Training Artificial Intelligence: Bags of Words

Students consider what training is by exploring two unique examples: song recommendation and plagiarism detection. As a result of this exploration, they learn that training is the act of transforming data into a model, which is a resource-intensive and time-intensive process.

Lesson Goals

Students will be able to…

Define training as the act of transforming data into a model, usually after aggregating the data in some way.
Define a bag of words as a model that represents text as an unordered collection of words with frequencies.
Describe the importance of data normalization.

Student-facing Lesson Goals

Let’s think about training, the act of transforming data into a model.

Materials

Preparation

The final section of this lesson involves an activity where students measure the angle difference between rays.
If you’d like your students to practice using their protractors, be sure to have have protractors available for Angle Difference (with Protractor). If your students will not use protractors, we recommend printing Angle Difference (No Protractor), a second version of the worksheet that uses more easily approximated angles (e.g., 45°, 90°).
Alternatively, you can have your students estimate angle size using what they know about angle measures and the activity will still be valuable.

🔗Song Recommendation Systems

Overview

Students explore song recommendation, another example of data-driven algorithms at work. Students consider how data aggregation is a key component of machine learning.

Launch

Read Case Study: Michelle’s Spotify Use.
Then respond to the questions, providing as much detail as you can.

Invite students to share their responses.

If their responses highlight that data-driven algorithms produce a higher quality output when we provide more data—great! Your students understood the key take-away from our Data-Driven Algorithms: Spell Checkers lesson.
Your students might bring new complexity to the conversation by acknowledging the possibility that a programmer’s changes to the algorithm could have caused Michelle’s increased satisfaction with her play list.
If your students do not propose that Spotify’s algorithm was updated (as we suggest in the answer key), that’s okay! There’s no need to reveal that possibility immediately. We recommend moving on with the lesson. After completing Designing a Song Recommendation System, you can circle back to Case Study: Michelle’s Spotify Use.

Investigate

It’s likely that Michelle could get better suggestions without Spotify making any changes to the code base. But it’s also possible that changing the code would improve Michelle’s experience! Let’s consider song recommendation in more depth and explore the idea that sometimes it is helpful for algorithms to change.

Very broadly, a song recommendation system does two things:

collect a user’s listening history to build a detailed profile of their musical tastes
given a new song, determine whether or not to recommend that song

What does building the profile for a listener entail?

With a partner, complete Designing a Song Recommendation System.
First, you will think about what data could be collected about a song of your choosing.
Then, you will (informally) design your own song recommendation system!

What would your song recommendation system’s algorithm prioritize?
Answers will vary!

As you discovered, song recommendation systems need training in order to make recommendations.

Training is the act of transforming data into a model.

In machine learning, we generally start with a large chunk of data. A model is then generated from the data. That model is generally expected to be a lot smaller than the original chunk of data (but it may still be huge!). The model can be queried from to get answers.

In the case of Spotify, the model is a summary of the users' listening habits. Using this model, Spotify answers questions about new, unseen data (e.g. Do we expect Michelle to like the latest Taylor Swift song… or not?).

When Michelle observed that Spotify must have updated its algorithms, she may have been correct…

Maybe the algorithm was altered to put more weight on other listeners' behaviors, and less weight on the user’s listening behaviors.
Maybe the designer realized they’d left out an important factor in predicting people’s musical tastes and the algorithm was completely overhauled!

But we have no way of knowing what actually goes on behind the scenes at Spotify, because 1) these are trade secrets (which companies don’t talk about) and 2) there is a huge team actively working on Spotify so how it works could change from day to day!

If your students did not suggest that Spotify improved its algorithms on Case Study: Michelle’s Spotify Use, now is an appropriate time to add some complexity and nuance to the conversation. Discuss the possibility that the algorithm changed a little or a lot… and that there’s no way for us to know!

Synthesize

Why do you think models are generally smaller than the training data?
Generally, the model summarizes the data, eliminating all but its most essential features—the features that enable it to make predictions, generate text, etc.
What advantages might there be to the model being smaller than the training data?
Smaller models can be more efficient and less costly. They require less memory and fewer resources. Reduced computational demand can translate to lower hardware expenses and reduced energy consumption.
What disadvantages might there be?
If the training data is large and complex, a smaller model might not generalize well to new, unseen data. If a model is too small, it may be inaccurate.
How is the problem of Spotify trying to improve its recommendations similar to the problem of ALVINN trying to drive on new surfaces?
At first, Michelle did not like Spotify’s "Discover Weekly" playlist because the songs did not match her tastes. Giving Spotify more data is one possible way that Michelle could get better song recommendations.
Similarly, ALVINN will produce safer, more accurate steering instructions when exposed to more training: training on snowy roads, on icy roads, on three-lane highways, etc. With data-driven algorithms, more data produces better results even when the same algorithm is being used!
Another option, though, is to use a different algorithm! Just as an improvement to Spotify’s algorithm might result in Michelle enjoying its output more, a change in ALVINN’s contract could produce safer driving.
For instance, ALVINN’s programmers could update the contract for it’s function so that the program takes into consideration some history, rather than making all decisions instantaneously. This way, the program could respond appropriately to road signs and other data.

🔗Bags of Words

Overview

Students practice thinking like a hacker to determine the basic requirements of a successful plagiarism detection program. They consider the "bag-of-words" model, which a user can query to better understand how similar or different two documents are.

Launch

A student in your class has been accused of plagiarism because an AI plagiarism detector flagged their essay.

Your teacher seems to believe that the plagiarism detector can be trusted with 100% certainty — but the student is your friend, and they assure you that they did not plagiarize.

Let’s learn how a plagiarism detector works so that we can defend the student.

Investigate

To understand the workings of plagiarism detection, we’ll start by looking at a simple detector.

Open the Plagiarism Detection Starter File.
With a partner, complete the first two sections of A Primitive Plagiarism Detector.

Debrief the page with your class using the prompts below.

What does the simple-equality detector do?
Takes in two documents and returns true if they match exactly and false if they don’t match exactly. The function does not include punctuation, capitalization or spacing when deciding if two strings match.
How would you evaluate the effectiveness of the simple-equality detector?
It doesn’t work very well! We have no way of knowing how similar the documents are unless they are basically an exact match. Whether two documents are almost identical or have nothing in common, we will be told that they aren’t a match.
What might a more effective plagiarism detector do differently?
Answers will vary.

Plagiarizers usually alter at least a few words of the original document. Sometimes they change the ordering of the text, and sometimes they delete a sentence or word here and there.

If the simple-equality detector finds a match, we can be certain that identicality exists (disregarding spaces, punctuation, and capitalization).
If the detector does not find a match, all we know is that the two documents are not identical.

We need a more sophisticated plagiarism detector!

The last section of A Primitive Plagiarism Detector invites students to think about measuring similarity. Remind students about the mountain sorting activity that they completed during Introduction to Artificial Intelligence to recontextualize the concept of measuring similarity.

Yara and Xola agree that there has to be a way to measure the similarity of the two essays.
With a partner, complete the last section of A Primitive Plagiarism Detector, where you will consider two proposals for measuring similarity — and then develop your own method for measuring similarity!

Rather than detecting identicality, we need to determine the closeness of two documents. To do that, we summarize each document, and then compute the distance between the summaries.

One standard way to summarize a document is by creating a "bag of words" model. Let’s try it on two documents (below); each document is an example of jazz "scatting", when a vocalist improvises with nonsense syllables.

Document a: "doo be doo be doo"
Document b: "doo doo be doo be"

The bag-of-words summary for Document a looks like this: "be": 2, "doo": 3

A bag-of-words model represents text as an unordered collection of words with their frequencies.

As you can see, we’ve taken the original sentence and disregarded word order, creating a collection that focuses solely on word frequency.

What is the bag-of-words summary for Document b?
The bag-of-words summary for Document b looks like this: "be": 2, "doo": 3.
It should be identical to the bag-of-words summary for Document a.
How did you know what order to put the words in?
I used the same order as the bag-of-words summary for Document a.

Note: We could have written these bag-of-words summaries as "doo": 3, "be": 2, but once we decide on a word order for one document, adhering to that same order is required. The simplest way to be consistent is to use alphabetical order.

The bag-of-words summary for both documents is exactly the same!

A plagiarism detector that uses this model, taking stock of word frequency and word order, could compare the bags instead of the documents. If it did so, it would conclude that the two bags of words are a perfect match… and that Document a and Document b are suspiciously similar.

Open the Plagiarism Detection Starter File.
With a partner, complete A Slightly Less Primitive Plagiarism Detector.

How is the bag-equality plagiarism detector different from our primitive simple-equality plagiarism detector?
The bag-equality plagiarism detector compares two bag-of-words summaries, rather than simply comparing two texts.
How is the bag-equality plagiarism detector similar to our primitive simple-equality plagiarism detector?
Like our primitive plagiarism detector, it checks for identicality. It determines if the two bags of words are identical or not.

Checking if two bags of words are identical is an improvement from checking if two texts are identical.

Synthesize

What similarities are there between a system that recommends songs and bag-equality plagiarism detection?
Both systems build summaries of the available data and then work with those summaries.
Can you think of any other apps or technologies that measure similarity in some way?
Image retrieval — finding images similar to a given image from a large database
Facial recognition — identifying and verifying individuals based on facial features
Product recommendation — suggesting items for purchase based on a customer’s browsing history

🔗Data Normalization

Overview

Students explore the importance of data normalization, when we organize data to follow a standard pattern.

Launch

Here are some discoveries we have made so far:

Checking if two texts are identical is not an effective way of detecting plagiarism.
Summarizing documents as bags of words, and then checking for identicality is better than comparing two texts… but it is also not an effective way of detecting plagiarism.

What we need is a way to check if bags are similar!

One strategy programmers use for this is to represent bags of words as points in space.

Let’s see how that would work for Documents a and b.

We already know that Document a "doo be doo be doo" can be represented as the bag of words ("be": 2, "doo": 3).
Written as a coordinate pair, it would like this: (2, 3)
Plotting that point on the be-doo plane looks like this:

The first quadrant of a coordinate plan with the x-axis labeled doo and the y-axis labeled be

When we plot a point on the coordinate plane, we typically locate x on the horizontal axis and y on the vertical axis. Similarly, bags must use same word order if we want meaningful results.

How would you represent Document b ("doo doo be doo be") as a point on the be-doo plane?
The point would be in the exact same position as the point for Document a.
Because we decided on "be" then "doo" for Document a, we must use "be" then "doo" for Document b also.

The decision we just made — to use "be" then "doo" for both bags of words — is an example of data normalization. Data normalization is the act of adapting and modifying disparate data so that they all have the same characteristics (making them easy to compare and otherwise compute with).

Don’t skip data normalization!

Failure to normalize data can lead to useless and confusing outputs.

Investigate

Let’s look at some slightly more complicated documents and consider how to plot their points in a multi-dimensional space.

Document c: "doo be doo be doo doo doo"
Document d: "be bop bop bop be bop bop"

Document Bag-of-words summary Point

Document	Bag-of-words summary	Point
c	`"be": 2, "doo": 5`	(2, 5)
d	`"be": 2, "bop": 5`	(2, 5)

"be": 2, "doo": 5

(2, 5)

"be": 2, "bop": 5

(2, 5)

We have a problem. We can plainly see that Documents c and d are not the same … but their points are…

What went wrong here?
The point is to draw out student thinking here rather than to get to any particular answer. The remainder of the lesson will dig into the details. Students might suggest:
- The points were written as if there were only two items in the list… but, in fact there are three different items!
- 5 represents "doo" in the first point and "bop" in the second point… but we’ve lost that information.

Document c: "doo be doo be doo doo doo"
Document d: "be bop bop bop be bop bop"

Document Bag-of-words summary Point

Document	Bag-of-words summary	Point
c	`"be": 2, "doo": 5`	(2, 5)
d	`"be": 2, "bop": 5`	(2, 5)

"be": 2, "doo": 5

(2, 5)

"be": 2, "bop": 5

(2, 5)

In the example above, we forgot the data normalization. How can we fix it?

To solve this problem, let’s start by taking a closer look at our data.

Document c: "doo be doo be doo doo doo"
Document d: "be bop bop bop be bop bop"

When we use a Venn Diagram to visualize the data…

A venn diagram with doo in the left circle, be where the circles overlap, and bop in the right circle

…we recognize that Documents c and d
contain a total of three different words!

Because there are three words, we need to use a three dimensional space, rather than a coordinate plane, which has just two dimensions.

We must revise our bag-of-words summaries and our points!

Document Bag-of-words summary Point

Document	Bag-of-words summary	Point
c	`"be": 2, "bop": 0, "doo": 5`	(2, 0, 5)
d	`"be": 2, "bop": 5, "doo": 0`	(2, 5, 0)

"be": 2, "bop": 0, "doo": 5

(2, 0, 5)

"be": 2, "bop": 5, "doo": 0

(2, 5, 0)

Normalizing data requires that we consider all the words; when a word occurs zero times in a document, we acknowledge it. Instead of glossing over the dimension, we indicate that a given word occurred zero times. When we include all of the words from both documents, we produce a model with the correct dimensionality. For the bag-of-words model, the dimensionality equals how many different words are in the corpus.

It is a bit trickier to envision plotting these points, but not impossible!

In the 3-dimensional space to the right, which point represents c?
The one on the bottom.
How do you know?
It’s at point (2,5) on the be-doo plane, and has moved 0 in the bop direction.

Document Bag-of-words summary Point

Document	Bag-of-words summary	Point
c	`"be": 2, "bop": 0, "doo": 5`	(2, 0, 5)
d	`"be": 2, "bop": 5, "doo": 0`	(2, 5, 0)

"be": 2, "bop": 0, "doo": 5

(2, 0, 5)

"be": 2, "bop": 5, "doo": 0

(2, 5, 0)

Let’s recap:

We started out with two documents.
Now, in place of our two documents, we have two points that exist at specific locations in a multi-dimensional space.
We are going to think about how to make use of those points very soon…

But first, let’s practice!
Complete Plotting Bags of Words, where you will convert text documents into bags of words, and then plot points to represent those bags.
You will also get an opportunity to reverse the process. (You will convert plotted points into bags and text!)

Once students have completed Plotting Bags of Words, reflect on the activity by discussing the prompts below.

Which cells on Plotting Bags of Words had more than one correct solution? Why?
When we were asked to write the text when given either an ordered pair or a bag-of-words summary, multiple solutions were possible.
For instance, in row I, "be doo doo" and "doo be doo" would both be correct responses.
Multiple responses are correct in these instances because the bag-of-words model eliminates word order.
Who do you agree with, Sierra or Jaden?
Students can reasonably agree with either Sierra or Jaden, depending on whether they think the specific lyrics define song A, or if its repetitive nature is what defines song A.
Some students may contest that it is too difficult to determine similarity with such limited information — also a valid point.
If your students discuss the actual distance of the different points on the coordinate plane, they are thinking like programmers!

Synthesize

Earlier in the lesson, you learned that generally, models summarize the data, eliminating all but the most essential features. Which features of the starting document does the bag of words eliminate? Which features does it preserve?
The bag-of-words model eliminates word order. It preserves word count.
Why is it important for the bag-of-words summary to acknowledge when a word occurs zero times?
Each point exists in a multi-dimensional space. To compare points and consider their closeness, the points must exist in the same multi-dimensional space. When we omit a word that occurs zero times, we are in fact omitting a dimension and constructing a broken model.

🔗Computing Closeness with Angle Difference

Overview

Compressing text into bags of words gives us a coarse-grained notion of similarity. Let’s explore how to produce a more refined notion of similarity.

Launch

When we ask people whether two documents are the same, they rarely give us a black-and-white "yes" or "no" answer. Instead they tend to speak about shades of similarity. Likewise, we would like our computer to give us a range of values that give us a sense of how similar the two documents are. In other words, we would like the output to be a Number, not just a Boolean (identical, not identical).

Investigate

Now that we know how to represent our bag of words summaries as points in space, we can draw a ray from the origin through each of those points and ask: What is the angle between the two rays?

Take, for example, this comparison between two strings: stringA ("doo doo doo doo") and stringB ("be be be be").

StringA: doo doo doo doo

Word

Frequency

doo

Ordered pair: (0,4)

StringB: be be be be

Word

Frequency

doo

Ordered pair: (4,0)

a coordinate plane with rays from the origin through (0,4) and (4,0)

The angle formed is 90°.

If two documents are identical, they will be at the same point in space, and have the same ray extending from the origin to that point. That means the angle between those rays will be 0°. Even if one document just rearranges the other, their bags of words will be identical—thereby again making the angle between the lines 0°.

If you want your students to practice measuring angles with a protractor, we recommend using Angle Difference (with Protractor). If you do not have protractors available or would prefer that your students approximate angle size without measuring, we recommend printing and distributing Angle Difference (No Protractor) instead.

Complete Angle Difference (with Protractor) or Angle Difference (No Protractor) using your knowledge of bags of words and plotting points.
- First, fill in the frequency tables by referring to the provided string.
- Translate the bags of words to ordered pairs.
- Plot the points.
- Draw a ray from the origin to each of the points.
- Measure or approximate the angle size.

As the documents contain different words, the angles between the rays will grow. To reflect this, we can use the angle-difference function. It will give us a value between 0° (if the two are identical) and 90° (if the two have nothing in common).

Points, Rays, and Vectors

As you’ve discovered, our plagiarism detector computes the angle difference between rays extending from the origin to various points that we have plotted space.

In machine learning, we generally refer to these bag-of-word representations not as points, but as vectors. Why? A point represents a location in space, whereas a vector represents a magnitude and a direction.

To reduce the amount of new vocabulary introduced in this lesson, we have opted to refer simply to points and rays. More commonly, however, the term vector is used in a machine learning context.

If you or your students are wondering why we wouldn’t just compute the distance between points, rather than complicating things and introducing angles… it’s because typically, machine learning uses vectors, not points.

When two documents are represented as two points in space, angle difference is the difference between the angles formed by rays from the origin through those points.

Angle difference can be used to measure the similarity of two documents.

The contract for angle-difference is below.

# angle-difference :: (String, String) -> Number

Let’s try the angle-difference function in Pyret.

Check your work on Angle Difference (with Protractor). .
- Open Plagiarism Detection Starter File and click "Run".
- Enter angle-difference("doo doo doo doo", "be be be be") into the Interactions Area.
- Does the angle size that Pyret produces match the angle that you drew? (Hopefully yes!)
- Use angle-difference to compare each pair of strings on Angle Difference (with Protractor).

Angles?!

Yes, angles!

Did you know that geometry is at the heart of modern AI? This lesson shows how. The same angles that your students learn to compute in middle-school are sitting at the heart of the machine learning calculations that power so many things in the world today. Even the plagiarism detectors that might be checking their essays on angles… are computing angles. So if your students ask “When are we ever going to use this?”, you can tell them, “You already do, all the time.”

The plot thickens, especially if you have older students who have learned some trigonometry. In practice, real machine learning systems don’t quite use angles. Instead, they use the cosine of the angle. There are two reasons for this:

The angle itself is a somewhat awkward value to work with. In contrast, the cosine has a nice numeric range, between -1 and 1, which makes it convenient to use in various other mathematical settings. (Specifically, it’s used in a process called gradient descent.)
It’s simpler to compute the cosine directly. In fact, inside Pyret, angle-difference actually first computes the cosine, then converts the result into an angle!

For the purposes of this curriculum, you can ignore this difference. In particular, if your students have never even heard of the cosine, that’s fine! For students who are familiar with cosine and curious to explore, the Plagiarism Detection Starter File contains a cosine-similarity function.

Synthesize

Here are three different lines of code.

angle-difference("hello world", "hello")

angle-difference("hello", "goodbye")

angle-difference("hello", "hello")

Which line of code produces 90°? How do you know?
angle-difference("hello", "goodbye"); the two strings are completely different.
Which line of code produces 45°? How do you know?
angle-difference("hello world", "hello"); the two strings have one word in common; they are not entirely different nor are they identical.
Which line of code produces 0°? How do you know?
angle-difference("hello", "hello"); the two strings are exactly the same.

These materials were developed partly through support of the National Science Foundation, (awards 1042210, 1535276, 1648684, 1738598, 2031479, and 1501927). Bootstrap by the Bootstrap Community is licensed under a Creative Commons 4.0 Unported License. This license does not grant permission to run training or professional development. Offering training or professional development with materials substantially derived from Bootstrap must be approved in writing by a Bootstrap Director. Permissions beyond the scope of this license, such as to run training, may be available by contacting contact@BootstrapWorld.org.