(Also available in CODAP)
Students are introduced to box plots, learn to evaluate the spread of a quantitative column, and deepen their perspective on shape by matching box plots to histogram.
Lesson Goals |
Students will be able to…
|
Student-facing Lesson Goals |
|
Materials |
|
Supplemental Materials |
|
Preparation |
|
🔗Making Box Plots 30 minutes
Overview
Students are introduced to the notion of spread in a dataset. They learn about quartiles, box plots, and how to use them to talk about spread.
Launch
When we explored measures of center, we tried to determine the "typical" weight of animals at the shelter.
We determined that the average pet weighs almost 40 pounds.
But, how useful is this fact, really? Maybe all the pets weigh between 35 and 45 pounds, with every pet close to the mean. But maybe all the pets are super small or huge, and none are anywhere near the mean!
Once we have our summary for a "normal value", it’s likely we’ll ask: How typical is the average?
We’d expect some deviation - or spread - in any sample!
How do we measure the spread of a sample?
-
We can start by lining up all the animals' weights from smallest to largest.
-
We can compute the range of a dataset, by finding the distance between minimum and maximum values.
Note: the term “Range” means something different in statistics than it does in algebra and programming!
To learn more about how evenly distributed the data is we can:
-
Find the median, which splits the data in half
-
Split the data into four equal-sized quarters (by splitting each of these halves in half) and identify the quartiles (boundary points between these equal quarters).
-
Combine the quartiles with the minimum and the maximum to get a 5 Number Summary of the dataset:
-
Minimum: the smallest value in a dataset - it starts the first quarter
-
Q1 (lower quartile): the number that separates the first quarter of the data from the second quarter of the data
-
Q2: Median: the middle value (median) in a dataset
-
Q3 (upper quartile): the value that separates the third quarter of the data from the last
-
Maximum: the largest value in a dataset - it ends the fourth quarter of the data
-
-
Use the quartiles to calculate IQR (Interquartile Range), the distance spanned by the middle half of the data. IQR = Q3 - Q1
Investigate
To visualize the 5 number summary, the Range, and the Interquartile Range we can use box plots, which show how the four equal quarters of data are spread out along the number line.
When we see that some of the sections are narrow and others are wider, we know that the narrow sections are packed more densely. They contain exactly as many points as the wider sections, but have less room for them to spread out!
-
Which quarter of data is packed the densest in this box plot?
-
The third one
-
-
Which quarter of the data is the most dispersed in this box plot?
-
The fourth one
-
When the points are evenly distributed, the four sections of the box plot will be equal in size, like this:
Even Distribution
Left and right skew are easy to identify from a quick glance at a box plot, with longer whiskers trailing off toward potential outliers.
|
|
Sometimes there is roughly the same amount of variation on the low end as on the high end. For example, the distribution of newborns who are smaller than average might mirror that of newborns who are bigger than average. We call this kind of spread symmetric.
Symmetric
Below is the Contract for box-plot
.
box-plot :: (t::Table, col::String) -> Image
Box plots divide our sample into four equally populated groups, and show which of those groups are spread wide or are tightly packed.
Let’s see what we can learn about the spread of the data in the pounds
column by making a box-plot
!
-
Log into code.pyret.org (CPO), open your saved "Animals Starter File" and click "Run". If you don’t have the file, you can open a new one.
-
Turn to Summarizing Columns with Measures of Spread and follow the directions to complete the Summarizing the Pounds Column section.
Students will type box-plot(animals-table, "pounds")
into the Interactions Area and use the resulting box plot to fill in the five-number summary for the pounds
column, and sketch the box plot.
-
What conclusions can you draw about the distribution of values in this column?
-
While the animals' weights range from 0.1 pounds to 172 pounds, 50% of the animals weigh 11.3 pounds or less. The animal that weighs 172 pounds may be an outlier.
-
-
If Q1 is the value for which 25% of the animals weighed that amount or less, what does Q3 represent?
-
The third quartile is the value for which 75% of the animals weighed that amount or less. Another way of saying that would be that it is the value for which 25% of the animals weigh that amount or more.
-
-
Could we make a box plot for every column in the data set?
-
No. We can only make box plots for quantitative columns.
-
-
Why do you think this display is sometimes called a "box and whisker plot"?
-
The distance between Min/Q1 and Q3/Max is drawn like whiskers!
-
If students are struggling to write conclusions, go over the following five number summary from the box plot they made.
-
Minimum (the left “whisker”) - the smallest value in the dataset . In our dataset, that’s just 0.1 pounds.
-
Q1 (the left edge of the box) - computed by taking the median of the lower half of the values. In the pounds column, that’s 3.9 pounds.
-
Q2 / Median value (the line in the middle), which is the middle Quartile of the whole dataset. We already computed this to be 11.3 pounds.
-
Q3 (the right edge of the box), which is computed by taking the median of the upper half of the values. That’s 60.4 pounds in our dataset.
-
Maximum (the right “whisker”) - the largest value in the dataset . In our dataset, that’s 172 pounds.
Choose another quantitative column to summarize and complete the second half of Summarizing Columns with Measures of Spread
Common Misconceptions
It is extremely common for students to forget that the quartiles divide the data into quarters, each of which includes 25% of the dataset. This will need to be heavily reinforced.
Synthesize
-
What percentage of points fall in the first quarter?
-
25%
-
-
What percentage of points fall in the second quarter?
-
25%
-
-
What percentage of points fall in the third quarter?
-
25%
-
-
What percentage of points fall in the fourth quarter?
-
25%
-
-
What percentage of points fall in the Interquartile Range (IQR)?
-
50%
-
-
What percentage of points fall within the Range?
-
100%
-
🔗Interpreting Box Plots 30 minutes
Overview
Students learn how to read a box plot, connecting this visualization of spread to what they know about histograms.
Launch
Box plots and histograms give us two different views of the shape of quantitative data.
Intervals | Data points per Interval | |
---|---|---|
Box Plots |
Variable |
Fixed - 25% of the data in each Interval |
Histograms |
Fixed Bins |
Variable - Points “pile up in bins”, so we can see how many are in each. |
In histograms, skewness shows up as a long tail of shorter bars to one side.
In a box plot skewness is seen as a longer "whisker" or more spread in one half of the box.
Kinesthetic Activity
Divide the class into groups, and give each group a ruler and a ball of play-dough. Have them draw a number line from 0-6 with the ruler, marking off the points at 0, 3, 4, 4.5 and 6 inches. Have the groups roll the dough into a thick cylinder, divide that cylinder in half, and then split each half to form four equally-sized cylinders. The play-dough represents a sample, with values divided into four quarters.
Box plots stretch and squeeze these equal quarters of the data across a number line, so that they fit into their respective intervals. On their number line, students have intervals from 0-3, 3-4, 4-4.5, and 4.5-6. Have students shape their cylinders into rectangles that fill each of these intervals, and are all about 1 inch thick.
Students should notice that the play-dough is taller for shorter intervals and thinner for longer intervals. Even though a box plot doesn’t show us the thickness of the data points, we know that a small interval has the same amount of data "squeezed" into it as a large interval has spread across it.
Investigate
-
Let’s practice identifying the shape of data from box plots! Turn to Identifying Shape - Box Plots.
-
To make connections between histograms and box plots, complete Matching Box Plots to Histograms
-
With a partner, complete the Box Plot Vocab Concept Map and see if you can draw connections between these concepts!
-
Complete Reading Box Plots to practice matching box plots to a written description of a distribution.
-
Complete Matching Box Plots to Histograms 2 and/or the Matching Box Plots to Histograms slide of Box plot practice (Desmos)
Modified Box Plots
More Statistics- or Math-oriented classes will also be familiar with modified box plots (video explanation), which remove outliers from the box-and-whisker and draw them as asterisks outside of the plot.
Modified box plots are also available in Bootstrap:Data Science, using the following Contract:
# modified-box-plot :: (Tabletable-name, Stringcolumn) -> Image
Now that you have the skills to interpret box plots, complete Data Cycle: Shape of the Animals Dataset.
Synthesize
Now that we’ve explored the spread of the dataset, do you think the mean is the best measure of center for the animals' weights?
🔗Data Exploration Project (Box Plots) flexible
Overview
Students apply what they have learned about box plots to their chosen dataset. They will add three items to their Data Exploration Project Slide Template: (1) at least two box plots, (2) the corresponding five-number summaries, and (3) any interesting questions they develop.
To learn more about the sequence and scope of the Exploration Project, visit Project: Dataset Exploration. For teachers with time and interest, Project: Research Capstone is an extension of the Dataset Exploration, where students select a single question to investigate via data analysis.
Launch
Let’s review what we have learned about making and interpreting box plots.
-
Does a box plot display categorical or quantitative data? How many columns of data does a box plot display?
-
Box plots display a single column of quantitative data.
-
-
How are box plots similar to histograms? How are they different?
-
Box plots and histograms give us two different views on the concept of shape. Histograms have fixed intervals ("bins") with variable numbers of data points in each one. Box plots have variable intervals ("quartiles") with a fixed number of data points in each one.
-
-
A box plot lets us visualize the five-number summary. What does the five-number summary tell us about the column of data?
-
The five-number summary includes the minimum, medium, and maximum. It also includes the median of the lower half of the values, and the median of the upper half of the data points.
-
Investigate
Let’s connect what we know about box plots to your chosen dataset.
Students have the opportunity to choose a dataset that interests them from our List of Datasets in the Choosing Your Dataset lesson.
-
Open your chosen dataset starter file in Pyret.
-
Remind yourself which two columns you investigated in the Measures of Center lesson and make a box plot for one of them.
-
What question does your display answer?
-
Possible responses: How is the data for a certain column distributed? Are the values close together or really spread out? Are there any outliers?
-
-
Now, write down that question in the top section of Data Cycle: Shape of My Dataset
-
Then, complete the rest of the data cycle, recording how you considered, analyzed and interpreted the question.
-
Repeat this process for the other column you explored before (and any others you are curious about).
If students want to investigate new columns from their dataset, they will need to copy/paste additional Measures of Center and Spread slides into their Exploration Project and calculate the mean, median and modes for the new columns.
Confirm that all students have created and understand how to interpret their box plots. Once you are confident that all students have made adequate progress, invite them to access their Data Exploration Project Slide Template from Google Drive.
-
It’s time to add to your Data Exploration Project Slide Template.
-
Find the box plot slide in the "Making Displays" section and copy/paste your first box plot here. Duplicate the slide to add your other box plots.
-
Add the five-number summaries from these plots to the corresponding "Measures of Center and Spread" slides.
-
Be sure to also add any interesting questions that you developed while making and thinking about box plots to the "My Questions" slide at the end of the deck.
Synthesize
-
What shape did you notice in your box plots?
-
Did you discover anything surprising or interesting about your dataset?
-
What, if any, outliers did you discover when making box plots?
-
When you compared your findings with others, did you make any interesting discoveries? (For instance: Did everyone find outliers? Was there more or less similarity than expected?)
🔗Additional Exercises
These materials were developed partly through support of the National Science Foundation, (awards 1042210, 1535276, 1648684, 1738598, 2031479, and 1501927).
Bootstrap by the Bootstrap Community is licensed under a Creative Commons 4.0 Unported License. This license does not grant permission to run training or professional development. Offering training or professional development with materials substantially derived from Bootstrap must be approved in writing by a Bootstrap Director. Permissions beyond the scope of this license, such as to run training, may be available by contacting contact@BootstrapWorld.org.