instagram

Students are introduced to box plots, learn to evaluate the spread of a quantitative column, and deepen their perspective on shape by matching box plots to histogram.

Lesson Goals

Students will be able to…​

  • apply one approach to measuring and displaying spread of a dataset

  • compare and contrast information displayed in a box plot and a histogram

Student-facing Lesson Goals

  • Let’s compare different uses for box plots and histograms when talking about data.

Materials

Preparation

Glossary
box plot

the box plot (a.k.a. box-and whisker-plot) is a way of displaying a distribution of data based on the five-number summary: minimum, first quartile, median, third quartile, and maximum

interquartile range

(IQR) is one possible measure of spread, based on dividing a dataset into four parts. The values that divide each part are called the first quartile (Q1), the median, and third quartile (Q3). IQR is calculated as Q3 minus Q1.

maximum

the largest value in a dataset

median

the middle element of a quantitative dataset

minimum

the smallest value in a dataset

quartile

each of four equal groups into which a population can be divided according to the distribution of values of a particular variable.

range

the type or set of outputs that a function produces

range of a dataset

the distance between minimum and maximum values

sample

a set of individuals or objects collected or selected from a statistical population by a defined procedure

shape

The aspect of a dataset - visible in a histogram or box plot - that describes which values are more or less common.

spread

the extent to which values in a dataset vary, either from one another or from the center

🔗Making Box Plots 30 minutes

Overview

Students are introduced to the notion of spread in a dataset. They learn about quartiles, box plots, and how to use them to talk about spread.

Launch

When we explored measures of center, we tried to answer a question about "typical" values. We considered a fact - that the Animal Shelter Bureau says the average pet weighs almost 41 pounds.

How useful is this fact, really? Maybe all the pets weigh between 35 and 45 pounds, with every pet close to the mean. But maybe all the pets are super small or huge, and no one is even near to the mean!

So once we have our summary for a "normal value", it’s likely we’ll ask another question: If the average pet is 41 pounds, just how typical is that?

There are differences in every class of students. Not everyone likes the same music, not everyone dresses the same, etc. So we’d expect some deviation - or spread - in any class of students! Some classes are more different than others. How do we measure the spread of a population?

Suppose we lined up all animals' weights from smallest to largest, and then split them in half by taking the median. We can learn something about the spread of the dataset by taking the median of each half, splitting the population into four equal-sized quartiles.

  • The first quartile (Q1) is the value for which 25% of the animals weighed that amount or less.

  • What animals does the third quartile represent?

    • The third quartile is the value for which 75% of the animals weighed that amount or less.

Besides looking at the median as center, and the spread between Q1 and Q3, we also gain valuable information from the spread of the entire dataset—that is, the distance between minimum and maximum. This is called the range of a dataset. (Note: the term “Range” means something different in statistics than it does in algebra and programming!)

Splitting a dataset into quartiles gives us five numbers that we can play with to measure spread. To summarize what we’ve seen so far:

  1. Minimum: the smallest value in a dataset

  2. Q1: the median that falls between the minimum and Q2

  3. Q2: Median: the middle value (median) in a dataset

  4. Q3: the median that falls between and Q2 and the maximum

  5. Maximum: the largest value in a dataset

Taken together these are called the 5 Number Summary of a dataset, and this summary is one tool for calculating spread. We can use these numbers to calculate two new values:

  • Maximum - Minimum = Range : the distance spanned by the extreme values in the dataset

  • Q3 - Q1 = IQR: the Interquartile Range, or the distance spanned by the middle half of the data

Investigate

We can use box plots to visualize the 5 number summary, the Range, and the Interquartile Range. Below is the contract for box-plot, along with an example that will make a box plot for the pounds column in the animals-table.

box-plot :: (t::Table, col::String) -> Image
# Consumes a table and the name of the column
# to plot, and produces a box plot"
box-plot(animals-table, "pounds")

Box plots divide our sample into equally-sized groups, and show where those groups are spread thin or clumped together.

Type box-plot(animals-table, "pounds") into the Interactions Area, and see the resulting plot.

This plot shows us the center and spread in our dataset according to those five numbers.

  • Minimum (the left “whisker”) - the smallest value in the dataset . In our dataset, that’s just 0.1 pounds.

  • Q1 (the left edge of the box) - computed by taking the median of the lower half of the values. In the pounds column, that’s 3.9 pounds.

  • Q2 / Median value (the line in the middle), which is the middle Quartile of the whole dataset. We already computed this to be 11.3 pounds.

  • Q3 (the right edge of the box), which is computed by taking the median of the upper half of the values. That’s 60.4 pounds in our dataset.

  • Maximum (the right “whisker”) - the largest value in the dataset . In our dataset, that’s 172 pounds.

  • Turn to Summarizing Columns in the Animals Dataset

  • Fill in the five-number summary for the pounds column, and sketch the box plot.

  • What conclusions can you draw about the distribution of values in this column?

    • While the animals' weights range from 0.1 pounds to 172 pounds, 50% of the animals weigh 11.3 pounds or less. The animal that weighs 172 pounds may be an outlier.

Common Misconceptions

It is extremely common for students to forget that every quartile always includes 25% of the dataset. This will need to be heavily reinforced.

Synthesize

  • What percentage of points make up the Q1?

    • 25%

  • What percentage of points make up Q2?

    • 25%

  • What percentage of points make up Q3?

    • 25%

  • What percentage of points make up Q4?

    • 25%

  • What percentage of points make up the Interquartile Range (IQR)?

    • 50%

  • What percentage of points make up the Range?

    • 100%

Optional: Have students work in pairs to complete this Box Plot Vocab Concept Map.

🔗Interpreting Box Plots 30 minutes

Overview

Students learn how to read a box plot, and consider spread and variability. They connect this visualization of spread to what they learned about histograms.

Launch

Just as pie and bar charts are ways of visualizing categorical data, box plots and histograms are both ways of visualizing the shape of quantitative data.

Box plots make it easy to see the 5-number summary, and compare the Range and Interquartile Range. Histograms make it easier to see skewness and more details of the shape, and offer more granularity when using smaller bins.

Left-skewness is seen as a long tail in a histogram. In a box plot, it’s seen as a longer left "whisker" or more spread in the left part of the box. Likewise, right skewness is shown as a longer right "whisker" or more spread in the right part of the box.

Box plots and histograms give us two different views on the concept of shape.

Intervals Points-per-Interval

Box Plots

Variable

Fixed

Histograms

Fixed

Variable

Histograms: fixed intervals (“bins”) with variable numbers of data points in each one. Points “pile up in bins”, so we can see how many are in each. Larger bars show where the clusters are.

Box plots: variable intervals (“quartiles”) with a fixed number of data points in each one. Treats data more like “pizza dough”, dividing it into four equal quarters showing where the data is tightly clumped or spread thin. Smaller intervals show where the clusters are.

Kinesthetic Activity

Divide the class into groups, and give each group a ruler and a ball of playdough. Have them draw a number line from 0-6 with the ruler, marking off the points at 0, 3, 4, 4.5 and 6 inches. Have the groups roll the dough into a thick cylinder, divide that cylinder in half, and then split each half to form four equally-sized cylinders. The playdough represents a sample, with values divided into four quartiles.

Box plots stretch and squeeze these equal quartiles across a number line, so that each quartile fills up an interval in that quartile. On their number line, students have intervals from 0-3, 3-4, 4-4.5, and 4.5-6. Have students roll their cylinders so that they fill each of these intervals, retaining a uniform thickness.

They should notice that shorter intervals have thicker cylinders, and longer ones have skinny ones. Even though a box plot doesn’t show us the thickness of the datapoints, we can tell that a small intervals has the same amount of data "squeezed" into it as a large interval.

Investigate

Modified Box Plots More Statistics- or Math-oriented classes will also be familiar with modified box plots (video explanation), which remove outliers from the box-and-whisker and draw them as asterisks outside of the plot. Modified box plots are also available in Bootstrap:Data Science, using the following contract:

# modified-box-plot :: (t :: Table, col :: String) ‑> Image

Synthesize

Histograms, box plots, and measures of center and spread are all different ways to get at the shape of our data. It’s important to get comfortable using every tool in the toolbox when discussing shape!

We started talking about measures of center with a single question: is "average" the right measure to use when talking about animals' weights? Now that we’ve explored the spread of the dataset, do you agree or disagree that average is the right summary?

Project Option: Stress or Chill?

Students can gather data about their own lives, and use what they’ve learned in the class so far to analyze it. This project can be used as a mid-term or formative assessment, or as a capstone for a limited implementation of Bootstrap:Data Science. The project description is Stress or Chill? [rubric] (You will also need the Personality Colors assessment)

🔗Your Own Analysis flexible

Overview

Students apply what they’ve learned to their own dataset.

Launch

What are the quantitative columns in your dataset? How are they distributed?

Are all the values pretty close together, or really spread out?

Are they clumped on the right, with a few outliers skewing to the left? Or are they clumped on the left, with a few outliers skewing to the right?

Investigate

  • How are the quantitative columns in your dataset distributed? Data Cycle: Shape of My Dataset, and use the Data Cycle to explore two quantitative columns with box plots.

  • Then add these displays - and your interpretations! - to the "Making Displays" section.

  • Do these displays bring up any interesting questions? If so, add them to the end of the document.

  • Complete Shape of My Dataset, and explain the connection between measures of center and your box plots.

  • Complete the "Measures of Center and Spread" section of the Dataset Exploration.

Synthesize

Have students share their findings.

  • Were any of them surprising?

  • What, if any, outliers did they discover when making box plots?

  • What measures of center makes the most sense for one column or another?

These materials were developed partly through support of the National Science Foundation, (awards 1042210, 1535276, 1648684, and 1738598). CCbadge Bootstrap by the Bootstrap Community is licensed under a Creative Commons 4.0 Unported License. This license does not grant permission to run training or professional development. Offering training or professional development with materials substantially derived from Bootstrap must be approved in writing by a Bootstrap Director. Permissions beyond the scope of this license, such as to run training, may be available by contacting contact@BootstrapWorld.org.