Randomness and Sample Size

email twitter instagram facebook

Lesson Pathway, Standards and Practices

Standards (click one)

Common Core Math Standards

HSS.IC.B.3: Recognize the purposes of and differences among sample surveys, experiments, and observational studies; explain how randomization relates to each.

CSTA Standards

2-DA-08: Collect data using computational tools and transform the data to make it more useful and reliable.
2-DA-09: Refine computational models based on the data they have generated.

Oklahoma Standards

OK.L1.IC.C.02: Test and refine computational artifacts to reduce bias and equity deficits.
OK.PA.A.2.2: Identify, describe, and analyze linear relationships between two variables.
OK.PA.D.2.2: Determine how samples are chosen (random, limited, biased) to draw and support conclusions about generalizing a sample to a population.

Textbook Alignment

IM.7.8.17
IM.7.8.14
IM.7.8.12
Samples and Populations: Making Comparisons and Predictions

Students learn about random samples and statistical inference, as applied to the Animals Dataset. In the process, students get a light introduction to the role of sample size and the importance of statistical inference.

Lesson Goals

Students will be able to…

Take random samples from a population
Understand the need for random samples
Understand the role of sample size

Student-facing Lesson Goals

Let’s explore how random sampling can be used with datasets.

Materials

Preparation

Make sure all materials have been gathered.
Decide how students will be grouped in pairs.
Computer for each student (or pair), with access to the internet
Student workbook, and something to write with

Optional Projects

Language Table

Types

Functions

Values

Number

+, -, *, /, num-sqrt

4, -1.2, 2/3, pi

String

string-length, string-repeat, string-contains

"hello", "91"

Boolean

<, <>, <=, >=, <, >, ==, <>, >=

true, false

Image

star, triangle, circle, square, rhombus, ellipse, regular-polygon, radial-star, bar-chart, pie-chart, box-plot, scatter-plot, bar-chart-summarized, pie-chart-summarized

🔵🔺🔶

Table

.row-n, .order-by, .filter, .build-column

Glossary

statistical inference: using information from a sample to draw conclusions about the larger population from which the sample was taken

🔗Do Now

Students should log into CPO open the Expanded Animals Starter File (Pyret), and save a copy.

🔗Flip the Script: Inference v. Probability 30 minutes

Overview

Statistical inference involves looking at a sample and trying to infer something you don’t know about a larger population. This requires a sort of backwards reasoning, kind of like making a guess about a cause, based on the effect that we see. To better understand the process of going from the sample back to the population, it helps to understand the more straightforward process of going from the population to a sample. If the sample is random, we call this process Probability!

In real life we typically don’t know what’s true for an entire population. But this probability thought-experiment will start with a larger population with known properties (such as the fact that nearly half of the entire population are males). Then we’ll see what kind of behavior we tend to see in random samples taken from that population.

Launch

Inference Reasons Backwards; Probability Reasons Forwards

One of the most useful tasks in Data Science is using sample data to infer (guess) what’s true about the larger population from which the sample was taken. This process, called statistical inference, is used to gain information in practically every field of study you can imagine: medicine, business, politics, history; even art! Early on, statisticians discovered that random samples almost always work best.

Suppose we want to estimate what percentage of all Americans plan to vote for a certain candidate. We can’t ask everyone who they’re voting for, so pollsters instead take a sample of Americans, and generalize the opinion of the sample to estimate how Americans as a whole feel. But choosing a sample can be tricky…

Would it be problematic to only call voters who are registered Democrats? To only call voters under 25? To only call regular churchgoers? Why or why not?
How could we choose a representative subset, or sample of American voters?
Would it be problematic to only sample a handful of voters? What do we gain by taking a larger sample?

Before we infer something unknown about a population from a sample, we need to know what makes a "good" sample!

Sampling is a complicated issue. The main reason for doing inference is to guess about something that’s unknown for the whole population. But a useful step along the way is to practice with situations where we happen to know what’s true for the whole population. As an exercise, we can keep taking random samples from that population and see how close they tend to get us to the truth. Another discovery (besides the value of randomness) that statisticians made early on was something that’s perfectly consistent with common sense: Larger samples are better than smaller ones, because they tend to get us closer to the truth about the whole population.

Let’s see what happens if we switch from smaller to larger sample sizes, if we’re taking a random sample of shelter animals to infer what’s true about the larger population…

Investigate

The Animals Dataset we’ve been using is just one sample taken from a very large animal shelter. How much can we infer about the whole population of hundreds of animals, by looking at just this one sample?

Divide the class into groups of 3-5 students.
Have students open the Expanded Animals Starter File (Pyret), and click "Run".
Have students complete Sampling and Inference (Page 48), sharing their results and discussing with the group.
For a deeper exploration of the impact of sample size, have students complete Predictions from Samples

Common Misconceptions

Many people mistakenly believe that larger populations need to be represented by larger samples. In fact, the formulas that Data Scientists use to assess how good a job the sample does is only based on the sample size, not the population size.

Extension

In a statistics-focused class, or if appropriate for your learning goals, this is a great place to include more rigorous statistics content on sample size, sampling bias, etc.

Synthesize

Have students share how much better their larger samples are at guessing the truth about the whole population.

Project Options: Food Habits / Time Use

In both of these projects, students can gather data about their own lives, and use what they’ve learned in the class so far to analyze it. This project can be used as a mid-term or formative assessment, or as a capstone for a limited implementation of Bootstrap:Data Science. See the project descriptions for Randomness and Sample Size and Randomness and Sample Size.

(Based on the projects of the same name from IDS at UCLA)

These materials were developed partly through support of the National Science Foundation, (awards 1042210, 1535276, 1648684, and 1738598). Bootstrap:Data Science by the Bootstrap Community is licensed under a Creative Commons 4.0 Unported License. This license does not grant permission to run training or professional development. Offering training or professional development with materials substantially derived from Bootstrap must be approved in writing by a Bootstrap Director. Permissions beyond the scope of this license, such as to run training, may be available by contacting contact@BootstrapWorld.org.