Probability, Inference, and Sample Size

(Also available in Pyret)

Students explore sampling and probability as a mechanism for detecting patterns. After exploring this in a binary system (flipping a coin), they consider the role of sampling as it applies to relationships in a dataset.

Lesson Goals

Students will be able to…

Understand the connection between probability and inference
Understand the need for random samples
Understand the role of sample size
Take random samples from a population

Student-facing Lesson Goals

Let’s explore the connection between probability and inference.
Let’s consider how sampling techniques impact the inferences we can draw.

Materials

Supplemental Materials

🔗How to Spot a Scam 20 minutes

Overview

Students consider a classic randomness scenario: the probability that a coin will land on heads or tails. From a data science perspective, this can be flipped from a discussion of probability to one of inference. Specifically, "based on the number of coin flips we observed, what can we conclude about the fairness of a coin?"

Launch

A coin, showing the tails side A coin, showing the heads side A casino dealer (called a "croupier") invites you to play a game of chance. They’ll flip a coin repeatedly. On each flip, the croupier gives you a dollar if it comes up tails. If it comes up heads, you pay them a dollar.

"It’s a pure game of chance", they tell you, "we each have equal odds of winning".

If you decide to play the game, how could you then decide if the croupier’s coin is fair, or if the croupier is scamming you?
For a fair coin, what are the chances of it landing heads? Tails?
- A fair coin has a 50% chance of landing heads and a 50% chance of landing tails.
How do you know if a coin is fair or not?
- Flip it! The more flips you make, the more accurately you can assess if it is fair or not.

Investigate

A fair coin should land on "heads" about as often as it lands on "tails": half the time.

In general, we assume that in the long run, an ordinary coin will land on "heads" 50% of the time. Our assumption that there is no bias towards "heads" or "tails" is our null hypothesis. A weighted coin, on the other hand, might be heavier on one side, creating a bias toward one side! And since we lose money on heads, we’re worried about bias in favor of heads.

So how do we test the null hypothesis?

Turn to Finding the Trick Coin
Complete the first question with your partner using Fair Coins Starter File.
Be ready to discuss your answers with the class!

The above starter file is in Pyret, a coding platform that will be unfamiliar to students. Students do not need to know any coding to complete the lesson. They will simply hit "Run" several times.

Have students share back their sample results and their predictions after 5 samples.

Complete the rest of Finding the Trick Coin.

Did the percent heads for any of the coins change significantly?
Do any samples seem to contradict the null hypothesis?
Could a coin come up "heads" twice in a row, and still be a fair coin? Why or why not? What about 10 times in a row? 20?
- The coin could be fair in all of these instances! A coin landing on heads 20 times in a row is extremely unlikely, however.

In Statistics and Data Science, samples like these don’t prove anything about the coins! Instead, they either produce enough evidence to reject the null hypothesis, or fail to do so. If the null hypothesis is actually false, larger samples give us a better chance of producing evidence to reject it.

The chances of getting "heads" from a fair coin three times in a row aren’t too small: 1-in-8! Maybe it was just the luck of the draw, and the coin is still fair.

Should we suspect a scam if the croupier’s coin flipped heads 10 times in a row?

The probability of a fair coin getting all heads for 10 flips in a row is ¹/₂¹⁰, or roughly 0.001.

Statisticians would explain this as follows: "If the null hypothesis were true, then the probability of getting sample results at least as extreme as the ones observed is 0.001."

But of course, there is a way. It’s just…incredibly unlikely.

Going Deeper: p-value

Describing what the number 0.001 is talking about in the example above is a mouthful, because we have to express it as an "If…then…" outcome.

Statisticians use formal language to express the probability of obtaining sample results at least as extreme as the ones observed, under the assumption that the null hypothesis is true for the population. They call this probability a "p-value", and typically report it as a decimal.

Most of us say…	Statisticians say…
"There’s a 1-in-10 chance of this"	"The p-value is 0.1"
"There’s a 1-in-100 chance of this"	"The p-value is 0.01"
"There’s a 2-in-100 chance of this"	"The p-value is 0.02"
"There’s a one-in-a million chance"	"The p-value is 0.000001"

Most of us say…

Statisticians say…

"There’s a 1-in-10 chance of this"

"The p-value is 0.1"

"There’s a 1-in-100 chance of this"

"The p-value is 0.01"

"There’s a 2-in-100 chance of this"

"The p-value is 0.02"

"There’s a one-in-a million chance"

"The p-value is 0.000001"

Common Misconceptions

Students may think that any sample from a fair coin should have an equal number of heads and tails outcomes. That’s not true at all! A fair coin might land on "tails" three times in a row! The fact that this is possible doesn’t mean it’s likely. Landing on "tails" five times in a row? Still possible, but much less likely.

This is where arithmetic thinking and statistical thinking diverge: it’s not a question of what is possible, but rather what is probable or improbable.

Synthesize

What is the relationship between how weighted a coin is and the number of flips you need to suspect that it’s weighted?
- A fair coin should land on heads about 50% of the time.
- If a coin has been designed to land on heads 100% of the time, it wouldn’t take long to figure out that something was up!
- A trick coin designed to come up heads 60% of the time, however, would need a much larger sample to be detected. The smaller the bias, the larger the sample we need to see it.
- A small bias might be enough to guarantee that a casino turn a profit, and be virtually undetectable without a massive sample!
Suppose we are rolling a 6-sided die. How could we tell if it’s weighted or not?
- We could record how many times the die landed on each number after rolling many times. If the die is fair, we should see that it lands on each number approximately equally.
Would we need more samples with the 6-sides die than with the 2-sided coin? Less? The same? Why?

🔗Probability v. Inference 35 minutes

Overview

Statistical inference involves looking at a sample and trying to infer something you don’t know about a larger population. This requires a sort of backwards reasoning, kind of like making a guess about a cause, based on the effect that we see.

Launch

Probability reasons forwards.

Because we know that the chance of coming up heads each time for a "population" of flips of a fair coin is 0.5, we can do probability calculations like "the probability of getting all three heads in three coin flips is 0.5 × 0.5 × 0.5 = 0.125." Likewise, we can say the probability of getting three of a kind in a randomly dealt set of five cards is 0.02.

"Based on what we know is true in the population, what’s the chance of this or that happening in a sample?" This is the kind of reasoning involved in probability.

Inference reasons backwards.

In the coin-flip activity, we took samples of coin flips and used our knowledge about chance and probability to make inferences about whether the coin was fair or weighted.

In other words, we looked at sample results and used them to decide what to believe about the population of all flips of that coin: was the overall chance of heads really 0.5?

"Based on what we saw in our sample, what do we believe is true about the population the sample came from?" This is the kind of reasoning involved in inference.

Statistical inference uses information from a sample to draw conclusions about the larger population from which the sample was taken. It is used in practically every field of study you can imagine: medicine, business, politics, history… even art!

Suppose we want to estimate what percentage of all Americans plan to vote for a certain candidate. We don’t have time to ask every single person who they’re voting for, so pollsters instead take a sample of Americans, and infer how all Americans feel based on the sample.

Just like our coin-flip, we can start out with the null hypothesis: assuming that the vote is split equally. Flipping a coin 10 times isn’t enough to infer whether it’s weighted, and polling 10 people isn’t enough to convince us that one candidate is in the lead. But if we survey enough people we can be fairly confident in inferring something about the whole population.

Sample size matters!

Suppose we were able to make a million phone calls to use voters…

Would it be problematic to only call voters who are registered Democrats?
To only call voters under 25?
To only call regular churchgoers?
Why or why not?
- Calling only certain segments of the population will not reveal the way an entire population will vote.

We’re taking a survey of religions in our neighborhood.
There’s a Baptist church right down the street.
Would it be problematic to get a nice big sample by asking everyone there?
- Collecting our sample at the church would bias the data. Everyone at the church is Baptist, but the entire neighborhood might not be!
- Taking a sample of whoever is nearby is called a convenience sample.

Bad samples can be an accident - or malice!

When designing a survey or collecting data, Data Scientists need to make sure they are working hard to get a good, random sample that reflects the population. Lazy surveys can result in some really bad data! But poor sampling can also happen when someone is trying to hide something, or to oppress or erase a group of people.

A teacher who wants the class to vote for a trip to the dinosaur museum might only call on the students who they know love dinosaurs, and then say "well, everyone I asked wanted that one!"
A mayor who wants to claim that they ended homelessness could order census-takers to only talk to people in verified home addresses. Since homeless people don’t typically have an address, the census would show no homeless people in the city!
A city that is worried about childhood depression could survey children to ask about their mood…but only conduct the survey at an amusement park!

Can you think of other examples where biased sampling could result in intentionally or unintentionally misleading results?

Investigate

The main reason for doing inference is to guess about something that’s unknown for the whole population.

A useful step along the way is to practice with situations where we happen to know what’s true for the whole population. As an exercise, we can keep taking random samples from that population and see how close they tend to get us to the truth.

The Animals Dataset we’ve been using is just one sample taken from a very large animal shelter.

We’re going to analyze which is better at guessing the truth about an entire population - a small sample of 10 randomly selected animals, or a large sample of 40 randomly selected animals.

Select Sampler from the Plugins drop-down menu.

Sample plugin default The Sampler plugin features a Mixer, Spinner, and Collector. Today, we’ll be using the Collector, which chooses a specified number of cases from a dataset.

What do you notice about the Sampler? What do you wonder?

Possible Wonders include: How many turquoise balls are there? Why is there that amount? How many brackets are alongside the collection of turquoise balls? Why are there that many?

With or without "replacement"?

If we pick cards from a deck, each sample changes the outcomes of the ones that follow. There’s only one Ace of Hearts in the deck, and you can’t draw it twice! When flipping a coin, each sample has the same number of possible outcomes as the one before: heads or tails. It’s as if each one has been replaced with a copy of the same outcome.

That’s the difference between sampling with or without replacement. If it’s like rolling dice or flipping a coin, it’s sampling with replacement. If it’s like drawing cards from a deck, it’s sampling without replacement.

Can you think of other examples for each?
Select the Options tab of the Sampler.
Which makes the most sense for our dataset: collecting cases with replacement or without replacement?

Discuss with the class, making sure everyone understands which one this is!

Designate the number of items to collect in each sample, and the number of samplings to take.
What would it mean to select three samples of five items each? (These are CODAP’s default settings.)
Enter the correct specifications for 1 collection of 10 items.
Click Start to observe the sampling simulation.
When it’s complete, the sample will be shown as a new table called experiment/samples/items. Rename it (by clicking on its title) to small-sample.

Ensure that students understand all the components of the new table they’ve created! Now that students are comfortable using the Sampler, it’s time to dig into the data.

We want large-sample (on the worksheet) to be its own unique table! To produce a new table using Sampler, reopen the plugin rather than simply modifying the number of items.
Complete Sampling and Inference, sharing their results and discussing with the group.
Complete Predictions from Samples.

Larger samples tend to yield better estimates of what’s true for the whole population.
Random samples help avoid bias.

Common Misconceptions

Many people mistakenly believe that larger populations need to be represented by larger samples. In fact, the formulas that Data Scientists use to assess how good a job the sample does is only based on the sample size, not the population size.

Extension

In a statistics-focused class, or if appropriate for your learning goals, this is a great place to include more rigorous statistics content on sample size, sampling bias, etc.

Synthesize

Were larger samples always better for guessing the truth about the whole population? If so, how much better?
Why is taking a random sample important for avoiding bias in our analyses?

Project Options: Food Habits / Time Use

Project: Snack Habits and Time Use are both projects in which students gather data about their own lives and use what they’ve learned in the class so far to analyze it. These projects can be used as a mid-term or formative assessment, or as a capstone for a limited implementation of Bootstrap:Data Science. Both projects also require that students break down tasks and follow a timeline - either individually or in groups. Rubrics for assessing the projects are linked in the materials section at the top of the lesson.

(Based on the projects of the same name from IDS at UCLA)

These materials were developed partly through support of the National Science Foundation, (awards 1042210, 1535276, 1648684, 1738598, 2031479, and 1501927). Bootstrap by the Bootstrap Community is licensed under a Creative Commons 4.0 Unported License. This license does not grant permission to run training or professional development. Offering training or professional development with materials substantially derived from Bootstrap must be approved in writing by a Bootstrap Director. Permissions beyond the scope of this license, such as to run training, may be available by contacting contact@BootstrapWorld.org.