instagram

Students consider possible threats to the validity of their analysis.

Prerequisites

Relevant Standards

Select one or more standards from the menu on the left (⌘-click on Mac, Ctrl-click elsewhere).

Common Core Math Standards
HSS.IC.B.6

Evaluate reports based on data.

CSTA Standards
3B-NI-07

Evaluate the ability of models and simulations to test and support the refinement of hypotheses.

K-12CS Standards
6-8.Data and Analysis.Collection

People design algorithms and tools to automate the collection of data by computers. When data collection is automated, data is sampled and converted into a form that a computer can process. For example, data from an analog sensor must be converted into a digital form. The method used to automate data collection is influenced by the availability of tools and the intended use of the data.

9-12.Data and Analysis.Inference and Models

The accuracy of predictions or inferences depends upon the limitations of the computer model and the data the model is built upon. The amount, quality, and diversity of data and the features chosen can affect the quality of a model and ability to understand a system. Predictions or inferences are tested to validate models.

P1

Fostering an Inclusive Computing Culture

Next-Gen Science Standards
HS-SEP1-7

Ask and/or evaluate questions that challenge the premise(s) of an argument, the interpretation of a data set, or the suitability of the design.

HS-SEP4-3

Consider limitations of data analysis (e.g., measurement error, sample selection) when analyzing and interpreting data.

Oklahoma Standards
OK.L1.IC.C.02

Test and refine computational artifacts to reduce bias and equity deficits.

Lesson Goals

Students will be able to…​

  • Define several types of Threats to Validity

  • Identify those threats by reading the description of an analysis

  • Identify those threats in their own analysis

Student-facing Lesson Goals

  • Let’s identify issues that could affect our data analysis.

Materials

Preparation

  • Make sure all materials have been gathered

  • Decide how students will be grouped in pairs

  • Computer for each student (or pair), with access to the internet

  • Student workbook, and something to write with

Supplemental Resources

Language Table

Types

Functions

Values

Number

num-sqrt, num-sqr, mean, median, modes

4, -1.2, 2/3

String

string-repeat, string-contains

"hello", "91"

Boolean

==, <, <=, >=, string-equal

true, false

Image

triangle, circle, star, rectangle, ellipse, square, text, overlay, bar-chart, pie-chart, bar-chart-summarized, pie-chart-summarized, histogram, scatter-plot, lr-plot

🔵🔺🔶

Table

count, .row-n, .order-by, .filter, .build-column

Glossary
threats to validity

factors that can undermine the conclusion of a study

🔗Threats to Validity 20 minutes

Overview

Students are introduced to the concept of validity, and a number of possible threats that might make an analysis invalid.

Launch

Survey says: “People prefer cats to dogs”

As good Data Scientists, the staff at the animal shelter is constantly gathering data about their animals, their volunteers, and the people who come to visit. But just because they have data doesn’t mean the conclusions they draw from it are correct! For example: suppose they surveyed 1,000 cat-owners and found that 95% of them thought cats were the best pet. Could they really claim that people generally prefer cats to dogs?

Have students share back what they think. The issue here is that cat-owners are not a representative sample of the population, so the claim is invalid.

There’s more to data analysis than simply collecting data and crunching numbers. In the example of the cat-owning survey, the claim that “people prefer cats to dogs” is invalid because the data itself wasn’t representative of the whole population (of course cat-owners are partial to cats!). This is just one example of what are called Threats to Validity.

There are several major threats to validity you should be on guard against:

  1. Selection bias - Data was gathered from a biased, non-representative sample of the population. This is the problem with surveying cat owners to find out which animal is most loved. Remember that, in general, randomness is the key to obtaining unbiased samples!

  2. Bias in the study design - Suppose you survey a random sample of pet owners that includes representative numbers of both cat and dog owners. But you ask them a “loaded” question like “Since annual vet care comes to about $300 for dogs and only about half of that for cats, would you say that owning a cat is less of a burden than owning a dog?” This could easily lead to a misrepresentation of people’s true opinions.

  3. Poor choice of summary - Even if the selection is unbiased, sometimes outliers are so extreme that they shift the results of our analysis (such as the mean) in ways that don’t represent the population as a whole. For example, if the shelter happened to house a 100-year-old tortoise, and summarized its animals’ ages with the mean, this would inflate our perception of what age is typical.

  4. Confounding variables - The gathered data does not take into account other factors that might influence a relationship. For example, a study might conclude that cat owners are more environmentally conscious: they’re more likely to use public transportation than dog owners. The confounding variable here could be urban versus rural dwelling: people who live in big cities are more likely to use public transportation and also more likely to own cats.

This is just a small list of different threats to validity. There are plenty more!

Investigate

On Identifying Threats to Validity (Page 84) and Identifying Threats to Validity (Page 85), you’ll find four different claims backed by four different datasets. Each one of those claims suffers from a serious threat to validity. Can you figure out what those threats are?

Synthesize

Give students time to discuss and share back.

Life is messy, and there are always threats to validity. Data Science is about doing the best you can to minimize those threats, and to be up front about what they are whenever you publish a finding. When you do your own analysis, make sure you include a discussion of the threats to validity!

🔗Fake News! 20 minutes

Overview

Students are asked to consider the ways in which statistics are misused in popular culture, and become critical consumers of some statistical claims. Finally, they are given the opportunity to misuse their own statistics, to better understand how someone might distort data for their own ends.

Launch

You’ve already seen a number of ways that statistics can be misused:

  1. Intentionally using the wrong chart

  2. Changing the scale of a chart

  3. Using the mean instead of the median with heavily-skewed data

  4. Using the wrong language when describing a Linear Regression

  5. Using a correlation to imply causation

With all the news being shared through newspapers, television, radio, and social media, it’s important to be critical consumers of information!

Investigate

  • On Fake News! (Page 86), you’ll find some deliberately misleading claims made by slimy Data Scientists. Can you figure out why these claims should not be trusted ?

  • Once you’ve finished, consider your own dataset and analysis: what misleading claims could someone make about your work? Turn to Lies, Darned Lies, and Statistics (Page 87), and come up with four misleading claims based on data or displays from your work.

  • Trade papers with another group, and see if you can figure out why each other’s claims are not to be trusted!

Synthesize

Have students share back their "lies". Was anyone able to stump the other group?

🔗Your Analysis flexible

Overview

Students repeat the previous activity, this time applying it to their own dataset and interpreting their own results. Note: this activity can be done briefly as a homework assignment, but we recommend giving students an additional class period to work on this.

Launch

In every analysis, there are always threats to validity. It’s important to always be upfront about what those threats are, so that anyone who reads your analysis can make their own decision.

Investigate

  • Students should fill in the Findings portion of their Research Paper, discussing threats to validity and drawing conclusions from their linear regression results.

🔗Additional Exercises:

These materials were developed partly through support of the National Science Foundation, (awards 1042210, 1535276, 1648684, and 1738598). CCbadge Bootstrap:Data Science by Emmanuel Schanzer, Nancy Pfenning, Emma Youndtsmith, Jennifer Poole, Shriram Krishnamurthi, Joe Politz, Ben Lerner, Flannery Denny, and Dorai Sitaram with help from Eric Allatta and Joy Straub is licensed under a Creative Commons 4.0 Unported License. Based on a work at www.BootstrapWorld.org. Permissions beyond the scope of this license may be available by contacting schanzer@BootstrapWorld.org.