Dataset Notes

We’ve written a description of each dataset including suggestions and things to be aware of as you support students. Given that these datasets are from the real world, they are not all equally conducive for all learning goals.

Skip to a Subset of our Library

The Environment & Health
Politics
Sports
Entertainment
Education
Nutrition

Datasets by Characteristic

Datasets requiring calculations to convert data before making graphs
- Rhode Island Schools
- Marijuana Laws & Arrests by State
- Countries
- U.S. Jobs
- Colleges in the US (2019-2020)
- Esports
Datasets with a limited number of categories
- Countries of the World
- Movies
- Arctic Sea Ice
- Marijuana Arrests and Laws by State
- Fast Food
Datasets with a huge number of columns
- MLB Hitting Stats
- U.S. Voter Turnout 2016
- Air Quality, Pollution Sources & Health in the U.S.
- Health By U.S. County
- U.S. Jobs
- Colleges in the US (2019-2020)
- Covid 19
Datasets with fairly little quantitative data (well-suited to simpler analysis, not recommended for scatter plots)
- International Exhibition of Modern Art 1913
- NYPD Stop Question Frisk 2019

The Environment & Health

Global Waste by Country

This dataset includes data from 174 countries that indicates their annual waste production as well as basic societal statistics that are associated with waste production.
The dataset also includes the global region of countries, as well as a few basic indicators of national waste management practices.
This dataset has a limited number of categories, making it accessible to any student.
The columns of this dataset are defined to allow students to start analysis without much additional coding.
This dataset is good for finding linear associations between two variables. Many of the variables are positively correlated with solid municipal waste production (total and per capita).
Column Titles: Name, Region, Economic Development, GDP per Capita (USD), Total Municipal Solid Waste per Year (tons), Population, Hazardous Waste per Year (tons), Has national agency for enforcing solid waste laws, Has national law governing solid waste management, MSW per Capita (tons/people), Hazardous Waste per Capita (tons/people), Percent Urban, Urbanization Rate, Human Development Index, Adult Obesity Rate by BMI
Have students…
- define a function urban-msw-trend, which takes two country names and returns true if the one with a greater MSW per capita also has a greater urban percentage, and has an increasing urbanization rate.
- define a function number-urban, which computes the total amount of people within a country living in urban areas.
- define a function total-gdp, which computes the total GDP of a country.
- define a function number-of-subgroup, that takes a column and string, and determines how many times that string occurs as a value in that column. Use this function to determine how many countries in the dataset are in Sub-Saharan Africa, and how many have economies in transition.
Heads Up
- Strong Correlations: Urban percent and MSW per capita, Obesity Rate and MSW per capita, Population and MSW, Urban percent and HDI
- Outlier to be aware of: Ukraine has an extremely high hazardous waste output. This value is from the World Bank dataset so it should be reputable, but from further research it does not seem to be indicative of the average annual hazardous waste output for Ukraine.
- Outlier to be aware of: Island nations of Oceania tend to have much higher obesity rates than any other countries in the dataset.

World Cities: Proximity to Ocean

This dataset looks at the 5000 most populous (as of 2020) cities in the world and their geographical information, such as elevation, country, continent, and distance from the nearest shoreline. The data comes from various sources, including the World Cities Database (for cities and population), the National Centers for Environmental Information (for worlds shoreline), JohnSnowLabs (for each country’s continent), and the World bank (for country GDP per capita).
This dataset has a limited number of categories, making it accessible to any student.
The columns of this dataset are defined to allow students to start analysis without much additional coding.
Column Titles: Name, Country, Population (2020), Continent, Elevation (m), Distance to Shore (km), GDP per Capita ($) 2017. This refers to the country’s GDP per capita.
Have students…
- define a function is-poverty, which returns true if a city’s GDP per capita is below $15,000.
- define a function is-climate-change-prone, which returns true if a city’s distance from shore is less than 20km and elevation is less than 10m.
- define a function is-metropolis, which returns true if a city’s population is above 1 million dollars and GDP per capita is above $10,000.

Earthquake sequences, 2000-2021, Magnitude 7+

This dataset looks at earthquake sequences recorded by the US Geological Survey that occured from 2000 to 2021, with the mainshock having a magnitude >= 7.0. The dataset contains the geographic and economic information about the location, geologic data, and death/injury tolls.
This dataset can be used to analyze what factors affect earthquakes, and model what makes an earthquake severe.
This dataset has a large number of columns (18) that will excite some students and may overwhelm others. The columns of this dataset are defined to allow students to start analysis without much additional coding.
Column Titles: Timestamp, Location, GDP (Millions of USD), GDP per Capita (nearest USD), Year, Month, Weekday, Latitude, Longitude, Depth (km), Magnitude, Over Land?, Type of Earthquake, Plate No1, Plate No2, Tsunami?, Death Toll, Injury Toll
Have students…
- define a function high-toll, which returns true if a death and injury toll are higher than a certain threshold (to be determined by the student)
- define a function pct-pacific, which computes the percentage of earthquakes that happen along a boundary of the Pacific plate.
Heads Up
- It may be useful to combine death and injury tolls for correlation analysis.
- The columns Depth, Magnitude, Death Toll, and Injury Toll all skew right.
- The weakness of correlations in this data may prompt a discussion about sampling bias - this data only considers earthquakes with magnitude 7.0 or greater. The magnitude scale ranges from 0-10, so this subset is certainly not representative of all earthquakes.
- Outliers to be aware of: there are some earthquakes with unusually high death/injury rates. Scatterplots including these will likely flatten all other data if using a linear axis.

Air Quality and Related Data by US County

This data mostly comes from censuses completed by either the federal or state government. Besides the census, the dataset also includes information from the U.S Energy and Information Administration and the Bureau of Economic Analysis The dataset can be narrowed down through the isolation of states or even geographical location.
This dataset has many applications in the real world. For example, if the government wants to try to spread out new power plants, and try to understand how they might affect the population. The government also spends a lot of money on medicare, and one might be interested in lowering that amount by lowering some of the causes of disease.
This dataset has a huge number of columns that will excite some students and may overwhelm others.
The columns of this dataset are defined to allow students to start analysis without much additional coding. Though there are many columns that the students could use to narrow down the amount of rows which might count as additional coding.
Column Titles: State, County, Full Name, FIPS Code, Median AQI, Median Income, Population Size, Urban, Rural, Metro, Rural-Urban Continuum Code, % White (Non-Hispanic), % Other Races (Hispanic Included), Estimated Asthma Prevalence in Population by Medicare Spending, Cancer Incidence Rate (cases per 100,000), Cancer Projection, Total Number of Power Plants, Primary Power Plant Type, Total C02 Emissions from Power Plants (tons), Mean Travel Time to Work (Minutes), Life Expectancy (years), Main Industry by # of Employees, % Agriculture, forestry, fishing and hunting of Total Industries, % Mining, quarrying, and oil and gas extraction of Total Industries, % Transportation and warehousing of Total Industries, Interstate Highway in County?
Have students…
- Isolate factors by narrowing down the data by urban/rural/metropolitan areas, states, or main industry.
- define a function pct-industry, which computes the percent of a certain industry in America by County
- define a function state-income, that finds the median income for an entire state
- define a function missing-counties, that given a full list of counties finds which ones are missing on the spreadsheet
Heads Up
- The data does not include Minnesota as they do not publish medical information by county.
- The data used to include Puerto Rico but was taken out for the same reason. This is a shame, as Puerto Rico has incredibly high rates of Asthma which would be interesting to study.
- The data contains varying times of collection– for example, median income was collected in 2018, while the AQI data is from 2021. All data is post 2016.
- It might be helpful to find the population density in each county so we can understand the spread of some of the factors (such as power plants or interstate highways), although I believe the rural-urban continuum code is a good proxy.

Health by U.S. County 2021

This dataset includes data from 2906 counties in the United States from https://www.countyhealthrankings.org/ and is focused on indicators of the mental and physical health condition of these counties in 2021.
This dataset has a huge number of columns that will excite some students and may overwhelm others.
The columns of this dataset are defined to allow students to start analysis without much additional coding.
Column Titles: FIPS, County, State, % Fair or Poor Health, Average Number of Physically Unhealthy Days, Average Number of Mentally Unhealthy Days, % Smokers, % Adults with Obesity, % Physically Inactive, % Excessive Drinking, % Uninsured, % Completed High School, % Some College, % Unemployed, % Children in Poverty, 80th Percentile Income, 20th Percentile Incomer, Income Ratio, % Children in Single-Parent Households, Presence of Water Violation, % Severe Housing Problem, % Long Commute - Drives Alone
Have students…
- define a function is-in-ca, which returns “true” if a county belongs to California
- define a function violate-water, which returns “true” if a county has water violation
- define a function mentally-unhealthy, which returns “true” if a county has greater mentally unhealthy days than physically unhealthy days
- Students are encouraged to investigate the correlation between Average Number of Physically Unhealthy Days and % Smokers/% Adults with Obesity/% Physically Inactive/% Excessive Drinking.
- Students are encouraged to investigate the correlation between Average Number of Mentally Unhealthy Days and Income Ratio/% Children in Single-Parent Households/ Severe Housing Problem, % Drive Alone to Work/% Long Commute - Drives Alone.
Outliers to be aware of:
- There exists one state with only 21% completing high school and 1% attending college, a percentage significantly below all other counties included in the dataset.
Strong Correlations to be aware of:
- There exists a strong positive correlation between Average Number of Physically Unhealthy Days and Average Number of Mentally Unhealthy Days.
- There exists a very strong positive correlation between % Completed High School and % Some College.
Heads Up
- All numbers in the dataset do not show their fractional parts, but they are not necessarily positive integers.

Covid-19

This dataset includes COVID-19 data from 3128 counties of the USA from December 2021.
This dataset has a huge number of columns that will excite some students and may overwhelm others.
The columns of this dataset are defined to allow students to start analysis without much additional coding.
Column Titles: FIPS County Code, County Name, State, Population, Total Cases, Total Deaths, Cases per 100k, Deaths per 100k, Poverty Estimate, Poverty Percentage, Median Household Income, Dose 1 Administered, Dose 1 Administered %, Series Complete, Series Complete %, Booster Doses Booster %, Social Vulnerability Index, Metro?
Have students…
- Add a column for pct-poverty
Heads Up
- Most data was taken from 27th December, 2021 when cases were picking up across US due to the Delta and Omicron variant.
- No US territories are mentioned. Some larger cities such as New York City were also removed in the data cleaning process
- Some values for Dose 1 Administered and its % equivalents were set at 0 although it is unlikely that this is the case. It is likely that the information was not available on that particular date.

Arctic Sea Ice

This dataset combines data on Arctic sea ice, global mean sea level variation, global average temperature anomalies, and atmospheric carbon dioxide levels.
This dataset has a limited number of categories, making it accessible to any student.
The columns of this dataset are defined to allow students to start analysis without much additional coding.
Column Titles: January Average Global Temperature Anomaly, February Average Global Temperature Anomaly, March Average Global Temperature Anomaly, April Average Global Temperature Anomaly, May Average Global Temperature Anomaly, June Average Global Temperature Anomaly, July Average Global Temperature Anomaly, August Average Global Temperature Anomaly, September Average Global Temperature Anomaly, October Average Global Temperature Anomaly, November Average Global Temperature Anomaly, December Average Global Temperature Anomaly, Anual Average Global Temperature Anomaly, Average Sea Ice Extent in the Arctic, Minimum Sea Ice Extent in the Arctic, Month Minimum Sea Ice Extent in the Arctic Occurred, Month Maximum Sea Ice Extent in the Arctic Occurred, Global Mean Sea Level Variation, GLobal CO2 Levels.
Have students…
- define a function variation-over-time which computes how much a variable has changed within a span of any two years.
Heads Up: The difference between the maximum and minimum sea ice extent in the Arctic can be calculated to show the range in sea ice extent per year and how that may fluctuate.

Politics

Countries of the World

This dataset includes 192 countries.
Column Titles: Country, Life-expectancy in years, GDP (in US$), population, continent, has universal healthcare? (yes/no)
Have students…
- define a function gdp-per-capita, which divides the gdp by the population.

Gerrymandering

This dataset looks at the 2018 United States House of Representatives elections. It includes state-wise population and voter turnout data, as well as columns comparing the percentage of votes a particular party recieved with the number of seats they won.
This dataset has a huge number of columns that will excite some students and may overwhelm others.
The columns of this dataset are defined to allow students to start analysis without much additional coding.
Column Titles: Population, %Turnout, %Vote Democrat, %Vote Republican, Total Seats, Democrat Seats, Republican Seats, %Seats Democrat, %Seats Republican, Winning Party, Seats match vote (Computed for each state by multiplying %Vote Democrat and %Vote Republican by the Total Seats for that state. If the resultant "expected" number of seats can be rounded up or down to match the data in both the Democrat Seats and Republican Seats columns respectively, this field is true. Else false.)
Have students…
- define a function majority-turnout, which returns true if a state has a voter turnout of more than 50%.
- define a function num-votes-dem, which returns the number of people who voted for Democratic party candidates in a given state.
- define a function victory-margin, which returns the number of votes the winning party in a particular state won by.
- define a function expected-seats-rep, which returns the number of seats you would expect Republican party candidates to win given the percentage of votes they received in a particular state.
- define a function flipped-winner, which returns true if the party that recieved the majority of votes did not win the majority of seats in a given state.
Heads Up: Before computing the number of people who voted for either party, the state’s population would have to be multiplied by the percentage voter turnout.

Marijuana Arrests and Laws by State

This dataset looks at marijuana arrest records for the year 2018 in 49 US States (data was unavailable for Florida), and breaks these arrests down by race. Each state also includes data for the years marijuana was legalized and/or decriminalized, and/or when medical marijuana was legalized.
This dataset has a limited number of categories, making it accessible to any student.
The columns of this dataset require calculations to convert data before students start making graphs.
Column Titles: State, All Drug Arrests, Total Marijuana Arrests, Total Marijuana Possession Arrests, Black Marijuana Possession Arrests, White Marijuana Possession Arrests, Black Population, White Population, Total Population, Marijuana Legalized (year), Marijuana Decriminalized (year), Medical Marijuana Legalized (year)
Have students…
- define a function pct-marijuana, which computes the percent of drug arrests that were marijuana arrests.
Heads Up
- A few states have multiple dates for medical marijuana legalization, denoted by listing the two dates separated by a slash.
- In order to compare between states, percentages will need to be calculated. These could include the percentage of a population that is black or white, the percentage of marijuana arrests that were of a black or white person, or the percentage of total arrests that were for marijuana possession.
- It may also make sense to split data into groups depending on whether marijuana/medical marijuana was legalized, decriminalized, or illegal.

LAPD Arrests 2010 - 2019

This dataset includes data from arrests in the city of Los Angeles between 2010 and 2019 and could be used to:
- analyze arrest demographics and compare them to the actual demographics of LA city (provided in the README tab),
- compare arrests released vs. booked
- analyzing geographic concentrations of arrests
This dataset has a limited number of categories, making it accessible to any student. The columns of this dataset are defined to allow students to start analysis without much additional coding.
Column Titles: Report ID, Report Type, Area Name, Sex Code, Descent Code, Address, Arrest Type Code, Age, Arrest Date, Time, Latitude, Longitude
Have students…
- define a function pct-black, which computes the percent of black people arrested out of the total
- define a function avg-age that computes the average age of all the people arrested
- define a function amt-hollywood that computes the number of arrests that occurred in the Hollywood area
Heads Up
- Outliers to be aware of Few ages are below 10 years old
- Age column skews right

NYPD Stop Question Frisk 2019

This dataset includes nearly 9000 incidences of Stop, questions and frisk in NYC in 2019.
This dataset mostly contains categorical data.
Column Titles: Description of Suspected Crimed, Minutes Observed, Searched?, Frisked?, Asked for Consent?, Consent Given?, Weapon Found?, Arrested?, Time of Stop, Month, Day, Issuing Officer Rank, Officer in Uniform?, Suspect Reported Age, Suspect Sex, Suspect Race Description, Suspect Eye Color, Suspect Hair Color, Demeanor of Suspect, Boro Where Stopped

Refugees 2018

This dataset looks at the countries of the world in 2018, recording the percent share of global refugees who originated from each country, as well as the personal security, civil liberty, and human development score of each country.
This dataset has a limited number of categories, making it accessible to any student.
The columns of this dataset are defined to allow students to start analysis without much additional coding.
Column Titles: Country, Region, Percent of Global Population, Percent of Global Refugees Originating from This Country, Personal Security Index, Democracy Category, Civil Liberties Index, Human Development Index
Have students…
- define a function little-liberty, which returns true if a country’s civil liberties index is between 6 & 7.
- define a function much-liberty, which returns true if a country’s civil liberties index is between 1 & 2.
- define a function little-security, which returns true if a country’s personal security index is <40.
- define a function much-security, which returns true if a country’s personal security index is >60.
- define a function little-development, which returns true if a country’s human development index is <0.4.
- define a function much-development, which returns true if a country’s human development index is >0.6.
- define a function high-indices, which returns true if a country’s much-liberty, much-security, & much-development all return true.
Heads Up
- Only three countries are in the Oceania category, and the same for the North America category.
- To compare regions of the world, a sieving/filtering function should be created to filter out countries by region.

State Demographics

This dataset includes all 50 states and the District of Columbia.
Column Titles: state, region, population 2010, population 2014, % bachelors or higher, % hs or higher, % homeowners, # of households, # of housing units, land area (square miles), % non-English at home, mean commute times (minutes), median household income (US$), per capita income (US$), % older than 65, % female, % under 18, % under 5, % in poverty, number of veterans
Have students…
- define a function pop-density, which computes the people per square mile.
- define a function pct-change-pop, which computes the percent change in the population from 2010 to 2014..

U.S. Income

This dataset covers the years 1967- 2015.
Column Titles: Year, Number of Families, Percent in each income group (<15k, 15k-25k, 25k-35k, 35k-50k, 50k-75k, 75k-100k, 100k-150k, 150k-200k, >200k), median income, mean income

U.S. Jobs

This dataset includes data about 140 occupations in the U.S. in 2019. For each occupation, this dataset records its occupation type, total number of employees, percentage of non-white employees, percentage of female employees, typical entry level educational requirement, annual median wage, weekly median wage, and female weekly median wage.
This dataset has a huge number of columns that will excite some students and may overwhelm others.
The columns of this dataset require calculations to convert data before students start making graphs.
Column Titles: Occupation, Occupation Type, Total Number of Employees, Percentage of Non-white Employees, Percentage of Female Employees, Typical Entry Level Educational Requirement, Annual Median Wage, Weekly Median Wage, Female Weekly Median Wage
Have students…
- define a function gender-wage-gap, which computes the numerical difference between female employees' weekly median wage and all employees' weekly median wage.
- define a function female-wage-is-lower, which returns true if an occupation offers female employees a wage comparatively lower wage.
Heads Up
- Columns list the median wage for all employees and median wage for female employees. In order to look into gender income equality, wage gap would need to be calculated.
- Outlier to be aware of: Occupation veterinarian offers significantly higher wage for female employees.

U.S. Voter Turnout 1986-2018

This dataset looks at demographics of the electorate. Information is provided on what percentage of the electorate each group represents and how turnout for elections varies across age, educational attainment and race.
Column Titles: year, Turnout Rate 18-29, Turnout Rate 30-44, Turnout Rate 45-59, Turnout Rate 60+, Share of Electorate 18-29, Share of Electorate 30-44, Share of Electorate 45-59, Share of Electorate 60+, Turnout Rate Less Than High School, Turnout Rate High School Grad, Turnout RateSome College to College Grad, Turnout Rate Post-Graduate, Share of Electorate Less Than High School, Share of Electorate High School Grad, Share of Electorate Some College to College Grad, Share of Electorate Post-Graduate, Turnout Rate Non-Hispanic White, Turnout Rate Non-Hispanic Black, Turnout Rate Hispanic, Turnout Rate Other, Share of Electorate Non-Hispanic White, Share of Electorate Non-Hispanic Black, Share of Electorate Hispanic, Share of Electorate Other

Sports

Esports Earnings

This dataset includes data from 1998-2020 that comes from https://www.esportsearnings.com, a publicly-sourced repository of information on esports earnings.
This dataset has a limited number of categories, making it accessible to any student.
The columns of this dataset require calculations to convert data before students start making graphs.
Column Titles: game, release-date, genre, total-earnings (total earnings of all players in all tournaments), online-earnings (total earnings of all players in online-only tournaments), total-winners (total number of tournament winners in all tournaments), total tournaments (total number of tournaments that took place from the release date to 2020)
Have students…
- define a function online-ratio, which returns the ratio of online-earnings to total-earnings for a game.
- define a function annual-earnings, which returns the average earnings per year since the release date for a game.
Heads Up
- Note that since the data is publicly sourced, there are games with no earnings and no tournaments. Additionally, the nature of the data means that there are a few extremely large outliers that the student will have to deal with when analyzing the data.
- The data for many less popular games varies greatly due to one-off tournaments. Additionally, the number of games with zero tournaments and zero players may vary, as the data is publicly sourced. Thus, it is recommended that comparisons between genres be done with subsets containing the games with the highest earnings, as much more data is available for those games.
- Other metrics that may be interesting to the student for analysis are the average earnings per year since release and the average number of tournaments per year since the release year of the game or 1998—whichever comes second.
- Keep in mind that the data only includes tournaments that occur from 1998 to 2020, even though some of the games included in the dataset were released before 1998.

MLB Hitting Stats (2018)

This dataset includes 30 teams.
Column Titles: Team, League, 28 Columns of Baseball stats!, Salary

NBA Players 2018-2019

This dataset includes selected box score stats for the NBA players from the 2018-2019 season.
Column Titles: id, name, was-traded, Team, Position, Age, Games Played, Minutes played per game, Free throws attempted, Free throw percentage, 2-point shots attempted, 2-point shot percentage, 3-point shots attempted, 3-point shots percentage, effective field goal percentage, true shooting percentage, height in inches
Heads Up: Players who were traded during the season are listed more than once, so that stats can be given for their time with each team they played for.

NFL Passing

This dataset includes stats on the top 102 NFL players of the 2019 season.
Column Titles: Rank, Name, Invited to ProBowl, Team, Position, Passes completed, Passes attempted, Percent of Passes Completed, Yards gained, Passing Touchdowns, Percentage of times intercepted when attempting to pass, Adjusted yards gained per pass attempted, Yards gained per pass completion, Yards gained per game played, Times sacked.

NFL Rushing

This dataset includes stats on the top 334 NFL players of the 2019 season.
Column Titles: Rank, Name, Team, Age, Position, Games Played, Games Started, Attempts, Yards, Touchdowns, First-Downs, Longest, Yards-Average, Yards-Game, Fumbles

Entertainment

Movies:

100 top-grossing movies (as of 2018)
Column Titles: Rank, Movie Title, Studio Name, Female-lead, POC-lead, Total Gross Income (million dollars), Domestic Income (million $), Overseas Income (million $), Year
Have students…
- define a function return-on-investment, which computes the ratio of the Total Gross Income to the budget.
Heads Up: Only a few films are from before 2000.

IGN Game Review

This dataset focuses on video game reviews from 2017-2018.
Column Titles: title, playstation, xbox, nintendo, pc, other, score, genre, editors-choice, release-year, release-month
Since many titles are available for more than one platform, columns like playstation and pc are all Booleans. This also makes it easy to create grouped samples, via simple filter functions that merely look up the value of a particular column.
Have Students…
- explore whether game studios tend to release their highest-rated games around certain months of the year.
- define filter functions like is-xbox or is-pc, to create grouped samples of platform-specific reviews and search for differences between platforms.
Heads Up
- Scatter plots for this dataset may be surprising, because IGN used whole numbers (4, 5, 6, etc.) for most of their ratings. As a result, most of the dots on the scatter plot are "clumped" along whole numbers.
- Interesting enough, IGN switched to allow decimal ratings in some years (4.1, 4.5, 4.6, etc.), which results in some wildly different distributions. Can students find this difference by looking at the scatter plot? Can they uncover the reason (IGN’s shift in scoring) by looking online?

International Exhibition of Modern Art 1913

This dataset describes nearly 800 of the pieces included in the first large exhibition of modern art in the United States.
This dataset has very little quantitative data.
Column Titles: Artist, Sex, Title, Medium, Location, Number of Locations exhibited, Sale Price in Dollars, Reason the piece was included in the show

Pipe Organs

This dataset describes about 500 pipeorgans, a sample taken from a database of North American organbuilders.
This dataset has a limited number of categories, making it accessible to any student.
The columns of this dataset may require additional coding depending on how the student wishes to analyze the data.
Column Titles: State, Venue, Most recent builder, Ranks, Stops, Manuals, Lowest Pitch, Year Completed, Altered
Have students…
- define a function average-ranks-wurlitzer, which takes the average number of ranks for all organs made by Wurlitzer. Do this for a few of these major organ companies as well: Austin, Noack, Hook & Hastings, Möller, Casavant, and Aeolian-Skinner.
- define a function ranks-stops-difference which computes the difference between an organ’s number of ranks and its number of stops.
Heads Up - Outliers to be aware of:
- Organ 131 is the largest pipe organ in the world.
- Outliers to be aware of: Only a few organs are not currently in the United States.
- Organs 2253 and 66436 are comparatively old.

Pokemon

This dataset contains information about all eight generations of Pokemon, which is current up to 2022. This dataset was further extended by a student at Mission Vista High School in CA!
Column Titles: number, name, type1, type2, total, hp, attack, defense, special-attack, special-defense, speed, generation, legendary, region and tier

Music:

This dataset includes over 10,000 songs. It is a subset of the Million songs database published in 2011.
Column Titles: artist, song, duration in seconds, loudness, beats per minute, end of fade-in (seconds), start of fade-out (seconds), familiarity, buzz, terms (category of music)

Education

College Majors by Gender, Employment Rates & Earnings

This dataset examines census data from 2015-2019, recording gender ratios, employment levels, and median income based on college major.
This dataset has a limited number of categories, making it accessible to any student.
The columns of this dataset are defined to allow students to start analysis without much additional coding.
Column Titles: Major, Major Category, Men, Women, Total, Employed, Unemployed, Median Income
Have students…
- define a function pct-women, which computes the percent of people who studied each major and are women
- define a function unemployment-rate, which computes the percent of people who are unemployed for each major
- define a function rank-by-income, which sorts the majors from highest to lowest median income

Colleges in the US (2019-2020)

This dataset includes data from 3232 post-secondary educational institutions in the United States during the school year (2019-2020).
Students applying to colleges can use this dataset to evaluate their options given their monetary, locational, and other preferences.
This dataset has a huge number of columns that will excite some students and may overwhelm others.
The columns of this dataset require calculations to convert data before students start making graphs.
Column Titles: School Name, City, State, Urbanicity, Region, Control, Program, Highest Degree, Gender, Operating, Main Campus, HBCU, Indigenous, US Service, Converted SAT Range, Undergrad Population, Acceptance Rate, Part-time Student Ratio, Average Monthly Faculty Salary, In-state Tuition, Out-of-state Tuition, Pell Grant, Average Cost of Attendance, Average Net Price
Have students…
- define a function all-pell-grant, which computes the total number of undergraduate students in all universities that received a Pell grant.
- define a function same-tuition, which returns true if a school has the same in-state and out-of-state tuition.
- define a function average-fin-aid, which returns the difference between a school’s cost of attendance and net price.
- define a function low-accept, which returns true if a school has less than a 10% acceptance rate.
- filter out a list of all schools in their state and compare them.
- filter out a list of all schools in the “1500+” Converted SAT Range and compare them with a filtered list of all schools with less than 0.1 acceptance rate.
Heads Up
- Note that a few schools may have a negative net price, suggesting that some students may actually end up receiving money for attending the particular school.
- Acceptance rates of 1 mean virtually everyone is accepted, though there may still be special circumstances in which a particular student is rejected.
- Some schools lacking significant data may be excluded from this dataset of 3232 colleges.

R.I. Schools

This is data from 271 Rhode Island public & charter schools.
Column Titles: District Name, School Name, % passing ELA in 2018, % passing Math in 2018, Charter School (yes/no?), Number of students free lunch eligible, Number of students reduced lunch eligible, Number of students by racial/ethnic group (American Indian/Alaska Native, Asian/Pacific Islander, Hispanic, Black, White, 2 or more races), male, female, total population.
Have students:
- define a function pct-black, which computes the percent of black students at a school.
- define a function pct-hispanic, which computes the percent of black students at a school. 
- define a function high-math, which returns true if a school has more than 60% of students passing the state math test.
Heads up!
- Other than ELA and Math Passing Percentages, columns list the number of students. In order to compare between schools, percentages would need to be calculated.
- Free and Reduced lunch students are listed as two separate quantities. Usually we combine these numbers for analysis.
- Outlier to be aware of: Classical High School

Evolution of College Admissions (UC vs Other Public Schools)

This dataset includes data from the common datasets of all 9 colleges in the UC system and 2 colleges from the CSU system
This dataset can be used to analyze the future of college admissions and whether UCs have found a way to increase diversity at schools despite the affirmative action ban.
This dataset has a limited number of categories, making it accessible to any student.
The columns of this dataset are defined to allow students to start analysis without much additional coding. I would not recommend taking too many random subsets of data since the analysis hinges on the examination of schools over time. Random sets may result in some unnecessary holes in the data.
Column Titles: College + Year, College, Year, Admission Rate, Most Dominant Race, Percentage of Most Dominant Race, Percentage of Whites, Percentage of Asians, Percentage of Blacks/Latinos/Native Americans, GPA (average), Percentage of In-State Enrollees, UC?, Weighted GPA?
Have students…
- define a function that calculates the difference between the percentage of the most dominant ethnicity and the underrepresented ethnicities for a row.
- define a function that takes in a row and a threshold and returns whether the average GPA of that row exceeds the threshold.
Heads Up
- Because of the COVID pandemic, 2020 data from each of the schools may skew from the general trend.
- There is a large chunk of UC Irvine data missing because the GPA was not reported from those schools/the common dataset was missing.
- All non-UC percentages except for the in-state percentage are hand-calculated.
- There is a noticeable negative correlation between the year and admission rates and a positive correlation between the year and average GPA.
- For UCs, there is a negative correlation between the year and in-state percentage.
- For UCs, there is a positive correlation between the year and underrepresented ethnicities percentage.

Nutrition

Soda, Coffee & other drinks

This dataset includes nutritional data from the USDA FNDDS database on 560 beverages, including juices, milks, coffee, soda, energy drinks and more.
This dataset has a limited number of categories, making it accessible to any student.
Column Titles: Name, Type, Calories, Fat (g), Protein (g), Carbohydrate (g), Sugars (g), Sodium (mg), Caffeine (mg), Serving Weight (g), Serving Description (fl oz), 200 Calorie Weight (g)
Heads Up: There are some beverages with much higher levels of caffeine than others.

Fast Food Nutrition

This dataset includes simplified nutritional value information about over 400 fastfood menu items from the USDA FNDDS database.
This dataset has a limited number of categories, making it accessible to any student.
Column Titles: Name, Type, Vendor, Calories, Fat (g), Protein (g), Carbohydrate (g), Cholestorol (mg), Saturated Fats (g), Sodium (mg), Serving Weight (g), 200 Calorie Weight (g)

These materials were developed partly through support of the National Science Foundation, (awards 1042210, 1535276, 1648684, 1738598, 2031479, and 1501927). Bootstrap by the Bootstrap Community is licensed under a Creative Commons 4.0 Unported License. This license does not grant permission to run training or professional development. Offering training or professional development with materials substantially derived from Bootstrap must be approved in writing by a Bootstrap Director. Permissions beyond the scope of this license, such as to run training, may be available by contacting contact@BootstrapWorld.org.