Scatter Plots can be used to show a relationship between two quantitative columns. Each row in the dataset is represented by a point, with one column providing the x-value and the other providing the y-value.

The resulting “point cloud” makes it possible to look for a relationship between those two columns.

  • If the points in a scatter plot appear to follow a straight line, it suggests that a linear relationship exists between those two columns. A number called a correlation can be used to summarize this relationship.

  • 𝑟 is the name of the correlation statistic. The 𝑟-value will always fall between −1 and +1. The sign tells us whether the correlation is positive or negative. Distance from 0 tells us the strength of the correlation.

    • −1 is the strongest possible negative correlation.

    • +1 is the strongest possible positive correlation.

    • 0 means no correlation.

    • ±0.65 or ±0.70 or more is typically considered a "strong correlation".

    • ±0.35 and ±0.65 is typically considered “moderately correlated”.

    • Anything less than about ±0.25 or ±0.35 may be considered weak.

    • However, these cutoffs are not an exact science! In some contexts an 𝑟-value of ±0.50 might be considered impressively strong!

  • The correlation is positive if the point cloud slopes up as it goes farther to the right. This means larger y-values tend to go with larger x-values. The correlation is negative if the point cloud slopes down as it goes farther to the right.

  • It is a strong correlation if the points are tightly clustered around a line. In this case, knowing the x-value gives us a pretty good idea of the y-value. It is a weak correlation if the points are loosely scattered and the y-value doesn’t depend much on the x-value.

  • Points that do not fit the trend line in a scatter plot are called unusual observations.

  • We graphically summarize this relationship by drawing a straight line through the data cloud, so that the vertical distance between the line and all the points taken together is as small as possible. This line is called the line of best fit and allows us to predict y-values based on x-values.

  • Correlation is not causation! Correlation only suggests that two column variables are related, but does not tell us if one causes the other. For example, hot days are correlated with people running their air conditioners, but air conditioners do not cause hot days!

These materials were developed partly through support of the National Science Foundation, (awards 1042210, 1535276, 1648684, 1738598, 2031479, and 1501927). CCbadge Bootstrap by the Bootstrap Community is licensed under a Creative Commons 4.0 Unported License. This license does not grant permission to run training or professional development. Offering training or professional development with materials substantially derived from Bootstrap must be approved in writing by a Bootstrap Director. Permissions beyond the scope of this license, such as to run training, may be available by contacting contact@BootstrapWorld.org.