Scatter Graphs and Correlation
This topic covers how to interpret scatter graphs, which show the relationship between two variables. You'll need to identify trends (correlation), draw a line of best fit by eye, and use it to make predictions, while understanding the critical difference between correlation and causation.
Part of the ESAT Mathematics 1 syllabus — revision for the Engineering and Science Admissions Test (ESAT), the UAT-UK admissions test for Cambridge, Imperial, Oxford and UCL.
Key points
- A scatter graph plots pairs of values from bivariate data to show the relationship, or correlation, between them.
- Correlation can be positive (variables increase together), negative (one increases as the other decreases), or non-existent. It can also be strong (points are close to a line) or weak (points are widely scattered).
- Crucially, correlation does not imply causation. Two variables might be correlated due to a third, unobserved factor, or by pure coincidence.
- A line of best fit is a single straight line that best represents the trend of the data. It should be drawn by eye to have roughly an equal number of points above and below it.
- Interpolation (predicting within the data range) is generally considered more reliable than extrapolation (predicting outside the data range), as the trend may not continue indefinitely.
Diagram
Formulae
y = mx + c This is the general equation of a straight line. Your line of best fit will follow this form, where 'm' is the gradient representing the rate of change and 'c' is the y-intercept.
Definitions
- Bivariate Data
- A set of data that has two variables for each observation. For example, the height and shoe size of a group of people.
- Correlation
- A statistical measure that expresses the extent to which two variables are linearly related, meaning they change together at a constant rate.
- Line of Best Fit
- A straight line drawn on a scatter graph that passes through the middle of the data points, representing the overall trend.
- Interpolation
- Estimating a value for a variable that lies within the range of the original data, using the line of best fit.
- Extrapolation
- Estimating a value for a variable that lies outside the range of the original data by extending the line of best fit.
Worked example
A scientist measures the time taken for a chemical reaction to complete at different temperatures. The results are plotted on a scatter graph, where temperature is on the x-axis (°C) and time is on the y-axis (seconds). A line of best fit is drawn passing through the points (10, 80) and (50, 20). 1. Describe the correlation shown by these data. 2. Use the line of best fit to estimate the time taken for the reaction at 20°C. 3. A student claims that a lower reaction time causes the temperature to be higher. Explain the flaw in this statement.
- 1
Step 1 (Correlation):
Observe the points used for the line of best fit.
As the temperature (x-value) increases from 10 to 50, the time (y-value) decreases from 80 to 20.
This indicates a negative correlation.
Since a line of best fit can be drawn, it is likely strong.
- 2
Step 2 (Estimation):
We need to find the equation of the line or use proportionality.
Let's find the gradient (m):
m = (change in y) / (change in x) = (20 - 80) / (50 - 10) = -60 / 40 = -1.5 s/°C - 3
Step 3:
Use one of the points to find the equation.
Using (10, 80):
y - 80 = -1.5 × (x - 10) ⇒ y - 80 = -1.5x + 15 ⇒ y = -1.5x + 95 - 4
Step 4:
Substitute x = 20°C into the equation:
y = -1.5 × 20 + 95 = -30 + 95 = 65 secondsThis is an interpolation as 20°C is within the data range of 10°C to 50°C.
- 5
Step 5 (Causation):
The student has reversed cause and effect.
In chemistry, increasing the temperature provides more kinetic energy, which causes the reaction to happen faster (in less time).
The temperature is the explanatory variable (cause), and the reaction time is the response variable (effect), not the other way around.
Answer: 1. Strong negative correlation. 2. 65 seconds. 3. The student has confused correlation with causation and reversed the roles of cause and effect; the change in temperature causes the change in reaction time.
Common mistakes
- ×Mistaking correlation for causation. This is the most common conceptual error. Just because ice cream sales and shark attacks are correlated doesn't mean one causes the other (the hidden factor is warm weather).
- ×Drawing a line of best fit incorrectly, for example, by forcing it to go through the origin (0,0) or connecting the first and last points.
- ×Treating extrapolated values as being as reliable as interpolated values. The trend may not hold true outside the measured range.
- ×Reading from the wrong axis when using the line of best fit to make a prediction.
No-calculator tips
- ✓To draw a line of best fit, use a clear ruler. Position it so it follows the general path of the points, aiming for an equal number of data points above and below the line.
- ✓When estimating using your line, use the edge of your question paper as a straight edge to draw a faint pencil line from the axis value to your line of best fit, then across to the other axis. This improves visual accuracy.
- ✓If you need the gradient of your line of best fit, pick two points on the line that have easy-to-read integer coordinates to simplify the 'rise over run' calculation.