Sometimes numerical data comes in pairs. Perhaps a paleontologist measures the lengths of the femur (leg bone) and humerus (arm bone) in five fossils of the same dinosaur species. It might make sense to consider the arm lengths separately from the leg lengths, and calculate things such as the mean, or the standard deviation. But what if the researcher is curious to know if there is a relationship between these two measurements? It's not enough to just look at the arms separately from the legs. Instead, the paleontologist should pair the lengths of the bones for each skeleton and use an area of statistics known as correlation.
What is correlation? In the example above suppose that the researcher studied the data and reached the not very surprising result that dinosaur fossils with longer arms also had longer legs, and fossils with shorter arms had shorter legs. A scatterplot of the data showed that the data points were all clustered near a straight line. The researcher would then say that there is a strong straight line relationship, or correlation, between the lengths of arm bones and leg bones of the fossils. It requires some more work to say how strong the correlation is.
Correlation and Scatterplots
Since each data point represents two numbers, a two-dimensional scatterplot is a great help in visualizing the data. Suppose we actually have our hands on the dinosaur data, and the five fossils have the following measurements:
- Femur 50 cm, humerus 41 cm
- Femur 57 cm, humerus 61 cm
- Femur 61 cm, humerus 71 cm
- Femur 66 cm, humerus 70 cm
- Femur 75 cm, humerus 82 cm
A scatterplot of the data, with femur measurement in the horizontal direction and humerus measurement in the vertical direction, results in the above graph. Each point represents the measurements of one of the skeletons. For instance, the point at the bottom left corresponds to skeleton #1. The point at the upper right is skeleton #5.
It certainly looks like we could draw a straight line that would be very close to all of the points. But how can we tell for certain? Closeness is in the eye of the beholder. How do we know that our definitions of "closeness" match with someone else? Is there any way that we could quantify this closeness?
To objectively measure how close the data is to being along a straight line, the correlation coefficient comes to the rescue. The correlation coefficient, typically denoted r, is a real number between -1 and 1. The value of r measures the strength of a correlation based on a formula, eliminating any subjectivity in the process. There are several guidelines to keep in mind when interpreting the value of r.
- If r = 0 then the points are a complete jumble with absolutely no straight line relationship between the data.
- If r = -1 or r = 1 then all of the data points line up perfectly on a line.
- If r is a value other than these extremes, then the result is a less than perfect fit of a straight line. In real-world data sets, this is the most common result.
- If r is positive then the line is going up with a positive slope. If r is negative then the line is going down with negative slope.
The Calculation of the Correlation Coefficient
The formula for the correlation coefficient r is complicated, as can be seen here. The ingredients of the formula are the means and standard deviations of both sets of numerical data, as well as the number of data points. For most practical applications r is tedious to compute by hand. If our data has been entered into a calculator or spreadsheet program with statistical commands, then there is usually a built-in function to calculate r.
Limitations of Correlation
Although correlation is a powerful tool, there are some limitations in using it:
- Correlation does not completely tell us everything about the data. Means and standard deviations continue to be important.
- The data may be described by a curve more complicated than a straight line, but this will not show up in the calculation of r.
- Outliers strongly influence the correlation coefficient. If we see any outliers in our data, we should be careful about what conclusions we draw from the value of r.
- Just because two sets of data are correlated, it doesn't mean that one is the cause of the other.