Hazards of correlation and regression

1. Drawing causation conclusions
Ski and snowboard sales tend to rise and fall together, but sales of one don’t lead to sales of the other – they are both tied to an outside factor. Throughout grade school, mathematical skills correlate positively to height. Height doesn’t make you good at math, but older children are on average both taller and better at math than younger children.

2. Ecological correlation
I.e., averaging subsets of data and then correlating the averages (such as comparing population traits after averaging by nation). This can overstate correlation because outliers have been averaged away before the correlation happens.

3. Regression fallacy
If you take data on a population twice and the correlation between the two data points is not 1, individuals with exceptionally high measurements in round 1 will tend to decrease in round 2 and those with exceptionally low measurements in round 1 will tend to increase (“regression toward the mean”). The fallacy is in attributing this to anything other than plain mathematics. One way to think about it is that if you are in the top ranks in the first measurement and you don’t have the same result on the second, there’s a lot more room to move down than up.

4. Application problems
Applying a technique to a data set that doesn’t meet the criteria for that technique to be applicable will at worst completely lie to you about the data. Choosing the wrong variables to compare might not give you any mathematical errors, but it could mask what’s “really going on.” For example, perimeter and area of rectangles are strongly correlated, but the real story is about each of those relating to length and width.

rweber.net

Hazards of correlation and regression

Leave a Reply Cancel reply

Share this:

Leave a Reply Cancel reply