R-squared measures how much a model output “explains” the variation in an underlying dataset. Formally, R-squared is measured as 1-RSS/TSS, where RSS stands for “residual sum of squares” and TSS stands for “total sum of squares”. RSS is calculated by the sum of the squares of the distances between the model predicted output and the actual values. TSS is calculated by the sum of the squares of the distances between the mean value of a dataset and the actual values. A model that just “predicts the mean” will then have an R-squared of 0. The closer the R-squared is to 1, the better the model output “explains” the variation in the underlying data.

It is important to flag that a higher R-squared does not mean the model is better. There are several ways in which a higher R-squared could mislead.

First, a model can be overfit. Suppose one wishes to predict the results of an NBA game. One collects tens of thousands of factors and adds them all into a simple machine learning model. The R-squared will be quite high: given enough data, the model can eventually find a pattern that can well explain the past data. But that model might be terrible at predicting the future, as many of the variables one observed are just junk. As a result, good data scientists are careful to avoid “overfitting” by “cross-validating” data (leaving a chunk of data out of the sample to test on). Moreover, in a linear regression, adding any variable will always improve the R-squared (since if the variable is truly irrelevant, the weight in an unbiased model will just be zero). As a result, simply comparing two models by R-squared will tend to just reward models with lots of variables. That’s why careful researchers often use measures like “adjusted R-squared” which accounts for this factor.

Second, a model can mistake correlation for causation. Suppose one wishes to predict the effects of absences on school grades. The model may have a high R-squared, seemingly saying absences really harm school performance. This is an eminently plausible hypothesis, but there is a possible confounder that students who are chronically absentee are also less likely to do homework and might be less interested in the subject. As such, good researchers must be careful not to conflate R-squared with causation.

More From Capital

No posts found