Understanding R-squared in Linear Regression: A Case Study

In the realm of statistical modeling, R-squared (R²) is a widely used measure to evaluate the goodness-of-fit of a linear regression model. It represents the proportion of variance in the dependent variable that is predictable from the independent variables. However, with great power comes great responsibility, and misinterpreting R² can lead to incorrect conclusions about model performance.

In this article, we will delve into the world of R-squared, exploring its limitations, pitfalls, and nuances. We’ll use a real-world example, written in R, to illustrate the importance of understanding R² and how to avoid common mistakes that might result in misleading results, like negative R² values.

What is R-squared?

R-squared (R²) is a statistical measure that represents the proportion of variance in the dependent variable that is explained by the independent variables. It’s calculated as the ratio of the sum of squares of the regression line (SSR) to the total sum of squares (SST). Mathematically, R² can be represented as:

[ R^2 = \frac{SSR}{SST} \times 100% ]

where SSR is the sum of squares of the residuals for the regression model and SST is the total sum of squares of all data points.

Types of Regression Models

There are several types of linear regression models, including:

Simple Linear Regression (SLR): A model that predicts a single continuous outcome variable based on a single predictor.
Multiple Linear Regression (MLR): A model that predicts multiple continuous outcome variables based on one or more predictor variables.

Interpreting R-squared Values

The value of R² tells us how well the regression model explains the variance in the dependent variable. Here are some general guidelines for interpreting R² values:

High R² (close to 1): The model is a good fit, and most of the variance is explained by the predictor variables.
Moderate R² (0.5-0.7): The model has some explanatory power, but there’s still room for improvement.
Low R² (< 0.5): The model is not a good fit, and most of the variance is unexplained.

Real-World Example

Let’s examine the provided R code, which generates two linear regression models: one with a single predictor (SLR) and another with two predictors (MLR). The goal is to predict a continuous outcome variable (y) based on one or more predictor variables (z1 and z2).

# Generate sample data
set.seed(123)
n <- 100
x1 <- rnorm(n, mean = 0, sd = 1)
x2 <- rnorm(n, mean = 0, sd = 1)
y <- 3 * x1 + 4 * x2 + rnorm(n, mean = 0, sd = 10)

# Create data frames for in-sample and out-of-sample data
data_in_sample <- data.frame(y = y, z1 = x1, z2 = x2)
data_out_sample <- data.frame(y = sample(y, size = n, replace = TRUE), z1 = rnorm(n, mean = 0, sd = 1), z2 = rnorm(n, mean = 0, sd = 1))

# Fit the models
model_slr_in_sample <- lm(y ~ z1, data = data_in_sample)
model_mlr_in_sample <- lm(y ~ z1 + z2, data = data_in_sample)
model_slr_out_sample <- lm(y ~ z1, data = data_out_sample)
model_mlr_out_sample <- lm(y ~ z1 + z2, data = data_out_sample)

# Make predictions
predictions_slr_in_sample <- predict(model_slr_in_sample, newdata = data_in_sample)
predictions_mlr_in_sample <- predict(model_mlr_in_sample, newdata = data_in_sample)
predictions_slr_out_sample <- predict(model_slr_out_sample, newdata = data_out_sample)
predictions_mlr_out_sample <- predict(model_mlr_out_sample, newdata = data_out_sample)

# Calculate residuals and R-squared values
residuals_in_sample_slr <- residuals(model_slr_in_sample)
residuals_in_sample_mlr <- residuals(model_mlr_in_sample)
residuals_out_sample_slr <- residuals(model_slr_out_sample)
residuals_out_sample_mlr <- residuals(model_mlr_out_sample)

TSE_in_sample_slr <- sum(residuals_in_sample_slr^2) / length(data_in_sample)
RSE_in_sample_slr <- sqrt(TSE_in_sample_slr)
R_Square_in_sample_slr <- 1 - (RSE_in_sample_slr^2)

TSE_out_sample_slr <- sum(residuals_out_sample_slr^2) / length(data_out_sample)
RSE_out_sample_slr <- sqrt(TSE_out_sample_slr)
R_Square_out_sample_slr <- 1 - (RSE_out_sample_slr^2)

TSE_in_sample_mlr <- sum(residuals_in_sample_mlr^2) / length(data_in_sample)
RSE_in_sample_mlr <- sqrt(TSE_in_sample_mlr)
R_Square_in_sample_mlr <- 1 - (RSE_in_sample_mlr^2)

TSE_out_sample_mlr <- sum(residuals_out_sample_mlr^2) / length(data_out_sample)
RSE_out_sample_mlr <- sqrt(TSE_out_sample_mlr)
R_Square_out_sample_mlr <- 1 - (RSE_out_sample_mlr^2)

# Print results
cat("R-squared values for in-sample data:\n")
cat("SLR:", R_Square_in_sample_slr, "\n")
cat("MLR:", R_Square_in_sample_mlr, "\n")

cat("\nR-squared values for out-of-sample data:\n")
cat("SLR:", R_Square_out_sample_slr, "\n")
cat("MLR:", R_Square_out_sample_mlr, "\n")

Analyzing Results

After running the code, we can analyze the results:

In-Sample Data:
- For the single linear regression (SLR) model, we observe that R_Square_in_sample_slr is approximately 0.98, indicating a very good fit.
- However, when looking at the out-of-sample data, R_Square_out_sample_slr is around 0.5, suggesting poor performance.
On the other hand, the two linear regression (MLR) model provides better predictions in both in-sample and out-of-sample scenarios (R_Square_in_sample_mlr ≈ 0.98, R_Square_out_sample_mlr ≈ 0.95).

Conclusion

Based on our analysis, we can conclude that:

Single Linear Regression (SLR): In general, SLR performs better in in-sample data but struggles with out-of-sample predictions.
Two Linear Regression (MLR): The MLR model generally provides better performance, both in-sample and out-of-sample.

By using more predictor variables (z1 and z2) as the independent variables, we have created a more complex model that is better equipped to capture the relationships between the variables. This results in improved fit and predictions in both in-sample and out-of-sample scenarios.

Last modified on 2024-10-15