Linear regression diagnostics

library(regressinator)
library(ggplot2)
library(broom)

The basic linear regression model assumes that \[ Y = \beta_0 + \beta_1 X_1 + \dots + \beta_p X_p + \epsilon, \] where \(\epsilon\) has mean zero and variance \(\sigma^2\). The regressors \(X_1, \dots, X_p\) might be the original predictors, or could be transformations and combinations of them, such as interactions or polynomials. The assumptions on the distribution of \(\epsilon\) are used to obtain the variance of \(\hat \beta\) and hence are important for inference, while the assumption of a linear relationship determines whether our mean function is accurate.

For these examples, we’ll consider a simple multivariate setting where the population relationship is nonlinear and we misspecify the model:

nonlinear_pop <- population(
  x1 = predictor(runif, min = 1, max = 8),
  x2 = predictor(runif, min = 4, max = 12),
  y = response(0.7 + 0.8 * x1**2 + 1.2 * x2,
               family = gaussian(), error_scale = 4.0)
)

nonlinear_data <- sample_x(nonlinear_pop, n = 100) |>
  sample_y()

fit <- lm(y ~ x1 + x2, data = nonlinear_data)

We’ll use broom::augment() to get diagnostic information from the fit in a standardized data frame, making it easy to use ggplot2 to produce different diagnostic plots.

Residual plots

Residual plots can detect misspecification of the mean function. When we plot the residuals against any linear combination of the regressors, they should have mean zero and constant variance. If they do not, the model may be incorrect.

One such linear combination is the fitted values \(\hat Y\). We can obtain the residuals and fitted values from broom::augment():

augment(fit) |>
  ggplot(aes(x = .fitted, y = .resid)) +
  geom_point() +
  geom_smooth(se = FALSE) +
  labs(x = "Fitted value", y = "Residual")
#> `geom_smooth()` using method = 'loess' and formula = 'y ~ x'

There is a suspicious trend here, but it is hard to tell what specifically is wrong from the model from this. It would be more helpful to know which predictor is modeled incorrectly.

To plot residuals versus the regressors, the regressinator provides augment_longer(), which is like augment() but converts the data to long form with one row per predictor per observation, making it easy to facet the residuals:

augment_longer(fit) |>
  ggplot(aes(x = .predictor_value, y = .resid)) +
  geom_point() +
  geom_smooth(se = FALSE) +
  facet_wrap(vars(.predictor_name), scales = "free_x") +
  labs(x = "Predictor", y = "Residual")
#> `geom_smooth()` using method = 'loess' and formula = 'y ~ x'

The trend in the residuals against x1 makes it clear that something is wrong with our specification of x1 in the model, which is correct: we modeled it as linear when it should be quadratic.

Detecting misspecification is a matter of judgment, as the residuals are random and trends can appear even when the model is well-specified. Using model_lineup(), we can compare the true residual plots to several where the model is correctly specified. Each row gives the plots for one simulation, and one of the fives rows (at random) is the true residual plots:

model_lineup(fit, fn = augment_longer, n = 5) |>
  ggplot(aes(x = .predictor_value, y = .resid)) +
  geom_point() +
  geom_smooth(se = FALSE) +
  facet_grid(rows = vars(.sample), cols = vars(.predictor_name),
             scales = "free_x") +
  labs(x = "Predictor", y = "Residual")
#> decrypt("gOh4 kDUD pB 9RtpUpRB v5")
#> `geom_smooth()` using method = 'loess' and formula = 'y ~ x'

This gives a sense of the variation to be expected when the model is well-specified. It should still be easy to spot the row containing the plots for our misspecified model.

Partial residual plots

The partial_residuals() function fetches partial residuals in a convenient data frame format. Partial residuals are defined in terms of the original predictors, not the regressors; we can think of a partial residual plot for a particular predictor as showing the relationship between that predictor and the response, after “adjusting for” the other predictors according to our fitted model. See the function documentation for references on the use and interpretation of partial residuals.

Detecting basic misspecification

Here the black line gives the fitted predictor effects (i.e. the model estimate of the relationship), while the blue line smooths the partial residuals. We can see that for x1, the blue line deviates systematically from the black line:

partial_residuals(fit) |>
  ggplot(aes(x = .predictor_value, y = .partial_resid)) +
  geom_point() + # partial residuals
  geom_smooth(se = FALSE) + # smoothed residuals
  geom_line(aes(x = .predictor_value, y = .predictor_effect)) + # effects
  facet_wrap(vars(.predictor_name), scales = "free") +
  labs(x = "Predictor", y = "Partial residual")
#> `geom_smooth()` using method = 'loess' and formula = 'y ~ x'

The shape of the blue line approximates the shape of the true relationship between x1 and the response (provided the rest of the model is well-specified!), so we can use its shape to determine how to change our model.

We can again use model_lineup() to see how these compare to the partial residuals when the model is correctly specified:

model_lineup(fit, partial_residuals, n = 5) |>
  ggplot(aes(x = .predictor_value, y = .partial_resid)) +
  geom_point() +
  geom_smooth(se = FALSE) +
  geom_line(aes(x = .predictor_value, y = .predictor_effect)) +
  facet_grid(rows = vars(.sample), cols = vars(.predictor_name),
             scales = "free") +
  labs(x = "Predictor", y = "Partial residual")
#> decrypt("gOh4 kDUD pB 9RtpUpRB vV")
#> `geom_smooth()` using method = 'loess' and formula = 'y ~ x'

In the plots where the model is correctly specified, the blue and black lines coincide, making it easy to spot the misspecified model.

Checking for interactions

Partial residuals can also be used to detect interaction terms that should be included in the model. For example, let’s define a population in which \(X_2\) is a factor variable with two levels, and the slope and intercept of the relationship between \(X_1\) and \(Y\) depend on the value of \(X_2\).

intercepts <- c(
  "foo" = -0.3,
  "bar" = 1.7
)

slopes <- c(
  "foo" = 1.8,
  "bar" = -0.4
)

interact_pop <- population(
  x1 = predictor(runif, min = 1, max = 8),
  x2 = predictor(rfactor, levels = c("foo", "bar")),
  y = response(by_level(x2, intercepts) + by_level(x2, slopes) * x1,
               family = gaussian(), error_scale = 4.0)
)

interact_data <- sample_x(interact_pop, n = 100) |>
  sample_y()

no_interact_fit <- lm(y ~ x1 + x2, data = interact_data)

We have fit a model that does not account for the interaction, allowing a different intercept for the two groups but forcing the slope of \(X_1\) to be the same. If we are concerned the interaction may be present, we can make the conventional partial residual plot, but with the points colored by the value of \(X_2\):

partial_residuals(no_interact_fit) |>
  ggplot(aes(x = .predictor_value, y = .partial_resid, color = x2)) +
  geom_point() + # partial residuals
  geom_smooth(se = FALSE) + # smoothed residuals
  geom_line(aes(x = .predictor_value, y = .predictor_effect, color = NULL)) + # effects
  facet_wrap(vars(.predictor_name), scales = "free") +
  labs(x = "Predictor", y = "Partial residual")
#> Factor predictors were dropped:
#> • `x2`
#> `geom_smooth()` using method = 'loess' and formula = 'y ~ x'

Notice the very distinct smoothing lines: the foo group appears to have a much steeper slope than the bar group. This is in fact true in the population. This may alert us that while our average slope is accurate, and the relationship appears reasonably linear, the slope may differ for the groups. We could, of course, produce a lineup to assess if the slope difference is significant, but it appears fairly large here.

We can now fit a model with the interaction term and generate partial residuals again:

interact_fit <- lm(y ~ x1 * x2, data = interact_data)

partial_residuals(interact_fit) |>
  ggplot(aes(x = .predictor_value, y = .partial_resid, color = x2)) +
  geom_point() + # partial residuals
  geom_smooth(se = FALSE) + # smoothed residuals
  geom_line(aes(x = .predictor_value, y = .predictor_effect, color = NULL)) + # effects
  facet_wrap(vars(.predictor_name), scales = "free") +
  labs(x = "Predictor", y = "Partial residual")
#> Factor predictors were dropped:
#> • `x2`
#> `geom_smooth()` using method = 'loess' and formula = 'y ~ x'

In this plot, both groups appear to have the same linear trend, and so adding the interaction appears to have been a good idea.

Note the positive slope in this plot. In the calculation of partial residuals for \(X_1\), all other predictors are set to 0 or their baseline level, so \(X_2\) is set to foo. Hence the overall slope is approximately 1.8, even for points with \(X_2 = \text{bar}\), which can be confusing.

Cook’s distances

The Cook’s distance for an observation represents how much the model fitted values would change if that observation were removed, scaled by the model’s mean squared error. A Cook’s distance of 1 is often considered a cutoff for a highly influential observation. Returning to our original model fit to nonlinear data, we can produce the Cook’s distances:

augment(fit) |>
  ggplot(aes(x = seq_along(.cooksd), y = .cooksd)) +
  geom_col() +
  labs(x = "Row index", y = "Cook's distance")

Note that the Cook’s distance measures changes in fitted values. If several regressors are collinear, a small change in an observation may change \(\hat \beta\) dramatically but not change the fitted values much. (It can also be interpreted as the Mahalanobis distance between the fitted \(\hat \beta\) with and without each observation, where the Mahalanobis distance takes into account the variance of \(\hat \beta\); collinear regressors will tend to have high variance in their parameter estimates, so while dropping a covariate may change their entries in \(\hat \beta\), this will not cause a great increase in the Mahalanobis distance.)

Residual Q-Q plots

Q-Q plots are used to see the distribution of the standardized residuals, and compare it to a normal distribution:

augment(fit) |>
  ggplot(aes(sample = .std.resid)) +
  geom_qq() +
  geom_qq_line() +
  labs(title = "Normal Q-Q plot of standardized residuals",
       x = "Theoretical quantiles", y = "Observed quantiles")

It is difficult for novices to judge normality from a Q-Q plot, so a lineup can be helpful:

model_lineup(fit) |>
  ggplot(aes(sample = .std.resid)) +
  geom_qq() +
  geom_qq_line() +
  facet_wrap(vars(.sample)) +
  labs(title = "Normal Q-Q plot of standardized residuals",
       x = "Theoretical quantiles", y = "Observed quantiles")
#> decrypt("gOh4 kDUD pB 9RtpUpRB Vi")

The true Q-Q plot does not stand out here, so we have little evidence for systematically non-normal residuals.

A case study

Let’s illustrate the use of these diagnostics on a real dataset: the Palmer Penguins data, giving measurements of 344 penguins in the Palmer Archipelago in Antarctica.

library(palmerpenguins)

Let’s consider building a model of bill length as a function of flipper length and species:

penguin_1 <- lm(bill_length_mm ~ flipper_length_mm + species,
                data = penguins)

This model allows a different intercept per species, but not a different slope. Let’s plot the partial residuals for flipper length:

partial_residuals(penguin_1, flipper_length_mm) |>
  ggplot(aes(x = .predictor_value, y = .partial_resid)) +
  geom_point(aes(color = species)) +
  geom_smooth(aes(color = species), se = FALSE) +
  geom_line(aes(y = .predictor_effect)) +
  labs(x = "Flipper length (mm)", y = "Partial residual",
       color = "Species")
#> `geom_smooth()` using method = 'loess' and formula = 'y ~ x'

Here we can see that gentoo penguins tend to have longer flippers than Adélie and chinstrap penguins. Looking at the partial residuals in each group, there appear to be varying slopes: there is a consistent upward trend in the residuals for gentoo penguins relative to the overall trend, for instance. This suggests that perhaps there is an interaction, and the slope of the relationship differs for each species.

Let’s fit a new model with an interaction.

penguin_2 <- lm(bill_depth_mm ~ flipper_length_mm * species,
                data = penguins)

Now the residuals do not seem to have separate trends in each group:

partial_residuals(penguin_2, flipper_length_mm) |>
  ggplot(aes(x = .predictor_value, y = .partial_resid)) +
  geom_point(aes(color = species)) +
  geom_smooth(aes(color = species), se = FALSE) +
  geom_line(aes(y = .predictor_effect)) +
  labs(x = "Flipper length (mm)", y = "Partial residual",
       color = "Species")
#> `geom_smooth()` using method = 'loess' and formula = 'y ~ x'

We can now proceed to other diagnostics, such as checking for equal variance and normality of the residuals. Let’s make a Q-Q plot of the standardized residuals for each species from this fit:

augment(penguin_2) |>
  ggplot(aes(sample = .std.resid)) +
  geom_qq() +
  geom_qq_line() +
  facet_wrap(vars(species)) +
  labs(title = "Normal Q-Q plot of standardized residuals",
       x = "Theoretical quantiles", y = "Observed quantiles")

The residuals appear reasonably normally distributed, but notice the slopes are different for each species, indicating they have different variances. The residual variance appears to be largest for Adélie penguins and smallest for gentoo penguins. If we were using this model purely for prediction, this would not be a concern; but if we intend to conduct tests or produce confidence intervals, this may be an issue.

There are no obvious major outliers either in the partial residual plots or residual Q-Q plots, so we have no reason to suspect there are highly influential observations. Nonetheless, we can check the Cook’s distances:

augment(penguin_2) |>
  ggplot(aes(x = seq_along(.cooksd), y = .cooksd)) +
  geom_col() +
  labs(x = "Row index", y = "Cook's distance")

This looks fine. Even the most influential point still has a very small Cook’s distance.