--- title: "Validating predictInterval() against brms" author: "Jared Knowles" date: "Written 2026-05-29 (last updated 2026-05-29)" output: rmarkdown::html_vignette vignette: > %\VignetteEngine{knitr::rmarkdown} %\VignetteIndexEntry{Validating predictInterval() against brms} %\VignetteEncoding{UTF-8} --- ## Why compare to brms? `merTools::predictInterval()` was written to put prediction intervals on mixed-effect models that are too large or too slow to bootstrap. It does this by simulating from the estimated distribution of the fixed and random effects -- fast, but an approximation. Since then, [brms](https://paulbuerkner.com/brms/) has become a standard for *fully* Bayesian multilevel modeling, and its posterior predictive distribution is about as close to a gold standard as we have. This vignette asks a simple question: **how close are `predictInterval()`'s intervals to the ones brms produces?** We look at three things: a Gaussian random-slopes model, what happens for an entirely unobserved group, and a binomial GLMM. This vignette is *precompiled* -- reproducing it requires the `brms` package and a working Stan toolchain. ## A random-slopes model We use the bundled `hsb` data (7,185 students in 160 schools) and model math achievement from student SES, school-mean SES, and a school-varying SES slope. We hold out 20% of students for testing, plus six whole schools to stand in for "new groups" later. ``` r data(hsb) hsb$schid <- factor(hsb$schid) new_schools <- sample(levels(hsb$schid), 6) hsb$.set <- "train" hsb$.set[hsb$schid %in% new_schools] <- "test_new" seen <- which(hsb$.set == "train") hsb$.set[sample(seen, round(0.2 * length(seen)))] <- "test_seen" train <- droplevels(hsb[hsb$.set == "train", ]) test_seen <- hsb[hsb$.set == "test_seen" & hsb$schid %in% levels(train$schid), ] test_new <- hsb[hsb$.set == "test_new", ] f1 <- mathach ~ ses + meanses + (ses | schid) m_lme <- lmer(f1, data = train) brms_fit_seconds <- system.time( m_brm <- fit_brm(f1, train, gaussian(), "brms_hsb_meanses"))[["elapsed"]] ``` We draw a predictive distribution for each held-out student from each method -- `predictInterval(returnSims = TRUE)` and `brms::posterior_predict()` -- and compute everything (point estimates, intervals, coverage) from those draws. ``` r mt_seen <- mt_draws(m_lme, test_seen) pp_seen <- posterior_predict(m_brm, newdata = test_seen, allow_new_levels = TRUE) ``` ### Point estimates are essentially identical Both methods condition on the same estimated fixed effects and school BLUPs, so the conditional-mean predictions should agree almost exactly -- and they do. ``` r lme_mean <- predict(m_lme, newdata = test_seen) brm_mean <- colMeans(posterior_epred(m_brm, newdata = test_seen, allow_new_levels = TRUE)) c(correlation = cor(lme_mean, brm_mean), mean_abs_diff = mean(abs(lme_mean - brm_mean)), response_sd = sd(test_seen$mathach)) ``` ``` #> correlation mean_abs_diff response_sd #> 0.99984045 0.03670134 6.71745425 ``` ``` r ggplot(data.frame(merTools = lme_mean, brms = brm_mean), aes(merTools, brms)) + geom_abline(slope = 1, intercept = 0, linetype = 2, color = "grey50") + geom_point(alpha = 0.3, size = 0.8, color = "#1B9E77") + coord_equal() + labs(x = "lme4 predict()", y = "brms posterior_epred()") + theme_minimal(base_size = 12) ```
### The prediction intervals agree ``` r b_mt <- qband(mt_seen, 0.90); b_brm <- qband(pp_seen, 0.90) c(width_merTools = median(b_mt$upr - b_mt$lwr), width_brms = median(b_brm$upr - b_brm$lwr), cor_lower = cor(b_mt$lwr, b_brm$lwr), cor_upper = cor(b_mt$upr, b_brm$upr)) ``` ``` #> width_merTools width_brms cor_lower cor_upper #> 20.1789145 20.2344075 0.9901934 0.9899334 ``` ``` r endpts <- rbind(data.frame(bound = "lower", merTools = b_mt$lwr, brms = b_brm$lwr), data.frame(bound = "upper", merTools = b_mt$upr, brms = b_brm$upr)) ggplot(endpts, aes(merTools, brms, color = bound)) + geom_abline(slope = 1, intercept = 0, linetype = 2, color = "grey50") + geom_point(alpha = 0.3, size = 0.8) + coord_equal() + labs(title = "90% prediction-interval endpoints", x = "merTools", y = "brms") + theme_minimal(base_size = 12) ```
### And they are equally well calibrated The real test is out-of-sample: do nominal intervals cover held-out scores at the nominal rate? Both methods track the diagonal, and each other. ``` r data.frame(nominal = NOMINAL, merTools = coverage(mt_seen, test_seen$mathach, NOMINAL), brms = coverage(pp_seen, test_seen$mathach, NOMINAL)) ``` ``` #> nominal merTools brms #> 1 0.50 0.4569776 0.4598698 #> 2 0.80 0.7895879 0.7932032 #> 3 0.90 0.9168474 0.9168474 #> 4 0.95 0.9660159 0.9703543 ``` ### At a fraction of the cost ``` r t_lme <- system.time(lmer(f1, data = train))[["elapsed"]] t_mt <- system.time(mt_draws(m_lme, test_seen))[["elapsed"]] t_pp <- system.time(posterior_predict(m_brm, newdata = test_seen, allow_new_levels = TRUE))[["elapsed"]] data.frame( # brms_fit_seconds was captured when the model was fit, above step = c("lme4 fit", "predictInterval", "brms fit (compile+sample)", "posterior_predict"), seconds = round(c(t_lme, t_mt, brms_fit_seconds, t_pp), 2)) ``` ``` #> step seconds #> 1 lme4 fit 0.05 #> 2 predictInterval 0.53 #> 3 brms fit (compile+sample) 398.97 #> 4 posterior_predict 1.52 ``` For this model, the entire lme4 + `predictInterval()` path runs in about a second, while a single brms fit takes several minutes -- a couple of orders of magnitude difference, before the (much larger) models `predictInterval()` was really built for. ## What about an entirely new group? For a school that was never seen during fitting there is a genuine modeling choice: do you *drop* the school's effect and predict from the fixed effects only, or *sample* a plausible effect for it from the estimated school-level distribution? Both packages support either, and the comparison is only fair when the choices match: | treatment of an unseen group | `predictInterval()` | brms | |---|---|---| | drop the effect (population prediction) | `new.levels = "zero"` *(default)* | `re_formula = NA` | | sample a new effect | `new.levels = "draw"` | `allow_new_levels = TRUE` *(default)* | Matched up, the two tools agree on both the width and the coverage of the new-school intervals: ``` r nd <- test_new pp_pop <- posterior_predict(m_brm, newdata = nd, re_formula = NA) pp_draw <- posterior_predict(m_brm, newdata = nd, allow_new_levels = TRUE) mt_zero <- mt_draws(m_lme, nd, new.levels = "zero") mt_draw <- mt_draws(m_lme, nd, new.levels = "draw") w <- function(d) median(qband(d, 0.90)$upr - qband(d, 0.90)$lwr) data.frame( unseen_group = c("drop effect", "sample effect"), merTools_width = c(w(mt_zero), w(mt_draw)), brms_width = c(w(pp_pop), w(pp_draw)), merTools_cov90 = c(coverage(mt_zero, nd$mathach, 0.90), coverage(mt_draw, nd$mathach, 0.90)), brms_cov90 = c(coverage(pp_pop, nd$mathach, 0.90), coverage(pp_draw, nd$mathach, 0.90))) ``` ``` #> unseen_group merTools_width brms_width merTools_cov90 brms_cov90 #> 1 drop effect 19.96801 20.02855 0.8708487 0.8745387 #> 2 sample effect 20.67910 20.76924 0.8892989 0.8856089 ``` So the interval width for an unseen group is set by the *choice*, not the method. Earlier I compared `predictInterval()`'s default against brms's default, which is an apples-to-oranges pairing -- the apparent "gap" was the two packages making different default assumptions, not a deficiency in either. How much the choice matters depends on how much group variance is left after the fixed effects. Here `meanses` already absorbs most of the between-school variation, so dropping vs. sampling the school effect barely changes the width. Remove `meanses` and the school SD roughly doubles, so the two options diverge: ``` r m0 <- lmer(mathach ~ ses + (ses | schid), data = train) data.frame( model = c("with meanses", "without meanses"), school_SD = c(attr(VarCorr(m_lme)$schid, "stddev")[["(Intercept)"]], attr(VarCorr(m0)$schid, "stddev")[["(Intercept)"]]), width_zero = c(w(mt_draws(m_lme, nd, new.levels = "zero")), w(mt_draws(m0, nd, new.levels = "zero"))), width_draw = c(w(mt_draws(m_lme, nd, new.levels = "draw")), w(mt_draws(m0, nd, new.levels = "draw")))) ``` ``` #> model school_SD width_zero width_draw #> 1 with meanses 1.598503 19.96801 20.67910 #> 2 without meanses 2.125770 19.94993 21.22576 ``` The practical guidance: `predictInterval()`'s default (`new.levels = "zero"`) answers "what is the expected outcome for a typical member of an unseen group," while `new.levels = "draw"` answers "what is the outcome for an unseen group drawn from the population" -- the same quantity brms returns by default. Pick the one that matches your question. ## A generalized linear mixed model The story carries over to GLMMs. We fit a binomial model for whether grouse had ticks and compare predicted probabilities -- `predictInterval(type = "probability")` against `brms::posterior_epred()`. ``` r data(grouseticks) grouseticks$TICKS_BIN <- as.integer(grouseticks$TICKS >= 1) grouseticks$cHEIGHT <- as.numeric(scale(grouseticks$HEIGHT)) gk <- grouseticks gk$.set <- ifelse(runif(nrow(gk)) < 0.2, "test", "train") gk_tr <- droplevels(gk[gk$.set == "train", ]) gk_te <- gk[gk$.set == "test" & gk$BROOD %in% levels(gk_tr$BROOD) & gk$LOCATION %in% unique(gk_tr$LOCATION), ] gf <- TICKS_BIN ~ YEAR + cHEIGHT + (1 | BROOD) + (1 | LOCATION) g_lme <- glmer(gf, family = binomial, data = gk_tr, control = glmerControl(optimizer = "bobyqa")) g_brm <- fit_brm(gf, gk_tr, bernoulli(), "brms_grouseticks") mt_pr <- mt_draws(g_lme, gk_te, type = "probability", include.resid.var = FALSE) if (max(mt_pr) > 1 || min(mt_pr) < 0) mt_pr <- plogis(mt_pr) brm_pr <- posterior_epred(g_brm, newdata = gk_te, allow_new_levels = TRUE) mt_p <- apply(mt_pr, 2, median); brm_p <- apply(brm_pr, 2, median) c(correlation = cor(mt_p, brm_p), mean_abs_diff = mean(abs(mt_p - brm_p))) ``` ``` #> correlation mean_abs_diff #> 0.99488587 0.02704796 ``` ``` r ggplot(data.frame(merTools = mt_p, brms = brm_p), aes(merTools, brms)) + geom_abline(slope = 1, intercept = 0, linetype = 2, color = "grey50") + geom_point(alpha = 0.4, color = "#7570B3") + coord_equal(xlim = c(0, 1), ylim = c(0, 1)) + labs(title = "Predicted P(tick): merTools vs brms", x = "predictInterval(type = 'probability')", y = "posterior_epred()") + theme_minimal(base_size = 12) ```
## Takeaways - For **observed groups**, `predictInterval()` reproduces brms's point estimates and prediction intervals almost exactly, with the same out-of-sample coverage, at a tiny fraction of the computational cost. - For **new groups**, you choose how to treat the unseen effect: `new.levels = "zero"` (the default) drops it and predicts from the fixed effects, while `new.levels = "draw"` samples it from the estimated covariance -- matching brms's `re_formula = NA` and `allow_new_levels = TRUE` respectively. Either way the two packages agree; pick the convention that fits your question. - The same agreement holds for **GLMMs** on the probability scale. In short, when bootstrapping or full MCMC is impractical, `predictInterval()` is a fast, well-calibrated stand-in for the cases it was designed to handle.