--- title: "Real Data Examples with Different Distributions" shorttitle: "Real Data Examples" author: - name: "Frédéric Bertrand" affiliation: - Cedric, Cnam, Paris email: frederic.bertrand@lecnam.net date: "`r Sys.Date()`" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{Real Data Examples with Different Distributions} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r setup, include = FALSE} #file.edit(normalizePath("~/.Renviron")) LOCAL <- identical(Sys.getenv("LOCAL"), "TRUE") #LOCAL=TRUE knitr::opts_chunk$set(purl = LOCAL) knitr::opts_chunk$set( collapse = TRUE, comment = "#>" ) ``` This vignette demonstrates **SelectBoost.gamlss** on real datasets using *different* distributions. We include a **growth** example (Dutch boys) and further count/continuous/binary cases. ## What you'll learn * How to configure scopes and engines for Box–Cox t growth models, insurance claims, and quine attendance. * How to use `selection_table()` and `effect_plot()` to interpret the resulting fits. * How to adjust bootstrap settings (`B`, `pi_thr`, `parallel`) for heavier real-world analyses. > You may want to install optional engines for best results: > `install.packages(c("glmnet","grpreg","SGL","future.apply"))` ```{r} library(SelectBoost.gamlss) library(gamlss) library(MASS) # Insurance, quine set.seed(1) ``` ## 1. Growth data (BCT) — Dutch boys We model **height** vs **age** with **BCT** (Box–Cox t). Grouped penalties select among smoothers and interactions with pubertal stage. Increase `B` to 50 for real use. ```{r, cache=TRUE, eval=LOCAL} set.seed(123) # Load and CLEAN the Dutch boys growth data utils::data("boys7482", package = "SelectBoost.gamlss") # Keep only rows complete boys <- boys7482[stats::complete.cases(boys7482),] boys$gen <- as.factor(boys$gen) # (Optional) keep levels actually present boys <- droplevels(boys) # Fit SelectBoost.gamlss on CLEANED data (no parallel for vignette stability) fit_growth <- sb_gamlss( hgt ~ 1, data = boys, family = gamlss.dist::BCT(), mu_scope = ~ pb(age) + pb(age):gen, sigma_scope = ~ pb(age), nu_scope = ~ pb(age), engine = "grpreg", engine_sigma = "grpreg", engine_nu = "grpreg", grpreg_penalty = "grLasso", B = 1, pi_thr = 0.6, pre_standardize = TRUE, parallel = "none", trace = FALSE ) # Peek at selection (first rows) print(utils::head(selection_table(fit_growth), 12)) # Effect plot for age on mu if (requireNamespace("ggplot2", quietly = TRUE)) { print(effect_plot(fit_growth, "age", boys, what = "mu")) } ``` Increase `B` to 50 for real use. ```{r, cache=TRUE, eval=FALSE} set.seed(123) # Stability curves over a small c0 grid (still on CLEANED data) g <- sb_gamlss_c0_grid( hgt ~ 1, data = boys, family = gamlss.dist::BCT(), mu_scope = ~ pb(age) + pb(age):gen, sigma_scope = ~ pb(age), nu_scope = ~ pb(age), c0_grid = seq(0.2, 0.8, by = 0.1), B = 1, pi_thr = 0.6, pre_standardize = TRUE, parallel = "none", trace = FALSE ) plot_stability_curves(g, terms = c("pb(age)", "pb(age):gen"), parameter = "mu") ``` **Interpretation.** Expect `pb(age)` to be stably selected for μ and σ; interactions like `pb(age):gen` often appear around puberty. --- ## 2. Count data (PO) — Insurance claims Fit **Poisson** with log-offset for exposure. ```{r, cache=TRUE, eval=LOCAL} data(Insurance, package = "MASS") ins <- transform(Insurance, logH = log(Holders)) fit_po <- sb_gamlss( Claims ~ offset(logH), data = ins, family = gamlss.dist::PO(), mu_scope = ~ Age + District + Group, engine = "glmnet", glmnet_alpha = 1, # lasso B = 100, pi_thr = 0.6, pre_standardize = FALSE, parallel = "auto", trace = FALSE ) selection_table(fit_po) ``` --- ## 3. Positive continuous (GA) — Old Faithful Use **Gamma** for positive waiting times vs eruption length. ```{r, cache=TRUE, eval=LOCAL} data(faithful) faith <- transform(faithful, eru2 = eruptions^2) fit_ga <- sb_gamlss( waiting ~ 1, data = faith, family = gamlss.dist::GA(), mu_scope = ~ pb(eruptions) + eru2, sigma_scope = ~ pb(eruptions), engine = "glmnet", glmnet_alpha = 0.5, # elastic-net B = 60, pi_thr = 0.6, pre_standardize = TRUE, parallel = "auto", trace = FALSE ) selection_table(fit_ga) ``` ```{r, eval=LOCAL} # Effect plot for eruptions on mu if (requireNamespace('ggplot2', quietly = TRUE)) { print(effect_plot(fit_ga, 'eruptions', faith, what = 'mu')) } ``` --- ## 4. Binary (BI) — `mtcars` transmission Treat `am` (0/1) as Bernoulli. ```{r, cache=TRUE, eval=LOCAL} mt <- transform(mtcars, am = as.integer(am), cyl = factor(cyl), gear = factor(gear), carb = factor(carb), vs = factor(vs)) fit_bin <- sb_gamlss( am ~ 1, data = mt, family = gamlss.dist::BI(), mu_scope = ~ wt + hp + qsec + cyl + gear + carb + vs, engine = "grpreg", grpreg_penalty = "grLasso", B = 80, pi_thr = 0.6, pre_standardize = TRUE, parallel = "auto", trace = FALSE ) selection_table(fit_bin) ``` --- ## 5. Overdispersed counts (NBII) — `quine` absences **Negative Binomial II** handles extra-Poisson variation. ```{r, cache=TRUE, eval=LOCAL} data(quine, package = "MASS") fit_nb2 <- sb_gamlss( Days ~ 1, data = quine, family = gamlss.dist::NBII(), mu_scope = ~ Eth + Sex + Lrn + Age, engine = "grpreg", grpreg_penalty = "grLasso", B = 80, pi_thr = 0.6, pre_standardize = FALSE, parallel = "auto", trace = FALSE ) selection_table(fit_nb2) ``` --- ### Tips - If you have many factor levels or smoothers, prefer `engine="grpreg"`/`"sgl"` (group-aware). - For very wide numeric designs, try `engine="glmnet"` with `glmnet_alpha` in \[0,1\]. - Use `pre_standardize=TRUE` (fast C++ scaler) when predictors are numeric and on different scales. - For speed, enable parallel workers: `future::plan(multisession, workers = 4); parallel="auto"`.