--- title: "Choosing a transformation: Box-Cox in practice" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{Choosing a transformation: Box-Cox in practice} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r, include = FALSE} knitr::opts_chunk$set(collapse = TRUE, comment = "#>", fig.width = 7, fig.height = 4.2) ``` ```{r setup, message = FALSE} library(shewhartr) library(ggplot2) ``` Box & Cox (1964) introduced a one-parameter family of power transformations, $$ y(\lambda) = \begin{cases} (x^\lambda - 1)/\lambda & \lambda \neq 0 \\ \log(x) & \lambda = 0, \end{cases} $$ and a procedure for choosing $\lambda$ by maximum likelihood. The goal is to find a scale on which the residuals are approximately normal and homoscedastic — the assumptions that classical inferential tools, including Shewhart charts, presuppose. `shewhart_box_cox()` returns the profile log-likelihood, the maximiser $\hat \lambda$, and a 95% confidence interval based on the chi-square approximation to twice the log-likelihood drop. ## A textbook example ```{r} set.seed(2025) y <- rlnorm(200, meanlog = 0, sdlog = 0.5) # log-normal -> lambda = 0 bc <- shewhart_box_cox(y) bc ``` The optimal lambda is near zero (log transformation), and the 95% CI should cover zero. Let's plot the profile: ```{r, eval = FALSE} autoplot(bc) ``` If the CI for $\lambda$ contains 1, no transformation is needed (the data are approximately normal as is). If it contains 0, take logs. If it contains 0.5, take square roots — and so on. ## Interaction with `shewhart_regression(model = "auto")` The `"auto"` model in `shewhart_regression()` calls `shewhart_box_cox()` internally on the response (with a +1 shift to keep zeros valid) and selects among `linear`, `log`, `loglog` according to the value of $\hat \lambda$: * $|\hat \lambda - 1| \le 0.1$ → `linear` * $|\hat \lambda - 0| \le 0.1$ → `log` * $|\hat \lambda - 0.5| \le 0.1$ → `loglog` * otherwise default to `linear` with a warning This is a guidance step, not a guarantee. Always inspect the residual diagnostics afterwards via `shewhart_diagnostics()`. ## When *not* to transform If your data are counts, proportions, times-to-event, or other quantities with a known parametric family, model that family explicitly. Box was clear about this: if you can model the right distribution, do so. Transforms exist for the case where the right distribution isn't tractable and a normal approximation on a suitably-chosen scale is the best available compromise. The c, u, p, and np charts in this package implement that advice: they support `limits = "poisson"` (or `"binomial"`) for exact distribution-aware limits, instead of relying on a transformation to coerce counts into approximate normality. ## References - Box, G. E. P., & Cox, D. R. (1964). An Analysis of Transformations. *Journal of the Royal Statistical Society B*, 26(2), 211-252. - Atkinson, A. C. (1985). *Plots, Transformations and Regression*. Oxford. - Box, G. E. P., Hunter, W. G., & Hunter, J. S. (2005). *Statistics for Experimenters* (2nd ed.). Wiley. - Sakia, R. M. (1992). The Box-Cox Transformation Technique: A Review. *The Statistician*, 41(2), 169-178.