--- title: "Use cases: stats and residuals" output: rmarkdown::html_vignette: css: vignette.css vignette: > %\VignetteEncoding{UTF-8} %\VignetteIndexEntry{Use cases: stats and residuals} %\VignetteEngine{knitr::rmarkdown} editor_options: markdown: wrap: 72 --- ```{r setup, include = FALSE} knitr::opts_chunk$set( collapse = TRUE, fig.width = 6, fig.height = 4, dpi = 96, warning = FALSE, message = FALSE ) ``` ```{r viz-setup, include=FALSE} library(ggplot2) library(ggerror) set_theme( theme_minimal(base_size = 13) + theme( plot.title = element_text( family = "mono", size = 11, hjust = 0), plot.caption = element_text(face = 'italic'), legend.position = 'none' ) ) ``` This vignette demonstrates more advanced usage of the package. For a getting-started guide and overview of this package's basic use case, see `vignette("ggerror")`. ## Setup We use `airquality` — Daily air quality measurements in New York, May to September 1973. This dataset has 153 rows with 6 columns: 4 continuous measurements, Month and Day. ```{r data-prep} data("airquality"); airq <- airquality # It wouldn't be an R workflow without minimal data cleaning... day_in_month <- function(day_in_month, month, year) { days_abbr <- format(as.Date(sprintf("%d-%02d-%02d", year, month, day_in_month)), "%a") factor(days_abbr, levels = c("Mon","Tue","Wed","Thu","Fri","Sat","Sun"), ordered = TRUE) } airq$Day <- day_in_month(airq$Day, airq$Month, 1973) airq$Month <- factor(airq$Month, labels = month.abb[5:9]) ```
More on airquality (click to expand, or type ?airquality) ::: {.alert .alert-info} `airquality` contains daily New York air-quality measurements from May to September 1973. The main continuous variables are `Ozone`, `Solar.R`, `Wind`, and `Temp`; `Month` and `Day` identify when each measurement was taken. :::
Let's use it to show two use cases: plotting a sign-aware residual plot and summarizing raw data with `stat_error()`. ## Residual plot with `sign_aware` `sign_aware = TRUE` interprets the sign of `error` as direction: positive values extend in the positive axis direction, negative in the negative. ### Let's fit a linear model and see how `Temp` can be explained by `Wind`. ```{r fit-model} model <- lm(Temp ~ Wind, data = airq) airq_fit <- cbind(airquality[,c('Temp','Wind')], predicted = predict(model), residuals = resid(model) ) ``` Let us complement the usual ggplot2 workflow for plotting residuals versus true values: ```{r residuals-plot, fig.height = 4.5} ggplot(airq_fit, aes(x = Wind, y = predicted)) + geom_line(linewidth = 0.4, colour = "grey40") + geom_point(aes(y = Temp), alpha = 0.5) + geom_error(aes(error = residuals), sign_aware = TRUE, orientation = "x", color_pos = "firebrick", color_neg = "steelblue", linewidth = 0.5, alpha = 0.85) + labs(title = paste( "[1] residuals = resid(lm(Temp ~ Wind))", "[2] geom_error(error = residuals, sign_aware = TRUE)", sep = "\n" )) ``` - **Flexible styling**: `ggerror` provides us with per-direction styling (`color`) as well as global styles (`alpha`, `linewidth`, etc.). How to interpret the plot is up to you. You may want to color the direction differently, to show the direction the bar extends, rather than the sign of the error. Experiment with this by tweaking either the actual color value, or the `neg/pos` suffix. - **`orientation = "x"`** Because both axes are numeric, `ggerror` can't infer the direction the bar should extend, so we pass `orientation = "x"` (which extends the bar along the y axis) explicitly. When one axis is discrete, orientation is inferred. - **NA values vs 0:** An **`NA` value** draws nothing, and triggers a row-indexed warning. As in `ggplot2`'s ecosystem, we would usually exclude invalid (missing or infinite) values before plotting. 0 is treated as any other value, and "draws" a bar of length zero (basically, nothing). Using `NA` comes in handy when we want to explicitly draw one-sided bars (see `vignette("ggerror")`). ## Summarizing raw data with `stat_error()` `stat_error()` follows ggplot2's `fun.data` logic: pass a summary function and it computes `y`, `ymin`, `ymax` per group. Regardless of the orientation of your plot, the summary function **must return a data frame with these exact column names**. The default summary function is `mean_se` (mean ± one standard error): ```{r stat-default} ggplot(airq, aes(Month, Temp)) + stat_error(width = 0.2) + stat_summary(geom='point') + labs(title = "stat_error() with default = mean_se") ``` `ggerror` ships two summary functions. You just saw the default, `mean_se`. The second is `mean_ci` which computes the mean and 95% confidence interval error bars. ```{r stat-ci} ggplot(airq, aes(Month, Temp)) + stat_error(fun = "mean_ci", error_geom = "pointrange") + labs(title = 'stat_error(fun = "mean_ci", error_geom = "pointrange")') ``` The `conf.int` argument controls the CI width — `0.95` by default, but any valid probability in `(0, 1)` works: ```{r stat-conf} ggplot(airq, aes(Month, Temp)) + stat_error(fun = "mean_ci", conf.int = 0.99, error_geom = "linerange") + stat_summary(geom="point") + labs(title = 'stat_error(fun = "mean_ci", conf.int = 0.99)') ``` > **Note:** I combine multiple summary geoms. `ggerror` is compatible > with `ggplot2` (core) and its ecosystem. ### Custom summary function: median ± mean absolute deviation You shall not be limited by my imagination. Create your own custom summary functions, as long as they obey the rules I mentioned earlier (returned object must be a data frame with three columns: `y`, `ymin`, `ymax` and one numeric row). R's `mad()` computes the median absolute deviation from the median. Let us use the **mean** absolute deviation (from the median), and call it `mae` (`e` for error, to avoid naming conflicts). **median** as the center, **mean absolute deviation from the median** as the spread: $$\mathrm{mae}(x) = \frac{1}{n}\sum_{i=1}^{n}\left|x_i - \mathrm{md}(x)\right|$$ ```{r stat-custom} mae_summary <- function(my_vec, scale_by = 1) { md <- median(my_vec) mae <- mean(abs(my_vec - md)) * scale_by data.frame(y = md, ymin = md - mae, ymax = md + mae) } ggplot(airq, aes(Month, Temp)) + stat_error(fun = mae_summary, error_geom = "crossbar", fill = "grey90") + labs(title = "stat_error(fun = median ± mean(|x − median|))") ``` Custom functions can take extra parameters too — `stat_error()` forwards any matching named arguments in the same way you are used to using the `...` argument. In this example, to multiply the error by 2, you'd write `stat_error(fun = mae_summary, scale_by = 2)`.
Passing `scale_by = 2` to `stat_error` ```{r aditional-code-collapsed} ggplot(airq, aes(Month, Wind)) + stat_error(fun = mae_summary, scale_by = 2, error_geom = "crossbar") + stat_summary(geom = "point") ```
> **Aside** — `geom_error(stat = "error", fun = ...)` is the layer-level > equivalent of `stat_error()`, which by default uses > `stat = "identity"`.