--- title: "miceFast - Introduction" author: "Maciej Nasinski" date: "`r Sys.Date()`" output: html_document: toc: true vignette: > %\VignetteIndexEntry{miceFast - Introduction} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r setup, include=FALSE} knitr::opts_chunk$set(echo = FALSE) ``` # Overview This vignette introduces the **miceFast** package, which provides fast imputations in an object-oriented programming style. The underlying methods rely on quantitative models with closed-form solutions implemented via linear algebra operations. Additional utility functions are provided for easy integration with popular R packages such as **dplyr** and **data.table**. ## Installation and Setup Before you begin, ensure the **miceFast** package is installed. You may also want to install a high-performance BLAS library for significant speedups. ### BLAS **Recommended: Install an optimized BLAS library (potentially up to 100x faster):** - **Linux**: `sudo apt-get install libopenblas-dev` - **macOS (Accelerate/vecLib BLAS)**: ```bash cd /Library/Frameworks/R.framework/Resources/lib ln -sf /System/Library/Frameworks/Accelerate.framework/Frameworks/vecLib.framework/Versions/Current/libBLAS.dylib libRblas.dylib ``` ## Loading the Package and Setting a Seed ```{r, echo=TRUE, message=FALSE, warning=FALSE} pkgs <- c("miceFast", "mice", "ggplot2", "dplyr", "data.table") inst <- lapply(pkgs, library, character.only = TRUE) set.seed(123456) ``` ## Description The **miceFast** package accelerates imputations for quantitative models (with closed-form solutions) by using efficient linear algebra routines. It offers substantial speed improvements in: - **Linear Discriminant Analysis** (up to 5x faster), - **Grouping-based calculations** (often 10x or more, depending on data size, number of groups, etc.), - **Multiple imputations** (faster by approximately the number of imputations, since the core model is computed only once), - **Variance Inflation Factor (VIF)** computations (around 5x faster by skipping redundant steps), - **Predictive Mean Matching (PMM)** (around 3x faster thanks to pre-sorting and binary search). > **Note:** For details on the performance-testing procedure, see the file `performance_validity.R` in the `extdata` folder. Additional simulation plots (which you are free to modify) are located in `extdata/images`. > _Environment: R 4.2 on Mac M1._ ## Motivations Missing data is a common issue in data analysis. One approach is to delete any row containing missing observations, though this can degrade the quality or representativeness of your dataset. Another, often better, approach is to use imputation methods (such as multiple or regular imputations) to fill in missing data, leveraging observed independent variables. Since R and Python are user-friendly but can be slower for heavy computation, **miceFast** uses efficient C++/RcppArmadillo routines under the hood to speed up the imputation process. --- # Introduction for dplyr/data.table Users The **miceFast** package provides additional functions (`fill_NA` and `fill_NA_N`) that are designed to be stable in the presence of user-level mistakes or complex data structures. Below are minimal examples using **dplyr** and **data.table**. ## Example Data ```{r, echo=TRUE} data(air_miss) # airquality with additional variables ``` ### Examining Missingness ```{r, echo=TRUE} upset_NA(air_miss, 6) ``` ## Using dplyr Below we show how to: 1. Calculate VIF. 2. Perform various imputations with and without grouping. 3. Safely handle collinearity or small group sizes by wrapping calls in `tryCatch`. ```{r, echo=TRUE} # VIF (Variance Inflation Factor) > ~10 suggests collinearity problems. VIF( air_miss, posit_y = "Ozone", posit_x = c("Solar.R", "Wind", "Temp", "x_character", "Day", "weights", "groups") ) ``` ```{r, echo=TRUE} air_miss <- air_miss %>% # Impute with grouping and weights group_by(groups) %>% do(mutate(., Solar_R_imp = fill_NA( x = ., model = "lm_pred", posit_y = "Solar.R", posit_x = c("Wind", "Temp", "Intercept"), w = .[["weights"]] ) )) %>% ungroup() %>% # Impute a discrete variable mutate(x_character_imp = fill_NA( x = ., model = "lda", posit_y = "x_character", posit_x = c("Wind", "Temp") )) %>% # Log regression because Ozone is roughly log-normal mutate(Ozone_imp1 = fill_NA( x = ., model = "lm_bayes", posit_y = "Ozone", posit_x = c("Intercept"), logreg = TRUE )) %>% # Impute by position indices mutate(Ozone_imp2 = fill_NA( x = ., model = "lm_bayes", posit_y = 1, posit_x = c(4, 6), logreg = TRUE )) %>% # Imputations (mean of 30 draws) mutate(Ozone_imp3 = fill_NA_N( x = ., model = "lm_noise", posit_y = "Ozone", posit_x = c("Intercept", "x_character_imp", "Wind", "Temp"), w = .[["weights"]], logreg = TRUE, k = 30 )) %>% mutate(Ozone_imp4 = fill_NA_N( x = ., model = "lm_bayes", posit_y = "Ozone", posit_x = c("Intercept", "x_character_imp", "Wind", "Temp"), w = .[["weights"]], logreg = TRUE, k = 30 )) %>% group_by(groups) %>% # Single evaluation model do(mutate(., Ozone_imp5 = fill_NA( x = ., model = "lm_pred", posit_y = "Ozone", posit_x = c("Intercept", "x_character_imp", "Wind", "Temp"), w = .[["weights"]], logreg = TRUE ) )) %>% do(mutate(., Ozone_imp6 = fill_NA_N( x = ., model = "pmm", posit_y = "Ozone", posit_x = c("Intercept", "x_character_imp", "Wind", "Temp"), w = .[["weights"]], logreg = TRUE, k = 20 ) )) %>% ungroup() %>% # Combine imputations mutate(Ozone_imp_mix = rowMeans(select(., starts_with("Ozone_imp")))) %>% # Protect against errors in small groups group_by(groups) %>% do(mutate(., Ozone_chac_imp = tryCatch( fill_NA( x = ., model = "lda", posit_y = "Ozone_chac", posit_x = c("Intercept", "Month", "Day", "Temp", "x_character_imp"), w = .[["weights"]] ), error = function(e) .[["Ozone_chac"]] ) )) %>% ungroup() ``` ### Visualizing Results ```{r, echo=TRUE} # Compare original vs. imputed distributions compare_imp(air_miss, origin = "Ozone", target = "Ozone_imp_mix") # Or compare multiple imputation columns compare_imp(air_miss, origin = "Ozone", target = c("Ozone_imp2", "Ozone_imp_mix")) ``` ## Using data.table Below is a similar approach using **data.table** syntax. ```{r, echo=TRUE} data(air_miss) setDT(air_miss) ``` ```{r, echo=TRUE} # Imputations air_miss[, Solar_R_imp := fill_NA_N( x = .SD, model = "lm_bayes", posit_y = "Solar.R", posit_x = c("Wind","Temp","Intercept"), w = .SD[["weights"]], k = 100 ), by = .(groups)] %>% .[, x_character_imp := fill_NA( x = .SD, model = "lda", posit_y = "x_character", posit_x = c("Wind","Temp","groups") )] %>% .[, Ozone_imp1 := fill_NA( x = .SD, model = "lm_bayes", posit_y = "Ozone", posit_x = c("Intercept"), logreg = TRUE )] %>% .[, Ozone_imp2 := fill_NA( x = .SD, model = "lm_bayes", posit_y = 1, posit_x = c(4,6), logreg = TRUE )] %>% .[, Ozone_imp3 := fill_NA_N( x = .SD, model = "lm_noise", posit_y = "Ozone", posit_x = c("Intercept","x_character_imp","Wind","Temp"), w = .SD[["weights"]], logreg = TRUE, k = 30 )] %>% .[, Ozone_imp4 := fill_NA_N( x = .SD, model = "lm_bayes", posit_y = "Ozone", posit_x = c("Intercept","x_character_imp","Wind","Temp"), w = .SD[["weights"]], logreg = TRUE, k = 30 )] %>% .[, Ozone_imp5 := fill_NA( x = .SD, model = "lm_pred", posit_y = "Ozone", posit_x = c("Intercept","x_character_imp","Wind","Temp"), w = .SD[["weights"]], logreg = TRUE ), by = .(groups)] %>% .[, Ozone_imp6 := fill_NA_N( x = .SD, model = "pmm", posit_y = "Ozone", posit_x = c("Intercept","x_character_imp","Wind","Temp"), w = .SD[["weights"]], logreg = TRUE, k = 10 ), by = .(groups)] %>% # Average across a set of methods .[, Ozone_imp_mix := apply(.SD, 1, mean), .SDcols = Ozone_imp1:Ozone_imp6] %>% # tryCatch for small group issues or collinearity .[, Ozone_chac_imp := tryCatch( fill_NA( x = .SD, model = "lda", posit_y = "Ozone_chac", posit_x = c("Intercept","Month","Day","Temp","x_character_imp"), w = .SD[["weights"]] ), error = function(e) .SD[["Ozone_chac"]] ), by = .(groups)] ``` --- # Generating Data with the corrData Module You can generate correlated data via `corrData`: ```r # Constructors: new(corrData, nr_cat, n_obs, means, cor_matrix) new(corrData, n_obs, means, cor_matrix) # Some relevant methods: cd_obj$fill("type") # "contin", "binom", or "discrete" ``` - `nr_cat`: number of categories for a discrete dependent variable - `n_obs`: number of observations - `means`: centers for independent variables - `cor_matrix`: a positive-definite correlation matrix Example usage: ```{r, eval=FALSE} cd <- new(corrData, nr_cat = 3, n_obs = 1000, means = c(0,1,2), cor_matrix = some_matrix) generated_data <- cd$fill("discrete") ``` --- # Imputing Data with the miceFast Module The **miceFast** module performs fast imputations via various built-in models. Below is a summary of its main methods: 1. **`set_data(x)`**: Provide the data as a numeric matrix by reference. 2. **`set_g(g)`**: Provide a grouping variable (numeric vector, must be positive, no NAs). 3. **`set_w(w)`**: Provide a weighting variable (numeric vector, must be positive, no NAs). 4. **`impute(model, posit_y, posit_x)`**: Single imputation according to the chosen model. 5. **`impute_N(model, posit_y, posit_x, k)`**: Multiple imputations (averaged result) for certain models. 6. **`update_var(posit_y, imputations)`**: Permanently update the object’s data for the specified column with new imputations. 7. **`vifs(posit_y, posit_x)`**: Compute Variance Inflation Factors. 8. **`sort_byg()`** / **`is_sorted_byg()`**: Sort the data by the grouping variable (or check if sorted). 9. **`which_updated()`**: Check which variables have been permanently updated in the object. Supported models: - `"lda"`: Linear discriminant analysis (for categorical `posit_y`). - `"lm_pred"`: Standard linear model prediction (including a mean-imputation approach if you only include an intercept). - `"lm_bayes"`: Linear model with Bayesian noise. - `"lm_noise"`: Similar to `"lm_bayes"` but with different noise assumptions. - `"pmm"`: Predictive Mean Matching (used via `impute_N`). > *Note:* For purely mean imputation, add an intercept column and use `"lm_pred"`. The LDA model requires at least 15 complete observations. Linear models require at least as many complete observations as the number of predictors. ## Examples ### Simple Usage ```{r, echo=TRUE} data <- cbind(as.matrix(mice::nhanes), intercept = 1, index = 1:nrow(mice::nhanes)) model <- new(miceFast) model$set_data(data) # Single imputation, linear model imp_result <- model$impute("lm_pred", posit_y = 2, posit_x = 5) model$update_var(2, imp_result$imputations) # LDA imputation for a categorical variable model$update_var(3, model$impute("lda", 3, c(1,2))$imputations) # Multiple imputation example multi_imp <- model$impute_N("lm_bayes", 4, c(1,2,3), k = 10) model$update_var(4, multi_imp$imputations) # Check which variables were updated model$which_updated() # Always be cautious with 'update_var' # since it permanently alters the object and your data matrix. ``` ### Example with Weights and Groups (Pre-sorted) ```{r, echo=TRUE} data <- cbind(as.matrix(airquality[,-5]), intercept = 1, index = 1:nrow(airquality)) weights <- rgamma(nrow(data), 3, 3) groups <- as.numeric(airquality[,5]) model <- new(miceFast) model$set_data(data) model$set_w(weights) model$set_g(groups) # Impute with group-wise weighting model$update_var(1, model$impute("lm_pred", 1, 6)$imputations) model$update_var(2, model$impute_N("lm_bayes", 2, c(1,3,4,5,6), k = 10)$imputations) # Inspect updated data, returning to original order via model$get_index() head(cbind(model$get_data(), model$get_g(), model$get_w())[order(model$get_index()), ], 4) ``` ### Example with Unsorted Groups ```{r, echo=TRUE} data <- cbind(as.matrix(airquality[, -5]), intercept = 1, index = 1:nrow(airquality)) weights <- rgamma(nrow(data), 3, 3) groups <- as.numeric(sample(1:8, nrow(data), replace = TRUE)) model <- new(miceFast) model$set_data(data) model$set_w(weights) model$set_g(groups) # The data will be automatically sorted by 'groups' during the first imputation: model$update_var(1, model$impute("lm_pred", 1, 6)$imputations) model$update_var(2, model$impute_N("lm_bayes", 2, c(1,3,4,5,6), 10)$imputations) head(cbind(model$get_data(), model$get_g(), model$get_w())[order(model$get_index()), ], 4) ``` --- # Tips and Additional Notes - **Creating matrices from data frames**: Remember that standard R matrices can only hold one data type (often numeric). Factor and character columns must be transformed (e.g., via `model.matrix`). ```r mtcars$cyl <- factor(mtcars$cyl) mtcars$gear <- factor(mtcars$gear) mtcars_mat <- model.matrix(~ ., data = mtcars, na.action = "na.pass") ``` - **Variance Inflation Factors (VIF)**: VIF measures how much the variance of a regression coefficient is inflated due to collinearity. High values (>10) may indicate problematic collinearity. ```r airquality2 <- airquality airquality2$Temp2 <- airquality2$Temp^2 airquality2$Month <- factor(airquality2$Month) # Using data.table: VIF( airquality2, posit_y = "Ozone", posit_x = c("Solar.R", "Wind", "Temp", "Month", "Day", "Temp2"), correct = FALSE ) ``` --- # References - **Exposing C++ functions and classes with Rcpp modules** Dirk Eddelbuettel and Romain François (2018) - **RcppArmadillo: Easily Extending R with High-Performance C++ Code** Dirk Eddelbuettel and Conrad Sanderson (2012) - **Extending R with C++: A Brief Introduction to Rcpp** Dirk Eddelbuettel and James Joseph Balamuta (2018) - **MICE: Multivariate Imputation by Chained Equations in R** Stef van Buuren (2013) - **CSCE 666: Pattern Analysis** Ricardo Gutierrez-Osuna (Fall 2013)