--- title: "Using the missForestPredict package" author: "Elena Albu" date: "`r Sys.Date()`" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{Using the missForestPredict package} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} bibliography: references.bib --- ```{r setup, include = FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>" ) ``` # Introduction ## What is this document? The goal of this document is to highlight the functionality implemented in the package `missForestPredict` and to provide guidance for the usage of this package. ## Package information The package `missForestPredict` implements the missing data imputation algorithm used in the R package `missForest` [@stekhoven2012missforest] with adaptations for prediction settings. The function `missForest` is used to impute a (training) dataset with missing values and to learn imputations models that can be later used for imputing new observations. The function `missForestPredict` is used to impute one or multiple new observations (test set) using the models learned on the training data. The word "Predict" in the function name should not misguide the user. The function does not perform prediction of an outcome and is agnostic on whether the outcome variable for a prediction model is part of the training data or not; it will treat all columns of the provided data as variables to be imputed. ## Package functionality **Fast implementation** The imputation algorithm is based on random forests [@breiman01] as implemented in the `ranger` R package [@JSSv077i01]. Ranger provides a fast implementation of random forests suitable for large datasets as well as high dimensional data. **Saved models and initialization** The missing values in each column are initialized with the mean/mode (or median/mode) of that variable derived on complete cases or a custom imputation scheme. Each variable is then imputed using the iterative algorithm of missForest [@stekhoven2012missforest] until a stopping criterion is met. The algorithm supports all variable types (continuous and categorical with two or more levels) and uses a common stopping criterion for all variables. The initialization used for the training data and the random forest models for each iteration are saved and can be later used to impute new observations. Imputation initialization and models are by default "learned" also for variables with no missing values in the original (training) data. This allows for unfortunate situations in which new observations have different missing patterns than the one encountered in the training data (for example, because of accidental registration errors or because of unfortunate train / test split in which all missing values of a variable with low missingness fall in the test set). **Imputation of new observations** The models are applied iteratively to "predict" the missing values of each variable for the new observation, using the same number of iterations as used in the training data. **Convergence criteria** At each iteration the out-of-bag (OOB) error is calculated for each variable separately. To obtain a global error the OOB errors for all variables a weighted average is used, that can be controlled by the `var_weights` parameter. By default the weights are set to the proportion of missing values of each variable. The normalized mean squared error is used for both continuous and categorical variables. For continuous variables, it is equivalent to $1 - R^2$. For categorical variables, it is equivalent to $1 - BSS$ (Brier Skill Score) [@brier1950verification]. More information on convergence criteria and error monitoring is provided in a separate vignette. **Support for dataframe and tibble** Both dataframe (`data.frame` class) and tibble (`tbl_df` class) are supported as input for the package functions. Currently matrix input is not supported. ## How to install ```{r, eval=FALSE} install.packages("missForestPredict") ``` # How to use the package ## Data used for demonstration **Iris data** The iris dataset in R base contains 4 continuous variables and one categorical variable with three categories for N = 150 flowers [@anderson1935irises]. **Diamonds data** The diamonds dataset from `ggplot2` R package contains seven continuous variables and three categorical variables for N = 53940 diamonds [@ggplot2]. ## Imputation of training and test set After installing the package you can load it in your R sessions with: ```{r} library(missForestPredict) ``` We will load the iris dataset and split it in a training set (100 observations) and a test set (50 observations). ```{r} data(iris) N <- nrow(iris) n_test <- floor(N/3) set.seed(2022) id_test <- sample(1:N, n_test) iris_train <- iris[-id_test,] iris_test <- iris[id_test,] ``` We produce 10% random missing values on each column in both the training and the test set. ```{r} set.seed(2022) iris_train_miss <- produce_NA(iris_train, proportion = 0.1) iris_test_miss <- produce_NA(iris_test, proportion = 0.1) head(iris_train_miss) head(iris_test_miss) ``` We will impute the training set and learn the random forest imputation models at the same time using the function `missForest`. To later use the models learned on a training set for imputation of new observations, `save_models` needs to be set to `TRUE`. By default feedback on the number of iterations and the error monitoring is provided. You can set `verbose = FALSE` to silence this output. More information on error monitoring is provided in a separate vignette. Please note that in all following examples we set the `ranger` parameter `num.threads` to 2. If you are running the code on a machine with more cores and you are willing to use them for running the code, you can remove this parameter completely. ```{r} set.seed(2022) iris_train_imp_object <- missForestPredict::missForest(iris_train_miss, save_models = TRUE, num.threads = 2) ``` The imputed training set can be found by extracting `ximp` dataframe from the object. ```{r} iris_train_imp <- iris_train_imp_object$ximp head(iris_train_imp) ``` We will further impute the test set using the learned imputation models. The function `missForestPredict` will: - initialize the missing values in each variable with the initialization "learned" from the training set (mean/mode) - imperatively predict the missing values of each variable using the learned random forest models for each iteration ```{r} iris_test_imp <- missForestPredict::missForestPredict(iris_train_imp_object, newdata = iris_test_miss) head(iris_test_imp) ``` ## Imputation of a single new observation `missForestPredict` can impute a new observation with missing values. The new observation has to be provided as a dataframe with one row and with named columns (the column names have to correspond to the column names of the training set). ```{r} single_observation <- iris_test_miss[1,] single_observation[1,2] <- NA print(single_observation) single_observation_imp <- missForestPredict::missForestPredict(iris_train_imp_object, newdata = single_observation) print(single_observation_imp) ``` `missForestPredict` package can impute observations with new missingness patterns not present in the training data as well as variables with no missingness (complete) in training data. The initialization and the random forest imputation models are "learned" for all variables in the dataset, regardless of the amount of missingness. ## Predict without storing the imputed matrix The function `missForest` returns both the imputed training dataframe as well as the imputation models. ```{r} str(iris_train_imp_object, max.level = 1) ``` The imputed training data is though not necessary for imputing the test set. To avoid storing further these data in the object, `ximp` can be set to NULL. ```{r} iris_train_imp <- iris_train_imp_object$ximp iris_train_imp_object$ximp <- NULL iris_test_imp <- missForestPredict::missForestPredict(iris_train_imp_object, newdata = iris_test_miss) head(iris_test_imp) ``` In `ximp` is set to NULL by mistake, and the training set imputation is lost, it can be recovered without rerunning the algorithm. Imputing the training set with the function `missForestPredict` will give the same results as the initial results of `missForest` function because the same models are applied to the variables in the same order and for the same number of iterations. ```{r} set.seed(2022) iris_train_imp_object <- missForestPredict::missForest(iris_train_miss, save_models = TRUE, num.threads = 2) # store imputed dataframe iris_train_imp <- iris_train_imp_object$ximp iris_train_imp_object$ximp <- NULL # re-impute the same dataframe using missForestPredict iris_train_imp_2 <- missForestPredict::missForestPredict(iris_train_imp_object, newdata = iris_train_miss) identical(iris_train_imp, iris_train_imp_2) ``` ## Impute larger datasets by adapting `num.trees` Although `missForestPredict` benefits of the improved computation time of `ranger` package, larger dataset can still prove time consuming to impute. We will load the diamonds dataset, which contains more than 50000 observations and produce 30% missing values on each variable. ```{r} library(ggplot2) data(diamonds) # split train / test N <- nrow(diamonds) n_test <- floor(N/3) set.seed(2022) id_test <- sample(1:N, n_test) diamonds_train <- diamonds[-id_test,] diamonds_test <- diamonds[id_test,] diamonds_train_miss <- produce_NA(diamonds_train, proportion = 0.1) diamonds_test_miss <- produce_NA(diamonds_test, proportion = 0.1) head(diamonds_train_miss) head(diamonds_test_miss) ``` The function `missForest` supports additional parameters to be passed to the `ranger` function. By default, the default values of `ranger` are used (e.g. the number of trees in the forest is 500). This can be overridden by passing `num.trees = 100`. Using less trees will prove to be computationally more efficient. ```{r} set.seed(2022) diamonds_train_imp_object <- missForestPredict::missForest(diamonds_train_miss, save_models = TRUE, num.trees = 100, num.threads = 2) # impute test set diamonds_train_imp_object$ximp <- NULL diamonds_test_imp <- missForestPredict::missForestPredict(diamonds_train_imp_object, newdata = diamonds_test_miss) head(diamonds_test_imp) ``` Alternatively, the `maxiter` parameter can be set to a lower number (the default is 10) or other `ranger` parameters can be adapted. Note that not all parameters to ranger function are supported. Parameters that should be adapted in function of the imputed variable (outcome for the ranger function) are not supported. Generic parameters that can be applied to all variables are supported (like: `num.trees, mtry, min.node.size, max.depth, replace,` ...), as well as `class.weights` for factor variables. ## Custom initialization The parameter `initialization` supports three values: `mean/mode`, `median/mode` and `custom`. The default is mean/mode, which will initialize each variable with the mean (for continuous variables) or mode (for categorical variables) calculated based on the complete observations on that variable. When `custom` is used, a complete dataframe is expected in `x_init`. For example, the imputations of another imputation method can be used as initialization. When `custom` is used in `missForest` function, an initialization dataframe has to be passed to the `missForestPredict` function later on too. We will exemplify this with an initialization using linear models on the `iris` dataset. Let's assume that variables `Sepal.Width` and `Species` are known to be never missing. We will regress the other variables on these two variables and save the linear models to be applied on the test set afterwards. ```{r} data(iris) # split train / test N <- nrow(iris) n_test <- floor(N/3) set.seed(2022) id_test <- sample(1:N, n_test) iris_train <- iris[-id_test,] iris_test <- iris[id_test,] # produce missing values set.seed(2022) iris_train_miss <- produce_NA(iris_train, proportion = c(0.2, 0, 0.2, 0.2, 0)) iris_test_miss <- produce_NA(iris_test, proportion = c(0.2, 0, 0.2, 0.2, 0)) # build linear models for Sepal.Length, Petal.Width, Petal.Length using complete cases for each variable fit_1 <- lm(Sepal.Length ~ ., data = iris_train_miss[!is.na(iris_train_miss$Sepal.Length), c("Sepal.Length", "Sepal.Width", "Species")]) fit_2 <- lm(Petal.Width ~ ., data = iris_train_miss[!is.na(iris_train_miss$Petal.Width), c("Petal.Width", "Sepal.Width", "Species")]) fit_3 <- lm(Petal.Length ~ ., data = iris_train_miss[!is.na(iris_train_miss$Petal.Length), c("Petal.Length", "Sepal.Width", "Species")]) # impute training with predictions of linear model iris_train_init <- iris_train_miss iris_train_init$Sepal.Length[is.na(iris_train_init$Sepal.Length)] <- predict(fit_1, iris_train_init[is.na(iris_train_init$Sepal.Length), c("Sepal.Width", "Species")]) iris_train_init$Petal.Width[is.na(iris_train_init$Petal.Width)] <- predict(fit_2, iris_train_init[is.na(iris_train_init$Petal.Width), c("Sepal.Width", "Species")]) iris_train_init$Petal.Length[is.na(iris_train_init$Petal.Length)] <- predict(fit_3, iris_train_init[is.na(iris_train_init$Petal.Length), c("Sepal.Width", "Species")]) # impute the training set using this initialization set.seed(2022) iris_train_imp_obj <- missForest(iris_train_miss, save_models = TRUE, initialization = "custom", x_init = iris_train_init, num.threads = 2) # build test set initialization using the linear models learned on training iris_test_init <- iris_test_miss iris_test_init$Sepal.Length[is.na(iris_test_init$Sepal.Length)] <- predict(fit_1, iris_test_init[is.na(iris_test_init$Sepal.Length), c("Sepal.Width", "Species")]) iris_test_init$Petal.Width[is.na(iris_test_init$Petal.Width)] <- predict(fit_2, iris_test_init[is.na(iris_test_init$Petal.Width), c("Sepal.Width", "Species")]) iris_test_init$Petal.Length[is.na(iris_test_init$Petal.Length)] <- predict(fit_3, iris_test_init[is.na(iris_test_init$Petal.Length), c("Sepal.Width", "Species")]) # impute test set iris_test_imp <- missForestPredict(iris_train_imp_obj, newdata = iris_test_miss, x_init = iris_test_init) evaluate_imputation_error(iris_test_imp, iris_test_miss, iris_test) evaluate_imputation_error(iris_test_init, iris_test_miss, iris_test) ``` The errors (MSE, NMSE) for `Sepal.Length` and `Petal.Length` decreased compared to the initialization imputation. The missclasification error (MER) for `Species` is 0 (if a variable does not have missing values, the error returned will be 0). Also keep in mind that `x_init` should be complete dataframe, if used. For example, if you would wish to impute only one variable using a different method as initialization, you cannot rely on the `missForest` function to do mean/mode initialization for the other variables. You would have to do it yourself before providing `x_init`. ## Using `predictor_matrix` By default, missForest will create models for all variables in a dataframe and will use all other variables in the imputation of each variable. Controlling which variables to impute and which predictors to use for the imputation of each variable can be done using the parameter `predictor_matrix`. We illustrate three different cases: - variables not imputed and not used as predictor - variables not imputed but used as predictor - variables imputed but not used as predictor ### Variables not imputed and not used as predictor We consider the case when a date variable is kept in the dataframe for further reference. Such variable does not need imputation, nor can be used as predictor. ```{r} data(iris) iris$Date_collected <- seq(Sys.Date() - nrow(iris) + 1, Sys.Date(), by="days") # split train / test N <- nrow(iris) n_test <- floor(N/3) set.seed(2022) id_test <- sample(1:N, n_test) iris_train <- iris[-id_test,] iris_test <- iris[id_test,] iris_train_miss <- produce_NA(iris_train, proportion = c(0.1,0.1,0.1,0.1,0.1,0)) iris_test_miss <- produce_NA(iris_test, proportion = c(0.1,0.1,0.1,0.1,0.1,0)) head(iris_train_miss) ``` We will create a predictor matrix so that the `Date_collected` variable is not included in imputation and not included as predictor. We will initialize the predictor matrix using the function `create_predictor_matrix`. This is the default predictor matrix in which all variables are included for imputation and all other variables are used as predictors. The diagonal is 0 as one variable will not be used as predictor for its own imputation model. ```{r} predictor_matrix <- create_predictor_matrix(iris_train_miss) print(predictor_matrix) ``` The rows control the variables to be imputed and the columns control the variables as predictors. We will set all zeros for `Date_collected` both row-wise and column-wise. ```{r} predictor_matrix["Date_collected",] <- 0 predictor_matrix[,"Date_collected"] <- 0 print(predictor_matrix) ``` We will pass this matrix to the missForest function. ```{r} set.seed(2022) iris_train_imp_object <- missForestPredict::missForest(iris_train_miss, save_models = TRUE, predictor_matrix = predictor_matrix, verbose = TRUE, num.threads = 2) iris_train_imp <- iris_train_imp_object$ximp head(iris_train_imp) iris_test_imp <- missForestPredict::missForestPredict(iris_train_imp_object, newdata = iris_test_miss) head(iris_test_imp) ``` ### Variables not imputed but used as predictor We consider the case when we do not want to impute the variable `Sepal.Length` (for example, we do not plan to use it further in a prediction model), but we want to use it as predictor for the imputation of other variables. ```{r} data(iris) # split train / test N <- nrow(iris) n_test <- floor(N/3) set.seed(2022) id_test <- sample(1:N, n_test) iris_train <- iris[-id_test,] iris_test <- iris[id_test,] iris_train_miss <- produce_NA(iris_train, proportion = 0.1) iris_test_miss <- produce_NA(iris_test, proportion = 0.1) head(iris_train_miss) ``` We create the predictor matrix and set zeros row-wise. The row represents the variable to be imputed, the columns represent the predictors used in imputation. Setting zero row-wise means: do not use any of the predictors in imputing `Sepal.Length`. This situation is interpreted as: do not build imputation models for `Sepal.Length`. ```{r} predictor_matrix <- create_predictor_matrix(iris_train_miss) predictor_matrix["Sepal.Length",] <- 0 print(predictor_matrix) ``` ```{r} set.seed(2022) iris_train_imp_object <- missForestPredict::missForest(iris_train_miss, save_models = TRUE, predictor_matrix = predictor_matrix, verbose = TRUE, num.threads = 2) iris_train_imp <- iris_train_imp_object$ximp iris_test_imp <- missForestPredict::missForestPredict(iris_train_imp_object, newdata = iris_test_miss) head(iris_test_imp) ``` Models are not built for the variable `Sepal.Length`. Nevertheless this variable is initialized (with mean imputation), as it is used in other imputation models. As this variable is not imputed iteratively, its mean imputation might not be optimal for it being used as predictor. We recommend using this case only for complete variables in the training data and no expected missing values in future data. ```{r} names(iris_train_imp_object$models[[1]]) names(iris_train_imp_object$init) ``` ### Variables imputed but not used as predictor A third case is that when we want to impute a variable (based on all other variables), but we don't want to use it as predictor for specific variables. For example we do not want to use the `Species` variable in the imputation of `Petal.Length` and `Petal.Width`. ```{r} data(iris) # split train / test N <- nrow(iris) n_test <- floor(N/3) set.seed(2022) id_test <- sample(1:N, n_test) iris_train <- iris[-id_test,] iris_test <- iris[id_test,] iris_train_miss <- produce_NA(iris_train, proportion = 0.1) iris_test_miss <- produce_NA(iris_test, proportion = 0.1) head(iris_train_miss) predictor_matrix <- create_predictor_matrix(iris_train_miss) predictor_matrix["Petal.Length","Species"] <- 0 predictor_matrix["Petal.Width","Species"] <- 0 print(predictor_matrix) set.seed(2022) iris_train_imp_object <- missForestPredict::missForest(iris_train_miss, save_models = TRUE, predictor_matrix = predictor_matrix, verbose = TRUE, num.threads = 2) iris_train_imp <- iris_train_imp_object$ximp iris_test_imp <- missForestPredict::missForestPredict(iris_train_imp_object, newdata = iris_test_miss) head(iris_test_imp) ``` A predictor matrix can be checked using the function `check_predictor_matrix`. In case the predictor matrix is not valid, an error will be returned. For valid predictor matrices, human readable message about the imputation scheme will be printed. ```{r} check_predictor_matrix(predictor_matrix, iris_train) ``` ### Impact of `proportion_usable_cases` on the predictor matrix The default predictor matrix can be altered by the setting of `proportion_usable_cases`. missForest initializes each variable and then builds models for each variable using the observed values of that variable as outcome of a random forest model. It then imputes the missing part of the variable using the learned models. Situations when a variable is mostly or completely missing for the observed part of another variable are suboptimal for learning the models, as the models will mostly rely on the initialization rather than on true values. Situations when a variable is missing for the missing part of another variable are suboptimal for the imputation (prediction) part as the predictions will rely on the initialization rather than on true values. By default `proportion_usable_cases` will filter the variables (as predictors) if they are completely missing for the observed or missing part of another variable. As an example, we will make `Sepal.Length` and `Sepal.Width` in `iris` missing together and `Petal.Length` missing for the observed part of these variables. Checking the predictor matrix post imputation will reveal that these variables will not be used as predictor for each other because of the "sub-optimal" missingness pattern. ```{r} data(iris) iris_mis <- iris iris_mis[1:50, "Sepal.Length"] <- NA iris_mis[1:50, "Sepal.Width"] <- NA iris_mis[51:150, "Petal.Length"] <- NA set.seed(2022) iris_train_imp_object <- missForestPredict::missForest(iris_mis, save_models = TRUE, verbose = TRUE, num.threads = 2) print(iris_train_imp_object$predictor_matrix) check_predictor_matrix(iris_train_imp_object$predictor_matrix, iris_mis, verbose = TRUE) ``` # Final words The package `missForestPredict` is based on the `missForest` package [@stekhoven2012missforest] with additional functionality for convergence criteria, initialization, predictor specification and imputation of new observations. Other imputation packages based on random forests are available in the R ecosystem, using iterative algorithms, like `missRanger` [@mayer2019package], `mice` [@van2007mice], `CALIBERrfimpute` [@shah2021package], `miceRanger` [@wilson2020miceranger]. These provide various extended functionality, but they do not support the imputation of new observations, with the exception of `miceRanger` [@wilson2020miceranger] which, to our knowledge, is the only iterative algorithm that can impute new observations. Non-iterative algorithms are also available: the function `rfImpute` in the `randomForest` [@randomForestPackage] package requires presence of the outcome at imputation time, which makes it inadequate for prediction settings (where imputation for a single new observation for which the outcome is not yet know may be of interest); `caret` [@kuhn2008caret] implements bagged trees without iterations and will impute new observations for most missingness patterns; `imputeMissings` implements random forests non-iteratively but the initialization is not learned on the training set and will require a test set for initialization. The R package `missForestPredict` uses an iterative algorithm and can impute a single new observation in (applied) prediction settings, even when the new observation exhibits a missingness pattern not encountered in the training set. It is therefore suitable for (applied) prediction settings. # References