slideimp

R-CMD-check

{slideimp} is a lightweight R package for fast K-NN and PCA imputation of missing values in high-dimensional numeric matrices.

Core functions

Installation

The stable version of {slideimp} can be installed from CRAN using:

install.packages("slideimp")

You can install the development version of {slideimp} with:

pak::pkg_install("hhp94/slideimp")

Workflow

Let’s simulate some DNA methylation (DNAm) microarray data from 2 chromosomes. All {slideimp} functions expect the input to be a numeric matrix where variables are stored in the columns.

library(slideimp)
# Simulate data from 2 chromosomes
set.seed(1234)
sim_obj <- sim_mat(m = 20, n = 50, perc_NA = 0.3, perc_col_NA = 1, nchr = 2)
# Here we see that variables are stored in rows
sim_obj$input[1:5, 1:5]
#>              s1        s2        s3        s4        s5
#> feat1 0.2391314 0.0000000 0.5897476 0.4201222        NA
#> feat2        NA 0.2810446 0.3677927        NA 0.6387734
#> feat3 0.7203854 0.1600776 0.5027545        NA 0.5556735
#> feat4 0.0000000 0.1816453 0.3608640 0.3356484 0.6394179
#> feat5 0.5827582 0.3774313 0.2801131 0.5047049 0.5761809

# So we t() to put the variables in columns
obj <- t(sim_obj$input)

We can optionally estimate the prediction accuracy of different methods and tune hyperparameters prior to imputation with tune_imp().

For custom functions (.f argument), the parameters data.frame must include the columns corresponding to the arguments passed to the custom function. The custom function must accept obj as the first argument and return a matrix with the same dimensions as obj.

We tune the results using 2 repeats (rep = 2) for illustration (increase in actual analyses).

knn_params <- tibble::tibble(k = c(5, 20))
# Parallelization is controlled by `cores` only for knn or slideimp knn
tune_knn <- tune_imp(obj, parameters = knn_params, cores = 2, rep = 2)
#> Tuning knn_imp
#> Step 1/2: Injecting NA
#> Running in parallel...
#> Step 2/2: Tuning
compute_metrics(tune_knn)
#> # A tibble: 12 × 7
#>        k cores param_set   rep .metric .estimator .estimate
#>    <dbl> <dbl>     <int> <int> <chr>   <chr>          <dbl>
#>  1     5     2         1     1 mae     standard     0.178  
#>  2     5     2         1     1 rmse    standard     0.225  
#>  3     5     2         1     1 rsq     standard     0.00454
#>  4    20     2         2     1 mae     standard     0.149  
#>  5    20     2         2     1 rmse    standard     0.190  
#>  6    20     2         2     1 rsq     standard     0.0172 
#>  7     5     2         1     2 mae     standard     0.202  
#>  8     5     2         1     2 rmse    standard     0.259  
#>  9     5     2         1     2 rsq     standard     0.00960
#> 10    20     2         2     2 mae     standard     0.172  
#> 11    20     2         2     2 rmse    standard     0.219  
#> 12    20     2         2     2 rsq     standard     0.0850

For PCA and custom functions, setup parallelization with mirai::daemons().

mirai::daemons(2) # 2 Cores
# Note, for PCA and custom functions, cores is controlled by the `mirai::daemons()`
# and the `cores` argument is ignored.

# PCA imputation. Specified by the `ncp` column in the `pca_params` tibble.
pca_params <- tibble::tibble(ncp = c(1, 5))
tune_pca <- tune_imp(obj, parameters = pca_params, rep = 2)

# The parameters have `mean` and `sd` columns.
custom_params <- tibble::tibble(mean = 1, sd = 0)
# This function impute data with rnorm values of different `mean` and `sd`.
custom_function <- function(obj, mean, sd) {
  missing <- is.na(obj)
  obj[missing] <- rnorm(sum(missing), mean = mean, sd = sd)
  return(obj)
}
tune_custom <- tune_imp(obj, parameters = custom_params, .f = custom_function, rep = 2)

mirai::daemons(0) # Close daemons

Then, preferably perform imputation by group with group_imp() if the variables can be meaningfully grouped (e.g., by chromosomes).

PCA-based imputation with group_imp() can be parallelized using the {mirai} package, similar to how parallelization is done with tune_imp().

# Use the `group_features()` helper function
group_df <- group_features(obj, sim_obj$group_feature)
group_df

# We choose K-NN imputation, k = 5, from the `tune_imp` results.
knn_group_results <- group_imp(obj, group = group_df, k = 5, cores = 2)

# Similar to `tune_imp`, parallelization is controlled by `mirai::daemons()`
mirai::daemons(2)
knn_group_results <- group_imp(obj, group = group_df, ncp = 3)
mirai::daemons(0)

Alternatively, full matrix imputation can be performed using knn_imp() or pca_imp().

full_knn_results <- knn_imp(obj = obj, k = 5)
full_pca_results <- pca_imp(obj = obj, ncp = 5)

Sliding Window Imputation

Sliding window imputation can be performed using slide_imp(). Note: DNAm WGBS/EM-seq data should be grouped by chromosomes and converted into either beta or M values before sliding window imputation. See vignette for more details.

chr1_beta <- t(sim_mat(m = 10, n = 2000, perc_NA = 0.3, perc_col_NA = 1, nchr = 1)$input)
dim(chr1_beta)
#> [1]   10 2000
chr1_beta[1:5, 1:5]
#>        feat1     feat2     feat3     feat4     feat5
#> s1        NA 0.7297743        NA        NA 0.3968039
#> s2 0.7346970        NA 0.5669140 0.3236858 0.3932419
#> s3        NA        NA        NA 0.3108793        NA
#> s4 0.5401526 0.5779956 0.4271064        NA 0.3309645
#> s5 0.6457875        NA 0.7308792 0.4803642 0.5929590

# From the tune results, choose window size of 50, overlap of size 5 between windows,
# K-NN imputation using k = 10. Specify `ncp` for sliding window PCA imputation.
slide_imp(obj = chr1_beta, n_feat = 50, n_overlap = 5, k = 10, cores = 2, .progress = FALSE)
#> ImputedMatrix (KNN)
#> Dimensions: 10 x 2000
#> 
#>        feat1     feat2     feat3     feat4     feat5
#> s1 0.5067435 0.7297743 0.5884198 0.5063839 0.3968039
#> s2 0.7346970 0.4551576 0.5669140 0.3236858 0.3932419
#> s3 0.5625864 0.4790436 0.5316400 0.3108793 0.5234974
#> s4 0.5401526 0.5779956 0.4271064 0.5551127 0.3309645
#> s5 0.6457875 0.4006866 0.7308792 0.4803642 0.5929590
#> 
#> # Showing [1:5, 1:5] of full matrix