The SNF Config

Download a copy of the vignette to follow along here: snf_config.Rmd

This vignette outlines how to construct and use the SNF config, an object storing all the settings and hyperparameters required to convert data in a data_list class object into a space of cluster solutions.

Creating a default SNF config

The most minimal SNF config (snf_config class object) can be obtained by providing a data list into the snf_config() function.

library(metasnf)

dl <- data_list(
    list(cort_t, "cortical_thickness", "neuroimaging", "continuous"),
    list(cort_sa, "cortical_surface_area", "neuroimaging", "continuous"),
    list(subc_v, "subcortical_volume", "neuroimaging", "continuous"),
    list(income, "household_income", "demographics", "continuous"),
    list(pubertal, "pubertal_status", "demographics", "continuous"),
    uid = "unique_id"
)

sc <- snf_config(dl, n_solutions = 5)
## ℹ No distance functions specified. Using defaults.
## ℹ No clustering functions specified. Using defaults.
sc
## Settings Data Frame:
##                            1    2    3    4    5
## SNF hyperparameters:
## alpha                    0.4  0.3  0.3  0.8  0.6
## k                         48   14   85   98   34  
## t                         20   20   20   20   20  
## SNF scheme:
##                            1    2    1    1    3  
## Clustering functions:
##                            1    2    2    2    2  
## Distance functions:
## CNT                        1    1    1    1    1  
## DSC                        1    1    1    1    1  
## ORD                        1    1    1    1    1  
## CAT                        1    1    1    1    1  
## MIX                        1    1    1    1    1  
## Component dropout:
## cortical_thickness         ✔    ✔    ✖    ✔    ✔  
## cortical_surface_area      ✔    ✖    ✔    ✔    ✔  
## subcortical_volume         ✔    ✖    ✔    ✔    ✔  
## household_income           ✔    ✔    ✔    ✔    ✔  
## pubertal_status            ✔    ✔    ✔    ✔    ✔  
## Distance Functions List:
## Continuous (1):
## [1] euclidean_distance
## Discrete (1):
## [1] euclidean_distance
## Ordinal (1):
## [1] euclidean_distance
## Categorical (1):
## [1] gower_distance
## Mixed (1):
## [1] gower_distance
## Clustering Functions List:
## [1] spectral_eigen
## [2] spectral_rot
## Weights Matrix:
## Weights defined for 5 cluster solutions.
## $ mrisdp_1 1, 1, 1, 1, 1 
## $ mrisdp_2 1, 1, 1, 1, 1 
## $ mrisdp_3 1, 1, 1, 1, 1 
## $ mrisdp_4 1, 1, 1, 1, 1 
## $ mrisdp_5 1, 1, 1, 1, 1 
## …and 329 more features.

Similarity network fusion-based clustering pipelines require the following steps:

  1. Selecting a set of data frames to integrate
  2. Converting those data frames into distance matrices using a distance function
  3. Converting those distance matrices to similarity matrices using the SNFtool package’s affinityMatrix() function
  4. Integrating the similarity matrices into one final similarity matrix using the SNFtool package’s SNF() function
  5. Converting that final similarity matrix into a cluster solution using a clustering function

The SNF config is made up of four parts that all address various parts of that pipeline:

  1. The settings data frame (class settings_df, extends class data.frame), which contains information about SNF-specific hyperparameters (step 4), which distance and clustering functions will be used (steps 2 and 5), and if any components of the data list (data frames) will be excluded on a particular run (step 1). Each row of the data frame corresponds to a complete set of settings that can yield a single cluster solution from the data list.
  2. The distance functions list (class dist_fns_list, extends class list), which stores the actual distance functions that are referenced in the settings data frame (step 2)
  3. The clustering functions list (class clust_fns_list, extends class list), which similarly stores clustering functions (step 5)
  4. The weights matrix (class weights_matrix, extends classes matrix, array), which contains feature weights to account for during the data to distance matrix conversion step (step 2).

The settings data frame

You can view the settings data frame in closer detail as follows:

sc$"settings_df"
##                            1    2    3    4    5
## SNF hyperparameters:
## alpha                    0.4  0.3  0.3  0.8  0.6
## k                         48   14   85   98   34  
## t                         20   20   20   20   20  
## SNF scheme:
##                            1    2    1    1    3  
## Clustering functions:
##                            1    2    2    2    2  
## Distance functions:
## CNT                        1    1    1    1    1  
## DSC                        1    1    1    1    1  
## ORD                        1    1    1    1    1  
## CAT                        1    1    1    1    1  
## MIX                        1    1    1    1    1  
## Component dropout:
## cortical_thickness         ✔    ✔    ✖    ✔    ✔  
## cortical_surface_area      ✔    ✖    ✔    ✔    ✔  
## subcortical_volume         ✔    ✖    ✔    ✔    ✔  
## household_income           ✔    ✔    ✔    ✔    ✔  
## pubertal_status            ✔    ✔    ✔    ✔    ✔
# Printed as a regular data frame
sc$"settings_df" |> as.data.frame()
##   solution alpha  k  t snf_scheme clust_alg cnt_dist dsc_dist ord_dist cat_dist
## 1        1   0.4 48 20          1         1        1        1        1        1
## 2        2   0.3 14 20          2         2        1        1        1        1
## 3        3   0.3 85 20          1         2        1        1        1        1
## 4        4   0.8 98 20          1         2        1        1        1        1
## 5        5   0.6 34 20          3         2        1        1        1        1
##   mix_dist inc_cortical_thickness inc_cortical_surface_area
## 1        1                      1                         1
## 2        1                      1                         0
## 3        1                      0                         1
## 4        1                      1                         1
## 5        1                      1                         1
##   inc_subcortical_volume inc_household_income inc_pubertal_status
## 1                      1                    1                   1
## 2                      0                    1                   1
## 3                      1                    1                   1
## 4                      1                    1                   1
## 5                      1                    1                   1

The columns in a settings_df class object include:

By default, the alpha and k hyperparameters are randomly varied from 0.3 to 0.8 and 10 to 100 respectively based on suggestions from the original SNF paper. The t hyperparameter by default stays fixed at 20. The snf_scheme column varies randomly from 1 to 3, corresponding to each of the three differente schemes that are available. The clust_alg randomly varies between 1 and 2 for the two default clustering algoritm functions: (1) spectral clustering using the eigen-gap heuristic to calculate the number of clusters and (2) spectral clustering using the rotation cost heuristic. The distance columns will always be 1 by default, as there is only one default distance metric function per variable type: simple Euclidean for anything numeric and Gower’s distance for anything mixed or categorical.

The distance functions list

The distance functions list is simply a list of functions capable of converting a data frame into a distance matrix. Distance functions within the list are organized based on what type of variable they deal with: continuous, discrete, ordinal, categorical, or mixed (any combination of the former 4).

dfl <- sc$"dist_fns_list"

dfl
## Continuous (1):
## [1] euclidean_distance
## Discrete (1):
## [1] euclidean_distance
## Ordinal (1):
## [1] euclidean_distance
## Categorical (1):
## [1] gower_distance
## Mixed (1):
## [1] gower_distance
names(dfl)
## [1] "cnt_dist_fns" "dsc_dist_fns" "ord_dist_fns" "cat_dist_fns" "mix_dist_fns"
dfl$"cnt_dist_fns"[[1]]
## function (df, weights_row) 
## {
##     weights <- diag(weights_row, nrow = length(weights_row))
##     weighted_df <- as.matrix(df) %*% weights
##     distance_matrix <- as.matrix(stats::dist(weighted_df, method = "euclidean"))
##     return(distance_matrix)
## }
## <bytecode: 0x56265ea3bc58>
## <environment: namespace:metasnf>

You can learn more about customizing distance metrics in the distance metrics vignette.

The clustering functions list

The clustering functions list is similarly a list of functions capable of converting a similarity matrix into a cluster solution (numeric vector).

cfl <- sc$"clust_fns_list"

cfl
## [1] spectral_eigen
## [2] spectral_rot
names(cfl)
## [1] "spectral_eigen" "spectral_rot"
cfl[[1]]
## function (similarity_matrix) 
## {
##     estimated_n <- estimate_nclust_given_graph(W = similarity_matrix, 
##         NUMC = 2:10)
##     nclust_estimate <- estimated_n$`Eigen-gap best`
##     solution <- SNFtool::spectralClustering(similarity_matrix, 
##         nclust_estimate)
##     return(solution)
## }
## <bytecode: 0x56265df73ac0>
## <environment: namespace:metasnf>

You can learn more about customizing clustering functions in the clustering algorithnms vignette.

The weights matrix

wm <- sc$"weights_matrix"

wm
## Weights defined for 5 cluster solutions.
## $ mrisdp_1 1, 1, 1, 1, 1 
## $ mrisdp_2 1, 1, 1, 1, 1 
## $ mrisdp_3 1, 1, 1, 1, 1 
## $ mrisdp_4 1, 1, 1, 1, 1 
## $ mrisdp_5 1, 1, 1, 1, 1 
## …and 329 more features.
class(wm) <- "matrix"

wm[1:5, 1:5]
##      mrisdp_1 mrisdp_2 mrisdp_3 mrisdp_4 mrisdp_5
## [1,]        1        1        1        1        1
## [2,]        1        1        1        1        1
## [3,]        1        1        1        1        1
## [4,]        1        1        1        1        1
## [5,]        1        1        1        1        1

There’s one row in the weights matrix corresponding to every row in the settings data frame and one column for every feature in the data list. By default, all the weights are set to 1, so no weighting occurs.

Customizing an SNF config

When not specifying any parameters beyond the number of rows that are created, the function will randomly vary the values in the matrix.

# Through minimums and maximums
sc <- snf_config(
    dl,
    n_solutions = 100
)
## ℹ No distance functions specified. Using defaults.
## ℹ No clustering functions specified. Using defaults.
sc
## Settings Data Frame:
##                            1    2    3    4    5    6    7    8    9   10
## SNF hyperparameters:
## alpha                    0.3  0.6  0.8  0.6  0.8  0.6  0.3  0.8  0.4  0.4
## k                         81   64   17   71   82   90   11   51   91   30  
## t                         20   20   20   20   20   20   20   20   20   20  
## SNF scheme:
##                            2    1    3    1    2    3    3    1    1    1  
## Clustering functions:
##                            2    1    1    1    1    2    1    1    1    1  
## Distance functions:
## CNT                        1    1    1    1    1    1    1    1    1    1  
## DSC                        1    1    1    1    1    1    1    1    1    1  
## ORD                        1    1    1    1    1    1    1    1    1    1  
## CAT                        1    1    1    1    1    1    1    1    1    1  
## MIX                        1    1    1    1    1    1    1    1    1    1  
## Component dropout:
## cortical_thickness         ✔    ✔    ✔    ✔    ✔    ✔    ✔    ✔    ✔    ✔  
## cortical_surface_area      ✔    ✔    ✔    ✔    ✔    ✔    ✔    ✔    ✔    ✔  
## subcortical_volume         ✔    ✔    ✔    ✔    ✔    ✔    ✔    ✖    ✔    ✔  
## household_income           ✔    ✔    ✔    ✔    ✔    ✔    ✔    ✔    ✔    ✔  
## pubertal_status            ✔    ✔    ✔    ✔    ✔    ✔    ✔    ✔    ✔    ✔  
## …and settings defined to create 90 more cluster solutions.
## Distance Functions List:
## Continuous (1):
## [1] euclidean_distance
## Discrete (1):
## [1] euclidean_distance
## Ordinal (1):
## [1] euclidean_distance
## Categorical (1):
## [1] gower_distance
## Mixed (1):
## [1] gower_distance
## Clustering Functions List:
## [1] spectral_eigen
## [2] spectral_rot
## Weights Matrix:
## Weights defined for 100 cluster solutions.
## $ mrisdp_1 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1… 
## $ mrisdp_2 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1… 
## $ mrisdp_3 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1… 
## $ mrisdp_4 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1… 
## $ mrisdp_5 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1… 
## …and 329 more features.

Alpha, k, and t

You can control any of these parameters either by providing a vector of values you’d like to randomly sample from or by specifying a minimum and maximum range.

# Through minimums and maximums
sc <- snf_config(
    dl,
    n_solutions = 100,
    min_k = 10,
    max_k = 60,
    min_alpha = 0.3,
    max_alpha = 0.8
)
## ℹ No distance functions specified. Using defaults.
## ℹ No clustering functions specified. Using defaults.
# Through specific value sampling
sc <- snf_config(
    dl,
    n_solutions = 20,
    k_values = c(10, 25, 50),
    alpha_values = c(0.4, 0.8)
)
## ℹ No distance functions specified. Using defaults.
## ℹ No clustering functions specified. Using defaults.

Inclusion columns and data frame dropout

Bounds on the number of input data frames removed as well as the way in which the number removed is chosen can be controlled.

By default, the settings_df generated during the call to snf_config() will pick a random value between 0 (printed as a red X) and 1 (printed as a green checkmark) less than the total number of available data frames in the data list based on an exponential probability distribution. The exponential distribution makes it so that it is very likely that a small number of data frames will be dropped and much less likely that a large number of data frames will be dropped.

You can control the distribution by changing the dropout_dist value to “uniform” (which will result in a much higher number of data frames being dropped on average) or “none” (which will result in no data frames being dropped).

# Exponential dropping
sc <- snf_config(
    dl,
    n_solutions = 20,
    dropout_dist = "exponential" # the default behaviour
)
## ℹ No distance functions specified. Using defaults.
## ℹ No clustering functions specified. Using defaults.
sc
## Settings Data Frame:
##                            1    2    3    4    5    6    7    8    9   10
## SNF hyperparameters:
## alpha                    0.4  0.7  0.6  0.5  0.8  0.3  0.8  0.3  0.8  0.4
## k                         88   17   53   27   82   90   31   97   59   14  
## t                         20   20   20   20   20   20   20   20   20   20  
## SNF scheme:
##                            3    2    1    1    2    1    2    2    3    3  
## Clustering functions:
##                            2    2    1    1    2    1    1    1    1    1  
## Distance functions:
## CNT                        1    1    1    1    1    1    1    1    1    1  
## DSC                        1    1    1    1    1    1    1    1    1    1  
## ORD                        1    1    1    1    1    1    1    1    1    1  
## CAT                        1    1    1    1    1    1    1    1    1    1  
## MIX                        1    1    1    1    1    1    1    1    1    1  
## Component dropout:
## cortical_thickness         ✔    ✔    ✔    ✔    ✔    ✔    ✔    ✔    ✔    ✔  
## cortical_surface_area      ✔    ✔    ✔    ✔    ✔    ✔    ✔    ✔    ✔    ✔  
## subcortical_volume         ✔    ✔    ✔    ✔    ✔    ✔    ✔    ✔    ✔    ✔  
## household_income           ✔    ✔    ✔    ✔    ✔    ✔    ✔    ✔    ✔    ✔  
## pubertal_status            ✔    ✔    ✔    ✔    ✔    ✔    ✔    ✔    ✔    ✔  
## …and settings defined to create 10 more cluster solutions.
## Distance Functions List:
## Continuous (1):
## [1] euclidean_distance
## Discrete (1):
## [1] euclidean_distance
## Ordinal (1):
## [1] euclidean_distance
## Categorical (1):
## [1] gower_distance
## Mixed (1):
## [1] gower_distance
## Clustering Functions List:
## [1] spectral_eigen
## [2] spectral_rot
## Weights Matrix:
## Weights defined for 20 cluster solutions.
## $ mrisdp_1 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1… 
## $ mrisdp_2 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1… 
## $ mrisdp_3 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1… 
## $ mrisdp_4 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1… 
## $ mrisdp_5 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1… 
## …and 329 more features.
# Uniform dropping
sc <- snf_config(
    dl,
    n_solutions = 20,
    dropout_dist = "uniform"
)
## ℹ No distance functions specified. Using defaults.
## ℹ No clustering functions specified. Using defaults.
sc
## Settings Data Frame:
##                            1    2    3    4    5    6    7    8    9   10
## SNF hyperparameters:
## alpha                    0.7  0.3  0.6  0.5  0.6  0.3  0.8  0.4  0.5  0.5
## k                         20   75   61   89   11   56   99   58   69   91  
## t                         20   20   20   20   20   20   20   20   20   20  
## SNF scheme:
##                            2    2    1    2    2    3    3    2    2    1  
## Clustering functions:
##                            2    1    2    1    2    2    1    2    2    2  
## Distance functions:
## CNT                        1    1    1    1    1    1    1    1    1    1  
## DSC                        1    1    1    1    1    1    1    1    1    1  
## ORD                        1    1    1    1    1    1    1    1    1    1  
## CAT                        1    1    1    1    1    1    1    1    1    1  
## MIX                        1    1    1    1    1    1    1    1    1    1  
## Component dropout:
## cortical_thickness         ✖    ✖    ✖    ✖    ✖    ✔    ✔    ✔    ✔    ✖  
## cortical_surface_area      ✖    ✖    ✖    ✔    ✖    ✔    ✖    ✔    ✔    ✖  
## subcortical_volume         ✖    ✖    ✔    ✖    ✖    ✔    ✔    ✖    ✔    ✖  
## household_income           ✔    ✔    ✔    ✖    ✖    ✔    ✔    ✔    ✔    ✔  
## pubertal_status            ✖    ✖    ✖    ✖    ✔    ✔    ✔    ✔    ✔    ✖  
## …and settings defined to create 10 more cluster solutions.
## Distance Functions List:
## Continuous (1):
## [1] euclidean_distance
## Discrete (1):
## [1] euclidean_distance
## Ordinal (1):
## [1] euclidean_distance
## Categorical (1):
## [1] gower_distance
## Mixed (1):
## [1] gower_distance
## Clustering Functions List:
## [1] spectral_eigen
## [2] spectral_rot
## Weights Matrix:
## Weights defined for 20 cluster solutions.
## $ mrisdp_1 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1… 
## $ mrisdp_2 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1… 
## $ mrisdp_3 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1… 
## $ mrisdp_4 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1… 
## $ mrisdp_5 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1… 
## …and 329 more features.
# No dropping
sc <- snf_config(
    dl,
    n_solutions = 20,
    dropout_dist = "none"
)
## ℹ No distance functions specified. Using defaults.
## ℹ No clustering functions specified. Using defaults.
sc
## Settings Data Frame:
##                            1    2    3    4    5    6    7    8    9   10
## SNF hyperparameters:
## alpha                    0.5  0.8  0.4  0.8  0.8  0.7  0.7  0.6  0.8  0.7
## k                         37   12   20   79   86   40   48   96   78   55  
## t                         20   20   20   20   20   20   20   20   20   20  
## SNF scheme:
##                            2    2    3    3    2    3    2    3    1    2  
## Clustering functions:
##                            2    2    2    2    2    1    1    2    2    1  
## Distance functions:
## CNT                        1    1    1    1    1    1    1    1    1    1  
## DSC                        1    1    1    1    1    1    1    1    1    1  
## ORD                        1    1    1    1    1    1    1    1    1    1  
## CAT                        1    1    1    1    1    1    1    1    1    1  
## MIX                        1    1    1    1    1    1    1    1    1    1  
## Component dropout:
## cortical_thickness         ✔    ✔    ✔    ✔    ✔    ✔    ✔    ✔    ✔    ✔  
## cortical_surface_area      ✔    ✔    ✔    ✔    ✔    ✔    ✔    ✔    ✔    ✔  
## subcortical_volume         ✔    ✔    ✔    ✔    ✔    ✔    ✔    ✔    ✔    ✔  
## household_income           ✔    ✔    ✔    ✔    ✔    ✔    ✔    ✔    ✔    ✔  
## pubertal_status            ✔    ✔    ✔    ✔    ✔    ✔    ✔    ✔    ✔    ✔  
## …and settings defined to create 10 more cluster solutions.
## Distance Functions List:
## Continuous (1):
## [1] euclidean_distance
## Discrete (1):
## [1] euclidean_distance
## Ordinal (1):
## [1] euclidean_distance
## Categorical (1):
## [1] gower_distance
## Mixed (1):
## [1] gower_distance
## Clustering Functions List:
## [1] spectral_eigen
## [2] spectral_rot
## Weights Matrix:
## Weights defined for 20 cluster solutions.
## $ mrisdp_1 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1… 
## $ mrisdp_2 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1… 
## $ mrisdp_3 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1… 
## $ mrisdp_4 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1… 
## $ mrisdp_5 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1… 
## …and 329 more features.

The bounds on the number of data frames that can be dropped can be controlled using the min_removed_inputs and max_removed_inputs:

sc <- snf_config(
    dl,
    n_solutions = 20,
    min_removed_inputs = 3
)
## ℹ No distance functions specified. Using defaults.
## ℹ No clustering functions specified. Using defaults.
# No row will exclude fewer than 3 data frames during SNF
sc
## Settings Data Frame:
##                            1    2    3    4    5    6    7    8    9   10
## SNF hyperparameters:
## alpha                    0.3  0.5  0.6  0.3  0.7  0.5  0.3  0.3  0.7  0.5
## k                         56   36   57   74   63   90   82   63   37   23  
## t                         20   20   20   20   20   20   20   20   20   20  
## SNF scheme:
##                            1    2    1    3    2    2    2    3    3    2  
## Clustering functions:
##                            2    2    1    1    1    2    2    1    1    1  
## Distance functions:
## CNT                        1    1    1    1    1    1    1    1    1    1  
## DSC                        1    1    1    1    1    1    1    1    1    1  
## ORD                        1    1    1    1    1    1    1    1    1    1  
## CAT                        1    1    1    1    1    1    1    1    1    1  
## MIX                        1    1    1    1    1    1    1    1    1    1  
## Component dropout:
## cortical_thickness         ✖    ✖    ✔    ✔    ✔    ✔    ✖    ✔    ✖    ✔  
## cortical_surface_area      ✖    ✔    ✖    ✖    ✖    ✖    ✔    ✖    ✖    ✖  
## subcortical_volume         ✔    ✔    ✖    ✖    ✔    ✖    ✔    ✔    ✖    ✖  
## household_income           ✔    ✖    ✔    ✖    ✖    ✖    ✖    ✖    ✔    ✔  
## pubertal_status            ✖    ✖    ✖    ✔    ✖    ✔    ✖    ✖    ✔    ✖  
## …and settings defined to create 10 more cluster solutions.
## Distance Functions List:
## Continuous (1):
## [1] euclidean_distance
## Discrete (1):
## [1] euclidean_distance
## Ordinal (1):
## [1] euclidean_distance
## Categorical (1):
## [1] gower_distance
## Mixed (1):
## [1] gower_distance
## Clustering Functions List:
## [1] spectral_eigen
## [2] spectral_rot
## Weights Matrix:
## Weights defined for 20 cluster solutions.
## $ mrisdp_1 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1… 
## $ mrisdp_2 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1… 
## $ mrisdp_3 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1… 
## $ mrisdp_4 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1… 
## $ mrisdp_5 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1… 
## …and 329 more features.

Grid searching

If you are interested in grid searching over perhaps just a specific set of alpha and k values, you may want to consider varying those parameters and keeping everything else fixed:

sc <- snf_config(
    dl,
    n_solutions = 10,
    alpha_values = c(0.3, 0.5, 0.8),
    k_values = c(20, 40, 60),
    dropout_dist = "none"
)
## ℹ No distance functions specified. Using defaults.
## ℹ No clustering functions specified. Using defaults.
sc
## Settings Data Frame:
##                            1    2    3    4    5    6    7    8    9   10
## SNF hyperparameters:
## alpha                    0.8  0.5  0.8  0.3  0.5  0.3  0.5  0.5  0.8  0.5
## k                         40   40   40   20   20   60   40   60   60   20  
## t                         20   20   20   20   20   20   20   20   20   20  
## SNF scheme:
##                            1    2    3    1    2    3    3    2    1    3  
## Clustering functions:
##                            1    2    2    2    2    1    2    1    2    2  
## Distance functions:
## CNT                        1    1    1    1    1    1    1    1    1    1  
## DSC                        1    1    1    1    1    1    1    1    1    1  
## ORD                        1    1    1    1    1    1    1    1    1    1  
## CAT                        1    1    1    1    1    1    1    1    1    1  
## MIX                        1    1    1    1    1    1    1    1    1    1  
## Component dropout:
## cortical_thickness         ✔    ✔    ✔    ✔    ✔    ✔    ✔    ✔    ✔    ✔  
## cortical_surface_area      ✔    ✔    ✔    ✔    ✔    ✔    ✔    ✔    ✔    ✔  
## subcortical_volume         ✔    ✔    ✔    ✔    ✔    ✔    ✔    ✔    ✔    ✔  
## household_income           ✔    ✔    ✔    ✔    ✔    ✔    ✔    ✔    ✔    ✔  
## pubertal_status            ✔    ✔    ✔    ✔    ✔    ✔    ✔    ✔    ✔    ✔  
## Distance Functions List:
## Continuous (1):
## [1] euclidean_distance
## Discrete (1):
## [1] euclidean_distance
## Ordinal (1):
## [1] euclidean_distance
## Categorical (1):
## [1] gower_distance
## Mixed (1):
## [1] gower_distance
## Clustering Functions List:
## [1] spectral_eigen
## [2] spectral_rot
## Weights Matrix:
## Weights defined for 10 cluster solutions.
## $ mrisdp_1 1, 1, 1, 1, 1, 1, 1, 1, 1, 1 
## $ mrisdp_2 1, 1, 1, 1, 1, 1, 1, 1, 1, 1 
## $ mrisdp_3 1, 1, 1, 1, 1, 1, 1, 1, 1, 1 
## $ mrisdp_4 1, 1, 1, 1, 1, 1, 1, 1, 1, 1 
## $ mrisdp_5 1, 1, 1, 1, 1, 1, 1, 1, 1, 1 
## …and 329 more features.

Assembling an SNF config in pieces

Rather than varying everything equally all at once, you may be interested in looking at “chunks” of solution spaces that are based on distinct SNF configs. For example, you may want to look at 25 solutions generated with k = 50 and look at another 25 solutions generated with k = 80. You can build two separate SNF configs and join them using the merge() function.

set.seed(42)
sc_1 <- snf_config(
    dl,
    n_solutions = 25,
    k_values = 50
)
## ℹ No distance functions specified. Using defaults.
## ℹ No clustering functions specified. Using defaults.
sc_2 <- snf_config(
    dl,
    n_solutions = 25,
    k_values = 80
)
## ℹ No distance functions specified. Using defaults.
## ℹ No clustering functions specified. Using defaults.
full_sc <- merge(sc_1, sc_2)

settings_df building failed to converge”

snf_config() will never build duplicate rows. A consequence of this is that if you request a very large number of rows over a very small range of possible values to vary over, it will be impossible for the matrix to be built. For example, there’s no way to generate 10 unique rows when the only varying parameter is which of two clustering algorithms is used - only 2 rows could ever be created. If you encounter the error “Matrix building failed”, try to generate fewer rows or to be a little less strict with what values are allowed.