All models are wrong, but some are useful – (Box, 1976)
1 Introduction
Traditional memory-based learning (MBL) methods fit a new local model for each query observation. While effective, this per-query refitting can be computationally expensive and requires access to the full reference library at prediction time.
The liblex() function (library of localised experts) takes a different approach: it pre-computes a library of local models during a build phase, then retrieves and combines relevant experts at prediction time without refitting. This design offers several advantages:
Computational efficiency: Models are fitted once; prediction requires only retrieval and weighted averaging.
Privacy preservation: Only model coefficients and centroids need to be stored (not the original training spectra).
Intrinsic uncertainty: Disagreement among retrieved experts provides a natural measure of prediction uncertainty.
Interpretability: Each expert retains its regression coefficients and variable importance, enabling inspection of local relationships.
The method is introduced and described in detail in Ramirez-Lopez et al. (2026).
2 How liblex works
The liblex algorithm operates in two phases:
2.1 Build phase
Anchor selection: A subset of reference observations (anchors) is selected. Each anchor will generate one local expert model. By default, all observations serve as anchors; for large datasets, a representative subset can be specified via anchor_indices.
Neighborhood construction: For each anchor, the \(k\) nearest neighbors are identified from the full reference set using a dissimilarity measure. Note that neighbors are drawn from all reference observations, not just anchors.
Local model fitting: A partial least squares (PLS) regression model (or variant such as weigthed average PLS or modified PLS) is fitted within each neighborhood, producing regression coefficients. The neighborhood centroid is stored for use during prediction.
Validation (optional): Nearest-neighbor cross-validation assesses performance and optimizes hyperparameters (\(k\) and PLS component range).
2.2 Prediction phase
Expert retrieval: For a new observation, dissimilarities to all stored centroids are computed, and the \(k\) nearest experts are retrieved. The same optimal \(k\) determined during fitting is used here, so the number of anchors should be at least as large as the maximum \(k\) being evaluated.
Weighted combination: Each retrieved expert generates a prediction. These predictions are combined via distance-based kernel weighting, with closer experts receiving higher weights.
Uncertainty estimation: The weighted variance across expert predictions provides a per-sample uncertainty estimate, reflecting disagreement among retrieved experts.
3 Data preparation
library(resemble)library(prospectr)data(NIRsoil)# Wavelengthswavs <-as.numeric(colnames(NIRsoil$spc))# Preprocess: detrend + first derivativeNIRsoil$spc_pr <-savitzkyGolay(detrend(NIRsoil$spc, wav = wavs),m =1, p =1, w =7)# Split into training and test sets# Note: missing values in the response are allowed in liblextrain_x <- NIRsoil$spc_pr[NIRsoil$train ==1, ]train_y <- NIRsoil$Ciso[NIRsoil$train ==1]test_x <- NIRsoil$spc_pr[NIRsoil$train ==0, ]test_y <- NIRsoil$Ciso[NIRsoil$train ==0]cat("Training set:", nrow(train_x), "observations\n")
Training set: 618 observations
cat("Test set:", nrow(test_x), "observations\n")
Test set: 207 observations
4 Core components
4.1 Neighbor selection
Neighborhood size is specified using neighbors_k() for fixed-\(k\) selection or neighbors_diss() for threshold-based selection. Multiple values can be provided for hyperparameter tuning.
# Fixed neighborhood sizes to evaluateneighbors_k(k =seq(40, 120, by =20))
Neighbor selection: fixed k
k : 40, 60, 80, 100, 120
The dissimilarity measure determines how neighbors are identified. Available methods include diss_pca(), diss_pls(), diss_correlation(), diss_cosine(), etc (see the dissimilarity() function for all methods). The moving-window correlation dissimilarity is particularly effective for spectral data:
# Moving-window correlation dissimilarity (recommended for spectra)diss_correlation(ws =37, scale =TRUE)
The default method in the liblex() function to fit the local regressions is the weighted average PLS (waPLS), which combines predictions from multiple PLS component counts. The fit_wapls() constructor specifies the component range and algorithm:
# waPLS with modified PLS algorithm and scalingfit_wapls(min_ncomp =3, max_ncomp =15, scale =TRUE, method ="mpls")
The plot method for liblex objects visualizes the performance of the fitted experts across different neighborhood sizes and shows the centroids of the neighborhoods. Figure 1 shows the resulting plots for the library built with fixed neighborhood sizes. The top panel compares the performance of the best model obtained for each neighborhood size based on the RMSE, while the bottom panel displays the centroids of the neighborhoods, which are used later in prediction to select the models to be used.
plot(ciso_lib_k)
Figure 1: Top: Best model obtained for each neighborhood size (based on the RMSE). Bottom: Centroids of the neighborhoods, usled later in prediction to selet the models to be used.
The centroids in Figure 1 represent the average spectra of the neighbors for each expert. During prediction, the similarity between the new observation and these centroids is used to determine which experts to retrieve and how to weight their predictions.
5.2 Using dissimilarity thresholds
Alternatively, neighbors can be selected based on a dissimilarity threshold. The k_min and k_max arguments ensure reasonable neighborhood sizes:
The stored neighborhood centroids can be compared against the spectra to be predicted. Good coverage of the prediction space by the centroids indicates that relevant experts are available:
Figure 2: Neighborhood centroids (blue) compared to test spectra (red)
5.4 Visualizing the regression models
The stored regression coefficients provide insight into how different spectral regions contribute to predictions across the library. Each expert model has its own set of coefficients, reflecting the local relationships learned within its neighborhood. Figure 3 shows an example of how the regression models can be visualized.
Figure 3: Regression coefficients (top) and intercept distribution (bottom) of the local experts
The top panel shows regression coefficients for each expert model across the spectral range, with the mean profile highlighted in red. Regions with high coefficient variability indicate wavelengths where local relationships differ substantially across the reference set. The bottom panel shows the distribution of intercepts, reflecting the range of baseline predictions across experts.
6 Choosing anchor samples
For large libraries, building models for every observation may be computationally prohibitive. The anchor_indices argument allows specifying a subset of observations as anchors while still using the full library for neighbor retrieval.
6.1 Using k-means sampling
The naes() function from the prospectr package implements the k-means sampling algorithm, which is a reasonable approach for selecting representative anchor samples:
# Select 350 representative anchors using k-means sampling# on the first 20 principal componentsset.seed(1124)kms <-naes( train_x, k =350, pc =20, iter.max =100,.center =TRUE, .scale =TRUE)anchor_km <- kms$modelcat("Selected", length(anchor_km), "anchors via k-means\n")
the final number of models here can be verifyied by looking at the number of rows in the ciso_lib_anchored$coefficients$B matrix, which contains regression coefficients for each expert.
# Number of experts in the anchored libraryn_experts <-nrow(ciso_lib_anchored$coefficients$B)cat("Number of experts in the anchored library:", n_experts, "\n")
Number of experts in the anchored library: 350
Is good practice to compare the spectra of the samples to be predicted against the centroids of the models stored in the library. Figure 4 shows the centroids of the anchored library compared to the test spectra. This visualization helps assess whether the selected anchors provide good coverage of the spectral space of the test samples, which is crucial for reliable predictions.
Available kernels include "gaussian", "tricube", "triweight", "triangular", "quartic", "parabolic", and "cauchy".
7.3 Enforcing specific experts
The enforce_indices argument ensures certain experts are always included in the prediction neighborhood. This is useful when some anchors are known to be from the same domain as the target observations.
# Always include specific experts in predictionspredict(ciso_lib_k, newdata = test_x, enforce_indices =c(1, 5, 10))
The key difference between liblex() and mbl() is when models are fitted:
Aspect
mbl()
liblex()
Model fitting
Per query (on demand)
Once (build phase)
Prediction
Fit + predict
Retrieve + combine
Storage
Reference library
Coefficients + centroids
Uncertainty
Typically requires resampling
Intrinsic (expert dispersion)
For applications requiring repeated predictions on new data, liblex() offers computational advantages once the library is built.
10 Parallel processing
Note
These functions support parallel execution via the foreach and doParallel packages. However, parallel execution is only beneficial when the workload per iteration is large enough to outweigh the overhead of spawning worker processes and serialising data between them. In practice this means large prediction or reference sets (typically hundreds of observations or more), large neighbourhoods, and many PLS components. For small datasets, sequential execution is invariably faster. When in doubt, benchmark both before committing to a parallel workflow.
The following example may not be faster than sequential execution due to the relatively small size of the dataset, but it illustrates how to set up parallel processing for liblex():
The chunk_size parameter in liblex_control() controls how many models are processed per parallel task. Larger values reduce scheduling overhead but may cause load imbalance.
11 Summary
The liblex() function provides a retrieval-based approach to memory-based learning that separates model building from prediction. Key features include:
Pre-computed library of local waPLS experts
Retrieval-gated prediction via centroid similarity
Intrinsic uncertainty quantification through expert dispersion
Support for anchor subsampling to handle large libraries (e.g., via naes())
Flexible weighting schemes for combining expert predictions
For large spectral libraries where predictions need to be made repeatedly, this approach offers efficiency and interpretability advantages over traditional per-query refitting methods.
References
Box, G.E., 1976. Science and statistics. Journal of the American Statistical Association 71, 791–799.
Ramirez-Lopez, L., Metz, M., Lesnoff, M., Orellano, C., Perez-Fernandez, E., Plans, M., Breure, T., Behrens, T., Viscarra Rossel, R., Peng, Y., 2026. Rethinking local spectral modelling: From per-query refitting to model libraries. Analytica Chimica Acta.