--- title: "Prepared References for Repeated Exact Search" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{Prepared References for Repeated Exact Search} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r setup, include=FALSE} knitr::opts_chunk$set(collapse = TRUE, comment = "#>") if (!requireNamespace("bigmemory", quietly = TRUE)) { cat("This vignette requires the 'bigmemory' package.\n") knitr::knit_exit() } library(bigKNN) library(bigmemory) ``` Prepared references let `bigKNN` cache metric-specific information about a fixed reference matrix and reuse it across later exact searches. They are the right tool when the reference data stays put but queries arrive in batches over time. This article walks through that pattern end to end: - build a file-backed reference matrix - prepare it once for cosine distance - reuse the prepared object across multiple query batches - stream prepared results into destination `big.matrix` objects - persist the prepared cache to disk and reload it later ```{r helpers, include=FALSE} knn_table <- function(result, query_ids, ref_ids) { do.call(rbind, lapply(seq_along(query_ids), function(i) { data.frame( query = query_ids[i], rank = seq_len(result$k), neighbor = ref_ids[result$index[i, ]], distance = signif(result$distance[i, ], 5), row.names = NULL ) })) } ``` # When prepared references help Prepared references are most useful when: - the reference matrix stays fixed - you need to answer many exact query batches - you want to persist the cache between sessions They do not change the search result. The advantage is that repeated searches can reuse cached row-wise quantities instead of recomputing them every time. # Build a File-Backed Reference For this vignette we will use a file-backed `big.matrix`, because persisted prepared caches are easiest to demonstrate when the reference can be reattached through files on disk. ```{r create-reference} scratch_dir <- file.path(tempdir(), "bigknn-prepared-search") dir.create(scratch_dir, recursive = TRUE, showWarnings = FALSE) reference_points <- data.frame( id = paste0("r", 1:8), x1 = c(1, 1, 2, 2, 3, 3, 4, 4), x2 = c(1, 2, 1, 2, 2, 3, 3, 4), x3 = c(0.5, 0.5, 1.0, 1.0, 1.5, 1.5, 2.0, 2.5) ) reference <- filebacked.big.matrix( nrow = nrow(reference_points), ncol = 3, type = "double", backingfile = "reference.bin", descriptorfile = "reference.desc", backingpath = scratch_dir ) reference[,] <- as.matrix(reference_points[c("x1", "x2", "x3")]) query_batch_a <- matrix( c(1.1, 1.2, 0.5, 2.7, 2.2, 1.4), ncol = 3, byrow = TRUE ) query_batch_b <- matrix( c(3.6, 3.1, 1.9, 1.5, 1.8, 0.8), ncol = 3, byrow = TRUE ) query_ids_a <- c("a1", "a2") query_ids_b <- c("b1", "b2") reference_points ``` All rows are non-zero, which matters because cosine distance requires non-zero reference and query vectors. # Building a prepared reference with `knn_prepare_bigmatrix()` ```{r prepare-reference} prepared <- knn_prepare_bigmatrix(reference, metric = "cosine") prepared ``` Internally, a prepared object stores: - the external pointer to the reference matrix - the chosen `metric` - a metric-specific numeric `row_cache` - cached dimensions and execution metadata The print method keeps that summary compact: ```{r prepared-summary} summary(prepared) length(prepared$row_cache) head(prepared$row_cache, 4) ``` For cosine distance, `row_cache` contains row-wise quantities that are reused during later searches. In normal workflows you rarely need to manipulate it directly; it is included here so you can see that a prepared object is more than just a wrapper around the original `big.matrix`. # Reusing it with `knn_search_prepared()` ```{r prepared-search} batch_a_result <- knn_search_prepared( prepared, query = query_batch_a, k = 2, exclude_self = FALSE ) batch_b_result <- knn_search_prepared( prepared, query = query_batch_b, k = 2, exclude_self = FALSE ) batch_a_result knn_table(batch_a_result, query_ids = query_ids_a, ref_ids = reference_points$id) knn_table(batch_b_result, query_ids = query_ids_b, ref_ids = reference_points$id) ``` The result contract is the same as `knn_bigmatrix()`. The difference is that the reference preparation step has already been done, so you can reuse the same `prepared` object across many query batches. To make that explicit, we can compare a prepared search with the one-shot API: ```{r prepared-vs-direct} direct_batch_a <- knn_bigmatrix( reference, query = query_batch_a, k = 2, metric = "cosine", exclude_self = FALSE ) identical(batch_a_result$index, direct_batch_a$index) all.equal(batch_a_result$distance, direct_batch_a$distance) ``` Prepared search is therefore an ergonomics and performance feature, not a different search algorithm. # Streaming prepared results with `knn_search_stream_prepared()` If you want the prepared search to write directly into destination `big.matrix` objects, use `knn_search_stream_prepared()`. This is helpful when the query set is larger or when you want to keep results in shared-memory or file-backed structures instead of dense R matrices. ```{r prepared-stream} index_store <- big.matrix(nrow(query_batch_b), 2, type = "integer") distance_store <- big.matrix(nrow(query_batch_b), 2, type = "double") streamed_batch_b <- knn_search_stream_prepared( prepared, query = query_batch_b, xpIndex = index_store, xpDistance = distance_store, k = 2, exclude_self = FALSE ) bigmemory::as.matrix(streamed_batch_b$index) round(bigmemory::as.matrix(streamed_batch_b$distance), 6) all.equal(bigmemory::as.matrix(streamed_batch_b$distance), batch_b_result$distance) ``` The neighbour indices and distances are the same as the in-memory prepared search; the only difference is where the results land. # Persisting caches with `cache_path` Prepared references can be serialized with `cache_path`, which is useful when a project repeatedly opens the same file-backed reference over many sessions. ```{r persist-prepared} cache_path <- file.path(scratch_dir, "prepared-cosine-cache.rds") prepared_cached <- knn_prepare_bigmatrix( reference, metric = "cosine", cache_path = cache_path ) prepared_cached file.exists(cache_path) ``` Persisted prepared references are especially helpful for long-running projects and reproducible pipelines. # Reloading with `knn_load_prepared()` ```{r load-prepared} loaded <- knn_load_prepared(cache_path) loaded ``` `knn_load_prepared()` restores the cached metadata and reattaches the underlying `big.matrix` through its stored descriptor. That means the prepared cache is tied to the original reference backing files: if those files move or disappear, the cache can no longer be reattached. # Validating with `knn_validate_prepared()` Validation is usually worth calling after loading a cache from disk, or any time you want to confirm that the descriptor, cached dimensions, and row cache still match the underlying reference. ```{r validate-prepared} isTRUE(knn_validate_prepared(loaded)) ``` Once the cache has been loaded and validated, it behaves like any other prepared reference: ```{r loaded-search} loaded_batch_b <- knn_search_prepared( loaded, query = query_batch_b, k = 2, exclude_self = FALSE ) identical(loaded_batch_b$index, batch_b_result$index) all.equal(loaded_batch_b$distance, batch_b_result$distance) ``` # Common failure modes and how to avoid them - Reusing a cache after the reference data changed: rebuild the prepared object whenever the underlying reference matrix is modified. - Missing or moved backing files: persisted caches rely on the stored `big.matrix` descriptor, so the reference files need to remain accessible. - Zero-norm rows with cosine distance: keep `validate = TRUE` when preparing cosine references so incompatible rows are caught early. - Overusing one-shot search: if the reference is fixed and many query batches are coming, switch from repeated `knn_bigmatrix()` calls to `knn_prepare_bigmatrix()` plus `knn_search_prepared()`. Prepared references are a small API feature with a big practical payoff: you do the setup work once, and then exact search against the same reference becomes easier to repeat, easier to stream, and easier to persist.