---
title: "Prepared References for Repeated Exact Search"
output: rmarkdown::html_vignette
vignette: >
  %\VignetteIndexEntry{Prepared References for Repeated Exact Search}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r setup, include=FALSE}
knitr::opts_chunk$set(collapse = TRUE, comment = "#>")

if (!requireNamespace("bigmemory", quietly = TRUE)) {
  cat("This vignette requires the 'bigmemory' package.\n")
  knitr::knit_exit()
}

library(bigKNN)
library(bigmemory)
```

Prepared references let `bigKNN` cache metric-specific information about a
fixed reference matrix and reuse it across later exact searches. They are the
right tool when the reference data stays put but queries arrive in batches over
time.

This article walks through that pattern end to end:

- build a file-backed reference matrix
- prepare it once for cosine distance
- reuse the prepared object across multiple query batches
- stream prepared results into destination `big.matrix` objects
- persist the prepared cache to disk and reload it later

```{r helpers, include=FALSE}
knn_table <- function(result, query_ids, ref_ids) {
  do.call(rbind, lapply(seq_along(query_ids), function(i) {
    data.frame(
      query = query_ids[i],
      rank = seq_len(result$k),
      neighbor = ref_ids[result$index[i, ]],
      distance = signif(result$distance[i, ], 5),
      row.names = NULL
    )
  }))
}
```

# When prepared references help

Prepared references are most useful when:

- the reference matrix stays fixed
- you need to answer many exact query batches
- you want to persist the cache between sessions

They do not change the search result. The advantage is that repeated searches
can reuse cached row-wise quantities instead of recomputing them every time.

# Build a File-Backed Reference

For this vignette we will use a file-backed `big.matrix`, because persisted
prepared caches are easiest to demonstrate when the reference can be reattached
through files on disk.

```{r create-reference}
scratch_dir <- file.path(tempdir(), "bigknn-prepared-search")
dir.create(scratch_dir, recursive = TRUE, showWarnings = FALSE)

reference_points <- data.frame(
  id = paste0("r", 1:8),
  x1 = c(1, 1, 2, 2, 3, 3, 4, 4),
  x2 = c(1, 2, 1, 2, 2, 3, 3, 4),
  x3 = c(0.5, 0.5, 1.0, 1.0, 1.5, 1.5, 2.0, 2.5)
)

reference <- filebacked.big.matrix(
  nrow = nrow(reference_points),
  ncol = 3,
  type = "double",
  backingfile = "reference.bin",
  descriptorfile = "reference.desc",
  backingpath = scratch_dir
)

reference[,] <- as.matrix(reference_points[c("x1", "x2", "x3")])

query_batch_a <- matrix(
  c(1.1, 1.2, 0.5,
    2.7, 2.2, 1.4),
  ncol = 3,
  byrow = TRUE
)

query_batch_b <- matrix(
  c(3.6, 3.1, 1.9,
    1.5, 1.8, 0.8),
  ncol = 3,
  byrow = TRUE
)

query_ids_a <- c("a1", "a2")
query_ids_b <- c("b1", "b2")

reference_points
```

All rows are non-zero, which matters because cosine distance requires non-zero
reference and query vectors.

# Building a prepared reference with `knn_prepare_bigmatrix()`

```{r prepare-reference}
prepared <- knn_prepare_bigmatrix(reference, metric = "cosine")
prepared
```

Internally, a prepared object stores:

- the external pointer to the reference matrix
- the chosen `metric`
- a metric-specific numeric `row_cache`
- cached dimensions and execution metadata

The print method keeps that summary compact:

```{r prepared-summary}
summary(prepared)
length(prepared$row_cache)
head(prepared$row_cache, 4)
```

For cosine distance, `row_cache` contains row-wise quantities that are reused
during later searches. In normal workflows you rarely need to manipulate it
directly; it is included here so you can see that a prepared object is more
than just a wrapper around the original `big.matrix`.

# Reusing it with `knn_search_prepared()`

```{r prepared-search}
batch_a_result <- knn_search_prepared(
  prepared,
  query = query_batch_a,
  k = 2,
  exclude_self = FALSE
)

batch_b_result <- knn_search_prepared(
  prepared,
  query = query_batch_b,
  k = 2,
  exclude_self = FALSE
)

batch_a_result
knn_table(batch_a_result, query_ids = query_ids_a, ref_ids = reference_points$id)
knn_table(batch_b_result, query_ids = query_ids_b, ref_ids = reference_points$id)
```

The result contract is the same as `knn_bigmatrix()`. The difference is that
the reference preparation step has already been done, so you can reuse the same
`prepared` object across many query batches.

To make that explicit, we can compare a prepared search with the one-shot API:

```{r prepared-vs-direct}
direct_batch_a <- knn_bigmatrix(
  reference,
  query = query_batch_a,
  k = 2,
  metric = "cosine",
  exclude_self = FALSE
)

identical(batch_a_result$index, direct_batch_a$index)
all.equal(batch_a_result$distance, direct_batch_a$distance)
```

Prepared search is therefore an ergonomics and performance feature, not a
different search algorithm.

# Streaming prepared results with `knn_search_stream_prepared()`

If you want the prepared search to write directly into destination
`big.matrix` objects, use `knn_search_stream_prepared()`. This is helpful when
the query set is larger or when you want to keep results in shared-memory or
file-backed structures instead of dense R matrices.

```{r prepared-stream}
index_store <- big.matrix(nrow(query_batch_b), 2, type = "integer")
distance_store <- big.matrix(nrow(query_batch_b), 2, type = "double")

streamed_batch_b <- knn_search_stream_prepared(
  prepared,
  query = query_batch_b,
  xpIndex = index_store,
  xpDistance = distance_store,
  k = 2,
  exclude_self = FALSE
)

bigmemory::as.matrix(streamed_batch_b$index)
round(bigmemory::as.matrix(streamed_batch_b$distance), 6)
all.equal(bigmemory::as.matrix(streamed_batch_b$distance), batch_b_result$distance)
```

The neighbour indices and distances are the same as the in-memory prepared
search; the only difference is where the results land.

# Persisting caches with `cache_path`

Prepared references can be serialized with `cache_path`, which is useful when a
project repeatedly opens the same file-backed reference over many sessions.

```{r persist-prepared}
cache_path <- file.path(scratch_dir, "prepared-cosine-cache.rds")

prepared_cached <- knn_prepare_bigmatrix(
  reference,
  metric = "cosine",
  cache_path = cache_path
)

prepared_cached
file.exists(cache_path)
```

Persisted prepared references are especially helpful for long-running projects
and reproducible pipelines.

# Reloading with `knn_load_prepared()`

```{r load-prepared}
loaded <- knn_load_prepared(cache_path)
loaded
```

`knn_load_prepared()` restores the cached metadata and reattaches the
underlying `big.matrix` through its stored descriptor. That means the prepared
cache is tied to the original reference backing files: if those files move or
disappear, the cache can no longer be reattached.

# Validating with `knn_validate_prepared()`

Validation is usually worth calling after loading a cache from disk, or any
time you want to confirm that the descriptor, cached dimensions, and row cache
still match the underlying reference.

```{r validate-prepared}
isTRUE(knn_validate_prepared(loaded))
```

Once the cache has been loaded and validated, it behaves like any other
prepared reference:

```{r loaded-search}
loaded_batch_b <- knn_search_prepared(
  loaded,
  query = query_batch_b,
  k = 2,
  exclude_self = FALSE
)

identical(loaded_batch_b$index, batch_b_result$index)
all.equal(loaded_batch_b$distance, batch_b_result$distance)
```

# Common failure modes and how to avoid them

- Reusing a cache after the reference data changed: rebuild the prepared
  object whenever the underlying reference matrix is modified.
- Missing or moved backing files: persisted caches rely on the stored
  `big.matrix` descriptor, so the reference files need to remain accessible.
- Zero-norm rows with cosine distance: keep `validate = TRUE` when preparing
  cosine references so incompatible rows are caught early.
- Overusing one-shot search: if the reference is fixed and many query batches
  are coming, switch from repeated `knn_bigmatrix()` calls to
  `knn_prepare_bigmatrix()` plus `knn_search_prepared()`.

Prepared references are a small API feature with a big practical payoff: you
do the setup work once, and then exact search against the same reference
becomes easier to repeat, easier to stream, and easier to persist.