---
title: "bigANNOY Versus bigKNN"
output:
  litedown::html_format:
    meta:
      css: ["@default"]
---

<!--
%\VignetteEngine{litedown::vignette}
%\VignetteIndexEntry{bigANNOY Versus bigKNN}
%\VignetteEncoding{UTF-8}
-->

```{r, include = FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)

options(bigANNOY.progress = FALSE)
set.seed(20260326)
```

`bigANNOY` and `bigKNN` are meant to complement each other, not compete for the
same role.

- `bigKNN` gives you **exact Euclidean neighbours**
- `bigANNOY` gives you **fast approximate neighbours through persisted Annoy
  indexes**

That makes them a natural pair:

- use `bigKNN` when exactness is the requirement
- use `bigANNOY` when scale and latency matter more than perfect exactness
- use them together when you want a ground-truth baseline for evaluating an
  approximate workflow

This vignette explains how to think about that split and how to compare the two
packages in practice.

## The Core Difference

At a high level, the packages answer slightly different questions.

`bigKNN` asks:

- what are the exact Euclidean nearest neighbours of each query row?

`bigANNOY` asks:

- what are very likely nearest neighbours of each query row, found through an
  approximate Annoy index?

That distinction has consequences:

- exact search is the correctness baseline
- approximate search is the operational speed/scale path
- exact search is usually the right benchmark target for Euclidean workloads

## When To Use Which Package

Use `bigKNN` when:

- exact Euclidean neighbours are required
- the result itself is a scientific or statistical reference quantity
- you need a benchmark ground truth for recall measurement
- approximation is not acceptable for the downstream task

Use `bigANNOY` when:

- query latency matters more than exactness
- reference data is large enough that approximate search is operationally
  attractive
- you want a persisted Annoy index that can be reopened and reused
- a small loss in recall is acceptable in exchange for speed

In other words:

- `bigKNN` is the answer when the question is "what is exactly correct?"
- `bigANNOY` is the answer when the question is "what is fast enough while
  still good enough?"

## Shared Result Shape

One of the most useful design choices in `bigANNOY` is that its result object
is intentionally aligned with `bigKNN`.

The returned components are conceptually parallel:

- `index`
- `distance`
- `k`
- `metric`
- `n_ref`
- `n_query`
- `exact`
- `backend`

For `bigANNOY`, `exact = FALSE` and `backend = "annoy"`.

That shared shape matters because it makes these workflows much simpler:

- row-by-row comparison of neighbour ids
- inspection of distance matrices under the same indexing conventions
- recall-at-`k` comparisons against an exact Euclidean baseline
- swapping exact and approximate results into the same downstream code more
  easily

## Load the Packages You Need

This vignette always uses `bigANNOY`. The `bigKNN` parts are optional and only
run when `bigKNN` is installed.

```{r}
library(bigANNOY)
library(bigmemory)
```

## A Small Comparison Dataset

We will create a small reference matrix and a separate query matrix. This is
large enough to show the workflow clearly without making the vignette slow.

```{r}
compare_dir <- tempfile("bigannoy-vs-bigknn-")
dir.create(compare_dir, recursive = TRUE, showWarnings = FALSE)

ref_dense <- matrix(rnorm(120 * 6), nrow = 120, ncol = 6)
query_dense <- matrix(rnorm(15 * 6), nrow = 15, ncol = 6)

ref_big <- as.big.matrix(ref_dense)
dim(ref_big)
dim(query_dense)
```

## Approximate Search with bigANNOY

`bigANNOY` first builds an Annoy index and then searches that persisted index.

```{r}
annoy_index <- annoy_build_bigmatrix(
  ref_big,
  path = file.path(compare_dir, "ref.ann"),
  metric = "euclidean",
  n_trees = 20L,
  seed = 123L,
  load_mode = "eager"
)

approx_result <- annoy_search_bigmatrix(
  annoy_index,
  query = query_dense,
  k = 5L,
  search_k = 100L
)

names(approx_result)
approx_result$exact
approx_result$backend
approx_result$index[1:3, ]
round(approx_result$distance[1:3, ], 3)
```

This is the standard approximate Euclidean workflow in `bigANNOY`.

## Exact Search with bigKNN When Available

If `bigKNN` is installed, the exact Euclidean comparison is straightforward
because the result structure is deliberately similar.

```{r}
if (length(find.package("bigKNN", quiet = TRUE)) > 0L) {
  knn_bigmatrix <- get("knn_bigmatrix", envir = asNamespace("bigKNN"))

  exact_result <- knn_bigmatrix(
    ref_big,
    query = query_dense,
    k = 5L,
    metric = "euclidean",
    block_size = 64L,
    exclude_self = FALSE
  )

  list(
    names = names(exact_result),
    exact = exact_result$exact,
    backend = exact_result$backend,
    index_head = exact_result$index[1:3, ],
    distance_head = round(exact_result$distance[1:3, ], 3)
  )
} else {
  "bigKNN is not installed in this session, so the exact comparison example is skipped."
}
```

The exact result uses the same high-level structure, but now `exact` is
expected to be `TRUE` and the backend identifies the exact search path.

## What Does "Aligned Result Shape" Buy You?

The aligned result shape means you can compare exact and approximate neighbour
sets directly when `metric = "euclidean"` and both were run with the same `k`.

When `bigKNN` is available, a simple overlap-style recall comparison looks like
this:

```{r}
if (length(find.package("bigKNN", quiet = TRUE)) > 0L) {
  knn_bigmatrix <- get("knn_bigmatrix", envir = asNamespace("bigKNN"))

  exact_result <- knn_bigmatrix(
    ref_big,
    query = query_dense,
    k = 5L,
    metric = "euclidean",
    block_size = 64L,
    exclude_self = FALSE
  )

  recall_at_5 <- mean(vapply(seq_len(nrow(query_dense)), function(i) {
    length(intersect(approx_result$index[i, ], exact_result$index[i, ])) / 5
  }, numeric(1L)))

  recall_at_5
} else {
  "Recall example skipped because bigKNN is not installed."
}
```

That is the core evaluation pattern:

- `bigKNN` provides the exact answer
- `bigANNOY` provides the approximate answer
- the overlap between the two tells you how much quality you are giving up

## Why bigANNOY Still Matters When bigKNN Exists

If exact search exists, why use approximate search at all?

Because operationally, the best answer is not always the exact answer.

`bigANNOY` adds capabilities that solve a different problem:

- persisted Annoy indexes that can be reopened across sessions
- approximate search that can be much more attractive for latency-sensitive
  workloads
- control over the build/search trade-off through `n_trees` and `search_k`
- file-backed and descriptor-oriented workflows around `bigmemory`

So the two packages fit a common progression:

1. use `bigKNN` to establish correctness and a benchmark baseline
2. use `bigANNOY` to explore how much latency you can save
3. compare recall against the exact baseline
4. choose the operating point that is acceptable for the application

## Benchmark Integration

The benchmark helpers in `bigANNOY` already support this pairing directly for
Euclidean workloads. If `bigKNN` is available, they can report exact timing and
recall automatically.

```{r}
bench <- benchmark_annoy_bigmatrix(
  n_ref = 200L,
  n_query = 20L,
  n_dim = 6L,
  k = 5L,
  n_trees = 20L,
  search_k = 100L,
  metric = "euclidean",
  exact = length(find.package("bigKNN", quiet = TRUE)) > 0L,
  path_dir = compare_dir,
  load_mode = "eager"
)

bench$summary[, c(
  "metric",
  "n_trees",
  "search_k",
  "build_elapsed",
  "search_elapsed",
  "exact_elapsed",
  "recall_at_k"
)]
```

This is usually the easiest way to decide whether an approximate search
configuration is worth adopting.

## A Practical Decision Framework

Here is a simple way to decide between the two packages for a Euclidean
workflow.

Start with `bigKNN` when:

- you need the exact answer
- you are still defining the benchmark target
- you do not yet know how much approximation your downstream task tolerates

Move toward `bigANNOY` when:

- exact search is too slow for the intended query workload
- you want a persisted index that can be reopened repeatedly
- you have measured acceptable recall relative to the exact baseline

Keep both in the workflow when:

- you want to monitor approximation quality over time
- you benchmark new `n_trees` or `search_k` settings
- you need a trustworthy exact baseline for evaluation or regression tests

## Important Boundaries

There are also a few boundaries worth keeping clear:

- `bigKNN` is the exact baseline only for Euclidean search
- `bigANNOY` supports additional Annoy metrics beyond Euclidean
- recall comparisons against `bigKNN` only make sense for Euclidean workloads
- an approximate result can be operationally excellent even when it is not
  exactly identical to the true top-`k`

That last point is easy to forget. The question is not whether approximate
search is exact. The question is whether the approximation quality is good
enough for the application you care about.

## Recap

The best way to think about the pair is:

- `bigKNN` gives you exact Euclidean truth
- `bigANNOY` gives you fast approximate search on top of persisted Annoy
  indexes
- the shared result shape makes comparison practical
- the benchmark helpers let you quantify the trade-off instead of guessing

If you are beginning a new Euclidean workflow, a strong default is to start
with `bigKNN` as the baseline, then move to `bigANNOY` once latency, scale, or
persisted-index workflows become the limiting factor.