---
title: "Benchmarking bigKNN Against Other Exact Implementations"
output: rmarkdown::html_vignette
vignette: >
  %\VignetteIndexEntry{Benchmarking bigKNN Against Other Exact Implementations}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

This article summarizes the benchmark recorded in the `bench/` directory for
`bigKNN` 0.3.0. The goal is not to claim a universal winner. The goal is to
show:

- how `bigKNN` compares with several exact dense k-nearest-neighbour
  implementations on the same Euclidean workloads
- how `bigKNN` behaves when the reference data is stored as a file-backed
  `bigmemory::big.matrix`

The benchmark driver is `bench/exact_knn_benchmark.R`, and the full recorded
outputs are:

- `bench/exact_knn_benchmark_report.md`
- `bench/exact_knn_benchmark_results.csv`
- `bench/exact_knn_benchmark_validation.csv`

# Comparator Set

The dense exact comparator set in the recorded run was:

- `Rfast::dista(..., index = TRUE)`
- `FNN::get.knnx(..., algorithm = "brute")`
- `dbscan::kNN(..., approx = 0)`
- `RANN::nn2(..., eps = 0)`
- `nabor::knn(..., eps = 0)`
- `BiocNeighbors::queryKNN(..., BNPARAM = KmknnParam(distance = "Euclidean"))`
- a base-R brute-force Euclidean implementation on the smallest case for full
  index-and-distance validation

`bigKNN` itself was measured in three dense modes:

- `knn_bigmatrix()`
- `knn_search_prepared()`
- `knn_search_stream_prepared()`

For larger scale runs, the benchmark used file-backed `big.matrix` references
and measured:

- `knn_prepare_bigmatrix()`
- `knn_search_prepared()`
- `knn_search_stream_prepared()`

# Recorded Environment

The recorded benchmark was generated on:

- R 4.5.2
- `bigKNN` 0.3.0
- `bigmemory` 4.6.4
- `Rfast` 2.1.5.2
- `FNN` 1.1.4.1
- `dbscan` 1.2.4
- `RANN` 2.6.2
- `nabor` 0.5.0
- `BiocNeighbors` 2.2.0

# Problem Sizes

The benchmark covers three dense comparator cases and two larger file-backed
scale cases.

Case | `n_ref` | `n_query` | `p` | `k` | Reference size | Full dense pairwise matrix
--- | --- | --- | --- | --- | --- | ---
`dense_small` | `5,000` | `250` | `16` | `10` | `0.61 MB` | `0.01 GB`
`dense_medium` | `20,000` | `500` | `16` | `10` | `2.44 MB` | `0.07 GB`
`dense_large` | `50,000` | `1,000` | `16` | `10` | `6.10 MB` | `0.37 GB`
`filebacked_xlarge` | `100,000` | `1,000` | `32` | `10` | `24.41 MB` | `0.75 GB`
`filebacked_2xlarge` | `200,000` | `1,000` | `32` | `10` | `48.83 MB` | `1.49 GB`

The `full_pairwise_gb` column is intentionally included because it highlights
what `bigKNN` does not materialize: a full dense query-by-reference distance
matrix.

# Correctness

All dense exact comparators matched `bigKNN` on neighbour indices for the
recorded cases. All distance-capable dense comparators matched on distances as
well.

That matters because the benchmark is intended to compare exact search
implementations, not approximate recall. The validation table in
`bench/exact_knn_benchmark_validation.csv` is therefore as important as the
timing table.

# Dense Benchmark Snapshot

The table below compares the recorded `bigKNN_prepared` median with the fastest
external exact comparator in each dense case.

Case | `bigKNN_prepared` median | Fastest external exact comparator
--- | --- | ---
`dense_small` | `0.110 s` | `0.042 s` (`FNN_brute` and `dbscan_kdtree` tie)
`dense_medium` | `0.822 s` | `0.328 s` (`RANN_kd`, `Rfast_index_only`, and `nabor_kd` tie)
`dense_large` | `4.642 s` | `1.289 s` (`nabor_kd`)

The broader dense median ranking in the recorded run was:

- at smaller and medium dense sizes, the in-memory comparator packages were
  noticeably faster than `bigKNN`
- at the largest dense case in this benchmark, the external exact comparator
  set still clustered around `1.29 s`, while `bigKNN_prepared` remained around
  `4.64 s`

This is a sensible result. The dense comparator packages are optimized for
ordinary in-memory matrix search, while `bigKNN` is designed around
`bigmemory::big.matrix`, reusable prepared references, streaming, and larger
workflows that do not assume everything should round-trip through ordinary R
matrices.

# File-Backed Scale Snapshot

The scale portion of the benchmark focuses on the feature set that is more
specific to `bigKNN`: exact search on file-backed `big.matrix` references and
streamed output into destination `big.matrix` objects.

Case | `bigKNN_prepare` | `bigKNN_prepared_search` | `bigKNN_stream`
--- | --- | --- | ---
`filebacked_xlarge` (`100,000 x 1,000 x 32`) | `0.039 s` | `9.714 s` | `9.290 s`
`filebacked_2xlarge` (`200,000 x 1,000 x 32`) | `0.078 s` | `18.594 s` | `18.533 s`

Two points stand out:

- preparation overhead is small relative to the search itself
- doubling the reference size produced close to linear growth in search time

That second point is important in practice. It means the file-backed exact path
behaves predictably as the reference grows, which is exactly the kind of
workflow `bigKNN` is meant to support.

# How To Read These Results

These benchmark numbers should be interpreted in context:

- if your workload is a dense in-memory Euclidean query and you only need
  top-`k` neighbours back in ordinary R matrices, the specialist dense
  comparator packages can be faster
- if your workload is built around `bigmemory::big.matrix`, reusable prepared
  references, streamed output, resumable jobs, or larger file-backed data,
  `bigKNN` provides functionality that goes beyond the dense in-memory
  comparator set

In other words, the benchmark is useful both for performance comparison and for
positioning the package.

# Reproducing The Benchmark

From the package root, rerun the benchmark with:

```sh
env LC_ALL=C LANG=C Rscript bench/exact_knn_benchmark.R
```

The script prefers benchmarking the source tree through `pkgload::load_all()`
when available, and otherwise falls back to the installed `bigKNN` package.

# Caveats

Benchmark results are machine-specific. The recorded values in this article are
best treated as a reproducible snapshot rather than a universal ranking.

The benchmark is also deliberately split into two stories:

- a dense exact comparison against several external implementations
- a file-backed scale story that highlights the `big.matrix`-oriented workflow
  of `bigKNN`

That split keeps the benchmark honest about what is truly comparable and what
is specific to the package's design goals.