--- title: "Benchmarking bigKNN Against Other Exact Implementations" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{Benchmarking bigKNN Against Other Exact Implementations} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- This article summarizes the benchmark recorded in the `bench/` directory for `bigKNN` 0.3.0. The goal is not to claim a universal winner. The goal is to show: - how `bigKNN` compares with several exact dense k-nearest-neighbour implementations on the same Euclidean workloads - how `bigKNN` behaves when the reference data is stored as a file-backed `bigmemory::big.matrix` The benchmark driver is `bench/exact_knn_benchmark.R`, and the full recorded outputs are: - `bench/exact_knn_benchmark_report.md` - `bench/exact_knn_benchmark_results.csv` - `bench/exact_knn_benchmark_validation.csv` # Comparator Set The dense exact comparator set in the recorded run was: - `Rfast::dista(..., index = TRUE)` - `FNN::get.knnx(..., algorithm = "brute")` - `dbscan::kNN(..., approx = 0)` - `RANN::nn2(..., eps = 0)` - `nabor::knn(..., eps = 0)` - `BiocNeighbors::queryKNN(..., BNPARAM = KmknnParam(distance = "Euclidean"))` - a base-R brute-force Euclidean implementation on the smallest case for full index-and-distance validation `bigKNN` itself was measured in three dense modes: - `knn_bigmatrix()` - `knn_search_prepared()` - `knn_search_stream_prepared()` For larger scale runs, the benchmark used file-backed `big.matrix` references and measured: - `knn_prepare_bigmatrix()` - `knn_search_prepared()` - `knn_search_stream_prepared()` # Recorded Environment The recorded benchmark was generated on: - R 4.5.2 - `bigKNN` 0.3.0 - `bigmemory` 4.6.4 - `Rfast` 2.1.5.2 - `FNN` 1.1.4.1 - `dbscan` 1.2.4 - `RANN` 2.6.2 - `nabor` 0.5.0 - `BiocNeighbors` 2.2.0 # Problem Sizes The benchmark covers three dense comparator cases and two larger file-backed scale cases. Case | `n_ref` | `n_query` | `p` | `k` | Reference size | Full dense pairwise matrix --- | --- | --- | --- | --- | --- | --- `dense_small` | `5,000` | `250` | `16` | `10` | `0.61 MB` | `0.01 GB` `dense_medium` | `20,000` | `500` | `16` | `10` | `2.44 MB` | `0.07 GB` `dense_large` | `50,000` | `1,000` | `16` | `10` | `6.10 MB` | `0.37 GB` `filebacked_xlarge` | `100,000` | `1,000` | `32` | `10` | `24.41 MB` | `0.75 GB` `filebacked_2xlarge` | `200,000` | `1,000` | `32` | `10` | `48.83 MB` | `1.49 GB` The `full_pairwise_gb` column is intentionally included because it highlights what `bigKNN` does not materialize: a full dense query-by-reference distance matrix. # Correctness All dense exact comparators matched `bigKNN` on neighbour indices for the recorded cases. All distance-capable dense comparators matched on distances as well. That matters because the benchmark is intended to compare exact search implementations, not approximate recall. The validation table in `bench/exact_knn_benchmark_validation.csv` is therefore as important as the timing table. # Dense Benchmark Snapshot The table below compares the recorded `bigKNN_prepared` median with the fastest external exact comparator in each dense case. Case | `bigKNN_prepared` median | Fastest external exact comparator --- | --- | --- `dense_small` | `0.110 s` | `0.042 s` (`FNN_brute` and `dbscan_kdtree` tie) `dense_medium` | `0.822 s` | `0.328 s` (`RANN_kd`, `Rfast_index_only`, and `nabor_kd` tie) `dense_large` | `4.642 s` | `1.289 s` (`nabor_kd`) The broader dense median ranking in the recorded run was: - at smaller and medium dense sizes, the in-memory comparator packages were noticeably faster than `bigKNN` - at the largest dense case in this benchmark, the external exact comparator set still clustered around `1.29 s`, while `bigKNN_prepared` remained around `4.64 s` This is a sensible result. The dense comparator packages are optimized for ordinary in-memory matrix search, while `bigKNN` is designed around `bigmemory::big.matrix`, reusable prepared references, streaming, and larger workflows that do not assume everything should round-trip through ordinary R matrices. # File-Backed Scale Snapshot The scale portion of the benchmark focuses on the feature set that is more specific to `bigKNN`: exact search on file-backed `big.matrix` references and streamed output into destination `big.matrix` objects. Case | `bigKNN_prepare` | `bigKNN_prepared_search` | `bigKNN_stream` --- | --- | --- | --- `filebacked_xlarge` (`100,000 x 1,000 x 32`) | `0.039 s` | `9.714 s` | `9.290 s` `filebacked_2xlarge` (`200,000 x 1,000 x 32`) | `0.078 s` | `18.594 s` | `18.533 s` Two points stand out: - preparation overhead is small relative to the search itself - doubling the reference size produced close to linear growth in search time That second point is important in practice. It means the file-backed exact path behaves predictably as the reference grows, which is exactly the kind of workflow `bigKNN` is meant to support. # How To Read These Results These benchmark numbers should be interpreted in context: - if your workload is a dense in-memory Euclidean query and you only need top-`k` neighbours back in ordinary R matrices, the specialist dense comparator packages can be faster - if your workload is built around `bigmemory::big.matrix`, reusable prepared references, streamed output, resumable jobs, or larger file-backed data, `bigKNN` provides functionality that goes beyond the dense in-memory comparator set In other words, the benchmark is useful both for performance comparison and for positioning the package. # Reproducing The Benchmark From the package root, rerun the benchmark with: ```sh env LC_ALL=C LANG=C Rscript bench/exact_knn_benchmark.R ``` The script prefers benchmarking the source tree through `pkgload::load_all()` when available, and otherwise falls back to the installed `bigKNN` package. # Caveats Benchmark results are machine-specific. The recorded values in this article are best treated as a reproducible snapshot rather than a universal ranking. The benchmark is also deliberately split into two stories: - a dense exact comparison against several external implementations - a file-backed scale story that highlights the `big.matrix`-oriented workflow of `bigKNN` That split keeps the benchmark honest about what is truly comparable and what is specific to the package's design goals.