--- title: "Exact Graph Construction from big.matrix Data" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{Exact Graph Construction from big.matrix Data} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- `bigKNN` can build exact neighbour graphs directly from `bigmemory::big.matrix` references. These graph builders are useful when the exact neighbourhood structure is the real output you care about, not only the raw search result. Typical uses include: - clustering and community discovery - manifold learning and neighborhood-preserving embeddings - graph-based preprocessing for downstream models - exact ground truth for evaluating approximate graph builders This vignette introduces the graph families available in `bigKNN`, the symmetrization rules they use, and the output formats that make them easy to inspect or hand off downstream. ```{r setup, include=FALSE} knitr::opts_chunk$set(collapse = TRUE, comment = "#>") if (!requireNamespace("bigmemory", quietly = TRUE)) { cat("This vignette requires the 'bigmemory' package.\n") knitr::knit_exit() } library(bigKNN) library(bigmemory) ``` ```{r helpers, include=FALSE} clean_graph <- function(x) { out <- as.data.frame(x) rownames(out) <- NULL out[order(out$from, out$to), , drop = FALSE] } ``` # Build a Small Reference Set We will work with six points arranged as two well-separated local groups. That geometry keeps the graph outputs easy to read while still showing meaningful differences between directed, mutual, shared-nearest-neighbour, and radius graphs. ```{r create-reference} reference_points <- data.frame( id = paste0("p", 1:6), x1 = c(0.0, 0.3, 1.2, 4.0, 4.3, 5.2), x2 = c(0.0, 0.0, 0.0, 4.0, 4.1, 4.0) ) reference <- as.big.matrix(as.matrix(reference_points[c("x1", "x2")])) reference_points ``` The returned graph objects always refer to rows of the reference matrix. In other words, vertices `1:6` correspond to `p1:p6`. # Directed `k`-NN graphs with `knn_graph_bigmatrix()` A directed `k`-nearest-neighbour graph stores one outgoing edge per retained neighbour for every row. With `include_distance = TRUE`, each edge value is the exact distance between the source row and the neighbour row. ```{r directed-knn-graph} directed_knn <- knn_graph_bigmatrix( reference, k = 1, format = "edge_list", symmetrize = "none" ) clean_graph(directed_knn) ``` Here: - `from` is the source row - `to` is the chosen neighbour row - `distance` is the exact Euclidean distance This is the most literal graph form: every query row keeps its own directed outgoing edges. If one row points to another but not vice versa, both facts are visible. # Mutual `k`-NN graphs with `mutual_knn_graph_bigmatrix()` Mutual `k`-NN graphs keep only reciprocal neighbour relationships. They are often sparser and more conservative than directed `k`-NN graphs, because one-sided edges are dropped. ```{r mutual-graph} mutual_knn <- mutual_knn_graph_bigmatrix( reference, k = 1, format = "edge_list" ) clean_graph(mutual_knn) ``` In this example, only the pairs `(1, 2)` and `(4, 5)` survive. Vertices `3` and `6` each point toward their nearest local anchor, but that preference is not returned in the other direction, so those edges disappear in the mutual graph. `mutual_knn_graph_bigmatrix()` is equivalent to calling `knn_graph_bigmatrix(..., symmetrize = "mutual")`: ```{r mutual-wrapper} identical( clean_graph(mutual_knn), clean_graph(knn_graph_bigmatrix(reference, k = 1, format = "edge_list", symmetrize = "mutual")) ) ``` # Shared-nearest-neighbour graphs with `snn_graph_bigmatrix()` Shared-nearest-neighbour graphs connect pairs of rows that have overlapping exact neighbour sets. Instead of storing distances, they store a similarity-like weight derived from neighbourhood overlap. `bigKNN` currently supports two weight definitions: - `weight = "count"`: the number of shared neighbours - `weight = "jaccard"`: shared neighbours divided by the union size ```{r snn-graph} snn_count <- snn_graph_bigmatrix( reference, k = 2, weight = "count", format = "edge_list" ) snn_jaccard <- snn_graph_bigmatrix( reference, k = 2, weight = "jaccard", format = "edge_list" ) merge( clean_graph(snn_count), clean_graph(snn_jaccard), by = c("from", "to"), suffixes = c("_count", "_jaccard") ) ``` The edge set is the same for both weight choices in this example, but the weights differ: - `count` emphasizes absolute overlap - `jaccard` normalizes overlap by neighbourhood size That normalization can be helpful when you want weights that are easier to compare across regions of the graph with different neighbour-set structure. # Radius graphs with `radius_graph_bigmatrix()` Radius graphs use a fixed distance threshold instead of a fixed `k`. That means the number of neighbours can vary from row to row. ```{r radius-graph} radius_directed <- radius_graph_bigmatrix( reference, radius = 1.1, format = "edge_list", symmetrize = "none" ) radius_union <- radius_graph_bigmatrix( reference, radius = 1.1, format = "edge_list", symmetrize = "union" ) clean_graph(radius_directed) clean_graph(radius_union) ``` Compared with fixed-`k` graphs: - radius graphs encode a distance scale directly - dense regions can produce more edges than sparse regions - isolated rows may have very few or no neighbours at the chosen radius In this self-search setting with a symmetric metric, `symmetrize = "union"` and `symmetrize = "mutual"` coincide because any radius edge that appears in one direction also appears in the other: ```{r radius-symmetry} identical( clean_graph(radius_graph_bigmatrix(reference, radius = 1.1, format = "edge_list", symmetrize = "union")), clean_graph(radius_graph_bigmatrix(reference, radius = 1.1, format = "edge_list", symmetrize = "mutual")) ) ``` # Symmetrization options and edge semantics For `k`-NN and radius graph builders, `symmetrize` controls how directed edges are collapsed: - `"none"` keeps directed edges as-is - `"union"` keeps a pair if either direction appears - `"mutual"` keeps a pair only when both directions appear The difference is easiest to see on `k = 1`: ```{r symmetrize-comparison} graph_none <- knn_graph_bigmatrix(reference, k = 1, format = "edge_list", symmetrize = "none") graph_union <- knn_graph_bigmatrix(reference, k = 1, format = "edge_list", symmetrize = "union") graph_mutual <- knn_graph_bigmatrix(reference, k = 1, format = "edge_list", symmetrize = "mutual") data.frame( symmetrize = c("none", "union", "mutual"), n_edges = c(nrow(clean_graph(graph_none)), nrow(clean_graph(graph_union)), nrow(clean_graph(graph_mutual))), row.names = NULL ) ``` When distances are stored, `bigKNN` keeps the minimum directed distance for the collapsed undirected pair. When unit weights are stored instead (`include_distance = FALSE`), the retained value is the maximum weight, which remains `1` for unweighted graphs. # Converting outputs with `as_edge_list()`, `as_triplet()`, and `as_sparse_matrix()` Different downstream tasks prefer different graph formats: - edge lists are easiest to inspect and export - triplet form is a compact sparse representation in R lists - sparse matrices are convenient for linear algebra and adjacency-style operations ```{r graph-coercion} triplet_graph <- as_triplet(graph_mutual) sparse_graph <- as_sparse_matrix( knn_graph_bigmatrix( reference, k = 1, format = "edge_list", symmetrize = "union", include_distance = FALSE ) ) triplet_graph class(sparse_graph) Matrix::summary(sparse_graph) ``` One subtle but important point: symmetric edge lists and triplets store each undirected pair once, while the sparse matrix representation stores both matrix entries `(i, j)` and `(j, i)`. That is why the sparse summary above shows twice as many non-zero entries as the symmetrized edge list. You can round-trip back to an edge list when needed: ```{r graph-roundtrip} clean_graph(as_edge_list(sparse_graph)) ``` # Using graph outputs downstream Once a graph is in sparse-matrix form, ordinary matrix operations become useful immediately. For an unweighted symmetric adjacency, row sums give vertex degrees: ```{r downstream-usage} degree <- Matrix::rowSums(sparse_graph) data.frame( vertex = reference_points$id, degree = as.numeric(degree), row.names = NULL ) ``` The same graph could also be handed off to a separate graph package after conversion: - use `as_edge_list()` for packages that ingest edge tables - use `as_sparse_matrix()` for adjacency-based workflows - use `as_triplet()` when you want a lightweight sparse representation without constructing a matrix object yet That keeps `bigKNN` focused on exact search and exact graph construction, without forcing a particular downstream graph ecosystem on the user.