Methods Overview

A concise summary of the statistical methods implemented in splikit. For a hands-on walkthrough see the Splikit Manual; the full source is at https://github.com/csglab/splikit.

Local junction variants (LJVs)

Splice junctions are grouped into local junction variants — junctions sharing either a 5-prime or 3-prime coordinate. For each junction, splikit builds an inclusion matrix M1 of its per-cell read counts and an exclusion matrix M2 holding the summed counts of the other junctions in its LJV. M1 and M2 are sparse dgCMatrix objects of dimension events x cells. A junction that participates in two LJVs (one per shared coordinate) contributes two rows with different M2 values; downstream code tolerates this by design.

Variable-event selection

find_variable_events() computes, for each event, the per-library binomial deviance of the inclusion ratio M1 / (M1 + M2) against an intercept-only baseline p_hat = sum(M1) / sum(M1 + M2). Events with the largest summed deviance are retained as highly variable.

Variable-gene selection

find_variable_genes() offers two methods on the gene-expression matrix: "sum_deviance" fits a per-gene negative-binomial deviance with a method-of-moments theta estimate, and "vst" returns a Seurat-style variance-stabilising transformation.

Event-covariate association

get_pseudo_correlation() fits a per-event binomial logistic GLM of the inclusion ratio on a target covariate by iteratively reweighted least squares, and reports a Cox-Snell / Nagelkerke pseudo-R-squared computed from the residual deviance. This quantifies how strongly each event tracks the covariate (e.g. a cluster label or a gene’s expression).

Implementation

All four kernels are written in C++ via Rcpp / RcppArmadillo with OpenMP parallelism over rows or cells. make_m2() automatically falls back to a data.table batched path when the working set would overflow 32-bit Armadillo indices.