--- title: "Algorithmic Structure of Optimal Subsampling" output: rmarkdown::html_vignette bibliography: references.bib vignette: > %\VignetteIndexEntry{Algorithmic Structure of Optimal Subsampling} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r, include = FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>" ) ``` This vignette describes the common algorithmic structure used by the optimal subsampling methods in `subsampling`. The exact model, score, and objective function differ across functions, but the optimal subsampling workflow in this package can be viewed as the following procedure. 1. Draw a first-step pilot subsample. 2. Fit a first-step pilot estimator. 3. Calculate second-step subsampling probabilities (feasible optimal subsampling probabilities). 4. Draw the second-step subsample. 5. Fit the model using the corresponding objective function. 6. Optionally combine the pilot and second-step estimators. The steps below explain the same structure used across the models in this package. ## Full-Data Problem Suppose the full data are $$ \{z_i: i = 1,\ldots,N\}, $$ where $z_i$ may contain a response and covariates. Let $\theta$ denote the model parameter. The full-data estimator can often be written as an optimization problem of the form $$ \widehat{\theta}_{\mathrm{full}} = \arg\min_\theta \frac{1}{N}\sum_{i=1}^N \ell_i(\theta). $$ When $N$ is large, solving the full-data problem may be computationally expensive. The goal of subsampling is to draw a smaller informative subsample that balances computational efficiency and statistical efficiency. ## Step 1: Pilot Estimation Optimal subsampling probabilities usually depend on the unknown model parameter and on quantities that would be expensive to calculate on the full data. The pilot sample provides an initial value or a cheaper approximation to these quantities. Let $S_{\mathrm{plt}}$ be the pilot sample and let $\pi_i^{\mathrm{plt}}$ be the pilot sampling probability. A generic weighted pilot estimator is $$ \widehat{\theta}_{\mathrm{plt}} = \arg\min_\theta \sum_{i \in S_{\mathrm{plt}}} \frac{1}{\pi_i^{\mathrm{plt}}}\ell_i(\theta). $$ Some models use a specialized pilot design. For example, `ssp.relogit()` draws a case-control pilot sample from positive and negative classes, then applies an intercept correction. `ssp.glm.rF()` can use balanced pilot sampling to preserve rare binary features. The purpose is the same: obtain a useful first estimate or approximation without fitting the model on all $N$ observations. ## Step 2: Calculate Raw Importance Scores The pilot estimator is used to assign each observation a raw importance score, say $$ m_i = m(z_i; \widehat{\theta}_{\mathrm{plt}}). $$ This score measures relative importance, not a valid sampling probability. A large $m_i$ means observation $i$ is more useful for the target estimation goal. For example, in `ssp.glm()` with logistic regression, $$ m_i^{\mathrm{optL}} = |y_i - \widehat{p}_i|\,\lVert x_i\rVert. $$ Other models use different scores, but the role of the score is the same: it summarizes how important each observation is for constructing the second-step subsampling probabilities. ## Step 3: Construct Probabilities and Draw the Second-Step Subsample Raw scores are transformed so that they become valid sampling probabilities. Once these probabilities are available, the second-step subsample is drawn directly from the corresponding sampling design. The package uses two main sampling procedures: sampling with replacement and Poisson subsampling. We use $\pi_i$ for both mechanisms, but the meaning is different. For sampling with replacement, a common drawing probability is $$ \pi_i = (1-\alpha)\frac{m_i}{\sum_{j=1}^N m_j} + \alpha\frac{1}{N}, $$ where $\alpha \in [0,1]$ mixes the optimal probabilities with uniform probabilities. These probabilities satisfy $$ \sum_{i=1}^N \pi_i = 1. $$ Here $\pi_i$ is the probability that observation $i$ is selected on one draw. Sampling with replacement draws exactly `n.ssp` times: $$ I_1,\ldots,I_{n_{\mathrm{ssp}}} \sim \mathrm{Multinomial}(\pi_1,\ldots,\pi_N). $$ Repeated row indices are allowed. For Poisson subsampling, $\pi_i$ is an inclusion probability for observation $i$. The raw score may be truncated as $$ m_i^* = \min(m_i, H), $$ and then transformed into $$ \pi_i = \min\left[ n_{\mathrm{ssp}} \left\{ (1-\alpha)\frac{m_i^*}{\widehat{\Phi}} + \alpha\frac{1}{N} \right\}, 1 \right]. $$ Here $H$ is a truncation threshold and $\widehat{\Phi}$ is a pilot-based normalizing quantity. The probabilities are designed so that $$ \sum_{i=1}^N \pi_i \approx n_{\mathrm{ssp}}, $$ up to truncation at 1 and pilot approximation error. Poisson subsampling draws each observation independently: $$ \delta_i \sim \mathrm{Bernoulli}(\pi_i), $$ and $$ S_{\mathrm{ssp}} = \{i: \delta_i = 1\}. $$ There is no replacement under Poisson subsampling because each observation is included at most once. The actual sample size can differ from the designed expected size: $$ |S_{\mathrm{ssp}}| = \sum_{i=1}^N \delta_i. $$ ## Step 4: Fit a Corrected Subsample Objective Function Nonuniform subsampling changes the objective. A corrected objective is required to remove the bias introduced by informative sampling. The most general correction is inverse-probability weighting: $$ \widehat{\theta}_{\mathrm{ssp}} = \arg\min_\theta \sum_{i \in S_{\mathrm{ssp}}} \frac{1}{\pi_i}\ell_i(\theta). $$ For with-replacement sampling, the denominator should be read as the drawing probability. For Poisson subsampling, the denominator is the inclusion probability. Some models support other objectives instead of a weighted objective. For logistic regression with Poisson subsampling, `ssp.glm()` supports `logOddsCorrection`. For rare-event logistic regression, `ssp.relogit()` uses a negative-sampling log-odds correction by default. For softmax regression, `ssp.softmax()` supports `MSCLE`, a sampled conditional likelihood. The implementation details differ by model, but the role is the same: fit the model on the subsample using a loss that accounts for the sampling design. ## Step 5: Optional Combination Some functions combine the pilot estimator and the subsample estimator. The pilot sample has already been computed, so it can still contribute information. A typical combination has the form $$ \widehat{\theta}_{\mathrm{cmb}} = \left(\widehat{I}_{\mathrm{plt}} + \widehat{I}_{\mathrm{ssp}}\right)^{-1} \left( \widehat{I}_{\mathrm{plt}}\widehat{\theta}_{\mathrm{plt}} + \widehat{I}_{\mathrm{ssp}}\widehat{\theta}_{\mathrm{ssp}} \right), $$ where $\widehat{I}_{\mathrm{plt}}$ and $\widehat{I}_{\mathrm{ssp}}$ are estimated information matrices. Not every function uses this step in the same way. For example, `ssp.glm()`, `ssp.relogit()`, and `ssp.softmax()` combine pilot and subsample estimators, while `ssp.quantreg()` uses repeated subsampling to estimate covariance for the subsample estimator. ## Function Map The table below maps package-level functions to the helper functions that implement the common algorithmic pattern. | Function | Pilot | Probability and draw | Subsample fit | Combination | |---|---|---|---|---| | `ssp.glm()` | `pilot.estimate()` | `subsampling()` | `subsample.estimate()` | `combining()` | | `ssp.relogit()` | `rare.pilot.estimate()` | `rare.subsampling()` | `rare.subsample.estimate()` | `rare.combining()` | | `ssp.softmax()` | `softmax.plt.estimate()` | `softmax.subsampling()` | `softmax.subsample.estimate()` | `softmax.combining()` | | `ssp.quantreg()` | `plt.estimation.quantreg()` | `ssp.estimation.quantreg()` | `ssp.estimation.quantreg()` | Not used directly | | `ssp.glm.rF()` | `rF.pilot.estimate()` | `rF.subsampling()` | `rF.subsample.estimate()` | `rF.combining()` | ## Checking Points When adapting the package structure for a new method, the following checks are useful. - Check the scale of the probabilities. For with-replacement sampling, $\sum_i \pi_i$ should be 1. For Poisson subsampling, $\sum_i \pi_i$ should be close to the designed expected subsample size, after accounting for any specific constraints. - For Poisson subsampling, check whether the actual sample size is far from the expected size. Some variation is natural. If the actual size is repeatedly far too small or too large, the probability normalization, truncation, or constraint logic may be wrong. - In the combination step, sample size matters. A larger second-step subsample usually pulls the combined estimator toward the second-step estimator because the combination is an information-weighted average. - Check the corrected loss against the sampling design. If the subsample was drawn with unequal probabilities, the fitted loss should use the corresponding weights, offsets, or conditional likelihood correction. - Check the pilot sample before trusting the second-step probabilities. If the pilot estimator is unstable, the raw importance scores can be unstable even when the probability formulas are coded correctly.