This article describes the scree estimation used in the
dppca package. The main idea is to rewrite each scree value
as a mean estimation problem.
The package considers three main methods as follows.
Let \[ X \in \mathbb{R}^{n \times p} \]
be the preprocessed data matrix. This means that the data have been centered, and possibly standardized.
Let \[ V_k = [v_1,\ldots,v_k] \in \mathbb{R}^{p \times k} \]
be the matrix of PC directions. Depending on the analysis, \(V_k\) may be non-private or differentially private.
The goal is to estimate the scree vector \(\lambda = (\lambda_1,\ldots,\lambda_k)\) in a differentially private way.
In the sample PCA case, the \(\ell\)-th scree value is
\[ \hat\lambda_\ell = v_\ell^\top \hat\Sigma v_\ell \quad \text{where} \quad \hat\Sigma = \frac{1}{n-1}X^\top X. \]
In dppca, we construct private estimates \(\widetilde\lambda_1,\ldots,\widetilde\lambda_k\)
and use them to build a differentially private scree plot.
For the \(\ell\)-th principal component direction \(v_\ell\), define the score vector \(z_\ell = X v_\ell\).
The sample variance of the \(\ell\)-th score vector is
\[ \hat\lambda_\ell = \frac{1}{n-1} \sum_{i=1}^n z_{i\ell}^2 = \frac{n}{n-1} \left( \frac{1}{n}\sum_{i=1}^n w_{i\ell} \right) \quad \text{where} \quad w_{i\ell} = z_{i\ell}^2. \]
Therefore, for each component \(\ell\), scree estimation can be viewed as a mean estimation problem for \(w_{1\ell},\ldots,w_{n\ell}\).
In dppca, we want to construct a differentially private
estimate of \(\widehat{\lambda}_\ell\),
denoted by \(\widetilde{\lambda}_\ell\). So, we first
privately estimate the mean of \(W_{1\ell},\ldots,W_{n\ell}\), and then
rescale it by \(n/(n-1)\).
\[ \widetilde{\lambda}_\ell = \frac{n}{n-1} \cdot \operatorname{DP-Mean}\left(w_{1\ell},\ldots,w_{n\ell}\right). \]
The simplest way to privately estimate the mean of \(w_{1\ell},\ldots,w_{n\ell}\) is to first clip the values to a bounded interval and then add Gaussian noise.
Choose a clipping threshold \[ C > 0. \]
Define the clipped observations by \(w_{i\ell}^{\mathrm{clip}} = \min(w_{i\ell}, C)\).
Then \[ 0 \leq w_{i\ell}^{\mathrm{clip}} \leq C. \]
The clipped empirical mean is
\[ \hat\mu_\ell^{\mathrm{clip}} = \frac{1}{n} \sum_{i=1}^n w_{i\ell}^{\mathrm{clip}} = \frac{1}{n} \sum_{i=1}^n \min(w_{i\ell}, C). \]
The corresponding non-private clipped scree estimate is \[ \hat\lambda_\ell^{\mathrm{clip}} = \frac{n}{n-1} \hat\mu_\ell^{\mathrm{clip}}. \]
Because each clipped observation lies in \([0,C]\), changing one observation can change the mean by at most \(\frac{C}{n}\).
After multiplying by \(n/(n-1)\), the sensitivity of the clipped scree estimate is
\[ \Delta_\ell \leq \frac{n}{n-1} \cdot \frac{C}{n} = \frac{C}{n-1}. \]
For privacy parameters \((\epsilon_\ell,\delta_\ell)\), the Gaussian mechanism releases
\[ \widetilde\lambda_\ell^{\mathrm{clip}} = \hat\lambda_\ell^{\mathrm{clip}} + \xi_\ell, \]
where
\[ \xi_\ell \sim N(0,\sigma_\ell^2) \] and \[ \sigma_\ell = \frac{\Delta_\ell\sqrt{2\log(1.25/\delta_\ell)}}{\epsilon_\ell} = \frac{(C/(n-1))\sqrt{2\log(1.25/\delta_\ell)}}{\epsilon_\ell}. \]
Clipped scree estimation is simple and easy to implement, but it depends on the choice of the clipping threshold \(C\).
The Huber method is based on the differentially private robust mean estimator of Yu, Ren, and Zhou (2024). For each component \(\ell\), it estimates the mean of \(w_{1\ell},\ldots,w_{n\ell}\) using a robust loss function instead of the ordinary sample mean.
For a robustification parameter \(\tau > 0\), the Huber loss is
\[ \rho_\tau(r) = \begin{cases} \frac{1}{2}r^2, & |r| \leq \tau,\\ \tau |r| - \frac{1}{2}\tau^2, & |r| > \tau. \end{cases} \]
For small residuals, the Huber loss behaves like squared error loss. For large residuals, it grows linearly, which reduces the influence of extreme observations.
The derivative of the Huber loss is the score function
\[ \psi_\tau(r) = \rho_\tau'(r) = \begin{cases} -\tau, & r < -\tau,\\ r, & |r| \leq \tau,\\ \tau, & r > \tau. \end{cases} \]
Thus, \(\psi_\tau(r)\) clips the residual \(r\) to the interval \([-\tau,\tau]\).
For the \(\ell\)-th component, the Huber mean estimator is defined as
\[ \hat\mu_{\tau,\ell} \in \arg\min_{\mu \in \mathbb{R}} \frac{1}{n} \sum_{i=1}^n \rho_\tau(w_{i\ell}-\mu). \]
The first-order condition is
\[ \frac{1}{n} \sum_{i=1}^n \psi_\tau(w_{i\ell}-\mu) = 0. \]
This equation shows that large residuals are clipped through the score function. As a result, the estimator is less sensitive to outliers than the ordinary mean.
The Huber objective is optimized using noisy gradient descent.
Given the current estimate \(\mu_\ell^{(t)}\) at iteration \(t\), we define the corresponding residuals as
\[ r_{i\ell}^{(t)} = w_{i\ell} - \mu_\ell^{(t)}, \qquad i=1,\ldots,n. \]
The clipped score values are \(\psi_\tau(r_{i\ell}^{(t)})\),
and the average gradient type quantity is
\[ g_\ell^{(t)} = \frac{1}{n} \sum_{i=1}^n \psi_\tau(r_{i\ell}^{(t)}). \]
A non-private update would take the form
\[ \mu_\ell^{(t+1)} = \mu_\ell^{(t)} + \eta_t g_\ell^{(t)} \quad \text{where} \quad \eta_t > 0 \quad \text{is the step size.} \]
To ensure privacy, Gaussian noise is added at each iteration.
\[ \mu_\ell^{(t+1)} = \mu_\ell^{(t)} + \eta_t g_\ell^{(t)} + \xi_\ell^{(t)} \quad \text{where} \quad \xi_\ell^{(t)} \sim N(0,\sigma_{t,\ell}^2). \]
Since
\[ \psi_\tau(r) \in [-\tau,\tau], \]
changing one observation can change the average score by at most \(\frac{2\tau}{n}\).
After multiplying by the step size \(\eta_t\), the one-step sensitivity is
\[ \Delta_{\mathrm{step}} = \frac{2\eta_t \tau}{n}. \]
This sensitivity determines the scale of the Gaussian noise added at each gradient step.
After \(T\) noisy gradient descent steps, let the final private Huber mean estimate be
\[ \mu_\ell^{(T)}. \]
The final Huber scree estimate is
\[ \widetilde\lambda_\ell^{\mathrm{Huber}} = \frac{n}{n-1} \mu_\ell^{(T)}. \]
The Huber method is useful when the squared scores \(w_{i\ell}\) may be heavy-tailed or affected by outliers.
The robustification parameter \(\tau\) controls the trade-off between bias, robustness, and privacy.
A small \(\tau\) gives stronger clipping and more robustness, but may increase the bias of the estimator. A large \(\tau\) behaves more like the ordinary mean, but may be less robust and may require more noise.
In the Huber DP mean, \(\tau\) is chosen using a scale quantity related to the second moment. A typical theoretical form is
\[ \tau \asymp \sqrt{m_2} \left( \frac{\epsilon n} {\sqrt{(d+\log n)\log n}} \right)^{1/2}, \]
where
\[ m_2 = \mathbb{E}\|X-\mu\|_2^2 \quad \text{is a second-moment scale.} \]
Because \(m_2\) is usually unknown and can be sensitive to outliers, it may need to be estimated robustly and privately. we use a private robust estimator for \(m_2\) described in Algorithm 2.
The Huber scree method also uses additional control parameters for noisy gradient descent and for the private scale-proxy step used to estimate \(m_2\). The default values are chosen based on the recommendations in the Yu, Ren, and Zhou (2024)
mu0: Initial value for noisy gradient descent.
(default: \(\mu_\ell^{(0)} =
0\))
eta0: Step size for noisy gradient descent.
(default: \(\eta_0 = 1\))
T: Number of noisy gradient descent iterations.
(default: \(T = \lfloor \log n
\rfloor\))
M: Number of blocks used in the private estimator
for \(m_2\). (default: \(M = \lfloor \sqrt{n}/2 \rfloor\))
k_min_m2 and k_max_m2: Lower and upper
dyadic bin indices used in the private histogram step for estimating
\(m_2\). The histogram searches over
scale levels \(2^k\) for
\[ k_{\min} \le k \le k_{\max}. \]
m2_frac: Fraction of the Huber scree privacy budget
used to privately estimate \(m_2\).
If \((\epsilon_{\mathrm{scree}},\delta_{\mathrm{scree}})\) is the budget for Huber scree estimation, then \((\epsilon_{m_2},\delta_{m_2}) = \text{m2_frac}\cdot (\epsilon_{\text{scree}},\delta_{\text{scree}})\), while the remaining budget \((\epsilon_{\text{gd}},\delta_{\text{gd}}) = (1-\text{m2_frac})\cdot (\epsilon_{\text{scree}},\delta_{\text{scree}})\) is used for Huber noisy gradient descent.
In dppca, the Huber scree estimator is controlled by
\[ \texttt{scree\_huber\_control( mu0,\ eta0,\ T,\ M,\ k\_min\_m2,\ k\_max\_m2,\ m2\_frac)}. \]
The PMWM method is based on the private modified winsorized mean of Ramsay and Spicker (2025). For each component \(\ell\), it estimates the mean of \(w_{1\ell},\ldots,w_{n\ell}\) by privately estimating tail quantiles, winsorizing the data, and then adding noise to the winsorized mean.
The PMWM method builds on the non-private modified winsorized mean of Lugosi and Mendelson (2021).
The non-private modified winsorized mean starts with a clipping proportion \(0 < p < \frac{1}{2}\), which determines the lower and upper tail quantiles used for winsorization.
Let \(\hat\xi_p \quad\text{and}\quad \hat\xi_{1-p}\) be empirical lower and upper quantiles.
Define the function
\[ \phi_{a,b}(x) = \begin{cases} a, & x < a,\\ x, & a \leq x \leq b,\\ b, & x > b. \end{cases} \]
The non-private modified winsorized mean is
\[ \hat\mu_p = \frac{1}{n} \sum_{i=1}^n \phi_{\hat\xi_p,\hat\xi_{1-p}}(X_i). \]
The idea is to first estimate the lower and upper tail cutoffs, and then winsorize the data by replacing extreme values with the corresponding cutoffs.
In the theoretical description, the data may be split into two subsets:
If the full sample size is \(n\), a simple split is \[ n_q = n_m = \frac{n}{2}. \]
In practice, the implementation may use all available observations at each step instead of splitting the sample.
PMWM uses private quantile estimates instead of non-private empirical quantiles. For component \(\ell\), let the clipping proportion be \[ p_\ell = \zeta_\ell. \]
A theoretical choice has the form
\[ \zeta_\ell = 16\eta + \frac{112}{3} \frac{ \log\left(32(\beta(u-\ell)/(\beta-1)\vee 1)/\delta\right) } {n_q}, \]
where
For practical implementation, the paper suggests using the clipping proportion
\[ p = \frac{C}{n_q} \vee \eta, \]
where \(C\) is a user-chosen
trimming constant. This is also the choice used in the
dppca package.
Let \(L_\ell\) and \(U_\ell\) be private estimates of the lower and upper quantiles.
\[ L_\ell \approx Q_{\zeta_\ell}(w_{1\ell},\ldots,w_{n\ell}), \quad \text{and} \quad U_\ell \approx Q_{1-\zeta_\ell}(w_{1\ell},\ldots,w_{n\ell}). \]
Define the winsorized observations by
\[ w_{i\ell}^{\mathrm{win}} = \min\left\{ \max(w_{i\ell},L_\ell), U_\ell \right\}. \]
Equivalently,
Thus, \[ L_\ell \leq w_{i\ell}^{\mathrm{win}} \leq U_\ell. \]
Using the mean estimation subset \(I_m\), define \(\hat\mu_\ell^{\mathrm{win}} = \frac{1}{n_m} \sum_{i\in I_m} w_{i\ell}^{\mathrm{win}}\).
The corresponding non-private winsorized scree estimate is
\[ \hat\lambda_\ell^{\mathrm{PMWM,np}} = \frac{n}{n-1} \hat\mu_\ell^{\mathrm{win}}. \]
Because all winsorized observations lie in \([L_\ell,U_\ell]\), the sensitivity of the winsorized mean is
\[ \Delta_\ell^{(\mu)} = \frac{U_\ell-L_\ell}{n_m}. \]
After multiplying by \(n/(n-1)\), the scree sensitivity is \(\Delta_\ell^{(\lambda)} = \frac{n}{n-1} \cdot \frac{U_\ell-L_\ell}{n_m}\).
The final PMWM scree estimate can be written as
\[ \widetilde\lambda_\ell^{\mathrm{PMWM}} = \hat\lambda_\ell^{\mathrm{PMWM,np}} + Z_\ell \quad \text{where} \quad Z_\ell \sim N(0,\sigma_\ell^2). \]
For privacy parameters \((\epsilon_{M,\ell},\delta_{M,\ell})\), the noise scale is
\[ \sigma_\ell = \frac{ \Delta_\ell^{(\lambda)} \sqrt{2\log(1.25/\delta_{M,\ell})} } {\epsilon_{M,\ell}} = \frac{ \left( \frac{n}{n-1} \cdot \frac{U_\ell-L_\ell}{n_m} \right) \sqrt{2\log(1.25/\delta_{M,\ell})} } {\epsilon_{M,\ell}}. \]
PMWM uses privacy budget for both private quantile estimation and private mean estimation.
For component \(\ell\), let the total component-level budget be \((\epsilon_\ell,\delta_\ell)\).
This can be split as \[ \epsilon_\ell = \epsilon_{Q,\ell} + \epsilon_{M,\ell}, \qquad \delta_\ell = \delta_{Q,\ell} + \delta_{M,\ell}. \]
The quantile budget itself is used to estimate two quantiles, so it can be split again.
\[ \epsilon_{Q,\ell} = \epsilon_{q1,\ell} + \epsilon_{q2,\ell}, \qquad \delta_{Q,\ell} = \delta_{q1,\ell} + \delta_{q2,\ell}. \]
The PMWM scree estimator uses additional parameters for private quantile estimation and winsorization.
beta: Log-binning base used in the private quantile
estimator.
It determines the spacing of the geometric search grid and must satisfy \(\beta > 1\).
a, b: Lower and upper search bounds
supplied to the private quantile. The private lower and upper clipping
cutoffs are searched within this range.
trim_const, eta: Parameters used to set
the practical clipping proportion \[
p
=
\min\left\{
\max\left(\frac{\texttt{trim\_const}}{n_q}, \eta\right),
0.49
\right\}.
\] Here, trim_const / n_q controls the baseline
clipping level, while eta gives a lower bound reflecting
the expected contamination level.
split_mode: Logical value indicating whether the
sample is split into two parts.
If TRUE, one part is used for private quantile
estimation and the other part is used for the winsorized mean step. If
FALSE, all observations are used in both steps.
max_extra_bins: Maximum number of additional
log-grid bins searched beyond the largest occupied bin in the private
quantile.
In dppca, the PMWM-specific parameters can be specified
through
\[ \text{scree_pmw_control( beta = beta,a = a,b = b, trim_const = C, eta = eta split_mode = split)} \]
Because of the added privacy noise, the raw DP scree estimates
\[ (\widetilde\lambda_1,\ldots,\widetilde\lambda_k) \]
may not have the usual scree shape. They may not be decreasing, and some values may be negative.
In ordinary PCA, scree values satisfy
\[ \lambda_1 \ge \lambda_2 \ge \cdots \ge \lambda_k \ge 0. \]
Therefore, dppca can apply post-processing to make the
DP scree estimates nonnegative and decreasing.
\[ \widetilde\lambda_1^{\mathrm{mono}} \ge \widetilde\lambda_2^{\mathrm{mono}} \ge \cdots \ge \widetilde\lambda_k^{\mathrm{mono}} \ge 0. \]
If monotone post-processing is used, the PVE can be computed from the post-processed scree values.
\[ \widetilde{\operatorname{PVE}}_\ell^{\mathrm{mono}} = \frac{\widetilde\lambda_\ell^{\mathrm{mono}}} {\sum_{j=1}^k \widetilde\lambda_j^{\mathrm{mono}}}. \]
This step only modifies the already released DP estimates, so it does not use any additional privacy budget.
dp_scree_plot(
X,
k = 3,
dp_scree_method = "clipped",
eps_total = 1,
delta_total = 1e-6,
control = clipped_control(
C_clip = 3
)
)dp_scree_plot(
X,
k = 3,
dp_scree_method = "all",
eps_total = 1,
delta_total = 1e-6,
center = TRUE,
standardize = FALSE,
control = list(
clipped = clipped_control(
C_clip = 3
),
pmwm = pmwm_control(
beta = 1.01,
a = 0,
b = 10,
trim_const = 10,
eta = 0.01,
split_mode = TRUE
),
huber = huber_control(
mu0 = 0,
eta0 = 1,
T = 50,
M = 20,
k_min_m2 = -40,
k_max_m2 = 40,
m2_frac = 1/4
)
)
)Myeonghun Yu. Zhao Ren. Wen-Xin Zhou. “Gaussian differentially private robust mean estimation and inference”. Bernoulli 30 (4) 3059 - 3088, November 2024. https://doi.org/10.3150/23-BEJ1706
Kelly Ramsay and Dylan Spicker. (2025). “Improved subsample-and-aggregate via the private modified winsorized mean”. arXiv preprint. https://arxiv.org/abs/2501.14095
Gábor Lugosi. Shahar Mendelson. “Robust multivariate mean estimation: The optimality of trimmed mean.” Ann. Statist. 49 (1) 393 - 410, February 2021. https://doi.org/10.1214/20-AOS1961