A PCA score plot is a standard visualization for examining the
low-dimensional structure of multivariate data. In a non-private
analysis, the score plot displays the projected observations directly.
In dppca, the differentially private score plot instead
represents the distribution of two-dimensional PCA scores by a
differentially private histogram.
Let
\[ X \in \mathbb{R}^{n \times p} \]
be the input data matrix after the requested preprocessing. In
dppca, preprocessing is controlled by the arguments
center and standardize.
Let
\[ V_k = [v_1,\ldots,v_k] \in \mathbb{R}^{p \times k} \]
be the matrix of principal component directions, where the column \(v_\ell\) is the \(\ell\)-th principal component direction. For the \(i\)-th observation \(x_i^\top\), the \(k\)-dimensional score vector is
\[ z_i = V_k^\top x_i \in \mathbb{R}^k, \qquad i=1,\ldots,n. \]
For visualization, we select two score coordinates. If
axes = c(a, b), define
\[ s_i = (z_{i,a}, z_{i,b})^\top \in \mathbb{R}^2, \qquad i=1,\ldots,n. \]
The collection \(S = \{s_i\}_{i=1}^n\) is the two-dimensional score point cloud. A non-private score plot would draw these points directly. The private score plot instead releases a noisy two-dimensional histogram of these points.
The private score visualization in dppca has the
following steps.
The plotting frame and histogram both consume privacy budget. If
g_dppca = TRUE, the private PC directions also consume
privacy budget.
Before constructing a two-dimensional histogram, we need a plotting region. This region is called the plotting frame. If the frame is too narrow, many points are excluded. If it is too wide, the histogram may become sparse and visually uninformative.
The current implementation uses a private center-radius frame. This approach constructs a square frame by privately estimating a center and then privately estimating a radius around that center. The private quantiles appearing in this step are computed using a smooth-sensitivity-based DP quantile estimator, as in Nissim, Raskhodnikova, and Smith (2007).
Let \(S \in \mathbb{R}^{n \times 2}\) be the score matrix, whose \(i\)-th row is \(s_i^\top = (z_{i,a}, z_{i,b})\). The frame center is estimated coordinate-wise using private medians:
\[ \widetilde c_1 = \widetilde Q_{0.5}(z_{1,a},\ldots,z_{n,a}), \qquad \widetilde c_2 = \widetilde Q_{0.5}(z_{1,b},\ldots,z_{n,b}). \]
Here \(\widetilde Q_q(\cdot)\) denotes a private estimate of the \(q\)-quantile. The private center is
\[ \widetilde c = (\widetilde c_1,\widetilde c_2)^\top. \]
After obtaining the private center, compute the Euclidean distance from each score point to the private center:
\[ r_i = \|s_i-\widetilde c\|_2 = \sqrt{(z_{i,a}-\widetilde c_1)^2 + (z_{i,b}-\widetilde c_2)^2}, \qquad i=1,\ldots,n. \]
The radius is then estimated by the private 0.99 quantile of these distances:
\[ \widetilde R = \widetilde Q_{0.99}(r_1,\ldots,r_n). \]
To add a visual margin and reduce boundary effects, introduce a fixed inflation factor \(\alpha > 0\).
\[ \widetilde R_{\mathrm{infl}} = (1+\alpha)\widetilde R, \]
where the current implementation uses a fixed inflation factor \(\alpha = 0.20\).
The final plotting frame is
\[ F = [\widetilde c_1-\widetilde R_{\mathrm{infl}}, \widetilde c_1+\widetilde R_{\mathrm{infl}}] \times [\widetilde c_2-\widetilde R_{\mathrm{infl}}, \widetilde c_2+\widetilde R_{\mathrm{infl}}]. \]
This produces a square frame centered at the private center.
The distances \(r_i\) are nonnegative, but the private quantile estimator adds random noise. Therefore, the private radius estimate can occasionally become non-finite or nonpositive, especially when the privacy budget is very small, the sample size is small, or the score points are nearly identical.
The implementation checks the private radius before forming the frame. If the private radius is not finite or is nonpositive, the score plotting routine stops with an informative error.
After the plotting frame \(F\) has been determined, it is divided into histogram bins. Let \(m_x\) and \(m_y\) be the number of bins along the two score axes. The two-dimensional histogram then has
\[ m = m_x m_y \]
bins in total.
In dppca, the user specifies the bin counts through the
bins argument, for example bins = c(20, 20).
The best bin choice depends on the sample size, privacy budget, and
visible structure in the score distribution. Fewer bins can be more
stable under stronger privacy noise, while more bins can reveal finer
structure when the sample size and privacy budget are sufficiently
large.
Let the private plotting frame be divided into bins \(B_1,\ldots,B_m\). For the score point set \(S = \{s_i\}_{i=1}^n\), the non-private count in bin \(B_k\) is
\[ c_k = \sum_{i=1}^n \mathbf{1}\{s_i \in B_k\}, \qquad k=1,\ldots,m. \]
The count vector is \(c = (c_1,\ldots,c_m) \in \mathbb{N}^m\). The empirical frequency in bin \(B_k\) is
\[ q_k = \frac{c_k}{n}, \qquad k=1,\ldots,m. \]
The private score visualization displays a noisy version of this frequency vector.
Under row-level adjacency, two neighboring datasets differ in one observation. Changing one observation can move one score point from one bin to another. Therefore, the count vector can change by at most \(+1\) in one bin and \(-1\) in another bin. Hence,
\[ \Delta_1(c) \leq 2, \qquad \Delta_2(c) \leq \sqrt{2}. \]
These sensitivity bounds are used to calibrate privacy noise for the histogram mechanisms.
The DP score histogram procedure has two main privacy-consuming steps
when g_dppca = FALSE:
If the total privacy budget is \((\epsilon,\delta)\), the implementation splits the budget as
\[ (\epsilon_{\mathrm{frame}},\delta_{\mathrm{frame}}) = (\epsilon/2,\delta/2), \qquad (\epsilon_{\mathrm{hist}},\delta_{\mathrm{hist}}) = (\epsilon/2,\delta/2). \]
The frame construction itself uses three private quantile estimates: two private medians for the center and one private 0.99 quantile for the radius. These share the frame budget by basic composition.
When g_dppca = TRUE, private PC direction estimation
also consumes privacy budget. In that case, the total budget is split
across
The implementation uses an equal split:
\[ (\epsilon_{\mathrm{pc}},\delta_{\mathrm{pc}}) = (\epsilon_{\mathrm{frame}},\delta_{\mathrm{frame}}) = (\epsilon_{\mathrm{hist}},\delta_{\mathrm{hist}}) = (\epsilon/3,\delta/3). \]
By basic composition, the overall procedure satisfies the requested \((\epsilon,\delta)\)-DP guarantee.
A simple DP histogram can be constructed by adding independent Gaussian noise to each bin count. The noisy counts are then post-processed to be nonnegative and normalized. This additive-noise approach is commonly used for DP histograms Wasserman and Zhou (2010), and the procedure is summarized in Additive DP histogram.
When many bins are empty, adding noise to every bin can dominate the visualization. A sparse histogram aims to report only bins whose counts are large enough to be distinguishable from noise.
In dppca, the sparse histogram is based on the
stability-based private histogram idea of Karwa
and Vadhan (2017), summarized in Sparse DP histogram.
When group labels are available, DP score histograms can be constructed separately for each group. Let
\[ \{(s_i,g_i)\}_{i=1}^n \]
denote the score data with group labels, where \(s_i \in \mathbb{R}^2\) is the two-dimensional PCA score and \(g_i \in \mathcal{G}\) is the group label.
The score directions, private plotting frame, and histogram grid are shared across all groups. For each group \(g \in \mathcal{G}\), define the group-specific bin count
\[ c_k^{(g)} = \sum_{i=1}^n \mathbf{1}\{s_i \in B_k,\; g_i = g\}. \]
Because the groups form a partition of the rows, group-wise histogram releases can use parallel composition across groups on the common grid.
In dppca, the group-wise version can be constructed
using either the group-wise
additive DP histogram or the group-wise sparse DP
histogram.
library(dppca)
data(gau, package = "dppca")
set.seed(123)
score_plot <- dp_score_plot(
X = gau,
eps = 5,
delta = 1e-5,
bins = c(15, 15),
method = c("add", "sparse"),
axes = c(1, 2)
)
score_plot$plot$allFor grouped score histograms:
Kobbi Nissim, Sofya Raskhodnikova, and Adam Smith. (2007). “Smooth sensitivity and sampling in private data analysis”. In Proceedings of the thirty-ninth annual ACM symposium on Theory of computing (STOC ’07). Association for Computing Machinery, New York, NY, USA, 75–84. https://doi.org/10.1145/1250790.1250803
Lei, Jing (2011). “Differentially private M-estimators”. Advances in Neural Information Processing Systems, 24. Curran Associates, Inc. https://proceedings.neurips.cc/paper_files/paper/2011/file/f718499c1c8cef6730f9fd03c8125cab-Paper.pdf
Wasserman, L., & Zhou, S. (2010). “A Statistical Framework for Differential Privacy”. Journal of the American Statistical Association, 105(489), 375–389. https://doi.org/10.1198/jasa.2009.tm08651
Vishesh Karwa and Salil Vadhan. (2018). “Finite sample differentially private confidence intervals”. In Proceedings of ITCS 2018, LIPIcs, 94, 44:1–44:9. https://doi.org/10.4230/LIPIcs.ITCS.2018.44