---
title: "DP score plots in dppca"
description: >
  Differentially private PCA score visualization in dppca, including private
  center-radius plotting frames, additive DP histograms, sparse DP histograms,
  and group-wise score histograms.
output: rmarkdown::html_vignette
vignette: >
  %\VignetteIndexEntry{DP score plots in dppca}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r, include = FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>",
  fig.align = "center"
)
```

A PCA score plot is a standard visualization for examining the low-dimensional
structure of multivariate data. In a non-private analysis, the score plot displays
the projected observations directly. In `dppca`, the differentially private score
plot instead represents the distribution of two-dimensional PCA scores by a
differentially private histogram.

## PC scores

Let

\[
X \in \mathbb{R}^{n \times p}
\]

be the input data matrix after the requested preprocessing. In `dppca`,
preprocessing is controlled by the arguments `center` and `standardize`.

Let

\[
V_k = [v_1,\ldots,v_k] \in \mathbb{R}^{p \times k}
\]

be the matrix of principal component directions, where the column \(v_\ell\)
is the \(\ell\)-th principal component direction. For the \(i\)-th observation
\(x_i^\top\), the \(k\)-dimensional score vector is

\[
z_i
=
V_k^\top x_i
\in
\mathbb{R}^k,
\qquad i=1,\ldots,n.
\]

For visualization, we select two score coordinates. If `axes = c(a, b)`, define

\[
s_i
=
(z_{i,a}, z_{i,b})^\top
\in
\mathbb{R}^2,
\qquad i=1,\ldots,n.
\]

The collection \(S = \{s_i\}_{i=1}^n\) is the two-dimensional score point cloud.
A non-private score plot would draw these points directly. The private score
plot instead releases a noisy two-dimensional histogram of these points.

## Overview of the DP score plot

The private score visualization in `dppca` has the following steps.

1. Compute two-dimensional PCA scores.
2. Construct a private plotting frame.
3. Divide the frame into rectangular bins.
4. Count how many score points fall into each bin.
5. Apply a differentially private histogram mechanism.
6. Normalize and visualize the noisy bin frequencies.

The plotting frame and histogram both consume privacy budget. If `g_dppca =
TRUE`, the private PC directions also consume privacy budget.

## 1. Private plotting frame

Before constructing a two-dimensional histogram, we need a plotting region. This
region is called the plotting frame. If the frame is too narrow, many points are
excluded. If it is too wide, the histogram may become sparse and visually
uninformative.

The current implementation uses a **private center-radius frame**. This approach
constructs a square frame by privately estimating a center and then privately
estimating a radius around that center. The private quantiles appearing in this
step are computed using a smooth-sensitivity-based DP quantile estimator, as in
[Nissim, Raskhodnikova, and Smith (2007)](#ref-Nissim2007).

### Private center

Let \(S \in \mathbb{R}^{n \times 2}\) be the score matrix, whose \(i\)-th row is
\(s_i^\top = (z_{i,a}, z_{i,b})\). The frame center is estimated coordinate-wise
using private medians:

\[
\widetilde c_1
=
\widetilde Q_{0.5}(z_{1,a},\ldots,z_{n,a}),
\qquad
\widetilde c_2
=
\widetilde Q_{0.5}(z_{1,b},\ldots,z_{n,b}).
\]

Here \(\widetilde Q_q(\cdot)\) denotes a private estimate of the \(q\)-quantile.
The private center is

\[
\widetilde c
=
(\widetilde c_1,\widetilde c_2)^\top.
\]

### Private radius

After obtaining the private center, compute the Euclidean distance from each
score point to the private center:

\[
r_i
=
\|s_i-\widetilde c\|_2
=
\sqrt{(z_{i,a}-\widetilde c_1)^2
+
(z_{i,b}-\widetilde c_2)^2},
\qquad i=1,\ldots,n.
\]

The radius is then estimated by the private 0.99 quantile of these distances:

\[
\widetilde R
=
\widetilde Q_{0.99}(r_1,\ldots,r_n).
\]

To add a visual margin and reduce boundary effects, introduce a fixed inflation 
factor $\alpha > 0$.

\[
\widetilde R_{\mathrm{infl}}
=
(1+\alpha)\widetilde R,
\]

where the current implementation uses a fixed inflation factor \(\alpha = 0.20\).

The final plotting frame is

\[
F
=
[\widetilde c_1-\widetilde R_{\mathrm{infl}},
 \widetilde c_1+\widetilde R_{\mathrm{infl}}]
\times
[\widetilde c_2-\widetilde R_{\mathrm{infl}},
 \widetilde c_2+\widetilde R_{\mathrm{infl}}].
\]

This produces a square frame centered at the private center.

### Numerical safeguard for the private radius

The distances \(r_i\) are nonnegative, but the private quantile estimator adds
random noise. Therefore, the private radius estimate can occasionally become
non-finite or nonpositive, especially when the privacy budget is very small, the
sample size is small, or the score points are nearly identical.

The implementation checks the private radius before forming the frame. If the
private radius is not finite or is nonpositive, the score plotting routine stops
with an informative error.

## 2. Choosing the number of bins

After the plotting frame \(F\) has been determined, it is divided into histogram
bins. Let \(m_x\) and \(m_y\) be the number of bins along the two score axes. The
two-dimensional histogram then has

\[
m = m_x m_y
\]

bins in total.

In `dppca`, the user specifies the bin counts through the `bins` argument, for
example `bins = c(20, 20)`. The best bin choice depends on the sample size,
privacy budget, and visible structure in the score distribution. Fewer bins can
be more stable under stronger privacy noise, while more bins can reveal finer
structure when the sample size and privacy budget are sufficiently large.


## 3. Two-dimensional histogram

Let the private plotting frame be divided into bins \(B_1,\ldots,B_m\). For the
score point set \(S = \{s_i\}_{i=1}^n\), the non-private count in bin \(B_k\) is

\[
c_k
=
\sum_{i=1}^n
\mathbf{1}\{s_i \in B_k\},
\qquad
k=1,\ldots,m.
\]

The count vector is \(c = (c_1,\ldots,c_m) \in \mathbb{N}^m\). The empirical
frequency in bin \(B_k\) is

\[
q_k
=
\frac{c_k}{n},
\qquad
k=1,\ldots,m.
\]

The private score visualization displays a noisy version of this frequency
vector.

### Sensitivity of histogram counts

Under row-level adjacency, two neighboring datasets differ in one observation.
Changing one observation can move one score point from one bin to another.
Therefore, the count vector can change by at most \(+1\) in one bin and \(-1\)
in another bin. Hence,

\[
\Delta_1(c) \leq 2,
\qquad
\Delta_2(c) \leq \sqrt{2}.
\]

These sensitivity bounds are used to calibrate privacy noise for the histogram
mechanisms.

## 4. Privacy accounting

The DP score histogram procedure has two main privacy-consuming steps when
`g_dppca = FALSE`:

1. private quantile estimation for constructing the plotting frame,
2. private histogram release.

If the total privacy budget is \((\epsilon,\delta)\), the implementation splits
the budget as

\[
(\epsilon_{\mathrm{frame}},\delta_{\mathrm{frame}})
=
(\epsilon/2,\delta/2),
\qquad
(\epsilon_{\mathrm{hist}},\delta_{\mathrm{hist}})
=
(\epsilon/2,\delta/2).
\]

The frame construction itself uses three private quantile estimates: two private
medians for the center and one private 0.99 quantile for the radius. These share
the frame budget by basic composition.

When `g_dppca = TRUE`, private PC direction estimation also consumes privacy
budget. In that case, the total budget is split across

1. private PC direction estimation,
2. private plotting frame construction,
3. private histogram release.

The implementation uses an equal split:

\[
(\epsilon_{\mathrm{pc}},\delta_{\mathrm{pc}})
=
(\epsilon_{\mathrm{frame}},\delta_{\mathrm{frame}})
=
(\epsilon_{\mathrm{hist}},\delta_{\mathrm{hist}})
=
(\epsilon/3,\delta/3).
\]

By basic composition, the overall procedure satisfies the requested
\((\epsilon,\delta)\)-DP guarantee.

## Method 1: Additive DP histogram

A simple DP histogram can be constructed by adding independent Gaussian noise to
each bin count. The noisy counts are then post-processed to be nonnegative and
normalized. This additive-noise approach is commonly used for DP histograms
[Wasserman and Zhou (2010)](#ref-Wasserman2010), and the procedure is summarized
in [Additive DP histogram](algorithms.html#alg-add-hist).

## Method 2: Sparse DP histogram

When many bins are empty, adding noise to every bin can dominate the
visualization. A sparse histogram aims to report only bins whose counts are large
enough to be distinguishable from noise.

In `dppca`, the sparse histogram is based on the stability-based private
histogram idea of [Karwa and Vadhan (2017)](#ref-Karwa2017), summarized in
[Sparse DP histogram](algorithms.html#alg-sparse-hist).

## Group-wise DP score histograms

When group labels are available, DP score histograms can be constructed
separately for each group. Let

\[
\{(s_i,g_i)\}_{i=1}^n
\]

denote the score data with group labels, where \(s_i \in \mathbb{R}^2\) is the
two-dimensional PCA score and \(g_i \in \mathcal{G}\) is the group label.

The score directions, private plotting frame, and histogram grid are shared
across all groups. For each group \(g \in \mathcal{G}\), define the group-specific
bin count

\[
c_k^{(g)}
=
\sum_{i=1}^n
\mathbf{1}\{s_i \in B_k,\; g_i = g\}.
\]

Because the groups form a partition of the rows, group-wise histogram releases
can use parallel composition across groups on the common grid.

In `dppca`, the group-wise version can be constructed using either the
[group-wise additive DP histogram](algorithms.html#alg-group-add-hist) or the
[group-wise sparse DP histogram](algorithms.html#alg-group-sparse-hist).

## Example usage

```{r, eval = FALSE}
library(dppca)

data(gau, package = "dppca")

set.seed(123)
score_plot <- dp_score_plot(
  X = gau,
  eps = 5,
  delta = 1e-5,
  bins = c(15, 15),
  method = c("add", "sparse"),
  axes = c(1, 2)
)

score_plot$plot$all
```

For grouped score histograms:

```{r, eval = FALSE}
library(dppca)

data(gau_g, package = "dppca")

set.seed(123)
score_plot_group <- dp_score_plot_group(
  X = gau_g,
  group = "group",
  eps = 3,
  delta = 1e-5,
  bins = c(15, 15),
  method = c("add", "sparse")
)

score_plot_group$plot$all
```

## References

<span id="ref-Nissim2007"></span>
Kobbi Nissim, Sofya Raskhodnikova, and Adam Smith. (2007). "Smooth sensitivity and sampling in private data analysis". In Proceedings of the thirty-ninth annual ACM symposium on Theory of computing (STOC '07). Association for Computing Machinery, New York, NY, USA, 75--84. https://doi.org/10.1145/1250790.1250803

<span id="ref-Lei2011"></span>
Lei, Jing (2011). "Differentially private M-estimators". Advances in Neural Information Processing Systems, 24. Curran Associates, Inc.
https://proceedings.neurips.cc/paper_files/paper/2011/file/f718499c1c8cef6730f9fd03c8125cab-Paper.pdf

<span id="ref-Wasserman2010"></span>
Wasserman, L., & Zhou, S. (2010). "A Statistical Framework for Differential Privacy". Journal of the American Statistical Association, 105(489), 375--389. https://doi.org/10.1198/jasa.2009.tm08651

<span id="ref-Karwa2017"></span>
Vishesh Karwa and Salil Vadhan. (2018). "Finite sample differentially private confidence intervals". In <em>Proceedings of ITCS 2018</em>, LIPIcs, 94, 44:1--44:9. https://doi.org/10.4230/LIPIcs.ITCS.2018.44