--- title: "PC Directions in dppca" description: > Principal component direction estimation used in dppca, including non-private sample PCA directions and differentially private g-DPPCA directions. output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{PC Directions in dppca} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r, include = FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>", fig.align = "center" ) ``` In ordinary PCA, the principal component directions are obtained from the eigenvectors of the sample covariance matrix. In `dppca`, these directions can be computed in two different ways. 1. **Non-private PC directions**: eigenvectors of the sample covariance matrix. 2. **Differentially private PC directions**: private principal component directions obtained through the g-DPPCA procedure. ## Notation Let \[ X = \begin{bmatrix} X_1^\top \\ X_2^\top \\ \vdots \\ X_n^\top \end{bmatrix} \in \mathbb{R}^{n \times p} \] be the data matrix used for PCA, where $X_i \in \mathbb{R}^p$is the $i$-th observation. We assume that $X$ has been centered, and optionally standardized. The principal component direction matrix is denoted by \[ V_k = [v_1,\ldots,v_k] \in \mathbb{R}^{p \times k}, \] where each column $v_\ell$ is a unit vector representing the $\ell$-th pc direction. The corresponding score matrix is $Z = X V_k$. ## 1. Non-private PC directions The classical sample covariance matrix is \[ \hat\Sigma = \frac{1}{n-1}X^\top X. \] The non-private PCA directions are obtained from the eigenvalue decomposition \[ \hat\Sigma = \hat V \hat\Lambda \hat V^\top, \] where \[ \hat V = [\hat v_1,\ldots,\hat v_p], \quad \hat\Lambda = \operatorname{diag}(\hat\lambda_1,\ldots,\hat\lambda_p) \quad \text{with} \quad \hat\lambda_1 \geq \hat\lambda_2 \geq \cdots \geq \hat\lambda_p \geq 0. \] The $\ell$-th sample principal component direction is $\hat v_\ell$. Equivalently, \[ \hat v_\ell = \arg\max_{\|v\|_2 = 1} v^\top \hat\Sigma v \quad \text{subject to} \quad v^\top \hat v_j = 0, \qquad j = 1,\ldots,\ell-1. \] In the non-private option of `dppca`, the direction matrix used for projection is \[ \hat V_k = [\hat v_1,\ldots,\hat v_k]. \] ## 2. DP PC directions [Kim and Jung (2025)](#ref-Kim2025) proposed `g-DPPCA` by adding matrix Gaussian mechanism on the generalized multivariate Kendall's tau matrix which based on the robust data transformation called generalized spatial sign proposed by [Raymakers and Rousseeuw (2019)](#ref-Raymaekers2019). For a positive valued scale function $\xi: (0, \infty) \to (0, \infty)$, consider a map $g_\xi: \mathbb{R}^d \to \mathbb{R}^d$ defined as \[ g_\xi(t) = \xi(\|t\|_2)\cdot \frac{t}{\|t\|_2}. \] $g_{\xi}$ is called as a *generalized spatial sign* with respect to $\xi$. The *generalized multivariate Kendall's tau* matrix with respect to $g_\xi$ is defined as \[ K_{g_\xi} = \mathbb{E}_{X, X'}\left[ g_\xi\left( \frac{X - X'}{\sqrt{2}}\right) g_\xi\left( \frac{X - X'}{\sqrt{2}}\right)^\top ~ \right], \] where $X'$ is an independent copy of $X$. Importantly, if $X$ follows an elliptical distribution (which including Gaussian and multivariate $t$-distributions), $K_{g_\xi}$ shares the same eigenvectors with same order to the $\mbox{cov}(X)$. So, one can conduct a PCA by estimating $K_{g_\xi}$ and then get eigenvectors of it. For a convenience, we write $g$ as the given sign function. For a random sample $S = (X_1, \dots, X_n)$, the second order *U*-statistic of $K_{g}$ can be written as \[ \widehat{K}_g(S) = \frac{2}{n(n-1)} \sum_{i < j} g\left(\frac{X_j - X_i}{\sqrt{2}}\right) g\left(\frac{X_j - X_i}{\sqrt{2}}\right)^\top. \] Note that the sensitivity of $\widehat{K}_g$ with respect to the Frobenius norm can be upper bounded by \[ \Delta_F(\widehat{K}_g) = \sup_{S \sim S'} \|\widehat{K}_g(S) - \widehat{K}_g(S')\|_F \le \frac{4\|g\|_\infty^2}{n}. \] So, for a dataset $S = (x_1, \dots, x_n)$ the randomized mechanism $\bar{K}_g$ defined as \[ \bar{K}_g(S) := \frac{2}{n(n-1)} \sum_{i < j} g\left(\frac{x_j-x_i}{\sqrt{2}}\right)g\left(\frac{x_j-x_i}{\sqrt{2}}\right)^\top + \mbox{vecd}^{-1}(\xi), \] where $\xi \sim N_{d(d+1)/2}(0, \sigma_{\varepsilon, \delta}^2 I_{d(d+1)/2})$ and $\sigma_{\varepsilon, \delta} = \frac{4\|g\|_{\infty}^2 \sqrt{2 \ln(1.25/\delta)}}{n\varepsilon}$, satisfies $(\varepsilon, \delta)$-DP. Define $\bar{V}_{g, m}(S) \in \mathcal{O}(d, m)$ as the matrix of the first $m$ eigenvectors of $\bar{K}_g(S)$. Then, $\bar{V}_{g, m}(S)$ satisfies $(\varepsilon, \delta)$-DP due to the post-processing property, and it can be served as a DP principal components. Kim and Jung (2025) calls these process as a `g-DPPCA`. In the implementation of the function `dp_pc_dir` with option `g_dppca=TRUE`, we use the spherical transformation $g_{sph}(t) = t/\|t\|_2$ to output differentially private PC directions $\bar{V}_{sph,m}$. In this case, it holds that $\|g_{sph}\|_{\infty} = 1$, and thus the variance of additive Gaussian noise is set as $\sigma_{\varepsilon, \delta} = \frac{4\sqrt{2 \ln(1.25/\delta)}}{n\varepsilon}$. ## Summary The principal component direction step in `dppca` can be summarized as follows. 1. Start with a preprocessed data matrix $X$. 2. Choose a direction estimation method. 3. Obtain a direction matrix $V_k$. 4. Compute projected scores $Y = X V_k$. 5. Use the scores for private scree estimation or private score visualization. The main distinction is whether $V_k$ is obtained from the ordinary sample covariance matrix or from a differentially private robust PC direction estimator. ## References Minwoo Kim and Sungkyu Jung (2025), "Robust and differentially private principal component analysis," *Statistical Analysis and Data Mining*, 18(6), https://doi.org/10.1002/sam.70053 Jakob Raymaekers and Peter Rousseeuw (2019), "A generalized spatial sign covariance matrix," *Journal of Multivariate Analysis*, 171:94–111, https://doi.org/10.1016/j.jmva.2018.11.010