---
title: "Pedigree PCA"
author: "Hannes Becher"
output: rmarkdown::html_vignette
vignette: >
  %\VignetteIndexEntry{pedigree-pca}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r, include = FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)
```

```{r setup}
library(randPedPCA)
```

# Intro
Principal component analysis (PCA) is a method for summarising the information
in the columns of a matrix. Because pedigrees can be represented as matrices,
they can, in principle, be subjected to PCA.

When thinking about PCA,
one usually begins with a data matrix, such as the iris dataset, where rows represent
individuals (or observations) and columns contain traits (or features or markers). However, when
dealing with a pedigree-derived matrix, there is no data matrix of this kind. For
instance, a pedigree's additive relationship matrix $A$ is a covariance matrix.
Fortunately, PCA can also be achieved by starting from a covariance matrix. That is what
the **`randPedPCA`** package does.

In this vignette, we demonstrate how to use the package **`randPedPCA`** and how to
carry out PCA on your own pedigree or pedigree matrix.

## Pedigree input
Using an example dataset that ships with **`randPedPCA`**, called `pedMeta2`, we first generate a pedigree object. This is done using the function `pedigree` from the dependence **`pedigreeTools`**.
Then we use `rppca` to perform PCA. Ignore the deprecation warning about matrix conversion if one comes up. It is generated by a dependency.
```{r}
# generate a pedigree object from the example dataset provided
ped <- pedigree(pedMeta2$fid,
                pedMeta2$mid,
                pedMeta2$id)
pc01 <- rppca(ped, center=TRUE)
plot(pc01, col = factor(pedMeta2$population))
```


## Matrix input
Instead of starting from a pedigree, we can also use the pedigree's $L^{-1}$ matrix
as the input to `rppca`. This matrix can be generated using the `getLInv` function
from **`pedigreeTools`**. Note that `rppca` expects a sparse matrix
in `spam` format. This is why we wrap `sparse2spam` around `getLInv`:
```{r}
li <- sparse2spam(getLInv(ped))
pc02 <- rppca(li, center = TRUE)
plot(pc02, col = factor(pedMeta2$population))
```

For simplicity, we decided to ship this $L^{-1}$ matrix as an example dataset.
The PCA result is identical when using this object `pedLInv2`:
```{r}
pc03 <- rppca(pedLInv2, center = TRUE)
plot(pc03, col = factor(pedMeta2$population))
```

## Centring
It is usually recommended that PCA is run on centred data, which is what we did
in the examples above. If the data are not centred, PC1 tends to capture the
deviation from mean = 0 of each data column. In the context of pedigree PCA,
this usually means that PC1 accounts for a high proportion of the total variance
and it usually aligns with time or generation number.
```{r echo=F, message=F, fig.dim = c(6, 3)}
set.seed(123345) # set random seed as Hutch++ uses random numbers
ttv <- hutchpp(pedLInv2,num_queries = 100, center = TRUE) # estimate
opar <- par(mfrow=c(1,2))
plot(rppca(ped), main="Un-centred", col=factor(pedMeta2$population))
plot(rppca(ped, center=TRUE, totVar = ttv), main="Centred", col=factor(pedMeta2$population))
par(opar)
```

The plots above show an un-centred and a cented PCA of the same simulated pedigree. Two
breeds (red and black) are generated from an initial population and are propagated
individually for 10 generations. Then, a hybrid population (green) is made, which keeps
receiving geneflow from both red and black for another 10 generations. Without
centring (left), PC1 aligns well with the generation number, a pattern that we often observe.
With centring (right), PC1 reflects the differentiation
between the breeds.

## Variance components
When running PCA, users often care about how much variance and what
proportion of the total variance is accounted for by individual principal
components. This can be queried using the `summary` function on a PCA object. For
objects generated by `rppca`, `summary` will return the standard deviation of each PC.
There will be two additional, sowing the proportion of total variance and their
cumulative sum, unless if the input to `rppca` was an $L^{-1}$ matrix.
R users may be used to this behaviour from the built-in functions `prcomp` and
`princomp`.

The reason that the variance proportions and cumulative sums are not shown by default
is that these numbers are costly to estimate from a large $L^{-1}$ matrix,
compared to running the actual PCA. If the total variance is known, it can be
specified when running `rppca` using the argument `totVar`. Below, we describe
how the total variance can be estimated.

### Estimating the total variance from an $L^{-1}$ matrix
Again, this is only required if the pedigree is unknown and the user has access to only the $L^{-1}$ matrix.
In the context of pedigrees, the total variance of the data corresponds to the
variance of the pedigree's $L$ matrix, which is equivalent to the trace of the
corresponding $A$ matrix. Both matrices are expensive to compute. If the PCA is
ran without centring, then the total variance also equals the sum of (the
inbreeding values + 1), which can be computed directly from the pedigree. With
centring, this does not work and the total variance will be lower. It has to be
estimated. Here, we implemented the [Hutch++](https://doi.org/10.48550/arXiv.2010.09649) trace estimation algorithm for this purpose. 

```{r}
# True total variance computed via the pedigree's inbreeding values
sum(inbreeding(ped) + 1) # "ped" was defined at the top

# Now estimate the total variance (the trace of A) from the corresponding
#  L inverse matrix using Hutch++
li <- sparse2spam(getLInv(ped)) # generate L inverse and convert to spam format
set.seed(123345) # set random seed as Hutch++ uses random numbers
hutchpp(li) # Hutch++ with default settings

# for higher accuracy increase num_queries (increases running time)
hutchpp(li,num_queries = 100)
```
We recommend that users experiment with the value of `num_queries` to optimise
accuracy. Using the the default value of 10 should work within reasonable time
even for a large pedigree of 1M individuals.

Hutch++ can also be used to compute the total variance of a **centred dataset**.
We demonstrate this here using a small simulated example, where we can perform
naive centring for comparison.
```{r}
# Get L, the "data matrix" of a pedigree
ll <- getL(ped)
# centre L (because L is upper triangular, we centre the rows)
llc <- apply(ll, 1, function(x) x - mean(x))
# compute additive relationship matrix of the centred data
ac <- llc %*% t(llc)
sum(diag(ac)) # exact value (would be too expensive to compute for a large pedigree)

# Obtain centred estimate from L inverse using Hutch++
li <- sparse2spam(getLInv(ped)) # generate L inverse and convert to spam format
set.seed(123345) # set random seed as Hutch++ uses random numbers
hutchpp(li,num_queries = 100, center = TRUE) # estimate
```

*If you want to use a* `hutchpp` *estimate when running* `rppca`, *make sure that the*
`center` *values are identical. E.g., when running a PCA with centring, make
sure that you have run* `hutchpp` *with centring as well!*

### Variance components - Examples
The summary of a PCA generated from an $L^{-1}$ matrix by default only shows
the standard deviation accounted for by each PC. It also throws a warning saying
that there is no information about the total variance:
```{r}
summary(pc02)
```

When starting from a pedigree and with no centring, then variance proportions and
cumulative sums are returned as well:
```{r}
pc04 <- rppca(ped, center=FALSE)
summary(pc04)
```


If the user has an **estimate of the total variance**, this can be supplied when calling
`rppca` on a matrix object and the variance proportions and cumulative sums are returned.
In practice, this should rarely be needed as the user will probably have access
to the pedigree, from which the total variance is automatically computed (example
above).
```{r}
pc05 <- rppca(pedLInv2, center=FALSE, totVar=3521.534)
summary(pc05)
```


When running `rppca` on a pedigree object without centring, totVar can be used
to override the value computed from the pedigree. This triggers a warning. In this
case here, it also causes variance proportions > 1 which do not make sense:
```{r}
pc07 <- rppca(ped, center=FALSE, totVar=123)
summary(pc07)
```

**Centred PCA**  
Variance proportions are automatically computed if the input is
a pedigree.
```{r}
pc06 <- rppca(ped, center=TRUE) 
summary(pc06)
```


If the input is an $L^{-1}$ matrix, variance proportions are only computed if
an estimate of the total variance is supplied.
```{r}
# No proportions shown by default
pc08 <- rppca(pedLInv2, center=TRUE) 
summary(pc08)
```

```{r}
# Only when estimate is supplied
pc09 <- rppca(pedLInv2, center=TRUE, totVar=2673.6) 
summary(pc09)
```

# Plotting
We generated a `plot` method for `rppca` objects. If the variance proportions are
known, they will be displayed in the axis labels by default. Note that if the
total variance was estimated as in this example, where it was supplied when
calling `rppca`, then the proportions displayed come with some uncertainty.
```{r}
# PC1 and PC2 are plotted by default
plot(pc06, col=factor(pedMeta2$population), main="My pedigree PCA")
# plot PC1 and PC3 instead
plot(pc06, dims=c(1,3), col=factor(pedMeta2$population), main="My pedigree PCA,\ncustom PCs shown")
```

## Down-sampling
Because `plot.rppca` relies on R's relatively slow base plotting function, it
is useful to down-sample the individuals to be plotted. Thus, `plot.rppca` will
automatically down-sample to 10,000 individuals if there are more in the dataset.
Technically there is no down-sampling and all the data are retained. Instead, an
integer vector is added in the `ds` slot of the `rppca` object.
That vector is automatically used by `plot.rppca` for individuals and any
colours specified by the `col` parameter. The value of `col` should thus have
the same length as there were individuals in the full dataset.
To adjust the down-sampling, one can use the parameter `to` of `plot.rppca`.
The value of `to` is passed to the function `dspc`, which carries out the down-sampling/vector
generation. See `?dspc` for details. To plot wihthout any down-sampling, do `plot(myPc, to=NA)`.

One potential issue is that every time `plot.rppca` is run, a new set of
individuals is sampled, so that repeated plots of the same dataset will look different.
To retain a specific down-sampling, one can run `dspc` outside of `plot.rppca`,
like `pcDown <- dspc(pc)`. Calling `plot(pcDown)` repeatedly will generate the same plot every time.

# Performance
We have implemented two PCA methods in `rppca`, accessible through the `method` argument.
The user can choose between `randSVS` (the default) and `rspec`. The latter does not
use randomised SVD but rather the Spectra C++ library (https://spectralib.org/) via the 
**`RSPpectra`** package. This method tends to be faster than the default if a greater number
of principal components are requested. The exact threshold depends on the dataset and
the computing platform used. As a rule of thumb, if the number of PCs requested is $\leq$ 6,
then the default method will be faster. This addition was inspired by reviewer 2
of our manuscript submitted to GSE.


# Using one's own data
Users will mainly want to analyse their own empirical or simulated datasets. This
should be straightforward if the data are in pedigree format, which we recommend.
In this case, `rppca` can be called directly in such an object as demonstrated above. 

If the user has an $L^{-1}$ matrix, this needs to be converted into 
a `spam` object, a specific sparse matrix format. We provide a wrapper
called `importLinv` for reading
matrices in Matrix Market format from disk and converting these into `spam` format.
If a sparse $L^{-1}$ matrix is already available in in the workspace, `sparse2spam`
should work to convert it to `spam` format. Let us know if this does not work for
you (and please share a reproducible example).