--- title: "Example pipeline" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{Example pipeline} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r, include = FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>" ) ``` ```{r setup} library(BREADR) ``` ## Example This basic example shows you how to analyse a real (anonymised) ancient DNA data set. An analysis would normally start by defining paths to all three Eigenstrat files, i.e., ```{r} ind_path <- "path/to/eigenstrat/indfile" snp_path <- "path/to/eigenstrat/snpfile" geno_path <- "path/to/eigenstrat/genofile" ``` and we would then preprocess this data using the processEigenstrat() function, ```{r,eval=FALSE} counts_example <- processEigenstrat( indfile = ind_path, snpfile = snp_path, genofile = geno_path ) ``` Since this step is the most computationally expensive, we include an option to automatically save the output as a TSV upon completion of the preprocessing, i.e, to avoid repeating the preprocessing ```{r,eval=FALSE} counts_example <- processEigenstrat( indfile = ind_path, snpfile = snp_path, genofile = geno_path, outfile = "path_to_save_tsv" ) ``` Also included are options to change the minimum distance between overlapping sites (filter_length) from the default $1\times 10^5,$ an option to only include individuals (pop_pattern) from certain populations (as defined in the ind file) and an option to remove C->T and G->A SNPs to avoid the effects of deamination. Note that since the Eigenstrat files are too large to include, we provide a pre-processed data set called [counts_example](https://github.com/jonotuke/BREADR/blob/master/data/counts_example.rda) which is the product of running processEigenstrat() on real data (although the raw Eigenstrat files can be found on [the Github page](https://github.com/jonotuke/BREADR/tree/master/data-raw)). We now load the counts_example tibble. ```{r example, eval = FALSE} library(BREADR) counts_example ``` ```{r, echo = FALSE} counts_example ``` We can we can estimate the degrees of relatedness from the raw counts using the callRelatedness() function. Users have the option to change the prior probability for each degree of relatedness (class_prior) from the default uniform prior, to define the expected PMR (average_relatedness) for a pair of unrelated individuals from the default of the sample median, and to set the minimum number of overlapping SNPs required for a pair of individuals (filter_n) to be included in the analysis from the default of 1. If the user decides to use the sample median as an estimate for the expected PMR for a pair of unrelated individuals, the minimum number of overlapping SNPs for a pair (median_co) can be changed from the default of 500. ```{r, eval = FALSE} relatedness_example <- callRelatedness(counts_example) relatedness_example ``` ```{r, echo = FALSE} relatedness_example <- callRelatedness(counts_example) relatedness_example ``` An overall picture of the relatedness for these individuals can be created, which displays the highest posterior probability degree of relatedness for each pair of individuals, indicated by the colour and shape of the points with 95% confidence intervals. The dashed horizontal lines indicated the expected values for each degree of relatedness. Users can choose to remove individuals with less than a user-defined number of overlapping SNPs (nsnps_cutoff). In large data sets, it is expected that the vast majority of individuals will be unrelated, and so the user may wish to only plot the more closely related individuals. TO this end, the user may choose to plot only the first $N$ individuals (using the input parameter N), which have been sorted by PMR value, in ascending order. ```{r, warning=FALSE, message=FALSE} plotLOAF(relatedness_example) ``` A plot for the assignment of the "Unrelated" degree of relatedness to individuals Ind1 and Ind2 can be produced which shows diagnostic information about the estimated genetic relatedness between a single pair of individuals. In the left panel, the distribution of PMR values for each degree of relatedness, given the number of overlapping SNPs, with the observed PMR (and 95% confidence interval) displayed below this. In the right panel are the normalised posterior probabilities for the each of the degrees of relatedness, indicating the certainty with which the degree of relatedness with the highest posterior probability was chosen. ```{r} plotSLICE(relatedness_example, row = 1) ``` Alternatively we can choose the row using the full pair name instead of the row number. ```{r} plotSLICE(relatedness_example, row = "Ind1 - Ind2") ``` We can test whether the observed PMR between Ind1 and Ind2 is consistent with a 3rd-degree genetic relationship, or not, but any degree of relatedness from 0 to 10 can be investigated. This function returns the p-value from the associated binomial test, and by setting verbose to TRUE, all diagnostic output from the test can be displayed ```{r} test_degree(relatedness_example, 1, 3, verbose = TRUE) ```