%
%  Copyright (C) 2019-2020 David Tallman
%  $Id: CNSigs.Rnw
%\VignetteEngine{knitr::knitr}
%\VignetteIndexEntry{CNSigs.Rnw}
\documentclass[nojss]{jss}

\title{CNSigs: An R package for the identification of copy number signatures}

\Plaintitle{CNSigs: An R package for the identification of copy number signatures} 
\Shorttitle{CNSigs: Copy number signatures in R}

\author{David Tallman\\The Ohio State University}
\Plainauthor{David Tallman}

\Address{
  David Tallman\\
  The Ohio State University\\
  512 BRT 460 W 12th Ave,\\
  Columbus, OH 43210\\
  E-mail: \email{tallman.52@osu.edu}\\
  URL: \url{https://u.osu.edu/stoverlab/}
}


\usepackage[utf8]{inputenc}
\usepackage{listings}
\usepackage{float}
\usepackage[font=scriptsize]{caption}

\Abstract{ 

Copy number aberrations (CNAs) are gains and losses of large genomic segments present across most cancer types and are a hallmark of cancer genomic
alterations. However, the processes underlying CNAs and characteristic patterns of CNAs are poorly understood. Using single nucleotide variant (SNV) data,
bioinformatic advances have identified underlying mutational signatures resulting from distinct mutational processes. Mutational signatures have led to a variety of
discoveries, several of which are being investigated in clinical management of cancer. The development of algorithms able to uncover similar signatures for CNAs,
rather than SNVs, is still in its infancy. Here we present an analysis package for the R programming language called CNSigs that allows for the robust and
reproducible derivation of copy number signatures. Based on a list of extracted copy number features previously verified in ovarian cancer, we utilize mixed model
approaches and non-negative matrix factorization to derive CNA signatures across cancer types. The development of a package to derive signatures from copy
number data allows further investigation of underlying processes that may be responsible for these CNA fingerprints. The CNSigs package also allows researchers to easily
analyze their own samples to look for signatures in their copy number profiles and to compare these to signatures previously derived for their cancer type.
}

\Keywords{R, copy number signatures, copy number segments}
\Plainkeywords{R, copy number signatures, copy number segments}

\begin{document}

\section{Introduction}
\label{sec:introduction}

This package is used to generate and analyze signatures derived from copy number
segments. There are two main ways to use the analysis pipeline. You can either run it step by step 
so that you can see the results as you go, or all at once using a single 
function. The individual steps are layed out in the following sections, and
the single function is described in the section "Running the full pipeline". The package
is able to save all of the results into a folder that includes the plots and 
other outputs from the pipeline. The package also has many functions to analyze 
and interpret your results. It also allows for parallelization and the tuning of several parameters.

\newpage

\section{Running the Pipeline}
\label{sec:pipeline}

The package can be loaded by simply using the library command. This document will begin
by walking through each step of the pipeline. The method for running the full pipeline 
at once is described later in this document.

<<message=F,warning=F>>=
library(CNSigs)
@

\subsection{Loading Data}
\label{sec:loadData}

The first step to using the pipeline is to load in the data. CNSigs supports 
running the pipeline straight from an already formatted list or from a text
file. The pipeline expects the data.frames to only have 5 columns with the titles
of "ID","chromosome","start","end","segVal". If your text file has columns that
are not named in that exact way but contains the same data, you can use the
colMap parameter in the readSegs function. If you look at the example data below,
you can see that it has all of the data that is needed, but the columns are named
differently. In order to load in that segment file properly, you must specify 
the mapping of the column name to the expected column names using the colMap
parameter of readSegs as seen below.

\includegraphics{./SampleSegData.png}

<<eval=F>>=
readSegs("./SampleSegs.txt",colMap = c("Sample_ID","Chrom","Start","End","Total_CN"))
@

\subsection{Smoothing the Data}
\label{sec:smoothData}

One of the data preprocessing steps for the pipeline is to smooth the segments.
This process is used to reconcile the differences across different copy number 
calling pipelines. Since the features extracted from the samples are directly related
to the segments that were called, by smoothing them, we are able to dampen the 
effects of using different copy number callers. The package has a function called
smoothSegs that is able to accomplish this. We recommend this smoothing 
step no matter which copy number caller is used.

<<fig.width=7, fig.height=7>>=
smooth = smoothSegs(segDataExp)

#Visualize the smoothed segments
toPlot = list(segDataExp[[1]],smooth[[1]])
names(toPlot) = c("Original","Smoothed")
plotSegs(toPlot, sep = T)
@

As you can see above, the smoothing generally removes some of the smaller and more
erroneous segments and leaves behind the higher confidence segments. The number in
the parentheses next to the sample name denotes the number of segments. Smoothing tends to 
lessen the effect of using different copy number callers on the resulting signatures.


\subsection{Extracting the copy number features}
\label{sec:extractCNFeats}

The CNSigs package uses six different copy number features based on those used by 
Macyntire et. al. Extracting these features from a dataset is very easy using the 
package. The function extractCNFeats will  return a list object with the six different
features inside. The six features extracted are segment size, the number of 
breakpoints per 10 megabase bin, the number of copy number oscillation events,
the average size of changepoints, the average copynumber, and the number of 
breakpoints per chromsome arm. Most of the features are and average value per 
chromosome. The features are then passed onto the mixed modelling step

<<>>=
feats = extractCNFeats(smooth)
@
\vspace{-5 mm}

\subsection{Fitting the mixture models}
\label{sec:fitModels}

After extracting the features, the pipeline then fits a mixture model onto each of the features
and uses these mixture models to define the underlying components of the signatures. By
default, the package will search for a mixture of between 2 and 10 models of the 
corresponding distribution types for each feature. This allows the pipeline to 
account for cases that have different underlying subgroups for each feature. 
If you want to change the range of the number of components to look for, you can do so using
the mincomps and maxcomps parameters. Both of these are a vector of 6 values corresponding
to the min or max number of components for each of the features. For more information, refer
to ?fitModels. You may also want to skip this step and use fixed components, if so,
refer to the "Fixed components" section later in this document.

<<eval=F>>=
comps = fitModels(feats)
@

\subsubsection{Peak Reduction}

Since most of the copy number features have a "normal" value, it often results
in a abnormally large peak at that value if you have enough samples. 
The flexmix package, which is used for the mixed modelling is often unable
to accomodate these peaks well and will often fail to converge when finding
the components. In order to get around this, the package includes an option
to perform peak reduction before the modeling step. This method looks through
the histogram of the input feature and identifies peaks that are larger than
the surrounding bins. It then makes these peaks smaller. This preserves the
overall structure of the distribution so that the modelling step will no longer fail
and the input data is mainly preserved. By setting the pR parameter
to T during your fitModels call, it will perform peak reduction before 
each of the modelling steps. 

\begin{figure}[H]
    \centering
    \begin{minipage}{0.45\textwidth}
        \centering
        <<fig=T,fig.height=4,echo=F>>=
          hist(featsExp$copynumber$value,breaks=100, main = "Histogram of orignal copynumber data",
               xlab = "Copynumber values")
        @
        \caption{Original}
    \end{minipage}\hfill
    \begin{minipage}{0.45\textwidth}
        \centering
        <<fig=T,fig.height=4,echo=F>>=
          test = CNSigs:::reducePeaks(featsExp$copynumber$value)
          hist(test,breaks=100, main = "Histogram of copynumber after reduction",
               xlab = "Copynumber values")
        @
        \caption{Reduced Peaks}
    \end{minipage}
\end{figure}

\subsection{Generating the sample by component matrix (scm)}
\label{sec:scm}

Once you have the components, you then have to create the sample by component matrix
(scm). The scm is a matrix that shows the distribution of the components within each
of the samples. The scm is used during NMF in order to define the 
signatures. In order to generate the scm, the package looks at each sample and
finds the posterior probabilty that the samples features lie within each of the 
derived components. It then sums these up and normalizes them by the number of 
features for the sample. This gives you an idea of how much each component can be found
within a particular sample. The full pipeline automatically generates a heatmap
plot to display this scm, but it can also be done using the following code below.

<<fig.height=3>>=
scm = generateSCM(feats, cancerComps)
plotScm(scm)
@

\subsection{Adding Ploidy Data (Optional)}
\label{sec:ploidy}

After generating the scm, you can include ploidy data to be used in the signature
identification. The package has a function called addPloidyData. If you give the
function the scm from the previous step and the samples ploidy data, it will return
a new scm that includes the ploidy data transformed to log2 ploidy. For this function
make sure that you give it the absolute ploidy because this function performs the
transform to log2 by default. If you ploidy data is in the same data file as your
segment data, you can read in both at the same time using the readSegs function
using the readPloidy parameter set to TRUE.

<<eval = F>>=
scm = addPloidyData(scm, ploidyData)
@

\subsection{Generating the signatures}
\label{sec:createSigs}

Once the scm has been generated, you can now perform the NMF step using the
createSigs function. The main parameter for this function, besides the scm, is 
nsig, which determines how many signatures the function will find. If you aren't 
sure how many signatures is appropriate, you can use the determineSigNum function
as laid out in a later section. This function will return the resulting NMF object.
In order to look at the final component weights for the signatures, you can use the
NMF function scoef in order to get the scaled coefficients for the signatures. You
can also use the basis function from NMF in order to see the signature exposures
for each of the samples.

<<eval = F>>=
sigs = createSigs(scm, 5)
@

\subsection{Generating the exposures}
\label{sec:findExposures}

As an alternate method, after the scm has been generated, you can utalize a set 
of fixed signatures with this function to identify the signature contribution 
of each sample. findExposures does this by using the least squares optimization 
method with constraints to keep the output as non-negative with the lsei function 
from limSolve. The CNSigs package contains pre-generated signatures derived 
from TCGA such as CancerSigs (set of 25 signatures) and collapsedSigs (set of 
13 signatures). You can save your own fixedSigs from a previous run with 
createSigs then using the testResults\$sigs as fixedSigs for a new run with 
a different sample set.

<<eval = F>>=
sigs = findExposures(scm, fixedSigs)
@

\section{Running the Full Pipeline}

The descriptions above walked you through how to analyze your copy number data 
in a step by step manner. The CNSigs package also allows you to do all of the above
with one command using the runPipeline command. The runPipeline function follows the
flowchart layed out in the picture below. This function allows you to specify almost
all of the parameters for the sub functions that were described above. For a full list 
of the parameters refer to the function documentation using ?runPipeline. We would
recommend saving the results of the pipeline to an object so that you can investigate 
them further using subsequent functions. If you forget to do so, but you specified
the saveRes parameter to be T, then you can simply read in the 'Pipeline Results.rds'
and that will give you the same results data structure.

<<eval=F>>=
results = runPipeline(segDataExp)
@

\includegraphics{./runPipeline_flowchart.png}

\subsection{Using Fixed components or signatures}
\label{sec:fixed}

There are many use cases where you may want to run the analysis pipeline using a fixed 
set of signatures or components. The CNSigs package easily allows the user to fix either
of these parameters

\subsubsection{Fixed components}
\label{sec:fixedComps}

Fixing the signature components allows you to easily compare signatures across datasets
since it ensures that the underlying signature components are pointing at the same 
feature distributions. Also, if the dataset you are analyzing is smaller, there may not 
be enough information to get generalized components. If you are analyzing a set of cancer samples
you may want to use the components included in the package that were derived from the entire
TCGA landscape. These components are representative of all cancer types and allow you to easily
and accurately compare signatures derived from different cancer samples. Since it skips the modelling
process, it also allows you to more robustly investigate smaller datasets. In order to use the derived
cancer components, you just have to pass in the cancerComps object from the CNSig package as seen
below.

<<eval=F>>=
results = runPipeline(segData, components = cancerComps)
@

\subsubsection{Fixed signatures}
\label{sec:fixedSigs}

You may also want to run the pipeline with not only fixed components but also fixed signatures.
If you have already derived signatures for a dataset, and you simply want to look at which
of those signatures can be seen in a new set of samples, fixing the signatures is the way to 
do this. This sort of analysis is similar to what is done when looking at mutational signatures.
Since the mutational signatures have already been defined, you often want to see which mutational
signatures are present in your samples. In order to use fixed signatures in the CNSigs pipeline,
you have to both specify the components that made the signatures and the signatures themselves. 
The easiest way to do this is to load up the results object from a previous run and grab the 
components and signatures from there. This sort of analysis is seen below. Just replace the 
'resultsPath' with the path to the results of the previous run. The package includes 
a set of 25 unique signatures that were found by looking at all 33 cancer types in the
TCGA dataset.

<<eval=F>>=
#referenceExp = readRDS("resultsPath/Pipeline results.rds")
newResults = runPipeline(segData, components = referenceExp$CN_components,
                         fixedSigs = referenceExp$sigs)
@

\subsection{Ploidy Data}
\label{sec:Pipeline Ploidy Data}

In the case where you are using fixed signatures and components that utalize ploidy data 
you can add your segment ploidy data into the pipeline function as ploidyData. The pipeline 
requires input ploidyData if your fixed signatures were build with ploidy.

<<eval=F>>=
newResults = runPipeline(segData, components = cancerComps,
                         fixedSigs = collapsedSigs, ploidyData = segPloidy)
@

\section{Analyzing your results}

This package includes many functions that allow you to analyze the results that are given 
by the pipeline. Here we will describe those functions and why it may be useful for you to use
them.

\subsection{Plotting the components}
\label{plotComps}

One of the things that you may want to be able to do is to see what the components
that were fitted to the extracted features look like. In order to do this, you can make
use of plotComp and plotComps functions. The plotComps function simply calls plotComp for
all of the different component types. plotComp looks at the 
parameters for the mixture model that was fit for the component that you
specified. It then plots an approximation of those fitted distributions. This is useful
if you want to compare two independently fit component sets. For instance, you may want
to visualize how similar bp10MB2 is across two seperate datasets. By plotting these
two components you are able to get a rough idea of how similar they are. Follow
the code samples below in order to plot your components.

<<fig.width=7, fig.height=4>>=
plotComp(cancerComps,"bp10MB") #Only plots the bp10MB component
@

<<eval = F>>=
plotComps(cancerComps) #Plots all of the components
@

\begin{figure}[H]
    \centering
    \begin{minipage}{0.3\textwidth}
        \centering
        <<fig=T,fig.height=5,echo=F>>=
          plotComp(cancerComps,"segsize")
        @
        \caption{Segment Size}
    \end{minipage}\hfill
    \begin{minipage}{0.3\textwidth}
        \centering
        <<fig=T,fig.height=5,echo=F>>=
          plotComp(cancerComps,"bp10MB")
        @
        \caption{Breakpoints per 10MB}
    \end{minipage}\hfill
    \begin{minipage}{0.3\textwidth}
        \centering
        <<fig=T,fig.height=5,echo=F>>=
          plotComp(cancerComps,"osCN")
        @
        \caption{Number of Oscillations}
    \end{minipage}
    \vspace*{1cm}
    \begin{minipage}{0.3\textwidth}
        \centering
        <<fig=T,fig.height=5,echo=F>>=
          plotComp(cancerComps,"changepoint")
        @
        \caption{Copy Number Changepoint}
    \end{minipage}\hfill
    \begin{minipage}{0.3\textwidth}
        \centering
        <<fig=T,fig.height=5,echo=F>>=
          plotComp(cancerComps,"copynumber")
        @
        \caption{Copynumber}
    \end{minipage}\hfill
    \begin{minipage}{0.3\textwidth}
        \centering
        <<fig=T,fig.height=5,echo=F>>=
          plotComp(cancerComps,"bpchrarm")
        @
        \caption{Breakpoints per chromosome arm}
    \end{minipage}
\end{figure}

\subsection{Determining signature similarity}

Another useful thing that you may want to do is to estimate the similarity between two signatures.
There are multiple ways of looking at the similarity of signatures.

\subsubsection{Same components, similar signatures}

If you believe that your two sets
of signatures should be nearly identical, then you can use the matchSigs function. This function
finds the best match for each signature between the reference and comparison set. This can be useful
if you have two datasets that are made up of similar samples, and you believe should produce similar 
signatures. 

<<>>=
matchSigs(referenceExp$sigs,referenceExp$sigs)
@

\subsubsection{Same components, different signatures}

If you don't expect your signatures to be the same, but you used the same components
to derive both signature sets (ie. cancerComps), then you can use the sigSim function. This function 
compares each signature in the reference set to every other signature in the comparison set. It then
plots a heatmap that shows the similarity of the signatures. (NOTE: The output of this function is
useful iff the components in both signature sets are identical)

<<fig.width=7, fig.height=6>>=
sigSim(referenceExp,referenceExp)
@

\subsubsection{Different Components}

If you let the pipeline fit the models independently, then the sigSim function recognizes
that the the underlying components are no longer the same and don't point to the same data. 
In this case, we utilize a method to estimate the signature similarities. We first attempt to 
recreat the underlying feature distributions that made up a signature by creating a 
distribution of simulated values vased on the feature models and the component weights
in the signatures. The two simulated feature distributions are then compared using the
ks statistical test. This gives a similarity per component and then each component 
similarity is averaged to give an overall signature similarity. 

\subsection{Plotting Signature Exposure}

It is very useful to visualize the exposure of your samples to the signatures. The pipeline
generates the plot automatically, but the package includes two functions that allow
you to do this yourself as well.

\subsubsection{Signature Exposure Matrix Visualization}

This heatmap based visualization is the default visualization method used by the 
pipeline. Using this function, you are able to specify the ordering of both the
rows and columns of the plot. The ordering parameters can either allow you to order
the values using heirarchical clustering or with a user specified order for the data.

<<>>=
plotSigExposureMat(referenceExp$sigExposure)
@

\subsubsection{Signature Exposure Stacked Bar Visualization}

Another way of visualizing the signature exposures is via a stacked bar plot. The
package has a function called plotSigExposure that allows you to create this different
visualization. One of the best features this function allows for is the inclusion
of additional data to be displayed as tracks along the plot. You can also specify a
specific set of sorting instructions to allow you to sort the exposure plot by both
the exposures and also the trackData. 

<<fig.height=3>>=
plotSigExposure(referenceExp$sigExposure)
@

This function allows you to add on data as tracks. If you only want to add a single
track, you can pass in a vector of data in the same order as the samples in the
sigExposure matrix using the trackData parameter. You can plot up to 3 tracks 
by passing them in as a list.

<<fig.height=4>>=
sigExposure = referenceExp$sigExposure

# Generate random data to represent track data
sampleTrackData = sample(c(1,2,3,4),ncol(sigExposure),T)
sampleTrackData2 = sample(c(1,2,3,4),ncol(sigExposure),T)

# Plot with a single track
plotSigExposure(sigExposure,trackData = sampleTrackData)

# Plot multiple tracks.
plotSigExposure(sigExposure, trackData = list(sampleTrackData,sampleTrackData2))
@

Using this function you are able to sort the plot in many different ways by using
the sortOrder parameter. When you give the function a set of trackData, it allows you to begin to 
specify the sortOrder. This allows your to sort the main plot in a different order. 
"m" represents the main plot, and "t" followed by the number of the track (ie: "t1","t2" ...)  
represents the tracks. By chaining the values together you can specify a variety 
of ways to sort the final plot. As an example, the sortOrder of "mt1t2" specifies 
the the plot should be sorted by the signature exposures first followed by the first 
track and finally the second track. In another example, the sortOrder of "t2mt1" 
specifies the plot to be sorted by track number 2 first followed by the signature 
exposures and lastly by track number 1.

<<fig.height=4>>=
# Plot multiple tracks sorted by the main plot and then the first track
plotSigExposure(sigExposure, trackData = list(sampleTrackData,sampleTrackData2),
                sort=T,sortOrder = "mt1")

# Plot multiple tracks sorted by the second track and then the first track
plotSigExposure(sigExposure, trackData = list(sampleTrackData,sampleTrackData2),
                sort=T,sortOrder = "t2t1m")
@

\section{Additional Options}

Most of the functions within this pipeline have additional parameters that allow for further tuning
of the copy number signatures pipeline. For more information on and examples of these parameters,
refer to the package manual, or to the individual function documentation using R's builtin
? notation (?functionName).

\subsection{Parallelization}

A large majority of the functions within this package are able to be run in parallel.
Using the cores parameter, you can specify the number of workers for the functions
to use. Both the extractCNFeats and fitModels can use a maximum of 6 workers
since each worker gets assigned one feature, of which there are 6. This will
help to speed up many of the methods, especially the NMF steps within the createSigs
function. It is worth noting that having a lot of concurrent workers can cause a 
spike in memory usage. 


\subsection{Determining Number of Signatures}

The package has a function that is useful for determining the optimal number of signatures
for a given dataset. This function is called determineNumSigs, and there are three ways
to use this function. This function, by default, looks from 3 to 12 signatures, but 
this can be changed by using the rmin and rmax parameters.

\subsubsection{Within a runPipeline call}

If you either don't give a value for nsig during a call to runPipeline, or if you 
specify nsig to equal 0, the pipeline will automatically run the determineNumSigs 
function. After it finishes the determineNumSigs step, it will wait for user input
to specify a number of signatures, and will continue to run the rest of the pipeline.
If you are saving the results, the plot will be saved in the results folder,
otherwise it should pop up for you to look at and pick a value. As you can see in
the example below, after it checks ranks 3-12 it waits for user input before
proceeding.

\includegraphics{./detSigNumInRP.jpg}

\subsubsection{Directly using determineNumSigs function}

The determineNumSigs function can be used directly if you have already extracted the
features and fit the models. You can pass both the features and components to the
function, then it will search from rmin to rmax and output the plot.

<<eval=F>>=
smooth = smoothSegs(segDataExp)
feats = extractCNFeats(smooth)
comps = fitModels(feats)
determineNumSigs(feats,comps)
@

\subsubsection{Using detSigNumPipeline function}

If you mainly want to determine the number of signatures to get a better idea about
your data, you can use the detSigNumPipeline function. This function executes
the first few steps of the full pipeline all at once and stops after running the
determineNumSigs function. 

<<eval=F>>=
detSigNumPipeline(segDataExp,smooth=T)
@

\subsection{Interpreting determineNumSigs plot}

The plot generated by the methods described above displays a variety of metrics that 
describes the output of the NMF. The plot compares the NMF output from the 
real data, to the NMF output of randomly scrambled data. In order to determine the best
number of signatures, it is useful to look at many of these metrics in order to make a 
decision. Following the descriptions of the measures below, you can pick the value
for nsigs that gives you the best results. All of these measures are calculated
using the input data (solid line) and a randomized permutation of the input data
(dotted line). For some measures, such as sparseness, it is useful to compare the
value for the real data compared to the randomized data.

\includegraphics{./detSigNumPlot.jpg}

\subsubsection{Cophenetic}

Also known as the cophenetic similarity or cophenetic distance, this measure describes
how similar two objects have to be in order to be grouped into the same cluster. In the
context of copy number signatures, it describes how similar two samples' copy number 
profiles need to be in order to be classified as the same signature. You ideally want to
have higher values for the cophenetic. You can imagine that if you have a low cophenetic
distance, then two samples that aren't very similar could be classified within thes same 
signature.

\subsubsection{Dispersion}
 
The dispersion coefficient is a measure that is derived from the NMF consensus matrix. 
The consensus matrix can be described as the average connectivity matrix of multiple 
NMF runs. Each value in the consensus matrix essentially describes the probability that
two samples belong to the cluster across the nrun NMF runs. In our package, nrun is set 
to 250 to ensure a stable consensus matrix. For a perfect consensus matrix, meaning all
of the connectivity matrices across all the NMF runs were the same, the dispersion value 
is 1. A lower dispersion value means that the clusters were less reproducible across the
NMF runs.

\subsubsection{Rss}

Rss stands for the residual sum of squares of the NMF model. This is a measure of how 
different the final estimation is from the input data. NMF is a method designed to estimate
a factorization of the input matrix. So the rss, describes how closesly the factorization 
matches the input data. You want a low value for the rss, however, since many of the NMF
algorithms are designed to minimize the rss, this measure usually does not have much 
effect on the decision for the number of signatures since values across all values of 
nsig generally have a low rss value.

\subsubsection{Silhouette}

Silhouette is a measure that looks at the consistency of the clusters and measures how 
well each sample has been classified. For each measure it looks at how similar the sample
is to others in the same cluster compared to other clusters. A single silhouette value for
a sample ranges from -1 to 1. A lower value means that the sample is more similar to samples
from other clusters than it is to samples within its cluster. The mean silhouette across all 
samples within a cluster measures how tightly grouped the samples are within that cluster.
Therefore the mean silhouette of all data across the entire dataset is a measure of how 
appropriately the samples are clustered.

\subsubsection{Sparseness}

The sparseness of a vector is defined by how much energy of the vector is found in a 
small number of the components. In theory, the most sparse vector, which only has
one non-zero value, would have a sparseness of 1. The least sparse vector, in which
all components have an equal value, would have a sparseness of 0. For a matrix,
a sparseness measure is simply an average sparseness of the column vectors. The
sparseness of the basis matrix describes how many components drive the signatures.
A more sparse basis matrix means that the signatures are composed of fewer 
components. The sparseness of the coefficients matrix describes how many signatures
each sample belongs to. A more sparse coefficients matrix means that each sample
is made up of fewer signatures

\subsection{Remapping the Pipeline Results}

There are times where you may want to either reorder the signatures that were
found during an analysis or maybe just give the signatures names. The package has a
function called remapResults to help you do this. You give it the path of an entire 
results folder and then either a numeric mapping for the signatures or a set of names
for the signatures. The function will generate a new results folder where all of the 
outputs are rearranged according to the new mapping and regenerating all of the plots
in the new order. 

\subsection{Subsetting the features used}

Many of the functions in the pipeline have a optional parameter in them to specify
which features to use. By utilizing this function you are able to generate copy
number signatures using any subset of the features that you want. For instance, if
you dont want to include copy number oscillations in your final signatures you can
remove this feature from the list of features to use and it will not be used. There is
a vector included in the package called defaultFeats that includes the names of the
default features. It is useful to use this vector and simply select the features you wish
to use and pass this vector into the functions.

<<>>=
defaultFeats

desiredFeats = defaultFeats[-3]
desiredFeats
@
<<eval=F>>=
results = runPipeline(segDataExp,featsToUse = desiredFeats)
@

\subsection{Different Genome Builds}

One of the important factors for the package to work properly is to ensure that you
are specifying the genome build correctly. The package uses both the lengths of the
chromosomes and the positions of the centromeres when extracting some of the copy number
features. Since these values change from one build to another it is important to
specify the genome build correctly. Be default, the genome build is assumed
to be hg19. Currently the package can support hg18, hg19, and hg38. Both the runPipeline
and the extractCNFeats have a parameter called gbuild that allows you to specify the genome
build.

\end{document}