%\VignetteIndexEntry{NameNeedle} %\VignetteKeywords{Needleman-Wunsch, Global Alignment, Cell Line Names} %\VignetteDepends{stats} %\VignettePackage{NameNeedle} \documentclass{article} \usepackage{hyperref} \newcommand{\Rfunction}[1]{{\texttt{#1}}} \newcommand{\Robject}[1]{{\texttt{#1}}} \newcommand{\Rclass}[1]{{\texttt{#1}}} \newcommand{\Rpackage}[1]{{\textit{#1}}} \title{Sequence Alignment and Cell Line Names} \author{Kevin R. Coombes} \begin{document} \maketitle \tableofcontents \section{Introduction} A problem that frequently plagues statistical analysts who are trying to combine data from two or more sources is that sample identifiers are entered inconsistently in the two data sets. This problem appears to be particularly prevalent in the world of cell line research, where no standard exists to define the names. Most cell line names can be broken down into three parts: an alphabetic prefix, a numeric identifier, and an optional alphanumeric suffix. Common problems include both the introduction of (apparently random) punctuation between the parts and abbreviation of the prefix (presumably under the belief that, within the context of the experiment, everyone will know what the abbreviation stands for). We recognized that the problem of matching cell line names from two experiments was related to the problem of aligning biological sequence data. As a result, we implemented a straightforward version of the Needleman-Wunsch global alignment algorithm that can be applied to this problem. This vignette explains how to use the \Rpackage{NameNeedle} package to match cell line names. \section{Getting Started} We start by loading the package into the current R session. <>= library(NameNeedle) @ \section{Aligning Two Character Strings} The basic function is called \Rfunction{needles} and implements the Needleman-Wunsch algorithm for aligning two text strings. The simplest use is to just supply the text strings directly: <>= needles("hcc-123", "hcc1243") @ The return value includes the optimal alignments of the two strings, a score, the ``score matrix'' (\Robject{sm}) and the ``backtrace matrix'' (\Robject{dm}). All of these items depend on a set of parameters, which can be supplied to the \Rfunction{needles} function as a list. The default value is <>= defaultNeedleParams @ Notice that the \Robject{GAPCHAR} component specifies the character to use in the alignment strings to indicate that the best alignment involves leaving a gap in one string rather than matching the characters directly in the other string. The other three parameters determine the rewards (for an exact match) or penalties (for mismatches or gaps) that contribute to the score. A simple way to alter the parameters is to start with the default values and modify the entries: <>= myParams <- defaultNeedleParams myParams$MISMATCH <- -2 myParams$MATCH <- 2 @ We can re-run the algorithm using our own parameter list. <>= needles("hcc-123", "hcc1243", myParams) @ Note that, in this case, the alignment does not change, but the score and the score matrix both change to reflect the altered parameters. In more complex cases, the actual alignments can change depending on the set of parameters. We find that the values in \Robject{myParams} work well for matching cell line names. \section{Cell Line Names} The \Rpackage{NameNeedle} library includes a sample data set consisting of the actual cell line names, as they were presented to us, from three related experiments. We use the \Rfunction{data} command to load these names into the current R session. <>= data(cellLineNames) ls() @ Now we review the cell line names. <>= class(sf2Names) length(sf2Names) sf2Names[1:10] @ <>= class(rppaNames) length(rppaNames) rppaNames[1:10] @ <>= class(illuNames) length(illuNames) summary(illuType) illuNames[1:10] @ \subsection{Matching One Cell Line Name} The \Rfunction{needles} function works on one character string at a time. (If you supply a character vector of length greater than one, it silently ignores everything except the first entry). In order to find the best match for one name in a list of possibilities, we use the \Rfunction{needleScores} function. For example, suppose we want to find the best match for the following name <>= probeName <- sf2Names[6] probeName @ in the character vector \Robject{illuNames}. Then we simply write <>= scores <- needleScores(probeName, illuNames, myParams) summary(scores) @ We see that the highest score is $11$. We can use the usual R tools to figure out which character string gives the highest score (and thus the best match). <>= w <- which(scores==max(scores)) illuNames[w] @ We note that the match differs from the probe name by the insertion of a space character between the alphabetic prefix and the numerical identifier in the name. \subsection{Matching All Cell Line Names} More generally, we would like to match two complete lists of names. For example, we might want to match the names in \Robject{sf2names} with the names in \Robject{rppaNames}. There is no special function to perform this task. Instead, we can simply run through a loop. <>= go <- proc.time() matchscore <- matchcode <- rep(NA, length(sf2Names)) for (j in 1:length(sf2Names)) { scores <- needleScores(sf2Names[j], rppaNames, myParams) matchcode[j] <- paste(which(scores==max(scores)), collapse=',') matchscore[j] <- max(scores) } used <- proc.time() - go @ Note, however, that we must allow for the possibility that multiple names in \Robject{rppaNames} will provide the best match (highest score) for any given name in \Robject{sf2Names}. We have saved the indices of all the best matches in the \Robject{matchcode} variable, as a comma-separated character string. The next loop expands those indices to the actual names, which are stored as semicolon-separated character strings. <>= rppaMatch <- sapply(matchcode, function(x) { y <- as.numeric(strsplit(x, ',')[[1]]) paste(rppaNames[y], collapse="; ") }) @ For example, we have <>= i <- 116 sf2Names[i] rppaMatch[i] @ In this case, sample ``\textit{HCC-2998}'' was used in the SF2 study but not in the RPPA study, and there are two cell lines in the RPPA study that give equally good (although incorrect) matches. If we want to check the actual alignments, we can again use the basic \Rfunction{needles} function. <>= x <- needles("HCC-2998", "HCC2279", myParams) x$align1 x$align2 @ For completeness, we also match \Robject{sf2names} to \Robject{illuNames}. <>= go <- proc.time() imatchscore <- imatchcode <- rep(NA, length(sf2Names)) for (j in 1:length(sf2Names)) { scores <- needleScores(sf2Names[j], illuNames, myParams) imatchcode[j] <- paste(which(scores==max(scores)), collapse=',') imatchscore[j] <- max(scores) } illuMatch <- sapply(imatchcode, function(x) { y <- as.numeric(strsplit(x, ',')[[1]]) paste(illuNames[y], collapse="; ") }) iused <- proc.time() - go used iused used + iused @ We can combine the results into a data frame. <>= matcher <- data.frame(rppaMatch=rppaMatch, rppaScore=matchscore, illuMatch=illuMatch, illuScore=imatchscore)#,combined) rownames(matcher) <- sf2Names matcher[1:10,] @ <>= combined <- read.table("combined.tsv", sep="\t", header=TRUE, row.names=1) matcher <- data.frame(rppaMatch=rppaMatch, rppaScore=matchscore, illuMatch=illuMatch, illuScore=imatchscore, combined) write.table(matcher, file="namesMatched.tsv", sep="\t", quote=FALSE, col.names=NA) @ \section{Conclusions} A simple implementation of the Needleman-Wunsch global alignment algorithm works reasonably well as a first approximation for matching cell line names from different datasets. We must note, however, that more sophisticated tools are available for aligning biological sequences; many such tools are implemented in the \Rpackage{Biostrings} package from Bioconductor. \end{document}