@ \documentclass[a4paper]{article} \usepackage{Sweave, amssymb, amsmath} \usepackage[ pdftex, pdfpagelabels, plainpages=false, pdfborder={0 0 0}, pdfstartview=FitH, bookmarks=true, bookmarksnumbered=true, bookmarksopen=true ] {hyperref} \title{Overview of the \emph{intervals} package} \author{Richard Bourgon} \date{06 June 2009} % The is for R CMD check, which finds it in spite of the "%", and also for % automatic creation of links in the HTML documentation for the package: % \VignetteIndexEntry{Overview of the intervals package.} \begin{document} %%%%%%%% Setup % Don't reform code \SweaveOpts{keep.source=TRUE, eps=FALSE} % Size for figures \setkeys{Gin}{width=.7\textwidth} % Fonts \DefineVerbatimEnvironment{Sinput}{Verbatim}{xleftmargin=1cm,fontshape=sl,fontsize=\small} \DefineVerbatimEnvironment{Soutput}{Verbatim}{xleftmargin=1cm,fontsize=\small} % Reduce characters per line in R output @ <>= options( width = 80 ) @ % Make title \maketitle % Typesetting commands \newcommand{\R}{\mathbb{R}} \newcommand{\Z}{\mathbb{Z}} %%%%%%%% TOC \tableofcontents \vspace{.25in} %%%%%%%% Main text \section{Introduction} The \emph{intervals} packages defines two S4 classes which represent collections of intervals over either the integers ($\Z$) or the real number line ($\R$). An instance of either class consists of a two-column matrix of endpoints, plus additional slots describing endpoint closure and whether the intervals are to be thought of as being over $\Z$ or $\R$. @ <>= library( intervals ) x <- Intervals( matrix( 1:6, ncol = 2 ) ) x x[2,2] <- NA x[3,1] <- 6 x @ Objects of class \texttt{Intervals} represent collections of intervals with common endpoint closure, e.g., all left-closed, right-open. More control over endpoints is permitted with the \texttt{Intervals\_full} class. (Both classes are derived from \texttt{Intervals\_virtual}, which is not intended for use by package users.) @ <>= y <- as( x, "Intervals_full" ) y <- c( y, Intervals_full( c(2,3,5,7) ) ) closed(y)[2:3,1] <- FALSE closed(y)[4,2] <- FALSE rownames(y) <- letters[1:5] y @ The \texttt{size} method gives measure --- counting measure over $\Z$ or Lebesgue measure over $\R$ --- for each interval represented in an object. The \texttt{empty} method identifies intervals that are in fact empty sets, which over $\R$ is not the same thing as having size 0. (Valid objects must have each right endpoint no less than the corresponding left endpoint. When one or both endpoints are open, however, valid intervals may be empty.) @ <>= size(x) empty(x) empty(y) @ \begin{figure}[!tb] \centering @ <>= plot( y ) @ \caption{The \texttt{Intervals\_full} object \texttt{y}, plotted with \texttt{plot(y)}. The second row contains an \texttt{NA} endpoint, and is not shown in the plot. The empty interval is plotted as a hollow point. By default, vertical placement avoids overlaps but is compact. } \label{fig:plotting} \end{figure} \section{Interpretation of objects} An \texttt{Intervals} or \texttt{Intervals\_full} object can be thought of in two different modes, each of which is useful in certain contexts: \begin{enumerate} \item As a (non-unique) representation of a subset of $\Z$ or $\R$. \item As a collection of (possibly overlapping) intervals, each of which has a meaningful identity. \end{enumerate} \subsection{As a subset of $\Z$ or $\R$} The \emph{intervals} package provides a number of basic tools for working in the first mode, where an object represents a subset of $\Z$ or $\R$ but the rows of the endpoint matrix do not have any external identity. Basic tools include \texttt{reduce}, which returns a sorted minimal representation equivalent to the original (dropping any intervals with \texttt{NA} endpoints), as well as \texttt{interval\_union}, \texttt{interval\_complement}, and \texttt{interval\_intersection}. @ <>= reduce( y ) interval_intersection( x, x + 2 ) interval_complement( x ) @ Note that combining \texttt{x} and its complement in order to compute a union requires mixing endpoint closure types; coercion to \texttt{Intervals\_full} is automatic. @ <>= interval_union( x, interval_complement( x ) ) @ The \texttt{distance\_to\_nearest} function treats its \texttt{to} argument in the first mode, as just a subset of $\Z$ or $\R$; it treats its \texttt{from} argument, however, in the second mode, returning one distance for every row of the \texttt{from} object. In the example below, we also look at performance for large data sets (less than one second on a 2 GHz Intel Core 2 Duo Macintosh, although the time shown below will likely differ). A histogram of \texttt{d} is given in Figure~\ref{fig:distance}. @ <>= B <- 100000 left <- runif( B, 0, 1e8 ) right <- left + rexp( B, rate = 1/10 ) v <- Intervals( cbind( left, right ) ) head( v ) mean( size( v ) ) dim( reduce( v ) ) system.time( d <- distance_to_nearest( sample( 1e8, B ), v ) ) @ \begin{figure}[!tb] \centering @ <>= hist( d, main = "Distance to nearest interval" ) @ \caption{Histogram of distances from a random set of points to the nearest point in \texttt{v}. There is also a \texttt{distance\_to\_nearest} method for comparing two sets of intervals.} \label{fig:distance} \end{figure} \subsection{As a set of meaningful, possibly overlapping intervals} In some applications, each row of an object's endpoint matrix has a meaningful identity, and particular points from $\Z$ or $\R$ may be found in more than one row. To support this mode, objects may be given row names, which are propagated through calculations when appropriate. The \texttt{c} methods simply stack objects (like \texttt{rbind}), preserving row names and retaining redundancy, if any. The \texttt{interval\_overlap} method works in this mode. In the next example we use it to identify rows of \texttt{v} which are at least partially redundant, i.e., which intersect at least one other row of \texttt{v}. All rows overlap themselves, so we look for rows that overlap at least two rows: @ <>= rownames(v) <- sprintf( "%06i", 1:nrow(v) ) io <- interval_overlap( v, v ) head( io, n = 3 ) n <- sapply( io, length ) sum( n > 1 ) k <- which.max( n ) io[ k ] v[ k, ] v[ io[[ k ]], ] @ The \texttt{which\_nearest} method also respects row identity, for both its \texttt{to} and \texttt{from} arguments. In addition to computing the distance from each \texttt{from} interval to the nearest point in \texttt{to}, it also returns the row index of the \texttt{to} interval (or intervals, in case of ties) located at the indicated distance. Another function which operates in this mode is \texttt{clusters}, which takes a set of points or intervals and identifies maximal groups which cluster together --- which are separated from one another by no more than a user-specified threshold. The following code is taken from the \texttt{clusters} documentation: @ <>= B <- 100 left <- runif( B, 0, 1e4 ) right <- left + rexp( B, rate = 1/10 ) y <- reduce( Intervals( cbind( left, right ) ) ) w <- 100 c2 <- clusters( y, w ) c2[1:3] @ \section{Floating point and intervals over $\R$} When \texttt{type == "R"}, interval endpoints are not truly in $\R$, but rather, in the subset which can be represented by floating point arithmetic. (For the moment, this is also true when \texttt{type == "Z"}. See Section~\ref{sec:representation}.) This limits the endpoint values which can be represented; more importantly, if computations are performed on interval endpoints, it means that floating point error can affect whether or not endpoints coincide, whether intervals which meet at or near endpoints overlap one another, etc. In spite of this potentially serious limitation, it is still often convenient to work with intervals with non-integer endpoints, including data where adjacent intervals exactly meet at a non-integer endpoint. To address this, the \emph{intervals} package takes the following approach: \begin{itemize} \item Floating point representations of interval endpoints are assumed to be \emph{exactly equal} (in the sense of \texttt{==} in R or C++) if and only if the user intends the real values corresponding to these representations to be exactly equal. \item For cases where floating point error and approximate equality are a concern, tools are provided to permit distinguishing between ambiguous and unambiguous intersection, union, etc. \end{itemize} In the next example, \texttt{y1} does not literally overlap \texttt{y2[2,]}, although R's \texttt{all.equal} function asserts that the gap between them is smaller than the default tolerance for equivalence up to floating point precision. @ <>= delta <- .Machine[[ "double.eps" ]]^0.5 y1 <- Intervals( c( .5, 1 - delta / 2 ) ) y2 <- Intervals( c( .25, 1, .75, 2 ) ) y1 y2 all.equal( y1[1,2], y2[2,1] ) interval_intersection( y1, y2 ) @ The \texttt{expand} and \texttt{contract} methods, used with \texttt{type = "relative"}, permit consideration of the maximal and minimal interval sets which are consistent with the nominal endpoints --- from the point of view of endpoint relative difference. The \texttt{contract} method, for example, contracts each interval in a collection so that the relative difference between original and contracted endpoints is equal to tolerance \texttt{delta}. Thus, if a relative difference less than or equal to \texttt{delta} is our criterion for approximate floating point equality, the contracted object has endpoints which are approximately equal to those of the original --- even though the contracted object is a proper subset of the original. The \texttt{expand} method is similar, but generates a proper superset. @ <>= contract( y1, delta, "relative" ) @ We compute two separate intersections which bound the nominal intersection: @ <>= inner <- interval_intersection( contract( y1, delta, "relative" ), contract( y2, delta, "relative" ) ) inner outer <- interval_intersection( expand( y1, delta, "relative" ), expand( y2, delta, "relative" ) ) outer @ Finally, we identify points which may or may not be in the intersection, depending on whether we make a conservative, literal, or anti-conservative interpretation of the nominal endpoints. @ <>= interval_difference( outer, inner ) @ The \texttt{expand} and \texttt{contract} methods have other uses as well. Here, we eliminate gaps of size 2 or smaller: @ <>= x <- Intervals( c(1,10,100,8,50,200), type = "Z" ) x w <- 2 close_intervals( contract( reduce( expand(x, w/2) ), w/2 ) ) @ \section{Notes on implementation} \subsection{Endpoint representation} \label{sec:representation} For the moment, interval endpoints are always stored using R's \emph{numeric} data type. Although this is wasteful from an memory and speed point of view, we do it for two reasons. First, use of R's \texttt{Inf} and \texttt{-Inf} --- not possible with the \emph{integer} type --- is very convenient when computing complements. Second, the range of integers which can be represented using the \emph{numeric} data type is considerably greater: @ <>= .Machine$integer.max numeric_max <- with( .Machine, double.base^double.digits ) options( digits = ceiling( log10( numeric_max ) ) ) numeric_max @ \subsection{Efficiency} All computations are accomplished by treating intervals as pairs of tagged endpoints, sorting these endpoints (along with their tags), and then making a single pass through the results. Computational complexity for set operations is therefore $O(n \log n)$, where input object $i$ contains $n_i$ rows and $n = \sum_i n_i$. The same sorting approach is also used for \texttt{interval\_overlap}, although if every interval in a query object of $m$ rows overlaps every intervals in a target object of $n$ rows, generating output alone must of necessity be $O(mn)$. Sorted endpoint vectors are not retained in memory. If one wishes to query a particular object over and over, repeated sorting would be inefficient; in practice so far, however, such repeated querying has not been needed. \subsection{Checking validity} The code behind \texttt{which\_nearest} and \texttt{reduce} (key methods in the \emph{intervals} package, which may be directly called by the user and are also used internally in numerous locations) is written in C++ for efficiency. The compiled code makes a number of assumptions about the \texttt{SEXP} objects passed in as arguments, but does not explicitly check these assumptions. Nonetheless, when the R wrappers for the compiled code are applied to \emph{valid} objects from the \texttt{Intervals} or \texttt{Intervals\_full} classes, all assumptions will always be met. This design decision was taken so that the requirements for individual objects and their contents could be gathered together in a single, natural location: the classes' \texttt{validity} functions. The \emph{intervals} package provides replacement methods --- e.g., \texttt{type} and \texttt{closed} --- which implement error checking and preserve object validity. R's implementation of S4 classes, however, leaves object data slots exposed to the user. As a consequence, a user can directly manipulate the data slots of a valid \texttt{Intervals} or \texttt{Intervals\_full} object in a way that invalidates the object, but does not generate any warning or error. To prevent invalid objects from being passed to compiled code --- and potentially generating segmentation faults or other problems --- all wrapper code in this package includes a \texttt{check\_valid} argument. This argument is set to \texttt{TRUE} by default, so that \texttt{validObject} is called on relevant objects before handing them off to the compiled code. For efficiency, users may choose to override this extra check if they are certain they have not manually assigned inappropriate content to objects' data slots. \section{Session information} @ <>= si <- as.character( toLatex( sessionInfo() ) ) cat( si[ -grep( "Locale", si ) ], sep = "\n" ) @ \end{document}