\documentclass{article}
\usepackage{amsfonts}
% TODO(mstokely): Can use a larger title here with spaces and such.
%\VignetteIndexEntry{HistogramTools-Intro}

%% JSS stuff we can remove for JSS version.
\usepackage[authoryear,round,longnamesfirst]{natbib}

\bibpunct{(}{)}{;}{a}{}{,}
\bibliographystyle{jss}
%% page layout
\topmargin 0pt
\textheight 46\baselineskip
\advance\textheight by \topskip
\oddsidemargin 0.1in
\evensidemargin 0.15in
\marginparwidth 1in
\oddsidemargin 0.125in
\evensidemargin 0.125in
\marginparwidth 0.75in
\textwidth 6.125in
%% paragraphs
\setlength{\parskip}{0.7ex plus0.1ex minus0.1ex}
\setlength{\parindent}{0em}

% The \cite command functions as follows:
%   \citet{key} ==>>                Jones et al. (1990)
%   \citet*{key} ==>>               Jones, Baker, and Smith (1990)
%   \citep{key} ==>>                (Jones et al., 1990)
%   \citep*{key} ==>>               (Jones, Baker, and Smith, 1990)
%   \citep[chap. 2]{key} ==>>       (Jones et al., 1990, chap. 2)
%   \citep[e.g.][]{key} ==>>        (e.g. Jones et al., 1990)
%   \citep[e.g.][p. 32]{key} ==>>   (e.g. Jones et al., p. 32)
%   \citeauthor{key} ==>>           Jones et al.
%   \citeauthor*{key} ==>>          Jones, Baker, and Smith
%   \citeyear{key} ==>>             1990

\newcommand{\keywords}[1]{\textbf{Keywords: }#1}
\newcommand{\code}[1]{\texttt{#1}}
\newcommand{\pkg}[1]{\textbf{#1}}
\usepackage{Sweave}
\usepackage{listings}             % Include the listings-package
\usepackage{url}
\usepackage{graphicx}

<<echo=FALSE,print=FALSE>>=
library("HistogramTools")
oldoptions <- options("width"=65)
oldpar <- par(no.readonly = TRUE)
set.seed(0)
ht.version <- packageDescription("HistogramTools")$Version
prettyDate <- format(Sys.Date(), "%B %e, %Y")
@
% closing $ needed here


%% almost as usual
\author{Murray Stokely}
\title{HistogramTools for Distributions of Large Data Sets}
\date{Version \Sexpr{ht.version} as of \Sexpr{prettyDate}}
\begin{document}

\maketitle

\abstract{
  \noindent
  Histograms are a common graphical representation of the distribution
  of a data set.  They are particularly useful for collecting very
  large data sets into a binned form for easier data storage and
  analysis.  The \pkg{HistogramTools} R package augments the
  built-in support for histograms with a number of methods that are
  useful for analyzing large data sets.  Specifically, methods are
  included for
  serializing histograms into a compact Protocol Buffer representation
  for sharing between distributed tasks, functions for
  manipulating the resulting aggregate histograms, and functions for
  measuring and visualizing the information loss associated with
  histogram representations of a data set.\\
\vspace{1ex}

\noindent
\keywords{histograms, distance, non-parametric, metric, map-reduce}
}
% \Keywords{histograms, distributions, R, mapreduce, protocol buffers, RProtoBuf}
% \Plainkeywords{histograms, distributions, R, mapreduce, protocol
%  buffers, RProtoBuf} %% without formatting.
%% at least one keyword must be supplied

%% publication information
%% NOTE: Typically, this can be left commented and will be filled out by the technical editor
%% \Volume{50}
%% \Issue{9}
%% \Month{June}
%% \Year{2012}
%% \Submitdate{2012-06-04}
%% \Acceptdate{2012-06-04}

%% The address of (at least) one author should be given
%% in the following format:

%\Address{
%  Murray Stokely\\
%  Google, Inc.\\
%  1600 Amphitheatre Parkway\\
%  Mountain View, CA 94043\\
%  E-mail: \email{mstokely@google.com}\\
%  URL: \url{http://research.google.com/pubs/MurrayStokely.html}
%  }
%
%% for those who use Sweave please include the following line (with % symbols):
%% need no \usepackage{Sweave.sty}

%% end of declarations %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

\tableofcontents

\section{Introduction}

Histograms have been used for over a century
\citep{pearson1895contributions} as a compact graphical representation
of a distribution of data.  Support for generating histograms is
included in almost all statistical software, including R.  However,
the growth in large-scale data analysis in R with the MapReduce
\citep{dean2008mapreduce} paradigm has highlighted the need for some
extensions to the histogram functionality in R\citep{r} for the
collection, transmission, and manipulation of larged binned data sets.

% TODO(mstokely): Also cite the streaming algorithm work?
Much previous work on histograms
\citep{sturges1926choice,scott1979optimal} involves identifying ideal
breakpoints for visualizing a data set.  Our context is different; we
use histograms as compact representations in a large MapReduce
environment, and need to merge histograms from subsets of the data to
obtain a histogram for the whole data set.  In this case, the
challenges are engineering challenges, rather than visualization challenges.
For example, how do we efficiently store large numbers of histograms
generated frequently by real-time monitoring systems of many thousands
of computers?  How can we aggregate histograms generated by different
tasks in a large scale computation?  If we chose very granular bucket
boundaries for our initial data collection pass, how can we then
reshape the bucket boundaries to be more relevant to the analysis
questions at hand?

% TODO(mstokely): Reference and cite ``Improved Histograms for
% Selectivity Estimation of Range Predicates'' which is a great survey
% of different types of histograms from the database community.

% [cite R ?hist algo] but in the parallell
% context each histogram must share the same breakpoints, even if they
% are not the optimal ones for the specific data subset any particular
% routine is accessing.  If we hope to merge the resulting histograms
% into a meaningful metric of the full data set.

% Histograms are increasingly used in leage scale distributed systems
% research.  Especially with the MapReduce paradigm where many
% thousands of tasks may analyze a big data set and collect
% distributions of interest Examples from search clusters, distributed
% filesystems, etc. [cite]

\section{Histogram Bin Manipulation}

\subsection{Trimming Empty Buckets from the Tails}

When generating histograms with a large number of fine-grained
bucket boundaries, the resulting histograms may have a large
number of empty consecutive buckets on the left or right side of the
histogram.  The \code{TrimHistogram} function can be used to remove
them, as illustrated in Figure~\ref{fig:trimhist}.

<<echo=TRUE,print=FALSE>>=
hist.1 <- hist(runif(100,min=2,max=4), breaks=seq(0,6,by=.2), plot=FALSE)
hist.trimmed <- TrimHistogram(hist.1)
@

<<echo=TRUE,print=TRUE>>=
length(hist.1$counts)
sum(hist.1$counts)
length(hist.trimmed$counts)
sum(hist.trimmed$counts)
@

<<trimhist,fig=TRUE,echo=TRUE,include=FALSE,width=8,height=4>>=
par(mfrow=c(1,2))
plot(hist.1)
plot(TrimHistogram(hist.1), main="Trimmed Histogram")
@

\begin{figure}[h!]
\begin{center}
\includegraphics[width=5in,height=2.5in]{HistogramTools-trimhist}
\end{center}
\caption{Effect of the \code{TrimHistogram} function.}
\label{fig:trimhist}
\end{figure}

\subsection{Adding/Merging Histograms}

If histograms (for different data sets) have the same bucket
boundaries, it is possible to add them together to obtain the
histogram for the combined data set by aggregating the counts values
with the \code{AddHistograms} function illustrated in
Figure~\ref{fig:mergehist}.

<<echo=TRUE,print=FALSE>>=
hist.1 <- hist(c(1,2,3,4), plot=FALSE)
hist.2 <- hist(c(1,2,2,4), plot=FALSE)
hist.sum <- AddHistograms(hist.1, hist.2)
@

<<mergehist,fig=TRUE,echo=FALSE,include=FALSE,width=8,height=4>>=
par(mfrow=c(1,3))
plot(hist.1)
plot(hist.2)
plot(hist.sum,main="Aggregated Histogram")
@

\begin{figure}[h!]
\begin{center}
\includegraphics[width=5in,height=2.5in]{HistogramTools-mergehist}
\end{center}
\caption{Effect of the \code{AddHistograms} function.}
\label{fig:mergehist}
\end{figure}

\code{AddHistograms} accepts an arbitrary number of histogram objects
to aggregate as long as they have the same bucket boundaries.

<<echo=TRUE,print=FALSE>>=
  hist.1 <- hist(c(1,2,3), breaks=0:9, plot=FALSE)
  hist.2 <- hist(c(1,2,3), breaks=0:9, plot=FALSE)
  hist.3 <- hist(c(4,5,6), breaks=0:9, plot=FALSE)
  hist.sum <- AddHistograms(hist.1, hist.2, hist.3)
  hist.sum
@

\subsection{Merging Buckets}

We may want a version of a histogram with fewer buckets, to save
storage or to produce better plots.  The \code{MergeBuckets} function takes
a histogram and a parameter for the number of adjacent buckets to
merge together and returns a new histogram with different bucket
boundaries, as illustrated in Figure~\ref{fig:downsamplehist}.

<<echo=TRUE,print=FALSE>>=
overbinned <- hist(c(rexp(100), 1+rexp(100)), breaks=seq(0, 10, by=.01), plot=FALSE)
better.hist <- MergeBuckets(overbinned, adj=30)
@

<<downsamplehist,fig=TRUE,echo=FALSE,print=FALSE,include=FALSE,width=8,height=4>>=
par(mfrow=c(1,2))
plot(overbinned)
plot(better.hist)
@

\begin{figure}[h!]
\begin{center}
\includegraphics[width=3.3in,height=2.5in]{HistogramTools-downsamplehist}
\end{center}
\caption{Effect of the \code{MergeBuckets} function.}
\label{fig:downsamplehist}
\end{figure}

\subsection{Subsetting Histograms}

% If, say you are only interested in part of a distribution, for
% example for performance measurements the requests that had a latency
% of greater than X seconds only.  Then the SubsetHistogram()
% functions can help manipulate the histograms to zoom in on an area
% of interest for additional plotting.

The \code{SubsetHistogram} function takes a histogram and a new
minimum and maximum bucket boundary and returns a new histogram with a
subset of the buckets, as illustrated in Figure~\ref{fig:subsethist}.

% TODO(mstokely): Add reference to iceberg queries, database work in
% this area.

<<echo=TRUE,print=FALSE>>=
hist.1 <- hist(runif(100, min=0, max=10), breaks=seq(from=0, to=10, by=.5), plot=FALSE)
hist.2 <- SubsetHistogram(hist.1, minbreak=2, maxbreak=6)
@

<<subsethist,fig=TRUE,echo=FALSE,print=FALSE,include=FALSE,width=5.3,height=4>>=
par(mfrow=c(1,2))
plot(hist.1, main="hist.1")
plot(hist.2, main="hist.2")
@

\begin{figure}[h!]
\begin{center}
\includegraphics[width=3.3in,height=2.5in]{HistogramTools-subsethist}
\end{center}
\caption{Effect of the \code{SubsetHistogram} function.}
\label{fig:subsethist}
\end{figure}

\subsection{Intersecting Histograms}

The \code{IntersectHistograms} function takes two histograms with
identical bucket boundaries and returns a histogram of the
intersection of the two.  Each bucket of the returned histogram
contains the minimum of the equivalent buckets in the two provided
histograms.

<<echo=TRUE,print=FALSE>>=
hist.1 <- hist(runif(100))
hist.2 <- hist(runif(100))
hist.3 <- IntersectHistograms(hist.1, hist.2)
@

<<intersecthist,fig=TRUE,echo=FALSE,print=FALSE,include=FALSE,width=5.3,height=4>>=
par(mfrow=c(1,3))
plot(hist.1, main="hist.1")
plot(hist.2, main="hist.2")
plot(hist.3, main="hist.3")
@

\begin{figure}[h!]
\begin{center}
\includegraphics[width=3.3in,height=2.5in]{HistogramTools-intersecthist}
\end{center}
\caption{Effect of the \code{IntersectHistograms} function.}
\label{fig:intersecthist}
\end{figure}

\subsection{Scaling Histograms}

The \code{ScaleHistogram} function takes a histogram and a numeric
value to scale each bucket bound by.  This can be used, e.g.\ to
normalize the histogram area to a desired value, as illustrated in
Figure~\ref{fig:scalehist}.

<<echo=TRUE,print=FALSE>>=
hist.2 <- ScaleHistogram(hist.1, 1/sum(hist.1$counts))
@

<<scalehist,fig=TRUE,echo=FALSE,print=FALSE,include=FALSE,width=5.3,height=4>>=
par(mfrow=c(1,2))
plot(hist.1, main="hist.1")
plot(hist.2, main="hist.2")
@

\begin{figure}[h!]
\begin{center}
\includegraphics[width=3.3in,height=2.5in]{HistogramTools-scalehist}
\end{center}
\caption{Effect of the \code{ScaleHistogram} function.}
\label{fig:scalehist}
\end{figure}

\section{Histogram Distance Measures}

This package provides a number of metrics for computing a distance
measure for two histograms.  At the moment, only bin-by-bin similarity
measures are supported meaning that the two histograms must have
exactly the same bucket boundaries.  The attributes of the different
distance measures are described in \citep{rubner2000earth}.

\subsection{Minkowski Distance}

The Minkowski distance of order $p$ between two histograms can be
computed with the \code{minkowski.dist} function.

<<echo=TRUE,print=FALSE>>=
h1 <- hist(runif(100), plot=FALSE)
h2 <- hist(runif(100), plot=FALSE)

minkowski.dist(h1, h2, 1)
minkowski.dist(h1, h2, 2)
minkowski.dist(h1, h2, 3)

# Verify the implementation:
p <- 3
#dist(t(matrix(c(h1$counts, h2$counts), nrow=2)), "minkowski", p=p)
# Or, alternatively:
(sum(abs(h1$counts - h2$counts)^p))^(1/p)
@

\subsection{Histogram Intersection Distance}

The histogram intersection distance \citep{swain1991color} between two
histograms can be computed with the \code{intersect.dist} function.

<<echo=TRUE,print=FALSE>>=
intersect.dist(h1, h2)
@

If histograms \code{h1} and \code{h2} do not contain the
same number of counts, then this metric will not be symmetric.

\subsection{Kullback-Leibler Divergence}

The Kullback-Leibler Divergence \citep{kullback1997information}
between two histograms can be computed with the \code{kl.divergence}
function.

<<echo=TRUE,print=FALSE>>=
kl.divergence(h1, h2)
@

\subsection{Jeffrey Divergence}

The \code{jeffrey.divergence} function computes the Jeffrey
divergence between two histograms \citep{puzicha1997non}.

<<echo=TRUE,print=FALSE>>=
jeffrey.divergence(h1, h2)
@

\section{Quantiles and Cumulative Distribution Functions}

When histograms are used as a binned data storage mechanism to reduce
data storage cost, information about the underlying distribution is
lost.  We can however approximate the quantiles and cumulative
distribution function for the underying distribution from the
histogram.

The \code{Count}, \code{ApproxMean}, and \code{ApproxQuantile}
functions are meant to help with this, but note that they will only be
accurate with very granular histogram buckets.  They would rarely be
appropriate with histogram buckets chosen by the default algorithm in
R.

<<echo=TRUE>>=
hist <- hist(c(1,2,3), breaks=c(0,1,2,3,4,5,6,7,8,9), plot=FALSE)
@
<<echo=TRUE,print=TRUE>>=
Count(hist)
ApproxMean(hist)
ApproxQuantile(hist, .5)
ApproxQuantile(hist, c(.05, .95))
@

The \code{HistToEcdf} function takes a histogram and returns an
empirical distribution function similar to what is returned by the
\code{ecdf} function on a distribution.

<<execdf,fig=TRUE,include=FALSE,echo=TRUE,height=4,width=8>>=
h <- hist(runif(100), plot=FALSE)
e <- HistToEcdf(h)
e(.5)
par(mfrow=c(1,2))
plot(h)
plot(HistToEcdf(h))
par(mfrow=c(1,1))
@

\begin{figure}[h]
\begin{center}
\includegraphics[width=6in,height=3in]{HistogramTools-execdf}
\end{center}
\caption{Histogram and CDF Created with HistToEcdf}
\label{fig:excdf}
\end{figure}

\section{Error estimates in CDFs approximated from histograms}

When constructing cumulative distribution functions from binned
histogram data sets there will usually be some amount of
information loss.  We can however come up with an upper bound
for the error by looking at two extreme cases.  For a given histogram
$h$ with bucket counts $C_i$ for $1 \le i \le n$ and left-closed bucket boundaries
$B_i$ for $1 \le i \le n+1$, we construct two data sets.  Let $X$ be
the (unknown) underlying data set for which we now only have the
binned representation $h$.  Let $X_{\rm{min}}$ be the data set constructed
by assuming the data points in each bucket of the histogram are all
equal to the left bucket boundary.  Let $X_{\rm{max}}$ be the data
set constructed by assuming the data points in each bucket of the
histogram are at the right bucket boundary. Then
%histogram are all infinitesimally close to the right bucket boundary. Then
$F_X$ is the true empirical CDF of the underlying data, 
% $F(X_b)$ is any estimate of the empirical CDF based on the binned data, 
and
$F_{X_{\rm{min}}}(x) \ge F_X(x) \ge F_{X_{\rm{max}}}(x)$. 

%If $F(x)$ is the true empirical CDF of the underying data, and
%$F_b(x)$ is any estimate of the empirical CDF based on the binned
%data, then we can compute an error estimate by integrating the
%difference between the largest and smallest CDFs that could be
%represented by the binned data.  

% TODO(mstokely): Technically one of these is a limit, plus epsilon
\begin{displaymath}
X_{\rm{min}} = \left ( (B_i)_{k=1}^{C_i} : 1 \le i \le n \right ) \\
X_{\rm{max}} = \left ( (B_{i+1})_{k=1}^{C_i} : 1 \le i \le n \right )
\end{displaymath}
% x1 <- rep(head(h$breaks, -1), h$counts)
% x2 <- rep(tail(h$breaks, -1), h$counts)

The package provides two different distance metrics to measure the
difference between empirical cumulative distribution functions
$F_{X_{\rm{min}}}$ and $F_{X_{\rm{max}}}$.  These distances serve as
upper-bounds for the amount of error between the true empirical
distribution function $F_X$ of the unbinned data and an ECDF
calculated from the binned data.

<<echo=FALSE,print=FALSE>>=
PlotAll <- function(x, h) {
  plot(x, main="x")
  plot(h)
  PlotKSDCC(h, 0.3)
  PlotEMDCC(h)
}
@

\subsection{Kolmogorov-Smirnov Distance of the Cumulative Curves}

The first metric provided by the package is the two-sample
Kolmogorov-Smirnov distance between $X_{\rm{min}}$ and $X_{\rm{max}}$.
In other words, it the largest possible distance between cumulative
distribution functions that could be represented by the binned data.
This metric is more formally defined as

\begin{displaymath}
\sup_{x} \left | F_{X_{\rm{max}}}(x) - F_{X_{\rm{min}}}(x) \right |
\end{displaymath}

This function is also occasionally called the maximum displacement of
the cumulative curves (MDCC).

<<echo=TRUE,print=FALSE>>=
x <- rexp(1000)
h <- hist(x, breaks=c(0,1,2,3,4,8,16,32), plot=FALSE)
x.min <- rep(head(h$breaks, -1), h$counts)
x.max <- rep(tail(h$breaks, -1), h$counts)
ks.test(x.min, x.max, exact=F)
KSDCC(h)
@

The \code{KSDCC} function accepts a histogram, generates the largest
and smallest empirical cumulative distribution functions that could be
represented by that histogram, and then calculates the
Kolmogorov-Smirnov distance between the two CDFs.  This measure can be
used in cases where expanding the binned data into \code{x.min} and
\code{x.max} explicitly to calculate distances using
e.g.\ \code{ks.test} directly would not be feasible.
Figure~\ref{fig:KSDCC} illustrates geometrically what value is being
returned.

<<hist1mdcc,fig=TRUE,echo=FALSE,print=FALSE,include=FALSE,height=4,width=8>>=
par(mfrow=c(2,3))

x <- rexp(100)
h1 <- hist(x, plot=FALSE)
h2 <- hist(x, breaks=seq(0,round(max(x) + 1),by=0.1), plot=FALSE)

plot(sort(x), main="sort(x)")
plot(h1)
PlotKSDCC(h1, 0.2, main="CDF with KSDCC")

plot(sort(x), main="sort(x)")
plot(h2)
PlotKSDCC(h2, 0.2, main="CDF with KSDCC")
@

\begin{figure}[h!]
\begin{center}
\includegraphics[width=6in,height=3in]{HistogramTools-hist1mdcc}
\end{center}
\caption{Sorted raw data (column 1), Histograms (column 2), CDF with
  KSDCC (column 3)}
\label{fig:KSDCC}
\end{figure}

\subsection{Earth Mover's Distance of the Cumulative Curves}
% The histograms above were generated using the default computed
%breakpoints of R, which uses a modified algorithm from
%\cite{sturges1926choice}.  However, for our actual data analysis tasks
%we generate our own bucket widths in advance so that all of our
%histograms have comparable and consistent bucket boundaries.  For the
%same distribution, we obviously have control over the histogram bins,
%and using more fine-grained bins produces a more accurate
%representation of the underlying distribution.

% TODO kind of like iceberg histograms, but not quite. there are
% references from the database community for what we do and why.

%A natural question is, are we using a good initial choice of buckets?
%Which of the 100,000 histograms that we generated last week had the
%greatest potential error, and how might we change the bucketing to
%reduce that error?
%
% subsubsection KS
%
%Mean Square Error measures the accuracy of an estimator of f in a single point, MISE Mean Integrated Squared Error is the global metric.  But we don't know f for the underlying distribution, since we only have the histogram, so we can't compute these metrics, right?
%
%If $F(x)$ is the true empirical CDF of the underying data, and
%$F_b(x)$ is any estimate of the empirical CDF based on the binned
%data, then we can compute an error estimate by integrating the
%difference between the largest and smallest CDFs that could be
%represented by the binned data.
%
% we can use an integrated version of the Kolmogorov-Smirnov
% distance to compute the error due to our binning.

The ``Earth Mover's Distance'' is like the Kolmogorov-Smirnof
statistic, but uses an integral to capture the difference across all
points of the curve rather than just the maximum difference.  This is
also known as Mallows distance, or Wasserstein distance with $p=1$.

The value of this metric for our histogram $h$ is
% We remove the \sup because we defined F_max and F_min above.
% \sup_{F_b,F}
\begin{displaymath}
\int_{\mathbb{R}} |F_{X_{\rm{max}}}(x) - F_{X_{\rm{min}}}(x)|dx,
\end{displaymath}

Figure~\ref{fig:EMDCC} shows the same e.c.d.f of the binned data
represented in the previous section, but with the yellow boxes
representing the range of possible values for any real e.c.d.f. having
the same binned representation.  For any distribution $X$ with the
binned representation in $h$, the e.c.d.f $F(X)$ will pass only
through the yellow boxes.  The area of the yellow boxes is the Earth
Mover's Distance.

<<hist1emdcc,fig=TRUE,echo=FALSE,print=FALSE,include=FALSE,height=4,width=8>>=
par(mfrow=c(2,3))
plot(sort(x), main="sort(x)")
plot(h1)
PlotEMDCC(h1, main="CDF with EMDCC")

plot(sort(x), main="sort(x)")
plot(h2)
PlotEMDCC(h2, main="CDF with EMDCC")
@

\begin{figure}[h!]
\begin{center}
\includegraphics[width=6in,height=3in]{HistogramTools-hist1emdcc}
\end{center}
\caption{Sorted raw data (column 1), Histograms (column 2), CDF with
  EMDCC (column 3)}
\label{fig:EMDCC}
\end{figure}

<<echo=TRUE>>=
x <- rexp(100)
h1 <- hist(x, plot=FALSE)
h2 <- hist(x, breaks=seq(0,round(max(x) + 1),by=0.1), plot=FALSE)
KSDCC(h1)
KSDCC(h2)
EMDCC(h1)
EMDCC(h2)
@

So, using this metric, the second lower example with more granular
histogram bins produces an ECDF with reduced worst case error bounds
compared to the one with default buckets, as expected.

%\begin{figure}[h!]
%\begin{center}
%\includegraphics[width=5in,height=2.5in]{HistogramTools-downsamplehist}
%\end{center}
%\caption{Effect of the \code{MergeBuckets} function.}
%\label{fig:downsamplehist}
%\end{figure}

% TODO
% \section{Extensions}
%\subsection{Exemplar Histograms}
% TODO
%

%Sharing Histograms Between Tools
%In a big data analysis task, we may combine many different tools and languages.  For example, high-performance C++ or Java code might process millions of records to generate histograms and then serialize them (using e.g. Protocol buffers or Thrift) to a data store.
%Other code in Python or R may analyze, merge, clean the resulting outputs.  Finally, a front-end dashboard may read in a JSON representation of the cleaned histograms from the datastore into an interactive Javascript web isualization (sing e.g. D3 (\cite{bostock2011d3}) or Gviz)
%Implementation details:
%
%- Benchmarks of Merge operation
%Benchmarks
%Compare size of R serialized histograms and RProtoBuf serialized histograms.
%%
%code snippet:
%
%Table:  columns:  serialized size, serialized time (ms)
%rows:  R serialize, Rprotobuf naive implementation, RProtobuf C++ optimized implementation


%TODO Table example performance analysis.
% label=tab1,echo=FALSE,results=tex>>=
%require(xtable)
%foo <- data.frame(a=c(1,2,3,4),b=c(5,6,7,8))
%print(xtable(foo, caption="Example Caption", center="centering", file="", label="tab:one",floating=FALSE))
%@
%
%Now we have an example reference to Table\ref{tab:one}.
\section{Visualization}

\subsection{Average Shifted Histograms}

Average shifted histograms are a simple device for taking a collection
of $m$ histograms each with uniform bin width $h$, but with bin
origins offset/shifted \citep{scott2009multivariate}.

Given the focus of this package on large data sets, a HistToASH
function is provided to produce the average shifted histogram from a
single histogram assuming the underlying dataset is no longer
available in unbinned form.  The next plot shows the original density histogram
of the overlaid with the average shifted histogram with smoothing
parameter 5.

<<fig=TRUE>>=
x <- runif(1000, min=0, max=100)
h <- hist(x, breaks=0:100, plot=F)

plot(h,freq=FALSE, main="Histogram of x with ASH superimposed in red")
# Superimpose the Average Shifted Histogram on top of the original.
lines(HistToASH(h), col="red")
@

\subsection{Log-scale Histograms}

Many histograms encountered in databases, distributed systems, and
other areas of computer science use log-scaled bucket boundaries
and may still contain counts that vary exponentially.

Examples of such data sets include the size or age of files stored, or
the latency enountered by read requests.  Such histograms can be read
into R using the methods of this package, but the default plotting
functionality is not very useful.

For example, Figure~\ref{fig:logloghist} shows the default plot output
for the histogram of read sizes observed when building a FreeBSD
kernel.  Neither the density histogram (left) nor the frequency
histogram (middle) make it easy to understand the distribution.  In
contract, the log ECDF produced by the \code{PlotLog2ByteEcdf}
function makes it easy to see that about 20\% of the reads were of 64
bytes or smaller, and that the median read size was well under a
kilobyte.

<<logloghist,fig=TRUE,include=FALSE,echo=FALSE,height=4,width=8>>=
par(mfrow=c(1,3))
filename <- system.file("extdata/buildkernel-readsize-dtrace.txt",
                        package="HistogramTools")
dtrace.hists <- ReadHistogramsFromDtraceOutputFile(filename)
plot(dtrace.hists[["TOTAL"]], main="", freq=FALSE)
plot(dtrace.hists[["TOTAL"]], main="", freq=TRUE)
PlotLog2ByteEcdf(dtrace.hists[["TOTAL"]])
@

\begin{figure}[h]
\begin{center}
\includegraphics[width=5in,height=2.5in]{HistogramTools-logloghist}
\end{center}
\caption{Example Log Log Histogram Data}
\label{fig:logloghist}
\end{figure}


\section{Efficient Representations of Histograms}

Consider an example histogram of 100 random data points.
Figure~\ref{fig:exhist} shows the graphical representation and list
structure of R histogram objects.

<<exhist,fig=TRUE,include=FALSE,echo=TRUE>>=
myhist <- hist(runif(100))
@

\begin{figure}[h]
\begin{minipage}{.4\textwidth}
\begin{center}
\includegraphics[width=2.5in,height=2.5in]{HistogramTools-exhist}
\end{center}
%\captionof{figure}{Example Histogram}
%\label{fig:exhist}
\end{minipage}\hfill \begin{minipage}{.5\textwidth}
\begin{tiny}
<<echo=FALSE,print=TRUE>>=
myhist
@
\end{tiny}
\end{minipage}
\caption{Example Histogram}
\label{fig:exhist}
\end{figure}

This histogram compactly represents the full distribution.
Histogram objects in R are lists with \Sexpr{length(myhist)} components: breaks, counts,
density, mids, name, and equidist.

If we are working in a parallel environment and need to distribute
such a histogram to other tasks running in a compute cluster, we
need to serialize this histogram object to a binary format that can be
transferred over the network.

\subsection{Native R Serialization}

R includes a built-in serialization framework that allows one to
serialize any R object to an Rdata file.

<<echo=TRUE,print=TRUE>>=
length(serialize(myhist, NULL))
@

This works and is quite convenient if the histogram must only be
shared between tasks running the R interpreter, but it is not a very
portable format.

\subsection{Protocol Buffers}

Protocol Buffers are a flexible, efficient, automated, cross-platform
mechanism for serializing structured data \citep{Pike:2005:IDP:1239655.1239658,protobuf}.  The RProtoBuf package
\citep{rprotobuf} provides an interface for manipulating protocol
buffer objects directly within R.

Of the \Sexpr{length(myhist)} elements stored in an R histogram object, we only need to
store three in our serialization format since the others can be
re-computed.  This leads to the following simple protocol buffer
definition of the breaks, counts, and name of a histogram:

<<echo=FALSE>>=
invisible(cat(paste(readLines(system.file("proto/histogram.proto",
                                package="HistogramTools")), "\n")))
@

The package provides \code{as.Message} and \code{as.histogram} methods
for converting between R histogram objects and this protocol buffer
representation.

In addition to the added portability, the protocol buffer
representation is significantly more compact.

<<echo=FALSE,print=FALSE>>=
if(require(RProtoBuf)) {
hist.msg <- as.Message(myhist)
} else {
hist.msg <- "RProtoBuf library not available"
}
<<echo=TRUE,print=TRUE,eval=FALSE>>=
hist.msg <- as.Message(myhist)
@

Our histogram protocol buffer has a human-readable ASCII representation:

<<echo=TRUE>>=
cat(as.character(hist.msg))
@

But it is most useful when serialized to a compact binary representation:

<<echo=TRUE,print=TRUE,eval=FALSE>>=
length(hist.msg$serialize(NULL))
<<echo=F,print=T,eval=TRUE>>=
if (require(RProtoBuf)) {
  length(hist.msg$serialize(NULL))
} else {
  invisible(cat("RProtoBuf not available."))
}
@

This protocol buffer representation is not compressed by default,
however, so we can do better:

<<echo=TRUE,print=TRUE,eval=FALSE>>=
raw.bytes <- memCompress(hist.msg$serialize(NULL), "gzip")
length(raw.bytes)
<<echo=F,print=T,eval=TRUE>>=
if (require(RProtoBuf)) {
  raw.bytes <- memCompress(hist.msg$serialize(NULL), "gzip")
  length(raw.bytes)
} else {
  raw.bytes <- memCompress("Not available", "gzip")
  cat("RProtoBuf not available")
}
@

We can then send this compressed binary representation of the
histogram over a network or store it to a cross-platform data store
for later analysis by other tools.  To recreate the
original R histogram object from the serialized protocol buffer we can
use the \code{as.histogram} method.

<<echo=TRUE,print=FALSE,eval=FALSE>>=
uncompressed.bytes <- memDecompress(raw.bytes, "gzip")
new.hist.proto <- P("HistogramTools.HistogramState")$read(uncompressed.bytes)
length(uncompressed.bytes)
<<echo=FALSE,eval=TRUE>>=
uncompressed.bytes <- memDecompress(raw.bytes, "gzip")
if (require(RProtoBuf)) {
  new.hist.proto <- P("HistogramTools.HistogramState")$read(uncompressed.bytes)
  length(uncompressed.bytes)
}
@

The resulting histogram is the same as the original; it was converted
to a protocol buffer, serialized, compressed, then uncompressed,
parsed, and converted back to a histogram.

<<echo=TRUE,print=FALSE,eval=FALSE>>=
plot(myhist)
plot(as.histogram(new.hist.proto))
@

<<fig=TRUE,echo=FALSE,height=4,width=8>>=
par(mfrow=c(1,2))
plot(myhist)
if (require(RProtoBuf)) {
  plot(as.histogram(new.hist.proto))
} else {
  plot(myhist)
}
@

\section{Applications}

\subsection{Filesystem Workload Characterization with DTrace}

The DTrace framework \cite{cantrill2004dynamic} provides a scalable
method of dynamically collecting and aggregating system performance
data on Unix systems.  The \code{ReadHistogramsFromDtraceOutputFile}
Parses the text output of the DTrace command to convert the ASCII
representation of aggregate distributions into R histogram objects.

\subsection{Binning Large Data Sets with Map-Reduce}

Many large data sets in fields such as particle physics and information
processing are stored in binned or histogram form in order to reduce
the data storage requirements \citep{scott2009multivariate}.

There are two common patterns for generating histograms of large data
sets with MapReduce.  In the first method, each mapper task can
generate a histogram over a subset of the data that is has been
assigned, and then the histograms of each mapper are sent to one or
more reducer tasks to merge.

In the second method, each mapper rounds a data point to a bucket
width and outputs that bucket as a key and '1' as a value.  Reducers
then sum up all of the values with the same key and output to a data store.

In both methods, the mapper tasks must choose identical
bucket boundaries even though they are analyzing disjoint parts of the
input set that may cover different ranges, or we must implement
multiple phases.

\begin{figure}[h!]
\begin{center}
\includegraphics{histogram-mapreduce-diag1.pdf}
\end{center}
\caption{Diagram of MapReduce Histogram Generation Pattern}
\label{fig:mr-histogram-pattern1}
\end{figure}

Figure~\ref{fig:mr-histogram-pattern1} illustrates the second method
described above for histogram generation of large data sets with
MapReduce.

This package is designed to be helpful if some of the Map or Reduce
tasks are written in R, or if those components are written in other
languages and only the resulting output histograms need to be
manipulated in R.

%There are really two cookbook recipes for generating histograms of
%large data sets.  Have each mapper generate an actual histogram, in
%which case the reducer must merge them.

%Or have each mapper output round to a bucket width and output for
%every bucket width item then have the reducer sum those up.

%Why do we do the first one?  Why not the second one?

%What is the runtime difference between the two?  Is the second
%faster?  Or the first?

%TODO:
%Good diagram for paper showing MapReduce generation of histograms.  As described in Hadoop MapReduce cookbook :

% Histogram makes sense only under a continuous dimension (for example, access time and file % size). It groups the number of occurrences of some event into several groups in the dimension. % For example, in this recipe, if we take the access time from weblogs as the dimension, then we will group the access time by the hour.
%
%The following figure shows a summary of the execution. Here the mapper calculates the hour of %the day and emits the "hour of the day" and 1 as the key and value respectively. Then each %reducer receives all the occurrences of one hour of a day, and calculates the number of %occurrences:
% THIS ABOVE IS FROM :
%\url{http://my.safaribooksonline.com/book/-/9781849517287/6dot-analytics/ch06s06_html}

%\section{Applications}
%Very large data sets in Pearson's day meant more than 100 observations, but analysis of web-scale data, such as the number of unlinks in the trillions of pages on the web requires significantly larger computation resources.
%Pearson was interested in very large data sets (more than 1000)

\section{Summary}
The HistogramTools package presented in this paper has been in wide
use for the last several years to allow engineers to read distribution
data from internal data stores written in other languages.  Internal
production monitoring tools and databases, as well as the output of
many MapReduce jobs have been used.

\bibliography{refs}

<<echo=FALSE,print=FALSE>>=
par(oldpar)
options(oldoptions)
@

\end{document}
%%  LocalWords:  HistogramTools MapReduce RProtoBuf rprotobuf proto
%%  LocalWords:  readLines memCompress gzip memDecompress mfrow CDFs
%%  LocalWords:  ApproxMean ApproxQuantile PlotAll PlotKSDCC Smirnov
%%  LocalWords:  PlotEMDCC Kolmogorov MDCC rexp KSDCC EMDCC ECDF
%%  LocalWords:  TrimHistogram trimhist AddHistograms mergehist
%%  LocalWords:  AddManyHistograms MergeBuckets overbinned Subsetting
%%  LocalWords:  downsamplehist SubsetHistogram subsethist