%\documentclass[10pt,a4paper]{article} \documentclass[nojss]{jss} \usepackage[utf8]{inputenc} \usepackage[english]{babel} %\usepackage{a4wide} %\setlength{\parskip}{0.5ex plus0.1ex minus0.1ex} %\setlength{\parindent}{0em} %\usepackage[round,longnamesfirst]{natbib} %\usepackage{hyperref} %%% for tabulars %\usepackage{rotating} %\usepackage{multirow} %%% for hanging paragraph %\usepackage{hanging} %%% double spacing % \usepackage{setspace} % \doublespacing %\newcommand{\strong}[1]{{\normalfont\fontseries{b}\selectfont #1}} \newcommand{\class}[1]{\mbox{\textsf{#1}}} \newcommand{\func}[1]{\mbox{\texttt{#1()}}} %\newcommand{\code}[1]{\mbox{\texttt{#1}}} \newcommand{\pkg}[1]{\strong{#1}} \newcommand{\samp}[1]{`\mbox{\texttt{#1}}'} %\newcommand{\proglang}[1]{\textsf{#1}} \newcommand{\set}[1]{\mathcal{#1}} \newcommand{\vect}[1]{\mathbf{#1}} \DeclareTextFontCommand{\emph}{\normalfont} %\usepackage{Sweave} %\VignetteIndexEntry{stream: Extending the stream Framework} %% publication information %% NOTE: This needs to filled out ONLY IF THE PAPER WAS ACCEPTED. %% If it was not (yet) accepted, leave them commented. %% \Volume{13} %% \Issue{9} %% \Month{September} %% \Year{2004} %% \Submitdate{2004-09-29} %% \Acceptdate{2004-09-29} \author{ Michael Hahsler\\Southern Methodist University \And Matthew Bola\~nos\\Microsoft Corporation \AND John Forrest\\Microsoft Corporation } \title{Extending the \pkg{stream} Framework} \Plainauthor{Michael Hahsler, Matthew Bolanos, John Forrest} \Plaintitle{stream: Extending the stream Framework} \Shorttitle{\pkg{stream}: Extending the \pkg{stream} Framework} %% an abstract and keywords \Abstract{This document describes how to add new data stream sources \code{DSD} and data stream tasks \code{DST} to the \pkg{stream} framework.} \Keywords{data streams, data mining, clustering} \Plainkeywords{data streams, data mining, clustering} \Address{Michael Hahsler\\ Computer Science\\ Lyle School of Engineering\\ Southern Methodist University\\ P.O. Box 750122 \\ Dallas, TX 75275-0122\\ E-mail: \email{mhahsler@lyle.smu.edu}\\ URL: \url{http://lyle.smu.edu/~mhahsler} Matthew Bola\~nos\\ Research Now\\ 5800 Tennyson Pkwy \# 600\\ Plano, TX 75024 E-mail: \email{mbolanos@curiouscrane.com} John Forrest\\ Microsoft Corporation\\ One Microsoft Way\\ Redmond, WA 98052-7329\\ E-mail: \email{jforrest@microsoft.com} } \begin{document} \vfill %\maketitle %% Add TOC (not with jss style) %\clearpage \tableofcontents \clearpage %\sloppy <>= options(width = 75, digits = 3, prompt = 'R> ', scipen = 3) @ \section{Extending the stream framework} \label{sec:extension} Since stream mining is a relatively young field and many advances are expected in the near future, the object oriented framework in \pkg{stream} is developed with easy extensibility in mind. Implementations for data streams (DSD) and data stream mining tasks (DST) can be easily added by implementing a small number of core functions. The actual implementation can be written in either \proglang{R}, \proglang{Java}, \proglang{C}/\proglang{C++} or any other programming language which can be interfaced by \proglang{R}. In the following we discuss how to extend \pkg{stream} with new DSD and DST implementations. %In the following we discuss how to extend DSD, DST and how to interface %algorithms from other frameworks in \pkg{stream}. \subsection{Adding a new data stream source (DSD)} DSD objects can be a management layer on top of a real data stream, a wrapper for data stored in memory or on disk, or a generator which simulates a data stream with know properties for controlled experiments. Figure~\ref{figure:dsd} shows the relationship (inheritance hierarchy) of the DSD classes as a UML class diagram~\citep{stream:Fowler:2003}. All DSD classes extend the abstract base class~\code{DSD}. There are currently two types of DSD implementations, classes which implement \proglang{R}-based data streams~(\code{DSD_R}) and MOA-based stream generators~(\code{DSD_MOA}) provided in \pkg{streamMOA}. Note that abstract classes define interfaces and only implement common functionality. Only implementation classes can be used to create objects (instances). This mechanism is not enforced by S3, but is implemented in \pkg{stream} by providing for all abstract classes constructor functions which create an error. The class hierarchy in Figure~\ref{figure:dsd} is implemented using the S3 class system~\citep{stream:Chambers:1992}. Class membership and the inheritance hierarchy is represented by a vector of class names stored as the object's class attribute. For example, an object of class \code{DSD_Gaussians} will have the class attribute vector \code{c("DSD_Gaussians", "DSD_R", "DSD")} indicating that the object is an \proglang{R} implementation of DSD. This allows the framework to implement all common functionality as functions at the level of \code{DSD} and \code{DSD_R} and only a minimal set of functions is required to implement a new data stream source. Note that the class attribute has to contain a vector of all parent classes in the class diagram in bottom-up order. \begin{figure} \centering \includegraphics[width=\linewidth]{dsd_uml} \caption{Overview of the data stream data (DSD) class structure.} \label{figure:dsd} \end{figure} For a new DSD implementation only the following two functions need to be implemented: \begin{enumerate} \item A creator function (with a name starting with the prefix \code{DSD_}) and \item the \code{get_points()} method. \end{enumerate} The creator function creates an object of the appropriate \code{DSD} subclass. Typically this S3 object contains a list of all parameters, an open \proglang{R} connection and/or an environment or a reference class for storing state information (e.g., the current position in the stream). Standard parameters are \code{d} and \code{k} for the number of dimensions of the created data and the true number of clusters, respectively. In addition an element called \code{"description"} should be provided. This element is used by \code{print()}. The implemented \code{get_points()} needs to dispatch for the class and create as the output a data frame containing the new data points as rows. If called with \code{info = TRUE} additional information columns starting with \code{.} should be returned. For example, a column called \code{.class} with the ground truth (true cluster assignment as an integer vector; noise is represented by \code{NA}) should be returned for data streams for clustering or classification. Other information columns include \code{.id} for point IDs and \code{.time} for time stamps. For a very simple example, we show here the implementation of \code{DSD_UniformNoise} available in the package's source code in file \code{DSD_UniformNoise.R}. This generator creates noise points uniformly distributed in a $d$-dimensional hypercube with a given range. <<>>= library("stream") @ <<>>= DSD_UniformNoise <- function(d = 2, range = NULL) { if(is.null(range)) range <- matrix(c(0, 1), ncol = 2, nrow = d, byrow = TRUE) structure(list(description = "Uniform Noise Data Stream", d = d, k = NA_integer_, range = range), class = c("DSD_UniformNoise", "DSD_R", "DSD")) } get_points.DSD_UniformNoise <- function(x, n = 1, info = TRUE, ...) { data <- data.frame(t(replicate(n, runif( x$d, min = x$range[, 1], max = x$range[, 2])))) if (info) data[[".class"]] <- NA data } @ The constructor only stores the description, the dimensionality and the range of the data. For this data generator \code{k}, the number of true clusters, is not applicable. Since all data is random, there is also no need to store a state. The \code{get_points()} implementation creates $n$ random points and if class assignment info is requested, then a \code{.class} column is added containing all \code{NA}s indicating that the data points are all noise. Now the new stream type can already be used. <>= stream <- DSD_UniformNoise() stream plot(stream, main = description(stream)) @ The resulting plot is shown in Figure~\ref{figure:dsd_example}. \begin{figure} \centering \includegraphics[width=.5\linewidth]{extending_stream-dsd_example} \caption{Sample points from the newly implemented \code{DSD\_UniformNoise} object.} \label{figure:dsd_example} \end{figure} \subsection{Adding a new data stream tasks (DST)} DST refers to any data mining task that can be applied to data streams. The design is flexible enough for future extensions including even currently unknown tasks. Figure~\ref{figure:dst} shows the class hierarchy for DST. \begin{figure} \centering \includegraphics[width=\linewidth]{dst_uml} \caption{Overview of the data stream task (DST) class structure with subclasses for clustering (DSC), classification (DSClassify) and frequent pattern mining (DSFP) and outlier detection (DSOutlier).} \label{figure:dst} \end{figure} DST classes implement mutable objects which can be changed without creating a copy. This is more efficient, since otherwise a new copy of all data structures used by the algorithm would be created for processing each data point. Mutable objects can be implemented in \proglang{R} using environments or the recently introduced reference class construct (see package~\pkg{methods} by the \cite{stream:R:2005}). Alternatively, pointers to external data structures in \proglang{Java} or \proglang{C/C++} can be used to create mutable objects. To add a new data stream mining tasks (e.g., frequent pattern mining), a new package with a subclass hierarchy similar to the hierarchy in Figure~\ref{figure:dst} for data stream clustering (DSC) can be easily added. This new package can take full advantage of the already existing infrastructure in \pkg{stream}. An example is the package~\pkg{streamMOA}~\cite{stream:streamMOA:2014}, which can be used as a model to develop a new package. We plan to provide more add-on packages to \pkg{stream} for frequent pattern mining and data stream classification in the near future. In the following we discuss how to interface an existing algorithm with \pkg{stream}. We concentrate again on clustering, but interfacing algorithms for other types of tasks is similar. To interface an existing clustering algorithm with \pkg{stream}, \begin{enumerate} \item a creator function (typically named after the algorithm and starting with \code{DSC_}) which created the clustering object, \item an implementation of the actual cluster algorithm, and \item accessors for the clustering \end{enumerate} are needed. The implementation depends on the interface that is used. Currently an \code{R} interface is available as \code{DSC_R} and a MOA interface is implemented in \code{DSC_MOA} (in \pkg{streamMOA}). The implementation for \code{DSC_MOA} takes care of all MOA-based clustering algorithms and we will concentrate here on the \proglang{R} interface. For the \proglang{R} interface, the clustering class needs to contain the elements \code{"description"} and \code{"RObj"}. The description needs to contain a character string describing the algorithm. RObj is expected to be a reference class object and contain the following methods: \begin{enumerate} \item \code{cluster(newdata, ...)}, where \code{newdata} is a data frame with new data points. \item \code{get_assignment(dsc, points, ...)}, where the clusterer \code{dsc} returns cluster assignments for the input \code{points} data frame. \item For micro-clusters: \code{get_microclusters(...)} and \code{get_microweights(...)} \item For macro-clusters: \code{get_macroclusters(...)}, \code{get_macroweights} and \\ \code{microToMacro(micro, ...)} which does micro- to macro-cluster matching. \end{enumerate} Note that these are methods for reference classes and do not contain the called object in the parameter list. Neither of these methods are called directly by the user. Figure~\ref{figure:interaction} shows that the function \code{update()} is used to cluster data points, and \code{get_centers()} and \code{get_weights()} are used to obtain the clustering. These user facing functions call internally the methods in RObj via the \proglang{R} interface in class \code{DSC_R}. \begin{figure} \centering \includegraphics[width=\linewidth]{interaction} \caption{Interaction between the DSD and DSC classes.} \label{figure:interaction} \end{figure} For a comprehensive example of a clustering algorithm implemented in \proglang{R}, we refer the reader to \code{DSC_DStream} (in file \code{DSC_DStream.R}) in the package's \code{R} directory. % %%\subsection{Interfacing Algorithms from Other Frameworks} %%TODO % %\pagebreak[1] % \bibliography{stream} \end{document}