--- title: "Using prim for bump hunting and estimating highest density difference regions" output: rmarkdown::html_vignette description: Using prim for bump hunting and estimating highest density difference regions author: Tarn Duong vignette: > %\VignetteIndexEntry{Using prim for bump hunting and estimating highest density difference regions} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r, echo=FALSE, message=FALSE} knitr::opts_chunk$set(global.par=TRUE, collapse=TRUE, comment="#>", fig.width=5, fig.height=5, fig.align="center", dpi=96) options(tibble.print_min=4L, tibble.print_max=4L) ``` ## Introduction The Patient Rule Induction Method (PRIM) was introduced by Friedman and Fisher (1999). It is a technique from data mining for finding `interesting' regions in high-dimensional data. We start with regression-type data *(**X**~1~, Y~1~), ..., (**X**~n~, Y~n~)* where ***X**~i~* is *d*-dimensional and *Y~i~* is a scalar response variable. We are interested in the conditional expectation function *m(**x**) = **E** (Y | **X** = **x**)*. ## Bump hunting In the case where we have a single sample then PRIM finds the bumps of *m(**x**)*. We use a subset of the `MASS::Boston` data set. It contains housing data measurements for 506 towns in the Boston, USA area. For the explanatory variables, we take the nitrogen oxides concentration in parts per 10 million `nox` and the average number of rooms per dwelling `rm`. The response is the per capita crime rate `crim`. We are interested in characterising those areas with higher crime rates in order to provide better support infrastructure. ```{r} library(prim) data(Boston, package="MASS") x <- Boston[,5:6] y <- Boston[,1] boston.prim <- prim.box(x=x, y=y, threshold.type=1) ``` The default settings for `prim.box` are * peeling quantile: `peel.alpha=0.05` * pasting is carried out: `pasting=TRUE` * pasting quantile: `paste.alpha=0.01` * minimum box mass (proportion of points inside a box): `mass.min=0.05` * `threshold` is the overall mean of the response variable `y` * `threshold.type=0` We use the default settings except that we wish to only find high crime areas {*m(**x**)* $\ge$ `threshold`} so we set `threshold.type=1`. We view the output using a `summary` method. This displays three columns: the box mean, the box mass, and the threshold type. Each line is a summary for each box, as well as an overall summary. The box which is asterisked indicates that it is the box which contains the rest of the data not processed by PRIM. There is one box which contains 42.89% of the towns and where the average crime rate is 7.62. This is our HDR estimate. This regions comprises the bulk of the high crime areas, and is described in terms of nitrogen oxides levels in [0.53, 0.74] and average number of rooms in [3.04, 7.07]. The other 57.11% of the towns have an average crime rate of 0.6035. ```{r} summary(boston.prim, print.box=TRUE) ``` We plot the PRIM boxes, including all those towns whose crime rate exceeds 3.5. Thus verifying that the majority of high crime towns fall inside thebump. ```{r, fig.asp=1} plot(boston.prim, col="transparent") points(x[y>3.5,]) ``` There are many options for the graphical display. See the help guide for more details `?plot.prim`. The default function applied to the response variable `y` is `mean`. We can input another function, e.g. `median`, using ```{r} boston.prim.med <- prim.box(x=x, y=y, threshold.type=1, y.fun=median) ``` We compare the results: the box for the mean is in black, for the median in red: ```{r, fig.asp=1} plot(boston.prim, col="transparent") plot(boston.prim.med, col="transparent", border="red", add=TRUE) legend("topleft", legend=c("mean", "median"), col=1:2, lty=1, bty="n") ``` The covariate `x` can also include categorical variables: we replace the average number of rooms per dwelling `rm` with the index of accessibility to radial highways `rad` which takes integral values from 1 to 24 inclusive. ```{r, fig.asp=1} x2 <- Boston[,c(5,9)] y <- Boston[,1] boston.cat.prim <- prim.box(x=x2, y=y, threshold.type=1) summary(boston.cat.prim, print.box=TRUE) plot(boston.cat.prim, col="transparent") points(x2[y>3.5,]) ``` ## Estimating highest density difference regions In the case where we have 2 samples, we can label the response as a binary variable with *Y~i~ = 1* if ***X**~i~* is in sample 1 or *Y~i~ = -1* if ***X**~i~* is in sample 2. Then PRIM finds the regions where the samples are most different. Here we have a positive HDR (where sample 1 points dominate) and a negative HDR (where sample 2 points dominate). We look at a 3-dimensional data set `quasiflow` included in the `prim` library. It is a randomly generated data set from two normal mixture distributions whose structure mimics some light scattering data, taken from a machine known as a flow cytometer. ```{r} library(prim) data(quasiflow) yflow <- quasiflow[,4] xflow <- quasiflow[,1:3] xflowp <- quasiflow[yflow==1,1:3] xflown <- quasiflow[yflow==-1,1:3] ``` We can think of `xflowp` as flow cytometric measurements from an HIV+ patient, and `xflown` from an HIV-- patient. ```{r} pairs(xflowp, cex=0.5, pch=16, col=grey(0,0.1), xlim=c(0,1), ylim=c(0,1)) pairs(xflown, cex=0.5, pch=16, col=grey(0,0.1), xlim=c(0,1), ylim=c(0,1)) ``` There are two ways of using `prim.box` to estimate where the two samples are most different (or equivalently to estimate the HDRs of the difference of the density functions). In the first way, we assume that we have suitable values for the thresholds. Then we can use ```{r}= qflow.thr <- c(0.38, -0.23) qflow.prim <- prim.box(x=xflow, y=yflow, threshold=qflow.thr, threshold.type=0) ``` An alternative is compute PRIM box sequences which cover the range of the response variable `y`, and then use `prim.hdr` to experiment with different threshold values. This two-step process is more efficient and faster than calling `prim.box` for each different threshold. We are happy with the positive HDR threshold so we can compute the positive HDR directly: ``` {r} qflow.hdr.pos <- prim.box(x=xflow, y=yflow, threshold=0.38, threshold.type=1) summary(qflow.hdr.pos) ``` On the other hand, we are not sure about the negative HDR thresholds. So we try several different values for `threshold`. ```{r} qflow.neg <- prim.box(x=xflow, y=yflow, threshold.type=-1) qflow.hdr.neg1 <- prim.hdr(qflow.neg, threshold=-0.23, threshold.type=-1) qflow.hdr.neg2 <- prim.hdr(qflow.neg, threshold=-0.43, threshold.type=-1) qflow.hdr.neg3 <- prim.hdr(qflow.neg, threshold=-0.63, threshold.type=-1) ``` After examining the summaries and plots, ```{r} summary(qflow.hdr.neg1) ``` we choose `qflow.hdr.neg1` to combine with `qflow.hdr.pos`. ```{r} qflow.prim2 <- prim.combine(qflow.hdr.pos, qflow.hdr.neg1) summary(qflow.prim2) ``` In the plot below, the positive HDR is coloured orange (where the HIV+ sample is more prevalent), and the negative HDR is coloured blue (where the HIV- sample is more prevalent) ```{r, fig.asp=1, fig.height=10} plot(qflow.prim2, x.pt=xflow, pch=16, cex=0.5, alpha=0.1) ``` or equivalently as a 3D scatter plot. ```{r, fig.asp=1, fig.height=10} plot(qflow.prim2, x.pt=xflow, pch=16, cex=0.5, alpha=0.1, splom=FALSE, colkey=FALSE, ticktype="detailed") ``` ## References Friedman, J. H. and Fisher, N. I. (1999) Bump-hunting for high dimensional data. *Statistics and Computing* **9**, 123--143.   -- Generated `r format(Sys.time(), '%d %B %Y')`.