---
title: "R2sample"
output: rmarkdown::html_vignette
vignette: >
%\VignetteIndexEntry{R2sample}
%\VignetteEngine{knitr::rmarkdown}
%\VignetteEncoding{UTF-8}
bibliography: [R2sample.bib]
---
```{r, include = FALSE}
knitr::opts_chunk$set(
collapse = TRUE,
comment = "#>"
)
```
```{r setup}
library(R2sample)
```
This package brings together a number of routines for the two sample problem for univariate data. There are two data sets x and y and we want to test whether they were generated by the same probability distribution.
The highlights of this package are:
- it runs a large number of tests simultaneously.
- the user can provide their own tests
- it works for continuous and for discrete data.
- it uses a permutation test to find p values (except for chi square tests).
- tests can be combined and a single adjusted p value found
- for large continuous data sets p values can be found via large sample theory.
- it uses Rcpp and parallel programming to be very fast.
- it includes routines that make power studies very simple.
- it has a routine that runs (up to) 20 different case studies, comparing the power of a new method to those included in the package. This will make it very easy for someone who has developed a new method to test its performance.
## Example 1
We generate two data sets of size 100 and 120 respectively from standard normal distributions and run the test:
```{r}
set.seed(123)
```
**Note** all examples run with arguments *B=500, maxProcessor = 2* in order to pass *devtools::check()*
```{r}
x1 = rnorm(100)
y1 = rnorm(120)
twosample_test(x1, y1, B=500, maxProcessor = 2)
```
In this case the the null hypothesis is true, both data sets were generated by a standard normal distribution. And indeed, all 13 tests have p values much larger than (say) 0.05 and therefore correctly fail to reject the null hypothesis.
Alternatively one might wish to carry out a select number of tests, but then find a correct p value for the combined test. This can be done as follows:
```{r}
twosample_test_adjusted_pvalue(x1, y1, B=c(500,500))
```
By default for continuous data this runs four tests (chi square with a small number of bins, ZA, ZK and Wassp1). It the finds the smallest p value (here 0.02) and adjusts it for the multiple testing problem, resulting in an overall p value of 0.05.
### New Tests
Say we wish to use two new tests for continuous data. One is based on the difference in standardized means and the other is based on the difference in standard deviations:
$$
\begin{aligned}
&TS_1 = sd(x) - sd(y) \\
&TS_2 = \bar{x}/sd(x) - \bar{y}/sd(y) \\
\end{aligned}
$$
```{r}
myTS1 = function(x, y) {
out = c(0, 0)
out[1] = abs(sd(x) - sd(y))
out[2] = abs(mean(x)/sd(x) - mean(y)/sd(y))
names(out) = c("std", "std t test")
out
}
```
Then we can simply run
```{r}
twosample_test(x1, y1, TS=myTS1, B=500, maxProcessor = 2)
```
#### Arguments and output of new test routine for continuous data
The arguments have to be x and y for the two data sets and (optionally) a list called TSextra for any additional input needed to find test statistic.
Note that the output vector of the routine has to be a **named** vector.
If the routine is written in Rcpp parallel programming is not available.
## Example 2
Here we generate two data sets of size 1000 and 1200 respectively from a binomial distribution with 5 trials and success probabilities of 0.5 and 0.55, respectively.
```{r}
x2 = table(c(0:5,rbinom(1000, 5, 0.5)))-1
y2 = table(c(0:5,rbinom(1200, 5, 0.55)))-1
rbind(x2, y2)
twosample_test(x2, y2, vals=0:5, B=500, maxProcessor = 2)$p.values
```
In this case the the null hypothesis is false, all nine tests for discrete data have p values much smaller than (say) 0.05 and therefore correctly reject the null hypothesis.
Notice, it is the presence of the *vals* argument that tells the *twosample_test* command that the data is discrete. The vectors x and y are the counts. Note that the lengths of the three vectors have to be the same and no value of vals is allowed that has both x and y equal to 0.
### New Test
Again we might want to run a different test, say based again on the difference in standardized means. This time we will implement the new test using Rcpp:
(For reasons to do with submission to CRAN this routine is already part of *R2sample*)
```{Rcpp, eval=FALSE}
#include
using namespace Rcpp;
// [[Rcpp::export]]
NumericVector myTS2(IntegerVector x,
IntegerVector y,
NumericVector vals) {
Rcpp::CharacterVector methods=CharacterVector::create("std t test");
int const nummethods=methods.size();
int k=x.size(), n=sum(x), i;
double meanx=0.0, meany=0.0, sdx=0.0, sdy=0.0;
NumericVector TS(nummethods);
TS.names() = methods;
for(i=0;i0$. Also (optionally) a list called TSextra for any additional input needed to find test statistic.
Note that the output vector of the routine has to be a **named** vector.
If the routine is written in Rcpp parallel programming is not available.
## Permutation Method
For all the tests except the chi square variants the p values are found via the permutation method. The idea of a permutation test is simple. Say we have data sets $x_1,..,x_n$ and $y_1,..,y_m$. They are combined into one large data set, permuted and split again in sets of sizes n and m. Under the null hypothesis these new data sets come from the same distribution as the actual data. Therefore calculating the tests statistics for them and repeating many times one can build up the distributions of the test statistics and find p values from them.
In the discrete case the situation is somewhat more complicated. Say we have data sets with values $v_1,..,v_k$ and counts $x_1,..,x_n$, $y_1,..,y_k$. One can then mimic the description above by expanding these to yield a large data set with $x_1+y_1$ $v_1$'s, $x_2+y_2$ $v_2$'s and so on. Then this vector is permuted and split as described above. The drawback of this sampling method is that its calculation speed increases with the sample size.
Alternatively R2sample provides the following option. One can show that the distribution of the permuted data sets has the following distribution: let $n=\sum x_i$ and $m=\sum y_i$, then
$$P(X=a|x,y)=\frac{\prod_{j=1}^k{{x_j+y_j}\choose{a_j}}}{{n+m}\choose{n}}$$
for any $a$ such that $0\le a_i \le x_i; i=1,..,k$ and $\sum a=n$. Sampling from this distribution can be done using MCMC. R2sample uses the Metropolis-Hastings algorithm as follows: in each step two coordinates $i$ and $j$ with $1\le i,j \le k, i\ne j$ are chosen randomly. Then the proposal $a_i=x_i+1, a_j=x_j-1, b_i=y_i-1, b_j=y_j+1$, $a=x, b=y$ is accepted with probability
$$\frac{P(X=a|x,y)}{P(X=x|x,y)}=\frac{(x_i+y_i-a_i)a_j}{(x_j+y_j+1-a_j)(a_i+1)}$$
As is generally the case with MCMC simulation one needs a larger number of simulations than with independence sampling. The default is independence sampling size with a simulation size of 5000 for testing and (1000, 1000) for power estimation. If the sample sizes are very large this might run quite slowly and it might be advisable to change to samplingmethod="MCMC".
Of course MCMC sampling suffers from the usual issues of having to reach the stationary distribution. In the current situation a two-sample test is likely only done if the two data sets are reasonably close, and so the MCMC algorithm should start essentially at the stationary distribution and no burn-in period is required.
On a standard pc (as of 2024) independence sampling of two data sets with 10000 observation each takes only a few seconds and so is perfectly doable.
## The Testing Methods
In the continuous case we have a sample x of size of n and a sample y of size m. In the discussion below both are assumed to be ordered. We denote by z the combined sample. In the discrete case we have a random variable with values v, and x and y are the counts. We denote by k the number of observed values.
We denote the edf's of the two data sets by $\widehat{F}$ and $\widehat{G}$, respectively. Moreover we denote the empirical distribution function of the combined data set by $\widehat{H}$.
- **chi large**
In the case of continuous data, it is binned into *nbins[1]* equal probability bins, with the default *nbins[1]=100*. In the case of discrete data the number of classes comes from *vals*, the values observed.
Then a standard chi square test is run and the p value is found using the usual chi square approximation.
- **chi small**
Many previous studies have shown the a chi square test with a large number of classes often has very poor power. So this test in the case of continuous data uses *nbins[2]* equal probability bins, with the default *nbins[2]=10*. In the case of discrete data if the number of classes exceeds *nbins[2]=10* the classes are combined until there are *nbins[2]=10* classes.
In either case the p values are found using the usual chi-square approximation. The degrees of freedom are the number of bins. For a literature review of chi square tests see [@rolke2021].
---
For all other tests the p values are found using a permutation test. They are
- **KS** Kolmogorov-Smirnov test
This test is based on the quantity $\max\{|\widehat{F}(x)-\widehat{G}(x)|:x\in \mathbf{R}\}$. In the continuous case (without ties) the function $\widehat{F}(x)-\widehat{G}(x)$ either takes a jump of size $1/n$ up at a point in the x data set, a jump of size $1/m$ down if x is a point in the y data set, or is flat. Therefore the test statistic is
$$\max\{\sum_{i=1}^{j} \vert \frac1n I\{z_i \in x\}-\frac1m I\{z_i \in y\} \vert:j=1,..,n+m\}$$
In the discrete case the jumps have sizes $x_i/n$ and $y_j/m$, respectively and the test statistic is
$$\max\{\sum_{i=1}^{j} \vert \frac{x_i}n -\frac{y_j}m \vert:j=1,..,k\}$$
This test was first proposed in [@Kolmogorov1933], [@Smirnov1939] and is one of the most widely used tests today.
There is a close connection between the edf's and the ranks of the x and y in the combined data set. Using this one can also calculate the KS statistic, and many statistics to follow, based on these ranks. This is used in many software routines, for example the ks.test routine in R. However, when applied to discrete data these formulas can fail badly because of the many ties, and so we will not use them in our routine.
- **Kuiper** Kuiper's test
This test is closely related to Kolmogorov-Smirnov, but it uses the sum of the largest positive and negative differences between the edf's as a test statistic:
$$T_x=\sum_{i=1}^{j} \vert \frac1n I\{z_i \in x\}-\frac1m I\{z_i \in x\} \vert:j=1,..,n$$
$$T_y=\sum_{i=1}^{j} \vert \frac1n I\{z_i \in y\}-\frac1m I\{z_i \in y\} \vert:j=1,..,m$$
and the the test statistic is $T_x-T_y$. The discrete case follows directly. This test was first proposed in [@Kuiper1960].
- **CvM** Cramer-vonMises test
This test is based on the integrated squared difference between the edf's:
$$\frac{nm}{(n+m)^2}\sum_{i=1}^{n+m} \left( \hat{F}(z_i)-\hat{G}(z_i) \right)^2 $$
the extension to the two-sample problem of the Cramer-vonMises criterion is discussed in [@Anderson1962].
- **AD** Anderson-Darling test
This test is based on
$$nm\sum_{i=1}^{n+m-1} \frac{\left( \hat{F}(z_i)-\hat{G}(z_i) \right)^2}{i(n-i)} $$
It was first proposed in [@anderson1952].
- **LR** Lehmann-Rosenblatt test
Let $r_i$ and $s_i$ be the ranks of x and y in the combined sample, then the test statistic is given by
$$\frac1{nm(n+m)}\left[n\sum_{i=1}^n(r_i-1)^2+m\sum_{i=1}^m(s_i-1)^2\right]$$
[@lehmann1951] and [@rosenblatt1952]
- **ZA, ZK and ZC** Zhang's tests
These tests are based on the paper [@zhang2006]. Note that the calculation of *ZC* is the one exception from the rule: even in the discrete data case the calculation of the test statistic does require loops of lengths n and m. The calculation time will therefore increase with the sample sizes, and for very large data sets this test might have to be excluded.
- **Wassp1** Wasserstein p=1 test
A test based on the Wasserstein p1 metric. It compares the quantiles of the x and y in the combined data set. If n=m the test statistic in the continuous case is very simple:
$$ \frac1n\sum_{i=1}^n |x_i-y_i|$$
If $n\ne m$ it is first necessary to find the respective quantiles of the x's and y's in the combined data set. In the discrete case the test statistic is
$$\sum_{i=1}^{k-1} \vert \sum_{j=1}^i\frac{x_j}{n}-\sum_{j=1}^i\frac{y_j}{m} \vert(v_{i+1}-v_i)$$
For a discussion of the Wasserstein distance see [@wasserstein1969].
#### The Arguments of *twosample_test*
The routine has the following arguments:
- **x** and **y** are numeric vectors. In the continuous data case they are simply the data, in the discrete case the counts.
- **vals** a numeric vector of values of the discrete random variable. If missing continuous data is assumed. In the discrete case x, y and vals have to be vectors of equal length.
- **TS** a routine to calculate a test statistic. If missing built-in tests are used.
- **TSextra** a list to be passed to TS.
- **B=5000** number of simulation runs for permutation test. If B=0 only test statistics are found.
- **nbins=c(100, 10)** number of bins to use for chi-square test. In the continuous data case bins are found equi-probable via the quantiles of the combined data set. In the discrete case nbins[1] is changed to the number of values, so the chi-square test is done on the original data set. Then neighboring values are combined until there are nbins[2] classes.
- **maxProcessor** if >1 parallel computing is used.
- **doMethods** a character vector giving the names of the methods to include.
Say for example we want to use only the KS and AD tests:
```{r}
x=rnorm(10)
y=rnorm(12)
twosample_test(x, y, B=500, maxProcessor = 2, doMethods=c("KS","AD"))
```
### The Output of *twosample_test*
A list with vectors statistics (the test statistics) and p.values.
### *shiny app*
The tests can also be run via a shiny app. To do so run
```{r eval=FALSE}
run_shiny()
```
### **twosample_power** and **plot_power**
The routine *twosample_power* allows the estimation of the power of the various tests, and *plot_power* draws the corresponding power graph with the methods sorted by their mean power.
Note that due to the use of permutation tests the power calculations are fairly slow. Because of the time constraints imposed by CRAN the following routines are not run. A full power study with (say) 25 alternatives and fairly large data sets might take several hours to run.
New tests can be used in the same way as above.
### Example 3
Say we wish to find the powers of the tests when one data set comes from a standard normal distribution with sample size 100 and the other from a t distribution with 1-10 degrees of freedom and sample size 200.
```{r eval=FALSE}
plot_power(twosample_power(
f=function(df) list(x=rnorm(100), y=rt(200, df)), df=1:10)
)
```
#### The Arguments of *twosample_power*
- **f** A function that generates data sets x and y, either continuous or the counts of discrete variables. In the discrete case care needs to be taken that the resulting vectors have the same length as *vals*. The output has to be a list with elements x and y. In the case of discrete data the list also has to include a vector **vals**, the values of the discrete variable.
- **...** arguments passed to f. The most common case would be a single vector, but two vectors or none is also possible.
- **alpha=0.05** type I error probability to use in power calculation
- **B=c(1000, 1000)** number of simulation runs for power estimation and estimation of p values.
- **TS** a routine to calculate a test statistic. If missing built-in tests are used.
- **TSextra** a list to be passed to TS.
- **nbins=c(100,10)** Number of bins to use in chi square tests
- **maxProcessor** if >1 parallel programming is used
The arguments of **plot_power** are a matrix of powers created by **twosample_power** and (optional) Smooth=FALSE if no smoothing is desired and a name of the variable on the x axis.
### Example 4
We wish to find the powers of the tests when the data set x are 100 observations comes from a binomial n=10, p=0.5 and y 120 observations from from a binomial n=10 and a number of different success probabilities p. Note that in either case not all the possible values 0-10 will actually appear in every data set, so we need to take some care in assuring that x, y and vals match every time:
```{r eval=FALSE}
plot_power(twosample_power(
function(p) {
vals=0:10
x=table(c(vals, rbinom(100, 10, 0.5))) - 1 #Make sure each vector has length 11
y=table(c(vals, rbinom(120, 10, p))) - 1 #and names 1-10
vals=vals[x+y>0] # only vals with at least one count
list(x=x[as.character(vals)],y=y[as.character(vals)], vals=vals)
},
p=seq(0.5, 0.6, length=5)
), "p")
```
### Example 5
We want to assess the effect of the sample size when one data set comes from a standard normal and the other from a normal with mean 1 and standard deviation 1:
```{r eval=FALSE}
plot_power(twosample_power(
f=function(n) list(x=rnorm(n), y=rnorm(n, 1)),
n=10*1:10), "n")
```
### Adjusted p values for Several Tests
As no single test can be relied upon to consistently have good power, it is reasonable to employ several of them. We would then reject the null hypothesis if any of the tests does so, that is, if the smallest p-value is less than the desired type I error probability $\alpha$.
This procedure clearly suffers from the problem of simultaneous inference, and the true type I error probability will be much larger than $\alpha$. It is however possible to adjust the p value so it does achieve the desired $\alpha$. This can be done as follows:
We generate a number of data sets under the null hypothesis. Generally about 1000 will be sufficient. Then for each simulated data set we apply the tests we wish to include, and record the smallest p value. Here is an example. .
```{r eval=FALSE}
pvals=matrix(0,1000,13)
for(i in 1:1000)
pvals[i, ]=R2sample::twosample_test(runif(100), runif(100), B=1000)$p.values
```
Next we find the smallest p value in each run for two selections of four methods. One is the selection found to be best above, namely Zhang's ZC and ZK methods, the methods by Kuiper and Wasserstein as well as a chi square test with a small number of bins. As a second selection we use the methods by Kolmogorov-Smirnov, Kuiper, Anderson-Darling, Cramer-vonMises and Lehman-Rosenblatt, which for this null data turn out to be highly correlated.
```{r eval=FALSE}
colnames(pvals)=names(R2sample::twosample_test(runif(100), runif(100), B=1000)$p.values)
p1=apply(pvals[, c("ZK", "ZC", "Wassp1", "Kuiper", "ES small" )], 1, min)
p2=apply(pvals[, c("KS", "Kuiper", "AD", "CvM", "LR")], 1, min)
```
Next we find the empirical distribution function for the two sets of p values and draw their graphs. We also add the curves for the cases of four identical tests and the case of four independent tests, which of course is the Bonferroni correction. The data for the cdfs is in the inst/extdata directory of the package.
```{r}
tmp=R2sample::pvaluecdf
Tests=factor(c(rep("Identical Tests", nrow(tmp)),
rep("Correlated Selection", nrow(tmp)),
rep("Best Selection", nrow(tmp)),
rep("Independent Tests", nrow(tmp))),
levels=c("Identical Tests", "Correlated Selection",
"Best Selection", "Independent Tests"),
ordered = TRUE)
dta=data.frame(x=c(tmp[,1],tmp[,1],tmp[,1],tmp[,1]),
y=c(tmp[,1],tmp[,3],tmp[,2],1-(1-tmp[,1])^4),
Tests=Tests)
ggplot2::ggplot(data=dta, ggplot2::aes(x=x,y=y,col=Tests))+
ggplot2::geom_line(linewidth=1.2)+
ggplot2::labs(x="p value", y="CDF")+
ggplot2::scale_color_manual(values=c("blue","red", "Orange", "green"))
```
### Case Studies
The package includes the routine *run.studies*, which provides 20 case studies each for the continuous and the discrete case. This allows the user to easily compare the power of the methods included in the package to a different one of their choice.
Say a user wishes to study the performance of the chi-square test, implemented in *chitest*. Of course this test is already implemented in *R2sample*, so this is for illustration only. Say we wish to see its performance when compared to the tests implemented in *R2sample* for the case where the null hypothesis specifies a uniform distribution but the data actually comes from a linear distribution with slope s:
```{r}
chitest = function(x, y, TSextra) {
nbins=TSextra$nbins
nx=length(x);ny=length(y);n=nx+ny
xy=c(x,y)
bins=quantile(xy, (0:nbins)/nbins)
Ox=hist(x, bins, plot=FALSE)$counts
Oy=hist(y, bins, plot=FALSE)$counts
tmp=sqrt(sum(Ox)/sum(Oy))
chi = sum((Ox/tmp-Oy*tmp)^2/(Ox+Oy))
pval=1-pchisq(chi, nbins-1)
out=ifelse(TSextra$statistic,chi,pval)
names(out)="ChiSquare"
out
}
TSextra=list(nbins=5,statistic=FALSE)
```
```{r}
pwr=R2sample::run.studies(chitest, "uniform.linear", TSextra=TSextra, With.p.value = TRUE)
R2sample::plot_power(pwr, "slope")
```
Arguments of *run.studies*:
- TS: new test statistic
- study: either the name(s) or position(s) of the desired sudies in the list of all studies
- TSextra: a list passed to TS
- With.p.value =FALSE set to TRUE if the new method finds p value(s). TS should then return only those
- BasicComparison=TRUE: All 20 case studies are run for one value of the parameter under the alternative.
- nsample=500: sample size
- alpha=0.05: true type I error probability
- param_alt: a vector of values of the parameter of the alternative distribution. If missing the ones included in R2sample are used. If more than one study is run this needs to be a list of vectors
- B=c(1000, 1000) number of simulation runs
Arguments of routine to calculate new test:
- Continuous data:
x,y the two data sets
TSextra (optional): a list passed to the routine with any object needed to do the calculation.
- Discrete data:
x,y the two data sets
vals: the possible values of the discrete random variable
TSextra (optional): a list passed to the routine with any object needed to do the calculation.
The output of the routine has to be a named vector with either the test statistic(s) or the p value(s).
The case studies are as follows. In each the first term specifies the model under the null hypothesis and the second the model under the alternative hypothesis.
1. uniform.linear$\hspace{3cm}$ U[0,1] vs a linear model on [0,1] with slope s.
2. uniform.quadratic$\hspace{2.5cm}$ U[0,1] vs a quadratic model with vertex at 0.5 and some curvature a.
3. uniform.bump$\hspace{3.1cm}$ U[0,1] vs U[0,1]+N(0.5,0.05).
4. uniform.sine$\hspace{3.3cm}$ U[0,1] vs U[0,1]+Sine wave
5. beta22.betaaa$\hspace{3cm}$ Beta(2,2) vs Beta(a,a)
6. beta22.beta2a$\hspace{3cm}$ Beta(2,2) vs Beta(2,a)
7. normal.shift$\hspace{3.5cm}$ N(0,1) vs N($\mu$,1)
8. normal.stretch$\hspace{3.1cm}$ N(0,1) vs N(0, $\sigma$)
9. normal.t$\hspace{4.2cm}$ N(0,1) vs t(df)
10. normal.outlier1$\hspace{3.1cm}$ N(0,1) vs N(0,1)+U[2,3]
11. normal.outlier2$\hspace{3.1cm}$ N(0,1) vs N(0,1)+U[-3,-2]+U[2,3]
12. exponential.gamma$\hspace{2.3cm}$ Exp(1) vs Gamma(1,b)
13. exponential.weibull$\hspace{2.5cm}$ Exp(1) vs Weibull(1,b)
14. exponential.bump$\hspace{2.7cm}$ Exp(1) vs Exp(1)+N(0.5,0.05)
15. gamma.normal$\hspace{3.1cm}$ Gamma($\mu$) vs N($\bar{x}$, sd(x)), here the mean of the normal distribution are the sample mean and sample standard deviation of the x data set.
16. normal.normalmixture$\hspace{2.1cm}$ N(0,1) vs N($-\mu$,1)+N($\mu$,1)
17. uniform.uniformmixture$\hspace{1.9cm}$ U[0,1] vs. $\alpha$U[0,1/2]+(1-$\alpha$)U[1/2,1]
18. uniform.betamixture$\hspace{2.5cm}$ U[0,1] vs. $\alpha$U[0,1/2]+(1-$\alpha$)Beta(2,2)
19. chisquare.noncentral$\hspace{2.3cm}$ $\chi^2(5)$ vs. $\chi^2(5, \tau)$
20. uniform.triangular$\hspace{2.9cm}$ U[0,1] vs. triangular
Generally a user will wish to apply their test to all the case studies. This can be done easily with:
```{r eval=FALSE}
R2sample::run.studies(chitest, TSextra=TSextra, With.p.value=TRUE)
```
*run.studies* can also be used to run the studies for the included methods but for different values of param_alt, nsample and/or alpha. For example, say we wish to find the powers of the included continuous methods for the cases uniform.linear and uniform.quadratic as well as samples of size 2000 and a true type I error of 0.1. Moreover for the case uniform/linear we want the slopes to be 0.1 and 0.2 and for the case uniform.quadratic we want param_alt to be 1.3:
```{r eval=FALSE, message=FALSE}
R2sample::run.studies(TRUE, # continuous data/model
study=c("uniform.linear", "uniform.quadratic"),
param_alt=list(c(0.1, 0.2), 1.3),
nsample=2000,
alpha=0.1)
```
# References