| Type: | Package |
| Version: | 0.3.0 |
| Title: | One- And Two-Sample Hausdorff Goodness-of-Fit Test |
| Description: | Computes the test statistic and p-values of the one-sample and two-sample Hausdorff (H) goodness-of-fit tests. The H statistic measures the Hausdorff distance under the Chebyshev (l-infinity) metric, between the two cumulative distribution functions (cdfs) underlying the corresponding one-sample and two-sample null hypothesis. It coincides to the side length of the largest axis-aligned square (hypercube) that can be inscribed between the two cdfs. The following cases are covered: (i) one-sample, univariate; (ii) two-sample univariate; and (iii) two-sample bivariate. Exact one-sample p-values are computed in O(n^2 log n) time via the 'Exact-KS-FFT' method of Dimitrova, Kaishev, and Tan (2020) <doi:10.18637/jss.v095.i10>; two-sample p-values are obtained by permutation. A key advantage of the H test is that its sensitivity can be directed towards the left tail, body, or right tail of the distribution by tuning a scale parameter sigma, and therefore maximizing its power which as shown numerically is significantly higher than the power of the classical tests such as the Kolmogorov-Smirnov, Cramer-von Mises, and Anderson-Darling test, especially when the right tail of the distribution is targeted. The sensitivity of the test (left tail, body, or right tail) is governed by two parameters psi1 and psi2, whose values needs to be input. Then the optimal value of the scale parameter sigma is automatically computed. |
| Depends: | R (≥ 3.5.0) |
| Imports: | Rcpp (≥ 0.12.12), KSgeneral, stats, utils, withr |
| LinkingTo: | Rcpp, RcppEigen |
| License: | GPL (≥ 3) |
| URL: | https://github.com/fakecloudsjy/HausdorffGoF |
| Encoding: | UTF-8 |
| Copyright: | Dimitrina S. Dimitrova, Yun Jia, and Vladimir K. Kaishev own the copyright of the original R and C++ code in this package. Portions of the C++ source code (src/fastCDF.cpp, src/fastCDF.h, src/fastCDFOnSample.cpp, src/fastCDFOnSample.h, src/nDDominanceAlone.cpp) are derived from the StOpt library and are Copyright (C) 2020 EDF - Electricite de France, published under the GNU Lesser General Public License (LGPL-3). See the file inst/COPYRIGHTS for full details. |
| NeedsCompilation: | yes |
| Packaged: | 2026-05-11 22:04:13 UTC; Yun |
| Author: | Dimitrina S. Dimitrova [aut], Yun Jia [aut, cre], Vladimir K. Kaishev [aut] |
| Maintainer: | Yun Jia <yunjia2019@gmail.com> |
| Repository: | CRAN |
| Date/Publication: | 2026-05-15 20:40:14 UTC |
One- and Two-Sample Hausdorff Goodness-of-Fit Test
Description
This package computes the test statistic and p-values of the one-sample and
two-sample Hausdorff (H) goodness-of-fit test, proposed by Dimitrova,
Jia, and Kaishev (2026a) and Dimitrova, Jia, and Kaishev (2026b) respectively.
Exploiting the scale dependence of H, it is shown that the sensitivity
of the H test can be controlled, making it left-tail, central body or
right-tail sensitive and thereby allowing its power to be optimized for the
problem at hand.
Simulation studies further demonstrate that, in terms of statistical power, the
H test substantially outperforms classical goodness-of-fit tests such as
the Kolmogorov-Smirnov, Cramer-von Mises, and Anderson-Darling tests. The
improvement is particularly pronounced when discrepancies in the right tail of
the distribution are of primary interest, as is often the case in finance and
insurance, economics, the natural sciences, and extreme value theory. These
results make the H test a compelling and effective alternative to
existing testing procedures.
In the one-sample setting, given a random sample
{X_{1}, \ldots, X_{n}} of size n with empirical cdf
F_{n}(x) and a prespecified null cdf F(x), the H
statistic is defined as the Hausdorff distance between the planar curves (completed graphs)
F^{c} and F^{c}_{n} of F(x) and F_{n}(x)
(completed by vertical segments at jump discontinuities). Based on Lemma 2.1 of Dimitrova, Jia, and Kaishev (2026a), the latter can be expressed as
H(F^{c}, F^{c}_{n}) = \sup_{A \in F^{c}_{n}} \inf_{B \in F^{c}} \rho_{1}(A,B)= \sup_{A \in F^{c}} \inf_{B \in F^{c}_{n}} \rho_{1}(A,B) ,
where, \rho_1(\cdot, \cdot) is the Chebyshev
(\ell_{\infty}) i.e., \rho_{1}(A,B) = \max{|x_{A}-x_{B}|, |z_{A}-z_{B}|}.
In the two-sample setting, the null hypothesis is H_{0}: F(x) = G(x) for all x \in R^{k}, against the
alternative H_{1}: F(x) \neq G(x) for at least one x, where F(x) and G(x) are unknown k-dimensional cdfs.
The null hypothesis H_{0} is tested using two independent random samples
{X_{1}, \ldots, X_{m}} and {Y_{1}, \ldots, Y_{n}} of sizes
m and n from F(x) and G(x)
respectively.
The two-sample H statistic is then defined as the Hausdorff
distance between the completed graphs F^{c}_{m} and G^{c}_{n} of
the empirical cdfs F_{m}(x) and G_{n}(x), again under the
Chebyshev distance \rho_{1}. Similarly to the one-sample case, using Lemma 2.4 of Dimitrova, Jia, and Kaishev (2026b), it can be expressed as
\mathcal{H}_{m,n} = H(F^{c}_{m}, G^{c}_{n}) = \sup_{A \in F^{c}_{m}} \inf_{B \in G^{c}_{n}} \rho_{1}(A,B) = \sup_{A \in G^{c}_{n}} \inf_{B \in F^{c}_{m}} \rho_{1}(A,B).
A geometric interpretation given by Theorem 2.5 of Dimitrova, Jia, and Kaishev
(2026b) is that the H statistic equals the side length of the largest
square (hypercube) that can be inscribed in the region between the two curves (hypersurfaces).
In the univariate case this is equivalent to the L\'evy metric between
F and F_{n} or F_{m} and G_{n} (see Remark 6 of Dimitrova, Jia, and Kaishev 2026b).
The package HausdorffGoF implements efficient algorithms to compute the
H statistic in the one-sample (Algorithm 1 of Dimitrova, Jia, and
Kaishev 2026a), two-sample univariate (Section 3.1 of Dimitrova, Jia, and
Kaishev 2026b) and two-sample bivariate (Appendix A of Dimitrova, Jia, and
Kaishev 2026b) cases. Exact one-sample p-values are provided via a
rectangle-probability representation (Theorem 4 of Dimitrova, Jia, and Kaishev
2026a), evaluated in O(n^{2}\log(n)) time using the Exact-KS-FFT method
of Dimitrova, Kaishev, and Tan (2020) via
ks_c_cdf_Rcpp from the package
KSgeneral. Two-sample p-values are obtained by applying exact or Monte Carlo
permutation of the H test (see functions H_test_2s_1d and H_test_2s_2d).
A key feature of the H test is that its p-values are not
scale-invariant. This property is used to make the test
the left-tail or body or right-tail sensitive and correspondingly optimize the power of the test. This is achieved by rescaling the data with a scale parameter \sigma > 0 that is selected as follows.
In the one-sample case the scaled statistic is
\mathcal{H}_{n}(\sigma) = H(F^{c}_{\sigma}, F^{c}_{n,\sigma}), where
F_{\sigma}(x) = F(x/\sigma) and F_{n,\sigma}(x) = F_{n}(x/\sigma).
An explicit scale-tuning formula
\sigma^{*} = \frac{\psi_{1} - \psi_{2}}{F^{-1}(\psi_{1}) - F^{-1}(\psi_{2})},
depending only on the null distribution F and two probability levels
\psi_{1} > \psi_{2} \in (0,1), allows the power to be focused on the
right tail (e.g. \psi_{1} = 0.99, \psi_{2} = 0.95), the left tail
(e.g. \psi_{1} = 0.05, \psi_{2} = 0.01), or the body
(e.g. \psi_{1} = 0.70, \psi_{2} = 0.40) of the distribution. Proposition
13 of Dimitrova, Jia, and Kaishev (2026a) shows that
\mathcal{H}_{n}(\sigma^{*}) and its p-value are then invariant under any
further affine rescaling of the data.
In the two-sample case no null distribution is assumed, so \sigma^{*}
cannot be computed from F directly. Instead, the scale is estimated from
the pooled sample via the permutation-based rule
\sigma^{*} = \mathrm{E}_{R_{m+n}}\!\left[
\max\!\left(
\frac{\psi_{1}-\psi_{2}}{Q_{X^{\dagger}_{m}}(\psi_{1}) - Q_{X^{\dagger}_{m}}(\psi_{2})},\;
\frac{\psi_{1}-\psi_{2}}{Q_{Y^{\dagger}_{n}}(\psi_{1}) - Q_{Y^{\dagger}_{n}}(\psi_{2})}
\right) \;\Bigg|\; Z_{m+n}
\right],
where Q_{X^{\dagger}_{m}}(\psi) and Q_{Y^{\dagger}_{n}}(\psi)
are the \psi-quantiles of permuted sub-samples X^{\dagger}_{m}
and Y^{\dagger}_{n} of the pooled sample
Z_{m+n} = \{x_{1},\ldots,x_{m},y_{1},\ldots,y_{n}\} (Equation (49) of
Dimitrova, Jia, and Kaishev 2026b). Theorem 4.4 of Dimitrova, Jia, and Kaishev
(2026b) shows that \sigma^{*} converges in probability to the one-sample
rule applied to the pooled distribution E = \eta F + (1-\eta)G, and
that the critical value of the scale-tuned permutation test is asymptotically
equivalent to that of the original H test. Recommended choices of
(\psi_{1}, \psi_{2}) are the same as in the one-sample case: right tail
(0.99,\,0.95), body (0.70,\,0.40) or (0.60,\,0.30),
left tail (0.05,\,0.01).
Scale tuning is integrated directly into
Hausdorff_test via the scale_psi
argument, which accepts the pair (\psi_{1}, \psi_{2}) and computes
\sigma^{*} automatically before running the test.
The package provides the following exported functions. For the
one-sample H test:
H_stat_1s_1d computes the test
statistic, H_test_1s_1d computes
the exact p-value, and
H_test_c_cdf is the low-level
rectangle-probability engine used by H_test_1s_1d and useful when
\hat{q} is already available.
For the two-sample H test:
H_test_2s_1d performs the
univariate permutation test,
H_stat_2s_1d_tr computes
the univariate two-sample statistic via the C++ transformation method,
H_stat_2s_1d_p computes
the univariate two-sample statistic via the C++ projection method,
H_test_2s_2d performs the bivariate permutation test, and
H_stat_2s_2d computes the
Hausdorff distance between two bivariate empirical cdfs.
The unified interface
Hausdorff_test and
Hausdorff_stat are S3 generics that dispatch
to the appropriate underlying function based on the type of input, covering
one-sample, two-sample univariate, and two-sample bivariate cases in a single
call. Null distributions are specified via
distribution, which constructs a
"NullDist" object bundling the cdf, quantile function, and density.
Arguments
No arguments. This is a package-level documentation page.
Details
One-sample H test:
The algorithm to compute the H statistic is based on Lemmas 2 and 3 of
Dimitrova, Jia, and Kaishev (2026a). It identifies the locally farthest vertices
\{B_{l}\} of the planar curve F^{c}_{n} from the null curve
F^{c}, then for each B_{l} finds the intersection point
E_{l} of the line \{(x,z) : x + z = x_{B_{l}} + z_{B_{l}}\}
with F^{c}, and returns \max_{l} \rho_{1}(B_{l}, E_{l}).
This is implemented in
H_stat_1s_1d.
The exact p-value P(H(F^{c}, F^{c}_{n}) > q) is computed by
H_test_1s_1d using the
rectangle-probability representation of Theorem 4 of Dimitrova, Jia, and
Kaishev (2026a):
P(H(F^{c}, F^{c}_{n}) > q) = 1 - P(a_{i} \leq U_{(i)} \leq b_{i},\; 1 \leq i \leq n),
where U_{(1)} \leq \cdots \leq U_{(n)} are the order statistics of
n i.i.d. Uniform(0,1) random variables, and
a_{i} = F\!\left(F^{-1}\!\!\left(\frac{i}{n} - q\right) - q\right), \quad b_{i} = F\!\left(F^{-1}\!\!\left(\frac{i-1}{n} + q\right) + q\right).
This rectangle probability is evaluated in O(n^{2}\log(n)) time via
KSgeneral::ks_c_cdf_Rcpp, with the boundary construction handled by
H_test_c_cdf. A Monte Carlo
alternative (bootstrap p-value) is available by passing method = "mc"
to Hausdorff_test.
Two-sample H test:
The two-sample H statistic
\mathcal{H}_{m,n} = H(F^{c}_{m}, G^{c}_{n}) is computed by
H_stat_2s_1d_tr using the
transformation method of Section 3.1 of Dimitrova, Jia, and Kaishev (2026b).
The two staircase curves F^{c}_{m} and G^{c}_{n} are first
modified so that one lies entirely above the other (Lemma 2.13), then the plane
is rotated by \pi/4 so that \mathcal{H}_{m,n} becomes the
supremum of half the vertical difference between the rotated curves (Section
3.1, ibid.), attained over a finite set of transformed vertices. The bivariate
version H_stat_2s_2d implements
the projection method of Appendix A of Dimitrova, Jia, and Kaishev (2026b) via
an internal C++ routine.
Because the distribution of \mathcal{H}_{m,n} depends on the unknown
underlying distributions F and G, p-values are obtained by the
permutation approach of Proposition 3.5 of Dimitrova, Jia, and Kaishev (2026b).
The permutation Hausdorff statistic \mathcal{H}^{\dagger}_{m,n} is
defined by randomly splitting the pooled sample Z_{m+n} into two groups
of sizes m and n, computing H on each split, and
averaging over all C = \frac{(m+n)!}{m!n!} splits:
P(\mathcal{H}^{\dagger}_{m,n} > q \mid Z_{m+n}) = \frac{1}{C} \sum_{i=1}^{C} \mathbf{1}\!\left\{H(F^{c\dagger}_{m}(\pi_{i}), G^{c\dagger}_{n}(\pi_{i})) > q\right\}.
Theorems 3.2–3.4 of Dimitrova, Jia, and Kaishev (2026b) establish that this
permutation p-value is asymptotically equivalent to the true p-value of
\mathcal{H}_{m,n} under the null and under fixed or contiguous
alternatives, and that the permutation test controls the type I error at the
nominal level for any k \geq 1. This is implemented in
H_test_2s_1d, which performs exact
enumeration when C \leq 2.147 \times 10^{9} and Monte Carlo permutation
otherwise.
The asymptotic null distribution of the normalised statistic
\sqrt{mn/(m+n)}\,\mathcal{H}^{\dagger}_{m,n} is given by Theorem 3.6
of Dimitrova, Jia, and Kaishev (2026b) and involves the supremum of a Brownian
bridge weighted by the density of the pooled distribution.
Unified interface:
Hausdorff_test and
Hausdorff_stat are S3 generics that dispatch
on the class of their second argument y: a
distribution object triggers the one-sample path;
a numeric vector triggers the two-sample univariate path; a matrix or list
triggers the two-sample bivariate path. Hausdorff_test also accepts
scale_psi to activate scale tuning in a single call, computing
\sigma^{*} automatically and returning the augmented "htest"
object with $sigma and $scale_psi components.
Value
No return value. This is a package-level documentation page. See the
individual function help pages (e.g. Hausdorff_test,
Hausdorff_stat) for the return values of each exported
function.
Author(s)
Dimitrina S. Dimitrova D.Dimitrova@city.ac.uk, Yun Jia yunjia2019@gmail.com, Vladimir K. Kaishev Vladimir.Kaishev.1@city.ac.uk
Maintainer: Yun Jia yunjia2019@gmail.com
References
Dimitrina S. Dimitrova, Yun Jia, Vladimir K. Kaishev (2026a). “On a One Sample Goodness-of-fit Test Based on the Hausdorff Metric”. Submitted.
Dimitrina S. Dimitrova, Yun Jia, Vladimir K. Kaishev (2026b). “On a Two-Sample Multivariate Goodness-of-Fit Test Based on the Hausdorff Metric”. Submitted.
Dimitrina S. Dimitrova, Vladimir K. Kaishev, Senren Tan (2020). “Computing the Kolmogorov-Smirnov Distribution When the Underlying CDF is Purely Discrete, Mixed or Continuous”. Journal of Statistical Software, 95(10): 1–42. doi:10.18637/jss.v095.i10.
Computes the one-sample Hausdorff test statistic for a given sample and a prespecified null distribution
Description
Computes the Hausdorff distance H(F^{c}, F^{c}_{n}) between the planar
curve F^{c} of a prespecified null cdf F(x) and the planar curve
F^{c}_{n} of the empirical cdf based on a given sample
\{x_{1}, \ldots, x_{n}\}.
Usage
H_stat_1s_1d(x, CDF, pdf = NULL, tol = 1e-10, max.init = 1000)
Arguments
x |
a numeric vector of data sample values |
CDF |
a vectorised R function for the null cdf |
pdf |
(optional) a vectorised R function for the null density |
tol |
a numeric value giving the tolerance for the root-finding step.
Defaults to |
max.init |
maximum number of Newton–Raphson iterations.
Used only when a density |
Details
Given a random sample \{x_{1}, \ldots, x_{n}\} with empirical cdf
F_{n}(x) and a prespecified null cdf F(x),
H_stat_1s_1d computes the Hausdorff
test statistic
H(F^{c}, F^{c}_{n}) = \max_{l} \rho_{1}(B_{l}, E_{l}),
where \{B_{l}\} are the locally farthest vertices of the planar curve
F^{c}_{n} from the null curve F^{c}, E_{l} is the
intersection of the line \{(x,z): x + z = x_{B_{l}} + z_{B_{l}}\} with
F^{c}, and
\rho_{1}(A,B) = \max\{|x_{A}-x_{B}|, |z_{A}-z_{B}|\}.
This implements Algorithm 1 (Lemmas 2 and 3) of Dimitrova, Jia, and Kaishev
(2026a).
The function first identifies the set \mathcal{B}_{loc} of locally
farthest vertices, then for each B_{l} \in \mathcal{B}_{loc} solves the
equation x + F(x) = x_{B_{l}} + z_{B_{l}} to find E_{l}, and
finally returns the maximum of |x^{*}_{l} - x_{B_{l}}|. When
pdf is supplied the equation is solved via Newton–Raphson; otherwise
uniroot is used.
Value
A single numeric value: the observed Hausdorff statistic
\hat{q} = H(F^{c}, F^{c}_{n}).
References
Dimitrina S. Dimitrova, Yun Jia, Vladimir K. Kaishev (2026a). “On a One Sample Goodness-of-fit Test Based on the Hausdorff Metric”. Submitted.
See Also
Examples
## Compute the H statistic for a sample from Exp(1) against the Exp(1) null
set.seed(1)
x <- rexp(50, rate = 1)
H_stat <- H_stat_1s_1d(x = x,
CDF = function(t) pexp(t, rate = 1),
pdf = function(t) dexp(t, rate = 1))
H_stat
## Compute the H statistic for a sample from N(0,1) against the N(0,1) null
set.seed(2)
y <- rnorm(100)
H_stat2 <- H_stat_1s_1d(x = y,
CDF = pnorm,
pdf = dnorm)
H_stat2
C++ implementation of the two-sample Hausdorff distance via the projection method
Description
A C++ implementation (via Rcpp) of the projection method for computing the
Hausdorff distance H(F^{c}_{m}, G^{c}_{n}) between the planar curves of
the empirical cdfs of two independent univariate samples
\{x_{1}, \ldots, x_{m}\} and \{y_{1}, \ldots, y_{n}\}.
See Appendix E of Dimitrova, Jia, and Kaishev (2026b) for a numerical accuracy
and timing comparison with
H_stat_2s_1d_tr.
Usage
H_stat_2s_1d_p(a, b)
Arguments
a |
a numeric vector of data sample values |
b |
a numeric vector of data sample values |
Details
H_stat_2s_1d_p implements the
projection method of Section 3.1 of Dimitrova, Jia, and Kaishev (2026b)
and proceeds in five steps identical to those described for the univariate case
in the bivariate function
H_stat_2s_2d, specialised to
k = 1. The overall computation is O(m + n) in the number of
distinct observations.
Value
A single numeric value: the Hausdorff distance H(F^{c}_{m}, G^{c}_{n}).
References
Dimitrina S. Dimitrova, Yun Jia, Vladimir K. Kaishev (2026b). “On a Two-Sample Multivariate Goodness-of-Fit Test Based on the Hausdorff Metric”. Submitted.
See Also
H_stat_2s_1d_tr for the
faster transformation-method implementation.
Examples
set.seed(1)
a <- rnorm(50)
b <- rnorm(50)
H_stat_2s_1d_p(a, b)
## Verify agreement with the transformation-method C++ implementation
H_stat_2s_1d_tr(a, b)
C++ implementation of the two-sample Hausdorff distance via the transformation method
Description
A C++ implementation (via Rcpp) of the transformation method for computing
the Hausdorff distance H(F^{c}_{m}, G^{c}_{n}) between the planar curves
of the empirical cdfs of two independent univariate samples
\{x_{1}, \ldots, x_{m}\} and \{y_{1}, \ldots, y_{n}\}. This is
the faster of the two exported C++ routines; see Appendix E of Dimitrova, Jia, and
Kaishev (2026b) for timing comparisons.
Usage
H_stat_2s_1d_tr(a, b)
Arguments
a |
a numeric vector of data sample values |
b |
a numeric vector of data sample values |
Details
H_stat_2s_1d_tr implements
the transformation method of Section 3.1 of Dimitrova, Jia, and Kaishev
(2026b). The plane is rotated by \pi/4 via the linear map
\mathrm{Tr}(x, z) = (x - z,\; x + z) (Equation (35), ibid.), after
which the Hausdorff distance equals half the supremum of the absolute vertical
difference between the two rotated staircase curves (Equation (36), ibid.). The
algorithm proceeds in four steps.
Step 1 (signed-increment encoding). For each sample, the sorted distinct values and their relative jump heights are extracted. Both samples are shifted by subtracting the joint minimum. Each staircase curve is encoded as an interleaved signed-increment vector.
Step 2 (rotated coordinates via cumulative sums).
The \pi/4 rotation is performed implicitly: the z'-coordinate
(x + z) and the -x'-coordinate (z - x) of every
staircase vertex are derived from the encoded vectors via two cumulative sums.
Step 3 (greedy pointer sweep).
A single left-to-right pass through both encoded curves simultaneously identifies,
for each vertex of one curve, the segment of the other curve whose
\lambda-projection interval contains it.
Step 4 (distance formula and return value). At each vertex, the candidate Hausdorff distance is computed in closed form from the parity of the current pointer position. The Hausdorff distance is
H(F^{c}_{m}, G^{c}_{n}) = \frac{1}{2} \max_{i}\, |H_{\mathrm{temp}}[i]|.
The entire computation is O(m + n) in the number of distinct observations.
Value
A single numeric value: the Hausdorff distance H(F^{c}_{m}, G^{c}_{n}).
References
Dimitrina S. Dimitrova, Yun Jia, Vladimir K. Kaishev (2026b). “On a Two-Sample Multivariate Goodness-of-Fit Test Based on the Hausdorff Metric”. Submitted.
See Also
H_stat_2s_1d_p for the C++
projection-method implementation (same result, slower).
Examples
set.seed(1)
a <- rnorm(50)
b <- rnorm(50)
H_stat_2s_1d_tr(a, b)
## Verify agreement with the projection-method C++ implementation
H_stat_2s_1d_p(a, b)
Computes the Hausdorff distance between two bivariate empirical cdfs
Description
Computes the Hausdorff distance between the bivariate empirical cdfs of two
independent two-dimensional data samples
\bm{X} = \{(X_{1,1}, X_{1,2}), \ldots, (X_{m,1}, X_{m,2})\} and
\bm{Y} = \{(Y_{1,1}, Y_{1,2}), \ldots, (Y_{n,1}, Y_{n,2})\}.
Usage
H_stat_2s_2d(x, y, tol = 1e-6)
Arguments
x |
a two-column numeric matrix representing the first bivariate sample
of size |
y |
a two-column numeric matrix representing the second bivariate sample
of size |
tol |
a numeric value giving the tolerance for locating omnidirectional jump
vertices in the bivariate empirical cdfs (see Definition 2.11 of
Dimitrova, Jia, and Kaishev 2026b). Defaults to |
Details
H_stat_2s_2d computes the Hausdorff
distance H(F^{c}_{m}, G^{c}_{n}) between the three-dimensional
staircase surfaces corresponding to the bivariate empirical cdfs
F_{m}(\bm{x}) and G_{n}(\bm{x}) of x and y,
under the Chebyshev distance
\rho_{1}(A,B) = \max_{1 \leq i \leq 3}\{|w^{(i)}_{A} - w^{(i)}_{B}|\}.
It implements the projection method of Appendix A of Dimitrova, Jia, and
Kaishev (2026b), which generalises the univariate algorithm of Section 3.1 to
k = 2.
Bivariate empirical cdf computation.
The evaluation of the bivariate empirical cdfs F_{m} and G_{n}
at the required grid points is performed using the fast divide-and-conquer
algorithm of Langren\'e and Warin (2021). For N data points in
d dimensions this algorithm achieves
O\!\left(N\log(N)^{(d-1) \vee 1}\right) complexity, compared to the
naive O(N^{2}), by recursively splitting the dataset along alternating
coordinate axes and accumulating counts in each sub-problem. In the bivariate
case (d = 2) this gives O(N \log^2 N) complexity.
Projection method (Appendix A, Dimitrova, Jia, and Kaishev 2026b). The algorithm proceeds in five steps.
Step 1 (omnidirectional jump vertices of F_{m}).
A point \alpha \in A_{x^{(1)}} \times A_{x^{(2)}} is an
omnidirectional jump of F_{m} if and only if
F_{m}(\alpha) \neq F_{m}(\alpha - \varepsilon_{0} e_{i}) for both
i = 1 and i = 2 (Definition 2.11), where
e_{1} = (1,0), e_{2} = (0,1), and
\varepsilon_{0} = \frac{1}{2}\min_{i; j \neq l} |x^{(i)}_{j} -
x^{(i)}_{l}|. Each omnidirectional jump \alpha_{i} gives rise to
two vertices of the planar curve F^{c}_{m} (Equation (19)):
A_{2i-1} = (\alpha_{i},\, F_{m}(\alpha_{i} - \varepsilon)),
\qquad
A_{2i} = (\alpha_{i},\, F_{m}(\alpha_{i})),
\qquad i = 1, \ldots, \upsilon,
where \varepsilon = (\varepsilon_{0}, \varepsilon_{0}).
Step 2 (truncation boundary and additional vertices).
The region between the bivariate planar curves is unbounded. Lemma 2.3 is
applied to truncate both curves at
M = 1 + \max\{x^{(i)}_{j}, y^{(i)}_{l}\}, ensuring that
H(\hat{F}^{c}_{m}(M), \hat{G}^{c}_{n}(M)) = H(F^{c}_{m}, G^{c}_{n}).
Step 3 (projection of F^{c}_{m} vertices).
Each vertex A_{l} of \hat{F}^{c}_{m}(M) is projected onto the
x^{(1)}Ox^{(2)} plane along the direction
E_{0} = (1, 1, -1) \in {R}^{3}.
Step 4 (projection partition for G^{c}_{n}).
The vertices and sides of G^{c}_{n} are projected in the same
direction E_{0}, forming a partition of {R}^{2} whose regions
determine which coordinate of A_{l} drives the distance to
G^{c}_{n}.
Step 5 (nearest-point distances via Lemma A.3).
For each locally farthest vertex A_{l} \in \mathcal{V}_{loc}, the
nearest point on G^{c}_{n} is found via the Pareto frontier of
dominated projected vertices, and the closed-form distance formula of Lemma
A.3 is applied. The Hausdorff distance is
H(F^{c}_{m}, G^{c}_{n}) = \max_{A_{l} \in \mathcal{V}_{loc}}
\inf_{B \in G^{c}_{n}} \rho_{1}(A_{l}, B).
The combinatorial search is executed by the internal C++ routine
hsearch_Rcpp (via Rcpp).
Component-wise orders and H^{\circ}_{m,n}.
The function computes the statistic under the standard component-wise order
\preceq_{1} (\bm{x} \preceq_{1} \bm{y} iff
x^{(1)} \leq y^{(1)} and x^{(2)} \leq y^{(2)}). The
order-invariant statistic
H^{\circ}_{m,n} = \max_{1 \leq i \leq 4} H(F^{c}_{m,i}, G^{c}_{n,i})
(Equation (17) of Dimitrova, Jia, and Kaishev 2026b) is available directly
via Hausdorff_stat with
invariant = TRUE, which calls this function over the four sign-flip
orderings and takes the maximum.
The inputs x and y must each be two-column numeric matrices;
otherwise the function stops with an error. To pass list input, use
Hausdorff_stat, which converts lists to matrices before
calling this function.
Value
A single numeric value: the Hausdorff distance H(F^{c}_{m}, G^{c}_{n})
between the bivariate empirical cdfs of x and y.
References
Dimitrina S. Dimitrova, Yun Jia, Vladimir K. Kaishev (2026b). “On a Two-Sample Multivariate Goodness-of-Fit Test Based on the Hausdorff Metric”. Submitted to the Annals of Statistics.
Nicolas Langren\'e, Xavier Warin (2021). “Fast Multivariate Empirical Cumulative Distribution Function with Connection to Kernel Density Estimation”. Computational Statistics & Data Analysis, 162, 107267. doi:10.1016/j.csda.2021.107267.
See Also
Hausdorff_stat,
Hausdorff_test.
Examples
## Hausdorff distance between two bivariate samples from the same distribution
set.seed(1)
x <- matrix(rnorm(100), ncol = 2)
y <- matrix(rnorm(100), ncol = 2)
H_stat_2s_2d(x, y)
## Hausdorff distance between two bivariate samples from different distributions
set.seed(2)
x2 <- matrix(rnorm(100), ncol = 2)
y2 <- matrix(rnorm(100, mean = 0.5), ncol = 2)
H_stat_2s_2d(x2, y2)
## List input: use Hausdorff_stat() which handles the conversion
set.seed(3)
x3 <- list(rnorm(30), rnorm(30))
y3 <- list(rnorm(30), rnorm(30))
Hausdorff_stat(x3, y3)
## Order-invariant statistic via the unified wrapper
Hausdorff_stat(x, y, invariant = TRUE)
One-sample Hausdorff goodness-of-fit test
Description
A thin wrapper that first computes the one-sample Hausdorff test statistic
\hat{q} = H(F^{c}, F^{c}_{n}) via
H_stat_1s_1d, then delegates
the p-value computation to
H_test_c_cdf. The split into
two functions allows the p-value to be computed independently when
\hat{q} is already known.
Usage
H_test_1s_1d(x, CDF, CDFinverse = NULL, pdf = NULL, tol = 1e-10,
max.init = 1000)
Arguments
x |
a numeric vector of data sample values |
CDF |
a vectorised R function for the null cdf |
CDFinverse |
(optional) a vectorised R function for |
pdf |
(optional) a vectorised R function for the null density |
tol |
tolerance for root-finding, passed to both
|
max.init |
maximum number of Newton–Raphson iterations, passed to
|
Value
An object of class "htest" with the following components:
statisticthe observed Hausdorff statistic
\hat{q} = H(F^{c}, F^{c}_{n}), named"H".p.valuethe exact p-value
P(H(F^{c}, F^{c}_{n}) > \hat{q}).methodthe character string
"One-sample Hausdorff test".alternativethe character string
"two-sided".data.namea character string giving the name of the data object.
See Also
H_stat_1s_1d,
H_test_c_cdf,
Hausdorff_test.
Examples
set.seed(1)
x <- rexp(50, rate = 1)
H_test_1s_1d(x,
CDF = function(t) pexp(t, rate = 1),
CDFinverse = function(p) qexp(p, rate = 1),
pdf = function(t) dexp(t, rate = 1))
Two-sample Hausdorff goodness-of-fit test with exact or Monte Carlo permutation p-values
Description
Tests the null hypothesis H_{0}: F(x) = G(x) for all x against
the two-sided alternative H_{1}: F(x) \neq G(x) for at least one
x, using the Hausdorff distance between the empirical cdfs of two
independent univariate samples. The p-value is obtained either by exact
enumeration of all \frac{(m+n)!}{m!n!} permutations of the pooled
sample or by Monte Carlo permutation. The test statistic is computed by
H_stat_2s_1d_tr.
Usage
H_test_2s_1d(x1, x2, nboots = 2000, Exact = FALSE)
Arguments
x1 |
a numeric vector of data sample values |
x2 |
a numeric vector of data sample values |
nboots |
a positive integer giving the number of Monte Carlo permutation
replications, used only when |
Exact |
a logical value indicating whether an exact permutation p-value should be
computed by enumerating all |
Details
Given two independent random samples
\{x_{1}, \ldots, x_{m}\} and \{y_{1}, \ldots, y_{n}\} from
unknown univariate cdfs F and G respectively, the two-sample
Hausdorff (H) test statistic is
\hat{q} = \mathcal{H}_{m,n} = H(F^{c}_{m}, G^{c}_{n}),
computed by
H_stat_2s_1d_tr. By
Theorem 2.5 of Dimitrova, Jia, and Kaishev (2026b), this equals the side length
of the largest axis-aligned square that can be inscribed in the region between
the two empirical cdf staircase curves F^{c}_{m} and G^{c}_{n}.
Permutation p-value. Because the distribution of \mathcal{H}_{m,n}
depends on the unknown F and G, p-values are computed via a
permutation argument (Proposition 3.5 of Dimitrova, Jia, and Kaishev 2026b). Let
Z_{m+n} = \{x_{1},\ldots,x_{m},y_{1},\ldots,y_{n}\} be the pooled
sample. The permutation Hausdorff statistic
\mathcal{H}^{\dagger}_{m,n} is defined on all
C = \frac{(m+n)!}{m!n!} splits of Z_{m+n} into groups of
sizes m and n, and its conditional distribution given
Z_{m+n} is:
P(\mathcal{H}^{\dagger}_{m,n} > q \mid Z_{m+n})
= \frac{1}{C} \sum_{i=1}^{C}
\mathbf{1}\!\left\{H(F^{c\dagger}_{m}(\pi_{i}), G^{c\dagger}_{n}(\pi_{i})) > q\right\},
where F^{c\dagger}_{m}(\pi_{i}) and G^{c\dagger}_{n}(\pi_{i})
are the empirical cdf planar curves of the i-th permuted sub-samples.
When Exact = TRUE all C splits are enumerated. When
Exact = FALSE, the p-value is estimated by Monte Carlo: in each of
nboots replications the pooled sample is randomly split into groups of
sizes m and n and H is recomputed; the p-value is the
proportion of replications for which H \geq \hat{q}. If the estimated
p-value is zero it is replaced by 1/(2 \times \code{nboots}).
Asymptotic equivalence (Theorems 3.2–3.4). Under the null hypothesis,
or under fixed or contiguous alternatives (with bounded densities and
m/(m+n) \to \eta \in (0,1)), the permutation p-value is asymptotically
equivalent to the true p-value of \mathcal{H}_{m,n}. The permutation
test controls the type I error at the nominal level for all k \geq 1,
and the power of the two tests are asymptotically equal.
Scale tuning. The power of \mathcal{H}_{m,n} is not
scale-invariant. The optimal scale \sigma^{*} can be estimated and
applied automatically by passing scale_psi to the unified wrapper
Hausdorff_test, which calls this function
internally on the already-scaled samples.
Value
An object of class "htest" with the following components:
statisticthe observed Hausdorff statistic
\hat{q} = \mathcal{H}_{m,n} = H(F^{c}_{m}, G^{c}_{n}), named"H".p.valuethe exact or Monte Carlo permutation p-value, i.e. the proportion of permutation statistics
H(F^{c\dagger}_{m}(\pi_{i}), G^{c\dagger}_{n}(\pi_{i}))that are greater than or equal to\hat{q}. If the Monte Carlo estimate is zero it is replaced by1/(2 \times \code{nboots}).methoda character string indicating the method:
"Two-sample Hausdorff Test (Exact)"when all permutations are enumerated, or"Two-sample Hausdorff Test (Monte Carlo)"otherwise.alternativethe character string
"two-sided".data.namea character string giving the names of the two data objects.
References
Dimitrina S. Dimitrova, Yun Jia, Vladimir K. Kaishev (2026a). “On a One Sample Goodness-of-fit Test Based on the Hausdorff Metric”. Submitted.
Dimitrina S. Dimitrova, Yun Jia, Vladimir K. Kaishev (2026b). “On a Two-Sample Multivariate Goodness-of-Fit Test Based on the Hausdorff Metric”. Submitted to the Annals of Statistics.
Dimitrina S. Dimitrova, Vladimir K. Kaishev, Senren Tan (2020). “Computing the Kolmogorov-Smirnov Distribution When the Underlying CDF is Purely Discrete, Mixed or Continuous”. Journal of Statistical Software, 95(10): 1–42. doi:10.18637/jss.v095.i10.
See Also
H_stat_2s_1d_tr,
Hausdorff_test.
Examples
## Two-sample H test: both samples from N(0,1) (null is true)
set.seed(1)
x1 <- rnorm(30)
x2 <- rnorm(30)
H_test_2s_1d(x1, x2, nboots = 1000)
## Two-sample H test: samples from N(0,1) and N(0.5,1) (null is false)
set.seed(2)
x3 <- rnorm(30)
x4 <- rnorm(30, mean = 0.5)
H_test_2s_1d(x3, x4, nboots = 1000)
## Exact permutation test for small samples
set.seed(3)
a <- rnorm(8)
b <- rnorm(8)
H_test_2s_1d(a, b, Exact = TRUE)
## Scale-tuned test via the unified wrapper (recommended)
set.seed(4)
x_exp <- rexp(60, rate = 2)
y_exp <- rexp(60, rate = 3)
Hausdorff_test(x_exp, y_exp, scale_psi = c(0.99, 0.95), scale_nperms = 500)
Two-sample bivariate Hausdorff goodness-of-fit test with Monte Carlo permutation p-values
Description
Tests the null hypothesis H_{0}: F(\bm{x}) = G(\bm{x}) for all
\bm{x} \in \mathbb{R}^{2} against the two-sided alternative
H_{1}: F(\bm{x}) \neq G(\bm{x}) for at least one \bm{x},
using the Hausdorff distance between the bivariate empirical cdfs of two
independent samples. The p-value is obtained by Monte Carlo permutation.
The test statistic is computed by H_stat_2s_2d.
Usage
H_test_2s_2d(x, y, nboots = 2000, Exact = FALSE, invariant = FALSE, tol = 1e-6)
Arguments
x |
a two-column numeric matrix representing the first bivariate sample
of size |
y |
a two-column numeric matrix representing the second bivariate sample
of size |
nboots |
a positive integer giving the number of Monte Carlo permutation
replications, used only when |
Exact |
a logical value indicating whether an exact permutation p-value
should be computed by enumerating all |
invariant |
logical. When |
tol |
a numeric value giving the tolerance for locating omnidirectional
jump vertices in the bivariate empirical cdfs, passed to
|
Details
Given two independent random samples
\bm{X} = \{(X_{1,1}, X_{1,2}), \ldots, (X_{m,1}, X_{m,2})\} and
\bm{Y} = \{(Y_{1,1}, Y_{1,2}), \ldots, (Y_{n,1}, Y_{n,2})\} from
unknown bivariate cdfs F and G respectively, the two-sample
Hausdorff (H) test statistic is
\hat{q} = \mathcal{H}_{m,n} = H(F^{c}_{m}, G^{c}_{n}),
computed by H_stat_2s_2d. By Theorem 2.5 of Dimitrova, Jia,
and Kaishev (2026b), this equals the side length of the largest axis-aligned
hypercube that can be inscribed in the region between the two bivariate
empirical cdf surfaces F^{c}_{m} and G^{c}_{n}.
Permutation p-value. Because the null distribution of
\mathcal{H}_{m,n} depends on the unknown F and G,
p-values are obtained by permutation (Proposition 3.5 of Dimitrova, Jia, and
Kaishev 2026b). When Exact = TRUE all \frac{(m+n)!}{m!\,n!}
splits of the pooled sample are enumerated; if the count exceeds
2.147 \times 10^{9} the function falls back to Monte Carlo
automatically. When Exact = FALSE the pooled sample is randomly split
for nboots replications and the p-value is the proportion of
permutation statistics \geq \hat{q}, floored at
1/(2 \times \code{nboots}). The asymptotic equivalence between the
permutation p-value and the true p-value of \mathcal{H}_{m,n}, and
the control of type I error, follow from Theorems 3.2–3.4 of Dimitrova, Jia,
and Kaishev (2026b); see H_test_2s_1d for the full statement.
Order-invariant statistic. When invariant = TRUE the
order-invariant statistic
H^{\circ}_{m,n} = \max_{1 \leq i \leq 4} H(F^{c}_{m,i}, G^{c}_{n,i})
(Equation (17) of Dimitrova, Jia, and Kaishev 2026b) is used for both the
observed value and every permutation replicate. It maximises over the
2^{2} = 4 sign-flip orderings of the two coordinates, making the
test invariant to the labelling of coordinate axes.
Scale tuning. For scale-tuned testing pass scale_psi to
Hausdorff_test, which estimates the column-wise optimal scale
vector (\sigma^{*}_{1}, \sigma^{*}_{2}) and calls this function
internally on the already-scaled samples.
Value
An object of class "htest" with the following components:
statisticthe observed Hausdorff statistic
\hat{q} = \mathcal{H}_{m,n} = H(F^{c}_{m}, G^{c}_{n})(orH^{\circ}_{m,n}wheninvariant = TRUE), named"H".p.valuethe Monte Carlo permutation p-value, i.e. the proportion of permutation statistics greater than or equal to
\hat{q}. If zero, replaced by1/(2 \times \code{nboots}).methoda character string indicating the procedure:
"Two-sample bivariate Hausdorff test (Exact)"or"Two-sample bivariate Hausdorff test (Monte Carlo permutation)", with the suffix", order-invariant"appended wheninvariant = TRUE.alternativethe character string
"two-sided".data.namea character string giving the names of the two data objects.
References
Dimitrina S. Dimitrova, Yun Jia, Vladimir K. Kaishev (2026a). “On a One Sample Goodness-of-fit Test Based on the Hausdorff Metric”. Submitted.
Dimitrina S. Dimitrova, Yun Jia, Vladimir K. Kaishev (2026b). “On a Two-Sample Multivariate Goodness-of-Fit Test Based on the Hausdorff Metric”. Submitted to the Annals of Statistics.
See Also
H_stat_2s_2d,
Hausdorff_test,
H_test_2s_1d.
Examples
## Bivariate H test: both samples from N(0,I) (null is true)
set.seed(1)
xm <- matrix(rnorm(100), ncol = 2)
ym <- matrix(rnorm(100), ncol = 2)
H_test_2s_2d(xm, ym, nboots = 1000)
## Bivariate H test: samples from N(0,I) and N(0.5,I) (null is false)
set.seed(2)
xm2 <- matrix(rnorm(100), ncol = 2)
ym2 <- matrix(rnorm(100, mean = 0.5), ncol = 2)
H_test_2s_2d(xm2, ym2, nboots = 1000)
## Exact permutation test for small samples
set.seed(3)
H_test_2s_2d(matrix(rnorm(16), ncol = 2),
matrix(rnorm(16), ncol = 2), Exact = TRUE)
## Order-invariant statistic
set.seed(4)
H_test_2s_2d(xm, ym, nboots = 1000, invariant = TRUE)
## Scale-tuned test via the unified wrapper (recommended)
set.seed(4)
H_test_2s_2d(xm, ym, nboots = 1000)
Hausdorff_test(xm, ym, nboots = 1000,
scale_psi = c(0.99, 0.95), scale_nperms = 500)
Exact complementary cdf of the one-sample Hausdorff statistic
Description
Computes the exact complementary cdf
P\!\left(H(F^{c},F_n^{c})>q\right)
of the one-sample Hausdorff goodness-of-fit statistic for a sample of size
n. For a continuous null cdf F, the rectangle-probability
representation of Theorem 4 of Dimitrova, Jia, and Kaishev (2026a) shows that
P\!\left(H(F^{c},F_n^{c})>q\right)
=
1-P(a_i\le U_{(i)}\le b_i,\;1\le i\le n),
where U_{(i)} are the order statistics of an i.i.d.\ U(0,1)
sample and
a_i = F\!\left(F^{-1}\!\left(\frac{i}{n}-q\right)-q\right),
\qquad
b_i = F\!\left(F^{-1}\!\left(\frac{i-1}{n}+q\right)+q\right).
The function evaluates this rectangle-probability representation numerically
via KSgeneral::ks_c_cdf_Rcpp() after constructing the boundary
vectors required by that routine.
This is the low-level p-value engine used by
H_test_1s_1d and is useful when
the observed Hausdorff statistic q has already been computed.
Usage
H_test_c_cdf(q, n, CDF, CDFinverse = NULL, pdf = NULL, tol = 1e-10,
max.init = 1000)
Arguments
q |
a single numeric value giving the observed Hausdorff statistic. |
n |
a positive integer, the sample size. |
CDF |
a vectorised R function implementing the null cdf |
CDFinverse |
(optional) a vectorised R function implementing
where |
pdf |
(optional) a vectorised R function for the null density |
tol |
numerical tolerance used in the root-finding step. Defaults to
|
max.init |
maximum number of Newton–Raphson iterations allowed at each boundary
point. Defaults to |
Details
For q \in (0,1), the function constructs two boundary vectors
f_a and f_b of length n. If CDFinverse is
available, these are computed directly from the closed-form expressions for
a_i and b_i. Otherwise, the code solves
F(x) = i/n - q and F(x) = (i-1)/n + q numerically and maps
the resulting roots through F(x - q) and F(x + q),
respectively. When i/n - q < 0 the entry a_i = 0 is assigned
directly; when (i-1)/n + q > 1 the entry b_i = 1 is assigned
directly, avoiding unnecessary root-finding.
The two boundary vectors are written to a temporary file
‘Boundary_Crossing_Time.txt’, then passed to
KSgeneral::ks_c_cdf_Rcpp(n), which evaluates the required
rectangle probability in O(n^{2} \log n) time. The file is deleted
after use.
The trivial boundary cases q >= 1 (returns 1) and q <= 0
(returns 0) are handled analytically before any computation.
Value
A single numeric value equal to the exact complementary cdf
P(H(F^{c},F_n^{c})>q).
References
Dimitrina S. Dimitrova, Yun Jia, Vladimir K. Kaishev (2026a). “On a One Sample Goodness-of-fit Test Based on the Hausdorff Metric”. Submitted.
Dimitrina S. Dimitrova, Vladimir K. Kaishev, Senren Tan (2020). “Computing the Kolmogorov-Smirnov Distribution When the Underlying CDF is Purely Discrete, Mixed or Continuous”. Journal of Statistical Software, 95(10), 1–42. doi:10.18637/jss.v095.i10.
See Also
H_test_1s_1d,
H_stat_1s_1d,
Hausdorff_test.
Examples
set.seed(1)
x <- rexp(50, rate = 1)
q <- H_stat_1s_1d(x = x,
CDF = function(t) pexp(t, rate = 1),
pdf = function(t) dexp(t, rate = 1))
H_test_c_cdf(q = q,
n = length(x),
CDF = function(t) pexp(t, rate = 1),
CDFinverse = function(p) qexp(p, rate = 1),
pdf = function(t) dexp(t, rate = 1))
Unified one-sample and two-sample Hausdorff test statistic
Description
An S3 generic that computes the Hausdorff (H) test statistic
without a p-value, dispatching on the class of y in the same way as
Hausdorff_test:
ynumericTwo-sample statistic
\mathcal{H}_{m,n} = H(F^{c}_{m}, G^{c}_{n})viaH_stat_2s_1d_tr.yNullDistOne-sample statistic
H(F^{c}, F^{c}_{n})viaH_stat_1s_1d.yfunctionOne-sample shorthand; the bare cdf is passed directly to
H_stat_1s_1dwithpdf = NULL.ymatrix or listTwo-sample bivariate statistic
H(F^{c}_{m}, G^{c}_{n})viaH_stat_2s_2d. Bothxandymust be two-column matrices (or lists of two numeric vectors) of sizesmandn. Wheninvariant = TRUEthe order-invariant statisticH^{\circ}_{m,n} = \max_{1 \leq i \leq 4} H(F^{c}_{m,i}, G^{c}_{n,i})is returned by maximising over the2^{2} = 4sign-flip orderings of the two coordinates (Equation (17) of Dimitrova, Jia, and Kaishev 2026b). List input is converted to a two-column matrix and the same path is followed.
Usage
Hausdorff_stat(x, y, tol = 1e-10, ...)
## S3 method for class 'numeric'
Hausdorff_stat(x, y, tol = 1e-10, ...)
## S3 method for class 'NullDist'
Hausdorff_stat(x, y, tol = 1e-10, ...)
## S3 method for class 'function'
Hausdorff_stat(x, y, tol = 1e-10, ...)
## S3 method for class 'matrix'
Hausdorff_stat(x, y, tol = 1e-6, invariant = FALSE, ...)
## S3 method for class 'list'
Hausdorff_stat(x, y, tol = 1e-6, invariant = FALSE, ...)
Arguments
x |
a numeric vector (univariate case) or a two-column numeric matrix or list of two numeric vectors (bivariate case). |
y |
one of: (i) a numeric vector (two-sample univariate), (ii) a
|
tol |
a numeric value giving the tolerance for the root-finding step (one-sample
path, default |
invariant |
logical; bivariate path only. When |
... |
further arguments (currently unused). |
Value
A single numeric value: the observed Hausdorff statistic \hat{q}.
See Also
distribution,
Hausdorff_test,
H_stat_1s_1d,
H_stat_2s_1d_tr,
H_stat_2s_2d.
Examples
## --- Univariate two-sample statistic ----------------------------------------
set.seed(1)
Hausdorff_stat(rnorm(40), rnorm(40))
## --- One-sample statistic: full distribution object -------------------------
set.seed(2)
x <- rexp(50, rate = 1)
null_exp <- distribution(CDF = function(t) pexp(t, rate = 1),
pdf = function(t) dexp(t, rate = 1))
Hausdorff_stat(x, null_exp)
## --- One-sample statistic: bare CDF shorthand -------------------------------
set.seed(3)
Hausdorff_stat(rnorm(60), pnorm)
## --- Bivariate two-sample statistic (standard ordering) ---------------------
set.seed(4)
x <- matrix(rnorm(100), ncol = 2)
y <- matrix(rnorm(100), ncol = 2)
Hausdorff_stat(x, y)
## --- Bivariate two-sample statistic (order-invariant) -----------------------
Hausdorff_stat(x, y, invariant = TRUE)
## --- Bivariate two-sample statistic: list input -----------------------------
set.seed(5)
x3 <- list(rnorm(30), rnorm(30))
y3 <- list(rnorm(30), rnorm(30))
Hausdorff_stat(x3, y3)
Unified one-sample and two-sample Hausdorff goodness-of-fit test
Description
An S3 generic that performs the Hausdorff (H) goodness-of-fit test
and returns an object of class "htest", dispatching on the class of
y.
Dispatch paths.
yNullDistOne-sample test.
H_test_1s_1dis called whenmethodis"default"or"exact", giving the exact rectangle-probability p-value. Whenmethod = "mc", a Monte Carlo bootstrap p-value is computed instead:nbootssamples of sizenare drawn fromFvia the quantile transform and the proportion of bootstrapHstatistics exceeding the observed value is returned.yfunctionOne-sample shorthand. The bare CDF is promoted to a
NullDistwithCDFinverse = NULLandpdf = NULL, then dispatched as above. Activatingscale_psiwith a bare function is not supported because\sigma^{*}requires a quantile function; an informative error directs the user todistribution.ynumericTwo-sample univariate test via
H_test_2s_1d."default"and"mc"use Monte Carlo permutation;"exact"enumerates all\frac{(m+n)!}{m!\,n!}permutations.ymatrix or listTwo-sample bivariate test via
H_test_2s_2d. Monte Carlo permutation is used by default;method = "exact"enumerates all\frac{(m+n)!}{m!\,n!}splits (automatically falling back to Monte Carlo if the count exceeds2.147 \times 10^{9}). List input is converted to a two-column matrix before proceeding.
Resolution of method by path.
"default"One-sample: exact rectangle-probability p-value. Two-sample (univariate and bivariate): Monte Carlo permutation.
"exact"One-sample: exact rectangle-probability p-value (same as
"default"). Two-sample univariate: full enumeration of all permutations. Two-sample bivariate: full enumeration of all permutations (falls back to Monte Carlo automatically if sample sizes are too large)."mc"One-sample: Monte Carlo bootstrap p-value. Bootstrap samples are drawn from
FviaCDFinverseif supplied in theNullDistobject, otherwise byunirootinversion ofCDF. Two-sample (univariate and bivariate): Monte Carlo permutation.
Scale tuning (scale_psi).
When scale_psi is supplied, \sigma^{*} is computed before
the test, the data are rescaled, and the test is run on the scaled inputs.
One-sample (y is NullDist): \sigma^{*} uses
the closed-form formula of Proposition 13 of Dimitrova, Jia, and Kaishev
(2026a),
\sigma^{*} = \frac{\psi_{1} - \psi_{2}}{F^{-1}(\psi_{1}) - F^{-1}(\psi_{2})}.
The quantile F^{-1} is evaluated by priority: CDFinverse
(if supplied in the NullDist object) > Newton–Raphson (if
pdf is supplied) > uniroot applied to CDF.
The Newton–Raphson iteration is bounded by max.init steps. The
sample is replaced by \sigma^{*} x and the null distribution by
F_{\sigma^{*}}(t) = F(t/\sigma^{*}). Only functions that exist in
the original NullDist object are scaled; absent ones remain
NULL.
Two-sample univariate (y is numeric): \sigma^{*} is
estimated from the pooled sample by averaging over scale_nperms
random splits (Equation (49) of Dimitrova, Jia, and Kaishev 2026b),
\sigma^{*} \approx \frac{1}{B}\sum_{b=1}^{B}
\max\!\left(
\frac{\psi_{1}-\psi_{2}}{Q_{X^{\dagger}_{m}}^{(b)}(\psi_{1}) -
Q_{X^{\dagger}_{m}}^{(b)}(\psi_{2})},\;
\frac{\psi_{1}-\psi_{2}}{Q_{Y^{\dagger}_{n}}^{(b)}(\psi_{1}) -
Q_{Y^{\dagger}_{n}}^{(b)}(\psi_{2})}
\right).
Both samples are scaled by the scalar \sigma^{*}.
Two-sample bivariate (y is matrix or list): the univariate
formula is applied column-wise, yielding a length-2 vector
(\sigma^{*}_{1}, \sigma^{*}_{2}). Column j of each matrix
is scaled by \sigma^{*}_{j}.
In all cases, scale_psi and \sigma^{*} are attached to the
returned "htest" object as $scale_psi and $sigma,
and the string "(scale-tuned)" is appended to $method.
Usage
Hausdorff_test(x, y, method = "default", nboots = 2000, tol = 1e-10,
scale_psi = NULL, scale_nperms = 1000, max.init = 1000, ...)
## S3 method for class 'NullDist'
Hausdorff_test(x, y, method = "default", nboots = 2000,
tol = 1e-10, scale_psi = NULL,
scale_nperms = 1000, max.init = 1000, ...)
## S3 method for class 'function'
Hausdorff_test(x, y, method = "default", nboots = 2000,
tol = 1e-10, scale_psi = NULL,
scale_nperms = 1000, max.init = 1000, ...)
## S3 method for class 'numeric'
Hausdorff_test(x, y, method = "default", nboots = 2000,
tol = 1e-10, scale_psi = NULL,
scale_nperms = 1000, ...)
## S3 method for class 'matrix'
Hausdorff_test(x, y, method = "default", nboots = 2000,
tol = 1e-6, scale_psi = NULL,
scale_nperms = 1000, max.init = 1000,
invariant = FALSE, ...)
## S3 method for class 'list'
Hausdorff_test(x, y, method = "default", nboots = 2000,
tol = 1e-6, scale_psi = NULL,
scale_nperms = 1000, max.init = 1000,
invariant = FALSE, ...)
Arguments
x |
a numeric vector (one-sample or two-sample univariate), or a two-column numeric matrix or list of two numeric vectors (bivariate). |
y |
one of: (i) a |
method |
|
nboots |
a positive integer: number of Monte Carlo replications for the test
p-value (one-sample bootstrap or two-sample permutation).
Defaults to |
tol |
numeric tolerance for root-finding in the one-sample statistic and p-value
computation (default |
scale_psi |
|
scale_nperms |
a positive integer: number of Monte Carlo splits used to estimate
|
max.init |
a positive integer: maximum number of Newton–Raphson iterations when
inverting |
invariant |
logical; bivariate path only. When |
... |
further arguments (currently unused). |
Value
An object of class "htest" with components:
statisticthe observed Hausdorff statistic
\hat{q}(computed on the scaled data whenscale_psiis supplied), named"H".p.valuethe p-value computed according to the active path and
method.methoda character string identifying the procedure. The suffix
"(scale-tuned)"is appended whenscale_psiis supplied.alternativethe character string
"two-sided".data.namea character string giving the names of the data objects.
sigma(only when
scale_psiis supplied) the computed\sigma^{*}: a scalar for the one-sample and two-sample univariate paths; a length-2 vector for the bivariate path.scale_psi(only when
scale_psiis supplied) the validated sortedc(psi_low, psi_high)vector used to compute\sigma^{*}.
References
Dimitrina S. Dimitrova, Yun Jia, Vladimir K. Kaishev (2026a). “On a One Sample Goodness-of-fit Test Based on the Hausdorff Metric”. Submitted.
Dimitrina S. Dimitrova, Yun Jia, Vladimir K. Kaishev (2026b). “On a Two-Sample Multivariate Goodness-of-Fit Test Based on the Hausdorff Metric”. Submitted.
Dimitrina S. Dimitrova, Vladimir K. Kaishev, Senren Tan (2020). “Computing the Kolmogorov-Smirnov Distribution When the Underlying CDF is Purely Discrete, Mixed or Continuous”. Journal of Statistical Software, 95(10): 1–42. doi:10.18637/jss.v095.i10.
See Also
distribution,
Hausdorff_stat,
H_test_1s_1d,
H_test_c_cdf,
H_test_2s_1d,
H_stat_2s_2d.
Examples
## ---- One-sample, no scaling ------------------------------------------------
set.seed(1)
x <- rexp(50, rate = 1)
null_e <- distribution(CDF = function(t) pexp(t, rate = 1),
CDFinverse = function(p) qexp(p, rate = 1),
pdf = function(t) dexp(t, rate = 1))
Hausdorff_test(x, null_e)
## ---- One-sample, scale-tuned: right tail -----------------------------------
res <- Hausdorff_test(x, null_e, scale_psi = c(0.99, 0.95))
res$method # "... (scale-tuned)"
res$sigma
res$scale_psi
## ---- One-sample, Monte Carlo p-value, scale-tuned: body --------------------
Hausdorff_test(x, null_e, method = "mc", nboots = 1000,
scale_psi = c(0.70, 0.40))
## ---- One-sample, only CDF and pdf supplied (Newton-Raphson sigma*) ---------
null_cdf_only <- distribution(CDF = function(t) pexp(t, rate = 1),
pdf = function(t) dexp(t, rate = 1))
Hausdorff_test(x, null_cdf_only, scale_psi = c(0.99, 0.95))
## ---- Two-sample univariate, no scaling -------------------------------------
set.seed(2)
x1 <- rnorm(40); x2 <- rnorm(40)
Hausdorff_test(x1, x2)
## ---- Two-sample univariate, scale-tuned ------------------------------------
res2 <- Hausdorff_test(x1, x2, scale_psi = c(0.99, 0.95), scale_nperms = 500)
res2$sigma
## ---- Two-sample univariate, exact permutation, small samples ---------------
set.seed(3)
Hausdorff_test(rnorm(8), rnorm(8), method = "exact")
## ---- Two-sample bivariate, no scaling --------------------------------------
set.seed(4)
xm <- matrix(rnorm(100), ncol = 2)
ym <- matrix(rnorm(100, mean = 0.5), ncol = 2)
Hausdorff_test(xm, ym, nboots = 1000)
## ---- Two-sample bivariate, scale-tuned (column-wise sigma*) ----------------
res3 <- Hausdorff_test(xm, ym, nboots = 1000,
scale_psi = c(0.70, 0.40), scale_nperms = 500)
res3$sigma # length-2 vector, one sigma* per coordinate
## ---- Two-sample bivariate, order-invariant statistic -----------------------
Hausdorff_test(xm, ym, nboots = 1000, invariant = TRUE)
Specify a null distribution for the one-sample Hausdorff test
Description
Creates an object of class "NullDist" that bundles the null
cumulative distribution function F(x) together with its optional
quantile function F^{-1}(p) and density f(x).
The resulting object is passed as the second argument y to
Hausdorff_test or
Hausdorff_stat to trigger the one-sample
dispatch path.
Usage
distribution(CDF, CDFinverse = NULL, pdf = NULL)
Arguments
CDF |
an R function for the null cdf |
CDFinverse |
(optional) an R function for the null quantile function
Only in-range values of |
pdf |
(optional) an R function for the null density |
Value
An object of class "NullDist": a named list with elements
CDF, CDFinverse, and pdf, all stored in their
vectorised and (for CDFinverse) boundary-extended forms. A
print method is provided.
References
Dimitrina S. Dimitrova, Yun Jia, Vladimir K. Kaishev (2026a). “On a One Sample Goodness-of-fit Test Based on the Hausdorff Metric”. Submitted.
See Also
Hausdorff_test,
Hausdorff_stat.
Examples
## Standard normal null (all three functions supplied)
null_norm <- distribution(CDF = pnorm,
CDFinverse = qnorm,
pdf = dnorm)
null_norm
## Exp(2) null -- only CDF supplied (root-finding used internally)
null_exp2 <- distribution(CDF = function(x) pexp(x, rate = 2))
null_exp2