| Type: | Package |
| Title: | Another Test of Association for Count Data |
| Version: | 0.1.0 |
| Date: | 2025-12-20 |
| Description: | The Upsilon test assesses association among categorical variables against the null hypothesis of independence (Luo 2021 MS thesis; ProQuest Publication No. 28649813). While promoting dominant function patterns, it demotes non-dominant function patterns. It is robust to low expected count—continuity correction like Yates's seems unnecessary. Using a common null population following a uniform distribution, contingency tables are comparable by statistical significance—not the case for most association tests defining a varying null population by tensor product of observed marginals. Although Pearson's chi-squared test, Fisher's exact test, and Woolf's G-test (related to mutual information) are useful in some contexts, the Upsilon test appeals to ranking association patterns not necessarily following same marginal distributions, such as in count data from DNA sequencing—an important modern scientific domain. |
| Encoding: | UTF-8 |
| License: | LGPL (≥ 3) |
| Imports: | Rcpp (≥ 1.0.8), Rdpack, ggplot2 (≥ 3.4.0), reshape2, scales |
| RdMacros: | Rdpack |
| LinkingTo: | Rcpp |
| RoxygenNote: | 7.3.3 |
| Suggests: | knitr, rmarkdown, testthat (≥ 3.0.0), DescTools, USP, metan, FunChisq, patchwork |
| VignetteBuilder: | knitr |
| Config/testthat/edition: | 3 |
| NeedsCompilation: | yes |
| Packaged: | 2025-12-20 13:50:51 UTC; joesong |
| Author: | Xuye Luo [aut],
Joe Song |
| Maintainer: | Joe Song <joemsong@nmsu.edu> |
| Repository: | CRAN |
| Date/Publication: | 2026-01-06 11:40:07 UTC |
Fast Zero-Tolerant Pearson's Chi-squared Test of Association
Description
Performs a fast zero-tolerant Pearson's chi-squared test (Pearson 1900) to evaluate association between observations from two categorical variables.
Usage
fast.chisq.test(x, y, log.p = FALSE)
Arguments
x |
a vector to
specify observations of the first
categorical variable. The vector can be of
numeric, character, or logical type.
|
y |
a vector to specify observations of
the second categorical variable.
Must not contain |
log.p |
a logical. If |
Value
A list with class "htest"
containing the following components:
statistic |
the value of chi-squared test statistic. |
parameter |
the degrees of freedom. |
p.value |
the p-value of the test. |
estimate |
Cramér's V statistic representing the effect size. |
method |
a character string indicating the method used. |
data.name |
a character string giving the names of input data. |
Note
The test uses an internal hash table, instead of matrix, to store the contingency table. Savings in both runtime and memory saving can be substantial if the contingency table is sparse and large. The test is implemented in C++, to give an additional layer of speedup over an R implementation.
References
Pearson K (1900). “X. On the criterion that a given system of deviations from the probable in the case of a correlated system of variables is such that it can be reasonably supposed to have arisen from random sampling.” The London, Edinburgh, and Dublin Philosophical Magazine and Journal of Science, 50(302), 157–175. doi:10.1080/14786440009463897.
Examples
library("Upsilon")
weather <- c(
"rainy", "sunny", "rainy", "sunny", "rainy"
)
mood <- c(
"wistful", "upbeat", "upbeat", "upbeat", "wistful"
)
fast.chisq.test(weather, mood)
# The result is equivalent to:
modified.chisq.test(table(weather, mood))
Fast Zero-Tolerant G-Test of Association
Description
Performs a fast zero-tolerant G-test (Woolf 1957) to evaluate association between observations from two categorical variables.
Usage
fast.gtest(x, y, log.p = FALSE)
Arguments
x |
a vector to
specify observations of the first
categorical variable. The vector can be of
numeric, character, or logical type.
|
y |
a vector to specify observations of
the second categorical variable.
Must not contain |
log.p |
a logical. If |
Value
A list with class "htest" containing the following components:
statistic |
the G-test statistic (Likelihood Ratio Chi-squared statistic). |
parameter |
the degrees of freedom. |
p.value |
the p-value of the test. |
estimate |
the mutual information between the two variables. |
method |
a character string indicating the method used. |
data.name |
a character string giving the names of the data. |
Note
The test uses an internal hash table, instead of matrix, to store the contingency table. Savings in both runtime and memory saving can be substantial if the contingency table is sparse and large. The test is implemented in C++, to give an additional layer of speedup over an R implementation.
References
Woolf B (1957). “The log likelihood ratio test (the G-test); methods and tables for tests of heterogeneity in contingency tables.” Annals of Human Genetics, 21(4), 397–409. doi:10.1111/j.1469-1809.1972.tb00293.x.
Examples
library("Upsilon")
weather <- c(
"rainy", "sunny", "rainy", "sunny", "rainy"
)
mood <- c(
"wistful", "upbeat", "upbeat", "upbeat", "wistful"
)
fast.gtest(weather, mood)
# The result is equivalent to:
modified.gtest(table(weather, mood))
Fast Upsilon Test of Association between Two Categorical Variables
Description
Performs a fast Upsilon test (Luo 2021) to evaluate association between observations from two categorical variables.
Usage
fast.upsilon.test(x, y, log.p = FALSE)
Arguments
x |
a vector to
specify observations of the first
categorical variable. The vector can be of
numeric, character, or logical type.
|
y |
a vector to specify observations of
the second categorical variable.
Must not contain |
log.p |
a logical. If |
Details
The Upsilon test is designed to promote dominant function patterns. In contrast to other tests of association to favor all function patterns, it is unique in demoting non-dominant function patterns.
Null hypothesis (H_0): Row and column variables are
statistically independent.
Null population: A discrete uniform distribution, where each entry in the table has the same probability.
Null distribution: The Upsilon test statistic
asymptotically follows a chi-squared distribution
with (nrow(x) - 1)(ncol(x) - 1) degrees of freedom,
under the null hypothesis on the null population.
See (Luo 2021) for full details of the Upsilon test.
Value
A list with class "htest" containing the following components:
statistic |
the Upsilon test statistic. |
parameter |
the degrees of freedom. |
p.value |
the p-value of the test. |
estimate |
the effect size derived from the Upsilon statistic. |
method |
a character string indicating the method used. |
data.name |
a character string giving the name of input data. |
Note
The test uses an internal hash table, instead of matrix, to store the contingency table. Savings in both runtime and memory saving can be substantial if the contingency table is sparse and large. The test is implemented in C++, to give an additional layer of speedup over an R implementation.
References
Luo X (2021). The Upsilon Test for Association Between Categorical Variables. Master's thesis, Department of Computer Science, New Mexico State University, Las Cruces, NM, United States. Publication No. 28649813. ProQuest Dissertations and Theses Global.
Examples
library("Upsilon")
weather <- c(
"rainy", "sunny", "rainy", "sunny", "rainy"
)
mood <- c(
"wistful", "upbeat", "upbeat", "upbeat", "wistful"
)
fast.upsilon.test(weather, mood)
# The result is equivalent to:
upsilon.test(table(weather, mood))
Zero-Tolerant Pearson's Chi-squared Statistic
Description
Calculates Pearson's chi-squared test statistic for contingency tables, ignoring entries with zero-expected count.
Usage
modified.chisq.statistic(x)
Arguments
x |
a matrix or data frame of floating or integer numbers to specify a contingency table. Entries must be non-negative. |
Details
This test is useful if p-value must be returned
on a contingency table with valid non-negative counts,
where the build-in R implementation of
chisq.test could return NA
as p-value, regardless of a pattern being
strong or weak. See Examples.
Unlike chisq.test, this
function handles tables with empty rows or columns (where
expected values are 0) by calculating the test
statistic over non-zero entries only. This prevents
the result from becoming NA, while giving
meaningful p-values.
Value
The numeric value of the modified Pearson's chi-squared test statistic.
Note
This function only takes contingency table
as input. It does not support goodness-of-fit
test on vectors.
It does not offer an option
to apply Yates's continuity correction
on 2 \times 2 tables.
References
Luo X (2021). The Upsilon Test for Association Between Categorical Variables. Master's thesis, Department of Computer Science, New Mexico State University, Las Cruces, NM, United States. Publication No. 28649813. ProQuest Dissertations and Theses Global.
Examples
library("Upsilon")
# Create a table with empty rows or columns
x <- matrix(c(0, 3, 0, 3, 0, 0), nrow = 2, byrow = TRUE)
print(x)
# Standard chisq.test might warn or fail on a table with empty rows or columns
chisq.test(x)
# Modified statistic handles it gracefully
modified.chisq.statistic(x)
Zero-Tolerant Pearson's Chi-squared Test for Contingency Tables
Description
Performs Pearson's chi-squared test (Pearson 1900) on contingency tables, slightly modified to handle rows or columns of all zeros.
Usage
modified.chisq.test(x, log.p = FALSE)
Arguments
x |
a matrix or data frame of floating or integer numbers to specify a contingency table. Entries must be non-negative. |
log.p |
a logical. If |
Details
This test is useful if p-value must be returned
on a contingency table with valid non-negative counts,
where the build-in R implementation of
chisq.test could return NA
as p-value, regardless of a pattern being
strong or weak. See Examples.
Unlike chisq.test, this
function handles tables with empty rows or columns (where
expected values are 0) by calculating the test
statistic over non-zero entries only. This prevents
the result from becoming NA, while giving
meaningful p-values.
Value
A list with class "htest" containing:
statistic |
the chi-squared test statistic (calculated ignoring entries of 0-expected count). |
parameter |
the degrees of freedom. |
p.value |
the p-value by the test. |
estimate |
Cramér's V statistic. |
observed |
the observed counts. |
expected |
the expected counts under the null hypothesis. |
Note
This function only takes contingency table
as input. It does not support goodness-of-fit
test on vectors.
It does not offer an option
to apply Yates's continuity correction
on 2 \times 2 tables.
References
Pearson K (1900). “X. On the criterion that a given system of deviations from the probable in the case of a correlated system of variables is such that it can be reasonably supposed to have arisen from random sampling.” The London, Edinburgh, and Dublin Philosophical Magazine and Journal of Science, 50(302), 157–175. doi:10.1080/14786440009463897.
Examples
library("Upsilon")
# A table with a dominant function and an empty column
x <- matrix(
c(0, 3, 0,
3, 0, 0),
nrow = 2, byrow = TRUE)
print(x)
# Standard chisq.test fails or returns NA warning
chisq.test(x)
# Modified chi-squared test is significant:
modified.chisq.test(x)
Zero-Tolerant G-Test for Contingency Tables
Description
Performs G-test (Woolf 1957) on contingency tables, slightly modified to handle rows or columns of all zeros.
Usage
modified.gtest(x, log.p = FALSE)
Arguments
x |
a matrix or data frame of floating or integer numbers to specify a contingency table. Entries must be non-negative. |
log.p |
a logical. If |
Details
This test is useful if a p-value must be returned
on a contingency table with valid non-negative counts,
where other implementations of G-test could
return NA as the p-value, regardless of a
pattern being strong or weak.
This function handles tables with empty rows
or columns (where expected values are 0) by
calculating the test statistic over non-zero
entries only. This prevents the result from
becoming NA, while giving meaningful
p-values.
Value
A list with class "htest" containing:
statistic |
the G statistic (log-likelihood ratio). |
parameter |
the degrees of freedom. |
p.value |
the p-value of the test. |
estimate |
the value of mutual information. |
method |
a character string indicating the method used. |
data.name |
a character string, name of the input data. |
observed |
the observed counts. |
expected |
the expected counts under the null hypothesis. |
References
Woolf B (1957). “The log likelihood ratio test (the G-test); methods and tables for tests of heterogeneity in contingency tables.” Annals of Human Genetics, 21(4), 397–409. doi:10.1111/j.1469-1809.1972.tb00293.x.
Examples
library("Upsilon")
# Create a sparse table with empty rows/cols
x <- matrix(
c(0, 3, 0,
3, 0, 0),
nrow = 2, byrow = TRUE
)
print(x)
# Perform the modified G-test
modified.gtest(x)
Plot Matrix with Entries Represented by Balloons of Varying Sizes and Colors
Description
Creates a "balloon plot" to visualize numeric data in a matrix or contingency table.
Usage
plot_matrix(
x,
title = "Balloon plot",
shape.color = c("tomato"),
s.min = 1,
s.max = 30,
x.axis = NULL,
y.axis = NULL,
x.lab = "",
y.lab = "",
bg.color = "white",
grid.color = "black",
grid.width = 0.1,
size.by = c("column", "row", "global", "none"),
color.by = c("column", "row", "global", "none"),
number.size = 6,
shape.by = c("column", "row", ""),
shapes = c(21, 22, 23, 24)
)
Arguments
x |
a numeric matrix or table to be plotted. |
title |
a character string for the main title of the plot.
Defaults to |
shape.color |
a character string specifying the
color for entries (e.g., |
s.min |
a numeric value specifying the minimum size of the shapes. Defaults to 5. |
s.max |
a numeric value specifying the maximum size of the shapes. Defaults to 30. |
x.axis |
a character vector for custom x-axis labels.
If |
y.axis |
a character vector for custom y-axis labels.
If |
x.lab |
a character string for the x-axis title.
Defaults to |
y.lab |
a character string for the y-axis title.
Defaults to |
bg.color |
a character string for the background
color of the tiles. Defaults to |
grid.color |
a character string specifying color of
grid lines ( |
grid.width |
a numeric value to specify the width of grid lines. |
size.by |
a character string to specify how to
scale the size of balloon: |
color.by |
a character string to specify how to
scale the color of balloon: |
number.size |
a numeric value specifying the font size for text. |
shape.by |
a character string to specify how to
choose the shape of balloon: |
shapes |
a character vector to specify shape codes. |
Details
Each entry in the matrix is represented by a shape, with size and color corresponding to the magnitude of value in the entry. It offers an alternative to heatmap for displaying count data.
Value
A ggplot object.
Examples
library(ggplot2)
mat <- matrix(c(10, 20, 30, 50, 80, 60, 40, 30), nrow = 2)
rownames(mat) <- c("Row1", "Row2")
colnames(mat) <- c("C1", "C2", "C3", "C4")
# Color by Row (Row 1 = red, Row 2 = blue)
plot_matrix(mat, color.by = "row", shape.color = c("tomato", "steelblue"))
# Color by Column (Rainbow colors)
plot_matrix(mat, color.by = "column", shape.color = c("red", "green", "blue", "orange"))
Recover Raw Data Vectors from Contingency Table
Description
Converts a contingency table (count data) back into two vectors of raw observations. This is useful when you have a summary table but need to run tests that require raw data vectors (like the functions in this package).
Usage
table.to.vectors(x)
Arguments
x |
A numeric matrix or contingency table containing non-negative integer counts. Must not contain NA values. |
Value
A list containing two integer vectors:
x_vector |
A vector of row indices corresponding to the observations. |
y_vector |
A vector of column indices corresponding to the observations. |
Examples
library("Upsilon")
# Create a sample contingency table
# Rows = Variable A (levels 1,2), Cols = Variable B (levels 1,2,3)
tab <- matrix(c(10, 5, 2, 8, 5, 10), nrow = 2, byrow = TRUE)
print(tab)
# Recover the raw vectors
res <- table.to.vectors(tab)
# Check the result
length(res$x_vector) # Should be sum(tab) = 40
head(cbind(res$x_vector, res$y_vector))
table(res$x_vector, res$y_vector) # Should as same as tab
Upsilon Goodness-of-Fit Test Statistic
Description
(FOR INTERNAL USE ONLY) Calculates the Upsilon statistic for a Goodness-of-Fit (GoF) test.
Usage
upsilon.gof.statistic(x, p = rep(1/length(x), length(x)), rescale.p = TRUE)
Arguments
x |
a numeric vector or one-column matrix representing observed counts. |
p |
a numeric vector of probabilities
of the same length as |
rescale.p |
a logical scalar.
If |
Details
This statistic measures the discrepancy between observed counts and expected probabilities.
Value
A numeric value of the Upsilon Goodness-of-Fit statistic.
Examples
library("Upsilon")
counts <- c(10, 20, 30)
upsilon.gof.statistic(counts)
Upsilon Goodness-of-Fit Test for Count Data
Description
(FOR INTERNAL USE ONLY) Performs the Upsilon Goodness-of-Fit test to determine if a sample of observed counts fits a specified probability distribution. The Upsilon statistic uses a specific normalization (dividing by the average expected count) which differs from the standard Pearson's Chi-squared test.
Usage
upsilon.gof.test(
x,
p = rep(1/length(x), length(x)),
rescale.p = TRUE,
log.p = FALSE
)
Arguments
x |
A numeric vector representing observed counts. Must be non-negative. |
p |
A numeric vector of probabilities of the same length as |
rescale.p |
Logical. If |
log.p |
a logical. If |
Value
A list with class "htest" containing:
statistic |
The Upsilon test statistic. |
parameter |
The degrees of freedom (k - 1). |
p.value |
The p-value of the test. |
estimate |
The effect size. |
method |
A character string indicating the method used. |
data.name |
A character string giving the name(s) of the data. |
observed |
The observed counts. |
expected |
The expected counts. |
residuals |
The Pearson residuals. |
p.normalized |
The probability vector used (after rescaling if applicable). |
Examples
library("Upsilon")
# Test against uniform distribution
counts <- c(10, 20, 30)
upsilon.gof.test(counts)
Upsilon Test Statistic for Contingency Tables
Description
Calculates the Upsilon test statistic \Upsilon.
Usage
upsilon.statistic(x)
Arguments
x |
a matrix or data frame of floating or integer numbers to specify a contingency table. Entries must be non-negative. |
Details
The Upsilon test is designed to promote dominant function patterns. In contrast to other tests of association to favor all function patterns, it is unique in demoting non-dominant function patterns.
Null hypothesis (H_0): Row and column variables are
statistically independent.
Null population: A discrete uniform distribution, where each entry in the table has the same probability.
Null distribution: The Upsilon test statistic
asymptotically follows a chi-squared distribution
with (nrow(x) - 1)(ncol(x) - 1) degrees of freedom,
under the null hypothesis on the null population.
See (Luo 2021) for full details of the Upsilon test.
Value
The numeric value of Upsilon test statistic \Upsilon.
References
Luo X (2021). The Upsilon Test for Association Between Categorical Variables. Master's thesis, Department of Computer Science, New Mexico State University, Las Cruces, NM, United States. Publication No. 28649813. ProQuest Dissertations and Theses Global.
Examples
library("Upsilon")
# Create a contingency table
x <- matrix(c(
0, 3, 0,
3, 0, 0),
nrow = 2, byrow = TRUE)
print(x)
# Calculate statistic
upsilon.statistic(x)
Upsilon Test of Association for Count Data
Description
Performs the Upsilon test to evaluate association among categorical variables represented by a contingency table.
Usage
upsilon.test(x, log.p = FALSE)
Arguments
x |
a matrix or data frame of floating or integer numbers to specify a contingency table. Entries must be non-negative. |
log.p |
a logical. If |
Details
The Upsilon test is designed to promote dominant function patterns. In contrast to other tests of association to favor all function patterns, it is unique in demoting non-dominant function patterns.
Null hypothesis (H_0): Row and column variables are
statistically independent.
Null population: A discrete uniform distribution, where each entry in the table has the same probability.
Null distribution: The Upsilon test statistic
asymptotically follows a chi-squared distribution
with (nrow(x) - 1)(ncol(x) - 1) degrees of freedom,
under the null hypothesis on the null population.
See (Luo 2021) for full details of the Upsilon test.
Value
A list with class "htest" containing:
statistic |
the value of the Upsilon statistic. |
parameter |
the degrees of freedom. |
p.value |
the p-value. |
estimate |
the effect size. |
method |
a character string giving the test name. |
data.name |
a character string giving the name of input data. |
observed |
the observed counts, a matrix copy of the input data. |
expected |
the expected counts under the null hypothesis using the observed marginals. |
References
Luo X (2021). The Upsilon Test for Association Between Categorical Variables. Master's thesis, Department of Computer Science, New Mexico State University, Las Cruces, NM, United States. Publication No. 28649813. ProQuest Dissertations and Theses Global.
Examples
library("Upsilon")
# A contingency table with independent row and column variables
x <- matrix(
c(1, 1, 0,
1, 1, 0,
1, 1, 0),
nrow = 3, byrow = TRUE
)
print(x)
upsilon.test(x)
# A contingency table with a non-dominant function
x <- matrix(
c(4, 0, 0,
0, 1, 0,
0, 0, 1),
nrow = 3, byrow = TRUE
)
print(x)
upsilon.test(x)
# A contingency table with a dominant function
x <- matrix(
c(2, 0, 0,
0, 2, 0,
0, 0, 2),
nrow = 3, byrow = TRUE)
print(x)
upsilon.test(x)
# Another contingency table with a dominant function
x <- matrix(
c(3, 0, 0,
0, 3, 0,
0, 0, 0),
nrow = 3, byrow = TRUE)
print(x)
upsilon.test(x)