--- title: "irdc-demo" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{irdc-demo} %\VignetteEngine{knitr::knitr} %\VignetteEncoding{UTF-8} --- ```{r setup, include=FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>" ) library(FORD) # Our package library(FOCI) # For comparison library(ggplot2) # For visualization ``` # Introduction We propose a new dependence measure $\nu(Y, \mathbf{X})$ ([*A New Measure Of Dependence: Integrated R2*](http://arxiv.org/abs/2505.18146)) to assess how much a random vector $\mathbf{X}$ explains a univariate response $Y$. Let $Y$ be a random variable and $\mathbf{X} = (X_1, \cdots, X_p)$ a random vector defined on the same probability space. Let $\mu$ be the probability law of $Y$, and $S$ be the support of $\mu$. Define: $$ \tilde{S} = \begin{cases} S \setminus \{s_{\max}\} & \text{if } S \text{ has a maximum } s_{\max} \\ S & \text{otherwise} \end{cases} $$ We define the measure $\tilde{\mu}$ on $S$ as: $$ \tilde{\mu}(A) = \frac{\mu(A \cap \tilde{S})}{\mu(\tilde{S})}, \quad \text{for measurable } A \subseteq S $$ Then the **irdc dependence coefficient** is defined as: $$ \nu(Y, \mathbf{X}) := \int \frac{\mathrm{Var}(\mathbb{E}[\mathbf{1}\{Y > t\} \mid \mathbf{X}])}{\mathrm{Var}(\mathbf{1}\{Y > t\})} d\tilde{\mu}(t) $$ In contrast, [*A Simple Measure Of Conditional Dependence*](https://www.jstor.org/stable/27170947) consider: $$ T(Y, \mathbf{X}) = \frac{\int \mathrm{Var}(\mathbb{E}[\mathbf{1}\{Y \ge t\} \mid \mathbf{X}]) d\mu(t)}{\int \mathrm{Var}(\mathbf{1}\{Y \ge t\}) d\mu(t)} $$ # Continuous Case ```{r continuous} n <- 1000 x <- matrix(runif(n * 3), nrow = n) y <- (x[, 1] + x[, 2]) %% 1 irdc(y, x[, 1]) irdc(y, x[, 2]) irdc(y, x[, 3]) ``` # Discrete Case ## Example 1 ```{r discrete-1} n <- 10000 s <- 0.1 x1 <- c(rep(0, n * s), runif(n * (1 - s))) x2 <- runif(n) y <- x1 irdc(y, x1, dist.type.X = "discrete") irdc(y, x2) ``` ## Example 2 ```{r discrete-2} n <- 10000 x1 <- runif(n) y1 <- rbinom(n, 1, 0.5) y2 <- as.numeric(x1 >= 0.5) irdc(y1, x1, dist.type.X = "discrete") irdc(y2, x1, dist.type.X = "discrete") FOCI::codec(y1, x1) FOCI::codec(y2, x1) ``` ## Example 3: Hurdle vs Gamma Mixture ```{r hurdle-vs-gamma} r_hurdle_poisson <- function(n, p_zero = 0.3, lambda = 2) { is_zero <- rbinom(n, 1, p_zero) rztpois <- function(m, lambda) { samples <- numeric(m) for (i in 1:m) { repeat { x <- rpois(1, lambda) if (x > 0) { samples[i] <- x break } } } samples } result <- numeric(n) result[is_zero == 0] <- rztpois(sum(is_zero == 0), lambda) result } set.seed(123) n <- 1000 p_zero <- 0.4 lambda <- 10 hurdle <- r_hurdle_poisson(n, p_zero, lambda) gamma_mix <- c(rep(0, round(p_zero * n)), rgamma(round((1 - p_zero) * n), shape = lambda, rate = 1)) df <- data.frame( value = c(hurdle, gamma_mix), source = rep(c("Hurdle Poisson", "Gamma Mixture"), each = n) ) ggplot(df, aes(x = value, fill = source)) + geom_histogram(alpha = 0.5, position = "identity", bins = 40) + labs(title = "Comparison: Hurdle Poisson vs Gamma Mixture", x = "Value", y = "Count", fill = "Distribution") + theme_bw() ``` ## Example 3 Continued ```{r discrete-3} x1 <- sort(gamma_mix) y1 <- rbinom(n, 1, 0.5) y2 <- sort(hurdle) irdc(y1, x1, dist.type.X = "discrete") irdc(y2, x1, dist.type.X = "discrete") FOCI::codec(y1, x1) FOCI::codec(y2, x1) ``` ## Example 4 ```{r discrete-4} x1 <- sort(hurdle) y1 <- rbinom(n, 1, 0.5) y2 <- sort(gamma_mix) irdc(y1, x1, dist.type.X = "discrete") irdc(y2, x1, dist.type.X = "discrete") FOCI::codec(y1, x1) FOCI::codec(y2, x1) ``` # Conclusion *irdc* provides a flexible and theoretically grounded dependence measure that works for both continuous and discrete predictors. For further theoretical details, see our paper: Azadkia and Roudaki (2025), [*A New Measure Of Dependence: Integrated R2*](http://arxiv.org/abs/2505.18146)