This package enables the user to quantify dependencies between mixed (continuous, discrete, binary) variables in the framework of a Gaussian copula model by estimating the correlation matrix of the copula (Tomilina, Mazo, and Jaffrézic (2024)) as well as its conditional correlation structure.

Context

When working with \(d\) mixed variables \(X_1, ..., X_d\), it can be complicated to infer a correlation network as well-known statistical methods do not work in case of mixed data. Indeed, most classical network inference methods such as gLasso or Gaussian graphical models rely on the assumption that the data follow a Gaussian distribution. Recently, the Non-paranormal distribution was introduced for network inference for continuous, non-Gaussian data (Liu (2009)). It consists in a transformation of the cumulative distribution functions via a Gaussian copula and provides results for non-Gaussian but continuous variables. We propose an extension of this model to the case of mixed variables.

The Model

Let \(X_1, ..., X_d\) be mixed variables (continuous or discrete). Let \(F_1, ..., F_d\) denote their marginal CDFs, \(\Phi^{-1}\) the inverse of the standard normal CDF and \(\Phi_\Sigma\) the Gaussian CDF of correlation matrix \(\Sigma\). We assume that the multivariate CDF of the vector \((X_1,\dots,X_d)\) is given by: \[F(X_1, ..., X_d)=C_\Sigma(F_1(X_1), ..., F_d(X_d)):=\Phi_\Sigma(\Phi^{-1}(F_1(X_1)), ..., \Phi^{-1}(F_d(X_d))).\]

Estimation

Our package enables the estimation of \(\Sigma\), the correlation matrix of the copula, and \(\Omega=\Sigma^{-1}\), the conditional covariance matrix of the copula..

In order to estimate \(\Sigma\), the rho_estim function uses a semiparametric pairwise maximum likelihood estimator. It returns the estimated correlation matrix of the copula and takes as arguments the data set and the variable types in a vector. In the example below, we have used a subset of the ICGC data set (Zhang, Bajari, and D. (2019)) which contains 5 RNA-seq, 5 protein and 5 mutation variables. We have specified the variable types, where a “C” stands for “continuous” and a “D” for “discrete”.

data(icgc_data)
R <- rho_estim(icgc_data,c(rep("D",5),rep("C",5),rep("D",5)))

A \(6\times6\) subset of the obtained copula correlation matrix is represented below.

	ACACA	AKT1S1	ANLN	ANXA1	AR	ACACA_P
ACACA	1.00	-0.13	0.09	-0.32	0.36	0.57
AKT1S1	-0.13	1.00	-0.12	-0.16	-0.18	0.11
ANLN	0.09	-0.12	1.00	0.12	-0.35	-0.07
ANXA1	-0.32	-0.16	0.12	1.00	-0.36	-0.22
AR	0.36	-0.18	-0.35	-0.36	1.00	0.14
ACACA_P	0.57	0.11	-0.07	-0.22	0.14	1.00

The precision matrix, \(\Omega=\Sigma^{-1}\) that encodes the latent conditional correlation network, can also be estimated via gLASSO penalized inversion (Friedman, Hastie, and Tibshirani (2008)). Because it automatically sets some coefficients to zero, no posterior thresholding is needed. Our function omega_estim takes as arguments the data set and the type of variables (or the correlation matrix if it has already been estimated), a grid of penalization parameters \(\lambda\), and the number of observations in the data if a correlation matrix has been entered as the first parameter.

O <- omega_estim(R, lambda=seq(0.4,0.5,0.01), n=250)

It returns a list containing: the correlation matrix, the optimal precision matrix according to the HBIC criterion, the optimal corresponding \(\lambda\), all tested values of \(\lambda\) and all corresponding values of the \(HBIC\). A \(6\times6\) subset of the obtained copula penalized conditional correlation matrix is represented below.

	ACACA	AKT1S1	ANLN	ANXA1	AR	ACACA_P
ACACA	0.73	0.00	0.00	0.00	0.00	-0.09
AKT1S1	0.00	0.71	0.00	0.00	0.00	0.00
ANLN	0.00	0.00	0.72	0.00	0.00	0.00
ANXA1	0.00	0.00	0.00	0.74	0.00	0.00
AR	0.00	0.00	0.00	0.00	0.74	0.00
ACACA_P	-0.09	0.00	0.00	0.00	0.00	0.76

The minimum \(\lambda\) and \(HBIC\) are returned by the package.

#> [1] "The minimal lambda is 0.4"
#> [1] "The minimal HBIC is 11.0917177044302"

Finally, it is possible to graphically represent the evolution of the HBIC depending on the \(\lambda\) by running the following code:

plot(O[[5]],O[[6]],xlab="Lambda",ylab="HBIC",type="l")

Graphical representation

The cor_network_graph function enables to visualize the obtained network. For a correlation network, it takes as arguments the dataset, the correlation matrix, the threshold and a legend.

cor_network_graph(R,TS=0.3,legend=c(rep("RNAseq",5),rep("Proteins",5),rep("Mutations",5)))

To visualize the conditional correlation network that is encoded in \(\Omega\), it is sufficient to specify the threshold parameter to exactly zero as the estimated null coefficients have been shrunk to zero by penalization.

cor_network_graph(O[[2]],TS=0,legend=c(rep("RNAseq",5),rep("Proteins",5),rep("Mutations",5)))

Simulation

Our package is also able to simulate data distributed according to the Gaussian copula model. Two functions enable us to generate two types of correlation matrices: block-wise and sparse. The diag_block_matrix function enables the user to get a block-wise correlation matrix. It takes as arguments a vector containing the block sizes and a vector containing the coefficients of each block. An example is shown below.

R <- diag_block_matrix(c(3,2),c(0.4,0.8))

1.0	0.4	0.4	0.0	0.0
0.4	1.0	0.4	0.0	0.0
0.4	0.4	1.0	0.0	0.0
0.0	0.0	0.0	1.0	0.8
0.0	0.0	0.0	0.8	1.0

The matrix_gen function enables the user to generate a sparse correlation matrix of initial sparsity parameter \(\gamma\), which has to be specified. It is based on the Cholesky decomposition where a lower triangular matrix of sparsity \(\gamma\) is generated before being multiplied by its transpose in order to obtain the final matrix. Note that the initial parameter is not equal to the final parameter, which is also returned by the function. In the example below, the first element of the list is the resulting matrix, and the second element of the list is the final sparsity parameter.

R <- matrix_gen(5,0.81)

1.0	0.00	0.6	0	0.00
0.0	1.00	0.0	0	0.67
0.6	0.00	1.0	0	0.00
0.0	0.00	0.0	1	0.00
0.0	0.67	0.0	0	1.00

sparsity = 0.64

The CopulaSim function enables the user to generate a data set which CDF can be expressed as a Gaussian copula of correlation matrix R (to be specified). In the example below, we first generate a block diagonal correlation matrix R and then generate the data set. Then, CopulaSim takes as arguments the number of observations, the correlation matrix of the copula, a vector containing the probability distributions and their parameters, the number of repetitions of each distribution, and enables the user to randomize their order. It returns a list of three elements: the data frame containing the generated data, the correlation matrix, and the permutation realized on the rows and columns of R order after randomization.

R <- diag_block_matrix(c(3,5,2),c(0.7,0.3,0.5))
CopulaSim(5,R,c(rep("qnorm(0,1)",5),rep("qexp(0.5)",3),rep("qbinom(4,0.8)",2)),random=TRUE)
#> [[1]]
#>            X1           X2         X3         X4         X5         X6
#> 1 -0.31432935 -0.003397378 -0.3042208 -0.1121517 -0.3293876 1.85493071
#> 2 -0.04569702 -0.157287732 -1.8411259  0.3596431  1.3008398 8.30431146
#> 3 -0.63241425 -0.571101700  1.1884076 -0.9861470 -0.7240118 0.02478069
#> 4 -0.34398010 -0.284438037  0.4719280 -1.1694105 -1.3806585 0.30800447
#> 5 -0.98796694 -0.692130436 -2.1990930 -1.2800052  1.2636802 4.57877686
#>          X7        X8 X9 X10
#> 1 3.1266226 2.8760999  3   3
#> 2 1.1930258 0.7823285  3   3
#> 3 0.6112743 0.1576969  1   3
#> 4 1.5508764 0.7029956  3   4
#> 5 0.3770914 1.3759504  4   3
#> 
#> [[2]]
#>       [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
#>  [1,]  1.0  0.0  0.5  0.0  0.0  0.0  0.0  0.0  0.0   0.0
#>  [2,]  0.0  1.0  0.0  0.0  0.0  0.0  0.7  0.0  0.7   0.0
#>  [3,]  0.5  0.0  1.0  0.0  0.0  0.0  0.0  0.0  0.0   0.0
#>  [4,]  0.0  0.0  0.0  1.0  0.3  0.3  0.0  0.3  0.0   0.3
#>  [5,]  0.0  0.0  0.0  0.3  1.0  0.3  0.0  0.3  0.0   0.3
#>  [6,]  0.0  0.0  0.0  0.3  0.3  1.0  0.0  0.3  0.0   0.3
#>  [7,]  0.0  0.7  0.0  0.0  0.0  0.0  1.0  0.0  0.7   0.0
#>  [8,]  0.0  0.0  0.0  0.3  0.3  0.3  0.0  1.0  0.0   0.3
#>  [9,]  0.0  0.7  0.0  0.0  0.0  0.0  0.7  0.0  1.0   0.0
#> [10,]  0.0  0.0  0.0  0.3  0.3  0.3  0.0  0.3  0.0   1.0
#> 
#> [[3]]
#>  [1]  3  2 10  4  8  5  1  6  7  9

Additionally, the gauss_gen function, which is used in CopulaSim, generates the latent Gaussian variables linked by the correlation matrix R. Its only arguments are the correlation matrix R and the number of observations.

latent_data <- gauss_gen(R,10)

X1	X2	X3	X4	X5	X6	X7	X8	X9	X10
0.24	0.22	0.79	0.15	0.33	0.50	0.52	0.83	0.42	0.45
0.79	0.91	0.82	0.88	0.96	0.56	0.43	0.90	0.09	0.19
0.46	0.45	0.83	0.76	0.99	0.87	0.83	0.92	0.48	0.58
0.48	0.63	0.30	0.08	0.10	0.06	0.38	0.20	0.04	0.11
0.48	0.84	0.86	0.96	0.96	0.15	1.00	0.95	0.66	0.66
0.41	0.42	0.15	0.35	0.88	0.16	0.94	0.16	0.84	0.50
0.91	0.85	0.97	0.54	0.20	0.88	0.25	0.73	0.84	0.91
0.59	0.89	0.59	0.43	0.07	0.24	0.68	0.74	0.91	0.79
0.84	0.58	0.85	0.75	0.85	0.79	0.78	0.43	0.82	0.73
0.24	0.45	0.54	0.38	0.10	0.71	0.12	0.33	0.29	0.87

References

Friedman, J, T Hastie, and R Tibshirani. 2008. “Sparse Inverse Covariance Estimation with the Graphical Lasso.” Biostatistics. https://doi.org/10.1093/biostatistics/kxm045.

Liu. 2009. “The Nonparanormal: Semiparametric Estimation of High Dimensional Undirected Graphs.” Journal of Machine Learning Research.

Tomilina, Mazo, and Jaffrézic. 2024. “Mixed Copula-Based Models for Semi-Parametric Network Inference: An Application to Multi-Omics Data.”

Zhang, J., R. Bajari, and Andric D. 2019. “The International Cancer Genome Consortium Data Portal.” Nat Biotechnol. https://doi.org/doi:10.1038/s41587-019-0055-9.

heterocop: an R package for Gaussian copula semi-parametric inference for heterogeneous data

Ekaterina Tomilina

2025-06-06