--- title: "The_rvif_package" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{The_rvif_package} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r, include = FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>" ) ``` ## Introduction The following is a presentation of the *rvif* package that aims to facilitate its use. To this end, the document is divided into four sections, each of them dedicated to each of the functions that make up the package. This package is focused on determining whether or not the degree of approximate multicollinearity in a multiple linear regression model is of concern, meaning that it affects the statistical analysis (i.e. individual significance tests) of the model. Please note that the results presented here are based on the articles by Salmerón, García and García (2025) and Salmerón, García and García (working paper), see bibliography at the end of the document. We start by loading the *rvif* library, which in turn loads the *multiColl* package. ```{r} library(rvif) ``` ## *cv_vif* function Given the multiple linear regression model $\mathbf{y} = \mathbf{X} \boldsymbol{\beta} + \mathbf{u},$ where the design matrix $\mathbf{X}$ contains the observations of each independent variable in columns, the function *cv_vif* calculates the Coefficients of Variation (CV) and Variance Inflation Factors (VIF) of each variable (column) contained in $\mathbf{X}$. The command and arguments of this function are as follows: ```{r} # cv_vif(x, tol = 1e-30) ``` Both CVs and VIFs are calculated using existing functions in the *multiColl* package. Since these functions require the intercept of the model to be in the first column of $\mathbf{X}$, the design function matrix $\mathbf{X}$ has to be entered as the first argument of the *cv_vif* function so that the intercept is in the first column. This issue is crucial for ensuring the function works correctly. #### A first example: how to use the function To illustrate this issue and how this function works, we will first use the Wissel data available in the *rvif* package. For more details use *help(Wissel)*. ```{r} head(Wissel, n=5) ``` As can be seen, the design matrix $\mathbf{X}$ in this case is formed by the last four columns. To calculate the CVs and VIFs of each independent variable, the following code must be executed: ```{r} x = Wissel[,-c(1,2)] cv_vif(x) ``` An alternative approach is to obtain the design matrix of the independent variables after specifying the corresponding model using the *lm* command and the *model.matrix* function: ```{r} attach(Wissel) reg_W = lm(D~C+I+CP) detach(Wissel) x = model.matrix(reg_W) cv_vif(x) ``` As can be seen, the same results are obtained in both cases. According to Salmerón, Rodríguez and García (2020), the Coefficient of Variation (CV) is useful for detecting non-essential multicollinearity (the relationship between the intercept and at least one of the remaining independent variables in a linear model). They state specifically that if the CV of an independent variable is less than 0.1002506, then it is related to the intercept and needs to be centered to eliminate this relationship. As can be seen from the Wissel data, the degree of non-essential multicollinearity is not strong. On the other hand, according to Salmerón, García and García (2018), the VIF is only able to detect multicollinearity of the essential type (linear relationship between at least two independent variables of the linear model excluding the intercept). As VIF values suggest the presence of this type of multicollinearity, the results clearly indicate that it exists in the Wissel data model. #### A second example: construction of the design matrix Since the concept of multicollinearity only affects the independent variables of the model, it is possible to detect multicollinearity without considering the dependent variable. When constructing the design matrix directly, it is important to note that the first column must correspond to the intercept, as previously mentioned. The following simulation illustrates the situation: ```{r} set.seed(2025) obs = 100 cte = rep(1, obs) x2 = rnorm(obs, 5, 0.01) # variable with very little variability x3 = rnorm(obs, 5, 10) x4 = x3 + rnorm(obs, 5, 1) # fourth variable related to the third x5 = rnorm(obs, -1, 30) x = cbind(cte, x2, x3, x4, x5) # the first column has to be the intercept cv_vif(x) ``` In this case, it is observed that the second variable causes non-essential multicollinearity, whereas the relation between the second and third variables cause essential multicollinearity (this is how the simulated data has been designed). #### A third example: checking some errors In what follows, we will use the data on soil characteristics as predictors of forest diversity from Bondell and Reich's paper. This data set is available in the *rvif* package. For more details, use *help(soil)*. This data set contains a variable called Sum Cations which is the sum of the concentrations of calcium (Ca), magnesium (Mg), potassium (K) and sodium (Na). For this reason, the VIF calculation produces errors in this case. ```{r} head(soil, n=5) x = soil[,-16] cv_vif(x) ``` The second argument of this function sets the threshold used to determine whether the system is computationally singular. Eliminating the second variable allows the results to be obtained without any problem. ```{r} x = soil[,-c(2,16)] cv_vif(x) ``` It is now observed that the design matrix has 14 variables, but the output only provides information on 13 of these (from 2 to 14). This is because the code is designed to eliminate the first column of the design matrix (which corresponds to the intercept). In this case, the design matrix entered does not have an intercept in its first column. This has resulted in the elimination of the first variable. This can be resolved either by entering the constant manually into the design matrix, or by obtaining it from the specification of the linear regression model. ```{r} y = soil[,16] reg_S = lm(y~as.matrix(x)) cv_vif(model.matrix(reg_S)) ``` It can be seen that variables 13 (pH) and 14 (ExchAc) are causing non-essential multicollinearity, while all the other variables, except for 7 (Na), 8 (P), 10 (Zn) and 11 (Mn), are causing essential multicollinearity. If the design matrix is constructed directly but the intercept is not positioned in the first column of the design matrix $\mathbf{X}$, the following error message occurs: ```{r} cte = rep(1, length(y)) x = cbind(x, cte) cv_vif(x) ``` #### A fourth example: the special case of the simple linear regression model In the special case of the simple linear regression model, the design matrix $\mathbf{X}$ consists of an intercept and a second variable. In this case, it makes no sense to calculate the VIF since it is always equal to 1 regardless of the observed values of the second variable (see Salmerón, Rodríguez and García (2020) for more details). For this reason, to execute the *cv_vif* function, the design matrix (the first argument of the function) must have more than two columns: ```{r} cte = rep(1, obs) set.seed(2025) x2 = rnorm(obs, 3, 4) x = cbind(cte, x2) cv_vif(x) ``` #### Final warning regarding the presence of binary variables in the model The code is not designed to distinguish between quantitative and binary independent variables, so users should be especially careful when interpreting results involving binary variables. Note that the VIF for a binary variable is calculated using the coefficient of determination from a regression in which the binary variable is treated as the dependent variable. However, the coefficient of determination is generally considered inappropriate in such cases, which is why models such as logit or probit are typically recommended for binary outcomes. The following example illustrates this: ```{r} set.seed(2025) x3 = rbinom(obs, 1, 0.5) x = cbind(cte, x2, x3) head(x, n=5) cv_vif(x) ``` All calculations involving a binary variable in the design matrix, X, are performed without any problem. However, we would like to point out to the reader that the associated VIF should not be interpreted as a quantitative variable. We recommend ignoring it. ## *cv_vif_plot* function The *cv_vif_plot* function is intended to facilitate the interpretation of the results provided by the *cv_vif* function. For this reason, the input of this function (first argument) is the output from the *cv_vif* function. Consequently, it represents the scatter plot of the Coefficient of Variation (CV) and the Variance Inflation Factor (VIF) for the independent variables (excluding the intercept) of a multiple linear regression model. The command and arguments of this function are as follows: ```{r} # cv_vif_plot(x, limit = 40) ``` It should be noted that the distinction between essential and non-essential multicollinearity and the limitations of each measure (CV and VIF) for detecting the different kinds of multicollinearity, can be very useful to detect if there is a troubling degree of multicollinearity, what kind of multicollinearity it is and what variables are causing it. For this purpose, it is important to include in the representation of the scatter plot of the CV and VIF the lines corresponding to the established thresholds for each measure: dashed vertical line for 0.1002506 (CV) and dotted horizontal line for 10 (VIF). These lines determine four regions (see the following graphical representation) which can be interpreted as follows: A: existence of troubling non-essential and non-troubling essential multicollinearity; B: existence of troubling essential and non-essential multicollinearity; C: existence of non-troubling non-essential and troubling essential multicollinearity; D: non-troubling degree of existing multicollinearity (essential and non-essential). ```{r, out.width="50%", fig.align='center'} plot(-2:20, -2:20, type = "n", xlab="Coefficient of Variation", ylab="Variance Inflation Factor") abline(h=10, col="red", lwd=3, lty=2) abline(h=0, col="black", lwd=1) abline(v=0.1002506, col="red", lwd=3, lty=3) text(-1.25, 2, "A", pos=3, col="blue") text(-1.25, 12, "B", pos=3, col="blue") text(10, 12, "C", pos=3, col="blue") text(10, 2, "D", pos=3, col="blue") ``` #### Some examples When the *cv_vif_plot* command is applied to the Wissel data, it is clear that the three independent variables have a CV greater than 0.1002506 (dashed vertical line) and a VIF greater than 10 (dotted horizontal line). Therefore, it can be concluded immediately that the degree of multicollinearity of the non-essential type is not strong, but that of the essential type is. ```{r, out.width="50%", fig.align='center'} x = Wissel[,-c(1,2)] cv_vif_plot(cv_vif(x)) ``` When the *cv_vif_plot* function is applied to the simulation of the second example of the *cv_vif* function, it can be observed that the second variable causes non-essential multicollinearity while the third and fourth variables cause essential multicollinearity: ```{r, out.width="50%", fig.align='center'} set.seed(2025) obs = 100 cte = rep(1, obs) x2 = rnorm(obs, 5, 0.01) # variable with very little variability x3 = rnorm(obs, 5, 10) x4 = x3 + rnorm(obs, 5, 1) # fourth variable related to the third x5 = rnorm(obs, -1, 30) x = cbind(cte, x2, x3, x4, x5) # the first column has to be the intercept cv_vif_plot(cv_vif(x)) cv_vif_plot(cv_vif(x), limit=0) # note how the 'limit' argument works ``` ## *rvifs* function In their paper, Salmerón, García and García (2025) propose a redefinition of the variance inflation factor (VIF) by modifying the reference orthogonal model. The redefined VIF (RVIF) is defined as follows: $$RVIF(i) = \frac{\mathbf{X}_{i}^{t} \mathbf{X}_{i}}{\mathbf{X}_{i}^{t} \mathbf{X}_{i} - \mathbf{X}_{i}^{t} \mathbf{X}_{-i} \cdot \left( \mathbf{X}_{-i}^{t} \mathbf{X}_{-i} \right) \cdot \mathbf{X}_{-i}^{t} \mathbf{X}_{i}},$$ which shows, among other questions, that the RVIF is defined for $i=1,2,\dots,k$ (suppose the design matrix $\mathbf{X}$ has $k$ independent variables (columns)). In contrast to the VIF, the RVIF can therefore be calculated for the intercept of the linear regression model. The following other considerations should be taken into account: - If the data are expressed in unit length, using the same transformation used to calculate the condition number, then: $$RVIF(i) = \frac{1}{1 - \mathbf{X}_{i}^{t} \mathbf{X}_{-i} \cdot \left( \mathbf{X}_{-i}^{t} \mathbf{X}_{-i} \right) \cdot \mathbf{X}_{-i}^{t} \mathbf{X}_{i}}, \quad i=1,\dots,k.$$ - In this case (data expressed in unit length), when $\mathbf{X}_{i}$ is orthogonal to $\mathbf{X}_{-i}$, it is verified that $\mathbf{X}_{i}^{t} \mathbf{X}_{-i} = \mathbf{0}$ and, consequently $RVIF(i) = 1$ for $i=1,\dots,k$. That is, the RVIF is always greater than or equal to 1 and its minimum value is indicative of the absence of multicollinearity. - Denoted by $a_{i}= \mathbf{X}_{i}^{t} \mathbf{X}_{i} \cdot \left( \mathbf{X}_{-i}^{t} \mathbf{X}_{-i} \right) \cdot \mathbf{X}_{-i}^{t} \mathbf{X}_{i}$, $RVIF(i) = \frac{1}{1-a_{i}}$, where $a_{i}$ can be interpreted as the percentage of approximate multicollinearity due to the variable $\mathbf{X}_{i}$. Note the similarity of this expression to the VIF expression: $VIF(i) = \frac{1}{1-R_{i}^{2}}$. - Finally, Salmerón, García and García (2025) show, from a simulation for $k=3$, that if $a_{i} > 0.826$, then the degree of multicollinearity is worrying. In any case, this value should be refined by considering a higher number of independent variables. In short, while it is not necessary to transform the data, it is advisable to do so for a better interpretation of the results (as with the condition number). The *rvifs* function calculates the $RVIF(i)$ and $a_{i}$ values associated with the independent variable $i=1,2,\dots,k$.The command and arguments of this function are as follows: ```{r} # rvifs(x, ul = TRUE, intercept = TRUE, tol = 1e-30) ``` As can be seen: - The first argument to the function refers to the design matrix $\mathbf{X}$. - The novelty here is that the design matrix, $\mathbf{X}$, may or may not have an intercept. This is specified by the third argument. If an intercept is present, as in previous cases, it must correspond to the first column of $\mathbf{X}$. - The second argument indicates whether the data has been transformed to a unit length (the default value) or whether no transformation has been performed. - The last argument is the threshold that determines whether the system is computationally singular, similar to the *cv_vif* function. #### Some examples When the *rvifs* function is applied to the simulation of the second example of the *cv_vif* function, it can be observed that all the variables (except the last one) have high RVIF values and a high percentage of approximate multicollinearity: ```{r} set.seed(2025) obs = 100 cte = rep(1, obs) x2 = rnorm(obs, 5, 0.01) # variable with very little variability x3 = rnorm(obs, 5, 10) x4 = x3 + rnorm(obs, 5, 1) # fourth variable related to the third x5 = rnorm(obs, -1, 30) x = cbind(cte, x2, x3, x4, x5) # the first column has to be the intercept rvifs(x) ``` However, when applied to data on soil characteristics, where there is no intercept in the design matrix, high values are observed for all variables: ```{r} x = soil[,-16] rvifs(x, intercept=FALSE) rvifs(x[,-2], intercept=FALSE) ``` Finally, by applying the *rvif* to the simple linear regression model, we can determine the degree of non-essential multicollinearity: ```{r} cte = rep(1, obs) set.seed(2025) x2 = rnorm(obs, 3, 4) x = cbind(cte, x2) rvifs(x) ``` A high degree of non-essential multicollinearity is detected if a variable with little variability is generated. ```{r} cte = rep(1, obs) set.seed(2025) x2 = rnorm(obs, 3, 0.04) x = cbind(cte, x2) rvifs(x) ``` If this variable is centered to solve the detected problem, the RVIFs coincide with their minimum value, indicating orthogonality between the elements of the design matrix: ```{r} x2 = x2 - mean(x2) x = cbind(cte, x2) rvifs(x) ``` If the data are not transformed into unit length, the percentages of multicollinearity due to each variable remain equal to zero. However, since there is no universal minimum reference value, interpreting the RVIF becomes more complicated: ```{r} rvifs(x, ul=FALSE) ``` This last example illustrates why it is advisable to transform the data into unit length when calculating the RVIF. ## *multicollinearity* function Until now, measures of the degree of multicollinearity in multiple linear regression models have been used. However, it has never been proven that this degree affects model analysis. In their working paper, Salmerón, García and García propose a statistical test that determines whether the degree of multicollinearity affects the statistical analysis of the model, based on a region of non-rejection (which depends on a level of significance). More specifically, the test determines whether the non-rejection of the null hypothesis in the individual significance tests of the coefficient associated with each independent variable is due to the degree of multicollinearity. Ultimately, the decision rule for the test is determined by the following result: > Given the multiple linear regression model $\mathbf{y} = \mathbf{X} \boldsymbol{\beta} + \mathbf{u}$, the degree of multicollinearity affects its statistical analysis (with a level of significance of $\alpha\%$) if there is a variable $i$, with $i=1,\dots,k$, that verifies $RVIF(i) > \max \{ c_{0}(i), c_{3}(i) \}$ where: $$c_{0}(i) = \left( \frac{\widehat{\beta}_{i}}{\widehat{\sigma} \cdot t_{n-k}(1-\alpha/2)} \right)^{2}$$ and $$c_{3}(i) = \left( \frac{t_{n-k}(1-\alpha/2)}{\widehat{\beta}_{i,o}} \right)^{2} \cdot \widehat{var} \left( \widehat{\beta}_{i} \right),$$ where $\widehat{\beta}_{i}$ is the OLS estimate of $\beta_{i}$, $\widehat{var} \left( \widehat{\beta}_{i} \right)$ its estimated variance, $\widehat{\beta}_{i,o}$ is the OLS estimate of $\beta_{i}$ in the orthogonal model, $\sigma^{2}$ is the variance of the random disturbance, and $t_{n-k}(1-\alpha/2)$ is the theoretical value associated with the individual significance test. The *multicollinearity* function allows you to check the condition of the theorem using the following code: ```{r} # multicollinearity(y, x, alpha = 0.05) ``` The first argument of the function refers to the dependent variable of the linear regression model. The second relates to the design matrix containing the independent variables and the third to the significance level considered in the statistical test. Note that, as with the *cv_vif* function, the *rvif* command will only work correctly if the design matrix has the intercept in its first column. It should also be noted that, in this case, the RVIF is calculated without the data being transformed into unit length, as the results obtained in the working paper by Salmerón, García and García do not assume any transformation.. ### A first example: Wissel data In Wissel's data on outstanding mortgage debt, the model is considered jointly valid at the same time that is not rejected the null hypothesis that each coefficient of the independent variables is zero: ```{r} summary(reg_W) ``` Let us remember that a high degree of essential multicollinearity was determined due to the three independent variables. However, when the *multicollinearity* command is used, it can be observed that only the second of these is affected by the detected problem: ```{r} y = Wissel[,2] x = Wissel[,3:6] multicollinearity(y, x) ``` In other words, it can be concluded that the failure to reject the null hypothesis in the test associated with the personal consumption (C) variable is due to the degree of multicollinearity in the model. ### A second example: Klein and Goldberger data The *multiColl* package contains data from Klein and Goldberger on consumption and wage income. ```{r} data(KG) attach(KG) reg_KG = lm(consumption~wage.income+non.farm.income+farm.income) detach(KG) summary(reg_KG) ``` Once again, at a 5% significance level, the null hypothesis is not rejected in the individual significance tests, while the null hypothesis is rejected in the joint significance test. This situation is considered a sign of strong multicollinearity. Using the *multicollinearity* command, it is possible to determine that the statistical analysis of the second variable is affected by the degree of multicollinearity in the model: ```{r} head(KG, n=5) y = KG[,1] x = model.matrix(reg_KG) multicollinearity(y, x) ``` ### A third example: the case of the simple linear regresion The work of Salmerón, García and García (2025) illustrates the limitations of the *vif* command in the *car* package for detecting the degree of multicollinearity in a simple linear regression model. The data used in this work for this purpose are available in the *rvif* package. For more details, use *help(SLM1)* and *help(SLM2)*. If the first one is used, the *vif* command effectively indicates that the model has fewer than two terms and does not perform any calculations: ```{r, error = TRUE} head(SLM1, n=5) attach(SLM1) reg_SLM1 = lm(y1~V) detach(SLM1) library(car) vif(reg_SLM1) ``` In other words, it denies the possibility that the intercept could form part of the model as another independent variable and, consequently, there could be a problem of multicollinearity. However, if the *multicollinearity* command is used, it is possible to determine the degree of multicollinearity in the model and whether it affects the statistical analysis of the model: ```{r} y = SLM1[,1] x = SLM1[,-1] multicollinearity(y, x) ``` While the degree of multicollinearity does not affect the statistical analysis of the model in the first data set, it does affect the second: ```{r} head(SLM2, n=5) y = SLM2[,1] x = SLM2[,-1] multicollinearity(y, x) ``` It can be observed that when the significance level is changed, the values of the non-rejection region (determined by the values $c_{0}(i)$ and $c_{3}(i)) change: ```{r} multicollinearity(y, x, alpha=0.01) ``` ### A fourth example: soil characteristics data Using the *rvif* command on the soil characteristics data revealed that all the variables had high RVIF values and a high percentage of multicollinearity. Meanwhile, using the *cv_vif* command revealed that the pH and ExchAc variables caused a non-essential multicollinearity problem, while all other variables except Na, P, Zn nad Mn caused essential multicollinearity. Using the *multicollinearity** command, it is observed that only the statistical analysis of the intercept, CECbuffer and HumicMatter is affected by the degree of multicollinearity present: ```{r} y = soil[,16] x = soil[,-16] intercept = rep(1, length(y)) x = cbind(intercept, x) # the design matrix has to have the intercept in the first column multicollinearity(y, x) multicollinearity(y, x[,-3]) # eliminating the problematic variable (SumCation) names(x[,-3]) ``` ## Bibliography - Bondell, H.D. and Reich. B.J. (2008). Simultaneous regression shrinkage, variable selection, and supervised clustering of predictors with OSCAR. Biometrics, 64 (1), 115–23, doi: https://doi.org/10.1111/j.1541-0420.2007.00843.x. - Salmerón, R., García, C.B. and García, J. (2018). Variance Inflation Factor and Condition Number in multiple linear regression. Journal of Statistical Computation and Simulation, 88(12), 2365-2384, doi: https://doi.org/10.1080/00949655.2018.1463376. - Salmerón, R., Rodríguez, A. and García, C.B. (2020). Diagnosis and quantification of the non-essential collinearity. Computational Statistics, 35(2), 647-666, doi: https://doi.org/10.1007/s00180-019-00922-x. - Salmerón, R., García, C.B. and García, J. (2025). A redefined Variance Inflation Factor: overcoming the limitations of the Variance Inflation Factor. Computational Economics, 65, 337-363, doi: https://doi.org/10.1007/s10614-024-10575-8. - Overcoming the inconsistences of the variance inflation factor: a redefined VIF and a test to detect statistical troubling multicollinearity by Salmerón, R., García, C.B and García, J. (working paper submitted to The R Journal, https://arxiv.org/pdf/2005.02245).