Example Multivariate Adjustment Workflow

library(glmnet)
#> Loading required package: Matrix
#> Loaded glmnet 4.1-10
library(rACMEMEEV)
set.seed(987654)

Let’s imagine a scenario where you have recently done a survey and gathered information on three different food items that people consume via an FFQ. For ease, let’s call them x, y, and z. Now, because measuring what people eat is quite complicated and not necessarily always accurate, there is a tacit assumption that there is measurement error. Normally, in these circumstances, researchers will conduct a validation study to understand how different the overall population deviates from the validation study. This is typically done by using a more accurate form of dietary assessment (24 hour recall or urinary recovery biomarkers). Often times this is a very costly step and not always practical. In these cases, we can turn to multivariate measurement error adjustment. The rest of this vignette will demo how to use this library and show its performance. This library is built upon the work by Muoka et al. 2020.

Setting up the Workflow

Normally dietary consumption data is characterized by a gamma distribution (Passerelli et al. 2022), so we that’s the data that will be generated. But, because we know that there is measurement error in the gathering of our experimental variables of x, y, and z, we will need to taint our data to simulate this. This can be done by introducing an additive \(\epsilon\) value to each of the variables, which will simulate by adding a normally distributed random error term.

Before we go further, we will generate an output variable, conveniently named output, that is a linear combination of the three non-measurement error tainted exposures. With this, we can then fit a model that shows us how the three exposures are related to the output. This gives us the ultimate “true” relationship between the exposures and output that is being clouded by the measurement error. Which, also allows us to look back at the end and assess how close the adjusted covariates via multivariate measurement error adjustment are to the “true” relationship.

simulate_gamma_mix <- function(n, a1 = 3, a2 = 4, a3 = 5) {
  shape1 <- 2
  rate1 <- 2
  shape2 <- 3
  rate2 <- 3
  shape3 <- 5
  rate3 <- 5

  x <- rgamma(n, shape1, rate1)
  y <- rgamma(n, shape2, rate2)
  z <- rgamma(n, shape3, rate3)
  output <- a1 * x + a2 * y + a3 * z + rnorm(n)

  frame <- data.frame(
    list(
      output = output,
      x = x,
      y = y,
      z = z
    )
  )
  return(frame)
}

The above code block generates three gamma distributions of input consumption data with a yielded output that is a linear combination of these inputs.

df <- simulate_gamma_mix(100)
head(df, 10)
#>       output         x         y         z
#> 1   8.886392 1.4884158 0.5664902 0.4330269
#> 2   5.673900 1.1749567 0.3490859 0.4502716
#> 3  14.354707 0.6011771 1.3360054 1.3566710
#> 4   8.640942 0.8426431 0.9276903 0.7586085
#> 5  10.755779 0.8854849 0.7491859 0.8433502
#> 6  17.362400 1.2975287 1.0343964 1.6273187
#> 7  14.152803 0.4781479 2.0028591 1.0814472
#> 8  13.511813 0.3113799 1.5703030 1.1260048
#> 9  15.159301 1.8270135 1.6296193 0.9581821
#> 10 10.430072 0.4189057 0.9167687 1.4367271

Now, we can fit a model and see what the resultant coefficients are for each of x, y, and z.

model <- glm("output ~ x + y + z", data = df)
summary(model)
#> 
#> Call:
#> glm(formula = "output ~ x + y + z", data = df)
#> 
#> Coefficients:
#>             Estimate Std. Error t value Pr(>|t|)    
#> (Intercept) -0.01221    0.30158   -0.04    0.968    
#> x            3.09764    0.10936   28.33   <2e-16 ***
#> y            3.81939    0.16940   22.55   <2e-16 ***
#> z            4.99010    0.19621   25.43   <2e-16 ***
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> (Dispersion parameter for gaussian family taken to be 0.8312565)
#> 
#>     Null deviance: 1766.939  on 99  degrees of freedom
#> Residual deviance:   79.801  on 96  degrees of freedom
#> AIC: 271.22
#> 
#> Number of Fisher Scoring iterations: 2

As we can see, given the above code, the output is directly related to the combination of the three input exposures and the coefficients are the specified a1, a2, and a3 parameters as well.

Generate the Tainted Dataset

Now we can see the tainted dataset. Using the created dataframe, we can then add a normally distributed error term, the measurement error, to each of the exposure estimates.

tainted_df <- df
tainted_df$x <- df$x + rnorm(100, mean = 2)
tainted_df$y <- df$y + rnorm(100, mean = 2)
tainted_df$z <- df$z + rnorm(100, mean = 2)
head(tainted_df, 10)
#>       output        x        y        z
#> 1   8.886392 5.353364 1.491154 3.353126
#> 2   5.673900 2.395862 4.459188 3.553458
#> 3  14.354707 3.581042 4.309288 2.149549
#> 4   8.640942 3.631810 2.727864 1.982963
#> 5  10.755779 1.460380 3.052161 3.978306
#> 6  17.362400 3.033691 2.591469 3.435692
#> 7  14.152803 2.266423 3.292294 3.293816
#> 8  13.511813 3.270974 4.539700 2.165729
#> 9  15.159301 3.905160 1.683995 2.519602
#> 10 10.430072 2.217785 2.374776 2.734878

This resultant tainted dataset can now be fit with a similar GLM, and then we can see how this measurement error effects real-world studies.

tainted_model <- glm("output ~ x + y + z", data = tainted_df)
summary(tainted_model)
#> 
#> Call:
#> glm(formula = "output ~ x + y + z", data = tainted_df)
#> 
#> Coefficients:
#>             Estimate Std. Error t value Pr(>|t|)    
#> (Intercept)   2.7401     1.9210   1.426  0.15699    
#> x             0.9756     0.2917   3.345  0.00118 ** 
#> y             1.4191     0.3422   4.146 7.29e-05 ***
#> z             0.7174     0.3529   2.033  0.04481 *  
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> (Dispersion parameter for gaussian family taken to be 13.97885)
#> 
#>     Null deviance: 1766.9  on 99  degrees of freedom
#> Residual deviance: 1342.0  on 96  degrees of freedom
#> AIC: 553.46
#> 
#> Number of Fisher Scoring iterations: 2

It is clear to see how this “measurement error” has significantly impacted our view of the world. The estimated coefficients on x, y, and z are vastly different and do not at all reflect the set values of 3, 4, and 5. Thus, we can see how and why this measurement error adjustment is so important.

Generate Validity Coefficients

As stated above, in the absence of the running a validation study, we can leverage literature based validity coefficients to help adjust for the measurement error. Because this developed method is Bayesian in nature, we can instead use a distribution over the range of the validity coefficients to add more prior data. In this library the generate_coefficient function can easily do this for us. This performs a Fisher-Z transformation on the values to normalize the distribution and then back transforms it to yield the resultant mean value of the distribution. Below is an example of this.

x_v_coef <- generate_coefficient(1000, 0.4, 0.7, 0.95)
y_v_coef <- generate_coefficient(1000, 0.5, 0.7, 0.95)
z_v_coef <- generate_coefficient(1000, 0.3, 0.6, 0.95)

cat(sprintf(
  "X validity coefficient: %s\nY validity coefficient: %s\nZ validity coefficient: %s", x_v_coef, y_v_coef, z_v_coef
))
#> X validity coefficient: 0.566634647205238
#> Y validity coefficient: 0.608293709304761
#> Z validity coefficient: 0.46117766768142

Pre-Model Fitting

A pre-model is needed for to solve for the covariances in the observed data. This covariance structure will allow us to then continue and solve the attenuation-contamination matrix and finally adjust the coefficients of the fitted model. Please see the references for a much more in-depth analysis of the proposed methodology as well as assumptions and decisions made along the way.

output <- acme_model(tainted_df, c("x", "y", "z"))
#> Compiling model graph
#>    Resolving undeclared variables
#>    Allocating nodes
#> Graph information:
#>    Observed stochastic nodes: 100
#>    Unobserved stochastic nodes: 4
#>    Total graph size: 166
#> 
#> Initializing model

The posterior of this model can be sampled using either a JAGS or Stan sampler, but the Stan sampler should be treated as experimental. In Muoka et. al 2020’s paper, the JAGS backend was used.

With this fitted model, a summary of the covariance structure can be assessed:

output$covariance_matrix
#> 
#> Iterations = 1001:11000
#> Thinning interval = 1 
#> Number of chains = 1 
#> Sample size per chain = 10000 
#> 
#> 1. Empirical mean and standard deviation for each variable,
#>    plus standard error of the mean:
#> 
#>              Mean        SD  Naive SE Time-series SE
#> rho[1,1]  1.00000 1.409e-16 1.409e-18      0.0000000
#> rho[2,1]  0.03249 9.887e-02 9.887e-04      0.0009769
#> rho[3,1] -0.09363 9.836e-02 9.836e-04      0.0009836
#> rho[1,2]  0.03249 9.887e-02 9.887e-04      0.0009769
#> rho[2,2]  1.00000 1.408e-16 1.408e-18      0.0000000
#> rho[3,2] -0.12599 9.774e-02 9.774e-04      0.0009774
#> rho[1,3] -0.09363 9.836e-02 9.836e-04      0.0009836
#> rho[2,3] -0.12599 9.774e-02 9.774e-04      0.0009774
#> rho[3,3]  1.00000 0.000e+00 0.000e+00      0.0000000
#> sigma[1]  1.29782 9.259e-02 9.259e-04      0.0009259
#> sigma[2]  1.10944 7.958e-02 7.958e-04      0.0007958
#> sigma[3]  1.08123 7.700e-02 7.700e-04      0.0007700
#> 
#> 2. Quantiles for each variable:
#> 
#>             2.5%      25%      50%      75%   97.5%
#> rho[1,1]  1.0000  1.00000  1.00000  1.00000 1.00000
#> rho[2,1] -0.1568 -0.03263  0.03257  0.09860 0.22773
#> rho[3,1] -0.2860 -0.16204 -0.09435 -0.02647 0.10091
#> rho[1,2] -0.1568 -0.03263  0.03257  0.09860 0.22773
#> rho[2,2]  1.0000  1.00000  1.00000  1.00000 1.00000
#> rho[3,2] -0.3141 -0.19238 -0.12677 -0.06026 0.06758
#> rho[1,3] -0.2860 -0.16204 -0.09435 -0.02647 0.10091
#> rho[2,3] -0.3141 -0.19238 -0.12677 -0.06026 0.06758
#> rho[3,3]  1.0000  1.00000  1.00000  1.00000 1.00000
#> sigma[1]  1.1319  1.23351  1.29199  1.35581 1.49623
#> sigma[2]  0.9694  1.05315  1.10438  1.15979 1.28022
#> sigma[3]  0.9429  1.02830  1.07699  1.13066 1.24478

And the model used to solve can be viewed:

output$model
#> JAGS model:
#> 
#> 
#>   model {
#>     for (i in 1:n) {
#>       model_data[i, 1:p] ~ dmnorm( zMu[1:p] , zInvCovMat[1:p, 1:p] )
#>     }
#> 
#>     for (varIdx in 1:p) {
#>       # -- uninformative prior for zMu -- #
#>       zMu[varIdx] ~ dnorm(0, 1000000)
#>     }
#>     zInvCovMat ~ dwish(zRmat[1:p, 1:p], zdf)
#> 
#>     # -- Convert invCovMat to SD and correlation -- #
#>     zCovMat <- inverse(zInvCovMat)
#>     for (varIdx in 1:p) {
#>       zSigma[varIdx] <- sqrt(zCovMat[varIdx,varIdx])
#>     }
#>     for (varIdx1 in 1:p) {
#>       for (varIdx2 in 1:p) {
#>         zRho[varIdx1, varIdx2] <- (
#>           zCovMat[varIdx1, varIdx2] / (zSigma[varIdx1] * zSigma[varIdx2])
#>         )
#>       }
#>     }
#> 
#>     for (varIdx in 1:p) {
#>       sigma[varIdx] <- zSigma[varIdx] * sd_orig[, varIdx]
#>       mu[varIdx] <- zMu[varIdx] * sd_orig[, varIdx] + mean_orig[, varIdx]
#>     }
#>     for (varIdx1 in 1:p) {
#>       for (varIdx2 in 1:p) {
#>         rho[varIdx1, varIdx2] <- zRho[varIdx1, varIdx2]
#>       }
#>     }
#>   }
#>   
#> Fully observed variables:
#>  mean_orig model_data n p sd_orig zRmat zdf

Attenutation Contamination Matrix

Now the attenuation contamination matrix can be solved. This matrix will allow for the error structure to be used in adjusting a model based on the various exposure estimates.

validity_coefficients <- c(x_v_coef, y_v_coef, z_v_coef)
lambda <- attenuation_matrix(output, c("x", "y", "z"), validity_coefficients)

The diagonal values of the matrix are subsequently used to adjust a fitted model in the next step to adjust the resultant coefficients.

lambda$matrix
#>               x            y           z
#> x  0.3215063804  0.001684071 0.006131843
#> y  0.0000614873  0.371217889 0.009765095
#> z -0.0042398293 -0.010257094 0.210882347

Multivariate Model

Now that we have solved for the attenuation-contamination matrix, a model can be fitted and comparisons can be made between a naive model (i.e. no error adjustment) and the multivariate adjusted model.

model_output <- multivariate_model(
  "output ~ x + y + z",
  data = tainted_df,
  columns = c("x", "y", "z"),
  a_c_matrix = lambda$matrix,
  sds = lambda$sds,
  variances = lambda$variances,
  univariate = TRUE
)

Please note that the univariate = TRUE argument specifies that the naive model object should be returned. This allows for plotting and comparisons to be made.

Comparing Between Unadjusted and Adjusted

Recall the original model yielded coefficients that were not at all close to the “true” simulated related from earlier.

summary(tainted_model)
#> 
#> Call:
#> glm(formula = "output ~ x + y + z", data = tainted_df)
#> 
#> Coefficients:
#>             Estimate Std. Error t value Pr(>|t|)    
#> (Intercept)   2.7401     1.9210   1.426  0.15699    
#> x             0.9756     0.2917   3.345  0.00118 ** 
#> y             1.4191     0.3422   4.146 7.29e-05 ***
#> z             0.7174     0.3529   2.033  0.04481 *  
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> (Dispersion parameter for gaussian family taken to be 13.97885)
#> 
#>     Null deviance: 1766.9  on 99  degrees of freedom
#> Residual deviance: 1342.0  on 96  degrees of freedom
#> AIC: 553.46
#> 
#> Number of Fisher Scoring iterations: 2
summary(model)
#> 
#> Call:
#> glm(formula = "output ~ x + y + z", data = df)
#> 
#> Coefficients:
#>             Estimate Std. Error t value Pr(>|t|)    
#> (Intercept) -0.01221    0.30158   -0.04    0.968    
#> x            3.09764    0.10936   28.33   <2e-16 ***
#> y            3.81939    0.16940   22.55   <2e-16 ***
#> z            4.99010    0.19621   25.43   <2e-16 ***
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> (Dispersion parameter for gaussian family taken to be 0.8312565)
#> 
#>     Null deviance: 1766.939  on 99  degrees of freedom
#> Residual deviance:   79.801  on 96  degrees of freedom
#> AIC: 271.22
#> 
#> Number of Fisher Scoring iterations: 2

But, with the proposed multivariate adjustment methodology can truly show how powerful it is.

summary(model_output$multivariate)
#>        x                y                z         
#>  Min.   :-1.137   Min.   :0.4442   Min.   :-3.205  
#>  1st Qu.: 2.455   1st Qu.:3.2656   1st Qu.: 2.013  
#>  Median : 3.067   Median :3.8929   Median : 3.138  
#>  Mean   : 3.066   Mean   :3.9019   Mean   : 3.139  
#>  3rd Qu.: 3.671   3rd Qu.:4.5408   3rd Qu.: 4.268  
#>  Max.   : 6.458   Max.   :7.7536   Max.   : 9.625

Looking through at the mean values of the adjusted coefficients, we can see that we now show x = 1.46, y = 3.30, and z = 3.37. Of course, these do not exactly match with the known simulated \(\alpha\) parameters of 3, 4, and 5, but they are much closer to reality than what had previously been derived from the tainted dataset. Even more importantly, when we look at the 97.5% percentile, we can see that 3, 4, and 5 are contained within each one of the adjusted covariates. This means that we have recovered a lot of the ground truth from the tainted dataset.

quantile(model_output$multivariate$x, 0.975)
#>    97.5% 
#> 4.871627
quantile(model_output$multivariate$y, 0.975)
#>    97.5% 
#> 5.721841
quantile(model_output$multivariate$z, 0.975)
#>    97.5% 
#> 6.470592

Another nice ergonomic of the library is the ability to directly look at plots comparing the unadjusted and adjusted coefficients. Again, since we know that the reality is 3, 4, and 5, respectively for x, y, and z, we can much more clearly see how the adjusted coefficient distribution (blue line) is much closer and even contains the reality compared to the unadjusted coefficient distribution (red line).

plots <- plot_covariates(model_output, c("x", "y", "z"))
#> No id variables; using all as measure variables
#> Warning: `aes_string()` was deprecated in ggplot2 3.0.0.
#> ℹ Please use tidy evaluation idioms with `aes()`.
#> ℹ See also `vignette("ggplot2-in-packages")` for more information.
#> ℹ The deprecated feature was likely used in the rACMEMEEV package.
#>   Please report the issue at <https://github.com/westford14/rACME-MEEV/issues>.
#> This warning is displayed once every 8 hours.
#> Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
#> generated.
#> No id variables; using all as measure variables
#> No id variables; using all as measure variables
plots$x

plots$y

plots$z

A Quasi-Sensitivity Analysis

One thing to note is that depending on the validity coefficients, we can potentially adjust out more of the measurement error. There is an interesting confluence of factors between the strength of association of an exposure and its resulting effect on the measurement error adjustment. Looking at the z exposure: we set the association to be 5. Let’s look at what happens when we reduce the strength of association.

First let’s look at reducing the strength of association between the two exposure and the outcome.

df <- simulate_gamma_mix(100, a3 = 2)
head(df, 10)
#>       output         x         y         z
#> 1  13.181947 1.8244780 2.2692883 0.4517797
#> 2   8.757858 0.4440549 0.8786792 1.1566166
#> 3   8.063726 0.7796099 1.3721743 0.4640524
#> 4   4.201791 0.2260912 0.7166295 0.9217702
#> 5   6.031654 0.6864201 0.6548484 0.8998979
#> 6  13.602016 0.4014021 1.9616123 1.6035283
#> 7  17.116508 2.4952813 1.3514314 1.7624075
#> 8   7.627437 0.7120872 0.7557619 1.1326039
#> 9   2.940515 0.2323428 0.3984574 0.8797067
#> 10  8.688601 0.1406167 1.1074552 0.8710625

From here we can follow the same procedure as above and re-use the validity coefficients to then show how this strength of association reduction changes the measurement error adjustment. Let’s taint the dataset.

tainted_df_lower_assoc <- df
tainted_df_lower_assoc$x <- df$x + rnorm(100, mean = 2)
tainted_df_lower_assoc$y <- df$y + rnorm(100, mean = 2)
tainted_df_lower_assoc$z <- df$z + rnorm(100, mean = 2)
head(tainted_df_lower_assoc, 10)
#>       output        x        y         z
#> 1  13.181947 2.369628 2.855428 3.8620031
#> 2   8.757858 2.806635 4.128599 3.6763995
#> 3   8.063726 2.504857 4.518488 1.8930857
#> 4   4.201791 3.368679 2.659698 2.7233855
#> 5   6.031654 2.321898 3.237331 0.3067704
#> 6  13.602016 3.059397 3.512237 2.4065276
#> 7  17.116508 3.259143 2.913477 2.4865265
#> 8   7.627437 3.529231 2.960672 2.4626126
#> 9   2.940515 3.558587 2.441093 4.5495388
#> 10  8.688601 3.437748 2.968184 2.6295337

And now we can fit the measurement error model.

output <- acme_model(tainted_df_lower_assoc, c("x", "y", "z"))
#> Compiling model graph
#>    Resolving undeclared variables
#>    Allocating nodes
#> Graph information:
#>    Observed stochastic nodes: 100
#>    Unobserved stochastic nodes: 4
#>    Total graph size: 166
#> 
#> Initializing model
lambda <- attenuation_matrix(output, c("x", "y", "z"), validity_coefficients)
model_output <- multivariate_model(
  "output ~ x + y + z",
  data = tainted_df_lower_assoc,
  columns = c("x", "y", "z"),
  a_c_matrix = lambda$matrix,
  sds = lambda$sds,
  variances = lambda$variances,
  univariate = TRUE
)

And now we can see the model:

summary(model_output$multivariate)
#>        x                y               z          
#>  Min.   :-0.598   Min.   :0.834   Min.   :-2.5214  
#>  1st Qu.: 2.637   1st Qu.:3.122   1st Qu.: 0.7513  
#>  Median : 3.201   Median :3.628   Median : 1.4707  
#>  Mean   : 3.198   Mean   :3.628   Mean   : 1.4784  
#>  3rd Qu.: 3.754   3rd Qu.:4.131   3rd Qu.: 2.1954  
#>  Max.   : 6.304   Max.   :6.781   Max.   : 5.4852

Here we can see that the coefficient is now approximately 1.5, which is much closer to the actual coefficient than it was previously when we had the greater strength of association.

Summary

Here we have shown how the proposed methodology based off of Muoka et al. 2020 can successfully adjust for measurement error in a dataset. This has been proven via a simulated, tainted dataset that the coefficients can then be recovered to closely mirror what the true data was.

Other Resources

The National Cancer Institute has its own error adjustment model (Zhang et. al 2011) that can be used as well. They offer their own vignettes that can be used to understand other idiosyncracies of measurement error adjustment.

Traceplots and ACFs

If you would like, the traceplots and ACFs are available for all fitted models.

Traceplots:

traceplots(model_output$naive, c("x", "y", "z"))

ACF Plots:

acf_plots(model_output$naive, c("x", "y", "z"))

References

  1. Muoka, A. K., Agogo, G. O., Ngesa, O. O., & Mwambi, H. G. (2020b). A Method to adjust for measurement error in multiple exposure variables measured with correlated errors in the absence of an internal validation study. F1000Research, 9, 1486. https://doi.org/10.12688/f1000research.27892.1
  2. Passarelli, S., Free, C. M., Allen, L. H., Batis, C., Beal, T., Biltoft-Jensen, A. P., Bromage, S., Cao, L., Castellanos-Gutiérrez, A., Christensen, T., Crispim, S. P., Dekkers, A., De Ridder, K., Kronsteiner-Gicevic, S., Lee, C., Li, Y., Moursi, M., Moyersoen, I., Schmidhuber, J., … Golden, C. D. (2022). Estimating national and subnational nutrient intake distributions of global diets. The American Journal of Clinical Nutrition, 116(2), 551–560. https://doi.org/10.1093/ajcn/nqac108
  3. Wikipedia contributors. (2025, September 4). Fisher transformation. Wikipedia. https://en.wikipedia.org/wiki/Fisher_transformation
  4. Zhang, S., Midthune, D., Guenther, P. M., Krebs-Smith, S. M., Kipnis, V., Dodd, K. W., Buckman, D. W., Tooze, J. A., Freedman, L., & Carroll, R. J. (2011). A NEW MULTIVARIATE MEASUREMENT ERROR MODEL WITH ZERO-INFLATED DIETARY DATA, AND ITS APPLICATION TO DIETARY ASSESSMENT. The Annals of Applied Statistics, 5(2B), 1456–1487. https://doi.org/10.1214/10-AOAS446

Maintainers