Maintainers
- westford14
library(glmnet)
#> Loading required package: Matrix
#> Loaded glmnet 4.1-10
library(rACMEMEEV)
set.seed(987654)Let’s imagine a scenario where you have recently done a survey and
gathered information on three different food items that people consume
via an FFQ. For ease, let’s call them x, y,
and z. Now, because measuring what people eat is quite
complicated and not necessarily always accurate, there is a tacit
assumption that there is measurement error. Normally, in these
circumstances, researchers will conduct a validation study to understand
how different the overall population deviates from the validation study.
This is typically done by using a more accurate form of dietary
assessment (24 hour recall or urinary recovery biomarkers). Often times
this is a very costly step and not always practical. In these cases, we
can turn to multivariate measurement error adjustment. The rest of this
vignette will demo how to use this library and show its performance.
This library is built upon the work by Muoka et
al. 2020.
Normally dietary consumption data is characterized by a gamma
distribution (Passerelli et al. 2022), so we
that’s the data that will be generated. But, because we know that there
is measurement error in the gathering of our experimental variables of
x, y, and z, we will need to
taint our data to simulate this. This can be done by introducing an
additive \(\epsilon\) value to each of
the variables, which will simulate by adding a normally distributed
random error term.
Before we go further, we will generate an output variable,
conveniently named output, that is a linear combination of
the three non-measurement error tainted exposures. With this, we can
then fit a model that shows us how the three exposures are related to
the output. This gives us the ultimate “true” relationship between the
exposures and output that is being clouded by the measurement error.
Which, also allows us to look back at the end and assess how close the
adjusted covariates via multivariate measurement error adjustment are to
the “true” relationship.
simulate_gamma_mix <- function(n, a1 = 3, a2 = 4, a3 = 5) {
shape1 <- 2
rate1 <- 2
shape2 <- 3
rate2 <- 3
shape3 <- 5
rate3 <- 5
x <- rgamma(n, shape1, rate1)
y <- rgamma(n, shape2, rate2)
z <- rgamma(n, shape3, rate3)
output <- a1 * x + a2 * y + a3 * z + rnorm(n)
frame <- data.frame(
list(
output = output,
x = x,
y = y,
z = z
)
)
return(frame)
}The above code block generates three gamma distributions of input consumption data with a yielded output that is a linear combination of these inputs.
df <- simulate_gamma_mix(100)
head(df, 10)
#> output x y z
#> 1 8.886392 1.4884158 0.5664902 0.4330269
#> 2 5.673900 1.1749567 0.3490859 0.4502716
#> 3 14.354707 0.6011771 1.3360054 1.3566710
#> 4 8.640942 0.8426431 0.9276903 0.7586085
#> 5 10.755779 0.8854849 0.7491859 0.8433502
#> 6 17.362400 1.2975287 1.0343964 1.6273187
#> 7 14.152803 0.4781479 2.0028591 1.0814472
#> 8 13.511813 0.3113799 1.5703030 1.1260048
#> 9 15.159301 1.8270135 1.6296193 0.9581821
#> 10 10.430072 0.4189057 0.9167687 1.4367271Now, we can fit a model and see what the resultant coefficients are
for each of x, y, and z.
model <- glm("output ~ x + y + z", data = df)
summary(model)
#>
#> Call:
#> glm(formula = "output ~ x + y + z", data = df)
#>
#> Coefficients:
#> Estimate Std. Error t value Pr(>|t|)
#> (Intercept) -0.01221 0.30158 -0.04 0.968
#> x 3.09764 0.10936 28.33 <2e-16 ***
#> y 3.81939 0.16940 22.55 <2e-16 ***
#> z 4.99010 0.19621 25.43 <2e-16 ***
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> (Dispersion parameter for gaussian family taken to be 0.8312565)
#>
#> Null deviance: 1766.939 on 99 degrees of freedom
#> Residual deviance: 79.801 on 96 degrees of freedom
#> AIC: 271.22
#>
#> Number of Fisher Scoring iterations: 2As we can see, given the above code, the output is directly related
to the combination of the three input exposures and the coefficients are
the specified a1, a2, and a3
parameters as well.
Now we can see the tainted dataset. Using the created dataframe, we can then add a normally distributed error term, the measurement error, to each of the exposure estimates.
tainted_df <- df
tainted_df$x <- df$x + rnorm(100, mean = 2)
tainted_df$y <- df$y + rnorm(100, mean = 2)
tainted_df$z <- df$z + rnorm(100, mean = 2)
head(tainted_df, 10)
#> output x y z
#> 1 8.886392 5.353364 1.491154 3.353126
#> 2 5.673900 2.395862 4.459188 3.553458
#> 3 14.354707 3.581042 4.309288 2.149549
#> 4 8.640942 3.631810 2.727864 1.982963
#> 5 10.755779 1.460380 3.052161 3.978306
#> 6 17.362400 3.033691 2.591469 3.435692
#> 7 14.152803 2.266423 3.292294 3.293816
#> 8 13.511813 3.270974 4.539700 2.165729
#> 9 15.159301 3.905160 1.683995 2.519602
#> 10 10.430072 2.217785 2.374776 2.734878This resultant tainted dataset can now be fit with a similar GLM, and then we can see how this measurement error effects real-world studies.
tainted_model <- glm("output ~ x + y + z", data = tainted_df)
summary(tainted_model)
#>
#> Call:
#> glm(formula = "output ~ x + y + z", data = tainted_df)
#>
#> Coefficients:
#> Estimate Std. Error t value Pr(>|t|)
#> (Intercept) 2.7401 1.9210 1.426 0.15699
#> x 0.9756 0.2917 3.345 0.00118 **
#> y 1.4191 0.3422 4.146 7.29e-05 ***
#> z 0.7174 0.3529 2.033 0.04481 *
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> (Dispersion parameter for gaussian family taken to be 13.97885)
#>
#> Null deviance: 1766.9 on 99 degrees of freedom
#> Residual deviance: 1342.0 on 96 degrees of freedom
#> AIC: 553.46
#>
#> Number of Fisher Scoring iterations: 2It is clear to see how this “measurement error” has significantly
impacted our view of the world. The estimated coefficients on
x, y, and z are vastly different
and do not at all reflect the set values of 3, 4, and 5. Thus, we can
see how and why this measurement error adjustment is so important.
As stated above, in the absence of the running a validation study, we
can leverage literature based validity coefficients to help adjust for
the measurement error. Because this developed method is Bayesian in
nature, we can instead use a distribution over the range of the validity
coefficients to add more prior data. In this library the
generate_coefficient function can easily do this for us.
This performs a Fisher-Z transformation on the
values to normalize the distribution and then back transforms it to
yield the resultant mean value of the distribution. Below is an example
of this.
x_v_coef <- generate_coefficient(1000, 0.4, 0.7, 0.95)
y_v_coef <- generate_coefficient(1000, 0.5, 0.7, 0.95)
z_v_coef <- generate_coefficient(1000, 0.3, 0.6, 0.95)
cat(sprintf(
"X validity coefficient: %s\nY validity coefficient: %s\nZ validity coefficient: %s", x_v_coef, y_v_coef, z_v_coef
))
#> X validity coefficient: 0.566634647205238
#> Y validity coefficient: 0.608293709304761
#> Z validity coefficient: 0.46117766768142A pre-model is needed for to solve for the covariances in the observed data. This covariance structure will allow us to then continue and solve the attenuation-contamination matrix and finally adjust the coefficients of the fitted model. Please see the references for a much more in-depth analysis of the proposed methodology as well as assumptions and decisions made along the way.
output <- acme_model(tainted_df, c("x", "y", "z"))
#> Compiling model graph
#> Resolving undeclared variables
#> Allocating nodes
#> Graph information:
#> Observed stochastic nodes: 100
#> Unobserved stochastic nodes: 4
#> Total graph size: 166
#>
#> Initializing modelThe posterior of this model can be sampled using either a JAGS or Stan sampler, but the Stan sampler should be treated as experimental. In Muoka et. al 2020’s paper, the JAGS backend was used.
With this fitted model, a summary of the covariance structure can be assessed:
output$covariance_matrix
#>
#> Iterations = 1001:11000
#> Thinning interval = 1
#> Number of chains = 1
#> Sample size per chain = 10000
#>
#> 1. Empirical mean and standard deviation for each variable,
#> plus standard error of the mean:
#>
#> Mean SD Naive SE Time-series SE
#> rho[1,1] 1.00000 1.409e-16 1.409e-18 0.0000000
#> rho[2,1] 0.03249 9.887e-02 9.887e-04 0.0009769
#> rho[3,1] -0.09363 9.836e-02 9.836e-04 0.0009836
#> rho[1,2] 0.03249 9.887e-02 9.887e-04 0.0009769
#> rho[2,2] 1.00000 1.408e-16 1.408e-18 0.0000000
#> rho[3,2] -0.12599 9.774e-02 9.774e-04 0.0009774
#> rho[1,3] -0.09363 9.836e-02 9.836e-04 0.0009836
#> rho[2,3] -0.12599 9.774e-02 9.774e-04 0.0009774
#> rho[3,3] 1.00000 0.000e+00 0.000e+00 0.0000000
#> sigma[1] 1.29782 9.259e-02 9.259e-04 0.0009259
#> sigma[2] 1.10944 7.958e-02 7.958e-04 0.0007958
#> sigma[3] 1.08123 7.700e-02 7.700e-04 0.0007700
#>
#> 2. Quantiles for each variable:
#>
#> 2.5% 25% 50% 75% 97.5%
#> rho[1,1] 1.0000 1.00000 1.00000 1.00000 1.00000
#> rho[2,1] -0.1568 -0.03263 0.03257 0.09860 0.22773
#> rho[3,1] -0.2860 -0.16204 -0.09435 -0.02647 0.10091
#> rho[1,2] -0.1568 -0.03263 0.03257 0.09860 0.22773
#> rho[2,2] 1.0000 1.00000 1.00000 1.00000 1.00000
#> rho[3,2] -0.3141 -0.19238 -0.12677 -0.06026 0.06758
#> rho[1,3] -0.2860 -0.16204 -0.09435 -0.02647 0.10091
#> rho[2,3] -0.3141 -0.19238 -0.12677 -0.06026 0.06758
#> rho[3,3] 1.0000 1.00000 1.00000 1.00000 1.00000
#> sigma[1] 1.1319 1.23351 1.29199 1.35581 1.49623
#> sigma[2] 0.9694 1.05315 1.10438 1.15979 1.28022
#> sigma[3] 0.9429 1.02830 1.07699 1.13066 1.24478And the model used to solve can be viewed:
output$model
#> JAGS model:
#>
#>
#> model {
#> for (i in 1:n) {
#> model_data[i, 1:p] ~ dmnorm( zMu[1:p] , zInvCovMat[1:p, 1:p] )
#> }
#>
#> for (varIdx in 1:p) {
#> # -- uninformative prior for zMu -- #
#> zMu[varIdx] ~ dnorm(0, 1000000)
#> }
#> zInvCovMat ~ dwish(zRmat[1:p, 1:p], zdf)
#>
#> # -- Convert invCovMat to SD and correlation -- #
#> zCovMat <- inverse(zInvCovMat)
#> for (varIdx in 1:p) {
#> zSigma[varIdx] <- sqrt(zCovMat[varIdx,varIdx])
#> }
#> for (varIdx1 in 1:p) {
#> for (varIdx2 in 1:p) {
#> zRho[varIdx1, varIdx2] <- (
#> zCovMat[varIdx1, varIdx2] / (zSigma[varIdx1] * zSigma[varIdx2])
#> )
#> }
#> }
#>
#> for (varIdx in 1:p) {
#> sigma[varIdx] <- zSigma[varIdx] * sd_orig[, varIdx]
#> mu[varIdx] <- zMu[varIdx] * sd_orig[, varIdx] + mean_orig[, varIdx]
#> }
#> for (varIdx1 in 1:p) {
#> for (varIdx2 in 1:p) {
#> rho[varIdx1, varIdx2] <- zRho[varIdx1, varIdx2]
#> }
#> }
#> }
#>
#> Fully observed variables:
#> mean_orig model_data n p sd_orig zRmat zdfNow the attenuation contamination matrix can be solved. This matrix will allow for the error structure to be used in adjusting a model based on the various exposure estimates.
validity_coefficients <- c(x_v_coef, y_v_coef, z_v_coef)
lambda <- attenuation_matrix(output, c("x", "y", "z"), validity_coefficients)The diagonal values of the matrix are subsequently used to adjust a fitted model in the next step to adjust the resultant coefficients.
Now that we have solved for the attenuation-contamination matrix, a model can be fitted and comparisons can be made between a naive model (i.e. no error adjustment) and the multivariate adjusted model.
model_output <- multivariate_model(
"output ~ x + y + z",
data = tainted_df,
columns = c("x", "y", "z"),
a_c_matrix = lambda$matrix,
sds = lambda$sds,
variances = lambda$variances,
univariate = TRUE
)Please note that the univariate = TRUE argument
specifies that the naive model object should be returned. This allows
for plotting and comparisons to be made.
Recall the original model yielded coefficients that were not at all close to the “true” simulated related from earlier.
summary(tainted_model)
#>
#> Call:
#> glm(formula = "output ~ x + y + z", data = tainted_df)
#>
#> Coefficients:
#> Estimate Std. Error t value Pr(>|t|)
#> (Intercept) 2.7401 1.9210 1.426 0.15699
#> x 0.9756 0.2917 3.345 0.00118 **
#> y 1.4191 0.3422 4.146 7.29e-05 ***
#> z 0.7174 0.3529 2.033 0.04481 *
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> (Dispersion parameter for gaussian family taken to be 13.97885)
#>
#> Null deviance: 1766.9 on 99 degrees of freedom
#> Residual deviance: 1342.0 on 96 degrees of freedom
#> AIC: 553.46
#>
#> Number of Fisher Scoring iterations: 2
summary(model)
#>
#> Call:
#> glm(formula = "output ~ x + y + z", data = df)
#>
#> Coefficients:
#> Estimate Std. Error t value Pr(>|t|)
#> (Intercept) -0.01221 0.30158 -0.04 0.968
#> x 3.09764 0.10936 28.33 <2e-16 ***
#> y 3.81939 0.16940 22.55 <2e-16 ***
#> z 4.99010 0.19621 25.43 <2e-16 ***
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> (Dispersion parameter for gaussian family taken to be 0.8312565)
#>
#> Null deviance: 1766.939 on 99 degrees of freedom
#> Residual deviance: 79.801 on 96 degrees of freedom
#> AIC: 271.22
#>
#> Number of Fisher Scoring iterations: 2But, with the proposed multivariate adjustment methodology can truly show how powerful it is.
summary(model_output$multivariate)
#> x y z
#> Min. :-1.137 Min. :0.4442 Min. :-3.205
#> 1st Qu.: 2.455 1st Qu.:3.2656 1st Qu.: 2.013
#> Median : 3.067 Median :3.8929 Median : 3.138
#> Mean : 3.066 Mean :3.9019 Mean : 3.139
#> 3rd Qu.: 3.671 3rd Qu.:4.5408 3rd Qu.: 4.268
#> Max. : 6.458 Max. :7.7536 Max. : 9.625Looking through at the mean values of the adjusted coefficients, we can see that we now show x = 1.46, y = 3.30, and z = 3.37. Of course, these do not exactly match with the known simulated \(\alpha\) parameters of 3, 4, and 5, but they are much closer to reality than what had previously been derived from the tainted dataset. Even more importantly, when we look at the 97.5% percentile, we can see that 3, 4, and 5 are contained within each one of the adjusted covariates. This means that we have recovered a lot of the ground truth from the tainted dataset.
quantile(model_output$multivariate$x, 0.975)
#> 97.5%
#> 4.871627
quantile(model_output$multivariate$y, 0.975)
#> 97.5%
#> 5.721841
quantile(model_output$multivariate$z, 0.975)
#> 97.5%
#> 6.470592Another nice ergonomic of the library is the ability to directly look
at plots comparing the unadjusted and adjusted coefficients. Again,
since we know that the reality is 3, 4, and 5, respectively for
x, y, and z, we can much more
clearly see how the adjusted coefficient distribution (blue line) is
much closer and even contains the reality compared to the unadjusted
coefficient distribution (red line).
plots <- plot_covariates(model_output, c("x", "y", "z"))
#> No id variables; using all as measure variables
#> Warning: `aes_string()` was deprecated in ggplot2 3.0.0.
#> ℹ Please use tidy evaluation idioms with `aes()`.
#> ℹ See also `vignette("ggplot2-in-packages")` for more information.
#> ℹ The deprecated feature was likely used in the rACMEMEEV package.
#> Please report the issue at <https://github.com/westford14/rACME-MEEV/issues>.
#> This warning is displayed once every 8 hours.
#> Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
#> generated.
#> No id variables; using all as measure variables
#> No id variables; using all as measure variables
plots$xOne thing to note is that depending on the validity coefficients, we
can potentially adjust out more of the measurement error. There is an
interesting confluence of factors between the strength of association of
an exposure and its resulting effect on the measurement error
adjustment. Looking at the z exposure: we set the
association to be 5. Let’s look at what happens when we reduce the
strength of association.
First let’s look at reducing the strength of association between the two exposure and the outcome.
df <- simulate_gamma_mix(100, a3 = 2)
head(df, 10)
#> output x y z
#> 1 13.181947 1.8244780 2.2692883 0.4517797
#> 2 8.757858 0.4440549 0.8786792 1.1566166
#> 3 8.063726 0.7796099 1.3721743 0.4640524
#> 4 4.201791 0.2260912 0.7166295 0.9217702
#> 5 6.031654 0.6864201 0.6548484 0.8998979
#> 6 13.602016 0.4014021 1.9616123 1.6035283
#> 7 17.116508 2.4952813 1.3514314 1.7624075
#> 8 7.627437 0.7120872 0.7557619 1.1326039
#> 9 2.940515 0.2323428 0.3984574 0.8797067
#> 10 8.688601 0.1406167 1.1074552 0.8710625From here we can follow the same procedure as above and re-use the validity coefficients to then show how this strength of association reduction changes the measurement error adjustment. Let’s taint the dataset.
tainted_df_lower_assoc <- df
tainted_df_lower_assoc$x <- df$x + rnorm(100, mean = 2)
tainted_df_lower_assoc$y <- df$y + rnorm(100, mean = 2)
tainted_df_lower_assoc$z <- df$z + rnorm(100, mean = 2)
head(tainted_df_lower_assoc, 10)
#> output x y z
#> 1 13.181947 2.369628 2.855428 3.8620031
#> 2 8.757858 2.806635 4.128599 3.6763995
#> 3 8.063726 2.504857 4.518488 1.8930857
#> 4 4.201791 3.368679 2.659698 2.7233855
#> 5 6.031654 2.321898 3.237331 0.3067704
#> 6 13.602016 3.059397 3.512237 2.4065276
#> 7 17.116508 3.259143 2.913477 2.4865265
#> 8 7.627437 3.529231 2.960672 2.4626126
#> 9 2.940515 3.558587 2.441093 4.5495388
#> 10 8.688601 3.437748 2.968184 2.6295337And now we can fit the measurement error model.
output <- acme_model(tainted_df_lower_assoc, c("x", "y", "z"))
#> Compiling model graph
#> Resolving undeclared variables
#> Allocating nodes
#> Graph information:
#> Observed stochastic nodes: 100
#> Unobserved stochastic nodes: 4
#> Total graph size: 166
#>
#> Initializing model
lambda <- attenuation_matrix(output, c("x", "y", "z"), validity_coefficients)
model_output <- multivariate_model(
"output ~ x + y + z",
data = tainted_df_lower_assoc,
columns = c("x", "y", "z"),
a_c_matrix = lambda$matrix,
sds = lambda$sds,
variances = lambda$variances,
univariate = TRUE
)And now we can see the model:
summary(model_output$multivariate)
#> x y z
#> Min. :-0.598 Min. :0.834 Min. :-2.5214
#> 1st Qu.: 2.637 1st Qu.:3.122 1st Qu.: 0.7513
#> Median : 3.201 Median :3.628 Median : 1.4707
#> Mean : 3.198 Mean :3.628 Mean : 1.4784
#> 3rd Qu.: 3.754 3rd Qu.:4.131 3rd Qu.: 2.1954
#> Max. : 6.304 Max. :6.781 Max. : 5.4852Here we can see that the coefficient is now approximately 1.5, which is much closer to the actual coefficient than it was previously when we had the greater strength of association.
Here we have shown how the proposed methodology based off of Muoka et al. 2020 can successfully adjust for measurement error in a dataset. This has been proven via a simulated, tainted dataset that the coefficients can then be recovered to closely mirror what the true data was.
The National Cancer Institute has its own error adjustment model (Zhang et. al 2011) that can be used as well. They offer their own vignettes that can be used to understand other idiosyncracies of measurement error adjustment.
If you would like, the traceplots and ACFs are available for all fitted models.
Traceplots:
ACF Plots: