processRaw()
The processRaw()
function calculates actual counts \((N)\) of each product-symptom combination,
expected counts \((E)\) under the
row/column independence assumption, relative reporting ratio \((RR)\), and proportional reporting ratio
\((PRR)\). processRaw()
has various parameters, some of which are shown below.
Suppose the data look as so:
dat
#> var1 var2 id
#> 1 product_B event_1 1
#> 2 product_A event_1 2
#> 3 product_B event_2 3
#> 4 product_A event_1 4
#> 5 product_A event_1 5
#> 6 product_A event_1 6
#> 7 product_A event_2 7
#> 8 product_A event_2 8
#> 9 product_A event_1 9
#> 10 product_A event_2 10
#> 11 product_A event_2 11
#> 12 product_B event_2 12
#> 13 product_B event_1 13
#> 14 product_B event_2 14
#> 15 product_B event_1 15
#> 16 product_B event_2 16
#> 17 product_C event_1 17
We can calculate \(N\), \(E\), \(RR\), and \(PRR\) for the product-symptom pairs:
processRaw(data = dat, stratify = FALSE, zeroes = FALSE)
#> var1 var2 N E RR PRR
#> 1 product_A event_1 5 4.7647059 1.05 1.11
#> 2 product_A event_2 4 4.2352941 0.94 0.89
#> 3 product_B event_1 3 3.7058824 0.81 0.71
#> 4 product_B event_2 4 3.2941176 1.21 1.43
#> 5 product_C event_1 1 0.5294118 1.89 2.00
Stratification can help control for confounding variables. For instance, food, cosmetics, and dietary supplements are often consumed at different rates by different genders and age groups. Similarly, adverse events associated with these products occur with varying rates. Therefore, we might wish to control for these variables when we examine the CAERS data.
Now assume the data look as so:
dat
#> var1 var2 strat1 strat2 id
#> 1 product_B event_1 F age_cat2 1
#> 2 product_A event_1 M age_cat1 2
#> 3 product_B event_2 M age_cat1 3
#> 4 product_A event_1 M age_cat1 4
#> 5 product_A event_1 F age_cat1 5
#> 6 product_A event_1 F age_cat1 6
#> 7 product_A event_2 F age_cat1 7
#> 8 product_A event_2 F age_cat1 8
#> 9 product_A event_1 M age_cat2 9
#> 10 product_A event_2 M age_cat1 10
#> 11 product_A event_2 M age_cat1 11
#> 12 product_B event_2 M age_cat2 12
#> 13 product_B event_1 M age_cat1 13
#> 14 product_B event_2 M age_cat1 14
#> 15 product_B event_1 M age_cat1 15
#> 16 product_B event_2 F age_cat1 16
#> 17 product_C event_1 M age_cat1 17
Notice that now we have stratifications variables (‘strat’ substring) present. We can use these stratification variables to get adjusted estimates for the \(EBGM\) scores. Stratification will affect \(E\) and \(RR\), but not \(PRR\). The \(E\)s are calculated by summing the expected counts from every stratum. Ideally, each stratum should contain several unique CAERS reports to insure good estimates of \(E\).
processRaw(data = dat, stratify = TRUE, zeroes = FALSE)
#> stratification variables used: strat1, strat2
#> there were 4 strata: F-age_cat1, F-age_cat2, M-age_cat1, M-age_cat2
#> Warning in .checkStrata_processRaw(data, max_cats): at least one stratum
#> contains less than 50 unique IDs
#> var1 var2 N E RR PRR
#> 1 product_A event_1 5 4.3222222 1.16 1.11
#> 2 product_A event_2 4 4.6777778 0.86 0.89
#> 3 product_B event_1 3 4.1222222 0.73 0.71
#> 4 product_B event_2 4 2.8777778 1.39 1.43
#> 5 product_C event_1 1 0.5555556 1.80 2.00
Notice that we use stratify = TRUE
to accomodate the new
stratification variables. The calculations for \(E\) and \(RR\) are adjusted.
Finally, in some cases one may wish to calculate the \(E\)s for product-symptom combinations that
do not occur in the data. These can be calculated by using the
zeroes = TRUE
argument in the processRaw()
function. It is typically not required to perform such calculations for
zero counts, and doing so can lead to much longer execution times when
estimating hyperparameters. For this reason, zero counts are only
recommended for hyperparameter estimation when convergence of
optimization routines cannot be reached otherwise. If zero counts are
used, data squashing should typically follow. Even if zero counts are
used for hyperparameter estimation, \(EBGM\) scores for zero counts never add
value to an analysis. For this reason, rows with zero counts should be
removed after estimating hyperparameters but before calculating \(EBGM\) and quantile scores.
processRaw(data = dat, stratify = FALSE, zeroes = TRUE)
#> var1 var2 N E RR PRR
#> 1 product_A event_1 5 4.7647059 1.05 1.11
#> 2 product_A event_2 4 4.2352941 0.94 0.89
#> 3 product_B event_1 3 3.7058824 0.81 0.71
#> 4 product_B event_2 4 3.2941176 1.21 1.43
#> 5 product_C event_1 1 0.5294118 1.89 2.00
#> 6 product_C event_2 0 0.4705882 0.00 0.00
Next, the Hyperparameter Estimation with openEBGM vignette will demonstrate how to estimate the hyperparameters of the prior distribution.