--- title: "1. Basic workflow for ER with binary endpoint" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{1. Basic workflow for ER with binary endpoint} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r, include = FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>" ) ``` This vignette provides a basic workflow for the development of ER models for binary endpoints and subsequent simulation from the developed ER model. ```{r setup, warning=FALSE, message=FALSE} library(dplyr) library(BayesERtools) ggplot2::theme_set(ggplot2::theme_bw(base_size = 12)) ``` ## Data We will use an example simulated dataset included in the package (`d_sim_binom_cov`) for the analysis. In this document we use hypoglycemia (hgly2) as an example AE. Another example AE is diarrhea (dr2), where you would see fairly flat ER curve. ```{r} data(d_sim_binom_cov) head(d_sim_binom_cov) |> gt::gt() |> gt::fmt_number(n_sigfig = 3) d_sim_binom_cov_2 <- d_sim_binom_cov |> mutate( AUCss_1000 = AUCss / 1000, BAGE_10 = BAGE / 10, BWT_10 = BWT / 10, BHBA1C_5 = BHBA1C / 5, Dose = paste(Dose_mg, "mg") ) # Grade 2+ hypoglycemia df_er_ae_hgly2 <- d_sim_binom_cov_2 |> filter(AETYPE == "hgly2") # Grade 2+ diarrhea df_er_ae_dr2 <- d_sim_binom_cov_2 |> filter(AETYPE == "dr2") ``` We also defines variables that is used in the analysis. ```{r} var_resp <- "AEFLAG" # HbA1c & glucose are only relevant for hyperglycemia var_cov_ae_hgly2 <- c("BAGE_10", "BWT_10", "RACE", "VISC", "BHBA1C_5", "BGLUC") var_cov_ae_dr2 <- c("BAGE_10", "BWT_10", "RACE", "VISC") ``` ## 1. Basic model development `dev_ermod_bin()` function can be used to develop basic ER model. (Note that this function can also be used for models with covariates, if you already know the covariates to be included.) ```{r} set.seed(1234) ermod_bin <- dev_ermod_bin( data = df_er_ae_hgly2, var_resp = var_resp, var_exposure = "AUCss_1000" ) ermod_bin ``` You can compare the observed data with the model fit using `plot_er()` function. ```{r} # Using `*` instead of `+` so that scale can be # applied for both panels (main plot and boxplot) plot_er_gof(ermod_bin, var_group = "Dose", show_coef_exp = TRUE) * xgxr::xgx_scale_x_log10() ``` MCMC samples can be obtained with `as_draws()` family of functions, such as `as_draws_df()`. ```{r} draws_df <- as_draws_df(ermod_bin) draws_df_summary <- posterior::summarize_draws(draws_df) draws_df_summary |> gt::gt() |> gt::fmt_number(n_sigfig = 3) ``` You can predict the probability of events for a given exposure level with `sim_er_new_exp()` function. Here, the prediction is done for AUCss_1000 of 1, 2, 3 (AUCss of 1000, 2000, 3000), and the output is the median and 95% CI of the predicted probability. You can set `output_type = "draw"` to get the raw posterior draws. ```{r} ersim_med_qi <- sim_er_new_exp( ermod_bin, exposure_to_sim_vec = 1:3, output_type = "median_qi" ) ersim_med_qi ``` ## 2. Selection of exposure metrics `dev_ermod_bin_exp_sel()` function can be used to select the best exposure metric(s) from a list of candidate exposure metrics. In this case, AUCss_1000 is selected as the best exposure metric, as it had the highest elpd (expected log predictive density).[^1] [^1]: Some references about elpd: \code{?loo::`loo-glossary`}, [What is the interpretation of ELPD / elpd_loo / elpd_diff?](https://mc-stan.org/loo/articles/online-only/faq.html#elpd_interpretation) ```{r} set.seed(1234) ermod_bin_exp_sel <- dev_ermod_bin_exp_sel( # Use reduced N to make the example run faster data = slice_sample(df_er_ae_hgly2, n = 100), var_resp = var_resp, var_exp_candidates = c("AUCss_1000", "Cmaxss", "Cminss"), # Use reduced iter to make the example run faster iter = 1000 ) ermod_bin_exp_sel ``` The ER curve for all the evaluated exposure metrics can be generated with `plot_er_exp_sel()` function. ```{r} plot_er_exp_sel(ermod_bin_exp_sel) + xgxr::xgx_scale_x_log10() ``` ## 3. Selection of covariates `dev_ermod_bin_cov_sel()` function can be used to select the best covariates from a list of candidate covariates. In this case, HbA1c (BHBA1C_5) and glucose (BGLUC) are selected, in addition to the exposure metric AUCss_1000 as predictors. ```{r, eval = FALSE} set.seed(1234) ermod_bin_cov_sel <- dev_ermod_bin_cov_sel( data = df_er_ae_hgly2, var_resp = var_resp, var_exposure = "AUCss_1000", var_cov_candidate = var_cov_ae_hgly2, verbosity_level = 2 ) ``` ```{r, eval = FALSE, include = FALSE} # Save the output in vignettes/data to avoid running it during the build. saveRDS(ermod_bin_cov_sel, file = "data/ermod_bin_cov_sel.rds") ``` ```{r, include = FALSE} ermod_bin_cov_sel <- readRDS("data/ermod_bin_cov_sel.rds") ``` ```{r} ermod_bin_cov_sel ``` The plot below shows that AUCss_1000, BHBA1C_5, and BGLUC contributes to improving the model performance, and after then the inclusion of no other covariates improves the model performance. ```{r} plot_submod_performance(ermod_bin_cov_sel) ``` In some cases, you might see a warning message like below. This indicates that approximation of leave-one-out cross-validation performance (PSIS-LOO) is not reliable. ``` Warning: In the recalculation of the reference model's PSIS-LOO CV weights for the performance evaluation, ... Pareto k-values are in the interval...`. ``` Alternatively to the default `cv_method = "LOO"`, you can use k-fold cross-validation by setting`cv_method = "kfold"` in `dev_ermod_bin_cov_sel()` function. This can take longer time to run, but it can be more reliable in the cases where LOO is not reliable. You can also set `validate_search = TRUE` to let the function perform the variable selection for each fold separately, rather than using the selected variable sequence from the full dataset evaluation. ```{r, eval = FALSE} set.seed(1234) ermod_bin_cov_sel_kfold <- dev_ermod_bin_cov_sel( data = df_er_ae_hgly2, var_resp = var_resp, var_exposure = "AUCss_1000", var_cov_candidate = var_cov_ae_hgly2, cv_method = "kfold", validate_search = TRUE, verbosity_level = 2 ) ``` ```{r, eval = FALSE, include = FALSE} # Save the output in vignettes/data to avoid running it during the build. saveRDS(ermod_bin_cov_sel_kfold, file = "data/ermod_bin_cov_sel_kfold.rds") ``` ```{r, include = FALSE} ermod_bin_cov_sel_kfold <- readRDS("data/ermod_bin_cov_sel_kfold.rds") ``` Added bonus of using k-fold cv is that you can visualize how often each variable is selected in the model. Here, as you can see (and as expected), HbA1c (BHBA1C_5) and glucose (BGLUC) are highly related and they are almost interchangeably selected in the second and third positions. Note that the function enforces the exposure metric to be included first in the model. ```{r} ermod_bin_cov_sel_kfold plot_submod_performance(ermod_bin_cov_sel_kfold) plot_var_ranking(ermod_bin_cov_sel_kfold) ``` ## 4. Marginal ER prediction You can simulate the marginal ER relationship, i.e. ER relationships for "marginalized", or averaged, response for the population of interest, using `sim_er_new_exp_marg()` function. By default, the covariate distribution is from the original data, but you can also supply other distribution with `data_cov` argument. ```{r} ersim_new_exp_marg_med_qi <- sim_er_new_exp_marg( ermod_bin_cov_sel, exposure_to_sim_vec = c(2:6), output_type = "median_qi" ) ersim_new_exp_marg_med_qi plot_er(ersim_new_exp_marg_med_qi, marginal = TRUE) ``` ## 5. Evaluation of covariate effects You can visualize the effect of the covariates with `sim_coveff()` and `plot_coveff()` functions. You can see that all three predictors have fairly strong effects on the odds ratio of hypoglycemia. ```{r, fig.width = 5, fig.height = 4.5} coveffsim <- sim_coveff(ermod_bin_cov_sel) plot_coveff(coveffsim) print_coveff(coveffsim) ```