R package

Simulated zika dataset

To illustrate the bounds, a simulated dataset, zika_learner, is constructed. It is inspired by a numerical zika example used in Smith and VanderWeele (2019) together with a case-control study that investigates the effect of zika virus on microcephaly (Araújo et al. 2018). The variables included are:

Living area \((V)\)
Socioeconomic status, SES \((U)\)
Zika \((T)\)
Microcephaly \((Y)\)
Birth \((S_1)\)
Public hospital \((S_2)\)

The relationships between the variables are illustrated in Figure 2 and Table 1. The prevalences of the variables, and strengths of dependencies between them, are chosen to mimic real data and the assumed values for the sensitivity parameters in Smith and VanderWeele (2019). The simulated data mimics a cohort with 5000 observations, even though the original study is a case-control study. For more details of the variables and the models, see Zetterstrom and Waernbaum (2023).

Figure 2. Causal model for the zika_learner dataset.

The causal dependencies are generated by the logit models described in Table 1.

Table 1. Data generating process for the dataset `zika_learner`. Models generating causal dependencies are logistic, \(g(X'\theta)\), for predictor variables \(X\) (including a constant 1 for the intercept) and model parameter \(\theta\) (including the intercept).
Model	Coefficients (\(\theta\))/Proportions	Function argument
\(P(V=1)\)	\(0.85\)	`Vval`
\(P(U=1)\)	\(0.50\)	`Uval`
\(P(T=1\|V)=g(V'\theta_T)\)	\((-6.20,1.75)\)	`Tcoef`
\(P(Y=1\|T,U)=g[(T,U)'\theta_{Y}]\)	\((-5.20,5.00,-1.00)\)	`Ycoef`
\(P(S_1=1\|V,U,T)=g[(V,U,T)'\theta_{S1}]\)	\((1.20,0.00,2.00,-4.00)\)	`Scoef`
\(P(S_2=1\|V,U,T)=g[(V,U,T)'\theta_{S2}]\)	\((2.20,0.50,-2.75,0.00)\)	`Scoef`

The data was generated in R, version 4.2.0, using the package arm, version 1.13-1, with the following code:

# Seed.
set.seed(158118)

# Number of observations.
nObs = 5000

# The unmeasured variable, living area (V).
urban = rbinom(nObs, 1, 0.85)

# The treatment variable, zika.
zika_prob = arm::invlogit(-6.2 + 1.75 * urban)
zika = rbinom(nObs, 1, zika_prob)

# The unmeasured variable, SES (U).
SES = rbinom(nObs, 1, 0.5)

# The outcome variable, microcephaly.
mic_ceph_prob = arm::invlogit(-5.2 + 5 * zika - 1 * SES)
mic_ceph = rbinom(nObs, 1, mic_ceph_prob)

# The first selection variable, birth.
birth_prob = arm::invlogit(1.2 - 4 * zika + 2 * SES)
birth = rbinom(nObs, 1, birth_prob)

# The second selection variable, hospital.
hospital_prob = arm::invlogit(2.2 + 0.5 * urban - 2.75 * SES)
hospital = rbinom(nObs, 1, hospital_prob)

# The selection indicator.
sel_ind = birth * hospital

The resulting proportions of the zika_learner data, for the total dataset, the subset with \(S_1=1\) and the subset with \(S_1=S_2=1\) are seen in Tables 2-4.

Table 2. Proportions for the simulated dataset, by treatment status and overall.
	Not zika infected (N=4939)	Zika infected (N=61)	Overall (N=5000)
Microcephaly
Mean	0.003	0.361	0.008
Living area
Mean	0.849	0.951	0.850
SES
Mean	0.499	0.426	0.498

Table 3. Proportions for the simulated dataset, by treatment status and overall, after the first selection.
	Not zika infected (N=4268)	Zika infected (N=11)	Overall (N=4279)
Microcephaly
Mean	0.003	0.273	0.004
Living area
Mean	0.845	1.000	0.846
SES
Mean	0.556	0.818	0.557

Table 4. Proportions for the simulated dataset, by treatment status and overall, after both selections.
	Not zika infected (N=2869)	Zika infected (N=7)	Overall (N=2876)
Microcephaly
Mean	0.004	0.286	0.005
Living area
Mean	0.858	1.000	0.858
SES
Mean	0.382	0.714	0.382

The dataset and data generating process (DGP) can be used to test the functions in SelectionBias.

`sensitivityparametersM()`

The sensitivity parameters for the SV, sharp, and GAF bounds are calculated for the generalized M-structure, illustrated in Figure 1. The sensitivity parameters are only calculated for an assumed model structure, since they depend on the unobserved variable, U. However, the observed probabilities of the outcome, \(P(Y=1|T=t,I_S=1)\), \(t=0,1\), are inputs. The code and the output are:

# SV bound
sensparSV = sensitivityparametersM(whichEst = "RR_tot",
                   whichBound = "SV",
                   Vval = matrix(c(1, 0, 0.85, 0.15), ncol = 2),
                   Uval = matrix(c(1, 0, 0.5, 0.5), ncol = 2),
                   Tcoef = c(-6.2, 1.75),
                   Ycoef = c(-5.2, 5.0, -1.0),
                   Scoef = matrix(c(1.2, 2.2, 0.0, 0.5,
                                    2.0, -2.75, -4.0, 0.0),
                                  ncol = 4),
                   Mmodel = "L",
                   pY1_T1_S1 = 0.286,
                   pY1_T0_S1 = 0.004)
print(sensparSV)
#> $parameter
#>  [1] "BF_11"     "BF_00"     "BF_10"     "BF_01"     "RR_UY|T=1" "RR_UY|T=0"
#>  [7] "RR_SU|11"  "RR_SU|00"  "RR_SU|10"  "RR_SU|01" 
#> 
#> $value
#>  [1] 1.2102 1.3532 1.3208 1.3895 1.9448 2.7089 1.5566 1.7058 1.9998 1.7998
#> 
#> attr(,"class")
#> [1] "sensparams"
#> attr(,"row.names")
#>  [1]  1  2  3  4  5  6  7  8  9 10

# Sharp bound
sensparSharp = sensitivityparametersM(whichEst = "RR_tot",
                   whichBound = "sharp",
                   Vval = matrix(c(1, 0, 0.85, 0.15), ncol = 2),
                   Uval = matrix(c(1, 0, 0.5, 0.5), ncol = 2),
                   Tcoef = c(-6.2, 1.75),
                   Ycoef = c(-5.2, 5.0, -1.0),
                   Scoef = matrix(c(1.2, 2.2, 0.0, 0.5,
                                    2.0, -2.75, -4.0, 0.0),
                                  ncol = 4),
                   Mmodel = "L",
                   pY1_T1_S1 = 0.286,
                   pY1_T0_S1 = 0.004)

print(sensparSharp)
#> $parameter
#>  [1] "BF_11"     "BF_00"     "BF_10"     "BF_01"     "RR_UY|T=1" "RR_UY|T=0"
#>  [7] "RR_SU|11"  "RR_SU|00"  "RR_SU|10"  "RR_SU|01" 
#> 
#> $value
#>  [1] 1.2102 1.3532 1.3208 1.3895 1.9448 2.7089 1.5566 1.7058 1.9998 1.7998
#> 
#> attr(,"class")
#> [1] "sensparams"
#> attr(,"row.names")
#>  [1]  1  2  3  4  5  6  7  8  9 10

# GAF bound
sensparGAF = sensitivityparametersM(whichEst = "RR_tot",
                   whichBound = "GAF",
                   Vval = matrix(c(1, 0, 0.85, 0.15), ncol = 2),
                   Uval = matrix(c(1, 0, 0.5, 0.5), ncol = 2),
                   Tcoef = c(-6.2, 1.75),
                   Ycoef = c(-5.2, 5.0, -1.0),
                   Scoef = matrix(c(1.2, 2.2, 0.0, 0.5,
                                    2.0, -2.75, -4.0, 0.0),
                                  ncol = 4),
                   Mmodel = "L",
                   pY1_T1_S1 = 0.286,
                   pY1_T0_S1 = 0.004)

print(sensparGAF)
#> $parameter
#> [1] "m_T" "M_T"
#> 
#> $value
#> [1] 0.0020 0.4502
#> 
#> attr(,"class")
#> [1] "sensparams"
#> attr(,"row.names")
#> [1] 1 2

The first argument is whichEst, where the user inputs the causal estimand of interest. It must be one of the four "RR_tot", "RD_tot", "RR_sub" or "RD_sub". The second argument is whichBound, where the user inputs the bound they want to use, either "SV", "sharp", or "GAF". Third, the argument Vval takes the matrix for V as input. The first column contains the values that V can take, and the second column contains the corresponding probabilities. In this example, V is binary, so the first two elements in the matrix are 1 and 0. However, any discrete V can be used. An approximation of a continuous V can be used, if it is discretized. The fourth argument is Uval, which takes the matrix for U as input. The matrix U has a similar structure as V. The fifth argument is Tcoef, containing the coefficients used in the model for T. The first entry in Tcoef is the intercept of the model, and the second the slope for V. The sixth argument is Ycoef, containing the coefficient vector for the outcome model, where the first entry is the intercept, the second the slope coefficient for T and third is the slope coefficient for U. The seventh argument is Scoef, the coefficient matrix for the selection variables. The number of rows is equal to the number of selection variables, and the number of columns is equal to four. The columns represent the intercept, and slope coefficients for V, U and T, respectively. A summary of the code notation is seen in the last column of Table 3. The eighth argument is Mmodel, which indicates whether the models in the M-structure are probit (Mmodel = "P") or logit (Mmodel = "L"). The ninth and tenth arguments are pY1_T1_S1 and pY1_T0_S1. They are the observed probabilities \(P(Y=1|T=1,I_S=1)\) and \(P(Y=1|T=0,I_S=1)\). The output is the sensitivity parameters for the chosen bound.

In the zika example, the estimand of interest is the risk ratio in the total population, whichEst = "RR_tot", the DGP is found in Table 1, logistic models are used in the DGP, and the probabilities are found in Table 4. For the SV and sharp bounds, the output is \(RR_{UY|T=1}=1.9448\), \(RR_{UY|T=0}=2.7089\), \(RR_{SU|11}=1.5566\), \(RR_{SU|00}=1.7058\),\(RR_{SU|10}=1.9998\), and \(RR_{SU|01}=1.7998\), which gives \(BF_{11}=1.2102\) and \(BF_{00}=1.3532\), \(BF_{10}=1.3208\), and \(BF_{01}=1.3895\). For the GAF bound, the output is \(M_T=0.4502\) and \(m_T=0.002\).

`SVbound()`

The SV bound can be calculated using the function SVbound(). The first argument is whichEst, indicating the causal estimand of interest ("RR_tot", "RD_tot", "RR_sub" or "RD_sub"). The second argument is sens, which is either output from sensitivityparametersM(), a named list or a data frame with column names parameter and value, which includes the sensitivity parameters. This input is optional as the sensitivity parameters can also be specified directly in the function with the below arguments. The third and fourth arguments are the observed conditional probabilities \(P(Y=1|T=1,I_S=1)\) and \(P(Y=1|T=0,I_S=1)\), which are needed to calculate bounds for the causal estimands and not the selection bias itself. The optional fifth and sixth arguments are the treatment probabilities \(P(T=1|I_S=1)\) and \(P(T=0|I_S=1)\). These are only needed for the alternative SV bound for the causal risk difference in the subpopulation. If they are not included, the original SV bound in the risk difference is calculated. The subsequent arguments are the sensitivity parameters provided by the user. The default value for all sensitivity parameters are NULL, and the user must then specify numeric values on the sensitivity parameters that are necessary for the bound for the chosen estimand. These arguments are also optional in case the argument sens is used. The sensitivity parameters can either be calculated using the function sensitivityparametersM(), or found elsewhere. For sensitivity parameters found elsewhere, SVbound() is not restricted to the generalized M-structure. However, the necessary assumptions for the SV bound must still be fulfilled (Smith and VanderWeele 2019). The output is the SV bound. The code and output are:

SVbound(whichEst = "RR_tot",
        pY1_T1_S1 = 0.286,
        pY1_T0_S1 = 0.004,
        RR_UY_T1 = 1.9448,
        RR_UY_T0 = 2.7089,
        RR_SU_11 = 1.5566,
        RR_SU_00 = 1.7058,
        RR_SU_10 = 1.9998,
        RR_SU_01 = 1.7998)
#>      [,1]             [,2]  
#> [1,] "SV lower bound" 43.66 
#> [2,] "SV upper bound" 131.22

As before in the zika example, the causal estimand is the risk ratio in the total population, whichEst = "RR_tot". The sensitivity parameters are \(RR_{UY|T=1}=1.9448\), \(RR_{UY|T=0}=2.7089\), \(RR_{SU|11}=1.5566\), \(RR_{SU|00}=1.7058\),\(RR_{SU|10}=1.9998\), and \(RR_{SU|01}=1.7998\), calculated above in sensitivityparametersM(), which gives a lower SV bound equal to 43.66 and a higher SV bound equal to 131.22. Alternatively, the output from the function sensitivityparametersM() can be used directly as input which gives the code:

SVbound(whichEst = "RR_tot",
        sens = sensparSV,
        pY1_T1_S1 = 0.286,
        pY1_T0_S1 = 0.004)
#>      [,1]             [,2]  
#> [1,] "SV lower bound" 43.66 
#> [2,] "SV upper bound" 131.22

This approach gives the same bounds as specifying the sensitivity parameters manually.

`sharpbound()`

The sharp bound can be calculated using the function sharpbound(). The input arguments are very similar to that of SVbound(). The first argument is whichEst, indicating the causal estimand of interest ("RR_tot", "RD_tot", "RR_sub" or "RD_sub"). The second argument is sens, which is either output from sensitivityparametersM(), a named list or a data frame with column names parameter and value, which includes the sensitivity parameters. This input is optional as the sensitivity parameters can also be specified directly in the function with the below arguments. The third and fourth arguments are the observed conditional probabilities \(P(Y=1|T=1,I_S=1)\) and \(P(Y=1|T=0,I_S=1)\), which are needed to calculate bounds for the causal estimands and not the selection bias itself. The fifth and sixth arguments are the treatment probabilities \(P(T=1|I_S=1)\) and \(P(T=0|I_S=1)\). These are only needed for the bounds for the causal estimands in the subpopulation. The seventh and eighth arguments are the probabilities \(P(I_S=1|T=1)\) and \(P(I_S=1|T=0)\) which are needed for the bounds for the causal estimands in the total population. If they are unknown, they can be set to 0 for more conservative bounds. The subsequent arguments are the sensitivity parameters provided by the user. The default value for all sensitivity parameters are NULL, and the user must then specify numeric values on the sensitivity parameters that are necessary for the bound for the chosen estimand. These arguments are also optional in case the argument sens is used. The sensitivity parameters can either be calculated using the function sensitivityparametersM(), or found elsewhere. For sensitivity parameters found elsewhere, sharpbound() is not restricted to the generalized M-structure. However, the necessary assumptions for the sharp bound must still be fulfilled (Zetterstrom, Sjölander, and Waernbaum 2025). The output is the sharp bound. The code and output with the two options to specify the sensitivity parameters are:

sharpbound(whichEst = "RR_tot",
        pY1_T1_S1 = 0.286,
        pY1_T0_S1 = 0.004,
        pS1_T1 = 0.11,
        pS1_T0 = 0.58,
        RR_UY_T1 = 1.9448,
        RR_UY_T0 = 2.7089,
        RR_SU_11 = 1.5566,
        RR_SU_00 = 1.7058,
        RR_SU_10 = 1.9998,
        RR_SU_01 = 1.7998)
#>      [,1]                [,2] 
#> [1,] "Sharp lower bound" 73.91
#> [2,] "Sharp upper bound" 78.99

sharpbound(whichEst = "RR_tot",
        sens = sensparSharp,
        pY1_T1_S1 = 0.286,
        pY1_T0_S1 = 0.004,
        pS1_T1 = 0.11,
        pS1_T0 = 0.58,)
#>      [,1]                [,2] 
#> [1,] "Sharp lower bound" 73.91
#> [2,] "Sharp upper bound" 78.99

As before in the zika example, the causal estimand is the risk ratio in the total population, whichEst = "RR_tot". The observed probabilities come from the dataset, and the sensitivity parameters are \(RR_{UY|T=1}=1.9448\), \(RR_{UY|T=0}=2.7089\), \(RR_{SU|11}=1.5566\), \(RR_{SU|00}=1.7058\), \(RR_{SU|10}=1.9998\), and \(RR_{SU|01}=1.7998\), calculated above in sensitivityparametersM(), which gives a lower sharp bound equal to 73.91 and a higher sharp bound equal to 78.99.

`checksharpSVbound()`

The sharpness of the SV bound can be evaluated using checksharpSVbound() (Zetterstrom, Sjölander, and Waernbaum 2025). The first argument is whichEst, indicating the causal estimand of interest ("RR_tot", "RR_sub" or "RD_sub"). Note that the SV bound for the risk difference in the total population is not sharp. The second argument is sens, which is either output from sensitivityparametersM(), a named list or a data frame with column names parameter and value, which includes the bounding factors (\(BF\)). This input is optional as the bounding factors can also be specified directly in the function with the argument below. The third argument is BF, the bounding factors \(BF_{00}\) and \(BF{10}\) for the total population, and \(BF_0\) and \(BF_1\) for the subpopulation. The fourth argument is pY1, the probabilities \(P(Y=1|T=1,I_S=1)\) and \(P(Y=1|T=0,I_S=1)\). Note that the order of the bounding factors and probabilities matter. The output is two strings stating whether the lower and upper SV bound are sharp or not. Note that the SV bound for the risk ratio in the total population and risk difference in the subpopulation can only be arbitrarily sharp, see (Zetterstrom, Sjölander, and Waernbaum 2025) for more details. The code and output are:

checksharpSVbound(whichEst = "RR_tot",
        sens = sensparSV,
        pY1 = c(0.286, 0.004))
#> [[1]]
#> [1] "The lower SV bound for the risk ratio in the total population is arbitrarily sharp. See vignette for details."
#> 
#> [[2]]
#> [1] "The upper SV bound for the risk ratio in the total population is arbitrarily sharp. See vignette for details."

Here, both the lower and upper SV bounds are arbitrarily sharp.

`AFbound()`

The AF bound is calculated using the function AFbound(). The first argument is the causal estimand of interest ("RR_tot", "RD_tot", "RR_sub" or "RD_sub"). The second argument is outcome, where the user inputs either the observed numeric vector with the outcome variable or a vector with the conditional outcome probabilities, \(P(Y=1|T=1,I_S=1)\) and \(P(Y=1|T=0,I_S=1)\). The third argument is treatment, where the user inputs either the observed numeric vector with the treatment variable or a vector with the conditional treatment probabilities, \(P(T=1|I_S=1)\) and \(P(T=0|I_S=1)\). The fourth argument is selection where the user can either input the observed selection vector or the selection probability. Its default value is NULL since it is only required when the causal estimands in the total population are of interest. If the subpopulation is of interest and selection = NULL, the outcome and treatment vectors must only include the selected subjects. The output is the lower and upper AF bounds. The code and output are:

attach(zika_learner)

AFbound(whichEst = "RR_tot",
        outcome = mic_ceph[sel_ind == 1],
        treatment = zika[sel_ind == 1],
        selection = mean(sel_ind))
#>      [,1]             [,2]  
#> [1,] "AF lower bound" 0     
#> [2,] "AF upper bound" 454.09

Similar to before, whichEst = "RR_tot". Furthermore, the outcome and treatment variables are microcephaly and zika. The selection probability is specified since the other variables are restricted to those subjects with \(I_S=1\). The output is the lower and upper AF bounds, which are 0 and 454.09 in the zika example.

If the raw data is not available, one can input the conditional probabilities instead. In this example, these probabilities are:

AFbound(whichEst = "RR_tot",
        outcome = c(0.286, 0.004),
        treatment = c(0.002, 0.998),
        selection = mean(sel_ind))
#>      [,1]             [,2]  
#> [1,] "AF lower bound" 0     
#> [2,] "AF upper bound" 435.14

The difference in these two examples comes from rounding errors in the estimated probabilities.

`GAFbound()`

The GAF bound is calculated using the function GAFbound(). The first argument is the causal estimand of interest ("RR_tot", "RD_tot", "RR_sub" or "RD_sub"). The second argument is sens, which is either output from sensitivityparametersM(), a named list or a data frame with column names “parameter” and “value”, which includes the sensitivity parameters. This input is optional as the sensitivity parameters can also be specified directly in the function with the below arguments. The third and fourth arguments are M and m which are the two sensitivity parameters for the GAF bound. The sensitivity parameters can either be calculated using sensitivityparametersM(), or found elsewhere. For sensitivity parameters found elsewhere, GAFbound() is not restricted to the generalized M-structure. However, the necessary assumptions for the GAF bound must still be fulfilled (Zetterstrom 2024). The fifth argument is outcome, where the user inputs either the observed numeric vector with the outcome variable or a vector with the conditional outcome probabilities, \(P(Y=1|T=1,I_S=1)\) and \(P(Y=1|T=0,I_S=1)\). The fifth argument is treatment, where the user inputs either the observed numeric vector with the treatment variable or a vector with the conditional treatment probabilities, \(P(T=1|I_S=1)\) and \(P(T=0|I_S=1)\). The sixth argument is selection where the user can either input the observed selection vector or selection probability. Its default value is NULL since it is only required when the causal estimands in the total population are of interest. If the subpopulation is of interest and selection = NULL, the outcome and treatment vectors must only include the selected subjects. The output is the lower and upper GAF bounds. The code and output are:

GAFbound(whichEst = "RR_tot",
         M = 0.4502,
         m = 0.002, 
         outcome = mic_ceph[sel_ind == 1],
         treatment = zika[sel_ind == 1],
         selection = mean(sel_ind))
#>      [,1]              [,2]  
#> [1,] "GAF lower bound" 0.01  
#> [2,] "GAF upper bound" 147.42

GAFbound(whichEst = "RR_tot",
         sens = sensparGAF, 
         outcome = mic_ceph[sel_ind == 1],
         treatment = zika[sel_ind == 1],
         selection = mean(sel_ind))
#>      [,1]              [,2]  
#> [1,] "GAF lower bound" 0.01  
#> [2,] "GAF upper bound" 147.42

Similar to before, whichEst = "RR_tot". The sensitivity parameters are the output from sensitivityparametersM(). Furthermore, the outcome and treatment variables are microcephaly and zika. The selection probability is specified since the other variables are restricted to those subjects with \(I_S=1\). The output is the lower and upper GAF bounds, which are 0.01 and 147.42 in the zika example. If the raw data is not available, one can input the conditional probabilities instead, similar to the AF bound.

`CAFbound()`

The CAF bound is calculated using the function CAFbound(). The first argument is the causal estimand of interest ("RR_tot", "RD_tot", "RR_sub" or "RD_sub"). The second and third arguments are M and m which are the two sensitivity parameters for the CAF bound. The fourth argument is outcome, where the user inputs either the observed numeric vector with the outcome variable or a vector with the conditional outcome probabilities, \(P(Y=1|T=1,I_S=1)\) and \(P(Y=1|T=0,I_S=1)\). The fifth argument is treatment, where the user inputs either the observed numeric vector with the treatment variable or a vector with the conditional treatment probabilities, \(P(T=1|I_S=1)\) and \(P(T=0|I_S=1)\). The sixth argument is selection where the user can either input the observed selection vector or selection probability. Its default value is NULL since it is only required when the causal estimands in the total population are of interest. If the subpopulation is of interest and selection = NULL, the outcome and treatment vectors must only include the selected subjects. The output is the lower and upper CAF bounds. The code and output are:

CAFbound(whichEst = "RR_tot",
         M = 0.3,
         m = 0.005, 
         outcome = c(0.286, 0.004),
         treatment = c(0.002, 0.998),
         selection = mean(sel_ind))
#>      [,1]              [,2] 
#> [1,] "CAF lower bound" 0.04 
#> [2,] "CAF upper bound" 67.78

Similar to before, whichEst = "RR_tot". The sensitivity parameters are chosen by the user. Furthermore, the outcome and treatment variables are microcephaly and zika. The selection probability is specified since other variables are restricted to those subjects with \(I_S=1\). The output is the lower and upper CAF bounds, which are 0.04 and 67.78 in the zika example. If the raw data is not available, one can input the conditional probabilities instead, similar to the AF and GAF bounds.

SelectionBias

Introduction