Selecting a study population from a larger source population, based
on the research question, is a common procedure, for example in an
observational study with data from a population register. Subjects who
fulfill all the selection criteria are included in the study population,
and subjects who do not fulfill at least one selection criterion are
excluded from the study population. These selections might introduce a
systematic error when estimating a causal effect, commonly referred to
as selection bias. Selection bias can also arise if the selections are
involuntary, for example, if there are dropouts or other missing values
for some individuals in the study. In an applied study, it is often of
interest to assess the magnitude of potential biases using a sensitivity
analysis, such as bounding the bias. Different bounds, the SV (Smith and
VanderWeele), sharp, AF (assumption-free), GAF (generalized
assumption-free), and CAF (counterfactual assumption-free), for the
causal estimand under selection bias can be calculated in this R package
SelectionBias
. The content is:
zika_learner
: a simulated dataset of zika virus and
microcephaly inspired both by data and a previous example (Araújo et al. 2018; Smith and VanderWeele
2019).sensitivityparametersM()
: a function that calculates
the sensitivity parameters for the SV and GAF bounds for an assumed
model following the M-structure in Figure 1.SVbound()
: a function that calculates the SV lower and
upper bound for the risk ratio or risk difference in either the total or
subpopulation for sensitivity parameters given by the user, or
calculated from sensitivityparametersM()
.sharpbound()
: a function that calculates the sharp
lower and upper bound for the risk ratio or risk difference in either
the total or subpopulation for sensitivity parameters given by the user,
or calculated from sensitivityparametersM()
.AFbound()
: a function that calculates the AF lower and
upper bounds for the risk ratio or risk difference in either the total
or subpopulation for a dataset or probabilities from the data.GAFbound()
: a function that calculates the GAF lower
and upper bounds for the risk ratio or risk difference in either the
total or subpopulation for a dataset or probabilities from the data and
sensitivity parameters either given by the user, or calculated from
sensitivityparametersM()
.CAFbound()
: a function that calculates the CAF lower
and upper bounds for the risk ratio or risk difference in either the
total or subpopulation for a dataset or probabilities from the data and
sensitivity parameters.checksharpSVbound()
: a function that evaluates if the
SV bound is sharp.For the formulas of the bounds as well as the theory behind them, we refer to the original papers [Smith and VanderWeele (2019);Zetterstrom and Waernbaum (2022);Zetterstrom and Waernbaum (2023);Zetterstrom (2024);zetterstrom2025investigations].
To illustrate the bounds, a simulated dataset,
zika_learner
, is constructed. It is inspired by a numerical
zika example used in Smith and VanderWeele
(2019) together with a case-control study that investigates the
effect of zika virus on microcephaly (Araújo et
al. 2018). The variables included are:
The relationships between the variables are illustrated in Figure 2 and Table 1. The prevalences of the variables, and strengths of dependencies between them, are chosen to mimic real data and the assumed values for the sensitivity parameters in Smith and VanderWeele (2019). The simulated data mimics a cohort with 5000 observations, even though the original study is a case-control study. For more details of the variables and the models, see Zetterstrom and Waernbaum (2023).
zika_learner
dataset.The causal dependencies are generated by the logit models described in Table 1.
Model | Coefficients (\(\theta\))/Proportions | Function argument |
---|---|---|
\(P(V=1)\) | \(0.85\) | Vval |
\(P(U=1)\) | \(0.50\) | Uval |
\(P(T=1|V)=g(V'\theta_T)\) | \((-6.20,1.75)\) | Tcoef |
\(P(Y=1|T,U)=g[(T,U)'\theta_{Y}]\) | \((-5.20,5.00,-1.00)\) | Ycoef |
\(P(S_1=1|V,U,T)=g[(V,U,T)'\theta_{S1}]\) | \((1.20,0.00,2.00,-4.00)\) | Scoef |
\(P(S_2=1|V,U,T)=g[(V,U,T)'\theta_{S2}]\) | \((2.20,0.50,-2.75,0.00)\) | Scoef |
The data was generated in R
, version 4.2.0, using the
package arm
, version 1.13-1, with the following code:
# Seed.
set.seed(158118)
# Number of observations.
nObs = 5000
# The unmeasured variable, living area (V).
urban = rbinom(nObs, 1, 0.85)
# The treatment variable, zika.
zika_prob = arm::invlogit(-6.2 + 1.75 * urban)
zika = rbinom(nObs, 1, zika_prob)
# The unmeasured variable, SES (U).
SES = rbinom(nObs, 1, 0.5)
# The outcome variable, microcephaly.
mic_ceph_prob = arm::invlogit(-5.2 + 5 * zika - 1 * SES)
mic_ceph = rbinom(nObs, 1, mic_ceph_prob)
# The first selection variable, birth.
birth_prob = arm::invlogit(1.2 - 4 * zika + 2 * SES)
birth = rbinom(nObs, 1, birth_prob)
# The second selection variable, hospital.
hospital_prob = arm::invlogit(2.2 + 0.5 * urban - 2.75 * SES)
hospital = rbinom(nObs, 1, hospital_prob)
# The selection indicator.
sel_ind = birth * hospital
The resulting proportions of the zika_learner
data, for
the total dataset, the subset with \(S_1=1\) and the subset with \(S_1=S_2=1\) are seen in Tables 2-4.
Not zika infected (N=4939) |
Zika infected (N=61) |
Overall (N=5000) |
|
---|---|---|---|
Microcephaly | |||
Mean | 0.003 | 0.361 | 0.008 |
Living area | |||
Mean | 0.849 | 0.951 | 0.850 |
SES | |||
Mean | 0.499 | 0.426 | 0.498 |
Not zika infected (N=4268) |
Zika infected (N=11) |
Overall (N=4279) |
|
---|---|---|---|
Microcephaly | |||
Mean | 0.003 | 0.273 | 0.004 |
Living area | |||
Mean | 0.845 | 1.000 | 0.846 |
SES | |||
Mean | 0.556 | 0.818 | 0.557 |
Not zika infected (N=2869) |
Zika infected (N=7) |
Overall (N=2876) |
|
---|---|---|---|
Microcephaly | |||
Mean | 0.004 | 0.286 | 0.005 |
Living area | |||
Mean | 0.858 | 1.000 | 0.858 |
SES | |||
Mean | 0.382 | 0.714 | 0.382 |
The dataset and data generating process (DGP) can be used to test the
functions in SelectionBias
.
sensitivityparametersM()
The sensitivity parameters for the SV, sharp, and GAF bounds are calculated for the generalized M-structure, illustrated in Figure 1. The sensitivity parameters are only calculated for an assumed model structure, since they depend on the unobserved variable, U. However, the observed probabilities of the outcome, \(P(Y=1|T=t,I_S=1)\), \(t=0,1\), are inputs. The code and the output are:
# SV bound
sensparSV = sensitivityparametersM(whichEst = "RR_tot",
whichBound = "SV",
Vval = matrix(c(1, 0, 0.85, 0.15), ncol = 2),
Uval = matrix(c(1, 0, 0.5, 0.5), ncol = 2),
Tcoef = c(-6.2, 1.75),
Ycoef = c(-5.2, 5.0, -1.0),
Scoef = matrix(c(1.2, 2.2, 0.0, 0.5,
2.0, -2.75, -4.0, 0.0),
ncol = 4),
Mmodel = "L",
pY1_T1_S1 = 0.286,
pY1_T0_S1 = 0.004)
print(sensparSV)
#> $parameter
#> [1] "BF_11" "BF_00" "BF_10" "BF_01" "RR_UY|T=1" "RR_UY|T=0"
#> [7] "RR_SU|11" "RR_SU|00" "RR_SU|10" "RR_SU|01"
#>
#> $value
#> [1] 1.2102 1.3532 1.3208 1.3895 1.9448 2.7089 1.5566 1.7058 1.9998 1.7998
#>
#> attr(,"class")
#> [1] "sensparams"
#> attr(,"row.names")
#> [1] 1 2 3 4 5 6 7 8 9 10
# Sharp bound
sensparSharp = sensitivityparametersM(whichEst = "RR_tot",
whichBound = "sharp",
Vval = matrix(c(1, 0, 0.85, 0.15), ncol = 2),
Uval = matrix(c(1, 0, 0.5, 0.5), ncol = 2),
Tcoef = c(-6.2, 1.75),
Ycoef = c(-5.2, 5.0, -1.0),
Scoef = matrix(c(1.2, 2.2, 0.0, 0.5,
2.0, -2.75, -4.0, 0.0),
ncol = 4),
Mmodel = "L",
pY1_T1_S1 = 0.286,
pY1_T0_S1 = 0.004)
print(sensparSharp)
#> $parameter
#> [1] "BF_11" "BF_00" "BF_10" "BF_01" "RR_UY|T=1" "RR_UY|T=0"
#> [7] "RR_SU|11" "RR_SU|00" "RR_SU|10" "RR_SU|01"
#>
#> $value
#> [1] 1.2102 1.3532 1.3208 1.3895 1.9448 2.7089 1.5566 1.7058 1.9998 1.7998
#>
#> attr(,"class")
#> [1] "sensparams"
#> attr(,"row.names")
#> [1] 1 2 3 4 5 6 7 8 9 10
# GAF bound
sensparGAF = sensitivityparametersM(whichEst = "RR_tot",
whichBound = "GAF",
Vval = matrix(c(1, 0, 0.85, 0.15), ncol = 2),
Uval = matrix(c(1, 0, 0.5, 0.5), ncol = 2),
Tcoef = c(-6.2, 1.75),
Ycoef = c(-5.2, 5.0, -1.0),
Scoef = matrix(c(1.2, 2.2, 0.0, 0.5,
2.0, -2.75, -4.0, 0.0),
ncol = 4),
Mmodel = "L",
pY1_T1_S1 = 0.286,
pY1_T0_S1 = 0.004)
print(sensparGAF)
#> $parameter
#> [1] "m_T" "M_T"
#>
#> $value
#> [1] 0.0020 0.4502
#>
#> attr(,"class")
#> [1] "sensparams"
#> attr(,"row.names")
#> [1] 1 2
The first argument is whichEst
, where the user inputs
the causal estimand of interest. It must be one of the four
"RR_tot"
, "RD_tot"
, "RR_sub"
or
"RD_sub"
. The second argument is whichBound
,
where the user inputs the bound they want to use, either
"SV"
, "sharp"
, or "GAF"
. Third,
the argument Vval
takes the matrix for V as input.
The first column contains the values that V can take, and the
second column contains the corresponding probabilities. In this example,
V is binary, so the first two elements in the matrix are 1 and
0. However, any discrete V can be used. An approximation of a
continuous V can be used, if it is discretized. The fourth
argument is Uval
, which takes the matrix for U as
input. The matrix U has a similar structure as V. The
fifth argument is Tcoef
, containing the coefficients used
in the model for T. The first entry in Tcoef
is
the intercept of the model, and the second the slope for V. The
sixth argument is Ycoef
, containing the coefficient vector
for the outcome model, where the first entry is the intercept, the
second the slope coefficient for T and third is the slope
coefficient for U. The seventh argument is Scoef
,
the coefficient matrix for the selection variables. The number of rows
is equal to the number of selection variables, and the number of columns
is equal to four. The columns represent the intercept, and slope
coefficients for V, U and T, respectively. A
summary of the code notation is seen in the last column of Table 3. The
eighth argument is Mmodel
, which indicates whether the
models in the M-structure are probit (Mmodel = "P"
) or
logit (Mmodel = "L"
). The ninth and tenth arguments are
pY1_T1_S1
and pY1_T0_S1
. They are the observed
probabilities \(P(Y=1|T=1,I_S=1)\) and
\(P(Y=1|T=0,I_S=1)\). The output is the
sensitivity parameters for the chosen bound.
In the zika example, the estimand of interest is the risk ratio in
the total population, whichEst = "RR_tot"
, the DGP is found
in Table 1, logistic models are used in the DGP, and the probabilities
are found in Table 4. For the SV and sharp bounds, the output is \(RR_{UY|T=1}=1.9448\), \(RR_{UY|T=0}=2.7089\), \(RR_{SU|11}=1.5566\), \(RR_{SU|00}=1.7058\),\(RR_{SU|10}=1.9998\), and \(RR_{SU|01}=1.7998\), which gives \(BF_{11}=1.2102\) and \(BF_{00}=1.3532\), \(BF_{10}=1.3208\), and \(BF_{01}=1.3895\). For the GAF bound, the
output is \(M_T=0.4502\) and \(m_T=0.002\).
SVbound()
The SV bound can be calculated using the function
SVbound()
. The first argument is whichEst
,
indicating the causal estimand of interest ("RR_tot"
,
"RD_tot"
, "RR_sub"
or "RD_sub"
).
The second argument is sens
, which is either output from
sensitivityparametersM()
, a named list or a data frame with
column names parameter and value, which includes the sensitivity
parameters. This input is optional as the sensitivity parameters can
also be specified directly in the function with the below arguments. The
third and fourth arguments are the observed conditional probabilities
\(P(Y=1|T=1,I_S=1)\) and \(P(Y=1|T=0,I_S=1)\), which are needed to
calculate bounds for the causal estimands and not the selection bias
itself. The optional fifth and sixth arguments are the treatment
probabilities \(P(T=1|I_S=1)\) and
\(P(T=0|I_S=1)\). These are only needed
for the alternative SV bound for the causal risk difference in the
subpopulation. If they are not included, the original SV bound in the
risk difference is calculated. The subsequent arguments are the
sensitivity parameters provided by the user. The default value for all
sensitivity parameters are NULL
, and the user must then
specify numeric values on the sensitivity parameters that are necessary
for the bound for the chosen estimand. These arguments are also optional
in case the argument sens
is used. The sensitivity
parameters can either be calculated using the function
sensitivityparametersM()
, or found elsewhere. For
sensitivity parameters found elsewhere, SVbound()
is not
restricted to the generalized M-structure. However, the necessary
assumptions for the SV bound must still be fulfilled (Smith and VanderWeele 2019). The output is the
SV bound. The code and output are:
SVbound(whichEst = "RR_tot",
pY1_T1_S1 = 0.286,
pY1_T0_S1 = 0.004,
RR_UY_T1 = 1.9448,
RR_UY_T0 = 2.7089,
RR_SU_11 = 1.5566,
RR_SU_00 = 1.7058,
RR_SU_10 = 1.9998,
RR_SU_01 = 1.7998)
#> [,1] [,2]
#> [1,] "SV lower bound" 43.66
#> [2,] "SV upper bound" 131.22
As before in the zika example, the causal estimand is the risk ratio
in the total population, whichEst = "RR_tot"
. The
sensitivity parameters are \(RR_{UY|T=1}=1.9448\), \(RR_{UY|T=0}=2.7089\), \(RR_{SU|11}=1.5566\), \(RR_{SU|00}=1.7058\),\(RR_{SU|10}=1.9998\), and \(RR_{SU|01}=1.7998\), calculated above in
sensitivityparametersM()
, which gives a lower SV bound
equal to 43.66 and a higher SV bound equal to 131.22. Alternatively, the
output from the function sensitivityparametersM()
can be
used directly as input which gives the code:
SVbound(whichEst = "RR_tot",
sens = sensparSV,
pY1_T1_S1 = 0.286,
pY1_T0_S1 = 0.004)
#> [,1] [,2]
#> [1,] "SV lower bound" 43.66
#> [2,] "SV upper bound" 131.22
This approach gives the same bounds as specifying the sensitivity parameters manually.
checksharpSVbound()
The sharpness of the SV bound can be evaluated using
checksharpSVbound()
(Zetterstrom,
Sjölander, and Waernbaum 2025). The first argument is
whichEst
, indicating the causal estimand of interest
("RR_tot"
, "RR_sub"
or "RD_sub"
).
Note that the SV bound for the risk difference in the total population
is not sharp. The second argument is sens
, which is either
output from sensitivityparametersM()
, a named list or a
data frame with column names parameter and value, which includes the
bounding factors (\(BF\)). This input
is optional as the bounding factors can also be specified directly in
the function with the argument below. The third argument is
BF
, the bounding factors \(BF_{00}\) and \(BF{10}\) for the total population, and
\(BF_0\) and \(BF_1\) for the subpopulation. The fourth
argument is pY1
, the probabilities \(P(Y=1|T=1,I_S=1)\) and \(P(Y=1|T=0,I_S=1)\). Note that the order of
the bounding factors and probabilities matter. The output is two strings
stating whether the lower and upper SV bound are sharp or not. Note that
the SV bound for the risk ratio in the total population and risk
difference in the subpopulation can only be arbitrarily sharp, see (Zetterstrom, Sjölander, and Waernbaum 2025) for
more details. The code and output are:
checksharpSVbound(whichEst = "RR_tot",
sens = sensparSV,
pY1 = c(0.286, 0.004))
#> [[1]]
#> [1] "The lower SV bound for the risk ratio in the total population is arbitrarily sharp. See vignette for details."
#>
#> [[2]]
#> [1] "The upper SV bound for the risk ratio in the total population is arbitrarily sharp. See vignette for details."
Here, both the lower and upper SV bounds are arbitrarily sharp.
AFbound()
The AF bound is calculated using the function AFbound()
.
The first argument is the causal estimand of interest
("RR_tot"
, "RD_tot"
, "RR_sub"
or
"RD_sub"
). The second argument is outcome
,
where the user inputs either the observed numeric vector with the
outcome variable or a vector with the conditional outcome probabilities,
\(P(Y=1|T=1,I_S=1)\) and \(P(Y=1|T=0,I_S=1)\). The third argument is
treatment
, where the user inputs either the observed
numeric vector with the treatment variable or a vector with the
conditional treatment probabilities, \(P(T=1|I_S=1)\) and \(P(T=0|I_S=1)\). The fourth argument is
selection
where the user can either input the observed
selection vector or the selection probability. Its default value is
NULL
since it is only required when the causal estimands in
the total population are of interest. If the subpopulation is of
interest and selection = NULL
, the outcome and treatment
vectors must only include the selected subjects. The output is the lower
and upper AF bounds. The code and output are:
attach(zika_learner)
AFbound(whichEst = "RR_tot",
outcome = mic_ceph[sel_ind == 1],
treatment = zika[sel_ind == 1],
selection = mean(sel_ind))
#> [,1] [,2]
#> [1,] "AF lower bound" 0
#> [2,] "AF upper bound" 454.09
Similar to before, whichEst = "RR_tot"
. Furthermore, the
outcome and treatment variables are microcephaly and zika. The selection
probability is specified since the other variables are restricted to
those subjects with \(I_S=1\). The
output is the lower and upper AF bounds, which are 0 and 454.09 in the
zika example.
If the raw data is not available, one can input the conditional probabilities instead. In this example, these probabilities are:
AFbound(whichEst = "RR_tot",
outcome = c(0.286, 0.004),
treatment = c(0.002, 0.998),
selection = mean(sel_ind))
#> [,1] [,2]
#> [1,] "AF lower bound" 0
#> [2,] "AF upper bound" 435.14
The difference in these two examples comes from rounding errors in the estimated probabilities.
GAFbound()
The GAF bound is calculated using the function
GAFbound()
. The first argument is the causal estimand of
interest ("RR_tot"
, "RD_tot"
,
"RR_sub"
or "RD_sub"
). The second argument is
sens
, which is either output from
sensitivityparametersM()
, a named list or a data frame with
column names “parameter” and “value”, which includes the sensitivity
parameters. This input is optional as the sensitivity parameters can
also be specified directly in the function with the below arguments. The
third and fourth arguments are M
and m
which
are the two sensitivity parameters for the GAF bound. The sensitivity
parameters can either be calculated using
sensitivityparametersM()
, or found elsewhere. For
sensitivity parameters found elsewhere, GAFbound()
is not
restricted to the generalized M-structure. However, the necessary
assumptions for the GAF bound must still be fulfilled (Zetterstrom 2024). The fifth argument is
outcome
, where the user inputs either the observed numeric
vector with the outcome variable or a vector with the conditional
outcome probabilities, \(P(Y=1|T=1,I_S=1)\) and \(P(Y=1|T=0,I_S=1)\). The fifth argument is
treatment
, where the user inputs either the observed
numeric vector with the treatment variable or a vector with the
conditional treatment probabilities, \(P(T=1|I_S=1)\) and \(P(T=0|I_S=1)\). The sixth argument is
selection
where the user can either input the observed
selection vector or selection probability. Its default value is
NULL
since it is only required when the causal estimands in
the total population are of interest. If the subpopulation is of
interest and selection = NULL
, the outcome and treatment
vectors must only include the selected subjects. The output is the lower
and upper GAF bounds. The code and output are:
GAFbound(whichEst = "RR_tot",
M = 0.4502,
m = 0.002,
outcome = mic_ceph[sel_ind == 1],
treatment = zika[sel_ind == 1],
selection = mean(sel_ind))
#> [,1] [,2]
#> [1,] "GAF lower bound" 0.01
#> [2,] "GAF upper bound" 147.42
GAFbound(whichEst = "RR_tot",
sens = sensparGAF,
outcome = mic_ceph[sel_ind == 1],
treatment = zika[sel_ind == 1],
selection = mean(sel_ind))
#> [,1] [,2]
#> [1,] "GAF lower bound" 0.01
#> [2,] "GAF upper bound" 147.42
Similar to before, whichEst = "RR_tot"
. The sensitivity
parameters are the output from sensitivityparametersM()
.
Furthermore, the outcome and treatment variables are microcephaly and
zika. The selection probability is specified since the other variables
are restricted to those subjects with \(I_S=1\). The output is the lower and upper
GAF bounds, which are 0.01 and 147.42 in the zika example. If the raw
data is not available, one can input the conditional probabilities
instead, similar to the AF bound.
CAFbound()
The CAF bound is calculated using the function
CAFbound()
. The first argument is the causal estimand of
interest ("RR_tot"
, "RD_tot"
,
"RR_sub"
or "RD_sub"
). The second and third
arguments are M
and m
which are the two
sensitivity parameters for the CAF bound. The fourth argument is
outcome
, where the user inputs either the observed numeric
vector with the outcome variable or a vector with the conditional
outcome probabilities, \(P(Y=1|T=1,I_S=1)\) and \(P(Y=1|T=0,I_S=1)\). The fifth argument is
treatment
, where the user inputs either the observed
numeric vector with the treatment variable or a vector with the
conditional treatment probabilities, \(P(T=1|I_S=1)\) and \(P(T=0|I_S=1)\). The sixth argument is
selection
where the user can either input the observed
selection vector or selection probability. Its default value is
NULL
since it is only required when the causal estimands in
the total population are of interest. If the subpopulation is of
interest and selection = NULL
, the outcome and treatment
vectors must only include the selected subjects. The output is the lower
and upper CAF bounds. The code and output are:
CAFbound(whichEst = "RR_tot",
M = 0.3,
m = 0.005,
outcome = c(0.286, 0.004),
treatment = c(0.002, 0.998),
selection = mean(sel_ind))
#> [,1] [,2]
#> [1,] "CAF lower bound" 0.04
#> [2,] "CAF upper bound" 67.78
Similar to before, whichEst = "RR_tot"
. The sensitivity
parameters are chosen by the user. Furthermore, the outcome and
treatment variables are microcephaly and zika. The selection probability
is specified since other variables are restricted to those subjects with
\(I_S=1\). The output is the lower and
upper CAF bounds, which are 0.04 and 67.78 in the zika example. If the
raw data is not available, one can input the conditional probabilities
instead, similar to the AF and GAF bounds.