This vignette provides a demonstration of the functions available in
the R package branchingprocess. This package quantifies
outbreak risk posed by an infectious disease with emerging transmission
potential.
Even when infectious pathogens are able to transmit from person to person, the realized risk of a particular pathogen causing an outbreak that infects many individuals over multiple generations of transmission is uncertain. For example, zoonotic pathogens that primarily exist in non-human animal reservoirs occasionally spill over to humans, and but onward transmission from human to human may or may not occur. Even when a spillover event leads to an outbreak infecting dozens of people, onward transmission often fizzles out.
Thus we can ask, what is the probability that a new spillover event will lead to an ever larger outbreak than what has been observed during past outbreaks? In other words, have large outbreaks been avoided up to the current time point simply by chance? The functions in this package can help provide quantitative answers to these questions.
A branching process is a stochastic (random) process in which each infected individual in generation \(n\) produces a random number of new infected individuals, or “offspring,” in generation \(n+1\), continuing for some number of generations or until there are no newly infected individuals remaining. If no infected individuals remain this is called extinction.
The random number of “offspring” infections produced by each infected individual is drawn from the offspring distribution, a discrete probability distribution with non-negative range. To model infectious disease outbreaks, it is common to use a negative binomial offspring distribution with two parameters: the mean \(R\), commonly known as the reproduction number, and the dispersion parameter \(k\).
\(R\), the reproduction number, is the expected (mean) number of transmissions from an infected individual. For a fixed value of \(R\), a lower dispersion \(k\) gives a higher variance of the negative binomial distribution: variance = \(R + R^2/k\). Infectious diseases with a lower value of \(k\) exhibit a higher rate of “superspreading” events, in which a single individual produces a number of transmission substantially higher than the mean.
pNextGenSize(x,y,R,k) functionThe function pNextGenSize(x,y,R,k) extends the
dnbinom() function to more than one potentially
transmitting individual: it calculates the probability that
x individuals in one generation transmit to a total of
y individuals in the next generation, where each of the
x individuals independently transmits according the
negative binomial distribution with parameters R and
k.
Example Problem: With \(R=0.8\) and \(k=0.1\), and with 2 individuals infected, what is the probability that those 2 individuals will produce a total of exactly 5 transmissions in the next generation?
There is about a 2.1% chance that the next generation will consist of
exactly 5 infected individuals. Note that pNextGenSize()
does not explicitly calculate the component probabilities for each of
the two potential transmitters (e.g., one transmitting to 2 and other
transmitting to 3, etc.); it only calculate the probability that the
total number of transmissions is 5.
We can observe some effects of the dispersion parameter \(k\) by comparing part of the distribution for two examples:
pNextGenSize(x=2, y=c(0,1,5,10), R=0.8, k=10)*100
#> [1] 21.454820740 31.784919615  2.033695140  0.002137314
pNextGenSize(x=2, y=c(0,1,5,10), R=0.8, k=0.1)*100
#> [1] 64.4394015 11.4558936  2.1146614  0.6795566The results show the probability (times 100, or % chance) that two individuals transmit to a total of exactly 0, 1, 5, or 10 when \(R = 0.8\), for two values of \(k\).
With a high value of \(k=10\), there is only a 21% chance that no transmissions occur, but the probability of a higher numbers of transmissions drops quickly: there is only a 0.002% or approximately one-in-500 chance that 10 transmissions occur.
With a low value of \(k=0.1\), we see some very different results despite the fact that the reproduction number is the same. This is because the varaince of the negative binomial distribution is inversly proportional to that value of \(k\). From the perspective of public health, a lower value of \(k\) is a mixture of good and bad news. The good news is that the probability of no transmissions (the best case scenario) has risen to 64%, about 3 times higher than the result for \(k=10\). The bad news is that if transmissions do occur, there is a substantially higher chance that the number of transmissions will be high; for example there is a 0.8% chance of 10 total transmissions from the first two individuals.
Of course, if the initial cases produce transmissions in the next generation, the proability of a “small” outbreak occuring is non-zero. Those transmissions produce potentially infectious individuals in the second generation who can subsequently transmit to produce a third generation, and so on. If \(R\) does not exceed 1, it is mathematically provable that the outbreak will eventual go extinct even if a second or third generation of infections occurs. It is guaranteed that there will be eventually be a generation of infected individuals who all produce zero transmissions. Nonetheless, it is still important to ask: how big could an outbreak get before extinction inevitably occurs?
A common approach to analyzing the distribution of stochastic
outbreak sizes is to run repeated simulations of outbreaks that
randomize transmissions from each simulated individual and repeat,
e.g. within a while loop, until no further transmissions
can occur. However, there are mathematical formulae derived in the
literature that can efficiently calculate the distribution of final
outbreak sizes with no simulation needed. A summary of the history and
derivation of this formula can be found in Toth et al. (2015): https://doi.org/10.3201/eid2108.150170
The function pFinalSize() makes use of such a formula,
for the case of the same negative binomial offspring distribution
applied independently to each infected individual. For example,
This calculated the probability that, with \(R=0.8\) and \(k=0.1\) applied to every case, \(n=2\) initially infected individuals produce an outbreak that ends with a final size of \(j=12\), i.e. the \(n=2\) initial cases plus 10 subsequent cases, over any possible number of transmission generations.
Now let’s look at an example of how this function could be applied to
study data from actual observed outbreak sizes. The H5N1 strain of
influenza is currently of concern because of outbreaks among animals
that sometimes cause spillover to humans. If these spillover events
start leading to some substantial outbreaks involving human-to-human
transmission, how could we use pFinalSize() to study these
data?
Say there are 30 H5N1 human infections from zoonotic spillover events, i.e. 30 instances of human infection from exposure to animals. In 25 of the spillover events, no subsequent human-to-human occurred. In 3 of the spillover events, exactly one human-to-human transmission occurred from the initially infected human with no further transmission (final size of 2 infections). In one spillover event, there was an outbreak of final size 20 infections, and another spillover event produced an outbreak of final size 50 infections. What values of R and k would best characterize this set of observations?
We can make use of the pFinalSize() function to create a
(log) likelihood function for this dataset of outbreak clusters. First,
we write a function called pClusterSizes() that calculates
the probabilities of the four different final sizes observed (1, 2, 20,
and 50) for given values of \(R\) and
\(k\):
The argument n to pFinalSize is the initial
number of cases at the start of an outbreak, here set to
n=1 because each of the observed H5N1 outbreak started from
one initial infection (the spillover event). The argument j
is the final size of the outbreak, here set as a vector of the four
unique final sizes observed.
If the reproduction number R = 1.5, and the dispersion
parameter k = 1, the function calculates a 40% chance that
an outbreak of total size 1 (no tranmsission) would occur, a 9.6% chance
of exactly size 2, a 0.12% chance of exactly size 20, and less than
0.01% chance of outbreak of exactly size 50.
Next we write a function logLikData() to calculate the
log likelihood of observing 25, 3, 1, and 1 outbreak(s) of those four
sizes, for given values of R and k:
What is the likelihood of observing exactly these 30 outbreak sizes after 30 spillover events, for our test values of R and k?
Can we do better for different choices of R and k? We can use
optim to locate the maximum likelihood solution for R and k
by optimizing the log likelihood using the built in R function
optim():
optim(f = function(x) ifelse(all(x>0), -logLikData(x[1],x[2]), Inf), par = c(R=1,k=1))$par
#>          R          k 
#> 0.70283388 0.07953917The maximum likelihood estimate is R = 0.7, k = 0.08.
Once an outbreak is underway after an initial (e.g. spillover) event has taken place, the affected individuals and public health officials may take action to reduce the chances of further transmission. Efforts to reduce transmission may be especially intense after a symptomatic individual tests positive for an emerging, high-risk pathogen. To model that effect, the package provides functions to calculate the final size distribution for a branching process in which the offspring distribution parameters (e.g. the reproduction number) changes after a given number of transmission generations following the initial infection.
The function pFinalSizeSwitchOne(n,j,R0,k0,Rc,kc)
calculates the probability that n initial cases lead to an
outbreak of final size j given that the initial cases
transmit with a negative binomial offspring distribution with parameters
R = R0 (the basic reproduction number) and
k = k0, and any/all subsequent cases transmit with a
negative binomial offspring distribution with parameters
R = Rc (the control reproduction number) and
k = kc.
In the above example of data from 30 spillover events, there were only 5 events resulting in transmission from the initial case. Thus, there were only 5 outbreaks providing data that might be used to estimate Rc and kc - not enough to provide robust estimates. However, suppose experts judged that control measures enacted after identifying a spillover event reduced transmission potential of subsequent cases by half. Assuming, say, \(R_c=R_0/2\) and \(k_c=k_0\), how would our estimates of \(R_0\) and \(k_0\) change?
pClusterSizesSwitch1 <- function(R0,k0) Vectorize(pFinalSizeSwitch1)(n=1, j=c(1,2,20,50), R0=R0, k0=k0, Rc=R0/2, kc=k0)
logLikDataSwitch1 <- function(R0,k0) sum(c(25,3,1,1)*log(pClusterSizesSwitch1(R0,k0)))
optim(f = function(x) ifelse(all(x>0), -logLikDataSwitch1(x[1],x[2]), Inf), par = c(R0=1,k0=1))$par
#>         R0         k0 
#> 1.07610845 0.06205758The resulting estimate of \(R_0>1\) suggests that an uncontrolled outbreak might be possible in the absence of any control measures like the ones that were implented after the 5 spillover events that resulted in human-to-human transmission.
There could be more information from an outbreak cluster than just
the final size. For example, the number of generations of transmission
over which the outbreak occurred might also be known. The
pFinalSize function accounts for all possibilities for the
number of generations. For example, an outbreak of size 10 could have
occurred in as few as one transmission generation (the initial case
transmitting directly to 9 others, who each transmit to none), or as
many as nine generations (every case transmitting to exactly one other,
until the tenth case fails to transmit). To break down the
possibilities, we also provide the
pFinalSizeAndGen(g,n,j,R,k) function, which calculates the
joint probability that n initial cases leads to an outbreak
of final size j over g transmission
generations, when the offspring distribution is negative binomial with
mean R and dispersion k.
The following gives the probability of each of the 9 possibilities for the joint probability of an outbreak of final size 10 and 1 to 9 total transmission generations, for R = 0.8 and k = 0.07.
finalSize10 <- Vectorize(pFinalSizeAndGen,'g')(g=1:9,n=1,j=10,R=0.8,k=0.07)
finalSize10
#> [1] 7.550110e-04 1.694990e-03 6.145547e-04 9.859817e-05 9.145005e-06
#> [6] 5.257486e-07 1.859572e-08 3.726243e-10 3.250305e-12
c(sum(finalSize10), pFinalSize(n=1,j=10,R=0.8,k=0.07))
#> [1] 0.003172844 0.003172844The second output above demonstrates that the 9 probabilities sum to
the total probability of an outbreak of final size 10 calculated by
pFinalSize.
Now say we have additional information about our example H5N1 cluster size data set: the size-20 outbreak occurred over 3 total generations of transmission, and the size-50 outbreak occurred over 4 generations. Does this change our estimates for R and k?
pClusterSizeAndGen <- function(R,k) Vectorize(pFinalSizeAndGen)(g=c(0,1,3,4),n=1, j=c(1,2,20,50), R=R, k=k)
logLikDataGen <- function(R,k) sum(c(25,3,1,1)*log(pClusterSizeAndGen(R,k)))
optim(f = function(x) ifelse(all(x>0), -logLikDataGen(x[1],x[2]), Inf), par = c(R=1,k=1))$par
#>          R          k 
#> 0.70286569 0.06847609In this example, the generation data did not substantially alter our estimates.
We also provide the function
pClusterSizeAndGenSwitch1(g,n,j,R0,k0,Rc,kc), which
caculates the same joint size & generation probability for the model
with a control reproduction number after the first transmission
generation.
What happens if we try to estimate three parameters simultaneously: \(R_0\), \(R_c\), and \(k=k_0=k_c\):
pClusterSizeAndGenControl <- function(R0,Rc,k) Vectorize(pFinalSizeAndGenSwitch1)(g=c(0,1,3,4),n=1, j=c(1,2,20,50), R0=R0, k0=k, Rc=Rc, kc=k)
logLikDataGenControl <- function(R0,Rc,k) sum(c(25,3,1,1)*log(pClusterSizeAndGenControl(R0,Rc,k)))
optim(f = function(x) ifelse(all(x>0), -logLikDataGenControl(x[1],x[2],x[3]), Inf), par = c(R0=1,Rc=1,k=1))$par
#>         R0         Rc          k 
#> 0.88844217 0.62453674 0.06894505With estimates based on past outbreaks in hand, what can we say about what may happen after future spillover events? What are the chances that we have not yet seen a “worst case scenario” and that even larger outbreaks are coming?
For example, the values R = 0.7, k = 0.08, applied to call generations, were the estimates derived above using example data in which the largest outbreak from a single spillover event was 50 cases, after 30 spillover events. What is the probability that a new spillover event will result in an outbreak of even more cases than the previous maximum? For example, more than twice as big as the previous worst outbreak?
To calculate the probability that the final size of an outbreak is \(x\) or more, we must first calculate the sum of the probabilities that the final size is exactly each the sizes from 1 to \(x-1\). For example, to calculate the probability of 100-or-more cases:
The model estimates that there would be about 0.5% or roughly 1-in-200 chance of a 100-or-more case outbreak after a future spillover event.
This type of result is known as an “exceedance” probability. There is
not a separate function for calculating exceedance, we only have a
function to calculate the probability of an exact final size. Hence, the
above function call to pFinalSize calculates the
probability of a final size exactly equal to 1, 2, … 99, and one minus
the sum of these probabilities is the exceedance probability
required.
How does this estimate change for the result from the second model that found \(R_0\) = 1.08 and \(k_0\) = 0.06, under the assumption of a control reproductive number \(R_c = R_0/2\) and \(k_c=k_0\)?
The result is a slightly lower 0.4% estimated chance of a 100-or-more person outbreak. But what if the \(R_0\) = 1.08 \(k_0\) = 0.06 result applied to every case, i.e. no control measures were enacted for subsequent outbreaks?
The chance of a large outbreak rises substantially to over 2%. With \(R\) being over the supercritical threshold of 1, that 2% chance includes the risk of an outbreak that grows indefinitely, addressed in the next sub-section.
We provide the function pExtinct(R,k), which calculates
the probability that an outbreak will eventually go extinct with a
negative binomial offspring distribution parameterized by mean
R and dispersion k. Then one minus this
probability is the “escape” probability, i.e. the risk that an outbreak
will grow indefinitely until something acts to reduce the reproduction
number, such as depletion of susceptible individuals in the population.
When \(R\leq 1\), the extinction
probability is 1, but if \(R>1\) the
probability depends on both R and k. From the example above:
Interestingly, despite the fact that \(R>1\), the escape probability is less than 1%. This suggests that seeing a large number of spillover events that all lead to extinguished outbreaks does not necessarily mean that the pathogen poses no risk of an extremely large outbreak.
While this branching process model does not include an explicit time
component, it does contain a measure of outbreak duration as quantified
by the number of transmission generations that occur before the outbreak
is extinguished. We provide the function pGen(gMax,R,k),
which calculates, for a single initial case and negative binomial
offspring distribution parameterized by mean R and
dispersion k, the probability that the an outbreak produces
less than g generations of transmission for g = 1 to
gMax.
pGen(gMax=10,R=0.7,k=0.08)
#>  [1] 0.8334502 0.9306003 0.9627497 0.9776830 0.9858313 0.9906938 0.9937569
#>  [8] 0.9957540 0.9970860 0.9979878The output is vector of probabilities of length gMax -
in this example we can see there is a 99.8% of the outbreak lasting less
than 10 generations of transmission, or a 0.2% risk of an outbreak
lasting 10 generations of transmission or more.
We also provide pGenSwitch1(gMax,R0,k0,Rc,kc), which
calculates the same probabilities for the model with a control
reproduction number.
pGenSwitch1(gMax=10,R0=1.08,k0=0.06,Rc=0.54,kc=0.06)
#>  [1] 0.8380594 0.9474815 0.9770431 0.9887943 0.9942516 0.9969780 0.9983912
#>  [8] 0.9991379 0.9995364 0.9997502While this model did not produce a substantially different final size risk compared to the previous one in the example above, it does produce a different duration risk: the risk of a 10-or-more generation outbreak is an order of magnitude lower at 0.02%.
We provide another set of functions that can be used to analyze an
ongoing outbreak prior to extinguishing, i.e. before it has reached
final size. The functions pSizeAtGen(g,n,j,R,k) and
pSizeAtGenSwitch1(g,n,j,R0,k0,Rc,kc) calculate the
probability that n initial cases lead to an outbreak that
lasts at least g generations of transmission AND has
exactly j total cases after generation g.
In the H5N1 data analysis example, these functions could be used to include data from an ongoing outbreak that may not yet be extinguished. For example, a spillover even that produced an outbreak with 10 cases (so far) after 2 generations of transmission: