---
title: "Sampling from Gompertz processes"
author: "TA Trikalinos" 
date: "2025-08-20"
output: rmarkdown::html_vignette
vignette: >
  %\VignetteIndexEntry{Sampling Gompertz times}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
  % \VignetteDepends{tictoc, truncnorm}
---

```{r, include = FALSE}
set.seed(20250820)
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)
K <- 10^5
```

## Simulation description  

Assume a population of $K = `r K`$ individuals indexed by $k \in [K] := \{ 1, \dots, K\}$. We will sample events from Gompertz processes over the age (time) interval $[T_{k0}, T_{k1})$. For example, $T_{k0}$ may be the age in years when person $k$ enters the simulation, and $T_{k1}$ the age in years when that person exits the simulation (e.g., dies from an 'other cause'). (In practice, $T_{k1}$ would be obtained by separate point process.)  Only some people will develop clinical cancer over their simulated lifetime. 

To fix a simulation scenario, let $T_{k0} = 20$ for all $k$ and $T_{k1} \sim U(60 , 100)$, where $U()$ is the uniform distribution.  


### Setup 
We will use `data.table`s -- but the same analysis should be obvious in base `R`. 
We will use functions from three more packages, without loading them:
The `tictoc()` package will be used for a simple time comparison between
the different ways one can simulate this problem. The `stats` package is used
to generate normally and uniformly distributed samples. 

```{r setup}
library(data.table)
library(nhppp)
```

Setup `pop`, the population `data.table`. The person specific values T_{k0}$ and $T_{k1}$ are variables in `pop`.
  
```{r population}
pop <- setDT(
  list(
    id = 1:K,
    T0 = rep(20, K),
    T1 = stats::runif(n = K, min = 60, max = 100)
  )
)
setindex(pop, id)
pop
```

## The Gompertz Process functions  

The Gompertz intensity function is 

$\lambda_k(t) = a_k b_k \ e^{b_k t}$, 

where $t$ is age in years. The parameters $a_k, \b_k$ are random over the individuals in the population with 
$a_k \sim U(0.0045, 0.0055)$, and 
$b_k \sim U(0.085, 0.095)$.

Add these values as person-level parameters in the dataset: 

``` {r} 
pop[, `:=`(
  a_gompertz = stats::runif(n = K, min = 0.0045, max = 0.0055),
  b_gompertz = stats::runif(n = K, min = 0.085, max = 0.095)
  )]
```


A cumulative intensity function is 

$\Lambda_k(t) = a_k (e^{b_k t} - 1)$, 

with the integration constant was arbitrarily set so that $\Lambda(0) = 0$. The inverse cumulative intensity function is 

$\Lambda^{-1}(z) = b_k^{-1} \ \log \big (\frac{z}{a_k} -1 \big) $.

#### Vectorized specification of the Gompertz $\lambda()$, $\Lambda()$, and $\Lambda^{-1}()$

Define vectorized forms of the Gompertz $\lambda()$, $\Lambda()$, and $\Lambda^{-1}()$ functions that take as 
default the values in `pop`.

```{r function-lambda-gompertz} 
l_gompertz <- function(t, a = pop$a_gompertz, b = pop$b_gompertz, ...) {
  a * b * exp(b * t)
}

L_gompertz <- function(t, a = pop$a_gompertz, b = pop$b_gompertz, ...) {
  a * exp(b * t) - a
}

Li_gompertz <- function(z, a = pop$a_gompertz, b = pop$b_gompertz, ...) {
  log(z/a +1) / b
}
 
```


### Method 1: Vectorized sampling using only $\lambda()$ 

When you only know the intensity function $\lambda$, `nhppp` employs a thinning 
algorithm. 

One of the items needed for the thinning algorithm is a piecewise constant
majorizer function $\lambda_*$ such that: $\lambda_*(t) >= \lambda(t)$ for 
all $t$ of interest.
 
The `nhppp::vdraw_intensity` function assumes that 
you will provide the majorizer function as a matrix (`lambda_maj_matrix`). 
To create this matrix, split the simulation time (here, from age 40 to age 100) 
in $M$ equal-length intervals. 
For person $k$ and interval $m$, the element `lambda_maj_matrix[k, m]`
records a supremum of $\lambda_k$ over the $m$-th interval. 
Any supremum will do -- but the algorithm is most 
efficient when you give it the least upper bound  -- practically, the maximum 
of $\lambda(t)$ over all $t$ in the interval. For monotone intensity functions, 
such as the Gompertz, the maximum is at one of the interval's bounds. 
It will be at the left bound, if $\lambda$ is decreasing, and at the right bound, 
if $\lambda$ is increasing. 

There is a helper function in `nhppp` that generates the majorizer matrix 
automatically for monotone (and possibly discontinuous) functions and 
for nonmonotone continuous Lipschitz functions (functions whose maximum slope is 
bounded). Even if your case is more complex, you should be able to find a 
supremum that works. 

This code samples in a vectorized fashion when you know only $\lambda()$. It creates
a majorizer matrix over $M=5$ intervals (we chose this arbitrarily -- not trying to be fast). To let the software know which times 
correspond to each of the $M$ intervals it suffices to specify a start and stop 
time for each row of the majorizer matrix with the `rate_matrix_t_min` and `rate_matrix_t_max`
options. The sampling intervals $[T_{k0}, T_{k1})$for each simulated 
person are a subset of the interval for which the majorizer matrix is defined, 
and are specified with the `t_min` and `t_max` options. 
(The `atmostB` option can be useful to speed up the sampling and minimize memory 
needs when one is interested in the first event only. The smaller the value, the faster 
the algorithm but you have to check that you have not specified it to be too small.
In this example, `atmostB = 5` is fine -- it returns exact solutions; but we 
have checked it [not shown].  If you do not want to mess with it, 
do not use the option. The function may be already fast enough for your needs). 

```{r method-1}
tictoc::tic("Method 1 (vectorized, thinning)")
M <- 5
break_points <- seq.int(from = 20, to = 100, length.out = M + 1)
breaks_mat <- matrix(rep(break_points, each = K), nrow = K)

lmaj_mat <- nhppp::get_step_majorizer(
  fun = l_gompertz,
  breaks = breaks_mat,
  is_monotone = TRUE
)

pop[
  ,
  t_thinning := nhppp::vdraw_intensity(
    lambda = l_gompertz,
    lambda_maj_matrix = lmaj_mat,
    rate_matrix_t_min = 20,
    rate_matrix_t_max = 100,
    t_min = pop$T0,
    t_max = pop$T1,
    atmost1 = TRUE,
    atmostB = 10
  )
]
tictoc::toc(log = TRUE) # timer end
```

### Method 2: Vectorized sampling using $\Lambda()$ and $\Lambda^{-1}()$

The most efficient sampling is possible when one knows $\Lambda()$ and $\Lambda^{-1}()$. 
The `nhppp` package can sample in this case using the `vdraw_cumulative_intensity()`
function. Here `range_t` is a matrix with information on each person's $[T_{k0}, T_{k1})$. 

```{r method-2}
tictoc::tic("Method 2 (vectorized, inversion)")
pop[
  ,
  t_inversion := nhppp::vdraw_cumulative_intensity(
    Lambda = L_gompertz,
    Lambda_inv = Li_gompertz,
    t_min = pop$T0,
    t_max = pop$T1,
    atmost1 = TRUE
  )
]
tictoc::toc(log = TRUE) # timer end
```

### Comparisons 

#### Simulation time-costs
The simulation time-costs that you see in this document depend on the machine 
that rendered it. If you read this online, this machine is probably some 
virtual server with minimal resources. If you installed the package locally, 
it is probably the machine you are using to run `R`. 

1. `r tictoc::tic.log()[[1]]`. This is the slower approach -- but still not
 bad for $`r K`$ samples! It uses the thinning algorithm which is very flexible -- it can accommodate very complex time varying intensity 
functions. You almost always know $\lambda$ and can get its majorizer $\lambda_*$ 
easily and fast.  

2. `r tictoc::tic.log()[[2]]`. This approach is many times faster that the previous one, 
but requires implementations for $\Lambda$ and $\Lambda^{-1}$.

### Simulated times 

Both methods sample correctly from the specified Gompertz process. There 
is no approximation at play.

The QQ plots compare the simulated times with the two methods. The agreement 
is excellent over this population of size $K = `r K`$. As $K$ increases 
the agreement remains excellent (not shown here - try it for yourself). 
The paper in the bibliography includes in-depth comparisons. 
A set of QQ plots should suffice here.

```{r qq-plot, fig.alt="QQ plot comparing simulated times with the two methods. The QQ plot indicates excellent agreement."}
qqplot(pop$t_thinning, pop$t_inversion)
```

## Simulating all event trajectories

In the above, we simulated the time to first event. Let's simulate all times that realize from that process in the interval of interest. You can use either method; we demonstrate with the faster one. 

``` {r all_events}
Z <- nhppp::vdraw_cumulative_intensity(
    Lambda = L_gompertz,
    Lambda_inv = Li_gompertz,
    t_min = pop$T0,
    t_max = pop$T1,
    atmost1 = FALSE
  )

dim(Z) # a lot of events!
```

There is at least one person among the `r K` simulated persons with `r dim(Z)[2]` total events in the interval of interest. 


## Demonstrating that we simulate from the correct intensity function 

Let's fix the parameter values for all $K$ people and do a histogram of the simulated times. They match the shape of the intensity function over the interval $[T_0, T_1)$, scaled to unit area, i.e. $\frac{\lambda(t)}{\big(\Lambda(T_1)-\Lambda(T_0) \big)}$. Run more samples to convince yourself -- or also calculate the Wasserstein distance of the empirical and theoretical distribution, as described in the numerical analyses in the `nhppp` paper. 

```{r histogram}
if(nchar(system.file(package = 'ggplot2'))>0) {
  pop2 <- setDT(list(id = 1:1000)) 
  pop2[, `:=`(
    T0 = 20, 
    T1 = 100, 
    a = 0.005, 
    b = 0.090
  )]
  setindex(pop2, id)
  pop2
  
  ## re-define functions to use `pop2` by default 
  ##    clunckiness of `nhppp` -- help us improve it!
  l  <- function(t, a = pop2$a, b = pop2$b, ...) a * b * exp(b * t)
  L  <- function(t, a = pop2$a, b = pop2$b, ...) a * exp(b * t) - a
  Li <- function(z, a = pop2$a, b = pop2$b, ...) log(z/a +1) / b
  
  ## get all times (full trajectories)
  Z <- nhppp::vdraw_cumulative_intensity(
      Lambda = L,
      Lambda_inv = Li,
      t_min = pop2$T0,
      t_max = pop2$T1,
      atmost1 = FALSE
    )
  
  ## ignore non-realized times 
  Z <- as.vector(Z)
  Z <- Z[!is.na(Z)]
  
  x <- seq(from = 20, to = 100, length.out = 100)
  ## normalize the hazard rate to have area 1 between 20 and 100
  y <- l(x, a = pop2$a[1], b = pop2$b[1]) / (
        L(100, a = pop2$a[1], b = pop2$b[1]) - 
        L(20, a = pop2$a[1], b = pop2$b[1])
       )
  
  
  library(ggplot2)
  ggplot(xlim = c(0, 110)) +
    geom_histogram(
      aes(x = Z, y = after_stat(density)), 
      bins = 100, 
      fill = "gray", 
      color = "white") +
    geom_line(
      aes(x = x, y = y), 
      color = "red"
    ) +
    labs(title = "Histogram of all simulated Gompertz times", x = "Time", y = "Empirical and theoretical density")
}
```


## Acknowledgments 

Thanks to Fernando Alarid-Escudero for suggesting this vignette. 


## Bibliography

Trikalinos TA, Sereda Y. _nhppp: Simulating Nonhomogeneous Poisson Point Processes in R_. arXiv preprint arXiv:2402.00358. 2024 Feb 1.

Trikalinos TA, Sereda Y. _The nhppp package for simulating non-homogeneous Poisson point processes in R_. PLoS One. 2024 Nov 21;19(11):e0311311.

Marshall AW, Olkin I. "The Gompertz distribution" (p 364) in _Life distributions_  Springer, New York; 2007.

Since the publication of the paper, the syntax and options of the `nhppp` package have
evolved. To reproduce the code in the paper, you have to install the version 
of `nhppp` used in the paper. Alternatively, take a look at the vignettes, 
which are written to work with the current package.