---
title: "Best practices"
author: "Maarten van Kessel"
always_allow_html: yes
output:
  html_document:
    toc: yes
    toc_depth: '3'
    df_print: paged
  html_vignette:
    toc: yes
    toc_depth: 3
    vignette: >
      %\VignetteIndexEntry{Best practices}
      %\VignetteEngine{knitr::rmarkdown}
      %\VignetteEncoding{UTF-8}
  pdf_document:
    toc: yes
---

```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```

## Introduction
There are some rules of thumb that I follow when using the `TreatmentPatterns` package. These rules tend to work well in most situations, across databases and datasets.

### TLDR

- `minPostCombinationWindow <= minEraDuration`.
- `combinationWindow >= minEraDuration`.
- Small cohorts should not be considered.
- Pathways with a low count should not be considered.
- Possible number of pathways: $n^n$
- Possible number of pathways, with no re-occurrence: $n!$
- Total number of possible events (events + combinations): $2^n - 1$
- Total number of possible combinations: $2^n - (n + 1)$

## Cohorts
When creating cohorts, it is important to keep in mind that the subjects will be dived across pathways. Lets assume we have 10000 subjects in a fictitious cohort. Let's also assume we have 5 event cohorts.

The total number of potential pathways, assuming only mono therapies equals to $pathways_n = n^{n}$, assuming we do not allow for any re-occurring treatments it would still equal to $pathways_n = n!$.

Assuming our 5 event cohorts this would equal to:
```{r}
5^5
factorial(5)
```

## Combinations
Combinations add additional pathway possibilities. Each event can be uniquely combined with each other event. Each combination can combine with another singular event or any other combination. However, each event in a combination must be unique. So: $AB = BA$. As an example it is irrelevant if a person receives penicillin and ibuprofen or ibuprofen and penicillin.

We can draw out all possible combinations in a graph for events $A$ $B$ $C$.
```{r, echo=FALSE, eval=FALSE}
# library(DiagrammeR)
# 
# g <- grViz("digraph {
#   graph [layout = dot, rankdir = TB, splines = line]
#   
#   A [label = 'A@_{1}']
#   B [label = 'B@_{1}']
#   C [label = 'C@_{1}']
#   D [label = 'D@_{1}']
#   
#   AB [label = 'AB@_{2}']
#   AC [label = 'AC@_{2}']
#   AD [label = 'AD@_{2}']
#   BC [label = 'BC@_{2}']
#   BD [label = 'BD@_{2}']
#   CD [label = 'CD@_{2}']
#   
#   ABC [label = 'ABC@_{3}']
#   ABD [label = 'ABD@_{3}']
#   ACD [label = 'ACD@_{3}']
#   BCD [label = 'BCD@_{3}']
#   
#   ABCD [label = 'ABCD@_{4}']
#   
#   subgraph cluster1 {
#     A -> AB -> ABC -> ABCD
#          AB -> ABD
#     A -> AC
#          AC -> ACD
#     A -> AD
#   }
#   
#   subgraph cluster2 {
#     B -> BC -> BCD
#     B -> BD
#   }
#   
#   subgraph cluster3 {
#     C -> CD
#   }
#   
#   subgraph cluster4 {
#     D
#   }
# }")
# 
# g |>
#   DiagrammeRsvg::export_svg() |>
#   charToRaw() |>
#   rsvg::rsvg_png(file = "./figures/a000_graph.png")
```

![](./figures/a000_graph.png)

The subscript of the nodes are the layers where the combination exists in. I.e. combination $AB$ is in layer 2, and combinations $ABCD$ is in layer 4. The layer coincides with the number of events in the combination.

We can count the number of nodes per layer, for each graph:
$$
\begin{matrix}
  & l1 & l2 & l3 & l4 & sum\\
A &  1 &  3 &  3 &  1 & 8\\
B &  1 &  2 &  1 &  0 & 4\\
C &  1 &  1 &  0 &  0 & 2\\
D &  1 &  0 &  0 &  0 & 1
\end{matrix}
$$

Our sums look suspiciously similar to $2^n$.
```{r}
2^1
2^2
2^3
2^4
```
We seem to overshoot by 1 $n$, so we can try $2^{n-1}$.
```{r}
2^0
2^1
2^2
2^3
```

So our total number of events equals:
$$
\sum^{n-1}_{i=0}2^{i}
$$

Which we can define as a function $f_1$.
```{r}
sum(c(2^0, 2^1, 2^2, 2^3))

# Or:
n <- 4
sum(2^(0:(n - 1)))

f_1 <- function(n) {
  sum(2^(0:(n - 1)))
}
```

We can simulate our $f_1$ function for 100 events.
```{r}
n <- 1:25
f_1_events <- unlist(lapply(n, f_1))

data.frame(
  n = n,
  f_1 = f_1_events
)
```

Notice how the number of events increases with $2^n-1$.

We define this as $f_2$. We can compare $f_1$ to $f_2$.
```{r}
f_2 <- function(n) {
  2^n - 1
}

n <- 1:25
f_1_events <- unlist(lapply(n, f_1))
f_2_events <- unlist(lapply(n, f_2))

data.frame(
  n = n,
  f_1 = f_1_events,
  f_2 = f_2_events
)
```

Now we can assert the following:
$$
monoEvents = n
\\
totalEvents = 2^n - 1
\\
combinationEvents = totalEvents - n
$$
```{r}
n <- 5
totalEvents <- 2^n - 1
combinationEvents <- totalEvents - n

sprintf("monoEvents: %s", n)
sprintf("totalEvents: %s", totalEvents)
sprintf("combinationEvents: %s", combinationEvents)
```


## Settings
The `minEraDuration`, `combinationWindow`, and `minPostCombinationWindow` have significant effects on how the treatment pathways are built. Conciser the following example:

```{r, message=FALSE}
library(dplyr)

cohort_table <- tribble(
  ~cohort_definition_id, ~subject_id, ~cohort_start_date,    ~cohort_end_date,
  1,                     1,           as.Date("2020-01-01"), as.Date("2021-01-01"),
  2,                     1,           as.Date("2020-01-01"), as.Date("2020-01-20"),
  3,                     1,           as.Date("2020-01-22"), as.Date("2020-02-28"),
  4,                     1,           as.Date("2020-02-20"), as.Date("2020-03-3")
)

cohort_table
```


Assume that the target cohort is cohort_definition_id: 1, the rest are event cohorts.

```{r}
cohort_table <- cohort_table %>%
  mutate(duration = as.numeric(cohort_end_date - cohort_start_date))

cohort_table
```

As you can see, the duration of the treatments are: 19, 37 and 12 days. Also cohort 3 overlaps with treatment 4 for 8 days.

We can compute the overlap as follows:
```{r}
cohort_table <- cohort_table %>%
  # Filter out target cohort
  filter(cohort_definition_id != 1) %>%
  mutate(overlap = case_when(
    # If the result of the next cohort_end_date is NA, set 0
    is.na(lead(cohort_end_date)) ~ 0,
    # Compute duration of cohort_end_date - next cohort_start_date
    # 2020-02-28 - 2020-02-20 = -8
    .default = as.numeric(cohort_end_date - lead(cohort_start_date))))

cohort_table
```

We see that the overlap between treatment 2 and 3 is `-2`, so rather than an overlap there is a gap between these treatments. Between treatment 3 and 4 there is an 8 day overlap. There is no next treatment after treatment 4, so the overlap is 0, let's assume our `minEraDuration = 5`.

We can draw it out like so:
```
2:   -------------------
3:                        -------------------------------------
4:                                                     ------------
```

If we set our `minCombinationWindow = 5`, the combination would be computed for cohort 3 and 4. This would leave us with the following treatments:
```
2:   -------------------
3:                        -----------------------------
3+4:                                                   --------
4:                                                             ----
```

Treatment 3 now lasts 11 days; Treatment 4 lasts 4 days; and combination treatment 3+4 lasts 8 days. If our `minPostCombinationDuration` is not set properly, we can filter out either too many, or too little treatments.

Assuming we would set `minPostCombinationDuration = 10`, we would lose treatment 4 and combination treatment 3+4. This would leave us with the following paths:
```
2:   -------------------
3:                        -----------------------------

Pathway: 2-3
```

As a rule of thumb the setting the `minPostCombinationDuration <= minEraDuration` seems to yield reasonable results. This would leave us with the following paths `minPostCombinationDuration = 5`:
```
2:   -------------------
3:                        -----------------------------
3+4:                                                   --------

Pathway: 2-3-3+4
```