---
title: "An introduction to incidence2"
output:
    markdown::html_format:
        options:
            toc: true
            number_sections: true
vignette: >
  %\VignetteIndexEntry{An introduction to incidence2}
  %\VignetteEngine{knitr::knitr}
  %\VignetteEncoding{UTF-8}
  %\VignetteDepends{outbreaks, ggplot2, ciTools, tidyr}
---

```{r, include = FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>",
  fig.align = "center"
)
data.table::setDTthreads(2)
```

## What does it do?

**incidence2** is an R package that implements functions to compute, handle
and visualise *incidence* data. It aims to be intuitive to use for both
interactive data exploration and as part of more robust outbreak analytic
pipelines.

The package is based around objects of the namesake class, `incidence2`. These
objects are a [`tibble`](https://cran.r-project.org/package=tibble) subclass
with some additional invariants. That is, an `incidence2` object must:
  
- have one column representing the date index (this does not need to be a `Date`
  object but must have an inherent ordering over time);
  
- have one column representing the count variable (i.e. what is being counted)
  and one variable representing the associated count;
  
- have zero or more columns representing groups;

- not have duplicated rows with regards to the date, group and count variables.


## Functions at a glance

To create and work with `incidence2` objects we provide a number of functions:

- `incidence()`: for the construction of incidence objects from both linelists
  and pre-aggregated data sets.
  
- `regroup()`: regroup incidence from different groups into one global incidence
  time series.
  
- `incidence_()` and `regroup_()`: These work similar to their aforementioned
  namesakes but also add support for
  [tidy-select](https://dplyr.tidyverse.org/reference/dplyr_tidy_select.html)
  semantics in their arguments.

- `plot.incidence2()`: generate simple plots with reasonable defaults.

- `cumulate()`: calculate the cumulative incidence over time.
  
- `complete_dates()`: ensure every possible combination of date and groupings
  is represented with a count.

- `keep_first()`, `keep_last()`: keep the rows corresponding to the first (or
  last) set of grouped dates (ordered by time) from an `incidence2` object.

- `keep_peaks()`; keep the rows corresponding to the maximum count value for
  each grouping of an `incidence2` object. A convenience wrapper around this,
  `first_peak()` keeps returns the earliest occurring peak row.
  
- `bootstrap_incidence()`; sampling (with replacement and optional randomisation)
  from incidence2 objects.
  
- `estimate_peak()`; estimate the peak of an epidemic curve using bootstrapped
  samples of the available data.

- Accessors for underlying variables: `get_date_index()`, `get_count_variable()`,
  `get_count_value()`, `get_groups()`, `get_count_value_name()`,
  `get_count_variable_name()`, `get_date_index_name()` and `get_group_names()`.
  
- Methods for common base R generics:
    - `as.data.frame.incidence2()`
    - `$<-.incidence2()`
    - `[.incidence2()`
    - `[<-.incidence2()`
    - `names<-.incidence2()`
    - `split.incidence2()`
    - `rbind.incidence2()`
  
- Methods for generics from the wider R package ecosystem, including:
    - `mutate.incidence2()`
    - `summarise.incidence2()`
    - `nest.incidence2()`
    - `as_tibble.incidence2()`
    - `as.data.table.incidence2()`
    
## Basic Usage

Examples in the vignette utilise three different sets of data:

- A synthetic linelist generated from a simulated Ebola Virus Disease
  (EVD) outbreak and available in the 
  [outbreaks](https://cran.r-project.org/package=outbreaks) package.

- A pre-aggregated time-series of Covid cases, tests, hospitalisations,
  and deaths for UK regions that is included within the incidence2 package.
  
- 136 cases of influenza A H7N9 in China. Again available in the outbreaks
  package.
  
### Computing incidence from a linelist

Broadly speaking, we refer to data with one row of observations (e.g. 'Sex',
'Date of symptom onset', 'Date of Hospitalisation') per individual as a linelist

```{r}
library(incidence2)

# linelist from the simulated ebola outbreak  (removing some missing entries)
ebola <- subset(outbreaks::ebola_sim_clean$linelist ,!is.na(hospital))
str(ebola)
```

To compute daily incidence we pass to `incidence()` our linelist data frame
as well as the name of a column in the data that we can use to index over
time. Whilst we refer to this index as the `date_index` there is no
restriction on it's type, save the requirement that is has an inherent ordering.

```{r}
(daily_incidence <- incidence(ebola, date_index = "date_of_onset"))
```

incidence2 also provides a simple plot method (see `help("plot.incidence2")`)
built upon [ggplot2](https://cran.r-project.org/package=ggplot2).  

```{r}
#| fig.height: 5
#| dpi: 90
#| fig.alt: >
#|   Bar chart of daily incidence covering the period April 2014 to April 2015
#|   inclusive. The graph appears to peaks around September 2014.
plot(daily_incidence)
```

The daily data is quite noisy, so it may be worth grouping the dates prior to
calculating the incidence. One way to do this is to utilise functions from the
[grates](https://cran.r-project.org/package=grates) package. incidence2
depends on the grates package so all of it's functionality is available
directly to users. Here we use the `as_isoweek()` function to convert the
'date of onset' to an isoweek (a week starting on a Monday) before proceeding
to calculate the incidence:

```{r}
#| fig.alt: >
#|   Bar chart of weekly incidence covering 2014-W15 to 2015-W18 inclusive.
#|   The graph peaks at 2014-W38. The "descent" from the peak tapers off 
#|   slower than the initial "ascent".
(weekly_incidence <- 
    ebola |>
    mutate(date_of_onset = as_isoweek(date_of_onset)) |> 
    incidence(date_index = "date_of_onset"))
plot(weekly_incidence, border_colour = "white")
```

As this sort of date grouping is often required we have chosen to integrate this
within the `incidence()` function via the `interval` parameter.
`interval` can take any of the following values:

- 'day' or 'daily' (mapping to `Date` objects);
- 'week(s)', 'isoweek(s)' or 'weekly' (mapping to `grates_isoweek`);
- 'epiweek(s)' (mapping to `grates_epiweek`);
- 'month(s)', 'yearmonth(s)' or 'monthly' (`grates_yearmonth`);
- 'quarter(s)', 'yearquarter(s)' or 'quarterly' (`grates_yearquarter`);
- 'year(s)' or 'yearly' (`grates_year`).

As an example, the following is equivalent to the `weekly_incidence` output above:

```{r}
(dat <- incidence(ebola, date_index = "date_of_onset", interval = "isoweek"))
# check equivalent
identical(dat, weekly_incidence)
```

If we wish to aggregate by specified groups we can use the `groups` argument.
For instance, to compute the weekly incidence by gender:

```{r}
(weekly_incidence_gender <- incidence(
    ebola,
    date_index = "date_of_onset",
    groups = "gender",
    interval = "isoweek"
))
```

For grouped data, the plot method will create a faceted plot across groups
unless a fill variable is specified:

```{r}
#| fig.alt: >
#|   Two bar charts (side by side) of weekly incidence covering 2014-W15 to
#|   2015-W18 inclusive. Females are on the left, Males the right. The graphs
#|   peak between 2014-W35 and 2014-W45. The "descent" from the peak tapers off
#|   slower than the initial "ascent".
plot(weekly_incidence_gender, border_colour = "white", angle = 45)
```

```{r}
#| fig.alt: >
#|   Bar chart of weekly incidence covering 2014-W15 to 2015-W18 inclusive.
#|   The graph peaks at 2014-W38. The "descent" from the peak tapers off 
#|   slower than the initial "ascent". The graph is "filled" by the number of
#|   male versus female but it is hard to descern the difference.
plot(weekly_incidence_gender, border_colour = "white", angle = 45, fill = "gender")
```

`incidence()` also supports multiple date inputs and allows renaming via the
use of named vectors:

```{r}
(weekly_multi_dates <- incidence(
    ebola,
    date_index = c(
        onset = "date_of_onset",
        infection = "date_of_infection"
    ), 
    interval = "isoweek",
    groups = "gender"
))
```

For a quick, high-level, overview of grouped data we can use the `summary()` method:

```{r}
summary(weekly_multi_dates)
```

When multiple date indices are given, they are used for rows of the resultant
plot, unless the resultant variable is used to fill:

```{r}
#| fig.alt: >
#|   Four bar charts arranged in a 2 by 2 grid. The top row represents incidence
#|   by date of infection, the bottom row by date of onset. Each row is arranged
#|   with females in the left plots and males the right. The graphs all peak
#|   between 2014-W35 and 2014-W45. The "descent" from the peaks tapers off
#|   slower than the initial "ascent".
plot(weekly_multi_dates, angle = 45, border_colour = "white")
```

```{r}
#| fig.alt: >
#|   Two bar charts (side by side) of weekly incidence covering 2014-W15 to
#|   2015-W18 inclusive. Females are on the left, Males the right. The graphs
#|   peak between 2014-W35 and 2014-W45. The "descent" from the peak tapers off
#|   slower than the initial "ascent". The graph is "filled" by the incidence
#|   according to date of onset and the incidence accorsing to date of
#|   infection.
plot(weekly_multi_dates, angle = 45, border_colour = "white", fill = "count_variable")
```

### Computing incidence from pre-aggregated data

In terms of this package, pre-aggregated data, is data where we have a single
column representing time and associated counts linked to those times (still
optionally split by characteristics). The included Covid data set is in this
wide format with multiple count values given for each day.

```{r}
covid <- subset(
    covidregionaldataUK,
    !region %in% c("England", "Scotland", "Northern Ireland", "Wales")
)
str(covid)
```

Like with our linelist data, `incidence()` requires us to specify a `date_index`
column and optionally our `groups` and/or `interval`. In addition we must now
also provide the `counts` variable(s) that we are interested in.

Before continuing, take note of the missing values in output above. Where
these occur in one of the count variables, `incidence()` warns users:

```{r}
monthly_covid <- incidence(
    covid,
    date_index = "date",
    groups = "region",
    counts = "cases_new",
    interval = "yearmonth"
)
monthly_covid
```

Whilst we could have let `incidence()` ignore missing values (equivalent to
setting `sum(..., na.rm=TRUE)`), we prefer that users make an explicit choice
on how these should (or should not) be imputed. For example, to treat missing
values as zero counts we can simply replace them in the data prior to calling
`incidence()`:

```{r}
#| fig.alt: >
#|   Nine bar charts arranged in a 3 by 3 grid representing incidence new covid
#|   cases by month across nine English regions. Each plot goes from the start
#|   of 2020 to mid 2021. In each plot we see an increase in cases towards the 
#|   end of 2020 and in to early 2021.
(monthly_covid <-
     covid |>
     tidyr::replace_na(list(cases_new = 0)) |> 
     incidence(
         date_index = "date",
         groups = "region",
         counts = "cases_new",
         interval = "yearmonth"
     ))
plot(monthly_covid, nrow = 3, angle = 45, border_colour = "white")
```

### Plotting in style of European Programme for Intervention Epidemiology Training (EPIET)

For small datasets it is convention of EPIET to display individual cases as
rectangles. We can do this by setting `show_cases = TRUE` in the call to
`plot()` which will display each case as an individual square with a white
border.

```{r}
#| fig.height: 3
#| fig.alt: >
#|   Bar chart of daily incidence covering the period 2014-07-08 to 2014-07-16
#|   inclusive. It shows 21 cases, with each case represented by an individual
#|   square. 
dat <- ebola[160:180, ]

(small <- incidence(
    dat,
    date_index = "date_of_onset",
    date_names_to = "date"
))
plot(small, show_cases = TRUE, angle = 45, n_breaks = 10)
```

```{r}
#| fig.height: 3
#| fig.alt: >
#|   Bar chart of daily incidence covering the period 2014-07-08 to 2014-07-16
#|   inclusive. It shows 21 cases, with each case represented by an individual
#|   square filled with a colour based on an individuals gender. There is a peak
#|   on 2014-07-13 with 5 cases.
(small_gender <- incidence(
    dat,
    date_index = "date_of_onset",
    groups = "gender",
    date_names_to = "date"
)) 
plot(small_gender, show_cases = TRUE, angle = 45, n_breaks = 10, fill = "gender")
```

## Support for tidy-select semantics

When working interactively it can feel a little onerous constantly having to
quote inputs for column names. To alleviate this we include the functions
`incidence_()` and `regroup_()` which both support
[tidy-select](https://dplyr.tidyverse.org/reference/dplyr_tidy_select.html)
semantics in their column arguments (i.e. `date_index`, `groups` and `counts`).

For now we have chosen to distinguish the functions via the post-fix underscore
and have a preference for the standard version for non-interactive (e.g.
programmatic usage). This could change over time if users feel having two
similar functions is confusing.

## Working with incidence objects

On top of the incidence constructor function and the basic plotting, printing
and summary we provide a number of other useful functions and integrations for
working with incidence2 objects.

**Note:** The following sections utilise tidy-select semantics and hence use the
post-fix version of the incidence function (`incidence_()`)

### `regroup()`

If you've created a grouped incidence object but now want to change the internal
grouping, you can `regroup()` to the desired aggregation:

```{r}
# generate an incidence object with 3 groups
(x <- incidence_(
    ebola,
    date_index = date_of_onset,
    groups = c(gender, hospital, outcome),
    interval = "isoweek"
))
# regroup to just two groups
regroup_(x, c(gender, outcome))
# standard (non-tidy-select) version
regroup(x, c("gender", "outcome"))
# drop all groups
regroup(x)
```

### `complete_dates()`
Sometimes your incidence data does not span consecutive units of time, or
different groupings may cover different periods. To this end we provide a
`complete_dates()` function which ensures a complete and identical range of
dates are given counts (by default filling with a 0 value).

```{r}
dat <- data.frame(
    dates = as.Date(c("2020-01-01", "2020-01-04")),
    gender = c("male", "female")
)

(incidence <- incidence_(dat, date_index = dates, groups = gender))
complete_dates(incidence)
```

### `keep_first()`, `keep_last()` and `keep_peaks()`

Once your data is grouped by date, you can select the first or last few entries
based on a particular date grouping using `keep_first()` and `keep_last()`:

```{r}
weekly_incidence <- incidence_(
    ebola,
    date_index = date_of_onset,
    groups = hospital,
    interval = "isoweek"
)

keep_first(weekly_incidence, 3)
keep_last(weekly_incidence, 3)
```

Similarly `keep_peaks()`returns the rows corresponding to the maximum count
value for each grouping of an `incidence2` object:

```{r}
keep_peaks(weekly_incidence)
```

### Bootstrapping and estimating peaks

`estimate_peak()` returns an estimate of the peak of an epidemic curve using
bootstrapped samples of the available data. It is a wrapper around two functions:

- Firstly, the imaginatively named, `first_peak()`, that returns the earliest
occurring peak row per group; and,
- Secondly, `bootstrap_incidence()` which samples (with replacement and optional
  randomisation) from incidence2 objects.

Note that the bootstrapping approach used for estimating the peak time makes
the following assumptions:

- the total number of event is known (no uncertainty on total incidence);
- dates with no events (zero incidence) will never be in bootstrapped datasets; and
- the reporting is assumed to be constant over time, i.e. every case is
  equally likely to be reported.


```{r}
#| fig.alt: >
#|   Bar chart of daily incidence covering the period March 2013 to August 2013
#|   inclusive. The graph appears to peak around the start of April.
influenza <- incidence_(
    outbreaks::fluH7N9_china_2013,
    date_index = date_of_onset,
    groups = province
)

# across provinces (we suspend progress bar for markdown)
estimate_peak(influenza, progress = FALSE) |> 
    select(-count_variable)
# regrouping for overall peak
plot(regroup(influenza))
estimate_peak(regroup(influenza), progress = FALSE) |> 
    select(-count_variable)
# return the first peak of the grouped and ungrouped data
first_peak(influenza)
first_peak(regroup(influenza))
# bootstrap a single sample
bootstrap_incidence(influenza)
```

### `cumulate()`
You can use `cumulate()` to easily generate cumulative incidences:

```{r}
#| fig.alt: >
#|   Fives graphs representing cumulative weekly incidence covering 2014-W15 to
#|   2015-W18 inclusive. Each graph represents a hospital in the data set. The
#|   five graphs fill a 3 by 2 grid with the bottom-right square being left
#|   blank.
(y <- cumulate(weekly_incidence))
plot(y, angle = 45, nrow = 3)
```

## Building on incidence2

The benefit incidence2 brings is not in the functionality it provides (which is
predominantly wrapping around the functionality of other packages) but in the
guarantees incidence2 objects give to a user about the underlying object
structure and invariants that must hold. 

To make these objects easier to build upon we give sensible behaviour when the
invariants are broken, an interface to access the variables underlying the
`incidence2` object and methods, for popular group-aware generics, that
implicitly utilise the underlying grouping structure.

### Class preservation

As mentioned at the beginning of the vignetted, by definition, `incidence2`
objects must:

- have one column representing the date index (this does not need to be a `Date`
  object but must have an inherent ordering over time);
  
- have one column representing the count variable (i.e. what is being counted)
  and one variable representing the associated count;
  
- have zero or more columns representing groups;

- not have duplicated rows with regards to the date, group and count
  variables.

Due to these requirements it is important that these objects preserve (or drop)
their structure appropriately under the range of different operations that can
be applied to data frames. By this we mean that if an operation is applied to an
incidence2 object then as long as the invariants of the object are preserved
(i.e. required columns and uniqueness of rows) then the object will retain it's
incidence class. If the invariants are not preserved then a `tibble` will be
returned instead.

```{r}
# create a weekly incidence object
weekly_incidence <- incidence_(
    ebola,
    date_index = date_of_onset,
    groups = c(gender, hospital),
    interval = "isoweek"
)

# filtering preserves class
weekly_incidence |> 
    subset(gender == "f" & hospital == "Rokupa Hospital") |> 
    class()

class(weekly_incidence[c(1L, 3L, 5L), ])

# Adding columns preserve class
weekly_incidence$future <- weekly_incidence$date_index + 999L
class(weekly_incidence)
weekly_incidence |> 
    mutate(past = date_index - 999L) |> 
    class()

# rename preserve class
names(weekly_incidence)[names(weekly_incidence) == "date_index"] <- "isoweek"
str(weekly_incidence)

# select returns a data frame unless all date, count and group variables are
# preserved in the output
str(weekly_incidence[,-1L])
str(weekly_incidence[, -6L])

# duplicating rows will drop the class but only if duplicate rows
class(rbind(weekly_incidence, weekly_incidence))
class(rbind(weekly_incidence[1:5, ], weekly_incidence[6:10, ]))
```

### Accessing variable information

We provide multiple accessors to easily access information about an
`incidence2` object's structure:

```{r}
# the name of the date_index variable of x
get_date_index_name(weekly_incidence)
# alias for `get_date_index_name()`
get_dates_name(weekly_incidence)
# the name of the count variable of x
get_count_variable_name(weekly_incidence)
# the name of the count value of x
get_count_value_name(weekly_incidence)
# the name(s) of the group variable(s) of x
get_group_names(weekly_incidence)
# the date_index variable of x
str(get_date_index(weekly_incidence))
# alias for get_date_index
str(get_dates(weekly_incidence))
# the count variable of x
str(get_count_variable(weekly_incidence))
# the count value of x
str(get_count_value(weekly_incidence))
# list of the group variable(s) of x
str(get_groups(weekly_incidence))
```

### Grouping aware methods

incidence2 provides methods for popular group-aware generics from both base R
and the wider package ecosystem:

- base: `split()`.
- [dplyr](https://cran.r-project.org/package=dplyr): `mutate()` and `summarise()`.
- [tidyr](https://cran.r-project.org/package=tidyr): `nest()`.

When called on incidence2 objects, these methods will utilise the
underlying grouping structure without the user needing to explicitly state
what it is. This makes it very easy to utilise in analysis pipelines.

#### Example fitting a poisson model

```{r}
#| fig.alt: >
#|   Fives bar charts representing weekly incidence covering 2014-W15 to
#|   2014-W35 inclusive. Each graph represents a hospital in the data set. The
#|   five graphs fill a 3 by 2 grid with the bottom-right square being left
#|   blank. On top of each bar graph is a line along with associated confidence
#|   intervals showing an increasing trend over the displayed weeks.
# first twenty weeks of the ebola data set across hospitals
dat <- incidence_(ebola, date_of_onset, groups = hospital, interval = "isoweek")
dat <- keep_first(dat, 20L)

# fit a poisson model to the grouped data
(fitted <-
    dat |>
    nest(.key = "data") |>
    mutate(
        model  = lapply(
            data,
            function(x) glm(count ~ date_index, data = x, family = poisson)
        )
    ))
# Add confidence intervals to the result
(intervals <-
    fitted |>
    mutate(result = Map(
        function(data, model) {
            data |>
                ciTools::add_ci(
                    model,
                    alpha = 0.05,
                    names = c("lower_ci", "upper_ci")
                ) |>
                as_tibble()
        },
        data,
        model
    )) |>
    select(hospital, result) |>
    unnest(result))
# plot
plot(dat, angle = 45) +
    ggplot2::geom_line(
        ggplot2::aes(date_index, y = pred),
        data = intervals,
        inherit.aes = FALSE
    ) +
    ggplot2::geom_ribbon(
        ggplot2::aes(date_index, ymin = lower_ci, ymax = upper_ci),
        alpha = 0.2,
        data = intervals,
        inherit.aes = FALSE,
        fill = "#BBB67E"
    )
```

#### Example - Adding a rolling average

```{r}
#| fig.alt: >
#|   Fives bar charts representing weekly incidence covering 2014-W15 to
#|   2015-W18 inclusive. Each graph represents a hospital in the data set. The
#|   five graphs fill a 3 by 2 grid with the bottom-right square being left
#|   blank. On top of each bar graph is a line along that displays the rolling
#|   average.
weekly_incidence |>
    regroup_(hospital) |> 
    mutate(rolling_average = data.table::frollmean(count, n = 3L, align = "right")) |> 
    plot(border_colour = "white", angle = 45) +
    ggplot2::geom_line(ggplot2::aes(x = date_index, y = rolling_average))
```