---
title: "Matching Patients in the `cancer` Dataset with `vecmatch`"
output: rmarkdown::html_vignette
bibliography: references.bib
link-citations: TRUE
vignette: >
  %\VignetteIndexEntry{Matching Patients in the `cancer` Dataset with `vecmatch`}
  %\VignetteEncoding{UTF-8}
  %\VignetteEngine{knitr::rmarkdown}
editor_options: 
  markdown: 
    wrap: 72
---

```{r, include = FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>",
  fig.width = 7,
  fig.asp = 0.8,
  echo = TRUE
)
```

# Practical Example of the Vector Matching Workflow

In this chapter, we will describe an exemplary data analysis of the
`cancer` dataset from the `vecmatch` package, matching multiple cohorts
based on selected covariates. The `vecmatch` package follows a strict
workflow based on the vector matching algorithm defined by
@lopez2017estimation, and it is advisable to follow it. The whole
process consists of five steps to ensure the best possible matching
quality using the vector matching algorithm.

## Step 1: Data Exploration and Initial Imbalance Assessment

The `vecmatch` package provides two graphical functions for assessing
initial dataset imbalance: `raincloud()` and `mosaic()`. The
`raincloud()` function is designed for continuous variables, while
`mosaic()` is for categorical variables. Both functions allow grouping
by up to two categorical variables using the `group` and `facet`
aesthetics and compute various statistical summaries, including
significance tests and effect size coefficients.

In this example analysis, we focus on two predictor variables from the
`cancer` dataset: the discrete `sex` and the continuous `age`. To
evaluate differences in age across `status` groups, we can use the
`raincloud()` function as follows:

```{r, message=FALSE}
library(vecmatch)
library(ggplot2)

raincloud(cancer,
  age,
  status,
  significance = "t_test",
  sig_label_color = TRUE,
  sig_label_size = 3,
  limits = c(10, 120)
) +
  scale_y_continuous(breaks = seq(10, 100, 10))
```

This plot presents the conditional distribution of `age` across
different `status` groups. The standardized mean differences (SMD) are
represented by the brackets on the right side of the jittered point
plot. The results of the Student's t-tests are displayed alongside the
boxplots on the right, providing insights into the statistical
significance of the differences in `age` between the groups.

Similarly, we can evaluate the differences in the `sex` variable using
the `mosaic()` function. The code snippet below demonstrates this:

```{r, message=FALSE}
mosaic(cancer,
  status,
  sex,
  group_counts = TRUE,
  significance = TRUE
)
```

The plot shows the conditional distribution of `sex` across different
`status` groups. The counts for all groups are displayed inside the
corresponding boxes, representing each subgroup within the mosaic plot.
Partial significance is visually coded through color filling aesthetics.
The labels provide the results of the Chi-squared test of independence,
summarizing the statistical assessment of the relationship between the
`sex` and `status` variables.

Based on the plots, we can observe some imbalance in the `age` and `sex`
variables. However, in both cases, the differences between the `status`
groups appear to be rather moderate. All Student's t-tests, adjusted for
multiple comparisons, resulted in statistically insignificant
differences in mean age values between the groups. Additionally, 2 out
of the 6 calculated standardized mean differences (SMDs) fell below the
0.10 threshold, indicating relatively small differences.

After a closer examination of the mean age values, we can notice that
they form two separate clusters. The first cluster includes controls and
patients with adenomas, who tend to be younger, while the second cluster
consists of patients with benign and malignant colorectal cancer, who
are generally older. The primary objective of the matching process will
then be to align these two clusters, ensuring they have a common mean
age value.

There were no significant differences in the `sex` distributions across
the `status` groups, as the overall Chi-square test of independence was
insignificant. All partial significance tests were statistically
insignificant, indicating small values of standardized Pearson's
residuals.

Given the relatively small differences between the `status` groups, the
vector matching algorithm is expected to produce satisfactory results.

## Step 2: Estimation of Generalized Propensity Scores

The next step in the vector matching algorithm is the calculation of
generalized propensity scores (GPS) for the treatment variable. These
scores represent the treatment assignment probabilities and are based on
user-defined covariates. The relationship between the treatment variable
and predictors can be easily defined using `R`-specific formula
notation, allowing for both additive and interaction effects.

The `vecmatch` package provides a straightforward method for calculating
the GPS-matrix with different methods using the `estimate_gps()`
function. However, custom approaches can also be applied within the
vector-matching workflow. The GPS-matrix used in subsequent analyses
must have the exact same structure as the output of `estimate_gps`,
containing the treatment variable and the treatment assignment
probabilities for each level of the treatment variable. Importantly, the
probabilities for each row in the GPS-matrix must sum to 1. The
resulting `data.frame` has to be of class `gps`.

In the following example, we will use a simple formula with `age` and
`sex` as predictors, allowing for an interaction effect. The method used
to estimate the generalized propensity scores (GPS) will be a
multinomial logistic regression, which can handle multiple levels of the
treatment variable. The code for estimating the GPS using the
multinomial logistic regression model is as follows:

```{r}
formula_cancer <- status ~ age * sex
gps_matrix <- estimate_gps(formula_cancer,
  cancer,
  method = "multinom",
  reference = "control"
)
head(gps_matrix, 7)
```

## Step 3: Calculating Common Support Region Borders

To remove observations that are not eligible for further matching, the
borders of the common support region (CSR) must be calculated. The
`csregion()` function facilitates this process by taking the
`gps_matrix` object as its sole argument. It computes the CSR borders
and automatically filters the input `gps_matrix` to include only the
observations that fall within the calculated CSR borders.

The function returns an object of class `csr_matrix`, which contains the
filtered data along with several additional attributes. These attributes
enable users to manually filter the initial `gps_matrix` or access a
summary of the CSR border calculation as a `data.frame` object. The
`csregion()` function can be used as follows:

```{r}
csr_matrix <- csregion(gps_matrix)
```

We can then compare the dimensions of the original `gps_matrix` and the
filtered `csr_matrix`.

```{r}
dim(gps_matrix)
dim(csr_matrix)
```

Clearly, 24 observations have been filtered out of the original matrix.

Next, as recommended by @lopez2017estimation, the GPS can be recalculated
using only the data within the CSR. However, in our experience, this additional
step does not result in a significant improvement in matching quality,
especially if the number of observations that fall outside of the CSR is
relatively small in comparison to the whole dataset size. Therefore, we decided
to bypass this step and proceed with matching directly on the filtered dataset.

## Step 4: k-Means Clustering and Matching {#sec-examatch}

The most crucial step of the vector matching algorithm is the *k*-means
clustering and subsequent matching within the resulting clusters. This approach
ensures that only observations with similar GPS vectors are matched, thereby
justifying the name *vector matching*. The core components of this step are the
`stats::kmeans()` function [@rlang2024], which performs clustering, and the
`Matching::Matchby()` or `optmatch::fullmatch()` functions
[@sekhon2011multivariate, @hansen2006optimal], which match observations within
each *k*-means cluster. These functionalities are combined into a single
function named `match_gps()`.

The `match_gps()` function takes the `csr_matrix` as input and returns a matched
dataset. The matching process can be customized using various arguments, and it
is advisable to experiment with different parameter combinations to optimize
both matching quality and the number of matched samples (see appendix
\ref{app:practical}). For the replicability of results, it is also crucial to
set the random number generator seed before running the clustering and matching
procedures. This ensures that the clustering assignments and matches are
consistent across different runs of the analysis, allowing others to reproduce
the results exactly.

The `match_gps()` function can be used as follows:

```{r}
set.seed(164373)
matched_cancer <- match_gps(csr_matrix,
  caliper = 0.21,
  kmeans_cluster = 2,
  reference = "control",
  replace = TRUE,
  method = "fullopt",
  order = "desc"
)
```

## Step 5: Post-Matching Quality Assessment {#quality}
The `vecmatch` package provides a dedicated function, `balqual()`, to assess
the quality of post-matching results. This function compares various metrics and
statistical summaries between the pre- and post-matching datasets, using a
user-defined formula to evaluate balance. Its usage is straightforward and
requires only the matched dataset (produced by `match_gps()`) and the formula
used for the calculation of the GPS:

```{r}
balqual(matched_cancer,
  formula_cancer,
  type = "smd",
  statistic = "max",
  round = 4
)
```

After assessing the quality with `balqual()`, we can combine both matched and unmatched datasets into a single data frame to visualize age and sex using `raincloud()` and `mosaic()`:

```{r}
matched_cancer$dataset <- "matched"
unmatched_cancer <- cancer
unmatched_cancer$dataset <- "unmatched"
data_full <- rbind(matched_cancer, unmatched_cancer)
```

```{r, message=FALSE, fig.asp=1.5}
raincloud(data_full,
  age,
  status,
  dataset,
  significance = "t_test",
  sig_label_color = TRUE,
  sig_label_size = 3,
  limits = c(10, 120)
) +
  scale_y_continuous(breaks = seq(10, 100, 10))
```

```{r, message=FALSE, fig.asp=1.5}
mosaic(data_full,
  status,
  sex,
  dataset,
  group_counts = TRUE,
  significance = TRUE
)
```

## References