Introduction to treeSS: reproducing Cançado et al. (2025)

Overview

The treeSS package implements the tree-spatial scan statistic of Cançado, Oliveira, Quadros and Duczmal (2025), which detects clusters that are simultaneously anomalous in geographic space and on a hierarchical classification tree. It generalises Kulldorff’s (1997) circular spatial scan by searching jointly over circular spatial zones $z$ and tree branches $g$, returning the (zone, branch) pair that maximises the likelihood ratio.

The package targets epidemiological and surveillance applications where the events to be clustered (deaths, hospitalisations, crimes, collisions) carry both a location and a categorical hierarchy (ICD-10 chapter $\to$ subchapter $\to$ leaf, NAICS sector, road type, etc.). Detecting clusters jointly in both dimensions can identify specific cause–area combinations that pure spatial or pure tree scans would miss.

The package provides:

Function	Purpose
`treespatial_scan()`	Tree-spatial scan (main entry point)
`circular_scan()`	Kulldorff’s circular spatial scan (tree-free)
`tree_scan()`	Tree-based scan (space-free)
`filter_clusters()`	Distinct secondary clusters per Cançado et al. (2025), Sec. 5.1.1
`sequential_scan()`	Sequential adjustment of Zhang, Assunção & Kulldorff (2010)
`get_cluster_regions()`	Tidy region-by-cluster table for plotting
`aggregate_tree()`	Roll counts up the tree

All scans accept the Poisson and Binomial models of the original paper and run the Monte Carlo step under OpenMP via the n_cores argument.

The package ships four example datasets:

Dataset	Country	Domain	Regions	Tree
`rj_mortality` + `rj_tree`	Brazil	Public health	89 municipalities	ICD-10 (622 nodes)
`chicago_crimes` + `chicago_tree`	USA	Crime	77 community areas	Type × Description × Location (2841 nodes)
`fl_deaths`	USA	Public health	65 counties	(built by user from raw data)
`london_collisions` + `london_tree`	UK	Road safety	33 boroughs	Light × Road × Junction (81 nodes)

This vignette walks through rj_mortality end-to-end, reproducing the published analysis of Cançado et al. (2025). A companion vignette, vignette("florida", package = "treeSS"), is pedagogical: it builds the inputs from raw long-form data and the ICD-10 tree from the codes that actually appear in the data. The Chicago and London datasets are worked out in the companion software paper.

Infant mortality in Rio de Janeiro

We reproduce Section 5.2 of Cançado et al. (2025): all live births and infant deaths in the 89 municipalities of Rio de Janeiro state in 2016, classified by the ICD-10 hierarchy. The pre-built dataset combines deaths, live births, ICD-10 leaf codes and centroid coordinates in long format — exactly the parallel-vector input contract that treeSS expects.

Run the scan

library(treeSS)
data(rj_mortality)
data(rj_tree)

c(rows           = nrow(rj_mortality),
  municipalities = length(unique(rj_mortality$region_id)),
  tree_nodes     = nrow(rj_tree),
  total_deaths   = sum(rj_mortality$cases),
  total_births   = sum(unique(
    rj_mortality[, c("region_id", "live_births")]
  )$live_births))
#> rows           municipalities  tree_nodes  total_deaths  total_births
#> 1440           89              622         2981          219124

The scan is one function call:

result_rj <- treespatial_scan(
  cases       = rj_mortality$cases,
  population  = rj_mortality$live_births,
  region_id   = rj_mortality$region_id,
  x           = rj_mortality$x,
  y           = rj_mortality$y,
  node_id     = rj_mortality$node_id,
  tree        = rj_tree,
  max_pop_pct = 0.50,
  nsim        = 999, seed = 2016,
  n_cores     = 4L
)
print(result_rj)

The most likely cluster is the ICD-10 leaf code P209 (“unspecified intrauterine hypoxia”) concentrated in 18 municipalities of the north fluminense region. Twenty-seven deaths are observed against an expectation of 3.34 under uniform risk — a relative risk of $\sim 8$ — with a Monte Carlo $p$-value of $1/(\text{nsim} + 1) = 0.001$.

This closely matches Table 7 of Cançado et al. (2025): the same 18 municipalities (by IBGE code), 27 observed deaths, expected $\approx 3.37$, log-likelihood ratio $LR = 38.48$ (we obtain $\approx 38.83$). The minor LR discrepancy is explained by a SIM database update between the paper’s analysis and the version shipped here, and by the discontinuation of TCU population estimates after 2017; the conclusions are identical.

Why the joint scan matters here

To make the value of the joint search concrete, it is worth running the two simpler scans on exactly the same data. The purely spatial scan, ignoring the tree, returns a single 8-municipality cluster around greater Rio de Janeiro driven by the aggregate death count (Cançado et al. 2025, Table 5); the P209 cluster in the north of the state does not appear at all, because aggregating over all causes dilutes the cause-specific signal. We can verify this directly by aggregating to one row per municipality and running the circular scan on the totals:

rj_agg <- unique(rj_mortality[, c("region_id", "x", "y", "live_births")])
rj_agg$cases <- tapply(rj_mortality$cases,
                        rj_mortality$region_id,
                        sum)[as.character(rj_agg$region_id)]

result_spatial <- circular_scan(
  cases       = rj_agg$cases,
  population  = rj_agg$live_births,
  region_id   = rj_agg$region_id,
  x           = rj_agg$x,
  y           = rj_agg$y,
  max_pop_pct = 0.50,
  nsim        = 999, seed = 2016
)
result_spatial$most_likely_cluster$region_ids

Conversely, the purely tree-based scan, ignoring geography, returns more than fifty significant cuts at the 5 % level on this dataset (Cançado et al. 2025, Table 6); the analyst is then left to apply a multiple-testing correction across tree branches and decide which to follow up. The joint scan returns a single most-likely $(z, g)$ pair with one Monte Carlo $p$-value — the family-wise error rate is already controlled by the joint maximisation.

Multiple distinct clusters (paper-faithful)

The paper’s Section 5.1.1 specifies a filtering rule: a candidate (zone, branch) pair is retained only if its branch is disjoint from every previously retained branch (no ancestor/descendant relation), or its zone is disjoint from every previously retained zone (no region in common). This is implemented in filter_clusters():

filter_clusters(result_rj)[1:5,
                            c("node_id", "n_regions",
                              "cases", "expected", "llr")]

No additional Monte Carlo simulation is performed; the secondary clusters are read directly from result_rj$secondary_clusters and filtered for distinctness in $O(k^2)$ time. The top filtered candidates correspond to the rows of Table 7 of Cançado et al. (2025).

Visualising the cluster

get_cluster_regions() returns a tidy data.frame ready to merge with any polygon layer:

# Most likely cluster only (single map)
cr1 <- get_cluster_regions(result_rj, n_clusters = 1, overlap = FALSE)

# Top-2 distinct clusters (faceted map)
cr2 <- get_cluster_regions(result_rj, n_clusters = 2, overlap = TRUE)

The package does not ship sf polygons for Rio de Janeiro — they are easy to fetch on demand from :

library(ggplot2)
library(geobr)
library(sf)

mun       <- read_municipality(code_muni = "RJ", year = 2016)
mun$code6 <- as.integer(substr(mun$code_muni, 1, 6))

cr2 <- merge(get_cluster_regions(result_rj, n_clusters = 2,
                                  overlap = TRUE),
             unique(rj_mortality[, c("region_id", "ibge_code")]),
             by = "region_id")
mun_facet <- merge(mun, cr2, by.x = "code6", by.y = "ibge_code")

ggplot(mun_facet) +
  geom_sf(aes(fill = factor(cluster)), color = "gray40",
          alpha = 0.75) +
  scale_fill_manual(values = c("1" = "#C44E52", "2" = "#4C72B0"),
                    na.value = "gray95", na.translate = FALSE) +
  facet_wrap(~ panel, nrow = 1) +
  theme_minimal() +
  theme(legend.position = "none")

A complete annotated worked example, including the polygon download and faceted plots for both filter_clusters() and sequential_scan() results, is shipped at system.file("examples", "example_brazil_rj.R", package = "treeSS").

Secondary clusters when the MLC shadows weaker ones

filter_clusters() extracts distinct clusters from a single run of the scan. It is fast (no additional Monte Carlo) and faithful to the paper. When the most likely cluster is overwhelmingly strong, however, it can shadow weaker secondary clusters: the excess cases inside the MLC inflate the expected counts everywhere else, weakening the likelihood ratios of any other genuine cluster on the same dataset.

For that situation the package also offers sequential_scan(), adapted from Zhang, Assunção and Kulldorff (2010). It detects the MLC, removes the MLC’s regions (optionally with a buffer of nearest neighbours) from the dataset, and re-runs the scan on the reduced data with a fresh Monte Carlo simulation. The procedure iterates until the current MLC is no longer significant at the user-specified $\alpha$ level. Because each iteration’s $p$-value is computed against the simulated null distribution of the current (reduced) dataset, no multiple-testing correction is required.

seq_rj <- sequential_scan(
  cases       = rj_mortality$cases,
  population  = rj_mortality$live_births,
  region_id   = rj_mortality$region_id,
  x           = rj_mortality$x,
  y           = rj_mortality$y,
  node_id     = rj_mortality$node_id,
  tree        = rj_tree,
  max_iter    = 5L, alpha = 0.05,
  nsim        = 999, seed = 2016,
  max_pop_pct = 0.50, n_cores = 4L
)
print(seq_rj)

The clusters data.frame has one row per iteration with the cluster identifier, log-likelihood ratio, fresh $p$-value, and a logical significant flag:

seq_rj$clusters[, c("iteration", "node_id", "n_regions",
                    "llr", "pvalue", "significant")]

Mapping the sequential result uses the same get_cluster_regions() generic. The complete plotting workflow is shipped in the same annotated example file.

When to use which

Use filter_clusters() when you want to follow Cançado et al.
1. exactly: one scan, then a non-overlap pass over the candidate pool. No extra computation.
Use sequential_scan() when the MLC may be hiding weaker secondary clusters that filter_clusters() cannot detect on the original (unreduced) data.

The two answer different questions and are complementary; both are appropriate in different applied scenarios.

Computational notes

treespatial_scan() and sequential_scan() both accept n_cores, which distributes the Monte Carlo step across OpenMP threads in C++. Each thread carries its own std::mt19937 generator seeded deterministically from the user-supplied seed, so the result is reproducible for any fixed pair (seed, n_cores). The serial path (n_cores = 1L) uses R’s rmultinom and is bit-identical to the pre-OpenMP implementation.

For publication-critical work where simulated $p$-values must be bit-identical between machines, use n_cores = 1L. For exploratory work, n_cores = 4 typically yields a 4–6× speed-up.

File	Purpose
`inst/examples/example_brazil_rj.R`	Full plotting workflow for the RJ analysis (this vignette)
`inst/examples/example_florida.R`	Building the input from raw long-form data — see `vignette("florida")`
`inst/examples/example_chicago.R`	Crime in Chicago, 2023 (discussed in the companion paper)
`inst/examples/example_london.R`	Road collisions in London, 2022 (discussed in the companion paper)