Introduction to treeSS

Overview

The treeSS package implements the tree-spatial scan statistic of Cançado, Oliveira, Quadros and Duczmal (2025), which detects clusters that are simultaneously anomalous in geographic space and on a hierarchical classification tree. It generalises Kulldorff’s (1997) circular spatial scan by searching jointly over circular spatial zones \(z\) and tree branches \(g\), returning the (zone, branch) pair that maximises the likelihood ratio.

The package targets epidemiological and surveillance applications where the events to be clustered (deaths, hospitalisations, crimes, collisions) carry both a location and a categorical hierarchy (ICD-10 chapter \(\to\) subchapter \(\to\) leaf, NAICS sector, road type, etc.). Detecting clusters jointly in both dimensions can identify specific cause-area combinations that pure spatial or pure tree scans would miss.

The package provides:

Function Purpose
treespatial_scan() Tree-spatial scan (main entry point)
circular_scan() Kulldorff’s circular spatial scan (tree-free)
tree_scan() Tree-based scan (space-free)
iterative_scan() Sequential conditional scan with Holm–Bonferroni correction
filter_clusters() Distinct secondary clusters per Cançado et al. (2025), Sec. 5.1.1
get_cluster_regions() Tidy region-by-cluster table for plotting
aggregate_tree() Roll counts up the tree

All scans accept the Poisson and Binomial models of the original paper and run the Monte Carlo step under OpenMP via the n_cores argument.

The package ships four example datasets:

Dataset Country Domain Regions Tree
rj_mortality + rj_tree Brazil Public health 89 municipalities ICD-10 (622 nodes)
chicago_crimes + chicago_tree USA Crime 77 community areas Type × Description × Location (2841 nodes)
fl_deaths USA Public health 65 counties (built by user from raw data)
london_collisions + london_tree UK Road safety 33 boroughs Light × Road × Junction (81 nodes)

This vignette walks through rj_mortality (reproducing the published analysis) and chicago_crimes (a different-domain illustration), then shows how to use iterative_scan() for sequential cluster detection. The Florida and London examples are shipped in inst/examples/ and discussed in the companion paper.


Example 1 — Infant mortality in Rio de Janeiro

This reproduces Section 5.2 of Cançado et al. (2025): all live births and infant deaths in the 89 municipalities of Rio de Janeiro state in 2016, classified by the ICD-10 hierarchy. The pre-built dataset combines deaths, live births, ICD-10 leaf codes and centroid coordinates in long format.

Run the scan

library(treeSS)
data(rj_mortality)
data(rj_tree)

str(rj_mortality, max = 1)
#> 'data.frame': 1440 obs. of 9 variables: region_id, ibge_code,
#>   name, live_births, x, y, node_id, cases, ...
result_rj <- treespatial_scan(
  cases       = rj_mortality$cases,
  population  = rj_mortality$live_births,
  region_id   = rj_mortality$region_id,
  x           = rj_mortality$x,
  y           = rj_mortality$y,
  node_id     = rj_mortality$node_id,
  tree        = rj_tree,
  max_pop_pct = 0.50,
  nsim        = 999, seed = 2016,
  n_cores     = 4L
)
print(result_rj)

Running this produces the most likely cluster P209 with 18 municipalities, log-likelihood ratio \(LR \approx 38.83\), \(p < 0.001\), 27 observed deaths against 3.34 expected — closely matching the published Table 7 (\(LR = 38.48\), 18 municipalities, 27 deaths, 3.37 expected). The minor LR difference comes from a SIM database update between the paper’s analysis and the dataset shipped here; the conclusions are identical.

Multiple distinct clusters (paper-faithful)

To extract more than one cluster from a single scan, the paper’s Section 5.1.1 specifies a filtering rule: a candidate (zone, branch) pair is retained only if its branch is disjoint from every previously retained branch (no ancestor/descendant relation), or its zone is disjoint from every previously retained zone (no region in common). This is implemented in filter_clusters():

filter_clusters(result_rj)[1:5, c("node_id", "n_regions",
                                   "cases", "expected", "llr")]

For visualisation, get_cluster_regions() returns a tidy data.frame ready to merge with a polygon layer:

# Most likely cluster only (single map)
cr1 <- get_cluster_regions(result_rj, n_clusters = 1, overlap = FALSE)

# Top-2 distinct clusters (faceted map)
cr2 <- get_cluster_regions(result_rj, n_clusters = 2, overlap = TRUE)

A complete plotting workflow with geobr and ggplot2:

library(ggplot2)
library(geobr)
library(sf)

mun <- read_municipality(code_muni = "RJ", year = 2016)
mun$code6 <- as.integer(substr(mun$code_muni, 1, 6))

cr2 <- merge(get_cluster_regions(result_rj, n_clusters = 2, overlap = TRUE),
             unique(rj_mortality[, c("region_id", "ibge_code")]),
             by = "region_id")
mun_facet <- merge(mun, cr2, by.x = "code6", by.y = "ibge_code")

ggplot(mun_facet) +
  geom_sf(aes(fill = factor(cluster)), color = "gray40", alpha = 0.75) +
  scale_fill_manual(values = c("1" = "#C44E52", "2" = "#4C72B0"),
                    na.value = "gray95", na.translate = FALSE) +
  facet_wrap(~ panel, nrow = 1) +
  theme_minimal() +
  theme(legend.position = "none")

A complete worked example with detailed annotation is shipped at system.file("examples", "example_brazil_rj.R", package = "treeSS").


Example 2 — Crime in Chicago

The Chicago dataset illustrates the package on a non-medical domain with a much larger tree (2841 nodes covering crime type, description and location). It also demonstrates a recurring choice users face in surveillance applications: which population should be the denominator?

chicago_crimes ships with two denominator columns:

The choice changes both the model and the interpretation. We use the incidence-rate denominator below.

Run the scan

data(chicago_crimes)
data(chicago_tree)

result_chi <- treespatial_scan(
  cases       = chicago_crimes$cases,
  population  = chicago_crimes$pop_residential,
  region_id   = chicago_crimes$region_id,
  x           = chicago_crimes$x,
  y           = chicago_crimes$y,
  node_id     = chicago_crimes$node_id,
  tree        = chicago_tree,
  max_pop_pct = 0.25,
  nsim        = 999, seed = 2023,
  n_cores     = 4L
)
print(result_chi)

The most likely cluster covers a contiguous group of community areas on Chicago’s South Side (Englewood, West Englewood, Auburn Gresham, Roseland and neighbouring CCAs), driven by an over-occurrence of violent and property crimes against residents. The relative risk for this cluster is approximately 1.58 — a 58 % excess over the citywide rate.

Visualise the most likely cluster

chicago_crimes joins to the bundled chicago_map polygon layer via the area_number column:

data(chicago_map)

cr1 <- merge(get_cluster_regions(result_chi, n_clusters = 1, overlap = FALSE),
             unique(chicago_crimes[, c("region_id", "area_number", "name")]),
             by = "region_id")
chi <- merge(chicago_map, cr1, by.x = "AREA_NUM", by.y = "area_number",
              all.x = TRUE)

ggplot(chi) +
  geom_sf(aes(fill = factor(cluster)), color = "gray40", alpha = 0.75) +
  scale_fill_manual(values = c("1" = "#C44E52"), na.value = "gray95",
                    name = "Cluster") +
  theme_minimal()

Multi-cluster detection: iterative_scan()

filter_clusters() (used in Example 1) returns distinct clusters extracted from a single run of the scan. An alternative — popular in the spatial-epidemiology literature since Zhang, Assunção and Kulldorff (2010) — is to run the scan, remove the cases assigned to the detected cluster, and re-run on the residual data. Repeating this yields a sequence of clusters whose subsequent components are conditionally most likely, given that the prior clusters have been explained away.

The function iterative_scan() implements this procedure. Note that it is not part of Cançado et al. (2025); users seeking strict fidelity to that paper should use filter_clusters(). The iterative procedure is offered as a useful extension and is documented as such.

Because each iteration is a separate hypothesis test on data modified by the previous iteration, the raw p-values overstate significance. iterative_scan() collects the raw p-values from all iterations and applies the Holm–Bonferroni correction at the end:

iter_rj <- iterative_scan(
  cases       = rj_mortality$cases,
  population  = rj_mortality$live_births,
  region_id   = rj_mortality$region_id,
  x           = rj_mortality$x,
  y           = rj_mortality$y,
  node_id     = rj_mortality$node_id,
  tree        = rj_tree,
  max_iter    = 5, alpha = 0.05,
  nsim        = 999, seed = 2016,
  max_pop_pct = 0.50, n_cores = 4L
)
print(iter_rj)

The returned clusters data.frame has one row per iteration with the raw p-value, the Holm-adjusted p-value, and a significant flag:

iter_rj$clusters[, c("iteration", "node_id", "n_regions",
                      "llr", "pvalue", "pvalue_adjusted",
                      "significant")]

Mapping the iterative result uses the same S3 dispatch:

cr_it <- merge(get_cluster_regions(iter_rj, overlap = TRUE),
               unique(rj_mortality[, c("region_id", "ibge_code")]),
               by = "region_id")
mun_it <- merge(mun, cr_it, by.x = "code6", by.y = "ibge_code",
                all.x = TRUE)

ggplot(mun_it) +
  geom_sf(aes(fill = factor(cluster)), color = "gray40", alpha = 0.75) +
  scale_fill_manual(values = c("1" = "#C44E52", "2" = "#4C72B0",
                                "3" = "#55A868", "4" = "#8172B2",
                                "5" = "#CCB974"),
                    na.value = "gray95", na.translate = FALSE) +
  facet_wrap(~ panel) +
  theme_minimal() + theme(legend.position = "none")

When to use which:


Computational notes

The Monte Carlo step (replicates of the null Poisson/Binomial counts under proportional intensities) dominates the runtime. All scan functions accept n_cores, which parallelises the Monte Carlo step via OpenMP. On a 4-core laptop the Rio de Janeiro analysis runs in under 20 seconds with nsim = 999; the Chicago analysis (much larger tree) takes around two minutes.

For tasks where the case counts are concentrated near the leaves and the tree is deep, aggregate_tree() can be used to compute the internal-node sums once and reuse them across calls.


Other examples

The inst/examples/ directory of the installed package contains fully annotated worked examples for all four datasets:

list.files(system.file("examples", package = "treeSS"))
#> [1] "example_brazil_rj.R"  "example_chicago.R"
#> [2] "example_florida.R"    "example_london.R"

The Florida script demonstrates the full workflow from raw long data — building the ICD-10 tree from scratch, downloading county polygons via tigris, assembling parallel input vectors and plotting with ggplot2. The London script illustrates the use of a non-medical tree (light × road × junction) on traffic collisions, with an interactive leaflet map.

These two examples are also the basis of the detailed use cases in the companion R Journal paper.


References

Cançado, A. L. F., Oliveira, G. S., Quadros, A. V. C., & Duczmal, L. H. (2025). A tree-spatial scan statistic. Environmental and Ecological Statistics, 32, 953–978. doi:10.1007/s10651-025-00670-w

Kulldorff, M. (1997). A spatial scan statistic. Communications in Statistics — Theory and Methods, 26(6), 1481–1496.

Kulldorff, M., Fang, Z., & Walsh, S. J. (2003). A tree-based scan statistic for database disease surveillance. Biometrics, 59(2), 323–331.

Zhang, Z., Assunção, R., & Kulldorff, M. (2010). Spatial scan statistics adjusted for multiple clusters. Journal of Probability and Statistics, 2010, 642379.

Holm, S. (1979). A simple sequentially rejective multiple test procedure. Scandinavian Journal of Statistics, 6(2), 65–70.