--- title: "Introduction to treeSS" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{Introduction to treeSS} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r setup, include = FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>", eval = FALSE ) ``` ## Overview The **treeSS** package implements the tree-spatial scan statistic of Cançado, Oliveira, Quadros and Duczmal (2025), which detects clusters that are simultaneously anomalous in geographic space and on a hierarchical classification tree. It generalises Kulldorff's (1997) circular spatial scan by searching jointly over circular spatial zones $z$ and tree branches $g$, returning the (zone, branch) pair that maximises the likelihood ratio. The package targets epidemiological and surveillance applications where the events to be clustered (deaths, hospitalisations, crimes, collisions) carry both a *location* and a *categorical hierarchy* (ICD-10 chapter $\to$ subchapter $\to$ leaf, NAICS sector, road type, etc.). Detecting clusters jointly in both dimensions can identify specific cause-area combinations that pure spatial or pure tree scans would miss. The package provides: | Function | Purpose | |:---------|:--------| | `treespatial_scan()` | Tree-spatial scan (main entry point) | | `circular_scan()` | Kulldorff's circular spatial scan (tree-free) | | `tree_scan()` | Tree-based scan (space-free) | | `iterative_scan()` | Sequential conditional scan with Holm–Bonferroni correction | | `filter_clusters()` | Distinct secondary clusters per Cançado et al. (2025), Sec. 5.1.1 | | `get_cluster_regions()` | Tidy region-by-cluster table for plotting | | `aggregate_tree()` | Roll counts up the tree | All scans accept the Poisson and Binomial models of the original paper and run the Monte Carlo step under OpenMP via the `n_cores` argument. The package ships four example datasets: | Dataset | Country | Domain | Regions | Tree | |:--------|:--------|:-------|:--------|:-----| | `rj_mortality` + `rj_tree` | Brazil | Public health | 89 municipalities | ICD-10 (622 nodes) | | `chicago_crimes` + `chicago_tree` | USA | Crime | 77 community areas | Type × Description × Location (2841 nodes) | | `fl_deaths` | USA | Public health | 65 counties | (built by user from raw data) | | `london_collisions` + `london_tree` | UK | Road safety | 33 boroughs | Light × Road × Junction (81 nodes) | This vignette walks through `rj_mortality` (reproducing the published analysis) and `chicago_crimes` (a different-domain illustration), then shows how to use `iterative_scan()` for sequential cluster detection. The Florida and London examples are shipped in `inst/examples/` and discussed in the companion paper. --- ## Example 1 — Infant mortality in Rio de Janeiro This reproduces Section 5.2 of Cançado et al. (2025): all live births and infant deaths in the 89 municipalities of Rio de Janeiro state in 2016, classified by the ICD-10 hierarchy. The pre-built dataset combines deaths, live births, ICD-10 leaf codes and centroid coordinates in long format. ### Run the scan ```{r rj-data} library(treeSS) data(rj_mortality) data(rj_tree) str(rj_mortality, max = 1) #> 'data.frame': 1440 obs. of 9 variables: region_id, ibge_code, #> name, live_births, x, y, node_id, cases, ... ``` ```{r rj-scan} result_rj <- treespatial_scan( cases = rj_mortality$cases, population = rj_mortality$live_births, region_id = rj_mortality$region_id, x = rj_mortality$x, y = rj_mortality$y, node_id = rj_mortality$node_id, tree = rj_tree, max_pop_pct = 0.50, nsim = 999, seed = 2016, n_cores = 4L ) print(result_rj) ``` Running this produces the most likely cluster `P209` with 18 municipalities, log-likelihood ratio $LR \approx 38.83$, $p < 0.001$, 27 observed deaths against 3.34 expected — closely matching the published Table 7 ($LR = 38.48$, 18 municipalities, 27 deaths, 3.37 expected). The minor LR difference comes from a SIM database update between the paper's analysis and the dataset shipped here; the conclusions are identical. ### Multiple distinct clusters (paper-faithful) To extract more than one cluster from a single scan, the paper's Section 5.1.1 specifies a filtering rule: a candidate (zone, branch) pair is retained only if its branch is disjoint from every previously retained branch (no ancestor/descendant relation), or its zone is disjoint from every previously retained zone (no region in common). This is implemented in `filter_clusters()`: ```{r rj-filter} filter_clusters(result_rj)[1:5, c("node_id", "n_regions", "cases", "expected", "llr")] ``` For visualisation, `get_cluster_regions()` returns a tidy data.frame ready to merge with a polygon layer: ```{r rj-getregions, eval = FALSE} # Most likely cluster only (single map) cr1 <- get_cluster_regions(result_rj, n_clusters = 1, overlap = FALSE) # Top-2 distinct clusters (faceted map) cr2 <- get_cluster_regions(result_rj, n_clusters = 2, overlap = TRUE) ``` A complete plotting workflow with `geobr` and `ggplot2`: ```{r rj-plot} library(ggplot2) library(geobr) library(sf) mun <- read_municipality(code_muni = "RJ", year = 2016) mun$code6 <- as.integer(substr(mun$code_muni, 1, 6)) cr2 <- merge(get_cluster_regions(result_rj, n_clusters = 2, overlap = TRUE), unique(rj_mortality[, c("region_id", "ibge_code")]), by = "region_id") mun_facet <- merge(mun, cr2, by.x = "code6", by.y = "ibge_code") ggplot(mun_facet) + geom_sf(aes(fill = factor(cluster)), color = "gray40", alpha = 0.75) + scale_fill_manual(values = c("1" = "#C44E52", "2" = "#4C72B0"), na.value = "gray95", na.translate = FALSE) + facet_wrap(~ panel, nrow = 1) + theme_minimal() + theme(legend.position = "none") ``` A complete worked example with detailed annotation is shipped at `system.file("examples", "example_brazil_rj.R", package = "treeSS")`. --- ## Example 2 — Crime in Chicago The Chicago dataset illustrates the package on a non-medical domain with a much larger tree (2841 nodes covering crime type, description and location). It also demonstrates a recurring choice users face in surveillance applications: **which population should be the denominator?** `chicago_crimes` ships with two denominator columns: * `population` — the total incidents in each community area. This is a *compositional* denominator: it expresses each cell as a share of the citywide incident count and answers the question *"which crime types over-occur in which areas, relative to the citywide profile?"* * `pop_residential` — the residential population of each community area (ACS 2020 5-year, aggregated to community areas by CMAP Community Data Snapshots). This is the appropriate denominator for an *incidence-rate* analysis (incidents per resident), analogous to deaths per live birth in Example 1. The choice changes both the model and the interpretation. We use the incidence-rate denominator below. ### Run the scan ```{r chi-scan} data(chicago_crimes) data(chicago_tree) result_chi <- treespatial_scan( cases = chicago_crimes$cases, population = chicago_crimes$pop_residential, region_id = chicago_crimes$region_id, x = chicago_crimes$x, y = chicago_crimes$y, node_id = chicago_crimes$node_id, tree = chicago_tree, max_pop_pct = 0.25, nsim = 999, seed = 2023, n_cores = 4L ) print(result_chi) ``` The most likely cluster covers a contiguous group of community areas on Chicago's South Side (Englewood, West Englewood, Auburn Gresham, Roseland and neighbouring CCAs), driven by an over-occurrence of violent and property crimes against residents. The relative risk for this cluster is approximately 1.58 — a 58 % excess over the citywide rate. ### Visualise the most likely cluster `chicago_crimes` joins to the bundled `chicago_map` polygon layer via the `area_number` column: ```{r chi-plot} data(chicago_map) cr1 <- merge(get_cluster_regions(result_chi, n_clusters = 1, overlap = FALSE), unique(chicago_crimes[, c("region_id", "area_number", "name")]), by = "region_id") chi <- merge(chicago_map, cr1, by.x = "AREA_NUM", by.y = "area_number", all.x = TRUE) ggplot(chi) + geom_sf(aes(fill = factor(cluster)), color = "gray40", alpha = 0.75) + scale_fill_manual(values = c("1" = "#C44E52"), na.value = "gray95", name = "Cluster") + theme_minimal() ``` --- ## Multi-cluster detection: `iterative_scan()` `filter_clusters()` (used in Example 1) returns distinct clusters extracted from a *single* run of the scan. An alternative — popular in the spatial-epidemiology literature since Zhang, Assunção and Kulldorff (2010) — is to run the scan, remove the cases assigned to the detected cluster, and re-run on the residual data. Repeating this yields a sequence of clusters whose subsequent components are *conditionally most likely*, given that the prior clusters have been explained away. The function `iterative_scan()` implements this procedure. **Note that it is not part of Cançado et al. (2025);** users seeking strict fidelity to that paper should use `filter_clusters()`. The iterative procedure is offered as a useful extension and is documented as such. Because each iteration is a separate hypothesis test on data modified by the previous iteration, the raw p-values overstate significance. `iterative_scan()` collects the raw p-values from all iterations and applies the Holm–Bonferroni correction at the end: ```{r rj-iter} iter_rj <- iterative_scan( cases = rj_mortality$cases, population = rj_mortality$live_births, region_id = rj_mortality$region_id, x = rj_mortality$x, y = rj_mortality$y, node_id = rj_mortality$node_id, tree = rj_tree, max_iter = 5, alpha = 0.05, nsim = 999, seed = 2016, max_pop_pct = 0.50, n_cores = 4L ) print(iter_rj) ``` The returned `clusters` data.frame has one row per iteration with the raw p-value, the Holm-adjusted p-value, and a `significant` flag: ```{r rj-iter-table} iter_rj$clusters[, c("iteration", "node_id", "n_regions", "llr", "pvalue", "pvalue_adjusted", "significant")] ``` Mapping the iterative result uses the same S3 dispatch: ```{r rj-iter-map} cr_it <- merge(get_cluster_regions(iter_rj, overlap = TRUE), unique(rj_mortality[, c("region_id", "ibge_code")]), by = "region_id") mun_it <- merge(mun, cr_it, by.x = "code6", by.y = "ibge_code", all.x = TRUE) ggplot(mun_it) + geom_sf(aes(fill = factor(cluster)), color = "gray40", alpha = 0.75) + scale_fill_manual(values = c("1" = "#C44E52", "2" = "#4C72B0", "3" = "#55A868", "4" = "#8172B2", "5" = "#CCB974"), na.value = "gray95", na.translate = FALSE) + facet_wrap(~ panel) + theme_minimal() + theme(legend.position = "none") ``` When to use which: * Use `filter_clusters()` / `get_cluster_regions(n_clusters = N)` when you want to follow Cançado et al. (2025) exactly. * Use `iterative_scan()` when residual clustering is plausible and you want a sequence of conditionally distinct clusters with proper multiple-testing control. --- ## Computational notes The Monte Carlo step (replicates of the null Poisson/Binomial counts under proportional intensities) dominates the runtime. All scan functions accept `n_cores`, which parallelises the Monte Carlo step via OpenMP. On a 4-core laptop the Rio de Janeiro analysis runs in under 20 seconds with `nsim = 999`; the Chicago analysis (much larger tree) takes around two minutes. For tasks where the case counts are concentrated near the leaves and the tree is deep, `aggregate_tree()` can be used to compute the internal-node sums once and reuse them across calls. --- ## Other examples The `inst/examples/` directory of the installed package contains fully annotated worked examples for all four datasets: ```{r examples-dir} list.files(system.file("examples", package = "treeSS")) #> [1] "example_brazil_rj.R" "example_chicago.R" #> [2] "example_florida.R" "example_london.R" ``` The Florida script demonstrates the full workflow from raw long data — building the ICD-10 tree from scratch, downloading county polygons via `tigris`, assembling parallel input vectors and plotting with `ggplot2`. The London script illustrates the use of a non-medical tree (light × road × junction) on traffic collisions, with an interactive `leaflet` map. These two examples are also the basis of the detailed use cases in the companion R Journal paper. --- ## References Cançado, A. L. F., Oliveira, G. S., Quadros, A. V. C., & Duczmal, L. H. (2025). A tree-spatial scan statistic. *Environmental and Ecological Statistics*, 32, 953–978. [doi:10.1007/s10651-025-00670-w](https://doi.org/10.1007/s10651-025-00670-w) Kulldorff, M. (1997). A spatial scan statistic. *Communications in Statistics — Theory and Methods*, 26(6), 1481–1496. Kulldorff, M., Fang, Z., & Walsh, S. J. (2003). A tree-based scan statistic for database disease surveillance. *Biometrics*, 59(2), 323–331. Zhang, Z., Assunção, R., & Kulldorff, M. (2010). Spatial scan statistics adjusted for multiple clusters. *Journal of Probability and Statistics*, 2010, 642379. Holm, S. (1979). A simple sequentially rejective multiple test procedure. *Scandinavian Journal of Statistics*, 6(2), 65–70.