Grouped Hyper Data Frame

Tingting Zhan

2025-06-05

Introduction

This vignette of package groupedHyperframe (CRAN, Github, RPubs) documents the creation of groupedHyperframe object, the batch processes for a groupedHyperframe, and aggregations of various statistics over multi-level grouping structure.

Prerequisite

Experimental (and maybe unstable) features are implemented extremely frequently on Github. Active developers should use the Github version; suggestions and bug reports are welcome!

remotes::install_github('tingtingzhan/groupedHyperframe')

Stable releases to CRAN are typically updated every 2 to 3 months, or when the authors have an upcoming manuscript in the peer-reviewing process. Developers should not use the CRAN version!

utils::install.packages('groupedHyperframe') # Developers, do NOT use!!

Package groupedHyperframe may require the development versions of the spatstat family.

remotes::install_github('spatstat/spatstat')
remotes::install_github('spatstat/spatstat.data')
remotes::install_github('spatstat/spatstat.explore')
remotes::install_github('spatstat/spatstat.geom')
remotes::install_github('spatstat/spatstat.linnet')
remotes::install_github('spatstat/spatstat.model')
remotes::install_github('spatstat/spatstat.random')
remotes::install_github('spatstat/spatstat.sparse')
remotes::install_github('spatstat/spatstat.univar')
remotes::install_github('spatstat/spatstat.utils')

Getting Started

Examples in this vignette require that the search path has

library(groupedHyperframe)
library(survival) # to help hyperframe understand Surv object

Terms and Abbreviations

Term / Abbreviation Description
|> Forward pipe operator introduced in R 4.1.0
.Machine Numerical characteristics of the machine R is running on, e.g., 32-bit integers and IEC 60559 floating-point (double precision) arithmetic
attr, attributes Attributes
CRAN, R The Comprehensive R Archive Network
cor Correlation matrix
cor.spatial Tjøstheim’s nonparametric correlation coefficient, from package SpatialPack (Vallejos, Osorio, and Bevilacqua 2020)
cov, cov2cor Variance-covariance matrix, and conversion to correlation matrix
data.frame Data frame
diag Matrix diagonals
dist Distance matrix; to take advantage of stats:::as.matrix.dist
file.size File size in bytes
formula Formula
fv, fv.object, plot.fv (Plot of) function value table
groupedData, ~ g1/.../gm Grouped data frame; nested grouping structure, from package nlme (Pinheiro, Bates, and R Core Team 2025)
groupedHyperframe Grouped hyper data frame
hypercolumns, hyperframe (Hyper columns of) hyper data frame, from package spatstat.geom (Baddeley and Turner 2005)
inherits Class inheritance
kerndens Kernel density, stats::density.default()$y
Inf Positive infinity \infty
kmeans k-means clustering
list, listof Lists of objects
markformat Storage mode of marks
marks, marked Marks of a point pattern
mc.cores Number of CPU cores to use for parallel computing
message Diagnostic message printed in R console
multitype Multitype spatial object
NaN Not-a-Number
object.size Memory allocation
pmean, pmedian Parallel, or point-wise, mean and median, groupedHyperframe::pmean; groupedHyperframe::pmedian
pmax, pmin Parallel, or point-wise, maxima and minima
ppp, ppp.object (Marked) point pattern
quantile Quantile
save, saveRDS, xz Save with xz compression
S3, generic, methods S3 object oriented system, UseMethod; getS3method; https://adv-r.hadley.nz/s3.html
sd Standard deviation
search Search path
Surv Survival, i.e., time-to-event, object
trapz, cumtrapz (Cumulative) trapezoidal integration, from package pracma (Borchers 2023)
vector Vector

Acknowledgement

This work is supported by National Institutes of Health, U.S. Department of Health and Human Services grants

Grouped Hyper Data Frame

We introduce a new S3 class groupedHyperframe for grouped hyper data frame, which inherits from the hyper data frame hyperframe class from package spatstat.geom (Baddeley, Rubak, and Turner 2015; Baddeley and Turner 2005). A hyperframe contains columns either as vectors like in a data.frame, or as lists of objects of the same class, a.k.a, the hypercolumns. This data structure is particularly useful in spatial analysis, e.g., with medical images, where the spatial information in each image would be represented by one element in a hypercolumn. The derived class groupedHyperframe has additional attributes

The grammar of the nested grouping structure g_1/.../g_m (~g1/.../gm) follows that of the parameter random of functions nlme::lme() and nlme::nlme(). In fact, the 'grouped' extension of a hyperframe is inspired by the nlme::groupedData class which inherits from data.frame (Pinheiro, Bates, and R Core Team 2025).

In this section, we introduce several S3 method dispatches of the S3 generic as.groupedHyperframe() to convert various classes into a groupedHyperframe. We also introduce aggregation functions aggregate_*() of the hypercolumns in a groupedHyperframe, at either one of the nested grouping levels g_1,\cdots,g_{m-1}. Aggregation at the lowest grouping level g_m is ignored, i.e., no aggregation to be performed. Available aggregation methods are the parallel minima base::pmin(), parallel maxima base::pmax(), parallel means pmean() (default) and parallel medians pmedian().

From data.frame

User may convert a data.frame with substantial amount of duplicated information into a groupedHyperframe using the S3 method dispatch as.groupedHyperframe.data.frame(). This function

  1. inspects the input data.frame by the user-specified (nested) grouping structure;
  2. identifies the column(s) with non-identical elements within the lowest group, and converts them into hypercolumn(s);
  3. returns a groupedHyperframe with the user-specified (nested) grouping structure.

In the following example, consider a toy data set wrobel_lung0 with non-identical column hladr in the lowest group image_id of the nested grouping structure ~patient_id/image_id.

wrobel_lung0 = wrobel_lung |>
  within.data.frame(expr = {
    x = y = NULL
    dapi = phenotype = tissue = NULL
  })
wrobel_lung0 |> head()
#>            image_id    patient_id gender stage_numeric pack_years
#> 1 [40864,18015].im3 #01 0-889-121      F             1         60
#> 2 [40864,18015].im3 #01 0-889-121      F             1         60
#> 3 [40864,18015].im3 #01 0-889-121      F             1         60
#> 4 [40864,18015].im3 #01 0-889-121      F             1         60
#> 5 [40864,18015].im3 #01 0-889-121      F             1         60
#> 6 [40864,18015].im3 #01 0-889-121      F             1         60
#>   adjuvant_therapy hladr    OS mhcII stage age
#> 1               No 0.115 3488+   low    IA  85
#> 2               No 0.239 3488+   low    IA  85
#> 3               No 0.268 3488+   low    IA  85
#> 4               No 0.245 3488+   low    IA  85
#> 5               No 0.127 3488+   low    IA  85
#> 6               No 0.136 3488+   low    IA  85

By converting wrobel_lung0 into a groupedHyperframe, the numeric hladr from each ~patient_id/image_id are converted into elements of the numeric-hypercolumn hladr in the returned wrobel_lung0g. Each row of a groupedHyperframe represents the lowest group of the nested grouping structure. The R console output (S3 method dispatch print.groupedHyperframe()) highlights the nested grouping structure, number of clusters at each grouping level, as well as the first 10 (or less) rows of the groupedHyperframe.

(wrobel_lung0g = wrobel_lung0 |> as.groupedHyperframe(group = ~ patient_id/image_id))
#> Grouped Hyperframe: ~patient_id/image_id
#> 
#> 15 image_id nested in
#> 3 patient_id
#> 
#> Preview of first 10 (or less) rows:
#> 
#>        hladr          image_id    patient_id gender stage_numeric pack_years
#> 1  (numeric) [40864,18015].im3 #01 0-889-121      F             1         60
#> 2  (numeric) [42689,19214].im3 #01 0-889-121      F             1         60
#> 3  (numeric) [42806,16718].im3 #01 0-889-121      F             1         60
#> 4  (numeric) [44311,17766].im3 #01 0-889-121      F             1         60
#> 5  (numeric) [45366,16647].im3 #01 0-889-121      F             1         60
#> 6  (numeric) [56576,16907].im3 #02 1-037-393      M             1         30
#> 7  (numeric) [56583,15235].im3 #02 1-037-393      M             1         30
#> 8  (numeric) [57130,16082].im3 #02 1-037-393      M             1         30
#> 9  (numeric) [57396,17896].im3 #02 1-037-393      M             1         30
#> 10 (numeric) [57403,16934].im3 #02 1-037-393      M             1         30
#>    adjuvant_therapy    OS mhcII stage age
#> 1                No 3488+   low    IA  85
#> 2                No 3488+   low    IA  85
#> 3                No 3488+   low    IA  85
#> 4                No 3488+   low    IA  85
#> 5                No 3488+   low    IA  85
#> 6                No  1605   low    IA  66
#> 7                No  1605   low    IA  66
#> 8                No  1605   low    IA  66
#> 9                No  1605   low    IA  66
#> 10               No  1605   low    IA  66

Reducing memory allocation

Converting a data.frame with substantial amount of duplicated information like wrobel_lung0 into a groupedHyperframe greatly reduces the memory allocation.

unclass(object.size(wrobel_lung0g)) / unclass(object.size(wrobel_lung0))
#> [1] 0.113451

A groupedHyperframe, however, would not reduce much the saved file.size compared to a data.frame, if xz compression is used for both.

f_g = tempfile(fileext = '.rds')
wrobel_lung0g |> saveRDS(file = f_g, compress = 'xz')
f = tempfile(fileext = '.rds')
wrobel_lung0 |> saveRDS(file = f, compress = 'xz')
file.size(f_g) / file.size(f) # not much reduction
#> [1] 0.9382716

Aggregation of numeric-hypercolumn

We use function aggregate_quantile() to aggregate the quantiles of each element in the numeric-hypercolumn hladr in wrobel_lung0g by point-wise means (default of parameter f_aggr_) at the second-lowest group ~patient_id. The returned object is a hyperframe instead of a groupedHyperframe, as we have one aggregated hladr.quantile per ~patient_id, thus eliminates the need for a grouping structure.

wrobel_lung0g |>
  aggregate_quantile(by = ~ patient_id, probs = seq.int(from = .01, to = .99, by = .01))
#> Hyperframe:
#>      patient_id gender stage_numeric pack_years adjuvant_therapy    OS mhcII
#> 1 #01 0-889-121      F             1         60               No 3488+   low
#> 2 #02 1-037-393      M             1         30               No  1605   low
#> 3 #03 2-080-378      M             3         50               No   176  high
#>   stage age hladr.quantile
#> 1    IA  85      (numeric)
#> 2    IA  66      (numeric)
#> 3  IIIA  84      (numeric)

In this package, we have include a groupedHyperframe example Ki67 with a numeric-hypercolumn logKi67 and a nested grouping structure ~patientID/tissueID.

data(Ki67, package = 'groupedHyperframe')
Ki67
#> Grouped Hyperframe: ~patientID/tissueID
#> 
#> 645 tissueID nested in
#> 622 patientID
#> 
#> Preview of first 10 (or less) rows:
#> 
#>      logKi67 tissueID Tstage  PFS recfreesurv_mon recurrence adj_rad adj_chemo
#> 1  (numeric) TJUe_I17      2 100+             100          0   FALSE     FALSE
#> 2  (numeric) TJUe_G17      1   22              22          1   FALSE     FALSE
#> 3  (numeric) TJUe_F17      1  99+              99          0   FALSE        NA
#> 4  (numeric) TJUe_D17      1  99+              99          0   FALSE      TRUE
#> 5  (numeric) TJUe_J18      1  112             112          1    TRUE      TRUE
#> 6  (numeric) TJUe_N17      4   12              12          1    TRUE     FALSE
#> 7  (numeric) TJUe_J17      2  64+              64          0   FALSE     FALSE
#> 8  (numeric) TJUe_F19      2  56+              56          0   FALSE     FALSE
#> 9  (numeric) TJUe_P19      2  79+              79          0   FALSE     FALSE
#> 10 (numeric) TJUe_O19      2   26              26          1   FALSE      TRUE
#>    histology  Her2   HR  node  race age patientID
#> 1          3  TRUE TRUE  TRUE White  66   PT00037
#> 2          3 FALSE TRUE FALSE Black  42   PT00039
#> 3          3 FALSE TRUE FALSE White  60   PT00040
#> 4          3 FALSE TRUE  TRUE White  53   PT00042
#> 5          3 FALSE TRUE  TRUE White  52   PT00054
#> 6          2  TRUE TRUE  TRUE Black  51   PT00059
#> 7          3 FALSE TRUE  TRUE Asian  50   PT00062
#> 8          2  TRUE TRUE  TRUE White  37   PT00068
#> 9          3  TRUE TRUE FALSE White  68   PT00082
#> 10         2  TRUE TRUE FALSE Black  55   PT00084

Similarly, we use function aggregate_quantile() to aggregate the quantiles of each element in the numeric-hypercolumn logKi67 at the second-lowest group ~patientID.

s = Ki67 |>
  aggregate_quantile(by = ~ patientID, probs = seq.int(from = .01, to = .99, by = .01))
s |> head()
#> Hyperframe:
#>   Tstage  PFS recfreesurv_mon recurrence adj_rad adj_chemo histology  Her2   HR
#> 1      2 100+             100          0   FALSE     FALSE         3  TRUE TRUE
#> 2      1   22              22          1   FALSE     FALSE         3 FALSE TRUE
#> 3      1  99+              99          0   FALSE        NA         3 FALSE TRUE
#> 4      1  99+              99          0   FALSE      TRUE         3 FALSE TRUE
#> 5      1  112             112          1    TRUE      TRUE         3 FALSE TRUE
#> 6      4   12              12          1    TRUE     FALSE         2  TRUE TRUE
#>    node  race age patientID logKi67.quantile
#> 1  TRUE White  66   PT00037        (numeric)
#> 2 FALSE Black  42   PT00039        (numeric)
#> 3 FALSE White  60   PT00040        (numeric)
#> 4  TRUE White  53   PT00042        (numeric)
#> 5  TRUE White  52   PT00054        (numeric)
#> 6  TRUE Black  51   PT00059        (numeric)

Users are encouraged to learn more about the applications of the aggregated quantiles of Ki67 data from package hyper.gam vignettes (RPubs, CRAN), section Quantile Index, as well as from our peer-reviewed publications Yi et al. (2025); Yi et al. (2023b); Yi et al. (2023a).

From hyperframe

Users may convert a hyperframe provided in the package spatstat.data into a groupedHyperframe using the S3 method dispatch as.groupedHyperframe.hyperframe(). This function simply inspects and adds a (nested) grouping structure to the input hyperframe.

In the following example, we inspect the data set spatstat.data::osteo, which has the serial number of sampling volume brick nested in the bone sample id, and add the nested grouping structure ~id/brick to it.

spatstat.data::osteo |> 
  as.groupedHyperframe(group = ~ id/brick)
#> Grouped Hyperframe: ~id/brick
#> 
#> 40 brick nested in
#> 4 id
#> 
#> Preview of first 10 (or less) rows:
#> 
#>        id shortid brick   pts depth
#> 1  c77za4       4     1 (pp3)    45
#> 2  c77za4       4     2 (pp3)    60
#> 3  c77za4       4     3 (pp3)    55
#> 4  c77za4       4     4 (pp3)    60
#> 5  c77za4       4     5 (pp3)    85
#> 6  c77za4       4     6 (pp3)    90
#> 7  c77za4       4     7 (pp3)    95
#> 8  c77za4       4     8 (pp3)    65
#> 9  c77za4       4     9 (pp3)   100
#> 10 c77za4       4    10 (pp3)   100

In this vignette, we do not place much emphasize on the objects provided in package spatstat.data, for now.

Grouping Structure on ppp-Hypercolumn

In this section, we introduce the creation of groupedHyperframe with one-and-only-one point pattern (ppp) hypercolumn, as well as the batch processes of spatial point pattern analyses on the one-and-only-one ppp-hypercolumn of a hyperframe (and/or groupedHyperframe).

These batch processes are not intended for a hyperframe (and/or groupedHyperframe) with multiple ppp-hypercolumn in the foreseeable future, as that would require checking for name clashes in the marks from multiple ppp-hypercolumn.

Grouped hyper data frame with ppp-hypercolumn

Function grouped_ppp() creates a groupedHyperframe with one-and-only-one ppp-hypercolumn.

In the following example, the argument formula specifies

(s = wrobel_lung |>
   grouped_ppp(formula = hladr + phenotype ~ OS + gender + age | patient_id/image_id))
#> Grouped Hyperframe: ~patient_id/image_id
#> 
#> 15 image_id nested in
#> 3 patient_id
#> 
#> Preview of first 10 (or less) rows:
#> 
#>       OS gender age    patient_id          image_id  ppp.
#> 1  3488+      F  85 #01 0-889-121 [40864,18015].im3 (ppp)
#> 2  3488+      F  85 #01 0-889-121 [42689,19214].im3 (ppp)
#> 3  3488+      F  85 #01 0-889-121 [42806,16718].im3 (ppp)
#> 4  3488+      F  85 #01 0-889-121 [44311,17766].im3 (ppp)
#> 5  3488+      F  85 #01 0-889-121 [45366,16647].im3 (ppp)
#> 6   1605      M  66 #02 1-037-393 [56576,16907].im3 (ppp)
#> 7   1605      M  66 #02 1-037-393 [56583,15235].im3 (ppp)
#> 8   1605      M  66 #02 1-037-393 [57130,16082].im3 (ppp)
#> 9   1605      M  66 #02 1-037-393 [57396,17896].im3 (ppp)
#> 10  1605      M  66 #02 1-037-393 [57403,16934].im3 (ppp)

Batch processes

Batch processes to return fv-hypercolumn

In this section, we discuss the batch processes that return a function value table (fv) hypercolumn, i.e., a hypercolumn which consists of a list of fv.objects.

Batch processes applicable to numeric marks

Batch Process Workhorse in package spatstat.explore fv-hypercolumns Suffix
Emark_() and Vmark_() Emark and Vmark, conditional mean E(r) and conditional variance V(r), diagnostics for dependence between the points and the marks (Schlather, Ribeiro, and Diggle 2003) .E and .V
markcorr_() markcorr, marked correlation k_{mm}(r) or generalized mark correlation k_f(r) (Stoyan and Stoyan 1994) .k
markvario_() markvario, mark variogram \gamma(r) (Wälder and Stoyan 1996) .gamma
Kmark_() Kmark, mark-weighted K_f(r) function (Penttinen, Stoyan, and Henttonen 1992) .K
Exception handling

In package spatstat.explore (up to version 3.4.3, 2025-05-21), function markcorr() is the workhorse inside functions Emark(), Vmark() and markvario(). Function markcorr() relies on the un-exported workhorse function spatstat.explore:::sewsmod(), whose default method = "density" contains the calculation of the ratio of two kernel densities. Due to the floating-point precision of R, such density ratios may have exceptional returns of

See for yourself
0 / c(2.6e-324, 2.5e-324)
#> [1]   0 NaN
c(2.5e-324, 2.6e-324) / 0
#> [1] NaN Inf

Function markcorr() provides a default argument of parameter r, at which the mark correlation function k_f(r) are evaluated, using function spatstat.geom::handle.r.b.args(). The S3 method dispatch spatstat.explore::print.fv() prints the recommended range and available range of the argument r.

spatstat.data::spruces |> 
  spatstat.explore::markcorr()
#> Function value object (class 'fv')
#> for the function r -> k[mm](r)
#> ................................................................................
#>       Math.label             
#> r     r                      
#> theo  {k[mm]^{iid}}(r)       
#> trans {hat(k)[mm]^{trans}}(r)
#> iso   {hat(k)[mm]^{iso}}(r)  
#>       Description                                       
#> r     distance argument r                               
#> theo  theoretical value (independent marks) for k[mm](r)
#> trans translation-corrected estimate of k[mm](r)        
#> iso   Ripley isotropic correction estimate of k[mm](r)  
#> ................................................................................
#> Default plot formula:  .~r
#> where "." stands for 'iso', 'trans', 'theo'
#> Recommended range of argument r: [0, 9.5]
#> Available range of argument r: [0, 9.5]
#> Unit of length: 1 metre

We may observe exceptional returns if we go beyond the recommended range and/or available range. In the following example, we see that the mark correlation k_f(r) (column iso) having value NaN at r=81,88,89,90, value 0 at r=82 and value Inf at r=83,84,87.

spatstat.data::spruces |> 
  spatstat.explore::markcorr(r = 0:90) |>
  spatstat.explore::as.data.frame.fv() |>
  utils::tail(n = 10L)
#>     r theo trans      iso
#> 82 81    1   NaN      NaN
#> 83 82    1   NaN  0.00000
#> 84 83    1   NaN      Inf
#> 85 84    1   NaN      Inf
#> 86 85    1   Inf      NaN
#> 87 86    1   Inf 11.50015
#> 88 87    1   NaN      Inf
#> 89 88    1   Inf      NaN
#> 90 89    1   NaN      NaN
#> 91 90    1   NaN      NaN

The batch process markcorr_(), as well as Emark_(), Vmark_() and markvario_() which rely on the workhorse markcorr(), prints the Recommended r_\text{max} from each of the fv-returns, no matter a user-specified argument for parameter r is provided or not.

s |>
  Emark_(correction = 'none')
#> 
#> Recommended rmax for hladr.E are 15⨯ 125.625

When a user-specified r is provided for a batch process on all ppp.objects in the ppp-hypercolumn, inevitably some of the fv-returns may have exceptional values. We discuss this exception handling in the next section Aggregation over nested grouping structure.

Batch processes applicable to multitype marks

Batch Process Workhorse in package spatstat.explore fv-hypercolumns Suffix
Gcross_() Gcross, multitype nearest-neighbour distance G_{ij}(r) .G
Kcross_() Kcross, multitype K_{ij}(r) .K
Jcross_() Jcross, multitype J_{ij}(r) (Van Lieshout and Baddeley 1999) .J
Lcross_() Lcross, multitype L_{ij}(r)=\sqrt{\frac{K_{ij}(r)}{\pi}} .L

Batch processes to return numeric-hypercolumn

Batch Process Workhorse in package spatstat.geom Applicable to numeric-hypercolumns Suffix
nncross_() nncross.ppp(., what = 'dist'), nearest neighbour distance multitype marks .nncross

Batch processes in a pipeline

Multiple batch processes may be applied to a hyperframe (and/or groupedHyperframe) in a pipeline using the native pipe operator |> introduced since R 4.1.0.

r = seq.int(from = 0, to = 250, by = 10)
out = s |>
  Emark_(r = r, correction = 'none') |> # slow
  # Vmark_(r = r, correction = 'none') |> # slow
  # markcorr_(r = r, correction = 'none') |> # slow
  # markvario_(r = r, correction = 'none') |> # slow
  # Kmark_(r = r, correction = 'none') |> # fast
  Gcross_(i = 'CK+.CD8-', j = 'CK-.CD8+', r = r, correction = 'none') |> # fast
  # Kcross_(i = 'CK+.CD8-', j = 'CK-.CD8+', r = r, correction = 'none') |> # fast
  nncross_(i = 'CK+.CD8-', j = 'CK-.CD8+', correction = 'none') # fast
#> 
#> Recommended rmax for hladr.E are 15⨯ 125.625
#> 

The returned hyperframe (or groupedHyperframe) has

out
#> Grouped Hyperframe: ~patient_id/image_id
#> 
#> 15 image_id nested in
#> 3 patient_id
#> 
#> Preview of first 10 (or less) rows:
#> 
#>       OS gender age    patient_id          image_id  ppp. hladr.E phenotype.G
#> 1  3488+      F  85 #01 0-889-121 [40864,18015].im3 (ppp)    (fv)        (fv)
#> 2  3488+      F  85 #01 0-889-121 [42689,19214].im3 (ppp)    (fv)        (fv)
#> 3  3488+      F  85 #01 0-889-121 [42806,16718].im3 (ppp)    (fv)        (fv)
#> 4  3488+      F  85 #01 0-889-121 [44311,17766].im3 (ppp)    (fv)        (fv)
#> 5  3488+      F  85 #01 0-889-121 [45366,16647].im3 (ppp)    (fv)        (fv)
#> 6   1605      M  66 #02 1-037-393 [56576,16907].im3 (ppp)    (fv)        (fv)
#> 7   1605      M  66 #02 1-037-393 [56583,15235].im3 (ppp)    (fv)        (fv)
#> 8   1605      M  66 #02 1-037-393 [57130,16082].im3 (ppp)    (fv)        (fv)
#> 9   1605      M  66 #02 1-037-393 [57396,17896].im3 (ppp)    (fv)        (fv)
#> 10  1605      M  66 #02 1-037-393 [57403,16934].im3 (ppp)    (fv)        (fv)
#>    phenotype.nncross
#> 1          (numeric)
#> 2          (numeric)
#> 3          (numeric)
#> 4          (numeric)
#> 5          (numeric)
#> 6          (numeric)
#> 7          (numeric)
#> 8          (numeric)
#> 9          (numeric)
#> 10         (numeric)

Aggregation over nested grouping structure

Of fv-hypercolumn(s)

Function aggregate_fv() aggregates

In the following example, we have

(afv = out |>
  aggregate_fv(by = ~ patient_id, f_aggr_ = pmean))
#> Hyperframe:
#>      OS gender age    patient_id hladr.E.value hladr.E.cumtrapz
#> 1 3488+      F  85 #01 0-889-121     (numeric)        (numeric)
#> 2  1605      M  66 #02 1-037-393     (numeric)        (numeric)
#> 3   176      M  84 #03 2-080-378     (numeric)        (numeric)
#>   phenotype.G.value phenotype.G.cumtrapz
#> 1         (numeric)            (numeric)
#> 2         (numeric)            (numeric)
#> 3         (numeric)            (numeric)

Each of the numeric-hypercolumns contains tabulated values on the common grid of r. One “slice” of this grid may be extracted by

afv$hladr.E.cumtrapz |> .slice(j = '50')
#>        1        2        3 
#> 10.60151 10.49337 31.05115

Exception handling

As we have mentioned in the previous section Batch processes, a same user-specified argument of r will be used for all ppp.objects in the ppp-hypercolumn. Suppose a naive user uses an r-vector well beyond the recommended range and/or available range. In this case, function aggregate_fv() prints a message of Legal r_\text{max}, which is determined by the last value of r, that no value of NaN and/or Inf appears in any of the fv-returns, e.g., in the hypercolumn hladr.E. Note that the 0-values in an fv-return are typically a sign of degeneration as well, but function aggregate_fv() does not eliminate 0-values from the determination of legal r_\text{max}.

r = seq.int(from = 0, to = 1000, by = 50)
s |>
  Emark_(r = r, correction = 'none') |>
  aggregate_fv(by = ~ patient_id, f_aggr_ = pmean)
#> 
#> Recommended rmax for hladr.E are 15⨯ 125.625
#> Legal rmax for hladr.E is 750

Of numeric-hypercolumn and numeric marks in ppp-hypercolumn

On quantiles

Function aggregate_quantile() aggregates the quantiles of the numeric-hypercolumns and the numeric marks in the ppp-hypercolumn.

In the following example, we have

out |>
  aggregate_quantile(by = ~ patient_id, probs = seq.int(from = 0, to = 1, by = .1))
#> Hyperframe:
#>      OS gender age    patient_id phenotype.nncross.quantile hladr.quantile
#> 1 3488+      F  85 #01 0-889-121                  (numeric)      (numeric)
#> 2  1605      M  66 #02 1-037-393                  (numeric)      (numeric)
#> 3   176      M  84 #03 2-080-378                  (numeric)      (numeric)

On kernel densities

Function aggregate_kerndens() aggregates the kernel density of the numeric-hypercolumns and the numeric marks in the ppp-hypercolumn.

In the following example, we have

(mdist = out$phenotype.nncross |> unlist() |> max())
#> [1] 333.2417
out |> 
  aggregate_kerndens(by = ~ patient_id, from = 0, to = mdist)
#> Hyperframe:
#>      OS gender age    patient_id phenotype.nncross.kerndens hladr.kerndens
#> 1 3488+      F  85 #01 0-889-121                  (numeric)      (numeric)
#> 2  1605      M  66 #02 1-037-393                  (numeric)      (numeric)
#> 3   176      M  84 #03 2-080-378                  (numeric)      (numeric)

Appendix

k-Means Clustering

The S3 generic .kmeans() performs k-means clustering (workhorse function stats::kmeans).

On ppp.object

data(shapley, package = 'spatstat.data')
shapley
#> Marked planar point pattern: 4215 points
#> Mark variables: Mag, V, SigV 
#> window: polygonal boundary
#> enclosing rectangle: [192.80185, 216.26227] x [-37.75511, -27.40664] degrees

The S3 method dispatch .kmeans.ppp() performs k-means clustering, with paramters

By coordinate(s) and/or marks

Example below shows a clustering based on the x- and y-coordinates, as well as the numeric mark Mag.

km = shapley |> .kmeans(formula = ~ x + y + Mag, centers = 3L)
km |> class()
#> [1] "kmeans"

Example below shows a clustering based on x-coordinate and Mag.

km1 = shapley |> .kmeans(formula = ~ x + Mag, centers = 3L)
km1 |> class()
#> [1] "kmeans"

Example below shows a clustering based on x- and y-coordinates only.

km2 = shapley |> .kmeans(formula = ~ x + y, centers = 3L)
km2 |> class()
#> [1] "kmeans"

By clusterSize

Example below shows a clustering specified by clusterSize.

km3 = shapley |> .kmeans(formula = ~ x + y, clusterSize = 1e3L)
km3 |> class()
#> [1] "kmeans"
km3$centers # 5 clusters needed
#>          x         y
#> 1 202.1441 -31.01862
#> 2 206.0951 -33.34088
#> 3 194.6935 -29.95076
#> 4 199.0664 -34.90336
#> 5 210.8689 -31.15603
km3$cluster |> table()
#> 
#>    1    2    3    4    5 
#> 1467  679 1099  448  522

Split by k-Means Clustering

The S3 generic split_kmeans() splits ppp.object, listof ppp.objects, and hyperframe by k-means clustering.

Note that many functions in package groupedHyperframe require a 'dataframe' markformat for ppp.objects.

data(flu, package = 'spatstat.data')
flu$pattern[[1L]] |> 
  spatstat.geom::markformat()
#> [1] "vector"

User may convert a 'vector' markformat to 'dataframe' using the syntactic sugar `mark_name<-`,

flu$pattern[] = flu$pattern |> 
  lapply(FUN = `mark_name<-`, value = 'stain') # read ?flu carefully
flu$pattern[[1L]] |> 
  spatstat.geom::markformat()
#> [1] "dataframe"

Split a ppp.object

The S3 method dispatch split_kmeans.default() splits a ppp.object by k-means clustering.

flu$pattern[[1L]] |> split_kmeans(formula = ~ x + y, centers = 3L)
#> Point pattern split by factor 
#> 
#> 1:
#> Marked planar point pattern: 157 points
#> Multitype, with levels = M2, M1 
#> window: rectangle = [0, 3331] x [0, 3331] nm
#> 
#> 2:
#> Marked planar point pattern: 145 points
#> Multitype, with levels = M2, M1 
#> window: rectangle = [0, 3331] x [0, 3331] nm
#> 
#> 3:
#> Marked planar point pattern: 169 points
#> Multitype, with levels = M2, M1 
#> window: rectangle = [0, 3331] x [0, 3331] nm

Split a listof ppp.objects

The S3 method dispatch split_kmeans.listof() splits a listof ppp.objects by k-means clustering.

The returned object has attributes

flu$pattern[1:2] |> split_kmeans(formula = ~ x + y, centers = 3L) 
#> $`wt M2-M1 13.1`
#> Marked planar point pattern: 153 points
#> Multitype, with levels = M2, M1 
#> window: rectangle = [0, 3331] x [0, 3331] nm
#> 
#> $`wt M2-M1 13.2`
#> Marked planar point pattern: 147 points
#> Multitype, with levels = M2, M1 
#> window: rectangle = [0, 3331] x [0, 3331] nm
#> 
#> $`wt M2-M1 13.3`
#> Marked planar point pattern: 171 points
#> Multitype, with levels = M2, M1 
#> window: rectangle = [0, 3331] x [0, 3331] nm
#> 
#> $`wt M2-M1 22.1`
#> Marked planar point pattern: 94 points
#> Multitype, with levels = M2, M1 
#> window: rectangle = [0, 3331] x [0, 3331] nm
#> 
#> $`wt M2-M1 22.2`
#> Marked planar point pattern: 79 points
#> Multitype, with levels = M2, M1 
#> window: rectangle = [0, 3331] x [0, 3331] nm
#> 
#> $`wt M2-M1 22.3`
#> Marked planar point pattern: 44 points
#> Multitype, with levels = M2, M1 
#> window: rectangle = [0, 3331] x [0, 3331] nm
#> 
#> attr(,"id")
#> [1] 1 1 1 2 2 2
#> attr(,"cluster")
#> [1] 1 2 3 1 2 3

Split a hyperframe and/or groupedHyperframe

The S3 method dispatch split_kmeans.hyperframe() splits a hyperframe and/or groupedHyperframe by k-means clustering of the one-and-only-one ppp-hypercolumn.

The returned object is a groupedHyperframe with grouping structure

flu[1:2,] |> split_kmeans(formula = ~ x + y, centers = 3L)
#> Grouped Hyperframe: ~.id/.cluster
#> 
#> 6 .cluster nested in
#> 2 .id
#> 
#> Preview of first 10 (or less) rows:
#> 
#>   pattern .id .cluster virustype stain frameid
#> 1   (ppp)   1        1        wt M2-M1      13
#> 2   (ppp)   1        2        wt M2-M1      13
#> 3   (ppp)   1        3        wt M2-M1      13
#> 4   (ppp)   2        1        wt M2-M1      22
#> 5   (ppp)   2        2        wt M2-M1      22
#> 6   (ppp)   2        3        wt M2-M1      22

Pairwise Tjøstheim’s Coefficient

The S3 generic pairwise_cor_spatial() calculates the nonparametric, rank-based, Tjøstheim’s correlation coefficients (Tjøstheim 1978; Hubert and Golledge 1982) in a pairwise-combination fashion, using the workhorse function SpatialPack::cor.spatial(). All S3 method dispatches return a object of class 'pairwise_cor_spatial', which inherits from class 'dist'.

Of ppp.object

The S3 method dispatch pairwise_cor_spatial.ppp() finds the nonparametric Tjøstheim’s correlation coefficients from the pairwise-combinations of all numeric marks of a ppp.object.

data(finpines, package = 'spatstat.data')
(r = finpines |> pairwise_cor_spatial())
#>         diameter
#> height 0.7287879

The printing of 'pairwise_cor_spatial' is taken care of by function stats:::print.dist.

Matrix of pairwise Tjøstheim’s coefficient

The S3 method dispatch as.matrix.pairwise_cor_spatial() returns a matrix with diagonal values of 1.

r |> as.matrix()
#>           diameter    height
#> diameter 1.0000000 0.7287879
#> height   0.7287879 1.0000000

Note that this matrix is not a correlation matrix, because Tjøstheim’s correlation coefficient

References

Baddeley, Adrian, Ege Rubak, and Rolf Turner. 2015. Spatial Point Patterns: Methodology and Applications with R. London: Chapman; Hall/CRC Press. https://www.routledge.com/Spatial-Point-Patterns-Methodology-and-Applications-with-R/Baddeley-Rubak-Turner/p/book/9781482210200/.
Baddeley, Adrian, and Rolf Turner. 2005. spatstat: An R Package for Analyzing Spatial Point Patterns.” Journal of Statistical Software 12 (6): 1–42. https://doi.org/10.18637/jss.v012.i06.
Borchers, Hans W. 2023. pracma: Practical Numerical Math Functions. https://doi.org/10.32614/CRAN.package.pracma.
Hubert, Lawrence J., and Reginald G. Golledge. 1982. “Measuring Association Between Spatially Defined Variables: Tjøstheim’s Index and Some Extensions.” Geographical Analysis 14 (3): 273–78. https://doi.org/10.1111/j.1538-4632.1982.tb00077.x.
Penttinen, Antti, Dietrich Stoyan, and Helena M. Henttonen. 1992. “Marked Point Processes in Forest Statistics.” Forest Science 38 (4): 806–24. https://doi.org/10.1093/forestscience/38.4.806.
Pinheiro, José, Douglas Bates, and R Core Team. 2025. nlme: Linear and Nonlinear Mixed Effects Models. https://doi.org/10.32614/CRAN.package.nlme.
Schlather, Martin, Jr Ribeiro Paulo J., and Peter J. Diggle. 2003. “Detecting Dependence Between Marks and Locations of Marked Point Processes.” Journal of the Royal Statistical Society Series B: Statistical Methodology 66 (1): 79–93. https://doi.org/10.1046/j.1369-7412.2003.05343.x.
Stoyan, Helga, and Dietrich Stoyan. 1994. Fractals, Random Shapes and Point Fields: Methods of Geometrical Statistics. John Wiley; Sons. https://www.wiley.com/Fractals%2C+Random+Shapes+and+Point+Fields%3A+Methods+of+Geometrical+Statistics-p-9780471937579.
Tjøstheim, Dag. 1978. “A Measure of Association for Spatial Variables.” Biometrika 65 (1): 109–14. https://doi.org/10.1093/biomet/65.1.109.
Vallejos, R., F. Osorio, and M. Bevilacqua. 2020. Spatial Relationships Between Two Georeferenced Variables: With Applications in r. New York: Springer. http://srb2gv.mat.utfsm.cl/.
Van Lieshout, M. N. M., and A. J. Baddeley. 1999. “Indices of Dependence Between Types in Multivariate Point Patterns.” Scandinavian Journal of Statistics 26 (4): 511–32. https://doi.org/https://doi.org/10.1111/1467-9469.00165.
Wälder, Olga, and Dietrich Stoyan. 1996. “On Variograms in Point Process Statistics.” Biometrical Journal 38 (8): 895–905. https://doi.org/10.1002/bimj.4710380802.
Yi, Misung, Tingting Zhan, Amy R. Peck, Jeffrey A. Hooke, Albert J. Kovatich, Craig D. Shriver, Hai Hu, Yunguang Sun, Hallgeir Rui, and Inna Chervoneva. 2023a. “Quantile Index Biomarkers Based on Single-Cell Expression Data.” Laboratory Investigation 103 (8): 100158. https://doi.org/10.1016/j.labinv.2023.100158.
———. 2023b. “Selection of Optimal Quantile Protein Biomarkers Based on Cell-Level Immunohistochemistry Data.” BMC Bioinformatics 24 (1): 298. https://doi.org/10.1186/s12859-023-05408-8.
Yi, Misung, Tingting Zhan, Hallgeir Rui, and Inna Chervoneva. 2025. “Functional Protein Biomarkers Based on Distributions of Expression Levels in Single-Cell Imaging Data.” Bioinformatics, April, btaf182. https://doi.org/10.1093/bioinformatics/btaf182.