The package spatialGE
provides
a collection of tools for the visualization of gene expression from
spatially-resolved transcriptomic experiments. The data input methods
have been designed so that any data can be analyzed as long as it
contains gene expression counts per spot or cell, and the spatial
coordinates of those spots or cells. Specialized data input options are
available to allow interoperability with workflows such as Space
Ranger.
This tutorial covers basic analysis possible within spatialGE. Tutorials explaining more advanced spatial analyses and other data input types can be found in the documentation site of the spatialGE GitHub repository. A point-and-click, user-friendly interface is also available at https://spatialge.moffitt.org.
The spatialGE
repository is available at the
Comprehensive R Archive Network (CRAN) repository and can be installed
via install.packages
:
A development version is available at GitHub and can be installed via
devtools
:
# if("devtools" %in% rownames(installed.packages()) == FALSE){
# install.packages("devtools")
# }
#devtools::install_github("fridleylab/spatialGE")
To use spatialGE
, load the package.
To show the utility of some of the functions in
spatialGE
, we use the spatial transcriptomics data set
generated with the platform Visium by Bassiouni
et al. (2023). This data set includes triple negative breast
cancer biopsies from 22 patients, with two tissue slices per patient.
The Visium platform allows gene expression quantitation in approximately
5000 capture areas (i.e., “spots”) separated each other by 100μM. The
spots are 55μM in diameter, which corresponds to 1-10 cells
approximately, depending on cell size.
Data for all the tissue slices are available at the Gene Expression Omnibus (GEO). For the purpose of this tutorial, we will use eight samples from four patients.
The GEO repositories can be accessed using the following links:
For convenience, the data is also available at our spatialGE_Data
GitHub repository. During this tutorial, we will pull the data from
this repository. Nonetheless, users are eoncouraged to download the data
and explore its contents in order to familiarize with the input formats
accepted by spatialGE
. For each sample directory, the file
filtered_feature_bc_matrix.h5
contains the gene expression
counts. The spatial
sub-directory contains the
tissue_positions_list.csv
,
tissue_hires_image.png
, and
scalefactors_json.json
files. This file structure,
corresponds roughly to the file structure in outputs generated by Space
Ranger, the software provided by 10X Genomics to process raw Visium
data.
In this tutorial, data will be deposited in a temporary directory. The data is downloaded from this link:
The following code block creates a temporary directory:
The data will be downloaded to the temporary directory. The data is
compressed in a zip file, which will be unzipped into a sub-directory
called tnbc_bassiouni
:
tryCatch({ # In case data is not available from network
download.file(lk, destfile=paste0(visium_tmp, '/', 'tnbc_bassiouni.zip'), mode='wb')
zip_tmp = list.files(visium_tmp, pattern='tnbc_bassiouni.zip$', full.names=TRUE)
unzip(zipfile=zip_tmp, exdir=thrane_tmp)
net_check <<- TRUE
}, error = function(e) {
message("Could not download data. Are you connected to the internet?")
net_check <<- FALSE
})
#> Could not download data. Are you connected to the internet?
In spatialGE
, raw and processed data are stored in an
STlist (S4 class object). As previously
mentioned, an STlist can be created with the function
STlist
, from a variety of data formats (see here
for more info or type ?STlist
in the R console). In this
tutorial we will provide the file paths to the folders downloaded in the
previous step.
Additionally, we will input meta data associated with each sample.
The meta data is provided in the form of a comma-delimited file. We have
extracted some of the clinical meta data for the eight samples from the
original publication and saved it as part of the spatialGE
package. The user is encouraged to look at the structure of this file by
downloading it from the spatialGE_Data
GitHub repository. The most important aspect when constructing this
file is that the sample names are in the first column, and they match
the names of the folders containing the data:
From the temporary directory, we can use R to generate the file paths
to be passed to the STlist
function:
visium_folders <- list.dirs(paste0(visium_tmp, '/tnbc_bassiouni'), full.names=TRUE, recursive=FALSE)
The meta data can be accessed directly from the
spatialGE
package installed in the computer like so:
clin_file <- list.files(paste0(visium_tmp, '/tnbc_bassiouni'), full.names=TRUE, recursive=FALSE, pattern='clinical')
We can load the files into an STlist using this command:
The tnbc
object is an STlist that contains the count
data, spot coordinates, and clinical meta data.
As observed, by calling the tnbc
object, information on
the number of spots and genes per sample is displayed. For count
statistics, the summarize_STlist
function can be used:
The minimum number of counts per spot is 42, which seems low. We can
look at the distribution of counts and genes per spot using the
distribution_plots
function:
Now, let us remove spots with low counts by keeping only those spots with at least 5000 counts. We will also restrict the data set to spots expressing at least 1000 genes. We also will remove a few spots that have abnormally large number of counts, as in sample_094c. These criteria are not a rule, and samples in each study have to be carefully examined. For example, this criteria may be not enough to reduce the differences in counts, especially for sample_120d and samples_120e.
We run this filter with the filter_data
function:
The functions pseudobulk_samples
and
pseudobulk_pca_plot
are the initial steps to obtain a quick
snapshot of the variation in gene expression among samples. The function
pseudobulk_samples
creates (pseudo) “bulk” RNAseq data sets
by combining all counts from each sample. Then, it log transforms the
“pseudo bulk” RNAseq counts and runs a Principal Component Analysis
(PCA). Note that the spatial coordinate information is not considered
here, which is intended only as an exploratory analysis analysis. The
max_var_genes
argument is used to specify the maximum
number of genes used for computation of PCA. The genes are selected
based on their standard deviation across samples.
In this case, we apply the function to look for agreement between
samples from the same patient: It is expected that tissue slices from
the same patient are more similar among them than tissue slices from
other patients. The pseudobulk_pca
allows to map a
sample-level variable to the points in the PCA by including the name of
the column from the sample metadata (patient_id
in this
example).
Users can also generate a heatmap from pseudobulk counts by calling
the pseudobulk_heatmap
function, which also requires prior
use of the pseudobulk_samples
function. The number of
variable genes to show can be controlled via the
hm_display_genes
argument.
Many transformation methods are available for RNAseq count data. In
spatialGE
, the function transform_data
applies
log-transformation to the counts, after library size normalization
performed on each sample separately. Similar to Seurat,
it applies a scaling factor (scale_f=10000
by default).
Users also can apply variance-stabilizing transformation (SCT; Hafemeister and Satija (2019)), which is another method increasingly used in single-cell and spatial transcriptomics studies. See here for details.
After data transformation, expression of specific genes can be
visualized using “quilt” plots. The function STplot
shows
the transformed expression of one or several genes. We have adopted the
color palettes from the packages khroma
and
RColorBrewer
. The name of a color palette can be passed
using the argument color_pal
. The default behavior of the
function produces plots for all samples within the STlist, but we can
pass specific samples to be plotted using the argument
samples
.
Let’s produce a quilt plot for the genes NDRG1 and
IGKC (hypoxia and B-cell markers, respectively), for sample
number 1 of patient 14 (samples=sample_094c
).
quilts1 <- STplot(tnbc,
genes=c('NDRG1', 'IGKC'),
samples='sample_094c',
color_pal='BuRd',
ptsize=0.8)
Because spatialGE
functions output lists of ggplot
objects, we can plot the results side-by-side using functions such as
ggarrange()
:
We can see that gene expression patterns of both genes are non-overlapping: IGKC is expressed in the upper right portion of the tissue, whereas NDRG1 is expressed in the right bottom portion (although relatively lower compared to IGKC as indicated by the white-colored spots). The location of gene expression of those two genes may be indicative of an immune-infiltrated area and a tumor area. With the help of spatial interpolation, visualization of these regions can be easier as will be showed next.
We can predict a smooth gene expression surface for each sample. In
spatialGE
, this prediction is achieved by using a spatial
interpolation method very popular in spatial statistics. The method
known as ‘kriging’ allows the estimation of gene expression values in
the un-sampled areas between spots, or cells/spots that were filtered
during data quality control. Estimating a transcriptomic surface via
kriging assumes that gene expression of two given points is correlated
to the spatial distance between them.
The function gene_interpolation
performs kriging of gene
expression via the fields
package. We specify that kriging
will be performed for two of the spatial samples
(samples=c('sample_094c', 'sample_117e')
):
tnbc <- gene_interpolation(tnbc,
genes=c('NDRG1', 'IGKC'),
samples=c('sample_094c', 'sample_117e'), cores=2)
Generating gene expression surfaces can be time consuming. The finer
resolution to which the surface is to be predicted (ngrid
argument), the longer the time it takes. The execution time also depends
on the number of spots/cells. The surfaces can be visualized using the
STplot_interpolation()
function:
kriges1 <- STplot_interpolation(tnbc,
genes=c('NDRG1', 'IGKC'),
samples='sample_094c')
#ggpubr::ggarrange(plotlist=kriges1, nrow=1, ncol=2, common.legend=TRUE, legend='bottom')
By looking at the transcriptomic surfaces of the two genes, it is easier to detect where “pockets” of high and low expression are located within the tissue. It is now more evident that expression of NDRG1 is higher in the lower right region of the tumor slice, as well in a smaller area to the left (potentially another hypoxic region).
If tissue images were uploaded to the STlist
(tissue_hires_image.png.gz
), it can be displayed for
comparison using the plot_image
function:
Detecting tissue compartments or niches is an important part of the
study of the tissue architecture. We can do this by applying
STclust
, a spatially-informed
clustering method implemented in spatialGE
. The
STclust
method uses weighted average matrices to capture
the transcriptomic differences among the cells/spots. As a first step in
STclust
, top variable genes are identified via Seurat’s
FindVariableFeatures
, and transcriptomic scaled distances
are calculated using only those genes. Next, scaled euclidean distances
are computed from the spatial coordinates of the spots/cells. The user
defines a weight (ws
) from 0 to 1, to apply to the physical
distances. The higher the weight, the less biologically meaningful the
clustering solution is, given that the clusters would only reflect the
physical distances between the spots/cells and less information on the
transcriptomic profiles will be used. After many tests, we have found
that weights between 0.025 - 0.25 are enough to capture the tissue
heterogeneity. By default, STclust
uses dynamic tree cuts
(Langfelder, Zhang, and Horvath 2007) to
define the number of clusters. But users can also test a series of k
values (ks
). For a more detailed description of the method,
please refer to the paper describing spatialGE
and
STclust
(Ospina et al.
2022).
We’ll try several weights to see it’s effect on the cluster assignments:
Results of clustering can be plotted via the STplot
function:
cluster_p <- STplot(tnbc,
samples='sample_094c',
ws=c(0, 0.025, 0.05, 0.2),
color_pal='highcontrast')
ggpubr::ggarrange(plotlist=cluster_p, nrow=2, ncol=2, legend='right')
We can see that from w=0
and w=0.05
, we can
only detect two tissue niches. At w=0.025
, we gain higher
resolution as one of the clusters is split and a third (‘yellow’)
cluster appears, potentially indicating that an different
transcriptional profile is present there. At w=0.2
, the
clusters seem too compact, indicating that the weight of spatial
information is probably too high. We have used here the dynamic tree
cuts (dtc
) to automatically select the number of clusters,
resulting in very coarse resolution tissue niches, however, users can
define their own range of k to be evaluated, allowing further detection
of tissue compartments.
To explore the relationship between a clinical (sample-level)
variable of interest and the level of gene expression spatial uniformity
within a sample, we can use the SThet()
function:
The SThet
function calculates the Moran’s I statsitic
(or Geary’s C) to measure the level of spatial heterogeneity in the
expression of the genes ( NDRG1, IGKC). The estimates
can be compared across samples using the function
compare_SThet()
p <- compare_SThet(tnbc,
samplemeta='overall_survival_days',
color_by='patient_id',
gene=c('NDRG1', 'IGKC'),
color_pal='muted',
ptsize=3)
p
The calculation of spatial statistics with SThet
and and
multi-sample comparison with compare_SThet
provides and
easy way to identify samples and genes exhibiting spatial patterns. The
previous figure shows that expression of NDRG1 is more
spatially uniform (lower Moran’s I) across the tissues in samples from
patients 2 and 8 compared to patents 9 and 14. The samples with higher
spatial uniformity in the expression of NDRG1 also tended to
have higher overall survival. Trends are less clear for IGKC,
however, it looks like samples where expression of IGKC was
spatially aggregated in “pockets” (higher Moran’s I) tended to have
lower survival (but note that patient 9 does not follow this trend). As
studies using spatial transcriptomics become larger, more samples will
provide more insightful patterns into the association of gene expression
spatial distribution and non-spatial traits associated with the
tissues.
The computed statistics are stored in the STlist for additional
analysis/plotting that the user may want to complete. The statistics
value can be accessed as a data frame using the
get_gene_meta
function:
SThet
can be
interpreted?
See the table below for a simplistic interpretation of the spatial
autocorrelation statistics calculated in spatialGE
:
To better understand how the Moran’s I and Geary’s C statistics
quantify spatial heterogeneity, tissue can be simulated using the
scSpatialSIM
and spatstat
packages. Also
tidyverse
and janitor
for some data
manipulation.
library('rpart')
library('spatstat')
library('scSpatialSIM')
library('tidyverse')
library('janitor')
The sc.SpatialSIM
package uses spatial point processes
to simulate the locations of spots/cells within a tissue. To facilitate
interpretability of the Moran’s I and Geary’s C statistics, first tissue
will be simulated, and then gene expression values will be
simulated.
The first step is to create a spatstat
observation
window:
wdw <- owin(xrange=c(0, 3), yrange=c(0, 3))
sim_visium <- CreateSimulationObject(sims=1, cell_types=1, window=wdw)
Next, the sc.SpatialSIM
is used to generate the spatial
point process (positions of the spots) within the observation window.
Then, assignments of spots to tissue domains are simulated and
visualized:
# Generate point process
# Then, simulate tissue compartments
set.seed(12345)
sim_visium <- GenerateSpatialPattern(sim_visium, gridded=TRUE, overwrite=TRUE) %>%
GenerateTissue(k=1, xmin=1, xmax=2, ymin=1, ymax=2, sdmin=1, sdmax=2, overwrite=TRUE)
PlotSimulation(sim_visium, which=1, what="tissue points") +
scale_shape_manual(values=c(19, 1))
Now, the simulated tissue domain assignments are extracted from the
SpatSimObj
object. Gene counts will be simulated in such a
way that:
# Extract tissue assignments from the `SpatSimObj` object
# Simulate expression of 'gene_1'
sim_visium_df <- sim_visium@`Spatial Files`[[1]] %>%
clean_names() %>%
mutate(gene_1=case_when(tissue_assignment == 'Tissue 1' ~ 1, TRUE ~ 0.1))
# Generate expression patter of "gene_2"
for(i in 1:nrow(sim_visium_df)){
if(i%%2 == 0){
sim_visium_df[i, 'gene_2'] = 1
} else{
sim_visium_df[i, 'gene_2'] = 0.1
}
}
# Generate expression of "gene_3"
# Set seed for resproducibility
set.seed(12345)
sim_visium_df[['gene_3']] <- sample(sim_visium_df[['gene_2']])
To visualize the simulated expression and run SThet
, an
STlist is created:
# Extract simulated expression data
sim_expr <- sim_visium_df %>%
add_column(libname=paste0('spot', seq(1:nrow(.)))) %>%
select(c('libname', 'gene_1', 'gene_2', 'gene_3')) %>%
column_to_rownames('libname') %>% t() %>%
as.data.frame() %>% rownames_to_column('genename')
# Extract simulated spot locations
sim_xy <- sim_visium_df %>%
add_column(libname=paste0('spot', seq(1:nrow(.)))) %>%
select(c('libname', 'y', 'x'))
# The `STlist` function can take a list of data frames
simulated <- STlist(rnacounts=list(sim_visium=sim_expr),
spotcoords=list(sim_visium=sim_xy), cores=2)
# Plot expression
ps <- STplot(simulated, genes=c('gene_1', 'gene_2', 'gene_3'), data_type='raw', color_pal='sunset', ptsize=1)
ggpubr::ggarrange(plotlist=ps, ncol=3)
As expected, simulated expression of “gene_1” is aggregated (i.e.,
clustered, red spots concentrated towards the center of the tissue).
This result was intently obtained by setting the simulation parameters
to xmin=2
, xmax=3
, ymin=2
,
ymax=3
in sc.SpatialSIM
.
# The `SThet` function requires normalized data
simulated <- transform_data(simulated, cores=2)
# Run `SThet`
simulated <- SThet(simulated, genes=c('gene_1', 'gene_2', 'gene_3'), method=c('moran', 'geary'))
# Extract results
get_gene_meta(simulated, sthet_only=TRUE)
The results for each of the metrics are as expected: “gene_1” shows aggregation/clustering, indicative of “hot-spot” expression. “gene_2” and “gene_3” show uniform and random expression respectively. Notice that none of these values are very far from the “random” expectation (Moran’s I = 0 and Geary’s C = 1). One of reasons for this result is the effect of library-size normalization. In addition, obtaining extreme values of I and C require extreme spatial patterns, which are unlikely to be observed even in the way data has been simulated here.
sessionInfo()
#> R version 4.5.0 (2025-04-11)
#> Platform: x86_64-apple-darwin20
#> Running under: macOS Sequoia 15.5
#>
#> Matrix products: default
#> BLAS: /Library/Frameworks/R.framework/Versions/4.5-x86_64/Resources/lib/libRblas.0.dylib
#> LAPACK: /Library/Frameworks/R.framework/Versions/4.5-x86_64/Resources/lib/libRlapack.dylib; LAPACK version 3.12.1
#>
#> locale:
#> [1] C/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
#>
#> time zone: America/New_York
#> tzcode source: internal
#>
#> attached base packages:
#> [1] stats graphics grDevices utils datasets methods base
#>
#> other attached packages:
#> [1] spatialGE_1.2.2
#>
#> loaded via a namespace (and not attached):
#> [1] vctrs_0.6.5 cli_3.6.5 knitr_1.50 rlang_1.1.6
#> [5] xfun_0.52 png_0.1-8 generics_0.1.3 jsonlite_2.0.0
#> [9] glue_1.8.0 htmltools_0.5.8.1 sass_0.4.10 scales_1.4.0
#> [13] rmarkdown_2.29 grid_4.5.0 evaluate_1.0.3 jquerylib_0.1.4
#> [17] tibble_3.2.1 fastmap_1.2.0 yaml_2.3.10 lifecycle_1.0.4
#> [21] compiler_4.5.0 dplyr_1.1.4 RColorBrewer_1.1-3 Rcpp_1.0.14
#> [25] pkgconfig_2.0.3 farver_2.1.2 lattice_0.22-7 digest_0.6.37
#> [29] R6_2.6.1 tidyselect_1.2.1 pillar_1.10.2 magrittr_2.0.3
#> [33] bslib_0.9.0 Matrix_1.7-3 tools_4.5.0 gtable_0.3.6
#> [37] ggpolypath_0.3.0 ggplot2_3.5.2 cachem_1.1.0