---
title: "Human Connectome Project Analysis"
author: "Shael Brown and Dr. Reza Farivar"
output: 
  rmarkdown::html_document:
    fig_caption: yes
vignette: >
  %\VignetteIndexEntry{Human Connectome Project Analysis}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
bibliography: REFERENCES.bib
---

```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)

# get original graphic parameters to be able to
# revert back at the end of the vignette
original_mfrow <- par()$mfrow
original_xpd <- par()$xpd
original_mar <- par()$mar
original_scipen <- options()$scipen
if(exists(".Random.seed", .GlobalEnv) == F)
{
  runif(1)
}
oldseed <- get(".Random.seed", .GlobalEnv)
oldRNGkind <- RNGkind()

# set some new parameters for viewing and reproducibility
options(scipen = 999)
set.seed(123)
par(mfrow = c(1,1))
```

# Abstract

In the **TDApplied** package vignette "TDApplied Theory and Practice" simulated data is used to provide examples of package function. In this vignette we will demonstrate that **TDApplied** can carry out meaningful analyses of real (i.e. non-simulated) data which other software packages cannot. We analyzed data from a very well-known neurological dataset, and identified topological features of neurological computation in single and multiple subjects which were correlated with (i) the task the subjects were performing while the data was collected, and (ii) the behavior (i.e. reaction time) of the subjects during the task -- these correlations suggest that the features are meaningful in the context of an neuroimaging analysis. Moreover, these features were identified, interpreted and analyzed with the `bootstrap_persistence_diagram`, `vr_graphs` and `diagram_kpca` functions, only the first of which has any implementation in another R package (**TDA**). **TDApplied** is therefore a powerful tool for applied topological analyses of data.

# Introduction

A popular technology for studying neural function is called *functional magnetic resonance imaging* (fMRI), in which oxygenated blood-flow across the brain is detected via magnetic resonance over multiple time points; fMRI is a proxy measurement of neural activity. Spatial activity patterns, i.e. vectors of measured values across space in a single time point, are modulated by performing tasks. The *study design* is the sequence of temporal *blocks* of performing these tasks and each task type is called a *condition*, and a study design can evoke meaningful information about neural processing related to tasks. Collections of spatial activity patterns (for example the spatial patterns evoked by a particular task over multiple time points) have previously been analyzed with topological and geometric techniques which capture their global structural features [@brain_manifold;@BrainOrganizationMapper;@fMRI_clustering_2;@resting_state_mapper]. However, these analyses are not designed to capture the spatially periodic features of fMRI data which we would expect to exist in abundance [@fMRI_noise1;@fMRI_noise2;@Denoising]. One persistent homology analysis of fMRI spatial pattern data found robust 0-dimensional topological features (i.e. clusters of time points with similar spatial patterns) whose persistence values negatively correlated with fluid intelligence [@topological_time_and_space]. Larger differences between spatial patterns at different time points generally corresponded to lower values of fluid intelligence, but higher-dimensional topological features such as loops were not considered in that analysis.

In this exploratory analysis, using the R package **TDApplied**, we utilized persistent homology to find, in one subject's fMRI data in an emotion task, task-related signal in the form of a spatial loop. We then linked the loop back to the subject's raw data to interpret what neurological features the loop represented. Finally we showed that topological features in 100 subject's emotion task fMRI data were correlated with behavior (i.e. the subject's response time to certain task blocks). While neuroimaging researchers would not consider a single loop within one subject "real" or "significant" without finding a similar loop across multiple subjects, the task of optimally matching loops between datasets is an open problem, so we leave the problem of finding group-level spatial loops to future work. For our analysis we will use data from the famous Human Connectome Project [@HCP] which contains extensively preprocessed neuroimaging data from roughly 1200 subjects. We focused on the HCP emotion task data, which alternated between two conditions - deciding which of two faces matched another target face in emotion, and deciding which of two shapes matched another target shape. We only analyzed the right-to-left phase encoding scan (this is a parameter of MRI imaging which determines the ordering of when image slices of the brain are obtained) as this was the phase encoding direction for the specific subject loop we analyzed. Also, all fMRI data was projected onto surface nodes -- points on a mesh of the brain's surface geometry -- which are more comparable across subjects than standard 3D volumes [@HCP]. The script used to perform the analysis can be found in the exec directory of **TDApplied**. Our analysis demonstrates the potential of using **TDApplied** for deriving interpretable and otherwise obscured insights from real datasets.

# A task-related spatial loop

For HCP subject 103111 we calculated a persistence diagram by analyzing the time-point by time-point correlation matrix $[\rho_{i,j}]$ of spatial patterns. A distance matrix was computed by the transformation $\rho_{i,j} \rightarrow \sqrt{2(1 - \rho_{i,j})}$ (see the appendix for details), and the bootstrap procedure [@bootstrap] found a single significant loop. We plotted the VR graph [@VR_graph] of the loop representative with $\epsilon$ being the birth value (with nodes labelled by their time points), and the VR graph of the whole dataset at this $\epsilon$ (with nodes in the representative cycle colored red):

```{r,echo=FALSE,message=FALSE}
library("TDApplied")
```

```{r,echo = F,out.width="45%", out.height="45%",fig.show="hold",fig.align = 'center',fig.cap="The VR graphs of (left) just the representative cycle time points, and (right) all time points, with both epsilon scales at the loop birth value."}
g <- readRDS("rips_cycle.rds")
plot_vr_graph(g,eps = as.numeric(names(g$graphs)),vertex_labels = T,title = "Rips graph of representative cycle") # clear loop
g <- readRDS("rips_all.rds")
cols <- readRDS("cols.rds")
l <- plot_vr_graph(g,eps = as.numeric(names(g$graphs)),component_of = 20,vertex_labels = F,cols = cols,return_layout = T,title = "Rips graph of all data") # two clear loops! One big, one dense
```

Two things are apparent from these plots -- the first is that the most persistent loop is the first task block (based on the time points in that block), and the second is that the dataset mainly forms two major loops in a figure eight. The secondary loop was the second most persistent loop in the diagram, and was comprised of (almost) all other time points. We found that physiology (i.e. breathing, measured in the HCP dataset as the pressure exerted by the subject's abdomen on the sensor belt, and averaged for each graph node) did not account for the two-loop structure, whereas task structure (measured by time-since-last-block -- i.e. the length of time since the most recent task block started) did. In these graphs, red represents high values, pink middle-high, white middle, light blue middle-low, blue low and black missing.

```{r,echo = F,out.width="45%", out.height="45%",fig.show="hold",fig.align = 'center',fig.cap="VR graph of all time points, colored by (left) mean respiration and (right) time-since-last-block."}
cols_respiratory <- readRDS("cols_respiratory.rds")
cols_time_since_last_block <- readRDS("cols_time_since_last_block.rds")
plot_vr_graph(g,eps = as.numeric(names(g$graphs)),cols = cols_respiratory,component_of = 20,vertex_labels = F,layout = l,title = "Respiratory")
plot_vr_graph(g,eps = as.numeric(names(g$graphs)),cols = cols_time_since_last_block,component_of = 20,vertex_labels = F,layout = l,title = "Time since last block")
```

The secondary loop seemed task-related when we colored its VR graph by time-since-last-block (i.e. the length of time since the most recent task block started), with clusters of similarly-colored time points (i.e. nodes) and smooth gradients at various points around the loop:

```{r,echo = F,out.width="45%", out.height="45%",fig.show="hold",fig.align = 'center',fig.cap = "VR graph of the secondary loop, colored by time-since-last-block."}
secondary_loop <- readRDS("rips_secondary.rds")
plot_vr_graph(secondary_loop,eps = as.numeric(names(secondary_loop$graphs)),component_of = 99,vertex_labels = F,cols = cols_time_since_last_block[as.numeric(secondary_loop$vertices)],title = "Rips graph of secondary loop")
```

# Linking the secondary loop to raw data

We next linked the secondary loop back to the fMRI data from which it came. From the 2D layout of its VR graph we computed a $\theta$ variable (angle around the loop, between $-\pi$ and $\pi$) and an $r$ variable (distance to the origin). Using linear models of the form $A_i = \beta_{1}\cos(\theta) + \beta_{2}\sin(\theta) + \beta_3$ and $A_i = \beta_{4}r + \beta_5$ for the activity $A$ of surface node $i$, we found that out of the total 91282 surface nodes, the activity of 1475 had a significant relationship with either $\cos(\theta)$ or $\sin(\theta)$, and the activity of 53 had a significant relationship with $r$ (significance thresholding was done at the Bonferroni level of approximately $0.05/(2*91282)=2.74*10^{-7}$). These two sets of nodes only shared one common node. 

Here are surface plots of the nodes for the left hemisphere, generated using python, responding to $\theta$ and $r$:

```{r,echo = F,out.width="45%", out.height="45%",fig.show="hold",fig.align = 'center',fig.cap="Surface nodes whose activity was significantly correlated with (left) theta and (right) r."}
knitr::include_graphics(c("theta_nodes.png"))
knitr::include_graphics(c("r_nodes.png"))
```

A large cluster of surface nodes responding to $\theta$ was found in the left hemisphere (and not in the right hemisphere), which appeared to belong to the brain region Pol1 in the insular cortex. Interestingly, Pol1 was not found to show significant differences in activity between the faces and shapes conditions [@HCP_multimodal_parcellation], despite the implication of the insular cortex in emotional processing [@insula]. This finding suggests that spatial loops of fMRI data may contain complementary information to typical task-based analyses of fMRI data. 

A t-test then found a significant difference in mean $r$ values between shape and face blocks:

```{r,echo = F,out.width="45%", out.height="45%",fig.show="hold",fig.align = 'center',fig.cap="Boxplot of r values in shape and face blocks."}
r_shape <- readRDS("r_shape.rds")
r_face <- readRDS("r_face.rds")
graphics::boxplot(r_shape,r_face,main = "r across conditions, p < 0.01",xlab = "Condition",ylab = "r",names = c("Shape","Face"))
```

# Linking topology to behavior

The main goal of using a summary statistic of neurological data, such as persistence diagrams of spatial activity patterns, is to correlate the statistic with behavior. In the HCP dataset, mean reaction time (in milliseconds) was recorded for all subjects and all tasks, so we checked whether topology was predictive of reaction time for the emotion task. We selected 100 subjects at random, computed their emotion right-to-left phase encoding data persistence diagrams, and computed a one dimensional kernel PCA embedding. We then plotted the relationship between the first embedding dimension and reaction time to shape blocks, which had a correlation of about 0.31:

```{r,echo = F,out.width="45%", out.height="45%",fig.show="hold",fig.align = 'center',fig.cap="A 2D scatterplot, whose x-axis is the 1D PCA embedding coordinates of the 100 subject's emotion persistence diagrams and whose y-axis is the 100 subject's mean response times in the shape blocks trials."}
emb <- readRDS("emb.rds")
shape_rt <- readRDS("shape_rt.rds")
plot(emb,shape_rt,xlab = "Embedding dim 1",ylab = "Mean Shape Block Reaction Time (ms)",main = "Topology-Behavior Relationship")
l <- lm(formula = shape_rt ~ emb)
coefficients <- as.numeric(coef(l))
graphics::lines(x = c(min(emb),max(emb)),y = coefficients[[1]] + coefficients[[2]]*c(min(emb),max(emb)))
cor_val <- round(cor(emb,shape_rt),digits = 2)
graphics::text(x = 0,y = 1400,paste0("Correlation = ",cor_val))
```

We then computed statistical significance by Fisher-transforming the sample correlation of 0.31 to obtain a test-statistic of about 0.32, and compared this statistic to a normal null distribution with mean 0 and standard error $1/\sqrt{100-3} \approx 0.1$ [@fisher]. The resulting p-value, for a two-sided null hypothesis, would have been less than 0.002, suggesting a significant positive correlation between the topology of neural activity spatial patterns and subject behavior.

# Conclusion

We analyzed loops of spatial pattern correlations in HCP emotion fMRI data using **TDApplied**, and were able to visualize and interpret significant loops of an individual subject, relating them to the study design. Such a result implies a complex structure for the representation of the conditions, a complexity that cannot be captured by linear models assuming clustered distributions. Moreover, our topology embedding, enabled with this software and approach, allowed us to discover a relationship between fMRI patterns and reaction time -- effectively, capturing a complex brain-behavior relationship. These results point towards the potential usage of 1-dimensional topology in finding task and/or behavior relationships with fMRI or other neuroimaging modalities, again implying that the shape of neuroimaging data may not be completely captured by traditional analysis methods. Moreover, our analysis hinged on the usage of several functions which are (largely) unique to **TDApplied**. These results show that **TDApplied** is the most flexible and useful tool for carrying out meaningful analyses of data.

# Appendix: Converting Correlations to Distances

In our analysis of HCP data we converted correlation values to correlation distance values using the formula $\rho \rightarrow \sqrt{2*(1-\rho)}$. To see why this transformation does produce distance values, let $X$ and $Y$ be two vectors (of the same length $n$) with 0 mean and unit variance. Then $\rho(X,Y) = \frac{Cov(E,Y)}{\sigma_{x}\sigma_{y}} = \frac{E[(X-0)(Y-0)]}{(1)(1)} = E[XY] = \frac{\sum_{i = 1}^{n}X_i Y_i}{n}$. The Euclidean distance of $X$ and $Y$ is therefore $d(X,Y) = \sqrt{\sum_{i=1}^{n}(X_i-Y_i)^2} = \sqrt{\sum_{i=1}^{n}X_i^2 + \sum_{i=1}^{n}Y_i^2 -2\sum_{i=1}^{n}X_i Y_i} = \sqrt{n + n - 2\rho(X,Y)} = \sqrt{2n(1-\rho(X,Y))} \propto \sqrt{2(1-\rho(X,Y))}$. Therefore, since a scaled distance metric (by a positive number like $\sqrt{n}$) is also a distance metric (this is very easy to verify), the transformation $\rho \rightarrow \sqrt{2*(1-\rho)}$ indeed gives distance values. Note that in [@topological_time_and_space] a proportional transformation was used $\rho \rightarrow \sqrt{1-\rho} \propto \sqrt{2(1-\rho)}$ and therefore our results are consistent with those found in that work.

```{r,echo = F}
# reset parameters
par(mfrow = original_mfrow,xpd = original_xpd,mar = original_mar)
options(scipen = original_scipen)
do.call("RNGkind",as.list(oldRNGkind))
assign(".Random.seed", oldseed, .GlobalEnv)
```

## References