A number of R packages exist for computing distances between pairs of persistence diagrams, including TDA (Fasy et al. 2021), rgudhi (Stamm 2023) and TDApplied. Comparing the speed of these calculations was performed in the “Benchmarking and Speed” package vignette, but here we treat the more basic question of “are these distance calculations the same across packages?” Through examples we show that the answer is unfortunately no, but through exploration we attempt to reconcile these differences and provide guidelines for using the different packages. Moreover, we include a proof of algorithm correctness for TDApplied’s distance function and in the following section we describe why we do not compare TDApplied’s distance function against that of the TDAstats package (Wadhwa et al. 2019).
phom.dist
FunctionThe TDAstats package has a (wasserstein) distance
function, phom.dist
, which is described in its package
documentation as being “not meaningful without a null distribution”. But
what does this mean? We examined the source code for
TDAstats, i.e. its R/inference.R file, and found that
the internal function wass_workhorse
on lines 77-103 is the
actual distance calculation. In this function, the persistence values
(i.e. death - birth) for topological features of two diagrams (and their
diagonal projections appended to the opposite diagram) are ordered from
largest to smallest. The persistence values are then paired between the
two diagrams according to their orderings, and the absolute differences
of these ordered persistence values are exponentiated and summed. This
algorithm is not guaranteed to produce an optimal matching of the
topological features (in the sense of the usual wasserstein cost
function), and is missing the final exponentiation (in the case of the
2-wasserstein metric we take the square root of the sum of squared
distances). These differences to the standard wasserstein formulation
were clear to the authors of TDAstats, and while their
phom.dist
function is not exactly a wasserstein metric it
is still a comparison statistic of two persistence diagrams which can be
studied with inference techniques. Nevertheless, these differences are
the reason why in TDApplied documentation we refer to
TDAstats’ distance calculation as being “non-standard”
and why comparisons (of calculations and benchmarking) are not made
against it.
In order to compare the distance calculations of TDA, rgudhi and TDApplied, we will specify three simple diagrams (the same D1, D2 and D3 from the package vignette “TDApplied Theory and Practice”) and consider distances between each of the three possible pairs. The three diagrams are as follows.
D1’s point is \((2,3)\), D2’s points are \(\{(2,3.3),(0,0.5)\}\), and D3’s point is \((0,0.5)\).
In order to compare the distance calculations of the packages we need to have the correct values of the distances. Using the infinity-norm distance for calculating matchings in the bottleneck and wasserstein distances as in (Kerber, Morozov, and Nigmetov 2017; Edelsbrunner and Harer 2010), we get the following optimal matchings and distance values:
For the Fisher information metric (Le and Yamada 2018) we will use the parameter \(\sigma = 1\) for simplicity. We must first calculate vectors \(\rho_1\) and \(\rho_2\), normalize them by dividing by the sum of their respective elements, then compute \(\mbox{cos}^{-1}(\sqrt{\rho_1} \cdot \sqrt{\rho_2})\) (where \(\mbox{cos}^{-1}\) is the arccos function). See (Le and Yamada 2018) for computational details.
\[\begin{align} \rho_1 = & \{\mbox{exp}(0) + \mbox{exp}(-0.545/2) + \mbox{exp}(-10.625/2),\\ & \mbox{exp}(-0.545/2) + \mbox{exp}(0) + \mbox{exp}(-11.52/2),\\ & \mbox{exp}(-10.625/2) + \mbox{exp}(-11.52/2) + \mbox{exp}(0),\\ & \mbox{exp}(-0.09/2) + \mbox{exp}(-0.845/2) + \mbox{exp}(-12.365/2),\\ & \mbox{exp}(-10.25/2) + \mbox{exp}(-11.645/2) + \mbox{exp}(-0.125/2),\\ & \mbox{exp}(-0.5/2) + \mbox{exp}(-0.045/2) + \mbox{exp}(-10.125/2)\}/(2\pi)\end{align}\]
\[\begin{align} \rho_2 = & \{\mbox{exp}(-0.09/2) + \mbox{exp}(-10.25/2) + \mbox{exp}(-0.5/2),\\ & \mbox{exp}(-0.845/2) + \mbox{exp}(-11.645/2) + \mbox{exp}(-0.045/2),\\ & \mbox{exp}(-12.365/2) + \mbox{exp}(-0.125/2) + \mbox{exp}(-10.125/2),\\ & \mbox{exp}(0) + \mbox{exp}(-11.84/2) + \mbox{exp}(-0.89/2),\\ & \mbox{exp}(-11.84/2) + \mbox{exp}(0) + \mbox{exp}(-10.25/2),\\ & \mbox{exp}(-0.89/2) + \mbox{exp}(-10.25/2) + \mbox{exp}(0)\}/(2\pi)\end{align}\]
Therefore the arccos of the dot product of the square root of the sum-normalized vectors is approximately 0.02354624.
\[\begin{align} \rho_1 = & \{\mbox{exp}(0) + \mbox{exp}(-10.625/2),\\ & \mbox{exp}(-10.625/2) + \mbox{exp}(0),\\ & \mbox{exp}(-10.25/2) + \mbox{exp}(-0.125/2),\\ & \mbox{exp}(-0.5/2) + \mbox{exp}(-10.125/2)\}/(2\pi)\end{align}\]
\[\begin{align} \rho_2 = & \{\mbox{exp}(-10.25/2) + \mbox{exp}(-0.5/2),\\ & \mbox{exp}(-0.125/2) + \mbox{exp}(-10.125/2),\\ & \mbox{exp}(0) + \mbox{exp}(-10.25/2),\\ & \mbox{exp}(-10.25/2) + \mbox{exp}(0)\}/(2\pi)\end{align}\]
Therefore the arccos of the dot product of the square root of the sum-normalized vectors is approximately 0.08821907.
\[\begin{align} \rho_1 = & \{\mbox{exp}(0) + \mbox{exp}(-11.84/2) + \mbox{exp}(-12.365/2),\\ & \mbox{exp}(-11.84/2) + \mbox{exp}(0) + \mbox{exp}(-0.125/2),\\ & \mbox{exp}(-12.365/2) + \mbox{exp}(-0.125/2) + \mbox{exp}(0),\\ & \mbox{exp}(-11.84/2) + \mbox{exp}(0) + \mbox{exp}(-0.125/2),\\ & \mbox{exp}(-0.845/2) + \mbox{exp}(-11.645/2) + \mbox{exp}(-11.52/2),\\ & \mbox{exp}(-12.365/2) + \mbox{exp}(-0.125/2) + \mbox{exp}(0)\}/\sqrt{2\pi}\end{align}\]
\[\begin{align} \rho_2 = & \{\mbox{exp}(-11.84/2) + \mbox{exp}(-0.845/2) + \mbox{exp}(-12.365/2),\\ & \mbox{exp}(0) + \mbox{exp}(-11.645/2) + \mbox{exp}(-0.125/2),\\ & \mbox{exp}(-0.125/2) + \mbox{exp}(-11.52/2) + \mbox{exp}(0),\\ & \mbox{exp}(0) + \mbox{exp}(-11.645/2) + \mbox{exp}(-0.125/2),\\ & \mbox{exp}(-11.645/2) + \mbox{exp}(0) + \mbox{exp}(-11.52/2),\\ & \mbox{exp}(-0.125/2) + \mbox{exp}(-11.52/2) + \mbox{exp}(0)\}/\sqrt{2\pi}\end{align}\]
Therefore the arccos of the dot product of the square root of the sum-normalized vectors is approximately 0.08741134.
While all three of TDA, rgudhi and TDApplied provide functions for the bottleneck and wasserstein calculations, only TDApplied and rgudhi have the functionality to calculate the Fisher information metric. We calculated the results of each distance calculation, for each pair of \(D_i\) and \(D_j\) (\(i \neq j\)) and each package, and stored the results in tables according to the three distance metrics under comparison:
Bottleneck comparison:
## pair ground_truth TDApplied TDA rgudhi
## 1 D1 and D2 0.30 0.30 0.30 0.30
## 2 D1 and D3 0.50 0.50 0.50 0.50
## 3 D2 and D3 0.65 0.65 0.65 0.65
Wasserstein comparison:
## pair ground_truth TDApplied TDA rgudhi
## 1 D1 and D2 0.3905125 0.3905125 0.55 0.55
## 2 D1 and D3 0.5590170 0.5590170 0.75 0.75
## 3 D2 and D3 0.6500000 0.6500000 0.65 0.65
Fisher information metric comparison:
## pair ground_truth TDApplied rgudhi
## 1 D1 and D2 0.02354624 0.02354624 0.02354624
## 2 D1 and D3 0.08821907 0.08821907 0.08821907
## 3 D2 and D3 0.08741134 0.08741134 0.08741134
All three packages agreed on the value of the three bottleneck calculations, and both TDApplied and rgudhi agreed on all Fisher information metric calculations. However, while TDA and TDAstats agreed on the three wasserstein calculations, two of these differed from TDApplied’s output. This occurred because TDA and rgudhi use a slightly different formula for computing wasserstein distances – where the distance between matched pairs of persistence diagram points is Euclidean rather than an infinity-norm distance. This is a perfectly suitable distance metric and matches the formula in (Robinson and Turner 2017). However, it is different from the published formulas in important works like (Kerber, Morozov, and Nigmetov 2017) and (Edelsbrunner and Harer 2010) (which are the formulas that TDApplied implements).
We state a quick last note of comparison between the kernel
calculations in TDApplied and rgudhi.
Even though their Fisher information metric calculations appear to be
the same (perhaps up to small differences in algorithm precision), it
turns out that rgudhi and TDApplied
return drastically different kernel values. For example, the Fisher
information metric between D2 and D3 was (correctly) stated as
0.08741134 for both packages. It follows that when \(t = 2\) the persistence Fisher kernel value
should be \(\mbox{exp}(-2*0.08741134) \approx
0.8396059\), which is the exact value of the
TDApplied calculation
diagram_kernel(D2,D3,dim = 0,t = 2)
. However, the code
gudhi_kern <- rgudhi::PersistenceFisherKernel$new(bandwidth = 0.5)
gudhi_kern$apply(D2[,2:3],D3[,2:3])
returns the value 0.7550683 (note that the bandwidth
parameter is \(1/t\)). Even more
perplexing is that the following code
gudhi_kern <- rgudhi::PersistenceFisherKernel$new(bandwidth_fisher = 0.5)
gudhi_kern$apply(D2[,2:3],D3[,2:3])
returns the value 1.8326, which should not be possible for the function \(\mbox{exp}(-t*d_{FIM})\) as \(t\) and \(d_{FIM}\) are always positive and non-negative respectively (so the maximum value should be 1). Unfortunately we were not able to identify the source of this confusion by examining the source code of rgudhi (and GUDHI), but it is still possible to calculate correct kernel values as follows:
d <- rgudhi::PersistenceFisherDistance$new() # sigma = 1
t <- 2 # or whatever desired parameter
exp(-t*d$apply(D2[,2:3],D3[,2:3]))
Since distance calculations are much faster with rgudhi than with TDApplied (see the package vignette “Benchmarking and Speedups”), and because distance and Gram matrices can be precomputed and reused across multiple analyses (again see “Benchmarking and Speedups”) it would be desirable to have a (correct) rgudhi Gram matrix function. An example of such a function is the following:
# create rgudhi distance object
# sigma = 0.01
gudhi_dist <- rgudhi::PersistenceFisherDistance$new(bandwidth = 0.01,n_jobs = 1L)
# create list of diagrams, only birth and death values in dimension 1
g <- lapply(X = 1:10,FUN = function(X){
df <- diagram_to_df(TDA::ripsDiag(X = TDA::circleUnif(n = 50),
maxdimension = 1,maxscale = 2))
return(df[which(df[,1] == 1),2:3])
})
# distance matrix function
# diagrams is a list of diagrams (only birth and death columns)
# gudhi_dist is the rgudhi distance object
gudhi_distance_matrix <- function(diagrams,gudhi_dist){
# get number of rows of each diagram since rgudhi can't calculate
# distances with empty diagrams
rows <- unlist(lapply(diagrams,FUN = nrow))
inds <- which(rows > 0)
# if inds is empty then return 0 matrix
if(length(inds) == 0)
{
return(matrix(data = 0,nrow = length(diagrams),ncol = length(diagrams)))
}
# calculate distance matrix for non-zero-row diagrams
d_non_zero <- gudhi_dist$fit_transform(diagrams[inds])
# fix diagonal which can sometimes have non-zero entries
diag(d_non_zero) <- rep(0,nrow(d_non_zero))
# symmetrize (necessary due to numeric rounding issues)
d_non_zero[which(upper.tri(d_non_zero),arr.ind = T)[,c("col","row")]] <-
d_non_zero[upper.tri(d_non_zero)]
# if all diagrams had at least one row, return
if(length(inds) == length(diagrams))
{
return(d_non_zero)
}
# create empty distance matrix d
d <- matrix(data = 0,nrow = length(diagrams),ncol = length(diagrams))
# update entries of d
e <- as.matrix(expand.grid(inds,inds))
e <- e[which(e[,1] < e[,2]),]
if(!is.matrix(e))
{
e <- t(as.matrix(e))
}
d[e] <- d_non_zero[which(upper.tri(d_non_zero),arr.ind = T)]
e <- e[,2:1]
if(!is.matrix(e))
{
e <- t(as.matrix(e))
}
d[e] <- d_non_zero[which(upper.tri(d_non_zero),arr.ind = T)]
return(d)
}
# Gram matrix function
# diagrams is a list of diagrams (only birth and death columns)
# t is the t parameter like in diagram_kernel
# gudhi_dist is the rgudhi distance object
gudhi_gram_matrix <- function(diagrams,t,gudhi_dist){
# calculate distance matrix
D <- gudhi_distance_matrix(diagrams = diagrams,gudhi_dist = gudhi_dist)
return(exp(-t*D))
}
# calculate the Gram matrix
G <- gudhi_gram_matrix(diagrams = g,t = 1,gudhi_dist = gudhi_dist)
Another issue with rgudhi calculations can be reproduced as follows:
# create data frame
D = data.frame(dimension = c(0,0),birth = c(0,0),death = c(1.089866,1.098640))
# create rgudhi distance object
gudhi_dist <- rgudhi::PersistenceFisherDistance$new(bandwidth = 0.01,n_jobs = 1L)
# compute distance
gudhi_dist$apply(D[,2:3],D[,2:3])
## [1] 0.00000001490116
This calculation returns a non-zero number for the distance value of
D with itself. This is likely due to numerical rounding issues, and its
discovery in rgudhi has led to an update in
TDApplied’s diagram_distance
function
which now returns 0 for this calculation:
diagram_distance(D,D,distance = "fisher",sigma = 0.001)
## [1] 0
Even though the Hungarian algorithm can be used to solve the linear sum assignment problem (LSAP) (Hornik 2005), finding a minimal cost matching of two sets of points, some work needs to be done to properly apply the algorithm to calculate wasserstein or bottleneck distances. For an example we will consider the bottleneck distance, although the argument still holds with a simple change for the wasserstein distance (squaring matrix entries). Let Diag1 and Diag2 be two diagrams, with \(n_1\) and \(n_2\) points respectively, whose projections onto the diagonal are denoted by \(\pi(\mbox{Diag1})\) and \(\pi(\mbox{Diag2})\) respectively. Then take \(M\) to be the following \((n_1 + n_2) \times (n_1 + n_2)\) matrix:
\[M = \left[ \begin{array}{c|c} d_{\infty}(\mbox{Diag1},\mbox{Diag2}) & d_{\infty}(\mbox{Diag1},\pi(\mbox{Diag2})) \\ \hline d_{\infty}(\pi(\mbox{Diag1}),\mbox{Diag2}) & 0 \end{array} \right]\]
Each row corresponds to the \(n_1\) points in Diag1 followed by the \(n_2\) projections \(\pi(\mbox{Diag2})\), and vice versa for the columns. Then we claim that the solution of the LSAP problem on \(M\) has the same cost as the bottleneck distance value between Diag1 and Diag2.
Firstly, we claim that a solution to the LSAP problem on \(M\) has a cost which is no less than the distance value. Let the distance value be \(s\). Now suppose, to reach a contradiction, that there existed a lower-cost matching for the LSAP problem for \(M\), \(m\),of cost \(s' < s\). Since projection points are matched together with cost 0 in \(M\), let \(m'\) contain all the matches in \(m\) which are not between two projection points. Then \(m'\) would be matching for the distance calculation which has lower cost than \(s\), contradicting the minimality of \(s\). Therefore, a solution to the LSAP problem on \(M\) has a cost which is no less than the distance value.
Next, we claim that a solution to the LSAP problem on \(M\) has a cost which is no greater than the distance value. Now suppose, to reach a contradiction, that the solution to the LSAP on problem \(M\) had cost \(s\), which was larger than the real distance value, \(s'\). Let \(m'\) be a matching for the distance calculation of \(s'\). Then since each point in either diagram is either paired with a point in the other diagram or its own diagonal projection, there must be an equal number of unpaired points in both diagrams in \(m'\). Therefore, we can augment \(m'\) to a matching \(m\) on \(M\) in which the unpaired diagonal points are arbitrarily paired up with cost 0. Thus, \(m\) has cost \(s' < s\), contradicting the minimality of \(s\). Therefore, a solution to the LSAP problem on \(M\) has a cost which is no greater than the distance value.
Therefore, a solution to the LSAP problem for \(M\) has a cost which is both greater than and less than the bottleneck distance value, and hence the two values must be equal.