
NOTES ON INSTALLING/USING EXPOKIT

aka 

Rapid Exponentiation Of Large Sparse Matrices With The Expokit FORTRAN Library For Dummies."

by Nicholas J. Matzke, 2012-06-12
modified 2013-02-08
matzke@berkeley.edu









#######################################################
# 1. BACKGROUND: Matrix exponentiation for biologists working with phylogenetic models
#######################################################

1A. GOAL

Usually, our goal is to calculate the likelihood of some tip data, given a model:

Likelihood = Prob(data | model)

The "model" typically includes:

- a transition matrix specifying the probability of transitions between states
- the base frequencies of the states
- a phylogenetic tree

Often, our goal is to 

- find the model parameters that confer the highest likelihood on the data (this is maximum likelihood, ML, inference)

- or to sample the posterior distribution of the parameters (Bayesian inference, where we we calculate the posterior probability of the model given the data, using Bayes theorem and MCMC sampling techniques:

posterior prob(model | data) ~ likelihood * prior
...or...
P(model | data) ~ Likelihood * prior
...or...
P(model | data) ~ P(data | model) * P(model)

Note: P(model | data) means "probability of the model given the data"
Note: ~ means "proportional to"

REFERENCE: For a more thorough-but-still-quick-n'-dirty introduction, see Peter G. Foster's "The Idiot's Guide to the Zen of Likelihood in a Nutshell in Seven Days for Dummies, Unleashed" at e.g.:

http://www.bioinf.org/molsys/data/idiots.pdf
http://www.bmnh.org/web_users/pf/idiots.pdf
http://www.ufscar.br/~evolucao/TGE/Lect09.pdf



1B. LIKELIHOOD OF DISCRETE DATA

When the data are discrete (e.g., nucleotide bases, geographical ranges in a discrete geography of presence/absence, etc.), the probability of changes between states is described by a instantaneous transition matrix, aka a Q matrix, which gives the relative probability of different changes.  Exponentiation of this matrix by t, the time, yields the probability of observing each state, given the starting state and the amount of time.

These probabilities can be calculated along every branch on a phylogeny and multiplied together to produce the likelihood of the tip data, given the tree and transition model.  This calculation is at the core of most phylogenetic analyses of discrete data (model comparison, ancestral state estimation, etc.).

By far the most common method for getting from a transition matrix (Q) to the state probabilities matrix (P) is exponentiation of the rate matrix (although there are some alternative approaches and do not involve exponentiation of the rate matrix, e.g. calculation through ODEs* or Bayesian path sampling). E.g.:

exp(Qt) = P(t)

The major difficulty with this approach is that for larger matrices, this calculation becomes slower.  For matrix of order 4 (a 4x4 matrix, e.g. for transitions between the 4 states of DNA, A, C, G, and T), the calculation is fast.  For a large matrix (e.g. a 20x20 amino acid matrix), it is somewhat slower, and for very large matrices (e.g. a 61x61 codon matrix, or a matrix of transitions between geographic ranges, the calculation can severely limit the complexity of the state space that can be allowed in the model.

No matter what software is used, this limit will be hit fairly quickly, but we might as well use the best software available for large matrices.  In addition, with "large but sparse" matrices -- matrices that contain a lot of zeros -- tricks exist that can speed up the calculation.


* ODEs = ordinary differential equation solving with e.g. the R package deSolve; see Diversitree




1C. EXPOKIT BACKGROUND

One oft-mentioned package for exponentiating matrices is EXPOKIT, a FORTRAN package. The original version is here:

http://www.maths.uq.edu.au/expokit/
http://www.maths.uq.edu.au/expokit/support.html

...and it has been coopted via wrappers in various other C++, C, java, etc. programs.  It comes with a Matlab wrapper as well, but that doesn't help us in R.


Compiling expokit to some of the example executables is easy, as long as you have the correct reference to a gfortran -77 (g77?) compiler in the Makefile:

==================================
cd /code/expokit/fortran/
./configure
pwd
make
./sample_g
==================================

However, this isn't very useful, since with insufficient documentation you can't really tell what the heck these functions are doing, and they are not very useable for practical purposes.


HOWEVER: Parts of expokit have been used in a few R and Python packages, however they weren't particularly usable for generic purposes (except expoRkit, put online after this was written).


1C(a). R packages that use EXPOKIT

Unfortunately, there is not (yet) much in R on use of EXPOKIT:
https://www.google.com/search?q=expokit&domains=r-project.org&sitesearch=r-project.org&btnG=Google+Search

A summary of what I found:

- Diversitree (Richard Fitzjohn) -- parts of expokit are included, but the R objects and dynamic functions employed to use them are quite complex and specific to the hidden functions (extractable in source code) that Diversitree uses.  I couldn't decipher it all in a short amount of time, and sometimes deSolve is used, not expokit.

- ctarma -- zexpm(A) function; this was helpful for getting a sense of how it could be done, but it uses only one expokit function.

cd /code/ctarma/
R CMD INSTALL /code/ctarma 


Also, according to:

Estimation of sparse multivariate diffusions
Alexander Sokol <alexander@math.ku.dk> & Niels Richard Hansen
Department of Mathematical Sciences
http://www.math.ku.dk/~alexander/2011_ISI_OUProcesses.pdf

http://www.math.ku.dk/~alexander/
http://www.math.ku.dk/~richard/

"The R package expm calculates these using Al-Mohy & Higham (2009), not taking advantage of sparsity."

"A new R interface to the Fortran library Expokit speeds up the computation of the matrix exponential for large sparse B-matrices."

This package eventually went onto CRAN as expoRkit (http://cran.r-project.org/web/packages/expoRkit/index.html) by Niels Richard Hansen, who gave advice at several key points during the development of rexpokit.






1C(b). PYTHON

Also some Python packages use EXPOKIT.  These were somewhat useful to play with, although I couldn't get any of them to actually function, due to ridiculous numbers of dependencies etc.  At least one used f2c to produce a C version of EXPOKIT.



i. Python package "Lineal": 

Solving the Master Equations in Python
http://sf.anu.edu.au/~mhk900/Python_Workshop/short.pdf

svn co http://lineal.developer.nicta.com.au/local/repos/trunk lineal




ii. Python package "Pyprop":
http://code.google.com/p/pyprop/

http://pyprop.googlecode.com/svn/trunk/core/krylov/expokit/expokitpropagator.cpp
http://www.koders.com/python/fidCA95B5A4B2FB77455A72B8A361CF684FFE48F4DC.aspx?s=fourier+transform

cd /code/pyprop/
git clone https://code.google.com/p/pyprop/





iii. Python package "qit"

Python Quantum Information Toolkit
http://qit.sourceforge.net/docs/html/utils.html

svn co https://qit.svn.sourceforge.net/svnroot/qit qit 
/code/qit 
cd /code/qit/python/trunk
python

==========
# ipython -pylab
#
# Actually, newer python:
ipython --pylab
from qit import *
==========
expv()

...I could access this one, but couldn't get expv to work






iv. EXPOKIT via f2c, can use with -lf2c flag

cd /Dropbox/_njm/__packages/phyRmcmc_setup/libf2c/libf2c/
./configure
make hadd
make check
make
sudo make install
cp libf2c.a /usr/lib/
cp f2c.h /usr/include/



NOTES: I attempted this, it compiles, but I get a memory error when I use it:
/Dropbox/_njm/__packages/rexpokit_setup/z_expokit_f2c_src_couldnt_get_it_to_work 
/Dropbox/_njm/__packages/rexpokit_setup/z_expokit_f2c_src_couldnt_get_it_to_work/expokit_f2c.R 






1C(c). Other packages appear to use EXPOKIT also.  Notably:

- BEAST
- Lagrange

Stephen Smith's C++ Lagrange code was the most useful, and I borrowed/modified from this.





1C(d). There are other ways to exponentiate a matrix in R; here are the methods I have discovered:

"msm::MatrixExp(Qmat*t, method='pade')",
"msm::MatrixExp(Qmat*t, method='series')",
"expm::expm(Qmat*t)",
"expm::expm(Qmat*t, method='Higham08.b')",
"expm::expm(Qmat*t, method='Ward77')",
"expm::expm(Qmat*t, method='R_Eigen')",
"expm::expm(Qmat*t, method='Taylor')",
"Matrix::expm(Qmat*t)",
"ape::matexpo(Qmat*t)",
"ctarma::zexpm(Qmat*t)"

In addition, some R packages exist that deal with sparse matrices -- e.g. SparseM and spam.  But, as far as I can tell, when they talk about matrix exponentiation, they are talking about exponentiation of each individual element, which of course is useless to us.











#######################################################
# 2. MAKING LAGRANGE CODE COMPILE INTO A *.so OBJECT
#######################################################
# FUNCTIONING VERSION IN:
# /Dropbox/_njm/__packages/rexpokit_setup/rexpokit.R

(*.so = shared object, i.e. a library that can be linked to from R)

=================================
cd /bioinformatics/lagrange_cpp/2012-04-23_lagrange/blackrim-lagrange-b7cf2d3/src
rm *.o
rm *.d
rm Makefile
./configure
make
=================================



NOTES: To get Lagrange's version of expokit to install/work properly,
some hints:


0. Dependencies:

You need to have the correct header files (*.h) in:
/usr/include

...and the correct .so or .a files (library files; these are derived from Makefile compiler statements which compile code files (*.f, *.c, *.cpp) and put them in:
/usr/lib

E.g. check for a library:

cd /usr/lib/
ls *blas*



a. If you need to change compilers, something like this:

rm /usr/bin/g++
ln -s /my_gcc/bin/g++46 /usr/bin/g++

rm /usr/local/bin/gfortran
ln -s gfortran-4.2 /usr/local/bin/gfortran

rm /usr/bin/gfortran
ln -s gfortran-4.2 /usr/bin/gfortran

My new compiler (gcc4.6) is in something like: /usr/my_gcc




b. Update Armadillo, and in the compile joining command, add the -larmadillo flag

==========================
cd /bioinformatics/lagrange_cpp/2012-04-23_lagrange/blackrim-lagrange-b7cf2d3/deps/armadillo-3.0.2
 ./configure
sudo make
sudo make install
==========================


c. Update LAPACK (use make blaslib & make, to get both -lblas and -llapack flags working); LAPACK is a huge collection of FORTRAN libraries for linear algebra operations

http://www.netlib.org/lapack/
===========================
cd /Users/nickm/Downloads/lapack-3.4.1

cp make.inc.example make.inc
sudo make blaslib && make
sudo make install
ls
ls *.a
cp *.a /usr/lib
===========================








#######################################################
# CALLING THE ROUTINES
#######################################################


1. PADM-type functions take the matrix as a list of doubles, i.e. as.double or as.numeric

2. DMEXP-type functions, though, deal with sparse matrices.  These have a lot of zeros,
and thus can be compressed into COO (coordinated list) format, which is here:

http://en.wikipedia.org/wiki/Sparse_matrix#Coordinate_list_.28COO.29

which, in expokit, is input as 3 vectors (first two integer, the third double)

ia = row number
ja = column number
a = value of that cell in the matrix (skipping 0 cells)



