--- title: "A brief guide to CAFT" author: "Glen A Satten, Mo Li, Ni Zhao" date: "`r format(Sys.time(), '%d %b %Y')`" output: rmarkdown::html_vignette: number_sections: false toc: true vignette: > %\VignetteIndexEntry{A brief guide to CAFT} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r include = FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>", warning = FALSE, message = FALSE ) ``` ```{r, include=FALSE} library(CAFT) ``` # Overview ## Introduction Microbiome sequencing produces compositional data with with many taxa having zero counts. Many methods of analyzing microbiome data replace a zero count by a pseudocounts, which can produce poor results. The **CAFT** package implements a rank-based compositional analysis using an Accelerated Failure Time (AFT) modeling framework tailored for microbiome data. Zero read counts are treated as censored observations below a detection limit given by the library size (total counts for a sample). On the log-relative-abundance scale, the analysis becomes an implementation of the accelerated failure time model (AFT) in survival analysis. CAFT uses a rank-based method to fit the AFT which does not make model error distributional assumptions. CAFT supports taxon-level differential abundance testing for binary or multi-category phenotypes, with optional adjustment for continuous and categorical covariates. CAFT fits a log-linear model to values of $-\text{log}(\widehat{p}_{ij})$ where $\widehat{p}_{ij}$ is the relative abundance of taxon $j$ in sample $i$. When $\widehat{p}_{ij}=0$, it is considered to be left-censored at $\widehat{p}_{ij}=1/N_i$ where $N_i$ is the library size for the $i$th sample. Defining $\Delta_{ij}=I(\widehat{p}_{ij}>0)$ and defining the right-censored "failure times" $t_{ij}=-\text{log}(\widehat{p}_{ij})$ when $\Delta_{ij}=1$ and $t_{ij}=-\text{log}(1/N_i)=\text{log}(N_i)$ when $\Delta_{ij}=0$, CAFT fits the model $$-t_{ij} =\alpha_j + x_{i\cdot} \beta_j + \epsilon_{ij}$$ to relative abundances of each taxon $j$, where $x_{i \cdot}$ is the $i$th row of the data matrix $x$. This model corresponds to the `Accelerated Failure Time (AFT) Model' in survival analysis. CAFT uses a rank-based estimation procedure to fit this model that makes no assumptions on the distribution of $\epsilon_{ij}$. CAFT is designed to test hypotheses of the form $$\Gamma \beta = b$$ in one of two ways. The easiest way is to write $$x=(x_{test},x_{adj})$$ where $x_{test}$ has as its columns the variables we wish to test, while $x_{adj}$ has as its columns variables we wish to adjust for. If $x_{test}$ has $d$ columns and $x_{adj}$ has $m$ columns then CAFT uses the "default choice" $\Gamma=(I_{d \times d},0_{m \times m})$ where $I$ is the identity matrix and $0$ is a matrix with all zeroes. The "default choice" for $b$ is $b=\Gamma \widehat{\beta}^{(m)}$ where $\widehat{\beta}^{(m)}$ is the median (over taxa) of the $\widehat{\beta}_j$ values. This gives a $d$-degree-of-freedom test of $\beta_1=b_1, \beta_2=b_2,\cdots,\beta_d=b_d$. Alternatively, or for more complex tests, the user may specify their own values of $\Gamma$ and $b$ as inputs. ## Installation Install from the local file (the .tar.gz file), which you can download from https://github.com/mli171/CAFT. ```{r, eval=FALSE} pak::pak("CAFT_0.1.0.tar.gz") # local directory ``` You can also install the developer version from github via the following code. ```{r, eval=FALSE} remotes::install_github("mli171/CAFT", dependencies = TRUE) ``` ## Open the Vignette in R ```{r, eval=FALSE} browseVignettes("CAFT") ``` or ```{r, eval=FALSE} vignette("vignette", package = "CAFT") ``` # CAFT: Compositional Accelerated Failure Time Model for Microbiome Data Differential Abundance The main function in CAFT package is ```{r, eval=FALSE} caft( otu.table, x.test = NULL, x.adj = NULL, x = NULL, Gamma = NULL, b = NULL, filter.thresh = 0.05, fdr.nominal = 0.20, adjust.method = "BH", regularize = TRUE, test.method = "rank", n.cores = 1L, boot.B = 0L, boot.parallel = FALSE, boot.n.cores = 1L, boot.seed = NULL, boot.return.dist = FALSE, verbose = FALSE, return.mr.resid = FALSE ) ``` The input parameters are: **otu.table**, a matrix of read counts with samples as rows and taxa as columns; **x.test**, a data matrix with samples as rows and columns corresponding to the variables we want to test; **x.adj**, a data matrix with samples as rows and columns corresponding to the variables we want to control for. Inputs **x**, **Gamma** and **b** will be explained later. CAFT filters out taxa that occur in proportionately fewer than **filter.thresh** samples (the default is 5%), and **fdr.nominal** is the desired nominal false-discovery rate. **regularize** adds a small quadratic penalty to the objective function to avoid infinite parameter estimates when there is complete separation of the data; its contribution is an order of **nrow(otu.table)** smaller than the solution when it is not required and can therefore be ignored. The use of **CAFT** and the other input parameters for special cases are described below by example in the sections that follow, using two example data sets for illustration. # Example data sets illustration The **CAFT** package includes filtered data from two studies: `URT` (upper respiratory tract), originally from [1] and obtained from the R package **LOCOM** [2]; and `Colon` from the R package **curatedMetagenomicData** [3] (which can be installed using **Bioconductor**). These processed microbiome data sets can be loaded directly from **CAFT** package. The URT data are a subset of data from a study of the microbiome of the upper respiratory tract [1], also used in the LOCOM package [2]. Following [2], we used only left oropharyngeal observations, and followed the same filtering steps as in the LOCOM paper [2]. We excluded three samples from individuals who had used antibiotics within three months prior to sample collection, and subset both the OTU count and taxonomy tables accordingly. The filtered OTU table have 57 rows (samples) and 856 columns (taxa), and the average censoring rate (the mean proportion of zero counts across taxa) is 89.35%. The `Colon` data in **CAFT** are described in the R package **curatedMetagenomicData** [3]. For our analyses here, we retain only adult participants, stool specimens, and samples with disease status labeled as colorectal cancer (CRC), adenoma, or healthy. ## Binary outcome phenotypes example We use the URT microbiome data set to illustrate the simplest use of **CAFT** for a single binary outcome. We first load the data and apply the filters described previously. ```{r} data("URT") throat.otu.table <- URT$otu throat.meta <- URT$meta throat.otu.taxonomy <- URT$tax filter.out.sam = which(throat.meta$AntibioticUsePast3Months_TimeFromAntibioticUsage != "None") throat.otu.table.filter = throat.otu.table[-filter.out.sam,] throat.meta.filter = throat.meta[-filter.out.sam,] dim(throat.otu.table.filter) cens.prop = colMeans(throat.otu.table.filter == 0, na.rm = T) mean(cens.prop) ``` Setting `filter.thresh = 0.10` and `fdr.nominal = 0.10`, we conduct the taxa differential abundant analysis by entering $x_{test}$ and $x_{adj}$ as separate arguments, using ```{r} x.test = ifelse(throat.meta.filter$SmokingStatus == "NonSmoker", 0, 1) x.adj = ifelse(throat.meta.filter$Sex == "Male", 0, 1) res.CAFT.throat = caft(otu.table=throat.otu.table.filter, x.test=x.test, x.adj=x.adj, filter.thresh = 0.10, fdr.nominal = 0.10) ``` With the default multiple comparison adjustment method of "BH", we can determine the significantly differential abundant taxa at the nominal FDR level as shown below. ```{r} res.CAFT.throat$q.detected.otu ``` If we want to see the p- or q-values for these taxa, we can use the code (only first 5 shown) ```{r} res.CAFT.throat$p.otu[res.CAFT.throat$q.detected.otu][1:5] ``` Note these are in the order that the taxa are given in the otu table. Note also that discoveries at different FDR thresholds can be obtained by simply selecting values of res.CAFT.throat$q.otu < FDR threshold. Setting the grouping label of each sample, we can obtain a ggplot2-based boxplot showing the relative abundance of a specified OTU using the function `otuboxplot()` in our **CAFT** package. ```{r, fig.width=7, fig.height=5, out.width="80%", fig.align='center'} groups = interaction(x.test, x.adj, sep = "_", drop = TRUE) groups = factor(groups, labels = c("Non-Smoker\nMale", "Smoker\nMale", "Non-Smoker\nFemale", "Smoker\nFemale")) boxplot.otu = "2831" throat.otu.taxonomy[which(colnames(throat.otu.table.filter) == boxplot.otu),] otuboxplot(plot.otu=boxplot.otu, count.data=throat.otu.table.filter, plot.title = "Prevotellaceae Prevotella",groups=groups) ``` ## Multicategory outcome phenotypes example We use the Colon data to illustrate multivariate covariates such as multigroup categorial data. Here we also use the **phyloseq** package to extract the data: note that **phyloseq** is not required to run **CAFT**. - `age_category == "adult"` - `body_site == "stool"` - `disease %in% c("CRC", "healthy", "adenoma")` The filtered OTU table have 856 rows (samples) and 935 columns (taxa). For convenience of display in this vignette, we also set the names of each taxon to its column number. ```{r} data(Colon) count.tab = Colon$otu sample.tab = Colon$meta tax.tab = Colon$tax dim(count.tab) colnames(count.tab)=1:ncol(count.tab) ``` We also removed samples with missing age values (no samples were missing gender) and then filtered the dataset to retain only taxa present in more than one sample (i.e., taxa with positive counts in at least two samples), subsetting both the count and taxonomy tables accordingly. After this additional filtering, the OTU table has 849 rows (samples) and 708 columns (taxa), and the average censoring rate (the mean proportion of zero counts across taxa) is 87.20%. ```{r} pNA = which(is.na(sample.tab$age)) if(length(pNA) > 0){ count.tab = count.tab[-pNA, ] sample.tab = sample.tab[-pNA,] } # No samples have missing values for gender ## otu presence filtering p_otu = which(rowSums(t(count.tab) > 0) > 1) count.tab = count.tab[,p_otu] tax.tab = tax.tab[p_otu,] dim(count.tab) cens.prop = colMeans(count.tab == 0, na.rm = T) mean(cens.prop) ``` Then we put the covariates we wish to test into the columns of `x.test`, setting "healthy" as the reference level and including two indicators for "CRC" and "adenoma". We also put the covariates we wish to adjust for (the variables "Age" and "Gender") into the columns of `x.adj`. ```{r} Disease1 = Disease2 = rep(0, NROW(sample.tab)) # healthy Disease1[sample.tab$disease == "CRC"] = 1 Disease2[sample.tab$disease == "adenoma"] = 1 Age = as.numeric(sample.tab$age) Gender = as.numeric(factor(sample.tab$gender)) - 1 x.test = cbind(CRC = Disease1, adenoma = Disease2) x.adj = cbind(Age = Age, Gender = Gender) ``` We can now run **CAFT** exactly as in the univariate example: ```{r} res.CAFT.Colon = caft(otu.table = count.tab, x.test=x.test, x.adj=x.adj, filter.thresh = 0.10, fdr.nominal = 0.10) res.CAFT.Colon$Gamma res.CAFT.Colon$q.detected.otu ``` We see that 59 taxa were detected at the FDR=0.1 nominal level. If we want to see the p- or q-values for these taxa, we can use the code (only first 10 shown) ```{r} res.CAFT.Colon$p.otu[res.CAFT.Colon$q.detected.otu][1:10] ``` Again, note these are in the order the taxa appear in the otu table, not the size of the p-value. ### Separate estimation and testing workflow The CAFT analysis can also be carried out in two separate steps. The advantage of this approach is computational, and is explained after we describe the two steps. For the first step, we use **caft_estimate** to obtain the $\widehat{\beta}_j$s, the unrestricted estimates of parameters $\beta_j$. Because no hypothesis is specified, data should be entered using the full data matrix x. If data are stored as `x_test' and `x_adj', then simply combine them using `x=cbind(x_test,x_adj)'. ```{r} est.CAFT = caft_estimate(otu.table=count.tab, x=cbind(x.test, x.adj), filter.thresh=0.10, regularize=TRUE, n.cores=1L) print(est.CAFT) head(est.CAFT$beta.est) ``` For the second step, we specify a hypothesis we want to test, either by specifying values of `x.test` and `x.adj` or by specifying `Gamma` and `b`. Using these inputs, **caft_test** will calculate the restricted parameter estimates $\widehat{\beta}_j^{(r)}$ that satisfy the null hypothesis $\Gamma \widehat{\beta}_j^{(r)}=b$, and conduct the hypothesis test to determine whether to accept or reject the hypothesis $\Gamma \beta=b$. Since the unrestricted estimates are the same regardless of the null hypothesis we wish to test (as long as the variables in `x` are not changed), the multi-step analysis can be more efficient. In the example below, we test the hypothesis $\Gamma \beta = b$. Because we do not specify $b$, the default choice $b=\text{median}(\Gamma \widehat{\beta})$ is used, where $\widehat{\beta}$ are the unrestricted estimators obtained from **caft_estimate**. For example, we can test the significance of the two coefficients corresponding to disease category, controlling for age and gender using ```{r} Gamma=cbind(diag(2), matrix(0,nrow=2, ncol=2) ) Gamma ``` which we can see is the same as ```{r} res.CAFT.Colon$Gamma ``` from the previous section. Then, we run **caft_test** to finish the analysis: ```{r} test.CAFT = caft_test(est.CAFT, Gamma=Gamma, fdr.nominal=0.10, adjust.method="BH") print(test.CAFT) test.CAFT$betahat.median test.CAFT$b.null test.CAFT$q.detected.otu test.CAFT$p.otu[test.CAFT$q.detected.otu][1:10] ``` These results should be identical to those obtained in the previous section. The advantage of the separate analysis path is that we can now test for a difference in the effect of the disease categories CRC and adenoma without having to run **caft_estimate** again (i.e., using the same unrestricted null estimates), by executing the following steps: ```{r} Gamma=matrix( c(1,-1,0,0), nrow=1) test.CAFT.2 = caft_test(est.CAFT, Gamma=Gamma, fdr.nominal=0.20, adjust.method="BH") print(test.CAFT.2) test.CAFT.2$betahat.median test.CAFT.2$b.null test.CAFT.2$q.detected.otu length(test.CAFT.2$q.detected.otu) test.CAFT.2$p.otu[test.CAFT.2$q.detected.otu][1:10] ``` We see that there are 39 taxa at which there is evidence that $\beta_1=\beta_{\text{CRC}}$ is different from $\beta_2=\beta_{\text{adenoma}}$. Note that, if the `x.test` and `x.adj` input format is used in `caft_test`, the order of the variables must be the same as was used when running `caft_estimate`. The object `test.CAFT$betahat.median` gives the value of $\text{median}(\Gamma \widehat{\beta})$, default value of $b$, the median-based compositional reference value. The user may specify a different value of $b$ (e.g., for sensitivity analyses or to choose a different reference, such as the mode of $\Gamma \widehat{\beta}$ rather than the median), which is returned in `test.CAFT$b.null`. ### Parallel computation We can also implement multi-threads parallel computing to speed up the computation by setting the values of `n.cores` based on available computation resources. ```{r, eval=FALSE} res.CAFT = caft(otu.table=count.tab, x.test=x.test, x.adj=x.adj, n.cores=2) ``` ### Bootstrap implementation of CAFT CAFT also provides bootstrap version to calculate taxon-level p-values when either the number of samples or the number of taxa is small. The bootstrap calculation can be requested by setting `boot.B` >= 2. In practice, a larger number of bootstrap replicates, such as `boot.B = 1000`, is recommended, as the smallest non-zero $p$-value that can be obtained is 1/(boot.B+1). Note that the bootstrap version of CAFT can be very slow. ```{r, eval=FALSE} res.CAFT.boot = caft(otu.table=count.tab, x.test=x.test, x.adj=x.adj, boot.B=1000, boot.seed=1) res.CAFT.boot$q.boot.detected.otu res.CAFT.boot$q.boot.chi.detected.otu ``` Bootstrap calibration can also be parallelized over bootstrap replicates. The following code is not evaluated in this vignette because the optimal number of cores depends on the user's computing environment. ```{r, eval=FALSE} res.CAFT.boot = caft(otu.table=count.tab, x.test=x.test, x.adj=x.adj, boot.B=1000, boot.parallel=TRUE, boot.n.cores=2, boot.seed=1) ``` # How to cite CAFT If you use the **CAFT** package or the CAFT method in your work, please cite: Satten, G. A., Li, M., & Zhao, N. (2025). *CAFT: A Compositional Log-Linear Model for Microbiome Data with Zero Cells*. bioRxiv, 2025.11.26.690468. https://doi.org/10.1101/2025.11.26.690468 A BibTeX entry is: ```bibtex @article{satten2025caft, title = {CAFT: A Compositional Log-Linear Model for Microbiome Data with Zero Cells}, author = {Satten, Glen A. and Li, Mo and Zhao, Ni}, journal = {bioRxiv}, year = {2025}, doi = {10.1101/2025.11.26.690468}, note = {Preprint} } ``` # References [1] Charlson, E. S., Chen, J., Custers-Allen, R., *et al*. Disordered microbial communities in the upper respiratory tract of cigarette smokers. *PloS one*, 5(12), e15216. https://doi.org/10.1371/journal.pone.0015216 [2] Hu, Y., Satten, G. A., \& Hu, Y. J. LOCOM: A logistic regression model for testing differential abundance in compositional microbiome data with false discovery rate control. *Proceedings of the National Academy of Sciences*, 119(30), e2122788119 (2022). https://doi.org/10.1073/pnas.2122788119 [3] Pasolli, E., Schiffer, L., Manghi, P., *et al*. Accessible, curated metagenomic data through ExperimentHub. *Nature Methods*, 14(11), 1023–1024 (2017). https://doi.org/10.1038/nmeth.4468 [4] Satten, G. A., Li, M., & Zhao, N. CAFT: A Compositional Log-Linear Model for Microbiome Data with Zero Cells. *bioRxiv*, 2025.11.26.690468 (2025). https://doi.org/10.1101/2025.11.26.690468 ```{r} sessionInfo() ```