A new pipeline to explore structural similarity across metabolite modules

1 Introduction

MetChem is an R package used to perform structural and functional analysis of metabolites using a simple pipeline.

2 Installation

2.1 Installation via CRAN

The R package MetChem (current version 0.2) is part of the Comprehensive R Archive Network (CRAN)^[https://cran.r-project.org/]. The simplest way to install the package is to enter the following command into your R session: install.packages("MetChem"). We suggest installing the following R packages: pheatmap and RColorBrewer to enable data visualization in heatmaps, readxl for the data reading of Excel files, and impute for the imputation of missing data.

# To install the pheatmap package
install.packages("pheatmap")

# To install the RColorBrewer package
install.packages("RColorBrewer")

# To install the readxl package
install.packages("readxl")

# To install the impute package
if (!require("BiocManager", quietly = TRUE))
    install.packages("BiocManager")
BiocManager::install("impute")

2.2 Manual installation from source

The package can be manually installed from source by opening the package’s page in CRAN and then proceeding as follows:

Download MetChem.tar.gz and save it to your hard disk
Open a shell/terminal/command prompt window and change to the desired directory for installation of MetChem.tar.gz. Enter R CMD INSTALL MetChem.tar.gz to install the package. Note that this may require additional software on some platforms. Windows requires Rtools^[https://developer.apple.com/xcode/] to be installed and to be available in the default search path (environment variable PATH). MAC OS X requires installation of Xcode developers and command line tools.

2.3 Compatibility issues

The package downloadable from CRAN was built using R version, R.4.2.1. The package should work without major issues on R versions > 3.5.0 and KODAMA package >= 2.3.

3 Getting Started

The R package MetChem depends on the R package rcdk, which is partially implemented in Java. In Windows environment, we reported an issue related to the uploading of rcdk package. It can be solved by running the following code before loading the package MetChem.

replacement <- function(category = "LC_ALL") {
  if (identical(category, "LC_MESSAGES"))
    return("")
  category <- match(category, .LC.categories)
  if (is.na(category)) 
    stop("invalid 'category' argument")
  .Internal(Sys.getlocale(category))

}
base <- asNamespace("base")
environment(replacement) <- base
unlockBinding("Sys.getlocale", base)
assign("Sys.getlocale", replacement, envir = base)
lockBinding("Sys.getlocale", base)

To load the package, enter the following command in your R session:

library("MetChem")

If this command terminates without any error messages, the package is installed successfully. The MetChem package is now ready for use.

The package includes both a user manual (this document) and a reference manual (help pages for each function). To view the user manual, enter vignette("MetChem"). Help pages can be viewed using the command help(package="MetChem").

\newpage

4 Example 1: murine prostate tissues metabolic profile

Here, we introduce an example for the analysis of metabolic structural information using the MetChem package. For this, we used a data set of mass spectrometry dataset obtained from murine prostate tissue samples reported by Labbé and Zadra \emph{et al.} (2019) (Supplementary Data 2). The metabolic data are obtained from ventral prostate tissues of mice that overexpress a human c-MYC transgene (MYC) in the prostate epithelium and wild-type littermates (WT). Mice were fed either a high-fat diet (HFD; 60% kcal from fat; lard—rich in saturated fat) or a control diet (CTD; 10% kcal from fat). The data set includes six replicates for each group (\emph{i.e.}, WT_CTD, MYC_CTD, WT_HFD, and MYC_HFD). To begin, download the data from the Labbé and Zadra (2019) study. Download it and save it to your hard disk. Metabolomic data is extracted using the instructions below. Data is then imputed using a k-nearest neighbour (kNN) algorithm using the function impute as described in the publication.

require("readxl")
require("impute")
d=as.data.frame(read_excel("41467_2019_12298_MOESM5_ESM.xlsx",skip = 3))
d=d[1:414,]
rownames(d)=d[,"Metabolite"]
met=d[,4:27]
label=rep(c("WT_CTD","MYC_CTD","WT_HFD","MYC_HFD"),each=6)
label_MYC=rep(c("WT","MYC","WT","MYC"),each=6)
colnames(met)=paste(label,1:6)
met=data.matrix(met)
met=impute.knn(met,k=5)$data

Heatmap visualization is generated using the function pheatmap. Metabolites are hierarchically clustered according to their relative concentration The hierarchical clustering is performed using the distance matrix based on the KODAMA dimensions. KODAMA is a learning algorithm for unsupervised feature extraction specifically designed for analyzing noisy and high-dimensional data sets (Cacciatore \emph{et al.}, 2014), implemented in the R package KODAMA (Cacciatore \emph{et al.}, 2017). Additional information can be found in the review of Zinga \emph{et al.}, 2023.

require("pheatmap")
require("RColorBrewer")

my_colour1 = list(genotype=c(MYC="#000000ff",WT="#eeeeeeff"),
                  group=c(MYC_CTD="#373898ff",MYC_HFD="#c11630ff",
                          WT_CTD="#00a4cfff",WT_HFD="#e40a81ff"))

config=umap.defaults
config$n_neighbors=5
kk1=KODAMA.matrix(t(met),ncomp=10,FUN="simpls")
col=KODAMA.visualization(kk1,config=config)
hcol=hclust(dist(col),method="ward.D")
kk2=KODAMA.matrix(scale(met),ncomp=10,FUN="simpls")
row=KODAMA.visualization(kk2)
hrow=hclust(dist(row),method="ward.D")
my_sample_col <- data.frame(group = label,genotype=label_MYC)
row.names(my_sample_col) <- colnames(met)


pheatmap(met,
         cluster_cols = hcol,
         cluster_rows = hrow,
         labels_row = rep("",nrow(met)),
         annotation_col = my_sample_col, 
         annotation_colors = my_colour1,
         color = colorRampPalette(rev(brewer.pal(n = 11, name ="RdBu")))(100))

\begin{figure}[h] \centerline{\includegraphics[width=12cm]{KODAMApaper.pdf}} \caption{Heatmap of metabolites with hierarchical clustering based on their concentration.} \label{fig:F1} \end{figure}

\newpage

To analyze the chemical similarities among metabolites, we need the Simplified Molecular-Input Line-Entry System (SMILES) of each metabolite. The SMILES of the previous data set are stored in the list HFD that can be loaded using the function data(HFD). The MetChem package includes the clusters.detection function based on KODAMA analysis. This function repeats the follwing steps (10 times as default): i) transformation of the chemical structure dissimilarity matrix in a multidimensional space (with 50 dimensions as defaults) using multidimensional scaling; ii) KODAMA features extraction; iii) hierarchical clustering based on the KODAMA output; iv) Calculation of the silhoutte index from different number of clusters (from 2 to 30 as default). The average of the silhouette index is calculated for each cluster number to identify the optimal cluster number.

data(HFD)
met=met[rownames(HFD),]
SMILES=HFD$SMILES
names(SMILES)=rownames(HFD)
clu=clusters.detection(SMILES)
plot(clu$min_nc:clu$max_nc,clu$silhouette,type="l",
     ylab="Rousseeuw's Silhouette index",xlab="Number of clusters")
abline(v=5*(1:6),lty=2)

print(clu$main_cluster)

\begin{figure}[h] \centerline{\includegraphics[width=8cm]{FLindex.pdf}} \caption{Silhouette index.} \label{fig:F2} \end{figure}

\newpage

Based on the average Silhouette indices, we have the optimal number of clusters for this data set is 8. Below is shown a graphical visualization of the final output of KODAMA. Each cluster is represented by a different color code. Each dot represents a different metabolite. Metabolites that are located near each other share a similar chemical structure.

plot(clu$visualization,pch=21,bg=rainbow(8,alpha = 0.7)[clu$clusters[,"Clusters 8"]],cex=2)
legend(-30, 20, legend=paste("Cluster", unique(clu$clusters[,"Clusters 8"])),
   col= rainbow(8,alpha = 0.7), pch= 16, cex=1)

The following code can be used to identify the points on the scatter plot. The cluster belonging and chemical name of the selected points will be displayed. ESC key to terminate the command.

data.frame(metabolite=rownames(met),
           cluster=clu$clusters[,"Clusters 8"])[identify(clu$visualization),]

\begin{figure}[h] \centerline{\includegraphics[width=12cm]{KODAMA.pdf}} \caption{KODAMA plot.} \label{fig:F3} \end{figure}

\newpage

A new heatmap is generated where metabolites are clustered according to their chemical similarity.

my_colour2 = list(cluster = c("1"=rainbow(8,alpha = 1)[1],
                              "2"=rainbow(8,alpha = 1)[2],
                              "3"=rainbow(8,alpha = 1)[3],
                              "4"=rainbow(8,alpha = 1)[4],
                              "5"=rainbow(8,alpha = 1)[5],
                              "6"=rainbow(8,alpha = 1)[6],
                              "7"=rainbow(8,alpha = 1)[7],
                              "8"=rainbow(8,alpha = 1)[8]),
                  genotype=c(MYC="#000000ff",WT="#eeeeeeff"),
                  group=c(MYC_CTD="#373898ff",MYC_HFD="#c11630ff",
                          WT_CTD="#00a4cfff",WT_HFD="#e40a81ff"))

clusters8=clu$clusters[,"Clusters 8"]
my_sample_row <- data.frame(cluster = as.character(clusters8))
row.names(my_sample_row) <- rownames(met)


met=met[rownames(HFD),] 
kk1=KODAMA.matrix(t(met),ncomp=10,FUN="simpls")
col=KODAMA.visualization(kk1,config=config)
hcol=hclust(dist(col),method="ward.D")
hrow=clu$hclust

my_sample_col <- data.frame(group = label,genotype=label_MYC)
row.names(my_sample_col) <- colnames(met)

pheatmap(met,
         cluster_cols = hcol,
         cluster_rows = hrow,
         labels_row = rep("",nrow(met)), 
         annotation_colors = my_colour2,
         annotation_col = my_sample_col, 
         annotation_row = my_sample_row, 
         cutree_rows = 8, 
         color = colorRampPalette(rev(brewer.pal(n = 11, name ="RdBu")))(100))

\begin{figure}[h] \centerline{\includegraphics[width=12cm]{heatmap.pdf}} \caption{Heatmap of metabolites with vertical hierarchical clustering based on their molecular structure.} \label{fig:F4} \end{figure}

\newpage

We next build a heatmap of the metabolites belonging to cluster 4.

sel=clu$clusters[,"Clusters 8"]==4
met.sel=met[sel,] 
my_sample_col <- data.frame(group = label,genotype=label_MYC)
row.names(my_sample_col) <- colnames(met)
oo=order(row.names(my_sample_col))
my_sample_col=my_sample_col[oo,]
met.sel=met.sel[,oo]
hrow.sel=hclust(dist(clu$visualization[sel,]),method="ward.D") 

pheatmap(met.sel,fontsize = 7,
         cluster_cols = FALSE,
         cluster_rows = hrow.sel,
         annotation_col = my_sample_col, 
         annotation_colors = my_colour1,
         color = colorRampPalette(rev(brewer.pal(n = 11, name ="RdBu")))(100))

\begin{figure}[h] \centerline{\includegraphics[width=12cm]{KODAMAselection.pdf}} \caption{Heatmap} \label{fig:F6} \end{figure}

\newpage

In the next step, we apply the Weighted Metabolite Chemical Similarity Analysis (WMCSA) on all branches of the hierarchical clustering performed on cluster 4. WMCSA is implemented in the function WMCSA. This function summarizes the relative concentration of metabolites within each module. Each module is defined according to chemical similarity.

my_sample_col <- data.frame(group = label,genotype=label_MYC)
row.names(my_sample_col) <- colnames(met)
cl=allbranches(hrow.sel)
ww=WMCSA(met.sel,cl)

hrow=hclust(dist(ww),method="ward.D")

pheatmap(ww,
         cluster_cols = FALSE,
         cluster_rows = hrow,
         annotation_col = my_sample_col, 
         annotation_colors = my_colour1,
         color = colorRampPalette(rev(brewer.pal(n = 11, name ="RdBu")))(100))

\begin{figure}[h] \centerline{\includegraphics[width=12cm]{KODAMAall.pdf}} \caption{Heatmap of the output of WMCSA.} \label{fig:F5} \end{figure}

\newpage

Differential analysis of the relevant modules can be performed using the function multi_analysis present in the R package clinalyze. In the example below, we perform a differential analysis between MYC transgenic mice fed with a high-fat diet or a control diet named MYC_HFD and MYC_CTD, respectively.

library("clinalyze")
label=rep(c("MYC","WT"),each=12)
multi_analysis(t(ww),label)

Feature	MYC	WT	p-value	FDR
Mod1, median [IQR]	-2.919 [-3.833 -1.536]	2.938 [2.195 3.563]	3.66e-05	7.84e-05
Mod2, median [IQR]	-2.17 [-2.638 -1.706]	2.237 [1.98 2.478]	3.66e-05	7.84e-05
Mod3, median [IQR]	-1.798 [-2.033 -1.215]	1.409 [1.015 2.228]	3.66e-05	7.84e-05
Mod4, median [IQR]	-2.45 [-3.037 -1.381]	2.139 [1.776 2.827]	3.66e-05	7.84e-05
Mod5, median [IQR]	-1.902 [-2.171 -1.425]	1.688 [1.563 1.941]	3.66e-05	7.84e-05
Mod6, median [IQR]	-1.772 [-3.022 -1.119]	1.701 [0.742 2.204]	2.46e-04	4.10e-04
Mod7, median [IQR]	0.886 [-1.181 2.231]	-0.329 [-1.224 0.379]	4.03e-01	4.64e-01
Mod8, median [IQR]	-1.674 [-2.485 -0.656]	1.248 [0.609 1.995]	3.84e-04	5.76e-04
Mod9, median [IQR]	1.609 [0.961 2.06]	-1.412 [-1.706 -1.203]	4.69e-05	8.80e-05
Mod10, median [IQR]	-0.84 [-1.956 0.087]	0.8 [-0.148 1.161]	4.64e-02	6.33e-02
Mod11, median [IQR]	-0.023 [-0.639 1.15]	-0.583 [-1.229 0.027]	1.75e-01	2.19e-01
Mod12, median [IQR]	-1.347 [-1.928 -0.681]	1.305 [0.711 1.736]	3.66e-05	7.84e-05
Mod13, median [IQR]	0.757 [-0.717 1.532]	-0.004 [-1.003 0.6]	5.07e-01	5.43e-01
Mod14, median [IQR]	1.691 [0.59 1.952]	-1.389 [-1.662 -0.889]	3.66e-05	7.84e-05
Mod15, median [IQR]	0.196 [-1.074 0.842]	0.388 [-0.853 0.638]	7.95e-01	7.95e-01

\newpage

\newpage The function readMet connects R to the HMDB database ^[https://www.hmdb.ca]. to retrieve chemical and functional information of each metabolite. This can be summarized using different functions: substituentsMet, diseaseMet, enzymeMet, pathwaysMet, taxonomyMet. The function features associates the most prominent features to each module.

In this example, we characterized the modules by functional group.

HMDB=HFD$HMDB
names(HMDB)=rownames(HFD)
doc=readMet(HMDB)
cla1=substituentsMet(doc)
substituents=features(doc,cla1,cl)
cla2=enzymesMet(doc)
enzymes=features(doc,cla2,cl)

To visualize the metabolite of , for example, module 12, we can type:

cl[[12]]

Fisher’s test was used to rank the association of each module to the metabolite information. Below are reported the p-value for associations of substituents with module 12.

substituents[[12]]

Substituents	p-value
Histidine or derivatives	1.44e-06
Imidazolyl carboxylic acid derivative	1.21e-05
Hybrid peptide	4.97e-05
N-acyl-alpha amino acid or derivatives	1.21e-03
Imidazole	1.31e-03
Azole	1.53e-03
Aromatic heteromonocyclic compound	1.77e-03
N-acyl-alpha-amino acid	2.33e-03
Beta amino acid or derivatives	8.62e-03
Organoheterocyclic compound	1.18e-02
Heteroaromatic compound	1.47e-02
Carboximidic acid	1.91e-02
Carboximidic acid derivative	2.04e-02
Azacycle	2.05e-02
Organic 1,3-dipolar compound	2.32e-02
Propargyl-type 1,3-dipolar organic compound	2.32e-02
Alpha-amino acid or derivatives	3.95e-02
Amino acid or derivatives	4.13e-02
\newpage

Here are reported the p-value for associations of metabolite-related enzymes with the module 12.

enzymes[[12]]

Substituents	p-value
CNDP1	2.97e-04
CARNS1	1.38e-03
VIM	8.62e-03
HSPA1A	8.62e-03
SLC15A2	8.62e-03
MPO	2.57e-02

4 How to Cite this Package

Ebtesam Abdel-Shafy, Tadele Melak, David A. MacIntyre, Giorgia Zadra, Luiz F. Zerbini, Silvano Piazza, and Stefano Cacciatore Publication in submission

To obtain BibTex entries of the two references, you can enter the following into your R session to BibTex citation("MetChem").

5 References

Cacciatore S, Luchinat C, Tenori L. Knowledge discovery by accuracy maximization. \emph{Proc Natl Acad Sci USA} 2014; 111: 5117-22.

Cacciatore S, Tenori L, Luchinat C, Bennett PR, MacIntyre DA (2017) KODAMA: an R package for knowledge discovery and data mining. \emph{Bioinformatics} 2017; 33(4): 621-623.

Labbé DP, Zadra G, Yang M, Reyes JM, Lin CY, Cacciatore S, Ebot EM, Creech AL, Giunchi F, Fiorentino M, Elfandy H, Syamala S, Karoly ED, Alshalalfa M, Erho N, Ross A, Schaeffer EM, Gibb EA, Takhar M, Den RB, Lehrer J, Karnes RJ, Freedland SJ, Davicioni E, Spratt DE, Ellis L, Jaffe JD, D’Amico AV, Kantoff PW, Bradner JE, Mucci LA, Chavarro JE, Loda M, Brown M. High-fat diet fuels prostate cancer progression by rewiring the metabolome and amplifying the MYC program. \emph{Nat Commun} 2019; 10: 4358.

Zinga MM, Abdel-Shafy E, Melak T, Vignoli A, Piazza S, Zerbini LF, Tenori L, Cacciatore S. KODAMA exploratory analysis in metabolic phenotyping. \emph{Front Mol Biosci} 2023; 9: 1436.

A new pipeline to explore structural similarity across metabolite modules

|

2025-06-04

1 Introduction

2 Installation

2.1 Installation via CRAN

2.2 Manual installation from source

2.3 Compatibility issues

3 Getting Started

4 Example 1: murine prostate tissues metabolic profile

4 How to Cite this Package

5 References