[PDF] Estimate Metabolite Taxonomy and Structure with a Fragment-Centered Database and Fragment Network

Abstract

Metabolite structure identification has become the major bottleneck of the mass spectrometry based metabolomics research. Till now, number of mass spectra databases and search algorithms have been developed to address this issue. However, two critical problems still exist: the low chemical component record coverage in databases and significant MS/MS spectra variations related to experiment equipment and parameter settings. In this work, we considered the molecule fragment as basic building blocks of the metabolic components which had relatively consistent signatures in MS/MS spectra. And from a bottom-up point of view, we built a fragment centered database, MSFragDB, by reorganizing the data from the Human Metabolome Database (HMDB) and developed an intensity-free searching algorithm to search and rank the most relative metabolite according to the users' input. We also proposed the concept of fragment network, a graph structure that encoded the relationship between the molecule fragments to find close motif that indicated a specific chemical structure. Although based on the same dataset as the HMDB, validation results implied that the MSFragDB had a higher hit ratio and furthermore, estimated possible taxonomy that a query spectrum belongs to when the corresponding chemical component was missing in the database. Aid by the Fragment Network, the MSFragDB was also proved to be able to estimate the right structure while the MS/MS spectrum suffers from the precursor-contamination. The strategy proposed is general and can be adopted in existing databases. We believe MSFragDB and Fragment Network can improve the performance of structure identification with existing data. The beta version of the database is freely available at www.xrzhanglab.com/msfragdb/.

Full PDF

EEstimate Metabolite Taxonomy and Structure with a Fragment-Centered Database and Fragment Network

Hansen Zhao , Xu Zhao , Huan Yao , Jiaxin Feng , Sichun Zhang , Xinrong Zhang Department of Chemistry, Tsinghua University, Beijing, China * Corresponding author: [email protected]; [email protected]

Abstract:

Introduction

Metabolomics that systematically study the cellular chemical compositions and their interaction networks have drawn increasing interests recently for its unique roles in both fundamental biological researches and next-generation precision medicine

1, 2, 3 . To detect the highly diverse chemical components in limited biological samples, untargeted mass spectrometry (MS) is usually applied to acquire the mass-to-charge ratio (m/z) of the molecules in high throughput and sensitivities

4, 5 . Despite advanced experimental techniques are continuing reported to obtain the MS spectra from biological sample with higher coverage or throughput , the metabolites identification step still remains time-consuming and challenging and is considered as the major bottleneck to convert the abundant spectra information to biological insights

1, 3, 8, 9, 10 . Compared to the bio-macromolecule such as nucleic acids or proteins that consist of limited number of basic building blocks, the structure of metabolites is highly diverse. To retrieve the structure of the metabolites from the MS spectra, tandem MS (MS/MS) is widely applied in which precursor intact molecule ions are fragmented into parts and the molecular fragments are detected by the secondary MS to form the MS spectra. The MS/MS process provides extra structural information of the precursor molecule and served as the asic start point for the following structure retrieve analysis. Despite large number of algorithms based on machine learning

10, 11 or in-silicon fragmentation has been developed in recent years to estimate the structure of the metabolites from their MS/MS spectra, spectra matching by comparing the experimental data against online database records is still considered as the ‘gold standard’ for metabolites identification and is also mostly adopted practically. Existing databases such as METLIN and Human Metabolome Database (HMDB) provide invaluable information for metabolites identification but suffer from two major disadvantages: 1) limit component coverage ; 2) inefficient spectra similarity matching due to the variations of the MS/MS spectra in different equipment or detection parameters

9, 15, 16 , collision energy for example. Global Natural Products Social Networking Library (GNPS), on the other hand, constructs molecular similarity network (MN) to get insights of the structural relationship within the experiment dataset

17, 18 . By doing so, the algorithm focuses on the spectra similarity that acquired by the same researcher with the same equipment and the same batch, thus eliminating the false matching due to the systematic variations. In this way, more chemical structures can be estimated by referring to the well-known node of the MN and prior knowledge about the homologues, even if the component is absent in the database. However, the spectra similarity matching based algorithm intrinsically assuming the secondary MS spectra are originally fragmented from pure precursor component. We argue that, instead, this assumption may fail in analyzing complex biological sample where multiple components with similar m/z are fragmented together due to the limited precursor filtering resolution (precursor contamination). To address this issue, mining the relationship information between the fragment ions in a MS spectrum may produce extra information about the chemical component identification spectra in different cell type can have different profiles, which may due to the precursor contamination. MSFragDB can thus aid the structure estimation in these cases. Overall, the fragment-based database and algorithm provide a new perspective for retrieving structure information from MS/MS spectrum and can be adopted as new searching method in existing databases for match accuracy improvement. Results Database construction.

Fig. 1 The database construction process.

Fragment-based searching and ranking.

The search set module is the major feature provided by our online search tool. For each fragment input, various number of hit records can be found in the database and ranking the results to let the most possible candidates appear at the top of the result list is one of the key issues. While the traditional spectrum matching algorithms based on similarity measurements give continuous outcome, the score based on individual fragments can be discrete. Here, we introduced two mechanisms to rank the search results. The first mechanism is a naïve consideration to achieve ‘all-hit-priority’, which means taxonomies and components that contain the most fragments in query set should be presented firstly. Considering 𝐻 𝑖𝑗𝑘 as a marker for whether input m/z i hit the library component j that belongs to taxonomy k . The taxonomy can be scored as 𝑇 𝑘𝑛𝑎𝑖𝑣𝑒 = 𝑚𝑎𝑥{𝐶 , 𝐶 , … } (1) 𝐶 𝑗𝑘 = ∑ 𝐻 𝑖𝑗𝑘𝑖 𝑁 (2) 𝐻 𝑖𝑗𝑘 = {1 ℎ𝑖𝑡 𝑟𝑒𝑐𝑜𝑟𝑑0 𝑜𝑡ℎ𝑒𝑟𝑤𝑖𝑠𝑒 (3) Where N is the number of the query fragments. The second strategy is adopted from the web page searching named TF-IDF scoring. While the web page searching involves topic-webpage-keyword three level organization, the same structure can be found in metabolites identification as the taxonomy-component-fragment organization. Thus, query chemical components by fragments can be considered as the similar process that query webpages by keywords. The F term in this case normalized the hit count of fragment i on taxonomy k by the total number of records in the database that belongs to taxonomy k . The IDF term, on the other hand, measures the specificity of the fragment. In other words, fragments that generally exist in many taxonomies shows less prediction strength than those only appear in small number of taxonomies. Given 𝑈 is the set of database records and 𝑈 𝑘 is the subset that records belong to taxonomy k . 𝑆 𝑖 is the subset that all records that fragment i hits. 𝐹𝑇 𝑖𝑘 = ∑ 𝐻 𝑖𝑗𝑘𝑗 |𝑈 𝑘 | (4) 𝐼𝐷𝐹 𝑖 = 𝑙𝑜𝑔 (‖𝑈‖‖𝑆 𝑖 ‖) (5) 𝑇 𝑘𝐹𝑇−𝐼𝐷𝐹 = ∑ 𝐹𝑇 𝑖𝑘 ∙ 𝐼𝐷𝐹 𝑖𝑖 (6) Where |∙| is the size of the set and ‖∙‖ is the number of unique taxonomy in the set. The candidate taxonomies are sorted by the 𝑇 𝑘𝑛𝑎𝑖𝑣𝑒 and 𝑇 𝑘𝐹𝑇−𝐼𝐷𝐹 . For each taxonomy, the candidate molecules are sorted by the number of hit fragment. Finally, the user interface will present the search results in a three-column hierarchical format to show the information about the potential hit taxonomy, metabolites and hit fragments, respectively (Fig. 2). Fig. 2 Illustration of the searching process.

Database evaluation.

We firstly evaluated the database by searching the spectrum within the MSFragDB and 100% of them were correctly annotated by the top item of the resulting taxonomy and component list, indicating that there was no systematic error or bias. Then, we tested the searching accuracy of the MSFragDB by comparing its performance with that of the HMDB, given that they share the same basic data but have different searching strategies. e randomly selected 40 spectra in the predicted dataset of the LipidBlast library as the lipidomics is one of the most frequently searched areas. Surprisingly, we found none of these spectra was correctly annotated in HMDB, which may due to the large differences of the intensity profile between the query spectra and the library spectra (Fig. 3b). This result indicated that the intensity of the fragment was highly diverse and unreliable for spectrum matching, which was consistent with previous publications

9, 15, 16 . Fragment-based searching strategy, which totally ignores the intensity information, showed significant improvement in correct hit ratio. Specifically, for component that is present in the HMDB (marked as 1 in the GT column in the Fig 3a), top component record achieved 36% accuracy while top5 records achieved 52% accuracy. 72% of the taxonomy can be correctly annotated by the top record while the ratio increased to 88% for top5 records. For component that is absent in the database, the correct hit ratio should be 0% as expected while the taxonomy still can be correctly inferred at 47% accuracy by the top record and 67% by the top5 records. These results show that the fragment-based searching strategy outperformed the traditional spectrum matching algorithm at least in this test case. Also, the strategy adopted in MSFragDB also provides reasonable estimation of the taxonomy of the absent component, which can be helpful in scientific research.

Fig. 3 Evaluation of the database. (a) The search accuracy in MSFragDB. GT=0 means the target component is not in the database. (b) A typical query spectrum and its corresponding database spectrum in HMDB.

Interpret precursor contaminated spectrum.

Precursor contamination can occur in small volume complex sample analysis such as single cell plasma, in which large number of metabolites components exist and separation methods are hard to apply. Traditional methods assume that the MS/MS spectrum is originally from a unique component. However, this assumption may not hold in the complex small sample analysis, where two or more components with similar m/z can be fragmented at the same time and recorded in the same spectrum. Here, we took a simple prove-of-concept example in lipidomics analysis. PC (16:0/18:0) ( m/z = 761.59) and PS (16:0/18:1) ( m/z = 761.52) have relatively close m/z value and can be easily co-fragmented in MS/MS analysis. The resulting spectrum had a characteristic peak with high intensity at m/z = 184.07 (Fig. 4a), which made it easily to be considered as a phosphatidylcholine (PC). However, by observing the taxonomy hit matrix 𝑀 𝑖𝑘 = ∑ 𝐻 𝑖𝑗𝑘𝑗 of the query fragment set (Fig. 4b), the hit fingerprint (rows of the matrix) of each fragment indicated roughly two distinct groups existed. The phenomenon can be further conformed by constructing molecular fragment network (MFN), which can be mathematically represented as G(V, E) . Each node ( V ) in the network stands for a fragment and the edges ( E ) characterize the relationship between two nodes by their weights (Fig. 4c). 𝑖𝑘 = {1 𝑀 𝑖𝑘 > 00 𝑀 𝑖𝑘 = 0 (7) 𝑊 𝑖𝑗 = { ∑ (𝐹 𝑖𝑘 ∙ 𝐹 𝑗𝑘 ) 𝑘 √∑ 𝐹 𝑖𝑘𝑘 √∑ 𝐹 𝑗𝑘𝑘 ∑ 𝐹 𝑖𝑘𝑘 > 0 𝑎𝑛𝑑 ∑ 𝐹 𝑗𝑘𝑘 > 0 0 𝑜𝑡ℎ𝑒𝑟𝑤𝑖𝑠𝑒 (8) The close related fragment clusters can thus be detected by thresholding the edge weight and social network community detection algorithm

21, 22 (Fig. 4d). Finally, the fragment subset identified by this process can be searched separately in MSFragDB to reveal the true composition of the MS/MS spectrum. This strategy based on MFN and community detection provides a novel way to find clues of the precursor contamination and uncover the compositions of the mixture.

Fig. 4 Identify precursor contaminations. (a) The spectrum that co-derives from

Phosphatidylcholine (PC) and Phosphatidylserines (PS). (b) Taxonomy hit matrix of the query fragment set. (c) Raw MFN constructed from the hit matrix. (d) Two communities were found by edge weight thresholding and community detection.

Single cell plasma analysis.

We finally re-analyzed the MS/MS spectra acquired by Pico-ESI-MS where metabolites of both human astrocyte cells and glioblastoma cells were extracted and electro-sprayed into the mass spectrometer. As electrospray time were largely extended, MS spectra can be detected for each MS peak in a data-dependent manner, and thus abundant MS/MS information can be acquired . However, previous analysis still focused on MS spectra such as differentiating cell types according to the MS spectra. Here, we focused on the MS information of each single cell (Fig. 5a). Interestingly, the MS spectra shows significant differences between these two types of cells even after normalizing the sum of the fragment intensity in each spectra to 1 (Fig. 5b). As the normalization step eliminated the abundance information of the metabolites in MS , if only one chemical component was fragmented in given precursor channel, there should no differences between these two cell types. Thus, Fig. 5b indicates that multiple components co-fragmentation in a given MS spectra may be commonly occurred in single cell metabonomics analysis. These components can have different abundance in different type of cells, leading to different MS profiles. In this case, tradition spectrum matching based search algorithms may be misleading as their single-original-component assumption is broken. However, fragment-centered strategy may be helpful here by mining the relationship among the fragments and uncover their co-existence communities. To find an example case, we filtered the fragment peaks by three criterions: total occurrence times > 5, fold change > 2 and p value < 0.01 (ANOVA) and the filtered peaks were shown in Fig. 5c. We then demonstrated whether the variations of the MS spectra with the same precursor m/z were randomly or cell-type-related. Molecular Network analysis was performed (Fig. 5d, precursor m/z = 522.14). Spectrum similarity was measured as described before with 5 strongest peaks selected, 0.01 Da alignment threshold and edge weight lower than 0.95 were eliminated. The resulting MN was plotted with force layout which shows that the two type cells clustered separately. This result indicated that the MS spectra of the single cell samples had higher similarity within the cell types than that between different types. Thus, the variations were cell-type-related, which confirm the assumption we made above. Two typical spectra were plotted in Fig. 5e. The spectrum corresponding to human normal astrocyte cells (blue color) showed clear fragmentation pattern that easy to be interpreted: all peaks such as 86.10, 125.00 and 184.07 were related phosphatidylcholine structure. However, the spectrum corresponding to cancer glioblastoma cells showed much complex pattern while still preserving phosphatidylcholine-related peaks. We selected 10 strongest peaks and searched them in MSFragDB. Molecular Fragment Network analysis showed that 3 fragment clustered were formed and the top-hit records for each cluster were LysoPC, Flavonoid-O-glycosides and Furospirostanes, respectively. These results imply that the precursor-contamination may frequently occur in the MS analysis of complex small volume samples and MSFragDB can help to identify the right component composition with a fragment-centered strategy. Fig. 5 Identify precursor contaminations. (a) Typical MS/MS map of A172 (left) cells and Normal (right) cells . (b) PCA analysis of the single cells. (c) Map of significant different MS/MS peaks (fold change > 2, -lg(p) > 2). (d) Molecular Network of the precursor with m/z=522.14. The orange color and blue color indicate A172 cells and Normal cells, respectively. (e) MS/MS spectra correspond to the nodes ndicated by red arrows in (d). (f) Fragment network constructed by MSFragDB and the suggested metabolites. Discussions

Here, we proposed a fragment-based strategy to aid the metabolites identification. By constructing a fragment-centered database named MSFragDB and evaluating its performance, we proved the hit accuracy improvement of the fragment-based strategy, which may encourage existing databases to adopt it to improve the matching performance. As the intensity of the fragment peaks in MS spectra has large variations in different equipment settings, the fragment-centered matching strategy may be more reliable than the spectrum-level similarity measurement. While large number of MS spectra can be acquired in short time by advanced equipment and analysis methods, the MS/MS spectra dataset collected can carry more and more abundant information. The MFN analysis proposed here can be an efficient method for the large dataset mining and fragment relationship extraction. It’s reasonable to assume that the fragments of molecules have certain relationships. For example, fragments of the characteristic structures of a chemical taxonomy may more likely coexist in MS/MS spectra. Mining these relationships can simplify the MS profile and aid the following spectra interpretation and structure identification. Moreover, our analysis on the single cell MS/MS spectra dataset suggests that the precursor contamination can frequently occur in the fragmentation processes of the small volume complex biological samples, which breaks the basic assumption of the spectrum-level matching algorithms. The fragment-centered strategy is essential for uncovering the right metabolites. We also build a community prototype in the website by allowing users upload their own experiment data and leave their information in the personal page. By doing so, researchers can contact with each other for further cooperation. However, current records in MSFragDB are originally from the data in HMDB, which limits its performance. Adding more data source should improve its searching accuracy. Further works may also include adding batch searching method to meet the demand of high-throughput data analysis. The manuscript is prepared for preprint submission. The online website proposed here was developed by the first author for prove-of-concept. It may suffer from unexpected bugs or accidents such as unreliable network connection in the lab. Any comments, suggestions or bug reports are welcome and can be submitted via e-mail of the first author ([email protected]) or the corresponding author ([email protected]; [email protected]). Reference

1. Yi Z, Zhu Z-J. Overview of Tandem Mass Spectral and Metabolite Databases for Metabolite Identification in Metabolomics.

Methods in molecular biology (Clifton, NJ) , 139-148 (2020). 2. Perez De Souza L, Alseekh S, Brotman Y, Fernie AR. Network-based strategies in metabolomics data analysis and interpretation: from molecular networking to biological interpretation.

Expert Review of Proteomics , 243-255 (2020). 3. O’Shea K, Misra BB. Software tools, databases and resources in metabolomics: updates from 2018 to 2019. Metabolomics , 36 (2020). 4. Blaženović I , et al. Structure Annotation of All Mass Spectra in Untargeted Metabolomics.

Analytical Chemistry , 2155-2162 (2019). 5. Nguyen D, Nguyen C, Mamitsuka H. Recent advances and prospects of computational methods for metabolite identification: a review with emphasis on machine learning approaches. Briefings in Bioinformatics , 2028 - 2043 (2019). 6. Wang R, Zhao H, Zhang X, Zhao X, Song Z, Ouyang J. Metabolic Discrimination of Breast Cancer Subtypes at the Single-Cell Level by Multiple Microextraction Coupled with Mass Spectrometry. Analytical Chemistry , 667-3674 (2019). 7. Yao H , et al. Label-free Mass Cytometry for Unveiling Cellular Metabolic Heterogeneity.

Analytical Chemistry , 9777-9783 (2019). 8. Dührkop K , et al. SIRIUS 4: a rapid tool for turning tandem mass spectra into metabolite structure information.

Nature Methods , 299-302 (2019). 9. Aron AT , et al. Reproducible molecular networking of untargeted mass spectrometry data using GNPS.

Nature Protocols , 1954-1991 (2020). 10. Laponogov I, Sadawi N, Galea D, Mirnezami R, Veselkov KA. ChemDistiller: an engine for metabolite annotation in mass spectrometry. Bioinformatics , 209V 2102 (2018). 11. Ji HC, Deng HZ, Lu HM, Zhang ZM. Predicting a Molecular Fingerprint from an Electron Ionization Mass Spectrum with Deep Neural Networks. Analytical Chemistry , 8649-8653 (2020). 12. Kind T, Liu K-H, Lee DY, DeFelice B, Meissen JK, Fiehn O. LipidBlast in silico tandem mass spectrometry database for lipid identification. Nature Methods , 755 (2013). 13. Guijas C , et al. METLIN: A Technology Platform for Identifying Knowns and Unknowns.

Analytical Chemistry , 3156-3164 (2018). 14. Wishart DS , et al. HMDB: the Human Metabolome Database.

Nucleic Acids Research , D521-D526 (2007). 15. Kind T , et al. Identification of small molecules using accurate mass MS/MS search.

Mass Spectrom Rev , 513-532 (2018). 16. Ichou F , et al. Comparison of the activation time effects and the internal energy distributions for the CID, PQD and HCD excitation modes.

Journal of Mass Spectrometry , 498-508 (2014). 17. Watrous J , et al. Mass spectral molecular networking of living microbial colonies.

Proceedings of the National Academy of Sciences , E1743 (2012). 18. Wang M , et al.

Sharing and community curation of mass spectrometry data with Global Natural Products Social Molecular Networking.

Nature Biotechnology , 828 (2016). 19. Nikolskiy I, Mahieu NG, Chen Y, Jr., Tautenhahn R, Patti GJ. An Untargeted Metabolomic Workflow to Improve Structural Characterization of Metabolites. Analytical Chemistry , 7713-7719 (2013). 20. Sawada Y , et al. RIKEN tandem mass spectral database (ReSpect) for phytochemicals: A plant-specific MS/MS-based data resource and database.

Phytochemistry , 38-45 (2012). 21. Palla G, Derényi I, Farkas I, Vicsek T. Uncovering the overlapping community structure of complex networks in nature and society. Nature , 814-818 (2005). 22. Palla G, Barabási A-L, Vicsek T. Quantifying social group evolution.

Nature , 664 (2007). 23. Zhang X-C , et al.

Combination of Droplet Extraction and Pico-ESI-MS Allows the Identification of Metabolites from Single Cancer Cells.

Analytical Chemistry , 9897-9903 (2018). 24. Frank AM , et al. Clustering Millions of Tandem Mass Spectra.

Journal of Proteome Research , 113-122 (2008). 25. Fruchterman TMJ, Reingold EM. Graph drawing by force-directed placement. Software: Practice and Experience21