Deep learning for peptide identification from metaproteomics datasets
DD EEP LEARNING FOR PEPTIDE IDENTIFICATION FROMMETAPROTEOMICS DATASETS
Xuan Guo
Department of Computer Science and EngineeringUniversity of North TexasDenton, TX
Shichao Feng
Department of Computer Science and EngineeringUniversity of North TexasDenton, TX
September 24, 2020 A BSTRACT
Metaproteomics are becoming widely used in microbiome research for gaining insights into thefunctional state of the microbial community. Current metaproteomics studies are generally based onhigh-throughput tandem mass spectrometry (MS/MS) coupled with liquid chromatography. The iden-tification of peptides and proteins from MS data involves the computational procedure of searchingMS/MS spectra against a predefined protein sequence database and assigning top-scored peptidesto spectra. Existing computational tools are still far from being able to extract all the informationout of large MS/MS datasets acquired from metaproteome samples. In this paper, we proposeda deep-learning-based algorithm, called DeepFilter, for improving the rate of confident peptideidentifications from a collection of tandem mass spectra. Compared with other post-processing tools,including Percolator, Q-ranker, PeptideProphet, and Iprophet, DeepFilter identified 20% and 10%more peptide-spectrum-matches and proteins, respectively, on marine microbial and soil microbialmetaproteome samples with false discovery rate at 1%.
Metaproteomics is the analysis of the protein samples from multi-organisms in a specific environment. Because ofthe significant role they play in nutrient cycling and the immune system, complex microbial communities studieshave gained increasing attention in recent years. Metaproteomics analysis is mostly based on the shotgun proteomicstechnique which uses high-pressure liquid chromatography-tandem mass spectrometry (LC-MS/MS).The complex microbial community data sets are worth analyzing, but some features of them impede the MS-basedmetaproteomics. Directly derived from the environment, the large amount [1] of microbial species causes complexityand heterogeneity for protein database-searching. As [2] introduces, in a typical metaproteomics environment, there aremore than 1000 unique species, and each one includes several hundred proteins, which causes a very massive amount ofpeptide sequences after digestion, so the requirement of computational effort and memory to store the digested peptideshave to be increased. Besides, since the number of microbial species is far greater than the number of species recordedin the published protein databases [3], the task of protein identification is difficult when the taxonomic composition ofprotein expression is not recorded in the existing database [4]. Also, the homologous protein sequences digest intocommon peptide candidates, which cause redundant peptide identification. That is, the common peptide candidates willbe scored with different mass spectra randomly; as the size of the protein database and the number of spectra increase,the probability of determining the selected false-positive samples with high-scoring PSM increases. In this way, theprogress of shotgun proteomics enhances the resolution of tandem mass spectrometry, the challenge of metaproteomicsitself, and the advance of technique make it thirsty to improve protein efficiency identification within MS-based data.We propose a deep learning method to post-process the PSM candidates after the database-searching engine improvesthe quality and efficiency of protein identification for metaproteomics. We hope this method can help recognize thepotential pattern of the mass spectrum to reduce false positive sample identification and increase protein identification. a r X i v : . [ q - b i o . Q M ] S e p PREPRINT - S
EPTEMBER
24, 2020In section 2, we introduce the popular database-searching engines and post-processors. In this section, we also discusssome researchers’ work showing the potential of using deep learning to improve protein identification performance. Inthe method section, we detail the whole workflow and architecture of our DeepFilter post-processing system, and statethe baseline use. In the result section, we present the performance within complex microbial communities as well assingle organisms, and compare our model with other baselines. Finally, we use class activation mappings to visualizethe features learned by the deep learning model.
Mass-spectrometry-based metaproteomics has become a typical and effective technique for protein identification. Fol-lowing the basic principles, protein is digested, and the sequences from the database for specific microbial communitiesare transferred into predicted peptides. The process we use to compare and find the matching tandem mass spectrometrydata and predict peptides is database searching. Database searching algorithms score these peptide-spectrum matches(PSMs) by comparing experimental mass spectrum data and peptides from the database using mathematical or statisticmethods. Comet [5] uses cross-correlation to calculate the score between experimental peptide spectrum and theoreticalpeptide. Some other algorithms mostly regard probability-based scoring functions to identify the quality of PSMs:Myrimatch [6] uses a Multivariate hypergeometric distribution to get random matches and stratify the peak intensity,yielding more accurate scores; Andromeda [7]uses a binomial distribution as the basic scoring function.Despite the database searching engine working well, these algorithms still have enormous room for improvement. Manyresearchers proposed the methodologies to re-score the PSM after using the database-searching process, a method calledpost-processing. Various publications focus on the post-processing of PSM scoring algorithms with two main methods.Many works use machine learning method to re-score the PSM: Percolator [8] trains semi-supervised supporting vectormachines (SVM) by the features extracted from PSMs; Q-ranker [9], based on Percolator, expands the feature set totrain the SVM; Nokoi [10] splits the target and decoy data sets and uses logistic regression to train a post-process model.The other popular approach to improve PSM identification is to re-score the PSM with statistical models: Iprophet usesthe expectation-maximization algorithm to build statistical models iteratively.Recently, as deep learning has become efficient and effective in the pattern recognition area, some researchers proposedeep learning architectures in the proteomics area. DeepNovo [11] implements a sequence-to-sequence architecture,which combines a CNN-based ion detection model with an RNN-based peptide sequence decoding model in order togenerate De nova peptide sequences from tandem mass spectrometry data. DeepMatch [12] constructs a deep neuralnetwork that uses Bi-LSTM as a mass spectrum encoder and uses CNN as a fragment ion detection model to build apost-processing model for PSM identification for a single organism.The deep learning model application is feasible to detect patterns between the mass spectra and fragment ions ofpeptides. Since the increase of the resolution for the mass spectra results in a sizable increase in the complexity ofmetaproteomics, we hope the deep learning model’s power helps detect PSM’s potential pattern.
We will discuss the details of our model in this section, which we refer to as DeepFilter. DeepFilter is consists of six majorcomponents,sub-figures A to F in the workflow 1 shows the relationship among the components mentioned following.:representative training PSM candidates selection and data set construction(A), charge detection for experimentalspectrum (B), isotopic distribution generation (C), representation of spectrum and 11 PSM extra features (D) (E), andCNN based deep learning model (F). In brief, we firstly used a database-searching engine, Comet, and post-processingtools to investigate the tool which has better PSM identification performance; then, we constructed a representativetraining data set by controlling a threshold of posterior error probability. After that, the training data set was processedby charge detection algorithm and isotope distribution generation algorithm and transferred into spectrum representationto be fed into the CNN model, which is referred as spectrum encoder. The other features we used are attributes extractedfrom the corresponding PSM candidates; these 11 extra PSM features were fed into a fully connected layer, whichreferred to the PSM feature encoder. The above two encoders are the cores of our deep learning model. We will explainthe details for each component in the following.
Traditional methods to identify PSM is to compare the similarity between the peptide sequence and experimental spectrausing mathematical and statistical methods. Moreover, they use the target-decoy search strategy [13], which involvesreverse protein sequences as decoys into the protein database, to select the confident PSMs; This strategy is to estimate2
PREPRINT - S
EPTEMBER
24, 2020Figure 1: The workflow of building DeepFilter3
PREPRINT - S
EPTEMBER
24, 2020false discovery rates (FDR) and regard the FDR as a threshold to filter the most confident PSMs in protein identification.For this process, most wide-used tools contain the function to initially digest the protein sequences and get the mostsimilar peptide sequence for each specific scan. In our experiment, we firstly used Comet to collect a set of top-scoringPSM candidates for each scan, and re-score these PSMs by several wide-used post-process algorithms, which includes:Percolator [8], Q-ranker [9] and PeptideProphet [14] and IPropeht [14]. After investigating these post-process tools,we select the algorithm that identifies most PSMs using the target-decoy strategy as the post-processor, which is thePercolator in our experiment, to generate the training data set.There are seven data sets in our experiments, three data sets are metaproteome from a marine microbial community[15], and three data sets are from a soil community [16]; we also use an E.coli proteome to test if our model also fits thesingle organism. One data set of the marine microbial community is used to train the model, and other data sets are testdata sets for the benchmark.The training data was processed as the workflow described in the first paragraph. Go through the first pass by Cometand second pass by Percolator and collected the top-5 scored PSM candidates for each scan. After removing the PSMcandidates whose posterior error probabilities are more than 0.93, we finally obtained a training data set containing926,253 PSM candidates. For the annotation, the top-1 PSM candidates, which are target PSMs, will be labeled aspositive PSMs, the top-1 PSM candidates, which are decoys, and the rest 4 PSM candidates of top-5 PSM candidateswill be labeled as negative PSMs.
The intensity distribution of MS2 files is mixed by different fragment ions of the peptide sequences. To capture patternsin different fragment ions, we firstly deconvolute the mass spectrum by applying a charge detection algorithm forexperimental mass spectra and detect the charge for fragment ions. We leverage MaxQuant to process the ms2 files firstto get the most redundant M/Z - intensity pairs; this kind of pairs are recorded in an APL format file. We reconstructthe experiment mass spectra by identifying the charge state for the most abundant peaks and representing them into acharge - m/z mapping dictionary for later process. The detail of the algorithm is described in the following.
Algorithm 1:
Charge detection algorithms
Data: each scan MS2 file and apl format file
Result:
Plain text file which contains the charge - m/z and intensity (most redundant mappings) for each scan Initialization: W h : weight of hydrogen; W n : weight of neutron for scan in MS2 file do Initialization:
Group i , i means the detected charge for the peak for m/z in scan do for charge in range (1 to 3) do detect m z = mz ∗ charge − ( charge − ∗ W h + j ∗ W n // Discharge and consider the isotope if Search ( detect m z , apl file) is True) then m/z-intensity ∈ Group charge else m/z-intensity ∈ Group
None WriteOut(scan, Groups, Plain text)
We used Sipros [17] to get the isotope distributions for the peptide sequences of the training PSM candidates. For eachpeak, we group each distribution by fragment charge and ion type (in our experiment, for the charge state, only thecharge which equals to 1, 2, and 3 is considered; As for ion type, we only consider B-ion and Y-ion), and we controlledthe cumulative isotopic abundance to be less than 98%. After that, for each PSM candidate, we combine its isotopedistribution and experimental mass spectrum after the charge detection process to get the spectrum representation forthe later training process. 4
PREPRINT - S
EPTEMBER
24, 2020Table 1: 11 extra PSM features used in Deep Filter1 Xcorr Cross correlation between calculated and observed spectra2 ∆ C n Fractional difference between current and second best XCorr3 ∆ C ln Fractional difference between current and third best XCorr4 Mass The observed mass [M+H] + DeltaM
The difference in calculated and observed mass6 abs (∆ M ) The absolute value of the difference in calculated and observed mass7 pepLen The length of the matched peptide, in residues8 enzInt The length of the matched peptide, in residues9-11 charge 1-3 Three Boolean features indicating the charge state
As the Figure 1 part B shows, a tandem mass spectrum always consisted of pairs of M/Z and intensity; for the theoreticalspectrum (Figure 1 part C, the isotopic distribution for different fragment ions is presented as the theoretical M/Z andthe abundance probability of the isotope. After the fragment charge detection for the experimental mass spectrum andthe calculation of isotopic distribution for PSM candidates, which is shown as the Figure 2, we get four groups of theexperimental spectrum (group by three charge states and the peak whose charge state is not detected) and six groups ofisotopic distribution (group by three charge states and two types of ions). In DeepFilter, to integrate the intensity ofspectrum and isotopic distribution, the isotope’s intensity and abundance probability are discretized. Then, combinedwith the identification of charge and ion type for the fragment ions, we constructed the spectrum representation matrix tofed into the CNN model. The representation of 11 extra PSM features is calculated after database-searching, and this rep-resentation is encoded with a fully connected layer. The details of these two representation is introduced in the following.
Spectrum representation
Our spectrum representation is a matrix constructed by peaks with the charge fromexperimental mass spectra and fragment ions with charge state from isotope distributions. We encoded the spectrumrepresentation by using the matrix index to indicate the M/Z, ion type, and charge state; The details are shownin figure 2. For each PSM candidate, we use 0.5 Da as a resolution parameter and regard the M/Z from 100 Dato 1900 Da and drop the rest. In this way, we initial an 8*3600 matrix with 0, then scan the sorting M/Z valuelists group by group (four groups for experimental mass spectrum and six groups for isotopic distribution) for thespecific PSM candidate. During the scan process, for each M i /Z i pair, we calculate the index by the equation index=(M i -M min )/resolution , then using the intensity to fill the matrix for the corresponding index in the first three rowswithin fragment charge equal to 1,2,3, and using the intensity of the M i /Z i pair whose fragment charge is not identifiedto fill the fourth row of the matrix; then using the abundance probability of the isotope to fill the rest rows of a matrix bythe different combination of charge state and ion type (charge=1, Y-ion; charge=1, B-ion; charge=2, Y-ion; charge=2,B-ion; charge=3, Y-ion, charge=3, B-ion). After L2 normalization, the matrix will be fed to train the classification model. Representation of 11 extra PSM features
For each PSM candidate, we also extracted 11 features based on Cometfor the later classification architecture. These 11 additional features are determined after investigating the weight ofeach feature used by Percolator and Q-ranker. The features are shown in Table 1.
We constructed a deep learning architecture that includes two encoders. The one consists of 4 convolutional layers andtwo fully connected layers; this encoder is called spectrum encoder. Spectrum encoder is used to detect the pattern ofthe top-scoring spectrum with different fragment ions and charge state. The other encoder is consists of a single fullyconnectional layer as an extra feature pattern recognizer, which is called PSM feature encoder in later part. The inputfor the spectrum encoder is the spectrum representation, and the input for the PSM feature encoder is 11 extra PSMfeatures. Then we concatenated the output from these two encoders and went through a fully connected layer to get aprobability that if the PSM candidate is a target or decoy PSM. The architecture of our model is described in Figure 2,and the detail of each encode will be introduced in the following.
Spectrum encoder
The spectrum encoder is a deep neural network model with four convolutional layers and 2two fully connected layers. To capture the high-scoring Target PSM candidate pattern, we used small kernel sizesto recognize the fragment ion. For the four convolutional layers, we use 16 kernels for each layer, and in each5
PREPRINT - S
EPTEMBER
24, 2020Figure 2: Data pre-processing and architectureconvolutional layer, we use different sizes of kernels, which are (4,7), (2,11), (2,9), (2,9) to capture the unsupervisedfeatures given by spectrum. We used max-pooling with (1,3) kernel size to capture the most weighted feature becauseof the high-dimension after each convolution operation. To speed up the process and avoid the over-fitting issue, wealso apply batch-normalization for each convolutional layer and add a dropout layer after the last convolutional layer,whose disabled probability is 0.5. In the first fully connected layer, the input dimension is 3076, and the hidden unitsare 1024. For the second fully connected layer, the input dimension is 1024, and the output vector is 512, which isactivated by ReLU function, and used as the representation of spectrum encoder for the later classification model.
PSM feature encoder
For each PSM candidate, several PSM related properties could be calculated after the processof Comet. As described above, we fed the 11 features obtained by Comet for each sample into a single fully connectedlayer, the input dimension for this layer is 11, and after activation by ReLU function, a 512-dimension vector is anoutput. This vector was used as the representation of PSM feature encoder.
Scoring model
The representations from the spectrum encoder and PSM feature encoder were then concatenatedtogether into a 1024-dimension matrix and fed into another fully connected layer with the softmax activation function.The output will be the probability from 0 to 1 to predict if a PSM candidate is a target PSM.
Loss function
As described in the section Data set construction, the data set is annotated as positive or negative,which means the scoring model is a binary classifier. However, the label is given by scoring the PSM candidate withComet, but not a real label. To enhance the model to detect more target PSMs, we apply an advanced cross-entropy lossfunction by involving the posterior error probability (PEP) parameter after filtered by Percolator. We regarded the p i asthe correct probability calculated by PEP and regarded t i as the prediction probability. Furthermore, we can call the lossfunction Equation 1 to update the weights for the model. Loss = − (cid:88) [ p i log t i + (1 − p i ) log(1 − t i )] (1)6 PREPRINT - S
EPTEMBER
24, 2020 marine 1 marine 2 marine 3 soil soil 2 soil 3 ecoli
We evaluated the performance of DeepFilter compared with the other four post-process tools, which is state-of-art indifferent Shotgun proteomics tasks. The five algorithm includes: Percolator [8], Q-ranker [9], PeptideProphet [18] andIProphet [14]. Percolator, Q-ranker, and PeptideProphet are all using Comet as the pre-process searching algorithm. ForIPropeht, this algorithm uses the PeptideProphet to processed the PSM candidates from the database-searching engine.Since the IProphet algorithm uses peptide and protein level features to improve the result of the PSM level, whichcauses the filtering of that two levels is not independent of PSM level filtering. In this way, we also experiment withdisabling 5 statistic models when applying Iprophet to evaluate the performance from different perspectives.The data sets to be evaluated are the six testing data sets from marine and soil microbial communities, as we introducedin the method section. The consistency of the data sets is shown in Table 4.1. First, We use Comet as the searchingengine to get a set of the confident peptide sequence. Then we post-processed the peptide sequences by the benchmarkpost-processors above. To measure each post-processor’s performance, we calculated the amount of PSM, peptide, andprotein. For every spectrum, only the PSM candidate with the highest score was regarded as the PSM for this spectrum;then, we set different FDRs as the threshold to compare the post-processor’s performance. The FDR calculationequation we followed is shown in Equation 2. In this equation,
Target is the amount of target PSMs and the
Decoy isthe amount of decoy PSMs.
F DR =
T arget
Decoy (2)
DeepFilter is applied to five data sets of complex microbial communities. Our method compared peptide identificationresults with Comet database-searching algorithm and the other four existing post-process algorithms - Percolator,Q-ranker, PeptideProphet, and IPropeht - at PSM, peptide, protein level within the threshold equals to 1%. The result isshown in the Table 4.2, the row name IProphet-NON represents the experiment to disable all the statistic models; thebold entry is the best result for the specific data set, and the underlined entry is the second best. We also applied themodel in the single organism - E.coli at three different FDR levels equal to 1%. The result is shown in Table 4.2.The results in Table 4.2 shows the peptide identification amounts at a different level when accepting the false discoveryrate equals to 1%. From the table, the comparison between different post-processor based on Comet could be observed.At the PSM level and peptide level, our model and IProphet always achieve the top-2 PSM detection amount: our modeloutperforms at the PSM level in marine data sets, and IProphet outperforms at the peptide level in soil1 and soil3 datasets. However, although the IPropeht shows better performance at PSM and peptide level, the PSM and peptide levelincrease does not help improve the protein identification. That may be caused because the Independence between PSMlevel and protein level means Iprophet uses some features at the protein level to improve the PSM identification. Ourmodel always outperforms at the protein level at FDR 1%; the improvement fraction is shown in Table 5. Second, thebest column shows the second-best target protein detection amount at the protein level within FDR 1%. Compared withthe second-best algorithm, our model mostly achieves more than 5% improvement except in the soil1 data set, but for thesoil1 data set, there is also a slight improvement. For the single organism E.coli, DeepFilter also ac hives comparativeprotein identification performance among the bench-marking post-processor in different peptide identification levelswithin FDR equals to 1% 7
PREPRINT - S
EPTEMBER
24, 2020
Marine 2 Marine 3 Soil 1 Soil 2 Soil 3
PSM identification at PSM level within FDR 1%Comet 31822 38490 79505 75693 72454Percolator 34741 41714 88037 84623 81331Q-ranker 33899 40832 86433 82773 79006PeptideProphet 30670 37072 73821 71281 68067IProphet 38476 44588
IProphet-NON 30846 37304 75360 73331 70121Deep filter
IProphet-NON 21696 24661 25403 22775 19922Deep filter
Table 3: Identification performance using five real-world metaproteomes at FDR 1%PSM Peptide ProteinComet 504466 30944 2147Percolator 508017 31179 2167Q-ranker 507425 30829 2165PeptideProphet 493686 30014 2060IPropeht 498810 31201 2041IPropeht-NON 495353 30037 2057Deep filter
Table 4: Identification performance using E.Coli data set at FDR 1%8
PREPRINT - S
EPTEMBER
24, 2020Second best Deep filter improvementmarine2 7715 8313 7.19%marine3 8209 8851 7.25%soil1 7756 8069 3.88%soil2 7519 8041 6.49%soil3 6387 6976 8.44%Table 5: improvement fraction for complex microbial communities at Protein level within FDR 1%
We adjust the FDR from 0.2% to 10% to test the fraction of our model compared with other database-searching engineand post-process algorithms. Figure 4.3, 4.3, 4.3 shows the experiments results, which conducted using the data setsdescribe above to identify peptide within different FDR value (from 0.2% to 10%) at PSM, peptide and protein level.For complex microbial communities, generally, IPropeht and Deep filter always stand at the top-2 position at PSM andPeptide level; IPropeht especially has a significant increment than any other data sets in the soil community. However,relying on the protein level feature to build a statistic model, Iprophet does not achieve a good performance at theprotein level. For protein level, Deep filter, Percolator, and Q-ranker stand at top-3 positions within all acceptable FDR.Deep filter always outperforms except in soil1 data set from FDR 5% to 10%, considered of the training data set, andconsistent with the marine microbial community, this situation may be caused by the lack of representative trainingsamples. In this way, some potential patterns are not learned in the training process. For the desperately set in differentcommunities, fine-tuning of our model within new data set may help to gain a post-processor with good performancejust within a short training time. For the single organism, from Figure 4.3, 4.3, 4.3, the result of the E.coli data setshows our model can achieve a significant similar result with the best post-processor in the baseline. That is, DeepFiltercan also be applicable in single organisms.
There is much work to study how to improve peptide identification after database-searching with different machinelearning methods. Percolator [8] and Q-ranker [9] using SVM algorithm; The PeptideProphet[19] apply expectation-maximization (EM) to estimate the peptide identification probability. However, most of them use semi-supervisedfashion to construct and train the data sets; They select positive PSMs by the control of FDR, and use the set of positivePSM combined with all the decoy PSMs to train the machine learning models, and applied that models to the wholedata set and re-score the PSM candidates. In this way, partial PSMs are seen data, and every time the users whopost-processes and re-score a set of PSM candidates, they must retrain a new model. Within DeepFilter, the experimentsshow that the model is considerably generalized for the unseen dataset; the model can be used in another data set,which means the unsupervised feature learning of CNN recognizes potential patterns.With the development of deep learning, more research works have been published to present how to apply a deep neuralnetwork to detect patterns from tandem MS data in the proteomics area. For example, DeepMatch [12] uses VGG-16 torecognize the patterns between tandem MS data and encoded amino acid sequence to improve the number of PSMs,and uses three well known single organisms (Yeast, Human, Mouse) to evaluate the model; Deep Novo [11] uses thesequence to sequence model constructed by CNN and LSTM to generate de novo sequences.Compared to the above-existing algorithms, we trained a CNN-based deep learning model to extract potential informa-tion from MS2 data and peptide sequences from the database-searching engine and estimate PSM’s probability in newdata sets using that model. We did not apply a semi-supervised version to involve a subset to train the model each timewe post-process the searching engine results; After a model is trained, we directly use the model without any fine-tunesettings to estimate the peptide probability for new data sets. Compared with the DeepMatch PSM identification model,our model has a simpler architecture and less neural network parameters, making the model more feasible to deploy.Besides, we evaluate our model not just at the PSM level with a specific acceptable FDR value, the peptide level, andprotein level inference is also tested. Our model can also achieve better performance in complex microbial communities;we focus more on metaproteomics while DeepMatch analyzes single organisms. Some ensemble approaches showbrilliant PSM identification results, such as MSBlender [20] uses a probabilistic approach while Sipros Ensemble9
PREPRINT - S
EPTEMBER
24, 2020 (a) marine2 (b) marine3(c) soil1 (d) soil2(e) soil3 (f) ecoli
Figure 3: PSM performance within different FDR.10
PREPRINT - S
EPTEMBER
24, 2020 (a) marine2 (b) marine3(c) soil1 (d) soil2(e) soil3 (f) ecoli
Figure 4: Peptide performance within different FDR.11
PREPRINT - S
EPTEMBER
24, 2020 (a) marine2 (b) marine3(c) soil1 (d) soil2(e) soil3 (f) ecoli
Figure 5: Protein performance within different FDR.12
PREPRINT - S
EPTEMBER
24, 2020 (a) (b)(c) (d)
Figure 6: class activation mappings of target PSMs. (a) (b)(c) (d)
Figure 7: class activation mappings of decoy PSMs.[21] uses the logistic regression algorithm to integrate peptide identification from multiple data searching engines.Nevertheless, this is another research method, which is not comparable to our approach.To mine the patterns and visualize the features learned by unsupervised feature engineering of our deep learningarchitecture, we adopted a class activation mapping (CAM) generation technique [22] to help us interpret the learningdecision of the CNN model. The CAMs show the most influential image region, which contributes to predicting aparticular category. In our experiment, we apply this algorithm in our spectrum representation to visualize the patternsto predict a target PSM.In Figure 5 and Figure 5, we present four CAMs for target and decoy PSMs separately. In the figures, white pointsindicate mass spectrum data and corresponding intensity data in this position. Furthermore, the color represents theimportance of the learning weight of CNN for this area. This region contributes more to predicting target PSM as thecolor of the background goes brighter; The rectangle with red lines is a region that shows an obvious ion matchingpattern for MS data and theoretical data.Figure 5 presents the CAM for target PSM, the sub-figures for different PSMs. We can see that the most significant part(red region) learned by CNN cover the fragment ion. Figure 5 shows the CAM for decoy PSMs; Sub-figures presentdifferent situation that the features CNN learned do not contribute to predicting a target PSM; Sub-figure (a) showsa low weighted region (blue region) to detect fragment ion matching in ions’ region, sub-figure (b) have light classmapping, but it fails to cover the ion position, sub-figure (c) has a large part red region, but there is no efficient MS data,and sub-figure (d) has a dense mass spectrum data and CAM was covering, but they are not matching.
References [1] Andreas Schlüter, Thomas Bekel, Naryttza N Diaz, Michael Dondrup, Rudolf Eichenlaub, Karl-Heinz Gartemann,Irene Krahn, Lutz Krause, Holger Krömeke, Olaf Kruse, et al. The metagenome of a biogas-producing microbialcommunity of a production-scale biogas plant fermenter analysed by the 454-pyrosequencing technology. Journalof biotechnology, 136(1-2):77–90, 2008.[2] Ngom Issa Isaac, Decloquement Philippe, Armstrong Nicholas, Didier Raoult, and Chabriere Eric. Metaproteomicsof the human gut microbiota: Challenges and contributions to other omics. Clinical Mass Spectrometry, 14:18–30,2019.[3] Kenneth J Locey and Jay T Lennon. Scaling laws predict global microbial diversity. Proceedings of the NationalAcademy of Sciences, 113(21):5970–5975, 2016.[4] Robert Heyer, Kay Schallert, Roman Zoun, Beatrice Becher, Gunter Saake, and Dirk Benndorf. Challenges andperspectives of metaproteomic data analysis. Journal of biotechnology, 261:24–36, 2017.[5] Jimmy K Eng, Tahmina A Jahan, and Michael R Hoopmann. Comet: an open-source ms/ms sequence databasesearch tool. Proteomics, 13(1):22–24, 2013.[6] David L Tabb, Christopher G Fernando, and Matthew C Chambers. Myrimatch: highly accurate tandem massspectral peptide identification by multivariate hypergeometric analysis. Journal of proteome research, 6(2):654–661, 2007. 13
PREPRINT - S