[PDF] Deep learning for peptide identification from metaproteomics datasets

Abstract

Metaproteomics are becoming widely used in microbiome research for gaining insights into the functional state of the microbial community. Current metaproteomics studies are generally based on high-throughput tandem mass spectrometry (MS/MS) coupled with liquid chromatography. The identification of peptides and proteins from MS data involves the computational procedure of searching MS/MS spectra against a predefined protein sequence database and assigning top-scored peptides to spectra. Existing computational tools are still far from being able to extract all the information out of large MS/MS datasets acquired from metaproteome samples. In this paper, we proposed a deep-learning-based algorithm, called DeepFilter, for improving the rate of confident peptide identifications from a collection of tandem mass spectra. Compared with other post-processing tools, including Percolator, Q-ranker, PeptideProphet, and Iprophet, DeepFilter identified 20% and 10% more peptide-spectrum-matches and proteins, respectively, on marine microbial and soil microbial metaproteome samples with false discovery rate at 1%.

Full PDF

DD EEP LEARNING FOR PEPTIDE IDENTIFICATION FROMMETAPROTEOMICS DATASETS

Xuan Guo

Department of Computer Science and EngineeringUniversity of North TexasDenton, TX

[email protected]

Shichao Feng

Department of Computer Science and EngineeringUniversity of North TexasDenton, TX

[email protected]

September 24, 2020 A BSTRACT

Metaproteomics are becoming widely used in microbiome research for gaining insights into thefunctional state of the microbial community. Current metaproteomics studies are generally based onhigh-throughput tandem mass spectrometry (MS/MS) coupled with liquid chromatography. The iden-tiﬁcation of peptides and proteins from MS data involves the computational procedure of searchingMS/MS spectra against a predeﬁned protein sequence database and assigning top-scored peptidesto spectra. Existing computational tools are still far from being able to extract all the informationout of large MS/MS datasets acquired from metaproteome samples. In this paper, we proposeda deep-learning-based algorithm, called DeepFilter, for improving the rate of conﬁdent peptideidentiﬁcations from a collection of tandem mass spectra. Compared with other post-processing tools,including Percolator, Q-ranker, PeptideProphet, and Iprophet, DeepFilter identiﬁed 20% and 10%more peptide-spectrum-matches and proteins, respectively, on marine microbial and soil microbialmetaproteome samples with false discovery rate at 1%.

Metaproteomics is the analysis of the protein samples from multi-organisms in a speciﬁc environment. Because ofthe signiﬁcant role they play in nutrient cycling and the immune system, complex microbial communities studieshave gained increasing attention in recent years. Metaproteomics analysis is mostly based on the shotgun proteomicstechnique which uses high-pressure liquid chromatography-tandem mass spectrometry (LC-MS/MS).The complex microbial community data sets are worth analyzing, but some features of them impede the MS-basedmetaproteomics. Directly derived from the environment, the large amount [1] of microbial species causes complexityand heterogeneity for protein database-searching. As [2] introduces, in a typical metaproteomics environment, there aremore than 1000 unique species, and each one includes several hundred proteins, which causes a very massive amount ofpeptide sequences after digestion, so the requirement of computational effort and memory to store the digested peptideshave to be increased. Besides, since the number of microbial species is far greater than the number of species recordedin the published protein databases [3], the task of protein identiﬁcation is difﬁcult when the taxonomic composition ofprotein expression is not recorded in the existing database [4]. Also, the homologous protein sequences digest intocommon peptide candidates, which cause redundant peptide identiﬁcation. That is, the common peptide candidates willbe scored with different mass spectra randomly; as the size of the protein database and the number of spectra increase,the probability of determining the selected false-positive samples with high-scoring PSM increases. In this way, theprogress of shotgun proteomics enhances the resolution of tandem mass spectrometry, the challenge of metaproteomicsitself, and the advance of technique make it thirsty to improve protein efﬁciency identiﬁcation within MS-based data.We propose a deep learning method to post-process the PSM candidates after the database-searching engine improvesthe quality and efﬁciency of protein identiﬁcation for metaproteomics. We hope this method can help recognize thepotential pattern of the mass spectrum to reduce false positive sample identiﬁcation and increase protein identiﬁcation. a r X i v : . [ q - b i o . Q M ] S e p PREPRINT - S

EPTEMBER

24, 2020In section 2, we introduce the popular database-searching engines and post-processors. In this section, we also discusssome researchers’ work showing the potential of using deep learning to improve protein identiﬁcation performance. Inthe method section, we detail the whole workﬂow and architecture of our DeepFilter post-processing system, and statethe baseline use. In the result section, we present the performance within complex microbial communities as well assingle organisms, and compare our model with other baselines. Finally, we use class activation mappings to visualizethe features learned by the deep learning model.

Mass-spectrometry-based metaproteomics has become a typical and effective technique for protein identiﬁcation. Fol-lowing the basic principles, protein is digested, and the sequences from the database for speciﬁc microbial communitiesare transferred into predicted peptides. The process we use to compare and ﬁnd the matching tandem mass spectrometrydata and predict peptides is database searching. Database searching algorithms score these peptide-spectrum matches(PSMs) by comparing experimental mass spectrum data and peptides from the database using mathematical or statisticmethods. Comet [5] uses cross-correlation to calculate the score between experimental peptide spectrum and theoreticalpeptide. Some other algorithms mostly regard probability-based scoring functions to identify the quality of PSMs:Myrimatch [6] uses a Multivariate hypergeometric distribution to get random matches and stratify the peak intensity,yielding more accurate scores; Andromeda [7]uses a binomial distribution as the basic scoring function.Despite the database searching engine working well, these algorithms still have enormous room for improvement. Manyresearchers proposed the methodologies to re-score the PSM after using the database-searching process, a method calledpost-processing. Various publications focus on the post-processing of PSM scoring algorithms with two main methods.Many works use machine learning method to re-score the PSM: Percolator [8] trains semi-supervised supporting vectormachines (SVM) by the features extracted from PSMs; Q-ranker [9], based on Percolator, expands the feature set totrain the SVM; Nokoi [10] splits the target and decoy data sets and uses logistic regression to train a post-process model.The other popular approach to improve PSM identiﬁcation is to re-score the PSM with statistical models: Iprophet usesthe expectation-maximization algorithm to build statistical models iteratively.Recently, as deep learning has become efﬁcient and effective in the pattern recognition area, some researchers proposedeep learning architectures in the proteomics area. DeepNovo [11] implements a sequence-to-sequence architecture,which combines a CNN-based ion detection model with an RNN-based peptide sequence decoding model in order togenerate De nova peptide sequences from tandem mass spectrometry data. DeepMatch [12] constructs a deep neuralnetwork that uses Bi-LSTM as a mass spectrum encoder and uses CNN as a fragment ion detection model to build apost-processing model for PSM identiﬁcation for a single organism.The deep learning model application is feasible to detect patterns between the mass spectra and fragment ions ofpeptides. Since the increase of the resolution for the mass spectra results in a sizable increase in the complexity ofmetaproteomics, we hope the deep learning model’s power helps detect PSM’s potential pattern.

We will discuss the details of our model in this section, which we refer to as DeepFilter. DeepFilter is consists of six majorcomponents,sub-ﬁgures A to F in the workﬂow 1 shows the relationship among the components mentioned following.:representative training PSM candidates selection and data set construction(A), charge detection for experimentalspectrum (B), isotopic distribution generation (C), representation of spectrum and 11 PSM extra features (D) (E), andCNN based deep learning model (F). In brief, we ﬁrstly used a database-searching engine, Comet, and post-processingtools to investigate the tool which has better PSM identiﬁcation performance; then, we constructed a representativetraining data set by controlling a threshold of posterior error probability. After that, the training data set was processedby charge detection algorithm and isotope distribution generation algorithm and transferred into spectrum representationto be fed into the CNN model, which is referred as spectrum encoder. The other features we used are attributes extractedfrom the corresponding PSM candidates; these 11 extra PSM features were fed into a fully connected layer, whichreferred to the PSM feature encoder. The above two encoders are the cores of our deep learning model. We will explainthe details for each component in the following.

Traditional methods to identify PSM is to compare the similarity between the peptide sequence and experimental spectrausing mathematical and statistical methods. Moreover, they use the target-decoy search strategy [13], which involvesreverse protein sequences as decoys into the protein database, to select the conﬁdent PSMs; This strategy is to estimate2

PREPRINT - S

EPTEMBER

24, 2020Figure 1: The workﬂow of building DeepFilter3

PREPRINT - S

EPTEMBER

24, 2020false discovery rates (FDR) and regard the FDR as a threshold to ﬁlter the most conﬁdent PSMs in protein identiﬁcation.For this process, most wide-used tools contain the function to initially digest the protein sequences and get the mostsimilar peptide sequence for each speciﬁc scan. In our experiment, we ﬁrstly used Comet to collect a set of top-scoringPSM candidates for each scan, and re-score these PSMs by several wide-used post-process algorithms, which includes:Percolator [8], Q-ranker [9] and PeptideProphet [14] and IPropeht [14]. After investigating these post-process tools,we select the algorithm that identiﬁes most PSMs using the target-decoy strategy as the post-processor, which is thePercolator in our experiment, to generate the training data set.There are seven data sets in our experiments, three data sets are metaproteome from a marine microbial community[15], and three data sets are from a soil community [16]; we also use an E.coli proteome to test if our model also ﬁts thesingle organism. One data set of the marine microbial community is used to train the model, and other data sets are testdata sets for the benchmark.The training data was processed as the workﬂow described in the ﬁrst paragraph. Go through the ﬁrst pass by Cometand second pass by Percolator and collected the top-5 scored PSM candidates for each scan. After removing the PSMcandidates whose posterior error probabilities are more than 0.93, we ﬁnally obtained a training data set containing926,253 PSM candidates. For the annotation, the top-1 PSM candidates, which are target PSMs, will be labeled aspositive PSMs, the top-1 PSM candidates, which are decoys, and the rest 4 PSM candidates of top-5 PSM candidateswill be labeled as negative PSMs.

The intensity distribution of MS2 ﬁles is mixed by different fragment ions of the peptide sequences. To capture patternsin different fragment ions, we ﬁrstly deconvolute the mass spectrum by applying a charge detection algorithm forexperimental mass spectra and detect the charge for fragment ions. We leverage MaxQuant to process the ms2 ﬁles ﬁrstto get the most redundant M/Z - intensity pairs; this kind of pairs are recorded in an APL format ﬁle. We reconstructthe experiment mass spectra by identifying the charge state for the most abundant peaks and representing them into acharge - m/z mapping dictionary for later process. The detail of the algorithm is described in the following.

Algorithm 1:

Charge detection algorithms

Data: each scan MS2 ﬁle and apl format ﬁle

Result:

Plain text ﬁle which contains the charge - m/z and intensity (most redundant mappings) for each scan Initialization: W h : weight of hydrogen; W n : weight of neutron for scan in MS2 ﬁle do Initialization:

Group i , i means the detected charge for the peak for m/z in scan do for charge in range (1 to 3) do detect m z = mz ∗ charge − ( charge − ∗ W h + j ∗ W n // Discharge and consider the isotope if Search ( detect m z , apl ﬁle) is True) then m/z-intensity ∈ Group charge else m/z-intensity ∈ Group

None WriteOut(scan, Groups, Plain text)

We used Sipros [17] to get the isotope distributions for the peptide sequences of the training PSM candidates. For eachpeak, we group each distribution by fragment charge and ion type (in our experiment, for the charge state, only thecharge which equals to 1, 2, and 3 is considered; As for ion type, we only consider B-ion and Y-ion), and we controlledthe cumulative isotopic abundance to be less than 98%. After that, for each PSM candidate, we combine its isotopedistribution and experimental mass spectrum after the charge detection process to get the spectrum representation forthe later training process. 4

PREPRINT - S

EPTEMBER

24, 2020Table 1: 11 extra PSM features used in Deep Filter1 Xcorr Cross correlation between calculated and observed spectra2 ∆ C n Fractional difference between current and second best XCorr3 ∆ C ln Fractional difference between current and third best XCorr4 Mass The observed mass [M+H] + DeltaM

The difference in calculated and observed mass6 abs (∆ M ) The absolute value of the difference in calculated and observed mass7 pepLen The length of the matched peptide, in residues8 enzInt The length of the matched peptide, in residues9-11 charge 1-3 Three Boolean features indicating the charge state

As the Figure 1 part B shows, a tandem mass spectrum always consisted of pairs of M/Z and intensity; for the theoreticalspectrum (Figure 1 part C, the isotopic distribution for different fragment ions is presented as the theoretical M/Z andthe abundance probability of the isotope. After the fragment charge detection for the experimental mass spectrum andthe calculation of isotopic distribution for PSM candidates, which is shown as the Figure 2, we get four groups of theexperimental spectrum (group by three charge states and the peak whose charge state is not detected) and six groups ofisotopic distribution (group by three charge states and two types of ions). In DeepFilter, to integrate the intensity ofspectrum and isotopic distribution, the isotope’s intensity and abundance probability are discretized. Then, combinedwith the identiﬁcation of charge and ion type for the fragment ions, we constructed the spectrum representation matrix tofed into the CNN model. The representation of 11 extra PSM features is calculated after database-searching, and this rep-resentation is encoded with a fully connected layer. The details of these two representation is introduced in the following.

Spectrum representation

Our spectrum representation is a matrix constructed by peaks with the charge fromexperimental mass spectra and fragment ions with charge state from isotope distributions. We encoded the spectrumrepresentation by using the matrix index to indicate the M/Z, ion type, and charge state; The details are shownin ﬁgure 2. For each PSM candidate, we use 0.5 Da as a resolution parameter and regard the M/Z from 100 Dato 1900 Da and drop the rest. In this way, we initial an 8*3600 matrix with 0, then scan the sorting M/Z valuelists group by group (four groups for experimental mass spectrum and six groups for isotopic distribution) for thespeciﬁc PSM candidate. During the scan process, for each M i /Z i pair, we calculate the index by the equation index=(M i -M min )/resolution , then using the intensity to ﬁll the matrix for the corresponding index in the ﬁrst three rowswithin fragment charge equal to 1,2,3, and using the intensity of the M i /Z i pair whose fragment charge is not identiﬁedto ﬁll the fourth row of the matrix; then using the abundance probability of the isotope to ﬁll the rest rows of a matrix bythe different combination of charge state and ion type (charge=1, Y-ion; charge=1, B-ion; charge=2, Y-ion; charge=2,B-ion; charge=3, Y-ion, charge=3, B-ion). After L2 normalization, the matrix will be fed to train the classiﬁcation model. Representation of 11 extra PSM features

For each PSM candidate, we also extracted 11 features based on Cometfor the later classiﬁcation architecture. These 11 additional features are determined after investigating the weight ofeach feature used by Percolator and Q-ranker. The features are shown in Table 1.

We constructed a deep learning architecture that includes two encoders. The one consists of 4 convolutional layers andtwo fully connected layers; this encoder is called spectrum encoder. Spectrum encoder is used to detect the pattern ofthe top-scoring spectrum with different fragment ions and charge state. The other encoder is consists of a single fullyconnectional layer as an extra feature pattern recognizer, which is called PSM feature encoder in later part. The inputfor the spectrum encoder is the spectrum representation, and the input for the PSM feature encoder is 11 extra PSMfeatures. Then we concatenated the output from these two encoders and went through a fully connected layer to get aprobability that if the PSM candidate is a target or decoy PSM. The architecture of our model is described in Figure 2,and the detail of each encode will be introduced in the following.

Spectrum encoder

The spectrum encoder is a deep neural network model with four convolutional layers and 2two fully connected layers. To capture the high-scoring Target PSM candidate pattern, we used small kernel sizesto recognize the fragment ion. For the four convolutional layers, we use 16 kernels for each layer, and in each5

PREPRINT - S

EPTEMBER

24, 2020Figure 2: Data pre-processing and architectureconvolutional layer, we use different sizes of kernels, which are (4,7), (2,11), (2,9), (2,9) to capture the unsupervisedfeatures given by spectrum. We used max-pooling with (1,3) kernel size to capture the most weighted feature becauseof the high-dimension after each convolution operation. To speed up the process and avoid the over-ﬁtting issue, wealso apply batch-normalization for each convolutional layer and add a dropout layer after the last convolutional layer,whose disabled probability is 0.5. In the ﬁrst fully connected layer, the input dimension is 3076, and the hidden unitsare 1024. For the second fully connected layer, the input dimension is 1024, and the output vector is 512, which isactivated by ReLU function, and used as the representation of spectrum encoder for the later classiﬁcation model.

PSM feature encoder

For each PSM candidate, several PSM related properties could be calculated after the processof Comet. As described above, we fed the 11 features obtained by Comet for each sample into a single fully connectedlayer, the input dimension for this layer is 11, and after activation by ReLU function, a 512-dimension vector is anoutput. This vector was used as the representation of PSM feature encoder.

Scoring model

The representations from the spectrum encoder and PSM feature encoder were then concatenatedtogether into a 1024-dimension matrix and fed into another fully connected layer with the softmax activation function.The output will be the probability from 0 to 1 to predict if a PSM candidate is a target PSM.

Loss function

As described in the section Data set construction, the data set is annotated as positive or negative,which means the scoring model is a binary classiﬁer. However, the label is given by scoring the PSM candidate withComet, but not a real label. To enhance the model to detect more target PSMs, we apply an advanced cross-entropy lossfunction by involving the posterior error probability (PEP) parameter after ﬁltered by Percolator. We regarded the p i asthe correct probability calculated by PEP and regarded t i as the prediction probability. Furthermore, we can call the lossfunction Equation 1 to update the weights for the model. Loss = − (cid:88) [ p i log t i + (1 − p i ) log(1 − t i )] (1)6 PREPRINT - S

EPTEMBER

24, 2020 marine 1 marine 2 marine 3 soil soil 2 soil 3 ecoli

We evaluated the performance of DeepFilter compared with the other four post-process tools, which is state-of-art indifferent Shotgun proteomics tasks. The ﬁve algorithm includes: Percolator [8], Q-ranker [9], PeptideProphet [18] andIProphet [14]. Percolator, Q-ranker, and PeptideProphet are all using Comet as the pre-process searching algorithm. ForIPropeht, this algorithm uses the PeptideProphet to processed the PSM candidates from the database-searching engine.Since the IProphet algorithm uses peptide and protein level features to improve the result of the PSM level, whichcauses the ﬁltering of that two levels is not independent of PSM level ﬁltering. In this way, we also experiment withdisabling 5 statistic models when applying Iprophet to evaluate the performance from different perspectives.The data sets to be evaluated are the six testing data sets from marine and soil microbial communities, as we introducedin the method section. The consistency of the data sets is shown in Table 4.1. First, We use Comet as the searchingengine to get a set of the conﬁdent peptide sequence. Then we post-processed the peptide sequences by the benchmarkpost-processors above. To measure each post-processor’s performance, we calculated the amount of PSM, peptide, andprotein. For every spectrum, only the PSM candidate with the highest score was regarded as the PSM for this spectrum;then, we set different FDRs as the threshold to compare the post-processor’s performance. The FDR calculationequation we followed is shown in Equation 2. In this equation,

Target is the amount of target PSMs and the

Decoy isthe amount of decoy PSMs.

F DR =

T arget

Decoy (2)

DeepFilter is applied to ﬁve data sets of complex microbial communities. Our method compared peptide identiﬁcationresults with Comet database-searching algorithm and the other four existing post-process algorithms - Percolator,Q-ranker, PeptideProphet, and IPropeht - at PSM, peptide, protein level within the threshold equals to 1%. The result isshown in the Table 4.2, the row name IProphet-NON represents the experiment to disable all the statistic models; thebold entry is the best result for the speciﬁc data set, and the underlined entry is the second best. We also applied themodel in the single organism - E.coli at three different FDR levels equal to 1%. The result is shown in Table 4.2.The results in Table 4.2 shows the peptide identiﬁcation amounts at a different level when accepting the false discoveryrate equals to 1%. From the table, the comparison between different post-processor based on Comet could be observed.At the PSM level and peptide level, our model and IProphet always achieve the top-2 PSM detection amount: our modeloutperforms at the PSM level in marine data sets, and IProphet outperforms at the peptide level in soil1 and soil3 datasets. However, although the IPropeht shows better performance at PSM and peptide level, the PSM and peptide levelincrease does not help improve the protein identiﬁcation. That may be caused because the Independence between PSMlevel and protein level means Iprophet uses some features at the protein level to improve the PSM identiﬁcation. Ourmodel always outperforms at the protein level at FDR 1%; the improvement fraction is shown in Table 5. Second, thebest column shows the second-best target protein detection amount at the protein level within FDR 1%. Compared withthe second-best algorithm, our model mostly achieves more than 5% improvement except in the soil1 data set, but for thesoil1 data set, there is also a slight improvement. For the single organism E.coli, DeepFilter also ac hives comparativeprotein identiﬁcation performance among the bench-marking post-processor in different peptide identiﬁcation levelswithin FDR equals to 1% 7

PREPRINT - S

EPTEMBER

24, 2020

Marine 2 Marine 3 Soil 1 Soil 2 Soil 3

PSM identiﬁcation at PSM level within FDR 1%Comet 31822 38490 79505 75693 72454Percolator 34741 41714 88037 84623 81331Q-ranker 33899 40832 86433 82773 79006PeptideProphet 30670 37072 73821 71281 68067IProphet 38476 44588

IProphet-NON 30846 37304 75360 73331 70121Deep ﬁlter

IProphet-NON 21696 24661 25403 22775 19922Deep ﬁlter

Table 3: Identiﬁcation performance using ﬁve real-world metaproteomes at FDR 1%PSM Peptide ProteinComet 504466 30944 2147Percolator 508017 31179 2167Q-ranker 507425 30829 2165PeptideProphet 493686 30014 2060IPropeht 498810 31201 2041IPropeht-NON 495353 30037 2057Deep ﬁlter

Table 4: Identiﬁcation performance using E.Coli data set at FDR 1%8

PREPRINT - S

EPTEMBER

24, 2020Second best Deep ﬁlter improvementmarine2 7715 8313 7.19%marine3 8209 8851 7.25%soil1 7756 8069 3.88%soil2 7519 8041 6.49%soil3 6387 6976 8.44%Table 5: improvement fraction for complex microbial communities at Protein level within FDR 1%

We adjust the FDR from 0.2% to 10% to test the fraction of our model compared with other database-searching engineand post-process algorithms. Figure 4.3, 4.3, 4.3 shows the experiments results, which conducted using the data setsdescribe above to identify peptide within different FDR value (from 0.2% to 10%) at PSM, peptide and protein level.For complex microbial communities, generally, IPropeht and Deep ﬁlter always stand at the top-2 position at PSM andPeptide level; IPropeht especially has a signiﬁcant increment than any other data sets in the soil community. However,relying on the protein level feature to build a statistic model, Iprophet does not achieve a good performance at theprotein level. For protein level, Deep ﬁlter, Percolator, and Q-ranker stand at top-3 positions within all acceptable FDR.Deep ﬁlter always outperforms except in soil1 data set from FDR 5% to 10%, considered of the training data set, andconsistent with the marine microbial community, this situation may be caused by the lack of representative trainingsamples. In this way, some potential patterns are not learned in the training process. For the desperately set in differentcommunities, ﬁne-tuning of our model within new data set may help to gain a post-processor with good performancejust within a short training time. For the single organism, from Figure 4.3, 4.3, 4.3, the result of the E.coli data setshows our model can achieve a signiﬁcant similar result with the best post-processor in the baseline. That is, DeepFiltercan also be applicable in single organisms.

There is much work to study how to improve peptide identiﬁcation after database-searching with different machinelearning methods. Percolator [8] and Q-ranker [9] using SVM algorithm; The PeptideProphet[19] apply expectation-maximization (EM) to estimate the peptide identiﬁcation probability. However, most of them use semi-supervisedfashion to construct and train the data sets; They select positive PSMs by the control of FDR, and use the set of positivePSM combined with all the decoy PSMs to train the machine learning models, and applied that models to the wholedata set and re-score the PSM candidates. In this way, partial PSMs are seen data, and every time the users whopost-processes and re-score a set of PSM candidates, they must retrain a new model. Within DeepFilter, the experimentsshow that the model is considerably generalized for the unseen dataset; the model can be used in another data set,which means the unsupervised feature learning of CNN recognizes potential patterns.With the development of deep learning, more research works have been published to present how to apply a deep neuralnetwork to detect patterns from tandem MS data in the proteomics area. For example, DeepMatch [12] uses VGG-16 torecognize the patterns between tandem MS data and encoded amino acid sequence to improve the number of PSMs,and uses three well known single organisms (Yeast, Human, Mouse) to evaluate the model; Deep Novo [11] uses thesequence to sequence model constructed by CNN and LSTM to generate de novo sequences.Compared to the above-existing algorithms, we trained a CNN-based deep learning model to extract potential informa-tion from MS2 data and peptide sequences from the database-searching engine and estimate PSM’s probability in newdata sets using that model. We did not apply a semi-supervised version to involve a subset to train the model each timewe post-process the searching engine results; After a model is trained, we directly use the model without any ﬁne-tunesettings to estimate the peptide probability for new data sets. Compared with the DeepMatch PSM identiﬁcation model,our model has a simpler architecture and less neural network parameters, making the model more feasible to deploy.Besides, we evaluate our model not just at the PSM level with a speciﬁc acceptable FDR value, the peptide level, andprotein level inference is also tested. Our model can also achieve better performance in complex microbial communities;we focus more on metaproteomics while DeepMatch analyzes single organisms. Some ensemble approaches showbrilliant PSM identiﬁcation results, such as MSBlender [20] uses a probabilistic approach while Sipros Ensemble9

PREPRINT - S

EPTEMBER

24, 2020 (a) marine2 (b) marine3(c) soil1 (d) soil2(e) soil3 (f) ecoli

Figure 3: PSM performance within different FDR.10

PREPRINT - S

EPTEMBER

24, 2020 (a) marine2 (b) marine3(c) soil1 (d) soil2(e) soil3 (f) ecoli

Figure 4: Peptide performance within different FDR.11

PREPRINT - S

EPTEMBER

24, 2020 (a) marine2 (b) marine3(c) soil1 (d) soil2(e) soil3 (f) ecoli

Figure 5: Protein performance within different FDR.12

PREPRINT - S

EPTEMBER

24, 2020 (a) (b)(c) (d)

Figure 6: class activation mappings of target PSMs. (a) (b)(c) (d)

Figure 7: class activation mappings of decoy PSMs.[21] uses the logistic regression algorithm to integrate peptide identiﬁcation from multiple data searching engines.Nevertheless, this is another research method, which is not comparable to our approach.To mine the patterns and visualize the features learned by unsupervised feature engineering of our deep learningarchitecture, we adopted a class activation mapping (CAM) generation technique [22] to help us interpret the learningdecision of the CNN model. The CAMs show the most inﬂuential image region, which contributes to predicting aparticular category. In our experiment, we apply this algorithm in our spectrum representation to visualize the patternsto predict a target PSM.In Figure 5 and Figure 5, we present four CAMs for target and decoy PSMs separately. In the ﬁgures, white pointsindicate mass spectrum data and corresponding intensity data in this position. Furthermore, the color represents theimportance of the learning weight of CNN for this area. This region contributes more to predicting target PSM as thecolor of the background goes brighter; The rectangle with red lines is a region that shows an obvious ion matchingpattern for MS data and theoretical data.Figure 5 presents the CAM for target PSM, the sub-ﬁgures for different PSMs. We can see that the most signiﬁcant part(red region) learned by CNN cover the fragment ion. Figure 5 shows the CAM for decoy PSMs; Sub-ﬁgures presentdifferent situation that the features CNN learned do not contribute to predicting a target PSM; Sub-ﬁgure (a) showsa low weighted region (blue region) to detect fragment ion matching in ions’ region, sub-ﬁgure (b) have light classmapping, but it fails to cover the ion position, sub-ﬁgure (c) has a large part red region, but there is no efﬁcient MS data,and sub-ﬁgure (d) has a dense mass spectrum data and CAM was covering, but they are not matching.

References [1] Andreas Schlüter, Thomas Bekel, Naryttza N Diaz, Michael Dondrup, Rudolf Eichenlaub, Karl-Heinz Gartemann,Irene Krahn, Lutz Krause, Holger Krömeke, Olaf Kruse, et al. The metagenome of a biogas-producing microbialcommunity of a production-scale biogas plant fermenter analysed by the 454-pyrosequencing technology. Journalof biotechnology, 136(1-2):77–90, 2008.[2] Ngom Issa Isaac, Decloquement Philippe, Armstrong Nicholas, Didier Raoult, and Chabriere Eric. Metaproteomicsof the human gut microbiota: Challenges and contributions to other omics. Clinical Mass Spectrometry, 14:18–30,2019.[3] Kenneth J Locey and Jay T Lennon. Scaling laws predict global microbial diversity. Proceedings of the NationalAcademy of Sciences, 113(21):5970–5975, 2016.[4] Robert Heyer, Kay Schallert, Roman Zoun, Beatrice Becher, Gunter Saake, and Dirk Benndorf. Challenges andperspectives of metaproteomic data analysis. Journal of biotechnology, 261:24–36, 2017.[5] Jimmy K Eng, Tahmina A Jahan, and Michael R Hoopmann. Comet: an open-source ms/ms sequence databasesearch tool. Proteomics, 13(1):22–24, 2013.[6] David L Tabb, Christopher G Fernando, and Matthew C Chambers. Myrimatch: highly accurate tandem massspectral peptide identiﬁcation by multivariate hypergeometric analysis. Journal of proteome research, 6(2):654–661, 2007. 13

PREPRINT - S