[PDF] Antimicrobial Peptide Prediction Using Ensemble Learning Algorithm

Abstract

Recently, Antimicrobial peptides (AMPs) have been an area of interest in the researches, as the first line of defense against the bacteria. They are raising attention as an efficient way of fighting multidrug resistance. Discovering and identification of AMPs in the wet labs are challenging, expensive, and time-consuming. Therefore, using computational methods for AMP predictions have grown attention as they are more efficient approaches. In this paper, we developed a promising ensemble learning algorithm that integrates well-known learning models to predict AMPs. First, we extracted the optimal features from the physicochemical, evolutionary, and secondary structure properties of the peptide sequences. Our ensemble algorithm then trains the data using conventional algorithms. Finally, the proposed ensemble algorithm has improved the performance of the prediction by about 10% comparing to the traditional learning algorithms

Full PDF

AAntimicrobial Peptide Prediction Using Ensemble Learning Algorithm

Neda Zarayeneh

EECS Department, WSU Pullman, WA, U.S. [email protected]

Zahra Hanifeloo

EECS Department, ZNU Strasbourg, France [email protected]

Abstract —Recently, Antimicrobial peptides (AMPs) have been area of interest in the researches, as the first line of defense against the bacteria. They are raising attention as an efficient way in fighting multi drug resistance. Discovering and identification of AMPs in the wet labs are challenging, expensive, and time consuming. Therefore, using computational methods for AMP predictions have grown attention as they are more efficient approaches. In this paper, we developed a promising ensemble learning algorithm that integrates well-known learning models to predict AMPs. First, we extracted the optimal features from the physicochemical, evolutionary and secondary structure properties of the peptide sequences. Our ensemble algorithm, then trains the data using conventional algorithms. Finally, the proposed ensemble algorithm has improved the performance of the prediction about 10% comparing to the traditional learning algorithms.

Keywords-component; Antimicrobial peptides; Ensemble learning; Feature Selection; Bacteria; Prediction I. I NTRODUCTION

Bacteria by far are the most diverse, and abundant organisms on Earth. They play an important role in human’s life and for decades they have been area of interest in researches [1-3]. Many researches have tried to understand their mechanism by clustering them, find their evolutionary history, or looking at their lateral gene transfer process [3-5]. Most of them have hoped their discoveries might facilitate the perceiving of bacterial antimicrobial-resistant which has become a real threat to global healthcare according to world health organization [6]. Attempts to fighting antimicrobial-resistant has led researchers to a key weapon provided by the nature: Antimicrobial peptides (AMPs). AMPs, which are suggested to be compelling against microorganisms such as virus, bacteria and fungi, are significant natural immune molecules that establishes a first line of host defense against microorganisms by damaging their cell membrane or their intracellular functions [7]. Developing synthetic anti-microbial drugs can take years, and then antimicrobial resistance always emerges the need for new line of drugs. Because of these obstructions, AMPs have grown attention as an alternative option for conventional approaches [7]. Discovering the AMPs in the wet-labs can be a challenge itself because it is still time consuming. Therefore, with the availability of enough data developing sequence-based computational tools have been found to be an effective way in identifying the peptides with high possibility of being a good AMP candidate [8]. Discovering these types of AMPs prior to the wet-lab experiments increases the probability of designing an AMP in a shorter time [8]. Here we discuss some of the most recent works that have applied the computational biology approaches to predict the AMPs. In [8], authors have developed a supervised learning algorithm to predict the AMPs. They first extracted physio-chemical and structure-based features, then they trained a Support Vector Machine (SVM) using the input feature. Their approach increased the accuracy comparing to previous approaches; however, we suggest that accuracy could be increased using ensemble models comparing to a solo SVM. AMEP [9] is a more recent study that has applied an ensemble learning algorithm to predict the AMPs. Initially, they generated the distribution patterns of amino acids properties as features of the peptides, subsequently they used as input in Random Forest for prediction of AMPs. Their algorithm increased the accuracy comparing to the previous model. However, the precision for their algorithm is not as convincing as the accuracy. AMAP [10] is another machine learning algorithm developed to predict the antimicrobial activity of the peptides. AMAP has applied multi-label classification to predict several types antimicrobial peptides. They have evaluated their model using cross validation and compared to the state-of-the-art methods, and the result showed improvement in performance. All the methods mentioned above along with other computational tools listed in [11], have generated useful knowledge for the prediction of AMPs. However, minimizing the number of false positives by improving the algorithm is required. In this study, we made an attempt to develop a computational approach for prediction of antibacterial higher performance. Initially, we generate the features from physiochemical, evolutionary and secondary structure properties of the peptide sequences. Next, we reduce the dimension of the features and finally use them as an input for our ensemble machine learning algorithm. Our approach found to be more accurate than existing approaches. This paper is organized as follows. In section II explain our methodology in detail including data collection, feature extraction, and learning algorithms. Then in section III, we valuate our approach. Section IV explains the conclusion of our work and future works. II. M THODOLOGY

In this section we have explained our methodology in detail. First how we collected the data, then features that we generated, and finally we have discussed the model we built . A. Data Collection

We collected positive antibacterial peptides (ABPs) from several publicly available databases. We downloaded in total 5000 positive ABPs available from Data Repository of Antimicrobial Peptides (DRAMP) [12], database Antimicrobial peptides (dbAMP) [13], and Collection of antimicrobial peptides (CAMP) [14]. For the negative dataset, we first computed the average weight of each amino acid in the positive data, and also length distribution of them. Then based on the result we generated 5000 negative peptides with the same weight and length distribution of the positive AMPs. Figure 1 plots the distribution of the length for positive and negative data, and figure 2 represents the distribution of the positive and negative data in terms of grand average of hydropathicity (gravy), and molecular weight of the sequences. The plots show how close the generated negative datasets are to the positive peptides. Using such a stringent dataset will affirm the result of model. B. Feature Extraction

We extracted different features for the peptide sequences. We searched through recent researches to find the optimal features. A number of researches [15-17] have suggested using physicochemical, evolutionary and secondary structure properties as optimal features for the peptides.

Figure 1- The distribution of the positive and negative AMPs in terms of the lengths and number in the dataset.

Figure 2- The distribution of the positive and negative AMPs in terms of sequence grand average of hydropathicity (gravy), and the molecular weight of the sequence

Table 1 lists the features that have been generated. Amino acid decomposition for each sequence is a fraction of the amino acids to the peptide length. The composition, transition, and distribution (CTD) model examines the physicochemical properties of the amino acids such as normalized van der Waals volume, hydrophobicity, polarity, polarizability, and secondary structure. There are 591 feature per sequence for these three feature sets. iFeature [17] is a python-based tool that has implemented the code for most of the protein sequence features. We used the classes developed by iFeature and also [15]. In order to mitigate the number of features, we first computed the Pearson’s correlation coefficient (1) between the features.

𝑃𝑒𝑎𝑟𝑠𝑜𝑛(𝐴, 𝐵) = 𝐸((𝐴 − 𝜇 𝐴 )(𝐵 − 𝜇 𝐵 ))𝜎 𝐴 𝜎 𝐵 (1) Where 𝐸 is the expectation, and 𝜇 A and 𝜇 B are the mean values, and 𝜎 A and 𝜎 B are the standard deviations of A and B , respectively. The result of correlation is a number between [-1, +1]. The farther from zero indicates the higher correlation between A, and B. Here, we kept the features with |correlation|<0.90. This way we reduced the number of features from 591 to 49. Table 1- Features for the peptides

Feature Dimension amino acid composition 20 composition, transition, and distribution (CTD) model 168 Predicted secondary structure 3 position-specific scoring matrix (PSSM) 400 . Learning Algorithm

We trained our model using three well-known machine learning algorithms: Support Vector Machine (SVM) [18], Random Forest (RF) [19], Gradient Boost Model (GBM) [20]. Then we developed an ensemble [15] algorithm that utilizes the learning by combining the three algorithms. SVM

Support Vector Machine (SVM) is a non-probabilistic, linear, binary classifier that can be used for both regression and classifying data by learning a hyperplane which divides the classes of the data. SVM basically learns an (n – 1)- dimension hyper plan for an n-dimensional space into two classes. SVM can be also used for classifying a non-linear dataset by projecting the dataset into a higher dimension in which it is linearly separable. It has low performances when the data is noisy. Randome Forest

Random forest [19] is a well-known ensemble algorithm that works by combining a large number of decision trees. The RF algorithm operates by voting. It simply benefits from the wisdom of the crowd. Every individual tree in the random forest predicts a class for the datapoint and the class with the highest number of votes turns into the final prediction. Training a large number of the uncorrelated decision trees is the key that RF works well. Uncorrelated trees lead to a higher accurate prediction, and also the trees protect each other from their individual errors. For building a random forest model, the features and in result the trees generated based on those features are required to have low correlation. Gradiant Boost Model

Gradient boosting is another ensemble learning algorithm that predictors are not independent, and they work sequentially. The gradient boosting algorithm (GBM) is basically a technique for both regression and classification problems. It generates a prediction model in the form of an ensemble of weak prediction models, typically decision trees. It builds the model in a stage-wise fashion like other boosting methods do, and it generalizes them by allowing optimization of an arbitrary differentiable loss function. Ensemble Method

We generated an Ensemble learning algorithm using RF, GBM and SVM. As we can see in figure 3, first the base classifiers (RF, GBM, and SVM), take the training dataset as input, then they provide a decision individually. We have mapped the categorical labels “positive” and “negative” to 1, and 0 respectively. Let’s the output of their decision be O RF , O GBM, and O

SVM , then the final decision is calculated as follows (2). 𝑓 = 𝑂 𝑅𝐹 + 𝑂 𝐺𝐵𝑀 + 𝑂

𝑆𝑉𝑀 (2) if { 𝑓 == 1 → 𝑆𝑡𝑟𝑜𝑛𝑔 𝑃𝑜𝑠𝑖𝑡𝑖𝑣𝑒𝑓 > = 0.66 → 𝑃𝑜𝑠𝑖𝑡𝑖𝑣𝑒 𝑓 < = 0.33 → 𝑁𝑒𝑔𝑎𝑡𝑖𝑣𝑒 𝑓 == 0 → 𝑆𝑡𝑟𝑜𝑛𝑔 𝑁𝑒𝑔𝑎𝑡𝑖𝑣𝑒 (3) The final decision is made based on the result of the f . We are able to make suggestion about the probability of being positive or negative based on the result of f . However, the final decision for classifying into two classes is if f>0.5 the prediction is positive otherwise it is negative. Figure 3- Ensemble method created by RF, GBM, and SVM

III. R ESULTS

For evaluating our model, we used four different evaluation metrics [21]: Accuracy (4), F1 Score (5), Recall (6), and ROC (7). First we define True Positives (TP), True Negatives (TN), False Positives (FP) and False Negatives (FN). TPs are the peptides correctly predicted as antibacterial peptides. TNs are peptides that are correctly predicted as not antibacterial peptides. FPs occur when a not antibacterial peptide is predicted as antibacterial. FNs happens when the predicted value indicates the peptide is not antibacterial, while the actual value is antibacterial peptide. The evaluation metrics are defined based on these parameters.

𝐴𝑐𝑐𝑢𝑟𝑎𝑐𝑦 = 𝑇𝑃 + 𝑇𝑁𝑇𝑃 + 𝑇𝑁 + 𝐹𝑃 + 𝐹𝑁 (4)

𝑅𝑒𝑐𝑎𝑙𝑙 = 𝑇𝑃𝑇𝑃 + 𝐹𝑁 (5)

𝐹1 𝑆𝑐𝑜𝑟𝑒 = 2𝑇𝑃2𝑇𝑃 + 𝐹𝑃 + 𝐹𝑁 (6)

𝐹𝑃𝑅 = 𝐹𝑃𝐹𝑃 + 𝑇𝑁 (7)

𝑇𝑃𝑅 = 𝑇𝑃𝐹𝑁 + 𝑇𝑃 (8) he Receive Operating Characteristic (ROC) curve is the created by plotting TPR against FPR. It shows the ability of the model to classify a binary dataset. We hold out 25 percent of the data as test, and trained the model using the 75% of the data. Table 2 compares the performance result for the ensemble method and three individual models.

Table 2- Performance Evaluation Method Accuracy F1 Score Recall

SVM 0.75 0.73 0.69 GBM 0.63 0.61 0.58 RF 0.76 0.76 0.74 Ensemble 0.87 0.86 0.86

The table shows that generally there is almost 10 percent improvement in prediction using the ensemble method. The higher F1 score means, the ensemble has been able to improve the precision of the model.

Figure 4 – The ROC curve for the proposed ensemble method and for three other individual learning algorithms

Figure 4 plots the ROC curve for the ensemble algorithm and three individual algorithms. The figure shows that all the models are better than random selecting the peptides. SVM works better than other two models. The Ensemble model benefits from combining the three models, and the higher area under curve (AUC) shows the improvement. IV. C ONCLUSION

Recently, predicting antimicrobial peptides has grown attention. In this work we developed a learning algorithm to predict the antibacterial peptides. The contribution of our work comes from combining well-know algorithms to generate a more powerful learning algorithm. We trained and tested our results using a highly stringent data, and the result shows almost 10% performance improvement. For the future work, we will design an ensemble model for predicting all types of antimicrobial peptides. Also, we will try to design a meta classifier to improve our model even more. R EFERENCES [1]

C. J. William, G. G. Geesey, and K-J. Cheng. "How bacteria stick." Scientific American 238, no. 1 (1978): 86-95 [2]

Adler, Julius. "Chemotaxis in bacteria." Science 153, no. 3737 (1966): 708-716. [3]

E. Khaledian, K. A. Brayton, and S. L. Broschat. "A Systematic Approach to Bacterial Phylogeny Using Order Level Sampling and Identification of HGT Using Network Science." Microorganisms 8, no. 2 (2020): 312 . [4] S. Frederik, E. A. Eloe-Fadrosh, R. M. Bowers, J. Jarett, T. Nielsen, N. N. Ivanova, N. C. Kyrpides, and T. Woyke. "Towards a balanced view of the bacterial tree of life." Microbiome 5, no. 1 (2017): 140. [5]

N. Shahla, D. M. Weinreich, and A. E. Vasdekis. "Cellular Noise and Response to Antibiotics." Biophysical Journal 118, no. 3 (2020): 452a. [6]

S. Saurabh R., P. S. Shrivastava, and J. Ramasamy. "Responding to the challenge of antibiotic resistance: World Health Organization." Journal of Research in Medical Sciences 23, no. 1 (2018): 21. [7]

M. Margit, J. Håkansson, L. Ringstad, and C. Björn. "Antimicrobial peptides: an emerging category of therapeutic agents." Frontiers in cellular and infection microbiology 6 (2016): 194. [8]

M. Prabina Kumar, T. Kumar Sahu, V. Saini, and A. Ramakrishna Rao. "Predicting antimicrobial peptides with improved accuracy by incorporating the compositional, physico-chemical and structural features into Chou’s general PseAAC."

Scientific reports

7, no. 1 (2017): 1-12. [9]

B. Pratiti, J. Yan, J. Li, S. Fong, and S. WI Siu. "AmPEP: Sequence-based prediction of antimicrobial peptides using distribution patterns of amino acid properties and random forest."

Scientific reports

8, no. 1 (2018): 1-10 [10]

G. Sadaf, N. Shamim, and F. Minhas. "AMAP: Hierarchical multi-label prediction of biologically active and antimicrobial peptides."

Computers in biology and medicine

107 (2019): 172-181. [11]

P. W. F., A. S. Pires, and O. L. Franco. "Computational tools for exploring sequence databases as a resource for antimicrobial peptides."

Biotechnology advances

35, no. 3 (2017): 337-349. [12]

K. Xinyue, F. Dong, C. Shi, S. Liu, J. Sun, J. Chen, H. Li, H. Xu, X. Lao, and H. Zheng. "DRAMP 2.0, an updated data repository of antimicrobial peptides."

Scientific data

6, no. 1 (2019): 1-10. [13]

J. Jhih-Hua, Y. Chi, W. Li, T. Lin, K. Huang, and T. Lee. "dbAMP: an integrated resource for exploring antimicrobial peptides with functional activities and physicochemical properties on transcriptome and proteome data."

Nucleic acids research

47, no. D1 (2019): D285-D297. [14]

W. Faiza Hanif, and S. Idicula ‐ Thomas. "Collection of antimicrobial peptides database and its derivatives: Applications and beyond."

Protein Science

29, no. 1 (2020): 36-42. [15]

C. Abu Sayed, E. Khaledian, and S. L. Broschat. "Capreomycin resistance prediction in two species of Mycobacterium using a stacked ensemble method."

Journal of applied microbiology [16]

L. Hong, J. Xu, L. Tao, X. Feng Li, S. Li, X. Zeng, S. Ying Chen et al. "SVM-Prot 2016: a web-server for machine learning prediction of protein functional families from sequence irrespective of similarity."

PloS one

11, no. 8 (2016). [17]

C. Zhen, P. Zhao, F. Li, A. Leier, T. Marquez-Lago, Y. Wang, G. I. Webb et al. "iFeature: a python package and web server for features extraction and selection from protein and peptide sequences."

Bioinformatics

34, no. 14 (2018): 2499-2502. [18]

S. Johan AK, and J. Vandewalle. "Least squares support vector machine classifiers."

Neural processing letters

9, no. 3 (1999): 293-300.

L. Andy, and M. Wiener. "Classification and regression by randomForest."

R news

2, no. 3 (2002): 18-22. [20]

N. Alexey, and A. Knoll. "Gradient boosting machines, a tutorial."

Frontiers in neurorobotics [21]

H. Mohammad, and M. N. Sulaiman. "A review on evaluation metrics for data classification evaluations."

International Journal of Data Mining & Knowledge Management Process

5, no. 2 (2015): 1.5, no. 2 (2015): 1.