[PDF] AMP0: Species-Specific Prediction of Anti-microbial Peptides using Zero and Few Shot Learning

Abstract

The evolution of drug-resistant microbial species is one of the major challenges to global health. The development of new antimicrobial treatments such as antimicrobial peptides needs to be accelerated to combat this threat. However, the discovery of novel antimicrobial peptides is hampered by low-throughput biochemical assays. Computational techniques can be used for rapid screening of promising antimicrobial peptide candidates prior to testing in the wet lab. The vast majority of existing antimicrobial peptide predictors are non-targeted in nature, i.e., they can predict whether a given peptide sequence is antimicrobial, but they are unable to predict whether the sequence can target a particular microbial species. In this work, we have developed a targeted antimicrobial peptide activity predictor that can predict whether a peptide is effective against a given microbial species or not. This has been made possible through zero-shot and few-shot machine learning. The proposed predictor called AMP0 takes in the peptide amino acid sequence and any N/C-termini modifications together with the genomic sequence of a target microbial species to generate targeted predictions. It is important to note that the proposed method can generate predictions for species that are not part of its training set. The accuracy of predictions for novel test species can be further improved by providing a few example peptides for that species. Our computational cross-validation results show that the pro-posed scheme is particularly effective for targeted antimicrobial prediction in comparison to existing approaches and can be used for screening potential antimicrobial peptides in a targeted manner especially for cases in which the number of training examples is small. The webserver of the method is available at this http URL.

Full PDF

XX , YYYY, 0 – Machine learning

AMP : Species-Specific Prediction of Anti-microbial Peptides using Zero and Few Shot Learning Sadaf Gull and Fayyaz Minhas PIEAS Biomedical Informatics Lab, Pakistan Institute of Engineering and Applied Sciences, PO Nilore, Islamabad, Pakistan. Department of Computer Science, University of Warwick, Coventry, UK [email protected], [email protected] *To whom correspondence should be addressed. Associate Editor: XXXXXXX Received on XXXXX; revised on XXXXX; accepted on XXXXX

Abstract

Motivation:

Results:

In this work, we have developed a targeted antimicrobial peptide activity predictor that can predict whether a peptide is effective against a given microbial species or not. This has been made possible through zero-shot and few-shot machine learning. The proposed predictor called AMP takes in the peptide amino acid sequence and any N/C-termini modifications together with the ge-nomic sequence of a target microbial species to generate targeted predictions. It is important to note that the proposed method can generate predictions for species that are not part of its training set. The accuracy of predictions for novel test species can be further improved by providing a few example peptides for that species. Our computational cross-validation results show that the proposed scheme is particularly effective for targeted antimicrobial prediction in comparison to existing approaches and can be used for screening potential antimicrobial peptides in a targeted manner especially for cases in which the number of training examples is small. Availability:

We have also developed a webserver of the proposed methodology available at http://ampzero.pythonanywhere.com. The data used for training and testing is also available for downloading, given in supplementary material.

Contact: [email protected]

Supplementary information:

Supplementary data are available. Introduction

Antibiotics play a significant role in protecting humans from microbial infections. The discovery and use of antibiotics since the 1930s has helped in treating serious infections and saved many lives (Aslam et al. , 2018). (Blair, 2018; Ventola, 2015). Resistance against antibiotics in microbes was detected in the 1960s and it prompted an evolutionary arms race between microbes and antibiotics (Ventola, 2015). Antimicro-bial resistance is currently a major global health crisis. The number of deaths due to infections caused by antibiotic resistance annually is in-creasing and is estimated to reach up to 10 million by 2050 (Blair, 2018). ull and Minhas

Fig. 1.

A general framework of machine learning predictors for (a) non targeted and (b) targeted predictions

World Health Organization (WHO) has generated a list of antibiotic resistant bacterial species that are a major threat to global health and require urgent development of novel therapeutics against them:

Entero-coccus faecium , Staphylococcus aureus , Klebsiella pneumoniae , Aci-netobacter baumannii , Pseudomonas aeruginosa , and

Enterobacter (Lakemeyer et al. , 2018). To handle the issue of antibiotic resistance, the development of novel antibiotics is necessary (Aslam et al. , 2018; Blair, 2018; Ventola, 2015; Lakemeyer et al. , 2018; Spaulding et al. , 2018). In comparison to the rate of development of antimicrobial resistance, the pace of discovery or development of new antibiotics is very slow: in the last 2 decades only two new classes of antibiotics were introduced for clinical use (Lake-meyer et al. , 2018). Consequently, the use of vaccines, lysins, antibodies, probiotics, bacteriophages and antimicrobial peptides (AMPs) is becom-ing popular in therapeutics as alternatives to antibiotics (Aslam et al. , 2018). For designing new drugs, the use of AMPs is rapidly gaining attention (Aslam et al. , 2018; Kampshoff et al. , 2019; Costa et al. , 2019; Yu et al. , 2018). AMPs exhibit different biological activities against microbes, e.g., bacteria, viruses, fungi, etc. (Aslam et al. , 2018), have higher inhibition rates than antibiotics, and can potentially slow down the evolution of antibiotic resistance as well (Yu et al. , 2018). Potential AMP candidates need to be tested and evaluated experimental-ly before entering clinical trials. The prediction of AMPs using machine learning techniques reduces the cost of identifying the effectiveness of a peptide sequence against microbial species in the wet lab by pre-screening potential antimicrobial peptides. A number of machine learn-ing based AMP predictors are available in the literature (Gull et al. , 2019; Bhadra et al. , 2018; Torrent et al. , 2009; Waghu et al. , 2015; Lin and Xu, 2016; Agrawal and Raghava, 2018). The primary issue with these un-targeted predictors is that they are unable to predict whether a given peptide sequence will be effective against a given target microbial species or not (see Fig 1). Only a small number of targeted predictors exist in the literature but they are not able to generate predictions for novel microbial species (Kleandrova et al. , 2016; Vishnepolsky et al. , 2018; Speck-Planche et al. , 2016). Vishnepolsky et al. developed a pre-dictor for 6 different gram-negative bacterial strains (Vishnepolsky et al. , 2018). The AMP predictor by Kleandrova et al. used 70 different gram-negative strains of bacteria in training to predict antimicrobial and cyto-toxic activity of individual amino acids in a peptide sequence for differ-ent strains (Kleandrova et al. , 2016). Although they covered a large set of bacterial species, their method can generate predictions for only spe-cific strains of gram-negative bacterial strains. Unavailability of their predictor for public use is also a limitation (Kleandrova et al. , 2016). The major drawback in targeted predictors is their inability of predicting a Table 1.

Filtering criteria applied to DBAASP database to obtain required dataset Filtering criteria Number of peptides

DBAASP monomer peptides 12,984 Sequences with length >5 12,517 Sequences with microbial targets (excluding cancers) 9,890 Sequences with MIC in ( 𝜇𝑀 ) or ( 𝜇𝑔/𝑚𝐿 ) 8,045 Sequences with target species genomes available in NCBI (Coordinators, 2016) 8,025 Sequences with at least one target species with MIC ≤ 25 𝜇𝑔/𝑚𝐿 Methods

Data collection and preprocessing

For constructing the dataset used for training and evaluation of our ma-chine learning models, we have used DBAASP version 2 (Pirtskhalava et al. , 2015). DBAASP has been widely used in recent studies in this field (Kleandrova et al. , 2016; Youmans et al. , 2017; Vishnepolsky et al. , 2018; Speck-Planche et al. , 2016; Win et al. , 2017). It contains a total of 12, 984 peptide sequences and their experimentally verified minimum inhibitory concentrations (MICs) against various target microbial spe-cies. In order to construct our dataset from DBAASP, we have used peptides with length greater than 5 amino acids whose experimentally validated MICs are available in micro molar ( 𝜇𝑀 ) or microgram per milliliter ( 𝜇𝑔/𝑚𝐿 ). We also ensured that the genomes of the target spe-cies are available in NCBI (Coordinators, 2016) and that each peptide in our dataset has at least one target species for which its MIC was ≤ 25 𝜇𝑔/𝑚𝐿 (Vishnepolsky et al. , 2018). The details of different filtration stages to extract the dataset of our interest are given in Table-1. DBAASP reports the effectiveness of a peptide sequence against multi-ple strains of a microbial species. We have taken the minimum MIC of a peptide across different strains of a species as its MIC against that spe-cies. All MIC values have been converted to 𝜇𝑔/𝑚𝐿 (Kleandrova et al. , 2016). Our final dataset comprises of 5,710 peptides that are effective against a total of 336 different microbial species. The details of individu- (a) (b) Fig 2.

MICs converted to continuous labels between -1 to +1 using bipolar sigmoid function pecies specific targeting of AMPs al peptides and their MICs against their target species is given in sup-plementary material. As an additional preprocessing step, we have scaled the MIC scores using a sigmoidal curve such that MIC scores ≤ 25 𝜇𝑔/𝑚𝐿 are mapped onto +1 and those ≥ 100 𝜇𝑔/𝑚𝐿 are mapped to -1 (see Fig. 2). For this purpose, we have utilized a sigmoid rescaling function which maps raw MIC scores 𝑦 as follows: 𝑦 ′ = 𝑠 (− 𝑦−5510 ) with 𝑠(𝑧) = 2 ( 𝑒 𝑧 𝑧 ) − 1 . This rescaling ensures that subsequent processing and machine learning models are not affected by large variations in MICs across different target species and peptides which can vary from a few 𝜇𝑔/𝑚𝐿 to more than 2000 𝜇𝑔/𝑚𝐿 . If the MIC of a peptide is not known for a species, its rescaled score is set at 0.0. Feature extraction

To predict antimicrobial activity of a peptide against given species through machine learning, we need features of peptide and genomic sequence of target microbial species as discussed below (see Fig. 3).

Amino Acid Sequence features

In order to obtain peptide-level features, we have used one-hot encod-ing of the peptide sequence that results in a 40-dimensional feature vec-tor (frequency count of 20 L-amino acids and 20 D-amino acids). The feature representation models the type of amino acid (L and D) in the peptide sequence separately as peptide bioactivity is dependent upon the type of amino acids (Cava et al. , 2011; Mangoni et al. , 2006; Baltz, 2009; Kawai et al. , 2004). The resulting feature vectors for a given pep-tide is normalized to unit norm. We have also analyzed 2-mer composi-tion which results in a = 1600 -dimensional feature vector (Leslie et al. , 2001). DBAASP (Pirtskhalava et al. , 2015) also provides information about N-terminus and C-terminus modifications of peptides which can play a significant role in their antimicrobial activity. Modification at N-terminus and C-terminus of peptides can change their biological activity (Crusca Jr et al. , 2011). We have used one-hot encoding to capture in-formation about C- and N-terminus modifications in our feature repre-sentation. The sequence features are concatenated with C and N termini features. Details about the different types of C and N termini modifica-tions are given in supplementary information. Genomic features

In order to perform targeted prediction of antimicrobial activity of a peptide sequence against a particular species through machine learning, we need to extract species-level features as well. The literature reports the use of mono, di, tri and tetra-nucleotide composition of genomic sequences for comparison or clustering of genomes (Karlin and Ladunga, 1994; Karlin et al. , 1998; Kariin and Burge, 1995; Karlin, 1998; Nakashima et al. , 1997, 1998; Pride et al. , 2003; Takahashi et al. , 2009). As a consequence, we have extracted features from complete genomes of species downloaded from NCBI (Coordinators, 2016). For feature ex-traction the counts of 1-mer, 2-mer, 3-mer and 4-mer are calculated from a given genome sequence and normalized to unit norm resulting in a 340-dimensional feature representation.

Prediction Models

To predict whether a given peptide sequence will be effective against a target microbial species or not, we have proposed a zero-shot machine learning model. We compare the proposed model to a conventional ma chine learning model as a baseline as discussed below. In order to aid the reader in understanding our modeling approach for baseline and zero-shot predictors, we denote a peptide sequence by its d -dimensional fea-ture vector 𝒙 𝒊 , 𝑖 = 1, … , 5710 whereas a particular microbial species is represented by an a -dimensional attribute vector 𝒔 𝒋 , for 𝑗 = 1, … , 336 based on its genomic sequence. We denote the rescaled MIC of a peptide 𝒙 𝒊 against species 𝒔 𝒋 by the target variable 𝑦 𝑖𝑗 . The prediction problem can then be expressed as finding a mathematical function 𝑓(𝒙 𝒊 , 𝒔 𝒋 ; 𝚯) parameterized by learnable parameters 𝚯 that can predict the effective-ness of a sequence 𝒙 𝒊 for microbial species 𝒔 𝒋 . Baseline models

We have chosen Radial Basis Function SVM (Cortes and Vapnik, 1995) and XGBoost (Chen and Guestrin, 2016) as baseline models due to their widespread use and ease of modeling. For this purpose, in order to pre-dict the effectiveness of a given peptide sequence against a microbial species, we construct a joint feature representation 𝝓 𝒊𝒋 = [𝒙 𝒊 𝒔 𝒋 ] by concat-enating peptide and species level features with the associated training label 𝑦 𝑖𝑗 set to +1 (antimicrobial) if the MIC of peptide 𝒙 𝒊 for species 𝒔 𝒋 is ≤ 25 𝜇𝑔/𝑚𝐿 and -1 (non-antimicrobial) if the MIC is ≥ 100 𝜇𝑔/𝑚𝐿 . A conventional SVM or XGBoost model can then be trained over such a data set. Zero and Few shot learning

In this work, we propose to model the problem of targeted antimicro-bial activity prediction through zero shot learning (ZSL) (Romera-Paredes and Torr, 2015). Widely used in object classification and com-puter vision, ZSL allows a classification model to generate predictions for novel classes which were not available at training time (Socher et al. , 2013; Norouzi et al. , 2013; Fu et al. , 2015). This is achieved by learning the definition of a class through an attribute vector representation instead of predicting class labels directly as in conventional classification. Many variants of ZSL have been proposed in the literature (Palatucci et al. , 2009; Zhang and Saligrama, 2015; Socher et al. , 2013; Norouzi et al. , 2013; Fu et al. , 2015; Kodirov et al. , 2017; Romera-Paredes and Torr, 2015). While ZSL assumes that no examples of a novel class presented during testing are available for training, the related case of few-shot learning aims at building a machine learning model such that only a few training examples are available for the target class (Snell et al. , 2017; Sung et al. , 2018; Gidaris and Komodakis, 2018; Garcia and Bruna, 2017; Ravi and Larochelle, 2016). Few Shot Learning (FSL) techniques perform significantly better than conventional classification methods when the number of training examples is very small (Snell et al. , 2017; Sung et al. , 2018; Gidaris and Komodakis, 2018). The problem of targeted antimicrobial activity prediction is ideally suited to zero and few shot learning: in typical machine learning guided design of wet lab experiments for screening potential peptides that are effective against a target microbial species, no or very few peptides with known labels are available for training. Furthermore, in order to predict how effective a peptide is against a novel microbial species for which no or very few training examples are available, we can model the target

Fig 3.

Proposed model framework using features of peptide and genomic sequences ull and Minhas microbial species as a class represented by an attribute vector based on its genomic sequence. In this work, we have used the ZSL scheme given by Romera-Paredes and Torr (Romera-Paredes and Torr, 2015). For predicting the MIC of a peptide sequence for a target species, the dis-criminant function used by the ZSL model of Romera-Paredes and Torr (Romera-Paredes and Torr, 2015) can be written as 𝑓(𝒙 𝒊 , 𝒔 𝒋 ; 𝚯) =𝒙 𝑖𝑇 𝚯𝒔 𝑗 with the learnable weight matrix 𝚯 ∈ ℝ 𝑑×𝑎 . If the number of peptides and species (classes) available during training are 𝑚 and 𝑧 , respectively and the rescaled MIC scores for each of the peptide against each microbe is represented by the 𝑚 × 𝑧 matrix 𝒀 ∈ [−1,1] 𝑚×𝑧 , the learning problem for ZSL can be formulated as the following optimiza-tion problem: 𝚯 ∗ = ‖𝑿 𝑻 𝚯𝑺 − 𝒀‖

𝑭𝒓𝒐𝟐 + (

𝚯∈ℝ 𝒅×𝒂 𝐚𝐫𝐠𝐦𝐢𝐧 𝜸‖𝚯𝑺‖ 𝑭𝟐 + 𝝀‖𝑿 𝑻 𝚯‖ 𝑭𝟐 + 𝜸𝝀‖𝚯‖ 𝑭𝟐 ) Here,

𝑿 ∈ ℝ 𝑑×𝑚 and

𝑺 ∈ ℝ 𝑎×𝑧 represent matrices of all peptide fea-tures ( 𝑚 examples each with a 𝑑 -dimensional feature vector) and attrib-utes of microbial species ( 𝑧 classes each with 𝑎 attributes), respectively. The first term represents the loss function with the aim of minimizing the error between predicted and target MICs. The second term ( 𝛾‖𝚯𝑺‖ 𝐹2 + 𝜆‖𝑿 𝑻 𝚯‖ 𝐹2 + 𝛾𝜆‖𝚯‖ 𝐹2 ) is the regularization factor that en-sures smoothness of the prediction function 𝑓(𝒙, 𝒔; 𝚯) and sparsity of the weight matrix 𝚯 through penalization of the Frobenius norm ‖∙‖ 𝐹2 of respective matrices. 𝛾 and 𝜆 are regularization hyper-parameters. In addition to better performance over benchmark datasets, another reason for choosing this ZSL implementation is the existence of a computation-ally efficient closed-form solution of its underlying optimization problem which can be written as follows: 𝚯 ∗ = (𝑿𝑿 𝑻 + 𝛾𝑰) −𝟏 𝑿𝒀𝑺 𝑻 (𝑺𝑺 𝑻 + 𝜆𝑰) −𝟏 Once the optimal weight matrix 𝚯 ∗ has been obtained, the predictions for a peptide (represented by the feature vector 𝒙 ) for species (represent-ed by the attribute vector 𝒔 ) can be generated by the decision function 𝑓(𝒙, 𝒔; 𝚯 ∗ ) = 𝒙 𝑻 𝚯 ∗ 𝒔 . Note that this decision function can be used for generating predictions both for novel peptides and novel species provid-ed their attribute representation 𝒔 is available. The most likely target species for a given peptide can be identified by simply ranking the result-ing decision function scores across a given list of potential target species. This formulation can be kernelized for non-linear kernels as well by applying the Representer theorem to the underlying optimization prob-lem (Romera-Paredes and Torr, 2015). For this purpose, an 𝑚 × 𝑚 sized kernel matrix 𝑲 with 𝐾 𝑖𝑗 = 𝑘(𝒙 𝒊 , 𝒙 𝒋 ) is computed over the training data using a kernel function such as the radial basis function (RBF) 𝑘(𝒂, 𝒃) =𝑒𝑥𝑝(−𝜅‖𝒂 − 𝒃‖ ) with the hyperparameter 𝜅 > 0 . The closed form solution of the kernelized ZSL optimization problem requires calculation of an 𝑚 × 𝑎 sized instance-attribute association matrix 𝚨 from training data as follows (see (Romera-Paredes and Torr, 2015) for details): 𝚨 = (𝑲 𝑻 𝑲 + 𝜸𝑰) −𝟏 𝑲𝒀𝑺(𝑺 𝑻 𝑺 + 𝝀𝑰) −𝟏 For inference or prediction of effectiveness of a peptide represented by a feature vector 𝒙 against a microbial species represented by its at-tribute vector 𝒔 , an 𝑚 -dimensional vector of kernel scores 𝒌(𝒙) =[𝑘(𝒙, 𝒙 ) 𝑘(𝒙, 𝒙 ) ⋯ 𝑘(𝒙, 𝒙 𝒎 )] 𝑇 of the test example with each training example is computed and used in the kernelized prediction func-tion 𝑓(𝒙, 𝒔; 𝑨) = 𝒌(𝒙) 𝑻 𝑨𝒔 . It is important to note that this framework extends seamlessly to FSL by simply adding further training instances for a target class. The hy-perparameters of the model (𝛾, 𝜆, 𝜅) are tuned through cross-validation. The best performance of the model was found using 𝛾 = 2.0 , 𝜆 =0.0001 , and the hyperparameter 𝜅 of RBF kernel is set to 2.0. Performance evaluation

We consider two practical use-cases of our system: 1) Target Species Ranking (TSR): given a set of microbial species for which labeled pep-tide sequences are available for training, predict the microbe that is most-likely to be targeted by a novel peptide sequence and, 2) Peptide Activity Prediction for Novel Species (PAP): predict whether a peptide is effec-tive against a given species or not such that no or very few peptide ex-amples for that species are available during training (i.e., Zero Shot or Few Shot Learning) (see Fig. 4). It is important to note that both these scenarios reflect practical use cases for biologists who are interested in machine-learning guided discovery for targeted antimicrobial peptides. In order to evaluate the performance of baseline and proposed machine learning models for TSR, we have used 5-fold cross validation (John Lu, 2010). The dataset of 5,710 peptides is divided into 5 non-overlapping folds. A given model is trained on labeled examples of all peptides in 4 folds and tested on the remaining peptides. This process is repeated 5 times, once for each fold. For each test peptide in a fold, model scores for all 336 species are sorted in descending order. The rank of the highest scoring microbe that is a known target of the given test peptide (positive example) is used as a peptide-specific performance metric. This simple biologist-centric performance metric called Rank of First Positive Pre-diction (RFPP) is based on the premise that an ideal machine learning model should assign high score to a known target species of a given peptide sequence and, consequently, rank target species at lower ranks in the sorted list in comparison to non-target species (Minhas et al. , 2014). As a result, for an ideal machine learning model, the RFPP for all test peptides should be 1.0. As discussed in the results section, we report the percentile-wise RFPP scores for all test peptides for different machine learning models together with a random predictor as experimental con-trol. The RFPP score at a certain percentile 𝑝 , henceforth denoted by 𝑅𝐹𝑃𝑃(𝑝) is defined as follows:

𝑅𝐹𝑃𝑃(𝑝) = 𝑞 , if 𝑝% test peptides have at least one known target microbial species among their top 𝑞 predictions (out of 336). Thus, for an ideal classifier 𝑅𝐹𝑃𝑃(100) = 1 , i.e., for every (a) (b)

Fig 4. (a) PAP takes inputs of a peptide sequence and a novel species genome to predict whether a peptide is effective against a given species or not; (b) TSR requires a novel peptide sequence and predicts the microbe that is most likely to be targeted by that pep-tide (out of 336 given species) pecies specific targeting of AMPs peptide, the top scoring species is a real target species of the given test peptide. RFPP is a biologist-centric metric as it tells us directly how often top-ranking predictions of a peptide can be expected to correspond to true target species and it can be directly used in experiment design. For PAP, i.e., predicting a peptide’s effectiveness for a novel species, our proposed modeling approach takes peptide and genomic sequences as input and the score generated by the decision function of a machine learning model is used for classification of peptide sequences for indi-vidual species. In order to quantify predictive accuracy, a selected set of 17 test species from DBAASP with a small but sufficient number (75-180) of known positive and negative peptide examples is used (details given in Table-2). For ZSL, the model is trained on all examples from other species and its predictive performance is evaluated for individual species in Table 2 using area under the receiver operating characteristic curve (AUC-ROC) as a performance metric (Davis and Goadrich, 2006). For few shots learning (FSL), a few positive and negative examples of a test species (1, 2, 4, 8 and half of all available examples for that species) are randomly sampled for training together with all examples from all other species and the model is evaluated on the remaining examples of the test species. This process is repeated 20 times with different species-level training and test examples to get average AUC-ROC scores and their standard deviation. Results

In this section, we discuss the results for the two learning tasks below.

Target Species Ranking (TSR)

Fig 5 shows the percentile-wise RFPP scores for all classifiers. As dis-cussed in section 2.4, the ideal RFPP score for all peptides is 1.0. For the random classifier that generates a random score for a given example, the median RFPP is 75, i.e., for 50% test peptides in cross-validation, a true target species is within the top 75 (out of 336) predictions. In contrast, for XGBoost and SVM baseline models, the median RFPPs are 50 and 10, respectively. However, the proposed model performs much better than these baseline models: the RFPP for the proposed model at the 75 th percentile is 1.0, i.e., the for up to 75% peptides, the top prediction by the model is correct. This clearly shows the effectiveness of the proposed prediction scheme for identifying the correct target species of a peptide. For TSR use case optimal results of SVM and XGBoost were obtained with 2-mer composition features of peptides. Peptide Activity Prediction for Novel Species

Table 2 shows the results of various machine learning models for the Peptide Activity Prediction (PAP) task. In this task the objective is to evaluate whether a given machine learning model can correctly predict peptides that target a novel species for which none or very few training examples are available. For this purpose, we compare the performance of conventional machine learning models (SVM, XGBoost), the proposed Zero Shot Learning (ZSL) and Few Shot Learning (FSL) models in addition to existing state of the art non-targeted antimicrobial activity predictors (CAMP (Waghu et al., 2015) (Gabere and Noble, 2017) and AMAP (Gull et al., 2019)). For this use case, XGBoost with amino acid composition features performed significantly better than SVM (results not shown for brevity). However, the prediction performance of XGBoost was typically no better than a random classifier especially when the number of training examples from a given test species was very small (see Supplementary Information for complete results). Similarly, existing state of the art methods such as CAMP (Waghu et al., 2015) and AMAP (Gull et al., 2019) do not give satisfactory predictive perfor-mance for the chosen species. In contrast, the proposed few shot learning model performs significantly better than other methods with an expected increase in prediction accuracy when the number of training examples of a species is increased.

Webserver and Code

The webserver developed for proposed model together with the code is available at the URL:http://ampzero.pythonanywhere.com. The webserv-er takes a peptide sequences in FASTA format along with any C-terminus and N-terminus modifications as input together with the ge-nome of a species in order to predict the degree of effectiveness of the peptide against the given species. Additionally, the user can upload a list of known positive and negative example peptide sequences for the given species for generating few shot learning based predictions. Conclusions

We have developed a targeted antimicrobial activity predictor called AMPZero can predict the effectiveness of a given peptide sequence against a given target species. The use of zero and few shot learning in the proposed model helps in overcoming the shortcomings of conven-tional machine learning techniques for this purpose. Our cross-validation analysis shows that the proposed model can perform better than existing approaches and it can be easily integrated in experimental discovery of antimicrobial peptide sequences for novel species.

Acknowledgements

Sadaf Gull is supported by a grant under indigenous 5000 Ph.D. fellowship scheme by the Higher Education Commission (HEC) of Pakistan.

Conflict of Interest: none declared.

Fig 5.

Percentile-wise RFPP for target species ranking for various machine learning models able 2

Results for Peptide Activity Prediction for Novel Species. The first column indicates the type of the different test species used in this analysis. The species name together with the total number of positive (P) and negative (N) examples available for that species are given in the second column. Results for zero shot learning (ZSL) in which no examples of the given test species are included in training are shown for the proposed ZSL model. For few shot learning results for different number of training examples (1, 2, 4, 8 and Half of all available examples) of the target species are shown. In the interest of relevance and brevity results for XGBoost are shown only when half of the available examples are used for training. CAMP and AMAP are existing state of the art predictors for antimicrobial activity and the prediction results were obtained using their respective webservers. Values in bold indicate the highest prediction performance. Note that the average AUC-ROC across multiple runs is reported together with the standard deviation (in parenthesis).

Species Type Species Name ↓ No. of Tr. Examples → Machine Learning Models ZSL FSL XGBoost CAMP AMAP 0 1 2 4 8 Half Half

Fungus Aspergillus fumigatus (P: 44, N: 33) 0.746 (0.056) 0.807 (0.059) 0.806 (0.041) 0.820 (0.046) 0.835 (0.043) (0.043) 0.614 (0.073) 0.798 (0.051) 0.545 (0.055)

Candida glabrata (P35: , N:47 ) (0.052) 0.677 (0.087) 0.350 (0.047) 0.489 (0.081)

Candida parapsilosis (P:51 , N:33 ) (0.055) 0.639 (0.120) 0.660 (0.069) 0.662 (0.075)

Candida tropicalis (P:88 , N:16 ) (0.042) 0.669 (0.066) 0.703 (0.076) 0.561 (0.058)

Cryptococcus neoformans (P:167 , N:14 ) (0.068) 0.541 (0.078) 0.576 (0.089) 0.581 (0.084)

Saccharomyces cerevisiae (P:132 , N:36 ) (0.052) 0.604 (0.046) 0.388 (0.053) 0.448 (0.043)

Fusarium oxysporum (P:125 , N:18 ) (0.021) 0.696 (0.094) 0.418 (0.049) 0.396 (0.033)

Gram Negative Bacteria Enterobacter aerogenes (P:36 , N:49 ) (0.051) 0.773 (0.069) 0.550 (0.067) 0.443 (0.069)

Erwinia amylovora (P112: , N:35 ) (0.047) (0.032) 0.877 (0.021) 0.385 (0.046)

Pasteurella multocida (P:37 , N:53 ) (0.019) 0.914 (0.046) 0.528 (0.067) 0.295 (0.039)

Proteus mirabilis (P:27 , N:105 ) (0.048) 0.731 (0.079) 0.377 (0.046) 0.269 (0.065)

Proteus vulgaris (P:84 , N:34 ) (0.030) 0.667 (0.071) 0.465 (0.048) 0.567 (0.048)

Serratia marcescens (P:48 , N:62 ) (0.021) 0.571 (0.068) 0.397 (0.084) 0.418 (0.046)

Gram Positive Bacteria Listeria innocua (P:64 , N:36 ) 0.686 (0.038) 0.688 (0.067) 0.710 (0.064) 0.738 (0.056) 0.763 (0.062) (0.057)

Streptococcus mutans (P:129 , N:11 ) (0.045) 0.767 (0.108) 0.472 (0.099) 0.706 (0.083)

Streptococcus pneumoniae (P:86 , N:17 ) (0.066) 0.507 (0.071) 0.3081 (0.057) 0.384 (0.077)

Streptococcus pyogenes (P:161 , N:09 ) (0.037) 0.669 (0.069) 0.569 (0.156) 0.737 (0.062)

References

Afsar Minhas,F. ul A. et al. (2014) PAIRpred: Partner-specific prediction of inter-acting residues from sequence and structure.

Proteins: Structure, Function, and Bioinformatics , , 1142–1155. Agrawal,P. and Raghava,G.P. (2018) Prediction of Antimicrobial Potential of a Chemically Modified Peptide From Its Tertiary Structure. Frontiers in Microbi-ology , , 2551. Aslam,B. et al. (2018) Antibiotic resistance: a rundown of a global crisis. Infection and drug resistance , , 1645. Baltz,R.H. (2009) Daptomycin: mechanisms of action and resistance, and biosyn-thetic engineering. Current opinion in chemical biology , , 144–151. Bhadra,P. et al. (2018) AmPEP: Sequence-based prediction of antimicrobial pep-tides using distribution patterns of amino acid properties and random forest. Sci-entific reports , , 1697. Blair,J.M. (2018) A climate for antibiotic resistance. Nature Climate Change , , 460. Cava,F. et al. (2011) Emerging knowledge of regulatory roles of D-amino acids in bacteria. Cellular and Molecular Life Sciences , , 817–831. Chen,T. and Guestrin,C. (2016) Xgboost: A scalable tree boosting system. In, Proceedings of the 22nd acm sigkdd international conference on knowledge dis-covery and data mining . ACM, pp. 785–794. pecies specific targeting of AMPs

Coordinators,N.R. (2016) Database resources of the national center for biotechnol-ogy information.

Nucleic acids research , , D7. Cortes,C. and Vapnik,V. (1995) Support-vector networks. Machine learning , , 273–297. Costa,F. et al. (2019) Clinical Application of AMPs. In, Antimicrobial Peptides . Springer, pp. 281–298. Crusca Jr,E. et al. (2011) Influence of N-terminus modifications on the biological activity, membrane interaction, and secondary structure of the antimicrobial peptide hylin-a1.

Peptide Science , , 41–48. Davis,J. and Goadrich,M. (2006) The relationship between Precision-Recall and ROC curves. In, Proceedings of the 23rd international conference on Machine learning . ACM, pp. 233–240. Fu,Y. et al. (2015) Transductive multi-view zero-shot learning.

IEEE transactions on pattern analysis and machine intelligence , , 2332–2345. Gabere,M.N. and Noble,W.S. (2017) Empirical comparison of web-based antimi-crobial peptide prediction tools. Bioinformatics , , 1921–1929. Garcia,V. and Bruna,J. (2017) Few-shot learning with graph neural networks. arXiv preprint arXiv:1711.04043 . Gidaris,S. and Komodakis,N. (2018) Dynamic few-shot visual learning without forgetting. In, Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition ., pp. 4367–4375. Gull,S. et al. (2019) AMAP: Hierarchical multi-label prediction of biologically active and antimicrobial peptides.

Computers in biology and medicine , , 172–181. John Lu,Z. (2010) The elements of statistical learning: data mining, inference, and prediction. Journal of the Royal Statistical Society: Series A (Statistics in Socie-ty) , , 693–694. Kampshoff,F. et al. (2019) A Pilot Study of the Synergy between Two Antimicro-bial Peptides and Two Common Antibiotics. Antibiotics , , 60. Kariin,S. and Burge,C. (1995) Dinucleotide relative abundance extremes: a ge-nomic signature. Trends in genetics , , 283–290. Karlin,S. et al. (1998) Comparative DNA analysis across diverse genomes. Annual review of genetics , , 185–225. Karlin,S. (1998) Global dinucleotide signatures and analysis of genomic heteroge-neity. Current opinion in microbiology , , 598–610. Karlin,S. and Ladunga,I. (1994) Comparisons of eukaryotic genomic sequences. Proceedings of the National Academy of Sciences , , 12832–12836. Kawai,Y. et al. (2004) Structural and functional differences in two cyclic bacterioc-ins with the same sequences produced by lactobacilli. Appl. Environ. Microbiol. , , 2906–2911. Kleandrova,V.V. et al. (2016) Enabling the discovery and virtual screening of potent and safe antimicrobial peptides. simultaneous prediction of antibacterial activity and cytotoxicity. ACS combinatorial science , , 490–498. Kodirov,E. et al. (2017) Semantic autoencoder for zero-shot learning. In, Proceed-ings of the IEEE Conference on Computer Vision and Pattern Recognition ., pp. 3174–3183. Lakemeyer,M. et al. (2018) Thinking Outside the Box—Novel Antibacterials To Tackle the Resistance Crisis.

Angewandte Chemie International Edition , , 14440–14475. Leslie,C. et al. (2001) The spectrum kernel: A string kernel for SVM protein classi-fication. In, Biocomputing 2002 . World Scientific, pp. 564–575. Lin,W. and Xu,D. (2016) Imbalanced multi-label learning for identifying antimi-crobial peptides and their functional types.

Bioinformatics , , 3745–3752. Mangoni,M.L. et al. (2006) Effect of natural L-to D-amino acid conversion on the organization, membrane binding, and biological function of the antimicrobial peptides bombinins H. Biochemistry , , 4266–4276. Nakashima,H. et al. (1997) Di. erences in Dinucleotide Frequencies of Human, Yeast, and Escherichia coli Genes. DNA Research , , 185–192. Nakashima,H. et al. (1998) Genes from nine genomes are separated into their organisms in the dinucleotide composition space. DNA Research , , 251–259. Norouzi,M. et al. (2013) Zero-shot learning by convex combination of semantic embeddings. arXiv preprint arXiv:1312.5650 . Palatucci,M. et al. (2009) Zero-shot learning with semantic output codes. In, Ad-vances in neural information processing systems ., pp. 1410–1418. Pirtskhalava,M. et al. (2015) DBAASP v. 2: an enhanced database of structure and antimicrobial/cytotoxic activity of natural and synthetic peptides.

Nucleic acids research , , D1104–D1112. Pride,D.T. et al. (2003) Evolutionary implications of microbial genome tetranu-cleotide frequency biases. Genome research , , 145–158. Ravi,S. and Larochelle,H. (2016) Optimization as a model for few-shot learning. Romera-Paredes,B. and Torr,P. (2015) An embarrassingly simple approach to zero-shot learning. In, International Conference on Machine Learning ., pp. 2152–2161. Snell,J. et al. (2017) Prototypical networks for few-shot learning. In,

Advances in Neural Information Processing Systems ., pp. 4077–4087. Socher,R. et al. (2013) Zero-shot learning through cross-modal transfer. In,

Ad-vances in neural information processing systems ., pp. 935–943. Spaulding,C.N. et al. (2018) Precision antimicrobial therapeutics: the path of least resistance?

NPJ biofilms and microbiomes , , 4. Speck-Planche,A. et al. (2016) First multitarget chemo-Bioinformatic model to enable the discovery of antibacterial peptides against multiple gram-positive pathogens. Journal of chemical information and modeling , , 588–598. Sung,F. et al. (2018) Learning to compare: Relation network for few-shot learning. In, Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition ., pp. 1199–1208. Takahashi,M. et al. (2009) Estimation of bacterial species phylogeny through oligonucleotide frequency distances.