[PDF] Learning interpretable models of phenotypes from whole genome sequences with the Set Covering Machine

Abstract

The increased affordability of whole genome sequencing has motivated its use for phenotypic studies. We address the problem of learning interpretable models for discrete phenotypes from whole genomes. We propose a general approach that relies on the Set Covering Machine and a k-mer representation of the genomes. We show results for the problem of predicting the resistance of Pseudomonas Aeruginosa, an important human pathogen, against 4 antibiotics. Our results demonstrate that extremely sparse models which are biologically relevant can be learnt using this approach.

Full PDF

LLearning interpretable models of phenotypes from wholegenome sequences with the Set Covering Machine

Alexandre Drouin , ∗ , S´ebastien Gigu`ere , Vladana Sagatovich , Maxime D´eraspe ,Franc¸ois Laviolette , Mario Marchand , Jacques Corbeil Universit´e Laval

Qu´ebec, Canada D´epartement d’informatique et de g´enie logiciel D´epartement de m´edecine mol´eculaire

Abstract

The increased affordability of whole genome sequencing has motivated its use for phenotypic stud-ies. We address the problem of learning interpretable models for discrete phenotypes from wholegenomes. We propose a general approach that relies on the Set Covering Machine and a k -mer rep-resentation of the genomes. We show results for the problem of predicting the resistance of Pseu-domonas aeruginosa , an important human pathogen, against 4 antibiotics. Our results demonstratethat extremely sparse models which are biologically relevant can be learnt using this approach.

Recent advances in next-generation sequencing (NGS) have led to a tremendous increase in the affordability of wholegenome sequencing (WGS) [5]. The reduced cost and increased throughput of NGS have motivated the use of WGSfor phenotypic study [5, 6, 8, 15]. Particularly, it is now possible to use whole genome sequences, instead of DNAmicroarrays, which require prior knowledge of the genomic regions of interest. Learning models which can accuratelypredict a discrete phenotype from a genome has direct applications in the clinical setting and can lead to a betterunderstanding of biological processes. This is especially true if the learnt model is simple and easy to interpret, which,unfortunately, is not the case of most learning algorithms. Moreover, the large size and increased availability of WGSdata give rise to new computational challenges which must be addressed.Machine learning algorithms require that the genomes be represented by a set of features. One common representationconsists in a set of single nucleotide polymorphisms (SNP) [3]. Obtaining the SNPs for multiple genomes requiresmultiple sequence alignment, which is computationally expensive and affected by genome rearrangements [9]. More-over, potentially important information about the genome can be lost in this process. In this paper we will favor analternative approach, inspired from the “bag-of-words” model that is heavily used in the domain of text classiﬁcation.It consists in representing each genome by the presence or absence of k -mers, which are words of k nucleotides thatare possibly its subsequences [6]. In addition, this approach does not require any sequence alignement [9].As previously stated, another key requirement when learning to predict a phenotype is that the model must be inter-pretable by domain experts. Interpretable models provide biological insight on the decision function and can leadto the discovery of novel biological processes. Models must thus be sparse and composed of elements from whichsufﬁcient biological knowledge can be extracted. Sparsity of the model is desirable since it contributes to reducing thecost of validation and promotes its usage in a clinical setting. Many state-of-the-art learning algorithms do not provideinterpretable models. This is the case of the Support Vector Machine (SVM) [4], which yields dense models. TheLasso [14] yields sparse models compared to the SVM. Nevertheless, they often contain a large number of non-nullcoefﬁcients, rendering their biological understanding difﬁcult. ∗ Corresponding author: [email protected] and accepted for presentation at Machine Learning in Computational Biology 2014, Montr´eal, Qu´ebec, Canada. a r X i v : . [ q - b i o . GN ] D ec n this paper, we propose a novel approach for learning sparse and interpretable models from whole genomes forpredicting discrete phenotypes. Our approach relies on the Set Covering Machine [11], a learning algorithm thatproduces highly sparse models that achieved state-of-the-art accuracy for many learning task, such as learning fromDNA microarray data [12]. The models obtained are short conjunctions or disjunctions of boolean valued rules whichcan explicitly highlight the importance of speciﬁc DNA sequences.We ﬁrst present the Set Covering Machine learning algorithm. Subsequently, we demonstrate how it can be applied topredict phenotypes based on whole genomes. We then present an example application to the problem of predicting theantimicrobial resistance of Pseudomonas aeruginosa (PA), an opportunistic, nosocomial pathogen of immunocom-promised individuals [7]. PA typically infects the pulmonary tract, urinary tract, burns, wounds, and also causes otherblood infections. It is an important human pathogen. Finally, we discuss the models found for the resistance of PAagainst four antibiotics, the limitations of the approach, and propose paths to extend our work.

In the supervised machine learning setting, we assume that data are available as a set S = { ( xxx i , y i ) } mi =1 ∼ D m , where xxx i ∈ X is a training example, y i ∈ Y its associated label and D is a data generating distribution. We consider binaryclassiﬁcation problems where Y = { , } . The goal of a learning algorithm is to obtain a predictor h : X → Y , suchthat h ( xxx ) = y for most ( xxx, y ) ∼ D . The Set Covering Machine (SCM) [11] is a learning algorithm that producespredictors that are conjunctions or disjunctions of boolean valued rules r i : X → { , } . Given a set of rules, theSCM attempts to ﬁnd the smallest subset of rules that correctly classiﬁes the data. As mentioned in [11], this problemcan be reduced to the set cover problem which is known to be NP-hard. However, the SCM uses a greedy algorithmto obtain an approximate solution with a worst case guarantee. Algorithm 1 presents the SCM algorithm in the casewhere the returned predictor is a conjunction of boolean valued rules. The disjunction case can be obtained from theprevious one by using S (cid:48) = { ( xxx i , ¬ y i ) : ( xxx i , y i ) ∈ S} as the set of training examples for Algorithm 1 and taking thecomplement of the returned predictor h . This follows from the De Morgan law: ¬ (cid:86) r (cid:63)i ∈R (cid:63) r (cid:63)i ( xxx ) = (cid:87) r (cid:63)i ∈R (cid:63) ¬ r (cid:63)i ( xxx ) .In Algorithm 1, the parameter s acts as a regularizer by limiting the complexity of h . The parameter p adds robustnessto noise and to class imbalance, by controlling a trade-off between minimizing the error on each class. If more thanone rule have a maximal utility value U i , we select the one which most reduces the empirical risk and thus maximises |A i | − |B i | . If there still exists a tie, we select either of the rules. Notice that, at each iteration, only the examplesfor which the outcome of h is not already determined are considered. This prevents the selection of correlated rules,which would unnecessarily increase the complexity of the predictor. This is crucial in a clinical setting, since increasedpredictor complexity can lead to dispensable costs.The running time complexity of Algorithm 1 is O ( |R| · |S| · s ) in the worst case. It thus scales linearly in the number ofrules and the number of training examples. In addition, note that at each iteration, the computation can be parallelizedby distributing the computation of the |A i | and |B i | on multiple cores. In order to apply the SCM to whole genomes, we use a k -mer representation. First, given a dataset S = { ( xxx i , y i ) } mi =1 ,where xxx i is a genome and y i ∈ { , } is the phenotype, we deﬁne K as the set of all unique k -mers that are at leastin one genome. All overlapping k -mers are considered while constructing this set. Note that k -mers that are in allgenomes can be discarded, since they cannot be selected by Algorithm 1. Then, for each ( xxx, y ) ∈ S , we representa genome xxx by φφφ ( xxx ) , a boolean vector of |K| dimensions where component φ i ( xxx ) = 1 if k i ∈ K is in genome xxx and otherwise. Then, for each k -mer k i ∈ K , we deﬁne a presence rule p k i ( φφφ ( xxx )) def = I ( φ i ( xxx ) = 1) and an absence rule a k i ( φφφ ( xxx )) def = I ( φ i ( xxx ) = 0) , where I is the indicator function. We can then apply Algorithm 1 by taking S (cid:48) = { ( φφφ ( xxx i ) , y i ) : ( xxx i , y i ) ∈ S} as the training set of examples and by using the set of · |K| presence and absencerules for R . By doing so, we obtain a predictor which explicitly highlights the importance of a small set of k -mers forpredicting a phenotype. This predictor has a form which is simple to interpret, since its predictions are the outcome ofa simple logical operation. The running time complexity of using Algorithm 1 in this setting is O ( |K| · |S| · s ) . Pseudomonas aeruginosa

We validated our approach by addressing the problem of antibiotic resistance. There is a pressing need for rapid clin-ical diagnostic tests, that can accurately predict the resistance of a bacterial strain to a wide range of antibiotics [16].2 lgorithm 1:

Train SCM( S , R , p , s ) input : S : A set of training examples, R : A set of boolean valued rules, p : The class tradeoff parameter, s : Themaximum number of rules in h . R (cid:63) ← ∅P ← the set of examples in S with label N ← the set of examples in S with label stop ← F alse while

N (cid:54) = ∅ and |R (cid:63) | < s and ¬ stop do ∀ r i ∈ R , A i ← the subset of N correctly classiﬁed by r i ∀ r i ∈ R , B i ← the subset of P misclassiﬁed by r i ∀ r i ∈ R , U i ← |A i | − p · |B i | if |A i | ≥ |B i | and −∞ otherwise i (cid:63) ← argmax i U i if U i (cid:63) (cid:54) = −∞ then R (cid:63) ← R (cid:63) ∪ { r i (cid:63) }N ← N − A i (cid:63) P ← P − B i (cid:63) else stop = T rue endendreturn h , where h ( xxx ) = (cid:86) r (cid:63)i ∈R (cid:63) r (cid:63)i ( xxx ) Antibiotic Resistant Sensitive Total · |K| Amikacin 84 281 365 119 023 612Doripenem 137 226 363 118 580 512Levoﬂoxacin 169 189 358 118 335 666Meropenem 153 215 368 118 931 911Table 1: Number of resistant/sensitive strains and boolean valued rules ( · |K| ) for each antibioticOur approach could be used to obtain interpretable predictors of the resistant phenotype for multiple antibiotics. Inaddition to assist in a clinical setting, these may help to uncover new resistance mechanisms and ultimately identifynew drug targets. In order to learn such models, we used a dataset containing the genomes of 390 Pseudomonasaeruginosa strains and their measured resistance phenotype (resistant, intermediate or sensitive) for 4 antibiotics [7].We addressed each antibiotic as a distinct classiﬁcation problem. We binarized each problem by discarding inter-mediate strains and assigning the label to resistant strains and the label to sensitive strains. For constructing the k -mer set K , we used k = 31 , because it is a standard parameter used in de novo genome assembly algorithms [2]; if k = 31 provides enough information to allow good genome assembly, it should be suitable for our learning task. Anoverview of the resulting dataset is shown in Table 1.For each antibiotic, we obtained sets K which contained millions of k -mers. Therefore, the dimensionality of φφφ ( xxx ) was extremely high, leading to the problem of storing the examples in memory. In order to overcome this problem, westored the φφφ ( xxx ) vectors on disk in an HDF5 dataset [13], which supports built-in compression and array-like access tothe data. At each iteration of Algorithm 1, we accessed the data by blocks of a ﬁxed number of rows and columns. Thisled to an implementation for which the memory usage was fully controllable. For each antibiotic, the ﬁrst iteration ofAlgorithm 1 took less than 7 minutes, while using a single CPU and 3 GB of RAM on a system equipped with a 2.8GHz Intel Core i7 and a 5200 RPM hard drive. All subsequent iterations took less time. For each antibiotic, we conducted nested 5-fold cross-validation (CV). We ﬁrst split the dataset into 5 parts, called the outer-folds . One outer-fold was left out as a validation set to compute the risk of the algorithm, while the remainingouter-folds formed a training dataset used to train the algorithm. On this inner dataset , we performed standard 5-foldCV to select the algorithm’s hyperparameters. After choosing those minimizing the inner CV risk, we retrained thealgorithm using the whole inner dataset, and computed predictions for the examples of the left out outer-fold. We3ntibiotic SCM SVM MajorityRisk |R (cid:63) | Risk RiskAmikacin ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± |R (cid:63) | used. The best risk values are in bold.repeated this procedure 5 times (once per outer fold) and reported the mean left out outer-fold risk, which correspondsto an estimate of the risk incurred on novel data.We conducted this experiment for our approach and selected hyperparameters p and s of Algorithm 1 from ranges [10 − , ] and { , ..., } respectively. For comparison, we repeated the experiment using a SVM with a linearkernel and selected the values of hyperparameter C in range [10 − , ] . † In addition, in order to validate that ourapproach actually learns from the WGS data, we compared to a predictor that predicts the majority class for eachantibiotic. The results are summarized in Table 2. On all but one antibiotic, our approach outperforms the SVM. Thisis an interesting result, since our predictors are composed of very few rules, whereas the SVM decision function isfully dense and thus attributes non-null weights to millions of rules. Moreover, these results conﬁrm that our approachindeed learns from the WGS data, since it outperforms the majority class predictor.

To evaluate the ability of our approach to learn biologically relevant models, we relearned an SCM on the entire datasetof each antibiotic. The optimal parameters for each dataset were selected by 5-fold CV based on the same ranges asin the previous experiment. For amikacin and meropenem, the obtained predictors are conjunctions of 5 and 2 rulesrespectively. For doripenem and levoﬂoxacin, the predictors are disjunctions of 2 rules. Hence, in each case, theobtained predictor is based on a very small number of rules.The predictor for levoﬂoxacin is a disjunction of 2 absence rules for k -mers located in the wildtype DNA gyrase,which is the target of levoﬂoxacin. Interestingly, the ﬁrst k -mer is located in the quinolone-resistance-determiningregion of subunit A and covers two amino acids, Thr-83 and Asp-87, known to confer resistance to levoﬂoxacin whenmutated [1]. Similarly, the second k -mer is located in subunit B and covers amino acids Ser-468 and Glu-470, whichalso confer resistance when mutated [1]. Therefore, in this case, the predictor has recovered known biological facts.For all other antibiotics, the model contains a rule relative to the absence of a k -mer in the DNA gyrase. However,none of these antibiotics actually target the DNA gyrase.In Figure 1, we see that most strains are resistant to multiple antibiotics and that many are resistant to all of them(central set). Following this observation, we used each predictor to predict the labels of the training examples. Weanalysed the distribution of false negatives among the sets of Figure 1 and found that only a slight fraction fall inthe central set. This suggests that, since the SCM can only learn one conjunction or disjunction, it learns the onewhich best classiﬁes the largest set of the training examples. The presence of the DNA gyrase in all the predictors canbe explained by a shared resistance with levoﬂoxacin for many examples. In addition, this observation supports thedifference in the accuracies observed for levoﬂoxacin and all other antibiotics in Table 2. We have addressed the problem of learning interpretable models of phenotypes from whole genome sequences. Wehave demonstrated that the Set Covering Machine can be used to achieve this goal. Our results for the problemof predicting the antibiotic resistance of

Pseudomonas aeruginosa suggest that our approach indeed yields sparseand interpretable models. For one antibiotic, we have recovered the target gene and for the others, the models wereinterpretable enough to gain insight on a limitation of our approach. Future works will therefore address the problem oflearning more than one conjunction/disjunction with the SCM. A disjunction of conjunctions can still be very sparse,and may allow to model richer biological pathways than a single conjunction or a single disjunction. † Note that learning with the SVM from the φφφ ( x ) vectors is equivalent to using the spectrum kernel [10]. Figure 1: Distribution of the resistant strains among the antibiotics in the dataset.

Acknowledgments

The authors would like to thank Veronica Kos, Humphrey Gardner and their colleagues from AstraZeneca for provid-ing the

Pseudomonas aeruginosa dataset. Computations were performed on the Colosse supercomputer at Universit´eLaval (resource allocation project: nne-790-ad), under the auspices of Calcul Qu´ebec and Compute Canada. AD isrecipient of an Alexander Graham Bell Canada Graduate Scholarship Doctoral Award from the National Sciences andEngineering Research Council of Canada (NSERC). This work was supported in part by the Fonds de recherche duQu´ebec - Nature et technologies (FL, MM & JC; 2013-PR-166708) and the NSERC Discovery Grants (FL; 262067,MM; 122405). JC acknowledges the Canada Research Chair in Medical Genomics.

References [1] Takaaki Akasaka et al. “Type II topoisomerase mutations in ﬂuoroquinolone-resistant clinical strains of Pseu-domonas aeruginosa isolated in 1998 and 1999: role of target enzyme in mechanism of ﬂuoroquinolone resis-tance”. In:

Antimicrobial agents and chemotherapy

Journal of Computational Biology

Gene

Machine learning

Trends in Genetics

PloS one

Antimicrobial Agents and Chemotherapy

Under Review (2014).[8] Claudio U K¨oser et al. “Routine use of microbial whole genome sequencing in diagnostic and public healthmicrobiology”. In:

PLoS pathogens

Bioinformatics (2014), btu177.[10] Christina S Leslie, Eleazar Eskin, and William Stafford Noble. “The spectrum kernel: A string kernel for SVMprotein classiﬁcation.” In:

Paciﬁc symposium on biocomputing . Vol. 7. 2002, pp. 566–575.[11] Mario Marchand and John Shawe Taylor. “The set covering machine”. In:

The Journal of Machine LearningResearch

Pattern Analysis and Machine Intelligence, IEEE Transactions on

Hierarchical data format version 5 . 2000-2010.

URL : .[14] Robert Tibshirani. “Regression shrinkage and selection via the lasso”. In: Journal of the Royal Statistical Soci-ety. Series B (Methodological) (1996), pp. 267–288.515] ME T¨or¨ok and SJ Peacock. “Rapid whole-genome sequencing of bacterial pathogens in the clinical microbiol-ogy laboratorypipe dream or reality?” In:

Journal of antimicrobial chemotherapy (2012), dks247.[16] Nina Tuite et al. “Rapid nucleic acid diagnostics for the detection of antimicrobial resistance in Gram-negativebacteria: is it time for a paradigm shift?” In: