Large scale modeling of antimicrobial resistance with interpretable classifiers
Alexandre Drouin, Frédéric Raymond, Gaël Letarte St-Pierre, Mario Marchand, Jacques Corbeil, François Laviolette
LLarge scale modeling of antimicrobial resistancewith interpretable classifiers
Alexandre Drouin , , † , Fr´ed´eric Raymond , , Ga¨el Letarte St-Pierre , , Mario Marchand , ,Jacques Corbeil , , Franc¸ois Laviolette , Department of Computer Science and Software Engineering, Infectious Disease Research Center, Big Data Research CenterUniversit´e Laval, Qu´ebec, Canada
Abstract
Antimicrobial resistance is an important public health concern that has implications in the practice ofmedicine worldwide. Accurately predicting resistance phenotypes from genome sequences showsgreat promise in promoting better use of antimicrobial agents, by determining which antibioticsare likely to be effective in specific clinical cases. In healthcare, this would allow for the designof treatment plans tailored for specific individuals, likely resulting in better clinical outcomes forpatients with bacterial infections. In this work, we present the recent work of Drouin et al. (2016)on using Set Covering Machines to learn highly interpretable models of antibiotic resistance andcomplement it by providing a large scale application of their method to the entire PATRIC database.We report prediction results for new datasets and present the Kover AMR platform, a new web-based tool allowing the visualization and interpretation of the generated models. Modern medicine relies on antimicrobial drugs to treat infections. However, bacteria have evolved mechanisms toprotect themselves from antibiotics [26, 19]. Thus, the use and abuse of antibiotics has lead to the selection and tothe spread of antibiotic-resistant pathogens. In order to define the right treatment and to reduce clinical failures, theprediction of antimicrobial resistance (AMR) is essential in choosing the right drugs to treat a specific patient [12].In clinical laboratories, AMR is measured using antibiograms. This method determines the minimal inhibitory con-centration (MIC) of an antibiotic by measuring the growth of the infecting microorganism in presence of differentconcentrations of the drug. The overall objective of this method is to determine if the pathogen will respond to treat-ment. Clinicians do so by comparing the measured MIC to the guidelines from CLSI or EUCAST, which are constantlybeing reevaluated by international committees [14]. However, while susceptibility to antibiotics can be predicted us-ing MIC, it does not always hold true, as susceptible isolates can sometimes become phenotypically resistant giventhe proper conditions [17]. In this case, genomic determinants of resistance can be effective in predicting clinicallyrelevant antimicrobial resistance [18]. In addition, determination of the MIC requires the growth of microorganisms,which generally necessitates one to two days of in vitro culture, and even more in the case of slow-growing organisms,such as
Mycobacterium tuberculosis [28]. Genomic methods, such as polymerase chain reaction or whole genomesequencing, can now be used to predict the resistance phenotypes of pathogens in a more rapid manner [20].Reanalysis of publicly available genome databases for which AMR phenotypes are available is a useful starting pointto improve our understanding of the relationship between genotype and phenotype. Indeed, several groups have usedmachine learning and statistics to understand and predict antimicrobial resistance ([2, 11, 22, 10, 9]). However, modelsthat predict AMR should not be static and should improve as new genomes are added to databases. For instance,the Pathosystems Resource Integration Center (PATRIC) database is a large-scale aggregation platform for bacterialgenomes and their associated metadata [27, 22]. The number of genomes in the PATRIC database nearly doubledbetween 2014 and 2015, with now more than 52 thousand microbial genome sequences. Recently, Drouin et al.(2016) proposed to use the Set Covering Machine [16] to learn extremely sparse models of antimicrobial resistance thatare intelligible for domain experts [10]. They compared their models to more complex predictors, such as linear andkernel-based Support Vector Machines [7, 23, 24], as well as decision trees [3], and showed that Set Covering Machinesachieved comparable, often superior, generalization performance, while being significantly sparser. Moreover, theydemonstrated that highly accurate models of AMR could be obtained, despite features spaces tens of thousands oftimes larger than the number of learning examples. † Corresponding author: [email protected] and accepted for presentation at the Machine Learning for Health Workshop, NIPS 2016, Barcelona, Spain. a r X i v : . [ q - b i o . GN ] D ec he present work summarizes the work of Drouin et al. (2016) and presents extensive new results. Specifically,we present a large scale application of their method to prospectively generate predictive models of AMR fromthe ever-growing collection of genomes in the PATRIC database. Moreover, we present the Kover AMR Platform(https://aldro61.github.io/kover-amr-platform/), a web-based tool that catalogs AMR prediction results for a wide va-riety of species and antibiotics, providing detailed metrics and allowing the visualization of the generated models. Thisinitiative will allow the interpretation of our results by healthcare researchers, generating new research and treatmentopportunities. We address the problem of predicting antimicrobial resistance as a supervised learning problem. The goal is to learn amodel that accurately discriminates genomes that are resistant or susceptible to an antibiotic based on genomic char-acteristics. Formally, we assume that we are given a dataset S def = { ( x i , y i ) } mi =1 ∼ D m , where x i ∈ { A, C, G, T } ∗ is a bacterial genome, y i ∈ { , } is its associated phenotype ( for susceptible and for resistant) and D is anunknown data generating distribution from which the dataset is sampled. We start by defining an alternative represen-tation for the genomic sequences, where each genome is characterized by the presence or absence of every possible k -mer, i.e. every possible sequence of k DNA nucleotides. This representation is obtained through a mapping function φφφ : { A, C, G, T } ∗ → { , } k , such that φ j ( x ) def = 1 if the k -mer k j is in the genome x and otherwise. This yield thetransformed dataset S (cid:48) def = { ( φφφ ( x i ) , y i ) } mi =1 , which is then used to train the learning algorithm.The goal is then to find a model h that has a good generalization performance, i.e. that minimizes the probability R ( h ) of making a prediction error for any example drawn according to distribution D , i.e., R ( h ) def = P r ( x,y ) ∼ D [ h ( φφφ ( x )) (cid:54) = y ] . (1)Furthermore, we seek highly interpretable models from which biologically relevant knowledge can be extracted. Such interpretable models are obtained through the Set Covering Machine algorithm (SCM) [16, 10], which producesmodels that are logical combinations (conjunctions or disjunctions) of boolean-valued rules that are generated fromthe data. We now briefly present the algorithm and direct the reader to [10, 16] for further explanations.The input of the SCM algorithm is a set of learning examples S and a set of boolean-valued rules R . In our context, S is composed of genomes, in the k -mer form induced by φφφ , and their labels. Let K be the set of all, possibly overlapping, k -mers that are present in at least one genome of S . For each k -mer k j ∈ K , we consider a presence rule, definedas p k j ( φφφ ( x )) def = I [ φ j ( x ) = 1] and an absence rule, defined as a k j ( φφφ ( x )) def = I [ φ j ( x ) = 0] , where I [ true ] = 1 and otherwise. These boolean-valued rules constitute the set R . Given S and R , the SCM attempts to find the modelthat relies on the smallest possible set of rules R (cid:63) = { r (cid:63) , ..., r (cid:63)n } ⊆ R , while minimizing Equation (1). The modelsgenerated can be conjunctions h ( φφφ ( x )) def = r (cid:63) ( φφφ ( x )) ∧ ... ∧ r (cid:63)n ( φφφ ( x )) or disjunctions h ( φφφ ( x )) def = r (cid:63) ( φφφ ( x )) ∨ ... ∨ r (cid:63)n ( φφφ ( x )) .Hence, they directly highlight the importance of a small set of genomic sequences for predicting AMR phenotypes.However, it is important to note that the distribution D is unknown; therefore, it is not possible to directly minimizeEquation (1). Instead, the algorithm constructs a model that achieves an appropriate trade-off between the empiricalrisk (i.e., the fraction of training errors) and the number of rules it uses. A model containing many rules is likelyto overfit the data, whereas a model containing too few rules is likely to underfit. To find the appropriate trade-offbetween the classifier’s size and its accuracy on the training set, the SCM relies on a modified version of the setcovering greedy algorithm of Chv´atal [6], which has a worst-case guarantee. The running time and space complexitiesof this algorithm are linear in the number of examples and rules, thus linear in the number of genomes and k -mers.Consequently, this algorithm is particularly well-suited for learning from large datasets of extreme dimensionalities,such as the ones that often occur in healthcare applications.The experiments in this work were performed using Kover, the SCM implementation of Drouin et al. (2016), whichhas been tailored for learning from k -mer represented genomes [10]. In Kover, the SCM algorithm is trained out-of-core , which means that the data resides on the disk and is accessed in blocks by the algorithm. The implementationexploits a compressed data representation and atomic CPU instructions to speed up computations. It is open-sourceand available at https://github.com/aldro61/kover. 2 igure 1: Each point corresponds to a dataset. Colored symbols identify bacterial species. a) Size of the datasets interms of genomes and k -mers; b) Specificity of the models with respect to their sensitivity; c) Accuracy of the modelswith respect to the baseline predictor (see text for details); d) Accuracy of the models with respect to the number ofgenomes; e) Accuracy of the models with respect to the number of k -mers; f) Time required to train the algorithmwith respect to the number of genomes. The data used in this study were extracted from the PATRIC database via its FTP server (ftp.patricbrc.org). First, thelatest AMR metadata were acquired. These data consisted of genome identifiers and measured resistance phenotypes(resistant or susceptible) for various antibiotics. The data were segmented by species and antibiotic to form datasets,and those with at least genomes in each class were retained. Finally, for each dataset, the assembled genomes weredownloaded and their k -mer representation was obtained using the DSK k -mer counter [21]. The value of k was setto , since extensive experimentation has shown that this value is appropriate for the task at hand [10]. A total of datasets, comprising five bacterial species, were extracted from PATRIC. A detailed list is available inAppendix - Table 1. Figure 1a shows the size of these datasets, in which the number of examples ranged from to and the number of k -mers from . to . millions. For most species, an increase in the number of genomesdoes not lead to a large increase in the number of k -mers. This reflects the limited genomic diversity that exists withinspecies (e.g.: M. tuberculosis [13]). Contrasting results were observed for
P. aeruginosa and
A. baumannii , whichhave higher genomic diversities [15, 25].The SCM’s ability to predict AMR phenotypes was evaluated on held-out subsets of the data. First, each dataset wasrandomly partitioned into a training set S ( ) and a testing set T ( ). The SCM was then trained on S and metricswere computed based on the model’s predictions on T . This procedure was repeated times, on different partitions,3 Figure 1: Each point corresponds to a dataset. Colored symbols identify bacterial species. a) Size of the datasets interms of genomes and k -mers; b) Specificity of the models with respect to their sensitivity; c) Accuracy of the modelswith respect to the baseline predictor (see text for details); d) Accuracy of the models with respect to the number ofgenomes; e) Accuracy of the models with respect to the number of k -mers; f) Time required to train the algorithmwith respect to the number of genomes. The data used in this study were extracted from the PATRIC database via its FTP server (ftp.patricbrc.org). First, thelatest AMR metadata were acquired. These data consisted of genome identifiers and measured resistance phenotypes(resistant or susceptible) for various antibiotics. The data were segmented by species and antibiotic to form datasets,and those with at least genomes in each class were retained. Finally, for each dataset, the assembled genomes weredownloaded and their k -mer representation was obtained using the DSK k -mer counter [21]. The value of k was set to , since extensive experimentation has shown that this value is appropriate for the task at hand [10]. A total of datasets, comprising five bacterial species, were extracted from PATRIC. A detailed list is available inAppendix - Table 1. Figure 1a shows the size of these datasets, in which the number of examples ranged from to and the number of k -mers from . to . millions. For most species, an increase in the number of genomesdoes not lead to a large increase in the number of k -mers. This reflects the limited genomic diversity that exists withinspecies (e.g.: M. tuberculosis [13]). Contrasting results were observed for
P. aeruginosa and
A. baumannii , whichhave higher genomic diversities [15, 25].The SCM’s ability to predict AMR phenotypes was evaluated on held-out subsets of the data. First, each dataset wasrandomly partitioned into a training set S ( ) and a testing set T ( ). The SCM was then trained on S and metricswere computed based on the model’s predictions on T . This procedure was repeated times, on different partitions,3nd the resulting metrics were averaged. The values of the algorithm’s hyperparameters (HPs) were selected bybound selection (see [16, 10]). Bound selection uses a probabilistic upper bound on Equation (1), computed fromthe training data, to score each of the HP combinations. For each of the latter, a single training of the algorithm isrequired; hence, bound selection is much faster than standard cross-validation. Drouin et al (2016). proposed sucha bound for conjunctions/disjunctions of presence/absence rules of k -mers and found that bound selection yieldedresults comparable to those of 5-fold cross-validation.The results are summarized in Figure 1b-e and detailed in Appendix - Table 2. Figure 1b shows the specificity of themodels with respect to their sensitivity. A perfect model would score for each of these metrics. Specificities superiorto were achieved for / datasets and comparable sensitivities were achieved for / datasets. In general,the obtained models are more specific than sensitive, meaning that they sometimes fail to identify resistant isolates,but very rarely mark a susceptible isolate as resistant. Figure 1c compares the SCM models to a baseline predictor thatpredicts the most abundant class in the training data. The SCM models, which achieve accuracies greater than on / datasets, generally surpass the baseline predictors, indicating that the algorithm extracts relevant patterns ofantibiotic resistance. Of note, the models learned by the SCM are extremely sparse, using an average of . rules (std: . ), which makes them well-suited for further review and experimental validation. Moreover, Figure 1c highlightsthe strong class imbalance that exists in some of the datasets considered, as the baseline predictors often achieve highaccuracies. Furthermore, based on Figure 1d, we observe that the accuracy of the models is generally higher fordatasets that contain more examples. There are notable exceptions, such as S. pneumoniae , where the accuracies arelower for larger datasets. However, the two largest datasets for this species correspond to combinations of drugs (Beta-lactams and Trimethoprim/Sulfamethoxazole), which could complexify the learning task. Also, notice that there arecases where the algorithm achieves near perfect accuracies with very few examples, despite disproportionately largefeature spaces. In fact, Figure 1e illustrates that the accuracy of the models is not related to the number of k -mers.This supports the theoretical and empirical results of Drouin et al. (2016), which found that the SCM could avoidoverfitting, even in such high dimensional settings. Finally, Figure 1f shows the time required to train the algorithmwith respect to the number of genomes in each dataset. The time, which grows linearly with the number of genomes,varied between seconds and minutes, using a single CPU core and less than 1 GB of memory.The models generated using the SCM are available through the Kover AMR Platform. The k -mer sequences of therules in these models were annotated using BLAST [1]. This revealed that, for most antibiotics, the SCM was accuratein identifying known resistance mechanisms. For example, the absence of one k -mer located in the DNA gyrasesubunit A ( gyrA ) predicts resistance to moxifloxacin in M. tuberculosis with an error rate as low as . This k -mer refers to the amino-acids 88 to 94 of GyrA, the mutation of which confers resistance to fluoroquinolones [5].Thus, the model relies on the absence of the susceptible genotype to account for the presence of resistant variants,which are more diverse. This behavior was also observed for predicting resistance to isoniazid in M. tuberculosis ,where the model rightly targets a region of the katG gene where multiple mutations are known to induce resistance [4,8]. The model also relies on a k -mer in the rpoB gene, a known rifampicin resistance determinant [8]. This could resultfrom the frequent combined use of antituberculosis drugs.We have briefly demonstrated the accuracy and interpretability of models produced by the SCM. Detailed results areavailable in the Appendix and the Kover AMR platform, which allows the visualization and further investigation ofAMR models generated at an unprecedented scale. In summary, we have outlined the recently published work of Drouin et al. (2016), while complementing their analysiswith a large-scale application of their method to the ever-growing PATRIC database. Kover, an extremely efficient out-of-core implementation of the SCM, allowed the rapid generation of these results with limited computational resources.Moreover, our results show that the method yields accurate results for predicting AMR in most datasets and that, dueto their strong interpretability, the obtained models can generate biologically relevant insight into these phenotypes.On another hand, contrasting results were obtained, which pave the way to extensions of their method. In fact, theobtained models were generally highly specific, but some lacked sensitivity. This could result from seeking the sparsestmodel that detects resistant genomes in the entire population of isolates. Hence, the deconvolution of resistancemechanisms based on population structure could provide a deeper understanding of antibiotic resistance and increasethe sensitivity of the models. Future work will therefore involve the development of algorithmic extensions to theSet Covering Machine, which allow the inclusion of prior knowledge of the population structure and the biologicalstructures present in the data (e.g., gene functions, pathways). p ∈ { . , . , . , . , , . , . , . , , + ∞} , s ∈ { , ..., } , model type ∈ { conjunction , disjunction } (notation of [10])
4e have only scratched the surface of the biological knowledge that can be generated from these results.The fact that our approach generates interpretable predictors, together with our proposed Kover AMR Platform(https://aldro61.github.io/kover-amr-platform/) will allow further analysis of these results by researchers with diversebackgrounds, bridging the gap between machine learning and healthcare research.
Acknowledgements
The authors acknowledge S´ebastien Gigu`ere and Pier-Luc Plante for helpful comments and suggestions. Computationswere performed on the Colosse supercomputer at Universit´e Laval (resource allocation project: nne-790-ae), underthe auspices of Calcul Qu´ebec and Compute Canada. AD is recipient of an Alexander Graham Bell Canada GraduateScholarship Doctoral Award of the National Sciences and Engineering Research Council of Canada (NSERC). Thiswork was supported in part by the NSERC Discovery Grants (FL; 262067, MM; 122405). JC acknowledges theCanada Research Chair in Medical Genomics.
References [1] Stephen F Altschul et al. “Basic local alignment search tool”. In:
Journal of Molecular Biology
Nature communications
Classification and regression trees . New York: CRC press, 1984.[4] Christine E Cade et al. “Isoniazid-resistance conferring mutations in Mycobacterium tuberculosis KatG: Cata-lase, peroxidase, and INH-NADH adduct formation activities”. In:
Protein Science
Antimicrobial agents and chemotherapy
Mathematics of Operations Research
Machine learning
Journal of antimicrobial chemotherapy
Scientific reports
BMC Genomics
Nature Microbiology
Clinical Microbiology and Infection
PLoS Biol
Journal of Antimicrobial Chemotherapy
Antimicrobial Agents and Chemotherapy
The Journal of Machine LearningResearch
Antimicrobial Agents and Chemotherapy
Current opinion in microbiology
27 (2015), pp. 45–50.[19] Jose M Munita, Arnold S Bayer, and Cesar A Arias. “Evolving resistance among Gram-positive pathogens”.In:
Clinical Infectious Diseases
Human genomics
Bioinformatics (2013), btt020.[22] John W Santerre et al. “Machine Learning for Antimicrobial Resistance”. In: arXiv.org (July 2016).arXiv: .[23] Bernhard Sch¨olkopf, Koji Tsuda, and Jean-Philippe Vert.
Kernel methods in computational biology . Cambridge,Massachusetts: MIT press, 2004.[24] John Shawe-Taylor and Nello Cristianini.
Kernel methods for pattern analysis . Cambridge, United Kingdom:Cambridge university press, 2004.[25] Lalena Wallace et al. “Use of Comparative Genomics To Characterize the Diversity of Acinetobacter baumanniiSurveillance Isolates in a Health Care Institution”. In:
Antimicrobial Agents and Chemotherapy
Nature
Nucleic acidsresearch (2013), gkt1099.[28] Adam A Witney et al. “Clinical use of whole genome sequencing for Mycobacterium tuberculosis”. In: