AutoPrognosis: Automated Clinical Prognostic Modeling via Bayesian Optimization with Structured Kernel Learning
aa r X i v : . [ c s . L G ] F e b AutoPrognosis: Automated Clinical Prognostic Modelingvia Bayesian Optimization with Structured Kernel Learning
Ahmed M. Alaa Mihaela van der Schaar Abstract
Clinical prognostic models derived from large-scale healthcare data can inform critical diagnos-tic and therapeutic decisions. To enable off-the-shelf usage of machine learning (ML) in prog-nostic research, we developed A UTO P ROGNOSIS :a system for automating the design of predic-tive modeling pipelines tailored for clinical prog-nosis. A UTO P ROGNOSIS optimizes ensembles ofpipeline configurations efficiently using a novelbatched Bayesian optimization (BO) algorithmthat learns a low-dimensional decomposition ofthe pipelines’ high-dimensional hyperparameterspace in concurrence with the BO procedure.This is achieved by modeling the pipelines’ per-formances as a black-box function with a Gaus-sian process prior, and modeling the “similari-ties” between the pipelines’ baseline algorithmsvia a sparse additive kernel with a Dirichlet prior.
Meta-learning is used to warmstart BO with ex-ternal data from “similar” patient cohorts by cali-brating the priors using an algorithm that mimicsthe empirical Bayes method. The system auto-matically explains its predictions by presentingthe clinicians with logical association rules thatlink patients’ features to predicted risk strata. Wedemonstrate the utility of A UTO P ROGNOSIS using10 major patient cohorts representing various as-pects of cardiovascular patient care.
1. Introduction
In clinical medicine, prognosis refers to the risk of futurehealth outcomes in patients with given features. Prognos-tic research aims at building actionable predictive modelsthat can inform clinicians about future course of patients’ * Equal contribution University of California, Los Angeles,USA Alan Turing Institute, London, UK.
AUTHORERR: Miss-ing \ icmlcorrespondingauthor. Proceedings of the th International Conference on MachineLearning , Stockholm, Sweden, PMLR 80, 2018. Copyright 2018by the author(s). clinical conditions in order to guide screening and thera-peutic decisions. With the recent abundance of data link-ages, electronic health records, and bio-repositories, clini-cal researchers have become aware that the value conferredby big, heterogeneous clinical data can only be realizedwith prognostic models based on flexible machine learn-ing (ML) approaches. There is, however, a concerning gapbetween the potential and actual utilization of ML in prog-nostic research; the reason being that clinicians with no ex-pertise in data science find it hard to manually design andtune ML pipelines (Luo et al., 2017).To fill this gap, we developed A UTO P ROGNOSIS , an auto-mated ML (AutoML) framework tailored for clinical prog-nostic modeling. A UTO P ROGNOSIS takes as an input datafrom a patient cohort, and uses such data to automaticallyconfigure ML pipelines . Every ML pipeline comprises allstages of prognostic modeling: missing data imputation,feature preprocessing, prediction, and calibration. The sys-tem handles different types of clinical data, including lon-gitudinal and survival (time-to-event) data, and automati-cally explains its predictions to the clinicians via an “inter-preter” module which outputs clinically interpretable asso-ciations between patients’ features and predicted risk strata.An overview of the system is provided in Figure 1.The core component of A UTO P ROGNOSIS is an algorithmfor configuring ML pipelines using Bayesian optimization(BO) (Snoek et al., 2012). Our BO algorithm models thepipelines’ performances as a black-box function, the in-put to which is a “pipeline configuration”, i.e. a selectionof algorithms and hyperparameter settings, and the outputof which is the performance (predictive accuracy) achievedby such a configuration. We implement BO with a
Gaus-sian process (GP) prior on the black-box function. To dealwith the high-dimensionality of the pipeline configurationspace, we capitalize on the fact that for a given dataset, the performance of one ML algorithm may not be cor-related with that of another algorithm . For instance, itmay be the case that the observed empirical performanceof logistic regression on a given dataset does not tell usmuch information about how a neural network would per-form on the same dataset. In such a case, both algorithmsshould not share the same GP prior, but should rather be utomated Clinical Prognostic Modeling via Bayesian Optimization
Figure1. Illustration for exemplary outputs of A
UTO P ROGNOSIS . modeled independently. Our BO learns such a decompo-sition of algorithms from data in order to break down thehigh-dimensional optimization problem into a set of lower-dimensional sub-problems. We model the decompositionof algorithms via an additive kernel with a Dirichlet prioron its structure, and learn the decomposition from data inconcurrence with the BO iterations. We also propose abatched (parallelized) version of the BO procedure, alongwith a computationally efficient algorithm for maximizingthe BO acquisition function. A UTO P ROGNOSIS follows a principled Bayesian approachin all of its components. The system implements post-hocconstruction of pipeline ensembles via
Bayesian model av-eraging , and implements a meta-learning algorithm thatutilizes data from external cohorts of “similar” patients us-ing an empirical Bayes method. In order to resolve the ten-sion between accuracy and interpretability, which is cru-cial for clinical decision-making (Cabitza et al., 2017), thesystem presents the clinicians with a rule-based approx-imation for the learned ML pipeline by mining for logi-cal associations between patients’ features and the model’spredicted risk strata using a
Bayesian associative classifier (Agrawal et al., 1993; Kruschke, 2008).We conclude the paper by conducting a set of experimentson multiple patient cohorts representing various aspectsof cardiovascular patient care, and show that prognosticmodels learned by A UTO P ROGNOSIS outperform widelyused clinical risk scores and existing AutoML frameworks.
Related work:
To the best of our knowledge, none ofthe existing AutoML frameworks, such as A UTO -W EKA (Kotthoff et al., 2016), A UTO - SKLEARN (Feurer et al.,2015), and
TPOT (Olson & Moore, 2016) use principledGP-based BO to configure ML pipelines. All of theexisting frameworks model the sparsity of the pipelines’hyperparameter space via frequentist tree-based struc-tures. Both A UTO -W EKA and A UTO - SKLEARN use BO,but through tree-based heuristics, such as random forest models and tree Parzen estimators, whereas TPOT usesa tree-based genetic programming algorithm. Previousworks have refrained from using principled GP-based BObecause of its statistical and computational complexity inhigh-dimensional hyperparameter spaces. Our algorithmmakes principled, high-dimensional GP-based BO possibleby learning a sparse additive kernel decomposition forthe GP prior. This approach confers many advantages asit captures the uncertainty about the sparsity structure ofthe GP prior, and allows for principled approaches for(Bayesian) meta-learning and ensemble construction thatare organically connected to the BO procedure. In Section5, we compare the performance of A UTO P ROGNOSIS with that of A UTO -W EKA , A UTO - SKLEARN , and
TPOT ,demonstrating the superiority of our algorithm.Various previous works have addressed the problem ofhigh-dimensional GP-based BO. (Wang et al., 2013) iden-tifies a low-dimensional effective subspace for the black-box function via random embedding. However, in the Au-toML setup, this approach cannot incorporate our priorknowledge about dependencies between the different hy-perparameters (we know the sets of hyperparameters thatare “activated” upon selecting an algorithm (Hutter et al.,2011)). This prior knowledge was captured by the
Arc-kernel proposed in (Swersky et al., 2014), and similarly in(Jenatton et al., 2017), where a BO algorithm for domainswith tree-structured dependencies was proposed. Unfor-tunately, both methods require full prior knowledge ofthe dependencies between the hyperparameters, and hencecannot be used when jointly configuring hyperparametersacross multiple algorithms, since the correlations of theperformances of different algorithms are not known a pri-ori. (Bergstra et al., 2011) proposed a na¨ıve approach thatdefines an independent GP for every set of hyperparametersthat belong to the same algorithm. Since it does not shareany information between the different algorithms, this ap-proach would require trying all combinations of algorithmsin a pipeline exhaustively. (In our system, there are 4,800possible pipelines.) Our model solves the problems abovevia a data-driven kernel decomposition, through whichonly relevant groups of hyperparameters share a commonGP prior, thereby balancing the trade-off between “infor-mation sharing” among hyperparameters and statistical ef-ficiency.
2. A
UTO P ROGNOSIS : A Practical System forAutomated Clinical Prognostic Modeling
Consider a dataset D = { ( x i , y i ) } ni =1 for a cohort of n pa-tients, with x i being patient i ’s features, and y i being thepatient’s clinical endpoint. A UTO P ROGNOSIS takes D as aninput, and outputs an automatically configured prognosticmodel which predicts the patients’ risks, along with “ex- utomated Clinical Prognostic Modeling via Bayesian Optimization Figure2. A schematic depiction of A
UTO P ROGNOSIS . Every ML pipeline comprises imputation, feature processing, prediction, andcalibration algorithms. The ensemble construction and interpreter modules are included in the system as post-processing steps. planations” for the predicted risk strata. This Section pro-vides an overview of the components of A UTO P ROGNOSIS ;a schematic depiction of the system is shown in Figure 2.The core component of A UTO P ROGNOSIS is an algorithmthat automatically configures ML pipelines, where everypipeline comprises algorithms for missing data imputation( (cid:3) ), feature preprocessing ( ♣ ), prediction ( • ), and calibra-tion ( ⋆ ). Table 1 lists the baseline algorithms adopted bythe system in all the stages of a pipeline. The imputationand calibration stages are particularly important for clinicalprognostic modeling (Blaha, 2016), and are not supportedin existing AutoML frameworks. The total number of hy-perparameters in A UTO P ROGNOSIS is 106, which is less thanthose of A UTO -W EKA (786) and A UTO - SKLEARN (110). Thepipeline configuration algorithm uses
Bayesian optimiza-tion to estimate the performance of different pipeline con-figurations in a scalable fashion by learning a structuredkernel decomposition that identifies algorithms with cor-related performance. Details of the Bayesian optimizationalgorithm are provided in Sections 3 and 5.In order to cope with the diverse nature of clinical data andhealth outcomes, A UTO P ROGNOSIS pipelines are enrichedwith three modes of operation: (a) classification mode , (b)temporal mode , and (c) survival mode . The classifica-tion mode handles datasets with binary clinical outcomes(Yoon et al., 2017). In this mode, the baseline predictivemodels include all algorithms in the scikit-learn li-brary (Pedregosa et al., 2011), in addition to other power-ful algorithms, such as XGBoost (Chen & Guestrin, 2016).The temporal mode handles longitudinal and time se-ries data (Alaa et al., 2017) by applying the classificationalgorithms above on data residing in a sliding windowwithin the time series, which we parametrize by the se-quence time (Hripcsak et al., 2015). The survival mode handles time-to-event data, and involves all the classifica-tion algorithms above, in addition to survival models suchas Cox proportional hazards model and survival forests(Ishwaran et al., 2008), and models for multiple competing risks (Fine & Gray, 1999).The meta-learning module is a pre-processing step thatis used to warmstart BO using data from external cohorts,whereas the ensemble construction and interpreter mod-ules post-process the BO outputs. All of the three modulerun with a relatively low computational burden. Details ofthe three modules are provided in Sections 4 and 5.
3. Pipeline Configuration via BayesianOptimization with Structured Kernels
Let ( A d , A f , A p , A c ) be the sets of all missing data im-putation, feature processing, prediction, and calibration al-gorithms considered in A UTO P ROGNOSIS (Table 1), respec-tively. A pipeline P is a tuple of the form: ✞✝ ☎✆ P = ( A d , A f , A p , A c ) where A v ∈ A v , ∀ v ∈ { d, f, p, c } . The space of all pos-sible pipelines is given by P = A d × A f × A p × A c . Thus,a pipeline is a selection of algorithms from the elementsof Table 1. An exemplary pipeline can be specified as fol-lows: P = { MICE , PCA , Random Forest , Sigmoid } . The to-tal number of pipelines in A UTO P ROGNOSIS is |P| = 4 , .The specification of a pipeline configuration is completedby determining the hyperparameters of its constituting al-gorithms. The space of hyperparameter configurations fora pipeline is Θ = Θ d × Θ f × Θ p × Θ c , where Θ v = ∪ a Θ av ,for v ∈ { d, f, p, c } , with Θ av being the space of hyperpa-rameters associated with the a th algorithm in A v . Thus,a pipeline configuration P θ ∈ P Θ is a selection of algo-rithms P ∈ P , and hyperparameter settings θ ∈ Θ ; P Θ isthe space of all possible pipeline configurations. The main goal of A UTO P ROGNOSIS is to identify the bestpipeline configuration P ∗ θ ∗ ∈ P Θ for a given patient cohort utomated Clinical Prognostic Modeling via Bayesian OptimizationPipeline Stage Algorithms (cid:3) Data Imputation (cid:3) missForest (2) (cid:3)
Median (0) (cid:3)
Most-frequent (0) (cid:3)
Mean (0) (cid:3) EM (1) (cid:3) Matrix completion (2) (cid:3)
MICE (1) (cid:3)
None (0) ♣ Feature process. ♣ Feature agglo. (4) ♣ Kernel PCA (5) ♣ Polynomial (3) ♣ Fast ICA (4) ♣ PCA (2) ♣ R. kitchen sinks (2) ♣ Nystroem (5) ♣ Linear SVM (3) ♣ Select Rates (3) ♣ None (0) • Prediction • Bernoulli NB (2) • AdaBoost (4) • Decision Tree (4) • Grad. Boost. (6) • LDA (4) • Gaussian NB (0) • XGBoost (5) • Extr. R. Trees (5) • Light GBM (5) • L. SVM (4) • Multinomial NB (2) • R. Forest (5) • Neural Net. (5) • Log. Reg. (0) • GP (3) • Ridge Class. (1) • Bagging (4) • k -NN (1) • Surv. Forest (5) • Cox Reg. (0) ⋆ Calibration ⋆ Sigmoid (0) ⋆ Isotonic (0) ⋆ None (0)
Table1. List of algorithms included in every stage of the pipeline. Numbers in brackets correspond to the number of hyperparameters. D via J -fold cross-validation as follows: P ∗ θ ∗ ∈ arg max P θ ∈P Θ J P Ji =1 L ( P θ ; D ( i ) train , D ( i ) valid ) , (1)where L is a given accuracy metric (AUC-ROC, c-index,etc), D ( i ) train and D ( i ) valid are training and validation splits of D in the i th fold. The optimization problem in (1) isdubbed the Pipeline Selection and Configuration Problem (PSCP). The PSCP can be thought of as a generalizationfor the combined algorithm selection and hyperparame-ter optimization (CASH) problem in (Feurer et al., 2015;Kotthoff et al., 2016), which maximizes an objective withrespect to selections of single algorithms from the set A p ,rather than selections of full-fledged pipelines from P Θ . The objective in (1) has no analytic form, and hence wetreat the PSCP as a black-box optimization problem. Inparticular, we assume that J P Ji =1 L ( P θ ; D ( i ) train , D ( i ) valid ) is anoisy version of a black-box function f : Λ → R , were Λ = Θ × P , and use BO to search for the pipeline con-figuration P ∗ θ ∗ that maximizes the black-box function f ( . ) (Snoek et al., 2012). The BO algorithm specifies a Gaus-sian process (GP) prior on f ( . ) as follows: f ∼ GP ( µ (Λ) , k (Λ , Λ ′ )) , (2)where µ (Λ) is the mean function , encoding the expectedperformance of different pipeline, and k (Λ , Λ ′ ) is the co-variance kernel (Rasmussen & Williams, 2006), encodingthe similarity between the different pipelines. The function f is defined over the D -dimensional space Λ ,where D = dim ( Λ ) is given by D = dim ( P ) + P v ∈{ d,f,p,c } P a ∈A v dim (Θ av ) . (3)In A UTO P ROGNOSIS , the domain Λ is high-dimensional,with D = 106 . (The dimensionality of Λ can be cal-culated by summing up the number of pipeline stages and the number of hyperparameters in Table 1.) High-dimensionality renders standard GP-based BO infeasibleas both the sample complexity of nonparametric estima-tion and the computational complexity of maximizing theacquisition function are exponential in D (Gy¨orfi et al.,2006; Kandasamy et al., 2015). For this reason, existingAutoML frameworks have refrained from using GP pri-ors, and relied instead on scalable tree-based heuristics(Feurer et al., 2015; Kotthoff et al., 2016). Despite its supe-rior performance, recent empirical findings have shown thatplain-vanilla GP-based BO is feasible only for problemswith D ≤ (Wang et al., 2013). Thus, the deploymentof GP-based BO has been limited to hyperparameter opti-mization for single, pre-defined ML models via tools suchas Google ’s Visier and
HyperTune (Golovin et al.,2017). A UTO P ROGNOSIS overcomes this challenge by lever-aging the structure of the PSCP problem as we show inwhat follows.3.3.1. T HE S TRUCTURE OF THE
PSCP P
ROBLEM
The key idea of our BO algorithm is that for a given dataset, the performance of a given group of algorithms may notbe informative of the performance of another group ofalgorithms . Since the kernel k (Λ , Λ ′ ) encodes the corre-lations between the performances of the different pipelineconfigurations, the underlying “informativeness” structurethat relates the different hyperparameters can be expressedvia the following sparse additive kernel decomposition : k (Λ , Λ ′ ) = P Mm =1 k m (Λ ( m ) , Λ ′ ( m ) ) , (4)where Λ ( m ) ∈ Λ ( m ) , ∀ m ∈ { , . . ., M } , with { Λ ( m ) } m beinga set of disjoint subspaces of Λ . (That is, ∪ m Λ ( m ) = Λ ,and Λ ( m ) ∩ Λ ( m ′ ) = ∅ .) The subspaces are assigned mu-tually exclusive subsets of the dimensions of Λ , so that P m dim (Λ ( m ) ) = D . The structure of the kernel in (4) is unknown a priori, and needs to be learned from data . Thekernel decomposition breaks down f as follows: f (Λ) = P Mm =1 f m (Λ ( m ) ) . (5) utomated Clinical Prognostic Modeling via Bayesian Optimization Figure3. Illustration for a exemplary subspace decomposition { Λ ( m ) } m =1 . The additively sparse structure in (4) gives rise to a statis-tically efficient BO procedure. That is, if f is γ -smooth,then our additive kernels reduce sample complexity from O ( n − γ γ + D ) to O ( n − γ γ + Dm ) , where D m is the maximumnumber of dimensions in any subspace (Raskutti et al.,2009; Yang et al., 2015). (Similar improvements hold forthe cumulative regret (Kandasamy et al., 2015).)Each subspace Λ ( m ) ⊂ Λ contains the hyperparameters ofalgorithms with correlated performances, whereas algo-rithms residing in two different subspaces Λ ( m ) and Λ ( m ′ ) have uncorrelated performances. Since a hyperparameterin Θ is only relevant to f ( . ) when the corresponding algo-rithm in P is selected (Hutter et al., 2009), then the decom-position { Λ ( m ) } m must ensure that all the hyperparametersof the same algorithm are bundled together in the same sub-space. This a priori knowledge about the “conditional rele-vance” of the dimensions of Λ makes it easier to learn thekernel decomposition from data. Figure 3 provides an il-lustration for an exemplary subspace decomposition for thehyperparameters of a set of prediction, feature processingand imputation algorithms. Since the structured kernel in(4) is not fully specified a priori, we propose an algorithmto learn it from the data in the next Section.3.3.2. S TRUCTURED K ERNEL L EARNING A UTO P ROGNOSIS uses a Bayesian approach to learn thesubspace decomposition { Λ ( m ) } m in concurrence with theBO procedure, where the following Dirichlet-Multinomialprior is placed on the structured kernel (Wang et al., 2017): α ∼ Dirichlet ( M, γ ) , z v,a ∼ Multi ( α ) , (6) ∀ a ∈ A v , v ∈ { d, f, p, c } , where γ = { γ m } m is the parame-ter of a Dirichlet prior, α = { α m } m are the Multinomial mixing proportions, and z v,a is an indicator variable thatdetermines the subspace to which the a th algorithm in A v belongs. The kernel decomposition in (4) is learned by up-dating the posterior distribution of { Λ ( m ) } m in every iter-ation of the BO procedure. The posterior distribution overthe variables { z v,a } v,a and α is given by: P ( z, α | H t , γ ) ∝ P ( H t | z ) P ( z | α ) P ( α, γ ) , (7)where z = { z v,a : ∀ a ∈ A v , ∀ v ∈ { d, f, p, c }} , and H t is thehistory of evaluations of the black-box function up to iter-ation t . Since the variables { z v,a } v,a are sufficient statis-tics for the subspace decomposition, the posterior over { Λ ( m ) } m is fully specified by (7) marginalized over α ,which can be evaluated using Gibbs sampling as follows: P ( z v,a = m | z/ { z v,a } , H t ) ∝ P ( H t | z ) ( |A ( m ) v | + γ m ) , where P ( H t | z ) is the GP likelihood under the kernel in-duced by z . The Gibbs sampler is implemented via theGumble-Max trick (Maddison et al., 2014) as follows: ω m i.i.d ∼ Gumbel (0 , , m ∈ { , . . ., M } , (8) z v,a ∼ arg max m P ( H t | z, z v,a = m )( |A ( m ) v | + γ m ) + ω m . XPLORATION VIA D IVERSE B ATCH S ELECTION
The BO procedure solves the PSCP problem by exploringthe performances of a sequence of pipelines { P θ , P θ , . . . } until it (hopefully) converges to the optimal pipeline P ∗ θ ∗ .In every iteration t , BO picks a pipeline to evaluate usingan acquisition function A ( P θ ; H t ) that balances between ex-ploration and exploitation . A UTO P ROGNOSIS deploys a 2-step batched (parallelized) exploration scheme that picks B pipelines for evaluation at every iteration t as follows: utomated Clinical Prognostic Modeling via Bayesian Optimization ☛✡ ✟✠ Step 1:
Select the frequentist kernel decomposition { ˆΛ ( m ) } m that maximizes the posterior P ( z | H t ) . ✗✖ ✔✕ Step 2:
Select the B pipelines { P bθ } Bb =1 with the highest val-ues for the acquisition function { A ( P bθ ; H t ) } Bb =1 , such thateach pipeline P bθ , b ∈ { , . . ., B } , involves a distinct pre-diction algorithm from a distinct subspace in { ˆΛ ( m ) } m . We use the well-known
Upper Confidence Bound (UCB)as acquisition function (Snoek et al., 2012). The decompo-sition in (5) offers an exponential speed up in the overall computational complexity of Step 2 since the UCB ac-quisition function is maximized separately for every (low-dimensional) component f m ; this reduces the number ofcomputations from to O ( n − D ) to O ( n − D m ) . The batchedimplementation is advantageous since sequential evalua-tions of f ( . ) are time consuming as it involves training theselected ML algorithms.Step 2 in the algorithm above encourages exploration asfollows. In every iteration t , we select a “diverse” batchof pipelines for which every pipeline is representative ofa distinct subspace in { ˆΛ ( m ) } m . The batch selectionscheme above encourages diverse exploration without theneed for sampling pipelines via a determinantal point pro-cess with an exponential complexity as in (Kathuria et al.,2016; Nikolov, 2015; Wang et al., 2017). We also devise anefficient backward induction algorithm that exploits thestructure of a pipeline to maximize the acquisition functionefficiently. (Details are provided in the supplement.)
4. Ensemble Construction & Meta-learning
In this Section, we discuss the details of the ensemble Con-struction and meta-learning modules; details of the inter-preter module are provided in the next Section.
The frequentist approach to pipeline configuration is topick the pipeline with the best observed performance fromthe set { P θ , . . ., P tθ t } explored by the BO algorithm in Sec-tion 3.3.3. However, such an approach does not capture theuncertainty in the pipelines’ performances, and wastefullythrows away t − of the evaluated pipelines. On the con-trary, A UTO P ROGNOSIS makes use of all such pipelines viapost-hoc
Bayesian model averaging , where it creates anensemble of weighted pipelines P i w i P iθ i . Model averag-ing is particularly useful in cohorts with small sample sizes,where large uncertainty about the pipelines’ performanceswould render frequentist solutions unreliable.The ensemble weight w i = P ( P i ∗ θ i ∗ = P iθ i | H t ) is the poste-rior probability of P iθ being the best performing pipeline: w i = P z P ( P i ∗ θ i ∗ = P iθ i | z, H t ) · P ( z | H t ) , (9) where i ∗ is the pipeline configuration with the best (true)generalization performance. The weights in (9) are com-puted by Monte Carlo sampling of kernel decompositionsvia the posterior P ( z | H t ) , and then sampling the pipelines’performances from the posterior f | z, H t . Note that, un-like the ensemble builder of A UTO S KLEARN (Feurer et al.,2015), the weights in (9) account for correlations betweendifferent pipelines, and hence it penalizes combinations of“similar” pipelines even if they are performing well. More-over, our post-hoc approach allows building ensembleswithout requiring extra hyperparameters: in A UTO W EKA ,ensemble construction requires a 5-fold increase in thenumber of hyperparameters (Kotthoff et al., 2016).
The Bayesian model used for solving the PSCP problem inSection 3 can be summarized as follows: f ∼ GP ( µ, k | z ) , z ∼ Multi ( α ) , α ∼ Dirichlet ( M, γ ) . The speed of convergence of BO depends on the calibra-tion of the prior’s hyperparameters ( M, γ, µ, k ) . An ag-nostic prior would require many iterations to converge tosatisfactory pipeline configurations. To warmstart the BOprocedure for a new cohort D , we incorporate prior infor-mation obtained from previous runs of A UTO P ROGNOSIS ona repository of K complementary cohorts {D , . . ., D K } .Our meta-learning approach combines {H t , . . ., H Mt K } (op-timizer runs on the K complementary cohorts) with thedata in D to obtain an empirical Bayes estimate ( ˆ M , ˆ γ, ˆ µ, ˆ k ) .Our approach to meta-learning works as follows. For ev-ery complementary dataset D k , we create a set of 55 meta-features M ( D k ) , 40 of which are statistical meta-features (e.g. number of features, size of data, class imbalance, etc),and the remaining 15 are clinical meta-features (e.g. labtests, vital signs, ICD-10 codes, diagnoses, etc). For everycomplementary dataset in D j , we optimize the hyperparam-eters ( ˆ M j , ˆ γ j , ˆ µ j , ˆ k j ) via marginal likelihood maximization.For a new cohort D , we compute a set of weights { η j } j ,with η j = ℓ j / P k ℓ k , where ℓ j = kM ( D ) − M ( D j ) k , andcalibrate its prior ( M, γ, µ, k ) by setting it to be the averageof the estimates ( ˆ M j , ˆ γ j , ˆ µ j , ˆ k j ) , weighted by { η j } j .Existing methods for meta-learning focus only on iden-tifying well-performing pipelines from other datasets,and use them for initializing the optimization procedure(Brazdil et al., 2008; Feurer et al., 2015). Conceptualiz-ing meta-learning as an empirical Bayes calibration proce-dure allows the transfer of a much richer set of informationacross datasets. Through the method described above, A U - TO P ROGNOSIS can import information on the smoothnessof the black-box function ( k ), the similarities among base-line algorithms ( γ, M ), and the expected pipelines’ perfor-mances ( µ ). This improves not only the initialization of the utomated Clinical Prognostic Modeling via Bayesian Optimization BO procedure, but also the mechanism by which it exploresthe pipelines’ design space.
5. Evaluation of A
UTO P ROGNOSIS
In this section, we assess the ability of A UTO P ROGNOSIS to automatically make the right prognostic modeling choices when confronted with a variety of clinical datasets with dif-ferent meta-features . We conducted experiments on 10 cardiovascular cohortsthat correspond to the following aspects of patient care: • Preventive care:
We considered two major cohorts forpreventive cardiology. The first is the Meta-analysis GlobalGroup in Chronic heart failure database (
MAGGIC ), whichholds data for 46,817 patients gathered from multiple clin-ical studies (Wong et al., 2014). The second cohort is the
UK Biobank , which is a bio-repository with data for morethan 500,000 volunteers in the UK (Sudlow et al., 2015). • Heart transplant wait-list management:
We extracteddata from the United Network for Organ Sharing (UNOS)database, which holds information on all heart transplantsconducted in the US between the years 1985 to 2015. Co-hort
UNOS-I is a pre-transplant population of 36,329 car-diac patients who were enrolled in a transplant wait-list. • Post-transplant follow-up:
Cohort
UNOS-II is a post-transplant population of 60,400 patients in the US who un-derwent a transplant between the years 1985 to 2015. • Cardiovascular comorbidities:
We extracted 6 cohortsfrom the Surveillance, Epidemiology, and End Results(SEER) cancer registries, which cover approximately 28 % of the US population (Yoo & Coughlin, 2018). We pre-dict cardiac deaths in patients diagnosed with breast cancer( SEER-I ), colorectal cancer (
SEER-II ), Leukemia (
SEER-III ), respiratory cancers (
SEER-IV ), digestive system can-cer (
SEER-V ), and urinary system cancer (
SEER-VI ).The first three groups of datasets (colored in red) were col-lected for cohorts of patients diagnosed with (or at risk for)cardiac diseases, and so they shared a set of meta-features,including a large number of cardiac risk factors, low cen-soring rate, and moderate class imbalance. The last groupof datasets (colored in blue) was collected for cohorts ofcancer patients for whom cardiac diseases are potential co-morbidities. These datasets shared a different set of meta-features, including a small number of cardiac risk factors,high censoring rate, and severe class imbalance. Our exper-iments will demonstrate the ability of A UTO P ROGNOSIS toadapt its modeling choices to these different clinical setups. A UTO P ROGNOSIS
Table 2 shows the performance of various competing prog-nostic modeling approaches evaluated in terms of the areaunder receiver operating characteristic curve (AUC-ROC)with 5-fold cross-validation . We compared the perfor-mance of A UTO P ROGNOSIS with the clinical risk scoresused for predicting prognosis in each cohort (MAGGICscore in
MAGGIC and
UNOS-I (Wong et al., 2014), Fram-ingham score in the
UK Biobank (Schnabel et al., 2009),and IMPACT score in
UNOS-II (Weiss et al., 2011)). Wealso compared with various AutoML frameworks, includ-ing A UTO -W EKA (Kotthoff et al., 2016), A UTO - SKLEARN (Feurer et al., 2015), and
TPOT (Olson & Moore, 2016).Finally, we compared with a standard Cox proportionalhazards (Cox PH) model, which is the model most com-monly used in clinical prognostic research.Table 2 demonstrates the superiority of A UTO P ROGNOSIS toall the competing models on all the cohorts under consid-eration. This reflects the robustness of our system sincethe 10 cohorts had very different characteristics. In manyexperiments, the learned kernel decomposition reflected anintuitive clustering of algorithms by the similarity of theirstructure. For instance, Figure 4 shows one subspace inthe frequentist decomposition learned by A UTO P ROGNOSIS over the BO iterations for the
MAGGIC cohorts. We cansee that all ensemble methods in the imputation and pre-diction stages that use decision-trees as their base learnerswere lumped together in the same subspace.
Figure4. The learned kernel decomposition for
MAGGIC . Albeit accurate, models built by A UTO P ROGNOSIS wouldgenerally be hard for a clinician to “interpret”. To addressthis issue, A UTO P ROGNOSIS deploys an interpreter module(see Figure 2) that takes as an input the learned model for All algorithms were allowed to run for a maximum of 10hours to ensure a fair comparison. utomated Clinical Prognostic Modeling via Bayesian Optimization
MAGGIC UK Biobank UNOS-I UNOS-II SEER-I SEER-II SEER-III SEER-IV SEER-V SEER-VIA
UTO P ROGNOSIS vanilla 0.76 ± .004 0.71 ± .004 0.78 ± .002 0.65 ± .001 0.68 ± .002 0.66 ± .005 0.61 ± .001 0.69 ± .002 0.64 ± .002 0.65 ± .003best predictor Grad. Boost XGBoost AdaBoost Rand. Forest Cox PH Cox PH S. Forest Cox PH S. Forest Cox PH + ensembles 0.77 ± .002 0.73 ± .003 0.80 ± .001 ± .001 ± .002 0.67 ± .003 0.62 ± .001 0.69 ± .002 0.66 ± .002 0.65 ± .002+ meta-learning 0.77 ± .004 0.72 ± .004 0.79 ± .002 0.65 ± .002 0.72 ± .003 0.68 ± .003 0.64 ± .001 0.71 ± .003 0.69 ± .003 0.66 ± .002full-fledged ± .004 0.74 ± .003 0.81 ± .001 0.66 ± .001 0.73 ± .003 0.69 ± .003 0.64 ± .001 0.72 ± .002 0.70 ± .003 0.67 ± .002 A UTO - SKLEARN ± .003 0.72 ± .004 0.77 ± .002 0.63 ± .002 0.67 ± .002 0.51 ± .005 0.60 ± .001 0.65 ± .004 0.64 ± .002 0.61 ± .003A UTO -W EKA ± .003 0.72 ± .005 0.78 ± .001 0.62 ± .002 0.66 ± .002 0.54 ± .004 0.59 ± .002 0.68 ± .003 0.63 ± .004 0.63 ± .002TPOT 0.74 ± .006 0.68 ± .005 0.72 ± .003 0.61 ± .003 0.64 ± .003 0.59 ± .003 0.57 ± .002 0.67 ± .004 0.62 ± .005 0.61 ± .003Clinical Score 0.70 ± .007 0.70 ± .003 0.62 ± .001 0.56 ± .001 — — — — — — Cox PH 0.75 ± .005 0.71 ± ± .001 0.59 ± .001 0.71 ± .003 0.65 ± .004 0.62 ± .002 ± .003 ± .003 ± .002 Table2. Performance of the different prognostic models in terms of the AUC-ROC with 5-fold cross-validation. Bold numbers corre-spond to the best result. The “best predictor” row lists the prediction algorithms picked by vanilla A
UTO P ROGNOSIS . a given cohort, in addition to a set of actionable risk strata R , and outputs an “explanation” for its predictions in termsof a set of logical association rules of the form: C ∧ C ∧ . . . ∧ C l ( r ) = ⇒ r, ∀ r ∈ R , (10)where { C , . . ., C l ( r ) } is a set of Boolean conditions as-sociated with risk stratum r . The association rules areobtained via a Bayesian associative classifier (Ma & Liu,1998; Agrawal et al., 1993; Kruschke, 2008; Luo, 2016),with a prior over association rules, and a posterior com-puted based on target labels that correspond to the outputsof the learned model discretized via the strata in R . TheBayesian approach allows incorporating prior knowledge(from clinical literature) about “likely” association rules.We report one example for an explanation provided by theinterpreter module based on our experiments on the UKBiobank cohort. For this cohort, the standard Framinghamrisk score exhibited an AUC-ROC of 0.705 for the overallcohort, but its AUC-ROC for patients with Type-2 Diabetes(T2D) was as low as 0.63. On the contrary, A UTO P ROGNO - SIS performed almost equally well in the two subgroups.The interpreter provided an explanation for the improvedpredictions through the following association rule: ✞✝ ☎✆
Diabetic ∧ Lipid-lowering ∧ (Age ≥ = ⇒ High risk
None of these risk factors were included in the standardguidelines. That is, the interpreter indicates that a betterstratification, with new risk factors such the usage of lipid-lowering drugs, is possible for diabetic patients. Clinicianscan use the interpreter as a data-driven hypothesis genera-tor that prompts new risk factors and strata for subsequentresearch. A UTO P ROGNOSIS as a Clairvoyant
We split up Table 2 into 2 groups of columns: group 1 (left)contains cohorts obtained from cardiology studies, whereasgroup 2 (right) contains cohorts obtained from cancer stud-ies, with cardiac secondary outcomes. As mentioned ear-lier, the two groups had different meta-features. We trackedthe modeling choices made by vanilla A UTO P ROGNOSIS (noensembles or meta-learning) in both groups (“best predic-tor” row in Table 2). For all datasets in group 2, A UTO -P ROGNOSIS decided that survival modeling (using Cox PHmodel or survival forests) is the right model. This is be-cause, with the high prevalence of censored time-to-eventdata, survival models are more data-efficient than operat-ing on binarized survival labels and removing patients lostto follow-up. When given richer datasets with a large num-ber of relevant features, low rates of censoring and mod-erate imbalance (group 1), A UTO P ROGNOSIS spent more it-erations navigating ML classifiers, and learned that an al-gorithm like AdaBoost is a better choice for a dataset like
UNOS-I . Such a (non-intuitive) choice would have not beenpossibly identified by a clinical researcher; researchers typ-ically use the Cox PH model, which on the
UNOS-I cohortprovides an inferior performance.
Meta-learning was implemented via leave-one-dataset-outvalidation: we run vanilla A UTO P ROGNOSIS on all of the10 cohorts, and then for every cohort, we use the other9 cohorts as the complementary datasets used to imple-ment the meta-learning algorithm. Since the pool of com-plementary cohorts contained 5 datasets for cardiovascularcomorbidities, meta-learning was most useful for group 2datasets as they all had very similar meta-features. Withmeta-learning, A UTO P ROGNOSIS had a strong prior on sur-vival models for group 2 datasets, and hence it convergesquickly to a decision on using a survival model having ob-served the dataset’s meta-features. Ensemble construction utomated Clinical Prognostic Modeling via Bayesian Optimization was most useful for the
MAGGIC and
UNOS cohorts, sincethose datasets had more complex hypotheses to learn.Clinical researchers often ask the question: when shouldI use machine learning for my prognostic study?
Theanswer depends on the nature of the dataset involved. Aswe have see in Table 2, a simple Cox model may in somecases be sufficient to issue accurate predictions. The meta-learning module in A UTO P ROGNOSIS can act as a clair-voyant that tells whether ML models would add valueto a given prognostic study without even training anymodel . That is, by looking at the “meta-learned” GP priorcalibrated by a new dataset’s meta-features, we can seewhether the prior assigns high scores to ML models com-pared to a simple Cox model, and hence decide on whetherML has gains to offer for such a dataset.
References
Agrawal, Rakesh, Imieli´nski, Tomasz, and Swami, Arun.Mining association rules between sets of items in largedatabases. In
Acm sigmod record , volume 22, pp. 207–216. ACM, 1993.Alaa, Ahmed M, Hu, Scott, and van der Schaar, Mi-haela. Learning from clinical judgments: Semi-markov-modulated marked hawkes processes for risk prognosis.
International Conference on Machine Learning , 2017.Bergstra, James, Bardenet, R´emi, K´egl, B, and Bengio, Y.Implementations of algorithms for hyper-parameter op-timization. In
NIPS Workshop on Bayesian optimization ,pp. 29, 2011.Blaha, Michael J. The critical importance of risk score cal-ibration, 2016.Brazdil, Pavel, Carrier, Christophe Giraud, Soares, Carlos,and Vilalta, Ricardo.
Metalearning: Applications to datamining . Springer Science & Business Media, 2008.Cabitza, Federico, Rasoini, Raffaele, and Gensini,Gian Franco. Unintended consequences of machinelearning in medicine.
Jama , 318(6):517–518, 2017.Chen, Tianqi and Guestrin, Carlos. Xgboost: A scalabletree boosting system. In
Proceedings of the 22nd acmsigkdd international conference on knowledge discoveryand data mining , pp. 785–794. ACM, 2016.Feurer, Matthias, Klein, Aaron, Eggensperger, Katharina,Springenberg, Jost, Blum, Manuel, and Hutter, Frank.Efficient and robust automated machine learning. In
Advances in Neural Information Processing Systems(NIPS) , pp. 2962–2970, 2015. Fine, Jason P and Gray, Robert J. A proportional hazardsmodel for the subdistribution of a competing risk.
Jour-nal of the American statistical association , 94(446):496–509, 1999.Golovin, Daniel, Solnik, Benjamin, Moitra, Subhodeep,Kochanski, Greg, Karro, John, and Sculley, D. Googlevizier: A service for black-box optimization. In
Pro-ceedings of the 23rd ACM SIGKDD International Con-ference on Knowledge Discovery and Data Mining , pp.1487–1495. ACM, 2017.Gy¨orfi, L´aszl´o, Kohler, Michael, Krzyzak, Adam, andWalk, Harro.
A distribution-free theory of nonparametricregression . Springer Science & Business Media, 2006.Hripcsak, George, Albers, David J, and Perotte, Adler.Parameterizing time in electronic health record studies.
Journal of the American Medical Informatics Associa-tion , 22(4):794–804, 2015.Hutter, Frank, Hoos, Holger H, Leyton-Brown, Kevin, andSt¨utzle, Thomas. Paramils: an automatic algorithm con-figuration framework.
Journal of Artificial IntelligenceResearch , 36(1):267–306, 2009.Hutter, Frank, Hoos, Holger H, and Leyton-Brown, Kevin.Sequential model-based optimization for general algo-rithm configuration.
LION , 5:507–523, 2011.Ishwaran, Hemant, Kogalur, Udaya B, Blackstone, Eu-gene H, and Lauer, Michael S. Random survival forests.
The annals of applied statistics , pp. 841–860, 2008.Jenatton, Rodolphe, Archambeau, Cedric, Gonz´alez,Javier, and Seeger, Matthias. Bayesian optimization withtree-structured dependencies. In
International Confer-ence on Machine Learning , pp. 1655–1664, 2017.Kandasamy, Kirthevasan, Schneider, Jeff, and P´oczos,Barnab´as. High dimensional bayesian optimisation andbandits via additive models. In
International Conferenceon Machine Learning (ICML) , pp. 295–304, 2015.Kathuria, Tarun, Deshpande, Amit, and Kohli, Pushmeet.Batched gaussian process bandit optimization via deter-minantal point processes. In
Advances in Neural Infor-mation Processing Systems , pp. 4206–4214, 2016.Kotthoff, Lars, Thornton, Chris, Hoos, Holger H, Hutter,Frank, and Leyton-Brown, Kevin. Auto-weka 2.0: Auto-matic model selection and hyperparameter optimizationin weka.
Journal of Machine Learning Research , 17:1–5, 2016.Kruschke, John K. Bayesian approaches to associativelearning: From passive to active learning.
Learning &behavior , 36(3):210–226, 2008. utomated Clinical Prognostic Modeling via Bayesian Optimization
Luo, Gang. Automatically explaining machine learningprediction results: a demonstration on type 2 diabetesrisk prediction.
Health information science and systems ,4(1):2, 2016.Luo, Gang, Stone, Bryan L, Johnson, Michael D, Tarczy-Hornoch, Peter, Wilcox, Adam B, Mooney, Sean D,Sheng, Xiaoming, Haug, Peter J, and Nkoy, Flory L. Au-tomating construction of machine learning models withclinical big data: proposal rationale and methods.
JMIRresearch protocols , 6(8), 2017.Ma, Bing Liu Wynne Hsu Yiming and Liu, Bing. Inte-grating classification and association rule mining. In
Proceedings of the fourth international conference onknowledge discovery and data mining , 1998.Maddison, Chris J, Tarlow, Daniel, and Minka, Tom. A*sampling. In
Advances in Neural Information ProcessingSystems , pp. 3086–3094, 2014.Nikolov, Aleksandar. Randomized rounding for the largestsimplex problem. In
Proceedings of the forty-seventh an-nual ACM symposium on Theory of computing , pp. 861–870. ACM, 2015.Olson, Randal S and Moore, Jason H. Tpot: A tree-basedpipeline optimization tool for automating machine learn-ing. In
ICML Workshop on Automatic Machine Learn-ing , pp. 66–74, 2016.Pedregosa, Fabian, Varoquaux, Ga¨el, Gramfort, Alexan-dre, Michel, Vincent, Thirion, Bertrand, Grisel, Olivier,Blondel, Mathieu, Prettenhofer, Peter, Weiss, Ron,Dubourg, Vincent, et al. Scikit-learn: Machine learn-ing in python.
Journal of Machine Learning Research ,12(Oct):2825–2830, 2011.Raskutti, Garvesh, Yu, Bin, and Wainwright, Martin J.Lower bounds on minimax rates for nonparametric re-gression with additive sparsity and smoothness. In
Ad-vances in Neural Information Processing Systems , pp.1563–1570, 2009.Rasmussen, Carl Edward and Williams, Christopher KI.
Gaussian processes for machine learning , volume 1.MIT press Cambridge, 2006.Schnabel, Renate B, Sullivan, Lisa M, Levy, Daniel,Pencina, Michael J, Massaro, Joseph M, D’Agostino,Ralph B, Newton-Cheh, Christopher, Yamamoto, Jen-nifer F, Magnani, Jared W, Tadros, Thomas M, et al. De-velopment of a risk score for atrial fibrillation (framing-ham heart study): a community-based cohort study.
TheLancet , 373(9665):739–745, 2009. Snoek, Jasper, Larochelle, Hugo, and Adams, Ryan P.Practical bayesian optimization of machine learning al-gorithms. In
Advances in neural information processingsystems (NIPS) , pp. 2951–2959, 2012.Sudlow, Cathie, Gallacher, John, Allen, Naomi, Beral, Va-lerie, Burton, Paul, Danesh, John, Downey, Paul, Elliott,Paul, Green, Jane, Landray, Martin, et al. Uk biobank:an open access resource for identifying the causes of awide range of complex diseases of middle and old age.
PLoS medicine , 12(3):e1001779, 2015.Swersky, Kevin, Duvenaud, David, Snoek, Jasper, Hut-ter, Frank, and Osborne, Michael A. Raiders ofthe lost architecture: Kernels for bayesian optimiza-tion in conditional parameter spaces. arXiv preprintarXiv:1409.4011 , 2014.Wang, Zi, Li, Chengtao, Jegelka, Stefanie, and Kohli,Pushmeet. Batched high-dimensional bayesian opti-mization via structural kernel learning.
InternationalConference on Machine Learning (ICML) , 2017.Wang, Ziyu, Zoghi, Masrour, Hutter, Frank, Matheson,David, De Freitas, Nando, et al. Bayesian optimizationin high dimensions via random embeddings. In
IJCAI ,pp. 1778–1784, 2013.Weiss, Eric S, Allen, Jeremiah G, Arnaoutakis, George J,George, Timothy J, Russell, Stuart D, Shah, Ashish S,and Conte, John V. Creation of a quantitative recipi-ent risk index for mortality prediction after cardiac trans-plantation (impact).
The Annals of thoracic surgery , 92(3):914–922, 2011.Wong, Chih M, Hawkins, Nathaniel M, Petrie, Mark C,Jhund, Pardeep S, Gardner, Roy S, Ariti, Cono A, Poppe,Katrina K, Earle, Nikki, Whalley, Gillian A, Squire,Iain B, et al. Heart failure in younger patients: the meta-analysis global group in chronic heart failure (maggic).
European heart journal , 35(39):2714–2721, 2014.Yang, Yun, Tokdar, Surya T, et al. Minimax-optimal non-parametric regression in high dimensions.
The Annals ofStatistics , 43(2):652–674, 2015.Yoo, Wonsuk and Coughlin, Steven S. Surveillance, epi-demiology, and end results (seer) data for monitoringcancer trends.
Journal of the Georgia Public Health As-sociation , 2018.Yoon, Jinsung, Alaa, Ahmed M, Cadeiras, Martin, andvan der Schaar, Mihaela. Personalized donor-recipientmatching for organ transplantation. In