[PDF] BET: Bayesian Ensemble Trees for Clustering and Prediction in Heterogeneous Data

Abstract

We propose a novel "tree-averaging" model that utilizes the ensemble of classification and regression trees (CART). Each constituent tree is estimated with a subset of similar data. We treat this grouping of subsets as Bayesian ensemble trees (BET) and model them as an infinite mixture Dirichlet process. We show that BET adapts to data heterogeneity and accurately estimates each component. Compared with the bootstrap-aggregating approach, BET shows improved prediction performance with fewer trees. We develop an efficient estimating procedure with improved sampling strategies in both CART and mixture models. We demonstrate these advantages of BET with simulations, classification of breast cancer and regression of lung function measurement of cystic fibrosis patients. Keywords: Bayesian CART; Dirichlet Process; Ensemble Approach; Heterogeneity; Mixture of Trees.

Full PDF

BBET: Bayesian Ensemble Treesfor Clustering and Prediction inHeterogeneous Data

Leo L. Duan , John P. Clancy and Rhonda D. Szczesniak Summary

We propose a novel “tree-averaging” model that utilizes the ensembleof classiﬁcation and regression trees (CART). Each constituent tree is esti-mated with a subset of similar data. We treat this grouping of subsets asBayesian ensemble trees (BET) and model them as an inﬁnite mixture Dirich-let process. We show that BET adapts to data heterogeneity and accuratelyestimates each component. Compared with the bootstrap-aggregating ap-proach, BET shows improved prediction performance with fewer trees. Wedevelop an eﬃcient estimating procedure with improved sampling strategiesin both CART and mixture models. We demonstrate these advantages ofBET with simulations, classiﬁcation of breast cancer and regression of lungfunction measurement of cystic ﬁbrosis patients.

KEY WORDS:

Bayesian CART; Dirichlet Process; Ensemble Approach;Heterogeneity; Mixture of Trees. Department of Mathematical Sciences, University of Cincinnati Division of Pulmonary Medicine, Cincinnati Children’s Hospital Medical Center Division of Biostatistics and Epidemiology, Cincinnati Children’s Hospital MedicalCenter Corresponding author. Address: 3333 Burnet Ave, MLC 5041, Cincinnati, OH 45229.Phone:(513)803-0563, email: [email protected] a r X i v : . [ s t a t . M L ] A ug Introduction

Classiﬁcation and regression trees (CART) (Breiman, Friedman, Olshen, andStone, 1984) is a nonparametric learning approach that provides fast parti-tioning of data through the binary split tree and an intuitive interpretationfor the relation between the covariates and outcome. Aside from simple modelassumptions, CART is not aﬀected by potential collinearity or singularity ofcovariates. From a statistical perspective, CART models the data entriesas conditionally independent given the partition, which not only retains thelikelihood simplicity but also preserves the nested structure.Since the introduction of CART, many approaches have been derivedwith better model parsimony and prediction. The Random Forests model(Breiman, 2001) generates bootstrap estimates of trees and utilizes the bootstrap-aggregating (“bagging”) estimator for prediction. Boosting (Friedman, 2001,2002) creates a generalized additive model of trees and then uses the sumof trees for inference. Bayesian CART (Denison, Mallick, and Smith, 1998;Chipman, George, and McCulloch, 1998) assigns a prior distribution to thetree and uses Bayesian model averaging to achieve better estimates. Bayesianadditive regression trees (BART, Chipman, George, McCulloch, et al. (2010))combine the advantages of the prior distribution and sum-of-trees structureto gain further improvement in prediction.Regardless of the diﬀerences in the aforementioned models, they shareone principle: multiple trees create more diverse ﬁtting than a single tree;therefore, the combined information accommodates more sources of variabil-ity from the data. Our design follows this principle.We create a new ensemble approach called the Bayesian Ensemble Trees(BET) model, which utilizes the information available in the subsamples ofdata. Similar to Random Forests, we hope to use the average of the trees, ofwhich each tree achieves an optimal ﬁt without any restraints. Nonetheless,we determine the subsamples through clustering rather than bootstrapping.This setting automates the control of the number of trees and also adaptsthe trees to possible heterogeneity in the data.In the following sections, we ﬁrst introduce the model notation and itssampling algorithm. We illustrate the clustering performance through threediﬀerent simulation settings. We demonstrate the new tree sampler usinghomogeneous example data from a breast cancer study. Next we benchmarkBET against other tree-based methods using heterogeneous data on lungfunction collected on cystic ﬁbrosis patients. Lastly, we discuss BET resultsand possible extensions.

We denote the i th record of the outcome as Y i , which can be either categoricalor continuous. Each Y i has a corresponding covariate vector X i .1n the standard CART model, we generate a binary decision tree T thatuses only the values of X i to assign the i th record to a certain region. In eachregion, elements of Y i are identically and independently distributed with aset of parameters θ . Our goals are to ﬁnd the optimal tree T , estimate θ andmake inference about an unknown Y s given values of X s , where s indexes theobservation to predict.We further assume that { Y i , X i } is from one of (countably inﬁnitely)many trees { T j } j . Its true origin is only known up to a probability w j fromthe j th tree. Therefore, we need to estimate both T j and w j for each j .Since it is impossible to estimate over all j ’s, we only calculate those j ’s withnon-negligible w j , as explained later. We now formally deﬁne the proposed model. We use [ . ] to denote the prob-ability density. Let [ Y i | X i , T j ] denote the probability of Y i conditional on itsorigination is from the j th tree. The mixture likelihood can be expressed as:[ Y i | X i , { T j } j ] = ∞ (cid:88) j w j [ Y i | X i , T j ] (1)where the mixture weight vector has an inﬁnite-dimension Dirichlet distri-bution with precision parameter α : W = { w j } j ∼ Dir ∞ ( α ). The likelihoodabove corresponds to Y i iid ∼ DP ( α, G )), where DP stands for the Dirichletprocess and the base distribution G is simply [ Y i | T j ]. We ﬁrst deﬁne the nodes as the building units of a tree. We adopt the nota-tion introduced by Wu, Tjelmeland, and West (2007) and use the followingmethod to assign indices to the nodes: for any node k , two child nodes areindexed as left (2 k + 1) and right (2 k + 2); the root node has index 0. Theparent of any node k > (cid:98) k − (cid:99) , where (cid:98) . (cid:99) denotes the integer partof a non-negative value. The depth of a node i is (cid:98) log ( k + 1) (cid:99) .Each node can have either zero (not split) or two (split) child nodes.Conditional on the parent node being split, we use s k = 1 to denote onenode being split (or interior), s k = 0 otherwise (or leaf). Therefore, theframe of a tree with at least one split is:[ s = 1] (cid:89) k ∈ I k [ s k +1 | s k = 1][ s k +2 | s k = 1]2here I k = { k : s k = 1 } denotes the set of interior nodes.Each node has splitting thresholds that correspond to the m covariatesin X . Let the m − dimensional vector t k denote these splitting thresholds.Also, it has a random draw variable c k from { , ..., m } . We assume s k , t k , c k are independent.For the c k th element X ( c k ) , if X ( c k ) < t ( c k ) , then observation i is distributedto its left child; otherwise it is distributed to the right child. For every i ,this distributing process iterates from the root node and ends in a leaf node.We use θ k to denote the distribution parameters in the leaf nodes. For eachnode and a complete tree, their prior densities are:[ T k ] = [ s k ][ c k ] s k [ t k ] s k [ θ k ] − s k [ T ] = [ T ] (cid:89) k ∈ I k [ T k +1 ][ T k +2 ] (2)For s k , t k , c k , we specify the prior distributions as follows: s k ∼ B ( exp ( −(cid:98) log ( k + 1) (cid:99) /δ )) c k ∼ M N m ( ξ ), where ξ ∼ Dir ( m )[ t k ] ∝ B is Bernoulli distribution, M N m is m -dimensional multinomial dis-tribution. The hyper-parameter δ is the tuning parameter for which smaller δ results in smaller tree.In each partition, objective priors are used for θ . If Y is continuous, then[ θ k ] = [ µ, σ ] ∝ /σ ; if Y is discrete, then θ k = p ∼ Dir (0 . · ). Notethat Dir reduces to a

Beta distribution when Y is a binary outcome. Toguarantee the posterior propriety of θ , we further require that each partitionshould have at least q observations and q > ξ reveals the proportion of instances thata certain variable is used in constructing a tree. One variable can be usedmore times than another, therefore resulting in a larger proportion in ξ .Therefore, ξ can be utilized in variable selection and we name it as variableranking probability. The changing dimension of the Dirichlet process creates diﬃculties in Bayesiansampling. Pioneering studies include exploring inﬁnite state space with thereversible-jump Markov chain Monte Carlo (Green and Richardson, 2001)and with an auxiliary variable for possible new states (Neal, 2000). At the3ame time, an equivalent construction named the stick-breaking process (Ish-waran and James, 2001) gained popularity for decreased computational bur-den. The stick-breaking process decomposes the Dirichlet process into aninﬁnite series of

Beta distributions: w = v w j = v j (cid:89) k v j iid ∼ Beta (1 , α ). This construction provides a straightforward il-lustration on the eﬀects of adding/deleting a new cluster to/from the existingclusters.Another diﬃculty in sampling is that j is inﬁnite. Ishwaran and James(2001) proved that the max( j ) can be truncated to 150 for a sample sizeof n = 10 , and the results are indistinguishable from those obtained usinglarger numbers. Later, Kalli, Griﬃn, and Walker (2011) introduced the slicesampler, which avoids the approximate truncation. Brieﬂy, the slice sampleradds a latent variable u i ∼ U (0 ,

1) for each observation. The probability in(1) becomes: [ Y i | X i , { T j } j ] = ∞ (cid:88) j ( u i < w j )[ Y i | X i , T j ] (5)due to (cid:82) ( u i < w j ) du i = w j . The Monte Carlo sampling of u i leads to omit-ting w j ’s that are too small. We found that the slice sampler usually leadsto a smaller eﬀective max j <

10 for n = 10 , hence more rapid convergencethan a simple truncation. We now explain the sampling algorithm for the BET model. Let Z i = j denote the latent assignment of the i th observation to the j th tree. Then thesampling scheme for the BET model involves iteration over two steps: treegrowing and clustering. [ T | W , Z , Y ] Each tree with allocated data is grown in this step. We sample in the order of[ s , c , t ] and then [ θ | s , c , t ]. As stated by Chipman, George, and McCulloch(1998), using [ Y | s , c , t ] marginalized over θ facilitates rapid change of thetree structure. After the tree is updated, the conditional sampling of θ Y i | T j ]for diﬀerent j .During the updating of [ s , c , t ], we found that using a random choicein grow/prune/swap/change (GPSC) in one Metropolis-Hasting (MH) step(Chipman, George, and McCulloch, 1998), is not suﬃcient to grow largetrees for our model. This is not a drawback of the proposal mechanism, butis instead primarily due to the notion that following this clustering processwould distribute the data entries to many small trees, if any large tree hasnot yet formed. In other words, the goal is to have our model prioritize “ﬁrstin growing the tree, second in clustering” instead of the other order.Therefore, we devise a new Gibbs sampling scheme, which sequentiallysamples the full conditional distribution [ s k | ( s k )], [ c k | ( c k )] and [ t k | ( t k )]. Foreach update, MH criterion is used. We restrict updates of c , t that result in anempty node such that s do not change in these steps. The major diﬀerencein this approach compared to the GPSC method is that, rather than onerandom change in one random node, we use micro steps to exhaustivelyexplore possible changes in every node, which increases chain convergence.Besides increasing the convergence rate, the other function of the Gibbssampler is to force each change in the tree structure to be small and local.Although some radical change steps (Wu, Tjelmeland, and West, 2007; Pra-tola, 2013) can facilitate the jumps between the modes in a single tree, for amixture of trees, local changes and mode sticking are useful to prevent labelswitching. [ W , Z | T , Y ] In this step, we take advantage of the latent uniform variable U in the slicesampler. In order to sample from the joint density [ U , W , Z | T ], we use theblocked Gibbs sampling again, in the order of the following marginal densities[ W | Z , T ], [ U | W , Z , T ] and [ Z | W , U , T ]. v j ∼ Beta (1 + (cid:88) i ( Z i = j ) , α + (cid:88) i ( Z i > j )) w j = v j (cid:89) k u i )[ Y i | T j ] (7)5 .3 Posterior Inference: Choosing the Best Ensembleof Trees In the posterior analyses of the Markov chain, we could consider the marginallikelihood (cid:81) i (cid:82) Z i [ y i | T Z i ] dP ( Z i ), but it would involve costly allocation of thedata over all the candidate trees. Therefore, we use the joint likelihoodwith the tree assignment (cid:81) i [ y i | T Z i ][ Z i ] as the criterion for choosing the bestensemble of trees.For prediction purposes, we deﬁne two types of estimators: a cluster-speciﬁc estimator and an ensemble estimator. The former is deﬁned as E ( θ | T Z i ), where assignment Z i is known. The latter is deﬁned as (cid:80) j w j E ( θ | T Z j ),where assignment Z i is unknown. In theory, conditional on the collected pos-terior samples being the same, the cluster-speciﬁc estimator has smaller vari-ance than the ensemble estimator. In practice, the latter is more applicablesince we often do not know Z i for a new observation until Y i is observed. In this section, we demonstrate the clustering capability of BET throughsimulations of three scenarios: (1) single way of partitioning in X with uni-modal [ Y | X ]; (2) single way of partitioning in X with multi-modal [ Y | X ];(3) multiple ways of partitioning of X with possible multi-modal [ Y | X ]. Forcase (1), the model should converge to one cluster, whereas cases (2) and (3)should both result in multiple clusters. This is the simplest yet still likely scenario, in which the data is homogeneousand can be easily classiﬁed with a unique partition. We adopt the simulationsetting used by Wu, Tjelmeland, and West (2007) and generate 300 testsamples as shown in Table 1. In this dataset, three partitions can be classiﬁedbased on combinations of either { X , X } or { X , X } , where two choices areequivalent.We ran the BET model on the data for 10,000 steps and used the last5,000 steps as the results. The best tree is shown in Figure 1. Three par-titions are correctly identiﬁed. The choice of X and splitting values areconsistent with the simulation parameters. More importantly, the majorityof observations are clustered together in a single tree. In each partition, thenumber of observations and the parameter estimates are very close to thetrue values. This simulation shows that BET functions as a simple BayesianCART method when there is no need of clustering.6ndex X X X Y1...100 U(0.1,0.4) U(0.1,0.4) U(0.6,0.9) N(1.0,0 . )101...200 U(0.1,0.4) U(0.6,0.9) U(0.6,0.9) N(3.0,0 . )201...300 U(0.6,0.9) U(0.1,0.9) U(0.1,0.4) N(5.0,0 . )Table 1: Setting for Simulation Study I X3 <= 0.40 Est= 5.09 +- 0.47 count= 84

Yes

X2 <= 0.40 No Est= 1.03 +- 0.51 count= 89

Yes

Est= 2.98 +- 0.50 count= 84 No Figure 1: Simulation Study I shows one cluster is found and the partitioningscheme is correctly uncovered

This example is to illustrate the scenario of having a“mixture distributioninside one tree”. We duplicate the previous observations in each partitionand change one set of their means. As shown in Table 2, each partitionnow becomes bimodal. Since partitioning based on { X , X } is equivalent to { X , X } , we drop X for clearer demonstration.The results are shown in Figure 2. Two clusters are correctly identiﬁed byBET. The ﬁtted splitting criteria and the estimated parameters are consistentwith the true values. It is interesting to note that in the leftmost nodes of thetwo clusters, there is a signiﬁcant overlap in distribution (within one standarddeviation of normal means). As a result, compared with the original countsin the data generation, some randomness is observed in these two nodes.Nevertheless, the two ﬁtted trees are almost the same as anticipated.7ndex X X Y1...100 U(0.1,0.4) U(0.1,0.4) N(1.0,0 . )101...200 U(0.1,0.4) U(0.6,0.9) N(3.0,0 . )201...300 U(0.6,0.9) U(0.1,0.9) N(5.0,0 . )301...400 U(0.1,0.4) U(0.1,0.4) N(1.5,0 . )401...500 U(0.1,0.4) U(0.6,0.9) N(5.5,0 . )501...600 U(0.6,0.9) U(0.1,0.9) N(3.5,0 . )Table 2: Setting for Simulation Study II This scenario reﬂects the most complicated case, which is quite realistic inlarge, heterogeneous data. We again duplicate the data in Table 1 and thenchange the partition scheme for indices 301 ... X X Y1...100 U(0.1,0.4) U(0.1,0.4) N(1.0,0 . )101...200 U(0.1,0.4) U(0.6,0.9) N(3.0,0 . )201...300 U(0.6,0.9) U(0.1,0.9) N(5.0,0 . )301...400 U(0.1,0.4) U(0.6,0.9) N(5.0,0 . )401...500 U(0.6,0.9) U(0.6,0.9) N(1.0,0 . )501...600 U(0.1,0.9) U(0.1,0.4) N(3.0,0 . )Table 3: Setting for Simulation Study IIITo ensure convergence, we ran the model for 20,000 steps and discard theﬁrst 10,000 as burn-in steps. Among the total 10,000 steps, the step numbersthat correspond to 1,2 and 3 clusters are 214, 9665 and 121. Clearly, the 2-cluster is the most probable model for the data.The best ensemble of trees are shown in Figure 4. This is consistent tothe data generation diagram in Figure 3, since the two means in the upperleft region are exchangeable due to its bimodal nature.These simulation results show that BET is capable of detecting not only8 X1 <= 0.60 X2 <= 0.50

Yes

Est= 5.02 +- 0.47 count= 84 No Est= 1.05 +- 0.50 count= 81

Yes

Est= 5.66 +- 0.40 count= 100 No X1 <= 0.40 X2 <= 0.57

Yes

Est= 3.61 +- 0.43 count= 108 No Est= 1.40 +- 0.51 count= 118

Yes

Est= 2.87 +- 0.60 count= 96 No Figure 2: Simulation Study II shows two clusters are found and parametersare correctly estimatedthe mixing of distributions, but also the mixing of trees. In the followingsections, we test its prediction performance through real data examples.

We ﬁrst demonstrate the performance of the Gibbs sampler in the BETmodel with homogenous data. We examined breast cancer data availablefrom the machine learning repository of the University of California at Irvine(Bache and Lichman, 2013). This data was originally the work of Wolbergand Mangasarian (1990) and has been subsequently utilized in numerous9 .2 0.4 0.6 0.8 . . . . X1 X

13 5 . . . . X1 X

35 1

Figure 3: Simulation Study III has data from diﬀerent partitioning schemesmixed together. The means are labeled in the center of each region. Theshared region in the upper left has mixed means.application studies. We focus on the outcome of breast cancer as the responsevariable. We deﬁne benign and malignant results as y = 0 and y = 1,respectively. Our application includes the nine clinical covariates, which areclump thickness, uniformity of cell size, uniformity of cell shape, marginaladhesion, single epithelial cell size, bare nuclei, bland chromatin, normalnucleoli and mitoses. We consider data for 683 females who have no missingvalues for the outcome and covariates at hand. We ran the model for 110,000steps and discarded the ﬁrst 10,000 steps.The results show the chain converges to the joint log-likelihood [ Y, Z | T ] atat mean − . Y | Z, T ] at mean − . . . ± .

007 (Figure 5(c)), with the minimum equal to 0 . . ± . T .10 X1 <= 0.40 X2 <= 0.47

Yes

Est= 4.73 +- 0.70 count= 107 No Est= 1.01 +- 0.40 count= 92

Yes

Est= 5.08 +- 0.52 count= 98 No (a) Cluster 1 X1 <= 0.39 Est= 3.12 +- 0.40 count= 132

Yes

X2 <= 0.39 No Est= 3.08 +- 0.47 count= 46

Yes

Est= 0.91 +- 0.46 count= 95 No (b) Cluster 2 Figure 4: Simulation Study III shows two clusters correctly identiﬁedNext we test BET on a large dataset, which possibly contains heteroge-neous data and simultaneously illustrate its regression performance.11 r equen cy −80 −60 −40 −20 (a) Log-likelihood given assign-ment F r equen cy (b) Misclassiﬁcation rate withcluster-speciﬁc estimator X2 <= 4.00 X6 <= 2.00

Yes

Est= 0.99 count= 175 No X3 <= 1.00

Yes

X1 <= 5.00 No Est= 0.00 count= 326

Yes

X8 <= 2.00 No Est= 0.01 count= 75

Yes

Est= 0.34 count= 13 No X2 <= 1.00

Yes

Est= 0.96 count= 41 No Est= 0.02 count= 21

Yes

X6 <= 5.00 No Est= 0.38 count= 15

Yes

Est= 0.98 count= 17 No (c) Best ensemble: only one cluster is found Figure 5: Result of breast cancer test12

Cystic Fibrosis Data Example

We used lung function data obtained from the Cystic Fibrosis FoundationPatent Registry (Cystic Fibrosis Foundation, 2012). Percent predicted offorced expiratory volume in 1 second (FEV %) is a continuous measure oflung function in cystic ﬁbrosis (CF) patients obtained at each clinical visit.We have previously demonstrated that the rates of FEV % change non-linearly (Szczesniak, McPhail, Duan, Macaluso, Amin, and Clancy, 2013),which can be described via semiparametric regression using penalized cubicsplines. Although trees may not outperform spline methods in prediction ofcontinuous outcomes, they provide reliable information for variable selectionwhen traditional methods such as p-value inspection and AIC may fall short.We used longitudinal data from 3,500 subjects (a total of 60,000 entries)from the CF dataset and utilized eight clinical covariates: baseline FEV %,age, gender, infections (each abbreviated as MRSA, PA, BC, CFRD) andinsurance (a measure of socioeconomic status, SES). We randomly selectedlongitudinal data from 2,943 subjects (roughly 50,000 entries) and used thesedata as the training set. We then carried out prediction on the remaining10,000 entries.We illustrate the prediction results of BET in Figure 6(a and b). Withthe assumption of one constant mean per partition, the predicted line takesthe shape of step functions, which correctly captured the declining trendof FEV %. The prediction seems unbiased, as the diﬀerence between thepredicted and true values are symmetric about the diagonal line. We alsocomputed the diﬀerence metrics with the true values. The results are listedin Table 4.Besides the BET model, we also tested other popular regression tree meth-ods (with corresponding R packages), such as CART ( “rpart”, Therneau,Atkinson, et al. (1997)), Random Forests (“randomForest” Breiman (2001)),Boosting (“gbm” Friedman (2001)) and BART(“bartMachine” Chipman,George, McCulloch, et al. (2010)). Since the tested data are essentially lon-gitudinal data, we can choose whether to group observations by subjects orby entries alone. In the subject-clustered version, we ﬁrst used one outcomeentry of each subject in the prediction subset, computed the most likely clus-ter and then computed the cluster-speciﬁc predictor; in the entry-clusteredversion, we simply used the ensemble predictor. We did not see an obviousdiﬀerence between the two predictors. This is likely due to entry-clusteredBET achieving better ﬁt, which compensates for less accuracy of the ensem-ble predictor.In the comparison among the listed tree-based methods, the two Bayesianmethods BET and BART provide the closest results to spline regression inprediction accuracy. Similar to the relation between BART and Boosting,BET can be viewed as the Bayes counterpart of Random Forests. Besidesthe use of prior, one important distinction is that the Random Forests ap-proach uses the average of bootstrap sample estimates, whereas BET uses13odel RMSE MADSpline Regression 16.60 10.07CART 18.07 10.32Random Forests (5 trees) 17.38 11.29Random Forests (50 trees) 17.13 11.28Boosting (1000 trees) 20.34 14.22BART (50 trees) 16.72 10.32BET (clustered by subjects) 16.97 10.57BET (clustered by entries) 16.70 10.13Table 4: Cross-validation results with various methods applied on cysticﬁbrosis data.the weighted average of cluster sample estimates. In Random Forests, thenumber of bootstrap sample needs to be speciﬁed by the user; while in BET,it is determined by the data through Dirichlet process. During this test,Random Forests used 50 trees; while BET converged to only 2-3 trees (Fig-ure 6(c)) and achieved similar prediction accuracy. The tree structures areshown in the appendix.Lastly, we focus on the variable selection issue. Although some informa-tion criteria have been established in this area, such as AIC, BIC and BayesFactor, their mechanisms are complicated by the necessity to ﬁt diﬀerentmodels to the data for multiple times. The inclusion probability (reviewedby O’Hara, Sillanp¨a¨a, et al. (2009)) is an attractive alternative, which pe-nalizes the addition of more variables through the inclusion prior. In thetree-based method, however, since multiplicity is not a concern, it is possibleto compare several variables of similar importances at the same time, withoutinclusion or exclusion.Since multiple trees are present, we use the weighted posterior ¯ ξ = (cid:80) j w j ξ j as the measure for variable importance. We plot the variable rank-ing probability ¯ ξ for each covariate (Figure 6). The interpretation of thisvalue follows naturally as how likely a covariate is chosen in forming thetrees. Therefore, the ranking of ¯ ξ reveals the order of importance of thecovariates. This concept is quite similar to the variable importance measureinvented in Random Forests. The diﬀerence is that their index is ranked inthe decrease of accuracy after permuting of a covariate; ours is purely a prob-abilistic measure. Regardless of this diﬀerence, the ranking of variables fromthe two models are remarkably similar (see Supplementary Information forresults of Random Forests): baseline FEV % and age are the two most im-portant variables while gender and MRSA seem to play the least importantroles. 14 l l lll l l l l ll l l l ll l l

10 12 14 16 18

Age F EV % (a) True values (dots) and the 95% cred-ible interval of the BET predicted (solid)in a single subject. llllll lllll ll llll llll lll lll lll llllllllllllllll lll llllllll llllllllllllllllll lll lllll lll ll l lll llll lll ll l ll lll lll l lll llll lll lllllll lll llll lll llll llll lllllll lllllll ll lll l llllll llllllll l llllllllllllllllllllllllllllllllllllll llllllllllllllll lllllll ll lll ll lllllllllllllllllllllll llll lll lllllllll llllllllllllll llll l lllll lll llllllllll lllllll llll l ll lllllllllllll llllllllllllllll lllllll llllllllllllll ll llll l ll lllll lllllllll lll ll lllll lllll llll llllllllll llllllllllllll ll ll l ll lllll lllllllll lllll lll lll ll lll lll lllll lll lll llll l ll ll llllllll llllllllllllllll lllllllllllllllllll llll lllllllllllllll llllll lllllllll ll lllll llll ll llll ll ll lllllllllllllllllllllll lllll llllllllllllll lllllllllllllll lll lllllllllllllllllllll ll llllll lll llll l ll l lll llllllll lllllll ll ll lllll lllllll lllll llllll ll llll lll lllll lllllll llllllllll llllll lll llllllll lll lll lllllllllllllllllllllllll lllllllllllllllll lllll llllllllllllllllll ll ll llllllllllllll llll lllllllllll lllll ll lll ll lllllll llll llll llll llll llllllllll llllll llll llll ll llllllllllllllllllllll llllllll llllll llllllllllllllllllllll lll lllllllllllllllllll lll lllll ll ll ll ll ll lllllll lll lllllll llll lll ll llll lll llll lllll lllll llllllll ll lllll l llllllllllllllllllllllllllllllllllllllllllllllll lllllllllllll lllllllllllllllllllllllllll lllllllllllllllllllllllll lllllllllllllllllllllll llllll ll llll ll lll ll l ll ll lll llll llllll lllll lll lll llll l llll lll llll ll ll ll lllllllllllllllll llllllllllll ll llllllllll lllllll lllllll lll lllllll llllll lll lllllllllllllllllllllll lll llll lll lll ll llll lll llllllllllllllllll llllllllllll lllll lll lll lll lll llllllllllllllllllllllllllllllllllllllll lllllllll llll lllllllllll lllllll llllllll ll ll llll lll l ll lll ll ll l ll ll l lll ll l llllllllll lllllllllllllllllllll lllllll lllllll llllllllllllllllllllllll lllllllll llllll lll ll l lllllllllllllllll llll lllllll llll lll lllll llll ll lllll lll l llll lll lll ll ll lll llllllllllllllllllllllllllll lllllllllllllllllllllllllll ll lll ll l lll lll llllllllllllllllllllllll llll lll ll lll lllllll lllll ll l lllllllllllllllllllllll lll lll l lllllll llll l lll llll lll lll lll llll lllllllllllllll lll l llll lllll llllllllllllllllll ll llll lll lll lll lll lllll llll llll llll lll llll ll llll l llllllllllllllllllllllllllllll lllllllllllllllll lllllllllllllllllllllllllllllllllllllllllll lllllll lllllllllllllll llllll ll lll lll lllllllll llll l ll ll ll lll l ll lll lllllllllll lllllllllllllllllllllllllllllllllllll llllll llllllllllllllll lllllllllllllllllll llllllllllllllllllllllllllllllllll lllllllll llllll llllllll llllllllll llllllllll lllllllllllllllllllllll llll lll lllll ll ll lllllllll lllllllll ll lll llll lllll llllllllllllllllllllllllllllllll lll lll lllllll lllllll lll lll llllllll llllllllll llllllll lllll ll l llll lll ll llllllll ll ll lll lll llllllll lllllll lllll lllllll llllllllllllllllllllllll lllllllllllllllllll llllll llllllllll lllllllll l lll lllllll lll ll lll lll lllllllllllll lllllllllllll llllllllll lllllllll ll lll llll llllll lll llll lllllllllllllllll lllll lllllllllll llllllllllllllllllllllllll lll ll llllllllllllllllllll lll lllllllll lll lllllll llll lllll lllllllllllllllll ll lll l ll lll llllllllll lllllll llllllllllllllllll lllllllllll lll llllllllllll lllllll ll llll llll llllllllllllllllllllllll lllllllllllllll llllllllllllllllllllllll lll ll ll lllllll ll lll lll lll ll llll lll llll l lll ll ll llll lllll llllllllll llllllllll llllllllllll lllllllll llll llllll lllllllllllllllllllllll lll llllll lllllll lll lll l ll lllll llll lllllll lllllll lll lll llll lll lll lllll lllllll lllllll ll lll llll ll llllllllll l ll lll ll llllllllllllllllllllllllllllllll llllllll llllllllllllllllll llllllll ll llll lll lll llllllllll llllll llllllllllllll lllll lllll lllllll lll lll llllll lllll ll llll ll lll llllllll lll llll ll lllllllll llllllll l ll lllllllllll llll llll ll l llll llllllll llll lllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllll l lll llllllllllllllllllll lll l lll lll lll ll ll lll lll ll ll lllllllllllllllllllll llllllllllllllllllllllllllllllll lll ll llll lllll ll ll lll ll ll llll lllllllllllllllllllllllllllll l lllllllllllllllllllllllllllllll lllllllll llll lllllllll llll llll l ll l llllllll lllllllllllll lll l lll ll l lll ll ll lll ll ll llllllll lllllll llllllll llllllll llllll llll l lllll llllll llllll lllllllll lllllllllll ll lll llllllll llllll llllll lllllllllllllllll llll llllll lll ll ll lll lllllllllllllllllllllll ll ll llllllllllllllll lllllllllllllllllllllllllll lll lll lllllll llllllllll llllllll lll llll ll ll lll llll lllll ll lllllll ll lll lll lll llll ll ll lll lllllllllll llllllllllllllll lllllllll lllllll ll ll lllllllllllll llll lllllllllllllllll llll llllllllll ll lll lllll llll lllllllllllllll llllllllllllllllllllllllll ll ll llllll lllllllll lllll lllllllllllllllllllllllllllllll lllllll lllllllllllllllllllllllll lllllllllllllllll lll lllllll lll lll ll ll ll lll lllllllllllllllllllll ll llllllll llll lll lll llll lll ll lll ll lll lllllll lllll llllllllllllllll l lll lllll lllll lll llllllllll ll llll lllllllll lllllllllllllllllllll lllllllllllllll lll llllll llllll lll lllll lll lll llll lllllllllll lllll lll ll ll llll ll llllllllllllllllll lllllll lllll lllllllllllllllllll lll lll lll llllllllllllllllllll llllllllllllllllllllll llll llll lllll lllll l l lllllll llll llllllllllllll lllllllllll llll llllllllll lllllllllllllllllllllll l lll lll llll ll llll lllll lllllllllllllllllllllll llllllllllllllllllll ll lllll lll lllll ll llll lll ll ll ll llllllll ll lllllllllllllllllllllllll llll ll ll lll lllllllllllllllllllllllllll ll llll ll ll lll lllllllllllllllllllllllllllllll lllll llll lll llllll ll lllllll lllllll lll l lll llllllllllllllllllllllll llllllll ll lllll lll l ll ll llll l lllll l llllll llll lllllllllllllll lll llll llll llllll ll ll lll ll lll l llllllllllllllllllllllllllllllllllll lllllllllllllllllllllllllllllllllllllllll llllllll lllll lllll ll lllll ll lll lllllll llllllllllllllllllllllllllllll lllll lllllll llll ll ll llll llll lll llllllllllllllllllll l llll llllllllllllll lllllllll lllllllllllllllllllllllllllllllllllllllllllllllllllll lllllll lllllllllllllll ll ll llllllllll llllllll lll lll llllllll lll ll ll ll lll ll lllllll lllllllllllllll lllllllllllllllllllllllllllllllllllllll lllllll lllllllllllllllllll lllllllll llllllllllllll lllll llll ll llllllllllll lllllllllllllll l ll llllll llllll lllllllll lllllll llll ll lll llllll ll ll lll lllll llllll lll l lllllllllllllll ll llll l lll llllll ll llll lllll lllllllll llll llllllll lllllllllllllllllllllllllllllllll llllll lll llll lll lll ll lll llllll llll llll lll llll llllllllllll lllllllllllllllllllll l lll ll lll lll llllllllllllllllllll llllll llll ll lllll ll ll llllllllllllll llllll ll lllll llllll lllllllllllllllllllllllllllllllllllllllllll llllllllllllll ll lll llllll lll lllllllllllllll llllllllllllllllllll llllllll llllllll ll lllllllll ll ll lll ll l llll ll lll llllllllllllllllllllllllllllllllll lll ll llll lllll llllll lllllll lll llllll ll lll ll llllll llll llll lll lllll llllll lllllllllll llllllllllllllllllllllllllllllllllllllllllllll lll lll l llll llllllllllllllll ll l lllllllllllllllllllllllll ll lllll lllll lllllll l lll lllllll ll lllll llllllllllllll lll ll lll lllllllllllll lllll lll llllll lllll llllll lllll llll llll llllllllllll lllllll llllllllllllllllllllllll ll llllllllllll llll lllllll lll llllll lllllllllllllllllllllllll lll ll llll llll llllll llll lll llllllll ll ll llll lllll lllll llll lll ll lll llll lllllllllllllllllllllll lllll llll llllllllll lllllllllllllllllll lllllllllllllll llllllllllllll lllllllllllllllllllllllllll llllllllllllllllllllllllllllllllll llllll llllllllllllllll llll lllll lllllllllllllllllllllllll lllllllllll llllllllllllllllllllllll llllllll lllllllll lll lllllllll llllllllll lll lll ll lllll lllll llllllll lllll lllllll lllll lll lllllll llllllll llllllllll ll llll l llllll llll lllll lllllll lll llllll ll lllll llll llll lllllllllllllll llllllllllllllllllllllllllllllllllll lllll ll lllllllllll ll ll ll ll ll lllll lllllll ll lll lllllllllll llll llllllllll ll lllllllllllllllllllllllllllllllll lllllll llllllllllll ll llllllllllllllllllllllllllllllllllllllll llllllllll lllllll ll llllll llllllllllllllllllllllllllllllllllllllllll lllllll lllllllllllllllllllllllllllllllllllllllllllllllll llll ll ll ll lll lll llll llllllllllllllllllllllllllllllll lll llllllllllllllllllllllllllllllll llllllllllllllllllllllllllllllllllllll lllllllllllll llllllllllll lllllll lllllllllllllllllllllllllllllllll llll l lll lllll llll lllll llll ll llllll llllllllllllllllllllllllllllllllll ll lllllllll lllllllll lllllllll lll ll lllll ll l llll lll llll ll l lll l ll lll l ll llll lll ll lllllllllllllllllll l ll l llll lllll lll ll ll ll llllllll lllllllll llllll ll ll llll lllll lllllll llllllllll lll ll l llll llll llllll lllllllllll llllll lll llll l lllll lllll llllllll ll lllll ll llllllll l ll ll lll ll llll llllll llll llllllll ll llll l llllll llllllllll lll ll ll lllll llll lllll ll lll lll llll llllllllllll lll ll lllll ll ll l lllllllll llllllllll ll lllllllllllllll ll ll llllll ll llll llllllllllllllllllllll lllllllllllllllllll llll ll llll llllll ll llll lll llllllll llllll lll llll lllll lllllllllll lll ll l ll lllll ll l lll lll ll ll ll l llllllllllllll lllllllllllllllllll llllll lll llll lllllllllllllllllllllllllllllllll l llllllll lll lll llllll lllllll llll llll llll lllll llll ll lll lll ll lllllll l llll l llll ll ll ll llllllllll lll lllllllllllll ll lll lll lllll ll llll ll lllllll llllllll ll lll llll lll lllllllll l l llllll ll llll l llll ll llllllll llllll lll lllll ll lll llllllllllllllllllllllllll lllll lllllllllll llll l llllllll lll lll lll lll l lllllllllllllllllllllll ll llllll llllll lll ll llllll l ll llll lll lll lllllll llll ll ll l llllllllll ll l llllll llllllll llllllllllllll l llllllllllll l lllllllllllllllll llllllll l llllll llll lllll lll l llllll llllllll llll l llll llllll lll lllllll lllllllllllll ll llllllllllllllllll l lllll lll llllllll llllllll lll ll l llll ll ll llll lll ll llll lllll lll ll llllll ll ll ll lll llllllllllllll ll lllll lll ll llllllllll lll ll ll llll ll lll llllll lll lllllllll llll llll ll llllllll lllllll lll llllllllllllllllllll lllllll lllllllllllllllllll lllll l ll lll ll l ll lll llll ll ll llll l lllll llllllllll l lllll ll lll ll l llll llllllllll lllll lllll llllllllllllllll lllllllll llllll lllllll llllllllllllll lllllllllllllllllllllllll ll l lll lllllllllllllllll ll l llll l llllllllllllllll llll lll ll lllll ll lll llllllll lllllllllllllllllllllllll llll lll lll lll lllllllll lllllll lllllll lllll lll lllll lllll l lllll ll ll llllllllllllllll ll lll llllll ll lllllll llllllllllll ll llllll lllllllllllll ll llllll llll lll lll ll lllllllllll ll llllllllll lllll llllllllllllllll llllllllllllllllllllll llllllllllll lll lll llll llllllllll llllllllllllll l llllll ll l lll lll l ll lllll l ll lll lllll llll ll llll lll lllll lll ll lll llll lllll lll llllllllllllllllll lll ll lll lllllllllllllllllllll lllllllllllllllllllll lll lll l llllllllllllllll l llll llll llll ll lllll llllll ll llllllllllll llllllllllllllll llllllllllllllllllllllll lllll ll llll lll lllllllllllllllllllllllllllllllllll lll lllllll llllllllllllllllll lllll llllll lllllllllll lll lll lllll lll llll ll ll llllll llllllllllll lll lllll l llllll lll ll l llllll ll ll ll l Posterior Means of Predicted T r ue V a l ue s (b) True values and the BET predictedmeans: the estimates are piecewise con-stant and seem unbiased. F r equen cy (c) Number of clusters found Baseline Age Gender MRSA PA BC SES CFRD P o s t e r i o r S e l e c t i on P r obab ili t y (d) The variable selection probabilities Figure 6: Results of FEV % ﬁtting and prediction.15 Discussion

Empirically, compared with a single model, a group of models (an ensemble)usually has better performance in inference and prediction. This view is alsosupported in the ﬁeld of the decision trees. As reviewed by Hastie, Tibshirani,Friedman, Hastie, Friedman, and Tibshirani (2009), the use of decision treeshas beneﬁted from multiple-tree methods such as Random Forests, Boostingand Bayesian tree averaging. One interesting question that remains, however,is how many models are enough, or, how many are more than enough?To address this issue, machine learning algorithms usually resort to re-peated tests of cross validation or out-of-bootstrap error calculation, in whichensemble size is gradually increased until performance starts to degrade. InBayesian model averaging, the number of steps to keep is often “the more,the better”, as long as the steps have low autocorrelation.Our proposed method, BET, demonstrates a more eﬃcient way to createan ensemble with the same predictive capability but much smaller size. Withthe help of the Dirichlet process, the self-reinforcing behavior of large clustersreduces the number of needed clusters (sub-models). Rather than using asimple average over many trees, we showed that using a weighted averageover a few important trees can provide the same accuracy.It is worth comparing BET with other mixture models. In the latter,the component distributions are typically continuous and unimodal; in BET,each component tree is discrete, and more importantly, multi-modal itself.This construction could have created caveats in model ﬁtting, as one canimagine only obtaining a large ensemble of very small trees. We circumventedthis issue by applying Gibbs sampling in each tree, which rapidly increasesthe ﬁt of tree to the data during tree growing, and decreases the chance thatthey are scattered to more clusters.It is also of interest to develop an empirical algorithm for BET. Onepossible extension is to use a local optimization technique (also known as“greedy algorithm”) under some randomness to explore the tree structure.This implementation may be not diﬃcult, since users can access existingCART packages to grow trees for subsets of data, then update clustering asmentioned previously.

Acknowledgments

Partial funding was provided by the Cystic Fibrosis Foundation Researchand Development Program (grant number R457-CR11). The authors aregrateful to the Cystic Fibrosis Foundation Patient Registry Committee fortheir thoughtful comments and data dispensation. The breast cancer domainwas obtained from the University Medical Centre, Institute of Oncology,Ljubljana, Yugoslavia. The authors thank M. Zwitter and M. Soklic foravailability of this data. 16 eferences

K. Bache and M. Lichman. UCI machine learning repository, 2013. URL http://archive.ics.uci.edu/ml .L. Breiman, J. Friedman, R. Olshen, and C. Stone.

Classiﬁcation and Re-gression Trees . Wadsworth and Brooks, Monterey, CA, 1984.Leo Breiman. Random forests.

Machine learning , 45(1):5–32, 2001.Hugh A Chipman, Edward I George, and Robert E McCulloch. Bayesiancart model search.

Journal of the American Statistical Association , 93(443):935–948, 1998.Hugh A Chipman, Edward I George, Robert E McCulloch, et al. Bart:Bayesian additive regression trees.

The Annals of Applied Statistics , 4(1):266–298, 2010.1 Cystic Fibrosis Foundation. Cystic ﬁbrosis foundation patient registry 2012annual data report. 2012.David GT Denison, Bani K Mallick, and Adrian FM Smith. A bayesian cartalgorithm.

Biometrika , 85(2):363–377, 1998.Jerome H Friedman. Greedy function approximation: a gradient boostingmachine.

Annals of Statistics , pages 1189–1232, 2001.Jerome H Friedman. Stochastic gradient boosting.

Computational Statistics& Data Analysis , 38(4):367–378, 2002.Peter J Green and Sylvia Richardson. Modelling heterogeneity with andwithout the dirichlet process.

Scandinavian journal of statistics , 28(2):355–375, 2001.Trevor Hastie, Robert Tibshirani, Jerome Friedman, T Hastie, J Friedman,and R Tibshirani.

The elements of statistical learning , volume 2. Springer,2009.Hemant Ishwaran and Lancelot F. James. Gibbs sampling methods for stick-breaking priors.

Journal of the American Statistical Association , 96(453):161–173, 2001.Maria Kalli, Jim E Griﬃn, and Stephen G Walker. Slice sampling mixturemodels.

Statistics and computing , 21(1):93–105, 2011.Radford M Neal. Markov chain sampling methods for dirichlet process mix-ture models.

Journal of computational and graphical statistics , 9(2):249–265, 2000. 17obert B O’Hara, Mikko J Sillanp¨a¨a, et al. A review of bayesian variableselection methods: what, how and which.

Bayesian analysis , 4(1):85–117,2009.MT Pratola. Eﬃcient metropolis-hastings proposal mechanisms for bayesianregression tree models. arXiv preprint arXiv:1312.1895 , 2013.Rhonda D Szczesniak, Gary L McPhail, Leo L Duan, Maurizio Macaluso,Raouf S Amin, and John P Clancy. A semiparametric approach to estimaterapid lung function decline in cystic ﬁbrosis.

Annals of epidemiology , 23(12):771–777, 2013.Terry M Therneau, Elizabeth J Atkinson, et al. An introduction to recursivepartitioning using the rpart routines. 1997.William H Wolberg and Olvi L Mangasarian. Multisurface method of patternseparation for medical diagnosis applied to breast cytology.

Proceedings ofthe national academy of sciences , 87(23):9193–9196, 1990.Yuhong Wu, H˚akon Tjelmeland, and Mike West. Bayesian cart: Prior speci-ﬁcation and posterior simulation.

Journal of Computational and GraphicalStatistics , 16(1):44–66, 2007. 18 upplementary Information

Variable Importance calculated by Random Forests with 50 trees, using cysticﬁbrosis data.

Baseline Age Gender MRSA PA BC SES CFRD V a r i ab l e I m po r t an c e 0 . + . + . + . + . +07