[PDF] High-dimensional nonparametric monotone function estimation using BART

Abstract

For the estimation of a regression relationship between Y and a large set of potential predictors x 1 , . . . , x p , the flexible nature of a nonparametric approach such as BART (Bayesian Additive Regression Trees) allows for a much richer set of possibilities than a more restrictive parametric approach. However, it may often occur that subject matter considerations suggest the relationship will be monotone in one or more of the predictors. For such situations, we propose monotone BART, a constrained version of BART that uses the monotonicity information to improve function estimation without the need of using a parametric form. Imposing monotonicity, when appropriate, results in (i) function estimates that are smoother and more interpretable, (ii) better out-of-sample predictive performance, (iii) less uncertainty, and (iv) less sensitivity to prior choice. While some of the key aspects of the unconstrained BART model carry over directly to monotone BART, the imposition of the monotonicity constraints necessitates a fundamental rethinking of how the model is implemented. In particular, in the original BART algorithm, the Markov Chain Monte Carlo algorithm relied on a conditional conjugacy that is no longer available in a high-dimensional, constrained space.

Full PDF

HHigh-dimensional nonparametric monotone functionestimation using BART

Hugh A. Chipman, Edward I. George, Robert E. McCulloch, Thomas S. Shively ∗ December 7, 2016 ∗ Hugh Chipman is Professor, Department of Mathematics and Statistics, Acadia University, Wolfville,Nova Scotia, B4P 2R6, [email protected]. Edward I. George is Professor of Statistics, De-partment of Statistics, The Wharton School, University of Pennsylvania, 3730 Walnut St, 400 JMHH,Philadelphia, PA 19104-6304, [email protected]. Robert E. McCulloch is Professor of Statis-tics, The School of Mathematical and Statistical Sciences, Arizona State University, Tempe, AZ, 85281,[email protected]. Thomas S. Shively is Professor of Statistics, Department of Information, Risk,and Operations Management, University of Texas at Austin, 1 University Station, B6500, Austin, TX78712-1175, [email protected]. This work was supported by the grant DMS-1406563 fromthe National Science Foundation. a r X i v : . [ s t a t . O T ] D ec bstract For the estimation of a regression relationship between Y and a large set of po-tential predictors x , . . . , x p , the ﬂexible nature of a nonparametric approach such asBART (Bayesian Additive Regression Trees) allows for a much richer set of possibilitiesthan a more restrictive parametric approach. However, it may often occur that subjectmatter considerations suggest the relationship will be monotone in one or more of thepredictors. For such situations, we propose monotone BART, a constrained version ofBART that uses the monotonicity information to improve function estimation with-out the need of using a parametric form. Imposing monotonicity, when appropriate,results in (i) function estimates that are smoother and more interpretable, (ii) betterout-of-sample predictive performance, (iii) less uncertainty, and (iv) less sensitivity toprior choice. While some of the key aspects of the unconstrained BART model carryover directly to monotone BART, the imposition of the monotonicity constraints ne-cessitates a fundamental rethinking of how the model is implemented. In particular,in the original BART algorithm, the Markov Chain Monte Carlo algorithm relied ona conditional conjugacy that is no longer available in a high-dimensional, constrainedspace.KEY WORDS: Bayesian nonparametrics, high-dimensional regression, ensemble model,MCMC algorithm, monotonicity constraints ontents T j Prior . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83.2 The σ Prior . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93.3 The M j | T j Prior . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93.3.1 The Choice of m . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 Introduction

Suppose one would like to learn how Y depends on a vector of potential predictors x =( x , . . . , x p ) when very little prior information is available about the form of the relationship.With only very weak assumptions, the Bayesian nonparametric approach BART (BayesianAdditive Regression Trees) can quickly discover the nature of this relationship; see Chip-man, George, and McCulloch (2010), hereafter CGM10. More precisely, based only on theassumption that Y = f ( x ) + (cid:15), (cid:15) ∼ N (0 , σ ) , (1.1)BART can quickly obtain full posterior inference for the unknown regression function, f ( x ) = E ( Y | x ) (1.2)and the unknown variance σ . BART also provides predictive inference as well as model-freevariable selection and interaction detection, see Chipman, George, and McCulloch (2013),Bleich et al. (2014), and Kapelner and Bleich (2016).The main goal of this paper is the development of monotone BART (hereafter mBART),a constrained version of BART that restricts attention to regression functions f ( x ) thatare monotone in any predesignated subset of the components of x . Such monotonicityconstraints often arise naturally from subject matter considerations. For example, in oneof our illustrative data sets, the Y variable of interest is the price of a house. One of thecomponents of the explanatory variable x is the size of the house. It seems reasonable torestrict our search for the function f to functions such that bigger houses sell for more, allother things equal.There is a rich literature on monotone function estimation in both the frequentist andBayesian paradigms. Frequentist methods to estimate univariate monotone functions includeBarlow et al. (1972), Mammen (1991), Ramsay (1998) and Kong and Eubank (2006) whileBayesian methods include Lavine and Mockus (1995), Holmes and Heard (2003), Neelon andDunson (2004), Shively, Sage, Walker (2009), and Shively, Walker, Damien (2011). How-ever, these methods are diﬃcult to implement for high-dimensional multivariate functionestimation because they are built on basis elements that are fundamentally low-dimensionalobjects. Saarela and Arjas (2011) and Lin and Dunson (2014) develop methods to estimatemonotone multivariate functions but they become computationally intensive as the numberof predictor variables increase. As we will show, mBART imposes no restrictions on f beyondthe monotonicity constraints while at the same time, it can easily handle high-dimensional1ata. The reason is that mBART is built on a sum-of-trees approximation to f and istherefore composed of multivariate basis elements.The extension of BART to our monotonically constrained setting requires two basic innova-tions. First, it is necessary to develop general constraints for regression trees to be monotonein any predesignated set of coordinates. Under these constraints, the monotonicity of thecomplete sum-of-trees approximation follows directly. The second innovation requires a newapproach for MCMC posterior computation. Whereas the original BART formulation al-lowed straightforward marginalization over regression tree parameters using a conditionalconjugacy argument, the constrained trees formulation requires a more nuanced approachbecause the conjugacy argument no longer applies.The outline of the paper is as follows. In Section 2, we describe in detail the constrainedsum-of-trees model used for monotone function estimation. Section 3 discusses the regu-larization prior for the trees while section 4 describes the new MCMC algorithm requiredto implement mBART. Section 5 provides simulation results for one-dimensional and ﬁve-dimensional function estimation as well as three examples using house price, car price andstock returns data. Section 6 contains some concluding remarks. The essence of BART is a sum-of-trees model approximation of the relationship between y and x in (1.1); Y = m (cid:88) j =1 g ( x ; T j , M j ) + (cid:15), (cid:15) ∼ N (0 , σ ) , (2.1)where each T j is a binary regression tree with a set M j of associated terminal node constants µ ij , and g ( x ; T j , M j ) is the function which assigns µ ij ∈ M j to x according to the sequenceof decision rules in T j . These decision rules are binary partitions of the predictor space ofthe form { x ≤ a } vs { x > a } where the splitting value a is in the range of x . (A clarifyingexample of how g works appears in Figure 1 below and is described later in this section).When m = 1, (2.1) reduces to the single tree model used by Chipman et al. (1998) forBayesian CART.Under (2.1), E ( Y | x ) is the sum, over trees T , . . . , T m , of all the terminal node µ ij ’s assignedto x by the g ( x ; T j , M j )’s. As the µ ij can take any values it is easy to see that the sum-of-trees model (2.1) is a very ﬂexible representation capable of representing a wide class of2unctions from R n to R , especially when the number of trees m is large. Composed of manysimple functions from R p to R , namely the g ( x ; T j , M j ), the sum-of-trees representation ismuch more manageable than a representation with more complicated basis elements such asmultidimensional wavelets or multidimensional splines. And because each tree function g isinvariant to monotone transformations of x (aside from the splitting value), standardizationchoices are not needed for the predictors.Key to the construction of monotone BART are the conditions under which the underly-ing sum-of-trees function (cid:80) mj =1 g ( x ; T j , M j ) will satisfy the following precise deﬁnition of amultivariate monotone function. Deﬁnition:

For a subset S of the coordinates of x ∈ R n , a function f : R n → R is said tobe monotone in S if for each x i ∈ S and all values of x , f satisﬁes f ( x , . . . , x i + δ, . . . , x p ) ≥ f ( x , . . . , x i , . . . , x p ) , (2.2)for all δ > f is nondecreasing), or for all δ < f is nonincreasing).Clearly, a sum-of-trees function will be monotone in S whenever each of the componenttrees is monotone in S . Thus it suﬃces to focus on the conditions for a single tree function g ( x ; T, M ) to be monotone in S . As we’ll see, this will only entail providing constraints onthe set of terminal node constants M ; constraints determined by the tree T .We illustrate these concepts with the bivariate monotone tree function in Figure 1. Thistree has six terminal nodes, labeled 4,10,11,12,13, and 7. The labels follow the standard treenode labeling scheme where the top node is labeled 1 and any non-terminal node with label j has a left child with label 2 j and a right child with label 2 j + 1. Beginning at the topnode, each x = ( x , x ) is assigned to subsequent nodes according to the sequence of splittingrules it meets. This continues until x reaches a terminal node where g ( x ; T, M ) assigns thedesignated value of µ from the set M . For example, with this choice of ( T, M ), g ( x ; T, M )= 3 when x = ( . , . x space induced by T . The terminal node regions, R , R , R , R , R , R ,correspond to the six similarly labeled terminal nodes of T . Figure 2b shows g ( x ; T, M ) as asimple step function which assigns a level µ ∈ M to each terminal node region. From Figure2b, it is clear that for any x = ( x , x ), moving x to ( x + δ, x ) or to ( x , x + δ ) can onlyincrease g for δ >

0. Thus, in the sense of our deﬁnition, this g ( x ; T, M ) is monotone in both x and x . 3 Figure 1: A bivariate, monotone regression tree T with 6 terminal nodes. Intermediatenodes are labeled with their splitting rules. Terminal nodes (bottom leaf nodes) are labeledwith their node number. Below each terminal node is the value of µ ∈ M assigned to x by g ( x ; T, M ). 4 .2 0.4 0.6 0.8 . . . . x1 x x x f ( x ) a b Figure 2: Two alternative views of the bivariate single tree model in Figure 1. (a) The sixregions R , R , R , R , R , R , corresponding to the terminal nodes 4,10,11,12,13,7. (b) Thelevels of the regions assigned by the step function g ( x ; T, M ).5 .0 0.2 0.4 0.6 0.8 1.0 − . . . . . . . x f ( x ) Figure 3: A monotone, univariate tree function g ( x ; T, M ).To see the essence of what is needed to guarantee the monotonicity of a tree function,consider the very simple case of a monotone g ( x ; T, M ) when T is a function of x = x only,as depicted in Figure 3. Each level region of g corresponds to a terminal node region in x space, which is simply an interval whenever g is a univariate function. For each suchregion, consider the adjoining region with larger values of x , which we refer to as an above-neighbor region, and the adjoining region with smaller values of x , which we refer to as abelow-neighbor region. End regions will only have single neighboring regions. To guarantee(nondecreasing) monotonicity, it suﬃces to constrain the µ level assigned to each terminalnode region to be less than or equal to the µ level of its above-neighbor region, and to begreater than or equal to the µ level of its below-neighbor region.To apply these notions to a bivariate tree function g ( x ; T, M ) as depicted in Figures 1and 2, we will simply say that rectangular regions are neighboring if they have boundarieswhich are adjoining in any of the coordinates. Furthermore, a region R k will be called anabove-neighbor of a region R k ∗ if the lower adjoining boundary of R k is the upper adjoiningboundary of R k ∗ . A below-neighbor is deﬁned similarly. For example, in Figure 2a, R is anabove-neighbor of R , R and R ; and R and R are below-neighbors of R .Note that R and R are not neighbors. We will say the R and R are separated becausethe x upper boundary of R is less than the x lower boundary of R . For a small enoughstep size δ , it is impossible to get from R to R by changing any x i by δ so that the meanlevel of one does not constrain the mean level of the other.To make these deﬁnitions precise for a d -dimensional tree T (a function of x = ( x , . . . , x d )),6e note that each terminal node region of T will be a rectangular region of the form R k = { x : x i ∈ [ L ik , U ik ) , i = 1 , . . . , d } , (2.3)where the interval [ L ik , U ik ) for each x i is determined by the sequence of splitting rulesleading to R k .We say that R k is separated from R k ∗ if U ik < L ik ∗ or L ik > U ik ∗ for some i . In Figure 2a, R is separated from R and R .If R k and R k ∗ are not separated, R k will be said to be an above-neighbor of R k ∗ if L ik = U ik ∗ for some i , and it will be said to be a below-neighbor of R k ∗ if U ik = L ik ∗ for some i . Notethat any terminal node region may have several above-neighbor and below-neighbor regions. R has below neighbors R and R and above neighbor R .The constraints on the µ levels under which g ( x ; T, M ) will be monotone are now straight-forward to state.

Constraint Conditions for Tree Monotonicity : A tree function g ( x ; T, M ) will bemonotone if the µ level of each of its terminal node regions is(a) less than or equal to the minimum level of all of its above-neighbor regions, and(b) greater than or equal to the maximum level of all of its below-neighbor regions.The function g will be monotone in S if the neighboring regions satisfy (a) and (b) for allthe coordinates in S (rather than all coordinates).As we’ll see in subsequent sections, an attractive feature of these conditions is that theydovetail perfectly with the nature of our iterative MCMC simulation calculations. At eachstep there, we simulate one terminal node level at time conditionally on all the other nodelevels, so imposing the constraints is straightforward. This avoids the need to simultaneouslyconstrain all the levels at once. The mBART model speciﬁcation is completed by putting a constrained regularization prioron the parameters, ( T , M ) , . . . , ( T m , M m ) and σ , of the sum-of-trees model (2.1). Essentiallya modiﬁcation of the original BART prior formulation to accommodate designated monotone7onstraints, we follow CGM10 and proceed by restricting attention to priors of the form p (( T , M ) , . . . , ( T m , M m ) , σ ) = (cid:34)(cid:89) j p ( T j , M j ) (cid:35) p ( σ )= (cid:34)(cid:89) j p ( M j | T j ) p ( T j ) (cid:35) p ( σ ) . (3.1)Under such priors, the tree components ( T , M ) , . . . , ( T m , M m ) are independent of each otherand of σ .As discussed in the previous section, a sum-of-trees function (cid:80) mj =1 g ( x ; T j , M j ) is guaranteedto be monotone whenever each of the trees g ( x ; T j , M j ) is monotone in the sense of (2.2).Thus, it suﬃces to restrict the support of p ( M j | T j ) to µ ij values which satisfy the Mono-tonicity Constraints (a) and (b) from Section 2. For this purpose, let C be the set of all( T, M ) which satisfy these monotonicity constraints, namely C = { ( T, M ) : g ( x ; T, M ) is monotone in x i ∈ S } . (3.2)These constraints are then incorporated into the prior by constraining the CGM10 BARTindependence form p ( M j | T j ) = (cid:81) i p ( µ ij | T j ) to have support only over C , p ( M j | T j ) ∝  b j (cid:89) i =1 p ( µ ij | T j )  χ C ( T j , M j ) . (3.3)Here b j is the number of bottom (terminal) nodes of T j , and χ C ( · ) = 1 on C and = 0otherwise.In the next three sections we discuss the choice of priors p ( T j ), p ( σ ), and p ( µ ij | T j ). Thesepriors will have the same form as in CGM10, but in some cases the monotonicity constraintwill motivate modiﬁcations to our choices for the hyper parameters. T j Prior

The tree prior p ( T j ) is speciﬁed by three aspects: (i) the probability of a node having childrenat depth d (= 0 , , , . . . ) is α (1 + d ) − β , α ∈ (0 , , β ∈ [0 , ∞ ) , (3.4)(ii) the uniform distribution over available predictors for splitting rule assignment at eachinterior node, and (iii) the uniform distribution on the discrete set of available splitting8alues for the assigned predictor at each interior node. This last choice has the appeal ofinvariance under monotone transformations of the predictors.Because we want the regularization prior to keep the individual tree components small,especially when m is set to be large, we typically recommend the defaults α = .

95 and β = 2in (3.4) in the unconstrained case. With this choice, simulation of tree skeletons directly from(i) shows us that trees with 1, 2, 3, 4, and ≥ α and β in the constrained case is deferred to the end of Section 4.3since our choices are motivated by details of the Markov Chain Monte Carlo algorithm forposterior computation. σ Prior

For p ( σ ), we use the (conditionally) conjugate inverse chi-square distribution σ ∼ ν λ/χ ν .To guide the speciﬁcation of the hyperparameters ν and λ , we recommend a data-informedapproach in order to assign substantial probability to the entire region of plausible σ val-ues while avoiding overconcentration and overdispersion. This entails calibrating the priordegrees of freedom ν and scale λ using a “rough data-based overestimate” ˆ σ of σ .The two natural choices for ˆ σ are (1) the “naive” speciﬁcation, in which we take ˆ σ tobe the sample standard deviation of Y (or some fraction of it), or (2) the “linear model”speciﬁcation, in which we take ˆ σ as the residual standard deviation from a least squareslinear regression of Y on the original x ’s. We then pick a value of ν between 3 and 10to get an appropriate shape, and a value of λ so that the q th quantile of the prior on σ is located at ˆ σ , that is P ( σ < ˆ σ ) = q. We consider values of q such as 0.75, 0.90 or 0.99to center the distribution below ˆ σ . For automatic use, we recommend the default setting( ν, q ) = (3 , .

90) which tends to avoid extremes. Alternatively, the values of ( ν, q ) may bechosen by cross-validation from a range of reasonable choices. This choice is exactly as inCGM10. M j | T j Prior

For the choice of p ( µ ij | T j ) in (3.3), we adopt a normal form as in BART. However, herewe use diﬀerent choices of the prior variance depending on whether C in (3.2) imposes9onstraints on µ ij . For µ ij unconstrained by C , we set p ( µ ij | T j ) = φ µ µ ,σ µ , (3.5)the same N ( µ µ , σ µ ) density used by CGM10 for BART, whereas for µ ij constrained by C ,we set p ( µ ij | T j ) = φ µ µ ,cσ µ , (3.6)the N ( µ µ , cσ µ ) density with c = ππ − ≈ . µ and µ constrained to satisfy µ ≤ µ . Under (3.6), the joint distribution of µ and µ is p ( µ , µ ) ∝ φ µ µ ,cσ µ ( µ ) φ µ µ ,cσ µ ( µ ) χ { µ ≤ µ } ( µ , µ ) . (3.7)In this case, the marginal distributions of µ and µ are skew normal distributions (Azzalini1985) with variances equal to σ µ , and means µ µ − cσ µ / √ π and µ µ + cσ µ / √ π , respectively.Thus, the marginal prior variances of µ and µ are identical to the marginal variances of theunconstrained means. This equality helps to balance the prior eﬀects across predictors andfacilitates the calibrated speciﬁcation of σ µ described below. Of course, it will be the case thatsome means µ ij may be further constrained when they occur deeper down the tree, therebyfurther reducing their prior variance. Although additional small prior adjustments can beconsidered for such cases, we view them as relatively unimportant because the vast majorityof BART trees will be small with at most one or two constraints. Thus, we recommend theprior (3.6) for all µ ij which are constrained.To guide the speciﬁcation of the hyperparameters µ µ and σ µ , we use the same informalempirical Bayes strategy in CGM10. Based on the idea that that E ( Y | x ) is very likelybetween y min and y max , the observed minimum and maximum of Y , we want to choose µ µ and σ µ so that the induced prior on E ( Y | x ) assigns substantial probability to the interval( y min , y max ). By using the observed y min and y max , we aim to ensure that the implicit priorfor E ( Y | x ) is in the right “ballpark”.In the unconstrained case where each value of E ( Y | x ) is the sum of m iid µ ij ’s under thesum-of-trees model, the induced prior on E ( Y | x ) under (3.5) is exactly N ( m µ µ , m σ µ ). Letus argue now that when monotone constraints are introduced, N ( m µ µ , m σ µ ) still holds upas a useful approximation to the induced prior on E ( Y | x ). To begin with, for each valueof x , let g ( x ; T j , M j ) = µ xj , the mean assigned to x by the j th tree T j . Then, under thesum-of-trees model, E ( Y | x ) = (cid:80) mj =1 µ xj is the sum of m independent means since the µ xj ’s10re independent across trees. Using central limit theorem considerations, this sum of smallrandom eﬀects will be approximately normal, at least for the central part of the distribution.The means of all the random eﬀects will be centered around µ µ , (the constrained µ ij ’s willhave pairwise oﬀsetting biases), and so the mean of E ( Y | x ) will be approximately µ µ .Finally, since the marginal variance for all µ xj ’s is at least approximately σ µ , the varianceof E ( Y | x ) will be approximately mσ µ .Proceeding as in CGM10, we thus choose µ µ and σ µ so that m µ µ − k √ m σ µ = y min and m µ µ + k √ m σ µ = y max for some preselected value of k . This is conveniently implementedby ﬁrst shifting and rescaling Y so that the observed transformed y values range from y min = − . y max = 0 .

5, and then setting µ µ = 0 and σ µ = 0 . /k √ m . Using k = 2,for example, would yield a 95% prior probability that E ( Y | x ) is in the interval ( y min , y max ),thereby assigning substantial probability to the entire region of plausible values of E ( Y | x )while avoiding overconcentration and overdispersion. As k and/or the number of trees m is increased, this prior will become tighter, thus limiting the eﬀect of the individual treecomponents of (2.1) by keeping the µ ij values small. We have found that values of k between1 and 3 yield good results, and we recommend k = 2 as an automatic default choice, thesame default recommendation for BART. Alternatively, the value of k may be chosen bycross-validation from a range of reasonable choices. m Again as in BART, we treat m as a ﬁxed tuning constant to be chosen by the user. Forprediction, we have found that mBART performs well with values of at least m = 50. Forvariable selection, values as small as m = 10 are often eﬀective. Let y be the n × Y from (2.1). All post-data infor-mation for Bayesian inference about any aspects of the unknowns, ( T , M ) , . . . , ( T m , M m ), σ and future values of Y , is captured by the full posterior distribution p (( T , M ) , . . . , ( T m , M m ) , σ | y ) . (4.1)11ince all inference is conditional on the given predictor values, we suppress them in the nota-tion. This posterior is proportional to the product of the likelihood p ( y | ( T , M ) , . . . , ( T m , M m ) , σ ),which is the product of normal likelihoods based on (2.1), and the constrained regularizationprior p (( T , M ) , . . . , ( T m , M m ) , σ ) described in Section 3.To extract information from (4.1), which is generally intractable, we propose an MCMCbackﬁtting algorithm that simulates a sequence of draws, k = 1 , . . . , K ,( T , M ) ( k ) , . . . , ( T m , M m ) ( k ) , σ ( k ) (4.2)that is converging in distribution to (4.1) as K → ∞ .Beginning with a set of initial values of (( T , M ) (0) , . . . , ( T m , M m ) (0) , σ (0) ), this algorithmproceeds by simulating a sequence of transitions ( T j , M j ) ( k ) → ( T j , M j ) ( k +1) , for j = 1 , . . . , m , σ ( k ) → σ ( k +1) . The ( T j , M j ) ( k ) → ( T j , M j ) ( k +1) transition is obtained by using a Metropolis-Hastings (MH) algorithm to simulate a single transition of a Markov chain with stabledistribution p (( T j , M j ) | r ( k ) j , σ ( k ) ) , (4.3)for j = 1 , . . . , m , where r ( k ) j ≡ y − (cid:88) j (cid:48) j g ( x ; T j (cid:48) , M j (cid:48) ) ( k ) (4.4)is the n − vector of partial residuals based on a ﬁt that excludes the most current simulatedvalues of T j (cid:48) , M j (cid:48) for j (cid:48) (cid:54) = j . A full iteration of the algorithm is then completed by simulatingthe draw of σ ( k +1) from the full conditional σ | ( T , M ) ( k +1) , . . . , ( T m , M m ) ( k +1) , y. (4.5)Because conditioning the distribution of ( T j , M j ) on r ( k ) j and σ ( k ) in (4.3) is equivalent toconditioning on the excluded values of ( T j (cid:48) , M j (cid:48) ), σ ( k ) and y , this algorithm is an instance ofMH within a Gibbs sampler. We now describe a new localized MH algorithm for the simulation of ( T j , M j ) ( k ) → ( T j , M j ) ( k +1) as single transitions of a Markov chain converging to the (possibly constrained) posterior(4.3). For simplicity of notation, let us denote a generic instance of these moves by ( T , M ) → ( T , M ). Dropping σ ( k ) from (4.3) since it is ﬁxed throughout this move, and dropping all12he remaining subscripts and superscripts, the target posterior distribution can be expressedas p ( T, M | r ) = p ( r | T, M ) p ( M | T ) p ( T ) /p ( r ) , (4.6)where its components are as follows.First, p ( r | T, M ) is the normal likelihood which would correspond to an observation of r = g ( x ; T, M ) + (cid:15) , where (cid:15) ∼ N n (0 , σ I ). Assuming M = ( µ , . . . , µ b ), and letting r i be thevector of components of r assigned to µ i by T , this likelihood is of the form p ( r | T, M ) = b (cid:89) i =1 p ( r i | µ i ) (4.7)where p ( r i | µ i ) ∝ (cid:89) j exp( − ( r ij − µ i ) / σ ) . (4.8)The prior of M | T given by (3.3) is of the form p ( M | T ) ∝ (cid:34) b (cid:89) i =1 p ( µ i | T ) (cid:35) χ C ( T, M ) , (4.9)where p ( µ i | T ) = φ µ µ ,σ µ ( µ i ) from (3.5) if µ i is unconstrained by χ C , and p ( µ i | T ) = φ µ µ ,cσ µ ( µ i )from (3.6) if µ i is constrained by χ C . The tree prior p ( T ) described in Section 3.1 is the sameform used for unconstrained BART. Finally, the intractable marginal p ( r ), which would inprinciple be obtained by summing and integrating over T and M , will fortunately play norole in our algorithm.In unconstrained CART and BART, CGM98 and CGM10 used the following two stepMetropolis-Hastings (MH) procedure for the simulation of ( T , M ) → ( T , M ). First,a proposal T ∗ was generated with probability q ( T → T ∗ ). Letting q ( T ∗ → T ) be theprobability of the reversed step, the move T = T ∗ was then accepted with probability α = min (cid:26) q ( T ∗ → T ) q ( T → T ∗ ) p ( T ∗ | r ) p ( T | r ) , (cid:27) = min (cid:26) q ( T ∗ → T ) q ( T → T ∗ ) p ( r | T ∗ ) p ( r | T ) p ( T ∗ ) p ( T ) , (cid:27) . (4.10)If accepted, any part of M with a new ancestry under M is simulated from independentnormals since p ( M | T , r ) just consists of b independent normals given the independenceand conditional conjugacy of our prior (which is (4.9) without the monotonicity constraint13 C ( T, M )) and the conditional data independence (4.7). Otherwise ( T , M ) is set equal to( T , M ).In the contrained case, the basic algorithm is the same except that with the monotonicityconstraint in (4.9), the µ i in M are dependent. Hence, when we make local moves involvinga few of the µ i we must be careful to condition on the remaining elements. In addition,computations must be done numerically since we lose the conditional conjugacy. The movesin mBART only operate on one or two of the µ values at a time so that the appropriateconditional integrals can easily be done numerically.We consider localized proposals ( T , M ) → ( T ∗ , M ∗ ) under which M and M ∗ diﬀer onlyby those µ ’s which have diﬀerent ancestries under T and T ∗ . Letting µ same be the part of M with the same ancestry under T and T ∗ , we restrict attention to proposals for which M = ( µ same , µ old ) and M ∗ = ( µ same , µ new ), where µ old is the part of M that will be replacedby µ new in M ∗ . It will also be convenient in what follows to let r old be the components of thedata r assigned to µ old by T , r new to be the components assigned to µ new by T ∗ , and r same to be the components assigned to the identical components of µ same by both T and T ∗ .For example, suppose we begin with a proposal T → T ∗ that randomly chooses between abirth step and death step, and that T ∗ was obtained by a birth step, which entails adding twochild nodes at a randomly chosen terminal node of T . This move is illustrated in Figure 4where M = ( µ , µ , µ ) and M ∗ = ( µ , µ , µ L , µ R ), so that µ same = ( µ , µ ) to which r same = ( r , r ) is assigned, µ old = µ to which r old = r is assigned, and µ new = ( µ L , µ R ) towhich r new = ( r L , r R ) is assigned. Note that the set of observations in ( r L , r R ) is just thedivision of the set of observations in r deﬁned by the decision rule associated with node 7in the tree T ∗ .The key is to then proceed conditional on µ same and the tree ancestry associated with it. InFigure 4, we condition on µ same = ( µ , µ ) and the ancestral tree structure given by nodes(1 , , ,

6) including the decision rules associated with the interior nodes 1 and 3. To keepthe notation clean, we will use µ same as a conditioning variable in our expressions below andthe reader must make a mental note to include the associated tree ancestry as conditioninginformation.Conditional on µ same , our Metropolis procedure is as follows. First, a proposal T ∗ is generatedwith probability q ( T → T ∗ ), using the same CGM98 proposal used in unconstrained CARTand BART. Letting q ( T ∗ → T ) be the probability of the reversed step, the move T = T ∗

12 36 714 15

Figure 4: A typical birth step starting at ( T , M ) and proposing ( T ∗ , M ∗ ). T includesthe nodes 1,2,3,6,7. T ∗ includes the nodes 1,2,3,6,7,14,15. Here µ same = ( µ , µ ). Our MHstep proceeds conditionally on µ same and the associated ancestral parts of the tree structures T and T ∗ , nodes 1,2,3,6. Our proposal generates the candidate rule associated with node7 in T ∗ . Conditional on all these elements, we integrate out µ or ( µ L , µ R ) subject to theconstraints implied by the conditioning elements. Note that the proposal for the node 7 ruledoes not depend on µ same , it only depends on the tree structures.15s then accepted with probability α = min (cid:26) q ( T ∗ → T ) q ( T → T ∗ ) p ( T ∗ | µ same , r ) p ( T | µ same , r ) , (cid:27) = min (cid:26) q ( T ∗ → T ) q ( T → T ∗ ) p ( T ∗ | µ same , r new ) p ( T | µ same , r old ) , (cid:27) = min (cid:26) q ( T ∗ → T ) q ( T → T ∗ ) p ( r new | T ∗ , µ same ) p ( r old | T , µ same ) p ( T ∗ ) p ( T ) , (cid:27) . (4.11)The diﬀerence between (4.10) and (4.11) is that we condition on µ same throughout and ex-plicitly note that the r same part of r does not matter. In going from the ﬁrst line above to thesecond we have used the fact that, conditional on µ same , r same gives the same multiplicativecontribution to the top and bottom of the acceptance ratio so that it cancels out leavingonly terms depending on r new and r old . To go from the second line above to the third we willcompute the required r new and r old marginals numerically as detailed in Section 4.3 below.Note also that in the BART prior, T and M are dependent only through the dimension of M so p ( T ∗ ) / p ( T ) is the same as in the unconstrained case.If T = T ∗ is accepted, µ new is then simulated from p ( µ new | T , µ same , r ) = p ( µ new | T , µ same , r new )and M is set equal to ( µ same , µ new ). Otherwise ( T , M ) is set equal to ( T , M ). The implementation of our localized MH algorithm requires the evaluation of p ( r new | T ∗ , µ same )and p ( r old | T , µ same ) for the α calculation in (4.11), and the simulation from p ( µ new | T , µ same , r new ).Although these can all be done quickly and easily in the unconstrained cases, a diﬀerent ap-proach is needed for constrained cases. This approach, which we now describe, relies cruciallyon the reduced computational requirements for the localized MH algorithm when T → T ∗ is restricted to local moves at a single node.For the moment, consider the birth move described in Section 4.2 and illustrated in Figure 4.In this case, µ new = ( µ L , µ R ) with corresponding r new = ( r L , r R ) and µ old = µ with corre-sponding r . Thus, to perform this move, it is necessary to compute p ( r L , r R | T ∗ , µ same ) and p ( r | T , µ same ) for the computation of α in (4.11), and to simulate ( µ L , µ R ) from p ( µ L , µ R | r L , r R , T ∗ , µ same ) when T = T ∗ is selected. For the corresponding death step, wewould need to simulate µ from p ( µ | r , T , µ same ). When these means are unconstrained,these calculations can be done quickly with closed form expressions and the simulations byroutine methods so we focus here on the constrained case.16et us begin with the calculation of p ( r L , r R | T ∗ , µ same ) = (cid:90) p ( r L | µ L ) p ( r R | µ R ) p ( µ L , µ R | T ∗ , µ same ) dµ L dµ R (4.12)where p ( µ L , µ R | T ∗ , µ same ) = φ µ µ ,cσ µ ( µ L ) φ µ µ ,cσ µ ( µ R ) χ C ( µ L , µ R ) / d ∗ (4.13)and d ∗ is the normalizing constant. The determination of χ C ( µ L , µ R ) is discussed in Section 2;it is the set ( µ L , µ R , µ same ) which results in a monotonic function. Note that C is of the form C = { ( µ L , µ R ) : a ≤ µ L ≤ µ R ≤ b } with a, b (possibly −∞ and/or ∞ ) determined by theconditioning on T ∗ and µ same . In particular, note that C depends on µ same but we havesuppressed this in the notation for the sake of simplicity.Closed forms for (4.12) and the norming constant d ∗ are unavailable. However, since theintegrals are only two-dimensional, it is straighforward to compute them numerically. To usea very simple approach, we approximate them by summing over a grid of ( µ L , µ R ) values.We choose a grid of equally spaced µ values and then let G be the set of ( µ L , µ R ) where both µ L and µ R belong to the grid. Then, our approximate integrals are˜ p ( r L , r R | T ∗ , µ same ) = (cid:88) ( µ L ,µ R ) ∈ G ∩ C p ( r L | µ L ) p ( r R | µ R ) ˜ p ( µ L , µ R | T ∗ , µ same ) , (4.14)where ˜ p ( µ L , µ R | T ∗ , µ same ) = φ µ µ ,cσ µ ( µ L ) φ µ µ ,cσ µ ( µ R ) / ˜ d ∗ (4.15)with ˜ d ∗ = (cid:88) ( µ L ,µ R ) ∈ G ∩ C φ µ µ ,cσ µ ( µ L ) φ µ µ ,cσ µ ( µ R ) . (4.16)Note that we do not include “∆ µ ” terms (the diﬀerence between adjacent grid values) in ourintegral approximations since they cancel out.If T = T ∗ is accepted, the simulation of ( µ L , µ R ) proceeds by sampling from the probabilitydistribution over G ∩ C given by˜ p ( µ L , µ R | r L , r R , T ∗ , µ same ) = p ( r L | µ L ) p ( r R | µ R ) ˜ p ( µ L , µ R | T ∗ , µ same )˜ p ( r L , r R | T ∗ , µ same ) . (4.17)Note that ˜ d ∗ cancels in (4.17) so that we are just renormalizing p ( r L | µ L ) p ( r R | µ R ) φ µ µ ,cσ µ ( µ L ) φ µ µ ,cσ µ ( µ R )to sum to one on G ∩ C . 17or the calculation of p ( r | T , µ same ) = (cid:90) p ( r | µ ) p ( µ | T , µ same ) dµ (4.18)where p ( µ | T , µ same ) = φ µ µ ,cσ µ ( µ ) χ C ( µ ) / d (4.19)and d is the normalizing constant with the constraint set of the form C = { ( µ ) : a ≤ µ ≤ b } , similar griding can be done to obtain a discrete approximation ˜ d of d and a constrainedposterior sample of µ . Again, C implicitly depends on T and µ same . The grid here wouldbe just one-dimensional.Computations for the reverse death move would proceed similarly. Local moves for T → T ∗ beyond birth and death moves may also be similarly applied, as long as µ old and µ new areeach at most two dimensional since beyond two dimensions, grids become computationallydemanding. For example, T → T ∗ obtained by changing a splitting rule whose children areterminal nodes would fall into this category. In all our examples, we use birth/death movesand draws of a single µ component given T and all the remaining elements of M .The approach outlined above for birth/death moves involves two bivariate integrals and twounivariate integrals which we approximate with two sums over a bivariate grid and two sumsover a univariate grid. In practice, we reduce the computational burden by letting ˜ d ∗ and ˜ d equal one and then compensating for this omission with an adjustment of our T prior. Forexample, in a birth move, setting the d ’s to one ignores a factor ˜ d / ˜ d ∗ in our ratio. Notethat from (4.13) d ∗ is just the constrained integral of the product of two univariate normaldensities. Without the constraint, the integral would be one. The more our monotonicityconstraint limits the integral (through χ C ( µ L , µ R )), the smaller d ∗ is. Similary, d is aconstrained univariate integral. However, in a birth step, d ∗ is typically more constrainedthan d . Hence, ˜ d / ˜ d ∗ is a ratio depending on T and T ∗ which we expect to be great thanone. Note that d only depends on T and d ∗ only depends on T ∗ (that is, not on µ same ).We compensate for the omission of ˜ d ∗ and ˜ d by letting α = .

25 and β = . α = .

95 and β = 2. With α = .

25 and β = . p ( T ∗ ) /p ( T )is larger mimicking the eﬀect of the omitted d ratio. We have found that with these choiceswe get tree sizes comparable to those obtained in unconstrained BART. The values α = . β = . Examples

In this section we present several examples to illustrate the performance of mBART. Wepresent results for two simulated scenarios and three real data sets. In the simulations,where we know the true function is monotonic and non-linear, we compare mBART toBART. For the real data, we compare mBART to BART and the standard linear model.In all cases (except Figure 6 in which we study prior sensitivity) we use default priors formBART and BART but remind the reader that for best out-of-sample results, it may bewise to consider the use of cross-validation to tune the prior choice as illustrated in CGM10.Note that below when we refer to the “ﬁt” of BART or mBART at a given x , we mean theposterior mean of f ( x ) estimated by averaging the f draws evaluated at x .Our ﬁrst simulated example has a one-dimensional x so that we can easily see the ﬁts graphi-cally. We see three diﬀerences between mBART and BART which are intuitive consequencesof the the injection of the strong prior information that the function is monotonic: (i) theﬁtted function is smoother, (ii) the uncertainty is lower, and (iii) for some values of x , theinﬂuence of the prior is reduced.Our second example simulates data sets where f is a ﬁve-dimensional nonlinear monotonicfunction. We explore the relationship between the out-of-sample predictive performanceof mBART and the signal to noise ratio by using four diﬀerent values for the error stan-dard deviation. In the high signal case, BART is able to estimate f without the aid ofthe additional information about monotonicity so that BART and mBART have the sameperformance out-of-sample. In the low signal cases, the additional information is importantand mBART beats BART out-of-sample.In our ﬁrst real example y is the price of a house and x represents attributes of the house.In this example, mBART, BART, and the linear model all give similar ﬁts. However, thenon-monotonic BART ﬁts are very counter intuitive while the mBART ﬁts are quite reason-able. This example also illustrates a very important feature of mBART: it handles ordered categorical explanatory variables beautifully.In our second real example y is the price of a used car and x represents attributes of a car.In this example, the linear model fails while mBART and BART perform comparably withmBART giving more interpretable results.In our third real example, y is the excess return for a cross section of ﬁrms in a given monthand x measures attributes of each ﬁrm in the previous month. This example is extreme in19hat the signal to noise ratio is very low. The unconstrained BART ﬁt is very noisy whilethe mBART ﬁt is nicely smooth and suggestive of non-linearity.The examples illustrate two key attributes of mBART in cases where the monotonicity isappropriate. First, the smoother mBART ﬁts are more interpretable than the more “jumbly”BART ﬁts. Second, in terms of the bias-variance trade oﬀ, the monotone constraint mayserve to restrain the ﬁt giving improved out-of-sample performance. This is clearly illustratedin our second simulated example. In all three of our real examples, the posterior distributionof σ from BART covers the posterior mean from mBART, informally suggesting that themonotonicity constraint is consistent with the data. In all three of our real examples, themBART ﬁts are more appealing. In the car price example, BART and mBART beat thelinear model out-of-sample, and mBART is (perhaps) a bit better than BART. In the returnsexample, linear beats BART out-of-sample, while mBART does as well as linear. In general,we can think of mBART as a convenient “half way house” in between the very ﬂexibleensemble method BART and the inﬂexible linear method. In this section we present a very simple simulated example with just one x variable so that wecan visually see some of the basic properties of the monotone BART inference. We simulate n = 200 observations from the model Y i = X i + (cid:15) i , (cid:15) i ∼ N (0 , . ) , X i ∼ U [ − , . Figure 5 shows the results for a single simulated data set. The BART inference is displayedin the left panel and the mBART inference is in the right panel. The mBART ﬁt is muchbetter. The ﬁt is smoother and the uncertainty is much smaller. Clearly, injecting the correctprior information that the function is monotone dramatically tightens up the inference.Given the additional information in the monotonicity constraint, mBART should also be lesssensitive to the choice of prior than BART. To explore this, we ﬁt BART and mBART using36 diﬀerent prior speciﬁcations and compared the variation in ﬁts. We used three diﬀerentvalues for m , the number of trees in the sum: 50, 200, and 500. We used four diﬀerent valuesfor k where a large value of k implies a tighter prior on the end node µ parameters : k = 11, 2, 3, or 5 (see CGM10 and Section (3.3)). We used three diﬀerent settings for the pairof values ( ν, q ) where ν is the degrees of freedom for the inverted chi-squared prior and thescale is chosen so the least squares estimate of the error standard deviation σ is at the q th − . − . . . . x y

95% pointwise posterior intervals, bart true fposterior mean −1.0 −0.5 0.0 0.5 1.0 − . − . . . . x y

95% pointwise posterior intervals, mbart

Figure 5:

One-dimensional simlulation. The left panel shows the unconstrained BART ﬁt.The right panel shows the mBART ﬁt. The monotone ﬁt is much better and the 95% pointwiseintervals for f ( x ) are much tighter. quantile of the σ prior: (3,.9), (3,.99), and (10,.75). These settings follow the prior settingsexplored in CGM10. All possible combinations gives us 3 × × x . Hence, x is increasing as you go from left to right in Figure 6.For each observation the left vertical line gives the range of ﬁtted values for BART whilethe right vertical line gives the range of ﬁtted values for mBART. The average ﬁtted valueis subtracted from the range so we can see the relative spreads.Interestingly, which method gives a greater dispersion depends on the location of the x atwhich predictions are made. In the middle of the data mBART is dramatically less sensitiveto the prior choice. However at the end points, mBART is more sensitive. This make senseas the monotonicity constrains points in the interior but not at the edges. In this section we present results for a ﬁve-dimensional simulated example. We comparethe out-of-sample prediction for mBART with that of BART. We vary the size of the error21

50 100 150 200 − . . . . observation f ( x ) range of fits over 36 prior choices bartmbart Figure 6:

One-dimensial simulation. Prior sensitivity. The range of ﬁts obtained at each x value using diﬀerent prior speciﬁcations. Y = f ( x ) + (cid:15) where x denotes a vector which ﬁve components. Forsmall errors, there is no diﬀerence in the performance as BART is able to infer the functionwith very little error. As the the error variance increases, the additional information thatthe function is monotonic becomes more useful and mBART outperforms BART.We simulated the data using Y i = x i x i + x i x i + x i + (cid:15) i . All of the x values are drawn from the standard uniform distribution on (0,1). Our function f ( x ) = x x + x x + x is clearly monotonic for such x . We let (cid:15) i ∼ N (0 , σ ) with σ = .2, .5,.7, and 1. For each value of σ we simulated 200 data sets. Each data set has 500 in-sample(training) observations and 1,000 out-of-sample (test) observations. For the training data,we draw x and y , while for the test data we need only draw x . For each simulated data setwe compute the RMSE RM SE = (cid:118)(cid:117)(cid:117)(cid:116) (cid:88) i =1 ( f ( x i ) − ˆ f ( x i )) where f is the true function, ˆ f ( x i ) is the posterior mean, and the x i are the test x vectors.The results are displayed in Figure 7. Each boxplot depicts the 200 RMSE values for amethod. The ﬁrst two boxplots show results for the draws with σ = .

2, the second two for σ = . σ both methods give similar results. As σ increases, mBART clearly outperforms unconstrainedBART. We have 128 observations on houses that have sold. The dependent variable y is the salesprice (thousands of dollars). The explanatory variables in x tell us about the houses. Theﬁrst explanatory variable ( nbhd ) indicates which of three neighborhoods the house is in.The three neighborhoods are labeled 1,2, and 3 and we know that neighborhood 1 is moredesirable than 2 and 2 than 3 ( location, location, location !! ). The second x is the size ofthe house ( size , thousands of square feet). The third x is a 0-1 dummy variable indicatingwhether or not the house is made of brick ( brickYes , 0 for non-brick, 1 for brick), and the23 ll ll l lll lll l ll ll bart−1 mbart−1 bart−2 mbart−2 bart−3 mbart−3 bart−4 mbart−4 . . . . . . Figure 7:

Five-dimensional simulation. Out-of-sample RMSE. fourth x is the number of bedrooms nbedrm . In each case, a monotone relationship seemslikely with larger x values giving larger y values.Note that because mBART is based on trees and nbhd is an ordered categorical variable, wecan just include it in mBART as a numeric variable. Of course, this might not be appropriatein a linear model since it would assume the diﬀerence between neighborhoods 1 and 2 is thesame as the diﬀerence between neighborhoods 2 and 3.Below is the multiple regression output where dummies for neighborhoods 2 and 3 have beenincluded. Estimate Std. Error t value Pr(>|t|)(Intercept) 18.130 10.732 1.689 0.0937 .nbhd2 5.096 2.788 1.828 0.0700 .nbhd3 34.909 3.232 10.802 < 2e-16 ***size 42.656 6.031 7.073 1.03e-10 ***brickYes 19.363 2.433 7.958 1.02e-12 ***nbedrm 2.700 1.925 1.403 0.1632---Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1Residual standard error: 12.45 on 122 degrees of freedomMultiple R-squared: 0.7936,Adjusted R-squared: 0.7851F-statistic: 93.81 on 5 and 122 DF, p-value: < 2.2e-16

Note that the signs of the coeﬃcients are all positive. However, the relationship need not belinear and we can check this by running the ﬂexible ﬁtters BART and mBART.24igure 8 displays aspects of the inference from the linear model, BART, and mBART. Thetop left panel plots the BART MCMC draws of σ . The solid horizontal line indicates the σ estimate from the linear regression, the “dot dash” horizontal line indicates the posteriormean of σ from mBART, and the “dash” horizontal line indicates the posterior mean of σ from BART. The top right panel shows the σ draws from the mBART MCMC with thehorizontal lines indicating estimates of σ as in the top left panel. The bottom left panelplots the BART ﬁts versus the mBART ﬁts. The bottom right panel plots the linear ﬁtsversus the mBART ﬁts. All three methods give a similar level of ﬁt. draw s i g m a d r a w , ba r t llllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllll bart sigma draws linear sigmabart sigmambart sigma 0 200 400 600 800 1000 draw s i g m a d r a w , m ba r t llllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllll mbart sigma draws l ll l ll llll lll l llll l llll l ll lll l ll l ll ll llll ll l lll lll ll lll l l ll lll ll ll l ll l lll l ll l ll ll llll ll ll l ll ll llll l ll l l ll ll llll l llll ll lll l ll ll ll

100 120 140 160 180 bart fits m ba r t f i t s fits, bart vs. mbart l ll l ll llll lll l ll ll l ll ll l ll lll l ll l ll ll llll ll l lll ll l ll lll l l ll lll ll ll lll l lll l ll l ll l l llll ll lll ll ll llll l ll l l ll ll l llll llll ll lll l ll ll ll

100 120 140 160 180 linear fits m ba r t f i t s fits, linear vs. mbart Figure 8:

House price example. Top row: σ draws from BART (left panel), σ draws frommBART (right panel). Solid horizontal line at least squares estimate of σ . Bottom row:BART versus mBART (left panel) and linear ﬁts versus mBART (right panel). Figure 9 displays the conditional eﬀects of neighborhood ( nbhd ) and house size ( size ). To25isualize the conditional eﬀects from the BART/mBART ﬁts we construct x vectors such thatone of the x coordinates varies while the others are held ﬁxed. For example, in the bottomleft panel of Figure 9 we see the estimate of f ( x ) for x having the size values indicatedon the horizontal axis. The various curves in the ﬁgure correspond to diﬀerent ﬁxed levelsof the other three variables in x . We picked a grid of values for each variable and thenconstructed a design matrix composed of all possible combinations. We used 15 quantiles forthe values of size , 2, 3, 4, and 5 for nbedrm , 1,2, and 3 for nbhd , and 0 or 1 for brickYes .giving 15 × × × .0 1.5 2.0 2.5 3.0 nbhd f ( x ) , ba r t nbhd f ( x ) , m ba r t size f ( x ) , ba r t size f ( x ) , m ba r t Figure 9:

House price example. Conditional eﬀects of neighborhood (top panels) and housesize (bottom panels) for BART (left panels) and mBART (right panels). .0 0.2 0.4 0.6 0.8 1.0 brick f ( x ) , ba r t brick f ( x ) , m ba r t nbedrm f ( x ) , ba r t nbedrm f ( x ) , m ba r t Figure 10:

House price example. Condtional eﬀects of brick (top panels) and number ofbedrooms (bottom panels) for BART (left panels) and mBART (right panels). .4 The Car Price Example In this example we have 1,000 observations and y is the sale price of a used Mercedes car.Our explanatory x variables are: (i) the mileage on the car ( mileage ), (ii) the year of thecar ( year ) (iii) feature count ( featureCount ) and (iv) has the car had just one owner (1 ifyes, 0 if no) ( isOneOwner ).We certainly expect a car with more mileage to sell for less. We multiplied mileage by -1 tomake the relationship monotonic increasing. We certainly expect the price to be monotonicincreasing in year and a higher price if there is just one owner. The feature count variableis interesting. Based on our initial understanding of this variable, we expected it to havea negative eﬀect on y so we multiplied by -1 to make it monotonic increasing. Below isthe multiple regression output. We see that all the signs are positive and featureCount is “signiﬁcant”. It turns out we misunderstood this variable and we will discuss it furtherwhen we look at the mBART results. featureCount is left in the presented analysis as wefeel the case “put in a variable by accident” is very realistic and mBART handles it nicely. Coefficients: Estimate Std. Error t value Pr(>|t|)(Intercept) -5.427e+06 1.732e+05 -31.334 < 2e-16 ***mileage 1.529e-01 8.353e-03 18.301 < 2e-16 ***year 2.726e+03 8.613e+01 31.648 < 2e-16 ***featureCount 3.263e+01 9.751e+00 3.346 0.000851 ***isOneOwner 1.324e+03 6.761e+02 1.959 0.050442 .---Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1Residual standard error: 7492 on 995 degrees of freedomMultiple R-squared: 0.8351,Adjusted R-squared: 0.8344F-statistic: 1260 on 4 and 995 DF, p-value: < 2.2e-16

Figure 11 has the same layout as Figure 8. The top left panel shows σ draws from BART,the top right panel shows σ draws from mBART, and in each plot the estimate of σ fromthe linear regression is indicated by a horizontal solid line. Both BART and mBART quicklyburn-in to σ values much smaller than the least squares estimate indicating a much tighterﬁt. The monotonicity constraint makes the σ draws slightly larger. In the bottom left panelswe see that the BART and mBART ﬁts are quite similar. In the bottom right panel we seethat the mBART ﬁts are quite diﬀerent from the linear model ﬁts.Figures 12 and 13 have the same layout as Figures 9 and 10. They show the conditionaleﬀects for the four variables in x . 29

500 1000 1500 2000 2500 draw s i g m a d r a w , ba r t llllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllll bart sigma draws linear sigmabart sigmambart sigma 0 500 1000 1500 2000 2500 draw s i g m a d r a w , m ba r t llllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllll mbart sigma draws l ll lll l ll ll ll l llll lll ll ll ll lll ll ll ll l lll ll ll lll l lll lll l l lll l l lll ll ll ll ll l lll ll ll ll l l ll ll lll ll lll ll l l llll lll l ll ll ll l ll ll ll llll ll lll l lll l lll ll l ll l ll lll l ll ll ll lllll lll lll ll ll lll ll l ll lll l l ll llllll llll llll l lll ll ll llll lll ll ll l lll lll ll ll lll l ll ll l ll lll ll l lll lll l ll l lll ll ll ll lll ll l l lll ll l lll ll ll l ll ll lll l llll lll ll l l ll ll l ll lll l lll l lll ll ll l lll lll llll ll lll ll l ll l llll lllll llll l ll ll ll l l l l lll ll lll l lll ll l lll l l ll lll ll ll l ll ll ll ll l l ll ll l lll l ll lll l lll l l llll l lll l lll ll ll l llll ll l l ll ll lll lll l lll l lll l ll ll l l llll l l lll l ll l ll l lll l lllll l ll ll lll l llll llll llll l l lll lllll l ll llll ll ll l l ll l ll l ll lll lll ll llll l llll ll llll ll lll llll ll lll l llllll llll ll l l ll l l ll llll ll ll l llll l ll llll l ll ll ll lll l lllll ll lll l ll lll ll l llll lll lll l ll ll llll l ll lll l l ll ll ll l l ll ll l lll lll ll ll llll ll ll l ll ll ll ll l ll ll ll lll lll lll lll l ll l ll ll llll l lll lll l lll lll lll l lllll lll ll l l l ll ll l ll llllll l ll ll ll ll ll lll ll l ll lll lll lll ll lll lll ll l ll ll lllllll l ll lll ll ll l ll ll ll l ll l ll ll l ll ll l l ll l llll l l l l llllll llll ll ll lll ll ll ll lll llll ll l ll l ll ll ll lll l ll ll l lll lllll l l lll l l lll llll lll l ll l l ll lllll ll l l ll ll bart fits m ba r t f i t s fits, bart vs. mbart l ll lll l ll ll ll l llll lll ll ll ll ll l ll ll lll lll ll ll lll l lll lll l l llll l lll ll ll ll ll l lll ll ll ll l l ll ll lll ll lll ll l l llll lll l ll ll ll lll ll ll llll ll lll l lll l lll ll l ll l l l lll l ll ll ll lllll lll lll ll ll lll ll l ll lll l l ll llllll llll llll l lll ll ll lll l lll ll ll l lll lll ll ll lll l ll ll l ll lllll ll ll lll l ll l lll ll ll ll lll ll l l lll ll l lll ll ll l ll ll lll l llll lll ll l l ll ll l ll lll l lll l lll ll ll l lll lll llll ll lll ll l ll l llll lllll llll l ll ll ll l l l l lll ll llll lll ll llll l l ll lll ll ll lll ll ll ll l l lllll lll l ll llll lll l l llll l lll l lll ll ll l llll ll l l ll ll lll ll l l lll l lll l ll ll l l llll l l lll lll l ll l lll l lllll l ll ll llll llll llll llll l l lll ll lll lll llll ll ll l l lll lll ll lll lll ll llll l llll ll llll ll lll llll ll lll l llllll llll ll l l ll ll ll llll ll ll l llll l ll lll l l ll ll ll lll l lllll ll lll l ll lll ll l llll lll lll l ll ll ll ll l ll ll l l l ll ll ll l l ll ll llll lll l l ll llll ll ll l llll ll ll l ll ll ll lll lll lll lll l ll l ll ll llll llll llll lll lll lll l lllll ll l ll ll l ll ll l ll llll ll lll ll ll ll ll lll ll l ll lll lll lll ll lll lllll l ll ll llllll l l ll lll ll ll l ll ll ll l ll l ll ll l ll ll l l ll l lll l l l l l llllll llll ll ll lll ll ll ll lll l lll ll l ll l ll l l ll lll l ll ll llll l llll l l lll l l lll llll lll l ll l l ll lllll ll ll ll ll −20000 0 20000 40000 60000 linear fits m ba r t f i t s fits, linear vs. mbart Figure 11:

Car price example. Top row: σ draws from BART (left panel), σ draws frommBART (right panel). Solid horizontal line at least squares estimate of σ . Bottom row:BART versus mBART (left panel) and linear ﬁts versus mBART (right panel). mileage and year (Figure 12) are quite similar for BART andmBART. The mBART eﬀect for mileage is smoother and everywhere monotonic while theBART ﬁt does exhibit slight dips. −140000 −120000 −100000 −80000 −60000 mileage f ( x ) , ba r t −140000 −120000 −100000 −80000 −60000 mileage f ( x ) , m ba r t year f ( x ) , ba r t year f ( x ) , m ba r t Figure 12:

Car price example. Conditional eﬀects of mileage (top panels) and year (bottompanels) for BART (left panels) and mBART (right panels).

The conditional eﬀects for featureCount and isOneOwner are in Figure 13. The conditionaleﬀect plot for featureCount is quite striking. The monotonic constraint results in a ﬂatlining of the plot dramatically indicating the absence of an eﬀect (contrary to the *** andt-value of 3.346 in the R multiple regression output). After we obtained these results, wechecked back with source of the data and found that we had misunderstood the variable featureCount and in fact, there is no reason to expect it to be predictive of the car prices!It measures web activity of a shopper and not a feature of the actual car.31

80 −60 −40 −20 0 featureCount f ( x ) , ba r t −80 −60 −40 −20 0 featureCount f ( x ) , m ba r t isOneOwner f ( x ) , ba r t isOneOwner f ( x ) , m ba r t Figure 13:

Car price example. Conditional eﬀects of featureCount (top panels) and isOne-Owner (bottom panels) for BART (left panels) and mBART (right panels). mileage and year for ﬁxed values of featureCount and isOneOwner . The BART ﬁt is on the left andthe mBART ﬁt is on the right. While similar, the mBART ﬁt is smoother. m il e a g e y ea r f ( x ) , ba r t m il e a g e y ea r f ( x ) , m ba r t Figure 14:

Car price example. Bivariate plot of ﬁtted price vs. mileage and year. BART(left panel), mBART (right panel).

We conducted a simple out-of-sample experiment to check for over-ﬁtting: 200 times werandomly selected 75% of the data to be in-sample and predicted the remaining 25% of the y values given their x values using linear regression, BART, and mBART. The out-of-sampleroot mean square errors are reported in Figure 15. BART and mBART give similar resultsand both are dramatically better than the linear predictions.Our results indicate that (i) the monotonicity is indeed reasonable, (ii) we can can get nicemonotonic ﬁts, and (iii) only mileage and year matter.33 linear bart mbart out of sample rmse, cars data Figure 15:

Car price example. Out-of-sample results. .5 The Stock Returns Example An important and heavily studied problem is the predictability of stock market returns.Can we measure characteristics of a ﬁrm ( x ) that can help us predict a future return ( y )?The data are monthly and the x ’s are measured the previous month so that the relationshipbeing studied is predictive. Our y is actually excess return, the diﬀerence between the returnfor a ﬁrm and the average return for ﬁrms that month. While still very useful in practice,predicting the excess return is easier than predicting the whole return.Often in predictability studies predictive models are ﬁt for each month and then rollingwindows of months are considered. In this section we pick one month (at random) andconsider the ﬁtting of a model for that particular month. We picked December, 1981 froma much larger data set of 594 months. Since the modeling is done for each month, it makessense to focus on a particular month to see how diﬀerent approaches might work. However,the predictive model uncovered is exploratory rather than reﬂective of the actual predictivemechanism used in practice, which is based on averaging rolling ﬁts.We used four predictive variables in x . logme : market equity (logged), r1 : return, gpat :gross proﬁtability ((sales - cost of goods sold) / total assets), and logag : growth in totalassets (logged). Remember, x is lagged. We have observations on 1,531 ﬁrms. Although logtransformations of predictors is unnecessary for BART and mBART, these transformationsfacilitate comparisons with linear regression.Logged market equity, previous return, and logged growth in total assets are all multipliedby -1 to give monotone increasing relationships. Below are the results of multiple linearregression: Estimate Std. Error t value Pr(>|t|)(Intercept) 0.028895 0.010052 2.875 0.00410 **logme 0.004461 0.001626 2.744 0.00614 **r1 0.063310 0.020397 3.104 0.00195 **gpat 0.035634 0.007428 4.797 1.77e-06 ***logag 0.080160 0.010715 7.481 1.24e-13 ***---Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1Residual standard error: 0.07285 on 1526 degrees of freedomMultiple R-squared: 0.05767,Adjusted R-squared: 0.0552F-statistic: 23.35 on 4 and 1526 DF, p-value: < 2.2e-16

From the multiple regression output we see that after ﬂipping three of the x variables we doindeed get positive coeﬃcient estimates. It is widely believed that larger ﬁrms are less risky35nd hence generate lower returns. Hence, the monotonicity for logme is strongly motivated.For the other variables, the story is less simple. One might think a high previous return wouldlead to a high current return (giving a negative sign in the regression since we multipliedby -1). However a tendency for “short term reversals” has been found. Intuitively, grossproﬁtability should be positively related to returns as in our regression. The monotonicityof the logag eﬀect is less clear and, indeed, the sign of the regression coeﬃcient can varyfrom month to month. However, ﬁnancial theory suggests that if we interpret our x ’s asrepresentative of underlying factors, the direction of an eﬀect can vary month to month, butwe still expect the eﬀect to be monotonic within a given month across a set of ﬁrms.The R in the multiple regression is less than 6%, indicating a very low signal to noiseratio. Bias-variance considerations suggest that only the simplest models can be used topredict since ﬁtting complex models with such a low signal is not feasible. This gives us astrong motivation for examining the ﬁt of mBART. Perhaps using mBART allows us to bemore ﬂexible than a simple linear approach without running the risks associated with anunconstrained ﬁt given the low signal.Figure 16 displays ﬁts from BART, mBART, and a linear regression (using the same layoutas in our previous examples). The top left plot shows the sequence of σ draws from theBART ﬁt, while the top right plot shows the sequence of σ draws from mBART. In eachplot, a solid horizontal line is drawn at the least squares estimate of σ . The σ draws from theBART ﬁt tend to be smaller than the least squares estimate while the least squares estimateis right at the center of the mBART ﬁts. The monotonicity constraint has pulled the BARTﬁt back so that overall, it is more comparable to the linear ﬁt.The lower left panel of Figure 16 plots the BART ﬁts versus the mBART ﬁts and the lowerleft panel plots the linear ﬁts versus the mBART ﬁts. Given the very low signal, it is notablethat all three methods pick up similar ﬁts. The mBART ﬁt appears to be more like thelinear ﬁt than the BART ﬁt.Figures 17 and 18 display the conditional eﬀects using the same format as in our previoustwo examples. The mBART ﬁts are much more appealing. They are quite close to linear(especially for r1 ) but there is some suggestion of nonlinearity for three of the variables.Figure 19 plots the ﬁtted cross section of expected returns against r1 and logme . Theunconstrained BART ﬁt seems quite absurd while the mBART ﬁt suggests some nonlinearity,but also leaves open the possibility that it is close enough to linear for prediction given thehigh noise level. 36

500 1000 1500 2000 2500 . . . . . . draw s i g m a d r a w , ba r t llllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllll bart sigma draws linear sigmabart sigmambart sigma 0 500 1000 1500 2000 2500 . . . . . . draw s i g m a d r a w , m ba r t llllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllll mbart sigma draws l ll l ll l ll ll ll l lll ll lll l ll l l lll lll lll l llll llllll llll l ll l ll llll ll lll l l ll llll ll l ll ll lll lllll lll ll l llll l ll l ll l ll l ll lll ll l ll lll ll llll l lll ll l l lll ll l ll ll l llll l ll ll l ll l lll ll l ll lll ll ll l ll l l llll l llll lllll l lllll lllll ll ll ll ll l lll ll lll lll ll lll l l ll ll ll l ll l ll ll llll lll ll l l lll ll ll ll ll l ll ll l ll ll l ll l llll lll lll llll ll l l lllll l lllll l l ll l ll l llll lll lll ll l l l ll ll l l ll lll lll llllll l l lll ll ll ll ll ll lll l ll lll ll ll ll l ll lll ll l lllll ll l ll ll l llll ll lll ll ll ll l ll lll lllllll ll l lll ll ll ll ll lll llll ll ll lllll l lllll l ll l ll ll l lll ll lll ll ll lll lll l l l ll lll lll l ll ll ll lllll l ll lll llllll lll ll lllll ll lllll ll lll lll l ll lll ll ll l ll llll l lll lll l lll l ll ll ll l llll l llll l lll l ll llll lll ll l ll lll ll llll l ll lll lll lll llll l l lll l ll l ll ll llll ll lll l ll lll ll lllll l l llll ll l ll lll llll ll ll lll lll lll lll ll ll l llll lll ll ll l lll ll ll l llll llll lll llll lll lll l lll l lllllll l lll ll lll lll ll ll llll ll lll lll l l lll ll llll l ll l lll ll lll ll ll ll lll ll ll ll l lll l llll l lll llll lll l ll lll ll ll lll l lll ll lll lll lll llll ll llll l llll llll llll ll lllll ll l ll ll l ll llll l lll llllll ll l ll lllll ll lll l ll lll l ll llll ll lll l l l ll ll lll lll l lll ll lll ll lll l l ll l lll ll l ll l ll ll lll llllllll ll lll ll llll ll l lll ll l l ll llllll ll l ll ll ll lll l lll lll ll ll ll llll ll lll l l ll lll ll llll lll lll ll l ll l ll l ll ll ll ll l lll ll ll l l llll lll ll l lll l ll lll ll ll llll l ll lllll ll lll lll ll l l lll l lll lll l ll ll lll l l lll l l lll ll l lll ll lll l ll ll ll l l ll l ll llllll l ll ll ll ll ll ll lll ll ll ll lll ll ll llll ll ll ll llll ll ll ll ll ll lll l llll ll ll lll ll l lll llll ll lll l lll l ll l ll lll ll lll l lll ll ll l ll ll lll lll lll l llll lll lll lll ll lll ll ll ll ll l ll ll lll lll l lll l llll ll ll llllll ll ll l l lll ll ll ll l ll ll llll llll lll ll ll l ll llllll l lll ll lll llll ll l lll ll ll llll l ll l lllllll l lll ll ll l ll ll l lll ll ll l ll −0.10 −0.05 0.00 0.05 − . − . − . − . . . bart fits m ba r t f i t s fits, bart vs. mbart l ll l ll l ll ll ll l l ll ll l ll lll l ll ll lll llll llll lll lll llll l ll lllllll ll lll l l ll l ll ll llll ll lll lll lll lll llll ll llll ll lll lll lll ll lll lll ll l ll llll l ll lllll ll l ll lll lllll llll l ll ll ll ll l ll lll l lll l ll l lllll l lll ll llll lll lll lll ll ll llll ll l lll ll lll lllll lll l l llll ll l ll lll l l llll l lllll llll llll ll ll lll l ll ll ll l lll lll l l ll llllll l lll l ll lll ll llll llll llll llll lll lll l lll l ll ll l l lllll ll l ll ll ll l l lll llll ll l lll lll lll lll ll ll ll l ll lll ll ll llll ll lll ll l llll l llll llllll lll lll lllllll lll lllll ll llll l ll llll ll ll llllllll lll l lll ll ll l lll ll lll ll lllll l ll l l l ll ll l lll l ll ll ll lllll ll l ll lllll lll ll ll ll lll ll ll lll ll lll lll l ll lll llll llllll l llll lll l lll l ll ll lll llll llll l l llllll llll l ll lll ll l ll ll l lll l l llll lllll l llll l llll l l ll ll ll llll ll l ll lll lllll llll l ll ll lllll ll lll llll ll ll l ll lll ll l lll ll ll lllll l ll ll ll l ll lll ll l llll ll ll lll ll lllll lll l llll lllll ll lll lll lll lll ll ll ll ll lll ll ll l l ll ll ll ll ll l ll llll ll lllll ll ll lll ll ll ll lllll lll l l ll lll ll lll l ll l ll ll ll lll l lll ll l lll l ll ll ll ll l lll ll l l lll l lll ll ll ll lll ll l l l ll lll ll l l lll ll l ll llll lllll lllll ll lll l ll ll llll l lll ll llllll ll ll lll lll l ll l ll lll ll llll l l l llllll l ll l ll ll lll ll lll lll ll lll ll ll ll ll l l ll ll ll llll llll ll l lll l ll llll l ll lll ll ll ll ll ll l ll ll l lll lll ll l lll ll l lll ll l ll l ll lll ll ll ll l llll lll ll ll ll lll ll ll ll l l l lll ll ll ll l llll lll ll lll ll llllll l lll l lll lll ll lll llll l lll l l lll ll llll ll l lll ll l llll l ll l lll llll l l lll lll ll ll l l lll ll ll ll lll ll l ll lll l lll lll l ll ll ll ll ll ll llll l lll ll ll lllll l ll lllll ll lll lll l l ll l ll lll ll l ll l lllll ll l ll ll lll lll lll l ll ll lll l ll lll l l ll l ll ll ll ll l ll ll lll ll ll ll ll lll lll ll ll l lll l lll l l lll ll llll l ll llll ll llll ll l ll ll l ll llllll l lllll lll llll l llll ll ll lllll l ll llll l lllllll ll ll lll ll llll ll lllll −0.10 −0.05 0.00 0.05 − . − . − . − . . . linear fits m ba r t f i t s fits, linear vs. mbart Figure 16:

Returns example. Top row: σ draws from BART (left panel), σ draws frommBART (right panel). Solid horizontal line at least squares estimate of σ . Bottom row:BART versus mBART (left panel) and linear ﬁts versus mBART (right panel). − . − . . . . . . logme f ( x ) , ba r t −7.0 −6.5 −6.0 −5.5 −5.0 −4.5 − . − . − . . . . . logme f ( x ) , m ba r t −0.15 −0.10 −0.05 0.00 0.05 − . − . . . . . . r1 f ( x ) , ba r t −0.15 −0.10 −0.05 0.00 0.05 − . − . − . . . . . r1 f ( x ) , m ba r t Figure 17:

Returns example. Conditional eﬀects of logme (top panels) and r1 (bottom panels)for BART (left panels) and mBART (right panels). .1 0.2 0.3 0.4 0.5 0.6 0.7 − . − . . . . . . gpat f ( x ) , ba r t − . − . − . . . . . gpat f ( x ) , m ba r t −0.30 −0.25 −0.20 −0.15 −0.10 −0.05 − . − . . . . . . logag f ( x ) , ba r t −0.30 −0.25 −0.20 −0.15 −0.10 −0.05 − . − . − . . . . . logag f ( x ) , m ba r t Figure 18:

Returns example. Conditional eﬀects of gpat (top panels) and logag (bottompanels) for BART (left panels) and mBART (right panels). o g m e r f ( x ) , ba r t l o g m e r f ( x ) , m ba r t Figure 19:

Returns example. Bivariate plot of ﬁtted expected return vs. logme and r1. BART(left panel), mBART (right panel).

To get some feeling for out-of-sample predictability, we performed a “stylized” out-of-sampleexperiment using the same setup as we used for the used cars example in the previoussection. We call this “stylized” because it is unrealistic to observe returns from 75% of theﬁrms and then predict the rest. However, this gives a sense for how the procedures workin the particular month. Out-of-sample RMSE’s are displayed in Figure 20. We see thatmBART and the linear ﬁt give very similar results while BART is slightly worse. Again,given the very low signal to noise ratio, we do not expect to be able to detect large diﬀerences.Our main point is that if you want to consider something more ﬂexible than linear, andinterpret the ﬁts on a monthly basis, Figure 19 shows that BART does a poor job, whilemBART can give plausible nonlinear results.40 l linear bart mbart . . . . out of sample rmse, finance data Figure 20:

Returns example. Out-of-sample RMSE. mBART is comparable to linear, whileBART is worse. Conclusion

When prior information is available, we should use it. Information about whether an increasein an explanatory x is likely to increase or decrease the dependent variable Y on average isoften available. For example, we do not expect a smaller house to sell for more, all otherthings being equal. Even in the relatively simple context of linear modeling, prior informationabout the signs of the coeﬃcients may be helpful.In the context of highly ﬂexible nonlinear models such as BART, the potential beneﬁts ofincluding prior information regarding monotonicity are even greater. By sensibly constrain-ing our search in the vast space of high-dimensional non-linear functions we ease the burdenof search and inference. Our resulting estimates are smoother and more interpretable. Wesee this in all our examples (Figures 17, 18, and 19 are particularly dramatic). The addi-tional information provides a form of regularization that may lead to better out-of-sampleprediction (see Figure 7). Injection of additional prior information can aﬀect both our pos-terior uncertainty (Figure 5) and the sensitivity of our results to other aspects of our priorspeciﬁcation (Figure 6).A key advantage of our approach is that it works for high-dimensional x whereas othermonotonic approaches become diﬃcult in this case. The simple observation that the sumof monotonic functions is monotonic reduces our problem to that of monotonic inferencefor a single tree. Since we seek the stochastic search exploration of the posterior uncer-tainty provided by Markov Chain Monte Carlo algorithms, we adapt the BART MCMCalgorithm. This turns out to be non-trivial because a key step in the BART algorithm usesthe conditional conjugacy of the prior setup to enable the integration of bottom node meanparameters. This conditional conjugacy is not available in the constrained case. In a funda-mental rethinking of the BART algorithm, we make local moves conditional on the rest of thetree including the current values of bottom node mean parameters. This conditioning allowsus to locally impose the monotonicity in a simple way and integrate parameters numericallywithout dramatically slowing the time per iteration.In some examples, the data are suﬃciently informative so that the injection of the additionalinformation regarding montonicity makes little diﬀerence and BART results are similar tomBART results. However, when the noise level is relatively high, the additional informationcan be useful as in Sections 5.2 and 5.5. In all examples, given the obvious appeal ofmonotonicity, the ability to obtain strictly monotonic estimates provides more interpretableand appealing results. 42ll data and code used in the paper is available at:https://bitbucket.org/remcc/mbart as a git repo.That is, git clone https://bitbucket.org/remcc/mbart will get everthing and then check outREADME.txt. 43 eferences Azzalini, A . (1985). A class of distributions which includes the normal ones.

Scand. J.Statist. Barlow, R.E. , Bartholomew, D. , Bremner, J.M. and

Brunk, H.D. (1972).

Sta-tistical Inference Under Order Restrictions: Theory and Application of Isotonic Re-gression.

New York: Wiley.

Bleich, J. , Kapelner, A. , George, E.I. and

Jensen, S.T. (2014). Variable selectionfor BART: An application to gene regulation.

Ann. Appl. Stat. Chipman, H. , George, E.I. and

McCulloch, R.E. (1998). Bayesian CART modelsearch (with discussion and a rejoinder by the authors).

J. Amer. Statist. Assoc. Chipman, H. , George, E.I. and

McCulloch, R.E. (2010). BART: Bayesian additiveregression trees.

Ann. Appl. Stat. Chipman, H. , George, E.I. and

McCulloch, R.E. (2013). Bayesian Regression Struc-ture Discovery.

In Bayesian Theory and Applications , (Eds, P. Damien, P. Dellaportas,N. Polson, D. Stephens), Oxford University Press, USA.

Holmes, C.C. and

Heard, N.A. (2003). Generalized monotonic regression using randomchange points.

Statist. Med. Kapelner, A. and

Bleich, J. (2016). bartMachine: Machine learning with Bayesianadditive regression trees.

J. Stat. Softw. Kong, M. and

Eubank, R.L. (2006). Monotone smoothing with application to dose-response curve.

Commun. Statist. B-Simul. Lavine, M. and

Mockus, A. (1995). A nonparametric Bayes method for isotonic regres-sion.

J. Stat. Plan. Inference , , 235-248. Lin, L. and

Dunson, D.B. (2014). Bayesian monotone regression using Gaussian processprojection.

Biometrika

Mammen, E. (1991). Estimating a smooth monotone regression function.

Ann. Statist. eelon, B. and Dunson, D.B. (2004). Bayesian isotonic regression and trend analysis.

Biometrics Ramsay, J.O. (1998). Estimating smooth monotone functions.

J. R. Statist. Soc. B Saarela, O. and

Arjas, E. (2011). A method for Bayesian monotonic multiple regres-sion.

Scand. J. Statist. Shively, T.S. , Sager T.W. and

Walker, S.G. (2009). A Bayesian approach to non-parametric monotone function estimation.

J. R. Statist. Soc. B Shively, T.S. , Walker, S.G. and

Damien, P. (2011). Nonparametric function estima-tion subject to monotonicity, convexity and other shape constraints.

J. Econometrics161