[PDF] Inference for BART with Multinomial Outcomes

Abstract

The multinomial probit Bayesian additive regression trees (MPBART) framework was proposed by Kindo et al. (KD), approximating the latent utilities in the multinomial probit (MNP) model with BART (Chipman et al. 2010). Compared to multinomial logistic models, MNP does not assume independent alternatives and the correlation structure among alternatives can be specified through multivariate Gaussian distributed latent utilities. We introduce two new algorithms for fitting the MPBART and show that the theoretical mixing rates of our proposals are equal or superior to the existing algorithm in KD. Through simulations, we explore the robustness of the methods to the choice of reference level, imbalance in outcome frequencies, and the specifications of prior hyperparameters for the utility error term. The work is motivated by the application of generating posterior predictive distributions for mortality and engagement in care among HIV-positive patients based on electronic health records (EHRs) from the Academic Model Providing Access to Healthcare (AMPATH) in Kenya. In both the application and simulations, we observe better performance using our proposals as compared to KD in terms of MCMC convergence rate and posterior predictive accuracy.

Full PDF

IInference for BART with Multinomial Outcomes

Yizhen Xu , Joseph W. Hogan , Michael J. Daniels , Rami Kantor , Ann Mwangi

1. Department of Biostatistics, Johns Hopkins University, 615 N Wolfe St, Baltimore MD,U.S.A.2. Department of Biostatistics, Brown University,121 S. Main Street, Providence RI, U.S.A.3. Department of Statistics, University of Florida, Gainesville FL, U.S.A4. Division of Infectious Diseases, Brown University, Providence RI, U.S.A5. Academic Model Providing Access to Healthcare (AMPATH), Eldoret, Kenya6. College of Health Sciences, School of Medicine, Moi University, Eldoret, Kenya* [email protected] * This is part of thesis work of the first author.

Abstract

The multinomial probit Bayesian additive regression trees (MPBART) framework was pro-posed by Kindo et al. (KD), approximating the latent utilities in the multinomial probit (MNP)model with BART . Compared to multinomial logistic models, MNP does not assume inde-pendent alternatives and the correlation structure among alternatives can be speciﬁed throughmultivariate Gaussian distributed latent utilities. We introduce two new algorithms for ﬁttingthe MPBART and show that the theoretical mixing rates of our proposals are equal or superiorto the existing algorithm in KD. Through simulations, we explore the robustness of the meth-ods to the choice of reference level, imbalance in outcome frequencies, and the speciﬁcationsof prior hyperparameters for the utility error term. The work is motivated by the application a r X i v : . [ s t a t . M E ] J a n f generating posterior predictive distributions for mortality and engagement in care amongHIV-positive patients based on electronic health records (EHRs) from the Academic ModelProviding Access to Healthcare (AMPATH) in Kenya. In both the application and simulations,we observe better performance using our proposals as compared to KD in terms of MCMCconvergence rate and posterior predictive accuracy. Keywords: Latent Models, Bayesian Data Augmentation, Additive Regression Trees

Bayesian additive regression trees (BART) is a ﬂexible nonparametric Bayesian approach forregression on a recursively binary-partitioned predictor space; it uses sum-of-trees to model themean function such that nonlinearities and interactions along with additive eﬀects are naturallyaccounted for, and regularization priors are imposed to favor shallow trees to reduce over-ﬁtting.There has been considerable literature on extending BART to various types of problems givenBART’s predictive power for continuous and binary outcomes . We speciﬁcally consider theextension of BART to multinomial probit models (MNP). Existing BART-related work has de-veloped eﬃcient Markov chain Monte Carlo (MCMC) algorithms for Gaussian likelihoods, whichnaturally adapt to frameworks with Gaussian-distributed latent variables. However, careful con-sideration of data augmentation (DA) schemes is needed for the computational eﬃciency of im-plementing BART for categorical outcomes. The main contributions of this paper are to provide adetailed review of sampling algorithms for parameter expansion that are based on DA schemes andto introduce a set of new MCMC algorithms for multinomial probit BART (MPBART).Our work is motivated by the predictive modeling of the progression of patients’ retention andmortality through the HIV care cascade , adopting the regression framework proposed by Lee etal. for modeling patients’ engagement in care over time as transitions among categorical outcomestates conditional on patients’ clinical history. The HIV care cascade is a conceptual model thatoutlines essential stages of the HIV care continuum: (a) HIV diagnosis through testing, (b) linkageto care, (c) engagement in care, (d) initiation of antiviral therapy (ART) through retention, and (e)2ustained suppression of viral load; it has been widely used as a monitoring and evaluation toolfor improving and managing HIV health care systems. We will demonstrate and compare diﬀerentalgorithms of MPBART for this application in Section 4.MNP and multinomial logistic (MNL) models are widely used tools for predicting and de-scribing the relationships of explanatory variables to categorical outcomes. Kindo et al. pro-posed the MPBART framework that uses BART to ﬁt models to the Gaussian latent variables inthe MNP. Related work incorporating BART into categorical response models is introduced byMurray , where BART is extended to log-linear models that includes multinomial logistic BART(MLBART). Both MNP and MNL can be derived from a latent variable framework, where eachoutcome category is modeled by a utility which is a function of the covariates, and the outcomeis the utility-maximizing category. MNP and MNL assume the utilities to be multivariate Gaus-sian distributed and independent extreme-value distributed, respectively. The MNP is appealingbecause the independence assumption under the MNL may be counterintuitive in practice ; theMNP does not have this restriction. This property is also present for MLBART and MPBART. Wewill show that allowing non-zero correlations between utilities is an important feature of the latentvariable distribution and can have a substantial impact on predictive accuracy.There are two diﬃculties in sampling from posterior distributions for MNP. First, a closed-formexpression for the outcome’s marginal probabilities is not available; second, identiﬁability of theMNP requires constraints on the covariance matrix of the utilities, hindering speciﬁcation of conju-gate distributions and making posterior sampling more diﬃcult. There has been considerable workon Bayesian sampling techniques to address the computational issues of the MNP basedon DA-related methods. The original DA algorihtm is a stochastic generalization of the EM al-gorithm . Marginal data augmentation (MDA) generalizes and accelerates the DA algorithmvia parameter expansion such that full conditionals are easier to sample from and expansion param-eter(s) are subsequently marginalized over. Heuristically, the MDA Gibbs sampler can traverse theparameter space more eﬃciently with the extra variation induced by the expansion parameter(s),resulting in possible computational gains, including a faster mixing rate . Li et al. provided3n example for posterior sampling of a correlation matrix via parameter expansion. For the MNP,sampling from the constrained model parameter space is diﬃcult because the full conditionals donot have a simple closed form; the MDA scheme circumvents the diﬃculty and allows an easier andmore eﬃcient joint sampling of expansion parameter and transformed model parameters. Imai &van Dyk uniﬁed several previous proposals under the umbrella of MDA, examined diﬀerent priorspeciﬁcations for model parameters, and outlined two adaptations of the MDA scheme for posteriorsampling of the MNP based on parameter expansion.Building upon the work of Imai & van Dyk , Kindo et al. proposed an algorithm, which werefer to as KD, for ﬁtting the MPBART. However, the existing package of KD has been shown tohave problems ; our implementation of KD shows further drawbacks such as oversized posteriortrees from potential overﬁtting, diﬃculty in posterior convergence, and lack of robustness to thechoice of reference levels. We propose two alternative procedures for ﬁtting the MPBART that havesimpler algorithmic structure, improved convergence in the sum-of-trees and the covariance matrix,and a better mixing rate when the Markov chain reaches equilibrium. In particular, our algorithmsshow better out-of-sample accuracy and stability in predictive tasks under various settings whenevaluated in terms of posterior means and posterior modes; the posterior mode accuracy is used asthe comparison metric in Kindo et al. . The intuition behind our proposals is to ﬁt the sum-of-treesin a constrained parameter space to reduce disruptions to the stochastic search of posterior trees,resulting in faster convergence of the Markov chain.In every step of the Gibbs sampler, the MDA scheme requires (1) the joint sampling of expan-sion parameter(s) and transformed model parameters, and (2) the marginalization over the expan-sion parameter. However, the two actions are not always feasible for complicated Gibbs samplingproblems. For example, sampling the functional mean component jointly with an expansion pa-rameter in an MPBART algorithm is diﬃcult because posterior trees are sampled by stochasticsearch. Algorithms for the MNP and MPBART generally fall under the category of the partiallymarginalized augmentation (PMA) samplers , which relaxes the fully marginalized structure ofthe MDA and expects an improvement in convergence rate when more steps involve joint sampling4nd marginalizing components of the expansion parameter(s). We will show that KD and one ofour proposals are essentially the MPBART generalizations of the two procedures proposed in Imai& van Dyk ; the main diﬀerence between the two MPBART algorithms is that the latter does notemploy augmentation in the sampling of model coeﬃcients for the mean component of random util-ities. In the context of MPBART, this is contrary to the intuition about PMA samplers and showsthat: when Metropolis-Hastings or stochastic search is involved in complicated samplers, particularsteps may be sensitive to the incorporation of expansion parameter(s); as such, more considerationsin algorithm design are needed.This paper is structured as follows. Section 2.1 introduces the formulation of MNP and MP-BART frameworks; Section 2.2 reviews sampling schemes for the MNP, including DA and MDA;Section 2.4 introduces the existing and proposed algorithms for ﬁtting the MPBART; and Section2.5 theoretically compares diﬀerent MPBART algorithms in terms of the mixing rate under station-arity. Section 3 provides empirical investigations into the MPBART algorithms on simulated dataunder diﬀerent settings. Section 4 conducts further comparisons of the algorithms on a real-worlddataset. Section 5 summarizes the conclusions. For the categorical outcome S , which takes value in { , . . . , C } , the general latent variableframework for multinomial models assumes that S is a manifestation of unobserved latent utilities Z = ( Z , . . . , Z C ) ∈ R C +1 , where S ( Z ) = argmax l Z l , i.e. S = k if Z k ≥ Z l for all l (cid:54) = k .In general, C is the number of outcome levels minus one. The framework requires normalizationfor identiﬁability because S is invariant to a translation or a scaling (by a positive constant) of Z .Without loss of generality, we assume that the reference outcome category is 0; the normalizationis achieved by modeling S as a function of latent variables W = ( W , . . . , W C ) ∈ R C , such that5 l = Z l − Z and S ( W ) =  l if max ( W ) = W l ≥ if max ( W ) < . (1)The MNP models W in terms of covariates X and accounts for correlation across outcome levelsby assuming W follows a multivariate normal model, W ( X ) ∼ M V N ( G ( X ; θ ) , Σ) , (2)where G ( X ; θ ) = ( G ( X ; θ ) , . . . , G C ( X ; θ C )) , θ = ( θ , . . . , θ C ) , and Σ = { σ ij } is a C × C positive deﬁnite symmetric matrix.Identiﬁability of the model requires normalizing the scale of W because by deﬁnition the out-come S is invariant to a multiplication of W by any positive constant. From (2), the normalizationfor scale occurs by imposing a constraint on the covariance matrix Σ , such as trace (Σ) = C . Toillustrate, suppose there are latent variables (cid:102) W of the following form, (cid:102) W ( X ) ∼ M V N ( G ( X ; (cid:101) θ ) , (cid:101) Σ) , (3)where (cid:102) W ( X ) = αW ( X ) , G ( X ; (cid:101) θ ) = αG ( X ; θ ) , (cid:101) Σ = α Σ , and α > . By (1), (cid:102) W and W yieldthe same S . However, if Σ satisﬁes the trace constraint, W is the normalized counterpart of (cid:102) W and α = trace ( (cid:101) Σ) /C is a positive scalar that ensures a one-to-one mapping from W to (cid:102) W .Direct posterior sampling of (2) is diﬃcult due to the constraint on Σ . A technique for easiersampling is to augment the parameter space such that it is possible to specify a conjugate priorand target parameters can be obtained by converting samples back to the normalized scale. Theobvious choice of augmented parameter space is the one without the normalization for scale, i.e. ( (cid:102) W , (cid:101) θ, (cid:101) Σ) in (3). Imai & van Dyk suggest a constrained inverse Wishart prior for Σ such that itsjoint distribution with α is equivalent to the unconstrained covariance matrix having prior distribu-tion (cid:101) Σ ∼ inv-Wishart ( ν, Ψ) . This makes it possible to sample easily from the conditional posterior6f (cid:101) Σ . Setting ν = C + 1 and Ψ to be an identity matrix is equivalent to sampling the correspondingcorrelations of (cid:101) Σ from a uniform distribution. When ν > C + 1 , the expectation of (cid:101) Σ has a closedform E ( (cid:101) Σ) = Ψ ν − C − .The classic framework of MNP assumes a linear model speciﬁcation for each W l ( X ) , i.e. G l ( X ; θ l ) = Xθ l for l = 1 , . . . , C . Kindo et al. proposed MPBART to increase the predictivepower and the ﬂexibility in dealing with complicated nonlinear and interaction eﬀects. The inno-vative idea is to approximate each mean component of W ( X ) using a sum of m trees, G l ( X ; θ l ) = (cid:80) mk =1 g ( X ; θ lk ) , where l = 1 , . . . , C and θ lk is the set of parameters corresponding to the k th bi-nary tree for the l th latent variable, W l ( X ) . MPBART uses the same Bayesian regularization prioron the trees to restrict over-ﬁtting as in Chipman et al. . An important contribution of Kindo etal. is deriving from (2) the conditional distribution for Gibbs sampling of each individual tree,and embedding it into the backﬁtting procedure of BART. See Chipman et al. , for details on theBART backﬁtting procedure. Suppose y and φ represent the augmented data and model parameters, respectively. The goal ofdata augmentation (DA) schemes is to draw samples from ( y, φ ) . The sampling algorithm proposedby Kindo et al. for MPBART heavily relies on Imai & van Dyk’s work on ﬁtting the MNP, whichexplores diﬀerent Gibbs samplers of ( W, θ, Σ) under the umbrella of marginal data augmentation(MDA) , an extension and improvement of the DA algorithm . This section provides a briefoverview of relevant developments on the DA algorithm for ﬁtting the MPBART in Section 2.4. Basic data augmentation.

To begin with, we illustrate the simple task of sampling ( y, φ ) underthe DA algorithm of Tanner & Wong : Scheme [DA]

1. Draw y ∼ f ( y | φ ) .2. Draw φ ∼ f ( φ | y ) . 7 arginalized data augmentation (MDA). The basic idea of MDA versus DA is to expand themodel and overparameterize f ( y, φ ) to f ( y, φ, α ) ; the expansion parameter α often correspondsto a transformation of y and/or φ . For example, α may index a transformation of y to (cid:101) y = t α ( y ) where t α is one-to-one and diﬀerentiable, expanding the model from f ( y, φ ) to f ( (cid:101) y, φ, α ) . Thechoice to sample from f ( y, φ, α ) or f ( (cid:101) y, φ, α ) depends on the speciﬁc model, and they are usuallyinterchangeable. This approach is appealing when sampling from f ( y, α | φ ) or f (˜ y, α | φ ) is easierthan the sampling of y alone. Liu & Wu and Meng & Van Dyk simultaneously developedMDA. Liu & Wu provided theoretical results on the convergence rate of the MDA. Meng &Van Dyk introduced the MDA under two augmentation schemes, grouping and collapsing ;both procedures lead to the same distribution of ( y, φ ) as Scheme [DA]. MDA with grouping.

The grouping scheme samples conditionally on the expansion parameter α , while the collapsing scheme integrates α out from the joint distribution. MDA under the group-ing scheme is preferred when the sampling of y or φ jointly with α is easier than that in Scheme[DA]. For example, when f ( φ | y, α ) is easier to sample than f ( φ | y ) , and f ( y, α | φ ) is easy to sample,the sampler can “group” y and α together and treats them as a single component, Scheme [MDA-G]

1. Draw ( y, α ) ∼ f ( y, α | φ ) .2. Draw φ ∼ f ( φ | y, α ) . MDA with collapsing.

1. Draw ( y, α ) ∼ f ( y, α | φ ) by α ∼ f ( α | φ ) and y ∼ f ( y | φ, α ) .2. Draw ( φ, α ) ∼ f ( φ, α | y ) by α ∼ f ( α | y ) and φ ∼ f ( φ | y, α ) .8otice that the newly sampled α is discarded in each step of the Scheme [MDA-C]. In practice,it may be reasonable to assume a priori independence between φ and α because φ are parametersidentiﬁed from the observed data, which does not contain information on α . Furthermore, giventhat transforming the augmented data y is of interest, it may be true that the conditional samplingof model parameters φ is more plausible under (cid:101) y than y . Accordingly, Scheme [MDA-C] can berewritten as: Scheme [MDA-C’]

1. Draw ( (cid:101) y, α ) by drawing α ∼ f ( α ) and then y ∼ f ( y | φ, α ) , and compute (cid:101) y = t α ( y ) .2. Draw ( φ, α ) by drawing α ∼ f ( α | (cid:101) y ) and then φ ∼ f ( φ | (cid:101) y, α ) .The f ( α ) and f ( α | (cid:101) y ) are the prior and posterior (under the transformed augmented data) of α ,respectively. The optimality of MDA under the collapsing scheme (Scheme [MDA-C]) over theDA algorithm (Scheme [DA]) in terms of convergence rate is proven in Meng & Van Dyk andLiu & Wu . Liu & Wu also introduced Scheme [MDA-LW], which is equivalent to Scheme[MDA-C’] in terms of the sampling distribution and rate of convergence. This scheme is implic-itly applied in the algorithms for ﬁtting the MNP and MPBART, typically in the normalization ofmodel parameters after each round of Gibbs sampling. Structurally, Scheme [MDA-LW] is in theform of Scheme [DA] with an additional intermediate step, which makes more clear the connectionbetween the MDA and the DA algorithm: Scheme [MDA-LW]

1. Draw y ∼ f ( y | φ ) .2. Draw α ∼ f ( α ) , compute (cid:101) y = t α ( y ) ; draw α ∼ f ( α | (cid:101) y ) , compute y (cid:48) = t − α ( (cid:101) y ) .3. Draw φ ∼ f ( φ | y (cid:48) ) . Note that y and y (cid:48) follow the same distribution. The intuition behind the improvement of Scheme9MDA-LW] compared to the DA algorithm is that the intermediate step of sampling from y (cid:48) allowsthe sampler for φ to explore the expanded model space with more freedom. For ﬁtting the MNP, Imai & van Dyk introduced two algorithms for the Gibbs sampling of ( W, θ, Σ) , which we refer as IvD1 and IvD2. The IvD1 modiﬁes Scheme [MDA-C’] by expand-ing the model to ( (cid:102) W , (cid:101) θ, (cid:101) Σ , α ) such that (cid:102) W and ( (cid:101) θ, (cid:101) Σ) correspond to (cid:101) y and φ , respectively, and α = ( α , α , α ) : Scheme [IvD1]

1. Draw ( (cid:102) W , α ) by drawing α ∼ f ( α | Σ) and then W ∼ f ( W | θ, Σ) , and compute (cid:102) W = α W .2. Draw ( (cid:101) θ, α ) by drawing α ∼ f ( α | (cid:102) W , Σ) and then (cid:101) θ ∼ f ( (cid:101) θ | α , (cid:102) W , Σ) , and compute θ = (cid:101) θ/α .3. Draw ( (cid:101) Σ , α ) by (cid:101) Σ ∼ f ( (cid:101) Σ | (cid:102) W − X (cid:101) θ ) and compute α = (cid:113) trace ( (cid:101) Σ) /C. Using (cid:101) Σ and α from Step 3, we can compute the normalized covariance matrix Σ = (cid:101) Σ /α anduse it in Steps 1 and 2 of the next round of posterior sampling; this is analogous to having α indexa one-to-one mapping from the expanded model space ( (cid:101) Σ ) to the normalized space ( Σ ). Steps 1 and3 in Scheme [IvD1] collapse down α and α , but Scheme [IvD1] is not a direct implementation ofthe MDA as in Scheme [MDA-C’] because Step 1 is conditional on θ , or equivalently ( (cid:101) θ, α ) where θ = (cid:101) θ/α . Hence, Step 2 does not integrate out (collapse down) α .Standard MDA (Schemes [MDA-C] and [MDA-C’]) are collapsed Gibbs samplers that integrateout expansion parameter(s) by redrawing and discarding α in every step. Scheme [IvD1] is in fact apartially marginalized augmentation (PMA) procedure that relaxes the restrictive structure of fullmarginalization in MDA. PMA allows the conditional distribution in a k th step of the Gibbs samplerto depend on expansion parameter(s) drawn in other steps. Algorithms for ﬁtting the MPBART inSection 2.4 are generally also PMA procedures.IvD1 can also be viewed from a diﬀerent perspective. Due to the linearity in model speciﬁcationof the MNP, i.e. G l ( X ; θ l ) = Xθ l for l = 1 , . . . , C , the linear relationship between θ and (cid:101) θ holds in10tep 2 of Scheme [IvD1], and it is equivalent to direct sampling of θ from f ( θ | (cid:102) W /α , Σ) . Hence,IvD1 can be rearranged as follows Scheme [IvD1’]

1. Draw W ∼ f ( W | θ, Σ) .2. Draw α ∼ f ( α | Σ) , compute (cid:102) W = α W ; draw α ∼ f ( α | (cid:102) W , Σ) , compute W (cid:48) = (cid:102) W /α .3. Draw θ ∼ f ( θ | W (cid:48) , Σ) .4. Draw Σ by (cid:101) Σ ∼ f ( (cid:101) Σ | (cid:102) W − X (cid:101) θ ) , compute α = (cid:113) trace ( (cid:101) Σ) /C , and Σ = (cid:101) Σ /α , where (cid:101) θ = α θ .The ﬁrst three steps are equivalent to sampling f ( W, θ | Σ) in Scheme [MDA-LW]. Step 4 col-lapses down α , but the fact that Step 4 is conditional on ( α , α ) through ( (cid:102) W , (cid:101) θ ) makes IvD1 nota collapsed Gibbs sampler collectively. IvD2 is given as follows: Scheme [IvD2]

1. Draw ( (cid:101) (cid:15), α ) by α ∼ f ( α | Σ) and W ∼ f ( W | θ, Σ) , compute (cid:101) (cid:15) = α [ W − G ( X ; θ )] .2. Draw (Σ , α ) by (cid:101) Σ ∼ f ( (cid:101) Σ | (cid:101) (cid:15) ) , compute α = (cid:113) trace ( (cid:101) Σ) /C , and Σ = (cid:101) Σ /α .3. Draw θ ∼ f ( θ | W, Σ) .IvD2 separates the sampling into two parts, ( (cid:101) (cid:15), Σ) and θ ; the ﬁrst part utilizes the MDA underScheme [MDA-C] and the second part is a standard Gibbs sampling draw. Theoretically, as statedin Imai & van Dyk , IvD1 and IvD2 have the same lag-one autocorrelation when the MCMC chainis stationary. However, they showed through numerical experiments that IvD1 is better than IvD2in estimating the MNP in terms of being less sensitive to the starting values of ( θ, Σ) .In the next section, we describe Kindo et al.’s algorithm (KD) and our two new proposals, andconnect them to the schemes reviewed here. 11 .4 Algorithms for Posterior Sampling Algorithms of MPBART For ease of notation, let W i, − j = ( W i , . . . , W i,j − , W i,j +1 , . . . , W iC ) and let µ = G ( X ; θ ) bethe sum-of-trees component under the normalization of scale. Kindo et al.’s algorithm for ﬁttingthe MPBART can be summarized as the following augmented Gibbs sampler: Algorithm [KD]

1. Sample (cid:102)

W , α | µ, Σ , S .(a) Draw α from its conditional prior f ( α | Σ) = trace [ΨΣ − ] /χ νC ;(b) for each j , update W ij conditional on W i, − j , µ , Σ , and the observed outcome S i , from atruncated normal distribution; and(c) transform W i and Σ to (cid:102) W i = α W i and (cid:101) Σ ∗ = α Σ .2. Sample (cid:101) θ | (cid:102) W , α , Σ .(a) Draw (cid:101) θ ∼ f ( (cid:101) θ | (cid:102) W , (cid:101) Σ ∗ ) ; and(b) set (cid:101) µ = G ( X ; (cid:101) θ ) and µ = (cid:101) µ/α .3. Sample Σ , α | (cid:102) W , (cid:101) θ .(a) Draw (cid:101) Σ ∼ Inv-Wishart ( N + ν, Ψ + (cid:80) Ni =1 (cid:101) (cid:15) i (cid:101) (cid:15) Ti ) , where (cid:101) (cid:15) i = (cid:102) W i − (cid:101) µ i ;(b) set α = trace ( (cid:101) Σ) /C ; and(c) set Σ = (cid:101) Σ /α and W = µ + (cid:101) (cid:15)α .Step 1 jointly samples from f ( (cid:102) W , α | µ, Σ , S ) by ﬁrst drawing the expansion parameter α from its prior distribution f ( α | Σ) , and then computing (cid:102) W = α W where W is sampled from f ( W | µ, Σ , S ) . Step 1(a) samples α such that α / trace [ΨΣ − ] follows an inverse-chi-squared dis-tribution with νC degrees of freedom. Step 1(b) samples each W ij from a truncated normal dis-tribution described in Appendix D.1 based on (1), as the observed outcome S i imposes an intervalconstraint on W i , e.g. if S i equals the reference level 0, then W ij ’s are right truncated at 0. Step 212amples posterior trees across multivariate mean components by Gibbs sampling and each posteriortree is sampled as in regular BART. Step 3 computes α using the sampled (cid:101) Σ and then normalizesthe scale of the model by Step 3(c).Notice that the sampling of model parameters (cid:101) θ is conditional on ( (cid:102) W , (cid:101) Σ ∗ ) , which is equivalent toconditioning on ( (cid:102) W , α , Σ) or ( W, α , Σ) ; this observation is essential to the analysis of Algorithm[KD] in Section 2.5. Algorithm [KD] is closely related to IvD1 (Scheme [IvD1]) but diﬀerent inthat it does not update the expansion parameter α as in Step (b) of IvD1. This is analogous tohaving α in IvD1 set to the sampled value of α from Step (a). The reason for this modiﬁcation isthat the posterior tree parameters in BART, denoted by θ , are drawn via stochastic search; it wouldbe extremely challenging to derive an analytical expression for f ( α | (cid:102) W , Σ) from (cid:82) f ( α, θ | (cid:102) W , Σ) dθ as in MNP, since the speciﬁcation is no longer linear in θ .In the ﬁrst step, (cid:102) W is a scaled version of W through (cid:102) W = α W . From (3), ﬁtting the sum-of-trees component to (cid:102) W is analogous to sampling the parameters in an un-normalized space. Posteriortree sampling in BART makes a one-step update on each tree from its previous state, by one of thefollowing four types of proposals: GROW, PRUNE, CHANGE, and SWAP. Stochastic search in amassive space of possible tree structures can be challenging when (cid:102) W , the quantity to which thesum-of-trees is ﬁtting, is unstable. Heuristically, we would expect ﬁtting the sum-of-trees compo-nent to W , which is a normalized quantity, instead of (cid:102) W to be more stable, induce better posteriorconvergence, and improve the prediction accuracy. Given these considerations, we modify Algo-rithm [KD] and propose the following: Algorithm [P1]

1. Sample

W, α | µ, Σ , S .(a) Draw α from its conditional prior f ( α | Σ) = trace [ΨΣ − ] /χ νC ;(b) for each j , update W ij conditional on W i, − j , µ , Σ , and observed outcome S i , from a trun-cated normal distribution; and 13c) transform W i to (cid:102) W i = α W i .2. Sample θ | W, Σ . Draw θ ∼ f ( θ | W, Σ) and then set µ = G ( X ; θ ) .3. Sample Σ , α | W, α , θ .(a) Draw (cid:101) Σ ∼ Inv-Wishart ( N + ν, Ψ + (cid:80) Ni =1 (cid:101) (cid:15) i (cid:101) (cid:15) Ti ) , where (cid:101) (cid:15) i = α ( W i − µ i ) ;(b) set α to trace ( (cid:101) Σ) /C ; and(c) set Σ = (cid:101) Σ /α and W = µ + (cid:101) (cid:15)α .In the ﬁrst proposal (Algorithm [P1]), the expansion parameters ( α , α ) do not aﬀect the samplingof the trees in Step 2. If the order of Step 2 and 3 are swapped, it becomes Scheme [IvD2] in Section2.2. Algorithms [KD] and [P1] are the MPBART analogues of IvD1 and IvD2 for the MNP. Imai &van Dyk expected IvD1 to outperform IvD2 for the MNP and demonstrated through simulations.While for MPBART, we ﬁnd Algorithm [P1] to be equal or superior to Algorithm [KD] theoretically(Section 2.5) and computationally (Sections 3 and 4).As an alternative to Algorithm [P1], we introduce another proposal, Algorithm [P2], which“abandons” the MDA framework and adopts a Scheme [MDA-LW]-like strategy only for Step 3.If we ﬁx α to be 1, both Algorithms [KD] and [P1] simplify to Algorithm [P2]. We show in Ap-pendix B that Algorithms [P1] and [P2] draw Σ from approximately the same sampling distributionunder certain conditions. Algorithm [P2]

1. Sample W | µ, Σ , S . For each j , update W ij conditional on W i, − j , µ , Σ , S i from a truncatednormal distribution.2. Sample θ | W, Σ . Draw θ ∼ f ( θ | W, Σ) and then set µ = G ( X ; θ ) .3. Sample Σ , α | W, θ .(a) Draw (cid:101) Σ ∼ Inv-Wishart ( N + ν, Ψ + (cid:80) Ni =1 (cid:15) i (cid:15) Ti ) , where (cid:15) i = W i − µ i ;14b) set α to trace ( (cid:101) Σ) /C ; and(c) set Σ = (cid:101) Σ /α and W = µ + (cid:15)α .Appendix D provides more details on the implementation of the algorithms. Software for ﬁttingboth the KD and our proposed algorithms is available on the ﬁrst author’s github page ( https://github.com/yizhenxu/GcompBART ). In what follows, we assume the Markov chain of ( W, θ, Σ) has reached stationary. Liu intro-duced the usage of diagrams that show dependency structures between two consecutive iterationsfor analyzing Bayesian algorithms. We do this for Algorithms [KD], [P1], and [P2], and derivetheir mixing rate in terms of autocovariances. We restate the algorithms under the expanded model ( W, µ, Σ , α ) , where W is the normalized latent variables with distribution M V N ( µ, Σ) and α isthe expansion parameter:Algorithm [KD]: ( W, α ) | µ, Σ ⇒ µ | ( W, α ) , Σ ⇒ (Σ , α ) | ( W, α ) , µ, Algorithm [P1]: ( W, α ) | µ, Σ ⇒ µ | W, Σ ⇒ (Σ , α ) | ( W, α ) , µ, Algorithm [P2]: W | µ, Σ ⇒ µ | W, Σ ⇒ (Σ , α ) | W, µ, where α ’s are indexed as in Scheme [IvD1].We make a few observations about these three algorithms: (a) Algorithm [KD] groups W and α together, as in Scheme [MDA-G]; (b) Algorithm [P1] is structurally equivalent to Scheme[IvD2]; and (c) the sampling of the normalized covariance matrix in all three algorithms inte-grates out α as in Scheme [MDA-C], i.e. Σ ∼ (cid:82) f (Σ , α | W, µ ) dα in Algorithm [P2], and Σ ∼ (cid:82) f (Σ , α | W, α , µ ) dα in Algorithms [KD] and [P1]. Based on these observations, we prove thedependency structure as diagrams in Figure 1.A common measure for quantifying the mixing rate of a Markov chain is the lag-1 autocorre-lation; lower autocorrelation indicates a better mixing rate. Using the dependency diagrams, we15rgue that Algorithm [P2] has the best mixing rate when the Markov chain is stationary. Theorem 1.

Assuming the chain of MPBART parameters ( W, µ, Σ , α ) has reached equilibrium,1. For µ , Algorithms [P1] and [P2] have the same lag-1 autocorrelation, which is no largerthan that from Algorithm [KD];2. For Σ , Algorithms [KD] and [P1] have the same lag-1 autocorrelation, which is no less thanthat from Algorithm [P2]. Proof: Appendix A.

This simulation study will compare the three algorithms in terms of MCMC convergence andprediction accuracy. We consider two diﬀerent metrics of predictive accuracy which we deﬁne asfollows. For subject i = 1 , . . . , N in a data for calculating predictive accuracy, given covariates X i and the posterior samples of model parameters { θ ( j ) , Σ ( j ) | j = 1 , . . . , J } , the posterior predictivedistribution for S i can be represented by its J posterior predictions, { ˆ S (1) i , . . . , ˆ S ( J ) i } , where ˆ S ( j ) i =  l if max ( ˆ W ( j ) i ) = ˆ W ( j ) il ≥ C if max ( ˆ W ( j ) i ) < , (4) ˆ W ( j ) i = ( ˆ W ( j ) i , . . . , ˆ W ( j ) i,C ) is the vector of latent variables, ˆ W ( j ) i ∼ M V N ( G ( X i ; θ ( j ) ) , Σ ( j ) ) , and G ( X i ; θ ( j ) ) = ( G ( X i ; θ ( j )1 ) , . . . , G C ( X i ; θ ( j ) C )) . Recall that each mean component is parameterizedas sum of trees, G l ( X i ; θ ( j ) l ) = (cid:80) mk =1 g ( X i ; θ ( j ) lk ) , where l = 1 , . . . , C .We use posterior percent agreement and posterior mode to assess prediction accuracy. Posteriorpercent agreement is the percent concordance between the observed outcome S i and the posteriorpredictions of the outcome { ˆ S (1) i , . . . , ˆ S ( J ) i } , averaged over N observations and J posterior samples, N N (cid:88) i =1 (cid:26) J J (cid:88) j =1 { ˆ S ( j ) i = S i } (cid:27) . (5)16e also summarize agreement between the observed outcome and the posterior mode via N N (cid:88) i =1 { mode j ( ˆ S ( j ) i ) = S i } (6)where mode j ( ˆ S ( j ) i ) = argmax l ∈{ ,...,C } (cid:80) Jj =1 { ˆ S ( j ) i = l } is the most frequently observed outcomeprediction among the posterior predictive draws ˆ S ( j ) i for individual i . The accuracy measures arediﬀerent in that (6) ignores the infrequent categories in MCMC sampling while (5) uses all posteriorpredictions.Numerical experiments for simulations use 30,000 posterior draws after a burn-in of 50,000 foreach model. We used a sum of 100 trees for estimating the mean component of each latent variable.Following Chipman et al. , we set prior probabilities for the posterior tree search to be 0.25, 0.25,0.4, and 0.1 for tree GROWTH, PRUNE, CHANGE, and SWAP, respectively. We also used theregularization priors for trees as recommended in Chipman et al. Prior speciﬁcation of the latentvariable covariance matrix assumes the scale matrix Ψ has diagonal elements equal to 1.For each simulation, we create a training set D and test set D , each of size 5000. Underdiﬀerent speciﬁcations of the reference level and prior on the covariance matrix, we use Algorithms[KD], [P1], and [P2] on D . For each set of posterior samples for each algorithm, the correspondingout-of-sample performance is calculated by using (5) and (6) on D .We use two speciﬁcations similar to those used by Kang & Schafer for generating the outcomedistribution in D and D . We set C = 2 and assume a set of covariates ( U, V ) where U =( U , . . . , U ) iid ∼ Uniform (0 , and V ∼ Uniform (0 , , and set G = 15 sin( πU U ) + ( U − . − U − U . For Setting 1, we set G = ( U − . − U U + 4 V, S . For Setting 2,we set G = ( U − . − U U + 4 V, which yields highly imbalanced distribution across the categories similar to our application. Thecovariance matrix is given by Σ =  . .  for both Settings. Under Setting 1, the frequency of the outcome alternatives for the ﬁrst simulated dataset was (2227 , , and (2225 , , for the training and test set, respectively. Table 1 com-pares the out-of-sample accuracy of Algorithms [KD], [P1], and [P2] for Setting 1 with the referencelevel set to 1, 2, and 3, and the prior (cid:101) Σ ∼ Inv-Wishart (3 , I ) . Algorithms [P1] and [P2] have sim-ilar accuracy, and are always better than Algorithm [KD]. The accuracy values are very diﬀerentbetween the metrics in (5) and (6); compared to Algorithm [KD], the improvement in Algorithms[P1] and [P2] is much larger under the percent agreement metric (5). In addition, Algorithm [KD]is sensitive to the choice of reference level; its posterior agreement accuracy under-performs underreference level 1, while the other two algorithms stay relatively stable.Because the covariance Σ of the latent variables partially determines the outcome distribu-tion and aﬀects the predictive accuracy, we also investigate how the three algorithms behave inestimating Σ when the reference level for MPBART is the same as that being used for the datageneration, i.e. under reference level 3. We look at how diﬀerent prior speciﬁcations on Σ aﬀectthe posterior accuracy and actual estimation of Σ . In particular, we consider Inv-Wishart ( ν, Ψ) prior for (cid:101) Σ with Ψ = Ψ = 1 , comparing a uniform ( ν = C + 1 , Ψ = 0 ), negatively tilted( ν = C + 3 , Ψ = − . ), and positively tilted ( ν = C + 3 , Ψ = 0 . ) prior for the un-normalizedcovariance σ . Table 3 compares the out-of-sample accuracy and Table 4 reports the posteriormean and 95 % credible intervals of the normalized covariance matrix entries σ and σ . Note18hat we do not look at σ because Σ is normalized on its scale, so σ = trace (Σ) − σ .Appendix C shows how σ aﬀects the outcome distribution, given σ = σ = 1 . Conditionalon ( G , G ) , σ has a substantial eﬀect on the outcome predictive distribution, as shown in Figure7, and a negative σ induces smaller reference level outcome probability P ( S = 3) . On the otherhand, having a negative estimated posterior mean of σ as in Algorithm [KD] may lead to estimatesof G and G that are systematically diﬀerent from that in the true data generating mechanism, inwhich σ is positive.It is clear from Table 3 that compared to Algorithms [P1] and [P2], Algorithm [KD] is moresensitive to the prior speciﬁcations on Σ . Notably, Algorithm [KD] systematically gives a negativeestimated posterior mean of the covariance σ , even though the empirical corr ( W , W ) from thesimulated training data is 0.424 and the true conditional correlation is corr ( W , W | G ) is 0.5. Onthe other hand, the posterior estimate of cov ( W , W ) is positive from Algorithms [P1] and [P2].Table 4 shows that the three algorithms are generally robust to the speciﬁcation of covariancehyperparameters. We also repeat the simulation 50 times and report the mean and standard deviationof the posterior mean of Σ across the 50 replications (Table 7). E [ ·| D ] is the posterior mean basedon one simulated dataset D . E { E [ ·| D ] } and S { E [ ·| D ] } are the mean and sd of E [ ·| D ] across the 50replications. The conclusions in Table 7 mirror the conclusions in the previous results in Table 4.In addition, conclusions on posterior predictive accuracy from the 50 replications (Table 8) matchthose from one run of the simulation (Table 3). Trace plots of the MCMC convergence for thesum-of-trees (Figure 2) and the covariance components (Figure 5) are in Section 5. Setting 2 is designed to investigate highly imbalanced outcomes. For the ﬁrst simulated datasetunder Setting 2, the frequency of outcome alternatives was (1557 , , and (1601 , , for the training and test set, respectively; the frequency of S = 3 is less than 4 % , representing avery unbalanced outcome distribution. Table 2 compares the out-of-sample accuracy of Algorithms[KD], [P1], and [P2] when the reference level is respectively set to 1, 2, and 3. The conclusions are19imilar to that from Table 1, but the sensitivity of Algorithm [KD] to the choice of reference levelis even more pronounced here, while the behavior of Algorithms [P1] and [P2] remains consistentregardless of that choice.Under reference level 3, Table 5 shows the comparison of the out-of-sample accuracy and Table6 shows the posterior mean and 95 % credible intervals of the normalized covariance matrix entries σ and σ , with an inverse Wishart prior on (cid:101) Σ respectively comparing a uniform ( ν = C +1 , Ψ =0 ), negatively tilted ( ν = C + 3 , Ψ = − . ), and positively tilted ( ν = C + 3 , Ψ = 0 . ) priorfor the un-normalized correlation. Similar conclusions to Setting 1 from Tables 3 and 4 can beseen here, except that the deviation in the estimate of the covariance σ between Algorithm [KD]and Algorithms [P1] and [P2] is even more extreme in this case with the outcome unbalanced. Asmentioned in Section 3.1, the conditional covariance between W and W is 0.5, and the estimatedposterior mean of covariance from Algorithms [P1] and [P2] taking values near 0.8, which agreeswith the true correlation in sign. Algorithm [KD] yields negative estimates around -0.35 and itsimpact on predicted value accuracy is similarly poor as in Setting 1.Table 9 and 10 report the posterior mean of Σ and the posterior accuracy, by their mean and stan-dard deviation across 50 replications. As before, the observations are consistent with those underone run of simulation. Moreover, Figure 3 and 6 provide diagnostic plots of the MCMC convergencefor the sum-of-trees and the covariance components, respectively; it is obvious that Algorithms [P1]and [P2] converge faster than Algorithm [KD]. When the outcome is highly imbalanced, posteriorconvergence is more diﬃcult than the balanced case. Compared to Algorithms [P1] and [P2], theapproach Algorithm [KD] uses to update the sum-of-trees component using un-normalized latentutilities may further obstruct posterior convergence.The simulation study in Sections 3.1 and 3.2 shows that: (a) the proposals, Algorithms [P1] and[P2], have better posterior predictive accuracy than Algorithm [KD] in general; (b) the proposalsare more robust than Algorithm [KD] such that the degree of accuracy is unaﬀected by choices ofreference level, imbalance in the categorical distribution, and the choice of prior speciﬁcations inthe sum-of-trees and variance components; (c) sign of the posterior estimates for the covariance20arameter agrees with the underlying truth under the proposals but not under Algorithm [KD]. In this application, we investigate patients’ retention in HIV care after enrollment as a functionof their baseline characteristics and treatment status. The data were extracted from electronic healthrecords of adults enrolled in HIV care between June 1st 2008 and August 23rd 2016 in AMPATH,an HIV care monitoring program in Kenya. We look at a 200-days window after the initial care en-counter and split the data into training and test sets of sample sizes 49,942 and 26,714, respectively.We deﬁne the outcome as disengagement, engagement, and reported death, where engagement incare means there was at least one visit to the clinic for HIV care during the ﬁrst 200 days after apatient’s initial encounter, and disengagement otherwise if the person was not reported dead. Theoutcome distribution is extremely imbalanced, such that the frequency of disengagement, engage-ment, and death is 16 % , 80 % , and 4 % , respectively. Covariates include baseline age, gender, yearof enrollment, travel time to clinic, marriage status, weight, height, baseline treatment status, indi-cation of CD4 measurement at or post baseline, and the most recent CD4. Table 11 summarizesthe observed distribution of each covariate stratiﬁed by outcome level.We use 10,000 posterior draws after a burn-in of 10,000 and keep other settings the same asin simulations. Table 12 compares the posterior accuracy for Algorithms [KD], [P1], and [P2].Algorithm [KD] has posterior mode accuracy close to, but not as good as, that from Algorithms[P1] and [P2]. In terms of posterior agreement accuracy, Algorithm [KD] is substantially inferiorto the proposals, indicating that the proposals yield a better separation between the random utilityfor the true outcome level and those for the other outcome alternatives. In terms of the stability inaccuracy measures with respect to the choice of reference level, the proposals also performed betterthan Algorithm [KD].Under the reference level being disengagement, the ﬁrst row of Figure 4 presents the MCMCconvergence plots of the average tree depth corresponding to latent variables W = Z eng − Z diseng and W = Z death − Z diseng , and the histogram of the posteriors of σ = Cov ( W , W ) , where21 Z eng , Z diseng , Z death ) are latent utilities corresponding to each of the outcome levels and σ is thenormalized conditional covariance of W and W . The plots show that the average tree depths arearound 6 and 9 respectively for W and W under Algorithm [KD], and approximately 2 for thoseunder Algorithms [P1] and [P2]. The Bayesian regularization priors that favor shallow trees do notwork well for Algorithm [KD], as a tree depth of 6 allows up to leaves, which increases the riskof over-ﬁtting and makes the stochastic tree search ineﬃcient. The second and third rows of Figure4 set engagement and reported death as the reference level, respectively, and the latent variablesare deﬁned accordingly. Similar conclusions are observed for tree depth. Under all choices of thereference level, the histogram of σ from Algorithms [P1] and [P2] agree on the sign of σ , whichwas demonstrated in previous simulations to match the sign of the true value of the underlying σ . While computational performance is an important criterion in building Gibbs sampler for com-plicated models, the dependency structure and sampling schemes are as crucial for devising analgorithm that generates a Markov chain with computational eﬃciency and fast mixing rates. Weexplore the data augmentation scheme for MPBART involved in KD, and propose two alternativealgorithms for MPBART that have improved computational and theoretical properties. Theoreti-cally, we prove that the mixing rate of our proposals (Algorithms [P1] and [P2]) is at least as goodas Algorithm [KD] for both the mean and covariance matrix of the latent variables.To assess computational performance, we consider two aspects, predictive accuracy and poste-rior convergence. In terms of predictive accuracy, we point out the limitation of the classiﬁcationerror metric in Kindo et al. , and also investigate the posterior agreement accuracy. Through nu-merical studies, we observe that Algorithm [KD] performs well under the posterior mode metric butits variation across posterior predictions tends to too large. Our proposals yield better predictiveaccuracy than Algorithm [KD], with the improvement substantial under the posterior agreementaccuracy.We also observed that for Algorithm [KD], posterior trees on average converge to more complex22rees than that in our proposals, and this has two important implications. First, the Bayesian regular-ization priors from BART which put more weight on smaller trees do not seem to work well underAlgorithm [KD]. Second, the convergence of the sum-of-trees parameters has a large inﬂuence notonly on the posterior predictive accuracy but also on the convergence of the covariance matrix. Asshown in simulations, the posterior covariance in Algorithm [KD] may seem to converge but oftento a value that has the wrong sign. In Appendix C we further explore how the correlation of thelatent utilities aﬀects the outcome distribution, and show that an estimated covariance of the wrongsign can lead to a sum-of-trees component with values systematically diﬀerent from the true datagenerating mechanism. We would like to acknowledge support for this project from the National Institutes of Health (NIH grants R01 AI108441, R01 CA 183854, GM 112327, AI 136664, K24 AI 134359). ables Posterior Agreement AccuracyTrain TestRef Level KD P1 P2 KD P1 P21 0.57 0.89 0.89 0.54 0.87 0.872 0.60 0.89 0.90 0.58 0.88 0.883 0.60 0.90 0.90 0.58 0.88 0.88Posterior Mode AccuracyTrain TestRef Level KD P1 P2 KD P1 P21 0.93 0.95 0.94 0.87 0.91 0.922 0.92 0.94 0.94 0.87 0.92 0.913 0.92 0.94 0.95 0.87 0.92 0.92Table 1: Accuracy comparison of Algorithms [KD], [P1], and [P2]. Training and test datasets eachwith 5000 observations are generated under Setting 1 with reference level 3. Posterior predictiveaccuracy measured by (5) and (6) are reported under reference levels 1, 2, and 3. The prior of (cid:101) Σ isInv-Wishart ( C + 1 , I C ) , where C = 2 . Posterior Agreement AccuracyTrain TestRef Level KD P1 P2 KD P1 P21 0.60 0.90 0.90 0.59 0.87 0.872 0.57 0.90 0.90 0.55 0.88 0.883 0.65 0.91 0.90 0.63 0.88 0.88Posterior Mode AccuracyTrain TestRef Level KD P1 P2 KD P1 P21 0.90 0.95 0.94 0.87 0.91 0.912 0.93 0.95 0.95 0.88 0.91 0.913 0.92 0.96 0.95 0.89 0.91 0.91Table 2: Accuracy comparison of Algorithms [KD], [P1], and [P2]. Training and test datasets eachwith 5000 observations are generated under Setting 2 with reference level 3. Posterior predictiveaccuracy measured by (5) and (6) are reported under reference levels 1, 2, and 3. The prior of (cid:101) Σ isInv-Wishart ( C + 1 , I C ) , where C = 2 . 24osterior Agreement AccuracyTrain Test ν − C Ψ KD P1 P2 KD P1 P21 0 0.60 0.90 0.90 0.59 0.87 0.873 -0.5 0.57 0.90 0.90 0.55 0.88 0.883 0.5 0.65 0.91 0.90 0.63 0.88 0.88Posterior Mode AccuracyTrain Test ν − C Ψ KD P1 P2 KD P1 P21 0 0.90 0.95 0.94 0.87 0.91 0.913 -0.5 0.93 0.95 0.95 0.88 0.91 0.913 0.5 0.92 0.96 0.95 0.89 0.91 0.91Table 3: Accuracy comparison of Algorithms [KD], [P1], and [P2] under reference level 3. Trainingand test datasets each with 5000 observations are generated under Setting 1 with reference level 3.The prior of (cid:101) Σ is Inv-Wishart ( ν, Ψ) , where Ψ = Ψ = 1 . Posterior predictive accuracy measuredby (5) and (6) are reported under ( ν − C, Ψ ) being (1 , , (3 , − . , and (3 , . .Algorithm [KD] Algorithm [P1] ν Ψ σ σ σ σ ν Ψ σ σ Σ , the co-variance of latent utilities under the trace constraint (equal to C ), with rows corresponding to priorhyperparameters ( ν − C, Ψ ) being (1 , , (3 , − . , and (3 , . , respectively, where Ψ is thescale matrix in the prior of (cid:101) Σ with Ψ = Ψ = 1 . Training and test datasets each with 5000observations are generated under Setting 1 using reference level 3.25osterior Agreement AccuracyTrain Test ν − C Ψ KD P1 P2 KD P1 P21 0 0.65 0.91 0.90 0.63 0.88 0.883 -0.5 0.70 0.91 0.90 0.67 0.88 0.883 0.5 0.70 0.91 0.91 0.67 0.88 0.88Posterior Mode AccuracyTrain Test ν − C Ψ KD P1 P2 KD P1 P21 0 0.92 0.96 0.95 0.89 0.91 0.913 -0.5 0.93 0.95 0.95 0.90 0.91 0.913 0.5 0.92 0.95 0.96 0.89 0.91 0.91Table 5: Accuracy comparison of Algorithms [KD], [P1], and [P2] under reference level 3. Trainingand test datasets each with 5000 observations are generated under Setting 2 with reference level 3.The prior of (cid:101) Σ is Inv-Wishart ( ν, Ψ) , where Ψ = Ψ = 1 . Posterior predictive accuracy measuredby (5) and (6) are reported under ( ν − C, Ψ ) being (1 , , (3 , − . , and (3 , . .Algorithm [KD] Algorithm [P1] ν Ψ σ σ σ σ ν Ψ σ σ Σ , the co-variance of latent utilities under the trace constraint (equal to C ), with rows corresponding to priorhyperparameters ( ν − C, Ψ ) being (1 , , (3 , − . , and (3 , . , respectively, where Ψ is thescale matrix in the prior of (cid:101) Σ with Ψ = Ψ = 1 . Training and test datasets each with 5000observations are generated under Setting 2 using reference level 3.26 { E [ σ | D ] } ( S { E [ σ | D ] } ) ν − C Ψ KD P1 P21 0 1.311 (0.032) 1.035 (0.041) 1.039 (0.039)3 -0.5 1.325 (0.057) 1.035 (0.036) 1.034 (0.036)3 0.5 1.296 (0.059) 1.039 (0.042) 1.038 (0.042) E { E [ σ | D ] } ( S { E [ σ | D ] } ) ν − C Ψ KD P1 P21 0 -0.108 (0.007) 0.344 (0.053) 0.354 (0.056)3 -0.5 -0.118 (0.009) 0.321 (0.057) 0.348 (0.062)3 0.5 -0.122 (0.010) 0.365 (0.054) 0.359 (0.059)Table 7: Comparison of Algorithms [KD], [P1], and [P2] under reference level 3 on Σ with 50replications. Training and test datasets each with 5000 observations are generated under Setting1 using reference level 3. E [ ·| D ] indicates sample mean on one simulated data D . E { E [ ·| D ] } and S { E [ ·| D ] } are the mean and sd of E [ ·| D ] across the 50 simulations of D . The prior of (cid:101) Σ isInv-Wishart ( ν, Ψ) , where Ψ = Ψ = 1 . Posterior predictive accuracy measured by (5) and (6)are reported under ( ν − C, Ψ ) being (1 , , (3 , − . , and (3 , . . Posterior Agreement AccuracyTrain Test ν − C Ψ KD P1 P2 KD P1 P21 0 0.603 (0.003) 0.900 (0.004) 0.900 (0.004) 0.580 (0.004) 0.881 (0.003) 0.882 (0.003)3 -0.5 0.650 (0.004) 0.900 (0.004) 0.900 (0.004) 0.618 (0.004) 0.881 (0.003) 0.882 (0.003)3 0.5 0.647 (0.003) 0.900 (0.004) 0.900 (0.004) 0.617 (0.004) 0.881 (0.003) 0.882 (0.003)Posterior Mode AccuracyTrain Test ν − C Ψ KD P1 P2 KD P1 P21 0 0.921 (0.005) 0.946 (0.003) 0.946 (0.003) 0.872 (0.006) 0.919 (0.004) 0.919 (0.004)3 -0.5 0.932 (0.004) 0.946 (0.003) 0.946 (0.003) 0.872 (0.006) 0.919 (0.003) 0.919 (0.004)3 0.5 0.928 (0.004) 0.946 (0.004) 0.946 (0.003) 0.873 (0.006) 0.919 (0.004) 0.919 (0.003)

Table 8: Accuracy comparison of Algorithms [KD], [P1], and [P2] under reference level 3 with 50replications. Training and test datasets each with 5000 observations are generated under Setting 1using reference level 3. Average (standard deviation) of accuracy across the 50 rounds are reported.The prior of (cid:101) Σ is Inv-Wishart ( ν, Ψ) , where Ψ = Ψ = 1 . Posterior predictive accuracy measuredby (5) and (6) are reported under ( ν − C, Ψ ) being (1 , , (3 , − . , and (3 , . .27 { E [ σ | D ] } ( S { E [ σ | D ] } ) ν − C Ψ KD P1 P21 0 0.848 (0.051) 0.769 (0.046) 0.770 (0.041)3 -0.5 0.599 (0.059) 0.783 (0.047) 0.774 (0.041)3 0.5 0.691 (0.067) 0.779 (0.036) 0.758 (0.049) E { E [ σ | D ] } ( S { E [ σ | D ] } ) ν − C Ψ KD P1 P21 0 -0.321 (0.009) 0.801 (0.029) 0.797 (0.025)3 -0.5 -0.349 (0.011) 0.782 (0.028) 0.791 (0.026)3 0.5 -0.354 (0.011) 0.801 (0.026) 0.802 (0.028)Table 9: Comparison of Algorithms [KD], [P1], and [P2] under reference level 3 on Σ with 50replications. Training and test datasets each with 5000 observations are generated under Setting2 using reference level 3. E [ ·| D ] indicates sample mean on one simulated data D . E { E [ ·| D ] } and S { E [ ·| D ] } are the mean and sd of E [ ·| D ] across the 50 simulations of D . The prior of (cid:101) Σ isInv-Wishart ( ν, Ψ) , where Ψ = Ψ = 1 . Posterior predictive accuracy measured by (5) and (6)are reported under ( ν − C, Ψ ) being (1 , , (3 , − . , and (3 , . . Posterior Agreement AccuracyTrain Test ν − C Ψ KD P1 P2 KD P1 P21 0 0.651 (0.003) 0.905 (0.003) 0.905 (0.003) 0.632 (0.005) 0.881 (0.003) 0.882 (0.003)3 -0.5 0.701 (0.003) 0.905 (0.003) 0.905 (0.003) 0.676 (0.005) 0.882 (0.003) 0.882 (0.003)3 0.5 0.700 (0.003) 0.906 (0.003) 0.906 (0.004) 0.679 (0.005) 0.882 (0.003) 0.882 (0.004)Posterior Mode AccuracyTrain Test ν − C Ψ KD P1 P2 KD P1 P21 0 0.924 (0.004) 0.953 (0.003) 0.952 (0.003) 0.896 (0.005) 0.918 (0.004) 0.918 (0.004)3 -0.5 0.930 (0.003) 0.952 (0.003) 0.952 (0.003) 0.893 (0.006) 0.918 (0.004) 0.918 (0.004)3 0.5 0.927 (0.003) 0.952 (0.003) 0.952 (0.003) 0.895 (0.005) 0.918 (0.004) 0.918 (0.004)

Table 10: Accuracy comparison of Algorithms [KD], [P1], and [P2] under reference level 3 with 50replications. Training and test datasets each with 5000 observations are generated under Setting 2using reference level 3. Average (standard deviation) of accuracy across the 50 rounds are reported.The prior of (cid:101) Σ is Inv-Wishart ( ν, Ψ) , where Ψ = Ψ = 1 . Posterior predictive accuracy measuredby (5) and (6) are reported under ( ν − C, Ψ ) being (1 , , (3 , − . , and (3 , . .28isengaged (6497) Engaged (67462) Died (2697)Male 22.5 34 51.3Year of Enrolment 2008 5.1 9.7 11.22009 8.3 18.7 17.12010 9.3 17.3 17.62011 9.2 15.8 172012 17.9 11.5 142013 18.5 8.9 11.32014 18.8 9.0 8.22015 12.8 8.3 3.32016 0.3 0.8 0.3Travel Time <

30 min 17.4 24 23.630 min - 1 h 19.4 26.9 29.41 h - 2 h 8.2 14.6 16.5 > (cid:101) Σ is Inv-Wishart (3 , I ) . Figures W ( t ) W ( t +1) W ( t ) W ( t +1) W ( t ) W ( t +1) α ( t ) α ( t +1) α ( t ) α ( t +1) µ ( t ) µ ( t +1) µ ( t ) µ ( t +1) µ ( t ) µ ( t +1) Σ ( t ) Σ ( t +1) Σ ( t ) Σ ( t +1) Σ ( t ) Σ ( t +1) Figure 1: Above diagrams from left to right correspond to Algorithms [KD], [P1], and [P2], re-spectively. 30 a) ν = C + 1 , Ψ = 0 , reference level 1 (b) ν = C + 1 , Ψ = 0 , reference level 2(c) ν = C + 1 , Ψ = 0 , reference level 3 (d) ν = C + 3 , Ψ = − . , reference level 3(e) ν = C + 3 , Ψ = 0 . , reference level 3 Figure 2: Plot of posterior average tree depth for each latent utility as time series, under simula-tion Setting 1 and hyperparameters described in plot labels. Red, black, and blue correspond toAlgorithms [KD], [P1], and [P2], respectively. 31 a) ν = C + 1 , Ψ = 0 , reference level 1 (b) ν = C + 1 , Ψ = 0 , reference level 2(c) ν = C + 1 , Ψ = 0 , reference level 3 (d) ν = C + 3 , Ψ = − . , reference level 3(e) ν = C + 3 , Ψ = 0 . , reference level 3 Figure 3: Plot of posterior average tree depth for each latent utility as time series, under simula-tion Setting 2 and hyperparameters described in plot labels. Red, black, and blue correspond toAlgorithms [KD], [P1], and [P2], respectively. 32

L2 − L1

MCMC Iteration A v e r age T r ee D ep t h L4 − L1

MCMC Iteration A v e r age T r ee D ep t h Correlation D en s i t y −1.0 −0.5 0.0 0.5 1.0 . . . . . (a) Reference level 1 (disengagement) L1 − L2

MCMC Iteration A v e r age T r ee D ep t h L4 − L2

MCMC Iteration A v e r age T r ee D ep t h Correlation D en s i t y −1.0 −0.5 0.0 0.5 1.0 . . . . (b) Reference level 2 (engagement) L1 − L4

MCMC Iteration A v e r age T r ee D ep t h L2 − L4

MCMC Iteration A v e r age T r ee D ep t h Correlation D en s i t y −1.0 −0.5 0.0 0.5 1.0 . . . . (c) Reference level 4 (death) Figure 4: Plot of posterior average tree depth for each latent utility as time series for the applicationanalysis on AMPATH data on the left, and histogram of the σ under its prior (purple), posteriorfrom Algorithms [KD] (red), [P1] (black), and [P2] (blue); same color speciﬁcation applies to theleft plot. Posterior inference is under ν = C + 1 , Ψ = 0 , with reference level as indicated in plotlabels. 33 orrelation D en s i t y −1.0 −0.5 0.0 0.5 1.0 . . . . (a) ν = C + 1 , Ψ = 0 Correlation D en s i t y −1.0 −0.5 0.0 0.5 1.0 . . . . (b) ν = C + 3 , Ψ = − . Correlation D en s i t y −1.0 −0.5 0.0 0.5 1.0 . . . . (c) ν = C + 3 , Ψ = 0 . Figure 5: Plot of posterior σ and σ as time series on the left, and histogram of the σ underits prior (purple), posterior from Algorithms [KD] (red), [P1] (black), and [P2] (blue); same colorspeciﬁcation applies to the left plot. Posterior inference is conducted under Setting 1, referencelevel 3, and hyperparameters described in plot labels.34 orrelation D en s i t y −1.0 −0.5 0.0 0.5 1.0 . . . . . (a) ν = C + 1 , Ψ = 0 Correlation D en s i t y −1.0 −0.5 0.0 0.5 1.0 . . . . . (b) ν = C + 3 , Ψ = − . Correlation D en s i t y −1.0 −0.5 0.0 0.5 1.0 . . . . . (c) ν = C + 3 , Ψ = 0 . Figure 6: Plot of posterior σ and σ as time series on the left, and histogram of the σ underits prior (blue), posterior from Algorithms [KD] (red), [P1] (black), and [P2] (blue). Posteriorinference is conducted under simulation Setting 2, reference level 3, and hyperparameters describedin plot labels. Red, black, and blue correspond to Algorithms [KD], [P1], and [P2], respectively.35 ppendix A Proof of Theorem 1

The lag-1 autocorrelation of µ is deﬁned as corr ( µ ( t ) , µ ( t +1) ) = cov ( µ ( t ) ,µ ( t +1) ) √ var ( µ ( t ) ) var ( µ ( t +1) ) , where t indexes posteriordraws. Under the condition that the chain has reached its stationary distribution f ( W, µ, Σ , α | S, X ) , where S and X are observed outcome and covariates, var ( µ ( t ) ) = var ( µ ( t +1) ) = var ( µ ) . Hence, we only need to look at the covariancefor comparing the autocorrelation.Consider two consecutive draws of µ from the Algorithm [P1]. We ﬁnd that E ( µ ( t ) µ ( t +1) ) = E [ E ( µ ( t ) µ ( t +1) | Σ ( t ) , W ( t +1) , α ( t +1) )]= E [ E ( µ ( t ) | Σ ( t ) , W ( t +1) , α ( t +1) )] E [ E ( µ ( t +1) | Σ ( t ) , W ( t +1) , α ( t +1) )]= E [ E ( µ | Σ , W, α ) ] , where the ﬁrst equality follows from the law of total expectation; the second and the third equalities follow from thefact that µ ( t ) and µ ( t +1) are conditionally independent and identically distributed given (Σ ( t ) , W ( t +1) , α ( t +1) ) . Thiscan be seen from the diagram for Algorithm [KD] in Figure 1; in particular, µ ( t ) connects with µ ( t +1) only through (Σ ( t ) , W ( t +1) , α ( t +1) ) . As a result, the covariance is given bycov ( µ ( t ) , µ ( t +1) ) = E ( µ ( t ) µ ( t +1) ) − E ( µ ( t ) ) E ( µ ( t +1) )= E [ E ( µ | Σ , W, α ) ] − E [ E ( µ | Σ , W, α )] = var [ E ( µ | Σ , W, α )] . Similarly, the covariance under Algorithms [P1] and [P2] is derived to be var [ E ( µ | Σ , W )] . The key to this calculation isthe fact that µ ( t ) connects with µ ( t +1) only through (Σ ( t ) , W ( t +1) ) (see Figure 1). We now compare the two variances,var [ E ( µ | Σ , W, α )] = var { E [ E ( µ | Σ , W, α ) | Σ , W ] } + E { var [ E ( µ | Σ , W, α ) | Σ , W ] }≥ var { E [ E ( µ | Σ , W, α ) | Σ , W ] } = var [ E ( µ | Σ , W )] . The ﬁrst equality comes from the law of total conditional variance and the last equality results from the law of totalexpectation. Therefore, we can conclude that the lag-1 autocorrelation of µ , corr ( µ ( t ) , µ ( t +1) ) is closer to zero inAlgorithms [P1] and [P2] than in Algorithm [KD]. ecall from Figure 1 that, Σ ( t ) connects with Σ ( t +1) only through ( W ( t +1) , α ( t +1) , µ ( t +1) ) in Algorithms [KD]and [P1], and through ( W ( t +1) , µ ( t +1) ) in Algorithm [P2]. Hence, with a similar argument as above, we can showthat the the lag-1 autocorrelation of Σ , corr (Σ ( t ) , Σ ( t +1) ) , for Algorithm [P2] is no larger than in Algorithms [KD] and[P1]. B Equivalence in the Sampling Distribution between Algorithms [P1] and[P2]

The purpose of this section is to show that Algorithms [P1] and [P2] have the same sampling distribution of Σ when the sample size of the observed data, N , is suﬃciently large and the scale matrix in the prior distribution of (cid:101) Σ isrelatively small compared to the sample estimate of the covariance matrix for the latent variables. Note that Algorithms[P1] and [P2] draw θ and W from the same conditional distributions. Hence, the conclusion here implies that the twoprocedures sample from the same joint distribution of ( θ, W, Σ) .Given the prior on the unconstrained covariance matrix (cid:101) Σ ∼ Inv-Wishart ( ν, Ψ) , we can calculate in closed formthe conditional posterior distributions of (cid:101) Σ and Σ under Algorithms [P1] and [P2], respectively, where Σ = (cid:101) Σ C trace ( (cid:101) Σ) .In Algorithm [P1], Step 1 samples the working parameter α from its conditional prior distribution, α | Σ ∼ Inv-Gamma ( νC/ , trace (ΨΣ − ) / , which is equivalent to trace (ΨΣ − ) /χ νC and has expectation E [ α | Σ] = trace (ΨΣ − ) / ( νC − . Since inverse-Wishart is conditionally conjugate to the covariance matrix in a Gaussian model, the conditional posteriordistribution of (cid:101) Σ under Algorithm [P1] is (cid:101) Σ | α , W, µ ∼ Inv-Wishart ( N + ν, Ψ + α N (cid:88) i =1 ( W i − µ i )( W i − µ i ) T ) , and this leads to the conditional posterior distribution of the corresponding restricted covariance matrix Σ , f (Σ | α , W, µ ) ∝ | Σ | − ( N + ν + C +1) / × trace (cid:26) [Ψ + α N (cid:88) i =1 ( W i − µ i )( W i − µ i ) T ]Σ − (cid:27) − νC/ . imilarly, the conditional posterior distributions under Algorithm [P2] for (cid:101) Σ and Σ , respectively, are (cid:101) Σ | W, µ ∼ Inv-Wishart ( N + ν, Ψ + N (cid:88) i =1 ( W i − µ i )( W i − µ i ) T ) and f (Σ | W, µ ) ∝ | Σ | − ( N + ν + C +1) / × trace (cid:26) [Ψ + N (cid:88) i =1 ( W i − µ i )( W i − µ i ) T ]Σ − (cid:27) − νC/ . In order to compare the posterior conditional of Σ under Algorithms [P1] and [P2], we look at the posterior meanand variance using a ﬁrst-order Taylor series expansion. The posterior mean under Algorithm [P1] is E (Σ | α , W, µ ) = E (cid:18) (cid:101) Σ α (cid:12)(cid:12)(cid:12)(cid:12) α , W, µ (cid:19) ≈ E ( (cid:101) Σ | α , W, µ ) E ( α | α , W, µ )= E ( (cid:101) Σ | α , W, µ ) E [ trace ( (cid:101) Σ) | α , W, µ ] × C = E ( (cid:101) Σ | α , W, µ ) trace [ E ( (cid:101) Σ | α , W, µ )] × C = Ψ + α (cid:80) Ni =1 ( W i − µ i )( W i − µ i ) T trace [Ψ + α (cid:80) Ni =1 ( W i − µ i )( W i − µ i ) T )] × C = Ψ + α (cid:80) Ni =1 ( W i − µ i )( W i − µ i ) T trace (Ψ) + α (cid:80) Ni =1 (cid:80) Cj =1 ( W ij − µ ij ) × C Similarly, the posterior mean of Σ under Algorithm [P2] is E (Σ | W, µ ) = Ψ + (cid:80) Ni =1 ( W i − µ i )( W i − µ i ) T trace (Ψ) + (cid:80) Ni =1 (cid:80) Cj =1 ( W ij − µ ij ) × C. For the posterior variance of Σ , we simplify the notation by writing the posterior conditional distribution of (cid:101) Σ asInv-Wishart ( (cid:101) ν, (cid:101) Ψ) , where (cid:101) ν = N + ν and the scale matrix (cid:101) Ψ is  Ψ + α (cid:80) Ni =1 ( W i − µ i )( W i − µ i ) T under Algorithm [P1] Ψ + (cid:80) Ni =1 ( W i − µ i )( W i − µ i ) T under Algorithm [P2]Then, the posterior variance has the following form,var ( σ ij ) = var ( (cid:101) σ ij α ) ≈ E ( (cid:101) σ ij ) E ( α ) × (cid:26) var ( (cid:101) σ ij ) E ( (cid:101) σ ij ) − cov ( (cid:101) σ ij , α ) E ( (cid:101) σ ij ) E ( α ) + var ( α ) E ( α ) (cid:27) here E ( (cid:101) σ ij ) = (cid:101) Ψ ij / ( (cid:101) ν − C − E ( α ) = E ( (cid:101) σ + . . . + (cid:101) σ CC ) = trace ( (cid:101) Ψ) / ( (cid:101) ν − C − var ( (cid:101) σ ij ) =  ( (cid:101) ν − C +1) (cid:101) Ψ ij +( (cid:101) ν − C − (cid:101) Ψ ii (cid:101) Ψ jj ( (cid:101) ν − C )( (cid:101) ν − C − ( (cid:101) ν − C − if i (cid:54) = j (cid:101) Ψ ii ( (cid:101) ν − C − ( (cid:101) ν − C − if i = j var ( α ) = 2( (cid:101) ν − C − ( (cid:101) ν − C − C (cid:88) i =1 (cid:101) Ψ ii cov ( (cid:101) σ ij , α ) = C (cid:88) k =1 cov ( (cid:101) σ ij , (cid:101) σ kk ) = 2 (cid:101) Ψ ij (cid:101) Ψ kk + ( (cid:101) ν − C − (cid:101) Ψ ik (cid:101) Ψ jk ( (cid:101) ν − C )( (cid:101) ν − C − ( (cid:101) ν − C − . Specifying the inverse-Wishart Prior for the covariance matrix is analogous to assuming a prior knowledge of ν Gaussian samples having covariance matrix Ψ . When the number of observations N gets larger, the posterior concen-trates more on the empirical covariance matrix. When N is suﬃciently large and Ψ kj << α (cid:80) Ni =1 (cid:80) Cj =1 ( W ik − µ ik )( W ij − µ ij ) for all k, j = 1 , . . . , C , the posterior mean of Σ under Algorithms [P1] and [P2] are approximatelythe same, E (Σ | α , W, µ ) ≈ α (cid:80) Ni =1 ( W i − µ i )( W i − µ i ) T α (cid:80) Ni =1 (cid:80) Cj =1 ( W ij − µ ij ) × C = (cid:80) Ni =1 ( W i − µ i )( W i − µ i ) T (cid:80) Ni =1 (cid:80) Cj =1 ( W ij − µ ij ) × C ≈ E (Σ | W, µ ) . In the same way, we can show that var (Σ | α , W, µ ) ≈ var (Σ | W, µ ) . C Outcome Distribution as a Function of Correlation

This section discusses how the outcome distribution is connected to the correlation of latent utilities for C = 2 .We assume the outcome is determined as follows: W = ( W , W ) T ∼ MVN (cid:18)  µ µ  ,  ρρ  (cid:19) ,S =  if W ≥ max { , W } if W ≥ max { , W } if W < and W < . e start with the outcome probability at the reference level, P ( S = 3) = P ( W < , W < (cid:90) −∞ (cid:90) −∞ (cid:112) | π Σ | exp (cid:26) −

12 ( w − µ , w − µ )Σ −  w − µ w − µ  (cid:27) dw dw = (cid:90) −∞ (cid:90) −∞ π (cid:112) − ρ exp (cid:26) −

12 [ w − µ − ρ ( w − µ )] − ρ (cid:27) exp (cid:26) − ( w − µ ) (cid:27) dw dw = (cid:90) − µ −∞ (cid:90) −∞ π (cid:112) − ρ exp (cid:26) −

12 [ w − µ − ρ (cid:101) w ] − ρ (cid:27) exp (cid:26) − (cid:101) w (cid:27) dw d (cid:101) w = (cid:90) − µ −∞ (cid:90) − µ − ρ (cid:101) w √ − ρ −∞ π exp (cid:26) − (cid:101) w (cid:27) exp (cid:26) − (cid:101) w (cid:27) d (cid:101) w d (cid:101) w = (cid:90) − µ −∞ π τ (cid:18) − µ − ρ (cid:101) w (cid:112) − ρ (cid:19) exp (cid:26) − (cid:101) w (cid:27) d (cid:101) w , where the second equality comes from the inversion and determinant lemma of matrices, and τ ( u ) = (cid:82) u −∞ exp {− s } ds in the last equality.Next, we write P ( S = 3) as a function of ρ , f ( ρ ) = (cid:82) − µ −∞ π τ ( − µ − ρt √ − ρ ) exp {− t } dt . The correspondingderivative w.r.t ρ has the form, ddρ f ( ρ ) = (cid:90) − µ −∞ π (cid:112) − ρ exp (cid:26) −

12 ( µ + ρt ) − ρ (cid:27) exp (cid:26) − t (cid:27)(cid:18) − t + ρµ − ρ (cid:19) dt = (cid:90) − µ −∞ π (cid:112) − ρ exp (cid:26) −

12 ( t + ρµ ) + µ (1 − ρ )1 − ρ (cid:27)(cid:18) − t + ρµ − ρ (cid:19) dt = (cid:90) − µ + ρµ −∞ π (cid:112) − ρ exp (cid:26) − (cid:101) t − ρ (cid:27) exp (cid:26) − µ (cid:27)(cid:18) − (cid:101) t − ρ (cid:19) dt = 12 π (cid:112) − ρ exp (cid:26) − µ (cid:27) exp (cid:26) −

12 ( − µ + ρµ ) − ρ (cid:27) ⇒ ddρ f ( ρ ) > , ρ ∈ ( − , . As a result, for every possible combination of ( µ , µ ) , P ( S = 3) is always a strictly increasing function of ρ . Next,we show that how the non-reference-level outcome probabilities change w.r.t ρ depends on ( µ , µ ) . For outcome level1, we have P ( S = 1) = P ( W ≥ W , W ≥

0) = P ( W − W ≤ , − W ≤

0) = P ( Z ≤ , Z ≤ with  Z Z  ∼ N (cid:18)  µ − µ − µ  ,  − ρ ) 1 − ρ − ρ  (cid:19) . he outcome probability can be written as P ( S = 1)= (cid:90) −∞ (cid:90) −∞ π (cid:112) − ρ exp (cid:26) −

12 [ z − ( µ − µ ) − (1 − ρ )( z + µ )] − ρ (cid:27) exp (cid:26) − ( z + µ ) (cid:27) dw dw = (cid:90) µ −∞ π τ (cid:18) − ( µ − µ ) − (1 − ρ ) t (cid:112) − ρ (cid:19) exp (cid:26) − t (cid:27) dt Similar to the procedure for f ( ρ ) , we write P ( S = 1) as g ( ρ ) . The derivative w.r.t. ρ is ddρ g ( ρ ) = −

12 exp (cid:26) −

11 + ρ (cid:20) µ + µ (cid:21) (cid:27) − ( µ − µ ) √ ρ − ρ ) τ (cid:18) µ + µ √ ρ (cid:19) ∈ (cid:18) −

12 exp (cid:26) −

11 + ρ (cid:20) µ + µ (cid:21) (cid:27) ± | µ − µ | − ρ (cid:114) π (1 + ρ )2 (cid:19) . The last interval comes from τ ( u ) ∈ (0 , √ π ) by deﬁnition. In fact, from the way S depends on ( W , W ) for thenon-reference levels, we can easily see that the derivative of P ( S = 2) w.r.t ρ falls into the same interval in the abovederivation. Clearly, the center and width of the interval depend on ( µ + µ , ρ ) and ( | µ − µ | , ρ ) , respectively. So how P ( S = 1) and P ( S = 2) vary with ρ are heavily inﬂuenced by the position of and the distance between the latentvariables.The following plots display how the outcome distribution changes with ρ under three diﬀerent pairs, ( µ , µ ) . Wecan see that the reference level outcome probability P ( S = 3) increases with ρ in all three settings, and the relativepositions of P ( S = 1) and P ( S = 2) are closely related to the values of and the diﬀerence between µ and µ . Figure 7: Outcome distribution as a function of ρ under ( µ , µ ) being (0,0), (0,-0.2), and (-0.1,0.3). D Algorithms’ Pseudo Code sing the notation in Section 2.1, we provide details on the algorithms in Section 2.4. Algorithm [KD]

Step 0 : Initialize parameters l = 0 , α (0) , W (0) , θ (0) , Σ (0) while l < L do doStep 1 : Update ( (cid:102) W ( l +1) , ( α ( l +1)1 ) ) via P ( (cid:102) W , α | µ ( l ) , Σ ( l ) , S ) (a) Draw ( α ( l +1)1 ) ∼ trace [Ψ(Σ ( l ) ) − ] /χ νC ;(b) Draw W ( l +1) = ( W ( l +1)1 , . . . , W ( l +1) C ) ∼ M V N ( µ ( l ) , Σ ( l ) ) by for i ∈ , . . . , N dofor j ∈ , . . . , C do W ( l +1) ij | W ( l +1) i ( − j ) ∼ T N ( m ( l ) ij , ( τ ( l ) j ) ) (cid:46) Appendix D.1where W ( l +1) i ( − j ) = ( W ( l +1) i , . . . , W ( l +1) i,j − , W ( l ) i,j +1 , . . . , W ( l ) i,C ) end forend for ;(c) Set (cid:102) W ( l +1) = α ( l +1)1 W ( l +1) . Step 2 : Update (cid:101) θ ( l +1) via P ( (cid:101) θ | (cid:102) W ( l +1) , α ( l +1)1 , Σ ( l ) ) (a) Gibbs sampling of binary trees: for b ∈ , . . . , m dofor j ∈ , . . . , C do Update (cid:102) W † jb and draw (cid:101) θ ( l +1) jb ∼ P ( (cid:101) θ jb | (cid:102) W † jb , ( α ( l +1)1 τ ( l ) j ) ) (cid:46) Appendix D.2 end forend for ;(b) Set (cid:101) µ ( l +1) ij = G j ( X i ; (cid:101) θ ( l +1) j ) and µ ( l +1) ij = (cid:101) µ ( l +1) ij /α ( l +1)1 . Step 3 : Update (Σ ( l +1) , ( α ( l +1)3 ) ) via P (Σ , α | (cid:102) W ( l +1) , (cid:101) θ ( l +1) ) (a) Draw (cid:101) Σ ∼ Inv-Wishart ( n + ν, Ψ + (cid:80) ni =1 (cid:101) (cid:15) i (cid:101) (cid:15) Ti ) ,where (cid:101) (cid:15) i = ( (cid:101) (cid:15) i , . . . , (cid:101) (cid:15) i,C ) and (cid:101) (cid:15) ij = (cid:102) W ( l +1) ij − (cid:101) µ ( l +1) ij for j = 1 , . . . , C ;(b) Setting ( α ( l +1)3 ) = trace ( (cid:101) Σ /C ) ;(c) Re-scaling model parameters based on α ( l +1)3 : Σ ( l +1) = (cid:101) Σ / ( α ( l +1)3 ) and W ( l +1) = µ ( l +1) + (cid:101) (cid:15)/α ( l +1)3 ; end whileStep 4 : Prediction given new input (cid:46) Appendix D.3 lgorithm [P1] Step 0 : Initialize parameters l = 0 , α (0) , W (0) , θ (0) , Σ (0) while l < L do doStep 1 : Update ( (cid:102) W ( l +1) , ( α ( l +1)1 ) ) via P ( (cid:102) W , α | µ ( l ) , Σ ( l ) , S ) (a) Draw ( α ( l +1)1 ) ∼ trace [Ψ(Σ ( l ) ) − ] /χ νC ;(b) Draw W ( l +1) = ( W ( l +1)1 , . . . , W ( l +1) C ) ∼ M V N ( µ ( l ) , Σ ( l ) ) by for i ∈ , . . . , N dofor j ∈ , . . . , C do W ( l +1) ij | W ( l +1) i ( − j ) ∼ T N ( m ( l ) ij , ( τ ( l ) j ) ) (cid:46) Appendix D.1where W ( l +1) i ( − j ) = ( W ( l +1) i , . . . , W ( l +1) i,j − , W ( l ) i,j +1 , . . . , W ( l ) i,C ) end forend for ;(c) Set (cid:102) W ( l +1) = α ( l +1)1 W ( l +1) . Step 2 : Update θ ( l +1) via P ( θ | W ( l +1) , Σ ( l ) ) (a) Gibbs sampling of binary trees: for b ∈ , . . . , m dofor j ∈ , . . . , C do Update W † jb and draw θ ( l +1) jb ∼ P ( θ jb | W † jb , ( τ ( l ) j ) ) (cid:46) Appendix D.2 end forend for ;(b) Set µ ( l +1) ij = G j ( X i ; θ ( l +1) j ) . Step 3 : Update (Σ ( l +1) , ( α ( l +1)3 ) ) via P (Σ , α | (cid:102) W ( l +1) , α ( l +1)1 , θ ( l +1) ) (a) Draw (cid:101) Σ ∼ Inv-Wishart ( n + ν, Ψ + (cid:80) ni =1 (cid:101) (cid:15) i (cid:101) (cid:15) Ti ) ,where (cid:101) (cid:15) i = ( (cid:101) (cid:15) i , . . . , (cid:101) (cid:15) i,C ) and (cid:101) (cid:15) ij = (cid:102) W ( l +1) ij − α ( l +1)1 µ ( l +1) ij for j = 1 , . . . , C ;(b) Setting ( α ( l +1)3 ) = trace ( (cid:101) Σ /C ) ;(c) Re-scaling model parameters based on α ( l +1)3 : Σ ( l +1) = (cid:101) Σ / ( α ( l +1)3 ) and W ( l +1) = µ ( l +1) + (cid:101) (cid:15)/α ( l +1)3 . end whileStep 4 : Prediction given new input (cid:46) Appendix D.3 lgorithm [P2] Step 0 : Initialize parameters l = 0 , α (0) , W (0) , θ (0) , Σ (0) while l < L do doStep 1 : Update W ( l +1) via P ( W | µ ( l ) , Σ ( l ) , S ) (a) Draw W ( l +1) = ( W ( l +1)1 , . . . , W ( l +1) C ) ∼ M V N ( µ ( l ) , Σ ( l ) ) by for i ∈ , . . . , N dofor j ∈ , . . . , C do W ( l +1) ij | W ( l +1) i ( − j ) ∼ T N ( m ( l ) ij , ( τ ( l ) j ) ) (cid:46) Appendix D.1where W ( l +1) i ( − j ) = ( W ( l +1) i , . . . , W ( l +1) i,j − , W ( l ) i,j +1 , . . . , W ( l ) i,C ) end forend for . Step 2 : Update θ ( l +1) via P ( θ | W ( l +1) , Σ ( l ) ) (a) Gibbs sampling of binary trees: for b ∈ , . . . , m dofor j ∈ , . . . , C do Update W † jb and draw θ ( l +1) jb ∼ P ( θ jb | W † jb , ( τ ( l ) j ) ) (cid:46) Appendix D.2 end forend for ;(b) Set µ ( l +1) ij = G j ( X i ; θ ( l +1) j ) . Step 3 : Update (Σ ( l +1) , ( α ( l +1)3 ) ) via P (Σ , α | W ( l +1) , θ ( l +1) ) (a) Draw (cid:101) Σ ∼ Inv-Wishart ( n + ν, Ψ + (cid:80) ni =1 (cid:15) i (cid:15) Ti ) ,where (cid:15) i = ( (cid:15) i , . . . , (cid:15) iC ) and (cid:15) ij = W ( l +1) ij − µ ( l +1) ij for j = 1 , . . . , C ;(b) Setting ( α ( l +1)3 ) = trace ( (cid:101) Σ /C ) ;(c) Re-scaling model parameters based on α ( l +1)3 : Σ ( l +1) = (cid:101) Σ / ( α ( l +1)3 ) and W ( l +1) = µ ( l +1) + (cid:101) (cid:15)/α ( l +1)3 . end whileStep 4 : Prediction given new input (cid:46) Appendix D.3

D.1 Gibbs Sampling of the Latent Utilities

Gibbs sampling of the latent utilities from univariate truncated normal distributions is described in the Section 3of R. McCulloch & Rossi . In the pseudo-code, m ij = µ ij + Σ j ( − j ) (Σ ( − j )( − j ) ) − [ W i ( − j ) − µ i ( − j ) )] τ j = Σ jj − Σ j ( − j ) (Σ ( − j )( − j ) ) − Σ ( − j ) j for the i = 1 , . . . , N , j = 1 , . . . , C , where µ ij = (cid:80) md =1 g ( X i ; θ jd ) , Σ jj is the element at the j th row and j thcolumn of Σ , Σ ( − j )( − j ) is the remaining of Σ excluding its j th row and j th column, Σ j ( − j ) is the j th row of Σ excludingits j th element Σ jj , and Σ ( − j ) j is similarly derived. .2 Tree Sampling in MPBART Using Algorithms [P1] and [P2] as an example,we follow Section 3.2 of Kindo et al. and provide the details onthe conditional distributions used to update each individual tree. For simplicity, we exclude the subscript i . Given W ∼ M V N ( µ, Σ) , we have W j | ( W ( − j ) , µ, Σ) ∼ N ( m j , τ j ) where m j and τ j are deﬁned in Appendix D.1. Basedon the fact that µ j = (cid:80) md =1 g ( X ; θ jd ) , deﬁne W † jb = W j − b − (cid:88) d =1 g ( X ; θ jd ) − m (cid:88) d (cid:48) = b +1 g ( X ; θ jd (cid:48) ) − Σ j ( − j ) (Σ ( − j )( − j ) ) − [ W ( − j ) − µ ( − j ) )] . Conditional on ( W ( − j ) , µ, Σ) , W † jb − g ( X ; θ jb ) = W j − m j ∼ M V N (0 , τ j ) ⇒ W † jb ∼ N ( g ( X ; θ jb ) , τ j ) . Consequently, the b th binary tree of the j th latent variable, θ ( l +1) jb , is updated to estimate the mean of W † jb = W ( l +1) j − b − (cid:88) d =1 g ( X ; θ ( l +1) jd ) − m (cid:88) d (cid:48) = b +1 g ( X ; θ ( l ) jd (cid:48) ) − Σ ( l ) j ( − j ) (Σ ( l )( − j )( − j ) ) − [ W ( l +1)( − j ) − µ ( l +1)( − j ) ] where µ ( l +1)( − j ) = { G k ( X ; θ ( l +1) k ); k (cid:54) = j } and G k ( X ; θ ( l +1) k ) = (cid:80) b − d =1 g ( X ; θ ( l +1) kd ) + (cid:80) md (cid:48) = b g ( X ; θ ( l ) kd (cid:48) ) . Similarly, inAlgorithm [KD], (cid:102) W † jb = (cid:102) W ( l +1) j − b − (cid:88) d =1 g ( X ; (cid:101) θ ( l +1) jd ) − m (cid:88) d (cid:48) = b +1 g ( X ; (cid:101) θ ( l ) jd (cid:48) ) − Σ ( l ) j ( − j ) (Σ ( l )( − j )( − j ) ) − [ (cid:102) W ( l +1)( − j ) − (cid:101) µ ( l +1)( − j ) ] where (cid:101) µ ( l +1)( − j ) = { G k ( X ; (cid:101) θ ( l +1) k ); k (cid:54) = j } and G k ( X ; (cid:101) θ ( l +1) k ) = (cid:80) b − d =1 g ( X ; (cid:101) θ ( l +1) kd ) + (cid:80) md (cid:48) = b g ( X ; (cid:101) θ ( l ) kd (cid:48) ) . D.3 Predictions from MPBART

Given ﬁtted model parameters from the L th round of posterior sampling, ( θ ( L ) , Σ ( L ) ) , we obtain outcome predic-tion for a new input x as follows: S ( x ) =  reference level if max { W ( x ) , . . . , W C ( x ) } < j if max { , W ( x ) , . . . , W C ( x ) } = W j ( x ) by drawing W ( x ) ∼ M V N ( G ( x ; θ ( L ) ) , Σ ( L ) ) . eferences Albert, J. H., & Chib, S. (1993). Bayesian Analysis of Binary and Polychotomous Response Data.

Journal of the American Statistical Association , (422), 669–679. Retrieved 2019-12-19, from doi: 10.2307/2290350Burgette, L. F., & Nordheim, E. V. (2012, July). The Trace Restriction: An Alternative Iden-tiﬁcation Strategy for the Bayesian Multinomial Probit Model. Journal of Business & Eco-nomic Statistics , (3), 404–410. Retrieved 2018-11-23, from https://doi.org/10.1080/07350015.2012.680416 doi: 10.1080/07350015.2012.680416Chipman, H. A., George, E. I., & McCulloch, R. E. (1998, September). Bayesian CART ModelSearch. Journal of the American Statistical Association , (443), 935–948. Retrieved 2018-11-27, from https://amstat.tandfonline.com/doi/10.1080/01621459.1998.10473750 doi: 10.1080/01621459.1998.10473750Chipman, H. A., George, E. I., & McCulloch, R. E. (2010, March). BART: Bayesian additiveregression trees. The Annals of Applied Statistics , (1), 266–298. Retrieved 2017-08-27, from https://projecteuclid.org/euclid.aoas/1273584455 doi: 10.1214/09-AOAS285Dempster, A. P., Laird, N. M., & Rubin, D. B. (1977). Maximum Likelihood from IncompleteData via the EM Algorithm. Journal of the Royal Statistical Society. Series B (Methodological) , (1), 1–38. Retrieved 2019-12-22, from Gardner, E. M., McLees, M. P., Steiner, J. F., Del Rio, C., & Burman, W. J. (2011, March). Thespectrum of engagement in HIV care and its relevance to test-and-treat strategies for preventionof HIV infection.

Clinical Infectious Diseases: An Oﬃcial Publication of the Infectious DiseasesSociety of America , (6), 793–800. doi: 10.1093/cid/ciq243Imai, K., & van Dyk, D. A. (2005). A Bayesian analysis of the multinomial probit model usingmarginal data augmentation. Journal of Econometrics , 311 – 334.46ang, J. D., & Schafer, J. L. (2007). Demystifying double robustness: A comparison of alternativestrategies for estimating a population mean from incomplete data.

Statistical Science , 523–539.Kindo, B. P., Wang, H., & Pe˜na, E. A. (2016). Multinomial probit Bayesian additive regres-sion trees.

Stat (International Statistical Institute) , (1), 119–131. Retrieved 2019-04-19, from doi: 10.1002/sta4.110Lee, H., Genberg, B. L., Nyambura, M., Hogan, J., Braitstein, P., & Sang, E. (2017). State-Space Models for Engagement, Retention, and Reentry in the HIV Care Cascade. CROI Con-ference . Retrieved 2017-07-24, from

Li, Z. R., McComick, T. H., & Clark, S. J. (2018). Using Bayesian Latent Gaussian Graphi-cal Models to Infer Symptom Associations in Verbal Autopsies.

Bayesian Analysis . Retrieved2019-12-20, from https://projecteuclid.org/euclid.ba/1569290444 doi: 10.1214/19-BA1172Liu, J. S. (1994). The Collapsed Gibbs Sampler in Bayesian Computations with Applicationsto a Gene Regulation Problem.

Journal of the American Statistical Association , (427), 958–966. Retrieved 2019-12-23, from doi: 10.2307/2290921Liu, J. S. (1994a). Fraction of Missing Information and Convergence Rate of Data Augmentation.Research Triangle Park, North Carolina.Liu, J. S., Wong, W. H., & Kong, A. (1994). Covariance Structure of the Gibbs Sampler withApplications to the Comparisons of Estimators and Augmentation Schemes. Biometrika , (1),27–40. Retrieved 2019-12-23, from doi: 10.2307/2337047Liu, J. S., & Wu, Y. N. (1999, December). Parameter Expansion for Data Augmentation. Journal of the American Statistical Association , (448), 1264–1274. Retrieved 2019-12-20,47rom doi:10.1080/01621459.1999.10473879Low-Kam, C., Telesca, D., Ji, Z., Zhang, H., Xia, T., Zink, J. I., & Nel, A. E. (2015, March).A Bayesian regression tree approach to identify the eﬀect of nanoparticles’ properties on toxic-ity proﬁles. Annals of Applied Statistics , (1), 383–401. Retrieved 2020-11-23, from https://projecteuclid.org/euclid.aoas/1430226097 (Publisher: Institute of Mathematical Statis-tics) doi: 10.1214/14-AOAS797McCulloch, R., & Rossi, P. (1994). An exact likelihood analysis of the multinomial probit model. Journal of Econometrics , (1-2), 207–240. Retrieved 2019-07-04, from https://econpapers.repec.org/article/eeeeconom/v 3a64 3ay 3a1994 3ai 3a1-2 3ap 3a207-240.htm McCulloch, R. E., Polson, N. G., & Rossi, P. E. (2000, November). A Bayesian analysisof the multinomial probit model with fully identiﬁed parameters.

Journal of Econometrics , (1), 173–193. Retrieved from doi: 10.1016/S0304-4076(00)00034-8McFadden, D. (1974). Conditional logit analysis of qualitative choice behavior. Frontiers ineconometrics .Meng, X.-L., & Van Dyk, D. A. (1999). Seeking Eﬃcient Data Augmentation Schemes viaConditional and Marginal Augmentation.

Biometrika , (2), 301–320. Retrieved 2019-12-20,from Murray, J. S. (2017, January). Log-Linear Bayesian Additive Regression Trees for Categorical andCount Responses. arXiv:1701.01503 [stat] . Retrieved 2019-03-25, from http://arxiv.org/abs/1701.01503 (arXiv: 1701.01503)Nobile, A. (1998, August). A hybrid Markov chain for the Bayesian analysis of the multi-nomial probit model.

Statistics and Computing , (3), 229–242. Retrieved 2019-12-19,48rom https://link.springer.com/article/10.1023/A:1008905311214 doi: 10.1023/A:1008905311214Sparapani, R. A., Logan, B. R., McCulloch, R. E., & Laud, P. W. (2016, July). Non-parametric survival analysis using Bayesian Additive Regression Trees (BART). Statistics inMedicine , (16), 2741–2753. Retrieved 2017-03-25, from http://onlinelibrary.wiley.com.revproxy.brown.edu/doi/10.1002/sim.6893/abstract doi: 10.1002/sim.6893Steenburgh, T. J. (2008). The Invariant Proportion of Substitution Property (IPS) of Discrete-Choice Models. Marketing Science , (2), 300–307. Retrieved 2019-12-17, from Tanner, M. A., & Wong, W. H. (1987). The Calculation of Posterior Distributions by DataAugmentation.

Journal of the American Statistical Association , (398), 528–540. Retrieved2019-12-20, from doi: 10.2307/2289457Van Dyk, D., & Meng, X.-L. (2001). The Art of Data Augmentation. The Journal of Computa-tional and Graphical Statistics , , 1–111.van Dyk, D. A. (2010). MARGINAL MARKOV CHAIN MONTE CARLO METHODS. Statis-tica Sinica , (4), 1423–1454.Waldmann, P. (2016, June). Genome-wide prediction using Bayesian additive regression trees. Genetics, Selection, Evolution : GSE , . Retrieved 2020-11-23, from doi: 10.1186/s12711-016-0219-8WHO. (2012). Meeting report on Framework for metrics to support eﬀective treatment asprevention. Retrieved 2018-12-06, from