BART with Targeted Smoothing: An analysis of patient-specific stillbirth risk
Jennifer E. Starling, Jared S. Murray, Carlos M. Carvalho, Radek K. Bukowski, James G. Scott
BBART with Targeted Smoothing: An analysis ofpatient-specific stillbirth risk
Jennifer E. Starling ∗ , Jared S. Murray, Carlos M. Carvalho,Radek Bukowski, and James G. ScottJune 4, 2019 Abstract
This article introduces BART with Targeted Smoothing, or tsBART, a new Bayesian tree-based model for nonparametric regression. The goal of tsBART is to introduce smoothnessover a single target covariate t , while not necessarily requiring smoothness over other covari-ates x . tsBART is based on the Bayesian Additive Regression Trees (BART) model, an ensembleof regression trees. tsBART extends BART by parameterizing each tree’s terminal nodes withsmooth functions of t , rather than independent scalars. Like BART, tsBART captures com-plex nonlinear relationships and interactions among the predictors. But unlike BART, tsBARTguarantees that the response surface will be smooth in the target covariate. This improvesinterpretability and helps regularize the estimate.After introducing and benchmarking the tsBART model, we apply it to our motivating ex-ample: pregnancy outcomes data from the National Center for Health Statistics. Our aim is toprovide patient-specific estimates of stillbirth risk across gestational age ( t ) , based on mater-nal and fetal risk factors ( x ) . Obstetricians expect stillbirth risk to vary smoothly over gesta-tional age, but not necessarily over other covariates, and tsBART has been designed preciselyto reflect this structural knowledge. The results of our analysis show the clear superiority ofthe tsBART model for quantifying stillbirth risk, thereby providing patients and doctors withbetter information for managing the risk of fetal mortality. All methods described here areimplemented in the R package tsbart . Keywords and phrases:
Bayesian additive regression tree, ensemble method, Gaussian pro-cess, regression tree, regularization
An ongoing research challenge in obstetrics is to quantify the risk of stillbirth, defined as fetaldeath after 20 weeks of gestation. Stillbirth is a major public-health problem, with 23,595 reportedcases in the U.S. in 2013 alone [MacDorman and Gregory, 2015]. Stillbirth is less well understoodthan other adverse pregnancy outcomes, and stillbirth rates have remained largely unchanged,even as many other serious adverse pregnancy outcomes (e.g. neonatal death) have become rarer.Providing better estimates of stillbirth risk as gestational age advances can yield important in-sights for obstetricians and patients. If an obstetrician knew, for example, that a patient’s stillbirthrisk was likely to rise earlier in pregnancy than usual, or was likely to rise to higher than normallevels at later gestational ages, then proactive steps could be taken to manage that risk, especially ∗ Corresponding author: [email protected] a r X i v : . [ s t a t . M E ] J un n pregnancy at term. Conservative steps might entail increased monitoring and more frequentprenatal clinic visits, while a more aggressive step might involve an elective Cesarean section orearly induction of labor.Statistically speaking, we can think of stillbirth risk as a regression function h ( t, x ) represent-ing the conditional probability of stillbirth at gestational age t , given that the fetus survived inutero until just before t , and given other characteristics x of the maternal-fetal dyad. Thus thefundamental biomedical problem we address in this paper is to provide better patient-specificestimates of h ( t, x ) . This fills an important knowledge gap, since the current obstetrics literaturedoes not provide an especially nuanced characterization of this function. In particular, the waythat h ( t, x ) depends upon maternal-fetal characteristics is not well understood. Structurally, obste-tricians do expect that stillbirth risk evolves smoothly, without sudden jumps or discontinuities,as gestational age ( t ) advances; however, they do not have strong prior knowledge about how itshould change with other maternal-fetal characteristics ( x ).The central argument of our paper is that this situation calls for nonparametric regression with targeted smoothing in gestational age t : that is, we require that h ( t, x ) be smooth with respect to t (the target covariate), but we remain agnostic about smoothness with respect to x . This approachrealizes two complementary advantages when quantifying stillbirth risk. First, from a clinicalperspective, targeted smoothing reflects prior knowledge, aids interpretability, and assists doctorsin communicating stillbirth risks to patients as clearly as possible. For example, smoothing helpsprevent doctors and patients alike from over-interpreting the small jumps or wiggles in h ( t, x ) that arise in a completely nonparametric estimate, but that are likely just noise. Second, from astatistical perspective, targeted smoothing can reduce variance without inflating bias.To incorporate these benefits into our analysis of stillbirth risk, we propose a Bayesian ap-proach called BART with Targeted Smoothing, or tsBART, which is based on the highly successfulBayesian Additive Regression Trees (BART) model introduced by Chipman et al. [2010]. The origi-nal BART model is a Bayesian ensemble-of-trees approach to nonparametric regression. It predictsa scalar response y using a sum of many binary regression trees, where each tree is encouragedby a prior to be a “weak learner”—that is, to have relatively few splits and to use only a small setof the available predictors. BART with Targeted Smoothing is similar in this regard, and we usethe same prior over tree space proposed in the original BART paper. Where tsBART differs is inthe prior used for the terminal nodes of each tree. BART specifies a Gaussian prior for the scalarmean parameters in each terminal node. tsBART replaces the Gaussian prior with a Gaussianprocess prior over univariate functions in the “target” covariate t , so that each terminal node isparameterized by a smooth function of t .Thus to summarize our contributions:1. We introduce the tsBART model and demonstrate its advantages for problems where tar-geted smoothing is desirable.2. We apply this method to data on birth records from the National Center for Health Statisticsin order to produce accurate estimates for h ( t, x ) , providing clinicians with more granularknowledge of patient-specific stillbirth risk.It would certainly be possible to estimate stillbirth risk using existing techniques for modelingtime-to-event data [see, e.g. Mandujano et al., 2013]. Thus a major focus of our paper is to demon-strate that the specific features we had in mind when designing tsBART—targeted smoothing ingestational age, while avoiding strong assumptions in other covariates—have some very real ad-vantages for this kind of problem. Available techniques either lack smoothness entirely (and thus Or, in continuous time, the hazard rate. t . Moreover, while our motivating example involves estimating a smooth hazard function, thetsBART model is much more general than this. The same approach can work in any nonpara-metric regression problem where targeted smoothing is desired a priori , regardless of whetherthe response is continuous, binary, or (as in our case) a time-to-event outcome. Across a seriesof benchmarking examples, we show that our approach to targeted smoothing can lead to a fa-vorable bias-variance tradeoff versus both classes of competing methods: those that make globalsmoothness assumptions, and those that make no smoothness assumptions. Our simulation stud-ies also bear out another considerable advantage: when the targeted smoothing assumption isvalid, tsBART tends to yield superior frequentist coverage versus plausible alternative methods.The paper proceeds as follows. Section 2 provides an overview of the stillbirth risk-curve mod-eling problem and dataset. Section 3 details the tsBART model and reviews the relevant literature.Section 4 presents the results of simulation studies showing the advantages of the method. Sec-tion 5 then presents our core scientific contribution: an analysis of stillbirth risk using the tsBARTmodel. Section 6 concludes with a brief discussion. Further details, including on computationalmethods, are in the appendices.All methods described in this paper are implemented in the R package tsbart . Stillbirth is a significant public health concern that affects tens of thousands of Americans eachyear. In the U.S. in 2013, a total of 23,595 stillbirths were reported [MacDorman and Gregory,2015]. The National Vital Statistics System notes that stillbirth has been significantly overlooked inpublic-health research and obstetrics guidance, and its mechanisms are not well understood. Ob-stetricians do know that the risk of stillbirth typically (but not universally) cumulatively increaseswith time in utero. But this risk must be balanced against the potential negative consequences ofearly delivery. Preterm and early term births are associated with increased risk of neonatal mortal-ity and morbidity, adverse neuro-developmental and cognitive outcomes, and increased health-care costs [e.g. Muraskas and Parsi, 2008, Kornhauser and Schneiderman, 2010]. Obstetricians cantherefore benefit greatly from access to better estimates of stillbirth risk over gestational age, sothat they can give clinical advice that minimizes the overall risk of adverse perinatal outcomes.Conservative patient management might entail increased monitoring and more frequent prenatalclinic visits, while more aggressive steps include an early delivery via either elective Cesarean sec-tion or early induction of labor. From a statistical perspective, this means that accurate uncertaintyquantification is vital, for helping doctors understand which cases have a less precisely estimatedrisk profile.Previous research on adverse perinatal outcomes has focused more heavily on neonatal deaththan on stillbirth [e.g. Bailit et al., 2010, Clark et al., 2010, Reddy et al., 2011]. A more recent lineof work attempts to refine these broad conclusions by seeking to model stillbirth risk based ona patient’s individual risk factors. In particular, Mandujano et al. [2013] model hazard functions https://github.com/jestarling/tsbart Our analysis uses anonymized birth data from the National Center for Health Statistics from theyears 2004 and 2005 (Table 1). Each medical record is associate with a single pregnancy. It recordscontain the gestational age in weeks at which the pregnancy was delivered, based on calculationfrom the woman’s last normal menstrual period, or a clinical estimate. The outcome of each preg-nancy is recorded as either a stillbirth or a live birth. Each record also contains information aboutthe maternal-fetal dyad, including maternal risk factors, such as diabetes, hypertension, and so-ciodemographic variables, and fetal characteristics, such as sex or estimated fetal weight.The dataset consists of 8,371,461 pregnancies, with 7,940,495 live births, 100,072 stillbirths,and 330,894 cases where stillbirth outcome is missing. We restrict our analysis to complete cases,with all maternal-fetal information and stillbirth response present. Analysis is also limited topregnancies delivered from 34 to 42 weeks inclusive, as is this the range where clinicians mightplausibly recommend to deliver a baby based on elevated stillbirth risk, barring truly exceptionalcircumstances. These restrictions yield 4,553,868 pregnancies for analysis, of which 7,175 are still-births, for an overall prevalence of 1.58 stillbirths per thousand pregnancies from 34 to 42 weeks’gestation. The prevalence in the high risk category was 2.85 stillbirths per thousand, while theprevalence in the low risk group was 1.45 per thousand. Prevalence is comparable to the datasetanalyzed by Mandujano et al. [2013], where overall prevalence was 1.45 births per thousand: 2.68in the high risk group, and 1.34 in the low risk group. A full table of summary statistics forour sample is shown in Table 1. In practice, we work with a smaller case-control sample of thisfull data set. This is described in Section 5; full details of the data pipeline are also available at github.com/jestarling/tsbart-analysis .Maternal-fetal characteristics were selected for inclusion in our regression models based onclinical knowledge, availability of data, and previous research findings on risk factors for stillbirth[e.g. Mandujano et al., 2013, Muraskas and Parsi, 2008, Kornhauser and Schneiderman, 2010].Maternal covariates include maternal age, primiparity, whether the labor was induced, ethnicity(White non-Hispanic, Black non-Hispanic, Hispanic, Other), aggregate pregnancy weight gainquantile, presence of diabetes mellitus, presence of chronic hypertension, and an indicator for thepresence of any other risk factor. Other risk factors include anemia, cardiac disease, lung disease,hemoglobinopathy, and Rh sensitization. Consistent with the analysis of Mandujano et al. [2013],pregnancy-related complications, such as gestational diabetes, abruption, or preeclampsia, werenot included as risk factors. Fetal covariates include infant sex and birth weight quantile.We did not exclude any variables on statistical grounds. One of the benefits of the BARTframework, which also applies to the tsBART method, is that variable selection procedures arenot generally required. As discussed in Section 3, the BART prior guides the model to choosingsubsets of the most relevant covariates for inclusion in each tree.4able 1: Cohort characteristics for our dataset on stillbirth. The “low risk” and “high-risk” des-ignations are not used formally in our model, but they are provided for the sake of comparisonwith Mandujano et al. [2013], Table 1. Our cohort is similar in composition to the cohort analyzedthere. Numbers in parentheses are percentages with respect to the given cohort.
Full Cohort Low risk High riskCharacteristic (n=4,553,868) (n=4,137,260) (n=416,608)Maternal age (Yrs) <
20 452,060 (9.93) 418,953 (10.13) 33,107 (7.95)20-29 2,401,223 (52.73) 2,204,168 (53.28) 197,055 (47.30)30-39 1,585,226 (34.81) 1,415,991 (34.23) 169,235 (40.62)40-49 115,020 (2.53) 97,855 (2.37) 17,165 (4.12)50+ 339 (0.01) 293 (0.01) 46 (0.01)Maternal race and ethnicityWhite, non-Hispanic 2,757,816 (60.56) 2,520,632 (60.93) 237,184 (56.93)Black, non-Hispanic 693,751 (15.23) 619,761 (14.98) 73,990 (17.76)Hispanic 809,086 (17.77) 736,908 (17.81) 72,178 (17.33)Other 293,215 (6.44) 259,959 (6.28) 33,256 (7.98)ParityPrimiparous 1,490,501 (32.73) 1,370,443 (33.12) 120,058 (28.82)Multiparous 3,063,367 (67.27) 2,766,817 (66.88) 296,550 (71.18)Maternal risk factorsAnemia 115,663 (2.54) 0 (0.00) 115,663 (27.76)Cardiac disease 20,937 (0.46) 0 (0.00) 20,937 (5.03)Lung disease 63,063 (1.38) 0 (0.00) 63,063 (15.14)Diabetes mellitus 159,765 (3.51) 0 (0.00) 159,765 (38.35)Hemoglobinopathy 4,260 (0.09) 0 (0.00) 4,260 (1.02)Chronic hypertension 43,935 (0.96) 0 (0.00) 43,935 (10.55)Renal disease 14,210 (0.31) 0 (0.00) 14,210 (3.41)Rh isoimmunization 31,317 (0.69) 0 (0.00) 31,317 (7.52)Infant sexMale 2,330,557 (51.18) 2,117,958 (51.19) 212,599 (51.03)Female 2,223,311 (48.82) 2,019,302 (48.81) 204,009 (48.97)
Birth weight cannot be observed directly by a doctor contemplating whether to delivery apregnancy early due to elevated stillbirth risk. However, birth weight quantile acts as a sensibleproxy for the information doctors would actually have at their disposal in a prenatal visit: fetalweight quantile in utero, which is estimated routinely using ultrasound and fetal growth charts.Because fetal weight quantile at later gestational ages correlates very strongly with birthweightquantile, we do not expect that there is substantial error introduced by using birth weight quantile(which we have and a doctor wouldn’t) as a proxy for fetal weight quantile in utero (which adoctor would have).
We now introduce the tsBART model, which later in Section 5 we will use to analyze the stillbirthdata just described. Throughout the remaining sections, we let t ∈ T represent the target covariate,i.e. the covariate in which the response surface is assumed to be smooth, which in our case is5estational age (discrete time, measured in weeks or days). We let x ∈ X represent a vectorof covariates other than t , which in our case are the characteristics of a particular maternal-fetaldyad.Because tsBART is a general approach for targeted smoothing in nonparametric regression, wefirst introduce the model in full generality. We then explain how to adapt it more specifically formodeling the hazard function for stillbirth, h ( t, x ) , which represents the conditional probability ofstillbirth at gestational age t , given that a fetus has survived in utero through gestational age t − . Before introducing tsBART, we briefly review the original BART framework. BART (for BayesianAdditive Regression Trees) is a fully Bayesian ensemble-of-trees model [Chipman et al., 2010].BART models the mean response for a non-linear regression function as the sum of a large numberof binary trees, each of which is constrained by the BART prior to be shallow (and therefore a weaklearner). The model is defined by a likelihood and prior, and inference is performed by samplingfrom the posterior. Specifically, suppose that y i is a scalar response and x i is a vector of covariates.The BART model assumes that y i = f ( x i ) + (cid:15) i , (cid:15) i iid ∼ N (0 , σ ) (1) f ( x i ) = m (cid:88) j =1 g ( x i ; T j , M j ) . (2)Here each T j is a binary tree that induces a step function in x via a partition of the covariate space,while the M j = (cid:8) µ j , . . . , µ b j j (cid:9) are the b j terminal node values in tree j (i.e. the levels of the stepfunction). We can think of each g as a basis function parameterized by the binary tree defined by ( T j , M j ) .The BART prior consists of three elements. The first component is the conjugate prior for theerror variance, σ ∼ νλ/χ ν . The second component is the specification of independent Gaussians µ hj iid ∼ N ( µ , τ ) on the terminal node parameters M j = (cid:8) µ j , . . . , µ b j j (cid:9) of each tree. The thirdcomponent is the prior over tree space, composed of a set of probabilities governing three things:the choice of splitting covariate, the choice of splitting value for each covariate, and whether anode at a given depth is a terminal node. We refer interested readers to Chipman et al. [2010],who recommend default hyperparameters that favor shallow trees, which both regularizes theestimate and encourages rapid mixing.BART has been successful in a variety of contexts including prediction and classification [Chip-man et al., 2010, Murray, 2017, Linero and Yang, 2018, Linero, 2018, Hern´andez et al., 2018], sur-vival analysis [Sparapani et al., 2016], and causal inference [Hill, 2011, Hahn et al., 2017, Loganet al., 2019, Sivaganesan et al., 2017]. Motivated by the success of BART models, we introduce tsBART, an extension of BART for esti-mating regression functions that are smooth in a target covariate. Consider a regression problemwith scalar response y i = f ( t i , x i ) + e i , where the underlying mean function f ( t i , x i ) depends bothon t (a scalar) and x (a vector), and should be smooth in t . To adapt BART for this setting, wereplace the scalar node-level parameters µ hj with univariate functions in t , µ hj ( t ) , and we assume6 < . µ j x < . µ j µ j no yesno yes 0.4 0.9 x x µ j µ j µ j x < . µ j ( t ) x < . µ j ( t ) µ j ( t )no yesno yes Figure 1: Illustrates the extension of BART to tsBART for a single tree, where nodes are parame-terized by smooth functions instead of scalar parameters. (Left) A vanilla BART example binarytree T j where terminal nodes are labeled with the corresponding scalar parameters µ hj . (Middle)The corresponding vanilla BART partition of the sample space and the step function g ( T j , M j ) .(Right) Our BART with Targeted Smoothing modification, where the µ hj ( t ) parameters associatedto terminal nodes are now functions of t .that only x variables (but not the target variable t ) are used to define tree splits. (See Figure 1.)These univariate functions in t can in principle be assigned any prior over function space; in theapplications considered in this paper, we use Gaussian process priors.More formally, we express the tsBART model as follows. Suppose that each observation i in our data set consists of predictor variables ( t i , x i ) together with outcome y i for i = 1 , . . . , N .(Recall that t i is the target variable for smoothing, while x i is a vector of all other variables.) Wenow let y i = α ( t i ) + f ( t i , x i ) + (cid:15) i , (cid:15) i iid ∼ N (0 , σ ) (3) f ( t i , x i ) = m (cid:88) j =1 g ( t i , x i ; T j , M j ) . Here T j is a binary tree whose terminal nodes partition the “non-target” covariate space X into b j disjoint regions, just as in the original BART model. But unlike BART, we parametrize theterminal nodes of the tree not by scalars, but by a collection of Gaussian processes in t : M j = { µ j ( t ) , . . . , µ b j j ( t ) } , with each function µ hj ( t ) associated with one terminal node. The right panelof Figure 1 illustrates an example with b j = 3 terminal nodes. The overall response is the sumof m such trees, so that at any fixed design point x = ( x , . . . , x p ) , the response f ( t, x ) is the sumof m Gaussian processes. We center the model at α ( t ) , a baseline function of t , so that the treesparametrize deviations from the baseline that are associated with x . We estimate α ( t ) using thesample mean response for observations at each t .We use the same prior over tree space as in the original BART paper. To model the µ hj ( t ) ’s ineach terminal node, we use a zero-centered Gaussian process prior: µ ( t ) ∼ GP (cid:0) , C θ ( t, t (cid:48) ) (cid:1) , where C θ ( t, t (cid:48) ) is the covariance function with hyperparameter θ , which can be either chosen basedon prior knowledge or tuned using the data. (Zero-centering is appropriate here because weseparate out the mean term α ( t ) in Equation 3.)In principle any covariance function can be used. For all examples in this paper, we use thesquared-exponential covariance function with variance parameter τ /m and length scale l . That This implies that f ( t, x ) is a Gaussian process in t for fixed x , but it is not a Gaussian process in ( t, x ) jointly. C ( t, t (cid:48) ) = τ m exp (cid:26) − d ( t, t (cid:48) ) l (cid:27) , (4)where d ( t, t (cid:48) ) is the Euclidean distance between t and t (cid:48) . Here τ determines the marginal varianceof the µ hj ’s, while l governs their “wiggliness.” As in the original BART model, we scale thevariance parameter τ inversely by the number of trees m . Since the mean-response function f ( t, x ) is the sum of m trees, this implies that the marginal prior variance of f ( t, x ) at any point t is τ . We then assign τ a half-Cauchy prior as in Gelman [2006], Linero and Yang [2018] and Hahnet al. [2017].The tsBART model also requires specifying l , the length scale of the Gaussian process prior,with larger l corresponding to more wiggliness. This length scale can be set using prior knowl-edge, but in Section 3.3 we provide a method to tune it automatically over a grid of possible values.As we also explain in Section 3.3, a reasonable default choice when using the squared exponentialcovariance function is l = T /π , where T is the range of t i values in the data set.We make the simplifying assumption of an i.i.d. error structure and complete the model spec-ification by assigning σ an inverse chi-square distribution σ ∼ νλ/χ ν . For full computationaldetails, including the data augmentation, prior specification, and posterior full conditional distri-butions, see Appendices A.1 and A.2. l We must select l , the length-scale parameter of the covariance matrix. To do this, we represent l using a formula by Kratz [2006] for the expected number of times a random function crosses itsmean, E [ N T ( s )] , on some interval I = [0 , t max ] . This formula gives us a closed-form solution forthe length-scale parameter as a function of the expected number of times that f ( t, x ) crosses zero.Recall that if f ( t, x ) = 0 , then the overall response at predictor x is simply α ( t ) , which we canthink of as the baseline response over t . The more times that f ( t, x ) crosses zero, the more sharplythe covariate-specific mean response deviates from the overall mean response.To set E [ N T ( s )] , let r ( s ) be the correlation function between time and time s : r ( s ) = E [ { f (0 , x ) − µ (0) } { f ( s, x ) − µ ( s ) } ] sd ( f (0 , x )) · sd ( f ( s, x )) . Per Kratz, E [ N T ( s )] = t max · exp (cid:20) − s (cid:21) (cid:32) (cid:112) r (cid:48)(cid:48) (0) π (cid:33) and we let s = 0 in order to maximize t max . We use the squared exponential covariance kernel, so r ( t ) = Cov ( f (0 , x ) , f ( t, x )) τ = exp (cid:20) − t l (cid:21) . Some algebra yields l = t max π E [ N T (0)] . (5)This opens up several options for choosing the length scale. The first is by subjective choice.This would entail eliciting a guess for κ ≡ E [ N T (0)] , the average number of times that f ( t, x ) α ( t ) + f ( t, x ) will cross the overall mean response α ( t ) . This is a useful basis forelicitation, since the number of crossings is a sensible and intuitive measure for the wiggliness ofour response as a function of t .The second option is to choose a default value for κ . If a default must be chosen, we recom-mend κ = 1 , or equivalently, l = t max /π . This encodes the belief that each response surface in t will cross the overall mean response α ( t ) once, on average across all predictor values. This allowsfor a substantial amount of heterogeneity in the mean responses over time, while still shrinkingtowards the overall mean.A final option, which we use in our simulation studies and real-data examples, is to tune κ = E [ N T (0)] over a grid of candidate values. This could be done using cross validation, as inthe original BART paper, although we use the WatanabeAkaike information criterion, or WAIC[Watanabe, 2013]. WAIC is calculated as the log pointwise posterior predictive density plus apenalty for effective number of parameters, to avoid overfitting. It provides an estimate of gener-alization error without requiring that we split the data into multiple subsets; see Appendix A.3 fordetails. In our simulation, we note that values of κ ≈ are frequently chosen by this data-drivenapproach, lending further credence to the choice of κ = 1 as a reasonable default. In their original paper, Chipman et al. [2010] provide a probit version of the BART model forbinary outcomes Y ∈ { , } : p ( Y = 1 | x ) = Φ ( G ( x )) (6) G ( x ) = m (cid:88) j =1 g ( x ; T j , M j ) , (7)where Φ( · ) is the standard normal CDF, and where G is the standard BART model. Inferenceproceeds via data augmentation, using the method of Albert and Chib [1993].Our tsBART model can be extended in the same way. Suppose that we observe a binary re-sponse c i , together with target covariate t i and non-target covariates x i . The tsBART probit modelintroduces a latent Gaussian variable z i , and then parametrizes z i using tsBART, in a manner par-allel to the original BART probit model: c i = (cid:26) if z i ≥ if z i < . (8) z i = α ( t i ) + f ( t i , x i ) + (cid:15) i , (cid:15) i iid ∼ N (0 , (9) f ( t i , x i ) = m (cid:88) j =1 g ( t i , x i ; T j , M j ) . (10)Here α ( t ) and f ( t, x ) are defined exactly as in Equation 3, and each g j is assigned the same prioroutlined the previous subsection. Marginalizing over z i yields the desired probability under theprobit model, P ( c i = 1 | x i , t i ) = Φ { α ( t i ) + f ( t i , x i ) } .Crucially for our application, it is also straightforward to extend tsBART-probit to discreteright-censored time-to-event outcomes, as noted by Sparapani et al. [2016] in the context of theoriginal BART-probit model. Suppose that t i ∈ T is a discrete time-to-event outcome, and that c i is a censoring indicator: c i = 1 means that an event occurred at time t i , while c i = 0 means9hat observation i was right-censored at time t i . In our stillbirth risk-modeling problem, c i = 1 corresponds to a stillbirth at gestational age t i , while c i = 0 corresponds to a live birth at t i (whichis right-censoring with respect to the stillbirth event). The object of interest is the set of conditionalprobabilities p = { p it } , where p it the conditional probability of an event at time t for observation i , given than no event has happened through time t − . These conditional probabilities definethe discrete-time hazard function h ( t, x ) . For ease of exposition, we assume here that the possibleevent times are T = { , . . . , T } , but this is not a requirement.To accommodate this data structure, we use the following standard factorization of the likeli-hood for a discrete-time hazard model. We introduce binary auxiliary variables { ˜ c is : s = 1 , . . . t i } for each observation i = 1 , . . . , N , where ˜ c is = (cid:26) if c i = 1 and s = t i otherwise.The likelihood for the hazard function is now L ( p ) = N (cid:89) i =1 t i (cid:89) s =1 p ˜ c is is (1 − p is ) − ˜ c is . We note, as do Sparapani et al. [2016], that the product form of this likelihood does not come fromthe assumption that the binary ˜ c is events are independent, but rather from the definition of each p is as a conditional probability.We can now use the same latent-variable trick from Albert and Chib [1993] to construct thetsBART-probit model for p , as follows: ˜ c is ∼ (cid:26) if z is ≥ if z is < . (11) z is = α ( s ) + f ( s, x i ) + (cid:15) is , (cid:15) is iid ∼ N (0 , (12) f ( s, x i ) = m (cid:88) j =1 g ( s, x i ; T j , M j ) , (13)where α and the g j ’s are parametrized just as in the tsBART model described previously, treatingtime as the target covariate for smoothing. Our paper sits in a long line of other research on extensions to the Bayesian tree-modeling frame-work. Two papers in particular are especially close in spirit to ours. The first is Sparapani et al.[2016], who introduce a model for nonparametric survival analysis using BART. Their model in-corporates dependence on t by simply adding time as an ordinary covariate to a BART-probit forthe discrete-time hazard function. This does not impose any continuity or smoothness constraintson f ( t, x ) . In contrast, our approach smooths the hazard function over time, while still retainingthe benefits of BART. The second paper is the treed Gaussian process (TGP) model of Gramacy andLee [2008]. Their model uses a single deep tree with a Gaussian process in each terminal node;our model, in contrast, is a sum of many trees. Our work therefore generalizes that of Gramacyand Lee [2008] in the same way that the BART model generalizes the single-tree Bayesian CARTmodel of Chipman et al. [1998].Smooth or partially smooth extensions of Bayesian tree models have also been proposed pre-viously by Linero and Yang [2018], who smooth a regression tree ensemble by randomizing the10ecision rules at internal nodes of the tree. This model induces smoothness over all covariatesby effectively replacing the step function induced by the binary trees with sigmoids. In contrast,our approach smooths over just one target covariate, while avoiding the high computational costassociated with the method of Linero and Yang [2018]. We conduct two simulation studies to compare tsBART to existing methods. These simulations aredesigned to evaluate tsBART along several dimensions—out-of-sample predictive performance,credible interval coverage, and interpretability—in settings with varying degrees of complexity incovariate interactions.Given the importance of uncertainty quantification for modeling stillbirth risk, we do notbenchmark against pure machine-learning methods that do not readily produce valid confidenceor credible intervals. This excludes neural networks, boosting, CART, and many other ensemblemethods. We do, however, benchmark against BART, which has been shown to enjoy comparableor superior predictive performance to all these pure machine-learning methods across a range ofscenarios [see, e.g. Chipman et al., 2010, who run these comparisons across 42 benchmark datasets]. Thus very little is lost by excluding methods that perform comparably to BART in terms ofpure prediction, but that cannot produce confidence/credible sets for those predictions.One plausible benchmark might be Random Forests, for which recent research [Wager et al.,2014] has addressed the problem of accurate uncertainty quantification. However, we choose notto include Random Forest in the simulation benchmarks for two reasons. First, Chipman et al.[2010] performed extensive benchmarking of ordinary BART versus Random Forests, and theymake a persuasive argument that if the computational resources are available for BART, it tendsto perform a bit better, on average. Additionally, we did investigate the performance of RandomForests on the stillbirth dataset that we analyze in Section 5. We found that the stillbirth risk curveestimates provided by Random Forest had many of the same interpretational problems posed byBART—namely, by not imposing adequate smoothness over time, it limits the interpretability forclinicians, encouraging them to over-interpret small wiggles in the fit. This analysis is included inAppendix A.5.
We first conduct a simulation study comparing tsBART to the ordinary BART model. The initialfocus on BART is intended to isolate a key feature of our approach: smoothing in t , versus simplyincluding t as another predictor available in the model. BART is also the most relevant practicalcomparison for our application, since Sparapani et al. [2016] have already shown that ordinaryBART-probit has cutting-edge performance for discrete-time survival modeling, versus a widerange of competing methods, including many more traditional time-to-event models.We simulated datasets across three scenarios of modest dimension in the non-target variables x : one with four covariates, one with eight covariates, and another with twenty covariates. Forall scenarios, we used eight discrete time points ( T = { , . . . , } ) for the target covariate. Thismimics the stillbirth data, where information on gestational age is used at a weekly resolutionbetween 34 and 42 weeks. It also reflects many other obstetrics, public health, and biomedicalapplications where data is observed at discrete intervals. We generated each pair of covariates11 x ij , x i,j +1 ) for odd j from a bivariate Gaussian with moderate correlation and unit variances. Foreach case, we simulated data sets with sample sizes n ∈ { , , , } , for a total of twelvecombinations of sample size ( n ) and dimension of the non-targer covariate ( p ). For each of thesetwelve combinations, we simulated 100 datasets.We focus on a ground truth in which the mean response evolves smoothly in t , and we seekto answer two key questions: 1) can tsBART adapt to the correct degree of smoothness, and 2)if so, how large are the gains versus an otherwise very similar model that makes no smoothnessassumptions? In the p = 4 case, we let f ( t, x ) = g ( x , x ) · cos ( t + 2 πh ( x , x )) so that the covariates x modify both amplitude and phase shift. We let g and h be simple functionsof the covariate pairs; here we sum each pair of covariates.In the p = 8 and p = 20 cases, we continue in a similar fashion, alternating sines and cosines,so that f ( t, x ) = g ( x , x ) · cos ( t + 2 πh ( x , x )) + g ( x , x ) · sin ( t + 2 πh ( x , x )) + . . . where this pattern continues. We again let g and h be sums of each pair of covariates. We generateresponses y ( t, x ) = f ( t, x ) + (cid:15) where (cid:15) iid ∼ N (0 , .We compare BART and tsBART using m = 200 trees and 10,000 MCMC draws, with a burn-in of 1000 draws. We compare performance by calculating the log-loss at each iteration of thealgorithm, both in-sample and for a held-out sample, taking the mean log-loss across all MCMCiterations. Log-losses are scaled by sample size. We tune the length scale l using the methoddescribed in Appendix A.3. We compare models using log-loss since our goal is not to classifypatients by whether they will experience stillbirth, but to provide well-calibrated probabilitiesof stillbirth to clinicians. Log-loss is a proper scoring rule which measures how effectively eachmethod calibrates its probability estimates.tsBART consistently outperforms ordinary BART (Table 2) in the out of sample log-loss. tsBARThas the most significant gains in scenarios with small sample sizes or more predictors. Figure 2illustrates the out of sample fits and log-loss in a single scenario, where n = 500 and p = 4 ; tsBARTtends to smooth out the long-range periodicities in f ( t, x ) much less than ordinary BART. We next compare tsBART to four existing models in a simulation study designed to mimic thebasic properties of the hazard functions we expect to see in our stillbirth data. We generate haz-ard functions and corresponding survival data for three scenarios, where covariates determineshape of the hazard function with increasing degrees of interaction complexity. We compare thefollowing methods.1. tsBART: The BART with Targeted Smoothing method with smoothing parameter κ tuned asdescribed in Appendix A.3.2. tsBART (default): The BART with Targeted Smoothing method with our suggested κ = 1 default smoothing parameter.3. BART: an ordinary BART-probit model, which also sets hyperparameters ( ν, λ ) as recom-mended in Chipman et al. [2010], and includes time t as a covariate (BART)12able 2: In-sample and out-of-sample log-loss (scaled by sample size), averaged across 100 repli-cates, comparing BART with tsBART. tsBART consistently outperforms ordinary BART in the out-of-sample log-loss. tsBART has the most significant gains in scenarios with small sample sizes ormore predictors, as evidenced by p-values from a paired Wilcoxon test comparing out-of-sampleresults. In-sample Out-of-samplep n BART tsBART BART tsBART P-value4 100 -1.61 -1.49 -1.97 -1.92 < -1.47 -1.80 -1.74 < -1.46 -1.76 -1.72 -1.43 -1.44 -1.68 -1.67 -1.66 -2.18 -2.07 < -1.63 -2.02 -1.92 < -1.55 -1.58 -1.95 -1.91 -1.48 -1.53 -1.88 -1.87 -2.02 -2.59 -2.31 < -1.94 -1.99 -2.41 -2.28 < -1.81 -1.94 -2.31 -2.27 < -1.66 -1.84 -2.27 -2.32 0.023 Patient 41 Patient 5 Patient 78 Patient 98 y = f(t , x ) tsBART out of sample fit Patient 41 Patient 5 Patient 79 Patient 98
Time (t) y = f(t , x ) BART out of sample fit −1.8 −1.7−1.6−1.5 tsBART(−1.59) BART(−1.76)
Model M ea n l og − l o sses c ond i t i on a l on t i m e Figure 2: Compares a single model fit for tsBART and BART, for one scenario (n=500, p=4), toillustrate the difference in fit when the ground truth is smooth in a single target covariate t . (Left)The bold dashed line represents the true function value, while the solid line and shading giveposterior means and 95% credible intervals. Lighter dashed lines give the prediction intervals.(Right) tsBART outperforms BART across t and on average. The boxplot gives the distribution ofaverage log-loss at each t for both methods, with overall average log-loss in parenthesis on thex-axis label. 13 inear Linear with interaction Nonlinear with interaction t H a z a r d f un c t i on Figure 3: Illustrates the different ground truth hazards used in the three simulation scenarios.The same basic shapes of hazard functions are present in all three scenarios; the difference isin how covariates x influence which shape arises. There is a mixture of gently-rising hazardsand hockey stick hazards; linearity determines the straightness of the rise, and presence of aninteraction increases strength of the hockey stick.4. Splines 1: a logistic regression model using cubic B-splines with seven degrees of freedom,with main effects for all covariates included in x i . Use of the spline basis induces targetedsmoothness in t by ensuring that, for fixed x , f ( t, x ) is piecewise polynomial with continuousfirst and second derivatives.5. Splines 2: another logistic regression model using cubic B-splines and seven degrees of free-dom, with the addition of interactions between each basis element in t and each covariate in x i .6. P-Splines: a penalized spline model including the same covariates and a penalized splinebasis with 9 spline basis elements and a second-order smoothing penalty (P-splines) [Eilersand Marx, 1996]. (The maximum possible number of basis elements is determined by thefact that there are only 9 distinct values for gestational age, 34–42 weeks.) By allowing forall possible basis elements to enter the model while penalizing deviations from smoothness,penalized splines provide flexibility while still regularizing the stillbirth risk curve estimates.We evaluate the performance of tsBART for three scenarios, representing increasing degrees ofdifficulty in how x parametrizes the hazard function.We simulate data as follows. Let t be a grid of times on the unit interval, spaced in incrementsof 0.1. Generate n = 1000 ten-dimensional covariates x i = { x i , . . . , x i } where x ij iid ∼ U (0 , . Thefirst five covariates in each x i impact the response; the rest are noise. In each case, we define thehazard function as the weighted combination of two “template” hazard functions f ( t ) and f ( t ) ,where weights w ( x i ) depend on covariates x i : h ( t, x ) = 0 . x i + w ( x i , . . . , x i ) f ( t ) + [1 − w ( x i , . . . , x i )] f ( t ) . The differences between the three scenarios are in how the weight depends on the covariates:linearly, linearly with interactions, or nonlinearly with interactions. Figure 3 illustrates resultinghazard functions for each scenario. There are four general hazard function shapes, dictated byhigh versus low baseline risk, and with or without a sharp increase in hazard beginning at t =0 . . Appendix A.4 provides further detail, and code is available at https://github.com/jestarling/tsbart-analysis . 14able 3: Average coverage rates (nominal is 95%) and mean squared error across 500 simulateddatasets for each weighting scenario and model combination. For tsBART and BART, coverage isfor posterior credible intervals and mean squared error uses the posterior mean. For the spline-based methods, coverage is for prediction intervals. tsBART has better coverage, even with thedefault smoothing parameter, and MSE for all methods is small and comparable.Weighting scenario Method Coverage MSELinear tsBART (tuned) 0.9310 0.0014tsBART (default) 0.9092 0.0017BART 0.7642 0.0011Splines 1 (Linear) 0.7925 0.0007Splines 2 (Interaction) 0.7788 0.0014P-Splines 0.7720 0.0007Linear (with interaction) tsBART (tuned) 0.9571 0.0019tsBART (default) 0.9443 0.0022BART 0.7907 0.0022Splines 1 (Linear) 0.8874 0.0039Splines 2 (Interaction) 0.7213 0.0391P-Splines 0.8718 0.0036Nonlinear (with interaction) tsBART (tuned) 0.9539 0.0013tsBART (default) 0.9354 0.0016BART 0.7408 0.0012Splines 1 (Linear) 0.8918 0.0006Splines 2 (Interaction) 0.8392 0.0013P-Splines 0.8747 0.000615able 4: Overall out-of-sample log-loss for each method, averaged over five evenly balanced case-control samples. tsBART outperforms other methods, with the tuned smoothness parameter onlyslightly outperforming the default.Method Log-losstsBART (tuned) -1.711 tsBART (default) -1.713BART -1.810Splines 1 -1.725Splines 2 -1.919P-splines -1.724For each of the three scenarios, we simulate 500 datasets to compare point-wise coverage oftsBART compared to the methods detailed in Section 5. The mean-squared error of the estimatesare small and comparable across all methods. Most striking, however, is that tsBART gives farbetter coverage than other methods, both with the smoothing parameter tuned and set to thedefault value of 1 (Table 3). No other method consistently produces credible/confidence setsthat are close to the nominal value of 95%. We conclude that tsBART is capable of matching orexceeding other methods in terms of mean-squared error, while producing error bars that arestatistically trustworthy and scientifically sensible. We now turn to our motivating application, by applying the tsBART method to estimate patient-specific stillbirth risk, using the data described in Section 2. To model the discrete-time hazardfunction for stillbirth, h ( t, x ) , we use the extension of tsBART-probit formulation described inSection 3.4. Our target covariate for smoothing is gestational age in weeks: t i ∈ { , , . . . , } .We let y i be an indicator of whether stillbirth has occurred for each pregnancy, and x i be the vectorof maternal-fetal covariates for each patient, including maternal age, primiparity, ethnicity, infantsex, presence of diabetes mellitus, presence of chronic hypertension, presence of other risk factors,whether the pregnancy was induced, and birth weight and weight gain quantiles.We first focus on the question of whether tsBART does, indeed, yield better-calibrated risk esti-mates over existing methods for our data set. For the purpose of evaluating all models while main-taining computational tractability, we created five balanced case-control samples of n = 1 , pregnancies each. (Since stillbirth is a rare event, using a balanced case-control sample also moreclearly highlights differences among methods.) We then split each balanced case-control sampleinto training and testing sets. We used the training set to fit tsBART, in addition to each of thefour models discussed in Section 4: vanilla BART, the two B-spline models, and P-splines. Wetune the length-scale parameter of tsBART using the method described in Appendix A.3., and weset tree-prior hyperparameters ( ν, λ ) as recommended in Chipman et al. [2010]. We then used thefitted model to predict the hazard functions for all held-out points, and we computed held-outlog-losses. We repeat this process over five balanced case-control data sets and average the results(Table 4).tsBART outperforms other methods, with the tuned smoothness parameter setting only slightlyoutperforming the default (untuned) setting. To provide some intuition for these results, Figure 4also shows relative out-of-sample log-losses of all methods as a function of gestational age, with16 .91.21.51.8 34 36 38 40 42 Gestational age (Wks) O u t o f sa m p l e r e l a t i ve l og − l o ss BARTP−splines Splines 1Splines 2 tsBART (default)tsBART (tuned)
Figure 4: Illustrates performance of each method relative to tsBART with tuned smoothing param-eter κ . Shows weekly out-of-sample log-loss for each method, averaged over five evenly balancedcase-control samples. tsBART’s gains are especially apparent in higher and lower gestational ages;other methods have small gains in the 37 to 39 week range, at the expense of inflation at extremegestational ages where sample sizes are small.tuned tsBART normalized to 1. The figure shows that tsBART’s gains are especially apparent athigher and lower gestational ages, where fewer observations are available. Most methods arecomparable at gestational ages across the middle of the available range (37-39 weeks).We next turn to the question of how obstetricians might use the results of tsBART to under-stand stillbirth risk and communicate that risk to their patients. To do so, we construct a set ofhypothetical “test” patients representing various configurations of maternal-fetal characteristics: • Patient 1 is a young, primiparous, white patient in her early 20’s, with no medical history,normal weight gain, and normal birth weight for a female infant. • Patient 2 is otherwise similar to Patient 1, but has hypertension. • Patient 3 is otherwise similar to Patient 1, but has both hypertension and diabetes. • Patient 4 is also a young white patient in her early 20’s, but is multiparous, with birth weightless than the 10th quantile. • Patient 5 is a white patient in her early 40’s, with diabetes, hypertension, and other riskfactors present; her labor is induced, and her infant is male.To maintain computational tractability, we again select a case-control sample of the overalldata set. We include all stillbirths in the case-control sample. Then for each gestational age, wesample 2% of the live births at that age. As a result, stillbirths are 50 times more prevalent in oursample than they are in the full data set, both overall and at each gestational age. This approachyields a dataset that is still reasonably large, with 91,078 pregnancies: all 7,175 stillbirth cases and83,903 live-birth controls. While we would prefer to fit the model to all 4.55 million data points,we are not yet able to do so, owing to computational constraints. Scalable Bayesian ensemblemethods are an active area of research, and we are currently drawing on this work to developmethods for scaling tsBART to use the entire dataset.17 t s BA R T BA R T S p li ne s S p li ne s
34 36 38 40 42 34 36 38 40 42 34 36 38 40 42 34 36 38 40 42 34 36 38 40 420246 P − s p li ne s Patient 1 Patient 2 Patient 3 Patient 4 Patient 5Gestational age (Wks) R i sk o f s t ill b i r t h pe r , r e m a i n i ng p r egnan c i e s Figure 5: Estimated stillbirth risk curves for five hypothetical patients with different combina-tions of maternal-fetal covariates, using full case-control sample. Each row is a method, and eachcolumn is a hypothetical patient. In each row, the posterior mean and credible interval are high-lighted (dark line and shading), while the other methods’ posterior means are in dashed linesfor comparison. Patient 1 is a low-risk patient (young, primiparous, no medical history, normalweight gain and birth weight). Patient 2 introduces hypertension; Patient 3 introduces both dia-betes and hypertension. Patient 4 is multiparous with very low birth weight, and Patient 5 has acombination of risk factors (older, diabetes, hypertension, medical history, induced labor). tsBARTgives a smooth fit with sensible credible intervals, consistent with clinical intuition about the waystillbirth risk evolves with gestational age.We use this large case-control sample to fit all methods from Section 4. We use the results toproduce estimates of the stillbirth hazard function for each of our hypothetical test patients. Wethen rescale the estimated hazard functions to account for the 50-fold down-sampling of live birthsin our case-control sample, and we express the resulting hazard functions as a stillbirth rate per1,000 live births.The results are shown in Figure 5. Each column represents one test patient, while each rowshows a particular method. In each panel, we show the estimated conditional probability of still-birth risk at gestational age t , given survival through time t − , along with 95% uncertaintyintervals. Estimated probabilities for all other methods are also visible in grey within each panel,18or easier comparison across panels. For tsBART and BART, the estimates are posterior-meanpredicted probabilities and (Bayesian) credible intervals; for spline methods, the estimates arepredicted probabilities and (frequentist) prediction intervals.These plots have several features of interest (we focus on the tsBART results in the top row).First, there is considerable heterogeneity in the estimated stillbirth risk curves: in their shape,level, and degree of interaction between maternal-fetal covariates and gestational age. Patient 1,for example, has a lower overall risk with a relatively small increase in risk at very late gestationalages (41-42). Patients 2-4 have slightly higher overall risk at earlier gestational ages, but moremuch pronounced “spikes” in risk at late gestational ages, when the inherent stillbirth risk at anadvanced stage of pregnancy is exacerbated by these patients’ covariates (hypertension, diabetes+ hypertension, and low fetal weight, respectively). Patient 5, on the other hand, has a higheroverall risk at all gestational ages, but a much more linear risk trajectory across gestational agecompared with Patients 1-4, without the pronounced spike.This striking heterogeneity across the patients illustrates the shortcomings of collapsing pa-tients into two risk groups, as in Mandujano et al. [2013]. Our method, in contrast, can produceindividualized estimates of risk for any patient, across all gestational ages.We note that the estimates from the BART model are generally similar in shape to the tsBARTestimates, but lack smoothness over gestational age. This results in increased variance and pooreroverall out-of-sample performance, as evident from Table 4. It also invites clinicians and patientsto over-interpret small wiggles in the risk curves that are a result of estimation noise, rather thanclinically meaningful differences. The spline models, meanwhile, tend to result in estimates thatare either over-smooth (Splines 1, P-splines) or undersmoothed and erratic (Splines 2). We at-tribute this to the fact that Splines 1 and P-splines are underparametrized: they fail to includeclinically meaningful interactions (e.g. between hypertension and diabetes). This results in higherbias, poorer estimation performance, infeasibly narrow confidence intervals that, in light of oursimulation studies (Section 4), are likely to be anti-conservative. Splines 2, meanwhile, is likelyoverparametrized: it allows for the possibility of all pairwise interactions between maternal-fetalcovariates and gestational age, needlessly inflating variance for the sake of finding a small handfulof clinically important interactions. This suggests that the spline models, in order to yield goodperformance for stillbirth prediction, would require more nuanced model selection and attentionto functional form, since including more flexible interactions was not a fruitful approach.TsBART, in contrast, produces the best out-of-sample performance, smooth estimates, andwider, more clinically sensible error bars. It also finds the important interactions out of the box,without the need to specify them by hand or conduct a specification search for the right formof the model. In addition, the posterior credible intervals from tsBART are noticeably wider forpatients with unusual combinations of characteristics—an intuitive result which reflects a higherdegree of uncertainty about rarer, more medically complex cases. Our tsBART model is a novel extension of BART which allows for targeted smoothing over a se-lected covariate. tsBART enjoys the same advantages as BART: excellent predictive performance,easily tunable hyperparameters, and avoiding specification of interactions. Hyperparameters areset efficiently via data-driven approaches using recommendations from Chipman et al. [2010] andour suggested method for tuning the length-scale of the covariance function. tsBART providesregularization in the form of constraining trees to be shallow learners in the prior, which is a wellstudied and highly successful approach to regularization in regression.19he kind of stillbirth risk analysis made possible by tsBART represents a substantial advance-ment on previous work in obstetrics [Mandujano et al., 2013], in terms of capturing heterogeneityof risk curves by patient and quantifying levels of certainty around each risk curve. Further in-vestigation into nuanced approaches for stillbirth risk modeling is warranted; maternal-fetal co-variates such as age, weight gain, and birth weight may play a role in risk of stillbirth, and mayinteract with other covariates in complex ways. Our fully Bayesian approach naturally allowsthe model to capture rich and complex interactions and quantify uncertainty about stillbirth risk,which appropriately varies by patient.We recognize the potential limitation of confounding between the decision to induce labor,risk of stillbirth, and maternal-fetal covariates. We currently consider the decision to induce to bea proxy for other maternal-fetal covariates which may increase stillbirth risk but are not includedin the model; future work may include modeling this covariate in a causal framework. A sec-ond limitation is the inability to link birth records to the same mother, potentially violating theindependence assumption (the de-identified nature of the data prevents this linking). However,because our data set spans only two years, it is unlikely that a large fraction of the overall births aremultiple births to the same mother. Moreover, the concern about non-independence is mitigatedbecause we have included many of the known risk factors for stillbirth in our model. While it isnot plausible that two stillbirth events for the same mother are marginally independent, it is muchmore plausible that they are conditionally independent, or nearly so, given these risk factors.Future areas of methodological work may include extension of tsBART to a causal inferenceframework for observational data, as well as extension to other priors with other types of struc-ture. tsBART may be adapted in the accelerated framework of He et al. [2018] to speed compu-tation time. It would also be interesting to explore more nuanced characterizations of partial de-pendence of stillbirth risk on individual covariates. For example, plots of individual conditionalexpectation (ICE) may be used to assess partial relationships between response and specific co-variates, using the techniques described in Goldstein et al. [2015]. ICE plots go beyond the simplepartial dependence plot, by showing the functional relationship between response and feature atthe level of individual observations (rather than averaging across the sample). This could poten-tially give insight into the extent of potential heterogeneity in the conditional expectation function.ICE plots can be created using the ICEbox R library [Goldstein et al., 2015].
A Appendix
A.1 Review of the Bayesian backfitting MCMC
The original BART model is typically fit using an algorithm called Bayesian backfitting [Hastie andTibshirani, 2000, Chipman et al., 2010]. We review this algorithm, then describe the modificationsnecessary to fit the BART with Targeted Smoothing model.Bayesian backfitting involves sampling each tree and its parameters one a time, given the par-tial residuals from all other m − trees. One iteration of the sampler consists of looping throughthe m trees, sampling each tree T j via a Metropolis step, and then sampling its associated leaf pa-rameters M j , conditional on σ and the remaining trees and leaf parameters. After a pass throughall m trees, σ is updated in a Gibbs step.To sample { T j , M j } conditioned on the other trees and leaf parameters { T − j , M − j } , define the20artial residual as R ij = y i − m (cid:88) k =1 ,k (cid:54) = j g ( x i ; T k , M k ) . (14)Using R j as the working response vector, at step s of the MCMC one samples T ( s ) j by proposingone of four local changes to T ( s − j , marginalizing analytically over M j . The local change is selectedrandomly from the following candidates: • grow randomly selects a terminal node and splits it into two child nodes • prune randomly selects an internal node with two children and no grandchildren, andprunes the children, making the selected node a leaf • change randomly selects an internal node and draws a new splitting rule • swap randomly selects a parent-child pair of internal nodes and swaps their decision rulesThe change and swap moves are computationally expensive; in practice, BART is often imple-mented with only prune and grow proposals [Pratola et al., 2014]. Once the move in tree space iseither accepted or rejected, M j is sampled from its Gaussian full conditional, given T j and σ . A.2 Fitting the tsBART model with Bayesian backfitting
Our approach to fitting tsBART retains the form of the Bayesian backfitting MCMC algorithm,as detailed by Chipman et al. [2010]. The primary modification is that all conjugate updates aremodified to their multivariate forms. We assume an i.i.d. error structure, although this is easilymodified; and we also use a redundant multiplicative parameterization of the scale parameter, tofacilitate faster MCMC mixing [Gelman, 2006, Hahn et al., 2017]. Thus our model is y i = α ( t i ) + ηf ( t i , x i ) + (cid:15) i , (cid:15) i ( t ) iid ∼ N (0 , σ ) f ( t i , x i ) = m (cid:88) j =1 g ( t i , x i ; T j , M j ) , M j = { µ j ( t ) , . . . , µ b j j ( t ) } µ hj ( t ) ∼ GP (0 , C ( t, t (cid:48) )) η ∼ N ( τ , γ ) γ ∼ IG (cid:18) , (cid:19) σ ∼ νλ/χ ν . Recall that µ hj ( t ) is the function at terminal node l of tree j . As described previously, this functionhas a Gaussian process prior with squared exponential covariance function with length scale l .Because we have already introduced η as a leading multiplicative scale parameter, we set the vari-ance parameter of the covariance function to be /m , and calibrate the prior half-Cauchy median τ to the marginal standard deviation of y .We use the same prior for over trees T j as in Chipman et al. [2010] and Hahn et al. [2017],and so we omit many details here and refer the interested reader there. Specifically, these papersparametrize tree depth in terms of the pair ( α, β ) ; we set ( α = . , β = 2) , which puts high21robability on trees of depth 2 and 3, and minimizes probability on trees with depth 1 or greaterthan 4. For σ , we follow Chipman et al.’s recommendation for a rough over-estimation of ˆ σ . Wechoose ν = 3 and q = 0 . , and estimate ˆ σ by regressing y onto x (including the index variable asa covariate), then choose λ s.t. the q th quantile of the prior is located at ˆ σ , i.e. P ( σ ≤ ˆ σ ) = q .The posterior conditional distributions are as follows. For simplicity of notation, we assumetimes t are on a common discrete grid, where T is again the range of t values in the data set(although this is not a requirement of the method). We update σ as σ | • ∼ νλ + RSS χ ν + N +1 where RSS = (cid:88) i,t ( y i ( t ) − ηf ( t i , x i )) . where N is the count of observations across all time points, N = (cid:80) ni =1 N i where N i is the numberof time points for observation i , and χ ν + N +1 is a draw from a chi-squared random variable.The update for each µ h = (cid:104) µ (1) h , . . . , µ ( T ) h (cid:105) is µ h | • ∼ N (cid:16) ˜ m, ˜Σ (cid:17) where ˜Σ = (Λ + K ) − and ˜ m = ˜Σ (Λ¯ y l + Kµ ) where Λ = N − l is the inverse of the diagonal matrix of sample sizes for each time point forobservations in leaf l , K = Σ − , and ¯ y l is the vector of sample means for observations in leaf l ateach time point.The update for η is Gaussian, µ h | • ∼ N ( ˜ m, ˜ v ) where ˜ v = τ γ + 1 σ (cid:88) i,t f ( t i , x i ) − ˜ m = ˜ v τ γ + 1 σ (cid:88) i,t y i f ( t i , x i ) . Finally, the update for γ is γ | • ∼ IG (cid:18) , η + 12 (cid:19) . For updating the trees T j , the marginal likelihood is the corresponding multivariate extensionto the marginal likelihood in regular BART. We again let R ij represent the partial residuals asdefined in Equation , and let R l denote the vector containing residuals for data points in leaf l .We then obtain the marginal likelihood for the b terminal nodes as p ( R h | T j , M j , σ ) = (cid:90) µ h (cid:89) l ∈ b N (cid:0) R h | W l µ h , σ I (cid:1) · N ( µ h | µ , Σ ) ∂µ h where W l is a ( t max × n ) matrix where elements indicate times at which each y i is observed. ThisGaussian integral is easily computed in closed form.22 l l l l l l l l l Candidate expected crossings W A I C Upper bound Smooth Fit WAIC
Optimal expected crossings: 2
Figure 6: Example of tuning the expected number of crossings. The jagged dashed line illustratesthe Monte Carlo variation present in WAIC estimates. The solid line shows the spline fit. Thehorizontal dotted line shows the minimum WAIC value plus one standard deviation from thespline fit. The solid vertical line gives the minimum candidate expected number of crossingsvalue where there is a WAIC value less than one plus the standard deviation.
A.3 Additional detail on hyperparameter tuning for length-scale
Here we provide additional detail regarding tuning the expected number of crossings E [ N T (0)] for calculating the covariance’s length-scale parameter. We select the optimal E [ N T (0)] by be-ginning with a grid of candidate values e c ∈ { e , . . . , e C } . For each candidate e c , we fit theBART with Targeted Smoothing model and calculate WAIC [Watanabe, 2013], yielding a grid ofWAIC values Ω = { ω , . . . , ω C } .The WAIC values contain Monte Carlos variation; to overcome this, we fit a cubic spline modelto Ω . Let ζ be the standard deviation of the residuals from this model fit. We select the smallestnumber of expected crossings e c where the corresponding ω c is within ζ of min(Ω) . This approachencourages smoothing while maintaining performance. Figure 6 gives a visualization of this tun-ing. Other methods such as cross-validation could easily be used for tuning the expected numberof crossings; we find this data-driven approach to be efficient while still yielding good results. A.4 Simulation Details
Here we provide more detail for the second simulation described in Section 4. We simulate dataas follows. Let t be a grid of times on the unit interval, spaced in increments of 0.1. We generated n = 1000 ten-dimensional covariates x i = { x i , . . . , x i } where x ij iid ∼ U (0 , . The first fivecovariates in each x i impact the response; the rest are noise.We generate data using a weighted combination of two risk functions, where f ( t ) = 0 . t isthe baseline risk function, and f ( t ) = 0 . · max(0 . , t ) ρ is a second risk function which controlsa large ‘kick‘ at t = 0 . . We let ρ = 1 + log(0 . / log(0 . , so that the f ( t ) risk at t = 1 is fivetimes the baseline risk.The weights w ( x i ) for combining f ( t ) and f ( t ) are dependent on the covariates x i . We gener-ate data for three scenarios, letting weights w ( x i ) depend on covariates in either linearly, linearlywith interactions, or nonlinearly with interactions. These scenarios represent increasing degrees23 R ando m F o r e s t Figure 7: Estimated stillbirth risk curves from the Random Forest model, using the same fivehypothetical patients and full case-control sample as Figure 5. The posterior mean and credibleintervals are highlighted (dark line and shading), while the other methods’ posterior means arein dashed lines for comparison. Each column is a hypothetical patient. Patient 1 is a low-riskpatient (young, primiparous, no medical history, normal weight gain and birth weight). Patient 2introduces hypertension; Patient 3 introduces both diabetes and hypertension. Patient 4 is multi-parous with very low birth weight, and Patient 5 has a combination of risk factors (older, diabetes,hypertension, medical history, induced labor). Random Forest gives curves generally similar inshape to BART, and does not induce smoothness.of difficulty in learning the underlying function. • Linear: w ( x i ) = sigmoid [5 ( x i − x i + x i − x i )] • Linear with interaction: w ( x i ) = sigmoid [5 ( x i − x i ) + 5 ( x i − .
5) ( x i − .
5) +5 ( x i − x i ) + 5 ( x i − .
5) ( x i − . • Nonlinear with interaction: w ( x i ) = sigmoid [5 (max ( x i , x i )) − x i , x i ))] We then generate the simulated hazard function data according to h ( t ) , rescale responses sothat the overall survival probability is roughly 0.5, and simulate event times for each observation. h ( t ) = 0 . x i + w ( x i ) f ( t ) + (1 − w ( x i )) f ( t ) A.5 Stillbirth Results Using Random Forest
Here we illustrate the Random Forest fit for the stillbirth dataset. We do not include Random For-est in the set of models for stillbirth analysis; while Wager et al. [2014] provide variance estimationfor Random Forest, Chipman et al. [2010] demonstrated that BART tends to outperform RandomForest. In addition, Random Forest does not induce smoothness, as we see in Figure 7.
B Appendix
The R package tsbart implements the BART with Targeted Smoothing method. It is available at https://github.com/jestarling/tsbart .24 eferences
J. Albert and S. Chib. Bayesian analysis of binary and polychotomous response data.
J. Amer.Statist. Assoc. , 88(422):669–679, 1993.J. Bailit, K. Gregory, and U. Reddy. Maternal and neonatal outcomes by labor onset type andgestational age.
Am. J. Obstet. Gynecol. , 202(3):245.e1–245.e12, 2010.H. Chipman, E. George, and R. McCulloch. Bayesian CART model search.
J. Amer. Statist. Assoc. ,93(443):935–948, 1998.H. Chipman, E. George, and R. McCulloch. BART: Bayesian additive regression trees.
Ann. Appl.Stat. , 4(1):266–298, 03 2010.S. Clark, D. Frye, and J. Myers. Reduction in elective delivery at ≤
39 weeks of gestation: compara-tive effectiveness of 3 approaches to change and the impact on neonatal intensive care admissionand stillbirth.
Am. J. Obstet. Gynecol. , 203(3):449.e1–449.e6, 2010.P. Eilers and B. Marx. Flexible smoothing with B–splines and penalties.
Stat. Sci. , 11(2):89–121, 051996.A. Gelman. Prior distributions for variance parameters in hierarchical models.
Bayes. Anal. , 1(3):515–534, 2006.A. Goldstein, K. Kapelner, J. Bleich, and E. Pitkin. Peeking inside the black box: Visualizingstatistical learning with plots of individual conditional expectation.
J. Comp. Graph. Stat. , 24(1):44–65, 2015.R. B. Gramacy and H. K. Lee. Bayesian treed Gaussian process models with an application tocomputer modeling.
J. Amer. Statist. Assoc. , 103(483):1119–1130, 2008.P. R. Hahn, J. S. Murray, and C. M. Carvalho. Bayesian regression tree models for causalinference: Regularization, confounding, and heterogeneous effects.
Preprint. Available atarXiv:1706.09523v2 , 2017.T. Hastie and R. Tibshirani. Bayesian backfitting.
Stat. Sci. , 15(3):196–223, 08 2000.J. He, Y. Saar, and P. R. Hahn. Accelerated bayesian additive regression trees.
Preprint. Available atarXiv:1810.02215 , 2018.B. Hern´andez, A. Raftery, and e. a. Pennington, S.R. Bayesian additive regression trees usingbayesian model averaging.
Stat. Comput. , 28:869–890, 2018.J. L. Hill. Bayesian nonparametric modeling for causal inference.
J. Comp. Graph. Stat. , 20(1):217–240, 2011.M. Kornhauser and R. Schneiderman. How plans can improve out-comes and cut costs for preterm infant care.
Managed Care , Jan2010. URL .M. F. Kratz. Level crossings and other level functionals of stationary Gaussian processes.
Prob.Surv. , 3:230–288, 2006. 25. Linero and Y. Yang. Bayesian regression tree ensembles that adapt to smoothness and sparsity.
J. Roy. Statist. Soc. Ser. B , 80(5):1087–1110, 2018.A. R. Linero. Bayesian regression trees for high-dimensional prediction and variable selection.
J.Amer. Statist. Assoc. , 113(522):626–636, 2018.B. R. Logan, R. Sparapani, R. E. McCulloch, and P. W. Laud. Decision making and uncertaintyquantification for individualized treatments using bayesian additive regression trees.
Stat. Meth.in Med. Res. , 28(4):1079–1093, 2019.M. F. MacDorman and E. C. Gregory. Fetal and perinatal mortality: United States, 2013.
Natl. VitalStat. Rep. , 66, No. 6:1–24, 2015.A. Mandujano, T. Waters, and S. Myers. The risk of fetal death: current concepts of best gestationalage for delivery.
Am. J. Obstet. Gynecol. , 208(3):207.e1–207.e8, 2013.J. Muraskas and K. Parsi. The cost of saving the tiniest lives: NICUs versus prevention.
J. of Ethics ,10(10):655–658, 2008.J. S. Murray. Log-linear bayesian additive regression trees for categorical and count responses.
Preprint. Available at arXiv:1701.01503 , 2017.M. T. Pratola, H. A. Chipman, J. R. Gattiker, D. M. Higdon, R. McCulloch, and W. N. Rust. Parallelbayesian additive regression trees.
J. Comp. Graph. Stat. , 23(3):830–852, 2014. ISSN 1061-8600.U. Reddy, V. Bettegowda, and T. Dias. Term pregnancy: a period of heterogeneous risk for infantmortality.
Obstet. Gynecol. , 117:1279–1287, 2011.S. Sivaganesan, P. Muller, and B. Huang. Subgroup finding via bayesian additive regression trees.
Stat. Med. , 36:2391–2403, 2017.Sparapani, Logan, MucCulloch, and Laud. Nonparametric survival analysis using bayesian addi-tive regression trees.
Stat. Med. , 35(16):2741–2753, 2016.S. Wager, T. Hastie, and B. Efron. Confidence intervals for random forests: The jackknife and theinfinitesimal jackknife.
J. Mach. Lear. Res. , 15:1625–1651, 2014.S. Watanabe. A widely applicable bayesian information criterion.
J. Mach. Lear. Res. , 14:867–897,2013.J. Xu, S. L. Murphy, K. D. Kochanek, and B. A. Bastian. Deaths: Final data for 2013.