[PDF] Hierarchical Bayesian Bootstrap for Heterogeneous Treatment Effect Estimation

Abstract

A major focus of causal inference is the estimation of heterogeneous average treatment effects (HTE) - average treatment effects within strata of another variable of interest. This involves estimating a stratum-specific regression and integrating it over the distribution of confounders in that stratum - which itself must be estimated. Standard practice in the Bayesian causal literature is to use Rubin's Bayesian bootstrap to estimate these stratum-specific confounder distributions independently. However, this becomes problematic for sparsely populated strata with few unique observed confounder vectors. By construction, the Bayesian bootstrap allocates no prior mass on confounder values unobserved within each stratum - even if these values are observed in other strata and we think they are a priori plausible. We propose causal estimation via a hierarchical Bayesian bootstrap (HBB) prior over the stratum-specific confounder distributions. Based on the Hierarchical Dirichlet Process, the HBB partially pools the stratum-specific confounder distributions by assuming all confounder vectors seen in the overall sample are plausible. In large strata, estimates allocate much of the mass to values seen within the strata, while placing small non-zero mass on unseen values. However, for sparse strata, more weight is given to values unseen in that stratum but seen elsewhere - thus shrinking the distribution towards the marginal. This allows us to borrow information across strata when estimating HTEs - leading to efficiency gains over standard marginalization approaches while avoiding strong parametric modeling assumptions about the confounder distribution when estimating HTEs. Moreover, the HBB is computationally efficient (due to conjugacy) and compatible with arbitrary outcome models.

Full PDF

HHierarchical Bayesian Bootstrap for Heterogeneous Treatment EﬀectEstimation

Arman Oganisian ∗ , Nandita Mitra , and Jason A. Roy Division of BiostatisticsDepartment of Biostatistics, Epidemiology, and InformaticsUniversity of Pennsylvania Department of Biostatistics and EpidemiologyRutgers University

Abstract

A major focus of causal inference is the estimation of heterogeneous average treatment eﬀects (HTE) -average treatment eﬀects within strata of another variable of interest. This involves estimating a stratum-speciﬁc regression and integrating it over the distribution of confounders in that stratum - which itself mustbe estimated. Standard practice in the Bayesian causal literature is to use Rubin’s Bayesian bootstrap toestimate these stratum-speciﬁc confounder distributions independently. However, this becomes problematicfor sparsely populated strata with few unique observed confounder vectors. By construction, the Bayesianbootstrap allocates no prior mass on confounder values unobserved within each stratum - even if these valuesare observed in other strata and we think they are a priori plausible. We propose causal estimation via ahierarchical Bayesian bootstrap (HBB) prior over the stratum-speciﬁc confounder distributions. Based onthe Hierarchical Dirichlet Process, the HBB partially pools the stratum-speciﬁc confounder distributionsby assuming all confounder vectors seen in the overall sample are a priori plausible. In large strata,estimates allocate much of the mass to values seen within the strata, while placing small non-zero mass onunseen values. However, for sparse strata, more weight is given to values unseen in that stratum but seenelsewhere - thus shrinking the distribution towards the marginal. This allows us to borrow informationacross strata when estimating HTEs - leading to eﬃciency gains over standard marginalization approacheswhile avoiding strong parametric modeling assumptions about the confounder distribution when estimatingHTEs. Moreover, the HBB is computationally eﬃcient (due to conjugacy) and compatible with arbitraryoutcome models. ∗ Corresponding author. Email: [email protected]. a r X i v : . [ s t a t . M E ] S e p Introduction

Estimation of heterogeneous causal eﬀects - i.e., eﬀects within strata of some other relevant variable - is popularin the causal inference literature. Such eﬀects can be identiﬁed under rather standard causal assumptions (nounmeasured confounding, positivity, etc) and computed using standardization in the point-treatment setting.Within each stratum, standardization involves averaging a stratum-speciﬁc regression model adjusting forconfounders and treatment over the distribution of confounders within that stratum. Fully Bayesian approachesto standardization, and causal estimation broadly, have been growing in popularity. For instance, BARTregression models were used in early work by Hill [1] to compute marginal eﬀects and subsequently by Zeldowet al. [2], Hahn et al. [3], and Henderson et a. [4] to compute individual treatment eﬀects. Other Bayesiannonparametric (BNP) priors such as Dirichlet process (DP) and variations such as the enriched DP anddependent DP regressions have also been used to do full posterior inference on marginal treatment eﬀects invarious settings. For instance, such methods have been developed for computing eﬀects under zero-inﬂation [5],in the presence of missingness, [6], in mediation scenarios [7], for censored survival outcomes under competingrisks [8], and to compute causal quantile eﬀects [9]. Work by Wang et al. [10] explore Bayesian causal inferencevia generalized linear model (GLM) regressions and Saarela et al.[11] develop a Bayesian estimation frameworkfor marginal structural models. Nethery et al. [12] address the problem of marginal eﬀect estimation in thepresence of non-overlap via Bayesian modeling.To perform standardization, regression models must be averaged over the confounder distribution of thetarget population. For instance, Hill averages BART over the empirical distribution when computing marginaleﬀects. This is a ﬂexible approach as it makes no modeling assumption about the distribution. However, itis unsatisfying from a Bayesian point of view since it uses a plug-in estimate and variability of this estimatedoes not ﬂow through to the posterior of the causal eﬀects. To address this, Wang et al. and Nethery et al.used the Rubin’s Bayesian bootstrap (BB) [13]. Broadly, this approach models the confounder distribution asa point-mass distribution with unknown mass/weight at each observed confounder value. Posterior inferenceis done on these unknown weights and variability propagates through to the causal eﬀects of interest.The popularity of the BB for marginal estimation has lead to its adoption for heterogenous average treat-ment eﬀect (HTE) estimation. These are average treatment eﬀects within strata of some other variable. Forinstance, Roy et al. [14] evaluate the eﬀect of antiretroviral therapy on various outcomes among HIV-positivepatients with and without recent alcohol use. They do this by estimating a marginal structural outcomemodel via a dependent DP. To estimate the cofounder distributions, they use separate BB estimates for theconfounder distributions of recent alcohol users and non-users separately. Taddy et al. [15] are concerned withestimating eﬀects of a large A/B testing experiment among “new” and “old” platform users. Again, theseuse separate BB estimates within these two strata. More recent work by Boatman et al. [16] attempts todo causal estimation in a setting where data are collected from several “supplemental sources” in addition toa “primary” data source. They then estimate a causal eﬀect within the “primary” stratum by averaging aBART regression over a BB estimate of the confounder distribution in the primary source, separate from theother sources.Though common, using separate BBs within each stratum is ineﬃcient when some strata are sparse. Forinstance, in our motivating data example we compute marginal treatment eﬀects within cancer strata. This iscomplicated as some cancer types (e.g. lung) may be rare in the sample, giving us little data on the confounderdistribution within these strata. By construction, the BB places zero mass on confounder values unseen withinthis stratum - even if this is due to small samples and not due to an a priori belief that unseen values areimpossible. While plausible covariate values for lung cancer patients may have been observed for, say, brain2ancer patients, stratum-speciﬁc BBs have no way of borrowing this information. In this paper we proposea hierarchical Bayesian bootstrap (HBB) prior for estimating stratum-speciﬁc confounder distributions in theHTE estimation setting. Based on the Hierarchical Dirichlet Process (HDP), our stratum-level distributionestimates place most mass on confounder values observed within each stratum, but also place non-zero mass onvalues unseen in the stratum but seen in other strata. For large strata, the HBB shrinks to the stratum-speciﬁcBB. For small strata, the confounder distribution is shrunk more heavily towards values seen in other strata. Inthis way, the stratum-level confounder distribution estimates borrow information across strata. This approachmaintains the ﬂexibility of the BB (we make no parametric assumptions about the confounder distributions),provides room for eﬃciency gains, and is agnostic to the choice of outcome model.Several notable modiﬁcations to the bootstrap have been proposed which are distinct from our work. Forinstance, Makela et al. [17] developed a two-stage Bayesian bootstrap for a cluster-randomized study setting.Here, clusters/strata are sampled and then individuals are sampled within a cluster. A key problem here ishow to account for strata that exist, but are never sampled. This is distinct from our problem where strataare known and ﬁxed and the issue is to borrow information across them. Approaches such as “bag-of-littlebootstraps” [18] and the Bayesian analogue [19] have been proposed with the goal of scaling bootstrap to largedatasets. The idea is to split the data into subsets, run separate bootstraps, and the combine in such a wayas to approximate the overall bootstrap distribution from these sub-samples. Rather, our concern is not withestimating the overall data distribution, but stratum-speciﬁc distributions in a more statistically eﬃciencyway. Finally, several “smoothed” bootstraps have been developed by Silverman [20, 21, 22] and a smoothedBayesian analogue has also been discussed [23]. To summarize, the view here is that the Efron’s bootstrap issampling from the empirical distribution that places uniform mass (inverse of the sample size) on each observeddata value. This point-mass distribution is convoluted with a kernel function to yield a smoother estimate.While indeed a step in the right direction, it is unsatisfactory from the perspective of HTE estimation. Wecould, for instance, estimate stratum-speciﬁc smoothed bootstrap distributions. This will indeed place somemass on the unseen values, but this mass is allocated via an ad-hoc kernel, rather than informed by data inthe other strata. Speciﬁcation of this kernel is also a hurdle, which BB does not face. As we will see, however,this smoothed bootstrap can be seen as a special case of the HBB.In the remaining sections, we introduce some notation and motivate the causal problem more precisely. Wethen discuss the HBB, posterior computation, and choice of hyperparameter values. We show how the stratum-speciﬁc BB and the smoothed bootstrap are limiting cases of the HBB. Simulation studies are conductedassessing the performance of the HBB relative to dominant approaches in the causal literature under a varietyof complex settings. We end with an analysis contrasting the risk of adverse events for proton versus photontherapies across various cancer types. We demonstrate how our approach can be easily combined with bothparametric and nonparametric models for diﬀerent outcome types to estimate a variety of common marginalcausal estimands contrasting adverse event risk.

Suppose we observe outcome Y for subjects assigned to one of two treatments A ∈ { , } along with somecovariates L = ( W, V ) that are measured pre-treatment. Here, we separate L into a component W which webelieve to be confounders (drivers of both treatment and outcome) and a component V which, for our purposes,will be a scalar stratum number, V ∈ { , , . . . , K } along which we wish to make causal comparisons. Let E [ Y a ] denote the average potential outcome [24] that would have occurred under treatment A = a . One3opular causal estimand is the heterogeneous, or stratum-speciﬁc, average treatment eﬀect (HTE) Ψ( v ) = E [ Y − Y | V = v ] - the average diﬀerence in outcomes had everyone in the stratum V = v taken treatment1 versus 0.While we could estimate E [ Y | A = a, V ] with observed data, in general E [ Y | A = a, V ] (cid:54) = E [ Y a | , V ].That is, the average outcome among subjects treated with A = a in V may not be the same as the averageoutcome had everyone in V taken treatment A = a . This is due to confounding: treated subjects may be anon-representative subset of the patients in stratum V (e.g. systematically sicker and, therefore, more likelyto have worse outcomes). However, under well-known causal identiﬁcation assumptions, we can estimateeach term of Ψ( v ) by integrating the diﬀerence in stratum-speciﬁc outcome regressions over the conditionaldistribution of W (see Appendix A)Ψ( v ) = (cid:90) W (cid:110) E [ Y | A = 1 , V = v, W ] − E [ Y | A = 0 , V = v, W ] (cid:111) dP v ( W ) (2.1)where P v ( W ) = P ( W | V = v ). This formula is known as standardization - a special case of Robins’ g-formula[25] in the point-treatment setting. The same general approach can be used to compute a marginal (over V and W ) average treatment eﬀect (ATE) Ψ = E [ Y − Y ], except we do not set V = v . Instead we integratedthe outcome regression over the joint distribution P ( L ) = P ( W, V )Suppose we observe n independent subjects with data, D = { Y i , A i , W i , V i } n . Let S v = { i : V i = v } contain the indices of subjects in stratum V = v and let n v denote the cardinality/size of S v such that n = (cid:80) v n v . Bayesian inference typically proceeds by obtaining a posterior over the regression E [ Y | A, V, W ].For instance, with a linear additive speciﬁcation, this amounts to just ﬁnding a posterior over the intercept andslope coeﬃcients. However, as discussed in the introduction, an array of more ﬂexible BNP models have beenused. Importantly, we must also estimate the conditional confounder distributions P v ( W ) as it is typicallyunknown in applications. From the perspective of causal estimation this is just a nuisance parameter and wewould like to do this as ﬂexibly as possible.One approach is to plug in the empirical distribution ˆ P v ( w ) = n v (cid:80) i ∈ S v δ W i ( w ) - where δ x ( · ) denotesthe degenerate distribution at x . For compactness we sometimes denote these as simply P v and δ x . Thisplaces uniform mass of 1 /n v on each confounder vectors observed in stratum v . This can be viewed as Efron’sbootstrap because “sampling the data with replacement” is formally deﬁned as drawing from this empiricaldistribution: in every iteration, each data vector is selected into the bootstrap resample with probability 1 /n v .For ATE estimation, the same is often done for the joint distribution, ˆ P ( l ) = n (cid:80) ni =1 δ L i ( l ). This is preciselythe approach of Hill [1] when computing ATEs with a BART prior on E [ Y | A, L ]. Indeed, as Ding et al. [26]note in their discussion of Bayesian causal inference, this is still a popular choice in the literature.To our knowledge, Wang et al. [10] ﬁrst proposed using Rubin’s Bayesian bootstrap (BB) [13] over thisempirical approach, as it accounts for uncertainty in the empirical estimate when doing full posterior ATEestimation. Nethery et al. [12], Saarela et al. [11], and Xu et al [9] also used this BB approach to do Bayesianinference on marginal causal estimation, as discussed in the introduction. To summarize the BB approachfor HTE estimation, it models the covariate distribution as a point mass at each observed covariate value P v ( w ) = (cid:80) i ∈ S v π vi δ W i ( w ), but unlike the empirical approach the weights, π v = { π vi } i ∈ S v , are consideredunknown parameters that completely determine P v . A prior over these these weights is then a prior over P v .Noting that the weight vector lives in the simplex, π v ∈ { R n v : π vi > ∀ i ∈ S v , (cid:80) i ∈ S v π vi = 1 } , BB placesan improper Dirichlet prior over this space π v ∼ Dir (0 n v ), where 0 n v is the n v -dimensional zero vector. Thisis a conjugate model with posterior π v | { W i } i ∈ S v ∼ Dir (1 n v ), where 1 n v is the n v -dimensional vector of4nes. Note that this is done for each V = v , separately . This is the approach used for HTE estimation in theBayesian causal inference literature by Boatman et al. [16], Roy et al. [14], and Taddy et al. [15]. We notethat the BB is closely related to the Bayesian motivation of the empirical likelihood via exponential tilting[27]. In addition, the BB connects to the literature on Bayesian numerical quadrature dating back to Diaconis[28] and O’Hagan [29] with more recent BART-based developments [30]. In particular, the π v could be viewedas unknown “quadrature weights”, π v , associated with a grid of “quadrature points”, { W i } i ∈ S v .This common approach of using separate stratum-speciﬁc BB estimates of P v for v = 1 , . . . , K does haveseveral advantages. First, it retains the ﬂexibility of the empirical distribution. Note that the posteriorexpectation of each π vi is 1 /n v . Second, unlike the empirical estimate, uncertainty in this estimate ﬂowsthrough to the posterior of Ψ( v ). Third, it is computationally easy (due to conjugacy) and agnostic to thechoice of outcome model. However, this approach becomes problematic for sparse strata where few uniquevalues of W are observed. Under the BB, P v assigns zero probability to values of W that are unseen in stratum v . This is undesirable because there are many values that we may think are a priori plausible. Indeed, wemay observe such values in other strata. However, since the BB estimates of P v are done independently, theposterior estimate of P v cannot borrow information from the other strata in these conditions - yielding lessstable estimates of Ψ( v ). The proposed HBB retains these desirable properties of the BB while addressing thesmall-strata shortcomings by adding the capacity to “partially pool” the estimates of P v . While the bulk ofthe mass remains on values seen within stratum v , all values of W observed in the full sample are a priori plausible. Let W v = { W i } i ∈ S v denote the observed confounders in stratum v and let W = { W v } v =1: K denote the fullset of confounders. We model W v as following an unknown distribution W v | P v ∼ P v and propose a prior for P v that borrows information across V . Our approach is inspired by the hierarchical Dirichlet Process (HDP)of Teh et al. [31]. The DP is a stochastic process that generates random distributions. Due to its ﬂexibilityand conjugacy, it has become a popular prior for unknown distributions in Bayesian analysis. An importantcharacteristic of the DP is that it generates discrete distributions almost surely. Suppose we place a DP prioron each P v , denoted P v ∼ DP ( αP v ). The realizations of P v are centered around a “mean” distribution of P , with α controlling the dispersion of these realizations around P v . This is ﬂexible because the posteriorof P v under a DP is a compromise between the prior mean, P v , and the empirical distribution in stratum V = v , (cid:80) i ∈ S v δ W i ( w ), with relative weight controlled by α . However, each P v is centered around its own P v ,preventing any borrowing of information across strata. This is precisely the motivation behind the HDP prior,which centers the P v around a common mean distribution P and adds a DP hyperprior on P . We note thatwhile the following theoretical development may seem complicated, the actual posterior computation will befully conjugate and eﬃcient. Under the HDP prior, the full model for the covariates is W i | P v ∼ P v for i ∈ S v P v | α, P ∼ DP ( αP ) for v = 1 , . . . , KP | γ, P ∗ ∼ DP ( γP ∗ ) (3.1)The DP hyperprior on P implies that the random P are discrete - allocating mass to atoms. The distributions P v have support on the same atoms as P but allocate mass diﬀerently across these atoms in a way that is5ocal to V . Since the DP is conjugate, the posterior of P v conditional on P is another DP: P v | P , α, W v ∼ DP ( αP + (cid:80) i ∈ S v δ W i ). Similarly the marginal posterior of P is also a DP: P | W ∼ DP ( γP ∗ + (cid:80) ni =1 δ W i ).For the Hierarchical BB, we set γ = 0 in (3.1) and denote this prior on P v as P v | α ∼ HBB ( α ). This yieldsposterior under the HBB ( α ) P v | P , α, W v ∼ DP ( αP + (cid:88) i ∈ S v δ W i ) P | W ∼ DP ( n (cid:88) i =1 δ W i ) (3.2)With γ = 0, P are random distributions centered around the empirical distribution P | W ∼ DP ( (cid:80) ni =1 δ W i )This distribution is discrete with an atom at each of the n observed W i . A P can be drawn from this posteriorby drawing a vector of weights π n ∼ Dir (1 n ), where π n = ( π , π , . . . , π n ). This draw of P can then berepresented as P = (cid:80) ni =1 π i δ W i . Note that this is exactly the BB of Rubin. However, now we have anadditional layer, since all the stratum-speciﬁc distributions must be drawn around this P . Conditional onthis P , we draw these stratum-speciﬁc distributions from P v | P , α, W v ∼ DP ( α ( (cid:80) ni =1 π i δ W i ) + (cid:80) i ∈ S v δ W i ). Figure 1: Draw from posterior of P v under prior P v ∼ HBB (2) with simulated scalar W i for n = 90 subjects from V = 1 , ,

3. These 90 atoms are represented by vertical bars with colors indicating stratum of the atom. The height ofthe lines represent probability mass drawn from the HBB posterior. Left panel: a draw of P - recall this is centeredaround the empirical distribution (i.e. line 2 in (3.2) ). The next panel shows a draw from the Dirichlet Processposterior of P v conditional on this draw of P - i.e. line one of (3.2). Note that P , P , and P place positive mass on all observed atoms. For instance, independent BB estimates of P would put place 0 mass on all atoms but the red -unlike the third panel. Again, conditional on a draw of P , each P v is a discrete distribution with atoms at each of the observed n points in the entire sample. Combining like terms in the summations, however, we see that atoms observedin stratum V = v have a weight of απ i + 1 - higher than the weight on atoms unseen in stratum V = v , whichis απ i . To see this, note that in expectation (over many draws of P v ), the posterior distribution of W v can berepresented using Blackwell and MacQueen’s Polya Urn [33]: P v ( W = w | P , α, W v ) = αα + n v ( n (cid:88) i =1 π i δ W i ) + 1 α + n v (cid:88) i ∈ S v δ W i = 1 α + n v (cid:110) (cid:88) i/ ∈ S v απ i δ W i + (cid:88) i ∈ S v (1 + απ i ) δ W i (cid:111) (3.3)Again due to the ﬁnitely many atoms, we can draw a P v from this posterior by drawing from an n -dimensional6irichlet distribution with the i th concentration parameter being απ i for i / ∈ S v and 1 + απ i for i ∈ S v .Intuitively, this can be seen as adding an additional α subjects from the marginal distribution into stratum V . These “pseudo-subjects” can take on any observed value in the marginal, even if they are unobserved inthe stratum - thus, borrowing information. As with the posterior update for P , a draw from this Dirichletdistribution yields an n-dimensional set of weights π v n and thus a draw of P v is given by P v ( w ) = (cid:80) ni =1 π vi δ W i .Note that in the above we used a common α across strata. This is without loss of generality, as each stratum canhave its own α v without changing the results. We will return to the topic of specifying these hyperparametersafter discussing posterior computation. Here we describe posterior inference for HTEs under a

HBB ( α ) prior for P v . We can sample P v from the HBBposterior distribution eﬃciently using Dirichlet distributions as described in the previous section. Speciﬁcally,at each iterations m , we1. Obtain a posterior draw of P by drawing weights π ( m )1: n ∼ Dir (1 n ) : P ( m )0 ( w ) = n (cid:88) i =1 π ( m ) i δ W i ( w )2. For each v = 1 , . . . , K , obtain a posterior draw, P ( m ) v , conditional on P ( m )0 . We do this by drawing π v ( m )1: n ∼ Dir ( η ( m ) n ), where η ( m ) n is the n-dimensional vector with element i being απ ( m ) i if i / ∈ S v and(1 + απ ( m ) i ) if i ∈ S v . Note the sum of the elements in η ( m ) n is α + n v . This now forms a draw of P v P ( m ) v ( w ) = n (cid:88) i =1 π v ( m ) i δ W i ( w )Returning to the ultimate goal of HTE estimation, suppose we also have m = 1 , . . . , M posterior drawsof the regression E [ Y | A, W, V ], denoted by µ ( m ) ( A, W, V ). This can be from any model. For instance, ina GLM this could be µ ( m ) ( A, W, V ) = g − ( β ( m )0 + W (cid:48) β ( m ) w + V (cid:48) β ( m ) v + β ( m ) A A ) where g − is the inverse linkfunction. This could also be a posterior draw µ ( m ) ( A, W, V ) = f ( m ) ( A, W, V ) where f ( m ) is the posterior drawof a sum-of-trees model under a f ∼ BART prior. To estimate the HTE, we include a third step3. Integrate over HBB draw of P v from Step 2, P ( m ) v .Ψ ( m ) ( v ) = (cid:90) W (cid:110) E [ Y | A = 1 , V = v, W ] − E [ Y | A = 0 , V = v, W ] (cid:111) dP v ( W ) ≈ (cid:90) W (cid:110) µ ( m ) ( A = 1 , W i , V = v ) − µ ( m ) ( A = 0 , W i , V = v ) (cid:111) dP ( m ) v ( W ) ≈ n (cid:88) i =1 π v ( m ) i (cid:110) µ ( m ) ( A = 1 , W i , V = v ) − µ ( m ) ( A = 0 , W i , V = v ) (cid:111) (3.4)Repeating this procedure for each of the draws yields a set of M draws from the posterior of Ψ( v ), { Ψ ( m ) ( v ) } M , for each stratum v = 1 , . . . , K . Note that the W i from all subjects contribute to Ψ ( m ) ( v ).However, values from the stratum and values outside the stratum are weighted diﬀerently according to π v ( m ) i .7 .2 Some Limiting Cases and Hyperparameter Choice Here we consider the limiting behavior of the HBB by analyzing (3.3) conditional on P ( w ) = (cid:80) ni =1 π i δ W i and the choice of hyperparameter. Note that for α = 0, the ﬁrst term in (3.3) disappears and our estimatereduces to P v ( W = w | P , α, W v ) = n v (cid:80) i ∈ S v δ W i . This is the empirical distribution within stratum v and represents a completely unpooled estimate where values of W unseen in stratum v have no mass. Thus,there is no borrowing of information. This is also the posterior mean of the BB estimate of P v used by Roy,Taddy, and Boatman for HTE estimation. Now consider the other extreme where α >> n v . In this case(3.3) reduces to P v ( w | P , α, D ) = (cid:80) ni =1 π i δ W i - the BB estimate of the entire marginal distribution (over V )that places expected mass E [ π i ] = 1 /n on each observed value of W in the entire sample. That, is we have completely pooled all the stratum-speciﬁc distributions. The parameter α controls the posterior compromisebetween these extremes for a particular stratum. We can also view α as trading oﬀ bias and variance whenestimating when estimating P v ( W ). Suppose we believe that W ⊥ V so that the conditional equals themarginal, P v ( W ) = P ( W ). Then, we could use all n data points to form an empirical distribution for P v ( W )when computing Ψ( v ). However, if we are wrong and W, V are dependent, then this would introduce bias.With α >> n v , we shrink towards a prior belief in this independence assumption. Conversely, α << n v shrinksto the stratum-level BB that only uses subjects in stratum v (prior belief of dependence). Hyperparameter choice

To consider choice of α , note that we can interpret α > α pseudo-subjects fromthe marginal distribution of W to the n v subjects in stratum V = v . Higher α places more weight on thepseudo-subjects - who may have values unseen in stratum V = v (i.e. more shrinkage towards the marginal).In general, the relative mass on a point seen within the stratum relative to an unseen point is approximately ρ = α/nα/n = nα + 1. This is seen in (3.3) when substituting π i with its posterior expectation of 1 /n . Forexample, if we add α = n pseudo-subjects, then on average the atoms seen in stratum v are about as likelyas the atoms not seen in stratum v . This is fairly aggressive shrinkage. For some M >

0, we propose setting α = n · Mn v which implies a relative weight of ρ = n v M + 1. We should now subscript this parameter as α v as it isstratum-speciﬁc depending on n v - but we omit this notation where there is no ambiguity. Here, M is user-speciﬁed and can be interpreted as the minimum desired sample size in each stratum. This may partially be setdepending on the number of confounders we are integrating over and the complexity of the joint distribution.For instance, with well-behaved, standard joint distribution (e.g. multivariate Gaussian), M = 30 subjectswithin a stratum may be suﬃcient to estimate the distribution. On the other hands, if the covariates arecomplex, skewed, and multimodal we may need a larger M to obtain a good nonparametric estimate sucha distribution. Note that strata with size n v << M implies ρ ≈ n v >> M , ρ gets larger - placing increasingly more weight on atoms withinstratum v only. This reduces shrinkage proportional to n v . Figure 2 depicts draws from the posterior of P v under a prior P v ∼ HBB ( nM/n v ) with synthetic data. Note that strata that are more sparse (relative to M )have distribution draws that are more heavily shrunk towards the marginal. However, we place positive mass onall points observed in the sample. Finally, not only does this choice of α = n · Mn v have interpretable appeal but italso has desirable behavior for large strata (as n v . For ﬁxed user-speciﬁed M , consider the weight on the prior, P , in the ﬁrst line of (3.3), αα + n v . With our choice of α , this becomes nM/n v nM/n v + n v = M/n v M/n v + n v /n . As n, n v getlarge (at the same rate) this term is approximately zero: M/n v M/n v + n v /n ≈ n v /n = 0. Now consider the weight onthe empirical distribution of stratum v under this choice of α : α + n v = M/ ( n v /n )+ n v . As n, n v get much larger8han M , the weight is increasingly dominated by n v , M/ ( n v /n )+ n v ≈ n v . So for large strata, HBB estimate in(3.3) is dominated by the stratum-speciﬁc empirical distribution: P v ( W = w | P , α, W v ) ≈ n v (cid:80) i ∈ S v δ W i . Figure 2: Draw from posterior of P v under prior P v ∼ HBB ( nM/n v ) with n = 300 scalar confounders simulated for v = 1 , , M = 30. Note that for stratum V = 1 we have far greater observations than M andso the draw of P places most mass on atoms seen in this stratum. Stratum 2 has size slightly larger than M and soplaces ρ = 58 /

30 + 1 ≈ ρ = 10 /

30 + 1 ≈ The smoothed bootstrap as a limiting case

As mentioned, the smooth bootstrap has been proposed as one way of placing mass on unseen values of W .In this section, we brieﬂy show how this is a limiting case of the HBB ( α ) prior on a mixing distribution when α →

0. The smooth bootstrap estimate of P v is given byˆ P v ( w ) = 1 n v (cid:88) i ∈ S v K h (cid:16) w − W i h (cid:17) Smoothness is induced by convoluting a user-speciﬁed symmetric kernel, K h , with the empirical distribution.The parameter h controls smoothing and diﬀerent methods have been proposed for selecting it. Angelis et al.[34] discuss a cross-validation approach and generally recommend a “smaller order of h ”. For concreteness,suppose the kernel is chosen to be standard Normal K h ( w − W i h ) = N ( w − W i h ; 0 , n v kernels centered around each observed W i with variance h . The mixing distribution isthe empirical distribution giving weight 1 /n v to each mixture component. Now consider a Bayesian mixturemodel with unknown mixing distribution P v P ( w | P v ) = (cid:90) W K h (cid:0) w − Wh (cid:1) dP v ( W )Above, W are random with distribution P v and w is a particular value. With an HBB ( α ) prior on the mixingdistribution, recall that the mean of P v is given via the Polya Urn in (3.3). Plugging this urn expression infor P v yields P ( w ) = (cid:90) W K h (cid:0) w − Wh (cid:1)(cid:110) αα + n v P ( W ) + 1 α + n v (cid:88) i ∈ S v δ W i ( W ) (cid:111)

9n the improper limit as α →

0, the left term in the Polya Urn goes to 0. Distributing the kernel and notingthat the integral over W is hit only at each W i , P ( w ) = 1 n v (cid:88) i ∈ S v K h (cid:16) w − W i h (cid:17) This is exactly the smoothed bootstrap estimate ˆ P v . However, in Bayesian inference we would typically havea posterior distribution over P v (a conjugate Dirichlet process) which we would draw from. This propagatesprior uncertainty whereas the smooth bootstrap uses the empirical distribution as a ﬁxed plug-in estimate ofthe mixing distribution. In this section we assess the behavior of the HBB relative to other approaches under a variety of settings viasimulation. In all settings, we simulate 1000 datasets with n = 300 observations from K = 4 strata of varyingsparsity. On average, the strata counts are n = 120, n = 90, n = 60, n = 30. Thus, stratum 4 is themost sparse stratum and stratum 1 is the least sparse. In each simulated data set, we simulate a vector, W ,of 10 confounders for each subject conditional on stratum V . The treatment indicator A itself is simulatedas a function of stratum membership and confounders. We simulate a binary outcome model conditional on V , W , and A from a logistic model. In the true outcome model, each stratum has a diﬀerent (conditional)treatment eﬀect, leading to true HTEs that vary across strata. These represent challenging scenarios withmany confounders and small samples that are often encountered in applied work.For each simulated dataset, we use a correctly speciﬁed Bayesian logistic regression with wide, null-centeredGaussian priors. This is to focus attention on the confounder distribution models. After posterior sampling forthe regression, we compute a causal risk diﬀerence, Ψ( v ) = E [ Y | V = v ] − E [ Y | V = v ], by integrating theregression over the confounder distribution model under both treatment interventions and taking the diﬀerence.We integrate over four confounder distribution models: the empirical distribution, the stratum-speciﬁc BBestimate, the HBB, and the oracle. By “oracle” we mean a Monte Carlo integration over draws from thetrue stratum-speciﬁc confounder distribution. For the HBB, we set P v ∼ HBB ( nM/n v ) with M = 100 in allsettings. We assess the bias, variance, coverage, and precision of posterior estimates for Ψ(1) (the eﬀect in theleast sparse stratum) and Ψ(4) (the eﬀect in the most sparse stratum) across simulation results in Table 1.In the ﬁrst setting, we consider a relatively simple multivariate Gaussian generating distribution for W ,which does not vary across V . In this “homogeneous Gaussian” setting, we see little diﬀerence in performanceamong the 4 methods in the least sparse stratum ( V = 1). This is desirable as we would want the HBB toperform similar to other methods in such populous stratum. In the sparse stratum ( V = 4), the HBB hasslightly lower bias with lower MSE (equal up to three decimal places). Notably, the HBB borrows informationacross strata to yield, on average, narrower intervals than the BB interval (.46 v .478) while maintainingnominal coverage of around 95%. Note that the BB produces a wider interval than the empirical distributionas well (.478 v .459) this is because the BB accounts for uncertainty in the confounder distribution estimate.The second setting considers a more diﬃcult scenario where W is marginally generated from a locationmixture of Gaussians. Each W is generated from a 10-dimensional multivariate normal but with diﬀerentmean for each stratum. Thus, borrowing information from diﬀerent strata is expected to come at the expenseof more increased bias. Indeed, in stratum 4 we see that absolute bias is about six times higher for HBBrelative to BB (.018 v. .003), however variation is also reduced (.01 v. .014) leading to about a 20% reduction10etting 1: Homogeneous GaussianModel MSE Bias Variance Interval Width CoverageStratum 1 Empirical 0.005 0.007 0.005 0.256 0.930BB 0.005 0.007 0.005 0.260 0.930HBB 0.005 0.007 0.005 0.258 0.933Oracle 0.005 0.008 0.005 0.255 0.928Stratum 4 Empirical 0.013 0.002 0.013 0.459 0.944BB 0.013 0.002 0.013 0.478 0.951HBB 0.013 0.000 0.013 0.460 0.948Oracle 0.013 0.000 0.013 0.457 0.945Setting 2: Gaussian MixtureModel MSE Bias Variance Interval Width CoverageStratum 1 Empirical 0.005 0.003 0.005 0.261 0.938BB 0.005 0.003 0.005 0.264 0.939HBB 0.005 0.007 0.005 0.253 0.941Oracle 0.005 0.004 0.005 0.259 0.934Stratum 4 Empirical 0.014 0.003 0.014 0.465 0.949BB 0.014 0.003 0.014 0.484 0.952HBB 0.011 0.018 0.010 0.440 0.950Oracle 0.013 0.000 0.013 0.463 0.957Setting 3: Bernoulli MixtureModel MSE Bias Variance Interval Width CoverageStratum 1 Empirical 0.007 0.005 0.007 0.310 0.933BB 0.007 0.005 0.007 0.312 0.936HBB 0.007 0.012 0.006 0.300 0.930Oracle 0.007 0.006 0.007 0.310 0.931Stratum 4 Empirical 0.023 0.010 0.022 0.577 0.953BB 0.023 0.010 0.022 0.584 0.953HBB 0.021 0.032 0.020 0.544 0.945Oracle 0.022 0.011 0.022 0.575 0.95Setting 4: Gamma MixtureModel MSE Bias Variance Interval Width CoverageStratum 1 Empirical 0.005 0.013 0.005 0.268 0.942BB 0.005 0.013 0.005 0.272 0.947HBB 0.006 0.022 0.006 0.288 0.934Oracle 0.005 0.009 0.005 0.260 0.950Stratum 4 Empirical 0.032 0.092 0.023 0.587 0.904BB 0.032 0.092 0.023 0.592 0.907HBB 0.011 0.002 0.011 0.405 0.943Oracle 0.010 0.018 0.009 0.371 0.933 Table 1: Simulation results: MSE, absolute bias, empirical variance of the posterior mean along with the widthand coverage of the 95% credible interval across 1,000 simulation runs. MSE is computed as average of the squareddiﬀerence between posterior mean and truth across simulations. Empirical variance is computed as the variance of the1000 posterior meanss. In general, the HBB trades oﬀ bias for gains in eﬃciency, leading to overall reduction in MSEfor sparse strata. Performance is generally similar to BB in more populous strata. The performance is particularly goodin the complicated Gamma mixture setting, where stratum 4 has too few observations from the tail of the Gamma-distributed W to estimate P ( W ) reliably via BB. The HBB, however, is able to borrow tail values observed in theother strata. of MSE (.011 v .014). The HBB interval is narrower relative to BB (.440 v .484) while maintaining close tonominal coverage. In stratum 1 we see equivalent MSE across methods. In the third setting, we consider thecase where W is comprised of independent Bernoulli realizations - with separate probability vectors for eachstratum. Each vector can have 2 possible values, but there are far fewer than 2 observations in any of the11 igure 3: Distribution of the 1000 simulated 95% credible interval widths for the HTE of each stratum, across thethree simulation settings. As expected, interval width increases as strata get more sparse. In all settings, the HBB haslower interval width relative to the stratum-speciﬁc BB. As strata get sparse, the BB gets less precise while the HBBremains comparable to the oracle. As shown in Table 1, the HBB has close to nominal coverage throughout. stratum. This complicates estimation of P v ( W ). In the sparse stratum V = 4, we see the HBB has nearlythree times higher absolute bias (.032 v .01) but has reduced variability (.020 v .022). The MSE is reduced byabout 8% (.021 v .023) with the HBB. Notably, the HBB interval is, on average, narrower while maintainingclose to nominal coverage. While, in any stratum, we observe far fewer than the 2 possible values of W , theHBB is able to borrow values seen in other strata.Lastly, in the fourth setting we consider an even more complicated scenario where W is generated from a10-dimensional location mixture of Gamma distributions. Each stratum has a diﬀerent mean and, importantly,skewness. This scenario is designed to assess the tail-behavior of the HBB. As shown in Table 1, the HBBperforms especially well in this complicated scenario. In stratum 4, the MSE, bias, and variance are lowerthan the BB. Intervals are narrower and coverage is closer to the nominal rate (94.3%). The small samplesize in stratum 4 leads to too few covariate observations from the tail of the skewed Gamma to have a reliablenonparametric estimate of P ( W ). This leads to poor BB estimates. However, in this setting the HBB borrowsrealizations from the tail in other strata - leading to a better estimate of P ( W ).To illustrate the precision of the results, Figure 3 displays the distribution of the 95% credible intervalwidths across the 1000 simulation runs in all settings. Note that in general interval width increases (precisiondecreases) as strata get more sparse. In all settings, the HBB interval widths are generally comparable to theoracle. In particular, in the extreme Gamma mixture setting, the empirical and BB methods are much moreimprecise, while the HBB interval widths are more similar to the oracle. These narrower interval widths donot come at the cost of signiﬁcant undercoverage. Recall from Table 1 that coverage for the HBB is close tonominal across settings, even while comparators suﬀer. In this section we conduct posterior inference for casual contrasts of proton versus photon therapy amongpatients being treated for various locally-advanced cancers. For the cancers under consideration, standard-of-care therapy is a combination of chemotherapy and radiation - known as concurrent chemoradiotherapy12CRT). However, many modalities of radiation exist. The most common modality used in CRT has beenphoton radiation. In recent year, proton radiation therapy has become more accessible alternative to patientsas barriers to access have eased and health systems have adopted the necessary technology. The idea ofproton therapy is to deliver radiation in a more targeted way to the cancer site, while being less damaging tohealthy tissue relative to photon. Observational data were collected from n = 1468 adult patients diagnosedwith non-metastatic cancer and treated with CRT at the University of Pennsylvania Health Systems from2011-2016.Our data includes assigned treatment to CRT with either proton or photon radiation, several confoundersmeasured at the time of treatment initiation, as well as the count of adverse events for a follow-up periodof 90 days after treatment initiation. All patients in the sample had complete follow-up for at least 90 days.Previous research on this data [35] has focused on the comparative risk of adverse events for patients on protonversus photon radiation. One hypothesis is that the more targeted nature of proton therapy will lead to fewerunplanned adverse events. To address these questions, we conduct two analyses. In the ﬁrst, we estimate acausal incidence diﬀerence between proton and photon patients across cancer type strata using a hierarchicalPoisson GLM. In the second, we estimate of causal odds ratio nonparametrically using BART. In the processwe illustrate how the HBB can be combined with both parametric and nonparametric models for diﬀerentoutcome types. It can also be used to estimate diﬀerent marginal causal contrasts (incidence diﬀerences, oddsratios, risk ratios, etc). In this setting, our outcome is a count of adverse events over the 90-day follow-up, Y ∈ { } ∪ Z + . We observedata across K = 8 cancer types (e.g., lung, head and neck, and esophagus/gastric) indicated by V . Let A = 1 denote proton while A = 0 denote photon. Finally, let W be a vector of confounders including baselineage, race, sex, body-mass index (BMI), insurance plan, and charlson comorbidity index (a measure of baselinehealth status). We specify a conditional Poisson outcome model with the regression below. We adjust for race,sex, and insurance plan as categorical covariates, while using a cubic B-spline for age. BMI and charlson indexare included as continuous covariates. More details on speciﬁcation and prior choices are given in AppendixD. The model for each stratum can be written as E [ Y i | A i , W i , V i = v ] = exp { β v + W (cid:48) i η v + A i θ v } Though parametric, such models are common in practice. Note we allow coeﬃcients to vary across strata. Ourtarget of interest here is the causal incidence diﬀerence within each stratum Ψ( v ) = E [ Y | V = v ] − E [ Y | V = v ]. A negative value indicates lower incidence of adverse events due to proton therapy relative to photon.To obtain this, we integrate the above regression over various estimators of P v ( W ). Figure 4a displays resultsunder three diﬀerent estimates of P v - including the HBB (with M = 100), BB, and the empirical distributionof W in each stratum. While the estimates for Ψ( v ) are largely similar across strata, note the HBB intervalsare typically slightly shorter. Similarly, the point estimates are typically higher in these strata. This maypartially reﬂect the trading oﬀ of increasing biased for reduced variability, as demonstrated in the simulations.However, these simulation results were averages across many runs. In any single data analysis, HBB need notproduce narrower intervals.Interpreting posterior estimates of Ψ( v ) in Figure 4a, we see that the proton and photon therapies’ eﬀecton adverse event incidence are largely comparable across cancer type - with posterior distributions centered13 a) Hierarchical Poisson Outcome Model (b) Stratum-Speciﬁc BART Outcome ModelsFigure 4: Posterior mean and 95% credible interval estimates of stratum-speciﬁc causal contrasts under Poisson model(left) and BART (right). For both models, we set minimum desired sample size of M = 100. The abbreviationsare gynecological (gyn), pancreas/duodenum/hepatobiliary (p/d/h), esophagus/gastric (e/g), and head/neck (h&n).Similar strata deﬁnitions were used in previous clinical studies [35] and may be justiﬁed by anatomical closeness ofaﬀected organs. either near zero or very wide around 0 (as indicated by 95% credible intervals). Of course, these causalinterpretations are subject to the validity of the required identiﬁcation assumptions discussed earlier and inAppendix A. Moreover, these inferences are conditional on the very rigid parametric assumptions of the Poissonmodel. For instance, it assumes linear (on log-scale) and additive covariate eﬀects, in addition to the poissondistributional assumption. In the next section, we consider a nonparametric estimation via BART. Here we illustrate how the HBB can be used in conjunction with a nonparametric model for a binary outcometo obtain HTEs more robust to model misspeciﬁcation. In this context let Y ∈ { , } be a binary indicator ofany adverse event over the 90-day followup period. Then, we specify a conditional Bernoulli model for Y withregression E [ Y i | A i , W i , V = v ] = Φ (cid:0) f v ( W i , A i ) (cid:1) f v ∼ BART v = 1 , . . . , K

This is the probit speciﬁcation of BART outlined in Chipman et al. [36]. Above, Φ is the standard Normaldistribution function and f v ∼ BART is shorthand for the sum-of-trees model f v ( W i , A i ) = (cid:80) Jj =1 g vj ( W i , A i )with J trees, g j . BART is characterized by a prior on the structure of each tree, g j , consisting of terminal nodeparameters, splitting rules, and tree depth. Here we estimate stratum-speciﬁc models, with separate BARTpriors on each function. Thus, for each stratum v , we can get posterior draws of f v under each treatment A = a . In this case our target is the stratum-speciﬁc causal odds ratioΨ( v ) = E [ Y | V = v ] / (1 − E [ Y | V = v ]) E [ Y | V = v ] / (1 − E [ Y | V = v ])14alues of Ψ( v ) less than one indicate lower risk of any adverse event due to proton therapy, relative to photon.Using standardization, we can compute each expectation by integrating Φ( f v ( W, a )) over P v ( W ). Figure 4bdisplays posterior results for Ψ( v ) under three diﬀerent estimate of P v - including the HBB (with M=100), BB,and the empirical distribution of W in each stratum. We notice that while point and interval estimates aregenerally similar across strata, the HBB intervals are somewhat shorter. However, according to these results,there is little posterior evidence for a reduction of adverse event risk due to proton therapy. While pointestimates of the odds ratios are below one across strata, there is signiﬁcant posterior uncertainty about thedirection and magnitude of these eﬀects, as indicated by the wide 95% credible intervals mostly overlappingone. The confounder distribution is a key unknown that must be estimated ﬂexibly when making causal inferences.It is still more important in the context of HTEs where some strata are too sparse to allow reliable non-parametric estimation. In this paper we show that straightforward application of the Bayesian bootstrap canbe improved upon in these scenarios with the HBB. The proposed HBB shares covariate information acrossstrata to achieve more stable stratum-speciﬁc causal estimates. The approach is computationally tractable,compatible with arbitrary outcome models, and makes no parametric assumptions about the distributions. Asshown in the data analysis, it can be used to compute a variety of marginal causal contrasts. Interestingly, weshowed that the stratum-speciﬁc BB and the smooth bootstrap can be seen as special-cases of the HBB.We emphasize that potential applications of the HBB go beyond estimation of stratum-speciﬁc causal ef-fects. For instance, another popular causal estimand is the average treatment eﬀect on the treated (ATT). Thisis deﬁned as the average diﬀerence in potential outcomes among those assigned treatment. A Standardization-type procedure can be used here as well and requires integrating a regression over the distribution of con-founders among the treated, P ( W | A = 1). If there are too few treated subjects to get a reliable nonparamet-ric estimate of this distribution, it may be reasonable to borrow covariate information from untreated subjects, P ( W | A = 0), by shrinking towards the marginal via the HBB.Lastly, our discussion of the connection between the HBB and the smoothed bootstrap motivates anextension to a “smoothed HBB”. In Section 3.2, an HBB (0) prior on the mixing distribution corresponds to asmoothed bootstrap within a stratum but prevents borrowing of information. We could in principle set α > K h - thus borrowing information across stratawhile modeling the distribution as a smooth mixture. Posterior computation for such a mixture model can bedone but is signiﬁcantly more complicated and requires good default choices of K h , but it may be beneﬁcialto explore in future work. 15 cknowledgements We would like to thank Dr. James Metz and Dr. Justin Bekelman (Department of Radiation Oncology,Perelman School of Medicine, University of Pennsylvania) for access to the dataset used in this paper.

References [1] Jennifer L. Hill. Bayesian nonparametric modeling for causal inference.

Journal of Computational andGraphical Statistics , 20(1):217–240, 2011.[2] Bret Zeldow, Vincent Lo Re III, and Jason Roy. A semiparametric modeling approach using bayesianadditive regression trees with an application to evaluate heterogeneous treatment eﬀects.

Ann. Appl.Stat. , 13(3):1989–2010, 09 2019.[3] P. Richard Hahn, Jared S. Murray, and Carlos M. Carvalho. Bayesian regression tree models for causalinference: Regularization, confounding, and heterogeneous eﬀects.

Bayesian Analysis , 2020. Advancepublication.[4] Nicholas C Henderson, Thomas A Louis, Gary L Rosner, and Ravi Varadhan. Individualized treatmenteﬀects with censored data via fully nonparametric Bayesian accelerated failure time models.

Biostatistics ,21(1):50–68, 07 2018.[5] Arman Oganisian, Nandita Mitra, and Jason A. Roy. A bayesian nonparametric model for zero-inﬂatedoutcomes: Prediction, clustering, and causal estimation.

Biometrics , n/a(n/a), 2020.[6] Jason Roy, Kirsten J. Lum, Bret Zeldow, Jordan D. Dworkin, Vincent Lo Re III, and Michael J. Daniels.Bayesian nonparametric generative models for causal inference with missing at random covariates.

Bio-metrics , 74(4):1193–1202, 2018.[7] Chanmin Kim, Michael J. Daniels, Bess H. Marcus, and Jason A. Roy. A framework for bayesian non-parametric inference for causal eﬀects of mediation.

Biometrics , 73(2):401–409, 2017.[8] Yanxun Xu, Daniel Scharfstein, Peter Mller, and Michael Daniels. A Bayesian nonparametric approachfor evaluating the causal eﬀect of treatment in randomized trials with semi-competing risks.

Biostatistics ,04 2020. kxaa008.[9] Dandan Xu, Michael J. Daniels, and Almut G. Winterstein. A bayesian nonparametric approach to causalinference on quantiles.

Biometrics , 74(3):986–996, 2018.[10] Chi Wang, Francesca Dominici, Giovanni Parmigiani, and Corwin Matthew Zigler. Accounting for un-certainty in confounder and eﬀect modiﬁer selection when estimating average causal eﬀects in generalizedlinear models.

Biometrics , 71(3):654–665, 2015.[11] Olli Saarela, David A. Stephens, Erica E. M. Moodie, and Marina B. Klein. On bayesian estimation ofmarginal structural models.

Biometrics , 71(2):279–288, 2015.[12] Rachel C. Nethery, Fabrizia Mealli, and Francesca Dominici. Estimating population average causal eﬀectsin the presence of non-overlap: The eﬀect of natural gas compressor station exposure on cancer mortality.

Ann. Appl. Stat. , 13(2):1242–1267, 06 2019. 1613] Donald B. Rubin. The bayesian bootstrap.

Ann. Statist. , 9(1):130–134, 01 1981.[14] Jason Roy, Kirsten J. Lum, and Michael J. Daniels. A Bayesian nonparametric approach to marginalstructural models for point treatments and a continuous or survival outcome.

Biostatistics , 18(1):32–47,06 2016.[15] Matt Taddy, Matt Gardner, Liyun Chen, and David Draper. A nonparametric bayesian analysis ofheterogenous treatment eﬀects in digital experimentation.

Journal of Business & Economic Statistics ,34(4):661–672, 2016.[16] Jeﬀrey A Boatman, David M Vock, and Joseph S Koopmeiners. Borrowing from supplemental sources toestimate causal eﬀects from a primary data source. arXiv preprint arXiv:2003.09680 , 2020.[17] Susanna Makela, Yajuan Si, and Andrew Gelman. Bayesian inference under cluster sampling with prob-ability proportional to size.

Statistics in Medicine , 37(26):3849–3868, 2018.[18] Ariel Kleiner, Ameet Talwalkar, Purnamrita Sarkar, and Michael I. Jordan. A scalable bootstrap formassive data.

Journal of the Royal Statistical Society. Series B (Statistical Methodology) , 76(4):795–816,2014.[19] Andrs F. Barrientos and Vctor Pea. Bayesian bootstraps for massive data.

Bayesian Anal. , 15(2):363–388,06 2020.[20] Bradley Efron and Gail Gong. A leisurely look at the bootstrap, the jackknife, and cross-validation.

TheAmerican Statistician , 37(1):36–48, 1983.[21] B. W. Silverman and G. A. Young. The bootstrap: To smooth or not to smooth?

Biometrika , 74(3):469–479, 1987.[22] Suojin Wang. Optimizing the smoothed bootstrap.

Annals of the Institute of Statistical Mathematics ,47(1):65–80, 1995.[23] David L. Banks. Histospline smoothing the bayesian bootstrap.

Biometrika , 75(4):673–684, 1988.[24] D. B. Rubin. Estimating causal eﬀects of treatments in randomized and nonrandomized studies.

Journalof educational Psychology , 66(5):688–701, 1974.[25] James Robins. A new approach to causal inference in mortality studies with a sustained exposure period- application to control of the healthy worker survivor eﬀect.

Mathematical Modelling , 7(9):1393 – 1512,1986.[26] Peng Ding and Fan Li. Causal inference: A missing data perspective.

Statist. Sci. , 33(2):214–237, 052018.[27] Susanne M. Schennach. Bayesian exponentially tilted empirical likelihood.

Biometrika , 92(1):31–46, 2005.[28] Persi Diaconis. Bayesian numerical analysis.

Statistical Decision Theory and Related Topics IV , 1988.[29] A. O’Hagan. Some bayesian numerical analysis.

Bayesian statistics , 4:345–363, 1992.[30] Harrison Zhu, Xing Liu, Ruya Kang, Zhichao Shen, Seth Flaxman, and Francois Xavier Briol. Bayesianprobabilistic numerical integration with tree-based models, 2020.1731] Yee Whye Teh, Michael I Jordan, Matthew J Beal, and David M Blei. Hierarchical dirichlet processes.

Journal of the American Statistical Association , 101(476):1566–1581, 2006.[32] David M. Blei, Andrew Y. Ng, and Michael I. Jordan. Latent dirichlet allocation.

J. Mach. Learn. Res. ,3(null):993?1022, March 2003.[33] David Blackwell and James B. MacQueen. Ferguson distributions via polya urn schemes.

Ann. Statist. ,1(2):353–355, 03 1973.[34] Daniela de Angelis and G. Alastair Young. Smoothing the bootstrap.

International Statistical Review /Revue Internationale de Statistique , 60(1):45–56, 1992.[35] Brian C. Baumann, Nandita Mitra, Joanna G. Harton, Ying Xiao, Andrzej P. Wojcieszynski, Peter E.Gabriel, Haoyu Zhong, Huaizhi Geng, Abigail Doucette, Jenny Wei, Peter J. O?Dwyer, Justin E. Bekel-man, and James M. Metz. Comparative Eﬀectiveness of Proton vs Photon Therapy as Part of ConcurrentChemoradiotherapy for Locally Advanced Cancer.

JAMA Oncology , 6(2):237–246, 02 2020.[36] Hugh A. Chipman, Edward I. George, and Robert E. McCulloch. Bart: Bayesian additive regressiontrees.

Ann. Appl. Stat. , 4(1):266–298, 03 2010. 18

Identiﬁcation of HTE

Here we identify the HTE in the point-treatment setting discussed in the paper. Recall the HTE is the averagetreatment eﬀect within stratum v , Ψ( v ) = E [ Y | V = v ] − E [ Y | V = v ]. Consider the term E [ Y a | V = v ]and now iterate expectation over W : E [ Y a | V = v ] = (cid:90) W E [ Y a | W, V = v ] dP v ( W )Now we assume conditional ignorability. Speciﬁcally that within stratum v , once we condition on confounders W , treatment assignment is independent of potential outcome, Y a ⊥ A | W, V = v . This implies that E [ Y a | W, V = v ] = E [ Y a | A = a, W, V = v ], E [ Y a | V = v ] = (cid:90) W E [ Y a | A = a, W, V = v ] dP v ( W )Now, we assume consistency. That is, the outcome actually observed under treatment assignment A = a actually equals the outcome that would occur under treatment A = a , i.e. Y a = Y . This would be violated if,for instance, there is non-adherence to treatment assignment. This yields, E [ Y a | V = v ] = (cid:90) W E [ Y | A = a, W, V = v ] dP v ( W )So we have identiﬁed each term of Ψ( v ) as a regression averaged over P v ( W ) = P ( W | V = v ). Note thatwe implicitly make a positivity and non-adherence assumption. By conditioning on A = a within W and V ,we are assuming that treatment probability is bounded 0 < P ( A = 1 | W, V = v ) < W within stratum V where we only observed patients assigned to one of the two treatments.We cannot estimate a causal eﬀect in this region of the data without (likely incorrect) extrapolation. Moreover,for a particular sample we have assumed that each subjects potential outcome Y a i i is unaﬀected by others’treatment assignment. If subject j ’s treatment assignment impacts subject i ’s potential outcome, then wewould have had to index the potential outcome with this treatment as well, Y a i ,a j i . B Posterior Derivations

Here we provide a derivation of the posterior distribution of each P v using Dirichlet Distributions - the ﬁnite-dimensional analogue of the Dirichlet Process. This is to supplement the conjugacy results used in the maintext. Suppose our model for the conditional covariate distribution, P v ( W ) = P ( W | V = v ), is P v ( W | π v ) = n (cid:88) i =1 π vi · δ W i ( W )We have K such distributions for each of the K levels of V . Consider the Dirichlet prior on each π v =( π v , π v , . . . , π vn ) conditional on the π = ( π , π , . . . , π n ) and α mentioned in the text, π v | π, α ∼ Dir ( απ )19ote, we could do everything in terms of v -speciﬁc concentration parameters, but use a common α for com-pactness. Now place Dirichlet hyperprior on π : π | γ ∼ Dir ( γ n )Note that the HBB corresponds to setting γ = 0 and that α is user-speciﬁed but we will leave γ as it is fornow. So the joint posterior is p ( π , π , . . . π K , π | α, γ, W, V ) ∝ (cid:110) K (cid:89) v =1 { (cid:89) i ∈ S v P v ( W i | π v ) } p ( π v | π, α ) (cid:111) p ( π | γ ) ∝ (cid:110) { K (cid:89) v =1 n (cid:89) i =1 ( π vi ) δ v ( V i ) } Γ( (cid:80) ni =1 απ i ) (cid:81) ni =1 Γ( απ i ) n (cid:89) i =1 ( π vi ) απ i − (cid:111) p ( π | γ ) ∝ (cid:110) K (cid:89) v =1 Γ( (cid:80) ni =1 απ i ) (cid:81) ni =1 Γ( απ i ) n (cid:89) i =1 ( π vi ) απ i + δ v ( V i ) − (cid:111) p ( π | γ ) (B.1)The second line follows from simply re-expressing (cid:81) i ∈ S v π vi = (cid:81) ni =1 ( π vi ) δ v ( V i ) . That is, only subjects in stratum v contribute information for π v to the likelihood. The third line follows from combining the exponents of π vi .The objective is to sample the π v . To do this, we sample from the joint and simply ignore draws of π . Notethat the joint can be expressed as a marginal posterior for π and independent conditional posteriors for π v p ( π , π , . . . π K , π | α, γ, W, V ) = { K (cid:89) v =1 p ( π v | π, α, γ, W, V ) } p ( π | α, γ, W, V )Thus to sample from the joint, we ﬁrst sample π from the marginal posterior. Then conditional on π , we cansample the π v independently. These are exactly Step 1 and 2, respectively, in the algorithm of Section 3.1.We now derive this marginal posterior and then turn to the conditional posteriors of π v . To get the marginal,integrate out each of the π v in (B.1) p ( π | α, γ, W, V ) ∝ (cid:110) K (cid:89) v =1 (cid:90) Π v Γ( (cid:80) ni =1 απ i ) (cid:81) ni =1 Γ( απ i ) n (cid:89) i =1 ( π vi ) απ i + δ v ( V i ) − dπ v (cid:111) p ( π | γ ) ∝ (cid:110) K (cid:89) v =1 Γ( (cid:80) ni =1 απ i ) (cid:81) ni =1 Γ( απ i ) (cid:90) Π v n (cid:89) i =1 ( π vi ) απ i + δ v ( V i ) − dπ v (cid:111) p ( π | γ ) ∝ (cid:110) K (cid:89) v =1 Γ( α ) (cid:81) ni =1 Γ( απ i ) (cid:81) ni ∈ S v Γ( απ i + 1) (cid:81) ni/ ∈ S v Γ( απ i )Γ( α + n v ) (cid:111) p ( π | γ )Above, Π v is the n -dimensional simplex we integrate over. The third line follows because the integral is overthe kernel of a Dirichlet distribution, with concentration parameter vector απ i + δ v ( V i ) and recognizing that (cid:80) ni =1 απ i = α since π i sum to 1. Thus, (cid:90) Π v n (cid:89) i =1 ( π vi ) απ i + δ v ( V i ) − dπ v = (cid:81) ni =1 Γ( απ i + δ v ( V i ))Γ( (cid:80) ni =1 απ i + δ v ( V i )) = (cid:81) ni ∈ S v Γ( απ i + 1) (cid:81) ni/ ∈ S v Γ( απ i )Γ( α + n v )20ontinuing the derivation, we cancel like terms from the numerator and denominators and note that Γ( απ i +1) = απ i Γ( απ i ). Therefore, Γ( απ i +1)Γ( απ i ) = απ i and we have p ( π | α, γ, W, V ) ∝ (cid:110) K (cid:89) v =1 Γ( α ) (cid:81) ni ∈ S v Γ( απ i ) (cid:81) ni ∈ S v Γ( απ i + 1)Γ( α + n v ) (cid:111) p ( π | γ ) ∝ (cid:110) K (cid:89) v =1 Γ( α )Γ( α + n v ) α n v (cid:89) i ∈ S v π i (cid:111) p ( π | γ ) ∝ (cid:110) K (cid:89) v =1 Γ( α ) α n v Γ( α + n v ) (cid:111) K (cid:89) v =1 (cid:89) i ∈ S v π i (cid:111) p ( π | γ ) ∝ (cid:110) K (cid:89) v =1 Γ( α ) α n v Γ( α + n v ) (cid:111) ( n (cid:89) i =1 π i ) p ( π | γ )Now, note that in the last line the term in brackets is constant with respect to π , so we can eliminate it andmaintain proportionality. Then, substituting the prior p ( π | γ = 0) = Dir (0 n ) ∝ (cid:81) ni =1 π − i , p ( π | α, γ, W, V ) ∝ ( n (cid:89) i =1 π i ) n (cid:89) i =1 π − i ∝ n (cid:89) i =1 π − i This is the kernel of

Dir (1 n ) - the posterior of Rubin’s bootstrap. Thus, to draw from this marginal posterior,we can draw π ∼ Dir (1 n ). This is the distribution we sample from in Step 1 of the algorithm in Section 3.1.Now, the conditional posterior of each π v conditional on π is much simpler. Just absorb all terms notinvolving π vi in (B.1) into the proportionality constant and we have p ( π v | π, α, γ, W, V ) ∝ n (cid:89) i =1 ( π vi ) απ i + δ v ( V i ) − Which is proportional to a π v ∼ Dir (cid:16) απ + δ v ( V ) , απ + δ v ( V ) , . . . , απ n + δ v ( V n ) (cid:17) . This is the distributionwe sample from in Step 2 of the algorithm in Section 3.1. C Simulation Details

Here we provide details for the simulation study in Section 4. In each setting, we simulate 1000 data sets with n = 300 subjects as follows. For i = 1 , . . . , V i ∼ M ultinom (1; 410 , , ,

110 )The parameter vectors gives the probability of assignment to stratum 1, 2, 3, and 4, respectively.2. Simulate 10-dimensional confounder vector W i = ( W pi ) p =1:10 , W i | V i = v ∼ p ( W | V = v )The form of p ( W | V = v ) varies with simulation setting and is speciﬁed below.21. Simulate treatment assignment, A i , from Bernoulli with probability P ( A = 1 | W i , V i = v ) = expit ( η v + W (cid:48) i β )4. Simulate binary outcome, Y i , from a Bernoulli with probability P ( Y = 1 | W i , V i = v ) = expit ( − γ v + W (cid:48) i θ + α v A i )Note in the above that W i impacts both treatment assignment (via β ) and outcome (via θ ) - so it is aconfounder. Similarly, V i impacts both treatment assignment (via η v ) and outcome (via γ v ). Note that theconditional treatment eﬀect, α v , varies across stratum - so this is a complex scenario with treatment eﬀectheterogeneity across strata. This yields a simulated data set { Y i , A i , W i , V i } i =1: n . We simulate 1000 such datasets across four settings.The covariate distribution p ( W | V ) has a diﬀerent family governed by diﬀerent parameters in each of thefour settings, 1 − W pi ∼ N (0 ,

1) for all V = v .2. W pi | V = v ∼ N ( µ v ,

1) where µ v ∈ {− , , , } for v = 1 , . . .

4, respecting order. Marginal of V , thedistribution of W is a location mixture of normals.3. W pi | V = v ∼ Ber ( p v ) where p v ∈ { . , . , . , . } for v = 1 , . . .

4, respecting order.4. W pi | V = v ∼ Gam ( shape = τ v , rate = ). Here τ v ∈ { , , , } for v = 1 , . . .

4, respecting order.All settings share these simulation parameters: • Set β = θ = (1 , − , , − , , − , , − , , − • Set η v ∈ (0 , − . , . , .

5) for v = 1 , . . . , • γ v ∈ ( − . , − . , . , .

5) for v = 1 , . . . , • α v ∈ (1 , − . , , .

5) for v = 1 , . . . , P ( Y | A, W, V = v ) = expit (cid:16) ω + ω v + W (cid:48) ω W + ω ∗ v A (cid:17) Normal priors with mean zero and standard deviation 3 were placed on each parameter. We obtain M = 5000posterior samples { ω , ω ( m )1 , . . . , ω ( m )4 , ω ( m ) W , ω ∗ ( m )1 , . . . , ω ∗ ( m )4 } m =1: M after discarding the ﬁrst 5000 draws asburn-in. Sampling was done via hamiltonian monte carlo as implemented in Stan. These samples werecombined with HBB as described in Section 3.1. D Data Analysis Details

Here we provide additional details about the data analysis in the main text. In the parametric Poisson model,we include the following covariates for each stratum except gynecological cancer.22 treatment: binary with one indicating proton. • race: categorical with levels white, black, and other. • sex: binary with one indicating male. • insurance: categorical with levels medicare, private, and other. • body-mass index: normalized. • age: normalized • charlson index: logged.For gynecological cancer, there is no need to adjust for sex. We specify N (0 ,

1) priors on all covariates exceptin the following instances: in the models for E/G, brain, anal, and rectum, we use tighter N (0 , .

1) priorson the other race coeﬃcient. Similarly, for the P/D/H model we use a N (0 , .

1) prior on other insurance.The tight priors are to regularize coeﬃcients that explode due too little variation in insurance status or racein a particular stratum. Non-bayesian analyses typically omit such variables (equivalent to a prior that thecoeﬃcient is exactly 0), but we choose to include them with a tight prior around 0 as a compromise. Note thatthe N (0 ,

1) prior may seem overly informative, but on the log scale it is quite ﬂat. It puts suﬃcient volume atincident rate ratios within exp( ± .

96) or within ( . , . f v under particular treatmentswere obtained using the BayesTree R package. We retain 1000 posterior draws for inference after discardingthe ﬁrst 1000 as burn-in. For the BART hyperpriors, we increase the power parameter from the default of 2to 3. This is to favors more shallow trees which provides more regularization. After draws of f vv