[PDF] Bayesian causal inference for count potential outcomes

Abstract

The literature for count modeling provides useful tools to conduct causal inference when outcomes take non-negative integer values. Applied to the potential outcomes framework, we link the Bayesian causal inference literature to statistical models for count data. We discuss the general architectural considerations for constructing the predictive posterior of the missing potential outcomes. Special considerations for estimating average treatment effects are discussed, some generalizing certain relationships and some not yet encountered in the causal inference literature.

Full PDF

BBayesian causal inference for count potential outcomes

Young Lee

Harvard University

Wicher P. Bergsma

London School of Economics

Marie-Abèle Bind

Harvard University

August 10, 2020

Abstract

The literature for count modeling provides useful tools to conduct causalinference when outcomes take non-negative integer values. Applied to the po-tential outcomes framework, we link the Bayesian causal inference literatureto statistical models for count data. We discuss the general architecturalconsiderations for constructing the predictive posterior of the missing po-tential outcomes. Special considerations for estimating average treatmenteﬀects are discussed, some generalizing certain relationships and some notyet encountered in the causal inference literature.

Statistical analyses with non-negative integer outcomes are encountered in manyﬁelds. For example, studies examining risk factors for seizure counts (Burneo,2008); number of deaths (Baccini et al., 2017); counts of COVID-19 deaths (Wuet al., 2020), number of drinks over a period of time (Horton et al., 2007). Theseoutcomes have a particular domain, the set of non-negative integer values, Z + .There is an extensive literature on statistical models for count data. Among someof the contributions to count regression model, the authors El-Sayyad (1973);Lawless (1987); Diggle et al. (1998); Winkelmann (2008); Chan and Vasconcelos(2012); Kim et al. (2013) study the estimation procedures of Poisson and negativebinomial regression paradigms. Breslow (1984); Agresti (2007) introduced thelognormal-Poisson regression and their variants. These models have mostly beenused to directly regress the observed outcomes on the observed treatment andbackground covariates, particularly in environmental epidemiology (Gasparriniet al., 2009; Zigler and Dominici, 2014; Schwartz et al., 2015), as opposed to causalmodeling. Leveraging on the existing statistical literature, we link the Bayesiancausal inference literature with models for count data. The fundamental problemof causal inference is one of missing data, and speciﬁcally of missing potential1 a r X i v : . [ s t a t . M E ] A ug utcomes. One approach to handle the missing data problem of causal inference ismultiple imputation, i.e., ‘ﬁll in’ missing data with credible values (Rubin, 1978,1987) by repeatedly sampling from the predictive posterior of the missing potentialoutcomes.The idea of using a count distribution to model the potential outcomes is not new;Gutman et al. (2017) proposed a procedure for estimating the causal eﬀects ofnursing home bed-hold policy and assumed a Poisson distribution for the potentialoutcomes. Sommer et al. (2018) examined the causal eﬀects of heat and rain onthe number of crimes in Boston assuming directly a negative binomial distributionfor the missing potential outcomes. Furthermore, these examples draw Bayesianinferences, which can be computationally intensive for large data sets. If thecount model yields a likelihood function that is computationally expensive, theBayesian perspective can quickly become prohibitive. One Markov chain Montecarlo (MCMC) method that is suited to imputation problems is the two-stepprocess of data augmentation algorithm by Tanner and Wong (1987). However,for more realistic count models, the second step is typically intractable, and atleast a two step MCMC procedure is still required to draw missing values, confere.g., Chapter 8.3 of Gelman and Hill (2007). Several threads of research havebeen devoted to examine the strategies for approximating draws in an imputationmodel, which include the large sample approximations and importance sampling(Rubin, 1987). However, almost all these studies inherently assume continuityof the potential outcome distributions (Schafer, 1997; Schafer and Yucel, 2002).To the best of our knowledge, little or no work has appeared on approximationstrategies for an imputation model with count potential outcomes.In this paper, we initiate a framework for estimating causal inference within aBayesian setting when the potential outcomes are counts. Although it may bepossible to draw causal inferences using continuous or binary potential outcomes(Rubin, 2006; Hill, 2011; Gutman and Rubin, 2012, 2015; Imbens and Rubin, 2015),the paradigms of the aforementioned models are not generally suitable for countpotential outcomes. Here, we posit that the potential outcomes inherit statisticalproperties of counts (i.e., discrete non-negative integers) and can be characterizedwith distributions that account for overdispersion (e.g., the variance can be greaterthan the mean), which can occur in settings with heterogeneous units or dependencebetween events. We remark that our proposed framework is ﬂexible and can easilybe parameterized to exploit a wide range of existing properties of count models,thereby could be adapted to each causal inference application. In addition toproposing a class of count potential outcomes, we add some new aspects to theexisting theory by introducing some approximation strategies for drawing Bayesiancausal inferences. Our approximation operates orders of magnitude faster thanexact Hamiltonian MCMC (HMC). We derive an asymptotic expression for theaccuracy of the approximation in terms of the total variation distance to the trueposterior. Simulations show good ﬁnite sample performance.Determining the causal eﬀects in this setting involves the following key steps: ( i ) posit a suitable family of distributions for the potential outcomes and derive the2onditional distribution of the missing potential outcomes given the observed data; ( ii ) compute the posterior distribution of the parameters of the potential outcomes.The posterior distribution of Poisson parameters with Gaussian priors is intractable.However, we show a normal approximation to the posterior can be given and wederive an asymptotic rate of convergence of the approximation to the posterior interms of the total variation distance; ( iii ) we evaluate the conditional distributionof the missing data given the observed data, and ﬁnally ( iv ) the estimand of causalinterest is immediate.The paper is structured in the following manner. In Section 2 we introduceour non-negative potential outcomes framework. In Section 3, we develop aBayesian imputation model for the missing potential outcomes of count type. Ofparticular interest is the characterization of approximation that is used to computethe posterior distribution towards the evaluation of causal estimands. Section 4presents some numerical illustrations and Section 5 concludes. The concept of potential outcomes was launched in Splawa-Neyman et al. (1923) forNeymanian inference in randomized experiments and later used by other researchersincluding Kempthorne (1955) and Wilk (1955) for causal inference from randomizedexperiments. The concept was extended by Rubin (1974, 1975, 1976, 1977, 1978)to other forms of causal inference from randomized experiments to observationalstudies (Rubin, 1974).Following Rubin (1978), W i denotes the treatment assignment for the i th unit,with i = 1 , . . . , N , where W i = 1 indicates the active treatment and W i = 0 the control treatment. We also denote the potential outcomes had unit i beenassigned to active and control treatment by Y i (1) and Y i (0) , respectively. Wedeﬁne the unit-level causal eﬀect of the binary treatment W i using the diﬀerencebetween Y i (1) and Y i (0) . We can never observe both Y i (1) and Y i (0) for anyunit i , because it is not possible to go back in time and expose the i th unit tothe other treatment. This is called the ‘fundamental problem of causal inference’(Rubin, 1975; Holland, 1986). Put diﬀerently, we are implicitly trying to ﬁgure outwhat would have happened to an individual had he/she taken the other treatmentcondition. Therefore, the problem of inferring unit-level causal eﬀects is a missingdata problem. The Bayesian framework permits the estimation of unit-level andaverage treatment eﬀects (Rubin, 1978). In this article, we focus on the ﬁnitepopulation average treatment eﬀect, deﬁned asATE := 1 N N (cid:88) i =1 ( Y i (1) − Y i (0)) (1)and we use the fact that Y mis i = (1 − W i ) · Y i (1) + W i · Y i (0) . (2)3 on-negative integer valued potential outcomes. We assume full compli-ance and the stable unit treatment value assumption (SUTVA) (Rubin, 1980),which means that the potential outcome of a particular unit depends only onthe treatment combination it is assigned, rather than on the assignments of theremaining units, as well as that there are no hidden versions of the treatments notrepresented by the values of W . Under this assumption, the potential outcome Y i of each unit i in an experiment only depends on whether it receives the treatment ( W = 1) or not ( W = 0) . The potential outcomes of all N units in an experiment,can be partitioned into two vectors of N components: Y (0) for all outcomesunder control and Y (1) for all outcomes under treatment. The unit-level potentialoutcomes can also be expressed as the observed and missing outcomes. Therefore,half of all the potential outcomes are observed, denoted by Y obs ; the other half areunobserved, denoted by Y mis .We propose a count model that allows for overdispersion: Y i (0) | β [ c ] , (cid:15) [ c ] i ∼ Pois ( µ [ c ] i ) , Y i (1) | β [ t ] , (cid:15) [ t ] i ∼ Pois ( µ [ t ] i ) (3)where Pois denotes the Poisson distribution, and µ [ c ] i := exp( x (cid:62) i β [ c ] ) (cid:15) [ c ] i , µ [ t ] i := exp( x (cid:62) i β [ t ] ) (cid:15) [ t ] i . (4)The quantity (cid:15) [ · ] i is taken to be a non-negative multiplicative random-eﬀect term tomodel individual heterogeneity (Long, 1997; Gelman and Hill, 2007; Winkelmann,2008; Cameron and Trivedi, 2013; Gelman et al., 2013) and their hyperparametersare denoted by ϑ [ · ] . Let (cid:15) i := ( (cid:15) [ c ] i , (cid:15) [ t ] i ) (cid:62) , (cid:15) := ( (cid:15) , (cid:15) , . . . , (cid:15) N ) (cid:62) , and ϑ := ( ϑ [ c ] , ϑ [ t ] ) (cid:62) .The covariates x i = (1 , x i , . . . , x ik ) are the ( k + 1) -dimensional features for every i and let X = ( x (cid:62) , . . . , x (cid:62) N ) (cid:62) . We also have β [ c ] and β [ t ] ∈ R k +1 . We further assumea prior distribution π ( β ) for β : β := (cid:18) β [ c ] β [ t ] (cid:19) ∼ N (cid:0) k +1) , σ β · I k +1) (cid:1) (5)with σ β being a ﬁxed positive number, N denoting the Gaussian distribution and β ∈ R k +1) . In this paper, we examine two types of potential outcomes: (1)when (cid:15) = 1 , we have the Poisson potential outcomes that do not account foroverdispersion in the count data; (2) we also work with a model that incorporateoverdispersion, i.e., when (cid:15) [ c ] and (cid:15) [ t ] take lognormal priors. We call this thelognormal-Poisson potential outcomes, in concert with the terminology introducedin the context of the regression model (Breslow, 1984; Agresti, 2007). Speciﬁcally,we let (cid:15) [ c ] ∼ log N (0 , ( σ [ c ] ) ) and (cid:15) [ t ] ∼ log N (0 , ( σ [ t ] ) ) . The hyperparameters forthis model are ( σ [ c ] , σ [ t ] ) and we place the priors σ [ c ] ∼ IG ( α [ c ] , ν [ c ] ) and σ [ t ] ∼ IG ( α [ t ] , ν [ t ] ) (6)where the shorthand IG denotes the inverse gamma distribution that has density ν α / Γ( α )(1 /x ) α +1 exp( − ν/x ) with parameters α and ν over the support x > .4e remark that other forms of potential outcomes can be constructed in a similarmanner, i.e., for example the negative binomial potential outcomes can be obtainedby placing Gamma priors on (cid:15) [ · ] i (Lawless, 1987; Hilbe, 2007). Another possibleavenue is to attribute an inverse-Gaussian prior on (cid:15) [ · ] i that would result in aheavy-tailed count behavior (Dean et al., 1989). The assignment mechanism.

Drawing inferences for causal eﬀects requiresthe speciﬁcation of an assignment mechanism, that is, a probabilistic model for howexperimental units are allocated to the treatment combination given the potentialoutcomes and covariates. Let W denote the N -vector of treatment assignments,with i th element W i . We deﬁne N c := (cid:80) Ni =1 (1 − W i ) and N t = (cid:80) Ni =1 W i as thenumber of units assigned to the control and active treatment respectively, with N c + N t = N . We assume a completely randomized assignment mechanism, so bydeﬁnition, the assignment mechanism is deﬁned by p ( W = w | Y (0) , Y (1) , X , β , (cid:15) , ϑ ) = (cid:18) NN t (cid:19) − (7)for all W such that (cid:80) Ni =1 W i = N t , and otherwise. Any other ignorable assignmentmechanism (i.e, p ( W | X , Y obs , Y mis ) = P ( W | X , Y obs ) (Rubin, 1978) could alsobe assumed. We estimate the ATE by explicitly imputing the missing potential outcomes in arepeated fashion to account for the uncertainty in the imputation. Let ˆ Y mis i be theimputed value corresponding to Y mis i , then the ATE in (1) can be estimated as (cid:91) ATE = 1 N N (cid:88) i =1 (cid:16) (2 W i − Y obs i − ˆ Y mis i ) (cid:17) . (8)We work within the model-based Bayesian causal inference framework (Rubin, 1975,1978) and evaluate a posterior predictive distribution for at least half of the missingpotential outcomes. The fundamental idea is to initiate an imputation model forthe missing potential outcomes Y mis , conditionally on the observed outcomes Y obs and the observed assignment vector W . As we outline, the predictive posterior of Y mis can be computed using the HMC algorithm in a package such as rstanarm (Gelman and Hill, 2007; Goodrich et al., 2020). However, this approach can becomputationally expensive for large N . To remedy this problem, we develop anapproximation algorithm in this section, which is an order of magnitude faster andgives a good approximation for moderately large Y obs i .We ﬁrst present some necessary tools required to evaluate the posterior distributionof the parameters governing the distribution of the potential outcomes. This step5s computationally intractable, so instead of computing the posterior distributionexactly, we propose an approximation. In Section 3.1, we present results byBartlett and Kendall (1946) and El-Sayyad (1973), and extend to these Lemma1 and Corollary 1, which we will use in the next subsection a new expressionfor the asymptotic total variation distance between our approximation and thetrue posterior. In Section 3.2, we derive the posterior distribution of the causalestimand in four steps and in Section 3.3, we compute the ATE for Poisson andlognormal-Poisson potential outcomes. The main purpose of this section is to derive Corollary 1, which will be useful inthe derivation of the predictive posterior of Y mis . First, we note that if X is aGamma random variable with density function Ga ( r, s )1Γ( r ) s r x r − exp (cid:16) − xs (cid:17) , x > , then a transformation of Y := log X yields the log-Gamma distribution which hasdensity r ) s r exp (cid:18) ry − e y s (cid:19) , y > . Denoting the log-Gamma distribution as Y ∼ logGamma ( r, s ) , it is well known that Y is approximately Gaussian distributed with mean log rs and variance log r − for large r (Bartlett and Kendall, 1946). Leveraging on this result, El-Sayyad(1973) noted that the variables and parameters of the log-Gamma and Poissondistributions are related in the following sense: f Poisson ( y | µ ≡ e ξ ) = e − µ µ y y ! = 1 y · e − e ξ ( e ξ ) y Γ( y ) = 1 y · f logGamma ( ξ | y, ( ∗ ) ≈ y · f Normal ( ξ | log y, y − ) . We now derive the total variation distance between the log-Gamma and Poissondistributions as y → ∞ . The Kullback-Leibler divergence between f and g , denotedby D ( f (cid:107) g ) is given by (cid:82) f ( x ) log( f ( x ) g ( x ) ) dx . Direct calculation shows that D (cid:0) f Normal ( ξ | log y, y − ) (cid:107) f logGamma ( ξ | y, (cid:1) = log (cid:18) Γ( y ) (cid:112) πy − (cid:19) − y log y + ye y − (9)as well as D ( f normal ( ξ | log y, y − ) (cid:107) f log-Gamma ) → as y → ∞ . We present thefollowing: 6 emma 1. It holds true that D (cid:0) f Normal ( ξ | log y, y − ) (cid:107) f logGamma ( ξ | y, (cid:1) = 524 y + O (cid:0) y − (cid:1) . Proof.

Stirling’s formula gives Γ( z ) = (cid:114) πz (cid:16) ze (cid:17) z (cid:0) O (cid:0) z − (cid:1)(cid:1) . (10)Expand the the Gamma function in (9) is expanded through Stirling’s series (Uhler,1942; Arfken, 1985) for its associated formula in (10) and using the fact that y ( e y − → / when y → ∞ yields the result.An alternative measure of deviation between probability measures is the totalvariation (TV) distance (Levin et al., 2006), deﬁned for measures P and Q asTV ( P (cid:107) Q ) = sup E | P ( E ) − Q ( E ) | where the supremum is over all possible events E . Pinsker’s inequality states thatTV ( P (cid:107) Q ) ≤ (cid:112) D ( P (cid:107) Q ) . Using a Taylor expansion for the square root function,Lemma 1 implies: Corollary 1.

It holds true thatTV (cid:0) f Normal ( ξ | log y, y − ) (cid:107) f logGamma ( ξ | y, (cid:1) = (cid:114) y + O (cid:0) y − / (cid:1) . Proof.

A Taylor expansion around zero gives (cid:112) O ( t ) = 1 + O ( t ) as t → .Hence we have that (cid:112) t + O ( t ) = √ t (cid:112) O ( t ) = √ t + O ( t / ) as t → . Therefore, (cid:112) y − + O ( y − ) = y − / + O ( y − / ) as y → ∞ , and thecorollary immediately follows from Lemma 1 using Pinsker’s inequality. We now return to our main topic, namely estimating the ﬁnite population averagetreatment eﬀect. We adapt our convergence results to the setting of the proposedcount potential outcomes framework. The derivation of the posterior distributionof the average treatment eﬀect entails the following steps: • Step ( i ) . Evaluate the conditional distribution of Y mis given Y obs , X , W , β , (cid:15) , and ϑ . • Step ( ii ) . Evaluate the conditional joint distribution for the parameters β , (cid:15) , ϑ given Y obs , W , and X . 7 Step ( iii ) Evaluate the distribution of Y mis | Y obs , W , X , which is the desiredimputation distribution, by marginalizing the posterior predictive distributionin step ( ii ) over (cid:15) , the parameter vector β , and the hyperparameter vector ϑ . • Step ( iv ) Finally, the estimated average treatment eﬀect is computed.Using (2), the conditional distribution of Y mis i given Y obs , W , X , β , (cid:15) , ϑ can beexpressed as: Y mis i | Y obs , W , X , β , (cid:15) , ϑ = Pois ( µ mis i ) (11)where µ mis i := exp( ξ mis i ) with ξ mis i := (cid:16) (1 − W i ) · { x (cid:62) i β [ t ] + log (cid:15) [ t ] i } + W i · { x (cid:62) i β [ c ] + log (cid:15) [ c ] i } (cid:17) . (12)We immediately see from (11) and (12) that the conditional distribution of Y mis i given Y obs , W , X , β , (cid:15) , ϑ is simply equal to the marginal distribution Y mis i given W , X , β , (cid:15) , ϑ due to the independence of the potential outcomes. Furthermore,due to the binary nature of W i , we have the following equivalent expressions: W i · e x (cid:62) i β [ c ] (cid:15) [ c ] i + (1 − W i ) · e x (cid:62) i β [ t ] (cid:15) [ t ] i = exp (cid:16) (1 − W i ) · { x (cid:62) i β [ t ] + log (cid:15) [ t ] i } + W i · { x (cid:62) i β [ c ] + log (cid:15) [ c ] i } (cid:17) . In order to carry out Step ( ii ) , we ﬁrst need to compute the conditional distributionof Y obs given X , β , (cid:15) , and ϑ . Using the fact that Y obs i = (1 − W i ) · Y i (0) + W i · Y i (1) ,we see that Y obs i | W , X , β , (cid:15) , ϑ = Pois ( µ obs i ) (13)where µ obs i := exp( ξ obs i ) with ξ obs i = (1 − W i ) { x (cid:62) i β [ c ] + log (cid:15) [ c ] i } + W i · { x (cid:62) i β [ t ] + log (cid:15) [ t ] i } . With ⊗ denoting the Kronecker product, it is readily computed that ξ obs i = (cid:0)(cid:2) (1 − W i ) W i (cid:3) ⊗ x (cid:62) i (cid:1) (cid:18) β [ c ] β [ t ] (cid:19) + (cid:2) (1 − W i ) W i (cid:3) (cid:32) log (cid:15) [ c ] i log (cid:15) [ t ] i (cid:33) . Deﬁne ˜ W obs i := (cid:2) (1 − W i ) W i (cid:3) , ˜ x obs i := ˜ W obs i ⊗ x (cid:62) i , ˜ m obs i := ˜ W obs i ˜ (cid:15) i , and ˜ (cid:15) i :=(log (cid:15) [ c ] i log (cid:15) [ t ] i ) (cid:62) , we can re-express ξ obs i succinctly as follows ξ obs i = ˜ x obs i β + m obs i . (14)Due to the choice of our count potential outcomes paradigm, an analytical expres-sion for the posterior for β is not available owing to the lack of conjugacy betweenthe Gaussian priors and the Poisson likelihood. However, we now give a result thatapplies to the approximate π ( β | Y obs , W , X , (cid:15) , ϑ ) , the posterior distribution of β Y obs , its assignment mechanism W , X , (cid:15) and its hyperparameters ϑ . Some additional notation is warranted. Let r i := log Y obs i − m obs i , ˜ X obs :=  ˜ x obs ˜ x obs ... ˜ x obs n  , ˜ R obs :=  r r ... r n  , and Σ obs y :=  ( Y obs1 ) − . . . ( Y obs n ) −  . (15) Lemma 2.

It holds true that

T V (cid:0) p ( β | Y obs , W , X , (cid:15) , ϑ ) , N ( µ (cid:15)β , Σ β ) (cid:1) = O (cid:16) N (cid:88) i =1 y − / i (cid:17) where µ (cid:15)β = Σ β ( ˜ X obs ) (cid:62) ( Σ obs y ) − ( ˜ Y − M obs (cid:15) ) , Σ − β = ( ˜ X obs ) (cid:62) ( Σ obs y ) − ˜ X + ( σ β ) − I k +1) with Σ obs y = diag (cid:2) ( Y obs1 ) − , . . . , ( Y obs N ) − (cid:3) . Proof.

From Corollary 1, there exists a function δ i with the following (cid:90) δ i ( ξ ) dξ = O ( y − / i ) such that f logGamma ( ξ i | y i ,

1) = f Normal ( ξ i | log y i , y − i ) + δ i ( ξ i ) . Hence, π ( β | Y obs , W , X , (cid:15) , ϑ ) ∝ p ( Y obs | W , X , β , (cid:15) , ϑ ) π ( β ) ∗ = n (cid:89) i =1 [ f Normal ( ξ obs i | log y i , y − i ) + δ i ( ξ obs i )] π ( β ) · y i = f Normal ( µ (cid:15)β , Σ β )) + O (cid:16) N (cid:88) i =1 y − / i (cid:17) where ∗ follows from Corollary 1 and δ (cid:48) i is a quantity such that (cid:82) δ (cid:48) i ( ξ ) dξ = O ( y − / i ) .Hence, the TV distance between the posterior π ( β | · ) and the f normal ( · ) term is oforder O (cid:16) (cid:80) Ni =1 y − / i (cid:17) .We turn our attention to investigate the conditional distribution of µ mis i given Y obs , W , X , (cid:15) , ϑ . This is important to aid the computations for Step ( iv ) . Let ˜ x mis i := (cid:2) W i (1 − W i ) (cid:3) ⊗ x (cid:62) i and ˜ m mis i := (cid:2) W i (1 − W i ) (cid:3) ˜ (cid:15) i . Then, we may write ξ mis i = ˜ x mis i β + ˜ m mis i and establish that ξ mis i | Y obs , W , X , (cid:15) , ϑ ∼ N (˜ x mis i µ (cid:15)β + ˜ m mis i , (˜ x mis i ) (cid:62) Σ β ˜ x mis i )

9y Lemma 2. Hence ξ mis i | Y obs , W , X , (cid:15) , ϑ is approximately log-Gamma fromLemma 1 with parameters ((˜ x mis i ) (cid:62) Σ β ˜ x mis i ) − and (˜ x mis i ) (cid:62) Σ β ˜ x mis i exp(˜ x mis i µ (cid:15)β +˜ m mis i ) . Since µ mis i = exp( ξ mis i ) , we see that the conditional distribution of µ mis i given Y obs , W , X , (cid:15) , ϑ is approximately Gamma distributed ( Ga ( · , · ) ), i.e., µ mis i | Y obs , W , X , (cid:15) , ϑ ∼ Ga ( γ i , h i ) (16)where γ i := ((˜ x mis i ) (cid:62) Σ β ˜ x mis i ) − , h i := (˜ x mis i ) (cid:62) Σ β ˜ x mis i exp(˜ x mis i µ (cid:15)β + ˜ m mis i ) . (17) Procedure 1 : Sampling ( β , (cid:15) , ϑ ).Input: N units of training samples of the form ( Y obs , x i , W i ) where x i arecovariates, Y obs is the observed outcome, and W i is the treatment assignment1. Initialize a list S of samples as empty2. Randomly initialize ( β , (cid:15) , ϑ ) and append to S

3. Repeat the following steps until the maximum number of iteration isreached3.1. Draw a random sample of β using Lemma 2 conditioned on thelatest sample of (cid:15) and ϑ in the list S .3.2. Draw a random sample of (cid:15) using the density in equation (19)conditioned on the latest sample of β and ϑ in the list S .3.3. Draw a random sample of ϑ from the density in equation (19)conditioned on the latest sample of β and (cid:15) in the list S .4. Discard the ﬁrst m burn-in samples in S so that |S| = R which we call S R

5. Return the samples of the joint posterior S R We now outline an MCMC scheme to sample from the posterior predictive of Y mis conditional on Y obs , W and its covariates. We can express the following p ( Y mis i = y | Y obs , W , X )= (cid:90) (cid:90) (cid:90) p ( Y mis i = y | Y obs , W , X , (cid:15) , β , ϑ ) · p ( (cid:15) , β , ϑ | Y obs , W , X ) d (cid:15) d β d ϑ . (18)The quantities β , (cid:15) and ϑ are sampled from the joint posteriors, which is denotedby p ( (cid:15) , β , ϑ | Y obs , W , X ) using a Gibbs sampler from their marginal posteriorswith the following densities p ( β | (cid:15) , ϑ , Y obs , W , X ) , p ( (cid:15) | β , ϑ , Y obs , W , X ) and p ( ϑ | (cid:15) ) (19)where p ( β | (cid:15) , ϑ , Y obs , W , X ) is Gaussian from Lemma 2 and the posterior distri-bution for the hyperparameters p ( ϑ | (cid:15) ) is known. We further remark that p ( ϑ | (cid:15) ) does not depend on β but depends implicitly on Y obs , W and X through (cid:15) . De-pending on the choice of (cid:15) , p ( (cid:15) | β , ϑ , Y obs , W , X ) may be tractable, as we shallsee from the case of lognormal-Poisson potential outcomes in the next subsection.A summary of the steps needed to sample p ( Y mis i = y | Y obs , W , X , (cid:15) , β , ϑ ) is given10n Procedure 1. Once β , (cid:15) and ϑ are known, use these values to further sample p ( Y mis i = y | Y obs , W , X , (cid:15) , β , ϑ ) from a Poisson law as given in (11). The special case of (cid:15) = 1 . Next we turn to a discussion when (cid:15) = 1 , i.e., thecase of Poisson potential outcomes. We show that an MCMC sampler is not neededin this special case since an analytical form for p ( Y mis i = y | Y obs , W , X , (cid:15) , ϑ ) exists.In the interest of generality, an expression that includes (cid:15) and ϑ is presented as itoﬀers insights through the potential outcomes paradigm as to why this formulationonly works for (cid:15) = 1 and not for general (cid:15) and ϑ .Note that the conditional density of Y mis i = y i given Y obs , W can be expressed as: p ( Y mis i = y | Y obs , W , X ) = (cid:90) (cid:90) ψ (cid:15) , ϑ ( y ) p ( (cid:15) , ϑ | Y obs , W , X ) d(cid:15) i dϑ i where ψ (cid:15) , ϑ ( y ) = (cid:18) γ + y − y (cid:19) (cid:18)

11 + h (cid:19) γ (cid:18) −

11 + h (cid:19) y (20)with γ and h given in equation (17). To see why this is true, ﬁrst observe that p ( Y mis i = y | Y obs , W , X )= (cid:90) (cid:90) (cid:90) p ( Y mis = y | Y obs , W , X , (cid:15) , µ mis , ϑ ) · p ( µ mis i | (cid:15) , ϑ , Y obs , W , X ) · p ( (cid:15) , ϑ | Y obs , W ) d (cid:15) dµ mis i d ϑ (21)and deﬁne the function ψ as the inner integral with respect to µ mis i : ψ (cid:15) , ϑ ( y ):= (cid:90) p ( Y mis i = y | µ mis i , (cid:15) , ϑ , Y obs , W , X ) · p ( µ mis i | (cid:15) , ϑ , Y obs , W , X ) dµ mis i . (22)Then, note that the ﬁrst and second terms of the integrand for ψ (cid:15) , ϑ ( y ) are Poissonand Gamma, respectively. Hence, ψ (cid:15) , ϑ ( y ) is the (conditional) density of a negative-binomial distribution as a result of the Poisson-Gamma mixture property whichyields equation (20). Equations (18) and (21) diﬀer in that the former has β marginalized while the latter has µ mis integrated out. The latter expression isdiﬃcult to sample from due to the construction of our count potential outcomes.To elucidate on this point, observe that p ( (cid:15) , ϑ | Y obs , W , X ) in equation (21) canbe decomposed into the product of two components p ( (cid:15) , ϑ | Y obs , W , X ) = p ( (cid:15) | Y obs , W , X , ϑ ) · p ( ϑ | Y obs , W , X ) . (23)The ﬁrst and the second terms on the right hand side of equation (23) can bere-expressed as p ( (cid:15) | Y obs , W , X , ϑ ) ∝ p ( Y obs | W , X , ϑ , (cid:15) ) · p ( (cid:15) | ϑ ) (24)11s well as p ( ϑ | Y obs , W , X ) ∝ p ( Y obs | W , X , ϑ ) · p ( ϑ ) , (25)respectively. Since we can determine the distribution of the overdispersion p ( (cid:15) | ϑ ) and its hyperparameters p ( ϑ ) , this makes it easier to sample from. However, wedo not know, a priori , the conditional distributions of p ( Y obs | W , X , ϑ , (cid:15) ) and p ( Y obs | W , ϑ , (cid:15) ) from the ﬁrst terms of equations (24) and (25), respectively.But what we do have, is the posterior predictive distribution of Y obs , i.e., theconditional distribution of the missing potential outcomes given the W , X , (cid:15) , ϑ ,whose explicit form is given in equation (13). Even though it is evident fromthe preceding discussion that the decomposition in equation (21) is prohibitivein drawing samples for general quantities of β and (cid:15) , it is particularly useful toevaluate the conditional distribution of Y mis given Y obs , W and the covariates for aPoisson potential outcomes model where an explicit expression is available. Hence,the conditional density of Y mis i = y i given Y obs , W , X can be explicitly expressedas: p ( Y mis i = y | Y obs , W , X ) = (cid:18) γ + y − y (cid:19) (cid:18)

11 + h (cid:19) γ (cid:18) −

11 + h (cid:19) y (26)whose values of γ and h are given in equation (17).We complete this section by carrying out Step ( iv ) to compute the ATE as in (1).As discussed previously, for general speciﬁcations of (cid:15) and ϑ , the distribution ofATE may not have an analytical form. Here, we explain the sampling methodsfor evaluating it. The key elements are Steps ( i ) and ( ii ) . Given a draw of ϑ , (cid:15) and β (drawn using Procedure 1), we substitute these values into the conditionaldistribution of Y mis using Step ( i ) to impute, independently all the missing potentialoutcomes, that is, draw Y mis from a Poisson distribution by substituting the pairsof ϑ , (cid:15) , and β as in Step ( i ) using equation (11). Substituting the observed andimputed missing potential outcomes into (8) yields an estimate for ATE based onthe r th imputation, ˆ τ ( r ) (cid:91) ATE ( r ) = 1 N N (cid:88) i =1 (cid:16) (2 W i − Y obs i + (1 − W i ) ˆ Y ( r ) i (cid:17) (27)where ˆ Y ( r ) i where we have used the identity in equation (2). To derive the fullposterior distribution of ATE, this procedure is repeated R times and the averageand variance of the imputed estimators (cid:91) ATE (1) , (cid:91) ATE (2) , . . . , (cid:91)

ATE ( R ) are R R (cid:88) r =1 (cid:91) ATE ( r ) =: ATE , R − R (cid:88) r =1 ( (cid:91) ATE ( r ) − ATE ) . (28)12 rocedure 2 : Estimating the mean and variance of ˆ τ .Input: A list S containing samples of ( β , (cid:15) , ϑ ) from the joint posterior, N trainingexamples of the form ( Y obs , x i , W i ) where x i are covariates, Y obs is the observedoutcome, and W i is the treatment assignment1. Initialize an empty list T R of length R

2. For each r -th triplet ( β , (cid:15) , ϑ ) in the list S returned from Procedure 12.1. Initialize an empty list L N Y mis to store Y mis of length N i ∈ { , , . . . , N } Y mis i | Y obs , W , X , β , (cid:15) , ϑ using equation (11)2.2.2. Append the above Y mis i sample to L N Y mis (cid:91) ATE ( r ) using equation (27) and L N Y mis (cid:91) ATE ( r ) to T R

3. Compute empirical mean and variance of (cid:91)

ATE from list T R using equation(28)4. Return the empirical mean and variance of (cid:91) ATE.Procedure 2 summarizes the preceding discussion on sampling methods to computeaverage treatment eﬀects in our paradigm.

In the previous section, we developed a Bayesian imputation model for the missingpotential outcomes with an approximation architecture to bypass the non conjugacybetween the Gaussian priors and the Poisson likelihood. The result is a generalstrategy for drawing samples for the predictive posterior for Y mis which is thenused to compute the ﬁnite population average treatment eﬀect. In this section,we show via two examples, one of which exhibits a closed form expression forthe ATE (Poisson potential outcomes), and another, for which the ATE can befound with an eﬃcient simulation method due to tractable marginal posteriors(lognormal-Poisson potential outcomes). Poisson potential outcomes.

We set (cid:15) ≡ . Since there are no hyperparame-ters, the resulting conditional distribution of Y mis i given Y obs , W and its covariates X is given by equation (26). The exact distribution of the ﬁnite population averagetreatment eﬀect given in terms of its conditional mean and variance are as follows: N N (cid:88) i =1 (cid:0) (2 W i − · ( Y obs i − γ i ) (cid:1) , N N (cid:88) i =1 γ i (1 − h i ) h i (29)where γ i and h i are given in equation (17). Lognormal-Poisson potential outcomes.

In this example, we take (cid:15) [ c ] ∼ log N (0 , ( σ [ c ] ) ) and (cid:15) [ t ] ∼ log N (0 , ( σ [ t ] ) ) . The hyperparameters for this model13re ϑ = ( σ [ c ] , σ [ t ] ) with the following priors σ [ c ] ∼ IG ( α [ c ] , ν [ c ] ) and σ [ t ] ∼ IG ( α [ t ] , ν [ t ] ) as in equation (6), as described in Section 2 (repeated here for convenience).We now spell out the marginal posteriors in (19). It is readily computed that theposterior of (cid:15) is given by p ( (cid:15) [ c ] i | Y obs , W , X , β , ϑ ) = log N ( (cid:15) [ c ] i | m [ c ] i , ( s [ c ] i ) ) , ( s [ c ] i ) = ((1 − W i ) Y obs i + ( σ [ c ] ) − ) − ,m [ c ] i = − (1 − W i ) Y obs i (˜ x obs i β − log Y obs i )( s [ c ] i ) (30)and p ( (cid:15) [ t ] i | Y obs , W , β , ϑ ) = log N ( (cid:15) [ t ] i | m [ t ] i , ( s t ) ) , ( s [ t ] i ) = ( W i Y obs i + ( σ [ t ] ) − ) − , (31) m [ t ] i = − W i Y obs i (˜ x obs i β − log Y obs i )( s [ t ] i ) where ˜ x obs i β is deﬁned in (14). Finally, the posterior of the hyperparameters aregiven by p (cid:0) ( σ [ c ] ) | (cid:15) [ c ] (cid:1) = IG (cid:0) ( σ [ c ] ) | ˜ α [ c ] , ˜ γ [ c ] (cid:1) , ˜ α [ c ] = α [ c ] + 12 N, ˜ γ [ c ] = γ [ c ] + 12 N (cid:88) i =1 (log (cid:15) [ c ] i ) (32)and p (( σ [ t ] ) | (cid:15) [ t ] ) = IG (( σ [ t ] ) | ˜ α [ t ] , ˜ γ [ t ] ) , ˜ α [ t ] = α [ t ] + 12 N, ˜ γ [ t ] = γ [ t ] + 12 N (cid:88) i =1 (log (cid:15) [ t ] i ) . (33)The closed form for the marginal posterior for β is guaranteed by Lemma 2.Moreover, the tractability of the marginal posteriors of the overdispersion (cid:15) andits hyperparamters ϑ makes it appealing in this setting as the average treatmenteﬀect can be computed eﬃciently using the steps outlined in Procedure 2. We ﬁrst assess the newly proposed approximation in a controlled setting. Speciﬁ-cally, we illustrate how the sample size, the types of potential outcomes, and thevariation of the model complexities aﬀect the error of our approximation methods.We compare the ATE estimated from an exact method with no approximation(

HMC - Exact ) versus that of using our proposed approximation method (

Our - Approx ).Working within the potential outcomes model in (3), our proposed approximationto estimate the ﬁnite population average treatment eﬀect uses Lemma 2 as well14s Procedures 1 and 2. The exact method, however, does not utilize any suchapproximations. In this case, we draw exact samples using an MCMC scheme forthe posterior distributions using rstanarm (Gelman and Hill, 2007; Goodrich et al.,2020). We consider two types of potential outcomes: potential outcomes that donot (i.e., Poisson, (cid:15) ≡ ) and do (i.e., lognormal-Poisson, (cid:15) [ c ] = (cid:15) [ t ] ∼ log N (0 , σ )) )account for overdispersion.

50 100 1000N0123456 M A E Simple Model MethodOur-ApproxHMC-Exact (a)

Poisson potential outcomes: Simple Model

50 100 1000N0123456 M A E Complex Model MethodOur-ApproxHMC-Exact (b)

Poisson potential outcomes: Complex Model

200 500 1000N0123456 M A E Simple Model MethodOur-ApproxHMC-Exact (c) lognormal-Poisson potential outcomes: Simple Model

200 500 1000N0.02.55.07.510.012.515.017.5 M A E Complex Model MethodOur-ApproxHMC-Exact (d) lognormal-Poisson potential outcomes: Complex Model

Figure 1:

MAE of ATE estimation with Poisson (overdispersion parameter (cid:15) = 1) )and lognormal-Poisson potential outcomes (overdispersion parameter (cid:15) =log N (0 , σ ) ) for the Simple and Complex Models in equation (34). A compari-son between our approximation ( Our - Approx ) and exact inference (

HMC - Exact )is presented.

Setup.

We describe our simulation study in terms of the number of units N ,where for each N , we compute the mean absolute error (MAE) of the ATE deﬁned15n (1) for both Poisson and the lognormal-Poisson potential outcomes. Precisely,we evaluate the MAE, deﬁned as (cid:107) ATE

Our - Approx − ATE ∗ (cid:107) for each N where ATE ∗ denotes the true ATE. We further set the number of treated and control units as N t = N c = / · N .In evaluating the performance of our approximation on the ATE, we generate N samples of the potential outcomes model with known N t = N c and true parameters β ∗ = ( β [ t ] ∗ , β [ c ] ∗ ) (cid:62) for each (cid:15) . From these samples, we compute a parameter estimate ˆ β Our - Approx using Procedure 1. To aid comparison, we further draw samples usingan exact scheme for the posterior of β using rstanarm to obtain the distributionsof ˆ β HMC - Exact . A similar set of estimates is obtained for the hyperparameters ϑ ∗ = ( ϑ [ t ] ∗ , ϑ [ c ] ∗ ) . Given these inferred parameters ˆ β HMC - Exact and ˆ β Our - Approx as wellas ˆ ϑ HMC - Exact and ˆ ϑ Our - Approx , we compute the corresponding ATE for the two casesof potential outcomes.We consider two simulation models, which we term "Simple Model" and "ComplexModel", and whose true parameters β ∗ are varied for each (cid:15) under consideration:Simple Model = (cid:40) µ [ c ] i = exp(3 . . x ) (cid:15) [ c ] i µ [ t ] i = exp(3 . . x ) (cid:15) [ t ] i , Complex Model = (cid:40) µ [ c ] i = exp(3 . . x i +0 . x i +1 . x i +0 . x i +0 . x i ) (cid:15) [ c ] i µ [ t ] i = exp(3 . . x i +0 . x i +1 . x i +0 . x i +0 . x i ) (cid:15) [ t ] . (34)The quantities x id are drawn independently from a uniform distribution on theinterval [ − , for individual i ∈ { , . . . , N } , and of dimension d ∈ { , . . . , } .With the Poisson potential outcomes, the ATE has a closed form solution (29).For lognormal-Poisson, we use the closed form expressions for the posteriors forhyperparmeters in equations (32)–(33) and repeat this R = 1000 times as explicatedin Procedures 1 and 2 to obtain the ATE.In addition, we oﬀer some insight into the behavior and frequency of the occurrenceof the count potential outcomes had we used a continuous potential outcomesframework to model its count counterpart. To that end, using the simulatedcount potential outcomes of the Complex Model, we ﬁrst use the identity Y obs i =(1 − W i ) · Y i (0) + W i · Y i (1) and then use Bayesian Additive Regression Trees (Hill,2011) to ﬁt Y obs i against the covariates and W i to infer the parameters. With these,the potential outcomes Y (0) and Y (1) are simulated using the Bayesian AdditiveRegression Trees (confer the functions f ( W = 0 , x ) and f ( W = 1 , x ) in Section 3of Hill (2011)). Results.

Figure 1 shows boxplots for the variation of the estimation error (MAE)of our proposed approximation vs exact methods. We see that the estimatesusing our approximation have commensurate accuracy to the exact inference forvarious N : when our approximation is well speciﬁed, the MAEs obtained from theapproximation have little diﬀerence from that of the exact methods, and indeedrecovered the optimal ATEs asymptotically.16 HMC-Exact Our-Approx100,000 0017 mins 00.1 mins200,000 0033 mins 00.2 mins500,000 0081 mins ( ∼ ∼ ∼ > hours 05.0 mins10,000,000 > hours 10.0 mins Table 1: Computational time in minutes (mins) to estimate the ATE for Complex Model withPoisson potential outcomes, see equation (34) with (cid:15) = 1 . N HMC-Exact Our-Approx10,000 0075 mins ( ∼ ∼ ∼ ∼ ∼ > hours 11.0 mins1,000,000 > hours 20.0 mins Table 2: Computational time in minutes (mins) to estimate the ATE for Complex Model withlognormal-Poisson potential outcomes, see equation (34) with overdispersion parameter (cid:15) = log N (0 , . ) . One incontrovertible advantage of using our proposed approximation is its speed –it is seen to be orders of magnitude faster compared to drawing inferences usingexact methods. Tables 1 and 2 provide the compute time to estimate the ATE forthe Complex Model. We see that our method requires less computational time(more than 100-fold) than using rstanarm (Gelman and Hill, 2007; Goodrich et al.,2020) to approximate the same posterior distribution and subsequently computethe ATE.Finally, Figure 2 oﬀers some illustrative diagnostics that a naive evaluation ofaverage treatment eﬀect using a continuous potential outcome model when they arein fact of count type may lead to nonsensical answers. As can be seen from lowerpanel of Figure 2, some potential outcomes are negative due to the incompatibilityof using a continuous real-valued potential outcomes framework to model counts.One possible avenue is to truncate the negative potential outcomes to zero, but it isunclear how this rounding should be handled in a principled manner and whetherit inﬂuences the posterior predictive distribution when the true potential outcomesare of count type.

The data that we use to illustrate the approximation methods developed comesfrom the well-known randomized evaluation program where the authors examinedthe eﬀectiveness of a job training program (the treatment) on individual earnings17

500 1000 1500 2000Y(0)02004006008001000 0 500 1000 1500 2000Y(1)02004006008001000 (a) Histogram of potential outcomes drawn from the Complex Model in equation(34) with lognormal-Poisson potential outcomes (overdispersion parameter (cid:15) =log N (0 , σ ) )

100 0 100 200Y(0)050100150200250300 100 0 100 200Y(1)050100150200250300 (b) Histogram of potential outcomes drawn from a continuous potential outcomesframework (BART) when the true potential outcomes are counts (top panel)

Figure 2: Histogram of the potential outcomes inferred with a continuous potential outcomesusing BART (Hill, 2011) when the true ( Y (0) , Y (1)) is of count type, drawn fromComplex Model in (34) with lognormal-Poisson potential outcomes (overdispersionparameter (cid:15) = log N (0 , σ ) ). Note that some of the realizations go negative due to thenormality assumption of the continuous potential outcomes framework (lower panel). matchit package (Hoet al., 2011), and after matching there are observations.Unit Potential Outcomes Treatment ObservedOutcome(actual) ObservedOutcome(integer valued) Y i (0) Y i (1) W i Y obs,continuous i Y obs,categorized i ? ? ? ? Table 3: A subset of the data from the NSW evaluation. Note that the last column is obtainedby mapping a value of 1 to earnings (real earnings in 1978) between $0-$5000, 2 for$5000-$10000, etc. The question marks represent missing potential outcomes.

Drawing causal inference typically involves sensitive information that needs to bekept private (e.g., salary). Suppose that it may be preferable to assign integer valuesin place of exact earnings. To demonstrate the suitability of our method, we assignthe integer values to the duplicated wage data. The outcome of interest is post-program labour earnings in 1978 ( re78 ). We assign a value of to earnings between$0 – $5,000, 2 for $5,000 – $10,000, for $10,000 – $15,000, etc. The maximuminteger valued from this data set is 13, which corresponds to unit NSW132 witha post-program labour earnings of $60,308. The column Y obs , categorized i in Table 3illustrates an excerpt of the mapped integer values to actual wage earnings. Theﬁnal column presents the categorized observed outcome which we used in ouranalysis. The integer valued data are chosen such that its distributional propertiesroughly depict that of a Poisson random variable, i.e. when (cid:15) ≡ in equation (3)for both the treated and control groups. The empirical mean and variance areapproximately equal to . and . for the control group and . and . forthe treated group, respectively. 19roup ( g ) Salary (’000) ATE g g Weights g [00; 50] 00 .

38 263 0 . [50; 10] +0 .

18 99 0 . [10; 15] − .

07 52 0 . [15; 20] +0 .

23 18 0 . [20; 25] +0 .

54 7 0 . [25; 30] +4 .

24 2 0 . [30; 35] +5 .

23 1 0 . [35; 40] − .

20 2 0 . [40; 45] —– —– —–10 [45; 50] —– —– —–11 [60; 65] +9 .

38 1 0 . Table 4: Group-wise average treatment eﬀects for data from the NSW evaluation using real wagesof 1978. Note that there are no units in Groups and . Using Procedure 2 with σ β = 1000 , the posterior mean of the average treatmenteﬀect is . , and the posterior standard deviation equal to . . This averagetreatment eﬀect translates to approximately a salary increase of $2050 ( i.e., . × . As a comparison, we compute the average treatment eﬀect on this data setusing the model-based inference that assumes a bivariate normal distribution for thepotential outcomes given covariates (confer e.g. Chapter 8.7 in Imbens and Rubin(2015)). We report that the Bayesian average treatment eﬀect for the salary increaseis approximately $1962. Although the modeling assumptions are diﬀerent, thisindicates that our framework behaves reasonably and has corresponding accuracyto the standard classical causal model whose potential outcomes are modeled by abivariate normal distribution. It is important to note that even though the averagetreatment eﬀect is of a similar magnitude obtained under diﬀerent assumptions,the posterior distribution of Y mis estimated from the bivariate normal distributionpotential outcomes model can in fact take negative values, as explicated in Section4.1. With regards to the stability of the average treatment eﬀects for diﬀerentgroups relative to the posterior mean value of . , we note that there existsheterogeneity across diﬀerent groups with diﬀerent average treatment eﬀects. InTable 4, the columns g and Weights g represent the number of participants ineach group as well as their corresponding weights, calculated as a ratio of g to the total number of participants, respectively.We perform another analysis that measures the diﬀerent levels of incrementalearnings between 1974 and 1978, in order to compare and contrast with our abovementioned observations of the average treatment eﬀects in a single year of 1978.In this complementary study, we assign a value of to an incremental earningsdiﬀerence of lesser than -$20,001, for incremental earnings diﬀerence between[-$20,000 ; -$17,500], for incremental earning diﬀerence between [-$17,500 ; -$15,000], for [-$15,000 ; -$12,500], until we reach the value of for positiveincremental earning diﬀerence greater than +$20,000. The posterior mean of20roup ( g ) Salary (’000) Estimated ATE g Number of units g Weights g < − . .

96 6 0 . − . − . ] +9 .

88 2 0 . [ − . − .

0] +8 .

41 2 0 . [ − . − .

5] +2 .

63 3 0 . [ − . − . − .

95 5 0 . [ − . − . − .

28 8 0 . [ − . − .

0] +2 .

37 12 0 . [ − . − . − .

29 13 0 . [ − . .

0] +0 .

52 17 0 . +[0 . .

5] +0 .

56 170 0 . +[2 . .

0] +0 .

23 57 0 . +[5 . . − .

05 45 0 . +[7 . .

0] +0 .

31 41 0 . +[10 . . − .

06 26 0 . +[12 . . − .

40 14 0 . +[15 . . − .

32 7 0 . +[17 . .

0] +4 .

09 6 0 . > . .

16 11 0 . Table 5: Estimated Group-wise average treatment eﬀects for data from the NSW evaluation usingthe diﬀerence between the two wages in the years 1974 and 1978. the average treatment eﬀect is . , and the posterior interval equal to . .This average treatment eﬀect translates to approximately a salary increase of $1700 (0 . × . The group wise ATE is summarized in Table 5. c i and p i to be the number of conﬁrmed cases and population size for county i , respectively. As mentioned earlier, we focus on those counties that have expectednumber of COVID-cases cases per 100,000 that exceeds . Furthermore, the county-wise crude death per capita cdr i := 10 q × d i /p i is computed, where d i denotesthe death counts for county i . Similarly, the crude number of conﬁrmed cases percapita in county i is computed via the following expression ccr i := 10 q × c i /p i .Throughout our analysis, we take q = 5 . Deﬁne the treatment indicator W i totake a value equals to if ccr i ≥ h (large number expected conﬁrmed cases per100,000) and 0 otherwise (low magnitude). Here, we set h ≡ and we remarkthat this quantity can be varied accordingly. With this, we can associate to eachcounty i two potential outcomes, namely Y i (1) , the crude death rate in county i had the exposure in county i been larger than h , and Y i (0) otherwise. The observedcrude mortality rate in county i is denoted by Y obs i . We refer to the counties with W i = 1 as ‘infected counties’ and to the counties with W i = 0 as ‘less-infectedcounties’ during the study period. The sample size under consideration is with treated and control counties.We consider the following covariates: poverty ( poverty ) and education levels( education ), percentage of owner-occupied housing ( pct_owner_occ ), percentageof Hispanic people ( hispanic ), percentage Asian people ( pct_asian ), percentageof native population ( pct_native ), percentage of white people ( pct_white ) andpercentage of black population ( pct_blk ). Details on these variables can be foundin Wu et al. (2020).We perform a one-to-one matching strategy by imposing a tolerance level on themaximum estimated propensity score distance (caliper) to create groups of exposedand control days with similar distributions for covariates (Rosenbaum and Rubin,1983). The Love plots (e.g., Ahmed et al. (2007)) are shown in Figure 3, wherethey schematically display covariate balance before and after matching. We seefrom Figure 3(a) that the covariates are imbalanced, especially pct_white and pct_blk . The diﬀerence in means of covariates is standardized using the expression ( θ − θ ) / (cid:112) v /n + v /n where θ ( v ) and θ ( v ) denote the average (variance)of the corresponding control and treated covariates with n and n being thenumber of control and treated counties, respectively.With the parameter settings of σ β = 100 and (cid:15) = 1 , the posterior mean of theaverage treatment eﬀect is deaths from COVID-19 per 100,000 and its posteriorstandard deviation is ∼ . . These results suggest that on average, more deathsoccur in counties when there are more cases ( ccr > h ≡ ). Our choice of σ β = 100 is intended to establish little or no structure through our model assumptions. Wefurther note that h can be varied to analyze diﬀerent scenarios given the data,which at the present moment may still be limited and very incomplete.22

15 −10 −5 0 5 10 15Standardized Mean Differencepovertyeducationpct_whitepct_owner_occpct_nativepct_blkpct_asianhispanic (a) Before matching −15 −10 −5 0 5 10 15Standardized Mean Differencepovertyeducationpct_whitepct_owner_occpct_nativepct_blkpct_asianhispanic (b) After matching

Figure 3: Covariate balance before and after matching.

In this paper, we have proposed a framework for estimating causal inference in aBayesian setting when the potential outcomes are counts. Under this paradigm,standard causal models that handle continuous or binary potential outcomes(e.g., Rubin (2006); Hill (2011); Gutman and Rubin (2012, 2015)) would not bedirectly applicable since the potential outcomes can take non negative integervalues. Presented within the Rubin causal framework, we argued that imputingthe missing count potential outcomes dominates the commonly-used approach thatdirectly regresses the count outcomes on the observed treatment and backgroundcovariates by allowing ﬂexibility in drawing inferences.Our proposed framework provides a Bayesian framework that can be extended toaddress more complex causal questions with count data. Some examples are: ( i ) Principal stratiﬁcation in non-compliance settings, ( ii ) Potential count time series, ( iii ) zero-inﬂated potential outcomes, ( iv ) nonparametric modeling of covariateswith count potential outcomes, and ( v ) multifactorial designs with count outcomes.Therefore, it is envisaged that this article will open up a discussion on causalproblems whose data comes in the form of counts, which can be found in almostall areas of statistics, health, social, and physical sciences.23 eferences Agresti, A. (2007).

An introduction to categorical data analysis . Wiley-Blackwell.Ahmed, A., Rich, M., Sanders, P., Perry, G., Bakris, G., Zile, M., Love, T., Aban, I.,and Shlipak, M. (2007). Chronic kidney disease associated mortality in diastolicversus systolic heart failure: A propensity matched study.

The American journalof cardiology , 99:393–8.Arfken, G. (1985).

Stirling Series, 10.3. Mathematical Methods for Physicists .Academic Press, Inc., third edition.Baccini, M., Mattei, A., Mealli, F., Bertazzi, P. A., and Carugno, M. (2017).Assessing the short term impact of air pollution on mortality: a matchingapproach.

Environmental Health , 16.Bartlett, M. S. and Kendall, D. G. (1946). The statistical analysis of variance-heterogeneity and the logarithmic transformation.

Supplement to the Journal ofthe Royal Statistical Society , 8(1):128–138.Breslow, N. E. (1984). Extra-poisson variation in log-linear models.

Journal of theRoyal Statistical Society: Series C , 33(1):38–44.Brown, N. and Borter, G. (2020). Speed of coronavirus deaths shock doctors asnew york toll hits new high.

Reuters .Burneo, J. G. (2008). The real truth behind seizure count.

Epilepsy Currents ,8(4):92–93.Cameron, A. and Trivedi, P. (2013).

Regression Analysis of Count Data . CambridgeUniversity Press.Chan, A. B. and Vasconcelos, N. (2012). Counting people with low-level featuresand bayesian regression.

IEEE Transactions on Image Processing , 21:2160–2177.Dean, C., Lawless, J. F., and Willmot, G. E. (1989). A mixed poisson-inverse-gaussian regression model.

Canadian Journal of Statistics , 17(2):171–181.Dehejia, R. H. and Wahba, S. (1999). Causal eﬀects in nonexperimental studies:Reevaluating the evaluation of training programs.

Journal of the AmericanStatistical Association , 94(448):1053–1062.Diggle, P. J., Tawn, J. A., and Moyeed, R. A. (1998). Model-based geostatistics.

Journal of the Royal Statistical Society: Series C , 47(3):299–350.El-Sayyad, G. (1973). Bayesian and classical analysis of poisson regression.

Journalof the Royal Statistical Society. Series B (Methodological) , pages 445–451.Gasparrini, A., Gorini, G., and Barchielli, A. (2009). On the relationship betweensmoking bans and incidence of acute myocardial infarction.

European Journalof Epidemiology , 24:597–602. 24elman, A. and Hill, J. (2007).

Data Analysis Using Regression and MultilevelHierarchical Models . Cambridge University Press.Gelman, A., Stern, H. S., Carlin, J. B., Dunson, D. B., Vehtari, A., and Rubin,D. B. (2013).

Bayesian data analysis . Chapman and Hall/CRC.Goodrich, B., Gabry, J., Ali, I., and Brilleman, S. (2020). rstanarm: Bayesianapplied regression modeling via Stan. R package version 2.19.3.Gutman, R., Intrator, O., and Lancaster, T. (2017). A Bayesian procedure forestimating the causal eﬀects of nursing home bed-hold policy.

Biostatistics ,19(4):444–460.Gutman, R. and Rubin, D. (2012). Analyses that inform policy decisions.

Biomet-rics , 68(3):671–675.Gutman, R. and Rubin, D. (2015). Estimation of causal eﬀects of binary treatmentsin unconfounded studies.

Statistics in Medicine , 34.Hilbe, J. (2007).

Negative Binomial Regression . Cambridge University Press.Hill, J. L. (2011). Bayesian nonparametric modeling for causal inference.

Journalof Computational and Graphical Statistics , 20(1):217–240.Ho, D. E., Imai, K., King, G., and Stuart, E. A. (2011). MatchIt: Nonparametricpreprocessing for parametric causal inference.

Journal of Statistical Software ,42(8):1–28.Holland, P. W. (1986). Statistics and causal inference.

Journal of the AmericanStatistical Association , 81(396):945–960.Horton, N., Kim, E., and Saitz, R. (2007). A cautionary note regarding countmodels of alcohol consumption in randomized controlled trials.

BMC medicalresearch methodology , 7:9.Imbens, G. W. and Rubin, D. B. (2015).

Causal Inference for Statistics, Social,and Biomedical Sciences: An Introduction . Cambridge University Press, USA.Kempthorne, O. (1955). The randomization theory of experimental inference.

Journal of the American Statistical Association , 50(271):946–967.Kim, S., Chen, Z., Zhang, Z., Simons-Morton, B. G., and Albert, P. S. (2013).Bayesian hierarchical poisson regression models: An application to a drivingstudy with kinematic events.

Journal of the American Statistical Association ,108(502):494–503.LaLonde, R. J. (1986). Evaluating the Econometric Evaluations of TrainingPrograms with Experimental Data.

American Economic Review , 76(4):604–620.Lawless, J. F. (1987). Negative binomial and mixed poisson regression.

CanadianJournal of Statistics , 15(3):209–225. 25evin, D. A., Peres, Y., and Wilmer, E. L. (2006).

Markov chains and mixingtimes . American Mathematical Society.Long, J. S. (1997).

Regression models for categorical and limited dependent variables .Sage Publications.Rosenbaum, P. R. and Rubin, D. B. (1983). The central role of the propensityscore in observational studies for causal eﬀects.

Biometrika , 70.Rubin, D. B. (1974). Estimating causal eﬀects of treatments in randomized andnonrandomized studies.

Journal of educational Psychology , 66(5):688.Rubin, D. B. (1975). Bayesian inference for causality: The importance of ran-domization. In

ASA Proceedings of the Social Statistics Section , pages 233–239.American Statistical Association.Rubin, D. B. (1976). Inference and missing data.

Biometrika , 63:581–590.Rubin, D. B. (1977). Assignment to treatment group on the basis of a covariate(corr: V3 p384).

J. Educ. Behav. Statist. , 2:1–26.Rubin, D. B. (1978). Bayesian Inference for Causal Eﬀects: The Role of Random-ization.

Annals of Statistics , 6:34–58.Rubin, D. B. (1980). Randomization analysis of experimental data: The ﬁsherrandomization test comment.

Journal of the American Statistical Association ,75(371):591–593.Rubin, D. B. (1987).

Multiple Imputation for Nonresponse in Surveys . Wiley.Rubin, D. B. (2006). Causal inference through potential outcomes and principalstratiﬁcation: Application to studies with “censoring” due to death.

StatisticalScience , 21(3):299–309.Schafer, J. (1997).

Analysis of Incomplete Multivariate Data . Chapman and Hall.Schafer, J. L. and Yucel, R. M. (2002). Computational strategies for multivariatelinear mixed-eﬀects models with missing values.

Journal of Computational andGraphical Statistics , 11(2):437–457.Schwartz, J., Austin, E., Bind, M.-A., Zanobetti, A., and Koutrakis, P. (2015).Estimating causal associations of ﬁne particles with daily deaths in boston: Table1.

American Journal of Epidemiology , 182.Sommer, A., Lee, M., and Bind, M.-A. (2018). Comparing apples to apples: anenvironmental criminology analysis of the eﬀects of heat and rain on violentcrimes in boston.

Palgrave Communications , 4.Splawa-Neyman, J., Dabrowska, D. M., and Speed, T. (1923). On the applicationof probability theory to agricultural experiments. essay on principles. section 9.

Statistical Science , pages 465–472. 26anner, M. A. and Wong, W. (1987). The calculation of posterior distributions bydata augmentation.

Journal of the American Statistical Association , 82:528–540.Uhler, H. S. (1942). The coeﬃcients of stirling’s series for loggamma ( z ) . Proceedingsof the National Academy of Sciences , 28(2):59–62.Wilk, M. B. (1955). The randomization analysis of a generalized randomized blockdesign.

Biometrika , 42(1/2):70–79.Winkelmann, R. (2008).

Econometric Analysis of Count Data . Springer.Wu, X., Nethery, R. C., Sabath, B. M., Braun, D., and Dominici, F. (2020).Exposure to air pollution and covid-19 mortality in the united states. medRxiv .Zigler, C. and Dominici, F. (2014). Point: Clarifying policy evidence with potential-outcomes thinking-beyond exposure-response estimation in air pollution epidemi-ology.