Interpretable Sensitivity Analysis for Balancing Weights
Dan Soriano, Eli Ben-Michael, Peter J. Bickel, Avi Feller, Samuel D. Pimentel
IInterpretable Sensitivity Analysis for Balancing Weights ∗Dan Soriano, Eli Ben-Michael, Peter Bickel, Avi Feller, and Sam PimentelUC Berkeley and Harvard UniversityMarch 1, 2021
Abstract
Assessing sensitivity to unmeasured confounding is an important step in observational studies, whichtypically estimate effects under the assumption that all confounders are measured. In this paper, we de-velop a sensitivity analysis framework for balancing weights estimators, an increasingly popular approachthat solves an optimization problem to obtain weights that directly minimizes covariate imbalance. Inparticular, we adapt a sensitivity analysis framework using the percentile bootstrap for a broad class ofbalancing weights estimators. We prove that the percentile bootstrap procedure can, with only minormodifications, yield valid confidence intervals for causal effects under restrictions on the level of unmea-sured confounding. We also propose an amplification to allow for interpretable sensitivity parameters inthe balancing weights framework. We illustrate our method through extensive real data examples. ∗ We would like to thank Skip Hirshberg for useful discussion and comments. This research was supported in part by theHellman Family Fund at UC Berkeley, the Institute of Education Sciences, U.S. Department of Education, through GrantR305D200010, and the Two Sigma PhD fellowship. The opinions expressed are those of the authors and do not represent viewsof the Institute or the U.S. Department of Education. a r X i v : . [ s t a t . M E ] F e b Introduction
Assessing the sensitivity of results to violations of causal assumptions is a critical part of the workflow forcausal inference with observational studies. In such studies, the key assumption that all confounders aremeasured, sometimes known as ignorability or unconfoundedness , rarely holds in practice. A sensitivityanalysis seeks to determine the magnitude of unobserved confounding required to alter a study’s findings. Ifa large amount of confounding is needed, then the study is robust, enhancing its reliability.In this paper, we develop a sensitivity analysis framework for balancing weights estimators . Building onclassical methods from survey calibration, these estimators find weights that minimize covariate imbalancebetween a weighted average of the observed units and a given distribution, such as re-weighting control unitsto have a similar covariate distribution to the treated units. Balancing weights have become increasinglycommon within causal inference, with better finite sample properties than traditional inverse propensityscore weighting (IPW). See Ben-Michael et al. (2020b) for a recent review.Our proposed sensitivity analysis framework adapts the percentile bootstrap sensitivity analysis thatZhao et al. (2019) develop for traditional IPW. Specifically, for a specified sensitivity parameter, we computethe upper and lower bounds of our estimator for each bootstrap sample, and then form a confidence intervalusing percentiles across bootstrap samples. We prove that this approach yields valid confidence intervals forour proposed sensitivity analysis procedure over a broad class of balancing weights estimators.To make the sensitivity analysis more interpretable, we propose a new amplification that expresses theerror from confounding in terms of: (1) the imbalance in observed and unobserved covariates; and (2) thestrength of the relationship between the outcome and the imbalanced covariates. Researchers can then relatethe results of our amplification to estimates from observed covariates.We demonstrate this approach via a numerical illustration and via several applications. We consider an observational study setting with independently and identically distributed data ( Y i , X i , Z i ) , i ∈ { , . . . , n } , drawn from some joint distribution P ( · ) with outcome Y i ∈ R , covariates X i ∈ X , andtreatment assignment Z i ∈ { , } . We posit the existence of potential outcomes : the outcome had unit i received the treatment, Y i (1) , and the outcome had unit i received the control, Y i (0) (Neyman, 1923; Rubin,1974). We assume stable treatment and no interference between units (Rubin, 1980), so the observed outcomeis Y i = (1 − Z i ) Y i (0) + Z i Y i (1) . Our primary estimand of interest is the Population Average Treatment Effect (PATE): τ = E [ Y (1) − Y (0)] = µ − µ , (1)where µ = E [ Y (1)] and µ = E [ Y (0)] . To simplify the exposition, we will focus on estimating µ ; estimating µ is symmetric. We consider alternative estimands in Appendix C.A common set of identification assumptions in this setting, known as strong ignorability , assumes thatconditioning on the covariates X sufficiently removes confounding between treatment Z and the potentialoutcomes Y (0) , Y (1) , and that treatment assignment is not deterministic given X (Rosenbaum and Rubin,1983b). 2 ssumption 1 (Ignorability) . Y (0) , Y (1) ⊥⊥ Z | X . Assumption 2 (Overlap) . The propensity score π ( x ) ≡ P ( Z = 1 | X = x ) satisfies < π ( x ) < for all x ∈ X .Under Assumptions 1 and 2, we can non-parametrically identify µ , solely with the outcomes from unitsreceiving treatment, µ = E (cid:20) π ( X ) Y (cid:12)(cid:12)(cid:12) Z = 1 (cid:21) . (2)In an observational setting, the researcher does not know the true treatment assignment mechanism, π ( x, y ) ≡ P ( Z = 1 | X = x, Y (1) = y ) , which in general can depend on both the covariates X and thepotential outcomes Y (1) and Y (0) . A rich literature assesses the sensitivity of estimates to violations ofthe ignorability assumption. This approach dates back at least to Cornfield et al. (1959), who conducteda formal sensitivity analysis of the effect of smoking on lung cancer. More recent examples of sensitivityanalysis include Rosenbaum and Rubin (1983a), Rosenbaum (2002), VanderWeele and Ding (2017), Frankset al. (2019), and Cinelli and Hazlett (2020). See Hong et al. (2020) for a recent discussion of weighting-basedsensitivity methods.Our proposed approach builds most directly on that of Zhao et al. (2019), who use the percentile bootstrapand linear programming to perform a sensitivity analysis for traditional IPW. Following their setup, we splitthe problem into two parts, sensitivity for the mean of the treated potential outcomes and sensitivity forthe mean of the control potential outcomes; without loss of generality, we consider the mean for the treatedpotential outcomes. Since unbiased estimation of E [ Y (1)] requires knowledge only of π ( x, y ) = P ( Z =1 | X = x, Y (1) = y ) rather than the full propensity score that also conditions on Y (0) , we can rewriteAssumption 1 as π ( x, y ) = π ( x ) . For details on combining sensitivity analyses for E [ Y (1)] and E [ Y (0)] intoa single sensitivity analysis for the ATE, see Section 5 from Zhao et al. (2019).We now introduce a sensitivity model where we relax the ignorability assumption so that the odds ratiobetween the two conditional probabilities π ( x ) and π ( x, y ) is bounded. Assumption 3 (Marginal sensitivity model) . For Λ ≥ , the true propensity score satisfies π ( x, y ) ∈ E (Λ) ≡ (cid:8) π ( x, y ) ∈ (0 ,
1) : Λ − ≤ OR ( π ( x ) , π ( x, y )) ≤ Λ (cid:9) , where OR ( p , p ) = p / (1 − p ) p / (1 − p ) is the odds ratio.Here, Λ is a sensitivity parameter, quantifying the difference between the true propensity score π ( x, y ) and the probability of treatment given X = x , π ( x ) ; when Λ = 1 , the two probabilities are equivalent,and Assumption 1 holds. If, for example,
Λ = 2 , Assumption 3 constrains the odds ratio between π ( x ) and π ( x, y ) to be between and 2. The modeled estimate of the propensity scores ˆ π ( x ) could differ fromthe true treatment probabilities π ( x, y ) for many reasons, including model misspecification and unobservedconfounding; the marginal sensitivity model in Assumption 3 is agnostic to the source of these differences.Again following Zhao et al. (2019), we will consider an equivalent characterization of the set E (Λ) interms of the log odds ratio h ( x, y ) = log OR ( π ( x ) , π ( x, y )) : H (Λ) = { h : X × R → R : (cid:107) h (cid:107) ∞ ≤ log Λ } , (3)3here (cid:107) h (cid:107) ∞ = sup x ∈X ,y ∈ R | h ( x, y ) | is the supremum norm. For a particular h ∈ H (Λ) , we can write the shifted inverse propensity score as π ( h ) ( x,y ) = 1 + (cid:16) π ( x ) − (cid:17) e h ( x,y ) , and the shifted estimand as µ ( h )1 = E (cid:20) Zπ ( h ) ( X, Y ) (cid:12)(cid:12)(cid:12)(cid:12) Z = 1 (cid:21) − E (cid:20) ZYπ ( h ) ( X, Y ) (cid:12)(cid:12)(cid:12)(cid:12) Z = 1 (cid:21) . (4)Under the marginal sensitivity model in Assumption 3, we then have a non-parametric partial identificationbound, inf h ∈H (Λ) µ ( h )1 ≤ µ ≤ sup h ∈H (Λ) µ ( h )1 . In Section 3, we will construct confidence intervals that coverthis partial identification set. In Section 4, we will consider a finite sample analog of the marginal sensitivitymodel in order to amplify, interpret, and calibrate our sensitivity analyses. We estimate µ via a weighted average of treated units’ outcomes using weights ˆ γ , ˆ µ = 1 n n (cid:88) i =1 Z i ˆ γ i Y i , (5)where (cid:80) ni =1 Z i = n . Under strong ignorability (Assumptions 1 and 2), traditional Inverse Propensity Score Weighting (IPW)first models the propensity score, ˆ π ( x ) , directly and then sets weights to be ˆ γ i = π ( X i ) . Thus, ˆ µ is a plug-inversion of Equation (2). This approach can perform poorly in moderate to high dimensions or when there ispoor overlap and either π ( x ) or ˆ π ( x ) is near 0 or 1 (Kang et al., 2007).Balancing weights, by contrast, directly optimize for covariate balance; recent proposals include Hain-mueller (2012); Zubizarreta (2015); Athey et al. (2018); Wang and Zubizarreta (2019); Hirshberg et al. (2019);Tan (2020) and have a long history in survey calibration for non-response (Deville and Särndal, 1992; Devilleet al., 1993). See Chattopadhyay et al. (2020) and Ben-Michael et al. (2020b) for recent reviews.Most balancing weights estimators attempt to control the imbalance between the weighted treated sampleand the full sample in some transformation of the covariates φ : X → R d . For example, Zubizarreta (2015)proposes stable balancing weights (SBW) that find weights ˆ γ that solve min γ ∈ R (cid:88) Z i =1 γ i subject to (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) n n (cid:88) i =1 Z i γ i φ ( X i ) − n n (cid:88) i =1 φ ( X i ) (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) ∞ ≤ λ γ i ≥ ∀ i. (6)These are the weights of minimum variance that guarantee approximate balance : that the worst imbalancein φ , the transformed covariates, is less than some hyper-parameter λ . There are many other choices ofboth the penalty on the weights and the measure of imbalance. For instance, in low dimensions, setting λ = 0 guarantees exact balance on the covariates φ ( X i ) . Here we focus on the more common case in whichachieving exact balance is infeasible; in that case, the particular choice of penalty function is less important.The balancing weights procedure is connected to the modeled IPW approach above through the La-grangian dual formulation of optimization problem (6). The imbalance in the d transformations of the Other possibilities include soft balance penalties rather than hard constraints (e.g. Ben-Michael et al., 2020a; Keele et al.,2020) and non-parametric measures of balance (e.g. Hirshberg et al., 2019). β ∈ R d , and the Lagrangian dual is min β ∈ R d n n (cid:88) i =1 Z i [ β · φ ( X i )] − n n (cid:88) i =1 β · φ ( X i ) (cid:124) (cid:123)(cid:122) (cid:125) balancing loss + λ (cid:107) β (cid:107) (cid:124) (cid:123)(cid:122) (cid:125) regularization , (7)where [ x ] + = max { , x } . The weights are recovered from the dual solution as ˆ γ i = (cid:104) ˆ β · φ ( X i ) (cid:105) + . As Zhao(2019) and Wang and Zubizarreta (2019) show, this is a regularized M -estimator of the propensity scorewhen it is of the form π ( x ) = [ β ∗ · φ ( x )] + for some true β ∗ . Therefore, we can view β ∗ · φ ( x ) as a naturalparameter for the propensity score; different penalty functions will induce different link functions, see Wangand Zubizarreta (2019). Similarly, different measures of balance will induce different forms of regularization on the propensity score parameters. In the succeeding sections, we will use this dual connection to show thatthe percentile bootstrap sensitivity procedure proposed by Zhao et al. (2019) for traditional IPW estimatorsin the marginal sensitivity model is valid with balancing weights estimators. We now outline our procedure for extending the percentile bootstrap sensitivity analysis to balancing weights.We introduce the shifted balancing weights estimator, detail the bootstrap sampling procedure, and describehow to efficiently compute the confidence intervals. Key to constructing the confidence intervals for the partialidentification set will be to construct intervals for each sensitivity model h in the collection of sensitivitymodels H (Λ) in Equation (3). Each h represents a particular deviation from ignorability that remains in theset defined by the marginal sensitivity model. We show that the percentile bootstrap yields valid confidenceintervals for each sensitivity model in H (Λ) , resulting in a valid interval for the partial identification set. Weprovide guidance for interpreting our sensitivity analysis procedure in Section 4.To construct the confidence intervals, we can first consider the case where we know the log odds function h ( x, y ) ∈ H (Λ) . With h , we can shift the balancing weights estimator for the shifted estimand µ ( h )1 as ˆ µ ( h )1 = (cid:32) n (cid:88) Z i =1 ˆ γ ( h ) i (cid:33) − n (cid:88) Z i =1 ˆ γ ( h ) i Y i , (8)where ˆ γ ( h ) i = 1 + (ˆ γ i − e h ( X i ,Y i ) for i ∈ { i : Z i = 1 } are the shifted balancing weights. We then take B bootstrap samples of size n without conditioning on treatment assignment — so the number of units in thetreatment and control groups may vary from sample to sample — and re-estimate the weights in each sampleby solving the balancing weights optimization problem (6) using the bootstrapped data.Then, for every h ∈ H (Λ) , we can construct a confidence interval for µ ( h )1 using the percentile bootstrapas (cid:104) L ( h ) , U ( h ) (cid:105) = (cid:104) Q α (cid:16) ˆ µ ∗ ( h )1 ,b (cid:17) , Q − α (cid:16) ˆ µ ∗ ( h )1 ,b (cid:17)(cid:105) . (9) Q α (ˆ µ ∗ ( h )1 ,b ) is the α -percentile of ˆ µ ∗ ( h )1 ,b in the bootstrap distribution made up of the B bootstrap samples and ˆ µ ∗ ( h )1 ,b is the shifted balancing weights estimator (8) using bootstrap sample b ∈ { , . . . , B } . Note, the ∗ in ˆ µ ∗ ( h )1 ,b indicates that it is an estimate from bootstrap data and b is used as an index for the B bootstrapsamples. The following theorem states that [ L ( h ) , U ( h ) ] is an asymptotically valid confidence interval for5 ( h )1 with at least (1 − α ) -coverage under high-level assumptions in Appendix A.1 on how well the balancingweights estimate the propensity scores. Theorem 1.
Under Assumption 4 in Appendix A.1, for every h ∈ H (Λ) , lim sup n →∞ P ( µ ( h )1 < L ( h ) ) ≤ α and lim sup n →∞ P ( µ ( h )1 > U ( h ) ) ≤ α , where P denotes the probability under the joint distribution of the data P ( · ) .Since each of the confidence intervals [ L ( h ) , U ( h ) ] are valid, we can use the Union Method to combinethem into a single valid confidence interval [ L union , U union ] for µ under Assumption 3, where L union = inf h ∈H (Λ) L ( h ) , U union = sup h ∈H (Λ) U ( h ) . (10)Finding [ L union , U union ] would require conducting a grid search over the space of log-odds functions H (Λ) andcomputing percentile bootstrap confidence intervals at each point; this is computationally infeasible. Instead,we can obtain a confidence interval [ L, U ] for µ by using generalized minimax and maximin inequalities as [ L, U ] = (cid:34) Q α (cid:18) inf h ∈H (Λ) ˆ µ ∗ ( h )1 ,b (cid:19) , Q − α (cid:32) sup h ∈H (Λ) ˆ µ ∗ ( h )1 ,b (cid:33)(cid:35) . (11)Zhao et al. (2019) show that this interval will be conservative, in the sense of being too wide, since L ≤ L union and U ≥ U union .The extrema of the point estimates can be solved efficiently by the following linear fractional programmingproblem: min / max r ∈ R n ˆ µ ( h )1 = n (cid:80) i =1 Z i (1 + r i [ˆ γ i − Y in (cid:80) i =1 Z i (1 + r i [ˆ γ i − subject to r i ∈ [Λ − , Λ] , for all i ∈ { , . . . , n } , (12)where r i = OR { π ( X i ) , π ( X i , Y i ) } are the decision variables. The procedure to obtain confidence interval [ L, U ] is then: Step 1.
Obtain B bootstrap samples of the data of size n without conditioning on treatment assignment. Step 2.
For each bootstrap sample b = 1 , . . . , B , re-estimate the weights and compute the extrema inf h ∈H (Λ) ˆ µ ∗ ( h )1 ,b and sup h ∈H (Λ) ˆ µ ∗ ( h )1 ,b under the collection of sensitivity models H (Λ) by solving (12). Step 3.
Obtain valid confidence intervals for sensitivity analysis: L = Q α (cid:18) inf h ∈H (Λ) ˆ µ ∗ ( h )1 ,b (cid:19) , U = Q − α (cid:32) sup h ∈H (Λ) ˆ µ ∗ ( h )1 ,b (cid:33) . (13)6eplacing ˆ γ i in Equation (12) with the inverse of propensity scores estimated by a generalized linear modelrecovers the procedure from Zhao et al. (2019).Finally, a researcher must compute a sensitivity value for a given study; see Rosenbaum (2002) forextensive discussion. Suppose the confidence interval for PATE under ignorability ( Λ = 1 ) does not containzero, indicating a statistically significant effect. As Λ increases, allowing for stronger violations of ignorability,the confidence interval will widen and eventually cross zero. Of particular interest then is the minimum valueof Λ for which the confidence interval contains zero; we denote this value as Λ ∗ . Thus, we can interpret Λ ∗ as the minimum difference in the odds ratio between the probability of treatment with and withoutconditioning on the treated potential outcome for which we no longer observe a significant treatment effect.This represents the degree of confounding required to change a study’s causal conclusions, with larger valuesof Λ ∗ representing more robust estimates.Sensitivity analysis may also be useful in cases where the confidence interval under Λ = 1 is very small andincludes zero, indicating no large effect in any direction. In this setting, a researcher may obtain a sensitivityvalue Λ ∗ by defining a minimal effect size ι > of practical interest and repeating the sensitivity analysisfor larger and larger values of Λ until the confidence interval includes either − ι or ι , revealing the degreeof confounding needed to mask a practically important effect. For examples of such sensitivity analyses forequivalence results, see Pimentel et al. (2015); Pimentel and Kelz (2020). In this section, we provide guidance for interpreting the main sensitivity parameter Λ ∗ by “amplifying” thesensitivity analyses into a constraint on the product of: (1) the level of remaining imbalance in confounders;and (2) the strength of the relationship between the imbalanced confounders and the treated potentialoutcome. As a first step to the amplification procedure, we introduce a finite sample analog to the marginal sensitivitymodel in Assumption 3. Importantly, the sensitivity analysis procedure for the marginal sensitivity modeloutlined in Section 3 remains the same; we introduce this extension in order to amplify, interpret, andcalibrate the sensitivity analysis outlined above.Recall that the marginal sensitivity model constrains the difference between the true probability oftreatment, conditional on the treated potential outcome and the covariates, and the propensity score thatonly conditions on the covariates. The true propensity score guarantees that, in expectation, the inverseprobability weighted outcomes for the treated group are equal to µ , i.e., that E [ Y (1)] = E (cid:20) π ( X, Y ) Y (cid:12)(cid:12)(cid:12) Z = 1 (cid:21) . (14)However, this does not guarantee that the two quantities will be the same in finite samples. In that case,we can instead consider oracle weights ˚ γ i that guarantee equality between the weighted average of treated7roup outcomes and the sample average treated outcome ˚ µ = n (cid:80) ni =1 Y i (1) : min γ ∈ R n (cid:88) Z i =1 γ i log γ i ˆ γ i subject to (cid:80) Z i =1 γ i (cid:88) Z i =1 γ i Y i (1) = 1 n n (cid:88) i =1 Y i (1) , (15)where ˆ γ i are the estimated weights from balancing the observed covariates.These oracle weights satisfy two key properties. First, they exactly balance the treated potential outcomesbetween all units and the weighted treated units. Second, they are as close (in terms of entropy) as possibleto the estimated weights. The corresponding oracle weight estimator of the complete data mean µ is then: ˚ µ = (cid:88) Z i =1 ˚ γ ( X i , Y i ) (cid:80) Z i =1 ˚ γ ( X i , Y i ) Y i . (16)Extending the population sensitivity model in Assumption 3, we can now define an analogous finitesample sensitivity model that bounds the difference between the estimated weights and the oracle weights within the sample rather than in the population. For Λ ≥ , we consider the set of oracle weights ˚ γ thatsatisfy: E ˚ γ (Λ) = { ˚ γ : Λ − ≤ ˚ γ i − γ i − ≤ Λ , ∀ i = 1 . . . , n } . (17)Rather than bounding the difference between the true probability of treatment π ( x, y ) and the propensityscore only conditioned on the covariates π ( x ) in the population, we bound the difference between the esti-mated and oracle weights in the sample . Thus, we can think of this model as the finite sample analogue tothe superpopulation marginal sensitivity model and is therefore more consistent with notions of sensitivitycommon in the matching literature (e.g., Rosenbaum, 2002). In order for a confounder to bias causal effect estimates, it must be associated with both the treatment andthe outcome. An “amplification” enhances a sensitivity analysis’s interpretability by allowing a researcherto instead interpret the results of the sensitivity analysis in terms of two parameters: one controlling theconfounder’s relationship with the treatment and the other controlling its relationship with the outcome(Rosenbaum and Silber, 2009). In our finite sample sensitivity model, the parameter Λ controls how farthe estimated weights can be from oracle weights that exactly balance the treated potential outcome. Toaid interpretation, we propose an amplification that expresses the results of our procedure in terms of theimbalance in confounders and the strength of the relationship between the confounders and the treatedpotential outcome.To start, we define our error of interest, ˚ µ − ˆ µ , to be the difference between the estimates of the completedata mean µ using oracle weights and estimated weights. Therefore, this error represents the differencebetween what we would like to have in a finite sample, an estimate using weights that exactly balance thepotential outcome under treatment, and what we have, an estimate using weights that only balance observedcovariates. We can write the difference between the average treated potential outcome in the sample and This definition of error as the difference between two sample estimates is similar to Cinelli and Hazlett (2020)’s formulation ˆ µ in terms of the imbalance in the treated potential outcome Y (1) : ˚ µ − ˆ µ = (cid:88) Z i =1 (cid:104) ˚ γ ( X i , Y i ) (cid:80) Z i =1 ˚ γ ( X i , Y i ) − ˆ γ ( X i ) n (cid:105) Y i (1) . To relate imbalance in Y (1) to observable quantities, we decompose Y (1) into two parts: (1) the linearprojection of covariates that the estimated balancing weights perfectly balance; and (2) the residual. Specif-ically, let U be the projection of the re-weighted covariates onto Z and let W be the orthogonal component.Therefore, U represents the parts of the observed and unobserved covariates that are not exactly balancedand W represents the parts that are exactly balanced. We write this decomposition as Y (1) = W β w + U β u ;this linear model merely serves as a guide to interpretation, rather than a true relationship we are assumingin the primary causal analysis. One way to reason about the covariates that are not exactly balanced is asfollows. Consider imbalanced covariate A , and run a linear regression of A on treatment assignment Z . Thefitted values would be included in U and the residuals in W , since the residuals represent the part of A thatis linearly independent from Z . Because the W are exactly balanced by construction, they do not introduceany error. Finally, in numerical examples (Section 5), we focus on standardized covariates.With this decomposition, we can write the error as a product of two terms: ˚ µ − ˆ µ = β u · (cid:32) n n (cid:88) i =1 U i − n (cid:88) Z i =1 ˆ γ i U i (cid:33) ≡ β u · δ u , (18)where δ u is the imbalance in U . As in Section 3 above, we can use the fractional linear program (12) to findupper and lower bounds for the error in Equation (18): (cid:16) inf h ∈H (Λ) ˆ µ ( h )1 (cid:17) − ˆ µ ≤ ˚ µ − ˆ µ ≤ (cid:16) sup h ∈H (Λ) ˆ µ ( h )1 (cid:17) − ˆ µ . (19)Therefore, we can constrain the product δ u · β u : (cid:16) inf h ∈H (Λ) ˆ µ ( h )1 (cid:17) − ˆ µ ≤ ˚ µ − ˆ µ = δ u · β u ≤ (cid:16) sup h ∈H (Λ) ˆ µ ( h )1 (cid:17) − ˆ µ . (20)Now, for any value of our sensitivity analysis parameter Λ , we can compute a corresponding bound on theerror, ˚ µ − ˆ µ , and can use this bound to decompose the error into different values of δ u and β u .In practice, as in Equation (20), we first bound the error via the extrema under the balancing weightssensitivity model. We then set the error equal to the maximum absolute value of the upper and lower boundsin Equation (20) for Λ = Λ ∗ . Therefore, this value is equal to the maximum absolute value of error thatis possible under the balancing weights sensitivity model. Finally, we compute a curve that maps the valueof error to different combinations of δ u and β u for enhanced interpretation. For example, ( δ u , β u ) = (1 . , and ( δ u , β u ) = (1 , are both consistent with error ˚ µ − ˆ µ = 3 . In Section 5, we illustrate our sensitivityanalysis procedure and how our amplification can produce more interpretable results. of bias in their sensitivity analysis framework for linear regression models. Numerical examples
We now illustrate the sensitivity analysis and amplification procedures using two real data examples. Weconsider the situation in which a researcher uses balancing weights to estimate the Population AverageTreatment Effect on the Treated (PATT) of a treatment on an outcome of interest; see Appendix C for anoverview of the PATT in our setting. Based on domain knowledge, the researcher believes that the set ofobserved covariates includes most factors associated with the treatment assignment and the outcome, whileleaving open the possibility that there remain relevant unobserved covariates.To start, we compute Λ ∗ , which represents the confounding required to alter a study’s causal conclusions.In order to compute Λ ∗ , we compute confidence intervals for a grid of values of Λ , starting with Λ = 1 andthen considering larger values of Λ . If the confidence interval corresponding to Λ = 1 contains zero, thenthe effect estimate is not significant, even under ignorability. If the confidence interval for
Λ = 1 does notcontain zero, increasing the value of Λ causes the confidence intervals to widen and eventually cross zero forsome value of Λ . We set Λ ∗ equal to the minimum value of Λ for which the confidence interval includes zero.Since the the percentile bootstrap procedure induces randomness, this value of Λ ∗ is computed with MonteCarlo error.We fix error equal to the maximum absolute value of the upper and lower bounds on error in Equation(32). This value is the maximum absolute value of error possible under the balancing weights sensitivitymodel with Λ = Λ ∗ and is therefore the error required to overturn the study’s causal conclusion. We createcontour plots with curves that map the particular value of error to varying values of δ u and β u , allowing theerror to be alternatively interpreted in terms of two sensitivity analysis parameters. We include standardizedobserved covariates on the contour plots, which serve as guides for reasoning about potential unobservedcovariates. Blue points correspond to observed covariates with imbalance prior to weighting, while red pointsrepresent post-weighting imbalance. We view the post-weighting imbalance corresponding to the red pointsas a best-case scenario for potential unobserved covariates — in general, we expect to achieve better balancein terms of the observed covariates that we directly target than unobserved covariates. Conversely, the pre-weighting imbalance represented by the blue points may be more in line with our expectations for unobservedcovariates. Figure 1 illustrates three cases a researcher might encounter when making the contour plots. The firstscenario, corresponding to the black curve, is when the error curve intersects with the shaded red region. Weview this as indicative of results that are sensitive to violations of ignorability, as confounders comparableto the observed covariates after weighting can overturn the results.By contrast, we view the scenario depicted by the purple curve in Figure 1 as evidence that a studyis fairly robust. The horizontal and vertical dotted blue lines correspond to the maximum values amongobserved covariates of β u and pre-weighting imbalance, respectively. Since the curve is above and to theright of the intersection of the two lines, the effect estimate is robust to an unobserved confounder withstrength and pre-weighting imbalance as large as the maximum among the observed covariates.The final case occurs when the error curve lies in between the red region and the intersection of thetwo lines. The green curve in Figure 1 illustrates this scenario. We view this as an ambiguous resultthat requires additional domain knowledge to evaluate the feasibility of there being confounding of thismagnitude. Conditional on observing one of the cases outlined above, the actual value of Λ ∗ can provide10igure 1: Example contour plots. The black curve is the error for a highly sensitive study, green is ambiguous,and purple is robust. The blue and red points are observed covariates with imbalance before and afterweighting, respectively. The red shaded region is the convex hull of the set including the red points, theorigin, the point on the y-axis corresponding to the maximum β u value among the red points, and the pointon the x-axis corresponding to the maximum δ u value among the red points. The region represents theapproximate magnitude of the observed covariates with post-weighting imbalance.additional insight. For example, if we are in the ambiguous scenario depicted by the green curve in Figure1, we would be more skeptical of the study’s causal conclusion if Λ ∗ = 1 . rather than Λ ∗ = 2 , since smallerdifferences between the estimated and oracle weights could overturn the study’s results. We re-examine data analyzed by LaLonde (1986) from the National Supported Work Demonstration Program(NSW), a randomized job training program. Specifically, we use the subset of data from Dehejia and Wahba(1999) to form a treatment group and observational data from the Current Population Survey–Social SecurityAdministration file (CPS1) to form a control group. We consider estimating the effect of the job trainingprogram on 1978 real earnings. The covariates for each individual include their age, years of education, race,marital status, whether or not they graduated high school, and earnings and employment status in 1974 and1975. In total, there are 185 treated units and 15,992 control units.First, we use stable balancing weights in Equation (6) to estimate (cid:92)
PATT = $1 , , which is in linewith Wang and Zubizarreta (2019)’s estimate using slightly different approximate balancing weights. Wethen compute Λ ∗ = 1 . , which indicates that even a slight difference between the estimated and oracleweights can negate the causal effect estimate. Figure 2a shows how the range of point estimates and the95% confidence interval widen as Λ increases, with the confidence interval including zero for Λ ∗ . The range11 a) Point estimate and confidence intervals for theLaLonde data. Dotted intervals are point estimate in-tervals and solid intervals are 95% confidence intervals. (b) Contour plot for the LaLonde data. The black curveis the error for Λ ∗ . The blue and red points are observedcovariates with imbalance before and after weighting, re-spectively. The red shaded region represents the approx-imate magnitude of the observed covariates with post-weighting imbalance. Figure 2: Sensitivity analysis results with the LaLonde dataof point estimates is obtained by computing the extrema of the point estimates for a particular Λ .Figure 2b shows the contour plot for the LaLonde data. We observe that the error curve intersects withthe red region. Therefore, the effect estimate is not robust to a confounder with similar levels of imbalanceand strength as those seen in the observed covariates after weighting. Based on this, we deem the estimatedeffect to be fairly sensitive to unmeasured confounding. We now examine data analyzed by Zhao et al. (2018) and Zhao et al. (2019) from the National Healthand Nutrition Examination Survey (NHANES) 2013-2014 containing information about fish consumptionand blood mercury levels. We evaluate the sensitivity of estimating the effect of fish consumption on bloodmercury levels using balancing weights. There are 234 treated units (consumption of greater than 12 servingsof fish or shellfish in the past month) and 873 control units (zero or one servings). The outcome of interestis log (total blood mercury), measured in micrograms per liter; the covariates include gender, age, income,whether income is missing and imputed, race, education, smoking history, and the number of cigarettessmoked in the previous month.To start, the stable balancing weights (6) estimate of the PATT is an increase of 2.1 in log (total bloodmercury) and Λ ∗ is approximately equal to 5.5 for the fish consumption data. We display the sensitivityanalysis results for multiple values of Λ in Figure 3a. We observe that the confidence interval corresponding12 a) Point estimate and confidence intervals for the fishdata. Dotted intervals are point estimate intervals andsolid intervals are 95% confidence intervals. (b) Contour plot for the fish data. The black curve isthe error for Λ ∗ . The blue and red points are observedcovariates with imbalance before and after weighting, re-spectively. The red shaded region represents the approx-imate magnitude of the observed covariates with post-weighting imbalance. Figure 3: Sensitivity analysis results with the fish datato no confounding (
Λ = 1 ) is far from zero and that the confidence interval for Λ ∗ = 5 . just begins to crosszero.The contour plot (Figure 3b) for the fish data indicates an extremely robust causal effect estimate. Asthe error curve is far above the intersection of the dotted lines that represents the maximum strength andpre-weighting imbalance among the observed covariates, confounding significantly stronger than the observedcovariates would be required to alter the causal conclusion. Comparing the contour plot for the fish data tothe contour plot for the LaLonde data (Figure 2b), we observe a clear difference. While the error curve forthe LaLonde data lies well within the range of the observed covariates, the error curve corresponding to thefish data is far from the observed covariates. From this contrast, we conclude that the causal effect estimatefor the fish data is much more robust than that for the LaLonde data. Balancing weights estimation is a popular approach for estimating treatment effects by weighting units tobalance covariates. In this paper, we develop a framework for assessing the sensitivity of these estimators tounmeasured confounding. We then propose an amplification for enhanced interpretation and illustrate ourmethod through real data examples.We briefly outline potential directions for future work. First, we could extend our framework to include augmented balancing weights estimators , which use an outcome model to correct for bias due to inexact13alance. Second, we could extend our sensitivity analysis framework to balancing weights in panel datasettings. For example, we could adapt this framework to variants of the synthetic control method (Abadieand Gardeazabal, 2003; Ben-Michael et al., 2018), extending some recent proposals for sensitivity analysisfrom Firpo and Possebom (2018).Additionally, Dorn and Guo (2021) recently proposed a modification to Zhao et al. (2019)’s procedure,using quantile balancing to obtain sharper sensitivity analysis intervals. We could adapt their modificationto our proposed balancing weights framework.Finally, we could use our framework to provide guidance in the design stage of balancing weights estima-tors. When estimating treatment effects using balancing weights, researchers must make decisions includingthe specific dispersion function of the weights, the particular imbalance measure, and, in many cases, anacceptable level of imbalance. We could extend our sensitivity analysis procedure to help make these deci-sions to improve robustness and power in the presence of unmeasured confounding. For example, we couldprovide insight into the trade-off between achieving better (marginal) balance on a few covariates or worsebalance on a richer set of covariates. 14 eferences
Abadie, A. and Gardeazabal, J. (2003). The economic costs of conflict: A case study of the basque country.
American economic review , 93(1):113–132.Athey, S., Imbens, G. W., and Wager, S. (2018). Approximate residual balancing: debiased inference ofaverage treatment effects in high dimensions.
Journal of the Royal Statistical Society: Series B (StatisticalMethodology) , 80(4):597–623.Ben-Michael, E., Feller, A., and Rothstein, J. (2018). The augmented synthetic control method. arXivpreprint arXiv:1811.04170 .Ben-Michael, E., Feller, A., and Rothstein, J. (2020a). Variation in impacts of letters of recommendationon college admissions decisions: Approximate balancing weights for treatment effect heterogeneity inobservational studies.Ben-Michael, E., Hirschberg, D., Feller, A., and Zubizarreta, J. (2020b). The balancing act for causalinference.Bickel, P. J. and Freedman, D. A. (1981). Some asymptotic theory for the bootstrap.
The annals of statistics ,pages 1196–1217.Chattopadhyay, A., Christopher H. Hase, and Zubizarreta, J. R. (2020). Balancing Versus Modeling Ap-proaches to Weighting in Practice.
Statistics in Medicine , 39(24):3227–3254.Cinelli, C. and Hazlett, C. (2020). Making Sense of Sensitivity: Extending Omitted Variable Bias.
Journalof the Royal Statistical Society Series B , 82(1):39–67.Cornfield, J., Haenszel, W., Hammond, E. C., Lilienfeld, A. M., Shimkin, M. B., and Wynder, E. L. (1959).Smoking and lung cancer: recent evidence and a discussion of some questions.
Journal of the NationalCancer institute , 22(1):173–203.Dehejia, R. H. and Wahba, S. (1999). Causal effects in nonexperimental studies: Reevaluating the evaluationof training programs.
Journal of the American statistical Association , 94(448):1053–1062.Deville, J. C. and Särndal, C. E. (1992). Calibration estimators in survey sampling.
Journal of the AmericanStatistical Association , 87(418):376–382.Deville, J. C., Särndal, C. E., and Sautory, O. (1993). Generalized raking procedures in survey sampling.
Journal of the American Statistical Association , 88(423):1013–1020.Dorn, J. and Guo, K. (2021). Sharp sensitivity analysis for inverse propensity weighting via quantile bal-ancing. arXiv preprint arXiv:2102.04543 .Firpo, S. and Possebom, V. (2018). Synthetic control method: Inference, sensitivity analysis and confidencesets.
Journal of Causal Inference , 6(2).Franks, A., D’Amour, A., and Feller, A. (2019). Flexible sensitivity analysis for observational studies withoutobservable implications.
Journal of the American Statistical Association , pages 1–33.Hainmueller, J. (2012). Entropy balancing for causal effects: A multivariate reweighting method to producebalanced samples in observational studies.
Political Analysis , 20(1):25–46.Hirshberg, D. A., Maleki, A., and Zubizarreta, J. (2019). Minimax linear estimation of the retargeted mean. arXiv preprint arXiv:1901.10296 .Hong, G., Yang, F., and Qin, X. (2020). Did you conduct a sensitivity analysis? a new weighting-basedapproach for evaluations of the average treatment effect for the treated.
Journal of the Royal StatisticalSociety: Series A (Statistics in Society) . 15ang, J. D., Schafer, J. L., et al. (2007). Demystifying double robustness: A comparison of alternativestrategies for estimating a population mean from incomplete data.
Statistical science , 22(4):523–539.Keele, L., Ben-Michael, E., Feller, A., Kelz, R., and Miratrix, L. (2020). Hospital Quality Risk Standardiza-tion via Approximate Balancing Weights.Klaassen, C. A. (1987). Consistent estimation of the influence function of locally asymptotically linearestimators.
The Annals of Statistics , pages 1548–1562.LaLonde, R. J. (1986). Evaluating the econometric evaluations of training programs with experimental data.
The American economic review , pages 604–620.Neyman, J. (1990 [1923]). On the application of probability theory to agricultural experiments. essay onprinciples. section 9.
Statistical Science , pages 465–472.Pimentel, S. D. and Kelz, R. R. (2020). Optimal tradeoffs in matched designs comparing us-trained andinternationally trained surgeons.
Journal of the American Statistical Association , 115(532):1675–1688.Pimentel, S. D., Kelz, R. R., Silber, J. H., and Rosenbaum, P. R. (2015). Large, sparse optimal matchingwith refined covariate balance in an observational study of the health outcomes produced by new surgeons.
Journal of the American Statistical Association , 110(510):515–527.Rosenbaum, P. R. (2002).
Observational Studies . Springer.Rosenbaum, P. R. and Rubin, D. B. (1983a). Assessing sensitivity to an unobserved binary covariate in an ob-servational study with binary outcome.
Journal of the Royal Statistical Society: Series B (Methodological) ,45(2):212–218.Rosenbaum, P. R. and Rubin, D. B. (1983b). The central role of the propensity score in observational studiesfor causal effects.
Biometrika , 70(1):41–55.Rosenbaum, P. R. and Silber, J. H. (2009). Amplification of sensitivity analysis in matched observationalstudies.
Journal of the American Statistical Association , 104(488):1398–1405.Rubin, D. B. (1974). Estimating causal effects of treatments in randomized and nonrandomized studies.
Journal of educational Psychology , 66(5):688.Rubin, D. B. (1980). Randomization analysis of experimental data: The fisher randomization test comment.
Journal of the American Statistical Association , 75(371):591–593.Tan, Z. (2020). Regularized calibrated estimation of propensity scores with model misspecification andhigh-dimensional data.
Biometrika , 107(1):137–158.VanderWeele, T. J. and Ding, P. (2017). Sensitivity analysis in observational research: introducing thee-value.
Annals of internal medicine , 167(4):268–274.Wang, Y. and Zubizarreta, J. R. (2019). Minimal approximately balancing weights: asymptotic propertiesand practical considerations. arXiv preprint arXiv:1705.00998 .Zhao, Q. (2019). Covariate balancing propensity score by tailored loss functions.
Annals of Statistics ,47(2):965–993.Zhao, Q., Small, D. S., and Bhattacharya, B. B. (2019). Sensitivity analysis for inverse probability weightingestimators via the percentile bootstrap.
Journal of the Royal Statistical Society: Series B (StatisticalMethodology) .Zhao, Q., Small, D. S., and Rosenbaum, P. R. (2018). Cross-screening in observational studies that testmany hypotheses.
Journal of the American Statistical Association , 113(523):1070–1084.Zubizarreta, J. R. (2015). Stable weights that balance covariates for estimation with incomplete outcomedata.
Journal of the American Statistical Association , 110(511):910–922.16
Proofs
A.1 Proof of Theorem 1
Proof.
We prove that, after centering, the difference between the mean computed from estimating andevaluating the function γ on bootstrap data and the mean computed from using the true function γ andevaluating on actual data is of order n − / .For simplicity, we consider estimating the population mean from an independent and identically dis-tributed random sample with missing outcome data. For unit i , let Y i be the outcome, X i be a vector ofobserved covariates, and Z i be a response indicator, where Z i = 1 if we observe unit i ’s outcome and Z i = 0 otherwise. We consider using estimator ˆ µ = n n (cid:80) i =1 ˆ γ ( X i ) Z i Y i to estimate µ = E [ Y ] = E [ ZYπ ( X ) ] = E [ γ ( X ) ZY ] from observed data O i = ( X i , Z i , Y i Z i ) ni =1 .We sample split to make the proof and arguments simpler and more transparent (see Klaassen, 1987).The proof can equivalently be done without sample splitting, but we sample split to avoid the associatedcomplexities. We split the data into two equally sized samples, i = 1 , . . . , m and i = m + 1 , . . . , n . For bothsamples, we take an iid bootstrap sample of size m from the respective empirical distribution to obtain data O ∗ i = ( X ∗ i , Z ∗ i , Y ∗ i Z ∗ i ) mi =1 and O ∗ i = ( X ∗ i , Z ∗ i , Y ∗ i Z ∗ i ) ni = m +1 . Let ˆ γ ∗ denote an estimate of γ using bootstrapdata. We estimate ˆ γ ∗ ( X ) in one bootstrap sample and evaluate in the other bootstrap sample. We thenswitch roles and take a weighted average of the two estimates proportional to (cid:80) mi =1 Z ∗ i in both bootstrapsamples to obtain an efficient estimate. This sample splitting approach with reversing roles and averagingyields the same estimate as without sample splitting to order o ( n − / ) . We demonstrate this throughsimulation (see Appendix B). We examine the case where we evaluate on the bootstrap sample from thesecond half of the data and estimate ˆ γ ∗ ( X ) from the bootstrap sample from the first half.We make the following mild assumptions on how ˆ γ is constructed: Assumption 4.
Assume that the following function ˜ γ m : X m × { , } m → R + is defined for all possibleempirical distributions from bootstrap samples of the data and obtained as the solution ˜ γ , . . . , ˜ γ m of (6)with the form min ˜ γ ∈ [0 , m m (cid:88) i =1 Z i ˜ γ i subject to (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) m m (cid:88) i =1 Z i ˜ γ i φ ( X i ) − m m (cid:88) i =1 φ ( X i ) (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) ∞ ≤ λ ˜ γ i ≥ ∀ i, (21)where < (cid:80) mi =1 Z i = m < m .We then let ˆ γ ∗ ( x ) = ˜ γ m ( X ∗ , . . . , X ∗ m , Z ∗ , . . . , Z ∗ m , x ) and ˆ γ ( x ) = ˜ γ m ( X , . . . , X m , Z , . . . , Z m , x ) , esti-mated on the bootstrap sample of the first half of the data and the actual first half of the data, respectively,be such that:1. ˜ γ m is uniformly bounded in m and x .2. sup x (cid:12)(cid:12)(cid:12) E (cid:2) ˆ γ ∗ ( x ) | X , . . . , X m (cid:3) − ˆ γ ( x ) (cid:12)(cid:12)(cid:12) = o p (1) .3. sup x (cid:12)(cid:12)(cid:12) ˆ γ ( x ) − γ ( x ) (cid:12)(cid:12)(cid:12) = o p (1) . These assumptions together imply that ˆ γ ∗ is consistently uniform for γ . Assumption 4 verifies E (cid:104)(cid:16) sup x (cid:12)(cid:12)(cid:12) ˆ γ ∗ ( x ) − γ ( x ) (cid:12)(cid:12)(cid:12) Y m +1 Z m +1 (cid:17) (cid:105) = E (cid:104)(cid:16) sup x (cid:12)(cid:12)(cid:12) ˆ γ ∗ ( x ) − γ ( x ) (cid:12)(cid:12)(cid:12)(cid:17) (cid:105) E (cid:104) Y m +1 Z m +1 (cid:105) = o (1) , Wang and Zubizarreta (2019)’s Theorem 2 proves that Assumption 4.3 holds for a hard L ∞ balance constraint on theweights.
17m where E denotes the conditional expectation given the first sample. Note, the conditions in Assumption4 are stronger than needed and could be relaxed.We proceed conditional on the first sample O i = ( X i , Z i , Y i Z i ) mi =1 and the first bootstrap sample O ∗ i =( X ∗ i , Z ∗ i , Y ∗ i Z ∗ i ) mi =1 . Therefore, ˆ γ ∗ is a completely known function. Let E ∗ denote the conditional expectationof the second bootstrap sample given the actual second sample. To show that the bootstrap can be validlyapplied, we can show that m m (cid:88) i = m +1 ˆ γ ∗ ( X ∗ i ) Z ∗ i Y ∗ i − E ∗ (cid:104) m m (cid:88) i = m +1 ˆ γ ∗ ( X ∗ i ) Z ∗ i Y ∗ i (cid:105) = 1 m m (cid:88) i = m +1 γ ( X i ) Z i Y i − E (cid:104) γ ( X m +1 ) Z m +1 Y m +1 (cid:105) + o p ( n − / ) . (22)Since E ∗ (cid:104) m m (cid:88) i = m +1 ˆ γ ∗ ( X ∗ i ) Z ∗ i Y ∗ i (cid:105) = 1 m m (cid:88) i = m +1 ˆ γ ∗ ( X i ) Z i Y i , then, by Theorem 2.1 from Bickel and Freedman (1981), m m (cid:88) i = m +1 ˆ γ ∗ ( X ∗ i ) Z ∗ i Y ∗ i − m m (cid:88) i = m +1 ˆ γ ∗ ( X i ) Z i Y i (23)and m m (cid:88) i = m +1 (cid:16) ˆ γ ∗ ( X i ) Z i Y i − E (cid:104) ˆ γ ∗ ( X m +1 ) Z m +1 Y m +1 (cid:105)(cid:17) (24)have the same limiting distribution. Since (23) and (24) have the same limiting distribution, it suffices toshow that the difference between the mean with the true γ and the mean with ˆ γ ∗ estimated on the bootstrapdata is of order n − / instead of (22). Therefore, we show m m (cid:88) i = m +1 ˆ γ ∗ ( X i ) Z i Y i − E (cid:104) ˆ γ ∗ ( X m +1 ) Z m +1 Y m +1 (cid:105) = 1 m m (cid:88) i = m +1 γ ( X i ) Z i Y i − E (cid:104) γ ( X m +1 ) Z m +1 Y m +1 (cid:105) + o p ( n − / ) . (25)We have now reduced the problem to showing that the true function γ can be replaced with ˆ γ ∗ . In orderto show this, we use properties of ˆ γ ∗ from Assumption 4. First, we let ∆( X i , Y i , Z i ) = (ˆ γ ∗ ( X i ) − γ ( X i )) Z i Y i − E (cid:104) (ˆ γ ∗ ( X m +1 ) − γ ( X m +1 )) Z m +1 Y m +1 (cid:105) . Note that the difference between the terms on the left and right hand sides of (25) is equal to m m (cid:80) i = m +1 ∆( X i , Y i , Z i ) .Additionally, note that E (cid:104) ∆( X i , Y i , Z i ) (cid:105) = 0 . Therefore, E (cid:104)(cid:16) m m (cid:88) i = m +1 ∆( X i , Y i , Z i ) (cid:17) (cid:105) = 1 m E (cid:104) ∆( X m +1 , Y m +1 , Z m +1 ) (cid:105) . m = Ω( n ) , by Assumption 4, E (cid:2) ∆( X m +1 , Y m +1 , Z m +1 ) (cid:3) = E (cid:104) ([ˆ γ ∗ ( X m +1 ) − γ ( X m +1 )] Z m +1 Y m +1 ) (cid:105) − E (cid:110)(cid:2) ˆ γ ∗ ( X m +1 ) − γ ( X m +1 ) (cid:3) Z m +1 Y m +1 E (cid:104)(cid:2) ˆ γ ∗ ( X m +1 ) − γ ( X m +1 ) (cid:3) Z m +1 Y m +1 (cid:105)(cid:111) + E (cid:110) E (cid:104)(cid:2) ˆ γ ∗ ( X m +1 ) − γ ( X m +1 ) (cid:3) Z m +1 Y m +1 (cid:105) (cid:111) = E (cid:104)(cid:16)(cid:2) ˆ γ ∗ ( X m +1 ) − γ ( X m +1 ) (cid:3) Z m +1 Y m +1 (cid:17) (cid:105) − E (cid:104)(cid:2) ˆ γ ∗ ( X m +1 ) − γ ( X m +1 ) (cid:3) Z m +1 Y m +1 (cid:105) ≤ E (cid:104)(cid:16)(cid:2) ˆ γ ∗ ( X m +1 ) − γ ( X m +1 ) (cid:3) Z m +1 Y m +1 (cid:17) (cid:105) = E (cid:104) (ˆ γ ∗ ( X m +1 ) − γ ( X m +1 )) Z m +1 Y m +1 (cid:105) = o p (1) Therefore, (25) follows. 19
Simulation for sample splitting
We conduct simulations to demonstrate the validity of the sample splitting technique that we use to proveTheorem 1 in Appendix A.1. We show that the bootstrap distributions for the balancing weights estimatesof µ with and without sample splitting are quite similar.The setup of the simulations is as follows. We draw 10,000 iid samples where covariates X and X are drawn from standard normal distributions, treatment indicator Z i is a bernoulli random variable withprobability = . . X i + 0 . X i + (cid:15) i , where (cid:15) i ∼ N (0 , . ) , and Y i = 0 . Z i + 0 . X i + 0 . X i + δ i , where δ i ∼ N (0 , . ) . We run 1,000 simulations and estimate µ with and without sample splitting usingweights obtained by entropy balancing with exact balance from Hainmueller (2012). We observe in Figure 4that the bootstrap distributions of the estimates with and without sample splitting are comparable.Figure 4: Bootstrap distributions of estimates of µ with the full data and with sample splitting20 Average treatment effect on the treated
In many settings, researchers are interested in estimating the
Population Average Treatment Effect on theTreated (PATT): τ T = E [ Y (1) − Y (0) | Z = 1] = µ − µ , (26)where µ = E [ Y (1) | Z = 1] and µ = E [ Y (0) | Z = 1] . Since µ is identifiable from observed data, weprimarily focus on estimating µ .Our procedure for performing sensitivity analysis outlined in Section 3 largely still holds. The primarydetails that differ for the PATT are as follows. First, for a particular h ∈ H (Λ) , we can write the shiftedestimand as µ ( h )01 = E (cid:20) (1 − Z ) π ( h ) ( X, Y )1 − π ( h ) ( X, Y ) (cid:21) − E (cid:20) (1 − Z ) π ( h ) ( X, Y )1 − π ( h ) ( X, Y ) Y (cid:21) . (27)The corresponding shifted estimator for µ ( h )01 is ˆ µ ( h )01 = (cid:32) (cid:88) Z i =0 e − h ( X i ,Y i ) ˆ γ i (cid:33) − (cid:88) Z i =0 e − h ( X i ,Y i ) ˆ γ i Y i . (28)Additionally, the oracle weights now exactly balance the control potential outcome Y (0) between thetreatment group and the weighted control group. The oracle weights ˚ γ ( X, Y ) solve the optimization problem min γ ∈ R n (cid:88) Z i =0 γ i log γ i ˆ γ subject to (cid:80) Z i =0 γ i (cid:88) Z i =0 γ i Y i (0) = 1 n (cid:88) Z i =1 Y i (0) (29)where ˆ γ i ∈ R n are the estimated weights from balancing the observed covariates. The balancing weightsestimate of µ using the oracle weights is ˚ µ = (cid:88) Z i =0 ˚ γ ( X i , Y i ) (cid:80) Z i =0 ˚ γ ( X i , Y i ) Y i . (30)For Λ ≥ , the balancing weights sensitivity model for the PATT considers the set of oracle weights ˚ γ that satisfy: E ˚ γ (Λ) = { ˚ γ : Λ − ≤ ˆ γ i ˚ γ i ≤ Λ , ∀ i = 1 . . . , n } . (31)Finally, the amplification becomes (cid:18) inf h ∈H (Λ) ˆ µ ( h )01 (cid:19) − ˆ µ ≤ ˚ µ − ˆ µ = δ u · β u ≤ (cid:32) sup h ∈H (Λ) ˆ µ ( h )01 (cid:33) − ˆ µ , (32)where inf / sup h ∈H (Λ) ˆ µ ( h )01 = min / max r ∈ R n (cid:80) Z i =0 r − i ˆ γ ( X i ) · Y i (cid:80) Z i =0 r − i ˆ γ ( X i ) subject to ≤ r i ≤ Λ (33)21nd r i = ˆ γ ( X i )˚ γ ( X i ,Y i ))