[PDF] Interpretable Sensitivity Analysis for Balancing Weights

Abstract

Assessing sensitivity to unmeasured confounding is an important step in observational studies, which typically estimate effects under the assumption that all confounders are measured. In this paper, we develop a sensitivity analysis framework for balancing weights estimators, an increasingly popular approach that solves an optimization problem to obtain weights that directly minimizes covariate imbalance. In particular, we adapt a sensitivity analysis framework using the percentile bootstrap for a broad class of balancing weights estimators. We prove that the percentile bootstrap procedure can, with only minor modifications, yield valid confidence intervals for causal effects under restrictions on the level of unmeasured confounding. We also propose an amplification to allow for interpretable sensitivity parameters in the balancing weights framework. We illustrate our method through extensive real data examples.

Full PDF

IInterpretable Sensitivity Analysis for Balancing Weights ∗Dan Soriano, Eli Ben-Michael, Peter Bickel, Avi Feller, and Sam PimentelUC Berkeley and Harvard UniversityMarch 1, 2021

Abstract

Assessing sensitivity to unmeasured confounding is an important step in observational studies, whichtypically estimate eﬀects under the assumption that all confounders are measured. In this paper, we de-velop a sensitivity analysis framework for balancing weights estimators, an increasingly popular approachthat solves an optimization problem to obtain weights that directly minimizes covariate imbalance. Inparticular, we adapt a sensitivity analysis framework using the percentile bootstrap for a broad class ofbalancing weights estimators. We prove that the percentile bootstrap procedure can, with only minormodiﬁcations, yield valid conﬁdence intervals for causal eﬀects under restrictions on the level of unmea-sured confounding. We also propose an ampliﬁcation to allow for interpretable sensitivity parameters inthe balancing weights framework. We illustrate our method through extensive real data examples. ∗ We would like to thank Skip Hirshberg for useful discussion and comments. This research was supported in part by theHellman Family Fund at UC Berkeley, the Institute of Education Sciences, U.S. Department of Education, through GrantR305D200010, and the Two Sigma PhD fellowship. The opinions expressed are those of the authors and do not represent viewsof the Institute or the U.S. Department of Education. a r X i v : . [ s t a t . M E ] F e b Introduction

Assessing the sensitivity of results to violations of causal assumptions is a critical part of the workﬂow forcausal inference with observational studies. In such studies, the key assumption that all confounders aremeasured, sometimes known as ignorability or unconfoundedness , rarely holds in practice. A sensitivityanalysis seeks to determine the magnitude of unobserved confounding required to alter a study’s ﬁndings. Ifa large amount of confounding is needed, then the study is robust, enhancing its reliability.In this paper, we develop a sensitivity analysis framework for balancing weights estimators . Building onclassical methods from survey calibration, these estimators ﬁnd weights that minimize covariate imbalancebetween a weighted average of the observed units and a given distribution, such as re-weighting control unitsto have a similar covariate distribution to the treated units. Balancing weights have become increasinglycommon within causal inference, with better ﬁnite sample properties than traditional inverse propensityscore weighting (IPW). See Ben-Michael et al. (2020b) for a recent review.Our proposed sensitivity analysis framework adapts the percentile bootstrap sensitivity analysis thatZhao et al. (2019) develop for traditional IPW. Speciﬁcally, for a speciﬁed sensitivity parameter, we computethe upper and lower bounds of our estimator for each bootstrap sample, and then form a conﬁdence intervalusing percentiles across bootstrap samples. We prove that this approach yields valid conﬁdence intervals forour proposed sensitivity analysis procedure over a broad class of balancing weights estimators.To make the sensitivity analysis more interpretable, we propose a new ampliﬁcation that expresses theerror from confounding in terms of: (1) the imbalance in observed and unobserved covariates; and (2) thestrength of the relationship between the outcome and the imbalanced covariates. Researchers can then relatethe results of our ampliﬁcation to estimates from observed covariates.We demonstrate this approach via a numerical illustration and via several applications. We consider an observational study setting with independently and identically distributed data ( Y i , X i , Z i ) , i ∈ { , . . . , n } , drawn from some joint distribution P ( · ) with outcome Y i ∈ R , covariates X i ∈ X , andtreatment assignment Z i ∈ { , } . We posit the existence of potential outcomes : the outcome had unit i received the treatment, Y i (1) , and the outcome had unit i received the control, Y i (0) (Neyman, 1923; Rubin,1974). We assume stable treatment and no interference between units (Rubin, 1980), so the observed outcomeis Y i = (1 − Z i ) Y i (0) + Z i Y i (1) . Our primary estimand of interest is the Population Average Treatment Eﬀect (PATE): τ = E [ Y (1) − Y (0)] = µ − µ , (1)where µ = E [ Y (1)] and µ = E [ Y (0)] . To simplify the exposition, we will focus on estimating µ ; estimating µ is symmetric. We consider alternative estimands in Appendix C.A common set of identiﬁcation assumptions in this setting, known as strong ignorability , assumes thatconditioning on the covariates X suﬃciently removes confounding between treatment Z and the potentialoutcomes Y (0) , Y (1) , and that treatment assignment is not deterministic given X (Rosenbaum and Rubin,1983b). 2 ssumption 1 (Ignorability) . Y (0) , Y (1) ⊥⊥ Z | X . Assumption 2 (Overlap) . The propensity score π ( x ) ≡ P ( Z = 1 | X = x ) satisﬁes < π ( x ) < for all x ∈ X .Under Assumptions 1 and 2, we can non-parametrically identify µ , solely with the outcomes from unitsreceiving treatment, µ = E (cid:20) π ( X ) Y (cid:12)(cid:12)(cid:12) Z = 1 (cid:21) . (2)In an observational setting, the researcher does not know the true treatment assignment mechanism, π ( x, y ) ≡ P ( Z = 1 | X = x, Y (1) = y ) , which in general can depend on both the covariates X and thepotential outcomes Y (1) and Y (0) . A rich literature assesses the sensitivity of estimates to violations ofthe ignorability assumption. This approach dates back at least to Cornﬁeld et al. (1959), who conducteda formal sensitivity analysis of the eﬀect of smoking on lung cancer. More recent examples of sensitivityanalysis include Rosenbaum and Rubin (1983a), Rosenbaum (2002), VanderWeele and Ding (2017), Frankset al. (2019), and Cinelli and Hazlett (2020). See Hong et al. (2020) for a recent discussion of weighting-basedsensitivity methods.Our proposed approach builds most directly on that of Zhao et al. (2019), who use the percentile bootstrapand linear programming to perform a sensitivity analysis for traditional IPW. Following their setup, we splitthe problem into two parts, sensitivity for the mean of the treated potential outcomes and sensitivity forthe mean of the control potential outcomes; without loss of generality, we consider the mean for the treatedpotential outcomes. Since unbiased estimation of E [ Y (1)] requires knowledge only of π ( x, y ) = P ( Z =1 | X = x, Y (1) = y ) rather than the full propensity score that also conditions on Y (0) , we can rewriteAssumption 1 as π ( x, y ) = π ( x ) . For details on combining sensitivity analyses for E [ Y (1)] and E [ Y (0)] intoa single sensitivity analysis for the ATE, see Section 5 from Zhao et al. (2019).We now introduce a sensitivity model where we relax the ignorability assumption so that the odds ratiobetween the two conditional probabilities π ( x ) and π ( x, y ) is bounded. Assumption 3 (Marginal sensitivity model) . For Λ ≥ , the true propensity score satisﬁes π ( x, y ) ∈ E (Λ) ≡ (cid:8) π ( x, y ) ∈ (0 ,

1) : Λ − ≤ OR ( π ( x ) , π ( x, y )) ≤ Λ (cid:9) , where OR ( p , p ) = p / (1 − p ) p / (1 − p ) is the odds ratio.Here, Λ is a sensitivity parameter, quantifying the diﬀerence between the true propensity score π ( x, y ) and the probability of treatment given X = x , π ( x ) ; when Λ = 1 , the two probabilities are equivalent,and Assumption 1 holds. If, for example,

Λ = 2 , Assumption 3 constrains the odds ratio between π ( x ) and π ( x, y ) to be between and 2. The modeled estimate of the propensity scores ˆ π ( x ) could diﬀer fromthe true treatment probabilities π ( x, y ) for many reasons, including model misspeciﬁcation and unobservedconfounding; the marginal sensitivity model in Assumption 3 is agnostic to the source of these diﬀerences.Again following Zhao et al. (2019), we will consider an equivalent characterization of the set E (Λ) interms of the log odds ratio h ( x, y ) = log OR ( π ( x ) , π ( x, y )) : H (Λ) = { h : X × R → R : (cid:107) h (cid:107) ∞ ≤ log Λ } , (3)3here (cid:107) h (cid:107) ∞ = sup x ∈X ,y ∈ R | h ( x, y ) | is the supremum norm. For a particular h ∈ H (Λ) , we can write the shifted inverse propensity score as π ( h ) ( x,y ) = 1 + (cid:16) π ( x ) − (cid:17) e h ( x,y ) , and the shifted estimand as µ ( h )1 = E (cid:20) Zπ ( h ) ( X, Y ) (cid:12)(cid:12)(cid:12)(cid:12) Z = 1 (cid:21) − E (cid:20) ZYπ ( h ) ( X, Y ) (cid:12)(cid:12)(cid:12)(cid:12) Z = 1 (cid:21) . (4)Under the marginal sensitivity model in Assumption 3, we then have a non-parametric partial identiﬁcationbound, inf h ∈H (Λ) µ ( h )1 ≤ µ ≤ sup h ∈H (Λ) µ ( h )1 . In Section 3, we will construct conﬁdence intervals that coverthis partial identiﬁcation set. In Section 4, we will consider a ﬁnite sample analog of the marginal sensitivitymodel in order to amplify, interpret, and calibrate our sensitivity analyses. We estimate µ via a weighted average of treated units’ outcomes using weights ˆ γ , ˆ µ = 1 n n (cid:88) i =1 Z i ˆ γ i Y i , (5)where (cid:80) ni =1 Z i = n . Under strong ignorability (Assumptions 1 and 2), traditional Inverse Propensity Score Weighting (IPW)ﬁrst models the propensity score, ˆ π ( x ) , directly and then sets weights to be ˆ γ i = π ( X i ) . Thus, ˆ µ is a plug-inversion of Equation (2). This approach can perform poorly in moderate to high dimensions or when there ispoor overlap and either π ( x ) or ˆ π ( x ) is near 0 or 1 (Kang et al., 2007).Balancing weights, by contrast, directly optimize for covariate balance; recent proposals include Hain-mueller (2012); Zubizarreta (2015); Athey et al. (2018); Wang and Zubizarreta (2019); Hirshberg et al. (2019);Tan (2020) and have a long history in survey calibration for non-response (Deville and Särndal, 1992; Devilleet al., 1993). See Chattopadhyay et al. (2020) and Ben-Michael et al. (2020b) for recent reviews.Most balancing weights estimators attempt to control the imbalance between the weighted treated sampleand the full sample in some transformation of the covariates φ : X → R d . For example, Zubizarreta (2015)proposes stable balancing weights (SBW) that ﬁnd weights ˆ γ that solve min γ ∈ R (cid:88) Z i =1 γ i subject to (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) n n (cid:88) i =1 Z i γ i φ ( X i ) − n n (cid:88) i =1 φ ( X i ) (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) ∞ ≤ λ γ i ≥ ∀ i. (6)These are the weights of minimum variance that guarantee approximate balance : that the worst imbalancein φ , the transformed covariates, is less than some hyper-parameter λ . There are many other choices ofboth the penalty on the weights and the measure of imbalance. For instance, in low dimensions, setting λ = 0 guarantees exact balance on the covariates φ ( X i ) . Here we focus on the more common case in whichachieving exact balance is infeasible; in that case, the particular choice of penalty function is less important.The balancing weights procedure is connected to the modeled IPW approach above through the La-grangian dual formulation of optimization problem (6). The imbalance in the d transformations of the Other possibilities include soft balance penalties rather than hard constraints (e.g. Ben-Michael et al., 2020a; Keele et al.,2020) and non-parametric measures of balance (e.g. Hirshberg et al., 2019). β ∈ R d , and the Lagrangian dual is min β ∈ R d n n (cid:88) i =1 Z i [ β · φ ( X i )] − n n (cid:88) i =1 β · φ ( X i ) (cid:124) (cid:123)(cid:122) (cid:125) balancing loss + λ (cid:107) β (cid:107) (cid:124) (cid:123)(cid:122) (cid:125) regularization , (7)where [ x ] + = max { , x } . The weights are recovered from the dual solution as ˆ γ i = (cid:104) ˆ β · φ ( X i ) (cid:105) + . As Zhao(2019) and Wang and Zubizarreta (2019) show, this is a regularized M -estimator of the propensity scorewhen it is of the form π ( x ) = [ β ∗ · φ ( x )] + for some true β ∗ . Therefore, we can view β ∗ · φ ( x ) as a naturalparameter for the propensity score; diﬀerent penalty functions will induce diﬀerent link functions, see Wangand Zubizarreta (2019). Similarly, diﬀerent measures of balance will induce diﬀerent forms of regularization on the propensity score parameters. In the succeeding sections, we will use this dual connection to show thatthe percentile bootstrap sensitivity procedure proposed by Zhao et al. (2019) for traditional IPW estimatorsin the marginal sensitivity model is valid with balancing weights estimators. We now outline our procedure for extending the percentile bootstrap sensitivity analysis to balancing weights.We introduce the shifted balancing weights estimator, detail the bootstrap sampling procedure, and describehow to eﬃciently compute the conﬁdence intervals. Key to constructing the conﬁdence intervals for the partialidentiﬁcation set will be to construct intervals for each sensitivity model h in the collection of sensitivitymodels H (Λ) in Equation (3). Each h represents a particular deviation from ignorability that remains in theset deﬁned by the marginal sensitivity model. We show that the percentile bootstrap yields valid conﬁdenceintervals for each sensitivity model in H (Λ) , resulting in a valid interval for the partial identiﬁcation set. Weprovide guidance for interpreting our sensitivity analysis procedure in Section 4.To construct the conﬁdence intervals, we can ﬁrst consider the case where we know the log odds function h ( x, y ) ∈ H (Λ) . With h , we can shift the balancing weights estimator for the shifted estimand µ ( h )1 as ˆ µ ( h )1 = (cid:32) n (cid:88) Z i =1 ˆ γ ( h ) i (cid:33) − n (cid:88) Z i =1 ˆ γ ( h ) i Y i , (8)where ˆ γ ( h ) i = 1 + (ˆ γ i − e h ( X i ,Y i ) for i ∈ { i : Z i = 1 } are the shifted balancing weights. We then take B bootstrap samples of size n without conditioning on treatment assignment — so the number of units in thetreatment and control groups may vary from sample to sample — and re-estimate the weights in each sampleby solving the balancing weights optimization problem (6) using the bootstrapped data.Then, for every h ∈ H (Λ) , we can construct a conﬁdence interval for µ ( h )1 using the percentile bootstrapas (cid:104) L ( h ) , U ( h ) (cid:105) = (cid:104) Q α (cid:16) ˆ µ ∗ ( h )1 ,b (cid:17) , Q − α (cid:16) ˆ µ ∗ ( h )1 ,b (cid:17)(cid:105) . (9) Q α (ˆ µ ∗ ( h )1 ,b ) is the α -percentile of ˆ µ ∗ ( h )1 ,b in the bootstrap distribution made up of the B bootstrap samples and ˆ µ ∗ ( h )1 ,b is the shifted balancing weights estimator (8) using bootstrap sample b ∈ { , . . . , B } . Note, the ∗ in ˆ µ ∗ ( h )1 ,b indicates that it is an estimate from bootstrap data and b is used as an index for the B bootstrapsamples. The following theorem states that [ L ( h ) , U ( h ) ] is an asymptotically valid conﬁdence interval for5 ( h )1 with at least (1 − α ) -coverage under high-level assumptions in Appendix A.1 on how well the balancingweights estimate the propensity scores. Theorem 1.

Under Assumption 4 in Appendix A.1, for every h ∈ H (Λ) , lim sup n →∞ P ( µ ( h )1 < L ( h ) ) ≤ α and lim sup n →∞ P ( µ ( h )1 > U ( h ) ) ≤ α , where P denotes the probability under the joint distribution of the data P ( · ) .Since each of the conﬁdence intervals [ L ( h ) , U ( h ) ] are valid, we can use the Union Method to combinethem into a single valid conﬁdence interval [ L union , U union ] for µ under Assumption 3, where L union = inf h ∈H (Λ) L ( h ) , U union = sup h ∈H (Λ) U ( h ) . (10)Finding [ L union , U union ] would require conducting a grid search over the space of log-odds functions H (Λ) andcomputing percentile bootstrap conﬁdence intervals at each point; this is computationally infeasible. Instead,we can obtain a conﬁdence interval [ L, U ] for µ by using generalized minimax and maximin inequalities as [ L, U ] = (cid:34) Q α (cid:18) inf h ∈H (Λ) ˆ µ ∗ ( h )1 ,b (cid:19) , Q − α (cid:32) sup h ∈H (Λ) ˆ µ ∗ ( h )1 ,b (cid:33)(cid:35) . (11)Zhao et al. (2019) show that this interval will be conservative, in the sense of being too wide, since L ≤ L union and U ≥ U union .The extrema of the point estimates can be solved eﬃciently by the following linear fractional programmingproblem: min / max r ∈ R n ˆ µ ( h )1 = n (cid:80) i =1 Z i (1 + r i [ˆ γ i − Y in (cid:80) i =1 Z i (1 + r i [ˆ γ i − subject to r i ∈ [Λ − , Λ] , for all i ∈ { , . . . , n } , (12)where r i = OR { π ( X i ) , π ( X i , Y i ) } are the decision variables. The procedure to obtain conﬁdence interval [ L, U ] is then: Step 1.

Obtain B bootstrap samples of the data of size n without conditioning on treatment assignment. Step 2.

For each bootstrap sample b = 1 , . . . , B , re-estimate the weights and compute the extrema inf h ∈H (Λ) ˆ µ ∗ ( h )1 ,b and sup h ∈H (Λ) ˆ µ ∗ ( h )1 ,b under the collection of sensitivity models H (Λ) by solving (12). Step 3.

Obtain valid conﬁdence intervals for sensitivity analysis: L = Q α (cid:18) inf h ∈H (Λ) ˆ µ ∗ ( h )1 ,b (cid:19) , U = Q − α (cid:32) sup h ∈H (Λ) ˆ µ ∗ ( h )1 ,b (cid:33) . (13)6eplacing ˆ γ i in Equation (12) with the inverse of propensity scores estimated by a generalized linear modelrecovers the procedure from Zhao et al. (2019).Finally, a researcher must compute a sensitivity value for a given study; see Rosenbaum (2002) forextensive discussion. Suppose the conﬁdence interval for PATE under ignorability ( Λ = 1 ) does not containzero, indicating a statistically signiﬁcant eﬀect. As Λ increases, allowing for stronger violations of ignorability,the conﬁdence interval will widen and eventually cross zero. Of particular interest then is the minimum valueof Λ for which the conﬁdence interval contains zero; we denote this value as Λ ∗ . Thus, we can interpret Λ ∗ as the minimum diﬀerence in the odds ratio between the probability of treatment with and withoutconditioning on the treated potential outcome for which we no longer observe a signiﬁcant treatment eﬀect.This represents the degree of confounding required to change a study’s causal conclusions, with larger valuesof Λ ∗ representing more robust estimates.Sensitivity analysis may also be useful in cases where the conﬁdence interval under Λ = 1 is very small andincludes zero, indicating no large eﬀect in any direction. In this setting, a researcher may obtain a sensitivityvalue Λ ∗ by deﬁning a minimal eﬀect size ι > of practical interest and repeating the sensitivity analysisfor larger and larger values of Λ until the conﬁdence interval includes either − ι or ι , revealing the degreeof confounding needed to mask a practically important eﬀect. For examples of such sensitivity analyses forequivalence results, see Pimentel et al. (2015); Pimentel and Kelz (2020). In this section, we provide guidance for interpreting the main sensitivity parameter Λ ∗ by “amplifying” thesensitivity analyses into a constraint on the product of: (1) the level of remaining imbalance in confounders;and (2) the strength of the relationship between the imbalanced confounders and the treated potentialoutcome. As a ﬁrst step to the ampliﬁcation procedure, we introduce a ﬁnite sample analog to the marginal sensitivitymodel in Assumption 3. Importantly, the sensitivity analysis procedure for the marginal sensitivity modeloutlined in Section 3 remains the same; we introduce this extension in order to amplify, interpret, andcalibrate the sensitivity analysis outlined above.Recall that the marginal sensitivity model constrains the diﬀerence between the true probability oftreatment, conditional on the treated potential outcome and the covariates, and the propensity score thatonly conditions on the covariates. The true propensity score guarantees that, in expectation, the inverseprobability weighted outcomes for the treated group are equal to µ , i.e., that E [ Y (1)] = E (cid:20) π ( X, Y ) Y (cid:12)(cid:12)(cid:12) Z = 1 (cid:21) . (14)However, this does not guarantee that the two quantities will be the same in ﬁnite samples. In that case,we can instead consider oracle weights ˚ γ i that guarantee equality between the weighted average of treated7roup outcomes and the sample average treated outcome ˚ µ = n (cid:80) ni =1 Y i (1) : min γ ∈ R n (cid:88) Z i =1 γ i log γ i ˆ γ i subject to (cid:80) Z i =1 γ i (cid:88) Z i =1 γ i Y i (1) = 1 n n (cid:88) i =1 Y i (1) , (15)where ˆ γ i are the estimated weights from balancing the observed covariates.These oracle weights satisfy two key properties. First, they exactly balance the treated potential outcomesbetween all units and the weighted treated units. Second, they are as close (in terms of entropy) as possibleto the estimated weights. The corresponding oracle weight estimator of the complete data mean µ is then: ˚ µ = (cid:88) Z i =1 ˚ γ ( X i , Y i ) (cid:80) Z i =1 ˚ γ ( X i , Y i ) Y i . (16)Extending the population sensitivity model in Assumption 3, we can now deﬁne an analogous ﬁnitesample sensitivity model that bounds the diﬀerence between the estimated weights and the oracle weights within the sample rather than in the population. For Λ ≥ , we consider the set of oracle weights ˚ γ thatsatisfy: E ˚ γ (Λ) = { ˚ γ : Λ − ≤ ˚ γ i − γ i − ≤ Λ , ∀ i = 1 . . . , n } . (17)Rather than bounding the diﬀerence between the true probability of treatment π ( x, y ) and the propensityscore only conditioned on the covariates π ( x ) in the population, we bound the diﬀerence between the esti-mated and oracle weights in the sample . Thus, we can think of this model as the ﬁnite sample analogue tothe superpopulation marginal sensitivity model and is therefore more consistent with notions of sensitivitycommon in the matching literature (e.g., Rosenbaum, 2002). In order for a confounder to bias causal eﬀect estimates, it must be associated with both the treatment andthe outcome. An “ampliﬁcation” enhances a sensitivity analysis’s interpretability by allowing a researcherto instead interpret the results of the sensitivity analysis in terms of two parameters: one controlling theconfounder’s relationship with the treatment and the other controlling its relationship with the outcome(Rosenbaum and Silber, 2009). In our ﬁnite sample sensitivity model, the parameter Λ controls how farthe estimated weights can be from oracle weights that exactly balance the treated potential outcome. Toaid interpretation, we propose an ampliﬁcation that expresses the results of our procedure in terms of theimbalance in confounders and the strength of the relationship between the confounders and the treatedpotential outcome.To start, we deﬁne our error of interest, ˚ µ − ˆ µ , to be the diﬀerence between the estimates of the completedata mean µ using oracle weights and estimated weights. Therefore, this error represents the diﬀerencebetween what we would like to have in a ﬁnite sample, an estimate using weights that exactly balance thepotential outcome under treatment, and what we have, an estimate using weights that only balance observedcovariates. We can write the diﬀerence between the average treated potential outcome in the sample and This deﬁnition of error as the diﬀerence between two sample estimates is similar to Cinelli and Hazlett (2020)’s formulation ˆ µ in terms of the imbalance in the treated potential outcome Y (1) : ˚ µ − ˆ µ = (cid:88) Z i =1 (cid:104) ˚ γ ( X i , Y i ) (cid:80) Z i =1 ˚ γ ( X i , Y i ) − ˆ γ ( X i ) n (cid:105) Y i (1) . To relate imbalance in Y (1) to observable quantities, we decompose Y (1) into two parts: (1) the linearprojection of covariates that the estimated balancing weights perfectly balance; and (2) the residual. Specif-ically, let U be the projection of the re-weighted covariates onto Z and let W be the orthogonal component.Therefore, U represents the parts of the observed and unobserved covariates that are not exactly balancedand W represents the parts that are exactly balanced. We write this decomposition as Y (1) = W β w + U β u ;this linear model merely serves as a guide to interpretation, rather than a true relationship we are assumingin the primary causal analysis. One way to reason about the covariates that are not exactly balanced is asfollows. Consider imbalanced covariate A , and run a linear regression of A on treatment assignment Z . Theﬁtted values would be included in U and the residuals in W , since the residuals represent the part of A thatis linearly independent from Z . Because the W are exactly balanced by construction, they do not introduceany error. Finally, in numerical examples (Section 5), we focus on standardized covariates.With this decomposition, we can write the error as a product of two terms: ˚ µ − ˆ µ = β u · (cid:32) n n (cid:88) i =1 U i − n (cid:88) Z i =1 ˆ γ i U i (cid:33) ≡ β u · δ u , (18)where δ u is the imbalance in U . As in Section 3 above, we can use the fractional linear program (12) to ﬁndupper and lower bounds for the error in Equation (18): (cid:16) inf h ∈H (Λ) ˆ µ ( h )1 (cid:17) − ˆ µ ≤ ˚ µ − ˆ µ ≤ (cid:16) sup h ∈H (Λ) ˆ µ ( h )1 (cid:17) − ˆ µ . (19)Therefore, we can constrain the product δ u · β u : (cid:16) inf h ∈H (Λ) ˆ µ ( h )1 (cid:17) − ˆ µ ≤ ˚ µ − ˆ µ = δ u · β u ≤ (cid:16) sup h ∈H (Λ) ˆ µ ( h )1 (cid:17) − ˆ µ . (20)Now, for any value of our sensitivity analysis parameter Λ , we can compute a corresponding bound on theerror, ˚ µ − ˆ µ , and can use this bound to decompose the error into diﬀerent values of δ u and β u .In practice, as in Equation (20), we ﬁrst bound the error via the extrema under the balancing weightssensitivity model. We then set the error equal to the maximum absolute value of the upper and lower boundsin Equation (20) for Λ = Λ ∗ . Therefore, this value is equal to the maximum absolute value of error thatis possible under the balancing weights sensitivity model. Finally, we compute a curve that maps the valueof error to diﬀerent combinations of δ u and β u for enhanced interpretation. For example, ( δ u , β u ) = (1 . , and ( δ u , β u ) = (1 , are both consistent with error ˚ µ − ˆ µ = 3 . In Section 5, we illustrate our sensitivityanalysis procedure and how our ampliﬁcation can produce more interpretable results. of bias in their sensitivity analysis framework for linear regression models. Numerical examples

We now illustrate the sensitivity analysis and ampliﬁcation procedures using two real data examples. Weconsider the situation in which a researcher uses balancing weights to estimate the Population AverageTreatment Eﬀect on the Treated (PATT) of a treatment on an outcome of interest; see Appendix C for anoverview of the PATT in our setting. Based on domain knowledge, the researcher believes that the set ofobserved covariates includes most factors associated with the treatment assignment and the outcome, whileleaving open the possibility that there remain relevant unobserved covariates.To start, we compute Λ ∗ , which represents the confounding required to alter a study’s causal conclusions.In order to compute Λ ∗ , we compute conﬁdence intervals for a grid of values of Λ , starting with Λ = 1 andthen considering larger values of Λ . If the conﬁdence interval corresponding to Λ = 1 contains zero, thenthe eﬀect estimate is not signiﬁcant, even under ignorability. If the conﬁdence interval for

Λ = 1 does notcontain zero, increasing the value of Λ causes the conﬁdence intervals to widen and eventually cross zero forsome value of Λ . We set Λ ∗ equal to the minimum value of Λ for which the conﬁdence interval includes zero.Since the the percentile bootstrap procedure induces randomness, this value of Λ ∗ is computed with MonteCarlo error.We ﬁx error equal to the maximum absolute value of the upper and lower bounds on error in Equation(32). This value is the maximum absolute value of error possible under the balancing weights sensitivitymodel with Λ = Λ ∗ and is therefore the error required to overturn the study’s causal conclusion. We createcontour plots with curves that map the particular value of error to varying values of δ u and β u , allowing theerror to be alternatively interpreted in terms of two sensitivity analysis parameters. We include standardizedobserved covariates on the contour plots, which serve as guides for reasoning about potential unobservedcovariates. Blue points correspond to observed covariates with imbalance prior to weighting, while red pointsrepresent post-weighting imbalance. We view the post-weighting imbalance corresponding to the red pointsas a best-case scenario for potential unobserved covariates — in general, we expect to achieve better balancein terms of the observed covariates that we directly target than unobserved covariates. Conversely, the pre-weighting imbalance represented by the blue points may be more in line with our expectations for unobservedcovariates. Figure 1 illustrates three cases a researcher might encounter when making the contour plots. The ﬁrstscenario, corresponding to the black curve, is when the error curve intersects with the shaded red region. Weview this as indicative of results that are sensitive to violations of ignorability, as confounders comparableto the observed covariates after weighting can overturn the results.By contrast, we view the scenario depicted by the purple curve in Figure 1 as evidence that a studyis fairly robust. The horizontal and vertical dotted blue lines correspond to the maximum values amongobserved covariates of β u and pre-weighting imbalance, respectively. Since the curve is above and to theright of the intersection of the two lines, the eﬀect estimate is robust to an unobserved confounder withstrength and pre-weighting imbalance as large as the maximum among the observed covariates.The ﬁnal case occurs when the error curve lies in between the red region and the intersection of thetwo lines. The green curve in Figure 1 illustrates this scenario. We view this as an ambiguous resultthat requires additional domain knowledge to evaluate the feasibility of there being confounding of thismagnitude. Conditional on observing one of the cases outlined above, the actual value of Λ ∗ can provide10igure 1: Example contour plots. The black curve is the error for a highly sensitive study, green is ambiguous,and purple is robust. The blue and red points are observed covariates with imbalance before and afterweighting, respectively. The red shaded region is the convex hull of the set including the red points, theorigin, the point on the y-axis corresponding to the maximum β u value among the red points, and the pointon the x-axis corresponding to the maximum δ u value among the red points. The region represents theapproximate magnitude of the observed covariates with post-weighting imbalance.additional insight. For example, if we are in the ambiguous scenario depicted by the green curve in Figure1, we would be more skeptical of the study’s causal conclusion if Λ ∗ = 1 . rather than Λ ∗ = 2 , since smallerdiﬀerences between the estimated and oracle weights could overturn the study’s results. We re-examine data analyzed by LaLonde (1986) from the National Supported Work Demonstration Program(NSW), a randomized job training program. Speciﬁcally, we use the subset of data from Dehejia and Wahba(1999) to form a treatment group and observational data from the Current Population Survey–Social SecurityAdministration ﬁle (CPS1) to form a control group. We consider estimating the eﬀect of the job trainingprogram on 1978 real earnings. The covariates for each individual include their age, years of education, race,marital status, whether or not they graduated high school, and earnings and employment status in 1974 and1975. In total, there are 185 treated units and 15,992 control units.First, we use stable balancing weights in Equation (6) to estimate (cid:92)

PATT = $1 , , which is in linewith Wang and Zubizarreta (2019)’s estimate using slightly diﬀerent approximate balancing weights. Wethen compute Λ ∗ = 1 . , which indicates that even a slight diﬀerence between the estimated and oracleweights can negate the causal eﬀect estimate. Figure 2a shows how the range of point estimates and the95% conﬁdence interval widen as Λ increases, with the conﬁdence interval including zero for Λ ∗ . The range11 a) Point estimate and conﬁdence intervals for theLaLonde data. Dotted intervals are point estimate in-tervals and solid intervals are 95% conﬁdence intervals. (b) Contour plot for the LaLonde data. The black curveis the error for Λ ∗ . The blue and red points are observedcovariates with imbalance before and after weighting, re-spectively. The red shaded region represents the approx-imate magnitude of the observed covariates with post-weighting imbalance. Figure 2: Sensitivity analysis results with the LaLonde dataof point estimates is obtained by computing the extrema of the point estimates for a particular Λ .Figure 2b shows the contour plot for the LaLonde data. We observe that the error curve intersects withthe red region. Therefore, the eﬀect estimate is not robust to a confounder with similar levels of imbalanceand strength as those seen in the observed covariates after weighting. Based on this, we deem the estimatedeﬀect to be fairly sensitive to unmeasured confounding. We now examine data analyzed by Zhao et al. (2018) and Zhao et al. (2019) from the National Healthand Nutrition Examination Survey (NHANES) 2013-2014 containing information about ﬁsh consumptionand blood mercury levels. We evaluate the sensitivity of estimating the eﬀect of ﬁsh consumption on bloodmercury levels using balancing weights. There are 234 treated units (consumption of greater than 12 servingsof ﬁsh or shellﬁsh in the past month) and 873 control units (zero or one servings). The outcome of interestis log (total blood mercury), measured in micrograms per liter; the covariates include gender, age, income,whether income is missing and imputed, race, education, smoking history, and the number of cigarettessmoked in the previous month.To start, the stable balancing weights (6) estimate of the PATT is an increase of 2.1 in log (total bloodmercury) and Λ ∗ is approximately equal to 5.5 for the ﬁsh consumption data. We display the sensitivityanalysis results for multiple values of Λ in Figure 3a. We observe that the conﬁdence interval corresponding12 a) Point estimate and conﬁdence intervals for the ﬁshdata. Dotted intervals are point estimate intervals andsolid intervals are 95% conﬁdence intervals. (b) Contour plot for the ﬁsh data. The black curve isthe error for Λ ∗ . The blue and red points are observedcovariates with imbalance before and after weighting, re-spectively. The red shaded region represents the approx-imate magnitude of the observed covariates with post-weighting imbalance. Figure 3: Sensitivity analysis results with the ﬁsh datato no confounding (

Λ = 1 ) is far from zero and that the conﬁdence interval for Λ ∗ = 5 . just begins to crosszero.The contour plot (Figure 3b) for the ﬁsh data indicates an extremely robust causal eﬀect estimate. Asthe error curve is far above the intersection of the dotted lines that represents the maximum strength andpre-weighting imbalance among the observed covariates, confounding signiﬁcantly stronger than the observedcovariates would be required to alter the causal conclusion. Comparing the contour plot for the ﬁsh data tothe contour plot for the LaLonde data (Figure 2b), we observe a clear diﬀerence. While the error curve forthe LaLonde data lies well within the range of the observed covariates, the error curve corresponding to theﬁsh data is far from the observed covariates. From this contrast, we conclude that the causal eﬀect estimatefor the ﬁsh data is much more robust than that for the LaLonde data. Balancing weights estimation is a popular approach for estimating treatment eﬀects by weighting units tobalance covariates. In this paper, we develop a framework for assessing the sensitivity of these estimators tounmeasured confounding. We then propose an ampliﬁcation for enhanced interpretation and illustrate ourmethod through real data examples.We brieﬂy outline potential directions for future work. First, we could extend our framework to include augmented balancing weights estimators , which use an outcome model to correct for bias due to inexact13alance. Second, we could extend our sensitivity analysis framework to balancing weights in panel datasettings. For example, we could adapt this framework to variants of the synthetic control method (Abadieand Gardeazabal, 2003; Ben-Michael et al., 2018), extending some recent proposals for sensitivity analysisfrom Firpo and Possebom (2018).Additionally, Dorn and Guo (2021) recently proposed a modiﬁcation to Zhao et al. (2019)’s procedure,using quantile balancing to obtain sharper sensitivity analysis intervals. We could adapt their modiﬁcationto our proposed balancing weights framework.Finally, we could use our framework to provide guidance in the design stage of balancing weights estima-tors. When estimating treatment eﬀects using balancing weights, researchers must make decisions includingthe speciﬁc dispersion function of the weights, the particular imbalance measure, and, in many cases, anacceptable level of imbalance. We could extend our sensitivity analysis procedure to help make these deci-sions to improve robustness and power in the presence of unmeasured confounding. For example, we couldprovide insight into the trade-oﬀ between achieving better (marginal) balance on a few covariates or worsebalance on a richer set of covariates. 14 eferences

Abadie, A. and Gardeazabal, J. (2003). The economic costs of conﬂict: A case study of the basque country.

American economic review , 93(1):113–132.Athey, S., Imbens, G. W., and Wager, S. (2018). Approximate residual balancing: debiased inference ofaverage treatment eﬀects in high dimensions.

Journal of the Royal Statistical Society: Series B (StatisticalMethodology) , 80(4):597–623.Ben-Michael, E., Feller, A., and Rothstein, J. (2018). The augmented synthetic control method. arXivpreprint arXiv:1811.04170 .Ben-Michael, E., Feller, A., and Rothstein, J. (2020a). Variation in impacts of letters of recommendationon college admissions decisions: Approximate balancing weights for treatment eﬀect heterogeneity inobservational studies.Ben-Michael, E., Hirschberg, D., Feller, A., and Zubizarreta, J. (2020b). The balancing act for causalinference.Bickel, P. J. and Freedman, D. A. (1981). Some asymptotic theory for the bootstrap.

The annals of statistics ,pages 1196–1217.Chattopadhyay, A., Christopher H. Hase, and Zubizarreta, J. R. (2020). Balancing Versus Modeling Ap-proaches to Weighting in Practice.

Statistics in Medicine , 39(24):3227–3254.Cinelli, C. and Hazlett, C. (2020). Making Sense of Sensitivity: Extending Omitted Variable Bias.

Journalof the Royal Statistical Society Series B , 82(1):39–67.Cornﬁeld, J., Haenszel, W., Hammond, E. C., Lilienfeld, A. M., Shimkin, M. B., and Wynder, E. L. (1959).Smoking and lung cancer: recent evidence and a discussion of some questions.

Journal of the NationalCancer institute , 22(1):173–203.Dehejia, R. H. and Wahba, S. (1999). Causal eﬀects in nonexperimental studies: Reevaluating the evaluationof training programs.

Journal of the American statistical Association , 94(448):1053–1062.Deville, J. C. and Särndal, C. E. (1992). Calibration estimators in survey sampling.

Journal of the AmericanStatistical Association , 87(418):376–382.Deville, J. C., Särndal, C. E., and Sautory, O. (1993). Generalized raking procedures in survey sampling.

Journal of the American Statistical Association , 88(423):1013–1020.Dorn, J. and Guo, K. (2021). Sharp sensitivity analysis for inverse propensity weighting via quantile bal-ancing. arXiv preprint arXiv:2102.04543 .Firpo, S. and Possebom, V. (2018). Synthetic control method: Inference, sensitivity analysis and conﬁdencesets.

Journal of Causal Inference , 6(2).Franks, A., D’Amour, A., and Feller, A. (2019). Flexible sensitivity analysis for observational studies withoutobservable implications.

Journal of the American Statistical Association , pages 1–33.Hainmueller, J. (2012). Entropy balancing for causal eﬀects: A multivariate reweighting method to producebalanced samples in observational studies.

Political Analysis , 20(1):25–46.Hirshberg, D. A., Maleki, A., and Zubizarreta, J. (2019). Minimax linear estimation of the retargeted mean. arXiv preprint arXiv:1901.10296 .Hong, G., Yang, F., and Qin, X. (2020). Did you conduct a sensitivity analysis? a new weighting-basedapproach for evaluations of the average treatment eﬀect for the treated.

Journal of the Royal StatisticalSociety: Series A (Statistics in Society) . 15ang, J. D., Schafer, J. L., et al. (2007). Demystifying double robustness: A comparison of alternativestrategies for estimating a population mean from incomplete data.

Statistical science , 22(4):523–539.Keele, L., Ben-Michael, E., Feller, A., Kelz, R., and Miratrix, L. (2020). Hospital Quality Risk Standardiza-tion via Approximate Balancing Weights.Klaassen, C. A. (1987). Consistent estimation of the inﬂuence function of locally asymptotically linearestimators.

The Annals of Statistics , pages 1548–1562.LaLonde, R. J. (1986). Evaluating the econometric evaluations of training programs with experimental data.

The American economic review , pages 604–620.Neyman, J. (1990 [1923]). On the application of probability theory to agricultural experiments. essay onprinciples. section 9.

Statistical Science , pages 465–472.Pimentel, S. D. and Kelz, R. R. (2020). Optimal tradeoﬀs in matched designs comparing us-trained andinternationally trained surgeons.

Journal of the American Statistical Association , 115(532):1675–1688.Pimentel, S. D., Kelz, R. R., Silber, J. H., and Rosenbaum, P. R. (2015). Large, sparse optimal matchingwith reﬁned covariate balance in an observational study of the health outcomes produced by new surgeons.

Journal of the American Statistical Association , 110(510):515–527.Rosenbaum, P. R. (2002).

Observational Studies . Springer.Rosenbaum, P. R. and Rubin, D. B. (1983a). Assessing sensitivity to an unobserved binary covariate in an ob-servational study with binary outcome.

Journal of the Royal Statistical Society: Series B (Methodological) ,45(2):212–218.Rosenbaum, P. R. and Rubin, D. B. (1983b). The central role of the propensity score in observational studiesfor causal eﬀects.

Biometrika , 70(1):41–55.Rosenbaum, P. R. and Silber, J. H. (2009). Ampliﬁcation of sensitivity analysis in matched observationalstudies.

Journal of the American Statistical Association , 104(488):1398–1405.Rubin, D. B. (1974). Estimating causal eﬀects of treatments in randomized and nonrandomized studies.

Journal of educational Psychology , 66(5):688.Rubin, D. B. (1980). Randomization analysis of experimental data: The ﬁsher randomization test comment.

Journal of the American Statistical Association , 75(371):591–593.Tan, Z. (2020). Regularized calibrated estimation of propensity scores with model misspeciﬁcation andhigh-dimensional data.

Biometrika , 107(1):137–158.VanderWeele, T. J. and Ding, P. (2017). Sensitivity analysis in observational research: introducing thee-value.

Annals of internal medicine , 167(4):268–274.Wang, Y. and Zubizarreta, J. R. (2019). Minimal approximately balancing weights: asymptotic propertiesand practical considerations. arXiv preprint arXiv:1705.00998 .Zhao, Q. (2019). Covariate balancing propensity score by tailored loss functions.

Annals of Statistics ,47(2):965–993.Zhao, Q., Small, D. S., and Bhattacharya, B. B. (2019). Sensitivity analysis for inverse probability weightingestimators via the percentile bootstrap.

Journal of the Royal Statistical Society: Series B (StatisticalMethodology) .Zhao, Q., Small, D. S., and Rosenbaum, P. R. (2018). Cross-screening in observational studies that testmany hypotheses.

Journal of the American Statistical Association , 113(523):1070–1084.Zubizarreta, J. R. (2015). Stable weights that balance covariates for estimation with incomplete outcomedata.

Journal of the American Statistical Association , 110(511):910–922.16

Proofs

A.1 Proof of Theorem 1

Proof.

We prove that, after centering, the diﬀerence between the mean computed from estimating andevaluating the function γ on bootstrap data and the mean computed from using the true function γ andevaluating on actual data is of order n − / .For simplicity, we consider estimating the population mean from an independent and identically dis-tributed random sample with missing outcome data. For unit i , let Y i be the outcome, X i be a vector ofobserved covariates, and Z i be a response indicator, where Z i = 1 if we observe unit i ’s outcome and Z i = 0 otherwise. We consider using estimator ˆ µ = n n (cid:80) i =1 ˆ γ ( X i ) Z i Y i to estimate µ = E [ Y ] = E [ ZYπ ( X ) ] = E [ γ ( X ) ZY ] from observed data O i = ( X i , Z i , Y i Z i ) ni =1 .We sample split to make the proof and arguments simpler and more transparent (see Klaassen, 1987).The proof can equivalently be done without sample splitting, but we sample split to avoid the associatedcomplexities. We split the data into two equally sized samples, i = 1 , . . . , m and i = m + 1 , . . . , n . For bothsamples, we take an iid bootstrap sample of size m from the respective empirical distribution to obtain data O ∗ i = ( X ∗ i , Z ∗ i , Y ∗ i Z ∗ i ) mi =1 and O ∗ i = ( X ∗ i , Z ∗ i , Y ∗ i Z ∗ i ) ni = m +1 . Let ˆ γ ∗ denote an estimate of γ using bootstrapdata. We estimate ˆ γ ∗ ( X ) in one bootstrap sample and evaluate in the other bootstrap sample. We thenswitch roles and take a weighted average of the two estimates proportional to (cid:80) mi =1 Z ∗ i in both bootstrapsamples to obtain an eﬃcient estimate. This sample splitting approach with reversing roles and averagingyields the same estimate as without sample splitting to order o ( n − / ) . We demonstrate this throughsimulation (see Appendix B). We examine the case where we evaluate on the bootstrap sample from thesecond half of the data and estimate ˆ γ ∗ ( X ) from the bootstrap sample from the ﬁrst half.We make the following mild assumptions on how ˆ γ is constructed: Assumption 4.

Assume that the following function ˜ γ m : X m × { , } m → R + is deﬁned for all possibleempirical distributions from bootstrap samples of the data and obtained as the solution ˜ γ , . . . , ˜ γ m of (6)with the form min ˜ γ ∈ [0 , m m (cid:88) i =1 Z i ˜ γ i subject to (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) m m (cid:88) i =1 Z i ˜ γ i φ ( X i ) − m m (cid:88) i =1 φ ( X i ) (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) ∞ ≤ λ ˜ γ i ≥ ∀ i, (21)where < (cid:80) mi =1 Z i = m < m .We then let ˆ γ ∗ ( x ) = ˜ γ m ( X ∗ , . . . , X ∗ m , Z ∗ , . . . , Z ∗ m , x ) and ˆ γ ( x ) = ˜ γ m ( X , . . . , X m , Z , . . . , Z m , x ) , esti-mated on the bootstrap sample of the ﬁrst half of the data and the actual ﬁrst half of the data, respectively,be such that:1. ˜ γ m is uniformly bounded in m and x .2. sup x (cid:12)(cid:12)(cid:12) E (cid:2) ˆ γ ∗ ( x ) | X , . . . , X m (cid:3) − ˆ γ ( x ) (cid:12)(cid:12)(cid:12) = o p (1) .3. sup x (cid:12)(cid:12)(cid:12) ˆ γ ( x ) − γ ( x ) (cid:12)(cid:12)(cid:12) = o p (1) . These assumptions together imply that ˆ γ ∗ is consistently uniform for γ . Assumption 4 veriﬁes E (cid:104)(cid:16) sup x (cid:12)(cid:12)(cid:12) ˆ γ ∗ ( x ) − γ ( x ) (cid:12)(cid:12)(cid:12) Y m +1 Z m +1 (cid:17) (cid:105) = E (cid:104)(cid:16) sup x (cid:12)(cid:12)(cid:12) ˆ γ ∗ ( x ) − γ ( x ) (cid:12)(cid:12)(cid:12)(cid:17) (cid:105) E (cid:104) Y m +1 Z m +1 (cid:105) = o (1) , Wang and Zubizarreta (2019)’s Theorem 2 proves that Assumption 4.3 holds for a hard L ∞ balance constraint on theweights.

17m where E denotes the conditional expectation given the ﬁrst sample. Note, the conditions in Assumption4 are stronger than needed and could be relaxed.We proceed conditional on the ﬁrst sample O i = ( X i , Z i , Y i Z i ) mi =1 and the ﬁrst bootstrap sample O ∗ i =( X ∗ i , Z ∗ i , Y ∗ i Z ∗ i ) mi =1 . Therefore, ˆ γ ∗ is a completely known function. Let E ∗ denote the conditional expectationof the second bootstrap sample given the actual second sample. To show that the bootstrap can be validlyapplied, we can show that m m (cid:88) i = m +1 ˆ γ ∗ ( X ∗ i ) Z ∗ i Y ∗ i − E ∗ (cid:104) m m (cid:88) i = m +1 ˆ γ ∗ ( X ∗ i ) Z ∗ i Y ∗ i (cid:105) = 1 m m (cid:88) i = m +1 γ ( X i ) Z i Y i − E (cid:104) γ ( X m +1 ) Z m +1 Y m +1 (cid:105) + o p ( n − / ) . (22)Since E ∗ (cid:104) m m (cid:88) i = m +1 ˆ γ ∗ ( X ∗ i ) Z ∗ i Y ∗ i (cid:105) = 1 m m (cid:88) i = m +1 ˆ γ ∗ ( X i ) Z i Y i , then, by Theorem 2.1 from Bickel and Freedman (1981), m m (cid:88) i = m +1 ˆ γ ∗ ( X ∗ i ) Z ∗ i Y ∗ i − m m (cid:88) i = m +1 ˆ γ ∗ ( X i ) Z i Y i (23)and m m (cid:88) i = m +1 (cid:16) ˆ γ ∗ ( X i ) Z i Y i − E (cid:104) ˆ γ ∗ ( X m +1 ) Z m +1 Y m +1 (cid:105)(cid:17) (24)have the same limiting distribution. Since (23) and (24) have the same limiting distribution, it suﬃces toshow that the diﬀerence between the mean with the true γ and the mean with ˆ γ ∗ estimated on the bootstrapdata is of order n − / instead of (22). Therefore, we show m m (cid:88) i = m +1 ˆ γ ∗ ( X i ) Z i Y i − E (cid:104) ˆ γ ∗ ( X m +1 ) Z m +1 Y m +1 (cid:105) = 1 m m (cid:88) i = m +1 γ ( X i ) Z i Y i − E (cid:104) γ ( X m +1 ) Z m +1 Y m +1 (cid:105) + o p ( n − / ) . (25)We have now reduced the problem to showing that the true function γ can be replaced with ˆ γ ∗ . In orderto show this, we use properties of ˆ γ ∗ from Assumption 4. First, we let ∆( X i , Y i , Z i ) = (ˆ γ ∗ ( X i ) − γ ( X i )) Z i Y i − E (cid:104) (ˆ γ ∗ ( X m +1 ) − γ ( X m +1 )) Z m +1 Y m +1 (cid:105) . Note that the diﬀerence between the terms on the left and right hand sides of (25) is equal to m m (cid:80) i = m +1 ∆( X i , Y i , Z i ) .Additionally, note that E (cid:104) ∆( X i , Y i , Z i ) (cid:105) = 0 . Therefore, E (cid:104)(cid:16) m m (cid:88) i = m +1 ∆( X i , Y i , Z i ) (cid:17) (cid:105) = 1 m E (cid:104) ∆( X m +1 , Y m +1 , Z m +1 ) (cid:105) . m = Ω( n ) , by Assumption 4, E (cid:2) ∆( X m +1 , Y m +1 , Z m +1 ) (cid:3) = E (cid:104) ([ˆ γ ∗ ( X m +1 ) − γ ( X m +1 )] Z m +1 Y m +1 ) (cid:105) − E (cid:110)(cid:2) ˆ γ ∗ ( X m +1 ) − γ ( X m +1 ) (cid:3) Z m +1 Y m +1 E (cid:104)(cid:2) ˆ γ ∗ ( X m +1 ) − γ ( X m +1 ) (cid:3) Z m +1 Y m +1 (cid:105)(cid:111) + E (cid:110) E (cid:104)(cid:2) ˆ γ ∗ ( X m +1 ) − γ ( X m +1 ) (cid:3) Z m +1 Y m +1 (cid:105) (cid:111) = E (cid:104)(cid:16)(cid:2) ˆ γ ∗ ( X m +1 ) − γ ( X m +1 ) (cid:3) Z m +1 Y m +1 (cid:17) (cid:105) − E (cid:104)(cid:2) ˆ γ ∗ ( X m +1 ) − γ ( X m +1 ) (cid:3) Z m +1 Y m +1 (cid:105) ≤ E (cid:104)(cid:16)(cid:2) ˆ γ ∗ ( X m +1 ) − γ ( X m +1 ) (cid:3) Z m +1 Y m +1 (cid:17) (cid:105) = E (cid:104) (ˆ γ ∗ ( X m +1 ) − γ ( X m +1 )) Z m +1 Y m +1 (cid:105) = o p (1) Therefore, (25) follows. 19

Simulation for sample splitting

We conduct simulations to demonstrate the validity of the sample splitting technique that we use to proveTheorem 1 in Appendix A.1. We show that the bootstrap distributions for the balancing weights estimatesof µ with and without sample splitting are quite similar.The setup of the simulations is as follows. We draw 10,000 iid samples where covariates X and X are drawn from standard normal distributions, treatment indicator Z i is a bernoulli random variable withprobability = . . X i + 0 . X i + (cid:15) i , where (cid:15) i ∼ N (0 , . ) , and Y i = 0 . Z i + 0 . X i + 0 . X i + δ i , where δ i ∼ N (0 , . ) . We run 1,000 simulations and estimate µ with and without sample splitting usingweights obtained by entropy balancing with exact balance from Hainmueller (2012). We observe in Figure 4that the bootstrap distributions of the estimates with and without sample splitting are comparable.Figure 4: Bootstrap distributions of estimates of µ with the full data and with sample splitting20 Average treatment eﬀect on the treated

In many settings, researchers are interested in estimating the

Population Average Treatment Eﬀect on theTreated (PATT): τ T = E [ Y (1) − Y (0) | Z = 1] = µ − µ , (26)where µ = E [ Y (1) | Z = 1] and µ = E [ Y (0) | Z = 1] . Since µ is identiﬁable from observed data, weprimarily focus on estimating µ .Our procedure for performing sensitivity analysis outlined in Section 3 largely still holds. The primarydetails that diﬀer for the PATT are as follows. First, for a particular h ∈ H (Λ) , we can write the shiftedestimand as µ ( h )01 = E (cid:20) (1 − Z ) π ( h ) ( X, Y )1 − π ( h ) ( X, Y ) (cid:21) − E (cid:20) (1 − Z ) π ( h ) ( X, Y )1 − π ( h ) ( X, Y ) Y (cid:21) . (27)The corresponding shifted estimator for µ ( h )01 is ˆ µ ( h )01 = (cid:32) (cid:88) Z i =0 e − h ( X i ,Y i ) ˆ γ i (cid:33) − (cid:88) Z i =0 e − h ( X i ,Y i ) ˆ γ i Y i . (28)Additionally, the oracle weights now exactly balance the control potential outcome Y (0) between thetreatment group and the weighted control group. The oracle weights ˚ γ ( X, Y ) solve the optimization problem min γ ∈ R n (cid:88) Z i =0 γ i log γ i ˆ γ subject to (cid:80) Z i =0 γ i (cid:88) Z i =0 γ i Y i (0) = 1 n (cid:88) Z i =1 Y i (0) (29)where ˆ γ i ∈ R n are the estimated weights from balancing the observed covariates. The balancing weightsestimate of µ using the oracle weights is ˚ µ = (cid:88) Z i =0 ˚ γ ( X i , Y i ) (cid:80) Z i =0 ˚ γ ( X i , Y i ) Y i . (30)For Λ ≥ , the balancing weights sensitivity model for the PATT considers the set of oracle weights ˚ γ that satisfy: E ˚ γ (Λ) = { ˚ γ : Λ − ≤ ˆ γ i ˚ γ i ≤ Λ , ∀ i = 1 . . . , n } . (31)Finally, the ampliﬁcation becomes (cid:18) inf h ∈H (Λ) ˆ µ ( h )01 (cid:19) − ˆ µ ≤ ˚ µ − ˆ µ = δ u · β u ≤ (cid:32) sup h ∈H (Λ) ˆ µ ( h )01 (cid:33) − ˆ µ , (32)where inf / sup h ∈H (Λ) ˆ µ ( h )01 = min / max r ∈ R n (cid:80) Z i =0 r − i ˆ γ ( X i ) · Y i (cid:80) Z i =0 r − i ˆ γ ( X i ) subject to ≤ r i ≤ Λ (33)21nd r i = ˆ γ ( X i )˚ γ ( X i ,Y i ))