[PDF] Sharp Sensitivity Analysis for Inverse Propensity Weighting via Quantile Balancing

Abstract

Inverse propensity weighting (IPW) is a popular method for estimating treatment effects from observational data. However, its correctness relies on the untestable (and frequently implausible) assumption that all confounders have been measured. This paper introduces a robust sensitivity analysis for IPW that estimates the range of treatment effects compatible with a given amount of unobserved confounding. The estimated range converges to the narrowest possible interval (under the given assumptions) that must contain the true treatment effect. Our proposal is a refinement of the influential sensitivity analysis by Zhao, Small, and Bhattacharya (2019), which we show gives bounds that are too wide even asymptotically. This analysis is based on new partial identification results for Tan (2006)'s marginal sensitivity model.

Full PDF

SSharp Sensitivity Analysis for Inverse Propensity Weighting viaQuantile Balancing ∗ Jacob Dorn † Princeton University Kevin GuoStanford UniversityFebruary 10, 2021

Abstract

Inverse propensity weighting (IPW) is a popular method for estimating treatment eﬀects from obser-vational data. However, its correctness relies on the untestable (and frequently implausible) assumptionthat all confounders have been measured. This paper introduces a robust sensitivity analysis for IPWthat estimates the range of treatment eﬀects compatible with a given amount of unobserved confound-ing. The estimated range converges to the narrowest possible interval (under the given assumptions)that must contain the true treatment eﬀect. Our proposal is a reﬁnement of the inﬂuential sensitivityanalysis by Zhao, Small, and Bhattacharya (2019), which we show gives bounds that are too wide evenasymptotically. This analysis is based on new partial identiﬁcation results for Tan (2006)’s marginalsensitivity model.

Estimating treatment eﬀects from observational data is diﬃcult because “treated” and “control” samplestypically diﬀer on many characteristics besides treatment status. One popular tool for managing this prob-lem is inverse propensity weighting (IPW) [4, 15, 16], which re-weights treated and untreated samples to besimilar along all observed characteristics and then compares outcomes in the weighted samples. The crucialassumption underlying this approach is that the weighted samples do not systematically diﬀer along impor-tant unobserved characteristics. This “unconfoundedness” assumption is untestable, and often implausible.This paper studies how much can be learned when unconfoundedness does not hold, but one can boundthe plausible degree of unobserved confounding. In particular, given a “sensitivity assumption” controllingthe degree of selection, we aim to answer two questions:(1)

Sensitivity analysis . Can we bound how much our estimates might change if unobserved confoundingwere properly accounted for?(2)

Partial identiﬁcation . Can we characterize the most informative bounds that could possibly be ob-tained from the sensitivity assumption with even an inﬁnite amount of observational data?The speciﬁc sensitivity assumption used in this paper is the “marginal sensitivity model” of Tan [44],which extends the famous Rosenbaum model [37, 38, 39, 48] from matched-pairs studies to IPW. Thissensitivity assumption is quite popular in causal inference; see [20, 21, 22, 23, 27, 40, 44, 49] for an incompletelist of references. As we will see, it lends itself to computationally-eﬃcient sensitivity analyses which aresimple enough to explain to any practitioner comfortable with IPW.Recently, Zhao, Small, and Bhattacharya [49] (hereafter ZSB) introduced an interpretable IPW sensitivityanalysis for the marginal sensitivity model. Their approach, based on linear fractional programming, has ∗ We are grateful for comments from Guillaume Basse, Michal Koles´ar, Qingyuan Zhao, and many others who providedvaluable input at earlier stages of this project. This material is based upon work supported by the National Science FoundationGraduate Research Fellowship Program under Grant No. DGE-2039656. Any opinions, ﬁndings, and conclusions or recom-mendations expressed in this material are those of the authors and do not necessarily reﬂect the views of the National ScienceFoundation. † Corresponding author: [email protected] . a r X i v : . [ m a t h . S T ] F e b een largely responsible for the recent resurgence of interest in this sensitivity assumption. However, theydid not answer question (2), leaving open the possibility that more informative bounds could be obtainedfrom the same data and assumptions. Indeed, there are no existing partial identiﬁcation results for themarginal sensitivity model which can be used to benchmark a sensitivity analysis.The ﬁrst main contribution of this paper is to provide a complete answer to the optimality question (2).We derive closed-form expressions for the largest and smallest values of the “usual” estimands (e.g. averagetreatment eﬀect) compatible with the marginal sensitivity assumption. These expressions show that theZSB bounds are essentially always too conservative because they ignore an inﬁnite collection of constraintsimplied by the distribution of observed characteristics. Tan [44] also identiﬁed these constraints, but deemedit intractable to incorporate them all in a sensitivity analysis. In contrast, our partial identiﬁcation resultsshow that this collection can actually be reduced to a single constraint which is easy to incorporate.Our second main contribution is to introduce a new IPW sensitivity analysis, which we call the quantilebalancing method. The method has several desireable features:(i) The quantile balancing sensitivity interval is always a subset of the ZSB interval. Outside of knife-edgecases, it is a strict subset.(ii) When the outcome’s conditional quantiles can be estimated consistently, the bounds converge to thebest possible bounds that can be obtained under the marginal sensitivity model. In the language ofpartial identiﬁcation, quantile balancing is “sharp.”(iii) Under standard assumptions for IPW inference, the bounds can be converted into conﬁdence intervalsusing the same bootstrap scheme proposed in [49].(iv) When the estimated quantiles are inconsistent, the sensitivity interval is too wide rather than toonarrow and the conﬁdence intervals over-cover rather than under-cover. In other words, our intervalsare guaranteed to be valid, regardless of the quality of the additional input we demand. To our knowl-edge, no other standard method for estimating sharp bounds has the same robustness to inconsistentestimation of nuisance parameters [48, 28].We apply the quantile balancing method in several simulated examples and one real-data application, andﬁnd that it can substantially tighten the ZSB bounds when the covariates are good predictors of the outcome.One shortcoming we will mention is that our statistical guarantees assume the outcome is continuously-distributed. This seems to be inevitable as our sensitivity analysis relies on quantile regression. Sinceour partial identiﬁcation results also apply to discrete outcomes, we conjecture that the quantile balancingprocedure could be modiﬁed to give sharp bounds in that setting too. We consider the Neyman-Rubin potential outcomes model with a binary treatment [34, 41]. Experimentalunits ( X i , Y i (0) , Y i (1) , Z i ) are sampled i.i.d. from a probability distribution P . The real-valued randomvariables ( Y i (0) , Y i (1)) are called “potential outcomes,” Z i ∈ { , } is a binary treatment assignment indi-cator, and X i ∈ X ⊆ R d is a vector of covariates. The statistician only observes { ( X i , Y i , Z i ) } i ≤ n , where Y i = Z i Y i (1) + (1 − Z i ) Y i (0) is called the “observed outcome.”The goal is to use the observed data to draw inferences about a causal estimand ψ . For the purposesof exposition, we initially focus on the counterfactual means ψ T = E [ Y (1)] and ψ C = E [ Y (0)], although theexamples of most practical interest are the average treatment eﬀect (ATE) and the average treatment eﬀecton the treated (ATT). ψ ATE = E [ Y (1) − Y (0)] ψ ATT = E [ Y (1) − Y (0) | Z = 1] . With minor modiﬁcation, our identiﬁcation results can also be applied to more complex estimands, includingweighted average treatment eﬀects and policy values of the type considered in [3, 22]. However, we do notpresent those extensions in this paper.Under the “unconfoundedness” assumption ( Y (0) , Y (1)) | = Z | X , all of the above quantities can beestimated using inverse propensity weighting. IPW estimators “work” by reweighting the observed sample2y (some function of) the propensity score e ( x ) := P ( Z = 1 | X = x ). For example, if the estimand of interestis ψ T , the (stabilized) IPW estimate is given by (1).ˆ ψ T = (cid:80) ni =1 Y i Z i / ˆ e ( X i ) (cid:80) ni =1 Z i / ˆ e ( X i ) . (1)Here, ˆ e ( X i ) is an estimate of the propensity score e ( X i ). Related estimators for the other estimands con-sidered will be denoted by ˆ ψ C , ˆ ψ ATE , and ˆ ψ ATT . See the articles by Austin and Stuart [4] or Hirano andImbens [15] for their exact formulas.We will assume some conditions which are required for identiﬁcation and estimation under unconfound-edness: 0 < e ( X ) < E [ | Y | ] < ∞ . However, we will not assume unconfoundedness. The “marginal sensitivity model” introduced by Tan [44] is a relaxation of unconfoundedness which has beenapplied in many causal inference problems [20, 21, 22, 23, 27, 40, 44, 49]. This one-parameter sensitivityassumption allows for the existence of unobserved confounders U , but limits the degree of selection bias thatcan be attributed to these confounders. Assumption 1. (Marginal sensitivity model)

There exists a vector of unmeasured confounders U that, if measured, would lead to unconfoundedness: ( Y (0) , Y (1)) | = Z | ( X, U ) . However, within each stratum of the observed covariates, measuring U can onlychange the odds of treatment by at most a factor of Λ , i.e. if we set e ( x, u ) := P ( Z = 1 | X = x, U = u ) ,then (2) holds with probability one. Λ − ≤ e ( X, U ) / [1 − e ( X, U )] e ( X ) / [1 − e ( X )] ≤ Λ (2)To avoid confusion between e and e , we will follow [23] and refer to e as the “true propensity score”and e as the “nominal propensity score.”Like the famous Rosenbaum model [37, 38, 39, 48] for matched-pairs studies, Assumption 1 controlsthe degree of unobserved confounding with a single parameter Λ. When Λ = 1, measuring additionalconfounders cannot change the odds of treatment at all, i.e. treatment assignment is unconfounded. As Λincreases, stronger forms of confounding are allowed. For advice on how to choose this parameter, see Kallusand Zhou [22]. We remark that the marginal sensitivity assumption is “nonparametric” in the followingsense: no assumptions are needed about how e ( x, u ) depends on u . The dimension of U does not even needto be speciﬁed.To see how Assumption 1 can be used for sensitivity analysis, begin by considering how an oracle statis-tician who observes the confounders U i might estimate ψ T . One strategy would be to use the IPW estimator(3), which is consistent under weak assumptions.ˆ ψ ∗ T = (cid:80) ni =1 Y i Z i /e ( X i , U i ) (cid:80) ni =1 Z i /e ( X i , U i ) . (3)In reality, { U i } i ≤ n are not observed, but under Assumption 1, it is possible to bound the true propensityscores e ( X i , U i ). In particular, the vector ( e ( X , U ) , · · · , e ( X n , U n )) must belong to the set E n (Λ) deﬁnedin (4). E n (Λ) = (cid:26) ¯ e ∈ R n : Λ − ≤ ¯ e i / (1 − ¯ e i ) e ( X i ) / [1 − e ( X i )] ≤ Λ (cid:27) (4) We have presented a slightly diﬀerent version of the marginal sensitivity model from the one given in [44] and [49], whichrequires the unobserved confounder U to be one or both of the potential outcomes. Most of the results in this paper also applyunder that assumption [35]. See Section 3.3 for more. E n (Λ).[ ˆ ψ − T,ZSB , ˆ ψ +T,ZSB ] = (cid:20) min ¯ e ∈E n (Λ) (cid:80) ni =1 Y i Z i / ¯ e i (cid:80) ni =1 Z i / ¯ e i , max ¯ e ∈E n (Λ) (cid:80) ni =1 Y i Z i / ¯ e i (cid:80) ni =1 Z i / ¯ e i (cid:21) . (5)Since the interval (5) contains the consistent estimator ˆ ψ ∗ T , the distance between the true estimand ψ T andthe sensitivity interval must tend to zero. ZSB show that this conclusion holds even if the nominal propensityscore e ( x ) is replaced by a suitably consistent estimate ˆ e ( x ) in the deﬁnition of E n (Λ), which is importantfor practical applications as e ( x ) is typically not known in observational studies.This simple idea is intuitive enough to explain to any practitioner who is comfortable with IPW and hasbeen extended to estimands other than ψ T . ZSB also consider ψ ATE and ψ ATT , and related work by [22, 20,21, 27] takes the idea substantially further. Tan [44] applied a similar idea to a diﬀerent propensity-score-based estimator, and [1, 31, 45] used this approach in survey sampling problems.

The aforementioned works do not address the asymptotic optimality of the interval [ ˆ ψ − T,ZSB , ˆ ψ +T,ZSB ]. Doesit converge to a limiting set containing all values of ψ T compatible with Assumption 1 and no others?Sensitivity analyses with this asymptotic optimality property are called “sharp” in the partial identiﬁcationliterature.Given its intuitive deﬁnition, it may be surprising to learn that the ZSB sensitivity analysis is actuallynot sharp. Indeed, it can be arbitrarily conservative. To illustrate this, we only need to consider a verysimple joint distribution of observables: X ∼ N (0 , σ ) Z | X ∼ Bernoulli( ) Y | X, Z ∼ N ( X, . (6)Suppose that a data analyst receives i.i.d. samples ( X i , Y i , Z i ) from this distribution and is willing to positthat Assumption 1 is satisﬁed with Λ = 2. Let φ ( · ) and z τ denote the density and τ -th quantile of thestandard normal distribution, respectively. The following proposition, which we prove in Appendix B.1,writes the set of values of ψ T compatible with Assumption 1 explicitly in terms of these quantities and showsthat this “partially identiﬁed” set does not coincide with the ZSB interval. Proposition 1.

Let ( X i , Y i , Z i ) be i.i.d. samples from the joint distribution (6).(i) The set of values of ψ T compatible with the bound Λ = 2 and the distribution (6) is the interval [ ± φ ( z / )] ≈ [ ± . .(ii) However, with probability one, [ ± . √ σ + 1] ⊆ [ ˆ ψ − T,ZSB , ˆ ψ +T,ZSB ] for all large n . The precise meaning of (i) is the following: for any ψ ∈ [ ± φ ( z / )], it is possible to construct adistribution P for the full data ( X, Y (0) , Y (1) , Z, U ) which marginalizes to (6), satisﬁes Assumption 1 withΛ = 2, and has E P [ Y (1)] = ψ . On the other hand, for any ψ not in this interval, it is impossible to constructsuch a distribution.Proposition 1 implies that the ZSB interval typically includes many values of ψ which cannot possiblybe reconciled with the data. The explanation for this conservatism is that the odds-ratio bound (2) doesnot capture all of the restrictions on the true propensity score e . Additional information can be foundin the marginal distribution of the observed characteristics. For example, the putative propensity score¯ e ( x, u ) = + I { x ≥ } certainly satisﬁes the odds-ratio bound (2) — and is therefore a possible value of¯ e in the problem (5) — but it could not possibly be the true propensity score e . If it were, this wouldimply P ( Z = 1 | X ≥

0) = , while (6) demands that P ( Z = 1 | X ≥

0) = . In other words, this choice of¯ e is allowed in the domain of the ZSB optimization problem but is incompatible with the distribution ofobserved data.This example suggests that it should be possible to improve upon the ZSB bounds by only optimizingover the subset of E n (Λ) which is “data compatible.” However, data compatibility can be diﬃcult to work4ith, because the observed data distribution actually imposes an inﬁnite number of constraints on ¯ e . Forexample, the true e “balances” all integrable functions h : X → R : E [ h ( X ) Z/e ( X, U )] = E [ h ( X ) E [ Z | X, U ] /e ( X, U )]= E [ h ( X ) e ( X, U ) /e ( X, U )]= E [ h ( X )] . (7)Every such h gives rise to a testable balancing constraint (7) which can be used to rule out “incompatible”values of ¯ e . (cid:80) ni =1 h ( X i ) Z i / ¯ e i (cid:80) ni =1 Z i / ¯ e i ≈ E [ h ( X )] (8)This calculation shows that any sharp sensitivity analysis must contend with an inﬁnite number ofbalancing constraints, which is typically computationally intractable [6, 13]. To deal with this problem, Tan[44] suggested a conservative approach that optimizes an IPW-type estimator subject to the constraint thatthe estimated weights exactly balance a ﬁnite collection of functions { h j } j ≤ J . However, he did not oﬀerany guidance on how these functions should be chosen. In a similar problem, Tudball et al. [45] suggestedoptimization subject to an approximate balance condition on the observed covariates. Although both ofthese strategies may improve upon the ZSB interval (5), it is apparent that neither yields sharp bounds ingeneral. In this section, we show that it is possible to characterize the sharp bounds for ψ ∈ { ψ T , ψ C , ψ ATT , ψ

ATE } without ignoring or relaxing any of the (inﬁnitely many) balancing constraints on the true propensity score.The results are developed at the “population” level, so they belong to the domain of partial identiﬁcation.We apply them to ﬁnite-sample sensitivity analysis in Section 4.Our main partial identiﬁcation results are stated in Section 3.1: for the estimands ψ T , ψ C , and ψ ATT , theinﬁnitely-many balancing constraints imposed by data-compatibility can be reduced to a single (carefullychosen) balancing constraint; for the estimand ψ ATE , a similar reduction can be obtained. Section 3.2explains how these dramatic simpliﬁcations are possible for ψ ∈ { ψ T , ψ C , ψ ATT } : both the original inﬁnitely-constrained problem and the relaxed single-constraint problem can be solved explicitly using Dantzig andWald’s [11] generalization of the Neyman-Pearson Lemma. The solutions are identical. A byproduct ofthis analysis is an explicit formula for the largest and smallest values of ψ compatible with Assumption 1.Section 3.3 extends this proof strategy to ψ = ψ ATE , which requires a novel argument. All formal proofs inthis paper are deferred to Appendix B.

First, we characterize the partially identiﬁed sets for ψ ∈ { ψ T , ψ C , ψ ATT } in terms of optimization problemswith only a single balancing constraint.To state these results formally, we need a few pieces of additional notation. Let F ( y | x, z ) := P ( Y ≤ y | X = x, Z = z ) denote the conditional distribution function of Y and, for each τ ∈ (0 , τ -thconditional quantile function by Q τ ( x, z ) := inf { q : F ( q | x, z ) ≥ τ } . We also introduce a “population”version of the constraint set E n (Λ), where the vectors ¯ e in R n are replaced by random variables ¯ E deﬁnedon the same probability space as ( X, Y, Z ). E ∞ (Λ) := (cid:26) ¯ E : Λ − ≤ ¯ E/ (1 − ¯ E ) e ( X ) / (1 − e ( X )) ≤ Λ with probability one (cid:27) (9)The following theorem shows that to compute the partially identiﬁed set for ψ T , one can simply minimizeand maximize E [ Y Z/ ¯ E ] over the set of ¯ E in E ∞ (Λ) which balance a particular conditional quantile of Y .5 heorem 1. (Optimal bounds for ψ T ) For any Λ ≥ , the set of values of ψ T compatible with the observed data distribution and Assumption 1 isa closed interval [ ψ − T , ψ +T ] . Moreover, if we deﬁne τ = ΛΛ+1 , then the interval endpoints solve (10) and (11). ψ − T = min ¯ E ∈E ∞ (Λ) E [ Y Z/ ¯ E ] subject to E [ Q − τ ( X, Z/ ¯ E ] = E [ Q − τ ( X, ψ +T = max ¯ E ∈E ∞ (Λ) E [ Y Z/ ¯ E ] subject to E [ Q τ ( X, Z/ ¯ E ] = E [ Q τ ( X, . (11)We will highlight a few important takeaways from this theorem. First, even if one adds additionalbalancing constraints of the form E [ h ( X ) Z/ ¯ E ] = E [ h ( X )] in (10) and (11), the value of these problems willnot change since the true propensity score e ( X, U ) remains feasible. Thus for the purposes of computingbounds, the quantile balancing constraints capture all the information in the observed data. Second, thisresult shows that the ZSB sensitivity analysis can only be sharp when the conditional quantiles of Y do notdepend on X at all. Since this is quite pathological, this suggests there is room for improvement over theZSB method in almost all applications. Third, it suggests that a variant of the Tan [44] sensitivity analysiscould be sharp when Q τ ( · ,

1) and Q − τ ( · ,

1) are in the span of the “balanced” functions { h j } j ≤ J .We can extend the theorem to other estimands. To bound ψ C , exchange the labels “treated” and “control”and apply Theorem 1. Sharp bounds on ψ C can be translated into sharp bounds on ψ ATT using the relation ψ ATT = E [ Y | Z =1] − ψ C P ( Z =1) . Corollary 1. (Optimal bounds for ψ C and ψ ATT ) In the setting of Theorem 1, the partially-identiﬁed set for ψ C is the interval [ ψ − C , ψ +C ] , where the intervalendpoints solve (12) and (13). ψ − C = min ¯ E ∈E ∞ (Λ) E [ Y − Z − ¯ E ] subject to E [ Q − τ ( X, − Z − ¯ E ] = E [ Q − τ ( X, ψ +C = max ¯ E ∈E ∞ (Λ) E [ Y − Z − ¯ E ] subject to E [ Q τ ( X, − Z − ¯ E ] = E [ Q τ ( X, The partially identiﬁed set for ψ ATT is the interval [ ψ − ATT , ψ +ATT ] , where ψ ∓ ATT = E [ Y | Z =1] − ψ ± C ) P ( Z =1) . Finally, sharp bounds for ψ ATE can be obtained by subtracting sharp bounds for ψ T and ψ C . Equiv-alently, these bounds can be obtained by solving optimization problems with two quantile balancing con-straints. Although this result is superﬁcially similar to Theorem 1 and Corollary 1, its proof requires a novelconstruction. Theorem 2. (Optimal bounds for ψ ATE ) For any Λ ≥ , the set of values of ψ ATE compatible with the observed data distribution and Assumption 1is a closed interval [ ψ − ATE , ψ +ATE ] where ψ − ATE = ψ − T − ψ +C and ψ +ATE = ψ +T − ψ − C . We ﬁnd the results of Theorem 1 and Corollary 1 quite counterintuitive. After all, it is certainly not truethat every random variable ¯ E ∈ E ∞ (Λ) satisfying E [ Q τ ( X, Z/ ¯ E ] = E [ Q τ ( X, e ( X, U ). Indeed, the constraints of the quantile-balancing optimization problems do noteven enforce that E [ ¯ E ] = P ( Z = 1).To explain how these results are possible, we begin by characterizing which random variables ¯ E couldplausibly be the true propensity score e ( X, U ). The calculation (7) indicates that ¯ E should at least satisfy E [ h ( X ) Z/ ¯ E ] = E [ h ( X )] for all integrable h , or equivalently, E [ Z/ ¯ E | X ] = 1. Proposition 2 shows that this isactually the “only” constraint on ¯ E for the purposes of bounding ψ T . Proposition 2.

For any random variable ¯ E ∈ E ∞ (Λ) satisfying E [ Z/ ¯ E | X ] = 1 , there is a distribution Q for ( X, Y (0) , Y (1) , Z, U ) with the following properties:(i) The distribution of the observables ( X, Y, Z ) is the same under P and Q .(ii) Q satisﬁes Assumption 1.(iii) E Q [ Y (1)] = E P [ Y Z/ ¯ E ] .

6n the appendix, we prove Proposition 2 using a slight generalization of the data-compatibility characteri-zations in [7, 36, 44, 49]. This result implies that E [ Y Z/ ¯ E ] is a plausible value of ψ T whenever E [ Z/ ¯ E | X ] = 1.This immediately gives variational formulas for the smallest and largest values of ψ T compatible with As-sumption 1. Corollary 2.

The endpoints of the partially-identiﬁed interval for ψ T solve: ψ − T = min ¯ E ∈E ∞ (Λ) E [ Y Z/ ¯ E ] subject to E [ Z/ ¯ E | X ] = 1 (14) ψ +T = max ¯ E ∈E ∞ (Λ) E [ Y Z/ ¯ E ] subject to E [ Z/ ¯ E | X ] = 1 (15)This corollary translates the data-compatibility constraints of Proposition 2 into variational problems(14) and (15) which are easy to solve using Dantzig and Wald’s [11] generalization of the Neyman-Pearsonfundamental lemma. Roughly speaking, the “suﬃciency” half of the Dantzig and Wald result says that anyrandom variable ¯ E that takes on its minimal value when Y is “small” and its maximal value when Y is“large” solves (14), as long as it is also feasible. When applied to the singly-constrained problem (10), thesame result says that any feasible solution to (10) which is a cutoﬀ on Y − cQ − τ ( X,

1) for some real number c is optimal in that problem. By direct calculation, one may check that the random variable ¯ E − deﬁned inProposition 3 solves both of these problems. Proposition 3.

Let ¯ E − , ¯ E + ∈ E ∞ (Λ) satisfy E [ Z/ ¯ E − | X ] = E [ Z/ ¯ E + | X ] = 1 and also (16) and (17)whenever Z = 1 . ¯ E − − ¯ E − = (cid:40) Λ − e ( X )1 − e ( X ) if Y < Q − τ ( X, +1 e ( X )1 − e ( X ) if Y > Q − τ ( X,

1) (16)¯ E + − ¯ E + = (cid:40) Λ − e ( X )1 − e ( X ) if Y > Q τ ( X, +1 e ( X )1 − e ( X ) if Y < Q τ ( X,

1) (17)

Then ¯ E − solves both (10) and (14), and ¯ E + solves both (11) and (15). These worst-case formulas explain how the relaxation from an inﬁnite number of balancing constraints toa single balancing constraint is possible: both problems have the same solution. The form of the propensityscore ¯ E + is quite intuitive: in the “worst-case,” all observations with “high” values of Y are unlikely tobe treated and thus receive large propensity weight, while all observation with “low” values of Y are likelyto be treated and thus receive small propensity weight. The cutoﬀ between “large” and “small” is chosento satisfy the data-compatibility condition E [ Z/ ¯ E + | X ] = 1. This argument extends immediately to ψ C byrelabeling “treatment” and “control” and it extends to ψ ATT by the argument given in Section 3.1.

To extend the argument from Section 3.2 to the ATE requires additional care. Although ψ +ATE = ψ +T − ψ − C is certainly a valid upper bound for the partially identiﬁed set for ψ ATE , it is not obviously a sharp one.Proposition 2 only implies that there exists a distribution Q matching the observed data law which has E Q [ Y (1)] = ψ +T and another distribution Q (cid:48) which has E Q (cid:48) [ Y (0)] = ψ − C , but these distributions need not bethe same. In other words, the two bounds may not be simultaneously achievable.Theorem 2 indicates that the worst-case bounds on the counterfactual means are simultaneously achiev-able in the marginal sensitivity model. This is actually a surprising result, given that simultaneous achiev-ability is not expected to hold in two closely-related sensitivity models. The ﬁrst is the Rosenbaum model,where Yadlowsky et al. [48] derived sharp bounds on ψ T and ψ C but required an extra symmetry assumptionon the distribution of potential outcomes to establish sharpness of the resulting ATE bounds. The second isTan’s [44] original formulation of the marginal sensitivity model, which requires the unobserved confounders U to be the potential outcomes themselves. With Tan’s added restriction, we have only been able to establishsimultaneous achievability when Y | X, Z has a continuous distribution.The key to our bounds on ψ ATE is the following proposition, which strengthens Proposition 2.7 roposition 4.

For any random variable ¯ E ∈ E ∞ (Λ) satisfying E [ Z/ ¯ E | X ] = E [(1 − Z ) / (1 − ¯ E ) | X ] = 1 ,there is a distribution Q for the full data ( X, Y (0) , Y (1) , Z, U ) with the following properties:(i) The distribution of the observables ( X, Y, Z ) is the same under P and Q .(ii) Q satisﬁes Assumption 1.(iii) E Q [ Y (1) − Y (0)] = E P [ Y Z/ ¯ E − Y (1 − Z ) / (1 − ¯ E )] . Unlike Proposition 2, this result does not follow from the existing data-compatibility characterizationsof [7, 36, 44, 49] and instead requires an original construction. Given this result, one can derive Theorem 2as a consequence of Theorem 1 and Corollary 1.

In this section, we give our proposal for translating the population-level partial identiﬁcation results of Section3 into a practical sensitivity analysis for IPW, which we call the quantile balancing method. Our proposalfollows naturally from our partial identiﬁcation results: on a high level, we modify the ZSB proposal describedin Section 2 to incorporate the quantile-balancing constraints we derived in Theorem 1 and Corollary 1.Throughout this section, we take Λ ≥ τ = Λ / (Λ + 1). We begin by describing the quantile balancing bounds for the average treated outcome. Theorem 1 impliesthat the largest value of ψ T compatible with Assumption 1 solves the optimization problem (18): ψ +T = max ¯ E ∈E ∞ (Λ) E [ Y Z/ ¯ E ] E [ Z/ ¯ E ] s.t. (cid:18) E [ Q τ ( X, Z/ ¯ E ] E [ Z/ ¯ E ] (cid:19) = (cid:18) E [ Q τ ( X, Z/e ( X )] E [ Z/e ( X )] (cid:19) . (18)Notice that we have included the additional constraint E [ Z/ ¯ E ] = E [ Z/e ( X )] which does not appear inTheorem 1, but this does not aﬀect the value of the optimization problem.Our proposal is to estimate ψ +T by replacing all of the unknown quantities in (18) with empirical coun-terparts. We estimate ψ − T by following the same principle. To translate these estimates into conﬁdenceintervals, we employ the same simple percentile bootstrap scheme as ZSB.We will be concrete about what optimization problem we are proposing to solve. Let ˆ Q τ ( x, z ) be anestimate of the conditional quantile function of Y obtained by some kind of quantile regression (e.g. [25, 43,29, 2]). Let ˆ e be the data analyst’s estimate of the nominal propensity score e from their primary analysis.We deﬁne ˆ ψ +T as the solution to the empirical maximization problem (19).ˆ ψ +T = max ¯ e ∈E n (Λ) (cid:80) ni =1 Y i Z i / ¯ e i (cid:80) ni =1 Z i / ¯ e i s.t. (cid:18) n (cid:80) ni =1 ˆ Q τ ( X i , Z i / ¯ e i n (cid:80) ni =1 Z i / ¯ e i (cid:19) = (cid:18) n (cid:80) ni =1 ˆ Q τ ( X i , Z i / ˆ e ( X i ) n (cid:80) ni =1 Z i / ˆ e ( X i ) (cid:19) (19)The lower bound ˆ ψ − T is deﬁned similarly, but with maximization replaced by minimization and ˆ Q τ ( x, z )replaced by another quantile estimate ˆ Q − τ ( x, z ). The somewhat peculiar form of the right-hand side of theconstraints in (19) ensures that ¯ e i = ˆ e ( X i ) is feasible, so ˆ ψ +T and ˆ ψ − T always exist.Several immediate properties of the quantile balancing bounds (19) are collected below:(i) When Λ = 1 (i.e. no confounding is allowed), the quantile balancing bounds collapse to the usual IPWestimate of ψ T under unconfoundedness.(ii) The quantile balancing bounds are sample bounded, i.e. min i Y i ≤ ˆ ψ − T ≤ ˆ ψ +T ≤ max i Y i .(iii) The quantile balancing bounds are always a subset of the ZSB bounds, and outside of knife-edge cases,are a strict subset.(iv) The optimization problem (19) is convex and can be solved eﬃciently. In fact, it reduces to a standardquantile regression problem. See Appendix A for implementation details.8he quantile balancing idea extends easily to other causal estimands. To compute bounds for ψ C , oneonly needs to exchange the deﬁnitions of “treated” and “control” and solve the same optimization problem.Subtracting the bounds for ψ T and ψ C gives bounds for ψ ATE , and bounds for ψ ATT follow from a similarprinciple (see Appendix A for the exact formula).To form conﬁdence intervals based on quantile balancing, we follow ZSB [49] and propose using thepercentile bootstrap. If [ ˆ ψ − b , ˆ ψ + b ] are quantile balancing bounds estimated in the b th of B bootstrap samples,we report the quantile balancing 1 − α conﬁdence interval as:CI( α ) = [ Q α/ ( { ˆ ψ − b } b ∈ [ B ] ) , Q − α/ ( ˆ ψ + b } b ∈ [ B ] )] . (20)As is standard for bootstrap-based IPW inference, we require re-estimating the nominal propensity scoreseparately in each bootstrap replication. That requirement does not extent to the conditional quantiles.While the conditional quantiles can be re-estimated within bootstraps, our inference results will also applyif they are taken from the main dataset. This is important to keep inference computationally tractable.When the conditional quantiles are estimated using linear quantile regression (i.e. ˆ Q t ( x, z ) = ˆ β t ( z ) (cid:62) h ( x )for some “features” h : X → R k ) one could consider directly “balancing” the features h rather than the ﬁttedquantile ˆ Q t as in [44, 17, 9, 45]. Although this approach has some nice features and theoretical support, oursimulations show the resulting inference is less reliable in small samples. We now state some theoretical properties of the quantile balancing bounds [ ˆ ψ − , ˆ ψ + ] which apply when theoutcome Y has a continuous distribution. In short, the bounds are sharp when quantiles are estimatedconsistently and are valid even when quantiles are estimated inconsistently. Moreover, the percentile boot-strap yields valid conﬁdence intervals if standard IPW inference conditions are satisﬁed and quantiles areestimated parametrically.To obtain these results, we need a few conditions. The ﬁrst condition collects some standard IPWconsistency requirements which we expect the data analyst to have already assumed in his or her “primaryanalysis.” Condition 1. (IPW assumptions)

The nominal propensity score e satisﬁes ε ≤ e ( X ) ≤ − ε with probability one for some ε > . The estimatedpropensity score ˆ e ≡ ˆ e ( { X i , Z i } i ≤ n ) is uniformly consistent, and the variance of Y is ﬁnite. The second condition requires that the outcome Y has a bounded conditional density which is positivenear the relevant conditional quantiles. This is a common identiﬁcation condition for quantile regression [2,5]. Condition 2. (Density)

The conditional distribution of Y | X, Z has a uniformly bounded density f ( y | x, z ) . For each ( x, z ) ∈X × { , } , the map y (cid:55)→ f ( y | x, z ) is continuous and positive near Q − τ ( x, z ) and Q τ ( x, z ) . Finally, we make some assumptions about how the quantiles are estimated. For the standard linearquantile regression method of Koenker and Bassett [25], one only needs to check that the regressors in thequantile regression have ﬁnite variance. We cover generic (possibly nonlinear) methods by requiring samplesplitting to avoid overﬁtting. The speciﬁc form of sample splitting analyzed in our proofs is “cross-ﬁtting”[42, 33, 10], but leave-one-out or out-of-bag quantile estimates perform similarly in simulations.

Condition 3. (Quantile estimates)

For each t ∈ { − τ, τ } , one of the following holds:(i) ˆ Q t ( x, z ) = ˆ β t ( z ) (cid:62) h ( x ) for some ﬁxed “features” h j ( X ) with ﬁnite variance.(ii) ˆ Q t ( x, z ) is estimated using cross-ﬁtting and satisﬁes Condition N in Appendix B.7. Condition 3 is essentially “algorithmic,” and neither (i) nor (ii) impose any accuracy requirements on theestimated conditional quantiles. The appendix conditions in (ii) are technical to state but very mild. Forexample, they are satisﬁed by the random-forest-based quantile regression methods of Meinshausen [29] andAthey et al. [2] without any further assumptions. 9nder these conditions, we have the following result on the asymptotic sharpness of the quantile balancingbounds.

Theorem 3. (Sharpness and robustness)

For any ψ ∈ { ψ T , ψ C , ψ ATT , ψ

ATE } , let [ ψ − , ψ + ] be its partially-identiﬁed interval under Assumption 1 andlet [ ˆ ψ − , ˆ ψ + ] be the corresponding quantile balancing interval. Assume Conditions 1, 2, and 3.(i) If the quantile regression estimates are consistent, then ˆ ψ − p −→ ψ − and ˆ ψ + p −→ ψ + .(ii) Even if the quantile models are misspeciﬁed, we still have ˆ ψ − ≤ ψ − + o P (1) and ψ + − o P (1) ≤ ˆ ψ + . The result (i) is expected, so we oﬀer some intuition for (ii). The true worst-case propensity score¯ E + deﬁned in Proposition 3 “balances” all (integrable) functions of X , so it is “almost” feasible in theoptimization problem (19). Thus, even if the quantile regression model is misspeciﬁed, the IPW estimatorbased on ¯ E + will “almost” be in the domain of (19). We should therefore expect ˆ ψ + to be too large ratherthan too small. The robustness result (ii) shows that it eventually is.The validity of the conﬁdence interval (20) requires stronger parametric assumptions. We prove aninference result assuming the nominal propensity score is estimated by a correctly-speciﬁed parametricmodel and the conditional quantiles are estimated by a (potentially misspeciﬁed) parametric model. Theseassumptions are not much stronger than what we expect the primary analysis to assume for inference underunconfoundedness [15, 49]. Theorem 4. (Inference)

Let [ ψ − , ψ + ] be as in Theorem 3, and let CI( α ) be as in (20). Suppose Conditions 1, 2, and 3.(i) aresatisﬁed, and also that the nominal propensity score is estimated by a regular parametric model (e.g. logisticregression). Then we have lim inf n →∞ P ([ ψ − , ψ + ] ⊆ CI( α )) ≥ − α (21) for any α ∈ (0 , . Although we do not have theoretical support for the conﬁdence interval CI( α ) when quantiles are esti-mated by a nonlinear model, we ﬁnd that approach performs reasonably well in the simulations of Section5. In this section, we illustrate the ﬁnite-sample performance of the quantile balancing method on severalsimulated datasets and one real-data example. We also compare several variants of the method against theZSB method described in Section 2 on the simulated datasets.

We consider two data-generating processes (DGPs) in our simulated examples. The two DGPs diﬀer in thedistribution of the regression function E [ Y | X, Z ], but otherwise can be described as follows: X ∼ Uniform([ − , ) Z | X ∼ Bernoulli (cid:18) − (cid:80) j =1 X j / √ (cid:19) Y | X, Z ∼ N ( µ ( X ) , . (22)In the ﬁrst DGP, we use µ ( x ) = x + · · · + x , and in the second DGP we use µ ( x ) = sign( x ) + sign( x ).The estimand of interest is the ATE and we ﬁx Λ = 2, i.e. unobserved confounders can double or halve theodds of treatment.We compare four methods for obtaining bounds on ψ ATE . Linear applies the quantile balancing methodof Section 4 with quantiles estimated following [25].

Forest is similar, but ˆ Q t ( x, z ) is ﬁtted using out-of-bag10stimates from the random forest method of Athey et al. [2]. Covariates directly balances the features X without ﬁrst estimating quantiles. ZSB implements the unconstrained method described in Section 2. Allfour methods estimate the nominal propensity score by logistic regression.Figure 1 shows the distribution of upper and lower bound point estimates from each of these four methods,estimated using 1,000 simulations with 500 observations each. Dashed lines indicate the true partially-identiﬁed region. The results conform to the asymptotic predictions of Proposition 1 and Theorem 3: (i)when the quantile models are “correctly speciﬁed,” the quantile balancing point estimates are nearly unbiased;(ii) under misspeciﬁcation, the range of point estimates is too wide rather than too narrow; and (iii) the

ZSB range of point estimates is too wide in both cases.

DGP1 DGP2Linear Forest Covariates ZSB Linear Forest Covariates ZSB−101−1.5−1.0−0.50.00.51.0 B ound s Figure 1:

Boxplots of the ATE upper and lower bound point estimates for both DGPs and all consideredmethods. The dashed line indicates the boundary of the true partially identiﬁed set. In DGP1, the

Linear and

Covariates methods are correctly speciﬁed and give the most accurate bounds. In DGP2, the

Forest method is well-suited to the piecewise-constant outcome model and gives the most accurate bounds.

Conﬁdence intervals based on the

Linear and

Covariates approaches exhibited some under-coverage atthis sample size, at least in the “well-speciﬁed” setting of DGP1. The 90% bootstrap conﬁdence intervalsbased on the

Linear method had approximately 85.4% coverage of the identiﬁed set, whereas those basedon the

Covariates method had 81.3% coverage. This undercoverage is partially caused by a small bias inthe original range of point estimates, which is not readily apparent from Figure 1. Meanwhile,

Forest had95.5% coverage, and

ZSB had 100% coverage. In DGP2, the

Forest method had 91.6% coverage, and allother methods had at least 99% coverage.

In this section, we apply our proposed sensitivity analysis to a subsample of data from the 1966-1981 NationalLongitudinal Survey (NLS) of Older and Young Men. We wish to estimate the impact of union membershipon wages. Speciﬁcally, we consider the ATE of union membership on log wages. For illustrative reasons,we focus on the 1978 cross-section of Young Men and restrict our attention to craftsmen and laborers notenrolled in school. Our estimates are thus based on a sample of 668 respondents with measurements onwages, union membership, and eight other covariates.For our “primary analysis,” we use IPW to adjust for baseline imbalances in covariates between unionand nonunion samples. Table 1 reports the covariate balance between union and nonunion samples beforeand after weighting by the (estimated) inverse propensity score. On several important characteristics, inversepropensity weighting dramatically improves balance across the two samples.11 .00.10.20.30.40.5 1.1 1.2 1.3 1.4 1.5 1.6 1.7 1.8 1.9 2.0 L B ound s Method

BalancingZSB

Figure 2:

Point estimate ranges and 90% bootstrap conﬁdence intervals for the ATE in the NLS dataset.For the quantile balancing method, conditional quantiles are estimated using the linear quantile regressionmethod of Koenker and Bassett [25].

Unweighted WeightedCovariate Union Nonunion Union NonunionAge 30.1 30.0 30.0 30.0Black 24% 24% 23% 24%Metropolitan 74% 57% 66% 65%Southern 32% 53% 42% 42%Married 78% 75% 76% 76%Manufacturing 42% 32% 37% 38%Laborer 23% 15% 18% 18%Education 12.2 11.7 12.1 12.0Table 1:

Covariate means among the nonunion and union subsamples, along with the means in the weightedsamples. In red, we highlight particularly large imbalances. In the weighted samples, propensity weights areestimated using logistic regression.

The IPW point estimate of the ATE is 0.23 with an associated 90% conﬁdence interval of [0 . , . Q t ( x, z )for quantile balancing using the standard linear quantile model of Koenker and Bassett [25].Both sensitivity analyses show that the positive eﬀect found in the “primary” analysis is fairly robust tounobserved confounding, but quantile balancing reﬁnes the ZSB interval. Even if the odds of union member-ship for “skilled” workers were double the odds for “typical” workers with the same observed covariates, the12uantile balancing analysis would still ﬁnd a statistically signiﬁcant positive treatment eﬀect. Meanwhile,when Λ = 2, the ZSB conﬁdence intervals already include the null, although the range of point estimates(barely) excludes it.To put these ﬁgures in context, we follow the advice of Kallus and Zhou [22] and compute the degree towhich the (estimated) odds of union membership could change if measured confounders were omitted fromthe dataset. Omitting Black only changes the odds of treatment by a factor of 1.4, and omitting

Laborer only changes the odds of treatment by a factor of 1.9. In fact, no measured confounders except

South wereable to double or halve the odds of union membership for any respondent.Incidentally, longitudinal estimates of union wage eﬀects — which control for individual-speciﬁc eﬀects like“skill” — come to similar conclusions as the one suggested by our sensitivity analysis. Although treatmenteﬀect estimates from longitudinal studies are generally smaller than those from cross-sectional studies, theystill ﬁnd evidence in favor of the “union premium” [8, 18, 14].

We have shown that quantile balancing — a simple modiﬁcation of the popular ZSB [49] sensitivity analysis— is feasible, robust, and sharp. This new sensitivity analysis for IPW is based on novel partial identiﬁcationresults for Tan [44]’s marginal sensitivity model.We will point to several interesting directions for future work. While our partial identiﬁcation resultsfocus on counterfactual means and a few treatment eﬀects, it should be possible to extend our partialidentiﬁcation results to more complex estimands of the type considered in [20, 21, 22, 23, 27]. Perhaps asimilarly compact sensitivity analysis could even apply to dynamic treatment regimes. In addition, whileour identiﬁcation arguments generalize to any sensitivity assumption that only restricts the propensity scorein a pointwise fashion (i.e. e min ( x ) ≤ e ( x, u ) ≤ e max ( x )), the practicality of our sensitivity analysis and itstheoretical properties rely on the marginal sensitivity model quite heavily. It would be interesting to see ifa practical and sharp sensitivity analysis could be developed for other sensitivity assumptions in this class.13 eferences [1] Peter M Aronow and Donald K K Lee. “Interval estimation of population means under unknown butbounded probabilities of sample selection”. In: Biometrika

Annals of Statistics

Econometrica

Statistics in Medicine

Journalof Econometrics issn : 0304-4076.[6] Arie Beresteanu, Ilya Molchanov, and Francesca Molinari. “Sharp Identiﬁcation Regions in Models withConvex Moment Predictions”. In:

Econometrica issn : 00129682, 14680262.[7] Jolene Birmingham, Andrea Rotnitzky, and Garrett M Fitzmaurice. “Pattern-mixture and selectionmodels for analysing longitudinal data with monotone missing patterns”. In:

Journal of the RoyalStatistical Society: Statistical Methodology: Series B

Journal of Econometrics issn : 0304-4076.[9] Kwun Chuen Gary Chan, Sheung Chi Phillip Yam, and Zheng Zhang. “Globally eﬃcient non-parametricinference of average treatment eﬀects by empirical balancing calibration weighting”. In:

Journal of theRoyal Statistical Society: Series B (Statistical Methodology)

Econometrics Journal

The Annals of Mathematical Statistics issn : 00034851.[12] D. A. Darling. “On a Class of Problems Related to the Random Division of an Interval”. In:

TheAnnals of Mathematical Statistics

A New Characterization of Identiﬁed sets in PartiallyIdentiﬁed Models . 2016. url : .[14] Richard B. Freeman. “Longitudinal Analyses of the Eﬀects of Trade Unions”. In: Journal of LaborEconomics issn : 0734306X, 15375307.[15] Keisuke Hirano and Guido W Imbens. “Estimation of Causal Eﬀects using Propensity Score Weighting:An Application to Data on Right Heart Catheterization”. In:

Health Services & Outcomes ResearchMethodology

Econometrica issn :00129682, 14680262.[17] Kosuke Imai and Marc Ratkovic. “Covariate balancing propensity score”. In:

Journal of the RoyalStatistical Society Series B

TheReview of Economic Studies issn : 00346527, 1467937X.[19] George Johnson. “Economic Analysis of Trade Unionism”. In:

American Economic Review

Interval Estimation of Individual-Level Causal EﬀectsUnder Unobserved Confounding . 2018. arXiv: .1421] Nathan Kallus and Angela Zhou.

Confounding-Robust Policy Evaluation in Inﬁnite-Horizon Reinforce-ment Learning . 2020. arXiv: .[22] Nathan Kallus and Angela Zhou. “Confounding-Robust Policy Improvement”. In:

Advances in NeuralInformation Processing Systems 31 . Ed. by S. Bengio et al. Curran Associates, Inc., 2018, pp. 9269–9279.[23] Nathan Kallus and Angela Zhou. “Minimax-Optimal Policy Learning Under Unobserved Confounding”.In:

Management Science (2020).[24] Roger Koenker.

Quantile Regression . Econometric Society Monographs. Cambridge University Press,2005.[25] Roger W Koenker and Gilbert Bassett. “Regression Quantiles”. In:

Econometrica

Introduction to empirical processes and semiparametric inference . Springer series instatistics. Springer, 2008. isbn : 9780387749778.[27] Kwonsang Lee, Falco J. Bargagli-Stoﬃ, and Francesca Dominici.

Causal Rule Ensemble: InterpretableInference of Heterogeneous Treatment Eﬀects . 2020. arXiv: .[28] Matthew A. Masten, Alexandre Poirier, and Linqi Zhang.

Assessing Sensitivity to Unconfoundedness:Estimation and Inference . 2020. arXiv: .[29] Nicolai Meinshausen. “Quantile Regression Forests”. In:

J. Mach. Learn. Res. issn : 1532-4435.[30] Wesley Mellow. “Unionism and Wages: A Longitudinal Analysis”. In:

The Review of Economics andStatistics issn : 00346535, 15309142.[31] L W Miratrix, S Wager, and J R Zubizarreta. “Shape-constrained partial identiﬁcation of a populationmean under unknown probabilities of sample selection”. In:

Biometrika

Cross-ﬁtting and fast remainder rates for semiparamet-ric estimation . CeMMAP working papers CWP41/17. Centre for Microdata Methods and Practice,Institute for Fiscal Studies, 2017.[34] Jerzy Neyman. “On the Application of Probability Theory to Agricultural Experiments. Essay onPrinciples. Section 9”. In:

Statist. Sci.

Statistical Science issn : 08834237. url : .[36] James M. Robins, Andrea Rotnitzky, and Daniel O. Scharfstein. “Sensitivity Analysis for Selection biasand unmeasured Confounding in missing Data and Causal inference models”. In: Statistical Models inEpidemiology, the Environment, and Clinical Trials . Ed. by M. Elizabeth Halloran and Donald Berry.New York, NY: Springer New York, 2000, pp. 1–94. isbn : 978-1-4612-1284-3.[37] Paul R Rosenbaum.

Design of Observational Studies . New York: Springer, 2010.[38] Paul R Rosenbaum. “Sensitivity analysis for certain permutation inferences in matched observationalstudies”. In:

Biometrika

Statist. Sci.

Combining Observational and Experimental Datasets Using Shrinkage Estima-tors . 2020. arXiv: .[41] D.B. Rubin. “Estimating causal eﬀects of treatments in randomized and nonrandomized studies”. In:

Journal of Educational Psychology

Ann. Statist.

Journalof the American Statistical Association

An Interval Estimation Approach to Sample Selection Bias . 2019. arXiv: .[46] A. W. van der Vaart.

Asymptotic Statistics . Cambridge Series in Statistical and Probabilistic Mathe-matics. Cambridge University Press, 1998.[47] A. W. van der Vaart and J. Wellner.

Weak Convergence and Empirical Processes: With Applicationsto Statistics . Springer Series in Statistics. Springer, 1996. isbn : 9780387946405.[48] Steve Yadlowsky et al.

Bounds on the conditional and average treatment eﬀect with unobserved con-founding factors . 2018. arXiv: .[49] Qingyuan Zhao, Dylan S Small, and Bhaswar B Bhattacharya. “Sensitivity analysis for inverse prob-ability weighting estimators via the percentile bootstrap”. In:

Journal of the Royal Statistical Society:Statistical Methodology: Series B

Appendix: Computation

This appendix describes how the quantile balancing sensitivity analysis can be implemented using standardsolvers for linear quantile regression [25, 24] (A.1). It also gives the formulas for the ATT bounds (A.2),which were omitted from the main text.Throughout this appendix, Λ ≥ τ = Λ / (Λ + 1). For a function f ≡ f ( x, y, z ), we usethe shorthand f i = f ( X i , Y i , Z i ). We will also write P n v = n (cid:80) ni =1 v i for any vector v ∈ R n . For functions f , P n f should be understood to mean n (cid:80) ni =1 f i . A.1 Computing bounds with weighted quantile regression

We begin by considering computation of ψ +T . We consider the more general optimization problem (23) forsome function g : X → R k containing an “intercept.” In the main text, we assumed g ( x ) = (1 , ˆ Q τ ( x, ¯ e ∈E n (Λ) P n Y Z/ ¯ e P n Z/ ˆ e ( X ) subject to P n g ( X ) Z/ ¯ e = P n g ( X ) Z/ ˆ e ( X ) (23)Readers familiar with quantile regression optimization (c.f. Chapter 6 of [24]) may recognize the problem(23) from the dual of a certain weighted quantile regression problem. Speciﬁcally, let ρ τ ( u ) = u ( τ − I { u < } )be the quantile regression “check” function and deﬁne the weighted quantile regression objective as: L n ( γ ) := P n ρ τ ( Y − γ (cid:62) g ( X )) Z − ˆ e ( X )ˆ e ( X ) . (24)The following proposition shows that any minimizer of L n can be used to compute the solution of (23). Proposition 5.

Suppose ˆ e ( X i ) ∈ (0 , for all i , let ˆ γ minimize L n and let ˆ V i = sign( Y i − ˆ γ (cid:62) g ( X )) . Thenthe optimal objective value in (23) is: P n ( Y − ˆ γ (cid:62) g ( X )) Z (1 + Λ ˆ V (1 − ˆ e ( X )) / ˆ e ( X )) + P n ˆ γ (cid:62) g ( X ) Z/ ˆ e ( X ) P n Z/ ˆ e ( X ) . This same approach can be used to compute a lower bound for ψ T , by replacing Y with − Y , applyingProposition 5, and then negating the answer. Upper and lower bounds for ψ C can then be obtained byreplacing Z by 1 − Z and ˆ e ( X ) by 1 − ˆ e ( X ) and then applying the same procedure. Subtracting the upperand lower bounds for ψ T and ψ C (as in Theorem 2) gives bounds on ψ ATE . A.2 Bounds for the ATT

Next, we describe the standard quantile balancing bounds for ψ ATT . Let ¯ Y (1) be the average value of Y i among treated observations. We deﬁne the quantile balancing upper bound for the ATT as the solution tothe optimization problem (25), where g + ( x ) = (1 , ˆ Q − τ ( x, ψ +ATT = max ¯ e ∈E n (Λ) ¯ Y (1) − (cid:80) Z i =0 Y i ¯ e i − ¯ e i (cid:80) Z i =0 ¯ e i − ¯ e i s.t. (cid:88) Z i =0 g + ( X i ) ¯ e i − ¯ e i = (cid:88) Z i =0 g + ( X i ) ˆ e i − ˆ e i (25)The lower bound ˆ ψ − ATT is deﬁned similarly, but with maximization replaced by minimization and g + ( x )replaced by g − ( x ) := (1 , ˆ Q τ ( x, B Appendix: Proofs

This appendix collects proofs of the results in the main text, along with supporting results. Throughout, wewill use the following notation. For an integer n ≥

1, [ n ] denotes the set { , · · · , n } . For symmetric matrices17 and B , A (cid:37) B and A (cid:31) B mean A − B is positive semideﬁnite and positive deﬁnite, respectively. If { a n } and { b n } are sequences of real numbers, then a n (cid:45) b n means a n = O ( b n ) and a n ∼ b n means a n /b n → { A n } and { B n } are sequences of random variables, then A n (cid:45) P B n means A n = O P ( B n ) and A n ∼ P B n means A n /B n p −→

1. We adopt the convention that a/b = 0 when a and b are both zero.We also make use of some standard empirical process notation. For a (possibly random) function f : X × R × { , } → R , we will write P f := (cid:82) f d P and P n f := n (cid:80) ni =1 f ( X i , Y i , Z i ). For any vector v = ( v , ..., v n ),we take P n v = n (cid:80) ni =1 v i . For any p ∈ [1 , ∞ ), we deﬁne || f || L p ( P ) = ( P | f | p ) /p and || f || L p ( P n ) = ( P n | f | p ) /p .When p = ∞ , we set || f || L ∞ ( P ) = inf { t : P ( | f | ≤ t ) = 1 } and || f || L ∞ ( P n ) = max i ≤ n | f ( X i , Y i , Z i ) | . B.1 Proof of Proposition 1

Proof.

First, we will compute the partially identiﬁed set for ψ T for a substantially more general observeddata distribution, since the proof is not any more diﬃcult. Suppose the observed data law has the followingfactorization: X ∼ P X , Z | X ∼ Bernoulli( e ( X )), and Y | X, Z ∼ N ( µ ( X ) , σ ( X )). Here, P X and e : X → (0 ,

1) are arbitrary, and the only requirements on µ, σ are that Y is integrable.Let z τ denote the τ -th quantile of the standard normal distribution. Since the conditional distributionof Y is continuous, Proposition 3 implies ψ +T = E [ Y Z/ ¯ E + ] where ¯ E + satisﬁes the following with probabilityone: 1 / ¯ E + = (cid:40) − e ( X ) e ( X ) Λ Y ≥ µ ( X ) + σ ( X ) z τ − e ( X ) e ( X ) Λ − Y < µ ( X ) + σ ( X ) z τ . Let C ( x ) = µ ( x ) + σ ( x ) z τ be the “cutoﬀ” between low and high values of ¯ E + . Then the inverse Mills ratioformula for the expectation of a truncated Gaussian gives: ψ +T = E [ Y Z/ ¯ E + ]= E [ e ( X ) E [ Y / ¯ E + | X, Z = 1]]= E (cid:2) ( e ( X ) + Λ − [1 − e ( X )]) τ E [ Y | Y < C ( X ) , X, Z = 1 (cid:3) + E [( e ( X ) + Λ[1 − e ( X )])(1 − τ ) E [ Y | Y ≥ C ( X ) , X, Z = 1]]= E (cid:2) ( e ( X ) + Λ − [1 − e ( X )]) τ ( µ ( X ) − σ ( X ) φ ( z τ ) /τ ) (cid:3) + E [( e ( X ) + Λ[1 − e ( X )])(1 − τ )( µ ( X ) + σ ( X ) φ ( z τ ) / (1 − τ ))]= E [ µ ( X )] + φ ( z τ )(Λ − Λ − ) E [ σ ( X )(1 − e ( X ))] . Applying the same calculation to − Y gives ψ − T = E [ µ ( X )] − φ ( z τ )(Λ − Λ − ) E [ σ ( X )(1 − e ( X ))]. Specializingthese answers to the distribution in Proposition 1 gives the stated answer.We remark that the preceding calculation also gives the partially identiﬁed sets for ψ C and ψ ATE . Sincethe distribution of Y in this setting does not depend on Z , exchanging the roles of Z and 1 − Z in the abovecalculation amounts to replacing 1 − e ( X ) by e ( X ) in the formula for ψ ± T . Thus, [ ψ − C , ψ +C ] = [ E [ µ ( X )] ± φ ( z τ )(Λ − Λ − ) E [ σ ( X ) e ( X )]], and Theorem 2 implies [ ψ − ATE , ψ +ATE ] = [ ± φ ( z τ )(Λ − Λ − ) E [ σ ( X )]].Next, we show that the ZSB interval is asymptotically too wide. We will only prove this result for thedistribution (6), since the calculations simplify. Let ˆ ψ +T,ZSB be as in (5). Let ¯ E ∗ = + I { Y ≤ . √ σ + 1 } ,and notice that Y | Z = 1 ∼ N (0 , σ + 1). Then a straightforward calculation using the inverse Mills ratioformula gives: E [ Y Z/ ¯ E ∗ ] E [ Z/ ¯ E ∗ ] = φ (0 . √ σ + 12 − Φ(0 . > . (cid:112) σ + 1The strong law of large numbers implies lim inf ˆ ψ +T,ZSB ≥ lim inf( P n Y Z/ ¯ E ∗ ) / ( P n Z/ ¯ E ∗ ) > . √ σ + 1almost surely. The lower bound follows by symmetry. B.2 Proof of Proposition 2

We instead show the more general result: 18 roposition 2B.

Let ( X, Y, Z ) ∼ P , and let e min , e max : X → (0 , be any two functions. For any randomvariable ¯ E ∈ (0 , satisfying E [ Z/ ¯ E | X ] = 1 and Z/e max ( X ) ≤ Z/ ¯ E ≤ Z/e min ( X ) , we can construct randomvariables ( Y (0) , Y (1) , U ) on the same space as ( X, Y, Z, ¯ E ) and an associated plausible propensity function ¯ e ( X, U ) := E [ Z | X, U ] satisfying the following properties:(i) Y = ZY (1) + (1 − Z ) Y (0) .(ii) ( Y (0) , Y (1)) | = Z | ( X, U ) and e min ( X ) ≤ ¯ e ( X, U ) ≤ e max ( X ) .(iii) Z/ ¯ e ( X, U ) = Z/ ¯ E . To recover the result of Proposition 2 from Proposition 2B, deﬁne e min ( x ) = e ( x ) / ( e ( x ) + [1 − e ( x )]Λ)and e max ( x ) = e ( x ) / ( e ( x ) + [1 − e ( x )] / Λ). Then let Q be the joint distribution of ( X, Y (0) , Y (1) , Z, U ). Item(i) implies Q is data compatible, item (ii) and ¯ E ∈ E ∞ (Λ) imply Q satisﬁes Assumption 1, and item (iii)implies E Q [ Y (1)] = E Q [ Y Z/ ¯ e ( X, U )] = E P [ Y Z/ ¯ E ]. Proof.

Let (

X, Y, Z, ¯ E ) be as in the proposition, and suppose we have access to independent random variables V , V ∼ Uniform[0 ,

1] which are also jointly independent of (

0) + (1 − Z ) YU = Z ¯ E + (1 − Z ) K − ( V | X ) . We adopt the convention that J − ( s ) := inf { t : J ( t ) ≥ s } whenever J is a distribution function, so that thesequantities are well-deﬁned even when some of these conditional distribution functions are not invertible. Sincethe support of K ( ·| x ) is a subset of the support of H ( ·| x, Z/e max ( X ) ≤ ¯ E ≤ Z/e min ( X )implies e min ( X ) ≤ U ≤ e max ( X ) almost surely.The requirement (i) is immediate from the deﬁnition of Y (0) and Y (1).To verify the ﬁrst part of (ii), we compute the distribution of ( Y (0) , Y (1)) given X, U, Z = 1 and thedistribution of ( Y (0) , Y (1)) given X, U, Z = 0. P ( Y (0) ≤ y , Y (1) ≤ y | X, U, Z = 1) = P ( F − ( V | X, ≤ y , Y ≤ y | X, ¯ E, Z = 1)= P ( F − ( V | X, ≤ y | X, ¯ E, Z = 1) G ( y | X, , ¯ E )= P ( F − ( V | X, ≤ y | X ) G ( y | X, , ¯ E )= F ( y | X, G ( y | X, , ¯ E ) P ( Y (0) ≤ y , Y (1) ≤ y | X, U, Z = 0) = P ( Y ≤ y , G − ( V | X, , ¯ E ) ≤ y | X, U, Z = 0)= P ( Y ≤ y | X, U, Z = 0) P ( G − ( V | X, , ¯ E ) ≤ y | X, U, Z = 0)= P ( Y ≤ y | X, Z = 0) P ( G − ( V | X, , ¯ E ) ≤ y | X )= F ( y | X, G ( y | X, , ¯ E ) . Since these are the same, ( Y (0) , Y (1)) | = Z | ( X, U ).A short calculation using Bayes’ theorem shows that ¯ e ( X, U ) = U . This will establish the second part of(ii), since e min ( X ) ≤ U ≤ e max ( X ).¯ e ( x, u ) = e ( x ) d P ( u | X = x, Z = 1)d P ( u | X = x ) = e ( x ) d P ( u | x, / d H ( u | x, P ( u | x ) / d H ( u | x,

1) = e ( x ) e ( x ) + (1 − e ( x )) e ( x )1 − e ( x ) 1 − uu = u The preceding calculation also veriﬁes (iii), since Z/ ¯ e ( X, U ) =

Z/U = Z/ ¯ E .19 .3 Proof of Proposition 3 and Theorem 1 Proof.

By symmetry, for Proposition 3, it suﬃces to show that ¯ E + solves both (11) and (15).First, we show that there exists ¯ E + ∈ E ∞ (Λ) with the properties stated in Proposition 3. Deﬁne e min ( x ) = e ( x ) / ( e ( x ) + [1 − e ( x )] / Λ) and e max ( x ) = e ( x ) / ( e ( x ) + [1 − e ( x )]Λ). For any γ ∈ [ e min ( x ) , e max ( x )],deﬁne e γ ( x, y ) by: ¯ e γ ( x, y ) =  e min ( x ) if y > Q τ ( x, e max ( x ) if y < Q τ ( x, γ if y = Q τ ( x, x , there exists γ ( x ) ∈ [ e min ( x ) , e max ( x )] solving E [ Z/ ¯ e γ ( X, Y ) | X = x ] = 1. We will provethis by applying the intermediate value theorem to the continuous function w x ( γ ) := E [ Z/e γ ( X, Y ) | X = x ].If we took γ = e max ( x ), then we would have: w x ( e max ( x )) = F ( Q τ ( x, | x, e ( x ) + [1 − e ( x )] / Λ) + (1 − F ( Q τ ( x, | x, e ( x ) + [1 − e ( x )]Λ) ≤ e ( x ) + (1 − e ( x ))( τ / Λ + (1 − τ )Λ)= 1and a similar calculation shows w x ( e min ( x )) ≥

1. Thus, there is some γ ( x ) ∈ [ e min ( x ) , e max ( x )] which solves E [ Z/ ¯ e + ( X, Y ) | X = x ] = 1. For that choice of γ ( x ), ¯ E + := ¯ e γ ( X ) ( X, Y ) belongs to E ∞ (Λ) and satisﬁes E [ Z/ ¯ E + | X ] = 1.Now we show that any random variable ¯ E + satisfying the requirements of the proposition solves thequantile balancing problem (11). It is easy to see that ¯ E + is feasible in (11), since E [ Q τ ( X ) Z/ ¯ E + ] = E [ Q τ ( X ) E [ Z/ ¯ E + | X ]] = E [ Q τ ( X )]. Moreover, for any other ¯ E ∈ E ∞ (Λ) which balances Q τ , we may write: E [ Y Z/ ¯ E ] = E [ Q τ ( X, Z/ ¯ E + ( Y − Q τ ( X, Z/ ¯ E ] ≤ E [ Q τ ( X, E [( Y − Q τ ( X, Z/ ¯ E + ]= E [ Q τ ( X, Z/ ¯ E + ] + E [( Y − Q τ ( X, Z/ ¯ E + ]= E [ Y Z/ ¯ E + ] . The inequality step follows because 1 / ¯ E + takes on the maximum allowable value whenever ( Y − Q τ ( X, Z is positive and the minimal allowable value whenever ( Y − Q τ ( X, Z is negative, so ( Y − Q τ ( X, Z/ ¯ E + is always larger than ( Y − Q τ ( X, Z/ ¯ E . Since ¯ E is arbitrary, this proves ¯ E + solves (11).Finally, ¯ E ∗ + solves the less constrained problem (11) and is feasible in the more constrained problem (15),so it solves (15) as well. This proves Proposition 3.Solving (15) implies ¯ E + achieves the upper endpoint of the identiﬁed set for ψ T . Symmetry implies ¯ E − achieves the lower endpoint. Proposition 2 and its obvious converse imply the identiﬁed set is convex, so itis a closed interval. This proves Theorem 1. B.4 Proof of Proposition 4 and Theorem 2

Proof.

We will divide the proof of Proposition 4 into several steps. Rather than explicitly constructing adistribution Q with E Q [ Y (1) − Y (0)] = E P [ Y Z/ ¯ E − Y (1 − Z ) / (1 − ¯ E )] for each ¯ E ∈ E ∞ (Λ), we will insteadconstruct the extremal distributions Q + and Q − that attain the endpoints of the partially identiﬁed set for ψ ATE . Then, we will show that the partially identiﬁed set is an interval. This will establish Proposition 4.Theorem 2 follows as a corollary.The ﬁrst step is to exhibit the worst-case distribution Q + which attains the upper bound on the ATE.We will actually construct random variables Y (0) , Y (1) , U on the same probability space as ( X, Y, Z, ¯ E ),with associated plausible propensity score ¯ e ( X, U ) := E [ Z | X, U ], that satisfy the following requirements:(i) Y = Y (1) Z + Y (0)(1 − Z ).(ii) ( Y (0) , Y (1)) | = Z | ( X, U ).(iii) ¯ e ( X, U ) ∈ E ∞ (Λ). 20iv) E [ Y (1)] = ψ +T and E [ Y (0)] = ψ − C .We then take Q + to be the joint distribution of ( X, Y (0) , Y (1) , Z, U ).We start with the construction. Let (

X, Y, Z ) ∼ P and ( V , V ) ∼ Uniform[0 , independently of( X, Y, Z ). Let F ( y | x, z ) = P ( Y ≤ y | X = x, Z = z ) and ¯ H ( y | x, z ) = P ( Y = y | X = x, Z = z ). Let T = τ Z + (1 − τ )(1 − Z ), and deﬁne the binary “confounder” U by: U = I { Y > Q T ( X, Z ) } + I { Y = Q T ( X, Z ) , V ¯ H ( Y | X, Z ) < F ( Y | X, Z ) − T } . Deﬁne the conditional CDF of Y to sample from by G ( y | x, z, u ) = P ( Y ≤ y | X = x, U = u, Z = z ), andconstruct Y (0) , Y (1) by: Y (1) = ZY + (1 − Z ) G − ( V | X, Z = 1 , U ) Y (0) = ZG − ( V | X, Z = 0 , U ) + (1 − Z ) Y. It is immediate from the deﬁnition that Y = Y (1) Z + Y (0)(1 − Z ), so the constructed random variablessatisfy (i).We prove (ii) by computing the joint distribution of ( Y (0) , Y (1)) given X, U, Z = 1 and also the jointdistribution of ( Y (0) , Y (1)) given X, U, Z = 0. P ( Y (0) ≤ y , Y (1) ≤ y | X, U, Z = 1) = P ( G − ( V | X, , U ) ≤ y , Y ≤ y | X, U, Z = 1)= G ( y | X, , U ) P ( Y ≤ y | X, U, Z = 1)= G ( y | X, , U ) G ( y | X, , U ) P ( Y (0) ≤ y , Y (1) ≤ y | X, U, Z = 0) = P ( Y ≤ y , G − ( V | X, , U ) ≤ y | X, U, Z = 0)= P ( Y ≤ y | X, U, Z = 0) G ( y | X, , U )= G ( y | X, , U ) G ( y | X, , U )Since these are the same, ( Y (0) , Y (1)) | = Z | ( X, U ).Next, we establish (iii) by directly computing ¯ e ( X, U ). First, observe that E [ U | X, Z = 1] = 1 − τ . E [ U | X, Z = 1] = P ( Y > Q τ ( X, Z ) | X, Z = 1) + P (cid:16) Y = Q τ ( X, Z ) , V < F ( Q τ ( X, | X, − τ ¯ H ( Q τ ( X, | X, | X, (cid:17) = 1 − F ( Q τ ( X, | X,

1) + ¯ H ( Q τ ( X, | X, F ( Q τ ( X, | X, − τ ¯ H ( Q τ ( X, | X, = 1 − τ A similar calculation shows E [ U | X, Z = 0] = τ . Therefore, we may calculate ¯ e for U = 1:¯ e ( x,

1) = e ( x ) P ( U = 1 | X = x, Z = 1) e ( x ) P ( U = 1 | X = x, Z = 1) + [1 − e ( x )] P ( U = 1 | X = x, Z = 0)= e ( x )(1 − τ ) e ( x )(1 − τ ) + [1 − e ( x )] τ = e ( x ) e ( x ) + [1 − e ( x )]ΛBy similar reasoning, ¯ e ( x,

0) = e ( x ) / ( e ( x ) + [1 − e ( x )]Λ − ). Both ¯ e ( x,

1) and ¯ e ( x,

0) satisfy the boundedodds ratio condition, so ¯ e ( X, U ) ∈ E ∞ (Λ).Finally, we conﬁrm item (iv), that E [ Y (1)] = ψ +T and E [ Y (0)] = ψ − C . Let ¯ E + be as in Proposition3. The explicit formulas for ¯ e ( X, U ) obtained in the proof of (iii) and the deﬁnition of U shows that Z/ ¯ e ( X, U ) = Z/ ¯ E + , except possibly when Y = Q τ ( X, E [ Z/ ¯ e ( X, U ) | X ] = 1. E [ Z/ ¯ e ( X, U ) | X ] = e ( X ) E [1 / ¯ e ( X, U ) | X, Z − e ( X ) (cid:16) P ( U =0 | X,Z =1)¯ e ( X, + P ( U =1 | X,Z =1)¯ e ( X, (cid:17) = e ( X ) (cid:104) τ (cid:16) − e ( x ) e ( x ) Λ − (cid:17) + (1 − τ ) (cid:16) − e ( x ) e ( x ) Λ (cid:17)(cid:105)

21 1Therefore, E [ Y Z/ ¯ e ( X, U )] = ψ +T by Proposition 3. By exchanging the labels “treated” and “control” andapplying Proposition 3, one can verify by a similar argument that E [ Y (1 − Z ) / (1 − ¯ e ( X, U ))] = ψ − C as well.To construct a distribution Q − attaining the minimal value of the ATE, we use a negation trick. Let( X (cid:48) , Y (cid:48) , Z (cid:48) ) = ( X, − Y, Z ), and let P (cid:48) be the distribution of ( X (cid:48) , Y (cid:48) , Z (cid:48) ). By the calculations above, thereexists a distribution Q which simultaneously attains the largest value of E [ Y (cid:48) (1)] and the smallest value E [ Y (cid:48) (0)] compatible with P − and the marginal sensitivity model. As a result, − E Q [ Y (cid:48) (1)] is the smallestvalue of ψ T compatible with P and − E Q [ Y (cid:48) (0)] is the largest value of ψ C compatible with P . If we let Q − bethe distribution of ( X, − Y (cid:48) (0) , − Y (cid:48) (1) , Z, U ) when ( X, Y (cid:48) (0) , Y (cid:48) (1) , Z, U ) ∼ Q , then Q − attains the minimalvalue of the ATE.Finally, we show that any value of ψ ATE between ψ − ATE and ψ +ATE can be realized by a distribution Q compatible with the observed data law and the marginal sensitivity model. In particular, this will provethat for any ¯ E ∈ E ∞ (Λ), there exists a compatible distribution Q with E Q [ Y (1) − Y (0)] = E [ Y Z/ ¯ E − Y (1 − Z ) / (1 − ¯ E )].Let λ ∈ [0 ,

1] be arbitrary. Let Q λ be the distribution which is sampled as follows: ﬁrst, sample B ∼ Bernoulli(Λ). If B = 0, sample ( X, Y (0) , Y (1) , Z, U ) ∼ Q − . Otherwise, sample ( X, Y (0) , Y (1) , Z, U ) ∼ Q + .Finally, return ( X, Y (0) , Y (1) , Z, U, B ). It is clear that Q λ matches the marginal distribution from P (since Q − and Q + both do) and satisﬁes Assumption 1 with the confounders ( U, B ). Moreover, the ATE underthe distribution Q λ is a convex combination of the extremal ATEs. Since this construction works for any λ ,the identiﬁed set must be an interval.This completes the proof of Proposition 4. Theorem 2 follows as a corollary. B.5 Proof of Proposition 5

Proof.

If Λ = 1, the claim holds trivially, so we proceed assuming Λ > W i = Z i (1 − ˆ e ( X i )) / ˆ e ( X i ). Since L n is convex, computing the subdiﬀerential optimality criterion forˆ γ shows that there exists a vector ∆ ∈ [Λ − , Λ] n such that P n ˆ W g ( X )(∆ −

1) = 0 and ∆ i = Λ sign( Y i − ˆ γ (cid:62) h ( X i )) whenever Y i (cid:54) = ˆ γ (cid:62) h ( X i ).We will ﬁrst show that ¯ e ∗ i := (1 + ∆ i (1 − ˆ e i ) / ˆ e i ) − solves (23). It is clear that ¯ e ∗ i belongs to E n (Λ).Moreover, we have 0 = P n ˆ W g ( X )(∆ −

1) = P n g ( X ) Z/ ¯ e ∗ − P n g ( X ) Z/ ˆ e ( X ). Therefore ¯ e ∗ is a feasiblesolution to (23).Optimality of ¯ e ∗ i follows from Theorem 3.1 in [11]. The main technical requirement to apply that resultis that P n g ( X ) Z/ ˆ e ( X ) is in the relative interior of { P n g ( X ) Z/ ˜ e : ˜ e ∈ E n (Λ) } . If 0 < ˆ e i < i , thenthis condition is satisﬁed by the open mapping theorem and the fact that 1 / ˆ e is an interior point of 1 / E n (Λ).Finally, we show the desired equivalence: P n Y Z/ ¯ e ∗ P n Z/ ˆ e ( X ) = i P n ( Y − ˆ γ (cid:62) g ( X )) Z/ ¯ e ∗ + P n ˆ γ (cid:62) g ( X ) Z/ ¯ e ∗ P n Z/ ˆ e ( X )= ii P n ( Y − ˆ γ (cid:62) g ( X )) Z/ ¯ e ∗ + P n ˆ γ (cid:62) g ( X ) Z/ ˆ e ( X ) P n Z/ ˆ e ( X )= iii P n ( Y − ˆ γ (cid:62) g ( X )) Z (1 + Λ ˆ V (1 − ˆ e ( X )) / ˆ e ( X )) + P n ˆ γ (cid:62) g ( X ) Z/ ˆ e ( X ) P n Z/ ˆ e ( X )There, step i adds and subtracts the term P n ˆ γ (cid:62) g ( X ) Z/ ¯ e ∗ in the numerator, step ii uses the fact that ¯ e ∗ “balances” g ( X ), and step iii restates ¯ e ∗ in terms of ˆ V . Since P n Y Z/ ¯ e ∗ P n Z/ ˆ e ( X ) is the objective value from (23), thisproves Proposition 5. B.6 Proofs of Theorems 3 and 4 for linear quantiles

In this section, we give the proofs of Theorems 3 and 4 under the assumption that ˆ Q τ ( x, z ) = ˆ β ( z ) (cid:62) h ( x ) forsome “features” h : X → R k with ﬁnite variance. We assume throughout that h contains an “intercept”,i.e. h ( x ) ≡

1. For simplicity, we only give the arguments for the estimator ˆ ψ +T . Results for other quantile22alancing bounds follow by essentially the same arguments. Since this estimator only involves a singleestimated quantile function, we will lighten the notation by writing Q ( x ) and ˆ Q ( x ) in place of Q τ ( x,

1) andˆ Q τ ( x, B.6.1 Supporting lemmas

The proofs will make use of several easy lemmas.

Lemma 1.

Assume that Conditions 1 and 2 hold, and also that Q ( x ) = β (cid:62) h ( x ) for some β ∈ R d . Furthersuppose that E [ h ( X ) h ( X ) (cid:62) ] is ﬁnite and nonsingular. Let ˆ γ minimize the loss function L n ( γ ) = P n ρ τ ( Y − γ (cid:62) h ( X )) Z − ˆ e ( X )ˆ e ( X ) . Then ˆ γ p −→ β .Proof. This follows from general results on convex M-estimators, e.g. Theorem 2.7 in [32].

Lemma 2.

Let ˆ U i = sign( Y i − ˆ Q ( X i )) . Then we have the inequality: ˆ ψ +T ≤ P n ( Y − ˆ Q ( X )) Z (1 + Λ ˆ U (1 − ˆ e ( X )) / ˆ e ( X )) + P n ˆ Q ( X ) Z/ ˆ e ( X ) P n Z/ ˆ e ( X ) (26) Proof.

By Proposition 5, ˆ ψ +T would be exactly equal to the right-hand side of (26) if ˆ U i were replaced byˆ V i = sign( Y i − ˆ γ − ˆ γ ˆ Q ( X i )) where ˆ γ = (ˆ γ , ˆ γ ) comes from L n in Lemma 2. However, ( Y i − ˆ Q ( X i ))Λ ˆ U i is(weakly) larger than ( Y i − ˆ Q ( X i ))Λ ˆ V i for every i , since ˆ U i exactly matches the sign of Y i − ˆ Q ( X i ) while ˆ V i might not. Making this replacement index-by-index gives (26). Lemma 3.

Let U i = sign( Y i − Q ( X i )) . Then we have the inequality: ˆ ψ +T ≥ P n ( Y − ˆ γ (cid:62) h ( X )) Z (1 + Λ U (1 − ˆ e ( X )) / ˆ e ( X )) + P n ˆ γ (cid:62) h ( X ) Z/ ˆ e ( X ) P n Z/ ˆ e ( X ) (27) where ˆ γ is as in Lemma 1.Proof. For the purposes of this proof, let ¯ ψ +T be the solution to the “feature-balancing” problem:¯ ψ +T = max ¯ e ∈E n (Λ) (cid:80) ni =1 Y i Z i / ¯ e i (cid:80) ni =1 Z i / ¯ e i s.t. P n h ( X ) Z/ ¯ e = P n h ( X ) Z/ ˆ e ( X ) . It is clear that ˆ ψ +T ≥ ¯ ψ +T , since the feature balancing problem has the same objective as the quantile balancingproblem but faces more constraints. Proposition 5 implies the ¯ ψ +T would be exactly equal to the right-handside of (27) if we replaced U i by ˆ U i = sign( Y i − ˆ γ (cid:62) h ( X i )). However, ( Y i − ˆ γ (cid:62) h ( X i ))Λ U i is (weakly) smallerthan ( Y i − ˆ γ (cid:62) h ( X i ))Λ ˆ U i for every i , since ˆ U i exactly matches the sign of Y i − ˆ γ (cid:62) h ( X i ) while U i might not.Making this replacement index-by-index gives (27). B.6.2 Proof of main results

Finally, we are ready to prove the main results, which we restate to make the regularity assumptions moreprecise.

Theorem 3(i). (Sharpness for ψ + T ) Assume Conditions 1, 2, and 3.(i). If Q ( x ) = β (cid:62) h ( x ) for some β ∈ R k and ˆ β p −→ β , then ˆ ψ +T = ψ +T − o P (1) .However, even if Q ( x ) (cid:54) = β (cid:62) h ( x ) for any β , we still have ˆ ψ +T ≥ ψ +T − o P (1) .Proof. We start by proving the upper bound ˆ ψ +T ≤ ψ +T + o P (1) in the well-speciﬁed case. Lemma 2 gives thefollowing upper bound on the quantile balancing estimator:ˆ ψ +T ≤ P n ( Y − ˆ Q ( X )) Z (1 + Λ ˆ U (1 − ˆ e ( X )) / ˆ e ( X )) + P n ˆ Q ( X ) Z/ ˆ e ( X ) P n Z/ ˆ e ( X ) . P n Z/ ˆ e ( X ) p −→

1, and the consistency of ˆ β implies P n ˆ Q ( X ) Z/ ˆ e ( X ) p −→ E [ Q ( X )]. Toestablish the upper bound, it remains to show P n ( Y − ˆ Q ( X )) Z (1 + Λ ˆ U (1 − ˆ e ( X )) / ˆ e ( X )) converges to ψ +T − E [ Q ( X )].The ﬁrst step is to replace the estimated propensity score ˆ e appearing in this quantity by the true nominalpropensity score e . The Cauchy-Schwarz inequality and Condition 1 imply: P n ( Y − ˆ Q ( X )) Z Λ ˆ U ( − ˆ e ( X )ˆ e ( X ) − − e ( X ) e ( X ) ) = O ( || Y − ˆ β (cid:62) h ( X ) || L ( P n ) × || / ˆ e ( X ) − /e ( X ) || L ( P n ) )= O P (( || Y || L ( P n ) + || ˆ β (cid:62) h ( X ) || L ( P n ) ) × ε − || ˆ e ( X ) − e ( X ) || L ∞ ( P n ) )= O P ( || Y || L ( P N ) + || Q ( X ) || L ( P n ) ) × o P (1))= o P (1)Thus, P n ( Y − ˆ Q ( X )) Z (1 + Λ ˆ U − ˆ e ( X )ˆ e ( X ) ) = P n ( Y − Q ( X )) Z (1 + Λ ˆ U − e ( X ) e ( X ) ) + o P (1).The next step is to replace ˆ U and ˆ Q ( X ) by U = sign( Y − Q ( X )) and Q ( X ), respectively. For this, weemploy a uniform convergence argument. For each β ∈ R k , deﬁne the function f β ( x, y, z ) by: f β ( x, y, z ) = ( y − β (cid:62) h ( x )) z (1 + Λ sign( y − β (cid:62) h ( x )) 1 − e ( x ) e ( x ) ) . Standard Glivenko-Cantelli (GC) preservation arguments (c.f. [26, 46, 47]) show that the class F = { f β : || β − β || ≤ } is GC, so we have the uniform convergence sup f ∈F | P n f − P f | = o P (1). Moreover, the map β (cid:55)→ P f β is continuous at β , which can be seen by noticing that as β → β , f β ( x, y, z ) → f β ( x, y, z ) foralmost every ( x, y, z ) (exceptions occur when y = β (cid:62) x , but Condition 2 implies this happens with probabilityzero) and then applying the dominated convergence theorem. Thus, we have: P n ( Y − ˆ Q ( Z )) Z (1 + Λ ˆ U − e ( X ) e ( X ) ) = P n f ˆ β ( X, Y, Z )= P f ˆ β ( X, Y, Z ) + o P (1)= P f β ( X, Y, Z ) + o P (1)= E [( Y − Q ( X )) Z/ ¯ E + ] + o P (1)= ψ +T − E [ Q ( X )] + o P (1)Combining these various results gives ˆ ψ +T ≤ ψ +T + o P (1). This establishes the upper bound in the well-speciﬁedcase.Now we turn to the lower bound, ˆ ψ +T ≥ ψ +T − o P (1), beginning in the correctly-speciﬁed case. Lemma 3lower bounds the quantile balancing estimator by a variant of the “feature balancing” estimator:ˆ ψ +T ≥ P n ( Y − ˆ γ (cid:62) h ( X )) Z (1 + Λ U (1 − ˆ e ( X )) / ˆ e ( X )) + P n ˆ γ (cid:62) h ( X ) Z/ ˆ e ( X ) P n Z/ ˆ e ( X ) . We will show that this lower bound is at least ψ +T − o P (1). We may assume without loss of generalitythat E [ h ( X ) h ( X ) (cid:62) ] is full rank, since excising features that are linear combinations of other ones has noeﬀect on the feature balancing estimator. In the preceding display, the denominator P n Z/ ˆ e ( X ) converges toone, so we can focus on the two terms in the numerator.Since Lemma 1 implies that ˆ γ is consistent, exactly the same arguments from the upper bound show P n ˆ γ (cid:62) h ( X ) Z/ ˆ e ( X ) p −→ E [ Q ( X )]. Moreover, the argument from the upper bound shows that ˆ e can be replacedby e in the expression P n ( Y − ˆ γ (cid:62) h ( X )) Z (1+Λ U (1 − ˆ e ( X )) / ˆ e ( X )). Some manipulation shows that 1+Λ U (1 − e ( X )) /e ( X ) = 1 / ¯ E + almost surely, where ¯ E + is the worst-case propensity score deﬁned in Proposition 3.Therefore, we may write: P n ( Y − ˆ γ (cid:62) h ( X )) Z (1 + Λ U − ˆ e ( X )ˆ e ( X ) ) = P n ( Y − ˆ γ (cid:62) h ( X )) Z/ ¯ E + + o P (1)= P n ( Y − Q ( X )) Z/ ¯ E + + O P ( || ˆ γ − β || ) + o P (1)= ψ +T − E [ Q ( X )] + o P (1)24ombining these various results gives ˆ ψ +T ≥ ψ +T − o P (1). This establishes the lower bound in the well-speciﬁedcase.Finally, we extend the lower bound to the misspeciﬁed case. If Q ( x ) (cid:54) = β (cid:62) h ( x ) for any β , then we canlower bound ˆ ψ +T by the feature-balancing estimator that balances h ( x ) and the true quantile Q ( x ). Thisbrings us back to the well-speciﬁed case, so the preceding arguments show ˆ ψ +T ≥ ψ +T − o P (1). Theorem 4(i). (Inference for ψ + T ) Assume Conditions 1, 2, and 3.(i). Suppose that the estimated propensity score satisﬁes ˆ e ( x ) = e ( x, ˆ θ ) forsome parametric model { e ( · , θ ) : θ ∈ R k } satisfying the following regularity conditions:(i) There exists θ ∈ R k such that e ( x ) = e ( x, θ ) .(ii) ˆ θ minimizes the function θ (cid:55)→ P n (cid:96) ( θ, X, Z ) where θ (cid:55)→ (cid:96) ( θ, x, z ) is convex for each ( x, z ) .(iii) The function L ( θ ) = P (cid:96) ( θ, X, Z ) satisﬁes ∇L ( θ ) = 0 , ∇ L ( θ ) (cid:31) .(iv) || e ( · , θ ) − e ( · , θ ) || ∞ ≤ r ( || θ − θ || ) for some continuous function r with r (0) = 0 .(v) The function class { ( x, z ) (cid:55)→ (cid:96) ( θ, x, z ) : || θ − θ || ≤ } is P -Donsker.(vi) The function class { θ (cid:55)→ e ( θ, x ) : || θ − θ || ≤ } has bounded uniform entropy integral.Suppose the number of bootstrap samples B ≡ B n tends to inﬁnity. Then we have: lim inf n →∞ P ( ψ +T ≤ Q − α ( { ˆ ψ + b } b ∈ [ B ] ) ≥ − α for all α ∈ (0 , .Proof. Suppose for now that the linear quantile model is correctly speciﬁed, i.e. Q ( x ) = β (cid:62) h ( x ) for some β ∈ R k . For each b ∈ [ B n ], let { ( X bi , Y bi , Z bi ) } i ≤ n denote the observations from the b th bootstrap sample.For a function f , write P bn f = n (cid:80) ni =1 f ( X bi , Y bi , Z bi ). Deﬁne ˆ θ b and ˆ γ b by:ˆ θ b = argmin θ P bn (cid:96) ( θ, X )ˆ γ b = argmin γ P bn ρ τ ( Y − γ (cid:62) h ( X )) Z − e ( X, ˆ θ b ) e ( X, ˆ θ b ) Then proof of Lemma 3 implies that the following inequality holds deterministically for every b :ˆ ψ + b ≥ P bn ( Y − ˆ γ (cid:62) b h ( X )) Z (1 + Λ U (1 − e ( X, ˆ θ b )) /e ( X, ˆ θ b )) + P bn ˆ γ (cid:62) b h ( X ) Z/e ( X, ˆ θ b ) P bn Z/e ( X, ˆ θ b ) . (28)We will call the estimator in (28) the lower bound ˜ ψ + b , and use ˜ ψ + to refer to the same estimator ﬁt on theoriginal sample { ( X i , Y i , Z i ) } i ≤ n . The bootstrap distribution of ˜ ψ + is only hypothetical, because it dependson the unknown function Q , but its quantiles must lie below the quantiles of the bootstrap distribution forˆ ψ +T . Therefore, if we can show the percentile bootstrap based on the hypothetical estimator ˜ ψ + is valid,then we will have proved the more conservative feasible percentile bootstrap for ˆ ψ +T is valid as well.The validity of the percentile bootstrap for ˜ ψ can be established using standard tools. We will brieﬂysketch how this can be done. Notice that (ˆ θ b , ˆ γ b , ˜ ψ b ) approximately solve P bn m ˆ θ b , ˆ γ b , ˜ ψ + b ( X, Y, Z ) = 0, wherethe score function m is deﬁned by: m θ,γ,ψ ( x, y, z ) =  ∇ (cid:96) ( θ, x, z ) h ( x )( τ − I { γ (cid:62) h ( x ) < } ) z − e ( x,θ ) e ( x,θ ) ( y − γ (cid:62) h ( x )) z (1 + Λ sign( y − Q ( x )) 1 − e ( x,θ ) e ( x,θ ) ) + γ (cid:62) h ( x ) z/e ( x, θ ) − ψz/e ( x, θ )  Therefore, one can apply the standard theory of bootstrap Z-estimators (in particular, Theorem 10.6 in[26]) to verify bootstrap consistency. The veriﬁcation of the technical requirements for that result from ourassumptions is routine, and convexity of the loss functions θ (cid:55)→ (cid:96) ( θ, x, z ) and γ (cid:55)→ ρ τ ( y − γ (cid:62) h ( x )) can beused in lieu of the global Glivenko-Cantelli requirement on the moment equations. See Appendix C of [49]for the type of veriﬁcation calculations that remain.Finally, it remains to remove the assumption that the linear quantile model is correctly speciﬁed. If Q ( x ) (cid:54) = β (cid:62) h ( x ) for any β , then we may once again lower bound ˆ ψ +T by the estimator that balances h ( x ) and the true quantile Q ( x ). This brings us back to the well-speciﬁed case, and the preceding arguments implythe validity of the bootstrap upper conﬁdence bound.25 .7 Proof of Theorem 3 for nonlinear quantiles In this section, we prove Theorem 3 when quantiles are estimated by a nonlinear model. As in the case oflinear quantiles, we will give the argument for the estimator ˆ ψ +T . As such, we will continue to use ˆ Q ( x ) and Q ( x ) as shorthand for ˆ Q τ ( x,

1) and Q τ ( x, B.7.1 Regularity conditions

As alluded to in Condition 3, we require nonlinear models to be estimated using a form of sample splittingcalled “cross-ﬁtting” [10, 42, 33]. We brieﬂy describe the procedure, mostly to ﬁx notation.The sample { ( X i , Y i , Z i ) } is divided into K disjoint “folds” F , · · · , F K of approximately equal size.For each k ∈ [ K ], a quantile estimate ˆ Q − k is obtained using observations not in F k . Finally, we setˆ Q i = (cid:80) Kk =1 ˆ Q − k ( X i ) I { i ∈ F k } . In this way, no observation is used to obtain its own quantile estimate. Inthe extreme case where K is equal to the sample size, this is simply “leave-one-out” estimation. However,in cross-ﬁtting, K is taken to be a ﬁxed constant.We also require the ﬁtted quantiles ˆ Q i to satisfy an additional regularity condition. Condition N.

For some α, β > , we have max i ≤ n | ˆ Q i | = o P ( n α ) and P (0 < | ˆ Q i − ˆ Q j | < n − β for some ( i, j )) → . This condition rules out gross “outliers” in ˆ Q i which are diﬃcult to balance. The condition max i | ˆ Q i | = o P ( n α ) alone is not suﬃcient for this, because it is not an aﬃne-invariant assumption. One can take anarbitrarily poorly-behaved estimate ˆ Q and scale it to be bounded by one without changing the estimatorˆ ψ +T . The separation requirement rules out this trick.It is not hard to ﬁnd examples of estimators which satisfy this condition. For example, under Conditions1 and 2, Condition N is satisﬁed by any estimator whose ﬁtted values { ˆ Q i } only take values in the observedoutcomes { Y i } . Example include the nearest-neighbor quantile regression method of [43], as well as therandom-forest methods of [29] and [2]. Proposition 6.

Assume Conditions 1 and 2. If { ˆ Q i } i ≤ n ⊆ { Y j } j ≤ n almost surely, then Condition N issatisﬁed with α = 1 / and any β > .Proof. The upper bound follows from the well-known fact that the maximum of n i.i.d. observations from adistribution with ﬁnite variance has magnitude o P ( n − / ). Therefore, max i | ˆ Q i | ≤ max j | Y j | = o P ( n − / ).For the lower bound, it suﬃces to show that P (min i (cid:54) = j | Y i − Y j | < n − β ) → β >

2. Let F Y ( y ) = P ( Y ≤ y ), and let B < ∞ be a uniform bound on F (cid:48) Y ( · ); this exists since f ( y | x, z ) is uniformly bounded byCondition 2. Then P (min i (cid:54) = j | Y i − Y j | < n − β ) ≤ P ( n ∆ ≤ Bn − ( β − ) where ∆ = min i (cid:54) = j | F Y ( Y i ) − F Y ( Y j ) | .Theorem 8.2 in [12] shows that n ∆ (cid:32) Exponential(1), so P ( n ∆ ≤ Bn − ( β − ) → B.7.2 Supporting lemmas

To simplify the proof, we separate out a preliminary convergence results as a lemma. Throughout thisproof and the next, we will use the following notation: for a function f , P kn f denotes the fold- k average |F k | (cid:80) i ∈F k f ( X i , Y i , Z i ). Lemma 4.

Assume Condition 1. Suppose || ˆ Q − k − Q || L ( P ) p −→ for each k ∈ [ K ] . Then || ˆ Q − Q || L ( P n ) = o P (1) and P n ˆ QZ/ ˆ e ( X ) = E [ Q ( X )] + o P (1) .Proof. Start with the ﬁrst claim. For any k ∈ [ K ], applying Markov’s inequality conditionally on { ( X i , Y i , Z i ) } i (cid:54)∈F k gives P kn ( ˆ Q − k ( X ) − Q ( X )) = O P ( || ˆ Q − k − Q || L ( P ) ) = o P (1). Averaging over k ∈ [ K ] gives the desired result.For the second claim, write: P n ˆ QZ/ ˆ e ( X ) = P n QZ/e ( X ) + P n ( ˆ Q − Q ) Z/e ( X ) + O ( || ˆ Q || L ( P n ) || / ˆ e − /e || L ( P n ) )= E [ Q ( X )] + O ( || ˆ Q − Q || L ( P n ) /ε ) + o P (1)= E [ Q ( X )] + o P (1) . .7.3 Proof of main result Now we are ready to prove Theorem 3 for nonlinear quantile models in the case of the estimand ψ +T . Werestate the result to make the quantile consistency assumption precise. Theorem 3(ii).

Assume Conditions 1, 2, 3.(ii), and N. If || ˆ Q − k − Q || L ( P ) = o P (1) for each k ∈ [ K ] , then ˆ ψ +T = ψ +T − o P (1) . However, even if || ˆ Q − k − Q || L ( P ) (cid:54)→ , we still have ˆ ψ +T ≥ ψ +T − o P (1) .Proof. We start by proving ˆ ψ +T ≤ ψ +T + o P (1) when the quantile model is consistent. This part of the prooffollows roughly the same template as the corresponding proof in the linear case. Lemma 2 implies:ˆ ψ +T ≤ P n ( Y − ˆ Q ) Z (1 + Λ ˆ U (1 − ˆ e ( X )) / ˆ e ( X )) + P n ˆ QZ/ ˆ e ( X ) P n Z/ ˆ e ( X )where ˆ U i = sign( Y i − ˆ Q i ). Since P n Z/ ˆ e ( X ) p −→ P n ˆ QZ/ ˆ e ( X ) p −→ E [ Q ( X )] by Lemma 4, itremains to show that P n ( Y − ˆ Q ) Z (1 + Λ ˆ U (1 − ˆ e ( X )) / ˆ e ( X )) converges to ψ +T + o P (1). By the same reasoningas in the linear case, we may replace ˆ e ( X ) by e ( X ) in this quantity without changing its value much. Thus,we may write: P n ( Y − ˆ Q ) Z (1 + Λ ˆ U − ˆ e ( X )ˆ e ( X ) ) = P n ( Y − ˆ Q ) Z (1 + Λ ˆ U − e ( X ) e ( X ) ) + o P (1)= i P n ( Y − Q ( X )) Z (1 + Λ ˆ U − e ( X ) e ( X ) ) + O ( ε − || ˆ Q ( X ) − Q ( X ) || L ( P n ) ) + o P (1)= ii P n ( Y − Q ( X )) Z (1 + Λ ˆ U − e ( X ) e ( X ) ) + o P (1)= iii P n ( Y − Q ( X )) Z/ ¯ E + + O ( || Y − Q ( X ) || L ( P n ) || Z Λ ˆ U − Z Λ U || L ( P n ) ) + o P (1)= iv ψ +T − E [ Q ( X )] + O P ( || Z Λ ˆ U − Z Λ U || L ( P n ) ) + o P (1)Here, i adds and subtracts a term then applies Cauchy-Schwarz, ii applies Lemma 4 to conclude || ˆ Q − Q || L ( P n ) = o P (1), iii adds and subtacts P n ( Y − Q ( X )) Z/ ¯ E + and applies Cauchy-Schwarz, and iv holds byProposition 3 and the law of large numbers.It remains to prove that || Z Λ ˆ U − Z Λ U || L ( P n ) = o P (1), or equivalently (up to constants) that P n Z I { ˆ U (cid:54) = U } = o P (1). For each k ∈ [ K ], we may apply Chebyshev’s inequality conditional on { ( X i , Y i , Z i ) } i (cid:54)∈F k toconclude: (cid:12)(cid:12)(cid:12)(cid:12) P kn Z I { ˆ U (cid:54) = U } − (cid:90) z I { sign( y − ˆ Q − k ( x )) (cid:54) = sign( y − Q ( x )) } d P ( x, y, z ) (cid:12)(cid:12)(cid:12)(cid:12) = o P (1)The integral in the preceding display tends to zero in probability. To see this, recall that Condition 2 requiresthe conditional density f ( y | x, z ) to be uniformly bounded by some B < ∞ , so we may write: (cid:90) z I { sign( y − ˆ Q − k ( x )) (cid:54) = sign( y − Q ( x )) } d P ( x, y, z ) = (cid:90) X e ( x ) (cid:90) ˆ Q − k ( x ) ∨ Q ( x )ˆ Q − k ( x ) ∧ Q ( x ) f ( y | x,

1) d y d P X ( x ) ≤ (cid:90) X (1 − ε ) B | ˆ Q − k ( x ) − Q ( x ) | d P X ( x ) (cid:45) || ˆ Q − k − Q || L ( P ) ≤ || ˆ Q − k − Q || L ( P ) = o P (1) . Thus, P kn Z I { ˆ U (cid:54) = U } = o P (1). Averaging over k gives P n Z I { ˆ U (cid:54) = U } = o P (1), and so ˆ ψ +T ≤ ψ +T + o P (1).Now, we turn to the lower bound, which is substantially more diﬃcult. We wish to show ˆ ψ +T ≥ ψ +T − o P (1)whether or not ˆ Q − k converges to Q . For each k ∈ [ K ], deﬁne ˆ ψ + ( k ) by:ˆ ψ + ( k ) = max ¯ e k ∈E n,k (Λ) P kn Y Z/ ¯ e k subject to (cid:18) P kn ˆ Q − k Z/ ¯ e k P kn Z/ ¯ e k (cid:19) = (cid:18) P kn ˆ Q − k Z/ ˆ e P kn Z/ ˆ e ( X ) (cid:19) (29)27here E n,k (Λ) is the projection of E n (Λ) onto the coordinates in F k . Clearly, ˆ ψ +T × P n Z/ ˆ e ( X ) ≥ (cid:80) k ˆ ψ + ( k ) |F k | /n ,so it suﬃces to prove ˆ ψ + ( k ) ≥ ψ +T − o P (1) for each k .We will make some notational simpliﬁcations. The remainder of the proof will focus on showing ˆ ψ + (1) ≥ ψ +T − o P (1). For convenience, we will assume F = [ n ] where n ∼ n/K . As an additional simpliﬁcation,we will assume that ε/ ≤ ˆ e i ≤ − ε/ i . Mechanically, this can always be done by “trimming” theestimated propensity score. Condition 1 implies the trimming has no eﬀect in large samples, so it is onlyused as a theoretical device to simplify calculations. Finally, recall that we have deﬁned ˆ W i = Z i (1 − ˆ e i ) / ˆ e i .We will construct an propensity vector ¯ e ∗ satisfying the constraints of (29) with the property that¯ ψ := P n Y Z/ ¯ e ∗ converges to ψ +T . Since ˆ ψ + (1) ≥ ¯ ψ , this will show ˆ ψ + (1) ≥ ψ +T − o P (1). A natural ﬁrstidea is to take the idealized propensity score ¯ e ∗ i = (1 + θ i (1 − ˆ e ( X i )) / ˆ e ( X i )) − , where θ i = Λ U i . This mimicsthe true worst-case propensity score, but uses ˆ e ( X i ) in place of e ( X i ) to satisfy the odds-ratio constraint. Itis not hard to see that this would result in a sharp estimate of ψ +T by classic IPW logic. P n Y Z (1 + θ − ˆ e ( X )ˆ e ( X ) ) = P n Y Z (1 + θ − e ( X ) e ( X ) ) + O ( || Y Z || L ( P n ) × || /e − / ˆ e || ∞ )= P n Y Z/ ¯ E + + o P (1)= ψ +T + o P (1) (30)However, this choice of ¯ e ∗ is not guaranteed to satisfy the “balancing” constraints of (29). Our constructionperturbs this “ideal” choice to gain feasibility.Our construction will be convoluted, so it is worth taking a moment to explain the high-level idea. First,we discard a small number of gross “outliers” to produce a set of “inliers” I j ∗ whose ﬁtted quantiles arerelatively easy to balance. We then produce a feasible propensity ¯ e ∗ by assigning the outliers the nominalpropensity score ˆ e ( X i ) and perturbing the inliers’ idealized propensity score by a small amount. We showthe resulting lower bound ¯ ψ = P n Y Z/ ¯ e ∗ is a consistent (albeit impractical) estimator of ψ +T .We start by extracting a set of inliers I j ∗ ⊆ [ n ] in the following fashion: set I = [ n ], and for2 ≤ j ≤ β + 3, recursively deﬁne I j by: I j = { i ∈ I j − : | ˆ Q i − ¯ Q j − | ≤ ( j − n − ( j − / } (31)where ¯ Q j − = ( (cid:80) i ∈I j − ˆ W i ˆ Q i ) / ( (cid:80) i ∈I j − ˆ W i ) is the weighted average value of ˆ Q i within I j − . We set I β +4 = ∅ . Let j ∗ be the ﬁrst stage in the above procedure at which an n − / fraction of the “weight” in I j comes from outliers. j ∗ = min (cid:26) j : (cid:80) i ∈I j \I j +1 ˆ W i (cid:80) i ∈I j ˆ W i ≥ n − / (cid:27) (32)It is easy to verify that j ∗ is well-deﬁned (the set is not empty) whenever Z i = 1 for some index i ≤ n . Forcompleteness, we arbitrarily set j ∗ = 4 β + 3 when that does not happen.With this deﬁnition of j ∗ , we ensure the total “weight” on discarded outliers is asymptotically negligible.Since (cid:80) i ∈I j \I j +1 ˆ W i ≤ n − / (cid:80) i ∈I j ˆ W i for all j < j ∗ , we have: (cid:88) i (cid:54)∈I j ∗ ˆ W i = (cid:88) j