Sharp Sensitivity Analysis for Inverse Propensity Weighting via Quantile Balancing
SSharp Sensitivity Analysis for Inverse Propensity Weighting viaQuantile Balancing ∗ Jacob Dorn † Princeton University Kevin GuoStanford UniversityFebruary 10, 2021
Abstract
Inverse propensity weighting (IPW) is a popular method for estimating treatment effects from obser-vational data. However, its correctness relies on the untestable (and frequently implausible) assumptionthat all confounders have been measured. This paper introduces a robust sensitivity analysis for IPWthat estimates the range of treatment effects compatible with a given amount of unobserved confound-ing. The estimated range converges to the narrowest possible interval (under the given assumptions)that must contain the true treatment effect. Our proposal is a refinement of the influential sensitivityanalysis by Zhao, Small, and Bhattacharya (2019), which we show gives bounds that are too wide evenasymptotically. This analysis is based on new partial identification results for Tan (2006)’s marginalsensitivity model.
Estimating treatment effects from observational data is difficult because “treated” and “control” samplestypically differ on many characteristics besides treatment status. One popular tool for managing this prob-lem is inverse propensity weighting (IPW) [4, 15, 16], which re-weights treated and untreated samples to besimilar along all observed characteristics and then compares outcomes in the weighted samples. The crucialassumption underlying this approach is that the weighted samples do not systematically differ along impor-tant unobserved characteristics. This “unconfoundedness” assumption is untestable, and often implausible.This paper studies how much can be learned when unconfoundedness does not hold, but one can boundthe plausible degree of unobserved confounding. In particular, given a “sensitivity assumption” controllingthe degree of selection, we aim to answer two questions:(1)
Sensitivity analysis . Can we bound how much our estimates might change if unobserved confoundingwere properly accounted for?(2)
Partial identification . Can we characterize the most informative bounds that could possibly be ob-tained from the sensitivity assumption with even an infinite amount of observational data?The specific sensitivity assumption used in this paper is the “marginal sensitivity model” of Tan [44],which extends the famous Rosenbaum model [37, 38, 39, 48] from matched-pairs studies to IPW. Thissensitivity assumption is quite popular in causal inference; see [20, 21, 22, 23, 27, 40, 44, 49] for an incompletelist of references. As we will see, it lends itself to computationally-efficient sensitivity analyses which aresimple enough to explain to any practitioner comfortable with IPW.Recently, Zhao, Small, and Bhattacharya [49] (hereafter ZSB) introduced an interpretable IPW sensitivityanalysis for the marginal sensitivity model. Their approach, based on linear fractional programming, has ∗ We are grateful for comments from Guillaume Basse, Michal Koles´ar, Qingyuan Zhao, and many others who providedvaluable input at earlier stages of this project. This material is based upon work supported by the National Science FoundationGraduate Research Fellowship Program under Grant No. DGE-2039656. Any opinions, findings, and conclusions or recom-mendations expressed in this material are those of the authors and do not necessarily reflect the views of the National ScienceFoundation. † Corresponding author: [email protected] . a r X i v : . [ m a t h . S T ] F e b een largely responsible for the recent resurgence of interest in this sensitivity assumption. However, theydid not answer question (2), leaving open the possibility that more informative bounds could be obtainedfrom the same data and assumptions. Indeed, there are no existing partial identification results for themarginal sensitivity model which can be used to benchmark a sensitivity analysis.The first main contribution of this paper is to provide a complete answer to the optimality question (2).We derive closed-form expressions for the largest and smallest values of the “usual” estimands (e.g. averagetreatment effect) compatible with the marginal sensitivity assumption. These expressions show that theZSB bounds are essentially always too conservative because they ignore an infinite collection of constraintsimplied by the distribution of observed characteristics. Tan [44] also identified these constraints, but deemedit intractable to incorporate them all in a sensitivity analysis. In contrast, our partial identification resultsshow that this collection can actually be reduced to a single constraint which is easy to incorporate.Our second main contribution is to introduce a new IPW sensitivity analysis, which we call the quantilebalancing method. The method has several desireable features:(i) The quantile balancing sensitivity interval is always a subset of the ZSB interval. Outside of knife-edgecases, it is a strict subset.(ii) When the outcome’s conditional quantiles can be estimated consistently, the bounds converge to thebest possible bounds that can be obtained under the marginal sensitivity model. In the language ofpartial identification, quantile balancing is “sharp.”(iii) Under standard assumptions for IPW inference, the bounds can be converted into confidence intervalsusing the same bootstrap scheme proposed in [49].(iv) When the estimated quantiles are inconsistent, the sensitivity interval is too wide rather than toonarrow and the confidence intervals over-cover rather than under-cover. In other words, our intervalsare guaranteed to be valid, regardless of the quality of the additional input we demand. To our knowl-edge, no other standard method for estimating sharp bounds has the same robustness to inconsistentestimation of nuisance parameters [48, 28].We apply the quantile balancing method in several simulated examples and one real-data application, andfind that it can substantially tighten the ZSB bounds when the covariates are good predictors of the outcome.One shortcoming we will mention is that our statistical guarantees assume the outcome is continuously-distributed. This seems to be inevitable as our sensitivity analysis relies on quantile regression. Sinceour partial identification results also apply to discrete outcomes, we conjecture that the quantile balancingprocedure could be modified to give sharp bounds in that setting too. We consider the Neyman-Rubin potential outcomes model with a binary treatment [34, 41]. Experimentalunits ( X i , Y i (0) , Y i (1) , Z i ) are sampled i.i.d. from a probability distribution P . The real-valued randomvariables ( Y i (0) , Y i (1)) are called “potential outcomes,” Z i ∈ { , } is a binary treatment assignment indi-cator, and X i ∈ X ⊆ R d is a vector of covariates. The statistician only observes { ( X i , Y i , Z i ) } i ≤ n , where Y i = Z i Y i (1) + (1 − Z i ) Y i (0) is called the “observed outcome.”The goal is to use the observed data to draw inferences about a causal estimand ψ . For the purposesof exposition, we initially focus on the counterfactual means ψ T = E [ Y (1)] and ψ C = E [ Y (0)], although theexamples of most practical interest are the average treatment effect (ATE) and the average treatment effecton the treated (ATT). ψ ATE = E [ Y (1) − Y (0)] ψ ATT = E [ Y (1) − Y (0) | Z = 1] . With minor modification, our identification results can also be applied to more complex estimands, includingweighted average treatment effects and policy values of the type considered in [3, 22]. However, we do notpresent those extensions in this paper.Under the “unconfoundedness” assumption ( Y (0) , Y (1)) | = Z | X , all of the above quantities can beestimated using inverse propensity weighting. IPW estimators “work” by reweighting the observed sample2y (some function of) the propensity score e ( x ) := P ( Z = 1 | X = x ). For example, if the estimand of interestis ψ T , the (stabilized) IPW estimate is given by (1).ˆ ψ T = (cid:80) ni =1 Y i Z i / ˆ e ( X i ) (cid:80) ni =1 Z i / ˆ e ( X i ) . (1)Here, ˆ e ( X i ) is an estimate of the propensity score e ( X i ). Related estimators for the other estimands con-sidered will be denoted by ˆ ψ C , ˆ ψ ATE , and ˆ ψ ATT . See the articles by Austin and Stuart [4] or Hirano andImbens [15] for their exact formulas.We will assume some conditions which are required for identification and estimation under unconfound-edness: 0 < e ( X ) < E [ | Y | ] < ∞ . However, we will not assume unconfoundedness. The “marginal sensitivity model” introduced by Tan [44] is a relaxation of unconfoundedness which has beenapplied in many causal inference problems [20, 21, 22, 23, 27, 40, 44, 49]. This one-parameter sensitivityassumption allows for the existence of unobserved confounders U , but limits the degree of selection bias thatcan be attributed to these confounders. Assumption 1. (Marginal sensitivity model)
There exists a vector of unmeasured confounders U that, if measured, would lead to unconfoundedness: ( Y (0) , Y (1)) | = Z | ( X, U ) . However, within each stratum of the observed covariates, measuring U can onlychange the odds of treatment by at most a factor of Λ , i.e. if we set e ( x, u ) := P ( Z = 1 | X = x, U = u ) ,then (2) holds with probability one. Λ − ≤ e ( X, U ) / [1 − e ( X, U )] e ( X ) / [1 − e ( X )] ≤ Λ (2)To avoid confusion between e and e , we will follow [23] and refer to e as the “true propensity score”and e as the “nominal propensity score.”Like the famous Rosenbaum model [37, 38, 39, 48] for matched-pairs studies, Assumption 1 controlsthe degree of unobserved confounding with a single parameter Λ. When Λ = 1, measuring additionalconfounders cannot change the odds of treatment at all, i.e. treatment assignment is unconfounded. As Λincreases, stronger forms of confounding are allowed. For advice on how to choose this parameter, see Kallusand Zhou [22]. We remark that the marginal sensitivity assumption is “nonparametric” in the followingsense: no assumptions are needed about how e ( x, u ) depends on u . The dimension of U does not even needto be specified.To see how Assumption 1 can be used for sensitivity analysis, begin by considering how an oracle statis-tician who observes the confounders U i might estimate ψ T . One strategy would be to use the IPW estimator(3), which is consistent under weak assumptions.ˆ ψ ∗ T = (cid:80) ni =1 Y i Z i /e ( X i , U i ) (cid:80) ni =1 Z i /e ( X i , U i ) . (3)In reality, { U i } i ≤ n are not observed, but under Assumption 1, it is possible to bound the true propensityscores e ( X i , U i ). In particular, the vector ( e ( X , U ) , · · · , e ( X n , U n )) must belong to the set E n (Λ) definedin (4). E n (Λ) = (cid:26) ¯ e ∈ R n : Λ − ≤ ¯ e i / (1 − ¯ e i ) e ( X i ) / [1 − e ( X i )] ≤ Λ (cid:27) (4) We have presented a slightly different version of the marginal sensitivity model from the one given in [44] and [49], whichrequires the unobserved confounder U to be one or both of the potential outcomes. Most of the results in this paper also applyunder that assumption [35]. See Section 3.3 for more. E n (Λ).[ ˆ ψ − T,ZSB , ˆ ψ +T,ZSB ] = (cid:20) min ¯ e ∈E n (Λ) (cid:80) ni =1 Y i Z i / ¯ e i (cid:80) ni =1 Z i / ¯ e i , max ¯ e ∈E n (Λ) (cid:80) ni =1 Y i Z i / ¯ e i (cid:80) ni =1 Z i / ¯ e i (cid:21) . (5)Since the interval (5) contains the consistent estimator ˆ ψ ∗ T , the distance between the true estimand ψ T andthe sensitivity interval must tend to zero. ZSB show that this conclusion holds even if the nominal propensityscore e ( x ) is replaced by a suitably consistent estimate ˆ e ( x ) in the definition of E n (Λ), which is importantfor practical applications as e ( x ) is typically not known in observational studies.This simple idea is intuitive enough to explain to any practitioner who is comfortable with IPW and hasbeen extended to estimands other than ψ T . ZSB also consider ψ ATE and ψ ATT , and related work by [22, 20,21, 27] takes the idea substantially further. Tan [44] applied a similar idea to a different propensity-score-based estimator, and [1, 31, 45] used this approach in survey sampling problems.
The aforementioned works do not address the asymptotic optimality of the interval [ ˆ ψ − T,ZSB , ˆ ψ +T,ZSB ]. Doesit converge to a limiting set containing all values of ψ T compatible with Assumption 1 and no others?Sensitivity analyses with this asymptotic optimality property are called “sharp” in the partial identificationliterature.Given its intuitive definition, it may be surprising to learn that the ZSB sensitivity analysis is actuallynot sharp. Indeed, it can be arbitrarily conservative. To illustrate this, we only need to consider a verysimple joint distribution of observables: X ∼ N (0 , σ ) Z | X ∼ Bernoulli( ) Y | X, Z ∼ N ( X, . (6)Suppose that a data analyst receives i.i.d. samples ( X i , Y i , Z i ) from this distribution and is willing to positthat Assumption 1 is satisfied with Λ = 2. Let φ ( · ) and z τ denote the density and τ -th quantile of thestandard normal distribution, respectively. The following proposition, which we prove in Appendix B.1,writes the set of values of ψ T compatible with Assumption 1 explicitly in terms of these quantities and showsthat this “partially identified” set does not coincide with the ZSB interval. Proposition 1.
Let ( X i , Y i , Z i ) be i.i.d. samples from the joint distribution (6).(i) The set of values of ψ T compatible with the bound Λ = 2 and the distribution (6) is the interval [ ± φ ( z / )] ≈ [ ± . .(ii) However, with probability one, [ ± . √ σ + 1] ⊆ [ ˆ ψ − T,ZSB , ˆ ψ +T,ZSB ] for all large n . The precise meaning of (i) is the following: for any ψ ∈ [ ± φ ( z / )], it is possible to construct adistribution P for the full data ( X, Y (0) , Y (1) , Z, U ) which marginalizes to (6), satisfies Assumption 1 withΛ = 2, and has E P [ Y (1)] = ψ . On the other hand, for any ψ not in this interval, it is impossible to constructsuch a distribution.Proposition 1 implies that the ZSB interval typically includes many values of ψ which cannot possiblybe reconciled with the data. The explanation for this conservatism is that the odds-ratio bound (2) doesnot capture all of the restrictions on the true propensity score e . Additional information can be foundin the marginal distribution of the observed characteristics. For example, the putative propensity score¯ e ( x, u ) = + I { x ≥ } certainly satisfies the odds-ratio bound (2) — and is therefore a possible value of¯ e in the problem (5) — but it could not possibly be the true propensity score e . If it were, this wouldimply P ( Z = 1 | X ≥
0) = , while (6) demands that P ( Z = 1 | X ≥
0) = . In other words, this choice of¯ e is allowed in the domain of the ZSB optimization problem but is incompatible with the distribution ofobserved data.This example suggests that it should be possible to improve upon the ZSB bounds by only optimizingover the subset of E n (Λ) which is “data compatible.” However, data compatibility can be difficult to work4ith, because the observed data distribution actually imposes an infinite number of constraints on ¯ e . Forexample, the true e “balances” all integrable functions h : X → R : E [ h ( X ) Z/e ( X, U )] = E [ h ( X ) E [ Z | X, U ] /e ( X, U )]= E [ h ( X ) e ( X, U ) /e ( X, U )]= E [ h ( X )] . (7)Every such h gives rise to a testable balancing constraint (7) which can be used to rule out “incompatible”values of ¯ e . (cid:80) ni =1 h ( X i ) Z i / ¯ e i (cid:80) ni =1 Z i / ¯ e i ≈ E [ h ( X )] (8)This calculation shows that any sharp sensitivity analysis must contend with an infinite number ofbalancing constraints, which is typically computationally intractable [6, 13]. To deal with this problem, Tan[44] suggested a conservative approach that optimizes an IPW-type estimator subject to the constraint thatthe estimated weights exactly balance a finite collection of functions { h j } j ≤ J . However, he did not offerany guidance on how these functions should be chosen. In a similar problem, Tudball et al. [45] suggestedoptimization subject to an approximate balance condition on the observed covariates. Although both ofthese strategies may improve upon the ZSB interval (5), it is apparent that neither yields sharp bounds ingeneral. In this section, we show that it is possible to characterize the sharp bounds for ψ ∈ { ψ T , ψ C , ψ ATT , ψ
ATE } without ignoring or relaxing any of the (infinitely many) balancing constraints on the true propensity score.The results are developed at the “population” level, so they belong to the domain of partial identification.We apply them to finite-sample sensitivity analysis in Section 4.Our main partial identification results are stated in Section 3.1: for the estimands ψ T , ψ C , and ψ ATT , theinfinitely-many balancing constraints imposed by data-compatibility can be reduced to a single (carefullychosen) balancing constraint; for the estimand ψ ATE , a similar reduction can be obtained. Section 3.2explains how these dramatic simplifications are possible for ψ ∈ { ψ T , ψ C , ψ ATT } : both the original infinitely-constrained problem and the relaxed single-constraint problem can be solved explicitly using Dantzig andWald’s [11] generalization of the Neyman-Pearson Lemma. The solutions are identical. A byproduct ofthis analysis is an explicit formula for the largest and smallest values of ψ compatible with Assumption 1.Section 3.3 extends this proof strategy to ψ = ψ ATE , which requires a novel argument. All formal proofs inthis paper are deferred to Appendix B.
First, we characterize the partially identified sets for ψ ∈ { ψ T , ψ C , ψ ATT } in terms of optimization problemswith only a single balancing constraint.To state these results formally, we need a few pieces of additional notation. Let F ( y | x, z ) := P ( Y ≤ y | X = x, Z = z ) denote the conditional distribution function of Y and, for each τ ∈ (0 , τ -thconditional quantile function by Q τ ( x, z ) := inf { q : F ( q | x, z ) ≥ τ } . We also introduce a “population”version of the constraint set E n (Λ), where the vectors ¯ e in R n are replaced by random variables ¯ E definedon the same probability space as ( X, Y, Z ). E ∞ (Λ) := (cid:26) ¯ E : Λ − ≤ ¯ E/ (1 − ¯ E ) e ( X ) / (1 − e ( X )) ≤ Λ with probability one (cid:27) (9)The following theorem shows that to compute the partially identified set for ψ T , one can simply minimizeand maximize E [ Y Z/ ¯ E ] over the set of ¯ E in E ∞ (Λ) which balance a particular conditional quantile of Y .5 heorem 1. (Optimal bounds for ψ T ) For any Λ ≥ , the set of values of ψ T compatible with the observed data distribution and Assumption 1 isa closed interval [ ψ − T , ψ +T ] . Moreover, if we define τ = ΛΛ+1 , then the interval endpoints solve (10) and (11). ψ − T = min ¯ E ∈E ∞ (Λ) E [ Y Z/ ¯ E ] subject to E [ Q − τ ( X, Z/ ¯ E ] = E [ Q − τ ( X, ψ +T = max ¯ E ∈E ∞ (Λ) E [ Y Z/ ¯ E ] subject to E [ Q τ ( X, Z/ ¯ E ] = E [ Q τ ( X, . (11)We will highlight a few important takeaways from this theorem. First, even if one adds additionalbalancing constraints of the form E [ h ( X ) Z/ ¯ E ] = E [ h ( X )] in (10) and (11), the value of these problems willnot change since the true propensity score e ( X, U ) remains feasible. Thus for the purposes of computingbounds, the quantile balancing constraints capture all the information in the observed data. Second, thisresult shows that the ZSB sensitivity analysis can only be sharp when the conditional quantiles of Y do notdepend on X at all. Since this is quite pathological, this suggests there is room for improvement over theZSB method in almost all applications. Third, it suggests that a variant of the Tan [44] sensitivity analysiscould be sharp when Q τ ( · ,
1) and Q − τ ( · ,
1) are in the span of the “balanced” functions { h j } j ≤ J .We can extend the theorem to other estimands. To bound ψ C , exchange the labels “treated” and “control”and apply Theorem 1. Sharp bounds on ψ C can be translated into sharp bounds on ψ ATT using the relation ψ ATT = E [ Y | Z =1] − ψ C P ( Z =1) . Corollary 1. (Optimal bounds for ψ C and ψ ATT ) In the setting of Theorem 1, the partially-identified set for ψ C is the interval [ ψ − C , ψ +C ] , where the intervalendpoints solve (12) and (13). ψ − C = min ¯ E ∈E ∞ (Λ) E [ Y − Z − ¯ E ] subject to E [ Q − τ ( X, − Z − ¯ E ] = E [ Q − τ ( X, ψ +C = max ¯ E ∈E ∞ (Λ) E [ Y − Z − ¯ E ] subject to E [ Q τ ( X, − Z − ¯ E ] = E [ Q τ ( X, The partially identified set for ψ ATT is the interval [ ψ − ATT , ψ +ATT ] , where ψ ∓ ATT = E [ Y | Z =1] − ψ ± C ) P ( Z =1) . Finally, sharp bounds for ψ ATE can be obtained by subtracting sharp bounds for ψ T and ψ C . Equiv-alently, these bounds can be obtained by solving optimization problems with two quantile balancing con-straints. Although this result is superficially similar to Theorem 1 and Corollary 1, its proof requires a novelconstruction. Theorem 2. (Optimal bounds for ψ ATE ) For any Λ ≥ , the set of values of ψ ATE compatible with the observed data distribution and Assumption 1is a closed interval [ ψ − ATE , ψ +ATE ] where ψ − ATE = ψ − T − ψ +C and ψ +ATE = ψ +T − ψ − C . We find the results of Theorem 1 and Corollary 1 quite counterintuitive. After all, it is certainly not truethat every random variable ¯ E ∈ E ∞ (Λ) satisfying E [ Q τ ( X, Z/ ¯ E ] = E [ Q τ ( X, e ( X, U ). Indeed, the constraints of the quantile-balancing optimization problems do noteven enforce that E [ ¯ E ] = P ( Z = 1).To explain how these results are possible, we begin by characterizing which random variables ¯ E couldplausibly be the true propensity score e ( X, U ). The calculation (7) indicates that ¯ E should at least satisfy E [ h ( X ) Z/ ¯ E ] = E [ h ( X )] for all integrable h , or equivalently, E [ Z/ ¯ E | X ] = 1. Proposition 2 shows that this isactually the “only” constraint on ¯ E for the purposes of bounding ψ T . Proposition 2.
For any random variable ¯ E ∈ E ∞ (Λ) satisfying E [ Z/ ¯ E | X ] = 1 , there is a distribution Q for ( X, Y (0) , Y (1) , Z, U ) with the following properties:(i) The distribution of the observables ( X, Y, Z ) is the same under P and Q .(ii) Q satisfies Assumption 1.(iii) E Q [ Y (1)] = E P [ Y Z/ ¯ E ] .
6n the appendix, we prove Proposition 2 using a slight generalization of the data-compatibility characteri-zations in [7, 36, 44, 49]. This result implies that E [ Y Z/ ¯ E ] is a plausible value of ψ T whenever E [ Z/ ¯ E | X ] = 1.This immediately gives variational formulas for the smallest and largest values of ψ T compatible with As-sumption 1. Corollary 2.
The endpoints of the partially-identified interval for ψ T solve: ψ − T = min ¯ E ∈E ∞ (Λ) E [ Y Z/ ¯ E ] subject to E [ Z/ ¯ E | X ] = 1 (14) ψ +T = max ¯ E ∈E ∞ (Λ) E [ Y Z/ ¯ E ] subject to E [ Z/ ¯ E | X ] = 1 (15)This corollary translates the data-compatibility constraints of Proposition 2 into variational problems(14) and (15) which are easy to solve using Dantzig and Wald’s [11] generalization of the Neyman-Pearsonfundamental lemma. Roughly speaking, the “sufficiency” half of the Dantzig and Wald result says that anyrandom variable ¯ E that takes on its minimal value when Y is “small” and its maximal value when Y is“large” solves (14), as long as it is also feasible. When applied to the singly-constrained problem (10), thesame result says that any feasible solution to (10) which is a cutoff on Y − cQ − τ ( X,
1) for some real number c is optimal in that problem. By direct calculation, one may check that the random variable ¯ E − defined inProposition 3 solves both of these problems. Proposition 3.
Let ¯ E − , ¯ E + ∈ E ∞ (Λ) satisfy E [ Z/ ¯ E − | X ] = E [ Z/ ¯ E + | X ] = 1 and also (16) and (17)whenever Z = 1 . ¯ E − − ¯ E − = (cid:40) Λ − e ( X )1 − e ( X ) if Y < Q − τ ( X, +1 e ( X )1 − e ( X ) if Y > Q − τ ( X,
1) (16)¯ E + − ¯ E + = (cid:40) Λ − e ( X )1 − e ( X ) if Y > Q τ ( X, +1 e ( X )1 − e ( X ) if Y < Q τ ( X,
1) (17)
Then ¯ E − solves both (10) and (14), and ¯ E + solves both (11) and (15). These worst-case formulas explain how the relaxation from an infinite number of balancing constraints toa single balancing constraint is possible: both problems have the same solution. The form of the propensityscore ¯ E + is quite intuitive: in the “worst-case,” all observations with “high” values of Y are unlikely tobe treated and thus receive large propensity weight, while all observation with “low” values of Y are likelyto be treated and thus receive small propensity weight. The cutoff between “large” and “small” is chosento satisfy the data-compatibility condition E [ Z/ ¯ E + | X ] = 1. This argument extends immediately to ψ C byrelabeling “treatment” and “control” and it extends to ψ ATT by the argument given in Section 3.1.
To extend the argument from Section 3.2 to the ATE requires additional care. Although ψ +ATE = ψ +T − ψ − C is certainly a valid upper bound for the partially identified set for ψ ATE , it is not obviously a sharp one.Proposition 2 only implies that there exists a distribution Q matching the observed data law which has E Q [ Y (1)] = ψ +T and another distribution Q (cid:48) which has E Q (cid:48) [ Y (0)] = ψ − C , but these distributions need not bethe same. In other words, the two bounds may not be simultaneously achievable.Theorem 2 indicates that the worst-case bounds on the counterfactual means are simultaneously achiev-able in the marginal sensitivity model. This is actually a surprising result, given that simultaneous achiev-ability is not expected to hold in two closely-related sensitivity models. The first is the Rosenbaum model,where Yadlowsky et al. [48] derived sharp bounds on ψ T and ψ C but required an extra symmetry assumptionon the distribution of potential outcomes to establish sharpness of the resulting ATE bounds. The second isTan’s [44] original formulation of the marginal sensitivity model, which requires the unobserved confounders U to be the potential outcomes themselves. With Tan’s added restriction, we have only been able to establishsimultaneous achievability when Y | X, Z has a continuous distribution.The key to our bounds on ψ ATE is the following proposition, which strengthens Proposition 2.7 roposition 4.
For any random variable ¯ E ∈ E ∞ (Λ) satisfying E [ Z/ ¯ E | X ] = E [(1 − Z ) / (1 − ¯ E ) | X ] = 1 ,there is a distribution Q for the full data ( X, Y (0) , Y (1) , Z, U ) with the following properties:(i) The distribution of the observables ( X, Y, Z ) is the same under P and Q .(ii) Q satisfies Assumption 1.(iii) E Q [ Y (1) − Y (0)] = E P [ Y Z/ ¯ E − Y (1 − Z ) / (1 − ¯ E )] . Unlike Proposition 2, this result does not follow from the existing data-compatibility characterizationsof [7, 36, 44, 49] and instead requires an original construction. Given this result, one can derive Theorem 2as a consequence of Theorem 1 and Corollary 1.
In this section, we give our proposal for translating the population-level partial identification results of Section3 into a practical sensitivity analysis for IPW, which we call the quantile balancing method. Our proposalfollows naturally from our partial identification results: on a high level, we modify the ZSB proposal describedin Section 2 to incorporate the quantile-balancing constraints we derived in Theorem 1 and Corollary 1.Throughout this section, we take Λ ≥ τ = Λ / (Λ + 1). We begin by describing the quantile balancing bounds for the average treated outcome. Theorem 1 impliesthat the largest value of ψ T compatible with Assumption 1 solves the optimization problem (18): ψ +T = max ¯ E ∈E ∞ (Λ) E [ Y Z/ ¯ E ] E [ Z/ ¯ E ] s.t. (cid:18) E [ Q τ ( X, Z/ ¯ E ] E [ Z/ ¯ E ] (cid:19) = (cid:18) E [ Q τ ( X, Z/e ( X )] E [ Z/e ( X )] (cid:19) . (18)Notice that we have included the additional constraint E [ Z/ ¯ E ] = E [ Z/e ( X )] which does not appear inTheorem 1, but this does not affect the value of the optimization problem.Our proposal is to estimate ψ +T by replacing all of the unknown quantities in (18) with empirical coun-terparts. We estimate ψ − T by following the same principle. To translate these estimates into confidenceintervals, we employ the same simple percentile bootstrap scheme as ZSB.We will be concrete about what optimization problem we are proposing to solve. Let ˆ Q τ ( x, z ) be anestimate of the conditional quantile function of Y obtained by some kind of quantile regression (e.g. [25, 43,29, 2]). Let ˆ e be the data analyst’s estimate of the nominal propensity score e from their primary analysis.We define ˆ ψ +T as the solution to the empirical maximization problem (19).ˆ ψ +T = max ¯ e ∈E n (Λ) (cid:80) ni =1 Y i Z i / ¯ e i (cid:80) ni =1 Z i / ¯ e i s.t. (cid:18) n (cid:80) ni =1 ˆ Q τ ( X i , Z i / ¯ e i n (cid:80) ni =1 Z i / ¯ e i (cid:19) = (cid:18) n (cid:80) ni =1 ˆ Q τ ( X i , Z i / ˆ e ( X i ) n (cid:80) ni =1 Z i / ˆ e ( X i ) (cid:19) (19)The lower bound ˆ ψ − T is defined similarly, but with maximization replaced by minimization and ˆ Q τ ( x, z )replaced by another quantile estimate ˆ Q − τ ( x, z ). The somewhat peculiar form of the right-hand side of theconstraints in (19) ensures that ¯ e i = ˆ e ( X i ) is feasible, so ˆ ψ +T and ˆ ψ − T always exist.Several immediate properties of the quantile balancing bounds (19) are collected below:(i) When Λ = 1 (i.e. no confounding is allowed), the quantile balancing bounds collapse to the usual IPWestimate of ψ T under unconfoundedness.(ii) The quantile balancing bounds are sample bounded, i.e. min i Y i ≤ ˆ ψ − T ≤ ˆ ψ +T ≤ max i Y i .(iii) The quantile balancing bounds are always a subset of the ZSB bounds, and outside of knife-edge cases,are a strict subset.(iv) The optimization problem (19) is convex and can be solved efficiently. In fact, it reduces to a standardquantile regression problem. See Appendix A for implementation details.8he quantile balancing idea extends easily to other causal estimands. To compute bounds for ψ C , oneonly needs to exchange the definitions of “treated” and “control” and solve the same optimization problem.Subtracting the bounds for ψ T and ψ C gives bounds for ψ ATE , and bounds for ψ ATT follow from a similarprinciple (see Appendix A for the exact formula).To form confidence intervals based on quantile balancing, we follow ZSB [49] and propose using thepercentile bootstrap. If [ ˆ ψ − b , ˆ ψ + b ] are quantile balancing bounds estimated in the b th of B bootstrap samples,we report the quantile balancing 1 − α confidence interval as:CI( α ) = [ Q α/ ( { ˆ ψ − b } b ∈ [ B ] ) , Q − α/ ( ˆ ψ + b } b ∈ [ B ] )] . (20)As is standard for bootstrap-based IPW inference, we require re-estimating the nominal propensity scoreseparately in each bootstrap replication. That requirement does not extent to the conditional quantiles.While the conditional quantiles can be re-estimated within bootstraps, our inference results will also applyif they are taken from the main dataset. This is important to keep inference computationally tractable.When the conditional quantiles are estimated using linear quantile regression (i.e. ˆ Q t ( x, z ) = ˆ β t ( z ) (cid:62) h ( x )for some “features” h : X → R k ) one could consider directly “balancing” the features h rather than the fittedquantile ˆ Q t as in [44, 17, 9, 45]. Although this approach has some nice features and theoretical support, oursimulations show the resulting inference is less reliable in small samples. We now state some theoretical properties of the quantile balancing bounds [ ˆ ψ − , ˆ ψ + ] which apply when theoutcome Y has a continuous distribution. In short, the bounds are sharp when quantiles are estimatedconsistently and are valid even when quantiles are estimated inconsistently. Moreover, the percentile boot-strap yields valid confidence intervals if standard IPW inference conditions are satisfied and quantiles areestimated parametrically.To obtain these results, we need a few conditions. The first condition collects some standard IPWconsistency requirements which we expect the data analyst to have already assumed in his or her “primaryanalysis.” Condition 1. (IPW assumptions)
The nominal propensity score e satisfies ε ≤ e ( X ) ≤ − ε with probability one for some ε > . The estimatedpropensity score ˆ e ≡ ˆ e ( { X i , Z i } i ≤ n ) is uniformly consistent, and the variance of Y is finite. The second condition requires that the outcome Y has a bounded conditional density which is positivenear the relevant conditional quantiles. This is a common identification condition for quantile regression [2,5]. Condition 2. (Density)
The conditional distribution of Y | X, Z has a uniformly bounded density f ( y | x, z ) . For each ( x, z ) ∈X × { , } , the map y (cid:55)→ f ( y | x, z ) is continuous and positive near Q − τ ( x, z ) and Q τ ( x, z ) . Finally, we make some assumptions about how the quantiles are estimated. For the standard linearquantile regression method of Koenker and Bassett [25], one only needs to check that the regressors in thequantile regression have finite variance. We cover generic (possibly nonlinear) methods by requiring samplesplitting to avoid overfitting. The specific form of sample splitting analyzed in our proofs is “cross-fitting”[42, 33, 10], but leave-one-out or out-of-bag quantile estimates perform similarly in simulations.
Condition 3. (Quantile estimates)
For each t ∈ { − τ, τ } , one of the following holds:(i) ˆ Q t ( x, z ) = ˆ β t ( z ) (cid:62) h ( x ) for some fixed “features” h j ( X ) with finite variance.(ii) ˆ Q t ( x, z ) is estimated using cross-fitting and satisfies Condition N in Appendix B.7. Condition 3 is essentially “algorithmic,” and neither (i) nor (ii) impose any accuracy requirements on theestimated conditional quantiles. The appendix conditions in (ii) are technical to state but very mild. Forexample, they are satisfied by the random-forest-based quantile regression methods of Meinshausen [29] andAthey et al. [2] without any further assumptions. 9nder these conditions, we have the following result on the asymptotic sharpness of the quantile balancingbounds.
Theorem 3. (Sharpness and robustness)
For any ψ ∈ { ψ T , ψ C , ψ ATT , ψ
ATE } , let [ ψ − , ψ + ] be its partially-identified interval under Assumption 1 andlet [ ˆ ψ − , ˆ ψ + ] be the corresponding quantile balancing interval. Assume Conditions 1, 2, and 3.(i) If the quantile regression estimates are consistent, then ˆ ψ − p −→ ψ − and ˆ ψ + p −→ ψ + .(ii) Even if the quantile models are misspecified, we still have ˆ ψ − ≤ ψ − + o P (1) and ψ + − o P (1) ≤ ˆ ψ + . The result (i) is expected, so we offer some intuition for (ii). The true worst-case propensity score¯ E + defined in Proposition 3 “balances” all (integrable) functions of X , so it is “almost” feasible in theoptimization problem (19). Thus, even if the quantile regression model is misspecified, the IPW estimatorbased on ¯ E + will “almost” be in the domain of (19). We should therefore expect ˆ ψ + to be too large ratherthan too small. The robustness result (ii) shows that it eventually is.The validity of the confidence interval (20) requires stronger parametric assumptions. We prove aninference result assuming the nominal propensity score is estimated by a correctly-specified parametricmodel and the conditional quantiles are estimated by a (potentially misspecified) parametric model. Theseassumptions are not much stronger than what we expect the primary analysis to assume for inference underunconfoundedness [15, 49]. Theorem 4. (Inference)
Let [ ψ − , ψ + ] be as in Theorem 3, and let CI( α ) be as in (20). Suppose Conditions 1, 2, and 3.(i) aresatisfied, and also that the nominal propensity score is estimated by a regular parametric model (e.g. logisticregression). Then we have lim inf n →∞ P ([ ψ − , ψ + ] ⊆ CI( α )) ≥ − α (21) for any α ∈ (0 , . Although we do not have theoretical support for the confidence interval CI( α ) when quantiles are esti-mated by a nonlinear model, we find that approach performs reasonably well in the simulations of Section5. In this section, we illustrate the finite-sample performance of the quantile balancing method on severalsimulated datasets and one real-data example. We also compare several variants of the method against theZSB method described in Section 2 on the simulated datasets.
We consider two data-generating processes (DGPs) in our simulated examples. The two DGPs differ in thedistribution of the regression function E [ Y | X, Z ], but otherwise can be described as follows: X ∼ Uniform([ − , ) Z | X ∼ Bernoulli (cid:18) − (cid:80) j =1 X j / √ (cid:19) Y | X, Z ∼ N ( µ ( X ) , . (22)In the first DGP, we use µ ( x ) = x + · · · + x , and in the second DGP we use µ ( x ) = sign( x ) + sign( x ).The estimand of interest is the ATE and we fix Λ = 2, i.e. unobserved confounders can double or halve theodds of treatment.We compare four methods for obtaining bounds on ψ ATE . Linear applies the quantile balancing methodof Section 4 with quantiles estimated following [25].
Forest is similar, but ˆ Q t ( x, z ) is fitted using out-of-bag10stimates from the random forest method of Athey et al. [2]. Covariates directly balances the features X without first estimating quantiles. ZSB implements the unconstrained method described in Section 2. Allfour methods estimate the nominal propensity score by logistic regression.Figure 1 shows the distribution of upper and lower bound point estimates from each of these four methods,estimated using 1,000 simulations with 500 observations each. Dashed lines indicate the true partially-identified region. The results conform to the asymptotic predictions of Proposition 1 and Theorem 3: (i)when the quantile models are “correctly specified,” the quantile balancing point estimates are nearly unbiased;(ii) under misspecification, the range of point estimates is too wide rather than too narrow; and (iii) the
ZSB range of point estimates is too wide in both cases.
DGP1 DGP2Linear Forest Covariates ZSB Linear Forest Covariates ZSB−101−1.5−1.0−0.50.00.51.0 B ound s Figure 1:
Boxplots of the ATE upper and lower bound point estimates for both DGPs and all consideredmethods. The dashed line indicates the boundary of the true partially identified set. In DGP1, the
Linear and
Covariates methods are correctly specified and give the most accurate bounds. In DGP2, the
Forest method is well-suited to the piecewise-constant outcome model and gives the most accurate bounds.
Confidence intervals based on the
Linear and
Covariates approaches exhibited some under-coverage atthis sample size, at least in the “well-specified” setting of DGP1. The 90% bootstrap confidence intervalsbased on the
Linear method had approximately 85.4% coverage of the identified set, whereas those basedon the
Covariates method had 81.3% coverage. This undercoverage is partially caused by a small bias inthe original range of point estimates, which is not readily apparent from Figure 1. Meanwhile,
Forest had95.5% coverage, and
ZSB had 100% coverage. In DGP2, the
Forest method had 91.6% coverage, and allother methods had at least 99% coverage.
In this section, we apply our proposed sensitivity analysis to a subsample of data from the 1966-1981 NationalLongitudinal Survey (NLS) of Older and Young Men. We wish to estimate the impact of union membershipon wages. Specifically, we consider the ATE of union membership on log wages. For illustrative reasons,we focus on the 1978 cross-section of Young Men and restrict our attention to craftsmen and laborers notenrolled in school. Our estimates are thus based on a sample of 668 respondents with measurements onwages, union membership, and eight other covariates.For our “primary analysis,” we use IPW to adjust for baseline imbalances in covariates between unionand nonunion samples. Table 1 reports the covariate balance between union and nonunion samples beforeand after weighting by the (estimated) inverse propensity score. On several important characteristics, inversepropensity weighting dramatically improves balance across the two samples.11 .00.10.20.30.40.5 1.1 1.2 1.3 1.4 1.5 1.6 1.7 1.8 1.9 2.0 L B ound s Method
BalancingZSB
Figure 2:
Point estimate ranges and 90% bootstrap confidence intervals for the ATE in the NLS dataset.For the quantile balancing method, conditional quantiles are estimated using the linear quantile regressionmethod of Koenker and Bassett [25].
Unweighted WeightedCovariate Union Nonunion Union NonunionAge 30.1 30.0 30.0 30.0Black 24% 24% 23% 24%Metropolitan 74% 57% 66% 65%Southern 32% 53% 42% 42%Married 78% 75% 76% 76%Manufacturing 42% 32% 37% 38%Laborer 23% 15% 18% 18%Education 12.2 11.7 12.1 12.0Table 1:
Covariate means among the nonunion and union subsamples, along with the means in the weightedsamples. In red, we highlight particularly large imbalances. In the weighted samples, propensity weights areestimated using logistic regression.
The IPW point estimate of the ATE is 0.23 with an associated 90% confidence interval of [0 . , . Q t ( x, z )for quantile balancing using the standard linear quantile model of Koenker and Bassett [25].Both sensitivity analyses show that the positive effect found in the “primary” analysis is fairly robust tounobserved confounding, but quantile balancing refines the ZSB interval. Even if the odds of union member-ship for “skilled” workers were double the odds for “typical” workers with the same observed covariates, the12uantile balancing analysis would still find a statistically significant positive treatment effect. Meanwhile,when Λ = 2, the ZSB confidence intervals already include the null, although the range of point estimates(barely) excludes it.To put these figures in context, we follow the advice of Kallus and Zhou [22] and compute the degree towhich the (estimated) odds of union membership could change if measured confounders were omitted fromthe dataset. Omitting Black only changes the odds of treatment by a factor of 1.4, and omitting
Laborer only changes the odds of treatment by a factor of 1.9. In fact, no measured confounders except
South wereable to double or halve the odds of union membership for any respondent.Incidentally, longitudinal estimates of union wage effects — which control for individual-specific effects like“skill” — come to similar conclusions as the one suggested by our sensitivity analysis. Although treatmenteffect estimates from longitudinal studies are generally smaller than those from cross-sectional studies, theystill find evidence in favor of the “union premium” [8, 18, 14].
We have shown that quantile balancing — a simple modification of the popular ZSB [49] sensitivity analysis— is feasible, robust, and sharp. This new sensitivity analysis for IPW is based on novel partial identificationresults for Tan [44]’s marginal sensitivity model.We will point to several interesting directions for future work. While our partial identification resultsfocus on counterfactual means and a few treatment effects, it should be possible to extend our partialidentification results to more complex estimands of the type considered in [20, 21, 22, 23, 27]. Perhaps asimilarly compact sensitivity analysis could even apply to dynamic treatment regimes. In addition, whileour identification arguments generalize to any sensitivity assumption that only restricts the propensity scorein a pointwise fashion (i.e. e min ( x ) ≤ e ( x, u ) ≤ e max ( x )), the practicality of our sensitivity analysis and itstheoretical properties rely on the marginal sensitivity model quite heavily. It would be interesting to see ifa practical and sharp sensitivity analysis could be developed for other sensitivity assumptions in this class.13 eferences [1] Peter M Aronow and Donald K K Lee. “Interval estimation of population means under unknown butbounded probabilities of sample selection”. In: Biometrika
Annals of Statistics
Econometrica
Statistics in Medicine
Journalof Econometrics issn : 0304-4076.[6] Arie Beresteanu, Ilya Molchanov, and Francesca Molinari. “Sharp Identification Regions in Models withConvex Moment Predictions”. In:
Econometrica issn : 00129682, 14680262.[7] Jolene Birmingham, Andrea Rotnitzky, and Garrett M Fitzmaurice. “Pattern-mixture and selectionmodels for analysing longitudinal data with monotone missing patterns”. In:
Journal of the RoyalStatistical Society: Statistical Methodology: Series B
Journal of Econometrics issn : 0304-4076.[9] Kwun Chuen Gary Chan, Sheung Chi Phillip Yam, and Zheng Zhang. “Globally efficient non-parametricinference of average treatment effects by empirical balancing calibration weighting”. In:
Journal of theRoyal Statistical Society: Series B (Statistical Methodology)
Econometrics Journal
The Annals of Mathematical Statistics issn : 00034851.[12] D. A. Darling. “On a Class of Problems Related to the Random Division of an Interval”. In:
TheAnnals of Mathematical Statistics
A New Characterization of Identified sets in PartiallyIdentified Models . 2016. url : .[14] Richard B. Freeman. “Longitudinal Analyses of the Effects of Trade Unions”. In: Journal of LaborEconomics issn : 0734306X, 15375307.[15] Keisuke Hirano and Guido W Imbens. “Estimation of Causal Effects using Propensity Score Weighting:An Application to Data on Right Heart Catheterization”. In:
Health Services & Outcomes ResearchMethodology
Econometrica issn :00129682, 14680262.[17] Kosuke Imai and Marc Ratkovic. “Covariate balancing propensity score”. In:
Journal of the RoyalStatistical Society Series B
TheReview of Economic Studies issn : 00346527, 1467937X.[19] George Johnson. “Economic Analysis of Trade Unionism”. In:
American Economic Review
Interval Estimation of Individual-Level Causal EffectsUnder Unobserved Confounding . 2018. arXiv: .1421] Nathan Kallus and Angela Zhou.
Confounding-Robust Policy Evaluation in Infinite-Horizon Reinforce-ment Learning . 2020. arXiv: .[22] Nathan Kallus and Angela Zhou. “Confounding-Robust Policy Improvement”. In:
Advances in NeuralInformation Processing Systems 31 . Ed. by S. Bengio et al. Curran Associates, Inc., 2018, pp. 9269–9279.[23] Nathan Kallus and Angela Zhou. “Minimax-Optimal Policy Learning Under Unobserved Confounding”.In:
Management Science (2020).[24] Roger Koenker.
Quantile Regression . Econometric Society Monographs. Cambridge University Press,2005.[25] Roger W Koenker and Gilbert Bassett. “Regression Quantiles”. In:
Econometrica
Introduction to empirical processes and semiparametric inference . Springer series instatistics. Springer, 2008. isbn : 9780387749778.[27] Kwonsang Lee, Falco J. Bargagli-Stoffi, and Francesca Dominici.
Causal Rule Ensemble: InterpretableInference of Heterogeneous Treatment Effects . 2020. arXiv: .[28] Matthew A. Masten, Alexandre Poirier, and Linqi Zhang.
Assessing Sensitivity to Unconfoundedness:Estimation and Inference . 2020. arXiv: .[29] Nicolai Meinshausen. “Quantile Regression Forests”. In:
J. Mach. Learn. Res. issn : 1532-4435.[30] Wesley Mellow. “Unionism and Wages: A Longitudinal Analysis”. In:
The Review of Economics andStatistics issn : 00346535, 15309142.[31] L W Miratrix, S Wager, and J R Zubizarreta. “Shape-constrained partial identification of a populationmean under unknown probabilities of sample selection”. In:
Biometrika
Cross-fitting and fast remainder rates for semiparamet-ric estimation . CeMMAP working papers CWP41/17. Centre for Microdata Methods and Practice,Institute for Fiscal Studies, 2017.[34] Jerzy Neyman. “On the Application of Probability Theory to Agricultural Experiments. Essay onPrinciples. Section 9”. In:
Statist. Sci.
Statistical Science issn : 08834237. url : .[36] James M. Robins, Andrea Rotnitzky, and Daniel O. Scharfstein. “Sensitivity Analysis for Selection biasand unmeasured Confounding in missing Data and Causal inference models”. In: Statistical Models inEpidemiology, the Environment, and Clinical Trials . Ed. by M. Elizabeth Halloran and Donald Berry.New York, NY: Springer New York, 2000, pp. 1–94. isbn : 978-1-4612-1284-3.[37] Paul R Rosenbaum.
Design of Observational Studies . New York: Springer, 2010.[38] Paul R Rosenbaum. “Sensitivity analysis for certain permutation inferences in matched observationalstudies”. In:
Biometrika
Statist. Sci.
Combining Observational and Experimental Datasets Using Shrinkage Estima-tors . 2020. arXiv: .[41] D.B. Rubin. “Estimating causal effects of treatments in randomized and nonrandomized studies”. In:
Journal of Educational Psychology
Ann. Statist.
Ann. Statist.
Journalof the American Statistical Association
An Interval Estimation Approach to Sample Selection Bias . 2019. arXiv: .[46] A. W. van der Vaart.
Asymptotic Statistics . Cambridge Series in Statistical and Probabilistic Mathe-matics. Cambridge University Press, 1998.[47] A. W. van der Vaart and J. Wellner.
Weak Convergence and Empirical Processes: With Applicationsto Statistics . Springer Series in Statistics. Springer, 1996. isbn : 9780387946405.[48] Steve Yadlowsky et al.
Bounds on the conditional and average treatment effect with unobserved con-founding factors . 2018. arXiv: .[49] Qingyuan Zhao, Dylan S Small, and Bhaswar B Bhattacharya. “Sensitivity analysis for inverse prob-ability weighting estimators via the percentile bootstrap”. In:
Journal of the Royal Statistical Society:Statistical Methodology: Series B
Appendix: Computation
This appendix describes how the quantile balancing sensitivity analysis can be implemented using standardsolvers for linear quantile regression [25, 24] (A.1). It also gives the formulas for the ATT bounds (A.2),which were omitted from the main text.Throughout this appendix, Λ ≥ τ = Λ / (Λ + 1). For a function f ≡ f ( x, y, z ), we usethe shorthand f i = f ( X i , Y i , Z i ). We will also write P n v = n (cid:80) ni =1 v i for any vector v ∈ R n . For functions f , P n f should be understood to mean n (cid:80) ni =1 f i . A.1 Computing bounds with weighted quantile regression
We begin by considering computation of ψ +T . We consider the more general optimization problem (23) forsome function g : X → R k containing an “intercept.” In the main text, we assumed g ( x ) = (1 , ˆ Q τ ( x, ¯ e ∈E n (Λ) P n Y Z/ ¯ e P n Z/ ˆ e ( X ) subject to P n g ( X ) Z/ ¯ e = P n g ( X ) Z/ ˆ e ( X ) (23)Readers familiar with quantile regression optimization (c.f. Chapter 6 of [24]) may recognize the problem(23) from the dual of a certain weighted quantile regression problem. Specifically, let ρ τ ( u ) = u ( τ − I { u < } )be the quantile regression “check” function and define the weighted quantile regression objective as: L n ( γ ) := P n ρ τ ( Y − γ (cid:62) g ( X )) Z − ˆ e ( X )ˆ e ( X ) . (24)The following proposition shows that any minimizer of L n can be used to compute the solution of (23). Proposition 5.
Suppose ˆ e ( X i ) ∈ (0 , for all i , let ˆ γ minimize L n and let ˆ V i = sign( Y i − ˆ γ (cid:62) g ( X )) . Thenthe optimal objective value in (23) is: P n ( Y − ˆ γ (cid:62) g ( X )) Z (1 + Λ ˆ V (1 − ˆ e ( X )) / ˆ e ( X )) + P n ˆ γ (cid:62) g ( X ) Z/ ˆ e ( X ) P n Z/ ˆ e ( X ) . This same approach can be used to compute a lower bound for ψ T , by replacing Y with − Y , applyingProposition 5, and then negating the answer. Upper and lower bounds for ψ C can then be obtained byreplacing Z by 1 − Z and ˆ e ( X ) by 1 − ˆ e ( X ) and then applying the same procedure. Subtracting the upperand lower bounds for ψ T and ψ C (as in Theorem 2) gives bounds on ψ ATE . A.2 Bounds for the ATT
Next, we describe the standard quantile balancing bounds for ψ ATT . Let ¯ Y (1) be the average value of Y i among treated observations. We define the quantile balancing upper bound for the ATT as the solution tothe optimization problem (25), where g + ( x ) = (1 , ˆ Q − τ ( x, ψ +ATT = max ¯ e ∈E n (Λ) ¯ Y (1) − (cid:80) Z i =0 Y i ¯ e i − ¯ e i (cid:80) Z i =0 ¯ e i − ¯ e i s.t. (cid:88) Z i =0 g + ( X i ) ¯ e i − ¯ e i = (cid:88) Z i =0 g + ( X i ) ˆ e i − ˆ e i (25)The lower bound ˆ ψ − ATT is defined similarly, but with maximization replaced by minimization and g + ( x )replaced by g − ( x ) := (1 , ˆ Q τ ( x, B Appendix: Proofs
This appendix collects proofs of the results in the main text, along with supporting results. Throughout, wewill use the following notation. For an integer n ≥
1, [ n ] denotes the set { , · · · , n } . For symmetric matrices17 and B , A (cid:37) B and A (cid:31) B mean A − B is positive semidefinite and positive definite, respectively. If { a n } and { b n } are sequences of real numbers, then a n (cid:45) b n means a n = O ( b n ) and a n ∼ b n means a n /b n → { A n } and { B n } are sequences of random variables, then A n (cid:45) P B n means A n = O P ( B n ) and A n ∼ P B n means A n /B n p −→
1. We adopt the convention that a/b = 0 when a and b are both zero.We also make use of some standard empirical process notation. For a (possibly random) function f : X × R × { , } → R , we will write P f := (cid:82) f d P and P n f := n (cid:80) ni =1 f ( X i , Y i , Z i ). For any vector v = ( v , ..., v n ),we take P n v = n (cid:80) ni =1 v i . For any p ∈ [1 , ∞ ), we define || f || L p ( P ) = ( P | f | p ) /p and || f || L p ( P n ) = ( P n | f | p ) /p .When p = ∞ , we set || f || L ∞ ( P ) = inf { t : P ( | f | ≤ t ) = 1 } and || f || L ∞ ( P n ) = max i ≤ n | f ( X i , Y i , Z i ) | . B.1 Proof of Proposition 1
Proof.
First, we will compute the partially identified set for ψ T for a substantially more general observeddata distribution, since the proof is not any more difficult. Suppose the observed data law has the followingfactorization: X ∼ P X , Z | X ∼ Bernoulli( e ( X )), and Y | X, Z ∼ N ( µ ( X ) , σ ( X )). Here, P X and e : X → (0 ,
1) are arbitrary, and the only requirements on µ, σ are that Y is integrable.Let z τ denote the τ -th quantile of the standard normal distribution. Since the conditional distributionof Y is continuous, Proposition 3 implies ψ +T = E [ Y Z/ ¯ E + ] where ¯ E + satisfies the following with probabilityone: 1 / ¯ E + = (cid:40) − e ( X ) e ( X ) Λ Y ≥ µ ( X ) + σ ( X ) z τ − e ( X ) e ( X ) Λ − Y < µ ( X ) + σ ( X ) z τ . Let C ( x ) = µ ( x ) + σ ( x ) z τ be the “cutoff” between low and high values of ¯ E + . Then the inverse Mills ratioformula for the expectation of a truncated Gaussian gives: ψ +T = E [ Y Z/ ¯ E + ]= E [ e ( X ) E [ Y / ¯ E + | X, Z = 1]]= E (cid:2) ( e ( X ) + Λ − [1 − e ( X )]) τ E [ Y | Y < C ( X ) , X, Z = 1 (cid:3) + E [( e ( X ) + Λ[1 − e ( X )])(1 − τ ) E [ Y | Y ≥ C ( X ) , X, Z = 1]]= E (cid:2) ( e ( X ) + Λ − [1 − e ( X )]) τ ( µ ( X ) − σ ( X ) φ ( z τ ) /τ ) (cid:3) + E [( e ( X ) + Λ[1 − e ( X )])(1 − τ )( µ ( X ) + σ ( X ) φ ( z τ ) / (1 − τ ))]= E [ µ ( X )] + φ ( z τ )(Λ − Λ − ) E [ σ ( X )(1 − e ( X ))] . Applying the same calculation to − Y gives ψ − T = E [ µ ( X )] − φ ( z τ )(Λ − Λ − ) E [ σ ( X )(1 − e ( X ))]. Specializingthese answers to the distribution in Proposition 1 gives the stated answer.We remark that the preceding calculation also gives the partially identified sets for ψ C and ψ ATE . Sincethe distribution of Y in this setting does not depend on Z , exchanging the roles of Z and 1 − Z in the abovecalculation amounts to replacing 1 − e ( X ) by e ( X ) in the formula for ψ ± T . Thus, [ ψ − C , ψ +C ] = [ E [ µ ( X )] ± φ ( z τ )(Λ − Λ − ) E [ σ ( X ) e ( X )]], and Theorem 2 implies [ ψ − ATE , ψ +ATE ] = [ ± φ ( z τ )(Λ − Λ − ) E [ σ ( X )]].Next, we show that the ZSB interval is asymptotically too wide. We will only prove this result for thedistribution (6), since the calculations simplify. Let ˆ ψ +T,ZSB be as in (5). Let ¯ E ∗ = + I { Y ≤ . √ σ + 1 } ,and notice that Y | Z = 1 ∼ N (0 , σ + 1). Then a straightforward calculation using the inverse Mills ratioformula gives: E [ Y Z/ ¯ E ∗ ] E [ Z/ ¯ E ∗ ] = φ (0 . √ σ + 12 − Φ(0 . > . (cid:112) σ + 1The strong law of large numbers implies lim inf ˆ ψ +T,ZSB ≥ lim inf( P n Y Z/ ¯ E ∗ ) / ( P n Z/ ¯ E ∗ ) > . √ σ + 1almost surely. The lower bound follows by symmetry. B.2 Proof of Proposition 2
We instead show the more general result: 18 roposition 2B.
Let ( X, Y, Z ) ∼ P , and let e min , e max : X → (0 , be any two functions. For any randomvariable ¯ E ∈ (0 , satisfying E [ Z/ ¯ E | X ] = 1 and Z/e max ( X ) ≤ Z/ ¯ E ≤ Z/e min ( X ) , we can construct randomvariables ( Y (0) , Y (1) , U ) on the same space as ( X, Y, Z, ¯ E ) and an associated plausible propensity function ¯ e ( X, U ) := E [ Z | X, U ] satisfying the following properties:(i) Y = ZY (1) + (1 − Z ) Y (0) .(ii) ( Y (0) , Y (1)) | = Z | ( X, U ) and e min ( X ) ≤ ¯ e ( X, U ) ≤ e max ( X ) .(iii) Z/ ¯ e ( X, U ) = Z/ ¯ E . To recover the result of Proposition 2 from Proposition 2B, define e min ( x ) = e ( x ) / ( e ( x ) + [1 − e ( x )]Λ)and e max ( x ) = e ( x ) / ( e ( x ) + [1 − e ( x )] / Λ). Then let Q be the joint distribution of ( X, Y (0) , Y (1) , Z, U ). Item(i) implies Q is data compatible, item (ii) and ¯ E ∈ E ∞ (Λ) imply Q satisfies Assumption 1, and item (iii)implies E Q [ Y (1)] = E Q [ Y Z/ ¯ e ( X, U )] = E P [ Y Z/ ¯ E ]. Proof.
Let (
X, Y, Z, ¯ E ) be as in the proposition, and suppose we have access to independent random variables V , V ∼ Uniform[0 ,
1] which are also jointly independent of (
X, Y, Z, ¯ E ). Define the following collection ofdistribution functions: F ( y | x, z ) = P ( Y ≤ y | X = x, Z = z ) G ( y | x, z, ¯ e ) = P ( Y ≤ y | X = x, Z = z, ¯ E = ¯ e ) H (¯ e | x, z ) = P ( ¯ E ≤ ¯ e | X = x, Z = z ) K ( u | x ) = (cid:90) u −∞ e ( x )1 − e ( x ) 1 − ¯ e ¯ e d H (¯ e | x, E [ Z/ ¯ E | X ] = 1 and ¯ E > K ( u | x ) is a proper CDF for each x . Using these functions, we define Y (1) , Y (0), and U by: Y (1) = ZY + (1 − Z ) G − ( V | X, , ¯ E ) Y (0) = ZF − ( V | X,
0) + (1 − Z ) YU = Z ¯ E + (1 − Z ) K − ( V | X ) . We adopt the convention that J − ( s ) := inf { t : J ( t ) ≥ s } whenever J is a distribution function, so that thesequantities are well-defined even when some of these conditional distribution functions are not invertible. Sincethe support of K ( ·| x ) is a subset of the support of H ( ·| x, Z/e max ( X ) ≤ ¯ E ≤ Z/e min ( X )implies e min ( X ) ≤ U ≤ e max ( X ) almost surely.The requirement (i) is immediate from the definition of Y (0) and Y (1).To verify the first part of (ii), we compute the distribution of ( Y (0) , Y (1)) given X, U, Z = 1 and thedistribution of ( Y (0) , Y (1)) given X, U, Z = 0. P ( Y (0) ≤ y , Y (1) ≤ y | X, U, Z = 1) = P ( F − ( V | X, ≤ y , Y ≤ y | X, ¯ E, Z = 1)= P ( F − ( V | X, ≤ y | X, ¯ E, Z = 1) G ( y | X, , ¯ E )= P ( F − ( V | X, ≤ y | X ) G ( y | X, , ¯ E )= F ( y | X, G ( y | X, , ¯ E ) P ( Y (0) ≤ y , Y (1) ≤ y | X, U, Z = 0) = P ( Y ≤ y , G − ( V | X, , ¯ E ) ≤ y | X, U, Z = 0)= P ( Y ≤ y | X, U, Z = 0) P ( G − ( V | X, , ¯ E ) ≤ y | X, U, Z = 0)= P ( Y ≤ y | X, Z = 0) P ( G − ( V | X, , ¯ E ) ≤ y | X )= F ( y | X, G ( y | X, , ¯ E ) . Since these are the same, ( Y (0) , Y (1)) | = Z | ( X, U ).A short calculation using Bayes’ theorem shows that ¯ e ( X, U ) = U . This will establish the second part of(ii), since e min ( X ) ≤ U ≤ e max ( X ).¯ e ( x, u ) = e ( x ) d P ( u | X = x, Z = 1)d P ( u | X = x ) = e ( x ) d P ( u | x, / d H ( u | x, P ( u | x ) / d H ( u | x,
1) = e ( x ) e ( x ) + (1 − e ( x )) e ( x )1 − e ( x ) 1 − uu = u The preceding calculation also verifies (iii), since Z/ ¯ e ( X, U ) =
Z/U = Z/ ¯ E .19 .3 Proof of Proposition 3 and Theorem 1 Proof.
By symmetry, for Proposition 3, it suffices to show that ¯ E + solves both (11) and (15).First, we show that there exists ¯ E + ∈ E ∞ (Λ) with the properties stated in Proposition 3. Define e min ( x ) = e ( x ) / ( e ( x ) + [1 − e ( x )] / Λ) and e max ( x ) = e ( x ) / ( e ( x ) + [1 − e ( x )]Λ). For any γ ∈ [ e min ( x ) , e max ( x )],define e γ ( x, y ) by: ¯ e γ ( x, y ) = e min ( x ) if y > Q τ ( x, e max ( x ) if y < Q τ ( x, γ if y = Q τ ( x, x , there exists γ ( x ) ∈ [ e min ( x ) , e max ( x )] solving E [ Z/ ¯ e γ ( X, Y ) | X = x ] = 1. We will provethis by applying the intermediate value theorem to the continuous function w x ( γ ) := E [ Z/e γ ( X, Y ) | X = x ].If we took γ = e max ( x ), then we would have: w x ( e max ( x )) = F ( Q τ ( x, | x, e ( x ) + [1 − e ( x )] / Λ) + (1 − F ( Q τ ( x, | x, e ( x ) + [1 − e ( x )]Λ) ≤ e ( x ) + (1 − e ( x ))( τ / Λ + (1 − τ )Λ)= 1and a similar calculation shows w x ( e min ( x )) ≥
1. Thus, there is some γ ( x ) ∈ [ e min ( x ) , e max ( x )] which solves E [ Z/ ¯ e + ( X, Y ) | X = x ] = 1. For that choice of γ ( x ), ¯ E + := ¯ e γ ( X ) ( X, Y ) belongs to E ∞ (Λ) and satisfies E [ Z/ ¯ E + | X ] = 1.Now we show that any random variable ¯ E + satisfying the requirements of the proposition solves thequantile balancing problem (11). It is easy to see that ¯ E + is feasible in (11), since E [ Q τ ( X ) Z/ ¯ E + ] = E [ Q τ ( X ) E [ Z/ ¯ E + | X ]] = E [ Q τ ( X )]. Moreover, for any other ¯ E ∈ E ∞ (Λ) which balances Q τ , we may write: E [ Y Z/ ¯ E ] = E [ Q τ ( X, Z/ ¯ E + ( Y − Q τ ( X, Z/ ¯ E ] ≤ E [ Q τ ( X, E [( Y − Q τ ( X, Z/ ¯ E + ]= E [ Q τ ( X, Z/ ¯ E + ] + E [( Y − Q τ ( X, Z/ ¯ E + ]= E [ Y Z/ ¯ E + ] . The inequality step follows because 1 / ¯ E + takes on the maximum allowable value whenever ( Y − Q τ ( X, Z is positive and the minimal allowable value whenever ( Y − Q τ ( X, Z is negative, so ( Y − Q τ ( X, Z/ ¯ E + is always larger than ( Y − Q τ ( X, Z/ ¯ E . Since ¯ E is arbitrary, this proves ¯ E + solves (11).Finally, ¯ E ∗ + solves the less constrained problem (11) and is feasible in the more constrained problem (15),so it solves (15) as well. This proves Proposition 3.Solving (15) implies ¯ E + achieves the upper endpoint of the identified set for ψ T . Symmetry implies ¯ E − achieves the lower endpoint. Proposition 2 and its obvious converse imply the identified set is convex, so itis a closed interval. This proves Theorem 1. B.4 Proof of Proposition 4 and Theorem 2
Proof.
We will divide the proof of Proposition 4 into several steps. Rather than explicitly constructing adistribution Q with E Q [ Y (1) − Y (0)] = E P [ Y Z/ ¯ E − Y (1 − Z ) / (1 − ¯ E )] for each ¯ E ∈ E ∞ (Λ), we will insteadconstruct the extremal distributions Q + and Q − that attain the endpoints of the partially identified set for ψ ATE . Then, we will show that the partially identified set is an interval. This will establish Proposition 4.Theorem 2 follows as a corollary.The first step is to exhibit the worst-case distribution Q + which attains the upper bound on the ATE.We will actually construct random variables Y (0) , Y (1) , U on the same probability space as ( X, Y, Z, ¯ E ),with associated plausible propensity score ¯ e ( X, U ) := E [ Z | X, U ], that satisfy the following requirements:(i) Y = Y (1) Z + Y (0)(1 − Z ).(ii) ( Y (0) , Y (1)) | = Z | ( X, U ).(iii) ¯ e ( X, U ) ∈ E ∞ (Λ). 20iv) E [ Y (1)] = ψ +T and E [ Y (0)] = ψ − C .We then take Q + to be the joint distribution of ( X, Y (0) , Y (1) , Z, U ).We start with the construction. Let (
X, Y, Z ) ∼ P and ( V , V ) ∼ Uniform[0 , independently of( X, Y, Z ). Let F ( y | x, z ) = P ( Y ≤ y | X = x, Z = z ) and ¯ H ( y | x, z ) = P ( Y = y | X = x, Z = z ). Let T = τ Z + (1 − τ )(1 − Z ), and define the binary “confounder” U by: U = I { Y > Q T ( X, Z ) } + I { Y = Q T ( X, Z ) , V ¯ H ( Y | X, Z ) < F ( Y | X, Z ) − T } . Define the conditional CDF of Y to sample from by G ( y | x, z, u ) = P ( Y ≤ y | X = x, U = u, Z = z ), andconstruct Y (0) , Y (1) by: Y (1) = ZY + (1 − Z ) G − ( V | X, Z = 1 , U ) Y (0) = ZG − ( V | X, Z = 0 , U ) + (1 − Z ) Y. It is immediate from the definition that Y = Y (1) Z + Y (0)(1 − Z ), so the constructed random variablessatisfy (i).We prove (ii) by computing the joint distribution of ( Y (0) , Y (1)) given X, U, Z = 1 and also the jointdistribution of ( Y (0) , Y (1)) given X, U, Z = 0. P ( Y (0) ≤ y , Y (1) ≤ y | X, U, Z = 1) = P ( G − ( V | X, , U ) ≤ y , Y ≤ y | X, U, Z = 1)= G ( y | X, , U ) P ( Y ≤ y | X, U, Z = 1)= G ( y | X, , U ) G ( y | X, , U ) P ( Y (0) ≤ y , Y (1) ≤ y | X, U, Z = 0) = P ( Y ≤ y , G − ( V | X, , U ) ≤ y | X, U, Z = 0)= P ( Y ≤ y | X, U, Z = 0) G ( y | X, , U )= G ( y | X, , U ) G ( y | X, , U )Since these are the same, ( Y (0) , Y (1)) | = Z | ( X, U ).Next, we establish (iii) by directly computing ¯ e ( X, U ). First, observe that E [ U | X, Z = 1] = 1 − τ . E [ U | X, Z = 1] = P ( Y > Q τ ( X, Z ) | X, Z = 1) + P (cid:16) Y = Q τ ( X, Z ) , V < F ( Q τ ( X, | X, − τ ¯ H ( Q τ ( X, | X, | X, (cid:17) = 1 − F ( Q τ ( X, | X,
1) + ¯ H ( Q τ ( X, | X, F ( Q τ ( X, | X, − τ ¯ H ( Q τ ( X, | X, = 1 − τ A similar calculation shows E [ U | X, Z = 0] = τ . Therefore, we may calculate ¯ e for U = 1:¯ e ( x,
1) = e ( x ) P ( U = 1 | X = x, Z = 1) e ( x ) P ( U = 1 | X = x, Z = 1) + [1 − e ( x )] P ( U = 1 | X = x, Z = 0)= e ( x )(1 − τ ) e ( x )(1 − τ ) + [1 − e ( x )] τ = e ( x ) e ( x ) + [1 − e ( x )]ΛBy similar reasoning, ¯ e ( x,
0) = e ( x ) / ( e ( x ) + [1 − e ( x )]Λ − ). Both ¯ e ( x,
1) and ¯ e ( x,
0) satisfy the boundedodds ratio condition, so ¯ e ( X, U ) ∈ E ∞ (Λ).Finally, we confirm item (iv), that E [ Y (1)] = ψ +T and E [ Y (0)] = ψ − C . Let ¯ E + be as in Proposition3. The explicit formulas for ¯ e ( X, U ) obtained in the proof of (iii) and the definition of U shows that Z/ ¯ e ( X, U ) = Z/ ¯ E + , except possibly when Y = Q τ ( X, E [ Z/ ¯ e ( X, U ) | X ] = 1. E [ Z/ ¯ e ( X, U ) | X ] = e ( X ) E [1 / ¯ e ( X, U ) | X, Z − e ( X ) (cid:16) P ( U =0 | X,Z =1)¯ e ( X, + P ( U =1 | X,Z =1)¯ e ( X, (cid:17) = e ( X ) (cid:104) τ (cid:16) − e ( x ) e ( x ) Λ − (cid:17) + (1 − τ ) (cid:16) − e ( x ) e ( x ) Λ (cid:17)(cid:105)
21 1Therefore, E [ Y Z/ ¯ e ( X, U )] = ψ +T by Proposition 3. By exchanging the labels “treated” and “control” andapplying Proposition 3, one can verify by a similar argument that E [ Y (1 − Z ) / (1 − ¯ e ( X, U ))] = ψ − C as well.To construct a distribution Q − attaining the minimal value of the ATE, we use a negation trick. Let( X (cid:48) , Y (cid:48) , Z (cid:48) ) = ( X, − Y, Z ), and let P (cid:48) be the distribution of ( X (cid:48) , Y (cid:48) , Z (cid:48) ). By the calculations above, thereexists a distribution Q which simultaneously attains the largest value of E [ Y (cid:48) (1)] and the smallest value E [ Y (cid:48) (0)] compatible with P − and the marginal sensitivity model. As a result, − E Q [ Y (cid:48) (1)] is the smallestvalue of ψ T compatible with P and − E Q [ Y (cid:48) (0)] is the largest value of ψ C compatible with P . If we let Q − bethe distribution of ( X, − Y (cid:48) (0) , − Y (cid:48) (1) , Z, U ) when ( X, Y (cid:48) (0) , Y (cid:48) (1) , Z, U ) ∼ Q , then Q − attains the minimalvalue of the ATE.Finally, we show that any value of ψ ATE between ψ − ATE and ψ +ATE can be realized by a distribution Q compatible with the observed data law and the marginal sensitivity model. In particular, this will provethat for any ¯ E ∈ E ∞ (Λ), there exists a compatible distribution Q with E Q [ Y (1) − Y (0)] = E [ Y Z/ ¯ E − Y (1 − Z ) / (1 − ¯ E )].Let λ ∈ [0 ,
1] be arbitrary. Let Q λ be the distribution which is sampled as follows: first, sample B ∼ Bernoulli(Λ). If B = 0, sample ( X, Y (0) , Y (1) , Z, U ) ∼ Q − . Otherwise, sample ( X, Y (0) , Y (1) , Z, U ) ∼ Q + .Finally, return ( X, Y (0) , Y (1) , Z, U, B ). It is clear that Q λ matches the marginal distribution from P (since Q − and Q + both do) and satisfies Assumption 1 with the confounders ( U, B ). Moreover, the ATE underthe distribution Q λ is a convex combination of the extremal ATEs. Since this construction works for any λ ,the identified set must be an interval.This completes the proof of Proposition 4. Theorem 2 follows as a corollary. B.5 Proof of Proposition 5
Proof.
If Λ = 1, the claim holds trivially, so we proceed assuming Λ > W i = Z i (1 − ˆ e ( X i )) / ˆ e ( X i ). Since L n is convex, computing the subdifferential optimality criterion forˆ γ shows that there exists a vector ∆ ∈ [Λ − , Λ] n such that P n ˆ W g ( X )(∆ −
1) = 0 and ∆ i = Λ sign( Y i − ˆ γ (cid:62) h ( X i )) whenever Y i (cid:54) = ˆ γ (cid:62) h ( X i ).We will first show that ¯ e ∗ i := (1 + ∆ i (1 − ˆ e i ) / ˆ e i ) − solves (23). It is clear that ¯ e ∗ i belongs to E n (Λ).Moreover, we have 0 = P n ˆ W g ( X )(∆ −
1) = P n g ( X ) Z/ ¯ e ∗ − P n g ( X ) Z/ ˆ e ( X ). Therefore ¯ e ∗ is a feasiblesolution to (23).Optimality of ¯ e ∗ i follows from Theorem 3.1 in [11]. The main technical requirement to apply that resultis that P n g ( X ) Z/ ˆ e ( X ) is in the relative interior of { P n g ( X ) Z/ ˜ e : ˜ e ∈ E n (Λ) } . If 0 < ˆ e i < i , thenthis condition is satisfied by the open mapping theorem and the fact that 1 / ˆ e is an interior point of 1 / E n (Λ).Finally, we show the desired equivalence: P n Y Z/ ¯ e ∗ P n Z/ ˆ e ( X ) = i P n ( Y − ˆ γ (cid:62) g ( X )) Z/ ¯ e ∗ + P n ˆ γ (cid:62) g ( X ) Z/ ¯ e ∗ P n Z/ ˆ e ( X )= ii P n ( Y − ˆ γ (cid:62) g ( X )) Z/ ¯ e ∗ + P n ˆ γ (cid:62) g ( X ) Z/ ˆ e ( X ) P n Z/ ˆ e ( X )= iii P n ( Y − ˆ γ (cid:62) g ( X )) Z (1 + Λ ˆ V (1 − ˆ e ( X )) / ˆ e ( X )) + P n ˆ γ (cid:62) g ( X ) Z/ ˆ e ( X ) P n Z/ ˆ e ( X )There, step i adds and subtracts the term P n ˆ γ (cid:62) g ( X ) Z/ ¯ e ∗ in the numerator, step ii uses the fact that ¯ e ∗ “balances” g ( X ), and step iii restates ¯ e ∗ in terms of ˆ V . Since P n Y Z/ ¯ e ∗ P n Z/ ˆ e ( X ) is the objective value from (23), thisproves Proposition 5. B.6 Proofs of Theorems 3 and 4 for linear quantiles
In this section, we give the proofs of Theorems 3 and 4 under the assumption that ˆ Q τ ( x, z ) = ˆ β ( z ) (cid:62) h ( x ) forsome “features” h : X → R k with finite variance. We assume throughout that h contains an “intercept”,i.e. h ( x ) ≡
1. For simplicity, we only give the arguments for the estimator ˆ ψ +T . Results for other quantile22alancing bounds follow by essentially the same arguments. Since this estimator only involves a singleestimated quantile function, we will lighten the notation by writing Q ( x ) and ˆ Q ( x ) in place of Q τ ( x,
1) andˆ Q τ ( x, B.6.1 Supporting lemmas
The proofs will make use of several easy lemmas.
Lemma 1.
Assume that Conditions 1 and 2 hold, and also that Q ( x ) = β (cid:62) h ( x ) for some β ∈ R d . Furthersuppose that E [ h ( X ) h ( X ) (cid:62) ] is finite and nonsingular. Let ˆ γ minimize the loss function L n ( γ ) = P n ρ τ ( Y − γ (cid:62) h ( X )) Z − ˆ e ( X )ˆ e ( X ) . Then ˆ γ p −→ β .Proof. This follows from general results on convex M-estimators, e.g. Theorem 2.7 in [32].
Lemma 2.
Let ˆ U i = sign( Y i − ˆ Q ( X i )) . Then we have the inequality: ˆ ψ +T ≤ P n ( Y − ˆ Q ( X )) Z (1 + Λ ˆ U (1 − ˆ e ( X )) / ˆ e ( X )) + P n ˆ Q ( X ) Z/ ˆ e ( X ) P n Z/ ˆ e ( X ) (26) Proof.
By Proposition 5, ˆ ψ +T would be exactly equal to the right-hand side of (26) if ˆ U i were replaced byˆ V i = sign( Y i − ˆ γ − ˆ γ ˆ Q ( X i )) where ˆ γ = (ˆ γ , ˆ γ ) comes from L n in Lemma 2. However, ( Y i − ˆ Q ( X i ))Λ ˆ U i is(weakly) larger than ( Y i − ˆ Q ( X i ))Λ ˆ V i for every i , since ˆ U i exactly matches the sign of Y i − ˆ Q ( X i ) while ˆ V i might not. Making this replacement index-by-index gives (26). Lemma 3.
Let U i = sign( Y i − Q ( X i )) . Then we have the inequality: ˆ ψ +T ≥ P n ( Y − ˆ γ (cid:62) h ( X )) Z (1 + Λ U (1 − ˆ e ( X )) / ˆ e ( X )) + P n ˆ γ (cid:62) h ( X ) Z/ ˆ e ( X ) P n Z/ ˆ e ( X ) (27) where ˆ γ is as in Lemma 1.Proof. For the purposes of this proof, let ¯ ψ +T be the solution to the “feature-balancing” problem:¯ ψ +T = max ¯ e ∈E n (Λ) (cid:80) ni =1 Y i Z i / ¯ e i (cid:80) ni =1 Z i / ¯ e i s.t. P n h ( X ) Z/ ¯ e = P n h ( X ) Z/ ˆ e ( X ) . It is clear that ˆ ψ +T ≥ ¯ ψ +T , since the feature balancing problem has the same objective as the quantile balancingproblem but faces more constraints. Proposition 5 implies the ¯ ψ +T would be exactly equal to the right-handside of (27) if we replaced U i by ˆ U i = sign( Y i − ˆ γ (cid:62) h ( X i )). However, ( Y i − ˆ γ (cid:62) h ( X i ))Λ U i is (weakly) smallerthan ( Y i − ˆ γ (cid:62) h ( X i ))Λ ˆ U i for every i , since ˆ U i exactly matches the sign of Y i − ˆ γ (cid:62) h ( X i ) while U i might not.Making this replacement index-by-index gives (27). B.6.2 Proof of main results
Finally, we are ready to prove the main results, which we restate to make the regularity assumptions moreprecise.
Theorem 3(i). (Sharpness for ψ + T ) Assume Conditions 1, 2, and 3.(i). If Q ( x ) = β (cid:62) h ( x ) for some β ∈ R k and ˆ β p −→ β , then ˆ ψ +T = ψ +T − o P (1) .However, even if Q ( x ) (cid:54) = β (cid:62) h ( x ) for any β , we still have ˆ ψ +T ≥ ψ +T − o P (1) .Proof. We start by proving the upper bound ˆ ψ +T ≤ ψ +T + o P (1) in the well-specified case. Lemma 2 gives thefollowing upper bound on the quantile balancing estimator:ˆ ψ +T ≤ P n ( Y − ˆ Q ( X )) Z (1 + Λ ˆ U (1 − ˆ e ( X )) / ˆ e ( X )) + P n ˆ Q ( X ) Z/ ˆ e ( X ) P n Z/ ˆ e ( X ) . P n Z/ ˆ e ( X ) p −→
1, and the consistency of ˆ β implies P n ˆ Q ( X ) Z/ ˆ e ( X ) p −→ E [ Q ( X )]. Toestablish the upper bound, it remains to show P n ( Y − ˆ Q ( X )) Z (1 + Λ ˆ U (1 − ˆ e ( X )) / ˆ e ( X )) converges to ψ +T − E [ Q ( X )].The first step is to replace the estimated propensity score ˆ e appearing in this quantity by the true nominalpropensity score e . The Cauchy-Schwarz inequality and Condition 1 imply: P n ( Y − ˆ Q ( X )) Z Λ ˆ U ( − ˆ e ( X )ˆ e ( X ) − − e ( X ) e ( X ) ) = O ( || Y − ˆ β (cid:62) h ( X ) || L ( P n ) × || / ˆ e ( X ) − /e ( X ) || L ( P n ) )= O P (( || Y || L ( P n ) + || ˆ β (cid:62) h ( X ) || L ( P n ) ) × ε − || ˆ e ( X ) − e ( X ) || L ∞ ( P n ) )= O P ( || Y || L ( P N ) + || Q ( X ) || L ( P n ) ) × o P (1))= o P (1)Thus, P n ( Y − ˆ Q ( X )) Z (1 + Λ ˆ U − ˆ e ( X )ˆ e ( X ) ) = P n ( Y − Q ( X )) Z (1 + Λ ˆ U − e ( X ) e ( X ) ) + o P (1).The next step is to replace ˆ U and ˆ Q ( X ) by U = sign( Y − Q ( X )) and Q ( X ), respectively. For this, weemploy a uniform convergence argument. For each β ∈ R k , define the function f β ( x, y, z ) by: f β ( x, y, z ) = ( y − β (cid:62) h ( x )) z (1 + Λ sign( y − β (cid:62) h ( x )) 1 − e ( x ) e ( x ) ) . Standard Glivenko-Cantelli (GC) preservation arguments (c.f. [26, 46, 47]) show that the class F = { f β : || β − β || ≤ } is GC, so we have the uniform convergence sup f ∈F | P n f − P f | = o P (1). Moreover, the map β (cid:55)→ P f β is continuous at β , which can be seen by noticing that as β → β , f β ( x, y, z ) → f β ( x, y, z ) foralmost every ( x, y, z ) (exceptions occur when y = β (cid:62) x , but Condition 2 implies this happens with probabilityzero) and then applying the dominated convergence theorem. Thus, we have: P n ( Y − ˆ Q ( Z )) Z (1 + Λ ˆ U − e ( X ) e ( X ) ) = P n f ˆ β ( X, Y, Z )= P f ˆ β ( X, Y, Z ) + o P (1)= P f β ( X, Y, Z ) + o P (1)= E [( Y − Q ( X )) Z/ ¯ E + ] + o P (1)= ψ +T − E [ Q ( X )] + o P (1)Combining these various results gives ˆ ψ +T ≤ ψ +T + o P (1). This establishes the upper bound in the well-specifiedcase.Now we turn to the lower bound, ˆ ψ +T ≥ ψ +T − o P (1), beginning in the correctly-specified case. Lemma 3lower bounds the quantile balancing estimator by a variant of the “feature balancing” estimator:ˆ ψ +T ≥ P n ( Y − ˆ γ (cid:62) h ( X )) Z (1 + Λ U (1 − ˆ e ( X )) / ˆ e ( X )) + P n ˆ γ (cid:62) h ( X ) Z/ ˆ e ( X ) P n Z/ ˆ e ( X ) . We will show that this lower bound is at least ψ +T − o P (1). We may assume without loss of generalitythat E [ h ( X ) h ( X ) (cid:62) ] is full rank, since excising features that are linear combinations of other ones has noeffect on the feature balancing estimator. In the preceding display, the denominator P n Z/ ˆ e ( X ) converges toone, so we can focus on the two terms in the numerator.Since Lemma 1 implies that ˆ γ is consistent, exactly the same arguments from the upper bound show P n ˆ γ (cid:62) h ( X ) Z/ ˆ e ( X ) p −→ E [ Q ( X )]. Moreover, the argument from the upper bound shows that ˆ e can be replacedby e in the expression P n ( Y − ˆ γ (cid:62) h ( X )) Z (1+Λ U (1 − ˆ e ( X )) / ˆ e ( X )). Some manipulation shows that 1+Λ U (1 − e ( X )) /e ( X ) = 1 / ¯ E + almost surely, where ¯ E + is the worst-case propensity score defined in Proposition 3.Therefore, we may write: P n ( Y − ˆ γ (cid:62) h ( X )) Z (1 + Λ U − ˆ e ( X )ˆ e ( X ) ) = P n ( Y − ˆ γ (cid:62) h ( X )) Z/ ¯ E + + o P (1)= P n ( Y − Q ( X )) Z/ ¯ E + + O P ( || ˆ γ − β || ) + o P (1)= ψ +T − E [ Q ( X )] + o P (1)24ombining these various results gives ˆ ψ +T ≥ ψ +T − o P (1). This establishes the lower bound in the well-specifiedcase.Finally, we extend the lower bound to the misspecified case. If Q ( x ) (cid:54) = β (cid:62) h ( x ) for any β , then we canlower bound ˆ ψ +T by the feature-balancing estimator that balances h ( x ) and the true quantile Q ( x ). Thisbrings us back to the well-specified case, so the preceding arguments show ˆ ψ +T ≥ ψ +T − o P (1). Theorem 4(i). (Inference for ψ + T ) Assume Conditions 1, 2, and 3.(i). Suppose that the estimated propensity score satisfies ˆ e ( x ) = e ( x, ˆ θ ) forsome parametric model { e ( · , θ ) : θ ∈ R k } satisfying the following regularity conditions:(i) There exists θ ∈ R k such that e ( x ) = e ( x, θ ) .(ii) ˆ θ minimizes the function θ (cid:55)→ P n (cid:96) ( θ, X, Z ) where θ (cid:55)→ (cid:96) ( θ, x, z ) is convex for each ( x, z ) .(iii) The function L ( θ ) = P (cid:96) ( θ, X, Z ) satisfies ∇L ( θ ) = 0 , ∇ L ( θ ) (cid:31) .(iv) || e ( · , θ ) − e ( · , θ ) || ∞ ≤ r ( || θ − θ || ) for some continuous function r with r (0) = 0 .(v) The function class { ( x, z ) (cid:55)→ (cid:96) ( θ, x, z ) : || θ − θ || ≤ } is P -Donsker.(vi) The function class { θ (cid:55)→ e ( θ, x ) : || θ − θ || ≤ } has bounded uniform entropy integral.Suppose the number of bootstrap samples B ≡ B n tends to infinity. Then we have: lim inf n →∞ P ( ψ +T ≤ Q − α ( { ˆ ψ + b } b ∈ [ B ] ) ≥ − α for all α ∈ (0 , .Proof. Suppose for now that the linear quantile model is correctly specified, i.e. Q ( x ) = β (cid:62) h ( x ) for some β ∈ R k . For each b ∈ [ B n ], let { ( X bi , Y bi , Z bi ) } i ≤ n denote the observations from the b th bootstrap sample.For a function f , write P bn f = n (cid:80) ni =1 f ( X bi , Y bi , Z bi ). Define ˆ θ b and ˆ γ b by:ˆ θ b = argmin θ P bn (cid:96) ( θ, X )ˆ γ b = argmin γ P bn ρ τ ( Y − γ (cid:62) h ( X )) Z − e ( X, ˆ θ b ) e ( X, ˆ θ b ) Then proof of Lemma 3 implies that the following inequality holds deterministically for every b :ˆ ψ + b ≥ P bn ( Y − ˆ γ (cid:62) b h ( X )) Z (1 + Λ U (1 − e ( X, ˆ θ b )) /e ( X, ˆ θ b )) + P bn ˆ γ (cid:62) b h ( X ) Z/e ( X, ˆ θ b ) P bn Z/e ( X, ˆ θ b ) . (28)We will call the estimator in (28) the lower bound ˜ ψ + b , and use ˜ ψ + to refer to the same estimator fit on theoriginal sample { ( X i , Y i , Z i ) } i ≤ n . The bootstrap distribution of ˜ ψ + is only hypothetical, because it dependson the unknown function Q , but its quantiles must lie below the quantiles of the bootstrap distribution forˆ ψ +T . Therefore, if we can show the percentile bootstrap based on the hypothetical estimator ˜ ψ + is valid,then we will have proved the more conservative feasible percentile bootstrap for ˆ ψ +T is valid as well.The validity of the percentile bootstrap for ˜ ψ can be established using standard tools. We will brieflysketch how this can be done. Notice that (ˆ θ b , ˆ γ b , ˜ ψ b ) approximately solve P bn m ˆ θ b , ˆ γ b , ˜ ψ + b ( X, Y, Z ) = 0, wherethe score function m is defined by: m θ,γ,ψ ( x, y, z ) = ∇ (cid:96) ( θ, x, z ) h ( x )( τ − I { γ (cid:62) h ( x ) < } ) z − e ( x,θ ) e ( x,θ ) ( y − γ (cid:62) h ( x )) z (1 + Λ sign( y − Q ( x )) 1 − e ( x,θ ) e ( x,θ ) ) + γ (cid:62) h ( x ) z/e ( x, θ ) − ψz/e ( x, θ ) Therefore, one can apply the standard theory of bootstrap Z-estimators (in particular, Theorem 10.6 in[26]) to verify bootstrap consistency. The verification of the technical requirements for that result from ourassumptions is routine, and convexity of the loss functions θ (cid:55)→ (cid:96) ( θ, x, z ) and γ (cid:55)→ ρ τ ( y − γ (cid:62) h ( x )) can beused in lieu of the global Glivenko-Cantelli requirement on the moment equations. See Appendix C of [49]for the type of verification calculations that remain.Finally, it remains to remove the assumption that the linear quantile model is correctly specified. If Q ( x ) (cid:54) = β (cid:62) h ( x ) for any β , then we may once again lower bound ˆ ψ +T by the estimator that balances h ( x ) and the true quantile Q ( x ). This brings us back to the well-specified case, and the preceding arguments implythe validity of the bootstrap upper confidence bound.25 .7 Proof of Theorem 3 for nonlinear quantiles In this section, we prove Theorem 3 when quantiles are estimated by a nonlinear model. As in the case oflinear quantiles, we will give the argument for the estimator ˆ ψ +T . As such, we will continue to use ˆ Q ( x ) and Q ( x ) as shorthand for ˆ Q τ ( x,
1) and Q τ ( x, B.7.1 Regularity conditions
As alluded to in Condition 3, we require nonlinear models to be estimated using a form of sample splittingcalled “cross-fitting” [10, 42, 33]. We briefly describe the procedure, mostly to fix notation.The sample { ( X i , Y i , Z i ) } is divided into K disjoint “folds” F , · · · , F K of approximately equal size.For each k ∈ [ K ], a quantile estimate ˆ Q − k is obtained using observations not in F k . Finally, we setˆ Q i = (cid:80) Kk =1 ˆ Q − k ( X i ) I { i ∈ F k } . In this way, no observation is used to obtain its own quantile estimate. Inthe extreme case where K is equal to the sample size, this is simply “leave-one-out” estimation. However,in cross-fitting, K is taken to be a fixed constant.We also require the fitted quantiles ˆ Q i to satisfy an additional regularity condition. Condition N.
For some α, β > , we have max i ≤ n | ˆ Q i | = o P ( n α ) and P (0 < | ˆ Q i − ˆ Q j | < n − β for some ( i, j )) → . This condition rules out gross “outliers” in ˆ Q i which are difficult to balance. The condition max i | ˆ Q i | = o P ( n α ) alone is not sufficient for this, because it is not an affine-invariant assumption. One can take anarbitrarily poorly-behaved estimate ˆ Q and scale it to be bounded by one without changing the estimatorˆ ψ +T . The separation requirement rules out this trick.It is not hard to find examples of estimators which satisfy this condition. For example, under Conditions1 and 2, Condition N is satisfied by any estimator whose fitted values { ˆ Q i } only take values in the observedoutcomes { Y i } . Example include the nearest-neighbor quantile regression method of [43], as well as therandom-forest methods of [29] and [2]. Proposition 6.
Assume Conditions 1 and 2. If { ˆ Q i } i ≤ n ⊆ { Y j } j ≤ n almost surely, then Condition N issatisfied with α = 1 / and any β > .Proof. The upper bound follows from the well-known fact that the maximum of n i.i.d. observations from adistribution with finite variance has magnitude o P ( n − / ). Therefore, max i | ˆ Q i | ≤ max j | Y j | = o P ( n − / ).For the lower bound, it suffices to show that P (min i (cid:54) = j | Y i − Y j | < n − β ) → β >
2. Let F Y ( y ) = P ( Y ≤ y ), and let B < ∞ be a uniform bound on F (cid:48) Y ( · ); this exists since f ( y | x, z ) is uniformly bounded byCondition 2. Then P (min i (cid:54) = j | Y i − Y j | < n − β ) ≤ P ( n ∆ ≤ Bn − ( β − ) where ∆ = min i (cid:54) = j | F Y ( Y i ) − F Y ( Y j ) | .Theorem 8.2 in [12] shows that n ∆ (cid:32) Exponential(1), so P ( n ∆ ≤ Bn − ( β − ) → B.7.2 Supporting lemmas
To simplify the proof, we separate out a preliminary convergence results as a lemma. Throughout thisproof and the next, we will use the following notation: for a function f , P kn f denotes the fold- k average |F k | (cid:80) i ∈F k f ( X i , Y i , Z i ). Lemma 4.
Assume Condition 1. Suppose || ˆ Q − k − Q || L ( P ) p −→ for each k ∈ [ K ] . Then || ˆ Q − Q || L ( P n ) = o P (1) and P n ˆ QZ/ ˆ e ( X ) = E [ Q ( X )] + o P (1) .Proof. Start with the first claim. For any k ∈ [ K ], applying Markov’s inequality conditionally on { ( X i , Y i , Z i ) } i (cid:54)∈F k gives P kn ( ˆ Q − k ( X ) − Q ( X )) = O P ( || ˆ Q − k − Q || L ( P ) ) = o P (1). Averaging over k ∈ [ K ] gives the desired result.For the second claim, write: P n ˆ QZ/ ˆ e ( X ) = P n QZ/e ( X ) + P n ( ˆ Q − Q ) Z/e ( X ) + O ( || ˆ Q || L ( P n ) || / ˆ e − /e || L ( P n ) )= E [ Q ( X )] + O ( || ˆ Q − Q || L ( P n ) /ε ) + o P (1)= E [ Q ( X )] + o P (1) . .7.3 Proof of main result Now we are ready to prove Theorem 3 for nonlinear quantile models in the case of the estimand ψ +T . Werestate the result to make the quantile consistency assumption precise. Theorem 3(ii).
Assume Conditions 1, 2, 3.(ii), and N. If || ˆ Q − k − Q || L ( P ) = o P (1) for each k ∈ [ K ] , then ˆ ψ +T = ψ +T − o P (1) . However, even if || ˆ Q − k − Q || L ( P ) (cid:54)→ , we still have ˆ ψ +T ≥ ψ +T − o P (1) .Proof. We start by proving ˆ ψ +T ≤ ψ +T + o P (1) when the quantile model is consistent. This part of the prooffollows roughly the same template as the corresponding proof in the linear case. Lemma 2 implies:ˆ ψ +T ≤ P n ( Y − ˆ Q ) Z (1 + Λ ˆ U (1 − ˆ e ( X )) / ˆ e ( X )) + P n ˆ QZ/ ˆ e ( X ) P n Z/ ˆ e ( X )where ˆ U i = sign( Y i − ˆ Q i ). Since P n Z/ ˆ e ( X ) p −→ P n ˆ QZ/ ˆ e ( X ) p −→ E [ Q ( X )] by Lemma 4, itremains to show that P n ( Y − ˆ Q ) Z (1 + Λ ˆ U (1 − ˆ e ( X )) / ˆ e ( X )) converges to ψ +T + o P (1). By the same reasoningas in the linear case, we may replace ˆ e ( X ) by e ( X ) in this quantity without changing its value much. Thus,we may write: P n ( Y − ˆ Q ) Z (1 + Λ ˆ U − ˆ e ( X )ˆ e ( X ) ) = P n ( Y − ˆ Q ) Z (1 + Λ ˆ U − e ( X ) e ( X ) ) + o P (1)= i P n ( Y − Q ( X )) Z (1 + Λ ˆ U − e ( X ) e ( X ) ) + O ( ε − || ˆ Q ( X ) − Q ( X ) || L ( P n ) ) + o P (1)= ii P n ( Y − Q ( X )) Z (1 + Λ ˆ U − e ( X ) e ( X ) ) + o P (1)= iii P n ( Y − Q ( X )) Z/ ¯ E + + O ( || Y − Q ( X ) || L ( P n ) || Z Λ ˆ U − Z Λ U || L ( P n ) ) + o P (1)= iv ψ +T − E [ Q ( X )] + O P ( || Z Λ ˆ U − Z Λ U || L ( P n ) ) + o P (1)Here, i adds and subtracts a term then applies Cauchy-Schwarz, ii applies Lemma 4 to conclude || ˆ Q − Q || L ( P n ) = o P (1), iii adds and subtacts P n ( Y − Q ( X )) Z/ ¯ E + and applies Cauchy-Schwarz, and iv holds byProposition 3 and the law of large numbers.It remains to prove that || Z Λ ˆ U − Z Λ U || L ( P n ) = o P (1), or equivalently (up to constants) that P n Z I { ˆ U (cid:54) = U } = o P (1). For each k ∈ [ K ], we may apply Chebyshev’s inequality conditional on { ( X i , Y i , Z i ) } i (cid:54)∈F k toconclude: (cid:12)(cid:12)(cid:12)(cid:12) P kn Z I { ˆ U (cid:54) = U } − (cid:90) z I { sign( y − ˆ Q − k ( x )) (cid:54) = sign( y − Q ( x )) } d P ( x, y, z ) (cid:12)(cid:12)(cid:12)(cid:12) = o P (1)The integral in the preceding display tends to zero in probability. To see this, recall that Condition 2 requiresthe conditional density f ( y | x, z ) to be uniformly bounded by some B < ∞ , so we may write: (cid:90) z I { sign( y − ˆ Q − k ( x )) (cid:54) = sign( y − Q ( x )) } d P ( x, y, z ) = (cid:90) X e ( x ) (cid:90) ˆ Q − k ( x ) ∨ Q ( x )ˆ Q − k ( x ) ∧ Q ( x ) f ( y | x,
1) d y d P X ( x ) ≤ (cid:90) X (1 − ε ) B | ˆ Q − k ( x ) − Q ( x ) | d P X ( x ) (cid:45) || ˆ Q − k − Q || L ( P ) ≤ || ˆ Q − k − Q || L ( P ) = o P (1) . Thus, P kn Z I { ˆ U (cid:54) = U } = o P (1). Averaging over k gives P n Z I { ˆ U (cid:54) = U } = o P (1), and so ˆ ψ +T ≤ ψ +T + o P (1).Now, we turn to the lower bound, which is substantially more difficult. We wish to show ˆ ψ +T ≥ ψ +T − o P (1)whether or not ˆ Q − k converges to Q . For each k ∈ [ K ], define ˆ ψ + ( k ) by:ˆ ψ + ( k ) = max ¯ e k ∈E n,k (Λ) P kn Y Z/ ¯ e k subject to (cid:18) P kn ˆ Q − k Z/ ¯ e k P kn Z/ ¯ e k (cid:19) = (cid:18) P kn ˆ Q − k Z/ ˆ e P kn Z/ ˆ e ( X ) (cid:19) (29)27here E n,k (Λ) is the projection of E n (Λ) onto the coordinates in F k . Clearly, ˆ ψ +T × P n Z/ ˆ e ( X ) ≥ (cid:80) k ˆ ψ + ( k ) |F k | /n ,so it suffices to prove ˆ ψ + ( k ) ≥ ψ +T − o P (1) for each k .We will make some notational simplifications. The remainder of the proof will focus on showing ˆ ψ + (1) ≥ ψ +T − o P (1). For convenience, we will assume F = [ n ] where n ∼ n/K . As an additional simplification,we will assume that ε/ ≤ ˆ e i ≤ − ε/ i . Mechanically, this can always be done by “trimming” theestimated propensity score. Condition 1 implies the trimming has no effect in large samples, so it is onlyused as a theoretical device to simplify calculations. Finally, recall that we have defined ˆ W i = Z i (1 − ˆ e i ) / ˆ e i .We will construct an propensity vector ¯ e ∗ satisfying the constraints of (29) with the property that¯ ψ := P n Y Z/ ¯ e ∗ converges to ψ +T . Since ˆ ψ + (1) ≥ ¯ ψ , this will show ˆ ψ + (1) ≥ ψ +T − o P (1). A natural firstidea is to take the idealized propensity score ¯ e ∗ i = (1 + θ i (1 − ˆ e ( X i )) / ˆ e ( X i )) − , where θ i = Λ U i . This mimicsthe true worst-case propensity score, but uses ˆ e ( X i ) in place of e ( X i ) to satisfy the odds-ratio constraint. Itis not hard to see that this would result in a sharp estimate of ψ +T by classic IPW logic. P n Y Z (1 + θ − ˆ e ( X )ˆ e ( X ) ) = P n Y Z (1 + θ − e ( X ) e ( X ) ) + O ( || Y Z || L ( P n ) × || /e − / ˆ e || ∞ )= P n Y Z/ ¯ E + + o P (1)= ψ +T + o P (1) (30)However, this choice of ¯ e ∗ is not guaranteed to satisfy the “balancing” constraints of (29). Our constructionperturbs this “ideal” choice to gain feasibility.Our construction will be convoluted, so it is worth taking a moment to explain the high-level idea. First,we discard a small number of gross “outliers” to produce a set of “inliers” I j ∗ whose fitted quantiles arerelatively easy to balance. We then produce a feasible propensity ¯ e ∗ by assigning the outliers the nominalpropensity score ˆ e ( X i ) and perturbing the inliers’ idealized propensity score by a small amount. We showthe resulting lower bound ¯ ψ = P n Y Z/ ¯ e ∗ is a consistent (albeit impractical) estimator of ψ +T .We start by extracting a set of inliers I j ∗ ⊆ [ n ] in the following fashion: set I = [ n ], and for2 ≤ j ≤ β + 3, recursively define I j by: I j = { i ∈ I j − : | ˆ Q i − ¯ Q j − | ≤ ( j − n − ( j − / } (31)where ¯ Q j − = ( (cid:80) i ∈I j − ˆ W i ˆ Q i ) / ( (cid:80) i ∈I j − ˆ W i ) is the weighted average value of ˆ Q i within I j − . We set I β +4 = ∅ . Let j ∗ be the first stage in the above procedure at which an n − / fraction of the “weight” in I j comes from outliers. j ∗ = min (cid:26) j : (cid:80) i ∈I j \I j +1 ˆ W i (cid:80) i ∈I j ˆ W i ≥ n − / (cid:27) (32)It is easy to verify that j ∗ is well-defined (the set is not empty) whenever Z i = 1 for some index i ≤ n . Forcompleteness, we arbitrarily set j ∗ = 4 β + 3 when that does not happen.With this definition of j ∗ , we ensure the total “weight” on discarded outliers is asymptotically negligible.Since (cid:80) i ∈I j \I j +1 ˆ W i ≤ n − / (cid:80) i ∈I j ˆ W i for all j < j ∗ , we have: (cid:88) i (cid:54)∈I j ∗ ˆ W i = (cid:88) j