[PDF] Assessing Sensitivity to Unconfoundedness: Estimation and Inference

Abstract

This paper provides a set of methods for quantifying the robustness of treatment effects estimated using the unconfoundedness assumption (also known as selection on observables or conditional independence). Specifically, we estimate and do inference on bounds on various treatment effect parameters, like the average treatment effect (ATE) and the average effect of treatment on the treated (ATT), under nonparametric relaxations of the unconfoundedness assumption indexed by a scalar sensitivity parameter c. These relaxations allow for limited selection on unobservables, depending on the value of c. For large enough c, these bounds equal the no assumptions bounds. Using a non-standard bootstrap method, we show how to construct confidence bands for these bound functions which are uniform over all values of c. We illustrate these methods with an empirical application to effects of the National Supported Work Demonstration program. We implement these methods in a companion Stata module for easy use in practice.

Full PDF

AAssessing Sensitivity to Unconfoundedness:Estimation and Inference ∗ Matthew A. Masten † Alexandre Poirier ‡ Linqi Zhang § December 31, 2020

Abstract

This paper provides a set of methods for quantifying the robustness of treatment eﬀectsestimated using the unconfoundedness assumption (also known as selection on observables orconditional independence). Speciﬁcally, we estimate and do inference on bounds on varioustreatment eﬀect parameters, like the average treatment eﬀect (ATE) and the average eﬀectof treatment on the treated (ATT), under nonparametric relaxations of the unconfoundednessassumption indexed by a scalar sensitivity parameter c . These relaxations allow for limitedselection on unobservables, depending on the value of c . For large enough c , these boundsequal the no assumptions bounds. Using a non-standard bootstrap method, we show how toconstruct conﬁdence bands for these bound functions which are uniform over all values of c .We illustrate these methods with an empirical application to eﬀects of the National SupportedWork Demonstration program. We implement these methods in a companion Stata module foreasy use in practice. JEL classiﬁcation:

C14; C18; C21; C51

Keywords:

Treatment Eﬀects, Conditional Independence, Unconfoundedness, Selection on Observables,Sensitivity Analysis, Nonparametric Identiﬁcation, Partial Identiﬁcation ∗ This paper was presented at the 2018 Western Economic Association International Conference, the 2019 StataConference Chicago, the 2020 World Congress of the Econometric Society, the DC-MD-VA Econometrics Workshop2020, University of Southern California, University of Toronto, and the 2020 SEA Conference. We thank participantsat those seminars and conferences, as well as Karim Chalak, Toru Kitagawa, and John Pepper. We thank PaulDiegert for excellent research assistance. Masten thanks the National Science Foundation for research support underGrant No. 1943138. † Department of Economics, Duke University, [email protected] ‡ Department of Economics, Georgetown University, [email protected] § Department of Economics, Boston College, [email protected] a r X i v : . [ ec on . E M ] D ec Introduction

A core goal of causal inference is to identify and estimate eﬀects of a treatment variable on anoutcome variable. A common assumption used to identify such eﬀects is unconfoundedness, whichsays that potential outcomes are independent of treatment conditional on covariates. This assump-tion is also known as conditional independence, selection on observables, ignorability, or exogenousselection; see Imbens (2004) for a survey. This assumption is not refutable, meaning that the dataalone cannot tell us whether it is true. Nonetheless, empirical researchers may wonder: How im-portant is this assumption in their analyses? Put diﬀerently: How sensitive are their results tofailures of the unconfoundedness assumption?A large literature on sensitivity analysis has developed to answer this question. Moreover,researchers widely acknowledge that answering this question is an important step in empiricalresearch. For example, in their ﬁgure 1, Caliendo and Kopeinig (2008) describe the workﬂow ofa standard analysis using selection on observables. Their ﬁfth and ﬁnal step in this workﬂowis to perform sensitivity analysis to the unconfoundedness assumption. Imbens and Wooldridge(2009, section 6.2), Imbens and Rubin (2015, chapter 22), and Athey and Imbens (2017) all alsorecommend that researchers conduct sensitivity analyses to assess the importance of non-refutableidentifying assumptions. In particular, Athey and Imbens (2017) describe these methods as “asystematic way of doing the sensitivity analyses that are routinely done in empirical work, butoften in an unsystematic way.”Most of the existing approaches to assessing unconfoundedness rely on strong auxiliary assump-tions, however. For example, they often assume treatment eﬀects are homogeneous and that allunobserved confounding arises due to a single unobserved variable whose distribution is parametri-cally speciﬁed, like a binary or normal distribution. They also often assume a parametric functionalform for potential outcomes, like a logit model for binary potential outcomes or a linear model forcontinuous potential outcomes. These assumptions—which are not needed for identiﬁcation of thebaseline model when unconfoundedness holds—raise a new question: Are the ﬁndings of thesesensitivity analyses themselves sensitive to these extra auxiliary assumptions?In this paper, we provide a set of tools for assessing the sensitivity of the unconfoundednessassumption which do not rely on strong auxiliary assumptions that are not used for the baselineanalysis. We do this by studying nonparametric relaxations of the unconfoundedness assumption.Speciﬁcally, we apply the identiﬁcation results of Masten and Poirier (2018), who consider a class ofassumptions called conditional c -dependence . This class measures relaxations of conditional inde-pendence by a single scalar parameter c ∈ [0 , c is the largest diﬀerence betweenthe propensity score and the probability of treatment conditional on covariates and an unobservedpotential outcome. Hence it has a straightforward interpretation as a deviation from conditionalindependence, as measured in probability units. For any positive c , conditional independence onlypartially holds, and so we cannot learn the exact value of our treatment eﬀect parameters, like theaverage treatment eﬀect (ATE) or the average eﬀect of treatment on the treated (ATT). Instead,2e only get bounds. Masten and Poirier (2018) derive closed-form expressions for these bounds asa function of c . Setting c = 0 yields the baseline model where unconfoundedness holds. Setting c = 1 yields the other extreme where no assumptions on selection are made, and hence gives theno assumption bounds as in Manski (1990). The bounds are monotonic in c , so that small valuesof c give narrow bounds while larger values of c give wider bounds. Just how wide these boundsare—and hence how sensitive one’s results are—depends on the data.While Masten and Poirier (2018) studied identiﬁcation of treatment eﬀects under nonparamet-ric relaxations of unconfoundedness, they did not study estimation or inference. We do that inthis paper. First we propose sample analog estimators of the bounds on the conditional quan-tile treatment eﬀect (CQTE), the conditional average treatment eﬀect (CATE), the ATE, and theATT. We do this using ﬂexible parametric ﬁrst step estimators of the propensity score and theconditional quantile function of the observed outcomes given treatment and covariates. Althoughsuch parametric restrictions are not required for our identiﬁcation theory, the analysis of inferenceis complicated and non-standard even with these parametric ﬁrst step estimators. Doing inferencebased on fully nonparametric ﬁrst step estimators will likely require deriving and applying moregeneral asymptotic theory for non-Hadmard diﬀerentiable functionals than currently exists. Hencewe leave that to future work. Moreover, note that our approach of using nonparametric identiﬁ-cation results paired with ﬂexible parametric estimators is analogous to what is commonly donein the baseline model which imposes unconfoundedness: Identiﬁcation is shown nonparametricallybut many commonly used estimators are based on ﬂexible parametric ﬁrst step estimators. Forexample, see chapter 13 in Imbens and Rubin (2015).We derive the asymptotic distributions of our bound estimators using the delta method forHadamard directionally diﬀerentiable functionals from Fang and Santos (2019). We then showconsistency of a non-standard bootstrap based on estimating the analytical Hadamard directionalderivatives of our bound functionals. This step again involves using the recent results of Fang andSantos (2019). We show how to construct conﬁdence bands for the bound functions which are uni-form over all values of c ∈ [0 , Related Literature

We conclude this section with a brief literature review. As mentioned earlier, there is a large existingliterature that studies how to relax unconfoundedness. This includes Rosenbaum and Rubin (1983),Mauro (1990), Rosenbaum (1995, 2002), Robins, Rotnitzky, and Scharfstein (2000), Imbens (2003),Altonji, Elder, and Taber (2005, 2008), Ichino, Mealli, and Nannicini (2008), Hosman, Hansen, andHolland (2010), Krauth (2016), Kallus, Mao, and Zhou (2019), Oster (2019), and Cinelli andHazlett (2020), among others. Here we discuss the most closely related work and several recentpapers. For further details about the related literature, see section 1 in Masten and Poirier (2018)for identiﬁcation and Appendix D in Masten and Poirier (2020) for estimation and inference.A key feature of our results is that they are based on the fully nonparametric analysis of Mas-ten and Poirier (2018). There are only a few other alternative nonparametric analyses availablein the literature. The ﬁrst is Ichino et al. (2008), who require that all variables are discretelydistributed. In contrast, we allow for continuous outcomes, covariates, and unobservables. Theirapproach requires picking a vector of sensitivity parameters that determines the joint distributionof the discrete observable and unobservable variables. In contrast, our approach uses a scalarsensitivity parameter. Finally, unlike us, they do not provide any formal results for doing esti-mation or inference. The second is Rosenbaum (1995, 2002), who proposed a sensitivity analysisfor unconfoundedness within the context of doing randomization inference based on the sharp nullhypothesis of no unit level treatment eﬀects for all units in the data set. Like our approach, heonly uses a scalar sensitivity parameter and also does not rely on a parametric model for outcomesor treatment assignment probabilities. His approach, however, is based on ﬁnite sample random-ization inference (for more discussion, see chapter 5 of Imbens and Rubin 2015). This approachto inference is conceptually distinct from the approach we use based on repeated sampling froma large population. For this reason, we view these diﬀerent approaches to inference in sensitivityanalyses as complementary. Finally, Kallus et al. (2019) study bounds on CATE under the samenonparametric relaxations deﬁned by Rosenbaum (1995, 2002). Unlike him, however, they take alarge population view. They propose sample analog kernel estimators based on an implicit charac-terization of the identiﬁed set using extrema. They show consistency of these estimators, but theydo not provide any inference results. As we discuss later, this is a key distinction because inferencein this setting is non-standard.A few recent papers provide methods for assesssing unconfoundedness in parametric linearmodels. This includes Oster (2019) and Cinelli and Hazlett (2020). These results rely on theassumption that outcomes are linear functions of treatment and covariates, among other parametricassumptions. In contrast, we build on the selection on observables literature that has emphasized4onparametric identiﬁcation. That literature emphasizes that identiﬁcation by functional form isoften implausible. Sensitivity analyses that rely on functional form assumptions are subject to thesame criticism: Findings that one’s results are robust to violations of unconfoundedness can bedriven primarily from the parametric functional form restrictions. To address this, our estimationand inference results are based on nonparametric sensitivity analyses that do not require parametricassumptions.Finally, we discuss the relationship with our own previous work. As noted earlier, our paperprovides estimation and inference results for population bounds derived in Masten and Poirier(2018). That paper did not provide any estimation or inference theory. Masten and Poirier (2020)builds on those results in several ways: First, they extend the identiﬁcation analysis to identiﬁcationof distributional treatment eﬀect parameters, with a focus on assessing the importance of the rankinvariance assumption. Second, they provide some asymptotic distributional results for sampleanalog estimators of the average treatment eﬀect (ATE), the conditional average treatment eﬀect(CATE), and the conditional quantile treatment eﬀect (CQTE), among other results. Those resultsare limited in a variety of ways, which we discuss next.Speciﬁcally, our paper diﬀers from the results in Masten and Poirier (2020) in several importantways: (1) Our paper allows for both discrete and continuous covariates, whereas that paper focusedon the case where all covariates are discrete. In particular, to allow for continuous covariates wedevelop a diﬀerent estimator of the bound functions. This is important since many empirical ap-plications, like ours in section 7, use continuous covariates. (2) Our results allow for all possiblevalues of c ∈ [0 , c (see their assumption A2.1). This is also important for practice and requires a substantialamount of new theoretical work. (3) Our results use the Fang and Santos (2019) bootstrap basedon estimators of analytical Hadamard directional derivatives to do inference. That paper insteadused the numerical delta method bootstrap of Hong and Li (2018). Our approach allows us toavoid choosing the step size tuning parameter required for the numerical delta method bootstrap,although our estimators of the analytical Hadamard directional derivatives also have tuning pa-rameters. (4) Unlike that paper, we also discuss inference on the average eﬀect of treatment onthe treated (ATT). (5) In this paper we provide a new companion Stata module implementing ourresults. In this section we describe the model and review standard results on point identiﬁcation of treat-ment eﬀects under unconfoundedness. We then describe how we relax unconfoundedness. Finally,we review the bounds on treatment eﬀects derived by Masten and Poirier (2018) when unconfound-edness is relaxed. 5 odel and Baseline Point Identiﬁcation Results

We use the standard potential outcomes model. Let X ∈ { , } be an observed binary treatment.Let Y and Y denote the unobserved potential outcomes. The observed outcome is Y = XY + (1 − X ) Y . (1)Let W ∈ R d W denote a vector of observed covariates, which may be discrete, continuous, or mixed.Let W = supp( W ) denote the support of W . Let p x | w = P ( X = x | W = w )denote the observed generalized propensity score.It is well known that the conditional distributions of potential outcomes Y | W and Y | W arepoint identiﬁed under the following two assumptions:Unconfoundedness: X ⊥⊥ Y | W and X ⊥⊥ Y | W .Overlap: p | w ∈ (0 ,

1) for all w ∈ W .Consequently, any functional of the distributions of Y | W and Y | W is also point identiﬁed. Wefocus on two leading examples: The average treatment eﬀect, ATE = E ( Y − Y ) and the averagetreatment eﬀect for the treated, ATT = E ( Y − Y | X = 1). We also consider the conditionalquantile treatment eﬀects CQTE( τ | w ) = Q Y | W ( τ | w ) − Q Y | W ( τ | w ) and the conditionalaverage treatment eﬀect CATE( w ) = E ( Y − Y | W = w ). Sensitivity Analysis: Relaxing Unconfoundedness

As discussed in section 1, the overlap assumption is refutable and hence can be directly veriﬁedfrom the data. The unconfoundedness assumption, however, is not refutable. Consequently, likemuch of the literature reviewed in section 1, we perform a sensitivity analysis. This entails replacingunconfoundedness with a weaker assumption and investigating how this changes the conclusions wecan draw about our parameter of interest. Speciﬁcally, we deﬁne the following class of assumptions,which we call conditional c -dependence (Masten and Poirier 2018): Deﬁnition 1.

Let x ∈ { , } . Let w ∈ W . Let c be a scalar between 0 and 1. Say X is conditionally c -dependent with Y x given W ifsup y x ∈ supp( Y x | W = w ) | P ( X = 1 | Y x = y x , W = w ) − P ( X = 1 | W = w ) | ≤ c. (2)holds for all w ∈ W .When c = 0, conditional c -dependence is equivalent to X ⊥⊥ Y x | W . For c >

0, however, we6llow for violations of unconfoundedness by allowing the unobserved conditional probability P ( X = 1 | Y x = y x , W = w )to diﬀer from the observed propensity score P ( X = 1 | W = w )by at most c . Thus we actually allow for some selection on unobservables, since treatment assign-ment may depend on Y x , but in a constrained manner. For suﬃciently large c , however, conditional c -dependence imposes no constraints on the relationship between Y x and X . This happens when c ≥ C where C = sup w ∈W max { p | w , p | w } . When c ∈ (0 , C ), conditional c -dependence imposessome constraints on treatment assignment, but it does not require conditional independence tohold exactly. For this reason, we call it a conditional partial independence assumption. Thus oursensitivity analysis replaces unconfoundedness withConditional Partial Independence: X is conditionally c -dependent with Y and Y given W . Treatment Eﬀect Bounds

By relaxing conditional independence our main parameters of interest—ATE and ATT—are nolonger point identiﬁed. Instead they are partially identiﬁed: We can bound them from above andfrom below. As c gets close to zero, however, these bounds collapse to a point. Hence for small c these bounds can be quite narrow. The goal of a sensitivity analysis is to understand how theshape and width of these bounds changes as c varies from 0 to 1.These bounds were derived in Masten and Poirier (2018), which we summarize here. Althoughthat paper studied both continuous and binary outcomes, here we only summarize the results forcontinuous Y x . All of our parameters of interest can be written in terms of bounds on the quantileregressions Q Y x | W ( τ | w ). Under the conditional partial independence assumption stated above andsome regularity conditions, Masten and Poirier (2018) showed that [ Q cY x | W ( τ | w ) , Q cY x | W ( τ | w )]are are sharp bounds on this quantile regression, uniformly in τ , x , and w , where Q cY x | W ( τ | w ) = Q Y | X,W (cid:0) t ( τ, x, w ) | x, w (cid:1) (3)where t ( τ, x, w ) = min (cid:26) τ + cp x | w min { τ, − τ } , τp x | w , (cid:27) and Q cY x | W ( τ | w ) = Q Y | X,W ( t ( τ, x, w ) | x, w ) (4)where t ( τ, x, w ) = max (cid:26) τ − cp x | w min { τ, − τ } , τ − p x | w + 1 , (cid:27) . x = 1 and x = 0 yields sharp bounds on the conditionalquantile treatment eﬀect CQTE( τ | w ), uniformly in τ and w : (cid:2) CQTE c ( τ | w ) , CQTE c ( τ | w ) (cid:3) ≡ (cid:104) Q cY | W ( τ | w ) − Q cY | W ( τ | w ) , Q cY | W ( τ | w ) − Q cY | W ( τ | w ) (cid:105) . Integrating these bounds over τ yields sharp bounds on CATE( w ), uniformly in w : (cid:2) CATE c ( w ) , CATE c ( w ) (cid:3) ≡ (cid:20)(cid:90) CQTE c ( τ | w ) dτ, (cid:90) CQTE c ( τ | w ) dτ (cid:21) . Further integrating over the marginal distribution of W yields sharp bounds on ATE: (cid:2) ATE c , ATE c (cid:3) ≡ (cid:104) E (cid:0) CATE c ( W ) (cid:1) , E (cid:0) CATE c ( W ) (cid:1)(cid:105) To obtain bounds on ATT, let E cx ( w ) = (cid:90) Q cY x ( τ | w ) dτ and E cx ( w ) = (cid:90) Q cY x ( τ | w ) dτ denote bounds on E ( Y x | W = w ). Averaging these over the marginal distribution of W yieldsbounds on E ( Y x ), denoted by E cx = E (cid:0) E cx ( W ) (cid:1) and E cx = E (cid:0) E cx ( W ) (cid:1) . This yields the following bounds on ATT: (cid:34) E ( Y | X = 1) − E c − p E ( Y | X = 0) p , E ( Y | X = 1) − E c − p E ( Y | X = 0) p (cid:35) (5)where p x = P ( X = x ) for x ∈ { , } . Finally, note that all of these bounds are sharp. Breakdown Points

So far we’ve discussed sharp bounds on various parameters of interest as a function of the sensitivityparameter c . In addition to the bounds themselves, it is common to analyze breakdown points forvarious conclusions of interest. For example, suppose that under the baseline model ( c = 0) weﬁnd that ATE >

0. We then ask: How much can we relax unconfoundedness while still being ableto conclude that the ATE is nonnegative? To answer this question, deﬁne the breakdown point forthe conclusion that the ATE is nonnegative as c bp = sup { c ∈ [0 ,

1] : (cid:2)

ATE c , ATE c (cid:3) ⊆ [0 , ∞ ) } . (6)8his number is a quantitative measure of the robustness of the conclusion that ATE is positiveto relaxations of the key identifying assumption of unconfoundedness. Breakdown points can bedeﬁned for other parameters and conclusions as well. See Masten and Poirier (2020) for morediscussion and additional references. Interpreting Conditional c -Dependence We conclude this section by giving some suggestions for how to interpret conditional c -dependence inpractice. In particular, what values of c are large? What values are small? Here we summarize andextend the discussion on page 321 of Masten and Poirier (2018). We illustrate these interpretationsin our empirical analysis in section 7.Let W k denote a component of W . Denote the propensity score by p | W ( w − k , w k ) = P ( X = 1 | W = ( w − k , w k )) . Let p | W − k ( w − k ) = P ( X = 1 | W − k = w − k )denote the leave-out-variable- k propensity score. This is just the proportion of the population whoare treated, conditional on only W − k . Consider the random variable∆ k = | p | W ( W − k , W k ) − p | W − k ( W − k ) | . This diﬀerence is a measure of the impact on the observed propensity score of adding W k , giventhat we already included W − k . Conditional c -dependence is deﬁned by a similar diﬀerence, exceptthere we add the unobservable Y x given that we already included W . Hence we suggest using thedistribution of ∆ k to calibrate values of c . For example, you could examine the 50th, 75th, and90th quantiles of ∆ k , along with the upper bound on the support, ¯ c k = max supp(∆ k ). You mayalso ﬁnd it useful to plot an estimate of the density of ∆ k . All of these reference values can becompared to the breakdown point c bp for a speciﬁc conclusion of interest. Speciﬁcally, if c bp islarger than the chosen reference value, then the conclusion of interest could be considered robust.In contrast, if c bp is smaller than the chosen reference value, then the conclusion of interest couldbe considered sensitive. You may also want to see where c bp lies relative to the distribution of ∆ k .This can be done by computing F ∆ k ( c bp ).While you could do this for all covariates k , it may be helpful to restrict attention to covariates k that have a suﬃciently large impact on the baseline point estimates. For example, suppose weare interested in the ATE. Let ATE − k denote the ATE estimand obtained in the baseline selectionon observables model using only the covariates W − k . Let ATE denote the ATE estimand obtainedin the baseline model using all the covariates. Then (cid:12)(cid:12)(cid:12)(cid:12) ATE − ATE − k ATE (cid:12)(cid:12)(cid:12)(cid:12) k on the ATE point estimand, as a percentage of the baselineestimand that uses all covariates in W . You may want to restrict attention to covariates k for whichthis ratio is relatively large. We illustrate this approach in our empirical analysis in section 7. In the previous section we assumed the entire population distribution of (

Y, X, W ) was known. Inpractice we only have a ﬁnite sample { ( Y i , X i , W i ) } ni =1 from this distribution. In this section weexplain how to use this ﬁnite sample data to estimate the population bounds of section 2. We givethe corresponding asymptotic theory in section 4 where we obtain the joint limiting distribution oftreatment eﬀects bounds. We describe how to perform bootstrap based inference on these boundsin section 5.As shown in section 2, all of our bounds can be constructed from the marginal distribution of W and the bounds on Q Y x | W given in equations (3) and (4). These bounds on Q Y x | W , in turn,depend on just two features of the data:1. The conditional quantile function Q Y | X,W ( τ | x, w ).2. The propensity score p x | w = P ( X = x | W = w ).In both cases, we can use parametric, semiparametric, or nonparametric estimation methods. In thispaper we focus on ﬂexible parametric approaches. Even in this case the asymptotic distributiontheory is non-standard and quite complicated. We discuss this point further in the conclusion,section 8.In section 3.1 we describe our ﬁrst step estimators of these two functions. Given these estimators,we then construct sample analog estimates of our bound functions in a second step. We describethese estimators in section 3.2. We estimate Q Y | X,W by a linear quantile regression of Y on ﬂexible functions of ( X, W ) that wedenote by q ( X, W ) ∈ R d q . For example, q ( x, w ) could be (1 , x, w ), (1 , x, w, x · w ), or could containadditional interactions between the treatment indicator X and functions of the covariates W . For τ ∈ (0 , (cid:98) γ ( τ ) = argmin a ∈ R dq n (cid:88) i =1 ρ τ ( Y i − a (cid:48) q ( X i , W i ))be the estimated coeﬃcients from a linear quantile regression of Y on q ( X, W ) at the quantile τ .Here ρ τ ( s ) = s ( τ − ( s < (cid:98) Q Y | X,W ( τ | x, w ) = q ( x, w ) (cid:48) (cid:98) γ ( τ ) denotethis estimator. 10e estimate the propensity score by maximum likelihood. In particular, specify the parametricmodel P ( X = 1 | W = w ) = F ( r ( w ) (cid:48) β )where F is a known cdf, r ( w ) is a known vector function, and β is an unknown constant vector.The functions r ( w ) could simply be (1 , w ) or may contain functions of w , like squared or interactionterms. For notational simplicity, we will assume throughout the paper that r ( w ) = w . Given thisassumption, the dimension of β is d W , the length of W . Suppose β lies in the parameter space B ⊆ R d W .This speciﬁcation for the propensity score includes the probit and logit estimators as specialcases. Those estimators are commonly used in the literature; for example, see chapter 13 of Imbensand Rubin (2015). Let (cid:98) β denote the maximum likelihood estimate of β : (cid:98) β = argmax β ∈B n (cid:88) i =1 log L ( X i , W (cid:48) i β )where L ( x, w (cid:48) β ) = F ( w (cid:48) β ) x (1 − F ( w (cid:48) β )) − x . For each x ∈ { , } , let (cid:98) p x | w = L ( x, w (cid:48) (cid:98) β ) denote our propensity score estimator. Given the ﬁrst step estimators from section 3.1, we obtain the following sample analog estimatorsof the CQTE bound functions deﬁned in equations (3) and (4): (cid:98) Q cY x | W ( τ | w ) = (cid:98) Q Y | X,W ( (cid:98) t ( τ, x, w ) | x, w )where (cid:98) t ( τ, x, w ) = min (cid:26) τ + c (cid:98) p x | w min { τ, − τ } , τ (cid:98) p x | w , (cid:27) and (cid:98) Q cY x | W ( τ | w ) = (cid:98) Q Y | X,W ( (cid:98) t ( τ, x, w ) | x, w )where (cid:98) t ( τ, x, w ) = max (cid:26) τ − c (cid:98) p x | w min { τ, − τ } , τ − (cid:98) p x | w + 1 , (cid:27) . As discussed in section 2, averaging these over τ ∈ (0 ,

1) yields sample analog estimates of boundson CATE( w ), which we can then use to get bounds on ATE. This approach requires estimationof extremal quantiles, however—estimation for τ ’s close to 0 or 1. This is well known to be adelicate problem (see Chernozhukov, Fern´andez-Val, and Kaji 2017 for details). So in this paperwe use a common solution: Fixed trimming of the extremal quantiles. We do this by modifying11he quantile bound estimators to ensure that the quantile index lies in [ ε, − ε ] for some ﬁxed andknown ε ∈ (0 , . (cid:98) Q cY x | W ( τ | w ) = (cid:98) Q Y | X,W (cid:16) max { min { (cid:98) t ( τ, x, w ) , − ε } , ε } | x, w (cid:17) (7)and (cid:98) Q cY x | W ( τ | w ) = (cid:98) Q Y | X,W (cid:0) max { min { (cid:98) t ( τ, x, w ) , − ε } , ε } | x, w (cid:1) . (8)We use these estimators for the rest of the paper. Common choices of ε are 0 .

05 or 0 .

01. In ourasymptotic analysis we hold ε ﬁxed with sample size. In principle we could generalize the resultsto allow ε → n → ∞ , but this would complicate the analysis of inference, which is alreadynon-standard for other reasons. Since we ﬁx ε throughout, we omit ε from the notation for brevity,except when necessary.We next estimate the CQTE bounds by taking diﬀerences of the quantile bound estimators: (cid:20) (cid:92) CQTE c ( τ | w ) , (cid:92) CQTE c ( τ | w ) (cid:21) ≡ (cid:104) (cid:98) Q cY | W ( τ | w ) − (cid:98) Q cY | W ( τ | w ) , (cid:98) Q cY | W ( τ | w ) − (cid:98) Q cY | W ( τ | w ) (cid:105) . Since our CATE bounds are simply the integral of the CQTE bounds over all the quantiles τ , wecan estimate them by (cid:20) (cid:92) CATE c ( w ) , (cid:92) CATE c ( w ) (cid:21) ≡ (cid:20)(cid:90) (cid:92) CQTE c ( τ | w ) dτ, (cid:90) (cid:92) CQTE c ( τ | w ) dτ (cid:21) . A second integration over w with respect to the marginal distribution of W yields bounds onATE. Like much of the literature, we use the empirical distribution of W to estimate the marginaldistribution of W . This yields the following estimator of our ATE bounds: (cid:20) (cid:91) ATE c , (cid:91) ATE c (cid:21) = (cid:34) n n (cid:88) i =1 (cid:92) CATE c ( W i ) , n n (cid:88) i =1 (cid:92) CATE c ( W i ) (cid:35) . Next consider the estimation of the ATT bounds. Let (cid:98) E c = 1 n n (cid:88) i =1 (cid:90) (cid:98) Q cY ( τ | W i ) dτ and (cid:98) E c = 1 n n (cid:88) i =1 (cid:90) (cid:98) Q cY ( τ | W i ) dτ. For x ∈ { , } let (cid:98) E ( Y | X = x ) = (cid:80) ni =1 Y i ( X i = x ) (cid:80) ni =1 ( X i = x ) and (cid:98) p x = 1 n n (cid:88) i =1 ( X i = x ) . We can then estimate the ATT bounds by replacing the population quantities in (5) with theirestimators that we just deﬁned. 12or c = 0, our estimated upper and lower bounds are equal and give point estimates of thevarious parameters of interest. For c >

0, our bounds have positive width. To use these bounds ina sensitivity analysis, we recommend producing the following plot: Pick a grid { c , . . . , c K } ⊆ [0 , c . Compute our bound estimates on this grid and plot them against these values of c .Then compute and plot conﬁdence bands for these bound estimates against c as well; we describehow to compute these bands in section 5. We illustrate all of these steps in our empirical analysisof section 7. In this section we provide formal results on the consistency and limiting distributions of the estima-tors we described in section 3. In section 5 we show how to use these results to do inference basedon a non-standard bootstrap. In section 6 we provide suﬃcient conditions under which standardbootstrap inference is valid.

Throughout this paper we assume that we observe a random sample.

Assumption A1 (

Random Sample). { ( Y i , X i , W i ) } ni =1 are iid.Our ﬁrst step estimators are standard in the literature. Hence we only brieﬂy review the mainassumptions and results for these estimators. For completeness, we provide a formal analysis inappendix A.We assume that both the propensity score and quantile regression functions are correctly spec-iﬁed: P ( X = x | W = w ) = L ( x, w (cid:48) β )and Q Y | X,W ( τ | x, w ) = q ( x, w ) (cid:48) γ ( τ )for all τ ∈ [ ε, − ε ].Since the ﬁrst step estimators consist of linear quantile regression and maximum likelihoodestimation, their √ n -convergence to Gaussian elements can be shown under standard assumptionsand arguments. For example, see Newey and McFadden (1994). Moreover, the convergence of (cid:98) γ ( τ )to γ ( τ ) is uniform over τ ∈ [ ε, − ε ]. Formally, as we show in appendix A lemma 1, √ n (cid:32) (cid:98) β − β (cid:98) γ ( τ ) − γ ( τ ) (cid:33) (cid:32) Z ( τ ) , where Z ( · ) is a mean-zero Gaussian process in R d W × (cid:96) ∞ ([ ε, − ε ] , R d q ) with continuous paths.The covariance kernel of this process is deﬁned in appendix A, equation (11). Also see appendix Afor the formal assumptions under which this result holds.13 .2 Convergence of the Second Step Estimators Next we consider the limiting distribution of our various second step estimators.

The CATE Bounds

We start with equations (7) and (8), our estimators of the conditional quantile bounds. Thepopulation conditional quantile bounds, equations (3) and (4), are known functions of θ = ( β , γ ).DeﬁneΓ ( x, w, τ, θ ) = q ( x, w ) (cid:48) γ (cid:18) max (cid:26) min (cid:26) τ + cL ( x, w (cid:48) β ) min { τ, − τ } , τL ( x, w (cid:48) β ) , − ε (cid:27) , ε (cid:27)(cid:19) Γ ( x, w, τ, θ ) = q ( x, w ) (cid:48) γ (cid:18) min (cid:26) max (cid:26) τ − cL ( x, w (cid:48) β ) min { τ, − τ } , τ − L ( x, w (cid:48) β ) + 1 , ε (cid:27) , − ε (cid:27)(cid:19) Throughout the paper, we let Γ j = (Γ j , Γ j ) for j ≥

1. Evaluating these at θ gives the trimmedpopulation conditional quantile bounds. Evaluating these at (cid:98) θ gives their sample analog estimators.Deﬁne Γ ( x, w, θ ) = (cid:90) Γ ( x, w, τ, θ ) dτ and Γ ( x, w, θ ) = (cid:90) Γ ( x, w, τ, θ ) dτ. Then (cid:2)

CATE cε ( w ) , CATE cε ( w ) (cid:3) = (cid:104) Γ (1 , w, θ ) − Γ (0 , w, θ ) , Γ (1 , w, θ ) − Γ (0 , w, θ ) (cid:105) are the trimmed population CATE bounds. We estimate them by (cid:20) (cid:92) CATE c ( w ) , (cid:92) CATE c ( w ) (cid:21) ≡ (cid:104) Γ (1 , w, (cid:98) θ ) − Γ (0 , w, (cid:98) θ ) , Γ (1 , w, (cid:98) θ ) − Γ (0 , w, (cid:98) θ ) (cid:105) . If these mappings were Hadamard diﬀerentiable in θ at θ , we could use the functional deltamethod to show that the above estimators have limiting Gaussian distributions and converge at √ n rates. Because they depend on the min and max functions these mappings are not Hadamarddiﬀerentiable. They are, however, Hadamard directionally diﬀerentiable (HDD); see deﬁnition 2 inappendix B. It turns out that this weaker version of diﬀerentiability is suﬃcient to establish their(non-Gaussian) limiting distribution.To formally derive the limiting distribution of the CATE estimators, we show that the mappingΓ ( x, w, · ) : R d W × (cid:96) ∞ ([ ε, − ε ] , R d q ) → R is Hadamard directionally diﬀerentiable at θ tangentially to R d W × C ([ ε, − ε ] , R d q ). Here C ( A, B )is the set of continuous functions from A to B .As a technical assumption, we restrict the complexity of the space that the quantile regressioncoeﬃcient γ ( · ) lives in. Speciﬁcally, we assume that it is in a H¨older ball. To deﬁne this parameter14pace precisely, let C m ( D ) denote the set of m -times continuously diﬀerentiable functions f : D → R ,where m is an integer and D be an open subset of R d q . Denote the diﬀerential operator by ∇ λ = ∂ | λ | ∂x λ · · · ∂x λ dq d q where λ = ( λ , . . . , λ d q ) is a d q -tuple of nonnegative integers and | λ | = λ + · · · + λ d q . Let ν ∈ (0 , (cid:107) · (cid:107) without any subscripts denote the R d q -Euclidean norm. Deﬁne the H¨older norm of f : D → R by (cid:107) f (cid:107) m, ∞ ,ν = max | λ |≤ m sup x ∈ int( D ) |∇ λ f ( x ) | + max | λ | = m sup x,y ∈ int( D ) ,x (cid:54) = y |∇ λ f ( x ) − ∇ λ f ( y ) |(cid:107) x − y (cid:107) ν . For any

B >

0, let C Bm,ν ( D ) = { f ∈ C m ( D ) : (cid:107) f (cid:107) m, ∞ ,ν ≤ B } denote a H¨older ball. Assumption A2 (

Quantile regression regularity). Let m be an integer with m ≥ ν ∈ (0 , B >

0. Then γ ∈ G where G ⊆ C Bm,ν ([ ε smaller , − ε smaller ]) d q for some ε smaller ∈ (0 , ε ).In this assumption we assume m ≥ γ . In appendixA we state several additional standard regularity conditions that we use to obtain asymptoticnormality of the ﬁrst step estimators; see assumptions A4–A6 starting on page 37. We continue tomaintain these assumptions here. As a ﬁrst preliminary result, we use these assumptions to derivethe limiting distribution of the CQTE bound estimators; see proposition 5 in appendix B. Usingthat result, we can then derive the limiting distribution of the CATE bound estimators. Proposition 1 (CATE convergence) . Fix w ∈ W . Suppose A1, A2, and A4–A6 hold. Fix c ∈ [0 , √ n (cid:32) (cid:92) CATE c ( w ) − CATE cε ( w ) (cid:92) CATE c ( w ) − CATE cε ( w ) (cid:33) d −→ Z CATE ( w ) , where Z CATE ( w ) is a random vector in R whose distribution is characterized in the proof.In the statement of this result we deferred the full characterization of Z CATE ( w ) to the proof.To get a brief idea of what it looks like, however, consider the ﬁrst component. It is Z (1)CATE ( w ) = Γ (cid:48) ,θ (1 , w, Z ) − Γ (cid:48) ,θ (0 , w, Z )where Γ (cid:48) ,θ ( x, w, Z ) is the Hadamard directional derivative of Γ evaluated at Z , the limit-ing distribution of the ﬁrst step estimators. See page 49 for the expression for Γ (cid:48) ,θ . Likewise,Γ (cid:48) ,θ ( x, w, Z ) is the Hadamard directional derivative of Γ evaluated at Z . Although Z is Gaus-sian, the HDDs are continuous but generally nonlinear functionals. Hence the distribution of Z CATE ( w ) is non-Gaussian. In section 5 we show how to use a non-standard bootstrap to approxi-mate its distribution. 15 he ATE Bounds Next we derive the limiting distribution of our ATE bound estimators. LetΓ ( x, θ ) = (cid:90) W Γ ( x, w, θ ) dF W ( w ) and Γ ( x, θ ) = (cid:90) W Γ ( x, w, θ ) dF W ( w ) . Then [ATE cε , ATE cε ] = (cid:2) Γ (1 , θ ) − Γ (0 , θ ) , Γ (1 , θ ) − Γ (0 , θ ) (cid:3) are the trimmed population ATE bounds. We estimate them by (cid:20) (cid:91) ATE c , (cid:91) ATE c (cid:21) ≡ (cid:34) n n (cid:88) i =1 (cid:16) Γ (1 , W i , (cid:98) θ ) − Γ (0 , W i , (cid:98) θ ) (cid:17) , n n (cid:88) i =1 (cid:16) Γ (1 , W i , (cid:98) θ ) − Γ (0 , W i , (cid:98) θ ) (cid:17)(cid:35) . Unlike Γ , the Γ mapping depends on F W , which is unknown. Here we estimate it by the empiricaldistribution of W .Next, let δ > B δ = { β ∈ B : (cid:107) β − β (cid:107) ≤ δ } and L β ( x, w (cid:48) β ) = ∂∂β L ( x, w (cid:48) β ) . The following assumption bounds the inverse ratio of the squared propensity score to its derivativewith respect to the parameter β . This assumption holds under common parametric speciﬁcationsfor the propensity score, like logit or probit. It also holds if strong overlap holds. Moreover, underour other assumptions, note that strong overlap holds when W has ﬁnite support. Assumption A3.

There is a δ > E (cid:32) sup β ∈B δ (cid:13)(cid:13)(cid:13)(cid:13) L β ( x, W (cid:48) β ) L ( x, W (cid:48) β ) (cid:13)(cid:13)(cid:13)(cid:13) (cid:33) < ∞ for each x ∈ { , } .Under these assumptions, we show the following result. Theorem 1 (ATE convergence) . Suppose A1–A6 hold. Then √ n (cid:32) (cid:91) ATE c − ATE cε (cid:91) ATE c − ATE cε (cid:33) d −→ Z ATE , where Z ATE is a random vector in R whose distribution is characterized in the proof.Like the CATE bound estimators, the limiting distribution of the ATE bound estimators isnon-Gaussian. To understand this limiting distribution, ﬁrst recall that we denote our bounds onthe means E ( Y x ) by E cx,ε = Γ ( x, θ ) = E [Γ ( x, W, θ )] and E cx,ε = Γ ( x, θ ) = E [Γ ( x, W, θ )] .

16e estimate them by (cid:98) E cx = 1 n n (cid:88) i =1 Γ ( x, W i , (cid:98) θ ) and (cid:98) E cx = 1 n n (cid:88) i =1 Γ ( x, W i , (cid:98) θ ) . In the proof of theorem 1, we show that the following asymptotic expansion holds: √ n (cid:32) (cid:98) E cx − E cx,ε (cid:98) E cx − E cx,ε (cid:33) = Γ (cid:48) ,θ ( x, √ n ( (cid:98) θ − θ )) + 1 √ n n (cid:88) i =1 (cid:16) Γ ( x, W i , θ ) − E [Γ ( x, W, θ )] (cid:17) + o p (1) , (9)where Γ (cid:48) ,θ is the Hadamard directional derivative of Γ , which we deﬁne in the proof of theorem 1.The ﬁrst term in this expansion comes from the sample variation in the ﬁrst step estimators: thepropensity score (cid:98) p x | w = L ( x, w (cid:48) (cid:98) β ) and quantile function (cid:98) Q ( τ | x, w ) = p ( x, w ) (cid:48) (cid:98) γ ( τ ). The functionalΓ (cid:48) ,θ ( x, · ) is nonlinear in (cid:98) β . Therefore, since √ n ( (cid:98) β − β ) converges in distribution to a Gaussianlimiting process, the limiting distribution of this functional is non-Gaussian. If β was known—andhence the propensity score was known—then this component would follow a Gaussian distributionsince the remaining component (cid:98) γ is asymptotically Gaussian and enters Γ (cid:48) ,θ ( x, · ) linearly.The second term in this expansion comes from the variation of the CATE bounds over thevalues of the covariates W . It follows a limiting Gaussian distribution by the central limit theorem.This term is asymptotically independent of the sampling variation in the ﬁrst step estimators (cid:98) θ since the inﬂuence function of (cid:98) θ is mean independent of W . Overall, we see that the limitingdistribution of the ATE bounds is the sum of two independent random vectors, one Gaussian andone non-Gaussian. We approximate the distribution of these two random vectors using two separatebootstraps in section 5. The ATT Bounds

Finally we study the limiting properties of our ATT bound estimators. Our trimmed populationATT bounds are[ATT cε , ATT cε ] = (cid:34) E ( Y | X = 1) − E c ,ε − p E ( Y | X = 0) p , E ( Y | X = 1) − E c ,ε − p E ( Y | X = 0) p (cid:35) . We estimate them by[ (cid:91)

ATT c , (cid:91) ATT c ] = (cid:34)(cid:98) E ( Y | X = 1) − (cid:98) E c − (cid:98) p (cid:98) E ( Y | X = 0) (cid:98) p , (cid:98) E ( Y | X = 1) − (cid:98) E c − (cid:98) p (cid:98) E ( Y | X = 0) (cid:98) p (cid:35) , where (cid:98) E ( Y | X = 0) and (cid:98) p are deﬁned in section 3. Proposition 2 (ATT convergence) . Suppose the assumptions of theorem 1 hold. Suppose further17hat var( Y ( X = x )) < ∞ for each x ∈ { , } . Then √ n (cid:32) (cid:91) ATT c − ATT cε (cid:91) ATT c − ATT cε (cid:33) d −→ Z ATT , where Z ATT is a random vector in R whose distribution is characterized in the proof. (cid:98) E ( Y | X = 0) and (cid:98) p are asymptotically Gaussian. Like our analysis of the ATE bounds,however, (cid:98) E c and (cid:98) E c have non-Gaussian asymptotic distributions. Overall, Z ATT , the asymptoticdistribution of our ATT bound estimators, is a linear combination of Gaussian and non-Gaussianrandom variables.

We now show how to conduct inference on our bounds for CATE, ATE, and ATT. Earlier we notedthat these bounds are generally not ordinary Hadamard diﬀerentiable mappings of the underlyingparameters θ . By corollary 3.1 in Fang and Santos (2019), this implies that standard bootstrapapproaches cannot be used for these bounds. We instead use the non-standard bootstrap approachdeveloped by Fang and Santos (2019). For brevity we focus on ATE and ATT in this section. Weprovide analogous results for CQTE and CATE in lemmas 6 and 7 in appendix E.1. The bounds for ATE and ATT can be written in terms of bounds on E ( Y x ). In this section wedescribe how to do inference on bounds for these means. We’ll then use these results to do inferenceon our ATE and ATT bounds in the next subsection. Recall that our bounds on E ( Y x ) can bewritten as a functional of θ . This functional is Hadamard directionally diﬀerentiable in θ , but it isgenerally not ordinary Hadamard diﬀerentiable. Theorem 3.1 of Fang and Santos (2019) shows howto do bootstrap inference by consistently estimating the Hadamard directional derivative (HDD).This can be done by using analytical estimators or by using a numerical derivative as describedin Hong and Li (2018). Here we use analytical estimates of the HDD. This approach explicitlyuses the functional form of the HDD to estimate it. It allows us to avoid picking the numericalderivative step size, although other tuning parameters are used to estimate the HDDs analytically. Setup

Next we deﬁne some general notation. Let Z i = ( Y i , X i , W i ) and Z n = { Z , . . . , Z n } . Let ϑ denote some parameter of interest and let (cid:98) ϑ be an estimator of ϑ based on the data Z n . Let A ∗ n denote √ n ( (cid:98) ϑ ∗ − (cid:98) ϑ ) where (cid:98) ϑ ∗ is a draw from the nonparametric bootstrap distribution of (cid:98) ϑ .Suppose A is the tight limiting process of √ n ( (cid:98) ϑ − ϑ ). Denote bootstrap consistency by A ∗ n P (cid:32) A where P (cid:32) denotes weak convergence in probability, conditional on the data Z n . Weak convergence18n probability conditional on Z n is deﬁned assup h ∈ BL | E [ h ( A ∗ n ) | Z n ] − E [ h ( A )] | = o p (1)where BL denotes the set of Lipschitz functions into R with Lipschitz constant no greater than 1.We leave the domain of these functions and its associated norm implicit.We focus on the choices ϑ = θ and (cid:98) ϑ = (cid:98) θ . For these choices, let Z ∗ n = √ n ( (cid:98) θ ∗ − (cid:98) θ ). Let Z denote the limiting distribution of √ n ( (cid:98) θ − θ ); see lemma 1 in appendix A. Theorem 3.6.1 of van derVaart and Wellner (1996) implies that Z ∗ n P (cid:32) Z . Our parameters of interest are all functionals Γof θ . In particular, in section 4 we showed that √ n (Γ( (cid:98) θ ) − Γ( θ )) (cid:32) Γ (cid:48) θ ( Z )for a variety of functionals Γ. To do inference on these functionals, we therefore want to estimatethe distribution of Γ (cid:48) θ ( Z ). Fang and Santos (2019) show that (cid:98) Γ (cid:48) θ ( Z ∗ n ) P (cid:32) Γ (cid:48) θ ( Z )where (cid:98) Γ (cid:48) θ is a suitable estimator of the Hadamard directional derivative Γ (cid:48) θ . In this section weconstruct the estimators (cid:98) Γ (cid:48) θ and show that they can be used in this bootstrap. Main Result

Next, recall the asymptotic expansion in equation (9) on page 17. As we will show, the second termin this expansion can be approximated using standard bootstrap approaches and replacing θ by (cid:98) θ . The ﬁrst term requires estimating the HDDs Γ (cid:48) ,θ and Γ (cid:48) ,θ . The formulas for our estimators ofthese HDDs are long, and so we describe them in appendix C. Denote these estimators by (cid:98) Γ (cid:48) ,θ and (cid:98) Γ (cid:48) ,θ . They require choosing two scalar tuning parameters, κ n and η n . κ n is a slackness parameterand η n is a step size parameter used to compute numerical derivatives of (cid:98) γ ( · ). Although not usedin our proof, the asymptotic independence of the two components implies that approximating theirrespective marginal distributions is suﬃcient to obtain their joint distribution.As we just mentioned, we’ll use the standard nonparametric bootstrap to approximate thesecond term of equation (9). To formalize this, let G ∗ n denote the nonparametric bootstrap empiricalprocess: G ∗ n = 1 √ n n (cid:88) i =1 ( M n,i − δ Z i where ( M n, , . . . , M n,n ) are multinomially distributed with parameters (1 /n, . . . , /n ) indepen-dently of Z n , and where δ Z i is a distribution which assigns probability one to the value Z i ∈ R d W .19hen for any function g , G ∗ n g ( Z ) = √ n (cid:32) n n (cid:88) i =1 g ( Z ∗ i ) − g ( Z ) (cid:33) where Z ∗ i , i = 1 , . . . , n are drawn independently with replacement from { Z , . . . , Z n } and g ( Z ) = n (cid:80) ni =1 g ( Z i ). In particular, we’ll study the asymptotic distribution of G ∗ n Γ ( x, W, (cid:98) θ ) = √ n (cid:32) n n (cid:88) i =1 Γ ( x, W ∗ i , (cid:98) θ ) − n n (cid:88) i =1 Γ ( x, W i , (cid:98) θ ) (cid:33) . The following proposition is our main bootstrap consistency result.

Proposition 3 (Analytical Bootstrap for Mean Potential Outcomes) . Suppose the assumptions oftheorem 1 hold. Let κ n → nκ n → ∞ , η n → , and nη n → ∞ as n → ∞ . Then (cid:98) Γ (cid:48) ,θ ( x, √ n ( (cid:98) θ ∗ − (cid:98) θ )) + G ∗ n Γ ( x, W, (cid:98) θ ) P (cid:32) Z ( x ) , where Z ( · ) is the limiting process of the expression given in equation (9), as characterized in theproof.This result shows how to use the bootstrap to approximate the joint limiting distribution of up-per and lower bounds of E ( Y x ), x ∈ { , } . As we show in section 5.2 below, these approximationscan be used to conduct pointwise or uniform-in- c inference on the ATE bounds. As part of theproof, we show that (cid:98) Γ (cid:48) ,θ ( x, √ n ( (cid:98) θ ∗ − (cid:98) θ )) weakly converges in probability conditional on the datato Γ (cid:48) ,θ ( x, Z ), a non-Gaussian vector which reﬂects the sample variation in the ﬁrst step estima-tors. We also show weak convergence in probability conditional on the data of G ∗ n Γ ( x, W, (cid:98) θ ) to G Γ ( x, W, θ ) ∼ N (0 , var(Γ ( x, W, θ )), a bivariate Gaussian vector which reﬂects the variation ofthe CATE bounds over W . This variation can be approximated using the standard nonparametricbootstrap. Hence the bounds’ limiting distribution is approximated by a combination of standardand non-standard bootstraps. Note that the two bootstrap distributions can be computed from aunique sequence of draws Z ∗ i and so has the same computational burden as a single bootstrap. Next we show how to use proposition 3 to do inference on our ATE bounds [ATE cε , ATE cε ]. We ﬁrstconsider inference pointwise in c . We then construct conﬁdence bands that are uniform over c .20 .2.1 Pointwise in c Conﬁdence Sets

An immediate corollary of proposition 3 is √ n (cid:32) (cid:91) ATE c, ∗ − (cid:91) ATE c (cid:91) ATE c, ∗ − (cid:91) ATE c (cid:33) (10)= (cid:16)(cid:98) Γ (cid:48) ,θ (1 , √ n ( (cid:98) θ ∗ − (cid:98) θ )) + G ∗ n Γ (1 , W, (cid:98) θ ) (cid:17) − (cid:16)(cid:98) Γ (cid:48) ,θ (0 , √ n ( (cid:98) θ ∗ − (cid:98) θ )) + G ∗ n Γ (0 , W, (cid:98) θ ) (cid:17)(cid:16)(cid:98) Γ (cid:48) ,θ (1 , √ n ( (cid:98) θ ∗ − (cid:98) θ )) + G ∗ n Γ (1 , W, (cid:98) θ ) (cid:17) − (cid:16)(cid:98) Γ (cid:48) ,θ (0 , √ n ( (cid:98) θ ∗ − (cid:98) θ )) + G ∗ n Γ (0 , W, (cid:98) θ ) (cid:17) P (cid:32) Z ATE . Thus we can also use this speciﬁc bootstrap to approximate the asymptotic distribution of our ATEbounds estimators. Given this result, we can construct a 100(1 − α )% conﬁdence set for the ATEidentiﬁed set under c -dependence as follows. LetCI c ATE (1 − α ) = (cid:34) (cid:91) ATE c − (cid:98) d α √ n , (cid:91) ATE c + (cid:98) d α √ n (cid:35) where (cid:98) d α = inf (cid:26) z ∈ R : P (cid:18) √ n ( (cid:91) ATE c, ∗ − (cid:91) ATE c ) ≤ − z and √ n ( (cid:91) ATE c, ∗ − (cid:91) ATE c ) ≥ z | Z n (cid:19) ≥ − α (cid:27) . The probability in this expression can be approximated by taking a large number of bootstrapdraws according to equation (10). Proposition 3 then implies thatlim inf n →∞ P (cid:16) CI c ATE (1 − α ) ⊇ [ATE cε , ATE cε ] (cid:17) ≥ − α. Let d α = inf (cid:110) z ∈ R : P (cid:16) Z (2)ATE ≤ − z and Z (1)ATE ≥ z (cid:17) ≥ − α (cid:111) . If P ( Z (2)ATE ≤ − z and Z (1)ATE ≥ z ) is continuous and strictly increasing in a neighborhood of d α ,corollary 3.2 in Fang and Santos (2015) yields (cid:98) d α = d α + o p (1) and hencelim n →∞ P (cid:16) CI c ATE (1 − α ) ⊇ [ATE cε , ATE cε ] (cid:17) = 1 − α. c ATE Bands

We just described how to use proposition 3 to do inference on the ATE bounds for any ﬁxed c .Those results can be immediately extended to do inference the ATE bounds for any ﬁnite grid of c ’s.In this section we show how to construct conﬁdence bands that are uniform over all c ∈ [0 , c . This lets us extrapolate bands thatare uniform on a ﬁnite grid in such a way that they have uniform coverage. A related procedure is21escribed in corollary 1 of Masten and Poirier (2020).Although ATE cε is nondecreasing in c , its estimate (cid:91) ATE c may be nonmonotonic in c because ofthe quantile crossing problem with linear quantile regression. In that case, we could monotonize theestimated ATE bound function by using the rearrangement procedure of Chernozhukov, Fern´andez-Val, and Galichon (2010), for example. As they show, the rearrangement operator is Hadamarddirectionally diﬀerentiable, and thus can be accommodated in our inferential results. Likewise,ATE cε is nonincreasing in c and (cid:91) ATE c is also nonincreasing after applying a suitable rearrangement.From here on we assume our bound estimators have been monotonized. Note that this does notaﬀect the asymptotic distribution of the estimators, by corollary 1 of Chernozhukov et al. (2010).Next consider a grid of values C = { c , . . . , c K } such that 0 = c < · · · < c K = 1. All of ouranalysis works with 0 < c and c K <

1, but typically researchers will want to include the endpointsof [0 ,

1] so we do that from here on. Using methods similar to those in section 5.2.1, for all c ∈ C let CI c ATE (1 − α ) = (cid:34) (cid:91) ATE c − (cid:98) d α ( c ) √ n , (cid:91) ATE c + (cid:98) d α ( c ) √ n (cid:35) be 100(1 − α )% conﬁdence sets for [ATE cε , ATE cε ] where the critical values (cid:98) d α ( c ) are chosen suchthat these sets are uniform over c in the ﬁnite grid C . That is, P (cid:0) CI c ATE (1 − α ) ⊇ [ATE cε , ATE cε ] for all c ∈ C (cid:1) → − α as n → ∞ . Finally, let min( c ) = inf { c k ∈ C : c ≤ c k } denote the smallest element in the grid C thatis still larger than c . For c ∈ [0 ,

1] deﬁne (cid:100)

UB( c ) = (cid:91) ATE min( c ) + (cid:98) d α (min( c )) √ n and (cid:99) LB( c ) = (cid:91) ATE min( c ) − (cid:98) d α (min( c )) √ n . (cid:100) UB( c ) is the greatest monotonic interpolation of the upper bounds of the conﬁdence intervals on thegrid C . (cid:99) LB( c ) is the least monotonic interpolation of the lower bounds of the conﬁdence intervalson the grid C . By the deﬁnition of these interpolated bands and by monotonicity of the populationATE bounds, P (cid:16) [ (cid:99) LB( c ) , (cid:100) UB( c )] ⊇ [ATE cε , ATE cε ] for all c ∈ [0 , (cid:17) = P (cid:16) [ (cid:99) LB( c ) , (cid:100) UB( c )] ⊇ [ATE cε , ATE cε ] for all c ∈ C (cid:17) → − α as n → ∞ .In this subsection we’ve shown that, although we cannot obtain the limiting distribution of theATE bounds uniformly over c ∈ [0 , , c -dependence: c -dependence implies c -dependence when c ≤ c . This kind of monotonicity is22ommon in many other approaches to sensitivity analysis and hence greatest and least monotonicinterpolations can likely be used more broadly to construct uniform conﬁdence bands. Bootstrap inference on the ATT bounds is quite similar to that on the ATE bounds. By examiningthe ATT bounds’ limiting distribution (see the proof of proposition 2) we see that it depends ontwo types of terms:1. One term comes from the limiting distribution of √ n (cid:32) (cid:98) E c − E c ,ε (cid:98) E c − E c ,ε (cid:33) , which is non-standard. We’ll approximate this term distribution by using the non-standardbootstrap of proposition 3.2. The other terms are due to the limiting distributions of √ n (cid:32)(cid:98) E ( Y | X = x ) − E ( Y | X = x ) (cid:98) p x − p x (cid:33) , which are standard and Gaussian. The distribution of these terms can be approximatedby the nonparametric bootstrap. For example, standard arguments show that the limitingdistribution of √ n ( (cid:98) E ( Y | X = x ) − E ( Y | X = x )) is approximated by Z ∗ E ( Y | X = x ) ≡ (cid:98) p x (cid:16) G ∗ n Y ( X = x ) − (cid:98) E ( Y | X = x ) · G ∗ n ( X = x ) (cid:17) . Similarly, Z ∗ p x ≡ G ∗ n ( X = x ) converges weakly in probability conditional on Z n to thelimiting distribution of √ n ( (cid:98) p x − p x ).Combining all the terms gives  Z ∗ E ( Y | X =1) − (cid:98) Γ (cid:48) ,θ (0 , √ n ( (cid:98) θ ∗ − (cid:98) θ )) + G ∗ n Γ (0 , W, (cid:98) θ ) (cid:98) p + (cid:98) p (cid:98) p Z ∗ E ( Y | X =0) + (cid:98) E ( Y | X = 0) (cid:98) p Z ∗ p + (cid:98) E c − (cid:98) E ( Y | X = 0) (cid:98) p (cid:98) p Z ∗ p Z ∗ E ( Y | X =1) − (cid:98) Γ (cid:48) ,θ (0 , √ n ( (cid:98) θ ∗ − (cid:98) θ )) + G ∗ n Γ (0 , W, (cid:98) θ ) (cid:98) p + (cid:98) p (cid:98) p Z ∗ E ( Y | X =0) + (cid:98) E ( Y | X = 0) (cid:98) p Z ∗ p + (cid:98) E c − (cid:98) E ( Y | X = 0) (cid:98) p (cid:98) p Z ∗ p  P (cid:32) Z ATT .

23e can use this result to construct pointwise conﬁdence sets for the ATT bounds for a ﬁxed c , orto construct conﬁdence bands that are uniform on a ﬁnite grid C . Like the ATE bounds, the ATTbounds are monotonic in c . Thus a similar interpolation can be used to construct conﬁdence bandsfor the ATT bounds that are uniform over c ∈ [0 , We conclude this section be showing how to use the conﬁdence bands we just described to doinference on breakdown points. For brevity we focus on the breakdown point for the conclusionthat the ATE is nonnegative, which we deﬁned earlier in equation (6). Inference on other breakdownpoints for other conclusions can be done similarly.Let CI c ATE (1 − α ) be a pointwise-in- c conﬁdence band for the ATE bounds, as described insection 5.2.1. Deﬁne c L = sup { c ∈ [0 ,

1] : CI c ATE (1 − α ) ⊆ [0 , ∞ ) } . This is simply the value at which the conﬁdence band ﬁrst intersects the horizontal line at zero.By proposition S.2 in Appendix D of Masten and Poirier (2020),lim n →∞ P ( c L ≤ c bp ) ≥ − α. Thus [ c L ,

1] is a valid one-sided lower conﬁdence interval for the breakdown point c bp . In the previous section we showed how to use a non-standard bootstrap method to conduct inferenceon the CATE, ATE, and ATT bounds. The key technical problem was that these bounds are notnecessarily Hadamard diﬀerentiable functionals of the ﬁrst step estimators; they are only Hadamarddirectionally diﬀerentiable. In this section, we provide simple suﬃcient conditions on the propensityscore under which the CATE, ATE, and ATT bounds are in fact Hadamard diﬀerentiable. Underthis condition, the methods in section 5 are still valid, but so is the standard nonparametricbootstrap. After stating the formal result, we discuss when this suﬃcient condition holds andwhen it does not.First consider the average treatment eﬀect. Recall that the ATE bounds depend on the func-tional Γ ( x, θ ). We will show that its Hadamard directional derivative Γ (cid:48) ,θ ( x, h ) is linear in h undera condition on the value of c , the propensity score p | w , and the distribution of W . By proposition2.1 in Fang and Santos (2019), this linearity is equivalent to Hadamard diﬀerentiability. By theorem3.9.11 in van der Vaart and Wellner (1996), this linearity also implies that the bootstrap process √ n (Γ ,θ ( x, (cid:98) θ ∗ ) − Γ ,θ ( x, (cid:98) θ )) converges weakly in probability conditional on the data to Γ (cid:48) ,θ ( x, Z ),a Gaussian vector. In other words, we can conduct inference on the ATE bounds using the standardnonparametric bootstrap: Take n independent draws from the data with replacement, compute thebound estimates in this bootstrap sample, and then use the distribution of these bound estimates24cross many such bootstrap samples to approximate the sampling distribution of the bound estima-tors. The following theorem provides the explicit suﬃcient condition for validity of this bootstrap.In this result, let p | W = P ( X = 1 | W ) denote the random variable obtained by evaluating thepropensity score at the random vector W . Theorem 2.

Suppose the assumptions of theorem 1 hold. Suppose P ( p | W ∈ { c, − c } ) = 0. Then √ n (cid:32) (cid:91) ATE c ∗ − (cid:91) ATE c (cid:91) ATE c ∗ − (cid:91) ATE c (cid:33) P (cid:32) Z ATE , where ( (cid:91) ATE c ∗ , (cid:91) ATE c ∗ ) are drawn from the nonparametric bootstrap distribution of ( (cid:91) ATE c , (cid:91) ATE c ).In the proof of this result we show that when the propensity score does not contain a point masson either c or 1 − c , the mapping Γ ( x, θ ) is Hadamard diﬀerentiable for x ∈ { , } . Hence thenonparametric bootstrap is valid. Although it is not formally stated in the theorem, we conjecturethat our suﬃcient condition for validity of the standard bootstrap is also a necessary condition.That is, we expect the standard bootstrap to be invalid when c or 1 − c are point masses of thepropensity score’s distribution. From the proof of theorem 2, we see when this condition on thepropensity score fails, Γ (cid:48) ,θ ( x, w, τ, h ) is nonlinear in h on a set of ( τ, w ) values of positive measure.Since the HDD of Γ ( x, · ) is the integral over ( τ, w ) of the HDD of Γ ( x, w, τ, · ), we expect thatΓ (cid:48) ,θ ( x, h ) will also be nonlinear in h , a failure of Hadamard diﬀerentiability.By further examining the proof of theorem 2, we can also show that the CATE bounds areHadamard diﬀerentiable at covariate values w and sensitivity parameter values c such that p | w / ∈{ c, − c } . Finally, the following proposition gives a similar result for the ATT, using slightly weakerassumptions. Proposition 4.

Suppose the assumptions of theorem 1 hold. Suppose var( Y ( X = x )) < ∞ foreach x ∈ { , } . Suppose P ( p | W = c ) = 0. Then, √ n (cid:32) (cid:91) ATT c ∗ − (cid:91) ATT c (cid:91) ATT c ∗ − (cid:91) ATT c (cid:33) P (cid:32) Z ATT , where ( (cid:91) ATT c ∗ , (cid:91) ATT c ∗ ) are drawn from the nonparametric bootstrap distribution of ( (cid:91) ATT c , (cid:91) ATT c ).The ATT bounds only depend on our bounds for E ( Y ), and not our bounds for E ( Y ). Hencewe only need to examine Γ ( x, θ ) for x = 0. So the proof of this proposition proceeds by showingthat Γ (0 , θ ) is Hadamard diﬀerentiable when the propensity score does not have a point mass at c . The suﬃcient conditions in theorem 2 and proposition 4 depend on the support of the propen-sity score p | W . If the propensity score’s distribution is absolutely continuous, it contains no pointmasses and therefore these condition holds. This holds when one covariate W k has nonzero coeﬃ-cient β ,k and has a continuous distribution conditional on the other covariates W − k . However, if25ll covariates are discrete or mixed, the support of the propensity score will generally contain pointmasses. The nonparametric bootstrap may not be valid whenever c coincides with these points.Even when p | W has point masses, the nonparametric bootstrap is valid for c outside of these points.To use this bootstrap, one could in principle estimate the support of p | W to determine at whichvalues of c inference might be invalid, and select sensitivity parameters outside of this support.The nonparametric bootstrap has the advantage of being computationally simple and does notrequire the choice of tuning parameters. While more involved, the bootstrap technique detailed insection 5 is valid regardless of the support of the propensity score. For example, in our empiricalanalysis in section 7, all of the covariates are either mixed or discrete. Given our analysis above,we therefore use the non-standard bootstrap in our empirical analysis since the standard bootstrapmay fail at some values of c . In this section we illustrate our methods using data on the National Supported Work (NSW)demonstration project studied by LaLonde (1986). Since this is a highly studied and well-knownprogram, we only brieﬂy summarize it here. See, for example, Heckman, LaLonde, and Smith(1999) for further details. We use LaLonde’s data as reconstructed by Dehejia and Wahba (1999).The NSW demonstration project randomly assigned participants to either receive a guaranteedjob for 9 to 18 months along with frequent counselor meetings or to be left in the labor market bythemselves. We use the Dehejia and Wahba (1999) sample, which are all males in LaLonde’s NSWdataset and where earnings are observed in 1974, 1975, and 1978. This dataset has 445 people: 185in the treatment group and 260 in the control group. Like Imbens (2003), we use this experimentalsample primarily as an illustration; in experiments where treatment was truly randomized it is notnecessary to assess sensitivity to unconfoundedness. Our results may be useful for assessing theimpact of randomization failure in experiments, but that is not our focus here.In addition to this experimental sample, we construct a sample using observational data. Thissample combines the 185 people in the NSW treatment group with 2490 people in a control groupconstructed from the Panel Study of Income Dynamics (PSID). This control group, called PSID-1by LaLonde, consists of all male household heads observed in all years between 1975 and 1978who were less than 55 years old and who did not classify themselves as retired. We further dropobservations with earnings above $ able 1: Summary statistics.Experimental dataset Observational datasetControl Treatment ControlMarried 0.15 0.19 0.78(0.36) (0.39) (0.42)Age 25.05 25.82 38.61(7.06) (7.16) (11.45)Black 0.83 0.84 0.27(0.38) (0.36) (0.44)Hispanic 0.11 0.06 0.04(0.31) (0.24) (0.20)Education 10.09 10.35 11.37(1.61) (2.01) (3.40)Earnings in 1974 2107.03 2095.57 765.75(5687.91) (4886.62) (1399.79)Earnings in 1975 1266.91 1532.06 650.54(3102.98) (3219.25) (1332.89)Positive earnings in 1974 0.25 0.29 0.29(0.43) (0.46) (0.46)Positive earnings in 1975 0.32 0.40 0.25(0.47) (0.49) (0.43)Sample size 260 185 242

Variable mean is shown in each cell, with that variable’s standard deviation in parentheses.

Table 2:

Baseline treatment eﬀect estimates (in 1982 dollars).ATE ATT Sample sizeExperimental dataset 1633 1738 445(650) (689)Observational dataset 3337 4001 390(769) (762)

Standard errors in parentheses.

Baseline Estimates

Table 2 shows the baseline point estimates of both ATE and ATT under the unconfoundednessassumption in the two samples we consider. These estimates are all computed by inverse probabilityweighting (IPW) using a parametric logit propensity score estimator. We do not consider otherestimators, since our goal is to illustrate sensitivity to identifying assumptions, rather than ﬁnitesample sensitivity to the choice of estimator. 27 c -15000-10000-50000500010000150002000025000 A T E c -15000-10000-5000050001000015000 A TT Figure 1:

Sensitivity of ATE (top) and ATT (bottom) estimates to relaxations of the selection onobservables assumption. The solid lines are bounds computed using the observational dataset whilethe dashed lines are bounds computed using the experimental dataset. The light dotted lines areconﬁdence bands for the observational dataset while the light dashed-dotted lines are conﬁdencebands for the experimental dataset.

Relaxing Unconfoundedness

Figure 1 shows our main results. These are estimated treatment eﬀect bounds under c -dependence,along with corresponding pointwise conﬁdence bands, as described in sections 2–5. The top plotshows bounds on ATE while the bottom plot shows bounds on ATT. The solid lines are bounds forthe observational dataset while the dashed lines are bounds for the experimental dataset. The light28otted lines are conﬁdence bands for the observational dataset while the light dashed-dotted linesare conﬁdence bands for the experimental dataset. These bands are constructed to have nominal95% coverage probability pointwise in c based on our non-standard bootstrap results in section5. For the tuning parameters we use ε = 0 . η n = 0 . n − / , and κ n = n − / . Note that thesuﬃcient conditions for validity of the standard bootstrap that we gave in section 6 do not applyhere, since the distribution of the propensity score variable p | W has point masses. This occursbecause seven of the nine covariates are discrete, while the other two mixed discrete-continuous.The mixed variables are earnings in 1974 and earnings in 1975, which have point masses at zerosince many people in the sample did not work in those years.For both datasets, at c = 0 the bounds collapse to the baseline point estimate. When c >

0, weallow for some selection on unobservables. Comparing the shape of the bounds for both datasetswe see that the experimental data are substantially more robust to relaxations from the baselineassumptions than the observational data. Speciﬁcally, for most values of c the bounds for theexperimental data are substantially tighter than the bounds for the observational data. Even theno assumptions bounds ( c = 1) are tighter for the experimental data than for the observationaldata.A second way to measure robustness uses breakdown points . Masten and Poirier (2020) discussthese in detail and give additional references. In the current context, the breakdown point is simplythe largest value of c such that we can no longer draw a speciﬁc conclusion about some parameter.Speciﬁcally, in the next two subsections we consider two conclusions: The conclusion that ATE isnonnegative, and the conclusion that ATT is less than the per participant program cost. Breakdown Points for Nonnegative ATE

First consider the conclusion that ATE is nonnegative. Our point estimates support this conclusion,but does it still hold if the baseline unconfoundedness assumption fails? In the experimental dataset,the estimated breakdown point is 0 . c such that the lower boundfunction in ﬁgure 1 intersects the horizontal axis. For all c ≤ . c > . . .

123 while it is 0 .

049 for the observational dataset. By this measure, the conclusion that ATT ispositive is more than twice as robust using the experimental data compared to the observationaldata. 29hus far we have compared the robustness of results obtained from the experimental data withresults obtained from the observational data. Next we discuss whether either of these results arerobust in an absolute sense. To do this, we use the leave-out-variable k propensity score analysisdiscussed in section 2. Table 3:

Variation in leave-out-variable- k propensity scores, experimental data.p50 p75 p90 ¯ c k Earnings in 1975 0.001 0.004 0.008 0.053Black 0.007 0.009 0.014 0.082Positive earnings in 1974 0.002 0.010 0.018 0.034Education 0.012 0.022 0.031 0.087Married 0.006 0.012 0.032 0.042Age 0.015 0.024 0.034 0.099Earnings in 1974 0.002 0.011 0.035 0.209Positive earnings in 1975 0.013 0.017 0.062 0.082Hispanic 0.007 0.017 0.099 0.124First consider table 3, which uses data from the experimental sample. For each variable k , listedin the rows of this table, we compute four summary statistics from the estimated distribution of∆ k = | p | W ( W − k , W k ) − p | W − k ( W − k ) | . Speciﬁcally, we estimate the 50th, 75th, and 90th percentiles of ∆ k , along with the maximumobserved value, denoted ¯ c k . As discussed in section 2, these quantities tell us about the marginalimpact of covariate k on treatment assignment. c -dependence constrains the maximum value of themarginal impact of the unobserved potential outcome on treatment assignment, above and beyondthe observed covariates. Thus the values in table 3 can help us calibrate c . Speciﬁcally, we willcompare the breakdown point to the values in this table. These values could be interpreted as upperbounds on the magnitude of selection on unobservables that we might think is present. Thus, for agiven reference value from this table, if the breakdown point is larger than the reference value, wecould consider the conclusion of interest to be robust to failure of unconfoundedness. In contrast,if the breakdown point is smaller than the reference value, we could consider the conclusion ofinterest to be sensitive to failure of unconfoundedness.Recall that the estimated breakdown point for the conclusion that ATE is nonnegative is 0 . c k values and on the same order of magnitude as four more. Ifwe look at a less stringent comparison, the 90th percentile, we see that the estimated breakdownpoint is now larger than all but one of the rows, corresponding to the indicator for Hispanic. Let’sexamine this variable more closely. Figure 2 plots the density ∆ k for k = Hispanic indicator. Herewe see that there is a small proportion of mass who have values larger than 0 . D en s i t y Figure 2:

Kernel density estimate of ∆ k , the absolute diﬀerence between propensity score andleave-out-variable k propensity score, for k = Hispanic indicator, in the experimental dataset.The leave-out-variable k propensity score analysis focuses on the relationship between observedcovariates and treatment assignment. It does not use data on outcomes. A less conservative analysisis to only worry about covariates k which have large values in table 3 and which also aﬀect ouroutcomes in some way. Speciﬁcally, we next consider leave-out-variable k IPW estimates of ATEunder the baseline unconfoundedness assumption. Table 4 shows the eﬀect of leaving out a singlevariable on the ATE point estimates for both datasets. Continue to consider just the experimentaldataset. Here we ﬁrst see that omitting any covariate at most changes the point estimate by 5.4%.Moreover, recall the main variable we were concerned about before: the indicator for Hispanic.Omitting this variable only changes the ATE point estimate by 1.5%.Overall, the leave-out-variable k analysis suggests that, on an absolute scale, the conclusionthat ATE is nonnegative using the experimental data is quite robust. A similar analysis applies toconclusions about ATT.Next consider the observational data. Table 5 shows the leave-out-variable k propensity scoreanalysis. Recall that the estimated breakdown point for the conclusion that ATE is nonnegative inthe observational dataset is 0.037. By any of these measures the conclusion that ATE is nonneg-ative is not robust. Suppose we only consider variables which also substantially change the pointestimates, as shown in table 4. Even then we still ﬁnd that the results are sensitive. For example,the indicator for Black changes the ATE point estimate by 14% and also has substantial marginalimpact on the propensity score, with its 50th percentile in table 5 about 1.5 times as large as theestimated ATE breakdown point. Thus, using these as absolute measures of robustness, we ﬁndthat the conclusion that ATE is positive using the observational data is not robust.31 able 4: Magnitude of the eﬀect of omitting a single variable on ATE point estimates (as apercentage of the baseline estimate). Experimental dataset Observational datasetEarnings in 1975 0.07 0.02Married 0.21 14.27Positive earnings in 1974 1.35 10.20Hispanic 1.51 1.01Black 2.91 14.11Positive earnings in 1975 3.32 0.64Age 3.36 6.49Earnings in 1974 3.90 0.34Education 5.39 1.84

Table 5:

Variation in leave-out-variable- k propensity scores, observational data.p50 p75 p90 ¯ c k Earnings in 1974 0.000 0.001 0.009 0.065Hispanic 0.003 0.011 0.024 0.214Education 0.006 0.017 0.042 0.127Earnings in 1975 0.002 0.010 0.057 0.276Positive earnings in 1975 0.007 0.019 0.076 0.295Positive earnings in 1974 0.012 0.028 0.099 0.423Married 0.028 0.079 0.172 0.314Age 0.035 0.093 0.205 0.508Black 0.053 0.143 0.266 0.477This conclusion that ﬁndings based on the observational dataset are not robust contrasts withthe sensitivity analysis of Imbens (2003), who ﬁnds that the same observational dataset yieldsrelatively robust results. Imbens’ analysis relied importantly on fully parametric assumptions aboutthe joint distribution of the observables and unobservables. In particular, he assumed outcomeswere normally distributed, that the treatment eﬀect is homogeneous, and that any selection onunobservables arises due to an omitted binary variable. Our identiﬁcation analysis does not requireany of these assumptions. As discussed in section 3, we do impose some parametric assumptionsto simplify estimation, but even these assumptions are substantially weaker than those used byImbens. Given that we are making weaker auxiliary assumptions, it is not surprising that ouranalysis shows the ﬁndings to be more sensitive than the analysis in Imbens (2003). Nonetheless,even with these weaker assumptions, we continue to ﬁnd that conclusion from the experimentaldataset remain robust.Finally, note that all of our discussion thus far has focused on the point estimates of thebreakdown points. In section 5.4 we showed that the value at which the pointwise conﬁdence bandintersects the horizontal axis is a valid one-sided lower conﬁdence interval for the breakdown point.32or the ATE with experimental data, this gives a conﬁdence set of [0 . , . . , . . k propensity scores. This is not surprising though, given that there is asubstantial amount of sampling uncertainty—even the lower bound of the conﬁdence intervals forthe baseline estimates are quite close to zero. Can Selection on Unobservables Help the Program Pass a Cost-Beneﬁt Analysis?

In the previous subsection we studied the sensitivity of the conclusion that the ATE is nonnegative.In practice, however, this is not necessarily the most policy relevant conclusion. For example,Heckman and Smith (1998, section 3) give a model where the socially optimal decision whether tocontinue a small scale program or to shut it down can be computed by comparing the ATT withthe program’s per participant cost. In this subsection we show how our sensitivity analysis can beused in these kinds of cost-beneﬁt analyses. Speciﬁcally, we consider the conclusion that the ATTis less than the per participant program cost. Under the model in Heckman and Smith (1998), theprogram should be shut down when this conclusion holds.Chapter 8 of MDRC (1983) reports NSW per participant program costs. For males, total costsranges between $ $ $ $ $ c such that the identiﬁed setfor ATT includes values that are larger than the per participant cost? If we look at the bounds’point estimates, there are no values of c under which the program is cost eﬀective. Accounting forsampling uncertainty by examining the conﬁdence bands, we need c to be at least about 0.5 beforeit is possible that the program is cost eﬀective. As we argued earlier, these are very large values,so it is unlikely that selection on unobservables is this strong.Next consider the observational dataset. Here again the conclusion of interest holds for thebaseline estimate: The ATT of $ c under which the program iscost eﬀective, based on the bounds’ point estimates. This is largely because the uncertainty due33o the impact of selection on unobservables is asymmetric in this example: The lower bound growsmuch faster in c than the upper bound does. Hence conclusions about the largest possible value ofthe ATT are more robust to relaxations of unconfoundedness than conclusions about the smallestpossible value of the ATT. If we account for sampling uncertainty by examining the conﬁdencebands, then we need c to be at least about 0.08 before the conﬁdence intervals contain ATT valueslarger than the per participant costs. This is a relatively large value, although it is smaller than adecent number of the leave-out-variable k propensity score values in table 5. That, however, likelyjust reﬂects the large amount of sampling uncertainty in this data.Overall, our analysis suggests that the program does not pass a cost-beneﬁt analysis, even if weallow for a large amount of selection on unobservables. Hence the conclusion that the ATT is lessthan the per participant cost, and hence that the program should be shut down, is quite robust tofailures of unconfoundedness.Finally, note that our analysis here is primarily illustrative. A more comprehensive cost-beneﬁt analysis would require examining many other program outcomes besides just short runpost-program earnings. For example, see the analysis in chapter 8 of MDRC (1983) and section10 of Heckman et al. (1999). Note, however, that given data on these additional outcomes, ourmethods could then be used to analyze the sensitivity of total program impacts to failures ofunconfoundedness. Identiﬁcation, estimation, and inference on treatment eﬀects under unconfoundedness has beenwidely studied and applied. This approach uses two assumptions: Unconfoundedness and Overlap.The overlap assumption is refutable, and many tools have been developed for checking this as-sumption in practice. For example, Stata’s built-in package teffects has commands for checkingoverlap. In this paper, we provide a complementary suite of tools for assessing the unconfounded-ness assumption. There are two key distinctions between our results and the previous literature.First, we begin from fully nonparametric bounds. In contrast, most of the previous literature relieson parametric assumptions for their identiﬁcation analysis. Second, we provide tools for inference.This is important because, just like baseline estimators, sensitivity analyses are also subject tosampling uncertainty.

We conclude by discussing several extensions and directions for future work. As we just mentioned,a key distinguishing feature of our sensitivity analysis is that we begin from fully nonparametricbounds. We then estimated these bounds using ﬂexible parametric estimators of the propensityscore and the quantile regression of outcomes on treatment and covariates. These estimatorscan include quadratic terms, cubic terms, and interactions, for example, but they are not fullynonparametric. We restricted attention to parametric estimators for one reason: Even in this case,34he asymptotic distribution theory is non-standard, complicated, and at the frontier of currentresearch. This diﬃculty comes from the fact that our estimands are not Hadamard diﬀerentiable.Extending our analysis to ﬁrst step nonparametric estimators is an important next step, but doingso will likely require both deriving and applying more general asymptotic theory for non-Hadamarddiﬀerentiable functionals than currently exists. Hence we leave that analysis to future work.A second extension is to consider additional parameters of interest. In this paper we focuson estimation and inference on the ATE and ATT bounds. We also developed analogous resultsfor the CQTE and CATE. The conditional average treatment eﬀect for the treated, CATT( w ) = E ( Y − Y | X = 1 , W = w ), can be studied with the same tools we use in section 4. We omit thatanalysis for brevity. Masten and Poirier (2018) also derive sharp bounds on unconditional quantiletreatment eﬀects (QTEs). Estimation and inference on the QTE bounds is more complicated thanthe ATE and ATT bounds. The reason is identical to the explanation van der Vaart (2000, page307) gives when discussing inference on unconditional sample quantiles: “to derive the asymptoticnormality of even a single quantile estimator (cid:98) F − n ( p ), we need to know that the estimators (cid:98) F n areasymptotically normal as a process, in a neighborhood of F − ( p ).” In our case, performing inferenceon the QTE bounds requires showing convergence of corresponding bounds on the unconditionalpotential outcome cdfs as a process in a neighborhood of the quantile of interest. For this reason,we leave estimation and inference on the QTE bounds to a separate paper.

References

Altonji, J. G., T. E. Elder, and C. R. Taber (2005): “Selection on observed and unobservedvariables: Assessing the eﬀectiveness of Catholic schools,”

Journal of Political Economy , 113,151–184.——— (2008): “Using selection on observed variables to assess bias from unobservables whenevaluating swan-ganz catheterization,”

American Economic Review P&P , 98, 345–350.

Amemiya, T. (1985):

Advanced Econometrics , Harvard University Press.

Angrist, J., V. Chernozhukov, and I. Fern´andez-Val (2006): “Quantile regression undermisspeciﬁcation, with an application to the US wage structure,”

Econometrica , 74, 539–563.

Athey, S. and G. W. Imbens (2017): “The state of applied econometrics: Causality and policyevaluation,”

Journal of Economic Perspectives , 31, 3–32.

Caliendo, M. and S. Kopeinig (2008): “Some practical guidance for the implementation ofpropensity score matching,”

Journal of Economic Surveys , 22, 31–72.

Chernozhukov, V., I. Fern´andez-Val, and A. Galichon (2010): “Quantile and probabilitycurves without crossing,”

Econometrica , 78, 1093–1125.

Chernozhukov, V., I. Fern´andez-Val, and T. Kaji (2017): “Extremal quantile regression,”

Handbook of Quantile Regression . Masten and Poirier (2020) prove some results along these lines; see their lemma 1. Those results are only validfor suﬃciently small values of c and with discrete W , which substantially simpliﬁes the analysis. inelli, C. and C. Hazlett (2020): “Making sense of sensitivity: Extending omitted variablebias,” Journal of the Royal Statistical Society: Series B (Statistical Methodology) , 82, 39–67.

Dehejia, R. H. and S. Wahba (1999): “Causal eﬀects in nonexperimental studies: Reevaluatingthe evaluation of training programs,”

Journal of the American Statistical Association , 94, 1053–1062.

Fang, Z. and A. Santos (2015): “Inference on directionally diﬀerentiable functions,”

Workingpaper .——— (2019): “Inference on directionally diﬀerentiable functions,”

The Review of Economic Stud-ies , 86, 377–412.

Heckman, J. J., R. J. LaLonde, and J. A. Smith (1999): “The economics and econometricsof active labor market programs,”

Handbook of Labor Economics , 3, 1865–2097.

Heckman, J. J. and J. Smith (1998): “Evaluating the welfare state,” in

Econometrics andEconomic Theory in the 20th Century: The Ragnar Frisch Centennial Symposium , CambridgeUniversity Press, 31, 241.

Hong, H. and J. Li (2018): “The numerical delta method,”

Journal of Econometrics , 206, 379–394.

Hosman, C. A., B. B. Hansen, and P. W. Holland (2010): “The sensitivity of linear regressioncoeﬃcients’ conﬁdence limits to the omission of a confounder,”

The Annals of Applied Statistics ,4, 849–870.

Ichino, A., F. Mealli, and T. Nannicini (2008): “From temporary help jobs to permanentemployment: What can we learn from matching estimators and their sensitivity?”

Journal ofApplied Econometrics , 23, 305–327.

Imbens, G. W. (2003): “Sensitivity to exogeneity assumptions in program evaluation,”

AmericanEconomic Review P&P , 126–132.——— (2004): “Nonparametric estimation of average treatment eﬀects under exogeneity: A re-view,”

The Review of Economics and Statistics , 86, 4–29.

Imbens, G. W. and D. B. Rubin (2015):

Causal Inference for Statistics, Social, and BiomedicalSciences , Cambridge University Press.

Imbens, G. W. and J. M. Wooldridge (2009): “Recent developments in the econometrics ofprogram evaluation,”

Journal of Economic Literature , 47, 5–86.

Kallus, N., X. Mao, and A. Zhou (2019): “Interval estimation of individual-level causal eﬀectsunder unobserved confounding,” in

The 22nd International Conference on Artiﬁcial Intelligenceand Statistics , 2281–2290.

Kosorok, M. R. (2008):

Introduction to empirical processes and semiparametric inference ,Springer Science & Business Media.

Krauth, B. (2016): “Bounding a linear causal eﬀect using relative correlation restrictions,”

Jour-nal of Econometric Methods , 5, 117–141. 36 aLonde, R. J. (1986): “Evaluating the econometric evaluations of training programs with ex-perimental data,”

The American Economic Review , 604–620.

Manpower Demonstration Research Corporation (MDRC) (1983):

Summary and Find-ings of the National Supported Work Demonstration . Manski, C. F. (1990): “Nonparametric bounds on treatment eﬀects,”

American Economic ReviewP&P , 80, 319–323.

Masten, M. A. and A. Poirier (2018): “Identiﬁcation of treatment eﬀects under conditionalpartial independence,”

Econometrica , 86, 317–351.——— (2020): “Inference on breakdown frontiers,”

Quantitative Economics , 11, 41–111.

Mauro, R. (1990): “Understanding LOVE (left out variables error): A method for estimating theeﬀects of omitted variables.”

Psychological Bulletin , 108, 314.

Newey, K. and D. McFadden (1994): “Large sample estimation and hypothesis,”

Handbook ofEconometrics, IV, Edited by RF Engle and DL McFadden , 2112–2245.

Oster, E. (2019): “Unobservable selection and coeﬃcient stability: Theory and evidence,”

Journalof Business & Economic Statistics , 37, 187–204.

Robins, J. M., A. Rotnitzky, and D. O. Scharfstein (2000): “Sensitivity analysis for selec-tion bias and unmeasured confounding in missing data and causal inference models,” in

Statisticalmodels in epidemiology, the environment, and clinical trials , Springer, 1–94.

Rosenbaum, P. R. (1995):

Observational Studies , Springer.——— (2002):

Observational Studies , Springer, second ed.

Rosenbaum, P. R. and D. B. Rubin (1983): “Assessing sensitivity to an unobserved binarycovariate in an observational study with binary outcome,”

Journal of the Royal Statistical Society,Series B , 212–218. van der Vaart, A. and J. Wellner (1996):

Weak Convergence and Empirical Processes: WithApplications to Statistics , Springer Science & Business Media. van der Vaart, A. W. (2000):

Asymptotic Statistics , Cambridge University Press.

A Asymptotics for the First Step Estimators

In this appendix we formally state assumptions that ensure asymptotic normality of our ﬁrst stepestimators.

A.1 Assumptions

We begin with the propensity score, which we estimate by maximum likelihood.

Assumption A4 (

Propensity Score). 37. (Correct speciﬁcation) Let

B ⊆ R d W be compact. There is a β ∈ int( B ) such that P ( X = x | W = w ) = F ( w (cid:48) β ) x (1 − F ( w (cid:48) β )) − x ≡ L ( x, w (cid:48) β )for all x ∈ { , } and w ∈ W .2. (Suﬃcient variation) There is no proper linear subspace A of R d W such that P ( W ∈ A ) = 1.3. (Regularity of link function) F : R → (0 ,

1) is strictly increasing and twice continuouslydiﬀerentiable with uniformly bounded derivative.This assumption requires our propensity score speciﬁcation to be correct. It also imposes somestandard assumptions on parameter space B , the link function F ( · ), and the distribution of thecovariates W . Next, let (cid:96) ( x, w (cid:48) β ) = log L ( x, w (cid:48) β )denote the log likelihood function. Let (cid:96) β ( x, w (cid:48) β ) = ∂∂β (cid:96) ( x, w (cid:48) β ) and (cid:96) ββ ( x, w (cid:48) β ) = ∂ ∂β∂β (cid:48) (cid:96) ( x, w (cid:48) β )denote its vector of derivatives and second derivative matrix, respectively. Recall from section 4that B δ = { β ∈ B : (cid:107) β − β (cid:107) ≤ δ } . We impose the following assumptions on the propensity scoreas well. Assumption A5 (

Propensity Score Regularity). For each x ∈ { , } ,1. We have E (cid:32) sup β ∈B | (cid:96) ( x, W (cid:48) β ) | (cid:33) < ∞ .

2. For some δ > (cid:90) W sup β ∈B δ (cid:13)(cid:13)(cid:13)(cid:13) ∂∂β L ( x, w (cid:48) β ) (cid:13)(cid:13)(cid:13)(cid:13) dw < ∞ and (cid:90) W sup β ∈B δ (cid:13)(cid:13)(cid:13)(cid:13) ∂ ∂β∂β (cid:48) L ( x, w (cid:48) β ) (cid:13)(cid:13)(cid:13)(cid:13) dw < ∞ .

3. The matrix V β = E [ (cid:96) β ( X, W (cid:48) β ) (cid:96) β ( X, W (cid:48) β ) (cid:48) ] exists and is nonsingular.4. For some δ > E (cid:32) sup β ∈B δ (cid:107) (cid:96) ββ ( x, W (cid:48) β ) (cid:107) (cid:33) < ∞ . These conditions are standard for maximum likelihood estimators. For example, see theorem3.3. in Newey and McFadden (1994) along with their discussion. Note that dominance conditionsA5.1, A5.3, and A5.4 hold in standard parametric models like logit and probit. Although notnecessary, it also holds when E ( (cid:107) W (cid:107) ) < ∞ and strong overlap holds; that is, when there exists0 < p < p < p | w ∈ [ p, p ] for all w ∈ W .Besides the propensity score, the other ﬁrst step estimator is the conditional quantile functionof Y given ( X, W ). We consider here a linear quantile regression of Y on q ( X, W ), a set of ﬂexiblefunctions of (

X, W ). We make the following assumptions.38 ssumption A6 (

Quantile Regression). There exists an ε smaller ∈ (0 , ε ) such that1. There is some γ ∈ C ([ ε smaller , − ε smaller ] , R d q ) such that Q Y | X,W ( τ | x, w ) = q ( x, w ) (cid:48) γ ( τ )for every τ ∈ [ ε smaller , − ε smaller ].2. The conditional density f Y | q ( X,W ) ( y | q ( x, w )) exists and is bounded and uniformly continuousin y , uniformly in q ( x, w ) ∈ supp( q ( X, W )).3. The matrix J ( τ ) = E (cid:2) f Y | q ( X,W ) (cid:0) q ( X, W ) (cid:48) γ ( τ ) | q ( X, W ) (cid:1) q ( X, W ) q ( X, W ) (cid:48) (cid:3) is positive deﬁnite for all τ ∈ [ ε smaller , − ε smaller ].4. E ( (cid:107) q ( x, W ) (cid:107) ) < ∞ for x ∈ { , } .These are standard assumptions for obtaining limiting distributions of quantile regression pro-cesses indexed by τ ∈ [ ε smaller , − ε smaller ]. For example, see theorem 3 in Angrist, Chernozhukov,and Fern´andez-Val (2006). A.2 Convergence Results

We next prove two convergence results. The ﬁrst is joint asymptotic normality of the ﬁrst stepestimators.

Lemma 1 (First step estimators) . Suppose A1 and A4–A6 hold. Then √ n (cid:18) (cid:98) β − β (cid:98) γ ( τ ) − γ ( τ ) (cid:19) (cid:32) Z ( τ ) , where Z ( · ) is a mean-zero Gaussian process in R d W × (cid:96) ∞ ([ ε, − ε ] , R d q ) with uniformly continuouspaths. Moreover, its covariance kernel can be written in block form as E [ Z ( τ ) Z ( τ ) (cid:48) ] = (cid:18) V β V γ ( τ , τ ) (cid:19) (11)where V β = E (cid:20) F (cid:48) ( W (cid:48) β ) F ( W (cid:48) β )(1 − F ( W (cid:48) β )) W W (cid:48) (cid:21) − and V γ ( τ , τ ) = J ( τ ) − (min { τ , τ } − τ τ ) E [ q ( X, W ) q ( X, W ) (cid:48) ] J ( τ ) − . Next we provide a convergence result for estimates of the derivatives of the quantile regressioncoeﬃcients. We estimate these derivatives as follows. Let τ ∈ [ ε, − ε ]. Then (cid:98) γ (cid:48) ( τ ) = (cid:98) γ ( τ + η n ) − (cid:98) γ ( τ − η n )2 η n (12)where η n > τ − η n , τ + η n ]is contained in ( ε smaller , − ε smaller ). The next result shows that these estimators are uniformlyconsistent. 39 emma 2 (Convergence of QR derivatives) . Let η n → nη n → ∞ as n → ∞ . Suppose theassumptions of lemma 1 hold. Suppose A2 holds. Thensup τ ∈ [ ε, − ε ] (cid:107) (cid:98) γ (cid:48) ( τ ) − γ (cid:48) ( τ ) (cid:107) = o p (1) . A.3 Proofs

We begin by examining each of the ﬁrst step estimators separately.

Lemma 3 (Propensity score estimation) . Suppose A1 and A4–A5 hold. Then √ n ( (cid:98) β − β ) = 1 √ n n (cid:88) i =1 V β F (cid:48) ( W (cid:48) i β )( X i − F ( W (cid:48) i β )) W i F ( W (cid:48) i β )(1 − F ( W (cid:48) i β )) + o p (1)and hence √ n ( (cid:98) β − β ) d −→ N (0 , V β ) . Proof of lemma 3.

This result follows from theorem 3.3 (asymptotic normality of MLEs) in Neweyand McFadden (1994). So it suﬃces to verify that their assumptions hold.1. Their theorem 3.3 begins by supposing the assumptions of their theorem 2.5 (consistency ofMLEs) holds. So we verify those assumptions ﬁrst. By A4.1 (cid:96) ( x, w (cid:48) β ) = (cid:96) ( x, w (cid:48) (cid:101) β ) for all( x, w ) ∈ supp( X, W ) implies that w (cid:48) β = w (cid:48) (cid:101) β for all w ∈ supp( W ). By A4.2 this impliesthat β = (cid:101) β . So assumption (i) of their theorem 2.5 holds. We directly assume that theirassumptions (ii), (iii), and (iv) hold (via our A4 and A5.1). Finally, note that A1 is ourassumption that { ( Y i , X i , W i ) } ni =1 are iid. Thus all assumptions of their theorem 2.5 hold.2. Next we consider the additional assumptions imposed in their theorem 3.3, (i)–(v). These aredirectly implied by our A4 and A5.Thus all assumptions of their theorem 3.3 hold. This gives us √ n ( (cid:98) β − β ) d −→ N (0 , V β ). Theasymptotic linear representation holds by arguments in the proof of their theorem 3.1 and thediscussion on pages 2142–2143. Lemma 4 (Quantile regression estimation) . Suppose A1 and A6 hold. Then √ n ( (cid:98) γ ( τ ) − γ ( τ )) = J ( τ ) − √ n n (cid:88) i =1 (cid:16) τ − (cid:0) Y i ≤ q ( X i , W i ) (cid:48) γ ( τ ) (cid:1)(cid:17) q ( X i , W i ) + o p (1) (cid:32) J ( τ ) − Z γ ( τ ) , where Z γ ( · ) is a mean-zero Gaussian process in (cid:96) ∞ ([ ε smaller , − ε smaller ] , R d q ) with continuous pathsand covariance kernel equal to Σ( τ , τ ) = (min { τ , τ } − τ τ ) E [ q ( X, W ) q ( X, W ) (cid:48) ]. Proof of lemma 4.

By Minkowski’s inequality, E ( (cid:107) q ( X, W ) (cid:107) ) / = E ( (cid:107) Xq (1 , W ) + (1 − X ) q (0 , W ) (cid:107) ) / ≤ E ( | X | · (cid:107) q (1 , W ) (cid:107) ) / + E ( | − X | · (cid:107) q (0 , W ) (cid:107) ) / . By A6.4 and X ∈ { , } , it follows that E ( (cid:107) q ( X, W ) (cid:107) ) < ∞ . The result then follows directly fromtheorem 3 in Angrist et al. (2006). 40 roof of lemma 1. Note that the inﬂuence functions in lemmas 3 and 4 are Donsker. Therefore wecan stack them to obtain joint weak convergence to Z ( τ ). Next, note that this result holds over τ ∈ [ ε smaller , − ε smaller ]. This is a strict superset of [ ε, − ε ], so it holds on that set too.Finally, we note that the diagonal element of the covariance kernel is zero. This diagonal elementis C β,γ ( τ ) = cov (cid:18) V β F (cid:48) ( W (cid:48) β )( X − F ( W (cid:48) β )) WF ( W (cid:48) β )(1 − F ( W (cid:48) β )) , (cid:16) τ − ( Y ≤ q ( X, W ) (cid:48) γ ( τ )) (cid:17) q ( X, W ) (cid:48) J ( τ ) − (cid:19) . By iterated expectations, E (cid:104)(cid:16) τ − ( Y ≤ q ( X, W ) (cid:48) γ ( τ )) (cid:17) q ( X, W ) (cid:48) J ( τ ) − (cid:105) = E (cid:104)(cid:16) τ − E (cid:2) ( Y ≤ q ( X, W ) (cid:48) γ ( τ )) | X, W (cid:3) (cid:17) q ( X, W ) (cid:48) J ( τ ) − (cid:105) = 0since P ( Y ≤ q ( x, w ) (cid:48) γ ( τ ) | X = x, W = w ) = τ by correct speciﬁcation of the conditional quantilefunction. Also, E (cid:20) V β F (cid:48) ( W (cid:48) β )( X − F ( W (cid:48) β )) WF ( W (cid:48) β )(1 − F ( W (cid:48) β )) (cid:16) τ − ( Y ≤ q ( X, W ) (cid:48) γ ( τ )) (cid:17) q ( X, W ) (cid:48) J ( τ ) − (cid:21) = 0by a similar argument, using iterated expectations conditional on ( X, W ), by correct speciﬁcationof the conditional quantile function, and since the ﬁrst term is deterministic conditional on (

X, W ).Thus C β,γ ( τ ) = 0 by deﬁnition of the covariance. Proof of lemma 2.

Without loss of generality, consider the convergence of (cid:98) γ (cid:48) (1) to γ (cid:48) , (1) the ﬁrstcomponent of γ (cid:48) . Since η n →

0, let η n be small enough such that η n ∈ (0 , ε − ε smaller ). Thensup τ ∈ [ ε, − ε ] | (cid:98) γ (cid:48) (1) ( τ ) − γ (cid:48) , (1) ( τ ) |≤ sup τ ∈ [ ε, − ε ] (cid:12)(cid:12)(cid:12)(cid:12) (cid:98) γ (1) ( τ + η n ) − (cid:98) γ (1) ( τ − η n )2 η n − γ , (1) ( τ + η n ) − γ , (1) ( τ − η n )2 η n (cid:12)(cid:12)(cid:12)(cid:12) + sup τ ∈ [ ε, − ε ] (cid:12)(cid:12)(cid:12)(cid:12) γ , (1) ( τ + η n ) − γ , (1) ( τ − η n )2 η n − γ (cid:48) , (1) ( τ ) (cid:12)(cid:12)(cid:12)(cid:12) ≤ η n (cid:32) sup τ ∈ [ ε, − ε ] (cid:12)(cid:12)(cid:98) γ (1) ( τ + η n ) − γ , (1) ( τ + η n ) (cid:12)(cid:12) + sup τ ∈ [ ε, − ε ] (cid:12)(cid:12)(cid:98) γ (1) ( τ − η n ) − γ , (1) ( τ − η n ) (cid:12)(cid:12)(cid:33) + sup τ ∈ [ ε, − ε ] (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) (cid:16) γ , (1) ( τ ) + γ (cid:48) , (1) ( τ ) η n + γ (cid:48)(cid:48) , (1) ( τ ) η n + γ (cid:48)(cid:48)(cid:48) , (1) ( τ ∗ n ) η n (cid:17) η n − (cid:16) γ , (1) ( τ ) − γ (cid:48) , (1) ( τ ) η n + γ (cid:48)(cid:48) , (1) ( τ ) η n − γ (cid:48)(cid:48)(cid:48) , (1) ( τ ∗ n ) η n (cid:17) η n − γ (cid:48) , (1) ( τ ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ η n (cid:32) sup τ ∈ [ ε smaller , − ε smaller ] (cid:12)(cid:12)(cid:98) γ (1) ( τ ) − γ , (1) ( τ ) (cid:12)(cid:12) + sup τ ∈ [ ε smaller , − ε smaller ] (cid:12)(cid:12)(cid:98) γ (1) ( τ ) − γ , (1) ( τ ) (cid:12)(cid:12)(cid:33)

41 sup τ ∈ [ ε, − ε ] (cid:12)(cid:12)(cid:12)(cid:12) γ (cid:48)(cid:48)(cid:48) , (1) ( τ ∗ n ) η n

12 + γ (cid:48)(cid:48)(cid:48) , (1) ( τ ∗ n ) η n (cid:12)(cid:12)(cid:12)(cid:12) . The ﬁrst inequality follows by the triangle inequality and the deﬁnition of (cid:98) γ (cid:48) (1) . The second in-equality follows by taking two third order Taylor expansions of γ , (1) , where τ ∗ n ∈ ( τ, τ + η n ) and τ ∗ n ∈ ( τ − η n , τ ). There we use A2 with m ≥

3. The last inequality follows for two reasons: In theﬁrst term, ε smaller < ε and η n is chosen such that [ τ − η n , τ + η n ] ⊂ ( ε smaller , − ε smaller ). So we aretaking the supremum over a larger set in this term in the last line. In the second term, the 0th,1st, and 2nd derivatives of γ , (1) all cancel, leaving only the third order derivatives remaining.Finally, sup τ ∈ [ ε, − ε ] | (cid:98) γ (cid:48) (1) ( τ ) − γ (cid:48) , (1) ( τ ) | = 12 η n O p (cid:18) √ n (cid:19) + B η n = o p (1) . The ﬁrst line follows by lemma 4, which shows that the ﬁrst term is (1 / η n ) O p (1 / √ n ) = O p (1 / (cid:112) nη n ).In the second term B > n . This constant comes from A2,which implies the function γ has uniformly bounded third derivatives. The last line follows by η n → nη n → ∞ as n → ∞ .Repeating this argument across all components of (cid:98) γ (cid:48) ( τ ) − γ (cid:48) ( τ ) shows that sup τ ∈ [ ε, − ε ] (cid:107) (cid:98) γ (cid:48) ( τ ) − γ (cid:48) ( τ ) (cid:107) = o p (1), as desired. B Proofs for Section 4

In this section we give the proofs for the results in section 4. We start with a preliminary result onthe asymptotic distribution of the CQTE bound estimators. We use this for all of our later results.We then state and prove a useful lemma. Finally we give the proofs for our CATE, ATE, and ATTbound estimators.All of these results rely on proving Hadamard directional diﬀerentiability of various functionals.For that reason, it is helpful to recall its deﬁnition.

Deﬁnition 2.

Let φ : D φ → E where D , E are Banach spaces and D φ ⊆ D . Say φ is Hadamarddirectionally diﬀerentiable at θ ∈ D φ tangentially to D ⊆ D if there is a continuous map φ (cid:48) θ : D → E such that lim m →∞ (cid:13)(cid:13)(cid:13)(cid:13) φ ( θ + t m h m ) − φ ( θ ) t m − φ (cid:48) θ ( h ) (cid:13)(cid:13)(cid:13)(cid:13) E = 0for all sequences { h m } ⊂ D , t m > t m → h m → h ∈ D as m → ∞ and θ + t m h m ∈ D φ for all m .By proposition 2.1 in Fang and Santos (2019), the mapping φ is Hadamard diﬀerentiable at θ tangentially to D if and only if it is Hadamard directionally diﬀerentiable at θ tangentially to D and the mapping φ (cid:48) θ is linear.Throughout the proofs we let (cid:107) γ (cid:107) ∞ = sup τ ∈ [ ε, − ε ] (cid:107) γ ( τ ) (cid:107) denote the sup-norm in (cid:96) ∞ ([ ε, − ε ] , R d q ). B.1 The CQTE Bounds

We start with a preliminary result for our estimates of the CQTE bounds. All our other boundsare built from these, so it is helpful to understand them ﬁrst.42 roposition 5 (CQTE convergence) . Suppose A1, A2, and A4–A6 hold. Fix ε > w ∈ W , c ∈ [0 , τ ∈ (0 , √ n (cid:32) (cid:92) CQTE c ( τ | w ) − CQTE cε ( τ | w ) (cid:92) CQTE c ( τ | w ) − CQTE cε ( τ | w ) (cid:33) d −→ Z CQTE ( w, τ ) , where Z CQTE is a random vector in R whose distribution is characterized in the proof. Proof of proposition 5.

Part 1: The upper bound is HDD . Recall from section 4.2 that Γ ( x, w, τ, θ ) denotes ourtrimmed population conditional quantile upper bound. We write this parameter as a function of afew diﬀerent pieces: Γ ( x, w, τ, θ ) = q ( x, w ) (cid:48) γ ( S ( x, w, τ, β ))where S ( x, w, τ, β ) = max { S ( x, w, τ, β ) , ε } and S ( x, w, τ, β ) = min (cid:26) τ + cL ( x, w (cid:48) β ) min { τ, − τ } , τL ( x, w (cid:48) β ) , − ε (cid:27) . For simplicity, we leave the dependence on ε and c implicit in our notation for S and S . Thereare now three steps: We show Hadamard directional diﬀerentiability (HDD) of Γ ( x, w, τ, · ) at θ tangentially to R d W × C ([ ε, − ε ] , R d q ) by examining the two pieces S and S separately. We thencombine these to show HDD of Γ . Step 1: HDD of S . We ﬁrst show S ( x, w, τ, · ) is HDD at β . Let t m (cid:38) h m → h ∈ R d W as m → ∞ . Deﬁne the secant line T m ( x, w, τ, β , h m ) = S ( x, w, τ, β + t m h m ) − S ( x, w, τ, β ) t m . We will show that T m ( x, w, τ, β , h m ) → T ( x, w, τ, β , h , (cid:88) j =1 T ,j ( x, w, τ, β , h ) ,j ( x, w, τ, β , m → ∞ , where T ,j are deﬁned in appendix C below. To see this, we consider the seven casesassociated with ,j for j = 1 , . . . , , ( x, w, τ, β ,

0) = 1. Then S ( x, w, τ, β ) = τ + cL ( x, w (cid:48) β ) min { τ, − τ } . Moreover, for m large enough and by continuity of S in β , , ( x, w, τ, β + t m h m ,

0) = 1. Hence S ( x, w, τ, β + t m h m ) = τ + cL ( x, w (cid:48) ( β + t m h m )) min { τ, − τ } .

43o for m large enough T m ( x, w, τ, β , h m ) = c min { τ, − τ } t m (cid:18) L ( x, w (cid:48) ( β + t m h m )) − L ( x, w (cid:48) β ) (cid:19) → T , ( x, w, τ, β , h )by the deﬁnition of the directional derivative of 1 /L ( x, w (cid:48) β ) with respect to β in the direction h at β .Similarly, if , ( x, w, τ, β ,

0) = 1 then T m ( x, w, τ, β , h m ) = τ t m (cid:18) L ( x, w (cid:48) ( β + t m h m )) − L ( x, w (cid:48) β ) (cid:19) → T , ( x, w, τ, β , h )where the ﬁrst line holds for m large enough and the second line holds as m → ∞ . Likewise, if , ( x, w, τ, β ,

0) = 1 then T m ( x, w, τ, β , h m ) = 0 → T , ( x, w, τ, β , h ) . where the ﬁrst line holds for m large enough and the convergence in the second line is as m → ∞ .If , ( x, w, τ, β ,

0) = 1 then τ + cL ( x, w (cid:48) β ) min { τ, − τ } = τL ( x, w (cid:48) β ) < − ε. For m large enough, S ( x, w, τ, β + t m h m ) = min (cid:26) τ + cL ( x, w (cid:48) ( β + t m h m )) min { τ, − τ } , τL ( x, w (cid:48) ( β + t m h m )) (cid:27) . Hence T m ( x, w, τ, β , h m ) = min (cid:26) c min { τ, − τ } t m (cid:18) L ( x, w (cid:48) ( β + t m h m )) − L ( x, w (cid:48) β ) (cid:19) ,τ t m (cid:18) L ( x, w (cid:48) ( β + t m h m )) − L ( x, w (cid:48) β ) (cid:19)(cid:27) for m large enough. Similarly, T m ( x, w, τ, β , h m ) = min (cid:26) c min { τ, − τ } t m (cid:18) L ( x, w (cid:48) ( β + t m h m )) − L ( x, w (cid:48) β ) (cid:19) , (cid:27) T m ( x, w, τ, β , h m ) = min (cid:26) τ t m (cid:18) L ( x, w (cid:48) ( β + t m h m )) − L ( x, wβ ) (cid:19) , (cid:27) T m ( x, w, τ, β , h m ) = min (cid:26) c min { τ, − τ } t m (cid:18) L ( x, w (cid:48) ( β + t m h m )) − L ( x, w (cid:48) β ) (cid:19) , t m (cid:18) L ( x, w (cid:48) ( β + t m h m )) − L ( x, w (cid:48) β ) (cid:19) , (cid:27) for m large enough when ,j = 1 for j = 5 , , t m (cid:38) h m → h , and byexamining T m for ,j = 1, j = 4 , , ,

7, we see that S ( x, w, τ, β + t m h m ) − S ( x, w, τ, β ) t m = T m ( x, w, τ, β , h m ) → T ( x, w, τ, β , h , . Hence S ( x, w, τ, · ) is Hadamard directionally diﬀerentiable at β . Step 2: HDD of S . Recall that S ( x, w, τ, β ) = max { S ( x, w, τ, β ) , ε } . As before, let t m (cid:38) h m → h ∈ R d W as m → ∞ . Deﬁne T m ( x, w, τ, β , h m ) = S ( x, w, τ, β + t m h m ) − S ( x, w, τ, β ) t m . Substituting the functional form for S into the deﬁnition of S gives T m ( x, w, τ, β , h m )= 1 t m max (cid:26) min (cid:26) τ + cL ( x, w (cid:48) ( β + t m h m )) min { τ, − τ } , τL ( x, w (cid:48) ( β + t m h m )) , − ε (cid:27) , ε (cid:27) − t m max (cid:26) min (cid:26) τ + cL ( x, w (cid:48) β ) min { τ, − τ } , τL ( x, w (cid:48) β ) , − ε (cid:27) , ε (cid:27) . As in step 1, we next characterize the value of this secant line by splitting it into three diﬀerentcases.1. If min (cid:26) τ + cL ( x, w (cid:48) β ) min { τ, − τ } , τL ( x, w (cid:48) β ) , − ε (cid:27) > ε then T m ( x, w, τ, β , h m )= 1 t m min (cid:26) τ + cL ( x, w (cid:48) ( β + t m h m )) min { τ, − τ } , τL ( x, w (cid:48) ( β + t m h m )) , − ε (cid:27) − t m min (cid:26) τ + cL ( x, w (cid:48) β ) min { τ, − τ } , τL ( x, w (cid:48) β ) , − ε (cid:27) for large enough m .2. If min (cid:26) τ + cL ( x, w (cid:48) β ) min { τ, − τ } , τL ( x, w (cid:48) β ) , − ε (cid:27) < ε then T m ( x, w, τ, β , h m ) = 0for large enough m . 45. If min (cid:26) τ + cL ( x, w (cid:48) β ) min { τ, − τ } , τL ( x, w (cid:48) β ) , − ε (cid:27) = ε then T m ( x, w, τ, β , h m )= max (cid:26) t m min (cid:26) τ + cL ( x, w (cid:48) ( β + t m h m )) min { τ, − τ } , τL ( x, w (cid:48) ( β + t m h m )) , − ε (cid:27) − t m min (cid:26) τ + cL ( x, w (cid:48) β ) min { τ, − τ } , τL ( x, w (cid:48) β ) , − ε (cid:27) , (cid:27) for large enough m .Using similar arguments as in step 1, by examining each of the three cases we see that T m ( x, w, τ, β , h m ) → T ( x, w, τ, β , h , m → ∞ . Hence S ( x, w, τ, · ) is Hadamard directionally diﬀerentiable at β . Step 3: HDD of Γ . Next we show that Γ ( x, w, τ, · ) is HDD at θ tangentially to R d W × C ([ ε, − ε ] , R d q ). Let t m (cid:38) h m → h ∈ R d W and h m → h ∈ C ([ ε, − ε ] , R d q ) endowed withthe sup norm, as m → ∞ . Let h m = ( h m , h m ) and h = ( h , h ). ThenΓ ( x, w, τ, θ + t m h m ) − Γ ( x, w, τ, θ ) t m = q ( x, w ) (cid:48) [ γ + t m h m ]( S ( x, w, τ, β + t m h m )) − q ( x, w ) (cid:48) γ ( S ( x, w, τ, β )) t m = q ( x, w ) (cid:48) ( γ ( S ( x, w, τ, β + t m h m )) − γ ( S ( x, w, τ, β ))) t m + q ( x, w ) (cid:48) h m ( S ( x, w, τ, β + t m h m )) . Consider the ﬁrst term. By A2, γ ( u ) is diﬀerentiable for any u ∈ [ ε, − ε ]. By the chain rule, q ( x, w ) (cid:48) [ γ ( S ( x, w, τ, β + t m h m )) − γ ( S ( x, w, τ, β ))] t m → q ( x, w ) (cid:48) γ (cid:48) ( S ( x, w, τ, β )) T ( x, w, τ, β , h , m → ∞ . Next consider the second term. We have | q ( x, w ) (cid:48) h m ( S ( x, w, τ, β + t m h m )) − q ( x, w ) (cid:48) h ( S ( x, w, τ, β )) |≤ | q ( x, w ) (cid:48) h m ( S ( x, w, τ, β + t m h m )) − q ( x, w ) (cid:48) h ( S ( x, w, τ, β + t m h m )) | + | q ( x, w ) (cid:48) h ( S ( x, w, τ, β + t m h m )) − q ( x, w ) (cid:48) h ( S ( x, w, τ, β )) |≤ (cid:107) q ( x, w ) (cid:107) · (cid:107) h m − h (cid:107) ∞ + | q ( x, w ) (cid:48) h ( S ( x, w, τ, β + t m h m )) − q ( x, w ) (cid:48) h ( S ( x, w, τ, β )) | . The ﬁrst inequality follows by the triangle inequality. Consider the second inequality. By continuityof h ( · ) and of S ( x, w, τ, · ), the second term converges to zero as m → ∞ . By sup-norm convergenceof h m to h , the ﬁrst term also converges to zero as m → ∞ . Thus q ( x, w ) (cid:48) h m ( S ( x, w, τ, β + t m h m )) → q ( x, w ) (cid:48) h ( S ( x, w, τ, β ))46s m → ∞ . Putting the two terms together givesΓ ( x, w, τ, θ + t m h m ) − Γ ( x, w, τ, θ ) t m → q ( x, w ) (cid:48) γ (cid:48) ( S ( x, w, τ, β )) T ( x, w, τ, β , h ,

0) + q ( x, w ) (cid:48) h ( S ( x, w, τ, β )) ≡ Γ (cid:48) ,θ ( x, w, τ, h ) . Thus Γ ( x, w, τ, · ) is HDD at θ tangentially to R d W × C ([ ε, − ε ] , R d q ). Part 2: The lower bound is HDD . An analogous argument applies to the lower boundΓ ( x, w, τ, · ). This givesΓ ( x, w, τ, θ + t m h m ) − Γ ( x, w, τ, θ ) t m → q ( x, w ) (cid:48) γ (cid:48) ( S ( x, w, τ, β )) T ( x, w, τ, β , h ,

0) + q ( x, w ) (cid:48) h ( S ( x, w, τ, β )) ≡ Γ (cid:48) ,θ ( x, w, τ, h )where S ( x, w, τ, β ) = max (cid:26) τ − cL ( x, w (cid:48) β ) min { τ, − τ } , τ − L ( x, w (cid:48) β ) + 1 , ε (cid:27) S ( x, w, τ, β ) = min { S ( x, w, τ, β ) , − ε } and T ( x, w, τ, β , h ,

0) = lim m →∞ S ( x, w, τ, β + t m h m ) − S ( x, w, τ, β ) t m T ( x, w, τ, β , h ,

0) = lim m →∞ S ( x, w, τ, β + t m h m ) − S ( x, w, τ, β ) t m . We give explicit expressions for these limits in appendix C.

Part 3: Apply the delta method . We’ve shown that Γ ( x, w, τ, · ) and Γ ( x, w, τ, · ) are HDDat θ . Moreover, by A1 and A4–A6, lemma 1 gives √ n ( (cid:98) θ − θ ) (cid:32) Z . Thus the delta method forHDD functionals (theorem 2.1 in Fang and Santos 2019) gives √ n (cid:32) Γ ( x, w, τ, (cid:98) θ ) − Γ ( x, w, τ, θ )Γ ( x, w, τ, (cid:98) θ ) − Γ ( x, w, τ, θ ) (cid:33) d −→ (cid:32) Γ (cid:48) ,θ ( x, w, τ, Z )Γ (cid:48) ,θ ( x, w, τ, Z ) (cid:33) ≡ Z ( x, w, τ ) . This convergence in uniform in x ∈ { , } .Finally, the CQTE bounds are just the diﬀerence between certain conditional quantile functionbounds. Thus we immediately get √ n (cid:32) (cid:92) CQTE c ( τ | w ) − CQTE cε ( τ | w ) (cid:92) CQTE c ( τ | w ) − CQTE cε ( τ | w ) (cid:33) d −→ (cid:32) Z (1)2 (1 , w, τ ) − Z (2)2 (0 , w, τ ) Z (2)2 (1 , w, τ ) − Z (1)2 (0 , w, τ ) (cid:33) ≡ Z CQTE ( w, τ ) . .2 A Useful Lemma The following is a technical lemma that we will use a few times in the upcoming proofs.

Lemma 5 (Min and Max are Lipschitz) . The following hold for any ( x , . . . , x n ) , ( y , . . . , y n ) ∈ R n : | min { x , . . . , x n } − min { y , . . . , y n }| ≤ n (cid:88) i =1 | x i − y i || max { x , . . . , x n } − max { y , . . . , y n }| ≤ n (cid:88) i =1 | x i − y i | . Proof of lemma 5.

We proceed by induction over n ≥

1. The inequalities trivially hold for n = 1.First, consider the minimum function and let n = 2. Consider the case where x ≤ x and y ≤ y .Then | min { x , x } − min { y , y }| = | x − y | ≤ | x − y | + | x − y | . Now consider the case where x ≤ x and y ≥ y . Then,min { x , x } − min { y , y } = x − y ≤ x − y ≤ | x − y | + | x − y | and min { x , x } − min { y , y } = x − y ≥ x − y ≥ −| x − y | − | x − y | . Hence | min { x , x } − min { y , y }| ≤ | x − y | + | x − y | . To exhaust all cases, we also consider cases where ( x ≥ x , y ≥ y ) and where ( x ≥ x , y ≤ y ).By symmetry across cases, the Lipschitz inequality for the minimum holds when n = 2. Nowsuppose it holds for n −

1. Then, | min { x , . . . , x n } − min { y , . . . , y n }| = | min { min { x , . . . , x n − } , x n } − min { min { y , . . . , y n − } , y n }|≤ | min { x , . . . , x n − } − min { y , . . . , y n − }| + | x n − y n |≤ n − (cid:88) i =1 | x i − y i | + | x n − y n | = n (cid:88) i =1 | x i − y i | . Therefore it holds for all n ≥

1. Noting that max { x , . . . , x n } = − min {− x , . . . , − x n } , thisinequality applies to the maximum as well. B.3 The CATE Bounds

Proof of proposition 1.

Part 1: The upper bound is HDD.

We ﬁrst show that the mappingΓ ( x, w, · ) : R d W × (cid:96) ∞ ([ ε, − ε ] , R d q ) → R

48s HDD at θ tangentially to R d W × C ([ ε, − ε ] , R d q ). Recall its deﬁnition:Γ ( x, w, θ ) = (cid:90) Γ ( x, w, τ, θ ) dτ. We will use the dominated convergence theorem to show thatΓ (cid:48) ,θ ( x, w, h ) = (cid:90) Γ (cid:48) ,θ ( x, w, τ, h ) dτ. For δ > G δ = { γ ∈ G : (cid:107) γ − γ (cid:107) ∞ ≤ δ } and Θ δ = B δ × G δ . To show dominated convergence can be applied, we ﬁrst show that the mapping Γ ( x, w, τ, θ ) isLipschitz in θ ∈ Θ δ for some δ >

0. To see this, let (cid:101) θ, θ ∈ Θ δ . Then | Γ ( x, w, τ, (cid:101) θ ) − Γ ( x, w, τ, θ ) | = (cid:12)(cid:12)(cid:12) q ( x, w ) (cid:48) (cid:16)(cid:101) γ ( S ( x, w, τ, (cid:101) β )) − γ ( S ( x, w, τ, (cid:101) β )) (cid:17) + q ( x, w ) (cid:48) (cid:16) γ ( S ( x, w, τ, (cid:101) β )) − γ ( S ( x, w, τ, β )) (cid:17)(cid:12)(cid:12)(cid:12) ≤ (cid:107) q ( x, w ) (cid:107) · (cid:107) (cid:101) γ − γ (cid:107) ∞ + (cid:107) q ( x, w ) (cid:48) γ (cid:48) ( ¯ S ) (cid:107) · (cid:107) S ( x, w, τ, (cid:101) β ) − S ( x, w, τ, β ) (cid:107) . The last line follows by a Taylor expansion, where ¯ S is on the line segment connecting S ( x, w, τ, (cid:101) β )and S ( x, w, τ, β ). By γ ∈ C Bm,ν ([ ε, − ε ]) d q for m ≥ (cid:107) γ (cid:48) (cid:107) ∞ ≤ B . Hence (cid:107) q ( x, w ) (cid:48) γ (cid:48) ( ¯ S )) (cid:107) ≤ (cid:107) q ( x, w ) (cid:107) · (cid:107) γ (cid:48) ( ¯ S ) (cid:107) ≤ (cid:107) q ( x, w ) (cid:107) B. Next, | S ( x, w, τ, (cid:101) β ) − S ( x, w, τ, β ) | = | max { S ( x, w, τ, (cid:101) β ) , ε } − max { S ( x, w, τ, β ) , ε }|≤ | S ( x, w, τ, (cid:101) β ) − S ( x, w, τ, β ) |≤ (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) τ + c min { τ, − τ } L ( x, w (cid:48) (cid:101) β ) − (cid:18) τ + c min { τ, − τ } L ( x, w (cid:48) β ) (cid:19)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) + (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) τL ( x, w (cid:48) (cid:101) β ) − τL ( x, w (cid:48) β ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) = ( τ + c min { τ, − τ } ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) L ( x, w (cid:48) (cid:101) β ) − L ( x, w (cid:48) β ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ ( τ + c min { τ, − τ } ) sup β ∈B δ (cid:13)(cid:13)(cid:13)(cid:13) L β ( x, w (cid:48) β ) L ( x, w (cid:48) β ) (cid:13)(cid:13)(cid:13)(cid:13) (cid:107) (cid:101) β − β (cid:107) . The second and third lines follow from lemma 5. The last line follows by a Taylor expansion. Tosee that sup β ∈B δ (cid:13)(cid:13) L β ( x, w (cid:48) β ) /L ( x, w (cid:48) β ) (cid:13)(cid:13) < ∞ , writesup β ∈B δ (cid:13)(cid:13)(cid:13)(cid:13) L β ( x, w (cid:48) β ) L ( x, w (cid:48) β ) (cid:13)(cid:13)(cid:13)(cid:13) = sup β ∈B δ (cid:13)(cid:13)(cid:13)(cid:13) (2 x − F (cid:48) ( w (cid:48) β ) wxF ( w (cid:48) β ) + (1 − x )(1 − F ( w (cid:48) β )) (cid:13)(cid:13)(cid:13)(cid:13) ≤ (cid:107) w (cid:107) sup a ∈ R | F (cid:48) ( a ) | xF (inf β ∈ β w (cid:48) β ) + (1 − x )(1 − F (sup β ∈ β w (cid:48) β )) < ∞ . { w (cid:48) β : β ∈ B δ } is bounded for ﬁxed w , and since F (cid:48) ( a ) is uniformlybounded by assumption A4.3. Thus | Γ ( x, w, τ, (cid:101) θ ) − Γ ( x, w, τ, θ ) |≤ (cid:107) q ( x, w ) (cid:107) · (cid:107) (cid:101) γ − γ (cid:107) ∞ + (cid:107) q ( x, w ) (cid:107) B ( τ + c min { τ, − τ } ) sup β ∈B δ (cid:13)(cid:13)(cid:13)(cid:13) L β ( x, w (cid:48) β ) L ( x, w (cid:48) β ) (cid:13)(cid:13)(cid:13)(cid:13) (cid:107) (cid:101) β − β (cid:107) . Hence Γ ( x, w, τ, θ ) is Lipschitz in θ . Therefore,Γ ( x, w, τ, θ + t m h m ) − Γ ( x, w, τ, θ ) t m is dominated by (cid:107) q ( x, w ) (cid:107) · (cid:107) h m (cid:107) ∞ + (cid:107) q ( x, w ) (cid:107) B ( τ + c min { τ, − τ } ) sup β ∈B δ (cid:13)(cid:13)(cid:13)(cid:13) L β ( x, w (cid:48) β ) L ( x, w (cid:48) β ) (cid:13)(cid:13)(cid:13)(cid:13) (cid:107) h m (cid:107)≤ (cid:107) q ( x, w ) (cid:107) ( (cid:107) h (cid:107) ∞ + λ ) + (cid:107) q ( x, w ) (cid:107) B ( τ + c min { τ, − τ } ) sup β ∈B δ (cid:13)(cid:13)(cid:13)(cid:13) L β ( x, w (cid:48) β ) L ( x, w (cid:48) β ) (cid:13)(cid:13)(cid:13)(cid:13) ( (cid:107) h (cid:107) + λ ) < ∞ . In the second line λ > m suﬃcientlylarge, since h m converges to h . Moreover, note that this dominating function is integrable over τ ∈ (0 , ( x, w, θ + t m h m ) − Γ ( x, w, θ ) t m = (cid:90) (cid:18) Γ ( x, w, τ, θ + t m h m ) − Γ ( x, w, τ, θ ) t m (cid:19) dτ → (cid:90) Γ (cid:48) ,θ ( x, w, τ, h ) dτ = q ( x, w ) (cid:48) (cid:90) h ( S ( x, w, τ, β )) dτ + q ( x, w ) (cid:48) (cid:90) γ (cid:48) ( S ( x, w, τ, β )) T ( x, w, τ, β , h , dτ ≡ Γ (cid:48) ,θ ( x, w, h ) . Part 2: The lower bound is HDD.

We can similarly show thatΓ ( x, w, θ + t m h m ) − Γ ( x, w, θ ) t m → q ( x, w ) (cid:48) (cid:90) h ( S ( x, w, τ, β )) dτ + q ( x, w ) (cid:48) (cid:90) γ (cid:48) ( S ( x, w, τ, β )) T ( x, w, τ, h , dτ ≡ Γ (cid:48) ,θ ( x, w, h ) . Part 3: Apply the delta method.

The functional delta method for HDD functionals nowimplies that, uniformly in x ∈ { , } , √ n (cid:32) Γ ( x, w, (cid:98) θ ) − Γ ( x, w, θ )Γ ( x, w, (cid:98) θ ) − Γ ( x, w, θ ) (cid:33) d −→ (cid:32) Γ (cid:48) ,θ ( x, w, Z )Γ (cid:48) ,θ ( x, w, Z ) (cid:33) ≡ Z ( x, w )50nd hence √ n (cid:32) (cid:92) CATE c ( w ) − CATE cε ( w ) (cid:92) CATE c ( w ) − CATE cε ( w ) (cid:33) d −→ (cid:32) Z (1)3 (1 , w ) − Z (2)3 (0 , w ) Z (2)3 (1 , w ) − Z (1)3 (0 , w ) (cid:33) ≡ Z CATE ( w ) . B.4 The ATE Bounds

Proof of theorem 1.

Part 1: The expectation upper bound . Write √ n ( (cid:98) E cx − E cx,ε ) = √ n (cid:32) n n (cid:88) i =1 Γ ( x, W i , (cid:98) θ ) − (cid:90) W Γ ( x, w, θ ) dF W ( w ) (cid:33) = √ n (cid:32) n n (cid:88) i =1 (Γ ( x, W i , (cid:98) θ ) − Γ ( x, W i , θ )) − (cid:90) W (Γ ( x, w, (cid:98) θ ) − Γ ( x, w, θ )) dF W ( w ) (cid:33) + 1 √ n n (cid:88) i =1 (cid:16) Γ ( x, W i , θ ) − E [Γ ( x, W, θ )] (cid:17) + √ n (cid:0) Γ ( x, (cid:98) θ ) − Γ ( x, θ ) (cid:1) . There are three terms here. We’ll show that the ﬁrst is o p (1) and that the second and thirdcontribute to the asymptotic distribution. Step 1.

We’ll begin by showing that the ﬁrst term is o p (1). For some δ >

0, consider the classof functions F = (cid:8) Γ ( x, w, θ ) : θ ∈ Θ δ (cid:9) . As in the proof of proposition 1, we will show that Γ ( x, w, θ ) is Lipschitz in θ . Let (cid:101) θ, θ ∈ Θ δ . Then | Γ ( x, w, (cid:101) θ ) − Γ ( x, w, θ ) | = (cid:12)(cid:12)(cid:12)(cid:12)(cid:90) Γ ( x, w, τ, (cid:101) θ ) dτ − (cid:90) Γ ( x, w, τ, θ ) dτ (cid:12)(cid:12)(cid:12)(cid:12) ≤ (cid:107) q ( x, w ) (cid:107) (cid:90) (cid:107) (cid:101) γ − γ (cid:107) ∞ dτ + (cid:107) q ( x, w ) (cid:107) B (cid:90) ( τ + c min { τ, − τ } ) dτ sup β ∈B δ (cid:13)(cid:13)(cid:13)(cid:13) L β ( x, w (cid:48) β ) L ( x, w (cid:48) β ) (cid:13)(cid:13)(cid:13)(cid:13) (cid:107) (cid:101) β − β (cid:107) = (cid:107) q ( x, w ) (cid:107) (cid:32) B (2 + c )4 sup β ∈B δ (cid:13)(cid:13)(cid:13)(cid:13) L β ( x, w (cid:48) β ) L ( x, w (cid:48) β ) (cid:13)(cid:13)(cid:13)(cid:13)(cid:33) (cid:16) (cid:107) (cid:101) γ − γ (cid:107) ∞ + (cid:107) (cid:101) β − β (cid:107) (cid:17) ≡ K ( w ) (cid:107) (cid:101) θ − θ (cid:107) Θ . The second line follows by our derivations in the proof of proposition 1. In the last line we let (cid:107) θ (cid:107) Θ = (cid:107) β (cid:107) + (cid:107) γ (cid:107) ∞ and deﬁned K ( w ) = (cid:107) q ( x, w ) (cid:107) (cid:32) B (2 + c )4 sup β ∈B δ (cid:13)(cid:13)(cid:13)(cid:13) L β ( x, w (cid:48) β ) L ( x, w (cid:48) β ) (cid:13)(cid:13)(cid:13)(cid:13)(cid:33) . Assumption A3 says that E (cid:32) sup β ∈B δ (cid:13)(cid:13)(cid:13)(cid:13) L β ( x, W (cid:48) β ) L ( x, W (cid:48) β ) (cid:13)(cid:13)(cid:13)(cid:13) (cid:33) < ∞ E ( (cid:107) q ( x, W ) (cid:107) ) < ∞ . These assumptions imply that K ( W ) has a bounded secondmoment: E [ K ( W ) ] = E  (cid:107) q ( x, W ) (cid:107) (cid:32) B (2 + c )4 sup β ∈B δ (cid:13)(cid:13)(cid:13)(cid:13) L β ( x, W (cid:48) β ) L ( x, W (cid:48) β ) (cid:13)(cid:13)(cid:13)(cid:13)(cid:33)  ≤ E (cid:2) (cid:107) q ( x, W ) (cid:107) (cid:3) / × E (cid:32) B (2 + c )4 sup β ∈B δ (cid:13)(cid:13)(cid:13)(cid:13) L β ( x, W (cid:48) β ) L ( x, W (cid:48) β ) (cid:13)(cid:13)(cid:13)(cid:13)(cid:33)  / < ∞ where the second line follows by the Cauchy-Schwarz inequality. Thus Γ ( x, w, θ ) is Lipschitz in θ .This lets us apply theorem 2.7.11 in van der Vaart and Wellner (1996) to see that the bracketingnumber N [ · ] (2 (cid:15) E [ K ( W ) ] / , F , L ( P )) is bounded above by N ( (cid:15), B δ × G δ , (cid:107) · (cid:107) Θ ) ≤ N ( (cid:15), B δ , (cid:107) · (cid:107) ) + N ( (cid:15), G δ , (cid:107) · (cid:107) ∞ ) . By example 19.7 of van der Vaart (2000), N ( (cid:15), B δ , (cid:107) · (cid:107) ) (cid:46) (cid:15) − d W for small enough (cid:15) . By theorem2.7.1 in van der Vaart and Wellner (1996), N ( (cid:15), G δ , (cid:107) · (cid:107) ∞ ) (cid:46) exp (cid:16) (cid:15) − m + ν (cid:17) . So (cid:90) δ (cid:113) log N [ · ] (2 (cid:15) E [ K ( W ) ] / , F , L ( P )) d(cid:15) < ∞ . Hence F is Donsker.By convergence of (cid:98) θ to θ (lemma 1) and the Lipschitz property of Γ , we have (cid:90) W (cid:12)(cid:12)(cid:12) Γ ( x, w, (cid:98) θ ) − Γ ( x, w, θ ) (cid:12)(cid:12)(cid:12) dF W ( w ) ≤ E [ K ( W ) ] · (cid:107) (cid:98) θ − θ (cid:107) = o p (1) . Therefore, by lemma 19.24 in van der Vaart (2000), √ n (cid:32) n n (cid:88) i =1 (Γ ( x, W i , (cid:98) θ ) − Γ ( x, W i , θ )) − (cid:90) W (Γ ( x, w, (cid:98) θ ) − Γ ( x, w, θ )) dF W ( w ) (cid:33) = o p (1) . Step 2.

Next consider the second term. First note that E [Γ ( x, W, θ ) ] = E (cid:34)(cid:12)(cid:12)(cid:12)(cid:12)(cid:90) q ( x, W ) (cid:48) γ ( S ( x, W, τ, β )) dτ (cid:12)(cid:12)(cid:12)(cid:12) (cid:35) ≤ E (cid:34)(cid:18)(cid:90) (cid:107) q ( x, W ) (cid:107) · (cid:107) γ ( S ( x, W, τ, β )) (cid:107) dτ (cid:19) (cid:35) ≤ E (cid:32)(cid:90) (cid:107) q ( x, W ) (cid:107) sup u ∈ [ ε, − ε ] (cid:107) γ ( u ) (cid:107) dτ (cid:33)  E ( (cid:107) q ( x, W ) (cid:107) ) B < ∞ . The ﬁrst line follows be deﬁnition of Γ . The second line follows by the Cauchy-Schwarz inequality.The third line follows the fact that S lies between ε and 1 − ε . The fourth line follows by A2. Thelast line follows by A6.4.This result lets us apply a CLT to the second term. Combining that with the fact that theinﬂuence functions for the two ﬁrst step estimators are Donsker (lemmas 3 and 4) we get √ n  (cid:98) θ − θ n n (cid:88) i =1 (cid:16) Γ ( x, W i , θ ) − E [Γ ( x, W, θ )] (cid:17) (cid:32) (cid:18) Z (cid:101) Z ( x ) (cid:19) , a mean-zero Gaussian process in R d W × (cid:96) ∞ ([ ε, − ε ] , R d q ) × R . Notice here we use Γ = (Γ , Γ ),not just Γ , as preparation for parts 2 and 3. Step 3.

Next, consider the third term. For this step we’ll show that the mapping Γ ( x, θ ) isHDD at θ tangentially to R d W × C ([ ε, − ε ] , R d q ). This will let us apply the delta method forHDD functionals in the last step. To see that this functional is HDD, note thatΓ ( x, θ + t m h m ) − Γ ( x, θ ) t m = (cid:90) W (cid:90) Γ ( x, w, τ, θ + t m h m ) − Γ ( x, w, τ, θ ) t m dτ dF W ( w ) . By proposition 1, for m large enough,Γ ( x, w, τ, θ + t m h m ) − Γ ( x, w, τ, θ ) t m is dominated by (cid:107) q ( x, w ) (cid:107) ( (cid:107) h (cid:107) ∞ + λ ) + (cid:107) q ( x, w ) (cid:107) B ( τ + c min { τ, − τ } ) sup β ∈B δ (cid:13)(cid:13)(cid:13)(cid:13) L β ( x, w (cid:48) β ) L ( x, w (cid:48) β ) (cid:13)(cid:13)(cid:13)(cid:13) ( (cid:107) h (cid:107) + λ ) . This expression which has ﬁnite integral over ( τ, w ) ∈ (0 , ×W by (cid:107) h (cid:107) Θ < ∞ , E ( (cid:107) q ( x, W ) (cid:107) ) < ∞ ,and E (cid:32) sup β ∈B δ (cid:13)(cid:13)(cid:13)(cid:13) L β ( x, W, β ) L ( x, W (cid:48) β ) (cid:13)(cid:13)(cid:13)(cid:13) (cid:33) < ∞ . So we can apply dominated convergence to see that Γ is HDD:Γ ( x, θ + t m h m ) − Γ ( x, θ ) t m → (cid:90) W (cid:20) q ( x, w ) (cid:48) (cid:90) h ( S ( x, w, τ, β )) dτ + q ( x, w ) (cid:48) (cid:90) γ (cid:48) ( S ( x, w, τ, β )) T ( x, w, τ, β , h , dτ (cid:21) dF W ( w ) ≡ Γ (cid:48) ,θ ( x, h ) . Step 4.

Finally, putting all the previous steps together and applying the delta method for53DD functionals gives that, uniformly in x ∈ { , } , √ n ( (cid:98) E cx − E cx,ε ) = o p (1) + 1 √ n n (cid:88) i =1 (cid:0) Γ ( x, W i , θ ) − E [Γ ( x, W, θ )] (cid:1) + √ n (Γ ( x, (cid:98) θ ) − Γ ( x, θ )) d −→ (cid:101) Z (1)4 ( x ) + Γ (cid:48) ,θ ( x, Z ) ≡ Z (1)4 ( x ) . Part 2: The expectation lower bound.

An identical argument can be applied to show that √ n ( (cid:98) E cx − E cx ) d −→ (cid:101) Z (2)4 ( x ) + Γ (cid:48) ,θ ( x, Z ) ≡ Z (2)4 ( x )where Γ (cid:48) ,θ ( x, h ) = (cid:90) W Γ (cid:48) ,θ ( x, w, h ) dF W ( w ) . Part 3: Putting them together.

Finally, the analysis in parts 1 and 2 can be combined toobtain joint convergence: √ n (cid:32) (cid:91) ATE c − ATE cε (cid:91) ATE c − ATE cε (cid:33) d −→ (cid:32) Z (1)4 (1) − Z (2)4 (0) Z (2)4 (1) − Z (1)4 (0) (cid:33) ≡ Z ATE . B.5 The ATT Bounds

Proof of proposition 2.

Note that in this proof we’ll use several of the results we derived in theproof of proposition 1. By var( Y ( X = x )) < ∞ and var( ( X = x )) < ∞ , we have that √ n (cid:16)(cid:98) E ( Y | X = x ) − E ( Y | X = x ) (cid:17) = 1 p x √ n (cid:32) n n (cid:88) i =1 Y i ( X i = x ) − E ( Y ( X = x )) (cid:33) − E ( Y | X = x ) p x √ n (cid:32) n n (cid:88) i =1 ( X i = x ) − p x (cid:33) + o p (1) d −→ Z E ( Y | X = x ) , a mean-zero Gaussian variable. Also, √ n ( (cid:98) p x − p x ) d −→ Z p x , another mean-zero Gaussian variable.As before the inﬂuence functions are Donsker and hence we can stack them to obtain √ n  (cid:98) θ − θ n n (cid:88) i =1 (Γ ( x, W i , θ ) − E [Γ ( x, W, θ )]) (cid:98) E ( Y | X = x ) − E ( Y | X = x ) (cid:98) p x − p x  (cid:32)  Z (cid:101) Z ( x ) Z E ( Y | X = x ) Z p x  ,

54 mean-zero Gaussian process in R d W × (cid:96) ∞ ([ ε, − ε ] , R d q ) × R , for x ∈ { , } and whose covariancekernel can be calculated as in the proof of theorem 1. By the delta method, √ n  (cid:91) ATT c − ATT cε (cid:91) ATT c − ATT cε  d −→  Z E ( Y | X =1) − Z (2)4 (0) p + p p Z E ( Y | X =0) + E ( Y | X = 0) p Z p + E c ,ε − E ( Y | X = 0) p p Z p Z E ( Y | X =1) − Z (1)4 (0) p + p p Z E ( Y | X =0) + E ( Y | X = 0) p Z p + E c ,ε − E ( Y | X = 0) p p Z p  ≡ Z ATT , a random vector in R . C Estimating the Analytical Hadamard Directional Derivatives

In this section we give formulas for our estimators of the analytical Hadamard directional derivatives(HDD) used in the CQTE, CATE, ATE, and ATT functionals. In these estimators we use a tuningparameter κ n ≥

0. This parameter acts as a slackness value, which lets us estimate when certainequalities hold in the population, but which may not hold exactly in ﬁnite samples. We assume κ n → nκ n → ∞ as n → ∞ .We estimate Γ (cid:48) ,θ ( x, w, τ, h ) by (cid:98) Γ (cid:48) ,θ ( x, w, τ, h ) (cid:98) Γ (cid:48) ,θ ( x, w, τ, h )  = (cid:32) q ( x, w ) (cid:48) h ( S ( x, w, τ, (cid:98) β )) + q ( x, w ) (cid:48) (cid:98) γ (cid:48) ( S ( x, w, τ, (cid:98) β )) · T ( x, w, τ, (cid:98) β, h , κ n ) q ( x, w ) (cid:48) h ( S ( x, w, τ, (cid:98) β )) + q ( x, w ) (cid:48) (cid:98) γ (cid:48) ( S ( x, w, τ, (cid:98) β )) · T ( x, w, τ, (cid:98) β, h , κ n ) (cid:33) . Note that q ( x, w ) (cid:48) refers to the transpose of q ( x, w ) while (cid:98) γ (cid:48) ( · ) refers to our estimator of thederivative of γ ( · ), deﬁned in equation (12) on page 39. We deﬁned the functions S and S in theproof of proposition 5. We deﬁne T and T below. Estimate Γ (cid:48) ,θ ( x, w, h ) by (cid:98) Γ (cid:48) ,θ ( x, w, h ) (cid:98) Γ (cid:48) ,θ ( x, w, h )  = (cid:82) (cid:98) Γ (cid:48) ,θ ( x, w, τ, h ) dτ (cid:82) (cid:98) Γ (cid:48) ,θ ( x, w, τ, h ) dτ  . Estimate Γ ,θ ( x, h ) by (cid:98) Γ (cid:48) ,θ ( x, h ) (cid:98) Γ (cid:48) ,θ ( x, h )  =  n (cid:80) ni =1 (cid:98) Γ (cid:48) ,θ ( x, W i , h ) n (cid:80) ni =1 (cid:98) Γ (cid:48) ,θ ( x, W i , h )  . Throughout the paper we will also use the vector notation (cid:98) Γ (cid:48) j,θ = ( (cid:98) Γ (cid:48) j,θ , (cid:98) Γ (cid:48) j,θ ) and Γ (cid:48) j,θ =(Γ (cid:48) j,θ , Γ (cid:48) j,θ ) for j = 1 , ,

3. Next we deﬁne the functions T and T . To do this, we’ll also de-ﬁne functions T and T . 55 he function T We ﬁrst deﬁne T ( x, w, τ, β, h , κ n ). We do this by splitting it into seven diﬀerent mutually exclusivecases. This lets us write T ( x, w, τ, β, h , κ n ) = (cid:88) j =1 T ,j ( x, w, τ, β, h ) · ,j ( x, w, τ, β, κ n ) . (13)For these cases, it is helpful to recall our notation L ( x, w (cid:48) β ) = F ( w (cid:48) β ) x (1 − F ( w (cid:48) β )) − x . These seven cases are deﬁned as follows:1. Let , ( x, w, τ, β, κ n ) = (cid:26) τ + cL ( x, w (cid:48) β ) min { τ, − τ } < min (cid:26) τL ( x, w (cid:48) β ) , − ε (cid:27) − κ n (cid:27) and T , ( x, w, τ, β, h ) = − c min { τ, − τ } L β ( x, w (cid:48) β ) (cid:48) h L ( x, w (cid:48) β ) .

2. Let , ( x, w, τ, β, κ n ) = (cid:26) τL ( x, w (cid:48) β ) < min (cid:26) τ + cL ( x, w (cid:48) β ) min { τ, − τ } , − ε (cid:27) − κ n (cid:27) and T , ( x, w, τ, β, h ) = − τ L β ( x, w (cid:48) β ) (cid:48) h L ( x, w (cid:48) β ) .

3. Let , ( x, w, τ, β, κ n ) = (cid:26) − ε < min (cid:26) τ + cL ( x, w (cid:48) β ) min { τ, − τ } , τL ( x, w (cid:48) β ) (cid:27) − κ n (cid:27) and T , ( x, w, τ, β, h ) = 0 .

4. Let , ( x, w, τ, β, κ n ) = (cid:26)(cid:12)(cid:12)(cid:12)(cid:12) τ + cL ( x, w (cid:48) β ) min { τ, − τ } − τL ( x, w (cid:48) β ) (cid:12)(cid:12)(cid:12)(cid:12) ≤ κ n (cid:27) · (cid:26) max (cid:26) τ + cL ( x, w (cid:48) β ) min { τ, − τ } , τL ( x, w (cid:48) β ) (cid:27) < − ε − κ n (cid:27) and T , ( x, w, τ, β, h ) = min (cid:26) − c min { τ, − τ } L β ( x, w (cid:48) β ) (cid:48) h L ( x, w (cid:48) β ) , − τ L β ( x, w (cid:48) β ) (cid:48) h L ( x, w (cid:48) β ) (cid:27) .

56. Let , ( x, w, τ, β, κ n ) = (cid:26)(cid:12)(cid:12)(cid:12)(cid:12) τ + cL ( x, w (cid:48) β ) min { τ, − τ } − (1 − ε ) (cid:12)(cid:12)(cid:12)(cid:12) ≤ κ n (cid:27) · (cid:26) max (cid:26) τ + cL ( x, w (cid:48) β ) min { τ, − τ } , − ε (cid:27) < τL ( x, w (cid:48) β ) − κ n (cid:27) and T , ( x, w, τ, β, h ) = min (cid:26) − c min { τ, − τ } L β ( x, w (cid:48) β ) (cid:48) h L ( x, w (cid:48) β ) , (cid:27) .

6. Let , ( x, w, τ, β, κ n ) = (cid:26)(cid:12)(cid:12)(cid:12)(cid:12) τL ( x, w (cid:48) β ) − (1 − ε ) (cid:12)(cid:12)(cid:12)(cid:12) ≤ κ n (cid:27) · (cid:26) max (cid:26) τL ( x, w (cid:48) β ) , − ε (cid:27) < τ + cL ( x, w (cid:48) β ) min { τ, − τ } − κ n (cid:27) and T , ( x, w, τ, β, h ) = min (cid:26) − τ L β ( x, w (cid:48) β ) (cid:48) h L ( x, w (cid:48) β ) , (cid:27) .

7. Let , ( x, w, τ, β, κ n ) equal 1 if at least two of the three following hold (cid:12)(cid:12)(cid:12)(cid:12) τ + cL ( x, w (cid:48) β ) min { τ, − τ } − (1 − ε ) (cid:12)(cid:12)(cid:12)(cid:12) ≤ κ n (cid:12)(cid:12)(cid:12)(cid:12) τL ( x, w (cid:48) β ) − (1 − ε ) (cid:12)(cid:12)(cid:12)(cid:12) ≤ κ n (cid:12)(cid:12)(cid:12)(cid:12) τ + cL ( x, w (cid:48) β ) min { τ, − τ } − τL ( x, w (cid:48) β ) (cid:12)(cid:12)(cid:12)(cid:12) ≤ κ n and zero otherwise. Let T , ( x, w, τ, β, h ) = min (cid:26) − c min { τ, − τ } L β ( x, w (cid:48) β ) (cid:48) h L ( x, w (cid:48) β ) , − τ L β ( x, w (cid:48) β ) (cid:48) h L ( x, w (cid:48) β ) , (cid:27) . The function T Next we deﬁne T ( x, w, τ, β, h , κ n ) in terms of T ( x, w, τ, β, h , κ n ) and two indicator functions.Recall from the proof of proposition 5 that S ( x, w, τ, β ) = min (cid:26) τ + cL ( x, w (cid:48) β ) min { τ, − τ } , τL ( x, w (cid:48) β ) , − ε (cid:27) . Then T ( x, w, τ, β, h , κ n ) = T ( x, w, τ, β, h , κ n ) · (cid:16) S ( x, w, τ, β ) > ε + κ n (cid:17) + max { T ( x, w, τ, β, h , κ n ) , } · (cid:16) | S ( x, w, τ, β ) − ε | ≤ κ n (cid:17) . he function T Next we deﬁne T ( x, w, τ, β, h , κ n ). Again we split this into seven cases, which lets us write thisfunction as T ( x, w, τ, β, h , κ n ) = (cid:88) j =1 T ,j ( x, w, τ, β, h ) · ,j ( x, w, τ, β, κ n ) . (14)where the seven cases are deﬁned as follows:1. Let , ( x, w, τ, β, κ n ) = (cid:26) τ − cL ( x, w (cid:48) β ) min { τ, − τ } > max (cid:26) τ − L ( x, w (cid:48) β ) + 1 , ε (cid:27) + κ n (cid:27) and T , ( x, w, τ, β, h ) = c min { τ, − τ } L β ( x, w (cid:48) β ) (cid:48) h L ( x, w (cid:48) β ) .

2. Let , ( x, w, τ, β, κ n ) = (cid:26) τ − L ( x, w (cid:48) β ) + 1 > max (cid:26) τ − cL ( x, w (cid:48) β ) min { τ, − τ } , ε (cid:27) + κ n (cid:27) and T , ( x, w, τ, β, h ) = (1 − τ ) L β ( x, w (cid:48) β ) (cid:48) h L ( x, w (cid:48) β ) .

3. Let , ( x, w, τ, β, κ n ) = (cid:26) ε > max (cid:26) τ − cL ( x, w (cid:48) β ) min { τ, − τ } , τ − L ( x, w (cid:48) β ) + 1 (cid:27) + κ n (cid:27) and T , ( x, w, τ, β, h ) = 0 .

4. Let , ( x, w, τ, β, κ n ) = (cid:26)(cid:12)(cid:12)(cid:12)(cid:12) τ − cL ( x, w (cid:48) β ) min { τ, − τ } − (cid:18) τ − L ( x, w (cid:48) β ) + 1 (cid:19)(cid:12)(cid:12)(cid:12)(cid:12) ≤ κ n (cid:27) · (cid:26) min (cid:26) τ − cL ( x, w (cid:48) β ) min { τ, − τ } , τ − L ( x, w (cid:48) β ) + 1 (cid:27) > ε + κ n (cid:27) and T , ( x, w, τ, β, h ) = max (cid:26) c min { τ, − τ } L β ( x, w (cid:48) β ) (cid:48) h L ( x, w (cid:48) β ) , (1 − τ ) L β ( x, w (cid:48) β ) (cid:48) h L ( x, w (cid:48) β ) (cid:27) .

5. Let , ( x, w, τ, β, κ n ) = (cid:26)(cid:12)(cid:12)(cid:12)(cid:12) τ − cL ( x, w (cid:48) β ) min { τ, − τ } − ε (cid:12)(cid:12)(cid:12)(cid:12) ≤ κ n (cid:27) · (cid:26) min (cid:26) τ − cL ( x, w (cid:48) β ) min { τ, − τ } , ε (cid:27) > τ − L ( x, w (cid:48) β ) + 1 + κ n (cid:27) T , ( x, w, τ, β, h ) = max (cid:26) c min { τ, − τ } L β ( x, w (cid:48) β ) (cid:48) h L ( x, w (cid:48) β ) , (cid:27) .

6. Let , ( x, w, τ, β, κ n ) = (cid:26)(cid:12)(cid:12)(cid:12)(cid:12) τ − L ( x, w (cid:48) β ) + 1 − ε (cid:12)(cid:12)(cid:12)(cid:12) ≤ κ n (cid:27) · (cid:26) min (cid:26) τ − L ( x, w (cid:48) β ) + 1 , ε (cid:27) > τ − cL ( x, w (cid:48) β ) min { τ, − τ } + κ n (cid:27) and T , ( x, w, τ, β, h ) = max (cid:26) (1 − τ ) L β ( x, w (cid:48) β ) (cid:48) h L ( x, w (cid:48) β ) , (cid:27) .

7. Let , ( x, w, τ, β, κ n ) equal 1 if at least 2 of the 3 following hold (cid:12)(cid:12)(cid:12)(cid:12) τ − cL ( x, w (cid:48) β ) min { τ, − τ } − (cid:18) τ − L ( x, w (cid:48) β ) + 1 (cid:19)(cid:12)(cid:12)(cid:12)(cid:12) ≤ κ n (cid:12)(cid:12)(cid:12)(cid:12) τ − cL ( x, w (cid:48) β ) min { τ, − τ } − ε (cid:12)(cid:12)(cid:12)(cid:12) ≤ κ n (cid:12)(cid:12)(cid:12)(cid:12) τ − L ( x, w (cid:48) β ) + 1 − ε (cid:12)(cid:12)(cid:12)(cid:12) ≤ κ n and zero otherwise. Let T , ( x, w, τ, β, h ) = max (cid:26) c min { τ, − τ } L β ( x, w (cid:48) β ) (cid:48) h L ( x, w (cid:48) β ) , (1 − τ ) L β ( x, w (cid:48) β ) (cid:48) h L ( x, w (cid:48) β ) , (cid:27) . The function T Finally we deﬁne T ( x, w, τ, β, h , κ n ) in terms of T ( x, w, τ, β, h , κ n ) and two indicator functions.Recall from the proof of proposition 5 that S ( x, w, τ, β ) = max (cid:26) τ − cL ( x, w (cid:48) β ) min { τ, − τ } , τ − L ( x, w (cid:48) β ) + 1 , ε (cid:27) . Then, T ( x, w, τ, β, h , κ n ) = T ( x, w, τ, β, h , κ n ) · (cid:16) S ( x, w, τ, β ) < − ε − κ n (cid:17) + min { T ( x, w, τ, β, h , κ n ) , } · (cid:16) | S ( x, w, τ, β ) − (1 − ε ) | ≤ κ n (cid:17) . D Proofs for Section 5

In this appendix we give the proofs for section 5. We start with a lemma about the HDD of Γ . Lemma 6 (Γ (cid:48) ,θ ( x, w, τ, h ) is Lipschitz in h ) . Suppose the assumptions of proposition 5 hold. Let κ n → nκ n → ∞ , η n →

0, and nη n → ∞ as n → ∞ . Then (cid:13)(cid:13)(cid:13)(cid:98) Γ (cid:48) ,θ ( x, w, τ, (cid:101) h ) − (cid:98) Γ (cid:48) ,θ ( x, w, τ, h ) (cid:13)(cid:13)(cid:13) ≤ K ( x, w, (cid:98) θ ) · (cid:107) (cid:101) h − h (cid:107) Θ K ( x, w, (cid:98) θ ) = O p (1) deﬁned below. Proof of lemma 6.

We will show that (cid:98) Γ (cid:48) ,θ ( x, w, τ, h ) is Lipschitz in h . The proof for the lowerbound is similar. Let h, (cid:101) h ∈ R d W × C ([ ε, − ε ] , R d q ). Then (cid:12)(cid:12)(cid:12)(cid:98) Γ (cid:48) ,θ ( x, w, τ, (cid:101) h ) − (cid:98) Γ (cid:48) ,θ ( x, w, τ, h ) (cid:12)(cid:12)(cid:12) ≤ (cid:107) q ( x, w ) (cid:107) · (cid:107) (cid:101) h ( S ( x, w, τ, (cid:98) β )) − h ( S ( x, w, τ, (cid:98) β )) (cid:107) + | q ( x, w ) (cid:48) (cid:98) γ (cid:48) ( S ( x, w, τ, (cid:98) β )) | · | T ( x, w, τ, (cid:98) β, (cid:101) h , κ n ) − T ( x, w, τ, (cid:98) β, h , κ n ) |≤ (cid:107) q ( x, w ) (cid:107) · (cid:107) (cid:101) h − h (cid:107) ∞ + | q ( x, w ) (cid:48) (cid:98) γ (cid:48) ( S ( x, w, τ, (cid:98) β )) | · | T ( x, w, τ, (cid:98) β, (cid:101) h , κ n ) − T ( x, w, τ, (cid:98) β, h , κ n ) | . (15)The ﬁrst line follows by the deﬁnition of (cid:98) Γ , the triangle inequality, and the Cauchy-Schwarzinequality. Next, | q ( x, w ) (cid:48) (cid:98) γ (cid:48) ( S ( x, w, τ, (cid:98) β )) | ≤ (cid:107) q ( x, w ) (cid:107) · (cid:107) (cid:98) γ (cid:48) (cid:107) ∞ ≤ (cid:107) q ( x, w ) (cid:107) (cid:0) (cid:107) γ (cid:48) (cid:107) ∞ + (cid:107) (cid:98) γ (cid:48) − γ (cid:107) ∞ (cid:1) = O p (1) . The last line follows since (cid:107) γ (cid:48) (cid:107) ∞ ≤ B (by γ ∈ G ) and (cid:107) (cid:98) γ (cid:48) − γ (cid:48) (cid:107) ∞ = o p (1) (by lemma 2). Thus itsuﬃces to show that T ( x, w, τ, (cid:98) β, h , κ n ) is Lipschitz in h . To see this, write | T ( x, w, τ, (cid:98) β, (cid:101) h , κ n ) − T ( x, w, τ, (cid:98) β, h , κ n ) |≤ | T ( x, w, τ, (cid:98) β, (cid:101) h , κ n ) − T ( x, w, τ, (cid:98) β, h , κ n ) | · (cid:16) S ( x, w, τ, (cid:98) β ) > ε + κ n (cid:17) + | max { T ( x, w, τ, (cid:98) β, (cid:101) h , κ n ) , } − max { T ( x, w, τ, (cid:98) β, h , κ n ) , }| · (cid:16)(cid:12)(cid:12)(cid:12) S ( x, w, τ, (cid:98) β ) − ε (cid:12)(cid:12)(cid:12) ≤ κ n (cid:17) ≤ | T ( x, w, τ, (cid:98) β, (cid:101) h , κ n ) − T ( x, w, τ, (cid:98) β, h , κ n ) | + | max { T ( x, w, τ, (cid:98) β, (cid:101) h , κ n ) , } − max { T ( x, w, τ, (cid:98) β, h , κ n ) , }|≤ · | T ( x, w, τ, (cid:98) β, (cid:101) h , κ n ) − T ( x, w, τ, (cid:98) β, h , κ n ) | The ﬁrst inequality follows by the deﬁnition of T and the triangle inequality. The last inequalityfollows from lemma 5. Thus it suﬃces to show that T is Lipschitz in h .To see this, consider | T ( x, w, τ, (cid:98) β, (cid:101) h , κ n ) − T ( x, w, τ, (cid:98) β, h , κ n ) |≤ (cid:88) j =1 | T ,j ( x, w, τ, (cid:98) β, (cid:101) h ) − T ,j ( x, w, τ, (cid:98) β, h ) |≤ (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) c min { τ, − τ } L β ( x, w (cid:48) (cid:98) β ) (cid:48) ( (cid:101) h − h ) L ( x, w (cid:48) (cid:98) β ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) + 4 (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) τ L β ( x, w (cid:48) (cid:98) β ) (cid:48) ( (cid:101) h − h ) L ( x, w (cid:48) (cid:98) β ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) L β ( x, w (cid:48) (cid:98) β ) L ( x, w (cid:48) (cid:98) β ) (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) · (cid:107) (cid:101) h − h (cid:107) . (16)The second line uses the deﬁnition of T and repeated applications of lemma 5. The last line follows60rom τ ≤ c min { τ, − τ } ≤

1, and the Cauchy-Schwarz inequality. We have (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) L β ( x, w (cid:48) (cid:98) β ) L ( x, w (cid:48) (cid:98) β ) (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) = O p (1)since sup β ∈B δ (cid:13)(cid:13)(cid:13)(cid:13) L β ( x, w (cid:48) β ) L ( x, w (cid:48) β ) (cid:13)(cid:13)(cid:13)(cid:13) < ∞ and P ( (cid:98) β ∈ B δ ) →

1. Thus T is Lipschitz in h .Overall, we have shown that (cid:98) Γ (cid:48) ,θ ( x, w, τ, h ) is Lipschitz in h with Lipschitz constant equal to K ( x, w, (cid:98) θ ) = (cid:107) q ( x, w ) (cid:107) (cid:32) · (cid:107) (cid:98) γ (cid:48) (cid:107) ∞ (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) L β ( x, w (cid:48) (cid:98) β ) L ( x, w (cid:48) (cid:98) β ) (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:33) = O p (1) . A similar argument can be used to show that (cid:98) Γ (cid:48) ,θ ( x, w, τ, h ) is Lipschitz in h with the sameconstant. Setting K ( x, w, (cid:98) θ ) = (cid:16) K ( x, w, (cid:98) θ ) + K ( x, w, (cid:98) θ ) (cid:17) / = √ · K ( x, w, (cid:98) θ ) concludes theproof.Next we prove proposition 3. This is our main result on the analytical bootstrap for meanpotential outcomes. In section 5 we use this result to do bootstrap inference on our ATE bounds.There we discussed how the asymptotic distribution of our mean potential outcome bounds comesfrom two terms. The ﬁrst term requires using HDDs while the second term is standard. We considereach term one at a time in the following two lemmas. Lemma 7 (Non-standard component) . Suppose the assumptions of proposition 1 hold. Suppose κ n → nκ n → ∞ , η n →

0, and nη n → ∞ as n → ∞ . Then (cid:98) Γ (cid:48) ,θ ( x, √ n ( (cid:98) θ ∗ − (cid:98) θ )) P (cid:32) Γ (cid:48) ,θ ( x, Z ) . Proof of lemma 7.

We prove this by applying theorem 3.2 in Fang and Santos (2019). To do thiswe must verify their assumptions 1–4.1. Their assumption 1 requires Γ ( x, θ ) to be HDD. We showed this in the proof of theorem 1.2. Their assumption 2 is about the asymptotic distribution of the ﬁrst step estimator (cid:98) θ . Thisholds by our lemma 1 in appendix A.3. Their assumption 3 is about validity of the bootstrap for (cid:98) θ . This holds by theorem 3.6.1 invan der Vaart and Wellner (1996).Finally, in their remark 3.4, they note that suﬃcient conditions for their assumption 4 are1. (Smoothness) (cid:98) Γ (cid:48) ,θ ( x, h ) is Lipschitz in h .2. (Consistency) (cid:107) (cid:98) Γ (cid:48) ,θ ( x, h ) − Γ ,θ ( x, h ) (cid:107) = o p (1) for any h .These are properties of the HDD estimator (cid:98) Γ (cid:48) ,θ ( x, h ). We ﬁnish this proof by verifying that theseproperties hold in our setting. 61 art 1: (Smoothness) (cid:98) Γ (cid:48) ,θ ( x, h ) is Lipschitz in h . Recall that (cid:98) Γ (cid:48) ,θ ( x, h ) = 1 n n (cid:88) i =1 (cid:90) (cid:98) Γ (cid:48) ,θ ( x, W i , τ, h ) dτ. So (cid:12)(cid:12)(cid:12)(cid:12)(cid:98) Γ (cid:48) ,θ ( x, (cid:101) h ) − (cid:98) Γ (cid:48) ,θ ( x, h ) (cid:12)(cid:12)(cid:12)(cid:12) ≤ n n (cid:88) i =1 (cid:90) (cid:12)(cid:12)(cid:12)(cid:12)(cid:98) Γ (cid:48) ,θ ( x, W i , τ, (cid:101) h ) − (cid:98) Γ (cid:48) ,θ ( x, W i , τ, h ) (cid:12)(cid:12)(cid:12)(cid:12) dτ ≤ (cid:32) n n (cid:88) i =1 (cid:90) K ( x, W i , (cid:98) θ ) dτ (cid:33) (cid:107) (cid:101) h − h (cid:107) Θ = (cid:32) n n (cid:88) i =1 K ( x, W i , (cid:98) θ ) (cid:33) (cid:107) (cid:101) h − h (cid:107) Θ . The second line follows by lemma 6. Next we’ll show that1 n n (cid:88) i =1 K ( x, W i , (cid:98) θ ) = O p (1) . We have1 n n (cid:88) i =1 K ( x, W i , (cid:98) θ )= 1 n n (cid:88) i =1 (cid:107) q ( x, W i ) (cid:107) (cid:32) · (cid:107) (cid:98) γ (cid:48) (cid:107) ∞ (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) L β ( x, W (cid:48) i (cid:98) β ) L ( x, W (cid:48) i (cid:98) β ) (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:33) ≤ (cid:32) n n (cid:88) i =1 (cid:107) q ( x, W i ) (cid:107) (cid:33) + 16 · (cid:107) (cid:98) γ (cid:48) (cid:107) ∞ (cid:32) n n (cid:88) i =1 (cid:107) q ( x, W i ) (cid:107) (cid:33) /  n n (cid:88) i =1 (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) L β ( x, W (cid:48) i (cid:98) β ) L ( x, W (cid:48) i (cid:98) β ) (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)  / . The ﬁrst line follows by the deﬁnition of K . The second line follows by the Cauchy-Schwarzinequality. By E ( (cid:107) q ( x, W ) (cid:107) ) < ∞ , by P ( (cid:98) β ∈ B δ ) →

1, and by E (cid:32) sup β ∈B δ (cid:13)(cid:13)(cid:13)(cid:13) L β ( x, W (cid:48) β ) L ( x, W (cid:48) β ) (cid:13)(cid:13)(cid:13)(cid:13) (cid:33) < ∞ we have 1 n n (cid:88) i =1 (cid:90) K ( x, W i , (cid:98) θ ) dτ = O p (1) . Thus (cid:98) Γ (cid:48) ,θ ( x, h ) is Lipschitz in h . It can be similarly shown that (cid:98) Γ (cid:48) ,θ ( x, h ) is Lipschitz in h . Part 2: Consistency of (cid:98) Γ (cid:48) ,θ ( x, h ) . Next we show that (cid:98) Γ (cid:48) ,θ ( x, h ) p −→ Γ (cid:48) ,θ ( x, h ) .

62o do this we use the triangle inequality to decompose their diﬀerence into four diﬀerent terms: (cid:12)(cid:12)(cid:12)(cid:12)(cid:98) Γ (cid:48) ,θ ( x, h ) − Γ (cid:48) ,θ ( x, h ) (cid:12)(cid:12)(cid:12)(cid:12) ≤ R + R + R + R where R = (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) n n (cid:88) i =1 q ( x, W i ) (cid:48) (cid:90) h ( S ( x, W i , τ, (cid:98) β )) dτ − E (cid:20) q ( x, W ) (cid:48) (cid:90) h ( S ( x, W, τ, β )) dτ (cid:21)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) R = (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) n n (cid:88) i =1 (cid:20) q ( x, W i ) (cid:48) (cid:90) (cid:98) γ (cid:48) ( S ( x, W i , τ, (cid:98) β )) T ( x, W i , τ, (cid:98) β, h , κ n ) dτ − q ( x, W i ) (cid:48) (cid:90) γ (cid:48) ( S ( x, W i , τ, (cid:98) β )) T ( x, W i , τ, (cid:98) β, h , κ n ) dτ (cid:21) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) R = (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) n n (cid:88) i =1 q ( x, W i ) (cid:48) (cid:90) γ (cid:48) ( S ( x, W i , τ, (cid:98) β )) T ( x, W i , τ, (cid:98) β, h , κ n ) dτ − E (cid:20) q ( x, W ) (cid:48) (cid:90) γ (cid:48) ( S ( x, W, τ, (cid:98) β )) T ( x, W, τ, (cid:98) β, h , κ n ) dτ (cid:21) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) R = (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) E (cid:20) q ( x, W ) (cid:48) (cid:90) γ (cid:48) ( S ( x, W, τ, (cid:98) β )) T ( x, W, τ, (cid:98) β, h , κ n ) dτ (cid:21) − E (cid:20) q ( x, W ) (cid:48) (cid:90) γ (cid:48) ( S ( x, W, τ, β )) T ( x, W, τ, β , h , dτ (cid:21) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) . Convergence of R . h is continuous on compact domain [ ε, − ε ]. Hence h is bounded on[ ε, − ε ]. Since S lies between [ ε, − ε ], the composite function h ( S ( x, w, τ, β )) is thus boundeduniformly over ( x, w, τ ) ∈ { , } × W × (0 , β forany ( x, w, τ ), since S is continuous in β . Therefore, by the dominated convergence theorem, (cid:90) h ( S ( x, w, τ, β )) dτ is continuous in β for all ( x, w ) ∈ { , } × W . Moreover, it has a bounded envelope: E (cid:32) sup β ∈B (cid:90) h ( S ( x, W, τ, β )) dτ (cid:33) < ∞ . These properties plus compactness of B imply that (cid:26)(cid:90) h ( S ( x, W, τ, β )) dτ : x ∈ { , } , β ∈ B (cid:27) is Glivenko-Cantelli, by example 19.8 in van der Vaart (2000). By E ( (cid:107) q ( x, W ) (cid:107) ) < ∞ (A6.4) and63y corollary 9.27 part (ii) in Kosorok (2008), the class of functions (cid:26) q ( x, W ) (cid:48) (cid:90) h ( S ( x, W, τ, β )) dτ : x ∈ { , } , β ∈ B (cid:27) is also Glivenko-Cantelli. Hence R = (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) n n (cid:88) i =1 q ( x, W i ) (cid:48) (cid:90) h ( S ( x, W i , τ, (cid:98) β )) dτ − E (cid:20) q ( x, W ) (cid:48) (cid:90) h ( S ( x, W, τ, β )) dτ (cid:21)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ sup β ∈B (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) n n (cid:88) i =1 q ( x, W i ) (cid:48) (cid:90) h ( S ( x, W i , τ, β )) dτ − E (cid:20) q ( x, W ) (cid:48) (cid:90) h ( S ( x, W, τ, β )) dτ (cid:21)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) + (cid:12)(cid:12)(cid:12)(cid:12)(cid:90) W q ( x, w ) (cid:48) (cid:90) h ( S ( x, w, τ, (cid:98) β )) dτ dF W ( w ) − (cid:90) W q ( x, w ) (cid:48) (cid:90) h ( S ( x, w, τ, β )) dτ dF W ( w ) (cid:12)(cid:12)(cid:12)(cid:12) = o p (1) + o p (1)= o p (1) . The second line follows by the triangle inequality. The ﬁrst term in that line is o p (1) by theGlivenko-Cantelli property. The second term is o p (1) by its continuity in β and (cid:98) β p −→ β . Convergence of R . We have R = (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) n n (cid:88) i =1 (cid:20) q ( x, W i ) (cid:48) (cid:90) (cid:98) γ (cid:48) ( S ( x, W i , τ, (cid:98) β )) T ( x, W i , τ, (cid:98) β, h , κ n ) dτ − q ( x, W i ) (cid:48) (cid:90) γ (cid:48) ( S ( x, W i , τ, (cid:98) β )) T ( x, W i , τ, (cid:98) β, h , κ n ) dτ (cid:21) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ n n (cid:88) i =1 (cid:107) q ( x, W i ) (cid:107) (cid:90) (cid:13)(cid:13)(cid:13)(cid:98) γ (cid:48) ( S ( x, W i , τ, (cid:98) β )) − γ (cid:48) ( S ( x, W i , τ, (cid:98) β )) (cid:13)(cid:13)(cid:13) (cid:12)(cid:12)(cid:12) T ( x, W i , τ, (cid:98) β, h , κ n ) (cid:12)(cid:12)(cid:12) dτ ≤ (cid:13)(cid:13)(cid:98) γ (cid:48) − γ (cid:48) (cid:13)(cid:13) ∞ n n (cid:88) i =1 (cid:107) q ( x, W i ) (cid:107) (cid:90) (cid:12)(cid:12)(cid:12) T ( x, W i , τ, (cid:98) β, h , κ n ) (cid:12)(cid:12)(cid:12) dτ ≤ o p (1) × (cid:32) n n (cid:88) i =1 (cid:107) q ( x, W i ) (cid:107) (cid:33) / × (cid:32) n n (cid:88) i =1 (cid:18)(cid:90) (cid:12)(cid:12)(cid:12) T ( x, W i , τ, (cid:98) β, h , κ n ) (cid:12)(cid:12)(cid:12) dτ (cid:19) (cid:33) / . The ﬁrst line is the deﬁnition of R . The second line follows by the triangle inequality and theCauchy-Schwarz inequality. The last line follows by uniform convergence of (cid:98) γ (cid:48) to γ (cid:48) (lemma 2) andthe Cauchy-Schwarz inequality. By A6.4, (cid:32) n n (cid:88) i =1 (cid:107) q ( x, W i ) (cid:107) (cid:33) / = O p (1) . Also, (cid:90) (cid:12)(cid:12)(cid:12) T ( x, W i , τ, (cid:98) β, h , κ n ) (cid:12)(cid:12)(cid:12) dτ ≤ (cid:90) (cid:12)(cid:12)(cid:12) T ( x, W i , τ, (cid:98) β, h , κ n ) (cid:12)(cid:12)(cid:12) dτ (cid:90) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) max (cid:40) − c min { τ, − τ } L β ( x, W (cid:48) i (cid:98) β ) (cid:48) h L ( x, W (cid:48) i (cid:98) β ) , − τ L β ( x, W (cid:48) i (cid:98) β ) (cid:48) h L ( x, W (cid:48) i (cid:98) β ) , (cid:41)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) dτ ≤ (cid:90) ( c min { τ, − τ } + τ ) dτ (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) L β ( x, W (cid:48) i (cid:98) β ) (cid:48) h L ( x, W (cid:48) i (cid:98) β ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) = c + 24 (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) L β ( x, W (cid:48) i (cid:98) β ) (cid:48) h L ( x, W (cid:48) i (cid:98) β ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) . The ﬁrst line follows by the deﬁnition of T . The second line follows by the deﬁnition of T (noticethe maximum of the three values T can take is an upper bound for it). By A4, (cid:18) L β ( x, w (cid:48) β ) (cid:48) h L ( x, w (cid:48) β ) (cid:19) is continuous in β for any w ∈ W . Moreover, this term has a bounded envelope in a neighborhoodof β by our assumption that E (cid:32) sup β ∈B δ (cid:13)(cid:13)(cid:13)(cid:13) L β ( x, W (cid:48) β ) L ( x, W (cid:48) β ) (cid:13)(cid:13)(cid:13)(cid:13) (cid:33) < ∞ . Hence 1 n n (cid:88) i =1 (cid:18)(cid:90) (cid:12)(cid:12)(cid:12) T ( x, W i , τ, (cid:98) β, h , κ n ) (cid:12)(cid:12)(cid:12) dτ (cid:19) ≤ ( c + 2)

16 1 n n (cid:88) i =1 (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) L β ( x, W (cid:48) i (cid:98) β ) L ( x, W (cid:48) i (cid:98) β ) (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) (cid:107) h (cid:107) = O p (1)where the last line follows by the uniform law of large numbers, as in example 19.8 in van der Vaart(2000). Thus R = o p (1). Convergence of R . For ﬁxed h and x , let g n ( w, β ) = q ( x, w ) (cid:48) (cid:90) γ (cid:48) ( S ( x, w, τ, β )) T ( x, w, τ, β, h , κ n ) dτ. Deﬁne R ( β ) = (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) n n (cid:88) i =1 g n ( W i , β ) − E [ g n ( W, β )] (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) . Then R = R ( (cid:98) β ). We want to show that R = o p (1). For any (cid:15) > P ( | R | ≥ (cid:15) ) = P ( | R | ≥ (cid:15), (cid:98) β ∈ B δ ) + P ( | R | ≥ (cid:15), (cid:98) β / ∈ B δ ) ≤ P (cid:32) sup β ∈B δ | R ( β ) | ≥ (cid:15), (cid:98) β ∈ B δ (cid:33) + P ( (cid:98) β / ∈ B δ ) ≤ P (cid:32) sup β ∈B δ | R ( β ) | ≥ (cid:15) (cid:33) + P ( (cid:98) β / ∈ B δ ) . The second term converges to zero by consistency of (cid:98) β . Thus it suﬃces to show that the ﬁrst term65onverges to zero. That is, we want to show thatsup β ∈B δ (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) n n (cid:88) i =1 g n ( W i , β ) − E [ g n ( W, β )] (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) = o p (1) . This follows from a uniform law of large numbers. Speciﬁcally, we use theorem 4.2.2 in Amemiya(1985). There are two main properties required to apply this theorem:1. A dominance condition: E (cid:32) sup β ∈B δ | g n ( W, β ) | (cid:33) < ∞ .

2. A continuity condition: g n ( w, β ) is continuous at any β ∈ B δ for all w ∈ W .So we conclude the proof by verifying these two properties. The dominance condition . We have E (cid:32) sup β ∈B δ | g n ( W, β ) | (cid:33) = E (cid:32) sup β ∈B δ (cid:12)(cid:12)(cid:12)(cid:12) q ( x, W ) (cid:48) (cid:90) γ (cid:48) ( S ( x, W, τ, β )) T ( x, W, τ, β, h , κ n ) dτ (cid:12)(cid:12)(cid:12)(cid:12) (cid:33) ≤ E (cid:32) (cid:107) q ( x, W ) (cid:107) sup β ∈B δ (cid:90) (cid:107) γ (cid:48) ( S ( x, W, τ, β )) (cid:107) | T ( x, W, τ, β, h , κ n ) | dτ (cid:33) ≤ E (cid:32) (cid:107) q ( x, W ) (cid:107) B (cid:18) c + 24 (cid:19) sup β ∈B δ (cid:12)(cid:12)(cid:12)(cid:12) L β ( x, W (cid:48) β ) (cid:48) h L ( x, W (cid:48) β ) (cid:12)(cid:12)(cid:12)(cid:12) (cid:33) ≤ B (cid:18) c + 24 (cid:19) E (cid:0) (cid:107) q ( x, W ) (cid:107) (cid:1) / E (cid:32) sup β ∈B δ (cid:13)(cid:13)(cid:13)(cid:13) L β ( x, W (cid:48) β ) L ( x, W (cid:48) β ) (cid:13)(cid:13)(cid:13)(cid:13) (cid:33) / (cid:107) h (cid:107) < ∞ . The second and fourth lines follow by the Cauchy-Schwarz inequality.

The continuity condition.

Fix w ∈ W . We next show continuity of g n ( w, β ) = q ( x, w ) (cid:48) (cid:90) γ (cid:48) ( S ( x, w, τ, β )) T ( x, w, τ, β, h , κ n ) dτ in β . First note thatlim m →∞ g n ( w, β m ) = q ( x, w ) (cid:48) (cid:18) lim m →∞ (cid:90) γ (cid:48) ( S ( x, w, τ, β m )) T ( x, w, τ, β m , h , κ n ) dτ (cid:19) . We now show that we can bring the limit inside the integral by applying the dominated convergencetheorem. First note that the integrand satisﬁes a dominance condition, similar to our analysis above.The other condition is pointwise convergence of the integrand for all τ ∈ (0 ,

1) except possibly ona set of Lebesgue measure zero. Thus we need to show thatlim m →∞ γ (cid:48) ( S ( x, w, τ, β m )) T ( x, w, τ, β m , h , κ n ) = γ (cid:48) ( S ( x, w, τ, β )) T ( x, w, τ, β, h , κ n )66or all τ ∈ (0 ,

1) except possibly a set of Lebesgue measure zero. We do this by showing that γ (cid:48) ( S ( x, w, τ, β )) T ( x, w, τ, β, h , κ n )is continuous in β , for all τ ∈ (0 ,

1) except possibly a set of Lebesgue measure zero. This term isthe product of two pieces, so it suﬃces to show that each piece separately is continuous. After wedo this, the overall proof will be complete.

Piece 1 : By γ ∈ G and by continuity of S ( x, w, τ, β ) in β , the function γ (cid:48) ( S ( x, w, τ, β )) iscontinuous. Piece 2 : Next we’ll show that T ( x, w, τ, β, h , κ n ) is continuous in β for all τ ∈ (0 ,

1) except aset of Lebesgue measure zero. Recall that, omitting the arguments of the functions, T = T · ( S − ε > κ n ) + max { T , } · ( | S − ε | ≤ κ n ) . So we’ll start by studying continuity of T ( x, w, τ, β, h , κ n ) in β . Recall equation (13): T ( x, w, τ, β, h , κ n ) = (cid:88) j =1 T ,j ( x, w, τ, β, h ) · ,j ( x, w, τ, β, κ n ) . Given our assumptions on the propensity score (A4), T ,j ( x, w, τ, β, h ) is a composition of functionsthat are continuous in β . Thus it is also continuous in β . This holds for all j = 1 , . . . ,

7. For example, T , ( x, w, τ, β, h ) = − c min { τ, − τ } L β ( x, w (cid:48) β ) (cid:48) h L ( x, w (cid:48) β ) , which is continuous in β by continuity of L ( x, · ). This holds for all τ ∈ (0 , ,j ( x, w, τ, β, κ n ). Fix β ∈ B δ . We will show that thesefunctions are constant, and therefore continuous at β , except on a set of τ ’s of Lebesgue measurezero.First consider , ( x, w, τ, β, κ n ) = (cid:26) τ + cL ( x, w (cid:48) β ) min { τ, − τ } < min (cid:26) τL ( x, w (cid:48) β ) , − ε (cid:27) − κ n (cid:27) Recall that w ∈ W is ﬁxed, along with x and κ n >

0. Let T = (cid:26) τ ∈ (0 ,

1) : τ + cL ( x, w (cid:48) β ) min { τ, − τ } = min (cid:26) τL ( x, w (cid:48) β ) , − ε (cid:27) − κ n (cid:27) . Then , ( x, w, τ, β, κ n ) is continuous at β for any τ / ∈ T . Moreover, T has Lebesgue measurezero. To see this, let ¯ τ = (1 − ε ) L ( x, w (cid:48) β ). Suppose ¯ τ ≤ /

2. Then T = (cid:26) τ ∈ (0 , ¯ τ ] : τ (cid:18) − − cL ( x, w (cid:48) β ) (cid:19) = − κ n (cid:27) ∪ (cid:26) τ ∈ (¯ τ , /

2] : τ (cid:18) cL ( x, w (cid:48) β ) (cid:19) = 1 − ε − κ n (cid:27) ∪ (cid:26) τ ∈ (1 / ,

1) : τ (cid:18) − cL ( x, w (cid:48) β ) (cid:19) = 1 − ε − cL ( x, w (cid:48) β ) − κ n (cid:27) . κ n (cid:54) = 0 the ﬁrst set in this union contains at most one point. Since1 + cL ( x, w (cid:48) β ) (cid:54) = 0 , the second set in this union also contains at most one point. Finally, the last set contains morethan one point only if L ( x, w (cid:48) β ) = c and 1 − ε − cL ( x, w (cid:48) β ) − κ n = 0 , which yields a contradiction since − ε − κ n <

0. Therefore, this is the union of at most three pointsand hence has measure zero. If ¯ τ ≥ / T = (cid:26) τ ∈ (0 , /

2] : τ (cid:18) − − cL ( x, w (cid:48) β ) (cid:19) = − κ n (cid:27) ∪ (cid:26) τ ∈ (1 / , ¯ τ ] : τ (cid:18) − cL ( x, w (cid:48) β ) (cid:19) = − cL ( x, w (cid:48) β ) − κ n (cid:27) ∪ (cid:26) τ ∈ (¯ τ ,

1) : τ (cid:18) − cL ( x, w (cid:48) β ) (cid:19) = 1 − ε − cL ( x, w (cid:48) β ) − κ n (cid:27) . Once again this is the union of at most three points and hence has measure zero. Thus we’ve shownthat for any β ∈ B δ and for all τ except a set of Lebesgue measure zero, T , ( x, w, τ, β, h ) · , ( x, w, τ, β, κ n )is continuous at β .A similar argument holds for the other indicators. Speciﬁcally, let T j denote the set of τ ’s atwhich ,j ( x, w, τ, β, κ n ) is discontinuous at β . Then T = (cid:26) τ ∈ (0 ,

1) : τL ( x, w (cid:48) β ) = min (cid:26) τ + cL ( x, w (cid:48) β ) min { τ, − τ } , − ε (cid:27) − κ n (cid:27) T = (cid:26) τ ∈ (0 ,

1) : 1 − ε = min (cid:26) τ + cL ( x, w (cid:48) β ) min { τ, − τ } , τL ( x, w (cid:48) β ) (cid:27) − κ n (cid:27) T ⊆ (cid:26) τ ∈ (0 ,

1) : (cid:12)(cid:12)(cid:12)(cid:12) τ + cL ( x, w (cid:48) β ) min { τ, − τ } − τL ( x, w (cid:48) β ) (cid:12)(cid:12)(cid:12)(cid:12) = κ n (cid:27) ∪ (cid:26) τ ∈ (0 ,

1) : τ + cL ( x, w (cid:48) β ) min { τ, − τ } = 1 − ε − κ n (cid:27) ∪ (cid:26) τ ∈ (0 ,

1) : τL ( x, w (cid:48) β ) = 1 − ε − κ n (cid:27) T ⊆ (cid:26) τ ∈ (0 ,

1) : (cid:12)(cid:12)(cid:12)(cid:12) τ + cL ( x, w (cid:48) β ) min { τ, − τ } − (1 − ε ) (cid:12)(cid:12)(cid:12)(cid:12) = κ n (cid:27) ∪ (cid:26) τ ∈ (0 ,

1) : τ + cL ( x, w (cid:48) β ) min { τ, − τ } = τL ( x, w (cid:48) β ) − κ n (cid:27) ∪ (cid:26) τ ∈ (0 ,

1) : 1 − ε = τL ( x, w (cid:48) β ) − κ n (cid:27) ⊆ (cid:26) τ ∈ (0 ,

1) : (cid:12)(cid:12)(cid:12)(cid:12) τL ( x, w (cid:48) β ) − (1 − ε ) (cid:12)(cid:12)(cid:12)(cid:12) = κ n (cid:27) ∪ (cid:26) τ ∈ (0 ,

1) : τL ( x, w (cid:48) β ) = τ + cL ( x, w (cid:48) β ) min { τ, − τ } − κ n (cid:27) ∪ (cid:26) τ ∈ (0 ,

1) : 1 − ε = τ + cL ( x, w (cid:48) β ) min { τ, − τ } − κ n (cid:27) T ⊆ (cid:26) τ ∈ (0 ,

1) : (cid:12)(cid:12)(cid:12)(cid:12) τ + cL ( x, w (cid:48) β ) min { τ, − τ } − (1 − ε ) (cid:12)(cid:12)(cid:12)(cid:12) = κ n (cid:27) ∪ (cid:26) τ ∈ (0 ,

1) : (cid:12)(cid:12)(cid:12)(cid:12) τL ( x, w (cid:48) β ) − (1 − ε ) (cid:12)(cid:12)(cid:12)(cid:12) = κ n (cid:27) ∪ (cid:26) τ ∈ (0 ,

1) : (cid:12)(cid:12)(cid:12)(cid:12) τ + cL ( x, w (cid:48) β ) min { τ, − τ } − τL ( x, w (cid:48) β ) (cid:12)(cid:12)(cid:12)(cid:12) = κ n (cid:27) . Using similar arguments to the j = 1 case, we see that all of these sets have Lebesgue measurezero. So (cid:83) j =1 T j has Lebesgue measure zero. Hence the function T ( x, w, τ, β, h , κ n ) is continuousat all β ∈ B δ for all τ ∈ (0 ,

1) except a set of Lebesgue measure zero.Now let’s return to T . This function is continuous in β for all τ ∈ (0 ,

1) except possibly on theset T =  (cid:91) j =1 T j  ∪ (cid:26) τ ∈ (0 ,

1) : (cid:12)(cid:12)(cid:12)(cid:12) min (cid:26) τ + cL ( x, w (cid:48) β ) min { τ, − τ } , τL ( x, w (cid:48) β ) , − ε (cid:27) − ε (cid:12)(cid:12)(cid:12)(cid:12) = κ n (cid:27) . The second term here comes from the indicators ( | S − ε | ≤ κ n ) and ( S − ε > κ n ). We can seethat this set has Lebesgue measure zero using similar arguments as above. Thus the overall set T has Lebesgue measure zero. Hence we’ve shown that, for a ﬁxed ( x, w, h , κ n ), and for any β ∈ B δ , T ( x, w, τ, β, h , κ n ) is continuous at β for all τ ∈ (0 ,

1) except a set of Lebesgue measure zero. Asnoted earlier, this is suﬃcient to complete the proof that R = o p (1). Convergence of R . This part is the diﬀerence between two expectations, one evaluated at (cid:98) β and the other at β . Note that the expectations are over W , not (cid:98) β . We’ll show that R = o p (1)by applying the dominated convergence theorem and then using the fact that (cid:98) β p −→ β . The R term is similar to R , and so our proof here will use some of our derivations from our proof that R = o p (1). The main diﬀerence is that R is a function of T . So we’ll spend most of our time onthat. T , in turn, depends on T . So we’ll begin by showing that T ( x, w, τ, (cid:98) β, h , κ n ) p −→ T ( x, w, τ, β , h , . By the deﬁnition of T , this convergence holds if T ,j ( x, w, τ, (cid:98) β, h ) p −→ T ,j ( x, w, τ, β , h ) and ,j ( x, w, τ, (cid:98) β, κ n ) p −→ ,j ( x, w, τ, β , . for all j = 1 , . . . ,

7. By continuity of T ,j ( x, w, τ, β, h ) in β for all j = 1 , . . . ,

7, and by theconsistency of (cid:98) β , the T ,j ( x, w, τ, (cid:98) β, h ) terms are consistent. The indicator functions are slightlytrickier. Given ( x, w, τ ), the value of β determines which of the seven cases we are in. That is,which of the indicators ,j ( x, w, τ, β ,

0) is 1; the other six are all zero.69e’ll consider each case separately. First suppose that τ + cL ( x, w (cid:48) β ) min { τ, − τ } < min (cid:26) τL ( x, w (cid:48) β ) , − ε (cid:27) . Thus we have , ( x, w, τ, β ,

0) = 1. By (cid:98) β p −→ β and κ n → , ( x, w, τ, (cid:98) β, κ n ) = (cid:40) τ + cL ( x, w (cid:48) (cid:98) β ) min { τ, − τ } < min (cid:40) τL ( x, w (cid:48) (cid:98) β ) , − ε (cid:41) − κ n (cid:41) p −→ (cid:26) τ + cL ( x, w (cid:48) β ) min { τ, − τ } < min (cid:26) τL ( x, w (cid:48) β ) , − ε (cid:27)(cid:27) = , ( x, w, τ, β , . Moreover, by taking complements, we see that for these values of ( x, w, τ, β ) all the other indicatorsconverge (to zero) as well.Next suppose , ( x, w, τ, β ,

0) = 1 or , ( x, w, τ, β ,

0) = 1. In either of these cases, we cansimilarly show that ,j ( x, w, τ, (cid:98) β, κ n ) p −→ ,j ( x, w, τ, β , j = 2 ,

3. Next suppose τ + cL ( x, w (cid:48) β ) min { τ, − τ } = τL ( x, w (cid:48) β ) < − ε, which puts us in the , ( x, w, τ, β ,

0) = 1 case. This case is more delicate, and shows the secondplace where the κ n ’s are important. , ( x, w, τ, (cid:98) β, κ n ) can be viewed as the product of threeindicator functions. Two of them are handled like the j = 1 , , (cid:40) τ + cL ( x, w (cid:48) (cid:98) β ) min { τ, − τ } < − ε − κ n (cid:41) p −→ (cid:26) τ + cL ( x, w (cid:48) β ) min { τ, − τ } < − ε (cid:27) and (cid:40) τL ( x, w (cid:48) (cid:98) β ) < − ε − κ n (cid:41) p −→ (cid:26) τL ( x, w (cid:48) β ) < − ε (cid:27) by (cid:98) β p −→ β and κ n →

0. The third indicator requires a diﬀerent argument: (cid:40)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) τ + cL ( x, w (cid:48) (cid:98) β ) min { τ, − τ } − τL ( x, w (cid:48) (cid:98) β ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ κ n (cid:41) = (cid:40) (cid:112) nκ n · √ n (cid:32) τ + cL ( x, w (cid:48) (cid:98) β ) min { τ, − τ } − τL ( x, w (cid:48) (cid:98) β ) (cid:33) ∈ [ − , (cid:41) → nκ n → ∞ and √ n (cid:32) τ + cL ( x, w (cid:48) (cid:98) β ) min { τ, − τ } − τL ( x, w (cid:48) (cid:98) β ) (cid:33) = O p (1) . This term is O p (1) since we’re looking at the case where τ + cL ( x, w (cid:48) (cid:98) β ) min { τ, − τ } − τL ( x, w (cid:48) (cid:98) β ) p −→ τ + cL ( x, w (cid:48) β ) min { τ, − τ } − τL ( x, w (cid:48) β ) = 0 , and by the delta method. Combining the consistency of these three indicator functions, we havethat , ( x, w, τ, (cid:98) β, κ n ) p −→ , ( x, w, τ, β , . As before, by taking complements we see that all of the other indicator functions are all alsoconsistent in this case. Notice that in this case κ n is a slackness parameter that we introduced toallow the indicator , ( x, w, τ, (cid:98) β, κ n ) to be 1 even if the inequality τ + cL ( x, w (cid:48) (cid:98) β ) min { τ, − τ } = τL ( x, w (cid:48) (cid:98) β )does not hold exactly in ﬁnite samples.The last three cases are all similar to the case , ( x, w, τ, β ,

0) = 1 that we just studied. Thus,putting all of these cases together gives T ( x, w, τ, (cid:98) β, h , κ n ) p −→ T ( x, w, τ, β , h , . By similar arguments, we can also show that T ( x, w, τ, (cid:98) β, h , κ n ) p −→ T ( x, w, τ, β , h , . By continuity of γ (cid:48) ( · ) and of S ( x, w, τ, β ) in β , this implies that q ( x, w ) (cid:48) γ (cid:48) ( S ( x, w, τ, (cid:98) β )) T ( x, w, τ, (cid:98) β, h , κ n ) p −→ q ( x, w ) (cid:48) γ (cid:48) ( S ( x, w, τ, β )) T ( x, w, τ, β , h , . Finally, note that q ( x, w ) (cid:48) (cid:90) γ (cid:48) ( S ( x, w, τ, β )) T ( x, w, τ, β, h , κ n ) dτ is continuous in β and has a bounded envelope (which can be seen using arguments similar tothat in our proof for R ). Thus we can apply the dominated convergence theorem, which gives R = o p (1). Putting the four pieces together . We’ve shown that R , . . . , R are all o p (1). Thus we’veshown that | (cid:98) Γ (cid:48) ,θ ( x, h ) − Γ (cid:48) ,θ ( x, h ) | = o p (1) . A similar argument can be used to show that | (cid:98) Γ (cid:48) ,θ ( x, h ) − Γ (cid:48) ,θ ( x, h ) | = o p (1) . As discussed at the beginning of the proof this consistency of (cid:98) Γ (cid:48) ,θ ( x, h ) was all that we had left71o show, so we are done. Lemma 8 (Standard component) . Suppose the assumptions of theorem 1 hold. Then G ∗ n Γ ( x, W, (cid:98) θ ) P (cid:32) G Γ ( x, W, θ ) ≡ (cid:101) Z ( x ) , a mean-zero Gaussian vector in R . Proof of lemma 8.

Write G ∗ n Γ ( x, W, (cid:98) θ ) = G ∗ n Γ ( x, W, θ ) + G ∗ n (cid:16) Γ ( x, W, (cid:98) θ ) − Γ ( x, W, θ ) (cid:17) . The ﬁrst term converges to G Γ ( x, W, θ ) by consistency of the standard nonparametric bootstrap.So it suﬃces to show that the second term converges to zero. We do this using an argument similarto that in the proof of lemma 19.24 in van der Vaart (2000). We’ll give the proof for the upperbound, the ﬁrst component of Γ . The proof for the lower bound is analogous.1. By the proof of theorem 1, F = { Γ ( x, W, θ ) : θ ∈ Θ δ } is Donsker with ﬁnite envelope function. So theorem 23.7 in van der Vaart (2000) gives G ∗ n Γ ( x, W, · ) P (cid:32) G Γ ( x, W, · ) , where G is a Gaussian process indexed by F .2. Endow F with the L ( P ) semi-metric. Note thatΓ ( x, · , (cid:98) θ ) p −→ Γ ( x, · , θ )in this semi-metric. This follows since, from the proof of theorem 1, (cid:12)(cid:12)(cid:12) Γ ( x, w, (cid:98) θ ) − Γ ( x, w, θ ) (cid:12)(cid:12)(cid:12) ≤ K ( w ) (cid:107) (cid:98) θ − θ (cid:107) Θ where E ( K ( W ) ) < ∞ . Thus (cid:90) W (cid:12)(cid:12)(cid:12) Γ ( x, w, (cid:98) θ ) − Γ ( x, w, θ ) (cid:12)(cid:12)(cid:12) dF W ( w ) ≤ E ( K ( W ) ) (cid:107) (cid:98) θ − θ (cid:107) , which converges to zero in probability by consistency of (cid:98) θ for θ .These two points imply that ( G ∗ n , Γ ( x, · , (cid:98) θ )) P (cid:32) ( G , Γ ( x, · , θ )) in the space (cid:96) ∞ ( F ) × F by Slutsky’stheorem. Deﬁne the function φ : (cid:96) ∞ ( F ) × F → R by φ ( g, Γ ( x, · , θ )) = g (Γ ( x, · , θ )) − g (Γ ( x, · , θ )) . Since G has continuous paths (lemma 18.15 in van der Vaart 2000) almost surely, the function φ is continuous at almost every ( G , Γ ( x, · , θ )). So the continuous mapping theorem (e.g., theorem10.8 in Kosorok 2008) implies that G ∗ n (Γ ( x, W, (cid:98) θ ) − Γ ( x, W, θ )) = φ ( G ∗ n , Γ ( x, · , (cid:98) θ ))72 (cid:32) φ ( G , Γ ( x, · , θ ))= 0 . Proof of proposition 3.

First, since the inﬂuence functions for the ﬁrst step estimators are Donsker(lemmas 3 and 4), and since the standard nonparametric bootstrap for those estimators is valid(theorem 3.6.1 in van der Vaart and Wellner 1996), and along with our analysis in the proof oflemma 8, we have (cid:32) √ n ( (cid:98) θ ∗ − (cid:98) θ ) G ∗ n Γ ( x, W, (cid:98) θ ) (cid:33) P (cid:32) (cid:18) Z (cid:101) Z ( x ) (cid:19) , a mean-zero process in R d W × (cid:96) ∞ ([ ε, − ε ] , R d q ) × R .Next, deﬁne Λ : R d W × (cid:96) ∞ ([ ε, − ε ] , R d q ) × R → R byΛ( θ, u ) = Γ ( x, θ ) + u where θ ∈ R d W × (cid:96) ∞ ([ ε, − ε ] , R d q ) and u ∈ R . By the proof of theorem 1, the mapping Γ ( x, θ ) isHDD at θ tangentially to R d W × C ([ ε, − ε ] , R d q ). So, for any u ∈ R , Λ( θ, u ) is HDD at ( θ , u )tangentially to R d W × C ([ ε, − ε ] , R d q ) × R . Its HDD isΛ (cid:48) θ ( h , h , h ) = Γ (cid:48) ,θ ( x, ( h , h )) + h , where h ∈ R . Estimate it by (cid:98) Λ (cid:48) θ ( h ) = (cid:98) Γ (cid:48) ,θ ( x, ( h , h )) + h . By the proof of lemma 7, this HDD estimator satisﬁes assumption 4 of Fang and Santos (2019).Thus we can apply their theorem 3.2 to get (cid:98) Γ (cid:48) ,θ ( x, √ n ( (cid:98) θ ∗ − (cid:98) θ )) + G ∗ n Γ ( x, W, (cid:98) θ ) = (cid:98) Λ (cid:48) θ ( √ n ( (cid:98) θ ∗ − (cid:98) θ ) , G ∗ n Γ ( x, W, (cid:98) θ )) P (cid:32) Λ (cid:48) θ ( Z , (cid:101) Z ( x ))= Γ (cid:48) ,θ ( x, Z ) + (cid:101) Z ( x ) ≡ Z ( x ) . E Analytical Bootstrap Results for the CQTE and CATE

In this section we formally derive bootstrap consistency for the CQTE and CATE.

Proposition 6 (CQTE Boostrap) . Suppose the assumptions of proposition 5 hold. Let κ n → nκ n → ∞ , η n →

0, and nη n → ∞ as n → ∞ . Then (cid:98) Γ (cid:48) ,θ ( x, w, τ, √ n ( (cid:98) θ ∗ − (cid:98) θ )) P (cid:32) Γ (cid:48) ,θ ( x, w, τ, Z ) . This proposition implies that the asymptotic distribution of the CQTE bounds can be approx-73mated by the bootstrap distribution of (cid:98) Γ (cid:48) ,θ (1 , w, τ, √ n ( (cid:98) θ ∗ − (cid:98) θ )) − (cid:98) Γ (cid:48) ,θ (0 , w, τ, √ n ( (cid:98) θ ∗ − (cid:98) θ )) (cid:98) Γ (cid:48) ,θ (1 , w, τ, √ n ( (cid:98) θ ∗ − (cid:98) θ )) − (cid:98) Γ (cid:48) ,θ (0 , w, τ, √ n ( (cid:98) θ ∗ − (cid:98) θ ))  . We also show the bootstrap consistency for the CATE.

Proposition 7 (CATE Bootstrap) . Suppose the assumptions of proposition 1 hold. Let κ n → nκ n → ∞ , η n →

0, and nη n → ∞ as n → ∞ . Then (cid:98) Γ (cid:48) ,θ ( x, w, √ n ( (cid:98) θ ∗ − (cid:98) θ )) P (cid:32) Γ (cid:48) ,θ ( x, w, Z ) . Like with the CQTE, the asymptotic distribution of the CATE bounds can be approximatedby the bootstrap distribution of (cid:98) Γ (cid:48) ,θ (1 , w, √ n ( (cid:98) θ ∗ − (cid:98) θ )) − (cid:98) Γ (cid:48) ,θ (0 , w, √ n ( (cid:98) θ ∗ − (cid:98) θ )) (cid:98) Γ (cid:48) ,θ (1 , w, √ n ( (cid:98) θ ∗ − (cid:98) θ )) − (cid:98) Γ (cid:48) ,θ (0 , w, √ n ( (cid:98) θ ∗ − (cid:98) θ ))  . E.1 Proofs

Proof of proposition 6.

As in the proof of lemma 7 we’ll use theorem 3.2 in Fang and Santos (2019).To do this we must verify their assumptions 1–4. Their assumptions 1–3 hold as in the proof oflemma 7. By their remark 3.4, suﬃcient conditions for their assumption 4 are:1.

A smoothness condition : (cid:98) Γ (cid:48) ,θ ( x, w, τ, h ) is Lipschitz in h . This holds by lemma 6.2. A consistency condition : (cid:98) Γ (cid:48) ,θ ( x, w, τ, h ) converges in probability to Γ (cid:48) ,θ ( x, w, τ, h ) for any h ∈ R d W × C ([ ε, − ε ] , R d q ). To see that this holds, recall that in the proof of lemma 7 weshowed that q ( x, w ) (cid:48) h ( S ( x, w, τ, (cid:98) β )) p −→ q ( x, w ) (cid:48) h ( S ( x, τ, β ))and q ( x, w ) (cid:48) γ (cid:48) ( S ( x, w, τ, (cid:98) β )) T ( x, w, τ, (cid:98) β, h , κ n ) p −→ q ( x, w ) (cid:48) γ (cid:48) ( S ( x, w, τ, β )) T ( x, w, τ, β , h , . By the deﬁnition of (cid:98) Γ (cid:48) ,θ , these two results imply (cid:98) Γ (cid:48) ,θ ( x, w, τ, h ) p −→ Γ (cid:48) ,θ ( x, w, τ, h ).Similar arguments can be used for the lower bound, to show that (cid:98) Γ (cid:48) ,θ ( x, w, τ, h ) is Lipschitz in h ,and that (cid:98) Γ (cid:48) ,θ ( x, w, τ, h ) p −→ Γ (cid:48) ,θ ( x, w, τ, h ). Proof of proposition 7.

The proof of this result is similar to the proof of proposition 6. Like there,we show that the two suﬃcient conditions for assumption 4 in theorem 3.2 of Fang and Santos(2019) hold.1.

A smoothness condition : (cid:98) Γ (cid:48) ,θ ( x, w, h ) is Lipschitz in h . To see this, write (cid:12)(cid:12)(cid:12)(cid:12)(cid:98) Γ (cid:48) ,θ ( x, w, (cid:101) h ) − (cid:98) Γ (cid:48) ,θ ( x, w, h ) (cid:12)(cid:12)(cid:12)(cid:12) ≤ (cid:90) (cid:12)(cid:12)(cid:12)(cid:12)(cid:98) Γ (cid:48) ,θ ( x, w, τ, (cid:101) h ) − (cid:98) Γ (cid:48) ,θ ( x, w, τ, h ) (cid:12)(cid:12)(cid:12)(cid:12) dτ (cid:90) K ( x, w, (cid:98) θ ) dτ (cid:107) (cid:101) h − h (cid:107) Θ = K ( x, w, (cid:98) θ ) · (cid:107) (cid:101) h − h (cid:107) Θ where the second line follows by the proof of lemma 6, and where K ( x, w, (cid:98) θ ) is deﬁned inlemma 6 and is shown to be O p (1). So (cid:98) Γ (cid:48) ,θ ( x, w, h ) is Lipschitz in h .2. A consistency condition : (cid:98) Γ (cid:48) ,θ ( x, w, h ) p −→ Γ (cid:48) ,θ ( x, w, h ). This result follows from argumentssimilar to those in the proof of proposition 1 and the dominated convergence theorem.Similar arguments can be used for the lower bound, to show that (cid:98) Γ (cid:48) ,θ ( x, w, h ) is Lipschitz in h ,and that (cid:98) Γ (cid:48) ,θ ( x, w, h ) p −→ Γ (cid:48) ,θ ( x, w, h ). F Proofs for Section 6

Proof of theorem 2.

Recall that our ATE bounds depend on the functionals Γ ( x, θ ). We will showthat P ( p | W ∈ { c, − c } ) = 0 implies that Γ (cid:48) ,θ ( x, h ) is linear in h . This, in turn, implies thatΓ ( x, θ ) is Hadamard diﬀerentiable at θ . Consistency of the standard bootstrap then follows fromthe delta method for the bootstrap: see theorem 3.9.11 in van der Vaart and Wellner (1996).Thus it suﬃces to show that Γ (cid:48) ,θ ( x, h ) is linear in h . We’ll show this in three steps. Step 1 . First we show thatΓ (cid:48) ,θ ( x, w, τ, h ) = q ( x, w ) (cid:48) γ (cid:48) ( S ( x, w, τ, β )) T ( x, w, τ, β , h ,

0) + q ( x, w ) (cid:48) h ( S ( x, w, τ, β )) . is linear in h . First note that it is trivially linear in h . It is linear in h if and only if T ( x, w, τ, β , h , h . Recall the deﬁnition of T : T ( x, w, τ , β , h ,

0) = T ( x, w, τ, β, h , · (cid:16) S ( x, w, τ, β ) > ε (cid:17) + max { T ( x, w, τ, β , h , , } · (cid:16) S ( x, w, τ, β ) = ε (cid:17) where S ( x, w, τ, β ) = min (cid:26) τ + cL ( x, w (cid:48) β ) min { τ, − τ } , τL ( x, w (cid:48) β ) , − ε (cid:27) . There are two parts of T . We’ll consider each of them separately. The second part . For given ( x, w, τ, β , c, ε ), the set { τ ∈ (0 ,

1) : S ( x, w, τ, β ) = ε } has Lebesgue measure zero. This is the case since S ( x, w, τ, β ) is strictly increasing in τ whenever S ( x, w, τ, β ) < − ε , and ε < − ε by assuming ε < /

2. Therefore, althoughmax { T ( x, w, τ, β , h , , } may be nonlinear in h ,max { T ( x, w, τ, β , h , , } · (cid:16) S ( x, w, τ, β ) = ε (cid:17)

75s nonlinear for a measure zero set of τ . The ﬁrst part . Next we study linearity of T ( x, w, τ, β , h ,

0) in h . Recall from appendix Cthat it can be written as T ( x, w, τ, β , h ,

0) = (cid:88) j =1 T ,j ( x, w, τ, β , h ) · ,j ( x, w, τ, β , . By examining the speciﬁc functional forms given in appendix C, we immediately see that thefunctions T ,j ( x, w, τ, β , h ) are linear for j ∈ { , , } and nonlinear for j ∈ { , , , } . The mainquestion is for how many values of ( τ, w ) are the indicators ,j equal to 1 for j ∈ { , , , } . We’llshow that these indicators are 1 only on a set of τ ’s and w ’s that have measure zero under theproduct measure with the Lebesgue measure and F W as the marginals.Deﬁne S j ( p x | w , c ) ≡ { τ ∈ (0 ,

1) : ,j ( x, w, τ, β ,

0) = 1 } for j ∈ { , , , } . For j = 4, by the deﬁnition of , in appendix C we have S ( p x | w , c ) = (cid:26) τ ∈ (0 ,

1) : τ + cp x | w min { τ, − τ } − τp x | w = 0 (cid:27) ∩ (cid:26) τ ∈ (0 ,

1) : max (cid:26) τ + cp x | w min { τ, − τ } , τp x | w (cid:27) < − ε (cid:27) ≡ S ,a ( p x | w , c ) ∩ S ,b ( p x | w , c ) . We write the ﬁrst set as S ,a ( p x | w , c )= (cid:26) τ ∈ (0 , /

2] : τ (cid:18) cp x | w − p x | w (cid:19) = 0 (cid:27) ∪ (cid:26) τ ∈ (1 / ,

1) : τ (cid:18) − cp x | w − p x | w (cid:19) = − cp x | w (cid:27) =  ∅ if c < − p x | w (0 , /

1) : τ + cp x | w min { τ, − τ } − (1 − ε ) = 0 (cid:27) ∩ (cid:26) τ ∈ (0 ,

1) : max (cid:26) τ + cp x | w min { τ, − τ } , − ε (cid:27) < τp x | w (cid:27) ≡ S ,a ( p x | w , c ) ∩ S ,b ( p x | w , c ) . Write the ﬁrst set as S ,a ( p x | w , c ) = 76 τ ∈ (0 , /

2] : τ (cid:18) cp x | w (cid:19) = 1 − ε (cid:27) ∪ (cid:26) τ ∈ (1 / ,

1) : τ (cid:18) − cp x | w (cid:19) = 1 − ε − cp x | w (cid:27) . Since 1 + cp x | w (cid:54) = 0 , the set (cid:26) τ ∈ (0 , /

2] : τ (cid:18) cp x | w (cid:19) = 1 − ε (cid:27) contains at most one point. Likewise, (cid:26) τ ∈ (1 / ,

1) : τ (cid:18) − cp x | w (cid:19) = 1 − ε − cp x | w (cid:27) contains at most one point whenever c (cid:54) = p x | w . When c = p x | w this set equals { τ ∈ (1 / ,

1) : 0 = − ε } , which is empty since ε >

0. Thus S ,a ( p x | w , c ) has Lebesgue measure zero. Consequently, S ( p x | w , c ) also has Lebesgue measure zero.Next consider j = 6. We have S ( p x | w , c ) = (cid:26) τ ∈ (0 ,

1) : τp x | w − (1 − ε ) = 0 (cid:27) ∩ (cid:26) τ ∈ (0 ,

1) : max (cid:26) τp x | w , − ε (cid:27) < τ + cp x | w min { τ, − τ } (cid:27) ⊆ (cid:8) τ ∈ (0 ,

1) : τ = (1 − ε ) p x | w (cid:9) . The ﬁrst line follows from the deﬁnition of , . The last line follows from looking at the ﬁrst set inthe intersection in the ﬁrst line. This last set is a singleton and hence has Lebesgue measure zero.Thus S ( p x | w , c ) has Lebesgue measure zero.Finally, consider j = 7. This case is a combination of the above cases. Hence we can show that S ( p x | w , c ) has Lebesgue measure zero by repeating some of the above steps. Step 2 : From step 1 we see that for any w ∈ W such that p x | w (cid:54) = 1 − c the mappingΓ (cid:48) ,θ ( x, w, τ, h ) is linear for all τ ∈ (0 ,

1) except for a Lebesgue measure zero set. Denote thisset by T . Then Γ (cid:48) ,θ ( x, w, h ) = (cid:90) Γ (cid:48) ,θ ( x, w, τ, h ) dτ = (cid:90) τ / ∈T Γ (cid:48) ,θ ( x, w, τ, h ) dτ + (cid:90) τ ∈T Γ (cid:48) ,θ ( x, w, τ, h ) dτ = (cid:90) τ / ∈T Γ (cid:48) ,θ ( x, w, τ, h ) dτ. The ﬁrst line follows by deﬁnition of Γ . The last line follows since T has Lebesgue measure zero.Since integrals are linear operators, we see that Γ (cid:48) ,θ ( x, w, h ) is linear in h for any w ∈ W suchthat p x | w (cid:54) = 1 − c . 77 tep 3 : We haveΓ (cid:48) ,θ ( x, h ) = (cid:90) W Γ (cid:48) ,θ ( x, w, h ) dF W ( w )= (cid:90) { w ∈W : p x | w / ∈{ c, − c }} Γ (cid:48) ,θ ( x, w, h ) dF W ( w ) + (cid:90) { w ∈W : p x | w ∈{ c, − c }} Γ (cid:48) ,θ ( x, w, h ) dF W ( w )= (cid:90) { w ∈W : p x | w / ∈{ c, − c }} Γ (cid:48) ,θ ( x, w, h ) dF W ( w ) . The ﬁrst line follows by deﬁnition of Γ . The third line follows since we assumed p x | W ∈ { c, − c } occurs with probability zero. By step 2, Γ (cid:48) ,θ ( x, w, h ) is linear on the set over which we areintegrating in the last line. Since integrals are linear operators, this implies Γ (cid:48) ,θ ( x, h ) is linear in h . Similar calculations for the lower bound show that Γ (cid:48) ,θ ( x, h ) is linear when P ( p x | W ∈ { c, − c } ) = 0. Thus we have shown that Γ (cid:48) ,θ ( x, h ) is linear in h , as desired. Proof of proposition 4.

By the proof of theorem 2, the mapping Γ (0 , θ ) is Hadamard diﬀerentiablewhen P ( p | W = 1 − c ) = P ( p | W = cc