Assessing Sensitivity to Unconfoundedness: Estimation and Inference
AAssessing Sensitivity to Unconfoundedness:Estimation and Inference ∗ Matthew A. Masten † Alexandre Poirier ‡ Linqi Zhang § December 31, 2020
Abstract
This paper provides a set of methods for quantifying the robustness of treatment effectsestimated using the unconfoundedness assumption (also known as selection on observables orconditional independence). Specifically, we estimate and do inference on bounds on varioustreatment effect parameters, like the average treatment effect (ATE) and the average effectof treatment on the treated (ATT), under nonparametric relaxations of the unconfoundednessassumption indexed by a scalar sensitivity parameter c . These relaxations allow for limitedselection on unobservables, depending on the value of c . For large enough c , these boundsequal the no assumptions bounds. Using a non-standard bootstrap method, we show how toconstruct confidence bands for these bound functions which are uniform over all values of c .We illustrate these methods with an empirical application to effects of the National SupportedWork Demonstration program. We implement these methods in a companion Stata module foreasy use in practice. JEL classification:
C14; C18; C21; C51
Keywords:
Treatment Effects, Conditional Independence, Unconfoundedness, Selection on Observables,Sensitivity Analysis, Nonparametric Identification, Partial Identification ∗ This paper was presented at the 2018 Western Economic Association International Conference, the 2019 StataConference Chicago, the 2020 World Congress of the Econometric Society, the DC-MD-VA Econometrics Workshop2020, University of Southern California, University of Toronto, and the 2020 SEA Conference. We thank participantsat those seminars and conferences, as well as Karim Chalak, Toru Kitagawa, and John Pepper. We thank PaulDiegert for excellent research assistance. Masten thanks the National Science Foundation for research support underGrant No. 1943138. † Department of Economics, Duke University, [email protected] ‡ Department of Economics, Georgetown University, [email protected] § Department of Economics, Boston College, [email protected] a r X i v : . [ ec on . E M ] D ec Introduction
A core goal of causal inference is to identify and estimate effects of a treatment variable on anoutcome variable. A common assumption used to identify such effects is unconfoundedness, whichsays that potential outcomes are independent of treatment conditional on covariates. This assump-tion is also known as conditional independence, selection on observables, ignorability, or exogenousselection; see Imbens (2004) for a survey. This assumption is not refutable, meaning that the dataalone cannot tell us whether it is true. Nonetheless, empirical researchers may wonder: How im-portant is this assumption in their analyses? Put differently: How sensitive are their results tofailures of the unconfoundedness assumption?A large literature on sensitivity analysis has developed to answer this question. Moreover,researchers widely acknowledge that answering this question is an important step in empiricalresearch. For example, in their figure 1, Caliendo and Kopeinig (2008) describe the workflow ofa standard analysis using selection on observables. Their fifth and final step in this workflowis to perform sensitivity analysis to the unconfoundedness assumption. Imbens and Wooldridge(2009, section 6.2), Imbens and Rubin (2015, chapter 22), and Athey and Imbens (2017) all alsorecommend that researchers conduct sensitivity analyses to assess the importance of non-refutableidentifying assumptions. In particular, Athey and Imbens (2017) describe these methods as “asystematic way of doing the sensitivity analyses that are routinely done in empirical work, butoften in an unsystematic way.”Most of the existing approaches to assessing unconfoundedness rely on strong auxiliary assump-tions, however. For example, they often assume treatment effects are homogeneous and that allunobserved confounding arises due to a single unobserved variable whose distribution is parametri-cally specified, like a binary or normal distribution. They also often assume a parametric functionalform for potential outcomes, like a logit model for binary potential outcomes or a linear model forcontinuous potential outcomes. These assumptions—which are not needed for identification of thebaseline model when unconfoundedness holds—raise a new question: Are the findings of thesesensitivity analyses themselves sensitive to these extra auxiliary assumptions?In this paper, we provide a set of tools for assessing the sensitivity of the unconfoundednessassumption which do not rely on strong auxiliary assumptions that are not used for the baselineanalysis. We do this by studying nonparametric relaxations of the unconfoundedness assumption.Specifically, we apply the identification results of Masten and Poirier (2018), who consider a class ofassumptions called conditional c -dependence . This class measures relaxations of conditional inde-pendence by a single scalar parameter c ∈ [0 , c is the largest difference betweenthe propensity score and the probability of treatment conditional on covariates and an unobservedpotential outcome. Hence it has a straightforward interpretation as a deviation from conditionalindependence, as measured in probability units. For any positive c , conditional independence onlypartially holds, and so we cannot learn the exact value of our treatment effect parameters, like theaverage treatment effect (ATE) or the average effect of treatment on the treated (ATT). Instead,2e only get bounds. Masten and Poirier (2018) derive closed-form expressions for these bounds asa function of c . Setting c = 0 yields the baseline model where unconfoundedness holds. Setting c = 1 yields the other extreme where no assumptions on selection are made, and hence gives theno assumption bounds as in Manski (1990). The bounds are monotonic in c , so that small valuesof c give narrow bounds while larger values of c give wider bounds. Just how wide these boundsare—and hence how sensitive one’s results are—depends on the data.While Masten and Poirier (2018) studied identification of treatment effects under nonparamet-ric relaxations of unconfoundedness, they did not study estimation or inference. We do that inthis paper. First we propose sample analog estimators of the bounds on the conditional quan-tile treatment effect (CQTE), the conditional average treatment effect (CATE), the ATE, and theATT. We do this using flexible parametric first step estimators of the propensity score and theconditional quantile function of the observed outcomes given treatment and covariates. Althoughsuch parametric restrictions are not required for our identification theory, the analysis of inferenceis complicated and non-standard even with these parametric first step estimators. Doing inferencebased on fully nonparametric first step estimators will likely require deriving and applying moregeneral asymptotic theory for non-Hadmard differentiable functionals than currently exists. Hencewe leave that to future work. Moreover, note that our approach of using nonparametric identifi-cation results paired with flexible parametric estimators is analogous to what is commonly donein the baseline model which imposes unconfoundedness: Identification is shown nonparametricallybut many commonly used estimators are based on flexible parametric first step estimators. Forexample, see chapter 13 in Imbens and Rubin (2015).We derive the asymptotic distributions of our bound estimators using the delta method forHadamard directionally differentiable functionals from Fang and Santos (2019). We then showconsistency of a non-standard bootstrap based on estimating the analytical Hadamard directionalderivatives of our bound functionals. This step again involves using the recent results of Fang andSantos (2019). We show how to construct confidence bands for the bound functions which are uni-form over all values of c ∈ [0 , Related Literature
We conclude this section with a brief literature review. As mentioned earlier, there is a large existingliterature that studies how to relax unconfoundedness. This includes Rosenbaum and Rubin (1983),Mauro (1990), Rosenbaum (1995, 2002), Robins, Rotnitzky, and Scharfstein (2000), Imbens (2003),Altonji, Elder, and Taber (2005, 2008), Ichino, Mealli, and Nannicini (2008), Hosman, Hansen, andHolland (2010), Krauth (2016), Kallus, Mao, and Zhou (2019), Oster (2019), and Cinelli andHazlett (2020), among others. Here we discuss the most closely related work and several recentpapers. For further details about the related literature, see section 1 in Masten and Poirier (2018)for identification and Appendix D in Masten and Poirier (2020) for estimation and inference.A key feature of our results is that they are based on the fully nonparametric analysis of Mas-ten and Poirier (2018). There are only a few other alternative nonparametric analyses availablein the literature. The first is Ichino et al. (2008), who require that all variables are discretelydistributed. In contrast, we allow for continuous outcomes, covariates, and unobservables. Theirapproach requires picking a vector of sensitivity parameters that determines the joint distributionof the discrete observable and unobservable variables. In contrast, our approach uses a scalarsensitivity parameter. Finally, unlike us, they do not provide any formal results for doing esti-mation or inference. The second is Rosenbaum (1995, 2002), who proposed a sensitivity analysisfor unconfoundedness within the context of doing randomization inference based on the sharp nullhypothesis of no unit level treatment effects for all units in the data set. Like our approach, heonly uses a scalar sensitivity parameter and also does not rely on a parametric model for outcomesor treatment assignment probabilities. His approach, however, is based on finite sample random-ization inference (for more discussion, see chapter 5 of Imbens and Rubin 2015). This approachto inference is conceptually distinct from the approach we use based on repeated sampling froma large population. For this reason, we view these different approaches to inference in sensitivityanalyses as complementary. Finally, Kallus et al. (2019) study bounds on CATE under the samenonparametric relaxations defined by Rosenbaum (1995, 2002). Unlike him, however, they take alarge population view. They propose sample analog kernel estimators based on an implicit charac-terization of the identified set using extrema. They show consistency of these estimators, but theydo not provide any inference results. As we discuss later, this is a key distinction because inferencein this setting is non-standard.A few recent papers provide methods for assesssing unconfoundedness in parametric linearmodels. This includes Oster (2019) and Cinelli and Hazlett (2020). These results rely on theassumption that outcomes are linear functions of treatment and covariates, among other parametricassumptions. In contrast, we build on the selection on observables literature that has emphasized4onparametric identification. That literature emphasizes that identification by functional form isoften implausible. Sensitivity analyses that rely on functional form assumptions are subject to thesame criticism: Findings that one’s results are robust to violations of unconfoundedness can bedriven primarily from the parametric functional form restrictions. To address this, our estimationand inference results are based on nonparametric sensitivity analyses that do not require parametricassumptions.Finally, we discuss the relationship with our own previous work. As noted earlier, our paperprovides estimation and inference results for population bounds derived in Masten and Poirier(2018). That paper did not provide any estimation or inference theory. Masten and Poirier (2020)builds on those results in several ways: First, they extend the identification analysis to identificationof distributional treatment effect parameters, with a focus on assessing the importance of the rankinvariance assumption. Second, they provide some asymptotic distributional results for sampleanalog estimators of the average treatment effect (ATE), the conditional average treatment effect(CATE), and the conditional quantile treatment effect (CQTE), among other results. Those resultsare limited in a variety of ways, which we discuss next.Specifically, our paper differs from the results in Masten and Poirier (2020) in several importantways: (1) Our paper allows for both discrete and continuous covariates, whereas that paper focusedon the case where all covariates are discrete. In particular, to allow for continuous covariates wedevelop a different estimator of the bound functions. This is important since many empirical ap-plications, like ours in section 7, use continuous covariates. (2) Our results allow for all possiblevalues of c ∈ [0 , c (see their assumption A2.1). This is also important for practice and requires a substantialamount of new theoretical work. (3) Our results use the Fang and Santos (2019) bootstrap basedon estimators of analytical Hadamard directional derivatives to do inference. That paper insteadused the numerical delta method bootstrap of Hong and Li (2018). Our approach allows us toavoid choosing the step size tuning parameter required for the numerical delta method bootstrap,although our estimators of the analytical Hadamard directional derivatives also have tuning pa-rameters. (4) Unlike that paper, we also discuss inference on the average effect of treatment onthe treated (ATT). (5) In this paper we provide a new companion Stata module implementing ourresults. In this section we describe the model and review standard results on point identification of treat-ment effects under unconfoundedness. We then describe how we relax unconfoundedness. Finally,we review the bounds on treatment effects derived by Masten and Poirier (2018) when unconfound-edness is relaxed. 5 odel and Baseline Point Identification Results
We use the standard potential outcomes model. Let X ∈ { , } be an observed binary treatment.Let Y and Y denote the unobserved potential outcomes. The observed outcome is Y = XY + (1 − X ) Y . (1)Let W ∈ R d W denote a vector of observed covariates, which may be discrete, continuous, or mixed.Let W = supp( W ) denote the support of W . Let p x | w = P ( X = x | W = w )denote the observed generalized propensity score.It is well known that the conditional distributions of potential outcomes Y | W and Y | W arepoint identified under the following two assumptions:Unconfoundedness: X ⊥⊥ Y | W and X ⊥⊥ Y | W .Overlap: p | w ∈ (0 ,
1) for all w ∈ W .Consequently, any functional of the distributions of Y | W and Y | W is also point identified. Wefocus on two leading examples: The average treatment effect, ATE = E ( Y − Y ) and the averagetreatment effect for the treated, ATT = E ( Y − Y | X = 1). We also consider the conditionalquantile treatment effects CQTE( τ | w ) = Q Y | W ( τ | w ) − Q Y | W ( τ | w ) and the conditionalaverage treatment effect CATE( w ) = E ( Y − Y | W = w ). Sensitivity Analysis: Relaxing Unconfoundedness
As discussed in section 1, the overlap assumption is refutable and hence can be directly verifiedfrom the data. The unconfoundedness assumption, however, is not refutable. Consequently, likemuch of the literature reviewed in section 1, we perform a sensitivity analysis. This entails replacingunconfoundedness with a weaker assumption and investigating how this changes the conclusions wecan draw about our parameter of interest. Specifically, we define the following class of assumptions,which we call conditional c -dependence (Masten and Poirier 2018): Definition 1.
Let x ∈ { , } . Let w ∈ W . Let c be a scalar between 0 and 1. Say X is conditionally c -dependent with Y x given W ifsup y x ∈ supp( Y x | W = w ) | P ( X = 1 | Y x = y x , W = w ) − P ( X = 1 | W = w ) | ≤ c. (2)holds for all w ∈ W .When c = 0, conditional c -dependence is equivalent to X ⊥⊥ Y x | W . For c >
0, however, we6llow for violations of unconfoundedness by allowing the unobserved conditional probability P ( X = 1 | Y x = y x , W = w )to differ from the observed propensity score P ( X = 1 | W = w )by at most c . Thus we actually allow for some selection on unobservables, since treatment assign-ment may depend on Y x , but in a constrained manner. For sufficiently large c , however, conditional c -dependence imposes no constraints on the relationship between Y x and X . This happens when c ≥ C where C = sup w ∈W max { p | w , p | w } . When c ∈ (0 , C ), conditional c -dependence imposessome constraints on treatment assignment, but it does not require conditional independence tohold exactly. For this reason, we call it a conditional partial independence assumption. Thus oursensitivity analysis replaces unconfoundedness withConditional Partial Independence: X is conditionally c -dependent with Y and Y given W . Treatment Effect Bounds
By relaxing conditional independence our main parameters of interest—ATE and ATT—are nolonger point identified. Instead they are partially identified: We can bound them from above andfrom below. As c gets close to zero, however, these bounds collapse to a point. Hence for small c these bounds can be quite narrow. The goal of a sensitivity analysis is to understand how theshape and width of these bounds changes as c varies from 0 to 1.These bounds were derived in Masten and Poirier (2018), which we summarize here. Althoughthat paper studied both continuous and binary outcomes, here we only summarize the results forcontinuous Y x . All of our parameters of interest can be written in terms of bounds on the quantileregressions Q Y x | W ( τ | w ). Under the conditional partial independence assumption stated above andsome regularity conditions, Masten and Poirier (2018) showed that [ Q cY x | W ( τ | w ) , Q cY x | W ( τ | w )]are are sharp bounds on this quantile regression, uniformly in τ , x , and w , where Q cY x | W ( τ | w ) = Q Y | X,W (cid:0) t ( τ, x, w ) | x, w (cid:1) (3)where t ( τ, x, w ) = min (cid:26) τ + cp x | w min { τ, − τ } , τp x | w , (cid:27) and Q cY x | W ( τ | w ) = Q Y | X,W ( t ( τ, x, w ) | x, w ) (4)where t ( τ, x, w ) = max (cid:26) τ − cp x | w min { τ, − τ } , τ − p x | w + 1 , (cid:27) . x = 1 and x = 0 yields sharp bounds on the conditionalquantile treatment effect CQTE( τ | w ), uniformly in τ and w : (cid:2) CQTE c ( τ | w ) , CQTE c ( τ | w ) (cid:3) ≡ (cid:104) Q cY | W ( τ | w ) − Q cY | W ( τ | w ) , Q cY | W ( τ | w ) − Q cY | W ( τ | w ) (cid:105) . Integrating these bounds over τ yields sharp bounds on CATE( w ), uniformly in w : (cid:2) CATE c ( w ) , CATE c ( w ) (cid:3) ≡ (cid:20)(cid:90) CQTE c ( τ | w ) dτ, (cid:90) CQTE c ( τ | w ) dτ (cid:21) . Further integrating over the marginal distribution of W yields sharp bounds on ATE: (cid:2) ATE c , ATE c (cid:3) ≡ (cid:104) E (cid:0) CATE c ( W ) (cid:1) , E (cid:0) CATE c ( W ) (cid:1)(cid:105) To obtain bounds on ATT, let E cx ( w ) = (cid:90) Q cY x ( τ | w ) dτ and E cx ( w ) = (cid:90) Q cY x ( τ | w ) dτ denote bounds on E ( Y x | W = w ). Averaging these over the marginal distribution of W yieldsbounds on E ( Y x ), denoted by E cx = E (cid:0) E cx ( W ) (cid:1) and E cx = E (cid:0) E cx ( W ) (cid:1) . This yields the following bounds on ATT: (cid:34) E ( Y | X = 1) − E c − p E ( Y | X = 0) p , E ( Y | X = 1) − E c − p E ( Y | X = 0) p (cid:35) (5)where p x = P ( X = x ) for x ∈ { , } . Finally, note that all of these bounds are sharp. Breakdown Points
So far we’ve discussed sharp bounds on various parameters of interest as a function of the sensitivityparameter c . In addition to the bounds themselves, it is common to analyze breakdown points forvarious conclusions of interest. For example, suppose that under the baseline model ( c = 0) wefind that ATE >
0. We then ask: How much can we relax unconfoundedness while still being ableto conclude that the ATE is nonnegative? To answer this question, define the breakdown point forthe conclusion that the ATE is nonnegative as c bp = sup { c ∈ [0 ,
1] : (cid:2)
ATE c , ATE c (cid:3) ⊆ [0 , ∞ ) } . (6)8his number is a quantitative measure of the robustness of the conclusion that ATE is positiveto relaxations of the key identifying assumption of unconfoundedness. Breakdown points can bedefined for other parameters and conclusions as well. See Masten and Poirier (2020) for morediscussion and additional references. Interpreting Conditional c -Dependence We conclude this section by giving some suggestions for how to interpret conditional c -dependence inpractice. In particular, what values of c are large? What values are small? Here we summarize andextend the discussion on page 321 of Masten and Poirier (2018). We illustrate these interpretationsin our empirical analysis in section 7.Let W k denote a component of W . Denote the propensity score by p | W ( w − k , w k ) = P ( X = 1 | W = ( w − k , w k )) . Let p | W − k ( w − k ) = P ( X = 1 | W − k = w − k )denote the leave-out-variable- k propensity score. This is just the proportion of the population whoare treated, conditional on only W − k . Consider the random variable∆ k = | p | W ( W − k , W k ) − p | W − k ( W − k ) | . This difference is a measure of the impact on the observed propensity score of adding W k , giventhat we already included W − k . Conditional c -dependence is defined by a similar difference, exceptthere we add the unobservable Y x given that we already included W . Hence we suggest using thedistribution of ∆ k to calibrate values of c . For example, you could examine the 50th, 75th, and90th quantiles of ∆ k , along with the upper bound on the support, ¯ c k = max supp(∆ k ). You mayalso find it useful to plot an estimate of the density of ∆ k . All of these reference values can becompared to the breakdown point c bp for a specific conclusion of interest. Specifically, if c bp islarger than the chosen reference value, then the conclusion of interest could be considered robust.In contrast, if c bp is smaller than the chosen reference value, then the conclusion of interest couldbe considered sensitive. You may also want to see where c bp lies relative to the distribution of ∆ k .This can be done by computing F ∆ k ( c bp ).While you could do this for all covariates k , it may be helpful to restrict attention to covariates k that have a sufficiently large impact on the baseline point estimates. For example, suppose weare interested in the ATE. Let ATE − k denote the ATE estimand obtained in the baseline selectionon observables model using only the covariates W − k . Let ATE denote the ATE estimand obtainedin the baseline model using all the covariates. Then (cid:12)(cid:12)(cid:12)(cid:12) ATE − ATE − k ATE (cid:12)(cid:12)(cid:12)(cid:12) k on the ATE point estimand, as a percentage of the baselineestimand that uses all covariates in W . You may want to restrict attention to covariates k for whichthis ratio is relatively large. We illustrate this approach in our empirical analysis in section 7. In the previous section we assumed the entire population distribution of (
Y, X, W ) was known. Inpractice we only have a finite sample { ( Y i , X i , W i ) } ni =1 from this distribution. In this section weexplain how to use this finite sample data to estimate the population bounds of section 2. We givethe corresponding asymptotic theory in section 4 where we obtain the joint limiting distribution oftreatment effects bounds. We describe how to perform bootstrap based inference on these boundsin section 5.As shown in section 2, all of our bounds can be constructed from the marginal distribution of W and the bounds on Q Y x | W given in equations (3) and (4). These bounds on Q Y x | W , in turn,depend on just two features of the data:1. The conditional quantile function Q Y | X,W ( τ | x, w ).2. The propensity score p x | w = P ( X = x | W = w ).In both cases, we can use parametric, semiparametric, or nonparametric estimation methods. In thispaper we focus on flexible parametric approaches. Even in this case the asymptotic distributiontheory is non-standard and quite complicated. We discuss this point further in the conclusion,section 8.In section 3.1 we describe our first step estimators of these two functions. Given these estimators,we then construct sample analog estimates of our bound functions in a second step. We describethese estimators in section 3.2. We estimate Q Y | X,W by a linear quantile regression of Y on flexible functions of ( X, W ) that wedenote by q ( X, W ) ∈ R d q . For example, q ( x, w ) could be (1 , x, w ), (1 , x, w, x · w ), or could containadditional interactions between the treatment indicator X and functions of the covariates W . For τ ∈ (0 , (cid:98) γ ( τ ) = argmin a ∈ R dq n (cid:88) i =1 ρ τ ( Y i − a (cid:48) q ( X i , W i ))be the estimated coefficients from a linear quantile regression of Y on q ( X, W ) at the quantile τ .Here ρ τ ( s ) = s ( τ − ( s < (cid:98) Q Y | X,W ( τ | x, w ) = q ( x, w ) (cid:48) (cid:98) γ ( τ ) denotethis estimator. 10e estimate the propensity score by maximum likelihood. In particular, specify the parametricmodel P ( X = 1 | W = w ) = F ( r ( w ) (cid:48) β )where F is a known cdf, r ( w ) is a known vector function, and β is an unknown constant vector.The functions r ( w ) could simply be (1 , w ) or may contain functions of w , like squared or interactionterms. For notational simplicity, we will assume throughout the paper that r ( w ) = w . Given thisassumption, the dimension of β is d W , the length of W . Suppose β lies in the parameter space B ⊆ R d W .This specification for the propensity score includes the probit and logit estimators as specialcases. Those estimators are commonly used in the literature; for example, see chapter 13 of Imbensand Rubin (2015). Let (cid:98) β denote the maximum likelihood estimate of β : (cid:98) β = argmax β ∈B n (cid:88) i =1 log L ( X i , W (cid:48) i β )where L ( x, w (cid:48) β ) = F ( w (cid:48) β ) x (1 − F ( w (cid:48) β )) − x . For each x ∈ { , } , let (cid:98) p x | w = L ( x, w (cid:48) (cid:98) β ) denote our propensity score estimator. Given the first step estimators from section 3.1, we obtain the following sample analog estimatorsof the CQTE bound functions defined in equations (3) and (4): (cid:98) Q cY x | W ( τ | w ) = (cid:98) Q Y | X,W ( (cid:98) t ( τ, x, w ) | x, w )where (cid:98) t ( τ, x, w ) = min (cid:26) τ + c (cid:98) p x | w min { τ, − τ } , τ (cid:98) p x | w , (cid:27) and (cid:98) Q cY x | W ( τ | w ) = (cid:98) Q Y | X,W ( (cid:98) t ( τ, x, w ) | x, w )where (cid:98) t ( τ, x, w ) = max (cid:26) τ − c (cid:98) p x | w min { τ, − τ } , τ − (cid:98) p x | w + 1 , (cid:27) . As discussed in section 2, averaging these over τ ∈ (0 ,
1) yields sample analog estimates of boundson CATE( w ), which we can then use to get bounds on ATE. This approach requires estimationof extremal quantiles, however—estimation for τ ’s close to 0 or 1. This is well known to be adelicate problem (see Chernozhukov, Fern´andez-Val, and Kaji 2017 for details). So in this paperwe use a common solution: Fixed trimming of the extremal quantiles. We do this by modifying11he quantile bound estimators to ensure that the quantile index lies in [ ε, − ε ] for some fixed andknown ε ∈ (0 , . (cid:98) Q cY x | W ( τ | w ) = (cid:98) Q Y | X,W (cid:16) max { min { (cid:98) t ( τ, x, w ) , − ε } , ε } | x, w (cid:17) (7)and (cid:98) Q cY x | W ( τ | w ) = (cid:98) Q Y | X,W (cid:0) max { min { (cid:98) t ( τ, x, w ) , − ε } , ε } | x, w (cid:1) . (8)We use these estimators for the rest of the paper. Common choices of ε are 0 .
05 or 0 .
01. In ourasymptotic analysis we hold ε fixed with sample size. In principle we could generalize the resultsto allow ε → n → ∞ , but this would complicate the analysis of inference, which is alreadynon-standard for other reasons. Since we fix ε throughout, we omit ε from the notation for brevity,except when necessary.We next estimate the CQTE bounds by taking differences of the quantile bound estimators: (cid:20) (cid:92) CQTE c ( τ | w ) , (cid:92) CQTE c ( τ | w ) (cid:21) ≡ (cid:104) (cid:98) Q cY | W ( τ | w ) − (cid:98) Q cY | W ( τ | w ) , (cid:98) Q cY | W ( τ | w ) − (cid:98) Q cY | W ( τ | w ) (cid:105) . Since our CATE bounds are simply the integral of the CQTE bounds over all the quantiles τ , wecan estimate them by (cid:20) (cid:92) CATE c ( w ) , (cid:92) CATE c ( w ) (cid:21) ≡ (cid:20)(cid:90) (cid:92) CQTE c ( τ | w ) dτ, (cid:90) (cid:92) CQTE c ( τ | w ) dτ (cid:21) . A second integration over w with respect to the marginal distribution of W yields bounds onATE. Like much of the literature, we use the empirical distribution of W to estimate the marginaldistribution of W . This yields the following estimator of our ATE bounds: (cid:20) (cid:91) ATE c , (cid:91) ATE c (cid:21) = (cid:34) n n (cid:88) i =1 (cid:92) CATE c ( W i ) , n n (cid:88) i =1 (cid:92) CATE c ( W i ) (cid:35) . Next consider the estimation of the ATT bounds. Let (cid:98) E c = 1 n n (cid:88) i =1 (cid:90) (cid:98) Q cY ( τ | W i ) dτ and (cid:98) E c = 1 n n (cid:88) i =1 (cid:90) (cid:98) Q cY ( τ | W i ) dτ. For x ∈ { , } let (cid:98) E ( Y | X = x ) = (cid:80) ni =1 Y i ( X i = x ) (cid:80) ni =1 ( X i = x ) and (cid:98) p x = 1 n n (cid:88) i =1 ( X i = x ) . We can then estimate the ATT bounds by replacing the population quantities in (5) with theirestimators that we just defined. 12or c = 0, our estimated upper and lower bounds are equal and give point estimates of thevarious parameters of interest. For c >
0, our bounds have positive width. To use these bounds ina sensitivity analysis, we recommend producing the following plot: Pick a grid { c , . . . , c K } ⊆ [0 , c . Compute our bound estimates on this grid and plot them against these values of c .Then compute and plot confidence bands for these bound estimates against c as well; we describehow to compute these bands in section 5. We illustrate all of these steps in our empirical analysisof section 7. In this section we provide formal results on the consistency and limiting distributions of the estima-tors we described in section 3. In section 5 we show how to use these results to do inference basedon a non-standard bootstrap. In section 6 we provide sufficient conditions under which standardbootstrap inference is valid.
Throughout this paper we assume that we observe a random sample.
Assumption A1 (
Random Sample). { ( Y i , X i , W i ) } ni =1 are iid.Our first step estimators are standard in the literature. Hence we only briefly review the mainassumptions and results for these estimators. For completeness, we provide a formal analysis inappendix A.We assume that both the propensity score and quantile regression functions are correctly spec-ified: P ( X = x | W = w ) = L ( x, w (cid:48) β )and Q Y | X,W ( τ | x, w ) = q ( x, w ) (cid:48) γ ( τ )for all τ ∈ [ ε, − ε ].Since the first step estimators consist of linear quantile regression and maximum likelihoodestimation, their √ n -convergence to Gaussian elements can be shown under standard assumptionsand arguments. For example, see Newey and McFadden (1994). Moreover, the convergence of (cid:98) γ ( τ )to γ ( τ ) is uniform over τ ∈ [ ε, − ε ]. Formally, as we show in appendix A lemma 1, √ n (cid:32) (cid:98) β − β (cid:98) γ ( τ ) − γ ( τ ) (cid:33) (cid:32) Z ( τ ) , where Z ( · ) is a mean-zero Gaussian process in R d W × (cid:96) ∞ ([ ε, − ε ] , R d q ) with continuous paths.The covariance kernel of this process is defined in appendix A, equation (11). Also see appendix Afor the formal assumptions under which this result holds.13 .2 Convergence of the Second Step Estimators Next we consider the limiting distribution of our various second step estimators.
The CATE Bounds
We start with equations (7) and (8), our estimators of the conditional quantile bounds. Thepopulation conditional quantile bounds, equations (3) and (4), are known functions of θ = ( β , γ ).DefineΓ ( x, w, τ, θ ) = q ( x, w ) (cid:48) γ (cid:18) max (cid:26) min (cid:26) τ + cL ( x, w (cid:48) β ) min { τ, − τ } , τL ( x, w (cid:48) β ) , − ε (cid:27) , ε (cid:27)(cid:19) Γ ( x, w, τ, θ ) = q ( x, w ) (cid:48) γ (cid:18) min (cid:26) max (cid:26) τ − cL ( x, w (cid:48) β ) min { τ, − τ } , τ − L ( x, w (cid:48) β ) + 1 , ε (cid:27) , − ε (cid:27)(cid:19) Throughout the paper, we let Γ j = (Γ j , Γ j ) for j ≥
1. Evaluating these at θ gives the trimmedpopulation conditional quantile bounds. Evaluating these at (cid:98) θ gives their sample analog estimators.Define Γ ( x, w, θ ) = (cid:90) Γ ( x, w, τ, θ ) dτ and Γ ( x, w, θ ) = (cid:90) Γ ( x, w, τ, θ ) dτ. Then (cid:2)
CATE cε ( w ) , CATE cε ( w ) (cid:3) = (cid:104) Γ (1 , w, θ ) − Γ (0 , w, θ ) , Γ (1 , w, θ ) − Γ (0 , w, θ ) (cid:105) are the trimmed population CATE bounds. We estimate them by (cid:20) (cid:92) CATE c ( w ) , (cid:92) CATE c ( w ) (cid:21) ≡ (cid:104) Γ (1 , w, (cid:98) θ ) − Γ (0 , w, (cid:98) θ ) , Γ (1 , w, (cid:98) θ ) − Γ (0 , w, (cid:98) θ ) (cid:105) . If these mappings were Hadamard differentiable in θ at θ , we could use the functional deltamethod to show that the above estimators have limiting Gaussian distributions and converge at √ n rates. Because they depend on the min and max functions these mappings are not Hadamarddifferentiable. They are, however, Hadamard directionally differentiable (HDD); see definition 2 inappendix B. It turns out that this weaker version of differentiability is sufficient to establish their(non-Gaussian) limiting distribution.To formally derive the limiting distribution of the CATE estimators, we show that the mappingΓ ( x, w, · ) : R d W × (cid:96) ∞ ([ ε, − ε ] , R d q ) → R is Hadamard directionally differentiable at θ tangentially to R d W × C ([ ε, − ε ] , R d q ). Here C ( A, B )is the set of continuous functions from A to B .As a technical assumption, we restrict the complexity of the space that the quantile regressioncoefficient γ ( · ) lives in. Specifically, we assume that it is in a H¨older ball. To define this parameter14pace precisely, let C m ( D ) denote the set of m -times continuously differentiable functions f : D → R ,where m is an integer and D be an open subset of R d q . Denote the differential operator by ∇ λ = ∂ | λ | ∂x λ · · · ∂x λ dq d q where λ = ( λ , . . . , λ d q ) is a d q -tuple of nonnegative integers and | λ | = λ + · · · + λ d q . Let ν ∈ (0 , (cid:107) · (cid:107) without any subscripts denote the R d q -Euclidean norm. Define the H¨older norm of f : D → R by (cid:107) f (cid:107) m, ∞ ,ν = max | λ |≤ m sup x ∈ int( D ) |∇ λ f ( x ) | + max | λ | = m sup x,y ∈ int( D ) ,x (cid:54) = y |∇ λ f ( x ) − ∇ λ f ( y ) |(cid:107) x − y (cid:107) ν . For any
B >
0, let C Bm,ν ( D ) = { f ∈ C m ( D ) : (cid:107) f (cid:107) m, ∞ ,ν ≤ B } denote a H¨older ball. Assumption A2 (
Quantile regression regularity). Let m be an integer with m ≥ ν ∈ (0 , B >
0. Then γ ∈ G where G ⊆ C Bm,ν ([ ε smaller , − ε smaller ]) d q for some ε smaller ∈ (0 , ε ).In this assumption we assume m ≥ γ . In appendixA we state several additional standard regularity conditions that we use to obtain asymptoticnormality of the first step estimators; see assumptions A4–A6 starting on page 37. We continue tomaintain these assumptions here. As a first preliminary result, we use these assumptions to derivethe limiting distribution of the CQTE bound estimators; see proposition 5 in appendix B. Usingthat result, we can then derive the limiting distribution of the CATE bound estimators. Proposition 1 (CATE convergence) . Fix w ∈ W . Suppose A1, A2, and A4–A6 hold. Fix c ∈ [0 , √ n (cid:32) (cid:92) CATE c ( w ) − CATE cε ( w ) (cid:92) CATE c ( w ) − CATE cε ( w ) (cid:33) d −→ Z CATE ( w ) , where Z CATE ( w ) is a random vector in R whose distribution is characterized in the proof.In the statement of this result we deferred the full characterization of Z CATE ( w ) to the proof.To get a brief idea of what it looks like, however, consider the first component. It is Z (1)CATE ( w ) = Γ (cid:48) ,θ (1 , w, Z ) − Γ (cid:48) ,θ (0 , w, Z )where Γ (cid:48) ,θ ( x, w, Z ) is the Hadamard directional derivative of Γ evaluated at Z , the limit-ing distribution of the first step estimators. See page 49 for the expression for Γ (cid:48) ,θ . Likewise,Γ (cid:48) ,θ ( x, w, Z ) is the Hadamard directional derivative of Γ evaluated at Z . Although Z is Gaus-sian, the HDDs are continuous but generally nonlinear functionals. Hence the distribution of Z CATE ( w ) is non-Gaussian. In section 5 we show how to use a non-standard bootstrap to approxi-mate its distribution. 15 he ATE Bounds Next we derive the limiting distribution of our ATE bound estimators. LetΓ ( x, θ ) = (cid:90) W Γ ( x, w, θ ) dF W ( w ) and Γ ( x, θ ) = (cid:90) W Γ ( x, w, θ ) dF W ( w ) . Then [ATE cε , ATE cε ] = (cid:2) Γ (1 , θ ) − Γ (0 , θ ) , Γ (1 , θ ) − Γ (0 , θ ) (cid:3) are the trimmed population ATE bounds. We estimate them by (cid:20) (cid:91) ATE c , (cid:91) ATE c (cid:21) ≡ (cid:34) n n (cid:88) i =1 (cid:16) Γ (1 , W i , (cid:98) θ ) − Γ (0 , W i , (cid:98) θ ) (cid:17) , n n (cid:88) i =1 (cid:16) Γ (1 , W i , (cid:98) θ ) − Γ (0 , W i , (cid:98) θ ) (cid:17)(cid:35) . Unlike Γ , the Γ mapping depends on F W , which is unknown. Here we estimate it by the empiricaldistribution of W .Next, let δ > B δ = { β ∈ B : (cid:107) β − β (cid:107) ≤ δ } and L β ( x, w (cid:48) β ) = ∂∂β L ( x, w (cid:48) β ) . The following assumption bounds the inverse ratio of the squared propensity score to its derivativewith respect to the parameter β . This assumption holds under common parametric specificationsfor the propensity score, like logit or probit. It also holds if strong overlap holds. Moreover, underour other assumptions, note that strong overlap holds when W has finite support. Assumption A3.
There is a δ > E (cid:32) sup β ∈B δ (cid:13)(cid:13)(cid:13)(cid:13) L β ( x, W (cid:48) β ) L ( x, W (cid:48) β ) (cid:13)(cid:13)(cid:13)(cid:13) (cid:33) < ∞ for each x ∈ { , } .Under these assumptions, we show the following result. Theorem 1 (ATE convergence) . Suppose A1–A6 hold. Then √ n (cid:32) (cid:91) ATE c − ATE cε (cid:91) ATE c − ATE cε (cid:33) d −→ Z ATE , where Z ATE is a random vector in R whose distribution is characterized in the proof.Like the CATE bound estimators, the limiting distribution of the ATE bound estimators isnon-Gaussian. To understand this limiting distribution, first recall that we denote our bounds onthe means E ( Y x ) by E cx,ε = Γ ( x, θ ) = E [Γ ( x, W, θ )] and E cx,ε = Γ ( x, θ ) = E [Γ ( x, W, θ )] .
16e estimate them by (cid:98) E cx = 1 n n (cid:88) i =1 Γ ( x, W i , (cid:98) θ ) and (cid:98) E cx = 1 n n (cid:88) i =1 Γ ( x, W i , (cid:98) θ ) . In the proof of theorem 1, we show that the following asymptotic expansion holds: √ n (cid:32) (cid:98) E cx − E cx,ε (cid:98) E cx − E cx,ε (cid:33) = Γ (cid:48) ,θ ( x, √ n ( (cid:98) θ − θ )) + 1 √ n n (cid:88) i =1 (cid:16) Γ ( x, W i , θ ) − E [Γ ( x, W, θ )] (cid:17) + o p (1) , (9)where Γ (cid:48) ,θ is the Hadamard directional derivative of Γ , which we define in the proof of theorem 1.The first term in this expansion comes from the sample variation in the first step estimators: thepropensity score (cid:98) p x | w = L ( x, w (cid:48) (cid:98) β ) and quantile function (cid:98) Q ( τ | x, w ) = p ( x, w ) (cid:48) (cid:98) γ ( τ ). The functionalΓ (cid:48) ,θ ( x, · ) is nonlinear in (cid:98) β . Therefore, since √ n ( (cid:98) β − β ) converges in distribution to a Gaussianlimiting process, the limiting distribution of this functional is non-Gaussian. If β was known—andhence the propensity score was known—then this component would follow a Gaussian distributionsince the remaining component (cid:98) γ is asymptotically Gaussian and enters Γ (cid:48) ,θ ( x, · ) linearly.The second term in this expansion comes from the variation of the CATE bounds over thevalues of the covariates W . It follows a limiting Gaussian distribution by the central limit theorem.This term is asymptotically independent of the sampling variation in the first step estimators (cid:98) θ since the influence function of (cid:98) θ is mean independent of W . Overall, we see that the limitingdistribution of the ATE bounds is the sum of two independent random vectors, one Gaussian andone non-Gaussian. We approximate the distribution of these two random vectors using two separatebootstraps in section 5. The ATT Bounds
Finally we study the limiting properties of our ATT bound estimators. Our trimmed populationATT bounds are[ATT cε , ATT cε ] = (cid:34) E ( Y | X = 1) − E c ,ε − p E ( Y | X = 0) p , E ( Y | X = 1) − E c ,ε − p E ( Y | X = 0) p (cid:35) . We estimate them by[ (cid:91)
ATT c , (cid:91) ATT c ] = (cid:34)(cid:98) E ( Y | X = 1) − (cid:98) E c − (cid:98) p (cid:98) E ( Y | X = 0) (cid:98) p , (cid:98) E ( Y | X = 1) − (cid:98) E c − (cid:98) p (cid:98) E ( Y | X = 0) (cid:98) p (cid:35) , where (cid:98) E ( Y | X = 0) and (cid:98) p are defined in section 3. Proposition 2 (ATT convergence) . Suppose the assumptions of theorem 1 hold. Suppose further17hat var( Y ( X = x )) < ∞ for each x ∈ { , } . Then √ n (cid:32) (cid:91) ATT c − ATT cε (cid:91) ATT c − ATT cε (cid:33) d −→ Z ATT , where Z ATT is a random vector in R whose distribution is characterized in the proof. (cid:98) E ( Y | X = 0) and (cid:98) p are asymptotically Gaussian. Like our analysis of the ATE bounds,however, (cid:98) E c and (cid:98) E c have non-Gaussian asymptotic distributions. Overall, Z ATT , the asymptoticdistribution of our ATT bound estimators, is a linear combination of Gaussian and non-Gaussianrandom variables.
We now show how to conduct inference on our bounds for CATE, ATE, and ATT. Earlier we notedthat these bounds are generally not ordinary Hadamard differentiable mappings of the underlyingparameters θ . By corollary 3.1 in Fang and Santos (2019), this implies that standard bootstrapapproaches cannot be used for these bounds. We instead use the non-standard bootstrap approachdeveloped by Fang and Santos (2019). For brevity we focus on ATE and ATT in this section. Weprovide analogous results for CQTE and CATE in lemmas 6 and 7 in appendix E.1. The bounds for ATE and ATT can be written in terms of bounds on E ( Y x ). In this section wedescribe how to do inference on bounds for these means. We’ll then use these results to do inferenceon our ATE and ATT bounds in the next subsection. Recall that our bounds on E ( Y x ) can bewritten as a functional of θ . This functional is Hadamard directionally differentiable in θ , but it isgenerally not ordinary Hadamard differentiable. Theorem 3.1 of Fang and Santos (2019) shows howto do bootstrap inference by consistently estimating the Hadamard directional derivative (HDD).This can be done by using analytical estimators or by using a numerical derivative as describedin Hong and Li (2018). Here we use analytical estimates of the HDD. This approach explicitlyuses the functional form of the HDD to estimate it. It allows us to avoid picking the numericalderivative step size, although other tuning parameters are used to estimate the HDDs analytically. Setup
Next we define some general notation. Let Z i = ( Y i , X i , W i ) and Z n = { Z , . . . , Z n } . Let ϑ denote some parameter of interest and let (cid:98) ϑ be an estimator of ϑ based on the data Z n . Let A ∗ n denote √ n ( (cid:98) ϑ ∗ − (cid:98) ϑ ) where (cid:98) ϑ ∗ is a draw from the nonparametric bootstrap distribution of (cid:98) ϑ .Suppose A is the tight limiting process of √ n ( (cid:98) ϑ − ϑ ). Denote bootstrap consistency by A ∗ n P (cid:32) A where P (cid:32) denotes weak convergence in probability, conditional on the data Z n . Weak convergence18n probability conditional on Z n is defined assup h ∈ BL | E [ h ( A ∗ n ) | Z n ] − E [ h ( A )] | = o p (1)where BL denotes the set of Lipschitz functions into R with Lipschitz constant no greater than 1.We leave the domain of these functions and its associated norm implicit.We focus on the choices ϑ = θ and (cid:98) ϑ = (cid:98) θ . For these choices, let Z ∗ n = √ n ( (cid:98) θ ∗ − (cid:98) θ ). Let Z denote the limiting distribution of √ n ( (cid:98) θ − θ ); see lemma 1 in appendix A. Theorem 3.6.1 of van derVaart and Wellner (1996) implies that Z ∗ n P (cid:32) Z . Our parameters of interest are all functionals Γof θ . In particular, in section 4 we showed that √ n (Γ( (cid:98) θ ) − Γ( θ )) (cid:32) Γ (cid:48) θ ( Z )for a variety of functionals Γ. To do inference on these functionals, we therefore want to estimatethe distribution of Γ (cid:48) θ ( Z ). Fang and Santos (2019) show that (cid:98) Γ (cid:48) θ ( Z ∗ n ) P (cid:32) Γ (cid:48) θ ( Z )where (cid:98) Γ (cid:48) θ is a suitable estimator of the Hadamard directional derivative Γ (cid:48) θ . In this section weconstruct the estimators (cid:98) Γ (cid:48) θ and show that they can be used in this bootstrap. Main Result
Next, recall the asymptotic expansion in equation (9) on page 17. As we will show, the second termin this expansion can be approximated using standard bootstrap approaches and replacing θ by (cid:98) θ . The first term requires estimating the HDDs Γ (cid:48) ,θ and Γ (cid:48) ,θ . The formulas for our estimators ofthese HDDs are long, and so we describe them in appendix C. Denote these estimators by (cid:98) Γ (cid:48) ,θ and (cid:98) Γ (cid:48) ,θ . They require choosing two scalar tuning parameters, κ n and η n . κ n is a slackness parameterand η n is a step size parameter used to compute numerical derivatives of (cid:98) γ ( · ). Although not usedin our proof, the asymptotic independence of the two components implies that approximating theirrespective marginal distributions is sufficient to obtain their joint distribution.As we just mentioned, we’ll use the standard nonparametric bootstrap to approximate thesecond term of equation (9). To formalize this, let G ∗ n denote the nonparametric bootstrap empiricalprocess: G ∗ n = 1 √ n n (cid:88) i =1 ( M n,i − δ Z i where ( M n, , . . . , M n,n ) are multinomially distributed with parameters (1 /n, . . . , /n ) indepen-dently of Z n , and where δ Z i is a distribution which assigns probability one to the value Z i ∈ R d W .19hen for any function g , G ∗ n g ( Z ) = √ n (cid:32) n n (cid:88) i =1 g ( Z ∗ i ) − g ( Z ) (cid:33) where Z ∗ i , i = 1 , . . . , n are drawn independently with replacement from { Z , . . . , Z n } and g ( Z ) = n (cid:80) ni =1 g ( Z i ). In particular, we’ll study the asymptotic distribution of G ∗ n Γ ( x, W, (cid:98) θ ) = √ n (cid:32) n n (cid:88) i =1 Γ ( x, W ∗ i , (cid:98) θ ) − n n (cid:88) i =1 Γ ( x, W i , (cid:98) θ ) (cid:33) . The following proposition is our main bootstrap consistency result.
Proposition 3 (Analytical Bootstrap for Mean Potential Outcomes) . Suppose the assumptions oftheorem 1 hold. Let κ n → nκ n → ∞ , η n → , and nη n → ∞ as n → ∞ . Then (cid:98) Γ (cid:48) ,θ ( x, √ n ( (cid:98) θ ∗ − (cid:98) θ )) + G ∗ n Γ ( x, W, (cid:98) θ ) P (cid:32) Z ( x ) , where Z ( · ) is the limiting process of the expression given in equation (9), as characterized in theproof.This result shows how to use the bootstrap to approximate the joint limiting distribution of up-per and lower bounds of E ( Y x ), x ∈ { , } . As we show in section 5.2 below, these approximationscan be used to conduct pointwise or uniform-in- c inference on the ATE bounds. As part of theproof, we show that (cid:98) Γ (cid:48) ,θ ( x, √ n ( (cid:98) θ ∗ − (cid:98) θ )) weakly converges in probability conditional on the datato Γ (cid:48) ,θ ( x, Z ), a non-Gaussian vector which reflects the sample variation in the first step estima-tors. We also show weak convergence in probability conditional on the data of G ∗ n Γ ( x, W, (cid:98) θ ) to G Γ ( x, W, θ ) ∼ N (0 , var(Γ ( x, W, θ )), a bivariate Gaussian vector which reflects the variation ofthe CATE bounds over W . This variation can be approximated using the standard nonparametricbootstrap. Hence the bounds’ limiting distribution is approximated by a combination of standardand non-standard bootstraps. Note that the two bootstrap distributions can be computed from aunique sequence of draws Z ∗ i and so has the same computational burden as a single bootstrap. Next we show how to use proposition 3 to do inference on our ATE bounds [ATE cε , ATE cε ]. We firstconsider inference pointwise in c . We then construct confidence bands that are uniform over c .20 .2.1 Pointwise in c Confidence Sets
An immediate corollary of proposition 3 is √ n (cid:32) (cid:91) ATE c, ∗ − (cid:91) ATE c (cid:91) ATE c, ∗ − (cid:91) ATE c (cid:33) (10)= (cid:16)(cid:98) Γ (cid:48) ,θ (1 , √ n ( (cid:98) θ ∗ − (cid:98) θ )) + G ∗ n Γ (1 , W, (cid:98) θ ) (cid:17) − (cid:16)(cid:98) Γ (cid:48) ,θ (0 , √ n ( (cid:98) θ ∗ − (cid:98) θ )) + G ∗ n Γ (0 , W, (cid:98) θ ) (cid:17)(cid:16)(cid:98) Γ (cid:48) ,θ (1 , √ n ( (cid:98) θ ∗ − (cid:98) θ )) + G ∗ n Γ (1 , W, (cid:98) θ ) (cid:17) − (cid:16)(cid:98) Γ (cid:48) ,θ (0 , √ n ( (cid:98) θ ∗ − (cid:98) θ )) + G ∗ n Γ (0 , W, (cid:98) θ ) (cid:17) P (cid:32) Z ATE . Thus we can also use this specific bootstrap to approximate the asymptotic distribution of our ATEbounds estimators. Given this result, we can construct a 100(1 − α )% confidence set for the ATEidentified set under c -dependence as follows. LetCI c ATE (1 − α ) = (cid:34) (cid:91) ATE c − (cid:98) d α √ n , (cid:91) ATE c + (cid:98) d α √ n (cid:35) where (cid:98) d α = inf (cid:26) z ∈ R : P (cid:18) √ n ( (cid:91) ATE c, ∗ − (cid:91) ATE c ) ≤ − z and √ n ( (cid:91) ATE c, ∗ − (cid:91) ATE c ) ≥ z | Z n (cid:19) ≥ − α (cid:27) . The probability in this expression can be approximated by taking a large number of bootstrapdraws according to equation (10). Proposition 3 then implies thatlim inf n →∞ P (cid:16) CI c ATE (1 − α ) ⊇ [ATE cε , ATE cε ] (cid:17) ≥ − α. Let d α = inf (cid:110) z ∈ R : P (cid:16) Z (2)ATE ≤ − z and Z (1)ATE ≥ z (cid:17) ≥ − α (cid:111) . If P ( Z (2)ATE ≤ − z and Z (1)ATE ≥ z ) is continuous and strictly increasing in a neighborhood of d α ,corollary 3.2 in Fang and Santos (2015) yields (cid:98) d α = d α + o p (1) and hencelim n →∞ P (cid:16) CI c ATE (1 − α ) ⊇ [ATE cε , ATE cε ] (cid:17) = 1 − α. c ATE Bands
We just described how to use proposition 3 to do inference on the ATE bounds for any fixed c .Those results can be immediately extended to do inference the ATE bounds for any finite grid of c ’s.In this section we show how to construct confidence bands that are uniform over all c ∈ [0 , c . This lets us extrapolate bands thatare uniform on a finite grid in such a way that they have uniform coverage. A related procedure is21escribed in corollary 1 of Masten and Poirier (2020).Although ATE cε is nondecreasing in c , its estimate (cid:91) ATE c may be nonmonotonic in c because ofthe quantile crossing problem with linear quantile regression. In that case, we could monotonize theestimated ATE bound function by using the rearrangement procedure of Chernozhukov, Fern´andez-Val, and Galichon (2010), for example. As they show, the rearrangement operator is Hadamarddirectionally differentiable, and thus can be accommodated in our inferential results. Likewise,ATE cε is nonincreasing in c and (cid:91) ATE c is also nonincreasing after applying a suitable rearrangement.From here on we assume our bound estimators have been monotonized. Note that this does notaffect the asymptotic distribution of the estimators, by corollary 1 of Chernozhukov et al. (2010).Next consider a grid of values C = { c , . . . , c K } such that 0 = c < · · · < c K = 1. All of ouranalysis works with 0 < c and c K <
1, but typically researchers will want to include the endpointsof [0 ,
1] so we do that from here on. Using methods similar to those in section 5.2.1, for all c ∈ C let CI c ATE (1 − α ) = (cid:34) (cid:91) ATE c − (cid:98) d α ( c ) √ n , (cid:91) ATE c + (cid:98) d α ( c ) √ n (cid:35) be 100(1 − α )% confidence sets for [ATE cε , ATE cε ] where the critical values (cid:98) d α ( c ) are chosen suchthat these sets are uniform over c in the finite grid C . That is, P (cid:0) CI c ATE (1 − α ) ⊇ [ATE cε , ATE cε ] for all c ∈ C (cid:1) → − α as n → ∞ . Finally, let min( c ) = inf { c k ∈ C : c ≤ c k } denote the smallest element in the grid C thatis still larger than c . For c ∈ [0 ,
1] define (cid:100)
UB( c ) = (cid:91) ATE min( c ) + (cid:98) d α (min( c )) √ n and (cid:99) LB( c ) = (cid:91) ATE min( c ) − (cid:98) d α (min( c )) √ n . (cid:100) UB( c ) is the greatest monotonic interpolation of the upper bounds of the confidence intervals on thegrid C . (cid:99) LB( c ) is the least monotonic interpolation of the lower bounds of the confidence intervalson the grid C . By the definition of these interpolated bands and by monotonicity of the populationATE bounds, P (cid:16) [ (cid:99) LB( c ) , (cid:100) UB( c )] ⊇ [ATE cε , ATE cε ] for all c ∈ [0 , (cid:17) = P (cid:16) [ (cid:99) LB( c ) , (cid:100) UB( c )] ⊇ [ATE cε , ATE cε ] for all c ∈ C (cid:17) → − α as n → ∞ .In this subsection we’ve shown that, although we cannot obtain the limiting distribution of theATE bounds uniformly over c ∈ [0 , , c -dependence: c -dependence implies c -dependence when c ≤ c . This kind of monotonicity is22ommon in many other approaches to sensitivity analysis and hence greatest and least monotonicinterpolations can likely be used more broadly to construct uniform confidence bands. Bootstrap inference on the ATT bounds is quite similar to that on the ATE bounds. By examiningthe ATT bounds’ limiting distribution (see the proof of proposition 2) we see that it depends ontwo types of terms:1. One term comes from the limiting distribution of √ n (cid:32) (cid:98) E c − E c ,ε (cid:98) E c − E c ,ε (cid:33) , which is non-standard. We’ll approximate this term distribution by using the non-standardbootstrap of proposition 3.2. The other terms are due to the limiting distributions of √ n (cid:32)(cid:98) E ( Y | X = x ) − E ( Y | X = x ) (cid:98) p x − p x (cid:33) , which are standard and Gaussian. The distribution of these terms can be approximatedby the nonparametric bootstrap. For example, standard arguments show that the limitingdistribution of √ n ( (cid:98) E ( Y | X = x ) − E ( Y | X = x )) is approximated by Z ∗ E ( Y | X = x ) ≡ (cid:98) p x (cid:16) G ∗ n Y ( X = x ) − (cid:98) E ( Y | X = x ) · G ∗ n ( X = x ) (cid:17) . Similarly, Z ∗ p x ≡ G ∗ n ( X = x ) converges weakly in probability conditional on Z n to thelimiting distribution of √ n ( (cid:98) p x − p x ).Combining all the terms gives Z ∗ E ( Y | X =1) − (cid:98) Γ (cid:48) ,θ (0 , √ n ( (cid:98) θ ∗ − (cid:98) θ )) + G ∗ n Γ (0 , W, (cid:98) θ ) (cid:98) p + (cid:98) p (cid:98) p Z ∗ E ( Y | X =0) + (cid:98) E ( Y | X = 0) (cid:98) p Z ∗ p + (cid:98) E c − (cid:98) E ( Y | X = 0) (cid:98) p (cid:98) p Z ∗ p Z ∗ E ( Y | X =1) − (cid:98) Γ (cid:48) ,θ (0 , √ n ( (cid:98) θ ∗ − (cid:98) θ )) + G ∗ n Γ (0 , W, (cid:98) θ ) (cid:98) p + (cid:98) p (cid:98) p Z ∗ E ( Y | X =0) + (cid:98) E ( Y | X = 0) (cid:98) p Z ∗ p + (cid:98) E c − (cid:98) E ( Y | X = 0) (cid:98) p (cid:98) p Z ∗ p P (cid:32) Z ATT .
23e can use this result to construct pointwise confidence sets for the ATT bounds for a fixed c , orto construct confidence bands that are uniform on a finite grid C . Like the ATE bounds, the ATTbounds are monotonic in c . Thus a similar interpolation can be used to construct confidence bandsfor the ATT bounds that are uniform over c ∈ [0 , We conclude this section be showing how to use the confidence bands we just described to doinference on breakdown points. For brevity we focus on the breakdown point for the conclusionthat the ATE is nonnegative, which we defined earlier in equation (6). Inference on other breakdownpoints for other conclusions can be done similarly.Let CI c ATE (1 − α ) be a pointwise-in- c confidence band for the ATE bounds, as described insection 5.2.1. Define c L = sup { c ∈ [0 ,
1] : CI c ATE (1 − α ) ⊆ [0 , ∞ ) } . This is simply the value at which the confidence band first intersects the horizontal line at zero.By proposition S.2 in Appendix D of Masten and Poirier (2020),lim n →∞ P ( c L ≤ c bp ) ≥ − α. Thus [ c L ,
1] is a valid one-sided lower confidence interval for the breakdown point c bp . In the previous section we showed how to use a non-standard bootstrap method to conduct inferenceon the CATE, ATE, and ATT bounds. The key technical problem was that these bounds are notnecessarily Hadamard differentiable functionals of the first step estimators; they are only Hadamarddirectionally differentiable. In this section, we provide simple sufficient conditions on the propensityscore under which the CATE, ATE, and ATT bounds are in fact Hadamard differentiable. Underthis condition, the methods in section 5 are still valid, but so is the standard nonparametricbootstrap. After stating the formal result, we discuss when this sufficient condition holds andwhen it does not.First consider the average treatment effect. Recall that the ATE bounds depend on the func-tional Γ ( x, θ ). We will show that its Hadamard directional derivative Γ (cid:48) ,θ ( x, h ) is linear in h undera condition on the value of c , the propensity score p | w , and the distribution of W . By proposition2.1 in Fang and Santos (2019), this linearity is equivalent to Hadamard differentiability. By theorem3.9.11 in van der Vaart and Wellner (1996), this linearity also implies that the bootstrap process √ n (Γ ,θ ( x, (cid:98) θ ∗ ) − Γ ,θ ( x, (cid:98) θ )) converges weakly in probability conditional on the data to Γ (cid:48) ,θ ( x, Z ),a Gaussian vector. In other words, we can conduct inference on the ATE bounds using the standardnonparametric bootstrap: Take n independent draws from the data with replacement, compute thebound estimates in this bootstrap sample, and then use the distribution of these bound estimates24cross many such bootstrap samples to approximate the sampling distribution of the bound estima-tors. The following theorem provides the explicit sufficient condition for validity of this bootstrap.In this result, let p | W = P ( X = 1 | W ) denote the random variable obtained by evaluating thepropensity score at the random vector W . Theorem 2.
Suppose the assumptions of theorem 1 hold. Suppose P ( p | W ∈ { c, − c } ) = 0. Then √ n (cid:32) (cid:91) ATE c ∗ − (cid:91) ATE c (cid:91) ATE c ∗ − (cid:91) ATE c (cid:33) P (cid:32) Z ATE , where ( (cid:91) ATE c ∗ , (cid:91) ATE c ∗ ) are drawn from the nonparametric bootstrap distribution of ( (cid:91) ATE c , (cid:91) ATE c ).In the proof of this result we show that when the propensity score does not contain a point masson either c or 1 − c , the mapping Γ ( x, θ ) is Hadamard differentiable for x ∈ { , } . Hence thenonparametric bootstrap is valid. Although it is not formally stated in the theorem, we conjecturethat our sufficient condition for validity of the standard bootstrap is also a necessary condition.That is, we expect the standard bootstrap to be invalid when c or 1 − c are point masses of thepropensity score’s distribution. From the proof of theorem 2, we see when this condition on thepropensity score fails, Γ (cid:48) ,θ ( x, w, τ, h ) is nonlinear in h on a set of ( τ, w ) values of positive measure.Since the HDD of Γ ( x, · ) is the integral over ( τ, w ) of the HDD of Γ ( x, w, τ, · ), we expect thatΓ (cid:48) ,θ ( x, h ) will also be nonlinear in h , a failure of Hadamard differentiability.By further examining the proof of theorem 2, we can also show that the CATE bounds areHadamard differentiable at covariate values w and sensitivity parameter values c such that p | w / ∈{ c, − c } . Finally, the following proposition gives a similar result for the ATT, using slightly weakerassumptions. Proposition 4.
Suppose the assumptions of theorem 1 hold. Suppose var( Y ( X = x )) < ∞ foreach x ∈ { , } . Suppose P ( p | W = c ) = 0. Then, √ n (cid:32) (cid:91) ATT c ∗ − (cid:91) ATT c (cid:91) ATT c ∗ − (cid:91) ATT c (cid:33) P (cid:32) Z ATT , where ( (cid:91) ATT c ∗ , (cid:91) ATT c ∗ ) are drawn from the nonparametric bootstrap distribution of ( (cid:91) ATT c , (cid:91) ATT c ).The ATT bounds only depend on our bounds for E ( Y ), and not our bounds for E ( Y ). Hencewe only need to examine Γ ( x, θ ) for x = 0. So the proof of this proposition proceeds by showingthat Γ (0 , θ ) is Hadamard differentiable when the propensity score does not have a point mass at c . The sufficient conditions in theorem 2 and proposition 4 depend on the support of the propen-sity score p | W . If the propensity score’s distribution is absolutely continuous, it contains no pointmasses and therefore these condition holds. This holds when one covariate W k has nonzero coeffi-cient β ,k and has a continuous distribution conditional on the other covariates W − k . However, if25ll covariates are discrete or mixed, the support of the propensity score will generally contain pointmasses. The nonparametric bootstrap may not be valid whenever c coincides with these points.Even when p | W has point masses, the nonparametric bootstrap is valid for c outside of these points.To use this bootstrap, one could in principle estimate the support of p | W to determine at whichvalues of c inference might be invalid, and select sensitivity parameters outside of this support.The nonparametric bootstrap has the advantage of being computationally simple and does notrequire the choice of tuning parameters. While more involved, the bootstrap technique detailed insection 5 is valid regardless of the support of the propensity score. For example, in our empiricalanalysis in section 7, all of the covariates are either mixed or discrete. Given our analysis above,we therefore use the non-standard bootstrap in our empirical analysis since the standard bootstrapmay fail at some values of c . In this section we illustrate our methods using data on the National Supported Work (NSW)demonstration project studied by LaLonde (1986). Since this is a highly studied and well-knownprogram, we only briefly summarize it here. See, for example, Heckman, LaLonde, and Smith(1999) for further details. We use LaLonde’s data as reconstructed by Dehejia and Wahba (1999).The NSW demonstration project randomly assigned participants to either receive a guaranteedjob for 9 to 18 months along with frequent counselor meetings or to be left in the labor market bythemselves. We use the Dehejia and Wahba (1999) sample, which are all males in LaLonde’s NSWdataset and where earnings are observed in 1974, 1975, and 1978. This dataset has 445 people: 185in the treatment group and 260 in the control group. Like Imbens (2003), we use this experimentalsample primarily as an illustration; in experiments where treatment was truly randomized it is notnecessary to assess sensitivity to unconfoundedness. Our results may be useful for assessing theimpact of randomization failure in experiments, but that is not our focus here.In addition to this experimental sample, we construct a sample using observational data. Thissample combines the 185 people in the NSW treatment group with 2490 people in a control groupconstructed from the Panel Study of Income Dynamics (PSID). This control group, called PSID-1by LaLonde, consists of all male household heads observed in all years between 1975 and 1978who were less than 55 years old and who did not classify themselves as retired. We further dropobservations with earnings above $ able 1: Summary statistics.Experimental dataset Observational datasetControl Treatment ControlMarried 0.15 0.19 0.78(0.36) (0.39) (0.42)Age 25.05 25.82 38.61(7.06) (7.16) (11.45)Black 0.83 0.84 0.27(0.38) (0.36) (0.44)Hispanic 0.11 0.06 0.04(0.31) (0.24) (0.20)Education 10.09 10.35 11.37(1.61) (2.01) (3.40)Earnings in 1974 2107.03 2095.57 765.75(5687.91) (4886.62) (1399.79)Earnings in 1975 1266.91 1532.06 650.54(3102.98) (3219.25) (1332.89)Positive earnings in 1974 0.25 0.29 0.29(0.43) (0.46) (0.46)Positive earnings in 1975 0.32 0.40 0.25(0.47) (0.49) (0.43)Sample size 260 185 242
Variable mean is shown in each cell, with that variable’s standard deviation in parentheses.
Table 2:
Baseline treatment effect estimates (in 1982 dollars).ATE ATT Sample sizeExperimental dataset 1633 1738 445(650) (689)Observational dataset 3337 4001 390(769) (762)
Standard errors in parentheses.
Baseline Estimates
Table 2 shows the baseline point estimates of both ATE and ATT under the unconfoundednessassumption in the two samples we consider. These estimates are all computed by inverse probabilityweighting (IPW) using a parametric logit propensity score estimator. We do not consider otherestimators, since our goal is to illustrate sensitivity to identifying assumptions, rather than finitesample sensitivity to the choice of estimator. 27 c -15000-10000-50000500010000150002000025000 A T E c -15000-10000-5000050001000015000 A TT Figure 1:
Sensitivity of ATE (top) and ATT (bottom) estimates to relaxations of the selection onobservables assumption. The solid lines are bounds computed using the observational dataset whilethe dashed lines are bounds computed using the experimental dataset. The light dotted lines areconfidence bands for the observational dataset while the light dashed-dotted lines are confidencebands for the experimental dataset.
Relaxing Unconfoundedness
Figure 1 shows our main results. These are estimated treatment effect bounds under c -dependence,along with corresponding pointwise confidence bands, as described in sections 2–5. The top plotshows bounds on ATE while the bottom plot shows bounds on ATT. The solid lines are bounds forthe observational dataset while the dashed lines are bounds for the experimental dataset. The light28otted lines are confidence bands for the observational dataset while the light dashed-dotted linesare confidence bands for the experimental dataset. These bands are constructed to have nominal95% coverage probability pointwise in c based on our non-standard bootstrap results in section5. For the tuning parameters we use ε = 0 . η n = 0 . n − / , and κ n = n − / . Note that thesufficient conditions for validity of the standard bootstrap that we gave in section 6 do not applyhere, since the distribution of the propensity score variable p | W has point masses. This occursbecause seven of the nine covariates are discrete, while the other two mixed discrete-continuous.The mixed variables are earnings in 1974 and earnings in 1975, which have point masses at zerosince many people in the sample did not work in those years.For both datasets, at c = 0 the bounds collapse to the baseline point estimate. When c >
0, weallow for some selection on unobservables. Comparing the shape of the bounds for both datasetswe see that the experimental data are substantially more robust to relaxations from the baselineassumptions than the observational data. Specifically, for most values of c the bounds for theexperimental data are substantially tighter than the bounds for the observational data. Even theno assumptions bounds ( c = 1) are tighter for the experimental data than for the observationaldata.A second way to measure robustness uses breakdown points . Masten and Poirier (2020) discussthese in detail and give additional references. In the current context, the breakdown point is simplythe largest value of c such that we can no longer draw a specific conclusion about some parameter.Specifically, in the next two subsections we consider two conclusions: The conclusion that ATE isnonnegative, and the conclusion that ATT is less than the per participant program cost. Breakdown Points for Nonnegative ATE
First consider the conclusion that ATE is nonnegative. Our point estimates support this conclusion,but does it still hold if the baseline unconfoundedness assumption fails? In the experimental dataset,the estimated breakdown point is 0 . c such that the lower boundfunction in figure 1 intersects the horizontal axis. For all c ≤ . c > . . .
123 while it is 0 .
049 for the observational dataset. By this measure, the conclusion that ATT ispositive is more than twice as robust using the experimental data compared to the observationaldata. 29hus far we have compared the robustness of results obtained from the experimental data withresults obtained from the observational data. Next we discuss whether either of these results arerobust in an absolute sense. To do this, we use the leave-out-variable k propensity score analysisdiscussed in section 2. Table 3:
Variation in leave-out-variable- k propensity scores, experimental data.p50 p75 p90 ¯ c k Earnings in 1975 0.001 0.004 0.008 0.053Black 0.007 0.009 0.014 0.082Positive earnings in 1974 0.002 0.010 0.018 0.034Education 0.012 0.022 0.031 0.087Married 0.006 0.012 0.032 0.042Age 0.015 0.024 0.034 0.099Earnings in 1974 0.002 0.011 0.035 0.209Positive earnings in 1975 0.013 0.017 0.062 0.082Hispanic 0.007 0.017 0.099 0.124First consider table 3, which uses data from the experimental sample. For each variable k , listedin the rows of this table, we compute four summary statistics from the estimated distribution of∆ k = | p | W ( W − k , W k ) − p | W − k ( W − k ) | . Specifically, we estimate the 50th, 75th, and 90th percentiles of ∆ k , along with the maximumobserved value, denoted ¯ c k . As discussed in section 2, these quantities tell us about the marginalimpact of covariate k on treatment assignment. c -dependence constrains the maximum value of themarginal impact of the unobserved potential outcome on treatment assignment, above and beyondthe observed covariates. Thus the values in table 3 can help us calibrate c . Specifically, we willcompare the breakdown point to the values in this table. These values could be interpreted as upperbounds on the magnitude of selection on unobservables that we might think is present. Thus, for agiven reference value from this table, if the breakdown point is larger than the reference value, wecould consider the conclusion of interest to be robust to failure of unconfoundedness. In contrast,if the breakdown point is smaller than the reference value, we could consider the conclusion ofinterest to be sensitive to failure of unconfoundedness.Recall that the estimated breakdown point for the conclusion that ATE is nonnegative is 0 . c k values and on the same order of magnitude as four more. Ifwe look at a less stringent comparison, the 90th percentile, we see that the estimated breakdownpoint is now larger than all but one of the rows, corresponding to the indicator for Hispanic. Let’sexamine this variable more closely. Figure 2 plots the density ∆ k for k = Hispanic indicator. Herewe see that there is a small proportion of mass who have values larger than 0 . D en s i t y Figure 2:
Kernel density estimate of ∆ k , the absolute difference between propensity score andleave-out-variable k propensity score, for k = Hispanic indicator, in the experimental dataset.The leave-out-variable k propensity score analysis focuses on the relationship between observedcovariates and treatment assignment. It does not use data on outcomes. A less conservative analysisis to only worry about covariates k which have large values in table 3 and which also affect ouroutcomes in some way. Specifically, we next consider leave-out-variable k IPW estimates of ATEunder the baseline unconfoundedness assumption. Table 4 shows the effect of leaving out a singlevariable on the ATE point estimates for both datasets. Continue to consider just the experimentaldataset. Here we first see that omitting any covariate at most changes the point estimate by 5.4%.Moreover, recall the main variable we were concerned about before: the indicator for Hispanic.Omitting this variable only changes the ATE point estimate by 1.5%.Overall, the leave-out-variable k analysis suggests that, on an absolute scale, the conclusionthat ATE is nonnegative using the experimental data is quite robust. A similar analysis applies toconclusions about ATT.Next consider the observational data. Table 5 shows the leave-out-variable k propensity scoreanalysis. Recall that the estimated breakdown point for the conclusion that ATE is nonnegative inthe observational dataset is 0.037. By any of these measures the conclusion that ATE is nonneg-ative is not robust. Suppose we only consider variables which also substantially change the pointestimates, as shown in table 4. Even then we still find that the results are sensitive. For example,the indicator for Black changes the ATE point estimate by 14% and also has substantial marginalimpact on the propensity score, with its 50th percentile in table 5 about 1.5 times as large as theestimated ATE breakdown point. Thus, using these as absolute measures of robustness, we findthat the conclusion that ATE is positive using the observational data is not robust.31 able 4: Magnitude of the effect of omitting a single variable on ATE point estimates (as apercentage of the baseline estimate). Experimental dataset Observational datasetEarnings in 1975 0.07 0.02Married 0.21 14.27Positive earnings in 1974 1.35 10.20Hispanic 1.51 1.01Black 2.91 14.11Positive earnings in 1975 3.32 0.64Age 3.36 6.49Earnings in 1974 3.90 0.34Education 5.39 1.84
Table 5:
Variation in leave-out-variable- k propensity scores, observational data.p50 p75 p90 ¯ c k Earnings in 1974 0.000 0.001 0.009 0.065Hispanic 0.003 0.011 0.024 0.214Education 0.006 0.017 0.042 0.127Earnings in 1975 0.002 0.010 0.057 0.276Positive earnings in 1975 0.007 0.019 0.076 0.295Positive earnings in 1974 0.012 0.028 0.099 0.423Married 0.028 0.079 0.172 0.314Age 0.035 0.093 0.205 0.508Black 0.053 0.143 0.266 0.477This conclusion that findings based on the observational dataset are not robust contrasts withthe sensitivity analysis of Imbens (2003), who finds that the same observational dataset yieldsrelatively robust results. Imbens’ analysis relied importantly on fully parametric assumptions aboutthe joint distribution of the observables and unobservables. In particular, he assumed outcomeswere normally distributed, that the treatment effect is homogeneous, and that any selection onunobservables arises due to an omitted binary variable. Our identification analysis does not requireany of these assumptions. As discussed in section 3, we do impose some parametric assumptionsto simplify estimation, but even these assumptions are substantially weaker than those used byImbens. Given that we are making weaker auxiliary assumptions, it is not surprising that ouranalysis shows the findings to be more sensitive than the analysis in Imbens (2003). Nonetheless,even with these weaker assumptions, we continue to find that conclusion from the experimentaldataset remain robust.Finally, note that all of our discussion thus far has focused on the point estimates of thebreakdown points. In section 5.4 we showed that the value at which the pointwise confidence bandintersects the horizontal axis is a valid one-sided lower confidence interval for the breakdown point.32or the ATE with experimental data, this gives a confidence set of [0 . , . . , . . k propensity scores. This is not surprising though, given that there is asubstantial amount of sampling uncertainty—even the lower bound of the confidence intervals forthe baseline estimates are quite close to zero. Can Selection on Unobservables Help the Program Pass a Cost-Benefit Analysis?
In the previous subsection we studied the sensitivity of the conclusion that the ATE is nonnegative.In practice, however, this is not necessarily the most policy relevant conclusion. For example,Heckman and Smith (1998, section 3) give a model where the socially optimal decision whether tocontinue a small scale program or to shut it down can be computed by comparing the ATT withthe program’s per participant cost. In this subsection we show how our sensitivity analysis can beused in these kinds of cost-benefit analyses. Specifically, we consider the conclusion that the ATTis less than the per participant program cost. Under the model in Heckman and Smith (1998), theprogram should be shut down when this conclusion holds.Chapter 8 of MDRC (1983) reports NSW per participant program costs. For males, total costsranges between $ $ $ $ $ c such that the identified setfor ATT includes values that are larger than the per participant cost? If we look at the bounds’point estimates, there are no values of c under which the program is cost effective. Accounting forsampling uncertainty by examining the confidence bands, we need c to be at least about 0.5 beforeit is possible that the program is cost effective. As we argued earlier, these are very large values,so it is unlikely that selection on unobservables is this strong.Next consider the observational dataset. Here again the conclusion of interest holds for thebaseline estimate: The ATT of $ c under which the program iscost effective, based on the bounds’ point estimates. This is largely because the uncertainty due33o the impact of selection on unobservables is asymmetric in this example: The lower bound growsmuch faster in c than the upper bound does. Hence conclusions about the largest possible value ofthe ATT are more robust to relaxations of unconfoundedness than conclusions about the smallestpossible value of the ATT. If we account for sampling uncertainty by examining the confidencebands, then we need c to be at least about 0.08 before the confidence intervals contain ATT valueslarger than the per participant costs. This is a relatively large value, although it is smaller than adecent number of the leave-out-variable k propensity score values in table 5. That, however, likelyjust reflects the large amount of sampling uncertainty in this data.Overall, our analysis suggests that the program does not pass a cost-benefit analysis, even if weallow for a large amount of selection on unobservables. Hence the conclusion that the ATT is lessthan the per participant cost, and hence that the program should be shut down, is quite robust tofailures of unconfoundedness.Finally, note that our analysis here is primarily illustrative. A more comprehensive cost-benefit analysis would require examining many other program outcomes besides just short runpost-program earnings. For example, see the analysis in chapter 8 of MDRC (1983) and section10 of Heckman et al. (1999). Note, however, that given data on these additional outcomes, ourmethods could then be used to analyze the sensitivity of total program impacts to failures ofunconfoundedness. Identification, estimation, and inference on treatment effects under unconfoundedness has beenwidely studied and applied. This approach uses two assumptions: Unconfoundedness and Overlap.The overlap assumption is refutable, and many tools have been developed for checking this as-sumption in practice. For example, Stata’s built-in package teffects has commands for checkingoverlap. In this paper, we provide a complementary suite of tools for assessing the unconfounded-ness assumption. There are two key distinctions between our results and the previous literature.First, we begin from fully nonparametric bounds. In contrast, most of the previous literature relieson parametric assumptions for their identification analysis. Second, we provide tools for inference.This is important because, just like baseline estimators, sensitivity analyses are also subject tosampling uncertainty.
We conclude by discussing several extensions and directions for future work. As we just mentioned,a key distinguishing feature of our sensitivity analysis is that we begin from fully nonparametricbounds. We then estimated these bounds using flexible parametric estimators of the propensityscore and the quantile regression of outcomes on treatment and covariates. These estimatorscan include quadratic terms, cubic terms, and interactions, for example, but they are not fullynonparametric. We restricted attention to parametric estimators for one reason: Even in this case,34he asymptotic distribution theory is non-standard, complicated, and at the frontier of currentresearch. This difficulty comes from the fact that our estimands are not Hadamard differentiable.Extending our analysis to first step nonparametric estimators is an important next step, but doingso will likely require both deriving and applying more general asymptotic theory for non-Hadamarddifferentiable functionals than currently exists. Hence we leave that analysis to future work.A second extension is to consider additional parameters of interest. In this paper we focuson estimation and inference on the ATE and ATT bounds. We also developed analogous resultsfor the CQTE and CATE. The conditional average treatment effect for the treated, CATT( w ) = E ( Y − Y | X = 1 , W = w ), can be studied with the same tools we use in section 4. We omit thatanalysis for brevity. Masten and Poirier (2018) also derive sharp bounds on unconditional quantiletreatment effects (QTEs). Estimation and inference on the QTE bounds is more complicated thanthe ATE and ATT bounds. The reason is identical to the explanation van der Vaart (2000, page307) gives when discussing inference on unconditional sample quantiles: “to derive the asymptoticnormality of even a single quantile estimator (cid:98) F − n ( p ), we need to know that the estimators (cid:98) F n areasymptotically normal as a process, in a neighborhood of F − ( p ).” In our case, performing inferenceon the QTE bounds requires showing convergence of corresponding bounds on the unconditionalpotential outcome cdfs as a process in a neighborhood of the quantile of interest. For this reason,we leave estimation and inference on the QTE bounds to a separate paper.
References
Altonji, J. G., T. E. Elder, and C. R. Taber (2005): “Selection on observed and unobservedvariables: Assessing the effectiveness of Catholic schools,”
Journal of Political Economy , 113,151–184.——— (2008): “Using selection on observed variables to assess bias from unobservables whenevaluating swan-ganz catheterization,”
American Economic Review P&P , 98, 345–350.
Amemiya, T. (1985):
Advanced Econometrics , Harvard University Press.
Angrist, J., V. Chernozhukov, and I. Fern´andez-Val (2006): “Quantile regression undermisspecification, with an application to the US wage structure,”
Econometrica , 74, 539–563.
Athey, S. and G. W. Imbens (2017): “The state of applied econometrics: Causality and policyevaluation,”
Journal of Economic Perspectives , 31, 3–32.
Caliendo, M. and S. Kopeinig (2008): “Some practical guidance for the implementation ofpropensity score matching,”
Journal of Economic Surveys , 22, 31–72.
Chernozhukov, V., I. Fern´andez-Val, and A. Galichon (2010): “Quantile and probabilitycurves without crossing,”
Econometrica , 78, 1093–1125.
Chernozhukov, V., I. Fern´andez-Val, and T. Kaji (2017): “Extremal quantile regression,”
Handbook of Quantile Regression . Masten and Poirier (2020) prove some results along these lines; see their lemma 1. Those results are only validfor sufficiently small values of c and with discrete W , which substantially simplifies the analysis. inelli, C. and C. Hazlett (2020): “Making sense of sensitivity: Extending omitted variablebias,” Journal of the Royal Statistical Society: Series B (Statistical Methodology) , 82, 39–67.
Dehejia, R. H. and S. Wahba (1999): “Causal effects in nonexperimental studies: Reevaluatingthe evaluation of training programs,”
Journal of the American Statistical Association , 94, 1053–1062.
Fang, Z. and A. Santos (2015): “Inference on directionally differentiable functions,”
Workingpaper .——— (2019): “Inference on directionally differentiable functions,”
The Review of Economic Stud-ies , 86, 377–412.
Heckman, J. J., R. J. LaLonde, and J. A. Smith (1999): “The economics and econometricsof active labor market programs,”
Handbook of Labor Economics , 3, 1865–2097.
Heckman, J. J. and J. Smith (1998): “Evaluating the welfare state,” in
Econometrics andEconomic Theory in the 20th Century: The Ragnar Frisch Centennial Symposium , CambridgeUniversity Press, 31, 241.
Hong, H. and J. Li (2018): “The numerical delta method,”
Journal of Econometrics , 206, 379–394.
Hosman, C. A., B. B. Hansen, and P. W. Holland (2010): “The sensitivity of linear regressioncoefficients’ confidence limits to the omission of a confounder,”
The Annals of Applied Statistics ,4, 849–870.
Ichino, A., F. Mealli, and T. Nannicini (2008): “From temporary help jobs to permanentemployment: What can we learn from matching estimators and their sensitivity?”
Journal ofApplied Econometrics , 23, 305–327.
Imbens, G. W. (2003): “Sensitivity to exogeneity assumptions in program evaluation,”
AmericanEconomic Review P&P , 126–132.——— (2004): “Nonparametric estimation of average treatment effects under exogeneity: A re-view,”
The Review of Economics and Statistics , 86, 4–29.
Imbens, G. W. and D. B. Rubin (2015):
Causal Inference for Statistics, Social, and BiomedicalSciences , Cambridge University Press.
Imbens, G. W. and J. M. Wooldridge (2009): “Recent developments in the econometrics ofprogram evaluation,”
Journal of Economic Literature , 47, 5–86.
Kallus, N., X. Mao, and A. Zhou (2019): “Interval estimation of individual-level causal effectsunder unobserved confounding,” in
The 22nd International Conference on Artificial Intelligenceand Statistics , 2281–2290.
Kosorok, M. R. (2008):
Introduction to empirical processes and semiparametric inference ,Springer Science & Business Media.
Krauth, B. (2016): “Bounding a linear causal effect using relative correlation restrictions,”
Jour-nal of Econometric Methods , 5, 117–141. 36 aLonde, R. J. (1986): “Evaluating the econometric evaluations of training programs with ex-perimental data,”
The American Economic Review , 604–620.
Manpower Demonstration Research Corporation (MDRC) (1983):
Summary and Find-ings of the National Supported Work Demonstration . Manski, C. F. (1990): “Nonparametric bounds on treatment effects,”
American Economic ReviewP&P , 80, 319–323.
Masten, M. A. and A. Poirier (2018): “Identification of treatment effects under conditionalpartial independence,”
Econometrica , 86, 317–351.——— (2020): “Inference on breakdown frontiers,”
Quantitative Economics , 11, 41–111.
Mauro, R. (1990): “Understanding LOVE (left out variables error): A method for estimating theeffects of omitted variables.”
Psychological Bulletin , 108, 314.
Newey, K. and D. McFadden (1994): “Large sample estimation and hypothesis,”
Handbook ofEconometrics, IV, Edited by RF Engle and DL McFadden , 2112–2245.
Oster, E. (2019): “Unobservable selection and coefficient stability: Theory and evidence,”
Journalof Business & Economic Statistics , 37, 187–204.
Robins, J. M., A. Rotnitzky, and D. O. Scharfstein (2000): “Sensitivity analysis for selec-tion bias and unmeasured confounding in missing data and causal inference models,” in
Statisticalmodels in epidemiology, the environment, and clinical trials , Springer, 1–94.
Rosenbaum, P. R. (1995):
Observational Studies , Springer.——— (2002):
Observational Studies , Springer, second ed.
Rosenbaum, P. R. and D. B. Rubin (1983): “Assessing sensitivity to an unobserved binarycovariate in an observational study with binary outcome,”
Journal of the Royal Statistical Society,Series B , 212–218. van der Vaart, A. and J. Wellner (1996):
Weak Convergence and Empirical Processes: WithApplications to Statistics , Springer Science & Business Media. van der Vaart, A. W. (2000):
Asymptotic Statistics , Cambridge University Press.
A Asymptotics for the First Step Estimators
In this appendix we formally state assumptions that ensure asymptotic normality of our first stepestimators.
A.1 Assumptions
We begin with the propensity score, which we estimate by maximum likelihood.
Assumption A4 (
Propensity Score). 37. (Correct specification) Let
B ⊆ R d W be compact. There is a β ∈ int( B ) such that P ( X = x | W = w ) = F ( w (cid:48) β ) x (1 − F ( w (cid:48) β )) − x ≡ L ( x, w (cid:48) β )for all x ∈ { , } and w ∈ W .2. (Sufficient variation) There is no proper linear subspace A of R d W such that P ( W ∈ A ) = 1.3. (Regularity of link function) F : R → (0 ,
1) is strictly increasing and twice continuouslydifferentiable with uniformly bounded derivative.This assumption requires our propensity score specification to be correct. It also imposes somestandard assumptions on parameter space B , the link function F ( · ), and the distribution of thecovariates W . Next, let (cid:96) ( x, w (cid:48) β ) = log L ( x, w (cid:48) β )denote the log likelihood function. Let (cid:96) β ( x, w (cid:48) β ) = ∂∂β (cid:96) ( x, w (cid:48) β ) and (cid:96) ββ ( x, w (cid:48) β ) = ∂ ∂β∂β (cid:48) (cid:96) ( x, w (cid:48) β )denote its vector of derivatives and second derivative matrix, respectively. Recall from section 4that B δ = { β ∈ B : (cid:107) β − β (cid:107) ≤ δ } . We impose the following assumptions on the propensity scoreas well. Assumption A5 (
Propensity Score Regularity). For each x ∈ { , } ,1. We have E (cid:32) sup β ∈B | (cid:96) ( x, W (cid:48) β ) | (cid:33) < ∞ .
2. For some δ > (cid:90) W sup β ∈B δ (cid:13)(cid:13)(cid:13)(cid:13) ∂∂β L ( x, w (cid:48) β ) (cid:13)(cid:13)(cid:13)(cid:13) dw < ∞ and (cid:90) W sup β ∈B δ (cid:13)(cid:13)(cid:13)(cid:13) ∂ ∂β∂β (cid:48) L ( x, w (cid:48) β ) (cid:13)(cid:13)(cid:13)(cid:13) dw < ∞ .
3. The matrix V β = E [ (cid:96) β ( X, W (cid:48) β ) (cid:96) β ( X, W (cid:48) β ) (cid:48) ] exists and is nonsingular.4. For some δ > E (cid:32) sup β ∈B δ (cid:107) (cid:96) ββ ( x, W (cid:48) β ) (cid:107) (cid:33) < ∞ . These conditions are standard for maximum likelihood estimators. For example, see theorem3.3. in Newey and McFadden (1994) along with their discussion. Note that dominance conditionsA5.1, A5.3, and A5.4 hold in standard parametric models like logit and probit. Although notnecessary, it also holds when E ( (cid:107) W (cid:107) ) < ∞ and strong overlap holds; that is, when there exists0 < p < p < p | w ∈ [ p, p ] for all w ∈ W .Besides the propensity score, the other first step estimator is the conditional quantile functionof Y given ( X, W ). We consider here a linear quantile regression of Y on q ( X, W ), a set of flexiblefunctions of (
X, W ). We make the following assumptions.38 ssumption A6 (
Quantile Regression). There exists an ε smaller ∈ (0 , ε ) such that1. There is some γ ∈ C ([ ε smaller , − ε smaller ] , R d q ) such that Q Y | X,W ( τ | x, w ) = q ( x, w ) (cid:48) γ ( τ )for every τ ∈ [ ε smaller , − ε smaller ].2. The conditional density f Y | q ( X,W ) ( y | q ( x, w )) exists and is bounded and uniformly continuousin y , uniformly in q ( x, w ) ∈ supp( q ( X, W )).3. The matrix J ( τ ) = E (cid:2) f Y | q ( X,W ) (cid:0) q ( X, W ) (cid:48) γ ( τ ) | q ( X, W ) (cid:1) q ( X, W ) q ( X, W ) (cid:48) (cid:3) is positive definite for all τ ∈ [ ε smaller , − ε smaller ].4. E ( (cid:107) q ( x, W ) (cid:107) ) < ∞ for x ∈ { , } .These are standard assumptions for obtaining limiting distributions of quantile regression pro-cesses indexed by τ ∈ [ ε smaller , − ε smaller ]. For example, see theorem 3 in Angrist, Chernozhukov,and Fern´andez-Val (2006). A.2 Convergence Results
We next prove two convergence results. The first is joint asymptotic normality of the first stepestimators.
Lemma 1 (First step estimators) . Suppose A1 and A4–A6 hold. Then √ n (cid:18) (cid:98) β − β (cid:98) γ ( τ ) − γ ( τ ) (cid:19) (cid:32) Z ( τ ) , where Z ( · ) is a mean-zero Gaussian process in R d W × (cid:96) ∞ ([ ε, − ε ] , R d q ) with uniformly continuouspaths. Moreover, its covariance kernel can be written in block form as E [ Z ( τ ) Z ( τ ) (cid:48) ] = (cid:18) V β V γ ( τ , τ ) (cid:19) (11)where V β = E (cid:20) F (cid:48) ( W (cid:48) β ) F ( W (cid:48) β )(1 − F ( W (cid:48) β )) W W (cid:48) (cid:21) − and V γ ( τ , τ ) = J ( τ ) − (min { τ , τ } − τ τ ) E [ q ( X, W ) q ( X, W ) (cid:48) ] J ( τ ) − . Next we provide a convergence result for estimates of the derivatives of the quantile regressioncoefficients. We estimate these derivatives as follows. Let τ ∈ [ ε, − ε ]. Then (cid:98) γ (cid:48) ( τ ) = (cid:98) γ ( τ + η n ) − (cid:98) γ ( τ − η n )2 η n (12)where η n > τ − η n , τ + η n ]is contained in ( ε smaller , − ε smaller ). The next result shows that these estimators are uniformlyconsistent. 39 emma 2 (Convergence of QR derivatives) . Let η n → nη n → ∞ as n → ∞ . Suppose theassumptions of lemma 1 hold. Suppose A2 holds. Thensup τ ∈ [ ε, − ε ] (cid:107) (cid:98) γ (cid:48) ( τ ) − γ (cid:48) ( τ ) (cid:107) = o p (1) . A.3 Proofs
We begin by examining each of the first step estimators separately.
Lemma 3 (Propensity score estimation) . Suppose A1 and A4–A5 hold. Then √ n ( (cid:98) β − β ) = 1 √ n n (cid:88) i =1 V β F (cid:48) ( W (cid:48) i β )( X i − F ( W (cid:48) i β )) W i F ( W (cid:48) i β )(1 − F ( W (cid:48) i β )) + o p (1)and hence √ n ( (cid:98) β − β ) d −→ N (0 , V β ) . Proof of lemma 3.
This result follows from theorem 3.3 (asymptotic normality of MLEs) in Neweyand McFadden (1994). So it suffices to verify that their assumptions hold.1. Their theorem 3.3 begins by supposing the assumptions of their theorem 2.5 (consistency ofMLEs) holds. So we verify those assumptions first. By A4.1 (cid:96) ( x, w (cid:48) β ) = (cid:96) ( x, w (cid:48) (cid:101) β ) for all( x, w ) ∈ supp( X, W ) implies that w (cid:48) β = w (cid:48) (cid:101) β for all w ∈ supp( W ). By A4.2 this impliesthat β = (cid:101) β . So assumption (i) of their theorem 2.5 holds. We directly assume that theirassumptions (ii), (iii), and (iv) hold (via our A4 and A5.1). Finally, note that A1 is ourassumption that { ( Y i , X i , W i ) } ni =1 are iid. Thus all assumptions of their theorem 2.5 hold.2. Next we consider the additional assumptions imposed in their theorem 3.3, (i)–(v). These aredirectly implied by our A4 and A5.Thus all assumptions of their theorem 3.3 hold. This gives us √ n ( (cid:98) β − β ) d −→ N (0 , V β ). Theasymptotic linear representation holds by arguments in the proof of their theorem 3.1 and thediscussion on pages 2142–2143. Lemma 4 (Quantile regression estimation) . Suppose A1 and A6 hold. Then √ n ( (cid:98) γ ( τ ) − γ ( τ )) = J ( τ ) − √ n n (cid:88) i =1 (cid:16) τ − (cid:0) Y i ≤ q ( X i , W i ) (cid:48) γ ( τ ) (cid:1)(cid:17) q ( X i , W i ) + o p (1) (cid:32) J ( τ ) − Z γ ( τ ) , where Z γ ( · ) is a mean-zero Gaussian process in (cid:96) ∞ ([ ε smaller , − ε smaller ] , R d q ) with continuous pathsand covariance kernel equal to Σ( τ , τ ) = (min { τ , τ } − τ τ ) E [ q ( X, W ) q ( X, W ) (cid:48) ]. Proof of lemma 4.
By Minkowski’s inequality, E ( (cid:107) q ( X, W ) (cid:107) ) / = E ( (cid:107) Xq (1 , W ) + (1 − X ) q (0 , W ) (cid:107) ) / ≤ E ( | X | · (cid:107) q (1 , W ) (cid:107) ) / + E ( | − X | · (cid:107) q (0 , W ) (cid:107) ) / . By A6.4 and X ∈ { , } , it follows that E ( (cid:107) q ( X, W ) (cid:107) ) < ∞ . The result then follows directly fromtheorem 3 in Angrist et al. (2006). 40 roof of lemma 1. Note that the influence functions in lemmas 3 and 4 are Donsker. Therefore wecan stack them to obtain joint weak convergence to Z ( τ ). Next, note that this result holds over τ ∈ [ ε smaller , − ε smaller ]. This is a strict superset of [ ε, − ε ], so it holds on that set too.Finally, we note that the diagonal element of the covariance kernel is zero. This diagonal elementis C β,γ ( τ ) = cov (cid:18) V β F (cid:48) ( W (cid:48) β )( X − F ( W (cid:48) β )) WF ( W (cid:48) β )(1 − F ( W (cid:48) β )) , (cid:16) τ − ( Y ≤ q ( X, W ) (cid:48) γ ( τ )) (cid:17) q ( X, W ) (cid:48) J ( τ ) − (cid:19) . By iterated expectations, E (cid:104)(cid:16) τ − ( Y ≤ q ( X, W ) (cid:48) γ ( τ )) (cid:17) q ( X, W ) (cid:48) J ( τ ) − (cid:105) = E (cid:104)(cid:16) τ − E (cid:2) ( Y ≤ q ( X, W ) (cid:48) γ ( τ )) | X, W (cid:3) (cid:17) q ( X, W ) (cid:48) J ( τ ) − (cid:105) = 0since P ( Y ≤ q ( x, w ) (cid:48) γ ( τ ) | X = x, W = w ) = τ by correct specification of the conditional quantilefunction. Also, E (cid:20) V β F (cid:48) ( W (cid:48) β )( X − F ( W (cid:48) β )) WF ( W (cid:48) β )(1 − F ( W (cid:48) β )) (cid:16) τ − ( Y ≤ q ( X, W ) (cid:48) γ ( τ )) (cid:17) q ( X, W ) (cid:48) J ( τ ) − (cid:21) = 0by a similar argument, using iterated expectations conditional on ( X, W ), by correct specificationof the conditional quantile function, and since the first term is deterministic conditional on (
X, W ).Thus C β,γ ( τ ) = 0 by definition of the covariance. Proof of lemma 2.
Without loss of generality, consider the convergence of (cid:98) γ (cid:48) (1) to γ (cid:48) , (1) the firstcomponent of γ (cid:48) . Since η n →
0, let η n be small enough such that η n ∈ (0 , ε − ε smaller ). Thensup τ ∈ [ ε, − ε ] | (cid:98) γ (cid:48) (1) ( τ ) − γ (cid:48) , (1) ( τ ) |≤ sup τ ∈ [ ε, − ε ] (cid:12)(cid:12)(cid:12)(cid:12) (cid:98) γ (1) ( τ + η n ) − (cid:98) γ (1) ( τ − η n )2 η n − γ , (1) ( τ + η n ) − γ , (1) ( τ − η n )2 η n (cid:12)(cid:12)(cid:12)(cid:12) + sup τ ∈ [ ε, − ε ] (cid:12)(cid:12)(cid:12)(cid:12) γ , (1) ( τ + η n ) − γ , (1) ( τ − η n )2 η n − γ (cid:48) , (1) ( τ ) (cid:12)(cid:12)(cid:12)(cid:12) ≤ η n (cid:32) sup τ ∈ [ ε, − ε ] (cid:12)(cid:12)(cid:98) γ (1) ( τ + η n ) − γ , (1) ( τ + η n ) (cid:12)(cid:12) + sup τ ∈ [ ε, − ε ] (cid:12)(cid:12)(cid:98) γ (1) ( τ − η n ) − γ , (1) ( τ − η n ) (cid:12)(cid:12)(cid:33) + sup τ ∈ [ ε, − ε ] (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) (cid:16) γ , (1) ( τ ) + γ (cid:48) , (1) ( τ ) η n + γ (cid:48)(cid:48) , (1) ( τ ) η n + γ (cid:48)(cid:48)(cid:48) , (1) ( τ ∗ n ) η n (cid:17) η n − (cid:16) γ , (1) ( τ ) − γ (cid:48) , (1) ( τ ) η n + γ (cid:48)(cid:48) , (1) ( τ ) η n − γ (cid:48)(cid:48)(cid:48) , (1) ( τ ∗ n ) η n (cid:17) η n − γ (cid:48) , (1) ( τ ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ η n (cid:32) sup τ ∈ [ ε smaller , − ε smaller ] (cid:12)(cid:12)(cid:98) γ (1) ( τ ) − γ , (1) ( τ ) (cid:12)(cid:12) + sup τ ∈ [ ε smaller , − ε smaller ] (cid:12)(cid:12)(cid:98) γ (1) ( τ ) − γ , (1) ( τ ) (cid:12)(cid:12)(cid:33)
41 sup τ ∈ [ ε, − ε ] (cid:12)(cid:12)(cid:12)(cid:12) γ (cid:48)(cid:48)(cid:48) , (1) ( τ ∗ n ) η n
12 + γ (cid:48)(cid:48)(cid:48) , (1) ( τ ∗ n ) η n (cid:12)(cid:12)(cid:12)(cid:12) . The first inequality follows by the triangle inequality and the definition of (cid:98) γ (cid:48) (1) . The second in-equality follows by taking two third order Taylor expansions of γ , (1) , where τ ∗ n ∈ ( τ, τ + η n ) and τ ∗ n ∈ ( τ − η n , τ ). There we use A2 with m ≥
3. The last inequality follows for two reasons: In thefirst term, ε smaller < ε and η n is chosen such that [ τ − η n , τ + η n ] ⊂ ( ε smaller , − ε smaller ). So we aretaking the supremum over a larger set in this term in the last line. In the second term, the 0th,1st, and 2nd derivatives of γ , (1) all cancel, leaving only the third order derivatives remaining.Finally, sup τ ∈ [ ε, − ε ] | (cid:98) γ (cid:48) (1) ( τ ) − γ (cid:48) , (1) ( τ ) | = 12 η n O p (cid:18) √ n (cid:19) + B η n = o p (1) . The first line follows by lemma 4, which shows that the first term is (1 / η n ) O p (1 / √ n ) = O p (1 / (cid:112) nη n ).In the second term B > n . This constant comes from A2,which implies the function γ has uniformly bounded third derivatives. The last line follows by η n → nη n → ∞ as n → ∞ .Repeating this argument across all components of (cid:98) γ (cid:48) ( τ ) − γ (cid:48) ( τ ) shows that sup τ ∈ [ ε, − ε ] (cid:107) (cid:98) γ (cid:48) ( τ ) − γ (cid:48) ( τ ) (cid:107) = o p (1), as desired. B Proofs for Section 4
In this section we give the proofs for the results in section 4. We start with a preliminary result onthe asymptotic distribution of the CQTE bound estimators. We use this for all of our later results.We then state and prove a useful lemma. Finally we give the proofs for our CATE, ATE, and ATTbound estimators.All of these results rely on proving Hadamard directional differentiability of various functionals.For that reason, it is helpful to recall its definition.
Definition 2.
Let φ : D φ → E where D , E are Banach spaces and D φ ⊆ D . Say φ is Hadamarddirectionally differentiable at θ ∈ D φ tangentially to D ⊆ D if there is a continuous map φ (cid:48) θ : D → E such that lim m →∞ (cid:13)(cid:13)(cid:13)(cid:13) φ ( θ + t m h m ) − φ ( θ ) t m − φ (cid:48) θ ( h ) (cid:13)(cid:13)(cid:13)(cid:13) E = 0for all sequences { h m } ⊂ D , t m > t m → h m → h ∈ D as m → ∞ and θ + t m h m ∈ D φ for all m .By proposition 2.1 in Fang and Santos (2019), the mapping φ is Hadamard differentiable at θ tangentially to D if and only if it is Hadamard directionally differentiable at θ tangentially to D and the mapping φ (cid:48) θ is linear.Throughout the proofs we let (cid:107) γ (cid:107) ∞ = sup τ ∈ [ ε, − ε ] (cid:107) γ ( τ ) (cid:107) denote the sup-norm in (cid:96) ∞ ([ ε, − ε ] , R d q ). B.1 The CQTE Bounds
We start with a preliminary result for our estimates of the CQTE bounds. All our other boundsare built from these, so it is helpful to understand them first.42 roposition 5 (CQTE convergence) . Suppose A1, A2, and A4–A6 hold. Fix ε > w ∈ W , c ∈ [0 , τ ∈ (0 , √ n (cid:32) (cid:92) CQTE c ( τ | w ) − CQTE cε ( τ | w ) (cid:92) CQTE c ( τ | w ) − CQTE cε ( τ | w ) (cid:33) d −→ Z CQTE ( w, τ ) , where Z CQTE is a random vector in R whose distribution is characterized in the proof. Proof of proposition 5.
Part 1: The upper bound is HDD . Recall from section 4.2 that Γ ( x, w, τ, θ ) denotes ourtrimmed population conditional quantile upper bound. We write this parameter as a function of afew different pieces: Γ ( x, w, τ, θ ) = q ( x, w ) (cid:48) γ ( S ( x, w, τ, β ))where S ( x, w, τ, β ) = max { S ( x, w, τ, β ) , ε } and S ( x, w, τ, β ) = min (cid:26) τ + cL ( x, w (cid:48) β ) min { τ, − τ } , τL ( x, w (cid:48) β ) , − ε (cid:27) . For simplicity, we leave the dependence on ε and c implicit in our notation for S and S . Thereare now three steps: We show Hadamard directional differentiability (HDD) of Γ ( x, w, τ, · ) at θ tangentially to R d W × C ([ ε, − ε ] , R d q ) by examining the two pieces S and S separately. We thencombine these to show HDD of Γ . Step 1: HDD of S . We first show S ( x, w, τ, · ) is HDD at β . Let t m (cid:38) h m → h ∈ R d W as m → ∞ . Define the secant line T m ( x, w, τ, β , h m ) = S ( x, w, τ, β + t m h m ) − S ( x, w, τ, β ) t m . We will show that T m ( x, w, τ, β , h m ) → T ( x, w, τ, β , h , (cid:88) j =1 T ,j ( x, w, τ, β , h ) ,j ( x, w, τ, β , m → ∞ , where T ,j are defined in appendix C below. To see this, we consider the seven casesassociated with ,j for j = 1 , . . . , , ( x, w, τ, β ,
0) = 1. Then S ( x, w, τ, β ) = τ + cL ( x, w (cid:48) β ) min { τ, − τ } . Moreover, for m large enough and by continuity of S in β , , ( x, w, τ, β + t m h m ,
0) = 1. Hence S ( x, w, τ, β + t m h m ) = τ + cL ( x, w (cid:48) ( β + t m h m )) min { τ, − τ } .
43o for m large enough T m ( x, w, τ, β , h m ) = c min { τ, − τ } t m (cid:18) L ( x, w (cid:48) ( β + t m h m )) − L ( x, w (cid:48) β ) (cid:19) → T , ( x, w, τ, β , h )by the definition of the directional derivative of 1 /L ( x, w (cid:48) β ) with respect to β in the direction h at β .Similarly, if , ( x, w, τ, β ,
0) = 1 then T m ( x, w, τ, β , h m ) = τ t m (cid:18) L ( x, w (cid:48) ( β + t m h m )) − L ( x, w (cid:48) β ) (cid:19) → T , ( x, w, τ, β , h )where the first line holds for m large enough and the second line holds as m → ∞ . Likewise, if , ( x, w, τ, β ,
0) = 1 then T m ( x, w, τ, β , h m ) = 0 → T , ( x, w, τ, β , h ) . where the first line holds for m large enough and the convergence in the second line is as m → ∞ .If , ( x, w, τ, β ,
0) = 1 then τ + cL ( x, w (cid:48) β ) min { τ, − τ } = τL ( x, w (cid:48) β ) < − ε. For m large enough, S ( x, w, τ, β + t m h m ) = min (cid:26) τ + cL ( x, w (cid:48) ( β + t m h m )) min { τ, − τ } , τL ( x, w (cid:48) ( β + t m h m )) (cid:27) . Hence T m ( x, w, τ, β , h m ) = min (cid:26) c min { τ, − τ } t m (cid:18) L ( x, w (cid:48) ( β + t m h m )) − L ( x, w (cid:48) β ) (cid:19) ,τ t m (cid:18) L ( x, w (cid:48) ( β + t m h m )) − L ( x, w (cid:48) β ) (cid:19)(cid:27) for m large enough. Similarly, T m ( x, w, τ, β , h m ) = min (cid:26) c min { τ, − τ } t m (cid:18) L ( x, w (cid:48) ( β + t m h m )) − L ( x, w (cid:48) β ) (cid:19) , (cid:27) T m ( x, w, τ, β , h m ) = min (cid:26) τ t m (cid:18) L ( x, w (cid:48) ( β + t m h m )) − L ( x, wβ ) (cid:19) , (cid:27) T m ( x, w, τ, β , h m ) = min (cid:26) c min { τ, − τ } t m (cid:18) L ( x, w (cid:48) ( β + t m h m )) − L ( x, w (cid:48) β ) (cid:19) , t m (cid:18) L ( x, w (cid:48) ( β + t m h m )) − L ( x, w (cid:48) β ) (cid:19) , (cid:27) for m large enough when ,j = 1 for j = 5 , , t m (cid:38) h m → h , and byexamining T m for ,j = 1, j = 4 , , ,
7, we see that S ( x, w, τ, β + t m h m ) − S ( x, w, τ, β ) t m = T m ( x, w, τ, β , h m ) → T ( x, w, τ, β , h , . Hence S ( x, w, τ, · ) is Hadamard directionally differentiable at β . Step 2: HDD of S . Recall that S ( x, w, τ, β ) = max { S ( x, w, τ, β ) , ε } . As before, let t m (cid:38) h m → h ∈ R d W as m → ∞ . Define T m ( x, w, τ, β , h m ) = S ( x, w, τ, β + t m h m ) − S ( x, w, τ, β ) t m . Substituting the functional form for S into the definition of S gives T m ( x, w, τ, β , h m )= 1 t m max (cid:26) min (cid:26) τ + cL ( x, w (cid:48) ( β + t m h m )) min { τ, − τ } , τL ( x, w (cid:48) ( β + t m h m )) , − ε (cid:27) , ε (cid:27) − t m max (cid:26) min (cid:26) τ + cL ( x, w (cid:48) β ) min { τ, − τ } , τL ( x, w (cid:48) β ) , − ε (cid:27) , ε (cid:27) . As in step 1, we next characterize the value of this secant line by splitting it into three differentcases.1. If min (cid:26) τ + cL ( x, w (cid:48) β ) min { τ, − τ } , τL ( x, w (cid:48) β ) , − ε (cid:27) > ε then T m ( x, w, τ, β , h m )= 1 t m min (cid:26) τ + cL ( x, w (cid:48) ( β + t m h m )) min { τ, − τ } , τL ( x, w (cid:48) ( β + t m h m )) , − ε (cid:27) − t m min (cid:26) τ + cL ( x, w (cid:48) β ) min { τ, − τ } , τL ( x, w (cid:48) β ) , − ε (cid:27) for large enough m .2. If min (cid:26) τ + cL ( x, w (cid:48) β ) min { τ, − τ } , τL ( x, w (cid:48) β ) , − ε (cid:27) < ε then T m ( x, w, τ, β , h m ) = 0for large enough m . 45. If min (cid:26) τ + cL ( x, w (cid:48) β ) min { τ, − τ } , τL ( x, w (cid:48) β ) , − ε (cid:27) = ε then T m ( x, w, τ, β , h m )= max (cid:26) t m min (cid:26) τ + cL ( x, w (cid:48) ( β + t m h m )) min { τ, − τ } , τL ( x, w (cid:48) ( β + t m h m )) , − ε (cid:27) − t m min (cid:26) τ + cL ( x, w (cid:48) β ) min { τ, − τ } , τL ( x, w (cid:48) β ) , − ε (cid:27) , (cid:27) for large enough m .Using similar arguments as in step 1, by examining each of the three cases we see that T m ( x, w, τ, β , h m ) → T ( x, w, τ, β , h , m → ∞ . Hence S ( x, w, τ, · ) is Hadamard directionally differentiable at β . Step 3: HDD of Γ . Next we show that Γ ( x, w, τ, · ) is HDD at θ tangentially to R d W × C ([ ε, − ε ] , R d q ). Let t m (cid:38) h m → h ∈ R d W and h m → h ∈ C ([ ε, − ε ] , R d q ) endowed withthe sup norm, as m → ∞ . Let h m = ( h m , h m ) and h = ( h , h ). ThenΓ ( x, w, τ, θ + t m h m ) − Γ ( x, w, τ, θ ) t m = q ( x, w ) (cid:48) [ γ + t m h m ]( S ( x, w, τ, β + t m h m )) − q ( x, w ) (cid:48) γ ( S ( x, w, τ, β )) t m = q ( x, w ) (cid:48) ( γ ( S ( x, w, τ, β + t m h m )) − γ ( S ( x, w, τ, β ))) t m + q ( x, w ) (cid:48) h m ( S ( x, w, τ, β + t m h m )) . Consider the first term. By A2, γ ( u ) is differentiable for any u ∈ [ ε, − ε ]. By the chain rule, q ( x, w ) (cid:48) [ γ ( S ( x, w, τ, β + t m h m )) − γ ( S ( x, w, τ, β ))] t m → q ( x, w ) (cid:48) γ (cid:48) ( S ( x, w, τ, β )) T ( x, w, τ, β , h , m → ∞ . Next consider the second term. We have | q ( x, w ) (cid:48) h m ( S ( x, w, τ, β + t m h m )) − q ( x, w ) (cid:48) h ( S ( x, w, τ, β )) |≤ | q ( x, w ) (cid:48) h m ( S ( x, w, τ, β + t m h m )) − q ( x, w ) (cid:48) h ( S ( x, w, τ, β + t m h m )) | + | q ( x, w ) (cid:48) h ( S ( x, w, τ, β + t m h m )) − q ( x, w ) (cid:48) h ( S ( x, w, τ, β )) |≤ (cid:107) q ( x, w ) (cid:107) · (cid:107) h m − h (cid:107) ∞ + | q ( x, w ) (cid:48) h ( S ( x, w, τ, β + t m h m )) − q ( x, w ) (cid:48) h ( S ( x, w, τ, β )) | . The first inequality follows by the triangle inequality. Consider the second inequality. By continuityof h ( · ) and of S ( x, w, τ, · ), the second term converges to zero as m → ∞ . By sup-norm convergenceof h m to h , the first term also converges to zero as m → ∞ . Thus q ( x, w ) (cid:48) h m ( S ( x, w, τ, β + t m h m )) → q ( x, w ) (cid:48) h ( S ( x, w, τ, β ))46s m → ∞ . Putting the two terms together givesΓ ( x, w, τ, θ + t m h m ) − Γ ( x, w, τ, θ ) t m → q ( x, w ) (cid:48) γ (cid:48) ( S ( x, w, τ, β )) T ( x, w, τ, β , h ,
0) + q ( x, w ) (cid:48) h ( S ( x, w, τ, β )) ≡ Γ (cid:48) ,θ ( x, w, τ, h ) . Thus Γ ( x, w, τ, · ) is HDD at θ tangentially to R d W × C ([ ε, − ε ] , R d q ). Part 2: The lower bound is HDD . An analogous argument applies to the lower boundΓ ( x, w, τ, · ). This givesΓ ( x, w, τ, θ + t m h m ) − Γ ( x, w, τ, θ ) t m → q ( x, w ) (cid:48) γ (cid:48) ( S ( x, w, τ, β )) T ( x, w, τ, β , h ,
0) + q ( x, w ) (cid:48) h ( S ( x, w, τ, β )) ≡ Γ (cid:48) ,θ ( x, w, τ, h )where S ( x, w, τ, β ) = max (cid:26) τ − cL ( x, w (cid:48) β ) min { τ, − τ } , τ − L ( x, w (cid:48) β ) + 1 , ε (cid:27) S ( x, w, τ, β ) = min { S ( x, w, τ, β ) , − ε } and T ( x, w, τ, β , h ,
0) = lim m →∞ S ( x, w, τ, β + t m h m ) − S ( x, w, τ, β ) t m T ( x, w, τ, β , h ,
0) = lim m →∞ S ( x, w, τ, β + t m h m ) − S ( x, w, τ, β ) t m . We give explicit expressions for these limits in appendix C.
Part 3: Apply the delta method . We’ve shown that Γ ( x, w, τ, · ) and Γ ( x, w, τ, · ) are HDDat θ . Moreover, by A1 and A4–A6, lemma 1 gives √ n ( (cid:98) θ − θ ) (cid:32) Z . Thus the delta method forHDD functionals (theorem 2.1 in Fang and Santos 2019) gives √ n (cid:32) Γ ( x, w, τ, (cid:98) θ ) − Γ ( x, w, τ, θ )Γ ( x, w, τ, (cid:98) θ ) − Γ ( x, w, τ, θ ) (cid:33) d −→ (cid:32) Γ (cid:48) ,θ ( x, w, τ, Z )Γ (cid:48) ,θ ( x, w, τ, Z ) (cid:33) ≡ Z ( x, w, τ ) . This convergence in uniform in x ∈ { , } .Finally, the CQTE bounds are just the difference between certain conditional quantile functionbounds. Thus we immediately get √ n (cid:32) (cid:92) CQTE c ( τ | w ) − CQTE cε ( τ | w ) (cid:92) CQTE c ( τ | w ) − CQTE cε ( τ | w ) (cid:33) d −→ (cid:32) Z (1)2 (1 , w, τ ) − Z (2)2 (0 , w, τ ) Z (2)2 (1 , w, τ ) − Z (1)2 (0 , w, τ ) (cid:33) ≡ Z CQTE ( w, τ ) . .2 A Useful Lemma The following is a technical lemma that we will use a few times in the upcoming proofs.
Lemma 5 (Min and Max are Lipschitz) . The following hold for any ( x , . . . , x n ) , ( y , . . . , y n ) ∈ R n : | min { x , . . . , x n } − min { y , . . . , y n }| ≤ n (cid:88) i =1 | x i − y i || max { x , . . . , x n } − max { y , . . . , y n }| ≤ n (cid:88) i =1 | x i − y i | . Proof of lemma 5.
We proceed by induction over n ≥
1. The inequalities trivially hold for n = 1.First, consider the minimum function and let n = 2. Consider the case where x ≤ x and y ≤ y .Then | min { x , x } − min { y , y }| = | x − y | ≤ | x − y | + | x − y | . Now consider the case where x ≤ x and y ≥ y . Then,min { x , x } − min { y , y } = x − y ≤ x − y ≤ | x − y | + | x − y | and min { x , x } − min { y , y } = x − y ≥ x − y ≥ −| x − y | − | x − y | . Hence | min { x , x } − min { y , y }| ≤ | x − y | + | x − y | . To exhaust all cases, we also consider cases where ( x ≥ x , y ≥ y ) and where ( x ≥ x , y ≤ y ).By symmetry across cases, the Lipschitz inequality for the minimum holds when n = 2. Nowsuppose it holds for n −
1. Then, | min { x , . . . , x n } − min { y , . . . , y n }| = | min { min { x , . . . , x n − } , x n } − min { min { y , . . . , y n − } , y n }|≤ | min { x , . . . , x n − } − min { y , . . . , y n − }| + | x n − y n |≤ n − (cid:88) i =1 | x i − y i | + | x n − y n | = n (cid:88) i =1 | x i − y i | . Therefore it holds for all n ≥
1. Noting that max { x , . . . , x n } = − min {− x , . . . , − x n } , thisinequality applies to the maximum as well. B.3 The CATE Bounds
Proof of proposition 1.
Part 1: The upper bound is HDD.
We first show that the mappingΓ ( x, w, · ) : R d W × (cid:96) ∞ ([ ε, − ε ] , R d q ) → R
48s HDD at θ tangentially to R d W × C ([ ε, − ε ] , R d q ). Recall its definition:Γ ( x, w, θ ) = (cid:90) Γ ( x, w, τ, θ ) dτ. We will use the dominated convergence theorem to show thatΓ (cid:48) ,θ ( x, w, h ) = (cid:90) Γ (cid:48) ,θ ( x, w, τ, h ) dτ. For δ > G δ = { γ ∈ G : (cid:107) γ − γ (cid:107) ∞ ≤ δ } and Θ δ = B δ × G δ . To show dominated convergence can be applied, we first show that the mapping Γ ( x, w, τ, θ ) isLipschitz in θ ∈ Θ δ for some δ >
0. To see this, let (cid:101) θ, θ ∈ Θ δ . Then | Γ ( x, w, τ, (cid:101) θ ) − Γ ( x, w, τ, θ ) | = (cid:12)(cid:12)(cid:12) q ( x, w ) (cid:48) (cid:16)(cid:101) γ ( S ( x, w, τ, (cid:101) β )) − γ ( S ( x, w, τ, (cid:101) β )) (cid:17) + q ( x, w ) (cid:48) (cid:16) γ ( S ( x, w, τ, (cid:101) β )) − γ ( S ( x, w, τ, β )) (cid:17)(cid:12)(cid:12)(cid:12) ≤ (cid:107) q ( x, w ) (cid:107) · (cid:107) (cid:101) γ − γ (cid:107) ∞ + (cid:107) q ( x, w ) (cid:48) γ (cid:48) ( ¯ S ) (cid:107) · (cid:107) S ( x, w, τ, (cid:101) β ) − S ( x, w, τ, β ) (cid:107) . The last line follows by a Taylor expansion, where ¯ S is on the line segment connecting S ( x, w, τ, (cid:101) β )and S ( x, w, τ, β ). By γ ∈ C Bm,ν ([ ε, − ε ]) d q for m ≥ (cid:107) γ (cid:48) (cid:107) ∞ ≤ B . Hence (cid:107) q ( x, w ) (cid:48) γ (cid:48) ( ¯ S )) (cid:107) ≤ (cid:107) q ( x, w ) (cid:107) · (cid:107) γ (cid:48) ( ¯ S ) (cid:107) ≤ (cid:107) q ( x, w ) (cid:107) B. Next, | S ( x, w, τ, (cid:101) β ) − S ( x, w, τ, β ) | = | max { S ( x, w, τ, (cid:101) β ) , ε } − max { S ( x, w, τ, β ) , ε }|≤ | S ( x, w, τ, (cid:101) β ) − S ( x, w, τ, β ) |≤ (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) τ + c min { τ, − τ } L ( x, w (cid:48) (cid:101) β ) − (cid:18) τ + c min { τ, − τ } L ( x, w (cid:48) β ) (cid:19)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) + (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) τL ( x, w (cid:48) (cid:101) β ) − τL ( x, w (cid:48) β ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) = ( τ + c min { τ, − τ } ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) L ( x, w (cid:48) (cid:101) β ) − L ( x, w (cid:48) β ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ ( τ + c min { τ, − τ } ) sup β ∈B δ (cid:13)(cid:13)(cid:13)(cid:13) L β ( x, w (cid:48) β ) L ( x, w (cid:48) β ) (cid:13)(cid:13)(cid:13)(cid:13) (cid:107) (cid:101) β − β (cid:107) . The second and third lines follow from lemma 5. The last line follows by a Taylor expansion. Tosee that sup β ∈B δ (cid:13)(cid:13) L β ( x, w (cid:48) β ) /L ( x, w (cid:48) β ) (cid:13)(cid:13) < ∞ , writesup β ∈B δ (cid:13)(cid:13)(cid:13)(cid:13) L β ( x, w (cid:48) β ) L ( x, w (cid:48) β ) (cid:13)(cid:13)(cid:13)(cid:13) = sup β ∈B δ (cid:13)(cid:13)(cid:13)(cid:13) (2 x − F (cid:48) ( w (cid:48) β ) wxF ( w (cid:48) β ) + (1 − x )(1 − F ( w (cid:48) β )) (cid:13)(cid:13)(cid:13)(cid:13) ≤ (cid:107) w (cid:107) sup a ∈ R | F (cid:48) ( a ) | xF (inf β ∈ β w (cid:48) β ) + (1 − x )(1 − F (sup β ∈ β w (cid:48) β )) < ∞ . { w (cid:48) β : β ∈ B δ } is bounded for fixed w , and since F (cid:48) ( a ) is uniformlybounded by assumption A4.3. Thus | Γ ( x, w, τ, (cid:101) θ ) − Γ ( x, w, τ, θ ) |≤ (cid:107) q ( x, w ) (cid:107) · (cid:107) (cid:101) γ − γ (cid:107) ∞ + (cid:107) q ( x, w ) (cid:107) B ( τ + c min { τ, − τ } ) sup β ∈B δ (cid:13)(cid:13)(cid:13)(cid:13) L β ( x, w (cid:48) β ) L ( x, w (cid:48) β ) (cid:13)(cid:13)(cid:13)(cid:13) (cid:107) (cid:101) β − β (cid:107) . Hence Γ ( x, w, τ, θ ) is Lipschitz in θ . Therefore,Γ ( x, w, τ, θ + t m h m ) − Γ ( x, w, τ, θ ) t m is dominated by (cid:107) q ( x, w ) (cid:107) · (cid:107) h m (cid:107) ∞ + (cid:107) q ( x, w ) (cid:107) B ( τ + c min { τ, − τ } ) sup β ∈B δ (cid:13)(cid:13)(cid:13)(cid:13) L β ( x, w (cid:48) β ) L ( x, w (cid:48) β ) (cid:13)(cid:13)(cid:13)(cid:13) (cid:107) h m (cid:107)≤ (cid:107) q ( x, w ) (cid:107) ( (cid:107) h (cid:107) ∞ + λ ) + (cid:107) q ( x, w ) (cid:107) B ( τ + c min { τ, − τ } ) sup β ∈B δ (cid:13)(cid:13)(cid:13)(cid:13) L β ( x, w (cid:48) β ) L ( x, w (cid:48) β ) (cid:13)(cid:13)(cid:13)(cid:13) ( (cid:107) h (cid:107) + λ ) < ∞ . In the second line λ > m sufficientlylarge, since h m converges to h . Moreover, note that this dominating function is integrable over τ ∈ (0 , ( x, w, θ + t m h m ) − Γ ( x, w, θ ) t m = (cid:90) (cid:18) Γ ( x, w, τ, θ + t m h m ) − Γ ( x, w, τ, θ ) t m (cid:19) dτ → (cid:90) Γ (cid:48) ,θ ( x, w, τ, h ) dτ = q ( x, w ) (cid:48) (cid:90) h ( S ( x, w, τ, β )) dτ + q ( x, w ) (cid:48) (cid:90) γ (cid:48) ( S ( x, w, τ, β )) T ( x, w, τ, β , h , dτ ≡ Γ (cid:48) ,θ ( x, w, h ) . Part 2: The lower bound is HDD.
We can similarly show thatΓ ( x, w, θ + t m h m ) − Γ ( x, w, θ ) t m → q ( x, w ) (cid:48) (cid:90) h ( S ( x, w, τ, β )) dτ + q ( x, w ) (cid:48) (cid:90) γ (cid:48) ( S ( x, w, τ, β )) T ( x, w, τ, h , dτ ≡ Γ (cid:48) ,θ ( x, w, h ) . Part 3: Apply the delta method.
The functional delta method for HDD functionals nowimplies that, uniformly in x ∈ { , } , √ n (cid:32) Γ ( x, w, (cid:98) θ ) − Γ ( x, w, θ )Γ ( x, w, (cid:98) θ ) − Γ ( x, w, θ ) (cid:33) d −→ (cid:32) Γ (cid:48) ,θ ( x, w, Z )Γ (cid:48) ,θ ( x, w, Z ) (cid:33) ≡ Z ( x, w )50nd hence √ n (cid:32) (cid:92) CATE c ( w ) − CATE cε ( w ) (cid:92) CATE c ( w ) − CATE cε ( w ) (cid:33) d −→ (cid:32) Z (1)3 (1 , w ) − Z (2)3 (0 , w ) Z (2)3 (1 , w ) − Z (1)3 (0 , w ) (cid:33) ≡ Z CATE ( w ) . B.4 The ATE Bounds
Proof of theorem 1.
Part 1: The expectation upper bound . Write √ n ( (cid:98) E cx − E cx,ε ) = √ n (cid:32) n n (cid:88) i =1 Γ ( x, W i , (cid:98) θ ) − (cid:90) W Γ ( x, w, θ ) dF W ( w ) (cid:33) = √ n (cid:32) n n (cid:88) i =1 (Γ ( x, W i , (cid:98) θ ) − Γ ( x, W i , θ )) − (cid:90) W (Γ ( x, w, (cid:98) θ ) − Γ ( x, w, θ )) dF W ( w ) (cid:33) + 1 √ n n (cid:88) i =1 (cid:16) Γ ( x, W i , θ ) − E [Γ ( x, W, θ )] (cid:17) + √ n (cid:0) Γ ( x, (cid:98) θ ) − Γ ( x, θ ) (cid:1) . There are three terms here. We’ll show that the first is o p (1) and that the second and thirdcontribute to the asymptotic distribution. Step 1.
We’ll begin by showing that the first term is o p (1). For some δ >
0, consider the classof functions F = (cid:8) Γ ( x, w, θ ) : θ ∈ Θ δ (cid:9) . As in the proof of proposition 1, we will show that Γ ( x, w, θ ) is Lipschitz in θ . Let (cid:101) θ, θ ∈ Θ δ . Then | Γ ( x, w, (cid:101) θ ) − Γ ( x, w, θ ) | = (cid:12)(cid:12)(cid:12)(cid:12)(cid:90) Γ ( x, w, τ, (cid:101) θ ) dτ − (cid:90) Γ ( x, w, τ, θ ) dτ (cid:12)(cid:12)(cid:12)(cid:12) ≤ (cid:107) q ( x, w ) (cid:107) (cid:90) (cid:107) (cid:101) γ − γ (cid:107) ∞ dτ + (cid:107) q ( x, w ) (cid:107) B (cid:90) ( τ + c min { τ, − τ } ) dτ sup β ∈B δ (cid:13)(cid:13)(cid:13)(cid:13) L β ( x, w (cid:48) β ) L ( x, w (cid:48) β ) (cid:13)(cid:13)(cid:13)(cid:13) (cid:107) (cid:101) β − β (cid:107) = (cid:107) q ( x, w ) (cid:107) (cid:32) B (2 + c )4 sup β ∈B δ (cid:13)(cid:13)(cid:13)(cid:13) L β ( x, w (cid:48) β ) L ( x, w (cid:48) β ) (cid:13)(cid:13)(cid:13)(cid:13)(cid:33) (cid:16) (cid:107) (cid:101) γ − γ (cid:107) ∞ + (cid:107) (cid:101) β − β (cid:107) (cid:17) ≡ K ( w ) (cid:107) (cid:101) θ − θ (cid:107) Θ . The second line follows by our derivations in the proof of proposition 1. In the last line we let (cid:107) θ (cid:107) Θ = (cid:107) β (cid:107) + (cid:107) γ (cid:107) ∞ and defined K ( w ) = (cid:107) q ( x, w ) (cid:107) (cid:32) B (2 + c )4 sup β ∈B δ (cid:13)(cid:13)(cid:13)(cid:13) L β ( x, w (cid:48) β ) L ( x, w (cid:48) β ) (cid:13)(cid:13)(cid:13)(cid:13)(cid:33) . Assumption A3 says that E (cid:32) sup β ∈B δ (cid:13)(cid:13)(cid:13)(cid:13) L β ( x, W (cid:48) β ) L ( x, W (cid:48) β ) (cid:13)(cid:13)(cid:13)(cid:13) (cid:33) < ∞ E ( (cid:107) q ( x, W ) (cid:107) ) < ∞ . These assumptions imply that K ( W ) has a bounded secondmoment: E [ K ( W ) ] = E (cid:107) q ( x, W ) (cid:107) (cid:32) B (2 + c )4 sup β ∈B δ (cid:13)(cid:13)(cid:13)(cid:13) L β ( x, W (cid:48) β ) L ( x, W (cid:48) β ) (cid:13)(cid:13)(cid:13)(cid:13)(cid:33) ≤ E (cid:2) (cid:107) q ( x, W ) (cid:107) (cid:3) / × E (cid:32) B (2 + c )4 sup β ∈B δ (cid:13)(cid:13)(cid:13)(cid:13) L β ( x, W (cid:48) β ) L ( x, W (cid:48) β ) (cid:13)(cid:13)(cid:13)(cid:13)(cid:33) / < ∞ where the second line follows by the Cauchy-Schwarz inequality. Thus Γ ( x, w, θ ) is Lipschitz in θ .This lets us apply theorem 2.7.11 in van der Vaart and Wellner (1996) to see that the bracketingnumber N [ · ] (2 (cid:15) E [ K ( W ) ] / , F , L ( P )) is bounded above by N ( (cid:15), B δ × G δ , (cid:107) · (cid:107) Θ ) ≤ N ( (cid:15), B δ , (cid:107) · (cid:107) ) + N ( (cid:15), G δ , (cid:107) · (cid:107) ∞ ) . By example 19.7 of van der Vaart (2000), N ( (cid:15), B δ , (cid:107) · (cid:107) ) (cid:46) (cid:15) − d W for small enough (cid:15) . By theorem2.7.1 in van der Vaart and Wellner (1996), N ( (cid:15), G δ , (cid:107) · (cid:107) ∞ ) (cid:46) exp (cid:16) (cid:15) − m + ν (cid:17) . So (cid:90) δ (cid:113) log N [ · ] (2 (cid:15) E [ K ( W ) ] / , F , L ( P )) d(cid:15) < ∞ . Hence F is Donsker.By convergence of (cid:98) θ to θ (lemma 1) and the Lipschitz property of Γ , we have (cid:90) W (cid:12)(cid:12)(cid:12) Γ ( x, w, (cid:98) θ ) − Γ ( x, w, θ ) (cid:12)(cid:12)(cid:12) dF W ( w ) ≤ E [ K ( W ) ] · (cid:107) (cid:98) θ − θ (cid:107) = o p (1) . Therefore, by lemma 19.24 in van der Vaart (2000), √ n (cid:32) n n (cid:88) i =1 (Γ ( x, W i , (cid:98) θ ) − Γ ( x, W i , θ )) − (cid:90) W (Γ ( x, w, (cid:98) θ ) − Γ ( x, w, θ )) dF W ( w ) (cid:33) = o p (1) . Step 2.
Next consider the second term. First note that E [Γ ( x, W, θ ) ] = E (cid:34)(cid:12)(cid:12)(cid:12)(cid:12)(cid:90) q ( x, W ) (cid:48) γ ( S ( x, W, τ, β )) dτ (cid:12)(cid:12)(cid:12)(cid:12) (cid:35) ≤ E (cid:34)(cid:18)(cid:90) (cid:107) q ( x, W ) (cid:107) · (cid:107) γ ( S ( x, W, τ, β )) (cid:107) dτ (cid:19) (cid:35) ≤ E (cid:32)(cid:90) (cid:107) q ( x, W ) (cid:107) sup u ∈ [ ε, − ε ] (cid:107) γ ( u ) (cid:107) dτ (cid:33) E ( (cid:107) q ( x, W ) (cid:107) ) B < ∞ . The first line follows be definition of Γ . The second line follows by the Cauchy-Schwarz inequality.The third line follows the fact that S lies between ε and 1 − ε . The fourth line follows by A2. Thelast line follows by A6.4.This result lets us apply a CLT to the second term. Combining that with the fact that theinfluence functions for the two first step estimators are Donsker (lemmas 3 and 4) we get √ n (cid:98) θ − θ n n (cid:88) i =1 (cid:16) Γ ( x, W i , θ ) − E [Γ ( x, W, θ )] (cid:17) (cid:32) (cid:18) Z (cid:101) Z ( x ) (cid:19) , a mean-zero Gaussian process in R d W × (cid:96) ∞ ([ ε, − ε ] , R d q ) × R . Notice here we use Γ = (Γ , Γ ),not just Γ , as preparation for parts 2 and 3. Step 3.
Next, consider the third term. For this step we’ll show that the mapping Γ ( x, θ ) isHDD at θ tangentially to R d W × C ([ ε, − ε ] , R d q ). This will let us apply the delta method forHDD functionals in the last step. To see that this functional is HDD, note thatΓ ( x, θ + t m h m ) − Γ ( x, θ ) t m = (cid:90) W (cid:90) Γ ( x, w, τ, θ + t m h m ) − Γ ( x, w, τ, θ ) t m dτ dF W ( w ) . By proposition 1, for m large enough,Γ ( x, w, τ, θ + t m h m ) − Γ ( x, w, τ, θ ) t m is dominated by (cid:107) q ( x, w ) (cid:107) ( (cid:107) h (cid:107) ∞ + λ ) + (cid:107) q ( x, w ) (cid:107) B ( τ + c min { τ, − τ } ) sup β ∈B δ (cid:13)(cid:13)(cid:13)(cid:13) L β ( x, w (cid:48) β ) L ( x, w (cid:48) β ) (cid:13)(cid:13)(cid:13)(cid:13) ( (cid:107) h (cid:107) + λ ) . This expression which has finite integral over ( τ, w ) ∈ (0 , ×W by (cid:107) h (cid:107) Θ < ∞ , E ( (cid:107) q ( x, W ) (cid:107) ) < ∞ ,and E (cid:32) sup β ∈B δ (cid:13)(cid:13)(cid:13)(cid:13) L β ( x, W, β ) L ( x, W (cid:48) β ) (cid:13)(cid:13)(cid:13)(cid:13) (cid:33) < ∞ . So we can apply dominated convergence to see that Γ is HDD:Γ ( x, θ + t m h m ) − Γ ( x, θ ) t m → (cid:90) W (cid:20) q ( x, w ) (cid:48) (cid:90) h ( S ( x, w, τ, β )) dτ + q ( x, w ) (cid:48) (cid:90) γ (cid:48) ( S ( x, w, τ, β )) T ( x, w, τ, β , h , dτ (cid:21) dF W ( w ) ≡ Γ (cid:48) ,θ ( x, h ) . Step 4.
Finally, putting all the previous steps together and applying the delta method for53DD functionals gives that, uniformly in x ∈ { , } , √ n ( (cid:98) E cx − E cx,ε ) = o p (1) + 1 √ n n (cid:88) i =1 (cid:0) Γ ( x, W i , θ ) − E [Γ ( x, W, θ )] (cid:1) + √ n (Γ ( x, (cid:98) θ ) − Γ ( x, θ )) d −→ (cid:101) Z (1)4 ( x ) + Γ (cid:48) ,θ ( x, Z ) ≡ Z (1)4 ( x ) . Part 2: The expectation lower bound.
An identical argument can be applied to show that √ n ( (cid:98) E cx − E cx ) d −→ (cid:101) Z (2)4 ( x ) + Γ (cid:48) ,θ ( x, Z ) ≡ Z (2)4 ( x )where Γ (cid:48) ,θ ( x, h ) = (cid:90) W Γ (cid:48) ,θ ( x, w, h ) dF W ( w ) . Part 3: Putting them together.
Finally, the analysis in parts 1 and 2 can be combined toobtain joint convergence: √ n (cid:32) (cid:91) ATE c − ATE cε (cid:91) ATE c − ATE cε (cid:33) d −→ (cid:32) Z (1)4 (1) − Z (2)4 (0) Z (2)4 (1) − Z (1)4 (0) (cid:33) ≡ Z ATE . B.5 The ATT Bounds
Proof of proposition 2.
Note that in this proof we’ll use several of the results we derived in theproof of proposition 1. By var( Y ( X = x )) < ∞ and var( ( X = x )) < ∞ , we have that √ n (cid:16)(cid:98) E ( Y | X = x ) − E ( Y | X = x ) (cid:17) = 1 p x √ n (cid:32) n n (cid:88) i =1 Y i ( X i = x ) − E ( Y ( X = x )) (cid:33) − E ( Y | X = x ) p x √ n (cid:32) n n (cid:88) i =1 ( X i = x ) − p x (cid:33) + o p (1) d −→ Z E ( Y | X = x ) , a mean-zero Gaussian variable. Also, √ n ( (cid:98) p x − p x ) d −→ Z p x , another mean-zero Gaussian variable.As before the influence functions are Donsker and hence we can stack them to obtain √ n (cid:98) θ − θ n n (cid:88) i =1 (Γ ( x, W i , θ ) − E [Γ ( x, W, θ )]) (cid:98) E ( Y | X = x ) − E ( Y | X = x ) (cid:98) p x − p x (cid:32) Z (cid:101) Z ( x ) Z E ( Y | X = x ) Z p x ,
54 mean-zero Gaussian process in R d W × (cid:96) ∞ ([ ε, − ε ] , R d q ) × R , for x ∈ { , } and whose covariancekernel can be calculated as in the proof of theorem 1. By the delta method, √ n (cid:91) ATT c − ATT cε (cid:91) ATT c − ATT cε d −→ Z E ( Y | X =1) − Z (2)4 (0) p + p p Z E ( Y | X =0) + E ( Y | X = 0) p Z p + E c ,ε − E ( Y | X = 0) p p Z p Z E ( Y | X =1) − Z (1)4 (0) p + p p Z E ( Y | X =0) + E ( Y | X = 0) p Z p + E c ,ε − E ( Y | X = 0) p p Z p ≡ Z ATT , a random vector in R . C Estimating the Analytical Hadamard Directional Derivatives
In this section we give formulas for our estimators of the analytical Hadamard directional derivatives(HDD) used in the CQTE, CATE, ATE, and ATT functionals. In these estimators we use a tuningparameter κ n ≥
0. This parameter acts as a slackness value, which lets us estimate when certainequalities hold in the population, but which may not hold exactly in finite samples. We assume κ n → nκ n → ∞ as n → ∞ .We estimate Γ (cid:48) ,θ ( x, w, τ, h ) by (cid:98) Γ (cid:48) ,θ ( x, w, τ, h ) (cid:98) Γ (cid:48) ,θ ( x, w, τ, h ) = (cid:32) q ( x, w ) (cid:48) h ( S ( x, w, τ, (cid:98) β )) + q ( x, w ) (cid:48) (cid:98) γ (cid:48) ( S ( x, w, τ, (cid:98) β )) · T ( x, w, τ, (cid:98) β, h , κ n ) q ( x, w ) (cid:48) h ( S ( x, w, τ, (cid:98) β )) + q ( x, w ) (cid:48) (cid:98) γ (cid:48) ( S ( x, w, τ, (cid:98) β )) · T ( x, w, τ, (cid:98) β, h , κ n ) (cid:33) . Note that q ( x, w ) (cid:48) refers to the transpose of q ( x, w ) while (cid:98) γ (cid:48) ( · ) refers to our estimator of thederivative of γ ( · ), defined in equation (12) on page 39. We defined the functions S and S in theproof of proposition 5. We define T and T below. Estimate Γ (cid:48) ,θ ( x, w, h ) by (cid:98) Γ (cid:48) ,θ ( x, w, h ) (cid:98) Γ (cid:48) ,θ ( x, w, h ) = (cid:82) (cid:98) Γ (cid:48) ,θ ( x, w, τ, h ) dτ (cid:82) (cid:98) Γ (cid:48) ,θ ( x, w, τ, h ) dτ . Estimate Γ ,θ ( x, h ) by (cid:98) Γ (cid:48) ,θ ( x, h ) (cid:98) Γ (cid:48) ,θ ( x, h ) = n (cid:80) ni =1 (cid:98) Γ (cid:48) ,θ ( x, W i , h ) n (cid:80) ni =1 (cid:98) Γ (cid:48) ,θ ( x, W i , h ) . Throughout the paper we will also use the vector notation (cid:98) Γ (cid:48) j,θ = ( (cid:98) Γ (cid:48) j,θ , (cid:98) Γ (cid:48) j,θ ) and Γ (cid:48) j,θ =(Γ (cid:48) j,θ , Γ (cid:48) j,θ ) for j = 1 , ,
3. Next we define the functions T and T . To do this, we’ll also de-fine functions T and T . 55 he function T We first define T ( x, w, τ, β, h , κ n ). We do this by splitting it into seven different mutually exclusivecases. This lets us write T ( x, w, τ, β, h , κ n ) = (cid:88) j =1 T ,j ( x, w, τ, β, h ) · ,j ( x, w, τ, β, κ n ) . (13)For these cases, it is helpful to recall our notation L ( x, w (cid:48) β ) = F ( w (cid:48) β ) x (1 − F ( w (cid:48) β )) − x . These seven cases are defined as follows:1. Let , ( x, w, τ, β, κ n ) = (cid:26) τ + cL ( x, w (cid:48) β ) min { τ, − τ } < min (cid:26) τL ( x, w (cid:48) β ) , − ε (cid:27) − κ n (cid:27) and T , ( x, w, τ, β, h ) = − c min { τ, − τ } L β ( x, w (cid:48) β ) (cid:48) h L ( x, w (cid:48) β ) .
2. Let , ( x, w, τ, β, κ n ) = (cid:26) τL ( x, w (cid:48) β ) < min (cid:26) τ + cL ( x, w (cid:48) β ) min { τ, − τ } , − ε (cid:27) − κ n (cid:27) and T , ( x, w, τ, β, h ) = − τ L β ( x, w (cid:48) β ) (cid:48) h L ( x, w (cid:48) β ) .
3. Let , ( x, w, τ, β, κ n ) = (cid:26) − ε < min (cid:26) τ + cL ( x, w (cid:48) β ) min { τ, − τ } , τL ( x, w (cid:48) β ) (cid:27) − κ n (cid:27) and T , ( x, w, τ, β, h ) = 0 .
4. Let , ( x, w, τ, β, κ n ) = (cid:26)(cid:12)(cid:12)(cid:12)(cid:12) τ + cL ( x, w (cid:48) β ) min { τ, − τ } − τL ( x, w (cid:48) β ) (cid:12)(cid:12)(cid:12)(cid:12) ≤ κ n (cid:27) · (cid:26) max (cid:26) τ + cL ( x, w (cid:48) β ) min { τ, − τ } , τL ( x, w (cid:48) β ) (cid:27) < − ε − κ n (cid:27) and T , ( x, w, τ, β, h ) = min (cid:26) − c min { τ, − τ } L β ( x, w (cid:48) β ) (cid:48) h L ( x, w (cid:48) β ) , − τ L β ( x, w (cid:48) β ) (cid:48) h L ( x, w (cid:48) β ) (cid:27) .
56. Let , ( x, w, τ, β, κ n ) = (cid:26)(cid:12)(cid:12)(cid:12)(cid:12) τ + cL ( x, w (cid:48) β ) min { τ, − τ } − (1 − ε ) (cid:12)(cid:12)(cid:12)(cid:12) ≤ κ n (cid:27) · (cid:26) max (cid:26) τ + cL ( x, w (cid:48) β ) min { τ, − τ } , − ε (cid:27) < τL ( x, w (cid:48) β ) − κ n (cid:27) and T , ( x, w, τ, β, h ) = min (cid:26) − c min { τ, − τ } L β ( x, w (cid:48) β ) (cid:48) h L ( x, w (cid:48) β ) , (cid:27) .
6. Let , ( x, w, τ, β, κ n ) = (cid:26)(cid:12)(cid:12)(cid:12)(cid:12) τL ( x, w (cid:48) β ) − (1 − ε ) (cid:12)(cid:12)(cid:12)(cid:12) ≤ κ n (cid:27) · (cid:26) max (cid:26) τL ( x, w (cid:48) β ) , − ε (cid:27) < τ + cL ( x, w (cid:48) β ) min { τ, − τ } − κ n (cid:27) and T , ( x, w, τ, β, h ) = min (cid:26) − τ L β ( x, w (cid:48) β ) (cid:48) h L ( x, w (cid:48) β ) , (cid:27) .
7. Let , ( x, w, τ, β, κ n ) equal 1 if at least two of the three following hold (cid:12)(cid:12)(cid:12)(cid:12) τ + cL ( x, w (cid:48) β ) min { τ, − τ } − (1 − ε ) (cid:12)(cid:12)(cid:12)(cid:12) ≤ κ n (cid:12)(cid:12)(cid:12)(cid:12) τL ( x, w (cid:48) β ) − (1 − ε ) (cid:12)(cid:12)(cid:12)(cid:12) ≤ κ n (cid:12)(cid:12)(cid:12)(cid:12) τ + cL ( x, w (cid:48) β ) min { τ, − τ } − τL ( x, w (cid:48) β ) (cid:12)(cid:12)(cid:12)(cid:12) ≤ κ n and zero otherwise. Let T , ( x, w, τ, β, h ) = min (cid:26) − c min { τ, − τ } L β ( x, w (cid:48) β ) (cid:48) h L ( x, w (cid:48) β ) , − τ L β ( x, w (cid:48) β ) (cid:48) h L ( x, w (cid:48) β ) , (cid:27) . The function T Next we define T ( x, w, τ, β, h , κ n ) in terms of T ( x, w, τ, β, h , κ n ) and two indicator functions.Recall from the proof of proposition 5 that S ( x, w, τ, β ) = min (cid:26) τ + cL ( x, w (cid:48) β ) min { τ, − τ } , τL ( x, w (cid:48) β ) , − ε (cid:27) . Then T ( x, w, τ, β, h , κ n ) = T ( x, w, τ, β, h , κ n ) · (cid:16) S ( x, w, τ, β ) > ε + κ n (cid:17) + max { T ( x, w, τ, β, h , κ n ) , } · (cid:16) | S ( x, w, τ, β ) − ε | ≤ κ n (cid:17) . he function T Next we define T ( x, w, τ, β, h , κ n ). Again we split this into seven cases, which lets us write thisfunction as T ( x, w, τ, β, h , κ n ) = (cid:88) j =1 T ,j ( x, w, τ, β, h ) · ,j ( x, w, τ, β, κ n ) . (14)where the seven cases are defined as follows:1. Let , ( x, w, τ, β, κ n ) = (cid:26) τ − cL ( x, w (cid:48) β ) min { τ, − τ } > max (cid:26) τ − L ( x, w (cid:48) β ) + 1 , ε (cid:27) + κ n (cid:27) and T , ( x, w, τ, β, h ) = c min { τ, − τ } L β ( x, w (cid:48) β ) (cid:48) h L ( x, w (cid:48) β ) .
2. Let , ( x, w, τ, β, κ n ) = (cid:26) τ − L ( x, w (cid:48) β ) + 1 > max (cid:26) τ − cL ( x, w (cid:48) β ) min { τ, − τ } , ε (cid:27) + κ n (cid:27) and T , ( x, w, τ, β, h ) = (1 − τ ) L β ( x, w (cid:48) β ) (cid:48) h L ( x, w (cid:48) β ) .
3. Let , ( x, w, τ, β, κ n ) = (cid:26) ε > max (cid:26) τ − cL ( x, w (cid:48) β ) min { τ, − τ } , τ − L ( x, w (cid:48) β ) + 1 (cid:27) + κ n (cid:27) and T , ( x, w, τ, β, h ) = 0 .
4. Let , ( x, w, τ, β, κ n ) = (cid:26)(cid:12)(cid:12)(cid:12)(cid:12) τ − cL ( x, w (cid:48) β ) min { τ, − τ } − (cid:18) τ − L ( x, w (cid:48) β ) + 1 (cid:19)(cid:12)(cid:12)(cid:12)(cid:12) ≤ κ n (cid:27) · (cid:26) min (cid:26) τ − cL ( x, w (cid:48) β ) min { τ, − τ } , τ − L ( x, w (cid:48) β ) + 1 (cid:27) > ε + κ n (cid:27) and T , ( x, w, τ, β, h ) = max (cid:26) c min { τ, − τ } L β ( x, w (cid:48) β ) (cid:48) h L ( x, w (cid:48) β ) , (1 − τ ) L β ( x, w (cid:48) β ) (cid:48) h L ( x, w (cid:48) β ) (cid:27) .
5. Let , ( x, w, τ, β, κ n ) = (cid:26)(cid:12)(cid:12)(cid:12)(cid:12) τ − cL ( x, w (cid:48) β ) min { τ, − τ } − ε (cid:12)(cid:12)(cid:12)(cid:12) ≤ κ n (cid:27) · (cid:26) min (cid:26) τ − cL ( x, w (cid:48) β ) min { τ, − τ } , ε (cid:27) > τ − L ( x, w (cid:48) β ) + 1 + κ n (cid:27) T , ( x, w, τ, β, h ) = max (cid:26) c min { τ, − τ } L β ( x, w (cid:48) β ) (cid:48) h L ( x, w (cid:48) β ) , (cid:27) .
6. Let , ( x, w, τ, β, κ n ) = (cid:26)(cid:12)(cid:12)(cid:12)(cid:12) τ − L ( x, w (cid:48) β ) + 1 − ε (cid:12)(cid:12)(cid:12)(cid:12) ≤ κ n (cid:27) · (cid:26) min (cid:26) τ − L ( x, w (cid:48) β ) + 1 , ε (cid:27) > τ − cL ( x, w (cid:48) β ) min { τ, − τ } + κ n (cid:27) and T , ( x, w, τ, β, h ) = max (cid:26) (1 − τ ) L β ( x, w (cid:48) β ) (cid:48) h L ( x, w (cid:48) β ) , (cid:27) .
7. Let , ( x, w, τ, β, κ n ) equal 1 if at least 2 of the 3 following hold (cid:12)(cid:12)(cid:12)(cid:12) τ − cL ( x, w (cid:48) β ) min { τ, − τ } − (cid:18) τ − L ( x, w (cid:48) β ) + 1 (cid:19)(cid:12)(cid:12)(cid:12)(cid:12) ≤ κ n (cid:12)(cid:12)(cid:12)(cid:12) τ − cL ( x, w (cid:48) β ) min { τ, − τ } − ε (cid:12)(cid:12)(cid:12)(cid:12) ≤ κ n (cid:12)(cid:12)(cid:12)(cid:12) τ − L ( x, w (cid:48) β ) + 1 − ε (cid:12)(cid:12)(cid:12)(cid:12) ≤ κ n and zero otherwise. Let T , ( x, w, τ, β, h ) = max (cid:26) c min { τ, − τ } L β ( x, w (cid:48) β ) (cid:48) h L ( x, w (cid:48) β ) , (1 − τ ) L β ( x, w (cid:48) β ) (cid:48) h L ( x, w (cid:48) β ) , (cid:27) . The function T Finally we define T ( x, w, τ, β, h , κ n ) in terms of T ( x, w, τ, β, h , κ n ) and two indicator functions.Recall from the proof of proposition 5 that S ( x, w, τ, β ) = max (cid:26) τ − cL ( x, w (cid:48) β ) min { τ, − τ } , τ − L ( x, w (cid:48) β ) + 1 , ε (cid:27) . Then, T ( x, w, τ, β, h , κ n ) = T ( x, w, τ, β, h , κ n ) · (cid:16) S ( x, w, τ, β ) < − ε − κ n (cid:17) + min { T ( x, w, τ, β, h , κ n ) , } · (cid:16) | S ( x, w, τ, β ) − (1 − ε ) | ≤ κ n (cid:17) . D Proofs for Section 5
In this appendix we give the proofs for section 5. We start with a lemma about the HDD of Γ . Lemma 6 (Γ (cid:48) ,θ ( x, w, τ, h ) is Lipschitz in h ) . Suppose the assumptions of proposition 5 hold. Let κ n → nκ n → ∞ , η n →
0, and nη n → ∞ as n → ∞ . Then (cid:13)(cid:13)(cid:13)(cid:98) Γ (cid:48) ,θ ( x, w, τ, (cid:101) h ) − (cid:98) Γ (cid:48) ,θ ( x, w, τ, h ) (cid:13)(cid:13)(cid:13) ≤ K ( x, w, (cid:98) θ ) · (cid:107) (cid:101) h − h (cid:107) Θ K ( x, w, (cid:98) θ ) = O p (1) defined below. Proof of lemma 6.
We will show that (cid:98) Γ (cid:48) ,θ ( x, w, τ, h ) is Lipschitz in h . The proof for the lowerbound is similar. Let h, (cid:101) h ∈ R d W × C ([ ε, − ε ] , R d q ). Then (cid:12)(cid:12)(cid:12)(cid:98) Γ (cid:48) ,θ ( x, w, τ, (cid:101) h ) − (cid:98) Γ (cid:48) ,θ ( x, w, τ, h ) (cid:12)(cid:12)(cid:12) ≤ (cid:107) q ( x, w ) (cid:107) · (cid:107) (cid:101) h ( S ( x, w, τ, (cid:98) β )) − h ( S ( x, w, τ, (cid:98) β )) (cid:107) + | q ( x, w ) (cid:48) (cid:98) γ (cid:48) ( S ( x, w, τ, (cid:98) β )) | · | T ( x, w, τ, (cid:98) β, (cid:101) h , κ n ) − T ( x, w, τ, (cid:98) β, h , κ n ) |≤ (cid:107) q ( x, w ) (cid:107) · (cid:107) (cid:101) h − h (cid:107) ∞ + | q ( x, w ) (cid:48) (cid:98) γ (cid:48) ( S ( x, w, τ, (cid:98) β )) | · | T ( x, w, τ, (cid:98) β, (cid:101) h , κ n ) − T ( x, w, τ, (cid:98) β, h , κ n ) | . (15)The first line follows by the definition of (cid:98) Γ , the triangle inequality, and the Cauchy-Schwarzinequality. Next, | q ( x, w ) (cid:48) (cid:98) γ (cid:48) ( S ( x, w, τ, (cid:98) β )) | ≤ (cid:107) q ( x, w ) (cid:107) · (cid:107) (cid:98) γ (cid:48) (cid:107) ∞ ≤ (cid:107) q ( x, w ) (cid:107) (cid:0) (cid:107) γ (cid:48) (cid:107) ∞ + (cid:107) (cid:98) γ (cid:48) − γ (cid:107) ∞ (cid:1) = O p (1) . The last line follows since (cid:107) γ (cid:48) (cid:107) ∞ ≤ B (by γ ∈ G ) and (cid:107) (cid:98) γ (cid:48) − γ (cid:48) (cid:107) ∞ = o p (1) (by lemma 2). Thus itsuffices to show that T ( x, w, τ, (cid:98) β, h , κ n ) is Lipschitz in h . To see this, write | T ( x, w, τ, (cid:98) β, (cid:101) h , κ n ) − T ( x, w, τ, (cid:98) β, h , κ n ) |≤ | T ( x, w, τ, (cid:98) β, (cid:101) h , κ n ) − T ( x, w, τ, (cid:98) β, h , κ n ) | · (cid:16) S ( x, w, τ, (cid:98) β ) > ε + κ n (cid:17) + | max { T ( x, w, τ, (cid:98) β, (cid:101) h , κ n ) , } − max { T ( x, w, τ, (cid:98) β, h , κ n ) , }| · (cid:16)(cid:12)(cid:12)(cid:12) S ( x, w, τ, (cid:98) β ) − ε (cid:12)(cid:12)(cid:12) ≤ κ n (cid:17) ≤ | T ( x, w, τ, (cid:98) β, (cid:101) h , κ n ) − T ( x, w, τ, (cid:98) β, h , κ n ) | + | max { T ( x, w, τ, (cid:98) β, (cid:101) h , κ n ) , } − max { T ( x, w, τ, (cid:98) β, h , κ n ) , }|≤ · | T ( x, w, τ, (cid:98) β, (cid:101) h , κ n ) − T ( x, w, τ, (cid:98) β, h , κ n ) | The first inequality follows by the definition of T and the triangle inequality. The last inequalityfollows from lemma 5. Thus it suffices to show that T is Lipschitz in h .To see this, consider | T ( x, w, τ, (cid:98) β, (cid:101) h , κ n ) − T ( x, w, τ, (cid:98) β, h , κ n ) |≤ (cid:88) j =1 | T ,j ( x, w, τ, (cid:98) β, (cid:101) h ) − T ,j ( x, w, τ, (cid:98) β, h ) |≤ (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) c min { τ, − τ } L β ( x, w (cid:48) (cid:98) β ) (cid:48) ( (cid:101) h − h ) L ( x, w (cid:48) (cid:98) β ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) + 4 (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) τ L β ( x, w (cid:48) (cid:98) β ) (cid:48) ( (cid:101) h − h ) L ( x, w (cid:48) (cid:98) β ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) L β ( x, w (cid:48) (cid:98) β ) L ( x, w (cid:48) (cid:98) β ) (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) · (cid:107) (cid:101) h − h (cid:107) . (16)The second line uses the definition of T and repeated applications of lemma 5. The last line follows60rom τ ≤ c min { τ, − τ } ≤
1, and the Cauchy-Schwarz inequality. We have (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) L β ( x, w (cid:48) (cid:98) β ) L ( x, w (cid:48) (cid:98) β ) (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) = O p (1)since sup β ∈B δ (cid:13)(cid:13)(cid:13)(cid:13) L β ( x, w (cid:48) β ) L ( x, w (cid:48) β ) (cid:13)(cid:13)(cid:13)(cid:13) < ∞ and P ( (cid:98) β ∈ B δ ) →
1. Thus T is Lipschitz in h .Overall, we have shown that (cid:98) Γ (cid:48) ,θ ( x, w, τ, h ) is Lipschitz in h with Lipschitz constant equal to K ( x, w, (cid:98) θ ) = (cid:107) q ( x, w ) (cid:107) (cid:32) · (cid:107) (cid:98) γ (cid:48) (cid:107) ∞ (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) L β ( x, w (cid:48) (cid:98) β ) L ( x, w (cid:48) (cid:98) β ) (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:33) = O p (1) . A similar argument can be used to show that (cid:98) Γ (cid:48) ,θ ( x, w, τ, h ) is Lipschitz in h with the sameconstant. Setting K ( x, w, (cid:98) θ ) = (cid:16) K ( x, w, (cid:98) θ ) + K ( x, w, (cid:98) θ ) (cid:17) / = √ · K ( x, w, (cid:98) θ ) concludes theproof.Next we prove proposition 3. This is our main result on the analytical bootstrap for meanpotential outcomes. In section 5 we use this result to do bootstrap inference on our ATE bounds.There we discussed how the asymptotic distribution of our mean potential outcome bounds comesfrom two terms. The first term requires using HDDs while the second term is standard. We considereach term one at a time in the following two lemmas. Lemma 7 (Non-standard component) . Suppose the assumptions of proposition 1 hold. Suppose κ n → nκ n → ∞ , η n →
0, and nη n → ∞ as n → ∞ . Then (cid:98) Γ (cid:48) ,θ ( x, √ n ( (cid:98) θ ∗ − (cid:98) θ )) P (cid:32) Γ (cid:48) ,θ ( x, Z ) . Proof of lemma 7.
We prove this by applying theorem 3.2 in Fang and Santos (2019). To do thiswe must verify their assumptions 1–4.1. Their assumption 1 requires Γ ( x, θ ) to be HDD. We showed this in the proof of theorem 1.2. Their assumption 2 is about the asymptotic distribution of the first step estimator (cid:98) θ . Thisholds by our lemma 1 in appendix A.3. Their assumption 3 is about validity of the bootstrap for (cid:98) θ . This holds by theorem 3.6.1 invan der Vaart and Wellner (1996).Finally, in their remark 3.4, they note that sufficient conditions for their assumption 4 are1. (Smoothness) (cid:98) Γ (cid:48) ,θ ( x, h ) is Lipschitz in h .2. (Consistency) (cid:107) (cid:98) Γ (cid:48) ,θ ( x, h ) − Γ ,θ ( x, h ) (cid:107) = o p (1) for any h .These are properties of the HDD estimator (cid:98) Γ (cid:48) ,θ ( x, h ). We finish this proof by verifying that theseproperties hold in our setting. 61 art 1: (Smoothness) (cid:98) Γ (cid:48) ,θ ( x, h ) is Lipschitz in h . Recall that (cid:98) Γ (cid:48) ,θ ( x, h ) = 1 n n (cid:88) i =1 (cid:90) (cid:98) Γ (cid:48) ,θ ( x, W i , τ, h ) dτ. So (cid:12)(cid:12)(cid:12)(cid:12)(cid:98) Γ (cid:48) ,θ ( x, (cid:101) h ) − (cid:98) Γ (cid:48) ,θ ( x, h ) (cid:12)(cid:12)(cid:12)(cid:12) ≤ n n (cid:88) i =1 (cid:90) (cid:12)(cid:12)(cid:12)(cid:12)(cid:98) Γ (cid:48) ,θ ( x, W i , τ, (cid:101) h ) − (cid:98) Γ (cid:48) ,θ ( x, W i , τ, h ) (cid:12)(cid:12)(cid:12)(cid:12) dτ ≤ (cid:32) n n (cid:88) i =1 (cid:90) K ( x, W i , (cid:98) θ ) dτ (cid:33) (cid:107) (cid:101) h − h (cid:107) Θ = (cid:32) n n (cid:88) i =1 K ( x, W i , (cid:98) θ ) (cid:33) (cid:107) (cid:101) h − h (cid:107) Θ . The second line follows by lemma 6. Next we’ll show that1 n n (cid:88) i =1 K ( x, W i , (cid:98) θ ) = O p (1) . We have1 n n (cid:88) i =1 K ( x, W i , (cid:98) θ )= 1 n n (cid:88) i =1 (cid:107) q ( x, W i ) (cid:107) (cid:32) · (cid:107) (cid:98) γ (cid:48) (cid:107) ∞ (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) L β ( x, W (cid:48) i (cid:98) β ) L ( x, W (cid:48) i (cid:98) β ) (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:33) ≤ (cid:32) n n (cid:88) i =1 (cid:107) q ( x, W i ) (cid:107) (cid:33) + 16 · (cid:107) (cid:98) γ (cid:48) (cid:107) ∞ (cid:32) n n (cid:88) i =1 (cid:107) q ( x, W i ) (cid:107) (cid:33) / n n (cid:88) i =1 (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) L β ( x, W (cid:48) i (cid:98) β ) L ( x, W (cid:48) i (cid:98) β ) (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) / . The first line follows by the definition of K . The second line follows by the Cauchy-Schwarzinequality. By E ( (cid:107) q ( x, W ) (cid:107) ) < ∞ , by P ( (cid:98) β ∈ B δ ) →
1, and by E (cid:32) sup β ∈B δ (cid:13)(cid:13)(cid:13)(cid:13) L β ( x, W (cid:48) β ) L ( x, W (cid:48) β ) (cid:13)(cid:13)(cid:13)(cid:13) (cid:33) < ∞ we have 1 n n (cid:88) i =1 (cid:90) K ( x, W i , (cid:98) θ ) dτ = O p (1) . Thus (cid:98) Γ (cid:48) ,θ ( x, h ) is Lipschitz in h . It can be similarly shown that (cid:98) Γ (cid:48) ,θ ( x, h ) is Lipschitz in h . Part 2: Consistency of (cid:98) Γ (cid:48) ,θ ( x, h ) . Next we show that (cid:98) Γ (cid:48) ,θ ( x, h ) p −→ Γ (cid:48) ,θ ( x, h ) .
62o do this we use the triangle inequality to decompose their difference into four different terms: (cid:12)(cid:12)(cid:12)(cid:12)(cid:98) Γ (cid:48) ,θ ( x, h ) − Γ (cid:48) ,θ ( x, h ) (cid:12)(cid:12)(cid:12)(cid:12) ≤ R + R + R + R where R = (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) n n (cid:88) i =1 q ( x, W i ) (cid:48) (cid:90) h ( S ( x, W i , τ, (cid:98) β )) dτ − E (cid:20) q ( x, W ) (cid:48) (cid:90) h ( S ( x, W, τ, β )) dτ (cid:21)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) R = (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) n n (cid:88) i =1 (cid:20) q ( x, W i ) (cid:48) (cid:90) (cid:98) γ (cid:48) ( S ( x, W i , τ, (cid:98) β )) T ( x, W i , τ, (cid:98) β, h , κ n ) dτ − q ( x, W i ) (cid:48) (cid:90) γ (cid:48) ( S ( x, W i , τ, (cid:98) β )) T ( x, W i , τ, (cid:98) β, h , κ n ) dτ (cid:21) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) R = (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) n n (cid:88) i =1 q ( x, W i ) (cid:48) (cid:90) γ (cid:48) ( S ( x, W i , τ, (cid:98) β )) T ( x, W i , τ, (cid:98) β, h , κ n ) dτ − E (cid:20) q ( x, W ) (cid:48) (cid:90) γ (cid:48) ( S ( x, W, τ, (cid:98) β )) T ( x, W, τ, (cid:98) β, h , κ n ) dτ (cid:21) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) R = (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) E (cid:20) q ( x, W ) (cid:48) (cid:90) γ (cid:48) ( S ( x, W, τ, (cid:98) β )) T ( x, W, τ, (cid:98) β, h , κ n ) dτ (cid:21) − E (cid:20) q ( x, W ) (cid:48) (cid:90) γ (cid:48) ( S ( x, W, τ, β )) T ( x, W, τ, β , h , dτ (cid:21) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) . Convergence of R . h is continuous on compact domain [ ε, − ε ]. Hence h is bounded on[ ε, − ε ]. Since S lies between [ ε, − ε ], the composite function h ( S ( x, w, τ, β )) is thus boundeduniformly over ( x, w, τ ) ∈ { , } × W × (0 , β forany ( x, w, τ ), since S is continuous in β . Therefore, by the dominated convergence theorem, (cid:90) h ( S ( x, w, τ, β )) dτ is continuous in β for all ( x, w ) ∈ { , } × W . Moreover, it has a bounded envelope: E (cid:32) sup β ∈B (cid:90) h ( S ( x, W, τ, β )) dτ (cid:33) < ∞ . These properties plus compactness of B imply that (cid:26)(cid:90) h ( S ( x, W, τ, β )) dτ : x ∈ { , } , β ∈ B (cid:27) is Glivenko-Cantelli, by example 19.8 in van der Vaart (2000). By E ( (cid:107) q ( x, W ) (cid:107) ) < ∞ (A6.4) and63y corollary 9.27 part (ii) in Kosorok (2008), the class of functions (cid:26) q ( x, W ) (cid:48) (cid:90) h ( S ( x, W, τ, β )) dτ : x ∈ { , } , β ∈ B (cid:27) is also Glivenko-Cantelli. Hence R = (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) n n (cid:88) i =1 q ( x, W i ) (cid:48) (cid:90) h ( S ( x, W i , τ, (cid:98) β )) dτ − E (cid:20) q ( x, W ) (cid:48) (cid:90) h ( S ( x, W, τ, β )) dτ (cid:21)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ sup β ∈B (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) n n (cid:88) i =1 q ( x, W i ) (cid:48) (cid:90) h ( S ( x, W i , τ, β )) dτ − E (cid:20) q ( x, W ) (cid:48) (cid:90) h ( S ( x, W, τ, β )) dτ (cid:21)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) + (cid:12)(cid:12)(cid:12)(cid:12)(cid:90) W q ( x, w ) (cid:48) (cid:90) h ( S ( x, w, τ, (cid:98) β )) dτ dF W ( w ) − (cid:90) W q ( x, w ) (cid:48) (cid:90) h ( S ( x, w, τ, β )) dτ dF W ( w ) (cid:12)(cid:12)(cid:12)(cid:12) = o p (1) + o p (1)= o p (1) . The second line follows by the triangle inequality. The first term in that line is o p (1) by theGlivenko-Cantelli property. The second term is o p (1) by its continuity in β and (cid:98) β p −→ β . Convergence of R . We have R = (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) n n (cid:88) i =1 (cid:20) q ( x, W i ) (cid:48) (cid:90) (cid:98) γ (cid:48) ( S ( x, W i , τ, (cid:98) β )) T ( x, W i , τ, (cid:98) β, h , κ n ) dτ − q ( x, W i ) (cid:48) (cid:90) γ (cid:48) ( S ( x, W i , τ, (cid:98) β )) T ( x, W i , τ, (cid:98) β, h , κ n ) dτ (cid:21) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ n n (cid:88) i =1 (cid:107) q ( x, W i ) (cid:107) (cid:90) (cid:13)(cid:13)(cid:13)(cid:98) γ (cid:48) ( S ( x, W i , τ, (cid:98) β )) − γ (cid:48) ( S ( x, W i , τ, (cid:98) β )) (cid:13)(cid:13)(cid:13) (cid:12)(cid:12)(cid:12) T ( x, W i , τ, (cid:98) β, h , κ n ) (cid:12)(cid:12)(cid:12) dτ ≤ (cid:13)(cid:13)(cid:98) γ (cid:48) − γ (cid:48) (cid:13)(cid:13) ∞ n n (cid:88) i =1 (cid:107) q ( x, W i ) (cid:107) (cid:90) (cid:12)(cid:12)(cid:12) T ( x, W i , τ, (cid:98) β, h , κ n ) (cid:12)(cid:12)(cid:12) dτ ≤ o p (1) × (cid:32) n n (cid:88) i =1 (cid:107) q ( x, W i ) (cid:107) (cid:33) / × (cid:32) n n (cid:88) i =1 (cid:18)(cid:90) (cid:12)(cid:12)(cid:12) T ( x, W i , τ, (cid:98) β, h , κ n ) (cid:12)(cid:12)(cid:12) dτ (cid:19) (cid:33) / . The first line is the definition of R . The second line follows by the triangle inequality and theCauchy-Schwarz inequality. The last line follows by uniform convergence of (cid:98) γ (cid:48) to γ (cid:48) (lemma 2) andthe Cauchy-Schwarz inequality. By A6.4, (cid:32) n n (cid:88) i =1 (cid:107) q ( x, W i ) (cid:107) (cid:33) / = O p (1) . Also, (cid:90) (cid:12)(cid:12)(cid:12) T ( x, W i , τ, (cid:98) β, h , κ n ) (cid:12)(cid:12)(cid:12) dτ ≤ (cid:90) (cid:12)(cid:12)(cid:12) T ( x, W i , τ, (cid:98) β, h , κ n ) (cid:12)(cid:12)(cid:12) dτ (cid:90) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) max (cid:40) − c min { τ, − τ } L β ( x, W (cid:48) i (cid:98) β ) (cid:48) h L ( x, W (cid:48) i (cid:98) β ) , − τ L β ( x, W (cid:48) i (cid:98) β ) (cid:48) h L ( x, W (cid:48) i (cid:98) β ) , (cid:41)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) dτ ≤ (cid:90) ( c min { τ, − τ } + τ ) dτ (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) L β ( x, W (cid:48) i (cid:98) β ) (cid:48) h L ( x, W (cid:48) i (cid:98) β ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) = c + 24 (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) L β ( x, W (cid:48) i (cid:98) β ) (cid:48) h L ( x, W (cid:48) i (cid:98) β ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) . The first line follows by the definition of T . The second line follows by the definition of T (noticethe maximum of the three values T can take is an upper bound for it). By A4, (cid:18) L β ( x, w (cid:48) β ) (cid:48) h L ( x, w (cid:48) β ) (cid:19) is continuous in β for any w ∈ W . Moreover, this term has a bounded envelope in a neighborhoodof β by our assumption that E (cid:32) sup β ∈B δ (cid:13)(cid:13)(cid:13)(cid:13) L β ( x, W (cid:48) β ) L ( x, W (cid:48) β ) (cid:13)(cid:13)(cid:13)(cid:13) (cid:33) < ∞ . Hence 1 n n (cid:88) i =1 (cid:18)(cid:90) (cid:12)(cid:12)(cid:12) T ( x, W i , τ, (cid:98) β, h , κ n ) (cid:12)(cid:12)(cid:12) dτ (cid:19) ≤ ( c + 2)
16 1 n n (cid:88) i =1 (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) L β ( x, W (cid:48) i (cid:98) β ) L ( x, W (cid:48) i (cid:98) β ) (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) (cid:107) h (cid:107) = O p (1)where the last line follows by the uniform law of large numbers, as in example 19.8 in van der Vaart(2000). Thus R = o p (1). Convergence of R . For fixed h and x , let g n ( w, β ) = q ( x, w ) (cid:48) (cid:90) γ (cid:48) ( S ( x, w, τ, β )) T ( x, w, τ, β, h , κ n ) dτ. Define R ( β ) = (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) n n (cid:88) i =1 g n ( W i , β ) − E [ g n ( W, β )] (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) . Then R = R ( (cid:98) β ). We want to show that R = o p (1). For any (cid:15) > P ( | R | ≥ (cid:15) ) = P ( | R | ≥ (cid:15), (cid:98) β ∈ B δ ) + P ( | R | ≥ (cid:15), (cid:98) β / ∈ B δ ) ≤ P (cid:32) sup β ∈B δ | R ( β ) | ≥ (cid:15), (cid:98) β ∈ B δ (cid:33) + P ( (cid:98) β / ∈ B δ ) ≤ P (cid:32) sup β ∈B δ | R ( β ) | ≥ (cid:15) (cid:33) + P ( (cid:98) β / ∈ B δ ) . The second term converges to zero by consistency of (cid:98) β . Thus it suffices to show that the first term65onverges to zero. That is, we want to show thatsup β ∈B δ (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) n n (cid:88) i =1 g n ( W i , β ) − E [ g n ( W, β )] (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) = o p (1) . This follows from a uniform law of large numbers. Specifically, we use theorem 4.2.2 in Amemiya(1985). There are two main properties required to apply this theorem:1. A dominance condition: E (cid:32) sup β ∈B δ | g n ( W, β ) | (cid:33) < ∞ .
2. A continuity condition: g n ( w, β ) is continuous at any β ∈ B δ for all w ∈ W .So we conclude the proof by verifying these two properties. The dominance condition . We have E (cid:32) sup β ∈B δ | g n ( W, β ) | (cid:33) = E (cid:32) sup β ∈B δ (cid:12)(cid:12)(cid:12)(cid:12) q ( x, W ) (cid:48) (cid:90) γ (cid:48) ( S ( x, W, τ, β )) T ( x, W, τ, β, h , κ n ) dτ (cid:12)(cid:12)(cid:12)(cid:12) (cid:33) ≤ E (cid:32) (cid:107) q ( x, W ) (cid:107) sup β ∈B δ (cid:90) (cid:107) γ (cid:48) ( S ( x, W, τ, β )) (cid:107) | T ( x, W, τ, β, h , κ n ) | dτ (cid:33) ≤ E (cid:32) (cid:107) q ( x, W ) (cid:107) B (cid:18) c + 24 (cid:19) sup β ∈B δ (cid:12)(cid:12)(cid:12)(cid:12) L β ( x, W (cid:48) β ) (cid:48) h L ( x, W (cid:48) β ) (cid:12)(cid:12)(cid:12)(cid:12) (cid:33) ≤ B (cid:18) c + 24 (cid:19) E (cid:0) (cid:107) q ( x, W ) (cid:107) (cid:1) / E (cid:32) sup β ∈B δ (cid:13)(cid:13)(cid:13)(cid:13) L β ( x, W (cid:48) β ) L ( x, W (cid:48) β ) (cid:13)(cid:13)(cid:13)(cid:13) (cid:33) / (cid:107) h (cid:107) < ∞ . The second and fourth lines follow by the Cauchy-Schwarz inequality.
The continuity condition.
Fix w ∈ W . We next show continuity of g n ( w, β ) = q ( x, w ) (cid:48) (cid:90) γ (cid:48) ( S ( x, w, τ, β )) T ( x, w, τ, β, h , κ n ) dτ in β . First note thatlim m →∞ g n ( w, β m ) = q ( x, w ) (cid:48) (cid:18) lim m →∞ (cid:90) γ (cid:48) ( S ( x, w, τ, β m )) T ( x, w, τ, β m , h , κ n ) dτ (cid:19) . We now show that we can bring the limit inside the integral by applying the dominated convergencetheorem. First note that the integrand satisfies a dominance condition, similar to our analysis above.The other condition is pointwise convergence of the integrand for all τ ∈ (0 ,
1) except possibly ona set of Lebesgue measure zero. Thus we need to show thatlim m →∞ γ (cid:48) ( S ( x, w, τ, β m )) T ( x, w, τ, β m , h , κ n ) = γ (cid:48) ( S ( x, w, τ, β )) T ( x, w, τ, β, h , κ n )66or all τ ∈ (0 ,
1) except possibly a set of Lebesgue measure zero. We do this by showing that γ (cid:48) ( S ( x, w, τ, β )) T ( x, w, τ, β, h , κ n )is continuous in β , for all τ ∈ (0 ,
1) except possibly a set of Lebesgue measure zero. This term isthe product of two pieces, so it suffices to show that each piece separately is continuous. After wedo this, the overall proof will be complete.
Piece 1 : By γ ∈ G and by continuity of S ( x, w, τ, β ) in β , the function γ (cid:48) ( S ( x, w, τ, β )) iscontinuous. Piece 2 : Next we’ll show that T ( x, w, τ, β, h , κ n ) is continuous in β for all τ ∈ (0 ,
1) except aset of Lebesgue measure zero. Recall that, omitting the arguments of the functions, T = T · ( S − ε > κ n ) + max { T , } · ( | S − ε | ≤ κ n ) . So we’ll start by studying continuity of T ( x, w, τ, β, h , κ n ) in β . Recall equation (13): T ( x, w, τ, β, h , κ n ) = (cid:88) j =1 T ,j ( x, w, τ, β, h ) · ,j ( x, w, τ, β, κ n ) . Given our assumptions on the propensity score (A4), T ,j ( x, w, τ, β, h ) is a composition of functionsthat are continuous in β . Thus it is also continuous in β . This holds for all j = 1 , . . . ,
7. For example, T , ( x, w, τ, β, h ) = − c min { τ, − τ } L β ( x, w (cid:48) β ) (cid:48) h L ( x, w (cid:48) β ) , which is continuous in β by continuity of L ( x, · ). This holds for all τ ∈ (0 , ,j ( x, w, τ, β, κ n ). Fix β ∈ B δ . We will show that thesefunctions are constant, and therefore continuous at β , except on a set of τ ’s of Lebesgue measurezero.First consider , ( x, w, τ, β, κ n ) = (cid:26) τ + cL ( x, w (cid:48) β ) min { τ, − τ } < min (cid:26) τL ( x, w (cid:48) β ) , − ε (cid:27) − κ n (cid:27) Recall that w ∈ W is fixed, along with x and κ n >
0. Let T = (cid:26) τ ∈ (0 ,
1) : τ + cL ( x, w (cid:48) β ) min { τ, − τ } = min (cid:26) τL ( x, w (cid:48) β ) , − ε (cid:27) − κ n (cid:27) . Then , ( x, w, τ, β, κ n ) is continuous at β for any τ / ∈ T . Moreover, T has Lebesgue measurezero. To see this, let ¯ τ = (1 − ε ) L ( x, w (cid:48) β ). Suppose ¯ τ ≤ /
2. Then T = (cid:26) τ ∈ (0 , ¯ τ ] : τ (cid:18) − − cL ( x, w (cid:48) β ) (cid:19) = − κ n (cid:27) ∪ (cid:26) τ ∈ (¯ τ , /
2] : τ (cid:18) cL ( x, w (cid:48) β ) (cid:19) = 1 − ε − κ n (cid:27) ∪ (cid:26) τ ∈ (1 / ,
1) : τ (cid:18) − cL ( x, w (cid:48) β ) (cid:19) = 1 − ε − cL ( x, w (cid:48) β ) − κ n (cid:27) . κ n (cid:54) = 0 the first set in this union contains at most one point. Since1 + cL ( x, w (cid:48) β ) (cid:54) = 0 , the second set in this union also contains at most one point. Finally, the last set contains morethan one point only if L ( x, w (cid:48) β ) = c and 1 − ε − cL ( x, w (cid:48) β ) − κ n = 0 , which yields a contradiction since − ε − κ n <
0. Therefore, this is the union of at most three pointsand hence has measure zero. If ¯ τ ≥ / T = (cid:26) τ ∈ (0 , /
2] : τ (cid:18) − − cL ( x, w (cid:48) β ) (cid:19) = − κ n (cid:27) ∪ (cid:26) τ ∈ (1 / , ¯ τ ] : τ (cid:18) − cL ( x, w (cid:48) β ) (cid:19) = − cL ( x, w (cid:48) β ) − κ n (cid:27) ∪ (cid:26) τ ∈ (¯ τ ,
1) : τ (cid:18) − cL ( x, w (cid:48) β ) (cid:19) = 1 − ε − cL ( x, w (cid:48) β ) − κ n (cid:27) . Once again this is the union of at most three points and hence has measure zero. Thus we’ve shownthat for any β ∈ B δ and for all τ except a set of Lebesgue measure zero, T , ( x, w, τ, β, h ) · , ( x, w, τ, β, κ n )is continuous at β .A similar argument holds for the other indicators. Specifically, let T j denote the set of τ ’s atwhich ,j ( x, w, τ, β, κ n ) is discontinuous at β . Then T = (cid:26) τ ∈ (0 ,
1) : τL ( x, w (cid:48) β ) = min (cid:26) τ + cL ( x, w (cid:48) β ) min { τ, − τ } , − ε (cid:27) − κ n (cid:27) T = (cid:26) τ ∈ (0 ,
1) : 1 − ε = min (cid:26) τ + cL ( x, w (cid:48) β ) min { τ, − τ } , τL ( x, w (cid:48) β ) (cid:27) − κ n (cid:27) T ⊆ (cid:26) τ ∈ (0 ,
1) : (cid:12)(cid:12)(cid:12)(cid:12) τ + cL ( x, w (cid:48) β ) min { τ, − τ } − τL ( x, w (cid:48) β ) (cid:12)(cid:12)(cid:12)(cid:12) = κ n (cid:27) ∪ (cid:26) τ ∈ (0 ,
1) : τ + cL ( x, w (cid:48) β ) min { τ, − τ } = 1 − ε − κ n (cid:27) ∪ (cid:26) τ ∈ (0 ,
1) : τL ( x, w (cid:48) β ) = 1 − ε − κ n (cid:27) T ⊆ (cid:26) τ ∈ (0 ,
1) : (cid:12)(cid:12)(cid:12)(cid:12) τ + cL ( x, w (cid:48) β ) min { τ, − τ } − (1 − ε ) (cid:12)(cid:12)(cid:12)(cid:12) = κ n (cid:27) ∪ (cid:26) τ ∈ (0 ,
1) : τ + cL ( x, w (cid:48) β ) min { τ, − τ } = τL ( x, w (cid:48) β ) − κ n (cid:27) ∪ (cid:26) τ ∈ (0 ,
1) : 1 − ε = τL ( x, w (cid:48) β ) − κ n (cid:27) ⊆ (cid:26) τ ∈ (0 ,
1) : (cid:12)(cid:12)(cid:12)(cid:12) τL ( x, w (cid:48) β ) − (1 − ε ) (cid:12)(cid:12)(cid:12)(cid:12) = κ n (cid:27) ∪ (cid:26) τ ∈ (0 ,
1) : τL ( x, w (cid:48) β ) = τ + cL ( x, w (cid:48) β ) min { τ, − τ } − κ n (cid:27) ∪ (cid:26) τ ∈ (0 ,
1) : 1 − ε = τ + cL ( x, w (cid:48) β ) min { τ, − τ } − κ n (cid:27) T ⊆ (cid:26) τ ∈ (0 ,
1) : (cid:12)(cid:12)(cid:12)(cid:12) τ + cL ( x, w (cid:48) β ) min { τ, − τ } − (1 − ε ) (cid:12)(cid:12)(cid:12)(cid:12) = κ n (cid:27) ∪ (cid:26) τ ∈ (0 ,
1) : (cid:12)(cid:12)(cid:12)(cid:12) τL ( x, w (cid:48) β ) − (1 − ε ) (cid:12)(cid:12)(cid:12)(cid:12) = κ n (cid:27) ∪ (cid:26) τ ∈ (0 ,
1) : (cid:12)(cid:12)(cid:12)(cid:12) τ + cL ( x, w (cid:48) β ) min { τ, − τ } − τL ( x, w (cid:48) β ) (cid:12)(cid:12)(cid:12)(cid:12) = κ n (cid:27) . Using similar arguments to the j = 1 case, we see that all of these sets have Lebesgue measurezero. So (cid:83) j =1 T j has Lebesgue measure zero. Hence the function T ( x, w, τ, β, h , κ n ) is continuousat all β ∈ B δ for all τ ∈ (0 ,
1) except a set of Lebesgue measure zero.Now let’s return to T . This function is continuous in β for all τ ∈ (0 ,
1) except possibly on theset T = (cid:91) j =1 T j ∪ (cid:26) τ ∈ (0 ,
1) : (cid:12)(cid:12)(cid:12)(cid:12) min (cid:26) τ + cL ( x, w (cid:48) β ) min { τ, − τ } , τL ( x, w (cid:48) β ) , − ε (cid:27) − ε (cid:12)(cid:12)(cid:12)(cid:12) = κ n (cid:27) . The second term here comes from the indicators ( | S − ε | ≤ κ n ) and ( S − ε > κ n ). We can seethat this set has Lebesgue measure zero using similar arguments as above. Thus the overall set T has Lebesgue measure zero. Hence we’ve shown that, for a fixed ( x, w, h , κ n ), and for any β ∈ B δ , T ( x, w, τ, β, h , κ n ) is continuous at β for all τ ∈ (0 ,
1) except a set of Lebesgue measure zero. Asnoted earlier, this is sufficient to complete the proof that R = o p (1). Convergence of R . This part is the difference between two expectations, one evaluated at (cid:98) β and the other at β . Note that the expectations are over W , not (cid:98) β . We’ll show that R = o p (1)by applying the dominated convergence theorem and then using the fact that (cid:98) β p −→ β . The R term is similar to R , and so our proof here will use some of our derivations from our proof that R = o p (1). The main difference is that R is a function of T . So we’ll spend most of our time onthat. T , in turn, depends on T . So we’ll begin by showing that T ( x, w, τ, (cid:98) β, h , κ n ) p −→ T ( x, w, τ, β , h , . By the definition of T , this convergence holds if T ,j ( x, w, τ, (cid:98) β, h ) p −→ T ,j ( x, w, τ, β , h ) and ,j ( x, w, τ, (cid:98) β, κ n ) p −→ ,j ( x, w, τ, β , . for all j = 1 , . . . ,
7. By continuity of T ,j ( x, w, τ, β, h ) in β for all j = 1 , . . . ,
7, and by theconsistency of (cid:98) β , the T ,j ( x, w, τ, (cid:98) β, h ) terms are consistent. The indicator functions are slightlytrickier. Given ( x, w, τ ), the value of β determines which of the seven cases we are in. That is,which of the indicators ,j ( x, w, τ, β ,
0) is 1; the other six are all zero.69e’ll consider each case separately. First suppose that τ + cL ( x, w (cid:48) β ) min { τ, − τ } < min (cid:26) τL ( x, w (cid:48) β ) , − ε (cid:27) . Thus we have , ( x, w, τ, β ,
0) = 1. By (cid:98) β p −→ β and κ n → , ( x, w, τ, (cid:98) β, κ n ) = (cid:40) τ + cL ( x, w (cid:48) (cid:98) β ) min { τ, − τ } < min (cid:40) τL ( x, w (cid:48) (cid:98) β ) , − ε (cid:41) − κ n (cid:41) p −→ (cid:26) τ + cL ( x, w (cid:48) β ) min { τ, − τ } < min (cid:26) τL ( x, w (cid:48) β ) , − ε (cid:27)(cid:27) = , ( x, w, τ, β , . Moreover, by taking complements, we see that for these values of ( x, w, τ, β ) all the other indicatorsconverge (to zero) as well.Next suppose , ( x, w, τ, β ,
0) = 1 or , ( x, w, τ, β ,
0) = 1. In either of these cases, we cansimilarly show that ,j ( x, w, τ, (cid:98) β, κ n ) p −→ ,j ( x, w, τ, β , j = 2 ,
3. Next suppose τ + cL ( x, w (cid:48) β ) min { τ, − τ } = τL ( x, w (cid:48) β ) < − ε, which puts us in the , ( x, w, τ, β ,
0) = 1 case. This case is more delicate, and shows the secondplace where the κ n ’s are important. , ( x, w, τ, (cid:98) β, κ n ) can be viewed as the product of threeindicator functions. Two of them are handled like the j = 1 , , (cid:40) τ + cL ( x, w (cid:48) (cid:98) β ) min { τ, − τ } < − ε − κ n (cid:41) p −→ (cid:26) τ + cL ( x, w (cid:48) β ) min { τ, − τ } < − ε (cid:27) and (cid:40) τL ( x, w (cid:48) (cid:98) β ) < − ε − κ n (cid:41) p −→ (cid:26) τL ( x, w (cid:48) β ) < − ε (cid:27) by (cid:98) β p −→ β and κ n →
0. The third indicator requires a different argument: (cid:40)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) τ + cL ( x, w (cid:48) (cid:98) β ) min { τ, − τ } − τL ( x, w (cid:48) (cid:98) β ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ κ n (cid:41) = (cid:40) (cid:112) nκ n · √ n (cid:32) τ + cL ( x, w (cid:48) (cid:98) β ) min { τ, − τ } − τL ( x, w (cid:48) (cid:98) β ) (cid:33) ∈ [ − , (cid:41) → nκ n → ∞ and √ n (cid:32) τ + cL ( x, w (cid:48) (cid:98) β ) min { τ, − τ } − τL ( x, w (cid:48) (cid:98) β ) (cid:33) = O p (1) . This term is O p (1) since we’re looking at the case where τ + cL ( x, w (cid:48) (cid:98) β ) min { τ, − τ } − τL ( x, w (cid:48) (cid:98) β ) p −→ τ + cL ( x, w (cid:48) β ) min { τ, − τ } − τL ( x, w (cid:48) β ) = 0 , and by the delta method. Combining the consistency of these three indicator functions, we havethat , ( x, w, τ, (cid:98) β, κ n ) p −→ , ( x, w, τ, β , . As before, by taking complements we see that all of the other indicator functions are all alsoconsistent in this case. Notice that in this case κ n is a slackness parameter that we introduced toallow the indicator , ( x, w, τ, (cid:98) β, κ n ) to be 1 even if the inequality τ + cL ( x, w (cid:48) (cid:98) β ) min { τ, − τ } = τL ( x, w (cid:48) (cid:98) β )does not hold exactly in finite samples.The last three cases are all similar to the case , ( x, w, τ, β ,
0) = 1 that we just studied. Thus,putting all of these cases together gives T ( x, w, τ, (cid:98) β, h , κ n ) p −→ T ( x, w, τ, β , h , . By similar arguments, we can also show that T ( x, w, τ, (cid:98) β, h , κ n ) p −→ T ( x, w, τ, β , h , . By continuity of γ (cid:48) ( · ) and of S ( x, w, τ, β ) in β , this implies that q ( x, w ) (cid:48) γ (cid:48) ( S ( x, w, τ, (cid:98) β )) T ( x, w, τ, (cid:98) β, h , κ n ) p −→ q ( x, w ) (cid:48) γ (cid:48) ( S ( x, w, τ, β )) T ( x, w, τ, β , h , . Finally, note that q ( x, w ) (cid:48) (cid:90) γ (cid:48) ( S ( x, w, τ, β )) T ( x, w, τ, β, h , κ n ) dτ is continuous in β and has a bounded envelope (which can be seen using arguments similar tothat in our proof for R ). Thus we can apply the dominated convergence theorem, which gives R = o p (1). Putting the four pieces together . We’ve shown that R , . . . , R are all o p (1). Thus we’veshown that | (cid:98) Γ (cid:48) ,θ ( x, h ) − Γ (cid:48) ,θ ( x, h ) | = o p (1) . A similar argument can be used to show that | (cid:98) Γ (cid:48) ,θ ( x, h ) − Γ (cid:48) ,θ ( x, h ) | = o p (1) . As discussed at the beginning of the proof this consistency of (cid:98) Γ (cid:48) ,θ ( x, h ) was all that we had left71o show, so we are done. Lemma 8 (Standard component) . Suppose the assumptions of theorem 1 hold. Then G ∗ n Γ ( x, W, (cid:98) θ ) P (cid:32) G Γ ( x, W, θ ) ≡ (cid:101) Z ( x ) , a mean-zero Gaussian vector in R . Proof of lemma 8.
Write G ∗ n Γ ( x, W, (cid:98) θ ) = G ∗ n Γ ( x, W, θ ) + G ∗ n (cid:16) Γ ( x, W, (cid:98) θ ) − Γ ( x, W, θ ) (cid:17) . The first term converges to G Γ ( x, W, θ ) by consistency of the standard nonparametric bootstrap.So it suffices to show that the second term converges to zero. We do this using an argument similarto that in the proof of lemma 19.24 in van der Vaart (2000). We’ll give the proof for the upperbound, the first component of Γ . The proof for the lower bound is analogous.1. By the proof of theorem 1, F = { Γ ( x, W, θ ) : θ ∈ Θ δ } is Donsker with finite envelope function. So theorem 23.7 in van der Vaart (2000) gives G ∗ n Γ ( x, W, · ) P (cid:32) G Γ ( x, W, · ) , where G is a Gaussian process indexed by F .2. Endow F with the L ( P ) semi-metric. Note thatΓ ( x, · , (cid:98) θ ) p −→ Γ ( x, · , θ )in this semi-metric. This follows since, from the proof of theorem 1, (cid:12)(cid:12)(cid:12) Γ ( x, w, (cid:98) θ ) − Γ ( x, w, θ ) (cid:12)(cid:12)(cid:12) ≤ K ( w ) (cid:107) (cid:98) θ − θ (cid:107) Θ where E ( K ( W ) ) < ∞ . Thus (cid:90) W (cid:12)(cid:12)(cid:12) Γ ( x, w, (cid:98) θ ) − Γ ( x, w, θ ) (cid:12)(cid:12)(cid:12) dF W ( w ) ≤ E ( K ( W ) ) (cid:107) (cid:98) θ − θ (cid:107) , which converges to zero in probability by consistency of (cid:98) θ for θ .These two points imply that ( G ∗ n , Γ ( x, · , (cid:98) θ )) P (cid:32) ( G , Γ ( x, · , θ )) in the space (cid:96) ∞ ( F ) × F by Slutsky’stheorem. Define the function φ : (cid:96) ∞ ( F ) × F → R by φ ( g, Γ ( x, · , θ )) = g (Γ ( x, · , θ )) − g (Γ ( x, · , θ )) . Since G has continuous paths (lemma 18.15 in van der Vaart 2000) almost surely, the function φ is continuous at almost every ( G , Γ ( x, · , θ )). So the continuous mapping theorem (e.g., theorem10.8 in Kosorok 2008) implies that G ∗ n (Γ ( x, W, (cid:98) θ ) − Γ ( x, W, θ )) = φ ( G ∗ n , Γ ( x, · , (cid:98) θ ))72 (cid:32) φ ( G , Γ ( x, · , θ ))= 0 . Proof of proposition 3.
First, since the influence functions for the first step estimators are Donsker(lemmas 3 and 4), and since the standard nonparametric bootstrap for those estimators is valid(theorem 3.6.1 in van der Vaart and Wellner 1996), and along with our analysis in the proof oflemma 8, we have (cid:32) √ n ( (cid:98) θ ∗ − (cid:98) θ ) G ∗ n Γ ( x, W, (cid:98) θ ) (cid:33) P (cid:32) (cid:18) Z (cid:101) Z ( x ) (cid:19) , a mean-zero process in R d W × (cid:96) ∞ ([ ε, − ε ] , R d q ) × R .Next, define Λ : R d W × (cid:96) ∞ ([ ε, − ε ] , R d q ) × R → R byΛ( θ, u ) = Γ ( x, θ ) + u where θ ∈ R d W × (cid:96) ∞ ([ ε, − ε ] , R d q ) and u ∈ R . By the proof of theorem 1, the mapping Γ ( x, θ ) isHDD at θ tangentially to R d W × C ([ ε, − ε ] , R d q ). So, for any u ∈ R , Λ( θ, u ) is HDD at ( θ , u )tangentially to R d W × C ([ ε, − ε ] , R d q ) × R . Its HDD isΛ (cid:48) θ ( h , h , h ) = Γ (cid:48) ,θ ( x, ( h , h )) + h , where h ∈ R . Estimate it by (cid:98) Λ (cid:48) θ ( h ) = (cid:98) Γ (cid:48) ,θ ( x, ( h , h )) + h . By the proof of lemma 7, this HDD estimator satisfies assumption 4 of Fang and Santos (2019).Thus we can apply their theorem 3.2 to get (cid:98) Γ (cid:48) ,θ ( x, √ n ( (cid:98) θ ∗ − (cid:98) θ )) + G ∗ n Γ ( x, W, (cid:98) θ ) = (cid:98) Λ (cid:48) θ ( √ n ( (cid:98) θ ∗ − (cid:98) θ ) , G ∗ n Γ ( x, W, (cid:98) θ )) P (cid:32) Λ (cid:48) θ ( Z , (cid:101) Z ( x ))= Γ (cid:48) ,θ ( x, Z ) + (cid:101) Z ( x ) ≡ Z ( x ) . E Analytical Bootstrap Results for the CQTE and CATE
In this section we formally derive bootstrap consistency for the CQTE and CATE.
Proposition 6 (CQTE Boostrap) . Suppose the assumptions of proposition 5 hold. Let κ n → nκ n → ∞ , η n →
0, and nη n → ∞ as n → ∞ . Then (cid:98) Γ (cid:48) ,θ ( x, w, τ, √ n ( (cid:98) θ ∗ − (cid:98) θ )) P (cid:32) Γ (cid:48) ,θ ( x, w, τ, Z ) . This proposition implies that the asymptotic distribution of the CQTE bounds can be approx-73mated by the bootstrap distribution of (cid:98) Γ (cid:48) ,θ (1 , w, τ, √ n ( (cid:98) θ ∗ − (cid:98) θ )) − (cid:98) Γ (cid:48) ,θ (0 , w, τ, √ n ( (cid:98) θ ∗ − (cid:98) θ )) (cid:98) Γ (cid:48) ,θ (1 , w, τ, √ n ( (cid:98) θ ∗ − (cid:98) θ )) − (cid:98) Γ (cid:48) ,θ (0 , w, τ, √ n ( (cid:98) θ ∗ − (cid:98) θ )) . We also show the bootstrap consistency for the CATE.
Proposition 7 (CATE Bootstrap) . Suppose the assumptions of proposition 1 hold. Let κ n → nκ n → ∞ , η n →
0, and nη n → ∞ as n → ∞ . Then (cid:98) Γ (cid:48) ,θ ( x, w, √ n ( (cid:98) θ ∗ − (cid:98) θ )) P (cid:32) Γ (cid:48) ,θ ( x, w, Z ) . Like with the CQTE, the asymptotic distribution of the CATE bounds can be approximatedby the bootstrap distribution of (cid:98) Γ (cid:48) ,θ (1 , w, √ n ( (cid:98) θ ∗ − (cid:98) θ )) − (cid:98) Γ (cid:48) ,θ (0 , w, √ n ( (cid:98) θ ∗ − (cid:98) θ )) (cid:98) Γ (cid:48) ,θ (1 , w, √ n ( (cid:98) θ ∗ − (cid:98) θ )) − (cid:98) Γ (cid:48) ,θ (0 , w, √ n ( (cid:98) θ ∗ − (cid:98) θ )) . E.1 Proofs
Proof of proposition 6.
As in the proof of lemma 7 we’ll use theorem 3.2 in Fang and Santos (2019).To do this we must verify their assumptions 1–4. Their assumptions 1–3 hold as in the proof oflemma 7. By their remark 3.4, sufficient conditions for their assumption 4 are:1.
A smoothness condition : (cid:98) Γ (cid:48) ,θ ( x, w, τ, h ) is Lipschitz in h . This holds by lemma 6.2. A consistency condition : (cid:98) Γ (cid:48) ,θ ( x, w, τ, h ) converges in probability to Γ (cid:48) ,θ ( x, w, τ, h ) for any h ∈ R d W × C ([ ε, − ε ] , R d q ). To see that this holds, recall that in the proof of lemma 7 weshowed that q ( x, w ) (cid:48) h ( S ( x, w, τ, (cid:98) β )) p −→ q ( x, w ) (cid:48) h ( S ( x, τ, β ))and q ( x, w ) (cid:48) γ (cid:48) ( S ( x, w, τ, (cid:98) β )) T ( x, w, τ, (cid:98) β, h , κ n ) p −→ q ( x, w ) (cid:48) γ (cid:48) ( S ( x, w, τ, β )) T ( x, w, τ, β , h , . By the definition of (cid:98) Γ (cid:48) ,θ , these two results imply (cid:98) Γ (cid:48) ,θ ( x, w, τ, h ) p −→ Γ (cid:48) ,θ ( x, w, τ, h ).Similar arguments can be used for the lower bound, to show that (cid:98) Γ (cid:48) ,θ ( x, w, τ, h ) is Lipschitz in h ,and that (cid:98) Γ (cid:48) ,θ ( x, w, τ, h ) p −→ Γ (cid:48) ,θ ( x, w, τ, h ). Proof of proposition 7.
The proof of this result is similar to the proof of proposition 6. Like there,we show that the two sufficient conditions for assumption 4 in theorem 3.2 of Fang and Santos(2019) hold.1.
A smoothness condition : (cid:98) Γ (cid:48) ,θ ( x, w, h ) is Lipschitz in h . To see this, write (cid:12)(cid:12)(cid:12)(cid:12)(cid:98) Γ (cid:48) ,θ ( x, w, (cid:101) h ) − (cid:98) Γ (cid:48) ,θ ( x, w, h ) (cid:12)(cid:12)(cid:12)(cid:12) ≤ (cid:90) (cid:12)(cid:12)(cid:12)(cid:12)(cid:98) Γ (cid:48) ,θ ( x, w, τ, (cid:101) h ) − (cid:98) Γ (cid:48) ,θ ( x, w, τ, h ) (cid:12)(cid:12)(cid:12)(cid:12) dτ (cid:90) K ( x, w, (cid:98) θ ) dτ (cid:107) (cid:101) h − h (cid:107) Θ = K ( x, w, (cid:98) θ ) · (cid:107) (cid:101) h − h (cid:107) Θ where the second line follows by the proof of lemma 6, and where K ( x, w, (cid:98) θ ) is defined inlemma 6 and is shown to be O p (1). So (cid:98) Γ (cid:48) ,θ ( x, w, h ) is Lipschitz in h .2. A consistency condition : (cid:98) Γ (cid:48) ,θ ( x, w, h ) p −→ Γ (cid:48) ,θ ( x, w, h ). This result follows from argumentssimilar to those in the proof of proposition 1 and the dominated convergence theorem.Similar arguments can be used for the lower bound, to show that (cid:98) Γ (cid:48) ,θ ( x, w, h ) is Lipschitz in h ,and that (cid:98) Γ (cid:48) ,θ ( x, w, h ) p −→ Γ (cid:48) ,θ ( x, w, h ). F Proofs for Section 6
Proof of theorem 2.
Recall that our ATE bounds depend on the functionals Γ ( x, θ ). We will showthat P ( p | W ∈ { c, − c } ) = 0 implies that Γ (cid:48) ,θ ( x, h ) is linear in h . This, in turn, implies thatΓ ( x, θ ) is Hadamard differentiable at θ . Consistency of the standard bootstrap then follows fromthe delta method for the bootstrap: see theorem 3.9.11 in van der Vaart and Wellner (1996).Thus it suffices to show that Γ (cid:48) ,θ ( x, h ) is linear in h . We’ll show this in three steps. Step 1 . First we show thatΓ (cid:48) ,θ ( x, w, τ, h ) = q ( x, w ) (cid:48) γ (cid:48) ( S ( x, w, τ, β )) T ( x, w, τ, β , h ,
0) + q ( x, w ) (cid:48) h ( S ( x, w, τ, β )) . is linear in h . First note that it is trivially linear in h . It is linear in h if and only if T ( x, w, τ, β , h , h . Recall the definition of T : T ( x, w, τ , β , h ,
0) = T ( x, w, τ, β, h , · (cid:16) S ( x, w, τ, β ) > ε (cid:17) + max { T ( x, w, τ, β , h , , } · (cid:16) S ( x, w, τ, β ) = ε (cid:17) where S ( x, w, τ, β ) = min (cid:26) τ + cL ( x, w (cid:48) β ) min { τ, − τ } , τL ( x, w (cid:48) β ) , − ε (cid:27) . There are two parts of T . We’ll consider each of them separately. The second part . For given ( x, w, τ, β , c, ε ), the set { τ ∈ (0 ,
1) : S ( x, w, τ, β ) = ε } has Lebesgue measure zero. This is the case since S ( x, w, τ, β ) is strictly increasing in τ whenever S ( x, w, τ, β ) < − ε , and ε < − ε by assuming ε < /
2. Therefore, althoughmax { T ( x, w, τ, β , h , , } may be nonlinear in h ,max { T ( x, w, τ, β , h , , } · (cid:16) S ( x, w, τ, β ) = ε (cid:17)
75s nonlinear for a measure zero set of τ . The first part . Next we study linearity of T ( x, w, τ, β , h ,
0) in h . Recall from appendix Cthat it can be written as T ( x, w, τ, β , h ,
0) = (cid:88) j =1 T ,j ( x, w, τ, β , h ) · ,j ( x, w, τ, β , . By examining the specific functional forms given in appendix C, we immediately see that thefunctions T ,j ( x, w, τ, β , h ) are linear for j ∈ { , , } and nonlinear for j ∈ { , , , } . The mainquestion is for how many values of ( τ, w ) are the indicators ,j equal to 1 for j ∈ { , , , } . We’llshow that these indicators are 1 only on a set of τ ’s and w ’s that have measure zero under theproduct measure with the Lebesgue measure and F W as the marginals.Define S j ( p x | w , c ) ≡ { τ ∈ (0 ,
1) : ,j ( x, w, τ, β ,
0) = 1 } for j ∈ { , , , } . For j = 4, by the definition of , in appendix C we have S ( p x | w , c ) = (cid:26) τ ∈ (0 ,
1) : τ + cp x | w min { τ, − τ } − τp x | w = 0 (cid:27) ∩ (cid:26) τ ∈ (0 ,
1) : max (cid:26) τ + cp x | w min { τ, − τ } , τp x | w (cid:27) < − ε (cid:27) ≡ S ,a ( p x | w , c ) ∩ S ,b ( p x | w , c ) . We write the first set as S ,a ( p x | w , c )= (cid:26) τ ∈ (0 , /
2] : τ (cid:18) cp x | w − p x | w (cid:19) = 0 (cid:27) ∪ (cid:26) τ ∈ (1 / ,
1) : τ (cid:18) − cp x | w − p x | w (cid:19) = − cp x | w (cid:27) = ∅ if c < − p x | w (0 , /
2] if c = 1 − p x | w (cid:110) c c − p x | w (cid:111) if c > − p x | w . Thus if c (cid:54) = 1 − p x | w then S ,a ( p x | w , c ) has Lebesgue measure zero. Consequently S ( p x | w , c ) alsohas Lebesgue measure zero in this case.Next consider j = 5. As before, by the definition of , we have S ( p x | w , c ) = (cid:26) τ ∈ (0 ,
1) : τ + cp x | w min { τ, − τ } − (1 − ε ) = 0 (cid:27) ∩ (cid:26) τ ∈ (0 ,
1) : max (cid:26) τ + cp x | w min { τ, − τ } , − ε (cid:27) < τp x | w (cid:27) ≡ S ,a ( p x | w , c ) ∩ S ,b ( p x | w , c ) . Write the first set as S ,a ( p x | w , c ) = 76 τ ∈ (0 , /
2] : τ (cid:18) cp x | w (cid:19) = 1 − ε (cid:27) ∪ (cid:26) τ ∈ (1 / ,
1) : τ (cid:18) − cp x | w (cid:19) = 1 − ε − cp x | w (cid:27) . Since 1 + cp x | w (cid:54) = 0 , the set (cid:26) τ ∈ (0 , /
2] : τ (cid:18) cp x | w (cid:19) = 1 − ε (cid:27) contains at most one point. Likewise, (cid:26) τ ∈ (1 / ,
1) : τ (cid:18) − cp x | w (cid:19) = 1 − ε − cp x | w (cid:27) contains at most one point whenever c (cid:54) = p x | w . When c = p x | w this set equals { τ ∈ (1 / ,
1) : 0 = − ε } , which is empty since ε >
0. Thus S ,a ( p x | w , c ) has Lebesgue measure zero. Consequently, S ( p x | w , c ) also has Lebesgue measure zero.Next consider j = 6. We have S ( p x | w , c ) = (cid:26) τ ∈ (0 ,
1) : τp x | w − (1 − ε ) = 0 (cid:27) ∩ (cid:26) τ ∈ (0 ,
1) : max (cid:26) τp x | w , − ε (cid:27) < τ + cp x | w min { τ, − τ } (cid:27) ⊆ (cid:8) τ ∈ (0 ,
1) : τ = (1 − ε ) p x | w (cid:9) . The first line follows from the definition of , . The last line follows from looking at the first set inthe intersection in the first line. This last set is a singleton and hence has Lebesgue measure zero.Thus S ( p x | w , c ) has Lebesgue measure zero.Finally, consider j = 7. This case is a combination of the above cases. Hence we can show that S ( p x | w , c ) has Lebesgue measure zero by repeating some of the above steps. Step 2 : From step 1 we see that for any w ∈ W such that p x | w (cid:54) = 1 − c the mappingΓ (cid:48) ,θ ( x, w, τ, h ) is linear for all τ ∈ (0 ,
1) except for a Lebesgue measure zero set. Denote thisset by T . Then Γ (cid:48) ,θ ( x, w, h ) = (cid:90) Γ (cid:48) ,θ ( x, w, τ, h ) dτ = (cid:90) τ / ∈T Γ (cid:48) ,θ ( x, w, τ, h ) dτ + (cid:90) τ ∈T Γ (cid:48) ,θ ( x, w, τ, h ) dτ = (cid:90) τ / ∈T Γ (cid:48) ,θ ( x, w, τ, h ) dτ. The first line follows by definition of Γ . The last line follows since T has Lebesgue measure zero.Since integrals are linear operators, we see that Γ (cid:48) ,θ ( x, w, h ) is linear in h for any w ∈ W suchthat p x | w (cid:54) = 1 − c . 77 tep 3 : We haveΓ (cid:48) ,θ ( x, h ) = (cid:90) W Γ (cid:48) ,θ ( x, w, h ) dF W ( w )= (cid:90) { w ∈W : p x | w / ∈{ c, − c }} Γ (cid:48) ,θ ( x, w, h ) dF W ( w ) + (cid:90) { w ∈W : p x | w ∈{ c, − c }} Γ (cid:48) ,θ ( x, w, h ) dF W ( w )= (cid:90) { w ∈W : p x | w / ∈{ c, − c }} Γ (cid:48) ,θ ( x, w, h ) dF W ( w ) . The first line follows by definition of Γ . The third line follows since we assumed p x | W ∈ { c, − c } occurs with probability zero. By step 2, Γ (cid:48) ,θ ( x, w, h ) is linear on the set over which we areintegrating in the last line. Since integrals are linear operators, this implies Γ (cid:48) ,θ ( x, h ) is linear in h . Similar calculations for the lower bound show that Γ (cid:48) ,θ ( x, h ) is linear when P ( p x | W ∈ { c, − c } ) = 0. Thus we have shown that Γ (cid:48) ,θ ( x, h ) is linear in h , as desired. Proof of proposition 4.
By the proof of theorem 2, the mapping Γ (0 , θ ) is Hadamard differentiablewhen P ( p | W = 1 − c ) = P ( p | W = cc