Identification and Estimation of Unconditional Policy Effects of an Endogenous Binary Treatment
aa r X i v : . [ ec on . E M ] O c t Identification and Estimation of Unconditional Policy Effectsof an Endogenous Binary Treatment
Julián Martínez-Iriarte and Yixiao Sun * Department of EconomicsUC San Diego
October 27, 2020
Abstract
This paper studies the identification and estimation of unconditional policy effects whenthe treatment is binary and endogenous. We first characterize the asymptotic bias of theunconditional regression estimator that ignores the endogeneity and elaborate on the channelsthat the endogeneity can render the unconditional regressor estimator inconsistent. We showthat even if the treatment status is exogenous, the unconditional regression estimator can stillbe inconsistent when there are common covariates affecting both the treatment status andthe outcome variable. We introduce a new class of marginal treatment effects (MTE) basedon the influence function of the functional underlying the policy target. We show that anunconditional policy effect can be represented as a weighted average of the newly definedMTEs over the individuals at the margin of indifference. Point identification is achieved usingthe local instrumental variable approach. Furthermore, the unconditional policy effects areshown to include the marginal policy-relevant treatment effect in the literature as a specialcase. Methods of estimation and inference for the unconditional policy effects are provided.In the empirical application, we estimate the effect of changing college enrollment status,induced by higher tuition subsidy, on the quantiles of the wage distribution.
Keywords : unconditional quantile regressions, unconditional policy effect, selection models,instrumental variables, marginal treatment effect, marginal policy-relevant treatment effect. * Email: [email protected]; [email protected]. For helpful and constructive comments, we thank seminar partici-pants at UCSD, especially Xinwei Ma and Kaspar Wuthrich. This research was conducted with restricted access toBureau of Labor Statistics (BLS) data. The views expressed here do not necessarily reflect the views of the BLS.
Introduction
An unconditional policy effect is the effect of a change in the (unconditional) distribution of atarget covariate on the (unconditional) distribution of an outcome variable of interest. When thetarget covariate has a continuous distribution, we may be interested in shifting its location andevaluate its effect on the distribution of the outcome variable. For example, we may considerincreasing the number of years of education for every worker in order to improve the median ofthe wage distribution. When the change of the covariate distribution is small, such an effect maybe referred to as the marginal unconditional policy effect. To estimate the unconditional quantileeffect (UQE), a particular unconditional policy effect, Firpo, Fortin and Lemieux (2009) developthe method of unconditional quantile regressions (UQR). In this paper, we consider a binary target covariate that indicates the treatment status. In thiscase, a location shift is not possible, and the only way to change its distribution is to change theproportion of treated individuals. How does such a change affect the distribution of an outcomevariable? In particular, how do the quantiles of the outcome variable change if more people areinduced to take up the treatment? Firpo, Fortin and Lemieux (2009) address this question, andachieve identification by assuming a form of distributional invariance which is unlikely to holdwhen treatment status is endogenous. As the main departures of this paper from the literature,we allow the binary treatment to be endogenous, and we allow other covariates to affect thetreatment status and the outcome variable. We focus on the identification and estimation ofunconditional quantile effects, but we also extend our results to cover general unconditionalpolicy effects.The first contribution of this paper is to show that the UQR estimator can be severely biasedfor the UQE if endogeneity is neglected. One may ask if, as in a linear model, the asymptoticbias can be signed based on a correlation coefficient. The answer to this question is negative. Thereason is that the asymptotic bias can be non-uniform across quantiles. We could have a positivebias for the 10 th quantile while a negative bias for the 11 th quantile. Strong assumptions have tobe imposed on the data generating process to sign the bias a priori .For the case with a binary treatment, the UQR of Firpo, Fortin and Lemieux (2009) does notexplicitly account for the presence of additional covariates. We show that in some cases, evenif the treatment status is exogenous, the presence of other covariates can render the UQR esti-mator inconsistent. For example, this can occur in situations where the treatment status is partlydetermined by the covariates that also affect the outcome variable, that is, when the selection There are three groups of variables in our study. The target covariate is the variable whose distribution we aimto intervene. We often refer to the target covariate as the treatment or treatment variable. The outcome variable isthe variable that a policy maker ultimately cares about. A policy maker hopes to change the distribution of the targetcovariate in order to achieve a desired effect on the distribution of the outcome variable. The third group of variablesconsists of other covariates whose distributions will not be manipulated. Mukhin (2019) generalizes Firpo, Fortin and Lemieux (2009) to allow for non-marginal changes in continuouscovariates. See the last paragraph of Section 3 in Firpo, Fortin and Lemieux (2009) or Corollary 3 in the working paperversion, Firpo, Fortin and Lemieux (2007). a , 2005) anduse a threshold-crossing model for treatment selection. In this case, an instrumental variableaffects the selection, but it is absent from the outcome equation. We show that the threshold-crossing model implies that individuals who are indifferent between taking up the treatmentand not taking up the treatment will drive the unconditional quantile effect. We introduce anew class of marginal treatment effects (MTEs) and show that the unconditional quantile effectcan be represented as a weighted average of these MTEs. The MTE we introduce is based onthe influence function of the quantile functional. It is related to the marginal treatment effectintroduced by Bjorklund and Moffitt (1987) and further studied by Heckman (1997) but is alsodistinctly different. Identification is achieved using the local instrumental variable approach asin Carneiro and Lee (2009).The third contribution of this paper is to show that the unconditional quantile effect andthe marginal policy relevant treatment effect (MPRTE) of Carneiro, Heckman and Vytlacil (2010)belong to the same family of parameters. To the best of our knowledge, this was not previouslyrecognized in either literature. This stems from the fact that our method is more general andallows us to estimate the effect on any (well-behaved) functional of the outcome distribution. Inthis case, we just need to work with the influence function of this functional. Common examplesof a general functional include the quantiles, the mean, and the Gini coefficient.The fourth contribution of the paper is to develop methods of statistical inference on the UQEwhen the binary treatment is endogeneous. We take a nonparametric approach but allow thepropensity score function to be either parametric or nonparametric. We establish the asymptoticdistribution of the UQE estimator. This is a formidable task, as the UQE estimator is a four-stepestimator, and we have to pin down the estimation error from each step. Perhaps surprisingly,we show that the error from estimating the propensity score function, either parametrically ornonparametrically, does not affect the asymptotic variance of our UQE estimator.We are not the first to consider unconditional quantile regressions under endogeneity. Kasy(2016) focuses on the ranking of counterfactual policies and, for the case of discrete regressors,allows for endogeneity. However, one key difference from our approach is that the counterfactualpolicies analyzed in Kasy (2016) are randomly assigned conditional on a covariate vector. In oursetting, selection into treatment follows a threshold-crossing model, where we use the exogenousvariation of an instrument to obtain different counterfactual scenarios. Our goal is not to rank2otential policies, although our method can be used to rank the class of policies, each of whichchanges a different instrumental variable. For the case of continuous endogenous covariates,Rothe (2010) shows that the control function approach of Imbens and Newey (2009) can be usedto achieve identification. Unlike Rothe (2010), we do not make the unconfoundedness assumptionhere.In related work, Kaplan (2019) considers a setting where conditional independence holds,that is, treatment is independent of potential outcomes given some covariates, and characterizeswhich set of policies does not affect the conditional distribution of the outcome variable giventhe treatment and the covariates. The distributional invariance condition, which will be moreformally explained in Section 2, is precisely the assumption we relax. We maintain neither theconditional independence nor the distributional invariance in this paper.Our general treatment of the problem using functionals is closely related to that of Rothe(2012). Rothe (2012) analyzes the effect of an arbitrary unconditional change in the distributionof a target covariate, either continuous or discrete, on some feature of the distribution of theoutcome variable. By assuming a form of conditional independence, for the case of continu-ous target covariates, Rothe (2012) generalizes the approach of Firpo, Fortin and Lemieux (2009).However, for the case of a discrete treatment, bounds are obtained by assuming that either thehighest-ranked or lowest-ranked individuals enter the program under the new policy.More recently, Zhou and Xie (2019) provide a new form of the marginal treatment effectparameter that conditions on the propensity score instead of on the whole vector of covariates.Their results allow them to obtain an easy representation of the mean effect when individualsare at the margin of indifference. While our results are also driven by individuals at the marginof indifference, we do not use their redefinition of the marginal treatment effect parameter.A word on notation: for any generic random variable W , we denote its CDF and pdf by F W ( · ) and f W ( · ) , respectively. We denote its conditional CDF and pdf conditional on a secondrandom variable W by F W | W ( ·|· ) and f W | W ( ·|· ) , respectively.The paper proceeds as follows: Section 2 introduces the unconditional quantile effect un-der exogeneity. Section 3 presents a model for studying the unconditional quantile effect underendogeneity. Section 4 develops the marginal treatment effect approach and discusses identifi-cation. Section 5 considers the case of a general functional. Section 6 formally establishes thelink between the unconditional quantile effect and the marginal policy relevant treatment ef-fect. We consider estimation and inference under a parametric propensity score in Section 7 andunder a nonparametric propensity score in Section 8. We revisit the empirical application ofCarneiro, Heckman and Vytlacil (2011) and focus on unconditional quantile effects in Section 9.Section 10 concludes. We relegate all proofs to the appendix.3 Unconditional Quantile Effect with a Binary Treatment
In this section, we present the basic setting and motivate our study. Suppose we have the follow-ing causal or structural model y = r ( d , u ) , (1)where d is a binary variable indicating the treatment status and u is a vector that consists of allother unobserved causal factors. The model is causal or structural in the sense that if we set ( d , u ) equal to some target value ( d o , u o ) , then we would observe the outcome of interest y o = r ( d o , u o ) .The function r ( · , · ) characterizes the potential outcomes under all possible settings of ( d , u ) . Inan observational study with a simple random sample, the settings of ( d , u ) can be regarded asi.i.d. draws from a certain population. For each draw ( D , U ) , we observe Y according to Y = r ( D , U ) . (2)Note that the map from ( D , U ) to Y is deterministic. The randomness comes from the random-ness of ( D , U ) . The population distribution of Y follows from that of ( D , U ) .In the case when D and U are independent, the joint distribution of ( D , U ) is characterized bytheir respective marginal distributions. For the binary variable D , the marginal distribution boilsdown to the probability of treatment. We consider two different probabilities: p and p + δ . Wecan think of them as the probabilities of treatment under the existing and new policy regimes.To avoid any possibility of confusion, we now use D δ to denote the treatment status under thenew policy regime. The outcome is now given by Y δ = r ( D δ , U ) , (3)where Pr ( D δ = ) = p + δ .Let y τ be the τ -quantile of Y , and let y τ , δ be the τ -quantile of Y δ . That is, Pr ( Y ≤ y τ ) = Pr ( Y δ ≤ y τ , δ ) = τ . We are interested in the behavior of y τ , δ as δ → Definition 1.
Unconditional Quantile Effect
The unconditional quantile effect is defined as Π τ : = lim δ → y τ , δ − y τ δ whenever this limit exists. By definition, we havePr ( Y δ ≤ y ) = Pr ( D δ = ) Pr ( Y δ ≤ y | D δ = ) + Pr ( D δ = ) Pr ( Y δ ≤ y | D δ = ) . We still use D to denote the treatment status and Y to denote the outcome variable under the original policyregime. ( Y δ ≤ y | D δ = d ) = Pr ( Y ≤ y | D = d ) for d = Y . By the distributional invariance assumption,we get Pr ( Y δ ≤ y ) = Pr ( D δ = ) Pr ( Y ≤ y | D = ) + Pr ( D δ = ) Pr ( Y ≤ y | D = ) .But, by definition, we also havePr ( Y ≤ y ) = Pr ( D = ) Pr ( Y ≤ y | D = ) + Pr ( D = ) Pr ( Y ≤ y | D = ) .Taking the difference of these two equations, we obtainPr ( Y δ ≤ y ) − Pr ( Y ≤ y ) = δ [ Pr ( Y ≤ y | D = ) − Pr ( Y ≤ y | D = )] .Setting y = y τ , δ and using Pr ( Y δ ≤ y τ , δ ) = Pr ( Y ≤ y τ ) = τ , we havePr ( Y ≤ y τ , δ ) − Pr ( Y ≤ y τ ) δ = Pr ( Y > y τ , δ | D = ) − Pr ( Y > y τ , δ | D = ) .Letting δ → y τ , δ → y τ , we obtain Π τ = f Y ( y τ ) [ Pr ( Y > y τ | D = ) − Pr ( Y > y τ | D = )] (5)under some mild conditions.A key condition for obtaining the above identification result is the distributional invarianceof Y given in equation (4). That is, the distributions of Y conditioning on the two differentpolicy regimes, D and D δ , are the same. If treatments are randomly assigned under both policyregimes, then U is clearly independent of D and D δ , and the distributional invariance assumptionis satisfied.When we allow for D and D δ to be correlated with U , however, the distributional invarianceof Y may not hold. This is the main departure of this paper and motivates our study. In this section we consider inducing a change in the participation rate via a certain selection rule.We characterize the unconditional quantile effect when D is endogenous. This allows us to showthat ignoring the endogeneity will give rise to an asymptotically biased estimator. In addition,we provide a formula for the asymptotic bias. 5 .1 An Ill-Posed Problem When we allow a correlation between D , the treatment status, and the unobservable U , it isnecessary to be explicit about the selection mechanism. For example, we could write D = { V ≤ } ,where V is uniformly distributed on [ − ] and is correlated with U . Note that in this case p : = Pr ( D = ) = δ >
0, in order to shift the participation rate to p + δ , considerthe following two possibilities: D δ ,1 = { V ≤ δ } D δ ,2 = { V ≤ V ≥ − δ } .Under D δ ,1 , individuals for whom V ∈ ( δ ] will change their treatment status. Under D δ ,2 ,individuals for whom V ∈ [ − δ , 0.5 ] will change their treatment status. Even though theparticipation rate change is the same under the treatment selection rules D δ ,1 and D δ ,2 , the tworules will incentivize different sets of individuals to change their treatment status. Given that V is correlated with U , the distribution of U , and hence that of Y will be different for the newparticipants under D δ than for those under D δ . As a result, the (unconditional) treatment effectwill be different under different treatment selection rules.There are (infinitely) many possibilities for altering the selection rule to achieve the samemarginal change in the participation rate. Each one of them will generate a different uncondi-tional quantile effect. Therefore, in the presence of endogeneity, analyzing the effect of marginallyshifting the participation rate is an ill-posed problem. The right question is, what is the effectof marginally altering the participation rate in a given way, say by using D δ ,1 or D δ ,2 ? That is,we must be explicit about how we alter the participation rate. In practice, when we use a par-ticular policy tool to induce a change in the participation rate, the mechanism under which theparticipation rate is altered is indeed well defined and is given a priori. This subsection defines the policy change we are interested in and examines its unconditionalquantile effect.We employ the potential outcomes framework. For each individual, there are two potentialoutcomes: Y ( ) and Y ( ) , where Y ( ) is the outcome had she received no treatment and Y ( ) is the outcome had she received treatment. Depending on the individual’s actual choice oftreatment, denoted by D , we observe either Y ( ) or Y ( ) , but we can never observe both. Theobserved outcome is denoted by Y : Y = ( − D ) Y ( ) + DY ( ) . (6)6e assume that the potential outcomes are given by Y ( ) = r ( W , U ) , Y ( ) = r ( W , U ) , (7)for a pair of unknown functions r and r . The vector W consists of observables and ( U , U ) consists of unobservables.Following Heckman and Vytlacil (1999, 2001 a , 2005), we assume that selection into treatmentis determined by a threshold-crossing equation D = { V ≤ µ ( W ) } , (8)where µ ( W ) can be regarded as the benefit from the treatment and V as the cost of the treat-ment. Individuals decide to take up the treatment if and only if its benefit outweighs its cost.Alternatively, we can think of µ ( W ) as the utility and V as the disutility from participating inthe program. While we observe ( D , W , Y ) , we observe neither U : = ( U , U ) nor V . Also, we donot restrict the dependence among U , W , and V , hence, D could be endogenous.Define the propensity score to be P ( w ) : = Pr ( D = | W = w ) , which can be represented as P ( w ) = Pr ( V ≤ µ ( W ) | W = w ) = F V | W ( µ ( w ) | w ) .If the conditional CDF F V | W ( ·| w ) is a strictly increasing function for all w ∈ W , the support of W , we have D = { V ≤ µ ( W ) } = (cid:8) F V | W ( V | W ) ≤ F V | W ( µ ( W ) | W ) (cid:9) = { U D ≤ P ( W ) } ,where U D = F V | W ( V | W ) .We can show that U D is uniform on [
0, 1 ] and is independent of W . To this end, we denote F V | W ( v | w ) : = Pr ( V ≤ v | W = w ) by G w ( v ) . We use the notation G w ( v ) when we view F V | W ( v | w ) as a function of v for a given w . We havePr ( U D ≤ u | W = w ) = Pr (cid:0) F V | W ( V | W ) ≤ u | W = w (cid:1) = Pr (cid:0) F V | W ( V | w ) ≤ u | W = w (cid:1) = Pr ( G w ( V ) ≤ u | W = w ) = Pr (cid:16) V ≤ G − w ( u ) | W = w (cid:17) = G w h G − w ( u ) i = u .By the law of iterated expectations, we have Pr ( U D ≤ u ) = u . Also, Pr ( U D ≤ u | W = w ) does notdepend on w , so U D is indeed uniform on [
0, 1 ] and is independent of W . Note that, in general,if U D is a deterministic function of both V and W , then U D is not independent of ( V , W ) , butthis does not rule out the possibility that U D is independent of W . Conditioning on W , U D is adeterministic function of V only, and hence U D and V are dependent conditioning on W .7f we let r ( D , W , U ) = ( − D ) r ( W , U ) + Dr ( W , U ) ,then we have Y = r ( D , W , U ) . Thus the potential outcomes framework can be cast into a struc-tural modeling framework with a special causal function r . From this point of view, the modelis more general than the model in (1), as an additional covariate vector W has been included inboth the outcome equation and the treatment selection equation.Unlike the case of a continuous causal variable, a location shift D + δ is not implementable.Instead, we consider a policy intervention that changes the propensity score from P ( w ) to P δ ( w ) such that the participation rate is increased (in expectation) by δ . That is, P δ ( w ) is a function of w only and satisfies E [ P δ ( W )] = E [ P ( W )] + δ .Naturally, P ( w ) = P ( w ) . Under the new policy, our model becomes Y δ = r ( D δ , W , U ) = ( − D δ ) Y ( ) + D δ Y ( ) , D δ = { U D ≤ P δ ( W ) } . (9)Note that the model described by (9) coincides with the model given in (6) and (8) if we set δ =
0. For notational convenience, when δ =
0, we drop the subscript, and we write Y , D ,and P ( W ) for Y , D , and P ( W ) , respectively. It is important to highlight that, regardless ofthe value of δ , the dependence pattern between U and V given W is the same. More precisely,the conditional distribution of ( U , V ) given W is invariant to the value of δ . Equivalently, theconditional distribution of ( U , U D ) given W under δ = δ o for any δ o is the same as that under δ = s µ ( δ ) with the normalization that s µ ( ) =
0. In this case, we would have P δ ( w ) = Pr (cid:2) V ≤ µ ( w ) + s µ ( δ ) | W = w (cid:3) = F V | W ( µ ( w ) + s µ ( δ ) | w ) . (10)Effectively, we induce a location change in the benefit function (or the cost function) while keep-ing intact the dependence between U and V (or U D ) given W . In this section, however, we areconcerned only with a generic policy that changes the propensity score from P ( w ) to P δ ( w ) .We are agnostic about how the policy intervention achieves this change; we impose only therestriction that E [ P δ ( W )] = p + δ .In order to find the unconditional quantile treatment effect, we first make a support assump-tion. Assumption 1.
Support Assumption
For d =
0, 1 , the support of Y ( d ) given ( U D , W ) does not dependon ( U D , W ) . D δ is a function of ( U D , W ) . Under the above assumption, the support of Y ( d ) given D δ also does not depend on D δ . We denote the support of Y ( d ) by Y ( d ) , which is thecommon support regardless of whether we condition on ( U D , W ) . It is also the common supportregardless of whether we condition on D δ .For some ε >
0, define N ε : = { δ : | δ | ≤ ε } . For every δ ∈ N ε , we have F Y δ ( y τ , δ ) = ( p + δ ) Pr ( Y ( ) ≤ y τ , δ | D δ = ) + ( − p − δ ) Pr ( Y ( ) ≤ y τ , δ | D δ = )= ˆ Y ( ) { y ≤ y τ , δ } ( p + δ ) f Y ( ) | D δ ( y | ) dy + ˆ Y ( ) { y ≤ y τ , δ } ( − p − δ ) f Y ( ) | D δ ( y | ) dy , (11)where we have used the support assumption so that the support of Y ( d ) given D δ = d is still Y ( d ) .To expand ( p + δ ) f Y ( ) | D δ ( y | ) and ( − p − δ ) f Y ( ) | D δ ( y | ) around δ =
0, we make sometechnical assumptions.
Assumption 2.
Regularity Conditions (a) For d =
0, 1, the random variables ( Y ( d ) , U D , W ) are absolutely continuous with joint density givenby f Y ( d ) , U D | W f W . (b) (i) For d =
0, 1, u f Y ( d ) | U D , W ( y | u , w ) is continuous for almost all y ∈ Y ( d ) and almost allw ∈ W . (ii) For d =
0, 1 , for almost all y ∈ Y ( d ) , sup w ∈W sup δ ∈ N ε f Y ( d ) | U D , W ( y | P δ ( w ) , w ) < ∞ . (c) (i) p = E [ P ( W )] ∈ (
0, 1 ) . (ii) For each w ∈ W , the map δ P δ ( w ) is continuously differentiableon N ε . (iii) sup w ∈W sup δ ∈ N ε (cid:12)(cid:12)(cid:12) ∂ P δ ( w ) ∂δ (cid:12)(cid:12)(cid:12) < ∞ . Assumption 3.
Domination Conditions
For d =
0, 1, ˆ Y ( d ) sup δ ∈ N ε f Y ( d ) | D δ ( y | d ) dy < ∞ , ˆ Y ( d ) sup δ ∈ N ε (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ∂ f Y ( d ) | D δ ( y | d ) ∂δ (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) dy < ∞ .In Assumption 2 the supremum over w ∈ W can be replaced by the essential supremum over w ∈ W . Lemma 1.
Let Assumptions 1 and 2 hold. For d =
0, 1 , the map δ f Y ( d ) | D δ ( y | d ) is continuouslydifferentiable on N ε for almost all y ∈ Y ( d ) .Using Lemma 1, we expand F Y δ ( y ) in (11) around δ = emma 2. Let Assumptions 1–3 hold. ThenF Y δ ( y ) = F Y ( y ) + δ E hn F Y ( ) | U D , W ( y | P ( W ) , W ) − F Y ( ) | U D , W ( y | P ( W ) , W ) o ˙ P ( W ) i + o ( | δ | ) , uniformly over y ∈ Y : = Y ( ) ∪ Y ( ) as δ → , where ˙ P ( w ) : = ∂ P δ ( w ) ∂δ (cid:12)(cid:12)(cid:12)(cid:12) δ = . Remark 1.
Lemma 2 provides a linear approximation to F Y δ , the CDF of the outcome variable under D δ . Essentially, it says that the proportion of individuals with outcome below y under the new selection rule,that is, F Y δ ( y ) , will be equal to the proportion of individuals with outcome below y under the originalselection rule, that is, F Y ( y ) , plus an adjustment given by the marginal entrants. Consider δ > andP δ ( w ) > P ( w ) for all w ∈ W as an example. In this case, because of the policy intervention, theindividuals who are on the margin, namely those with u D = P ( w ) , will switch their treatment status from to . Such a switch contributes to F Y δ ( y ) by the amount F Y ( ) | U D , W ( P ( w ) , w ) − F Y ( ) | U D , W ( P ( w ) , w ) ,averaged over the distribution of W for a certain subpopulation. We will show later that the subpopulationis exactly the group of individuals who are on the margin under the existing policy regime. Theorem 1.
Let Assumptions 1–3 hold. Assume further that f Y ( y τ ) > . Then Π τ ≡ lim δ → y τ , δ − y τ δ = f Y ( y τ ) ˆ W E [ { Y ( ) ≤ y τ } | U D = P ( w ) , W = w ] ˙ P ( w ) f W ( w ) dw − f Y ( y τ ) ˆ W E [ { Y ( ) ≤ y τ } | U D = P ( w ) , W = w ] ˙ P ( w ) f W ( w ) dw . (12)Theorem 1 shows that, among the individuals with W = w , only those for whom u D = P ( w ) will contribute to the unconditional quantile effect. Among the group defined by W = w , thereis a subgroup of individuals who are indifferent between participating and not participating:those for whom u D = P ( w ) , that is, those for whom v satisfies F V | W ( v | w ) = P ( w ) . A smallincentive will induce a change in the treatment status for only this subgroup of individuals. It isthe change in the treatment status, and hence the change in the composition of Y ( ) and Y ( ) inthe observed outcome Y , that changes the unconditional quantiles of Y .Theorem 1 shows that the unconditional quantile effect depends also on ˙ P ( w ) . Under As-sumption 2(c), we have ˆ W ∂ P δ ( w ) ∂δ f W ( w ) = ∂∂δ ˆ W P δ ( w ) f W ( w ) = ∂∂δ ( p + δ ) = δ ∈ N ε , hence ´ W ˙ P ( w ) f W ( w ) dw =
1. Thus the integrals in (12) can be regarded as aweighted mean with the weight given by ˙ P ( w ) . Note that ˙ P ( w ) depends on how we choose10o modify the propensity score, that is, it depends on who the marginal entrants are. Differ-ent propensity score interventions can result in different sets of marginal entrants and differentunconditional quantile effects.For intuition on this, consider the case where δ > P δ ( w ) ≥ P ( w ) for all w ∈ W . Thenwe have ˙ P ( w ) = lim δ → Pr ( U D ≤ P δ ( w )) − Pr ( U D ≤ P ( w )) δ = lim δ → Pr ( P ( W ) < U D ≤ P δ ( W ) | W = w ) δ .Thus, ˙ P ( w ) measures the relative contribution to the overall improvement in the participationrate (i.e., δ ) for the individuals with W = w . For each value of W , only individuals on the margin(“the marginal individuals”) will change their treatment status and contribute to the overallimprovement in the participation rate. The relative “thickness” of the margin depends on w andis measured by ˙ P ( w ) .We can use Figure 1 to convey the intuition behind ˙ P ( w ) . The figure illustrates the marginalindividuals under the existing and new policy regimes. The marginal individuals are thosewith u D = P δ ( w ) . Under the existing policy regime, the marginal individuals lie on the 45-degree line in the ( P ( w ) , u D ) -plane. For easy reference, we call it the marginal curve, which isthe set of points { ( P ( w ) , u D ) : u D = P ( w ) } . Under the new policy regime, the marginal curveis now { ( P ( w ) , u D ) : u D = P δ ( w ) } . Note that we can rewrite u D = P δ ( w ) as u D = P ( w ) +[ P δ ( w ) − P ( w )] . Thus the new marginal curve can be obtained by shifting every point on theoriginal marginal curve up by P δ ( w ) − P ( w ) . The magnitude of the upward shift is approxi-mately ˙ P ( w ) δ , which is, in general, different for different values of w . The integral of the dif-ference of the two marginal curves (i.e., the area of the gray region) weighted by the marginaldensity f W ( · ) of W is equal to δ . To understand the weight f W ( w ) ˙ P ( w ) that appears in Theorem 1, let ǫ be a small positivenumber. Then, f W ( w ) ǫ measures the proportion of individuals for whom W is in [ w − ǫ /2, w + ǫ /2 ] .Note that for W ∈ [ w − ǫ /2, w + ǫ /2 ] , the propensity scores under D and D δ are approximately P ( w ) and P δ ( w ) . The proportion of the individuals for whom W ∈ [ w − ǫ /2, w + ǫ /2 ] and whohave switched their treatment status from 0 to 1 is then equal to f W ( w ) · [ P δ ( w ) − P ( w )] · ǫ .Scaling this by δ , which is the overall proportion of the individuals who have switched thetreatment status, we obtain f W ( w ) · [ P δ ( w ) − P ( w )] / δ · ǫ . Thus f W ( w ) · [ P δ ( w ) − P ( w )] / δ can beregarded as the density function of W among those who have switched the treatment status from Note that the two marginal curves coincide in the limit as δ →
0, and so in the limit we can define the marginalindividuals as those with U D = P ( W ) . In our discussions, “the marginal individuals” may refer to the group ofindividuals with U D = P ( W ) or the group of individuals with U D = P δ ( W ) . Which group we refer to should be clearfrom the context. ( W ) U D D = D δ = D = D δ = D δ = D = D δ = { U D ≤ P δ ( W ) } Marginal individualsunder D = { U D ≤ P ( W ) } Figure 1:
Marginal individuals under different policies. ( W ∈ [ w − ǫ /2, w + ǫ /2 ] | D = D δ = ) ǫ = Pr ( W ∈ [ w − ǫ /2, w + ǫ /2 ] , D = D δ = ) ǫ · Pr ( D = D δ = )= Pr ( W ∈ [ w − ǫ /2, w + ǫ /2 ] , U D ∈ [ P ( W ) , P δ ( W )]) ǫ · δ ,so taking the limit as ǫ →
0, we obtainlim ǫ → Pr ( W ∈ [ w − ǫ /2, w + ǫ /2 ] | D = D δ = ) ǫ = f W ( w ) P δ ( w ) − P ( w ) δ .Thus f W ( w ) [ P δ ( w ) − P ( w )] / δ is the density of W among those who respond positively to thepolicy intervention, that is, those with D = D δ =
1. Graphically, f W ( w ) [ P δ ( w ) − P ( w )] / δ is the conditional density of W conditional on ( P ( W ) , U D ) being in the gray region in Figure 1.Letting δ →
0, we obtainlim δ → f W ( w ) P δ ( w ) − P ( w ) δ = f W ( w ) ˙ P ( w ) .That is, f W ( w ) ˙ P ( w ) is the limit of the density of W among those with D = D δ =
1. Wecan therefore refer to f W ( w ) ˙ P ( w ) as the density of the distribution of W over the marginalsubpopulation that consists of all marginal individuals.In view of the above interpretation of f W ( w ) ˙ P ( w ) , Theorem 1 shows that the unconditionalquantile effect is equal to the change in the influence functions for the marginal individuals,weighted by the density of the distribution of W over those marginal individuals.12oting that f W ( w ) is the density of the distribution of W over the entire population, wecan regard ˙ P ( w ) as the Radon–Nikodym (RN) derivative of the subpopulation distribution withrespect to the population distribution. Even if ˙ P ( w ) is not positive for all w ∈ W , the Radon–Nikodym interpretation is still valid. In this case, the distribution with density f W ( w ) ˙ P ( w ) withrespect to the Lebesgue measure is a signed measure. Corollary 1.
Let the assumptions in Theorem 1 hold. Then Π τ = A τ − B τ , where A τ = f Y ( y τ ) ˆ W E [ { Y ≤ y τ } | D = W = w ] f W ( w ) dw − f Y ( y τ ) ˆ W E [ { Y ≤ y τ } | D = W = w ] f W ( w ) dw , (13) and B τ = B τ + B τ , for B τ = f Y ( y τ ) ˆ W h F Y ( ) | D , W ( y τ | w ) − F Y ( ) | D , W ( y τ | w ) i (cid:2) − ˙ P ( w ) (cid:3) f W ( w ) dw = f Y ( y τ ) ˆ W (cid:2) F Y | D , W ( y τ | w ) − F Y | D , W ( y τ | w ) (cid:3) ˙ P ( w ) f W ( w ) dw , − f Y ( y τ ) ˆ W (cid:2) F Y | D , W ( y τ | w ) − F Y | D , W ( y τ | w ) (cid:3) f W ( w ) dwand B τ = f Y ( y τ ) ˆ W h F Y ( ) | D , W ( y τ | w ) − F Y ( ) | U D , W ( y τ | P ( w ) , w ) i ˙ P ( w ) f W ( w ) dw − f Y ( y τ ) ˆ W h F Y ( ) | D , W ( y τ | w ) − F Y ( ) | U D , W ( y τ | P ( w ) , w ) i ˙ P ( w ) f W ( w ) dw . (14)To facilitate understanding of Corollary 1, we can define and organize the average influencefunctions (AIF) in a table: AIF for Y ( ) AIF for Y ( ) Difference U D E w [ ψ τ ( Y ( )) | U D = P ( w )] E w [ ψ τ ( Y ( )) | U D = P ( w )] E w [ ψ τ ( Y ( )) − ψ τ ( Y ( )) | U D = P ( w )] D E w [ ψ τ ( Y ( )) | D = ] E w [ ψ τ ( Y ( )) | D = ] E w [ ψ τ ( Y ( )) | D = ] − E w [ ψ τ ( Y ( )) | D = ] where ψ τ ( Y ( d )) = τ − { Y ( d ) ≤ y τ } f Y ( y τ ) .13n the above, E w [ · ] stands for the conditional mean operator given W = w . For example, E w [ ψ τ ( Y ( )) | D = ] = E [ ψ τ ( Y ( )) | D = W = w ] . Let ψ ∆ , U D ( w ) : = E w [ ψ τ ( Y ( )) − ψ τ ( Y ( )) | U D = P ( w )] , ψ ∆ , D ( w ) : = E w [ ψ τ ( Y ( )) | D = ] − E w [ ψ τ ( Y ( )) | D = ] .The unconditional quantile effect Π τ is the average of the difference ψ ∆ , U D ( w ) with respect tothe distribution of W over the marginal subpopulation. The average apparent effect A τ is theaverage of the difference ψ ∆ , D ( w ) with respect to the distribution of W over the whole pop-ulation distribution. It is also equal to the limit of the unconditional quantile estimator ofFirpo, Fortin and Lemieux (2009), where the endogeneity of the treatment selection is ignored.The discrepancy between Π τ and A τ gives rise to the asymptotic bias B τ : B τ = A τ − Π τ = E [ ψ ∆ , D ( W )] − E (cid:2) ψ ∆ , U D ( W ) ˙ P ( W ) (cid:3) = E (cid:8) ψ ∆ , D ( W ) (cid:2) − ˙ P ( W ) (cid:3)(cid:9)| {z } B τ + E (cid:8) [ ψ ∆ , D ( W ) − ψ ∆ , U D ( W )] ˙ P ( W ) (cid:9)| {z } B τ . (15)It is easy to show that the B τ given here is identical to that given in (14).Equation (15) decomposes the asymptotic bias into two components. The first one, B τ , cap-tures the heterogeneity of the averaged apparent effects averaged over two different subpop-ulations. For every w , ψ ∆ , D ( w ) is the average effect of D on [ τ − { Y ≤ y τ } ] / f Y ( y τ ) for theindividuals with W = w . These effects are averaged over two different distributions of W : thedistribution of W for the marginal subpopulation (i.e., ˙ P ( w ) f W ( w )) and the distribution of W for the whole population (i.e., f W ( w )) . B τ is equal to the difference of these two average effects.If the effect ψ ∆ , D ( w ) does not depend on w , then B τ =
0. If ˙ P ( w ) =
1, then the distribution of W over the whole population is the same as that over the subpopulation, and hence B τ = B τ , has a difference-in-differences interpretation. Each of ψ ∆ , D ( · ) and ψ ∆ , U D ( · ) is the difference in the average influence functions associated with the counterfac-tual outcomes Y ( ) and Y ( ) . However, ψ ∆ , D ( · ) is the difference over the two subpopulationswho actually choose D = D =
0, while ψ ∆ , U D ( · ) is the difference over the marginal sub-population. So ψ ∆ , D ( · ) − ψ ∆ , U D ( · ) is a difference in differences. B τ is simply the average of thisdifference in differences with respect to the distribution of W over the marginal subpopulation.This term arises because the change in the distributions of Y for those with D = D = U D is just above P ( w ) and those whose U D is justbelow P ( w ) . Thus we can label B τ as a marginal selection bias .If ψ ∆ , D ( w ) = ψ ∆ , U D ( w ) for almost all w ∈ W , then B τ =
0. The condition ψ ∆ , D ( w ) = ψ ∆ , U D ( w ) is the same as E w [ ψ τ ( Y ( )) − ψ τ ( Y ( )) | U D = P ( w )] = E w [ ψ τ ( Y ( )) | D = ] − E w [ ψ τ ( Y ( )) | D = ] .14quivalently, E w [ ψ τ ( Y ( )) | U D = P ( w )] − E w [ ψ τ ( Y ( )) | D = ] = E w [ ψ τ ( Y ( )) | U D = P ( w )] − E w [ ψ τ ( Y ( )) | D = ] .The condition resembles the parallel-paths assumption or the constant-bias assumption in adifference-in-differences analysis. If U D is independent of ( U , U ) given W , then this condi-tion holds and B τ = U D is not independent of ( U , U ) given W , and W enters the selectionequation, we have B τ = B τ =
0, hence Π τ = A τ . If ˙ P ( w ) is not identified, then B τ isnot identified. In general, B τ is not identified without an instrument. Therefore, without aninstrument, the asymptotic bias can not be eliminated and Π τ is not identified.It is not surprising that in the presence of endogeneity, the unconditional quantile estimatorof Firpo, Fortin and Lemieux (2009) is asymptotically biased. The virtue of Corollary 1 is that itprovides a closed-form characterization and clear interpretations of the asymptotic bias. To thebest of our knowledge, this bias formula is new in the literature. From a broad perspective, theasymptotic bias B τ is the unconditional quantile counterpart of the endogenous bias of the OLSestimator in a linear regression framework.The bias decomposition is not unique. Corollary 1 gives only one possibility. We can alsowrite B τ = E (cid:8) ψ ∆ , U D ( W ) (cid:2) − ˙ P ( W ) (cid:3)(cid:9)| {z } ˜ B τ + E [ ψ ∆ , D ( W ) − ψ ∆ , U D ( W )] | {z } ˜ B τ .The interpretations of ˜ B τ and ˜ B τ are similar to those of B τ and B τ with obvious and minormodifications. The non-uniqueness of the decomposition when two or more quantities changesimultaneously is well expected. Remark 2.
Consider a setting of full independence: V ⊥ U ⊥ U ⊥ W (i.e., every subset is independentof its complement). In this case, B τ = and by equation (12), the UQE is Π τ = f Y ( y τ ) ˆ W E [ { Y ( ) ≤ y τ } | W = w ] ˙ P ( w ) f W ( w ) dw − f Y ( y τ ) ˆ W E [ { Y ( ) ≤ y τ } | W = w ] ˙ P ( w ) f W ( w ) dw . Following (13), the apparent effect isA τ = f Y ( y τ ) ˆ W E [ { Y ( ) ≤ y τ } | W = w ] f W ( w ) dw − f Y ( y τ ) ˆ W E [ { Y ( ) ≤ y τ } | W = w ] f W ( w ) dw . Here, ψ ∆ , U D ( · ) changes to ψ ∆ , D ( · ) and ˙ P ( w ) f W ( w ) changes to f W ( w ) . n general, therefore, we will still have a bias term given by B τ unless ˙ P ( w ) = or the differenceE [ { Y ( ) ≤ y τ } | W = w ] − E [ { Y ( ) ≤ y τ } | W = w ] does not depend on w . In general, both condi-tions fail if the covariate W enters both the outcome equations and the selection equation. In this case, theusual unconditional quantile regression estimator will be asymptotically biased.On the other hand, under full independence, there will be no asymptotic bias if the treatment hasa constant effect (i.e., E [ { Y ( ) ≤ y τ } | W = w ] − E [ { Y ( ) ≤ y τ } | W = w ] does not depend on w)or if the distribution of W over the whole population is the same as that over the marginal subpopula-tion. Note that if W does not enter the outcome equation, then E [ { Y ( ) ≤ y τ } | W = w ] is equal toE [ { Y ( ) ≤ y τ } | W = w ] under the condition ( U , U ) ⊥ W . As a result, there will be no bias. If Wdoes not enter the selection equation so that µ ( W ) = µ for a constant µ , then ˙ P ( w ) = and thedistribution of W over the whole population is the same as that over the marginal subpopulation. As aresult, there will be no bias in that case, either. In this subsection, we modify the propensity score by inducing a location shift in the cost orbenefit function. Note that an increase in the benefit has the same effect as reducing the costby the same amount. It is innocuous to focus on a location shift in the benefit function, whichis one of many ways to change the propensity score. From the perspective of policy design, weask and address the following question: given the dependence between U = ( U , U ) and V inthe population, how will the unconditional quantile of Y change if we manage to improve thebenefit from participating in the program by s µ ( δ ) for each individual in the population? While U and V are dependent, the incremental change s µ ( δ ) is the same for all individuals and hence isexogenous.Recall that P δ ( w ) was given in (10): P µδ ( w ) = Pr (cid:2) V ≤ µ ( w ) + s µ ( δ ) | W = w (cid:3) = F V | W ( µ ( w ) + s µ ( δ ) | w ) . (16)To emphasize that the change is induced on µ ( · ) , we have added a superscript ‘ µ ’ to P δ ( w ) . Wewill use P µδ ( w ) exclusively for this case. Note that s µ ( δ ) is determined implicitly by the equation E (cid:2) P µδ ( W ) (cid:3) = p + δ . The following lemma characterizes how s µ ( δ ) and P µδ ( w ) will change inresponse to a small change in δ . Lemma 3.
Assume that (i) ( V , W ) are absolutely continuous random variables with joint density f V , W ( v , w ) given by f V | W ( v | w ) f W ( w ) ; (ii) f V | W ( v | w ) is continuous in v for almost all w ∈ W ; (iii) ´ W sup δ ∈ N ε f V | W ( µ ( w ) + s µ ( δ ) | w ) f W ( w ) dw < ∞ ; and (iv) ´ W f V | W ( µ ( w ) + s µ ( δ ) | w ) f W ( w ) dw = for all δ ∈ N ε . Then for all To see this, note that when P δ ( w ) does not depend on w , we have that P δ = p + δ , which implies that ˙ P = ∈ N ε , ∂ s µ ( δ ) ∂δ = ´ W f V | W ( µ ( w ) + s µ ( δ ) | w ) f W ( w ) dw , ∂ P µδ ( w ) ∂δ = f V | W ( µ ( w ) + s µ ( δ ) | w ) ´ W f V | W ( µ ( w ) + s µ ( δ ) | w ) f W ( w ) dw . In particular, ∂ s µ ( δ ) ∂δ (cid:12)(cid:12)(cid:12)(cid:12) δ = = ´ W f V | W ( µ ( w ) | w ) f W ( w ) dw ,˙ P µ ( w ) = ∂ P µδ ( w ) ∂δ (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) δ = = f V | W ( µ ( w ) | w ) ´ W f V | W ( µ ( w ) | w ) f W ( w ) dw .Combining Theorem 1 and Lemma 3, we obtain a representation of the unconditional quantileeffect under a location shift in the benefit function. We use Π τ , µ to denote this effect. Corollary 2.
Let the assumptions in Theorem 1 hold. Then under the location shift in the benefit functiongiven by (16), we have Π τ , µ = f Y ( y τ ) ˆ W E [ { Y ( ) ≤ y τ } | U D = P ( w ) , W = w ] ˙ P µ ( w ) f W ( w ) dw − f Y ( y τ ) ˆ W E [ { Y ( ) ≤ y τ } | U D = P ( w ) , W = w ] ˙ P µ ( w ) f W ( w ) dw , (17) where ˙ P µ ( w ) = ∂ P µδ ( w ) ∂δ (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) δ = = f V | W ( µ ( w ) | w ) ´ W f V | W ( µ ( w ) | w ) f W ( w ) dw .To bring our setting closer to that of Firpo, Fortin and Lemieux (2009) and shed more light onTheorem 1, we consider the special case where µ ( w ) = µ for a constant µ and V is independentof W . In this case, the selection is based on unobservables only. We have P µδ ( w ) = F V | W ( µ + s µ ( δ ) | w ) = F V ( µ + s µ ( δ )) ,which does not depend on w . In particular, P µ ( w ) = F V ( µ ) = p . The selection rule becomes D = { U D ≤ p } , where U D is uniform on [
0, 1 ] . Using Lemma 3, we find that˙ P µ = ∂ P µδ ( w ) ∂δ (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) δ = = f V ( µ ) ´ W f V | W ( µ | w ) f W ( w ) dw = f V ( µ ) ´ W f V ( µ ) f W ( w ) dw = P µδ ( w ) = p + δ . It then follows that the distributionof W over the whole population is the same as that over the marginal subpopulation. As a result,one of the asymptotic bias terms, namely B τ , disappears.Using Theorem 1 and Corollary 1, we obtain the following corollary.17 orollary 3. Let the assumptions in Theorem 1 hold. If µ ( w ) = µ for some constant µ and V isindependent of W , then Π τ , µ = f Y ( y τ ) E [ { Y ( ) ≤ y τ } − { Y ( ) ≤ y τ } | U D = p ] . (18) In addition, Π τ , µ = A τ , µ − B τ , µ forA τ , µ = f Y ( y τ ) ˆ E [ { Y ≤ y τ } | D = W = w ] f W ( w ) dw − f Y ( y τ ) ˆ E [ { Y ≤ y τ } | D = W = w ] f W ( w ) dw (19) = f Y ( y τ ) ( E [ { Y ≤ y τ } | D = ] − E [ { Y ≤ y τ } | D = ]) (20) and B τ , µ = f Y ( y τ ) h F Y ( ) | D ( y τ | ) − F Y ( ) | U D ( y τ | p ) i − f Y ( y τ ) h F Y ( ) | D ( y τ | ) − F Y ( ) | U D ( y τ | p ) i .The representation of Π τ , µ in (18) shows that the unconditional quantile effect is equal to thedifference of the influence functions averaged over the conditional distribution of the outcomevariable given U D = p . That is, the unconditional quantile effect is driven by the individuals forwhom U D = p . These individuals are ex-ante indifferent between choosing D = D = D =
0) to participating (i.e., D = U D ⊥ U , then F Y ( d ) | D ( y | d ) = F Y ( d ) | U D ( y | p ) = F Y ( d ) ( y ) for any y ∈ Y ( d ) and for d = d =
1. As a result, B τ =
0. Thus the asymptotic bias B τ vanishes in the absence of endogeneity.This is, of course, well expected. In the presence of endogeneity, it is not easy to evaluate or even sign the asymptotic bias. Ingeneral, the joint distribution of ( U , U D ) given the covariate W is needed for this purpose. Thisis not atypical. For a nonlinear estimator such as the unconditional quantile estimator, its asymp-totic properties often depend on the full data generating process in a non-trivial way. This is insharp contrast with a linear estimator such as the OLS in a linear regression model whose prop-erties depend on only the first few moments of the data. Next, we present some examples ofasymptotic bias. Example 1.
Non-uniformity of Endogeneity Bias across Quantiles B τ , µ < B τ , µ > Non-uniform bias: the solid green curve represents F U | D ( u | ) , and the dashed blue curverepresents F U | D ( u | ) . The vertical distance between them is the (rescaled) bias.Consider the model Y ( d ) = U ∈ R , for d =
0, 1, Y = DY ( ) + ( − D ) Y ( ) , D = { V ≤ } , where U and V are correlated. Noting that Y ( ) = Y ( ) , the treatment has no effect on the outcome, andso { Y ( ) ≤ y τ } − { Y ( ) ≤ y τ } = As a result, Π τ , µ = . By Corollary 3, A τ , µ = B τ , µ , that is, the estimator of Firpo, Fortin and Lemieux(2009) is an estimator of the asymptotic bias only. In this case, since Y = U, we haveB τ , µ = f U ( u τ ) (cid:2) F U | D ( u τ | ) − F U | D ( u τ | ) (cid:3) , where u τ is the τ -quantile of U: F U ( u τ ) = τ . This shows that the sign of the asymptotic bias depends onF U | D ( u τ | ) − F U | D ( u τ | ) . In the presence of endogeneity, the two distribution functions F U | D ( ·| ) andF U | D ( ·| ) are not the same. Unless one distribution function first-order stochastically dominates the other,it is necessarily true that B τ , µ is positive for some quantile levels and negative for others.Figure 2 shows a case where the bias is positive for higher quantiles and negative for lower quantiles.Thus it is not sufficient to use the sign of the correlation between U and V to sign the asymptotic bias forthe unconditional quantile effect at all quantile levels. Example 2.
Asymptotic Bias with Exogenous Treatment onsider the model Y ( ) = q ( W ) + U , Y ( ) = q ( W ) + β + U , D = { V ≤ µ ( W ) } , Y = ( − D ) Y ( ) + DY ( ) , where q ( · ) and µ ( · ) are functions of W, and W is independent of ( U , U , V ) . By Theorem 1, we have Π τ , µ = f Y ( y τ ) ˆ W " ˆ y τ − q ( w ) − ∞ f U | V ( u | µ ( w )) du ˜ f W ( w ) dw − f Y ( y τ ) ˆ W " ˆ y τ − q ( w ) − β − ∞ f U | V ( u | µ ( w )) du ˜ f W ( w ) dw , where ˜ f W ( w ) = f V ( µ ( w )) f W ( w ) ´ W f V ( µ ( ˇ w )) f W ( ˇ w ) d ˇ w . It follows from Corollary 1 that the apparent effect isA τ , µ = f Y ( y τ ) ˆ W ´ ∞ µ ( w ) h ´ y τ − q ( w ) − ∞ f U | V ( u | v ) du i f V ( v ) dv − F V ( µ ( w )) f W ( w ) dw − f Y ( y τ ) ˆ W ´ µ ( w ) − ∞ h ´ y τ − q ( w ) − β − ∞ f U | V ( u | v ) du i f V ( v ) dvF V ( µ ( w )) f W ( w ) dw . Details of this and the expression for f Y ( y τ ) can be found in the supplementary appendix.To compute Π τ , µ and A τ , µ numerically, we set β = q ( w ) = µ ( w ) = e w , and we assume that W isstandard normal and that ( U , U , V ) is normal with mean and variance Σ = ρ ρρ ρ , where ρ is the correlation between U and V, and between U and V. Different values of ρ lead to differentdegrees of endogeneity. When ρ = , the treatment selection is exogenous.Figure 3 plots the asymptotic bias B τ , µ : = A τ , µ − Π τ , µ as a function of τ for ρ =
0, .25, .5, .75, .9 . Asin Example 1, we can see that the asymptotic bias is not uniform across quantiles. Hence, any attempt tosign the bias based on the “sign” or degree of the endogeneity (i.e., the sign or magnitude of ρ ) is futile.It is intriguing to see that, even in the case of exogenous treatment selection (i.e., ρ = ), we still have n asymptotic bias. When ρ = the asymptotic bias from the second source is , asF Y ( ) | D , W ( y τ | w ) = F Y ( ) | U D , W ( y τ | P ( w ) , w ) = F Y ( ) | W ( y τ | w ) , F Y ( ) | D , W ( y τ | w ) = F Y ( ) | U D , W ( y τ | P ( w ) , w ) = F Y ( ) | W ( y τ | w ) . The asymptotic bias from the first source isB τ , µ = f Y ( y τ ) ˆ W (cid:2) F Y | D , W ( y τ | w ) − F Y | D , W ( y τ | w ) (cid:3) ˙ P ( w ) f W ( w ) dw − f Y ( y τ ) ˆ W (cid:2) F Y | D , W ( y τ | w ) − F Y | D , W ( y τ | w ) (cid:3) f W ( w ) dw . In this example, ˙ P ( w ) = andF Y | D , W ( y τ | w ) − F Y | D , W ( y τ | w ) = Pr ( e w + U < y τ ) − Pr ( e w + β + U < y τ ) = for all β = So the asymptotic bias from the first source, B τ , µ , is not equal to when β = Why is there a bias when the treatment is exogenous? To improve the overall participation rate from pto p + δ , we change the benefit function by the same amount s µ ( δ ) , that is, from µ ( w ) to µ ( w ) + s µ ( δ ) . Depending on the value of w , such a change in the benefit function will have a differential effect on thetreatment rate. In other words, P δ ( w ) − P ( w ) and hence ˙ P ( w ) , will depend on w . As a result, thedistribution of W over the whole population will not be the same as that over the marginal subpopulation.This difference creates a wedge between the unconditional quantile effect Π τ , µ and the average apparenteffect A τ , µ when the “apparent” difference F Y | D , W ( y τ | w ) − F Y | D , W ( y τ | w ) depends on w . Quantiles -3.5-3-2.5-2-1.5-1-0.500.511.5 A sy m p t o t i c B i a s Figure 3:
Asymptotic bias for ρ =
0, 0.25, 0.5, 0.75, 0.921
Unconditional Quantile Regressions with an Instrument
In the previous section, we have shown that the estimator of Firpo, Fortin and Lemieux (2009)will be asymptotically biased under endogeneity. It is hard to sign the bias, but more importantly,the bias may not be uniform across quantiles, as shown in Figure 3. However, if a special covariateis available, the unconditional quantile effect can be point identified and consistently estimated.We consider the same model as before, given in equations (6)–(9). We partition the covariates W into two parts: W = ( Z , X ) .We assume that Z ∈ R is a special variable that does not enter the potential outcome equations. Inaddition, we make the following assumptions, taken directly from Heckman and Vytlacil (1999,2001 a , 2005). Assumption 4.
Relevance and Exogeneity (a) µ ( Z , X ) is a non-degenerate random variable conditional on X.(b) ( U , U , V ) is independent of Z conditional on X. Assumption 4(a) is a relevance assumption: for any given level of X , the variable Z caninduce some variation in D . Assumption 4(b) is referred to as an exogeneity assumption. Thetwo assumptions are essentially the conditions for a valid instrumental variable, hence we willrefer to Z as the instrumental variable.Assumption 4(b) allows us to write U D = F V | Z , X ( V | Z , X ) = F V | X ( V | X ) .Assumption 4(b) then implies that ( U , U , U D ) is independent of Z conditional on X . Based onthe value U D = u , we define a marginal treatment effect for the τ -quantile, which will be a basicbuilding block for the unconditional quantile effect. Definition 2.
The marginal treatment effect for the τ -quantile is defined as MTE τ ( u , x ) = E [ { Y ( ) ≤ y τ } − { Y ( ) ≤ y τ } | U D = u , X = x ] , where y τ is the τ -quantile of Y = DY ( ) + ( − D ) Y ( ) , that is, Pr ( Y ≤ y τ ) = τ . We could also defineMTE τ ( u , x ) = E (cid:20) τ − { Y ( ) ≤ y τ } f Y ( y τ ) − τ − { Y ( ) ≤ y τ } f Y ( y τ ) (cid:12)(cid:12)(cid:12)(cid:12) U D = u , X = x (cid:21) = f Y ( y τ ) E [ { Y ( ) ≤ y τ } − { Y ( ) ≤ y τ } | U D = u , X = x ] ,but we omit the multiplicative factor f Y ( y τ ) for notational simplicity.
22o aid in the understanding of MTE τ , we compare it to the marginal treatment effect ofHeckman and Vytlacil (1999, 2001 a , 2005): MTE ( u , x ) : = E [ Y ( ) − Y ( ) | U D = u , X = x ] . For agiven individual, Y ( ) − Y ( ) is the (individual-level) treatment effect, so that the MTE ( u , x ) isthe average treatment effect for individuals with characteristics U D = u , X = x .Define ∆ ( y τ ) : = { Y ( ) ≤ y τ } − { Y ( ) ≤ y τ } , which is the argument of the conditionalexpectation in our definition of MTE τ . The random variable ∆ ( y τ ) can take three values: ∆ ( y τ ) = Y ( ) ≤ y τ and Y ( ) > y τ (cid:20) Y ( ) > y τ and Y ( ) > y τ (cid:21) or (cid:20) Y ( ) ≤ y τ and Y ( ) ≤ y τ (cid:21) − Y ( ) > y τ and Y ( ) ≤ y τ For a given individual, ∆ ( y τ ) = τ -quantile y τ of Y from below, and ∆ ( y τ ) = − τ -quantile y τ of Y from above. In the first case the individual benefits from thetreatment while in the second case the treatment harms her. The intermediate case, ∆ ( y τ ) = E [ ∆ ( y τ )] equals the difference between the proportion of individuals who benefitfrom the treatment and the proportion of individuals who are harmed by it. For the UQE,whether the treatment is beneficial or harmful is measured in terms of quantile crossing. Amongthe individuals with characteristics U D = u , X = x , MTE τ ( u , x ) is then the difference betweenthe proportion of individuals who benefit from the treatment and the proportion of individualswho are harmed by it. Thus, MTE τ ( u , x ) is positive if more individuals increase their outcomeabove y τ , and it is negative if more individuals decrease their outcome below y τ .MTE τ ( u , x ) is different from the quantile analogue of the marginal treatment effect of Carneiro and Lee(2009), which is defined as F − Y ( ) | U D , X ( τ | u , x ) − F − Y ( ) | U D , X ( τ | u , x ) . MTE τ ( u , x ) is proportional tothe difference of (the conditional expectations of) the influence functions for the τ -quantile of Y ( ) and Y ( ) . The proportionality factor is f Y ( y τ ) , the unconditional density of Y evaluated atthe τ -quantile y τ of Y ; see footnote 8. In the new setting with an instrument, it is worthwhile revisiting Corollary 2. Note that if weinduce a location shift in the benefit function, the unconditional quantile effect Π τ , µ is still givenby Corollary 2 as long as the same assumptions hold for W = ( Z , X ) . Using Assumption 4(b), wecan obtain a representation of Π τ , µ in terms of the marginal treatment effect for the τ -quantile.This representation will be useful for identification, as we show later in Section 4.3. Theorem 2.
Let Assumptions 1–3 and Assumption 4(b) hold. Assume further that f Y ( y τ ) > . Then nder the location shift in the benefit function given by (16), we have Π τ , µ = f Y ( y τ ) ˆ W MTE τ ( P ( w ) , x ) ˙ P µ ( w ) f W ( w ) dw , where ˙ P µ ( w ) = f V | X ( µ ( w ) | x ) E (cid:2) f V | X ( µ ( W ) | X ) (cid:3) .The main difference between Theorem 2 and Corollary 2 is that conditioning on W = w hasbeen replaced by conditioning on X = x only. This is possible because of Assumption 4(b).When ( U , U , V ) is independent of Z conditional on X , we know that ( U , U ) is independentof Z given ( U D , X ) , as f U , Z | U D , X ( u , z | u D , x ) = f U , U D , Z | X ( u , u D , z | x ) f U D | X ( u D | x ) = f U , U D | X ( u , u D | x ) f U D | X ( u D | x ) · f Z | X ( z | x )= f U | U D , X ( u | u D , x ) · f Z | U D , X ( z | u D , x ) . (21)Given this and the assumption that Z does not enter Y ( ) or Y ( ) , we have, for w = ( z , x ) , E [ { Y ( ) ≤ y τ } − { Y ( ) ≤ y τ } | U D = u , W = w ]= E [ { Y ( ) ≤ y τ } − { Y ( ) ≤ y τ } | U D = u , Z = z , X = x ]= E [ { Y ( ) ≤ y τ } − { Y ( ) ≤ y τ } | U D = u , X = x ] .Thus we can indeed replace “conditioning on W = w ” by “conditioning on X = x .” In this subsection, we consider a different kind of manipulation of the propensity score: weshift the location of the instrumental variable Z . To be precise, suppose we shift Z to Z δ = Z + g ( W ) s z ( δ ) , where g ( · ) is a measurable function to be determined in the sense that it is userspecified. Recall that we partition the covariates as W = ( Z , X ) . Note that while s z ( δ ) is the samefor all individuals, g ( W ) depends on the value of W and hence it is individual specific. Thus,we allow the intervention to be heterogeneous.A simple homogenous intervention is obtained by setting g ( · ) ≡
1, in which case we have anadditive shift in Z . An example of an heterogeneous intervention is given by g ( W ) = Z , in whichcase we obtain a multiplicative shift of order 1 + s z ( δ ) . Both of these cases have been studied byCarneiro, Heckman and Vytlacil (2010).Selection into treatment is now governed by D δ = { V ≤ µ ( Z + g ( W ) s z ( δ ) , X ) } , (22)where, for a given g ( · ) , the choice of s z ( δ ) guarantees that Pr ( D δ = ) = p + δ .24 emma 4. Assume that (i) ( V , X ) are absolutely continuous random variables with joint density f V , X ( v , x ) given by f V | X ( v | x ) f X ( x ) ; (ii) f V | X ( v | x ) is continuous in v for almost all x ∈ X ; (iii) µ ( z , x ) is con-tinuously differentiable in z for almost all x ∈ X ; (iv) letting µ ′ z ( z , x ) be the partial derivative of µ ( z , x ) with respect to z, we haveE sup δ ∈ N ε (cid:2) f V | X ( µ ( Z + g ( W ) s z ( δ ) , X ) | X ) µ ′ z ( Z + g ( W ) s z ( δ ) , X ) (cid:3) g ( W ) < ∞ ; (v) for each δ ∈ N ε , E (cid:2) f V | X ( µ ( Z + g ( W ) s z ( δ ) , X ) | X ) µ ′ z ( Z + g ( W ) s z ( δ ) , X ) g ( W ) (cid:3) = Then ∂ s z ( δ ) ∂δ (cid:12)(cid:12)(cid:12)(cid:12) δ = = E (cid:2) f V | W ( µ ( W ) | W ) µ ′ z ( W ) g ( W ) (cid:3) , ∂ P δ ( z , x ) ∂δ (cid:12)(cid:12)(cid:12)(cid:12) δ = = f V | W ( µ ( w ) | w ) µ ′ z ( w ) g ( w ) E (cid:2) f V | W ( µ ( W ) | W ) µ ′ z ( W ) g ( W ) (cid:3) . Theorem 3.
Let Assumptions 1–3 and 4(b), and the assumptions of Lemma 4 hold. Assume further thatf Y ( y τ ) > . Then, the unconditional quantile effect of the shift in Z given in (22) is Π τ , z = f Y ( y τ ) ˆ W MTE τ ( P ( w ) , x ) ˙ P z ( w ) f W ( w ) dw , where ˙ P z ( w ) = ∂ P δ ( w ) ∂δ (cid:12)(cid:12)(cid:12)(cid:12) δ = = f V | X ( µ ( w ) | x ) µ ′ z ( w ) g ( w ) E (cid:2) f V | X ( µ ( W ) | X ) µ ′ z ( W ) g ( W ) (cid:3) .It can be seen that the main difference between Π τ , µ in Theorem 2 and Π τ , z in Theorem 3 liesin the adjustment given by the derivative of the modified propensity score P δ ( w ) . For Π τ , z , theadjustment to the population weights includes the derivative of the benefit function, µ ′ z ( · ) , andthe function g ( · ) . Different adjustments lead to different distributions of W over the marginalsubpopulation.Note that if µ ′ z ( w ) is known and different from 0 for all w ∈ W , then choosing g ( w ) = µ ′ z ( w ) yields Π τ , z = Π τ , µ . In the special case where g ( w ) = µ ′ z ( w ) = w ∈ W , the twoeffects coincide. Note that the condition µ ′ z ( w ) = w ∈ W holds only if µ ( z , x ) = z + ˜ µ ( x ) for some function ˜ µ ( · ) . So, if we think of µ ( · , · ) as the utility function, then the utility functionis required to take a quasilinear form with z as the numeraire. Shifts to the benefit function andshifts to the instrument (upon choosing g ( · ) ≡
1) are then equivalent.25 .3 Identification of MTE τ To investigate the identifiability of Π τ , z , we study the identifiability of MTE τ and the weight func-tion or the RN derivative ∂ P δ ( w ) ∂δ (cid:12)(cid:12)(cid:12) δ = separately. The proposition below shows that MTE τ ( u , x ) isidentified for every u = P ( w ) for some w ∈ W . Proposition 1.
Let Assumptions 2(a), 2(b), and 4(b) hold. Then, for every u = P ( w ) with w ∈ W , wehave MTE τ ( u , x ) = − ∂ E [ { Y ≤ y τ } | P ( W ) = u , X = x ] ∂ u . (23)Proposition 1 can be proved using Theorem 1 in Carneiro and Lee (2009). In the supplemen-tary appendix, we provide a self-contained proof that is directly connected to the idea of shiftingthe propensity score.The key results that we use to establish Proposition 1 are E [ G ( Y ( )) | P ( W ) = P ( w ) , U D ≤ P ( w ) , X = x ] = E [ G ( Y ( )) | U D ≤ P ( w ) , X = x ] , E [ G ( Y ( )) | P ( W ) = P ( w ) , U D > P ( w ) , X = x ] = E [ G ( Y ( )) | U D > P ( w ) , X = x ] ,where G ( · ) is a bounded function with G ( · ) = {· ≤ y τ } as a special case. These results holdunder the assumption of instrument exogeneity. Without them, we have only that ∂ E [ G ( Y ) | P ( W ) = u , X = x ] ∂ u = E [ G ( Y ( )) | P ( W ) = u , U D = u , X = x ] − E [ G ( Y ( )) | P ( W ) = u , U D = u , X = x ]+ ˆ u ∂ E [ G ( Y ( )) | P ( W ) = u , U D = ˜ u , X = x ] ∂ u d ˜ u + ˆ u ∂ E [ G ( Y ( )) | P ( W ) = u , U D = ˜ u , X = x ] ∂ u d ˜ u . (24)Under the assumption of instrument exogeneity (and the assumption that Z does not affect thepotential outcomes directly), we have that ∂ E [ G ( Y ( )) | P ( W ) = u , U D = ˜ u , X = x ] ∂ u = u ≤ u , and that ∂ E [ G ( Y ( )) | P ( W ) = u , U D = ˜ u , X = x ] ∂ u = u > u . Hence, the last two terms in (24) disappear, and conditioning on P ( W ) = u inthe first two terms can be dropped. Therefore, a key identification assumption for MTE τ is theassumption of instrument exogeneity. 26 .4 Identification of the RN Derivative In this subsection, we investigate the identification of the RN derivative ˙ P z ( w ) in the representa-tion of Π τ , z . By Lemma 4 together with Assumption 4(b), we have˙ P z ( w ) = ∂ P δ ( w ) ∂δ (cid:12)(cid:12)(cid:12)(cid:12) δ = = f V | X ( µ ( w ) | x ) µ ′ z ( w ) g ( w ) E (cid:2) f V | X ( µ ( W ) | X ) µ ′ z ( W ) g ( W ) (cid:3) ,where µ ′ z ( w ) : = ∂µ ( z , x ) ∂ z . Under Assumption 4(b), the propensity score becomes P ( w ) = Pr ( V ≤ µ ( W ) | W = w )= F V | Z , X ( µ ( z , x ) | z , x ) = F V | X ( µ ( z , x ) | x ) . (25)Therefore, ∂ P ( w ) ∂ z = f V | X ( µ ( z , x ) | x ) µ ′ z ( z , x ) = f V | X ( µ ( w ) | x ) µ ′ z ( w ) .It is now clear that ˙ P z ( w ) can be represented using ∂ P ( w ) ∂ z and g ( w ) . We formalize this in thefollowing proposition. Proposition 2.
Let Assumption 4(b) and the assumptions in Lemma 4 hold. Then ˙ P z ( w ) = ∂ P δ ( w ) ∂δ (cid:12)(cid:12)(cid:12)(cid:12) δ = = ∂ P ( w ) ∂ z g ( w ) E h ∂ P ( W ) ∂ z g ( W ) i . (26)Since g ( w ) is known and ∂ P ( w ) ∂ z is identified, ˙ P z ( w ) is also identified. As in the case of MTE τ identification, Assumption 4(b) plays a key role in identifying ˙ P z ( w ) . Without the assumptionthat V is independent of Z conditional on X , we can have only that˙ P z ( w ) = f V | W ( µ ( w ) | w ) µ ′ z ( w ) g ( w ) E (cid:2) f V | W ( µ ( W ) | W ) µ ′ z ( W ) g ( W ) (cid:3) and ∂ P ( w ) ∂ z = f V | Z , X ( µ ( w ) | w ) µ ′ z ( w ) + ∂ F V | Z , X ( µ ( w ) | ˜ z , x ) ∂ ˜ z | ˜ z = z .The presence of the second term in the above equation invalidates the identification result in (26).Using Propositions 1 and 2, we can represent Π τ , z as Π τ , z = − f Y ( y τ ) ˆ W ∂ E [ { Y ≤ y τ } | P ( W ) = P ( w ) , X = x ] ∂ P ( w ) ∂ P ( w ) ∂ z g ( w ) E h ∂ P ( W ) ∂ z g ( W ) i f W ( w ) dw . (27)All objects in the above are point identified, hence Π τ , z is point identified.We note that, in general, Π τ , µ is not point identified. Even if there is a valid instrument suchthat MTE τ ( u , x ) is identified, ˙ P µ ( w ) , the RN derivative in the definition of Π τ , µ , may not be point27dentified. Observing that ˙ P µ ( w ) = ∂ P δ ( w ) ∂δ (cid:12)(cid:12)(cid:12)(cid:12) δ = = f V | X ( µ ( w ) | x ) E (cid:2) f V | X ( µ ( W ) | X ) (cid:3) ,we see that, in general, ˙ P µ ( w ) can be identified using the instrument in only the special casewhere µ ′ z ( w ) = w ∈ W . In this special case, ˙ P µ ( w ) = ˙ P z ( w ) .When ˙ P µ ( w ) is not identified, Π τ , µ is also not identified. The most we can do in such a case isto bound the unconditional quantile effect. Suppose ˙ P µ ( w ) has the same sign so that ˙ P µ ( w ) ≥ w ∈ W . Then we have Π τ , µ = ˆ W MTE τ ( P ( w ) , x ) ˙ P µ ( w ) f W ( w ) dw ∈ " inf w ∈W MTE τ ( P ( w ) , x ) , sup w ∈W MTE τ ( P ( w ) , x ) because ´ W ˙ P µ ( w ) f W ( w ) =
1. We leave the details of the bound approach under partial identifi-cation for future research.
Let F ∗ be the space of finite signed measures ν on Y ⊆ R with distribution function F ν ( y ) = ν ( − ∞ , y ] for y ∈ Y . We endow F ∗ with the usual supremum norm: for two distribution func-tions F ν and F ν associated with the signed measures ν and ν on Y , we define k F ν − F ν k = sup y ∈Y | F ν ( y ) − F ν ( y ) | . In this section, we consider a general functional ρ : F ∗ → R and studythe general unconditional effect.The baseline model is the same as in Section 3. As before, we modify the propensity score toimprove the treatment take-up rate from p to p + δ . The (general) unconditional policy effect isdefined to be the marginal change in ρ ( F Y δ ) in the limit as δ goes to 0. Definition 3.
General Unconditional Policy Effect
The general unconditional policy effect for the functional ρ is defined as Π ρ = lim δ → ρ ( F Y δ ) − ρ ( F Y ) δ whenever this limit exists. The definition above is the same as that of the marginal partial distributional policy effectdefined in Rothe (2012). The term we use is closer to those in Firpo, Fortin and Lemieux (2009). Given that ´ W ˙ P µ ( w ) f W ( w ) =
1, it is impossible that ˙ P µ ( w ) ≤ w ∈ W . .1 Characterization of the General Unconditional Policy Effect We first consider a Hadamard differentiable functional ρ . For completeness, we provide the defi-nition of Hadamard differentiability below. Definition 4. ρ : F ∗ → R is Hadamard differentiable at F ∈ F ∗ if there exists a linear and continuousfunctional ˙ ρ F : F ∗ → R such that for any G ∈ F ∗ and G δ ∈ F ∗ with lim δ → sup y ∈Y k G δ ( y ) − G ( y ) k = we have lim δ → ρ ( F Y + δ G δ ) − ρ ( F Y ) δ = ˙ ρ F ( G ) .Recall that by Lemma 2 we have the expansion F Y δ ( y ) = F Y ( y )+ δ E hn F Y ( ) | U D , W ( y | P ( W ) , W ) − F Y ( ) | U D , W ( y | P ( W ) , W ) o ˙ P ( W ) i + R F ( δ ; y ) ,where sup y ∈Y | R F ( δ ; y ) | = o ( | δ | ) as δ →
0. Taking G δ ( y ) = δ [ F Y δ ( y ) − F Y ( y )] and G ( y ) = E hn F Y ( ) | U D , W ( y | P ( W ) , W ) − F Y ( ) | U D , W ( y | P ( W ) , W ) o ˙ P ( W ) i , (28)we have lim δ → sup y ∈Y k G δ ( y ) − G ( y ) k = ρ , we obtain Π ρ = lim δ → ρ ( F Y δ ) − ρ ( F Y ) δ = lim δ → ρ ( F Y + δ G δ ) − ρ ( F Y ) δ = ˙ ρ F ( G )= ˆ Y ψ ( y , ρ , F Y ) dG ( y ) ,where ψ ( y , ρ , F Y ) is the influence function of ρ at F Y . Plugging (28) into this result yields thefollowing theorem. Theorem 4.
Let the assumptions in Lemma 2 (i.e., Assumptions 1–3) hold. Assume further that ρ : F ∗ → R is Hadamard differentiable. Then Π ρ = ˆ Y ψ ( y , ρ , F Y ) E hn f Y ( ) | U D , W ( y | P ( W ) , W ) − f Y ( ) | U D , W ( y | P ( W ) , W ) o ˙ P ( W ) i dy . (29)29efine MTE ρ ( u , w ) = E [ ψ ( Y ( ) , ρ , F Y ) − ψ ( Y ( ) , ρ , F Y ) | U D = u , W = w ] . (30)Then Π ρ = ˆ W MTE ρ ( u , w ) ˙ P ( w ) f W ( w ) dw . (31)Hence, the general unconditional policy effect Π ρ can be represented as a weighted average ofMTE ρ ( u , w ) over the marginal subpopulation.Consider the quantile functional: ρ τ ( F ) = F − ( τ ) . It is well known that this functional ρ τ ( · ) is Hadamard differentiable, and its influence function is ψ ( y , ρ τ , F Y ) = τ − { y ≤ y τ } f Y ( y τ ) . (32)Plugging this into (30) yieldsMTE τ ( u , w ) = f Y ( y τ ) E [ { Y ( ) ≤ y τ } − { Y ( ) ≤ y τ } | U D = u , W = w ] .This is the same as MTE τ in Definition 2 except for the scaling factor f Y ( y τ ) and the absence ofthe instrument in the conditioning set. The representation in Theorem 4 is then exactly the sameas the representation in Theorem 1.Following Firpo, Fortin and Lemieux (2009), we may construct an unconditional regressionusing ψ ( y , ρ τ , F Y ) as the dependent variable and D and other covariates as the independent vari-ables. Like the UQR estimator, such an estimator will be inconsistent for Π ρ , and its asymptoticbias can be similarly decomposed into two sources. The identification of Π ρ under instrumentintervention for a general ρ can be established in the same way as that for the quantile functional.We omit the details here.While Theorem 4 covers general functionals, it does not cover the mean functional ρ ( F ) = ´ Y ydF ( y ) unless Y is a bounded set. When Y is unbounded, the mean functional is not contin-uous on ( F ∗ , k·k ∞ ) and hence is not Hadamard differentiable (see Exercise 7 in Chapter 20 invan der Vaart (2000)). In such a case, we opt for a direct approach by showing thatlim δ → δ ˆ Y ydR F ( δ ; y ) =
0, (33)so that Π ρ = ˆ Y yE hn f Y ( ) | U D , W ( y | P ( W ) , W ) − f Y ( ) | U D , W ( y | P ( W ) , W ) o ˙ P ( W ) i dy + lim δ → δ ˆ Y ydR F ( δ ; y )= ˆ Y yE hn f Y ( ) | U D , W ( y | P ( W ) , W ) − f Y ( ) | U D , W ( y | P ( W ) , W ) o ˙ P ( W ) i dy .30he result in (33) holds if the following stronger version of Assumption 3 holds: Assumption 5.
Stronger Domination Conditions
For d =
0, 1 ˆ Y ( d ) sup δ ∈ N ε y · f Y ( d ) | D δ ( y | d ) dy < ∞ , ˆ Y ( d ) sup δ ∈ N ε (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) y · ∂ f Y ( d ) | D δ ( y | d ) ∂δ (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) dy < ∞ . Corollary 4.
Let Assumptions 1, 2, and 5 hold. Then for the mean functional, we have Π ρ = E (cid:8) [ Y ( ) − Y ( ) | U D = P ( W ) , W ] ˙ P ( W ) (cid:9) . (34) In this section we take a closer look at the relationship between general unconditional effectsunder endogeneity and marginal policy relevant treatment effects. A policy relevant treatmenteffect is a comparison of two policies with different incentives for participation. It is assumed thatthe potential outcomes remain the same. Using the notation of the previous sections, consider abaseline policy where δ =
0, versus an alternative policy where δ >
0. As before, p = Pr ( D = ) ,whereas p + δ = Pr ( D δ = ) . Here, D δ is the treatment status under the alternative policy.Suppose we are interested in the effect of this policy change on the unconditional mean of theobserved outcomes. To this end, Heckman and Vytlacil (2001 b , 2005) consider the policy relevanttreatment effect defined as PRTE δ = E ( Y δ ) − E ( Y ) E ( D δ ) − E ( D ) = E ( Y δ ) − E ( Y ) δ . (35)Taking the limit δ → marginal policy relevant treatment effect (MPRTE) of Carneiro, Heckman and Vytlacil(2010): MPRTE = lim δ → PRTE δ .Following Heckman and Vytlacil (2001 b , 2005), we can show that MPRTE can be representedin terms of the following marginal treatment effect:MTE ( u , x ) : = E [ Y ( ) − Y ( ) | U D = u , X = x ] .We note that MTE ( u , x ) is different from MTE τ ( u , x ) in Definition 2. To simplify the notation,we drop the covariate X . Let P and P δ be the propensity scores, that is, P = Pr ( D = | Z ) and P δ = Pr ( D δ = | Z ) . This new notation for the propensity scores, P and P δ , suppresses theirdependence on Z and highlight that P and P δ are themselves random variables. Let f P δ ( · ) and F P δ ( · ) be the pdf and CDF of P δ , respectively. When δ =
0, we denote the pdf and CDF of P by31 P ( · ) and F P ( · ) , respectively. ThenMPRTE = − ˆ MTE ( u ) ∂ F P δ ( u ) ∂δ (cid:12)(cid:12)(cid:12)(cid:12) δ = du . (36)A proof is given in the supplementary appendix.To obtain an expression for ∂ F P δ ( u ) ∂δ | δ = , we consider the special case µ ( Z ) = γ Z and γ > µ ( · ) is a simplified version of Assumption B-1 in Carneiro, Heckman and Vytlacil(2010). In this case, P = F V ( γ Z ) and P δ = F V ( γ Z + γ s z ( δ ) . Consider a constant shift of magni-tude s z ( δ ) in Z (i.e., g ( · ) = ) . Under this shift, the participation rate increases by δ so that E ( P δ ) = Pr ( D δ = ) = Pr ( D = ) + δ = E ( P ) + δ .We have F P δ ( u ) = Pr ( P δ ≤ u ) = Pr ( F V ( γ Z + γ s z ( δ )) ≤ u )= Pr ( γ Z + γ s z ( δ ) ≤ F − V ( u )) = Pr ( F V ( γ Z ) ≤ F V ( F − V ( u ) − γ s z ( δ )))= F P ( F V ( F − V ( u ) − γ s z ( δ ))) . (37)Differentiating (37) with respect to δ , we get ∂ F P δ ( u ) ∂δ = − f P δ ( F V ( F − V ( u ) − γ s z ( δ ))) f V ( F − V ( u ) − γ s z ( δ )) γ ∂ s z ( δ ) ∂δ . (38)Using Lemma 4 and setting g ( · ) =
1, we have ∂ s z ( δ ) ∂δ (cid:12)(cid:12)(cid:12)(cid:12) δ = = ´ Z f V ( γ z ) γ f Z ( z ) dz . (39)Evaluating (38) at δ = ∂ F P δ ( u ) ∂δ (cid:12)(cid:12)(cid:12)(cid:12) δ = = − f P ( u ) f V ( F − V ( u )) ´ Z f V ( γ z ) f Z ( z ) dz . (40)Now, the marginal policy relevant treatment effect isMPRTE = ˆ MTE ( u ) f P ( u ) f V ( F − V ( u )) ´ Z f V ( γ z ) f Z ( z ) dz du .Consider the change of variable u = P ( z ) , where P ( z ) = F V ( γ z ) in this particular case. Then du = γ f V ( γ z ) dz and F − V ( u ) = γ z . Note also that F P ( u ) = Pr ( P ≤ u ) = Pr ( F V ( γ Z ) ≤ u ) = F Z ( γ − F − V ( u )) .32herefore, the density f P ( u ) is f P ( u ) = f Z ( γ − F − V ( u )) γ − f V ( F − V ( u )) = f Z ( z ) γ − f V ( F − V ( u )) ,and the MPRTE becomesMPRTE = ˆ Z MTE ( P ( z )) f V ( γ z ) ´ Z f V ( γ z ) f Z ( z ) dz f Z ( z ) dz = ˆ Z MTE ( P ( z )) ˙ P ( z ) f Z ( z ) dz .This result is precisely the one in (34) in Corollary 4 after dropping the covariate X . This isalso Example 2 in Carneiro, Heckman and Vytlacil (2010) for the case where, in their notation, q α ( t ) = t + α .A formal proof of the equivalence of MPRTE to Π ρ when ρ is the mean functional (as inCorollary 4) in more general cases is sketched in the appendix. While our results cover MPRTEas a special case, they also cover more general unconditional policy effects. This section is devoted to the estimation and inference of the UQE under instrument intervention,as described in Section 4. We assume that the propensity score function is parametric, and weleave the case with a nonparametric propensity score to the next section.Letting m ( y τ , P ( w ) , x ) : = E [ { Y ≤ y τ } | P ( W ) = P ( w ) , X = x ] and using (27), we have Π τ , z = f Y ( y τ ) ˆ W MTE τ ( P ( w ) , x ) ˙ P z ( w ) f W ( w ) dw = − f Y ( y τ ) (cid:26) E (cid:20) ∂ P ( W ) ∂ z g ( W ) (cid:21)(cid:27) − ˆ W ∂ m ( y τ , P ( w ) , x ) ∂ z g ( w ) f W ( w ) dw .Note that the presence of g ( · ) amounts to a change of measure (from a measure with density f W ( w ) to a measure with density g ( w ) f W ( w )) . In order to simplify the notation, we set g ( · ) ≡ g ( · ) can be handled withstraightforward modifications.With g ( · ) =
1, we write the parameter Π τ , z as Π τ , z = − f Y ( y τ ) E (cid:20) ∂ P ( W ) ∂ z (cid:21) − E (cid:20) ∂ m ( y τ , P ( W ) , X ) ∂ z (cid:21) . (41) Π τ , z consists of two average derivatives and a density evaluated at a point. As shown by Newey(1994), the two average derivatives are √ n -estimable. However, the density at a point cannot beestimated at the usual √ n rate unless a parametric model is imposed.33irst, we make an assumption concerning the dimensions of the variables. Assumption 6.
The covariate vector X is an element of R d X and Z ∈ R . In empirical applications, we may have a few exogenous variables that affect the treatmentchoice but not the outcome of interest directly. The instrument Z can be any one of these exoge-nous variables, and the rest of the exogenous variables become part of the covariate vector X .The unconditional effect is specific to the instrumental variable Z that we choose to intervene inorder to improve the treatment adoption rate.The rest of this section is structured as follows: in Section 7.1 we establish the rate of con-vergence of the two-step estimator of f Y ( y τ ) ; in Section 7.2 we find the asymptotic distributionof the terms associated with the propensity score; and in Section 7.3 we establish the asymptoticdistribution of ˆ Π τ , z and construct a pivotal test statistic for testing the null of a zero effect.For a given sample { O i = ( Y i , Z i , X i , D i ) } ni = , we use P n to denote the empirical measure. Theexpectation of a function χ ( O ) with respect to P n is P n χ = n − ∑ ni = χ ( O i ) . For a given τ , we estimate y τ using the (generalized) inverse of the empirical distribution functionof Y : ˆ y τ = inf { y : F n ( y ) ≥ τ } ,where F n ( y ) : = n n ∑ i = { Y i ≤ y } .Serfling (1980) establishes the following asymptotic result. Lemma 5.
If the density f Y ( · ) of Y is positive and continuous at y τ , then ˆ y τ − y τ = P n ψ Q ( y τ ) + o p ( n − ) , where ψ Q ( y τ ) : = τ − { Y ≤ y τ } f Y ( y τ ) .We use a kernel density estimator to estimate f Y ( y ) . We maintain the following assumptionon the kernel function. Assumption 7.
Kernel Assumption
The kernel function K ( · ) satisfies (i) ´ ∞ − ∞ K ( u ) du = , (ii) ´ ∞ − ∞ u K ( u ) du < ∞ , and (iii) K ( u ) = K ( − u ) , and it is twice differentiable with Lipschitz continuous second-order derivative K ′′ ( u ) satisfying(i) ´ ∞ − ∞ K ′′ ( u ) udu < ∞ and ( ii ) there exist positive constants C and C such that | K ′′ ( u ) − K ′′ ( u ) | ≤ C | u − u | for | u − u | ≥ C . See Section 2.5.1. Actually, Serfling (1980) provides a better rate for the remainder.
34e also need the following rate assumption on the bandwidth.
Assumption 8.
Rate Assumption
Assume that n ↑ ∞ and h ↓ such that nh ↑ ∞ . The non-standard condition nh ↑ ∞ is due to the estimation of y τ . Since we need to expandˆ f Y ( ˆ y τ ) − ˆ f Y ( y τ ) , the derivative of ˆ f Y ( y ) will entail a slower decay for h . The details can be foundin the proof of Lemma 6. We note, however, that nh ↑ ∞ implies that nh ↑ ∞ .The estimator of f Y ( y ) is then given byˆ f Y ( y ) = n n ∑ i = K h ( Y i − y ) ,where K h ( u ) = K ( u / h ) / h . Lemma 6.
Let Assumptions 7 and 8 hold. Then ˆ f Y ( y ) − f Y ( y ) = P n ψ f Y ( y ) + B f Y ( y ) + o p ( h ) , where ψ f Y ( y ) : = K h ( Y − y ) − E [ K h ( Y − y )] = O p ( n − h − ) and B f Y ( y ) = h f ′′ Y ( y ) ˆ ∞ − ∞ u K ( u ) du . Furthermore, for the quantile estimator ˆ y τ of y τ that satisfies Lemma 5, we have ˆ f Y ( ˆ y τ ) − ˆ f Y ( y τ ) = f Y ( ˆ y τ ) − f Y ( y τ ) + R f Y = f ′ Y ( y τ ) P n ψ Q ( y τ ) + R f Y , where R f Y = o p ( n − h − ) .In order to isolate the contributions of ˆ f and ˆ y τ , we can use Lemma 6 to writeˆ f Y ( ˆ y τ ) − f Y ( y τ ) = ˆ f Y ( y τ ) − f Y ( y τ ) + f Y ( ˆ y τ ) − f Y ( y τ ) + R f Y . (42)The first pair of terms on the right-hand side of (42) represents the dominant term and reflects theuncertainty in the estimation of f Y . The second pair of terms reflects the error from estimating y τ . In order to ensure that R f Y = o p ( n − h − ) , we need nh ↑ ∞ , as stated in Assumption 8.We will use (42) repeatedly. In this subsection we assume that the propensity score is known up to a finite-dimensional vector α . For example, the propensity score is a logit function of W = ( Z , X ) . In this case, we estimate α using the maximum likelihood estimator. 35 ssumption 9. The propensity score is known up to a finite-dimensional vector α ∈ R d α . We denote the propensity by P ( Z , X , α ) . Under Assumption 9, the parameter Π τ , z can bewritten as Π τ , z = − f Y ( y τ ) · T · T , (43)where T = E (cid:20) ∂ P ( Z , X , α ) ∂ z (cid:21) and T = E (cid:20) ∂ m ( y τ , P ( Z , X , α ) , X ) ∂ z (cid:21) .First, we estimate T , the average value of the derivative of the propensity score, by T n ( ˆ α ) : = n n ∑ i = ∂ P ( z , x , ˆ α ) ∂ z (cid:12)(cid:12)(cid:12)(cid:12) ( z , x )=( Z i , X i ) .To save space, we slightly abuse notation and write T n ( ˆ α ) as T n ( ˆ α ) = n n ∑ i = ∂ P ( Z i , X i , ˆ α ) ∂ z . (44)We adopt this convention in the rest of the paper. Lemma 7.
Suppose that(i) ˆ α admits the linear representation ˆ α − α = P n ψ α + o p ( n − ) ; (45) (ii) the variance of ∂ P ( Z , X , α ) ∂ z is finite;(iii) the derivative ∂ P ( Z , X , α ) ∂α∂ z exists in an open neighborhood around α ; (iv) the following uniform law of large numbers holds: sup α ∈A (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) n n ∑ i = ∂ P ( Z i , X i , α ) ∂ z ∂α − E (cid:20) ∂ P ( Z , X , α ) ∂ z ∂α (cid:21)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) p → where A is a neighborhood around α . Then T n ( ˆ α ) can be represented asT n ( ˆ α ) − T = E (cid:20) ∂ P ( Z , X , α ) ∂ z ∂α ′ (cid:21) P n ψ α + P n ψ ∂ P + o p ( n − ) , where ψ ∂ P : = ∂ P ( Z , X , α ) ∂ z − E (cid:20) ∂ P ( Z , X , α ) ∂ z (cid:21) .36e can rewrite the main result of Lemma 7 as T n ( ˆ α ) − T = T n ( α ) − T + T n ( ˆ α ) − T n ( α )= T n ( α ) − T + E (cid:20) ∂ P ( Z , X , ˆ α ) ∂ z (cid:21) − E (cid:20) ∂ P ( Z , X , α ) ∂ z (cid:21) + o p ( n − ) . (46)Equation (46) has the same interpretation as equation (42). It consists of a pair of leading termsthat ignores the estimation uncertainty in ˆ α but accounts for the variability of the sample mean,and another pair that accounts for the uncertainty in ˆ α but ignores the variability of the samplemean.We estimate the second average derivative T by T n ( ˆ y τ , ˆ m , ˆ α ) : = n n ∑ i = ∂ ˆ m ( ˆ y τ , P ( Z i , X i , ˆ α ) , X i ) ∂ z . (47)This can be regarded as a four-step estimator. The first step estimates y τ , the second step es-timates α , the third step estimates the conditional expectation m ( y , P ( Z , X , α ) , X ) using thegenerated regressor P ( Z , X , ˆ α ) , and the fourth step averages the derivative (with respect to Z ) over X and the generated regressor P ( Z , X , ˆ α ) .We use the series method to estimate m . To alleviate notation, define the vector˜ w ( α ) : = ( P ( z , x , α ) , x ) ′ and ˜ W i ( α ) : = ( P ( Z i , X i , α ) , X i ) ′ .Both ˜ w ( α ) and ˜ W i ( α ) are in R d X + . Let φ J ( ˜ w ( α )) = ( φ J ( ˜ w ( α )) , . . . , φ JJ ( ˜ w ( α ))) ′ be a vector of J basis functions of ˜ w ( α ) with finite second moments. Here each φ jJ ( · ) is adifferentiable basis function. Then, the conditional expectation estimator isˆ m ( ˆ y τ , ˜ w ( ˆ α )) = φ J ( ˜ w ( ˆ α )) ′ ˆ b ( ˆ α , ˆ y τ ) , (48)where ˆ b ( ˆ α , ˆ y τ ) is the least squares estimate:ˆ b ( ˆ α , ˆ y τ ) = n ∑ i = φ J ( ˜ W i ( ˆ α )) φ J ( ˜ W i ( ˆ α )) ′ ! − n ∑ i = φ J ( ˜ W i ( ˆ α )) { Y i ≤ ˆ y τ } . (49)The estimator of the derivative is then ∂ ˆ m ( ˆ y τ , ˜ w ( ˆ α )) ∂ z = ∂φ J ( ˜ w ( ˆ α )) ∂ z ′ ˆ b ( ˆ α , ˆ y τ ) , (50)37nd the estimator of the average derivative becomes T n ( ˆ y τ , ˆ m , ˆ α ) = n n ∑ i = ∂φ J ( ˜ W i ( ˆ α )) ∂ z ′ ˆ b ( ˆ α , ˆ y τ ) . (51)We use the path derivative approach of Newey (1994) to obtain a decomposition of T n ( ˆ y τ , ˆ m , ˆ α ) − T , which is similar to that in Section 2.1 of Hahn and Ridder (2013). To describe the idea, let O : = ( Y , Z , X , D ) be the vector of observations, and let { F θ } be a path of distributions indexed by θ ∈ R such that F θ is the true distribution of O . The parametric assumption on the propensity score need not beimposed on the path. The score of the parametric submodel is S ( O ) = ∂ log dF θ ( O ) ∂θ (cid:12)(cid:12)(cid:12)(cid:12) θ = θ .For any θ , we define T θ = E θ (cid:20) ∂ m θ ( y τ , θ , ˜ W ( α θ ) ∂ z (cid:21) where m θ , y τ , θ , and α θ are the probability limits of ˆ m , ˆ y τ , and ˆ α , respectively, when the distributionof O is F θ . Note that when θ = θ , we have T θ = T . Suppose the set of scores { S ( O ) } for allparametric submodels { F θ } can approximate in the mean square any zero-mean, finite-variancefunction of O . If the function θ → T θ is differentiable at θ and we can write ∂ T θ ∂θ (cid:12)(cid:12)(cid:12)(cid:12) θ = θ = E [ Γ ( O ) S ( O )] (52)for some mean-zero and finite second-moment function Γ ( · ) and any path F θ , then, by Theorem2.1 of Newey (1994), the asymptotic variance of T n ( ˆ y τ , ˆ m , ˆ α ) is E [ Γ ( O ) ] .In the next lemma, we will show that θ → T θ is differentiable at θ . Suppose for the momentthis is the case. Then, by the chain rule, we can write ∂ T θ ∂θ (cid:12)(cid:12)(cid:12)(cid:12) θ = θ = ∂∂θ E θ (cid:20) ∂ m θ ( y τ , θ , ˜ W ( α θ )) ∂ z (cid:21)(cid:12)(cid:12)(cid:12)(cid:12) θ = θ = ∂∂θ E θ (cid:20) ∂ m ( y τ , ˜ W ( α )) ∂ z (cid:21)(cid:12)(cid:12)(cid:12)(cid:12) θ = θ + ∂∂θ E (cid:20) ∂ m θ ( y τ , ˜ W ( α )) ∂ z (cid:21)(cid:12)(cid:12)(cid:12)(cid:12) θ = θ + ∂∂θ E (cid:20) ∂ m ( y τ , θ , ˜ W ( α )) ∂ z (cid:21)(cid:12)(cid:12)(cid:12)(cid:12) θ = θ + ∂∂θ E (cid:20) ∂ m ( y τ , ˜ W ( α θ )) ∂ z (cid:21)(cid:12)(cid:12)(cid:12)(cid:12) θ = θ . (53) As we show later, the error from estimating the propensity score does not affect the asymptotic variance of T n ( ˆ y τ , ˆ m , ˆ α ) . This is the “generality” requirement of the family of distibutions in Newey (1994).
38o use Theorem 2.1 of Newey (1994), we need to write all these terms in an outer-product form,namely the form of the right-hand side of (52). To search for the required pathwise derivative of Γ ( · ) , we can examine one component of T θ at a time by treating the remaining components asknown, an observation due to Newey (1994).The next lemma provides the conditions under which we can ignore the error from estimatingthe propensity score in our asymptotic analysis. Lemma 8.
Assume that(i) ( Z , X ) is absolutely continuous with density f ZX ( z , x ) satisfying(a) f ZX ( z , x ) is continuously differentiable with respect to z in Z × X ;(b) for each x ∈ X , f ZX ( z , x ) = for any z on the boundary of its support Z ( x ) ; (c) ∂ log f Z , X ( Z , X ) ∂ z has finite second moments;(ii) the following conditional mean independence holds:E (cid:20) ∂ log f W ( W ) ∂ z (cid:12)(cid:12)(cid:12)(cid:12) ˜ W ( α ) , ∂ P ( W , α ) ∂α (cid:21) = E (cid:20) ∂ log f W ( W ) ∂ z (cid:12)(cid:12)(cid:12)(cid:12) ˜ W ( α ) (cid:21) , E (cid:20) { Y ≤ y τ }| ˜ W ( α ) , ∂ P ( W , α ) ∂α (cid:21) = E (cid:2) { Y ≤ y τ }| ˜ W ( α ) (cid:3) ; (iii) m ( y τ , ˜ w ( α )) is continuously differentiable with respect to z for all orders, and for a neighborhood Θ of θ , the following holds:E " sup θ ∈ Θ (cid:12)(cid:12)(cid:12)(cid:12) ∂∂α θ ∂ m ( y τ , ˜ W ( α θ )) ∂ z (cid:12)(cid:12)(cid:12)(cid:12) < ∞ , E sup θ ∈ Θ (cid:12)(cid:12)(cid:12)(cid:12) ∂∂α θ E (cid:20) ∂ log f W ( W ) ∂ z (cid:12)(cid:12)(cid:12)(cid:12) ˜ W ( α θ ) (cid:21)(cid:12)(cid:12)(cid:12)(cid:12) < ∞ , E sup θ ∈ Θ (cid:12)(cid:12)(cid:12)(cid:12) ∂∂α θ (cid:26) m ( y τ , ˜ W ( α θ )) E (cid:20) ∂ log f W ( W ) ∂ z (cid:12)(cid:12)(cid:12)(cid:12) ˜ W ( α θ ) (cid:21)(cid:27)(cid:12)(cid:12)(cid:12)(cid:12) < ∞ . Then ∂∂θ E (cid:20) ∂ m ( y τ , ˜ W ( α θ )) ∂ z (cid:21)(cid:12)(cid:12)(cid:12)(cid:12) θ = θ = P ( W , α ) = P ( w , α ) and X = x with W = ( Z , X ) as the unknown has a unique solution Z = z , so that the equationsystem implies W = w . In this case, conditioning on ∂ P ( W , α ) ∂α becomes redundant. In general, theequation system may not have a unique solution for Z . Condition (ii) will hold if knowing that ∂ P ( W , α ) ∂α = ∂ P ( w , α ) ∂α X and consider P ( Z , α ) = exp (cid:0) Z α (cid:1) + exp ( Z α ) , so that ∂ P ( Z , α ) ∂α = Z exp (cid:0) Z α (cid:1) + exp ( Z α ) .For any z > P ( Z , α ) = P ( z , α ) has two solutions: Z = z and Z = − z . The solution set willnot change if we also know that ∂ P ( Z , α ) ∂α = ∂ P ( z , α ) ∂α .Thus Condition (ii) holds.The next lemma establishes a stochastic approximation of T n ( ˆ y τ , ˆ m , ˆ α ) − T and provides theinfluence function as well. The assumptions of the Lemma are adapted from Newey (1994).These assumptions are not necessarily the weakest possible. Lemma 9.
Suppose that(i) the support
Z × X of ( Z , X ) is [ z l , z u ] × [ x l , x u ] × [ x l , x u ] × · · · × [ x d X l , x d X u ] , f ZX ( z , x ) isbounded below by C × ( z − z l ) κ ( z u − z ) κ (cid:16) Π d X j = ( x − x jl ) κ ( x ju − x ) κ (cid:17) for some C > and κ > and ˆ Z×X sup θ ∈ Θ (cid:12)(cid:12)(cid:12)(cid:12) ∂ f ZX ( z , x ; θ ) ∂θ (cid:12)(cid:12)(cid:12)(cid:12) dzdx < ∞ ; (ii) there is a constant C such that ∂ a m ( y τ , ˜ w ( α )) / ∂ z a ≤ C a for all a ∈ N ; (iii) the number of series terms, J , satisfies J ( n ) = O ( n ̺ ) for some ̺ > , and J + κ = O ( n ) ; (iv) the map ( y , z ) m ( y , ˜ w ( α )) is differentiable, andE sup y ∈Y (cid:12)(cid:12)(cid:12)(cid:12) m ( y , ˜ W ( α )) ∂ y ∂ z (cid:12)(cid:12)(cid:12)(cid:12) < ∞ ; (v) the following stochastic equicontinuity conditions hold:E (cid:20) ∂ ˆ m ( y τ , ˜ W ( α )) ∂ z − ∂ m ( y τ , ˜ W ( α )) ∂ z (cid:21) = n n ∑ i = (cid:20) ∂ ˆ m ( y τ , ˜ W i ( α )) ∂ z − ∂ m ( y τ , ˜ W i ( α )) ∂ z (cid:21) + o p ( n − ) , E (cid:20) ∂ m ( ˆ y τ , ˜ W ( α )) ∂ z − ∂ m ( y τ , ˜ W ( α )) ∂ z (cid:21) = n n ∑ i = (cid:20) ∂ m ( ˆ y τ , ˜ W i ( α )) ∂ z − ∂ m ( y τ , ˜ W i ( α )) ∂ z (cid:21) + o p ( n − ) ; (vi) the assumptions of Lemma 8 hold. hen, we have the decompositionT n ( ˆ y τ , ˆ m , ˆ α ) − T = T n ( y τ , m , α ) − T + T n ( y τ , ˆ m , α ) − T n ( y τ , m , α )+ T n ( ˆ y τ , m , α ) − T n ( y τ , m , α )+ o p ( n − ) . (54) Additionally, T n ( ˆ y τ , ˆ m , ˆ α ) − T = P n ψ ∂ m − P n ψ m + P n ˜ ψ Q ( y τ ) + o p ( n − ) , where ψ ∂ m : = ∂ m ( y τ , ˜ W ( α )) ∂ z − T , ψ m : = (cid:2) { Y ≤ y τ } − m ( y τ , ˜ W ( α ) (cid:3) × E (cid:20) ∂ log f W ( W ) ∂ z (cid:12)(cid:12)(cid:12)(cid:12) ˜ W ( α ) (cid:21) , and ˜ ψ Q ( y τ ) : = E " ∂ f Y | ˜ W ( α ) ( y τ | ˜ W ( α )) ∂ z ψ Q ( y τ )= E " ∂ f Y | ˜ W ( α ) ( y τ | ˜ W ( α )) ∂ z τ − { Y ≤ y τ } f Y ( y τ ) (cid:21) .Lemma 9 characterizes the contribution of each stage to the final influence function. Thecontribution of the estimation of m , given by P n ψ m , corresponds to the one in Proposition 5 ofNewey (1994). We estimate the UQE by ˆ Π τ , z ( ˆ y τ , ˆ f Y , ˆ m , ˆ α ) = − f Y ( ˆ y τ ) T n ( ˆ y τ , ˆ m , ˆ α ) T n ( ˆ α ) . (55)With the asymptotic linear representations of the arguments ˆ f Y ( ˆ y τ ) , T n ( ˆ α ) , and T n ( ˆ y τ , ˆ m , ˆ α ) , wecan obtain the asymptotic linear representation of ˆ Π τ , z ( ˆ y τ , ˆ f Y , ˆ m , ˆ α ) . The next theorem followsfrom combining Lemmas 6, 7, and 9. 41 heorem 5. Under the assumptions of Lemmas 6, 7, and 9, we have ˆ Π τ , z − Π τ , z = T f Y ( y τ ) T (cid:2) P n ψ f Y ( y τ ) + B f Y ( y τ ) (cid:3) + T f Y ( y τ ) T f ′ Y ( y τ ) P n ψ Q ( y τ )+ T f Y ( y τ ) T P n ψ ∂ P + T f Y ( y τ ) T E (cid:20) ∂ P ( Z , X , α ) ∂ z ∂α ′ (cid:21) P n ψ α − f Y ( y τ ) T P n ψ ∂ m + f Y ( y τ ) T P n ψ m − f Y ( y τ ) T P n ˜ ψ Q ( y τ ) + R Π , (56) where R Π = O p (cid:16) | ˆ f Y ( ˆ y τ ) − f Y ( y τ ) | (cid:17) + O p (cid:16) n − (cid:17) + O p (cid:16) n − | ˆ f Y ( ˆ y τ ) − f Y ( y τ ) | (cid:17) + O p (cid:0) | R f Y | (cid:1) + o p ( n − ) + o p ( h ) . (57) Furthermore, under Assumption 8, √ nhR Π = o p ( ) .Equation (56) consists of six influence functions and a bias term. The bias term B f Y ( y τ ) arisesfrom estimating the density and is of order O ( h ) . The six influence functions reflect the impactof each estimation stage. The rate of converge of ˆ Π τ , z is slowed down through P n ψ f Y ( y τ ) , whichis of order O p ( n − h − ) . We can summarize the results of Theorem 5 in a single equation:ˆ Π τ , z − Π τ , z = P n ψ Π τ + ˜ B f Y ( y τ ) + o p ( n − h − ) ,where ψ Π τ collects all the influence functions in (56) except for the bias, and˜ B f Y ( y τ ) : = T f Y ( y τ ) T B f Y ( y τ ) .The bias term is o ( n − h − ) by Assumption 8. The following corollary provides the asymptoticdistribution of ˆ Π τ , z . Corollary 5.
Under the assumptions of Theorem 5, √ nh (cid:0) ˆ Π τ , z − Π τ , z (cid:1) = √ n P n √ h ψ Π τ + o p ( ) ⇒ N ( V τ ) , where V τ = lim h ↓ E (cid:2) h ψ Π τ (cid:3) . (58)From the perspective of asymptotic theory, √ nh P n ψ Q ( y τ ) , √ nh P n ψ ∂ P , √ nh P n ψ α , √ nh P n ψ ∂ m ,and √ nh P n ψ m are all of order O p ( h ) = o p ( ) and hence can be ignored in large samples. The42symptotic variance is then given by V τ = T f Y ( y τ ) T lim h ↓ h ψ f ( y τ , h ) = T f Y ( y τ ) T ˆ ∞ − ∞ K ( u ) du .However, V τ ignores all estimation uncertainties except that in ˆ f Y ( y τ ) , and we do not expect itto reflect the finite-sample variability of √ nh ( ˆ Π τ , z − Π τ , z ) well. To improve the finite-sampleperformances, we keep the dominating term from each source of estimation errors and employa sample counterpart of Eh ψ Π τ to estimate V τ . This is done in the next subsection. The asymptotic variance V τ in (58) can be estimated by the plug-in estimatorˆ V τ = hn n ∑ i = ˆ ψ Π τ , i , (59)where, by Theorem 5,ˆ ψ Π τ , i = ˆ T n ˆ f Y ( ˆ y τ ) ˆ T n ˆ ψ f , i ( ˆ y τ ) + ˆ T n ˆ f Y ( y τ ) ˆ T n ˆ f ′ Y ( y τ ) ˆ ψ Q , i ( ˆ y τ )+ ˆ T n ˆ f Y ( ˆ y τ ) ˆ T n ˆ ψ ∂ P , i + ˆ T n ˆ f Y ( ˆ y τ ) ˆ T n n n ∑ i = (cid:20) ∂ P ( W i , ˆ α ) ∂ z ∂ ˆ α ′ (cid:21) ˆ ψ α , i − f Y ( ˆ y τ ) ˆ T n ˆ ψ ∂ m , i + f Y ( ˆ y τ ) ˆ T n ˆ ψ m , i − f Y ( ˆ y τ ) ˆ T n ˆ E " ∂ f Y | ˜ W ( α ) ( y τ | ˜ W ( ˆ α )) ∂ z ˆ ψ Q , i ( ˆ y τ ) .In this equation, ˆ T n = T n ( ˆ y τ , ˆ m , ˆ α ) , ˆ T n = T n ( ˆ α ) ,ˆ ψ f , i ( ˆ y τ ) = K h ([ Y i − ˆ y τ ]) − n n ∑ j = K h (cid:0)(cid:2) Y j − ˆ y τ (cid:3)(cid:1) ˆ ψ Q , i ( ˆ y τ ) = τ − { Y i ≤ ˆ y τ } ˆ f Y ( ˆ y τ ) ,ˆ ψ ∂ P , i = ∂ P ( W i , ˆ α ) ∂ z − n n ∑ j = ∂ P ( W j , ˆ α ) ∂ z ,ˆ ψ α , i = − n n ∑ i = [ P ∂ ( W i , ˆ α )] W ′ i W i P ( W i , ˆ α ) [ − P ( W i , ˆ α )] ! − P ∂ ( W i , ˆ α ) W i [ D i − P ( W i , ˆ α )] P ( W i , ˆ α ) [ − P ( W i , ˆ α )] ,ˆ ψ ∂ m , i = ∂ ˆ m ( y τ , ˜ W i ( ˆ α )) ∂ z − ˆ T n ˆ ψ m , i = (cid:0) { Y i ≤ ˆ y τ } − m ( ˆ y τ , ˜ W i ( ˆ α )) (cid:1) × ˆ E (cid:20) ∂ log f W ( W i ) ∂ z (cid:12)(cid:12)(cid:12)(cid:12) ˜ W i ( ˆ α ) (cid:21) ,43nd ˆ E " ∂ f Y | ˜ W ( α ) ( y τ | ˜ W ( ˆ α )) ∂ z = n n ∑ j = ∂∂ z ∑ i = j K h ([ Y i − y τ ]) · K h (cid:2) ˜ W i ( ˆ α ) − ˜ w ( ˆ α ) (cid:3) ∑ i = j K h (cid:2) ˜ W i ( ˆ α ) − ˜ w ( ˆ α ) (cid:3) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) w = W j for the rescaled kernel function K h ( · ) defined by K h ([ χ , . . . , χ ℓ ]) = ℓ ∏ j = K h ( χ ℓ ) with [ χ , . . . , χ ℓ ] ∈ R ℓ .Most of these plug-in estimates are self-explanatory. For example, ˆ ψ α , i is the estimated influ-ence function for the MLE when P ( W i , ˆ α ) = P ( W i ˆ α ) and P ∂ ( a ) = ∂ P ( a ) / ∂ a . If the propensityscore function does not take a linear index form, then we need to make some adjustment to ˆ ψ α , i .The only thing we need to do is find the influence function for the MLE, which is an easy task,and then plug ˆ α into the influence function.The only remaining quantity that needs some explanation is ˆ ψ m , i , which involves a nonpara-metric regression of ∂ log f W ( W i ) ∂ z on ˜ W i ( ˆ α ) : = [ P ( Z i , X i , ˆ α ) , X i ] . We letˆ E (cid:20) ∂ log f W ( W i ) ∂ z (cid:12)(cid:12)(cid:12)(cid:12) ˜ W i ( ˆ α ) (cid:21) = − φ J ( ˜ W i ( ˆ α )) n ∑ ℓ = φ J ( ˜ W ℓ ( ˆ α )) φ J ( ˜ W ℓ ( ˆ α )) ′ ! − n ∑ ℓ = ∂φ J ( ˜ W ℓ ( ˆ α )) ∂ z .To see why this may be consistent for E h ∂ log f ( W i ) ∂ z (cid:12)(cid:12)(cid:12) ˜ W i ( α ) i , we note that under some conditionsthe following hold: (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) n n ∑ ℓ = ∂φ J ( ˜ W ℓ ( α )) ∂ z − E ∂φ J ( ˜ W ( α )) ∂ z (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) = o p ( ) , (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) n n ∑ ℓ = φ J ( ˜ W ℓ ( α )) ∂ log f W ( W ℓ ) ∂ z − E (cid:20) φ J ( ˜ W ( α )) ∂ log f W ( W ) ∂ z (cid:21)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) = o p ( ) .But E ∂φ J ( ˜ W ( α )) ∂ z = ˆ W ∂φ J ( ˜ w ( α )) ∂ z f W ( w ) dw = − ˆ Z×X φ J ( ˜ w ( α )) ∂ f W ( w ) ∂ z dzdx = − ˆ Z×X φ J ( ˜ w ( α )) ∂ log f W ( w ) ∂ z f W ( w ) dw = − E (cid:20) φ J ( ˜ W ( α )) ∂ log f W ( W ) ∂ z (cid:21) .Hence (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) n n ∑ ℓ = ∂φ J ( ˜ W ℓ ( α )) ∂ z + n n ∑ ℓ = φ J ( ˜ W ℓ ( α )) ∂ log f W ( W ℓ ) ∂ z (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) = o p ( ) ,44nd ˆ E h ∂ log f W ( W i ) ∂ z (cid:12)(cid:12)(cid:12) ˜ W i ( ˆ α ) i is approximately equal to φ J ( ˜ W i ( ˆ α )) n ∑ ℓ = φ J ( ˜ W ℓ ( ˆ α )) φ J ( ˜ W ℓ ( ˆ α )) ′ ! − n n ∑ ℓ = φ J ( ˜ W ℓ ( ˆ α )) ∂ log f W ( W ℓ ) ∂ z ,which is just a series approximation to E h ∂ log f W ( W i ) ∂ z (cid:12)(cid:12)(cid:12) ˜ W i ( α ) i .The consistency of ˆ V τ can be established by using the uniform law of large numbers. Thearguments are standard but tedious. We omit the details here. We can use Corollary 5 for hypothesis testing on Π τ , z . Since ˆ Π τ , z converges to Π τ , z at a non-parametric rate, in general the test will have power only against a departure of a nonparametricrate. However, if we are interested in testing the null of a zero effect, that is, H : Π τ , z = H : Π τ , z =
0, we can detect a parametric rate of departure from the null. The reason is that, by(43), Π τ , z = T =
0, and T can be estimated at the usual parametric rate. Hence,instead of testing H : Π τ , z = H : Π τ , z =
0, we test the equivalent hypotheses H : T = H : T = T n ( ˆ y τ , ˆ m , ˆ α ) of T . In view of its influence function givenin Lemma 9, we can estimate the asymptotic variance of T n ( ˆ y τ , ˆ m , ˆ α ) byˆ V = n n ∑ i = ˆ ψ i ,where ˆ ψ i : = ∂φ J ( ˜ W i ( ˆ α )) ∂ z ′ ˆ b ( ˆ α , ˆ y τ ) − T n ( ˆ y τ , ˆ m , ˆ α ) − " n ∑ ℓ = ∂φ J ( ˜ W ℓ ( ˆ α )) ∂ z ′ n ∑ ℓ = φ J ( ˜ W ℓ ( ˆ α )) φ J ( ˜ W ℓ ( ˆ α )) ′ − × φ J ( ˜ W i ( ˆ α )) ′ (cid:16) { Y i ≤ ˆ y τ } − φ J ( ˜ W i ( ˆ α )) ′ ˆ b ( ˆ α , ˆ y τ ) (cid:17) + ˆ E (cid:20) ∂ log f W ( W i ) ∂ z (cid:12)(cid:12)(cid:12)(cid:12) ˜ W i ( ˆ α ) (cid:21) · τ − { Y i ≤ ˆ y τ } ˆ f Y ( ˆ y τ ) .We can then form the test statistic: T : = √ n T n ( ˆ y τ , ˆ m , ˆ α ) p ˆ V .By Lemma 9 and using standard arguments, we can show that T ⇒ N (
0, 1 ) . To save space, weomit the details here. 45 Estimation and Inference under a Nonparametric Propensity Score
In this section, we drop Assumption 9 and estimate the propensity score non-parametricallyusing the series method. With respect to the results of the previous sections, we only needto modify Lemma 7, since Lemma 9 shows that we do not need to account for the error fromestimating the propensity score.Let ˆ P ( w ) denote the nonparametric series estimator of P ( w ) . The estimator of T : = E h ∂ P ( W ) ∂ z i is now T n ( ˆ P ) : = n n ∑ i = ∂ ˆ P ( w ) ∂ z (cid:12)(cid:12)(cid:12)(cid:12) w = W i .The estimator of T is the same as in (47) but with P ( W i , ˆ α ) replaced by ˆ P ( W i ) : T n ( ˆ y τ , ˆ m , ˆ P ) : = n n ∑ i = ∂ ˆ m ( ˆ y τ , ˆ P ( W i ) , X i ) ∂ z , (60)where, as in (47), ˆ m is the series estimator of m . The formula is the same as before, and we onlyneed to replace P ( W i , ˆ α ) by ˆ P ( W i ) . The UQE estimator becomesˆ Π τ , z ( ˆ y τ , ˆ f Y , ˆ m , ˆ P ) : = − f Y ( ˆ y τ ) " n n ∑ i = ∂ ˆ P ( W i ) ∂ z − n n ∑ i = ∂ ˆ m ( ˆ y τ , ˆ P ( W i ) , X i ) ∂ z = − f Y ( ˆ y τ ) T n ( ˆ y τ , ˆ m , ˆ P ) T n ( ˆ P ) . (61)The following lemma follows directly from Theorem 7.2 of Newey (1994). Lemma 10.
Let Assumption (i) of Lemmas 8 and Assumptions (i) and (iii) of Lemma 9 hold. Assumefurther that P ( z , x ) is continuously differentiable with respect to z for all orders, and that there is a constantC such that ∂ a P ( z , x ) / ∂ z a ≤ C a for all a ∈ N . ThenT n ( ˆ P ) − T = P n ψ ∂ P s − P n ψ P s + o p ( n − ) , where we define ψ ∂ P s : = ∂ P ( W ) ∂ z − T and ψ P s : = ( D − P ( W )) × ∂ log f W ( W ) ∂ z .Using a proof similar to that of Lemma 9, we can show that T n ( ˆ y τ , ˆ m , ˆ P ) and T n ( ˆ y τ , ˆ m , P ) have the same influence function. That is, we have T n ( ˆ y τ , ˆ m , ˆ P ) − T = P n ψ ∂ m − P n ψ m + P n ˜ ψ Q ( y τ ) + o p ( n − ) ,46here ψ ∂ m : = ∂ m ( y τ , P ( W ) , X ) ∂ z − T , ψ m : = ( { Y ≤ y τ } − m ( y τ , P ( W ) , X )) × E (cid:20) ∂ log f W ( W ) ∂ z (cid:12)(cid:12)(cid:12)(cid:12) P ( W ) , X (cid:21) ,and ˜ ψ Q ( y τ ) = E " ∂ f Y | P ( W ) , X ( y τ | P ( W ) , X ) ∂ z ψ Q ( y τ ) .Given the asymptotic linear representations of T n ( ˆ P ) − T and T n ( ˆ y τ , ˆ m , ˆ P ) − T , we candirectly use Lemma 9, together with Lemma 6, to obtain an asymptotic linear representation ofˆ Π τ , z ( ˆ y τ , ˆ f Y , ˆ m , ˆ P ) . Theorem 6.
Under the assumptions of Lemmas 6, 9, and 10, we have ˆ Π τ , z − Π τ , z = T f Y ( y τ ) T (cid:2) P n ψ f Y ( y τ ) + B f Y ( y τ ) (cid:3) + T f Y ( y τ ) T f ′ Y ( y τ ) P n ψ Q ( y τ )+ T f Y ( y τ ) T P n ψ ∂ P s − T f Y ( y τ ) T P n ψ P s − f Y ( y τ ) T P n ψ ∂ m + f Y ( y τ ) T P n ψ m − f Y ( y τ ) T P n ˜ ψ Q ( y τ ) + R Π , (62) where R Π = O p (cid:16) | ˆ f Y ( ˆ y τ ) − f Y ( y τ ) | (cid:17) + O p (cid:16) n − (cid:17) + O p (cid:16) n − | ˆ f Y ( ˆ y τ ) − f Y ( y τ ) | (cid:17) + O p (cid:0) | R f Y | (cid:1) + o p ( n − ) + o p ( h ) . (63) Furthermore, under Assumption 8, √ nhR Π = o p ( ) .We summarize the results of Theorem 6 in a single equation:ˆ Π τ , z − Π τ , z = P n ψ Π τ + ˜ B f Y ( y τ ) + o p ( n − h − ) ,where ψ Π τ collects all the influence functions in (62) except for the bias, R Π is absorbed in the o p ( n − h − ) term, and ˜ B f Y ( y τ ) : = T f Y ( y τ ) T B f Y ( y τ ) .The bias term is o p ( n − h − ) by Assumption 8. The following corollary provides the asymp-totic distribution of ˆ Π τ , z . 47 orollary 6. Under the assumptions of Theorem 6, √ nh (cid:0) ˆ Π τ , z − Π τ , z (cid:1) = √ n P n √ h ψ Π τ + o p ( ) ⇒ N ( V τ ) , where V τ = lim h ↓ E (cid:2) h ψ Π τ (cid:3) .The asymptotic variance takes the same form as the asymptotic variance in Corollary 5. Es-timating the asymptotic variance and testing for a zero unconditional effect are entirely similarto the case with a parametric propensity score. We omit the details to avoid repetition and re-dundancy. From the perspective of implementation, there is no substantive difference between aparametric approach and a nonparametric approach to the propensity score estimation. We estimate the unconditional quantile effect of expanding college enrollment on (log) wages.The outcome variable Y is the log wage, and the binary treatment is the college enrollment status.Thus p = Pr ( D = ) is the proportion of individuals who ever enrolled in a college. Arguably,the cost of tuition ( Z ) is an important factor that affects the college enrollment status but notthe wage. In order to alter the proportion of enrolled individuals, we consider a policy thatsubsidizes tuition by a certain amount. The UQE is the effect of this policy on the differentquantiles of the unconditional distribution of wages when the subsidy goes to zero. This is theeffect that we denote UQE under instrument intervention in Section 4. This policy shifts Z , thetuition, to Z δ = Z + s z ( δ ) for some s z ( δ ) , which is the same for all individuals, and induces thecollege enrollment to increase from p to p + δ . Note that we do not need to specify s z ( δ ) becausewe look at the limiting version as δ → Y is the log wage in 1991. The treatment indicator D is equalto 1 if the individual ever enrolled in college by 1991, and 0 otherwise. The other covariatesare AFQT score, mother’s education, number of siblings, average log earnings 1979–2000 in thecounty of residence at age 17, average unemployment 1979–2000 in the state of residence at age17, urban residence dummy at age 14, cohort dummies, years of experience in 1991, average locallog earnings in 1991, and local unemployment in 1991. We collect these variables into a vectorand denote it by X O .We assume that the following four variables ( denoted by Z , Z , Z , Z ) enter the selectionequation but not the outcome equation: presence of a four-year college in the county of residenceat age 14, local earnings at age 17, local unemployment at age 17, and tuition at local public four-year colleges at age 17. The total sample size is 1747, of which 882 individuals had never enrolled48n a college ( D =
0) by 1991, and 865 individuals had enrolled in a college by 1991 ( D = ) . Wecompute the UQE of a marginal shift in the tuition at local public four-year colleges at age 17. Soin our notation, Z = Z , X = ( Z , Z , Z , X O ) , and W = ( Z , X ) .To estimate the propensity score, we use a parametric logistic specification. To estimate theconditional expectation, we use a series regression using both the estimated propensity score andthe covariates X as the regressors. Due to the large number of variables involved, a penalizationof λ = − was imposed on the L -norm of the coefficients, excluding the constant term as inridge regressions. We compute the UQE at the quantile level τ = τ , wealso construct the 95% (pointwise) confidence interval.Figure 4 presents the results. The UQE ranges between 0.22 and 0.47 across the quantiles,and averages to 0.37. When we estimate the unconditional mean effect, we obtain an estimateof 0.21, which is somewhat consistent with the quantile cases. We interpret these estimates inthe following way: a marginal additive change in tuition produces a marginal change in collegeenrollment which in turn produces a marginal change in (log) wages between 0.22 and 0.47 acrossquantiles. − . . . . . . Quantiles U Q E Figure 4:
Solid line: UQE; Dashed line: 95% confidence intervals.
10 Conclusion
In this paper we study the unconditional policy effect with an endogenous binary treatment. Forconcreteness, we focus on the unconditional quantile effect, but the basic ideas and insights areapplicable if the policy goals are other features of the unconditional distribution of an outcome49ariable. We find that the unconditional quantile regression estimator that neglects endogeneitycan be severely biased. The bias may not be uniform across quantiles. Any attempt to sign thebias a priori requires very strong assumptions on the data generating process. More intriguingly,the unconditional quantile regression estimator can be inconsistent even if the treatment statusis exogenously determined. This happens when the treatment selection is partly determined bycovariates that also influence the outcome variable.When an instrumental variable is available, it is possible to recover the unconditional effectthrough an application of the local instrumental variable technique. Framing the selection equa-tion as a threshold-crossing model allows us to introduce a new class of marginal treatmenteffects and represent the unconditional effect as a weighted average of these marginal treatmenteffects. We find that the unconditional quantile effect and the marginal policy relevant treatmenteffect can be seen as part of the same family of effects. It is possible to view the latter as a robustversion of the former. Both of them are examples of a general unconditional policy effect. To thebest of our knowledge, this connection has not been established in either literature.
References
Bjorklund, Anders, and Robert Moffitt.
The Review of Economics and Statistics , 69(1): 42–49.
Carneiro, Pedro, and Sokbae Lee.
Journal of Econometrics , 149(2): 191–208.
Carneiro, Pedro, James J. Heckman, and Edward Vytlacil.
Econometrica ,78(1): 377–394.
Carneiro, Pedro, James J. Heckman, and Edward Vytlacil.
American Economic Review , 101(6): 2754–2781.
Firpo, Sergio, Nicole M. Fortin, and Thomas Lemieux.
NBER Technical Working Paper 339 . Firpo, Sergio, Nicole M. Fortin, and Thomas Lemieux.
Econometrica , 77(3): 953–973.
Hahn, Jinyong, and Geert Ridder.
Econometrica , 81(1): 315–340.
Heckman, James J.
The Journal of Human Resources , 32(3): 441–462.50 eckman, James J., and Edward Vytlacil.
Proceedings of the NationalAcademy of Sciences of the United States of America , 96(8): 4730–4734.
Heckman, James J., and Edward Vytlacil. a . “Local Instrumental Variables.” In NonlinearStatistical Modeling: Proceedings of the Thirteenth International Symposium in Economic Theory andEconometrics: Essays in Honor of Takeshi Amemiya . , ed. C. Hsiao, K. Morimune and J. Powell,1–46. Cambridge, UK:Cambrigde University Press.
Heckman, James J., and Edward Vytlacil. b . “Policy Relevant Treatment Effects.” AmericanEconomic Review , 91(2): 107–111.
Heckman, James J., and Edward Vytlacil.
Econometrica , 73(3): 669–738.
Ibragimov, I. A., and R. Z. Hasminskii.
Statistical Estimation: Asymptotic Theory.
Springer-Verlag.
Imbens, Guido W., and Whitney K. Newey.
Econometrica , 77(5): 1481–1512.
Kaplan, David M.
Kasy, Maximilian.
The Review of Economics and Statistics , 98(March): 111–131.
Mukhin, Yaroslav.
Newey, Whitney K.
Economet-rica , 62(6): 1349–1382.
Pagan, Adrian, and Aman Ullah.
Nonparametric Econometrics.
Cambridge University Press.
Rothe, Christoph.
Economics Letters , 109(3): 171–174.
Rothe, Christoph.
Econometrica , 80(5): 2269–2301.
Rudin, Walter.
Principles of Mathematical Analysis.
New York: McGraw-Hill.
Serfling, Robert J.
Approximation Theorems of Mathematical Statistics.
New York: Wiley. van der Vaart, A. W.
Asymptotic Statistics.
Cambrigde University Press.
Zhou, Xiang, and Yu Xie.
Journal of Political Economy , 127(6). 51 ppendices
A Proofs
Proof of Lemma 1.
Using the selection equation D δ = { U D ≤ P δ ( W ) } , we have F Y ( ) | D δ ( y | ) = Pr ( Y ( ) ≤ y | D δ = )= Pr ( Y ( ) ≤ y | U D ≤ P δ ( W )) = Pr ( Y ( ) ≤ y , U D ≤ P δ ( W )) Pr ( U D ≤ P δ ( W ))= ´ W Pr ( Y ( ) ≤ y , U D ≤ P δ ( w ) | W = w ) f W ( w ) dwp + δ = ´ W F Y ( ) , U D | W ( y , P δ ( w ) | w ) f W ( w ) dwp + δ = p + δ ˆ W ˆ y − ∞ ˆ P δ ( w ) − ∞ f Y ( ) , U D | W ( ˜ y , ˜ u | w ) f W ( w ) d ˜ ud ˜ ydw = p + δ ˆ y − ∞ ˆ W " ˆ P δ ( w ) − ∞ f Y ( ) , U D | W ( ˜ y , ˜ u | w ) d ˜ u f W ( w ) dwd ˜ y ,where the order of integration can be switched because the integrands are non-negative. It thenfollows that f Y ( ) | D δ ( y | ) = p + δ ˆ W " ˆ P δ ( w ) − ∞ f Y ( ) , U D | W ( y , ˜ u | w ) d ˜ u f W ( w ) dw . (A.1)Under Assumptions 2(b) and 2(c), we can differentiate both sides of (A.1) with respect to δ under the integral sign to get ∂ f Y ( ) | D δ ( y | ) ∂δ = p + δ ˆ W f Y ( ) , U D | W ( y , P δ ( w ) | w ) ∂ P δ ( w ) ∂δ f W ( w ) dw − ( p + δ ) ˆ W " ˆ P δ ( w ) − ∞ f Y ( ) , U D | W ( y , ˜ u | w ) d ˜ u f W ( w ) dw = p + δ ˆ W f Y ( ) | U D , W ( y | P δ ( w ) , w ) ∂ P δ ( w ) ∂δ f W ( w ) dw − f Y ( ) | D δ ( y | ) p + δ (A.2)where the last line follows from (A.1).Under Assumptions 2(b.i) and 2(c.ii), f Y ( ) | U D , W ( y | P δ ( w ) , w ) ∂ P δ ( w ) / ∂δ is continuous in δ foreach y ∈ Y ( d ) and w ∈ W . In view of Assumptions 2(b.ii) and 2(c.iii), we can invoke thedominated convergence theorem to show that the map δ ∂ f Y ( ) | D δ ( y | ) ∂δ is continuous for each y ∈ Y ( d ) .For the case of f Y ( ) | D δ ( y | ) , we have ∂ f Y ( ) | D δ ( y | ) ∂δ = ∂∂δ ∂ F Y ( ) | D δ ( y | ) ∂ y .52sing the selection equation, we can write F Y ( ) | D δ ( y | ) as F Y ( ) | D δ ( y | ) = Pr ( Y ( ) ≤ y | D δ = )= Pr ( Y ( ) ≤ y | U D > P δ ( W )))= Pr ( Y ( ) ≤ y , U D > P δ ( W )) − p − δ = Pr ( Y ( ) ≤ y ) − Pr ( Y ( ) ≤ y , U D ≤ P δ ( W )) − p − δ = F Y ( ) ( y ) − ´ W F Y ( ) , U D | W ( y , P δ ( w ) | w ) f W ( w ) dw − p − δ = − p − δ " F Y ( ) ( y ) − ˆ W ˆ y − ∞ ˆ P δ ( w ) − ∞ f Y ( ) , U D | W ( ˜ y , ˜ u | w ) f W ( w ) d ˜ ud ˜ ydw = − p − δ " F Y ( ) ( y ) − ˆ y − ∞ ˆ W ˆ P δ ( w ) − ∞ f Y ( ) , U D | W ( ˜ y , ˜ u | w ) f W ( w ) d ˜ udwd ˜ y ,where the orders of integrations can be switched because the integrands are non-negative. There-fore, f Y ( ) | D δ ( y | ) = − p − δ " f Y ( ) ( y ) − ˆ W ˆ P δ ( w ) − ∞ f Y ( ) , U D | W ( y , ˜ u | w ) f W ( w ) d ˜ udw . (A.3)Using Assumptions 2(b) and 2(c), we have ∂ f Y ( ) | D δ ( y | ) ∂δ = f Y ( ) | D δ ( y | ) − p − δ − − p − δ ˆ W f Y ( ) , U D | W ( y , P δ ( w )) | w ) ∂ P δ ( w ) ∂δ f W ( w ) dw . (A.4)The continuity of δ ∂ f Y ( ) | D δ ( y | ) ∂δ follows from the same arguments for the continuity of δ ∂ f Y ( ) | D δ ( y | ) ∂δ . Therefore, we have established that δ f Y ( ) | D δ ( y | ) is continuously differentiable. Proof of Lemma 2.
For any δ in N ε , we have F Y δ ( y ) = Pr ( Y δ ≤ y )= Pr (( − D δ ) Y ( ) + D δ Y ( ) ≤ y )= ( p + δ ) Pr ( Y ( ) ≤ y | D δ = ) + ( − p − δ ) Pr ( Y ( ) ≤ y | D δ = )= ˆ Y ( ) { ˜ y ≤ y } ( p + δ ) f Y ( ) | D δ ( ˜ y | ) d ˜ y + ˆ Y ( ) { ˜ y ≤ y } ( − p − δ ) f Y ( ) | D δ ( ˜ y | ) d ˜ y . (A.5)We proceed to take the first order Taylor expansion of δ ( p + δ ) f Y ( ) | D δ and δ ( − p − ) f Y ( ) | D δ around δ =
0, which is possible by Lemma 1: f Y ( d ) | D δ are continuously differentiablewith respect to δ . We have ( p + δ ) f Y ( ) | D δ ( ˜ y | )= p f Y ( ) | D ( ˜ y | ) + δ · " p ∂ f Y ( ) | D δ ( ˜ y | ) ∂δ (cid:12)(cid:12)(cid:12)(cid:12) δ = + f Y ( ) | D ( ˜ y | ) + R ( δ ; ˜ y , 1 ) , (A.6)where R ( δ ; ˜ y , 1 ) : = δ · " p ∂ f Y ( ) | D δ ( ˜ y | ) ∂δ (cid:12)(cid:12)(cid:12)(cid:12) δ = ˜ δ − p ∂ f Y ( ) | D δ ( ˜ y | ) ∂δ (cid:12)(cid:12)(cid:12)(cid:12) δ = + δ · h f Y ( ) | D δ ( ˜ y | ) − f Y ( ) | D ( ˜ y | ) i (A.7)and 0 ≤ ˜ δ ≤ δ . The middle point ˜ δ depends on δ . For the case of d =
0, we have a similarexpansion. ( − p − δ ) f Y ( ) | D δ ( ˜ y | )= ( − p ) f Y ( ) | D ( ˜ y | ) + δ · " ( − p ) ∂ f Y ( ) | D δ ( ˜ y | ) ∂δ (cid:12)(cid:12)(cid:12)(cid:12) δ = − f Y ( ) | D ( ˜ y | ) + R ( δ ; ˜ y , 0 ) , (A.8)where R ( δ ; ˜ y , 0 ) : = δ · " ( − p ) ∂ f Y ( ) | D δ ( ˜ y | ) ∂δ (cid:12)(cid:12)(cid:12)(cid:12) δ = ˜ δ − ( − p ) ∂ f Y ( ) | D δ ( ˜ y | ) ∂δ (cid:12)(cid:12)(cid:12)(cid:12) δ = + δ · h f Y ( ) | D ( ˜ y | ) − f Y ( ) | D δ ( ˜ y | ) i (A.9)and 0 ≤ ˜ δ ≤ δ . The middle point ˜ δ depends on δ .Consider the first order derivative that appears in (A.6), when δ =
0, using (A.2) we have ∂ f Y ( ) | D δ ( ˜ y | ) ∂δ (cid:12)(cid:12)(cid:12)(cid:12) δ = = p ˆ W f Y ( ) , U D | W ( ˜ y , P ( w ) | w ) ˙ P ( w ) f W ( w ) dw − f Y ( ) | D ( ˜ y | ) p = p (cid:20) ˆ W f Y ( ) , U D | W ( ˜ y , P ( w ) | w ) ˙ P ( w ) f W ( w ) dw − f Y ( ) | D ( ˜ y | ) (cid:21) = p (cid:20) ˆ W f Y ( ) | U D , W ( ˜ y | P ( w ) , w ) f U D | W ( P ( w ) | w ) ˙ P ( w ) f W ( w ) dw − f Y ( ) | D ( ˜ y | ) (cid:21) = p (cid:20) ˆ W f Y ( ) | U D , W ( ˜ y | P ( w ) , w ) ˙ P ( w ) f W ( w ) dw − f Y ( ) | D ( ˜ y | ) (cid:21) (A.10)where we define ˙ P ( w ) = ∂ P δ ( w ) ∂δ | δ = .54ote that we have used that U D | W is uniform on [
0, 1 ] .Now we substitute (A.10) in (A.6) to get ( p + δ ) f Y ( ) | D δ ( ˜ y | ) = p f Y ( ) | D ( ˜ y | ) + δ ˆ W f Y ( ) | U D , W ( ˜ y , P ( w ) | w ) ˙ P ( w ) f W ( w ) dw + R ( δ ; ˜ y , 1 ) . (A.11)The first derivative in (A.8) can be handled similarly for δ = ∂ f Y ( ) | D δ ( ˜ y | ) ∂δ (cid:12)(cid:12)(cid:12)(cid:12) δ = = − ´ W f Y ( ) | U D , W ( ˜ y | P ( w ) , w ) ˙ P ( w ) f W ( w ) dw − p + f Y ( ) | D ( ˜ y | )) − p . (A.12)Plugging (A.12) into (A.8), we get ( − p − δ ) f Y ( ) | D δ ( ˜ y | ) = ( − p ) f Y ( ) | D ( ˜ y | ) − δ ˆ W f Y ( ) | U D , W ( ˜ y | P ( w ) , w ) ˙ P ( w ) f W ( w ) dw + R ( δ ; ˜ y , 0 ) . (A.13)Now we substitute (A.11) and (A.13) in (A.5), leading to F Y δ ( y ) = ˆ Y ( ) { ˜ y ≤ y } (cid:20) p f Y ( ) | D ( ˜ y | ) + δ ˆ W f Y ( ) | U D , W ( ˜ y | P ( w ) , w ) ˙ P ( w ) f W ( w ) dw (cid:21) d ˜ y + ˆ Y ( ) { ˜ y ≤ y } (cid:20) ( − p ) f Y ( ) | D ( ˜ y | ) − δ ˆ W f Y ( ) | U D , W ( ˜ y | P ( w ) , w ) ˙ P ( w ) f W ( w ) dw (cid:21) d ˜ y + ˜ R ( δ ; y )= F Y ( y ) + δ ˆ Y ( ) ˆ W { ˜ y ≤ y } f Y ( ) | U D , W ( ˜ y | P ( w ) , w ) ˙ P ( w ) f W ( w ) dwd ˜ y − δ ˆ Y ( ) ˆ W { ˜ y ≤ y } f Y ( ) | U D , W ( ˜ y | P ( w ) , w ) ˙ P ( w ) f W ( w ) dwd ˜ y + R F ( δ ; y ) (A.14)where the remainder R F ( δ ; y ) is R F ( δ ; y ) : = ˆ Y ( ) { ˜ y ≤ y } R ( δ ; ˜ y , 1 ) d ˜ y + ˆ Y ( ) { ˜ y ≤ y } R ( δ ; ˜ y , 0 ) d ˜ y . (A.15)The next step is to show that the remainder in (A.15) is o ( | δ | ) uniformly over y ∈ Y = Y ( ) ∪Y ( ) as δ →
0, that is, lim δ → sup y ∈Y (cid:12)(cid:12)(cid:12)(cid:12) R F ( δ ; y ) δ (cid:12)(cid:12)(cid:12)(cid:12) = y ∈Y (cid:12)(cid:12)(cid:12)(cid:12) R F ( δ ; y ) δ (cid:12)(cid:12)(cid:12)(cid:12) ≤ p ˆ Y ( ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ∂ f Y ( ) | D δ ( ˜ y | ) ∂δ (cid:12)(cid:12)(cid:12)(cid:12) δ = ˜ δ − ∂ f Y ( ) | D δ ( ˜ y | ) ∂δ (cid:12)(cid:12)(cid:12)(cid:12) δ = (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) d ˜ y + ( − p ) ˆ Y ( ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ∂ f Y ( ) | D δ ( ˜ y | ) ∂δ (cid:12)(cid:12)(cid:12)(cid:12) δ = ˜ δ − ∂ f Y ( ) | D δ ( ˜ y | ) ∂δ (cid:12)(cid:12)(cid:12)(cid:12) δ = (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) d ˜ y + ˆ Y ( ) (cid:12)(cid:12)(cid:12) f Y ( ) | D ( ˜ y | ) − f Y ( ) | D δ ( ˜ y | ) (cid:12)(cid:12)(cid:12) d ˜ y + ˆ Y ( ) (cid:12)(cid:12)(cid:12) f Y ( ) | D ( ˜ y | ) − f Y ( ) | D δ ( ˜ y | ) (cid:12)(cid:12)(cid:12) d ˜ y .Assumption 3 allows us to take the limit δ → f Y ( d ) | D δ ( ˜ y | d ) and ∂ f Y ( d ) | D δ ( ˜ y | d ) ∂δ are continuous in δ . Therefore, lim δ → sup y ∈Y (cid:12)(cid:12)(cid:12) R F ( δ ; y ) δ (cid:12)(cid:12)(cid:12) =
0, andwe get the desired result: F Y δ ( y ) = F Y ( y ) + δ ˆ Y ( ) ˆ W { ˜ y ≤ y } f Y ( ) | U D , W ( ˜ y | P ( w ) , w ) ˙ P ( w ) f W ( w ) dwd ˜ y − δ ˆ Y ( ) ˆ W { ˜ y ≤ y } f Y ( ) | U D , W ( ˜ y | P ( w ) , w ) ˙ P ( w ) f W ( w ) dwd ˜ y + o ( | δ | ) (A.16)uniformly over y ∈ Y as δ →
0. This can be written more compactly as F Y δ ( y ) = F Y ( y ) + δ E h F Y ( ) | U D , W ( y | P ( W ) , W ) ˙ P ( W ) i − δ E h F Y ( ) | U D , W ( y | P ( W ) , W ) ˙ P ( W ) i + o ( | δ | ) uniformly over y ∈ Y as δ → Proof of Theorem 1.
Using Lemma 2, we have F Y δ ( y τ , δ ) = F Y ( y τ , δ ) + δ E h F Y ( ) | U D , W ( y τ , δ | P ( W ) , W ) ˙ P ( W ) i − δ E h F Y ( ) | U D , W ( y τ , δ | P ( W ) , W ) ˙ P ( W ) i + o ( | δ | ) .Noting that F Y δ ( y τ , δ ) = F Y ( y τ ) = τ , we have F Y ( y τ ) = F Y ( y τ , δ ) + δ E h F Y ( ) | U D , W ( y τ , δ | P ( W ) , W ) ˙ P ( W ) i − δ E h F Y ( ) | U D , W ( y τ , δ | P ( W ) , W ) ˙ P ( W ) i + o ( | δ | ) . (A.17)Note that (cid:12)(cid:12)(cid:12) E h F Y ( d ) | U D , W ( y τ , δ | P ( W ) , W ) ˙ P ( W ) i(cid:12)(cid:12)(cid:12) ≤ E (cid:12)(cid:12) ˙ P ( W ) (cid:12)(cid:12) < ∞ .56etting δ → δ → F Y ( y τ , δ ) = F Y ( y τ ) .Under the assumption that f Y ( y τ ) > F Y ( · ) is continuous and strictly increasing at y τ .Combining this with the above limit result, we conclude that lim δ → y τ , δ = y τ .Going back to (A.17), we havelim δ → F Y ( y τ , δ ) − F Y ( y τ ) δ = lim δ → E h F Y ( ) | U D , W ( y τ , δ | P ( W ) , W ) ˙ P ( W ) i − lim δ → E h F Y ( ) | U D , W ( y τ , δ | P ( W ) , W ) ˙ P ( W ) i = E h F Y ( ) | U D , W ( y τ | P ( W ) , W ) ˙ P ( W ) i − E h F Y ( ) | U D , W ( y τ | P ( W ) , W ) ˙ P ( W ) i .Therefore, we have that the unconditional quantile effect islim δ → y τ , δ − y τ δ = f Y ( y τ ) (cid:26) E h F Y ( ) | U D , W ( y τ | P ( W ) , W ) ˙ P ( W ) i − E h F Y ( ) | U D , W ( y τ | P ( W ) , W ) ˙ P ( W ) i (cid:27) . (A.18) Proof of Corollary 1.
For each d = f Y ( y τ ) ˆ W E [ { Y ( d ) ≤ y τ } | U D = P ( w ) , W = w ] ˙ P ( w ) f W ( w ) dw = f Y ( y τ ) ˆ W E [ { Y ( d ) ≤ y τ } | D = d , W = w ] ˙ P ( w ) f W ( w ) dw + f Y ( y τ ) ˆ W E [ { Y ( d ) ≤ y τ } | U D = P ( w ) , W = w ] ˙ P ( w ) f W ( w ) dw − f Y ( y τ ) ˆ W E [ { Y ( d ) ≤ y τ } | D = d , W = w ] ˙ P ( w ) f W ( w ) dw = f Y ( y τ ) ˆ W E [ { Y ( d ) ≤ y τ } | D = d , W = w ] f W ( w ) dw − f Y ( y τ ) ˆ W E [ { Y ( d ) ≤ y τ } | D = d , W = w ] f W ( w ) (cid:2) − ˙ P ( w ) (cid:3) dw − f Y ( y τ ) ˆ W h F Y ( d ) | D , W ( y τ | d , w ) − F Y ( d ) | U D , W ( y τ | P ( w ) , w ) i ˙ P ( w ) f W ( w ) dw : = A τ ( d ) − B τ ( d ) − B τ ( d ) A τ ( d ) = f Y ( y τ ) ˆ W E [ { Y ( d ) ≤ y τ } | D = d , W = w ] f W ( w ) dw , B τ ( d ) = f Y ( y τ ) ˆ W E [ { Y ( d ) ≤ y τ } | D = d , W = w ] f W ( w ) (cid:2) − ˙ P ( w ) (cid:3) dw , B τ ( d ) = f Y ( y τ ) ˆ W h F Y ( d ) | D , W ( y τ | d , w ) − F Y ( d ) | U D , W ( y τ | P ( w ) , w ) i ˙ P ( w ) f W ( w ) dw .So Π τ = A τ ( ) − B τ ( ) − B τ ( ) − [ A τ ( ) − B τ ( ) − B τ ( )]= A τ − B τ − B τ ,where A τ = A τ ( ) − A τ ( )= f Y ( y τ ) ˆ W E [ { Y ≤ y τ } | D = W = w ] f W ( w ) dw − f Y ( y τ ) ˆ W E [ { Y ≤ y τ } | D = W = w ] f W ( w ) dw , B τ = B τ ( ) − B τ ( )= f Y ( y τ ) ˆ W h F Y ( ) | D , W ( y τ | w ) − F Y ( ) | D , W ( y τ | w ) i (cid:2) − ˙ P ( w ) (cid:3) f W ( w ) dw ,and B τ = B τ ( ) − B τ ( )= f Y ( y τ ) ˆ W h F Y ( ) | D , W ( y τ | w ) − F Y ( ) | U D , W ( y τ | P ( w ) , w ) i ˙ P ( w ) f W ( w ) dw + f Y ( y τ ) ˆ W h F Y ( ) | U D , W ( y τ | P ( w ) , w ) − F Y ( ) | D , W ( y τ | w ) i ˙ P ( w ) f W ( w ) dw . Proof of Lemma 3.
For a given δ , s µ ( δ ) satisfies Pr ( D δ = ) = p + δ . ButPr ( D δ = ) = E [ P δ ( W )] = ˆ W F V | W ( µ ( w ) + s µ ( δ ) | w ) f W ( w ) dw ,and so p + δ = ˆ W F V | W ( µ ( w ) + s µ ( δ ) | w ) f W ( w ) dw . (A.19)58ote that s µ ( ) =
0. We need to find the derivative of the implicit function s µ ( δ ) with respect to δ . Define t ( δ , s ) = p + δ − ˆ W F V | W ( µ ( w ) + s | w ) f W ( w ) dw . (A.20)By Theorem 9.28 in Rudin (1976), we need to show that t is continuously differentiable in aneighborhood around (
0, 0 ) of ( δ , s ) . We do this, by showing that the partial derivatives of (A.20)with respect to δ and s exist and are continuous (See Theorem 9.21 in Rudin (1976)).For the partial derivative with respect to δ , we have ∂ t ( δ , s ) / ∂δ =
1, which is obviouslycontinuous in ( δ , s ) . For the partial derivative with respect to s , we use Assumption (iii) in thelemma to obtain ∂ t ( δ , s ) ∂ s = − ˆ W f V | W ( µ ( w ) + s | w ) f W ( w ) dw .The function is trivially continuous in δ . In view of the continuity of f V | W ( v | w ) in v for almost all w , the dominated convergence theorem implies that ∂ t ( δ , s ) / ∂ s is also continuous in s . Therefore,we can apply the implicit function theorem to obtain s ′ µ ( δ ) in a neighborhood of δ =
0. Takingthe derivative of (A.19) with respect to δ , we get ∂ s µ ( δ ) ∂δ = s ′ µ ( δ ) = ´ W f V | W ( µ ( w ) + s µ ( δ ) | w ) f W ( w ) dw .In particular, for δ =
0, we have s ′ µ ( ) = ´ W f V | W ( µ ( w ) | w ) f W ( w ) dw .For the propensity score, we have P µδ ( w ) = F V | W ( µ ( w ) + s µ ( δ ) | w ) and so ∂ P µδ ( w ) ∂δ = f V | W ( µ ( w ) + s µ ( δ ) | w ) ∂ s µ ( δ ) ∂δ = f V | W ( µ ( w ) + s µ ( δ ) | w ) ´ W f V | W ( µ ( w ) + s µ ( δ ) | w ) f W ( w ) dw .Evaluating the above at δ = P µ ( w ) = f V | W ( µ ( w ) | w ) ´ W f V | W ( µ ( w ) | w ) f W ( w ) dw .59 roof of Corollary 3. It follows from Theorem 1 that Π τ , µ = f Y ( y τ ) ˆ W E [ { Y ( ) ≤ y τ } | U D = P ( w ) , W = w ] f W ( w ) dw , − f Y ( y τ ) ˆ W E [ { Y ( ) ≤ y τ } | U D = P ( w ) , W = w ] f W ( w ) dw . (A.21)Since the propensity score does not depend on w because µ ( · ) = µ , a constant, we have U D = P ( w ) = p , where p : = F V ( µ ) = Pr ( D = ) . But U D is independent of W , so Π τ , µ = f Y ( y τ ) ˆ W E [ { Y ( ) ≤ y τ } | U D = p , W = w ] f W | U D ( w | p ) dw − f Y ( y τ ) ˆ W E [ { Y ( ) ≤ y τ } | U D = p , W = w ] f W | U D ( w | p ) dw = f Y ( y τ ) E [ { Y ( ) ≤ y τ } − { Y ( ) ≤ y τ } | U D = p ] . (A.22)In the decomposition Π τ , µ = A τ , µ − B τ , µ , the formula for A τ , µ follows from the same argumentas above, and the formula for B τ , µ follows from Corollary 1 because B τ = B τ simplifiesto the given expression. Proof of Theorem 2.
Follows directly from an application of Assumption 4(b) to Corollary 2.
Proof of Lemma 4.
For a given δ , s z ( δ ) satisfies Pr ( D δ = ) = p + δ . ButPr ( D δ = ) = E [ P δ ( W )] = ˆ W F V | W ( µ ( z + g ( w ) s z ( δ ) , x ) | w ) f W ( w ) dw ,and so p + δ = ˆ W F V | W ( µ ( z + g ( w ) s z ( δ ) , x ) | w ) f W ( w ) dw . (A.23)Note that s z ( ) =
0. We need to find the derivative of the implicit function s z ( δ ) with respect to δ . Define t ( δ , s ) = p + δ − ˆ W F V | W ( µ ( z + g ( w ) s , x ) | w ) f W ( w ) dw . (A.24)By Theorem 9.28 in Rudin (1976), we need to show that t is continuously differentiable in aneighborhood around (
0, 0 ) of ( δ , s ) . We do this, by showing that the partial derivatives of (A.24)with respect to δ and s exist and are continuous (See Theorem 9.21 in Rudin (1976)).For the partial derivative with respect to δ , we have ∂ t ( δ , s ) / ∂δ =
1, which is obviouslycontinuous in ( δ , s ) . For the partial derivative with respect to s , we use Assumption (iii) in theLemma to obtain ∂ t ( δ , s ) ∂ s = − ˆ W f V | W ( µ ( z + g ( w ) s , x ) | w ) µ ′ z ( z + g ( w ) s , x ) g ( w ) f W ( w ) dw .60he function is trivially continuous in δ . In view of the continuity of f V | W ( v | w ) in v for almost all w , the dominated convergence theorem implies that ∂ t ( δ , s ) / ∂ s is also continuous in s . Therefore,we can apply the implicit function theorem to obtain s ′ z ( δ ) in a neighborhood of δ =
0. Takingthe derivative of (A.23) with respect to δ , we get ∂ s z ( δ ) ∂δ = ´ W f V | W ( µ ( z + g ( w ) s z ( δ ) , x ) | w ) µ ′ z ( z + g ( w ) s z ( δ ) , x ) g ( w ) f W ( w ) dw = E (cid:2) f V | W ( µ ( Z + g ( W ) s z ( δ ) , X ) | W ) µ ′ z ( Z + g ( W ) s z ( δ ) , X ) g ( W ) (cid:3) .Next, we have P δ ( z , x ) = Pr ( D δ = | Z = z , X = x ) = F V | W ( µ ( z + g ( w ) s z ( δ ) , x ) | w ) .So ∂ P δ ( z , x ) ∂δ = f V | W ( µ ( z + g ( w ) s z ( δ ) , x ) | w ) µ ′ z ( z + g ( w ) s z ( δ ) , x ) g ( w ) ∂ s z ( δ ) ∂δ = f V | W ( µ ( z + g ( w ) s z ( δ ) , x ) | w ) µ ′ z ( z + g ( w ) s z ( δ ) , x ) g ( w ) E (cid:2) f V | W ( µ ( Z + g ( W ) s z ( δ ) , X ) | W ) µ ′ z ( Z + g ( W ) s z ( δ ) , X ) g ( W ) (cid:3) .It then follows that ∂ s z ( δ ) ∂δ (cid:12)(cid:12)(cid:12)(cid:12) δ = = E (cid:2) f V | W ( µ ( W ) | W ) µ ′ z ( W ) g ( W ) (cid:3) ,and ∂ P δ ( z , x ) ∂δ (cid:12)(cid:12)(cid:12)(cid:12) δ = = f V | W ( µ ( w ) | w ) µ ′ z ( w ) g ( w ) E (cid:2) f V | W ( µ ( W ) | W ) µ ′ z ( W ) g ( W ) (cid:3) . Proof of Theorem 3.
The theorem follows directly from an application of Lemma 4 to Theorem1.
Proof of Corollary 4.
It is easy to see that ˆ Y yE hn f Y ( ) | U D , W ( y | P ( W ) , W ) − f Y ( ) | U D , W ( y | P ( W ) , W ) o ˙ P ( W ) i dy = E (cid:8) [ Y ( ) − Y ( ) | U D = P ( W ) , W ] ˙ P ( W ) (cid:9) .61ence it suffices to show that lim δ → δ − ´ Y ydR F ( δ ; y ) =
0. Under Assumption 5, we have (cid:12)(cid:12)(cid:12)(cid:12) δ ˆ Y ydR F ( δ ; y ) (cid:12)(cid:12)(cid:12)(cid:12) ≤ p ˆ Y ( ) | ˜ y | · (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ∂ f Y ( ) | D δ ( ˜ y | ) ∂δ (cid:12)(cid:12)(cid:12)(cid:12) δ = ˜ δ − ∂ f Y ( ) | D δ ( ˜ y | ) ∂δ (cid:12)(cid:12)(cid:12)(cid:12) δ = (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) d ˜ y + ( − p ) ˆ Y ( ) | ˜ y | · (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ∂ f Y ( ) | D δ ( ˜ y | ) ∂δ (cid:12)(cid:12)(cid:12)(cid:12) δ = ˜ δ − ∂ f Y ( ) | D δ ( ˜ y | ) ∂δ (cid:12)(cid:12)(cid:12)(cid:12) δ = (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) d ˜ y + ˆ Y ( ) | ˜ y | · (cid:12)(cid:12)(cid:12) f Y ( ) | D ( ˜ y | ) − f Y ( ) | D δ ( ˜ y | ) (cid:12)(cid:12)(cid:12) d ˜ y + ˆ Y ( ) | ˜ y | · (cid:12)(cid:12)(cid:12) f Y ( ) | D ( ˜ y | ) − f Y ( ) | D δ ( ˜ y | ) (cid:12)(cid:12)(cid:12) d ˜ y .As in the proof of Lemma 2, each term in the above upper bound converges to zero as δ → δ → δ ˆ Y ydR F ( δ ; y ) = Proof of Lemma 6.
We haveˆ f Y ( y ) − f Y ( y ) = nh n ∑ i = K (cid:18) Y i − yh (cid:19) − f Y ( y )= n n ∑ i = h K (cid:18) Y i − yh (cid:19) − E ˆ f Y ( y ) + E ˆ f Y ( y ) − f Y ( y )= n n ∑ i = h K (cid:18) Y i − yh (cid:19) − E ˆ f Y ( y ) + B f Y ( y ) + o p ( h ) ,where B f Y ( y ) = h f ′′ Y ( y ) ˆ ∞ − ∞ u K ( u ) du .We write this concisely asˆ f Y ( y ) − f Y ( y ) = n n ∑ i = ψ f Y , i ( y , h ) + B f Y ( y ) + o p ( h ) ,where ψ f Y , i ( y , h ) : = h K (cid:18) Y i − yh (cid:19) − E h K (cid:18) Y i − yh (cid:19) = O p ( n − h − ) .Since K ( u ) is twice continuously differentiable, we use a Taylor expansion to obtainˆ f Y ( ˆ y τ ) − ˆ f Y ( y τ ) = ˆ f ′ Y ( y τ )( ˆ y τ − y τ ) +
12 ˆ f ′′ ( ˜ y τ ) ( ˆ y τ − y τ ) (A.25)62or some ˜ y τ between ˆ y τ and y τ . The first and second derivatives areˆ f ′ Y ( y ) = − nh n ∑ i = K ′ (cid:18) Y i − yh (cid:19) , ˆ f ′′ Y ( y ) = nh n ∑ i = K ′′ (cid:18) Y i − yh (cid:19) .To find the order of ˆ f ′′ Y ( y ) , we calculate its mean and variance. We have E h ˆ f ′′ Y ( y ) i = nh n ∑ i = K ′′ (cid:18) Y i − yh (cid:19) = O (cid:18) h (cid:19) , var h ˆ f ′′ Y ( y ) i ≤ n ( nh ) E (cid:20) K ′′ (cid:18) Y i − yh (cid:19)(cid:21) = O (cid:18) nh (cid:19) .Therefore, when nh → ∞ , ˆ f ′′ Y ( y ) = O p (cid:0) h − (cid:1) ,for any y . That is, for any ǫ >
0, there exists an M > (cid:18) h (cid:12)(cid:12)(cid:12) ˆ f ′′ Y ( y τ ) (cid:12)(cid:12)(cid:12) > M (cid:19) < ǫ n is large enough.Suppose we choose M so large that we also havePr (cid:0) √ n | ˜ y τ − y τ | > M (cid:1) < ǫ n is large enough. Then, when n is large enough,Pr (cid:18) h ˆ f ′′ Y ( ˜ y τ ) > M (cid:19) ≤ Pr (cid:18) h (cid:12)(cid:12)(cid:12) ˆ f ′′ Y ( ˜ y τ ) − ˆ f ′′ Y ( y τ ) (cid:12)(cid:12)(cid:12) > M (cid:19) + Pr (cid:18) h (cid:12)(cid:12)(cid:12) ˆ f ′′ Y ( y τ ) (cid:12)(cid:12)(cid:12) > M (cid:19) ≤ Pr (cid:18) h h ˆ f ′′ Y ( ˜ y τ ) − ˆ f ′′ Y ( y τ ) i > M (cid:19) + ǫ ≤ Pr (cid:18) h h ˆ f ′′ Y ( ˜ y τ ) − ˆ f ′′ Y ( y τ ) i > M √ n | ˜ y τ − y τ | < M (cid:19) + ǫ . (A.26)When √ n | ˜ y τ − y τ | < √ h , we have h (cid:12)(cid:12)(cid:12) ˆ f ′′ Y ( ˜ y τ ) − ˆ f ′′ Y ( y τ ) (cid:12)(cid:12)(cid:12) ≤ nh n ∑ i = (cid:12)(cid:12)(cid:12)(cid:12) K ′′ (cid:18) Y i − y τ h (cid:19) − K ′′ (cid:18) Y i − ˜ y τ h (cid:19)(cid:12)(cid:12)(cid:12)(cid:12) ≤ L K · h √ hM √ n = L K · M √ nh by the Lipschitz continuity of K ′′ ( · ) with Lipschitz constant L K . When √ h ≤ √ n | ˜ y τ − y τ | < M ,we have (cid:12)(cid:12)(cid:12)(cid:12) Y i − y τ h − Y i − ˜ y τ h (cid:12)(cid:12)(cid:12)(cid:12) = √ n | ˜ y τ − y τ | h > √ h → ∞ .63sing the second condition on K ′′ ( · ) , we have, for √ h ≤ √ n | ˜ y τ − y τ | < M , (cid:12)(cid:12)(cid:12)(cid:12) K ′′ (cid:18) Y i − y τ h (cid:19) − K ′′ (cid:18) Y i − ˜ y τ h (cid:19)(cid:12)(cid:12)(cid:12)(cid:12) ≤ C Mnh ,and h (cid:12)(cid:12)(cid:12) ˆ f ′′ Y ( ˜ y τ ) − ˆ f ′′ Y ( y τ ) (cid:12)(cid:12)(cid:12) ≤ nh n ∑ i = (cid:12)(cid:12)(cid:12)(cid:12) K ′′ (cid:18) Y i − y τ h (cid:19) − K ′′ (cid:18) Y i − ˜ y τ h (cid:19)(cid:12)(cid:12)(cid:12)(cid:12) ≤ C nh = O (cid:18) √ nh (cid:19) .Hence, in both cases, h (cid:12)(cid:12)(cid:12) ˆ f ′′ Y ( ˜ y τ ) − ˆ f ′′ Y ( y τ ) (cid:12)(cid:12)(cid:12) = O p (cid:0) n − h − (cid:1) . As a result,Pr (cid:18) h h ˆ f ′′ Y ( ˜ y τ ) − ˆ f ′′ Y ( y τ ) i > M √ n | ˜ y τ − y τ | < M (cid:19) → h ˆ f ′′ Y ( ˜ y τ ) = O p ( ) and ˆ f ′′ Y ( ˜ y τ )( ˆ y τ − y τ ) = O p (cid:16) n − h − (cid:17) .In view of (A.25), we then haveˆ f Y ( ˆ y τ ) − ˆ f Y ( y τ ) = ˆ f ′ Y ( y τ )( ˆ y τ − y τ ) + O p ( n − h − ) .Now, using Theorem 5, we can writeˆ f Y ( ˆ y τ ) − ˆ f Y ( y τ ) = f ′ Y ( y τ )( ˆ y τ − y τ ) + h ˆ f ′ Y ( y τ ) − f ′ Y ( y τ ) i ( ˆ y τ − y τ ) + O p ( n − h − ) . = f ′ Y ( y τ ) n n ∑ i = ψ Q , i ( y τ ) + R f Y ,where R f Y : = h ˆ f ′ Y ( y τ ) − f ′ Y ( y τ ) i [ ˆ y τ − y τ ] + o p ( n − ) + O p ( n − h − ) and the o p ( n − ) term is the error of the linear asymptotic representation of ˆ y τ − y τ .In order to the obtain the order of R f Y , we use the following results:ˆ f ′ Y ( y ) = f ′ Y ( y ) + O p (cid:18) √ nh + h (cid:19) ,ˆ y τ = y τ + O p (cid:18) √ n (cid:19) .The rate on the derivative of the density can be found on page 56 of Pagan and Ullah (1999).64herefore, R f Y = o p ( n − ) + O p (cid:18) √ nh + h (cid:19) O p (cid:18) √ n (cid:19) + O p ( n − h − ) . = o p ( n − ) + O p (cid:16) n − h − (cid:17) + O p (cid:16) n − h (cid:17) + O p ( n − h − ) . = o p ( n − ) + O p (cid:16) n − h − (cid:17) + O p ( n − h − ) .because, since by Assumption 8, h ↓
0, so O p (cid:0) n − h (cid:1) = o p ( n − ) . We need to show that √ nhR f Y = o p ( ) . We do this term by term. First, √ nh × o p ( n − ) = o p ( h ) = o p ( ) because h ↓
0. Second, √ nh × O p (cid:16) n − h − (cid:17) = O p (cid:16) n − h − (cid:17) = o p ( ) as long as nh ↑ ∞ , which is guaranteed by Assumption 8, since it is implied by nh ↑ ∞ . Finally, √ nh × O p ( n − h − ) = O p ( n − h − ) = o p ( ) since by Assumption 8 nh ↑ ∞ . Therefore, √ nhR f Y = o p ( ) . Proof of Lemma 7.
We have the following decomposition:1 n n ∑ i = ∂ P ( Z i , X i , ˆ α ) ∂ z − E (cid:20) ∂ P ( Z , X , α ) ∂ z (cid:21) = n n ∑ i = ∂ P ( Z i , X i , ˆ α ) ∂ z − n n ∑ i = ∂ P ( Z i , X i , α ) ∂ z + n n ∑ i = ∂ P ( Z i , X i , α ) ∂ z − E (cid:20) ∂ P ( Z , X , α ) ∂ z (cid:21) .Under the assumption of finite variance for ∂ P ( Z , X , α ) ∂ z , we have1 n n ∑ i = ∂ P ( Z i , X i , α ) ∂ z − E (cid:20) ∂ P ( Z , X , α ) ∂ z (cid:21) = n n ∑ i = ψ ∂ P , i = O p ( n − ) ,where ψ ∂ P , i = ∂ P ( Z i , X i , α ) ∂ z − E (cid:20) ∂ P ( Z , X , α ) ∂ z (cid:21) .For the first term, we have1 n n ∑ i = ∂ P ( Z i , X i , ˆ α ) ∂ z − n n ∑ i = ∂ P ( Z i , X i , α ) ∂ z = n n ∑ i = ∂ P ( Z i , X i , ˜ α ) ∂α ′ ∂ z ! ( ˆ α − α ) ,65here ˜ α is a vector with (not necessarily equal) coordinates between α and ˆ α , and ∂ P ( Z i , X i ,˜ α ) ∂α ′ ∂ z isa row vector of dimension d α ×
1. Under the uniform law of large numbers given in the lemma,we have 1 n n ∑ i = ∂ P ( Z i , X i , ˜ α ) ∂α∂ z p → E (cid:20) ∂ P ( Z , X , α ) ∂α∂ z (cid:21) . (A.27)Using (A.27) together with (45), we obtain1 n n ∑ i = ∂ P ( Z i , X i , ˆ α ) ∂ z − n n ∑ i = ∂ P ( Z i , X i , α ) ∂ z = (cid:26) E (cid:20) ∂ P ( Z , X , α ) ∂α ′ ∂ z (cid:21) + o p ( ) (cid:27) n P n ψ α + o p ( n − ) o = E (cid:20) ∂ P ( Z , X , α ) ∂α ′ ∂ z (cid:21) P n ψ α + o p ( n − ) .The decomposition is then T n ( ˆ α ) − T = n n ∑ i = ∂ P ( Z i , X i , ˆ α ) ∂ z − E (cid:20) ∂ P ( Z , X , α ) ∂ z (cid:21) = E (cid:20) ∂ P ( Z , X , α ) ∂ z ∂α ′ (cid:21) P n ψ α + P n ψ ∂ P + o p ( n − ) . Proof of Lemma 8.
Recall that m ( y τ , ˜ w ( α θ )) : = m ( y τ , P ( w , α θ ) , x )= E [ { Y ≤ y τ } | P ( W , α θ ) = P ( w , α θ ) , X = x ] .In order to emphasize the dual roles of α θ , we define˜ m ( y τ , u , x ; P ( · , α θ )) = E [ { Y ≤ y τ } | P ( W , α θ ) = u , X = x ] .Since y τ is fixed, we regard ˜ m as a function of ( u , x ) that depends on the function P ( · , α θ ) . Then˜ m ( y τ , P ( w , α θ ) , x ; P ( · , α θ )) | θ = θ = θ = E [ { Y ≤ y τ } | P ( W , α θ ) = P ( w , α θ ) , X = x ]= m ( y τ , P ( w , α θ ) , x ) .As in Hahn and Ridder (2013), we employ ˜ m ( y τ , u , x ; P ( · , α θ )) as an expositional device only.The functional of interest is H [ m ] = : H [ m ( y τ , P ( · , α θ ) , · )] = ˆ W ∂ m ( y τ , P ( w , α θ ) , x ) ∂ z f W ( w ) dw = ˆ W ∂ ˜ m ( y τ , P ( w , α θ ) , x ; P ( · , α θ )) ∂ z (cid:12)(cid:12)(cid:12)(cid:12) θ = θ = θ f W ( w ) dw .66nder Condition (iii) of the lemma, we can exchange ∂∂α θ with E and obtain ∂∂α θ H [ m ] (cid:12)(cid:12)(cid:12)(cid:12) θ = θ = ˆ W ∂∂ z ∂ ˜ m ( y τ , P ( w , α θ ) , x ; P ( · , α θ )) ∂α θ (cid:12)(cid:12)(cid:12)(cid:12) θ = θ = θ f W ( w ) dw + ˆ W ∂∂ z ∂ ˜ m ( y τ , P ( w , α θ ) , x ; P ( · , α θ )) ∂α θ (cid:12)(cid:12)(cid:12)(cid:12) θ = θ = θ f W ( w ) dw = ˆ W ∂∂ z (cid:2) ˜ m ′ α ( y τ , P ( w , α θ ) , x ; P ( · , α θ )) (cid:3) f W ( w ) dw ,where ˜ m ′ α ( y τ , P ( w , α θ ) , x ; P ( · , α θ )) = ∂ ˜ m ( y τ , P ( w , α θ ) , x ; P ( · , α θ )) ∂α θ + ∂ ˜ m ( y τ , P ( w , α θ ) , x ; P ( · , α θ )) ∂α θ (cid:12)(cid:12)(cid:12)(cid:12) θ = θ = θ .Under Condition (i) of the lemma, we have ˆ W ∂∂ z (cid:2) ˜ m ′ α ( y τ , P ( w , α θ ) , x ; P ( · , α θ )) (cid:3) f W ( w ) dw = ˆ X ˆ z U ( x ) z L ( x ) ∂∂ z (cid:2) ˜ m ′ α ( y τ , P ( w , α θ ) , x ; P ( · , α θ )) (cid:3) f Z | X ( z | x ) dz · f X ( x ) dx = ˆ X ˜ m ′ α ( y τ , P ( w , α θ ) , x ; P ( · , α θ )) f Z | X ( z | x ) (cid:12)(cid:12) z U ( x ) z L ( x ) f X ( x ) dx − ˆ W ˜ m ′ α ( y τ , P ( w , α θ ) , x ; P ( · , α θ )) ∂ log f Z | X ( z | x ) ∂ z f W ( w ) dw = − ˆ W ˜ m ′ α ( y τ , P ( w , α θ ) , x ; P ( · , α θ )) ∂ log f W ( w ) ∂ z f W ( w ) dw .Define ν ( u , x ; P ( · , α θ )) = E (cid:20) ∂ log f W ( W ) ∂ z | P ( W , α θ ) = u , X = x (cid:21) .By the law of iterated expectations, we have ˆ W ˜ m ( y τ , P ( w , α θ ) , x ; P ( · , α θ )) ν ( P ( w , α θ ) , x ; P ( · , α θ )) f W ( w ) dw = E [ { Y ≤ y τ } ν ( P ( W , α θ ) , X ; P ( · , α θ ))] .Differentiating the above with respect to α θ and evaluating the resulting equation at θ = θ , we67ave E " ∂ ˜ m ( y τ , P ( W , α θ ) , X ; P ( · , α θ )) ∂α θ (cid:12)(cid:12)(cid:12)(cid:12) θ = θ ν ( P ( W , α θ ) , X ; P ( · , α θ )) + E " ∂ ˜ m ( y τ , P ( W , α θ ) , X ; P ( · , α θ )) ∂α θ (cid:12)(cid:12)(cid:12)(cid:12) θ = θ ν ( P ( W , α θ ) , X ; P ( · , α θ )) = E ( [ { Y ≤ y τ } − m ( y τ , P ( W , α θ ) , X )] ∂ν ( P ( W , α θ ) , X ; P ( · , α θ )) ∂α θ (cid:12)(cid:12)(cid:12)(cid:12) θ = θ ) , (A.28)where we have used Condition (iii) to exchange the differentiation with the expectation.Using (A.28) and Condition (ii) of the lemma, we have ˆ W ∂∂ z (cid:2) ˜ m ′ α ( y τ , P ( w , α θ ) , x ; P ( · , α θ )) (cid:3) f W ( w ) dw = − ˆ W ˜ m ′ α ( y τ , P ( w , α θ ) , x ; P ( · , α θ )) ∂ log f W ( w ) ∂ z f W ( w ) dw = ˆ W ˜ m ′ α ( y τ , P ( w , α θ ) , x ; P ( · , α θ )) ν ( P ( w , α θ ) , x ; P ( · , α θ )) f W ( w ) dw = E (cid:26) [ { Y ≤ y τ } − m ( y τ , P ( W , α θ ) , X )] ∂ν ( P ( w , α θ ) , x ; P ( · , α θ )) ∂α θ (cid:27) = ∂∂θ E (cid:20) ∂ m ( y τ , P ( Z , X , α θ ) , X ) ∂ z (cid:21)(cid:12)(cid:12)(cid:12)(cid:12) θ = θ = Proof of Lemma 9.
First, we prove that the decomposition in (54) is valid. We start by showingthat T θ = E θ (cid:20) ∂ m θ ( y τ , θ , ˜ W ( α θ )) ∂ z (cid:21) is differentiable at θ . For this, it suffices to show that each of the four derivatives below exists at θ = θ : ∂∂θ E θ (cid:20) ∂ m ( y τ , ˜ W ( α )) ∂ z (cid:21) ; ∂∂θ E (cid:20) ∂ m θ ( y τ , ˜ W ( α )) ∂ z (cid:21) ; ∂∂θ E (cid:20) ∂ m ( y τ , θ , ˜ W ( α )) ∂ z (cid:21) ; ∂∂θ E (cid:20) ∂ m ( y τ , ˜ W ( α θ )) ∂ z (cid:21) . (A.29)By Lemma 8, the last derivative exists and is equal to zero at θ = θ . We deal with the restthree derivatives in (A.29) one at a time. Consider the first derivative. Under Conditions (i) and68ii) of the lemma, E θ h ∂ m ( y τ , ˜ W ( α )) ∂ z i is differentiable in θ and ∂∂θ E θ (cid:20) ∂ m ( y τ , ˜ W ( α )) ∂ z (cid:21)(cid:12)(cid:12)(cid:12)(cid:12) θ = θ = E (cid:20) ∂ m ( y τ , ˜ W ( α )) ∂ z S ( O ) (cid:21) .Hence, the contribution associated with the first derivative is simply the influence function of T n ( y τ , m , α ) − T .Now, for the second derivative in (A.29), Theorem 7.2 in Newey (1994) shows that the as-sumptions of the lemma imply the following:1. There is a function γ m ( o ) and a measure ˆ F m such that E [ γ m ( O )] = E [ γ m ( O ) ] < ∞ , andfor all ˆ m such that k ˆ m − m k is small enough, E (cid:20) ∂ ˆ m ( y τ , P ( Z , X , α ) , X ) ∂ z − ∂ m ( y τ , P ( Z , X , α ) , X ) ∂ z (cid:21) = ˆ γ m ( o ) d ˆ F m ( o ) .2. The following approximation holds ˆ γ m ( o ) d ˆ F m ( o ) = n n ∑ i = γ m ( O i ) + o p ( n − ) .For a parametric submodel F θ , we then have, when θ is close enough to θ :1 θ − θ E (cid:20) ∂ m θ ( y τ , P ( Z , X , α ) , X ) ∂ z − ∂ m ( y τ , P ( Z , X , α ) , X ) ∂ z (cid:21) = θ − θ ˆ γ m ( o ) d [ F θ ( o ) − F θ ( o )] ,since E [ γ m ( O )] = ´ γ m ( o ) dF θ ( o ) =
0. If ´ γ m ( o ) dF θ ( o ) is bounded in a neighborhood θ = θ ,then, by Lemma 7.2 in Ibragimov and Hasminskii (1981), the second derivative exists and satisfies ∂∂θ E (cid:20) ∂ m θ ( y τ , P ( Z , X , α ) , X ) ∂ z (cid:21)(cid:12)(cid:12)(cid:12)(cid:12) θ = θ = ∂∂θ ˆ γ m ( o ) dF θ ( o ) (cid:12)(cid:12)(cid:12)(cid:12) θ = θ = E [ γ m ( O ) S ( O )] .This shows that γ m ( o ) is the influence function of E h ∂ ˆ m ( y τ , P ( Z , X , α ) , X ) ∂ z i . That is E (cid:20) ∂ ˆ m ( y τ , P ( Z , X , α ) , X ) ∂ z (cid:21) − E (cid:20) ∂ m ( y τ , P ( Z , X , α ) , X ) ∂ z (cid:21) = n n ∑ i = γ m ( O i ) + o p ( n − ) .This, combined with the stochastic equicontinuity assumption, implies that T n ( y τ , ˆ m , α ) − T n ( y τ , m , α ) = n n ∑ i = γ m ( O i ) + o p ( n − ) ,and γ m ( o ) is the influence function of T n ( y τ , ˆ m , α ) − T n ( y τ , m , α ) .69ow, the dominating condition in condition (iv) ensures that the third derivative in (A.29)exists and ∂∂θ E (cid:20) ∂ m ( y τ , θ , ˜ W ( α )) ∂ z (cid:21)(cid:12)(cid:12)(cid:12)(cid:12) θ = θ = E (cid:20) ∂ m ( y τ , ˜ W ( α )) ∂ y τ ∂ z (cid:21) ∂ y τ , θ ∂θ (cid:12)(cid:12)(cid:12)(cid:12) θ = θ .Given the approximation ˆ y τ − y τ = P n ψ Q ( y τ ) + o p ( n − ) ,from Lemma 5, we have ∂ y τ , θ ∂θ (cid:12)(cid:12)(cid:12)(cid:12) θ = θ = E [ ψ Q ( y τ ) S ( O )] .Hence, ∂∂θ E (cid:20) ∂ m ( y τ , θ , ˜ W ( α )) ∂ z (cid:21)(cid:12)(cid:12)(cid:12)(cid:12) θ = θ = E (cid:20) ∂ m ( y τ , ˜ W ( α )) ∂ y τ ∂ z (cid:21) E [ ψ Q ( y τ ) S ( O )] .This gives us the contribution from the estimation of y τ . Alternatively, this expression gives usthe influence function of E (cid:20) ∂ m ( ˆ y τ , ˜ W ( α )) ∂ z (cid:21) − E (cid:20) ∂ m ( y τ , ˜ W ( α )) ∂ z (cid:21) ,because E (cid:20) ∂ m ( ˆ y τ , ˜ W ( α )) ∂ z (cid:21) − E (cid:20) ∂ m ( y τ , ˜ W ( α )) ∂ z (cid:21) = E (cid:20) ∂ m ( y τ , ˜ W ( α )) ∂ y τ ∂ z (cid:21) ( ˆ y τ − y τ ) + o p ( n − ) .Using the stochastic equicontinuity assumption, we then get that T n ( ˆ y τ , m , α ) − T n ( y τ , m , α ) = E (cid:20) ∂ m ( y τ , ˜ W ( α )) ∂ y τ ∂ z (cid:21) ( ˆ y τ − y τ ) + o p ( n − )= E " ∂ f Y | ˜ W ( α ) ( y τ | ˜ W ( α )) ∂ z P n ψ Q ( y τ ) + o p ( n − ) .To sum up, we have shown that T n ( ˆ y τ , ˆ m , ˆ α ) − T = [ T n ( y τ , m , α ) − T ] + [ T n ( y τ , ˆ m , α ) − T n ( y τ , m , α )]+ [ T n ( ˆ y τ , m , α ) − T n ( y τ , m , α )] + o p ( n − ) .To obtain the influence function of the first and second terms in the right-hand side of (54),we just need to invoke Theorem 7.2 in Newey (1994) to obtain T n ( y τ , ˆ m , α ) − T = n n ∑ i = ∂ m ( y τ , ˜ W i ( α )) ∂ z − T − P n (cid:2) { Y ≤ y τ } − m ( y τ , ˜ W ( α )) (cid:3) E (cid:20) ∂ log f W ( W ) ∂ z (cid:12)(cid:12)(cid:12)(cid:12) ˜ W ( α ) (cid:21) + o p ( n − ) , (A.30)70ecause T n ( y τ , ˆ m , α ) − T = T n ( y τ , m , α ) − T + T n ( y τ , ˆ m , α ) − T n ( y τ , m , α ) .The first term in (A.30) is simply the influence function of a sample mean. The second term in(A.30) is the adjustment due to the estimation of m . Proof of Theorem 5.
Consider the following difference Π τ , z − ˆ Π τ , z = f Y ( ˆ y τ ) T n ( ˆ y τ , ˆ m , ˆ α ) T n ( ˆ α ) − f Y ( y τ ) T T = T n ( ˆ y τ , ˆ m , ˆ α ) f Y ( y τ ) T − ˆ f Y ( ˆ y τ ) T n ( ˆ α ) T ˆ f Y ( ˆ y τ ) T n ( ˆ α ) f Y ( y τ ) T = f Y ( y τ ) T ˆ f Y ( ˆ y τ ) T n ( ˆ α ) f Y ( y τ ) T [ T n ( ˆ y τ , ˆ m , ˆ α ) − T ] − ˆ f Y ( ˆ y τ ) T n ( ˆ α ) T − f Y ( y τ ) T T ˆ f Y ( ˆ y τ ) T n ( ˆ α ) f Y ( y τ ) T = f Y ( y τ ) T ˆ f Y ( ˆ y τ ) T n ( ˆ α ) f Y ( y τ ) T [ T n ( ˆ y τ , ˆ m , ˆ α ) − T ] − T n ( ˆ α ) T ˆ f Y ( ˆ y τ ) T n ( ˆ α ) f Y ( y τ ) T h ˆ f Y ( ˆ y τ ) − f Y ( y τ ) i − T f Y ( y τ ) ˆ f Y ( ˆ y τ ) T n ( ˆ α ) f Y ( y τ ) T [ T n ( ˆ α ) − T ] . (A.31)We can rearrange (A.31) asˆ Π τ , z − Π τ , z = T ˆ f Y ( ˆ y τ ) f Y ( y τ ) T h ˆ f Y ( ˆ y τ ) − f Y ( y τ ) i + T ˆ f Y ( ˆ y τ ) T n ( ˆ α ) T [ T n ( ˆ α ) − T ] − f Y ( ˆ y τ ) T n ( ˆ α ) [ T n ( ˆ y τ , ˆ m , ˆ α ) − T ] . (A.32)By appropriately defining the remainders, we can express (A.32) asˆ Π τ , z − Π τ , z = T f Y ( y τ ) T h ˆ f Y ( ˆ y τ ) − f Y ( y τ ) i + T f Y ( y τ ) T [ T n ( ˆ α ) − T ] − f Y ( y τ ) T [ T n ( ˆ y τ , ˆ m , ˆ α ) − T ] + R + R + R . (A.33)The definitions of R , R and R can be found below. Now we are ready to separate the contri-71ution of each stage of the estimation. We shall use equations (42), (46), and (54).ˆ Π τ , z − Π τ , z = T f Y ( y τ ) T h ˆ f Y ( y τ ) − f Y ( y τ ) i + T f Y ( y τ ) T [ f Y ( ˆ y τ ) − f Y ( y τ )]+ T f Y ( y τ ) T [ T n ( α ) − T ] + T f Y ( y τ ) T (cid:26) E (cid:20) ∂ P ( W , ˆ α ) ∂ z (cid:21) − T (cid:27) − f Y ( y τ ) T [ T n ( y τ , m , α ) − T ] − f Y ( y τ ) T [ T n ( y τ , ˆ m , α ) − T n ( y τ , m , α )] − f Y ( y τ ) T [ T n ( ˆ y τ , m , α ) − T n ( y τ , m , α )]+ R + R + R + T f Y ( y τ ) T R f Y + o p ( n − ) . (A.34)Finally, we establish the rate for the remainders R , R and R in (A.33). We deal with eachcomponent of the remainder separately. The first remainder is R = T ˆ f Y ( ˆ y τ ) f Y ( y τ ) T h ˆ f Y ( ˆ y τ ) − f Y ( y τ ) i − T f Y ( y τ ) T h ˆ f Y ( ˆ y τ ) − f Y ( y τ ) i = T f Y ( y ) T h ˆ f Y ( ˆ y τ ) − f Y ( y τ ) i " f Y ( ˆ y τ ) − f Y ( y τ ) = − T f Y ( y ) T ˆ f Y ( ˆ y τ ) h ˆ f Y ( ˆ y τ ) − f Y ( y τ ) i = O p (cid:16) | ˆ f Y ( ˆ y τ ) − f Y ( y τ ) | (cid:17) . (A.35)The second remainder is R = T ˆ f Y ( ˆ y τ ) T n ( ˆ α ) T [ T n ( ˆ α ) − T ] − T f Y ( y τ ) T [ T n ( ˆ α ) − T ]= T T [ T n ( ˆ α ) − T ] " f Y ( ˆ y τ ) T n ( ˆ α ) − f Y ( y τ ) T = T T [ T n ( ˆ α ) − T ] " f Y ( y τ ) T − ˆ f Y ( ˆ y τ ) T n ( ˆ α ) ˆ f Y ( ˆ y τ ) T n ( ˆ α ) f Y ( y τ ) T = T T [ T n ( ˆ α ) − T ] " f Y ( y τ )( T − T n ( ˆ α )) − T n ( ˆ α )( ˆ f Y ( ˆ y τ ) − f Y ( y τ )) ˆ f Y ( ˆ y τ ) T n ( ˆ α ) f Y ( y τ ) T = O p (cid:0) | T n ( ˆ α ) − T | (cid:1) + O p (cid:16) | T n ( ˆ α ) − T || ˆ f Y ( ˆ y τ ) − f Y ( y τ ) | (cid:17) = O p (cid:16) n − (cid:17) + O p (cid:16) n − | ˆ f Y ( ˆ y τ ) − f Y ( y τ ) | (cid:17) . (A.36)72he third remainder is R = f Y ( y τ ) T [ T n ( ˆ y τ , ˆ m , ˆ α ) − T ] − f Y ( ˆ y τ ) T n ( ˆ α ) [ T n ( ˆ y τ , ˆ m , ˆ α ) − T ]= O p ( | T n ( ˆ y τ , ˆ m , ˆ α ) − T || T n ( ˆ α ) − T | )+ O p (cid:16) | T n ( ˆ y τ , ˆ m , ˆ α ) − T || ˆ f Y ( ˆ y τ ) − f Y ( y τ ) | (cid:17) = O p (cid:16) n − (cid:17) + O p (cid:16) n − | ˆ f Y ( ˆ y τ ) − f Y ( y τ ) | (cid:17) (A.37)because it has the same denominator as R in (A.36). Finally, we compute the rate for ˆ f Y ( ˆ y τ ) − f Y ( y τ ) . To do so, we use the results in Lemma 6. Equation (42) tells usˆ f Y ( ˆ y τ ) − f Y ( y τ ) = ˆ f Y ( y τ ) − f Y ( y τ ) + f Y ( ˆ y τ ) − f Y ( y τ ) + R f Y = O p ( n − h − ) + O ( h ) + o p ( h ) + O p ( n − ) + O p ( | R f Y | )= O p ( n − h − ) + O ( h ) + o p ( h ) + O p ( n − ) + O p ( | R f Y | )= O p ( n − h − ) + O ( h ) + o p ( n − ) + O p ( | R f Y | ) . (A.38)Thus we have O p (cid:16) | ˆ f Y ( ˆ y τ ) − f Y ( y τ ) | (cid:17) = O p ( n − h − ) + O ( h ) + o p ( n − ) + O p ( | R f Y | ) .The remainder R Π is defined as R Π : = R + R + R + T f Y ( y τ ) T R f Y + o p ( n − ) + o p ( h ) .So, R Π = O p (cid:16) | ˆ f Y ( ˆ y τ ) − f Y ( y τ ) | (cid:17) + O p (cid:16) n − (cid:17) + O p (cid:16) n − | ˆ f Y ( ˆ y τ ) − f Y ( y τ ) | (cid:17) + O p (cid:0) | R f Y | (cid:1) + o p ( n − ) + o p ( h )= O p (cid:16) | ˆ f Y ( ˆ y τ ) − f Y ( y τ ) | (cid:17) + O p (cid:16) n − | ˆ f Y ( ˆ y τ ) − f Y ( y τ ) | (cid:17) + O p (cid:0) | R f Y | (cid:1) + o p ( n − ) + o p ( h ) , (A.39)because O p ( n − ) is o p ( n − ) as n → ∞ .Now we show that √ nhR Π = o p ( ) under Assumption 8. We do this term by term in (A.39): √ nho p ( h ) = o p ( n h ) = o p ( ) as nh ↓ √ nho p ( n − ) = o p ( h ) = o p ( ) as h ↓ √ nhO p (cid:0) | R f Y | (cid:1) = o p ( ) by Lemma 6; √ nhO p (cid:16) n − | ˆ f Y ( ˆ y τ ) − f Y ( y τ ) | (cid:17) = h O p (cid:16) | ˆ f Y ( ˆ y τ ) − f Y ( y τ ) | (cid:17) = o p ( ) ,73ince ˆ f Y ( ˆ y τ ) − f Y ( y τ ) = o p ( ) . Finally, √ nhO p (cid:16) | ˆ f Y ( ˆ y τ ) − f Y ( y τ ) | (cid:17) = o p ( ) ,since ˆ f Y ( ˆ y τ ) − f Y ( y τ ) = O p ( n − h − + h + n − + R f Y ) by (42). B Supplementary Appendix
Details of Example 2.
The propensity score under the new policy regime is P µδ ( w ) = Pr (cid:2) V ≤ µ ( w ) + s µ ( δ ) | W = w (cid:3) = F V | W ( µ ( w ) + s µ ( δ ) | w ) ,and so ∂ P µδ ( w ) ∂δ (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) δ = = f V ( µ ( w )) E ( f V ( µ ( W ))) .By Theorem 1, we have Π τ , µ = E ( f V ( µ ( W ))) f Y ( y τ ) ˆ W E [ { Y ( ) ≤ y τ } | V = µ ( w ) , W = w ] f V ( µ ( w )) f W ( w ) dw − E ( f V ( µ ( W ))) f Y ( y τ ) ˆ W E [ { Y ( ) ≤ y τ } | V = µ ( w ) , W = w ] f V ( µ ( w )) f W ( w ) dw .Using the potential outcome equations, we get E [ { Y ( ) ≤ y τ } | V = µ ( w ) , W = w ] = E [ { q ( w ) + U ≤ y τ } | V = µ ( w )]= Pr ( U ≤ y τ − q ( w ) | V = µ ( w ))= F U | V ( y τ − q ( w ) | µ ( w )) and E [ { Y ( ) ≤ y τ } | V = µ ( w ) , W = w ] = E [ { q ( w ) + β + U ≤ y τ } | V = µ ( w )]= Pr ( U ≤ y τ − q ( w ) − β | V = µ ( w ))= F U | V ( y τ − q ( w ) − β | µ ( w )) .74ence, Π τ , µ = f Y ( y τ ) ˆ W F U | V ( y τ − q ( w ) | µ ( w ))) f V ( µ ( w )) f W ( w ) ´ W f V ( µ ( w )) f W ( w ) dw dw − f Y ( y τ ) ˆ W (cid:8) F U | V ( y τ − q ( w ) − β | µ ( w )) (cid:9) f V ( µ ( w )) f W ( w ) ´ W f V ( µ ( w )) f W ( w ) dw dw = f Y ( y τ ) ˆ W " ˆ y τ − q ( w ) − ∞ f U | V ( u | µ ( w )) du ˜ f W ( w ) dw − f Y ( y τ ) ˆ W " ˆ y τ − q ( w ) − β − ∞ f U | V ( u | µ ( w )) du ˜ f W ( w ) dw .It follows from Corollary 1 that the apparent effect is A τ , µ = f Y ( y τ ) ˆ ∞ − ∞ E [ { Y ≤ y τ } | D = W = w ] f W ( w ) dw − f Y ( y τ ) ˆ ∞ − ∞ E [ { Y ≤ y τ } | D = W = w ] f W ( w ) dw .Using the explicit forms of the potential outcomes, we get E [ { Y ≤ y τ } | D = W = w ] = E [ { Y ( ) ≤ y τ } | D = W = w ]= E [ { q ( W ) + U ≤ y τ } | D = W = w ]= E [ { q ( W ) + U ≤ y τ } | V > µ ( w ) , W = w ]= Pr ( U ≤ y τ − q ( w ) | V > µ ( w ))= ´ ∞ µ ( w ) h ´ y τ − q ( w ) − ∞ f U | V ( u | v ) du i f V ( v ) dv ) − F V ( µ ( w )) and E [ { Y ≤ y τ } | D = W = w ] = E [ { Y ( ) ≤ y τ } | D = W = w ]= E [ { q ( W ) + β + U ≤ y τ } | D = W = w ]= E [ { q ( W ) + β + U ≤ y τ } | V ≤ µ ( W ) , W = w ]= Pr ( U ≤ y τ − q ( w ) − β | V ≤ µ ( w ))= ´ µ ( w ) − ∞ h ´ y τ − q ( w ) − β − ∞ f U | V ( u | v ) du i f V ( v ) dvF V ( µ ( w )) .75ence A τ , µ = f Y ( y τ ) ˆ W ´ ∞ µ ( w ) h ´ y τ − q ( w ) − ∞ f U | V ( u | v ) du i f V ( v ) dv − F V ( µ ( w )) f W ( w ) dw − f Y ( y τ ) ˆ W ´ µ ( w ) − ∞ h ´ y τ − q ( w ) − β − ∞ f U | V ( u | v ) du i f V ( v ) dvF V ( µ ( w )) f W ( w ) dw .Now, for the scaling factor f Y ( · ) , we have a mixture f Y ( y τ ) = f Y ( ) | D ( y τ | ) Pr ( D = ) + f Y ( ) | D ( y τ | ) Pr ( D = ) .The mixing weights are Pr ( D = ) = Pr ( V ≤ µ ( W )) ,and Pr ( D = ) = − Pr ( D = ) . To obtain the mixing densities, we note that F Y ( ) | D ( y τ | ) = Pr ( Y ( ) ≤ y τ | D = ) = Pr ( Y ( ) ≤ y τ , D = ) Pr ( D = )= ( D = ) Pr ( q ( W ) + U ≤ y τ , V > µ ( W ))= ( D = ) ˆ W Pr ( q ( w ) + U ≤ y τ , V > µ ( w )) f W ( w ) dw = ( D = ) ˆ W [ F U ( y τ − q ( w )) − F U , V ( y τ − q ( w ) , µ ( w ))] f W ( w ) dw .Hence, the density f Y ( ) | D ( y τ | ) is f Y ( ) | D ( y τ | ) = ( D = ) ˆ W " f U ( y τ − q ( w )) − ˆ µ ( w ) − ∞ f U , V ( y τ − q ( w ) , ˇ w ) d ˇ w f W ( w ) dw .For the other case, we have F Y ( ) | D ( y τ | ) = Pr ( Y ( ) ≤ y τ | D = ) = Pr ( Y ( ) ≤ y τ , D = ) Pr ( D = )= ( D = ) Pr ( q ( W ) + β + U ≤ y τ , V ≤ µ ( W ))= ( D = ) ˆ W Pr ( q ( w ) + β + U ≤ y τ , V ≤ µ ( w )) f W ( w ) dw = ( D = ) ˆ W F U , V ( y τ − q ( w ) − β , µ ( w )) f W ( w ) dw ,76nd so the density f Y ( ) | D ( y τ | ) is f Y ( ) | D ( y τ | ) = ( D = ) ˆ W " ˆ µ ( w ) − ∞ f U , V ( y τ − q ( w ) − β , ˇ w ) d ˇ w f W ( w ) dw .The density f Y ( y τ ) is then f Y ( y τ ) = ˆ W " f U ( y τ − q ( w )) − ˆ µ ( w ) − ∞ f U , V ( y τ − q ( w ) , ˇ w ) d ˇ w f W ( w ) dw + ˆ W " ˆ µ ( w ) − ∞ f U , V ( y τ − q ( w ) − β , ˇ w ) d ˇ w f W ( w ) dw . Proof of Proposition 1.
Note that for any bounded function G ( · ) , we have E [ G ( Y ) | P ( W ) = P ( w ) , X = x ] = E [ G ( Y ( )) | D = P ( W ) = P ( w ) , X = x ] × Pr ( D = | P ( W ) = P ( w ) , X = x )+ E [ G ( Y ( )) | D = P ( W ) = P ( w ) , X = x ] × Pr ( D = | P ( W ) = P ( w ) , X = x ) .But, using D = { U D ≤ P ( W ) } , we havePr ( D = | P ( W ) = P ( w ) , X = x ) = Pr ( U D ≤ P ( W ) | P ( W ) = P ( w ) , X = x )= Pr ( U D ≤ P ( w ) | P ( W ) = P ( w ) , X = x )= P ( w ) ,because U D is independent of W . So, E [ G ( Y ) | P ( W ) = P ( w ) , X = x ] = E [ G ( Y ( )) | D = P ( W ) = P ( w ) , X = x ] P ( w )+ E [ G ( Y ( )) | D = P ( W ) = P ( w ) , X = x ] ( − P ( w )) . = E [ G ( Y ( ))) | U D ≤ P ( w ) , P ( W ) = P ( w ) , X = x ] P ( w )+ E [ G ( Y ( )) | U D > P ( w ) , P ( W ) = P ( w ) , X = x ] ( − P ( w ))= E [ G ( Y ( ))) | U D ≤ P ( w ) , X = x ] P ( w )+ E [ G ( Y ( )) | U D > P ( w ) , X = x ] ( − P ( w )) ,where the last line follows because U = ( U , U ) is independent of Z given X and U D (see (21)).77ow E [ G ( Y ( )) | U D ≤ P ( w ) , X = x ] = E { E [ G ( Y ( )) | U D , X = x ] | U D ≤ P ( w ) , X = x } = E { E [ G ( Y ( )) | U D , X = x ] | U D ≤ P ( w ) } = ´ P ( w ) E [ G ( Y ( )) | U D = u , X = x ] duP ( w ) ,where the first equality uses the law of iterated expectations, the second equality uses the inde-pendence of U D from X , and the last equality uses U D ∼ uniform on [
0, 1 ] . Similarly, E [ G ( Y ( )) | U D > P ( w ) , X = x ] = ´ P ( w ) E [ G ( Y ( )) | U D = u , X = x ] du − P ( w ) .So we have E [ G ( Y ) | P ( W ) = P ( w ) , X = x ] = ˆ P ( w ) E [ G ( Y ( )) | U D = u , X = x ] du + ˆ P ( w ) E [ G ( Y ( )) | U D = u , X = x ] du .By taking G ( · ) = {· ≤ y τ } , we have E [ { Y ≤ y τ } | P ( W ) = P ( w ) , X = x ] = ˆ P ( w ) E [ { Y ( ) ≤ y τ } | U D = u , X = x ] du + ˆ P ( w ) E [ { Y ( ) ≤ y τ } | U D = u , X = x ] du .Under Assumptions 2(a) and 2(b), we can invoke the fundamental theorem of calculus to obtain ∂ E [ { Y ≤ y τ } | P ( W ) = P ( w ) , X = x ] ∂ P ( w ) = E [ { Y ( ) ≤ y τ } | U D = P ( w ) , X = x ] − E [ { Y ( ) ≤ y τ } | U D = P ( w ) , X = x ]= MTE τ ( P ( w ) , x ) . (B.1)That is, MTE τ ( u , x ) = ∂ E [ { Y ≤ y τ } | P ( W ) = u , X = x ] ∂ u for any u such that there is a w ∈ W satisfying P ( w ) = u . Proof of Equation (36).
Note that E ( Y δ ) = ˆ E ( Y δ |P δ = t ) f P δ ( t ) dt . (B.2)78ow, consider E ( Y δ |P δ = t ) . Depending on the value of P δ relative to the index U D , we observethe potential outcome Y ( ) or Y ( ) . By the independence of P δ from U D and the law of iteratedexpectations, we have E ( Y δ |P δ = t ) = ˆ E ( Y δ |P δ = t , U D = u ) du = ˆ t E ( Y ( ) | U D = u ) du + ˆ t E ( Y ( ) | U D = u ) du = ˆ (cid:20) { ≤ u ≤ t } E ( Y ( ) | U D = u )+ { t ≤ u ≤ } E ( Y ( ) | U D = u ) (cid:21) du (B.3)Plugging (B.3) back into (B.2) we get E ( Y δ )= ˆ (cid:26) ˆ (cid:20) { ≤ u ≤ t } E ( Y ( ) | U D = u ) + { t ≤ u ≤ } E ( Y ( ) | U D = u ) (cid:21) du (cid:27) f P δ ( t ) dt = ˆ (cid:26) ˆ (cid:20) { u ≤ t ≤ } E ( Y ( ) | U D = u ) + { ≤ t ≤ u } E ( Y ( ) | U D = u ) (cid:21) f P δ ( t ) dt (cid:27) du = ˆ (cid:20) ( − F P δ ( u )) E ( Y ( ) | U D = u ) + F P δ ( u ) E ( Y ( ) | U D = u ) (cid:21) du . (B.4)Going back to (35) using (B.4), we getPRTE = δ ˆ (cid:20) ( − F P δ ( u )) E ( Y ( ) | U D = u ) + F P δ ( u ) E ( Y ( ) | U D = u ) (cid:21) du − δ ˆ (cid:20) ( − F P ( u )) E ( Y ( ) | U D = u ) − F P ( u ) E ( Y ( ) | U D = u ) (cid:21) du = δ ˆ MTE ( u )( F P ( u ) − F P δ ( u )) du . (B.5)Taking the limit in (B.5) as δ → = lim δ → − δ ˆ MTE ( u )( F P δ ( u ) − F P ( u )) du = − ˆ MTE ( u ) ∂ F P δ ( u ) ∂δ (cid:12)(cid:12)(cid:12)(cid:12) δ = du . Proof of the Equivalence of MPRTE to Π ρ for the Mean Functional. The proof is heuristic, and we donot claim to deal rigorously with all the mathematical issues that arise, even though the proof can79e made rigorous with additional technical details. It is hoped that the formal proof given here,combined with the rigorous proof for a special case given in the main text, is enough to convincea reader that the representation we establish here coincides with the existing presentation of theMPRTE.Note that F P δ ( u ) = Pr ( P δ ≤ u ) = Pr ( F V | X ( µ ( Z + s z ( δ ) , X ) | X ) ≤ u )= ˆ W (cid:8) F V | X ( µ ( z + s z ( δ ) , x ) | x ) ≤ u (cid:9) F W ( w ) dw .Let G σ be a smooth CDF such that as σ → G σ ( u ) approaches the step function { u ≥ } andits derivative G ′ σ approaches the Dirac Delta function G ′ . We havelim δ → δ [ F P δ ( u ) − F P ( u )]= lim δ → ˆ W (cid:8) F V | X ( µ ( z , x ) | x ) − u > (cid:9) − (cid:8) F V | X ( µ ( z + s z ( δ ) , x ) | x ) − u > (cid:9) δ f W ( w ) dw = lim δ → ˆ W lim σ → G σ (cid:2) F V | X ( µ ( z , x ) | x ) − u (cid:3) − G σ (cid:2) F V | X ( µ ( z + s z ( δ ) , x ) | x ) − u (cid:3) δ f W ( w ) dw = − ˆ W G ′ (cid:0) F V | X ( µ ( z , x ) | x ) − u (cid:1) f V | X ( µ ( z , x ) | x ) µ z ( z , x ) ∂ s z ( δ ) ∂δ (cid:12)(cid:12)(cid:12)(cid:12) δ = f W ( w ) dw .Therefore, − ˆ MTE ( u , x ) lim δ → δ [ F P δ ( u ) − F P ( u )] du = ˆ ˆ W MTE ( u , x ) G ′ (cid:0) F V | X ( µ ( z , x ) | x ) − u (cid:1) f V | X ( µ ( z , x ) | x ) µ z ( z , x ) ∂ s z ( δ ) ∂δ (cid:12)(cid:12)(cid:12)(cid:12) δ = f W ( w ) dwdu = ˆ W MTE ( P ( w ) , x ) f V | X ( µ ( z , x ) | x ) µ z ( z , x ) ∂ s z ( δ ) ∂δ (cid:12)(cid:12)(cid:12)(cid:12) δ = f W ( w ) dwdu = ˆ W MTE ( P ( w ) , x ) ˙ P ( w ) f W ( w ) dwdu = E (cid:8) [ Y ( ) − Y ( ) | U D = P ( W ) , X ] ˙ P ( W ) (cid:9) = Π ρρ