aa r X i v : . [ ec on . E M ] J a n Regression Discontinuity Design with Many Thresholds
Marinho Bertanha a This version: September 16, 2019First version: November 7, 2014
Abstract
Numerous empirical studies employ regression discontinuity designs with multiple cutoffsand heterogeneous treatments. A common practice is to normalize all the cutoffs to zero andestimate one effect. This procedure identifies the average treatment effect (ATE) on the observeddistribution of individuals local to existing cutoffs. However, researchers often want to makeinferences on more meaningful ATEs, computed over general counterfactual distributions ofindividuals, rather than simply the observed distribution of individuals local to existing cutoffs.This paper proposes a consistent and asymptotically normal estimator for such ATEs whenheterogeneity follows a non-parametric function of cutoff characteristics in the sharp case. Theproposed estimator converges at the minimax optimal rate of root- n for a specific choice oftuning parameters. Identification in the fuzzy case, with multiple cutoffs, is impossible unlessheterogeneity follows a finite-dimensional function of cutoff characteristics. Under parametricheterogeneity, this paper proposes an ATE estimator for the fuzzy case that optimally combinesobservations to maximize its precision. Keywords:
Regression Discontinuity, Multiple Cutoffs, Average Treatment Effect, Peer-effects
JEL Classification:
C14, C21, C52, I21. a ∼ mbertanh. Introduction
Applications of regression discontinuity design (RDD) have become increasingly popular in eco-nomics since the late 1990s (Black (1999), Angrist and Lavy (1999), and Van der Klaauw (2002)).One of RDD’s main advantages is identification of a local causal effect under minimal functionalform assumptions. More recently, with increasing availability of richer data sets, there havebeen many applications with multiple cutoffs and treatments (for example, Black et al. (2007),Egger and Koethenbuerger (2010), De La Mata (2012), Pop-Eleches and Urquiola (2013)). Exist-ing one-cutoff RDD methods applied to each individual cutoff produce many local effects that areestimated using only a few observations near each cutoff. Researchers often prefer one takeawaysummary effect that is more precisely estimated by pooling all the data. The meaning of a summaryeffect crucially depends on heterogeneity assumptions and weights imposed on the different localeffects.Applied studies with multiple cutoffs often normalize all cutoffs to zero and use the one-cutoffestimator. This normalization procedure estimates an average of local treatment effects weightedby the relative density of individuals near each of the cutoffs (Cattaneo et al. (2016), Proposition3). Such an average effect would be a meaningful summary measure only in two cases: (i) localtreatment effects are all identical and the weighting scheme does not matter; or (ii) local treatmenteffects are heterogeneous but the researcher is only interested in the average effect on the individualsnear the existing cutoffs. However, researchers are often interested in combining observed data withassumptions weaker than (i) to make inferences on counterfactual scenarios more general than (ii). This paper proposes a novel estimation procedure for average treatment effects (ATE). TheseATEs are more valuable summary measures than the average effect estimated by the normalizationprocedure described above for two reasons. First, the researcher explicitly chooses the counterfac-tual distribution of the ATE, and this distribution may include individuals at or between existingcutoffs. Second, the researcher does not need to assume any specific functional form for the het-erogeneity of treatment effects across different cutoffs. As an example of an application, supposewe are interested in estimating the effect of Medicaid benefits on health care utilization. Medicaideligibility is triggered by income cutoffs that vary across states. Existing one-cutoff RDD methodsidentify the average effect on individuals with income equal to the income cutoffs. However, mostinteresting policy questions require the average effect over the entire range of income values in thedata.The framework for RDD with many thresholds is introduced here using a simple example basedon the work of Pop-Eleches and Urquiola (2013), PU from now on. Using a wealth of variation ofcutoffs from high school assignments in Romania, PU provide rigorous evidence of the impacts of In a RDD setting with multiple cutoffs and treatments, it is unreasonable to expect that different local treatmenteffects are always identical. For example, Pop-Eleches and Urquiola (2013) find that the impact of going to a betterhigh school on academic achievement is heterogeneous across students with different ability levels. Another exampleis De La Mata (2012), who finds that the eligibility for Medicaid benefits decreases the probability of having privatehealth insurance more strongly for lower income individuals. Although I allow for heterogeneous effects across cutoffs,counterfactual analysis requires a pooling and a policy invariance assumption (Section 2). i submits her score X i (forcing variable) to the central planner who, basedon the entire distribution of scores, determines a minimum test score c j (cutoff) for admission toeach high school j . The quality of high school j is denoted d j (treatment dose).The RDD assignment is assumed sharp for now. That is, students attend the best high schoolavailable to them based on their score and the cutoffs that apply to them. As the test score crossesan admission threshold c j , the quality of the school the student attends changes from d j − to d j .Local average effects are denoted by E [ Y i ( d j ) − Y i ( d j − ) | X i = c j ] = β ( c j , d j − , d j ), where Y i ( d )is the potential academic achievement student i has if attending a high school of quality d , and β ( c, d, d ′ ) is the treatment effect function. Heterogeneity of local effects comes from values of cutoffsand treatment doses that change across the different cutoffs. PU give a particularly illustrativeapplication, because it exhibits sufficient variation in cutoff and treatment doses to generate ATEswith substantially greater economic relevance than the typical average based on normalizing all ofthe cutoffs to zero.Numerous other examples of RDD with multiple cutoffs and treatments exist in different fieldsof economics. For instance, Egger and Koethenbuerger (2010) study the effect of the size of citygovernment councils on municipal expenditures, where council size is determined by populationcutoffs. De La Mata (2012) estimates the effects of Medicaid benefits on health care utilization,where Medicaid eligibility is triggered by income cutoffs that vary across states. Agarwal et al.(2017) and De Giorgi et al. (2017) look at multiple cutoffs on credit scores, used by banks to makecredit decisions. Education economics also provides a variety of applications. Angrist and Lavy(1999) and Hoxby (2000) use class size rules to estimate the impact of class size on student achieve-ment. Hoxby (2000) utilizes variation in cutoff values from specific school district class size rules.Several researchers exploit different school starting dates to estimate the impact of educational at-tainment on various outcomes, for example, Dobkin and Ferreira (2010), and McCrary and Royer(2011). Duflo et al. (2011) analyze school cohorts that are split into low and high-achieving classesbased on test scores, where each school has its own cutoff score. Garibaldi et al. (2012) look atdifferent income cutoffs that determine tuition subsidies to study the impact of tuition paymenton the probability of late graduation from university. In short, despite many applications withvariation in cutoffs and treatment doses, a lack of theory on how to combine observations from allcutoffs impedes our ability to estimate economically-relevant average effects.Whether local effects can be combined into an average effect depends on how comparable the re-searcher believes these effects are. The comparability of local treatment effects essentially dependson the heterogeneity of treatment doses and on the heterogeneity of the treatment effect function β ( c, d, d ′ ). This paper considers two types of assumptions regarding these two aspects of hetero-geneity. The first heterogeneity assumption says that treatment doses are credibly quantifiable by3ome variable d . For example, PU find behavioral evidence that average student performance ateach school is a good summary measure for school quality. Another example is the case of a singletreatment being triggered by varying cutoffs, as when each state has its own income threshold forMedicaid coverage. The second heterogeneity assumption specifies a parametric functional form for β ( c, d, d ′ ) guided by economic theory or a priori knowledge of the researcher. For example, in a classsize application like Hoxby’s (2000), a functional form based on Lazear’s (2001) model of achieve-ment can be derived as a function of class size. Another example is given by Bajari et al. (2017)who present a principal-agent model to study how insurers reimburse hospitals. The marginalreimbursement rate is discontinuous on health expenditures.This paper proposes a consistent and asymptotically normal estimator for the ATE of a counter-factual distribution of treatment assignments specified by the researcher. A counterfactual policyscenario specifies the distribution of ( c, d, d ′ ), and the ATE is the integral of β ( c, d, d ′ ) weighted bysuch a distribution. The ability to predict effects of counterfactual policies depends crucially onassuming that the distribution of potential outcomes Y i ( d ) does not depend on the initial scheduleof cutoff-dose values. This policy invariance assumption, along with the first heterogeneity assump-tion, allows the researcher to choose counterfactual distributions with support more general thanthe discrete set of cutoff-dose values observed in the data.The estimator proposed in this paper approximates the ATE integral by averaging estimatesof β ( c, d, d ′ ) at existing cutoffs using a proper weighting scheme. Under the first heterogeneityassumption with β ( c, d, d ′ ) non-parametric, the proposed ATE estimator is shown to be consistentand asymptotically normal. This result is novel, because estimation of the non-parametric function β ( c, d, d ′ ) is only possible at deterministic points of the domain, and that creates an additionalsource of bias. Asymptotic normality requires both the number of observations and cutoffs togrow to infinity, and I provide sufficient conditions on their rate of growth. I demonstrate thatthe minimax rate of ATE estimation in this setting is root- n , and that the proposed estimatorattains the minimax optimal rate for a specific choice of tuning parameters. This extends theprevious literature on minimax optimality of non-parametric estimation of regression functions ata boundary point to estimation of averages of these regression functions.Many applications of RDD with multiple cutoffs are, in fact, fuzzy rather than sharp. In thehigh school assignment example, a student may choose to attend a high school other than the schoolshe is originally eligible to attend. Multiple treatments result in multiple compliance behaviors,and one-cutoff identification results do not apply. Building on classic definitions of compliancebehaviors (Imbens and Rubin (1997)), I define compliance groups in terms of changes in treatmenteligibility and receipt. “Ever-compliers” are those whose treatment received changes if and onlyif it changes to the treatment dose for which they become eligible for. I assume that individualsnever change into a treatment dose different from the dose of eligibility, a “no-defiance” condition.In the high school example, if the test score of a student currently in school B increases so as togrant her access to school A, no-defiance implies she either chooses to attend school A or stay atschool B, and that she is not triggered to attend some other school C.4his paper shows that even local identification in fuzzy RDD with finite multiple treatments isimpossible unless the class of treatment effect functions of ever-compliers is restricted to a finite-dimensional class. Important empirical analyses of fuzzy RDD with multiple treatments includethose of Angrist and Lavy (1999), Chen and Van der Klaauw (2008), and Hoekstra (2009); nev-ertheless, this is the first paper to define compliance and study causal identification in a generalframework for multi-cutoff fuzzy RDD. This framework lays out conditions for the interpretationof two-stage least squares (2SLS) estimates in applications of multi-cutoff fuzzy RDD, a commonpractice in applied work. The second heterogeneity assumption states that the treatment effectfunction is of a parametric class. This assumption allows for consistent and asymptotically normalestimation of ATEs on ever-compliers. It also results in efficiency gains, because observations areoptimally combined across cutoffs to minimize the mean squared error (MSE) of the ATE estimator.The rapid growth in the number of applications of RDD in economics in the late 1990s wasaccompanied by substantial theoretical contributions for inference in the one-cutoff case. Iden-tification and estimation in the sharp and fuzzy cases were formalized by Hahn et al. (2001).Fan and Gijbels (1996) and Porter (2003) demonstrated low-order bias and rate optimality of thelocal polynomial estimator. Recent theoretical contributions have addressed the optimal bandwidthchoice (Imbens and Kalyanaraman (2012)), alternative asymptotic approximations with better fi-nite sample properties (Calonico et al. (2014)), quantile treatment effects (Frandsen et al. (2012)),kink treatment effects (Dong (2018b)), and the difficulty of uniform inference (Bertanha and Moreira(2019)).The contribution of this paper is more closely related to the study of treatment effect extrapola-tion of Angrist (2004), Bertanha and Imbens (2019), Dong and Lewbel (2015), Angrist and Rokkanen(2015), and Rokkanen (2015). These last two authors use observations on additional covariates.They restrict the relationship between the heterogeneity of treatment effects after conditioning onthese covariates to obtain identification away from the cutoff. This paper differs from these othercontributions, because the variation of multiple cutoffs and doses identify ATEs over distributionsof individuals both between and at cutoffs, without additional covariates.The remainder of this paper is organized as follows. Section 2 presents the notation and lays outbasic assumptions. Section 3 describes the ATE estimator for the sharp case and proves asymptoticnormality. It is divided into two sub-sections. Section 3.1 treats ATEs of discrete counterfactualdistributions, which is a straightforward generalization of one-cutoff RDD. Section 3.2 is novel; itstudies ATEs of continuous counterfactual distributions under the first heterogeneity assumption.Section 4 analyzes the fuzzy case. Appendix A contains all proofs. Supplemental Appendix Bcollects auxiliary results to the proofs in Appendix A. This section sets up the framework for RDD with multiple cutoffs. There are P sub-populations Appendix B is available online at ∼ mbertanh .
5f individuals indexed by p = 1 , . . . , P . An example of a sub-population may be a town-year in thehigh school application, or a state in the Medicaid example. Each individual i in sub-population p is fully characterized by a vector of random variables ( X i,p , U i,p ) drawn iid across i from eachsub-population. The forcing variable X i,p is a scalar score that governs eligibility for treatment, andit lives in a compact interval X = [ X , X ]; U i,p is a vector of unobserved heterogeneity. Individual( i, p ) receives a treatment dose D i,p from a set of possible treatments D . The outcome variable Y i,p is determined by a function Y of the individual characteristics and treatment, Y i,p = Y ( X i,p , D i,p , U i,p ) . (1)I start with the simpler sharp RDD setting and defer the fuzzy RDD case to Section 4. Inthe sharp case, the treatment received by the individual is a deterministic function of the forcingvariable. For an individual with forcing variable X i,p close to a cutoff c , the treatment dose is d if X i,p < c , or d ′ if X i,p ≥ c . Hahn et al. (2001) demonstrate that continuity of the conditional meanof outcomes is sufficient to identify average causal effects for individuals local to the cutoff c . Lemma 1.
Assume that E [ Y ( X i,p , d, U i,p ) | X i,p = x ] is a continuous function of x for the treatmentdoses d and d ′ in the neighborhood of the cutoff c . Then, the average causal effect for individualswith X i,p = c is identified: E (cid:2) Y ( X i,p , d ′ , U i,p ) − Y ( X i,p , d, U i,p ) (cid:12)(cid:12) X i,p = c (cid:3) = lim e ↓ n E [ Y i,p | X i,p = c + e ] − E [ Y i,p | X i,p = c − e ] o . (2)Lemma 1 generalizes to the case of multiple cutoffs and treatments under the assumption ofcontinuity of E [ Y ( X i,p , d, U i,p ) | X i,p = x ] as a function of x for every d ∈ D . Many cutoffs arisebecause data sets may have many sub-populations with few cutoffs (e.g. Medicaid benefit withone cutoff per state, many states); or few sub-populations with many cutoffs (e.g. Romanian highschools with one town and many schools). The ability to exploit variation in cutoff-dose valuesrelies on the following pooling assumption. Assumption 1 (Pooling) . For any { d, d ′ } ⊂ D , the conditional expectation E (cid:2) Y ( X i,p , d ′ , U i,p ) − Y ( X i,p , d, U i,p ) | X i,p = x (cid:3) as a function of x does not depend on p . Assumption 1 does not restrict average outcomes to be the same across different sub-populations.It is less restrictive than common specifications for pooling data in applied work, for example, time-trends and sub-population fixed effects. The pooling assumption says that individuals with the sameforcing variable that undergo the same change in treatment have the same average response acrossdifferent sub-populations. The rest of the paper builds on Assumption 1, and it becomes irrelevant6o distinguish sub-populations. Thus, I drop the subscript p and focus on the case of one populationwith multiple cutoffs.The cutoffs are ordered such that c < c < . . . < c K . Sharp RDD means that an individualwith forcing variable X i is deterministically assigned to a treatment dose D i = D ( X i ) according tothe following rule: D ( x ) = d if c ≤ x < c d if c ≤ x < c ... d K if c K ≤ x ≤ c K +1 (3)where c = X , and c K +1 = X . Each cutoff is characterized by three variables: the scalar threshold c j ; the treatment dose d j − the individual receives if c j − ≤ X i < c j ; and the treatment dose d j the individual receives if c j ≤ X i < c j +1 . Let c j = ( c j , d j − , d j ). The schedule of cutoffs andtreatment doses is given by the non-random set C K = { c j } Kj =1 . The richness of set C K increases asthe researcher collects more data. The data generating process is summarized as follows. Values for the forcing variable X i andheterogeneity U i are drawn iid i = 1 , ..., n from a joint distribution. Given D ( x ), these n individualsare assigned to different treatment doses D i = D ( X i ). The observed outcome is determined by Y i = Y ( X i , D i , U i ). The econometrician observes the schedule of cutoffs and treatment doses D ( x ) and( Y i , X i , D i ) for i = 1 , ..., n . Following Rubin’s model of potential outcomes, let Y i ( d ) = Y ( X i , d, U i ),and assume continuity of E [ Y i ( d ) | X i = x ] for every d ∈ D . A simple extension of Lemma 1 identifiesaverage effects at every cutoff c ∈ C K , β ( c ) = E [ Y i ( d ′ ) − Y i ( d ) | X i = c ]= lim e ↓ n E [ Y i | X i = c + e ] − E [ Y i | X i = c − e ] o . (4)Data with multiple cutoff-dose values allow the researcher to learn the causal effect of a varietyof dose changes applied to individuals at various levels of the forcing variables. This fact opensthe possibility of using observed data to estimate the effect of new policy changes. The individualresponse function Y may well depend on the initial assignment of treatments D i , and it couldpotentially change under counterfactual policies. Unless such dependence is restricted, it becomesimpossible to use existing data to infer the effect of new policies. The remainder of this paper relieson the following policy-invariance assumption. Assumption 2 (Policy Invariance) . Regardless of the distribution of ( X i , D i , U i ) in a counterfactualpolicy, individual outcomes are always generated by a fixed response function Y , that is, Y i = Y ( X i , D i , U i ) . The validity of the RDD depends crucially on exogeneity of cutoffs and no manipulation of the forcing variable X by individuals. See McCrary (2008) for a test of forcing variable manipulation. Bajari et al. (2017) present amodified RDD estimator that is consistent under forcing variable manipulation in a class of structural models. i is assigned to a change in treatment dose from D ∗ i to D ∗∗ i , where the distribution of( D ∗ i , D ∗∗ i ) is independent of U i after conditioning on X i . Under Assumption 2, the average causaleffect of such an experiment is µ = E [ Y ( X i , D ∗∗ i , U i ) − Y ( X i , D ∗ i , U i ) ]= E [ E ( Y ( X i , D ∗∗ i , U i ) − Y ( X i , D ∗ i , U i ) | D ∗∗ i , D ∗ i , X i ) ]= E [ E ( Y ( X i , D ∗∗ i , U i ) − Y ( X i , D ∗ i , U i ) | X i ) ]= E [ β ( X i , D ∗ i , D ∗∗ i ) ] (5)where the last equality uses the definition of β in Equation 4. The average effect µ equals an averageof the β function over the counterfactual distribution of ( X i , D ∗ i , D ∗∗ i ). The inference methods ofthis paper first identify β from RDD with many cutoffs, then identify the average of β under acounterfactual distribution pre-specified by the researcher. In a similar setting, Cattaneo et al.(2016) study identification under conditions equivalent to Assumptions 1 and 2 (respectively, theirAssumptions 5a and 5b).The definition of µ captures both the direct effect of changing D , and the composition effect ofa change in the distribution of D conditional on X . To investigate the direct effects of D , Rothe(2012) proposes methods for inference on partial policy effects that preserve the distribution ofranks of ( D, X ) unchanged, thus controlling for composition effects. Although not the focus of thispaper, Rothe’s methods may be combined with the RDD identification strategy to study partialpolicy effects.
This section investigates estimation and inference of averages of the non-parametric function β under sharp RDD with many cutoffs. First, I treat the case of qualitative treatment doses. This isa straightforward extension of single-cutoff RDDs which identify ATEs of discrete counterfactualdistributions, with support contained in C K . Second, I treat the case of quantitative treatmentdoses, that is, the first heterogeneity assumption. Substantial variation in cutoff-dose values allowsfor novel methods that estimate ATEs with support more general than C K . Consider applications of RDD where the treatment dose variable has a qualitative nature, andis not credibly summarized by a real-valued metric. For example, Hastings et al. (2013) study theassignment of students into different degree programs in universities in Chile. There are multiplecutoffs on a test score, but different cutoffs switch students to completely different programs, e.g.8hysics, engineering, economics, etc. This limits the ability to combine local effects across cutoffs,which restricts ATEs to counterfactual distributions with discrete support contained in C K . In thissection, it is not possible to identify effects of policies that places weight on cutoff-dose combinations( c, d, d ′ ) that are not in C K .The focus is on discrete counterfactual distributions with probability mass function ω d ( c ) where ω dj = ω d ( c j ) for every j . For example, in the high school assignment application, a new policy mayreallocate students with test scores marginally across the existing cutoffs. The weight ω dj representsthe probability mass of students with test score equal to c j that undergo a change in school qualityfrom d j − to d j in the reallocation policy.The parameter of interest is the average effect on these students, which is a weighted averageof local effects at the existing cutoffs: µ d = K X j =1 ω dj β ( c j ) . Identification follows from Equation 4. Estimation is conducted in two steps. The first stepuses local polynomial regressions (LPR) near each cutoff c j to non-parametrically estimate B j = lim e ↓ { E [ Y i | X i = c j + e ] − E [ Y i | X i = c j − e ] } . (6)The researcher chooses a bandwidth parameter h j > k ( . ), and the order of the polynomial regression ρ ∈ Z + . A polynomial in X is fitted on each sideof the cutoff, and the estimator ˆ B j is the difference between the intercepts of these two polynomialregressions: ˆ B j = ˆ a + j − ˆ a − j (7)(ˆ a + j , ˆ b + j ) = argmin ( a, b ) n X i =1 (cid:26) k (cid:18) X i − c j h j (cid:19) v j + i (cid:2) Y i − a − b ( X i − c j ) − . . . − b ρ ( X i − c j ) ρ (cid:3) (cid:27) (8)(ˆ a − j , ˆ b − j ) = argmin ( a, b ) n X i =1 (cid:26) k (cid:18) X i − c j h j (cid:19) v j − i (cid:2) Y i − a − b ( X i − c j ) − . . . − b ρ ( X i − c j ) ρ (cid:3) (cid:27) (9)where v j + i = I { c j ≤ X i < c j + h j } , v j − i = I { c j − h j < X i < c j } , (10) The common practice of normalizing all cutoffs to zero and estimating only one effect produces an estimatorconsistent for µ d with weights ω dj = f ( c j ) / P l f ( c l ) where f is the probability density function of X . b = ( b , . . . , b ρ ). The estimator b B j uses observations with X i in the estimation window[ c j − h j , c j + h j ]. The choice of bandwidths may allow the windows to overlap at consecutivecutoffs. However, it must be the case that c j + h j < c j +1 and c j ≤ c j +1 − h j +1 for j = 1 , . . . , K − Y i = Y i ( d j ) for X i ∈ [ c j , c j + h j ], and Y i = Y i ( d j − ) for X i ∈ [ c j − h j , c j ). In the second step, the researcher averages out ˆ B j to obtain the estimator b µ d : b µ d = K X j =1 ω dj ˆ B j . (11)For the case of one cutoff, Hahn et al. (2001) and Porter (2003) derive the asymptotic nor-mal distribution of the LPR estimator ˆ B j . I build on their arguments to derive the asymptoticdistribution of b µ d under the assumptions listed below. Assumption 3.
The kernel density function k : R → R is symmetric around zero, has compactsupport [ − M, M ] for some M ∈ (0 , ∞ ) , and is Lipschitz continuous. Assumption 4. (a)
The distribution of X i has probability density function f ( x ) that is continuousand has bounded support X = [ X , X ] ; (b) f ( x ) is differentiable with bounded derivative ∇ x f ( x ) . Assumption 5.
Let ρ ∈ Z + be the order of the first-step LPR. For arbitrary d ∈ D , (a) R ( x, d ) = E [ Y i ( d ) | X i = x ] is ρ + 1 times continuously differentiable wrt x ; its ( ρ + 1) -th partial derivativewrt x is denoted as ∇ ρ +1 x R ( x, d ) ; (b) σ ( x, d ) = V [ Y i ( d ) | X i = x ] where V is the variance operator; σ ( x, d ) is continuously differentiable wrt x ; its partial derivative wrt x is denoted as ∇ x σ ( x, d ) ; σ ( x, d ) is bounded away from zero, and E [ | Y i ( d ) − R ( X i , d ) | | X i ] is bounded. Theorem 1.
Suppose Assumptions 3-5 hold. Let h = min j h j and h = max j h j . As n → ∞ ,assume that h → , h /h = O (1) , nh → ∞ , and ( nh ) / h ρ +11 = O (1) . Then, b µ d − B dn − µ d ( V dn ) / d → N (0 , where the bias B dn and variance V dn terms are characterized as follows: B dn = 1( ρ + 1)! K X j =1 h ρ +11 j f ( c j ) (cid:2) ∇ ρ +1 R ( c j , d j ) e ′ G j + n − ∇ ρ +1 R ( c j , d j − ) e ′ G j − n (cid:3) γ ∗ (12) V dn = n E ε i K X j =1 ω dj nh j k (cid:18) X i − c j h j (cid:19) e ′ (cid:16) v j + i E [ G j + n ] − v j − i E [ G j − n ] (cid:17) e H ji , (13) with ε i = Y i − E [ Y i | X i ] ; H ( u ) = [ u , u , . . . , u ρ ] ′ is a ( ρ + 1) × vector-valued function, e H ji = H ( h − j ( X i − c j )) , and G j ± n = ( nh j ) − P ni =1 v j ± i k ( h − j ( X i − c j )) e H ji e H j ′ i is a ( ρ + 1) × ( ρ + 1) This is the first-step estimation procedure for one sub-population with K cutoffs. In many settings, the datahave many sub-populations p = 1 , . . . , P with one or more cutoffs j = 1 , . . . , K ( p ) in each sub-population. In thatcase, the researcher first estimates b B j,p for every j in each sub-population p . Then, Assumption 1 allows for poolingof b B j,p across p in the second step. atrix; v j ± i are defined in Equation 10; γ ∗ = [ γ ρ +1 . . . γ ρ +1 ] ′ , for γ d = R k ( u ) u d du ; and e is the ( ρ + 1 × vector with one in its first coordinate and zero otherwise. Furthermore, (cid:0) V dn (cid:1) − / = O (cid:16)(cid:0) nh (cid:1) / (cid:17) , and (cid:0) V dn (cid:1) − / B dn = O P (cid:16)(cid:0) nh (cid:1) / h ρ +11 (cid:17) . The variance of b µ d is consistently estimated by b V dn = n X i =1 b ε i K X j =1 ω dj nh j k (cid:18) X i − c j h j (cid:19) e ′ (cid:16) v j + i G j + n − v j − i G j − n (cid:17) e H ji . (14)The squared residuals b ε i are computed by a nearest-neighbor matching estimator, as suggested byCalonico et al. (2014) (CCT from now on): b ε i = 34 Y i − X l =1 Y ℓ ( i,l ) ! , (15)and ℓ ( i, l ) is the index of the l -th closest X to X i that lies within the same cutoffs c j and c j +1 that X i does. CCT’s Theorem A3 demonstrates that b V dn / V dn p → (cid:0) V dn (cid:1) − / B dn differs from zero asymp-totically, then inference must be done using a bias-corrected estimator. A practical way of doingbias correction is to increase the order of the polynomial from ρ to ρ + 1 and compute b µ d ′ and b V d ′ n using the same bandwidth choices as b µ d and b V dn . It follows that ( b V d ′ n ) − / ( b µ d ′ − µ d ) d → N (0 , c + h > c − h , the estimator b B usessome of the same observations that the estimator b B does. In theory, a finite number of cutoffswith shrinking bandwidths leads to non-overlapping estimation windows in large samples. As aconsequence, the asymptotic variance of p nh ( b µ d − B dn − µ d ) may not approximate its finite-sample variance well in case of overlap. Instead, the variance term in (13) takes into accountoverlap because its formula is constructed based on the finite-sample variance.In practice, implementation of b µ d requires the researcher to choose bandwidths h j >
0, thepolynomial order ρ ∈ Z + , and a kernel density function k ( · ). In the one-cutoff case, commonchoices in applied work include the edge kernel k ( u ) = I {| u | ≤ } (1 − u ), local linear regression ρ = 1, and a bandwidth choice that minimizes the mean squared error (MSE) of estimation. Recentwork by Imbens and Kalyanaraman (2012) (IK from now on) provides a practical data-driven rulefor choosing the bandwidth in the case of one cutoff. With multiple cutoffs, an interesting aspectof the optimal bandwidth problem is the variance reduction from overlapping estimation windows. A formal investigation on optimal bandwidths in the multi-cutoff case is deferred to future work.A simple recommendation to implement Theorem 1 is to use the IK bandwidth based on local Following Equation (7),
COV ( b B j , b B j +1 ) = COV ( b a + j − b a − j , b a + j +1 − b a − j +1 ) = COV ( b a + j , − b a − j +1 ) < b a + j and b a − j +1 use some of the same observations in the case of overlap. ρ = 2) with the edge kernel and the same bandwidths asbefore to compute the consistent bias-corrected estimator b µ d ′ and its variance b V d ′ n . Calonico et al.(2018) propose shrinking MSE-optimal bandwidths as a rule of thumb to improve finite samplecoverage of confidence intervals. As means of a robustness check, the researcher may shrink the IKbandwidths by multiplying them by n − / , and examine the resulting confidence intervals (Section4.1, Calonico et al. (2018)). The first heterogeneity assumption allows the researcher to identify counterfactual ATEs withsupport more general than C K . An empirical application satisfies the first heterogeneity assumptionif the treatment dose is credibly quantifiable in a real-valued variable d . For example, in the highschool assignment of PU, the treatment dose is a quality measure for each school. Possible measuresof school quality include the average test score of peers, the average number of teachers, or fundingper student. An infinite amount of data gives rise to a countably-infinite set of cutoff-dose values C ∞ . In terms of the high school assignment example, a large number of towns and years producesubstantial variation in cutoff-dose values. Define C to be the convex hull of C ∞ . If variation incutoff-dose values is sufficiently rich, then ATEs with counterfactual distributions supported in C are identified (Lemma 2).I focus on scalar treatment doses d and counterfactual distributions with continuous probabilitydensity function ω c ( c ). Minor changes to the setup can accommodate multivariate d and discreteor mixed counterfactual distributions. The ATE is defined as µ c = Z C ω c ( c ) β ( c ) d ( c ) . (16) Lemma 2.
Assume that an infinite amount of data has sufficient variation such that (i) C ∞ isdense in its convex hull C ; and that (ii) β ( c ) is a continuous function over C . Then, µ c is identified. The researcher may impose further heterogeneity restrictions to reduce the dimension of β ( c )and increase the set of possible counterfactual distributions. For instance, linear returns to schoolquality say that β ( c, d, d ′ ) depends on ( c, d ′ − d ) instead of ( c, d, d ′ ). This implies that β ( c ) = φ ( c )( d ′ − d ) for a smooth function φ ( c ), and changes the dimension of set C K . See Figure 2 inSection 6 for an empirical illustration. Medicaid coverage is an example of binary treatment thatis triggered by various income cutoffs across states. In the case of binary treatment, the treatmenteffect function depends only on the cutoff value, that is, β ( c, d, d ′ ) = φ ( c ). Identification of averagesof β ( c ) requires identification of averages of φ ( c ), which relies on infinitely many cutoff values thatcover a compact interval on the real line. For example, such variation identifies the average effectof giving Medicaid benefits to an entire neighborhood of individuals within the range of income12utoffs seen in the data. The parameter µ c is estimated in two steps. The first step is identical to the procedure describedin Equations 7-9. That is, LPRs produce estimates b B j , j = 1 , . . . , K . The second step computesa weighted average of the first-step estimates, using specially designed weights { ∆ j } Kj =1 that I call“correction weights,” b µ c = K X j =1 ∆ j ˆ B j . (17)Unlike the intuition of the discrete case, the correction weight ∆ j is not necessarily equal orproportional to ω cj = ω c ( c j ). An analytical expression for ∆ j is given below in Equation 22, andconstructed as follows. The correction weight ∆ j is the contribution of estimate b B j to the integral R C ω c ( c ) b β ( c ) d ( c ), where b β ( c ) is a non-parametric estimate of β ( c ). A weighted regression of b B j onpolynomial functions of c j centered at c produces the estimate b β ( c ). The researcher specifies theorder of the polynomials ρ ∈ Z + , and a bandwidth h > c ∈ C . The estimate b β ( c ) is the intercept of the following weighted least squares regression: b η = argmin η (cid:16) b B − E ( c ) η (cid:17) ′ Ω ( c ; h ) (cid:16) b B − E ( c ) η (cid:17) (18)where b B = h b B , . . . , b B K i ′ is a K × Ω ( c ; h ) = diag { Ω j ( c ; h ) } Kj =1 is a K × K matrix, with (20)Ω j ( c ; h ) = k (cid:18) c j − ch (cid:19) k (cid:18) d j − − dh (cid:19) k (cid:18) d j − − d ′ h (cid:19) ; E ( c ) = [ E ( c ) , . . . , E K ( c )] ′ is a K × J matrix, where (21) E j ( c ) is a J × vector with all polynomials of the form p γ ( c j − c ) = ( c j − c ) γ ( d j − − d ) γ ( d j − d ′ ) γ for γ = ( γ , γ , γ ) ∈ Z , γ + γ + γ ≤ ρ , min { γ , γ } = 0 ,J = 2 ( ρ + 2)!2! ρ ! − ( ρ + 1) , where ! denotes factorial,and the first element of E j ( c ) is . For the Medicaid example, De La Mata (2012) has many income cutoffs that differ by state, age, and year. DeLa Mata’s Table I suggests variation between US $ $ β is possible for a range of dose changes at a few cutoff values. j comes from integrating ω c ( c ) b β ( c ): Z C ω c ( c ) b β ( c ) d c = Z C ω c ( c ) e ′ (cid:0) E ( c ) ′ Ω ( c ; h ) E ( c ) (cid:1) − X j Ω j ( c ; h ) E j ( c ) b B j d ( c )= X j Z C ω c ( c ) e ′ (cid:0) E ( c ) ′ Ω ( c ; h ) E ( c ) (cid:1) − Ω j ( c ; h ) E j ( c ) d ( c ) b B j = X j Z C ω c ( c ) det (cid:0) E ( c ) ′ Ω ( c ; h ) E ← e j ( c ) (cid:1) det ( E ( c ) ′ Ω ( c ; h ) E ( c )) d ( c ) | {z } ≡ ∆ j b B j (22)= X j ∆ j b B j , (23)where the third equality uses the Cramer rule, and E ← e j ( c ) is a K × J matrix equal to E ( c ) exceptfor the first column, which is replaced by the K × e j . The vector e j has one in its j -thentry and zero otherwise.The main contribution of this paper concerns inference on µ c where β ( c ) is estimated non-parametrically and then averaged across cutoffs. This is not the first paper to study estimation ofaverages of non-parametric functions; for example, see Newey (1994). The novelty here is that thenon-parametric estimation step only occurs at K fixed boundary points c j . A necessary conditionfor consistency of ˆ µ c is an “infill type of asymptotics,” that is, K grows large with the sample size n , and C K becomes dense in its convex hull C . Assumption 6 makes the dependence of K , h j , h ,and c j on n explicit with a subscript. The main text omits the subscript n whenever possible tosimplify notation. Assumption 6. (a)
The schedule of cutoffs and doses comes from a triangular array of fixedconstants C K n = { c j,n } K n j =1 that depends on the sample size n ; C K n converges to a countably infiniteset C ∞ as n → ∞ ; C ∞ is dense in its convex hull C ; (b) given the first-step bandwidth sequences h j,n , assume that c j,n + h j,n < c j +1 ,n and c j,n ≤ c j +1 ,n − h j +1 ,n for all j = 1 , . . . , K n − ;and (c) given the second-step bandwidth sequence h ,n and polynomial order ρ , define E n ( c ) and Ω n ( c ; h ,n ) as in Equations (20) - (21) for each n . Assume there exists a positive definite J × J matrix Q such that sup c ∈C (cid:13)(cid:13)(cid:13) K n h ,n (cid:2) E n ( c /h ,n ) ′ Ω n ( c ; h ,n ) E n ( c /h ,n ) (cid:3) − − Q (cid:13)(cid:13)(cid:13) = o (1) . For large K , cutoff-dose values must be uniformly distributed on the domain C such that E ( c /h ) ′ Ω ( x ; h ) E ( c /h ) is invertible and of magnitude Kh , that is, K times the volume of every h -neighborhood of c , for every c in C . These conditions are satisfied in a variety of examplesof triangular arrays of points. In Section B.3 of the supplemental appendix, these conditions areverified for one example of a triangular array. Asymptotic normality also relies on additionalsmoothness conditions on the moments of the data.14 ssumption 7. (a) R ( x, d ) = E [ Y i ( d ) | X i = x ] is a ¯ ρ times continuously differentiable functionwith ¯ ρ = max { ρ + 2 , ρ + 2 } , where ρ and ρ are polynomial degrees in the first and secondsteps; the ¯ ρ -th partial derivative of R ( x, d ) with respect to x is denoted ∇ ¯ ρx R ( x, d ) ; (b) σ ( x, d ) = V [ Y i ( d ) | X i = x ] is a continuous function bounded away from zero; and (c) ∃ M ∈ (0 , ∞ ) such that P [ | Y i ( d ) − R ( X i , d ) | < M ] = 1 for ∀ d ∈ D . Theorem 2 states the rate conditions under which the estimator b µ c has an asymptotic normaldistribution. Estimation of the ATE consists of approximating the integral of the treatment effectfunction by a weighted sum of the values of such function at a finite number of points in its domain.The approximation error converges to zero as the number of points grows large. Function evalua-tions B j are estimated by b B j . The correction weights guarantee that the integral approximationerror converges to zero faster than the estimation error. Theorem 2.
Suppose Assumptions 3-7 hold. As n → ∞ , assume that K → ∞ , h → , h /h = O (1) , and h → such that (i) (cid:0) Knh (cid:1) / h ρ +11 = O (1) ; (ii) K / log n/ (cid:0) nh (cid:1) / = o (1) , and Kh = O (1) ; and (iii) (cid:0) Knh (cid:1) / h ρ +12 = O (1) , and /Kh = O (1) . Then, b µ c − B c n − B c n − µ c ( V cn ) / d → N (0 , . (24) The first-step bias B c n and variance V cn terms are defined as in Equations 12-13 except that ∆ j replaces ω dj ; the second-step bias B c n is characterized as follows: B c n = Z C ω c ( c ) X ( γ ,γ ,γ ) K X j =1 ( ( c j − c ) γ ( d j − − d ) γ ( d j − d ′ ) γ γ ! γ ! γ ! ∇ γ c ∇ γ d ∇ γ d ′ β ( c, d, d ′ ) det (cid:0) E ( c ) ′ Ω ( c ; h ) E ← e j ( c ) (cid:1) det ( E ( c ) ′ Ω ( c ; h ) E ( c )) ) d c , (25) where the first sum runs over all triplets ( γ , γ , γ ) ∈ Z such that γ + γ + γ = ρ + 1 , and min { γ , γ } = 0 . Furthermore, ( V cn ) − / = O (cid:16)(cid:0) Knh (cid:1) / (cid:17) , ( V cn ) − / B c n = O P (cid:16)(cid:0) Knh (cid:1) / h ρ +11 (cid:17) ,and ( V cn ) − / B c n = O (cid:16)(cid:0) Knh (cid:1) / h ρ +12 (cid:17) . A consistent estimator for V cn is b V cn = n X i =1 b ε i K X j =1 ∆ j nh j k (cid:18) X i − c j h j (cid:19) e ′ (cid:16) v j + i G j + n − v j − i G j − n (cid:17) e H ji , (26)where b ε i is computed using Equation 15. Lemma B.10 in the supplemental appendix’s Section B.4demonstrates that b V cn / V cn p → Kh ) − = O (1). If the bandwidth choicesare such that the standardized bias term ( V cn ) − / ( B c n + B c n ) differs from zero asymptotically,then inference must be done using a bias-corrected estimator. A practical way of performing biascorrection is to increase the order of the polynomials from ( ρ , ρ ) to ( ρ +1 , ρ +1), and to compute15 µ c ′ and b V c ′ n using the same bandwidth choices as b µ c and b V cn . It follows that ( b V c ′ n ) − / ( b µ c ′ − µ c ) d → N (0 , C , along with the asymptotic behavior of the schedule of cutoff-doses (Assumption6), is crucial for the numerical integration error to vanish sufficiently quickly, as required by Theorem2. Continuity of ω c ( c ) implies that the boundary of C has zero probability under the counterfactualdistribution. Therefore, the convergence rate of b µ c is not affected by the value of ω c ( c ) over theboundary of C . In finite samples, local polynomial estimates of b β ( c ) may be noisy for values of c atthe boundary of the convex-hull of C K . Researchers should take that into account when specifyingthe support of the counterfactual distribution ω c ( c ).A simple example illustrates the three rate conditions of Theorem 2. Suppose h j = n − λ forall j , h = n − λ , and K = n θ . The first-step estimation uses local-linear regression ( ρ = 1),and the second step, local cubic regression ( ρ = 3). The first rate condition says the first-step bandwidths have to converge to zero fast enough to control the asymptotic bias. That is, (cid:0) Knh (cid:1) / h ρ +11 = O (1); in terms of the example, this condition becomes λ ≥ (1 + θ ) / (3 + 2 ρ ).The second rate condition restricts how fast the number of cutoffs grows with n . It cannot grow toofast to ensure having enough observations around the cutoffs for uniform consistency of first-stepestimates. The second condition has two parts: (a) K / log n/ (cid:0) nh (cid:1) / = o (1) ⇔ λ < − θ ; and(b) Kh = O (1) ⇔ λ ≥ θ . The third rate condition limits how slowly K grows, relative to thesample size, to ensure that the integral approximation error vanishes faster than the estimationvariance. Part (a) of the third condition says (cid:0) Knh (cid:1) / h ρ +12 = O (1) ⇔ λ ≥ θ − λ ( ρ + 1);part (b) is 1 /Kh = O (1) ⇔ λ ≤ θ/ Panel (a) shows the conditions in terms of ( λ , θ ) assuming λ = θ/ h = K − / , whichsatisfies part (b) of the third condition. Panel (b) depicts the same conditions in terms of ( λ , λ ),assuming θ = 0 .
4. The feasible set is well-defined as long as K grows no faster than √ n , that is, θ < .
5. In addition, ρ ≥ √ n , and it is reached along the dashed line 2(b). Section B.3 in the supplemental appendix gives an example of a schedule of cutoff-dose values that satisfiesAssumption 6 for feasible choices of ( h , h ) in this example. Notes: The diagram shows the rate conditions of Theorem 2 applied to the case where h j = n − λ for all j , h = n − λ , and K = n θ . Condition 1, that is (cid:0) Knh (cid:1) / h ρ +11 = O (1), is equivalent to λ ≥ (1 + θ ) / (3 + 2 ρ );condition 2(a): K / log n/ (cid:0) nh (cid:1) / = o (1) ⇔ λ < − θ ; condition 2(b): Kh = O (1) ⇔ λ ≥ θ ; condition 3(a): (cid:0) Knh (cid:1) / h ρ +12 = O (1) ⇔ λ ≥ θ − λ ( ρ + 1); and condition 3(b): 1 /Kh = O (1) ⇔ λ ≤ θ/
3. Panel(a) illustrates the rate conditions on the first-step bandwidth and number of cutoffs ( λ , θ ) for ρ = 1, ρ = 3, and λ = θ/
3, so that h = K − / and condition 3(b) is satisfied. Panel (b) displays the rate conditions on the bandwidths( λ , λ ) given θ = 0 . ρ = 1, and ρ = 3. Implementation of Theorem 2 requires the researcher to choose ρ ∈ Z + , h j > ∀ j , ρ ∈ Z + , h >
0, and k ( · ). A theory of optimal choice of these tuning parameters is beyond the goalsof this paper. Optimal choice of bandwidths is an interesting topic for future research, becauseoptimality in the multi-cutoff case would account for: (i) the interaction between first and second-stage bandwidths; (ii) the variance reduction from overlapping estimation windows at consecutivecutoffs; and (iii) the recent advances of robust bias-corrected inference and coverage-error optimalbandwidths by Calonico et al. (2018).The IK bandwidth formula may produce first-step bandwidths with an incorrect rate of conver-gence. For example, if ρ = 1, these bandwidths converge to zero at n − . , which is not fast enoughif ρ = 3 (Figure 1). A simple way to correct for this is to adjust the bandwidths by multiplyingthem by n . − λ for λ ≥ θ , so that their rate becomes n − λ . Conditions 2(a) and (b) imply that θ is never bigger than 0 . ρ , ρ , and λ . Thus, the smallest value for λ consistentwith these restrictions is 0 .
5. The same idea applies to the coverage-error optimal bandwidths byCalonico et al. (2018), which converge to zero at rate n − . , and need to be adjusted.In certain cases, the β function may depend on less than the three arguments ( c, d, d ′ ). Forexample, in the Medicaid application, the treatment is binary and β is only a function of c . Thisis a particular case of the theory in this section. The only rate condition that changes is condition3(b). It becomes 1 /Kh = O (1), or λ ≤ θ in terms of Figure 1. A non-empty feasible set ofbandwidth choices requires ρ ≥
1, as opposed to ρ ≥ h that minimizes the MSE17f estimation. First, use observations pertaining to each cutoff j , compute the IK bandwidth h ik j forsharp RD and local-linear regression; adjust the rate of the bandwidths so that h j = h ik j × n − . .Second, create a grid of possible values for h . For each value on the grid, compute b µ c ( h ) using theedge kernel, the choices of h j given above, ρ = 1, and ρ = 3 (or ρ = 1 in the binary treatmentcase). Similarly, compute b µ c ′ ( h ) using the edge kernel, the choices of h j given above, ρ = 2,and ρ = 4 (or ρ = 2 in the binary treatment case). Use Equation 26 to estimate the variance of b µ c ( h ) and call it b V cn ( h ). Evaluate the approximated MSE of b µ c ( h ) by ( b µ c ( h ) − b µ c ′ ( h )) + b V cn ( h ).Choose the bandwidth value on the grid that minimizes the MSE and call it h ∗ . The bias-correctedestimate is b µ c ′ ( h ∗ ), and its variance estimate is b V c ′ n ( h ∗ ).It may not be immediately clear that root- n is the fastest estimation rate achievable in a settingwhere both K and n grow large. The double asymptotic setting is conceptually different from theusual asymptotic setting where only n → ∞ and non-parametric averages are estimable at root- n .Estimation rates depend not only on bandwidth choices, but also on how fast K grows, relative to n . Similar examples in econometrics include panels with a large number of observations and timeperiods, and asymptotics with many instruments. The following theorem demonstrates that theminimax optimal rate of estimation of µ c is indeed root- n , as long as first-step bandwidths convergeto zero at 1 /K rate. Theorem 3.
Let P be the class of models generating potential outcomes { Y i ( d ) } d ∈D and forcingvariables X i . For a schedule of cutoffs and doses, { c j } Kj =1 , observed data ( Y i , X i , D i ) are generatediid from P ∈ P as described in Section 3.1. Assume that (i) each model P ∈ P satisfies Assump-tions 4-7; (ii) f ( x ) and σ ( x, d ) are bounded away from zero uniformly in P ; (iii) the followingfunctions are bounded uniformly in P : ∇ ρx σ ( x, d ) ∀ ρ ≤ , ∇ ρx f ( x ) ∀ ρ ≤ , ∇ ρx R ( x, d ) ∀ ρ ≤ ¯ ρ , ∇ ρd R ( x, d ) ∀ ρ ≤ ¯ ρ , where ¯ ρ = max { ρ + 2 , ρ + 2 } ; and (iv) there exists M ∈ (0 , ∞ ) such that P [ | Y i ( d ) − R ( X i , d ) | < M ] = 1 ∀ d ∈ D uniformly in P . Then, for any ǫ > , there exists η > such that inf e µ sup P ∈P P P (cid:2) √ n | e µ − µ c ( P ) | > ǫ/ (cid:3) ≥ η for large n. (27) The inf is taken over all estimators e µ built using the observed data ( Y i , X i , D i ) , i = 1 , . . . , n ; µ c ( P ) = R ω c ( c ) β ( c ; P ) d c with β ( c ; P ) = E P [ Y i ( d ′ ) − Y i ( d ) | X i = c ] ; and P P and E P denote theprobability and expectation under model P ∈ P .Assume the conditions of Theorem 2, and that first-step bandwidths satisfy h = O ( K − ) .Consider the estimator b µ c defined in Equation 17. For any small δ > , there exists large ǫ ∈ (0 , ∞ ) such that sup P ∈P P P (cid:2) √ n | b µ c − µ c ( P ) | > ǫ (cid:3) < δ for large n. (28)Equation 27 shows that no estimator converges faster than √ n uniformly over P . Equation28 says the estimator proposed in Theorem 2 converges at root- n uniformly over P as long as18rst-step bandwidths converge to zero at 1 /K rate. Therefore, root- n is the minimax optimal rateof convergence in the non-parametric estimation of ATE in RDD with many thresholds. Authorshave previously analyzed minimax optimality of non-parametric estimators of a regression functionat a boundary point, for example, Cheng et al. (1997) and Sun (2005). Theorem 3 is novel becauseit combines boundary points to estimate averages of non-parametric regression functions. This section relaxes the sharp assignment mechanism of previous sections and studies the fuzzyRDD case. The analysis focuses on multiple cutoffs, but K is finite as opposed to approachinginfinity as in Section 3.2. This makes the exercise more tractable, because the number of compliancecases grows super-exponentially with the number of cutoffs. In contrast to the sharp case, non-parametric identification of local effects in the fuzzy case is impossible. As a result, inferencemethods in this section rely on a second heterogeneity assumption, namely, the treatment effectfunction is assumed parametric. Section B.5.2, in the supplemental appendix, provides practicalguidelines to compute an MSE-optimal ATE estimator, and demonstrates asymptotic normality.In the sharp RDD case, all individuals with forcing variable equal to x receive the same treatment D ( x ) (Equation 3). In the fuzzy RDD case, many of these individuals may receive treatmentsdifferent from D ( x ). In the high school assignment example, students may choose to go to a schoolthat is not the best school for which they are eligible. For instance, a student may want to attendthe same high school as a certain friend or sibling. Another example is given by Garibaldi et al.(2012). In their study, a schedule of tuition subsidies applies to most students at Bocconi University,but the university reserves the right to grant certain students different subsidies after reassessingtheir ability to pay. The fuzzy RDD case is modeled in terms of a potential treatment assignment framework. Apotential treatment assignment function U : X → D describes the treatment received for everyvalue of the forcing variable x ∈ X . For simplicity, these functions are assumed to belong to thefollowing class: U ∗ = ( U : X → D : U ( x ) = K X j =0 u j I { c j ≤ x < c j +1 } for some u j ∈ { d , . . . , d K } , j = 0 , . . . , K ) . (29)Sharp RDD is the particular case where the individual potential treatment assignment function U i is the same for every individual i , that is, U i ( x ) = D ( x ) ∀ i with D ( x ) defined in Equation 3. The source of fuzziness varies across applications. One example is the case where the assignment of individualsinto different treatments is made through a matching mechanism, and the econometrician does not observe all theindividual characteristics used in the matching algorithm. This is the reason why the RDD of PU is fuzzy: basedon the entire distribution of test scores and preferences, the central planner ranks students by their test scores andassigns each one to her preferred school among schools with vacancies.
19n the fuzzy case, U i is sampled iid from a distribution of functions with support in U ∗ . Potentialtreatment functions U i ( x ) are unobserved, but the treatments received are observed and given by D i = K X j =0 U i ( c j ) I { c j ≤ X i < c j +1 } . Using classic definitions of compliance behaviors (Imbens and Rubin (1997)), three types ofcompliance groups are defined in terms of changes in treatment eligibility. “Never-changers” arethose whose treatment received never changes when eligibility changes. The treatment received by“ever-compliers” or “ever-defiers” changes at least once when eligibility changes. Ever-compliersare those whose treatment received changes if and only if it changes to the treatment dose for whichthey become eligible. Ever-defiers change to a treatment dose different from the one for which theybecome eligible. In the case of one cutoff and two treatments, the definition of ever-complier (ever-defier) is equivalent to the classic definition of complier (defier) of Imbens and Lemieux (2008).The three compliance groups are measurable events that partition the population of individualswith G nc denoting never-changers, G ec ever-compliers, and G ed ever-defiers. G nc = n U i ∈ U ∗ : { j : U i ( c j − ) = U i ( c j ) } = ∅ o (30) G ec = n U i ∈ U ∗ : { j : U i ( c j ) = D ( c j ) } ⊇ { j : U i ( c j − ) = U i ( c j ) } 6 = ∅ o (31) G ed = n U i ∈ U ∗ : { { j : U i ( c j ) = D ( c j ) } ∩ { j : U i ( c j − ) = U i ( c j ) } } 6 = ∅ o (32)where ∅ denotes empty set.In the high school assignment case, an example of a never-changer is a student who stronglyprefers the high school with the lowest admission cutoff and attends that high school even if sheis admitted to better schools. An example of an ever-complier is a student who attends the bestschool into which she is admitted, or a student who chooses the best school among the nearbyschools. Suppose a student has rational preferences and is never indifferent. Assume her choice setis equal to those schools with admission cutoffs that are less than or equal to her test score. Then,such a student is never an ever-defier. In other words, as her test score increases, a new school isadded to her choice-set of schools; she either chooses to go to the new school for which she becomeseligible, or she stays at the school which she preferred prior to the increase in her choice-set. Thus,it seems natural to rule out “ever-defiers” in this and other applications.Never-changers do not produce changes in treatments, so there is no identification on them.For ever-compliers, there are multiple possible changes in treatment at a given cutoff, and ever-compliers may differ in terms of the treatments they comply with. For example, the student who iswilling to attend the best school possible complies with all changes in treatment eligibility. On theother hand, the student who is willing to attend the best possible school within a certain distance These definitions allow for non-monotonic treatment schedules; for example, the average class-size varies non-monotonically across cutoffs on enrollment (Angrist and Lavy (1999)). Table B.1 in Section B.5.1 of the supplementalappendix illustrates these definitions of compliance groups using a simple example with 3 treatments and 2 cutoffs.
Assumption 8. (a)
There are no ever-defiers: P [ G ed ] = 0 ; (b) for arbitrary d ∈ D , and ¯ U ∈ U ∗ , E [ Y i ( d ) | X i = x, U i = ¯ U ] and P (cid:2) U i = ¯ U | X i = x (cid:3) are continuous and bounded functions of x ; (c) thereexists a function β ec ( c ) such that E [ Y i ( d ′ ) − Y i ( d ) | X i = c, U i = ¯ U ] = β ec ( c ) for every c = ( c, d, d ′ ) ∈ C and ¯ U ∈ G ec . A fuzzy assignment produces several different treatment changes at each cutoff, even afterruling out ever-defiers. The researcher only observes one aggregate change in Y i at each cutoff,but there are several treatment effects on ever-compliers to be identified at that cutoff. Theorem4 below shows that identification of these effects is not possible without further restricting theclass of functions β ec ( c ). Economic theory or a priori knowledge guides the choice of a functionalform that credibly summarizes the heterogeneity of treatment effects. For example, the principal-agent model of Bajari et al. (2017) yields a functional form to study reimbursement of hospitalsby insurers. The second heterogeneity assumption (Assumption 9) restricts the treatment effectfunction on ever-compliers to a finite-dimensional vector space of functions. Assumption 9.
Let W ( c, d ) = [ W ( c, d ) , . . . , W q ( c, d )] ′ be a vector-valued function W : X × D → R q × known to the researcher and such that (a) E F [ W ( c, d ′ ) − W ( c, d )] is well-defined for thecounterfactual distribution F ; and (b) W j ( c, d ′ ) − W j ( c, d ) , j = 1 , . . . , q , are linearly indepen-dent functions. The treatment effect function β ec ( c ) is assumed to belong to the following class offunctions: H = (cid:26) β : C → R : β ( c, d, d ′ ) = (cid:2) W ( c, d ′ ) − W ( c, d ) (cid:3) ′ θ , for θ ∈ R q (cid:27) . In this case, the ATE on ever-compliers is a linear combination of the true parameter vector θ ec . For a counterfactual distribution F chosen by the researcher, µ ec ( F ) = Z β ( c ; θ ec ) dF ( c ) (33)= Z (cid:2) W ( c, d ′ ) − W ( c, d ) (cid:3) ′ dF ( c ) | {z } ≡ Z ( F ) θ ec (34)= Z ( F ) θ ec . (35)Theorem 4 shows that the observed change in average outcome at a given cutoff is a weightedaverage of treatment effects on ever-compliers who switch from various doses into the dose of eligi-bility at that cutoff. Assumption 9 and variation in cutoff characteristics are sufficient conditions21or identification. Conversely, identification on ever-compliers implies that β ec ( c ) belongs to afinite-dimensional class of functions. Theorem 4.
Under Assumption 8, for j = 1 , . . . , K , B j = K X l =0 ,l = j ω j,l β ec ( c j , d l , d j ) where B j is defined in Equation 6, and ω j,l = lim e ↓ { P [ D i = d l | X i = c j − e ] − P [ D i = d l | X i = c j + e ] } , for l = 0 , , . . . , K, l = j .Moreover, suppose β ec belongs to the class of functions H defined in Assumption 9 with q ≤ K .Define f W j = K X l =0 ,l = j ω j,l [ W ( c j , d j ) − W ( c j , d l )] (36) for the vector-valued function W ( c, d ) of Assumption 9; build a K × q matrix f W by stacking f W j ,and B by stacking B j . If f W ′ f W is invertible, then β ec ( c ) is identified and equal to β ec ( c ) = (cid:2) W ( c, d ′ ) − W ( c, d ) (cid:3) ′ (cid:16) f W ′ f W (cid:17) − f W ′ B . Conversely, suppose β ec belongs to some class of functions e H , and treatment effects on ever-compliers are identified at the p > K cutoff-dose values { ˜ c : ˜ c = ( c j , d l , d j ) with ω j,l > } of everypossible fuzzy assignment generated from the given schedule of cutoffs { c j } Kj =1 . Then, the class offunctions e H is “finite dimensional” in the sense that G = n(cid:16) β (˜ c ) , . . . , β (˜ c p ) (cid:17) : for β ∈ e H o ⊆ R p has dim G ≤ K for every fuzzy assignment { ˜ c j } pj =1 generated from { c j } Kj =1 . Theorem 4 reveals the requirement of stronger functional form assumptions on β ec ( c ) evenfor identification of local effects in the fuzzy case with a finite number of multiple cutoffs. Forexample, identification is not possible when e H is the class of all smooth functions studied in thenon-parametric case of Section 3.2. The result is striking because non-parametric identification oflocal effects is possible both in the sharp case with a finite number of cutoffs and in the fuzzy casewith a single cutoff. It is likely possible to obtain non-parametric identification of β ec ( c ) under alarge variation of cutoff-dose values. The function β ec ( c ) may be approximated by a sequence ofparametric functions from Assumption 9, where q grows to infinity more slowly than K , so to keep dim G ≤ K as K → ∞ . In this paper, the number of cutoffs is kept finite for simplicity, and the22ase with large K is deferred to future work.Theorem 4 also clarifies the interpretation of two-stage least squares (2SLS) estimates in applica-tions of fuzzy RD with multiple cutoffs, a common practice in applied work. The practice consistsof using D ( X i ) as an instrument for D i in the regression of Y i on a constant, D i , and X i . SeeAngrist and Pischke (2008) for a discussion. In the single-cutoff case, both the non-parametric RDestimator and 2SLS applied to a neighborhood of the cutoff are consistent to the average treatmenteffect on compliers (Hahn et al. (2001)). To my knowledge, such an equivalence has never beenstudied in the multiple-cutoff case. Nevertheless, many important applications have multiple-fuzzycutoffs and use 2SLS; for example, Angrist and Lavy (1999), Chen and Van der Klaauw (2008), andHoekstra (2009). The 2SLS estimator is consistent for a data-driven weighted average of treatmenteffects on ever-compliers as long as a sufficiently flexible specification is used; for example, cutofffixed-effects or varying slopes. The economic meaning of the 2SLS estimands depends crucially onthe choice of such a weighting scheme. Unless a parametric functional form is imposed on β ec ( c ), orthere is large variation in cutoff-doses, only a data-driven weighted average of β ec ( c ) is identified.In other words, if β ec ( c ) is non-parametric and there are only a few cutoffs, the researcher does nothave control over the weighting scheme, and 2SLS estimates don’t have a clear interpretation.Theorem 4 leads to a two-step estimation procedure for θ ec and µ ec . The mechanics are similarto the previous sections, so I omit the details from the main text for brevity. In the first step,the researcher estimates the jump discontinuity of the vector [ Y i W ( X i , D i ) ′ ] ′ using LPRs at eachcutoff to obtain [ b B j cf W ′ j ] ′ . In the second step, a regression of b B j on cf W ′ j obtains b θ ec . The ATEestimator is b µ ec = Z ( F ) b θ ec . Estimation precision varies across cutoffs, and the parametric form of β ec allows us to optimally combine different cutoffs to minimize the MSE of b θ ec . The researchercan simply re-weight the second-step regression by the inverse of the MSE matrix of the first-stepestimators. Section B.5.2 in the supplemental appendix delineates the estimation and inferenceprocedures of θ ec and µ ec with practical steps. In this section, Monte Carlo simulations illustrate the finite sample behavior of the ATE estima-tor proposed in Section 3.2. The analysis considers estimation precision and coverage of confidenceintervals for different choices of tuning parameters and a non-linear specification for β . As pre-dicted by Theorem 2, an incorrect choice of the second-step polynomial degree leads to severe biasand extremely poor coverage of confidence intervals. Moreover, first-step bandwidths that implyoverlapping estimation windows produce lower MSE than cases with no overlap, regardless of othertuning parameters.The DGP draws n iid observations of ( X i , ε i ) where X i is uniformly distributed over [0 , ε i is normally distributed with zero mean and unit variance, and these variables are independent ofeach other. There are K cutoffs c j = j/ ( K + 1), j = 1 , . . . , K , on the unit interval [0 , K = ⌊ n . ⌋ , where ⌊ a ⌋ denotes the largest integer smaller than or equal to a .23able 1: Precision of Estimators - Choice of h n K b µ b µ bc e µ e µ bc b µ b µ bc e µ e µ bc b µ b µ bc e µ e µ bc Overlap
No Overlap b µ, b µ bc , e µ, e µ bc ), two choicesof first-step bandwidth (overlap and no overlap), and five sample sizes n and respective numbers of cutoffs K . The second-stepbandwidth is set to h = 3 / ( K + 1), which minimizes MSE of b µ . Refer to Table 2 for different choices of h . The number ofsimulations is 10,000. An individual with forcing variable X i receives a treatment dose equal to D ( X i ) as in Equation 3.The dose increases by one unit at each cutoff, starting at d = 1 and ending at d K = K + 1. Theoutcome variable is Y i = φ ( X i ) D ( X i ) + ε i where φ ( X i ) = 15 X i + 7 . X i − . X i + 2 . β ( c ) = φ ( c )( d ′ − d ), which falls into the binary treatment case (see discussion on page17). Consider a counterfactual policy that uniformly increases treatment doses by one unit. TheATE parameter µ is the integral of φ ( c ) over c ∈ [0 , − h and h , I compare the ATE estimator b µ that uses ρ = ρ = 1, to the bias-corrected ATE estimator b µ bc that uses ρ = ρ = 2. To emphasize the importance of the second step, I also compute a naiveATE estimator that simply averages the first-step estimates. The naive and bias-corrected naiveestimators, respectively e µ and e µ bc , are constructed as b µ and b µ bc except for the tuning parameters inthe second step. Both naive estimators use ρ = 0 and h = ∞ . To examine the effect of overlappingestimation windows in the first step, I compare estimators for two choices of h . The first choice isthe largest possible bandwidth h = 1 / ( K +1), which leads to maximum overlap. The second choiceis the largest possible bandwidth with no overlap, that is, h = 0 . / ( K + 1). Finally, I study theeffects of ten different choices for the second-step bandwidth, h ∈ { / ( K + 1) , . . . , / ( K + 1) } . Allchoices of tuning parameters satisfy the rate conditions of Theorem 2 and produce a convergencerate of root- n for b µ and b µ bc . The Monte Carlo experiment simulates 10,000 draws of an iid samplewith n ∈ { , , , , } and K ∈ { , , , , } , respectively. SectionB.7 in the supplemental appendix repeats the experiment with data-driven bandwidth choices,following the bandwidth rules proposed on page 17 (Section 3.2).The bias and variance of all estimators converge to zero as the sample size increases, regardlessof the choice of h (Table 1). The bias-correction of b µ bc eliminates almost all the bias of b µ , at thecost of a higher variance. The naive estimator e µ oversmooths the second step beyond the conditions24able 2: Precision of Estimators - Choice of h ( n, K ) = (1789 ,
20) ( n, K ) = (10120 , h · ( K + 1) b µ b µ bc b µ b µ bc b µ b µ bc b µ b µ bc b µ b µ bc b µ b µ bc Notes: The table reports simulated bias, variance, and mean squared error (MSE) for two estimators ( b µ, b µ bc ), ten choices ofsecond-step bandwidth ( h ∈ { / ( K + 1) , . . . , / ( K + 1) } ), and the two smallest sample sizes n and respective numbers ofcutoffs K . The first-step bandwidth is set to h = 1 / ( K + 1) (overlap). Naive estimators ( e µ, e µ bc ) are not in this table becausethey are not affected by the choice of h . The number of simulations is 10,000. of Theorem 2. As a result, the bias of e µ is substantially larger than that of b µ . Simply correctingfor bias in the first step does not solve the problem, as the difference in bias between e µ and e µ bc is small. First-step bandwidths that produce overlap (Table 1, rows 1-5) yield approximately thesame bias, but substantially smaller variance, compared to first-step bandwidths that produce nooverlap (Table 1, rows 6-10).Next, I study how the choice of h affects precision of ( b µ, b µ bc ) for a fixed choice of h = 1 / ( K +1)(Table 2). The smallest value for h is 3 / ( K + 1). This defines a second-step estimation windowwith at least three cutoffs to ensure invertibility of matrices in the regressions. The bias of b µ is substantially smaller when h is set to its smallest value. All other measures are practicallyunaffected across different h .The significant bias of the naive ATE estimators e µ and e µ bc decreases the coverage of 95% con-fidence intervals as the sample increases (Table 3). The naive estimators oversmooth in the secondstep, and Theorem 2 implies the bias grows faster than root- n . For each of the four estimators, theconfidence intervals equal the estimator plus or minus 1 .
96 times its standard error. The varianceof estimators are obtained as described in Equation 26. The bias-corrected ATE estimator b µ bc pro-duces confidence intervals with correct coverage for all samples sizes. Although b µ yields intervalswith average length smaller than b µ bc , the bias of b µ leads to a slightly lower coverage. In this section, the methods proposed in this paper are illustrated using the data from PU onhigh school assignments in Romania. Many policy questions demand an ATE of a continuous The data set is available online in the supplemental materials of PU on the website of the
American EconomicReview . n K b µ b µ bc e µ e µ bc b µ b µ bc e µ e µ bc Notes: The table reports simulated percentage of correct coverage and average length of 95% confidence intervals. Confidenceintervals are constructed using four estimators ( b µ, b µ bc , e µ, e µ bc ). They equal an estimator plus or minus its estimated standarddeviation multiplied by 1 .
96. Coverage and average length are computed for five sample sizes n and respective numbers ofcutoffs K . The first-step bandwidth is set to h = 1 / ( K + 1) (overlap), and the second-step bandwidth is set to h = 3 / ( K + 1),which minimizes MSE of b µ . The number of simulations is 10,000. counterfactual distribution of treatments, and this section provides an example of such a policyquestion. The estimators designed for the sharp RDD case are consistent for “Intent-to-Treat”(ITT) average effects when applied to the fuzzy data of PU. In this application, the ITT effectmeasures the impact of being assigned to a better school but not necessarily attending it. Theparametric methods of Section 4 yield noticeable efficiency gains in the estimation of the ATEon ever-compliers. Treatment effects for ever-compliers reveal a heterogeneity pattern unlike theheterogeneity of ITT effects.The administrative data from Romania cover 3 cohorts of 9th grade students for the years2001, 2002, and 2003, with a total of 334,137 observations. The essential elements of the highschool assignment in Romania are described below. The assignment to high school is nationallycentralized by the Ministry of Education. At the end of grade 8, students submit a transition scoreand a complete ranking of preferences for high schools. The transition score is an average of thestudent’s performance on a national exam taken in grade 8 and the student’s grade point averageduring grades 5-8. The Ministry of Education ranks students by their transition score and no othercriteria. The mechanism assigns the student ranked first to her most preferred school, the studentranked second to her most preferred school, etc. Students cannot decline their assignment, andthey have incentives to truthfully reveal their preference rankings.The observed variables are the town and year of student i , the transition score X i , the schoolthe student is assigned to, and the student’s score on the “baccalaureate exam.” This is an examtaken at the end of high school, and the grade on the exam is the outcome variable Y i . The qualityof school j (treatment dose d j ) is measured by the average transition score of the students attendingthat school j . The cutoff c j for admission into a school j is equal to the minimum transition scoreamong the students that are assigned to that school j . The student’s preferences in high schoolsare not observed in the data, which makes the RDD fuzzy. For example, a student may have a scoregreater than the cutoff for the best school in her town, but still be assigned to a different schoolbecause of her personal preferences. For a transition score X i , the treatment dose of eligibility26 ( X i ) is equal to the largest d among those schools with admission cutoff c less than X i . Thetreatment dose received D i coincides with the treatment dose of eligibility D ( X i ) for 40% of thestudents in the sample. Thus, the assignment is fuzzy, and causal inference beyond ITT effectsrequires the methods of Section 4. Following PU, I drop observations with missing values for Y i .I also drop cutoffs without enough observations around them to carry out the matrix inversionsof the local polynomial regressions. The dropping of cutoffs leaves the empirical distribution ofoutcomes, forcing variable, cutoffs, and treatment doses practically unchanged. The estimationsample has 588 cutoffs with a total of 179,995 individuals from 769 schools in 121 towns and 3years. The variation of cutoff and dose values is displayed in Figure 2.Figure 2: Variation in Cutoff and Dose Values(a) d - do s e s (b) u - do s e c hange s Notes: Scatter plot with cutoff values on the x-axis and dose values on the y-axis for the K = 588 cutoffs in theRomanian data. Panel (a) shows both doses before and after the cutoff, that is, d j − in black and d j in gray. Panel(b) displays dose-change values on the y-axis, that is, u j = d j − d j − . The peer-quality of school j (treatment dose d j ) is measured by the average transition score of the students attending that school j . The cutoff c j for admissioninto a school j is equal to the minimum transition score among the students assigned to school j . Non-parametric identification of β ( c ) is limited to the set C , which is the convex-hull of C ∞ . Theset C ∞ is not entirely observed, and the researcher relies on the observation of C K (Figure 2(a)).For the sake of simplicity, I restrict β to be a function of dose changes ( u = d ′ − d ) instead of dosesbefore and after ( d and d ′ ). The restriction greatly simplifies the visualization and estimation of β ( c, d, d ′ ), because it implies that β ( c, d, d ′ ) = φ ( c )( d ′ − d ), where φ is a continuously differentiablefunction. Figure 2(b) illustrates the variation of cutoff and dose-change values and defines the limitson identification of policy counterfactuals. For example, it is not possible to identify the effectsof randomly assigning students with grades between 8 and 9 to a change in treatment dose of 2.The support of such counterfactual distribution falls outside the observed variation of cutoff anddose-change values. On the other hand, it is possible to identify the ATE of randomly assigningstudents with grades between 6.5 and 8.5 to dose increases between 0 and 1.5.27he following policy question illustrates the ATE estimator proposed in this paper. Supposea new charter school is constructed in one of the towns in Romania. The new charter school hasmore autonomy and better management than traditional public schools, and admitted studentsexperience an increase in school quality as if they were admitted to a school with better peers.More specifically, the policy counterfactual is to give a 0.5 increase in peer-quality to a uniformdistribution of scores between 6.5 and 8.5. The ATE parameter is defined as µ = 12 Z . . φ ( c ) dc. (37)I follow the estimation procedure suggested in Section 3.2 and take into account the restriction β ( c, d, d ′ ) = φ ( c )( d ′ − d ). As in the binary treatment case, the restriction lowers the polynomialdegree requirement on the second-step estimation to ρ = 1. See Figure 1 and the discussionthat follows it. The grid for h has 32 equally-spaced points between 0.1 and 3.6, respectively, thesmallest bandwidth for which the estimator is computable, and the maximum distance between twodifferent cutoffs. The MSE-optimal bandwidth choice is h ∗ = 1 . . . . b φ ( c ), that is, the effect of a 0 . c . The graph reveals heterogeneous marginal effects of ability on returns to school quality.Heterogeneity of treatment effects is a priori unknown, and the ATE estimator proposed in thispaper is consistent for µ regardless of the shape of φ ( c ). This highlights the empirical relevanceof Theorem 2 and the importance of the second-step estimation. In other words, the commonstrategy of normalizing all cutoffs to zero and estimating one discontinuity using the pooled datais not consistent for µ when φ ( c ) has such heterogeneity.28igure 3: Treatment Effect Function(a) . ph i ( c ) (b) . ph i ( c ) Notes: Estimated average treatment effect function for a 0 . c . The figure plots b β ( c, d, d + 0 .
5) = 0 . b φ ( c ) for c ∈ [6 . , . φ function is estimated non-parametrically withbias correction following Section 3.2 (sharp case). Panel (b) displays the effect on ever-compliers of the same uniformchange in treatment dose. The φ function is estimated parametrically with bias correction following the iteratedprocedure of Section B.5.2 in the supplemental appendix (fuzzy case). Estimation of treatment effects on ever-compliers requires a parametric functional form on β ec (Theorem 4). I assume β ec ( c, d, d ′ ) = θ ( d ′ − d )+ θ c ( d ′ − d )+ θ c ( d ′ − d )+ θ c ( d ′ − d ) and carry outthe iterative estimation procedure described in the supplemental appendix’s Section B.5.2. Thealgorithm achieves convergence of θ s within 30 iterations. The iterated bias-corrected ATE onever-compliers equals 0 .
107 with standard error of 0 . .
5. Compared to ITT effects in Figure 3(a), the return of better schooling onever-compliers is also positive, but much less heterogeneous across ability levels.
Difficulty in gathering experimental data in many fields within the social sciences makes quasi-experimental techniques such as RDD extremely important to evaluate policies and social programs.RDD has been used in a wide range of applications in economics since the late 1990s. More recently,there has been an increasing number of applications with one forcing variable and multiple cutoffs,assigning individuals to heterogeneous treatments. The demand for multi-cutoff RDD methods isconstantly growing, as richer data sets become ever more available.This paper states conditions under which multiple RDD effects are combined to infer ATE overthe entire range of cutoff values. The proposed estimator is consistent and asymptotically normalfor ATEs over the entire support of variation in cutoffs and treatment doses. Asymptotic resultsare derived under a large number of observations and cutoffs in the sharp case of non-parametrictreatment effect functions. Sufficient conditions on the rate of growth of the number of cutoffs,29elative to the number of observations, are given. These rate conditions determine the feasiblechoice set of tuning parameters. This paper also shows that non-parametric identification in fuzzyRDD with multiple cutoffs is impossible unless the treatment effect function is finite-dimensional, orthere is large variation of cutoff-dose values. A parametric specification provides an MSE-optimalATE estimator for the fuzzy case that is consistent and asymptotically normal.The relevance of the ATE estimators proposed in this paper is illustrated with the data ofPop-Eleches and Urquiola (2013) on high school assignment in Romania. Of interest is the effect ofhigh school quality on academic performance of students. I find strong evidence of non-linearitiesin the returns to better schooling, as a function of students’ ability level. Monte Carlo simulationsdemonstrate that such non-linearities severely bias a naive average of local effects that does notuse the correction weighting scheme proposed in this paper. Applying the fuzzy RDD methods tothe Romanian data reveals causal effects on ever-compliers that are smaller and less heterogeneousthan ITT effects.The proposed estimator converges at the minimax optimal rate of root- n , as long as first-stepbandwidths converge to zero at 1 /K rate. It would be interesting to learn about efficiency propertiesof the ATE estimator. Theoretical tools commonly employed to derive efficiency lower-bounds maynot be immediately applicable to the setting of this paper. These tools are designed for regularestimators, and for data drawn from a population where the parameter of interest is identified.In contrast, the fixed-cutoff RDD design relies on an “identification at infinity argument”, andI wonder about the sufficient conditions that would obtain regularity of the ATE estimator. Apossibility for future work is to the generalize the uniform convergence tools from this paper toarrive at such conditions. I am indebted to Han Hong, Caroline Hoxby, and Guido Imbens for invaluable advice. The paperalso benefited from feedback received from seminar participants at Stanford, Boston University,Cambridge, Iowa, Notre Dame, CORE-UcLouvain, UCSD, UCDavis, FGV-EESP, FGV-EPGE,Insper, PUC-Rio, Toulouse, UIUC, and at various conferences. I thank Tim Bresnahan, ArunChandrasekhar, Michael Dinerstein, Ivan Fernandez-Val, Ivan Korolev, Michael Leung, Huiyu Li,Jessie Li, Petra Moser, Stephen Terry, Xiaowei Yu, and anonymous referees for suggestions andcomments. I gratefully acknowledge the financial support received from the B.F. Haley and E.S.Shaw Fellowship at SIEPR-Stanford, CORE-UcLouvain, ISLA-Notre Dame, and while visiting theKenneth C. Griffin Department of Economics at the University of Chicago.
References
Agarwal, Sumit, Souphala Chomsisengphet, Neale Mahoney, and Johannes Stroebel (2017) “DoBanks Pass Through Credit Expansions to Consumers Who Want to Borrow?”
Quarterly Journal f Economics , Vol. 133, No. 1, pp. 129–190.Andrews, D.W.K. (1994) “Empirical Process Methods in Econometrics,” in Engle, R.F. and D.L.McFadden eds. Handbook of Econometrics , Vol. 4: North Holland, Chap. 37, pp. 2247–2294.Angrist, J.D. (2004) “Treatment Effect Heterogeneity in Theory and Practice,”
Economic Journal ,Vol. 114, pp. C52–C83.Angrist, J.D. and V. Lavy (1999) “Using Maimonides’ Rule to Estimate the Effect of Class Size onScholastic Achievement,”
Quarterly Journal of Economics , Vol. 114, No. 2, pp. 533–575.Angrist, J.D. and Miikka Rokkanen (2015) “Wanna Get Away? Regression Discontinuity Esti-mation of Exam School Effects Away from the Cutoff,”
Journal of the American StatisticalAssociation , Vol. 110, No. 512, pp. 1331–1344.Angrist, Joshua D and J¨orn-Steffen Pischke (2008)
Mostly Harmless Econometrics: An Empiricist’sCompanion : Princeton University Press.Bajari, Patrick, Han Hong, Minjung Park, and Robert Town (2017) “Estimating Price Sensitivityof Economic Agents Using Discontinuity in Nonlinear Contracts,”
Quantitative Economics , Vol.8, No. 2, pp. 397–433.Bertanha, Marinho and Guido Imbens (2019) “External Validity in Fuzzy Regression DiscontinuityDesigns,”
Journal of Business and Economic Statistics, forthcoming .Bertanha, Marinho and Marcelo J Moreira (2019) “Impossible Inference in Econometrics: Theoryand Applications,”
Journal of Econometrics, forthcoming .Black, Dan A, Jose Galdo, and Jeffrey A Smith (2007) “Evaluating the Worker Profiling andReemployment Services System Using a Regression Discontinuity Approach,”
American Eco-nomic Review , Vol. 97, No. 2, pp. 104–107.Black, S.E. (1999) “Do Better Schools Matter? Parental Valuation of Elementary Education,”
Quarterly Journal of Economics , Vol. 114, No. 2, pp. 577–599.Calonico, Sebastian, Matias D Cattaneo, and Max H Farrell (2018) “Optimal Bandwidth Choicefor Robust Bias Corrected Inference in Regression Discontinuity Designs,” arXiv preprintarXiv:1809.00236 .Calonico, Sebastian, Matias D Cattaneo, and Rocio Titiunik (2014) “Robust Nonparametric Con-fidence Intervals for Regression-discontinuity Designs,”
Econometrica , Vol. 82, No. 6, pp. 2295–2326.Cattaneo, Matias D, Rocio Titiunik, Gonzalo Vazquez-Bare, and Luke Keele (2016) “InterpretingRegression Discontinuity Designs with Multiple Cutoffs,”
Journal of Politics , Vol. 78, No. 4, pp.1229–1248.Chen, Susan and Wilbert Van der Klaauw (2008) “The Work Disincentive Effects of the DisabilityInsurance Program in the 1990s,”
Journal of Econometrics , Vol. 142, No. 2, pp. 757–784.Cheng, Ming-Yen, Jianqing Fan, and James S Marron (1997) “On Automatic Boundary Correc-tions,”
Annals of Statistics , Vol. 25, No. 4, pp. 1691–1708.De Giorgi, Giacomo, Andres Drenik, and Enrique Seira (2017) “Sequential Banking: Direct andExternality Effects on Delinquency,”
CEPR Discussion Paper No. DP12280 .De La Mata, Dolores (2012) “The Effect of Medicaid Eligibility on Coverage, Utilization, andChildren’s Health,”
Health Economics , Vol. 21, No. 9, pp. 1061–1079.Dobkin, Carlos and Fernando Ferreira (2010) “Do School Entry Laws Affect Educational Attain-ment and Labor Market Outcomes?”
Economics of Education Review , Vol. 29, No. 1, pp. 40–54.Dong, Yingying (2018a) “Alternative Assumptions to Identify LATE in Fuzzy Regression Discon-tinuity Designs,”
Oxford Bulletin of Economics and Statistics , Vol. 80, No. 5, pp. 1020 – 1027.312018b) “Jump or Kink? Regression Probability Jump and Kink Design for TreatmentEffect Evaluation,”
Working Paper, University of California, Irvine .Dong, Yingying and Arthur Lewbel (2015) “Identifying the Effect of Changing the Policy Thresholdin Regression Discontinuity Models,”
Review of Economics and Statistics , Vol. 97, No. 5, pp.1081–1092.Duflo, E., P. Dupas, and M. Kremer (2011) “Peer Effects, Teacher Incentives, and the Impact ofTracking: Evidence from a Randomized Evaluation in Kenya,”
American Economic Review , Vol.101, No. 5, pp. 1739–1774.Egger, Peter and Marko Koethenbuerger (2010) “Government Spending and Legislative Organi-zation: Quasi-experimental Evidence from Germany,”
American Economic Journal: AppliedEconomics , Vol. 2, No. 4, pp. 200–212.Fan, J. and I. Gijbels (1996)
Local Polynomial Modelling and Its Applications , Chapman &Hall/CRC Monographs on Statistics & Applied Probability: Taylor & Francis.Frandsen, R., M. Fr¨olich, and B. Melly (2012) “Quantile Treatment Effects in the RegressionDiscontinuity Design,”
Journal of Econometrics , Vol. 168, No. 2, pp. 382–395.Garibaldi, P., F. Giavazzi, A. Ichino, and E. Rettore (2012) “College Cost and Time to Obtaina Degree: Evidence from Tuition Discontinuities,”
Review of Economics and Statistics , Vol. 94,No. 3, pp. 699–711.Hahn, J., P. Todd, and W. Van der Klaauw (2001) “Identification and Estimation of TreatmentEffects with a Regression-discontinuity Design,”
Econometrica , Vol. 69, No. 1, pp. 201–209.Hastings, Justine S, Christopher A Neilson, and Seth D Zimmerman (2013) “Are Some DegreesWorth More Than Others? Evidence From College Admission Cutoffs In Chile,”
NBER WorkingPaper 19241 .Hoekstra, Mark (2009) “The Effect of Attending the Flagship State University on Earnings: aDiscontinuity-based Approach,”
Review of Economics and Statistics , Vol. 91, No. 4, pp. 717–724.Hoxby, C.M. (2000) “The Effects of Class Size on Student Achievement: New Evidence fromPopulation Variation,”
Quarterly Journal of Economics , Vol. 115, No. 4, pp. 1239–1285.Imbens, Guido and Karthik Kalyanaraman (2012) “Optimal Bandwidth Choice For The RegressionDiscontinuity Estimator,”
Review of Economic Studies , Vol. 79, No. 3, pp. 933–959.Imbens, Guido W and Thomas Lemieux (2008) “Regression Discontinuity Designs: a Guide toPractice,”
Journal of Econometrics , Vol. 142, No. 2, pp. 615–635.Imbens, Guido W and Donald B Rubin (1997) “Estimating Outcome Distributions for Compliersin Instrumental Variables Models,”
Review of Economic Studies , Vol. 64, No. 4, pp. 555–574.Van der Klaauw, Wilbert (2002) “Estimating the Effect of Financial Aid Offers on College Enroll-ment: A Regression-discontinuity Approach,”
International Economic Review , Vol. 43, No. 4,pp. 1249–1287.Lazear, E. (2001) “Educational Production,”
Quarterly Journal of Economics , Vol. 116, No. 3, pp.777–803.Lipman, Yaron, Daniel Cohen-Or, and David Levin (2006) “Error Bounds and Optimal Neighbor-hoods for MLS Approximation,” in Polthier, Konrad and Alla Sheffer eds.
Eurographics Sympo-sium on Geometry Processing .McCrary, J. (2008) “Manipulation of the Running Variable in the Regression Discontinuity Design:a Density Test,”
Journal of Econometrics , Vol. 142, No. 2, pp. 698–714.McCrary, Justin and Heather Royer (2011) “The Effect of Female Education on Fertility and InfantHealth: Evidence from School Entry Policies Using Exact Date of Birth,”
American Economic eview , Vol. 101, No. 1, pp. 158–195.Newey, Whitney K (1994) “Kernel Estimation of Partial Means and a General Variance Estimator,” Econometric Theory , Vol. 10, No. 02, pp. 1–21.Pollard, D. (1984)
Convergence of Stochastic Processes : Springer.Pop-Eleches, C. and M. Urquiola (2013) “Going to a Better School: Effects and Behavioral Re-sponses,”
American Economic Review , Vol. 103, No. 4, pp. 1289–1324.Porter, J. (2003) “Estimation in the Regression Discontinuity Model,”
Unpublished Manuscript,University of Wisconsin, Madison .Rokkanen, Miikka (2015) “Exam Schools, Ability, and the Effects of Affirmative Action: Latent Fac-tor Extrapolation in the Regression Discontinuity Design,”
Unpublished Manuscript, ColumbiaUniversity .Rothe, Christoph (2012) “Partial Distributional Policy Effects,”
Econometrica , Vol. 80, No. 5, pp.2269–2301.Sun, Yixiao (2005) “Adaptive Estimation of the Regression Discontinuity Model,”
Working PaperAvailable at SSRN: 739151 .Tsybakov, Alexandre B (2009)
Introduction to Nonparametric Estimation : Translated from Frenchby Vladimir Zaiats. Springer Series in Statistics, New York.Van Der Vaart, Aad W and Jon A Wellner (1996)
Weak Convergence and Empirical Processes :Springer.
A Appendix
Throughout the appendices, M is used as a generic finite and positive constant in the proofs.For a p × q matrix A, the norm of A is induced by the Euclidean norm k · k , i.e. k A k =max x ∈ R q ,x =0 k Ax k / k x k . The determinant of matrix A is denoted det( A ). References to the supple-mental appendix include B in the numbering; for example, Lemma B.1, or Table B.2. A.1 Proof of Theorem 1
Lemma B.1 derives asymptotic normality of the bias-corrected jump-discontinuity estimator at onecutoff based on local polynomial regressions of a vector Y i on a scalar forcing variable X i . The proofof Theorem 1 is a straightforward generalization of Lemma B.1 in the particular case of a scalar Y i . As the sample size increases and the number of cutoff remains fixed, the jump-discontinuityestimators are independent across cutoffs. First apply Lemma B.1 to each cutoff individually, andthen aggregate over cutoffs. (cid:3) A.2 Proof of Lemma 2
Define C = [ X , X ] × [ D , D ] × [ D , D ]. Consider the partition of C made of the set of non-intersectingcubicles T n = { C , . . . , C M } with M = n , n = { , , . . . } . Each C j is a half-open cubicle ofthe form [ x l − , x l ) × [ y m − , y m ) × [ z o − , z o ) with sides of lengths equal to ( X − X ) /n , ( D − D ) /n ,and ( D − D ) /n . Define the sub-collection U n = { C ∈ T n : C ⊂ C} = { A , . . . , A Q } . Since C ∞ is dense in C , for every A j ∈ U n , find a point c j ∈ C ∞ ∩ A j for which β ( c j ) is known. Thesum µ n = P Qj =1 ω ( c j ) β ( c j ) (cid:0) X − X (cid:1) (cid:0)
D − D (cid:1) /n converges to µ c as n → ∞ because ω ( c ) β ( c ) isRiemann integrable on C . 33 A.3 Proof of Theorem 2
The proof combines arguments from the proof of Lemma B.1 with lemmas on the uniform conver-gence of empirical processes from Sections B.2 and B.3. Define µ ∗ , e µ , and µ n as follows: µ ∗ = K X j =1 ∆ j n e ′ E [ G j + n ] 1 nh j n X i =1 k (cid:18) X i − c j h j (cid:19) v j + i Y i e H ji − e ′ E [ G j − n ] 1 nh j n X i =1 k (cid:18) X i − c j h j (cid:19) v j − i Y i e H ji o (A.1)= K X j =1 ∆ j nh j n X i =1 k (cid:18) X i − c j h j (cid:19) Y i e ′ (cid:16) v j + i E [ G j + n ] − v j − i E [ G j − n ] (cid:17) e H ji (A.2) e µ = K X j =1 ∆ j ( e ′ G j + n E " nh j n X i =1 k (cid:18) X i − c j h j (cid:19) v j + i Y j + i e H ji − e ′ G j − n E " nh j n X i =1 k (cid:18) X i − c j h j (cid:19) v j − i Y j − i e H ji µ n = K X j =1 ∆ j B j . (A.3)Write b µ − B c n − B c n − µ ( V cn ) / = µ ∗ − E [ µ ∗ |X n ]( V cn ) / (A.4)+ e µ − B c n ( V cn ) / (A.5)+ µ n − B c n − µ ( V cn ) / (A.6)+ b µ − E [ b µ |X n ] − ( µ ∗ − E [ µ ∗ |X n ])( V cn ) / (A.7)+ E [ b µ − µ n |X n ] − e µ ( V cn ) / . (A.8)The proof in this appendix applies a central limit theorem (CLT) to show that Part (A.4)converges in distribution to a standard normal; it demonstrates that B n approximates the first-stepbias, that is, that part (A.5) converges in probability to zero; and it shows that B n approximatesthe second-step bias (integration error), that is, that part (A.6) converges to zero. Lemma B.7shows that parts (A.7) and (A.8) converge in probability to zero.34 art (A.4)First, find the rate that ( V cn ) − / grows. Define φ n and rewrite V cn as follows: φ n ( X i ) = K X j =1 ∆ j nh j k (cid:18) X i − c j h j (cid:19) e ′ (cid:16) v j + i E [ G j + n ] − v j − i E [ G j − n ] (cid:17) e H ji (A.9) V cn = n X i =1 E (cid:2) ε i φ n ( X i ) (cid:3) . (A.10)Choose alternative bandwidths h ∗ j , j = 1 , . . . , K , such that (i) there exists δ > n ) such that δ < h ∗ j /h j ≤ ∀ j ; and (ii) [ c j − h ∗ j , c j + h ∗ j ] ∩ [ c j ′ − h ∗ j ′ , c j ′ + h ∗ j ′ ] = ∅ for any j = j ′ . V cn = n E (cid:2) ζ ( X i ) φ n ( X i ) (cid:3) ≥ n K X j =1 Z c j + h ∗ j c j − h ∗ j ζ ( x ) φ n ( x ) f ( x ) dx (A.11)= n K X j =1 Z c j + h ∗ j c j − h ∗ j ζ ( x ) ∆ j nh j k (cid:18) x − c j h j (cid:19) e ′ (cid:0) I { x ≥ } E [ G j + n ] − I { x < } E [ G j − n ] (cid:1) H (cid:18) x − c j h j (cid:19) ! f ( x ) dx (A.12)= 1 Kn K K X j =1 K ∆ j h j Z h ∗ j /h j − h ∗ j /h j ζ ( c j + uh j ) k ( u ) e ′ (cid:0) I { u ≥ } E [ G j + n ] − I { u < } E [ G j − n ] (cid:1) H ( u ) ! f ( c j + uh j ) du ≥ MKnh (A.13)where the first inequality follows from the integrand being positive and ∪ Kj =1 [ c j − h ∗ j , c j + h ∗ j ] ⊆∪ Kj =1 [ c j − h j , c j + h j ]; the third equality uses a change of variables u = ( x − c j ) /h j ; and thelast inequality follows because (a) h j ≤ h ; (b) K ∆ j is bounded away from zero uniformly over j (Lemma B.9); and (c) each integral is bounded away from zero over j because the integrationlimits, ζ ( c j + uh j ), E [ G j ± n ], and f ( c j + uh j ) are uniformly close to quantities that are positivedefinite uniformly over j (see Lemma B.6 and recall that f and ζ are bounded away from zerobecause of Assumptions 4 and 7). The inequality in (A.13) implies that ( V cn ) − = O ( Knh ) where Knh → ∞ .Second, write part (A.4) as a weighted sum across i : µ ∗ = n X i =1 Y i K X j =1 ∆ j nh j k (cid:18) X i − c j h j (cid:19) e ′ (cid:16) v j + i E [ G j + n ] − v j − i E [ G j − n ] (cid:17) e H ji | {z } ≡ φ n ( X i ) = n X i =1 Y i φ n ( X i ) (A.14)35o that µ ∗ − E [ µ ∗ |X n ]( V cn ) / = P ni =1 ( Y i − E [ Y i | X i ]) φ n ( X i )( V cn ) / = P ni =1 ε i φ n ( X i )( V cn ) / . (A.15)Equation A.15 is a sum of iid random variables with zero mean, where V cn is the variance of thenumerator. The Lindeberg condition is verified next. Take an arbitrary δ > n X i =1 E h ( V cn ) − ε i φ n ( X i ) I n(cid:12)(cid:12)(cid:12) ( V cn ) − / ε i φ n ( X i ) (cid:12)(cid:12)(cid:12) > δ oi (A.16) ≤ n X i =1 E h M Knh φ n ( X i ) I n M ′ (cid:0) Knh (cid:1) / | φ n ( X i ) | > δ oi (A.17) ≤ n X i =1 E h M Knh ( Knh ) − I n M ′ (cid:0) Knh (cid:1) / ( Knh ) − > δ oi (A.18) ≤ ( Kh ) − I n M ′ ( Knh ) − / > δ o = o (1) (A.19)where the first inequality relies on the fact that ε i is a.s. bounded (Assumption 7), and that( V cn ) − = O (cid:0) Knh (cid:1) (Equation A.13). The second inequality uses that φ n ( x ) = O ( Knh ) − uniformly over x . In fact, φ n ( x ) is a sum of K components of which at most two are non-zero,∆ j = O (cid:0) K − (cid:1) uniformly over j (Lemma B.9), k ( · ) is bounded (Assumption 3), E [ G j ± n ] is uniformlyclose to G j ± whose norm is bounded away from zero (Lemma B.6). The last inequality relies on therate condition h /h = O (1), and that the indicator becomes zero for large n . The Lindeberg-FellerCLT says that Equation A.15, and thus part (A.4), converges in distribution to a standard normal. Part (A.5)First consider E (cid:20) h j k (cid:18) X i − c j h j (cid:19) v j + i e H ji E h Y j + i | X i i(cid:21) (A.20)= E " h j k (cid:18) X i − c j h j (cid:19) v j + i e H ji ∇ ( ρ +1) R ( c j , d j )( ρ + 1)! (cid:18) X i − c j h j (cid:19) ρ +1 h ρ +11 j (A.21)+ E " h j k (cid:18) X i − c j h j (cid:19) v j + i e H ji ∇ ( ρ +2) R ( c ∗ j , d j )( ρ + 2)! (cid:18) X i − c j h j (cid:19) ρ +2 h ρ +21 j (A.22)= h ρ +11 j ∇ ( ρ +1) R ( c j , d j )( ρ + 1)! f ( c j ) γ ∗ + O (cid:16) h ρ +21 (cid:17) (A.23)where E h Y j + i | X i i is the difference between E [ Y i | X i ] and its ρ -th order Taylor expansion around X i = c j (see Equations B.37 and B.38). The expectations in Equations A.21 and A.22, withoutthe h ρ +11 j and h ρ +21 j terms, are bounded over j because the kernel, derivatives, and polynomialsare bounded functions of u = ( x − c j ) h − j (Assumptions 3 and 7). The remainder term O (cid:16) h ρ +21 (cid:17) is uniform over j . 36ext, e µ − B n ( V cn ) / =( V cn ) − / K X j =1 ∆ j e ′ G j + n E " nh j n X i =1 k (cid:18) X i − c j h j (cid:19) v j + i E h Y j + i | X i i e H ji − ( V cn ) − / B +1 n (A.24) − ( V cn ) − / K X j =1 ∆ j e ′ G j − n E " nh j n X i =1 k (cid:18) X i − c j h j (cid:19) v j − i E h Y j − i | X i i e H ji + ( V cn ) − / B − n (A.25)where B n = B +1 n − B − n (A.26) B +1 n =(( ρ + 1)!) − K X j =1 h ρ +11 j ∆ j f ( c j ) ∇ ρ +1 x R ( c j , d j ) e ′ G j + n γ ∗ (A.27) B − n =(( ρ + 1)!) − K X j =1 h ρ +11 j ∆ j f ( c j ) ∇ ρ +1 x R ( c j , d j − ) e ′ G j − n γ ∗ . (A.28)Consider part (A.24). Part (A.25) follows a symmetric argument.(A.24) =( V cn ) − / K X j =1 ∆ j e ′ G j + n E " nh j n X i =1 k (cid:18) X i − c j h j (cid:19) v j + i E h Y j + i | X i i e H ji − ( V cn ) − / B +1 n (A.29)=( V cn ) − / K X j =1 ∆ j e ′ G j + n h ρ +11 j ∇ ( ρ +1) R ( c j , d j )( ρ + 1)! f ( c j ) γ ∗ − B +1 n (A.30)+ ( V cn ) − / K X j =1 ∆ j e ′ G j + n O (cid:16) h ρ +21 (cid:17) (A.31)=0 (A.32)+ O (cid:16)(cid:0) Knh (cid:1) / (cid:17) KO ( K − ) O P (1) O (cid:16) h ρ +21 (cid:17) = o P (1) (A.33)where the second equality uses the expansion in Equation A.23. The third equality uses thedefinition of B +1 n , that ∆ j = O ( K − ) uniformly over j , that G j + n = O P (1). These terms are o P (1)because of the rate condition (cid:0) Knh (cid:1) / h ρ +11 = O (1). Part (A.6) µ n − B n − µ ( V cn ) / = O (cid:16)(cid:0) Knh (cid:1) / (cid:17) K X j =1 ∆ j B j − B n − Z C ω ( c ) β ( c ) d ( c ) (A.34)= O (cid:16)(cid:0) Knh (cid:1) / (cid:17) O (cid:16) h ρ +22 (cid:17) = O (1) O ( h ) = o (1) (A.35)37here the first equality uses the rate on ( V cn ) − / (Equation A.13). The second equality appliesLemma B.9 and relies on Assumption 6 (asymptotic behavior of { c j } j ) and Assumption 7 (smooth-ness of β ( c )). The third equality uses the rate condition (cid:0) Knh (cid:1) / h ρ +12 = O (1). Lemma B.9 alsoshows that B n = O (cid:16) h ρ +12 (cid:17) , which yields ( V cn ) − / B n = O (cid:16)(cid:0) Knh (cid:1) / h ρ +12 (cid:17) = O (1).Lemma B.7 shows that parts (A.7) and (A.8) converge in probability to zero, which concludesthe proof. (cid:3) A.4 Proof of Theorem 3
Part (27)First, consider the ideal setting where estimators µ ∗ are functions of data observed from { Y i ( d ) } d ∈D and X i . For a choice of loss function L ( µ, µ ′ ), the minimax risk of estimating theparameter µ c ( P ) is defined as inf µ ∗ sup P ∈P E P [ L ( µ ∗ , µ c ( P ))]. Here, the 0-1 loss function is used,that is, L n ( µ, µ ′ ) = I { n r | µ − µ ′ | > ǫ } , for a positive rate r and ǫ . In this case, E P [ L n ( µ ∗ , µ c ( P ))] = P P [ n r | µ ∗ − µ c ( P ) | > ǫ ]. The minimax risk is the supremum probability over P of an estimatorbeing farther than ǫn − r from the truth minimized over all possible estimators µ ∗ . The rate r is anupper bound on the rate of convergence if for small ǫ > L ∈ (0 , µ ∗ sup P ∈P P P [ n r | µ ∗ − µ c ( P ) | > ǫ ] ≥ L for large n . The rate r is the minimax optimalrate if it is an upper bound and achievable; that is, if there exists an estimator b µ that convergesat rate r uniformly. The estimator b µ converges at rate r uniformly if, for any small δ >
0, thereexists large ǫ ∈ (0 , ∞ ) such that sup P ∈P P P [ n r | b µ − µ c ( P ) | > ǫ ] < δ for large n . See discussion inChapter 2 of Tsybakov (2009).One common approach to compute lower bounds for the minimax risk is to use Le Cam’smethod. For ǫ >
0, choose two models
P, Q ∈ P such that | µ c ( P ) − µ c ( Q ) | > ǫn − r . Le Cam’smethod leads to the following inequality: inf µ ∗ sup P ∈P P P [ n r | µ ∗ − µ c ( P ) | > ǫ/ ≥ e − nKL ( P,Q ) / KL ( P, Q ) is the Kullback-Leibler divergence between P and Q . See Equations (2.7), (2.9),and Theorem 2.2(iii) of Tsybakov (2009). This inequality is used to prove part (27) with r = 1 / ω c ( c ). The researcher must be choose a counter-factual density such that its marginal densities R ω c ( c, d, d ′ ) d ( d ′ ) and R ω c ( c, d, d ′ ) d ( d ) are differentfunctions; otherwise, µ c = 0. Construct an infinitely differentiable bounded function g ( c, d ) ≥ R [ g ( c, d ′ ) − g ( c, d )] ω ( c ) d c = 1.Construct two models P, Q ∈ P as follows. Let ε i ∼ N (0 ,
1) and X i ∼ U [0 ,
1] iid and inde-pendent of each other. Pick ξ > √ πǫ >
0. For model P , define Y i ( d ) = Φ (cid:0) ξn − / g ( X i , d ) + ε i (cid:1) ,where Φ is the standard normal cdf. For model Q , define Y i ( d ) = Φ ( ε i ). The expectation of Y i ( d )conditional on X i = c , that is, R ( c, d ), is an infinitely differentiable function. The variables havebounded support, and models P and Q satisfy all the conditions to be in P . Under model P , β ( c ; P ) = E P [ Y i ( d ′ ) − Y i ( d ) | X i = c ]= E P [Φ (cid:0) ξn − / g ( X i , d ′ ) + ε i (cid:1) − Φ (cid:0) ξn − / g ( X i , d ) + ε i (cid:1) | X i = c ]= E P [ φ ( ε ∗ i ) ξn − / ( g ( c, d ′ ) − g ( c, d )) | X i = c ]= E P [ φ ( ε ∗ i )] ξn − / ( g ( c, d ′ ) − g ( c, d ))where φ is the standard normal pdf, and ε ∗ i is in between ε i + ξn − / g ( c, d ′ ) and ε i + ξn − / g ( c, d ).As n grows large, E P [ φ ( ε ∗ i )] = E P [ φ ( ε i )] + o (1) = √ π + o (1) where the o (1) term is uniform over( c, d, d ′ ). Then, µ c ( P ) = √ π ξn − / R ( g ( c, d ′ ) − g ( c, d )) ω ( c ) d c + o (cid:0) n − / (cid:1) = √ π ξn − / + o (cid:0) n − / (cid:1) . Under model Q , β ( c ; Q ) = 0. Therefore, 38 c ( P ) − µ c ( Q ) = √ π ξn − / + o (cid:0) n − / (cid:1) > ǫn − / for large n , because √ π ξ > ǫ .Next, we use the following inequality to show that r = 1 / µ ∗ sup P ∈P P P (cid:2) n / | µ ∗ − µ c ( P ) | > ǫ/ (cid:3) ≥ e − nKL ( P,Q ) / d ∗ be such that g ( c, d ∗ ) > c . For simple models like P and Q , any function ofthe variables { Y i ( d ) } d ∈D and X i can be rewritten as functions of Y i ( d ∗ ) and X i because Y i ( d ) is adeterministic function of Y i ( d ∗ ) and X i for any d. It suffices to look at the distribution of Y i ( d ∗ )and X i instead of the distribution of { Y i ( d ) } d ∈D and X i . Consider the Kullback-Leibler divergencefor the distributions P and Q of ( Y i ( d ∗ ) , X i ), KL ( P, Q ) = R log h p ( y,x ) q ( y,x ) i p ( y, x ) dydx ,where p ( y, x ) and q ( y, x ) are the pdfs of ( Y i ( d ∗ ) , X i ) under P and Q respectively. Define e Y i = ξn − / g ( X i , d ∗ ) + ε i under P , and e Y i = ε i under Q . It follows that ( Y i ( d ∗ ) , X i ) = (Φ( e Y i ) , X i ) underboth P and Q . The Kullback-Leibler divergence is invariant to such a transformation of variables. KL ( P, Q ) = R log h e p ( y,x ) e q ( y,x ) i e p ( y, x ) dydx ,where e p ( y, x ) = φ (cid:0) y − ξn − / g ( x, d ∗ ) (cid:1) and e q ( y, x ) = φ ( y ) are the pdfs of ( e Y i , X i ) under P and Q respectively. KL ( P, Q ) = R log " exp n − (1 / ( y − ξn − / g ( x,d ∗ ) ) o exp {− (1 / y } p ( y, x ) dydx = R log (cid:2) exp (cid:8) yξn − / g ( x, d ∗ ) − (1 / ξ n − g ( x, d ∗ ) (cid:9)(cid:3) e p ( y, x ) dydx = R (cid:2) yξn − / g ( x, d ∗ ) − (1 / ξ n − g ( x, d ∗ ) (cid:3) e p ( y, x ) dydx = R (1 / ξ n − g ( x, d ∗ ) dx = (1 / ξ n − R g ( x, d ∗ ) dx > η > / ξ R g ( x, d ∗ ) dx < log( η ). Then, e − nKL ( P,Q ) / > / (4 η ) >
0, andinf µ ∗ sup P ∈P P P (cid:2) n / | µ ∗ − µ c ( P ) | > ǫ/ (cid:3) ≥ η .This is a minimax lower bound for estimators µ ∗ that are functions of an ideal sample of { Y i ( d ) } d ∈D and X i . In practice, only part of these variables are observed according to the scheduleof cutoff-doses { c j } Kj =1 . The set of all estimators e µ that are functions of the observed variables( Y i , X i ) is a subset of the set of all estimators µ ∗ . Therefore, the lower bound above is also aminimax lower bound for all estimators e µ :inf e µ sup P ∈P P P (cid:2) n / | e µ − µ c ( P ) | > ǫ/ (cid:3) ≥ η . Part (28)Let b µ denote b µ c and µ = µ c ( P ) for notational ease. The goal is to show that, for any small δ >
0, there exists large ǫ ∈ (0 , ∞ ) such that sup P ∈P P P (cid:2) n / | b µ − µ | > ǫ (cid:3) < δ for large n . Thechoice of h plus the discussion preceding Equation A.13 lead to ( V cn ) − / ≥ M n / for large n .Thus, P P (cid:2) n / | b µ − µ | > ǫ/M (cid:3) ≤ P P (cid:2) ( V cn ) − / | b µ − µ | > ǫ (cid:3) uniformly over P for large n . Theorem 2breaks ( V cn ) − / | b µ − µ | into four components: the CLT component N n , that converges in distributionto a standard normal (part (A.4)); the first-step bias component B n , that converges in probabilityto zero (part (A.5)); the integration error component I n , that converges in probability to zero (part39A.6)); and the remainder terms R n , that converge in probability to zero (parts (A.7) and (A.8)).It is true that P P (cid:0) ( V cn ) − / | b µ − µ | > ǫ (cid:1) ≤ P P ( | N n | > ǫ/ P P ( | B n | > ǫ/ P P ( | I n | > ǫ/ P P ( | R n | > ǫ/ δ > n and large ǫ > P is less than δ . Therestrictions placed in the class of models P along with the proof of Theorem 2 give the result. N n - term : part (A.4) has zero mean and unit variance (see Equation A.15). Chebyshev’sinequality implies that the supremum probability of the absolute value of part (A.4) being greaterthan ǫ/ /ǫ uniformly over P . B n - term : B n is the sum of B + n (part (A.24)), and B − n (part (A.25)). B + n converges in probabilityto zero uniformly over P because the approximations of Lemma B.6, the bounds on the derivativesof R ( x, d ), on f ( x ), on σ ( x, d ), and on the rate of ( V cn ) − / hold uniformly over P . The weights∆ j do not depend on P . The same idea applies to B − n . Thus, for ǫ >
0, sup P ∈P P P ( | B n | > ǫ/ I n - term : uniform bounds on the partial derivatives of β ( c ) yield a uniform bound on theapproximation error of the numerical integral. See Lemma B.9. The bounds on the rate of ( V cn ) − / also hold uniformly over P . For every ǫ > n for which | I n | ≤ ǫ/ P . R n - term : R n is the sum of R an (part (A.7)) and R bn (part (A.8)). Lemma B.7 shows that bothconverge in probability to zero. They also converge in probability to zero uniformly over P for thesame reasons that the B n -term above does. Therefore, for ǫ >
0, sup P ∈P P P ( | R n | > ǫ/
4) convergesto zero. (cid:3)
A.5 Proof of Theorem 4
Define δ j,l = I {U i ( c j ) = d l } . Assumption 8 (no ever-defiers) implies the following facts: (i) P [ δ j − ,l = 0 , δ j,l = 1] = 0 for ∀ l = j ; (ii) P [ δ j − ,l = 1 , δ j,l = 0] = 0 for l = j ;(iii) P [ δ j − ,l = 1 , δ j,l = 0 , δ j,u = 1] = 0 for ∀ u = j and u = l .Fix a small e > E [ Y i | X i = c j + e ] = P Kl =0 E [ δ j,l Y i ( d l ) | X i = c j + e ]= P Kl =0 E [ Y i ( d l ) | X i = c j + e, δ j,l = 1 , δ j − ,l = 1] P [ δ j,l = 1 , δ j − ,l = 1 | X i = c j + e ]+ P Kl =0 E [ Y i ( d l ) | X i = c j + e, δ j,l = 1 , δ j − ,l = 0] P [ δ j,l = 1 , δ j − ,l = 0 | X i = c j + e ]= P Kl =0 E [ Y i ( d l ) | X i = c j + e, δ j,l = 1 , δ j − ,l = 1] P [ δ j,l = 1 , δ j − ,l = 1 | X i = c j + e ]+ E [ Y i ( d j ) | X i = c j + e, δ j,j = 1 , δ j − ,j = 0] P [ δ j,j = 1 , δ j − ,j = 0 | X i = c j + e ] . Take the limit as e ↓
0. Use that { δ j,l = 1 , δ j − ,l = 1 } and { δ j,j = 1 , δ j − ,j = 0 } are finite unionsof measurable sets of the form {U i = ¯ U} , ¯ U ∈ U ∗ . The conditional expectation and probability arecontinuous functions of x conditional on these sets (Assumption 8).lim e ↓ E [ Y i | X i = c j + e ]= P Kl =0 E [ Y i ( d l ) | X i = c j , δ j,l = 1 , δ j − ,l = 1] P [ δ j,l = 1 , δ j − ,l = 1 | X i = c j ]+ E [ Y i ( d j ) | X i = c j , δ j,j = 1 , δ j − ,j = 0] P [ δ j,j = 1 , δ j − ,j = 0 | X i = c j ]= P Kl =0 E [ Y i ( d l ) | X i = c j , δ j,l = 1 , δ j − ,l = 1] P [ δ j,l = 1 , δ j − ,l = 1 | X i = c j ]+ P Kl =0 ,l = j E [ Y i ( d j ) | X i = c j , δ j,j = 1 , δ j − ,l = 1] P [ δ j,j = 1 , δ j − ,l = 1 | X i = c j ]Similarly, use fact (ii) for the left-hand-side limit, lim e ↓ E [ Y i | X i = c j − e ]= P Kl =0 E [ Y i ( d l ) | X i = c j , δ j,l = 1 , δ j − ,l = 1] P [ δ j,l = 1 , δ j − ,l = 1 | X i = c j ]+ P Kl =0 ,l = j E [ Y i ( d j ) | X i = c j , δ j,l = 0 , δ j − ,l = 1] P [ δ j,l = 0 , δ j − ,l = 1 | X i = c j ]Use fact (iii) to get 40 P Kl =0 E [ Y i ( d l ) | X i = c j , δ j,l = 1 , δ j − ,l = 1] P [ δ j,l = 1 , δ j − ,l = 1 | X i = c j ]+ P Kl =0 ,l = j E [ Y i ( d l ) | X i = c j , δ j,j = 1 , δ j − ,l = 1] P [ δ j,j = 1 , δ j − ,l = 1 | X i = c j ] . The difference between right and left hand side limits is B j = P Kl =0 ,l = j E [ Y i ( d j ) − Y i ( d l ) | X i = c j , δ j,j = 1 , δ j − ,l = 1] P [ δ j,j = 1 , δ j − ,l = 1 | X i = c j ]= K P l =0 ,l = j β ec ( c j , d l , d j ) P [ δ j,j = 1 , δ j − ,l = 1 | X i = c j ] . Next, it is shown that P [ δ j,j = 1 , δ j − ,l = 1 | X i = c j ] = ω j,l , for l = j . P [ δ j,j = 1 , δ j − ,l = 1 | X i = c j ] = P [ δ j,l = 0 , δ j − ,l = 1 | X i = c j ]= P [ δ j,l = 0 | X i = c j ] − P [ δ j − ,l = 0 | X i = c j ]= lim e ↓ { P [ U i ( c j ) = d l | X i = c j + e ] − P [ U i ( c j − ) = d l | X i = c j − e ] } = lim e ↓ { P [ D i = d l | X i = c j − e ] − P [ D i = d l | X i = c j + e ] } where facts (i) and (ii) are used. This proves the first part of the theorem.If β ec belongs to the class of functions of Assumption 9, then B j = f W j θ . If the matrix f W ′ f W = P j f W j f W ′ j is invertible, then the second part of the theorem follows.Conversely, suppose that the p > K elements in { β ec ( c j , d l , d j ) for ( j, l ) : ω j,l > } are identifiedfor every fuzzy assignment ˜ c = ( c , d , d ), ..., ˜ c p = ( c K , d K − , d K ). Identification means thatthere is an unique solution to the following constrained linear system: B ... B K = ω , . . . ω ,K . . . . . . . . . ω , . . . ω ,K . . . . . . . . . . . . . . . ω K, . . . ω K,K − β β ... β p such that ( β , . . . , β p ) ∈ G . The K × p matrix of coefficients has rank equal to K because the assignment is fuzzy. Since p > K , the unconstrained system has infinitely many nonzero solutions of the form b = b p + P p − Km =1 λ m b sm for any ( λ , . . . , λ p − K ) ∈ R p − K , where { b sm } p − Km =1 are the basis vectors of the null-space of the unconstrained system, and b p is a particular solution. By assumption, the constrainedsystem has one unique solution b ∗ ∈ G , so b ∗ + b sm
6∈ G ∀ m . This implies that b sm
6∈ G ∀ m because G is a vector subspace of R p . This is a set of p − K linearly independent vectors in R p not in G .Therefore, the dim G ≤ p − ( p − K ) = K , and the third part of the theorem follows. (cid:3) Regression Discontinuity Design with Many Thresholds”
Marinho Bertanha
B Supplemental Appendix
B.1 Local Polynomial Regressions
The first lemma is a straightforward generalization of Porter (2003)’s Theorem 3(a). It derivesthe asymptotic distribution of the Local Polynomial Regression (LPR) estimator for the differencein side-limits of a conditional mean. The lemma considers the mean of the q × Y i ratherthan a scalar Y i in order to cover the CLT proof in the fuzzy case (Theorem B.1) as a specialcase with Y i = [ Y i W ( c j , D i ) ′ ] ′ . At a cutoff c j , the difference in conditional mean is J j , for j = 1 , . . . , K . Given a choice of a bandwidth h j > c j , a kernel density function k ( u ), and a polynomial order ρ ∈ Z + , the l -coordinate of J j is denoted J j,l and estimated asfollows: b J j,l = b a j + ,l − ˆ a j − ,l (B.1)( b a j + ,l , b b j + ,l ) = argmin ( a, b ) n X i =1 ( k (cid:18) X i − c j h j (cid:19) v j + i (cid:20) Y j,l − a − b ( X i − c j ) − . . . − b ρ ( X i − c j ) ρ (cid:21) ) (B.2)(ˆ a j − ,l , ˆ b j − ,l ) = argmin ( a, b ) n X i =1 ( k (cid:18) X i − c j h j (cid:19) v j − i (cid:20) Y j,l − a − b ( X i − c j ) − . . . − b ρ ( X i − c j ) ρ (cid:21) ) (B.3)where v j + i = I { c j ≤ X i < c j + h j } (B.4) v j − i = I { c j − h j < X i < c j } (B.5) b =( b , . . . , b ρ ) . (B.6) Lemma B.1.
For each j = 1 , . . . , K , assume the following conditions hold:(i) The kernel density function k : R → R is symmetric around zero, has compact support [ − M, M ] for some M ∈ (0 , ∞ ) , and it is Lipschitz continuous; ii) The distribution of X i has probability density function f ( x ) that is continuous and has boundedsupport X = [ X , X ] ; the cutoff c j belongs to ( X , X ) ;(iii) Define J j = lim e ↓ { E [ Y i | X i = c j + e ] − E [ Y i | X i = c j − e ] } (B.7) m ( x ) = E [ Y i | X i = x ] − K X j =1 I { c j ≤ x } J j (B.8) The function m ( x ) is at least ρ + 1 times continuously differentiable wrt x for all x in acompact interval centered at c j except for x = c j ; there exists left and right side derivatives at x = c j up to the same order; its ρ + 1 -th partial derivative wrt x is denoted as ∇ ( ρ +1) x m ( x ) and the side limits of the derivatives are denoted as lim x → c ± j ∇ ( ρ +1) x m ( x ) = ∇ ( ρ +1) x m ( c ± j ) ;(iv) Define ε i = Y i − E [ Y i | X i ] (B.9) ζ ( x ) = E (cid:2) ε i ε ′ i | X i = x (cid:3) , (B.10) and assume E [ k ε i k | X i ] is bounded. The matrix-valued function ζ ( x ) is continuous wrt x forall x in a compact interval centered at c j except for x = c j ; there exists left and right sidelimits at x = c j denoted lim x → c ± j ζ ( x ) = ζ ( c ± j ) , where ζ ( c ± j ) is positive-definite;(v) As n → ∞ and h j → , assume nh j → ∞ and p nh j h p +11 j → C ∈ [0 , ∞ ) Then, for each j ( V nj ) − / (cid:16)b J j − B nj − J j (cid:17) d → N ( , I ) (B.11) with ( V nj ) − / being the inverse of the square root of the symmetric and positive-definite matrix V nj , ( V nj ) / = O (cid:16) ( nh j ) / (cid:17) , ( V nj ) / B nj = O P (cid:16) ( nh j ) / h ρ +11 j (cid:17) , where is the q × vectorof zeros, and I is the q × q identity matrix. The bias B nj and variance V nj terms are characterizedas follows, B nj = h ρ +11 j f ( c j )( ρ + 1)! e ′ h G j + n γ ∗ ∇ ρ +1 x m ( c + j ) − G j − n γ ∗ ∇ ρ +1 x m ( c − j ) i (B.12) V nj = n E ( (cid:20) nh j k (cid:18) X i − c j h j (cid:19)(cid:21) h e ′ (cid:16) v j + i E [ G j + n ] − v j − i E [ G j − n ] (cid:17) e H ji i ε i ′ i (cid:20) e H j ′ i (cid:16) v j + i E [ G j + ′ n ] − v j − i E [ G j − ′ n ] (cid:17) e (cid:21) ) (B.13) where ε i = Y i − E [ Y i | X i ] ; and γ ∗ =[ γ ρ +1 . . . γ ρ +1 ] ′ (B.14) γ ∗ = I q ⊗ γ ∗ , where I q is the q × q identity matrix, and (B.15) ⊗ denotes the Kronecker product; e = I q ⊗ e , where (B.16) e is the ( ρ + 1 × vector e = [1 0 0 · · · ′ γ d = Z k ( u ) u d du (B.17) H ( u ) = [1 u . . . u ρ ] ′ (B.18) H ji = H ( X i − c j ) (B.19) e H ji = H (cid:18) X i − c j h j (cid:19) (B.20) H ( u ) = I q ⊗ H ( u ) (B.21) H ji = H ( X i − c j ) (B.22) e H ji = H (cid:18) X i − c j h j (cid:19) (B.23) G j ± n = " nh j n X i =1 k (cid:18) X i − c j h j (cid:19) v j ± i e H ji e H j ′ i − (B.24) G j ± = f ( c j ) − I q ⊗ Γ − ± (B.25)Γ + =Γ , Γ − = { ( − j + l Γ j,l } j,l (B.26)Γ = γ . . . γ ρ ... ... ... γ ρ . . . γ ρ (B.27) where vec ( A m × n ) = [ a , , . . . , a m, , a , , . . . , a m,n ] ′ which makes ϕ j ± a q ( ρ + 1) × vector.Moreover, (cid:0) G j ± n − G j ± (cid:1) = O P p nh j ! (B.28) E (cid:2) G j ± n (cid:3) = G j ± + O ( h j ) (B.29) Proof.
Following Porter (2003), the jump estimator is equal to b J j = b a + j − b a − j , where b a ± j =[ b a ± j, , . . . , b a ± j,q ] ′ . 3 a ± j = e ′ " nh j n X i =1 k (cid:18) X i − c j h j (cid:19) v j ± i H ji H j ′ i − " nh j n X i =1 k (cid:18) X i − c j h j (cid:19) v j ± i H ji Y i (B.30)= e ′ G j ± n " nh j n X i =1 k (cid:18) X i − c j h j (cid:19) v j ± i e H ji Y i (B.31) b J j = e ′ G j + n " nh j n X i =1 k (cid:18) X i − c j h j (cid:19) v j + i e H ji Y i − e ′ G j − n " nh j n X i =1 k (cid:18) X i − c j h j (cid:19) v j − i e H ji Y i (B.32)and note that H ji changes to e H ji because e only takes the first elements of each of the q stacked ρ + 1 vectors. Define J ∗ j , e J j as follows, J ∗ j = e ′ E [ G j + n ] 1 nh j n X i =1 k (cid:18) X i − c j h j (cid:19) v j + i e H ji Y i − e ′ E [ G j − n ] 1 nh j n X i =1 k (cid:18) X i − c j h j (cid:19) v j − i e H ji Y i o (B.33)= n X i =1 nh j k (cid:18) X i − c j h j (cid:19) e ′ (cid:16) v j + i E [ G j + n ] − v j − i E [ G j − n ] (cid:17) e H ji | {z } ϕ n ( X i ) Y i (B.34)= n X i =1 ϕ n ( X i ) Y i (B.35) e J j = e ′ G j + n E " nh j n X i =1 k (cid:18) X i − c j h j (cid:19) v j + i e H ji Y j + i − e ′ G j − n E " nh j n X i =1 k (cid:18) X i − c j h j (cid:19) v j − i e H ji Y j − i (B.36)where Y j ± i are defined by Y j + i = Y i − H j ′ i ϕ j + = m ( X i ) + K X l =1 I { c l ≤ X i } J l + ε i − H j ′ i ϕ j + (B.37) Y j − i = Y i − H j ′ i ϕ j − = m ( X i ) + K X l =1 I { c l ≤ X i } J l + ε i − H j ′ i ϕ j − (B.38) ϕ j + = vec " m ( c j ) + j X l =1 J l ∇ x m ( c + j ) . . . ∇ ρ x m ( c + j ) /ρ ! ′ (B.39)4 j − = vec " m ( c j ) + j − X l =1 J l ∇ x m ( c − j ) . . . ∇ ρ x m ( c − j ) /ρ ! ′ . (B.40)Write( V nj ) − / (cid:16)b J j − B nj − J j (cid:17) =( V nj ) − / (cid:0) J ∗ j − E [ J ∗ j |X n ] (cid:1) (B.41)+( V nj ) − / (cid:16)e J j − B nj (cid:17) (B.42)+( V nj ) − / (cid:16)b J j − E [ b J j |X n ] − (cid:0) J ∗ j − E [ J ∗ j |X n ] (cid:1)(cid:17) (B.43)+( V nj ) − / (cid:16) E [ b J j − J j |X n ] − e J j (cid:17) . (B.44)The proof applies a central limit theorem (CLT) to show that part (B.41) converges in distribu-tion to a standard normal; it demonstrates that B nj approximates the first-order bias, that is, part(B.42) converges in probability to zero; and that parts (B.43) and (B.44) converge in probabilityto zero. Part (B.41)First, find the rate that ( V nj ) − / grows. Use the change of variables u = ( x − c j ) /h j toevaluate the expectation: V nj = 1 nh j Z − k ( u ) (cid:2) e ′ (cid:0) I { u ≥ } E [ G j + n ] − I { u < } E [ G j − n ] (cid:1) H ( u ) (cid:3) ζ ( c j + uh j ) h H ( u ) (cid:16) I { u ≥ } E [ G j + ′ n ] − I { u < } E [ G j − ′ n ] (cid:17) e i f ( c j + uh j ) du (B.45) k V nj k > Mnh j . (B.46)because E [ G j ± n ] is approximately equal to a positive-definite matrix G j ± so that the integral eval-uates to a positive-definite matrix.Second, ( V nj ) − / (cid:0) J ∗ j − E [ J ∗ j |X n ] (cid:1) = ( V nj ) − / n X i =1 ϕ n ( X i ) ( Y i − E [ Y i | X i ]) (B.47)= ( V nj ) − / n X i =1 ϕ n ( X i ) ε i (B.48)Equation B.48 is a sum of iid random vectors with zero mean, where V nj is the variance of the5umerator. The Lindeberg condition is verified next. Take an arbitrary δ > n X i =1 E h k V n k − k ϕ n ( X i ) ε i k I n k V n k − / k ϕ n ( X i ) ε i k > δ oi (B.49)= n E h k V n k − k ϕ n ( X i ) ε i k I n k ϕ n ( X i ) ε i k > δ k V n k / oi (B.50) ≤ n E h k V n k − k ϕ n ( X i ) ε i k δ − k V n k − / i (B.51) ≤ M n ( nh j ) / E "(cid:13)(cid:13)(cid:13)(cid:13) nh j k (cid:18) X i − c j h j (cid:19) e ′ (cid:16) v j + i E [ G j + n ] − v j − i E [ G j − n ] (cid:17) e H ji ε i (cid:13)(cid:13)(cid:13)(cid:13) δ − (B.52) ≤ M n ( nh j ) / n h j E " h j (cid:12)(cid:12)(cid:12)(cid:12) k (cid:18) X i − c j h j (cid:19)(cid:12)(cid:12)(cid:12)(cid:12) (cid:13)(cid:13)(cid:13) v j + i E [ G j + n ] − v j − i E [ G j − n ] (cid:13)(cid:13)(cid:13) (cid:13)(cid:13)(cid:13)(cid:13) H (cid:18) X i − c j h j (cid:19)(cid:13)(cid:13)(cid:13)(cid:13) E [ k ε i k | X i ] δ − (B.53)= M ( nh j ) − / Z − " | k ( u ) | (cid:13)(cid:13) I { u ≥ } E [ G j + n ] − I { u < } E [ G j − n ] (cid:13)(cid:13) k H ( u ) k f ( c j + uh j ) du (B.54) ≤ M ( nh j ) − / = o (1) (B.55)where the inequality x I {| x | > δ } ≤ x δ − , boundedness of E [ k ε i k | X i ], and the rate of V − nj are used. The multivariate Lindeberg-Feller CLT says that Equation B.48, and thus part (B.41),converges in distribution to a standard normal. Part (B.42)First consider, E (cid:20) h j k (cid:18) X i − c j h j (cid:19) v j + i e H ji E h Y j + i | X i i(cid:21) (B.56)= E " h j k (cid:18) X i − c j h j (cid:19) v j + i e H ji ∇ ρ +1 x m ( c + j )( ρ + 1)! (cid:18) X i − c j h j (cid:19) ρ +1 h ρ +11 j (B.57)+ E " h j k (cid:18) X i − c j h j (cid:19) v j + i e H ji ∇ ρ +2 x m ( c ∗ j )( ρ + 2)! (cid:18) X i − c j h j (cid:19) ρ +2 h ρ +21 j (B.58)= h ρ +11 j f ( c j ) γ ∗ ∇ ρ +1 x m ( c + j )( ρ + 1)! + O (cid:16) h ρ +21 j (cid:17) , (B.59)and B nj = B + nj − B − nj (B.60) B + nj = h ρ +11 j f ( c j )( ρ + 1)! e ′ G j + n γ ∗ ∇ ρ +1 x m ( c + j ) (B.61)6 − nj = h ρ +11 j f ( c j )( ρ + 1)! e ′ G j − n γ ∗ ∇ ρ +1 x m ( c − j ) (B.62)where E h Y j + i | X i i is the difference between E [ Y i | X i ] and its ρ -th order Taylor expansion around X i = c j (see Equations B.37 and B.38). The expectations in Equations B.57 and B.58, without the h ρ +11 j and h ρ +21 j terms, are bounded over j because the kernel, derivatives, and polynomials arebounded functions of u = ( x − c j ) h − j .Next, ( V nj ) − / (cid:16)e J j − B nj (cid:17) =( V nj ) − / e ′ G j + n E " nh j n X i =1 k (cid:18) X i − c j h j (cid:19) v j + i e H ji Y j + i − ( V nj ) − / B + nj (B.63) − ( V nj ) − / e ′ G j − n E " nh j n X i =1 k (cid:18) X i − c j h j (cid:19) v j − i e H ji Y j − i +( V nj ) − / B − nj (B.64)Consider part (B.63). Part (B.64) follows a symmetric argument. Use (B.59) and write( B.
63) =( V nj ) − / " e ′ G j + n γ ∗ h ρ +11 j ∇ ρ +1 m ( c + j )( ρ + 1)! f ( c j ) − B + nj (B.65)+( V nj ) − / e ′ G j + n O (cid:16) h ρ +21 j (cid:17) (B.66)=0 + O P (cid:16)p nh j h ρ +21 j (cid:17) = o P (1) (B.67)where the second equality uses the definition of B + nj , the fact that G j + n = O P (1), and the ratecondition ( nh j ) / h ρ +11 j = O (1). Part (B.43)(B.43) =( V nj ) − / " e ′ G j + n nh j n X i =1 k (cid:18) X i − c j h j (cid:19) v j + i e H ji ε i − e ′ G j − n nh j n X i =1 k (cid:18) X i − c j h j (cid:19) v j − i e H ji ε i (B.68) − ( V nj ) − / " e ′ E (cid:2) G j + n (cid:3) nh j n X i =1 k (cid:18) X i − c j h j (cid:19) v j + i e H ji ε i − e ′ E (cid:2) G j − n (cid:3) nh j n X i =1 k (cid:18) X i − c j h j (cid:19) v j − i e H ji ε i (B.69)7( V nj ) − / e ′ (cid:2) G j + n − E (cid:2) G j + n (cid:3)(cid:3) nh j n X i =1 k (cid:18) X i − c j h j (cid:19) v j + i e H ji ε i (B.70) − ( V nj ) − / e ′ (cid:2) G j − n − E (cid:2) G j − n (cid:3)(cid:3) nh j n X i =1 k (cid:18) X i − c j h j (cid:19) v j − i e H ji ε i (B.71)= O (cid:16) ( nh j ) / (cid:17) O P (cid:16) ( nh j ) − / (cid:17) o P (1) (B.72)+ O (cid:16) ( nh j ) / (cid:17) O P (cid:16) ( nh j ) − / (cid:17) o P (1) (B.73)= o P (1) (B.74)because of (cid:2) G j ± n − E (cid:2) G j ± n (cid:3)(cid:3) = O P (cid:0) ( nh j ) − / (cid:1) , and the fact that the zero mean terms ( nh j ) − P ni =1 k (cid:16) X i − c j h j (cid:17) v j ± i e H ji ε i converge in probability to zero since their variances are O (cid:0) ( nh j ) − (cid:1) . Part (B.44)Use the definitions of ϕ j ± and Y j ± i to write: b J j − J j = b a + j − b a − j − J j (B.75)= e ′ " nh j n X i =1 k (cid:18) X i − c j h j (cid:19) v j + i H ji H j ′ i − " nh j n X i =1 k (cid:18) X i − c j h j (cid:19) v j + i H ji Y i − e ′ " nh j n X i =1 k (cid:18) X i − c j h j (cid:19) v j − i H ji H j ′ i − " nh j n X i =1 k (cid:18) X i − c j h j (cid:19) v j − i H ji Y i − e ′ (cid:0) ϕ j + − ϕ j − (cid:1) (B.76)= e ′ " nh j n X i =1 k (cid:18) X i − c j h j (cid:19) v j + i H ji H j ′ i − " nh j n X i =1 k (cid:18) X i − c j h j (cid:19) v j + i H ji Y j + i − e ′ " nh j n X i =1 k (cid:18) X i − c j h j (cid:19) v j − i H ji H j ′ i − " nh j n X i =1 k (cid:18) X i − c j h j (cid:19) v j − i H ji Y j − i (B.77)= e ′ G j + n " nh j n X i =1 k (cid:18) X i − c j h j (cid:19) v j + i e H ji Y j + i − e ′ G j − n " nh j n X i =1 k (cid:18) X i − c j h j (cid:19) v j − i e H ji Y j − i (B.78)Thus, part (B.44) becomes(B.44) =( V nj ) − / e ′ G j + n " nh j n X i =1 k (cid:18) X i − c j h j (cid:19) v j + i e H ji E [ Y j + i | X i ] − ( V nj ) − / e ′ G j + n E " nh j n X i =1 k (cid:18) X i − c j h j (cid:19) v j + i e H ji E [ Y j + i | X i ] (B.79)8 ( V nj ) − / e ′ G j − n " nh j n X i =1 k (cid:18) X i − c j h j (cid:19) v j − i e H ji E [ Y j − i | X i ] +( V nj ) − / e ′ G j − n E " nh j n X i =1 k (cid:18) X i − c j h j (cid:19) v j − i e H ji E [ Y j − i | X i ] (B.80)The next steps show that part (B.79) converges in probability to zero. A symmetric proof showsthat part (B.80) also converges in probability to zero.(B.79) =( V nj ) − / e ′ G j + n ( nh j n X i =1 k (cid:18) X i − c j h j (cid:19) v j + i e H ji E [ Y j + i | X i ] − E " nh j n X i =1 k (cid:18) X i − c j h j (cid:19) v j + i e H ji E [ Y j + i | X i ] (B.81)=( V nj ) − / e ′ G j + n h ρ +11 j ( nh j n X i =1 k (cid:18) X i − c j h j (cid:19) v j + i e H ji E [ Y j + i | X i ] h − ( ρ +1)1 j − E " nh j n X i =1 k (cid:18) X i − c j h j (cid:19) v j + i e H ji E [ Y j + i | X i ] h − ( ρ +1)1 j ) (B.82)= O (cid:16) ( nh j ) / (cid:17) O P (1) h ρ +11 j O P (cid:16) ( nh j ) − / (cid:17) = o P (1) (B.83)where the zero mean term in curly brackets is normalized by h ρ +11 j (see Equation B.59), and itsvariance after the normalization decreases at ( nh j ) − . B.2 Uniformity with Large Number of Cutoffs
A class of sets S of a space Ω is said to shatter a n -point subset of Ω, D n , if for every subset of D n , D ( i ) n , there exists a set in S , S , such that S ∩ D ( i ) n = D ( i ) n . A class of sets S is said to be a VCclass if there exists a finite non-negative integer v such that no v -point set D v is shattered by S . Inthis case, the index of the VC class is v . For a class of functions from Ω to R , F , call the class ofgraphs of F , g F = { ( x, t ) ∈ Ω × R : t ≤ f ( x ) ≤ ≤ t ≤ f ( x ) for f ∈ F } . A class of functions F is called a VC-subgraph class if gF is a VC class. The class F is enveloped by function F if ∀ f ∈ F , | f ( x ) | ≤ F ( x ). Let (Ω , A , Q ) be a probability space. A covering number N ( ε, Q, F ) isdefined to be the smallest non-negative integer m for which there exists functions f , . . . , f m in F such that min j E Q | f − f j | ≤ ε for every f ∈ F .It is possible to build a complex VC-subgraph class by combining basic VC-subgraph classes.Any class of functions made of a finite union or intersection of VC-subgraph classes is also VCsubgraph (Pollard (1984)’s Lemma 2.15). Let φ : R → R be a monotone function. Define theclass of functions which consists of translations of this monotone function φ . That is, F = { f : One may define VC subgraph using alternative definitions of class of graphs, but those lead to definitions of VCsubgraph that are equivalent to ours. See Van Der Vaart and Wellner (1996)’s Problem 2.6.11. → R with f ( x ) = φ ( x − c ) ∀ c ∈ R } . Then, F is a VC-subgraph class with index equal to 2(Van Der Vaart and Wellner (1996)’s Lemma 2.6.16). Moreover, if G is VC subgraph, then φ ◦ G = { φ ( g ) : g ∈ G} is VC subgraph (Van Der Vaart and Wellner (1996)’s Lemma 2.6.18). A VC-subgraph class F of uniformly bounded functions has covering number N ( ε, Q, F ) ≤ Aε − W , wherethe constants A, W depend only on the VC index of the class of functions and on the uniformbound (Pollard (1984)’s Lemma 2.25). The next lemma lists more properties.
Lemma B.2.
Let F and G be VC-subgraph classes of functions uniformly bounded by a constant < M < ∞ . Define H + = { f + g : f ∈ F , g ∈ G} and H × = { f g : f ∈ F , g ∈ G} . For a fixedLipschitz continuous function φ with Lipschitz constant C , define H φ = { φ ( f ) : f ∈ F } . Then,1. N ( ε, Q, H + ) ≤ N ( ε/ , Q, F ) N ( ε/ , Q, G ) N ( ε, Q, H × ) ≤ N ( ε/ M, Q, F ) N ( ε/ M, Q, G ) N ( ε, Q, H φ ) ≤ N ( ε/C, Q, F ) Proof.
Slightly modified from Theorem 3 in Andrews (1994).Fix ε >
0, pick any h ∈ H + . It is known that h = f + g .Use f i + g j to approximate f + g , where E Q | f − f i | ≤ ε/ E Q | g − g j | ≤ ε/
2, 1 ≤ i ≤ N ( ε/ , Q, F ), 1 ≤ j ≤ N ( ε/ , Q, G ). It is known that these two covering numbers are finite since F and G are VC-subgraph. Call h l = f i + g j , with 1 ≤ l ≤ N ( ε/ , Q, F ) N ( ε/ , Q, G ). E Q | h − h l | = E Q | f + g − ( f i + g j ) | ≤ E Q | f − f i | + E Q | g − g j | ≤ ε Therefore, N ( ε, Q, H + ) ≤ N ( ε/ , Q, F ) N ( ε/ , Q, G ).Now, pick any h ∈ H × . It is known that h = f g .Use f i g j to approximate f g , where E Q | f − f i | ≤ ε/ M and E Q | g − g j | ≤ ε/ M , 1 ≤ i ≤ N ( ε/ M, Q, F ) < ∞ , 1 ≤ j ≤ N ( ε/ M, Q, G ) < ∞ . Call h l = f i g j , with 1 ≤ l ≤ N ( ε/ M, Q, F ) N ( ε/ M, Q, G ). E Q | h − h l | = E Q | f g − f i g j | = E Q | f g − f i g j − f i g + f i g |≤ E Q | f − f i || g | + E Q | g j − g || f i | ≤ M ( E Q | f − f i | + E Q | g j − g | ) ≤ ε Therefore, N ( ε, Q, H × ) ≤ N ( ε/ M, Q, F ) N ( ε/ M, Q, G ).Lastly, pick h ∈ H φ , so that h = φ ( f ) for some f ∈ F . Use f i to approximate f , where E Q | f − f i | ≤ ε/C , 1 ≤ i ≤ N ( ε/C, Q, F ) < ∞ . Call h i = φ ( f i ) for each i . E Q | h − h i | = E Q | φ ( f ) − φ ( f i ) | ≤ CE Q | f − f i | ≤ ε .Therefore, N ( ε, Q, H φ ) ≤ N ( ε/C, Q, F ). 10onsider a set of K + 2 positive bandwidth sequences h , h j , j = 1 , . . . , K , h that depend on n . Assume h ≤ h j ≤ h for every j , and that both h and h converge to zero at the same rate.Define v + c,h ( x ) = I { c ≤ x < c + h } , and v − c,h ( x ) = I { c − h < x < c } for any c ∈ R and h >
0, so that v j ± i (used in the main text) becomes v ± c j ,h j ( X i ). Lemma B.3.
Consider the classes of functions defined below F ± j , j = 1 , . . . , . They depend on n because the bandwidth sequences h , h j , j = 1 , . . . , K , and h enter their definitions.1. F ± = n f c,h : X → R st f c,h ( x ) = v ± c,h ( x ) k (cid:0) x − ch (cid:1) , c ∈ X , h ∈ [ h , h ] o for a kernel densityfunction k ( · ) that satisfies Assumption 3;2. F ± = n f c,h : X × [ − M ; M ] → R st f c,h ( x, y ) = v ± c,h ( x ) k (cid:0) x − ch (cid:1) y , c ∈ X , h ∈ [ h , h ] o for any M ∈ (0 , ∞ ) ;3. F ± = n f c,h : X → R st f c,h ( x ) = v ± c,h ( x ) k (cid:0) x − ch (cid:1) r ± ( x )] , c ∈ X , h ∈ [ h , h ] o where r ± ( x ) = P Kj =1 v ± c j ,h j ( x ) E [ Y j ± i | X i = x ] and Y j ± i is defined in Lemma B.1 (scalar case);4. F ± = n f c,h : X → R st f c,h ( x ) = v ± c,h ( x ) k (cid:0) x − ch (cid:1) (cid:0) x − ch (cid:1) l , c ∈ X , h ∈ [ h , h ] o for any positiveinteger l ∈ Z + .If Assumptions 3, 5, and 7 hold, these functions are bounded for large n . The covering numberof each of these classes satisfies N ( ε, Q, F j ) ≤ A j ε − W j , j = 1 , , , , where the positive constants A j and W j are independent of n and Q .Proof. First, note that all these functions are bounded. The functions in the first two classesare bounded because the kernel and the indicator functions are bounded. For the third class offunctions, (cid:12)(cid:12) r + ( x ) (cid:12)(cid:12) = (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) K X j =1 v + c j ,h j ( x ) E [ Y j + i | X i = x ] (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ max j (cid:12)(cid:12)(cid:12) E [ Y j + i | X i = x ] (cid:12)(cid:12)(cid:12) = max j (cid:12)(cid:12)(cid:2) ∇ ρ +1 x R ( c ∗ j ( x ) , d j ) / ( ρ + 1)! (cid:3) ( x − c j ) ρ +1 (cid:12)(cid:12) where c ∗ j ( x ) ∈ ( c j , x ). The function r + ( x ) is bounded because ∇ ρ +1 x R ( · ) is bounded (Assumption7). An analogous argument bounds r − ( x ). For the fourth class of functions,0 ≤ v + c,h (cid:18) x − ch (cid:19) l < − < v − c,h (cid:18) x − ch (cid:19) l < G ± = n f c,h : X → R st f c,h ( x ) = v ± c,h ( x ) (cid:0) x − ch (cid:1) l , c ∈ X , h ∈ [ h , h ] o ;11. G ± = n f c,h : X → R st f c,h ( x ) = v ± c,h ( x ) , c ∈ X , h ∈ [ h , h ] o ;3. G = { f : [ − M, M ] → R st f ( y ) = y } , that is, only one function f ;4. G ± = { f : X → R st f ( x ) = r ± ( x ) } , that is, only one function r ± ;5. G ± = n f c,h : X → R st f c,h ( x ) = v ± c,h ( x ) k (cid:0) x − ch (cid:1) , c ∈ X , h ∈ [ h , h ] o .Lemma B.2 says that it suffices to show that each of these classes has a polynomial bound onthe covering number with constants that are independent of n and Q . G ± = n f c,h : X → R st f c,h ( x ) = v ± c,h ( x ) (cid:0) x − ch (cid:1) l , c ∈ X , h ∈ [ h , h ] o Take G +1 WLOG. A function v + c,h ( x ) (cid:0) x − ch (cid:1) is a line connecting the point ( c,
0) to ( c + h,
1) withsupport [ c, c + h ). The class of functions G ∗ = n f c,h : X → R st f c,h ( x ) = v + c,h ( x ) (cid:0) x − ch (cid:1) , c ∈ X , h ∈ [ h , h ] o is VC subgraph because no 4-point set is shattered. It has covering number N ( ε, Q, G ∗ )bounded by a polynomial in ε whose constants do not depend on Q or n . The function φ ( x ) = x l defined over [0 ,
1] is Lipschitz continuous with constant equal to l . Since G +1 = n φ ( g ) : g ∈ G ∗ o ,Lemma B.2 says G +1 has covering number bounded above by A ε − W with A , W independent of n or Q . G ± = n f c,h : X → R st f c,h ( x ) = v ± c,h ( x ) , c ∈ X , h ∈ [ h , h ] o For either v + c,h or v − c,h , no 3-point set is shattered by the graphs of either G +2 or G − . Hence, G ± isVC subgraph with covering number bounded above by A ± ε − W ± where A ± , W ± are independentof n or Q . G = { f : [ − M, M ] → R st f ( y ) = y } It is straightforward to see that the graphs of this class of functions is VC with index 2. Therefore,the covering number of G is bounded above by A ε − W with A , W independent of n or Q . G ± = { f : X → R st f ( x ) = r ± ( x ) } Consider G +4 WLOG. For each n , r + ( x ) = P Kj =1 v + c j ,h j ( x ) E [ Y j + i | X i = x ] is a fixed function. Similarto G , the covering number of G +4 is bounded above by A ε − W with A , W independent of n or Q . G ± = n f c,h : X → R st f c,h ( x ) = v ± c,h ( x ) k (cid:0) x − ch (cid:1) , c ∈ X , h ∈ [ h , h ] o Take G +5 WLOG. Define G ∗ = n f = k ( g ) , g ∈ G ∗ o , where G ∗ is the VC subgraph class of functionsdefined above. Given that k ( · ) is Lipschitz continuous (Assumption 3), Lemma B.2 says that G ∗ has covering number N ( ε, Q, G ∗ ) bounded by a polynomial in ε whose constants do not depend on Q or n . Note that G = n gh : g ∈ G +2 , h ∈ G ∗ o because v + c,h ( x ) k (cid:16) v + c,h ( x ) (cid:0) x − ch (cid:1)(cid:17) = v + c,h ( x ) k (cid:0) x − ch (cid:1) .Therefore, G has covering number bounded above by A ε − W with A , W independent of n or Q (Lemma B.2). 12emma B.4 below is a slightly modified version of Pollard (1984)’s Theorem 2.37. Lemma B.4.
For each n , let F n be a class of uniformly bounded functions whose covering numberssatisfy sup Q N ( ε, Q, F n ) ≤ Aε − W for < ε < with constants A and W not depending on n . Let δ n be a positive decreasing sequence such that log nnδ n → . If [ E ( f )] / ≤ δ n for ∀ f ∈ F n , then sup f ∈F n | E n ( f ) − E ( f ) | = O P δ n s log nnδ n ! where E n ( f ) is the expected value of f wrt the empirical distribution of the variables in the domainof f .Proof. The proof is almost the same as Pollard (1984)’s Theorem 2.37. There are two main dif-ferences. The first, he has an arbitrary sequence α n that weakly decreases to zero such that nδ n α n log n → ∞ , and I take this sequence to be α n = log nnδ n →
0. Note that this sequence of α n doesnot satisfy nδ n α n log n → ∞ , but this is not needed here. The second difference, he shows almost sureconvergence, and I only show the expression to be bounded in probability.That said, it is to be shown that for ∀ γ >
0, there exists M γ > n γ such that P ( sup f ∈F n | E n ( f ) − E ( f ) | > M γ δ n α n ) < γ for n ≥ n γ Taking ε n = εδ n α n , V ( E n ( f ))(4 ε n ) ≤ E ( f )16 nε δ n α n ≤ M ε nδ n α n = 116 ε log n For large n , this is smaller than 1 /
2, so that Equation (30) on page 31 of Pollard (1984) is usedto get: P ( sup f ∈F n | E n ( f ) − E ( f ) | > εδ n α n ) ≤ P ( sup f ∈F n | E ◦ n ( f ) | > ε n ) (B.84)where E ◦ n ( f ) is the signed measure defined there. Using the same approximation argument that13ed to Equation (31) on page 31 for functions g j ∈ F n : P ( sup f ∈F n | E ◦ n ( f ) | > ε n ) ≤ N ( ε n , P n , F n ) exp ( − / nε n max j E n ( g j ) where P n is the probability measure that weights each observation by 1 /n . This inequality is usedto rewrite the right-hand side of Equation B.84):4 P ( sup f ∈F n | E ◦ n ( f ) | > ε n ) =4 P ( sup f ∈F n | E ◦ n ( f ) | > ε n , sup f ∈F n | E n ( f ) | ≤ δ n ) + 4 P ( sup f ∈F n | E ◦ n ( f ) | > ε n , sup f ∈F n | E n ( f ) | > δ n ) ≤ N ( ε n , P n , F n ) exp (cid:20) ( − / nε n δ n (cid:21) (B.85)+ 4 P ( sup f ∈F n | E n ( f ) | > δ n ) (B.86)For part (B.85)), use the fact that N ( ε n , P n , F n ) ≤ Aε − W , and rearrange it into B. ≤ Aε − W exp[ W log(1 /δ n α n ) − nε δ n α n / B. ≤ E (cid:2) min (cid:8) N ( δ n , P n , F n ) exp( − nδ n ) ; 1 (cid:9)(cid:3) ≤ E (cid:2) min (cid:8) N ( δ n / , P n , F n ) exp( − nδ n ) ; 1 (cid:9)(cid:3) ≤
16 min ( A (cid:18) δ n (cid:19) − W exp( − nδ n ) ; 1 ) = 16 min (cid:8) A W exp (cid:2) − ( W log δ n + nδ n ) (cid:3) ; 1 (cid:9) Hence, P ( sup f ∈F n | E n ( f ) − E ( f ) | > εδ n α n ) < Aε − W exp[ W log(1 /δ n α n ) − nε δ n α n / (cid:8) A W exp (cid:2) − ( W log δ n + nδ n ) (cid:3) ; 1 (cid:9) (B.88)For the case here, it suffices to show that there is a ε such that the sum of the two bounds above14B.87) and (B.88) converge to zero as n → ∞ . For part (B.87), note that for large n , n log n ≥ nα n ,since α n is decreasing. Then,log (cid:18) δ n α n (cid:19) = log (cid:18) nα n log n (cid:19) ≤ log (cid:18) n log n log n (cid:19) = log n Using this, ( B. ≤ Aε − W exp h(cid:16) W − ε (cid:17) log n i . If ε is made small enough, this expression goesto zero.For part (B.88), log nnδ n → δ n n − = nδ n → ∞ . For big n , these imply: (i) log( δ n ) ≥ log n − = − log n and (ii) nδ n ≥ ( W + 1) log n . Hence,( B. ≤
16 min (cid:8) A W exp( − log n ) ; 1 (cid:9) → k · k with real-valued vectors. For matrices, the normis induced by the Euclidean norm. That is, for a p × q matrix A , k A k = sup x ∈ R q , k x k =1 k Ax k . Such amatrix norm has the following properties: (i) for a matrix A and a vector x , k Ax k ≤ k A kk x k ; (ii) formatrices A and B such that AB is defined, k AB k ≤ k A kk B k ; (iii) for A invertible, k A k − ≤ k A − k .The determinant of matrix A is denoted det( A ). Another useful result is that (iv) convergence inthe matrix norm is equivalent to convergence of all elements of the matrix. Lemma B.5.
Consider a random process X n ( c ) in R q × q , and a fixed (non-random) function X ( c ) also in R q × q . Suppose sup c k X ( c ) k ≤ L < ∞ and inf c | det( X ( c )) | ≥ L > .If for some sequence α n ↓ c k X n ( c ) − X ( c ) k = O P ( α n ) then sup c (cid:13)(cid:13) X n ( c ) − − X ( c ) − (cid:13)(cid:13) = O P ( α n ) Proof.
Consider the compact subset of R q × q : A = (cid:8) X ∈ R q × q : k X k ≤ L , | det( X ) | ≥ L (cid:9) Note that X ( c ) ∈ A ∀ ( c ), and that any continuous function on A is uniformly continuous because A is a compact set. The function f : A → R q × q , f ( X ) = X − is uniformly continuous.For any γ >
0, find M γ > P (cid:26) sup c α − n (cid:13)(cid:13) X n ( c ) − − X ( c ) − (cid:13)(cid:13) > M γ (cid:27) < γ (cid:26) α − n sup c (cid:13)(cid:13) X n ( c ) − − X ( c ) − (cid:13)(cid:13) > M γ (cid:27) ≤ P (cid:26) sup c (cid:13)(cid:13) X n ( c ) − − X ( c ) − (cid:13)(cid:13) > α n M γ , X n ( c ) ∈ A ∀ c (cid:27) (B.89)+ P { X n ( c ) / ∈ A for some c } (B.90) Part (B.89)Since f ( X ) = X − is uniformly continuous in A , for any choice of M γ >
0, and for a givensample size, there exists a δ ( α n M γ ) > ∀ X n ( t ) , X ( t ) ∈ A, k X n ( t ) − − X ( t ) − k > α n M γ ⇒ k X n ( t ) − X ( t ) k > δ ( α n M γ ) ∀ n ( B. ≤ P (cid:26) sup c,p k X n ( c ) − X ( c ) k > δ ( α n M γ ) , X n ( c ) ∈ A ∀ ( c ) (cid:27) ≤ P (cid:26) sup c,p k X n ( c ) − X ( c ) k > δ ( α n M γ ) (cid:27) (B.91)By assumption, it is possible to find M ∗ such that P (cid:26) sup c,p k X n ( c ) − X ( c ) k > α n M ∗ (cid:27) < γ/ n . So pick M γ to be such that δ ( α n M γ ) ≥ α n M ∗ which makes ( B. ≤ γ/ Part (B.90) ( B. ≤ P {k X n ( c ) k > L for some c } + P {| det( X n ( c )) | < L for some c }≤ P {k X n ( c ) − X ( c ) k > L for some c } + P {k X ( c ) k > L for some c } + P {| det( X n ( c )) − det( X ( c )) | < L / c } + P {| det( X ( c )) | < L / c } = P {k X n ( c ) − X ( c ) k > L for some c } + P {| det( X n ( c )) − det( X ( c )) | < L / c } which is made smaller than γ/ n since X n ( c ) converges in probability to X ( c ), and sodoes det( X n ( c )) to det( X ( c )). Therefore, ( B.
89) + ( B. ≤ γ .16n application of Lemma B.4 to the classes of functions in Lemma B.3 gives the rates at whichcertain terms in the proof of Theorem 2 are uniformly bounded in probability. Lemma B.6.
Consider the definitions of G j ± , G j ± n , e H ji , H , and Y j ± i from Lemma B.1 (scalarcase). Suppose Assumptions 3, 4, 5, and 7 hold. Assume the rate conditions of Theorem 2. Then, max j (cid:13)(cid:13) G j ± − E (cid:2) G j ± n (cid:3)(cid:12)(cid:12) = O ( h ) (B.92)max j (cid:13)(cid:13) G j ± n − E (cid:2) G j ± n (cid:3)(cid:13)(cid:13) = O P s log nnh ! (B.93)max j (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) nh j n X i =1 v j ± i k (cid:18) X i − c j h j (cid:19) e H ji ε i (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) = O P s log nnh ! (B.94)max j (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) nh j n X i =1 (cid:26) v j ± i k (cid:18) X i − c j h j (cid:19) e H ji E [ Y j ± i | X i ] − E (cid:20) v j ± i k (cid:18) X i − c j h j (cid:19) e H ji Y j ± i (cid:21) (cid:27) (cid:13)(cid:13)(cid:13)(cid:13) = O P s log nnh ! (B.95) Proof.
Consider the positive parts with v j + i = v + c j ,h j ( X i ), G j + n , and G j + WLOG.
Part (B.92)First, we show that max j (cid:13)(cid:13)(cid:13) E h G j + n i − G j + (cid:13)(cid:13)(cid:13) = O ( h ) using Lemma B.5.We have that G j + = f ( c j ) − Γ − is a bounded function of j and has a determinant uniformlybounded away from zero. Using Lemma B.5, it suffices to showmax j (cid:13)(cid:13)(cid:13)(cid:13)h E (cid:16) G j + n (cid:17)i − − (cid:0) G j + (cid:1) − (cid:13)(cid:13)(cid:13)(cid:13) = O ( h ).Also, convergence in the matrix norm is equivalent to convergence in each element of the matrix.Hence, it suffices to show thatmax j (cid:12)(cid:12)(cid:12)(cid:12) h j E (cid:20) v j + i k (cid:16) X i − c j h j (cid:17) (cid:16) X i − c j h j (cid:17) l (cid:21) − f ( c j ) γ l (cid:12)(cid:12)(cid:12)(cid:12) = O ( h ).The LHS above is bounded bysup c ∈X ,h ∈ [ h ,h ] (cid:12)(cid:12)(cid:12)(cid:12) h E (cid:20) v c,h ( X i ) k (cid:16) X i − ch (cid:17) (cid:16) X i − ch (cid:17) l (cid:21) − f ( c ) γ l (cid:12)(cid:12)(cid:12)(cid:12) ,which we show to be O ( h ).Take an arbitrary sequence h ∈ [ h , h ], (cid:12)(cid:12)(cid:12)(cid:12) h E (cid:20) v c,h ( X i ) k (cid:16) X i − ch (cid:17) (cid:16) X i − ch (cid:17) l (cid:21) − f ( c ) γ l (cid:12)(cid:12)(cid:12)(cid:12) = (cid:12)(cid:12)(cid:12)R k ( u ) u l f ( c + uh ) du − f ( c ) γ l (cid:12)(cid:12)(cid:12) = h (cid:12)(cid:12)(cid:12)R k ( u ) u l +1 ∇ x f ( c ∗ uh ) du (cid:12)(cid:12)(cid:12) ≤ M h = O ( h )17here Assumption 4 bounds the derivative of f . Therefore, the supremum above is O ( h ), andthe result follows. Part (B.93)The goal is to show that:max j (cid:13)(cid:13)(cid:13) G j ± n − E h G j ± n i(cid:13)(cid:13)(cid:13) = O P (cid:16)q log nnh (cid:17) .Note that part (B.92) implies that E h G j + n i is a bounded function of j and has a determi-nant uniformly bounded away from zero for large n . Using Lemma B.5, it suffices to show thatmax j (cid:13)(cid:13)(cid:13)(cid:13) G j + n − − E h G j + n i − (cid:13)(cid:13)(cid:13)(cid:13) = O P (cid:16)q log nnh (cid:17) . In fact, it suffices to show uniform convergence of eachpoint of the matrix:max j (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) nh j n X i =1 ( v j + i k (cid:18) X i − c j h j (cid:19) (cid:18) X i − c j h j (cid:19) l − E " v j + i k (cid:18) X i − ch j (cid:19) (cid:18) X i − ch j (cid:19) l = O P s log nnh ! for an arbitrary l . The LHS is bounded by h sup c ∈X , h ∈ [ h ,h ] (cid:12)(cid:12)(cid:12)(cid:12) n P ni =1 (cid:26) v + c,h ( X i ) k (cid:16) X i − ch (cid:17) (cid:16) X i − ch (cid:17) l − E (cid:20) v + c,h ( X i ) k (cid:16) X i − ch (cid:17) (cid:16) X i − ch (cid:17) l (cid:21) (cid:27) (cid:12)(cid:12)(cid:12)(cid:12) (B.96)and we apply Lemma B.4 to this part. Lemma B.3 says that the class of functions (over which thesup is being taken) satisfies the conditions of Lemma B.4. For the second moment bound δ n , takean arbitrary sequence h ∈ [ h , h ] and note that E " v + c,h ( X i ) k (cid:18) X i − ch (cid:19) (cid:18) X i − ch (cid:19) l = h Z k ( u ) u l f ( c + uh ) du ≤ M h ≤ M h where f ( · ) and k ( · ) are uniformly bounded (Assumptions 3 and 4). Hence, for the purposes ofLemma B.4, δ n = M h , which satisfies log nnδ n → √ K log n √ nh → h − O P (cid:16) h q log nnh (cid:17) = O P (cid:16)q log nnh (cid:17) ,because h − h = O (1). Part (B.94)Similar to above, convergence in the matrix norm is equivalent to convergence in each elementof the matrix, so it suffices to show that1 h sup c ∈X ,h ∈ [ h ,h ] (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) n n X i =1 v + c,h ( X i ) k (cid:18) X i − ch (cid:19) (cid:18) X i − ch (cid:19) l ε i (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) = O P s log nnh ! l . Take an arbitrary sequence h ∈ [ h , h ] E " v + c,h ( X i ) k (cid:18) X i − ch (cid:19) (cid:18) X i − ch (cid:19) l ε i ≤ M E " v + c,h ( X i ) k (cid:18) X i − ch (cid:19) (cid:18) X i − ch (cid:19) l = M h Z k ( u ) u l f ( c + uh ) du ≤ M h where it is used that ε i = Y i − R ( X i , D i ) is a.s. uniformly bounded (Assumption 7). Hence, δ n = M h . The expectation E (cid:20) v + c,h ( X i ) k (cid:16) X i − ch (cid:17) (cid:16) X i − ch (cid:17) l ε i (cid:21) = 0, and the sup is over a class offunctions that satisfies the conditions of Lemma B.4, which gives the result. Part (B.95)It suffices to show that1 h sup c ∈X , h ∈ [ h ,h ] (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) n n X i =1 ( v + c,h ( X i ) k (cid:18) X i − ch (cid:19) (cid:18) X i − ch (cid:19) l E [ Y j + i | X i ] − E " v + c,h ( X i ) k (cid:18) X i − ch (cid:19) (cid:18) X i − ch (cid:19) l Y j ± i = 1 h sup c ∈X , h ∈ [ h ,h ] (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) n n X i =1 ( v + c,h ( X i ) k (cid:18) X i − ch (cid:19) (cid:18) X i − ch (cid:19) l r + ( X i ) − E " v + c,h ( X i ) k (cid:18) X i − ch (cid:19) (cid:18) X i − ch (cid:19) l r + ( X i ) = O P s log nnh ! for any positive integer l . Choose δ n similarly as before. The sup is over a class of functions thatsatisfies the conditions of Lemma B.4, which gives the result. Lemma B.7.
Assume the conditions of Theorem 2 hold. Then, parts (A.7) and (A.8) in the proofof Theorem 2 (Section A.3) converge in probability to zero.Proof.
Part (A.7)19 (cid:12)(cid:12)(cid:12) b µ − E [ b µ |X n ] − ( µ ∗ − E [ µ ∗ |X n ])( V cn ) / (cid:12)(cid:12)(cid:12)(cid:12) (B.97) ≤ O (cid:16)(cid:0) Knh (cid:1) / (cid:17) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) K X j =1 ∆ j n e ′ (cid:0) G j + n − E (cid:2) G j + n (cid:3)(cid:1) nh j n X i =1 k (cid:18) X i − c j h j (cid:19) v j + i ε i e H ji − e ′ (cid:0) G j − n − E (cid:2) G j − n (cid:3)(cid:1) nh j n X i =1 k (cid:18) X i − c j h j (cid:19) v j − i ε i e H ji o(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) (B.98) ≤ O (cid:16)(cid:0) Knh (cid:1) / (cid:17) K X j =1 | ∆ j | (cid:13)(cid:13) G j + n − E (cid:2) G j + n (cid:3)(cid:13)(cid:13) (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) nh j n X i =1 k (cid:18) X i − c j h j (cid:19) v j + i ε i e H ji (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) + (cid:13)(cid:13) G j − n − E (cid:2) G j − n (cid:3)(cid:13)(cid:13) (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) nh j n X i =1 k (cid:18) X i − c j h j (cid:19) v j − i ε i e H ji (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) (B.99) ≤ O (cid:16)(cid:0) Knh (cid:1) / (cid:17) KO (cid:0) K − (cid:1) O P (cid:18) log nnh (cid:19) / ! O P (cid:18) log nnh (cid:19) / ! (B.100)= O P K / log n (cid:0) nh (cid:1) / ! = o P (1) (B.101)where the first inequality uses the rate on ( V cn ) − / (Equation A.13); the third inequality relies onthe uniform convergence rates of Lemma B.6, and that ∆ j = O ( K − ) uniformly over j (LemmaB.9); the last equality uses the rate condition K / log n (cid:0) nh (cid:1) − / = o (1). Part (A.8) E [ b µ − µ n |X n ] − e µ ( V cn ) / (B.102)=( V cn ) − / K X j =1 ∆ j e ′ G j + n ( nh j n X i =1 k (cid:18) X i − c j h j (cid:19) v j + i E [ Y j + i | X i ] e H ji − E " nh j n X i =1 k (cid:18) X i − c j h j (cid:19) v j + i Y j + i e H ji (B.103) − ( V cn ) − / K X j =1 ∆ j e ′ G j − n ( nh j n X i =1 k (cid:18) X i − c j h j (cid:19) v j − i E [ Y j − i | X i ] e H ji − E " nh j n X i =1 k (cid:18) X i − c j h j (cid:19) v j − i Y j − i e H ji (B.104)where 20 B. V cn ) − / K X j =1 ∆ j e ′ E (cid:2) G j + n (cid:3) ( nh j n X i =1 k (cid:18) X i − c j h j (cid:19) v j + i E [ Y j + i | X i ] e H ji − E " nh j n X i =1 k (cid:18) X i − c j h j (cid:19) v j + i Y j + i e H ji (B.105)=( V cn ) − / K X j =1 ∆ j e ′ (cid:0) G j + n − E (cid:2) G j + n (cid:3)(cid:1) ( nh j n X i =1 k (cid:18) X i − c j h j (cid:19) v j + i E [ Y j + i | X i ] e H ji − E " nh j n X i =1 k (cid:18) X i − c j h j (cid:19) v j + i Y j + i e H ji (B.106)Part (B.105) is o P (1) because it has zero mean and zero limiting variance, V [( B. V cn ) − K X j =1 ∆ j n n V (cid:26) h j k (cid:18) X i − c j h j (cid:19) v j + i E [ Y j + i | X i ] (cid:16) e ′ E (cid:2) G j + n (cid:3) e H ji (cid:17) − E (cid:20) h j k (cid:18) X i − c j h j (cid:19) v j + i Y j + i (cid:16) e ′ E (cid:2) G j + n (cid:3) e H ji (cid:17)(cid:21)(cid:27) (B.107) ≤ O ( Knh ) K X j =1 ∆ j n E " h j k (cid:18) X i − c j h j (cid:19) v j + i E [ Y j + i | X i ] (cid:16) e ′ E (cid:2) G j + n (cid:3) e H ji (cid:17) (B.108)= O ( K ) K X j =1 O ( K − ) E " h j k (cid:18) X i − c j h j (cid:19) v j + i O (cid:16) h ρ +11 j (cid:17) (cid:16) e ′ (cid:2) G j + + O ( h j ) (cid:3) e H ji (cid:17) (B.109)= O (cid:16) h ρ +21 (cid:17) = o (1) (B.110)where it is used the rate on ( V cn ) − (Equation A.13); that ∆ j = O ( K − ) holds uniformly over j (Lemma B.9); expansion (A.23); E h G j + n i is uniformly close to G j + (Lemma B.6); and that theexpected value in (B.109) without the O (cid:16) h ρ +11 j (cid:17) term is a bounded quantity.Part (B.106) is o P (1) because | ( B. | ≤ O (cid:16)(cid:0) Knh (cid:1) / (cid:17) K X j =1 | ∆ j | (cid:13)(cid:13) G j + n − E (cid:2) G j + n (cid:3)(cid:13)(cid:13) (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) nh j n X i =1 k (cid:18) X i − c j h j (cid:19) v j + i E [ Y j + i | X i ] e H ji − E (cid:20) h j k (cid:18) X i − c j h j (cid:19) v j + i Y j + i e H ji (cid:21)(cid:13)(cid:13)(cid:13)(cid:13) (B.111)= O (cid:16)(cid:0) Knh (cid:1) / (cid:17) KO (cid:0) K − (cid:1) O P (cid:16) (log n ) / (cid:0) nh (cid:1) − / (cid:17) O P (cid:16) (log n ) / (cid:0) nh (cid:1) − / (cid:17) (B.112)= O P (cid:16) K / (log n ) (cid:0) nh (cid:1) − / (cid:17) = o P (1) (B.113)21hich relies on the rate conditions of ( V cn ) − / (Equation A.13), that ∆ j = O ( K − ) uniformly over j (Lemma B.9), and on the rate conditions of Lemma B.6’s parts (B.93) and (B.95). Therefore,(B.103) is o P (1), and a symmetric proof shows that (B.104) is o P (1). Hence, part (A.8) is o P (1). B.3 Integral Approximation
This section proves results on the error of approximated integrals. Let R : R → R be aRiemann integrable function; for an open and convex set C ⊂ R , define β : C → R such that β ( x ) = R ( x , x ) − R ( x , x ) (i.e. treatment effect function on the main text). There are observations ofthe value of the β ( . ) function for K points c , . . . , c K , that is, β = β ( c ) , . . . , β K = β ( c K ), for c j = ( c ,j , c ,j , c ,j ). Interest lies on the integral µ = R C β ( x ) d ( x ) which is approximated by a finiteweighted sum b µ = P j ∆ j β j . A procedure to compute the integral approximation is given below.More importantly, there is a result that gives the rate of decay of the approximation error of thisprocedure as the number of points K → ∞ . The procedure consists of using a multivariate localpolynomial regression in a first step to obtain an approximated function b β ( x ). The second stepintegrates b β ( x ) over set C to obtain an approximated integral b µ .For the first step, run a weighted regression of β j s on J × E j ( x ). Each E j ( x ) is madeof polynomials evaluated at ( x − c j ) of order ρ at most. To define E j ( x ) and J , first consider themulti-index notation for vectors: for x = ( x , x , x ) ∈ R and γ = ( γ , γ , γ ) ∈ Z , let | γ | = X i =1 γ i γ ! = Y i =1 γ i ! x γ = Y i =1 x γ i i ∇ | γ | β ( x ) = ∂ | γ | ∂x γ ∂x γ ∂x γ β ( x )Each entry in E j ( x ) is a polynomial of the form p γ ( x − c j ) = Q i =1 ( x i − c i,j ) γ i with γ such that | γ | ≤ ρ and min { γ , γ } = 0. There is no γ , γ > β ( x ) is the difference R ( x , x ) − R ( x , x )whose polynomial approximation does not include interactions between x and x . The dimensionof E j ( x ) is J × J = 2 (cid:0) ρ +22 (cid:1) − ( ρ + 1), and the first entry in E j ( x ) is the polynomial ofdegree zero (i.e. p ( x − c j ) = 1). Next, stack E ( x ) ′ . . . E K ( x ) ′ into the K × J matrix E ( x ), and β , . . . , β K into the K × B . The regression of B on E is kernel weighted depending onthe distance between a fixed point x ∈ C and c j . For a choice of bandwidth h >
0, and a kerneldensity function that satisfies Assumption 3, the K × K matrix Ω ( x ; h ) is the diagonal matrix ofkernel weights: 22 ( x ; h ) = diag { Ω j ( x ; h ) } j = diag ( Y i =1 k (cid:18) x i − c i,j h (cid:19)) j The first-step regression consists of solving the following problem. b η = argmin η ( B − E ( x ) η ) ′ Ω ( x ; h ) ( B − E ( x ) η ) b β ( x ) = e ′ b η = b η where η is a J × η is the first coordinate of the vector η (interceptcoefficient).In the second step, integrate the estimated function b β ( x ) over C . Note that the approximatedintegral b µ is written as a weighted sum of β j . Z C b β ( x ) d x = Z C e ′ (cid:0) E ( x ) ′ Ω ( x ; h ) E ( x ) (cid:1) − X j Ω j ( x ; h ) E j ( x ) β j d ( x )= X j Z C e ′ (cid:0) E ( x ) ′ Ω ( x ; h ) E ( x ) (cid:1) − Ω j ( x ; h ) E j ( x ) d ( x ) β j = X j ∆ j β j The expression for the correction weight ∆ j is∆ j = Z C e ′ (cid:0) E ( x ) ′ Ω ( x ; h ) E ( x ) (cid:1) − Ω j ( x ; h ) E j ( x ) d ( x )= Z C det (cid:0) E ( x ) ′ Ω ( x ; h ) E ← e j ( x ) (cid:1) det ( E ( x ) ′ Ω ( x ; h ) E ( x )) d ( x )where the Cramer rule is used in the second equality, and E ← e j ( x ) is the matrix valued function E ( x ) except for the first column which is replaced by the K × e j that is zero everywhereexcept for the j -th entry which is equal to 1.The approximation error of such a procedure is well-behaved if R ( x, y ) is a continuously differ-entiable function of order up to ρ + 1 on C . This implies that, ∇ | γ | β ( x ) is a continuous functionfor every γ such that | γ | = ρ + 1. Lemma B.8 below states the approximation error of using amultivariate local polynomial regression on a finite number of points to obtain b β ( x ). This result isTheorem 3.1 of Lipman et al. (2006), and here account is given to the fact that β is the differenceof two functions. Lemma B.8.
Let
C ⊂ R be open and convex. Let R : R → R be a ρ + 1 times continuouslydifferentiable function on C , and define β ( x ) = R ( x , x ) − R ( x , x ) . For x ∈ C , assume b β ( x ) is constructed as above, and that the matrix E ( x ) ′ Ω ( x ; h ) E ( x ) is invertible for some choice of > . Then, there exists ξ j ∈ (0 , j = 1 , . . . , K , such that b β ( x ) − β ( x ) = X | γ | = ρ +1min { γ ,γ } =0 K X j =1 ( γ ! ( c j − x ) γ ∇ | γ | β (cid:18) ξ j ( c j − x ) + x (cid:19) det (cid:0) E ( x ) ′ Ω ( x ; h ) E ← e j ( x ) (cid:1) det ( E ( x ) ′ Ω ( x ; h ) E ( x )) ) (B.114) where the E ( x ) , E ← e j ( x ) , and Ω ( x ; h ) matrices have been described above.Proof. See the proof of Theorem 3.1 in Lipman et al. (2006) and use the fact that ∇ | γ | β ( x ) = 0 if γ > γ > c j arethought as coming from a triangular array indexed by K : { c j,K } j . The approximation error b β ( x ) − β ( x ) decreases to zero as K grows large. Lemma B.9 below uses regularity conditions on thefunction β and on the triangular array of points to determine the rate at which the approximationerror converges to zero. Lemma B.9.
Assume the conditions of Lemma B.8 hold. Furthermore, assume that(i) K → ∞ , h → , / ( Kh ) = O (1) ;(ii) there exists a positive definite J × J matrix Q such that sup x ∈C (cid:13)(cid:13)(cid:13) Kh [ E ( x /h ) ′ Ω ( x ; h ) E ( x /h )] − − Q (cid:13)(cid:13)(cid:13) = o (1) ; and(iii) the function β ( x ) has bounded derivatives on C of order up to ρ + 2 .Then, Z C b β ( x ) − β ( x ) d x − B K = O (cid:16) h ρ +22 (cid:17) (B.115) where B K = Z C X | γ | = ρ +1min { γ ,γ } =0 K X j =1 ( γ ! ( c j − x ) γ ∇ | γ | β ( x ) det (cid:0) E ( x ) ′ Ω ( x ; h ) E ← e j ( x ) (cid:1) det ( E ( x ) ′ Ω ( x ; h ) E ( x )) ) d x . (B.116) and B K = O (cid:16) h ρ +12 (cid:17) .Moreover, there exists a J × vector Θ such that e ′ Q Θ > , and max ≤ j ≤ K (cid:13)(cid:13)(cid:13)(cid:13) h − Z C Ω j ( x ; h ) E j ( x /h ) d x − Θ (cid:13)(cid:13)(cid:13)(cid:13) = o (1) (B.117)24ax ≤ j ≤ K | K ∆ j − e ′ Q Θ | = o (1) . (B.118) Proof.
Parts (B.115) and (B.116) : Start with Equation B.114. Do a first-order Taylor expansion of ∇ | γ | β ( ξ j ( c j − x ) + x ) around x and substitute in Equation B.114 to obtain b β ( x ) − β ( x ) = X | γ | = ρ +1min { γ ,γ } =0 K X j =1 ( γ ! ( c j − x ) γ ∇ | γ | β ( x ) det (cid:0) E ( x ) ′ Ω ( x ; h ) E ← e j ( x ) (cid:1) det ( E ( x ) ′ Ω ( x ; h ) E ( x )) ) (B.119)+ X | γ | = ρ +1min { γ ,γ } =0 K X j =1 ( γ ! ( c j − x ) γ X | η | =1 ∇ | γ + η | β (cid:18) δ j ( c j − x ) + x (cid:19) ( c j − x ) η det (cid:0) E ( x ) ′ Ω ( x ; h ) E ← e j ( x ) (cid:1) det ( E ( x ) ′ Ω ( x ; h ) E ( x )) ) . (B.120)The integral over set C is Z C b β ( x ) − β ( x ) d x = B K (B.121)+ Z C X | γ | = ρ +1min { γ ,γ } =0 | η | =1 K X j =1 ( γ ! ( c j − x ) γ + η ∇ | γ + η | β (cid:18) δ j ( c j − x ) + x (cid:19) det (cid:0) E ( x ) ′ Ω ( x ; h ) E ← e j ( x ) (cid:1) det ( E ( x ) ′ Ω ( x ; h ) E ( x )) ) d x . (B.122)The absolute value of the expression inside the integral in Equation B.122 is bounded by X | γ | = ρ +1min { γ ,γ } =0 | η | =1 K X j =1 ( γ ! | c j − x | γ + η (cid:12)(cid:12)(cid:12)(cid:12) ∇ | γ + η | β (cid:18) δ j ( c j − x ) + x (cid:19)(cid:12)(cid:12)(cid:12)(cid:12) (B.123) (cid:12)(cid:12)(cid:12) e ′ (cid:0) E ( x ) ′ Ω ( x ; h ) E ( x ) (cid:1) − Ω j ( x ; h ) E j ( x ) (cid:12)(cid:12)(cid:12) ) (B.124)= X | γ | = ρ +1min { γ ,γ } =0 | η | =1 K X j =1 ( γ ! | c j − x | γ + η (cid:12)(cid:12)(cid:12)(cid:12) ∇ | γ + η | β (cid:18) δ j ( c j − x ) + x (cid:19)(cid:12)(cid:12)(cid:12)(cid:12) (B.125) (cid:12)(cid:12)(cid:12) e ′ (cid:0) E ( x /h ) ′ Ω ( x ; h ) E ( x /h ) (cid:1) − Ω j ( x ; h ) E j ( x /h ) (cid:12)(cid:12)(cid:12) ) (B.126)25 M h ρ +22 (cid:13)(cid:13)(cid:13) Kh (cid:0) E ( x /h ) ′ Ω ( x ; h ) E ( x /h ) (cid:1) − (cid:13)(cid:13)(cid:13) K K X j =1 (cid:13)(cid:13)(cid:13)(cid:13) h Ω j ( x ; h ) E j ( x /h ) (cid:13)(cid:13)(cid:13)(cid:13) (B.127) ≤ M h ρ +22 O (1) O (1) = O (cid:16) h ρ +22 (cid:17) . (B.128)where it is used that the derivatives of β are bounded; that | ( c j − x ) γ + η | ≤ h ρ +22 ; that the normof the inverse of Kh ( E ( x /h ) ′ Ω ( x ; h ) E ( x /h )) is bounded over x and n (Assumption (ii)); andthe fact that P j k Ω j ( x ; h ) E j ( x /h ) k ≤ M Kh . It follows that Equation B.122 is O (cid:16) h ρ +22 (cid:17) . Asimilar argument yields B K = O (cid:16) h ρ +12 (cid:17) . Part (B.117) : Assume WLOG the support of the kernel is [ − ,
1] (Assumption 3). Define: F ( x ) = E j ( x + c j ); C h = { x ∈ R : Q i =1 ( x i ± h ) ⊆ C} , where Q is used to denote the Cartesian product;Θ = R [ − , k ( u ) k ( u ) k ( u ) F ( u ) d u , where u = ( u , u , u ).0 ≤ max j : c j ∈C h (cid:13)(cid:13)(cid:13)(cid:13) h − Z C Ω j ( x ; h ) E j ( x /h ) d x − Θ (cid:13)(cid:13)(cid:13)(cid:13) ≤ sup c ∈C h (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) h − Z c ± h Y i =1 k (( x i − c i ) /h ) F (( x − c ) /h ) d x − Θ (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) = sup c ∈C h (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)Z [ − , k ( u ) k ( u ) k ( u ) F ( u ) d u − Θ (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) = 0where the transformation u = ( x − c ) /h is used. The result follows from the fact that C h ↑ C . Part (B.118) : Using the formula for the correction weights ∆ j (cid:12)(cid:12) K ∆ j − e ′ Q Θ (cid:12)(cid:12) = (cid:12)(cid:12)(cid:12)(cid:12) K Z C e ′ (cid:0) E ( x /h ) ′ Ω ( x ; h ) E ( x /h ) (cid:1) − Ω j ( x ; h ) E j ( x /h ) d ( x ) − e ′ Q Θ (cid:12)(cid:12)(cid:12)(cid:12) = (cid:12)(cid:12)(cid:12)(cid:12)Z C e ′ h Kh (cid:0) E ( x /h ) ′ Ω ( x ; h ) E ( x /h ) (cid:1) − i h − Ω j ( x ; h ) E j ( x /h ) d ( x ) − e ′ Q Θ (cid:12)(cid:12)(cid:12)(cid:12) ≤ (cid:12)(cid:12)(cid:12)(cid:12)Z C e ′ h Kh (cid:0) E ( x /h ) ′ Ω ( x ; h ) E ( x /h ) (cid:1) − − Q i h − Ω j ( x ; h ) E j ( x /h ) d ( x ) (cid:12)(cid:12)(cid:12)(cid:12) + (cid:12)(cid:12)(cid:12)(cid:12)Z C e ′ Q h − Ω j ( x ; h ) E j ( x /h ) d ( x ) − e ′ Q Θ (cid:12)(cid:12)(cid:12)(cid:12) ≤ Z C (cid:13)(cid:13)(cid:13) Kh (cid:0) E ( x /h ) ′ Ω ( x ; h ) E ( x /h ) (cid:1) − − Q (cid:13)(cid:13)(cid:13) h − k Ω j ( x ; h ) E j ( x /h ) k d ( x )+ (cid:12)(cid:12)(cid:12)(cid:12) e ′ Q (cid:20)Z C h − Ω j ( x ; h ) E j ( x /h ) d ( x ) − Θ (cid:21)(cid:12)(cid:12)(cid:12)(cid:12) = o (1) O (1) + o (1) = o (1)26ext, (cid:12)(cid:12)(cid:12)(cid:12)Z C d c − e ′ Q Θ (cid:12)(cid:12)(cid:12)(cid:12) ≤ (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)Z C d c − X j ∆ j (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) + (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)X j ∆ j − e ′ Q Θ (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ o (1) + 1 K X j (cid:12)(cid:12) K ∆ j − e ′ Q Θ (cid:12)(cid:12) ≤ o (1) + max j (cid:12)(cid:12) K ∆ j − e ′ Q Θ (cid:12)(cid:12) = o (1)shows that e ′ Q Θ = R C d c > Remark 1.
Lemma B.9 also applies to weighted integrals of the form µ = Z C ω ( x ) β ( x ) d ( x ) where ω ( x ) is a probability density function that is continuous, bounded and bounded away fromzero. There are three main differences between unweighted integrals (treated above) and weightedintegrals (considered in the main text) : (i) the formula for the weights ∆ j changes to ∆ j = Z C ω ( x ) e ′ (cid:0) E ( x ) ′ Ω ( x ; h ) E ( x ) (cid:1) − Ω j ( x ; h ) E j ( x ) d ( x )= Z C ω ( x ) det (cid:0) E ( x ) ′ Ω ( x ; h ) E ← e j ( x ) (cid:1) det ( E ( x ) ′ Ω ( x ; h ) E ( x )) d ( x ) , (ii) the formula for the bias changes to B K = Z C ω ( x ) X | γ | = ρ +1min { γ ,γ } =0 K X j =1 ( γ ! ( c j − x ) γ ∇ | γ | β ( x ) det (cid:0) E ( x ) ′ Ω ( x ; h ) E ← e j ( x ) (cid:1) det ( E ( x ) ′ Ω ( x ; h ) E ( x )) ) d x , and (iii) conclusion B.118 of Lemma B.9 changes to max ≤ j ≤ K | K ∆ j /ω ( c j ) − e ′ Q Θ | = o (1) . Lemma B.9 states a condition on the asymptotic behavior of the triangular array of points { c j } Kj =1 . For large K , the observations must be uniformly distributed on the domain C such that E ( x ) ′ Ω ( x ; h ) E ( x ) is invertible and of magnitude Kh , that is, K times the volume of every h -neighborhood of x , for every x in C . These conditions are satisfied in a variety of examples oftriangular arrays of points that cover C uniformly well for large K .27o be clearer, this assumption is illustrated in a simple example. In the main text, the conditionsof Lemma B.9 are restated in Assumption 6(c) and in the rate conditions of Theorem 2. The choiceof h , h , ρ needs to satisfy both the conditions in Assumption 6 and the rate conditions of Theorem2. Pick the choices given in the example of Figure 1, for which, h = K − λ /λ , h = K − / , and ρ = 3. Let c ∈ R , x ∈ R , k ( u ) = . I {| u | ≤ } , and C = (0 , . Define N points for each l th coordinate of c = ( c , c , c ) as c l,j,N = j/ ( N + 1), j = 1 , . . . , N , l = 1 , ,
3. In this case, K = N , and h = 1 /N / . Assumption 6(b) requires the distance c ,j +1 ,K − c ,j,K = 1 / ( N + 1) =1 / ( K / + 1), to be greater than the order of h = K − λ /λ . This is equivalent to λ > λ / K = P K ( l ,l ,l ) I (cid:8) Ω ( l ,l ,l ) ( x ; h ) > (cid:9) , where ( l , l , l ) indexes point c ( l ,l ,l ) = ( c l , c l , c l ).For each x , the number of c ( l ,l ,l ) in the h -neighborhood of x grows to infinity at K / = Kh rate, so ˜ K = O ( Kh ). This rate of growth is uniform over x ∈ C . The vector of polynomials E j ( x )is written as E j ( x ) = F (cid:0) x − c ( l ,l ,l ) (cid:1) where F ( u ) = h u (0 , , , u (1 , , , u (0 , , , . . . , u (2 , , i ′ that is, all polynomials u ( γ ,γ ,γ ) such that γ i ∈ Z + ∀ i , 0 ≤ γ + γ + γ ≤
3, and min { γ , γ } = 0.Then, F ( u ) is a J × J = 16 . Now, fix x ∈ (0 , and a large K . Consider an uniform discrete random vector ˜ u taking valueson ( − , according to u ( l ,l ,l ) = (cid:0) x − c ( l ,l ,l ) (cid:1) /h for all c ( l ,l ,l ) ∈ ( x ± h ). It turns out that1˜ K E ( x /h ) ′ Ω ( x ; h ) E ( x /h )= 12 ˜ K X ( l ,l ,l ) I (cid:8) Ω ( l ,l ,l ) ( x ; h ) > (cid:9) F (cid:0) ( x − c ( l ,l ,l ) ) /h (cid:1) F (cid:0) ( x − c ( l ,l ,l ) ) /h (cid:1) ′ = 12 E (cid:2) F (˜ u ) F (˜ u ) ′ (cid:3) This is approximately equal to R u ∈ [ − , F ( u ) F ( u ) ′ d u uniformly in x . Simply call Q the inverseof this integral, a positive definite matrix. Finally,sup x ∈C (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:20) Kh E ( x /h ) ′ Ω ( x ; h ) E ( x /h ) (cid:21) − − Q (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) = o (1) B.4 Consistent Estimation of Standard Errors
This section demonstrates that the estimator for the variance of b µ c proposed in Section 3.2 isa consistent estimator. For the nearest-neighbor matching, the distribution of X i is continuous, so28 assume X < . . . < X n WLOG. For a fixed number of neighbors N ∈ Z + , define c : X → { , , . . . , K } , where c ( x ) = max ≤ j ≤ K { c j : c j ≤ x } (B.129) ℓ : { , . . . , n } × Z + → { , . . . , n } , where ℓ ( i, N ) is such that n X v =1 v = i,v = ℓ ( i,N ) I (cid:8) | X v − X i | ≤ | X ℓ ( i,N ) − X i | , c ( X v ) = c ( X i ) (cid:9) = N (B.130) b ε i = NN + 1 Y i − N N X l =1 Y ℓ ( i,l ) ! (B.131)The expression for the variance estimator in the continuous case is given in Equation 26. Lemma B.10.
Assume the conditions of Theorem 2 and ( Kh ) − = O (1) hold. Then, b V cn / V cn p → .Proof. The proof extends the arguments of Theorem A3 by CCT to the case where the number ofcutoffs grows to infinity.Define φ n and b φ n by φ n ( X i ) = K X j =1 ∆ j nh j k (cid:18) X i − c j h j (cid:19) e ′ (cid:16) v j + i E [ G j + n ] − v j − i E [ G j − n ] (cid:17) e H ji (B.132) b φ n ( X i ) = K X j =1 ∆ j nh j k (cid:18) X i − c j h j (cid:19) e ′ (cid:16) v j + i G j + n − v j − i G j − n (cid:17) e H ji . (B.133)Use them to rewrite V cn and b V cn as V cn = n E (cid:2) ε i φ n ( X i ) (cid:3) (B.134) b V cn = n X i =1 b ε i b φ n ( X i ) . (B.135)In order to show b V cn / V cn p →
1, it suffices to show that (
Knh )( b V cn − V cn ) p → V cn ) − = O ( Knh ).( Knh )( b V cn − V cn ) =( Knh ) n X i =1 b ε i b φ n ( X i ) − ( Knh ) n X i =1 ε i φ n ( X i ) (B.136)=( Knh ) n X i =1 b ε i (cid:16) b φ n ( X i ) − φ n ( X i ) + φ n ( X i ) (cid:17) − ( Knh ) n X i =1 ε i φ n ( X i ) (B.137)=( Knh ) n X i =1 b ε i (cid:16) b φ n ( X i ) − φ n ( X i ) (cid:17) (B.138)29 2( Knh ) n X i =1 b ε i (cid:16) b φ n ( X i ) − φ n ( X i ) (cid:17) φ n ( X i ) (B.139)+ ( Knh ) n X i =1 (cid:0)b ε i − ε i (cid:1) φ n ( X i ) (B.140)The rest of the proof shows that parts (B.138) - (B.140) converge in probability to zero. Part (B.138)First, for arbitrary x ∈ X (cid:12)(cid:12)(cid:12) b φ n ( x ) − φ n ( x ) (cid:12)(cid:12)(cid:12) ≤ K X j =1 ( | ∆ j | nh j (cid:12)(cid:12)(cid:12)(cid:12) k (cid:18) x − c j h j (cid:19)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) e ′ (cid:16) v + c j ,h j ( x ) (cid:0) G j + n − E [ G j + n ] (cid:1) − v − c j ,h j ( x ) (cid:0) G j − n − E [ G j − n ] (cid:1)(cid:17) H (cid:18) x − c j h j (cid:19)(cid:12)(cid:12)(cid:12)(cid:12) ) (B.141) ≤ ≤ j ≤ K ( | ∆ j | nh j (cid:12)(cid:12)(cid:12)(cid:12) k (cid:18) x − c j h j (cid:19)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) e ′ (cid:16) v + c j ,h j ( x ) (cid:0) G j + n − E [ G j + n ] (cid:1) − v − c j ,h j ( x ) (cid:0) G j − n − E [ G j − n ] (cid:1)(cid:17) H (cid:18) x − c j h j (cid:19)(cid:12)(cid:12)(cid:12)(cid:12) ) (B.142)= O (cid:18) Knh (cid:19) O P s log nnh ! = O P Knh s log nnh ! (B.143)where the second inequality uses the fact that at most two elements of the sum over j are non-zerofor each value of x ; the first equality relies on max j | ∆ j | = O ( K − ) (Lemma B.9), on h − j ≤ h − ,on the fact that the kernel is bounded (Assumption 3), that v ± c j ,h j ( x ) H ( h − j ( x − c j )) is bounded,and that max j (cid:13)(cid:13)(cid:13) G j ± n − E [ G j ± n ] (cid:13)(cid:13)(cid:13) = O P ((log n/Kh ) / ) (Lemma B.6); the last equality uses the ratecondition h /h = O (1). The rate in (B.143) is uniform over x ∈ X .Then, it follows that | (B.138) | ≤ ( Knh ) n n n X i =1 b ε i max x (cid:12)(cid:12)(cid:12) b φ n ( x ) − φ n ( x ) (cid:12)(cid:12)(cid:12) (B.144)=( Knh ) nO P (1) O P (cid:18) Knh ) log nnh (cid:19) = 1 Kh log nnh O P (1) = o P (1) (B.145)where the first equality used the rate derived in (B.143), and the fact that b ε i is a.s. bounded because ε i is a.s. bounded (Assumption 7); the last equality relied on the rate conditions ( Kh ) − = O (1)and log n ( nh ) − = o (1). 30 art (B.139)First, for arbitrary x ∈ X| φ n ( x ) | ≤ K X j =1 ( | ∆ j | nh j (cid:12)(cid:12)(cid:12)(cid:12) k (cid:18) x − c j h j (cid:19)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) e ′ (cid:16) v + c j ,h j ( x ) E [ G j + n ] − v − c j ,h j ( x ) E [ G j − n ] (cid:17) H (cid:18) x − c j h j (cid:19)(cid:12)(cid:12)(cid:12)(cid:12) ) (B.146) ≤ ≤ j ≤ K ( | ∆ j | nh j (cid:12)(cid:12)(cid:12)(cid:12) k (cid:18) x − c j h j (cid:19)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) e ′ (cid:16) v + c j ,h j ( x ) E [ G j + n ] − v − c j ,h j ( x ) E [ G j − n ] (cid:17) H (cid:18) x − c j h j (cid:19)(cid:12)(cid:12)(cid:12)(cid:12) ) (B.147)= O (cid:18) Knh (cid:19) = O (cid:18) Knh (cid:19) (B.148)where the second inequality uses the fact that at most two elements of the sum over j are non-zerofor each value of x ; the first equality relies on max j | ∆ j | = O ( K − ) (Lemma B.9), on h − j ≤ h − ,on the fact that the kernel is bounded (Assumption 3), that v ± c j ,h j ( x ) H ( h − j ( x − c j )) is bounded,and that E [ G j ± n ] is approximately equal to a positive definite matrix with determinant boundedaway from zero (Lemma B.6); the last equality uses the rate condition h /h = O (1). The rate in(B.148) is uniform over x ∈ X .Then, it follows that | (B.139) | ≤ ( Knh ) n n n X i =1 b ε i max x (cid:12)(cid:12)(cid:12) b φ n ( x ) − φ n ( x ) (cid:12)(cid:12)(cid:12) max x | φ n ( x ) | (B.149)=( Knh ) nO P (1) O P Knh s log nnh ! O (cid:18) Knh (cid:19) = 1 Kh s log nnh O P (1) = o P (1)(B.150)where the first equality used the rate derived in (B.143) and (B.148), and the fact that b ε i is a.s.bounded because ε i is a.s. bounded (Assumption 7); the last equality relied on the rate conditions( Kh ) − = O (1) and log n ( nh ) − = o (1). 31 art (B.140)First, expand b ε i around ε i . To simplify notation, abbreviate E [ Y i | X i ] = R ( X i , D i ) to R i . b ε i = NN + 1 Y i − N N X l =1 Y ℓ ( i,l ) ! (B.151)= NN + 1 R i + ε i − N N X l =1 ( R ℓ ( i,l ) + ε ℓ ( i,l ) ) ! (B.152)= NN + 1 (cid:18) ε i − N ε ℓ ( i,l ) (cid:19) + NN + 1 N N X l =1 ( R i − R ℓ ( i,l ) ) ! + 2 NN + 1 (cid:18) ε i − N ε ℓ ( i,l ) (cid:19) N N X l =1 ( R i − R ℓ ( i,l ) ) ! (B.153)= ε i − ε i N + 1 N X l =1 ε ℓ ( i,l ) + 1 N ( N + 1) N X l =1 ( ε ℓ ( i,l ) − ε i )+ 2 N ( N + 1) N X l =1 N X v =1 ,v>l ε ℓ ( i,l ) ε ℓ ( i,v ) + 1 N ( N + 1) N X l =1 R i − R ℓ ( i,l ) ! + 2 ε i N + 1 N X l =1 (cid:0) R i − R ℓ ( i,l ) (cid:1) − N ( N + 1) N X l =1 ε ℓ ( i,l ) N X v =1 ( R i − R ℓ ( i,v ) ) (B.154)Then, substitute the expression for b ε i − ε i derived above in part (B.140):( B. Knh ) n X i =1 ( − ε i N + 1 N X l =1 ε ℓ ( i,l ) ) φ n ( X i ) (B.155)+ ( Knh ) n X i =1 ( N ( N + 1) N X l =1 ( ε ℓ ( i,l ) − ε i ) ) φ n ( X i ) (B.156)+ ( Knh ) n X i =1 N ( N + 1) N X l =1 N X v =1 ,v>l ε ℓ ( i,l ) ε ℓ ( i,v ) φ n ( X i ) (B.157)+ ( Knh ) n X i =1 N ( N + 1) N X l =1 R i − R ℓ ( i,l ) ! φ n ( X i ) (B.158)+ ( Knh ) n X i =1 ( ε i N + 1 N X l =1 (cid:0) R i − R ℓ ( i,l ) (cid:1)) φ n ( X i ) (B.159)+ ( Knh ) n X i =1 ( − N ( N + 1) N X l =1 ε ℓ ( i,l ) N X v =1 ( R i − R ℓ ( i,v ) ) ) φ n ( X i ) (B.160)32he steps below demonstrate that parts (B.155) - (B.160) converge in probability to zero. Part (B.155) : the expected value E [(B.155) |X n ] = 0. To compute the variance of (B.155)centered at E [(B.155) |X n ] = 0, abbreviate N − P Nl =1 ε ℓ ( i,l ) to ε i , φ n ( X i ) to φ ni , I { c ( X i ) = c ( X j ) } to I = ij , and I { c ( X i ) = c ( X j ) } to I = ij . Then, E h ((B.155) − E [(B.155) |X n ]) i = M ( Knh ) E n X i =1 n X j =1 ( I = ij + I = ij ) ε i ε i φ ni ε j ε j φ nj (B.161)= M ( Knh ) E n X i =1 n X j =1 I = ij ε i ε i φ ni ε j ε j φ nj (B.162)= M ( Knh ) n X i =1 n X j =1 E (cid:2) I = ij (cid:3) O P (cid:0) ( Knh ) − (cid:1) (B.163)= M ( Knh ) − n O P (cid:0) K − (cid:1) = o P (1) (B.164)where the second equality uses that the expected value of I = ij ε i ε i φ ni ε j ε j φ nj conditional on X n iszero because if I = ij = 1, then ε i ε i is independent of ε j ε j , and E [ ε i ε i |X n ] = E [ ε i |X n ] E [ ε i |X n ] = 0;the third equality relies on the fact that ε i and ε i are a.s. bounded (Assumption 7), and that φ ni is O (cid:0) ( Knh ) − (cid:1) (Equation B.148); the fourth equality uses that E [ I = ij ] = E [ P ( c j ≤ X j
B.5.1 Example of Compliance Behaviors
Here is a simple example with three different treatments and two cutoffs (that is, 3 schools, K =2) to illustrate the different compliance behaviors. Table B.1 below lists all possible combinationsof treatment eligibility and assignment produced by U i ( x ).35able B.1: Different Compliance Behaviors Eligibility Type d d d d d d ever-defiers d d d d d d d d d d d d d d d d d d d d d d d d d d d d d d d d d d d d d d d d d d d d d d d d d d d Eligibility Type d d d d d d never-changers d d d d d d d d d ever-compliers d d d d d d d d d d d d d d d Notes: All possible realizations of the random function U i ( x ) for values of x such that D ( x ) ∈ { d , d , d } . B.5.2 Estimation and Inference
Theorem 4 in the main text suggests a two-step estimation procedure for θ ec . In the first step,obtain b B j as in Section 3.1, and compute estimates cf W j using LPRs of W ( X i , D i ) on X i at eachside of the cutoff c j . For each j = 1 , . . . , K , and l -th coordinate of the vector f W j , l = 1 , . . . , q , theresearcher computes cf W j,l =ˆ a + j,l − ˆ a − j,l (B.179)(ˆ a + j,l , ˆ b + j,l ) = argmin ( a, b ) n X i =1 ( k (cid:18) X i − c j h j (cid:19) v j + i (cid:20) e ′ l W ( X i , D i ) − a − b ( X i − c j ) − . . . − b ρ ( X i − c j ) ρ (cid:21) ) (B.180)(ˆ a − j,l , ˆ b − j,l ) = argmin ( a, b ) n X i =1 ( k (cid:18) X i − c j h j (cid:19) v j − i (cid:20) e ′ l W ( X i , D i ) − a − b ( X i − c j ) − . . . − b ρ ( X i − c j ) ρ (cid:21) ) (B.181)36here e l is the q × l -th coordinate. The q × cf W j isconstructed by stacking the q estimates cf W j = (cid:20)cf W j, , . . . , cf W j,q (cid:21) ′ .In the second step, regress b B j on cf W j to obtain an estimate for θ ec . More specifically, stackall q × cf W j into the K × q matrix cf W , and b B j into the K × b B . Choose a K × K symmetric and positive-definite weighting matrix Ω. The estimator b θ ec is the solution tothe following weighted least-squares problem: b θ ec = argmin θ (cid:18) b B − cf W θ (cid:19) ′ Ω (cid:18) b B − cf W θ (cid:19) . (B.182)The estimator for the ATE on ever-compliers µ ec is a linear combination of b θ ec , b µ ec = Z ( F ) b θ ec (B.183)where Z ( F ) is defined in Equation 35.Asymptotic normality of b θ ec relies on smoothness assumptions on the conditional moments of Y i and the probabilities of treatment for different compliance behaviors. The sample size growslarge, while the number of cutoffs remains fixed. Assumption 10.
For any d ∈ D , and any ¯ U in the class U ∗ defined in Eq. (29) , (a) E [ Y i ( d ) | X i = x, U i = ¯ U ] is a ρ + 1 times continuously differentiable function of x with ( ρ + 1) -th derivative ∇ ρ +1 x E [ Y i ( d ) | X i = x, U i = ¯ U ] ; (b) V [ Y i ( d ) | X i = x, U i = ¯ U ] is a continuous function of x , and E h(cid:12)(cid:12) Y i ( d ) − E [ Y i ( d ) | X i = x, U i = ¯ U ] (cid:12)(cid:12) | X i = x, U i = ¯ U i is bounded; (c) for f W defined in Eq. (36) , f W ′ f W is invertible; (d) P [ U i = ¯ U | X i = x ] is a ρ + 1 times continuously differentiable function of x with ( ρ + 1) -th derivative ∇ ρ +1 x P [ U i = ¯ U | X i = x ] . Theorem B.1.
Suppose Assumptions 3-4 and 8-10 hold, and that the number of cutoffs K is fixed.Let h = min j { h j } and h = max j { h j } . As n → ∞ , assume that h → , h /h = O (1) , nh → ∞ , and ( nh ) / h ρ +11 = O (1) . Then, ( V θ ec n ) − / (cid:16)b θ ec − B θ ec n − θ ec (cid:17) d → N ( , I ) (B.184) b µ ec − B µ ec n − µ ec ( V µ ec n ) / d → N (0 , , (B.185) where denotes the q × vector of zeros, and I is the q × q identity matrix.The bias and variance terms are characterized as follows, B θ ec n = (cid:18) cf W ′ Ω cf W (cid:19) − cf W ′ Ω B ecn , a q × vector; (B.186) B µ ec n = Z ( F ) (cid:18) cf W ′ Ω cf W (cid:19) − cf W ′ Ω B ecn , a scalar; (B.187)37 θ ec n = (cid:18) cf W ′ Ω cf W (cid:19) − cf W ′ Ω V ecn Ω cf W (cid:18) cf W ′ Ω cf W (cid:19) − , a q × q matrix; (B.188) V µ ec n = Z ( F ) (cid:18) cf W ′ Ω cf W (cid:19) − cf W ′ Ω V ecn Ω cf W (cid:18) cf W ′ Ω cf W (cid:19) − Z ( F ) ′ , a scalar. (B.189) These terms depend on B ecn ( K × vector) and V ecn ( K × K matrix) that are defined below B ecn = [ B ecn , . . . , B ecnK ] ′ , where for each j (B.190) B ecnj = h ρ +11 j f ( c j )( ρ + 1)! h − θ ec ′ i e ′ ( G j + n γ ∗ ∇ ρ +1 x " R ( c j , d j ) W ( c j , d j ) − G j − n γ ∗ ∇ ρ +1 x " R ( c j , d j − ) W ( c j , d j − ) ;(B.191) V ecn = V ecn . . . V ecn K ... . . . ... V ecnK . . . V ecnKK , where V ecnjl = 0 if | j − l | > , otherwise (B.192) V ecnjl = n E ( nh j k (cid:18) X i − c j h j (cid:19) nh l k (cid:18) X i − c l h l (cid:19)h − θ ec ′ i e ′ (cid:16) v j + i E [ G j + n ] − v j − i E [ G j − n ] (cid:17) e H ji ε i ε ′ i e H l ′ i (cid:16) v l + i E [ G l + n ] ′ − v l − i E [ G l − n ] ′ (cid:17) e h − θ ec ′ i ′ ) , (B.193) where ε i = [ Y i W ( X i , D i ) ′ ] ′ − E (cid:8) [ Y i W ( X i , D i ) ′ ] ′ | X i (cid:9) , a ( q + 1) × vector; e = I q ⊗ e where I q is the q × q identity matrix, ⊗ denotes the Kronecker product, and e is a ( ρ + 1) × vector of ze-ros except for the first coordinate that equals ; γ ∗ = I q ⊗ γ ∗ for γ ∗ defined in Theorem 1; e H ji = I q ⊗ H (( X i − c j ) /h j ) for H ( u ) defined in Theorem 1; and G j ± n = [( nh j ) − P ni =1 k (cid:16) X i − c j h j (cid:17) v j ± i e H ji e H j ′ i ] − for v j ± i defined in Equation 10.Furthermore, ( V θ ec n ) − / = O P (cid:16)(cid:0) nh (cid:1) / (cid:17) , and ( V θ ec n ) − / B θ ec n = O P (cid:16)(cid:0) nh (cid:1) / h ρ +11 (cid:17) , where A − / denotes the inverse of the square root of a positive-definite matrix A . Similarly, ( V µ ec n ) − / = O P (cid:16)(cid:0) nh (cid:1) / (cid:17) , and ( V µ ec n ) − / B µ ec n = O P (cid:16)(cid:0) nh (cid:1) / h ρ +11 (cid:17) .The approximate MSE of either b θ ec or b µ ec is minimized by setting Ω = (cid:16) B ecn B ec ′ n + V ecn (cid:17) − . The proof of Theorem B.1 is in Section B.5.3 of the supplemental appendix. The variance termsin Equations B.188 and B.189 contain the matrix V ecn that needs to be estimated. The elements V ecnjl of such matrix are consistently estimated by b V ecnjl = n X i =1 ( nh j k (cid:18) X i − c j h j (cid:19) nh l k (cid:18) X i − c l h l (cid:19) − b θ ec ′ (cid:21) e ′ (cid:16) v j + i G j + n − v j − i G j − n (cid:17) e H ji b ε i b ε ′ i e H l ′ i (cid:16) v l + i G l + ′ n − v l − i G l − ′ n (cid:17) e (cid:20) − b θ ec ′ (cid:21) ′ ) . (B.194)where b θ ec is a consistent estimator of θ ec , and the vector of residuals is estimated by a nearest-neighbor matching estimator analogously to Section 3.1’s Equation 15 : b ε i b ε ′ i = 34 Y i − X l =1 Y ℓ ( i,l ) ! Y i − X l =1 Y ℓ ( i,l ) ! ′ . (B.195)If the bandwidth choices are such that the standardized bias term (cid:0) V θ ec n (cid:1) − / B θ ec n differs from zeroasymptotically, then inference must be done using a bias-corrected estimator. A practical way ofdoing bias correction is to increase the order of the polynomial to ρ + 1 and compute b θ ec ′ and b V ec ′ n using the same bandwidth choices. It follows that ( b V θ ec ′ n ) − / (cid:18)b θ ec ′ − θ ec (cid:19) d → N ( , I ). Similar toTheorems 1 and 2, Theorem B.1 allows for bandwidth choices that produce overlapping estimationwindows across cutoffs. The variance estimator in (B.194) takes account of overlap by allowing V njl to be non-zero for j = l .The following steps are a practical recommendation to implement MSE-optimal and bias-corrected estimates. The source of MSE in estimation is B ecnj and V ecnjl , which come from theregression of Y i − W ( X i , D i ) ′ θ ec on X i at each cutoff c j ; thus, it makes sense to choose MSE-optimal bandwidths for these regressions.0. Take initial values b θ ec (0) and Ω (0) ;1. Compute first-step IK bandwidths h (0)1 j for sharp RD of Y i − W ( X i , D i ) ′ θ ec (0) on X i at eachcutoff c j . Use local-linear regression ( ρ = 1) and edge kernel;2. Obtain bias-corrected estimates b B (0) j for each cutoff j using sharp RD of Y i on X i using local-quadratic regression, edge kernel, and bandwidth h (0)1 j ; do the same for each coordinate of W ( X i , D i ) to compute cf W (0) j for each j ; stack estimates into b B (0) and cf W (0) ;3. Update b θ ec : compute b θ ec (1) using Equation B.182 with Ω (0) , b B (0) , and cf W (0) ;4. Estimate the variance of b B (0) − cf W (0) θ ec using Equation B.194 with ρ = 2, b θ ec (1) , andbandwidths h (0)1 j ; call the estimated variance b V ec (1) n .5. Update Ω: compute Ω (1) = (cid:16) b V ec (1) n (cid:17) − ;6. Update b θ ec : compute b θ ec (2) using Equation B.182 with Ω (1) , b B (0) , and cf W (0) ;39. Repeat Steps 4-6 starting with b θ ec (2) in the place of b θ ec (1) . Iterate these three steps untilconvergence of b θ ec . Call b θ ec (3) and Ω (3) , the iterated values of, respectively, b θ ec and Ω;8. Repeat Steps 1-7 starting with b θ ec (3) in the place of b θ ec (0) , and with Ω (3) in the place of Ω (0) .Iterate these 7 steps until the difference between the θ s of Step 3 and Step 7 converges tozero. Call b θ ec (4) , Ω (4) , and cf W (4) the iterated values of, respectively, b θ ec , Ω, and cf W ;9. Estimate the variance of b θ ec using b V θ ec n = cf W (4) ′ Ω (4) cf W (4) ! − ; compute b µ ec using EquationB.183 with b θ ec (4) and Z ( F ) given by the counterfactual policy of interest; estimate the varianceof b µ ec using b V µ ec n = Z ( F ) b V θ ec n Z ( F ) ′ . B.5.3 Proof of Theorem B.1
This proof relies heavily on Lemma B.1, which is a CLT for the LPR estimator of the differencein side-limits of a conditional mean function of the vector Y i given X i at X i = c j . Such lemma isapplied to Y i = [ Y i W ( X i , D i ) ′ ] ′ to arrive at( V nj ) − / (cid:16)b J j − B nj − J j (cid:17) d → N ( ; I ) (B.196)for each j .The assumptions of Theorem B.1 satisfy the assumptions of Lemma B.1. In fact, the conditionson the rates, on the distribution of X i , and on the kernel density in Lemma B.1 are simply restatedin the conditions of Theorem B.1. It remains to verify the other two sufficient conditions of LemmaB.1: (a) m ( x ) has continuous derivatives wrt x of order ρ + 1 in a compact interval centered at c j but excluding c j , and existence of side limits at c j ; and (b) continuity of ζ ( x ) wrt x in a compactinterval centered at c j but excluding c j , existence of side limits at c j , and boundedness of the thirdmoment conditional on X i .For (a), note that, in the fuzzy case, the mean of Y i and W ( X i , D i ) conditional on X i is asum of the means of potential outcomes Y i ( d ) and W ( c j , d ) for various dosages d conditional onsets of the form {U i ( c j ) = d } weighted by conditional probabilities of the same sets (see proof ofTheorem 4). Assumption 10 implies that such conditional means and conditional probabilities aresmooth functions of x and side-limits exist at x = c j . Similarly, for (b), the conditional covarianceof [ Y i , W ( X i , D i ) ′ ] ′ is a function of sums of the first and second moments of potential outcomes Y i ( d ) for various dosages conditional on sets of the form {U i ( c j ) = d } weighted by conditionalprobabilities of the same sets. Assumption 10 ensures continuity of ζ ( x ) wrt x and existence ofside-limits at c j . A similar argument bounds the third centered moment of Y i ( d ) , and Lemma B.1applies. 40ext, note thatlim e ↓ ( E [ W ( c j , D i ) | X i = c j + e ] − E [ W ( c j , D i ) | X i = c j − e ] (cid:27) (B.197)= K X l =0 ,l = j n W ( c j , d j ) − W ( c j , d l ) o ω j,l = f W j (B.198)which means that J j = [ B j f W ′ j ] ′ . Call α = [1 − θ ec ′ ] ′ , a (( q + 1) ×
1) vector. Then, (B.196)implies (cid:0) α ′ V nj α (cid:1) − / (cid:16) α ′ b J j − α ′ B nj − α ′ J j (cid:17) d → N (0 ,
1) (B.199) (cid:0) V ecnjj (cid:1) − / (cid:18) b B j − θ ec ′ cf W j − B ecnj (cid:19) d → N (0 ,
1) (B.200)where α ′ J j = 0 by Assumption 9, and the definitions (B.191) and (B.193) are used. Stacking acrosscutoffs gives (cid:0) diag {V ecnjj } j (cid:1) − / (cid:18) b B − cf W θ ec − B ecn (cid:19) d → N ( , I ) (B.201)( V ecn ) − / (cid:18) b B − cf W θ ec − B ecn (cid:19) d → N ( , I ) (B.202)where ( V ecn ) / (cid:16) diag {V ecnjj } j (cid:17) − / → I because the covariances (off-diagonal terms) converge tozero since the estimation windows do not overlap in the limit. Define Γ = (cid:16) f W ′ Ω f W (cid:17) − f W ′ Ω.Then, (cid:0) Γ V ecn Γ ′ (cid:1) − / (cid:18) Γ b B − Γ cf W θ ec − Γ B ecn (cid:19) d → N ( , I ) (B.203)Define b Γ = (cid:18) cf W ′ Ω cf W (cid:19) − cf W ′ Ω, and write( V θ ec n ) − / (cid:16)b θ ec − B θ ec n − θ ec (cid:17) = (cid:16)b Γ V ecn b Γ ′ (cid:17) − / (cid:18)b Γ b B − b Γ cf W θ ec − b Γ B ecn (cid:19) (B.204)= (cid:16)b Γ V ecn b Γ ′ (cid:17) − / (cid:18) Γ b B − Γ cf W θ ec − Γ B ecn (cid:19) (B.205)+ (cid:16)b Γ V ecn b Γ ′ (cid:17) − / (cid:16)b Γ − Γ (cid:17) (cid:18) b B − cf W θ ec − B ecn (cid:19) (B.206)= (cid:16)b Γ V ecn b Γ ′ (cid:17) − / (cid:0) Γ V ecn Γ ′ (cid:1) / (cid:0) Γ V ecn Γ ′ (cid:1) − / (cid:18) Γ b B − Γ cf W θ ec − Γ B ecn (cid:19) (B.207)+ O P (cid:16) ( nh ) / (cid:17) o P (1) O P (cid:16) ( nh ) − / (cid:17) (B.208)41hich converges in distribution to N ( , I ) because of (B.203), the fact that (cid:16)b Γ V ecn b Γ ′ (cid:17) − / ( Γ V ecn Γ ′ ) / p → I , and that b Γ p → Γ . (cid:3) B.6 Estimation of Counterfactual Distributions
This section considers applications where the counterfactual distribution is estimated as opposedto being known by the researcher. For brevity, I focus on the setting of Section 3.1, that is, sharpRD with discrete counterfactual and fixed K . The analysis for the other settings of the paperfollows similar arguments. In what follows, I derive the limiting distribution for b µ d and propose aconsistent variance estimator.In the first step, the researcher estimates the counterfactual probability mass function ω d ( c ) forevery c ∈ C K using iid observations Z i = ( Y i , X i ), i = 1 , . . . , n . There is a variety of ways to obtainestimates for b ω d ( c ). For example, one may estimate the distribution of X i non-parametrically,obtain b f X ( c j ) for every j , and construct b ω dj = b f X ( c j ) / P Kl =1 b f X ( c l ). Another way is to specify aparametric distribution and estimate its parameters. To keep the analysis general, assume b ω dj − ω dj = n X i =1 η nj ( Z i ) + o P ( r − / n ) (B.209)for every j , where η nj ( Z i ) has zero mean and finite variance for each n and j , and r n is a sequencethat converges to infinity. The exact forms of the function η nj ( Z i ) and r n depend on the typeof estimator used to obtain b ω dj . The sequence r n represents the rate at which the inverse of thevariance of b ω dj grows. Namely, 1 /V AR [ P ni =1 η nj ( Z i )] = O ( r n ). For example, if ω d ( c ) is estimatedparametrically by maximum likelihood, then η nj ( Z i ) will be a function of the Hessian matrix timesthe score function, and r n = n ; if b ω dj is based of a kernel estimator for the density of X i withbandwidth h ω , then η nj ( Z i ) = ( nh ω ) − k (( X i − c j ) /h ω ) / P Kl =1 f X ( c l ) and r n = nh ω .The second step consists of estimating µ d , b µ d = K X j =1 b ω dj b B j . (B.210)Rewrite b µ d as b µ d − µ d = K X j =1 ω dj ( b B j − B j ) + K X j =1 B j ( b ω dj − ω dj ) + K X j =1 ( b B j − B j )( b ω dj − ω dj ) . (B.211)Suppose b B j has no first-order asymptotic bias (i.e. bias-corrected). The proofs of Lemma B.142nd Theorem 2 imply that K X j =1 ω dj ( b B j − B j ) = n X i =1 ϕ n ( Z i ) + o P (cid:16) ( nh ) − / (cid:17) , (B.212)where 1 /V AR [ P ni =1 ϕ n ( Z i )] = O (cid:0) nh (cid:1) , and nh → ∞ . Similarly, the sum across j of (B.209)times B j gives K X j =1 B j ( b ω dj − ω dj ) = n X i =1 K X j =1 B j η nj ( Z i ) | {z } ≡ η n ( Z i ) + o P ( r − / n ) (B.213)= n X i =1 η n ( Z i ) + o P ( r − / n ) , (B.214)where 1 /V AR [ P ni =1 η n ( Z i )] = O ( r n ).Next, substitute (B.212) and (B.214) into Equation B.211, b µ d − µ d = n X i =1 ϕ n ( Z i ) + n X i =1 η n ( Z i ) + o P ( r − / n ) + o P (cid:16) ( nh ) − / (cid:17) + O P (cid:16) ( nh ) − / r − / n (cid:17) (B.215)= n X i =1 { ϕ n ( Z i ) + η n ( Z i ) } + o P ( r − / n ) + o P (cid:16) ( nh ) − / (cid:17) (B.216)where the second equality relies on b B j − B j = O P (cid:0) ( nh ) − / (cid:1) , b ω dj − ω dj = O P (cid:16) r − / n (cid:17) , and on thefact that ( nh ) − / r − / n converges to zero faster than each of ( nh ) − / and r − / n .Define V ωn = V AR [ P ni =1 ϕ n ( Z i )+ η n ( Z i )], and note that ( V ωn ) − / = O (cid:0) max { r − n , ( nh ) − } − / (cid:1) = O (cid:16) min { r / n , ( nh ) / } (cid:17) .Then, ( V ωn ) − / (cid:16)b µ d − µ d (cid:17) = ( V ωn ) − / n X i =1 { ϕ n ( Z i ) + η n ( Z i ) } (B.217)+ O (cid:16) min { r / n , ( nh ) / } (cid:17) o P (cid:16) r − / n (cid:17) (B.218)+ O (cid:16) min { r / n , ( nh ) / } (cid:17) o P (cid:16) ( nh ) − / (cid:17) (B.219)= ( V ωn ) − / n X i =1 { ϕ n ( Z i ) + η n ( Z i ) } + o P (1) (B.220) d → N (0 , . (B.221)43 consistent estimator for the variance is: b V ωn = n X i =1 { b ϕ n ( Z i ) + b η n ( Z i ) } , (B.222)with b ϕ n ( Z i ) constructed as in Equation (14), Section 3.1, b ϕ n ( Z i ) = b ε i K X j =1 b ω dj nh j k (cid:18) X i − c j h j (cid:19) e ′ (cid:16) v j + i G j + n − v j − i G j − n (cid:17) e H ji , (B.223)and the formula for b η n ( Z i ) depends on the form of the estimator ω d . In the kernel density example, b η n ( Z i ) = P Kj =1 c B j ( nh ω ) − k (( X i − c j ) /h ω ) P Kl =1 ( nh ω ) − P nm =1 k (( X m − c l ) /h ω ) . (B.224)An interesting particular case occurs when r n grows faster than nh . This is the case if ω d ( c )is assumed to be in a parametric class; or if b ω d ( c ) is based of a kernel density estimator with abandwidth that converges to zero more slowly than h . Let V dn = V AR [ P ni =1 ϕ n ( Z i )] as defined inTheorem 1. It follows that, (cid:16) V dn (cid:17) − / (cid:16)b µ d − µ d (cid:17) = (cid:16) V dn (cid:17) − / n X i =1 { ϕ n ( Z i ) + η n ( Z i ) } (B.225)+ o P (cid:16) ( nh ) / r − / n (cid:17) + o P (cid:16) ( nh ) / ( nh ) − / (cid:17) (B.226)= (cid:16) V dn (cid:17) − / n X i =1 { ϕ n ( Z i ) + η n ( Z i ) } + o P (1) (B.227) d → N (0 , , (B.228)where (cid:0) V dn (cid:1) − / = O (cid:0) ( nh ) / (cid:1) from Theorem 1, and (cid:0) V dn (cid:1) − / V AR [ P ni =1 ϕ n ( Z i ) + η n ( Z i )] → ω d ( c ) is estimated at a faster rate than β ( c ), the asymptotic distribution andvariance estimator provided in Section 3.1 remain valid. B.7 Monte Carlo Simulations with Data-driven Bandwidths
This section revisits the simulations in Section 5 with data-driven bandwidth choices. Both firstand second-step bandwidths follow the rules for practical implementation suggested in Section 3.2(refer to page 17, paragraph starting with “
A simple recommendation to implement Theorem 2 ”).The rest of the simulation design remains the same as that of Section 5.Table B.2 compares the estimation precision of b µ and b µ bc across five sample sizes n , withrespective numbers of cutoffs K . Table B.3 analyzes coverage of confidence intervals. Overall,the finite sample properties are consistent with those of Section 5, when bandwidths are non-random. The randomness of bandwidths increases the variance and bias, but they decrease with n
44t approximately the same rate as before. Bias correction eliminates most of the bias, and producesconfidence intervals with correct finite sample coverage.Table B.2: Precision of EstimatorsBias Variance MSE n K b µ b µ bc b µ b µ bc b µ b µ bc Notes: The table reports simulated bias, variance, and mean squared error (MSE) for two estimators ( b µ, b µ bc ),and five sample sizes n , with respective numbers of cutoffs K . Following Section 3.2, the first-step bandwidthsare picked by the IK algorithm, and adjusted to be of order 1 /K . The second-step bandwidth is chosen onthe grid h ∈ { / ( K + 1) , . . . , / ( K + 1) } to minimize the estimated MSE of b µ . The number of simulationsis 10,000. Table B.3: Coverage of 95% Confidence Intervals% Coverage Avg. Length n K b µ b µ bc b µ b µ bc Notes: The table reports simulated percentage of correct coverage, and average length of 95% confidenceintervals. Confidence intervals are constructed using two estimators ( b µ, b µ bc ). They equal an estimator plusor minus its estimated standard deviation multiplied by 1 .
96. Coverage and average length are computedfor five sample sizes n and respective numbers of cutoffs K . Following Section 3.2, the first-step bandwidthsare picked by the IK algorithm, and adjusted to be of order 1 /K . The second-step bandwidth is chosen onthe grid h ∈ { / ( K + 1) , . . . , / ( K + 1) } to minimize the estimated MSE of b µ . The number of simulationsis 10,000.. The number of simulationsis 10,000.