[PDF] Regression Discontinuity Design with Many Thresholds

Abstract

Numerous empirical studies employ regression discontinuity designs with multiple cutoffs and heterogeneous treatments. A common practice is to normalize all the cutoffs to zero and estimate one effect. This procedure identifies the average treatment effect (ATE) on the observed distribution of individuals local to existing cutoffs. However, researchers often want to make inferences on more meaningful ATEs, computed over general counterfactual distributions of individuals, rather than simply the observed distribution of individuals local to existing cutoffs. This paper proposes a consistent and asymptotically normal estimator for such ATEs when heterogeneity follows a non-parametric function of cutoff characteristics in the sharp case. The proposed estimator converges at the minimax optimal rate of root-n for a specific choice of tuning parameters. Identification in the fuzzy case, with multiple cutoffs, is impossible unless heterogeneity follows a finite-dimensional function of cutoff characteristics. Under parametric heterogeneity, this paper proposes an ATE estimator for the fuzzy case that optimally combines observations to maximize its precision.

Full PDF

aa r X i v : . [ ec on . E M ] J a n Regression Discontinuity Design with Many Thresholds

Marinho Bertanha a This version: September 16, 2019First version: November 7, 2014

Abstract

Numerous empirical studies employ regression discontinuity designs with multiple cutoﬀsand heterogeneous treatments. A common practice is to normalize all the cutoﬀs to zero andestimate one eﬀect. This procedure identiﬁes the average treatment eﬀect (ATE) on the observeddistribution of individuals local to existing cutoﬀs. However, researchers often want to makeinferences on more meaningful ATEs, computed over general counterfactual distributions ofindividuals, rather than simply the observed distribution of individuals local to existing cutoﬀs.This paper proposes a consistent and asymptotically normal estimator for such ATEs whenheterogeneity follows a non-parametric function of cutoﬀ characteristics in the sharp case. Theproposed estimator converges at the minimax optimal rate of root- n for a speciﬁc choice oftuning parameters. Identiﬁcation in the fuzzy case, with multiple cutoﬀs, is impossible unlessheterogeneity follows a ﬁnite-dimensional function of cutoﬀ characteristics. Under parametricheterogeneity, this paper proposes an ATE estimator for the fuzzy case that optimally combinesobservations to maximize its precision. Keywords:

Regression Discontinuity, Multiple Cutoﬀs, Average Treatment Eﬀect, Peer-eﬀects

JEL Classiﬁcation:

C14, C21, C52, I21. a ∼ mbertanh. Introduction

Applications of regression discontinuity design (RDD) have become increasingly popular in eco-nomics since the late 1990s (Black (1999), Angrist and Lavy (1999), and Van der Klaauw (2002)).One of RDD’s main advantages is identiﬁcation of a local causal eﬀect under minimal functionalform assumptions. More recently, with increasing availability of richer data sets, there havebeen many applications with multiple cutoﬀs and treatments (for example, Black et al. (2007),Egger and Koethenbuerger (2010), De La Mata (2012), Pop-Eleches and Urquiola (2013)). Exist-ing one-cutoﬀ RDD methods applied to each individual cutoﬀ produce many local eﬀects that areestimated using only a few observations near each cutoﬀ. Researchers often prefer one takeawaysummary eﬀect that is more precisely estimated by pooling all the data. The meaning of a summaryeﬀect crucially depends on heterogeneity assumptions and weights imposed on the diﬀerent localeﬀects.Applied studies with multiple cutoﬀs often normalize all cutoﬀs to zero and use the one-cutoﬀestimator. This normalization procedure estimates an average of local treatment eﬀects weightedby the relative density of individuals near each of the cutoﬀs (Cattaneo et al. (2016), Proposition3). Such an average eﬀect would be a meaningful summary measure only in two cases: (i) localtreatment eﬀects are all identical and the weighting scheme does not matter; or (ii) local treatmenteﬀects are heterogeneous but the researcher is only interested in the average eﬀect on the individualsnear the existing cutoﬀs. However, researchers are often interested in combining observed data withassumptions weaker than (i) to make inferences on counterfactual scenarios more general than (ii). This paper proposes a novel estimation procedure for average treatment eﬀects (ATE). TheseATEs are more valuable summary measures than the average eﬀect estimated by the normalizationprocedure described above for two reasons. First, the researcher explicitly chooses the counterfac-tual distribution of the ATE, and this distribution may include individuals at or between existingcutoﬀs. Second, the researcher does not need to assume any speciﬁc functional form for the het-erogeneity of treatment eﬀects across diﬀerent cutoﬀs. As an example of an application, supposewe are interested in estimating the eﬀect of Medicaid beneﬁts on health care utilization. Medicaideligibility is triggered by income cutoﬀs that vary across states. Existing one-cutoﬀ RDD methodsidentify the average eﬀect on individuals with income equal to the income cutoﬀs. However, mostinteresting policy questions require the average eﬀect over the entire range of income values in thedata.The framework for RDD with many thresholds is introduced here using a simple example basedon the work of Pop-Eleches and Urquiola (2013), PU from now on. Using a wealth of variation ofcutoﬀs from high school assignments in Romania, PU provide rigorous evidence of the impacts of In a RDD setting with multiple cutoﬀs and treatments, it is unreasonable to expect that diﬀerent local treatmenteﬀects are always identical. For example, Pop-Eleches and Urquiola (2013) ﬁnd that the impact of going to a betterhigh school on academic achievement is heterogeneous across students with diﬀerent ability levels. Another exampleis De La Mata (2012), who ﬁnds that the eligibility for Medicaid beneﬁts decreases the probability of having privatehealth insurance more strongly for lower income individuals. Although I allow for heterogeneous eﬀects across cutoﬀs,counterfactual analysis requires a pooling and a policy invariance assumption (Section 2). i submits her score X i (forcing variable) to the central planner who, basedon the entire distribution of scores, determines a minimum test score c j (cutoﬀ) for admission toeach high school j . The quality of high school j is denoted d j (treatment dose).The RDD assignment is assumed sharp for now. That is, students attend the best high schoolavailable to them based on their score and the cutoﬀs that apply to them. As the test score crossesan admission threshold c j , the quality of the school the student attends changes from d j − to d j .Local average eﬀects are denoted by E [ Y i ( d j ) − Y i ( d j − ) | X i = c j ] = β ( c j , d j − , d j ), where Y i ( d )is the potential academic achievement student i has if attending a high school of quality d , and β ( c, d, d ′ ) is the treatment eﬀect function. Heterogeneity of local eﬀects comes from values of cutoﬀsand treatment doses that change across the diﬀerent cutoﬀs. PU give a particularly illustrativeapplication, because it exhibits suﬃcient variation in cutoﬀ and treatment doses to generate ATEswith substantially greater economic relevance than the typical average based on normalizing all ofthe cutoﬀs to zero.Numerous other examples of RDD with multiple cutoﬀs and treatments exist in diﬀerent ﬁeldsof economics. For instance, Egger and Koethenbuerger (2010) study the eﬀect of the size of citygovernment councils on municipal expenditures, where council size is determined by populationcutoﬀs. De La Mata (2012) estimates the eﬀects of Medicaid beneﬁts on health care utilization,where Medicaid eligibility is triggered by income cutoﬀs that vary across states. Agarwal et al.(2017) and De Giorgi et al. (2017) look at multiple cutoﬀs on credit scores, used by banks to makecredit decisions. Education economics also provides a variety of applications. Angrist and Lavy(1999) and Hoxby (2000) use class size rules to estimate the impact of class size on student achieve-ment. Hoxby (2000) utilizes variation in cutoﬀ values from speciﬁc school district class size rules.Several researchers exploit diﬀerent school starting dates to estimate the impact of educational at-tainment on various outcomes, for example, Dobkin and Ferreira (2010), and McCrary and Royer(2011). Duﬂo et al. (2011) analyze school cohorts that are split into low and high-achieving classesbased on test scores, where each school has its own cutoﬀ score. Garibaldi et al. (2012) look atdiﬀerent income cutoﬀs that determine tuition subsidies to study the impact of tuition paymenton the probability of late graduation from university. In short, despite many applications withvariation in cutoﬀs and treatment doses, a lack of theory on how to combine observations from allcutoﬀs impedes our ability to estimate economically-relevant average eﬀects.Whether local eﬀects can be combined into an average eﬀect depends on how comparable the re-searcher believes these eﬀects are. The comparability of local treatment eﬀects essentially dependson the heterogeneity of treatment doses and on the heterogeneity of the treatment eﬀect function β ( c, d, d ′ ). This paper considers two types of assumptions regarding these two aspects of hetero-geneity. The ﬁrst heterogeneity assumption says that treatment doses are credibly quantiﬁable by3ome variable d . For example, PU ﬁnd behavioral evidence that average student performance ateach school is a good summary measure for school quality. Another example is the case of a singletreatment being triggered by varying cutoﬀs, as when each state has its own income threshold forMedicaid coverage. The second heterogeneity assumption speciﬁes a parametric functional form for β ( c, d, d ′ ) guided by economic theory or a priori knowledge of the researcher. For example, in a classsize application like Hoxby’s (2000), a functional form based on Lazear’s (2001) model of achieve-ment can be derived as a function of class size. Another example is given by Bajari et al. (2017)who present a principal-agent model to study how insurers reimburse hospitals. The marginalreimbursement rate is discontinuous on health expenditures.This paper proposes a consistent and asymptotically normal estimator for the ATE of a counter-factual distribution of treatment assignments speciﬁed by the researcher. A counterfactual policyscenario speciﬁes the distribution of ( c, d, d ′ ), and the ATE is the integral of β ( c, d, d ′ ) weighted bysuch a distribution. The ability to predict eﬀects of counterfactual policies depends crucially onassuming that the distribution of potential outcomes Y i ( d ) does not depend on the initial scheduleof cutoﬀ-dose values. This policy invariance assumption, along with the ﬁrst heterogeneity assump-tion, allows the researcher to choose counterfactual distributions with support more general thanthe discrete set of cutoﬀ-dose values observed in the data.The estimator proposed in this paper approximates the ATE integral by averaging estimatesof β ( c, d, d ′ ) at existing cutoﬀs using a proper weighting scheme. Under the ﬁrst heterogeneityassumption with β ( c, d, d ′ ) non-parametric, the proposed ATE estimator is shown to be consistentand asymptotically normal. This result is novel, because estimation of the non-parametric function β ( c, d, d ′ ) is only possible at deterministic points of the domain, and that creates an additionalsource of bias. Asymptotic normality requires both the number of observations and cutoﬀs togrow to inﬁnity, and I provide suﬃcient conditions on their rate of growth. I demonstrate thatthe minimax rate of ATE estimation in this setting is root- n , and that the proposed estimatorattains the minimax optimal rate for a speciﬁc choice of tuning parameters. This extends theprevious literature on minimax optimality of non-parametric estimation of regression functions ata boundary point to estimation of averages of these regression functions.Many applications of RDD with multiple cutoﬀs are, in fact, fuzzy rather than sharp. In thehigh school assignment example, a student may choose to attend a high school other than the schoolshe is originally eligible to attend. Multiple treatments result in multiple compliance behaviors,and one-cutoﬀ identiﬁcation results do not apply. Building on classic deﬁnitions of compliancebehaviors (Imbens and Rubin (1997)), I deﬁne compliance groups in terms of changes in treatmenteligibility and receipt. “Ever-compliers” are those whose treatment received changes if and onlyif it changes to the treatment dose for which they become eligible for. I assume that individualsnever change into a treatment dose diﬀerent from the dose of eligibility, a “no-deﬁance” condition.In the high school example, if the test score of a student currently in school B increases so as togrant her access to school A, no-deﬁance implies she either chooses to attend school A or stay atschool B, and that she is not triggered to attend some other school C.4his paper shows that even local identiﬁcation in fuzzy RDD with ﬁnite multiple treatments isimpossible unless the class of treatment eﬀect functions of ever-compliers is restricted to a ﬁnite-dimensional class. Important empirical analyses of fuzzy RDD with multiple treatments includethose of Angrist and Lavy (1999), Chen and Van der Klaauw (2008), and Hoekstra (2009); nev-ertheless, this is the ﬁrst paper to deﬁne compliance and study causal identiﬁcation in a generalframework for multi-cutoﬀ fuzzy RDD. This framework lays out conditions for the interpretationof two-stage least squares (2SLS) estimates in applications of multi-cutoﬀ fuzzy RDD, a commonpractice in applied work. The second heterogeneity assumption states that the treatment eﬀectfunction is of a parametric class. This assumption allows for consistent and asymptotically normalestimation of ATEs on ever-compliers. It also results in eﬃciency gains, because observations areoptimally combined across cutoﬀs to minimize the mean squared error (MSE) of the ATE estimator.The rapid growth in the number of applications of RDD in economics in the late 1990s wasaccompanied by substantial theoretical contributions for inference in the one-cutoﬀ case. Iden-tiﬁcation and estimation in the sharp and fuzzy cases were formalized by Hahn et al. (2001).Fan and Gijbels (1996) and Porter (2003) demonstrated low-order bias and rate optimality of thelocal polynomial estimator. Recent theoretical contributions have addressed the optimal bandwidthchoice (Imbens and Kalyanaraman (2012)), alternative asymptotic approximations with better ﬁ-nite sample properties (Calonico et al. (2014)), quantile treatment eﬀects (Frandsen et al. (2012)),kink treatment eﬀects (Dong (2018b)), and the diﬃculty of uniform inference (Bertanha and Moreira(2019)).The contribution of this paper is more closely related to the study of treatment eﬀect extrapola-tion of Angrist (2004), Bertanha and Imbens (2019), Dong and Lewbel (2015), Angrist and Rokkanen(2015), and Rokkanen (2015). These last two authors use observations on additional covariates.They restrict the relationship between the heterogeneity of treatment eﬀects after conditioning onthese covariates to obtain identiﬁcation away from the cutoﬀ. This paper diﬀers from these othercontributions, because the variation of multiple cutoﬀs and doses identify ATEs over distributionsof individuals both between and at cutoﬀs, without additional covariates.The remainder of this paper is organized as follows. Section 2 presents the notation and lays outbasic assumptions. Section 3 describes the ATE estimator for the sharp case and proves asymptoticnormality. It is divided into two sub-sections. Section 3.1 treats ATEs of discrete counterfactualdistributions, which is a straightforward generalization of one-cutoﬀ RDD. Section 3.2 is novel; itstudies ATEs of continuous counterfactual distributions under the ﬁrst heterogeneity assumption.Section 4 analyzes the fuzzy case. Appendix A contains all proofs. Supplemental Appendix Bcollects auxiliary results to the proofs in Appendix A. This section sets up the framework for RDD with multiple cutoﬀs. There are P sub-populations Appendix B is available online at ∼ mbertanh .

5f individuals indexed by p = 1 , . . . , P . An example of a sub-population may be a town-year in thehigh school application, or a state in the Medicaid example. Each individual i in sub-population p is fully characterized by a vector of random variables ( X i,p , U i,p ) drawn iid across i from eachsub-population. The forcing variable X i,p is a scalar score that governs eligibility for treatment, andit lives in a compact interval X = [ X , X ]; U i,p is a vector of unobserved heterogeneity. Individual( i, p ) receives a treatment dose D i,p from a set of possible treatments D . The outcome variable Y i,p is determined by a function Y of the individual characteristics and treatment, Y i,p = Y ( X i,p , D i,p , U i,p ) . (1)I start with the simpler sharp RDD setting and defer the fuzzy RDD case to Section 4. Inthe sharp case, the treatment received by the individual is a deterministic function of the forcingvariable. For an individual with forcing variable X i,p close to a cutoﬀ c , the treatment dose is d if X i,p < c , or d ′ if X i,p ≥ c . Hahn et al. (2001) demonstrate that continuity of the conditional meanof outcomes is suﬃcient to identify average causal eﬀects for individuals local to the cutoﬀ c . Lemma 1.

Assume that E [ Y ( X i,p , d, U i,p ) | X i,p = x ] is a continuous function of x for the treatmentdoses d and d ′ in the neighborhood of the cutoﬀ c . Then, the average causal eﬀect for individualswith X i,p = c is identiﬁed: E (cid:2) Y ( X i,p , d ′ , U i,p ) − Y ( X i,p , d, U i,p ) (cid:12)(cid:12) X i,p = c (cid:3) = lim e ↓ n E [ Y i,p | X i,p = c + e ] − E [ Y i,p | X i,p = c − e ] o . (2)Lemma 1 generalizes to the case of multiple cutoﬀs and treatments under the assumption ofcontinuity of E [ Y ( X i,p , d, U i,p ) | X i,p = x ] as a function of x for every d ∈ D . Many cutoﬀs arisebecause data sets may have many sub-populations with few cutoﬀs (e.g. Medicaid beneﬁt withone cutoﬀ per state, many states); or few sub-populations with many cutoﬀs (e.g. Romanian highschools with one town and many schools). The ability to exploit variation in cutoﬀ-dose valuesrelies on the following pooling assumption. Assumption 1 (Pooling) . For any { d, d ′ } ⊂ D , the conditional expectation E (cid:2) Y ( X i,p , d ′ , U i,p ) − Y ( X i,p , d, U i,p ) | X i,p = x (cid:3) as a function of x does not depend on p . Assumption 1 does not restrict average outcomes to be the same across diﬀerent sub-populations.It is less restrictive than common speciﬁcations for pooling data in applied work, for example, time-trends and sub-population ﬁxed eﬀects. The pooling assumption says that individuals with the sameforcing variable that undergo the same change in treatment have the same average response acrossdiﬀerent sub-populations. The rest of the paper builds on Assumption 1, and it becomes irrelevant6o distinguish sub-populations. Thus, I drop the subscript p and focus on the case of one populationwith multiple cutoﬀs.The cutoﬀs are ordered such that c < c < . . . < c K . Sharp RDD means that an individualwith forcing variable X i is deterministically assigned to a treatment dose D i = D ( X i ) according tothe following rule: D ( x ) =  d if c ≤ x < c d if c ≤ x < c ... d K if c K ≤ x ≤ c K +1 (3)where c = X , and c K +1 = X . Each cutoﬀ is characterized by three variables: the scalar threshold c j ; the treatment dose d j − the individual receives if c j − ≤ X i < c j ; and the treatment dose d j the individual receives if c j ≤ X i < c j +1 . Let c j = ( c j , d j − , d j ). The schedule of cutoﬀs andtreatment doses is given by the non-random set C K = { c j } Kj =1 . The richness of set C K increases asthe researcher collects more data. The data generating process is summarized as follows. Values for the forcing variable X i andheterogeneity U i are drawn iid i = 1 , ..., n from a joint distribution. Given D ( x ), these n individualsare assigned to diﬀerent treatment doses D i = D ( X i ). The observed outcome is determined by Y i = Y ( X i , D i , U i ). The econometrician observes the schedule of cutoﬀs and treatment doses D ( x ) and( Y i , X i , D i ) for i = 1 , ..., n . Following Rubin’s model of potential outcomes, let Y i ( d ) = Y ( X i , d, U i ),and assume continuity of E [ Y i ( d ) | X i = x ] for every d ∈ D . A simple extension of Lemma 1 identiﬁesaverage eﬀects at every cutoﬀ c ∈ C K , β ( c ) = E [ Y i ( d ′ ) − Y i ( d ) | X i = c ]= lim e ↓ n E [ Y i | X i = c + e ] − E [ Y i | X i = c − e ] o . (4)Data with multiple cutoﬀ-dose values allow the researcher to learn the causal eﬀect of a varietyof dose changes applied to individuals at various levels of the forcing variables. This fact opensthe possibility of using observed data to estimate the eﬀect of new policy changes. The individualresponse function Y may well depend on the initial assignment of treatments D i , and it couldpotentially change under counterfactual policies. Unless such dependence is restricted, it becomesimpossible to use existing data to infer the eﬀect of new policies. The remainder of this paper relieson the following policy-invariance assumption. Assumption 2 (Policy Invariance) . Regardless of the distribution of ( X i , D i , U i ) in a counterfactualpolicy, individual outcomes are always generated by a ﬁxed response function Y , that is, Y i = Y ( X i , D i , U i ) . The validity of the RDD depends crucially on exogeneity of cutoﬀs and no manipulation of the forcing variable X by individuals. See McCrary (2008) for a test of forcing variable manipulation. Bajari et al. (2017) present amodiﬁed RDD estimator that is consistent under forcing variable manipulation in a class of structural models. i is assigned to a change in treatment dose from D ∗ i to D ∗∗ i , where the distribution of( D ∗ i , D ∗∗ i ) is independent of U i after conditioning on X i . Under Assumption 2, the average causaleﬀect of such an experiment is µ = E [ Y ( X i , D ∗∗ i , U i ) − Y ( X i , D ∗ i , U i ) ]= E [ E ( Y ( X i , D ∗∗ i , U i ) − Y ( X i , D ∗ i , U i ) | D ∗∗ i , D ∗ i , X i ) ]= E [ E ( Y ( X i , D ∗∗ i , U i ) − Y ( X i , D ∗ i , U i ) | X i ) ]= E [ β ( X i , D ∗ i , D ∗∗ i ) ] (5)where the last equality uses the deﬁnition of β in Equation 4. The average eﬀect µ equals an averageof the β function over the counterfactual distribution of ( X i , D ∗ i , D ∗∗ i ). The inference methods ofthis paper ﬁrst identify β from RDD with many cutoﬀs, then identify the average of β under acounterfactual distribution pre-speciﬁed by the researcher. In a similar setting, Cattaneo et al.(2016) study identiﬁcation under conditions equivalent to Assumptions 1 and 2 (respectively, theirAssumptions 5a and 5b).The deﬁnition of µ captures both the direct eﬀect of changing D , and the composition eﬀect ofa change in the distribution of D conditional on X . To investigate the direct eﬀects of D , Rothe(2012) proposes methods for inference on partial policy eﬀects that preserve the distribution ofranks of ( D, X ) unchanged, thus controlling for composition eﬀects. Although not the focus of thispaper, Rothe’s methods may be combined with the RDD identiﬁcation strategy to study partialpolicy eﬀects.

This section investigates estimation and inference of averages of the non-parametric function β under sharp RDD with many cutoﬀs. First, I treat the case of qualitative treatment doses. This isa straightforward extension of single-cutoﬀ RDDs which identify ATEs of discrete counterfactualdistributions, with support contained in C K . Second, I treat the case of quantitative treatmentdoses, that is, the ﬁrst heterogeneity assumption. Substantial variation in cutoﬀ-dose values allowsfor novel methods that estimate ATEs with support more general than C K . Consider applications of RDD where the treatment dose variable has a qualitative nature, andis not credibly summarized by a real-valued metric. For example, Hastings et al. (2013) study theassignment of students into diﬀerent degree programs in universities in Chile. There are multiplecutoﬀs on a test score, but diﬀerent cutoﬀs switch students to completely diﬀerent programs, e.g.8hysics, engineering, economics, etc. This limits the ability to combine local eﬀects across cutoﬀs,which restricts ATEs to counterfactual distributions with discrete support contained in C K . In thissection, it is not possible to identify eﬀects of policies that places weight on cutoﬀ-dose combinations( c, d, d ′ ) that are not in C K .The focus is on discrete counterfactual distributions with probability mass function ω d ( c ) where ω dj = ω d ( c j ) for every j . For example, in the high school assignment application, a new policy mayreallocate students with test scores marginally across the existing cutoﬀs. The weight ω dj representsthe probability mass of students with test score equal to c j that undergo a change in school qualityfrom d j − to d j in the reallocation policy.The parameter of interest is the average eﬀect on these students, which is a weighted averageof local eﬀects at the existing cutoﬀs: µ d = K X j =1 ω dj β ( c j ) . Identiﬁcation follows from Equation 4. Estimation is conducted in two steps. The ﬁrst stepuses local polynomial regressions (LPR) near each cutoﬀ c j to non-parametrically estimate B j = lim e ↓ { E [ Y i | X i = c j + e ] − E [ Y i | X i = c j − e ] } . (6)The researcher chooses a bandwidth parameter h j > k ( . ), and the order of the polynomial regression ρ ∈ Z + . A polynomial in X is ﬁtted on each sideof the cutoﬀ, and the estimator ˆ B j is the diﬀerence between the intercepts of these two polynomialregressions: ˆ B j = ˆ a + j − ˆ a − j (7)(ˆ a + j , ˆ b + j ) = argmin ( a, b ) n X i =1 (cid:26) k (cid:18) X i − c j h j (cid:19) v j + i (cid:2) Y i − a − b ( X i − c j ) − . . . − b ρ ( X i − c j ) ρ (cid:3) (cid:27) (8)(ˆ a − j , ˆ b − j ) = argmin ( a, b ) n X i =1 (cid:26) k (cid:18) X i − c j h j (cid:19) v j − i (cid:2) Y i − a − b ( X i − c j ) − . . . − b ρ ( X i − c j ) ρ (cid:3) (cid:27) (9)where v j + i = I { c j ≤ X i < c j + h j } , v j − i = I { c j − h j < X i < c j } , (10) The common practice of normalizing all cutoﬀs to zero and estimating only one eﬀect produces an estimatorconsistent for µ d with weights ω dj = f ( c j ) / P l f ( c l ) where f is the probability density function of X . b = ( b , . . . , b ρ ). The estimator b B j uses observations with X i in the estimation window[ c j − h j , c j + h j ]. The choice of bandwidths may allow the windows to overlap at consecutivecutoﬀs. However, it must be the case that c j + h j < c j +1 and c j ≤ c j +1 − h j +1 for j = 1 , . . . , K − Y i = Y i ( d j ) for X i ∈ [ c j , c j + h j ], and Y i = Y i ( d j − ) for X i ∈ [ c j − h j , c j ). In the second step, the researcher averages out ˆ B j to obtain the estimator b µ d : b µ d = K X j =1 ω dj ˆ B j . (11)For the case of one cutoﬀ, Hahn et al. (2001) and Porter (2003) derive the asymptotic nor-mal distribution of the LPR estimator ˆ B j . I build on their arguments to derive the asymptoticdistribution of b µ d under the assumptions listed below. Assumption 3.

The kernel density function k : R → R is symmetric around zero, has compactsupport [ − M, M ] for some M ∈ (0 , ∞ ) , and is Lipschitz continuous. Assumption 4. (a)

The distribution of X i has probability density function f ( x ) that is continuousand has bounded support X = [ X , X ] ; (b) f ( x ) is diﬀerentiable with bounded derivative ∇ x f ( x ) . Assumption 5.

Let ρ ∈ Z + be the order of the ﬁrst-step LPR. For arbitrary d ∈ D , (a) R ( x, d ) = E [ Y i ( d ) | X i = x ] is ρ + 1 times continuously diﬀerentiable wrt x ; its ( ρ + 1) -th partial derivativewrt x is denoted as ∇ ρ +1 x R ( x, d ) ; (b) σ ( x, d ) = V [ Y i ( d ) | X i = x ] where V is the variance operator; σ ( x, d ) is continuously diﬀerentiable wrt x ; its partial derivative wrt x is denoted as ∇ x σ ( x, d ) ; σ ( x, d ) is bounded away from zero, and E [ | Y i ( d ) − R ( X i , d ) | | X i ] is bounded. Theorem 1.

Suppose Assumptions 3-5 hold. Let h = min j h j and h = max j h j . As n → ∞ ,assume that h → , h /h = O (1) , nh → ∞ , and ( nh ) / h ρ +11 = O (1) . Then, b µ d − B dn − µ d ( V dn ) / d → N (0 , where the bias B dn and variance V dn terms are characterized as follows: B dn = 1( ρ + 1)! K X j =1 h ρ +11 j f ( c j ) (cid:2) ∇ ρ +1 R ( c j , d j ) e ′ G j + n − ∇ ρ +1 R ( c j , d j − ) e ′ G j − n (cid:3) γ ∗ (12) V dn = n E  ε i  K X j =1 ω dj nh j k (cid:18) X i − c j h j (cid:19) e ′ (cid:16) v j + i E [ G j + n ] − v j − i E [ G j − n ] (cid:17) e H ji   , (13) with ε i = Y i − E [ Y i | X i ] ; H ( u ) = [ u , u , . . . , u ρ ] ′ is a ( ρ + 1) × vector-valued function, e H ji = H ( h − j ( X i − c j )) , and G j ± n = ( nh j ) − P ni =1 v j ± i k ( h − j ( X i − c j )) e H ji e H j ′ i is a ( ρ + 1) × ( ρ + 1) This is the ﬁrst-step estimation procedure for one sub-population with K cutoﬀs. In many settings, the datahave many sub-populations p = 1 , . . . , P with one or more cutoﬀs j = 1 , . . . , K ( p ) in each sub-population. In thatcase, the researcher ﬁrst estimates b B j,p for every j in each sub-population p . Then, Assumption 1 allows for poolingof b B j,p across p in the second step. atrix; v j ± i are deﬁned in Equation 10; γ ∗ = [ γ ρ +1 . . . γ ρ +1 ] ′ , for γ d = R k ( u ) u d du ; and e is the ( ρ + 1 × vector with one in its ﬁrst coordinate and zero otherwise. Furthermore, (cid:0) V dn (cid:1) − / = O (cid:16)(cid:0) nh (cid:1) / (cid:17) , and (cid:0) V dn (cid:1) − / B dn = O P (cid:16)(cid:0) nh (cid:1) / h ρ +11 (cid:17) . The variance of b µ d is consistently estimated by b V dn = n X i =1 b ε i  K X j =1 ω dj nh j k (cid:18) X i − c j h j (cid:19) e ′ (cid:16) v j + i G j + n − v j − i G j − n (cid:17) e H ji   . (14)The squared residuals b ε i are computed by a nearest-neighbor matching estimator, as suggested byCalonico et al. (2014) (CCT from now on): b ε i = 34 Y i − X l =1 Y ℓ ( i,l ) ! , (15)and ℓ ( i, l ) is the index of the l -th closest X to X i that lies within the same cutoﬀs c j and c j +1 that X i does. CCT’s Theorem A3 demonstrates that b V dn / V dn p → (cid:0) V dn (cid:1) − / B dn diﬀers from zero asymp-totically, then inference must be done using a bias-corrected estimator. A practical way of doingbias correction is to increase the order of the polynomial from ρ to ρ + 1 and compute b µ d ′ and b V d ′ n using the same bandwidth choices as b µ d and b V dn . It follows that ( b V d ′ n ) − / ( b µ d ′ − µ d ) d → N (0 , c + h > c − h , the estimator b B usessome of the same observations that the estimator b B does. In theory, a ﬁnite number of cutoﬀswith shrinking bandwidths leads to non-overlapping estimation windows in large samples. As aconsequence, the asymptotic variance of p nh ( b µ d − B dn − µ d ) may not approximate its ﬁnite-sample variance well in case of overlap. Instead, the variance term in (13) takes into accountoverlap because its formula is constructed based on the ﬁnite-sample variance.In practice, implementation of b µ d requires the researcher to choose bandwidths h j >

0, thepolynomial order ρ ∈ Z + , and a kernel density function k ( · ). In the one-cutoﬀ case, commonchoices in applied work include the edge kernel k ( u ) = I {| u | ≤ } (1 − u ), local linear regression ρ = 1, and a bandwidth choice that minimizes the mean squared error (MSE) of estimation. Recentwork by Imbens and Kalyanaraman (2012) (IK from now on) provides a practical data-driven rulefor choosing the bandwidth in the case of one cutoﬀ. With multiple cutoﬀs, an interesting aspectof the optimal bandwidth problem is the variance reduction from overlapping estimation windows. A formal investigation on optimal bandwidths in the multi-cutoﬀ case is deferred to future work.A simple recommendation to implement Theorem 1 is to use the IK bandwidth based on local Following Equation (7),

COV ( b B j , b B j +1 ) = COV ( b a + j − b a − j , b a + j +1 − b a − j +1 ) = COV ( b a + j , − b a − j +1 ) < b a + j and b a − j +1 use some of the same observations in the case of overlap. ρ = 2) with the edge kernel and the same bandwidths asbefore to compute the consistent bias-corrected estimator b µ d ′ and its variance b V d ′ n . Calonico et al.(2018) propose shrinking MSE-optimal bandwidths as a rule of thumb to improve ﬁnite samplecoverage of conﬁdence intervals. As means of a robustness check, the researcher may shrink the IKbandwidths by multiplying them by n − / , and examine the resulting conﬁdence intervals (Section4.1, Calonico et al. (2018)). The ﬁrst heterogeneity assumption allows the researcher to identify counterfactual ATEs withsupport more general than C K . An empirical application satisﬁes the ﬁrst heterogeneity assumptionif the treatment dose is credibly quantiﬁable in a real-valued variable d . For example, in the highschool assignment of PU, the treatment dose is a quality measure for each school. Possible measuresof school quality include the average test score of peers, the average number of teachers, or fundingper student. An inﬁnite amount of data gives rise to a countably-inﬁnite set of cutoﬀ-dose values C ∞ . In terms of the high school assignment example, a large number of towns and years producesubstantial variation in cutoﬀ-dose values. Deﬁne C to be the convex hull of C ∞ . If variation incutoﬀ-dose values is suﬃciently rich, then ATEs with counterfactual distributions supported in C are identiﬁed (Lemma 2).I focus on scalar treatment doses d and counterfactual distributions with continuous probabilitydensity function ω c ( c ). Minor changes to the setup can accommodate multivariate d and discreteor mixed counterfactual distributions. The ATE is deﬁned as µ c = Z C ω c ( c ) β ( c ) d ( c ) . (16) Lemma 2.

Assume that an inﬁnite amount of data has suﬃcient variation such that (i) C ∞ isdense in its convex hull C ; and that (ii) β ( c ) is a continuous function over C . Then, µ c is identiﬁed. The researcher may impose further heterogeneity restrictions to reduce the dimension of β ( c )and increase the set of possible counterfactual distributions. For instance, linear returns to schoolquality say that β ( c, d, d ′ ) depends on ( c, d ′ − d ) instead of ( c, d, d ′ ). This implies that β ( c ) = φ ( c )( d ′ − d ) for a smooth function φ ( c ), and changes the dimension of set C K . See Figure 2 inSection 6 for an empirical illustration. Medicaid coverage is an example of binary treatment thatis triggered by various income cutoﬀs across states. In the case of binary treatment, the treatmenteﬀect function depends only on the cutoﬀ value, that is, β ( c, d, d ′ ) = φ ( c ). Identiﬁcation of averagesof β ( c ) requires identiﬁcation of averages of φ ( c ), which relies on inﬁnitely many cutoﬀ values thatcover a compact interval on the real line. For example, such variation identiﬁes the average eﬀectof giving Medicaid beneﬁts to an entire neighborhood of individuals within the range of income12utoﬀs seen in the data. The parameter µ c is estimated in two steps. The ﬁrst step is identical to the procedure describedin Equations 7-9. That is, LPRs produce estimates b B j , j = 1 , . . . , K . The second step computesa weighted average of the ﬁrst-step estimates, using specially designed weights { ∆ j } Kj =1 that I call“correction weights,” b µ c = K X j =1 ∆ j ˆ B j . (17)Unlike the intuition of the discrete case, the correction weight ∆ j is not necessarily equal orproportional to ω cj = ω c ( c j ). An analytical expression for ∆ j is given below in Equation 22, andconstructed as follows. The correction weight ∆ j is the contribution of estimate b B j to the integral R C ω c ( c ) b β ( c ) d ( c ), where b β ( c ) is a non-parametric estimate of β ( c ). A weighted regression of b B j onpolynomial functions of c j centered at c produces the estimate b β ( c ). The researcher speciﬁes theorder of the polynomials ρ ∈ Z + , and a bandwidth h > c ∈ C . The estimate b β ( c ) is the intercept of the following weighted least squares regression: b η = argmin η (cid:16) b B − E ( c ) η (cid:17) ′ Ω ( c ; h ) (cid:16) b B − E ( c ) η (cid:17) (18)where b B = h b B , . . . , b B K i ′ is a K × Ω ( c ; h ) = diag { Ω j ( c ; h ) } Kj =1 is a K × K matrix, with (20)Ω j ( c ; h ) = k (cid:18) c j − ch (cid:19) k (cid:18) d j − − dh (cid:19) k (cid:18) d j − − d ′ h (cid:19) ; E ( c ) = [ E ( c ) , . . . , E K ( c )] ′ is a K × J matrix, where (21) E j ( c ) is a J × vector with all polynomials of the form p γ ( c j − c ) = ( c j − c ) γ ( d j − − d ) γ ( d j − d ′ ) γ for γ = ( γ , γ , γ ) ∈ Z , γ + γ + γ ≤ ρ , min { γ , γ } = 0 ,J = 2 ( ρ + 2)!2! ρ ! − ( ρ + 1) , where ! denotes factorial,and the ﬁrst element of E j ( c ) is . For the Medicaid example, De La Mata (2012) has many income cutoﬀs that diﬀer by state, age, and year. DeLa Mata’s Table I suggests variation between US $ $ β is possible for a range of dose changes at a few cutoﬀ values. j comes from integrating ω c ( c ) b β ( c ): Z C ω c ( c ) b β ( c ) d c = Z C ω c ( c ) e ′ (cid:0) E ( c ) ′ Ω ( c ; h ) E ( c ) (cid:1) − X j Ω j ( c ; h ) E j ( c ) b B j d ( c )= X j Z C ω c ( c ) e ′ (cid:0) E ( c ) ′ Ω ( c ; h ) E ( c ) (cid:1) − Ω j ( c ; h ) E j ( c ) d ( c ) b B j = X j Z C ω c ( c ) det (cid:0) E ( c ) ′ Ω ( c ; h ) E ← e j ( c ) (cid:1) det ( E ( c ) ′ Ω ( c ; h ) E ( c )) d ( c ) | {z } ≡ ∆ j b B j (22)= X j ∆ j b B j , (23)where the third equality uses the Cramer rule, and E ← e j ( c ) is a K × J matrix equal to E ( c ) exceptfor the ﬁrst column, which is replaced by the K × e j . The vector e j has one in its j -thentry and zero otherwise.The main contribution of this paper concerns inference on µ c where β ( c ) is estimated non-parametrically and then averaged across cutoﬀs. This is not the ﬁrst paper to study estimation ofaverages of non-parametric functions; for example, see Newey (1994). The novelty here is that thenon-parametric estimation step only occurs at K ﬁxed boundary points c j . A necessary conditionfor consistency of ˆ µ c is an “inﬁll type of asymptotics,” that is, K grows large with the sample size n , and C K becomes dense in its convex hull C . Assumption 6 makes the dependence of K , h j , h ,and c j on n explicit with a subscript. The main text omits the subscript n whenever possible tosimplify notation. Assumption 6. (a)

The schedule of cutoﬀs and doses comes from a triangular array of ﬁxedconstants C K n = { c j,n } K n j =1 that depends on the sample size n ; C K n converges to a countably inﬁniteset C ∞ as n → ∞ ; C ∞ is dense in its convex hull C ; (b) given the ﬁrst-step bandwidth sequences h j,n , assume that c j,n + h j,n < c j +1 ,n and c j,n ≤ c j +1 ,n − h j +1 ,n for all j = 1 , . . . , K n − ;and (c) given the second-step bandwidth sequence h ,n and polynomial order ρ , deﬁne E n ( c ) and Ω n ( c ; h ,n ) as in Equations (20) - (21) for each n . Assume there exists a positive deﬁnite J × J matrix Q such that sup c ∈C (cid:13)(cid:13)(cid:13) K n h ,n (cid:2) E n ( c /h ,n ) ′ Ω n ( c ; h ,n ) E n ( c /h ,n ) (cid:3) − − Q (cid:13)(cid:13)(cid:13) = o (1) . For large K , cutoﬀ-dose values must be uniformly distributed on the domain C such that E ( c /h ) ′ Ω ( x ; h ) E ( c /h ) is invertible and of magnitude Kh , that is, K times the volume of every h -neighborhood of c , for every c in C . These conditions are satisﬁed in a variety of examplesof triangular arrays of points. In Section B.3 of the supplemental appendix, these conditions areveriﬁed for one example of a triangular array. Asymptotic normality also relies on additionalsmoothness conditions on the moments of the data.14 ssumption 7. (a) R ( x, d ) = E [ Y i ( d ) | X i = x ] is a ¯ ρ times continuously diﬀerentiable functionwith ¯ ρ = max { ρ + 2 , ρ + 2 } , where ρ and ρ are polynomial degrees in the ﬁrst and secondsteps; the ¯ ρ -th partial derivative of R ( x, d ) with respect to x is denoted ∇ ¯ ρx R ( x, d ) ; (b) σ ( x, d ) = V [ Y i ( d ) | X i = x ] is a continuous function bounded away from zero; and (c) ∃ M ∈ (0 , ∞ ) such that P [ | Y i ( d ) − R ( X i , d ) | < M ] = 1 for ∀ d ∈ D . Theorem 2 states the rate conditions under which the estimator b µ c has an asymptotic normaldistribution. Estimation of the ATE consists of approximating the integral of the treatment eﬀectfunction by a weighted sum of the values of such function at a ﬁnite number of points in its domain.The approximation error converges to zero as the number of points grows large. Function evalua-tions B j are estimated by b B j . The correction weights guarantee that the integral approximationerror converges to zero faster than the estimation error. Theorem 2.

Suppose Assumptions 3-7 hold. As n → ∞ , assume that K → ∞ , h → , h /h = O (1) , and h → such that (i) (cid:0) Knh (cid:1) / h ρ +11 = O (1) ; (ii) K / log n/ (cid:0) nh (cid:1) / = o (1) , and Kh = O (1) ; and (iii) (cid:0) Knh (cid:1) / h ρ +12 = O (1) , and /Kh = O (1) . Then, b µ c − B c n − B c n − µ c ( V cn ) / d → N (0 , . (24) The ﬁrst-step bias B c n and variance V cn terms are deﬁned as in Equations 12-13 except that ∆ j replaces ω dj ; the second-step bias B c n is characterized as follows: B c n = Z C ω c ( c ) X ( γ ,γ ,γ ) K X j =1 ( ( c j − c ) γ ( d j − − d ) γ ( d j − d ′ ) γ γ ! γ ! γ ! ∇ γ c ∇ γ d ∇ γ d ′ β ( c, d, d ′ ) det (cid:0) E ( c ) ′ Ω ( c ; h ) E ← e j ( c ) (cid:1) det ( E ( c ) ′ Ω ( c ; h ) E ( c )) ) d c , (25) where the ﬁrst sum runs over all triplets ( γ , γ , γ ) ∈ Z such that γ + γ + γ = ρ + 1 , and min { γ , γ } = 0 . Furthermore, ( V cn ) − / = O (cid:16)(cid:0) Knh (cid:1) / (cid:17) , ( V cn ) − / B c n = O P (cid:16)(cid:0) Knh (cid:1) / h ρ +11 (cid:17) ,and ( V cn ) − / B c n = O (cid:16)(cid:0) Knh (cid:1) / h ρ +12 (cid:17) . A consistent estimator for V cn is b V cn = n X i =1 b ε i  K X j =1 ∆ j nh j k (cid:18) X i − c j h j (cid:19) e ′ (cid:16) v j + i G j + n − v j − i G j − n (cid:17) e H ji   , (26)where b ε i is computed using Equation 15. Lemma B.10 in the supplemental appendix’s Section B.4demonstrates that b V cn / V cn p → Kh ) − = O (1). If the bandwidth choicesare such that the standardized bias term ( V cn ) − / ( B c n + B c n ) diﬀers from zero asymptotically,then inference must be done using a bias-corrected estimator. A practical way of performing biascorrection is to increase the order of the polynomials from ( ρ , ρ ) to ( ρ +1 , ρ +1), and to compute15 µ c ′ and b V c ′ n using the same bandwidth choices as b µ c and b V cn . It follows that ( b V c ′ n ) − / ( b µ c ′ − µ c ) d → N (0 , C , along with the asymptotic behavior of the schedule of cutoﬀ-doses (Assumption6), is crucial for the numerical integration error to vanish suﬃciently quickly, as required by Theorem2. Continuity of ω c ( c ) implies that the boundary of C has zero probability under the counterfactualdistribution. Therefore, the convergence rate of b µ c is not aﬀected by the value of ω c ( c ) over theboundary of C . In ﬁnite samples, local polynomial estimates of b β ( c ) may be noisy for values of c atthe boundary of the convex-hull of C K . Researchers should take that into account when specifyingthe support of the counterfactual distribution ω c ( c ).A simple example illustrates the three rate conditions of Theorem 2. Suppose h j = n − λ forall j , h = n − λ , and K = n θ . The ﬁrst-step estimation uses local-linear regression ( ρ = 1),and the second step, local cubic regression ( ρ = 3). The ﬁrst rate condition says the ﬁrst-step bandwidths have to converge to zero fast enough to control the asymptotic bias. That is, (cid:0) Knh (cid:1) / h ρ +11 = O (1); in terms of the example, this condition becomes λ ≥ (1 + θ ) / (3 + 2 ρ ).The second rate condition restricts how fast the number of cutoﬀs grows with n . It cannot grow toofast to ensure having enough observations around the cutoﬀs for uniform consistency of ﬁrst-stepestimates. The second condition has two parts: (a) K / log n/ (cid:0) nh (cid:1) / = o (1) ⇔ λ < − θ ; and(b) Kh = O (1) ⇔ λ ≥ θ . The third rate condition limits how slowly K grows, relative to thesample size, to ensure that the integral approximation error vanishes faster than the estimationvariance. Part (a) of the third condition says (cid:0) Knh (cid:1) / h ρ +12 = O (1) ⇔ λ ≥ θ − λ ( ρ + 1);part (b) is 1 /Kh = O (1) ⇔ λ ≤ θ/ Panel (a) shows the conditions in terms of ( λ , θ ) assuming λ = θ/ h = K − / , whichsatisﬁes part (b) of the third condition. Panel (b) depicts the same conditions in terms of ( λ , λ ),assuming θ = 0 .

4. The feasible set is well-deﬁned as long as K grows no faster than √ n , that is, θ < .

5. In addition, ρ ≥ √ n , and it is reached along the dashed line 2(b). Section B.3 in the supplemental appendix gives an example of a schedule of cutoﬀ-dose values that satisﬁesAssumption 6 for feasible choices of ( h , h ) in this example. Notes: The diagram shows the rate conditions of Theorem 2 applied to the case where h j = n − λ for all j , h = n − λ , and K = n θ . Condition 1, that is (cid:0) Knh (cid:1) / h ρ +11 = O (1), is equivalent to λ ≥ (1 + θ ) / (3 + 2 ρ );condition 2(a): K / log n/ (cid:0) nh (cid:1) / = o (1) ⇔ λ < − θ ; condition 2(b): Kh = O (1) ⇔ λ ≥ θ ; condition 3(a): (cid:0) Knh (cid:1) / h ρ +12 = O (1) ⇔ λ ≥ θ − λ ( ρ + 1); and condition 3(b): 1 /Kh = O (1) ⇔ λ ≤ θ/

3. Panel(a) illustrates the rate conditions on the ﬁrst-step bandwidth and number of cutoﬀs ( λ , θ ) for ρ = 1, ρ = 3, and λ = θ/

3, so that h = K − / and condition 3(b) is satisﬁed. Panel (b) displays the rate conditions on the bandwidths( λ , λ ) given θ = 0 . ρ = 1, and ρ = 3. Implementation of Theorem 2 requires the researcher to choose ρ ∈ Z + , h j > ∀ j , ρ ∈ Z + , h >

0, and k ( · ). A theory of optimal choice of these tuning parameters is beyond the goalsof this paper. Optimal choice of bandwidths is an interesting topic for future research, becauseoptimality in the multi-cutoﬀ case would account for: (i) the interaction between ﬁrst and second-stage bandwidths; (ii) the variance reduction from overlapping estimation windows at consecutivecutoﬀs; and (iii) the recent advances of robust bias-corrected inference and coverage-error optimalbandwidths by Calonico et al. (2018).The IK bandwidth formula may produce ﬁrst-step bandwidths with an incorrect rate of conver-gence. For example, if ρ = 1, these bandwidths converge to zero at n − . , which is not fast enoughif ρ = 3 (Figure 1). A simple way to correct for this is to adjust the bandwidths by multiplyingthem by n . − λ for λ ≥ θ , so that their rate becomes n − λ . Conditions 2(a) and (b) imply that θ is never bigger than 0 . ρ , ρ , and λ . Thus, the smallest value for λ consistentwith these restrictions is 0 .

5. The same idea applies to the coverage-error optimal bandwidths byCalonico et al. (2018), which converge to zero at rate n − . , and need to be adjusted.In certain cases, the β function may depend on less than the three arguments ( c, d, d ′ ). Forexample, in the Medicaid application, the treatment is binary and β is only a function of c . Thisis a particular case of the theory in this section. The only rate condition that changes is condition3(b). It becomes 1 /Kh = O (1), or λ ≤ θ in terms of Figure 1. A non-empty feasible set ofbandwidth choices requires ρ ≥

1, as opposed to ρ ≥ h that minimizes the MSE17f estimation. First, use observations pertaining to each cutoﬀ j , compute the IK bandwidth h ik j forsharp RD and local-linear regression; adjust the rate of the bandwidths so that h j = h ik j × n − . .Second, create a grid of possible values for h . For each value on the grid, compute b µ c ( h ) using theedge kernel, the choices of h j given above, ρ = 1, and ρ = 3 (or ρ = 1 in the binary treatmentcase). Similarly, compute b µ c ′ ( h ) using the edge kernel, the choices of h j given above, ρ = 2,and ρ = 4 (or ρ = 2 in the binary treatment case). Use Equation 26 to estimate the variance of b µ c ( h ) and call it b V cn ( h ). Evaluate the approximated MSE of b µ c ( h ) by ( b µ c ( h ) − b µ c ′ ( h )) + b V cn ( h ).Choose the bandwidth value on the grid that minimizes the MSE and call it h ∗ . The bias-correctedestimate is b µ c ′ ( h ∗ ), and its variance estimate is b V c ′ n ( h ∗ ).It may not be immediately clear that root- n is the fastest estimation rate achievable in a settingwhere both K and n grow large. The double asymptotic setting is conceptually diﬀerent from theusual asymptotic setting where only n → ∞ and non-parametric averages are estimable at root- n .Estimation rates depend not only on bandwidth choices, but also on how fast K grows, relative to n . Similar examples in econometrics include panels with a large number of observations and timeperiods, and asymptotics with many instruments. The following theorem demonstrates that theminimax optimal rate of estimation of µ c is indeed root- n , as long as ﬁrst-step bandwidths convergeto zero at 1 /K rate. Theorem 3.

Let P be the class of models generating potential outcomes { Y i ( d ) } d ∈D and forcingvariables X i . For a schedule of cutoﬀs and doses, { c j } Kj =1 , observed data ( Y i , X i , D i ) are generatediid from P ∈ P as described in Section 3.1. Assume that (i) each model P ∈ P satisﬁes Assump-tions 4-7; (ii) f ( x ) and σ ( x, d ) are bounded away from zero uniformly in P ; (iii) the followingfunctions are bounded uniformly in P : ∇ ρx σ ( x, d ) ∀ ρ ≤ , ∇ ρx f ( x ) ∀ ρ ≤ , ∇ ρx R ( x, d ) ∀ ρ ≤ ¯ ρ , ∇ ρd R ( x, d ) ∀ ρ ≤ ¯ ρ , where ¯ ρ = max { ρ + 2 , ρ + 2 } ; and (iv) there exists M ∈ (0 , ∞ ) such that P [ | Y i ( d ) − R ( X i , d ) | < M ] = 1 ∀ d ∈ D uniformly in P . Then, for any ǫ > , there exists η > such that inf e µ sup P ∈P P P (cid:2) √ n | e µ − µ c ( P ) | > ǫ/ (cid:3) ≥ η for large n. (27) The inf is taken over all estimators e µ built using the observed data ( Y i , X i , D i ) , i = 1 , . . . , n ; µ c ( P ) = R ω c ( c ) β ( c ; P ) d c with β ( c ; P ) = E P [ Y i ( d ′ ) − Y i ( d ) | X i = c ] ; and P P and E P denote theprobability and expectation under model P ∈ P .Assume the conditions of Theorem 2, and that ﬁrst-step bandwidths satisfy h = O ( K − ) .Consider the estimator b µ c deﬁned in Equation 17. For any small δ > , there exists large ǫ ∈ (0 , ∞ ) such that sup P ∈P P P (cid:2) √ n | b µ c − µ c ( P ) | > ǫ (cid:3) < δ for large n. (28)Equation 27 shows that no estimator converges faster than √ n uniformly over P . Equation28 says the estimator proposed in Theorem 2 converges at root- n uniformly over P as long as18rst-step bandwidths converge to zero at 1 /K rate. Therefore, root- n is the minimax optimal rateof convergence in the non-parametric estimation of ATE in RDD with many thresholds. Authorshave previously analyzed minimax optimality of non-parametric estimators of a regression functionat a boundary point, for example, Cheng et al. (1997) and Sun (2005). Theorem 3 is novel becauseit combines boundary points to estimate averages of non-parametric regression functions. This section relaxes the sharp assignment mechanism of previous sections and studies the fuzzyRDD case. The analysis focuses on multiple cutoﬀs, but K is ﬁnite as opposed to approachinginﬁnity as in Section 3.2. This makes the exercise more tractable, because the number of compliancecases grows super-exponentially with the number of cutoﬀs. In contrast to the sharp case, non-parametric identiﬁcation of local eﬀects in the fuzzy case is impossible. As a result, inferencemethods in this section rely on a second heterogeneity assumption, namely, the treatment eﬀectfunction is assumed parametric. Section B.5.2, in the supplemental appendix, provides practicalguidelines to compute an MSE-optimal ATE estimator, and demonstrates asymptotic normality.In the sharp RDD case, all individuals with forcing variable equal to x receive the same treatment D ( x ) (Equation 3). In the fuzzy RDD case, many of these individuals may receive treatmentsdiﬀerent from D ( x ). In the high school assignment example, students may choose to go to a schoolthat is not the best school for which they are eligible. For instance, a student may want to attendthe same high school as a certain friend or sibling. Another example is given by Garibaldi et al.(2012). In their study, a schedule of tuition subsidies applies to most students at Bocconi University,but the university reserves the right to grant certain students diﬀerent subsidies after reassessingtheir ability to pay. The fuzzy RDD case is modeled in terms of a potential treatment assignment framework. Apotential treatment assignment function U : X → D describes the treatment received for everyvalue of the forcing variable x ∈ X . For simplicity, these functions are assumed to belong to thefollowing class: U ∗ = ( U : X → D : U ( x ) = K X j =0 u j I { c j ≤ x < c j +1 } for some u j ∈ { d , . . . , d K } , j = 0 , . . . , K ) . (29)Sharp RDD is the particular case where the individual potential treatment assignment function U i is the same for every individual i , that is, U i ( x ) = D ( x ) ∀ i with D ( x ) deﬁned in Equation 3. The source of fuzziness varies across applications. One example is the case where the assignment of individualsinto diﬀerent treatments is made through a matching mechanism, and the econometrician does not observe all theindividual characteristics used in the matching algorithm. This is the reason why the RDD of PU is fuzzy: basedon the entire distribution of test scores and preferences, the central planner ranks students by their test scores andassigns each one to her preferred school among schools with vacancies.

19n the fuzzy case, U i is sampled iid from a distribution of functions with support in U ∗ . Potentialtreatment functions U i ( x ) are unobserved, but the treatments received are observed and given by D i = K X j =0 U i ( c j ) I { c j ≤ X i < c j +1 } . Using classic deﬁnitions of compliance behaviors (Imbens and Rubin (1997)), three types ofcompliance groups are deﬁned in terms of changes in treatment eligibility. “Never-changers” arethose whose treatment received never changes when eligibility changes. The treatment received by“ever-compliers” or “ever-deﬁers” changes at least once when eligibility changes. Ever-compliersare those whose treatment received changes if and only if it changes to the treatment dose for whichthey become eligible. Ever-deﬁers change to a treatment dose diﬀerent from the one for which theybecome eligible. In the case of one cutoﬀ and two treatments, the deﬁnition of ever-complier (ever-deﬁer) is equivalent to the classic deﬁnition of complier (deﬁer) of Imbens and Lemieux (2008).The three compliance groups are measurable events that partition the population of individualswith G nc denoting never-changers, G ec ever-compliers, and G ed ever-deﬁers. G nc = n U i ∈ U ∗ : { j : U i ( c j − ) = U i ( c j ) } = ∅ o (30) G ec = n U i ∈ U ∗ : { j : U i ( c j ) = D ( c j ) } ⊇ { j : U i ( c j − ) = U i ( c j ) } 6 = ∅ o (31) G ed = n U i ∈ U ∗ : { { j : U i ( c j ) = D ( c j ) } ∩ { j : U i ( c j − ) = U i ( c j ) } } 6 = ∅ o (32)where ∅ denotes empty set.In the high school assignment case, an example of a never-changer is a student who stronglyprefers the high school with the lowest admission cutoﬀ and attends that high school even if sheis admitted to better schools. An example of an ever-complier is a student who attends the bestschool into which she is admitted, or a student who chooses the best school among the nearbyschools. Suppose a student has rational preferences and is never indiﬀerent. Assume her choice setis equal to those schools with admission cutoﬀs that are less than or equal to her test score. Then,such a student is never an ever-deﬁer. In other words, as her test score increases, a new school isadded to her choice-set of schools; she either chooses to go to the new school for which she becomeseligible, or she stays at the school which she preferred prior to the increase in her choice-set. Thus,it seems natural to rule out “ever-deﬁers” in this and other applications.Never-changers do not produce changes in treatments, so there is no identiﬁcation on them.For ever-compliers, there are multiple possible changes in treatment at a given cutoﬀ, and ever-compliers may diﬀer in terms of the treatments they comply with. For example, the student who iswilling to attend the best school possible complies with all changes in treatment eligibility. On theother hand, the student who is willing to attend the best possible school within a certain distance These deﬁnitions allow for non-monotonic treatment schedules; for example, the average class-size varies non-monotonically across cutoﬀs on enrollment (Angrist and Lavy (1999)). Table B.1 in Section B.5.1 of the supplementalappendix illustrates these deﬁnitions of compliance groups using a simple example with 3 treatments and 2 cutoﬀs.

Assumption 8. (a)

There are no ever-deﬁers: P [ G ed ] = 0 ; (b) for arbitrary d ∈ D , and ¯ U ∈ U ∗ , E [ Y i ( d ) | X i = x, U i = ¯ U ] and P (cid:2) U i = ¯ U | X i = x (cid:3) are continuous and bounded functions of x ; (c) thereexists a function β ec ( c ) such that E [ Y i ( d ′ ) − Y i ( d ) | X i = c, U i = ¯ U ] = β ec ( c ) for every c = ( c, d, d ′ ) ∈ C and ¯ U ∈ G ec . A fuzzy assignment produces several diﬀerent treatment changes at each cutoﬀ, even afterruling out ever-deﬁers. The researcher only observes one aggregate change in Y i at each cutoﬀ,but there are several treatment eﬀects on ever-compliers to be identiﬁed at that cutoﬀ. Theorem4 below shows that identiﬁcation of these eﬀects is not possible without further restricting theclass of functions β ec ( c ). Economic theory or a priori knowledge guides the choice of a functionalform that credibly summarizes the heterogeneity of treatment eﬀects. For example, the principal-agent model of Bajari et al. (2017) yields a functional form to study reimbursement of hospitalsby insurers. The second heterogeneity assumption (Assumption 9) restricts the treatment eﬀectfunction on ever-compliers to a ﬁnite-dimensional vector space of functions. Assumption 9.

Let W ( c, d ) = [ W ( c, d ) , . . . , W q ( c, d )] ′ be a vector-valued function W : X × D → R q × known to the researcher and such that (a) E F [ W ( c, d ′ ) − W ( c, d )] is well-deﬁned for thecounterfactual distribution F ; and (b) W j ( c, d ′ ) − W j ( c, d ) , j = 1 , . . . , q , are linearly indepen-dent functions. The treatment eﬀect function β ec ( c ) is assumed to belong to the following class offunctions: H = (cid:26) β : C → R : β ( c, d, d ′ ) = (cid:2) W ( c, d ′ ) − W ( c, d ) (cid:3) ′ θ , for θ ∈ R q (cid:27) . In this case, the ATE on ever-compliers is a linear combination of the true parameter vector θ ec . For a counterfactual distribution F chosen by the researcher, µ ec ( F ) = Z β ( c ; θ ec ) dF ( c ) (33)= Z (cid:2) W ( c, d ′ ) − W ( c, d ) (cid:3) ′ dF ( c ) | {z } ≡ Z ( F ) θ ec (34)= Z ( F ) θ ec . (35)Theorem 4 shows that the observed change in average outcome at a given cutoﬀ is a weightedaverage of treatment eﬀects on ever-compliers who switch from various doses into the dose of eligi-bility at that cutoﬀ. Assumption 9 and variation in cutoﬀ characteristics are suﬃcient conditions21or identiﬁcation. Conversely, identiﬁcation on ever-compliers implies that β ec ( c ) belongs to aﬁnite-dimensional class of functions. Theorem 4.

Under Assumption 8, for j = 1 , . . . , K , B j = K X l =0 ,l = j ω j,l β ec ( c j , d l , d j ) where B j is deﬁned in Equation 6, and ω j,l = lim e ↓ { P [ D i = d l | X i = c j − e ] − P [ D i = d l | X i = c j + e ] } , for l = 0 , , . . . , K, l = j .Moreover, suppose β ec belongs to the class of functions H deﬁned in Assumption 9 with q ≤ K .Deﬁne f W j = K X l =0 ,l = j ω j,l [ W ( c j , d j ) − W ( c j , d l )] (36) for the vector-valued function W ( c, d ) of Assumption 9; build a K × q matrix f W by stacking f W j ,and B by stacking B j . If f W ′ f W is invertible, then β ec ( c ) is identiﬁed and equal to β ec ( c ) = (cid:2) W ( c, d ′ ) − W ( c, d ) (cid:3) ′ (cid:16) f W ′ f W (cid:17) − f W ′ B . Conversely, suppose β ec belongs to some class of functions e H , and treatment eﬀects on ever-compliers are identiﬁed at the p > K cutoﬀ-dose values { ˜ c : ˜ c = ( c j , d l , d j ) with ω j,l > } of everypossible fuzzy assignment generated from the given schedule of cutoﬀs { c j } Kj =1 . Then, the class offunctions e H is “ﬁnite dimensional” in the sense that G = n(cid:16) β (˜ c ) , . . . , β (˜ c p ) (cid:17) : for β ∈ e H o ⊆ R p has dim G ≤ K for every fuzzy assignment { ˜ c j } pj =1 generated from { c j } Kj =1 . Theorem 4 reveals the requirement of stronger functional form assumptions on β ec ( c ) evenfor identiﬁcation of local eﬀects in the fuzzy case with a ﬁnite number of multiple cutoﬀs. Forexample, identiﬁcation is not possible when e H is the class of all smooth functions studied in thenon-parametric case of Section 3.2. The result is striking because non-parametric identiﬁcation oflocal eﬀects is possible both in the sharp case with a ﬁnite number of cutoﬀs and in the fuzzy casewith a single cutoﬀ. It is likely possible to obtain non-parametric identiﬁcation of β ec ( c ) under alarge variation of cutoﬀ-dose values. The function β ec ( c ) may be approximated by a sequence ofparametric functions from Assumption 9, where q grows to inﬁnity more slowly than K , so to keep dim G ≤ K as K → ∞ . In this paper, the number of cutoﬀs is kept ﬁnite for simplicity, and the22ase with large K is deferred to future work.Theorem 4 also clariﬁes the interpretation of two-stage least squares (2SLS) estimates in applica-tions of fuzzy RD with multiple cutoﬀs, a common practice in applied work. The practice consistsof using D ( X i ) as an instrument for D i in the regression of Y i on a constant, D i , and X i . SeeAngrist and Pischke (2008) for a discussion. In the single-cutoﬀ case, both the non-parametric RDestimator and 2SLS applied to a neighborhood of the cutoﬀ are consistent to the average treatmenteﬀect on compliers (Hahn et al. (2001)). To my knowledge, such an equivalence has never beenstudied in the multiple-cutoﬀ case. Nevertheless, many important applications have multiple-fuzzycutoﬀs and use 2SLS; for example, Angrist and Lavy (1999), Chen and Van der Klaauw (2008), andHoekstra (2009). The 2SLS estimator is consistent for a data-driven weighted average of treatmenteﬀects on ever-compliers as long as a suﬃciently ﬂexible speciﬁcation is used; for example, cutoﬀﬁxed-eﬀects or varying slopes. The economic meaning of the 2SLS estimands depends crucially onthe choice of such a weighting scheme. Unless a parametric functional form is imposed on β ec ( c ), orthere is large variation in cutoﬀ-doses, only a data-driven weighted average of β ec ( c ) is identiﬁed.In other words, if β ec ( c ) is non-parametric and there are only a few cutoﬀs, the researcher does nothave control over the weighting scheme, and 2SLS estimates don’t have a clear interpretation.Theorem 4 leads to a two-step estimation procedure for θ ec and µ ec . The mechanics are similarto the previous sections, so I omit the details from the main text for brevity. In the ﬁrst step,the researcher estimates the jump discontinuity of the vector [ Y i W ( X i , D i ) ′ ] ′ using LPRs at eachcutoﬀ to obtain [ b B j cf W ′ j ] ′ . In the second step, a regression of b B j on cf W ′ j obtains b θ ec . The ATEestimator is b µ ec = Z ( F ) b θ ec . Estimation precision varies across cutoﬀs, and the parametric form of β ec allows us to optimally combine diﬀerent cutoﬀs to minimize the MSE of b θ ec . The researchercan simply re-weight the second-step regression by the inverse of the MSE matrix of the ﬁrst-stepestimators. Section B.5.2 in the supplemental appendix delineates the estimation and inferenceprocedures of θ ec and µ ec with practical steps. In this section, Monte Carlo simulations illustrate the ﬁnite sample behavior of the ATE estima-tor proposed in Section 3.2. The analysis considers estimation precision and coverage of conﬁdenceintervals for diﬀerent choices of tuning parameters and a non-linear speciﬁcation for β . As pre-dicted by Theorem 2, an incorrect choice of the second-step polynomial degree leads to severe biasand extremely poor coverage of conﬁdence intervals. Moreover, ﬁrst-step bandwidths that implyoverlapping estimation windows produce lower MSE than cases with no overlap, regardless of othertuning parameters.The DGP draws n iid observations of ( X i , ε i ) where X i is uniformly distributed over [0 , ε i is normally distributed with zero mean and unit variance, and these variables are independent ofeach other. There are K cutoﬀs c j = j/ ( K + 1), j = 1 , . . . , K , on the unit interval [0 , K = ⌊ n . ⌋ , where ⌊ a ⌋ denotes the largest integer smaller than or equal to a .23able 1: Precision of Estimators - Choice of h n K b µ b µ bc e µ e µ bc b µ b µ bc e µ e µ bc b µ b µ bc e µ e µ bc Overlap

No Overlap b µ, b µ bc , e µ, e µ bc ), two choicesof ﬁrst-step bandwidth (overlap and no overlap), and ﬁve sample sizes n and respective numbers of cutoﬀs K . The second-stepbandwidth is set to h = 3 / ( K + 1), which minimizes MSE of b µ . Refer to Table 2 for diﬀerent choices of h . The number ofsimulations is 10,000. An individual with forcing variable X i receives a treatment dose equal to D ( X i ) as in Equation 3.The dose increases by one unit at each cutoﬀ, starting at d = 1 and ending at d K = K + 1. Theoutcome variable is Y i = φ ( X i ) D ( X i ) + ε i where φ ( X i ) = 15 X i + 7 . X i − . X i + 2 . β ( c ) = φ ( c )( d ′ − d ), which falls into the binary treatment case (see discussion on page17). Consider a counterfactual policy that uniformly increases treatment doses by one unit. TheATE parameter µ is the integral of φ ( c ) over c ∈ [0 , − h and h , I compare the ATE estimator b µ that uses ρ = ρ = 1, to the bias-corrected ATE estimator b µ bc that uses ρ = ρ = 2. To emphasize the importance of the second step, I also compute a naiveATE estimator that simply averages the ﬁrst-step estimates. The naive and bias-corrected naiveestimators, respectively e µ and e µ bc , are constructed as b µ and b µ bc except for the tuning parameters inthe second step. Both naive estimators use ρ = 0 and h = ∞ . To examine the eﬀect of overlappingestimation windows in the ﬁrst step, I compare estimators for two choices of h . The ﬁrst choice isthe largest possible bandwidth h = 1 / ( K +1), which leads to maximum overlap. The second choiceis the largest possible bandwidth with no overlap, that is, h = 0 . / ( K + 1). Finally, I study theeﬀects of ten diﬀerent choices for the second-step bandwidth, h ∈ { / ( K + 1) , . . . , / ( K + 1) } . Allchoices of tuning parameters satisfy the rate conditions of Theorem 2 and produce a convergencerate of root- n for b µ and b µ bc . The Monte Carlo experiment simulates 10,000 draws of an iid samplewith n ∈ { , , , , } and K ∈ { , , , , } , respectively. SectionB.7 in the supplemental appendix repeats the experiment with data-driven bandwidth choices,following the bandwidth rules proposed on page 17 (Section 3.2).The bias and variance of all estimators converge to zero as the sample size increases, regardlessof the choice of h (Table 1). The bias-correction of b µ bc eliminates almost all the bias of b µ , at thecost of a higher variance. The naive estimator e µ oversmooths the second step beyond the conditions24able 2: Precision of Estimators - Choice of h ( n, K ) = (1789 ,

20) ( n, K ) = (10120 , h · ( K + 1) b µ b µ bc b µ b µ bc b µ b µ bc b µ b µ bc b µ b µ bc b µ b µ bc Notes: The table reports simulated bias, variance, and mean squared error (MSE) for two estimators ( b µ, b µ bc ), ten choices ofsecond-step bandwidth ( h ∈ { / ( K + 1) , . . . , / ( K + 1) } ), and the two smallest sample sizes n and respective numbers ofcutoﬀs K . The ﬁrst-step bandwidth is set to h = 1 / ( K + 1) (overlap). Naive estimators ( e µ, e µ bc ) are not in this table becausethey are not aﬀected by the choice of h . The number of simulations is 10,000. of Theorem 2. As a result, the bias of e µ is substantially larger than that of b µ . Simply correctingfor bias in the ﬁrst step does not solve the problem, as the diﬀerence in bias between e µ and e µ bc is small. First-step bandwidths that produce overlap (Table 1, rows 1-5) yield approximately thesame bias, but substantially smaller variance, compared to ﬁrst-step bandwidths that produce nooverlap (Table 1, rows 6-10).Next, I study how the choice of h aﬀects precision of ( b µ, b µ bc ) for a ﬁxed choice of h = 1 / ( K +1)(Table 2). The smallest value for h is 3 / ( K + 1). This deﬁnes a second-step estimation windowwith at least three cutoﬀs to ensure invertibility of matrices in the regressions. The bias of b µ is substantially smaller when h is set to its smallest value. All other measures are practicallyunaﬀected across diﬀerent h .The signiﬁcant bias of the naive ATE estimators e µ and e µ bc decreases the coverage of 95% con-ﬁdence intervals as the sample increases (Table 3). The naive estimators oversmooth in the secondstep, and Theorem 2 implies the bias grows faster than root- n . For each of the four estimators, theconﬁdence intervals equal the estimator plus or minus 1 .

96 times its standard error. The varianceof estimators are obtained as described in Equation 26. The bias-corrected ATE estimator b µ bc pro-duces conﬁdence intervals with correct coverage for all samples sizes. Although b µ yields intervalswith average length smaller than b µ bc , the bias of b µ leads to a slightly lower coverage. In this section, the methods proposed in this paper are illustrated using the data from PU onhigh school assignments in Romania. Many policy questions demand an ATE of a continuous The data set is available online in the supplemental materials of PU on the website of the

American EconomicReview . n K b µ b µ bc e µ e µ bc b µ b µ bc e µ e µ bc Notes: The table reports simulated percentage of correct coverage and average length of 95% conﬁdence intervals. Conﬁdenceintervals are constructed using four estimators ( b µ, b µ bc , e µ, e µ bc ). They equal an estimator plus or minus its estimated standarddeviation multiplied by 1 .

96. Coverage and average length are computed for ﬁve sample sizes n and respective numbers ofcutoﬀs K . The ﬁrst-step bandwidth is set to h = 1 / ( K + 1) (overlap), and the second-step bandwidth is set to h = 3 / ( K + 1),which minimizes MSE of b µ . The number of simulations is 10,000. counterfactual distribution of treatments, and this section provides an example of such a policyquestion. The estimators designed for the sharp RDD case are consistent for “Intent-to-Treat”(ITT) average eﬀects when applied to the fuzzy data of PU. In this application, the ITT eﬀectmeasures the impact of being assigned to a better school but not necessarily attending it. Theparametric methods of Section 4 yield noticeable eﬃciency gains in the estimation of the ATEon ever-compliers. Treatment eﬀects for ever-compliers reveal a heterogeneity pattern unlike theheterogeneity of ITT eﬀects.The administrative data from Romania cover 3 cohorts of 9th grade students for the years2001, 2002, and 2003, with a total of 334,137 observations. The essential elements of the highschool assignment in Romania are described below. The assignment to high school is nationallycentralized by the Ministry of Education. At the end of grade 8, students submit a transition scoreand a complete ranking of preferences for high schools. The transition score is an average of thestudent’s performance on a national exam taken in grade 8 and the student’s grade point averageduring grades 5-8. The Ministry of Education ranks students by their transition score and no othercriteria. The mechanism assigns the student ranked ﬁrst to her most preferred school, the studentranked second to her most preferred school, etc. Students cannot decline their assignment, andthey have incentives to truthfully reveal their preference rankings.The observed variables are the town and year of student i , the transition score X i , the schoolthe student is assigned to, and the student’s score on the “baccalaureate exam.” This is an examtaken at the end of high school, and the grade on the exam is the outcome variable Y i . The qualityof school j (treatment dose d j ) is measured by the average transition score of the students attendingthat school j . The cutoﬀ c j for admission into a school j is equal to the minimum transition scoreamong the students that are assigned to that school j . The student’s preferences in high schoolsare not observed in the data, which makes the RDD fuzzy. For example, a student may have a scoregreater than the cutoﬀ for the best school in her town, but still be assigned to a diﬀerent schoolbecause of her personal preferences. For a transition score X i , the treatment dose of eligibility26 ( X i ) is equal to the largest d among those schools with admission cutoﬀ c less than X i . Thetreatment dose received D i coincides with the treatment dose of eligibility D ( X i ) for 40% of thestudents in the sample. Thus, the assignment is fuzzy, and causal inference beyond ITT eﬀectsrequires the methods of Section 4. Following PU, I drop observations with missing values for Y i .I also drop cutoﬀs without enough observations around them to carry out the matrix inversionsof the local polynomial regressions. The dropping of cutoﬀs leaves the empirical distribution ofoutcomes, forcing variable, cutoﬀs, and treatment doses practically unchanged. The estimationsample has 588 cutoﬀs with a total of 179,995 individuals from 769 schools in 121 towns and 3years. The variation of cutoﬀ and dose values is displayed in Figure 2.Figure 2: Variation in Cutoﬀ and Dose Values(a) d - do s e s (b) u - do s e c hange s Notes: Scatter plot with cutoﬀ values on the x-axis and dose values on the y-axis for the K = 588 cutoﬀs in theRomanian data. Panel (a) shows both doses before and after the cutoﬀ, that is, d j − in black and d j in gray. Panel(b) displays dose-change values on the y-axis, that is, u j = d j − d j − . The peer-quality of school j (treatment dose d j ) is measured by the average transition score of the students attending that school j . The cutoﬀ c j for admissioninto a school j is equal to the minimum transition score among the students assigned to school j . Non-parametric identiﬁcation of β ( c ) is limited to the set C , which is the convex-hull of C ∞ . Theset C ∞ is not entirely observed, and the researcher relies on the observation of C K (Figure 2(a)).For the sake of simplicity, I restrict β to be a function of dose changes ( u = d ′ − d ) instead of dosesbefore and after ( d and d ′ ). The restriction greatly simpliﬁes the visualization and estimation of β ( c, d, d ′ ), because it implies that β ( c, d, d ′ ) = φ ( c )( d ′ − d ), where φ is a continuously diﬀerentiablefunction. Figure 2(b) illustrates the variation of cutoﬀ and dose-change values and deﬁnes the limitson identiﬁcation of policy counterfactuals. For example, it is not possible to identify the eﬀectsof randomly assigning students with grades between 8 and 9 to a change in treatment dose of 2.The support of such counterfactual distribution falls outside the observed variation of cutoﬀ anddose-change values. On the other hand, it is possible to identify the ATE of randomly assigningstudents with grades between 6.5 and 8.5 to dose increases between 0 and 1.5.27he following policy question illustrates the ATE estimator proposed in this paper. Supposea new charter school is constructed in one of the towns in Romania. The new charter school hasmore autonomy and better management than traditional public schools, and admitted studentsexperience an increase in school quality as if they were admitted to a school with better peers.More speciﬁcally, the policy counterfactual is to give a 0.5 increase in peer-quality to a uniformdistribution of scores between 6.5 and 8.5. The ATE parameter is deﬁned as µ = 12 Z . . φ ( c ) dc. (37)I follow the estimation procedure suggested in Section 3.2 and take into account the restriction β ( c, d, d ′ ) = φ ( c )( d ′ − d ). As in the binary treatment case, the restriction lowers the polynomialdegree requirement on the second-step estimation to ρ = 1. See Figure 1 and the discussionthat follows it. The grid for h has 32 equally-spaced points between 0.1 and 3.6, respectively, thesmallest bandwidth for which the estimator is computable, and the maximum distance between twodiﬀerent cutoﬀs. The MSE-optimal bandwidth choice is h ∗ = 1 . . . . b φ ( c ), that is, the eﬀect of a 0 . c . The graph reveals heterogeneous marginal eﬀects of ability on returns to school quality.Heterogeneity of treatment eﬀects is a priori unknown, and the ATE estimator proposed in thispaper is consistent for µ regardless of the shape of φ ( c ). This highlights the empirical relevanceof Theorem 2 and the importance of the second-step estimation. In other words, the commonstrategy of normalizing all cutoﬀs to zero and estimating one discontinuity using the pooled datais not consistent for µ when φ ( c ) has such heterogeneity.28igure 3: Treatment Eﬀect Function(a) . ph i ( c ) (b) . ph i ( c ) Notes: Estimated average treatment eﬀect function for a 0 . c . The ﬁgure plots b β ( c, d, d + 0 .

5) = 0 . b φ ( c ) for c ∈ [6 . , . φ function is estimated non-parametrically withbias correction following Section 3.2 (sharp case). Panel (b) displays the eﬀect on ever-compliers of the same uniformchange in treatment dose. The φ function is estimated parametrically with bias correction following the iteratedprocedure of Section B.5.2 in the supplemental appendix (fuzzy case). Estimation of treatment eﬀects on ever-compliers requires a parametric functional form on β ec (Theorem 4). I assume β ec ( c, d, d ′ ) = θ ( d ′ − d )+ θ c ( d ′ − d )+ θ c ( d ′ − d )+ θ c ( d ′ − d ) and carry outthe iterative estimation procedure described in the supplemental appendix’s Section B.5.2. Thealgorithm achieves convergence of θ s within 30 iterations. The iterated bias-corrected ATE onever-compliers equals 0 .

107 with standard error of 0 . .

5. Compared to ITT eﬀects in Figure 3(a), the return of better schooling onever-compliers is also positive, but much less heterogeneous across ability levels.

Diﬃculty in gathering experimental data in many ﬁelds within the social sciences makes quasi-experimental techniques such as RDD extremely important to evaluate policies and social programs.RDD has been used in a wide range of applications in economics since the late 1990s. More recently,there has been an increasing number of applications with one forcing variable and multiple cutoﬀs,assigning individuals to heterogeneous treatments. The demand for multi-cutoﬀ RDD methods isconstantly growing, as richer data sets become ever more available.This paper states conditions under which multiple RDD eﬀects are combined to infer ATE overthe entire range of cutoﬀ values. The proposed estimator is consistent and asymptotically normalfor ATEs over the entire support of variation in cutoﬀs and treatment doses. Asymptotic resultsare derived under a large number of observations and cutoﬀs in the sharp case of non-parametrictreatment eﬀect functions. Suﬃcient conditions on the rate of growth of the number of cutoﬀs,29elative to the number of observations, are given. These rate conditions determine the feasiblechoice set of tuning parameters. This paper also shows that non-parametric identiﬁcation in fuzzyRDD with multiple cutoﬀs is impossible unless the treatment eﬀect function is ﬁnite-dimensional, orthere is large variation of cutoﬀ-dose values. A parametric speciﬁcation provides an MSE-optimalATE estimator for the fuzzy case that is consistent and asymptotically normal.The relevance of the ATE estimators proposed in this paper is illustrated with the data ofPop-Eleches and Urquiola (2013) on high school assignment in Romania. Of interest is the eﬀect ofhigh school quality on academic performance of students. I ﬁnd strong evidence of non-linearitiesin the returns to better schooling, as a function of students’ ability level. Monte Carlo simulationsdemonstrate that such non-linearities severely bias a naive average of local eﬀects that does notuse the correction weighting scheme proposed in this paper. Applying the fuzzy RDD methods tothe Romanian data reveals causal eﬀects on ever-compliers that are smaller and less heterogeneousthan ITT eﬀects.The proposed estimator converges at the minimax optimal rate of root- n , as long as ﬁrst-stepbandwidths converge to zero at 1 /K rate. It would be interesting to learn about eﬃciency propertiesof the ATE estimator. Theoretical tools commonly employed to derive eﬃciency lower-bounds maynot be immediately applicable to the setting of this paper. These tools are designed for regularestimators, and for data drawn from a population where the parameter of interest is identiﬁed.In contrast, the ﬁxed-cutoﬀ RDD design relies on an “identiﬁcation at inﬁnity argument”, andI wonder about the suﬃcient conditions that would obtain regularity of the ATE estimator. Apossibility for future work is to the generalize the uniform convergence tools from this paper toarrive at such conditions. I am indebted to Han Hong, Caroline Hoxby, and Guido Imbens for invaluable advice. The paperalso beneﬁted from feedback received from seminar participants at Stanford, Boston University,Cambridge, Iowa, Notre Dame, CORE-UcLouvain, UCSD, UCDavis, FGV-EESP, FGV-EPGE,Insper, PUC-Rio, Toulouse, UIUC, and at various conferences. I thank Tim Bresnahan, ArunChandrasekhar, Michael Dinerstein, Ivan Fernandez-Val, Ivan Korolev, Michael Leung, Huiyu Li,Jessie Li, Petra Moser, Stephen Terry, Xiaowei Yu, and anonymous referees for suggestions andcomments. I gratefully acknowledge the ﬁnancial support received from the B.F. Haley and E.S.Shaw Fellowship at SIEPR-Stanford, CORE-UcLouvain, ISLA-Notre Dame, and while visiting theKenneth C. Griﬃn Department of Economics at the University of Chicago.

References

Agarwal, Sumit, Souphala Chomsisengphet, Neale Mahoney, and Johannes Stroebel (2017) “DoBanks Pass Through Credit Expansions to Consumers Who Want to Borrow?”

Quarterly Journal f Economics , Vol. 133, No. 1, pp. 129–190.Andrews, D.W.K. (1994) “Empirical Process Methods in Econometrics,” in Engle, R.F. and D.L.McFadden eds. Handbook of Econometrics , Vol. 4: North Holland, Chap. 37, pp. 2247–2294.Angrist, J.D. (2004) “Treatment Eﬀect Heterogeneity in Theory and Practice,”

Economic Journal ,Vol. 114, pp. C52–C83.Angrist, J.D. and V. Lavy (1999) “Using Maimonides’ Rule to Estimate the Eﬀect of Class Size onScholastic Achievement,”

Quarterly Journal of Economics , Vol. 114, No. 2, pp. 533–575.Angrist, J.D. and Miikka Rokkanen (2015) “Wanna Get Away? Regression Discontinuity Esti-mation of Exam School Eﬀects Away from the Cutoﬀ,”

Journal of the American StatisticalAssociation , Vol. 110, No. 512, pp. 1331–1344.Angrist, Joshua D and J¨orn-Steﬀen Pischke (2008)

Mostly Harmless Econometrics: An Empiricist’sCompanion : Princeton University Press.Bajari, Patrick, Han Hong, Minjung Park, and Robert Town (2017) “Estimating Price Sensitivityof Economic Agents Using Discontinuity in Nonlinear Contracts,”

Quantitative Economics , Vol.8, No. 2, pp. 397–433.Bertanha, Marinho and Guido Imbens (2019) “External Validity in Fuzzy Regression DiscontinuityDesigns,”

Journal of Business and Economic Statistics, forthcoming .Bertanha, Marinho and Marcelo J Moreira (2019) “Impossible Inference in Econometrics: Theoryand Applications,”

Journal of Econometrics, forthcoming .Black, Dan A, Jose Galdo, and Jeﬀrey A Smith (2007) “Evaluating the Worker Proﬁling andReemployment Services System Using a Regression Discontinuity Approach,”

American Eco-nomic Review , Vol. 97, No. 2, pp. 104–107.Black, S.E. (1999) “Do Better Schools Matter? Parental Valuation of Elementary Education,”

Quarterly Journal of Economics , Vol. 114, No. 2, pp. 577–599.Calonico, Sebastian, Matias D Cattaneo, and Max H Farrell (2018) “Optimal Bandwidth Choicefor Robust Bias Corrected Inference in Regression Discontinuity Designs,” arXiv preprintarXiv:1809.00236 .Calonico, Sebastian, Matias D Cattaneo, and Rocio Titiunik (2014) “Robust Nonparametric Con-ﬁdence Intervals for Regression-discontinuity Designs,”

Econometrica , Vol. 82, No. 6, pp. 2295–2326.Cattaneo, Matias D, Rocio Titiunik, Gonzalo Vazquez-Bare, and Luke Keele (2016) “InterpretingRegression Discontinuity Designs with Multiple Cutoﬀs,”

Journal of Politics , Vol. 78, No. 4, pp.1229–1248.Chen, Susan and Wilbert Van der Klaauw (2008) “The Work Disincentive Eﬀects of the DisabilityInsurance Program in the 1990s,”

Journal of Econometrics , Vol. 142, No. 2, pp. 757–784.Cheng, Ming-Yen, Jianqing Fan, and James S Marron (1997) “On Automatic Boundary Correc-tions,”

Annals of Statistics , Vol. 25, No. 4, pp. 1691–1708.De Giorgi, Giacomo, Andres Drenik, and Enrique Seira (2017) “Sequential Banking: Direct andExternality Eﬀects on Delinquency,”

CEPR Discussion Paper No. DP12280 .De La Mata, Dolores (2012) “The Eﬀect of Medicaid Eligibility on Coverage, Utilization, andChildren’s Health,”

Health Economics , Vol. 21, No. 9, pp. 1061–1079.Dobkin, Carlos and Fernando Ferreira (2010) “Do School Entry Laws Aﬀect Educational Attain-ment and Labor Market Outcomes?”

Economics of Education Review , Vol. 29, No. 1, pp. 40–54.Dong, Yingying (2018a) “Alternative Assumptions to Identify LATE in Fuzzy Regression Discon-tinuity Designs,”

Oxford Bulletin of Economics and Statistics , Vol. 80, No. 5, pp. 1020 – 1027.312018b) “Jump or Kink? Regression Probability Jump and Kink Design for TreatmentEﬀect Evaluation,”

Working Paper, University of California, Irvine .Dong, Yingying and Arthur Lewbel (2015) “Identifying the Eﬀect of Changing the Policy Thresholdin Regression Discontinuity Models,”

Review of Economics and Statistics , Vol. 97, No. 5, pp.1081–1092.Duﬂo, E., P. Dupas, and M. Kremer (2011) “Peer Eﬀects, Teacher Incentives, and the Impact ofTracking: Evidence from a Randomized Evaluation in Kenya,”

American Economic Review , Vol.101, No. 5, pp. 1739–1774.Egger, Peter and Marko Koethenbuerger (2010) “Government Spending and Legislative Organi-zation: Quasi-experimental Evidence from Germany,”

American Economic Journal: AppliedEconomics , Vol. 2, No. 4, pp. 200–212.Fan, J. and I. Gijbels (1996)

Local Polynomial Modelling and Its Applications , Chapman &Hall/CRC Monographs on Statistics & Applied Probability: Taylor & Francis.Frandsen, R., M. Fr¨olich, and B. Melly (2012) “Quantile Treatment Eﬀects in the RegressionDiscontinuity Design,”

Journal of Econometrics , Vol. 168, No. 2, pp. 382–395.Garibaldi, P., F. Giavazzi, A. Ichino, and E. Rettore (2012) “College Cost and Time to Obtaina Degree: Evidence from Tuition Discontinuities,”

Review of Economics and Statistics , Vol. 94,No. 3, pp. 699–711.Hahn, J., P. Todd, and W. Van der Klaauw (2001) “Identiﬁcation and Estimation of TreatmentEﬀects with a Regression-discontinuity Design,”

Econometrica , Vol. 69, No. 1, pp. 201–209.Hastings, Justine S, Christopher A Neilson, and Seth D Zimmerman (2013) “Are Some DegreesWorth More Than Others? Evidence From College Admission Cutoﬀs In Chile,”

NBER WorkingPaper 19241 .Hoekstra, Mark (2009) “The Eﬀect of Attending the Flagship State University on Earnings: aDiscontinuity-based Approach,”

Review of Economics and Statistics , Vol. 91, No. 4, pp. 717–724.Hoxby, C.M. (2000) “The Eﬀects of Class Size on Student Achievement: New Evidence fromPopulation Variation,”

Quarterly Journal of Economics , Vol. 115, No. 4, pp. 1239–1285.Imbens, Guido and Karthik Kalyanaraman (2012) “Optimal Bandwidth Choice For The RegressionDiscontinuity Estimator,”

Review of Economic Studies , Vol. 79, No. 3, pp. 933–959.Imbens, Guido W and Thomas Lemieux (2008) “Regression Discontinuity Designs: a Guide toPractice,”

Journal of Econometrics , Vol. 142, No. 2, pp. 615–635.Imbens, Guido W and Donald B Rubin (1997) “Estimating Outcome Distributions for Compliersin Instrumental Variables Models,”

Review of Economic Studies , Vol. 64, No. 4, pp. 555–574.Van der Klaauw, Wilbert (2002) “Estimating the Eﬀect of Financial Aid Oﬀers on College Enroll-ment: A Regression-discontinuity Approach,”

International Economic Review , Vol. 43, No. 4,pp. 1249–1287.Lazear, E. (2001) “Educational Production,”

Quarterly Journal of Economics , Vol. 116, No. 3, pp.777–803.Lipman, Yaron, Daniel Cohen-Or, and David Levin (2006) “Error Bounds and Optimal Neighbor-hoods for MLS Approximation,” in Polthier, Konrad and Alla Sheﬀer eds.

Eurographics Sympo-sium on Geometry Processing .McCrary, J. (2008) “Manipulation of the Running Variable in the Regression Discontinuity Design:a Density Test,”

Journal of Econometrics , Vol. 142, No. 2, pp. 698–714.McCrary, Justin and Heather Royer (2011) “The Eﬀect of Female Education on Fertility and InfantHealth: Evidence from School Entry Policies Using Exact Date of Birth,”

American Economic eview , Vol. 101, No. 1, pp. 158–195.Newey, Whitney K (1994) “Kernel Estimation of Partial Means and a General Variance Estimator,” Econometric Theory , Vol. 10, No. 02, pp. 1–21.Pollard, D. (1984)

Convergence of Stochastic Processes : Springer.Pop-Eleches, C. and M. Urquiola (2013) “Going to a Better School: Eﬀects and Behavioral Re-sponses,”

American Economic Review , Vol. 103, No. 4, pp. 1289–1324.Porter, J. (2003) “Estimation in the Regression Discontinuity Model,”

Unpublished Manuscript,University of Wisconsin, Madison .Rokkanen, Miikka (2015) “Exam Schools, Ability, and the Eﬀects of Aﬃrmative Action: Latent Fac-tor Extrapolation in the Regression Discontinuity Design,”

Unpublished Manuscript, ColumbiaUniversity .Rothe, Christoph (2012) “Partial Distributional Policy Eﬀects,”

Econometrica , Vol. 80, No. 5, pp.2269–2301.Sun, Yixiao (2005) “Adaptive Estimation of the Regression Discontinuity Model,”

Working PaperAvailable at SSRN: 739151 .Tsybakov, Alexandre B (2009)

Introduction to Nonparametric Estimation : Translated from Frenchby Vladimir Zaiats. Springer Series in Statistics, New York.Van Der Vaart, Aad W and Jon A Wellner (1996)

Weak Convergence and Empirical Processes :Springer.

A Appendix

Throughout the appendices, M is used as a generic ﬁnite and positive constant in the proofs.For a p × q matrix A, the norm of A is induced by the Euclidean norm k · k , i.e. k A k =max x ∈ R q ,x =0 k Ax k / k x k . The determinant of matrix A is denoted det( A ). References to the supple-mental appendix include B in the numbering; for example, Lemma B.1, or Table B.2. A.1 Proof of Theorem 1

Lemma B.1 derives asymptotic normality of the bias-corrected jump-discontinuity estimator at onecutoﬀ based on local polynomial regressions of a vector Y i on a scalar forcing variable X i . The proofof Theorem 1 is a straightforward generalization of Lemma B.1 in the particular case of a scalar Y i . As the sample size increases and the number of cutoﬀ remains ﬁxed, the jump-discontinuityestimators are independent across cutoﬀs. First apply Lemma B.1 to each cutoﬀ individually, andthen aggregate over cutoﬀs. (cid:3) A.2 Proof of Lemma 2

Deﬁne C = [ X , X ] × [ D , D ] × [ D , D ]. Consider the partition of C made of the set of non-intersectingcubicles T n = { C , . . . , C M } with M = n , n = { , , . . . } . Each C j is a half-open cubicle ofthe form [ x l − , x l ) × [ y m − , y m ) × [ z o − , z o ) with sides of lengths equal to ( X − X ) /n , ( D − D ) /n ,and ( D − D ) /n . Deﬁne the sub-collection U n = { C ∈ T n : C ⊂ C} = { A , . . . , A Q } . Since C ∞ is dense in C , for every A j ∈ U n , ﬁnd a point c j ∈ C ∞ ∩ A j for which β ( c j ) is known. Thesum µ n = P Qj =1 ω ( c j ) β ( c j ) (cid:0) X − X (cid:1) (cid:0)

D − D (cid:1) /n converges to µ c as n → ∞ because ω ( c ) β ( c ) isRiemann integrable on C . 33 A.3 Proof of Theorem 2

The proof combines arguments from the proof of Lemma B.1 with lemmas on the uniform conver-gence of empirical processes from Sections B.2 and B.3. Deﬁne µ ∗ , e µ , and µ n as follows: µ ∗ = K X j =1 ∆ j n e ′ E [ G j + n ] 1 nh j n X i =1 k (cid:18) X i − c j h j (cid:19) v j + i Y i e H ji − e ′ E [ G j − n ] 1 nh j n X i =1 k (cid:18) X i − c j h j (cid:19) v j − i Y i e H ji o (A.1)= K X j =1 ∆ j nh j n X i =1 k (cid:18) X i − c j h j (cid:19) Y i e ′ (cid:16) v j + i E [ G j + n ] − v j − i E [ G j − n ] (cid:17) e H ji (A.2) e µ = K X j =1 ∆ j ( e ′ G j + n E " nh j n X i =1 k (cid:18) X i − c j h j (cid:19) v j + i Y j + i e H ji − e ′ G j − n E " nh j n X i =1 k (cid:18) X i − c j h j (cid:19) v j − i Y j − i e H ji µ n = K X j =1 ∆ j B j . (A.3)Write b µ − B c n − B c n − µ ( V cn ) / = µ ∗ − E [ µ ∗ |X n ]( V cn ) / (A.4)+ e µ − B c n ( V cn ) / (A.5)+ µ n − B c n − µ ( V cn ) / (A.6)+ b µ − E [ b µ |X n ] − ( µ ∗ − E [ µ ∗ |X n ])( V cn ) / (A.7)+ E [ b µ − µ n |X n ] − e µ ( V cn ) / . (A.8)The proof in this appendix applies a central limit theorem (CLT) to show that Part (A.4)converges in distribution to a standard normal; it demonstrates that B n approximates the ﬁrst-stepbias, that is, that part (A.5) converges in probability to zero; and it shows that B n approximatesthe second-step bias (integration error), that is, that part (A.6) converges to zero. Lemma B.7shows that parts (A.7) and (A.8) converge in probability to zero.34 art (A.4)First, ﬁnd the rate that ( V cn ) − / grows. Deﬁne φ n and rewrite V cn as follows: φ n ( X i ) = K X j =1 ∆ j nh j k (cid:18) X i − c j h j (cid:19) e ′ (cid:16) v j + i E [ G j + n ] − v j − i E [ G j − n ] (cid:17) e H ji (A.9) V cn = n X i =1 E (cid:2) ε i φ n ( X i ) (cid:3) . (A.10)Choose alternative bandwidths h ∗ j , j = 1 , . . . , K , such that (i) there exists δ > n ) such that δ < h ∗ j /h j ≤ ∀ j ; and (ii) [ c j − h ∗ j , c j + h ∗ j ] ∩ [ c j ′ − h ∗ j ′ , c j ′ + h ∗ j ′ ] = ∅ for any j = j ′ . V cn = n E (cid:2) ζ ( X i ) φ n ( X i ) (cid:3) ≥ n K X j =1 Z c j + h ∗ j c j − h ∗ j ζ ( x ) φ n ( x ) f ( x ) dx (A.11)= n K X j =1 Z c j + h ∗ j c j − h ∗ j ζ ( x ) ∆ j nh j k (cid:18) x − c j h j (cid:19) e ′ (cid:0) I { x ≥ } E [ G j + n ] − I { x < } E [ G j − n ] (cid:1) H (cid:18) x − c j h j (cid:19) ! f ( x ) dx (A.12)= 1 Kn K K X j =1 K ∆ j h j Z h ∗ j /h j − h ∗ j /h j ζ ( c j + uh j ) k ( u ) e ′ (cid:0) I { u ≥ } E [ G j + n ] − I { u < } E [ G j − n ] (cid:1) H ( u ) ! f ( c j + uh j ) du ≥ MKnh (A.13)where the ﬁrst inequality follows from the integrand being positive and ∪ Kj =1 [ c j − h ∗ j , c j + h ∗ j ] ⊆∪ Kj =1 [ c j − h j , c j + h j ]; the third equality uses a change of variables u = ( x − c j ) /h j ; and thelast inequality follows because (a) h j ≤ h ; (b) K ∆ j is bounded away from zero uniformly over j (Lemma B.9); and (c) each integral is bounded away from zero over j because the integrationlimits, ζ ( c j + uh j ), E [ G j ± n ], and f ( c j + uh j ) are uniformly close to quantities that are positivedeﬁnite uniformly over j (see Lemma B.6 and recall that f and ζ are bounded away from zerobecause of Assumptions 4 and 7). The inequality in (A.13) implies that ( V cn ) − = O ( Knh ) where Knh → ∞ .Second, write part (A.4) as a weighted sum across i : µ ∗ = n X i =1 Y i K X j =1 ∆ j nh j k (cid:18) X i − c j h j (cid:19) e ′ (cid:16) v j + i E [ G j + n ] − v j − i E [ G j − n ] (cid:17) e H ji | {z } ≡ φ n ( X i ) = n X i =1 Y i φ n ( X i ) (A.14)35o that µ ∗ − E [ µ ∗ |X n ]( V cn ) / = P ni =1 ( Y i − E [ Y i | X i ]) φ n ( X i )( V cn ) / = P ni =1 ε i φ n ( X i )( V cn ) / . (A.15)Equation A.15 is a sum of iid random variables with zero mean, where V cn is the variance of thenumerator. The Lindeberg condition is veriﬁed next. Take an arbitrary δ > n X i =1 E h ( V cn ) − ε i φ n ( X i ) I n(cid:12)(cid:12)(cid:12) ( V cn ) − / ε i φ n ( X i ) (cid:12)(cid:12)(cid:12) > δ oi (A.16) ≤ n X i =1 E h M Knh φ n ( X i ) I n M ′ (cid:0) Knh (cid:1) / | φ n ( X i ) | > δ oi (A.17) ≤ n X i =1 E h M Knh ( Knh ) − I n M ′ (cid:0) Knh (cid:1) / ( Knh ) − > δ oi (A.18) ≤ ( Kh ) − I n M ′ ( Knh ) − / > δ o = o (1) (A.19)where the ﬁrst inequality relies on the fact that ε i is a.s. bounded (Assumption 7), and that( V cn ) − = O (cid:0) Knh (cid:1) (Equation A.13). The second inequality uses that φ n ( x ) = O ( Knh ) − uniformly over x . In fact, φ n ( x ) is a sum of K components of which at most two are non-zero,∆ j = O (cid:0) K − (cid:1) uniformly over j (Lemma B.9), k ( · ) is bounded (Assumption 3), E [ G j ± n ] is uniformlyclose to G j ± whose norm is bounded away from zero (Lemma B.6). The last inequality relies on therate condition h /h = O (1), and that the indicator becomes zero for large n . The Lindeberg-FellerCLT says that Equation A.15, and thus part (A.4), converges in distribution to a standard normal. Part (A.5)First consider E (cid:20) h j k (cid:18) X i − c j h j (cid:19) v j + i e H ji E h Y j + i | X i i(cid:21) (A.20)= E " h j k (cid:18) X i − c j h j (cid:19) v j + i e H ji ∇ ( ρ +1) R ( c j , d j )( ρ + 1)! (cid:18) X i − c j h j (cid:19) ρ +1 h ρ +11 j (A.21)+ E " h j k (cid:18) X i − c j h j (cid:19) v j + i e H ji ∇ ( ρ +2) R ( c ∗ j , d j )( ρ + 2)! (cid:18) X i − c j h j (cid:19) ρ +2 h ρ +21 j (A.22)= h ρ +11 j ∇ ( ρ +1) R ( c j , d j )( ρ + 1)! f ( c j ) γ ∗ + O (cid:16) h ρ +21 (cid:17) (A.23)where E h Y j + i | X i i is the diﬀerence between E [ Y i | X i ] and its ρ -th order Taylor expansion around X i = c j (see Equations B.37 and B.38). The expectations in Equations A.21 and A.22, withoutthe h ρ +11 j and h ρ +21 j terms, are bounded over j because the kernel, derivatives, and polynomialsare bounded functions of u = ( x − c j ) h − j (Assumptions 3 and 7). The remainder term O (cid:16) h ρ +21 (cid:17) is uniform over j . 36ext, e µ − B n ( V cn ) / =( V cn ) − / K X j =1 ∆ j e ′ G j + n E " nh j n X i =1 k (cid:18) X i − c j h j (cid:19) v j + i E h Y j + i | X i i e H ji − ( V cn ) − / B +1 n (A.24) − ( V cn ) − / K X j =1 ∆ j e ′ G j − n E " nh j n X i =1 k (cid:18) X i − c j h j (cid:19) v j − i E h Y j − i | X i i e H ji + ( V cn ) − / B − n (A.25)where B n = B +1 n − B − n (A.26) B +1 n =(( ρ + 1)!) − K X j =1 h ρ +11 j ∆ j f ( c j ) ∇ ρ +1 x R ( c j , d j ) e ′ G j + n γ ∗ (A.27) B − n =(( ρ + 1)!) − K X j =1 h ρ +11 j ∆ j f ( c j ) ∇ ρ +1 x R ( c j , d j − ) e ′ G j − n γ ∗ . (A.28)Consider part (A.24). Part (A.25) follows a symmetric argument.(A.24) =( V cn ) − / K X j =1 ∆ j e ′ G j + n E " nh j n X i =1 k (cid:18) X i − c j h j (cid:19) v j + i E h Y j + i | X i i e H ji − ( V cn ) − / B +1 n (A.29)=( V cn ) − /  K X j =1 ∆ j e ′ G j + n h ρ +11 j ∇ ( ρ +1) R ( c j , d j )( ρ + 1)! f ( c j ) γ ∗ − B +1 n  (A.30)+ ( V cn ) − /  K X j =1 ∆ j e ′ G j + n O (cid:16) h ρ +21 (cid:17) (A.31)=0 (A.32)+ O (cid:16)(cid:0) Knh (cid:1) / (cid:17) KO ( K − ) O P (1) O (cid:16) h ρ +21 (cid:17) = o P (1) (A.33)where the second equality uses the expansion in Equation A.23. The third equality uses thedeﬁnition of B +1 n , that ∆ j = O ( K − ) uniformly over j , that G j + n = O P (1). These terms are o P (1)because of the rate condition (cid:0) Knh (cid:1) / h ρ +11 = O (1). Part (A.6) µ n − B n − µ ( V cn ) / = O (cid:16)(cid:0) Knh (cid:1) / (cid:17)  K X j =1 ∆ j B j − B n − Z C ω ( c ) β ( c ) d ( c )  (A.34)= O (cid:16)(cid:0) Knh (cid:1) / (cid:17) O (cid:16) h ρ +22 (cid:17) = O (1) O ( h ) = o (1) (A.35)37here the ﬁrst equality uses the rate on ( V cn ) − / (Equation A.13). The second equality appliesLemma B.9 and relies on Assumption 6 (asymptotic behavior of { c j } j ) and Assumption 7 (smooth-ness of β ( c )). The third equality uses the rate condition (cid:0) Knh (cid:1) / h ρ +12 = O (1). Lemma B.9 alsoshows that B n = O (cid:16) h ρ +12 (cid:17) , which yields ( V cn ) − / B n = O (cid:16)(cid:0) Knh (cid:1) / h ρ +12 (cid:17) = O (1).Lemma B.7 shows that parts (A.7) and (A.8) converge in probability to zero, which concludesthe proof. (cid:3) A.4 Proof of Theorem 3

Part (27)First, consider the ideal setting where estimators µ ∗ are functions of data observed from { Y i ( d ) } d ∈D and X i . For a choice of loss function L ( µ, µ ′ ), the minimax risk of estimating theparameter µ c ( P ) is deﬁned as inf µ ∗ sup P ∈P E P [ L ( µ ∗ , µ c ( P ))]. Here, the 0-1 loss function is used,that is, L n ( µ, µ ′ ) = I { n r | µ − µ ′ | > ǫ } , for a positive rate r and ǫ . In this case, E P [ L n ( µ ∗ , µ c ( P ))] = P P [ n r | µ ∗ − µ c ( P ) | > ǫ ]. The minimax risk is the supremum probability over P of an estimatorbeing farther than ǫn − r from the truth minimized over all possible estimators µ ∗ . The rate r is anupper bound on the rate of convergence if for small ǫ > L ∈ (0 , µ ∗ sup P ∈P P P [ n r | µ ∗ − µ c ( P ) | > ǫ ] ≥ L for large n . The rate r is the minimax optimalrate if it is an upper bound and achievable; that is, if there exists an estimator b µ that convergesat rate r uniformly. The estimator b µ converges at rate r uniformly if, for any small δ >

0, thereexists large ǫ ∈ (0 , ∞ ) such that sup P ∈P P P [ n r | b µ − µ c ( P ) | > ǫ ] < δ for large n . See discussion inChapter 2 of Tsybakov (2009).One common approach to compute lower bounds for the minimax risk is to use Le Cam’smethod. For ǫ >

0, choose two models

P, Q ∈ P such that | µ c ( P ) − µ c ( Q ) | > ǫn − r . Le Cam’smethod leads to the following inequality: inf µ ∗ sup P ∈P P P [ n r | µ ∗ − µ c ( P ) | > ǫ/ ≥ e − nKL ( P,Q ) / KL ( P, Q ) is the Kullback-Leibler divergence between P and Q . See Equations (2.7), (2.9),and Theorem 2.2(iii) of Tsybakov (2009). This inequality is used to prove part (27) with r = 1 / ω c ( c ). The researcher must be choose a counter-factual density such that its marginal densities R ω c ( c, d, d ′ ) d ( d ′ ) and R ω c ( c, d, d ′ ) d ( d ) are diﬀerentfunctions; otherwise, µ c = 0. Construct an inﬁnitely diﬀerentiable bounded function g ( c, d ) ≥ R [ g ( c, d ′ ) − g ( c, d )] ω ( c ) d c = 1.Construct two models P, Q ∈ P as follows. Let ε i ∼ N (0 ,

1) and X i ∼ U [0 ,

1] iid and inde-pendent of each other. Pick ξ > √ πǫ >

0. For model P , deﬁne Y i ( d ) = Φ (cid:0) ξn − / g ( X i , d ) + ε i (cid:1) ,where Φ is the standard normal cdf. For model Q , deﬁne Y i ( d ) = Φ ( ε i ). The expectation of Y i ( d )conditional on X i = c , that is, R ( c, d ), is an inﬁnitely diﬀerentiable function. The variables havebounded support, and models P and Q satisfy all the conditions to be in P . Under model P , β ( c ; P ) = E P [ Y i ( d ′ ) − Y i ( d ) | X i = c ]= E P [Φ (cid:0) ξn − / g ( X i , d ′ ) + ε i (cid:1) − Φ (cid:0) ξn − / g ( X i , d ) + ε i (cid:1) | X i = c ]= E P [ φ ( ε ∗ i ) ξn − / ( g ( c, d ′ ) − g ( c, d )) | X i = c ]= E P [ φ ( ε ∗ i )] ξn − / ( g ( c, d ′ ) − g ( c, d ))where φ is the standard normal pdf, and ε ∗ i is in between ε i + ξn − / g ( c, d ′ ) and ε i + ξn − / g ( c, d ).As n grows large, E P [ φ ( ε ∗ i )] = E P [ φ ( ε i )] + o (1) = √ π + o (1) where the o (1) term is uniform over( c, d, d ′ ). Then, µ c ( P ) = √ π ξn − / R ( g ( c, d ′ ) − g ( c, d )) ω ( c ) d c + o (cid:0) n − / (cid:1) = √ π ξn − / + o (cid:0) n − / (cid:1) . Under model Q , β ( c ; Q ) = 0. Therefore, 38 c ( P ) − µ c ( Q ) = √ π ξn − / + o (cid:0) n − / (cid:1) > ǫn − / for large n , because √ π ξ > ǫ .Next, we use the following inequality to show that r = 1 / µ ∗ sup P ∈P P P (cid:2) n / | µ ∗ − µ c ( P ) | > ǫ/ (cid:3) ≥ e − nKL ( P,Q ) / d ∗ be such that g ( c, d ∗ ) > c . For simple models like P and Q , any function ofthe variables { Y i ( d ) } d ∈D and X i can be rewritten as functions of Y i ( d ∗ ) and X i because Y i ( d ) is adeterministic function of Y i ( d ∗ ) and X i for any d. It suﬃces to look at the distribution of Y i ( d ∗ )and X i instead of the distribution of { Y i ( d ) } d ∈D and X i . Consider the Kullback-Leibler divergencefor the distributions P and Q of ( Y i ( d ∗ ) , X i ), KL ( P, Q ) = R log h p ( y,x ) q ( y,x ) i p ( y, x ) dydx ,where p ( y, x ) and q ( y, x ) are the pdfs of ( Y i ( d ∗ ) , X i ) under P and Q respectively. Deﬁne e Y i = ξn − / g ( X i , d ∗ ) + ε i under P , and e Y i = ε i under Q . It follows that ( Y i ( d ∗ ) , X i ) = (Φ( e Y i ) , X i ) underboth P and Q . The Kullback-Leibler divergence is invariant to such a transformation of variables. KL ( P, Q ) = R log h e p ( y,x ) e q ( y,x ) i e p ( y, x ) dydx ,where e p ( y, x ) = φ (cid:0) y − ξn − / g ( x, d ∗ ) (cid:1) and e q ( y, x ) = φ ( y ) are the pdfs of ( e Y i , X i ) under P and Q respectively. KL ( P, Q ) = R log " exp n − (1 / ( y − ξn − / g ( x,d ∗ ) ) o exp {− (1 / y } p ( y, x ) dydx = R log (cid:2) exp (cid:8) yξn − / g ( x, d ∗ ) − (1 / ξ n − g ( x, d ∗ ) (cid:9)(cid:3) e p ( y, x ) dydx = R (cid:2) yξn − / g ( x, d ∗ ) − (1 / ξ n − g ( x, d ∗ ) (cid:3) e p ( y, x ) dydx = R (1 / ξ n − g ( x, d ∗ ) dx = (1 / ξ n − R g ( x, d ∗ ) dx > η > / ξ R g ( x, d ∗ ) dx < log( η ). Then, e − nKL ( P,Q ) / > / (4 η ) >

0, andinf µ ∗ sup P ∈P P P (cid:2) n / | µ ∗ − µ c ( P ) | > ǫ/ (cid:3) ≥ η .This is a minimax lower bound for estimators µ ∗ that are functions of an ideal sample of { Y i ( d ) } d ∈D and X i . In practice, only part of these variables are observed according to the scheduleof cutoﬀ-doses { c j } Kj =1 . The set of all estimators e µ that are functions of the observed variables( Y i , X i ) is a subset of the set of all estimators µ ∗ . Therefore, the lower bound above is also aminimax lower bound for all estimators e µ :inf e µ sup P ∈P P P (cid:2) n / | e µ − µ c ( P ) | > ǫ/ (cid:3) ≥ η . Part (28)Let b µ denote b µ c and µ = µ c ( P ) for notational ease. The goal is to show that, for any small δ >

0, there exists large ǫ ∈ (0 , ∞ ) such that sup P ∈P P P (cid:2) n / | b µ − µ | > ǫ (cid:3) < δ for large n . Thechoice of h plus the discussion preceding Equation A.13 lead to ( V cn ) − / ≥ M n / for large n .Thus, P P (cid:2) n / | b µ − µ | > ǫ/M (cid:3) ≤ P P (cid:2) ( V cn ) − / | b µ − µ | > ǫ (cid:3) uniformly over P for large n . Theorem 2breaks ( V cn ) − / | b µ − µ | into four components: the CLT component N n , that converges in distributionto a standard normal (part (A.4)); the ﬁrst-step bias component B n , that converges in probabilityto zero (part (A.5)); the integration error component I n , that converges in probability to zero (part39A.6)); and the remainder terms R n , that converge in probability to zero (parts (A.7) and (A.8)).It is true that P P (cid:0) ( V cn ) − / | b µ − µ | > ǫ (cid:1) ≤ P P ( | N n | > ǫ/ P P ( | B n | > ǫ/ P P ( | I n | > ǫ/ P P ( | R n | > ǫ/ δ > n and large ǫ > P is less than δ . Therestrictions placed in the class of models P along with the proof of Theorem 2 give the result. N n - term : part (A.4) has zero mean and unit variance (see Equation A.15). Chebyshev’sinequality implies that the supremum probability of the absolute value of part (A.4) being greaterthan ǫ/ /ǫ uniformly over P . B n - term : B n is the sum of B + n (part (A.24)), and B − n (part (A.25)). B + n converges in probabilityto zero uniformly over P because the approximations of Lemma B.6, the bounds on the derivativesof R ( x, d ), on f ( x ), on σ ( x, d ), and on the rate of ( V cn ) − / hold uniformly over P . The weights∆ j do not depend on P . The same idea applies to B − n . Thus, for ǫ >

0, sup P ∈P P P ( | B n | > ǫ/ I n - term : uniform bounds on the partial derivatives of β ( c ) yield a uniform bound on theapproximation error of the numerical integral. See Lemma B.9. The bounds on the rate of ( V cn ) − / also hold uniformly over P . For every ǫ > n for which | I n | ≤ ǫ/ P . R n - term : R n is the sum of R an (part (A.7)) and R bn (part (A.8)). Lemma B.7 shows that bothconverge in probability to zero. They also converge in probability to zero uniformly over P for thesame reasons that the B n -term above does. Therefore, for ǫ >

0, sup P ∈P P P ( | R n | > ǫ/

4) convergesto zero. (cid:3)

A.5 Proof of Theorem 4

Deﬁne δ j,l = I {U i ( c j ) = d l } . Assumption 8 (no ever-deﬁers) implies the following facts: (i) P [ δ j − ,l = 0 , δ j,l = 1] = 0 for ∀ l = j ; (ii) P [ δ j − ,l = 1 , δ j,l = 0] = 0 for l = j ;(iii) P [ δ j − ,l = 1 , δ j,l = 0 , δ j,u = 1] = 0 for ∀ u = j and u = l .Fix a small e > E [ Y i | X i = c j + e ] = P Kl =0 E [ δ j,l Y i ( d l ) | X i = c j + e ]= P Kl =0 E [ Y i ( d l ) | X i = c j + e, δ j,l = 1 , δ j − ,l = 1] P [ δ j,l = 1 , δ j − ,l = 1 | X i = c j + e ]+ P Kl =0 E [ Y i ( d l ) | X i = c j + e, δ j,l = 1 , δ j − ,l = 0] P [ δ j,l = 1 , δ j − ,l = 0 | X i = c j + e ]= P Kl =0 E [ Y i ( d l ) | X i = c j + e, δ j,l = 1 , δ j − ,l = 1] P [ δ j,l = 1 , δ j − ,l = 1 | X i = c j + e ]+ E [ Y i ( d j ) | X i = c j + e, δ j,j = 1 , δ j − ,j = 0] P [ δ j,j = 1 , δ j − ,j = 0 | X i = c j + e ] . Take the limit as e ↓

0. Use that { δ j,l = 1 , δ j − ,l = 1 } and { δ j,j = 1 , δ j − ,j = 0 } are ﬁnite unionsof measurable sets of the form {U i = ¯ U} , ¯ U ∈ U ∗ . The conditional expectation and probability arecontinuous functions of x conditional on these sets (Assumption 8).lim e ↓ E [ Y i | X i = c j + e ]= P Kl =0 E [ Y i ( d l ) | X i = c j , δ j,l = 1 , δ j − ,l = 1] P [ δ j,l = 1 , δ j − ,l = 1 | X i = c j ]+ E [ Y i ( d j ) | X i = c j , δ j,j = 1 , δ j − ,j = 0] P [ δ j,j = 1 , δ j − ,j = 0 | X i = c j ]= P Kl =0 E [ Y i ( d l ) | X i = c j , δ j,l = 1 , δ j − ,l = 1] P [ δ j,l = 1 , δ j − ,l = 1 | X i = c j ]+ P Kl =0 ,l = j E [ Y i ( d j ) | X i = c j , δ j,j = 1 , δ j − ,l = 1] P [ δ j,j = 1 , δ j − ,l = 1 | X i = c j ]Similarly, use fact (ii) for the left-hand-side limit, lim e ↓ E [ Y i | X i = c j − e ]= P Kl =0 E [ Y i ( d l ) | X i = c j , δ j,l = 1 , δ j − ,l = 1] P [ δ j,l = 1 , δ j − ,l = 1 | X i = c j ]+ P Kl =0 ,l = j E [ Y i ( d j ) | X i = c j , δ j,l = 0 , δ j − ,l = 1] P [ δ j,l = 0 , δ j − ,l = 1 | X i = c j ]Use fact (iii) to get 40 P Kl =0 E [ Y i ( d l ) | X i = c j , δ j,l = 1 , δ j − ,l = 1] P [ δ j,l = 1 , δ j − ,l = 1 | X i = c j ]+ P Kl =0 ,l = j E [ Y i ( d l ) | X i = c j , δ j,j = 1 , δ j − ,l = 1] P [ δ j,j = 1 , δ j − ,l = 1 | X i = c j ] . The diﬀerence between right and left hand side limits is B j = P Kl =0 ,l = j E [ Y i ( d j ) − Y i ( d l ) | X i = c j , δ j,j = 1 , δ j − ,l = 1] P [ δ j,j = 1 , δ j − ,l = 1 | X i = c j ]= K P l =0 ,l = j β ec ( c j , d l , d j ) P [ δ j,j = 1 , δ j − ,l = 1 | X i = c j ] . Next, it is shown that P [ δ j,j = 1 , δ j − ,l = 1 | X i = c j ] = ω j,l , for l = j . P [ δ j,j = 1 , δ j − ,l = 1 | X i = c j ] = P [ δ j,l = 0 , δ j − ,l = 1 | X i = c j ]= P [ δ j,l = 0 | X i = c j ] − P [ δ j − ,l = 0 | X i = c j ]= lim e ↓ { P [ U i ( c j ) = d l | X i = c j + e ] − P [ U i ( c j − ) = d l | X i = c j − e ] } = lim e ↓ { P [ D i = d l | X i = c j − e ] − P [ D i = d l | X i = c j + e ] } where facts (i) and (ii) are used. This proves the ﬁrst part of the theorem.If β ec belongs to the class of functions of Assumption 9, then B j = f W j θ . If the matrix f W ′ f W = P j f W j f W ′ j is invertible, then the second part of the theorem follows.Conversely, suppose that the p > K elements in { β ec ( c j , d l , d j ) for ( j, l ) : ω j,l > } are identiﬁedfor every fuzzy assignment ˜ c = ( c , d , d ), ..., ˜ c p = ( c K , d K − , d K ). Identiﬁcation means thatthere is an unique solution to the following constrained linear system:  B ... B K  =  ω , . . . ω ,K . . . . . . . . . ω , . . . ω ,K . . . . . . . . . . . . . . . ω K, . . . ω K,K −   β β ... β p  such that ( β , . . . , β p ) ∈ G . The K × p matrix of coeﬃcients has rank equal to K because the assignment is fuzzy. Since p > K , the unconstrained system has inﬁnitely many nonzero solutions of the form b = b p + P p − Km =1 λ m b sm for any ( λ , . . . , λ p − K ) ∈ R p − K , where { b sm } p − Km =1 are the basis vectors of the null-space of the unconstrained system, and b p is a particular solution. By assumption, the constrainedsystem has one unique solution b ∗ ∈ G , so b ∗ + b sm

6∈ G ∀ m . This implies that b sm

6∈ G ∀ m because G is a vector subspace of R p . This is a set of p − K linearly independent vectors in R p not in G .Therefore, the dim G ≤ p − ( p − K ) = K , and the third part of the theorem follows. (cid:3) Regression Discontinuity Design with Many Thresholds”

Marinho Bertanha

B Supplemental Appendix

B.1 Local Polynomial Regressions

The ﬁrst lemma is a straightforward generalization of Porter (2003)’s Theorem 3(a). It derivesthe asymptotic distribution of the Local Polynomial Regression (LPR) estimator for the diﬀerencein side-limits of a conditional mean. The lemma considers the mean of the q × Y i ratherthan a scalar Y i in order to cover the CLT proof in the fuzzy case (Theorem B.1) as a specialcase with Y i = [ Y i W ( c j , D i ) ′ ] ′ . At a cutoﬀ c j , the diﬀerence in conditional mean is J j , for j = 1 , . . . , K . Given a choice of a bandwidth h j > c j , a kernel density function k ( u ), and a polynomial order ρ ∈ Z + , the l -coordinate of J j is denoted J j,l and estimated asfollows: b J j,l = b a j + ,l − ˆ a j − ,l (B.1)( b a j + ,l , b b j + ,l ) = argmin ( a, b ) n X i =1 ( k (cid:18) X i − c j h j (cid:19) v j + i (cid:20) Y j,l − a − b ( X i − c j ) − . . . − b ρ ( X i − c j ) ρ (cid:21) ) (B.2)(ˆ a j − ,l , ˆ b j − ,l ) = argmin ( a, b ) n X i =1 ( k (cid:18) X i − c j h j (cid:19) v j − i (cid:20) Y j,l − a − b ( X i − c j ) − . . . − b ρ ( X i − c j ) ρ (cid:21) ) (B.3)where v j + i = I { c j ≤ X i < c j + h j } (B.4) v j − i = I { c j − h j < X i < c j } (B.5) b =( b , . . . , b ρ ) . (B.6) Lemma B.1.

For each j = 1 , . . . , K , assume the following conditions hold:(i) The kernel density function k : R → R is symmetric around zero, has compact support [ − M, M ] for some M ∈ (0 , ∞ ) , and it is Lipschitz continuous; ii) The distribution of X i has probability density function f ( x ) that is continuous and has boundedsupport X = [ X , X ] ; the cutoﬀ c j belongs to ( X , X ) ;(iii) Deﬁne J j = lim e ↓ { E [ Y i | X i = c j + e ] − E [ Y i | X i = c j − e ] } (B.7) m ( x ) = E [ Y i | X i = x ] − K X j =1 I { c j ≤ x } J j (B.8) The function m ( x ) is at least ρ + 1 times continuously diﬀerentiable wrt x for all x in acompact interval centered at c j except for x = c j ; there exists left and right side derivatives at x = c j up to the same order; its ρ + 1 -th partial derivative wrt x is denoted as ∇ ( ρ +1) x m ( x ) and the side limits of the derivatives are denoted as lim x → c ± j ∇ ( ρ +1) x m ( x ) = ∇ ( ρ +1) x m ( c ± j ) ;(iv) Deﬁne ε i = Y i − E [ Y i | X i ] (B.9) ζ ( x ) = E (cid:2) ε i ε ′ i | X i = x (cid:3) , (B.10) and assume E [ k ε i k | X i ] is bounded. The matrix-valued function ζ ( x ) is continuous wrt x forall x in a compact interval centered at c j except for x = c j ; there exists left and right sidelimits at x = c j denoted lim x → c ± j ζ ( x ) = ζ ( c ± j ) , where ζ ( c ± j ) is positive-deﬁnite;(v) As n → ∞ and h j → , assume nh j → ∞ and p nh j h p +11 j → C ∈ [0 , ∞ ) Then, for each j ( V nj ) − / (cid:16)b J j − B nj − J j (cid:17) d → N ( , I ) (B.11) with ( V nj ) − / being the inverse of the square root of the symmetric and positive-deﬁnite matrix V nj , ( V nj ) / = O (cid:16) ( nh j ) / (cid:17) , ( V nj ) / B nj = O P (cid:16) ( nh j ) / h ρ +11 j (cid:17) , where is the q × vectorof zeros, and I is the q × q identity matrix. The bias B nj and variance V nj terms are characterizedas follows, B nj = h ρ +11 j f ( c j )( ρ + 1)! e ′ h G j + n γ ∗ ∇ ρ +1 x m ( c + j ) − G j − n γ ∗ ∇ ρ +1 x m ( c − j ) i (B.12) V nj = n E ( (cid:20) nh j k (cid:18) X i − c j h j (cid:19)(cid:21) h e ′ (cid:16) v j + i E [ G j + n ] − v j − i E [ G j − n ] (cid:17) e H ji i ε i ′ i (cid:20) e H j ′ i (cid:16) v j + i E [ G j + ′ n ] − v j − i E [ G j − ′ n ] (cid:17) e (cid:21) ) (B.13) where ε i = Y i − E [ Y i | X i ] ; and γ ∗ =[ γ ρ +1 . . . γ ρ +1 ] ′ (B.14) γ ∗ = I q ⊗ γ ∗ , where I q is the q × q identity matrix, and (B.15) ⊗ denotes the Kronecker product; e = I q ⊗ e , where (B.16) e is the ( ρ + 1 × vector e = [1 0 0 · · · ′ γ d = Z k ( u ) u d du (B.17) H ( u ) = [1 u . . . u ρ ] ′ (B.18) H ji = H ( X i − c j ) (B.19) e H ji = H (cid:18) X i − c j h j (cid:19) (B.20) H ( u ) = I q ⊗ H ( u ) (B.21) H ji = H ( X i − c j ) (B.22) e H ji = H (cid:18) X i − c j h j (cid:19) (B.23) G j ± n = " nh j n X i =1 k (cid:18) X i − c j h j (cid:19) v j ± i e H ji e H j ′ i − (B.24) G j ± = f ( c j ) − I q ⊗ Γ − ± (B.25)Γ + =Γ , Γ − = { ( − j + l Γ j,l } j,l (B.26)Γ =  γ . . . γ ρ ... ... ... γ ρ . . . γ ρ  (B.27) where vec ( A m × n ) = [ a , , . . . , a m, , a , , . . . , a m,n ] ′ which makes ϕ j ± a q ( ρ + 1) × vector.Moreover, (cid:0) G j ± n − G j ± (cid:1) = O P p nh j ! (B.28) E (cid:2) G j ± n (cid:3) = G j ± + O ( h j ) (B.29) Proof.

Following Porter (2003), the jump estimator is equal to b J j = b a + j − b a − j , where b a ± j =[ b a ± j, , . . . , b a ± j,q ] ′ . 3 a ± j = e ′ " nh j n X i =1 k (cid:18) X i − c j h j (cid:19) v j ± i H ji H j ′ i − " nh j n X i =1 k (cid:18) X i − c j h j (cid:19) v j ± i H ji Y i (B.30)= e ′ G j ± n " nh j n X i =1 k (cid:18) X i − c j h j (cid:19) v j ± i e H ji Y i (B.31) b J j = e ′ G j + n " nh j n X i =1 k (cid:18) X i − c j h j (cid:19) v j + i e H ji Y i − e ′ G j − n " nh j n X i =1 k (cid:18) X i − c j h j (cid:19) v j − i e H ji Y i (B.32)and note that H ji changes to e H ji because e only takes the ﬁrst elements of each of the q stacked ρ + 1 vectors. Deﬁne J ∗ j , e J j as follows, J ∗ j = e ′ E [ G j + n ] 1 nh j n X i =1 k (cid:18) X i − c j h j (cid:19) v j + i e H ji Y i − e ′ E [ G j − n ] 1 nh j n X i =1 k (cid:18) X i − c j h j (cid:19) v j − i e H ji Y i o (B.33)= n X i =1 nh j k (cid:18) X i − c j h j (cid:19) e ′ (cid:16) v j + i E [ G j + n ] − v j − i E [ G j − n ] (cid:17) e H ji | {z } ϕ n ( X i ) Y i (B.34)= n X i =1 ϕ n ( X i ) Y i (B.35) e J j = e ′ G j + n E " nh j n X i =1 k (cid:18) X i − c j h j (cid:19) v j + i e H ji Y j + i − e ′ G j − n E " nh j n X i =1 k (cid:18) X i − c j h j (cid:19) v j − i e H ji Y j − i (B.36)where Y j ± i are deﬁned by Y j + i = Y i − H j ′ i ϕ j + = m ( X i ) + K X l =1 I { c l ≤ X i } J l + ε i − H j ′ i ϕ j + (B.37) Y j − i = Y i − H j ′ i ϕ j − = m ( X i ) + K X l =1 I { c l ≤ X i } J l + ε i − H j ′ i ϕ j − (B.38) ϕ j + = vec " m ( c j ) + j X l =1 J l ∇ x m ( c + j ) . . . ∇ ρ x m ( c + j ) /ρ ! ′ (B.39)4 j − = vec " m ( c j ) + j − X l =1 J l ∇ x m ( c − j ) . . . ∇ ρ x m ( c − j ) /ρ ! ′ . (B.40)Write( V nj ) − / (cid:16)b J j − B nj − J j (cid:17) =( V nj ) − / (cid:0) J ∗ j − E [ J ∗ j |X n ] (cid:1) (B.41)+( V nj ) − / (cid:16)e J j − B nj (cid:17) (B.42)+( V nj ) − / (cid:16)b J j − E [ b J j |X n ] − (cid:0) J ∗ j − E [ J ∗ j |X n ] (cid:1)(cid:17) (B.43)+( V nj ) − / (cid:16) E [ b J j − J j |X n ] − e J j (cid:17) . (B.44)The proof applies a central limit theorem (CLT) to show that part (B.41) converges in distribu-tion to a standard normal; it demonstrates that B nj approximates the ﬁrst-order bias, that is, part(B.42) converges in probability to zero; and that parts (B.43) and (B.44) converge in probabilityto zero. Part (B.41)First, ﬁnd the rate that ( V nj ) − / grows. Use the change of variables u = ( x − c j ) /h j toevaluate the expectation: V nj = 1 nh j Z − k ( u ) (cid:2) e ′ (cid:0) I { u ≥ } E [ G j + n ] − I { u < } E [ G j − n ] (cid:1) H ( u ) (cid:3) ζ ( c j + uh j ) h H ( u ) (cid:16) I { u ≥ } E [ G j + ′ n ] − I { u < } E [ G j − ′ n ] (cid:17) e i f ( c j + uh j ) du (B.45) k V nj k > Mnh j . (B.46)because E [ G j ± n ] is approximately equal to a positive-deﬁnite matrix G j ± so that the integral eval-uates to a positive-deﬁnite matrix.Second, ( V nj ) − / (cid:0) J ∗ j − E [ J ∗ j |X n ] (cid:1) = ( V nj ) − / n X i =1 ϕ n ( X i ) ( Y i − E [ Y i | X i ]) (B.47)= ( V nj ) − / n X i =1 ϕ n ( X i ) ε i (B.48)Equation B.48 is a sum of iid random vectors with zero mean, where V nj is the variance of the5umerator. The Lindeberg condition is veriﬁed next. Take an arbitrary δ > n X i =1 E h k V n k − k ϕ n ( X i ) ε i k I n k V n k − / k ϕ n ( X i ) ε i k > δ oi (B.49)= n E h k V n k − k ϕ n ( X i ) ε i k I n k ϕ n ( X i ) ε i k > δ k V n k / oi (B.50) ≤ n E h k V n k − k ϕ n ( X i ) ε i k δ − k V n k − / i (B.51) ≤ M n ( nh j ) / E "(cid:13)(cid:13)(cid:13)(cid:13) nh j k (cid:18) X i − c j h j (cid:19) e ′ (cid:16) v j + i E [ G j + n ] − v j − i E [ G j − n ] (cid:17) e H ji ε i (cid:13)(cid:13)(cid:13)(cid:13) δ − (B.52) ≤ M n ( nh j ) / n h j E " h j (cid:12)(cid:12)(cid:12)(cid:12) k (cid:18) X i − c j h j (cid:19)(cid:12)(cid:12)(cid:12)(cid:12) (cid:13)(cid:13)(cid:13) v j + i E [ G j + n ] − v j − i E [ G j − n ] (cid:13)(cid:13)(cid:13) (cid:13)(cid:13)(cid:13)(cid:13) H (cid:18) X i − c j h j (cid:19)(cid:13)(cid:13)(cid:13)(cid:13) E [ k ε i k | X i ] δ − (B.53)= M ( nh j ) − / Z − " | k ( u ) | (cid:13)(cid:13) I { u ≥ } E [ G j + n ] − I { u < } E [ G j − n ] (cid:13)(cid:13) k H ( u ) k f ( c j + uh j ) du (B.54) ≤ M ( nh j ) − / = o (1) (B.55)where the inequality x I {| x | > δ } ≤ x δ − , boundedness of E [ k ε i k | X i ], and the rate of V − nj are used. The multivariate Lindeberg-Feller CLT says that Equation B.48, and thus part (B.41),converges in distribution to a standard normal. Part (B.42)First consider, E (cid:20) h j k (cid:18) X i − c j h j (cid:19) v j + i e H ji E h Y j + i | X i i(cid:21) (B.56)= E " h j k (cid:18) X i − c j h j (cid:19) v j + i e H ji ∇ ρ +1 x m ( c + j )( ρ + 1)! (cid:18) X i − c j h j (cid:19) ρ +1 h ρ +11 j (B.57)+ E " h j k (cid:18) X i − c j h j (cid:19) v j + i e H ji ∇ ρ +2 x m ( c ∗ j )( ρ + 2)! (cid:18) X i − c j h j (cid:19) ρ +2 h ρ +21 j (B.58)= h ρ +11 j f ( c j ) γ ∗ ∇ ρ +1 x m ( c + j )( ρ + 1)! + O (cid:16) h ρ +21 j (cid:17) , (B.59)and B nj = B + nj − B − nj (B.60) B + nj = h ρ +11 j f ( c j )( ρ + 1)! e ′ G j + n γ ∗ ∇ ρ +1 x m ( c + j ) (B.61)6 − nj = h ρ +11 j f ( c j )( ρ + 1)! e ′ G j − n γ ∗ ∇ ρ +1 x m ( c − j ) (B.62)where E h Y j + i | X i i is the diﬀerence between E [ Y i | X i ] and its ρ -th order Taylor expansion around X i = c j (see Equations B.37 and B.38). The expectations in Equations B.57 and B.58, without the h ρ +11 j and h ρ +21 j terms, are bounded over j because the kernel, derivatives, and polynomials arebounded functions of u = ( x − c j ) h − j .Next, ( V nj ) − / (cid:16)e J j − B nj (cid:17) =( V nj ) − / e ′ G j + n E " nh j n X i =1 k (cid:18) X i − c j h j (cid:19) v j + i e H ji Y j + i − ( V nj ) − / B + nj (B.63) − ( V nj ) − / e ′ G j − n E " nh j n X i =1 k (cid:18) X i − c j h j (cid:19) v j − i e H ji Y j − i +( V nj ) − / B − nj (B.64)Consider part (B.63). Part (B.64) follows a symmetric argument. Use (B.59) and write( B.

63) =( V nj ) − / " e ′ G j + n γ ∗ h ρ +11 j ∇ ρ +1 m ( c + j )( ρ + 1)! f ( c j ) − B + nj (B.65)+( V nj ) − / e ′ G j + n O (cid:16) h ρ +21 j (cid:17) (B.66)=0 + O P (cid:16)p nh j h ρ +21 j (cid:17) = o P (1) (B.67)where the second equality uses the deﬁnition of B + nj , the fact that G j + n = O P (1), and the ratecondition ( nh j ) / h ρ +11 j = O (1). Part (B.43)(B.43) =( V nj ) − / " e ′ G j + n nh j n X i =1 k (cid:18) X i − c j h j (cid:19) v j + i e H ji ε i − e ′ G j − n nh j n X i =1 k (cid:18) X i − c j h j (cid:19) v j − i e H ji ε i (B.68) − ( V nj ) − / " e ′ E (cid:2) G j + n (cid:3) nh j n X i =1 k (cid:18) X i − c j h j (cid:19) v j + i e H ji ε i − e ′ E (cid:2) G j − n (cid:3) nh j n X i =1 k (cid:18) X i − c j h j (cid:19) v j − i e H ji ε i (B.69)7( V nj ) − / e ′ (cid:2) G j + n − E (cid:2) G j + n (cid:3)(cid:3) nh j n X i =1 k (cid:18) X i − c j h j (cid:19) v j + i e H ji ε i (B.70) − ( V nj ) − / e ′ (cid:2) G j − n − E (cid:2) G j − n (cid:3)(cid:3) nh j n X i =1 k (cid:18) X i − c j h j (cid:19) v j − i e H ji ε i (B.71)= O (cid:16) ( nh j ) / (cid:17) O P (cid:16) ( nh j ) − / (cid:17) o P (1) (B.72)+ O (cid:16) ( nh j ) / (cid:17) O P (cid:16) ( nh j ) − / (cid:17) o P (1) (B.73)= o P (1) (B.74)because of (cid:2) G j ± n − E (cid:2) G j ± n (cid:3)(cid:3) = O P (cid:0) ( nh j ) − / (cid:1) , and the fact that the zero mean terms ( nh j ) − P ni =1 k (cid:16) X i − c j h j (cid:17) v j ± i e H ji ε i converge in probability to zero since their variances are O (cid:0) ( nh j ) − (cid:1) . Part (B.44)Use the deﬁnitions of ϕ j ± and Y j ± i to write: b J j − J j = b a + j − b a − j − J j (B.75)= e ′ " nh j n X i =1 k (cid:18) X i − c j h j (cid:19) v j + i H ji H j ′ i − " nh j n X i =1 k (cid:18) X i − c j h j (cid:19) v j + i H ji Y i − e ′ " nh j n X i =1 k (cid:18) X i − c j h j (cid:19) v j − i H ji H j ′ i − " nh j n X i =1 k (cid:18) X i − c j h j (cid:19) v j − i H ji Y i − e ′ (cid:0) ϕ j + − ϕ j − (cid:1) (B.76)= e ′ " nh j n X i =1 k (cid:18) X i − c j h j (cid:19) v j + i H ji H j ′ i − " nh j n X i =1 k (cid:18) X i − c j h j (cid:19) v j + i H ji Y j + i − e ′ " nh j n X i =1 k (cid:18) X i − c j h j (cid:19) v j − i H ji H j ′ i − " nh j n X i =1 k (cid:18) X i − c j h j (cid:19) v j − i H ji Y j − i (B.77)= e ′ G j + n " nh j n X i =1 k (cid:18) X i − c j h j (cid:19) v j + i e H ji Y j + i − e ′ G j − n " nh j n X i =1 k (cid:18) X i − c j h j (cid:19) v j − i e H ji Y j − i (B.78)Thus, part (B.44) becomes(B.44) =( V nj ) − / e ′ G j + n " nh j n X i =1 k (cid:18) X i − c j h j (cid:19) v j + i e H ji E [ Y j + i | X i ] − ( V nj ) − / e ′ G j + n E " nh j n X i =1 k (cid:18) X i − c j h j (cid:19) v j + i e H ji E [ Y j + i | X i ] (B.79)8 ( V nj ) − / e ′ G j − n " nh j n X i =1 k (cid:18) X i − c j h j (cid:19) v j − i e H ji E [ Y j − i | X i ] +( V nj ) − / e ′ G j − n E " nh j n X i =1 k (cid:18) X i − c j h j (cid:19) v j − i e H ji E [ Y j − i | X i ] (B.80)The next steps show that part (B.79) converges in probability to zero. A symmetric proof showsthat part (B.80) also converges in probability to zero.(B.79) =( V nj ) − / e ′ G j + n ( nh j n X i =1 k (cid:18) X i − c j h j (cid:19) v j + i e H ji E [ Y j + i | X i ] − E " nh j n X i =1 k (cid:18) X i − c j h j (cid:19) v j + i e H ji E [ Y j + i | X i ] (B.81)=( V nj ) − / e ′ G j + n h ρ +11 j ( nh j n X i =1 k (cid:18) X i − c j h j (cid:19) v j + i e H ji E [ Y j + i | X i ] h − ( ρ +1)1 j − E " nh j n X i =1 k (cid:18) X i − c j h j (cid:19) v j + i e H ji E [ Y j + i | X i ] h − ( ρ +1)1 j ) (B.82)= O (cid:16) ( nh j ) / (cid:17) O P (1) h ρ +11 j O P (cid:16) ( nh j ) − / (cid:17) = o P (1) (B.83)where the zero mean term in curly brackets is normalized by h ρ +11 j (see Equation B.59), and itsvariance after the normalization decreases at ( nh j ) − . B.2 Uniformity with Large Number of Cutoﬀs

A class of sets S of a space Ω is said to shatter a n -point subset of Ω, D n , if for every subset of D n , D ( i ) n , there exists a set in S , S , such that S ∩ D ( i ) n = D ( i ) n . A class of sets S is said to be a VCclass if there exists a ﬁnite non-negative integer v such that no v -point set D v is shattered by S . Inthis case, the index of the VC class is v . For a class of functions from Ω to R , F , call the class ofgraphs of F , g F = { ( x, t ) ∈ Ω × R : t ≤ f ( x ) ≤ ≤ t ≤ f ( x ) for f ∈ F } . A class of functions F is called a VC-subgraph class if gF is a VC class. The class F is enveloped by function F if ∀ f ∈ F , | f ( x ) | ≤ F ( x ). Let (Ω , A , Q ) be a probability space. A covering number N ( ε, Q, F ) isdeﬁned to be the smallest non-negative integer m for which there exists functions f , . . . , f m in F such that min j E Q | f − f j | ≤ ε for every f ∈ F .It is possible to build a complex VC-subgraph class by combining basic VC-subgraph classes.Any class of functions made of a ﬁnite union or intersection of VC-subgraph classes is also VCsubgraph (Pollard (1984)’s Lemma 2.15). Let φ : R → R be a monotone function. Deﬁne theclass of functions which consists of translations of this monotone function φ . That is, F = { f : One may deﬁne VC subgraph using alternative deﬁnitions of class of graphs, but those lead to deﬁnitions of VCsubgraph that are equivalent to ours. See Van Der Vaart and Wellner (1996)’s Problem 2.6.11. → R with f ( x ) = φ ( x − c ) ∀ c ∈ R } . Then, F is a VC-subgraph class with index equal to 2(Van Der Vaart and Wellner (1996)’s Lemma 2.6.16). Moreover, if G is VC subgraph, then φ ◦ G = { φ ( g ) : g ∈ G} is VC subgraph (Van Der Vaart and Wellner (1996)’s Lemma 2.6.18). A VC-subgraph class F of uniformly bounded functions has covering number N ( ε, Q, F ) ≤ Aε − W , wherethe constants A, W depend only on the VC index of the class of functions and on the uniformbound (Pollard (1984)’s Lemma 2.25). The next lemma lists more properties.

Lemma B.2.

Let F and G be VC-subgraph classes of functions uniformly bounded by a constant < M < ∞ . Deﬁne H + = { f + g : f ∈ F , g ∈ G} and H × = { f g : f ∈ F , g ∈ G} . For a ﬁxedLipschitz continuous function φ with Lipschitz constant C , deﬁne H φ = { φ ( f ) : f ∈ F } . Then,1. N ( ε, Q, H + ) ≤ N ( ε/ , Q, F ) N ( ε/ , Q, G ) N ( ε, Q, H × ) ≤ N ( ε/ M, Q, F ) N ( ε/ M, Q, G ) N ( ε, Q, H φ ) ≤ N ( ε/C, Q, F ) Proof.

Slightly modiﬁed from Theorem 3 in Andrews (1994).Fix ε >

0, pick any h ∈ H + . It is known that h = f + g .Use f i + g j to approximate f + g , where E Q | f − f i | ≤ ε/ E Q | g − g j | ≤ ε/

2, 1 ≤ i ≤ N ( ε/ , Q, F ), 1 ≤ j ≤ N ( ε/ , Q, G ). It is known that these two covering numbers are ﬁnite since F and G are VC-subgraph. Call h l = f i + g j , with 1 ≤ l ≤ N ( ε/ , Q, F ) N ( ε/ , Q, G ). E Q | h − h l | = E Q | f + g − ( f i + g j ) | ≤ E Q | f − f i | + E Q | g − g j | ≤ ε Therefore, N ( ε, Q, H + ) ≤ N ( ε/ , Q, F ) N ( ε/ , Q, G ).Now, pick any h ∈ H × . It is known that h = f g .Use f i g j to approximate f g , where E Q | f − f i | ≤ ε/ M and E Q | g − g j | ≤ ε/ M , 1 ≤ i ≤ N ( ε/ M, Q, F ) < ∞ , 1 ≤ j ≤ N ( ε/ M, Q, G ) < ∞ . Call h l = f i g j , with 1 ≤ l ≤ N ( ε/ M, Q, F ) N ( ε/ M, Q, G ). E Q | h − h l | = E Q | f g − f i g j | = E Q | f g − f i g j − f i g + f i g |≤ E Q | f − f i || g | + E Q | g j − g || f i | ≤ M ( E Q | f − f i | + E Q | g j − g | ) ≤ ε Therefore, N ( ε, Q, H × ) ≤ N ( ε/ M, Q, F ) N ( ε/ M, Q, G ).Lastly, pick h ∈ H φ , so that h = φ ( f ) for some f ∈ F . Use f i to approximate f , where E Q | f − f i | ≤ ε/C , 1 ≤ i ≤ N ( ε/C, Q, F ) < ∞ . Call h i = φ ( f i ) for each i . E Q | h − h i | = E Q | φ ( f ) − φ ( f i ) | ≤ CE Q | f − f i | ≤ ε .Therefore, N ( ε, Q, H φ ) ≤ N ( ε/C, Q, F ). 10onsider a set of K + 2 positive bandwidth sequences h , h j , j = 1 , . . . , K , h that depend on n . Assume h ≤ h j ≤ h for every j , and that both h and h converge to zero at the same rate.Deﬁne v + c,h ( x ) = I { c ≤ x < c + h } , and v − c,h ( x ) = I { c − h < x < c } for any c ∈ R and h >

0, so that v j ± i (used in the main text) becomes v ± c j ,h j ( X i ). Lemma B.3.

Consider the classes of functions deﬁned below F ± j , j = 1 , . . . , . They depend on n because the bandwidth sequences h , h j , j = 1 , . . . , K , and h enter their deﬁnitions.1. F ± = n f c,h : X → R st f c,h ( x ) = v ± c,h ( x ) k (cid:0) x − ch (cid:1) , c ∈ X , h ∈ [ h , h ] o for a kernel densityfunction k ( · ) that satisﬁes Assumption 3;2. F ± = n f c,h : X × [ − M ; M ] → R st f c,h ( x, y ) = v ± c,h ( x ) k (cid:0) x − ch (cid:1) y , c ∈ X , h ∈ [ h , h ] o for any M ∈ (0 , ∞ ) ;3. F ± = n f c,h : X → R st f c,h ( x ) = v ± c,h ( x ) k (cid:0) x − ch (cid:1) r ± ( x )] , c ∈ X , h ∈ [ h , h ] o where r ± ( x ) = P Kj =1 v ± c j ,h j ( x ) E [ Y j ± i | X i = x ] and Y j ± i is deﬁned in Lemma B.1 (scalar case);4. F ± = n f c,h : X → R st f c,h ( x ) = v ± c,h ( x ) k (cid:0) x − ch (cid:1) (cid:0) x − ch (cid:1) l , c ∈ X , h ∈ [ h , h ] o for any positiveinteger l ∈ Z + .If Assumptions 3, 5, and 7 hold, these functions are bounded for large n . The covering numberof each of these classes satisﬁes N ( ε, Q, F j ) ≤ A j ε − W j , j = 1 , , , , where the positive constants A j and W j are independent of n and Q .Proof. First, note that all these functions are bounded. The functions in the ﬁrst two classesare bounded because the kernel and the indicator functions are bounded. For the third class offunctions, (cid:12)(cid:12) r + ( x ) (cid:12)(cid:12) = (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) K X j =1 v + c j ,h j ( x ) E [ Y j + i | X i = x ] (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ max j (cid:12)(cid:12)(cid:12) E [ Y j + i | X i = x ] (cid:12)(cid:12)(cid:12) = max j (cid:12)(cid:12)(cid:2) ∇ ρ +1 x R ( c ∗ j ( x ) , d j ) / ( ρ + 1)! (cid:3) ( x − c j ) ρ +1 (cid:12)(cid:12) where c ∗ j ( x ) ∈ ( c j , x ). The function r + ( x ) is bounded because ∇ ρ +1 x R ( · ) is bounded (Assumption7). An analogous argument bounds r − ( x ). For the fourth class of functions,0 ≤ v + c,h (cid:18) x − ch (cid:19) l < − < v − c,h (cid:18) x − ch (cid:19) l < G ± = n f c,h : X → R st f c,h ( x ) = v ± c,h ( x ) (cid:0) x − ch (cid:1) l , c ∈ X , h ∈ [ h , h ] o ;11. G ± = n f c,h : X → R st f c,h ( x ) = v ± c,h ( x ) , c ∈ X , h ∈ [ h , h ] o ;3. G = { f : [ − M, M ] → R st f ( y ) = y } , that is, only one function f ;4. G ± = { f : X → R st f ( x ) = r ± ( x ) } , that is, only one function r ± ;5. G ± = n f c,h : X → R st f c,h ( x ) = v ± c,h ( x ) k (cid:0) x − ch (cid:1) , c ∈ X , h ∈ [ h , h ] o .Lemma B.2 says that it suﬃces to show that each of these classes has a polynomial bound onthe covering number with constants that are independent of n and Q . G ± = n f c,h : X → R st f c,h ( x ) = v ± c,h ( x ) (cid:0) x − ch (cid:1) l , c ∈ X , h ∈ [ h , h ] o Take G +1 WLOG. A function v + c,h ( x ) (cid:0) x − ch (cid:1) is a line connecting the point ( c,

0) to ( c + h,

1) withsupport [ c, c + h ). The class of functions G ∗ = n f c,h : X → R st f c,h ( x ) = v + c,h ( x ) (cid:0) x − ch (cid:1) , c ∈ X , h ∈ [ h , h ] o is VC subgraph because no 4-point set is shattered. It has covering number N ( ε, Q, G ∗ )bounded by a polynomial in ε whose constants do not depend on Q or n . The function φ ( x ) = x l deﬁned over [0 ,

1] is Lipschitz continuous with constant equal to l . Since G +1 = n φ ( g ) : g ∈ G ∗ o ,Lemma B.2 says G +1 has covering number bounded above by A ε − W with A , W independent of n or Q . G ± = n f c,h : X → R st f c,h ( x ) = v ± c,h ( x ) , c ∈ X , h ∈ [ h , h ] o For either v + c,h or v − c,h , no 3-point set is shattered by the graphs of either G +2 or G − . Hence, G ± isVC subgraph with covering number bounded above by A ± ε − W ± where A ± , W ± are independentof n or Q . G = { f : [ − M, M ] → R st f ( y ) = y } It is straightforward to see that the graphs of this class of functions is VC with index 2. Therefore,the covering number of G is bounded above by A ε − W with A , W independent of n or Q . G ± = { f : X → R st f ( x ) = r ± ( x ) } Consider G +4 WLOG. For each n , r + ( x ) = P Kj =1 v + c j ,h j ( x ) E [ Y j + i | X i = x ] is a ﬁxed function. Similarto G , the covering number of G +4 is bounded above by A ε − W with A , W independent of n or Q . G ± = n f c,h : X → R st f c,h ( x ) = v ± c,h ( x ) k (cid:0) x − ch (cid:1) , c ∈ X , h ∈ [ h , h ] o Take G +5 WLOG. Deﬁne G ∗ = n f = k ( g ) , g ∈ G ∗ o , where G ∗ is the VC subgraph class of functionsdeﬁned above. Given that k ( · ) is Lipschitz continuous (Assumption 3), Lemma B.2 says that G ∗ has covering number N ( ε, Q, G ∗ ) bounded by a polynomial in ε whose constants do not depend on Q or n . Note that G = n gh : g ∈ G +2 , h ∈ G ∗ o because v + c,h ( x ) k (cid:16) v + c,h ( x ) (cid:0) x − ch (cid:1)(cid:17) = v + c,h ( x ) k (cid:0) x − ch (cid:1) .Therefore, G has covering number bounded above by A ε − W with A , W independent of n or Q (Lemma B.2). 12emma B.4 below is a slightly modiﬁed version of Pollard (1984)’s Theorem 2.37. Lemma B.4.

For each n , let F n be a class of uniformly bounded functions whose covering numberssatisfy sup Q N ( ε, Q, F n ) ≤ Aε − W for < ε < with constants A and W not depending on n . Let δ n be a positive decreasing sequence such that log nnδ n → . If [ E ( f )] / ≤ δ n for ∀ f ∈ F n , then sup f ∈F n | E n ( f ) − E ( f ) | = O P δ n s log nnδ n ! where E n ( f ) is the expected value of f wrt the empirical distribution of the variables in the domainof f .Proof. The proof is almost the same as Pollard (1984)’s Theorem 2.37. There are two main dif-ferences. The ﬁrst, he has an arbitrary sequence α n that weakly decreases to zero such that nδ n α n log n → ∞ , and I take this sequence to be α n = log nnδ n →

0. Note that this sequence of α n doesnot satisfy nδ n α n log n → ∞ , but this is not needed here. The second diﬀerence, he shows almost sureconvergence, and I only show the expression to be bounded in probability.That said, it is to be shown that for ∀ γ >

0, there exists M γ > n γ such that P ( sup f ∈F n | E n ( f ) − E ( f ) | > M γ δ n α n ) < γ for n ≥ n γ Taking ε n = εδ n α n , V ( E n ( f ))(4 ε n ) ≤ E ( f )16 nε δ n α n ≤ M ε nδ n α n = 116 ε log n For large n , this is smaller than 1 /

2, so that Equation (30) on page 31 of Pollard (1984) is usedto get: P ( sup f ∈F n | E n ( f ) − E ( f ) | > εδ n α n ) ≤ P ( sup f ∈F n | E ◦ n ( f ) | > ε n ) (B.84)where E ◦ n ( f ) is the signed measure deﬁned there. Using the same approximation argument that13ed to Equation (31) on page 31 for functions g j ∈ F n : P ( sup f ∈F n | E ◦ n ( f ) | > ε n ) ≤ N ( ε n , P n , F n ) exp  ( − / nε n max j E n ( g j )  where P n is the probability measure that weights each observation by 1 /n . This inequality is usedto rewrite the right-hand side of Equation B.84):4 P ( sup f ∈F n | E ◦ n ( f ) | > ε n ) =4 P ( sup f ∈F n | E ◦ n ( f ) | > ε n , sup f ∈F n | E n ( f ) | ≤ δ n ) + 4 P ( sup f ∈F n | E ◦ n ( f ) | > ε n , sup f ∈F n | E n ( f ) | > δ n ) ≤ N ( ε n , P n , F n ) exp (cid:20) ( − / nε n δ n (cid:21) (B.85)+ 4 P ( sup f ∈F n | E n ( f ) | > δ n ) (B.86)For part (B.85)), use the fact that N ( ε n , P n , F n ) ≤ Aε − W , and rearrange it into B. ≤ Aε − W exp[ W log(1 /δ n α n ) − nε δ n α n / B. ≤ E (cid:2) min (cid:8) N ( δ n , P n , F n ) exp( − nδ n ) ; 1 (cid:9)(cid:3) ≤ E (cid:2) min (cid:8) N ( δ n / , P n , F n ) exp( − nδ n ) ; 1 (cid:9)(cid:3) ≤

16 min ( A (cid:18) δ n (cid:19) − W exp( − nδ n ) ; 1 ) = 16 min (cid:8) A W exp (cid:2) − ( W log δ n + nδ n ) (cid:3) ; 1 (cid:9) Hence, P ( sup f ∈F n | E n ( f ) − E ( f ) | > εδ n α n ) < Aε − W exp[ W log(1 /δ n α n ) − nε δ n α n / (cid:8) A W exp (cid:2) − ( W log δ n + nδ n ) (cid:3) ; 1 (cid:9) (B.88)For the case here, it suﬃces to show that there is a ε such that the sum of the two bounds above14B.87) and (B.88) converge to zero as n → ∞ . For part (B.87), note that for large n , n log n ≥ nα n ,since α n is decreasing. Then,log (cid:18) δ n α n (cid:19) = log (cid:18) nα n log n (cid:19) ≤ log (cid:18) n log n log n (cid:19) = log n Using this, ( B. ≤ Aε − W exp h(cid:16) W − ε (cid:17) log n i . If ε is made small enough, this expression goesto zero.For part (B.88), log nnδ n → δ n n − = nδ n → ∞ . For big n , these imply: (i) log( δ n ) ≥ log n − = − log n and (ii) nδ n ≥ ( W + 1) log n . Hence,( B. ≤

16 min (cid:8) A W exp( − log n ) ; 1 (cid:9) → k · k with real-valued vectors. For matrices, the normis induced by the Euclidean norm. That is, for a p × q matrix A , k A k = sup x ∈ R q , k x k =1 k Ax k . Such amatrix norm has the following properties: (i) for a matrix A and a vector x , k Ax k ≤ k A kk x k ; (ii) formatrices A and B such that AB is deﬁned, k AB k ≤ k A kk B k ; (iii) for A invertible, k A k − ≤ k A − k .The determinant of matrix A is denoted det( A ). Another useful result is that (iv) convergence inthe matrix norm is equivalent to convergence of all elements of the matrix. Lemma B.5.

Consider a random process X n ( c ) in R q × q , and a ﬁxed (non-random) function X ( c ) also in R q × q . Suppose sup c k X ( c ) k ≤ L < ∞ and inf c | det( X ( c )) | ≥ L > .If for some sequence α n ↓ c k X n ( c ) − X ( c ) k = O P ( α n ) then sup c (cid:13)(cid:13) X n ( c ) − − X ( c ) − (cid:13)(cid:13) = O P ( α n ) Proof.

Consider the compact subset of R q × q : A = (cid:8) X ∈ R q × q : k X k ≤ L , | det( X ) | ≥ L (cid:9) Note that X ( c ) ∈ A ∀ ( c ), and that any continuous function on A is uniformly continuous because A is a compact set. The function f : A → R q × q , f ( X ) = X − is uniformly continuous.For any γ >

0, ﬁnd M γ > P (cid:26) sup c α − n (cid:13)(cid:13) X n ( c ) − − X ( c ) − (cid:13)(cid:13) > M γ (cid:27) < γ (cid:26) α − n sup c (cid:13)(cid:13) X n ( c ) − − X ( c ) − (cid:13)(cid:13) > M γ (cid:27) ≤ P (cid:26) sup c (cid:13)(cid:13) X n ( c ) − − X ( c ) − (cid:13)(cid:13) > α n M γ , X n ( c ) ∈ A ∀ c (cid:27) (B.89)+ P { X n ( c ) / ∈ A for some c } (B.90) Part (B.89)Since f ( X ) = X − is uniformly continuous in A , for any choice of M γ >

0, and for a givensample size, there exists a δ ( α n M γ ) > ∀ X n ( t ) , X ( t ) ∈ A, k X n ( t ) − − X ( t ) − k > α n M γ ⇒ k X n ( t ) − X ( t ) k > δ ( α n M γ ) ∀ n ( B. ≤ P (cid:26) sup c,p k X n ( c ) − X ( c ) k > δ ( α n M γ ) , X n ( c ) ∈ A ∀ ( c ) (cid:27) ≤ P (cid:26) sup c,p k X n ( c ) − X ( c ) k > δ ( α n M γ ) (cid:27) (B.91)By assumption, it is possible to ﬁnd M ∗ such that P (cid:26) sup c,p k X n ( c ) − X ( c ) k > α n M ∗ (cid:27) < γ/ n . So pick M γ to be such that δ ( α n M γ ) ≥ α n M ∗ which makes ( B. ≤ γ/ Part (B.90) ( B. ≤ P {k X n ( c ) k > L for some c } + P {| det( X n ( c )) | < L for some c }≤ P {k X n ( c ) − X ( c ) k > L for some c } + P {k X ( c ) k > L for some c } + P {| det( X n ( c )) − det( X ( c )) | < L / c } + P {| det( X ( c )) | < L / c } = P {k X n ( c ) − X ( c ) k > L for some c } + P {| det( X n ( c )) − det( X ( c )) | < L / c } which is made smaller than γ/ n since X n ( c ) converges in probability to X ( c ), and sodoes det( X n ( c )) to det( X ( c )). Therefore, ( B.

89) + ( B. ≤ γ .16n application of Lemma B.4 to the classes of functions in Lemma B.3 gives the rates at whichcertain terms in the proof of Theorem 2 are uniformly bounded in probability. Lemma B.6.

Consider the deﬁnitions of G j ± , G j ± n , e H ji , H , and Y j ± i from Lemma B.1 (scalarcase). Suppose Assumptions 3, 4, 5, and 7 hold. Assume the rate conditions of Theorem 2. Then, max j (cid:13)(cid:13) G j ± − E (cid:2) G j ± n (cid:3)(cid:12)(cid:12) = O ( h ) (B.92)max j (cid:13)(cid:13) G j ± n − E (cid:2) G j ± n (cid:3)(cid:13)(cid:13) = O P s log nnh ! (B.93)max j (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) nh j n X i =1 v j ± i k (cid:18) X i − c j h j (cid:19) e H ji ε i (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) = O P s log nnh ! (B.94)max j (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) nh j n X i =1 (cid:26) v j ± i k (cid:18) X i − c j h j (cid:19) e H ji E [ Y j ± i | X i ] − E (cid:20) v j ± i k (cid:18) X i − c j h j (cid:19) e H ji Y j ± i (cid:21) (cid:27) (cid:13)(cid:13)(cid:13)(cid:13) = O P s log nnh ! (B.95) Proof.

Consider the positive parts with v j + i = v + c j ,h j ( X i ), G j + n , and G j + WLOG.

Part (B.92)First, we show that max j (cid:13)(cid:13)(cid:13) E h G j + n i − G j + (cid:13)(cid:13)(cid:13) = O ( h ) using Lemma B.5.We have that G j + = f ( c j ) − Γ − is a bounded function of j and has a determinant uniformlybounded away from zero. Using Lemma B.5, it suﬃces to showmax j (cid:13)(cid:13)(cid:13)(cid:13)h E (cid:16) G j + n (cid:17)i − − (cid:0) G j + (cid:1) − (cid:13)(cid:13)(cid:13)(cid:13) = O ( h ).Also, convergence in the matrix norm is equivalent to convergence in each element of the matrix.Hence, it suﬃces to show thatmax j (cid:12)(cid:12)(cid:12)(cid:12) h j E (cid:20) v j + i k (cid:16) X i − c j h j (cid:17) (cid:16) X i − c j h j (cid:17) l (cid:21) − f ( c j ) γ l (cid:12)(cid:12)(cid:12)(cid:12) = O ( h ).The LHS above is bounded bysup c ∈X ,h ∈ [ h ,h ] (cid:12)(cid:12)(cid:12)(cid:12) h E (cid:20) v c,h ( X i ) k (cid:16) X i − ch (cid:17) (cid:16) X i − ch (cid:17) l (cid:21) − f ( c ) γ l (cid:12)(cid:12)(cid:12)(cid:12) ,which we show to be O ( h ).Take an arbitrary sequence h ∈ [ h , h ], (cid:12)(cid:12)(cid:12)(cid:12) h E (cid:20) v c,h ( X i ) k (cid:16) X i − ch (cid:17) (cid:16) X i − ch (cid:17) l (cid:21) − f ( c ) γ l (cid:12)(cid:12)(cid:12)(cid:12) = (cid:12)(cid:12)(cid:12)R k ( u ) u l f ( c + uh ) du − f ( c ) γ l (cid:12)(cid:12)(cid:12) = h (cid:12)(cid:12)(cid:12)R k ( u ) u l +1 ∇ x f ( c ∗ uh ) du (cid:12)(cid:12)(cid:12) ≤ M h = O ( h )17here Assumption 4 bounds the derivative of f . Therefore, the supremum above is O ( h ), andthe result follows. Part (B.93)The goal is to show that:max j (cid:13)(cid:13)(cid:13) G j ± n − E h G j ± n i(cid:13)(cid:13)(cid:13) = O P (cid:16)q log nnh (cid:17) .Note that part (B.92) implies that E h G j + n i is a bounded function of j and has a determi-nant uniformly bounded away from zero for large n . Using Lemma B.5, it suﬃces to show thatmax j (cid:13)(cid:13)(cid:13)(cid:13) G j + n − − E h G j + n i − (cid:13)(cid:13)(cid:13)(cid:13) = O P (cid:16)q log nnh (cid:17) . In fact, it suﬃces to show uniform convergence of eachpoint of the matrix:max j (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) nh j n X i =1 ( v j + i k (cid:18) X i − c j h j (cid:19) (cid:18) X i − c j h j (cid:19) l − E " v j + i k (cid:18) X i − ch j (cid:19) (cid:18) X i − ch j (cid:19) l = O P s log nnh ! for an arbitrary l . The LHS is bounded by h sup c ∈X , h ∈ [ h ,h ] (cid:12)(cid:12)(cid:12)(cid:12) n P ni =1 (cid:26) v + c,h ( X i ) k (cid:16) X i − ch (cid:17) (cid:16) X i − ch (cid:17) l − E (cid:20) v + c,h ( X i ) k (cid:16) X i − ch (cid:17) (cid:16) X i − ch (cid:17) l (cid:21) (cid:27) (cid:12)(cid:12)(cid:12)(cid:12) (B.96)and we apply Lemma B.4 to this part. Lemma B.3 says that the class of functions (over which thesup is being taken) satisﬁes the conditions of Lemma B.4. For the second moment bound δ n , takean arbitrary sequence h ∈ [ h , h ] and note that E " v + c,h ( X i ) k (cid:18) X i − ch (cid:19) (cid:18) X i − ch (cid:19) l  = h Z k ( u ) u l f ( c + uh ) du ≤ M h ≤ M h where f ( · ) and k ( · ) are uniformly bounded (Assumptions 3 and 4). Hence, for the purposes ofLemma B.4, δ n = M h , which satisﬁes log nnδ n → √ K log n √ nh → h − O P (cid:16) h q log nnh (cid:17) = O P (cid:16)q log nnh (cid:17) ,because h − h = O (1). Part (B.94)Similar to above, convergence in the matrix norm is equivalent to convergence in each elementof the matrix, so it suﬃces to show that1 h sup c ∈X ,h ∈ [ h ,h ] (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) n n X i =1 v + c,h ( X i ) k (cid:18) X i − ch (cid:19) (cid:18) X i − ch (cid:19) l ε i (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) = O P s log nnh ! l . Take an arbitrary sequence h ∈ [ h , h ] E " v + c,h ( X i ) k (cid:18) X i − ch (cid:19) (cid:18) X i − ch (cid:19) l ε i  ≤ M E " v + c,h ( X i ) k (cid:18) X i − ch (cid:19) (cid:18) X i − ch (cid:19) l = M h Z k ( u ) u l f ( c + uh ) du ≤ M h where it is used that ε i = Y i − R ( X i , D i ) is a.s. uniformly bounded (Assumption 7). Hence, δ n = M h . The expectation E (cid:20) v + c,h ( X i ) k (cid:16) X i − ch (cid:17) (cid:16) X i − ch (cid:17) l ε i (cid:21) = 0, and the sup is over a class offunctions that satisﬁes the conditions of Lemma B.4, which gives the result. Part (B.95)It suﬃces to show that1 h sup c ∈X , h ∈ [ h ,h ] (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) n n X i =1 ( v + c,h ( X i ) k (cid:18) X i − ch (cid:19) (cid:18) X i − ch (cid:19) l E [ Y j + i | X i ] − E " v + c,h ( X i ) k (cid:18) X i − ch (cid:19) (cid:18) X i − ch (cid:19) l Y j ± i = 1 h sup c ∈X , h ∈ [ h ,h ] (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) n n X i =1 ( v + c,h ( X i ) k (cid:18) X i − ch (cid:19) (cid:18) X i − ch (cid:19) l r + ( X i ) − E " v + c,h ( X i ) k (cid:18) X i − ch (cid:19) (cid:18) X i − ch (cid:19) l r + ( X i ) = O P s log nnh ! for any positive integer l . Choose δ n similarly as before. The sup is over a class of functions thatsatisﬁes the conditions of Lemma B.4, which gives the result. Lemma B.7.

Assume the conditions of Theorem 2 hold. Then, parts (A.7) and (A.8) in the proofof Theorem 2 (Section A.3) converge in probability to zero.Proof.

Part (A.7)19 (cid:12)(cid:12)(cid:12) b µ − E [ b µ |X n ] − ( µ ∗ − E [ µ ∗ |X n ])( V cn ) / (cid:12)(cid:12)(cid:12)(cid:12) (B.97) ≤ O (cid:16)(cid:0) Knh (cid:1) / (cid:17) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) K X j =1 ∆ j n e ′ (cid:0) G j + n − E (cid:2) G j + n (cid:3)(cid:1) nh j n X i =1 k (cid:18) X i − c j h j (cid:19) v j + i ε i e H ji − e ′ (cid:0) G j − n − E (cid:2) G j − n (cid:3)(cid:1) nh j n X i =1 k (cid:18) X i − c j h j (cid:19) v j − i ε i e H ji o(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) (B.98) ≤ O (cid:16)(cid:0) Knh (cid:1) / (cid:17) K X j =1 | ∆ j | (cid:13)(cid:13) G j + n − E (cid:2) G j + n (cid:3)(cid:13)(cid:13) (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) nh j n X i =1 k (cid:18) X i − c j h j (cid:19) v j + i ε i e H ji (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) + (cid:13)(cid:13) G j − n − E (cid:2) G j − n (cid:3)(cid:13)(cid:13) (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) nh j n X i =1 k (cid:18) X i − c j h j (cid:19) v j − i ε i e H ji (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) (B.99) ≤ O (cid:16)(cid:0) Knh (cid:1) / (cid:17) KO (cid:0) K − (cid:1) O P (cid:18) log nnh (cid:19) / ! O P (cid:18) log nnh (cid:19) / ! (B.100)= O P K / log n (cid:0) nh (cid:1) / ! = o P (1) (B.101)where the ﬁrst inequality uses the rate on ( V cn ) − / (Equation A.13); the third inequality relies onthe uniform convergence rates of Lemma B.6, and that ∆ j = O ( K − ) uniformly over j (LemmaB.9); the last equality uses the rate condition K / log n (cid:0) nh (cid:1) − / = o (1). Part (A.8) E [ b µ − µ n |X n ] − e µ ( V cn ) / (B.102)=( V cn ) − / K X j =1 ∆ j e ′ G j + n ( nh j n X i =1 k (cid:18) X i − c j h j (cid:19) v j + i E [ Y j + i | X i ] e H ji − E " nh j n X i =1 k (cid:18) X i − c j h j (cid:19) v j + i Y j + i e H ji (B.103) − ( V cn ) − / K X j =1 ∆ j e ′ G j − n ( nh j n X i =1 k (cid:18) X i − c j h j (cid:19) v j − i E [ Y j − i | X i ] e H ji − E " nh j n X i =1 k (cid:18) X i − c j h j (cid:19) v j − i Y j − i e H ji (B.104)where 20 B. V cn ) − / K X j =1 ∆ j e ′ E (cid:2) G j + n (cid:3) ( nh j n X i =1 k (cid:18) X i − c j h j (cid:19) v j + i E [ Y j + i | X i ] e H ji − E " nh j n X i =1 k (cid:18) X i − c j h j (cid:19) v j + i Y j + i e H ji (B.105)=( V cn ) − / K X j =1 ∆ j e ′ (cid:0) G j + n − E (cid:2) G j + n (cid:3)(cid:1) ( nh j n X i =1 k (cid:18) X i − c j h j (cid:19) v j + i E [ Y j + i | X i ] e H ji − E " nh j n X i =1 k (cid:18) X i − c j h j (cid:19) v j + i Y j + i e H ji (B.106)Part (B.105) is o P (1) because it has zero mean and zero limiting variance, V [( B. V cn ) − K X j =1 ∆ j n n V (cid:26) h j k (cid:18) X i − c j h j (cid:19) v j + i E [ Y j + i | X i ] (cid:16) e ′ E (cid:2) G j + n (cid:3) e H ji (cid:17) − E (cid:20) h j k (cid:18) X i − c j h j (cid:19) v j + i Y j + i (cid:16) e ′ E (cid:2) G j + n (cid:3) e H ji (cid:17)(cid:21)(cid:27) (B.107) ≤ O ( Knh ) K X j =1 ∆ j n E " h j k (cid:18) X i − c j h j (cid:19) v j + i E [ Y j + i | X i ] (cid:16) e ′ E (cid:2) G j + n (cid:3) e H ji (cid:17) (B.108)= O ( K ) K X j =1 O ( K − ) E " h j k (cid:18) X i − c j h j (cid:19) v j + i O (cid:16) h ρ +11 j (cid:17) (cid:16) e ′ (cid:2) G j + + O ( h j ) (cid:3) e H ji (cid:17) (B.109)= O (cid:16) h ρ +21 (cid:17) = o (1) (B.110)where it is used the rate on ( V cn ) − (Equation A.13); that ∆ j = O ( K − ) holds uniformly over j (Lemma B.9); expansion (A.23); E h G j + n i is uniformly close to G j + (Lemma B.6); and that theexpected value in (B.109) without the O (cid:16) h ρ +11 j (cid:17) term is a bounded quantity.Part (B.106) is o P (1) because | ( B. | ≤ O (cid:16)(cid:0) Knh (cid:1) / (cid:17) K X j =1 | ∆ j | (cid:13)(cid:13) G j + n − E (cid:2) G j + n (cid:3)(cid:13)(cid:13) (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) nh j n X i =1 k (cid:18) X i − c j h j (cid:19) v j + i E [ Y j + i | X i ] e H ji − E (cid:20) h j k (cid:18) X i − c j h j (cid:19) v j + i Y j + i e H ji (cid:21)(cid:13)(cid:13)(cid:13)(cid:13) (B.111)= O (cid:16)(cid:0) Knh (cid:1) / (cid:17) KO (cid:0) K − (cid:1) O P (cid:16) (log n ) / (cid:0) nh (cid:1) − / (cid:17) O P (cid:16) (log n ) / (cid:0) nh (cid:1) − / (cid:17) (B.112)= O P (cid:16) K / (log n ) (cid:0) nh (cid:1) − / (cid:17) = o P (1) (B.113)21hich relies on the rate conditions of ( V cn ) − / (Equation A.13), that ∆ j = O ( K − ) uniformly over j (Lemma B.9), and on the rate conditions of Lemma B.6’s parts (B.93) and (B.95). Therefore,(B.103) is o P (1), and a symmetric proof shows that (B.104) is o P (1). Hence, part (A.8) is o P (1). B.3 Integral Approximation

This section proves results on the error of approximated integrals. Let R : R → R be aRiemann integrable function; for an open and convex set C ⊂ R , deﬁne β : C → R such that β ( x ) = R ( x , x ) − R ( x , x ) (i.e. treatment eﬀect function on the main text). There are observations ofthe value of the β ( . ) function for K points c , . . . , c K , that is, β = β ( c ) , . . . , β K = β ( c K ), for c j = ( c ,j , c ,j , c ,j ). Interest lies on the integral µ = R C β ( x ) d ( x ) which is approximated by a ﬁniteweighted sum b µ = P j ∆ j β j . A procedure to compute the integral approximation is given below.More importantly, there is a result that gives the rate of decay of the approximation error of thisprocedure as the number of points K → ∞ . The procedure consists of using a multivariate localpolynomial regression in a ﬁrst step to obtain an approximated function b β ( x ). The second stepintegrates b β ( x ) over set C to obtain an approximated integral b µ .For the ﬁrst step, run a weighted regression of β j s on J × E j ( x ). Each E j ( x ) is madeof polynomials evaluated at ( x − c j ) of order ρ at most. To deﬁne E j ( x ) and J , ﬁrst consider themulti-index notation for vectors: for x = ( x , x , x ) ∈ R and γ = ( γ , γ , γ ) ∈ Z , let | γ | = X i =1 γ i γ ! = Y i =1 γ i ! x γ = Y i =1 x γ i i ∇ | γ | β ( x ) = ∂ | γ | ∂x γ ∂x γ ∂x γ β ( x )Each entry in E j ( x ) is a polynomial of the form p γ ( x − c j ) = Q i =1 ( x i − c i,j ) γ i with γ such that | γ | ≤ ρ and min { γ , γ } = 0. There is no γ , γ > β ( x ) is the diﬀerence R ( x , x ) − R ( x , x )whose polynomial approximation does not include interactions between x and x . The dimensionof E j ( x ) is J × J = 2 (cid:0) ρ +22 (cid:1) − ( ρ + 1), and the ﬁrst entry in E j ( x ) is the polynomial ofdegree zero (i.e. p ( x − c j ) = 1). Next, stack E ( x ) ′ . . . E K ( x ) ′ into the K × J matrix E ( x ), and β , . . . , β K into the K × B . The regression of B on E is kernel weighted depending onthe distance between a ﬁxed point x ∈ C and c j . For a choice of bandwidth h >

0, and a kerneldensity function that satisﬁes Assumption 3, the K × K matrix Ω ( x ; h ) is the diagonal matrix ofkernel weights: 22 ( x ; h ) = diag { Ω j ( x ; h ) } j = diag ( Y i =1 k (cid:18) x i − c i,j h (cid:19)) j The ﬁrst-step regression consists of solving the following problem. b η = argmin η ( B − E ( x ) η ) ′ Ω ( x ; h ) ( B − E ( x ) η ) b β ( x ) = e ′ b η = b η where η is a J × η is the ﬁrst coordinate of the vector η (interceptcoeﬃcient).In the second step, integrate the estimated function b β ( x ) over C . Note that the approximatedintegral b µ is written as a weighted sum of β j . Z C b β ( x ) d x = Z C e ′ (cid:0) E ( x ) ′ Ω ( x ; h ) E ( x ) (cid:1) − X j Ω j ( x ; h ) E j ( x ) β j d ( x )= X j Z C e ′ (cid:0) E ( x ) ′ Ω ( x ; h ) E ( x ) (cid:1) − Ω j ( x ; h ) E j ( x ) d ( x ) β j = X j ∆ j β j The expression for the correction weight ∆ j is∆ j = Z C e ′ (cid:0) E ( x ) ′ Ω ( x ; h ) E ( x ) (cid:1) − Ω j ( x ; h ) E j ( x ) d ( x )= Z C det (cid:0) E ( x ) ′ Ω ( x ; h ) E ← e j ( x ) (cid:1) det ( E ( x ) ′ Ω ( x ; h ) E ( x )) d ( x )where the Cramer rule is used in the second equality, and E ← e j ( x ) is the matrix valued function E ( x ) except for the ﬁrst column which is replaced by the K × e j that is zero everywhereexcept for the j -th entry which is equal to 1.The approximation error of such a procedure is well-behaved if R ( x, y ) is a continuously diﬀer-entiable function of order up to ρ + 1 on C . This implies that, ∇ | γ | β ( x ) is a continuous functionfor every γ such that | γ | = ρ + 1. Lemma B.8 below states the approximation error of using amultivariate local polynomial regression on a ﬁnite number of points to obtain b β ( x ). This result isTheorem 3.1 of Lipman et al. (2006), and here account is given to the fact that β is the diﬀerenceof two functions. Lemma B.8.

Let

C ⊂ R be open and convex. Let R : R → R be a ρ + 1 times continuouslydiﬀerentiable function on C , and deﬁne β ( x ) = R ( x , x ) − R ( x , x ) . For x ∈ C , assume b β ( x ) is constructed as above, and that the matrix E ( x ) ′ Ω ( x ; h ) E ( x ) is invertible for some choice of > . Then, there exists ξ j ∈ (0 , j = 1 , . . . , K , such that b β ( x ) − β ( x ) = X | γ | = ρ +1min { γ ,γ } =0 K X j =1 ( γ ! ( c j − x ) γ ∇ | γ | β (cid:18) ξ j ( c j − x ) + x (cid:19) det (cid:0) E ( x ) ′ Ω ( x ; h ) E ← e j ( x ) (cid:1) det ( E ( x ) ′ Ω ( x ; h ) E ( x )) ) (B.114) where the E ( x ) , E ← e j ( x ) , and Ω ( x ; h ) matrices have been described above.Proof. See the proof of Theorem 3.1 in Lipman et al. (2006) and use the fact that ∇ | γ | β ( x ) = 0 if γ > γ > c j arethought as coming from a triangular array indexed by K : { c j,K } j . The approximation error b β ( x ) − β ( x ) decreases to zero as K grows large. Lemma B.9 below uses regularity conditions on thefunction β and on the triangular array of points to determine the rate at which the approximationerror converges to zero. Lemma B.9.

Assume the conditions of Lemma B.8 hold. Furthermore, assume that(i) K → ∞ , h → , / ( Kh ) = O (1) ;(ii) there exists a positive deﬁnite J × J matrix Q such that sup x ∈C (cid:13)(cid:13)(cid:13) Kh [ E ( x /h ) ′ Ω ( x ; h ) E ( x /h )] − − Q (cid:13)(cid:13)(cid:13) = o (1) ; and(iii) the function β ( x ) has bounded derivatives on C of order up to ρ + 2 .Then, Z C b β ( x ) − β ( x ) d x − B K = O (cid:16) h ρ +22 (cid:17) (B.115) where B K = Z C X | γ | = ρ +1min { γ ,γ } =0 K X j =1 ( γ ! ( c j − x ) γ ∇ | γ | β ( x ) det (cid:0) E ( x ) ′ Ω ( x ; h ) E ← e j ( x ) (cid:1) det ( E ( x ) ′ Ω ( x ; h ) E ( x )) ) d x . (B.116) and B K = O (cid:16) h ρ +12 (cid:17) .Moreover, there exists a J × vector Θ such that e ′ Q Θ > , and max ≤ j ≤ K (cid:13)(cid:13)(cid:13)(cid:13) h − Z C Ω j ( x ; h ) E j ( x /h ) d x − Θ (cid:13)(cid:13)(cid:13)(cid:13) = o (1) (B.117)24ax ≤ j ≤ K | K ∆ j − e ′ Q Θ | = o (1) . (B.118) Proof.

Parts (B.115) and (B.116) : Start with Equation B.114. Do a ﬁrst-order Taylor expansion of ∇ | γ | β ( ξ j ( c j − x ) + x ) around x and substitute in Equation B.114 to obtain b β ( x ) − β ( x ) = X | γ | = ρ +1min { γ ,γ } =0 K X j =1 ( γ ! ( c j − x ) γ ∇ | γ | β ( x ) det (cid:0) E ( x ) ′ Ω ( x ; h ) E ← e j ( x ) (cid:1) det ( E ( x ) ′ Ω ( x ; h ) E ( x )) ) (B.119)+ X | γ | = ρ +1min { γ ,γ } =0 K X j =1 ( γ ! ( c j − x ) γ X | η | =1 ∇ | γ + η | β (cid:18) δ j ( c j − x ) + x (cid:19) ( c j − x ) η det (cid:0) E ( x ) ′ Ω ( x ; h ) E ← e j ( x ) (cid:1) det ( E ( x ) ′ Ω ( x ; h ) E ( x )) ) . (B.120)The integral over set C is Z C b β ( x ) − β ( x ) d x = B K (B.121)+ Z C X | γ | = ρ +1min { γ ,γ } =0 | η | =1 K X j =1 ( γ ! ( c j − x ) γ + η ∇ | γ + η | β (cid:18) δ j ( c j − x ) + x (cid:19) det (cid:0) E ( x ) ′ Ω ( x ; h ) E ← e j ( x ) (cid:1) det ( E ( x ) ′ Ω ( x ; h ) E ( x )) ) d x . (B.122)The absolute value of the expression inside the integral in Equation B.122 is bounded by X | γ | = ρ +1min { γ ,γ } =0 | η | =1 K X j =1 ( γ ! | c j − x | γ + η (cid:12)(cid:12)(cid:12)(cid:12) ∇ | γ + η | β (cid:18) δ j ( c j − x ) + x (cid:19)(cid:12)(cid:12)(cid:12)(cid:12) (B.123) (cid:12)(cid:12)(cid:12) e ′ (cid:0) E ( x ) ′ Ω ( x ; h ) E ( x ) (cid:1) − Ω j ( x ; h ) E j ( x ) (cid:12)(cid:12)(cid:12) ) (B.124)= X | γ | = ρ +1min { γ ,γ } =0 | η | =1 K X j =1 ( γ ! | c j − x | γ + η (cid:12)(cid:12)(cid:12)(cid:12) ∇ | γ + η | β (cid:18) δ j ( c j − x ) + x (cid:19)(cid:12)(cid:12)(cid:12)(cid:12) (B.125) (cid:12)(cid:12)(cid:12) e ′ (cid:0) E ( x /h ) ′ Ω ( x ; h ) E ( x /h ) (cid:1) − Ω j ( x ; h ) E j ( x /h ) (cid:12)(cid:12)(cid:12) ) (B.126)25 M h ρ +22 (cid:13)(cid:13)(cid:13) Kh (cid:0) E ( x /h ) ′ Ω ( x ; h ) E ( x /h ) (cid:1) − (cid:13)(cid:13)(cid:13) K K X j =1 (cid:13)(cid:13)(cid:13)(cid:13) h Ω j ( x ; h ) E j ( x /h ) (cid:13)(cid:13)(cid:13)(cid:13) (B.127) ≤ M h ρ +22 O (1) O (1) = O (cid:16) h ρ +22 (cid:17) . (B.128)where it is used that the derivatives of β are bounded; that | ( c j − x ) γ + η | ≤ h ρ +22 ; that the normof the inverse of Kh ( E ( x /h ) ′ Ω ( x ; h ) E ( x /h )) is bounded over x and n (Assumption (ii)); andthe fact that P j k Ω j ( x ; h ) E j ( x /h ) k ≤ M Kh . It follows that Equation B.122 is O (cid:16) h ρ +22 (cid:17) . Asimilar argument yields B K = O (cid:16) h ρ +12 (cid:17) . Part (B.117) : Assume WLOG the support of the kernel is [ − ,

1] (Assumption 3). Deﬁne: F ( x ) = E j ( x + c j ); C h = { x ∈ R : Q i =1 ( x i ± h ) ⊆ C} , where Q is used to denote the Cartesian product;Θ = R [ − , k ( u ) k ( u ) k ( u ) F ( u ) d u , where u = ( u , u , u ).0 ≤ max j : c j ∈C h (cid:13)(cid:13)(cid:13)(cid:13) h − Z C Ω j ( x ; h ) E j ( x /h ) d x − Θ (cid:13)(cid:13)(cid:13)(cid:13) ≤ sup c ∈C h (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) h − Z c ± h Y i =1 k (( x i − c i ) /h ) F (( x − c ) /h ) d x − Θ (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) = sup c ∈C h (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)Z [ − , k ( u ) k ( u ) k ( u ) F ( u ) d u − Θ (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) = 0where the transformation u = ( x − c ) /h is used. The result follows from the fact that C h ↑ C . Part (B.118) : Using the formula for the correction weights ∆ j (cid:12)(cid:12) K ∆ j − e ′ Q Θ (cid:12)(cid:12) = (cid:12)(cid:12)(cid:12)(cid:12) K Z C e ′ (cid:0) E ( x /h ) ′ Ω ( x ; h ) E ( x /h ) (cid:1) − Ω j ( x ; h ) E j ( x /h ) d ( x ) − e ′ Q Θ (cid:12)(cid:12)(cid:12)(cid:12) = (cid:12)(cid:12)(cid:12)(cid:12)Z C e ′ h Kh (cid:0) E ( x /h ) ′ Ω ( x ; h ) E ( x /h ) (cid:1) − i h − Ω j ( x ; h ) E j ( x /h ) d ( x ) − e ′ Q Θ (cid:12)(cid:12)(cid:12)(cid:12) ≤ (cid:12)(cid:12)(cid:12)(cid:12)Z C e ′ h Kh (cid:0) E ( x /h ) ′ Ω ( x ; h ) E ( x /h ) (cid:1) − − Q i h − Ω j ( x ; h ) E j ( x /h ) d ( x ) (cid:12)(cid:12)(cid:12)(cid:12) + (cid:12)(cid:12)(cid:12)(cid:12)Z C e ′ Q h − Ω j ( x ; h ) E j ( x /h ) d ( x ) − e ′ Q Θ (cid:12)(cid:12)(cid:12)(cid:12) ≤ Z C (cid:13)(cid:13)(cid:13) Kh (cid:0) E ( x /h ) ′ Ω ( x ; h ) E ( x /h ) (cid:1) − − Q (cid:13)(cid:13)(cid:13) h − k Ω j ( x ; h ) E j ( x /h ) k d ( x )+ (cid:12)(cid:12)(cid:12)(cid:12) e ′ Q (cid:20)Z C h − Ω j ( x ; h ) E j ( x /h ) d ( x ) − Θ (cid:21)(cid:12)(cid:12)(cid:12)(cid:12) = o (1) O (1) + o (1) = o (1)26ext, (cid:12)(cid:12)(cid:12)(cid:12)Z C d c − e ′ Q Θ (cid:12)(cid:12)(cid:12)(cid:12) ≤ (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)Z C d c − X j ∆ j (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) + (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)X j ∆ j − e ′ Q Θ (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ o (1) + 1 K X j (cid:12)(cid:12) K ∆ j − e ′ Q Θ (cid:12)(cid:12) ≤ o (1) + max j (cid:12)(cid:12) K ∆ j − e ′ Q Θ (cid:12)(cid:12) = o (1)shows that e ′ Q Θ = R C d c > Remark 1.

Lemma B.9 also applies to weighted integrals of the form µ = Z C ω ( x ) β ( x ) d ( x ) where ω ( x ) is a probability density function that is continuous, bounded and bounded away fromzero. There are three main diﬀerences between unweighted integrals (treated above) and weightedintegrals (considered in the main text) : (i) the formula for the weights ∆ j changes to ∆ j = Z C ω ( x ) e ′ (cid:0) E ( x ) ′ Ω ( x ; h ) E ( x ) (cid:1) − Ω j ( x ; h ) E j ( x ) d ( x )= Z C ω ( x ) det (cid:0) E ( x ) ′ Ω ( x ; h ) E ← e j ( x ) (cid:1) det ( E ( x ) ′ Ω ( x ; h ) E ( x )) d ( x ) , (ii) the formula for the bias changes to B K = Z C ω ( x ) X | γ | = ρ +1min { γ ,γ } =0 K X j =1 ( γ ! ( c j − x ) γ ∇ | γ | β ( x ) det (cid:0) E ( x ) ′ Ω ( x ; h ) E ← e j ( x ) (cid:1) det ( E ( x ) ′ Ω ( x ; h ) E ( x )) ) d x , and (iii) conclusion B.118 of Lemma B.9 changes to max ≤ j ≤ K | K ∆ j /ω ( c j ) − e ′ Q Θ | = o (1) . Lemma B.9 states a condition on the asymptotic behavior of the triangular array of points { c j } Kj =1 . For large K , the observations must be uniformly distributed on the domain C such that E ( x ) ′ Ω ( x ; h ) E ( x ) is invertible and of magnitude Kh , that is, K times the volume of every h -neighborhood of x , for every x in C . These conditions are satisﬁed in a variety of examples oftriangular arrays of points that cover C uniformly well for large K .27o be clearer, this assumption is illustrated in a simple example. In the main text, the conditionsof Lemma B.9 are restated in Assumption 6(c) and in the rate conditions of Theorem 2. The choiceof h , h , ρ needs to satisfy both the conditions in Assumption 6 and the rate conditions of Theorem2. Pick the choices given in the example of Figure 1, for which, h = K − λ /λ , h = K − / , and ρ = 3. Let c ∈ R , x ∈ R , k ( u ) = . I {| u | ≤ } , and C = (0 , . Deﬁne N points for each l th coordinate of c = ( c , c , c ) as c l,j,N = j/ ( N + 1), j = 1 , . . . , N , l = 1 , ,

3. In this case, K = N , and h = 1 /N / . Assumption 6(b) requires the distance c ,j +1 ,K − c ,j,K = 1 / ( N + 1) =1 / ( K / + 1), to be greater than the order of h = K − λ /λ . This is equivalent to λ > λ / K = P K ( l ,l ,l ) I (cid:8) Ω ( l ,l ,l ) ( x ; h ) > (cid:9) , where ( l , l , l ) indexes point c ( l ,l ,l ) = ( c l , c l , c l ).For each x , the number of c ( l ,l ,l ) in the h -neighborhood of x grows to inﬁnity at K / = Kh rate, so ˜ K = O ( Kh ). This rate of growth is uniform over x ∈ C . The vector of polynomials E j ( x )is written as E j ( x ) = F (cid:0) x − c ( l ,l ,l ) (cid:1) where F ( u ) = h u (0 , , , u (1 , , , u (0 , , , . . . , u (2 , , i ′ that is, all polynomials u ( γ ,γ ,γ ) such that γ i ∈ Z + ∀ i , 0 ≤ γ + γ + γ ≤

3, and min { γ , γ } = 0.Then, F ( u ) is a J × J = 16 . Now, ﬁx x ∈ (0 , and a large K . Consider an uniform discrete random vector ˜ u taking valueson ( − , according to u ( l ,l ,l ) = (cid:0) x − c ( l ,l ,l ) (cid:1) /h for all c ( l ,l ,l ) ∈ ( x ± h ). It turns out that1˜ K E ( x /h ) ′ Ω ( x ; h ) E ( x /h )= 12 ˜ K X ( l ,l ,l ) I (cid:8) Ω ( l ,l ,l ) ( x ; h ) > (cid:9) F (cid:0) ( x − c ( l ,l ,l ) ) /h (cid:1) F (cid:0) ( x − c ( l ,l ,l ) ) /h (cid:1) ′ = 12 E (cid:2) F (˜ u ) F (˜ u ) ′ (cid:3) This is approximately equal to R u ∈ [ − , F ( u ) F ( u ) ′ d u uniformly in x . Simply call Q the inverseof this integral, a positive deﬁnite matrix. Finally,sup x ∈C (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:20) Kh E ( x /h ) ′ Ω ( x ; h ) E ( x /h ) (cid:21) − − Q (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) = o (1) B.4 Consistent Estimation of Standard Errors

This section demonstrates that the estimator for the variance of b µ c proposed in Section 3.2 isa consistent estimator. For the nearest-neighbor matching, the distribution of X i is continuous, so28 assume X < . . . < X n WLOG. For a ﬁxed number of neighbors N ∈ Z + , deﬁne c : X → { , , . . . , K } , where c ( x ) = max ≤ j ≤ K { c j : c j ≤ x } (B.129) ℓ : { , . . . , n } × Z + → { , . . . , n } , where ℓ ( i, N ) is such that n X v =1 v = i,v = ℓ ( i,N ) I (cid:8) | X v − X i | ≤ | X ℓ ( i,N ) − X i | , c ( X v ) = c ( X i ) (cid:9) = N (B.130) b ε i = NN + 1 Y i − N N X l =1 Y ℓ ( i,l ) ! (B.131)The expression for the variance estimator in the continuous case is given in Equation 26. Lemma B.10.

Assume the conditions of Theorem 2 and ( Kh ) − = O (1) hold. Then, b V cn / V cn p → .Proof. The proof extends the arguments of Theorem A3 by CCT to the case where the number ofcutoﬀs grows to inﬁnity.Deﬁne φ n and b φ n by φ n ( X i ) = K X j =1 ∆ j nh j k (cid:18) X i − c j h j (cid:19) e ′ (cid:16) v j + i E [ G j + n ] − v j − i E [ G j − n ] (cid:17) e H ji (B.132) b φ n ( X i ) = K X j =1 ∆ j nh j k (cid:18) X i − c j h j (cid:19) e ′ (cid:16) v j + i G j + n − v j − i G j − n (cid:17) e H ji . (B.133)Use them to rewrite V cn and b V cn as V cn = n E (cid:2) ε i φ n ( X i ) (cid:3) (B.134) b V cn = n X i =1 b ε i b φ n ( X i ) . (B.135)In order to show b V cn / V cn p →

1, it suﬃces to show that (

Knh )( b V cn − V cn ) p → V cn ) − = O ( Knh ).( Knh )( b V cn − V cn ) =( Knh ) n X i =1 b ε i b φ n ( X i ) − ( Knh ) n X i =1 ε i φ n ( X i ) (B.136)=( Knh ) n X i =1 b ε i (cid:16) b φ n ( X i ) − φ n ( X i ) + φ n ( X i ) (cid:17) − ( Knh ) n X i =1 ε i φ n ( X i ) (B.137)=( Knh ) n X i =1 b ε i (cid:16) b φ n ( X i ) − φ n ( X i ) (cid:17) (B.138)29 2( Knh ) n X i =1 b ε i (cid:16) b φ n ( X i ) − φ n ( X i ) (cid:17) φ n ( X i ) (B.139)+ ( Knh ) n X i =1 (cid:0)b ε i − ε i (cid:1) φ n ( X i ) (B.140)The rest of the proof shows that parts (B.138) - (B.140) converge in probability to zero. Part (B.138)First, for arbitrary x ∈ X (cid:12)(cid:12)(cid:12) b φ n ( x ) − φ n ( x ) (cid:12)(cid:12)(cid:12) ≤ K X j =1 ( | ∆ j | nh j (cid:12)(cid:12)(cid:12)(cid:12) k (cid:18) x − c j h j (cid:19)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) e ′ (cid:16) v + c j ,h j ( x ) (cid:0) G j + n − E [ G j + n ] (cid:1) − v − c j ,h j ( x ) (cid:0) G j − n − E [ G j − n ] (cid:1)(cid:17) H (cid:18) x − c j h j (cid:19)(cid:12)(cid:12)(cid:12)(cid:12) ) (B.141) ≤ ≤ j ≤ K ( | ∆ j | nh j (cid:12)(cid:12)(cid:12)(cid:12) k (cid:18) x − c j h j (cid:19)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) e ′ (cid:16) v + c j ,h j ( x ) (cid:0) G j + n − E [ G j + n ] (cid:1) − v − c j ,h j ( x ) (cid:0) G j − n − E [ G j − n ] (cid:1)(cid:17) H (cid:18) x − c j h j (cid:19)(cid:12)(cid:12)(cid:12)(cid:12) ) (B.142)= O (cid:18) Knh (cid:19) O P s log nnh ! = O P Knh s log nnh ! (B.143)where the second inequality uses the fact that at most two elements of the sum over j are non-zerofor each value of x ; the ﬁrst equality relies on max j | ∆ j | = O ( K − ) (Lemma B.9), on h − j ≤ h − ,on the fact that the kernel is bounded (Assumption 3), that v ± c j ,h j ( x ) H ( h − j ( x − c j )) is bounded,and that max j (cid:13)(cid:13)(cid:13) G j ± n − E [ G j ± n ] (cid:13)(cid:13)(cid:13) = O P ((log n/Kh ) / ) (Lemma B.6); the last equality uses the ratecondition h /h = O (1). The rate in (B.143) is uniform over x ∈ X .Then, it follows that | (B.138) | ≤ ( Knh ) n n n X i =1 b ε i max x (cid:12)(cid:12)(cid:12) b φ n ( x ) − φ n ( x ) (cid:12)(cid:12)(cid:12) (B.144)=( Knh ) nO P (1) O P (cid:18) Knh ) log nnh (cid:19) = 1 Kh log nnh O P (1) = o P (1) (B.145)where the ﬁrst equality used the rate derived in (B.143), and the fact that b ε i is a.s. bounded because ε i is a.s. bounded (Assumption 7); the last equality relied on the rate conditions ( Kh ) − = O (1)and log n ( nh ) − = o (1). 30 art (B.139)First, for arbitrary x ∈ X| φ n ( x ) | ≤ K X j =1 ( | ∆ j | nh j (cid:12)(cid:12)(cid:12)(cid:12) k (cid:18) x − c j h j (cid:19)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) e ′ (cid:16) v + c j ,h j ( x ) E [ G j + n ] − v − c j ,h j ( x ) E [ G j − n ] (cid:17) H (cid:18) x − c j h j (cid:19)(cid:12)(cid:12)(cid:12)(cid:12) ) (B.146) ≤ ≤ j ≤ K ( | ∆ j | nh j (cid:12)(cid:12)(cid:12)(cid:12) k (cid:18) x − c j h j (cid:19)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) e ′ (cid:16) v + c j ,h j ( x ) E [ G j + n ] − v − c j ,h j ( x ) E [ G j − n ] (cid:17) H (cid:18) x − c j h j (cid:19)(cid:12)(cid:12)(cid:12)(cid:12) ) (B.147)= O (cid:18) Knh (cid:19) = O (cid:18) Knh (cid:19) (B.148)where the second inequality uses the fact that at most two elements of the sum over j are non-zerofor each value of x ; the ﬁrst equality relies on max j | ∆ j | = O ( K − ) (Lemma B.9), on h − j ≤ h − ,on the fact that the kernel is bounded (Assumption 3), that v ± c j ,h j ( x ) H ( h − j ( x − c j )) is bounded,and that E [ G j ± n ] is approximately equal to a positive deﬁnite matrix with determinant boundedaway from zero (Lemma B.6); the last equality uses the rate condition h /h = O (1). The rate in(B.148) is uniform over x ∈ X .Then, it follows that | (B.139) | ≤ ( Knh ) n n n X i =1 b ε i max x (cid:12)(cid:12)(cid:12) b φ n ( x ) − φ n ( x ) (cid:12)(cid:12)(cid:12) max x | φ n ( x ) | (B.149)=( Knh ) nO P (1) O P Knh s log nnh ! O (cid:18) Knh (cid:19) = 1 Kh s log nnh O P (1) = o P (1)(B.150)where the ﬁrst equality used the rate derived in (B.143) and (B.148), and the fact that b ε i is a.s.bounded because ε i is a.s. bounded (Assumption 7); the last equality relied on the rate conditions( Kh ) − = O (1) and log n ( nh ) − = o (1). 31 art (B.140)First, expand b ε i around ε i . To simplify notation, abbreviate E [ Y i | X i ] = R ( X i , D i ) to R i . b ε i = NN + 1 Y i − N N X l =1 Y ℓ ( i,l ) ! (B.151)= NN + 1 R i + ε i − N N X l =1 ( R ℓ ( i,l ) + ε ℓ ( i,l ) ) ! (B.152)= NN + 1 (cid:18) ε i − N ε ℓ ( i,l ) (cid:19) + NN + 1 N N X l =1 ( R i − R ℓ ( i,l ) ) ! + 2 NN + 1 (cid:18) ε i − N ε ℓ ( i,l ) (cid:19) N N X l =1 ( R i − R ℓ ( i,l ) ) ! (B.153)= ε i − ε i N + 1 N X l =1 ε ℓ ( i,l ) + 1 N ( N + 1) N X l =1 ( ε ℓ ( i,l ) − ε i )+ 2 N ( N + 1) N X l =1 N X v =1 ,v>l ε ℓ ( i,l ) ε ℓ ( i,v ) + 1 N ( N + 1) N X l =1 R i − R ℓ ( i,l ) ! + 2 ε i N + 1 N X l =1 (cid:0) R i − R ℓ ( i,l ) (cid:1) − N ( N + 1) N X l =1 ε ℓ ( i,l ) N X v =1 ( R i − R ℓ ( i,v ) ) (B.154)Then, substitute the expression for b ε i − ε i derived above in part (B.140):( B. Knh ) n X i =1 ( − ε i N + 1 N X l =1 ε ℓ ( i,l ) ) φ n ( X i ) (B.155)+ ( Knh ) n X i =1 ( N ( N + 1) N X l =1 ( ε ℓ ( i,l ) − ε i ) ) φ n ( X i ) (B.156)+ ( Knh ) n X i =1  N ( N + 1) N X l =1 N X v =1 ,v>l ε ℓ ( i,l ) ε ℓ ( i,v )  φ n ( X i ) (B.157)+ ( Knh ) n X i =1  N ( N + 1) N X l =1 R i − R ℓ ( i,l ) !  φ n ( X i ) (B.158)+ ( Knh ) n X i =1 ( ε i N + 1 N X l =1 (cid:0) R i − R ℓ ( i,l ) (cid:1)) φ n ( X i ) (B.159)+ ( Knh ) n X i =1 ( − N ( N + 1) N X l =1 ε ℓ ( i,l ) N X v =1 ( R i − R ℓ ( i,v ) ) ) φ n ( X i ) (B.160)32he steps below demonstrate that parts (B.155) - (B.160) converge in probability to zero. Part (B.155) : the expected value E [(B.155) |X n ] = 0. To compute the variance of (B.155)centered at E [(B.155) |X n ] = 0, abbreviate N − P Nl =1 ε ℓ ( i,l ) to ε i , φ n ( X i ) to φ ni , I { c ( X i ) = c ( X j ) } to I = ij , and I { c ( X i ) = c ( X j ) } to I = ij . Then, E h ((B.155) − E [(B.155) |X n ]) i = M ( Knh ) E  n X i =1 n X j =1 ( I = ij + I = ij ) ε i ε i φ ni ε j ε j φ nj  (B.161)= M ( Knh ) E  n X i =1 n X j =1 I = ij ε i ε i φ ni ε j ε j φ nj  (B.162)= M ( Knh ) n X i =1 n X j =1 E (cid:2) I = ij (cid:3) O P (cid:0) ( Knh ) − (cid:1) (B.163)= M ( Knh ) − n O P (cid:0) K − (cid:1) = o P (1) (B.164)where the second equality uses that the expected value of I = ij ε i ε i φ ni ε j ε j φ nj conditional on X n iszero because if I = ij = 1, then ε i ε i is independent of ε j ε j , and E [ ε i ε i |X n ] = E [ ε i |X n ] E [ ε i |X n ] = 0;the third equality relies on the fact that ε i and ε i are a.s. bounded (Assumption 7), and that φ ni is O (cid:0) ( Knh ) − (cid:1) (Equation B.148); the fourth equality uses that E [ I = ij ] = E [ P ( c j ≤ X j l ε ℓ ( i,l ) ε ℓ ( i,v ) ) φ ni ) (B.171)= M E  ( Knh ) n X i =1 n X j =1 ( I = ij + I = ij ) ( N X v>l ε ℓ ( i,l ) ε ℓ ( i,v ) ) φ ni ( N X v>l ε ℓ ( j,l ) ε ℓ ( j,v ) ) φ nj  = o P (1)(B.172)where the expected value of I = ij P v>l ε ℓ ( i,l ) ε ℓ ( i,v ) φ ni P v>l ε ℓ ( j,l ) ε ℓ ( j,v ) φ nj conditional on X n is zerobecause if I = ij = 1, then ε ℓ ( i,l ) ε ℓ ( i,v ) is independent of ε ℓ ( j,l ) ε ℓ ( j,v ) , and E [ ε ℓ ( i,l ) ε ℓ ( i,v ) |X n ] = 0; the restfollows arguments similar to the ones used in part (B.155). The Chebyshev’s inequality yields that(B.157) = o P (1). Part (B.158) : for x ∗ il between X i and X ℓ ( i,l ) :(B.158) = M ( Knh ) n X i =1 N X l =1 ∇ x R ( x ∗ il , D i )( X i − X ℓ ( i,l ) ) ! φ ni (B.173)=( Knh ) nO ( K − ) O (cid:0) ( Knh ) − (cid:1) = O ( K − ) O (cid:0) ( Kh ) − (cid:1) = o (1) (B.174)which uses the fact that ∇ x R ( x ∗ il , D i ) is bounded (Assumption 5), that ( X i − X ℓ ( i,l ) ) = O ( K − ),that φ ni is O (cid:0) ( Knh ) − (cid:1) (Equation B.148), and the rate condition ( Kh ) − = O (1). Part (B.159) : for x ∗ il between X i and X ℓ ( i,l ) :(B.159) = M ( Knh ) n X i =1 ( ε i N X l =1 ∇ x R ( x ∗ il , D i )( X i − X ℓ ( i,l ) ) ) φ ni (B.175)=( Knh ) nO P ( K − ) O (cid:0) ( Kh ) − (cid:1) = o P (1) (B.176)34hich follows from the same arguments as the ones in part (B.158) plus the fact that ε i is a.s.bounded (Assumption 7). Part (B.160) : for x ∗ iv between X i and X ℓ ( i,v ) :(B.160) = M ( Knh ) n X i =1 ( N X l =1 ε ℓ ( i,l ) N X v =1 ∇ x R ( x ∗ iv , D i )( X i − X ℓ ( i,v ) ) ) φ ni (B.177)=( Knh ) nO P ( K − ) O (cid:0) ( Kh ) − (cid:1) = o P (1) (B.178)as seen in part (B.159).Therefore, (B.140) = o P (1), which concludes the proof. B.5 Fuzzy RDD with Multiple Cutoﬀs

B.5.1 Example of Compliance Behaviors

Here is a simple example with three diﬀerent treatments and two cutoﬀs (that is, 3 schools, K =2) to illustrate the diﬀerent compliance behaviors. Table B.1 below lists all possible combinationsof treatment eligibility and assignment produced by U i ( x ).35able B.1: Diﬀerent Compliance Behaviors Eligibility Type d d d d d d ever-deﬁers d d d d d d d d d d d d d d d d d d d d d d d d d d d d d d d d d d d d d d d d d d d d d d d d d d d Eligibility Type d d d d d d never-changers d d d d d d d d d ever-compliers d d d d d d d d d d d d d d d Notes: All possible realizations of the random function U i ( x ) for values of x such that D ( x ) ∈ { d , d , d } . B.5.2 Estimation and Inference

Theorem 4 in the main text suggests a two-step estimation procedure for θ ec . In the ﬁrst step,obtain b B j as in Section 3.1, and compute estimates cf W j using LPRs of W ( X i , D i ) on X i at eachside of the cutoﬀ c j . For each j = 1 , . . . , K , and l -th coordinate of the vector f W j , l = 1 , . . . , q , theresearcher computes cf W j,l =ˆ a + j,l − ˆ a − j,l (B.179)(ˆ a + j,l , ˆ b + j,l ) = argmin ( a, b ) n X i =1 ( k (cid:18) X i − c j h j (cid:19) v j + i (cid:20) e ′ l W ( X i , D i ) − a − b ( X i − c j ) − . . . − b ρ ( X i − c j ) ρ (cid:21) ) (B.180)(ˆ a − j,l , ˆ b − j,l ) = argmin ( a, b ) n X i =1 ( k (cid:18) X i − c j h j (cid:19) v j − i (cid:20) e ′ l W ( X i , D i ) − a − b ( X i − c j ) − . . . − b ρ ( X i − c j ) ρ (cid:21) ) (B.181)36here e l is the q × l -th coordinate. The q × cf W j isconstructed by stacking the q estimates cf W j = (cid:20)cf W j, , . . . , cf W j,q (cid:21) ′ .In the second step, regress b B j on cf W j to obtain an estimate for θ ec . More speciﬁcally, stackall q × cf W j into the K × q matrix cf W , and b B j into the K × b B . Choose a K × K symmetric and positive-deﬁnite weighting matrix Ω. The estimator b θ ec is the solution tothe following weighted least-squares problem: b θ ec = argmin θ (cid:18) b B − cf W θ (cid:19) ′ Ω (cid:18) b B − cf W θ (cid:19) . (B.182)The estimator for the ATE on ever-compliers µ ec is a linear combination of b θ ec , b µ ec = Z ( F ) b θ ec (B.183)where Z ( F ) is deﬁned in Equation 35.Asymptotic normality of b θ ec relies on smoothness assumptions on the conditional moments of Y i and the probabilities of treatment for diﬀerent compliance behaviors. The sample size growslarge, while the number of cutoﬀs remains ﬁxed. Assumption 10.

For any d ∈ D , and any ¯ U in the class U ∗ deﬁned in Eq. (29) , (a) E [ Y i ( d ) | X i = x, U i = ¯ U ] is a ρ + 1 times continuously diﬀerentiable function of x with ( ρ + 1) -th derivative ∇ ρ +1 x E [ Y i ( d ) | X i = x, U i = ¯ U ] ; (b) V [ Y i ( d ) | X i = x, U i = ¯ U ] is a continuous function of x , and E h(cid:12)(cid:12) Y i ( d ) − E [ Y i ( d ) | X i = x, U i = ¯ U ] (cid:12)(cid:12) | X i = x, U i = ¯ U i is bounded; (c) for f W deﬁned in Eq. (36) , f W ′ f W is invertible; (d) P [ U i = ¯ U | X i = x ] is a ρ + 1 times continuously diﬀerentiable function of x with ( ρ + 1) -th derivative ∇ ρ +1 x P [ U i = ¯ U | X i = x ] . Theorem B.1.

Suppose Assumptions 3-4 and 8-10 hold, and that the number of cutoﬀs K is ﬁxed.Let h = min j { h j } and h = max j { h j } . As n → ∞ , assume that h → , h /h = O (1) , nh → ∞ , and ( nh ) / h ρ +11 = O (1) . Then, ( V θ ec n ) − / (cid:16)b θ ec − B θ ec n − θ ec (cid:17) d → N ( , I ) (B.184) b µ ec − B µ ec n − µ ec ( V µ ec n ) / d → N (0 , , (B.185) where denotes the q × vector of zeros, and I is the q × q identity matrix.The bias and variance terms are characterized as follows, B θ ec n = (cid:18) cf W ′ Ω cf W (cid:19) − cf W ′ Ω B ecn , a q × vector; (B.186) B µ ec n = Z ( F ) (cid:18) cf W ′ Ω cf W (cid:19) − cf W ′ Ω B ecn , a scalar; (B.187)37 θ ec n = (cid:18) cf W ′ Ω cf W (cid:19) − cf W ′ Ω V ecn Ω cf W (cid:18) cf W ′ Ω cf W (cid:19) − , a q × q matrix; (B.188) V µ ec n = Z ( F ) (cid:18) cf W ′ Ω cf W (cid:19) − cf W ′ Ω V ecn Ω cf W (cid:18) cf W ′ Ω cf W (cid:19) − Z ( F ) ′ , a scalar. (B.189) These terms depend on B ecn ( K × vector) and V ecn ( K × K matrix) that are deﬁned below B ecn = [ B ecn , . . . , B ecnK ] ′ , where for each j (B.190) B ecnj = h ρ +11 j f ( c j )( ρ + 1)! h − θ ec ′ i e ′ ( G j + n γ ∗ ∇ ρ +1 x " R ( c j , d j ) W ( c j , d j ) − G j − n γ ∗ ∇ ρ +1 x " R ( c j , d j − ) W ( c j , d j − ) ;(B.191) V ecn =  V ecn . . . V ecn K ... . . . ... V ecnK . . . V ecnKK  , where V ecnjl = 0 if | j − l | > , otherwise (B.192) V ecnjl = n E ( nh j k (cid:18) X i − c j h j (cid:19) nh l k (cid:18) X i − c l h l (cid:19)h − θ ec ′ i e ′ (cid:16) v j + i E [ G j + n ] − v j − i E [ G j − n ] (cid:17) e H ji ε i ε ′ i e H l ′ i (cid:16) v l + i E [ G l + n ] ′ − v l − i E [ G l − n ] ′ (cid:17) e h − θ ec ′ i ′ ) , (B.193) where ε i = [ Y i W ( X i , D i ) ′ ] ′ − E (cid:8) [ Y i W ( X i , D i ) ′ ] ′ | X i (cid:9) , a ( q + 1) × vector; e = I q ⊗ e where I q is the q × q identity matrix, ⊗ denotes the Kronecker product, and e is a ( ρ + 1) × vector of ze-ros except for the ﬁrst coordinate that equals ; γ ∗ = I q ⊗ γ ∗ for γ ∗ deﬁned in Theorem 1; e H ji = I q ⊗ H (( X i − c j ) /h j ) for H ( u ) deﬁned in Theorem 1; and G j ± n = [( nh j ) − P ni =1 k (cid:16) X i − c j h j (cid:17) v j ± i e H ji e H j ′ i ] − for v j ± i deﬁned in Equation 10.Furthermore, ( V θ ec n ) − / = O P (cid:16)(cid:0) nh (cid:1) / (cid:17) , and ( V θ ec n ) − / B θ ec n = O P (cid:16)(cid:0) nh (cid:1) / h ρ +11 (cid:17) , where A − / denotes the inverse of the square root of a positive-deﬁnite matrix A . Similarly, ( V µ ec n ) − / = O P (cid:16)(cid:0) nh (cid:1) / (cid:17) , and ( V µ ec n ) − / B µ ec n = O P (cid:16)(cid:0) nh (cid:1) / h ρ +11 (cid:17) .The approximate MSE of either b θ ec or b µ ec is minimized by setting Ω = (cid:16) B ecn B ec ′ n + V ecn (cid:17) − . The proof of Theorem B.1 is in Section B.5.3 of the supplemental appendix. The variance termsin Equations B.188 and B.189 contain the matrix V ecn that needs to be estimated. The elements V ecnjl of such matrix are consistently estimated by b V ecnjl = n X i =1 ( nh j k (cid:18) X i − c j h j (cid:19) nh l k (cid:18) X i − c l h l (cid:19) − b θ ec ′ (cid:21) e ′ (cid:16) v j + i G j + n − v j − i G j − n (cid:17) e H ji b ε i b ε ′ i e H l ′ i (cid:16) v l + i G l + ′ n − v l − i G l − ′ n (cid:17) e (cid:20) − b θ ec ′ (cid:21) ′ ) . (B.194)where b θ ec is a consistent estimator of θ ec , and the vector of residuals is estimated by a nearest-neighbor matching estimator analogously to Section 3.1’s Equation 15 : b ε i b ε ′ i = 34 Y i − X l =1 Y ℓ ( i,l ) ! Y i − X l =1 Y ℓ ( i,l ) ! ′ . (B.195)If the bandwidth choices are such that the standardized bias term (cid:0) V θ ec n (cid:1) − / B θ ec n diﬀers from zeroasymptotically, then inference must be done using a bias-corrected estimator. A practical way ofdoing bias correction is to increase the order of the polynomial to ρ + 1 and compute b θ ec ′ and b V ec ′ n using the same bandwidth choices. It follows that ( b V θ ec ′ n ) − / (cid:18)b θ ec ′ − θ ec (cid:19) d → N ( , I ). Similar toTheorems 1 and 2, Theorem B.1 allows for bandwidth choices that produce overlapping estimationwindows across cutoﬀs. The variance estimator in (B.194) takes account of overlap by allowing V njl to be non-zero for j = l .The following steps are a practical recommendation to implement MSE-optimal and bias-corrected estimates. The source of MSE in estimation is B ecnj and V ecnjl , which come from theregression of Y i − W ( X i , D i ) ′ θ ec on X i at each cutoﬀ c j ; thus, it makes sense to choose MSE-optimal bandwidths for these regressions.0. Take initial values b θ ec (0) and Ω (0) ;1. Compute ﬁrst-step IK bandwidths h (0)1 j for sharp RD of Y i − W ( X i , D i ) ′ θ ec (0) on X i at eachcutoﬀ c j . Use local-linear regression ( ρ = 1) and edge kernel;2. Obtain bias-corrected estimates b B (0) j for each cutoﬀ j using sharp RD of Y i on X i using local-quadratic regression, edge kernel, and bandwidth h (0)1 j ; do the same for each coordinate of W ( X i , D i ) to compute cf W (0) j for each j ; stack estimates into b B (0) and cf W (0) ;3. Update b θ ec : compute b θ ec (1) using Equation B.182 with Ω (0) , b B (0) , and cf W (0) ;4. Estimate the variance of b B (0) − cf W (0) θ ec using Equation B.194 with ρ = 2, b θ ec (1) , andbandwidths h (0)1 j ; call the estimated variance b V ec (1) n .5. Update Ω: compute Ω (1) = (cid:16) b V ec (1) n (cid:17) − ;6. Update b θ ec : compute b θ ec (2) using Equation B.182 with Ω (1) , b B (0) , and cf W (0) ;39. Repeat Steps 4-6 starting with b θ ec (2) in the place of b θ ec (1) . Iterate these three steps untilconvergence of b θ ec . Call b θ ec (3) and Ω (3) , the iterated values of, respectively, b θ ec and Ω;8. Repeat Steps 1-7 starting with b θ ec (3) in the place of b θ ec (0) , and with Ω (3) in the place of Ω (0) .Iterate these 7 steps until the diﬀerence between the θ s of Step 3 and Step 7 converges tozero. Call b θ ec (4) , Ω (4) , and cf W (4) the iterated values of, respectively, b θ ec , Ω, and cf W ;9. Estimate the variance of b θ ec using b V θ ec n = cf W (4) ′ Ω (4) cf W (4) ! − ; compute b µ ec using EquationB.183 with b θ ec (4) and Z ( F ) given by the counterfactual policy of interest; estimate the varianceof b µ ec using b V µ ec n = Z ( F ) b V θ ec n Z ( F ) ′ . B.5.3 Proof of Theorem B.1

This proof relies heavily on Lemma B.1, which is a CLT for the LPR estimator of the diﬀerencein side-limits of a conditional mean function of the vector Y i given X i at X i = c j . Such lemma isapplied to Y i = [ Y i W ( X i , D i ) ′ ] ′ to arrive at( V nj ) − / (cid:16)b J j − B nj − J j (cid:17) d → N ( ; I ) (B.196)for each j .The assumptions of Theorem B.1 satisfy the assumptions of Lemma B.1. In fact, the conditionson the rates, on the distribution of X i , and on the kernel density in Lemma B.1 are simply restatedin the conditions of Theorem B.1. It remains to verify the other two suﬃcient conditions of LemmaB.1: (a) m ( x ) has continuous derivatives wrt x of order ρ + 1 in a compact interval centered at c j but excluding c j , and existence of side limits at c j ; and (b) continuity of ζ ( x ) wrt x in a compactinterval centered at c j but excluding c j , existence of side limits at c j , and boundedness of the thirdmoment conditional on X i .For (a), note that, in the fuzzy case, the mean of Y i and W ( X i , D i ) conditional on X i is asum of the means of potential outcomes Y i ( d ) and W ( c j , d ) for various dosages d conditional onsets of the form {U i ( c j ) = d } weighted by conditional probabilities of the same sets (see proof ofTheorem 4). Assumption 10 implies that such conditional means and conditional probabilities aresmooth functions of x and side-limits exist at x = c j . Similarly, for (b), the conditional covarianceof [ Y i , W ( X i , D i ) ′ ] ′ is a function of sums of the ﬁrst and second moments of potential outcomes Y i ( d ) for various dosages conditional on sets of the form {U i ( c j ) = d } weighted by conditionalprobabilities of the same sets. Assumption 10 ensures continuity of ζ ( x ) wrt x and existence ofside-limits at c j . A similar argument bounds the third centered moment of Y i ( d ) , and Lemma B.1applies. 40ext, note thatlim e ↓ ( E [ W ( c j , D i ) | X i = c j + e ] − E [ W ( c j , D i ) | X i = c j − e ] (cid:27) (B.197)= K X l =0 ,l = j n W ( c j , d j ) − W ( c j , d l ) o ω j,l = f W j (B.198)which means that J j = [ B j f W ′ j ] ′ . Call α = [1 − θ ec ′ ] ′ , a (( q + 1) ×

1) vector. Then, (B.196)implies (cid:0) α ′ V nj α (cid:1) − / (cid:16) α ′ b J j − α ′ B nj − α ′ J j (cid:17) d → N (0 ,

1) (B.199) (cid:0) V ecnjj (cid:1) − / (cid:18) b B j − θ ec ′ cf W j − B ecnj (cid:19) d → N (0 ,

1) (B.200)where α ′ J j = 0 by Assumption 9, and the deﬁnitions (B.191) and (B.193) are used. Stacking acrosscutoﬀs gives (cid:0) diag {V ecnjj } j (cid:1) − / (cid:18) b B − cf W θ ec − B ecn (cid:19) d → N ( , I ) (B.201)( V ecn ) − / (cid:18) b B − cf W θ ec − B ecn (cid:19) d → N ( , I ) (B.202)where ( V ecn ) / (cid:16) diag {V ecnjj } j (cid:17) − / → I because the covariances (oﬀ-diagonal terms) converge tozero since the estimation windows do not overlap in the limit. Deﬁne Γ = (cid:16) f W ′ Ω f W (cid:17) − f W ′ Ω.Then, (cid:0) Γ V ecn Γ ′ (cid:1) − / (cid:18) Γ b B − Γ cf W θ ec − Γ B ecn (cid:19) d → N ( , I ) (B.203)Deﬁne b Γ = (cid:18) cf W ′ Ω cf W (cid:19) − cf W ′ Ω, and write( V θ ec n ) − / (cid:16)b θ ec − B θ ec n − θ ec (cid:17) = (cid:16)b Γ V ecn b Γ ′ (cid:17) − / (cid:18)b Γ b B − b Γ cf W θ ec − b Γ B ecn (cid:19) (B.204)= (cid:16)b Γ V ecn b Γ ′ (cid:17) − / (cid:18) Γ b B − Γ cf W θ ec − Γ B ecn (cid:19) (B.205)+ (cid:16)b Γ V ecn b Γ ′ (cid:17) − / (cid:16)b Γ − Γ (cid:17) (cid:18) b B − cf W θ ec − B ecn (cid:19) (B.206)= (cid:16)b Γ V ecn b Γ ′ (cid:17) − / (cid:0) Γ V ecn Γ ′ (cid:1) / (cid:0) Γ V ecn Γ ′ (cid:1) − / (cid:18) Γ b B − Γ cf W θ ec − Γ B ecn (cid:19) (B.207)+ O P (cid:16) ( nh ) / (cid:17) o P (1) O P (cid:16) ( nh ) − / (cid:17) (B.208)41hich converges in distribution to N ( , I ) because of (B.203), the fact that (cid:16)b Γ V ecn b Γ ′ (cid:17) − / ( Γ V ecn Γ ′ ) / p → I , and that b Γ p → Γ . (cid:3) B.6 Estimation of Counterfactual Distributions

This section considers applications where the counterfactual distribution is estimated as opposedto being known by the researcher. For brevity, I focus on the setting of Section 3.1, that is, sharpRD with discrete counterfactual and ﬁxed K . The analysis for the other settings of the paperfollows similar arguments. In what follows, I derive the limiting distribution for b µ d and propose aconsistent variance estimator.In the ﬁrst step, the researcher estimates the counterfactual probability mass function ω d ( c ) forevery c ∈ C K using iid observations Z i = ( Y i , X i ), i = 1 , . . . , n . There is a variety of ways to obtainestimates for b ω d ( c ). For example, one may estimate the distribution of X i non-parametrically,obtain b f X ( c j ) for every j , and construct b ω dj = b f X ( c j ) / P Kl =1 b f X ( c l ). Another way is to specify aparametric distribution and estimate its parameters. To keep the analysis general, assume b ω dj − ω dj = n X i =1 η nj ( Z i ) + o P ( r − / n ) (B.209)for every j , where η nj ( Z i ) has zero mean and ﬁnite variance for each n and j , and r n is a sequencethat converges to inﬁnity. The exact forms of the function η nj ( Z i ) and r n depend on the typeof estimator used to obtain b ω dj . The sequence r n represents the rate at which the inverse of thevariance of b ω dj grows. Namely, 1 /V AR [ P ni =1 η nj ( Z i )] = O ( r n ). For example, if ω d ( c ) is estimatedparametrically by maximum likelihood, then η nj ( Z i ) will be a function of the Hessian matrix timesthe score function, and r n = n ; if b ω dj is based of a kernel estimator for the density of X i withbandwidth h ω , then η nj ( Z i ) = ( nh ω ) − k (( X i − c j ) /h ω ) / P Kl =1 f X ( c l ) and r n = nh ω .The second step consists of estimating µ d , b µ d = K X j =1 b ω dj b B j . (B.210)Rewrite b µ d as b µ d − µ d = K X j =1 ω dj ( b B j − B j ) + K X j =1 B j ( b ω dj − ω dj ) + K X j =1 ( b B j − B j )( b ω dj − ω dj ) . (B.211)Suppose b B j has no ﬁrst-order asymptotic bias (i.e. bias-corrected). The proofs of Lemma B.142nd Theorem 2 imply that K X j =1 ω dj ( b B j − B j ) = n X i =1 ϕ n ( Z i ) + o P (cid:16) ( nh ) − / (cid:17) , (B.212)where 1 /V AR [ P ni =1 ϕ n ( Z i )] = O (cid:0) nh (cid:1) , and nh → ∞ . Similarly, the sum across j of (B.209)times B j gives K X j =1 B j ( b ω dj − ω dj ) = n X i =1 K X j =1 B j η nj ( Z i ) | {z } ≡ η n ( Z i ) + o P ( r − / n ) (B.213)= n X i =1 η n ( Z i ) + o P ( r − / n ) , (B.214)where 1 /V AR [ P ni =1 η n ( Z i )] = O ( r n ).Next, substitute (B.212) and (B.214) into Equation B.211, b µ d − µ d = n X i =1 ϕ n ( Z i ) + n X i =1 η n ( Z i ) + o P ( r − / n ) + o P (cid:16) ( nh ) − / (cid:17) + O P (cid:16) ( nh ) − / r − / n (cid:17) (B.215)= n X i =1 { ϕ n ( Z i ) + η n ( Z i ) } + o P ( r − / n ) + o P (cid:16) ( nh ) − / (cid:17) (B.216)where the second equality relies on b B j − B j = O P (cid:0) ( nh ) − / (cid:1) , b ω dj − ω dj = O P (cid:16) r − / n (cid:17) , and on thefact that ( nh ) − / r − / n converges to zero faster than each of ( nh ) − / and r − / n .Deﬁne V ωn = V AR [ P ni =1 ϕ n ( Z i )+ η n ( Z i )], and note that ( V ωn ) − / = O (cid:0) max { r − n , ( nh ) − } − / (cid:1) = O (cid:16) min { r / n , ( nh ) / } (cid:17) .Then, ( V ωn ) − / (cid:16)b µ d − µ d (cid:17) = ( V ωn ) − / n X i =1 { ϕ n ( Z i ) + η n ( Z i ) } (B.217)+ O (cid:16) min { r / n , ( nh ) / } (cid:17) o P (cid:16) r − / n (cid:17) (B.218)+ O (cid:16) min { r / n , ( nh ) / } (cid:17) o P (cid:16) ( nh ) − / (cid:17) (B.219)= ( V ωn ) − / n X i =1 { ϕ n ( Z i ) + η n ( Z i ) } + o P (1) (B.220) d → N (0 , . (B.221)43 consistent estimator for the variance is: b V ωn = n X i =1 { b ϕ n ( Z i ) + b η n ( Z i ) } , (B.222)with b ϕ n ( Z i ) constructed as in Equation (14), Section 3.1, b ϕ n ( Z i ) = b ε i K X j =1 b ω dj nh j k (cid:18) X i − c j h j (cid:19) e ′ (cid:16) v j + i G j + n − v j − i G j − n (cid:17) e H ji , (B.223)and the formula for b η n ( Z i ) depends on the form of the estimator ω d . In the kernel density example, b η n ( Z i ) = P Kj =1 c B j ( nh ω ) − k (( X i − c j ) /h ω ) P Kl =1 ( nh ω ) − P nm =1 k (( X m − c l ) /h ω ) . (B.224)An interesting particular case occurs when r n grows faster than nh . This is the case if ω d ( c )is assumed to be in a parametric class; or if b ω d ( c ) is based of a kernel density estimator with abandwidth that converges to zero more slowly than h . Let V dn = V AR [ P ni =1 ϕ n ( Z i )] as deﬁned inTheorem 1. It follows that, (cid:16) V dn (cid:17) − / (cid:16)b µ d − µ d (cid:17) = (cid:16) V dn (cid:17) − / n X i =1 { ϕ n ( Z i ) + η n ( Z i ) } (B.225)+ o P (cid:16) ( nh ) / r − / n (cid:17) + o P (cid:16) ( nh ) / ( nh ) − / (cid:17) (B.226)= (cid:16) V dn (cid:17) − / n X i =1 { ϕ n ( Z i ) + η n ( Z i ) } + o P (1) (B.227) d → N (0 , , (B.228)where (cid:0) V dn (cid:1) − / = O (cid:0) ( nh ) / (cid:1) from Theorem 1, and (cid:0) V dn (cid:1) − / V AR [ P ni =1 ϕ n ( Z i ) + η n ( Z i )] → ω d ( c ) is estimated at a faster rate than β ( c ), the asymptotic distribution andvariance estimator provided in Section 3.1 remain valid. B.7 Monte Carlo Simulations with Data-driven Bandwidths

This section revisits the simulations in Section 5 with data-driven bandwidth choices. Both ﬁrstand second-step bandwidths follow the rules for practical implementation suggested in Section 3.2(refer to page 17, paragraph starting with “

A simple recommendation to implement Theorem 2 ”).The rest of the simulation design remains the same as that of Section 5.Table B.2 compares the estimation precision of b µ and b µ bc across ﬁve sample sizes n , withrespective numbers of cutoﬀs K . Table B.3 analyzes coverage of conﬁdence intervals. Overall,the ﬁnite sample properties are consistent with those of Section 5, when bandwidths are non-random. The randomness of bandwidths increases the variance and bias, but they decrease with n

44t approximately the same rate as before. Bias correction eliminates most of the bias, and producesconﬁdence intervals with correct ﬁnite sample coverage.Table B.2: Precision of EstimatorsBias Variance MSE n K b µ b µ bc b µ b µ bc b µ b µ bc Notes: The table reports simulated bias, variance, and mean squared error (MSE) for two estimators ( b µ, b µ bc ),and ﬁve sample sizes n , with respective numbers of cutoﬀs K . Following Section 3.2, the ﬁrst-step bandwidthsare picked by the IK algorithm, and adjusted to be of order 1 /K . The second-step bandwidth is chosen onthe grid h ∈ { / ( K + 1) , . . . , / ( K + 1) } to minimize the estimated MSE of b µ . The number of simulationsis 10,000. Table B.3: Coverage of 95% Conﬁdence Intervals% Coverage Avg. Length n K b µ b µ bc b µ b µ bc Notes: The table reports simulated percentage of correct coverage, and average length of 95% conﬁdenceintervals. Conﬁdence intervals are constructed using two estimators ( b µ, b µ bc ). They equal an estimator plusor minus its estimated standard deviation multiplied by 1 .

96. Coverage and average length are computedfor ﬁve sample sizes n and respective numbers of cutoﬀs K . Following Section 3.2, the ﬁrst-step bandwidthsare picked by the IK algorithm, and adjusted to be of order 1 /K . The second-step bandwidth is chosen onthe grid h ∈ { / ( K + 1) , . . . , / ( K + 1) } to minimize the estimated MSE of b µ . The number of simulationsis 10,000.. The number of simulationsis 10,000.