[PDF] Inference in Regression Discontinuity Designs under Monotonicity

Abstract

We provide an inference procedure for the sharp regression discontinuity design (RDD) under monotonicity, with possibly multiple running variables. Specifically, we consider the case where the true regression function is monotone with respect to (all or some of) the running variables and assumed to lie in a Lipschitz smoothness class. Such a monotonicity condition is natural in many empirical contexts, and the Lipschitz constant has an intuitive interpretation. We propose a minimax two-sided confidence interval (CI) and an adaptive one-sided CI. For the two-sided CI, the researcher is required to choose a Lipschitz constant where she believes the true regression function to lie in. This is the only tuning parameter, and the resulting CI has uniform coverage and obtains the minimax optimal length. The one-sided CI can be constructed to maintain coverage over all monotone functions, providing maximum credibility in terms of the choice of the Lipschitz constant. Moreover, the monotonicity makes it possible for the (excess) length of the CI to adapt to the true Lipschitz constant of the unknown regression function. Overall, the proposed procedures make it easy to see under what conditions on the underlying regression function the given estimates are significant, which can add more transparency to research using RDD methods.

Full PDF

IInference in Regression Discontinuity Designs underMonotonicity ∗ Koohyun Kwon † Soonwoo Kwon ‡ November 23, 2020

Abstract

We provide an inference procedure for the sharp regression discontinuitydesign (RDD) under monotonicity, with possibly multiple running variables.Speciﬁcally, we consider the case where the true regression function is monotonewith respect to (all or some of) the running variables and assumed to lie in aLipschitz smoothness class. Such a monotonicity condition is natural in manyempirical contexts, and the Lipschitz constant has an intuitive interpretation.We propose a minimax two-sided conﬁdence interval (CI) and an adaptive one-sided CI. For the two-sided CI, the researcher is required to choose a Lipschitzconstant where she believes the true regression function to lie in. This is theonly tuning parameter, and the resulting CI has uniform coverage and obtainsthe minimax optimal length. The one-sided CI can be constructed to maintaincoverage over all monotone functions, providing maximum credibility in termsof the choice of the Lipschitz constant. Moreover, the monotonicity makes itpossible for the (excess) length of the CI to adapt to the true Lipschitz constantof the unknown regression function. Overall, the proposed procedures make iteasy to see under what conditions on the underlying regression function thegiven estimates are signiﬁcant, which can add more transparency to researchusing RDD methods. ∗ We thank our advisors Donald Andrews and Timothy Armstrong for continuous guidance. Weare also grateful to participants at the Yale Econometrics Prospectus Lunch for insightful discussions. † Department of Economics, Yale University, [email protected] ‡ Department of Economics, Yale University, [email protected] a r X i v : . [ ec on . E M ] N ov Introduction

Recently, there has been growing interest in honest and minimax optimal inferencemethods in regression discontinuity designs (Armstrong and Koles´ar, 2018a; Arm-strong and Koles´ar, 2020b; Imbens and Wager, 2019; Koles´ar and Rothe, 2018; Noackand Rothe, 2020). This approach requires a researcher to specify a function spacewhere she believes the regression function to lie in, and the inference procedures followonce this function space is chosen. The methods proposed in the literature essentiallyuse bounds on the second derivatives to specify the function space. This is motivatedby the popularity of local linear regression methods in practice, which is often jus-tiﬁed by imposing local bounds on the second derivative of the regression function.However, choosing a reasonable bound on the second derivative can be diﬃcult inpractice.We address this concern by considering the problem of conducting inference for thesharp regression discontinuity (RD) parameter under monotonicity and a Lipschitzcondition. Speciﬁcally, the regression function is assumed to be monotone in allor some of the running variables, with a bounded ﬁrst derivative. Monotonicitynaturally arises in many regression discontinuity design (RDD) contexts, which is welldocumented by Babii and Kumar (2020). The Lipschitz constant, or the bound onthe ﬁrst derivative, has an intuitive interpretation, since this is a bound on how muchthe outcome can change if the running variable is changed by a single unit. Hence, ifthe researcher reports the inference results along with the Lipschitz constant used torun the proposed procedure, it is easy to see under what (interpretable) conditionson the regression function the researcher has obtained such results. We exploit thecombination of the monotonicity and the Lipschitz continuity restrictions to constructa conﬁdence interval (CI) which is eﬃcient and maintains correct coverage uniformlyover a potentially large and more interpretable function space.We provide a minimax two-sided CI and an adaptive one-sided CI. For the two-sided CI, the researcher is required to choose a bound on the ﬁrst derivative of thetrue regression function. The bound is the only tuning parameter, and the resultingCI has uniform coverage and obtains the minimax optimal length over the class ofregression functions under consideration. Moreover, by exploiting monotonicity, theCI has a signiﬁcantly shorter length than minimax CIs constructed under no suchshape restriction. To our knowledge, this paper is the ﬁrst to consider a minimax2ptimal procedure when the regression function is assumed to be monotone.The one-sided CI can be constructed to maintain coverage over all monotone func-tions, providing maximum credibility in terms of the choice of the Lipschitz constant.Due to monotonicity, the resulting CI still has ﬁnite excess length as long as the trueregression function has a bounded ﬁrst derivative, where this bound is allowed to bearbitrarily large and unknown. This is in contrast with minimax CIs constructedwithout the monotonicity condition, in which case the length must be inﬁnity to coverall functions. Moreover, our proposed one-sided CI adapts to the underlying smooth-ness class, resulting in a shorter CI when the true regression function has a smallerﬁrst derivative bound. This enables the researcher to conduct non-conservative in-ference for the RD parameter, at the same time maintaining honest coverage overa signiﬁcantly larger space of regression functions. The cost of such adaptation isthat we can only construct either a one-sided lower or upper CI depending on thetreatment allocation rule, but not both. We characterize this relationship betweenthe treatment allocation rules and the direction of the adaptive one-sided CI we canconstruct.Our approach, especially the two-sided CI, is closely related to the literature onhonest inference in RDDs. By working with second derivative bounds, the inferenceprocedures in the literature are based on local linear regression estimators, which isin line with the more conventional methods used in the RDD setting. However, itis rather diﬃcult to evaluate the validity of a second derivative bound speciﬁed by aresearcher. While it seems rather innocuous to ignore regression functions with kinks,thus with inﬁnite second derivatives, it is not clear how large or how small the secondderivative should be to be considered as “too large” or “too small”. For this reason,the literature often recommends a sensitivity analysis to strengthen the credibilityof the inference result. However, the credibility gain from the sensitivity analysis islimited when the smoothness parameter is not easy to interpret. For example, it ishard to judge whether the maximum value considered in the sensitivity analysis islarge enough or not.In contrast, the bound on the ﬁrst derivative can be chosen based on more straight-forward empirical reasoning, since the ﬁrst derivative has an intuitive interpretationof a partial eﬀect. For example, if an outcome variable y and a running variable x are current and previous test scores, the class of regression functions whose values This requires the regression function to be monotone in all the running variables. y in response to 1/10 standard devia-tion increase in x can be regarded as reasonable. Armstrong and Koles´ar (2020a) takea similar approach of specifying Lipschitz constants in their empirical application, inan inference problem for average treatment eﬀects under a diﬀerent setting. By im-posing a bound on the ﬁrst derivative, our procedure is based on a Nadaraya–Watsontype estimator, with the boundary bias correctly accounted for.The possibility of forming an adaptive one-sided CI in RDD settings under mono-tonicity was ﬁrst considered in Armstrong (2015) and Armstrong and Koles´ar (2018b).The diﬀerence is that these papers are concerned with adapting to H¨older exponents β ∈ (0 ,

1] while ﬁxing the Lipschitz constant. Here, we ﬁx β = 1 and adapt to theLipschitz constant. When β = 1, what governs the performance of an adaptive CIis the size of the constant multiplied to the rate of convergence, not the rate itself.This is in contrast to the setting considered in Armstrong (2015) and Armstrong andKoles´ar (2018b), which primarily discuss rate-adaptation. In this paper, we provide aprocedure which makes the magnitude of the multiplying constant reasonably small.Babii and Kumar (2020) also consider an RDD setting with a monotone regressionfunction. They introduce an inference procedure based on an isotonic regressionestimator, conveying a similar message to ours that the monotonicity restriction canlead to a more eﬃcient inference procedure. Our approach diﬀers from theirs in severalkey aspects: 1) we explicitly focus on maintaining uniform coverage and optimizingthe length of the CI, 2) we consider the Lipschitz class while they consider the H¨olderclass with exponent β > /

2, and 3) our procedure can be used in settings withmultiple running variables. The general treatment of the dimension of the running variables in this paperallows a researcher to use our procedure in the setting with more than one runningvariables, referred to as the multi-score RDD by Cattaneo et al. (2020). This settinghas also been considered by Imbens and Wager (2019). Our paper is the ﬁrst toconsider the space of monotone regression functions in this setting, and the gain frommonotonicity is especially signiﬁcant for the case with multiple running variables, aswe show later in this paper. We allow the regression function to be monotone withrespect to only some of the variables as well, which broadens the scope of application In fact, our procedure can be easily adapted to the case where β ∈ (0 , β = 1 mainly due to the interpretability ofthe Lipschitz constant, and because assuming bounded ﬁrst derivatives seems rather innocuous inempirical contexts.

4f our procedure.The rest of the paper is organized as follows. Section 2 describes the settingand the form of optimal kernels and bandwidths under this setting. Section 3 intro-duces the minimax two-sided CI, and Section 4 the adaptive one-sided CI. Section5 provides results from simulation studies to demonstrate the eﬃcacy of the pro-posed procedures. Section 6 revisits the empirical analysis by Lee (2008). Section 7concludes by discussing possible extensions.

We observe i.i.d. observations { ( y i , x i ) } ni =1 , where y i ∈ R is an outcome variable,and x i ∈ X ⊂ R d is either a scalar or a vector of running variables. We take X to be a hyperrectangle in R d , and let X t and X c be connected sets with nonemptyinteriors that form a partition of X . The subscripts t and c correspond to “treatment”and “control” groups, respectively, throughout this paper. We write x i ∈ X t and x i ∈ X c to indicate that individual i belongs to the treatment and the control groups,respectively. When d = 1, our setting corresponds to the standard sharp RD designwith a single cutoﬀ point.Let ( · ) denote the indicator function. Then, our setting can be written as anonparametric regression model y i = f ( x i ) + u i , f ( x ) = f t ( x ) { x ∈ X t } + f c ( x ) { x ∈ X c } , (1)where the random variable u i is independent across i . Here, f t and f c denote meanoutcome functions for the treated and the control groups, respectively. So f corre-sponds to the mean outcome function for the observed outcome, which we refer to asthe “regression function” throughout the paper.Our parameter of interest is a treatment eﬀect parameter at a boundary point, ∈ B := ∂ X t ∩ ∂ X c , deﬁned as L RD f := lim x → x ,x ∈X t f ( x ) − lim x → x ,x ∈X c f ( x ) . When d = 1, L RD f corresponds to the conventional sharp RD parameter. On theother hand, when d > L RD f is the sharp RD parameter at a particular cutoﬀ5oint in B . This type of parameter is also considered in Imbens and Wager (2019)and Cattaneo et al. (2020) when they analyze the RD design with multiple runningvariables. Without loss of generality, we set x = 0 since we can always relabel˜ x i = x i − x . Remark 1 (More than two treatment status) . Our framework can also handle thesetting where individuals are assigned to more than two treatments based on thevalues of the multiple running variables, which was analyzed by Papay et al. (2011).For example, when d = 2 with two cutoﬀ points c , c , four diﬀerent treatment statusare possible depending on whether x i ≥ c i for i = 1 ,

2. Let {X j } j =1 be the partition of X , where j = 1 , ..., (cid:101) X = X (cid:83) X , and apply our method with (cid:101) X instead of X , with X t = X and X c = X . We consider the framework where a researcher is willing to specify some

C > | f q ( x ) − f q ( z ) | ≤ C · || x − z || for all x, z ∈ X q , for each q ∈ { t, c } , (2)for some norm || · || on R d . In other words, the mean potential outcome functionsare Lipschitz continuous with a Lipschitz constant C , with respect to the norm || · || .While the norm can be understood as an absolute value when d = 1, the choice ofthe norm can give diﬀerent interpretations when d >

1. We allow a general class ofnorms, only requiring that || x || is increasing in the absolute value of each element of x . We say the regression function f deﬁned in (1) has a Lipschitz constant C if f t and f c satisfy (2).In addition to the Lipschitz continuity, the researcher assumes that the meanpotential outcome functions are monotone with respect to the running variables.Formally, letting z ( s ) denote the s th component of the running variable z ∈ R d , andletting V ⊆ { , ..., d } be an index set for monotone variables, the researcher assumes6hat f q ( x ) ≥ f q ( z ) if x ( s ) ≥ z ( s ) for all s ∈ V and x ( t ) = z ( t ) for all t / ∈ V , (3)for all q ∈ { t, c } . We use F ( C ) to denote the space of functions on R d which satisfy(2) and (3) separately on X t and X c .We discuss the practical implications of our framework. First, there are abundantsettings where such a monotonicity condition is reasonable. See Appendix C of ourpaper as well as Appendix A.3 of Babii and Kumar (2020) for examples of RDDswith monotone running variables. One reason for the prevalence of monotonicity isdue to the nature of policy design. For example, students with lower test scores areassigned to summer schools since policymakers are worried that students with lowertest scores will show lower academic achievement in the future—they believe theaverage future academic achievement (an outcome variable) is monotone in currenttest scores (running variables).Next, our framework with Lipschitz continuity diﬀers from the previous approachesspecifying a bound on the second derivative. For example, Armstrong and Koles´ar(2018a) consider the following locally smooth function space: f t , f c ∈ F T,p := (cid:110) g : (cid:12)(cid:12)(cid:12) g ( x ) − (cid:88) pj =0 g ( j ) (0) x j /j ! (cid:12)(cid:12)(cid:12) ≤ C | x | p ∀ x ∈ X (cid:111) , and set p = 2 in their empirical application. Similarly, Imbens and Wager (2019)impose a global smoothness assumption that (cid:107)∇ f q (cid:107) ≤ C for q ∈ { t, c } , where (cid:107)∇ f q (cid:107) denotes the operator norm of the Hessian matrix. In contrary, we considersituations where a researcher has a belief on the size of the ﬁrst, rather than thesecond, derivative.The main advantage of working with the ﬁrst derivative is the interpretability ofthe function space. The function space over which the coverage is uniform should beeasy to interpret, in the sense that the researcher herself or a policymaker analyzingthe inference procedure can evaluate whether the functions not belonging to thefunction space can be safely disregarded as “extreme”.For this purpose, the size of the ﬁrst derivative provides a reasonable criterion.To be concrete, let us consider Lee (2008) analyzing the eﬀect of the incumbency onthe election outcome, where the outcome variable is the diﬀerence in the percentageof votes between two parties in the current election, and the running variable is7he same quantity in the previous election. Then, a mean potential outcome functionwhose maximum slope is as large as C = 50 seems unreasonable—this roughly impliesthat a very small increase in the vote percentage diﬀerence in an election, say 0.1%,predicts a large increase in the vote percentage diﬀerence in a consequent election,5%. Similarly, if a researcher presents a CI for the incumbency eﬀect parameter whichhas valid coverage over a space of functions with their ﬁrst derivatives bounded by C = 0 .

5, the policymaker evaluating the analysis might ﬁnd the function space toorestrictive.In comparison, it seems relatively more diﬃcult to evaluate the validity of a givenbound on the second derivative. Previous papers have proposed heuristic argumentsto set the bound on the second derivative. For example, Armstrong and Koles´ar(2018a) choose the smoothness parameter so that the reduction in the predictionMSE from using the true conditional mean rather than its Taylor approximation isnot too large. While this gives an alternative interpretation of their smoothness con-stant, the prediction MSE does not have an interpretation which can be connected toempirical examples being considered. Imbens and Wager (2019) suggests estimatingthe curvatures of f t and f c using quadratic functions and multiplying a constant suchas 2 or 3 to the estimates, but we can expect that this procedure would not yielduniform coverage without further restrictions on the function space, as pointed outin their paper. Armstrong and Koles´ar (2020b) formally derive an additional condi-tion on the function space which enables a data-driven estimation of the smoothnessparameter, but they warn that this additional assumption may be diﬃcult to justify.Instead of setting a single bound, one may choose to conduct a sensitivity analysis,which is recommended by Armstrong and Koles´ar (2018a) and Imbens and Wager(2019). However, a sensitivity analysis is more meaningful when the smoothnessparameter the researcher is varying has an intuitive meaning.A possible drawback of having to specify a Lipschitz constant is that our proceduredoes not ensure coverage when the mean potential functions are linear or close tolinear with slopes larger than the smoothness constant set a priori. For the caseof the minimax two-sided CI, we can view this as a price we pay by maintainingvalidity for functions with arbitrarily large second derivatives. On the other hand,our adaptive procedure can be used to construct a one-sided CI which adapts to thedegree of Lipschitz smoothness. The adaptive procedure enables the researcher toset a very large value of the smoothness parameter or even set it to inﬁnity (so that8he coverage is over all monotone functions) and to obtain a shorter CI if the trueregression function has a smaller ﬁrst derivative bound. This is possible due to themonotonicity assumption, which is plausible in many RDD applications. Remark 2 (Lipschitz continuity under general dimension) . When d >

1, it can bemore reasonable to assume that a researcher has a belief on the size of the partialderivative of the mean potential outcome functions. That is, there exist C , ..., C d > | f q ( x ) − f q ( z ) | ≤ C s | x ( s ) − z ( s ) | for all x, z ∈ X s.t. x − s = z − s , (4)for each q ∈ { c, t } and s ∈ { , . . . , d } . Here, x − s denotes the elements in x ∈ R d excluding its s th component. It is easy to show that under (4), the original Lipschitzcontinuity assumption (2) holds with C = 1 and || · || being a weighted (cid:96) norm on R d , || z || = (cid:80) ds =1 C s | z ( s ) | . Moreover, (2) holding with C = 1 and the weighted (cid:96) norm alsoimplies (4). Therefore, a researcher assuming (4) can equivalently assume (2) withthis weighted (cid:96) norm. This approach is also used in Armstrong and Koles´ar (2020a)in the context of inference for average treatment eﬀects under unconfoundedness. Remark 3 (RDDs without monotonicity) . By taking V = ∅ , our procedure can beused to construct a minimax CI for the RD parameter without the monotonicityassumption. While other alternatives such as Armstrong and Koles´ar (2018a) andImbens and Wager (2019) can be used to deal with this setting, our procedure is stilluseful to researchers who prefer imposing bounds on the ﬁrst derivative rather thanthe second derivative, perhaps due to better interpretability. Our procedures depend on certain kernel functions and bandwidths that depend onthe Lipschitz parameter. We ﬁrst introduce some notations. Given some z ∈ R d andthe index set V , we deﬁne ( z ) V + to be an element in R d where its s th element is givenby ( z ) V +( s ) := max (cid:8) z ( s ) , (cid:9) ( s ∈ V ) + z ( s ) ( s / ∈ V ) , s = 1 , ..., d. Similarly, we deﬁne ( z ) V− := − ( − z ) V + . When a is a scalar, we use square brackets[ a ] + to denote max { a, } . In addition, we deﬁne σ ( x i ) := Var / [ u i | x i ], and given an9stimator (cid:98) L of L RD f , we write bias f ( (cid:98) L ) to denote E f [ (cid:98) L − L RD f ] and sd( (cid:98) L ) to denoteVar / ( (cid:98) L ).The minimax procedure is based on the following kernel function K ( z ) := [1 − ( (cid:107) ( z ) V + (cid:107) + (cid:107) ( z ) V− (cid:107) )] + . (5)In the adaptive procedure, diﬀerent bandwidths are used for each coordinates, de-pending on the signs of the coordinates. To make the use of diﬀerent bandwidthsclear, for h = ( h , h ) ∈ R , we deﬁne K ( z, h ) := [1 − ( (cid:107) ( z/h ) V + (cid:107) + (cid:107) ( z/h ) V− (cid:107) )] + . The optimal bandwidths used in estimation is based on the following two functions ω t ( δ t ; C , C ) and ω c ( δ c ; C , C ), deﬁned to be the solutions to the following equation: (cid:88) x i ∈X q [ ω q ( δ q ; C , C ) − C (cid:107) ( x i ) V + (cid:107) − C (cid:107) ( x i ) V− (cid:107) ] /σ ( x i ) = δ q , for each q ∈ { t, q } , given a pair of non-negative numbers ( δ t , δ c ), and σ ( x ) := Var[ u i | x i = x ]. Moreover,given some scalar δ ≥

0, we deﬁne δ ∗ t ( δ ; C , C ) and δ ∗ c ( δ ; C , C ) to be solutions tosup δ ∗ t ≥ ,δ ∗ c ≥ , ( δ ∗ t ) +( δ ∗ c ) = δ ω t ( δ ∗ t ; C , C ) + ω c ( δ ∗ c ; C , C ) . Based on these deﬁnitions, we introduce the following shorthand notations to be usedthroughout this paper; for q ∈ { t, c } , δ ≥

0, ( C , C ) ∈ R , h ∈ R , we deﬁne ω ∗ q ( δ ; C , C ) := ω q ( δ ∗ q ( δ ; C , C ); C , C ) ,a q ( h ; C , C ) := 12 · (cid:80) x i ∈X q K ( x i , h ) [ C (cid:107) ( x i ) V + (cid:107) − C (cid:107) ( x i ) V− (cid:107) ] /σ ( x i ) (cid:80) x i ∈X q K ( x i , h ) /σ ( x i ) ,s ( δ, h ; C , C , σ ( · )) := δ/ω ∗ t ( δ ; C , C ) (cid:80) x i ∈X t K ( x i , h ) /σ ( x i ) . q ∈ { t, c } , δ ≥

0, and C , h ∈ R , we deﬁne ω ∗ q ( δ ; C ) := ω ∗ q ( δ ; C , C ) ,a q ( h ; C ) := a q ( h , h ; C , C ) ,s ( δ, h ; C , σ ( · )) := s ( δ, h , h ; C , C , σ ( · )) . The forms of the optimal kernel and the bandwidth presented in this sectionresult from solving modulus of continuity problems considered in Donoho and Liu(1991) and Donoho (1994) in the context of minimax optimal inference, and in Caiand Low (2004) and Armstrong and Koles´ar (2018a) in the context of the adaptiveinference. While we only make the connection speciﬁcally to our setting when weprove the validity of our procedure in Appendix A, interested readers may refer tothe aforementioned papers for a more general discussion.

We ﬁrst consider the case where the researcher is comfortable specifying a Lipschitzconstant and/or the empirical context requires a two-sided CI. In this case, we rec-ommend a minimax aﬃne CI, which is the CI whose worst-case expected length isthe shortest among all CIs based on aﬃne estimators (Donoho, 1994). We refer tosuch a CI as the minimax CI for brevity. The minimax CI is constructed based on an aﬃne estimator (cid:98) L mm := a + (cid:80) ni =1 w i y i with non-negative weights ( w i ) ni =1 and half-length χ mm , i.e., CI mm = (cid:104)(cid:98) L mm − χ mm , (cid:98) L mm + χ mm (cid:105) . The half-length χ mm is non-random and calibrated to maintain correct coverage uni-formly over the function space F ( C ):inf f ∈F ( C ) P f ( L RD f ∈ CI mm ) ≥ − α. (6) Donoho (1994) shows focusing on aﬃne estimators is reasonable in the sense that the gain fromconsidering non-aﬃne estimators is limited. P f ( L RD f ∈ CI mm ) = P f (cid:32)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) (cid:98) L mm − L RD f sd( (cid:98) L mm ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ χ mm sd( (cid:98) L mm ) (cid:33) . Under the assumption that the error term u i has a Gaussian distribution, the ran-dom variable ( (cid:98) L mm − L RD f ) / sd( (cid:98) L mm ) is normally distributed with mean equal tobias f ( (cid:98) L mm ) / sd( (cid:98) L mm ) and with unit variance. Hence, the quantiles of the randomvariable | ( (cid:98) L mm − L RD f ) / sd( (cid:98) L mm ) | is maximized over f ∈ F ( C ) when | bias f ( (cid:98) L mm ) | isthe largest. Therefore, if we deﬁne cv α ( t ) to be 1 − α quantile of | Z | , with Z ∼ N ( t, χ mm that guarantees the coverage requirement (6) isgiven by χ mm = cv α (cid:32) sup f ∈F ( C ) | bias f ( (cid:98) L mm ) | sd( (cid:98) L mm ) (cid:33) sd( (cid:98) L mm ) . (7)It remains to derive the form of the estimator (cid:98) L mm such that χ mm is minimized. Wetake (cid:98) L mm to be the diﬀerence between two re-centered kernel regression estimators,say (cid:98) L mm t − (cid:98) L mm c , where (cid:98) L mm q := (cid:80) x i ∈X q K ( x i /h q ) y i /σ ( x i ) (cid:80) x i ∈X q K ( x i /h q ) /σ ( x i ) − a q , for each q ∈ { t, c } , (8)for the kernel function K ( · ) deﬁned in (5). Note that (cid:98) L mm t and (cid:98) L mm c correspond toestimators of f t (0) and f c (0), respectively. Regarding the form of the optimal kernelfunction K , K ( z ) is the usual triangular kernel when d = 1, whose optimality isdiscussed in Donoho (1994) and Armstrong and Koles´ar (2020b). Here, we derivethe optimal kernel for multi-dimensional cases as well, for any given norm and underpartial or full monotonicity.A notable diﬀerence from the previous inference methods in RDDs is that theestimator (cid:98) L mm q is a Nadaraya-Watson estimator instead of a local linear estimator.This diﬀerence naturally arises because we work under the assumption of boundedﬁrst derivatives. In general, the local linear estimator is preferred due to the well-known issue of bias at the boundary for Nadaraya-Watson type estimators. In thecontext of honest inference, however, the worst-case bias is explicitly corrected for.For the optimal choices of bandwidths ( h t , h c ) and the centering terms ( a t , a c ), we12an show that the minimax two-sided CI can be obtained by taking h q = h q ( δ ) = ω ∗ q ( δ ; C ) /C for each q ∈ { t, c } , (9)for a suitable choice of δ ≥

0, by applying the result of Donoho (1994) to our setting.Next, from the form of (7), the centering terms should be chosen such thatsup f ∈F ( C ) bias f ( (cid:98) L mm ) = − inf f ∈F ( C ) bias f ( (cid:98) L mm ) , to minimize χ mm , i.e., the worst-case negative and positive biases should be balanced.We can show that this can be achieved if we choose a q := a q ( h q ; C ) , for each q ∈ { t, c } . (10)Note that this quantity depends on δ when ( h t , h c ) depends on δ .Now, let (cid:98) L mm ( δ ) be the above kernel regression estimator with the bandwidth( h t ( δ ) , h c ( δ )) and centering terms ( a t , a c ) as deﬁned above. Under these choices ofbandwidths and centering terms, we havesd( (cid:98) L mm ( δ )) = s ( δ, h t ( δ ); C, σ ( · )) , sup f ∈F ( C ) bias f ( (cid:98) L mm ( δ )) = − inf f ∈F ( C ) bias f ( (cid:98) L mm ( δ ))= 12 (cid:16) C ( h t ( δ ) + h c ( δ )) − δ sd( (cid:98) L mm ( δ )) (cid:17) . (11)Then, we choose the optimal value of δ by plugging in (11) into (7), and calculatethe value of δ that minimizes the half-length χ mm , say δ ∗ , yielding the value of theshortest half-length. Plugging δ = δ ∗ into the bandwidth and centering term formulasalso yields the form of the estimator corresponding to this half-length. Procedure 1summarizes our discussion on the construction of the minimax CI.The following is the main theoretical result regarding the minimax procedure.While we consider an idealized setting with Gaussian errors and known conditionalvariances, such exact ﬁnite sample results can be translated into asymptotic resultsunder non-Gaussian errors with unknown variances following similar arguments givenby Armstrong and Koles´ar (2018c). In Section 5, we discuss in more detail how toplug in consistent estimators of the conditional variances.13 rocedure 1 Minimax Aﬃne CI .1. Choose a value C such that the Lipschitz continuity in (2) is satisﬁed, witha suitable choice of a norm || · || on R d when d > δ .3. Using (11), ﬁnd the value of δ which minimizes the half-length (7), say δ ∗ .4. Calculate the value of the estimator and the half-length by plugging in δ ∗ ,which gives the ﬁnal form of the CI. Assumption 1. { x i } ni =1 is nonrandom and u i ∼ N (0 , σ ( x i )) , where σ ( · ) is known. Theorem 3.1.

Under Assumption 1, we have inf f ∈F ( C ) P f ( L RD f ∈ CI mm ) = 1 − α. Moreover, CI mm is the shortest among all (ﬁxed-length) aﬃne CIs with uniform cov-erage. From (9), we can see that there is a one-to-one relationship between the size of thebandwidth and the Lipschitz constant C chosen by a researcher. So choosing C is notnecessarily an additional burden to the researcher if a bandwidth has to be chosenanyway. While there also exist various data-driven bandwidth choice methods, ourway of choosing the bandwidth makes it clear the relationship between the bandwidthand the function space over which the resulting CI has uniform coverage, at the sametime achieving the minimax optimal length.It is useful to discuss the case with d = 1 to illustrate the role of the monotonicityrestriction in the minimax optimal inference. Intuitively, under monotonicity, we donot have to worry about the bias caused by functions with negative slopes, so it isoptimal to use a larger bandwidth than in the case without monotonicity in order toreduce the standard error. Using our kernel function and bandwidth formulas above,we can calculate how much larger the bandwidth should be under monotonicity. When V = { } , the kernel function in (5) is given by K ( z ) = [1 − | z | ] + , while when V = ∅ ,the kernel function is given by K ( z ) = [1 − | z | ] + = K ( z/ (1 / K ( z ) = K ( z ), the bandwidth ratio between the one under14 .20.40.60.8 0 1 2 3 4 5 Lipschitz coefficient C I Leng t h ( d = ) Method

MonotoneNon−monotone 0.51.01.52.0 0 1 2 3 4 5

Lipschitz coefficient C I Leng t h ( d = ) Method

MonotoneNon−monotone

Figure 1: Comparison of the minimax lengths with and without monotonicity. For thedesign of the running variable(s), 500 observations were generated from the uniformdistribution over [ − , d , for d = 1 ,

2. When d = 2, we set V = { , } and used the l norm. The lengths were calculated for σ ( x ) = 1. V = { } and the one under V = ∅ is given by2 ω ∗ q ( δ ; C, { } ) ω ∗ q ( δ ; C, ∅ ) , for each q ∈ { t, c } , δ ≥ , where ω ∗ q ( δ ; C, V ) denotes the value of ω ∗ q ( δ ; C ) when the monotonicity restrictionholds for the index set V . Following Kwon and Kwon (2020), we can show that theabove quantity is approximately 2 / ≈ . n . So if we believe the meanpotential outcome functions are monotone, it is optimal to use a bandwidth about60% larger than what should be used without monotonicity.By a similar argument, the length of the minimax CI becomes shorter when weonly consider the space of monotone functions. Since the length of the CI is a ﬁxedquantity, we can easily compare its length under the shape restriction to the onewithout it. Figure 1 considers the cases with d = 1 and 2, for the treatment designwhere individuals are treated when the values of all the running variables are negative.The CI which does not take into account monotonicity can be as much as 30% longerfor d = 1 and 40% longer for d = 2. Considering the prevalence of RDDs withmonotone regression functions, this eﬃciency gain demonstrates the importance ofincorporating the shape restriction when constructing minimax CIs.15 Adaptive One-sided CI

A prominent feature of the minimax two-sided is that it is a ﬁxed-length CI, in thesense that its length is determined before observing the realization of the outcome { y i } ni =1 . The length of the CI depends on the Lipschitz constant C chosen by theresearcher, and the larger the value of C , the longer the CI is. Hence, the minimaxCI may become too wide to use in the case where a researcher desires to strengthenthe credibility of the inference result by setting a conservative value of the Lipschitzconstant. In an extreme case where a researcher wishes to set C = ∞ , the CI isnecessarily the entire real line, providing no information.To deal with this issue, we provide a method to construct an adaptive one-sidedCI. The CI can be made to maintain coverage over a function space with a largeLipschitz constant C , even allowing for C = ∞ . Moreover, as long as the trueregression function lies in a smoother class, the length of the CI does not depend on C when d = 1 and nearly so when d >

1. In Section 4.1, we discuss this property inmore detail, and provide a simple condition on the relationship between the treatmentallocation rule and the direction of monotonicity under which such a property holds.This property ensures that a researcher can strengthen the credibility of the inferenceresult by considering a large function space without ending up with an uninformativeCI. Furthermore, the one-sided CI can be made to utilize the information about thesmoothness of the unknown regression function f contained in the observed outcome { y i } ni =1 . By using this information, the length of the CI is adjusted accordingly, unlikethe minimax optimal CI whose length is ﬁxed regardless of the values of the observedoutcome. The CIs possessing this type of property are called adaptive CIs, followingCai and Low (2004). This property further improves the usefulness of the proposedone-sided CI; the length of the CI is not only (nearly) independent of C but shrinkswhen the true regression function has a smaller ﬁrst derivative bound.We focus our discussion on the construction of adaptive lower CIs and relatedtreatment allocation rules. In many RDD applications, researchers are often inter-ested in how signiﬁcantly larger than 0 the true treatment eﬀect is. In this context,a lower one-sided CI provides useful information. The upper CI can be dealt within an analogous manner, but requires diﬀerent (or “opposite”) treatment allocationrules. Lastly, the “length” of a one-sided CI refers to the distance between the true16arameter L RD f and its endpoint, referred to as the “excess length” in Armstrongand Koles´ar (2018a). In this section, we describe in more detail the treatment allocation rules under which itis possible to construct a lower CI with uniform coverage over F ( C ), but with lengththat does not necessarily increase with C . Speciﬁcally, given a smaller Lipschitzconstant C (cid:48) such that 0 ≤ C (cid:48) < C , we ask when it is possible to construct a lower CIwhose worst-case expected length over F ( C (cid:48) ) does not grow with C . This propertyensures that more credibility does not necessarily lead to wider CIs.To give an intuition, we ﬁrst describe the argument in a simple setting, wherethere is a single running variable with the cutoﬀ point x = 0. Consider a lower CI[ (cid:98) L − χ L , ∞ ) constructed by subtracting a constant χ L from a linear estimator (cid:98) L . Inorder to maintain uniform coverage over a function space F ( C ), we must have χ L = sup f ∈F ( C ) bias f ( (cid:98) L ) + sd( (cid:98) L ) · z − α . Note that the estimator (cid:98) L we consider is given as the diﬀerence between estimatorsfor f t (0) and f c (0), say ˆ f t (0) − ˆ f c (0).Now, suppose individuals receive some treatment if and only if x i <

0. A keyproperty under this design is that ˆ f t (0) always has a negative bias if f t ( x ) is increasing,since ˆ f t (0) is calculated only using observations with x i <

0. Similarly, ˆ f c (0) alwayshas positive bias when f c ( x ) is increasing. Therefore, the bias of (cid:98) L over f ∈ F ( C )is always negative, regardless of the value of C speciﬁed by the researcher. Hence, aone-sided lower CI that maintains uniform coverage over F ( C ) can be formed to be independent of C . We can also easily see that this argument no longer holds whenthe individual i is treated if and only if x i ≥

0, in which case the maximum bias over f ∈ F ( C ) increases with C .We now state this idea formally, including the case where d >

1. Given some (cid:101)

C >

0, let L α ( (cid:101) C ) denote the set of lower CIs (as functions of { y i } ni =1 and { x i } ni =1 )with uniform coverage probability 1 − α over F ( (cid:101) C ), i.e, CI ∈ L α ( (cid:101) C ) ⇐⇒ inf f ∈F ( (cid:101) C ) P ( L RD f ∈ CI ) ≥ − α. (cid:96) ( C (cid:48) ; C ) := inf [ˆ c, ∞ ) ∈L α ( C ) sup f ∈F ( C (cid:48) ) E [ L RD f − ˆ c ] , (12)where C (cid:48) ≤ C . This is the worst-case expected length of a CI 1) which has correctuniform coverage over F ( C ) and 2) whose worst-case expected length is the smallestover F ( C (cid:48) ). Armstrong and Koles´ar (2018a) calculate this quantity in terms of themodulus of continuity deﬁned in Appendix A. The question we ask here is under whatconditions the quantity (cid:96) ( C (cid:48) ; C ) can be viewed as independent of C . In other words,we characterize when it is possible to construct a one-sided lower CI whose lengthdoes not grow with C when the regression function belongs to a smoother functionspace. Lemma 4.1.

Suppose both X t and X c are non-empty, and Assumption 1 holds. Then,given a pair of Lipschitz constants ( C (cid:48) , C ) such that C (cid:48) < C , there exists some con-stant A ( C (cid:48) ) , independent of C , such that (cid:96) ( C (cid:48) ; C ) ≤ A ( C (cid:48) ) , if and only if the followings hold: 1) there exists x i ∈ X t such that x i ∈ ( −∞ , d ,2) there exists x i ∈ X c such that x i ∈ [0 , ∞ ) d , and 3) the regression function is fullymonotone, i.e., V = { , ..., d } . Under the conditions provided in Lemma 4.1, the researcher can construct a one-sided CI whose worst-case expected length is not too large if the true regressionfunction belongs to a smoother space F ( C (cid:48) ), while maintaining a stringent coveragerequirement by specifying a large C . When d = 1, it is possible to take A ( C (cid:48) ) = (cid:96) ( C (cid:48) ; C (cid:48) ) with the inequality in Lemma 4.1 holding with equality, as suggested by theintuition discussed above. That is, C does not aﬀect the size of (cid:96) ( C (cid:48) ; C ) at all when d = 1. On the other hand, when d >

1, the same property does not hold unless wehave observations only over [0 , ∞ ) d and ( −∞ , d , which is usually not the case inpractice. Therefore, specifying a larger C translates into a longer CI (by giving lessweights to observations outside [0 , ∞ ) d ∪ ( −∞ , d ) when d >

1. However, there isan upper bound on how much this length can grow with C . This upper bound isindependent of C , and thus the worst-case length of the CI is nearly independent of C . 18n words, the conditions in Lemma 4.1 imply that “the more disadvantaged groupshould get treated.” Such RDD settings can be easily found in the education literaturewhere students or schools with lower academic performance receive some kind ofsupport (Chay et al., 2005; Chiang, 2009; Jacob and Lefgren, 2004; Leuven et al.,2007; Matsudaira, 2008), in the environmental economics literature where countieswith high pollution levels are exposed to some environmental regulations (Chay andGreenstone, 2005; Greenstone and Gallagher, 2008), and in poverty programs whichprovide funds to those in need (Ludwig and Miller, 2007). Lastly, we note that whenthe mean potential outcome functions are decreasing rather than increasing, we maysimply switch the conditions for X t and X c in Lemma 4.1. Under treatment allocation rules considered in the previous section, it is possible toconstruct a one-sided CI that is optimal for a single function space F ( C (cid:48) ) for some C (cid:48) , with its length independent or nearly independent of the larger Lipschitz constant C . Ideally, we would use a one-sided CI that performs well over a range of functionspaces, corresponding to a range of Lipschitz constants C (cid:48) ∈ [ C, C ] for some speciﬁedbounds C and C . To this end, we deﬁne a class of adaptive procedures so that ourprocedure is based on multiple CIs which solve the optimization problem (12) fordiﬀerent values of Lipschitz constants C (cid:48) ∈ { C j } Jj =1 , with C ≤ C < · · · < C J ≤ C .For now, we assume ( C j ) Jj =1 are given. We discuss how to choose this sequence ofLipschitz constants in a way that is optimal later on.To introduce our procedure, we ﬁrst characterize the solution to (12), applying thegeneral result of Armstrong and Koles´ar (2018a) to our setting. Given a value of C (cid:48) , itturns out that the lower CI which solves (12) is based on a linear estimator (cid:98) L ( C (cid:48) ) := a ( C (cid:48) ) + (cid:80) ni =1 w i ( C (cid:48) ) y i for non-negative weights w i ( C (cid:48) ) and a centering constant a ( C (cid:48) ).Given some τ ∈ (0 , − τ over F ( C (cid:48) ), takes the form ofˆ c Lτ ( C (cid:48) ) = (cid:98) L τ ( C (cid:48) ) − sup f ∈F ( C ) bias f ( (cid:98) L τ ( C (cid:48) )) − z − τ sd( (cid:98) L τ ( C (cid:48) )) . (13)Here, we make the dependence on τ explicit because we later choose a suitable τ < α when J >

1. 19or the estimator, we take (cid:98) L τ ( C (cid:48) ) to be the diﬀerence between two kernel regres-sion estimators, say (cid:98) L t,τ ( C (cid:48) ) − (cid:98) L c,τ ( C (cid:48) ), where (cid:98) L q,τ ( C (cid:48) ) := (cid:80) x i ∈X q K ( x i , h q,τ ( C (cid:48) )) y i /σ ( x i ) (cid:80) x i ∈X q K ( x i , h q,τ ( C (cid:48) )) /σ ( x i ) , for each q ∈ { t, c } . The optimal bandwidths are given by h t,τ ( C (cid:48) ) = ω ∗ t ( z − τ ; C, C (cid:48) ) · (1 /C, /C (cid:48) ) and h c,τ ( C (cid:48) ) = ω ∗ c ( z − τ ; C (cid:48) , C ) · (1 /C (cid:48) , /C ) , (14)which completes the deﬁnition of the estimator (cid:98) L τ ( C (cid:48) ). The worst-case bias and thestandard deviation of (cid:98) L τ ( C (cid:48) ) are given bysd( (cid:98) L τ ( C (cid:48) )) = s ( z − τ , h t,τ ( C (cid:48) ); C, C (cid:48) , σ ( · ))sup f ∈F ( C ) bias( (cid:98) L τ ( C (cid:48) )) = a t ( h t,τ ( C (cid:48) ); C, C (cid:48) ) − a c ( h c,τ ( C (cid:48) ) , C (cid:48) , C )+ 12 [ ω ∗ t ( z − τ ; C, C (cid:48) ) + ω ∗ c ( z − τ ; C (cid:48) , C ) − z − τ sd( (cid:98) L τ ( C (cid:48) ))] , (15)which concludes the form of the one-sided CI which solves (12).The CI (cid:2) ˆ c Lτ ( C (cid:48) ) , ∞ (cid:1) is in terms of the worst-case excess length over F ( C (cid:48) ) for asingle Lipschitz constant C (cid:48) . We construct a CI which performs well over a collectionof function spaces by taking intersection of CIs in the form of (cid:2) ˆ c Lτ ( C (cid:48) ) , ∞ (cid:1) for diﬀerentvalues of C (cid:48) . Note that taking the intersection “picks out” the shortest CI amongmultiple CIs formed. This is roughly equivalent to inferring from the data the truefunction space where the regression function belongs to, and using the CI whichperforms well over that function space.The value of τ should be calibrated so that the resulting CI maintains correctcoverage probability 1 − α after taking the intersection. Suppose we have a collectionof J CIs (cid:8)(cid:2) ˆ c Lτ ( C j ) , ∞ (cid:1)(cid:9) Jj =1 . A simple procedure to take the intersection of such CIsis to use a Bonferroni procedure and take τ = α/J . This procedure, however, isconservative since the correlations among the estimators (cid:98) L τ ( C ) , ..., (cid:98) L τ ( C J ) are posi-tive, and highly so when C j and C k are close. To calibrate the value of τ taking intoaccount such positive correlation, let ( V j,τ ) Jj =1 denote a J -dimensional multivariatenormal random variable centered at zero, unit variance, and with covariance terms20iven as Cov( V j,τ , V k,τ ) = (cid:80) x i ∈X t K ( x i , h t,τ ( C j )) K ( x i , h t,τ ( C k )) /σ ( x i ) z − τ /ω ∗ t ( z − τ ; C, C j ) ω ∗ t ( z − τ ; C, C k )+ (cid:80) x i ∈X c K ( x i , h c,τ ( C j )) K ( x i , h c,τ ( C k )) /σ ( x i ) z − τ /ω ∗ c ( z − τ ; C j , C ) ω ∗ c ( z − τ ; C k , C ) . (16)Then, we can show that if we take τ ∗ to be the value of τ that solves P (cid:18) max ≤ j ≤ J V j,τ > z − τ , (cid:19) = α, (17)then the resulting CI obtained by taking the intersection has correct coverage. Re-garding the solution to the above equation, we can actually show that Cov( V j,τ , V k,τ )does not depend on τ as n → ∞ , which implies that max ≤ j ≤ J V j,τ ∗ converges in dis-tribution to a random variable V max which does not dependent on τ . The followingis the main theoretical result for our intersection CI.

Theorem 4.2.

Given { C , ..., C J } , C and α ∈ (0 , , let τ ∗ be deﬁned as in (17).Then, under Assumption 1, if we let ˆ c := max ≤ j ≤ J ˆ c Lτ ∗ ( C j ) , we have inf f ∈F ( C ) P f ( L RD f ∈ [ˆ c, ∞ )) ≥ − α. The next result, which is immediate from Armstrong and Koles´ar (2018a), showsthat the class of adaptive procedures we consider is a reasonable one. It basicallystates that each of the CIs [ˆ c Lτ ∗ ( C j ) , ∞ ) is an optimal CI when f ∈ F ( C j ), exceptthat it covers the true parameter with probability 1 − τ ∗ instead of 1 − α , which isthe price we pay to adapt to multiple Lipschitz classes. Corollary 4.3.

Under Assumption 1, [ˆ c Lτ ∗ ( C j ) , ∞ ) solves inf [ˆ c, ∞ ) ∈L τ ∗ ( C ) sup f ∈F ( C j ) E f [ L RD f − ˆ c ] . Again, considering the simple case where d = 1 and V = { } provides an intuitionbehind our adaptive procedure. Under this simple case, (cid:98) L q,τ ( C (cid:48) ) is simply a kernel In a simulation exercise not reported, we ﬁnd this asymptotic approximation works well for amoderate sample size such as n = 100. K ( z ) = [1 − | z | ] + and bandwidths h t,τ ( C (cid:48) ) = ω ∗ t ( z − τ ; C, C (cid:48) ) /C (cid:48) , h c,τ ( C (cid:48) ) = ω ∗ c ( z − τ ; C (cid:48) , C ) /C (cid:48) . (18)Applying results of Kwon and Kwon (2020), we can show that these quantities areapproximately c × ( C (cid:48) ) − / for some constant c for large n . This implies that we con-struct estimators with varying bandwidth sizes, and compare the lengths of resultingone-sided CIs. The estimator with a smaller bandwidth is the one constructed toperform well when the Lipschitz constant of the regression function is large, and viceversa for the estimator with a larger bandwidth. This is because when the Lipschitzconstant of the regression function is large, the excess length of a one-sided CI isreduced by taking a smaller bandwidth to decrease the absolute size of the bias. Onthe other hand, when the Lipschitz constant of the regression function is small, re-ducing the standard deviation by taking a larger bandwidth matters more in reducingthe excess length of a one-sided CI. This idea is similar to the bandwidth snoopingprocedure suggested by Armstrong and Koles´ar (2018b). The results therein focuson the case with a single running variable and adapting to the H¨older exponent. Remark 4 (Specifying C ) . When d = 1, the worst-case bias is 0 under the treatmentallocation rules in Section 4.1. Therefore, a researcher can set C = ∞ and not worryabout correct coverage. On the other hand, when d >

1, the size of C governshow large weights should be on observations outside [0 , ∞ ) d and ( −∞ , d . A larger C puts smaller weights on those observations, leading to a wider CI, with C = ∞ corresponding to using observations only in [0 , ∞ ) d and ( −∞ , d . However, thelength of the adaptive CI shrinks when the true regression function has a smaller ﬁrstderivative bound even if C is set to be a large number, alleviating the concern thatlarge C might lead to a less informative CI. Remark 5 (Adaptation in multi-dimensional RDDs) . Consider the setting of Remark2, where the Lipschitz continuity is speciﬁed by the weighted (cid:96) -norm with Lipschitzconstants C , ..., C d . Then, our adaptive procedure allows one to consider diﬀerentvalues of C other than C = 1. Note that this takes C /C s given for s = 2 , ..., d .Adapting to diﬀerent values of C /C s is an interesting extension that we did notpursue here. 22 .3 Choice of an adaptive procedure In this section, we discuss the choice of Lipschitz constants ( C j ) Jj =1 to conclude ourdeﬁnition of the adaptive one-sided CI. The choice of function spaces to adapt to isespecially relevant in our setting unlike the previous literature on adaptive inference.The previous literature has mostly focused on rate-adaptation, the problem of how toconstruct a CI which shrinks at an optimal rate. For example, Armstrong (2015) andArmstrong and Koles´ar (2018b) discuss adaptive testing and construction of adaptiveCIs for the RD parameter, which adapt to H¨older exponents β ∈ (0 , β = 1 and try to adapt to Lipschitz constants, the convergencerate is always n / (2+ d ) , and what matters is the actual length of CIs, not their rate ofconvergence.The optimal but infeasible adaptive CI is given as CI = [ˆ c, ∞ ) such thatsup f ∈F ( C (cid:48) ) E [ L RD f − ˆ c ] = (cid:96) ( C (cid:48) ; C ) , for all C (cid:48) ∈ [ C, C ], where (cid:96) ( C (cid:48) ; C ) is deﬁned in (12). This is infeasible since the formof CI which is optimal over F ( C (cid:48) ) is diﬀerent from the one which is optimal over F ( C (cid:48)(cid:48) ) for C (cid:48) (cid:54) = C (cid:48)(cid:48) . Instead, our aim is to construct a CI [ˆ c, ∞ ) such that (cid:96) adpt ( C (cid:48) ) := sup f ∈F ( C (cid:48) ) E [ L RD f − ˆ c ]is close to (cid:96) ( C (cid:48) ; C ) over C (cid:48) ∈ (cid:2) C, C (cid:3) given some (cid:0)

C, C (cid:1) , when (cid:96) adpt ( C (cid:48) ) and (cid:96) ( C (cid:48) ; C )are viewed as a function of C (cid:48) .By restricting the class of one-sided CIs to those considered in Section 4.2, wecan show that there is a measure of distance between (cid:96) adpt ( · ) and (cid:96) ( · ; C ) which isboth reasonable and easy to calculate. To be speciﬁc, let C = ( C j ) Jj =1 denote somesequence of Lipschitz constants to be used to construct our adaptive CI, and denotethe endpoint of the CI with ˆ c ( C ). Then, write (cid:96) adpt ( C (cid:48) ; C ) := sup f ∈F ( C (cid:48) ) E [ L RD f − ˆ c ( C )] . rocedure 2 Adaptive One-sided CI .1. Choose C so that the Lipschitz continuity in (2) is satisﬁed, with a suitablechoice of a norm || · || on R d when d > d = 1, we can set C = ∞ ; see Remark 4 in Section 4.2.2. Choose values of C and C such that 0 ≤ C < C ≤ C , which is the regionwhere the adaptive CI will be close to optimal.- A reasonable value of C can be estimated; see Appendix B.3. Let C ( J ) be the equidistant grid of size J over (cid:2) C, C (cid:3) . Starting from J = 2,increase J by 1 until | ∆( C ( J )) − ∆( C ( J − | ≤ ε , for a tolerance level ε .4. Let J ∗ be the value of J we stopped at in Step 3. We use C ( J ∗ ) = ( C j ) J ∗ j =1 asthe sequence of Lipschitz constants to construct the adaptive CI.5. Using (13), (14), (15), and (17), ﬁnally obtain the adaptive one-sided CI, (cid:104) max ≤ j ≤ J ˆ c Lτ ∗ ( C j ) , ∞ (cid:17) . Consider the following quantity measuring the “distance” between (cid:96) adpt ( · ; C ) and (cid:96) ( · ; C ): ∆( C ) := sup C (cid:48) ∈ [ C,C ] (cid:96) adpt ( C (cid:48) ; C ) (cid:96) ( C (cid:48) ; C ) . (19)Note that this criterion is consistent with the previous literature which compares theperformances of diﬀerent conﬁdence intervals using a ratio measure (Cai and Low,2004; Armstrong and Koles´ar, 2018a). Speciﬁcally, ∆( C ) satisﬁes (cid:96) adpt ( C (cid:48) ; C ) ≤ ∆( C ) (cid:96) ( C (cid:48) ; C ) , ∀ C (cid:48) ∈ (cid:2) C, C (cid:3) . This is the precisely the notion of adaptive CIs introduced in Cai and Low (2004).Since the rates of (cid:96) adpt ( C (cid:48) ; C ) and (cid:96) ( C (cid:48) ; C ) are the same under our setting, the choiceof an adaptive procedure should be based on the size of the constant ∆( C ).An advantage from focusing on the class of adaptive CIs proposed in Section 4.2is that it is easy to evaluate ∆( C ) as shown in the following proposition.24 roposition 4.4. We have (cid:96) adpt ( C (cid:48) ; C ) = E min j ≤ J U j , where U := ( U , . . . , U J ) (cid:48) isa Gaussian random vector with known mean and variance. Furthermore, we have (cid:96) ( C (cid:48) ; C ) = ω ∗ t ( z − α , C, C (cid:48) ) + ω ∗ c ( z − α , C (cid:48) , C ) . Based on the discussion so far, we recommend choosing { C j } Jj =1 based on thevalue of ∆( C ). When calculating this value, searching for all possible sequences C isinfeasible in practice. Instead, we suggest taking C = C ( J ) to be the equidistant gridof size J on (cid:2) C, C (cid:3) and calculating ∆( C ( J )) for diﬀerent values of J . Our simulationstudy (not reported) suggests that the gain from increasing J becomes very smallafter some threshold. Hence, a computationally attractive procedure is to increasethe value of J until the additional gain from using J + 1 instead of J is smaller thana tolerance parameter. In the simulation designs of Section 5, the largest value of∆( C ( J ∗ )) across all scenarios is less than 1.07, where J ∗ is the value of J obtainedby the method described above. That is, the worst-case length of the adaptive CI isat most 7% longer than the worst-case length of the CI that we would have used ifwe knew the true Lipschitz constant, in the data generating processes considered inSection 5.Note that our procedure requires a researcher to specify the values of C and C .For C , while it can be set to 0, a reasonable value of C can be estimated fromthe data as discussed in Appendix B. The value of C deﬁnes a region for Lipschitzconstants where the adaptive CI is intended to perform well. Therefore, a researchermay choose C to be some non-conservative potential value of the Lipschitz constantof the regression function. As discussed above, the coverage probability is not aﬀectedwhatever the value of C is, so it will not be a burdensome task for a researcher tochoose C . Procedure 2 summarizes the steps for constructing the adaptive one-sidedCI, including the choice of the Lipschitz spaces to adapt to. We investigate the performance of our minimax and adaptive procedures via a sim-ulation study. We focus on the case where d = 1 with individuals being treated if x i <

0. We restrict the support of x i to be [ − , The expressions for the mean and variance are given in the proof. θ ≡ L RD f , we consider the following designs.1. Linear design: f c ( x ) = Cx , f t ( x ) = f c ( x ) + θ . We have | f (cid:48) q ( x ) | = C , and | f (cid:48)(cid:48) q ( x ) | = 0 for all x ∈ [ − ,

1] and q ∈ { t, c } .2. Modiﬁed speciﬁcation of Armstrong and Koles´ar (2018c): given some “knots”( b , b ) such that 0 < b < b , deﬁne f c ( x ) = 3 C x − x − b ) + 2( x − b ) ) , f t ( x ) = − f c ( − x ) + θ. If b ≥ b /

2, both functions are increasing. Taking ( b , b ) = (1 / , /

3) gives | f (cid:48) q ( x ) | ≤ C , and | f (cid:48)(cid:48) q ( x ) | = 3 C for all x ∈ [ − ,

1] and q ∈ { t, c } . We also have | f (cid:48) q (0) | = 0.3. Modiﬁed speciﬁcation of Babii and Kumar (2020): deﬁne f c ( x ) = C x + x ) , f t ( x ) = f c ( x ) + θ. We have | f (cid:48) q ( x ) | ≤ C and | f (cid:48)(cid:48) q ( x ) | = 3 C/ x ∈ [ − ,

1] and q ∈ { t, c } . Wealso have | f (cid:48) q (0) | = C/ | f (cid:48)(cid:48) q (0) | = 0.4. Nonzero ﬁrst and second derivatives at 0: deﬁne f c ( x ) = C ((3 x + 1) / − , f t ( x ) = − f c ( − x ) + θ. We have | f (cid:48) q ( x ) | ≤ C , and | f (cid:48)(cid:48) q ( x ) | ≤ C for all x ∈ [ − ,

1] and q ∈ { t, c } . Wealso have | f (cid:48) q (0) | = C and | f (cid:48)(cid:48) q (0) | = 2 C . x i ∼ unif( − ,

1) Constant variance C is smallDesign 1 YES YES YESDesign 2 NO YES YESDesign 3 YES NO YESDesign 4 NO NO YESDesign 5 YES YES NODesign 6 NO YES NODesign 7 YES NO NODesign 8 NO NO NOTable 1: Simulation design speciﬁcations26 = f f = f Length: RBC Minimax Length: RBC MinimaxRBC/MM coverage coverage RBC/MM coverage coverageDesign 1 1.094 0.925 0.968 1.097 0.924 0.942Design 2 1.112 0.938 0.979 1.115 0.941 0.949Design 3 1.082 0.926 0.968 1.085 0.924 0.942Design 4 1.099 0.936 0.979 1.103 0.940 0.950Design 5 1.128 0.925 0.919 1.148 0.930 0.945Design 6 1.137 0.938 0.934 1.162 0.941 0.951Design 7 1.117 0.926 0.920 1.139 0.927 0.945Design 8 1.125 0.936 0.935 1.153 0.941 0.952Table 2: Comparison between RBC and minimax; f ∈ { f , f } For the running variables, we consider x i ∼ unif( − ,

1) and x i ∼ × Beta(2 , − σ ( x ) = 1 and σ ( x ) = φ ( x ) /φ (0), where φ ( x ) is the standard normal pdf. Thesample size is n = 500. Figure 4 in Appendix D provides plots for the four regressionfunctions.We estimate the conditional variance based on local constant kernel regression,where the initial bandwidth is chosen based on Silverman’s rule of thumb. Thisis to avoid using a bandwidth selection method based on local linear regression toensure that our proposed method works even when the second derivative is very large.After the estimators are calculated based on the estimated conditional variance, weconstruct the CIs following Armstrong and Koles´ar (2020b), who use a simple way toestimate the variance byˆ σ ( x i ) = JJ + 1 (cid:32) y i − J J (cid:88) i =1 y j ( i ) (cid:33) , for some ﬁxed J , where j ( i ) denotes the index for the closest observation to i (withthe same treatment status). The default value in their implementation is J = 3,which we follow. 27 = f f = f Length: RBC Minimax Length: RBC MinimaxRBC/MM coverage coverage RBC/MM coverage coverageDesign 1 1.090 0.925 0.949 1.093 0.924 0.968Design 2 1.109 0.937 0.951 1.111 0.938 0.977Design 3 1.078 0.925 0.949 1.081 0.924 0.968Design 4 1.096 0.936 0.953 1.098 0.936 0.977Design 5 1.094 0.923 0.962 1.119 0.920 0.930Design 6 1.112 0.937 0.971 1.132 0.935 0.947Design 7 1.082 0.924 0.961 1.107 0.924 0.930Design 8 1.099 0.936 0.971 1.119 0.933 0.947Table 3: Comparison between RBC and minimax; f ∈ { f , f } First, we investigate the performance of the minimax procedure described in Section3. We consider diﬀerent combinations of speciﬁcations on the running variable, thevariance function, and the value of C : 1) x i follows either a uniform or a Betadistribution, 2) the variance function is given by either σ × σ ( x ) or σ × σ ( x )(where σ = 1 / C is either small ( C = 1) or large ( C = 3).For the minimax two-sided CI, we consider the case where we correctly specify theLipschitz constant in all cases, setting C = 3. The results were calculated with 1 , R package rdrobust .Tables 2 and 3 display the results. For each regression function, the ﬁrst columnshows the ratio of the length of the CIs, and the second and the third columns showthe coverage probabilities of the RBC and the minimax procedures. Despite the factthat the minimax CIs are shorter than the RBC CIs, the coverage probabilities of theminimax CIs are closer to the nominal level than the RBC CIs. However, the lengthcomparison here should not be interpreted as the superiority of our procedure over theRBC method, since the lengths are sensitive to the choice of C and to the form of thetrue regression function. Rather, as we have discussed so far, the relative advantage28 .01.11.21.31.4 0.5 1.0 1.5 2.0 True C E xc e ss Leng t h R a t i o : f = f AdaptiveMinimax 1.01.21.41.61.8 0.5 1.0 1.5 2.0

True C E xc e ss Leng t h R a t i o : f = f AdaptiveMinimax1.01.21.41.61.8 0.5 1.0 1.5 2.0

True C E xc e ss Leng t h R a t i o : f = f AdaptiveMinimax 1.01.21.4 0.5 1.0 1.5 2.0

True C E xc e ss Leng t h R a t i o : f = f AdaptiveMinimax

Figure 2: Performance comparison of one-sided CIsof our procedure is the use of a more transparent function space, and incorporatingthe monotonicity of the regression function.

Next, we compare the performance of the one-sided CI to two benchmarks, the min-imax and the oracle one-sided CIs. Speciﬁcally, we consider regression functions f , ..., f and vary the value of C ∈ [ C l , C u ] which determines their smoothness. Foreach C ∈ [ C l , C u ], the oracle procedure adapts to a single Lipschitz constant C , asif we know the true Lipschitz constant, which is infeasible in practice. On the otherhand, the minimax procedure adapts to the largest Lipschitz constant C u ; this proce-dure is nearly optimal (among feasible procedures) when we do not have monotonicity,as shown in Armstrong and Koles´ar (2018a). Both the oracle and the minimax pro-29edures take monotonicity into account.We take ( C l , C u ) = (1 / ,

2) and consider only Design 1. For the adaptive pro-cedure, we take (

C, C ) = (1 / , C ∈ [1 / ,

1] than over C ∈ [1 , f = f and f = f , the adap-tive CIs are sometimes even shorter than the oracle. This is because the derivativesof such regression functions near x = 0 are smaller than C , although their max-imum derivatives are C globally. This shows that the adaptive CI adapts to the local smoothness of the regression function, which is another advantage of using theadaptive procedure. In this section, we revisit the analysis by Lee (2008). The running variable x i ∈ [ − , y i ∈ [0 , x i is the vote margin instead of the vote share, this translatesto setting C = 1 / C ∈ [0 . , C = 1 /

2. Wealso include CIs using the robust bias correction method of Calonico et al. (2015)and the minimax CI using a second derivative bound as in Armstrong and Koles´ar30 .02.55.07.510.012.5 0.4 0.6 0.8 1.0

Lipschitz Constant (C) E l e c t o r a l A d v an t age ( % ) Adaptive one−sidedAKMinimax two−sidedRBC

Figure 3: Lee (2008) example. The red line (Minimax two-sided) plots our mini-max optimal CI, while the blue line (RBC) and the green line (AK) plot the CIsconstructed using the methods of Calonico et al. (2015) and Armstrong and Koles´ar(2018a), respectively. The purple line (Adaptive one-sided) plots the adaptive one-sided upper CI with C = ∞ . The vertical dotted line indicates C = 1 /

2, our preferredspeciﬁcation.(2018a). For the latter, the bound on second derivative is set as M = 1 /

10, which isthe largest bound used in their empirical analysis. The nominal coverage probabilityis 0 .

95 for all CIs. The CIs obtained by the robust bias correction and the secondderivative bound are given by [3 . , .

38] and [2 . , . C = 1 /

2, the CI is given by [5 . , . . , . C is larger than16. The ﬁrst derivative bound C = 16 roughly implies that one unit increase in theprevious election vote share can predict as much as a 32 unit increase in the nextelection vote share, which is quite a large amount. Therefore, we conclude that thesigniﬁcance of the incumbency eﬀect is robust over a reasonable set of assumptionson the unknown regression function.Given that various CIs considered here are obtained from diﬀerent assumptionson the regression function, it would be an interesting exercise to see what we can inferabout the RD parameter under a minimal assumption. To this end, we construct anadaptive one-sided CI which maintains coverage over all monotone functions. Thediscussion in Section 4.1 implies that we can construct an adaptive upper CI in thisempirical setting. We take ( C, C ) = (0 . , . .

5. To make the one-sided CI comparablewith the other two-sided CIs, the nominal coverage probability of the upper CI is setas 0 . In this paper, we proposed a minimax two-sided CI and an adaptive one-sided CI whenthe regression function is assumed to be monotone and has bounded ﬁrst derivative.We showed our procedure achieves uniform coverage under easy-to-interpret condi-tions and can be used to construct either a two-sided CI with a minimax optimallength or a one-sided one whose excess length adapts to the smoothness of the un-known regression function. There are two extensions that we ﬁnd interesting.

Fuzzy RDDs.

There are various RDD applications where compliance to a treatmentstatus is only partial. Therefore, it would be of interest to extend our approach tothe fuzzy RDD setting, for example, by making monotonicity and Lipschitz continu-ity assumptions on the treatment propensity p ( x ) = P ( t i = 1 | x i = x ), where t i isthe treatment indicator for individual i . This approach will complement the minimax32ptimal approaches to fuzzy RDDs using second derivative bounds by Armstrong andKoles´ar (2020b) and Noack and Rothe (2020). Weigthed CATE.

In multi-score RDDs, Imbens and Wager (2019) suggest esti-mating the weighted average of conditional treatment eﬀects over diﬀerent boundarypoints to make inference more precise. Since the weighted average parameter is alinear functional of the regression function, we can also adjust our framework to con-duct inference on the parameter. On the other hand, a closed-form solution might notexist, and how to computationally construct the conﬁdence interval for the weightedaverage parameter under our setting seems to be an interesting research question.33 eferences

Armstrong, T. (2015): “Adaptive testing on a regression function at a point,”

TheAnnals of Statistics , 43, 2086–2101.

Armstrong, T. and M. Koles´ar (2020a): “Finite-sample optimal estimation andinference on average treatment eﬀects under unconfoundedness,” .

Armstrong, T. B. and M. Koles´ar (2016): “Optimal inference in a class ofregression models,” working paper .——— (2018a): “Optimal inference in a class of regression models,”

Econometrica ,86, 655–683.——— (2018b): “A simple adjustment for bandwidth snooping,”

The Review of Eco-nomic Studies , 85, 732–765.——— (2018c): “Supplement to ’Optimal inference in a class of regression models’,”

Econometrica Supplemental Material , 85.——— (2020b): “Simple and honest conﬁdence intervals in nonparametric regres-sion,”

Quantitative economics , 11, 1–39.

Babii, A. and R. Kumar (2020): “Isotonic regression discontinuity designs,” .

Cai, T. T. and M. G. Low (2004): “An adaptation theory for nonparametricconﬁdence intervals,”

The Annals of statistics , 32, 1805–1840.

Calonico, S., M. D. Cattaneo, and R. Titiunik (2015): “rdrobust: An RPackage for Robust Nonparametric Inference in Regression-Discontinuity Designs.”

R J. , 7, 38.

Cattaneo, M. D., N. Idrobo, and R. Titiunik (2020):

A practical introductionto regression discontinuity designs: Extensions , Cambridge University Press (toappear).

Chay, K. Y. and M. Greenstone (2005): “Does air quality matter? Evidencefrom the housing market,”

Journal of political Economy , 113, 376–424.34 hay, K. Y., P. J. McEwan, and M. Urquiola (2005): “The central role ofnoise in evaluating interventions that use test scores to rank schools,”

AmericanEconomic Review , 95, 1237–1258.

Chiang, H. (2009): “How accountability pressure on failing schools aﬀects studentachievement,”

Journal of Public Economics , 93, 1045–1057.

Dell, M. (2010): “The persistent eﬀects of Peru’s mining mita,”

Econometrica , 78,1863–1903.

Donoho, D. L. (1994): “Statistical Estimation and Optimal Recovery,”

Annals ofStatistics , 22, 238–270.

Donoho, D. L. and R. C. Liu (1991): “Geometrizing rates of convergence, III,”

The Annals of Statistics , 668–701.

Greenstone, M. and J. Gallagher (2008): “Does hazardous waste matter?Evidence from the housing market and the superfund program,”

The QuarterlyJournal of Economics , 123, 951–1003.

Imbens, G. and S. Wager (2019): “Optimized regression discontinuity designs,”

Review of Economics and Statistics , 101, 264–278.

Jacob, B. A. and L. Lefgren (2004): “Remedial education and student achieve-ment: A regression-discontinuity analysis,”

Review of economics and statistics , 86,226–244.

Kane, T. J. (2003): “A quasi-experimental estimate of the impact of ﬁnancial aidon college-going,” Tech. rep., National Bureau of Economic Research.

Keele, L. J. and R. Titiunik (2015): “Geographic boundaries as regression dis-continuities,”

Political Analysis , 23, 127–155.

Koles´ar, M. and C. Rothe (2018): “Inference in regression discontinuity designswith a discrete running variable,”

American Economic Review , 108, 2277–2304.

Kwon, K. and S. Kwon (2020): “Adaptive Inference in Multivariate Nonparamet-ric Regression Models Under Monotonicity,” Working paper.35 ee, D. S. (2008): “Randomized experiments from non-random selection in USHouse elections,”

Journal of Econometrics , 142, 675–697.

Leuven, E., M. Lindahl, H. Oosterbeek, and D. Webbink (2007): “Theeﬀect of extra funding for disadvantaged pupils on achievement,”

The Review ofEconomics and Statistics , 89, 721–736.

Ludwig, J. and D. L. Miller (2007): “Does Head Start improve children’s lifechances? Evidence from a regression discontinuity design,”

The Quarterly journalof economics , 122, 159–208.

Matsudaira, J. D. (2008): “Mandatory summer school and student achievement,”

Journal of Econometrics , 142, 829–850.

Noack, C. and C. Rothe (2020): “Bias-aware inference in fuzzy regression dis-continuity designs,” arXiv preprint arXiv:1906.04631 . Papay, J. P., R. J. Murnane, and J. B. Willett (2010): “The consequencesof high school exit examinations for low-performing urban students: Evidence fromMassachusetts,”

Educational Evaluation and Policy Analysis , 32, 5–23.

Papay, J. P., J. B. Willett, and R. J. Murnane (2011): “Extending theregression-discontinuity approach to multiple assignment variables,”

Journal ofEconometrics , 161, 203–207.

Van der Klaauw, W. (2002): “Estimating the eﬀect of ﬁnancial aid oﬀers oncollege enrollment: A regression–discontinuity approach,”

International EconomicReview , 43, 1249–1287.

Wong, V. C., P. M. Steiner, and T. D. Cook (2013): “Analyzing regression-discontinuity designs with multiple assignment variables: A comparative study offour estimation methods,”

Journal of Educational and Behavioral Statistics , 38,107–141. 36

Lemmas and Proofs

In this section, we collect auxiliary lemmas and omitted proofs. Before presentingthe results, we state the following deﬁnition.

Deﬁnition 1.

Given two Lipschitz constants C and C , and some positive constant δ ≥

0, we deﬁne ω ( δ ; C , C ) := sup f ∈F ( C ) , f ∈F ( C ) L RD f − L RD f s.t. n (cid:88) i =1 (cid:18) f ( x i ) − f ( x i ) σ ( x i ) (cid:19) ≤ δ . (20)The quantity ω ( δ ; C , C ) is called the ordered modulus of continuity of F ( C ) and F ( C ) for the parameter L RD f . A.1 Lemmas

Lemma A.1.

Given some pair of numbers

C, C (cid:48) ≥ and δ ≥ , deﬁne ω q ( δ ; C, C (cid:48) ) for each q ∈ { t, c } to be the solution to the following equation (cid:88) x i ∈X q [ ω q ( δ ; C, C (cid:48) ) − C (cid:107) ( x i ) V + (cid:107) − C (cid:48) (cid:107) ( x i ) V− (cid:107) ] /σ ( x i ) = δ . Then, we have ω q ( δ ; C, C (cid:48) ) = sup f ∈F ( C ) ,f ∈F ( C (cid:48) ) f (0) − f (0) s.t. (cid:88) x i ∈X q (cid:18) f ( x i ) − f ( x i ) σ ( x i ) (cid:19) ≤ δ . (21) Proof.

See Kwon and Kwon (2020).

Lemma A.2.

Given some pair C , C ≥ and δ ≥ , we have ω ( δ ; C , C ) = ω t ( δ ∗ t ( δ ; C , C ) ; C , C ) + ω c ( δ ∗ c ( δ ; C , C ) ; C , C ) . roof of Lemma A.2. Noting that L RD f − L RD f = ( f ,t (0) − f ,c (0)) − ( f ,t (0) − f ,c (0))= ( f ,t (0) − f ,t (0)) + ( f ,c (0) − f ,c (0)) ,ω ( δ ; C , C ) is obtained by solving the following problem:sup f ,t ,f ,c ,f ,t ,f ,c ( f ,t (0) − f ,t (0)) + ( f ,c (0) − f ,c (0))s.t. n (cid:88) i =1 (cid:32) { x i ∈ X t } (cid:18) f ,t ( x i ) − f ,t ( x i ) σ ( x i ) (cid:19) + { x i ∈ X c } (cid:18) f ,c ( x i ) − f ,c ( x i ) σ ( x i ) (cid:19) (cid:33) ≤ δ ,f ,t , f ,c ∈ Λ + , V ( C ) , f ,t , f ,c ∈ Λ + , V ( C ) . Using the deﬁnitions for ω t ( δ t ; C , C ) and ω c ( δ c ; C , C ), and Lemma A.1, we canwrite ω ( δ ; C , C ) = sup δ t ≥ ,δ c ≥ ,δ t + δ c = δ ω t ( δ t ; C , C ) + ω c ( δ c ; C , C ) , which concludes the proof. Lemma A.3.

Given some ( C , C ) ∈ R and δ ≥ , write δ ∗ t := δ ∗ t ( δ ; C , C ) , δ ∗ c := δ ∗ c ( δ ; C , C ) . Then, we can ﬁnd (cid:0) f ∗ δ, , f ∗ δ, (cid:1) ∈ F ( C ) × F ( C ) , which satisfy thefollowing three conditions: (i) L RD f ∗ δ, − L RD f ∗ δ, = ω ( δ ; C , C ) , (ii) (cid:13)(cid:13)(cid:13) f ∗ δ, − f ∗ δ, σ (cid:13)(cid:13)(cid:13) = δ ,and (iii) when we write f ∗ δ, = f ∗ δ, ,t { x ∈ X t } + f ∗ δ, ,c { x ∈ X c } f ∗ δ, = f ∗ δ, ,t { x ∈ X t } + f ∗ δ, ,c { x ∈ X c } , (cid:0) f ∗ δ, ,t , f ∗ δ, ,t (cid:1) ∈ F ( C ) × F ( C ) and (cid:0) f ∗ δ, ,c , f ∗ δ, ,c (cid:1) ∈ F ( C ) × F ( C ) satisfy f ∗ δ, ,t − f ∗ δ, ,t = [ ω t ( δ ∗ t ; C , C ) − C (cid:107) ( x ) V + (cid:107) − C (cid:107) ( x ) V− (cid:107) ] + (22) f ∗ δ, ,c − f ∗ δ, ,c = [ ω c ( δ ∗ c ; C , C ) − C (cid:107) ( x ) V− (cid:107) − C (cid:107) ( x ) V + (cid:107) ] + , (23) f ∗ δ, ,t (0) + f ∗ δ, ,t (0) = ω t ( δ ∗ t ; C , C ) (24) f ∗ δ, ,c (0) + f ∗ δ, ,c (0) = ω c ( δ ∗ c ; C , C ) , (25)38 nd (cid:0) f ∗ δ, ,t − f ∗ δ, ,t (cid:1) (cid:0) f ∗ δ, ,t + f ∗ δ, ,t (cid:1) = (cid:0) f ∗ δ, ,t − f ∗ δ, ,t (cid:1) ( ω t ( δ ∗ t ; C , C ) + C (cid:107) ( x ) V + (cid:107) − C (cid:107) ( x ) V− (cid:107) ) (26) (cid:0) f ∗ δ, ,c − f ∗ δ, ,c (cid:1) (cid:0) f ∗ δ, ,c + f ∗ δ, ,c (cid:1) = (cid:0) f ∗ δ, ,c − f ∗ δ, ,c (cid:1) ( ω c ( δ ∗ c ; C , C ) + C (cid:107) ( x ) V + (cid:107) − C (cid:107) ( x ) V− (cid:107) ) . (27) Proof.

Kwon and Kwon (2020) show that if we let f ∗ δ, ,t ( x ) =  C (cid:107) ( x ) V + (cid:107) if C ≤ C min { ω t ( δ t , C , C ) − C (cid:107) ( x ) V− (cid:107) , C (cid:107) ( x ) V + (cid:107)} otherwise ,f ∗ δ, ,t ( x ) =  max { ω t ( δ t , C , C ) − C (cid:107) ( x ) V− (cid:107) , C (cid:107) ( x ) V + (cid:107)} if C ≤ C ω t ( δ t , C , C ) − C (cid:107) ( x ) V− (cid:107) otherwise , we have (cid:0) f ∗ δ, ,t , f ∗ δ, ,t (cid:1) ∈ F ( C ) × F ( C ) and these functions solve (21) for q = t , C = C , and C (cid:48) = C . Likewise, if we let f ∗ δ, ,c ( x ) =  max { ω c ( δ c , C , C ) − C (cid:107) ( x ) V− (cid:107) , C (cid:107) ( x ) V + (cid:107)} if C ≤ C ω c ( δ c , C , C ) − C (cid:107) ( x ) V− (cid:107) otherwise ,f ∗ δ, ,c ( x ) =  C (cid:107) ( x ) V + (cid:107) if C ≤ C min { ω c ( δ c , C , C ) − C (cid:107) ( x ) V− (cid:107) , C (cid:107) ( x ) V + (cid:107)} otherwise,we have (cid:0) f ∗ δ, ,c , f ∗ δ, ,c (cid:1) ∈ F ( C ) × F ( C ) and these functions solve (21) for q = c , C = C , and C (cid:48) = C . Then, the equations (22) – (27) in the statement of thislemma follow from the above formulas and Lemma A.2. Lemma A.4.

Given some ( C , C ) ∈ R and δ ≥ , let ω (cid:48) ( δ ; C , C ) = ∂∂δ ω ( δ ; C , C ) and write δ ∗ t := δ ∗ t ( δ ; C , C ) , δ ∗ c := δ ∗ c ( δ ; C , C ) . Then, we have ω (cid:48) ( δ ; C , C ) = δ (cid:80) x i ∈X t [ ω t ( δ ∗ t ; C , C ) − C (cid:107) ( x ) V + (cid:107) − C (cid:107) ( x ) V− (cid:107) ] + /σ ( x i )= δ (cid:80) x i ∈X c [ ω c ( δ ∗ c ; C , C ) − C (cid:107) ( x ) V− (cid:107) − C (cid:107) ( x ) V + (cid:107) ] + /σ ( x i ) . Proof.

Note that f ∈ F ( C ) implies f + z ∈ F ( C ) for any z ∈ R and C ≥

0. Moreover,39etting ι t ( x ) := { x ∈ X t } , we have L RD ( ι t ) = 1. Then, Lemma B.3 in Armstrongand Koles´ar (2016) implies that ω (cid:48) ( δ ; C , C ) δ = 1 (cid:80) ni =1 (cid:0) f ∗ δ, ( x i ) − f ∗ δ, ( x i ) (cid:1) { x ∈ X t } /σ ( x i )= 1 (cid:80) x i ∈X t (cid:0) f ∗ δ, ,t ( x i ) − f ∗ δ, ,t ( x i ) (cid:1) /σ ( x i ) , where f ∗ δ,j ( x i ) and f ∗ δ,j,t ( x i ) for j = 1 , ι c ( x ) = − { x ∈ X c } , we have L RD ( ι c ) = 1, and Lemma B.3 in Armstrongand Koles´ar (2016) implies that ω (cid:48) ( δ ; C , C ) δ = 1 (cid:80) x i ∈X c (cid:0) f ∗ δ, ,c ( x i ) − f ∗ δ, ,c ( x i ) (cid:1) /σ ( x i ) , where f ∗ δ,j,c ( x i ) for j = 1 , Lemma A.5.

Given some pair ( C , C ) such that C ≥ C ≥ , and for some δ > ,write δ ∗ t := δ ∗ t ( δ ; C , C ) , δ ∗ c := δ ∗ c ( δ ; C , C ) , and h t,δ = ω ∗ t ( δ, C , C ) · (cid:18) C , C (cid:19) , h c,δ = ω ∗ c ( δ, C , C ) · (cid:18) C , C (cid:19) . Then, we have (cid:88) x i ∈X t K ( x i , h t,δ ) /σ ( x i ) = ( δ ∗ t /ω ∗ t ( δ, C , C )) , (cid:88) x i ∈X c K ( x i , h c,δ ) /σ ( x i ) = ( δ ∗ c /ω ∗ c ( δ, C , C )) . Proof.

Using notations in Lemma A.3, we can write (cid:88) x i ∈X t K ( x i , h t,δ ) /σ ( x i ) = 1( ω ∗ t ( δ, C , C )) (cid:88) x i ∈X t (cid:0) f ∗ δ, ,q ( x i ) − f ∗ δ, ,q ( x i ) (cid:1) /σ ( x i ) . Now, by deﬁnitions of f ∗ δ, ,t and f ∗ δ, ,t , we have (cid:80) x i ∈X t (cid:0) f ∗ δ, ,t ( x i ) − f ∗ δ, ,t ( x i ) (cid:1) /σ ( x i ) = δ ∗ t , which gives the desired result for the ﬁrst equation of this lemma. The second40quation follows from the analogous reasoning. A.2 Proofs of main results

Proof of Lemma 4.1.

Consider a lower conﬁdence interval [ˆ c, ∞ ) ∈ L α ( C ). Then,Theorem 3.1 of Armstrong and Koles´ar (2018a) implies thatsup f ∈F ( C (cid:48) ) E [ L RD f − ˆ c ] ≥ ω ( z − α , C, C (cid:48) ) , when Assumption 1 holds. The same theorem also implies that there exists [ˆ c, ∞ ) ∈L α ( C ) such that it exactly achieves the lower bound above. Therefore, it remains toanalyze the conditions where ω ( z − α , C, C (cid:48) ) is bounded by some A ( C (cid:48) ).Now, Lemma A.2 implies that we can write ω ( z − α , C, C (cid:48) ) = ω t ( δ ∗ t , C, C (cid:48) ) + ω c ( δ ∗ c , C (cid:48) , C ) , where ( δ ∗ t , δ ∗ c ) solve sup δ t ≥ ,δ c ≥ ,δ t + δ c = z − α ω t ( δ t , C, C (cid:48) ) + ω c ( δ c , C, C (cid:48) ) . Due to Lemma A.1, b t = ω t ( δ ∗ t , C, C (cid:48) ) solves (cid:88) x i ∈X t [ b t − C (cid:107) ( x i ) V + (cid:107) − C (cid:48) (cid:107) ( x i ) V− (cid:107) ] /σ ( x i ) = ( δ ∗ t ) . (28)We ﬁrst consider suﬃciency. Deﬁne X t := ( −∞ , d ∩ X t . Notice that for any given b ∈ R , we have (cid:88) x i ∈X t [ b − C (cid:107) ( x i ) V + (cid:107) − C (cid:48) (cid:107) ( x i ) V− (cid:107) ] /σ ( x i ) ≥ (cid:88) x i ∈X t [ b − C (cid:107) ( x i ) V + (cid:107) − C (cid:48) (cid:107) ( x i ) V− (cid:107) ] /σ ( x i ) . Therefore, given some δ t ≥

0, if b (cid:48) t = b (cid:48) t ( δ t ) solves (cid:88) x i ∈X t [ b (cid:48) t − C (cid:107) ( x i ) V + (cid:107) − C (cid:48) (cid:107) ( x i ) V− (cid:107) ] /σ ( x i ) = δ t , (cid:48) t ≥ ω t ( δ t , C, C (cid:48) ) will hold. This implies that ω t ( δ ∗ t , C, C (cid:48) ) ≤ ω t ( z − α , C, C (cid:48) ) ≤ b (cid:48) t ( z − α ) , where the ﬁrst inequality is due to the constraint that δ ∗ t ≥ , δ ∗ c ≥ δ ∗ t ) +( δ ∗ c ) = z − α . Note that x i ∈ X t implies (cid:88) x i ∈X t [ b (cid:48) t − C (cid:107) ( x i ) V + (cid:107) − C (cid:48) (cid:107) ( x i ) V− (cid:107) ] /σ ( x i ) = (cid:88) x i ∈X t [ b (cid:48) t − C (cid:48) (cid:107) ( x i ) V− (cid:107) ] /σ ( x i ) , using the assumption that V = { , ..., d } . This implies that b (cid:48) t ( z − α ) is a function ofthe data and C (cid:48) . Likewise, if we deﬁne X c := [0 , ∞ ) d ∩ X c , and b (cid:48) c = b (cid:48) c ( δ c ) to be thesolution to (cid:88) x i ∈X c [ b (cid:48) c − C (cid:48) (cid:107) ( x i ) V + (cid:107) ] /σ ( x i ) = δ c ,b (cid:48) c ( z − α ) is a function of the data and C (cid:48) . Therefore, we can set A ( C (cid:48) ) = b (cid:48) t ( z − α ) + b (cid:48) c ( z − α ).Next, let us consider necessity. First, suppose ( −∞ , d ∩X t = ∅ . Then, since everyterm in the summation in (28) depends on C , increasing C should necessarily increasethe size of b t . Therefore, ω ( z − α , C, C (cid:48) ) cannot be bounded by a term independentof C . The same reasoning applies to the case with [0 , ∞ ) d ∩ X c = ∅ , and to the casewhere V (cid:40) { , ..., d } , which concludes the proof. Proof of Theorem 3.1.

First, given some δ ≥

0, consider the optimization problem ω ( δ ; C ) := sup f ∈F ( C ) ,f ∈F ( C ) L RD f − L RD f s.t. (cid:13)(cid:13)(cid:13)(cid:13) f − f σ (cid:13)(cid:13)(cid:13)(cid:13) ≤ δ . (29) To be more precise, the reasoning above holds only if δ ∗ t , δ ∗ c >

0, and only if there is noobservation exactly located on the axis. These cases can be ignored when n is suﬃciently large, andwhen the running variables have a continuous distribution. (cid:0) f ∗ δ, , f ∗ δ, (cid:1) . Deﬁne (cid:101) L ( δ ) := ω (cid:48) ( δ ; C, C ) δ n (cid:88) i =1 (cid:0) f ∗ δ, ( x i ) − f ∗ δ, ( x i ) (cid:1) y i σ ( x i ) − ω (cid:48) ( δ ; C, C )2 δ n (cid:88) i =1 (cid:34) (cid:0) f ∗ δ, ( x i ) − f ∗ δ, ( x i ) (cid:1) (cid:0) f ∗ δ, ( x i ) + f ∗ δ, ( x i ) (cid:1) σ ( x i ) (cid:35) + L RD (cid:18) f ∗ δ, + f ∗ δ, (cid:19) . Under these deﬁnitions, We can showsd (cid:16)(cid:101) L ( δ ) (cid:17) = ω (cid:48) ( δ ; C, C )sup f ∈F ( C ) bias (cid:16)(cid:101) L ( δ ) (cid:17) = 12 ( ω ( δ ; C, C ) − δω (cid:48) ( δ ; C, C )) . Deﬁne (cid:101) χ ( δ ) = cv α   sup f ∈F ( C ) bias (cid:16)(cid:101) L ( δ ) (cid:17) sd (cid:16)(cid:101) L ( δ ) (cid:17)  · sd (cid:16)(cid:101) L ( δ ) (cid:17) . Then, the result of Donoho (1994) implies that the minimax aﬃne optimal CI is givenby (cid:104)(cid:101) L ( δ ) − (cid:101) χ ( δ ) , (cid:101) L ( δ ) + (cid:101) χ ( δ ) (cid:105) , when δ is chosen to minimize (cid:101) χ ( δ ). Thus. the proofis done if we can show 1) (cid:101) L ( δ ) has the same form as (cid:98) L mm , and 2) sd (cid:16)(cid:101) L ( δ ) (cid:17) andsup f ∈F ( C ) bias (cid:16)(cid:101) L ( δ ) (cid:17) and are given as (11).For the ﬁrst claim, recall that we can write for j = 1 , f ∗ δ,j ( x ) = f ∗ δ,j,t ( x ) { x ∈ X t } + f ∗ δ,j,c ( x ) { x ∈ X c } . (cid:101) L ( δ ) = ω (cid:48) ( δ ; C, C ) δ (cid:88) x i ∈X t (cid:0) f ∗ δ, ,t ( x i ) − f ∗ δ, ,t ( x i ) (cid:1) y i σ ( x i )= − ω (cid:48) ( δ ; C, C ) δ (cid:88) x i ∈X c (cid:0) f ∗ δ, ,c ( x i ) − f ∗ δ, ,c ( x i ) (cid:1) y i σ ( x i ) − ω (cid:48) ( δ ; C, C )2 δ (cid:88) x i ∈X t (cid:34) (cid:0) f ∗ δ, ,t ( x i ) − f ∗ δ, ,t ( x i ) (cid:1) (cid:0) f ∗ δ, ,t ( x i ) + f ∗ δ, ,t ( x i ) (cid:1) σ ( x i ) (cid:35) + ω (cid:48) ( δ ; C, C )2 δ (cid:88) x i ∈X c (cid:34) (cid:0) f ∗ δ, ,c ( x i ) − f ∗ δ, ,c ( x i ) (cid:1) (cid:0) f ∗ δ, ,c ( x i ) + f ∗ δ, ,c ( x i ) (cid:1) σ ( x i ) (cid:35) + f ∗ δ, ,t (0) + f ∗ δ, ,t (0)2 − f ∗ δ, ,c (0) + f ∗ δ, ,c (0)2 . Deﬁning (cid:101) L t ( δ ) = ω (cid:48) ( δ ; C, C ) δ (cid:88) x i ∈X t (cid:0) f ∗ δ, ,t ( x i ) − f ∗ δ, ,t ( x i ) (cid:1) y i σ ( x i ) − ω (cid:48) ( δ ; C, C )2 δ (cid:88) x i ∈X t (cid:34) (cid:0) f ∗ δ, ,t ( x i ) − f ∗ δ, ,t ( x i ) (cid:1) (cid:0) f ∗ δ, ,t ( x i ) + f ∗ δ, ,t ( x i ) (cid:1) σ ( x i ) (cid:35) + f ∗ δ, ,t (0) + f ∗ δ, ,t (0)2 , and (cid:101) L c ( δ ) = ω (cid:48) ( δ ; C, C ) δ (cid:88) x i ∈X c (cid:0) f ∗ δ, ,c ( x i ) − f ∗ δ, ,c ( x i ) (cid:1) y i σ ( x i ) − ω (cid:48) ( δ ; C, C )2 δ (cid:88) x i ∈X c (cid:34) (cid:0) f ∗ δ, ,c ( x i ) − f ∗ δ, ,c ( x i ) (cid:1) (cid:0) f ∗ δ, ,c ( x i ) + f ∗ δ, ,c ( x i ) (cid:1) σ ( x i ) (cid:35) + f ∗ δ, ,c (0) + f ∗ δ, ,c (0)2 , we can see (cid:101) L ( δ ) = (cid:101) L t ( δ ) − (cid:101) L c ( δ ) holds.Now, using the equations (22), (24) and (26) in Lemma A.3, and the ﬁrst repre-44entation of ω (cid:48) ( δ ; C,C ) δ in Lemma A.4, we get (cid:101) L t ( δ ) = (cid:80) x i ∈X t [ ω t ( δ ∗ t ; C, C ) − C (cid:107) ( x ) V + (cid:107) − C (cid:107) ( x ) V− (cid:107) ] + y i /σ ( x i ) (cid:80) x i ∈X t [ ω t ( δ ∗ t ; C, C ) − C (cid:107) ( x ) V + (cid:107) − C (cid:107) ( x ) V− (cid:107) ] + /σ ( x i ) − (cid:88) x i ∈X t (cid:20) [ ω t ( δ ∗ t ; C, C ) − C (cid:107) ( x ) V + (cid:107) − C (cid:107) ( x ) V− (cid:107) ] + (cid:80) x i ∈X t [ ω t ( δ ∗ t ; C, C ) − C (cid:107) ( x ) V + (cid:107) − C (cid:107) ( x ) V− (cid:107) ] + /σ ( x i ) × ( ω t ( δ ∗ t ; C, C ) + C (cid:107) ( x ) V + (cid:107) − C (cid:107) ( x ) V− (cid:107) ) /σ ( x i ) (cid:3) + ω t ( δ ∗ t ; C, C )2 , = (cid:80) x i ∈X t K ( x i /h t ) y i /σ ( x i ) (cid:80) x i ∈X t K ( x i /h t ) /σ ( x i ) + a t ( δ ) . Similarly, for (cid:101) L c ( δ ), using the equations (23), (25), and (27) in Lemma A.3, and thesecond representation of ω (cid:48) ( δ ; C,C ) δ in Lemma A.4, we can get (cid:101) L c ( δ ) = (cid:80) x i ∈X c K ( x i /h c ) y i /σ ( x i ) (cid:80) x i ∈X c K ( x i /h c ) /σ ( x i ) + a c ( δ ) , which establishes the claim that (cid:101) L ( δ ) has the same form as (cid:98) L mm .Next, for the second claim about the standard deviation and the worst-case bias,it is suﬃcient to show ω (cid:48) ( δ ; C, C ) = δ/ ( Ch t ( δ )) (cid:80) x i ∈X t K ( x i /h t ( δ )) /σ ( x i ) ,ω ( δ ; C, C ) = C ( h t ( δ ) + h c ( δ )) . First, the ﬁrst equation follows from Lemma A.4. Next, ω ( δ ; C, C ) = ω t ( δ ∗ t ( δ ; C ); C )+ ω c ( δ ∗ c ( δ ; C ); C ) follows from Lemma A.2, which implies the second equation, whichconcludes the proof. 45 roof of Theorem 4.2. Given some f ∈ F ( C ) and j ∈ { , ..., J } , we can write P f (cid:0) L RD f < ˆ c Lτ ( C j ) (cid:1) = P f (cid:0) ˆ c Lτ ( C j ) − L RD f > (cid:1) = P f (cid:18) ˆ c Lτ ( C j ) − L RD fω (cid:48) ( z − τ , C, C j ) > (cid:19) = P f (cid:18) ˆ c Lτ ( C j ) − L RD fω (cid:48) ( z − τ , C, C j ) + z − τ > z − τ (cid:19) ≡ P f (cid:16) (cid:101) V j,τ > z − τ (cid:17) , where ω (cid:48) ( z − τ , C, C j ) is as deﬁned in Therefore, deﬁning CI L ( τ ) := (cid:20) max ≤ j ≤ J ˆ c Lτ ( C j ) , ∞ (cid:19) ,we have P f (cid:0) L RD f / ∈ CI L ( τ ) (cid:1) = P f (cid:0) ∃ j s.t. L RD f < ˆ c Lτ ( C j ) (cid:1) = P f (cid:18) max ≤ j ≤ J (cid:101) V j,τ > z − τ (cid:19) . Now, we want to ﬁnd the largest value of τ such thatsup f ∈F ( C ) P f (cid:18) max ≤ j ≤ J (cid:101) V j,τ > z − τ (cid:19) ≤ α, (30)therefore satisfying the coverage requirement. First, it easy to see that (cid:16) (cid:101) V ,τ , ..., (cid:101) V J,τ (cid:17) has a multivariate normal distribution. Next, note that the quantiles of max ≤ j ≤ J (cid:101) V j,τ is increasing in each of E f (cid:101) V ,τ , ..., E f (cid:101) V j,τ . Moreover, the variances and covariancesof (cid:16) (cid:101) V ,τ , ..., (cid:101) V J,τ (cid:17) do not depend on the true regression function f , by the construc-tion of ˆ c Lτ ( C j ). This means that the quantiles of max ≤ j ≤ J (cid:101) V j,τ are smaller than those ofmax ≤ j ≤ J V j,τ , where ( V ,τ , ..., V J,τ ) has a multivariate normal distribution with a meangiven by EV j,τ = sup f ∈F ( C ) E f (cid:101) V j,τ for j = 1 , ..., J , and with the same covariance matrixas (cid:16) (cid:101) V ,τ , ..., (cid:101) V J,τ (cid:17) . Therefore, if we take τ so that z − τ is the 1 − α th quantile ofmax ≤ j ≤ J V j,τ , (30) is satisﬁed. Therefore, it remains to show ( V j,τ ) Jj =1 is a multivariatenormal distribution with zero means, unit variances, and covariances as given in (16).46or the mean, by deﬁnition of ˆ c Lτ ( C j ), we havesup f ∈F ( C ) E f (cid:2) ˆ c Lτ ( C j ) − L RD f (cid:3) = − z − τ sd( (cid:98) L τ ( C j ))= − z − τ ω (cid:48) ( z − τ , C, C j ) , where the last line follows from the discussion in Armstrong and Koles´ar (2018a).Hence, we get EV j,τ = 0. Moreover, this also implies Var( (cid:98) L τ ( C j )) = 1. For thecovariance, we haveCov (cid:16)(cid:98) L τ ( C j ) , (cid:98) L τ ( C k ) (cid:17) =Cov (cid:16)(cid:98) L t,τ ( C j ) − (cid:98) L c,τ ( C j ) , (cid:98) L t,τ ( C k ) − (cid:98) L c,τ ( C k ) (cid:17) =Cov (cid:16)(cid:98) L t,τ ( C j ) , (cid:98) L t,τ ( C k ) (cid:17) + Cov (cid:16)(cid:98) L c,τ ( C j ) , (cid:98) L c,τ ( C k ) (cid:17) , where the last line follows from the independence assumption, and we can showCov (cid:32) (cid:98) L t,τ ( C j ) ω (cid:48) ( z − τ , C, C j ) , (cid:98) L t,τ ( C k ) ω (cid:48) ( z − τ , C, C k ) (cid:33) = (cid:80) x i ∈X t K ( x i , h t,τ ( C j )) K ( x i , h t,τ ( C k )) /σ ( x i ) (cid:80) x i ∈X t K ( x i , h t,τ ( C j )) /σ ( x i ) (cid:80) x i ∈X t K ( x i , h t,τ ( C k )) /σ ( x i ) × ω (cid:48) ( z − τ , C, C j ) ω (cid:48) ( z − τ , C, C k )= ω ∗ t ( z − τ ; C, C j ) ω ∗ t ( z − τ ; C, C k ) z − τ (cid:88) x i ∈X t K ( x i , h t,τ ( C j )) K ( x i , h t,τ ( C k )) /σ ( x i ) , where the last equality follows from Lemma A.4 and the deﬁnition of K ( x i , h t,τ ( C j )).Likewise, we can showCov (cid:16)(cid:98) L c,τ ( C j ) , (cid:98) L c,τ ( C k ) (cid:17) = ω ∗ c ( z − τ ; C j , C ) ω ∗ c ( z − τ ; C k , C ) z − τ (cid:88) x i ∈X t K ( x i , h c,τ ( C j )) K ( x i , h c,τ ( C k )) /σ ( x i ) , which proves that the covariance term has the same form as in (16). Proof of Proposition 4.4.

The last statement is immediate from Lemma A.2 of ourpaper and Theorem 3.1 of Armstrong and Koles´ar (2018a), so we focus on the former47tatement.First, we state the means and the covariance matrix of U j ’s. Given C = { C , ..., C J } ,let τ ∗ = τ ∗ ( C ) be the value of τ that solves (17). Then, we have EU j = (cid:80) x i ∈X t K ( x i , h t,τ ∗ ( C j )) C (cid:48) || ( x i ) V− || /σ ( x i ) (cid:80) x i ∈X t K ( x i , h t,τ ∗ ( C j )) /σ ( x i )+ (cid:80) x i ∈X c K ( x i , h c,τ ∗ ( C j )) C (cid:48) || ( x i ) V + || /σ ( x i ) (cid:80) x i ∈X c K ( x i , h c,τ ∗ ( C j )) /σ ( x i )+ z − τ ∗ sd ( (cid:98) L τ ∗ ( C j )) + sup f ∈F ( C ) bias ( (cid:98) L τ ∗ ( C j )) , and Cov ( U j , U k ) = (cid:80) x i ∈X t K ( x i , h t,τ ∗ ( C j )) K ( x i , h t,τ ∗ ( C k )) /σ ( x i ) (cid:80) x i ∈X t K ( x i , h t,τ ∗ ( C j )) /σ ( x i ) (cid:80) x i ∈X t K ( x i , h t,τ ∗ ( C k )) /σ ( x i )+ (cid:80) x i ∈X c K ( x i , h c,τ ∗ ( C j )) K ( x i , h c,τ ∗ ( C k )) /σ ( x i ) (cid:80) x i ∈X c K ( x i , h c,τ ∗ ( C j )) /σ ( x i ) (cid:80) x i ∈X c K ( x i , h c,τ ∗ ( C k )) /σ ( x i ) . To prove the statement in the proposition, deﬁne f ∗ C (cid:48) ( x ) := − C (cid:48) || ( x ) V− || x ∈ X t ) + C (cid:48) || ( x ) V + || x ∈ X c ) . Then, we show (cid:96) adpt ( C (cid:48) ; C ) = sup f ∈F ( C (cid:48) ) E f min ≤ j ≤ J (cid:0) L RD f − ˆ c Lτ ∗ ( C j ) (cid:1) =: sup f ∈F ( C (cid:48) ) E f min ≤ j ≤ J ( Z j ( f ))= E f ∗ C (cid:48) min ≤ j ≤ J ( Z j ( f ∗ C (cid:48) ))= E f ∗ C (cid:48) min ≤ j ≤ J − ˆ c Lτ ∗ ( C j ) . The ﬁrst and the last equalities hold by deﬁnition (since L RD f ∗ C (cid:48) = 0), so what wehave to show is the second to last equality.It is suﬃcient to show that for each j ∈ { , ..., J } ,sup f ∈F ( C (cid:48) ) E f (cid:0) L RD f − ˆ c Lτ ∗ ( C j ) (cid:1) = E f ∗ C (cid:48) (cid:0) L RD f ∗ C (cid:48) − ˆ c Lτ ∗ ( C j ) (cid:1) = E f ∗ C (cid:48) (cid:0) ˆ c Lτ ∗ ( C j ) (cid:1) . Z j ( f )) Jj =1 has a multivariate normal distribution with some mean vector( µ j ( f )) Jj =1 and a variance matrix which does not depend on f . This implies if f ∗ C (cid:48) maximizes µ j ( f ) for all j = 1 , ..., J , it also maximizes E f min ≤ j ≤ J ( Z j ( f )).For a shorthand notation, deﬁne χ Lτ ( C (cid:48) ) := sup f ∈F ( C ) bias f (cid:16)(cid:98) L τ ( C (cid:48) ) (cid:17) + z − τ sd (cid:16)(cid:98) L τ ( C (cid:48) ) (cid:17) . Now, note that we can writesup f ∈F ( C (cid:48) ) E f (cid:0) L RD f − ˆ c Lτ ∗ ( C j ) (cid:1) = sup f ∈F ( C (cid:48) ) E f (cid:16) L RD f − (cid:98) L τ ∗ ( C j ) + χ Lτ ∗ ( C j ) (cid:17) = sup f ∈F ( C (cid:48) ) E f (cid:16) L RD f − (cid:98) L τ ∗ ( C j ) (cid:17) + χ Lτ ∗ ( C j ) , since χ Lτ ∗ ( C j ) is a ﬁxed quantity. The ﬁrst term in the last line can be written assup f ∈F ( C (cid:48) ) E f (cid:16) L RD f − (cid:98) L τ ∗ ( C j ) (cid:17) = sup f t, ,f c ∈F ( C (cid:48) ) E f (cid:20)(cid:18) f t (0) − (cid:80) x i ∈X t K ( x i , h t,τ ∗ ( C j )) y i /σ ( x i ) (cid:80) x i ∈X t K ( x i , h t,τ ∗ ( C j )) /σ ( x i ) (cid:19) − (cid:18) f c (0) − (cid:80) x i ∈X c K ( x i , h c,τ ∗ ( C j )) y i /σ ( x i ) (cid:80) x i ∈X c K ( x i , h c,τ ∗ ( C j )) /σ ( x i ) (cid:19)(cid:21) = sup f t ∈F ( C (cid:48) ) E f t (cid:18) f t (0) − (cid:80) x i ∈X t K ( x i , h t,τ ∗ ( C j )) f t ( x i ) /σ ( x i ) (cid:80) x i ∈X t K ( x i , h t,τ ∗ ( C j )) /σ ( x i ) (cid:19) − inf f c ∈F ( C (cid:48) ) E f c (cid:18) f c (0) − (cid:80) x i ∈X c K ( x i , h c,τ ∗ ( C j )) f c ( x i ) /σ ( x i ) (cid:80) x i ∈X c K ( x i , h c,τ ∗ ( C j )) /σ ( x i ) (cid:19) . Note that f t (0) and f c (0) can be normalized to 0 since if we consider (cid:101) f t = f t + v forsome v ∈ R , (cid:101) f t (0) − (cid:80) x i ∈X t K ( x i , h t,τ ∗ ( C j )) (cid:101) f t ( x i ) /σ ( x i ) (cid:80) x i ∈X t K ( x i , h t,τ ∗ ( C j )) /σ ( x i )= f t (0) + v − (cid:80) x i ∈X t K ( x i , h t,τ ∗ ( C j ) /σ ( x i )) ( f t ( x i ) + v ) /σ ( x i ) (cid:80) x i ∈X t K ( x i , h t,τ ∗ ( C j )) /σ ( x i )= f t (0) − (cid:80) x i ∈X t K ( x i , h t,τ ∗ ( C j )) f t ( x i ) /σ ( x i ) (cid:80) x i ∈X t K ( x i , h t,τ ∗ ( C j )) /σ ( x i ) , f c . Therefore, we havesup f t ∈F ( C (cid:48) ) E f t (cid:18) f t (0) − (cid:80) x i ∈X t K ( x i , h t,τ ∗ ( C j )) f t ( x i ) /σ ( x i ) (cid:80) x i ∈X t K ( x i , h t,τ ∗ ( C j )) /σ ( x i ) (cid:19) = − inf f t ∈F ( C (cid:48) ) ,f t (0)=0 (cid:80) x i ∈X t K ( x i , h t,τ ∗ ( C j )) f t ( x i ) /σ ( x i ) (cid:80) x i ∈X t K ( x i , h t,τ ∗ ( C j )) /σ ( x i ) . Kwon and Kwon (2020) show that the minimizer to this problem is given by f ∗ t ( x ) = − C (cid:48) || ( x ) V− || . Likewise, we have − inf f c ∈F ( C (cid:48) ) E f c (cid:18) f c (0) − (cid:80) x i ∈X c K ( x i , h c,τ ∗ ( C j )) f c ( x i ) /σ ( x i ) (cid:80) x i ∈X c K ( x i , h c,τ ∗ ( C j )) /σ ( x i ) (cid:19) = sup f c ∈F ( C (cid:48) ) ,f c (0)=0 (cid:80) x i ∈X c K ( x i , h c,τ ∗ ( C j )) f c ( x i ) /σ ( x i ) (cid:80) x i ∈X c K ( x i , h c,τ ∗ ( C j )) /σ ( x i ) . Again by Kwon and Kwon (2020), the maximizer is given by f ∗ c ( x ) = C (cid:48) || ( x ) V + || . Therefore, the worst case expected length is achieved when f = f ∗ C (cid:48) . Lastly, it is easyto see that − ˆ c Lj ’s are jointly normally distributed with the mean and the covariancegiven in the lemma when f = f ∗ C (cid:48) , which establishes our claim. B Unbiased estimator for C As in Armstrong and Koles´ar (2018c), we can estimate the lower bound of C using thedata. We ﬁrst explain the possibility for the case with d = 1. Suppose X t = [ x min , X c = (0 , x max ].First, note that for any x ≥ x ≥

0, we have f c ( x ) − f c ( x ) ≤ C ( x − x ) . (31)Let n c := (cid:80) ni =1 { x i ∈ X c } , and set a cm > (cid:80) ni =1 { x i ∈ X c , x i ≤ a cm } =50 c /

2. Then, we have1 n c / (cid:34) n (cid:88) i =1 f c ( x i ) { x i ∈ X c , x i > a cm } − n (cid:88) i =1 f c ( x i ) { x i ∈ X c , x i ≤ a cm } (cid:35) (32) ≤ Cn c / n (cid:88) i =1 x i { x i ∈ X c , x i > a cm } , (33)so that we have C ≥ n c / [ (cid:80) ni =1 f c ( x i ) { x i ∈ X c , x i > a cm } − (cid:80) ni =1 f c ( x i ) { x i ∈ X c , x i ≤ a cm } ] n c / [ (cid:80) ni =1 x i { x i ∈ X c , x i > a cm } − (cid:80) ni =1 x i { x i ∈ X c , x i ≤ a cm } ] . (34)We estimate the RHS of (34), denoted by µ c , byˆ µ c := n c / [ (cid:80) ni =1 y i { x i ∈ X c , x i > a cm } − (cid:80) ni =1 y i { x i ∈ X c , x i ≤ a cm } ] n c / [ (cid:80) ni =1 x i { x i ∈ X c , x i > a cm } − (cid:80) ni =1 x i { x i ∈ X c , x i ≤ a cm } ] . (35)We can easily see that ˆ µ c is an unbiased estimator of µ c .Likewise, we can form a lower bound estimator using the data in X t byˆ µ t := n t / [ (cid:80) ni =1 y i { x i ∈ X t , x i > a tm } − (cid:80) ni =1 y i { x i ∈ X t , x i ≤ a tm } ] n t / [ (cid:80) ni =1 x i { x i ∈ X t , x i > a tm } − (cid:80) ni =1 x i { x i ∈ X t , x i ≤ a tm } ] , (36)where n t := (cid:80) ni =1 { x i ∈ X t } and a tm < (cid:80) ni =1 { x i ∈ X t , x i ≤ a tm } = n t /

2. In the dataset of Lee (2008), it turns out that ˆ µ c = 0 .

353 and ˆ µ t = 0 . d >

1. For example, when d = d + ,we use inequality f c ( x ) − f c ( x ) ≤ C (cid:107) x − x (cid:107) , (37)for any x , x such that x r ≥ x r for all r = 1 , ..., d . Find two subsets of X c , say X c and X c so that 1) x ∈ X c and x ∈ X c implies x r ≥ x r for all r = 1 , ..., d ,and 2) the number of observations in each set is equal to ˜ n c /

2. Index x i ’s so that x , ..., x ˜ n c / ∈ X c and x ˜ n c / , ..., x ˜ n c ∈ X c . Deﬁne some one-to-one mapping j from (cid:8) , , ..., ˜ n c (cid:9) to (cid:8) ˜ n c + 1 , ˜ n c + 2 , ..., ˜ n c (cid:9) . Then, we have C ≥ (cid:80) ˜ n c / i =1 (cid:2) f c (cid:0) x j ( i ) (cid:1) − f c ( x i ) (cid:3)(cid:80) ˜ n c / i =1 (cid:13)(cid:13) x j ( i ) − x i (cid:13)(cid:13) . (38)51gain, the lower bound can be estimated by replacing f c ( x i ) by y i . C Examples of Multi-score RDD

In this section, we discuss RDD applications with monotone multiple running vari-ables. When presenting these empirical applications, we categorize the class of RDdesigns we consider into the following three cases, depending on how the running vari-ables relate to the assignment of the treatment. While our framework can potentiallycover a much larger class of models, these three settings seem to be the ones thatappear most frequently in empirical studies. We let x i ∈ R d be the value of runningvariable for an individual i , and denote the s th element of x i by x i ( s ) for s = 1 , ..., d .1. MRO (Multiple Running variables with “OR” conditions):

An indi-vidual i is treated if there exists some s ∈ { , ..., d } such that x i ( s ) > ≤ MRA (Multiple Running variables with “AND” conditions):

An indi-vidual i is treated if x i ( s ) > ≤

0) for all s = 1 , ..., d .3. WAV (Weighted AVerage of multiple running variables):

There aremultiple running variables, and the treatment status is determined by someweighted average of those running variables. Hence, an individual i is treatedif (cid:80) ds =1 w s x i ( s ) > ≤

0) for some positive weights w , ..., w d . While wecan view this design as an RD design with a single running variable (cid:101) x i = (cid:80) ds =1 w s x i ( s ) , we may obtain richer information by considering this design asa multi-dimensional RD design. For example, consider an RD design wherean individual i is treated if the average of the normalized math and readingscores is greater than 0. Then, we can consider treatment eﬀect parameters atdiﬀerent cutoﬀ points, e.g. (math = 0, reading = 0), (math = −

1, reading = 1),or (math = 1, reading = − Refer to Appendix A.3 of Babii and Kumar (2020) for RDD applications with univariate mono-tone running variables For example, our categorization excludes geographic RD designs, such as those analyzed in Dell(2010), Keele and Titiunik (2015), and Imbens and Wager (2019). While our general framework canincorporate such models, we do not consider them since the monotonicity assumption is unlikely tohold under such contexts. d > d = 1. 54 Auxiliary Figure for Section 5 x f ( x ) x f ( x ) x f ( x ) x f ( x ) Figure 4: Regression function values on x ∈ [ − ,,