Inference in Regression Discontinuity Designs under Monotonicity
IInference in Regression Discontinuity Designs underMonotonicity ∗ Koohyun Kwon † Soonwoo Kwon ‡ November 23, 2020
Abstract
We provide an inference procedure for the sharp regression discontinuitydesign (RDD) under monotonicity, with possibly multiple running variables.Specifically, we consider the case where the true regression function is monotonewith respect to (all or some of) the running variables and assumed to lie in aLipschitz smoothness class. Such a monotonicity condition is natural in manyempirical contexts, and the Lipschitz constant has an intuitive interpretation.We propose a minimax two-sided confidence interval (CI) and an adaptive one-sided CI. For the two-sided CI, the researcher is required to choose a Lipschitzconstant where she believes the true regression function to lie in. This is theonly tuning parameter, and the resulting CI has uniform coverage and obtainsthe minimax optimal length. The one-sided CI can be constructed to maintaincoverage over all monotone functions, providing maximum credibility in termsof the choice of the Lipschitz constant. Moreover, the monotonicity makes itpossible for the (excess) length of the CI to adapt to the true Lipschitz constantof the unknown regression function. Overall, the proposed procedures make iteasy to see under what conditions on the underlying regression function thegiven estimates are significant, which can add more transparency to researchusing RDD methods. ∗ We thank our advisors Donald Andrews and Timothy Armstrong for continuous guidance. Weare also grateful to participants at the Yale Econometrics Prospectus Lunch for insightful discussions. † Department of Economics, Yale University, [email protected] ‡ Department of Economics, Yale University, [email protected] a r X i v : . [ ec on . E M ] N ov Introduction
Recently, there has been growing interest in honest and minimax optimal inferencemethods in regression discontinuity designs (Armstrong and Koles´ar, 2018a; Arm-strong and Koles´ar, 2020b; Imbens and Wager, 2019; Koles´ar and Rothe, 2018; Noackand Rothe, 2020). This approach requires a researcher to specify a function spacewhere she believes the regression function to lie in, and the inference procedures followonce this function space is chosen. The methods proposed in the literature essentiallyuse bounds on the second derivatives to specify the function space. This is motivatedby the popularity of local linear regression methods in practice, which is often jus-tified by imposing local bounds on the second derivative of the regression function.However, choosing a reasonable bound on the second derivative can be difficult inpractice.We address this concern by considering the problem of conducting inference for thesharp regression discontinuity (RD) parameter under monotonicity and a Lipschitzcondition. Specifically, the regression function is assumed to be monotone in allor some of the running variables, with a bounded first derivative. Monotonicitynaturally arises in many regression discontinuity design (RDD) contexts, which is welldocumented by Babii and Kumar (2020). The Lipschitz constant, or the bound onthe first derivative, has an intuitive interpretation, since this is a bound on how muchthe outcome can change if the running variable is changed by a single unit. Hence, ifthe researcher reports the inference results along with the Lipschitz constant used torun the proposed procedure, it is easy to see under what (interpretable) conditionson the regression function the researcher has obtained such results. We exploit thecombination of the monotonicity and the Lipschitz continuity restrictions to constructa confidence interval (CI) which is efficient and maintains correct coverage uniformlyover a potentially large and more interpretable function space.We provide a minimax two-sided CI and an adaptive one-sided CI. For the two-sided CI, the researcher is required to choose a bound on the first derivative of thetrue regression function. The bound is the only tuning parameter, and the resultingCI has uniform coverage and obtains the minimax optimal length over the class ofregression functions under consideration. Moreover, by exploiting monotonicity, theCI has a significantly shorter length than minimax CIs constructed under no suchshape restriction. To our knowledge, this paper is the first to consider a minimax2ptimal procedure when the regression function is assumed to be monotone.The one-sided CI can be constructed to maintain coverage over all monotone func-tions, providing maximum credibility in terms of the choice of the Lipschitz constant.Due to monotonicity, the resulting CI still has finite excess length as long as the trueregression function has a bounded first derivative, where this bound is allowed to bearbitrarily large and unknown. This is in contrast with minimax CIs constructedwithout the monotonicity condition, in which case the length must be infinity to coverall functions. Moreover, our proposed one-sided CI adapts to the underlying smooth-ness class, resulting in a shorter CI when the true regression function has a smallerfirst derivative bound. This enables the researcher to conduct non-conservative in-ference for the RD parameter, at the same time maintaining honest coverage overa significantly larger space of regression functions. The cost of such adaptation isthat we can only construct either a one-sided lower or upper CI depending on thetreatment allocation rule, but not both. We characterize this relationship betweenthe treatment allocation rules and the direction of the adaptive one-sided CI we canconstruct.Our approach, especially the two-sided CI, is closely related to the literature onhonest inference in RDDs. By working with second derivative bounds, the inferenceprocedures in the literature are based on local linear regression estimators, which isin line with the more conventional methods used in the RDD setting. However, itis rather difficult to evaluate the validity of a second derivative bound specified by aresearcher. While it seems rather innocuous to ignore regression functions with kinks,thus with infinite second derivatives, it is not clear how large or how small the secondderivative should be to be considered as “too large” or “too small”. For this reason,the literature often recommends a sensitivity analysis to strengthen the credibilityof the inference result. However, the credibility gain from the sensitivity analysis islimited when the smoothness parameter is not easy to interpret. For example, it ishard to judge whether the maximum value considered in the sensitivity analysis islarge enough or not.In contrast, the bound on the first derivative can be chosen based on more straight-forward empirical reasoning, since the first derivative has an intuitive interpretationof a partial effect. For example, if an outcome variable y and a running variable x are current and previous test scores, the class of regression functions whose values This requires the regression function to be monotone in all the running variables. y in response to 1/10 standard devia-tion increase in x can be regarded as reasonable. Armstrong and Koles´ar (2020a) takea similar approach of specifying Lipschitz constants in their empirical application, inan inference problem for average treatment effects under a different setting. By im-posing a bound on the first derivative, our procedure is based on a Nadaraya–Watsontype estimator, with the boundary bias correctly accounted for.The possibility of forming an adaptive one-sided CI in RDD settings under mono-tonicity was first considered in Armstrong (2015) and Armstrong and Koles´ar (2018b).The difference is that these papers are concerned with adapting to H¨older exponents β ∈ (0 ,
1] while fixing the Lipschitz constant. Here, we fix β = 1 and adapt to theLipschitz constant. When β = 1, what governs the performance of an adaptive CIis the size of the constant multiplied to the rate of convergence, not the rate itself.This is in contrast to the setting considered in Armstrong (2015) and Armstrong andKoles´ar (2018b), which primarily discuss rate-adaptation. In this paper, we provide aprocedure which makes the magnitude of the multiplying constant reasonably small.Babii and Kumar (2020) also consider an RDD setting with a monotone regressionfunction. They introduce an inference procedure based on an isotonic regressionestimator, conveying a similar message to ours that the monotonicity restriction canlead to a more efficient inference procedure. Our approach differs from theirs in severalkey aspects: 1) we explicitly focus on maintaining uniform coverage and optimizingthe length of the CI, 2) we consider the Lipschitz class while they consider the H¨olderclass with exponent β > /
2, and 3) our procedure can be used in settings withmultiple running variables. The general treatment of the dimension of the running variables in this paperallows a researcher to use our procedure in the setting with more than one runningvariables, referred to as the multi-score RDD by Cattaneo et al. (2020). This settinghas also been considered by Imbens and Wager (2019). Our paper is the first toconsider the space of monotone regression functions in this setting, and the gain frommonotonicity is especially significant for the case with multiple running variables, aswe show later in this paper. We allow the regression function to be monotone withrespect to only some of the variables as well, which broadens the scope of application In fact, our procedure can be easily adapted to the case where β ∈ (0 , β = 1 mainly due to the interpretability ofthe Lipschitz constant, and because assuming bounded first derivatives seems rather innocuous inempirical contexts.
4f our procedure.The rest of the paper is organized as follows. Section 2 describes the settingand the form of optimal kernels and bandwidths under this setting. Section 3 intro-duces the minimax two-sided CI, and Section 4 the adaptive one-sided CI. Section5 provides results from simulation studies to demonstrate the efficacy of the pro-posed procedures. Section 6 revisits the empirical analysis by Lee (2008). Section 7concludes by discussing possible extensions.
We observe i.i.d. observations { ( y i , x i ) } ni =1 , where y i ∈ R is an outcome variable,and x i ∈ X ⊂ R d is either a scalar or a vector of running variables. We take X to be a hyperrectangle in R d , and let X t and X c be connected sets with nonemptyinteriors that form a partition of X . The subscripts t and c correspond to “treatment”and “control” groups, respectively, throughout this paper. We write x i ∈ X t and x i ∈ X c to indicate that individual i belongs to the treatment and the control groups,respectively. When d = 1, our setting corresponds to the standard sharp RD designwith a single cutoff point.Let ( · ) denote the indicator function. Then, our setting can be written as anonparametric regression model y i = f ( x i ) + u i , f ( x ) = f t ( x ) { x ∈ X t } + f c ( x ) { x ∈ X c } , (1)where the random variable u i is independent across i . Here, f t and f c denote meanoutcome functions for the treated and the control groups, respectively. So f corre-sponds to the mean outcome function for the observed outcome, which we refer to asthe “regression function” throughout the paper.Our parameter of interest is a treatment effect parameter at a boundary point, ∈ B := ∂ X t ∩ ∂ X c , defined as L RD f := lim x → x ,x ∈X t f ( x ) − lim x → x ,x ∈X c f ( x ) . When d = 1, L RD f corresponds to the conventional sharp RD parameter. On theother hand, when d > L RD f is the sharp RD parameter at a particular cutoff5oint in B . This type of parameter is also considered in Imbens and Wager (2019)and Cattaneo et al. (2020) when they analyze the RD design with multiple runningvariables. Without loss of generality, we set x = 0 since we can always relabel˜ x i = x i − x . Remark 1 (More than two treatment status) . Our framework can also handle thesetting where individuals are assigned to more than two treatments based on thevalues of the multiple running variables, which was analyzed by Papay et al. (2011).For example, when d = 2 with two cutoff points c , c , four different treatment statusare possible depending on whether x i ≥ c i for i = 1 ,
2. Let {X j } j =1 be the partition of X , where j = 1 , ..., (cid:101) X = X (cid:83) X , and apply our method with (cid:101) X instead of X , with X t = X and X c = X . We consider the framework where a researcher is willing to specify some
C > | f q ( x ) − f q ( z ) | ≤ C · || x − z || for all x, z ∈ X q , for each q ∈ { t, c } , (2)for some norm || · || on R d . In other words, the mean potential outcome functionsare Lipschitz continuous with a Lipschitz constant C , with respect to the norm || · || .While the norm can be understood as an absolute value when d = 1, the choice ofthe norm can give different interpretations when d >
1. We allow a general class ofnorms, only requiring that || x || is increasing in the absolute value of each element of x . We say the regression function f defined in (1) has a Lipschitz constant C if f t and f c satisfy (2).In addition to the Lipschitz continuity, the researcher assumes that the meanpotential outcome functions are monotone with respect to the running variables.Formally, letting z ( s ) denote the s th component of the running variable z ∈ R d , andletting V ⊆ { , ..., d } be an index set for monotone variables, the researcher assumes6hat f q ( x ) ≥ f q ( z ) if x ( s ) ≥ z ( s ) for all s ∈ V and x ( t ) = z ( t ) for all t / ∈ V , (3)for all q ∈ { t, c } . We use F ( C ) to denote the space of functions on R d which satisfy(2) and (3) separately on X t and X c .We discuss the practical implications of our framework. First, there are abundantsettings where such a monotonicity condition is reasonable. See Appendix C of ourpaper as well as Appendix A.3 of Babii and Kumar (2020) for examples of RDDswith monotone running variables. One reason for the prevalence of monotonicity isdue to the nature of policy design. For example, students with lower test scores areassigned to summer schools since policymakers are worried that students with lowertest scores will show lower academic achievement in the future—they believe theaverage future academic achievement (an outcome variable) is monotone in currenttest scores (running variables).Next, our framework with Lipschitz continuity differs from the previous approachesspecifying a bound on the second derivative. For example, Armstrong and Koles´ar(2018a) consider the following locally smooth function space: f t , f c ∈ F T,p := (cid:110) g : (cid:12)(cid:12)(cid:12) g ( x ) − (cid:88) pj =0 g ( j ) (0) x j /j ! (cid:12)(cid:12)(cid:12) ≤ C | x | p ∀ x ∈ X (cid:111) , and set p = 2 in their empirical application. Similarly, Imbens and Wager (2019)impose a global smoothness assumption that (cid:107)∇ f q (cid:107) ≤ C for q ∈ { t, c } , where (cid:107)∇ f q (cid:107) denotes the operator norm of the Hessian matrix. In contrary, we considersituations where a researcher has a belief on the size of the first, rather than thesecond, derivative.The main advantage of working with the first derivative is the interpretability ofthe function space. The function space over which the coverage is uniform should beeasy to interpret, in the sense that the researcher herself or a policymaker analyzingthe inference procedure can evaluate whether the functions not belonging to thefunction space can be safely disregarded as “extreme”.For this purpose, the size of the first derivative provides a reasonable criterion.To be concrete, let us consider Lee (2008) analyzing the effect of the incumbency onthe election outcome, where the outcome variable is the difference in the percentageof votes between two parties in the current election, and the running variable is7he same quantity in the previous election. Then, a mean potential outcome functionwhose maximum slope is as large as C = 50 seems unreasonable—this roughly impliesthat a very small increase in the vote percentage difference in an election, say 0.1%,predicts a large increase in the vote percentage difference in a consequent election,5%. Similarly, if a researcher presents a CI for the incumbency effect parameter whichhas valid coverage over a space of functions with their first derivatives bounded by C = 0 .
5, the policymaker evaluating the analysis might find the function space toorestrictive.In comparison, it seems relatively more difficult to evaluate the validity of a givenbound on the second derivative. Previous papers have proposed heuristic argumentsto set the bound on the second derivative. For example, Armstrong and Koles´ar(2018a) choose the smoothness parameter so that the reduction in the predictionMSE from using the true conditional mean rather than its Taylor approximation isnot too large. While this gives an alternative interpretation of their smoothness con-stant, the prediction MSE does not have an interpretation which can be connected toempirical examples being considered. Imbens and Wager (2019) suggests estimatingthe curvatures of f t and f c using quadratic functions and multiplying a constant suchas 2 or 3 to the estimates, but we can expect that this procedure would not yielduniform coverage without further restrictions on the function space, as pointed outin their paper. Armstrong and Koles´ar (2020b) formally derive an additional condi-tion on the function space which enables a data-driven estimation of the smoothnessparameter, but they warn that this additional assumption may be difficult to justify.Instead of setting a single bound, one may choose to conduct a sensitivity analysis,which is recommended by Armstrong and Koles´ar (2018a) and Imbens and Wager(2019). However, a sensitivity analysis is more meaningful when the smoothnessparameter the researcher is varying has an intuitive meaning.A possible drawback of having to specify a Lipschitz constant is that our proceduredoes not ensure coverage when the mean potential functions are linear or close tolinear with slopes larger than the smoothness constant set a priori. For the caseof the minimax two-sided CI, we can view this as a price we pay by maintainingvalidity for functions with arbitrarily large second derivatives. On the other hand,our adaptive procedure can be used to construct a one-sided CI which adapts to thedegree of Lipschitz smoothness. The adaptive procedure enables the researcher toset a very large value of the smoothness parameter or even set it to infinity (so that8he coverage is over all monotone functions) and to obtain a shorter CI if the trueregression function has a smaller first derivative bound. This is possible due to themonotonicity assumption, which is plausible in many RDD applications. Remark 2 (Lipschitz continuity under general dimension) . When d >
1, it can bemore reasonable to assume that a researcher has a belief on the size of the partialderivative of the mean potential outcome functions. That is, there exist C , ..., C d > | f q ( x ) − f q ( z ) | ≤ C s | x ( s ) − z ( s ) | for all x, z ∈ X s.t. x − s = z − s , (4)for each q ∈ { c, t } and s ∈ { , . . . , d } . Here, x − s denotes the elements in x ∈ R d excluding its s th component. It is easy to show that under (4), the original Lipschitzcontinuity assumption (2) holds with C = 1 and || · || being a weighted (cid:96) norm on R d , || z || = (cid:80) ds =1 C s | z ( s ) | . Moreover, (2) holding with C = 1 and the weighted (cid:96) norm alsoimplies (4). Therefore, a researcher assuming (4) can equivalently assume (2) withthis weighted (cid:96) norm. This approach is also used in Armstrong and Koles´ar (2020a)in the context of inference for average treatment effects under unconfoundedness. Remark 3 (RDDs without monotonicity) . By taking V = ∅ , our procedure can beused to construct a minimax CI for the RD parameter without the monotonicityassumption. While other alternatives such as Armstrong and Koles´ar (2018a) andImbens and Wager (2019) can be used to deal with this setting, our procedure is stilluseful to researchers who prefer imposing bounds on the first derivative rather thanthe second derivative, perhaps due to better interpretability. Our procedures depend on certain kernel functions and bandwidths that depend onthe Lipschitz parameter. We first introduce some notations. Given some z ∈ R d andthe index set V , we define ( z ) V + to be an element in R d where its s th element is givenby ( z ) V +( s ) := max (cid:8) z ( s ) , (cid:9) ( s ∈ V ) + z ( s ) ( s / ∈ V ) , s = 1 , ..., d. Similarly, we define ( z ) V− := − ( − z ) V + . When a is a scalar, we use square brackets[ a ] + to denote max { a, } . In addition, we define σ ( x i ) := Var / [ u i | x i ], and given an9stimator (cid:98) L of L RD f , we write bias f ( (cid:98) L ) to denote E f [ (cid:98) L − L RD f ] and sd( (cid:98) L ) to denoteVar / ( (cid:98) L ).The minimax procedure is based on the following kernel function K ( z ) := [1 − ( (cid:107) ( z ) V + (cid:107) + (cid:107) ( z ) V− (cid:107) )] + . (5)In the adaptive procedure, different bandwidths are used for each coordinates, de-pending on the signs of the coordinates. To make the use of different bandwidthsclear, for h = ( h , h ) ∈ R , we define K ( z, h ) := [1 − ( (cid:107) ( z/h ) V + (cid:107) + (cid:107) ( z/h ) V− (cid:107) )] + . The optimal bandwidths used in estimation is based on the following two functions ω t ( δ t ; C , C ) and ω c ( δ c ; C , C ), defined to be the solutions to the following equation: (cid:88) x i ∈X q [ ω q ( δ q ; C , C ) − C (cid:107) ( x i ) V + (cid:107) − C (cid:107) ( x i ) V− (cid:107) ] /σ ( x i ) = δ q , for each q ∈ { t, q } , given a pair of non-negative numbers ( δ t , δ c ), and σ ( x ) := Var[ u i | x i = x ]. Moreover,given some scalar δ ≥
0, we define δ ∗ t ( δ ; C , C ) and δ ∗ c ( δ ; C , C ) to be solutions tosup δ ∗ t ≥ ,δ ∗ c ≥ , ( δ ∗ t ) +( δ ∗ c ) = δ ω t ( δ ∗ t ; C , C ) + ω c ( δ ∗ c ; C , C ) . Based on these definitions, we introduce the following shorthand notations to be usedthroughout this paper; for q ∈ { t, c } , δ ≥
0, ( C , C ) ∈ R , h ∈ R , we define ω ∗ q ( δ ; C , C ) := ω q ( δ ∗ q ( δ ; C , C ); C , C ) ,a q ( h ; C , C ) := 12 · (cid:80) x i ∈X q K ( x i , h ) [ C (cid:107) ( x i ) V + (cid:107) − C (cid:107) ( x i ) V− (cid:107) ] /σ ( x i ) (cid:80) x i ∈X q K ( x i , h ) /σ ( x i ) ,s ( δ, h ; C , C , σ ( · )) := δ/ω ∗ t ( δ ; C , C ) (cid:80) x i ∈X t K ( x i , h ) /σ ( x i ) . q ∈ { t, c } , δ ≥
0, and C , h ∈ R , we define ω ∗ q ( δ ; C ) := ω ∗ q ( δ ; C , C ) ,a q ( h ; C ) := a q ( h , h ; C , C ) ,s ( δ, h ; C , σ ( · )) := s ( δ, h , h ; C , C , σ ( · )) . The forms of the optimal kernel and the bandwidth presented in this sectionresult from solving modulus of continuity problems considered in Donoho and Liu(1991) and Donoho (1994) in the context of minimax optimal inference, and in Caiand Low (2004) and Armstrong and Koles´ar (2018a) in the context of the adaptiveinference. While we only make the connection specifically to our setting when weprove the validity of our procedure in Appendix A, interested readers may refer tothe aforementioned papers for a more general discussion.
We first consider the case where the researcher is comfortable specifying a Lipschitzconstant and/or the empirical context requires a two-sided CI. In this case, we rec-ommend a minimax affine CI, which is the CI whose worst-case expected length isthe shortest among all CIs based on affine estimators (Donoho, 1994). We refer tosuch a CI as the minimax CI for brevity. The minimax CI is constructed based on an affine estimator (cid:98) L mm := a + (cid:80) ni =1 w i y i with non-negative weights ( w i ) ni =1 and half-length χ mm , i.e., CI mm = (cid:104)(cid:98) L mm − χ mm , (cid:98) L mm + χ mm (cid:105) . The half-length χ mm is non-random and calibrated to maintain correct coverage uni-formly over the function space F ( C ):inf f ∈F ( C ) P f ( L RD f ∈ CI mm ) ≥ − α. (6) Donoho (1994) shows focusing on affine estimators is reasonable in the sense that the gain fromconsidering non-affine estimators is limited. P f ( L RD f ∈ CI mm ) = P f (cid:32)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) (cid:98) L mm − L RD f sd( (cid:98) L mm ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ χ mm sd( (cid:98) L mm ) (cid:33) . Under the assumption that the error term u i has a Gaussian distribution, the ran-dom variable ( (cid:98) L mm − L RD f ) / sd( (cid:98) L mm ) is normally distributed with mean equal tobias f ( (cid:98) L mm ) / sd( (cid:98) L mm ) and with unit variance. Hence, the quantiles of the randomvariable | ( (cid:98) L mm − L RD f ) / sd( (cid:98) L mm ) | is maximized over f ∈ F ( C ) when | bias f ( (cid:98) L mm ) | isthe largest. Therefore, if we define cv α ( t ) to be 1 − α quantile of | Z | , with Z ∼ N ( t, χ mm that guarantees the coverage requirement (6) isgiven by χ mm = cv α (cid:32) sup f ∈F ( C ) | bias f ( (cid:98) L mm ) | sd( (cid:98) L mm ) (cid:33) sd( (cid:98) L mm ) . (7)It remains to derive the form of the estimator (cid:98) L mm such that χ mm is minimized. Wetake (cid:98) L mm to be the difference between two re-centered kernel regression estimators,say (cid:98) L mm t − (cid:98) L mm c , where (cid:98) L mm q := (cid:80) x i ∈X q K ( x i /h q ) y i /σ ( x i ) (cid:80) x i ∈X q K ( x i /h q ) /σ ( x i ) − a q , for each q ∈ { t, c } , (8)for the kernel function K ( · ) defined in (5). Note that (cid:98) L mm t and (cid:98) L mm c correspond toestimators of f t (0) and f c (0), respectively. Regarding the form of the optimal kernelfunction K , K ( z ) is the usual triangular kernel when d = 1, whose optimality isdiscussed in Donoho (1994) and Armstrong and Koles´ar (2020b). Here, we derivethe optimal kernel for multi-dimensional cases as well, for any given norm and underpartial or full monotonicity.A notable difference from the previous inference methods in RDDs is that theestimator (cid:98) L mm q is a Nadaraya-Watson estimator instead of a local linear estimator.This difference naturally arises because we work under the assumption of boundedfirst derivatives. In general, the local linear estimator is preferred due to the well-known issue of bias at the boundary for Nadaraya-Watson type estimators. In thecontext of honest inference, however, the worst-case bias is explicitly corrected for.For the optimal choices of bandwidths ( h t , h c ) and the centering terms ( a t , a c ), we12an show that the minimax two-sided CI can be obtained by taking h q = h q ( δ ) = ω ∗ q ( δ ; C ) /C for each q ∈ { t, c } , (9)for a suitable choice of δ ≥
0, by applying the result of Donoho (1994) to our setting.Next, from the form of (7), the centering terms should be chosen such thatsup f ∈F ( C ) bias f ( (cid:98) L mm ) = − inf f ∈F ( C ) bias f ( (cid:98) L mm ) , to minimize χ mm , i.e., the worst-case negative and positive biases should be balanced.We can show that this can be achieved if we choose a q := a q ( h q ; C ) , for each q ∈ { t, c } . (10)Note that this quantity depends on δ when ( h t , h c ) depends on δ .Now, let (cid:98) L mm ( δ ) be the above kernel regression estimator with the bandwidth( h t ( δ ) , h c ( δ )) and centering terms ( a t , a c ) as defined above. Under these choices ofbandwidths and centering terms, we havesd( (cid:98) L mm ( δ )) = s ( δ, h t ( δ ); C, σ ( · )) , sup f ∈F ( C ) bias f ( (cid:98) L mm ( δ )) = − inf f ∈F ( C ) bias f ( (cid:98) L mm ( δ ))= 12 (cid:16) C ( h t ( δ ) + h c ( δ )) − δ sd( (cid:98) L mm ( δ )) (cid:17) . (11)Then, we choose the optimal value of δ by plugging in (11) into (7), and calculatethe value of δ that minimizes the half-length χ mm , say δ ∗ , yielding the value of theshortest half-length. Plugging δ = δ ∗ into the bandwidth and centering term formulasalso yields the form of the estimator corresponding to this half-length. Procedure 1summarizes our discussion on the construction of the minimax CI.The following is the main theoretical result regarding the minimax procedure.While we consider an idealized setting with Gaussian errors and known conditionalvariances, such exact finite sample results can be translated into asymptotic resultsunder non-Gaussian errors with unknown variances following similar arguments givenby Armstrong and Koles´ar (2018c). In Section 5, we discuss in more detail how toplug in consistent estimators of the conditional variances.13 rocedure 1 Minimax Affine CI .1. Choose a value C such that the Lipschitz continuity in (2) is satisfied, witha suitable choice of a norm || · || on R d when d > δ .3. Using (11), find the value of δ which minimizes the half-length (7), say δ ∗ .4. Calculate the value of the estimator and the half-length by plugging in δ ∗ ,which gives the final form of the CI. Assumption 1. { x i } ni =1 is nonrandom and u i ∼ N (0 , σ ( x i )) , where σ ( · ) is known. Theorem 3.1.
Under Assumption 1, we have inf f ∈F ( C ) P f ( L RD f ∈ CI mm ) = 1 − α. Moreover, CI mm is the shortest among all (fixed-length) affine CIs with uniform cov-erage. From (9), we can see that there is a one-to-one relationship between the size of thebandwidth and the Lipschitz constant C chosen by a researcher. So choosing C is notnecessarily an additional burden to the researcher if a bandwidth has to be chosenanyway. While there also exist various data-driven bandwidth choice methods, ourway of choosing the bandwidth makes it clear the relationship between the bandwidthand the function space over which the resulting CI has uniform coverage, at the sametime achieving the minimax optimal length.It is useful to discuss the case with d = 1 to illustrate the role of the monotonicityrestriction in the minimax optimal inference. Intuitively, under monotonicity, we donot have to worry about the bias caused by functions with negative slopes, so it isoptimal to use a larger bandwidth than in the case without monotonicity in order toreduce the standard error. Using our kernel function and bandwidth formulas above,we can calculate how much larger the bandwidth should be under monotonicity. When V = { } , the kernel function in (5) is given by K ( z ) = [1 − | z | ] + , while when V = ∅ ,the kernel function is given by K ( z ) = [1 − | z | ] + = K ( z/ (1 / K ( z ) = K ( z ), the bandwidth ratio between the one under14 .20.40.60.8 0 1 2 3 4 5 Lipschitz coefficient C I Leng t h ( d = ) Method
MonotoneNon−monotone 0.51.01.52.0 0 1 2 3 4 5
Lipschitz coefficient C I Leng t h ( d = ) Method
MonotoneNon−monotone
Figure 1: Comparison of the minimax lengths with and without monotonicity. For thedesign of the running variable(s), 500 observations were generated from the uniformdistribution over [ − , d , for d = 1 ,
2. When d = 2, we set V = { , } and used the l norm. The lengths were calculated for σ ( x ) = 1. V = { } and the one under V = ∅ is given by2 ω ∗ q ( δ ; C, { } ) ω ∗ q ( δ ; C, ∅ ) , for each q ∈ { t, c } , δ ≥ , where ω ∗ q ( δ ; C, V ) denotes the value of ω ∗ q ( δ ; C ) when the monotonicity restrictionholds for the index set V . Following Kwon and Kwon (2020), we can show that theabove quantity is approximately 2 / ≈ . n . So if we believe the meanpotential outcome functions are monotone, it is optimal to use a bandwidth about60% larger than what should be used without monotonicity.By a similar argument, the length of the minimax CI becomes shorter when weonly consider the space of monotone functions. Since the length of the CI is a fixedquantity, we can easily compare its length under the shape restriction to the onewithout it. Figure 1 considers the cases with d = 1 and 2, for the treatment designwhere individuals are treated when the values of all the running variables are negative.The CI which does not take into account monotonicity can be as much as 30% longerfor d = 1 and 40% longer for d = 2. Considering the prevalence of RDDs withmonotone regression functions, this efficiency gain demonstrates the importance ofincorporating the shape restriction when constructing minimax CIs.15 Adaptive One-sided CI
A prominent feature of the minimax two-sided is that it is a fixed-length CI, in thesense that its length is determined before observing the realization of the outcome { y i } ni =1 . The length of the CI depends on the Lipschitz constant C chosen by theresearcher, and the larger the value of C , the longer the CI is. Hence, the minimaxCI may become too wide to use in the case where a researcher desires to strengthenthe credibility of the inference result by setting a conservative value of the Lipschitzconstant. In an extreme case where a researcher wishes to set C = ∞ , the CI isnecessarily the entire real line, providing no information.To deal with this issue, we provide a method to construct an adaptive one-sidedCI. The CI can be made to maintain coverage over a function space with a largeLipschitz constant C , even allowing for C = ∞ . Moreover, as long as the trueregression function lies in a smoother class, the length of the CI does not depend on C when d = 1 and nearly so when d >
1. In Section 4.1, we discuss this property inmore detail, and provide a simple condition on the relationship between the treatmentallocation rule and the direction of monotonicity under which such a property holds.This property ensures that a researcher can strengthen the credibility of the inferenceresult by considering a large function space without ending up with an uninformativeCI. Furthermore, the one-sided CI can be made to utilize the information about thesmoothness of the unknown regression function f contained in the observed outcome { y i } ni =1 . By using this information, the length of the CI is adjusted accordingly, unlikethe minimax optimal CI whose length is fixed regardless of the values of the observedoutcome. The CIs possessing this type of property are called adaptive CIs, followingCai and Low (2004). This property further improves the usefulness of the proposedone-sided CI; the length of the CI is not only (nearly) independent of C but shrinkswhen the true regression function has a smaller first derivative bound.We focus our discussion on the construction of adaptive lower CIs and relatedtreatment allocation rules. In many RDD applications, researchers are often inter-ested in how significantly larger than 0 the true treatment effect is. In this context,a lower one-sided CI provides useful information. The upper CI can be dealt within an analogous manner, but requires different (or “opposite”) treatment allocationrules. Lastly, the “length” of a one-sided CI refers to the distance between the true16arameter L RD f and its endpoint, referred to as the “excess length” in Armstrongand Koles´ar (2018a). In this section, we describe in more detail the treatment allocation rules under which itis possible to construct a lower CI with uniform coverage over F ( C ), but with lengththat does not necessarily increase with C . Specifically, given a smaller Lipschitzconstant C (cid:48) such that 0 ≤ C (cid:48) < C , we ask when it is possible to construct a lower CIwhose worst-case expected length over F ( C (cid:48) ) does not grow with C . This propertyensures that more credibility does not necessarily lead to wider CIs.To give an intuition, we first describe the argument in a simple setting, wherethere is a single running variable with the cutoff point x = 0. Consider a lower CI[ (cid:98) L − χ L , ∞ ) constructed by subtracting a constant χ L from a linear estimator (cid:98) L . Inorder to maintain uniform coverage over a function space F ( C ), we must have χ L = sup f ∈F ( C ) bias f ( (cid:98) L ) + sd( (cid:98) L ) · z − α . Note that the estimator (cid:98) L we consider is given as the difference between estimatorsfor f t (0) and f c (0), say ˆ f t (0) − ˆ f c (0).Now, suppose individuals receive some treatment if and only if x i <
0. A keyproperty under this design is that ˆ f t (0) always has a negative bias if f t ( x ) is increasing,since ˆ f t (0) is calculated only using observations with x i <
0. Similarly, ˆ f c (0) alwayshas positive bias when f c ( x ) is increasing. Therefore, the bias of (cid:98) L over f ∈ F ( C )is always negative, regardless of the value of C specified by the researcher. Hence, aone-sided lower CI that maintains uniform coverage over F ( C ) can be formed to be independent of C . We can also easily see that this argument no longer holds whenthe individual i is treated if and only if x i ≥
0, in which case the maximum bias over f ∈ F ( C ) increases with C .We now state this idea formally, including the case where d >
1. Given some (cid:101)
C >
0, let L α ( (cid:101) C ) denote the set of lower CIs (as functions of { y i } ni =1 and { x i } ni =1 )with uniform coverage probability 1 − α over F ( (cid:101) C ), i.e, CI ∈ L α ( (cid:101) C ) ⇐⇒ inf f ∈F ( (cid:101) C ) P ( L RD f ∈ CI ) ≥ − α. (cid:96) ( C (cid:48) ; C ) := inf [ˆ c, ∞ ) ∈L α ( C ) sup f ∈F ( C (cid:48) ) E [ L RD f − ˆ c ] , (12)where C (cid:48) ≤ C . This is the worst-case expected length of a CI 1) which has correctuniform coverage over F ( C ) and 2) whose worst-case expected length is the smallestover F ( C (cid:48) ). Armstrong and Koles´ar (2018a) calculate this quantity in terms of themodulus of continuity defined in Appendix A. The question we ask here is under whatconditions the quantity (cid:96) ( C (cid:48) ; C ) can be viewed as independent of C . In other words,we characterize when it is possible to construct a one-sided lower CI whose lengthdoes not grow with C when the regression function belongs to a smoother functionspace. Lemma 4.1.
Suppose both X t and X c are non-empty, and Assumption 1 holds. Then,given a pair of Lipschitz constants ( C (cid:48) , C ) such that C (cid:48) < C , there exists some con-stant A ( C (cid:48) ) , independent of C , such that (cid:96) ( C (cid:48) ; C ) ≤ A ( C (cid:48) ) , if and only if the followings hold: 1) there exists x i ∈ X t such that x i ∈ ( −∞ , d ,2) there exists x i ∈ X c such that x i ∈ [0 , ∞ ) d , and 3) the regression function is fullymonotone, i.e., V = { , ..., d } . Under the conditions provided in Lemma 4.1, the researcher can construct a one-sided CI whose worst-case expected length is not too large if the true regressionfunction belongs to a smoother space F ( C (cid:48) ), while maintaining a stringent coveragerequirement by specifying a large C . When d = 1, it is possible to take A ( C (cid:48) ) = (cid:96) ( C (cid:48) ; C (cid:48) ) with the inequality in Lemma 4.1 holding with equality, as suggested by theintuition discussed above. That is, C does not affect the size of (cid:96) ( C (cid:48) ; C ) at all when d = 1. On the other hand, when d >
1, the same property does not hold unless wehave observations only over [0 , ∞ ) d and ( −∞ , d , which is usually not the case inpractice. Therefore, specifying a larger C translates into a longer CI (by giving lessweights to observations outside [0 , ∞ ) d ∪ ( −∞ , d ) when d >
1. However, there isan upper bound on how much this length can grow with C . This upper bound isindependent of C , and thus the worst-case length of the CI is nearly independent of C . 18n words, the conditions in Lemma 4.1 imply that “the more disadvantaged groupshould get treated.” Such RDD settings can be easily found in the education literaturewhere students or schools with lower academic performance receive some kind ofsupport (Chay et al., 2005; Chiang, 2009; Jacob and Lefgren, 2004; Leuven et al.,2007; Matsudaira, 2008), in the environmental economics literature where countieswith high pollution levels are exposed to some environmental regulations (Chay andGreenstone, 2005; Greenstone and Gallagher, 2008), and in poverty programs whichprovide funds to those in need (Ludwig and Miller, 2007). Lastly, we note that whenthe mean potential outcome functions are decreasing rather than increasing, we maysimply switch the conditions for X t and X c in Lemma 4.1. Under treatment allocation rules considered in the previous section, it is possible toconstruct a one-sided CI that is optimal for a single function space F ( C (cid:48) ) for some C (cid:48) , with its length independent or nearly independent of the larger Lipschitz constant C . Ideally, we would use a one-sided CI that performs well over a range of functionspaces, corresponding to a range of Lipschitz constants C (cid:48) ∈ [ C, C ] for some specifiedbounds C and C . To this end, we define a class of adaptive procedures so that ourprocedure is based on multiple CIs which solve the optimization problem (12) fordifferent values of Lipschitz constants C (cid:48) ∈ { C j } Jj =1 , with C ≤ C < · · · < C J ≤ C .For now, we assume ( C j ) Jj =1 are given. We discuss how to choose this sequence ofLipschitz constants in a way that is optimal later on.To introduce our procedure, we first characterize the solution to (12), applying thegeneral result of Armstrong and Koles´ar (2018a) to our setting. Given a value of C (cid:48) , itturns out that the lower CI which solves (12) is based on a linear estimator (cid:98) L ( C (cid:48) ) := a ( C (cid:48) ) + (cid:80) ni =1 w i ( C (cid:48) ) y i for non-negative weights w i ( C (cid:48) ) and a centering constant a ( C (cid:48) ).Given some τ ∈ (0 , − τ over F ( C (cid:48) ), takes the form ofˆ c Lτ ( C (cid:48) ) = (cid:98) L τ ( C (cid:48) ) − sup f ∈F ( C ) bias f ( (cid:98) L τ ( C (cid:48) )) − z − τ sd( (cid:98) L τ ( C (cid:48) )) . (13)Here, we make the dependence on τ explicit because we later choose a suitable τ < α when J >
1. 19or the estimator, we take (cid:98) L τ ( C (cid:48) ) to be the difference between two kernel regres-sion estimators, say (cid:98) L t,τ ( C (cid:48) ) − (cid:98) L c,τ ( C (cid:48) ), where (cid:98) L q,τ ( C (cid:48) ) := (cid:80) x i ∈X q K ( x i , h q,τ ( C (cid:48) )) y i /σ ( x i ) (cid:80) x i ∈X q K ( x i , h q,τ ( C (cid:48) )) /σ ( x i ) , for each q ∈ { t, c } . The optimal bandwidths are given by h t,τ ( C (cid:48) ) = ω ∗ t ( z − τ ; C, C (cid:48) ) · (1 /C, /C (cid:48) ) and h c,τ ( C (cid:48) ) = ω ∗ c ( z − τ ; C (cid:48) , C ) · (1 /C (cid:48) , /C ) , (14)which completes the definition of the estimator (cid:98) L τ ( C (cid:48) ). The worst-case bias and thestandard deviation of (cid:98) L τ ( C (cid:48) ) are given bysd( (cid:98) L τ ( C (cid:48) )) = s ( z − τ , h t,τ ( C (cid:48) ); C, C (cid:48) , σ ( · ))sup f ∈F ( C ) bias( (cid:98) L τ ( C (cid:48) )) = a t ( h t,τ ( C (cid:48) ); C, C (cid:48) ) − a c ( h c,τ ( C (cid:48) ) , C (cid:48) , C )+ 12 [ ω ∗ t ( z − τ ; C, C (cid:48) ) + ω ∗ c ( z − τ ; C (cid:48) , C ) − z − τ sd( (cid:98) L τ ( C (cid:48) ))] , (15)which concludes the form of the one-sided CI which solves (12).The CI (cid:2) ˆ c Lτ ( C (cid:48) ) , ∞ (cid:1) is in terms of the worst-case excess length over F ( C (cid:48) ) for asingle Lipschitz constant C (cid:48) . We construct a CI which performs well over a collectionof function spaces by taking intersection of CIs in the form of (cid:2) ˆ c Lτ ( C (cid:48) ) , ∞ (cid:1) for differentvalues of C (cid:48) . Note that taking the intersection “picks out” the shortest CI amongmultiple CIs formed. This is roughly equivalent to inferring from the data the truefunction space where the regression function belongs to, and using the CI whichperforms well over that function space.The value of τ should be calibrated so that the resulting CI maintains correctcoverage probability 1 − α after taking the intersection. Suppose we have a collectionof J CIs (cid:8)(cid:2) ˆ c Lτ ( C j ) , ∞ (cid:1)(cid:9) Jj =1 . A simple procedure to take the intersection of such CIsis to use a Bonferroni procedure and take τ = α/J . This procedure, however, isconservative since the correlations among the estimators (cid:98) L τ ( C ) , ..., (cid:98) L τ ( C J ) are posi-tive, and highly so when C j and C k are close. To calibrate the value of τ taking intoaccount such positive correlation, let ( V j,τ ) Jj =1 denote a J -dimensional multivariatenormal random variable centered at zero, unit variance, and with covariance terms20iven as Cov( V j,τ , V k,τ ) = (cid:80) x i ∈X t K ( x i , h t,τ ( C j )) K ( x i , h t,τ ( C k )) /σ ( x i ) z − τ /ω ∗ t ( z − τ ; C, C j ) ω ∗ t ( z − τ ; C, C k )+ (cid:80) x i ∈X c K ( x i , h c,τ ( C j )) K ( x i , h c,τ ( C k )) /σ ( x i ) z − τ /ω ∗ c ( z − τ ; C j , C ) ω ∗ c ( z − τ ; C k , C ) . (16)Then, we can show that if we take τ ∗ to be the value of τ that solves P (cid:18) max ≤ j ≤ J V j,τ > z − τ , (cid:19) = α, (17)then the resulting CI obtained by taking the intersection has correct coverage. Re-garding the solution to the above equation, we can actually show that Cov( V j,τ , V k,τ )does not depend on τ as n → ∞ , which implies that max ≤ j ≤ J V j,τ ∗ converges in dis-tribution to a random variable V max which does not dependent on τ . The followingis the main theoretical result for our intersection CI.
Theorem 4.2.
Given { C , ..., C J } , C and α ∈ (0 , , let τ ∗ be defined as in (17).Then, under Assumption 1, if we let ˆ c := max ≤ j ≤ J ˆ c Lτ ∗ ( C j ) , we have inf f ∈F ( C ) P f ( L RD f ∈ [ˆ c, ∞ )) ≥ − α. The next result, which is immediate from Armstrong and Koles´ar (2018a), showsthat the class of adaptive procedures we consider is a reasonable one. It basicallystates that each of the CIs [ˆ c Lτ ∗ ( C j ) , ∞ ) is an optimal CI when f ∈ F ( C j ), exceptthat it covers the true parameter with probability 1 − τ ∗ instead of 1 − α , which isthe price we pay to adapt to multiple Lipschitz classes. Corollary 4.3.
Under Assumption 1, [ˆ c Lτ ∗ ( C j ) , ∞ ) solves inf [ˆ c, ∞ ) ∈L τ ∗ ( C ) sup f ∈F ( C j ) E f [ L RD f − ˆ c ] . Again, considering the simple case where d = 1 and V = { } provides an intuitionbehind our adaptive procedure. Under this simple case, (cid:98) L q,τ ( C (cid:48) ) is simply a kernel In a simulation exercise not reported, we find this asymptotic approximation works well for amoderate sample size such as n = 100. K ( z ) = [1 − | z | ] + and bandwidths h t,τ ( C (cid:48) ) = ω ∗ t ( z − τ ; C, C (cid:48) ) /C (cid:48) , h c,τ ( C (cid:48) ) = ω ∗ c ( z − τ ; C (cid:48) , C ) /C (cid:48) . (18)Applying results of Kwon and Kwon (2020), we can show that these quantities areapproximately c × ( C (cid:48) ) − / for some constant c for large n . This implies that we con-struct estimators with varying bandwidth sizes, and compare the lengths of resultingone-sided CIs. The estimator with a smaller bandwidth is the one constructed toperform well when the Lipschitz constant of the regression function is large, and viceversa for the estimator with a larger bandwidth. This is because when the Lipschitzconstant of the regression function is large, the excess length of a one-sided CI isreduced by taking a smaller bandwidth to decrease the absolute size of the bias. Onthe other hand, when the Lipschitz constant of the regression function is small, re-ducing the standard deviation by taking a larger bandwidth matters more in reducingthe excess length of a one-sided CI. This idea is similar to the bandwidth snoopingprocedure suggested by Armstrong and Koles´ar (2018b). The results therein focuson the case with a single running variable and adapting to the H¨older exponent. Remark 4 (Specifying C ) . When d = 1, the worst-case bias is 0 under the treatmentallocation rules in Section 4.1. Therefore, a researcher can set C = ∞ and not worryabout correct coverage. On the other hand, when d >
1, the size of C governshow large weights should be on observations outside [0 , ∞ ) d and ( −∞ , d . A larger C puts smaller weights on those observations, leading to a wider CI, with C = ∞ corresponding to using observations only in [0 , ∞ ) d and ( −∞ , d . However, thelength of the adaptive CI shrinks when the true regression function has a smaller firstderivative bound even if C is set to be a large number, alleviating the concern thatlarge C might lead to a less informative CI. Remark 5 (Adaptation in multi-dimensional RDDs) . Consider the setting of Remark2, where the Lipschitz continuity is specified by the weighted (cid:96) -norm with Lipschitzconstants C , ..., C d . Then, our adaptive procedure allows one to consider differentvalues of C other than C = 1. Note that this takes C /C s given for s = 2 , ..., d .Adapting to different values of C /C s is an interesting extension that we did notpursue here. 22 .3 Choice of an adaptive procedure In this section, we discuss the choice of Lipschitz constants ( C j ) Jj =1 to conclude ourdefinition of the adaptive one-sided CI. The choice of function spaces to adapt to isespecially relevant in our setting unlike the previous literature on adaptive inference.The previous literature has mostly focused on rate-adaptation, the problem of how toconstruct a CI which shrinks at an optimal rate. For example, Armstrong (2015) andArmstrong and Koles´ar (2018b) discuss adaptive testing and construction of adaptiveCIs for the RD parameter, which adapt to H¨older exponents β ∈ (0 , β = 1 and try to adapt to Lipschitz constants, the convergencerate is always n / (2+ d ) , and what matters is the actual length of CIs, not their rate ofconvergence.The optimal but infeasible adaptive CI is given as CI = [ˆ c, ∞ ) such thatsup f ∈F ( C (cid:48) ) E [ L RD f − ˆ c ] = (cid:96) ( C (cid:48) ; C ) , for all C (cid:48) ∈ [ C, C ], where (cid:96) ( C (cid:48) ; C ) is defined in (12). This is infeasible since the formof CI which is optimal over F ( C (cid:48) ) is different from the one which is optimal over F ( C (cid:48)(cid:48) ) for C (cid:48) (cid:54) = C (cid:48)(cid:48) . Instead, our aim is to construct a CI [ˆ c, ∞ ) such that (cid:96) adpt ( C (cid:48) ) := sup f ∈F ( C (cid:48) ) E [ L RD f − ˆ c ]is close to (cid:96) ( C (cid:48) ; C ) over C (cid:48) ∈ (cid:2) C, C (cid:3) given some (cid:0)
C, C (cid:1) , when (cid:96) adpt ( C (cid:48) ) and (cid:96) ( C (cid:48) ; C )are viewed as a function of C (cid:48) .By restricting the class of one-sided CIs to those considered in Section 4.2, wecan show that there is a measure of distance between (cid:96) adpt ( · ) and (cid:96) ( · ; C ) which isboth reasonable and easy to calculate. To be specific, let C = ( C j ) Jj =1 denote somesequence of Lipschitz constants to be used to construct our adaptive CI, and denotethe endpoint of the CI with ˆ c ( C ). Then, write (cid:96) adpt ( C (cid:48) ; C ) := sup f ∈F ( C (cid:48) ) E [ L RD f − ˆ c ( C )] . rocedure 2 Adaptive One-sided CI .1. Choose C so that the Lipschitz continuity in (2) is satisfied, with a suitablechoice of a norm || · || on R d when d > d = 1, we can set C = ∞ ; see Remark 4 in Section 4.2.2. Choose values of C and C such that 0 ≤ C < C ≤ C , which is the regionwhere the adaptive CI will be close to optimal.- A reasonable value of C can be estimated; see Appendix B.3. Let C ( J ) be the equidistant grid of size J over (cid:2) C, C (cid:3) . Starting from J = 2,increase J by 1 until | ∆( C ( J )) − ∆( C ( J − | ≤ ε , for a tolerance level ε .4. Let J ∗ be the value of J we stopped at in Step 3. We use C ( J ∗ ) = ( C j ) J ∗ j =1 asthe sequence of Lipschitz constants to construct the adaptive CI.5. Using (13), (14), (15), and (17), finally obtain the adaptive one-sided CI, (cid:104) max ≤ j ≤ J ˆ c Lτ ∗ ( C j ) , ∞ (cid:17) . Consider the following quantity measuring the “distance” between (cid:96) adpt ( · ; C ) and (cid:96) ( · ; C ): ∆( C ) := sup C (cid:48) ∈ [ C,C ] (cid:96) adpt ( C (cid:48) ; C ) (cid:96) ( C (cid:48) ; C ) . (19)Note that this criterion is consistent with the previous literature which compares theperformances of different confidence intervals using a ratio measure (Cai and Low,2004; Armstrong and Koles´ar, 2018a). Specifically, ∆( C ) satisfies (cid:96) adpt ( C (cid:48) ; C ) ≤ ∆( C ) (cid:96) ( C (cid:48) ; C ) , ∀ C (cid:48) ∈ (cid:2) C, C (cid:3) . This is the precisely the notion of adaptive CIs introduced in Cai and Low (2004).Since the rates of (cid:96) adpt ( C (cid:48) ; C ) and (cid:96) ( C (cid:48) ; C ) are the same under our setting, the choiceof an adaptive procedure should be based on the size of the constant ∆( C ).An advantage from focusing on the class of adaptive CIs proposed in Section 4.2is that it is easy to evaluate ∆( C ) as shown in the following proposition.24 roposition 4.4. We have (cid:96) adpt ( C (cid:48) ; C ) = E min j ≤ J U j , where U := ( U , . . . , U J ) (cid:48) isa Gaussian random vector with known mean and variance. Furthermore, we have (cid:96) ( C (cid:48) ; C ) = ω ∗ t ( z − α , C, C (cid:48) ) + ω ∗ c ( z − α , C (cid:48) , C ) . Based on the discussion so far, we recommend choosing { C j } Jj =1 based on thevalue of ∆( C ). When calculating this value, searching for all possible sequences C isinfeasible in practice. Instead, we suggest taking C = C ( J ) to be the equidistant gridof size J on (cid:2) C, C (cid:3) and calculating ∆( C ( J )) for different values of J . Our simulationstudy (not reported) suggests that the gain from increasing J becomes very smallafter some threshold. Hence, a computationally attractive procedure is to increasethe value of J until the additional gain from using J + 1 instead of J is smaller thana tolerance parameter. In the simulation designs of Section 5, the largest value of∆( C ( J ∗ )) across all scenarios is less than 1.07, where J ∗ is the value of J obtainedby the method described above. That is, the worst-case length of the adaptive CI isat most 7% longer than the worst-case length of the CI that we would have used ifwe knew the true Lipschitz constant, in the data generating processes considered inSection 5.Note that our procedure requires a researcher to specify the values of C and C .For C , while it can be set to 0, a reasonable value of C can be estimated fromthe data as discussed in Appendix B. The value of C defines a region for Lipschitzconstants where the adaptive CI is intended to perform well. Therefore, a researchermay choose C to be some non-conservative potential value of the Lipschitz constantof the regression function. As discussed above, the coverage probability is not affectedwhatever the value of C is, so it will not be a burdensome task for a researcher tochoose C . Procedure 2 summarizes the steps for constructing the adaptive one-sidedCI, including the choice of the Lipschitz spaces to adapt to. We investigate the performance of our minimax and adaptive procedures via a sim-ulation study. We focus on the case where d = 1 with individuals being treated if x i <
0. We restrict the support of x i to be [ − , The expressions for the mean and variance are given in the proof. θ ≡ L RD f , we consider the following designs.1. Linear design: f c ( x ) = Cx , f t ( x ) = f c ( x ) + θ . We have | f (cid:48) q ( x ) | = C , and | f (cid:48)(cid:48) q ( x ) | = 0 for all x ∈ [ − ,
1] and q ∈ { t, c } .2. Modified specification of Armstrong and Koles´ar (2018c): given some “knots”( b , b ) such that 0 < b < b , define f c ( x ) = 3 C x − x − b ) + 2( x − b ) ) , f t ( x ) = − f c ( − x ) + θ. If b ≥ b /
2, both functions are increasing. Taking ( b , b ) = (1 / , /
3) gives | f (cid:48) q ( x ) | ≤ C , and | f (cid:48)(cid:48) q ( x ) | = 3 C for all x ∈ [ − ,
1] and q ∈ { t, c } . We also have | f (cid:48) q (0) | = 0.3. Modified specification of Babii and Kumar (2020): define f c ( x ) = C x + x ) , f t ( x ) = f c ( x ) + θ. We have | f (cid:48) q ( x ) | ≤ C and | f (cid:48)(cid:48) q ( x ) | = 3 C/ x ∈ [ − ,
1] and q ∈ { t, c } . Wealso have | f (cid:48) q (0) | = C/ | f (cid:48)(cid:48) q (0) | = 0.4. Nonzero first and second derivatives at 0: define f c ( x ) = C ((3 x + 1) / − , f t ( x ) = − f c ( − x ) + θ. We have | f (cid:48) q ( x ) | ≤ C , and | f (cid:48)(cid:48) q ( x ) | ≤ C for all x ∈ [ − ,
1] and q ∈ { t, c } . Wealso have | f (cid:48) q (0) | = C and | f (cid:48)(cid:48) q (0) | = 2 C . x i ∼ unif( − ,
1) Constant variance C is smallDesign 1 YES YES YESDesign 2 NO YES YESDesign 3 YES NO YESDesign 4 NO NO YESDesign 5 YES YES NODesign 6 NO YES NODesign 7 YES NO NODesign 8 NO NO NOTable 1: Simulation design specifications26 = f f = f Length: RBC Minimax Length: RBC MinimaxRBC/MM coverage coverage RBC/MM coverage coverageDesign 1 1.094 0.925 0.968 1.097 0.924 0.942Design 2 1.112 0.938 0.979 1.115 0.941 0.949Design 3 1.082 0.926 0.968 1.085 0.924 0.942Design 4 1.099 0.936 0.979 1.103 0.940 0.950Design 5 1.128 0.925 0.919 1.148 0.930 0.945Design 6 1.137 0.938 0.934 1.162 0.941 0.951Design 7 1.117 0.926 0.920 1.139 0.927 0.945Design 8 1.125 0.936 0.935 1.153 0.941 0.952Table 2: Comparison between RBC and minimax; f ∈ { f , f } For the running variables, we consider x i ∼ unif( − ,
1) and x i ∼ × Beta(2 , − σ ( x ) = 1 and σ ( x ) = φ ( x ) /φ (0), where φ ( x ) is the standard normal pdf. Thesample size is n = 500. Figure 4 in Appendix D provides plots for the four regressionfunctions.We estimate the conditional variance based on local constant kernel regression,where the initial bandwidth is chosen based on Silverman’s rule of thumb. Thisis to avoid using a bandwidth selection method based on local linear regression toensure that our proposed method works even when the second derivative is very large.After the estimators are calculated based on the estimated conditional variance, weconstruct the CIs following Armstrong and Koles´ar (2020b), who use a simple way toestimate the variance byˆ σ ( x i ) = JJ + 1 (cid:32) y i − J J (cid:88) i =1 y j ( i ) (cid:33) , for some fixed J , where j ( i ) denotes the index for the closest observation to i (withthe same treatment status). The default value in their implementation is J = 3,which we follow. 27 = f f = f Length: RBC Minimax Length: RBC MinimaxRBC/MM coverage coverage RBC/MM coverage coverageDesign 1 1.090 0.925 0.949 1.093 0.924 0.968Design 2 1.109 0.937 0.951 1.111 0.938 0.977Design 3 1.078 0.925 0.949 1.081 0.924 0.968Design 4 1.096 0.936 0.953 1.098 0.936 0.977Design 5 1.094 0.923 0.962 1.119 0.920 0.930Design 6 1.112 0.937 0.971 1.132 0.935 0.947Design 7 1.082 0.924 0.961 1.107 0.924 0.930Design 8 1.099 0.936 0.971 1.119 0.933 0.947Table 3: Comparison between RBC and minimax; f ∈ { f , f } First, we investigate the performance of the minimax procedure described in Section3. We consider different combinations of specifications on the running variable, thevariance function, and the value of C : 1) x i follows either a uniform or a Betadistribution, 2) the variance function is given by either σ × σ ( x ) or σ × σ ( x )(where σ = 1 / C is either small ( C = 1) or large ( C = 3).For the minimax two-sided CI, we consider the case where we correctly specify theLipschitz constant in all cases, setting C = 3. The results were calculated with 1 , R package rdrobust .Tables 2 and 3 display the results. For each regression function, the first columnshows the ratio of the length of the CIs, and the second and the third columns showthe coverage probabilities of the RBC and the minimax procedures. Despite the factthat the minimax CIs are shorter than the RBC CIs, the coverage probabilities of theminimax CIs are closer to the nominal level than the RBC CIs. However, the lengthcomparison here should not be interpreted as the superiority of our procedure over theRBC method, since the lengths are sensitive to the choice of C and to the form of thetrue regression function. Rather, as we have discussed so far, the relative advantage28 .01.11.21.31.4 0.5 1.0 1.5 2.0 True C E xc e ss Leng t h R a t i o : f = f AdaptiveMinimax 1.01.21.41.61.8 0.5 1.0 1.5 2.0
True C E xc e ss Leng t h R a t i o : f = f AdaptiveMinimax1.01.21.41.61.8 0.5 1.0 1.5 2.0
True C E xc e ss Leng t h R a t i o : f = f AdaptiveMinimax 1.01.21.4 0.5 1.0 1.5 2.0
True C E xc e ss Leng t h R a t i o : f = f AdaptiveMinimax
Figure 2: Performance comparison of one-sided CIsof our procedure is the use of a more transparent function space, and incorporatingthe monotonicity of the regression function.
Next, we compare the performance of the one-sided CI to two benchmarks, the min-imax and the oracle one-sided CIs. Specifically, we consider regression functions f , ..., f and vary the value of C ∈ [ C l , C u ] which determines their smoothness. Foreach C ∈ [ C l , C u ], the oracle procedure adapts to a single Lipschitz constant C , asif we know the true Lipschitz constant, which is infeasible in practice. On the otherhand, the minimax procedure adapts to the largest Lipschitz constant C u ; this proce-dure is nearly optimal (among feasible procedures) when we do not have monotonicity,as shown in Armstrong and Koles´ar (2018a). Both the oracle and the minimax pro-29edures take monotonicity into account.We take ( C l , C u ) = (1 / ,
2) and consider only Design 1. For the adaptive pro-cedure, we take (
C, C ) = (1 / , C ∈ [1 / ,
1] than over C ∈ [1 , f = f and f = f , the adap-tive CIs are sometimes even shorter than the oracle. This is because the derivativesof such regression functions near x = 0 are smaller than C , although their max-imum derivatives are C globally. This shows that the adaptive CI adapts to the local smoothness of the regression function, which is another advantage of using theadaptive procedure. In this section, we revisit the analysis by Lee (2008). The running variable x i ∈ [ − , y i ∈ [0 , x i is the vote margin instead of the vote share, this translatesto setting C = 1 / C ∈ [0 . , C = 1 /
2. Wealso include CIs using the robust bias correction method of Calonico et al. (2015)and the minimax CI using a second derivative bound as in Armstrong and Koles´ar30 .02.55.07.510.012.5 0.4 0.6 0.8 1.0
Lipschitz Constant (C) E l e c t o r a l A d v an t age ( % ) Adaptive one−sidedAKMinimax two−sidedRBC
Figure 3: Lee (2008) example. The red line (Minimax two-sided) plots our mini-max optimal CI, while the blue line (RBC) and the green line (AK) plot the CIsconstructed using the methods of Calonico et al. (2015) and Armstrong and Koles´ar(2018a), respectively. The purple line (Adaptive one-sided) plots the adaptive one-sided upper CI with C = ∞ . The vertical dotted line indicates C = 1 /
2, our preferredspecification.(2018a). For the latter, the bound on second derivative is set as M = 1 /
10, which isthe largest bound used in their empirical analysis. The nominal coverage probabilityis 0 .
95 for all CIs. The CIs obtained by the robust bias correction and the secondderivative bound are given by [3 . , .
38] and [2 . , . C = 1 /
2, the CI is given by [5 . , . . , . C is larger than16. The first derivative bound C = 16 roughly implies that one unit increase in theprevious election vote share can predict as much as a 32 unit increase in the nextelection vote share, which is quite a large amount. Therefore, we conclude that thesignificance of the incumbency effect is robust over a reasonable set of assumptionson the unknown regression function.Given that various CIs considered here are obtained from different assumptionson the regression function, it would be an interesting exercise to see what we can inferabout the RD parameter under a minimal assumption. To this end, we construct anadaptive one-sided CI which maintains coverage over all monotone functions. Thediscussion in Section 4.1 implies that we can construct an adaptive upper CI in thisempirical setting. We take ( C, C ) = (0 . , . .
5. To make the one-sided CI comparablewith the other two-sided CIs, the nominal coverage probability of the upper CI is setas 0 . In this paper, we proposed a minimax two-sided CI and an adaptive one-sided CI whenthe regression function is assumed to be monotone and has bounded first derivative.We showed our procedure achieves uniform coverage under easy-to-interpret condi-tions and can be used to construct either a two-sided CI with a minimax optimallength or a one-sided one whose excess length adapts to the smoothness of the un-known regression function. There are two extensions that we find interesting.
Fuzzy RDDs.
There are various RDD applications where compliance to a treatmentstatus is only partial. Therefore, it would be of interest to extend our approach tothe fuzzy RDD setting, for example, by making monotonicity and Lipschitz continu-ity assumptions on the treatment propensity p ( x ) = P ( t i = 1 | x i = x ), where t i isthe treatment indicator for individual i . This approach will complement the minimax32ptimal approaches to fuzzy RDDs using second derivative bounds by Armstrong andKoles´ar (2020b) and Noack and Rothe (2020). Weigthed CATE.
In multi-score RDDs, Imbens and Wager (2019) suggest esti-mating the weighted average of conditional treatment effects over different boundarypoints to make inference more precise. Since the weighted average parameter is alinear functional of the regression function, we can also adjust our framework to con-duct inference on the parameter. On the other hand, a closed-form solution might notexist, and how to computationally construct the confidence interval for the weightedaverage parameter under our setting seems to be an interesting research question.33 eferences
Armstrong, T. (2015): “Adaptive testing on a regression function at a point,”
TheAnnals of Statistics , 43, 2086–2101.
Armstrong, T. and M. Koles´ar (2020a): “Finite-sample optimal estimation andinference on average treatment effects under unconfoundedness,” .
Armstrong, T. B. and M. Koles´ar (2016): “Optimal inference in a class ofregression models,” working paper .——— (2018a): “Optimal inference in a class of regression models,”
Econometrica ,86, 655–683.——— (2018b): “A simple adjustment for bandwidth snooping,”
The Review of Eco-nomic Studies , 85, 732–765.——— (2018c): “Supplement to ’Optimal inference in a class of regression models’,”
Econometrica Supplemental Material , 85.——— (2020b): “Simple and honest confidence intervals in nonparametric regres-sion,”
Quantitative economics , 11, 1–39.
Babii, A. and R. Kumar (2020): “Isotonic regression discontinuity designs,” .
Cai, T. T. and M. G. Low (2004): “An adaptation theory for nonparametricconfidence intervals,”
The Annals of statistics , 32, 1805–1840.
Calonico, S., M. D. Cattaneo, and R. Titiunik (2015): “rdrobust: An RPackage for Robust Nonparametric Inference in Regression-Discontinuity Designs.”
R J. , 7, 38.
Cattaneo, M. D., N. Idrobo, and R. Titiunik (2020):
A practical introductionto regression discontinuity designs: Extensions , Cambridge University Press (toappear).
Chay, K. Y. and M. Greenstone (2005): “Does air quality matter? Evidencefrom the housing market,”
Journal of political Economy , 113, 376–424.34 hay, K. Y., P. J. McEwan, and M. Urquiola (2005): “The central role ofnoise in evaluating interventions that use test scores to rank schools,”
AmericanEconomic Review , 95, 1237–1258.
Chiang, H. (2009): “How accountability pressure on failing schools affects studentachievement,”
Journal of Public Economics , 93, 1045–1057.
Dell, M. (2010): “The persistent effects of Peru’s mining mita,”
Econometrica , 78,1863–1903.
Donoho, D. L. (1994): “Statistical Estimation and Optimal Recovery,”
Annals ofStatistics , 22, 238–270.
Donoho, D. L. and R. C. Liu (1991): “Geometrizing rates of convergence, III,”
The Annals of Statistics , 668–701.
Greenstone, M. and J. Gallagher (2008): “Does hazardous waste matter?Evidence from the housing market and the superfund program,”
The QuarterlyJournal of Economics , 123, 951–1003.
Imbens, G. and S. Wager (2019): “Optimized regression discontinuity designs,”
Review of Economics and Statistics , 101, 264–278.
Jacob, B. A. and L. Lefgren (2004): “Remedial education and student achieve-ment: A regression-discontinuity analysis,”
Review of economics and statistics , 86,226–244.
Kane, T. J. (2003): “A quasi-experimental estimate of the impact of financial aidon college-going,” Tech. rep., National Bureau of Economic Research.
Keele, L. J. and R. Titiunik (2015): “Geographic boundaries as regression dis-continuities,”
Political Analysis , 23, 127–155.
Koles´ar, M. and C. Rothe (2018): “Inference in regression discontinuity designswith a discrete running variable,”
American Economic Review , 108, 2277–2304.
Kwon, K. and S. Kwon (2020): “Adaptive Inference in Multivariate Nonparamet-ric Regression Models Under Monotonicity,” Working paper.35 ee, D. S. (2008): “Randomized experiments from non-random selection in USHouse elections,”
Journal of Econometrics , 142, 675–697.
Leuven, E., M. Lindahl, H. Oosterbeek, and D. Webbink (2007): “Theeffect of extra funding for disadvantaged pupils on achievement,”
The Review ofEconomics and Statistics , 89, 721–736.
Ludwig, J. and D. L. Miller (2007): “Does Head Start improve children’s lifechances? Evidence from a regression discontinuity design,”
The Quarterly journalof economics , 122, 159–208.
Matsudaira, J. D. (2008): “Mandatory summer school and student achievement,”
Journal of Econometrics , 142, 829–850.
Noack, C. and C. Rothe (2020): “Bias-aware inference in fuzzy regression dis-continuity designs,” arXiv preprint arXiv:1906.04631 . Papay, J. P., R. J. Murnane, and J. B. Willett (2010): “The consequencesof high school exit examinations for low-performing urban students: Evidence fromMassachusetts,”
Educational Evaluation and Policy Analysis , 32, 5–23.
Papay, J. P., J. B. Willett, and R. J. Murnane (2011): “Extending theregression-discontinuity approach to multiple assignment variables,”
Journal ofEconometrics , 161, 203–207.
Van der Klaauw, W. (2002): “Estimating the effect of financial aid offers oncollege enrollment: A regression–discontinuity approach,”
International EconomicReview , 43, 1249–1287.
Wong, V. C., P. M. Steiner, and T. D. Cook (2013): “Analyzing regression-discontinuity designs with multiple assignment variables: A comparative study offour estimation methods,”
Journal of Educational and Behavioral Statistics , 38,107–141. 36
Lemmas and Proofs
In this section, we collect auxiliary lemmas and omitted proofs. Before presentingthe results, we state the following definition.
Definition 1.
Given two Lipschitz constants C and C , and some positive constant δ ≥
0, we define ω ( δ ; C , C ) := sup f ∈F ( C ) , f ∈F ( C ) L RD f − L RD f s.t. n (cid:88) i =1 (cid:18) f ( x i ) − f ( x i ) σ ( x i ) (cid:19) ≤ δ . (20)The quantity ω ( δ ; C , C ) is called the ordered modulus of continuity of F ( C ) and F ( C ) for the parameter L RD f . A.1 Lemmas
Lemma A.1.
Given some pair of numbers
C, C (cid:48) ≥ and δ ≥ , define ω q ( δ ; C, C (cid:48) ) for each q ∈ { t, c } to be the solution to the following equation (cid:88) x i ∈X q [ ω q ( δ ; C, C (cid:48) ) − C (cid:107) ( x i ) V + (cid:107) − C (cid:48) (cid:107) ( x i ) V− (cid:107) ] /σ ( x i ) = δ . Then, we have ω q ( δ ; C, C (cid:48) ) = sup f ∈F ( C ) ,f ∈F ( C (cid:48) ) f (0) − f (0) s.t. (cid:88) x i ∈X q (cid:18) f ( x i ) − f ( x i ) σ ( x i ) (cid:19) ≤ δ . (21) Proof.
See Kwon and Kwon (2020).
Lemma A.2.
Given some pair C , C ≥ and δ ≥ , we have ω ( δ ; C , C ) = ω t ( δ ∗ t ( δ ; C , C ) ; C , C ) + ω c ( δ ∗ c ( δ ; C , C ) ; C , C ) . roof of Lemma A.2. Noting that L RD f − L RD f = ( f ,t (0) − f ,c (0)) − ( f ,t (0) − f ,c (0))= ( f ,t (0) − f ,t (0)) + ( f ,c (0) − f ,c (0)) ,ω ( δ ; C , C ) is obtained by solving the following problem:sup f ,t ,f ,c ,f ,t ,f ,c ( f ,t (0) − f ,t (0)) + ( f ,c (0) − f ,c (0))s.t. n (cid:88) i =1 (cid:32) { x i ∈ X t } (cid:18) f ,t ( x i ) − f ,t ( x i ) σ ( x i ) (cid:19) + { x i ∈ X c } (cid:18) f ,c ( x i ) − f ,c ( x i ) σ ( x i ) (cid:19) (cid:33) ≤ δ ,f ,t , f ,c ∈ Λ + , V ( C ) , f ,t , f ,c ∈ Λ + , V ( C ) . Using the definitions for ω t ( δ t ; C , C ) and ω c ( δ c ; C , C ), and Lemma A.1, we canwrite ω ( δ ; C , C ) = sup δ t ≥ ,δ c ≥ ,δ t + δ c = δ ω t ( δ t ; C , C ) + ω c ( δ c ; C , C ) , which concludes the proof. Lemma A.3.
Given some ( C , C ) ∈ R and δ ≥ , write δ ∗ t := δ ∗ t ( δ ; C , C ) , δ ∗ c := δ ∗ c ( δ ; C , C ) . Then, we can find (cid:0) f ∗ δ, , f ∗ δ, (cid:1) ∈ F ( C ) × F ( C ) , which satisfy thefollowing three conditions: (i) L RD f ∗ δ, − L RD f ∗ δ, = ω ( δ ; C , C ) , (ii) (cid:13)(cid:13)(cid:13) f ∗ δ, − f ∗ δ, σ (cid:13)(cid:13)(cid:13) = δ ,and (iii) when we write f ∗ δ, = f ∗ δ, ,t { x ∈ X t } + f ∗ δ, ,c { x ∈ X c } f ∗ δ, = f ∗ δ, ,t { x ∈ X t } + f ∗ δ, ,c { x ∈ X c } , (cid:0) f ∗ δ, ,t , f ∗ δ, ,t (cid:1) ∈ F ( C ) × F ( C ) and (cid:0) f ∗ δ, ,c , f ∗ δ, ,c (cid:1) ∈ F ( C ) × F ( C ) satisfy f ∗ δ, ,t − f ∗ δ, ,t = [ ω t ( δ ∗ t ; C , C ) − C (cid:107) ( x ) V + (cid:107) − C (cid:107) ( x ) V− (cid:107) ] + (22) f ∗ δ, ,c − f ∗ δ, ,c = [ ω c ( δ ∗ c ; C , C ) − C (cid:107) ( x ) V− (cid:107) − C (cid:107) ( x ) V + (cid:107) ] + , (23) f ∗ δ, ,t (0) + f ∗ δ, ,t (0) = ω t ( δ ∗ t ; C , C ) (24) f ∗ δ, ,c (0) + f ∗ δ, ,c (0) = ω c ( δ ∗ c ; C , C ) , (25)38 nd (cid:0) f ∗ δ, ,t − f ∗ δ, ,t (cid:1) (cid:0) f ∗ δ, ,t + f ∗ δ, ,t (cid:1) = (cid:0) f ∗ δ, ,t − f ∗ δ, ,t (cid:1) ( ω t ( δ ∗ t ; C , C ) + C (cid:107) ( x ) V + (cid:107) − C (cid:107) ( x ) V− (cid:107) ) (26) (cid:0) f ∗ δ, ,c − f ∗ δ, ,c (cid:1) (cid:0) f ∗ δ, ,c + f ∗ δ, ,c (cid:1) = (cid:0) f ∗ δ, ,c − f ∗ δ, ,c (cid:1) ( ω c ( δ ∗ c ; C , C ) + C (cid:107) ( x ) V + (cid:107) − C (cid:107) ( x ) V− (cid:107) ) . (27) Proof.
Kwon and Kwon (2020) show that if we let f ∗ δ, ,t ( x ) = C (cid:107) ( x ) V + (cid:107) if C ≤ C min { ω t ( δ t , C , C ) − C (cid:107) ( x ) V− (cid:107) , C (cid:107) ( x ) V + (cid:107)} otherwise ,f ∗ δ, ,t ( x ) = max { ω t ( δ t , C , C ) − C (cid:107) ( x ) V− (cid:107) , C (cid:107) ( x ) V + (cid:107)} if C ≤ C ω t ( δ t , C , C ) − C (cid:107) ( x ) V− (cid:107) otherwise , we have (cid:0) f ∗ δ, ,t , f ∗ δ, ,t (cid:1) ∈ F ( C ) × F ( C ) and these functions solve (21) for q = t , C = C , and C (cid:48) = C . Likewise, if we let f ∗ δ, ,c ( x ) = max { ω c ( δ c , C , C ) − C (cid:107) ( x ) V− (cid:107) , C (cid:107) ( x ) V + (cid:107)} if C ≤ C ω c ( δ c , C , C ) − C (cid:107) ( x ) V− (cid:107) otherwise ,f ∗ δ, ,c ( x ) = C (cid:107) ( x ) V + (cid:107) if C ≤ C min { ω c ( δ c , C , C ) − C (cid:107) ( x ) V− (cid:107) , C (cid:107) ( x ) V + (cid:107)} otherwise,we have (cid:0) f ∗ δ, ,c , f ∗ δ, ,c (cid:1) ∈ F ( C ) × F ( C ) and these functions solve (21) for q = c , C = C , and C (cid:48) = C . Then, the equations (22) – (27) in the statement of thislemma follow from the above formulas and Lemma A.2. Lemma A.4.
Given some ( C , C ) ∈ R and δ ≥ , let ω (cid:48) ( δ ; C , C ) = ∂∂δ ω ( δ ; C , C ) and write δ ∗ t := δ ∗ t ( δ ; C , C ) , δ ∗ c := δ ∗ c ( δ ; C , C ) . Then, we have ω (cid:48) ( δ ; C , C ) = δ (cid:80) x i ∈X t [ ω t ( δ ∗ t ; C , C ) − C (cid:107) ( x ) V + (cid:107) − C (cid:107) ( x ) V− (cid:107) ] + /σ ( x i )= δ (cid:80) x i ∈X c [ ω c ( δ ∗ c ; C , C ) − C (cid:107) ( x ) V− (cid:107) − C (cid:107) ( x ) V + (cid:107) ] + /σ ( x i ) . Proof.
Note that f ∈ F ( C ) implies f + z ∈ F ( C ) for any z ∈ R and C ≥
0. Moreover,39etting ι t ( x ) := { x ∈ X t } , we have L RD ( ι t ) = 1. Then, Lemma B.3 in Armstrongand Koles´ar (2016) implies that ω (cid:48) ( δ ; C , C ) δ = 1 (cid:80) ni =1 (cid:0) f ∗ δ, ( x i ) − f ∗ δ, ( x i ) (cid:1) { x ∈ X t } /σ ( x i )= 1 (cid:80) x i ∈X t (cid:0) f ∗ δ, ,t ( x i ) − f ∗ δ, ,t ( x i ) (cid:1) /σ ( x i ) , where f ∗ δ,j ( x i ) and f ∗ δ,j,t ( x i ) for j = 1 , ι c ( x ) = − { x ∈ X c } , we have L RD ( ι c ) = 1, and Lemma B.3 in Armstrongand Koles´ar (2016) implies that ω (cid:48) ( δ ; C , C ) δ = 1 (cid:80) x i ∈X c (cid:0) f ∗ δ, ,c ( x i ) − f ∗ δ, ,c ( x i ) (cid:1) /σ ( x i ) , where f ∗ δ,j,c ( x i ) for j = 1 , Lemma A.5.
Given some pair ( C , C ) such that C ≥ C ≥ , and for some δ > ,write δ ∗ t := δ ∗ t ( δ ; C , C ) , δ ∗ c := δ ∗ c ( δ ; C , C ) , and h t,δ = ω ∗ t ( δ, C , C ) · (cid:18) C , C (cid:19) , h c,δ = ω ∗ c ( δ, C , C ) · (cid:18) C , C (cid:19) . Then, we have (cid:88) x i ∈X t K ( x i , h t,δ ) /σ ( x i ) = ( δ ∗ t /ω ∗ t ( δ, C , C )) , (cid:88) x i ∈X c K ( x i , h c,δ ) /σ ( x i ) = ( δ ∗ c /ω ∗ c ( δ, C , C )) . Proof.
Using notations in Lemma A.3, we can write (cid:88) x i ∈X t K ( x i , h t,δ ) /σ ( x i ) = 1( ω ∗ t ( δ, C , C )) (cid:88) x i ∈X t (cid:0) f ∗ δ, ,q ( x i ) − f ∗ δ, ,q ( x i ) (cid:1) /σ ( x i ) . Now, by definitions of f ∗ δ, ,t and f ∗ δ, ,t , we have (cid:80) x i ∈X t (cid:0) f ∗ δ, ,t ( x i ) − f ∗ δ, ,t ( x i ) (cid:1) /σ ( x i ) = δ ∗ t , which gives the desired result for the first equation of this lemma. The second40quation follows from the analogous reasoning. A.2 Proofs of main results
Proof of Lemma 4.1.
Consider a lower confidence interval [ˆ c, ∞ ) ∈ L α ( C ). Then,Theorem 3.1 of Armstrong and Koles´ar (2018a) implies thatsup f ∈F ( C (cid:48) ) E [ L RD f − ˆ c ] ≥ ω ( z − α , C, C (cid:48) ) , when Assumption 1 holds. The same theorem also implies that there exists [ˆ c, ∞ ) ∈L α ( C ) such that it exactly achieves the lower bound above. Therefore, it remains toanalyze the conditions where ω ( z − α , C, C (cid:48) ) is bounded by some A ( C (cid:48) ).Now, Lemma A.2 implies that we can write ω ( z − α , C, C (cid:48) ) = ω t ( δ ∗ t , C, C (cid:48) ) + ω c ( δ ∗ c , C (cid:48) , C ) , where ( δ ∗ t , δ ∗ c ) solve sup δ t ≥ ,δ c ≥ ,δ t + δ c = z − α ω t ( δ t , C, C (cid:48) ) + ω c ( δ c , C, C (cid:48) ) . Due to Lemma A.1, b t = ω t ( δ ∗ t , C, C (cid:48) ) solves (cid:88) x i ∈X t [ b t − C (cid:107) ( x i ) V + (cid:107) − C (cid:48) (cid:107) ( x i ) V− (cid:107) ] /σ ( x i ) = ( δ ∗ t ) . (28)We first consider sufficiency. Define X t := ( −∞ , d ∩ X t . Notice that for any given b ∈ R , we have (cid:88) x i ∈X t [ b − C (cid:107) ( x i ) V + (cid:107) − C (cid:48) (cid:107) ( x i ) V− (cid:107) ] /σ ( x i ) ≥ (cid:88) x i ∈X t [ b − C (cid:107) ( x i ) V + (cid:107) − C (cid:48) (cid:107) ( x i ) V− (cid:107) ] /σ ( x i ) . Therefore, given some δ t ≥
0, if b (cid:48) t = b (cid:48) t ( δ t ) solves (cid:88) x i ∈X t [ b (cid:48) t − C (cid:107) ( x i ) V + (cid:107) − C (cid:48) (cid:107) ( x i ) V− (cid:107) ] /σ ( x i ) = δ t , (cid:48) t ≥ ω t ( δ t , C, C (cid:48) ) will hold. This implies that ω t ( δ ∗ t , C, C (cid:48) ) ≤ ω t ( z − α , C, C (cid:48) ) ≤ b (cid:48) t ( z − α ) , where the first inequality is due to the constraint that δ ∗ t ≥ , δ ∗ c ≥ δ ∗ t ) +( δ ∗ c ) = z − α . Note that x i ∈ X t implies (cid:88) x i ∈X t [ b (cid:48) t − C (cid:107) ( x i ) V + (cid:107) − C (cid:48) (cid:107) ( x i ) V− (cid:107) ] /σ ( x i ) = (cid:88) x i ∈X t [ b (cid:48) t − C (cid:48) (cid:107) ( x i ) V− (cid:107) ] /σ ( x i ) , using the assumption that V = { , ..., d } . This implies that b (cid:48) t ( z − α ) is a function ofthe data and C (cid:48) . Likewise, if we define X c := [0 , ∞ ) d ∩ X c , and b (cid:48) c = b (cid:48) c ( δ c ) to be thesolution to (cid:88) x i ∈X c [ b (cid:48) c − C (cid:48) (cid:107) ( x i ) V + (cid:107) ] /σ ( x i ) = δ c ,b (cid:48) c ( z − α ) is a function of the data and C (cid:48) . Therefore, we can set A ( C (cid:48) ) = b (cid:48) t ( z − α ) + b (cid:48) c ( z − α ).Next, let us consider necessity. First, suppose ( −∞ , d ∩X t = ∅ . Then, since everyterm in the summation in (28) depends on C , increasing C should necessarily increasethe size of b t . Therefore, ω ( z − α , C, C (cid:48) ) cannot be bounded by a term independentof C . The same reasoning applies to the case with [0 , ∞ ) d ∩ X c = ∅ , and to the casewhere V (cid:40) { , ..., d } , which concludes the proof. Proof of Theorem 3.1.
First, given some δ ≥
0, consider the optimization problem ω ( δ ; C ) := sup f ∈F ( C ) ,f ∈F ( C ) L RD f − L RD f s.t. (cid:13)(cid:13)(cid:13)(cid:13) f − f σ (cid:13)(cid:13)(cid:13)(cid:13) ≤ δ . (29) To be more precise, the reasoning above holds only if δ ∗ t , δ ∗ c >
0, and only if there is noobservation exactly located on the axis. These cases can be ignored when n is sufficiently large, andwhen the running variables have a continuous distribution. (cid:0) f ∗ δ, , f ∗ δ, (cid:1) . Define (cid:101) L ( δ ) := ω (cid:48) ( δ ; C, C ) δ n (cid:88) i =1 (cid:0) f ∗ δ, ( x i ) − f ∗ δ, ( x i ) (cid:1) y i σ ( x i ) − ω (cid:48) ( δ ; C, C )2 δ n (cid:88) i =1 (cid:34) (cid:0) f ∗ δ, ( x i ) − f ∗ δ, ( x i ) (cid:1) (cid:0) f ∗ δ, ( x i ) + f ∗ δ, ( x i ) (cid:1) σ ( x i ) (cid:35) + L RD (cid:18) f ∗ δ, + f ∗ δ, (cid:19) . Under these definitions, We can showsd (cid:16)(cid:101) L ( δ ) (cid:17) = ω (cid:48) ( δ ; C, C )sup f ∈F ( C ) bias (cid:16)(cid:101) L ( δ ) (cid:17) = 12 ( ω ( δ ; C, C ) − δω (cid:48) ( δ ; C, C )) . Define (cid:101) χ ( δ ) = cv α sup f ∈F ( C ) bias (cid:16)(cid:101) L ( δ ) (cid:17) sd (cid:16)(cid:101) L ( δ ) (cid:17) · sd (cid:16)(cid:101) L ( δ ) (cid:17) . Then, the result of Donoho (1994) implies that the minimax affine optimal CI is givenby (cid:104)(cid:101) L ( δ ) − (cid:101) χ ( δ ) , (cid:101) L ( δ ) + (cid:101) χ ( δ ) (cid:105) , when δ is chosen to minimize (cid:101) χ ( δ ). Thus. the proofis done if we can show 1) (cid:101) L ( δ ) has the same form as (cid:98) L mm , and 2) sd (cid:16)(cid:101) L ( δ ) (cid:17) andsup f ∈F ( C ) bias (cid:16)(cid:101) L ( δ ) (cid:17) and are given as (11).For the first claim, recall that we can write for j = 1 , f ∗ δ,j ( x ) = f ∗ δ,j,t ( x ) { x ∈ X t } + f ∗ δ,j,c ( x ) { x ∈ X c } . (cid:101) L ( δ ) = ω (cid:48) ( δ ; C, C ) δ (cid:88) x i ∈X t (cid:0) f ∗ δ, ,t ( x i ) − f ∗ δ, ,t ( x i ) (cid:1) y i σ ( x i )= − ω (cid:48) ( δ ; C, C ) δ (cid:88) x i ∈X c (cid:0) f ∗ δ, ,c ( x i ) − f ∗ δ, ,c ( x i ) (cid:1) y i σ ( x i ) − ω (cid:48) ( δ ; C, C )2 δ (cid:88) x i ∈X t (cid:34) (cid:0) f ∗ δ, ,t ( x i ) − f ∗ δ, ,t ( x i ) (cid:1) (cid:0) f ∗ δ, ,t ( x i ) + f ∗ δ, ,t ( x i ) (cid:1) σ ( x i ) (cid:35) + ω (cid:48) ( δ ; C, C )2 δ (cid:88) x i ∈X c (cid:34) (cid:0) f ∗ δ, ,c ( x i ) − f ∗ δ, ,c ( x i ) (cid:1) (cid:0) f ∗ δ, ,c ( x i ) + f ∗ δ, ,c ( x i ) (cid:1) σ ( x i ) (cid:35) + f ∗ δ, ,t (0) + f ∗ δ, ,t (0)2 − f ∗ δ, ,c (0) + f ∗ δ, ,c (0)2 . Defining (cid:101) L t ( δ ) = ω (cid:48) ( δ ; C, C ) δ (cid:88) x i ∈X t (cid:0) f ∗ δ, ,t ( x i ) − f ∗ δ, ,t ( x i ) (cid:1) y i σ ( x i ) − ω (cid:48) ( δ ; C, C )2 δ (cid:88) x i ∈X t (cid:34) (cid:0) f ∗ δ, ,t ( x i ) − f ∗ δ, ,t ( x i ) (cid:1) (cid:0) f ∗ δ, ,t ( x i ) + f ∗ δ, ,t ( x i ) (cid:1) σ ( x i ) (cid:35) + f ∗ δ, ,t (0) + f ∗ δ, ,t (0)2 , and (cid:101) L c ( δ ) = ω (cid:48) ( δ ; C, C ) δ (cid:88) x i ∈X c (cid:0) f ∗ δ, ,c ( x i ) − f ∗ δ, ,c ( x i ) (cid:1) y i σ ( x i ) − ω (cid:48) ( δ ; C, C )2 δ (cid:88) x i ∈X c (cid:34) (cid:0) f ∗ δ, ,c ( x i ) − f ∗ δ, ,c ( x i ) (cid:1) (cid:0) f ∗ δ, ,c ( x i ) + f ∗ δ, ,c ( x i ) (cid:1) σ ( x i ) (cid:35) + f ∗ δ, ,c (0) + f ∗ δ, ,c (0)2 , we can see (cid:101) L ( δ ) = (cid:101) L t ( δ ) − (cid:101) L c ( δ ) holds.Now, using the equations (22), (24) and (26) in Lemma A.3, and the first repre-44entation of ω (cid:48) ( δ ; C,C ) δ in Lemma A.4, we get (cid:101) L t ( δ ) = (cid:80) x i ∈X t [ ω t ( δ ∗ t ; C, C ) − C (cid:107) ( x ) V + (cid:107) − C (cid:107) ( x ) V− (cid:107) ] + y i /σ ( x i ) (cid:80) x i ∈X t [ ω t ( δ ∗ t ; C, C ) − C (cid:107) ( x ) V + (cid:107) − C (cid:107) ( x ) V− (cid:107) ] + /σ ( x i ) − (cid:88) x i ∈X t (cid:20) [ ω t ( δ ∗ t ; C, C ) − C (cid:107) ( x ) V + (cid:107) − C (cid:107) ( x ) V− (cid:107) ] + (cid:80) x i ∈X t [ ω t ( δ ∗ t ; C, C ) − C (cid:107) ( x ) V + (cid:107) − C (cid:107) ( x ) V− (cid:107) ] + /σ ( x i ) × ( ω t ( δ ∗ t ; C, C ) + C (cid:107) ( x ) V + (cid:107) − C (cid:107) ( x ) V− (cid:107) ) /σ ( x i ) (cid:3) + ω t ( δ ∗ t ; C, C )2 , = (cid:80) x i ∈X t K ( x i /h t ) y i /σ ( x i ) (cid:80) x i ∈X t K ( x i /h t ) /σ ( x i ) + a t ( δ ) . Similarly, for (cid:101) L c ( δ ), using the equations (23), (25), and (27) in Lemma A.3, and thesecond representation of ω (cid:48) ( δ ; C,C ) δ in Lemma A.4, we can get (cid:101) L c ( δ ) = (cid:80) x i ∈X c K ( x i /h c ) y i /σ ( x i ) (cid:80) x i ∈X c K ( x i /h c ) /σ ( x i ) + a c ( δ ) , which establishes the claim that (cid:101) L ( δ ) has the same form as (cid:98) L mm .Next, for the second claim about the standard deviation and the worst-case bias,it is sufficient to show ω (cid:48) ( δ ; C, C ) = δ/ ( Ch t ( δ )) (cid:80) x i ∈X t K ( x i /h t ( δ )) /σ ( x i ) ,ω ( δ ; C, C ) = C ( h t ( δ ) + h c ( δ )) . First, the first equation follows from Lemma A.4. Next, ω ( δ ; C, C ) = ω t ( δ ∗ t ( δ ; C ); C )+ ω c ( δ ∗ c ( δ ; C ); C ) follows from Lemma A.2, which implies the second equation, whichconcludes the proof. 45 roof of Theorem 4.2. Given some f ∈ F ( C ) and j ∈ { , ..., J } , we can write P f (cid:0) L RD f < ˆ c Lτ ( C j ) (cid:1) = P f (cid:0) ˆ c Lτ ( C j ) − L RD f > (cid:1) = P f (cid:18) ˆ c Lτ ( C j ) − L RD fω (cid:48) ( z − τ , C, C j ) > (cid:19) = P f (cid:18) ˆ c Lτ ( C j ) − L RD fω (cid:48) ( z − τ , C, C j ) + z − τ > z − τ (cid:19) ≡ P f (cid:16) (cid:101) V j,τ > z − τ (cid:17) , where ω (cid:48) ( z − τ , C, C j ) is as defined in Therefore, defining CI L ( τ ) := (cid:20) max ≤ j ≤ J ˆ c Lτ ( C j ) , ∞ (cid:19) ,we have P f (cid:0) L RD f / ∈ CI L ( τ ) (cid:1) = P f (cid:0) ∃ j s.t. L RD f < ˆ c Lτ ( C j ) (cid:1) = P f (cid:18) max ≤ j ≤ J (cid:101) V j,τ > z − τ (cid:19) . Now, we want to find the largest value of τ such thatsup f ∈F ( C ) P f (cid:18) max ≤ j ≤ J (cid:101) V j,τ > z − τ (cid:19) ≤ α, (30)therefore satisfying the coverage requirement. First, it easy to see that (cid:16) (cid:101) V ,τ , ..., (cid:101) V J,τ (cid:17) has a multivariate normal distribution. Next, note that the quantiles of max ≤ j ≤ J (cid:101) V j,τ is increasing in each of E f (cid:101) V ,τ , ..., E f (cid:101) V j,τ . Moreover, the variances and covariancesof (cid:16) (cid:101) V ,τ , ..., (cid:101) V J,τ (cid:17) do not depend on the true regression function f , by the construc-tion of ˆ c Lτ ( C j ). This means that the quantiles of max ≤ j ≤ J (cid:101) V j,τ are smaller than those ofmax ≤ j ≤ J V j,τ , where ( V ,τ , ..., V J,τ ) has a multivariate normal distribution with a meangiven by EV j,τ = sup f ∈F ( C ) E f (cid:101) V j,τ for j = 1 , ..., J , and with the same covariance matrixas (cid:16) (cid:101) V ,τ , ..., (cid:101) V J,τ (cid:17) . Therefore, if we take τ so that z − τ is the 1 − α th quantile ofmax ≤ j ≤ J V j,τ , (30) is satisfied. Therefore, it remains to show ( V j,τ ) Jj =1 is a multivariatenormal distribution with zero means, unit variances, and covariances as given in (16).46or the mean, by definition of ˆ c Lτ ( C j ), we havesup f ∈F ( C ) E f (cid:2) ˆ c Lτ ( C j ) − L RD f (cid:3) = − z − τ sd( (cid:98) L τ ( C j ))= − z − τ ω (cid:48) ( z − τ , C, C j ) , where the last line follows from the discussion in Armstrong and Koles´ar (2018a).Hence, we get EV j,τ = 0. Moreover, this also implies Var( (cid:98) L τ ( C j )) = 1. For thecovariance, we haveCov (cid:16)(cid:98) L τ ( C j ) , (cid:98) L τ ( C k ) (cid:17) =Cov (cid:16)(cid:98) L t,τ ( C j ) − (cid:98) L c,τ ( C j ) , (cid:98) L t,τ ( C k ) − (cid:98) L c,τ ( C k ) (cid:17) =Cov (cid:16)(cid:98) L t,τ ( C j ) , (cid:98) L t,τ ( C k ) (cid:17) + Cov (cid:16)(cid:98) L c,τ ( C j ) , (cid:98) L c,τ ( C k ) (cid:17) , where the last line follows from the independence assumption, and we can showCov (cid:32) (cid:98) L t,τ ( C j ) ω (cid:48) ( z − τ , C, C j ) , (cid:98) L t,τ ( C k ) ω (cid:48) ( z − τ , C, C k ) (cid:33) = (cid:80) x i ∈X t K ( x i , h t,τ ( C j )) K ( x i , h t,τ ( C k )) /σ ( x i ) (cid:80) x i ∈X t K ( x i , h t,τ ( C j )) /σ ( x i ) (cid:80) x i ∈X t K ( x i , h t,τ ( C k )) /σ ( x i ) × ω (cid:48) ( z − τ , C, C j ) ω (cid:48) ( z − τ , C, C k )= ω ∗ t ( z − τ ; C, C j ) ω ∗ t ( z − τ ; C, C k ) z − τ (cid:88) x i ∈X t K ( x i , h t,τ ( C j )) K ( x i , h t,τ ( C k )) /σ ( x i ) , where the last equality follows from Lemma A.4 and the definition of K ( x i , h t,τ ( C j )).Likewise, we can showCov (cid:16)(cid:98) L c,τ ( C j ) , (cid:98) L c,τ ( C k ) (cid:17) = ω ∗ c ( z − τ ; C j , C ) ω ∗ c ( z − τ ; C k , C ) z − τ (cid:88) x i ∈X t K ( x i , h c,τ ( C j )) K ( x i , h c,τ ( C k )) /σ ( x i ) , which proves that the covariance term has the same form as in (16). Proof of Proposition 4.4.
The last statement is immediate from Lemma A.2 of ourpaper and Theorem 3.1 of Armstrong and Koles´ar (2018a), so we focus on the former47tatement.First, we state the means and the covariance matrix of U j ’s. Given C = { C , ..., C J } ,let τ ∗ = τ ∗ ( C ) be the value of τ that solves (17). Then, we have EU j = (cid:80) x i ∈X t K ( x i , h t,τ ∗ ( C j )) C (cid:48) || ( x i ) V− || /σ ( x i ) (cid:80) x i ∈X t K ( x i , h t,τ ∗ ( C j )) /σ ( x i )+ (cid:80) x i ∈X c K ( x i , h c,τ ∗ ( C j )) C (cid:48) || ( x i ) V + || /σ ( x i ) (cid:80) x i ∈X c K ( x i , h c,τ ∗ ( C j )) /σ ( x i )+ z − τ ∗ sd ( (cid:98) L τ ∗ ( C j )) + sup f ∈F ( C ) bias ( (cid:98) L τ ∗ ( C j )) , and Cov ( U j , U k ) = (cid:80) x i ∈X t K ( x i , h t,τ ∗ ( C j )) K ( x i , h t,τ ∗ ( C k )) /σ ( x i ) (cid:80) x i ∈X t K ( x i , h t,τ ∗ ( C j )) /σ ( x i ) (cid:80) x i ∈X t K ( x i , h t,τ ∗ ( C k )) /σ ( x i )+ (cid:80) x i ∈X c K ( x i , h c,τ ∗ ( C j )) K ( x i , h c,τ ∗ ( C k )) /σ ( x i ) (cid:80) x i ∈X c K ( x i , h c,τ ∗ ( C j )) /σ ( x i ) (cid:80) x i ∈X c K ( x i , h c,τ ∗ ( C k )) /σ ( x i ) . To prove the statement in the proposition, define f ∗ C (cid:48) ( x ) := − C (cid:48) || ( x ) V− || x ∈ X t ) + C (cid:48) || ( x ) V + || x ∈ X c ) . Then, we show (cid:96) adpt ( C (cid:48) ; C ) = sup f ∈F ( C (cid:48) ) E f min ≤ j ≤ J (cid:0) L RD f − ˆ c Lτ ∗ ( C j ) (cid:1) =: sup f ∈F ( C (cid:48) ) E f min ≤ j ≤ J ( Z j ( f ))= E f ∗ C (cid:48) min ≤ j ≤ J ( Z j ( f ∗ C (cid:48) ))= E f ∗ C (cid:48) min ≤ j ≤ J − ˆ c Lτ ∗ ( C j ) . The first and the last equalities hold by definition (since L RD f ∗ C (cid:48) = 0), so what wehave to show is the second to last equality.It is sufficient to show that for each j ∈ { , ..., J } ,sup f ∈F ( C (cid:48) ) E f (cid:0) L RD f − ˆ c Lτ ∗ ( C j ) (cid:1) = E f ∗ C (cid:48) (cid:0) L RD f ∗ C (cid:48) − ˆ c Lτ ∗ ( C j ) (cid:1) = E f ∗ C (cid:48) (cid:0) ˆ c Lτ ∗ ( C j ) (cid:1) . Z j ( f )) Jj =1 has a multivariate normal distribution with some mean vector( µ j ( f )) Jj =1 and a variance matrix which does not depend on f . This implies if f ∗ C (cid:48) maximizes µ j ( f ) for all j = 1 , ..., J , it also maximizes E f min ≤ j ≤ J ( Z j ( f )).For a shorthand notation, define χ Lτ ( C (cid:48) ) := sup f ∈F ( C ) bias f (cid:16)(cid:98) L τ ( C (cid:48) ) (cid:17) + z − τ sd (cid:16)(cid:98) L τ ( C (cid:48) ) (cid:17) . Now, note that we can writesup f ∈F ( C (cid:48) ) E f (cid:0) L RD f − ˆ c Lτ ∗ ( C j ) (cid:1) = sup f ∈F ( C (cid:48) ) E f (cid:16) L RD f − (cid:98) L τ ∗ ( C j ) + χ Lτ ∗ ( C j ) (cid:17) = sup f ∈F ( C (cid:48) ) E f (cid:16) L RD f − (cid:98) L τ ∗ ( C j ) (cid:17) + χ Lτ ∗ ( C j ) , since χ Lτ ∗ ( C j ) is a fixed quantity. The first term in the last line can be written assup f ∈F ( C (cid:48) ) E f (cid:16) L RD f − (cid:98) L τ ∗ ( C j ) (cid:17) = sup f t, ,f c ∈F ( C (cid:48) ) E f (cid:20)(cid:18) f t (0) − (cid:80) x i ∈X t K ( x i , h t,τ ∗ ( C j )) y i /σ ( x i ) (cid:80) x i ∈X t K ( x i , h t,τ ∗ ( C j )) /σ ( x i ) (cid:19) − (cid:18) f c (0) − (cid:80) x i ∈X c K ( x i , h c,τ ∗ ( C j )) y i /σ ( x i ) (cid:80) x i ∈X c K ( x i , h c,τ ∗ ( C j )) /σ ( x i ) (cid:19)(cid:21) = sup f t ∈F ( C (cid:48) ) E f t (cid:18) f t (0) − (cid:80) x i ∈X t K ( x i , h t,τ ∗ ( C j )) f t ( x i ) /σ ( x i ) (cid:80) x i ∈X t K ( x i , h t,τ ∗ ( C j )) /σ ( x i ) (cid:19) − inf f c ∈F ( C (cid:48) ) E f c (cid:18) f c (0) − (cid:80) x i ∈X c K ( x i , h c,τ ∗ ( C j )) f c ( x i ) /σ ( x i ) (cid:80) x i ∈X c K ( x i , h c,τ ∗ ( C j )) /σ ( x i ) (cid:19) . Note that f t (0) and f c (0) can be normalized to 0 since if we consider (cid:101) f t = f t + v forsome v ∈ R , (cid:101) f t (0) − (cid:80) x i ∈X t K ( x i , h t,τ ∗ ( C j )) (cid:101) f t ( x i ) /σ ( x i ) (cid:80) x i ∈X t K ( x i , h t,τ ∗ ( C j )) /σ ( x i )= f t (0) + v − (cid:80) x i ∈X t K ( x i , h t,τ ∗ ( C j ) /σ ( x i )) ( f t ( x i ) + v ) /σ ( x i ) (cid:80) x i ∈X t K ( x i , h t,τ ∗ ( C j )) /σ ( x i )= f t (0) − (cid:80) x i ∈X t K ( x i , h t,τ ∗ ( C j )) f t ( x i ) /σ ( x i ) (cid:80) x i ∈X t K ( x i , h t,τ ∗ ( C j )) /σ ( x i ) , f c . Therefore, we havesup f t ∈F ( C (cid:48) ) E f t (cid:18) f t (0) − (cid:80) x i ∈X t K ( x i , h t,τ ∗ ( C j )) f t ( x i ) /σ ( x i ) (cid:80) x i ∈X t K ( x i , h t,τ ∗ ( C j )) /σ ( x i ) (cid:19) = − inf f t ∈F ( C (cid:48) ) ,f t (0)=0 (cid:80) x i ∈X t K ( x i , h t,τ ∗ ( C j )) f t ( x i ) /σ ( x i ) (cid:80) x i ∈X t K ( x i , h t,τ ∗ ( C j )) /σ ( x i ) . Kwon and Kwon (2020) show that the minimizer to this problem is given by f ∗ t ( x ) = − C (cid:48) || ( x ) V− || . Likewise, we have − inf f c ∈F ( C (cid:48) ) E f c (cid:18) f c (0) − (cid:80) x i ∈X c K ( x i , h c,τ ∗ ( C j )) f c ( x i ) /σ ( x i ) (cid:80) x i ∈X c K ( x i , h c,τ ∗ ( C j )) /σ ( x i ) (cid:19) = sup f c ∈F ( C (cid:48) ) ,f c (0)=0 (cid:80) x i ∈X c K ( x i , h c,τ ∗ ( C j )) f c ( x i ) /σ ( x i ) (cid:80) x i ∈X c K ( x i , h c,τ ∗ ( C j )) /σ ( x i ) . Again by Kwon and Kwon (2020), the maximizer is given by f ∗ c ( x ) = C (cid:48) || ( x ) V + || . Therefore, the worst case expected length is achieved when f = f ∗ C (cid:48) . Lastly, it is easyto see that − ˆ c Lj ’s are jointly normally distributed with the mean and the covariancegiven in the lemma when f = f ∗ C (cid:48) , which establishes our claim. B Unbiased estimator for C As in Armstrong and Koles´ar (2018c), we can estimate the lower bound of C using thedata. We first explain the possibility for the case with d = 1. Suppose X t = [ x min , X c = (0 , x max ].First, note that for any x ≥ x ≥
0, we have f c ( x ) − f c ( x ) ≤ C ( x − x ) . (31)Let n c := (cid:80) ni =1 { x i ∈ X c } , and set a cm > (cid:80) ni =1 { x i ∈ X c , x i ≤ a cm } =50 c /
2. Then, we have1 n c / (cid:34) n (cid:88) i =1 f c ( x i ) { x i ∈ X c , x i > a cm } − n (cid:88) i =1 f c ( x i ) { x i ∈ X c , x i ≤ a cm } (cid:35) (32) ≤ Cn c / n (cid:88) i =1 x i { x i ∈ X c , x i > a cm } , (33)so that we have C ≥ n c / [ (cid:80) ni =1 f c ( x i ) { x i ∈ X c , x i > a cm } − (cid:80) ni =1 f c ( x i ) { x i ∈ X c , x i ≤ a cm } ] n c / [ (cid:80) ni =1 x i { x i ∈ X c , x i > a cm } − (cid:80) ni =1 x i { x i ∈ X c , x i ≤ a cm } ] . (34)We estimate the RHS of (34), denoted by µ c , byˆ µ c := n c / [ (cid:80) ni =1 y i { x i ∈ X c , x i > a cm } − (cid:80) ni =1 y i { x i ∈ X c , x i ≤ a cm } ] n c / [ (cid:80) ni =1 x i { x i ∈ X c , x i > a cm } − (cid:80) ni =1 x i { x i ∈ X c , x i ≤ a cm } ] . (35)We can easily see that ˆ µ c is an unbiased estimator of µ c .Likewise, we can form a lower bound estimator using the data in X t byˆ µ t := n t / [ (cid:80) ni =1 y i { x i ∈ X t , x i > a tm } − (cid:80) ni =1 y i { x i ∈ X t , x i ≤ a tm } ] n t / [ (cid:80) ni =1 x i { x i ∈ X t , x i > a tm } − (cid:80) ni =1 x i { x i ∈ X t , x i ≤ a tm } ] , (36)where n t := (cid:80) ni =1 { x i ∈ X t } and a tm < (cid:80) ni =1 { x i ∈ X t , x i ≤ a tm } = n t /
2. In the dataset of Lee (2008), it turns out that ˆ µ c = 0 .
353 and ˆ µ t = 0 . d >
1. For example, when d = d + ,we use inequality f c ( x ) − f c ( x ) ≤ C (cid:107) x − x (cid:107) , (37)for any x , x such that x r ≥ x r for all r = 1 , ..., d . Find two subsets of X c , say X c and X c so that 1) x ∈ X c and x ∈ X c implies x r ≥ x r for all r = 1 , ..., d ,and 2) the number of observations in each set is equal to ˜ n c /
2. Index x i ’s so that x , ..., x ˜ n c / ∈ X c and x ˜ n c / , ..., x ˜ n c ∈ X c . Define some one-to-one mapping j from (cid:8) , , ..., ˜ n c (cid:9) to (cid:8) ˜ n c + 1 , ˜ n c + 2 , ..., ˜ n c (cid:9) . Then, we have C ≥ (cid:80) ˜ n c / i =1 (cid:2) f c (cid:0) x j ( i ) (cid:1) − f c ( x i ) (cid:3)(cid:80) ˜ n c / i =1 (cid:13)(cid:13) x j ( i ) − x i (cid:13)(cid:13) . (38)51gain, the lower bound can be estimated by replacing f c ( x i ) by y i . C Examples of Multi-score RDD
In this section, we discuss RDD applications with monotone multiple running vari-ables. When presenting these empirical applications, we categorize the class of RDdesigns we consider into the following three cases, depending on how the running vari-ables relate to the assignment of the treatment. While our framework can potentiallycover a much larger class of models, these three settings seem to be the ones thatappear most frequently in empirical studies. We let x i ∈ R d be the value of runningvariable for an individual i , and denote the s th element of x i by x i ( s ) for s = 1 , ..., d .1. MRO (Multiple Running variables with “OR” conditions):
An indi-vidual i is treated if there exists some s ∈ { , ..., d } such that x i ( s ) > ≤ MRA (Multiple Running variables with “AND” conditions):
An indi-vidual i is treated if x i ( s ) > ≤
0) for all s = 1 , ..., d .3. WAV (Weighted AVerage of multiple running variables):
There aremultiple running variables, and the treatment status is determined by someweighted average of those running variables. Hence, an individual i is treatedif (cid:80) ds =1 w s x i ( s ) > ≤
0) for some positive weights w , ..., w d . While wecan view this design as an RD design with a single running variable (cid:101) x i = (cid:80) ds =1 w s x i ( s ) , we may obtain richer information by considering this design asa multi-dimensional RD design. For example, consider an RD design wherean individual i is treated if the average of the normalized math and readingscores is greater than 0. Then, we can consider treatment effect parameters atdifferent cutoff points, e.g. (math = 0, reading = 0), (math = −
1, reading = 1),or (math = 1, reading = − Refer to Appendix A.3 of Babii and Kumar (2020) for RDD applications with univariate mono-tone running variables For example, our categorization excludes geographic RD designs, such as those analyzed in Dell(2010), Keele and Titiunik (2015), and Imbens and Wager (2019). While our general framework canincorporate such models, we do not consider them since the monotonicity assumption is unlikely tohold under such contexts. d > d = 1. 54 Auxiliary Figure for Section 5 x f ( x ) x f ( x ) x f ( x ) x f ( x ) Figure 4: Regression function values on x ∈ [ − ,,