LLocal linear tie-breaker designs
Dan M. KlugerStanford University Art B. OwenStanford UniversityJanuary 2021
Abstract
Tie-breaker experimental designs are hybrids of Randomized ControlTrials (RCTs) and Regression Discontinuity Designs (RDDs) in whichsubjects with moderate scores are placed in an RCT while subjects withextreme scores are deterministically assigned to the treatment or controlgroup. The design maintains the benefits of randomization for causal es-timation while avoiding the possibility of excluding the most deservingrecipients from the treatment group. The causal effect estimator for atie-breaker design can be estimated by fitting local linear regressions forboth the treatment and control group, as is typically done for RDDs. Westudy the statistical efficiency of such local linear regression-based causalestimators as a function of ∆, the radius of the interval in which treat-ment randomization occurs. In particular, we determine the efficiency ofthe estimator as a function of ∆ for a fixed, arbitrary bandwidth underthe assumption of a uniform assignment variable. To generalize beyonduniform assignment variables and asymptotic regimes, we also demon-strate on the Angrist and Lavy (1999) classroom size dataset that priorto conducting an experiment, an experimental designer can estimate theefficiency for various experimental radii choices by using Monte Carlo aslong as they have access to the distribution of the assignment variable.For both uniform and triangular kernels, we show that increasing the ra-dius of randomized experiment interval will increase the efficiency untilthe radius is the size of the local-linear regression bandwidth, after whichno additional efficiency benefits are conferred.
The regression discontinuity design (RDD) introduced by Thistlethwaite andCampbell (1960) has become a mainstay of causal inference in recent years, es-pecially in econometrics (Angrist and Pischke, 2014) and the social sciences(Imbens and Rubin, 2015). The logic in an RDD is as follows. Some differenttreatment is offered to subject i depending on whether the value x i of an as-signment variable (also called a running variable) exceeds a threshold t , or not.Then if the future value of a quantity Y i has a different expected value for x i just barely larger than t than for x i just barely smaller than t it becomes quite1 a r X i v : . [ s t a t . M E ] J a n redible to interpret the difference causally, assuming that there is no a priorireason for E ( Y | x ) to have a step discontinuity at x = t .In most uses of RDD, the assignment to treatment or control cannot beinfluenced by the investigator. In some settings, however, the investigator caninject randomness into the treatment assignments around the threshold valueof x = t . The goal is to better measure the effect of the treatment variable Z on Y . This design is called a tie-breaker design. For instance, letting x be ameasure of high-school performance and the treatment be awarding of a univer-sity scholarship, Angrist et al. (2014) use a tie-breaker design to measure theimpact of the scholarship on students’ academic outcomes. The top performingstudents all get the scholarship, the bottom performing students do not, andthere is a group in the middle where a random selection is used. Aiken et al.(1998) use a tie-breaker design to offer remedial English classes based on a testgiven to students prior to matriculation. Companies offering loyalty programsto their best customers can also include some randomness in the reward, tobetter learn the impact of those rewards. Tie-breaker designs are also known ascutoff designs in clinical trial settings (Trochim and Cappelleri, 1992).In our setting it is convenient to represent the treatment by Z ∈ {− , } ,with the level − collect assignment variable values x i for subjects i = 1 , . . . , N , determine a distribution for treatment variables Z i ∈ {− , } , assign treatments Z i to subjects, observe corresponding Y i , and infer the treatment effect from ( x i , Z i , Y i ) values.The Y i values are not available to the investigator at the time the distributionfor Z i is chosen. As a result the treatment decisions have to be chosen in partbased on a guess as to what model might be used to fit the Y i values. We areinterested in settings where Y i does not become available fast enough to employbandit methods. For instance there might be a year long delay in the customerloyalty setting or six years in the educational attainment setting.Owen and Varian (2020) study the statistical efficiency of tie-breaker designs.In the simplest formulation, the assignment variables are sorted and scaled sothat x i = (2 i − / (2 N ) ∈ ( − , (cid:54) ∆ (cid:54) Z i = 1 | x i ) = , x i (cid:54) − ∆1 / , | x i | < ∆1 , x i (cid:62) ∆ . (1)This tie-breaker design interpolates between an RDD with t = 0 when ∆ = 0and an RCT for ∆ = 1. They consider a regression model Y i = β + β x i + β Z i + β x i Z i + ε i (2)where ε i are IID random variables with mean 0 and variance σ . They findstatistical efficiency improves with ∆. We will describe below how efficiency is2efined when we introduce our version. They also find no efficiency benefit fromusing a sliding scale for Pr( Z = 1 | x ) instead of using just the three levels in (1).Model (2) is simple enough to analyze; they also describe algorithmic ways tostudy more general alternative models at step 2 when there may be vectors x i available for each subject.A weakness of model (2) is that it is global over all subjects. In RDDs, it isnow more common to use nonparametric regression methods (Hahn et al., 2001;Calonico et al., 2014). Let µ + ( x ) = E ( Y | x, Z = 1) and µ − ( x ) = E ( Y | x, Z = − , both assumed to be smooth in a neighborhood of x = t . The RDD is usedto estimate µ + ( t ) − µ − ( t ). A kernel regression ˆ µ − is fit to ( x i , Y i ) pairs with Z i = − x i (cid:54) t and extrapolated to ˆ µ − ( t ). Similarly, a kernelregression estimate ˆ µ + ( t ) based on ( x i , Y i ) data with x i (cid:62) t is produced andthen the causal impact of treatment at x = t is estimated by ˆ µ + ( t ) − ˆ µ − ( t ).Our goal in this paper is to show that the tie-breaker design is still superiorto the RDD even when a local regression is used. The local regression will use abandwidth parameter h chosen at step 5 above. That bandwidth is not knownto the investigator at step 2 where ∆ is chosen. To handle this, we show thatthe tie-breaker has a statistical advantage at all h > δ = ∆ /h of the size of the experimental region to thebandwidth used. Efficiency ranges from 1 at δ = 0 to 4 at δ = 1 (for the boxcarkernel) or to 3 . δ = 1 for the triangular kernel. The efficiency is monotonein δ . To get an interpretable theory we have worked with a uniformly spacedassignment variable. Section 5 shows how one can compute efficiency empiri-cally using one’s actual assignment variables, focusing on the Israeli classroomsize data from Angrist and Lavy (1999) as an example. The efficiency curvesare quite similar to the ones we have for uniformly spaced assignment variablesbut they show slightly greater efficiency gains than the ones in our theoremsfor uniform spacings. Section 6 presents a discussion and an Appendix containsone of our proofs. 3 Kernels and problem formulation
In keeping with current RDD practice we will use a kernel smoothed versionof (2). The parameter vector β is estimated byˆ β = arg min β ∈ R N (cid:88) i =1 K (cid:16) x i − th (cid:17)(cid:0) Y i − ( β + β x i + β Z i + β x i Z i ) (cid:1) (3)for a symmetric kernel function K ( · ) (cid:62) h > K ( · ) is non-negative and symmetric. We have special interestin a uniform (boxcar) kernel K BC ( x ) = 1 | x | (cid:54) because it is a local version ofthe regression model (2). We are also interested in a triangular spike kernel K TS ( x ) = (1 − | x | ) + where z + = max(0 , z ). This triangular kernel was shownby Cheng et al. (1997) to optimize a bias-variance tradeoff for extrapolationfrom x i > t to E ( Y | x = t ) and has been advocated for RDD analysis by Imbensand Kalyanaraman (2012) and Calonico et al. (2014) among others.In the model (2), the treatment effect at x is( β + β x + β + β x ) − ( β + β x − β − β x ) = 2( β + β x ) . An RDD is used for x = t and there the treatment effect is 2( β + β t ). We willshift the assignment variable so as to make t = 0 and then focus on β .The kernel regression estimator from (3) has a bias and variance that bothdepend on the bandwidth h . Larger h typically bring greater bias because thetrue regression is not precisely linear over a region centered on t . Smaller h bringgreater variance because then fewer data points are in the regression. Calonicoet al. (2014) advocate for choosing smaller h than is mean square optimal.Making the bias negligible compared to the standard deviation has the effectof making it much easier to get a confidence interval for the treatment effect.Confidence intervals are important when the estimate is to be used for policypurposes where quantifying uncertainty is critical. For this reason, we will studythe variance of ˆ β given h and not consider the bias. That is, we are assumingthat the user will purposefully undersmooth the regression as recommended byCalonico et al. (2014).The design matrix for the regression is X ∈ R N × with i ’th row (1 , x i , Z i , x i Z i ).The response is Y = ( Y , . . . , Y N ) T . With t = 0, the kernel weights are K ( x i /h )and we let W = W ( h ) ∈ R N × N = diag( K ( x i /h )). Thenˆ β = ˆ β (∆) = ( X T WX ) − X T WY (4)and under the assumption that var( Y | X ) = σ I N we havevar( ˆ β | X ; ∆) = ( X T WX ) − X T W X ( X T WX ) − σ . (5)Formula (4) for ˆ β matches the familiar generalized least squares formula forthe case where var( Y | X ) = W σ . Here W arises from weights that are notof inverse variance type and hence the formula for var( ˆ β | X ; ∆) involves a W K ( x i /h ) ∈ { , } equals its own square. In that casevar( ˆ β | X ; ∆) = ( X T WX ) − σ .The estimand in regression discontinuity is 2 β . Therefore we study var( ˆ β |X ; ∆) under a tie-breaker design as (var( ˆ β | X ; ∆)) , using the expression in (5). Our present analysis takes place before step 1 from the outline in the intro-duction: we want to study var( ˆ β | X ; ∆), but we do not yet have the x i . Givenany list of x i some numerical methods described in Owen and Varian (2020) canbe adapted to the kernel regression setting, but that does not give theoreticalinsight. Without access to x i we will study the uniformly spaced setting with x i = (2 i − / (2 N ). This case is simple enough to illustrate the gains from atie-breaker design. In practice we could apply it by replacing x i by their cen-tered and scaled ranks. In Section 6 we explain using asymptotic theory fromFan and Gijbels (1996) why this case models the most important features of theproblem and in Section 5 we show numerical results on a real data set that isnot equispaced.For x i = (2 i − / (2 N ), the matrices X T WX /N and X T W X /N containelements that can be approximated by integrals of the form I rst = I rst (∆ , h, K ) ≡ (cid:90) − x r E ( Z s | x ; ∆) K (cid:16) xh (cid:17) t d x (6)for integer exponents r , s and t . Our expressions will simplify somewhat because Z = 1 making every I r, ,t = I r, ,t and also because both x and E ( Z | x ; ∆)are antisymmetric functions of x making them orthogonal to K ( x/h ) which wehave assumed to be symmetric. The error in those moment approximations is O p ( N − / ) if the Z i are independent random variables. The error can be muchless with other sampling schemes. For instance, we could use stratified sampling,forming pairs of subjects ( i, i + 1) in the experimental region and randomlysetting Z i = ± Z i +1 = − Z i . We will use ≈ to describe approximationsthat are O p ( N − / ) or better.Applying first Z = 1 and then using symmetry and anti-symmetry1 N X T WX ≈ x z xz I I I I x I I I I z I I I I xz I I I I = I I I I I I I I . K ( · ) is also a symmetric function we also get1 N X T W X ≈ I I I I I I I I . From all of the symmetries involved in the 32 components of these twomatrices, we need to consider at most six distinct integrals. We rewrite them,beginning with 1 N X T WX ≈ ν φ (∆)0 ν φ (∆) 00 φ (∆) ν φ (∆) 0 0 ν (7)where ν = 12 (cid:90) − K (cid:16) xh (cid:17) d x, ν = 12 (cid:90) − x K (cid:16) xh (cid:17) d x, and φ (∆) = 12 (cid:90) − ∆ − ( − x ) K (cid:16) xh (cid:17) d x + 12 (cid:90) xK (cid:16) xh (cid:17) d x = (cid:90) xK (cid:16) xh (cid:17) d x. (8)Note that ν and ν may depend on h but they do not depend on ∆. A similarargument shows that1 N X T W X ≈ π ψ (∆)0 π ψ (∆) 00 ψ (∆) π ψ (∆) 0 0 π (9)for π = 12 (cid:90) − K (cid:16) xh (cid:17) d x, π = 12 (cid:90) − x K (cid:16) xh (cid:17) d x, and ψ (∆) = (cid:90) xK (cid:16) xh (cid:17) d x. (10)Now we are ready to describe the asymptotic variance of ˆ β . Theorem 1.
Let x i = (2 i − / (2 N ) , select Z i ∈ {− , } by the tie-breakerequation (1) . Let Y i be uncorrelated random variables with common variance σ , conditionally on X = ((1 , x , Z , x Z ) , · · · , (1 , x N , Z N , x N Z N )) . Next, fora symmetric kernel K ( · ) (cid:62) with < (cid:82) ∞−∞ x K ( x ) d x < ∞ and a bandwidth h > , let ˆ β be estimated by the kernel weighted regression (3) . Then N var( ˆ β | X ; ∆) = σ (cid:0) ν π − ν φ (∆) ψ (∆) + π φ (∆) (cid:1)(cid:0) ν ν − φ (∆) (cid:1) + O p (cid:16) √ N (cid:17) , (11)6 here ν , ν and φ (∆) are defined in (8) and π , π and ψ (∆) are definedin (10) .Proof. Reordering the components of β we find after substituting equations (7)and (9) into (5) that √ N ( ˆ β , ˆ β , ˆ β , ˆ β ) has variance ν φ φ ν ν φ φ ν − π ψ ψ π π ψ ψ π ν φ φ ν ν φ φ ν − σ + O p (cid:16) √ N (cid:17) . Now (11) follows directly by matrix inversion and multiplication.The variance formula in Theorem 1 does not require the linear model (2) tohold. When it does not hold there will generally be some bias where E (2 ˆ β |X ; ∆) (cid:54) = µ + (0) − µ − (0). We suppose that the user will choose an undersmoothed h making bias smaller than the standard error, as recommended by Calonicoet al. (2014). That undersmoothing takes place in our step 5 above and is notavailable when ∆ is chosen.We are primarily interested in comparing the asymptotic variance of ˆ τ = 2 ˆ β for various choices of ∆. We especially want to compare the efficiency of tie-breaker designs with ∆ > ( N ) (∆) ≡ var( ˆ β | X ; 0)var( ˆ β | X ; ∆) . (12)Using Theorem 1, Eff ( N ) (∆) converges in probability to the asymptotic effi-ciency ratioEff(∆) = (cid:0) ν π − ν φ (0) ψ (0) + π φ (0) (cid:1)(cid:0) ν ν − φ (∆) (cid:1) (cid:0) ν π − ν φ (∆) ψ (∆) + π φ (∆) (cid:1)(cid:0) ν ν − φ (0) (cid:1) (13)using quantities that we defined at (8) and (10). In this section we present the efficiency ratios under the conditions of The-orem 1 for the two kernels of greatest interest: the boxcar kernel and the trian-gular kernel. We work with x i = (2 i − N ) / (2 N ) throughout this section.For the boxcar kernel K BC ( x ) = 1 | x | (cid:54) , we can assume without loss ofgenerality that h (cid:54) | x i − t | = | x i | > h > h = 1. We find for this kernelthat ν = π = h, ν = π = h , and φ (∆) = ψ (∆) = ( h − ∆ ) + . (14)7sing some foresight, we define the local tie-breaker constant δ = ∆ /h . This isthe fraction of the local regression region in which the treatment was assignedat random. Proposition 1.
Under the conditions of Theorem 1 and using the boxcar kernel K BC , the asymptotic efficiency of the tie-breaker design is Eff BC = 1 + 6 δ − δ (15) for δ = ∆ /h (cid:54) . If δ > , then Eff BC = 4 .Proof. Because many quantities from (14) are identical, substituting them into(13) produces numerous simplifications that yieldEff BC = ν ν − φ (∆) ν ν − φ (0) = h − ( h − ∆ ) h − h = 4 − − δ ) . For 0 (cid:54) δ < δ > h = 1 makes the local regression a global one. We then get thesame efficiency ratio as in equation (6) from Owen and Varian (2020). By takingderivatives it is easy to show that the efficiency ratio in (15) is strictly increasingas the local amount of experimentation δ varies over the interval 0 < δ < BC versus δ .The triangular spike kernel K TS ( x ) = (1 − | x | ) + (triangular kernel for short)is more complicated than the boxcar kernel because for it, K is not proportionalto K . Once again, we assume that h ∈ [0 , ν = h , ν = h , π = h , and π = h δ = ∆ /h , we get φ (∆) = h − δ + 2 δ ) and ψ (∆) = h
12 (1 − δ + 8 δ − δ ) . Proposition 2.
Under the conditions of Theorem 1 and using the triangularkernel K TS , the asymptotic efficiency of the tie-breaker design is Eff TS = 2 (cid:0) − − δ + 2 δ ) (cid:1) − − δ + 2 δ )(1 − δ + 8 δ − δ ) + 2(1 − δ + 2 δ ) (16) for δ = ∆ /h (cid:54) .Proof. When we substitute values into the efficiency formula (13) we get somesimplifications from π = (2 / ν and π = (2 / ν . The constant term in N var( ˆ β ) /σ becomes(2 / ν ν − ν φ (∆) ψ (∆) + (2 / ν ( ν ν − φ (∆)) = 15 ν ν ν − φ (∆) ψ (∆) + 6( ν ν − φ (∆))
8o that the efficiency ratio isEff TS = 5 ν ν − φ (0) ψ (0) + 3 φ (0)5 ν ν − φ (∆) ψ (∆) + 3 φ (∆) × ( ν ν − φ (∆)) ( ν ν − φ (0)) (17)after cancelling a common factor of 30 ν . Next 5 ν ν − φ (∆) ψ (∆) + 3 φ (∆)equals5 h − h
72 (1 − δ + 2 δ )(1 − δ + 8 δ − δ ) + h
12 (1 − δ + 2 δ ) and so the first factor in (17) is 25 − − δ + 2 δ )(1 − δ + 8 δ − δ ) + 2(1 − δ + 2 δ ) . Turning to the second factor ν ν − φ (∆) = h − h
36 (1 − δ + 2 δ ) = h (cid:0) − − δ + 2 δ ) (cid:1) and so the second factor equals (3 − − δ + 2 δ ) ) , establishing (16).The second panel in Figure 1 shows Eff TS versus the local experiment size δ .The efficiency curve has a similar monotone increasing shape as we saw for theboxcar kernel. The maximum efficiency ratio, at δ = 1, is 18 / . δ with a numerator of degree 12and a denominator of degree 7. It is strictly increasing on the interval 0 < δ < Proposition 3.
The derivative of Eff TS with respect to δ is positive for <δ < .Proof. See the Appendix.
We explored the efficiency ratio for the tie-breaker design for x i with auniform distribution. While that can be arranged by using ranks, in othersituations we might prefer to use the original value of a running variable andthose might not be uniformly distributed. We demonstrate how this would bedone using a dataset from Angrist and Lavy (1999) on classroom sizes.Angrist and Lavy (1999) studied the causal effect of classroom size on testperformance of elementary school students in Israel. In Israel, the Maimonidesrule mandates that elementary school classes cannot exceed 40 students. If aschool has 41 students enrolled in a particular grade that grade must be splitinto two classes. Note that grades that have 40 or fewer enrolled students areallowed to split into multiple classes and that grades with slightly more than 409igure 1: The top panel shows the efficiency of local tie-breaker design foruniform x i and a boxcar kernel as a function of δ = ∆ /h . The lower panelshows this efficiency for a triangular kernel.students occasionally violate the Maimonides rule and do not split into multipleclasses. Despite this, we can consider this a setting for RDD where the treatmentvariable is whether or not the school is legally mandated to split a particulargrade into smaller classes.The dataset, published on the Harvard Dataverse (Angrist and Lavy, 2009),has verbal and math scores for 3rd, 4th and 5th graders across Israel. We choseto focus exclusively on 4th grade verbal scores as our response variable and4th grade enrollments as our assignment variable because Angrist and Lavy(1999) suggest that a slightly significant effect of the treatment on 4th gradeverbal scores exists. A case could be made for a tie-breaker design in thissetting because had one been used, the treatment effect might have been more10igure 2: A histogram of 4th grade enrollments for our filtered dataset (withschools exceeding 80 4th grade students or three 4th grade classes removed)accurately estimated.To simplify the analysis, we removed all schools that either had more than80 students or more than two 4th-grade classes from the dataset. We furtherremoved all schools that had NA entries for either class size or verbal scores,leaving N = 711 schools in our filtered dataset. See Figure 2 for a visualizationof the distribution of the 4th grade enrollments and Figure 3 for visualiza-tions of the local linear regression based-RDD on this dataset using boxcar andtriangular kernels. We use the bandwidths h IK given by the Imbens and Kalya-naraman (2012) procedure. The apparent benefit from smaller classrooms ispositive but small and it turns out, not statistically significant in this analysis.The 95% confidence interval (assuming homoscedastic errors) for the effect sizeat the boundary of the local linear regression-based RDD was ( − . , .
0) whena boxcar kernel with bandwidth h IK , BC = 14 .
18 was used. The 95% confidenceinterval for the effect size at boundary of this RDD was ( − . , .
4) when atriangular kernel with bandwidth h IK , TS = 9 .
02 was used.Next we illustrate how an investigator can estimate the efficiency of tie-breaker designs as a function of ∆ on sample values of the assignment variable.First we center the data, replacing x i by x i − . t = 40 . t = 0. Next, for each ∆ of interest we use 1000 Monte Carlo samplesto estimate var( ˆ β | X ; ∆) and also var( ˆ β | X ; 0), both up to a constant σ .That gives us 1000 efficiency ratios Eff ( N ) (∆) = var( ˆ β | X ; 0) / var( ˆ β | X ; ∆) foreach ∆. In each of our 1000 samples, we simulate random assignments for atie-breaker design at the given experimental radius ∆. The random assignmentsare stratified: in each consecutive pair of classroom sizes in the experimentalregion, one was randomly chosen to have Z = 1 and the other got Z = − Z i let us compute the matrices X and W defined in the beginningof Section 2, from which we compute a non-asymptotic var( ˆ β | X ; ∆). Notethat we do not need to simulate any Y values to do this, because, in this initial11igure 3: RDD fit to the 4th grader verbal scores from the Angrist and Lavy(2009) dataset when using a boxcar kernel (top) and a triangular kernel (bot-tom). For these two fits, the bandwidths h IK , BC and h IK , TS were chosen by theprocedure in Imbens and Kalyanaraman (2012).analysis, we are retaining the bandwidths from the Imbens and Kalyanaraman(2012) procedure on the original data.Figure 4 shows boxplots of 1000 simulated Eff ( N ) (∆) values for variouschoices of ∆ ∈ N to plot the full efficiency curve. It is clear from Figure 4that with stratified allocations the efficiency is very reproducible. Figure 5shows results for different bandwidths, ranging from h IK / h IK . Becausethe efficiencies are so reproducible given the bandwidth, we just plot curvesof the mean and standard deviations of estimated Eff values. For the boxcarkernel we see that the tie-breaker design is reproducibly more efficient than the12igure 4: Boxplots of the Monte-Carlo efficiency ratio estimates for variousvalues of ∆ ∈ N when using a boxcar kernel (top) and a triangular kernel(bottom). For both the boxcar kernel and the triangular kernel, we used thesame bandwidth as in Figure 3, namely h IK , BC = 14 .
18 and h IK , TS = 9 . δ = ∆ /h increases for all h we studied. For thetriangular kernel we see much the same thing apart from one value of δ and thesmallest bandwidth where the tie-breaker comes out less efficient than RDD.For that point the experimental region consisted of just 17 data points, 8 witha class of 40 students and 9 with a class of 41 students.For a further discussion of the Maimonides rule, see Angrist et al. (2019).They consider different data sets and also investigate the possibility that theclass sizes are sometimes manipulated to be above the threshold triggering aclassroom split. 13igure 5: Monte-Carlo based estimates of the expected value (left) and stan-dard deviation (right) of Eff ( N ) (∆) versus ∆ /h for the Angrist and Lavy (2009)dataset of 4th grader verbal scores. For these plots a boxcar kernel (top) anda triangular kernel (bottom) were used. The bandwidths plotted are differentscalar multiples of h IK , BC and h IK , TS from the procedure of Imbens and Kalya-naraman (2012). The legend for the plots on the right are the same as those forthe expected efficiency curves. Note that the curves are not smooth because,to avoid redundancy, only points that corresponded to integer values of ∆ wereused. Owen and Varian (2020) found an efficiency advantage for the tie-breakerin a global regression, wherein the estimation variance decreased monotonicallywith the amount of experimentation. This paper provides a comparable find-ing for the now more standard local linear regression approach. For any fixedbandwidth h , we see a theoretical efficiency that increases with the amount ∆of experimentation. We have not investigated the effect of ∆ on the subsequentchoice of h .Our theoretical analysis is for a uniformly spaced assignment variable. Im-bens and Wager (2019) consider how to optimally tune kernel weights in a re-gression discontinuity problem to a given set of data. Owen and Varian (2020)consider numerical optimization of tie-breaker designs on given data.Here we offer one explanation for why the empirical efficiencies on non-14niformly distributed data look so similar to the theoretical ones for uniformlydistributed data. We use some results about non-parametric regression from Fanand Gijbels (1996, Table 2.1). Nonparametric regression estimates ˆ µ ( t ) typicallyhave an asymptotic variance where the leading term is proportional to 1 /f ( t )where f is the probability density of the x i . This arises because the local samplesize is asymptotically proportional to f ( t ). Hence, when considering nonuniformdistributions, the 1 /f ( t ) factors in the leading order variance terms will cancelout when computing the efficiency ratios. Some of the nonparametric regressionestimators, such as the Nadaraya-Watson estimator, have a lead term in theirbias that depends on the derivative f (cid:48) ( t ) and while f (cid:48) ( t ) = 0 for uniformlydistributed data it is not zero in general. Kernel weighted least squares methods(with symmetric K ( · )) does not have a dependency on f (cid:48) ( t ) in its bias. There isa curvature bias from µ (cid:48)(cid:48) ( t ) but that is not related to the sampling distributionof the x i . The lead terms in bias and variance for kernel regressions do notdistinguish between distributions with the same value of f ( t ) but different f (cid:48) ( t ).We close by comparing the tie-breaker setting to some other similar soundingones. In fuzzy RDDs (Campbell, 1969) the threshold varies perhaps randomlybecause it depends on some additional variables that are not available to thedata analyst. There are also settings where the assigment variable is subject tomanipulation. For instance if a passing grade is 50% there may be no candidateswith recorded scores in the interval from 47% to 50%. See McCrary (2008) formore on manipulation and Rosenman and Rajkumar (2019) for a mitigationstrategy. The tie-breaker setting is special because the investigator is able tocontrol the treatment allocation. Acknowledgments
This work was supported by the U.S. National Science Foundation undergrant IIS-1837931. We thank Hal Varian for commenting on the paper.
References
Aiken, L. S., West, S. G., Schwalm, D. E., Carroll, J. L., and Hsiung, S. (1998).Comparison of a randomized and two quasi-experimental designs in a singleoutcome evaluation: Efficacy of a university-level remedial writing program.
Evaluation Review , 22(2):207–244.Angrist, J., Hudson, S., and Pallais, A. (2014). Leveling up: Early results froma randomized evaluation of post-secondary aid. Technical report, NationalBureau of Economic Research.Angrist, J. D. and Lavy, V. (1999). Using Maimonides’ rule to estimate the effectof class size on scholastic achievement.
The Quarterly Journal of Economics ,114(2):533–575. 15ngrist, J. D. and Lavy, V. (2009). Replication data for: Using Maimonides’Rule to Estimate the Effect of Class Size on Student Achievement.Angrist, J. D., Lavy, V., Leder-Luis, J., and Shany, A. (2019). Maimonides’ ruleredux.
American Economic Review: Insights , 1(3):309–24.Angrist, J. D. and Pischke, J.-S. (2014).
Mastering Metrics . Princeton UniverityPress, Princeton.Borchers, H. W. (2019). pracma: Practical Numerical Math Functions . Rpackage version 2.2.9.Calonico, S., Cattaneo, M. D., and Titiunik, R. (2014). Robust nonparamet-ric confidence intervals for regression-discontinuity designs.
Econometrica ,82(6):2295–2326.Campbell, D. T. (1969). Reforms as experiments.
American psychologist ,24(4):409.Cheng, M.-Y., Fan, J., and Marron, J. S. (1997). On automatic boundarycorrections.
The Annals of Statistics , 25(4):1691–1708.Fan, J. and Gijbels, I. (1996).
Local polynomial modelling and its applications:monographs on statistics and applied probability 66 , volume 66. CRC Press,Boca Raton, FL.Hahn, J., Todd, P., and der Klaauw, W. V. (2001). Identification and estimationof treatment effects with a regression-discontinuity design.
Econometrica ,69(1):201–209.Higham, N. J. (2002).
Accuracy and Stability of Numerical Algorithms . Societyfor Industrial and Applied Mathematics, Philadelphia, second edition.Imbens, G. and Kalyanaraman, K. (2012). Optimal bandwidth choice for theregression discontinuity estimator.
The Review of Economic Studies , 79:933–959.Imbens, G. and Wager, S. (2019). Optimized regression discontinuity designs.
Review of Economics and Statistics , 101(2):264–278.Imbens, G. W. and Rubin, D. B. (2015).
Causal inference in statistics, social,and biomedical sciences . Cambridge University Press.McCrary, J. (2008). Manipulation of the running variable in the regressiondiscontinuity design: A density test.
Journal of econometrics , 142(2):698–714.Owen, A. B. and Varian, H. (2020). Optimizing the tie-breaker regression dis-continuity design.
Electronic Journal of Statistics , 14(2):4004–4027.16osenman, E. and Rajkumar, K. (2019). Optimized partial identificationbounds for regression discontinuity designs with manipulation. Technical Re-port arXiv:1910.02170, Stanford University.Thistlethwaite, D. L. and Campbell, D. T. (1960). Regression-discontinuityanalysis: An alternative to the ex post facto experiment.
Journal of Educa-tional psychology , 51(6):309.Trochim, W. M. K. and Cappelleri, J. C. (1992). Cutoff assignment strategiesfor enhancing randomized clinical trials.
Controlled Clinical Trials , pages190–212.
Appendix: Proof of Proposition 3
We want to show that this functionEff TS ( δ ) = 2 (cid:0) − − δ + 2 δ ) (cid:1) − − δ + 2 δ )(1 − δ + 8 δ − δ ) + 2(1 − δ + 2 δ ) has a positive derivative for 0 < δ <
1. The numerator has degree 12 andthe denominator has degree 7. The customary formula for the derivative of arational function produces a rational function with a non-negative denominatorand a numerator of degree 18. We will work through a sequence of steps reducingthe degree of this polynomial to show that the numerator must be positive on(0 , TS ( δ ) which isvisually apparent.It is convenient to work instead with x = 1 − δ . Then 1 − δ +2 δ = 3 x − x and 1 − δ + 8 δ − δ = 4 x − x . Therefore Eff TS ( δ ) = f (1 − δ ) where f isa function given by f ( x ) = 2(3 − x − x ) ) − x − x )(4 x − x ) + 2(3 x − x ) = 2(3 − x (3 − x ) ) − x (3 − x )(4 − x ) + 2 x (3 − x ) = 2(3 − x (3 − x ) ) x (3 − x )][6 − x + 15 x ]= 2(3 − g ( x )(3 − x )) g ( x )(6 − x + 15 x )for g ( x ) = x (3 − x ) and having replaced δ by x = 1 − δ we will show that f (cid:48) ( x ) < f (cid:48) ( x ) has this numerator n ( x ) = 4(3 − g ( x )(3 − x ))(4 g ( x ) − g (cid:48) ( x )(3 − x ))[5 + g ( x )(6 − x + 15 x )] − − g ( x )(3 − x )) [ g (cid:48) ( x )(6 − x + 15 x ) + g ( x )( −
24 + 30 x )] . (cid:54) g ( x )(3 − x ) (cid:54) x ∈ [0 ,
1] and so 3 − g ( x )(3 − x ) >
0. Asa result, the sign of n ( x ) is preserved by dividing it by 2(3 − g ( x )(3 − x )),yielding n ( x ) = 2(4 g ( x ) − g (cid:48) ( x )(3 − x ))[5 + g ( x )(6 − x + 15 x )] − (3 − g ( x )(3 − x ))[ g (cid:48) ( x )(6 − x + 15 x ) + g ( x )( −
24 + 30 x )] . Now since g ( x ) = x [ x (3 − x )] and g (cid:48) ( x ) = x [12 − x ] and x ∈ (0 ,
1) we candivide n ( x ) by x / n ( x ) = 13 (cid:0) x − − x ) (cid:1) (3 − x )[5 + g ( x )(6 − x + 15 x )] − (cid:0) − g ( x )(3 − x ) (cid:1)(cid:2) (12 − x )(6 − x + 15 x ) + x (3 − x )( −
24 + 30 x ) (cid:3) = − − x )(3 − x )[5 + g ( x )(6 − x + 15 x )] − (3 − g ( x )(3 − x ))[ − x + 93 x − x + 12]= − − x )(3 − x )[5 + g ( x )(6 − x + 15 x )] − (3 − g ( x )(3 − x ))(1 − x )(35 x − x + 12) . We can divide n ( x ) by − (1 − x ) getting a polynomial n ( x ) with the oppositesign from n . This yields n ( x ) = 8(3 − x )[5 + g ( x )(6 − x + 15 x )] + (3 − g ( x )(3 − x ))(35 x − x + 12)= 8(3 − x )[ − x + 93 x − x + 18 x + 5]+ ( − x + 24 x − x + 3)(35 x − x + 12)= 8[60 x − x + 447 x − x + 54 x − x + 15]+ ( − x + 1304 x − x + 1332 x − x + 105 x − x + 36)= 200 x − x + 1458 x − x + 216 x + 105 x − x + 156 . Note that the coefficient of x in n is a zero that must not be left out whenentering the coefficients into symbolic differentiation codes.We want to show that n ( x ) > x ∈ (0 ,
1) which then makes n ( x ) < f (cid:48) ( x ) < (cid:48) TS ( δ ) > δ = 1 − x ∈ (0 , n (cid:48)(cid:48) ( x ) > x ∈ [0 , n (cid:48)(cid:48) ( x ) = 11200 x − x + 43740 x − x + 2592 x + 210 . Graphing n (cid:48)(cid:48) ( x ) versus x and evaluating it numerically, makes it is clear that148 < n (cid:48)(cid:48) ( x ) < n (cid:48)(cid:48) . Some readers might prefer to skip that and goinstead to the subsection marked “Conclusion of the proof”. Positivity of n (cid:48)(cid:48) We begin by writing n (cid:48)(cid:48)(cid:48) ( x ) = 48 x (1400 x − x + 3645 x − x + 108) . | n (cid:48)(cid:48)(cid:48) ( x ) | (cid:54) (cid:54) (18)for all x ∈ [0 , k ∈ { , , . . . , } define x k = k − . For each k we will let (cid:92) n (cid:48)(cid:48) ( x k )be the numerical evaluation for the polynomial n (cid:48)(cid:48) ( x k ) computed using Horner’smethod with double-precision in R, implemented with the ‘horner’ function inthe pracma R package of Borchers (2019).By formula (5.3) in Higham (2002), the absolute error in Horner’s methodfor the polynomial (cid:80) nr =0 a r x r is at most γ n ˜ p ( | x | ) where γ n ≡ nu/ (1 − nu ), u is the unit roundoff, and ˜ p ( | x | ) = n (cid:80) r =0 | a r || x | r .Applying that bound to our 6th degree polynomial n (cid:48)(cid:48) ( x ), and noting thatfor each x ∈ [0 , p ( | x | ) will be at most the sum of the absolute value of thecoefficients we find thatmax (cid:54) k (cid:54) (cid:12)(cid:12) n (cid:48)(cid:48) ( x k ) − (cid:92) n (cid:48)(cid:48) ( x k ) (cid:12)(cid:12) (cid:54) γ × ˜ p ( | x k | ) (cid:54) u − u × , . (19)We need not worry about floating point error induced by evaluating x k , becauseeach x k is a floating point number. For double precision in R, the unit roundoffis u = 2 − (cid:54) × − from which 12 u/ (1 − u ) (cid:54) − , and so the maximumerror in (19) is at most 10 − . The smallest value of (cid:92) n (cid:48)(cid:48) ( x k ) among all 2 + 1evaluation points x k was 148 . (cid:54) k (cid:54) n (cid:48)(cid:48) ( x k ) (cid:62) n (cid:48)(cid:48) ( x ) > x ∈ [0 , x ∈ [0 , k ∗ ∈ { , , . . . , } with | x − x k ∗ | (cid:54) − . By (18) we know that n (cid:48)(cid:48) is Lipschitz continuous on [0 ,
1] with Lipschitz constant 2 . Therefore n (cid:48)(cid:48) ( x ) (cid:62) n (cid:48)(cid:48) ( x k ∗ ) − | n (cid:48)(cid:48) ( x k ∗ ) − n (cid:48)(cid:48) ( x ) | (cid:62) − × − > , holds for all x ∈ [0 , . Conclusion of the proof
We have shown above that n (cid:48)(cid:48) ( x ) > x ∈ [0 , n (cid:48) (1) = −
20, and n (cid:48)(cid:48) ( x ) > x ∈ [0 , n (cid:48) ( x ) < x ∈ [0 , n (1) = 5 and n (cid:48) ( x ) < x ∈ [0 , n ( x ) > x ∈ [0 ,,