[PDF] Local linear tie-breaker designs

Abstract

Tie-breaker experimental designs are hybrids of Randomized Control Trials (RCTs) and Regression Discontinuity Designs (RDDs) in which subjects with moderate scores are placed in an RCT while subjects with extreme scores are deterministically assigned to the treatment or control group. The design maintains the benefits of randomization for causal estimation while avoiding the possibility of excluding the most deserving recipients from the treatment group. The causal effect estimator for a tie-breaker design can be estimated by fitting local linear regressions for both the treatment and control group, as is typically done for RDDs. We study the statistical efficiency of such local linear regression-based causal estimators as a function of Δ , the radius of the interval in which treatment randomization occurs. In particular, we determine the efficiency of the estimator as a function of Δ for a fixed, arbitrary bandwidth under the assumption of a uniform assignment variable. To generalize beyond uniform assignment variables and asymptotic regimes, we also demonstrate on the Angrist and Lavy (1999) classroom size dataset that prior to conducting an experiment, an experimental designer can estimate the efficiency for various experimental radii choices by using Monte Carlo as long as they have access to the distribution of the assignment variable. For both uniform and triangular kernels, we show that increasing the radius of randomized experiment interval will increase the efficiency until the radius is the size of the local-linear regression bandwidth, after which no additional efficiency benefits are conferred.

Full PDF

LLocal linear tie-breaker designs

Dan M. KlugerStanford University Art B. OwenStanford UniversityJanuary 2021

Abstract

Tie-breaker experimental designs are hybrids of Randomized ControlTrials (RCTs) and Regression Discontinuity Designs (RDDs) in whichsubjects with moderate scores are placed in an RCT while subjects withextreme scores are deterministically assigned to the treatment or controlgroup. The design maintains the beneﬁts of randomization for causal es-timation while avoiding the possibility of excluding the most deservingrecipients from the treatment group. The causal eﬀect estimator for atie-breaker design can be estimated by ﬁtting local linear regressions forboth the treatment and control group, as is typically done for RDDs. Westudy the statistical eﬃciency of such local linear regression-based causalestimators as a function of ∆, the radius of the interval in which treat-ment randomization occurs. In particular, we determine the eﬃciency ofthe estimator as a function of ∆ for a ﬁxed, arbitrary bandwidth underthe assumption of a uniform assignment variable. To generalize beyonduniform assignment variables and asymptotic regimes, we also demon-strate on the Angrist and Lavy (1999) classroom size dataset that priorto conducting an experiment, an experimental designer can estimate theeﬃciency for various experimental radii choices by using Monte Carlo aslong as they have access to the distribution of the assignment variable.For both uniform and triangular kernels, we show that increasing the ra-dius of randomized experiment interval will increase the eﬃciency untilthe radius is the size of the local-linear regression bandwidth, after whichno additional eﬃciency beneﬁts are conferred.

The regression discontinuity design (RDD) introduced by Thistlethwaite andCampbell (1960) has become a mainstay of causal inference in recent years, es-pecially in econometrics (Angrist and Pischke, 2014) and the social sciences(Imbens and Rubin, 2015). The logic in an RDD is as follows. Some diﬀerenttreatment is oﬀered to subject i depending on whether the value x i of an as-signment variable (also called a running variable) exceeds a threshold t , or not.Then if the future value of a quantity Y i has a diﬀerent expected value for x i just barely larger than t than for x i just barely smaller than t it becomes quite1 a r X i v : . [ s t a t . M E ] J a n redible to interpret the diﬀerence causally, assuming that there is no a priorireason for E ( Y | x ) to have a step discontinuity at x = t .In most uses of RDD, the assignment to treatment or control cannot beinﬂuenced by the investigator. In some settings, however, the investigator caninject randomness into the treatment assignments around the threshold valueof x = t . The goal is to better measure the eﬀect of the treatment variable Z on Y . This design is called a tie-breaker design. For instance, letting x be ameasure of high-school performance and the treatment be awarding of a univer-sity scholarship, Angrist et al. (2014) use a tie-breaker design to measure theimpact of the scholarship on students’ academic outcomes. The top performingstudents all get the scholarship, the bottom performing students do not, andthere is a group in the middle where a random selection is used. Aiken et al.(1998) use a tie-breaker design to oﬀer remedial English classes based on a testgiven to students prior to matriculation. Companies oﬀering loyalty programsto their best customers can also include some randomness in the reward, tobetter learn the impact of those rewards. Tie-breaker designs are also known ascutoﬀ designs in clinical trial settings (Trochim and Cappelleri, 1992).In our setting it is convenient to represent the treatment by Z ∈ {− , } ,with the level − collect assignment variable values x i for subjects i = 1 , . . . , N , determine a distribution for treatment variables Z i ∈ {− , } , assign treatments Z i to subjects, observe corresponding Y i , and infer the treatment eﬀect from ( x i , Z i , Y i ) values.The Y i values are not available to the investigator at the time the distributionfor Z i is chosen. As a result the treatment decisions have to be chosen in partbased on a guess as to what model might be used to ﬁt the Y i values. We areinterested in settings where Y i does not become available fast enough to employbandit methods. For instance there might be a year long delay in the customerloyalty setting or six years in the educational attainment setting.Owen and Varian (2020) study the statistical eﬃciency of tie-breaker designs.In the simplest formulation, the assignment variables are sorted and scaled sothat x i = (2 i − / (2 N ) ∈ ( − , (cid:54) ∆ (cid:54) Z i = 1 | x i ) =  , x i (cid:54) − ∆1 / , | x i | < ∆1 , x i (cid:62) ∆ . (1)This tie-breaker design interpolates between an RDD with t = 0 when ∆ = 0and an RCT for ∆ = 1. They consider a regression model Y i = β + β x i + β Z i + β x i Z i + ε i (2)where ε i are IID random variables with mean 0 and variance σ . They ﬁndstatistical eﬃciency improves with ∆. We will describe below how eﬃciency is2eﬁned when we introduce our version. They also ﬁnd no eﬃciency beneﬁt fromusing a sliding scale for Pr( Z = 1 | x ) instead of using just the three levels in (1).Model (2) is simple enough to analyze; they also describe algorithmic ways tostudy more general alternative models at step 2 when there may be vectors x i available for each subject.A weakness of model (2) is that it is global over all subjects. In RDDs, it isnow more common to use nonparametric regression methods (Hahn et al., 2001;Calonico et al., 2014). Let µ + ( x ) = E ( Y | x, Z = 1) and µ − ( x ) = E ( Y | x, Z = − , both assumed to be smooth in a neighborhood of x = t . The RDD is usedto estimate µ + ( t ) − µ − ( t ). A kernel regression ˆ µ − is ﬁt to ( x i , Y i ) pairs with Z i = − x i (cid:54) t and extrapolated to ˆ µ − ( t ). Similarly, a kernelregression estimate ˆ µ + ( t ) based on ( x i , Y i ) data with x i (cid:62) t is produced andthen the causal impact of treatment at x = t is estimated by ˆ µ + ( t ) − ˆ µ − ( t ).Our goal in this paper is to show that the tie-breaker design is still superiorto the RDD even when a local regression is used. The local regression will use abandwidth parameter h chosen at step 5 above. That bandwidth is not knownto the investigator at step 2 where ∆ is chosen. To handle this, we show thatthe tie-breaker has a statistical advantage at all h > δ = ∆ /h of the size of the experimental region to thebandwidth used. Eﬃciency ranges from 1 at δ = 0 to 4 at δ = 1 (for the boxcarkernel) or to 3 . δ = 1 for the triangular kernel. The eﬃciency is monotonein δ . To get an interpretable theory we have worked with a uniformly spacedassignment variable. Section 5 shows how one can compute eﬃciency empiri-cally using one’s actual assignment variables, focusing on the Israeli classroomsize data from Angrist and Lavy (1999) as an example. The eﬃciency curvesare quite similar to the ones we have for uniformly spaced assignment variablesbut they show slightly greater eﬃciency gains than the ones in our theoremsfor uniform spacings. Section 6 presents a discussion and an Appendix containsone of our proofs. 3 Kernels and problem formulation

In keeping with current RDD practice we will use a kernel smoothed versionof (2). The parameter vector β is estimated byˆ β = arg min β ∈ R N (cid:88) i =1 K (cid:16) x i − th (cid:17)(cid:0) Y i − ( β + β x i + β Z i + β x i Z i ) (cid:1) (3)for a symmetric kernel function K ( · ) (cid:62) h > K ( · ) is non-negative and symmetric. We have special interestin a uniform (boxcar) kernel K BC ( x ) = 1 | x | (cid:54) because it is a local version ofthe regression model (2). We are also interested in a triangular spike kernel K TS ( x ) = (1 − | x | ) + where z + = max(0 , z ). This triangular kernel was shownby Cheng et al. (1997) to optimize a bias-variance tradeoﬀ for extrapolationfrom x i > t to E ( Y | x = t ) and has been advocated for RDD analysis by Imbensand Kalyanaraman (2012) and Calonico et al. (2014) among others.In the model (2), the treatment eﬀect at x is( β + β x + β + β x ) − ( β + β x − β − β x ) = 2( β + β x ) . An RDD is used for x = t and there the treatment eﬀect is 2( β + β t ). We willshift the assignment variable so as to make t = 0 and then focus on β .The kernel regression estimator from (3) has a bias and variance that bothdepend on the bandwidth h . Larger h typically bring greater bias because thetrue regression is not precisely linear over a region centered on t . Smaller h bringgreater variance because then fewer data points are in the regression. Calonicoet al. (2014) advocate for choosing smaller h than is mean square optimal.Making the bias negligible compared to the standard deviation has the eﬀectof making it much easier to get a conﬁdence interval for the treatment eﬀect.Conﬁdence intervals are important when the estimate is to be used for policypurposes where quantifying uncertainty is critical. For this reason, we will studythe variance of ˆ β given h and not consider the bias. That is, we are assumingthat the user will purposefully undersmooth the regression as recommended byCalonico et al. (2014).The design matrix for the regression is X ∈ R N × with i ’th row (1 , x i , Z i , x i Z i ).The response is Y = ( Y , . . . , Y N ) T . With t = 0, the kernel weights are K ( x i /h )and we let W = W ( h ) ∈ R N × N = diag( K ( x i /h )). Thenˆ β = ˆ β (∆) = ( X T WX ) − X T WY (4)and under the assumption that var( Y | X ) = σ I N we havevar( ˆ β | X ; ∆) = ( X T WX ) − X T W X ( X T WX ) − σ . (5)Formula (4) for ˆ β matches the familiar generalized least squares formula forthe case where var( Y | X ) = W σ . Here W arises from weights that are notof inverse variance type and hence the formula for var( ˆ β | X ; ∆) involves a W K ( x i /h ) ∈ { , } equals its own square. In that casevar( ˆ β | X ; ∆) = ( X T WX ) − σ .The estimand in regression discontinuity is 2 β . Therefore we study var( ˆ β |X ; ∆) under a tie-breaker design as (var( ˆ β | X ; ∆)) , using the expression in (5). Our present analysis takes place before step 1 from the outline in the intro-duction: we want to study var( ˆ β | X ; ∆), but we do not yet have the x i . Givenany list of x i some numerical methods described in Owen and Varian (2020) canbe adapted to the kernel regression setting, but that does not give theoreticalinsight. Without access to x i we will study the uniformly spaced setting with x i = (2 i − / (2 N ). This case is simple enough to illustrate the gains from atie-breaker design. In practice we could apply it by replacing x i by their cen-tered and scaled ranks. In Section 6 we explain using asymptotic theory fromFan and Gijbels (1996) why this case models the most important features of theproblem and in Section 5 we show numerical results on a real data set that isnot equispaced.For x i = (2 i − / (2 N ), the matrices X T WX /N and X T W X /N containelements that can be approximated by integrals of the form I rst = I rst (∆ , h, K ) ≡ (cid:90) − x r E ( Z s | x ; ∆) K (cid:16) xh (cid:17) t d x (6)for integer exponents r , s and t . Our expressions will simplify somewhat because Z = 1 making every I r, ,t = I r, ,t and also because both x and E ( Z | x ; ∆)are antisymmetric functions of x making them orthogonal to K ( x/h ) which wehave assumed to be symmetric. The error in those moment approximations is O p ( N − / ) if the Z i are independent random variables. The error can be muchless with other sampling schemes. For instance, we could use stratiﬁed sampling,forming pairs of subjects ( i, i + 1) in the experimental region and randomlysetting Z i = ± Z i +1 = − Z i . We will use ≈ to describe approximationsthat are O p ( N − / ) or better.Applying ﬁrst Z = 1 and then using symmetry and anti-symmetry1 N X T WX ≈  x z xz I I I I x I I I I z I I I I xz I I I I  =  I I I I I I I I  . K ( · ) is also a symmetric function we also get1 N X T W X ≈  I I I I I I I I  . From all of the symmetries involved in the 32 components of these twomatrices, we need to consider at most six distinct integrals. We rewrite them,beginning with 1 N X T WX ≈  ν φ (∆)0 ν φ (∆) 00 φ (∆) ν φ (∆) 0 0 ν  (7)where ν = 12 (cid:90) − K (cid:16) xh (cid:17) d x, ν = 12 (cid:90) − x K (cid:16) xh (cid:17) d x, and φ (∆) = 12 (cid:90) − ∆ − ( − x ) K (cid:16) xh (cid:17) d x + 12 (cid:90) xK (cid:16) xh (cid:17) d x = (cid:90) xK (cid:16) xh (cid:17) d x. (8)Note that ν and ν may depend on h but they do not depend on ∆. A similarargument shows that1 N X T W X ≈  π ψ (∆)0 π ψ (∆) 00 ψ (∆) π ψ (∆) 0 0 π  (9)for π = 12 (cid:90) − K (cid:16) xh (cid:17) d x, π = 12 (cid:90) − x K (cid:16) xh (cid:17) d x, and ψ (∆) = (cid:90) xK (cid:16) xh (cid:17) d x. (10)Now we are ready to describe the asymptotic variance of ˆ β . Theorem 1.

Let x i = (2 i − / (2 N ) , select Z i ∈ {− , } by the tie-breakerequation (1) . Let Y i be uncorrelated random variables with common variance σ , conditionally on X = ((1 , x , Z , x Z ) , · · · , (1 , x N , Z N , x N Z N )) . Next, fora symmetric kernel K ( · ) (cid:62) with < (cid:82) ∞−∞ x K ( x ) d x < ∞ and a bandwidth h > , let ˆ β be estimated by the kernel weighted regression (3) . Then N var( ˆ β | X ; ∆) = σ (cid:0) ν π − ν φ (∆) ψ (∆) + π φ (∆) (cid:1)(cid:0) ν ν − φ (∆) (cid:1) + O p (cid:16) √ N (cid:17) , (11)6 here ν , ν and φ (∆) are deﬁned in (8) and π , π and ψ (∆) are deﬁnedin (10) .Proof. Reordering the components of β we ﬁnd after substituting equations (7)and (9) into (5) that √ N ( ˆ β , ˆ β , ˆ β , ˆ β ) has variance  ν φ φ ν ν φ φ ν  −  π ψ ψ π π ψ ψ π  ν φ φ ν ν φ φ ν  − σ + O p (cid:16) √ N (cid:17) . Now (11) follows directly by matrix inversion and multiplication.The variance formula in Theorem 1 does not require the linear model (2) tohold. When it does not hold there will generally be some bias where E (2 ˆ β |X ; ∆) (cid:54) = µ + (0) − µ − (0). We suppose that the user will choose an undersmoothed h making bias smaller than the standard error, as recommended by Calonicoet al. (2014). That undersmoothing takes place in our step 5 above and is notavailable when ∆ is chosen.We are primarily interested in comparing the asymptotic variance of ˆ τ = 2 ˆ β for various choices of ∆. We especially want to compare the eﬃciency of tie-breaker designs with ∆ > ( N ) (∆) ≡ var( ˆ β | X ; 0)var( ˆ β | X ; ∆) . (12)Using Theorem 1, Eﬀ ( N ) (∆) converges in probability to the asymptotic eﬃ-ciency ratioEﬀ(∆) = (cid:0) ν π − ν φ (0) ψ (0) + π φ (0) (cid:1)(cid:0) ν ν − φ (∆) (cid:1) (cid:0) ν π − ν φ (∆) ψ (∆) + π φ (∆) (cid:1)(cid:0) ν ν − φ (0) (cid:1) (13)using quantities that we deﬁned at (8) and (10). In this section we present the eﬃciency ratios under the conditions of The-orem 1 for the two kernels of greatest interest: the boxcar kernel and the trian-gular kernel. We work with x i = (2 i − N ) / (2 N ) throughout this section.For the boxcar kernel K BC ( x ) = 1 | x | (cid:54) , we can assume without loss ofgenerality that h (cid:54) | x i − t | = | x i | > h > h = 1. We ﬁnd for this kernelthat ν = π = h, ν = π = h , and φ (∆) = ψ (∆) = ( h − ∆ ) + . (14)7sing some foresight, we deﬁne the local tie-breaker constant δ = ∆ /h . This isthe fraction of the local regression region in which the treatment was assignedat random. Proposition 1.

Under the conditions of Theorem 1 and using the boxcar kernel K BC , the asymptotic eﬃciency of the tie-breaker design is Eﬀ BC = 1 + 6 δ − δ (15) for δ = ∆ /h (cid:54) . If δ > , then Eﬀ BC = 4 .Proof. Because many quantities from (14) are identical, substituting them into(13) produces numerous simpliﬁcations that yieldEﬀ BC = ν ν − φ (∆) ν ν − φ (0) = h − ( h − ∆ ) h − h = 4 − − δ ) . For 0 (cid:54) δ < δ > h = 1 makes the local regression a global one. We then get thesame eﬃciency ratio as in equation (6) from Owen and Varian (2020). By takingderivatives it is easy to show that the eﬃciency ratio in (15) is strictly increasingas the local amount of experimentation δ varies over the interval 0 < δ < BC versus δ .The triangular spike kernel K TS ( x ) = (1 − | x | ) + (triangular kernel for short)is more complicated than the boxcar kernel because for it, K is not proportionalto K . Once again, we assume that h ∈ [0 , ν = h , ν = h , π = h , and π = h δ = ∆ /h , we get φ (∆) = h − δ + 2 δ ) and ψ (∆) = h

12 (1 − δ + 8 δ − δ ) . Proposition 2.

Under the conditions of Theorem 1 and using the triangularkernel K TS , the asymptotic eﬃciency of the tie-breaker design is Eﬀ TS = 2 (cid:0) − − δ + 2 δ ) (cid:1) − − δ + 2 δ )(1 − δ + 8 δ − δ ) + 2(1 − δ + 2 δ ) (16) for δ = ∆ /h (cid:54) .Proof. When we substitute values into the eﬃciency formula (13) we get somesimpliﬁcations from π = (2 / ν and π = (2 / ν . The constant term in N var( ˆ β ) /σ becomes(2 / ν ν − ν φ (∆) ψ (∆) + (2 / ν ( ν ν − φ (∆)) = 15 ν ν ν − φ (∆) ψ (∆) + 6( ν ν − φ (∆))

8o that the eﬃciency ratio isEﬀ TS = 5 ν ν − φ (0) ψ (0) + 3 φ (0)5 ν ν − φ (∆) ψ (∆) + 3 φ (∆) × ( ν ν − φ (∆)) ( ν ν − φ (0)) (17)after cancelling a common factor of 30 ν . Next 5 ν ν − φ (∆) ψ (∆) + 3 φ (∆)equals5 h − h

72 (1 − δ + 2 δ )(1 − δ + 8 δ − δ ) + h

12 (1 − δ + 2 δ ) and so the ﬁrst factor in (17) is 25 − − δ + 2 δ )(1 − δ + 8 δ − δ ) + 2(1 − δ + 2 δ ) . Turning to the second factor ν ν − φ (∆) = h − h

36 (1 − δ + 2 δ ) = h (cid:0) − − δ + 2 δ ) (cid:1) and so the second factor equals (3 − − δ + 2 δ ) ) , establishing (16).The second panel in Figure 1 shows Eﬀ TS versus the local experiment size δ .The eﬃciency curve has a similar monotone increasing shape as we saw for theboxcar kernel. The maximum eﬃciency ratio, at δ = 1, is 18 / . δ with a numerator of degree 12and a denominator of degree 7. It is strictly increasing on the interval 0 < δ < Proposition 3.

The derivative of Eﬀ TS with respect to δ is positive for <δ < .Proof. See the Appendix.

We explored the eﬃciency ratio for the tie-breaker design for x i with auniform distribution. While that can be arranged by using ranks, in othersituations we might prefer to use the original value of a running variable andthose might not be uniformly distributed. We demonstrate how this would bedone using a dataset from Angrist and Lavy (1999) on classroom sizes.Angrist and Lavy (1999) studied the causal eﬀect of classroom size on testperformance of elementary school students in Israel. In Israel, the Maimonidesrule mandates that elementary school classes cannot exceed 40 students. If aschool has 41 students enrolled in a particular grade that grade must be splitinto two classes. Note that grades that have 40 or fewer enrolled students areallowed to split into multiple classes and that grades with slightly more than 409igure 1: The top panel shows the eﬃciency of local tie-breaker design foruniform x i and a boxcar kernel as a function of δ = ∆ /h . The lower panelshows this eﬃciency for a triangular kernel.students occasionally violate the Maimonides rule and do not split into multipleclasses. Despite this, we can consider this a setting for RDD where the treatmentvariable is whether or not the school is legally mandated to split a particulargrade into smaller classes.The dataset, published on the Harvard Dataverse (Angrist and Lavy, 2009),has verbal and math scores for 3rd, 4th and 5th graders across Israel. We choseto focus exclusively on 4th grade verbal scores as our response variable and4th grade enrollments as our assignment variable because Angrist and Lavy(1999) suggest that a slightly signiﬁcant eﬀect of the treatment on 4th gradeverbal scores exists. A case could be made for a tie-breaker design in thissetting because had one been used, the treatment eﬀect might have been more10igure 2: A histogram of 4th grade enrollments for our ﬁltered dataset (withschools exceeding 80 4th grade students or three 4th grade classes removed)accurately estimated.To simplify the analysis, we removed all schools that either had more than80 students or more than two 4th-grade classes from the dataset. We furtherremoved all schools that had NA entries for either class size or verbal scores,leaving N = 711 schools in our ﬁltered dataset. See Figure 2 for a visualizationof the distribution of the 4th grade enrollments and Figure 3 for visualiza-tions of the local linear regression based-RDD on this dataset using boxcar andtriangular kernels. We use the bandwidths h IK given by the Imbens and Kalya-naraman (2012) procedure. The apparent beneﬁt from smaller classrooms ispositive but small and it turns out, not statistically signiﬁcant in this analysis.The 95% conﬁdence interval (assuming homoscedastic errors) for the eﬀect sizeat the boundary of the local linear regression-based RDD was ( − . , .

0) whena boxcar kernel with bandwidth h IK , BC = 14 .

18 was used. The 95% conﬁdenceinterval for the eﬀect size at boundary of this RDD was ( − . , .

4) when atriangular kernel with bandwidth h IK , TS = 9 .

02 was used.Next we illustrate how an investigator can estimate the eﬃciency of tie-breaker designs as a function of ∆ on sample values of the assignment variable.First we center the data, replacing x i by x i − . t = 40 . t = 0. Next, for each ∆ of interest we use 1000 Monte Carlo samplesto estimate var( ˆ β | X ; ∆) and also var( ˆ β | X ; 0), both up to a constant σ .That gives us 1000 eﬃciency ratios Eﬀ ( N ) (∆) = var( ˆ β | X ; 0) / var( ˆ β | X ; ∆) foreach ∆. In each of our 1000 samples, we simulate random assignments for atie-breaker design at the given experimental radius ∆. The random assignmentsare stratiﬁed: in each consecutive pair of classroom sizes in the experimentalregion, one was randomly chosen to have Z = 1 and the other got Z = − Z i let us compute the matrices X and W deﬁned in the beginningof Section 2, from which we compute a non-asymptotic var( ˆ β | X ; ∆). Notethat we do not need to simulate any Y values to do this, because, in this initial11igure 3: RDD ﬁt to the 4th grader verbal scores from the Angrist and Lavy(2009) dataset when using a boxcar kernel (top) and a triangular kernel (bot-tom). For these two ﬁts, the bandwidths h IK , BC and h IK , TS were chosen by theprocedure in Imbens and Kalyanaraman (2012).analysis, we are retaining the bandwidths from the Imbens and Kalyanaraman(2012) procedure on the original data.Figure 4 shows boxplots of 1000 simulated Eﬀ ( N ) (∆) values for variouschoices of ∆ ∈ N to plot the full eﬃciency curve. It is clear from Figure 4that with stratiﬁed allocations the eﬃciency is very reproducible. Figure 5shows results for diﬀerent bandwidths, ranging from h IK / h IK . Becausethe eﬃciencies are so reproducible given the bandwidth, we just plot curvesof the mean and standard deviations of estimated Eﬀ values. For the boxcarkernel we see that the tie-breaker design is reproducibly more eﬃcient than the12igure 4: Boxplots of the Monte-Carlo eﬃciency ratio estimates for variousvalues of ∆ ∈ N when using a boxcar kernel (top) and a triangular kernel(bottom). For both the boxcar kernel and the triangular kernel, we used thesame bandwidth as in Figure 3, namely h IK , BC = 14 .

18 and h IK , TS = 9 . δ = ∆ /h increases for all h we studied. For thetriangular kernel we see much the same thing apart from one value of δ and thesmallest bandwidth where the tie-breaker comes out less eﬃcient than RDD.For that point the experimental region consisted of just 17 data points, 8 witha class of 40 students and 9 with a class of 41 students.For a further discussion of the Maimonides rule, see Angrist et al. (2019).They consider diﬀerent data sets and also investigate the possibility that theclass sizes are sometimes manipulated to be above the threshold triggering aclassroom split. 13igure 5: Monte-Carlo based estimates of the expected value (left) and stan-dard deviation (right) of Eﬀ ( N ) (∆) versus ∆ /h for the Angrist and Lavy (2009)dataset of 4th grader verbal scores. For these plots a boxcar kernel (top) anda triangular kernel (bottom) were used. The bandwidths plotted are diﬀerentscalar multiples of h IK , BC and h IK , TS from the procedure of Imbens and Kalya-naraman (2012). The legend for the plots on the right are the same as those forthe expected eﬃciency curves. Note that the curves are not smooth because,to avoid redundancy, only points that corresponded to integer values of ∆ wereused. Owen and Varian (2020) found an eﬃciency advantage for the tie-breakerin a global regression, wherein the estimation variance decreased monotonicallywith the amount of experimentation. This paper provides a comparable ﬁnd-ing for the now more standard local linear regression approach. For any ﬁxedbandwidth h , we see a theoretical eﬃciency that increases with the amount ∆of experimentation. We have not investigated the eﬀect of ∆ on the subsequentchoice of h .Our theoretical analysis is for a uniformly spaced assignment variable. Im-bens and Wager (2019) consider how to optimally tune kernel weights in a re-gression discontinuity problem to a given set of data. Owen and Varian (2020)consider numerical optimization of tie-breaker designs on given data.Here we oﬀer one explanation for why the empirical eﬃciencies on non-14niformly distributed data look so similar to the theoretical ones for uniformlydistributed data. We use some results about non-parametric regression from Fanand Gijbels (1996, Table 2.1). Nonparametric regression estimates ˆ µ ( t ) typicallyhave an asymptotic variance where the leading term is proportional to 1 /f ( t )where f is the probability density of the x i . This arises because the local samplesize is asymptotically proportional to f ( t ). Hence, when considering nonuniformdistributions, the 1 /f ( t ) factors in the leading order variance terms will cancelout when computing the eﬃciency ratios. Some of the nonparametric regressionestimators, such as the Nadaraya-Watson estimator, have a lead term in theirbias that depends on the derivative f (cid:48) ( t ) and while f (cid:48) ( t ) = 0 for uniformlydistributed data it is not zero in general. Kernel weighted least squares methods(with symmetric K ( · )) does not have a dependency on f (cid:48) ( t ) in its bias. There isa curvature bias from µ (cid:48)(cid:48) ( t ) but that is not related to the sampling distributionof the x i . The lead terms in bias and variance for kernel regressions do notdistinguish between distributions with the same value of f ( t ) but diﬀerent f (cid:48) ( t ).We close by comparing the tie-breaker setting to some other similar soundingones. In fuzzy RDDs (Campbell, 1969) the threshold varies perhaps randomlybecause it depends on some additional variables that are not available to thedata analyst. There are also settings where the assigment variable is subject tomanipulation. For instance if a passing grade is 50% there may be no candidateswith recorded scores in the interval from 47% to 50%. See McCrary (2008) formore on manipulation and Rosenman and Rajkumar (2019) for a mitigationstrategy. The tie-breaker setting is special because the investigator is able tocontrol the treatment allocation. Acknowledgments

This work was supported by the U.S. National Science Foundation undergrant IIS-1837931. We thank Hal Varian for commenting on the paper.

References

Aiken, L. S., West, S. G., Schwalm, D. E., Carroll, J. L., and Hsiung, S. (1998).Comparison of a randomized and two quasi-experimental designs in a singleoutcome evaluation: Eﬃcacy of a university-level remedial writing program.

Evaluation Review , 22(2):207–244.Angrist, J., Hudson, S., and Pallais, A. (2014). Leveling up: Early results froma randomized evaluation of post-secondary aid. Technical report, NationalBureau of Economic Research.Angrist, J. D. and Lavy, V. (1999). Using Maimonides’ rule to estimate the eﬀectof class size on scholastic achievement.

The Quarterly Journal of Economics ,114(2):533–575. 15ngrist, J. D. and Lavy, V. (2009). Replication data for: Using Maimonides’Rule to Estimate the Eﬀect of Class Size on Student Achievement.Angrist, J. D., Lavy, V., Leder-Luis, J., and Shany, A. (2019). Maimonides’ ruleredux.

American Economic Review: Insights , 1(3):309–24.Angrist, J. D. and Pischke, J.-S. (2014).

Mastering Metrics . Princeton UniverityPress, Princeton.Borchers, H. W. (2019). pracma: Practical Numerical Math Functions . Rpackage version 2.2.9.Calonico, S., Cattaneo, M. D., and Titiunik, R. (2014). Robust nonparamet-ric conﬁdence intervals for regression-discontinuity designs.

Econometrica ,82(6):2295–2326.Campbell, D. T. (1969). Reforms as experiments.

American psychologist ,24(4):409.Cheng, M.-Y., Fan, J., and Marron, J. S. (1997). On automatic boundarycorrections.

The Annals of Statistics , 25(4):1691–1708.Fan, J. and Gijbels, I. (1996).

Local polynomial modelling and its applications:monographs on statistics and applied probability 66 , volume 66. CRC Press,Boca Raton, FL.Hahn, J., Todd, P., and der Klaauw, W. V. (2001). Identiﬁcation and estimationof treatment eﬀects with a regression-discontinuity design.

Econometrica ,69(1):201–209.Higham, N. J. (2002).

Accuracy and Stability of Numerical Algorithms . Societyfor Industrial and Applied Mathematics, Philadelphia, second edition.Imbens, G. and Kalyanaraman, K. (2012). Optimal bandwidth choice for theregression discontinuity estimator.

The Review of Economic Studies , 79:933–959.Imbens, G. and Wager, S. (2019). Optimized regression discontinuity designs.

Review of Economics and Statistics , 101(2):264–278.Imbens, G. W. and Rubin, D. B. (2015).

Causal inference in statistics, social,and biomedical sciences . Cambridge University Press.McCrary, J. (2008). Manipulation of the running variable in the regressiondiscontinuity design: A density test.

Journal of econometrics , 142(2):698–714.Owen, A. B. and Varian, H. (2020). Optimizing the tie-breaker regression dis-continuity design.

Electronic Journal of Statistics , 14(2):4004–4027.16osenman, E. and Rajkumar, K. (2019). Optimized partial identiﬁcationbounds for regression discontinuity designs with manipulation. Technical Re-port arXiv:1910.02170, Stanford University.Thistlethwaite, D. L. and Campbell, D. T. (1960). Regression-discontinuityanalysis: An alternative to the ex post facto experiment.

Journal of Educa-tional psychology , 51(6):309.Trochim, W. M. K. and Cappelleri, J. C. (1992). Cutoﬀ assignment strategiesfor enhancing randomized clinical trials.

Controlled Clinical Trials , pages190–212.

Appendix: Proof of Proposition 3

We want to show that this functionEﬀ TS ( δ ) = 2 (cid:0) − − δ + 2 δ ) (cid:1) − − δ + 2 δ )(1 − δ + 8 δ − δ ) + 2(1 − δ + 2 δ ) has a positive derivative for 0 < δ <

1. The numerator has degree 12 andthe denominator has degree 7. The customary formula for the derivative of arational function produces a rational function with a non-negative denominatorand a numerator of degree 18. We will work through a sequence of steps reducingthe degree of this polynomial to show that the numerator must be positive on(0 , TS ( δ ) which isvisually apparent.It is convenient to work instead with x = 1 − δ . Then 1 − δ +2 δ = 3 x − x and 1 − δ + 8 δ − δ = 4 x − x . Therefore Eﬀ TS ( δ ) = f (1 − δ ) where f isa function given by f ( x ) = 2(3 − x − x ) ) − x − x )(4 x − x ) + 2(3 x − x ) = 2(3 − x (3 − x ) ) − x (3 − x )(4 − x ) + 2 x (3 − x ) = 2(3 − x (3 − x ) ) x (3 − x )][6 − x + 15 x ]= 2(3 − g ( x )(3 − x )) g ( x )(6 − x + 15 x )for g ( x ) = x (3 − x ) and having replaced δ by x = 1 − δ we will show that f (cid:48) ( x ) < f (cid:48) ( x ) has this numerator n ( x ) = 4(3 − g ( x )(3 − x ))(4 g ( x ) − g (cid:48) ( x )(3 − x ))[5 + g ( x )(6 − x + 15 x )] − − g ( x )(3 − x )) [ g (cid:48) ( x )(6 − x + 15 x ) + g ( x )( −

24 + 30 x )] . (cid:54) g ( x )(3 − x ) (cid:54) x ∈ [0 ,

1] and so 3 − g ( x )(3 − x ) >

0. Asa result, the sign of n ( x ) is preserved by dividing it by 2(3 − g ( x )(3 − x )),yielding n ( x ) = 2(4 g ( x ) − g (cid:48) ( x )(3 − x ))[5 + g ( x )(6 − x + 15 x )] − (3 − g ( x )(3 − x ))[ g (cid:48) ( x )(6 − x + 15 x ) + g ( x )( −

24 + 30 x )] . Now since g ( x ) = x [ x (3 − x )] and g (cid:48) ( x ) = x [12 − x ] and x ∈ (0 ,

1) we candivide n ( x ) by x / n ( x ) = 13 (cid:0) x − − x ) (cid:1) (3 − x )[5 + g ( x )(6 − x + 15 x )] − (cid:0) − g ( x )(3 − x ) (cid:1)(cid:2) (12 − x )(6 − x + 15 x ) + x (3 − x )( −

24 + 30 x ) (cid:3) = − − x )(3 − x )[5 + g ( x )(6 − x + 15 x )] − (3 − g ( x )(3 − x ))[ − x + 93 x − x + 12]= − − x )(3 − x )[5 + g ( x )(6 − x + 15 x )] − (3 − g ( x )(3 − x ))(1 − x )(35 x − x + 12) . We can divide n ( x ) by − (1 − x ) getting a polynomial n ( x ) with the oppositesign from n . This yields n ( x ) = 8(3 − x )[5 + g ( x )(6 − x + 15 x )] + (3 − g ( x )(3 − x ))(35 x − x + 12)= 8(3 − x )[ − x + 93 x − x + 18 x + 5]+ ( − x + 24 x − x + 3)(35 x − x + 12)= 8[60 x − x + 447 x − x + 54 x − x + 15]+ ( − x + 1304 x − x + 1332 x − x + 105 x − x + 36)= 200 x − x + 1458 x − x + 216 x + 105 x − x + 156 . Note that the coeﬃcient of x in n is a zero that must not be left out whenentering the coeﬃcients into symbolic diﬀerentiation codes.We want to show that n ( x ) > x ∈ (0 ,

1) which then makes n ( x ) < f (cid:48) ( x ) < (cid:48) TS ( δ ) > δ = 1 − x ∈ (0 , n (cid:48)(cid:48) ( x ) > x ∈ [0 , n (cid:48)(cid:48) ( x ) = 11200 x − x + 43740 x − x + 2592 x + 210 . Graphing n (cid:48)(cid:48) ( x ) versus x and evaluating it numerically, makes it is clear that148 < n (cid:48)(cid:48) ( x ) < n (cid:48)(cid:48) . Some readers might prefer to skip that and goinstead to the subsection marked “Conclusion of the proof”. Positivity of n (cid:48)(cid:48) We begin by writing n (cid:48)(cid:48)(cid:48) ( x ) = 48 x (1400 x − x + 3645 x − x + 108) . | n (cid:48)(cid:48)(cid:48) ( x ) | (cid:54) (cid:54) (18)for all x ∈ [0 , k ∈ { , , . . . , } deﬁne x k = k − . For each k we will let (cid:92) n (cid:48)(cid:48) ( x k )be the numerical evaluation for the polynomial n (cid:48)(cid:48) ( x k ) computed using Horner’smethod with double-precision in R, implemented with the ‘horner’ function inthe pracma R package of Borchers (2019).By formula (5.3) in Higham (2002), the absolute error in Horner’s methodfor the polynomial (cid:80) nr =0 a r x r is at most γ n ˜ p ( | x | ) where γ n ≡ nu/ (1 − nu ), u is the unit roundoﬀ, and ˜ p ( | x | ) = n (cid:80) r =0 | a r || x | r .Applying that bound to our 6th degree polynomial n (cid:48)(cid:48) ( x ), and noting thatfor each x ∈ [0 , p ( | x | ) will be at most the sum of the absolute value of thecoeﬃcients we ﬁnd thatmax (cid:54) k (cid:54) (cid:12)(cid:12) n (cid:48)(cid:48) ( x k ) − (cid:92) n (cid:48)(cid:48) ( x k ) (cid:12)(cid:12) (cid:54) γ × ˜ p ( | x k | ) (cid:54) u − u × , . (19)We need not worry about ﬂoating point error induced by evaluating x k , becauseeach x k is a ﬂoating point number. For double precision in R, the unit roundoﬀis u = 2 − (cid:54) × − from which 12 u/ (1 − u ) (cid:54) − , and so the maximumerror in (19) is at most 10 − . The smallest value of (cid:92) n (cid:48)(cid:48) ( x k ) among all 2 + 1evaluation points x k was 148 . (cid:54) k (cid:54) n (cid:48)(cid:48) ( x k ) (cid:62) n (cid:48)(cid:48) ( x ) > x ∈ [0 , x ∈ [0 , k ∗ ∈ { , , . . . , } with | x − x k ∗ | (cid:54) − . By (18) we know that n (cid:48)(cid:48) is Lipschitz continuous on [0 ,

1] with Lipschitz constant 2 . Therefore n (cid:48)(cid:48) ( x ) (cid:62) n (cid:48)(cid:48) ( x k ∗ ) − | n (cid:48)(cid:48) ( x k ∗ ) − n (cid:48)(cid:48) ( x ) | (cid:62) − × − > , holds for all x ∈ [0 , . Conclusion of the proof

We have shown above that n (cid:48)(cid:48) ( x ) > x ∈ [0 , n (cid:48) (1) = −

20, and n (cid:48)(cid:48) ( x ) > x ∈ [0 , n (cid:48) ( x ) < x ∈ [0 , n (1) = 5 and n (cid:48) ( x ) < x ∈ [0 , n ( x ) > x ∈ [0 ,,