A Nonparametric Bayesian Methodology for Regression Discontinuity Designs
AA Nonparametric Bayesian Methodology forRegression Discontinuity Designs
Zach Branson ∗ Department of Statistics, Harvard UniversityMaxime RischardDepartment of Statistics, Harvard UniversityLuke BornnDepartment of Statistics and Actuarial Science, Simon Fraser UniversityLuke MiratrixGraduate School of Education, Harvard UniversityOctober 2, 2018
Abstract
One of the most popular methodologies for estimating the average treatmenteffect at the threshold in a regression discontinuity design is local linear regression(LLR), which places larger weight on units closer to the threshold. We propose aGaussian process regression methodology that acts as a Bayesian analog to LLR forregression discontinuity designs. Our methodology provides a flexible fit for treatmentand control responses by placing a general prior on the mean response functions.Furthermore, unlike LLR, our methodology can incorporate uncertainty in how unitsare weighted when estimating the treatment effect. We prove our method is consistentin estimating the average treatment effect at the threshold. Furthermore, we findvia simulation that our method exhibits promising coverage, interval length, andmean squared error properties compared to standard LLR and state-of-the-art LLRmethodologies. Finally, we explore the performance of our method on a real-worldexample by studying the impact of being a first-round draft pick on the performanceand playing time of basketball players in the National Basketball Association.
Keywords:
Regression discontinuity designs, Gaussian process regression, Bayesian non-parametrics, coverage, posterior consistency ∗ This research was supported by the National Science Foundation Graduate Research Fellowship Pro-gram under Grant No. 1144152, by the National Science Foundation under Grant No. 1461435, by DARPAunder Grant No. FA8750-14-2-0117, by ARO under Grant No. W911NF- 15-1-0172, and by NSERC. Anyopinions, findings, and conclusions or recommendations expressed in this material are those of the authorsand do not necessarily reflect the views of the National Science Foundation, DARPA, ARO, or NSERC. a r X i v : . [ s t a t . M E ] S e p Introduction
Recently there has been a renewed interest in regression discontinuity designs (RDDs),which originated with Thistlethwaite & Campbell (1960). In an RDD, the treatment as-signment is discontinuous at a certain covariate value, or “threshold,” such that only unitswhose covariate is above the threshold will receive treatment. There are many examplesof RDDs in the applied econometrics literature: the United States providing additionalfunding to only the 300 poorest counties for the Head Start education program (Ludwig& Miller, 2007); schools mandating students to attend summer school if their exam scoresare below a threshold (Matsudaira, 2008); colleges offering financial aid to students whoseacademic ability is above a cutoff (Van der Klaauw, 2002); and Medicare increasing insur-ance coverage after age 65 (Card, Dobkin, & Maestas, 2004). The main goal of an RDDis to estimate a treatment effect while addressing likely confounding by the covariate thatdetermines treatment assignment.One of the most popular methodologies for estimating the average treatment effect atthe threshold in an RDD is local linear regression (LLR), which places larger weight onunits closer to the threshold. Implementation of LLR is straightforward and there is a wideliterature on its theoretical properties. However, recent works have found that LLR canexhibit poor inferential properties—such as confidence intervals that tend to undercover—which has motivated a strand of literature started by Calonico et al. (2014) that modifiesLLR to improve coverage and other inferential properties.Adding to this literature, we propose a nonparametric regression approach that actsas a Bayesian analog to LLR for sharp regression discontinuity designs. Our approachutilizes Gaussian process regression (GPR) to provide a flexible fit for treatment and controlresponses by placing a general prior on the mean response functions. While GPR hasbeen widely used in the machine learning and statistics literature, it has not previouslybeen proposed for estimating treatment effects in RDDs. Thus, our main contribution isoutlining how to use Gaussian processes to make causal inferences in RDDs and assess howsuch a methodology compares to current LLR methodologies.In the remainder of this section, we review RDDs and LLR methodologies for estimatingthe average treatment effect at the threshold. In Section 2, we outline GPR for sharp2DDs and note various analogies to LLR, which builds intuition for implementing ourmethod. In Section 3, we establish that our method is consistent in estimating the averagetreatment effect at the boundary. In Section 4, we show via simulation that our methodexhibits promising coverage, interval length, and mean squared error properties comparedto standard LLR and state-of-the-art LLR methodologies. In Section 5, we use GPR ondata from the National Basketball Association (NBA) to estimate the effect of being a first-round versus a second-round pick on basketball player performance and playing time, andwe find that GPR detects treatment effects that are more in line with previous results inthe sports literature than do LLR methodologies. In Section 6, we conclude by discussingextensions to our methodology to tackle problems beyond sharp RDDs.
We follow the notation of Imbens & Lemieux (2008) and discuss RDDs within the potentialoutcomes framework: For each unit i , there are two potential outcomes, Y i ( ) and Y i ( ) ,corresponding to treatment and control, respectively, and a covariate X i . Only one ofthese two potential outcomes is observed, but X i is always observed. Let W i denote theassignment for unit i , where W i = i is assigned to treatment and W i = i isassigned to control. The observed outcome for unit i is then y i = W i Y i ( ) + ( − W i ) Y i ( ) (1)Often, one wants to estimate the average treatment effect E [ Y i ( ) − Y i ( )] , but usuallythis treatment effect is confounded by X (and possibly other unobserved covariates), suchthat examining the difference in mean response between treatment and control is not ap-propriate. In these cases, methods such as stratification, matching, and regression areoften employed to address covariate confounding when estimating the average treatmenteffect. However, such methods are only appropriate when there is sufficient overlap in thetreatment and control covariate distributions, i.e.,0 < P ( W i = ∣ X i = x ) < W i and X i is estimated and then accounted for during the analysis.In an RDD, the relationship between the treatment assignment W i and the covariates isknown. More specifically, it is known that the function P ( W = ∣ X = x ) is discontinuous atsome threshold or boundary X = b . In this paper we focus on a special case, sharp RDDs ,where the treatment assignment for a unit i is W i = ⎧⎪⎪⎪⎪⎨⎪⎪⎪⎪⎩ X i ≥ b X that determines treatment assignment in an RDD is often called the“running variable” in order to not confuse it with other background covariates that do notnecessarily determine treatment assignment. The more general case, where P ( W = ∣ X = x ) is discontinuous at X = b but is not necessarily 0 or 1, is known as a fuzzy RDD .Rubin (1977) states that when treatment assignment depends on one covariate, esti-mating E [ Y i ( )∣ X i ] and E [ Y i ( )∣ X i ] is essential for estimating the average treatment effect.Furthermore, treatment effect estimates are particularly sensitive to model specification of E [ Y i ( )∣ X i ] and E [ Y i ( )∣ X i ] when there is not substantial covariate overlap, as in a sharpRDD. Aware of this sensitivity, RDD analyses typically do not attempt to estimate the av-erage treatment effect. Instead, they focus on the estimand that requires the least amountof extrapolation to overcome this lack of covariate overlap: the average treatment effect atthe boundary b . Defining the conditional expectations for treatment and control as µ T ( x ) ≡ E [ Y i ( )∣ X i = x ] , µ C ( x ) ≡ E [ Y i ( )∣ X i = x ] , (4)the treatment effect at the boundary b is (Imbens and Lemieux 2008) τ = lim x ↓ b E [ y i ∣ X i = x ] − lim x ↑ b E [ y i ∣ X i = x ] = µ T ( b ) − µ C ( b ) . (5)The notation µ T ( x ) and µ C ( x ) emphasizes that the goal of an RDD requires estimatingtwo unknown mean functions. Hahn et al. (2001) showed that sufficient conditions for τ to be identifiable in a sharp RDD are that µ C ( x ) and µ T ( x ) − µ C ( x ) are continuous at b .4hey further state that “we can use any nonparametric estimator to estimate” µ T ( x ) and µ C ( x ) , and recommended local linear regression (hereafter called LLR), which is currentlythe most popular methodology for estimating the average treatment effect at the boundaryin an RDD. The goal of an RDD is to estimate τ defined in (5), i.e., to estimate µ T ( b ) and µ C ( b ) . LLRestimates µ T ( b ) as ˆ µ T ( b ) = X b ( X TT W T X T ) − X TT W T Y T (6)where X b ≡ ( b ) , X T is the n T × X for treated units, Y T is the n T -dimensional column vector of treatedunits’ responses, and W T is a n T × n T diagonal weight matrix whose entries are ( W T ) ii ≡ K ( x i − bh ) , i = , . . . , n T (7)for some kernel K (⋅) and bandwidth h . The estimator ˆ µ C ( b ) is analogously defined for thecontrol. Then, the estimator for the treatment effect is ˆ τ = ˆ µ T ( b ) − ˆ µ C ( b ) . Here, ˆ µ T ( b ) andˆ µ C ( b ) are weighted least squares estimators, where the weights depend on units’ closenessto the boundary b . To perform local polynomial regression, one appends higher orders of X to the design matrices X T and X C .The RDD literature has focused on LLR largely because of its boundary bias properties.For example, Hahn et al. (2001) recommend LLR over alternatives like kernel regressionbecause Fan (1992) showed that LLR exhibits better bias properties for boundary pointsthan kernel regression. For more details on the bias comparison between kernel regressionand LLR, see Imbens & Lemieux 2008 (p. 624-625) as well as Fan & Gijbels (1992) andPorter (2003).Furthermore, LLR’s implementation is straightforward once a kernel K (⋅) and band-width h are chosen. The most common choice of K (⋅) is the rectangular or triangularkernel; Imbens & Lemieux (2008) argue that more complicated kernels rarely make a dif-5erence in estimation. Much more attention has been given to the bandwidth choice h ,largely because the bias is characterized by h . In the 2000s, choosing an appropriate h forLLR in RDDs was an open problem: For example, Ludwig & Miller (2007) stated that“there is currently no widely-agreed-upon method for selection of optimal bandwidths...soour strategy is to present results for a broad range of candidate bandwidths.” One widely-used bandwidth selection method is that of Imbens & Kalyanaraman (2012), who deriveda data-driven, MSE-optimal bandwidth for LLR estimators. This provided practitionerswith clear guidelines for implementing LLR for RDDs, which made its use very popular.The bandwidth is arguably the most important choice to be made in the LLR methodol-ogy for RDDs, because the treatment effect is often sensitive to the bandwidth choice. Thismotivates sensitivity checks such as that in Ludwig & Miller (2007), where the treatmenteffect is estimated several times with different bandwidths to ensure that estimates do notvary too greatly. Some have noted that confidence intervals from LLR have a tendency toundercover when a single bandwidth is chosen for inference when the treatment effect issensitive to the bandwidth choice (Armstrong & Koles´ar, 2017; Gelman & Imbens, 2018).A recent extension to the LLR methodology—that of Calonico et al. (2014), hereafter called“robust LLR”—was one of the first methods to address the undercoverage issue of LLRby incorporating a bias correction and inflated confidence intervals corresponding to theuncertainty in estimating the bias correction. Because of its promising inference properties,Calonico et al. (2014) has arguably become the state-of-the-art for conducting inference forthe average treatment effect at the boundary in an RDD. Other methodologies besides LLR have been proposed for estimating the average treatmenteffect at the boundary in a sharp RDD. For example, many practitioners have used high-order global polynomials to estimate µ T ( x ) and µ C ( x ) : Matsudaira (2008) argued for aglobal third-order polynomial regression, and also considered fourth- and fifth-order poly-nomials as a sensitivity check; similarly, Van der Klaauw (2002) used a global third-orderpolynomial and noted that LLR could have been an alternative; finally, Card et al. (2004)argued for using a global third-order polynomial regression instead of LLR because the6unning variable, age, was discrete. However, in recent years many have argued against theuse of high-order polynomials in RDDs because of their tendency to yield point estimatesand confidence intervals that are highly sensitive to the order of the polynomial (Calonicoet al., 2015; Gelman & Imbens, 2018).Others have focused on local randomization methodologies, where units within a windowaround the boundary are viewed as-if randomized to treatment and control. For example,Cattaneo et al. (2015) recommends a series of covariate balance tests to decide the win-dow around the boundary such that the as-if randomized assumption is most plausible.Li et al. (2015) extended these ideas to develop a notion of a local overlap assumptionand used a Bayesian hierarchical modeling approach for deciding the window around theboundary where this assumption is most plausible. Cattaneo et al. (2017) compared thelocal randomization approach to local polynomial estimators, and they extended the localrandomization approach to incorporate adjustments via parametric models as well.Finally, others have developed Bayesian methodologies for RDDs. Li et al. (2015)propose a principal stratification approach that provides alternative identification assump-tions based on a formal definition of local randomization. Geneletti et al. (2015) proposea Bayesian methodology that incorporates prior information in the treatment effect. Chib& Greenberg (2014) use Bayesian splines to estimate treatment effects in RDDs. Chib &Jacobi (2015) propose a Bayesian methodology specific to fuzzy RDDs. We propose a methodology that utilizes Gaussian process regression (GPR), which is oneof the most popular nonparametric methodologies in the machine learning and Bayesianmodeling literature for estimating unknown functions (Rasmussen & Williams, 2006). Thenotion of using GPR for RDDs is very much in line with the claim in Hahn et al. (2001) thatany nonparametric estimator can be used to estimate the treatment and control responsein an RDD. However, to our knowledge, GPR has not been previously proposed for RDDs.Similar to LLR, our GPR methodology provides a flexible fit to the mean functions µ T ( x ) and µ C ( x ) . Furthermore, our methodology can incorporate both prior knowledgeand uncertainty in various parameters in the RDD problem—such as how units are weighted7ear the boundary—which is not necessarily as straightforward with current LLR method-ologies. Finally, our GPR methodology can be used in conjunction with a local random-ization perspective. Our methodology adds to the strand of literature started by Calonicoet al. (2014) that addresses the undercoverage of standard LLR, as well as the strand ofliterature on Bayesian methodologies for RDDs. First we review notation for GPR and how GPR is used to estimate a single unknownfunction. We then discuss GPR models that estimate the two unknown mean functions insharp RDDs.
Define a dataset { x i , y i } ni = of responses y = ( y , . . . , y n ) that varies around some unknownfunction of the covariate x = ( x , . . . , x n ) : y = f ( x ) + (cid:15) (8)where (cid:15) ∼ N n ( , σ y I n ) and f ( x ) ≡ ( f ( x ) , . . . , f ( x n )) . If the goal is to well-estimate E [ f ( x ∗ )] for a particular covariate value x ∗ , one option is to specify a functional formfor f ( x ) and then predict E [ f ( x ∗ )] from this specified model, such as local linear regres-sion, as discussed in Section 1. Instead of specifying a functional form for f ( x ) , we considernonparametrically inferring f ( x ) by placing a prior on f ( x ) . A Gaussian process is onesuch prior: f ( x ) ∼ GP ( m ( x ) , K ( x , x ′ )) (9)for some unknown mean function m ( x ) and covariance function K ( x , x ′ ) . The notation f ( x ) ∼ GP ( m ( x ) , K ( x , x ′ )) denotes a Gaussian process prior on the unknown function8 ( x ) , which states that, for any ( x , . . . , x n ) , the joint distribution ( f ( x ) , . . . , f ( x n )) is an n -dimensional multivariate normal distribution with mean vector ( m ( x ) , . . . , m ( x n )) andcovariance matrix K ( x , x ′ ) whose ( i, j ) entries are K ( x i , x ′ j ) .There are many choices one could make for the mean and covariance functions. Acommon choice for the mean function is m ( x ) = ; a common choice for the covariancefunction is the squared exponential covariance function, whose entries are K ( x i , x j ) ≡ σ GP exp (− (cid:96) ( x i − x j ) ) . (10)Placing a Gaussian process prior with a squared exponential covariance function on f ( x ) assumes that f ( x ) is infinitely differentiable, which is similar to other assumptions in theRDD literature (e.g., Assumption 3.3 of Imbens & Kalyanaraman 2012 and Assumption1 of Calonico et al. 2014). The covariance parameters σ GP and (cid:96) are called the varianceand lengthscale, respectively. The variance determines the amplitude of f ( x ) , i.e., howmuch the function varies from its mean. The lengthscale determines the smoothness of thefunction: Small lengthscales correspond to f ( x ) changing rapidly. Most importantly, thecovariance function assumes that the response at a particular covariate value f ( x ∗ ) will besimilar to the response at covariate values close to x ∗ .Given the Gaussian process prior with mean function m ( x ) and covariance function K ( x , x ′ ) , as well as their parameters, the posterior for f ( x ∗ ) at any particular covari-ate value x ∗ can be obtained via standard conditional multivariate normal theory (for anexposition, see Rasmussen & Williams 2006, Pages 16-17): f ( x ∗ )∣ x , y ∼ N ( µ ∗ , σ GP − Σ ∗ ) , where (11) µ ∗ ≡ m ( x ∗ ) + K ( x ∗ , x )[ K ( x , x ) + σ y I n ] − ( y − m ( x )) Σ ∗ ≡ K ( x ∗ , x )[ K ( x , x ) + σ y I n ] − K ( x , x ∗ ) (12)The above posterior can thus be used to obtain point estimates and credible intervalsfor the value of an unknown function at a particular covariate value x ∗ . In practice, thecovariance parameters are estimated from the data, such as through maximum likelihoodor cross-validation Rasmussen & Williams 2006 (Chapter 5). In Section 2.2 we assume9hat the covariance parameters are fixed, and in Section 2.3 we extend to a full Bayesianapproach that places priors on (cid:96), σ GP , and σ y . The notion of using GPR to estimate the average treatment effect at the boundary inan RDD suggests a class of models that has not previously been considered in the RDDliterature. We focus on two GPR models, which correspond to different assumptions placedon the unknown response functions µ T ( x ) and µ C ( x ) . For each model we show the resultingposterior for the average treatment effect at the boundary and compare it to its analogousLLR model. In Section 2.3 we discuss how—unlike LLR methodologies—the uncertaintyin how units are weighted can be incorporated into these GPR models.For both GPR models, we assume that the treatment response Y i ( ) and the controlresponse Y i ( ) have the following relationship with the running variable x i Y i ( ) = µ T ( x i ) + (cid:15) i , and Y i ( ) = µ C ( x i ) + (cid:15) i , where µ T ( x i ) ⊥⊥ µ C ( x i ) , (cid:15) i iid ∼ N ( , σ y ) , and (cid:15) i iid ∼ N ( , σ y ) (13)Thus, intuitively, the procedure outlined in Section 2.1 can simply be performed twice—once for µ T ( x ) and once for µ C ( x ) . However, there are assumptions on µ T ( x ) and µ C ( x ) that, if true, can simplify our GPR models and make inference more precise.In particular, assumptions can be placed on the covariance structure of µ T ( x ) and µ C ( x ) . The two models we present correspond to two different sets of assumptions—thefirst assumes that the covariance structure of µ T ( x ) and µ C ( x ) are the same, while thesecond allows them to be different. In both models, we assume that µ T ( x ) and µ C ( x ) arestationary processes, i.e., the covariance parameters of µ T ( x ) and µ C ( x ) do not vary withthe running variable X . We discuss cases when this stationarity assumption is inappropri-ate in Sections 4 and 6. Same Covariance Assumption : Cov ( µ T ( x )) = Cov ( µ C ( x )) , and µ T ( x ) and µ C ( x ) arestationary processes. 10f the Same Covariance Assumption holds, a natural LLR procedure is to fit local linearregressions on both sides of the boundary with the same bandwidth but different interceptsand slopes. This is largely the standard practice in the RDD literature (Imbens & Lemieux,2008). Analogously, we place Gaussian process priors on µ T ( x ) and µ C ( x ) for given meanfunctions m T ( x ) and m C ( x ) and covariance function K ( x , x ′ ) : µ T ( x ) ∼ GP ( m T ( x ) , K ( x , x ′ )) µ C ( x ) ∼ GP ( m C ( x ) , K ( x , x ′ )) (14)Then, estimates ˆ µ T ( b ) and ˆ µ C ( b ) are obtained, which results in a treatment effect estimateˆ τ = ˆ µ T ( b )− ˆ µ C ( b ) . We now outline how such estimates ˆ µ T ( b ) and ˆ µ C ( b ) are obtained. Usingstandard conditional multivariate normal theory as in Section 2.1, we have the followingposteriors for µ T ( b ) and µ C ( b ) : µ T ( b )∣ x , y ∼ N ( µ b ∣ T , σ GP − Σ b ∣ T ) µ C ( b )∣ x , y ∼ N ( µ b ∣ C , σ GP − Σ b ∣ C ) , where (15) µ b ∣ T ≡ m T ( b ) + K ( b, x T )[ K ( x T , x T ) + σ y I ] − ( y T − m T ( x T )) µ b ∣ C ≡ m C ( b ) + K ( b, x C )[ K ( x C , x C ) + σ y I ] − ( y C − m C ( x C )) Σ b ∣ T ≡ K ( b, x T )[ K ( x T , x T ) + σ y I ] − K ( x T , b ) Σ b ∣ C ≡ K ( b, x C )[ K ( x C , x C ) + σ y I ] − K ( x C , b ) (16)Here, µ b ∣ T and µ b ∣ C denote the posterior mean for µ T ( b ) and µ C ( b ) , respectively, whichare in the definition of the treatment effect τ defined in (5). Note that µ b ∣ T and µ b ∣ C areweighted averages of the observed response, where the weights K ( b, x )[ K ( x , x ) + σ y I ] − depend on the covariance parameters (cid:96) , σ GP , and σ y , as well as x . For more discussionon the behavior of the weights K ( b, x )[ K ( x , x ) + σ y I ] − , see Rasmussen & Williams (2006,Section 2.6).The posterior for the treatment effect under the Same Covariance Assumption is then τ ≡ µ T ( b ) − µ C ( b )∣ x , y ∼ N ( µ b ∣ T − µ b ∣ C , σ GP − Σ b ∣ T − Σ b ∣ C ) . (17)11here we have also used the independence of µ T ( x ) and µ C ( x ) stated in (13). If the SameCovariance Assumption does not hold, one can still assume that the mean treatment andcontrol response processes are stationary, but allow both the mean and covariance to varyon either side of the boundary. Stationary Assumption : µ T ( x ) and µ C ( x ) are stationary processes.The posterior in this case would be identical to (17), except the means µ b ∣ T and µ b ∣ C and covariances Σ b ∣ T and Σ b ∣ C are instead defined as µ b ∣ T ≡ m T ( b ) + K T ( b, x T )[ K T ( x T , x T ) + σ y I ] − ( y T − m T ( x T )) µ b ∣ C ≡ m C ( b ) + K C ( b, x C )[ K C ( x C , x C ) + σ y I ] − ( y C − m C ( x C )) , Σ b ∣ T ≡ K T ( b, x T )[ K T ( x T , x T ) + σ y I ] − K T ( x T , b ) Σ b ∣ C ≡ K C ( b, x C )[ K C ( x C , x C ) + σ y I ] − K C ( x C , b ) (18)i.e., the shared covariance K (⋅ , ⋅) is replaced with K T (⋅ , ⋅) for units receiving treatmentand K C (⋅ , ⋅) for units receiving control. The analogous LLR methodology would be toallow different intercepts, slopes, and bandwidths on either side of the boundary. However,using different bandwidths on either side of the boundary is rarely done in practice. Forexample, Imbens & Lemieux (2008) argue that if the curvature of µ T ( x ) and µ C ( x ) is thesame, then the large-sample optimal bandwidths should be the same; and, furthermore,there is additional variance in estimating two optimal bandwidths rather than one, dueto the smaller sample used to estimate each bandwidth. Thus, a benefit of the SameCovariance Assumption is that it allows researchers to use the entire data to estimateone set of covariance parameters, instead of estimating two separate sets of covarianceparameters for treatment and control. However, when fitting our GPR model, we donot recommend sharing information between µ T ( x ) and µ C ( x ) beyond estimating theircovariance structure—this follows the general practice in the RDD literature to fit separateregression functions (that may nonetheless share the same bandwidth) for the treatmentand control groups.The above posteriors for these two GPR models assume fixed mean and covariance12arameters. In practice, maximum-likelihood or cross-validation can be used for estimatingthese parameters. In Section 2.3, we extend the above GPR models to a full-Bayesianapproach that incorporates uncertainty in the mean and covariance parameters. The GPR models in Section 2.2 assume that µ T ( b ) and µ C ( b ) will be similar to µ T ( x ) and µ C ( x ) , respectively, for x near b . The extent of this similarity is determined by thecovariance function K ( x , x ′ ) and its parameters. In particular, recall that in Section 2.2 weshowed that the posterior mean of the average treatment effect for GPR is characterized bya difference of two weighted averages, where the weights are of the form K ( b, x )[ K ( x , x ) + σ y I ] − . Thus, incorporating uncertainty in the covariance parameters in turn incorporatesuncertainty in how units are weighted when estimating the average treatment effect.Denote the mean function parameters by θ m and the covariance function parametersby θ K . For example, consider the mean functions m T ( x ) = h ( x ) T β T m C ( x ) = h ( x ) T β C (19)where h ( x ) = ( , x, . . . , x p − ) , and β T and β C are the corresponding p -dimensional columnvectors. In this case, θ m = ( β T , β C ) . For the squared exponential covariance functiondefined in (10), θ K = ( (cid:96), σ GP , σ y ) .In order to incorporate uncertainty in θ m and θ K , one can first obtain draws 1 , . . . , D from the joint posterior of ( θ m , θ K ) , rather than obtaining point-estimates ˆ θ m and ˆ θ K .Then, for each draw ( θ m , θ K ) , . . . , ( θ m , θ K ) D , one draws from the posterior for τ , definedin (17).Section 2.2 already defines the likelihood for the GPR models, so all that remains is tospecify priors for ( θ m , θ K ) in order to obtain draws from the joint posterior of ( θ m , θ K ) .Priors for ( θ m , θ K ) will depend on the choice of mean and covariance functions. Forexample, for the mean functions defined in (19), we recommend N ( , B ) as a prior for β T and β C , where B is a p × p diagonal matrix with reasonably large entries (Rasmussen& Williams 2006, Pages 28-29). For the squared exponential covariance function given by1310), we recommend half-Cauchy priors for the covariance parameters (cid:96) and σ GP and noise σ y , following advice from Gelman (2006) about stable priors for variance parameters.Now we prove that our GPR methodology is consistent in estimating the average treat-ment effect at the boundary. First we establish consistency when the covariance parametersare fixed, and then we consider the case where priors are placed on the covariance param-eters. Gaussian processes are known to exhibit posterior consistency under minimal assumptions.Ghosal & Roy (2006) proved posterior consistency of binary GPR for fixed covarianceparameters, and Choi & Schervish (2007) proved posterior consistency of GPR when theresponse is continuous. More generally, van der Vaart & van Zanten (2008) studied thecontraction rate for Gaussian process priors for density estimation and regression problems,and van der Vaart & van Zanten (2009) extended these results to when a prior is placedon the lengthscale of a Gaussian process.Here we evaluate our Bayesian methodology from a frequentist point-of-view, whichassumes a fixed treatment effect at the boundary. The GPR models in Section 2.2 estimatethe treatment effect as the difference between two Gaussian process regressions; thus, ourposterior of the treatment effect is consistent if the separate GPRs on either side of thediscontinuity are consistent. First we establish posterior consistency assuming the covari-ance parameters are fixed, as in Section 2.2, and then we extend these results to when aprior is placed on the covariance parameters, as in Section 2.3.We prove posterior consistency assuming the Stationary Assumption in Section 2.2, butthe results also hold for the Same Covariance Assumption. Furthermore, we assume themean functions m T ( x ) = m C ( x ) = heorem 1 : Assume that the Stationary Assumption holds, the covariance functions K T ( x, x ) and K C ( x, x ) are fixed, and Assumptions A1, A2, and A3 given in the Appendixhold. Denote the true average treatment effect at the boundary as τ ∗ = µ ∗ T ( b ) − µ ∗ C ( b ) , where µ ∗ T ( x ) and µ ∗ C ( x ) are the true mean treatment and control response functions in the model(13). Let ∏( τ ∣ x , . . . , x n ) denote the posterior distribution of the average treatment effectat the boundary, defined in (17). Then, this posterior is consistent, in the sense that ∏ ( τ ∶ h ( τ, τ ∗ ) ≥ M (cid:15) n ∣ x , . . . , x n ) P τ ∗ ——→ for sufficiently large M , where h is the Hellinger distance, and (cid:15) n is the rate at which theposterior of τ contracts to the true τ ∗ . The proof of Theorem 1, as well as a discussion about the nature of the contraction rate, isgiven in the Appendix. Theorem 2 establishes posterior consistency when a prior is placedon the lengthscale parameter (cid:96) , instead of being held fixed (see Section 2.3). A discussionabout posterior consistency when an additional prior is placed on σ GP is in the Appendix. Theorem 2 : Assume that the Stationary Assumption holds, the σ GP parameters in K T ( x, x ) and K C ( x, x ) are fixed, and Assumptions A1, A2, A3, and A4 given in the Appendix hold.Then, Theorem 1 holds. The proof of Theorem 2 is given in the Appendix. A corollary follows from the proofs ofTheorem 1 and 2.
Corollary 1 : Theorems 1 and 2 hold if the Same Covariance Assumption holds instead ofthe Stationary Assumption.
Here we investigate how our Gaussian Process model compares to standard LLR and therobust LLR method introduced in Calonico et al. (2014). We choose these two methods15ecause the former is the standard in both applied work and the RDD literature at large,and the latter is a recent method that attempts to solve the undercoverage issue of standardLLR. We focus on the GPR model assuming the Same Covariance Assumption in Section2.2 and use the mean functions m T ( x ) = β T + β T x m C ( x ) = β C + β C x (21)and the squared exponential covariance function given by (10). These assumptions areanalogous to the LLR procedure of fitting separate local linear regressions in treatment andcontrol with differing slopes but the same bandwidth. Specification of the mean functionin the Gaussian process prior is typically not consequential for estimation; however, asdiscussed in Rasmussen & Williams (2006, Section 2.7), there can be some benefits tospecifying a mean function, as we do here. In particular, the above specification allowsGPR predictions to pull towards a global linear trend instead of a global mean (whichwould be the case if we used a zero mean function—a common choice in the literature—forthe Gaussian process prior). This can be useful within the context of extrapolation towardsthe boundary, as in an RDD. In the Appendix in Table 3, we present simulation resultsfor our GPR methodology using a zero mean function instead of the above linear meanfunction for the Gaussian process prior. The results for that case are largely the same asthe results presented here, which suggests that our results are insensitive to specificationof the mean function in the Gaussian process prior.As discussed in Section 2.3, we took a full-Bayesian approach to our GPR methodologyand placed independent N ( , ) priors on the mean function parameters in (21) andindependent half-Cauchy ( , ) priors on the covariance parameters. These choices for theprior distributions are in line with common recommendations in the Bayesian data analysisliterature: The choice of Normal priors on the mean function parameters follows recom-mendations from Rasmussen & Williams (2006), and the choice of half-Cauchy priors onthe covariance parameters follows recommendations from Gelman (2006), Polson & Scott(2012), and Gelman et al. (2013, Chapter 5). We used the R package rstan (Carpenteret al., 2016) to sample from the posterior of these parameters. In the Appendix in Table 3,16e discuss simulation results for GPR when we instead plug in the MLE for the covarianceparameters; however, we found that the full-Bayesian approach is preferable in terms ofinferential properties, which suggests that it is beneficial to incorporate uncertainty in thecovariance parameters for our GPR method.We conduct a simulation study based on simulations from Imbens & Kalyanaraman(2012) and Calonico et al. (2014). In all simulations, we generated 1,000 datasets of 500observations {( x i , y i , (cid:15) i ) ∶ i = , . . . , } , where x i ∼ ( , ) − (cid:15) i ∼ N ( , . ) , and y i = µ j ( x i ) + (cid:15) i for different mean functions µ j ( x i ) .We consider seven different mean functions (see Figure 1), which we call Lee, Quad,Cate 1, Cate 2, Ludwig, Curvature, and Cubic. Lee, Quad, Cate 1, and Cate 2 were usedin Imbens & Kalyanaraman (2012), and Lee, Ludwig, and Curvature were used in Calonicoet al. (2014); details about these datasets can be found in Imbens & Kalyanaraman 2012(Page 18) and Calonico et al. 2014 (Page 20). We also introduce the Cubic mean functionas a comparison to the Quad mean function, because in the Quad mean function the lineartrends on either side of b are the opposite sign, whereas those for the Cubic mean functionare the same sign.The boundary for each dataset is b =
0. The treatment effect is τ = .
04 for Lee andCurvature, τ = τ = . τ = − .
35 for Ludwig.Also displayed in Figure 1 are a set of sample points {( x i , y i ) , i = , . . . , } for each meanfunction, which shows what one dataset looks like for each mean function. One can seethat—although (cid:15) i ∼ N ( , . ) for all mean functions—the relative noise varies acrossmean functions. 17 ate2 Ludwig
Curvature
Lee
Quad
Cubic
Cate1 -1.0 -0.5 -1.0 -0.5 -1.0 -0.5 -1.0 -0.5 -25 -20 -15 -10 -5 -2 -6 -4 -2 -20 -15 -10 -5 x y Figure 1: Mean function for the datasets used in our simulation study. Lee, Quad, Cate 1,and Cate 2 were used in Imbens & Kalyanaraman (2012), and Lee, Ludwig, and Curvaturewere used in Calonico et al. (2014). Displayed in gray are a set of sample points {( x i , y i ) , i = , . . . , } for each mean function.For standard LLR and robust LLR we used the rdrobust R package (Calonico et al.,2017). For both methods, we used an MSE-optimal bandwidth that is the default in rdrobust . Simulation results using the bandwidth introduced in Imbens & Kalyanaraman(2012)—also known as the IK bandwidth, which has been widely used in practice—areprovided in the Appendix in Table 4, and results using the coverage error rate optimalbandwidth—an alternative bandwidth choice within the rdrobust R package that is alsodiscussed in Calonico et al. (2018a)—are provided in the Appendix in Table 5. The resultsusing those bandwidths are largely the same as the results presented here. Simulationresults for other bandwidth choices appear in Calonico et al. (2014).18mbens & Kalyanaraman (2012) ran a simulation study that focused on the Lee, Quad,Cate 1, and Cate 2 mean functions, and they compared different bandwidth selectors forLLR in terms of bias and root mean squared error (RMSE). Similarly, Calonico et al. (2014)ran a simulation study that focused on the Lee, Ludwig, and Curvature mean functions, andthey compared different bandwidth selectors for LLR and their methodology in terms ofcoverage and mean interval length (IL). We synthesize these simulation studies and reportin Figure 2 how LLR, robust LLR, and two versions of our GPR methodology perform onthe seven mean functions in Figure 1 in terms of coverage, IL, absolute bias, and RMSE.The numbers plotted in Figure 2 are in Tables 2 and 3 in the Appendix. Point estimatesand 95% confidence intervals for LLR and robust LLR were obtained from rdrobust . Pointestimates and 95% credible intervals for our methodology corresponded to the mean and2.5% and 97.5% quantiles, respectively, of the posterior of the average treatment effect,shown in (17).Robust LLR is meant to improve the coverage of LLR, and indeed it does for all datasets.The better coverage is in part due to wider confidence intervals (see the systematicallyhigher mean interval length at the top right of Figure 2) and in part due to better biasproperties (see the bottom left of Figure 2). Robust LLR also tends to exhibit worse RMSEthan LLR (see the bottom right of Figure 2).Our primary method (“GPR”) tends to exhibit narrower intervals than both LLR androbust LLR. GPR also exhibits better coverage than both methods, except for the Lee andLudwig datasets. Furthermore, our method tends to exhibit lower RMSE than both LLRand robust LLR. However, our method always exhibits more bias than robust LLR, whichexplicitly uses a bias correction. 19 Bias|
RMSE
Coverage
Mean IL
Lee
Quad
Cubic
Cate1
Cate2
Ludwig
Curvature
Lee
Quad
Cubic
Cate1
Cate2
Ludwig
Curvature V a l ue Estimator
LLR
Robust LLR
GPR
GPR (Cut)
Figure 2: The coverage, mean interval length (IL), absolute bias, and root mean squarederror (RMSE) for LLR, robust LLR, and our GPR method.For the Lee and Ludwig datasets, our method did worse than robust LLR in terms ofcoverage and bias, and this may be because it is inappropriate to assume that µ T ( x ) and µ C ( x ) are stationary processes—i.e., that the covariance parameters do not vary across therunning variable X —in these cases. By assuming stationarity, our GPR model uses databoth close to and far from the boundary to estimate the single variance σ GP and lengthscale (cid:96) . This assumption is related to the stability of the second derivative of µ T ( x ) and µ C ( x ) ,because the covariance parameters of a Gaussian process dictate their derivative processes;see Wang (2012) for a further discussion of this relationship. Figure 3 displays the absolutesecond derivative of the seven mean functions (in blue). The Lee and Ludwig datasets arecharacterized by the absolute second derivative rapidly increasing near the boundary. OurGPR methodology likely does not do well for these datasets because we are using data far20rom the boundary to estimate the overall curvature of the mean function, which leads usto underestimating the curvature of the Lee and Ludwig mean functions at the boundary.Figure 3 also displays ˆ σ GP ˆ (cid:96) , the ratio of the maximum-likelihood estimates of the co-variance parameters, which was computed within a sliding window of a noiseless version ofthe seven mean functions. Although there is not a one-to-one correspondence between theabsolute second derivative and ˆ σ GP ˆ (cid:96) , their behavior is notably similar, which reinforces theidea that both the variance σ GP and lengthscale (cid:96) play a role in capturing the curvatureof the mean function. Furthermore, this connection between the second derivative andthe covariance parameters further suggests a similarity between the covariance parametersin our GPR methodology and the bandwidth in the LLR methodology, because the IKbandwidth is estimated as a nonlinear function of the estimated second derivative at theboundary b (Imbens & Kalyanaraman, 2012). − − x Lee − − x Quad − − x Cubic − − x Cate1 − − x Cate2 − − x Ludwig − − x Curvature ˆ σ GP /ˆ ‘ | f ” ( x ) | Figure 3: The absolute second derivative (blue) of the seven mean functions shown in Fig-ure 1, and the ratio of the maximum-likelihood estimates of the variance and lengthscale(orange), computed within a sliding window of a noiseless version of the seven mean func-tions. The sliding window was the range [ x − . , x + ] for x ∈ [− . , − . ] in the controlgroup (left-hand side) and x ∈ [ . , . ] in the treatment group (right-hand side).If one did not believe the Stationary Assumption held, one alternative would be to onlyuse data close to the boundary when fitting our GPR model. This is our second method,“GPR (Cut),” whose coverage, mean IL, absolute bias, and RMSE is also displayed in21igure 2. For each of the 1,000 replications of the seven mean functions, we first estimatedthe IK bandwidth with a rectangular kernel; then, we fit our GPR model within thisestimated bandwidth. This procedure improves upon our GPR model for and only for theLee and Ludwig datasets—the coverage increased to 94.3% and 88.9%, respectively, andthe bias improved for the Lee and Ludwig datasets while staying the same for the otherdatasets. While the coverage also improved for the Cate 1 and Cate 2 datasets, this islikely due to the increase in the interval length. Because results improved only for thesetwo datasets, this further demonstrates that using our GPR model on the whole datasetcan be beneficial when µ T ( x ) and µ C ( x ) are stationary processes; otherwise, it may bepreferable to only fit our GPR model to data close to the boundary.Furthermore, this suggests that our method can be combined with a local randomizationperspective for RDDs (e.g., Li et al. 2015): One can first determine the window aroundthe boundary where units are “as-if randomized” by using covariate balance tests such asCattaneo et al. (2015) and Li et al. (2015) and then use our GPR methodology withinthis window around the boundary. This approach is similar to Cattaneo et al. (2017), whocombined the local randomization perspective with parametric models for estimating theaverage treatment effect at the boundary.Overall, GPR performs well compared to LLR and robust LLR. In particular, our GPRmethod tends to yield better interval length and RMSE properties than LLR and robustLLR, and it also yields better coverage when the underlying mean functions are stationaryacross the running variable X . The issue of undercoverage in LLR methodologies has beenrelatively unaddressed in the RDD literature, except for robust LLR (Calonico, Cattaneo,& Titiunik, 2014), and so our GPR method can be viewed as the second method to yieldpromising coverage properties for RDD analyses while also providing a flexible fit to theunderlying mean functions µ T ( x ) and µ C ( x ) . When µ T ( x ) and µ C ( x ) are nonstationaryacross X , our GPR methodology could be extended to incorporate a lengthscale function (cid:96) ( x ) instead of a single lengthscale (cid:96) . However, such a lengthscale function would eitherneed to be specified (see Rasmussen and Williams 2006, Chapter 4, for an example), orestimated via another Gaussian process prior (Plagemann, Kersting, & Burgard, 2008). In asimilar vein, there has also been work on using dimension expansion to model nonstationary22rocesses (Bornn et al., 2012). However, in an RDD, we only need a good estimate of thecovariance parameters near the boundary, rather than across the entire mean functions µ T ( x ) and µ C ( x ) . More work needs to be done to determine the optimal amount of datato include in our GPR model for estimating these parameters and the average treatmenteffect at the boundary in the case of nonstationary processes.Now we compare how LLR, robust LLR, and GPR perform on a real dataset from theNational Basketball Association. The National Basketball Association (NBA) draft, held annually, is divided into 2 rounds,where each NBA teams gets one selection per round to draft a player of their choice.Because players are picked sequentially, there is no reason to believe there is a markedskill difference between the last pick of the first round and first pick of the second round.However, because of the difference in the perceived value of first-round versus second-round picks, as well as differing contract structures between the two rounds, we suspectthat first-round picks are treated more favorably and given more playing time than theirsecond-round colleagues, above and beyond what can be explained by differences in skill.As such, we seek to explore if there is a difference between first- and second-round picks inboth skill and playing time.We want to estimate the treatment effect of having a second-round contract instead of afirst-round contract on four basketball player outcomes: box plus-minus, win shares played,number of minutes played, and number of games played. The first two are overall measuresof player performance (Kubatko, 2009; Myers, 2015), while the latter two are measures ofplaying time. Our data include the pick number and the four aforementioned outcomes for1,238 NBA basketball players drafted between 1995 and 2016. Due to anomalies createdby the NBA expanding from 29 teams to 30 teams in 2004, as well as some years withteams forfeiting picks, we shifted the pick numbers to ensure that b = . - - Pick B o x P l u s - M i nu s Pick W i n S ha r e P l a y ed Pick M i nu t e s P l a y ed Pick N u m be r o f G a m e s P l a y ed Figure 4: Four basketball-player outcomes—number of minutes played, box plus-minus,win shares played, and number of games played—across picks.Even though the running variable in this case is discrete—which can cause complicationsin some regression discontinuity analyses (Lee & Card, 2008; Koles´ar & Rothe, 2018)—wenonetheless apply LLR, robust LLR, and our GPR method to these data to compare howthey perform in practice. As in Section 4, we used the rdrobust R package to implement24LR and robust LLR using the default MSE-optimal bandwidth. For GPR, we used thesquared exponential covariance function assuming the Same Covariance Assumption; fur-thermore, we took the full-Bayesian approach to our GPR methodology and—similar toSection 4—placed independent
N ( , ) priors on the mean function parameters andhalf-Cauchy ( , ) priors on the covariance parameters.Figure 5 shows the estimated mean functions and corresponding confidence intervals forLLR and GPR, and Table 1 shows the treatment effect point estimates and correspondingconfidence intervals for LLR, robust LLR, and GPR. The estimated mean functions forLLR and GPR are quite similar to each other for all outcomes, with GPR yielding slightlywider confidence intervals, as expected. All three methods find the number of games playedfor second-round picks to be significantly lower than that of first-round picks. Furthermore,LLR and robust LLR find the box plus-minus for second-round picks to be significantlylower than that of first-round picks; meanwhile, GPR finds this difference to be borderlineinsigificant. Out of these three methods, the results from our GPR method are mostin line with previous reports—both qualitative (Barber, 2016) and quantitative (Koenig,2012)—on the difference between first- and second-round NBA basketball players, whichhave claimed that there is a difference in attention given to first-round picks (e.g., gamesplayed) but not a difference in player ability (e.g., box plus-minus and win shares). Robust LLR yields an inflated confidence interval for the treatment effect specifically, which dependson a bias correction at the boundary. Thus, robust LLR does not yield confidence intervals for the entiremean functions, because the bias correction is boundary-specific. Thus, only LLR and GPR are displayedin Figure 5, while LLR, robust LLR, and GPR are all discussed in Table 1. Furthermore, for robust LLRin Table 1, we report the bias-corrected point estimate given by the rdrobust package. - - Pick B o x P l u s - M i nu s Pick W i n S ha r e s P l a y ed Pick M i nu t e s P l a y ed Pick N u m be r o f G a m e s P l a y ed Figure 5: The estimated mean functions (solid lines) and corresponding confidence inter-vals (dashed lines) for LLR (black lines) and GPR (red lines). The lines for LLR wereproduced by the rdd
R package (Dimmery, 2013), but using the bandwidth estimated bythe rdrobust
R package. The two treatment groups are the first round of picks (picks 30and below) and second round of picks (picks 31 and above). We set the boundary to be b = . LLR Robust LLR GPROutcome
Estimate 95% CI Estimate 95% CI Estimate 95% CIBox Plus-Minus -3.06 [-5.20, -0.92] -3.37 [-5.86, -0.88] -2.42 [-5.02, 0.01]Win Shares -1.58 [-4.09, 0.93] -1.73 [-4.76, 1.29] -1.08 [-3.04, 0.88]Minutes Played -446.29 [-1042.81, 150.23] -406.63 [-1123.99, 310.72] -640.75 [-1259.65, 5.51]Games Played -29.71 [-40.20, -19.22] -28.16 [-40.81, -15.51] -32.00 [-45.69, -18.14]
Point estimates and 95% confidence intervals for the treatment effect on each of the fouroutcomes: box plus-minus, win shares played, number of minutes played, and number ofgames played. Statistically significant point estimates are in bold.second-round players relative to their first-round counterparts beyond that explainableby the natural drop-off in playing time between successive picks. Furthermore, we findthat there is not a significant difference in player ability between first- and second-roundbasketball players at the boundary between the 30th and 31st picks that divides the firstand second rounds of the NBA draft.
Local linear regression (LLR) and its variants are currently the most common methodolo-gies for estimating the average treatment effect at the boundary in RDDs. These methodsare popular because they are easy to implement and there is a large literature on theirtheoretical properties. However, recent works have noted that LLR tends to yield confi-dence intervals that undercover, and new methodologies—namely that of Calonico et al.(2014)—have tried to address this issue. As an alternative to LLR, we proposed a Gaus-sian Process regression (GPR) methodology that flexibly fits the treatment and controlresponse functions by placing a general prior on the mean response functions. We showedvia simulation that our GPR methodology tends to outperform standard LLR and thestate-of-the-art methodology of Calonico et al. (2014) in terms of coverage, interval length,and RMSE. Furthermore, we used our GPR methodology on a real-world sharp RDD inthe National Basketball Association (NBA) draft and found that GPR yielded results thatwere more in line with previous reports on the NBA draft than were results from LLRmethods. Overall, our methodology addresses the undercoverage issue commonly seen in27DDs without sacrificing too much power to detect treatment effects, thereby adding tothe growing literatures on improving inference for RDDs (Calonico et al., 2014, 2018b) andon Bayesian methods for RDDs (Chib & Greenberg, 2014; Chib & Jacobi, 2015; Genelettiet al., 2015; Li et al., 2015).Our methodology focuses on flexibly fitting the mean treatment and control responseswhile also improving coverage properties; however, there are many other issues of interestin the RDD literature. For example, we only consider sharp RDDs; Li et al. (2015) andChib & Jacobi (2015) provide Bayesian methodologies for fuzzy RDDs which could likelybe combined with our Gaussian process approach. Furthermore, we focused on RDDs thatonly have one background covariate—the running variable—but other covariates could beincluded in our GPR methodology to improve the precision of our treatment effect estimator(e.g., Calonico et al. 2016). See Imbens & Lemieux (2008) for a review of these other RDDconcerns.Furthermore, while our method can incorporate prior knowledge and uncertainty invarious parameters that are typically discussed in the RDD literature, it is not necessarilyclear when this should be done for our GPR model. For example, Hall & Kang (2001) foundthat incorporating the uncertainty in the bandwidth—which is somewhat analogous to thecovariance parameters in our GPR model—for density estimators via bootstrapping can beinconsequential or even detrimental in some cases. Although our simulation study suggeststhat it is beneficial to take a full-Bayesian approach and propogate uncertainty in the GPRmodel parameters when estimating treatment effects in RDDs, more work needs to be doneto determine when it is most appropriate to incorporate uncertainty in these parameters.Furthermore, we only focused on the squared exponential covariance function for Gaussianprocesses, but other covariance functions should be considered. Because estimating theaverage treatment effect at the boundary in an RDD is fundamentally an extrapolationissue, covariance functions whose purpose is to extrapolate well (e.g., Wilson & Adams2013) may be particularly suitable for RDDs.Additionally, although our GPR methodology exhibits arguably better coverage, inter-val length, and RMSE properties than standard methodologies in the literature, there areways our methodology could be improved, even for the case of a sharp RDD with only28ne covariate. Our methodology does not perform well when the mean response functions µ T ( x ) and µ C ( x ) are nonstationary across the running variable X . More work needs to bedone to model nonstationary processes within the context of RDDs, such as by estimatinga lengthscale function (cid:96) ( x ) or by determining the optimal amount of data to use whenestimating the covariance parameters for our GPR methodology. Relatedly, our simula-tion results suggest that our GPR methodology can be used in combination with a localrandomization framework for RDDs (such as that seen in Li et al. 2015).Finally, a promising avenue for future research is extending our GPR methodologybeyond one-dimensional sharp RDDs. In particular, recent work has explored geographicRDDs, where spatial variation in outcomes must be accounted for when estimating thetreatment effect along a geographic boundary (Keele & Titiunik, 2014; Keele et al., 2015).Because GPR has been widely used in spatial statistics (Banerjee et al., 2014; Cressie,2015), it may be particularly suitable for geographic RDDs. We explore the use of ourGPR methodology for geographic RDDs in Rischard et al. (2018).29 Appendix
Table 2: Simulation for n = Method
LLR Robust LLR GPR
Dataset
EC IL Bias RMSE EC IL Bias RMSE EC IL Bias RMSE
Lee 89.2% 0.21 0.02 0.06 91.7% 0.25 0.01 0.07 82.3% 0.19 0.05 0.07Quad 92.8% 0.21 0.00 0.06 93.9% 0.25 0.00 0.07 97.0% 0.18 0.00 0.04Cubic 92.5% 0.22 -0.01 0.06 93.0% 0.25 0.00 0.07 96.3% 0.20 -0.01 0.05Cate 1 92.4% 0.24 -0.01 0.07 92.4% 0.28 0.00 0.08 94.4% 0.25 -0.01 0.07Cate 2 92.4% 0.24 -0.01 0.07 92.4% 0.28 0.00 0.08 94.4% 0.25 -0.01 0.07Ludwig 85.1% 0.32 0.05 0.10 93.1% 0.35 0.01 0.09 75.0% 0.25 0.08 0.10Curvature 91.2% 0.22 -0.01 0.06 92.7% 0.25 0.00 0.07 96.3% 0.21 -0.01 0.05
Simulations assessing the empirical coverage (EC), mean interval length (IL), bias, androot mean squared error (RMSE) for local linear regression (LLR), robust LLR, and ourGaussian Process Regression (GPR) method, where we used the MSE-opitmal defaultbandwidth in rdrobust when implementing LLR and robust LLR. These methods wereperformed on 1,000 replications of seven different datasets, which were also used in Imbens& Kalyanaraman (2012) and Calonico et al. (2014). A plot of these numbers is shown inFigure 2. 30able 3: Simulation Results for GPR, GPR (Cut), GPR (Zero Mean), and GPR (MLE)
MethodGPR GPR (Cut) GPR (Zero Mean) GPR (MLE)Data EC IL Bias RMSE EC IL Bias RMSE EC IL Bias RMSE EC IL Bias RMSE
Lee 82.3% 0.19 0.05 0.07 94.3% 0.22 0.03 0.06 81.6% 0.18 0.05 0.07 76.4% 0.15 0.04 0.06Quad 97.0% 0.18 0.00 0.04 97.2% 0.23 -0.00 0.05 97.0% 0.19 0.01 0.04 95.9% 0.18 0.00 0.05Cubic 96.3% 0.20 -0.01 0.05 96.0% 0.35 -0.01 0.08 96.7% 0.20 0.00 0.05 91.1% 0.19 -0.03 0.06Cate 1 94.4% 0.25 -0.01 0.07 96.9% 0.29 -0.01 0.07 95.2% 0.26 0.00 0.07 93.8% 0.25 0.00 0.07Cate 2 94.4% 0.25 -0.01 0.07 97.2% 0.29 -0.01 0.07 95.4% 0.26 0.00 0.07 94.2% 0.25 0.00 0.07Ludwig 75.0% 0.25 0.08 0.10 88.9% 0.34 0.06 0.10 64.5% 0.24 0.10 0.12 52.0% 0.24 0.11 0.13Curvature 96.3% 0.21 -0.01 0.05 96.1% 0.26 -0.01 0.06 96.9% 0.20 0.00 0.05 89.6% 0.20 -0.03 0.06
Simulation results assessing the empirical coverage (EC), mean interval length (IL), bias,and root mean squared error (RMSE) for GPR (which uses the full Bayesian approachdiscussed in Section 2.3), GPR using only data within the IK bandwidth with a rectangularkernel (called GPR (Cut)), GPR using a zero mean function in the prior (19) instead of alinear trend (called GPR (Zero Mean)), and GPR plugging in the MLE for the covarianceparameters (calld GPR (MLE)). Note that the GPR columns are the same as those inTable 2. GPR (Cut) performs much better than GPR for the Lee and Ludwig datasets,and GPR (Cut) performs marginally better than GPR for the Cate 1 and Cate 2 datasets.Otherwise, GPR (Cut) is equal to or worse than GPR. This demonstrates that GPR on thewhole dataset can be beneficial when µ T ( x ) and µ C ( x ) are stationary processes; otherwise,it may be preferable to only fit our GPR model to data close to the boundary. Furthermore,GPR (Zero Mean) performs similarly to GPR for most of the datasets; this suggests thatour results are generally insensitive to specification of the mean function in the Gaussianprocess prior. However, GPR does perform better than GPR (Zero Mean) for the Ludwigdataset. This is likely because, as discussed in Section 4, when GPR extrapolates to theboundary, its predictions will be pulled towards the global linear trend, while predictionsfrom GPR (Zero Mean) will be pulled towards the global mean. This also suggests why,for the Ludwig dataset, the bias for GPR (Zero Mean) is higher than the bias for GPR.Finally, GPR that simply plugs in the MLE of the covariance parameters tends to performworse than the full-Bayesian GPR approach, especially in terms of coverage. This furthersuggests that incorporating uncertainty in the covariance parameters for our GPR methodcan lead to promising inferential properties.31able 4: Simulation for n = Method
LLR Robust LLR GPR
Dataset
EC IL Bias RMSE EC IL Bias RMSE EC IL Bias RMSE
Lee 82.7% 0.15 0.04 0.05 91.2% 0.27 0.01 0.08 82.3% 0.19 0.05 0.07Quad 94.6% 0.15 0.00 0.04 93.2% 0.24 0.00 0.07 97.0% 0.18 0.00 0.04Cubic 93.7% 0.19 -0.01 0.05 93.8% 0.24 0.01 0.06 96.3% 0.20 -0.01 0.05Cate 1 91.2% 0.21 -0.01 0.06 92.9% 0.26 0.01 0.07 94.4% 0.25 -0.01 0.07Cate 2 92.1% 0.22 -0.01 0.06 92.8% 0.26 0.01 0.07 94.4% 0.25 -0.01 0.07Ludwig 31.5% 0.23 0.15 0.16 89.8% 0.27 0.04 0.08 75.0% 0.25 0.08 0.10Curvature 84.4% 0.19 -0.03 0.06 94.8% 0.23 0.00 0.06 96.3% 0.21 -0.01 0.05
Simulations assessing the empirical coverage (EC), mean interval length (IL), bias, androot mean squared error (RMSE) for local linear regression (LLR), robust LLR, and ourGaussian Process Regression (GPR) method, where we used the bandwidth of Imbens &Kalyanaraman (2012) when implementing LLR and robust LLR. These results are largelythe same as the results presented in Table 2: Our method performed best in terms ofcoverage, except for the Lee and Ludwig datasets. Furthermore, our method tended toyield narrower intervals and lower RMSE than robust LLR. Our method also tended toyield more bias than robust LLR.Table 5: Simulation for n = Method
LLR Robust LLR GPR
Dataset
EC IL Bias RMSE EC IL Bias RMSE EC IL Bias RMSE
Lee 91.1% 0.25 0.01 0.07 91.8% 0.27 0.01 0.08 82.3% 0.19 0.05 0.07Quad 92.3% 0.24 -0.00 0.07 92.7% 0.27 -0.00 0.08 97.0% 0.18 0.00 0.04Cubic 92.2% 0.25 -0.00 0.07 92.1% 0.27 0.00 0.08 96.3% 0.20 -0.01 0.05Cate 1 91.9% 0.28 -0.00 0.08 92.3% 0.30 0.00 0.08 94.4% 0.25 -0.01 0.07Cate 2 92.0% 0.28 -0.00 0.08 92.3% 0.30 0.00 0.08 94.4% 0.25 -0.01 0.07Ludwig 91.5% 0.39 0.02 0.10 92.9% 0.41 0.00 0.10 75.0% 0.25 0.08 0.10Curvature 91.4% 0.26 -0.00 0.07 93.3% 0.28 0.00 0.08 96.3% 0.21 -0.01 0.05
Simulations assessing the empirical coverage (EC), mean interval length (IL), bias, androot mean squared error (RMSE) for local linear regression (LLR), robust LLR, and ourGaussian Process Regression (GPR) method, where we used the coverage error rate (CER)optimal bandwidth—an alternative bandwidth choice within the rdrobust R package thatis also discussed in Calonico et al. (2018a). These results are largely the same as the resultspresented in Tables 2 and 4: Our method performed best in terms of coverage, except forthe Lee and Ludwig datasets. Furthermore, our method tended to yield narrower intervalsand lower RMSE than robust LLR. Our method also tended to yield more bias than robustLLR. 32 .2 Assumptions for Posterior Consistency Proofs
Assumptions on the Running Variable X (Assumptions A1)
1. Assume the control running variable values { x i } n C i = are known elements of [ b C , b ] andthe treatment running variable values { x i } n T i = are known elements of [ b, b T ] , for some b C , b T , and boundary b . Assumptions on the Response Function (Assumptions A2)
1. Let C α [ b, b T ] and C α [ b C , b ] be H¨older spaces of α -smooth functions f ∶ [ b, b T ] → R and f ∶ [ b C , b ] → R , respectively. Assume µ T ( x ) ∈ C α [ b, b T ] and µ C ( x ) ∈ C α [ b C , b ] ,where µ T ( x ) and µ C ( x ) are the treatment and control response functions.2. The treatment and control responses { y i } n T i = and { y i } n C i = have the following relation-ships with the running variable: y i = µ T ( x i ) + (cid:15) i , i = , . . . , n T (22) y i = µ C ( x i ) + (cid:15) i , i = , . . . , n C (23)for mean response functions µ T ( x ) and µ C ( x ) and independent errors { (cid:15) i } n C i = ∼ N ( , σ y ) and { (cid:15) i } n T i = ∼ N ( , σ y ) . Assumptions on the Noise (Assumptions A3)
1. The priors on σ y and σ y have support on compact intervals that are subsets of ( , ∞) which contain the true errors σ y and σ y , respectively. Assumptions on the Prior for (cid:96) (Assumptions A4)
1. The lengthscale (cid:96) has a prior distribution κ such that, for positive constants C , D , C , D ,nonnegative constants p, q , and every sufficiently large a > C a p exp (− D a log q a ) ≤ κ ( a ) ≤ C a p exp (− D a log q a ) (24)33 .3 Proof of Theorems 1 and 2 Proof of Theorem 1 : Assume that the Stationary Assumption holds, the covariancefunctions K T ( x, x ) and K C ( x, x ) are fixed, and Assumptions A1, A2, and A3 hold. Then,according to Theorem 2.2 in van der Vaart & van Zanten (2009), for sufficiently large M T and M C , ∏ ( µ T ∶ h ( µ T , µ ∗ T ) ≥ M T (cid:15) n T ∣ x , . . . , x n T ) P µ ∗ T ——→ , and ∏ ( µ C ∶ h ( µ C , µ ∗ C ) ≥ M C (cid:15) n C ∣ x , . . . , x n C ) P µ ∗ C ——→ (cid:15) n T and (cid:15) n C , where h is the Hellinger distance. The nature ofthe contraction rates (cid:15) n T and (cid:15) n C are discussed in van der Vaart & van Zanten (2009), anddepend on the differentiability and smoothness of the true µ ∗ T ( x ) and µ ∗ C ( x ) . Note thatthe only covariate value for which both of these hold is x = b , because by Assumption A1,the intersection of the supports of { x i } n C i = and { x i } n T i = is only the boundary b .Because the Hellinger distance is symmetric and satisfies the triangle inequality, h ( τ, τ ∗ ) = h ( µ T ( b ) − µ C ( b ) , µ ∗ T ( b ) − µ ∗ C ( b )) (26) ≤ h ( µ T ( b ) , µ ∗ T ( b )) + h ( µ C ( b ) , µ ∗ C ( b )) (27) ≤ M T (cid:15) n T + M C (cid:15) n C (28)where the last inequality holds with posterior probability 1, by (25). Therefore, ∏ ( τ ∶ h ( τ, τ ∗ ) ≥ M (cid:15) n ∣ x , . . . , x n ) P τ ∗ ——→ M ≡ √ ⋅ max ( M T , M C ) and (cid:15) n ≡ √ ⋅ max ( (cid:15) n T , (cid:15) n C ) , so that M (cid:15) n ≥ M T (cid:15) n T + M C (cid:15) n C . ∎ Proof of Theorem 2 : Assume that the Stationary Assumption holds, the σ GP pa-rameters in K T ( x, x ) and K C ( x, x ) are fixed, and Assumptions A1, A2, A3, and A4 hold.By Theorem 3.1 of van der Vaart & van Zanten (2009), (25) holds, and then the proof ofTheorem 2 is identical to that of Theorem 1. ∎ .4 Extending Theorems 1 and 2 to Other Mean and CovarianceFunctions and Random Variance Rasmussen & Williams 2006 (Section 2.7) discuss other choices of mean functions besides m ( x ) =
0, particularly mean functions of the form m ( x ) = h ( x ) T β for some fixed basisfunctions h ( x ) . However, Rasmussen & Williams (2006) argue that different choices of m ( x ) are more for interpretability than predictive accuracy, because m ( x ) = m ( x ) likely do notaffect the consistency results in Section 3. To the best of our knowledge, the literature hasfocused primarily on the choice m ( x ) = (cid:96) but not on the variance σ GP , as in van der Vaart & van Zanten(2009). To the best of our knowledge, Choi (2007) is the only work to consider posteriorconsistency when priors are placed on both (cid:96) and σ GP ; however, these results only hold forbinary Gaussian process regression, and their necessary assumptions are more restrictivethan those presented in this paper. We leave posterior consistency of our GPR methodwhen priors are placed on both (cid:96) and σ GP as future work.35 eferences Armstrong, T. B., & Koles´ar, M. (2017). A simple adjustment for bandwidth snooping.
The Review of Economic Studies , (2), 732–765.Banerjee, S., Carlin, B. P., & Gelfand, A. E. (2014). Hierarchical modeling and analysisfor spatial data . Crc Press.Barber, H. (2016). Where are all the second round picks? nobody knows andthe nba doesn’t seem to care. . Accessed: 2017-11-21.Bornn, L., Shaddick, G., & Zidek, J. V. (2012). Modeling nonstationary processes throughdimension expansion.
Journal of the American Statistical Association , (497), 281–289.Calonico, S., Cattaneo, M. D., & Farrell, M. H. (2018a). On the effect of bias estimationon coverage accuracy in nonparametric inference. Journal of the American StatisticalAssociation , (pp. 1–13).Calonico, S., Cattaneo, M. D., & Farrell, M. H. (2018b). Optimal bandwidth choicefor robust bias corrected inference in regression discontinuity designs. arXiv preprintarXiv:1809.00236 .Calonico, S., Cattaneo, M. D., Farrell, M. H., & Titiunik, R. (2016). Regression disconti-nuity designs using covariates.
Review of Economics and Statistics , (0).Calonico, S., Cattaneo, M. D., Farrell, M. H., & Titiunik, R. (2017). rdrobust: Softwarefor regression discontinuity designs.
Stata Journal , (2), 372–404.Calonico, S., Cattaneo, M. D., & Titiunik, R. (2014). Robust nonparametric confidenceintervals for regression-discontinuity designs. Econometrica , (6), 2295–2326.Calonico, S., Cattaneo, M. D., & Titiunik, R. (2015). Optimal data-driven regressiondiscontinuity plots. Journal of the American Statistical Association , (512), 1753–1769. 36ard, D., Dobkin, C., & Maestas, N. (2004). The impact of nearly universal insurancecoverage on health care utilization and health: evidence from medicare. Tech. rep.,National Bureau of Economic Research.Carpenter, B., Gelman, A., Hoffman, M., Lee, D., Goodrich, B., Betancourt, M., Brubaker,M. A., Guo, J., Li, P., & Riddell, A. (2016). Stan: A probabilistic programming language. Journal of Statistical Software , .Cattaneo, M. D., Frandsen, B. R., & Titiunik, R. (2015). Randomization inference in theregression discontinuity design: An application to party advantages in the us senate. Journal of Causal Inference , (1), 1–24.Cattaneo, M. D., Titiunik, R., & Vazquez-Bare, G. (2017). Comparing inference approachesfor rd designs: A reexamination of the effect of head start on child mortality. Journal ofPolicy Analysis and Management .Chib, S., & Greenberg, E. (2014). Nonparametric bayes analysis of the sharp and fuzzyregression discontinuity designs. Technical Report, Washington University in St. Louis.Chib, S., & Jacobi, L. (2015). Bayesian fuzzy regression discontinuity analysis and returnsto compulsory schooling.
Journal of Applied Econometrics .Choi, T. (2007). Alternative posterior consistency results in nonparametric binary regres-sion using gaussian process priors.
Journal of Statistical Planning and Inference , (9),2975–2983.Choi, T., & Schervish, M. J. (2007). On posterior consistency in nonparametric regressionproblems. Journal of Multivariate Analysis , (10), 1969–1987.Cressie, N. (2015). Statistics for Spatial Data . John Wiley & Sons.Dimmery, D. (2013). rdd: Regression discontinuity estimation.
R package version 0.54 .Fan, J. (1992). Design-adaptive nonparametric regression.
Journal of the American Sta-tistical Association , (420), 998–1004. 37an, J., & Gijbels, I. (1992). Variable bandwidth and local linear regression smoothers. The Annals of Statistics , (pp. 2008–2036).Gelman, A. (2006). Prior distributions for variance parameters in hierarchical models(comment on article by Browne and Draper).
Bayesian Analysis , (3), 515–534.Gelman, A., & Imbens, G. (2018). Why high-order polynomials should not be used inregression discontinuity designs. Journal of Business & Economic Statistics , (pp. 1–10).Gelman, A., Stern, H. S., Carlin, J. B., Dunson, D. B., Vehtari, A., & Rubin, D. B. (2013).
Bayesian Data Analysis . Chapman and Hall/CRC.Geneletti, S., O’Keeffe, A. G., Sharples, L. D., Richardson, S., & Baio, G. (2015). Bayesianregression discontinuity designs: Incorporating clinical knowledge in the causal analysisof primary care data.
Statistics in Medicine , (15), 2334–2352.Ghosal, S., & Roy, A. (2006). Posterior consistency of gaussian process prior for nonpara-metric binary regression. The Annals of Statistics , (pp. 2413–2429).Hahn, J., Todd, P., & Van der Klaauw, W. (2001). Identification and estimation of treat-ment effects with a regression-discontinuity design.
Econometrica , (1), 201–209.Hall, P., & Kang, K.-H. (2001). Bootstrapping nonparametric density estimators withempirically chosen bandwidths. Annals of Statistics , (pp. 1443–1468).Imbens, G., & Kalyanaraman, K. (2012). Optimal bandwidth choice for the regressiondiscontinuity estimator.
The Review of Economic Studies , (3), 933–959.Imbens, G. W., & Lemieux, T. (2008). Regression discontinuity designs: A guide topractice. Journal of Econometrics , (2), 615–635.Keele, L., Titiunik, R., & Zubizarreta, J. R. (2015). Enhancing a geographic regressiondiscontinuity design through matching to estimate the effect of ballot initiatives on voterturnout. Journal of the Royal Statistical Society: Series A (Statistics in Society) , (1),223–239. 38eele, L. J., & Titiunik, R. (2014). Geographic boundaries as regression discontinuities. Political Analysis , (1), 127–155.Koenig, A. (2012). Coming up just short: The marginal effect of being a first round pickin the nba draft.Koles´ar, M., & Rothe, C. (2018). Inference in regression discontinuity designs with adiscrete running variable. American Economic Review , (8), 2277–2304.Kubatko, J. (2009). Calculating win shares. Basketball-Reference. com .Lee, D. S., & Card, D. (2008). Regression discontinuity inference with specification error.
Journal of Econometrics , (2), 655–674.Li, F., Mattei, A., Mealli, F., et al. (2015). Evaluating the causal effect of university grantson student dropout: evidence from a regression discontinuity design using principal strat-ification. The Annals of Applied Statistics , (4), 1906–1931.Ludwig, J., & Miller, D. L. (2007). Does head start improve children’s life chances? evidencefrom a regression discontinuity design. The Quarterly Journal of Economics , (1),159–208.Matsudaira, J. D. (2008). Mandatory summer school and student achievement. Journal ofEconometrics , (2), 829–850.Myers, D. (2015). About box plus/minus (bpm). Basketball-Reference. com , .Plagemann, C., Kersting, K., & Burgard, W. (2008). Nonstationary gaussian processregression using point estimates of local smoothness. Machine Learning and KnowledgeDiscovery in Databases , (pp. 204–219).Polson, N. G., & Scott, J. G. (2012). On the half-cauchy prior for a global scale parameter.
Bayesian Analysis , (4), 887–902.Porter, J. (2003). Estimation in the regression discontinuity model. UnpublishedManuscript, Department of Economics, University of Wisconsin at Madison .39asmussen, C. E., & Williams, C. K. I. (2006).
Gaussian Processes for Machine Learning .MIT Press.Rischard, M., Branson, Z., Miratrix, L., & Bornn, L. (2018). A bayesian nonparametricapproach to geographic regression discontinuity designs: Do school districts affect nychouse prices? arXiv preprint arXiv:1807.04516 .Rubin, D. B. (1977). Assignment to treatment group on the basis of a covariate.
Journalof Educational Statistics , (1), 1–26.Silver, N. (2014). How much is winning the nba draft lot-tery really worth? https://fivethirtyeight.com/features/how-much-is-winning-the-nba-draft-lottery-really-worth/ . Accessed: 2017-11-21.Thistlethwaite, D. L., & Campbell, D. T. (1960). Regression-discontinuity analysis: Analternative to the ex post facto experiment. Journal of Educational Psychology , (6),309–317.Van der Klaauw, W. (2002). Estimating the effect of financial aid offers on college en-rollment: A regression–discontinuity approach. International Economic Review , (4),1249–1287.van der Vaart, A. W., & van Zanten, J. H. (2008). Rates of contraction of posteriordistributions based on gaussian process priors. The Annals of Statistics , (pp. 1435–1463).van der Vaart, A. W., & van Zanten, J. H. (2009). Adaptive bayesian estimation using agaussian random field with inverse gamma bandwidth.
The Annals of Statistics , (pp.2655–2675).Wang, X. (2012).
Bayesian modeling using latent structures . Ph.D. thesis, Citeseer.Wilson, A. G., & Adams, R. P. (2013). Gaussian process kernels for pattern discovery andextrapolation. In