Efficient Estimation for Staggered Rollout Designs
EEfficient Estimation for Staggered Rollout Designs ∗Jonathan Roth † Pedro H.C. Sant’Anna ‡ February 3, 2021
Abstract
Researchers are often interested in the causal effect of treatments that are rolled outto different units at different points in time. This paper studies how to efficiently esti-mate a variety of causal parameters in such staggered rollout designs when treatmenttiming is (as-if) randomly assigned. We solve for the most efficient estimator in a classof estimators that nests two-way fixed effects models as well as several popular general-ized difference-in-differences methods. The efficient estimator is not feasible in practicebecause it requires knowledge of the optimal weights to be placed on pre-treatmentoutcomes. However, the optimal weights can be estimated from the data, and in largedatasets the plug-in estimator that uses the estimated weights has similar properties tothe “oracle” efficient estimator. We illustrate the performance of the plug-in efficient es-timator in simulations and in an application to Wood, Tyler and Papachristos (2020a)’sstudy of the staggered rollout of a procedural justice training program for police of-ficers. We find that confidence intervals based on the plug-in efficient estimator havegood coverage and can be as much as five times shorter than confidence intervals basedon existing methods. As an empirical contribution of independent interest, our appli-cation provides the most precise estimates to date on the effectiveness of proceduraljustice training programs for police officers. ∗ We are grateful to Brantly Callaway, Emily Owens, Ryan Hill, Ashesh Rambachan, Evan Rose, Adri-enne Sabety, Jesse Shapiro, Yotam Shem-Tov, and Ariella Kahn-Lang Spitzer for helpful comments andconversations. † Microsoft. [email protected] ‡ Vanderbilt University. [email protected] a r X i v : . [ ec on . E M ] F e b Introduction
Researchers are often interested in the causal effects of a treatment that has a staggeredrollout, meaning that it is first implemented for different units at different times. For in-stance, social scientists may be interested in the causal effect of a policy that is adopted indifferent states at different times. Businesses may likewise be interested in the causal effectof a new feature or advertising campaign that is introduced to different customers over time.In many cases, the timing of the rollout is controlled by the researcher and can be explicitlyrandomized. In others, researchers argue that the timing of the treatment is as-if randomlyassigned.In these settings, researchers often estimate treatment effects using methods that extendthe simple two-period difference-in-differences estimator to the staggered setting. It is com-mon practice to estimate causal effects using two-way fixed effects (TWFE) models thatcontrol for both time and unit fixed effects (e.g. Xiong, Athey, Bayati and Imbens, 2019).Recent work has shown, however, that the estimand of TWFE models may be difficult tointerpret under treatment effect heterogeneity (see Related Literature below). The literaturehas therefore proposed a variety of alternative procedures that yield more easily-interpretableestimands under heterogeneous treatment effects (Callaway and Sant’Anna, 2020; de Chaise-martin and D’Haultfœuille, 2020; Sun and Abraham, 2020). All of these procedures exploit ageneralized “parallel trends” assumption for estimation. However, the assumption of randomtreatment timing is stronger than parallel trends. This suggests that it might be possible toobtain more precise estimates by more fully exploiting the random timing of treatment.This paper considers efficient estimation of causal effects in settings where the timing oftreatment is (as-if) randomly assigned. We begin by introducing a design-based frameworkthat formalizes the notion that treatment timing is (as-if) randomly assigned. We considerestimation of a variety of causal estimands using a class of estimators that nests canonicaltwo-way fixed effects models as well as the alternative estimators discussed above as specialcases. We then solve for the most efficient estimator in this class. The efficient estimatormore fully exploits the implications of random treatment timing, which is stronger thanthe generalized parallel trends assumption on which conventional estimators are based. Asa result, the efficient estimator we derive asymptotically dominates conventional estimationapproaches for the same estimand in terms of efficiency, with large gains in Monte Carlo sim-ulations. We therefore recommend use of the efficient estimator in settings where treatmenttiming is either random by design or assumed to be quasi-random.For clarity of exposition and to connect our results to previous work, we begin by ana-lyzing the canonical two-period difference-in-differences model under randomized treatment.2ll units are untreated in the first period, and a subset of units are randomly assigned tobegin treatment in the second period. We consider estimators of the average treatment ef-fect (ATE) of the form ˆ θ β “ p ¯ Y ´ ¯ Y q ´ β p ¯ Y ´ ¯ Y q , where ¯ Y dt is the sample mean ofthe outcome for treatment group d in period t . These estimators take the simple differencein means in period 1 and then adjust linearly for the difference in means in period 0. Thecanonical difference-in-differences estimator corresponds with the special case of β “ . Un-der the assumption that period 0 outcomes are unaffected by treatment status in period 1(i.e. there are no anticipatory effects of treatment), the period 0 outcomes are isomorphicto fixed covariates in a random experiment. We can then apply results from Lin (2013) oncovariate adjustment in random experiments to (i) show that ˆ θ β is unbiased for the ATE forall β , and (ii) solve for the variance-minimizing value β ˚ , which depends on the covariancebetween the (treated and untreated) potential outcomes in period 1 and the pre-treatmentoutcomes. In general, the efficient value β ˚ will not be equal to 1, and thus the DiD estimatorwill be inefficient. Although the “oracle” β ˚ will generally not be known, as in Lin (2013), aplug-in estimator based on a sample analog of β ˚ will achieve the efficient variance in largepopulations.We next consider the more practically relevant case in which there is staggered timingacross multiple periods. There are T periods, and unit i is first treated in period G i P G Ďt , ..., T, , with G i “ 8 denoting that i is never treated. There are many possible waysof aggregating treatment effects across cohorts and time periods in the staggered treatmentsetting, and so we consider a broad class of estimands that encompasses many possibleaggregation schemes. Specifically, we define τ t,gg to be the average effect on the outcome inperiod t of changing the initial treatment date from g to g . We then consider the class ofestimands that are linear combinations of these building blocks, θ “ ř t,g,g a t,g,g τ t,gg . Ourframework thus accommodates a variety of summary measures of dynamic treatment effects,including several aggregation schemes proposed in the recent literature.We consider the class of estimators that start with a sample analog to the target param-eter and then adjust by a linear combination of pre-treatment outcomes. More precisely, weconsider estimators of the form ˆ τ β “ ř t,g a t,g,g ˆ τ t,gg ´ ˆ X β , where the first term in ˆ τ β replacesthe τ t,gg with their sample analogs in the definition of θ , and the second term adjusts for alinear combination of ˆ X , where ˆ X is a vector that compares outcomes for cohorts treatedat different dates at points in time before either was treated. We show that a variety ofestimation procedures are part of this class for an appropriately defined ˆ X , including theclassical TWFE model as well as recent procedures proposed by Sun and Abraham (2020),de Chaisemartin and D’Haultfœuille (2020), and Callaway and Sant’Anna (2020). All esti-mators of this form are unbiased for θ under the assumptions of random treatment timing3nd no anticipation.We then derive the most efficient estimator in this class. The optimal coefficient β ˚ depends on covariances between the potential outcomes over time, and thus in general willnot coincide with the fixed coefficients in any of the previously proposed procedures discussedabove. As in the two-period case, the “oracle” value β ˚ will typically not be known ex ante,and will need to be replaced with a sample analog ˆ β ˚ . Similar to the two-period case, weshow that using the plug-in estimator is asymptotically unbiased and efficient under large-population asymptotics, exploiting the generalized finite-population central limit theoremsin Li and Ding (2017). In a Monte Carlo study calibrated to our application, we find thatconfidence intervals based on the plug-in efficient estimator have good coverage propertiesand are substantially shorter than the procedures of Callaway and Sant’Anna (2020) andde Chaisemartin and D’Haultfœuille (2020).As an illustration of our method and standalone empirical contribution, we re-examinethe effectiveness of procedural justice training for police officers. We use data from Woodet al. (2020a), who studied the randomized rollout of a procedural justice training programin Chicago. The original study by Wood et al. (2020a) found that the program producedlarge and highly statistically significant reductions in (sustained) complaints against policeofficers and officer use of force. These findings have been influential in policy discussionsabout police reform (e.g. Doleac, 2020). However, an earlier version of our analysis revealeda statistical error in the analysis of Wood et al. (2020a), which did not account for thefact that cohorts trained on different days were of different sizes, leading to spuriously largeestimates. In Wood, Tyler, Papachristos, Roth and Sant’Anna (2020b), we worked withthe authors of the original paper to re-analyze the data using the state-of-the art tools inCallaway and Sant’Anna (2020). Our re-analysis found no significant effects on complaintsor sustained complaints, and borderline significant effects on police use of force, althoughthe confidence intervals for all three outcomes were large and included both economicallysmall and meaningful effects. In this paper, we re-analyze the data using our proposedmethodology.We find that the use of our proposed methodology allows us to obtain substantially moreprecise estimates of the effect of the training program, leading to a reduction in standarderrors by a factor between 1.3 and 5.6 depending on the specification. We again find limitedevidence of a meaningful effect of the program on complaints or sustained complaints andborderline significant overall effects on use of force. Our revised estimates have much greaterprecision than in our previous analysis, however. For example, our baseline estimate forthe overall average effect on complaints (using a simple aggregation across time periods andcohorts) is -2% relative to the pre-treatment mean with a 95% CI of [-11,6], compared with4n estimate of -10% and CI of [-26,5] using the procedure of Callaway and Sant’Anna (2020).Likewise, we find a borderline significant effect on use of force of -15% (CI: [-29,0]), comparedwith our previous estimate of -22% (CI: [-43,-2]). We caution, however, that the marginallysignificant results for use of force are not significant after adjusting for testing hypotheseson multiple outcomes. Related Literature.
This paper contributes to an active literature on difference-in-differencesand related methods with staggered treatment timing. Several recent papers have illustratedthat the estimand of standard TWFE models may not have an intuitive causal interpreta-tion when there are heterogeneous treatment effects, and new estimators for more sensiblecausal estimands have been introduced (Athey and Imbens, 2018; Borusyak and Jaravel,2016; Callaway and Sant’Anna, 2020; de Chaisemartin and D’Haultfœuille, 2020; Goodman-Bacon, 2018; Imai and Kim, 2020; Meer and West, 2016; Słoczyński, 2020; Sun and Abraham,2020). In contrast to most of the previous literature, we consider the efficiency of variousprocedures under random treatment timing. This assumption is stronger than the general-ized parallel trends assumptions considered in previous work, and thus our proposed methodwill not be applicable in settings where the researcher is confident in parallel trends butnot in random treatment timing. On the other hand, under suitable regularity conditionsour proposed plug-in efficient estimator is at least as efficient in large populations as themethods proposed in previous work when treatment timing is (as-if) randomly assigned, andwill often be substantially more precise.Additionally, most of the pre-existing literature has adopted a sampling-based perspec-tive, where uncertainty in the data arises from the sample drawn from a superpopulation. Bycontrast, we adopt a design-based framework in which the population is fixed and uncertaintyarises from the randomness of treatment timing. This framework is useful for formalizingthe notion of random treatment timing, and may be especially appealing in settings wherethe superpopulation is not clear, such as when the researcher has access to all counties in theUnited States (Manski and Pepper, 2018). Athey and Imbens (2018) adopt a design-basedframework similar to ours, but consider the interpretation of the estimand of two-way fixedeffects models rather than efficient estimation. Shaikh and Toulis (2019) consider inferenceon sharp null hypotheses in a design-based model where treatment timing is random con-ditional on observables; by contrast, we consider inference on average causal effects underunconditional random treatment timing.Several papers in both the economics and biostatistics literatures study the efficiency of In Roth and Sant’Anna (2021), we show that if treatment timing is not random, then the paralleltrends assumption will be sensitive to functional form without strong assumptions on the full distributionof potential outcomes. β ˚ in a sampling-based model; however, they donot consider estimation or inference when the oracle is unknown. None of the aforementionedpapers considers a design-based framework, nor do they study the common case of staggeredtreatment timing as we do in Section 3.Our work is also related to Xiong et al. (2019) and Basse, Ding and Toulis (2020), whoconsider how to optimally design a staggered rollout experiment to maximize the efficiencyof a fixed estimator. By contrast, we solve for the most efficient estimator given a fixedexperimental design. Ding and Li (2019) show a bracketing relationship between the biasesof difference-in-differences and other estimators in the class we consider when treatmenttiming is not random, but do not consider efficiency under random treatment timing.Finally, we contribute to the literature on the effectiveness of procedural justice trainingprograms for police officers. Previous work has studied the program in Chicago that we study(Wood et al., 2020a,b) and a smaller pilot evaluation in Seattle (Owens, Weisburd, Amendolaand Alpert, 2018). Although qualitatively in line with previous findings in the literature,our analysis provides by far the most precise estimates from a randomized evaluation. Forexample, the standard error for our estimate for the effect on citizen complaints, measuredas a percentage of the pre-treatment mean, is 1.9 times smaller than the estimate in Woodet al. (2020b) and over 3 times smaller than that in Owens et al. (2018). We begin by developing intuition for our more general results by considering a canonicaltwo-period difference-in-differences model. All of the results in this section can be viewedas special cases of the more general results for staggered rollouts in Section 3. We provide6roofs where we think they will aid in developing intuition, but defer some of the moretechnical proofs to the theorems in the following section.
There is a finite population of N units. We observe data for 2 periods t “ , . All unitsare untreated in t “ , and some units receive a treatment of interest in t “ . We denoteby Y it p q , Y it p q the potential outcomes for unit i in period t under treatment and control,respectively, and we observe the outcome Y it “ D i Y it p q ` p ´ D i q Y it p q , where D i is an in-dicator for whether unit i is treated. Following Neyman (1923) for randomized experimentsand Athey and Imbens (2018) and Rambachan and Roth (2020) for DiD designs, we treat asfixed (or condition on) the potential outcomes and the number of treated and untreated units( N and N ); the only source of uncertainty in our model comes from the vector of treatmentassignments D “ p D , ..., D N q , which is stochastic. All expectations p E r¨sq and probabilitystatements p P p¨qq are taken over the distribution of D conditional on the number of treatedunits p N q and the potential outcomes, although we suppress this conditioning unless neededfor clarity. For a non-stochastic attribute W i (e.g. a function of the potential outcomes), wedenote by E f r W i s “ N ´ ř i W i and V ar f r W i s “ p N ´ q ´ ř i p W i ´ E f r W i sqp W i ´ E f r W i sq the finite-population expectation and variance of W i .The target parameter of interest is the average treatment effect in t “ , τ : “ N ÿ i p Y it “ p q ´ Y it “ p qq . We now introduce two assumptions that we will maintain throughout our analysis. Wefirst assume that the assignment of treatment status is random. Assumption 1 (Random treatment assignment (2 periods)) . P p D “ d q “ { ` NN ˘ if ř i d i “ N , and 0 otherwise. We also assume that treatment status has no effect on outcomes in t “ , before treatmentis implemented. This assumption is plausible in many contexts, but may be violated ifindividuals learn of treatment status beforehand and adjust their behavior in anticipation(Malani and Reif, 2015). Assumption 2 (No anticipation (2 periods)) . For all i , Y it “ p q “ Y it “ p q . In Section 3, we will index potential outcomes by the date of treatment timing, so Y it p q in this sectioncorresponds with Y it p8q in the notation of Section 3. We use the notation Y it p q here to make the connectionsto the literature on randomized experiments more explicit. Note that we condition on the number of treated units p N q , so in contrast to standard sampling-basedapproaches, unit i ’s treatment status D i is correlated with that of unit j . .2 Efficient Estimation and Comparison to DiD The canonical difference-in-differences estimator is ˆ τ DiD “ p ¯ Y t “ ´ ¯ Y t “ q ´ p ¯ Y t “ ´ ¯ Y t “ q , (1)where ¯ Y t “ N ´ ř i D i Y it and ¯ Y t “ N ´ ř i p ´ D i q Y it are the sample means for the treatedand untreated groups in period t .Note that ˆ τ DiD is a special case of the class of estimators of the form ˆ τ β “ p ¯ Y t “ ´ ¯ Y t “ q ´ β p ¯ Y t “ ´ ¯ Y t “ q . The estimator ˆ τ β takes the simple difference-in-means between the treated and control groupsin period t “ , and then adjusts by a factor β times the difference in means in the pre-treatment period.We now draw connections between estimators of the form ˆ τ β and estimators that applycovariate adjustments in cross-sectional random experiments. Note that under Assumption2, Y i,t “ “ Y i,t “ p q regardless of i ’s treatment status. Our setting is thus isomorphic to across-sectional randomized experiment in which the outcome of interest is Y i “ Y i,t “ and wehave fixed pre-treatment covariates X i “ Y i,t “ p q . In the cross-sectional setting, Lin (2013)and Li and Ding (2017) consider estimators of the form ˆ τ p b , b q “ ¯ Y ´ ¯ Y ´ p ¯ X ´ ¯ X q b ` p ¯ X ´ ¯ X q b , where ¯ Y “ N ´ ř i D i Y i and the other terms are defined analogously. Observe, however,that the unconditional mean ¯ X “ N ´ ř i X i is a weighted average of ¯ X and ¯ X , i.e. ¯ X “ p N { N q ¯ X ` p N { N q ¯ X . It then follows from some straightforward algebra that ˆ τ p b , b q “ ` ¯ Y ´ ¯ Y ˘ ´ ˆ N N b ` N N b ˙ p ¯ X ´ ¯ X q . The estimator ˆ τ p b , b q is thus equivalent to ˆ τ β with β “ p N { N q b ` p N { N q b .With this equivalence in hand, it is straightforward to apply the results in Lin (2013)and Li and Ding (2017) to show that i) ˆ τ β is unbiased for the ATE for all β , and ii) solve forthe efficient coefficient β ˚ that minimizes the variance of ˆ τ β . Proposition 2.1 (Unbiasedness of ˆ τ β ) . Under Assumptions 1 and 2, E r ˆ τ β s “ τ for all β. Proof.
The proof is immediate from the results in Lin (2013) and Li and Ding (2017) from theanalogy to covariate adjustment in randomized experiment, but we provide a short proof for8ompleteness. Observe that ¯ Y t “ “ N ´ ř i D i Y it “ p q . By Assumption 1, E r D i s “ N { N ,so E “ ¯ Y t “ ‰ “ N ÿ i E r D i s Y it “ p q “ N ÿ i Y it “ p q . By analogous arguments for the other terms in (1), we have that E “ ˆ τ DiD ‰ “ ˜ N ÿ i Y it “ p q ´ N ÿ i Y it “ p q ¸ ´ ˜ N ÿ i Y it “ p q ´ N ÿ i Y it “ p q ¸ “ τ ` ˜ N ÿ i Y it “ p q ´ N ÿ i Y it “ p q ¸ . However, Assumption 2 implies that the second term in the previous display is zero, whichgives the desired result.
Proposition 2.2.
Let β d be the coefficient on Y it “ p q from a regression of Y it “ p d q on Y it “ p q and a constant. Let β ˚ “ p N { N q β ` p N { N q β . If Assumptions 1 and 2 hold, then V ar r ˆ τ β ˚ s ď V ar r ˆ τ β s for all β P R , with strict inequality for any β ‰ β ˚ if V ar f r Y it “ s ą .Proof. We have shown that the estimator τ β ˚ is equivalent to the estimator τ p β , β q con-sidered in Lin (2013) and Li and Ding (2017), with Y i “ Y i,t “ and X i “ Y i,t “ p q . It thenfollows immediately from the results in Lin (2013) and Li and Ding (2017) that ˆ τ β ˚ hasminimal variance. Further, Li and Ding (2017) show that for any p b , b q , V ar r ˆ τ p β , β qs ă V ar r ˆ τ p b , b qs unless V ar r ˆ τ p β , β q ´ ˆ τ p b , b qs “ . Thus, V ar r ˆ τ β ˚ s ă V ar “ ˆ τ ˜ β ‰ for any ˜ β unless V ar “ ˆ τ β ˚ ´ ˆ τ ˜ β ‰ “ . However, ˆ τ β ˚ ´ ˆ τ ˜ β “ ˆ τ p , NN β ˚ q ´ ˆ τ p , NN ˜ β q “ NN p β ˚ ´ ˜ β q ¯ X . Since ¯ X “ N ´ ř i D i X i is a simple random sample of size N , it has positive variance if V ar f r X i s ą . Thus, V ar r ˆ τ β ˚ s ă V ar “ ˆ τ ˜ β ‰ if V ar f r X i s ą and β ˚ ‰ ˜ β .Proposition 2.2 implies that unless the potential outcomes happen to be such that N N β ` N N β “ , the variance of ˆ τ DiD is dominated by that of ˆ τ β ˚ .9 .3 The plug-in efficient estimator In practical settings, however, the “oracle” coefficient β ˚ is not known. Mirroring Lin (2013)in the cross-sectional case, we now show that β ˚ can be approximated by a plug-in estimate ˆ β ˚ , and the resulting estimator ˆ τ β ˚ has similar properties to the “oracle” estimator ˆ τ β .We first describe the construction of the plug-in estimator. Consider a regression of Y it “ on Y it “ and a constant among units with D i “ . Let ˆ β denote the coneffient on Y it “ , i.e. ˆ β “ ˜ N ÿ i D i Y it “ ¸ ´ ˜ N ÿ i D i Y it “ Y it “ ¸ , where Y it “ “ Y it “ ´ N ř i Y it “ is the de-meaned pre-treatment outcome. Define ˆ β to bethe coefficient from the analogous regression among D i “ units, ˆ β “ ˜ N ÿ i p ´ D i q Y it “ ¸ ´ ˜ N ÿ i p ´ D i q Y it “ Y it “ ¸ . Letting ˆ β ˚ “ p N { N q ˆ β ` p N { N q ˆ β , the estimator ˆ τ ˆ β ˚ is then a feasible approximation to ˆ τ β ˚ . It is straightforward to show that the estimator ˆ τ ˆ β ˚ is equivalent to the coefficient fromthe “interacted” regression considered in Lin (2013), i.e. the coefficient on D i in the ordinaryleast squares (OLS) regression, Y it “ “ β ` β D i ` β Y it “ ` β D i ˆ Y it “ ` (cid:15) i . (2)We will now show that when the population is sufficiently large, ˆ τ ˆ β ˚ is approximatelyunbiased for τ and achieves the same variance as the oracle estimator ˆ τ β ˚ . As in Lin (2013)and Li and Ding (2017) and other papers, we consider sequences of populations indexed by m where N ,m and N ,m grow large. For ease of notation, we leave the index m implicit inour notation for the remainder of the paper. We assume the sequence of populations satisfiesthe following regularity conditions. Assumption 3 (Sequences of populations) . Let Y i p d q “ p Y it “ p d q , Y it “ p d qq for d “ , .Let S , S , and S denote the finite population variances and covariance of Y i p q , Y i p q : S d “ N ´ ÿ i p Y i p d q´ ¯ Y p d qqp Y i p d q´ ¯ Y p d qq , S “ N ´ ÿ i p Y i p q´ ¯ Y p qqp Y i p q´ ¯ Y p qq , where ¯ Y p d q “ N ´ ř i Y i p d q . We assume(i) N { N Ñ p P p , q . ii) S , S , and S have finite limiting values denoted S ˚ ą , S ˚ ą , and S ˚ .(iii) max i || Y i p d q ´ ¯ Y p d q|| { N Ñ for d “ , . Part (i) of Assumption 3 states that the fraction of treated units converges to a constantstrictly between 0 and 1. Part (ii) states that the variance and covariances of the potentialoutcomes have limits. Part (iii) requires that no single observation dominates the variance ofthe potential outcomes, and is thus analogous to the familiar Lindeberg condition in samplingcontexts. With these assumptions in hand, we can now formally state the sense in which ˆ τ ˆ β ˚ is asymptotically unbiased and as efficient as ˆ τ β ˚ . Proposition 2.3.
Under Assumptions 1, 2, and 3, ? N p ˆ τ ˆ β ˚ ´ τ q Ñ d N ` , σ ˚ ˘ , (3) where σ ˚ “ lim N Ñ8 N V ar r ˆ τ β ˚ s .Proof. Since ˆ τ β ˚ is equivalent to the coefficient on D i from (2), the result follows immedi-ately from Theorem 1 in Lin (2013). Alternatively, this can be viewed as a special case ofProposition 3.2 below.To develop intuition for how ˆ τ ˆ β ˚ achieves the same asymptotic variance as ˆ τ β ˚ , observe thatwe can write ˆ τ ˆ β ˚ “ ˆ τ ´ β ˚ p ¯ Y t “ ´ ¯ Y t “ q ` p ˆ β ˚ ´ β ˚ qp ¯ Y t “ ´ ¯ Y t “ q“ ˆ τ β ˚ ´ p ˆ β ˚ ´ β ˚ q ` p ¯ Y t “ ´ ¯ Y t “ q ´ p ¯ Y t “ ´ ¯ Y t “ q ˘ where ¯ Y t “ “ N ´ ř i Y it “ p q . By standard arguments for finite populations, ˆ β ˚ ´ β ˚ and ¯ Y dt “ ´ ¯ Y t “ are O p p N ´ q . It follows that ˆ τ ˆ β ˚ ´ ˆ τ β ˚ is the product of O p p N ´ q terms, and henceis O p p N ´ q , whereas ˆ τ β ˚ ´ τ is O p p N ´ q . We thus see that the error induced from estimating β ˚ is of a higher-order than the variation of ˆ τ β ˚ . In sufficiently large populations, the noiseinduced by estimating β ˚ is thus negligible. A similar analysis applies to finite-sample bias,which as in Lin (2013) is also O p p N ´ q . Remark 1.
Building on results in Frison and Pocock (1992), McKenzie (2012) proposesusing the coefficient γ from the OLS regression Y it “ “ γ ` γ D i ` γ Y it “ ` (cid:15) i , (4)which is sometimes referred to as the Analysis of Covariance (ANCOVA). This differs fromthe regression representation of the efficient plug-in estimator in (2) in that it omits the11nteraction term D i Y it “ . Treating Y it “ as a fixed pre-treatment covariate, the coefficient ˆ γ from (4) is equivalent to the estimator studied in Freedman (2008b,a). The results inLin (2013) therefore imply that McKenzie (2012)’s estimator will have the same asymptoticefficiency as ˆ τ ˆ β ˚ under constant treatment effects. Intuitively, this is because the coefficienton the interaction term in (2) converges in probability to 0. However, the results in Freedman(2008b,a) imply that under heterogenous treatment effects McKenzie (2012)’s estimator mayeven be less efficient than ˆ τ , which in turns is (weakly) less efficient than ˆ τ ˆ β ˚ . Relatedly,Wan (2020) proves that ˆ β from (2) is asymptotically at least as efficient as ˆ γ from (4) in asampling-based model that assumes normally distributed potential outcomes. To form confidence sets for τ based on (3), one needs to estimate the variance σ ˚ . As istypical in finite population settings, it is not possible to obtain a consistent variance estimateunder treatment effect heterogeneity. We show, however, that one can obtain a consistentestimator for an upper bound on the asymptotic variance. The variance estimator is lessconservative than the conventional Neyman estimator in that it accounts for heterogeneitythat is explained by lagged outcomes.We begin with the following decomposition of the variance. Lemma 2.1.
Under Assumptions 1 and 2, V ar r ˆ τ β ˚ s “ N ˜ S ` N ˜ S ´ N ˜ S τ , where ˜ S is the finite population variance of Y it “ p q ´ β Y it “ p q ; ˜ S is the finite popula-tion variance of Y it “ p q ´ β Y it “ p q , and ˜ S τ is the finite population variance of Y it “ p q ´ Y it “ p q ´ p β ´ β q Y it “ p q .Proof. Immediate from Example 9.1 in Li and Ding (2017).
Proposition 2.4.
Let ˜ s “ N ÿ i D i p Y it “ ´ Y it “ ˆ β q , ˜ s “ N ÿ i p ´ D i qp Y it “ ´ Y it “ ˆ β q . Under Assumptions 3 and 4, pp N { N q ˜ s ` p N { N q ˜ s q Ñ p σ ˚ ´ ˜ S ˚ τ , where ˜ S ˚ τ “ lim N Ñ8 ˜ S τ .Proof. Follows as a special case of Lemma 3.7 below.Proposition 2.4 shows that the variance estimate ˜ s “ p N { N q ˜ s ` p N { N q ˜ s is asymp-totically conservative. It is strictly conservative if ˜ S ˚ τ ą , meaning that there is positive12symptotic variance of the “adjusted” treatment effects τ i ´ p β ´ β q Y it “ p q – i.e. hetero-geneous treatment effects that are not linear functions of lagged outcomes. We note thatin a completely randomized experiment, the typical Neyman variance is conservative by thevariance of τ i . Since V ar f r τ i ´ p β ´ β q Y it “ p qs “ V ar f r τ i s ´ p β ´ β q V ar f r Y it “ p qs , thevariance estimator here is less conservative than the usual Neyman variance estimator. We now extend the results above to the more complex setting in which there is staggeredtreatment timing across multiple periods.
There is again a finite population of N units. We observe data for T periods, t “ , .., T . Aunit’s treatment status is indexed by G i P G Ď t , ..., T, , where G i corresponds withthe first period in which unit i is treated (and G i “ 8 denotes that a unit is nevertreated). We assume that treatment is an absorbing state. We denote by Y it p g q the po-tential outcome for unit i in period t when treatment starts at time g , and define the vector Y i p g q “ p Y i p g q , ..., Y iT p g qq P R T . We let D ig “ r G i “ g s . The observed vector of outcomesfor unit i is then Y i “ ř i D ig Y i p g q . We treat as fixed the number of units that are firsttreated at each time g , N g , and assume that the timing of treatment is random. Assumption 4 (Random treatment timing) . Let D be the random N ˆ | G | matrix with p i, g q th element D ig . Then P p D “ d q “ p ś g P G N g ! q{ N ! if ř i d ig “ N g for all g , and zerootherwise. Remark 2 (Stratified Treatment Assignment) . For simplicity, we consider the case of un-conditional random treatment timing. In some settings, the treatment timing may be ran-domized among units with some shared observable characteristics (e.g. counties within astate). In such cases, the methodology developed below can be applied to form efficientestimators for each stratum, and the stratum-level estimates can then be pooled to formaggregate estimates for the population.As in the two-period model, we also assume that the treatment has no causal impact onthe outcome in periods before it is implemented.
Assumption 5 (No anticipation) . For all i , Y it p g q “ Y it p g q for all g, g ą t . Y it p g q ‰ Y it p g q whenever t ě min p g, g q . Rather, we only require that, say, aunit’s outcome in period 1 does not depend on whether it was ultimately treated in period2 or period 3. Following Athey and Imbens (2018), we define τ it,gg “ Y it p g q ´ Y it p g q to be the causal effectof switching the treatment date from date g to g on unit i ’s outcome in period t . We define τ t,gg “ N ´ ř i τ it,gg to be the average treatment effect (ATE) of switching treatment from g to g on outcomes at period t . We will consider scalar estimands of the form θ “ ÿ t,g,g a t,gg τ t,gg (5)i.e. weighted sums of the average treatment effects of switching from treatment g to g .Researchers will often be intersted in weighted averages of the τ t,gg , in which case the a t,gg will sum to 1, although our results allow for general a t,gg . The results extend easily tovector-valued θ ’s where each component is of the form in the previous display; we focus onthe scalar case for ease of notation. The no anticipation assumption (Assumption 5) impliesthat τ t,gg “ if t ă min p g, g q , and so without loss of generality we make the normalizationthat a t,gg “ if t ă min p g, g q .Researchers are often interested in the effect of receiving treatment at a particular timerelative to not receiving treatment at all. We will define AT E p t, g q : “ τ t,g to be the averagetreatment effect on the outcome in period t of being first-treated at period g relative to notbeing treated at all. The AT E p t, g q is a close analog to the cohort average treatment effectson the treated considered in Callaway and Sant’Anna (2020) and Sun and Abraham (2020).The main difference is that those papers do not assume random treatment timing, and thusconsider treatment effects on the treated population rather than average treatment effects(in a sampling-based framework).Our framework incorporates a variety of possible summary measures that aggregatethe AT E p t, g q across different cohorts and time periods. The following definitions mirrorthose proposed in Callaway and Sant’Anna (2020) for the AT T p t, g q . We define the simple-weighted ATE to be the simple weighted average of the AT E p t, g q , where each AT E p t, g q is This allows the possibility, for instance, that θ represents the difference between long-run and short-runeffects, so that some of the a t,gg are negative. N g , θ simple “ ř t ř g : g ď t N g ÿ t ÿ g : g ď t AT E p t, g q . Likewise, we define the cohort- and time-specific weighted averages as θ t “ ř g : g ď t N g ÿ g : g ď t AT E p t, g q and θ g “ T ´ g ` ÿ t : t ě g AT E p t, g q , and introduce the summary parameters θ calendar “ T ÿ t θ t and θ cohort “ | G z8| ÿ g : g ‰8 θ g , where | A | denotes the cardinality of a set A . Finally, we introduce “event-study” parametersthat aggregate the treatment effects at a given lag l since treatment θ ESl “ ř g : g ` l ď T N g ÿ g : g ` l ď T AT E p g ` l, g q . Note that the instantaneous parameter θ ES is analogous to the estimand considered inde Chaisemartin and D’Haultfœuille (2020) in settings like ours where treatment is an ab-sorbing state (although their framework also extends to the more general setting wheretreatment turns on and off). We now introduce the class of estimators we will consider. Let ˆ¯ Y g “ N g ´ ř i D ig Y i be thesample mean of the outcome for treatment group g , and let ˆ τ t,gg “ ˆ¯ Y g,t ´ ˆ¯ Y g ,t be the sampleanalog of τ t,gg . We define ˆ θ “ ÿ t,g,g a t,gg ˆ τ t,gg which replaces the population means in the definition of θ with their sample analogues.We will consider estimators of the form ˆ θ β “ ˆ θ ´ ˆ X β (6)where intuitively, ˆ X is a vector of differences-in-means that are guaranteed to be mean-zero under the assumptions of random treatment timing and no anticipation. Formally, we15onsider M -dimensional vectors ˆ X where each element of ˆ X takes the form ˆ X j “ ÿ p t,g,g q : g,g ą t a j ,gg ˆ τ t,gg . There are many possible choices for the vector ˆ X that satisfy these assumptions. Forexample ˆ X could be a vector where each component equals ˆ τ t,gg ´ ˆ τ t,g for a different com-bination of p t, g, g q with t ă g, g . Alternatively, ˆ X could be a scalar that takes a weightedaverage of such differences. The choice of ˆ X is analogous to the choice of which variablesto control for in a simple randomized experiment. In principle, including more covariates(higher-dimensional ˆ X ) will improve asymptotic precision, yet including “too many” covari-ates may lead to over-fitting, leading to poor performance in practice. For now, we supposethe researcher has chosen a fixed ˆ X , and will consider the optimal choice of β for a given ˆ X .We will return to the choice of ˆ X in the discussion of our Monte Carlo results in Section 4below.Several estimators proposed in the literature can be viewed as special cases of the classof estimators we consider with β “ and appropriately-defined scalar ˆ X . Example 1 (Callaway and Sant’Anna (2020)) . For settings where there is a never-treatedgroup ( G ), Callaway and Sant’Anna (2020) consider the estimator ˆ τ CStg “ ˆ τ t,g ´ ˆ τ g ´ ,g , i.e. a difference-in-differences that compares outcomes between period t and g ´ for thecohort first treated in period g relative to the never-treated cohort. It is clear that ˆ τ CStg can be viewed as an estimator of
AT E p t, g q of the form given in (6), with ˆ X “ ˆ τ g ´ ,g and β “ . Likewise, Callaway and Sant’Anna (2020) consider an estimator that aggregatesthe ˆ τ CStg , say ˆ τ CSw “ ř t,g w t,g ˆ τ t,g , which can be viewed as an estimator of the parameter θ w “ ř t,g w t,g AT E p t, g q of the form (6) with ˆ X “ ř t,g w t,g ˆ τ g ´ ,g . Similarly, Callawayand Sant’Anna (2020) consider an estimator that replaces the never-treated group with anaverage over cohorts not yet treated in period t , ˆ τ CS tg “ ř g ą t N g ÿ g ą t N g ˆ τ t,gg ´ ř g ą t N g ÿ g ą t N g ˆ τ g ´ ,gg , for t ě g. In principle the vector ˆ X could also include pre-treatment differences in means of non-linear transforma-tions of the outcome as well; see Guo and Basse (2020) for related results on non-linear covariate adjustmentsin randomized experiments. This could also be viewed as an estimator of the form (6) if ˆ X were a vector with each element corre-sponding with ˆ τ t,g and the vector β was a vector with elements corresponding with w t,g .
16t is again apparent that this estimator can be written as an estimator of
AT E p t, g q of theform in (6), with ˆ X now corresponding with a weighted average of ˆ τ g ´ ,gg and β again equalto 1. Example 2 (Sun and Abraham (2020)) . Sun and Abraham (2020) consider an estimatorthat is equivalent to that in Callaway and Sant’Anna (2020) in the case where there is a never-treated cohort. When there is no never-treated group, Sun and Abraham (2020) proposeusing the last cohort to be treated as the control. Formally, they consider the estimator of
AT E p t, g q of the form ˆ τ SAtg “ ˆ τ t,gg max ´ ˆ τ s,gg max , where g max “ max G is the last period in which units receive treatment and s ă g is somereference period before g (e.g. g ´ ). It is clear that ˆ τ SAtg takes the form (6), with ˆ X “ ˆ τ s,gg max and β “ . Weighted averages of the ˆ τ SAtg can likewise be expressed in the form (6), analogousto the Callaway and Sant’Anna (2020) estimators.
Example 3 (de Chaisemartin and D’Haultfœuille (2020)) . de Chaisemartin and D’Haultfœuille(2020) propose an estimator of the instantaneous effect of a treatment. Although their es-timator extends to settings where treatment turns on and off, in a setting like ours wheretreatment is an absorbing state, their estimator can be written as a linear combination ofthe ˆ τ CS tg . In particular, they consider a weighted average of the treatment effects estimatesfor the first period in which a unit was treated, ˆ τ dCH “ ř g : g ď T N g ÿ g : g ď T N g ˆ τ CS t,g . It is thus immediate from the previous examplse that their estimator can also be written inthe form (6).
Example 4 (Two-period DiD) . It may be instructive to consider how the estimators con-sidered in the two-period model from Section 2 fit into the general framework. That modelcorresponds with T “ and G “ t , . ˆ X is simply the difference in sample means forthe treatment groups in t “ , ˆ τ , . The DiD estimator in the two-period model thuscorresponds with ˆ θ . Example 5 (TWFE Models) . Athey and Imbens (2018) consider the setting with G “t , ...T, . Let D it “ r G i ď t s be an indicator for whether unit i is treated by period t . Note that the potential outcomes Y it p8q and Y it p q used in this section correspond with the moretraditional Y it p q and Y it p q used in Section 2. D it from the two-way fixedeffects specification Y it “ α i ` λ t ` D it θ T W F E ` (cid:15) it (7)can be decomposed as ˆ θ T W F E “ ÿ t ÿ p g,g q : min p g,g qď t γ t,gg ˆ τ t,gg ` ÿ t ÿ p g,g q : min p g,g qą t γ t,gg ˆ τ t,gg (8)for weights γ t,gg that depend only on N g and N and thus are non-stochastic in our framework.Thus, ˆ θ T W F E can be viewed as an estimator of the form (6) for the parameter θ T W F E “ ř t ř p g,g q : min p g,g qď t γ t,gg τ t,gg , with X “ ´ ř t ř p g,g q : min p g,g qą t γ t,gg ˆ τ t,gg and β “ . Notation.
Recall that the sample treatment effect estimates ˆ τ t,gg are themselves differ-ences in sample means, ˆ τ t,gg “ ¯ Y t,g ´ ¯ Y t,g . It follows that we can write ˆ θ “ ÿ g A θ,g ˆ¯ Y g and ˆ X “ ÿ g A ,g ˆ¯ Y g for appropriately defined matrices A θ,g and A of dimension ˆ T and M ˆ T , respectively.These definitions will be useful in deriving our theoretical results below. We now consider the problem of finding the best estimator ˆ θ β of the form introduced in (6).First, it is straightforward to show that ˆ θ β is unbiased for any fixed β. Lemma 3.1 ( ˆ θ β unbiased) . Under Assumptions 4 and 5, E ” ˆ θ β ı “ θ for any β P R M .Proof. By Assumption 4, E r D ig s “ p N g { N q . Hence, E ” ˆ θ ı “ E «ÿ g A θ,g N g ÿ i D ig Y i ff “ ÿ g A θ,g N g ÿ i E r D ig s Y i p g q “ ÿ g A θ,g N g ÿ i N g N Y i p g q “ θ. Likewise, E ” ˆ X ı “ E «ÿ g A ,g N g ÿ i D ig Y i ff “ ÿ g A ,g N ÿ i Y i p g q “ N ÿ i ÿ g A ,g Y i p g q “ , since ř g A ,g Y i p g q “ by Assumption 5. The result follows immediately from the previoustwo displays. 18e now turn our attention to deriving the variance of ˆ θ β and solving for the most efficient β . We first introduce some notation. Let S g “ p N ´ q ´ ř i p Y i p g q ´ ¯ Y p g qqp Y i p g q ´ ¯ Y p g qq bethe finite population variance of Y i p g q and S gg “ p N ´ q ´ ř i p Y i p g q ´ ¯ Y p g qqp Y i p g q ´ ¯ Y p g qq be the finite-population covariance. To derive the variance of ˆ θ β , it will be useful to solvefor the joint variance of p ˆ θ , ˆ X q . Lemma 3.2.
Under Assumptions 4 and 5, V ar «˜ ˆ θ ˆ X ¸ff “ ˜ ř g N g ´ A θ,g S g A θ,g ´ S θ , ř g N g ´ A θ,g S g A ,g ř g N g ´ A ,g S g A θ,g , ř g N g ´ A ,g S g A ,g ¸ “ : ˜ V ˆ θ V ˆ θ ,X V X, ˆ θ V X ¸ , where S θ “ V ar f ”ř g A θ,g Y i p g q ı .Proof. Let A τ,g “ ˜ A θ,g A ,g ¸ . Then we can write ˆ τ : “ ÿ g A τ,g ˆ¯ Y g “ ˜ ˆ θ ˆ X ¸ . Since Assumption 4 holds, we can appeal to Theorem 3 in Li and Ding (2017), which impliesthat V ar r ˆ τ s “ ř g N g ´ A τ,g S g A τ,g ´ S τ , where S τ “ V ar f ”ř g A τ,g Y i p g q ı . The result thenfollows immediately from expanding this variance, as well as the observation that S τ “ ˜ S θ
00 0 ¸ , which follows from the fact that ř g A ,g Y i p g q “ E f ”ř g A ,g Y i p g q ı “ for all i by Assumption 5.The variance of ˆ θ β then follows immediately. Corollary 3.1.
Under Assumptions 4 and 5, V ar ” ˆ θ β ı “ V ˆ θ ` β V X β ´ V ˆ θ ,X β . Having solved for V ar ” ˆ θ β ı , we now derive the β ˚ that minimizes the variance. Proposition 3.1.
Suppose Assumptions 4 and 5 hold and that V X is full-rank. Let β ˚ “ V ´ X V X, ˆ θ , for V X and V X, ˆ θ as defined in Lemma 3.2. Then V ar ” ˆ θ β ˚ ı “ V ˆ θ ´ V ˆ θ ,X V ´ X V X, ˆ θ ď V ar ” ˆ θ β ı for all β P R M .Proof. First, note that V ar ” ˆ θ β ˚ ı “ V ar ” ˆ θ ı ` β ˚1 V ar ” ˆ X ı β ˚ ´ Cov ” ˆ θ , ˆ X ı β ˚ “ V θ ` V ˆ θ ,X V ´ X V X, ˆ θ ´ V ˆ θ ,X V ´ X V X, ˆ θ , V ˆ θ β “ V ar ” ˆ θ ´ ˆ X β ı “ E ” p ˆ θ ´ θ ´ ˆ X β q ı , where we use the fact that E ” ˆ θ ı “ θ from Lemma 3.1 and that E ” ˆ X ı “ from theconstruction of ˆ X along with Assumption 5. It is then immediate that β ˚ is optimal if andonly if it solves the least-squares problem min β E ” p ˆ θ ´ θ ´ ˆ X β q ı . The solution is β ˚ “ E ” ˆ X ˆ X ı ´ E ” ˆ X p ˆ θ ´ θ q ı “ V ´ X V X, ˆ θ , as needed. As in Section 2, the efficient estimator ˆ θ β ˚ is not of practical use since the “oracle” coefficient β ˚ is not known. We now show that in large populations a feasible plug-in estimator θ ˆ β ˚ hassimilar properties to the oracle estimator. In particular, let ˆ S g “ N g ´ ÿ i D ig p Y i p g q ´ ˆ¯ Y g qp Y i p g q ´ ˆ¯ Y g q and let ˆ V X, ˆ θ and ˆ V X be the analogs to V X, ˆ θ and V X that replace S g with ˆ S g in the definitions.We then define ˆ β ˚ “ ˆ V ´ X ˆ V X, ˆ θ . We will now show that the feasible plug-in estimator ˆ θ ˆ β ˚ isasymptotically unbiased and as efficient as the oracle estimator ˆ θ β ˚ .We again consider a sequence of finite populations satisfying certain regularity conditions,analogous to the exercise in Section 2. Assumption 6. (i) For all g P G , N g { N Ñ p g P p , q .(ii) For all g, g , S g and S gg have limiting values denoted S ˚ g and S ˚ gg , respectively, with S ˚ g positive definite.(iii) max i,g || Y i p g q ´ ¯ Y p g q|| { N Ñ . Assumption 6 is analogous to Assumption 3 for the 2-period case. Part (i) requires theprobability that treatment begins in period g P G converges to a constant strictly between0 and 1. Part (ii) requires the variances and covariances of the potential outcomes converge20o a constant. Part (iii) requires that no single observation dominates the finite-populationvariance of the potential outcomes.We now provide two lemmas that characterize the asymptotic joint distribution of p ˆ θ , ˆ X q ,and show that ˆ S g is consistent for S ˚ g under Assumption 6. Both results are direct conse-quences of the general asymptotic results in Li and Ding (2017) for multi-valued treatmentsin randomized experiments. Lemma 3.3.
Under Assumptions 4, 5, and 6, ? N ˜ ˆ θ ´ θ ˆ X ¸ Ñ d N p , V ˚ q , where V ˚ “ ˜ ř g p g ´ A θ,g S ˚ g A θ,g ´ S ˚ θ ř g p g ´ A θ,g S ˚ g A ,g ř g p g ´ A ,g S ˚ g A θ,g ř g p g ´ A ,g S ˚ g A ,g ¸ “ : ˜ V ˚ ˆ θ V ˚ ˆ θ ,X V ˚ X, ˆ θ V ˚ X ¸ , and S ˚ θ “ lim N Ñ8 S θ (where S θ is defined in Lemma 3.2).Proof. As in the proof to Lemma 3.2, we can write ˆ τ “ ÿ g A τ,g ˆ¯ Y g “ ˜ ˆ θ ˆ X ¸ . The result then follows from Theorem 5 in Li and Ding (2017), combined with the observationnoted in the proof to Lemma 3.2 that S τ “ ˜ S θ
00 0 ¸ and hence S τ Ñ ˜ S ˚ θ
00 0 ¸ . Lemma 3.4.
Under Assumptions 4, 5, and 6, ˆ S g Ñ p S ˚ g for all g .Proof. Follows immediately from Proposition 3 in Li and Ding (2017).It is now straightforward to derive the limiting distribution of ˆ θ ˆ β ˚ . Proposition 3.2.
Under Assumptions 4, 5, and 6, ? N p ˆ θ ˆ β ˚ ´ θ q Ñ d N p , σ ˚ q , where σ ˚ “ V ˚ ˆ θ ´ V ˚1 X, ˆ θ p V ˚ X q ´ V ˚ X, ˆ θ “ lim N Ñ8 N V ar ” ˆ θ β ˚ ı . Proof.
Recall that ˆ β ˚ “ ˆ V ´ X ˆ V X, ˆ θ . It is clear that ˆ β ˚ is a continuous function of ˆ V X and ˆ V X,θ , and that ˆ V X and ˆ V X,θ are continuous functions of ˆ S g . From Lemma 3.4 along withthe continuous mapping theorem, we obtain that ˆ β ˚ Ñ p p V ˚ X q ´ V ˚ X, ˆ θ . Lemma 3.3 together21ith Slutsky’s lemma then give that ? N p ˆ θ ˆ β ˚ ´ θ q Ñ d N ´ , V ˚ ˆ θ ´ V ˚1 X, ˆ θ p V ˚ X q ´ V ˚ X, ˆ θ ¯ . FromProposition 3.1, it is apparent that the asymptotic variance of ˆ θ ˆ β ˚ is equal to the limit of N V ar ” ˆ θ β ˚ ı , which completes the proof. To construct confidence intervals using Proposition 3.1, one requires an estimate of σ ˚ . Wefirst introduce a simple Neyman-style variance estimator that is conservative under treatmenteffect heterogeneity. We then introduce a refinement to this estimator that adjusts for thepart of the heterogeneity explained by ˆ X .From proposition 3.2 as well as the definition of V ˚ , we have that σ ˚ “ ˜ÿ g p g A θ,g S ˚ g A θ,g ´ S ˚ θ ¸ ´ ˜ÿ g p g A θ,g S ˚ g A ,g ¸ ˜ÿ g p g A ,g S ˚ g A ,g ¸ ´ ˜ÿ g p g A θ,g S ˚ g A ,g ¸ . Since S ˚ g is consistently estimable (Lemma 3.4), a natural conservative (Neyman-style) vari-ance estimator replaces S ˚ g with ˆ S g and ignores the ´ S ˚ θ term. That is, we consider ˆ σ ˚ “ ˜ÿ g NN g A θ,g ˆ S g A θ,g ¸ ´ ˜ÿ g NN g A θ,g ˆ S g A ,g ¸ ˜ÿ g NN g A ,g ˆ S g A ,g ¸ ´ ˜ÿ g NN g A θ,g ˆ S g A ,g ¸ . Lemma 3.5.
Under Assumptions 4, 5, and 6, ˆ σ ˚ Ñ p σ ˚ ` S ˚ θ ď σ ˚ .Proof. Immediate from Lemma 3.4 combined with the continuous mapping theorem.Intuitively, the Neyman-type variance estimator proposed above is conservative when thereis treatment effect heterogeneity.When the estimand θ does not involve any treatment effects for the cohort treated inperiod 1, the estimator ˆ σ ˚ can be improved by using outcomes from earlier periods. Therefined estimator intuitively lower bounds the heterogeneity in treatment effects by the partof the heterogeneity that is explained by the outcomes in earlier periods. The constructionof this refined estimator mirrors the refinements using fixed covariates in randomized exper-iments considered in Lin (2013); Abadie, Athey, Imbens and Wooldridge (2020), with laggedoutcomes playing a similar role to the fixed covariates. Aronow, Green and Lee (2014) provide sharp bounds on the variance of the difference-in-means estimatorin randomized experiments, although these bounds are difficult to extend to other estimators and settingslike those considered here. emma 3.6. Suppose that A θ,g “ for all g ă g min . If Assumption 5 holds, then S θ “ V ar f ” ˜ θ i ı ` ˜ ÿ g ě g min β g ¸ p M S g min M q ˜ ÿ g ě g min β g ¸ , (9) where M is the matrix that selects the rows of Y i corresponding with t ă g min ; β g “p M S g M q ´ M S g A θ,g is the coefficient from projecting A θ,g Y i p g q on M Y i p g q ; and ˜ θ i “ ř g ě g min A θ,g Y i p g q´ ř g ě g min p M Y i p g qq β g .Proof. For any g and functions of the potential outcomes X i P R K and Z i P R , let X i “ X i ´ E f r X i s , Z i “ Z i ´ E f r Z i s , and β XZ “ V ar f r X i s ´ E f ” X i Z i ı . We claim that V ar f r Z i ´ β XZ X i s “ V ar f r Z i s ´ β XZ V ar f r X i s β XZ . Indeed, V ar f r Z i ´ β XZ X i s “ N ´ ÿ i ´ Z i ´ β XZ X i ¯ “ N ´ ÿ i Z i ` β XZ ˜ N ´ ÿ i X i X i ¸ β XZ ´ β XZ N ´ ÿ i X i Z i “ V ar f r Z i s ` β XZ V ar f r X i s β XZ ´ β XZ V ar f r X i s β XZ “ V ar f r Z i s ´ β XZ V ar f r X i s β XZ . The result then follows from setting Z i “ ř g ě g min A θ,g Y i p g q and X i “ M Y i p g min q , and notingthat under Assumption 5, M Y i p g min q “ M Y i p g q for all g ě g min , and hence V ar f r M Y i p g min qs “ M S g min M “ M S g M “ V ar f r M Y i p g qs .The second term on the right-hand side of (9) is consistently estimable, and allows us toobtain a tighter variance estimate. In particular, let ˆ σ ˚˚ “ ˆ σ ˚ ´ ˜ ÿ g ą g min ˆ β g ¸ ´ M ˆ S g min M ¯ ˜ ÿ g ą g min ˆ β g ¸ , where ˆ β g “ p M ˆ S g M q ´ A θ,g ˆ S g M . Then ˆ σ ˚˚ is consistent for a tighter upper bound on σ ˚ . Lemma 3.7.
Suppose that A θ,g “ for all g ă g min and Assumptions 4-6 hold. Then, ˆ σ ˚˚ Ñ p σ ˚ ` S ˚ ˜ θ , where S ˚ ˜ θ “ lim N Ñ8 V ar f ” ˜ θ i ı for ˜ θ i defined in Lemma 3.6, and S ˚ ˜ θ ď S ˚ θ . Assumption 5 implies that
M S g M “ M S g min M for all g ě g min . The term M ˆ S g min M can thus bereplaced by any convex combination of M ˆ S g M for g ě g min ; this has no effect on the asymptotic results,but may improve finite sample performance. roof. Note that ˆ β g is a continuous function of ˆ S g . Lemma 3.4 together with the continuousmapping theorem imply that ˜ ÿ g ą g min ˆ β g ¸ ´ M ˆ S g min M ¯ ˜ ÿ g ą g min ˆ β g ¸ ´ ˜ ÿ g ą g min β g ¸ p M S g min M q ˜ ÿ g ą g min β g ¸ Ñ p . The result is then immediate from Lemmas 3.5 and 3.6.
We now discuss the implications of our results for estimators previously proposed in theliterature. As discussed in Examples 1-3 above, the estimators of Callaway and Sant’Anna(2020), Sun and Abraham (2020), and de Chaisemartin and D’Haultfœuille (2020) correspondwith the estimator ˆ θ for an appropriately defined ˆ X . Our results thus imply that, unless β ˚ “ , the estimator ˆ θ β ˚ is unbiased for the same estimand and has strictly lower varianceunder random treatment timing. Since the optimal β ˚ depends on the potential outcomes, wedo not generically expect β ˚ “ , and thus the previously-proposed estimators will genericallybe dominated in terms of efficiency. Although the optimal β ˚ will typically not be known, ourresults imply that the plug-in estimator ˆ θ ˆ β ˚ will have similar properties in large populations,and thus will be more efficient than the previously-proposed estimators in large populations.We note, however, that the estimators in the aforementioned papers are valid for the ATT insettings where only parallel trends holds but there is not random treatment timing, whereasrandomization of treatment timing is necessary for the validity of the efficient estimator. We thus view the results on the efficient estimator as complementary to these estimatorsconsidered in previous work.Similarly, in light of Example 5, our results imply that the TWFE estimator will generallynot be the most efficient estimator for the TWFE estimand, θ T W F E . Previous work hasargued that the estimand θ T W F E may not be the most economically interesting estimandand may be difficult to interpret (e.g. Athey and Imbens (2018); Borusyak and Jaravel(2016); Goodman-Bacon (2018); de Chaisemartin and D’Haultfœuille (2020)). Our resultsprovide a new and complementary critique of the TWFE specification: even if θ T W F E is thetarget estimand, estimation via (7) will generally be inefficient in large populations underrandom treatment timing and no anticipation. The estimator of de Chaisemartin and D’Haultfœuille (2020) can also be applied in settings wheretreatment turns on and off over time. Monte Carlo Results
We present two sets of Monte Carlo results. In Section 4.1, we conduct simulations in astylized two-period setting like in Section 2 to illustrate how the efficient estimator comparesto the classical difference-in-differences and simple difference-in-means (DiM) estimators.Section 4.2 presents a more realistic set of simulations with staggered treatment timing thatis calibrated to the data in Wood et al. (2020a) which we use in our application.
Specification.
We follow the model in Section 2 in which there are two periods ( t “ , )and some units are treated in period 1. We first generate the potential outcomes as follows.For each unit i in the population, we draw Y i p q “ p Y it “ p q , Y it “ p qq from a N p , Σ ρ q distribution, where Σ ρ has 1s on the diagonal and ρ on the off-diagonal. The parameter ρ isthe correlation between the untreated potential outcomes in period t “ and period t “ .We then set Y it “ p q “ Y it “ p q ` τ i , where τ i “ γ p Y it “ p q ´ E f r Y it “ p qsq . The parameter γ governs the degree of heterogeneity of treatment effects: if γ “ , then there is no treatmenteffect heterogeneity, whereas if γ is positive then individuals with larger untreated outcomesin t “ have larger treatment effects. We normalize by E f r Y it “ p qs so that the treatmenteffects are 0 on average. We generate the potential outcomes once, and treat the populationas fixed throughout our simulations. Our simulation draws then differ based on the drawof the treatment assignment vector. For simplicity, we set N “ N “ N { , and in eachsimulation draw, we randomly select which units are treated in t “ or not. We conduct1000 simulations for all combinations of N P t , u , ρ P t , . , . u , and γ P t , . u . Results.
Table 1 shows the bias, standard deviation, and coverage of 95% confidenceintervals based on the plug-in efficient estimator ˆ θ ˆ β ˚ , difference-in-differences ˆ θ DiD “ ˆ θ , andsimple differences-in-means ˆ θ DiM “ ˆ θ . Confidence intervals are constructed as ˆ θ ˆ β ˚ ˘ . σ ˚˚ for the efficient estimator, and analogously for the other estimators. For all specificationsand estimators, the estimated bias is quite small, and coverage is close to the nominallevel. Table 2 facilitates comparison of the standard deviations of the different estimatorsby showing the ratio relative to the plug-in estimator. The standard deviation of the plug-inefficient estimator is weakly smaller than that of either DiD or DiM in nearly all cases, andis never more than 2% larger than that of either DiD or DiM. The standard deviation of theplug-in efficient estimator is similar to DiD when auto-correlation of Y p q is high p ρ “ . q For ˆ θ β , we use an analog to ˆ σ ˚˚ , except the unrefined estimate ˆ σ ˚ for the efficient estimator is replacedwith the sample analog to the expression for V ar ” ˆ θ β ı given in Corollary 3.1. p γ “ q , so that β ˚ « and thus DiDis (nearly) optimal in the class we consider. Likewise, it is similar to DiM when there isno autocorrelation p ρ “ q and there is no treatment effect heterogeneity p γ “ q , and thus β ˚ “ and so DiM is optimal in the class we consider. The plug-in efficient estimator issubstantially more precise than DiD and DiM in many other specifications: in the worstspecification, the standard deviation of DiD is as much as 1.7 times larger than the plug-inefficient estimator, and the standard deviation of the DiM can be as much as 7 times larger.These simulations thus illustrate how the plug-in efficient estimator can improve on DiD orDiM in cases where they are suboptimal, while retaining nearly identical performance whenthe DiD or DiM model is optimal. Bias SD Coverage N N ρ γ PlugIn DiD DiM PlugIn DiD DiM PlugIn DiD DiM1000 1000 0.99 0.0 .
00 0 . ´ .
00 0 .
01 0 .
01 0 .
04 0 .
95 0 .
95 0 . .
00 0 . ´ .
00 0 .
01 0 .
01 0 .
06 0 .
95 0 .
95 0 . .
00 0 .
00 0 .
00 0 .
04 0 .
04 0 .
05 0 .
94 0 .
95 0 . .
00 0 .
00 0 .
00 0 .
05 0 .
05 0 .
06 0 .
94 0 .
95 0 . ´ .
00 0 . ´ .
00 0 .
04 0 .
07 0 .
04 0 .
95 0 .
94 0 . ´ . ´ . ´ .
01 0 .
06 0 .
07 0 .
06 0 .
95 0 .
95 0 .
25 25 0.99 0.0 .
00 0 . ´ .
03 0 .
04 0 .
04 0 .
27 0 .
94 0 .
94 0 .
25 25 0.99 0.5 . ´ . ´ .
04 0 .
05 0 .
08 0 .
34 0 .
92 0 .
93 0 .
25 25 0.50 0.0 ´ .
01 0 . ´ .
02 0 .
24 0 .
29 0 .
26 0 .
94 0 .
95 0 .
25 25 0.50 0.5 .
01 0 . ´ .
01 0 .
30 0 .
32 0 .
33 0 .
94 0 .
95 0 .
25 25 0.00 0.0 ´ . ´ . ´ .
03 0 .
28 0 .
38 0 .
27 0 .
93 0 .
95 0 .
25 25 0.00 0.5 ´ . ´ . ´ .
04 0 .
35 0 .
42 0 .
34 0 .
93 0 .
94 0 . Table 1: Bias, Standard Deviation, and Coverage for ˆ θ ˆ β ˚ , ˆ θ DiD , ˆ θ DiM in 2-period simulations
To evaluate the performance of our proposed methods in a more realistic setting, we conductsimulations calibrated to our application to Wood et al. (2020a) in Section 5. The outcome ofinterest Y it is the number of complaints against police officer i in month t for police officers inChicago. Police officers were randomly assigned to first receive a procedural justice trainingin period G i . See Section 5 for more background on the application. Simulation specification.
We calibrate our baseline specification as follows. The numberof observations and time periods in the data exactly matches the data from Wood et al.26D Relative to Plug-In N N ρ γ PlugIn DiD DiM1000 1000 0.99 0.0 .
00 1 .
00 7 . .
00 1 .
71 7 . .
00 1 .
13 1 . .
00 1 .
04 1 . .
00 1 .
45 1 . .
00 1 .
31 1 .
25 25 0.99 0.0 .
00 0 .
99 6 .
25 25 0.99 0.5 .
00 1 .
47 6 .
25 25 0.50 0.0 .
00 1 .
21 1 .
25 25 0.50 0.5 .
00 1 .
08 1 .
25 25 0.00 0.0 .
00 1 .
35 0 .
25 25 0.00 0.5 .
00 1 .
22 0 . Table 2: Ratio of standard deviations for ˆ θ DiD and ˆ θ DiM relative to ˆ θ ˆ β ˚ in 2-period simula-tions(2020a) used in our application. We set the untreated potential outcomes Y it p8q to matchthe observed outcomes in the data Y i (which would exactly match the true potential outcomesif there were no treatment effect on any units). In our baseline simulation specification, thereis no causal effect of treatment, so that Y it p g q “ Y it p8q for all g . (We describe an alternativesimulation design with heterogeneous treatment effects in Appendix Section A.) In eachsimulation draw s , we randomly draw a vector of treatment dates G s “ p G s , ..., G sN q suchthat the number of units first treated in periods g matches that observed in the data (i.e. ř r G si “ g s “ N g for all g ). In total, there are 72 months of data on 7785 officers. Thereare 48 distinct values of g , with the cohort size N g ranging from 6 to 642. In an alternativespecification, we collapse the data to the yearly level, so that there are 6 time periods and 5cohorts.For each simulated data-set, we calculate the plug-in efficient estimator ˆ θ ˆ β ˚ for fourestimands: the simple weighted average ATE p θ simple q ; the calendar- and cohort-weightedaverage treatment effects ( θ calendar and θ cohort ), and the instantaneous event-study parameter p θ ES q . (See Section 3.2 for the formal definition of these estimands). In our baseline specifi-cation, we use as ˆ X the scalar weighted combination of pre-treatment differences used by theCallaway and Sant’Anna (2020, CS) estimator for the appropriate estimand (see Example1). In the appendix, we also present results for an alternative specification in which ˆ X is a As in our application, we restrict attention to police officers who remained in the police force throughoutthe sample period. ˆ τ t,gg for all pairs g, g ą t . For comparison, we also compute the CS esti-mator for the same estimand, using the not-yet-treated as the control group (since all unitsare eventually treated). Recall that for θ ES , the CS estimator coincides with the estimatorproposed in de Chaisemartin and D’Haultfœuille (2020) in our setting, since treatment is anabsorbing state. Confidence intervals are calculated as ˆ θ ˆ β ˚ ˘ . σ ˚˚ for the plug-in efficientestimator and analogously for the CS estimator. Baseline simulation results.
The results for our baseline specification are shown inTables 3 and 4. As seen in Table 3, the plug-in efficient estimator is approximately unbiased,and 95% confidence intervals based on our standard errors have coverage rates close to thenominal level for all of the estimands, with size distortions no larger than 3% for all of ourspecifications. The CS estimator also is approximately unbiased and has excellent coveragefor all of the estimands as well.Table 4 shows that there are large efficiency gains from using the plug-in efficient estima-tor relative to the CS estimator. The table compares the standard deviation of the plug-inefficient estimator to that of the CS estimator. Remarkably, using the plug-in efficient es-timator reduces the standard deviation by a factor of nearly two for the calendar-weightedaverage, and by a factor between 1.36 and 1.67 for the other estimands. Since standarderrors are proportional to the square root of the sample size, these results suggest that usingthe plug-in efficient estimator is roughly equivalent to multiplying the sample size by a factorof four for the calendar-weighted average.
Extensions.
In Appendix A, we present simulations from an alternative specificationwhere the monthly data is collapsed to the yearly level, leading to fewer time periods andfewer (but larger) cohorts. Both the efficient and CS estimators have very good coverageand minimal bias. The efficient estimator again dominates the CS estimator in efficiency,although the gains are smaller (24 to 30% reductions in standard deviation). The smallerefficiency gains in this specification are intuitive: the CS estimator overweights the pre-treatment periods (relative to the efficient estimator) in our setting, but the penalty fordoing this is smaller in the collapsed data, where the pre-treatment outcomes are averagedover more months and thus have lower variance.In the appendix, we also present results from a modification of our baseline DGP withheterogeneous treatment effects. We again find that the plug-in efficient estimator performswell, with qualititative findings similar to those in the baseline specification, although thestandard errors are somewhat conservative as expected.In the appendix, we also conduct simulation results using a modified version of the plug-28stimator Estimand Bias Coverage Mean SE SDPlugIn calendar 0.00 0.93 0.27 0.29PlugIn cohort 0.00 0.92 0.24 0.24PlugIn ES0 0.01 0.94 0.26 0.27PlugIn simple 0.00 0.92 0.22 0.22CS calendar 0.00 0.94 0.55 0.55CS cohort -0.01 0.95 0.41 0.41CS/CdH ES0 0.01 0.94 0.36 0.36CS simple -0.01 0.96 0.41 0.40Table 3: Results for Simulations Calibrated to Wood et al. (2020a)
Note: This table shows results for the plug-in efficient and Callaway and Sant’Anna (2020) estimatorsin simulations calibrated to Wood et al. (2020a). The estimands considered are the calendar-, cohort-,and simple-weighted average treatment effects, as well as the instantaneous event-study effect (ES0). TheCallaway and Sant’Anna (2020) estimator for ES0 corresponds with the estimator in de Chaisemartin andD’Haultfœuille (2020). Coverage refers to the fraction of the time a nominal 95% confidence interval includesthe true parameter. Mean SE refers to the average estimated standard error, and SD refers to the actualstandard deviation of the estimator. The bias, Mean SE, and SD are all multiplied by 100 for ease ofreadability.
Estimand Ratio of SDscalendar . cohort . ES0 . simple . Table 4: Comparison of Standard Deviations – Callaway and Sant’Anna (2020) versus Plug-in Efficient Estimator
Note: This table shows the ratio of the standard deviation of the Callaway and Sant’Anna (2020) estimatorrelative to the plug-in efficient estimator, based on the simulation results in Table 3. in efficient estimator in which ˆ X is a vector containing all possible comparisons of cohorts g and g in periods t ă min p g, g q . We find poor coverage of this estimator in the monthlyspecification, where the dimension of ˆ X is large relative to the sample size (1987, comparedwith N “ ), and thus the normal approximation derived in Proposition 3.2 is poor.By contrast, when the data is collapsed to the yearly level, and thus the dimension of ˆ X constructed in this way is more modest (10), the coverage for this estimator is good, and itoffers modest efficiency gains over the scalar ˆ X considered in the main text. These findingsalign with the results in Lei and Ding (2020), who show that covariate-adjustment in cross-sectional experiments yields asymptotically normal estimators when the dimensions of the29ovariates is o p N ´ q (and certain regularity conditions are satisfied). We thus recommendusing the version of ˆ X with all potential comparisons only when its dimension is small relativeto the square root of the sample size.Finally, we repeat the same exercise for the other outcomes used in our application(use of force and sustained complaints). We again find that the plug-in efficient estimatorhas minimal bias, good coverage properties, and is substantially more precise than the CSestimator for nearly all specifications (with reductions in standard deviations by a factor ofover 3 for some specifications). The one exception to the good performance of the plug-inefficient estimator is the calendar-weighted average for sustained complaints when using themonthly data: the coverage of CIs based on the plug-in efficient estimator is only 79% inthis specification. Two distinguishing features of this specification are that the outcome isvery rare (pre-treatment mean 0.004) and the aggregation scheme places the largest weighton the earliest three cohorts, which were small (sizes 17,15,26). This finding aligns withthe well-known fact that the central limit theorem may be a poor approximation in finitesamples with a binary outcome that is very rare. The plug-in efficient estimator again hasgood coverage (94%) when considering the annualized data where the cohort sizes are larger.We thus urge some caution in using the plug-in efficient estimator (or any procedure based ona normal approximation) when cohort sizes are small (<30) and the outcome is rare (mean ă . ); in such settings, we recommend collapsing the data to a higher level of aggregationbefore using the plug-in estimator. Reducing police misconduct and use of force is an important policy objective. Wood et al.(2020a) studied the Chicago Police Department’s staggered rollout of a procedural justicetraining program, which taught police officers strategies for emphasizing respect, neutral-ity, and transparency in the exercise of authority. Officers were randomly assigned a datefor training. Wood et al. (2020a) found large and statistically significant impacts of theprogram on complaints and sustained complaints against police officers and on officer useof force. However, in the process of preparing the analysis for this paper, we discovered astatistical error in the original analysis of Wood et al. (2020a): their cohort-level analysis See the Supplement to Wood et al. (2020a) for discussion of some concerns regarding non-compliance,particularly towards the end of the sample. We explore robustness to dropping officers trained in the lastyear in Appendix Figure 4. The results are qualitatively similar, although with smaller estimated effects onuse of force.
We use the same data as in our re-analysis in Wood et al. (2020b), which extends the dataused in the original analysis of Wood et al. (2020a) through December 2016. As in Wood etal. (2020b), we restrict attention to the balanced panel of 7,785 who remained in the policeforce throughout the study period. The data contain the outcome measures (complaints,sustained complaints, and use of force) at a monthly level for the period of 72 months (6years), with the first cohort trained in month 13 and the final cohort trained in the lastmonth of the sample. The data also contain the date on which each officer was trained.
We apply our proposed plug-in efficient estimator to estimate the effects of the proceduraljustice training program on the three outcomes of interest. We estimate the simple-, cohort-,and calendar-weighted average effects described in Section 3.2 and used in our Monte Carlostudy. We also estimate the average dynamic effect for the first 24 months after treatment,which includes the instantaneous event-study effect studied in our Monte Carlo as a specialcase (for event-time 0). For comparison, we also estimate the Callaway and Sant’Anna(2020) estimator as we did in our re-analysis in Wood et al. (2020b). (Recall that for theinstantaneous event-study effect, the Callaway and Sant’Anna (2020) and de Chaisemartinand D’Haultfœuille (2020) estimators coincide.)
Figure 1 shows the results of our analysis for the three aggregate summary parameters. Table5 compares the magnitudes of these estimates and their 95% confidence intervals (CIs) tothe mean of the outcome in the 12 months before treatment began. The estimates using theplug-in efficient estimator are substantially more precise than those using the Callaway andSant’Anna (2020, CS) estimator, with the standard errors ranging from 1.3 to 5.6 times31maller (see final column of Table 5).Figure 1: Effect of Procedural Justice Training Using the Plug-In Efficient and Callawayand Sant’Anna (2020) Estimators
Note: this figure shows point estimates and 95% CIs for the effects of procedural justice training on com-plaints, force, and sustained complaints using the CS and plug-in efficient estimators. Results are shown forthe calendar-, cohort-, and simple-weighted averages.
As in our re-analysis, we find no significant impact on complaints using any of the aggre-gations. Our bounds on the magnitude of the treatment effect are substantially tighter thanbefore, however. For instance, using the simple aggregation we can now rule out reductionsin complaints of more than 11%, compared with a bound of 26% using the CS estimator.For use of force, the point estimates are somewhat smaller than when using the CS estimatorand the upper bounds of the confidence intervals are all nearly exactly 0. Although preci-sion is substantially higher than when using the CS estimator, the CIs for force still includeeffects between near-zero and 29% of the pre-treatment mean. For sustained complaints, allof the point estimates are near zero and the CIs are substantially narrower than when usingthe CS estimator, although the plug-in efficient estimate using the calendar aggregation ismarginally significant. If we were to Bonferroni-adjust all of the CIs in Figure 1 for testingnine hypotheses (three outcomes times three aggregations), none of the confidence intervalswould rule out zero.Figure 2 shows event-time estimates for the first two years using the plug-in efficientestimator. (To conserve space, we place the analogous results for the CS estimator in the Recall that the calendar aggregation for sustained complaints was the one specification for which CIsbased on the plug-in efficient estimator substantially undercovered (79%), and thus the significant resultshould be interpreted with some caution.
Note: This table shows the pre-treatment means for the three outcomes. It also displays the estimates and95% CIs in Figure 1 as percentages of these means. The final columns shows the ratio of the CI length usingthe CS estimator relative to the plug-in efficient estimator. appendix.) In dark blue, we present point estimates and pointwise confidence intervals, andin light blue we present simultaneous confidence bands calculated via Bonferroni adjustment.It has been argued that simultaneous confidence bands are more appropriate for event-studyanalyses since they control size over the full dynamic path of treatment effects (Freyalden-hoven, Hansen and Shapiro, 2019; Callaway and Sant’Anna, 2020). The figure shows thatthe simultaneous confidence bands include zero for nearly all periods for all three outcomes.Inspecting the results for force more closely, we see that the point estimates are positive(although typically not significant) for most of the first year after treatment, but becomeconsistently negative around the start of the second year from treatment. This suggeststhat the negative point estimates in the aggregate summary statistics are driven mainly bymonths after the first-year. Although it is possible that the treatment effects grow overtime, this runs counter to the common finding of fadeout in educational programs in general(Bailey, Duncan, Cunha, Foorman and Yeager, 2020) and anti-bias training in particular(Forscher and Devine, 2017).Finally, in Appendix Figure 4, we present results analogous to those in Figure 1 exceptremoving officers who were treated in the last 12 months of the data. The reason for this33igure 2: Event-Time Average Effects Using the Plug-In Efficient Estimatoris, as discussed in the supplement to Wood et al. (2020a), there was some non-compliancetowards the end of the study period wherein officers who had not already been trained couldvolunteer to take the training at a particular date. The qualitative patterns after droppingthese observations are similar, although the estimates for the effect on use of force are smallerand not statistically significant at conventional levels.
This paper considers efficient estimation in settings where the timing of treatment to differ-ent units is randomly assigned. Although this random assignment assumption is strongerthan the typical parallel trends assumption, it can be ensured by design when the researchercontrols the timing of treatment, and is often the justification given for parallel trends inquasi-experimental contexts. We then derive the most efficient estimator in a large class ofestimators that nests many existing approaches. Although the “oracle” efficient estimator isnot known in practice, we show that a plug-in sample analog has similar properties in largepopulations, and derive a valid variance estimator for construction of confidence intervals.We find in simulations that the proposed plug-in efficient estimator is approximately unbi-ased, yields CIs with good coverage, and substantially increases precision relative to existingmethods. We apply our proposed methodology to obtain the most precise estimates to dateof the causal effects of procedural justice training programs for police officers.34 eferences
Abadie, Alberto, Susan Athey, Guido W. Imbens, and Jeffrey M. Wooldridge ,“Sampling-Based versus Design-Based Uncertainty in Regression Analysis,”
Econometrica ,2020, (1), 265–296. Aronow, Peter M., Donald P. Green, and Donald K. K. Lee , “Sharp bounds onthe variance in randomized experiments,”
The Annals of Statistics , June 2014, (3),850–871. Athey, Susan and Guido Imbens , “Design-Based Analysis in Difference-In-DifferencesSettings with Staggered Adoption,” arXiv:1808.05293 [cs, econ, math, stat] , August 2018.
Bailey, Drew H., Greg J. Duncan, Flávio Cunha, Barbara R. Foorman, andDavid S. Yeager , “Persistence and Fade-Out of Educational-Intervention Effects: Mech-anisms and Potential Solutions:,”
Psychological Science in the Public Interest , October2020.
Basse, Guillaume, Yi Ding, and Panos Toulis , “Minimax designs for causal effects intemporal experiments with treatment habituation,” arXiv:1908.03531 [stat] , June 2020.arXiv: 1908.03531.
Borusyak, Kirill and Xavier Jaravel , “Revisiting Event Study Designs,” SSRN ScholarlyPaper ID 2826228, Social Science Research Network, Rochester, NY August 2016.
Callaway, Brantly and Pedro H. C. Sant’Anna , “Difference-in-Differences with multipletime periods,”
Journal of Econometrics , December 2020. de Chaisemartin, Clément and Xavier D’Haultfœuille , “Two-Way Fixed Effects Es-timators with Heterogeneous Treatment Effects,”
American Economic Review , September2020, (9), 2964–2996.
Ding, Peng and Fan Li , “A bracketing relationship between difference-in-differences andlagged-dependent-variable adjustment,”
Political Analysis , 2019, (4), 605–615. Doleac, Jennifer , “How to Fix Policing,”
Neskanen Center , 2020.
Forscher, Patrick S and Patricia G Devine , “Knowledge-based interventions are morelikely to reduce legal disparities than are implicit bias interventions,” 2017.
Freedman, David A. , “On Regression Adjustments in Experiments with Several Treat-ments,”
The Annals of Applied Statistics , 2008, (1), 176–196.35 “On regression adjustments to experimental data,” Advances in Applied Mathematics ,2008, (2), 180–193. Freyaldenhoven, Simon, Christian Hansen, and Jesse Shapiro , “Pre-event Trends inthe Panel Event-study Design,”
American Economic Review , 2019, (9), 3307–3338.
Frison, L. and S. J. Pocock , “Repeated measures in clinical trials: analysis using meansummary statistics and its implications for design,”
Statistics in Medicine , September1992, (13), 1685–1704. Funatogawa, Takashi, Ikuko Funatogawa, and Yu Shyr , “Analysis of covariance withpre-treatment measurements in randomized trials under the cases that covariances andpost-treatment variances differ between groups,”
Biometrical Journal , May 2011, (3),512–524. Goodman-Bacon, Andrew , “Difference-in-Differences with Variation in Treatment Tim-ing,” Working Paper 25018, National Bureau of Economic Research September 2018.
Guo, Kevin and Guillaume Basse , “The Generalized Oaxaca-Blinder Estimator,” arXiv:2004.11615 [math, stat] , April 2020. arXiv: 2004.11615.
Imai, Kosuke and In Song Kim , “On the Use of Two-way Fixed Effects Regression Modelsfor Causal Inference with Panel Data,”
Political Analysis , 2020, (Forthcoming).
Lei, Lihua and Peng Ding , “Regression adjustment in completely randomized experimentswith a diverging number of covariates,”
Biometrika , December 2020, (Forthcoming).
Li, Xinran and Peng Ding , “General Forms of Finite Population Central Limit Theoremswith Applications to Causal Inference,”
Journal of the American Statistical Association ,October 2017, (520), 1759–1769.
Lin, Winston , “Agnostic notes on regression adjustments to experimental data: Reexam-ining Freedman’s critique,”
Annals of Applied Statistics , March 2013, (1), 295–318. Malani, Anup and Julian Reif , “Interpreting pre-trends as anticipation: Impact on esti-mated treatment effects from tort reform,”
Journal of Public Economics , April 2015, ,1–17.
Manski, Charles F. and John V. Pepper , “How Do Right-to-Carry Laws Affect CrimeRates? Coping with Ambiguity Using Bounded-Variation Assumptions,”
Review of Eco-nomics and Statistics , 2018, (2), 232–244.36 cKenzie, David , “Beyond baseline and follow-up: The case for more T in experiments,”
Journal of Development Economics , 2012, (2), 210–221. Meer, Jonathan and Jeremy West , “Effects of the Minimum Wage on EmploymentDynamics,”
Journal of Human Resources , 2016, (2), 500–522. Neyman, Jerzy , “On the Application of Probability Theory to Agricultural Experiments.Essay on Principles. Section 9.,”
Statistical Science , 1923, (4), 465–472. Owens, Emily, David Weisburd, Karen L. Amendola, and Geoffrey P. Alpert ,“Can You Build a Better Cop?,”
Criminology & Public Policy , 2018, (1), 41–87. Rambachan, Ashesh and Jonathan Roth , “Design-Based Uncertainty for Quasi-Experiments,” arXiv:2008.00602 [econ, stat] , August 2020. arXiv: 2008.00602.
Roth, Jonathan and Pedro H. C. Sant’Anna , “When Is Parallel Trends Sensitive toFunctional Form?,” arXiv:2010.04814 [econ, stat] , January 2021. arXiv: 2010.04814.
Shaikh, Azeem and Panos Toulis , “Randomization Tests in Observational Studies withStaggered Adoption of Treatment,” arXiv:1912.10610 [stat] , December 2019. arXiv:1912.10610.
Sun, Liyang and Sarah Abraham , “Estimating dynamic treatment effects in event studieswith heterogeneous treatment effects,”
Journal of Econometrics , December 2020.
Słoczyński, Tymon , “Interpreting OLS Estimands When Treatment Effects Are Hetero-geneous: Smaller Groups Get Larger Weights,”
Review of Economics and Statistics , 2020,(Forthcoming).
Wan, Fei , “Analyzing pre-post designs using the analysis of covariance models with andwithout the interaction term in a heterogeneous study population,”
Statistical Methods inMedical Research , January 2020, (1), 189–204. Wood, George, Tom R. Tyler, and Andrew V. Papachristos , “Procedural justicetraining reduces police use of force and complaints against officers,”
Proceedings of theNational Academy of Sciences , May 2020, (18), 9815–9821. , , , Jonathan Roth, and Pedro H.C. Sant’Anna , “Revised Findings for “Proce-dural justice training reduces police use of force and complaints against officers”,”
WorkingPaper , 2020. 37 iong, Ruoxuan, Susan Athey, Mohsen Bayati, and Guido Imbens , “Optimal Ex-perimental Design for Staggered Rollouts,” arXiv:1911.03764 [econ, stat] , November 2019.arXiv: 1911.03764. 38
Additional Simulation Results
This section presents results from extensions to the simulations in Section 4.
Other outcomes.
Tables 6-9 show results analogous to those in the main text, exceptusing the other two outcomes considered in our application (use of force and sustainedcomplaints).
Annualized data.
Tables 10-15 show versions of our simulations results (for all threeoutcomes) when the data is collapsed to the annual level, so that there are 6 total timeperiods and 5 cohorts.
Augmented ˆ X . Table 16 shows results for an alternative version of the efficient estimatorwhere ˆ X is now a vector that contains the difference in means between cohort g and g in all periods t ă min p g, g q . This vector is large relative to sample size in the monthlyspecification ( dim p ˆ X q “ , N “ ), which leads to bias and severe undercoveragefor the modified plug-in efficient estimator. In the annualized data, the dimension of themodified ˆ X is modest (10), and the modified efficient estimator has good coverage and yieldssmall efficiency gains (up to 3%) relative to the plug-in efficient estimator considered in themain text. Heterogeneous Treatment Effects.
Tables 17 and 18 show simulation results for amodification of our baseline specification in which there are heterogeneous treatment effects.In the baseline specification, Y i p g q “ Y i p8q for all g . In the modification, we set Y i p g q “ Y i p8q ` r t ą“ g s ¨ u i . The u i are mean-zero draws drawn from a normal distribution withstandard deviation equal to half the standard deviation of the untreated potential outcomes.We draw the u i once and hold them fixed throughout the simulations, which differ only in theassignment of treatment timing. The results are similar to those for the main specification,although as expected, the standard errors are somewhat conservative (i.e. the mean standarderror exceeds the standard deviation of the estimator).39stimator Estimand Bias Coverage Mean SE SDPlugIn calendar 0.03 0.94 0.30 0.32PlugIn cohort 0.02 0.92 0.28 0.29PlugIn ES0 0.01 0.96 0.28 0.28PlugIn simple 0.01 0.93 0.26 0.27CS calendar 0.03 0.95 0.59 0.60CS cohort 0.01 0.96 0.45 0.44CS/CdH ES0 0.01 0.96 0.37 0.37CS simple 0.01 0.96 0.45 0.44Table 6: Results for Simulations Calibrated to Wood et al. (2020a) – Use of Force Note: This table shows results analogous to Table 3, except using Use of Force rather than Complaints asthe outcome.
Estimand Ratio of SDscalendar . cohort . ES0 . simple . Table 7: Comparison of Standard Deviations – Callaway and Sant’Anna (2020) versus Plug-in Efficient Estimator – Use of Force
Note: This table shows results analogous to Table 4, except using Use of Force rather than Complaints asthe outcome.
Estimator Estimand Bias Coverage Mean SE SDPlugIn calendar 0.00 0.79 0.06 0.07PlugIn cohort 0.00 0.92 0.03 0.03PlugIn ES0 0.01 0.95 0.08 0.08PlugIn simple 0.00 0.92 0.03 0.03CS calendar 0.01 0.95 0.14 0.17CS cohort 0.01 0.95 0.11 0.11CS/CdH ES0 0.01 0.94 0.11 0.12CS simple 0.01 0.96 0.11 0.12Table 8: Results for Simulations Calibrated to Wood et al. (2020a) – Sustained Complaints
Note: This table shows results analogous to Table 3, except using Sustained Complaints rather than Com-plaints as the outcome. . cohort . ES0 . simple . Table 9: Comparison of Standard Deviations – Callaway and Sant’Anna (2020) versus Plug-in Efficient Estimator – Sustained Complaints
Note: This table shows results analogous to Table 4, except using Sustained Complaints rather than Com-plaints as the outcome.
Estimator Estimand Bias Coverage Mean SE SDPlugIn calendar 0.11 0.95 1.99 1.96PlugIn cohort 0.15 0.95 2.53 2.49PlugIn ES0 0.03 0.96 1.65 1.60PlugIn simple 0.14 0.95 2.41 2.37CS calendar 0.20 0.96 2.65 2.56CS cohort 0.26 0.96 3.24 3.13CS/CdH ES0 0.04 0.96 2.05 1.98CS simple 0.27 0.96 3.17 3.05Table 10: Results for Simulations Calibrated to Wood et al. (2020a) – Annualized Data
Note: This table shows results analogous to Table 3, except the data is collapsed to the annual level.
Estimand Ratio of SDscalendar . cohort . ES0 . simple . Table 11: Comparison of Standard Deviations – Callaway and Sant’Anna (2020) versusPlug-in Efficient Estimator – Annualized Data
Note: This table shows results analogous to Table 4, except the data is collapsed to the annual level.
Note: This table shows results analogous to Table 3, except using Use of Force rather than Complaints asthe outcome, and in simulations where data is collapsed to the annual level.
Estimand Ratio of SDscalendar . cohort . ES0 . simple . Table 13: Comparison of Standard Deviations – Callaway and Sant’Anna (2020) versusPlug-in Efficient Estimator – Use of Force & Annualized Data
Note: This table shows results analogous to Table 4, except using Use of Force rather than Complaints asthe outcome, and in simulations where data is collapsed to the annual level.
Note: This table shows results analogous to Table 3, except using Sustained Complaints rather than Com-plaints as the outcome, and in simulations where data is collapsed to the annual level.
Estimand Ratio of SDscalendar . cohort . ES0 . simple . Table 15: Comparison of Standard Deviations – Callaway and Sant’Anna (2020) versusPlug-in Efficient Estimator – Sustained Complaints & Annualized Data
Note: This table shows results analogous to Table 4, except using Sustained Complaints rather than Com-plaints as the outcome, and in simulations where data is collapsed to the annual level. ˆ X Note: This table shows the bias, coverage, mean standard error, and standard deviation of two versions ofthe plug-efficient estimator. The estimator with the label “Long X” uses an augmented version of ˆ X thatincludes the difference in means between all cohorts g, g in periods t ă min p g, g q . The estimator labeledPlugIn uses a scalar ˆ X such that the CS estimator corresponds with β “ , as in the main text. Thesimulation specification in panel (a) is the baseline specification considered in the main text; in panel (b),the data is collapsed to the annual level. Note: This table shows results analogous to Table 3, except the DGP adds heterogeneous treatment effectas described in Section A.
Estimand Ratio of SDscalendar . cohort . ES0 . simple . Table 18: Comparison of Standard Deviations – Callaway and Sant’Anna (2020) versusPlug-in Efficient Estimator – Heterogeneous Treatment Effects
Note: This table shows results analogous to Table 4, except the DGP adds heterogeneous treatment effectas described in Section A. Additional Tables and Figures
Figure 3: Event-Time Average Effects Using the CS Estimator
Note: This figure is analogous to Figure 2 except it uses the CS estimator rather than the plug-in efficient.
Figure 4: Effect of Procedural Justice Training Using the Plug-In Efficient and Callawayand Sant’Anna (2020) Estimators – Dropping Late-Trained Officers