[PDF] Efficient Estimation for Staggered Rollout Designs

Abstract

This paper studies efficient estimation of causal effects when treatment is (quasi-) randomly rolled out to units at different points in time. We solve for the most efficient estimator in a class of estimators that nests two-way fixed effects models and other popular generalized difference-in-differences methods. A feasible plug-in version of the efficient estimator is asymptotically unbiased with efficiency (weakly) dominating that of existing approaches. We provide both t-based and permutation-test based methods for inference. We illustrate the performance of the plug-in efficient estimator in simulations and in an application to Wood et al. (2020a)'s study of the staggered rollout of a procedural justice training program for police officers. We find that confidence intervals based on the plug-in efficient estimator have good coverage and can be as much as five times shorter than confidence intervals based on existing state-of-the-art methods. As an empirical contribution of independent interest, our application provides the most precise estimates to date on the effectiveness of procedural justice training programs for police officers.

Full PDF

EEﬃcient Estimation for Staggered Rollout Designs ∗Jonathan Roth † Pedro H.C. Sant’Anna ‡ February 3, 2021

Abstract

Researchers are often interested in the causal eﬀect of treatments that are rolled outto diﬀerent units at diﬀerent points in time. This paper studies how to eﬃciently esti-mate a variety of causal parameters in such staggered rollout designs when treatmenttiming is (as-if) randomly assigned. We solve for the most eﬃcient estimator in a classof estimators that nests two-way ﬁxed eﬀects models as well as several popular general-ized diﬀerence-in-diﬀerences methods. The eﬃcient estimator is not feasible in practicebecause it requires knowledge of the optimal weights to be placed on pre-treatmentoutcomes. However, the optimal weights can be estimated from the data, and in largedatasets the plug-in estimator that uses the estimated weights has similar properties tothe “oracle” eﬃcient estimator. We illustrate the performance of the plug-in eﬃcient es-timator in simulations and in an application to Wood, Tyler and Papachristos (2020a)’sstudy of the staggered rollout of a procedural justice training program for police of-ﬁcers. We ﬁnd that conﬁdence intervals based on the plug-in eﬃcient estimator havegood coverage and can be as much as ﬁve times shorter than conﬁdence intervals basedon existing methods. As an empirical contribution of independent interest, our appli-cation provides the most precise estimates to date on the eﬀectiveness of proceduraljustice training programs for police oﬃcers. ∗ We are grateful to Brantly Callaway, Emily Owens, Ryan Hill, Ashesh Rambachan, Evan Rose, Adri-enne Sabety, Jesse Shapiro, Yotam Shem-Tov, and Ariella Kahn-Lang Spitzer for helpful comments andconversations. † Microsoft. [email protected] ‡ Vanderbilt University. [email protected] a r X i v : . [ ec on . E M ] F e b Introduction

Researchers are often interested in the causal eﬀects of a treatment that has a staggeredrollout, meaning that it is ﬁrst implemented for diﬀerent units at diﬀerent times. For in-stance, social scientists may be interested in the causal eﬀect of a policy that is adopted indiﬀerent states at diﬀerent times. Businesses may likewise be interested in the causal eﬀectof a new feature or advertising campaign that is introduced to diﬀerent customers over time.In many cases, the timing of the rollout is controlled by the researcher and can be explicitlyrandomized. In others, researchers argue that the timing of the treatment is as-if randomlyassigned.In these settings, researchers often estimate treatment eﬀects using methods that extendthe simple two-period diﬀerence-in-diﬀerences estimator to the staggered setting. It is com-mon practice to estimate causal eﬀects using two-way ﬁxed eﬀects (TWFE) models thatcontrol for both time and unit ﬁxed eﬀects (e.g. Xiong, Athey, Bayati and Imbens, 2019).Recent work has shown, however, that the estimand of TWFE models may be diﬃcult tointerpret under treatment eﬀect heterogeneity (see Related Literature below). The literaturehas therefore proposed a variety of alternative procedures that yield more easily-interpretableestimands under heterogeneous treatment eﬀects (Callaway and Sant’Anna, 2020; de Chaise-martin and D’Haultfœuille, 2020; Sun and Abraham, 2020). All of these procedures exploit ageneralized “parallel trends” assumption for estimation. However, the assumption of randomtreatment timing is stronger than parallel trends. This suggests that it might be possible toobtain more precise estimates by more fully exploiting the random timing of treatment.This paper considers eﬃcient estimation of causal eﬀects in settings where the timing oftreatment is (as-if) randomly assigned. We begin by introducing a design-based frameworkthat formalizes the notion that treatment timing is (as-if) randomly assigned. We considerestimation of a variety of causal estimands using a class of estimators that nests canonicaltwo-way ﬁxed eﬀects models as well as the alternative estimators discussed above as specialcases. We then solve for the most eﬃcient estimator in this class. The eﬃcient estimatormore fully exploits the implications of random treatment timing, which is stronger thanthe generalized parallel trends assumption on which conventional estimators are based. Asa result, the eﬃcient estimator we derive asymptotically dominates conventional estimationapproaches for the same estimand in terms of eﬃciency, with large gains in Monte Carlo sim-ulations. We therefore recommend use of the eﬃcient estimator in settings where treatmenttiming is either random by design or assumed to be quasi-random.For clarity of exposition and to connect our results to previous work, we begin by ana-lyzing the canonical two-period diﬀerence-in-diﬀerences model under randomized treatment.2ll units are untreated in the ﬁrst period, and a subset of units are randomly assigned tobegin treatment in the second period. We consider estimators of the average treatment ef-fect (ATE) of the form ˆ θ β “ p ¯ Y ´ ¯ Y q ´ β p ¯ Y ´ ¯ Y q , where ¯ Y dt is the sample mean ofthe outcome for treatment group d in period t . These estimators take the simple diﬀerencein means in period 1 and then adjust linearly for the diﬀerence in means in period 0. Thecanonical diﬀerence-in-diﬀerences estimator corresponds with the special case of β “ . Un-der the assumption that period 0 outcomes are unaﬀected by treatment status in period 1(i.e. there are no anticipatory eﬀects of treatment), the period 0 outcomes are isomorphicto ﬁxed covariates in a random experiment. We can then apply results from Lin (2013) oncovariate adjustment in random experiments to (i) show that ˆ θ β is unbiased for the ATE forall β , and (ii) solve for the variance-minimizing value β ˚ , which depends on the covariancebetween the (treated and untreated) potential outcomes in period 1 and the pre-treatmentoutcomes. In general, the eﬃcient value β ˚ will not be equal to 1, and thus the DiD estimatorwill be ineﬃcient. Although the “oracle” β ˚ will generally not be known, as in Lin (2013), aplug-in estimator based on a sample analog of β ˚ will achieve the eﬃcient variance in largepopulations.We next consider the more practically relevant case in which there is staggered timingacross multiple periods. There are T periods, and unit i is ﬁrst treated in period G i P G Ďt , ..., T, , with G i “ 8 denoting that i is never treated. There are many possible waysof aggregating treatment eﬀects across cohorts and time periods in the staggered treatmentsetting, and so we consider a broad class of estimands that encompasses many possibleaggregation schemes. Speciﬁcally, we deﬁne τ t,gg to be the average eﬀect on the outcome inperiod t of changing the initial treatment date from g to g . We then consider the class ofestimands that are linear combinations of these building blocks, θ “ ř t,g,g a t,g,g τ t,gg . Ourframework thus accommodates a variety of summary measures of dynamic treatment eﬀects,including several aggregation schemes proposed in the recent literature.We consider the class of estimators that start with a sample analog to the target param-eter and then adjust by a linear combination of pre-treatment outcomes. More precisely, weconsider estimators of the form ˆ τ β “ ř t,g a t,g,g ˆ τ t,gg ´ ˆ X β , where the ﬁrst term in ˆ τ β replacesthe τ t,gg with their sample analogs in the deﬁnition of θ , and the second term adjusts for alinear combination of ˆ X , where ˆ X is a vector that compares outcomes for cohorts treatedat diﬀerent dates at points in time before either was treated. We show that a variety ofestimation procedures are part of this class for an appropriately deﬁned ˆ X , including theclassical TWFE model as well as recent procedures proposed by Sun and Abraham (2020),de Chaisemartin and D’Haultfœuille (2020), and Callaway and Sant’Anna (2020). All esti-mators of this form are unbiased for θ under the assumptions of random treatment timing3nd no anticipation.We then derive the most eﬃcient estimator in this class. The optimal coeﬃcient β ˚ depends on covariances between the potential outcomes over time, and thus in general willnot coincide with the ﬁxed coeﬃcients in any of the previously proposed procedures discussedabove. As in the two-period case, the “oracle” value β ˚ will typically not be known ex ante,and will need to be replaced with a sample analog ˆ β ˚ . Similar to the two-period case, weshow that using the plug-in estimator is asymptotically unbiased and eﬃcient under large-population asymptotics, exploiting the generalized ﬁnite-population central limit theoremsin Li and Ding (2017). In a Monte Carlo study calibrated to our application, we ﬁnd thatconﬁdence intervals based on the plug-in eﬃcient estimator have good coverage propertiesand are substantially shorter than the procedures of Callaway and Sant’Anna (2020) andde Chaisemartin and D’Haultfœuille (2020).As an illustration of our method and standalone empirical contribution, we re-examinethe eﬀectiveness of procedural justice training for police oﬃcers. We use data from Woodet al. (2020a), who studied the randomized rollout of a procedural justice training programin Chicago. The original study by Wood et al. (2020a) found that the program producedlarge and highly statistically signiﬁcant reductions in (sustained) complaints against policeoﬃcers and oﬃcer use of force. These ﬁndings have been inﬂuential in policy discussionsabout police reform (e.g. Doleac, 2020). However, an earlier version of our analysis revealeda statistical error in the analysis of Wood et al. (2020a), which did not account for thefact that cohorts trained on diﬀerent days were of diﬀerent sizes, leading to spuriously largeestimates. In Wood, Tyler, Papachristos, Roth and Sant’Anna (2020b), we worked withthe authors of the original paper to re-analyze the data using the state-of-the art tools inCallaway and Sant’Anna (2020). Our re-analysis found no signiﬁcant eﬀects on complaintsor sustained complaints, and borderline signiﬁcant eﬀects on police use of force, althoughthe conﬁdence intervals for all three outcomes were large and included both economicallysmall and meaningful eﬀects. In this paper, we re-analyze the data using our proposedmethodology.We ﬁnd that the use of our proposed methodology allows us to obtain substantially moreprecise estimates of the eﬀect of the training program, leading to a reduction in standarderrors by a factor between 1.3 and 5.6 depending on the speciﬁcation. We again ﬁnd limitedevidence of a meaningful eﬀect of the program on complaints or sustained complaints andborderline signiﬁcant overall eﬀects on use of force. Our revised estimates have much greaterprecision than in our previous analysis, however. For example, our baseline estimate forthe overall average eﬀect on complaints (using a simple aggregation across time periods andcohorts) is -2% relative to the pre-treatment mean with a 95% CI of [-11,6], compared with4n estimate of -10% and CI of [-26,5] using the procedure of Callaway and Sant’Anna (2020).Likewise, we ﬁnd a borderline signiﬁcant eﬀect on use of force of -15% (CI: [-29,0]), comparedwith our previous estimate of -22% (CI: [-43,-2]). We caution, however, that the marginallysigniﬁcant results for use of force are not signiﬁcant after adjusting for testing hypotheseson multiple outcomes. Related Literature.

This paper contributes to an active literature on diﬀerence-in-diﬀerencesand related methods with staggered treatment timing. Several recent papers have illustratedthat the estimand of standard TWFE models may not have an intuitive causal interpreta-tion when there are heterogeneous treatment eﬀects, and new estimators for more sensiblecausal estimands have been introduced (Athey and Imbens, 2018; Borusyak and Jaravel,2016; Callaway and Sant’Anna, 2020; de Chaisemartin and D’Haultfœuille, 2020; Goodman-Bacon, 2018; Imai and Kim, 2020; Meer and West, 2016; Słoczyński, 2020; Sun and Abraham,2020). In contrast to most of the previous literature, we consider the eﬃciency of variousprocedures under random treatment timing. This assumption is stronger than the general-ized parallel trends assumptions considered in previous work, and thus our proposed methodwill not be applicable in settings where the researcher is conﬁdent in parallel trends butnot in random treatment timing. On the other hand, under suitable regularity conditionsour proposed plug-in eﬃcient estimator is at least as eﬃcient in large populations as themethods proposed in previous work when treatment timing is (as-if) randomly assigned, andwill often be substantially more precise.Additionally, most of the pre-existing literature has adopted a sampling-based perspec-tive, where uncertainty in the data arises from the sample drawn from a superpopulation. Bycontrast, we adopt a design-based framework in which the population is ﬁxed and uncertaintyarises from the randomness of treatment timing. This framework is useful for formalizingthe notion of random treatment timing, and may be especially appealing in settings wherethe superpopulation is not clear, such as when the researcher has access to all counties in theUnited States (Manski and Pepper, 2018). Athey and Imbens (2018) adopt a design-basedframework similar to ours, but consider the interpretation of the estimand of two-way ﬁxedeﬀects models rather than eﬃcient estimation. Shaikh and Toulis (2019) consider inferenceon sharp null hypotheses in a design-based model where treatment timing is random con-ditional on observables; by contrast, we consider inference on average causal eﬀects underunconditional random treatment timing.Several papers in both the economics and biostatistics literatures study the eﬃciency of In Roth and Sant’Anna (2021), we show that if treatment timing is not random, then the paralleltrends assumption will be sensitive to functional form without strong assumptions on the full distributionof potential outcomes. β ˚ in a sampling-based model; however, they donot consider estimation or inference when the oracle is unknown. None of the aforementionedpapers considers a design-based framework, nor do they study the common case of staggeredtreatment timing as we do in Section 3.Our work is also related to Xiong et al. (2019) and Basse, Ding and Toulis (2020), whoconsider how to optimally design a staggered rollout experiment to maximize the eﬃciencyof a ﬁxed estimator. By contrast, we solve for the most eﬃcient estimator given a ﬁxedexperimental design. Ding and Li (2019) show a bracketing relationship between the biasesof diﬀerence-in-diﬀerences and other estimators in the class we consider when treatmenttiming is not random, but do not consider eﬃciency under random treatment timing.Finally, we contribute to the literature on the eﬀectiveness of procedural justice trainingprograms for police oﬃcers. Previous work has studied the program in Chicago that we study(Wood et al., 2020a,b) and a smaller pilot evaluation in Seattle (Owens, Weisburd, Amendolaand Alpert, 2018). Although qualitatively in line with previous ﬁndings in the literature,our analysis provides by far the most precise estimates from a randomized evaluation. Forexample, the standard error for our estimate for the eﬀect on citizen complaints, measuredas a percentage of the pre-treatment mean, is 1.9 times smaller than the estimate in Woodet al. (2020b) and over 3 times smaller than that in Owens et al. (2018). We begin by developing intuition for our more general results by considering a canonicaltwo-period diﬀerence-in-diﬀerences model. All of the results in this section can be viewedas special cases of the more general results for staggered rollouts in Section 3. We provide6roofs where we think they will aid in developing intuition, but defer some of the moretechnical proofs to the theorems in the following section.

There is a ﬁnite population of N units. We observe data for 2 periods t “ , . All unitsare untreated in t “ , and some units receive a treatment of interest in t “ . We denoteby Y it p q , Y it p q the potential outcomes for unit i in period t under treatment and control,respectively, and we observe the outcome Y it “ D i Y it p q ` p ´ D i q Y it p q , where D i is an in-dicator for whether unit i is treated. Following Neyman (1923) for randomized experimentsand Athey and Imbens (2018) and Rambachan and Roth (2020) for DiD designs, we treat asﬁxed (or condition on) the potential outcomes and the number of treated and untreated units( N and N ); the only source of uncertainty in our model comes from the vector of treatmentassignments D “ p D , ..., D N q , which is stochastic. All expectations p E r¨sq and probabilitystatements p P p¨qq are taken over the distribution of D conditional on the number of treatedunits p N q and the potential outcomes, although we suppress this conditioning unless neededfor clarity. For a non-stochastic attribute W i (e.g. a function of the potential outcomes), wedenote by E f r W i s “ N ´ ř i W i and V ar f r W i s “ p N ´ q ´ ř i p W i ´ E f r W i sqp W i ´ E f r W i sq the ﬁnite-population expectation and variance of W i .The target parameter of interest is the average treatment eﬀect in t “ , τ : “ N ÿ i p Y it “ p q ´ Y it “ p qq . We now introduce two assumptions that we will maintain throughout our analysis. Weﬁrst assume that the assignment of treatment status is random. Assumption 1 (Random treatment assignment (2 periods)) . P p D “ d q “ { ` NN ˘ if ř i d i “ N , and 0 otherwise. We also assume that treatment status has no eﬀect on outcomes in t “ , before treatmentis implemented. This assumption is plausible in many contexts, but may be violated ifindividuals learn of treatment status beforehand and adjust their behavior in anticipation(Malani and Reif, 2015). Assumption 2 (No anticipation (2 periods)) . For all i , Y it “ p q “ Y it “ p q . In Section 3, we will index potential outcomes by the date of treatment timing, so Y it p q in this sectioncorresponds with Y it p8q in the notation of Section 3. We use the notation Y it p q here to make the connectionsto the literature on randomized experiments more explicit. Note that we condition on the number of treated units p N q , so in contrast to standard sampling-basedapproaches, unit i ’s treatment status D i is correlated with that of unit j . .2 Eﬃcient Estimation and Comparison to DiD The canonical diﬀerence-in-diﬀerences estimator is ˆ τ DiD “ p ¯ Y t “ ´ ¯ Y t “ q ´ p ¯ Y t “ ´ ¯ Y t “ q , (1)where ¯ Y t “ N ´ ř i D i Y it and ¯ Y t “ N ´ ř i p ´ D i q Y it are the sample means for the treatedand untreated groups in period t .Note that ˆ τ DiD is a special case of the class of estimators of the form ˆ τ β “ p ¯ Y t “ ´ ¯ Y t “ q ´ β p ¯ Y t “ ´ ¯ Y t “ q . The estimator ˆ τ β takes the simple diﬀerence-in-means between the treated and control groupsin period t “ , and then adjusts by a factor β times the diﬀerence in means in the pre-treatment period.We now draw connections between estimators of the form ˆ τ β and estimators that applycovariate adjustments in cross-sectional random experiments. Note that under Assumption2, Y i,t “ “ Y i,t “ p q regardless of i ’s treatment status. Our setting is thus isomorphic to across-sectional randomized experiment in which the outcome of interest is Y i “ Y i,t “ and wehave ﬁxed pre-treatment covariates X i “ Y i,t “ p q . In the cross-sectional setting, Lin (2013)and Li and Ding (2017) consider estimators of the form ˆ τ p b , b q “ ¯ Y ´ ¯ Y ´ p ¯ X ´ ¯ X q b ` p ¯ X ´ ¯ X q b , where ¯ Y “ N ´ ř i D i Y i and the other terms are deﬁned analogously. Observe, however,that the unconditional mean ¯ X “ N ´ ř i X i is a weighted average of ¯ X and ¯ X , i.e. ¯ X “ p N { N q ¯ X ` p N { N q ¯ X . It then follows from some straightforward algebra that ˆ τ p b , b q “ ` ¯ Y ´ ¯ Y ˘ ´ ˆ N N b ` N N b ˙ p ¯ X ´ ¯ X q . The estimator ˆ τ p b , b q is thus equivalent to ˆ τ β with β “ p N { N q b ` p N { N q b .With this equivalence in hand, it is straightforward to apply the results in Lin (2013)and Li and Ding (2017) to show that i) ˆ τ β is unbiased for the ATE for all β , and ii) solve forthe eﬃcient coeﬃcient β ˚ that minimizes the variance of ˆ τ β . Proposition 2.1 (Unbiasedness of ˆ τ β ) . Under Assumptions 1 and 2, E r ˆ τ β s “ τ for all β. Proof.

The proof is immediate from the results in Lin (2013) and Li and Ding (2017) from theanalogy to covariate adjustment in randomized experiment, but we provide a short proof for8ompleteness. Observe that ¯ Y t “ “ N ´ ř i D i Y it “ p q . By Assumption 1, E r D i s “ N { N ,so E “ ¯ Y t “ ‰ “ N ÿ i E r D i s Y it “ p q “ N ÿ i Y it “ p q . By analogous arguments for the other terms in (1), we have that E “ ˆ τ DiD ‰ “ ˜ N ÿ i Y it “ p q ´ N ÿ i Y it “ p q ¸ ´ ˜ N ÿ i Y it “ p q ´ N ÿ i Y it “ p q ¸ “ τ ` ˜ N ÿ i Y it “ p q ´ N ÿ i Y it “ p q ¸ . However, Assumption 2 implies that the second term in the previous display is zero, whichgives the desired result.

Proposition 2.2.

Let β d be the coeﬃcient on Y it “ p q from a regression of Y it “ p d q on Y it “ p q and a constant. Let β ˚ “ p N { N q β ` p N { N q β . If Assumptions 1 and 2 hold, then V ar r ˆ τ β ˚ s ď V ar r ˆ τ β s for all β P R , with strict inequality for any β ‰ β ˚ if V ar f r Y it “ s ą .Proof. We have shown that the estimator τ β ˚ is equivalent to the estimator τ p β , β q con-sidered in Lin (2013) and Li and Ding (2017), with Y i “ Y i,t “ and X i “ Y i,t “ p q . It thenfollows immediately from the results in Lin (2013) and Li and Ding (2017) that ˆ τ β ˚ hasminimal variance. Further, Li and Ding (2017) show that for any p b , b q , V ar r ˆ τ p β , β qs ă V ar r ˆ τ p b , b qs unless V ar r ˆ τ p β , β q ´ ˆ τ p b , b qs “ . Thus, V ar r ˆ τ β ˚ s ă V ar “ ˆ τ ˜ β ‰ for any ˜ β unless V ar “ ˆ τ β ˚ ´ ˆ τ ˜ β ‰ “ . However, ˆ τ β ˚ ´ ˆ τ ˜ β “ ˆ τ p , NN β ˚ q ´ ˆ τ p , NN ˜ β q “ NN p β ˚ ´ ˜ β q ¯ X . Since ¯ X “ N ´ ř i D i X i is a simple random sample of size N , it has positive variance if V ar f r X i s ą . Thus, V ar r ˆ τ β ˚ s ă V ar “ ˆ τ ˜ β ‰ if V ar f r X i s ą and β ˚ ‰ ˜ β .Proposition 2.2 implies that unless the potential outcomes happen to be such that N N β ` N N β “ , the variance of ˆ τ DiD is dominated by that of ˆ τ β ˚ .9 .3 The plug-in eﬃcient estimator In practical settings, however, the “oracle” coeﬃcient β ˚ is not known. Mirroring Lin (2013)in the cross-sectional case, we now show that β ˚ can be approximated by a plug-in estimate ˆ β ˚ , and the resulting estimator ˆ τ β ˚ has similar properties to the “oracle” estimator ˆ τ β .We ﬁrst describe the construction of the plug-in estimator. Consider a regression of Y it “ on Y it “ and a constant among units with D i “ . Let ˆ β denote the coneﬃent on Y it “ , i.e. ˆ β “ ˜ N ÿ i D i Y it “ ¸ ´ ˜ N ÿ i D i Y it “ Y it “ ¸ , where Y it “ “ Y it “ ´ N ř i Y it “ is the de-meaned pre-treatment outcome. Deﬁne ˆ β to bethe coeﬃcient from the analogous regression among D i “ units, ˆ β “ ˜ N ÿ i p ´ D i q Y it “ ¸ ´ ˜ N ÿ i p ´ D i q Y it “ Y it “ ¸ . Letting ˆ β ˚ “ p N { N q ˆ β ` p N { N q ˆ β , the estimator ˆ τ ˆ β ˚ is then a feasible approximation to ˆ τ β ˚ . It is straightforward to show that the estimator ˆ τ ˆ β ˚ is equivalent to the coeﬃcient fromthe “interacted” regression considered in Lin (2013), i.e. the coeﬃcient on D i in the ordinaryleast squares (OLS) regression, Y it “ “ β ` β D i ` β Y it “ ` β D i ˆ Y it “ ` (cid:15) i . (2)We will now show that when the population is suﬃciently large, ˆ τ ˆ β ˚ is approximatelyunbiased for τ and achieves the same variance as the oracle estimator ˆ τ β ˚ . As in Lin (2013)and Li and Ding (2017) and other papers, we consider sequences of populations indexed by m where N ,m and N ,m grow large. For ease of notation, we leave the index m implicit inour notation for the remainder of the paper. We assume the sequence of populations satisﬁesthe following regularity conditions. Assumption 3 (Sequences of populations) . Let Y i p d q “ p Y it “ p d q , Y it “ p d qq for d “ , .Let S , S , and S denote the ﬁnite population variances and covariance of Y i p q , Y i p q : S d “ N ´ ÿ i p Y i p d q´ ¯ Y p d qqp Y i p d q´ ¯ Y p d qq , S “ N ´ ÿ i p Y i p q´ ¯ Y p qqp Y i p q´ ¯ Y p qq , where ¯ Y p d q “ N ´ ř i Y i p d q . We assume(i) N { N Ñ p P p , q . ii) S , S , and S have ﬁnite limiting values denoted S ˚ ą , S ˚ ą , and S ˚ .(iii) max i || Y i p d q ´ ¯ Y p d q|| { N Ñ for d “ , . Part (i) of Assumption 3 states that the fraction of treated units converges to a constantstrictly between 0 and 1. Part (ii) states that the variance and covariances of the potentialoutcomes have limits. Part (iii) requires that no single observation dominates the variance ofthe potential outcomes, and is thus analogous to the familiar Lindeberg condition in samplingcontexts. With these assumptions in hand, we can now formally state the sense in which ˆ τ ˆ β ˚ is asymptotically unbiased and as eﬃcient as ˆ τ β ˚ . Proposition 2.3.

Under Assumptions 1, 2, and 3, ? N p ˆ τ ˆ β ˚ ´ τ q Ñ d N ` , σ ˚ ˘ , (3) where σ ˚ “ lim N Ñ8 N V ar r ˆ τ β ˚ s .Proof. Since ˆ τ β ˚ is equivalent to the coeﬃcient on D i from (2), the result follows immedi-ately from Theorem 1 in Lin (2013). Alternatively, this can be viewed as a special case ofProposition 3.2 below.To develop intuition for how ˆ τ ˆ β ˚ achieves the same asymptotic variance as ˆ τ β ˚ , observe thatwe can write ˆ τ ˆ β ˚ “ ˆ τ ´ β ˚ p ¯ Y t “ ´ ¯ Y t “ q ` p ˆ β ˚ ´ β ˚ qp ¯ Y t “ ´ ¯ Y t “ q“ ˆ τ β ˚ ´ p ˆ β ˚ ´ β ˚ q ` p ¯ Y t “ ´ ¯ Y t “ q ´ p ¯ Y t “ ´ ¯ Y t “ q ˘ where ¯ Y t “ “ N ´ ř i Y it “ p q . By standard arguments for ﬁnite populations, ˆ β ˚ ´ β ˚ and ¯ Y dt “ ´ ¯ Y t “ are O p p N ´ q . It follows that ˆ τ ˆ β ˚ ´ ˆ τ β ˚ is the product of O p p N ´ q terms, and henceis O p p N ´ q , whereas ˆ τ β ˚ ´ τ is O p p N ´ q . We thus see that the error induced from estimating β ˚ is of a higher-order than the variation of ˆ τ β ˚ . In suﬃciently large populations, the noiseinduced by estimating β ˚ is thus negligible. A similar analysis applies to ﬁnite-sample bias,which as in Lin (2013) is also O p p N ´ q . Remark 1.

Building on results in Frison and Pocock (1992), McKenzie (2012) proposesusing the coeﬃcient γ from the OLS regression Y it “ “ γ ` γ D i ` γ Y it “ ` (cid:15) i , (4)which is sometimes referred to as the Analysis of Covariance (ANCOVA). This diﬀers fromthe regression representation of the eﬃcient plug-in estimator in (2) in that it omits the11nteraction term D i Y it “ . Treating Y it “ as a ﬁxed pre-treatment covariate, the coeﬃcient ˆ γ from (4) is equivalent to the estimator studied in Freedman (2008b,a). The results inLin (2013) therefore imply that McKenzie (2012)’s estimator will have the same asymptoticeﬃciency as ˆ τ ˆ β ˚ under constant treatment eﬀects. Intuitively, this is because the coeﬃcienton the interaction term in (2) converges in probability to 0. However, the results in Freedman(2008b,a) imply that under heterogenous treatment eﬀects McKenzie (2012)’s estimator mayeven be less eﬃcient than ˆ τ , which in turns is (weakly) less eﬃcient than ˆ τ ˆ β ˚ . Relatedly,Wan (2020) proves that ˆ β from (2) is asymptotically at least as eﬃcient as ˆ γ from (4) in asampling-based model that assumes normally distributed potential outcomes. To form conﬁdence sets for τ based on (3), one needs to estimate the variance σ ˚ . As istypical in ﬁnite population settings, it is not possible to obtain a consistent variance estimateunder treatment eﬀect heterogeneity. We show, however, that one can obtain a consistentestimator for an upper bound on the asymptotic variance. The variance estimator is lessconservative than the conventional Neyman estimator in that it accounts for heterogeneitythat is explained by lagged outcomes.We begin with the following decomposition of the variance. Lemma 2.1.

Under Assumptions 1 and 2, V ar r ˆ τ β ˚ s “ N ˜ S ` N ˜ S ´ N ˜ S τ , where ˜ S is the ﬁnite population variance of Y it “ p q ´ β Y it “ p q ; ˜ S is the ﬁnite popula-tion variance of Y it “ p q ´ β Y it “ p q , and ˜ S τ is the ﬁnite population variance of Y it “ p q ´ Y it “ p q ´ p β ´ β q Y it “ p q .Proof. Immediate from Example 9.1 in Li and Ding (2017).

Proposition 2.4.

Let ˜ s “ N ÿ i D i p Y it “ ´ Y it “ ˆ β q , ˜ s “ N ÿ i p ´ D i qp Y it “ ´ Y it “ ˆ β q . Under Assumptions 3 and 4, pp N { N q ˜ s ` p N { N q ˜ s q Ñ p σ ˚ ´ ˜ S ˚ τ , where ˜ S ˚ τ “ lim N Ñ8 ˜ S τ .Proof. Follows as a special case of Lemma 3.7 below.Proposition 2.4 shows that the variance estimate ˜ s “ p N { N q ˜ s ` p N { N q ˜ s is asymp-totically conservative. It is strictly conservative if ˜ S ˚ τ ą , meaning that there is positive12symptotic variance of the “adjusted” treatment eﬀects τ i ´ p β ´ β q Y it “ p q – i.e. hetero-geneous treatment eﬀects that are not linear functions of lagged outcomes. We note thatin a completely randomized experiment, the typical Neyman variance is conservative by thevariance of τ i . Since V ar f r τ i ´ p β ´ β q Y it “ p qs “ V ar f r τ i s ´ p β ´ β q V ar f r Y it “ p qs , thevariance estimator here is less conservative than the usual Neyman variance estimator. We now extend the results above to the more complex setting in which there is staggeredtreatment timing across multiple periods.

There is again a ﬁnite population of N units. We observe data for T periods, t “ , .., T . Aunit’s treatment status is indexed by G i P G Ď t , ..., T, , where G i corresponds withthe ﬁrst period in which unit i is treated (and G i “ 8 denotes that a unit is nevertreated). We assume that treatment is an absorbing state. We denote by Y it p g q the po-tential outcome for unit i in period t when treatment starts at time g , and deﬁne the vector Y i p g q “ p Y i p g q , ..., Y iT p g qq P R T . We let D ig “ r G i “ g s . The observed vector of outcomesfor unit i is then Y i “ ř i D ig Y i p g q . We treat as ﬁxed the number of units that are ﬁrsttreated at each time g , N g , and assume that the timing of treatment is random. Assumption 4 (Random treatment timing) . Let D be the random N ˆ | G | matrix with p i, g q th element D ig . Then P p D “ d q “ p ś g P G N g ! q{ N ! if ř i d ig “ N g for all g , and zerootherwise. Remark 2 (Stratiﬁed Treatment Assignment) . For simplicity, we consider the case of un-conditional random treatment timing. In some settings, the treatment timing may be ran-domized among units with some shared observable characteristics (e.g. counties within astate). In such cases, the methodology developed below can be applied to form eﬃcientestimators for each stratum, and the stratum-level estimates can then be pooled to formaggregate estimates for the population.As in the two-period model, we also assume that the treatment has no causal impact onthe outcome in periods before it is implemented.

Assumption 5 (No anticipation) . For all i , Y it p g q “ Y it p g q for all g, g ą t . Y it p g q ‰ Y it p g q whenever t ě min p g, g q . Rather, we only require that, say, aunit’s outcome in period 1 does not depend on whether it was ultimately treated in period2 or period 3. Following Athey and Imbens (2018), we deﬁne τ it,gg “ Y it p g q ´ Y it p g q to be the causal eﬀectof switching the treatment date from date g to g on unit i ’s outcome in period t . We deﬁne τ t,gg “ N ´ ř i τ it,gg to be the average treatment eﬀect (ATE) of switching treatment from g to g on outcomes at period t . We will consider scalar estimands of the form θ “ ÿ t,g,g a t,gg τ t,gg (5)i.e. weighted sums of the average treatment eﬀects of switching from treatment g to g .Researchers will often be intersted in weighted averages of the τ t,gg , in which case the a t,gg will sum to 1, although our results allow for general a t,gg . The results extend easily tovector-valued θ ’s where each component is of the form in the previous display; we focus onthe scalar case for ease of notation. The no anticipation assumption (Assumption 5) impliesthat τ t,gg “ if t ă min p g, g q , and so without loss of generality we make the normalizationthat a t,gg “ if t ă min p g, g q .Researchers are often interested in the eﬀect of receiving treatment at a particular timerelative to not receiving treatment at all. We will deﬁne AT E p t, g q : “ τ t,g to be the averagetreatment eﬀect on the outcome in period t of being ﬁrst-treated at period g relative to notbeing treated at all. The AT E p t, g q is a close analog to the cohort average treatment eﬀectson the treated considered in Callaway and Sant’Anna (2020) and Sun and Abraham (2020).The main diﬀerence is that those papers do not assume random treatment timing, and thusconsider treatment eﬀects on the treated population rather than average treatment eﬀects(in a sampling-based framework).Our framework incorporates a variety of possible summary measures that aggregatethe AT E p t, g q across diﬀerent cohorts and time periods. The following deﬁnitions mirrorthose proposed in Callaway and Sant’Anna (2020) for the AT T p t, g q . We deﬁne the simple-weighted ATE to be the simple weighted average of the AT E p t, g q , where each AT E p t, g q is This allows the possibility, for instance, that θ represents the diﬀerence between long-run and short-runeﬀects, so that some of the a t,gg are negative. N g , θ simple “ ř t ř g : g ď t N g ÿ t ÿ g : g ď t AT E p t, g q . Likewise, we deﬁne the cohort- and time-speciﬁc weighted averages as θ t “ ř g : g ď t N g ÿ g : g ď t AT E p t, g q and θ g “ T ´ g ` ÿ t : t ě g AT E p t, g q , and introduce the summary parameters θ calendar “ T ÿ t θ t and θ cohort “ | G z8| ÿ g : g ‰8 θ g , where | A | denotes the cardinality of a set A . Finally, we introduce “event-study” parametersthat aggregate the treatment eﬀects at a given lag l since treatment θ ESl “ ř g : g ` l ď T N g ÿ g : g ` l ď T AT E p g ` l, g q . Note that the instantaneous parameter θ ES is analogous to the estimand considered inde Chaisemartin and D’Haultfœuille (2020) in settings like ours where treatment is an ab-sorbing state (although their framework also extends to the more general setting wheretreatment turns on and oﬀ). We now introduce the class of estimators we will consider. Let ˆ¯ Y g “ N g ´ ř i D ig Y i be thesample mean of the outcome for treatment group g , and let ˆ τ t,gg “ ˆ¯ Y g,t ´ ˆ¯ Y g ,t be the sampleanalog of τ t,gg . We deﬁne ˆ θ “ ÿ t,g,g a t,gg ˆ τ t,gg which replaces the population means in the deﬁnition of θ with their sample analogues.We will consider estimators of the form ˆ θ β “ ˆ θ ´ ˆ X β (6)where intuitively, ˆ X is a vector of diﬀerences-in-means that are guaranteed to be mean-zero under the assumptions of random treatment timing and no anticipation. Formally, we15onsider M -dimensional vectors ˆ X where each element of ˆ X takes the form ˆ X j “ ÿ p t,g,g q : g,g ą t a j ,gg ˆ τ t,gg . There are many possible choices for the vector ˆ X that satisfy these assumptions. Forexample ˆ X could be a vector where each component equals ˆ τ t,gg ´ ˆ τ t,g for a diﬀerent com-bination of p t, g, g q with t ă g, g . Alternatively, ˆ X could be a scalar that takes a weightedaverage of such diﬀerences. The choice of ˆ X is analogous to the choice of which variablesto control for in a simple randomized experiment. In principle, including more covariates(higher-dimensional ˆ X ) will improve asymptotic precision, yet including “too many” covari-ates may lead to over-ﬁtting, leading to poor performance in practice. For now, we supposethe researcher has chosen a ﬁxed ˆ X , and will consider the optimal choice of β for a given ˆ X .We will return to the choice of ˆ X in the discussion of our Monte Carlo results in Section 4below.Several estimators proposed in the literature can be viewed as special cases of the classof estimators we consider with β “ and appropriately-deﬁned scalar ˆ X . Example 1 (Callaway and Sant’Anna (2020)) . For settings where there is a never-treatedgroup ( G ), Callaway and Sant’Anna (2020) consider the estimator ˆ τ CStg “ ˆ τ t,g ´ ˆ τ g ´ ,g , i.e. a diﬀerence-in-diﬀerences that compares outcomes between period t and g ´ for thecohort ﬁrst treated in period g relative to the never-treated cohort. It is clear that ˆ τ CStg can be viewed as an estimator of

AT E p t, g q of the form given in (6), with ˆ X “ ˆ τ g ´ ,g and β “ . Likewise, Callaway and Sant’Anna (2020) consider an estimator that aggregatesthe ˆ τ CStg , say ˆ τ CSw “ ř t,g w t,g ˆ τ t,g , which can be viewed as an estimator of the parameter θ w “ ř t,g w t,g AT E p t, g q of the form (6) with ˆ X “ ř t,g w t,g ˆ τ g ´ ,g . Similarly, Callawayand Sant’Anna (2020) consider an estimator that replaces the never-treated group with anaverage over cohorts not yet treated in period t , ˆ τ CS tg “ ř g ą t N g ÿ g ą t N g ˆ τ t,gg ´ ř g ą t N g ÿ g ą t N g ˆ τ g ´ ,gg , for t ě g. In principle the vector ˆ X could also include pre-treatment diﬀerences in means of non-linear transforma-tions of the outcome as well; see Guo and Basse (2020) for related results on non-linear covariate adjustmentsin randomized experiments. This could also be viewed as an estimator of the form (6) if ˆ X were a vector with each element corre-sponding with ˆ τ t,g and the vector β was a vector with elements corresponding with w t,g .

16t is again apparent that this estimator can be written as an estimator of

AT E p t, g q of theform in (6), with ˆ X now corresponding with a weighted average of ˆ τ g ´ ,gg and β again equalto 1. Example 2 (Sun and Abraham (2020)) . Sun and Abraham (2020) consider an estimatorthat is equivalent to that in Callaway and Sant’Anna (2020) in the case where there is a never-treated cohort. When there is no never-treated group, Sun and Abraham (2020) proposeusing the last cohort to be treated as the control. Formally, they consider the estimator of

AT E p t, g q of the form ˆ τ SAtg “ ˆ τ t,gg max ´ ˆ τ s,gg max , where g max “ max G is the last period in which units receive treatment and s ă g is somereference period before g (e.g. g ´ ). It is clear that ˆ τ SAtg takes the form (6), with ˆ X “ ˆ τ s,gg max and β “ . Weighted averages of the ˆ τ SAtg can likewise be expressed in the form (6), analogousto the Callaway and Sant’Anna (2020) estimators.

Example 3 (de Chaisemartin and D’Haultfœuille (2020)) . de Chaisemartin and D’Haultfœuille(2020) propose an estimator of the instantaneous eﬀect of a treatment. Although their es-timator extends to settings where treatment turns on and oﬀ, in a setting like ours wheretreatment is an absorbing state, their estimator can be written as a linear combination ofthe ˆ τ CS tg . In particular, they consider a weighted average of the treatment eﬀects estimatesfor the ﬁrst period in which a unit was treated, ˆ τ dCH “ ř g : g ď T N g ÿ g : g ď T N g ˆ τ CS t,g . It is thus immediate from the previous examplse that their estimator can also be written inthe form (6).

Example 4 (Two-period DiD) . It may be instructive to consider how the estimators con-sidered in the two-period model from Section 2 ﬁt into the general framework. That modelcorresponds with T “ and G “ t , . ˆ X is simply the diﬀerence in sample means forthe treatment groups in t “ , ˆ τ , . The DiD estimator in the two-period model thuscorresponds with ˆ θ . Example 5 (TWFE Models) . Athey and Imbens (2018) consider the setting with G “t , ...T, . Let D it “ r G i ď t s be an indicator for whether unit i is treated by period t . Note that the potential outcomes Y it p8q and Y it p q used in this section correspond with the moretraditional Y it p q and Y it p q used in Section 2. D it from the two-way ﬁxedeﬀects speciﬁcation Y it “ α i ` λ t ` D it θ T W F E ` (cid:15) it (7)can be decomposed as ˆ θ T W F E “ ÿ t ÿ p g,g q : min p g,g qď t γ t,gg ˆ τ t,gg ` ÿ t ÿ p g,g q : min p g,g qą t γ t,gg ˆ τ t,gg (8)for weights γ t,gg that depend only on N g and N and thus are non-stochastic in our framework.Thus, ˆ θ T W F E can be viewed as an estimator of the form (6) for the parameter θ T W F E “ ř t ř p g,g q : min p g,g qď t γ t,gg τ t,gg , with X “ ´ ř t ř p g,g q : min p g,g qą t γ t,gg ˆ τ t,gg and β “ . Notation.

Recall that the sample treatment eﬀect estimates ˆ τ t,gg are themselves diﬀer-ences in sample means, ˆ τ t,gg “ ¯ Y t,g ´ ¯ Y t,g . It follows that we can write ˆ θ “ ÿ g A θ,g ˆ¯ Y g and ˆ X “ ÿ g A ,g ˆ¯ Y g for appropriately deﬁned matrices A θ,g and A of dimension ˆ T and M ˆ T , respectively.These deﬁnitions will be useful in deriving our theoretical results below. We now consider the problem of ﬁnding the best estimator ˆ θ β of the form introduced in (6).First, it is straightforward to show that ˆ θ β is unbiased for any ﬁxed β. Lemma 3.1 ( ˆ θ β unbiased) . Under Assumptions 4 and 5, E ” ˆ θ β ı “ θ for any β P R M .Proof. By Assumption 4, E r D ig s “ p N g { N q . Hence, E ” ˆ θ ı “ E «ÿ g A θ,g N g ÿ i D ig Y i ﬀ “ ÿ g A θ,g N g ÿ i E r D ig s Y i p g q “ ÿ g A θ,g N g ÿ i N g N Y i p g q “ θ. Likewise, E ” ˆ X ı “ E «ÿ g A ,g N g ÿ i D ig Y i ﬀ “ ÿ g A ,g N ÿ i Y i p g q “ N ÿ i ÿ g A ,g Y i p g q “ , since ř g A ,g Y i p g q “ by Assumption 5. The result follows immediately from the previoustwo displays. 18e now turn our attention to deriving the variance of ˆ θ β and solving for the most eﬃcient β . We ﬁrst introduce some notation. Let S g “ p N ´ q ´ ř i p Y i p g q ´ ¯ Y p g qqp Y i p g q ´ ¯ Y p g qq bethe ﬁnite population variance of Y i p g q and S gg “ p N ´ q ´ ř i p Y i p g q ´ ¯ Y p g qqp Y i p g q ´ ¯ Y p g qq be the ﬁnite-population covariance. To derive the variance of ˆ θ β , it will be useful to solvefor the joint variance of p ˆ θ , ˆ X q . Lemma 3.2.

Under Assumptions 4 and 5, V ar «˜ ˆ θ ˆ X ¸ﬀ “ ˜ ř g N g ´ A θ,g S g A θ,g ´ S θ , ř g N g ´ A θ,g S g A ,g ř g N g ´ A ,g S g A θ,g , ř g N g ´ A ,g S g A ,g ¸ “ : ˜ V ˆ θ V ˆ θ ,X V X, ˆ θ V X ¸ , where S θ “ V ar f ”ř g A θ,g Y i p g q ı .Proof. Let A τ,g “ ˜ A θ,g A ,g ¸ . Then we can write ˆ τ : “ ÿ g A τ,g ˆ¯ Y g “ ˜ ˆ θ ˆ X ¸ . Since Assumption 4 holds, we can appeal to Theorem 3 in Li and Ding (2017), which impliesthat V ar r ˆ τ s “ ř g N g ´ A τ,g S g A τ,g ´ S τ , where S τ “ V ar f ”ř g A τ,g Y i p g q ı . The result thenfollows immediately from expanding this variance, as well as the observation that S τ “ ˜ S θ

00 0 ¸ , which follows from the fact that ř g A ,g Y i p g q “ E f ”ř g A ,g Y i p g q ı “ for all i by Assumption 5.The variance of ˆ θ β then follows immediately. Corollary 3.1.

Under Assumptions 4 and 5, V ar ” ˆ θ β ı “ V ˆ θ ` β V X β ´ V ˆ θ ,X β . Having solved for V ar ” ˆ θ β ı , we now derive the β ˚ that minimizes the variance. Proposition 3.1.

Suppose Assumptions 4 and 5 hold and that V X is full-rank. Let β ˚ “ V ´ X V X, ˆ θ , for V X and V X, ˆ θ as deﬁned in Lemma 3.2. Then V ar ” ˆ θ β ˚ ı “ V ˆ θ ´ V ˆ θ ,X V ´ X V X, ˆ θ ď V ar ” ˆ θ β ı for all β P R M .Proof. First, note that V ar ” ˆ θ β ˚ ı “ V ar ” ˆ θ ı ` β ˚1 V ar ” ˆ X ı β ˚ ´ Cov ” ˆ θ , ˆ X ı β ˚ “ V θ ` V ˆ θ ,X V ´ X V X, ˆ θ ´ V ˆ θ ,X V ´ X V X, ˆ θ , V ˆ θ β “ V ar ” ˆ θ ´ ˆ X β ı “ E ” p ˆ θ ´ θ ´ ˆ X β q ı , where we use the fact that E ” ˆ θ ı “ θ from Lemma 3.1 and that E ” ˆ X ı “ from theconstruction of ˆ X along with Assumption 5. It is then immediate that β ˚ is optimal if andonly if it solves the least-squares problem min β E ” p ˆ θ ´ θ ´ ˆ X β q ı . The solution is β ˚ “ E ” ˆ X ˆ X ı ´ E ” ˆ X p ˆ θ ´ θ q ı “ V ´ X V X, ˆ θ , as needed. As in Section 2, the eﬃcient estimator ˆ θ β ˚ is not of practical use since the “oracle” coeﬃcient β ˚ is not known. We now show that in large populations a feasible plug-in estimator θ ˆ β ˚ hassimilar properties to the oracle estimator. In particular, let ˆ S g “ N g ´ ÿ i D ig p Y i p g q ´ ˆ¯ Y g qp Y i p g q ´ ˆ¯ Y g q and let ˆ V X, ˆ θ and ˆ V X be the analogs to V X, ˆ θ and V X that replace S g with ˆ S g in the deﬁnitions.We then deﬁne ˆ β ˚ “ ˆ V ´ X ˆ V X, ˆ θ . We will now show that the feasible plug-in estimator ˆ θ ˆ β ˚ isasymptotically unbiased and as eﬃcient as the oracle estimator ˆ θ β ˚ .We again consider a sequence of ﬁnite populations satisfying certain regularity conditions,analogous to the exercise in Section 2. Assumption 6. (i) For all g P G , N g { N Ñ p g P p , q .(ii) For all g, g , S g and S gg have limiting values denoted S ˚ g and S ˚ gg , respectively, with S ˚ g positive deﬁnite.(iii) max i,g || Y i p g q ´ ¯ Y p g q|| { N Ñ . Assumption 6 is analogous to Assumption 3 for the 2-period case. Part (i) requires theprobability that treatment begins in period g P G converges to a constant strictly between0 and 1. Part (ii) requires the variances and covariances of the potential outcomes converge20o a constant. Part (iii) requires that no single observation dominates the ﬁnite-populationvariance of the potential outcomes.We now provide two lemmas that characterize the asymptotic joint distribution of p ˆ θ , ˆ X q ,and show that ˆ S g is consistent for S ˚ g under Assumption 6. Both results are direct conse-quences of the general asymptotic results in Li and Ding (2017) for multi-valued treatmentsin randomized experiments. Lemma 3.3.

Under Assumptions 4, 5, and 6, ? N ˜ ˆ θ ´ θ ˆ X ¸ Ñ d N p , V ˚ q , where V ˚ “ ˜ ř g p g ´ A θ,g S ˚ g A θ,g ´ S ˚ θ ř g p g ´ A θ,g S ˚ g A ,g ř g p g ´ A ,g S ˚ g A θ,g ř g p g ´ A ,g S ˚ g A ,g ¸ “ : ˜ V ˚ ˆ θ V ˚ ˆ θ ,X V ˚ X, ˆ θ V ˚ X ¸ , and S ˚ θ “ lim N Ñ8 S θ (where S θ is deﬁned in Lemma 3.2).Proof. As in the proof to Lemma 3.2, we can write ˆ τ “ ÿ g A τ,g ˆ¯ Y g “ ˜ ˆ θ ˆ X ¸ . The result then follows from Theorem 5 in Li and Ding (2017), combined with the observationnoted in the proof to Lemma 3.2 that S τ “ ˜ S θ

00 0 ¸ and hence S τ Ñ ˜ S ˚ θ

00 0 ¸ . Lemma 3.4.

Under Assumptions 4, 5, and 6, ˆ S g Ñ p S ˚ g for all g .Proof. Follows immediately from Proposition 3 in Li and Ding (2017).It is now straightforward to derive the limiting distribution of ˆ θ ˆ β ˚ . Proposition 3.2.

Under Assumptions 4, 5, and 6, ? N p ˆ θ ˆ β ˚ ´ θ q Ñ d N p , σ ˚ q , where σ ˚ “ V ˚ ˆ θ ´ V ˚1 X, ˆ θ p V ˚ X q ´ V ˚ X, ˆ θ “ lim N Ñ8 N V ar ” ˆ θ β ˚ ı . Proof.

Recall that ˆ β ˚ “ ˆ V ´ X ˆ V X, ˆ θ . It is clear that ˆ β ˚ is a continuous function of ˆ V X and ˆ V X,θ , and that ˆ V X and ˆ V X,θ are continuous functions of ˆ S g . From Lemma 3.4 along withthe continuous mapping theorem, we obtain that ˆ β ˚ Ñ p p V ˚ X q ´ V ˚ X, ˆ θ . Lemma 3.3 together21ith Slutsky’s lemma then give that ? N p ˆ θ ˆ β ˚ ´ θ q Ñ d N ´ , V ˚ ˆ θ ´ V ˚1 X, ˆ θ p V ˚ X q ´ V ˚ X, ˆ θ ¯ . FromProposition 3.1, it is apparent that the asymptotic variance of ˆ θ ˆ β ˚ is equal to the limit of N V ar ” ˆ θ β ˚ ı , which completes the proof. To construct conﬁdence intervals using Proposition 3.1, one requires an estimate of σ ˚ . Weﬁrst introduce a simple Neyman-style variance estimator that is conservative under treatmenteﬀect heterogeneity. We then introduce a reﬁnement to this estimator that adjusts for thepart of the heterogeneity explained by ˆ X .From proposition 3.2 as well as the deﬁnition of V ˚ , we have that σ ˚ “ ˜ÿ g p g A θ,g S ˚ g A θ,g ´ S ˚ θ ¸ ´ ˜ÿ g p g A θ,g S ˚ g A ,g ¸ ˜ÿ g p g A ,g S ˚ g A ,g ¸ ´ ˜ÿ g p g A θ,g S ˚ g A ,g ¸ . Since S ˚ g is consistently estimable (Lemma 3.4), a natural conservative (Neyman-style) vari-ance estimator replaces S ˚ g with ˆ S g and ignores the ´ S ˚ θ term. That is, we consider ˆ σ ˚ “ ˜ÿ g NN g A θ,g ˆ S g A θ,g ¸ ´ ˜ÿ g NN g A θ,g ˆ S g A ,g ¸ ˜ÿ g NN g A ,g ˆ S g A ,g ¸ ´ ˜ÿ g NN g A θ,g ˆ S g A ,g ¸ . Lemma 3.5.

Under Assumptions 4, 5, and 6, ˆ σ ˚ Ñ p σ ˚ ` S ˚ θ ď σ ˚ .Proof. Immediate from Lemma 3.4 combined with the continuous mapping theorem.Intuitively, the Neyman-type variance estimator proposed above is conservative when thereis treatment eﬀect heterogeneity.When the estimand θ does not involve any treatment eﬀects for the cohort treated inperiod 1, the estimator ˆ σ ˚ can be improved by using outcomes from earlier periods. Thereﬁned estimator intuitively lower bounds the heterogeneity in treatment eﬀects by the partof the heterogeneity that is explained by the outcomes in earlier periods. The constructionof this reﬁned estimator mirrors the reﬁnements using ﬁxed covariates in randomized exper-iments considered in Lin (2013); Abadie, Athey, Imbens and Wooldridge (2020), with laggedoutcomes playing a similar role to the ﬁxed covariates. Aronow, Green and Lee (2014) provide sharp bounds on the variance of the diﬀerence-in-means estimatorin randomized experiments, although these bounds are diﬃcult to extend to other estimators and settingslike those considered here. emma 3.6. Suppose that A θ,g “ for all g ă g min . If Assumption 5 holds, then S θ “ V ar f ” ˜ θ i ı ` ˜ ÿ g ě g min β g ¸ p M S g min M q ˜ ÿ g ě g min β g ¸ , (9) where M is the matrix that selects the rows of Y i corresponding with t ă g min ; β g “p M S g M q ´ M S g A θ,g is the coeﬃcient from projecting A θ,g Y i p g q on M Y i p g q ; and ˜ θ i “ ř g ě g min A θ,g Y i p g q´ ř g ě g min p M Y i p g qq β g .Proof. For any g and functions of the potential outcomes X i P R K and Z i P R , let X i “ X i ´ E f r X i s , Z i “ Z i ´ E f r Z i s , and β XZ “ V ar f r X i s ´ E f ” X i Z i ı . We claim that V ar f r Z i ´ β XZ X i s “ V ar f r Z i s ´ β XZ V ar f r X i s β XZ . Indeed, V ar f r Z i ´ β XZ X i s “ N ´ ÿ i ´ Z i ´ β XZ X i ¯ “ N ´ ÿ i Z i ` β XZ ˜ N ´ ÿ i X i X i ¸ β XZ ´ β XZ N ´ ÿ i X i Z i “ V ar f r Z i s ` β XZ V ar f r X i s β XZ ´ β XZ V ar f r X i s β XZ “ V ar f r Z i s ´ β XZ V ar f r X i s β XZ . The result then follows from setting Z i “ ř g ě g min A θ,g Y i p g q and X i “ M Y i p g min q , and notingthat under Assumption 5, M Y i p g min q “ M Y i p g q for all g ě g min , and hence V ar f r M Y i p g min qs “ M S g min M “ M S g M “ V ar f r M Y i p g qs .The second term on the right-hand side of (9) is consistently estimable, and allows us toobtain a tighter variance estimate. In particular, let ˆ σ ˚˚ “ ˆ σ ˚ ´ ˜ ÿ g ą g min ˆ β g ¸ ´ M ˆ S g min M ¯ ˜ ÿ g ą g min ˆ β g ¸ , where ˆ β g “ p M ˆ S g M q ´ A θ,g ˆ S g M . Then ˆ σ ˚˚ is consistent for a tighter upper bound on σ ˚ . Lemma 3.7.

Suppose that A θ,g “ for all g ă g min and Assumptions 4-6 hold. Then, ˆ σ ˚˚ Ñ p σ ˚ ` S ˚ ˜ θ , where S ˚ ˜ θ “ lim N Ñ8 V ar f ” ˜ θ i ı for ˜ θ i deﬁned in Lemma 3.6, and S ˚ ˜ θ ď S ˚ θ . Assumption 5 implies that

M S g M “ M S g min M for all g ě g min . The term M ˆ S g min M can thus bereplaced by any convex combination of M ˆ S g M for g ě g min ; this has no eﬀect on the asymptotic results,but may improve ﬁnite sample performance. roof. Note that ˆ β g is a continuous function of ˆ S g . Lemma 3.4 together with the continuousmapping theorem imply that ˜ ÿ g ą g min ˆ β g ¸ ´ M ˆ S g min M ¯ ˜ ÿ g ą g min ˆ β g ¸ ´ ˜ ÿ g ą g min β g ¸ p M S g min M q ˜ ÿ g ą g min β g ¸ Ñ p . The result is then immediate from Lemmas 3.5 and 3.6.

We now discuss the implications of our results for estimators previously proposed in theliterature. As discussed in Examples 1-3 above, the estimators of Callaway and Sant’Anna(2020), Sun and Abraham (2020), and de Chaisemartin and D’Haultfœuille (2020) correspondwith the estimator ˆ θ for an appropriately deﬁned ˆ X . Our results thus imply that, unless β ˚ “ , the estimator ˆ θ β ˚ is unbiased for the same estimand and has strictly lower varianceunder random treatment timing. Since the optimal β ˚ depends on the potential outcomes, wedo not generically expect β ˚ “ , and thus the previously-proposed estimators will genericallybe dominated in terms of eﬃciency. Although the optimal β ˚ will typically not be known, ourresults imply that the plug-in estimator ˆ θ ˆ β ˚ will have similar properties in large populations,and thus will be more eﬃcient than the previously-proposed estimators in large populations.We note, however, that the estimators in the aforementioned papers are valid for the ATT insettings where only parallel trends holds but there is not random treatment timing, whereasrandomization of treatment timing is necessary for the validity of the eﬃcient estimator. We thus view the results on the eﬃcient estimator as complementary to these estimatorsconsidered in previous work.Similarly, in light of Example 5, our results imply that the TWFE estimator will generallynot be the most eﬃcient estimator for the TWFE estimand, θ T W F E . Previous work hasargued that the estimand θ T W F E may not be the most economically interesting estimandand may be diﬃcult to interpret (e.g. Athey and Imbens (2018); Borusyak and Jaravel(2016); Goodman-Bacon (2018); de Chaisemartin and D’Haultfœuille (2020)). Our resultsprovide a new and complementary critique of the TWFE speciﬁcation: even if θ T W F E is thetarget estimand, estimation via (7) will generally be ineﬃcient in large populations underrandom treatment timing and no anticipation. The estimator of de Chaisemartin and D’Haultfœuille (2020) can also be applied in settings wheretreatment turns on and oﬀ over time. Monte Carlo Results

We present two sets of Monte Carlo results. In Section 4.1, we conduct simulations in astylized two-period setting like in Section 2 to illustrate how the eﬃcient estimator comparesto the classical diﬀerence-in-diﬀerences and simple diﬀerence-in-means (DiM) estimators.Section 4.2 presents a more realistic set of simulations with staggered treatment timing thatis calibrated to the data in Wood et al. (2020a) which we use in our application.

Speciﬁcation.

We follow the model in Section 2 in which there are two periods ( t “ , )and some units are treated in period 1. We ﬁrst generate the potential outcomes as follows.For each unit i in the population, we draw Y i p q “ p Y it “ p q , Y it “ p qq from a N p , Σ ρ q distribution, where Σ ρ has 1s on the diagonal and ρ on the oﬀ-diagonal. The parameter ρ isthe correlation between the untreated potential outcomes in period t “ and period t “ .We then set Y it “ p q “ Y it “ p q ` τ i , where τ i “ γ p Y it “ p q ´ E f r Y it “ p qsq . The parameter γ governs the degree of heterogeneity of treatment eﬀects: if γ “ , then there is no treatmenteﬀect heterogeneity, whereas if γ is positive then individuals with larger untreated outcomesin t “ have larger treatment eﬀects. We normalize by E f r Y it “ p qs so that the treatmenteﬀects are 0 on average. We generate the potential outcomes once, and treat the populationas ﬁxed throughout our simulations. Our simulation draws then diﬀer based on the drawof the treatment assignment vector. For simplicity, we set N “ N “ N { , and in eachsimulation draw, we randomly select which units are treated in t “ or not. We conduct1000 simulations for all combinations of N P t , u , ρ P t , . , . u , and γ P t , . u . Results.

Table 1 shows the bias, standard deviation, and coverage of 95% conﬁdenceintervals based on the plug-in eﬃcient estimator ˆ θ ˆ β ˚ , diﬀerence-in-diﬀerences ˆ θ DiD “ ˆ θ , andsimple diﬀerences-in-means ˆ θ DiM “ ˆ θ . Conﬁdence intervals are constructed as ˆ θ ˆ β ˚ ˘ . σ ˚˚ for the eﬃcient estimator, and analogously for the other estimators. For all speciﬁcationsand estimators, the estimated bias is quite small, and coverage is close to the nominallevel. Table 2 facilitates comparison of the standard deviations of the diﬀerent estimatorsby showing the ratio relative to the plug-in estimator. The standard deviation of the plug-ineﬃcient estimator is weakly smaller than that of either DiD or DiM in nearly all cases, andis never more than 2% larger than that of either DiD or DiM. The standard deviation of theplug-in eﬃcient estimator is similar to DiD when auto-correlation of Y p q is high p ρ “ . q For ˆ θ β , we use an analog to ˆ σ ˚˚ , except the unreﬁned estimate ˆ σ ˚ for the eﬃcient estimator is replacedwith the sample analog to the expression for V ar ” ˆ θ β ı given in Corollary 3.1. p γ “ q , so that β ˚ « and thus DiDis (nearly) optimal in the class we consider. Likewise, it is similar to DiM when there isno autocorrelation p ρ “ q and there is no treatment eﬀect heterogeneity p γ “ q , and thus β ˚ “ and so DiM is optimal in the class we consider. The plug-in eﬃcient estimator issubstantially more precise than DiD and DiM in many other speciﬁcations: in the worstspeciﬁcation, the standard deviation of DiD is as much as 1.7 times larger than the plug-ineﬃcient estimator, and the standard deviation of the DiM can be as much as 7 times larger.These simulations thus illustrate how the plug-in eﬃcient estimator can improve on DiD orDiM in cases where they are suboptimal, while retaining nearly identical performance whenthe DiD or DiM model is optimal. Bias SD Coverage N N ρ γ PlugIn DiD DiM PlugIn DiD DiM PlugIn DiD DiM1000 1000 0.99 0.0 .

00 0 . ´ .

00 0 .

01 0 .

04 0 .

95 0 .

95 0 . .

00 0 . ´ .

00 0 .

01 0 .

06 0 .

95 0 .

95 0 . .

00 0 .

04 0 .

05 0 .

94 0 .

95 0 . .

00 0 .

05 0 .

06 0 .

94 0 .

95 0 . ´ .

00 0 . ´ .

00 0 .

04 0 .

07 0 .

04 0 .

95 0 .

94 0 . ´ . ´ . ´ .

01 0 .

06 0 .

07 0 .

06 0 .

95 0 .

25 25 0.99 0.0 .

00 0 . ´ .

03 0 .

04 0 .

27 0 .

94 0 .

25 25 0.99 0.5 . ´ . ´ .

04 0 .

05 0 .

08 0 .

34 0 .

92 0 .

93 0 .

25 25 0.50 0.0 ´ .

01 0 . ´ .

02 0 .

24 0 .

29 0 .

26 0 .

94 0 .

95 0 .

25 25 0.50 0.5 .

01 0 . ´ .

01 0 .

30 0 .

32 0 .

33 0 .

94 0 .

95 0 .

25 25 0.00 0.0 ´ . ´ . ´ .

03 0 .

28 0 .

38 0 .

27 0 .

93 0 .

95 0 .

25 25 0.00 0.5 ´ . ´ . ´ .

04 0 .

35 0 .

42 0 .

34 0 .

93 0 .

94 0 . Table 1: Bias, Standard Deviation, and Coverage for ˆ θ ˆ β ˚ , ˆ θ DiD , ˆ θ DiM in 2-period simulations

To evaluate the performance of our proposed methods in a more realistic setting, we conductsimulations calibrated to our application to Wood et al. (2020a) in Section 5. The outcome ofinterest Y it is the number of complaints against police oﬃcer i in month t for police oﬃcers inChicago. Police oﬃcers were randomly assigned to ﬁrst receive a procedural justice trainingin period G i . See Section 5 for more background on the application. Simulation speciﬁcation.

We calibrate our baseline speciﬁcation as follows. The numberof observations and time periods in the data exactly matches the data from Wood et al.26D Relative to Plug-In N N ρ γ PlugIn DiD DiM1000 1000 0.99 0.0 .

00 1 .

00 7 . .

00 1 .

71 7 . .

00 1 .

13 1 . .

00 1 .

04 1 . .

00 1 .

45 1 . .

00 1 .

31 1 .

25 25 0.99 0.0 .

00 0 .

99 6 .

25 25 0.99 0.5 .

00 1 .

47 6 .

25 25 0.50 0.0 .

00 1 .

21 1 .

25 25 0.50 0.5 .

00 1 .

08 1 .

25 25 0.00 0.0 .

00 1 .

35 0 .

25 25 0.00 0.5 .

00 1 .

22 0 . Table 2: Ratio of standard deviations for ˆ θ DiD and ˆ θ DiM relative to ˆ θ ˆ β ˚ in 2-period simula-tions(2020a) used in our application. We set the untreated potential outcomes Y it p8q to matchthe observed outcomes in the data Y i (which would exactly match the true potential outcomesif there were no treatment eﬀect on any units). In our baseline simulation speciﬁcation, thereis no causal eﬀect of treatment, so that Y it p g q “ Y it p8q for all g . (We describe an alternativesimulation design with heterogeneous treatment eﬀects in Appendix Section A.) In eachsimulation draw s , we randomly draw a vector of treatment dates G s “ p G s , ..., G sN q suchthat the number of units ﬁrst treated in periods g matches that observed in the data (i.e. ř r G si “ g s “ N g for all g ). In total, there are 72 months of data on 7785 oﬃcers. Thereare 48 distinct values of g , with the cohort size N g ranging from 6 to 642. In an alternativespeciﬁcation, we collapse the data to the yearly level, so that there are 6 time periods and 5cohorts.For each simulated data-set, we calculate the plug-in eﬃcient estimator ˆ θ ˆ β ˚ for fourestimands: the simple weighted average ATE p θ simple q ; the calendar- and cohort-weightedaverage treatment eﬀects ( θ calendar and θ cohort ), and the instantaneous event-study parameter p θ ES q . (See Section 3.2 for the formal deﬁnition of these estimands). In our baseline speciﬁ-cation, we use as ˆ X the scalar weighted combination of pre-treatment diﬀerences used by theCallaway and Sant’Anna (2020, CS) estimator for the appropriate estimand (see Example1). In the appendix, we also present results for an alternative speciﬁcation in which ˆ X is a As in our application, we restrict attention to police oﬃcers who remained in the police force throughoutthe sample period. ˆ τ t,gg for all pairs g, g ą t . For comparison, we also compute the CS esti-mator for the same estimand, using the not-yet-treated as the control group (since all unitsare eventually treated). Recall that for θ ES , the CS estimator coincides with the estimatorproposed in de Chaisemartin and D’Haultfœuille (2020) in our setting, since treatment is anabsorbing state. Conﬁdence intervals are calculated as ˆ θ ˆ β ˚ ˘ . σ ˚˚ for the plug-in eﬃcientestimator and analogously for the CS estimator. Baseline simulation results.

The results for our baseline speciﬁcation are shown inTables 3 and 4. As seen in Table 3, the plug-in eﬃcient estimator is approximately unbiased,and 95% conﬁdence intervals based on our standard errors have coverage rates close to thenominal level for all of the estimands, with size distortions no larger than 3% for all of ourspeciﬁcations. The CS estimator also is approximately unbiased and has excellent coveragefor all of the estimands as well.Table 4 shows that there are large eﬃciency gains from using the plug-in eﬃcient estima-tor relative to the CS estimator. The table compares the standard deviation of the plug-ineﬃcient estimator to that of the CS estimator. Remarkably, using the plug-in eﬃcient es-timator reduces the standard deviation by a factor of nearly two for the calendar-weightedaverage, and by a factor between 1.36 and 1.67 for the other estimands. Since standarderrors are proportional to the square root of the sample size, these results suggest that usingthe plug-in eﬃcient estimator is roughly equivalent to multiplying the sample size by a factorof four for the calendar-weighted average.

Extensions.

In Appendix A, we present simulations from an alternative speciﬁcationwhere the monthly data is collapsed to the yearly level, leading to fewer time periods andfewer (but larger) cohorts. Both the eﬃcient and CS estimators have very good coverageand minimal bias. The eﬃcient estimator again dominates the CS estimator in eﬃciency,although the gains are smaller (24 to 30% reductions in standard deviation). The smallereﬃciency gains in this speciﬁcation are intuitive: the CS estimator overweights the pre-treatment periods (relative to the eﬃcient estimator) in our setting, but the penalty fordoing this is smaller in the collapsed data, where the pre-treatment outcomes are averagedover more months and thus have lower variance.In the appendix, we also present results from a modiﬁcation of our baseline DGP withheterogeneous treatment eﬀects. We again ﬁnd that the plug-in eﬃcient estimator performswell, with qualititative ﬁndings similar to those in the baseline speciﬁcation, although thestandard errors are somewhat conservative as expected.In the appendix, we also conduct simulation results using a modiﬁed version of the plug-28stimator Estimand Bias Coverage Mean SE SDPlugIn calendar 0.00 0.93 0.27 0.29PlugIn cohort 0.00 0.92 0.24 0.24PlugIn ES0 0.01 0.94 0.26 0.27PlugIn simple 0.00 0.92 0.22 0.22CS calendar 0.00 0.94 0.55 0.55CS cohort -0.01 0.95 0.41 0.41CS/CdH ES0 0.01 0.94 0.36 0.36CS simple -0.01 0.96 0.41 0.40Table 3: Results for Simulations Calibrated to Wood et al. (2020a)

Note: This table shows results for the plug-in eﬃcient and Callaway and Sant’Anna (2020) estimatorsin simulations calibrated to Wood et al. (2020a). The estimands considered are the calendar-, cohort-,and simple-weighted average treatment eﬀects, as well as the instantaneous event-study eﬀect (ES0). TheCallaway and Sant’Anna (2020) estimator for ES0 corresponds with the estimator in de Chaisemartin andD’Haultfœuille (2020). Coverage refers to the fraction of the time a nominal 95% conﬁdence interval includesthe true parameter. Mean SE refers to the average estimated standard error, and SD refers to the actualstandard deviation of the estimator. The bias, Mean SE, and SD are all multiplied by 100 for ease ofreadability.

Estimand Ratio of SDscalendar . cohort . ES0 . simple . Table 4: Comparison of Standard Deviations – Callaway and Sant’Anna (2020) versus Plug-in Eﬃcient Estimator

Note: This table shows the ratio of the standard deviation of the Callaway and Sant’Anna (2020) estimatorrelative to the plug-in eﬃcient estimator, based on the simulation results in Table 3. in eﬃcient estimator in which ˆ X is a vector containing all possible comparisons of cohorts g and g in periods t ă min p g, g q . We ﬁnd poor coverage of this estimator in the monthlyspeciﬁcation, where the dimension of ˆ X is large relative to the sample size (1987, comparedwith N “ ), and thus the normal approximation derived in Proposition 3.2 is poor.By contrast, when the data is collapsed to the yearly level, and thus the dimension of ˆ X constructed in this way is more modest (10), the coverage for this estimator is good, and itoﬀers modest eﬃciency gains over the scalar ˆ X considered in the main text. These ﬁndingsalign with the results in Lei and Ding (2020), who show that covariate-adjustment in cross-sectional experiments yields asymptotically normal estimators when the dimensions of the29ovariates is o p N ´ q (and certain regularity conditions are satisﬁed). We thus recommendusing the version of ˆ X with all potential comparisons only when its dimension is small relativeto the square root of the sample size.Finally, we repeat the same exercise for the other outcomes used in our application(use of force and sustained complaints). We again ﬁnd that the plug-in eﬃcient estimatorhas minimal bias, good coverage properties, and is substantially more precise than the CSestimator for nearly all speciﬁcations (with reductions in standard deviations by a factor ofover 3 for some speciﬁcations). The one exception to the good performance of the plug-ineﬃcient estimator is the calendar-weighted average for sustained complaints when using themonthly data: the coverage of CIs based on the plug-in eﬃcient estimator is only 79% inthis speciﬁcation. Two distinguishing features of this speciﬁcation are that the outcome isvery rare (pre-treatment mean 0.004) and the aggregation scheme places the largest weighton the earliest three cohorts, which were small (sizes 17,15,26). This ﬁnding aligns withthe well-known fact that the central limit theorem may be a poor approximation in ﬁnitesamples with a binary outcome that is very rare. The plug-in eﬃcient estimator again hasgood coverage (94%) when considering the annualized data where the cohort sizes are larger.We thus urge some caution in using the plug-in eﬃcient estimator (or any procedure based ona normal approximation) when cohort sizes are small (<30) and the outcome is rare (mean ă . ); in such settings, we recommend collapsing the data to a higher level of aggregationbefore using the plug-in estimator. Reducing police misconduct and use of force is an important policy objective. Wood et al.(2020a) studied the Chicago Police Department’s staggered rollout of a procedural justicetraining program, which taught police oﬃcers strategies for emphasizing respect, neutral-ity, and transparency in the exercise of authority. Oﬃcers were randomly assigned a datefor training. Wood et al. (2020a) found large and statistically signiﬁcant impacts of theprogram on complaints and sustained complaints against police oﬃcers and on oﬃcer useof force. However, in the process of preparing the analysis for this paper, we discovered astatistical error in the original analysis of Wood et al. (2020a): their cohort-level analysis See the Supplement to Wood et al. (2020a) for discussion of some concerns regarding non-compliance,particularly towards the end of the sample. We explore robustness to dropping oﬃcers trained in the lastyear in Appendix Figure 4. The results are qualitatively similar, although with smaller estimated eﬀects onuse of force.

We use the same data as in our re-analysis in Wood et al. (2020b), which extends the dataused in the original analysis of Wood et al. (2020a) through December 2016. As in Wood etal. (2020b), we restrict attention to the balanced panel of 7,785 who remained in the policeforce throughout the study period. The data contain the outcome measures (complaints,sustained complaints, and use of force) at a monthly level for the period of 72 months (6years), with the ﬁrst cohort trained in month 13 and the ﬁnal cohort trained in the lastmonth of the sample. The data also contain the date on which each oﬃcer was trained.

We apply our proposed plug-in eﬃcient estimator to estimate the eﬀects of the proceduraljustice training program on the three outcomes of interest. We estimate the simple-, cohort-,and calendar-weighted average eﬀects described in Section 3.2 and used in our Monte Carlostudy. We also estimate the average dynamic eﬀect for the ﬁrst 24 months after treatment,which includes the instantaneous event-study eﬀect studied in our Monte Carlo as a specialcase (for event-time 0). For comparison, we also estimate the Callaway and Sant’Anna(2020) estimator as we did in our re-analysis in Wood et al. (2020b). (Recall that for theinstantaneous event-study eﬀect, the Callaway and Sant’Anna (2020) and de Chaisemartinand D’Haultfœuille (2020) estimators coincide.)

Figure 1 shows the results of our analysis for the three aggregate summary parameters. Table5 compares the magnitudes of these estimates and their 95% conﬁdence intervals (CIs) tothe mean of the outcome in the 12 months before treatment began. The estimates using theplug-in eﬃcient estimator are substantially more precise than those using the Callaway andSant’Anna (2020, CS) estimator, with the standard errors ranging from 1.3 to 5.6 times31maller (see ﬁnal column of Table 5).Figure 1: Eﬀect of Procedural Justice Training Using the Plug-In Eﬃcient and Callawayand Sant’Anna (2020) Estimators

Note: this ﬁgure shows point estimates and 95% CIs for the eﬀects of procedural justice training on com-plaints, force, and sustained complaints using the CS and plug-in eﬃcient estimators. Results are shown forthe calendar-, cohort-, and simple-weighted averages.

As in our re-analysis, we ﬁnd no signiﬁcant impact on complaints using any of the aggre-gations. Our bounds on the magnitude of the treatment eﬀect are substantially tighter thanbefore, however. For instance, using the simple aggregation we can now rule out reductionsin complaints of more than 11%, compared with a bound of 26% using the CS estimator.For use of force, the point estimates are somewhat smaller than when using the CS estimatorand the upper bounds of the conﬁdence intervals are all nearly exactly 0. Although preci-sion is substantially higher than when using the CS estimator, the CIs for force still includeeﬀects between near-zero and 29% of the pre-treatment mean. For sustained complaints, allof the point estimates are near zero and the CIs are substantially narrower than when usingthe CS estimator, although the plug-in eﬃcient estimate using the calendar aggregation ismarginally signiﬁcant. If we were to Bonferroni-adjust all of the CIs in Figure 1 for testingnine hypotheses (three outcomes times three aggregations), none of the conﬁdence intervalswould rule out zero.Figure 2 shows event-time estimates for the ﬁrst two years using the plug-in eﬃcientestimator. (To conserve space, we place the analogous results for the CS estimator in the Recall that the calendar aggregation for sustained complaints was the one speciﬁcation for which CIsbased on the plug-in eﬃcient estimator substantially undercovered (79%), and thus the signiﬁcant resultshould be interpreted with some caution.

Note: This table shows the pre-treatment means for the three outcomes. It also displays the estimates and95% CIs in Figure 1 as percentages of these means. The ﬁnal columns shows the ratio of the CI length usingthe CS estimator relative to the plug-in eﬃcient estimator. appendix.) In dark blue, we present point estimates and pointwise conﬁdence intervals, andin light blue we present simultaneous conﬁdence bands calculated via Bonferroni adjustment.It has been argued that simultaneous conﬁdence bands are more appropriate for event-studyanalyses since they control size over the full dynamic path of treatment eﬀects (Freyalden-hoven, Hansen and Shapiro, 2019; Callaway and Sant’Anna, 2020). The ﬁgure shows thatthe simultaneous conﬁdence bands include zero for nearly all periods for all three outcomes.Inspecting the results for force more closely, we see that the point estimates are positive(although typically not signiﬁcant) for most of the ﬁrst year after treatment, but becomeconsistently negative around the start of the second year from treatment. This suggeststhat the negative point estimates in the aggregate summary statistics are driven mainly bymonths after the ﬁrst-year. Although it is possible that the treatment eﬀects grow overtime, this runs counter to the common ﬁnding of fadeout in educational programs in general(Bailey, Duncan, Cunha, Foorman and Yeager, 2020) and anti-bias training in particular(Forscher and Devine, 2017).Finally, in Appendix Figure 4, we present results analogous to those in Figure 1 exceptremoving oﬃcers who were treated in the last 12 months of the data. The reason for this33igure 2: Event-Time Average Eﬀects Using the Plug-In Eﬃcient Estimatoris, as discussed in the supplement to Wood et al. (2020a), there was some non-compliancetowards the end of the study period wherein oﬃcers who had not already been trained couldvolunteer to take the training at a particular date. The qualitative patterns after droppingthese observations are similar, although the estimates for the eﬀect on use of force are smallerand not statistically signiﬁcant at conventional levels.

This paper considers eﬃcient estimation in settings where the timing of treatment to diﬀer-ent units is randomly assigned. Although this random assignment assumption is strongerthan the typical parallel trends assumption, it can be ensured by design when the researchercontrols the timing of treatment, and is often the justiﬁcation given for parallel trends inquasi-experimental contexts. We then derive the most eﬃcient estimator in a large class ofestimators that nests many existing approaches. Although the “oracle” eﬃcient estimator isnot known in practice, we show that a plug-in sample analog has similar properties in largepopulations, and derive a valid variance estimator for construction of conﬁdence intervals.We ﬁnd in simulations that the proposed plug-in eﬃcient estimator is approximately unbi-ased, yields CIs with good coverage, and substantially increases precision relative to existingmethods. We apply our proposed methodology to obtain the most precise estimates to dateof the causal eﬀects of procedural justice training programs for police oﬃcers.34 eferences

Abadie, Alberto, Susan Athey, Guido W. Imbens, and Jeﬀrey M. Wooldridge ,“Sampling-Based versus Design-Based Uncertainty in Regression Analysis,”

Econometrica ,2020, (1), 265–296. Aronow, Peter M., Donald P. Green, and Donald K. K. Lee , “Sharp bounds onthe variance in randomized experiments,”

The Annals of Statistics , June 2014, (3),850–871. Athey, Susan and Guido Imbens , “Design-Based Analysis in Diﬀerence-In-DiﬀerencesSettings with Staggered Adoption,” arXiv:1808.05293 [cs, econ, math, stat] , August 2018.

Bailey, Drew H., Greg J. Duncan, Flávio Cunha, Barbara R. Foorman, andDavid S. Yeager , “Persistence and Fade-Out of Educational-Intervention Eﬀects: Mech-anisms and Potential Solutions:,”

Psychological Science in the Public Interest , October2020.

Basse, Guillaume, Yi Ding, and Panos Toulis , “Minimax designs for causal eﬀects intemporal experiments with treatment habituation,” arXiv:1908.03531 [stat] , June 2020.arXiv: 1908.03531.

Borusyak, Kirill and Xavier Jaravel , “Revisiting Event Study Designs,” SSRN ScholarlyPaper ID 2826228, Social Science Research Network, Rochester, NY August 2016.

Callaway, Brantly and Pedro H. C. Sant’Anna , “Diﬀerence-in-Diﬀerences with multipletime periods,”

Journal of Econometrics , December 2020. de Chaisemartin, Clément and Xavier D’Haultfœuille , “Two-Way Fixed Eﬀects Es-timators with Heterogeneous Treatment Eﬀects,”

American Economic Review , September2020, (9), 2964–2996.

Ding, Peng and Fan Li , “A bracketing relationship between diﬀerence-in-diﬀerences andlagged-dependent-variable adjustment,”

Political Analysis , 2019, (4), 605–615. Doleac, Jennifer , “How to Fix Policing,”

Neskanen Center , 2020.

Forscher, Patrick S and Patricia G Devine , “Knowledge-based interventions are morelikely to reduce legal disparities than are implicit bias interventions,” 2017.

Freedman, David A. , “On Regression Adjustments in Experiments with Several Treat-ments,”

The Annals of Applied Statistics , 2008, (1), 176–196.35 “On regression adjustments to experimental data,” Advances in Applied Mathematics ,2008, (2), 180–193. Freyaldenhoven, Simon, Christian Hansen, and Jesse Shapiro , “Pre-event Trends inthe Panel Event-study Design,”

American Economic Review , 2019, (9), 3307–3338.

Frison, L. and S. J. Pocock , “Repeated measures in clinical trials: analysis using meansummary statistics and its implications for design,”

Statistics in Medicine , September1992, (13), 1685–1704. Funatogawa, Takashi, Ikuko Funatogawa, and Yu Shyr , “Analysis of covariance withpre-treatment measurements in randomized trials under the cases that covariances andpost-treatment variances diﬀer between groups,”

Biometrical Journal , May 2011, (3),512–524. Goodman-Bacon, Andrew , “Diﬀerence-in-Diﬀerences with Variation in Treatment Tim-ing,” Working Paper 25018, National Bureau of Economic Research September 2018.

Guo, Kevin and Guillaume Basse , “The Generalized Oaxaca-Blinder Estimator,” arXiv:2004.11615 [math, stat] , April 2020. arXiv: 2004.11615.

Imai, Kosuke and In Song Kim , “On the Use of Two-way Fixed Eﬀects Regression Modelsfor Causal Inference with Panel Data,”

Political Analysis , 2020, (Forthcoming).

Lei, Lihua and Peng Ding , “Regression adjustment in completely randomized experimentswith a diverging number of covariates,”

Biometrika , December 2020, (Forthcoming).

Li, Xinran and Peng Ding , “General Forms of Finite Population Central Limit Theoremswith Applications to Causal Inference,”

Journal of the American Statistical Association ,October 2017, (520), 1759–1769.

Lin, Winston , “Agnostic notes on regression adjustments to experimental data: Reexam-ining Freedman’s critique,”

Annals of Applied Statistics , March 2013, (1), 295–318. Malani, Anup and Julian Reif , “Interpreting pre-trends as anticipation: Impact on esti-mated treatment eﬀects from tort reform,”

Journal of Public Economics , April 2015, ,1–17.

Manski, Charles F. and John V. Pepper , “How Do Right-to-Carry Laws Aﬀect CrimeRates? Coping with Ambiguity Using Bounded-Variation Assumptions,”

Review of Eco-nomics and Statistics , 2018, (2), 232–244.36 cKenzie, David , “Beyond baseline and follow-up: The case for more T in experiments,”

Journal of Development Economics , 2012, (2), 210–221. Meer, Jonathan and Jeremy West , “Eﬀects of the Minimum Wage on EmploymentDynamics,”

Journal of Human Resources , 2016, (2), 500–522. Neyman, Jerzy , “On the Application of Probability Theory to Agricultural Experiments.Essay on Principles. Section 9.,”

Statistical Science , 1923, (4), 465–472. Owens, Emily, David Weisburd, Karen L. Amendola, and Geoﬀrey P. Alpert ,“Can You Build a Better Cop?,”

Criminology & Public Policy , 2018, (1), 41–87. Rambachan, Ashesh and Jonathan Roth , “Design-Based Uncertainty for Quasi-Experiments,” arXiv:2008.00602 [econ, stat] , August 2020. arXiv: 2008.00602.

Roth, Jonathan and Pedro H. C. Sant’Anna , “When Is Parallel Trends Sensitive toFunctional Form?,” arXiv:2010.04814 [econ, stat] , January 2021. arXiv: 2010.04814.

Shaikh, Azeem and Panos Toulis , “Randomization Tests in Observational Studies withStaggered Adoption of Treatment,” arXiv:1912.10610 [stat] , December 2019. arXiv:1912.10610.

Sun, Liyang and Sarah Abraham , “Estimating dynamic treatment eﬀects in event studieswith heterogeneous treatment eﬀects,”

Journal of Econometrics , December 2020.

Słoczyński, Tymon , “Interpreting OLS Estimands When Treatment Eﬀects Are Hetero-geneous: Smaller Groups Get Larger Weights,”

Review of Economics and Statistics , 2020,(Forthcoming).

Wan, Fei , “Analyzing pre-post designs using the analysis of covariance models with andwithout the interaction term in a heterogeneous study population,”

Statistical Methods inMedical Research , January 2020, (1), 189–204. Wood, George, Tom R. Tyler, and Andrew V. Papachristos , “Procedural justicetraining reduces police use of force and complaints against oﬃcers,”

Proceedings of theNational Academy of Sciences , May 2020, (18), 9815–9821. , , , Jonathan Roth, and Pedro H.C. Sant’Anna , “Revised Findings for “Proce-dural justice training reduces police use of force and complaints against oﬃcers”,”

WorkingPaper , 2020. 37 iong, Ruoxuan, Susan Athey, Mohsen Bayati, and Guido Imbens , “Optimal Ex-perimental Design for Staggered Rollouts,” arXiv:1911.03764 [econ, stat] , November 2019.arXiv: 1911.03764. 38

Additional Simulation Results

This section presents results from extensions to the simulations in Section 4.

Other outcomes.

Tables 6-9 show results analogous to those in the main text, exceptusing the other two outcomes considered in our application (use of force and sustainedcomplaints).

Annualized data.

Tables 10-15 show versions of our simulations results (for all threeoutcomes) when the data is collapsed to the annual level, so that there are 6 total timeperiods and 5 cohorts.

Augmented ˆ X . Table 16 shows results for an alternative version of the eﬃcient estimatorwhere ˆ X is now a vector that contains the diﬀerence in means between cohort g and g in all periods t ă min p g, g q . This vector is large relative to sample size in the monthlyspeciﬁcation ( dim p ˆ X q “ , N “ ), which leads to bias and severe undercoveragefor the modiﬁed plug-in eﬃcient estimator. In the annualized data, the dimension of themodiﬁed ˆ X is modest (10), and the modiﬁed eﬃcient estimator has good coverage and yieldssmall eﬃciency gains (up to 3%) relative to the plug-in eﬃcient estimator considered in themain text. Heterogeneous Treatment Eﬀects.

Tables 17 and 18 show simulation results for amodiﬁcation of our baseline speciﬁcation in which there are heterogeneous treatment eﬀects.In the baseline speciﬁcation, Y i p g q “ Y i p8q for all g . In the modiﬁcation, we set Y i p g q “ Y i p8q ` r t ą“ g s ¨ u i . The u i are mean-zero draws drawn from a normal distribution withstandard deviation equal to half the standard deviation of the untreated potential outcomes.We draw the u i once and hold them ﬁxed throughout the simulations, which diﬀer only in theassignment of treatment timing. The results are similar to those for the main speciﬁcation,although as expected, the standard errors are somewhat conservative (i.e. the mean standarderror exceeds the standard deviation of the estimator).39stimator Estimand Bias Coverage Mean SE SDPlugIn calendar 0.03 0.94 0.30 0.32PlugIn cohort 0.02 0.92 0.28 0.29PlugIn ES0 0.01 0.96 0.28 0.28PlugIn simple 0.01 0.93 0.26 0.27CS calendar 0.03 0.95 0.59 0.60CS cohort 0.01 0.96 0.45 0.44CS/CdH ES0 0.01 0.96 0.37 0.37CS simple 0.01 0.96 0.45 0.44Table 6: Results for Simulations Calibrated to Wood et al. (2020a) – Use of Force Note: This table shows results analogous to Table 3, except using Use of Force rather than Complaints asthe outcome.

Estimand Ratio of SDscalendar . cohort . ES0 . simple . Table 7: Comparison of Standard Deviations – Callaway and Sant’Anna (2020) versus Plug-in Eﬃcient Estimator – Use of Force

Note: This table shows results analogous to Table 4, except using Use of Force rather than Complaints asthe outcome.

Estimator Estimand Bias Coverage Mean SE SDPlugIn calendar 0.00 0.79 0.06 0.07PlugIn cohort 0.00 0.92 0.03 0.03PlugIn ES0 0.01 0.95 0.08 0.08PlugIn simple 0.00 0.92 0.03 0.03CS calendar 0.01 0.95 0.14 0.17CS cohort 0.01 0.95 0.11 0.11CS/CdH ES0 0.01 0.94 0.11 0.12CS simple 0.01 0.96 0.11 0.12Table 8: Results for Simulations Calibrated to Wood et al. (2020a) – Sustained Complaints

Note: This table shows results analogous to Table 3, except using Sustained Complaints rather than Com-plaints as the outcome. . cohort . ES0 . simple . Table 9: Comparison of Standard Deviations – Callaway and Sant’Anna (2020) versus Plug-in Eﬃcient Estimator – Sustained Complaints

Note: This table shows results analogous to Table 4, except using Sustained Complaints rather than Com-plaints as the outcome.

Estimator Estimand Bias Coverage Mean SE SDPlugIn calendar 0.11 0.95 1.99 1.96PlugIn cohort 0.15 0.95 2.53 2.49PlugIn ES0 0.03 0.96 1.65 1.60PlugIn simple 0.14 0.95 2.41 2.37CS calendar 0.20 0.96 2.65 2.56CS cohort 0.26 0.96 3.24 3.13CS/CdH ES0 0.04 0.96 2.05 1.98CS simple 0.27 0.96 3.17 3.05Table 10: Results for Simulations Calibrated to Wood et al. (2020a) – Annualized Data

Note: This table shows results analogous to Table 3, except the data is collapsed to the annual level.

Estimand Ratio of SDscalendar . cohort . ES0 . simple . Table 11: Comparison of Standard Deviations – Callaway and Sant’Anna (2020) versusPlug-in Eﬃcient Estimator – Annualized Data

Note: This table shows results analogous to Table 4, except the data is collapsed to the annual level.

Note: This table shows results analogous to Table 3, except using Use of Force rather than Complaints asthe outcome, and in simulations where data is collapsed to the annual level.

Estimand Ratio of SDscalendar . cohort . ES0 . simple . Table 13: Comparison of Standard Deviations – Callaway and Sant’Anna (2020) versusPlug-in Eﬃcient Estimator – Use of Force & Annualized Data

Note: This table shows results analogous to Table 4, except using Use of Force rather than Complaints asthe outcome, and in simulations where data is collapsed to the annual level.

Note: This table shows results analogous to Table 3, except using Sustained Complaints rather than Com-plaints as the outcome, and in simulations where data is collapsed to the annual level.

Estimand Ratio of SDscalendar . cohort . ES0 . simple . Table 15: Comparison of Standard Deviations – Callaway and Sant’Anna (2020) versusPlug-in Eﬃcient Estimator – Sustained Complaints & Annualized Data

Note: This table shows results analogous to Table 4, except using Sustained Complaints rather than Com-plaints as the outcome, and in simulations where data is collapsed to the annual level. ˆ X Note: This table shows the bias, coverage, mean standard error, and standard deviation of two versions ofthe plug-eﬃcient estimator. The estimator with the label “Long X” uses an augmented version of ˆ X thatincludes the diﬀerence in means between all cohorts g, g in periods t ă min p g, g q . The estimator labeledPlugIn uses a scalar ˆ X such that the CS estimator corresponds with β “ , as in the main text. Thesimulation speciﬁcation in panel (a) is the baseline speciﬁcation considered in the main text; in panel (b),the data is collapsed to the annual level. Note: This table shows results analogous to Table 3, except the DGP adds heterogeneous treatment eﬀectas described in Section A.

Estimand Ratio of SDscalendar . cohort . ES0 . simple . Table 18: Comparison of Standard Deviations – Callaway and Sant’Anna (2020) versusPlug-in Eﬃcient Estimator – Heterogeneous Treatment Eﬀects

Note: This table shows results analogous to Table 4, except the DGP adds heterogeneous treatment eﬀectas described in Section A. Additional Tables and Figures

Figure 3: Event-Time Average Eﬀects Using the CS Estimator

Note: This ﬁgure is analogous to Figure 2 except it uses the CS estimator rather than the plug-in eﬃcient.

Figure 4: Eﬀect of Procedural Justice Training Using the Plug-In Eﬃcient and Callawayand Sant’Anna (2020) Estimators – Dropping Late-Trained Oﬃcers