[PDF] The role of parallel trends in event study settings: An application to environmental economics

Abstract

Difference-in-Differences (DID) research designs usually rely on variation of treatment timing such that, after making an appropriate parallel trends assumption, one can identify, estimate, and make inference about causal effects. In practice, however, different DID procedures rely on different parallel trends assumptions (PTA), and recover different causal parameters. In this paper, we focus on staggered DID (also referred as event-studies) and discuss the role played by the PTA in terms of identification and estimation of causal parameters. We document a ``robustness'' vs. ``efficiency'' trade-off in terms of the strength of the underlying PTA, and argue that practitioners should be explicit about these trade-offs whenever using DID procedures. We propose new DID estimators that reflect these trade-offs and derived their large sample properties. We illustrate the practical relevance of these results by assessing whether the transition from federal to state management of the Clean Water Act affects compliance rates.

Full PDF

TThe role of parallel trends in event study settings:

An application to environmental economics ∗ Michelle MarcusVanderbilt University Pedro H. C. Sant’AnnaVanderbilt UniversitySeptember 3, 2020

Diﬀerence-in-Diﬀerences (DID) research designs usually rely on variation of treatmenttiming such that, after making an appropriate parallel trends assumption, one can iden-tify, estimate, and make inference about causal eﬀects. In practice, however, diﬀerent DIDprocedures rely on diﬀerent parallel trends assumptions (PTA), and recover diﬀerent causalparameters. In this paper, we focus on staggered DID (also referred as event-studies) anddiscuss the role played by the PTA in terms of identiﬁcation and estimation of causal pa-rameters. We document a “robustness” vs. “eﬃciency” trade-oﬀ in terms of the strength ofthe underlying PTA, and argue that practitioners should be explicit about these trade-oﬀswhenever using DID procedures. We propose new DID estimators that reﬂect these trade-oﬀs and derived their large sample properties. We illustrate the practical relevance of theseresults by assessing whether the transition from federal to state management of the CleanWater Act aﬀects compliance rates. ∗ First version: January 10, 2020. We thank Brantly Callaway, Jonathan Roth, Julia Schmieder, theEditor, Daniel Millimet, and two anonymous referees for comments and suggestions. a r X i v : . [ ec on . E M ] S e p Introduction

Researchers and policy makers are often interested in evaluating the causal eﬀect of a giventreatment/intervention on an outcome of interest. When data from randomized controltrials related to the causal question of interest are not available, researchers often rely on“natural experiments” and make use of diﬀerence-in-diﬀerences (DID) methods to estimatethe eﬀect of a given policy. The canonical DID method presumes the existence of twogroups, the treated and the comparison group, two time periods, pre-treatment and post-treatment periods, such that the comparison group is not treated in either time period, andthe treated group is only treated at the post-treatment period. Then, one estimates theaverage treatment eﬀect among those treated units by comparing the average diﬀerence inpre and post-treatment outcomes of two groups, or, equivalently, by using a two-way ﬁxedeﬀects regression model with a group and a time ﬁxed eﬀect; see, e.g., Section 2 of Lechner(2010) for details about the history of DID procedures.It is worth stressing that the causal interpretation of the two groups, two time periods(henceforth × ) DID procedure relies on a so-called parallel trend assumption (PTA): inthe absence of the treatment, the average outcome for the treated and comparison groupswould have evolved in parallel. Such an assumption is well-understood, see e.g. Chapter 5 ofAngrist and Pischke (2009), Chapter 10 of Cunningham (2018), and Section 2 of Sant’Annaand Zhao (2020). Importantly, it restricts the average counterfactual outcome for the treatedunits at the post-treatment period had they not been subject to the treatment, but it doesnot directly impose restrictions on the outcome in pre-treatment periods. In addition, it isworth mentioning that the PTA is untestable in this × setup, see, e.g., Chapter 10 ofCunningham (2018) and Section 4 of Callaway and Sant’Anna (2020).Although most of the aforementioned points are well-understood in the × setup, inmany DID applications, however, there are more than two time periods, and units can betreated in diﬀerent points in time, which leads to multiple treatment groups as well. Thismany periods, many groups DID setup is substantially more challenging than the canon-ical × one. For instance, Sun and Abraham (2020) (henceforth S&A), Callaway andSant’Anna (2020) (henceforth C&S), de Chaisemartin and D’Haultfœuille (2020) (hence-forth dC&D), and Goodman-Bacon (2019), study DID procedures with multiple periodsand multiple groups, and each of these papers rely on diﬀerent types of parallel trends as-sumptions and/or propose diﬀerent estimators for diﬀerent causal parameters of interest.This is in sharp contrast with the × setup, where there is only one type of PTA and the In the × DID setup, the only variation in PTA one observes is whether it holds unconditionally,or only after conditioning on a vector of observed characteristics, see, e.g., Heckman et al. (1998), Abadie(2005), and Sant’Anna and Zhao (2020). This is not the type of variation of the PTA we are referring to. . We exclusively focus on DID settings with staggered adoptiondesigns and binary treatments. By doing so, we can compare the PTA and parameters ofinterest discussed in S&A, C&S, and dC&D in a more direct manner.We show that the PTA invoked by S&A and dC&D (i) not only restricts counterfactualtrends after the treatment, but also imposes parallel pre-treatment trends, and (ii) imposesthat every individual group that is not-yet treated by time t can be used as a valid comparisongroup for those earlier-treated units, at time t . C&S, on the other hand, considers twodiﬀerent PTAs, one that relies on using “never-treated” units as a comparison group, andone that uses not-yet-treated units as valid comparison groups for the earlier-treated units.Interestingly, both PTAs considered by C&S are, at least technically speaking, weaker thanthe PTA invoked by S&A and dC&D, as they either do not restrict pre-treatment trends,or, when they do, these restrictions are potentially less demanding. Although these PTAsdiﬀer in their “strength”, we show that they all can be used to recover the same variety ofaverage treatment eﬀects measures.Overall, we argue that, in practice, one should be explicit about the type of PTA invokedin the DID analysis. On top of adding transparency and objectivity to the analysis, see, e.g.,Rubin (2007, 2008), we stress that the choice of the parallel trends assumption can also helpin selecting the most appropriate estimator for a given parameter of interest. For instance,in situations where one is comfortable with a “stronger” PTA, we show that one can exploit See also Athey and Imbens (2018), Goodman-Bacon (2019), Arkhangelsky et al. (2018), Borusyak andJaravel (2017), Ferman and Pinto (2019), and Rambachan and Roth (2019) for other recent contributionsto the DID literature. overidentiﬁcation , and then use the generalized method ofmoments (GMM) framework to form more eﬃcient treatment eﬀect estimators than thosecurrently available in the literature; see Proposition 4.1. Another consequence of adoptingthe GMM framework is that it is relatively straightforward to test for the credibility of a“stronger” parallel trends assumption by conducting a classical Hansen-Sargan J-test. To thebest of our knowledge, this paper is the ﬁrst to make this simple, but important observation.In many other situations, however, we expect that researchers will not be a priori com-fortable with a “stronger” version of the PTA, as it may impose more restrictions on thedata than those strictly required for identiﬁcation of treatment eﬀect parameters. Indeed,when the number of groups and time periods is moderate, the number of restrictions impliedby the “stronger” PTA can be close to the number of observations available in the data. Insuch cases, it may be reasonable to favor “weaker” versions of the PTA. When a suﬃcientlylarge “never-treated” group is available, researchers can use the easy-to-implement nonpara-metric DID estimators based on sample means proposed by C&S. When an appropriate“never-treated” group is unavailable, we show that one can rely on an alternative “weaker”PTA and use a simple plug-in DID estimator that diﬀers from the ones considered by S&A,C&S, and dC&D. We show that this new DID estimator is consistent and asymptoticallynormal, and also describe a bootstrap procedure to conduct inference that is robust againstmultiple-testing problems. Interestingly, this new DID estimator does not rely on restrictingpre-treatment trends, and, at the same time, exploits data from all available groups in thegiven application. On the other hand, both this newly proposed DID estimator and the oneproposed by C&S are, in general, less eﬃcient than the GMM estimator, which relies onmore stringent assumptions. To the best of our knowledge, we are the ﬁrst to documentthis “robustness” versus “eﬃciency” trade-oﬀ in terms of the strength of the underlying PTAinvoked in DID setups.We illustrate the practical relevance of the aforementioned observations by revisitingGrooms (2015). We examine the eﬀect of the transition from federal to state managementof the Clean Water Act (CWA) on violation rates. Similarly to Grooms (2015), we ﬁnd thatthe transition from federal to state control has little to no eﬀect on violation rates — thisresult is robust across diﬀerent parallel trends assumptions and diﬀerent causal parametersof interest.Next, like Grooms (2015), we also analyze whether states with a long prevalence ofcorruption see a large decrease in the violation rate after authorization relative to stateswithout corruption. Grooms (2015) uses a dynamic TWFE (event-study) linear regressionmodel, and ﬁnds strong evidence that violation rates decreased more in more corrupt statesthan in less corrupt states after the transition to state control. However, given that Grooms42015) focuses exclusively on TWFE-type estimators, it is not clear what kind of PTA isactually being made in the analysis. Here, we show how it can be beneﬁcial to separate theanalysis into two steps: (i) identiﬁcation and the relevance of the PTA, and (ii) data analysisand estimation procedures. By proceeding in this manner, we ﬁnd that the conclusion thatviolation rates dropped more in more corrupt states than less corrupt states depends on thetype of PTA imposed. For instance, when one assumes that, in the absence of treatment,the counterfactual outcome trends diﬀer depending on whether a state has a long prevalenceof corruption or not (“corruption-speciﬁc trends”), we ﬁnd essentially no evidence that thetreatment eﬀects vary depending on whether a state is more or less corrupt. On the otherhand, if one assumes an alternative PTA such that one can use averages of both corrupt andnon-corrupt states as valid comparison groups, we ﬁnd evidence that more corrupt statessee a larger decrease in the violation rate after authorization than less corrupt states, justlike the original ﬁndings of Grooms (2015). As “corruption” is not randomly assigned, webelieve that allowing for corruption-speciﬁc trends is the most natural identiﬁcation set upin this context. These conﬂicting ﬁndings highlight the importance of explicitly stating theunderlying PTA invoked in the exercise.The rest of this paper is organized as follows. In Section 2, we present the generalframework, compare the diﬀerent PTAs using a stylized example, and describe the diﬀer-ent parameters of interest considered by S&A, C&S, and dC&D. In Section 3, we discussthe testability of the PTAs and the practical considerations a researcher might take intoaccount when choosing a PTA and a DID estimator. Section 4 describes how one can usegeneralized method of moments (GMM) framework to form more eﬃcient treatment ef-fect estimators when the chosen PTA leads to overidentiﬁcation. Section 5 presents a neweasy-to-compute DID estimator based on an alternative “weaker” PTA than what has beenpreviously seen in the literature. Finally, Section 6 presents the empirical application, andwe conclude in Section 7. Proofs and additional results are available at the Web Appendix,at https://pedrohcgs.github.io/ﬁles/Marcus_SantAnna_2020_webAppendix.pdf.

We ﬁrst introduce the notation we use throughout the paper, which resembles that adoptedby C&S. We consider the case with T periods and denote a particular time period by t where t = 1 , . . . , T . In the canonical DID setup, T = 2 and no one is treated in period 1. Let D t be a binary variable equal to one if a unit is treated in period t and equal to zero otherwise.5lso, deﬁne G g to be a dummy variable that is equal to one if a unit is ﬁrst treated in period g , and deﬁne C as a dummy variable that is equal to one for units who are not treated in anyperiod T . For each unit, exactly one of the G g or C is equal to one. Finally, let Y t (1) and Y t (0) be the potential outcomes at time t with and without treatment, respectively. Theobserved outcome in each period can be expressed as Y t = D t Y t (1) + (1 − D t ) Y t (0) . Henceforth, we refer to “groups” as the group associated with the time a unit is ﬁrsttreated. Throughout the paper, we maintain the following assumptions.

Assumption 2.1 (Sampling) . { Y i , Y i , . . . Y i T , D i , D i , . . . , D i T } ni =1 is independent andidentically distributed ( iid ) . Assumption 2.2 (Staggered treatment design) . For t = 2 , . . . , T , D t − = 1 implies that D t = 1 Assumption 2.3 (No Anticipation) . For all t = 1 , . . . , T , g = 2 , . . . , T such that t < g , E [ Y it | G g = 1] = E [ Y it (0) | G g = 1] Assumption 2.4 (Overlap) . P ( G = 1) = 0 and, for some (cid:15) > , and all g = 2 , . . . , T , P ( G g = 1) > (cid:15) . Assumption 2.1 implies that we are considering the case of panel data. The discussionsrelated to the case where only repeated cross-section data are available follows similar ar-guments and is omitted to avoid repetition. Assumption 2.1 does not restrict the temporaldependence across outcomes, though it relies on “large n, ﬁxed t” panel data. Assumption2.1 also rules out covariates; we only impose this simpliﬁcation to allow for a more directcomparison between the proposals of S&A, C&S, and dC&D; we refer the reader to C&S fora detailed discussion about ﬂexibly accommodating covariates into the DID analysis.Assumption 2.2 imposes that treatment is “irreversible”, i.e., once a unit is treated at time t − , it is “forever” treated. This assumption is usually referred to as staggered treatmentadoption in the DID literature. We interpret this assumption as if units that experiencetreatment are forever aﬀected by this experience, and do not “forget” about it. We emphasizethat, by imposing Assumption 2.2, we are able to directly compare the DID contributions ofS&A, C&S, and dC&D.Assumption 2.3 implies that there is no anticipatory response to treatment for those unitsthat are eventually treated. This assumption is standard in the DID literature, though many When treatment can “turn on” and later “turn oﬀ”, one usually needs to augment the potential outcomenotation to analyze the eﬀect of a given treatment path, see, e.g. Han (2019). When one is interested only inan average of the instantaneous eﬀects of the policy among all units that switch treatments, one can bypasssome of this complications by imposing a “no carryover assumption”, see, e.g., dC&H. Assumption 2.4 imposes that no unit is treated in the ﬁrst time period, and that a newset of units are treated in each time period with a strictly positive probability. If there isan “always treated” group, i.e., units that are already treated in the ﬁrst time period, wedrop those observations because neither the data nor parallel trends assumptions for Y t (0) provide information information to identify the average treatment eﬀect for this group. Weassume that new sets of units are treated in each time period only for notation convenience.Also, note that Assumption 2.4 accommodates, but does not require, that there is a “nevertreated” group available.Next, we revisit S&A, C&S, and dC&D, paying particular attention to their PTA, under-lying parameter of interest, and how one can estimate these parameters. When presentingthese results, we impose Assumptions 2.1 - 2.4, which may result in slight changes of no-tation when compared to their original statements. In terms of notation, we follow C&Sand, whenever possible, attempt to express the diﬀerent parameters of interest in terms offunctionals of “group-time average treatment eﬀects”, i.e., the average treatment eﬀect attime t , for those units ﬁrst treated at time g , AT T ( g, t ) ≡ E [ Y it (1) | G g = 1] − E [ Y it (0) | G g = 1] . (2.1) = α g,t (1) − α g,t (0) To convey the discussion in an easy-to-understand manner, we consider a simple, stylizedexample. Assume that we observe Y it for a sample of units i = 1 , . . . , n in four time periods, t = 1 , , , . Some units are ﬁrst treated at time ( G i = 1) , others at time , ( G i = 1 ),and the remaining units are not treated in the entire observation window ( C i = 1) . Oncea unit i is treated at time g , it remains treated for all time periods t ≥ g . Let W =( Y , Y , Y , Y , G , G , C ) (cid:48) , and assume that we observe a random sample { W i } ni =1 of W . When the researcher is worried about anticipatory eﬀects, this can be circumvented by simply redeﬁning g to denote the period in which anticipatory eﬀects begin. However, this may require strengthening otherassumptions; see C&S for a discussion. In this paper, we attempt to only make parallel trends assumption about the evolution of Y t (0) , andremain agnostic about the trends for Y t (1) . When one is willing to impose parallel trends for Y t (1) , too,then we can leverage the existence of an “always treated” group to form alternative parameters of interest,though such an assumption further restricts treatment eﬀect heterogeneity. We leave a detailed discussionabout this case for future research. S&A refers to

AT T ( g, t ) as cohort-speciﬁc average treatment eﬀects on the treated, though they expressit in terms of event-time t − g , i.e., the time elapsed since treatment started. .2 The diﬀerent parallel trends assumptions In this subsection, we present the three diﬀerent parallel trends assumptions considered byS&A, C&S, and dC&D. We start by describing each PTA conceptually and then make useof the stylized example to highlight the key diﬀerences between these assumptions.We ﬁrst present the PTA assumption invoked by S&A and dC&D (see, Assumption 1 inS&A and Assumption 5 in dC&D).

Assumption 2.5 (Parallel trends assumption across all time periods and all groups) . Forall t = 2 , . . . , T , all g = 2 , . . . , T , E [ Y t (0) − Y t − (0) | G g = 1] = E [ Y t (0) − Y t − (0) | C = 1] = E [ Y t (0) − Y t − (0)] , where the ﬁrst equality holds only when there exists a “never-treated” group. Assumption 2.5 states that, in the absence of treatment, the expectation of the outcomeof interest follows the same path in all groups and in all time periods available in the data.Although fairly intuitive, such an assumption imposes important restrictions on the data(when combined with Assumptions 2.1 - 2.4). In particular, Assumption 2.5 imposes aparallel pre-trends condition across all treatment groups, and, as a consequence, allows oneto use any individual group that has not yet been treated by time t (units with G s = 1 , s > t ) as a valid comparison group for those units already treated by time t .To visualize these restrictions, let us consider our stylized example. Under Assumptions2.1 - 2.4, the PTA 2.5 can be written as the following seven moment conditions: α , (0) = E [ Y − Y | C = 1] + E [ Y | G = 1] , (2.2) α , (0) = E [ Y − Y | G = 1] + E [ Y | G = 1] , (2.3) α , (0) = E [ Y − Y | C = 1] + α , (0) , (2.4) α , (0) = E [ Y − Y | C = 1] + E [ Y | G = 1] , (2.5) E [ Y − Y | G = 1] = E [ Y − Y | C = 1] , (2.6) E [ Y − Y | G = 1] = E [ Y − Y | G = 1] (2.7) E [ Y − Y | G = 1] = E [ Y − Y | C = 1] . (2.8)These moment restrictions highlight all restrictions PTA 2.5 impose on the data. First, themoment conditions (2.2) and (2.3) formalize the notion that the evolution of the outcomefor the “never-treated” and “late-treated” units can be used to identify α , (0) , which inturn, would allow one to identify AT T (3 , . An empirically important implication of this Recall that, for g ≥ t, AT T ( g, t ) = α g,t (1) − α g,t (0) = E [ Y t | G g = 1] − E [ Y t (0) | G g = 1] . Thus, giventhat E [ Y t | G g = 1] is estimable from the data, one only needs to identify α g,t (0) in order to recover AT T ( g, t ) from the data. any linear combination of E [ Y − Y | C = 1] and E [ Y − Y | G = 1] canbe used to impute α , (0) . Given that the “never-treated” units are the only units thathave not yet experienced treatment at time t = 4 , they form the only group that can beused to recover α , (0) and α , (0) — this notion is formalized by the moment restrictions(2.4) and (2.5). Finally, (2.6)-(2.8) impose a parallel “pre-trends” condition, i.e., that theevolution of the outcome before treatment occurs is the same across all groups. Note that themoment condition (2.8) is a linear combination of the moment conditions (2.2) and (2.3), so(2.8) is redundant in the aforementioned system of equations. Nonetheless, this observationallows one to conclude that, by assuming that both never-treated units and the units thatare not-yet-treated at time t = 3 can be used as valid comparison groups for the units ﬁrsttreated at time t = 3 , one is imposing the parallel pre-trends condition (2.8). Of course, thereverse argument is also true, highlighting that parallel pre-trends across groups may haveimportant identiﬁcation content; we further discuss this in Section 3.Given that we now have a better understanding of the PTA invoked by S&A and dC&D,we turn our attention to the PTA invoked by C&S. In fact, C&S consider two diﬀerent PTAdepending on whether a “never treated” group is available or not (see Assumptions 4 and 5in C&S). Assumption 2.6 (Parallel trends assumption based on “never treated” units) . For all g, t =2 , . . . , T , g = 2 , . . . , T , such that t ≥ g , E [ Y t (0) − Y t − (0) | G g = 1] = E [ Y t (0) − Y t − (0) | C = 1] . Assumption 2.7 (Parallel trends assumption based on “not-yet treated” units) . For all g, s, t = 2 , . . . , T , such that t ≥ g , s ≥ t , E [ Y t (0) − Y t − (0) | G g = 1] = E [ Y t (0) − Y t − (0) | D s = 0] . The diﬀerence between the parallel trends assumptions 2.6 and 2.7 is that the former usesthe “never treated” units as a ﬁxed comparison group, whereas the latter allows one to useaverages of diﬀerent groups of units that are not-yet treated by time t as a comparison group.At ﬁrst sight, it may not be clear whether the PTAs 2.6 and 2.7 also restrict pre-trends asthe PTA 2.5 does.In order to compare these PTAs, it is illustrative to focus our attention to the stylizedexample where we again pre-impose Assumptions 2.1 - 2.4. In this context, the PTA 2.6imposes the following three moment restrictions: α , (0) = E [ Y − Y | C = 1] + E [ Y | G = 1] , (2.9) α , (0) = E [ Y − Y | C = 1] + α , (0) , (2.10) α , (0) = E [ Y − Y | C = 1] + E [ Y | G = 1] . (2.11)9s is evident from (2.9)-(2.11), the PTA 2.6 does not restrict pre-trends across groups, anddoes not presume that “later treated” units can be used as a valid comparison group for “earlytreated” units. Although the moment conditions (2.9), (2.10), and (2.11) are respectivelythe same as (2.2), (2.4), and (2.5), it does not impose the moment restrictions (2.3), (2.6),and (2.7) imposed by the PTA 2.5. Therefore, one can reasonably argue that the PTA 2.6is “weaker” then the PTA 2.5.Next, we describe the PTA 2.7, which, in the context of our stylized example, imposesthe following moment restrictions: α , (0) = E [ Y − Y | D = 0] + E [ Y | G = 1] , (2.12) α , (0) = E [ Y − Y | D = 0] + E [ Y | G = 1] , (2.13) α , (0) = E [ Y − Y | D = 0] + α , (0) , (2.14) α , (0) = E [ Y − Y | D = 0] + E [ Y | G = 1] . (2.15)where D it = 1 if a unit i is treated by time t , and equal to zero otherwise. In the context ofour stylized example, D = 0 if and only if C = 1 , implying that (2.13)-(2.15) are equivalentto (2.9)-(2.11), respectively. Thus, from this simple observation, we can conclude that thePTA 2.7 is “stronger” then PTA 2.6, as the latter does not involve the moment restriction(2.12).To compare the PTA 2.7 with the PTA 2.5, we need to understand the im-plications of adding the moment restriction (2.12) to the other moment restrictionsimplied by PTA 2.6. Note that when we combine (2.12) with (2.13), we havethat E [ Y − Y | D = 0] = E [ Y − Y | D = 0] , which in our example is the same as E [ Y − Y | D = 0] = E [ Y − Y | C = 1] . Given that D = 0 if and only if either G = 1 or C = 1 , it follows that E [ Y − Y | D = 0] = E [ Y − Y | C = 1] ⇐⇒ E [ Y − Y | G = 1] = E [ Y − Y | C = 1] . Thus, by exploiting this simple but subtle observation, we can conclude that the momentrestrictions implied by the PTA 2.7, (2.12)-(2.15), are equivalent (2.2)-(2.5), a subset of themoment restrictions implied by the PTA 2.5. Importantly, and in contrast with the PTA2.6, the PTA 2.7 does rule out non-parallel pre-trends for some groups and pre-treatmentperiods, though, technically, it is still weaker than the PTA 2.5 as the latter completelyrules out any type of non-parallel pre-trends. The PTA 2.7 does not restrict pre-trends involving time periods before the ﬁrst unit is treated, anddoes not restrict pre-trends for the earliest treatment group.

10n summary, from the discussion presented above, one can conclude that the PTA 2.6does not restrict pre-trends and is weaker than the PTA 2.5 and 2.7, though it requires theexistence of a “never treated” group. In addition, the PTA 2.7 is arguably weaker then thePTA 2.5, as the latter restricts all pre-trends in all pre-treatment periods, while the formerdoes not restrict pre-trends involving time periods before the ﬁrst unit is treated. This canbe practically relevant in applications where data are available on many time periods beforethe ﬁrst group of units is treated.

In this subsection, we discuss the diﬀerent parameters of interest that may arise when onedeviates from the canonical × DID setting. Before presenting the parameters of interestconsidered by S&A, C&S, and dC&D, it is worth stressing the potential pitfalls associatedwith the commonly used TWFE regression speciﬁcations.

As Borusyak and Jaravel (2017), dC&D and Goodman-Bacon (2019) point out, one of themost popular speciﬁcations in this many periods and many groups DID setting is the fol-lowing TWFE regression speciﬁcation, Y it = α g + α t + β fe D it + u it , (2.16)where α g and α t are group and time ﬁxed eﬀects, respectively, and u it is an idiosyncraticerror term. Although practitioners often consider β fe to be a main parameter of interest,these aforementioned papers show that, when treatment eﬀects are allowed to be heteroge-neous across groups and time periods, β fe can only be interpreted as a weighted averageof treatment eﬀects, and, perhaps even more problematic, some of these weights can benegative; see also Laporte and Windmeijer (2005), Wooldridge (2005), Chernozhukov et al.(2013), and Gibbons et al. (2018) for earlier related results based on (one-way) ﬁxed-eﬀectestimators. As so, interpreting estimates of β fe as sensible causal summary parameters canlead to misleading conclusions about the policy eﬀectiveness.Moreover, the negative (and non-intuitive) weighting problem is not speciﬁc to (2.16).dC&D show that it also applies to the ﬁrst-diﬀerence speciﬁcation. In addition, S&A showthe non-convex weighting problem extends to many variations of the dynamic TWFE re- All the results remain the same if one replaces α g with α i , an unit-speciﬁc ﬁxed eﬀect. We prefer toinclude α g as it closely resemble the canonical DID regression speciﬁcation. Y it = α i + α t + − (cid:88) e = − K β e { t − G i + 1 = e } + L (cid:88) e =1 β e { t − G i + 1 = e } + v it , (2.17)where G i is the time a unit i is ﬁrst treated (equal to inﬁnity if unit i is “never-treated”), t − G i + 1 is the “event time” , i.e., the number of time periods a unit has been treated, { t − G i + 1 = e } is an indicator for unit i being treated for e time periods. Taken together,these results suggest that the common practice of attaching sensible causal interpretation tothe coeﬃcients of TWFE regression models is not, in general, warranted. Given the potential pitfalls associated with traditional estimation procedures, S&A, C&Sand dC&D propose diﬀerent estimators for diﬀerent treatment eﬀect parameters. In thissubsection, we review these procedures and highlight their diﬀerences.Given that traditional estimation procedures do not recover easy to interpret causalparameters without further restricting treatment eﬀect heterogeneity, S&A, C&S and dC&Dpropose diﬀerent estimators for diﬀerent treatment eﬀect parameters. In this subsection, wereview these procedures.dC&D focuses on an instantaneous treatment eﬀect measure across all “ever treated”groups. More precisely, dC&D is mainly interested in estimating δ S ≡ E (cid:34) (cid:80) ni =1 (cid:80) T t =2 G ig · ( Y it (1) − Y it (0)) (cid:80) ni =1 (cid:80) T t =2 G it (cid:35) , (2.18)the average of the treatment eﬀect at the time when a group starts receiving the treatment,across all groups that become treated at some point (see Section 4 of dC&D). dC&D also proposes an easy-to-implement estimator for δ S . To better understand theirestimator, let (cid:91) AT T ny ( g, t ) = n − (cid:80) ni =1 G ig ( Y it − Y ig − ) n − (cid:80) ni =1 G it − n − (cid:80) ni =1 (1 − D it ) (1 − G ig ) ( Y it − Y ig − ) n − (cid:80) ni =1 (1 − D it ) (1 − G ig ) (2.19)be a DID estimator for (2.1) that uses not-yet treated units by time t as a comparison groupfor treatment group g , at time t . Consider the estimator for the probability of a unit beingin group g given that it is among the units that are treated for at least e = t − g + 1 periodsgiven by (cid:98) w ( g ; e ) ≡ (cid:98) P ( G g = 1 | Treated for ≥ e periods ) = N g ∩≥ e N ≥ e , (2.20) This is the case with staggered treatment adoption. In more general treatment adoption setups, theparameter of interest considered by dC&H diﬀers from δ S as deﬁned above. N g ∩≥ e denotes the number of observations in group g among those units that havebeen treated for at least e periods, and N ≥ e is the number of units who have been treated forat least e periods. dC&D then show that, under the PTA 2.5 and some additional regularityconditions, (cid:98) δ S = T (cid:88) g =2 (cid:98) P ( G g = 1 | Treated for ≥ ) · (cid:91) AT T ny ( g, g ) , (2.21)is an unbiased estimator of δ S , and, as (eﬀective) sample size grows, (cid:98) δ S is also consistent andasymptotically normal.From the discussion above, it is evident that (2.21) is a well-deﬁned estimator for theeasy-to-interpret causal parameter of interest δ S as deﬁned in (2.18). On the other hand, (cid:98) δ S is, by design, only suitable to summarize instantaneous treatment eﬀects. Hence, when one isinterested in treatment eﬀect dynamics, one needs to consider alternative causal parametersof interest.A particular way of considering more general parameters of interest that are able tocapture richer sources of treatment eﬀect heterogeneity is to follow C&S, and center theanalysis on the average treatment eﬀect at time t , for those units ﬁrst treated at time g , AT T ( g, t ) as deﬁned in (2.1). By doing so, one can highlight diﬀerent sources of treatmenteﬀect heterogeneity. For instance, one can look at how the AT T ( g, t ) for particular group g evolves over time, which would allow one to study group-speciﬁc treatment eﬀect dynamics.Alternatively, one can form diﬀerent weighted averages of the AT T ( g, t ) that are able tosummarize overall treatment eﬀects. Examples of these weighted averages include ( i ) a“simple” average of the AT T ( g, t ) , AT T simple = (cid:80) T g =2 (cid:80) T t =2 { g ≤ t } P ( G = g ) · AT T ( g, t ) (cid:80) T g =2 (cid:80) T t =2 { g ≤ t } P ( G = g ) , (2.22) ( ii ) the “event-study-type” causal parameter δ es ( e ) = T (cid:88) g =2 T (cid:88) t =2 { t − g + 1 = e } P ( G g = 1 | Treated for ≥ e periods ) AT T ( g, t ) , (2.23)which provides the average treatment eﬀect for units that have been treated for e periods,and ( iii ) the average of δ es ( e ) over all possible (positive) values of e , δ e,avg = 1 T − T − (cid:88) e =1 δ es ( e ) . (2.24)Note that all these weighted averages of the AT T ( g, t ) are easy-to-interpret, less “datahungry” than the disaggregated AT T ( g, t ) , and can be use to summarize short, medium andlong run eﬀects of a given policy. In fact, one can show that (2.18) is equal to δ es ( e ) with e = 1 . 13he key challenge in estimating all these causal parameters of interest is to show that onecan indeed nonparametrically point-identify all AT T ( g, t ) ’s with t ≥ g . C&S shows that onecan bypass such challenges by imposing either the PTA 2.6 or 2.7, though each assumptionleads to a diﬀerent estimand. More precisely, C&S shows that, for t ≥ g , AT T ( g, t ) isnonparametrically identiﬁed by AT T never ( g, t ) = E [ Y t − Y g − | G g = 1] − E [ Y t − Y g − | C = 1] , (2.25) AT T ny ( g, t ) = E [ Y t − Y g − | G g = 1] − E [ Y t − Y g − | D t = 0 , G g = 0] , (2.26)when one respectively imposes either the PTA 2.6 or 2.7. These quantities can be straight-forwardly estimated by (cid:91) AT T never ( g, t ) = n − (cid:80) ni =1 G ig ( Y it − Y ig − ) n − (cid:80) ni =1 G ig − n − (cid:80) ni =1 C i ( Y it − Y ig − ) n − (cid:80) ni =1 C i (2.27)and by (cid:91) AT T ny ( g, t ) as deﬁned in (2.19).With either (cid:91) AT T never ( g, t ) or (cid:91) AT T ny ( g, t ) on hand, one can then form the more aggre-gated parameters by replacing (cid:91) AT T ( g, t ) by either of these estimators, and by replacingthe weights by their natural (plug-in) estimators. For instance, depending on whether oneimposes the parallel trends assumption 2.6 or 2.7, one can naturally estimate δ es ( e ) by (cid:98) δ esnever ( e ) = T (cid:88) g =2 T (cid:88) t =2 { t − g + 1 = e } (cid:98) w ( g ; e ) (cid:91) AT T never ( g, t ) , (cid:98) δ esny ( e ) = T (cid:88) g =2 T (cid:88) t =2 { t − g + 1 = e } (cid:98) w ( g ; e ) (cid:91) AT T ny ( g, t ) , respectively, where e ≥ and (cid:98) w ( g ; e ) is deﬁned as in (2.20). The aggregated estimators for AT T simple and for δ e,avg are formed analogously.C&S derive the large sample properties of all these aforementioned estimators and pro-pose bootstrap procedures to construct simultaneous conﬁdence bands for these treatmenteﬀect measures. They emphasize the practical importance of using simultaneous inferenceprocedures when one estimates multiple parameters of interest (e.g., when one estimate δ es ( e ) for multiple e ’s), as failing to account for multiple testing usually lead to misleadinginference.We conclude this subsection by noting that S&A is mainly interested in recovering theevent-study-type parameter δ es ( e ) . More precisely, S&A propose the following interaction-weighted estimator for δ es ( e ) (see Section 4 of S&A). In the ﬁrst-step, they use the lineartwo-way ﬁxed eﬀects speciﬁcation that interacts relative time indicators with treatment groupindicator: Y it = λ i + λ t + T − (cid:88) g =2 (cid:88) e (cid:54) =0 δ ge · G ig { t − G i + 1 = e } + v it (2.28)14n observations from t = 1 , . . . , T − , where the last time period T is dropped in order toaccommodate the case where there is no “never treated” group; if there is a never-treatedgroup available, dropping data from time period T is unnecessary. S&A shows that, underthe PTA 2.5, the estimator (cid:98) δ ge is consistent for AT T ( g, t ) , t − g + 1 = e . Here, it is importantto emphasize that, when a “never treated” group is not available, only the units treated atthe last time period are used as comparison units when computing (cid:98) δ ge , which diﬀers fromthe C&S proposed estimator (2.19) that uses not-yet treated units as comparison units.When a “never treated” group is available, though, the interaction-weighted estimator (cid:98) δ ge isequivalent to (cid:91) AT T never ( g, t ) once one maps event-time to calendar time (or vice-versa).Armed with (cid:98) δ ge , S&A then propose to estimate δ es ( e ) by (cid:98) δ esS & A ( e ) = T (cid:88) g =2 T (cid:88) t =2 { t − g + 1 = e } (cid:98) w ( g ; e ) (cid:98) δ gt , where (cid:98) w ( g ; e ) is deﬁned as in (2.20). S&A establish the large sample properties of (cid:98) δ esS & A ( e ) and provide valid pointwise inference procedures for δ es ( e ) . Remark 2.1.

Many times one wishes to check for the existence of non-parallel pre-trends asa way to assess the credibility of the DID setup. We note that one can use δ esnever ( e ) and/or δ esny ( e ) with e < as estimators of pre-trends, though, when e is negative one must replacethe estimated weights (cid:98) w ( g ; e ) as deﬁned as in (2.20) with (cid:98) w ( g ; e − ) ≡ (cid:98) P ( G g = 1 | at least | e | pre-treatment periods available ) = N g ∩≥ e − N ≥ e − , where N g ∩≥ e − denotes the number of observations in group g among those units that haveleast | e | pre-treatment time periods of data available, and N ≥ e − is the number of units whohave least | e | pre-treatment time periods of data available. Importantly, these event-studyestimators avoid the pitfalls associated with using the dynamic TWFE to assess the credibilityof parallel trends; see S&A for a detailed discussion of this important issue. From the discussion in the previous section, it is clear that in DID designs with staggeredtreatment adoption one can make diﬀerent parallel trends assumptions, and can estimatediﬀerent parameters of interest. The discussion in Section 2 also indicates that, despite theirpeculiarities, these diﬀerent parameters of interest can be estimated using weighted averagesof estimators for the

AT T ( g, t ) (cid:48) s . From this observation, one can reasonably argue thatidentifying (and estimating) the AT T ( g, t ) (cid:48) s from the data is the most challenging step ofthe analysis, and that the diﬀerent PTAs help researchers to overcome it. Once this is done,15onstructing event-study-type estimators, for instance, becomes straightforward.In practice, however, researchers must choose and justify the use of a given PTA. In thissection, we aim to highlight some practical consequences of adopting diﬀerent versions ofthe PTA. In order to simplify the discussion, we exploit the stylized example introduced inSection 2 whenever possible, and we implicitly impose Assumptions 2.1-2.4. Practitioners routinely use estimates of pre-treatment event-study coeﬃcients to assess thecredibility of an underlying PTA. Can these tests for parallel pre-treatment trends beinterpreted as direct tests for the validity of the underlying PTA, or should these tests beinterpreted as “placebo/falsiﬁcation” type of tests? With the help of the stylized example,we show that the answer depends on the chosen PTA.Let us ﬁrst consider the case of Assumption 2.5. As it is evident from (2.2)-(2.8), thePTA 2.5 imposes six linearly independent moment restrictions to recover three counterfac-tual parameters, α , (0) , α , (0) and α , (0) . That is, imposing the PTA 2.5 leads to anoveridentiﬁed system of equations and, consequently, we can directly test for the validity ofthe PTA 2.5. Indeed, it is easy to see that the PTA 2.5 implies parallel pre-treatment trendsacross every group, see e.g., (2.6)-(2.8), and such restrictions can be directly assessed fromthe data, for instance, by testing if pre-treatment event-study-type estimates of (2.23) areall equal to zero. Thus, under the PTA 2.5, non-zero pre-treatment event-study estimatesshould be interpreted as direct evidence against the identifying assumptions. A somewhat similar conclusion is also reached when one relies on the PTA 2.7: (2.12)-(2.15) suggest four linearly independent moment restrictions to recover three counterfactualparameters, which also leads to an overidentiﬁed system of equations. Recalling that (2.12)-(2.15) is equivalent to (2.2)-(2.5), we can then see from (2.2)-(2.3) that the PTA 2.7 alsoimposes parallel pre-trends among “never treated” ( C = 1 ) and the “later treated”( G =1 ) from time t = 2 to t = 3 , but does not restrict pre-trends of these two groups from t = 1 to t = 2 , nor the pre-trends of the early treated group ( G = 1 ). Moving from thestylized example to the general case, we have that PTA 2.7 imposes parallel pre-treatment As stressed by S&A, one should not use pre-treatment coeﬃcients from TWFE event-study-type re-gressions to assess the credibility of the PTA as these coeﬃcients can be contaminated with post-treatmenteﬀects. Using estimates of (2.23) with e < , on the other hand, does not suﬀer from these pitfalls. Pleaserefer to S&A for a detailed discussion about these issues. If Assumption 2.3 is violated, though, it is possible that violations from Assumption 2.3 “oﬀset" vio-lations of Assumption 2.5 and the test based on (2.6)-(2.8) would not capture such violations. This shouldalways be taken into account. In addition, failing to reject these tests should not be interpret as evidence infavor of the identifying assumptions, as it may be the case that the test lacks power to detect some non-trivialdeviations from the null. t = g min − to t = T for all groups except the ﬁrst-treated group (who istreated at time g min ). Interestingly, because estimators of (2.23) with e < exploit all pre-treatment trends (including the one for group G = 1 in our stylized example), non-zeropre-treatment estimates can not, at least strictly speaking, be interpreted as direct tests ofthe identifying assumptions, but rather as placebo-type tests. Nonetheless, one can easilybypass this limitation by constructing alternative tests for the identifying assumptions. Forinstance, in the context of our stylized example, one can directly test if E [ Y − Y | C = 1] = E [ Y − Y | G = 1] using a standard t-test. Rejecting the null hypothesis would provide directevidence against the identifying assumptions.Finally, note that the conclusion is very diﬀerent when one imposes the PTA 2.6: (2.9)-(2.11) suggest we have a just-identiﬁed system of equations, implying that the PTA 2.6 cannot be directly tested. Indeed, as we dicussed in Section 2.2, PTA 2.6 does not restrict pre-treatment trends, and, therefore, event-study estimates for pre-treatment periods provide,at best, placebo-type evidence against the PTA 2.6.The discussion above highlights that whether tests for parallel pre-treatment trends pro-vide direct or indirect evidence against the invoked identifying assumptions crucially dependson the invoked PTA. This is very diﬀerent from the case where treatment adoptions doesnot vary across time. In that case, tests for non-parallel pre-treatment trends always provideonly indirect evidence against the adopted design. In this section, we discuss some potential trade-oﬀs one may face when adopting diﬀerentPTAs, as they can lead to diﬀerent DID estimators.

The PTA 2.6 is the weakest PTA among the three we have considered so far as it does notimpose any restriction on pre-treatment trends across groups. Given that this PTA leadsto a just-identiﬁed system of equations, in situations where researchers are not willing toimpose additional restrictions on the data, (cid:99)

AT T never ( g, t ) as deﬁned in (2.27) is the only suitable estimator for the AT T ( g, t ) and their diﬀerent functionals such as the event-studyparameters (2.23).Of course, in order to rely on the PTA 2.6 and use the DID estimator (2.27), we musthave a set of units that do not experience treatment in the time-window we want to analyze.When such a group of units is available but its relative size is small, inference procedures17ased on (2.27) may not be as precise as one wishes. However, it is important to stressthat this potential “loss of eﬃciency” is a direct consequence of not exploiting restrictions onpre-treatment trends across groups.In practice, we foresee researchers taking into account this “robustness” versus “eﬃciency”trade-oﬀ when deciding if the PTA 2.6 is the most suitable for the given application. In situ-ations where there is a “reasonably large” number of units that cannot be treated because ofsome application-speciﬁc institutional detail, we expect that the gains in robustness shoulddominate the potential gains in eﬃciency associated with using other PTAs and DID es-timators. The same holds true if researchers are not comfortable with a priori ruling outnon-parallel pre-trends. In these cases, we foresee researchers favoring the PTA 2.6 and theDID estimator (cid:99) AT T never ( g, t ) over the other alternatives. In many situations, a “never-treated” group is not available, implying that the PTA 2.6does not provide any identifying restriction that can be used to estimate the

AT T ( g, t ) ’s.In other cases, the “never-treated” group may be “too small” to be of practical use, and/orresearchers may be a priori comfortable restricting pre-treatment trends and using those“not-yet-treated” units as valid comparison groups for those “earlier-treated”. In such cases,researchers can then choose between the PTA 2.5 and the PTA 2.7. In both cases, though,they can use the the DID estimator (cid:99) AT T ny ( g, t ) , as deﬁned in (2.19), to study policy eﬀec-tiveness.Although (2.19) can be used under either PTA, we still recommend researchers to ex-plicitly specify which PTA they are making for at least three reasons. First, being explicitabout the identifying assumption adds transparency to the analysis, which is always desir-able. Second, the interpretation of pre-tests based on event-study-type estimates can varydepending on the assumptions, as we discusses in Section 3.1. Third, the choice of the PTAhas an important impact on what other estimators you could use instead of (2.19). This isparticularly important because (2.19) does not fully exploit all the restrictions imposed byeither PTA. Being aware of the exact PTA invoked allows researchers to adopt an alternativeestimation procedure that fully exploits all these moment restrictions, resulting in estima-tors that are more eﬃcient than (2.19). Here, we stress that the gains in eﬃciency will varydepending on the underlying PTA used, as the PTA 2.5 imposes more restrictions on thedata than PTA 2.7.Before describing how one can exploit these additional moment restrictions to form a moreeﬃcient DID estimator, it is worth describing situations where researchers may favor eitherthe PTA 2.5 or the PTA 2.7. Recall that the main diﬀerence between PTA 2.5 and PTA18.7 is that the former imposes parallel pre-treatment trends across all groups and all timeperiods, whereas the latter only restricts pre-treatment trends since the time the ﬁrst groupis treated. These diﬀerences can be meaningful in applications where data on multiple timeperiods before the ﬁrst group of units is treated are available, and the economic environmentin these “early-periods” were potentially diﬀerent from the “later-periods”. In these cases, theoutcome of the diﬀerent groups may evolve in a non-parallel manner during “early-periods”periods, perhaps because the groups were exposed to diﬀerent shocks, but these non-paralleltrends become less of a concern as time pass by. In such cases, we expect researchers tofavor the PTA 2.7 over the PTA 2.5. In other situations, though, researchers may prefer toimpose the PTA 2.5, allowing them to enjoy some potential gains in eﬃciency if they useestimators that exploit the additional restrictions imposed by the PTA 2.5 when comparedto the PTA 2.7. Again, the “robustness” versus “eﬃciency” trade-oﬀ should be taken intoaccount when deciding which PTA is more appropriate for the speciﬁc application. As we described in Section 3.2, in situations where researchers are comfortable with imposingeither the PTA 2.5 or the PTA 2.7, the DID estimator (2.19) is not eﬃcient, as it does notfully exploit all the restrictions implied by these PTAs. In this section, we describe howone can exploit all the restrictions implied by the identifying assumptions to form eﬃcientDID estimators by casting the problem into the familiar GMM framework (Hansen, 1982).In what follows, we provide a step-by-step description of how one can form these eﬃcientGMM DID estimators. To avoid repetition, we focus on the case where researchers imposethe PTA 2.7; the implementation based on the PTA 2.5 is completely analogous.The key to implement the GMM is to list all moment restrictions we are imposing torecover the

AT T ( g, t ) ’s, which involves not only the moment restrictions implied by the PTA2.7, but also the observational restrictions that, for all t ≥ g , α g,t (1) ≡ E [ Y t (1) | G g = 1] = E [ Y t | G g = 1] , α propg ≡ E [ G g ] , α propC ≡ E [ C ] . We can then use these “augmented” momentrestrictions (consisting of observational restriction and all the moment restrictions impliedby the PTA) to eﬃciently estimate all the unknown parameters involved in our problem byfollowing Hansen (1982).To gain more intuition on how to implement the eﬃcient GMM, we turn of at-tention to our stylized example. In this case, the unknown parameters consist of α ≡ ( α , (1) , α , (0) , α , (1) , α , (0) , α , (1) , α , (0) , α propC , α prop , α prop ) (cid:48) , which can be ef-ﬁciently estimated by (cid:98) α gmm = arg min α ∈ Θ ¯ g α ( W ) (cid:48) (cid:98) Σ − α,gmm ¯ g α ( W ) , (4.1)19here ¯ g α ( W ) is the sample average of the augmented moment conditions, n − (cid:80) ni =1 g α ( W i ) ,with g a ( W i ) combining all (linearly independent) moment conditions ,and (cid:98) Σ ˇ α,gmm = 1 n n (cid:88) i =1 g ˇ α ( W i ) g ˇ α ( W i ) (cid:48) , ˇ α being a preliminary consistent estimator for α , say the minimizer of (4.1) with (cid:98) Σ ˇ α,gmm replaced by the identity matrix.With (cid:98) α gmm , one can then eﬃciently estimate the parameters of interest: AT T (3 , , AT T (3 , and AT T (4 , by (cid:91) AT T gmm  , , ,  =  (cid:98) α gmm , (1) − (cid:98) α gmm , (0) (cid:98) α gmm , (1) − (cid:98) α gmm , (0) (cid:98) α gmm , (1) − (cid:98) α gmm , (0)  . (4.2)In what follows, we establish the asymptotic properties of (cid:98) α gmm . The asymptotic prop-erties of (cid:91) AT T gmm ( g, t ) follow directly from the delta method. Let ∆ Y t = Y t − Y t − . Deﬁne Σ α,gmm as the probability limit of (cid:98) Σ ˜ α,gmm , and Ψ = E (cid:20) ∂g α ( W ) ∂α (cid:21) Let the vector of scores associated with the eﬃcient GMM estimator be deﬁned as φ gmmα ( W i ) = − (cid:0) Ψ (cid:48) Σ − α,gmm Ψ (cid:1) − Ψ (cid:48) Σ − α,gmm · g α ( W i ) . Proposition 4.1.

Assume that all random variables have ﬁnite second moments, Σ α,gmm ispositive deﬁnite, and that Assumptions 2.1-2.4 hold. Then, when the parallel trends assump-tion 2.7 holds, we have that: ( i ) As n → ∞ , √ n ( (cid:98) α gmm − α ) = 1 √ n n (cid:88) i =1 φ gmmα ( W i ) + o p (1) d → N (cid:16) , (cid:0) Ψ (cid:48) Σ − α,gmm Ψ (cid:1) − (cid:17) . ( ii ) The GMM estimator (cid:98) α gmm is semiparametrically eﬃcient.Proof. Proof is presented in the Web Appendix C.Proposition 4.1 has important practical implications, which we illustrate in the contextof our stylized example. First, and perhaps most important, it implies that √ n (cid:16) (cid:91) AT T gmm − AT T (cid:17)  , , ,  d → N (0 , Ω) See Web Appendix A for the details about g a ( W i ) . Ω = A (cid:0) Ψ (cid:48) Σ − α,gmm Ψ (cid:1) − A (cid:48) , A is a “selection matrix” and that Ω is equal to the semi-parametric eﬃciency bound for AT T ( g, t ) under the PTA 2.7. As so, (cid:91) AT T gmm exploitsall available information in the data to estimate the

AT T ( g, t ) ’s, which, in general, trans-lates to tighter conﬁdence intervals. In fact, under the PTA 2.7, the GMM DID estimator (cid:91) AT T gmm ( g, t ) is, in general, more eﬃcient than (cid:91) AT T ny ( g, t ) or (cid:91) AT T never ( g, t ) as deﬁnedin (2.19), and (2.27), respectively, or those based on the “interaction-weighted” regression(2.28). This is a main advantage of the GMM DID estimator when compared to the otheravailable estimators.A second implication of Proposition 4.1 is that, given that we have an overidentiﬁed system of equations, one can directly use the Sargan-Hansen J-test as a test for the validityof the PTA 2.7. More precisely, under the null hypothesis that the PTA 2.7 is true, J = n · (cid:16) ¯ g (cid:98) α gmm ( W ) (cid:48) (cid:98) Σ − (cid:98) α gmm ,gmm ¯ g (cid:98) α gmm ( W ) (cid:17) d → χ − as n → ∞ . If the PTA 2.7 holds, any deviation of J from zero should be within the range of samplingerror, whereas if the PTA 2.7 is violated, J should be “large.” Thus, the Sargan-Hanen J-testcan be useful for detecting violations of the PTA 2.7.At this moment, one may wonder about the situations where one may favor the GMMDID estimator (4.2) over the simpler DID estimator (2.19). Given that (4.2) is more eﬃcientthan and as-robust-as (2.19), the only obstacle we see for its widespread adoption is itsimplementation: whenever the number of treatment groups and/or time periods availableare large, the number of moment conditions needed to be considered into the eﬃcient GMMcan be fairly large — in our application, for example, where we have 16 treatment groups and33 time periods, the GMM involves 780 moments with

195 overidentiﬁcation restrictions ,whereas sample size (state-year pairs) is equal to 759. In such cases, we expect researchersto favor the simpler, but ineﬃcient DID estimator (2.19). However, when the number ofgroups and/or time periods is moderate such that implementation of the eﬃcient GMM isnot challenging, we would recommend using it.

As highlighted in the previous section, the main attractive feature of using the GMM esti-mation procedure is that it leads to eﬃcient estimators that fully exploit all the availableinformation compatible with the underlying identifying assumptions. On the other hand,the implementation of such GMM DID estimator is not always straightforward.In this section, we describe an alternative DID estimator for the

AT T ( g, t ) . Although See Web Appendix A for its formal deﬁnition.

Assumption 5.1 (“Weaker” Parallel trends assumption based on “not-yet treated” units) . For all g, t = 2 , . . . , T , such that t ≥ g , E [ Y t (0) − Y t − (0) | G g = 1] = E [ Y t (0) − Y t − (0) | D t = 0] . The PTA 5.1 imposes that the evolution of the outcome at time t among those units thathave not yet experienced treatment by time t can help us identify the AT T ( g, t ) ’s. UnlikePTA 2.7, it does not impose that every individual not-yet-treated group can be used as acomparison group, which, in turn, suggests that the AT T ( g, t ) , t ≥ g are nonparametrically just-identiﬁed . We formalize this result in the next proposition. Let ∆ Y t = Y t − Y t − denotethe ﬁrst-diﬀerence of Y t . Proposition 5.1.

Assume that Assumptions 2.1-2.4 hold. Then, when the parallel trendsassumption 5.1 holds, it follows that, for ≤ g ≤ t ≤ T , AT T ( g, t ) = AT T ny + ( g, t ) , where AT T ny + ( g, t ) ≡ E [ Y t − Y g − | G g = 1] − (cid:32) t (cid:88) s = g E [ ∆ Y s | D s = 0 , G g = 0] (cid:33) . (5.1) Proof.

Proof is presented in the Web Appendix C.To better grasp how the PTA 5.1 allows us to use

AT T ny + ( g, t ) as an estimand forthe AT T ( g, t ) , t ≥ g , it is illustrative to go back to our stylized example. In this speciﬁccontext, we have that, under Assumptions 2.1-2.4, the PTA 5.1 is equivalent to the followingrestrictions: α , (0) = E [ Y − Y | D = 0] + E [ Y | G = 1] , (5.2) α , (0) = E [ Y − Y | D = 0] + α , (0) , (5.3) α , (0) = E [ Y − Y | D = 0] + E [ Y | G = 1] . (5.4)By listing these restrictions we can now see how we get AT T ny + ( g, t ) , as deﬁned in (5.1).First, from (5.2) and (5.4), it follows that, when g = t , α g,t (0) can be explicitly written asfunctionals of observable data (and not potential outcomes). As so, AT T ( g, t ) is identiﬁedby (5.1). Interestingly, in this case with g = t , (5.1) reduces to AT T ny ( g, t ) , as deﬁned in(2.26). When one moves away from the “instantaneous average treatment eﬀects”, though,these two estimands diﬀer. Indeed, by exploiting the moment restrictions (5.2) and (5.3),we can see that the AT T (3 , is nonparametrically identiﬁed by AT T ny + (3 ,

4) = E [ Y − Y | G = 1] − ( E [ Y − Y | D = 0] + E [ Y − Y | D = 0]) , AT T ny + (3 , uses data fromall groups, G = 1 , G = 1 , and C = 1 , whereas AT T ny (3 , only uses data from G = 1 and C = 1 . Hence, one may expect that estimators based on (5.1) to be more precise than(2.19) because they utilize more data. Furthermore, because the PTA 5.1 does not restrictpre-trends, i.e., it does not impose that E [ Y − Y | D = 0] = E [ Y − Y | D = 0] as impliedby (2.12) and (2.13), one can also expect additional gains in “robustness” by exploiting (5.1)instead of (2.26).Next, we discuss how one can exploit Proposition 5.1 to estimate the AT T ( g, t ) s. Here,the most natural way to proceed is to use the sample analogue of (5.1): (cid:91) AT T ny + ( g, t ) = n − (cid:80) ni =1 G ig ( Y it − Y ig − ) n − (cid:80) ni =1 G it − t (cid:88) s = g (cid:18) n − (cid:80) ni =1 (1 − D is ) (1 − G ig ) ∆ Y is n − (cid:80) ni =1 (1 − D is ) (1 − G ig ) (cid:19) . (5.5)Note that (5.5) is very easy to compute as it only involves combinations of sample means.Next, we show that these DID estimators also enjoy good asymptotic properties. Moreprecisely, we prove they are √ n -consistent and establish their joint asymptotic distribution.Before we present the results, we need to introduce some additional notation. For each ( g, t ) -pair, let φ ny + ( W i ; g, t ) be the inﬂuence function of (cid:91) AT T ny + ( g, t ) , φ ny + ( W i ; g, t ) = (cid:18) G ig E [ G g ] (cid:18) ( Y it − Y ig − ) − E [ G g · ( Y t − Y g − ) E [ G g ] (cid:19) − t (cid:88) s = g (1 − D is ) (1 − G ig ) E [(1 − D s ) (1 − G g )] (cid:18) ∆ Y is − E [(1 − D s ) (1 − G g ) · ∆ Y s ] E [(1 − D s ) (1 − G g )] (cid:19)(cid:33) . Finally, let (cid:91)

AT T ny + ( t ≥ g ) and AT T ( t ≥ g ) denote the vector of (cid:91) AT T ny + ( g, t ) and AT T ( g, t ) ,respectively, for all g, t = 2 , . . . , T with t ≥ g . Analogously, let Φ ny + ( W i ; t ≥ g ) denote thecollection of φ ny + ( W i ; g, t ) across all periods t and groups g such that t ≥ g . Proposition 5.2.

Assume that Assumptions 2.1-2.4 and Assumption 5.1 hold. Then, as n → ∞ , √ n (cid:16) (cid:91) AT T ny + − AT T (cid:17) ( g, t ) = 1 √ n n (cid:88) i =1 φ ny + ( W i ; g, t ) + o p (1) . (5.6) Furthermore, √ n (cid:16) (cid:91) AT T ny + ( t ≥ g ) − AT T ( t ≥ g ) (cid:17) d → N (0 , V ) , (5.7) with V = E (cid:0) Φ ny + ( W ; t ≥ g ) Φ ny + ( W ; t ≥ g ) (cid:48) (cid:1) .Proof. Proof is presented in the Web Appendix C. We restrict our attention to t ≥ g just because these are the post-treatment periods, which presumablyare the periods of main interest for the analysis. However, our results naturally extend to the case whereone consider all possible g, t ’s, with the caveat that AT T ny + ( g, t ) may diﬀer from AT T ( g, t ) for t < g , as thePTA 5.1 does not explicit restrict pre-trends. AT T ( g, t ) ’s. The ﬁrst and perhapsmore standard approach is to use the analogy principle and directly estimate V , which leadsdirectly to standard errors and pointwise conﬁdence intervals. However, it is worth stressingthat when one is interested in making inference about multiple AT T ( g, t ) ’s, inference pro-cedures based on this standard approach such as those based on traditional t-tests and/orindividual conﬁdence intervals are usually inappropriate as they do not account for the factthat one is (implicitly) conducting multiple hypotheses testing . As a direct consequence,signiﬁcant treatment eﬀects may emerge simply by chance, even when all AT T ( g, t ) ’s areequal to zero, see, e.g., Romano and Wolf (2005), Anderson (2008) and section 8 of Romanoet al. (2010).An alternative path to conduct asymptotically valid inference for multiple parametersof interest that is robust against the multiple-testing problem is to leverage the asymptoticlinear representation (5.6) to construct computationally-simple bootstrapped simultaneous conﬁdence intervals for multiple AT T ( g, t ) . The idea of this bootstrap procedure is fairlysimple, and each bootstrap iteration simply amounts to “perturbing” the asymptotic linearrepresentation of the (cid:91) AT T ny + ( g, t ) ’s by a random weight V , and it does not require re-estimating the AT T ( g, t ) ’s at each bootstrap draw. In Web Appendix B, we provide astep-by-step description of how one can implement such a procedure. Remark 5.1.

It is worth stressing that the

AT T ny + ( g, t ) estimand deﬁned in (5.1) is onlysuitable for post-treatment periods, i.e., for t ≥ g . Hence, in contrast to (2.25) and (2.26),we can not ﬁx the estimand to analyze pre-treatment periods t < g . To address this issue,we suggest using the estimand AT T preny + ( g, t ) ≡ E [ ∆ Y t | G g = 1] − E [ ∆ Y t | D t = 0 , G g = 0] , f or t < g, (5.8) which should be equal to zero under Assumptions 2.1-2.4 and a stronger version of the PTA5.1 that holds for both pre and post-treatment periods (and not only for post-treatment periodsas PTA 5.1). One can then use estimates of (5.8) to provide indirect evidence for the PTA5.1, as PTA 5.1 cannot be directly tested. We stress that (5.8) should not be directly comparedwith (2.25) and (2.26) for t < g as (5.8) measures “local deviations” (from time t − to t )of a zero pre-treatment trends conditions, whereas (2.25) and (2.26) capture “cumulativedeviations” (from time t until g − ) of zero pre-treatment trends conditions. Remark 5.2.

It is straightforward to build on (cid:91)

AT T ny + ( g, t ) to construct event-study esti-mators for δ es ( e ) as deﬁned in (2.23). Following the same steps described in Section 2.3.2, natural estimator for δ es ( e ) is (cid:98) δ esny + ( e ) = T (cid:88) g =2 T (cid:88) t =2 { t − g + 1 = e } (cid:98) w ( g ; e ) (cid:91) AT T ny + ( g, t ) , where the weights (cid:98) w ( g ; e ) are as deﬁned in (2.20). By building on Theorem 5.2 and the factthat the weights admit an asymptotic linear representation, it is easy to show that (cid:98) δ esny + ( e ) isconsistent and asymptotically normal; see, e.g. C&S. We conclude this section by highlighting situations where we foresee (5.5) being favoredover the other available DID estimators. First, we envision researchers favoring (5.5) over(2.19) in situations where they are not comfortable explicitly restricting pre-trends and/orwhen they want to use data from all groups to estimate the

AT T ( g, t ) ’s. This can be par-ticularly relevant when one wants to conduct cluster-robust inference when only a moderatenumber of clusters is available. We also expect researchers to favor (5.5) over the eﬃcientGMM estimator when implementation is challenging. In this case, we expect that (5.5)’s“easiness-to-use” would dominate the potential eﬃciency gains of using GMM. Finally, weexpect researcher to favor (5.5) over (2.27) when a “never-treated” group is relatively small,though we stress that these two estimators rely on non-nested PTAs. Remark 5.3.

Given that diﬀerent estimators (and PTAs) have diﬀerent implications forrobustness and eﬃciency, it may be tempting to “speciﬁc-to-general” speciﬁcation search:start the analysis considering estimators that rely on “stronger” assumptions and then testthe validity of these assumptions; in case one does not reject them, one stop and uses the“more eﬃcient” estimators, but in case one rejects the invoked PTA, one then chooses a“more robust” but “less eﬃcient” DID estimator. Although fairly intuitive, the aforementionedstrategy is dangerous and should not be used in practice since this speciﬁcation search is basedon multiple-testing procedure, and, as so, inference procedures that treat the “ﬁnal” estimator(or the “winner”) as “true” can be severely distorted, see, e.g. Roth (2020) for detaileddiscussion of this issue. Hence, we argue that researchers should select the PTA taking intoaccount the “robustness” versus “eﬃciency” trade-oﬀ, and that these considerations should bedone based on external, context-speciﬁc information, and not on pre-tests.

To illustrate the inherent trade-oﬀs described above, we replicate Katherine Grooms’ (2015)analysis of the transition from federal to state management of the Clean Water Act (CWA).Environmental policy mandated at the federal level is often implemented at the state level.Yet, there exists variation in the level of enforcement across states. Grooms (2015) exploits25he staggered timing of the transfer from federal to state monitoring and enforcement ofthe CWA. Using TWFE speciﬁcations akin to (2.16) and (2.17), she ﬁnds that state-levelprevalence of corruption plays an important role in the enforcement and compliance of en-vironmental regulation after transitioning to state control.We begin by describing the data, and then we discuss the practical relevance of the keyassumptions and speciﬁcations we use for the analysis given our context. Finally, we showthe baseline and corruption-speciﬁc results for both the TWFE speciﬁcation and the newDID estimators that rely on the diﬀerent PTAs discussed above. Finally, we discuss theimplications of the ﬁndings and the importance of choosing an appropriate PTA.

We follow the data construction from Grooms (2015) as closely as possible. Table D.1 in WebAppendix D replicates key summary statistics from Grooms (2015) and provides additionaldetail on data sources and construction. As described further in Web Appendix D, we followGrooms (2015) to construct a measure of the fraction of total facilities with at least oneinspection, violation, or enforcement action in a state and year.The timing of state authorization is distributed fairly evenly throughout our sampleperiod, with the exception that 27 states received authorization prior to the sample period,between 1973-1975. Given that neither the data nor the parallel trends assumptions for Y t (0) provide information to identify the average treatment eﬀect for these “always treated”states, these states are dropped from the analysis. Figure 6.1 highlights the year that each ofthe remaining 23 states started treatment, i.e., the year in which the state was authorized toadminister individual NPDES permits. The bottom four states are what we call the “never-treated” units, i.e., the states that remain unauthorized to administer individual NPDESpermits through the entire sample period. Figure 6.1 also allows one to visualize whichstates form each treatment group (those states whose colors turn to dark blue in the sameyear), and who the “not-yet-treated” states are at any point in time (those units that arecolored light-blue in a given year).Finally, we follow Grooms (2015) in deﬁning states with above median federal publiccorruption convictions across all years as “corrupt” states. Figure 6.2 shows corrupt states Figure D.1 in the Web Appendix D shows the distribution of the timing of state authorization acrossyears. As many states receive authorization for the ﬁrst four phases in the same year, we deﬁne the yearof authorization as the year in which the state was authorized to perform the ﬁrst phase of the program,administering individual NPDES permits. As of 2008, four states remained unauthorized to administer individual NPDES. Idaho received autho-rization in 2018, outside of the sample period used here to be consistent with Grooms (2015). See Web Appendix D for additional detail.

NMNHMAIDAKAZMETXOKLAFLSDUTARRIKYWVNJALPAIATNIL1976 1979 1982 1985 1988 1991 1994 1997 2000 2003 2006

Never−treated Treated (before state authorization) Treated (after state authorization)

Notes: Shows the timing treatment adoption, where treatment is deﬁned as the year in which the state wasauthorized to administer individual NPDES permits. in red and non-corrupt states in blue. “Always-treated” states are shown in grey. Based onthis measure, “corrupt” states are mostly from the mid-Atlantic and southern regions, while“non-corrupt” states tend to appear in New England and the west.

Like Grooms (2015), the starting point of our exercise is to examine the impact of authoriza-tion on compliance outcomes — for the sake of brevity, we focus on violation rates, thoughresults for inspection rate and enforcement rate are available upon request. Since we are particularly interested in treatment eﬀect dynamics, we estimate event-study-type parameters using four diﬀerent procedures. First, we replicate the dynamicTWFE speciﬁcation from Grooms (2015). The exact speciﬁcation we use is the following: Y it = λ i + λ t + (cid:88) e = − ,e (cid:54) =0 β e { t − G i + 1 = e } + v it , (6.1)which includes 30 treatment lead indicators (all the indicators associated with β e with e < )and 32 treatment lag indicators (all the indicators associated with β e with e > ). We followBorusyak and Jaravel (2017) and omit the treatment lead indicators associated with e = 0 Overall, we ﬁnd essentially zero eﬀect on these other outcomes, regardless of the PTA and modelspeciﬁcation used. This is in line with the results in Grooms (2015).

Notes: Corrupt states, shown in red, are those above the median of average convictions per capita across allyears. Non-corrupt states are shown in blue. Grey states are “already treated” prior to the sample windowand are not included in the analysis. and with e = − . Like Grooms (2015), our speciﬁcations are weighted by total facilities ina state, and all standard errors are clustered at the state level.Second, we make speciﬁc PTAs and use the new estimators described previously. Becauseour empirical application includes a set of “never-treated” states, we estimate event-study-type parameters based on the PTA (2.6) and use (cid:98) δ esnever ( e ) as an estimator for δ es ( e ) . We alsoleverage the PTA (2.7) and use (cid:98) δ esny ( e ) as an estimator for δ es ( e ) . Finally, we employ the PTA(5.1) and use (cid:98) δ esny + ( e ) as an estimator for δ es ( e ) . We do not use the event-study estimatesbased on GMM framework discussed in Section 4, since, in our speciﬁc application, the GMMassociated with the PTA 2.5 involves 780 moments with

195 overidentiﬁcation restrictions ,whereas sample size (state-year pairs) is equal to 759.Next, we analyze whether the eﬀect of state authorization on violation rates vary de-pending on whether a state has a long prevalence of corruption. To do so, we follow Grooms(2015) and consider the following TWFE speciﬁcation Y it = α i + α t + (cid:88) e = − ,e (cid:54) =0 β e { t − G i + 1 = e } + (cid:88) e = − ,e (cid:54) =0 β ce (1 { t − G i + 1 = e }× Corrupt i )+ v it , (6.2)where the β ce ’s are considered to be a measure of how treatment eﬀects vary depending onwhether a state is “corrupt” or not: positive (negative) point estimates suggest that theviolation rates increased (decreased) more in corrupt states than in non-corrupt states.28t this stage, two important questions arise. First, what type of parallel trends assump-tion is actually being invoked to justify attaching a causal interpretation to the β ce ’s in (6.2)?Second, is (6.2) susceptible to the potential pitfalls discussed in Section 2.3.1? Answeringthese questions is inherently hard, as TWFE is a model speciﬁcation and not a “researchdesign.” An alternative, and perhaps more constructive way of approaching this problemis to construct event-study-type estimators that explicitly rely on a particular PTA, andthat, by design, avoid the potential lack of a clear interpretation associated with the TWFEspeciﬁcation (6.2). We follow this latter path.With respect to the PTA, there are two natural variants of each of the PTAs 2.6, 2.7, and5.1 that one can invoke to highlight treatment eﬀect heterogeneity with respect to whethera state is corrupt or not. These variants diﬀer from each other depending on whether ornot one allows for diﬀerent counterfactual trends between corrupt and non-corrupt states.One may be concerned, for example, that, in the absence of treatment, the evolution of theviolation rate could diﬀer between corrupt and non-corrupt states. In this case, one wouldprefer a “weaker” assumption to allow for corruption-speciﬁc trends. In the context of ourapplication, “corruption” is not randomly assigned and the geographic clustering of corruptand non-corrupt states may lead us to prefer a PTA that permits corruption-speciﬁc trendsif we think, for example, there may be regional trends that diﬀer across these states. Webelieve this is the most natural identiﬁcation set up, as this is how one would proceed if onewere to separately estimate counterfactuals for corrupt states and non-corrupt states, andonly later compare their diﬀerence. We formalize these six diﬀerent PTAs below. Assumption 6.1 (Parallel trends assumption based on “never treated” units, with corrup-tion-speciﬁc trends) . For c = 0 , , and all g, t = 2 , . . . , T , such that t ≥ g , E [ Y t (0) − Y t − (0) | G g = 1 , Corr = c ] = E [ Y t (0) − Y t − (0) | C = 1 , Corr = c ] . Assumption 6.2 (Parallel trends assumption based on “not-yet treated” units, with cor-ruption-speciﬁc trends) . For c = 0 , , and all g, s, t = 2 , . . . , T , such that t ≥ g , s ≥ t , E [ Y t (0) − Y t − (0) | G g = 1 , Corr = c ] = E [ Y t (0) − Y t − (0) | D s = 0 , Corr = c ] . Assumption 6.3 (“Weaker” Parallel trends assumption based on “not-yet treated” units,with corruption-speciﬁc trends) . For c = 0 , , and all g, t = 2 , . . . , T , such that t ≥ g , E [ Y t (0) − Y t − (0) | G g = 1 , Corr = c ] = E [ Y t (0) − Y t − (0) | D t = 0 , Corr = c ] . Assumption 6.4 (Parallel trends assumption based on “never treated” units, without cor-ruption-speciﬁc trends) . For c = 0 , , and all g, t = 2 , . . . , T , such that t ≥ g , E [ Y t (0) − Y t − (0) | G g = 1 , Corr = c ] = E [ Y t (0) − Y t − (0) | C = 1] . ssumption 6.5 (Parallel trends assumption based on “not-yet treated” units, withoutcorruption-speciﬁc trends) . For c = 0 , , and all g, s, t = 2 , . . . , T , such that t ≥ g , s ≥ t , E [ Y t (0) − Y t − (0) | G g = 1 , Corr = c ] = E [ Y t (0) − Y t − (0) | D s = 0] . Assumption 6.6 (“Weaker” Parallel trends assumption based on “not-yet treated” units,without corruption-speciﬁc trends) . For c = 0 , , and all g, t = 2 , . . . , T , such that t ≥ g , E [ Y t (0) − Y t − (0) | G g = 1 , Corr = c ] = E [ Y t (0) − Y t − (0) | D t = 0] . The diﬀerence between these PTAs depends on whether one uses the “never-treated”,some “not-yet-treated”, or all “not-yet-treated” as valid comparisons group, and whether oneonly uses states with the same corruption status (corrupt or non-corrupt) as valid comparisongroups. Assumptions 6.1-6.3) do not assume that the evolution of the violation rate is thesame between corrupt and non-corrupt states. These three assumptions are the analoguesof Assumptions 2.6, 2.7 and 5.1 when one restricts attention to the subset of units withcorruption status equal to c . Assumptions 6.4-6.6, on the other hand, assume that, in theabsence of treatment, the evolution of the violation rate is the same for corrupt and non-corrupt states, i.e., it rules out corruption-speciﬁc trends. As so, one may argue that theAssumptions 6.1-6.3 are “weaker” than Assumptions 6.4-6.6.Next, one can easily leverage any of these PTAs to identify and estimate sensible treat-ment eﬀect parameters by following the same steps described in Section 2.3.2. The ﬁrst steptoward this goal is to show that the AT T ( g, t ) ’s for the units with corruption status equalto c , c = 0 , , deﬁned by AT T ( g, t ; c ) ≡ E [ Y it (1) − Y it (0) | G g = 1 , Corr = c ] , are nonparametrically point-identiﬁed for all t ≥ g . However, given the results in Theorem1 of C&S and the discussion in Sections 2.3.2 and 5, this is a straightforward task. Indeed,one can easily show that for all t ≥ g , the AT T ( g, t ) ’s are nonparametrically identiﬁed by AT T never ( g, t ; c ) = E [ Y t − Y g − | G g = 1 , Corr = c ] − E [ Y t − Y g − | C = 1 , Corr = c ] , (6.3) AT T ny ( g, t ; c ) = E [ Y t − Y g − | G g = 1 , Corr = c ] − E [ Y t − Y g − | D t = 0 , Corr = c ] (6.4) AT T ny + ( g, t ; c ) = E [ Y t − Y g − | G g = 1 , Corr = c ] − (cid:32) t (cid:88) s = g E [ ∆ Y s | D s = 0 , Corr = c ] (cid:33) , (6.5) AT T never ( g, t ; c ) = E [ Y t − Y g − | G g = 1 , Corr = c ] − E [ Y t − Y g − | C = 1] , (6.6) AT T ny ( g, t ; c ) = E [ Y t − Y g − | G g = 1 , Corr = c ] − E [ Y t − Y g − | D t = 0] , (6.7) AT T ny + ( g, t ; c ) = E [ Y t − Y g − | G g = 1 , Corr = c ] − (cid:32) t (cid:88) s = g E [ ∆ Y s | D s = 0] (cid:33) , (6.8) These results areanalogous to (2.25), (2.26) and (5.1). Likewise, all the aforementioned quantities can beestimated using the analogy principle, i.e., by replacing population expectation with sampleexpectations.Armed with these estimators, one can form diﬀerent summary measures for the over-all treatment eﬀect following the same steps described in Section 2.3.2. To explicitly showhow one can form event-study-type estimators, let

AT T generic ( g, t ; c ) be a generic notationfor AT T never ( g, t ; c ) , AT T ny ( g, t ; c ) , AT T ny + ( g, t ; c ) , AT T never ( g, t ; c ) , AT T ny ( g, t ; c ) , and AT T ny + ( g, t ; c ) and denote its plug-in estimator by (cid:91) AT T generic ( g, t ; c ) . Then, one can esti-mate the average treatment eﬀect for units with corruption status equal to c that have beentreated for e periods by (cid:98) δ esgeneric ( e ; c ) = T (cid:88) g =2 T (cid:88) t =2 { t − g + 1 = e } (cid:98) w ( g ; e, c ) (cid:91) AT T generic ( g, t ; c ) , (6.9)where the weights are given by (cid:98) w ( g ; e, c ) ≡ (cid:98) P ( G g = 1 | Treated for ≥ e periods, Corr = c ) = N g ∩≥ e ∩ c N ≥ e ∩ c ,N g ∩≥ e ∩ c denotes the number of observations in group g among those units with corrupt status c that have been treated for at least e periods, and N ≥ e is the number of units with corruptstatus c who have been treated for at least e periods. Given that our main goal is tocompare the evolution of treatment eﬀects between corrupt and non-corrupt states, we cansimply compute the diﬀerence between (cid:98) δ esgeneric ( e ; 1) and (cid:98) δ esgeneric ( e ; 0) . Denote this (generic)estimator by (cid:98) δ esgeneric ( e ; 1 − , i.e., (cid:98) δ esgeneric ( e ; 1 −

0) = (cid:98) δ esgeneric ( e ; 1) − (cid:98) δ esgeneric ( e ; 0) . (6.10)Here, we stress that, regardless of which of the six diﬀerent estimators for (6.10) oneadopts, they are all directly and explicitly tied to a given PTA, and, by design, they bypassthe potential pitfalls associated with the TWFE speciﬁcation.In addition to the event-study estimates, we further aggregate these treatment eﬀectcurves into scalar, easy to interpret parameters. Toward this end, we report the plug-inestimators for the following two aggregated treatment eﬀect parameters proposed by C&S, AT T simple, − = AT T simple ;1 − AT T simple ;0 , (6.11) Although the PTAs 6.4-6.6 lead to overidentiﬁcation, for the sake of simplicity we do not fully exploitall these restrictions when proposing the aforementioned estimands. When e < , we replace (cid:98) w ( g ; e, c ) in (6.9) with (cid:98) w ( g ; e − , c ) , where (cid:98) w ( g ; e − , c ) is analogous to (cid:98) w ( g ; e − ) as deﬁned in Remark 2.1. c = 0 , AT T simple ; c is deﬁned analogously to (2.22), i.e., AT T simple,c = (cid:80) T g =2 (cid:80) T t =2 { g ≤ t } P ( G = g, Corr = c ) · AT T ( g, t ; c ) (cid:80) T g =2 (cid:80) T t =2 { g ≤ t } P ( G = g, Corr = c ) , and the average of δ es ( e ; 1 − over all possible (positive) values of e , δ e,avg, − = δ e,avg ;1 − δ e,avg ;0 . (6.12)where, for c = 0 , , δ e,avg ; c is deﬁned analogously to (2.24), i.e., δ e,avg ; c = 1 T − T − (cid:88) e =1 δ es ( e ; c ) . We report estimators for these functionals that rely on the PTA 6.1-6.6, respectively. Forthe sake of comparison, we also report the OLS estimate of β cfe associated with the followingTWFE speciﬁcation, Y it = α g + α t + β fe D it + β cfe D it × Corrupt i + u it , (6.13)though these estimates are also subject to pitfalls brieﬂy described in Section 2.3.1. Figure 6.3 displays the results based on the TWFE speciﬁcation (6.1), and those basedon the event-study estimators (cid:98) δ esnever ( e ) , (cid:98) δ esny ( e ) , and (cid:98) δ esny + ( e ) . We report the point-estimatesassociated with 20 treatment leads and 20 treatment lags (red line), their associated 90%point-wise conﬁdence intervals (dark-shaded area), and 90% simultaneous conﬁdence inter-vals (light-shaded area) — we do not report simultaneous conﬁdence intervals for the TWFEspeciﬁcation as these are usually not reported by practitioners who adopt such speciﬁcations.It is important to emphasize that, in each of the panels in Figure 6.3, we have 40 diﬀerentestimates, one for each considered e . Point-wise inference procedures proceed “as if” onewere conducting a single hypothesis test, and report standard conﬁdence interval for each e . Failing to account for the fact that one is performing 40 diﬀerent hypotheses tests maylead to signiﬁcant treatment eﬀects and/or pre-trends that emerge simply by chance. Simul-taneous conﬁdence intervals, on the other hand, account for this multiple testing problem,and asymptotically cover the entire event-study curve with probability 1 - α , where α isthe signiﬁcance level. As so, simultaneous conﬁdence intervals are suitable to analyze globalproperties of the event-study curve, such as monotonicity and presence of statistically non-zero eﬀects. In practice, one simply has to replace the commonly used critical value (say,1.645 for a 90% conﬁdence interval) with the one simulated via a bootstrap procedure akin32o Algorithm B.1; see Section 4 of C&S for additional details.The results shown in Figure 6.3 suggest that, regardless of the PTA and the estimatorused, there is little to no evidence that the transition to state control decreased violationrates. Despite the similarity in terms of conclusions, we ﬁnd that comparing the resultsfrom each speciﬁcation highlights some interesting practical features. For instance, the pointestimates associated with the TWFE speciﬁcation (Panel (a)) and with the estimator thatuses the “all not-yet-treated” states as a comparison group (Panel (d)) are close to eachother, whereas using “never-treated” states as a comparison group (Panel (b)) suggests aslightly stronger long-run eﬀect. Furthermore, when using “all not-yet-treated” states as acomparison group (Panel (d)), the (simultaneous) conﬁdence interval is tighter, suggesting,as we discussed in Section 5, that it makes more eﬃcient use of the available data. In termsof interpreting the pre-treatment coeﬃcients, the pre-trend point estimates when using the“not-yet-treated” comparison group in Panel (c) are closer to zero than when one uses the“never-treated” states as a comparison group in Panel (b). It is also very noticeable that pre-treatment trends in Panel (d) are very precisely estimated zeros. However, it is important torecall from Remark 5.1 that these pre-treatment coeﬃcients should not be directly comparedto the other pre-treatment trends as they measure “local deviations” of zero pre-treatmenttrends rather “cumulative deviations” of pre-treatment trends.Although these estimators lead to similar conclusions, they are not (a priori) “madeequal”. As highlighted by S&A and discussed in Section 2.3.1, the β e ’s associated with theTWFE speciﬁcation (6.1) are not guaranteed to have a clear causal interpretation, evenwhen one invokes the PTA 2.5 , which, in our application, imposes 195 overidentifying re-strictions on the evolution of violation rates across states. The estimators in Panels (b),(c), and (d), on the other hand, are designed to bypass the potential pitfalls of the TWFEspeciﬁcation, and rely on clearly stated parallel trends assumptions (Assumptions 2.6, 2.7,and 5.1, respectively).As discussed in Section 2.3.2, there are multiple sensible measures that one can useto summarize the overall eﬀect of state authorization on violation rates across all treatedstates. For instance, one can use the “simple” average of all the AT T ( g, t ) where t ≥ g , AT T simple , as deﬁned in (2.22), or the average of the event-study-type estimands δ es ( e ) over the positive values of e , δ e,avg , as deﬁned in 2.24. Table 6.1 shows the estimates ofthese parameters when one adopts Assumption 2.6 (Column (1)), Assumption 2.7 (Column(2)), or Assumption 5.1 (Column (3)). For the sake of comparison we also report the OLSestimate of β fe (Column (4)). Standard errors, clustered at the state level, are reportedin parenthesis and 90% conﬁdence intervals are reported in brackets. Essentially, all thesesummary measures indicates that state authorization has close to zero eﬀect on violation33igure 6.3: Event-study analysis of violation rate: baseline results −0.07−0.05−0.03−0.010.010.030.05 −20 −16 −12 −8 −4 0 4 8 12 16 20 E v en t S t ud y A TT (a) Event study based on TWFE specification−0.07−0.05−0.03−0.010.010.030.05 −20 −16 −12 −8 −4 0 4 8 12 16 20 E v en t S t ud y A TT (b) Event study using never−treated units as comparison group−0.07−0.05−0.03−0.010.010.030.05 −20 −16 −12 −8 −4 0 4 8 12 16 20 E v en t S t ud y A TT (c) Event study using not−yet−treated units as comparison group−0.07−0.05−0.03−0.010.010.030.05 −20 −16 −12 −8 −4 0 4 8 12 16 20 E v en t S t ud y A TT (d) Event study using all not−yet−treated units as comparison group Notes: Red line displays the point estimate, dark-shaded area the 90% pointwise conﬁdence interval, and thelight-shaded are the 90% simultaneous conﬁdence band. Panel (a) displays the ordinary least squares (OLS)estimates of the β e associated with the two-way ﬁxed-eﬀects linear regression speciﬁcation (6.1); Panel (b)displays the results based on (2.23) that uses (2.27) as an estimator for AT T ( g, t ) ; Panel (c) displays theresults based on (2.23) that uses (2.19) as an estimator for AT T ( g, t ) ; Panel (d) displays the results basedon (2.23) that uses (5.5) as an estimator for AT T ( g, t ) . All standard errors are clustered at the state level,though the standard errors in Panel (a) are based on analytical results, whereas those in Panel (b)-(d) arebased on the multiplicative bootstrap procedure discussed in Algorithm B.1) and in C&S (we use 1,000bootstrap draws). The critical value for the simultaneous conﬁdence bands is computed using Algorithm B.1(which is akin to the one proposed by C&S). Summary measures Never-treated Not-yet-treated All Not-yet-treated TWFE(1) (2) (3) (4)

AT T simple -0.017 -0.010 -0.003 —(0.009) (0.009) (0.006) —[-0.032, 0.001] [-0.024, 0.004] [-0.014, 0.008] — δ e,avg -0.015 -0.008 -0.003 —(0.007) (0.006) (0.004) —[-0.027, -0.002] [-0.017, 0.002] [-0.010, 0.004] —TWFE — — — -0.003— — — (0.010)— — — [-0.019, 0.013] Notes: The point estimates, cluster-robust standard errors (in parenthesis), and 90% conﬁdence interval (in brackets) for the eﬀect of state au-thorization on violation rates.

AT T simple is as deﬁned in (2.22) and denotes the weighted average of all post-treatment

AT T ( g, t ) (cid:48) s . δ e,avg is asdeﬁned in (2.24) and denotes the time-average of all event-study parameters δ es ( e ) , e > . TWFE refers to the ordinary least square estimatesof β fe in the TWFE linear regression speciﬁcation (2.16), which is invariant to the comparison group being used. Column (1) display the re-sults that uses (2.27) as an estimator for AT T ( g, t ) , column (2) displays the results that uses (2.19) as an estimator for AT T ( g, t ) , and column(3) displays the results that uses (5.5) as an estimator for AT T ( g, t ) Column (4) displays the result using the TWFE regression speciﬁcation.Standard errors are clustered at the state level, and, with the exception of the TWFE summary measure, are computed using the multiplicativebootstrap procedure described in Algorithm B.1, which is akin to the one proposed by C&S. We use 1,000 bootstrap draws. rates, which is in line with the ﬁndings from Grooms (2015).

Next, we analyze whether the eﬀect of state authorization on violation rates vary dependingon whether a state has a long prevalence of corruption. Panel (a) of Figure 6.4 displays theOLS estimates of β ce ’s, together with the 90% pointwise conﬁdence intervals. All standarderrors are clustered at the state level. Consistent with the ﬁndings from Grooms (2015),the results suggest that states with high levels of corruption have a lower violation rateafter authorization relative to non-corrupt states, and the relative drop in the violation rateappears to increase with elapsed treatment time.Panels (b), (c), and (d) of Figure 6.4 present the event-study estimates (6.10) based onthe PTAs 6.1, 6.2, and 6.3 that allow for corruption-speciﬁc trends, whereas Panels (b),(c), and (d) of Figure 6.5 present the event-study estimates (6.10) based on the PTAs 6.4,6.5, and 6.6 that do not allow for corruption-speciﬁc trends. For comparison purposes,Panel (a) of Figures 6.4 and 6.5 displays the OLS estimates of the β ce ’s associated with theTWFE speciﬁcation (6.2). Like before, all estimators are weighted by total facilities in astate, all standard errors are clustered at the state level, and we report both pointwise andsimultaneous 90% conﬁdence intervals.The results in Figures 6.4 and 6.5 reveal the practical relevance of being explicit aboutthe underlying PTA in a given application. For instance, Figure 6.4 suggests that when oneinvokes Assumption 6.1 (Panel (b)), Assumption 6.2 (Panel (c)), or Assumption 6.3 (Panel35igure 6.4: Event-study analysis of violation rate: diﬀerence between corrupt and non-corrupt states allowing diﬀerent counterfactual trends between corrupt and non-corruptstates −0.16−0.12−0.08−0.040.000.040.080.12 −20 −16 −12 −8 −4 0 4 8 12 16 20 E v en t S t ud y A TT (a) Event study based on TWFE specification−0.16−0.12−0.08−0.040.000.040.080.12 −20 −16 −12 −8 −4 0 4 8 12 16 20 E v en t S t ud y A TT (b) Event study using the never−treated units as comparison group, allowing for corruption−specific trends−0.16−0.12−0.08−0.040.000.040.080.12 −20 −16 −12 −8 −4 0 4 8 12 16 20 E v en t S t ud y A TT (c) Event study using the not−yet−treated units as comparison group, allowing for corruption−specific trends−0.16−0.12−0.08−0.040.000.040.080.12 −20 −16 −12 −8 −4 0 4 8 12 16 20 E v en t S t ud y A TT (d) Event study using all not−yet−treated units as comparison group, allowing for corruption−specific trends Notes: Red line displays the point estimate, dark-shaded area the 90% pointwise conﬁdence interval, andthe light-shaded are the 90% simultaneous conﬁdence band. Panel (a) displays the results based on the OLSestimates of the β ce ’s in the TWFE speciﬁcation (6.2); Panel (b) displays the results based on the event-studyestimator (6.9) that relies on the PTA 6.1; Panel (c) displays the results based on the event-study estimator(6.9) that relies on the PTA 6.2; Panel (d) displays the results based on the event-study estimator (6.9)that relies on the PTA 6.3. All standard errors are clustered at the state level, though the standard errorsin Panel (a) are based on analytical results, whereas those in Panel (b)-(d) are based on the multiplicativebootstrap procedure discussed in Algorithm B.1, which is similar to C&S proposal (we use 1,000 bootstrapdraws). The critical value for the simultaneous conﬁdence bands is computed using Algorithm B.1. −0.20−0.16−0.12−0.08−0.040.000.04 −20 −16 −12 −8 −4 0 4 8 12 16 20 E v en t S t ud y A TT (a) Event study based on TWFE specification−0.20−0.16−0.12−0.08−0.040.000.04 −20 −16 −12 −8 −4 0 4 8 12 16 20 E v en t S t ud y A TT (b) Event study using the never−treated units as comparison group, not allowing for corruption−specific trends−0.20−0.16−0.12−0.08−0.040.000.04 −20 −16 −12 −8 −4 0 4 8 12 16 20 E v en t S t ud y A TT (c) Event study using the not−yet−treated units as comparison group, not allowing for corruption−specific trends−0.20−0.16−0.12−0.08−0.040.000.04 −20 −16 −12 −8 −4 0 4 8 12 16 20 E v en t S t ud y A TT (d) Event study using all not−yet−treated units as comparison group, not allowing for corruption−specific trends Notes: Red line displays the point estimate, dark-shaded area the 90% pointwise conﬁdence interval, andthe light-shaded are the 90% simultaneous conﬁdence band. Panel (a) displays the results based on the OLSestimates of the β ce ’s in the TWFE speciﬁcation (6.2); Panel (b) displays the results based on the event-studyestimator (6.9) that relies on the PTA 6.4; Panel (c) displays the results based on the event-study estimator(6.9) that relies on the PTA 6.5; Panel (d) displays the results based on the event-study estimator (6.9)that relies on the PTA 6.6. All standard errors are clustered at the state level, though the standard errorsin Panel (a) are based on analytical results, whereas those in Panel (b)-(d) are based on the multiplicativebootstrap procedure discussed in Algorithm B.1, which is similar to C&S proposal (we use 1,000 bootstrapdraws). The critical value for the simultaneous conﬁdence bands is computed using Algorithm B.1. β cfe associated with the TWFE speciﬁcation, shown in (6.13).Table 6.2: Eﬀect of authorization on violation rate: corrupt vs. not corrupt states. Allow for corrupt-speciﬁc trends Not allow corrupt-speciﬁc trendSummary measures Never-treated Not-yet-treated All Not-yet-treated Never-treated Not-yet-treated All Not-yet-treated TWFE(1) (2) (3) (4) (5) (6) (7)

ATT simple, − -0.007 -0.008 -0.014 -0.035 -0.035 -0.033 —(0.014) (0.014) (0.013) (0.012) (0.012) (0.010) —[-0.030, 0.017] [-0.031, 0.016] [-0.036, 0.008] [-0.054, -0.016] [-0.054, -0.015] [-0.049, -0.017] — δ e,avg, − -0.001 -0.002 -0.009 -0.024 -0.025 -0.028 —(0.014) (0.016) (0.014) (0.013) (0.012) (0.011) —[-0.024, 0.022] [-0.028, 0.024] [-0.031, 0.013] [-0.045, -0.003] [-0.045, -0.005] [-0.047, -0.009] —TWFE — — — — — — -0.037— — — — — — (0.010)— — — — — — [-0.054, -0.020] Notes: The point estimates, cluster-robust standard errors (in parenthesis), and 90% conﬁdence interval (in brackets) for the eﬀect of state authorization on violation rates.

ATT simple, − is as deﬁned in (6.2) and denotes the diﬀerence of the weighted average of all post-treatment ATT ( g, t ; c ) (cid:48) s between corrupt and non-corrupt states. δ e,avg, − is as deﬁned in (6.12) and denotes diﬀerence of the time-average of all event-study parameters δ es ( e ) , e > , between corrupt and non-corrupt states. TWFE refers to theordinary least square estimates of β cfe in the TWFE linear regression speciﬁcation (6.13), which is invariant to the comparison group being used. Columns (1)-(6) display theresults that relies on the PTA 6.1-6.6, respectively. Standard errors are clustered at the state level, and, with the exception of the TWFE summary measure, are computedusing the multiplicative bootstrap procedure presented in Algorithm B.1, which is akin to C&S proposal. We use 1,000 bootstrap draws. The results in Table 6.2 reinforce the message from Figures 6.4 and 6.5: when one allowsfor corruption-speciﬁc trends and relies on PTAs 6.1, 6.2, or 6.3 (Columns (1), (2), and(3), respectively), one ﬁnds essentially no evidence that the eﬀect of state authorizationon violation rates varies by state corruption. On the other hand, when one relies on the“stronger” PTAs 6.4, 6.5, or 6.6 (Columns (4), (5), and (6), respectively), one ﬁnds evidencethat corrupt states experienced a large decrease in violation rate after state authorizationrelative to non-corrupt states. This latter result is in agreement with the TWFE speciﬁcation,whereas the former is not. 38

Conclusion

In this paper, we have highlighted the important role played by the parallel trends assumptionin event-study settings in terms of identiﬁcation, estimation and summary of diﬀerent treat-ment eﬀects parameters. We ﬁrst showed that, when there is variation in treatment timing,researchers may adopt diﬀerent types of parallel trends assumptions and identify/estimatediﬀerent treatment eﬀect parameters. Next, we discussed the practical implications of adopt-ing diﬀerent parallel trends assumptions, and discussed how one constructs estimators thatmake use of all the restrictions implied by the underlying PTA. Here, we documented aninteresting “robustness" vs. “eﬃciency" trade-oﬀ in terms of the strength of the underlyingPTA, and argue that one should take this into consideration whenever employing a DID-type of analysis. Importantly, we advocate that one should always attempt to be explicitabout the parallel trends assumption invoked in the study, as this usually translates into amore transparent and objective analysis. We showed how one can form semiparametricallyeﬃcient DID estimators by fully exploiting all the empirical content of underlying PTA viathe traditional GMM approach. We also proposed an alternative, simpler to use DID esti-mator that does not restrict pre-treatment trends when one wants to use “not-yet-treated”units as a comparison group, and, at the same time, makes use of more groups than otheravailable DID estimators. Finally, we illustrated the practical importance of being explicitabout the PTA via an empirical application about the eﬀect of the transition from federal tostate management of the Clean Water Act on compliance rates. Our results suggest that theconclusion that corrupt states see a decline in the violation rate after program authorizationrelative to non-corrupt treated states depends on the type of PTA adopted.

References

Abadie, Alberto , “Semiparametric Diﬀerence-in-Diﬀerence Estimators,”

Review of Eco-nomic Studies , 2005,

Anderson, Michael L. , “Multiple inference and gender diﬀerences in the eﬀects of earlyintervention: A reevaluation of the Abecedarian, Perry Preschool, and Early TrainingProjects,”

Journal of the American Statistical Association , 2008, (484).

Angrist, Joshua D. and Jörn-Steﬀen Pischke , Mostly Harmless Econometrics: AnEmpiricist ’ s Companion , Princeton, NJ: Princeton University Press, 2009.

Arkhangelsky, Dmitry, Susan Athey, David A. Hirshberg, Guido W. Imbens, andStefan Wager , “Synthetic Diﬀerence in Diﬀerences,” arXiv preprint arXiv:1812.09970 ,2018. 39 they, Susan and Guido W Imbens , “Design-based Analysis in Diﬀerence-In-DiﬀerencesSettings with Staggered Adoption,” arxiv preprint arXiv:1808.05293 , 2018.

Borusyak, Kirill and Xavier Jaravel , “Revisiting Event Study Designs,”

UnpublishedManuscript, Department of Economics, Harvard University , 2017.

Callaway, Brantly and Pedro H. C. Sant’Anna , “Diﬀerence-in-Diﬀerences with Multi-ple Time Periods,” arXiv preprint arXiv:1803.09015 , 2020.

Chernozhukov, Victor, Iván Fernández-Val, Jinyong Hahn, and Whitney Newey ,“Average and Quantile Eﬀects in Nonseparable Panel Models,”

Econometrica , 2013, (2). Cunningham, Scott , Causal Inference: The Mixtape, v.1.7 de Chaisemartin, Clément and Xavier D’Haultfœuille , “Two-way Fixed Eﬀects Es-timators with Heterogeneous Treatment Eﬀects,”

American Economic Review , 2020, (9).

Ferman, Bruno and Cristine Pinto , “Inference in Diﬀerences-in-Diﬀerences with FewTreated Groups and Heteroskedasticity,”

The Review of Economics and Statistics , 2019, (3).

Gibbons, Charles E., Juan Carlos Suárez Serrato, and Michael B. Urbancic ,“Broken or Fixed Eﬀects?,”

Journal of Econometric Methods , 2018, (1). Goodman-Bacon, Andrew , “Diﬀerence-in-Diﬀerences with Variation in Treatment Tim-ing,”

NBER Working Paper No. 25018 , 2019.

Grooms, Katherine K , “Enforcing the Clean Water Act: The eﬀect of state-level cor-ruption on compliance,”

Journal of Environmental Economics and Management , 2015,

Han, Sukjin , “Identiﬁcation in Nonparametric Models for Dynamic Treatment Eﬀects,”

Journal of Econometrics , 2019,

Forthcoming.

Hansen, Lars Peter , “Large Sample Properties of Generalized Method of Moments Esti-mators,”

Econometrica , 1982, (4). Heckman, James J., Hidehiko Ichimura, Jefrey Smith, and Petra Todd , “Charac-terizing Selection Bias using Experimental Data,”

Econometrica , 1998, (5). Laporte, Audrey and Frank Windmeijer , “Estimation of panel data models with binaryindicators when treatment eﬀects are not constant over time,”

Economics Letters , 2005, (3). Lechner, Michael , “The Estimation of Causal Eﬀects by Diﬀerence-in-Diﬀerence Methods,”

Foundations and Trends in Econometrics , 2010, (3). Rambachan, Ashesh and Jonathan Roth , “An Honest Approach to Parallel Trends,”

Working Paper, Department of Economics, Harvard University , 2019.40 omano, Joseph P. and Michael Wolf , “Stepwise multiple testing as formalized datasnooping,”

Econometrica , 2005, (4). , Azeem M. Shaikh, and Michael Wolf , “Hypothesis Testing in Econometrics,” AnnualReview of Economics , 2010, (1). Roth, Jonathan , “Pre-test with Caution: Event-study Estimates After Testing for ParallelTrends,”

Working Paper, Department of Economics, Harvard University , 2020.

Rubin, Donald B. , “The design versus the analysis of observational studies for causaleﬀects: Parallels with the design of randomized trials,”

Statistics in Medicine , 2007, (1)., “For objective causal inference, design trumps analysis,” Annals of Applied Statistics ,2008, (3). Sant’Anna, Pedro H. C. and Jun B Zhao , “Doubly Robust Diﬀerence-in-DiﬀerencesEstimators,”

Journal of Econometrics , 2020,

Forthcoming.

Sun, Liyang and Sarah Abraham , “Estimating Dynamic Treatment Eﬀects in EventStudies With Heterogeneous Treatment Eﬀects,”

Working Paper, Department of Eco-nomics, MIT , 2020.

Wooldridge, Jeﬀrey M. , “Fixed-Eﬀects and Related Estimators for Correlated Random-Coeﬃcient and Treatment-Eﬀect Panel Data Models,”

Review of Economics and Statistics ,2005,87