The role of parallel trends in event study settings: An application to environmental economics
TThe role of parallel trends in event study settings:
An application to environmental economics ∗ Michelle MarcusVanderbilt University Pedro H. C. Sant’AnnaVanderbilt UniversitySeptember 3, 2020
Difference-in-Differences (DID) research designs usually rely on variation of treatmenttiming such that, after making an appropriate parallel trends assumption, one can iden-tify, estimate, and make inference about causal effects. In practice, however, different DIDprocedures rely on different parallel trends assumptions (PTA), and recover different causalparameters. In this paper, we focus on staggered DID (also referred as event-studies) anddiscuss the role played by the PTA in terms of identification and estimation of causal pa-rameters. We document a “robustness” vs. “efficiency” trade-off in terms of the strength ofthe underlying PTA, and argue that practitioners should be explicit about these trade-offswhenever using DID procedures. We propose new DID estimators that reflect these trade-offs and derived their large sample properties. We illustrate the practical relevance of theseresults by assessing whether the transition from federal to state management of the CleanWater Act affects compliance rates. ∗ First version: January 10, 2020. We thank Brantly Callaway, Jonathan Roth, Julia Schmieder, theEditor, Daniel Millimet, and two anonymous referees for comments and suggestions. a r X i v : . [ ec on . E M ] S e p Introduction
Researchers and policy makers are often interested in evaluating the causal effect of a giventreatment/intervention on an outcome of interest. When data from randomized controltrials related to the causal question of interest are not available, researchers often rely on“natural experiments” and make use of difference-in-differences (DID) methods to estimatethe effect of a given policy. The canonical DID method presumes the existence of twogroups, the treated and the comparison group, two time periods, pre-treatment and post-treatment periods, such that the comparison group is not treated in either time period, andthe treated group is only treated at the post-treatment period. Then, one estimates theaverage treatment effect among those treated units by comparing the average difference inpre and post-treatment outcomes of two groups, or, equivalently, by using a two-way fixedeffects regression model with a group and a time fixed effect; see, e.g., Section 2 of Lechner(2010) for details about the history of DID procedures.It is worth stressing that the causal interpretation of the two groups, two time periods(henceforth × ) DID procedure relies on a so-called parallel trend assumption (PTA): inthe absence of the treatment, the average outcome for the treated and comparison groupswould have evolved in parallel. Such an assumption is well-understood, see e.g. Chapter 5 ofAngrist and Pischke (2009), Chapter 10 of Cunningham (2018), and Section 2 of Sant’Annaand Zhao (2020). Importantly, it restricts the average counterfactual outcome for the treatedunits at the post-treatment period had they not been subject to the treatment, but it doesnot directly impose restrictions on the outcome in pre-treatment periods. In addition, it isworth mentioning that the PTA is untestable in this × setup, see, e.g., Chapter 10 ofCunningham (2018) and Section 4 of Callaway and Sant’Anna (2020).Although most of the aforementioned points are well-understood in the × setup, inmany DID applications, however, there are more than two time periods, and units can betreated in different points in time, which leads to multiple treatment groups as well. Thismany periods, many groups DID setup is substantially more challenging than the canon-ical × one. For instance, Sun and Abraham (2020) (henceforth S&A), Callaway andSant’Anna (2020) (henceforth C&S), de Chaisemartin and D’Haultfœuille (2020) (hence-forth dC&D), and Goodman-Bacon (2019), study DID procedures with multiple periodsand multiple groups, and each of these papers rely on different types of parallel trends as-sumptions and/or propose different estimators for different causal parameters of interest.This is in sharp contrast with the × setup, where there is only one type of PTA and the In the × DID setup, the only variation in PTA one observes is whether it holds unconditionally,or only after conditioning on a vector of observed characteristics, see, e.g., Heckman et al. (1998), Abadie(2005), and Sant’Anna and Zhao (2020). This is not the type of variation of the PTA we are referring to. . We exclusively focus on DID settings with staggered adoptiondesigns and binary treatments. By doing so, we can compare the PTA and parameters ofinterest discussed in S&A, C&S, and dC&D in a more direct manner.We show that the PTA invoked by S&A and dC&D (i) not only restricts counterfactualtrends after the treatment, but also imposes parallel pre-treatment trends, and (ii) imposesthat every individual group that is not-yet treated by time t can be used as a valid comparisongroup for those earlier-treated units, at time t . C&S, on the other hand, considers twodifferent PTAs, one that relies on using “never-treated” units as a comparison group, andone that uses not-yet-treated units as valid comparison groups for the earlier-treated units.Interestingly, both PTAs considered by C&S are, at least technically speaking, weaker thanthe PTA invoked by S&A and dC&D, as they either do not restrict pre-treatment trends,or, when they do, these restrictions are potentially less demanding. Although these PTAsdiffer in their “strength”, we show that they all can be used to recover the same variety ofaverage treatment effects measures.Overall, we argue that, in practice, one should be explicit about the type of PTA invokedin the DID analysis. On top of adding transparency and objectivity to the analysis, see, e.g.,Rubin (2007, 2008), we stress that the choice of the parallel trends assumption can also helpin selecting the most appropriate estimator for a given parameter of interest. For instance,in situations where one is comfortable with a “stronger” PTA, we show that one can exploit See also Athey and Imbens (2018), Goodman-Bacon (2019), Arkhangelsky et al. (2018), Borusyak andJaravel (2017), Ferman and Pinto (2019), and Rambachan and Roth (2019) for other recent contributionsto the DID literature. overidentification , and then use the generalized method ofmoments (GMM) framework to form more efficient treatment effect estimators than thosecurrently available in the literature; see Proposition 4.1. Another consequence of adoptingthe GMM framework is that it is relatively straightforward to test for the credibility of a“stronger” parallel trends assumption by conducting a classical Hansen-Sargan J-test. To thebest of our knowledge, this paper is the first to make this simple, but important observation.In many other situations, however, we expect that researchers will not be a priori com-fortable with a “stronger” version of the PTA, as it may impose more restrictions on thedata than those strictly required for identification of treatment effect parameters. Indeed,when the number of groups and time periods is moderate, the number of restrictions impliedby the “stronger” PTA can be close to the number of observations available in the data. Insuch cases, it may be reasonable to favor “weaker” versions of the PTA. When a sufficientlylarge “never-treated” group is available, researchers can use the easy-to-implement nonpara-metric DID estimators based on sample means proposed by C&S. When an appropriate“never-treated” group is unavailable, we show that one can rely on an alternative “weaker”PTA and use a simple plug-in DID estimator that differs from the ones considered by S&A,C&S, and dC&D. We show that this new DID estimator is consistent and asymptoticallynormal, and also describe a bootstrap procedure to conduct inference that is robust againstmultiple-testing problems. Interestingly, this new DID estimator does not rely on restrictingpre-treatment trends, and, at the same time, exploits data from all available groups in thegiven application. On the other hand, both this newly proposed DID estimator and the oneproposed by C&S are, in general, less efficient than the GMM estimator, which relies onmore stringent assumptions. To the best of our knowledge, we are the first to documentthis “robustness” versus “efficiency” trade-off in terms of the strength of the underlying PTAinvoked in DID setups.We illustrate the practical relevance of the aforementioned observations by revisitingGrooms (2015). We examine the effect of the transition from federal to state managementof the Clean Water Act (CWA) on violation rates. Similarly to Grooms (2015), we find thatthe transition from federal to state control has little to no effect on violation rates — thisresult is robust across different parallel trends assumptions and different causal parametersof interest.Next, like Grooms (2015), we also analyze whether states with a long prevalence ofcorruption see a large decrease in the violation rate after authorization relative to stateswithout corruption. Grooms (2015) uses a dynamic TWFE (event-study) linear regressionmodel, and finds strong evidence that violation rates decreased more in more corrupt statesthan in less corrupt states after the transition to state control. However, given that Grooms42015) focuses exclusively on TWFE-type estimators, it is not clear what kind of PTA isactually being made in the analysis. Here, we show how it can be beneficial to separate theanalysis into two steps: (i) identification and the relevance of the PTA, and (ii) data analysisand estimation procedures. By proceeding in this manner, we find that the conclusion thatviolation rates dropped more in more corrupt states than less corrupt states depends on thetype of PTA imposed. For instance, when one assumes that, in the absence of treatment,the counterfactual outcome trends differ depending on whether a state has a long prevalenceof corruption or not (“corruption-specific trends”), we find essentially no evidence that thetreatment effects vary depending on whether a state is more or less corrupt. On the otherhand, if one assumes an alternative PTA such that one can use averages of both corrupt andnon-corrupt states as valid comparison groups, we find evidence that more corrupt statessee a larger decrease in the violation rate after authorization than less corrupt states, justlike the original findings of Grooms (2015). As “corruption” is not randomly assigned, webelieve that allowing for corruption-specific trends is the most natural identification set upin this context. These conflicting findings highlight the importance of explicitly stating theunderlying PTA invoked in the exercise.The rest of this paper is organized as follows. In Section 2, we present the generalframework, compare the different PTAs using a stylized example, and describe the differ-ent parameters of interest considered by S&A, C&S, and dC&D. In Section 3, we discussthe testability of the PTAs and the practical considerations a researcher might take intoaccount when choosing a PTA and a DID estimator. Section 4 describes how one can usegeneralized method of moments (GMM) framework to form more efficient treatment ef-fect estimators when the chosen PTA leads to overidentification. Section 5 presents a neweasy-to-compute DID estimator based on an alternative “weaker” PTA than what has beenpreviously seen in the literature. Finally, Section 6 presents the empirical application, andwe conclude in Section 7. Proofs and additional results are available at the Web Appendix,at https://pedrohcgs.github.io/files/Marcus_SantAnna_2020_webAppendix.pdf.
We first introduce the notation we use throughout the paper, which resembles that adoptedby C&S. We consider the case with T periods and denote a particular time period by t where t = 1 , . . . , T . In the canonical DID setup, T = 2 and no one is treated in period 1. Let D t be a binary variable equal to one if a unit is treated in period t and equal to zero otherwise.5lso, define G g to be a dummy variable that is equal to one if a unit is first treated in period g , and define C as a dummy variable that is equal to one for units who are not treated in anyperiod T . For each unit, exactly one of the G g or C is equal to one. Finally, let Y t (1) and Y t (0) be the potential outcomes at time t with and without treatment, respectively. Theobserved outcome in each period can be expressed as Y t = D t Y t (1) + (1 − D t ) Y t (0) . Henceforth, we refer to “groups” as the group associated with the time a unit is firsttreated. Throughout the paper, we maintain the following assumptions.
Assumption 2.1 (Sampling) . { Y i , Y i , . . . Y i T , D i , D i , . . . , D i T } ni =1 is independent andidentically distributed ( iid ) . Assumption 2.2 (Staggered treatment design) . For t = 2 , . . . , T , D t − = 1 implies that D t = 1 Assumption 2.3 (No Anticipation) . For all t = 1 , . . . , T , g = 2 , . . . , T such that t < g , E [ Y it | G g = 1] = E [ Y it (0) | G g = 1] Assumption 2.4 (Overlap) . P ( G = 1) = 0 and, for some (cid:15) > , and all g = 2 , . . . , T , P ( G g = 1) > (cid:15) . Assumption 2.1 implies that we are considering the case of panel data. The discussionsrelated to the case where only repeated cross-section data are available follows similar ar-guments and is omitted to avoid repetition. Assumption 2.1 does not restrict the temporaldependence across outcomes, though it relies on “large n, fixed t” panel data. Assumption2.1 also rules out covariates; we only impose this simplification to allow for a more directcomparison between the proposals of S&A, C&S, and dC&D; we refer the reader to C&S fora detailed discussion about flexibly accommodating covariates into the DID analysis.Assumption 2.2 imposes that treatment is “irreversible”, i.e., once a unit is treated at time t − , it is “forever” treated. This assumption is usually referred to as staggered treatmentadoption in the DID literature. We interpret this assumption as if units that experiencetreatment are forever affected by this experience, and do not “forget” about it. We emphasizethat, by imposing Assumption 2.2, we are able to directly compare the DID contributions ofS&A, C&S, and dC&D.Assumption 2.3 implies that there is no anticipatory response to treatment for those unitsthat are eventually treated. This assumption is standard in the DID literature, though many When treatment can “turn on” and later “turn off”, one usually needs to augment the potential outcomenotation to analyze the effect of a given treatment path, see, e.g. Han (2019). When one is interested only inan average of the instantaneous effects of the policy among all units that switch treatments, one can bypasssome of this complications by imposing a “no carryover assumption”, see, e.g., dC&H. Assumption 2.4 imposes that no unit is treated in the first time period, and that a newset of units are treated in each time period with a strictly positive probability. If there isan “always treated” group, i.e., units that are already treated in the first time period, wedrop those observations because neither the data nor parallel trends assumptions for Y t (0) provide information information to identify the average treatment effect for this group. Weassume that new sets of units are treated in each time period only for notation convenience.Also, note that Assumption 2.4 accommodates, but does not require, that there is a “nevertreated” group available.Next, we revisit S&A, C&S, and dC&D, paying particular attention to their PTA, under-lying parameter of interest, and how one can estimate these parameters. When presentingthese results, we impose Assumptions 2.1 - 2.4, which may result in slight changes of no-tation when compared to their original statements. In terms of notation, we follow C&Sand, whenever possible, attempt to express the different parameters of interest in terms offunctionals of “group-time average treatment effects”, i.e., the average treatment effect attime t , for those units first treated at time g , AT T ( g, t ) ≡ E [ Y it (1) | G g = 1] − E [ Y it (0) | G g = 1] . (2.1) = α g,t (1) − α g,t (0) To convey the discussion in an easy-to-understand manner, we consider a simple, stylizedexample. Assume that we observe Y it for a sample of units i = 1 , . . . , n in four time periods, t = 1 , , , . Some units are first treated at time ( G i = 1) , others at time , ( G i = 1 ),and the remaining units are not treated in the entire observation window ( C i = 1) . Oncea unit i is treated at time g , it remains treated for all time periods t ≥ g . Let W =( Y , Y , Y , Y , G , G , C ) (cid:48) , and assume that we observe a random sample { W i } ni =1 of W . When the researcher is worried about anticipatory effects, this can be circumvented by simply redefining g to denote the period in which anticipatory effects begin. However, this may require strengthening otherassumptions; see C&S for a discussion. In this paper, we attempt to only make parallel trends assumption about the evolution of Y t (0) , andremain agnostic about the trends for Y t (1) . When one is willing to impose parallel trends for Y t (1) , too,then we can leverage the existence of an “always treated” group to form alternative parameters of interest,though such an assumption further restricts treatment effect heterogeneity. We leave a detailed discussionabout this case for future research. S&A refers to
AT T ( g, t ) as cohort-specific average treatment effects on the treated, though they expressit in terms of event-time t − g , i.e., the time elapsed since treatment started. .2 The different parallel trends assumptions In this subsection, we present the three different parallel trends assumptions considered byS&A, C&S, and dC&D. We start by describing each PTA conceptually and then make useof the stylized example to highlight the key differences between these assumptions.We first present the PTA assumption invoked by S&A and dC&D (see, Assumption 1 inS&A and Assumption 5 in dC&D).
Assumption 2.5 (Parallel trends assumption across all time periods and all groups) . Forall t = 2 , . . . , T , all g = 2 , . . . , T , E [ Y t (0) − Y t − (0) | G g = 1] = E [ Y t (0) − Y t − (0) | C = 1] = E [ Y t (0) − Y t − (0)] , where the first equality holds only when there exists a “never-treated” group. Assumption 2.5 states that, in the absence of treatment, the expectation of the outcomeof interest follows the same path in all groups and in all time periods available in the data.Although fairly intuitive, such an assumption imposes important restrictions on the data(when combined with Assumptions 2.1 - 2.4). In particular, Assumption 2.5 imposes aparallel pre-trends condition across all treatment groups, and, as a consequence, allows oneto use any individual group that has not yet been treated by time t (units with G s = 1 , s > t ) as a valid comparison group for those units already treated by time t .To visualize these restrictions, let us consider our stylized example. Under Assumptions2.1 - 2.4, the PTA 2.5 can be written as the following seven moment conditions: α , (0) = E [ Y − Y | C = 1] + E [ Y | G = 1] , (2.2) α , (0) = E [ Y − Y | G = 1] + E [ Y | G = 1] , (2.3) α , (0) = E [ Y − Y | C = 1] + α , (0) , (2.4) α , (0) = E [ Y − Y | C = 1] + E [ Y | G = 1] , (2.5) E [ Y − Y | G = 1] = E [ Y − Y | C = 1] , (2.6) E [ Y − Y | G = 1] = E [ Y − Y | G = 1] (2.7) E [ Y − Y | G = 1] = E [ Y − Y | C = 1] . (2.8)These moment restrictions highlight all restrictions PTA 2.5 impose on the data. First, themoment conditions (2.2) and (2.3) formalize the notion that the evolution of the outcomefor the “never-treated” and “late-treated” units can be used to identify α , (0) , which inturn, would allow one to identify AT T (3 , . An empirically important implication of this Recall that, for g ≥ t, AT T ( g, t ) = α g,t (1) − α g,t (0) = E [ Y t | G g = 1] − E [ Y t (0) | G g = 1] . Thus, giventhat E [ Y t | G g = 1] is estimable from the data, one only needs to identify α g,t (0) in order to recover AT T ( g, t ) from the data. any linear combination of E [ Y − Y | C = 1] and E [ Y − Y | G = 1] canbe used to impute α , (0) . Given that the “never-treated” units are the only units thathave not yet experienced treatment at time t = 4 , they form the only group that can beused to recover α , (0) and α , (0) — this notion is formalized by the moment restrictions(2.4) and (2.5). Finally, (2.6)-(2.8) impose a parallel “pre-trends” condition, i.e., that theevolution of the outcome before treatment occurs is the same across all groups. Note that themoment condition (2.8) is a linear combination of the moment conditions (2.2) and (2.3), so(2.8) is redundant in the aforementioned system of equations. Nonetheless, this observationallows one to conclude that, by assuming that both never-treated units and the units thatare not-yet-treated at time t = 3 can be used as valid comparison groups for the units firsttreated at time t = 3 , one is imposing the parallel pre-trends condition (2.8). Of course, thereverse argument is also true, highlighting that parallel pre-trends across groups may haveimportant identification content; we further discuss this in Section 3.Given that we now have a better understanding of the PTA invoked by S&A and dC&D,we turn our attention to the PTA invoked by C&S. In fact, C&S consider two different PTAdepending on whether a “never treated” group is available or not (see Assumptions 4 and 5in C&S). Assumption 2.6 (Parallel trends assumption based on “never treated” units) . For all g, t =2 , . . . , T , g = 2 , . . . , T , such that t ≥ g , E [ Y t (0) − Y t − (0) | G g = 1] = E [ Y t (0) − Y t − (0) | C = 1] . Assumption 2.7 (Parallel trends assumption based on “not-yet treated” units) . For all g, s, t = 2 , . . . , T , such that t ≥ g , s ≥ t , E [ Y t (0) − Y t − (0) | G g = 1] = E [ Y t (0) − Y t − (0) | D s = 0] . The difference between the parallel trends assumptions 2.6 and 2.7 is that the former usesthe “never treated” units as a fixed comparison group, whereas the latter allows one to useaverages of different groups of units that are not-yet treated by time t as a comparison group.At first sight, it may not be clear whether the PTAs 2.6 and 2.7 also restrict pre-trends asthe PTA 2.5 does.In order to compare these PTAs, it is illustrative to focus our attention to the stylizedexample where we again pre-impose Assumptions 2.1 - 2.4. In this context, the PTA 2.6imposes the following three moment restrictions: α , (0) = E [ Y − Y | C = 1] + E [ Y | G = 1] , (2.9) α , (0) = E [ Y − Y | C = 1] + α , (0) , (2.10) α , (0) = E [ Y − Y | C = 1] + E [ Y | G = 1] . (2.11)9s is evident from (2.9)-(2.11), the PTA 2.6 does not restrict pre-trends across groups, anddoes not presume that “later treated” units can be used as a valid comparison group for “earlytreated” units. Although the moment conditions (2.9), (2.10), and (2.11) are respectivelythe same as (2.2), (2.4), and (2.5), it does not impose the moment restrictions (2.3), (2.6),and (2.7) imposed by the PTA 2.5. Therefore, one can reasonably argue that the PTA 2.6is “weaker” then the PTA 2.5.Next, we describe the PTA 2.7, which, in the context of our stylized example, imposesthe following moment restrictions: α , (0) = E [ Y − Y | D = 0] + E [ Y | G = 1] , (2.12) α , (0) = E [ Y − Y | D = 0] + E [ Y | G = 1] , (2.13) α , (0) = E [ Y − Y | D = 0] + α , (0) , (2.14) α , (0) = E [ Y − Y | D = 0] + E [ Y | G = 1] . (2.15)where D it = 1 if a unit i is treated by time t , and equal to zero otherwise. In the context ofour stylized example, D = 0 if and only if C = 1 , implying that (2.13)-(2.15) are equivalentto (2.9)-(2.11), respectively. Thus, from this simple observation, we can conclude that thePTA 2.7 is “stronger” then PTA 2.6, as the latter does not involve the moment restriction(2.12).To compare the PTA 2.7 with the PTA 2.5, we need to understand the im-plications of adding the moment restriction (2.12) to the other moment restrictionsimplied by PTA 2.6. Note that when we combine (2.12) with (2.13), we havethat E [ Y − Y | D = 0] = E [ Y − Y | D = 0] , which in our example is the same as E [ Y − Y | D = 0] = E [ Y − Y | C = 1] . Given that D = 0 if and only if either G = 1 or C = 1 , it follows that E [ Y − Y | D = 0] = E [ Y − Y | C = 1] ⇐⇒ E [ Y − Y | G = 1] = E [ Y − Y | C = 1] . Thus, by exploiting this simple but subtle observation, we can conclude that the momentrestrictions implied by the PTA 2.7, (2.12)-(2.15), are equivalent (2.2)-(2.5), a subset of themoment restrictions implied by the PTA 2.5. Importantly, and in contrast with the PTA2.6, the PTA 2.7 does rule out non-parallel pre-trends for some groups and pre-treatmentperiods, though, technically, it is still weaker than the PTA 2.5 as the latter completelyrules out any type of non-parallel pre-trends. The PTA 2.7 does not restrict pre-trends involving time periods before the first unit is treated, anddoes not restrict pre-trends for the earliest treatment group.
10n summary, from the discussion presented above, one can conclude that the PTA 2.6does not restrict pre-trends and is weaker than the PTA 2.5 and 2.7, though it requires theexistence of a “never treated” group. In addition, the PTA 2.7 is arguably weaker then thePTA 2.5, as the latter restricts all pre-trends in all pre-treatment periods, while the formerdoes not restrict pre-trends involving time periods before the first unit is treated. This canbe practically relevant in applications where data are available on many time periods beforethe first group of units is treated.
In this subsection, we discuss the different parameters of interest that may arise when onedeviates from the canonical × DID setting. Before presenting the parameters of interestconsidered by S&A, C&S, and dC&D, it is worth stressing the potential pitfalls associatedwith the commonly used TWFE regression specifications.
As Borusyak and Jaravel (2017), dC&D and Goodman-Bacon (2019) point out, one of themost popular specifications in this many periods and many groups DID setting is the fol-lowing TWFE regression specification, Y it = α g + α t + β fe D it + u it , (2.16)where α g and α t are group and time fixed effects, respectively, and u it is an idiosyncraticerror term. Although practitioners often consider β fe to be a main parameter of interest,these aforementioned papers show that, when treatment effects are allowed to be heteroge-neous across groups and time periods, β fe can only be interpreted as a weighted averageof treatment effects, and, perhaps even more problematic, some of these weights can benegative; see also Laporte and Windmeijer (2005), Wooldridge (2005), Chernozhukov et al.(2013), and Gibbons et al. (2018) for earlier related results based on (one-way) fixed-effectestimators. As so, interpreting estimates of β fe as sensible causal summary parameters canlead to misleading conclusions about the policy effectiveness.Moreover, the negative (and non-intuitive) weighting problem is not specific to (2.16).dC&D show that it also applies to the first-difference specification. In addition, S&A showthe non-convex weighting problem extends to many variations of the dynamic TWFE re- All the results remain the same if one replaces α g with α i , an unit-specific fixed effect. We prefer toinclude α g as it closely resemble the canonical DID regression specification. Y it = α i + α t + − (cid:88) e = − K β e { t − G i + 1 = e } + L (cid:88) e =1 β e { t − G i + 1 = e } + v it , (2.17)where G i is the time a unit i is first treated (equal to infinity if unit i is “never-treated”), t − G i + 1 is the “event time” , i.e., the number of time periods a unit has been treated, { t − G i + 1 = e } is an indicator for unit i being treated for e time periods. Taken together,these results suggest that the common practice of attaching sensible causal interpretation tothe coefficients of TWFE regression models is not, in general, warranted. Given the potential pitfalls associated with traditional estimation procedures, S&A, C&Sand dC&D propose different estimators for different treatment effect parameters. In thissubsection, we review these procedures and highlight their differences.Given that traditional estimation procedures do not recover easy to interpret causalparameters without further restricting treatment effect heterogeneity, S&A, C&S and dC&Dpropose different estimators for different treatment effect parameters. In this subsection, wereview these procedures.dC&D focuses on an instantaneous treatment effect measure across all “ever treated”groups. More precisely, dC&D is mainly interested in estimating δ S ≡ E (cid:34) (cid:80) ni =1 (cid:80) T t =2 G ig · ( Y it (1) − Y it (0)) (cid:80) ni =1 (cid:80) T t =2 G it (cid:35) , (2.18)the average of the treatment effect at the time when a group starts receiving the treatment,across all groups that become treated at some point (see Section 4 of dC&D). dC&D also proposes an easy-to-implement estimator for δ S . To better understand theirestimator, let (cid:91) AT T ny ( g, t ) = n − (cid:80) ni =1 G ig ( Y it − Y ig − ) n − (cid:80) ni =1 G it − n − (cid:80) ni =1 (1 − D it ) (1 − G ig ) ( Y it − Y ig − ) n − (cid:80) ni =1 (1 − D it ) (1 − G ig ) (2.19)be a DID estimator for (2.1) that uses not-yet treated units by time t as a comparison groupfor treatment group g , at time t . Consider the estimator for the probability of a unit beingin group g given that it is among the units that are treated for at least e = t − g + 1 periodsgiven by (cid:98) w ( g ; e ) ≡ (cid:98) P ( G g = 1 | Treated for ≥ e periods ) = N g ∩≥ e N ≥ e , (2.20) This is the case with staggered treatment adoption. In more general treatment adoption setups, theparameter of interest considered by dC&H differs from δ S as defined above. N g ∩≥ e denotes the number of observations in group g among those units that havebeen treated for at least e periods, and N ≥ e is the number of units who have been treated forat least e periods. dC&D then show that, under the PTA 2.5 and some additional regularityconditions, (cid:98) δ S = T (cid:88) g =2 (cid:98) P ( G g = 1 | Treated for ≥ ) · (cid:91) AT T ny ( g, g ) , (2.21)is an unbiased estimator of δ S , and, as (effective) sample size grows, (cid:98) δ S is also consistent andasymptotically normal.From the discussion above, it is evident that (2.21) is a well-defined estimator for theeasy-to-interpret causal parameter of interest δ S as defined in (2.18). On the other hand, (cid:98) δ S is, by design, only suitable to summarize instantaneous treatment effects. Hence, when one isinterested in treatment effect dynamics, one needs to consider alternative causal parametersof interest.A particular way of considering more general parameters of interest that are able tocapture richer sources of treatment effect heterogeneity is to follow C&S, and center theanalysis on the average treatment effect at time t , for those units first treated at time g , AT T ( g, t ) as defined in (2.1). By doing so, one can highlight different sources of treatmenteffect heterogeneity. For instance, one can look at how the AT T ( g, t ) for particular group g evolves over time, which would allow one to study group-specific treatment effect dynamics.Alternatively, one can form different weighted averages of the AT T ( g, t ) that are able tosummarize overall treatment effects. Examples of these weighted averages include ( i ) a“simple” average of the AT T ( g, t ) , AT T simple = (cid:80) T g =2 (cid:80) T t =2 { g ≤ t } P ( G = g ) · AT T ( g, t ) (cid:80) T g =2 (cid:80) T t =2 { g ≤ t } P ( G = g ) , (2.22) ( ii ) the “event-study-type” causal parameter δ es ( e ) = T (cid:88) g =2 T (cid:88) t =2 { t − g + 1 = e } P ( G g = 1 | Treated for ≥ e periods ) AT T ( g, t ) , (2.23)which provides the average treatment effect for units that have been treated for e periods,and ( iii ) the average of δ es ( e ) over all possible (positive) values of e , δ e,avg = 1 T − T − (cid:88) e =1 δ es ( e ) . (2.24)Note that all these weighted averages of the AT T ( g, t ) are easy-to-interpret, less “datahungry” than the disaggregated AT T ( g, t ) , and can be use to summarize short, medium andlong run effects of a given policy. In fact, one can show that (2.18) is equal to δ es ( e ) with e = 1 . 13he key challenge in estimating all these causal parameters of interest is to show that onecan indeed nonparametrically point-identify all AT T ( g, t ) ’s with t ≥ g . C&S shows that onecan bypass such challenges by imposing either the PTA 2.6 or 2.7, though each assumptionleads to a different estimand. More precisely, C&S shows that, for t ≥ g , AT T ( g, t ) isnonparametrically identified by AT T never ( g, t ) = E [ Y t − Y g − | G g = 1] − E [ Y t − Y g − | C = 1] , (2.25) AT T ny ( g, t ) = E [ Y t − Y g − | G g = 1] − E [ Y t − Y g − | D t = 0 , G g = 0] , (2.26)when one respectively imposes either the PTA 2.6 or 2.7. These quantities can be straight-forwardly estimated by (cid:91) AT T never ( g, t ) = n − (cid:80) ni =1 G ig ( Y it − Y ig − ) n − (cid:80) ni =1 G ig − n − (cid:80) ni =1 C i ( Y it − Y ig − ) n − (cid:80) ni =1 C i (2.27)and by (cid:91) AT T ny ( g, t ) as defined in (2.19).With either (cid:91) AT T never ( g, t ) or (cid:91) AT T ny ( g, t ) on hand, one can then form the more aggre-gated parameters by replacing (cid:91) AT T ( g, t ) by either of these estimators, and by replacingthe weights by their natural (plug-in) estimators. For instance, depending on whether oneimposes the parallel trends assumption 2.6 or 2.7, one can naturally estimate δ es ( e ) by (cid:98) δ esnever ( e ) = T (cid:88) g =2 T (cid:88) t =2 { t − g + 1 = e } (cid:98) w ( g ; e ) (cid:91) AT T never ( g, t ) , (cid:98) δ esny ( e ) = T (cid:88) g =2 T (cid:88) t =2 { t − g + 1 = e } (cid:98) w ( g ; e ) (cid:91) AT T ny ( g, t ) , respectively, where e ≥ and (cid:98) w ( g ; e ) is defined as in (2.20). The aggregated estimators for AT T simple and for δ e,avg are formed analogously.C&S derive the large sample properties of all these aforementioned estimators and pro-pose bootstrap procedures to construct simultaneous confidence bands for these treatmenteffect measures. They emphasize the practical importance of using simultaneous inferenceprocedures when one estimates multiple parameters of interest (e.g., when one estimate δ es ( e ) for multiple e ’s), as failing to account for multiple testing usually lead to misleadinginference.We conclude this subsection by noting that S&A is mainly interested in recovering theevent-study-type parameter δ es ( e ) . More precisely, S&A propose the following interaction-weighted estimator for δ es ( e ) (see Section 4 of S&A). In the first-step, they use the lineartwo-way fixed effects specification that interacts relative time indicators with treatment groupindicator: Y it = λ i + λ t + T − (cid:88) g =2 (cid:88) e (cid:54) =0 δ ge · G ig { t − G i + 1 = e } + v it (2.28)14n observations from t = 1 , . . . , T − , where the last time period T is dropped in order toaccommodate the case where there is no “never treated” group; if there is a never-treatedgroup available, dropping data from time period T is unnecessary. S&A shows that, underthe PTA 2.5, the estimator (cid:98) δ ge is consistent for AT T ( g, t ) , t − g + 1 = e . Here, it is importantto emphasize that, when a “never treated” group is not available, only the units treated atthe last time period are used as comparison units when computing (cid:98) δ ge , which differs fromthe C&S proposed estimator (2.19) that uses not-yet treated units as comparison units.When a “never treated” group is available, though, the interaction-weighted estimator (cid:98) δ ge isequivalent to (cid:91) AT T never ( g, t ) once one maps event-time to calendar time (or vice-versa).Armed with (cid:98) δ ge , S&A then propose to estimate δ es ( e ) by (cid:98) δ esS & A ( e ) = T (cid:88) g =2 T (cid:88) t =2 { t − g + 1 = e } (cid:98) w ( g ; e ) (cid:98) δ gt , where (cid:98) w ( g ; e ) is defined as in (2.20). S&A establish the large sample properties of (cid:98) δ esS & A ( e ) and provide valid pointwise inference procedures for δ es ( e ) . Remark 2.1.
Many times one wishes to check for the existence of non-parallel pre-trends asa way to assess the credibility of the DID setup. We note that one can use δ esnever ( e ) and/or δ esny ( e ) with e < as estimators of pre-trends, though, when e is negative one must replacethe estimated weights (cid:98) w ( g ; e ) as defined as in (2.20) with (cid:98) w ( g ; e − ) ≡ (cid:98) P ( G g = 1 | at least | e | pre-treatment periods available ) = N g ∩≥ e − N ≥ e − , where N g ∩≥ e − denotes the number of observations in group g among those units that haveleast | e | pre-treatment time periods of data available, and N ≥ e − is the number of units whohave least | e | pre-treatment time periods of data available. Importantly, these event-studyestimators avoid the pitfalls associated with using the dynamic TWFE to assess the credibilityof parallel trends; see S&A for a detailed discussion of this important issue. From the discussion in the previous section, it is clear that in DID designs with staggeredtreatment adoption one can make different parallel trends assumptions, and can estimatedifferent parameters of interest. The discussion in Section 2 also indicates that, despite theirpeculiarities, these different parameters of interest can be estimated using weighted averagesof estimators for the
AT T ( g, t ) (cid:48) s . From this observation, one can reasonably argue thatidentifying (and estimating) the AT T ( g, t ) (cid:48) s from the data is the most challenging step ofthe analysis, and that the different PTAs help researchers to overcome it. Once this is done,15onstructing event-study-type estimators, for instance, becomes straightforward.In practice, however, researchers must choose and justify the use of a given PTA. In thissection, we aim to highlight some practical consequences of adopting different versions ofthe PTA. In order to simplify the discussion, we exploit the stylized example introduced inSection 2 whenever possible, and we implicitly impose Assumptions 2.1-2.4. Practitioners routinely use estimates of pre-treatment event-study coefficients to assess thecredibility of an underlying PTA. Can these tests for parallel pre-treatment trends beinterpreted as direct tests for the validity of the underlying PTA, or should these tests beinterpreted as “placebo/falsification” type of tests? With the help of the stylized example,we show that the answer depends on the chosen PTA.Let us first consider the case of Assumption 2.5. As it is evident from (2.2)-(2.8), thePTA 2.5 imposes six linearly independent moment restrictions to recover three counterfac-tual parameters, α , (0) , α , (0) and α , (0) . That is, imposing the PTA 2.5 leads to anoveridentified system of equations and, consequently, we can directly test for the validity ofthe PTA 2.5. Indeed, it is easy to see that the PTA 2.5 implies parallel pre-treatment trendsacross every group, see e.g., (2.6)-(2.8), and such restrictions can be directly assessed fromthe data, for instance, by testing if pre-treatment event-study-type estimates of (2.23) areall equal to zero. Thus, under the PTA 2.5, non-zero pre-treatment event-study estimatesshould be interpreted as direct evidence against the identifying assumptions. A somewhat similar conclusion is also reached when one relies on the PTA 2.7: (2.12)-(2.15) suggest four linearly independent moment restrictions to recover three counterfactualparameters, which also leads to an overidentified system of equations. Recalling that (2.12)-(2.15) is equivalent to (2.2)-(2.5), we can then see from (2.2)-(2.3) that the PTA 2.7 alsoimposes parallel pre-trends among “never treated” ( C = 1 ) and the “later treated”( G =1 ) from time t = 2 to t = 3 , but does not restrict pre-trends of these two groups from t = 1 to t = 2 , nor the pre-trends of the early treated group ( G = 1 ). Moving from thestylized example to the general case, we have that PTA 2.7 imposes parallel pre-treatment As stressed by S&A, one should not use pre-treatment coefficients from TWFE event-study-type re-gressions to assess the credibility of the PTA as these coefficients can be contaminated with post-treatmenteffects. Using estimates of (2.23) with e < , on the other hand, does not suffer from these pitfalls. Pleaserefer to S&A for a detailed discussion about these issues. If Assumption 2.3 is violated, though, it is possible that violations from Assumption 2.3 “offset" vio-lations of Assumption 2.5 and the test based on (2.6)-(2.8) would not capture such violations. This shouldalways be taken into account. In addition, failing to reject these tests should not be interpret as evidence infavor of the identifying assumptions, as it may be the case that the test lacks power to detect some non-trivialdeviations from the null. t = g min − to t = T for all groups except the first-treated group (who istreated at time g min ). Interestingly, because estimators of (2.23) with e < exploit all pre-treatment trends (including the one for group G = 1 in our stylized example), non-zeropre-treatment estimates can not, at least strictly speaking, be interpreted as direct tests ofthe identifying assumptions, but rather as placebo-type tests. Nonetheless, one can easilybypass this limitation by constructing alternative tests for the identifying assumptions. Forinstance, in the context of our stylized example, one can directly test if E [ Y − Y | C = 1] = E [ Y − Y | G = 1] using a standard t-test. Rejecting the null hypothesis would provide directevidence against the identifying assumptions.Finally, note that the conclusion is very different when one imposes the PTA 2.6: (2.9)-(2.11) suggest we have a just-identified system of equations, implying that the PTA 2.6 cannot be directly tested. Indeed, as we dicussed in Section 2.2, PTA 2.6 does not restrict pre-treatment trends, and, therefore, event-study estimates for pre-treatment periods provide,at best, placebo-type evidence against the PTA 2.6.The discussion above highlights that whether tests for parallel pre-treatment trends pro-vide direct or indirect evidence against the invoked identifying assumptions crucially dependson the invoked PTA. This is very different from the case where treatment adoptions doesnot vary across time. In that case, tests for non-parallel pre-treatment trends always provideonly indirect evidence against the adopted design. In this section, we discuss some potential trade-offs one may face when adopting differentPTAs, as they can lead to different DID estimators.
The PTA 2.6 is the weakest PTA among the three we have considered so far as it does notimpose any restriction on pre-treatment trends across groups. Given that this PTA leadsto a just-identified system of equations, in situations where researchers are not willing toimpose additional restrictions on the data, (cid:99)
AT T never ( g, t ) as defined in (2.27) is the only suitable estimator for the AT T ( g, t ) and their different functionals such as the event-studyparameters (2.23).Of course, in order to rely on the PTA 2.6 and use the DID estimator (2.27), we musthave a set of units that do not experience treatment in the time-window we want to analyze.When such a group of units is available but its relative size is small, inference procedures17ased on (2.27) may not be as precise as one wishes. However, it is important to stressthat this potential “loss of efficiency” is a direct consequence of not exploiting restrictions onpre-treatment trends across groups.In practice, we foresee researchers taking into account this “robustness” versus “efficiency”trade-off when deciding if the PTA 2.6 is the most suitable for the given application. In situ-ations where there is a “reasonably large” number of units that cannot be treated because ofsome application-specific institutional detail, we expect that the gains in robustness shoulddominate the potential gains in efficiency associated with using other PTAs and DID es-timators. The same holds true if researchers are not comfortable with a priori ruling outnon-parallel pre-trends. In these cases, we foresee researchers favoring the PTA 2.6 and theDID estimator (cid:99) AT T never ( g, t ) over the other alternatives. In many situations, a “never-treated” group is not available, implying that the PTA 2.6does not provide any identifying restriction that can be used to estimate the
AT T ( g, t ) ’s.In other cases, the “never-treated” group may be “too small” to be of practical use, and/orresearchers may be a priori comfortable restricting pre-treatment trends and using those“not-yet-treated” units as valid comparison groups for those “earlier-treated”. In such cases,researchers can then choose between the PTA 2.5 and the PTA 2.7. In both cases, though,they can use the the DID estimator (cid:99) AT T ny ( g, t ) , as defined in (2.19), to study policy effec-tiveness.Although (2.19) can be used under either PTA, we still recommend researchers to ex-plicitly specify which PTA they are making for at least three reasons. First, being explicitabout the identifying assumption adds transparency to the analysis, which is always desir-able. Second, the interpretation of pre-tests based on event-study-type estimates can varydepending on the assumptions, as we discusses in Section 3.1. Third, the choice of the PTAhas an important impact on what other estimators you could use instead of (2.19). This isparticularly important because (2.19) does not fully exploit all the restrictions imposed byeither PTA. Being aware of the exact PTA invoked allows researchers to adopt an alternativeestimation procedure that fully exploits all these moment restrictions, resulting in estima-tors that are more efficient than (2.19). Here, we stress that the gains in efficiency will varydepending on the underlying PTA used, as the PTA 2.5 imposes more restrictions on thedata than PTA 2.7.Before describing how one can exploit these additional moment restrictions to form a moreefficient DID estimator, it is worth describing situations where researchers may favor eitherthe PTA 2.5 or the PTA 2.7. Recall that the main difference between PTA 2.5 and PTA18.7 is that the former imposes parallel pre-treatment trends across all groups and all timeperiods, whereas the latter only restricts pre-treatment trends since the time the first groupis treated. These differences can be meaningful in applications where data on multiple timeperiods before the first group of units is treated are available, and the economic environmentin these “early-periods” were potentially different from the “later-periods”. In these cases, theoutcome of the different groups may evolve in a non-parallel manner during “early-periods”periods, perhaps because the groups were exposed to different shocks, but these non-paralleltrends become less of a concern as time pass by. In such cases, we expect researchers tofavor the PTA 2.7 over the PTA 2.5. In other situations, though, researchers may prefer toimpose the PTA 2.5, allowing them to enjoy some potential gains in efficiency if they useestimators that exploit the additional restrictions imposed by the PTA 2.5 when comparedto the PTA 2.7. Again, the “robustness” versus “efficiency” trade-off should be taken intoaccount when deciding which PTA is more appropriate for the specific application. As we described in Section 3.2, in situations where researchers are comfortable with imposingeither the PTA 2.5 or the PTA 2.7, the DID estimator (2.19) is not efficient, as it does notfully exploit all the restrictions implied by these PTAs. In this section, we describe howone can exploit all the restrictions implied by the identifying assumptions to form efficientDID estimators by casting the problem into the familiar GMM framework (Hansen, 1982).In what follows, we provide a step-by-step description of how one can form these efficientGMM DID estimators. To avoid repetition, we focus on the case where researchers imposethe PTA 2.7; the implementation based on the PTA 2.5 is completely analogous.The key to implement the GMM is to list all moment restrictions we are imposing torecover the
AT T ( g, t ) ’s, which involves not only the moment restrictions implied by the PTA2.7, but also the observational restrictions that, for all t ≥ g , α g,t (1) ≡ E [ Y t (1) | G g = 1] = E [ Y t | G g = 1] , α propg ≡ E [ G g ] , α propC ≡ E [ C ] . We can then use these “augmented” momentrestrictions (consisting of observational restriction and all the moment restrictions impliedby the PTA) to efficiently estimate all the unknown parameters involved in our problem byfollowing Hansen (1982).To gain more intuition on how to implement the efficient GMM, we turn of at-tention to our stylized example. In this case, the unknown parameters consist of α ≡ ( α , (1) , α , (0) , α , (1) , α , (0) , α , (1) , α , (0) , α propC , α prop , α prop ) (cid:48) , which can be ef-ficiently estimated by (cid:98) α gmm = arg min α ∈ Θ ¯ g α ( W ) (cid:48) (cid:98) Σ − α,gmm ¯ g α ( W ) , (4.1)19here ¯ g α ( W ) is the sample average of the augmented moment conditions, n − (cid:80) ni =1 g α ( W i ) ,with g a ( W i ) combining all (linearly independent) moment conditions ,and (cid:98) Σ ˇ α,gmm = 1 n n (cid:88) i =1 g ˇ α ( W i ) g ˇ α ( W i ) (cid:48) , ˇ α being a preliminary consistent estimator for α , say the minimizer of (4.1) with (cid:98) Σ ˇ α,gmm replaced by the identity matrix.With (cid:98) α gmm , one can then efficiently estimate the parameters of interest: AT T (3 , , AT T (3 , and AT T (4 , by (cid:91) AT T gmm , , , = (cid:98) α gmm , (1) − (cid:98) α gmm , (0) (cid:98) α gmm , (1) − (cid:98) α gmm , (0) (cid:98) α gmm , (1) − (cid:98) α gmm , (0) . (4.2)In what follows, we establish the asymptotic properties of (cid:98) α gmm . The asymptotic prop-erties of (cid:91) AT T gmm ( g, t ) follow directly from the delta method. Let ∆ Y t = Y t − Y t − . Define Σ α,gmm as the probability limit of (cid:98) Σ ˜ α,gmm , and Ψ = E (cid:20) ∂g α ( W ) ∂α (cid:21) Let the vector of scores associated with the efficient GMM estimator be defined as φ gmmα ( W i ) = − (cid:0) Ψ (cid:48) Σ − α,gmm Ψ (cid:1) − Ψ (cid:48) Σ − α,gmm · g α ( W i ) . Proposition 4.1.
Assume that all random variables have finite second moments, Σ α,gmm ispositive definite, and that Assumptions 2.1-2.4 hold. Then, when the parallel trends assump-tion 2.7 holds, we have that: ( i ) As n → ∞ , √ n ( (cid:98) α gmm − α ) = 1 √ n n (cid:88) i =1 φ gmmα ( W i ) + o p (1) d → N (cid:16) , (cid:0) Ψ (cid:48) Σ − α,gmm Ψ (cid:1) − (cid:17) . ( ii ) The GMM estimator (cid:98) α gmm is semiparametrically efficient.Proof. Proof is presented in the Web Appendix C.Proposition 4.1 has important practical implications, which we illustrate in the contextof our stylized example. First, and perhaps most important, it implies that √ n (cid:16) (cid:91) AT T gmm − AT T (cid:17) , , , d → N (0 , Ω) See Web Appendix A for the details about g a ( W i ) . Ω = A (cid:0) Ψ (cid:48) Σ − α,gmm Ψ (cid:1) − A (cid:48) , A is a “selection matrix” and that Ω is equal to the semi-parametric efficiency bound for AT T ( g, t ) under the PTA 2.7. As so, (cid:91) AT T gmm exploitsall available information in the data to estimate the
AT T ( g, t ) ’s, which, in general, trans-lates to tighter confidence intervals. In fact, under the PTA 2.7, the GMM DID estimator (cid:91) AT T gmm ( g, t ) is, in general, more efficient than (cid:91) AT T ny ( g, t ) or (cid:91) AT T never ( g, t ) as definedin (2.19), and (2.27), respectively, or those based on the “interaction-weighted” regression(2.28). This is a main advantage of the GMM DID estimator when compared to the otheravailable estimators.A second implication of Proposition 4.1 is that, given that we have an overidentified system of equations, one can directly use the Sargan-Hansen J-test as a test for the validityof the PTA 2.7. More precisely, under the null hypothesis that the PTA 2.7 is true, J = n · (cid:16) ¯ g (cid:98) α gmm ( W ) (cid:48) (cid:98) Σ − (cid:98) α gmm ,gmm ¯ g (cid:98) α gmm ( W ) (cid:17) d → χ − as n → ∞ . If the PTA 2.7 holds, any deviation of J from zero should be within the range of samplingerror, whereas if the PTA 2.7 is violated, J should be “large.” Thus, the Sargan-Hanen J-testcan be useful for detecting violations of the PTA 2.7.At this moment, one may wonder about the situations where one may favor the GMMDID estimator (4.2) over the simpler DID estimator (2.19). Given that (4.2) is more efficientthan and as-robust-as (2.19), the only obstacle we see for its widespread adoption is itsimplementation: whenever the number of treatment groups and/or time periods availableare large, the number of moment conditions needed to be considered into the efficient GMMcan be fairly large — in our application, for example, where we have 16 treatment groups and33 time periods, the GMM involves 780 moments with
195 overidentification restrictions ,whereas sample size (state-year pairs) is equal to 759. In such cases, we expect researchersto favor the simpler, but inefficient DID estimator (2.19). However, when the number ofgroups and/or time periods is moderate such that implementation of the efficient GMM isnot challenging, we would recommend using it.
As highlighted in the previous section, the main attractive feature of using the GMM esti-mation procedure is that it leads to efficient estimators that fully exploit all the availableinformation compatible with the underlying identifying assumptions. On the other hand,the implementation of such GMM DID estimator is not always straightforward.In this section, we describe an alternative DID estimator for the
AT T ( g, t ) . Although See Web Appendix A for its formal definition.
Assumption 5.1 (“Weaker” Parallel trends assumption based on “not-yet treated” units) . For all g, t = 2 , . . . , T , such that t ≥ g , E [ Y t (0) − Y t − (0) | G g = 1] = E [ Y t (0) − Y t − (0) | D t = 0] . The PTA 5.1 imposes that the evolution of the outcome at time t among those units thathave not yet experienced treatment by time t can help us identify the AT T ( g, t ) ’s. UnlikePTA 2.7, it does not impose that every individual not-yet-treated group can be used as acomparison group, which, in turn, suggests that the AT T ( g, t ) , t ≥ g are nonparametrically just-identified . We formalize this result in the next proposition. Let ∆ Y t = Y t − Y t − denotethe first-difference of Y t . Proposition 5.1.
Assume that Assumptions 2.1-2.4 hold. Then, when the parallel trendsassumption 5.1 holds, it follows that, for ≤ g ≤ t ≤ T , AT T ( g, t ) = AT T ny + ( g, t ) , where AT T ny + ( g, t ) ≡ E [ Y t − Y g − | G g = 1] − (cid:32) t (cid:88) s = g E [ ∆ Y s | D s = 0 , G g = 0] (cid:33) . (5.1) Proof.
Proof is presented in the Web Appendix C.To better grasp how the PTA 5.1 allows us to use
AT T ny + ( g, t ) as an estimand forthe AT T ( g, t ) , t ≥ g , it is illustrative to go back to our stylized example. In this specificcontext, we have that, under Assumptions 2.1-2.4, the PTA 5.1 is equivalent to the followingrestrictions: α , (0) = E [ Y − Y | D = 0] + E [ Y | G = 1] , (5.2) α , (0) = E [ Y − Y | D = 0] + α , (0) , (5.3) α , (0) = E [ Y − Y | D = 0] + E [ Y | G = 1] . (5.4)By listing these restrictions we can now see how we get AT T ny + ( g, t ) , as defined in (5.1).First, from (5.2) and (5.4), it follows that, when g = t , α g,t (0) can be explicitly written asfunctionals of observable data (and not potential outcomes). As so, AT T ( g, t ) is identifiedby (5.1). Interestingly, in this case with g = t , (5.1) reduces to AT T ny ( g, t ) , as defined in(2.26). When one moves away from the “instantaneous average treatment effects”, though,these two estimands differ. Indeed, by exploiting the moment restrictions (5.2) and (5.3),we can see that the AT T (3 , is nonparametrically identified by AT T ny + (3 ,
4) = E [ Y − Y | G = 1] − ( E [ Y − Y | D = 0] + E [ Y − Y | D = 0]) , AT T ny + (3 , uses data fromall groups, G = 1 , G = 1 , and C = 1 , whereas AT T ny (3 , only uses data from G = 1 and C = 1 . Hence, one may expect that estimators based on (5.1) to be more precise than(2.19) because they utilize more data. Furthermore, because the PTA 5.1 does not restrictpre-trends, i.e., it does not impose that E [ Y − Y | D = 0] = E [ Y − Y | D = 0] as impliedby (2.12) and (2.13), one can also expect additional gains in “robustness” by exploiting (5.1)instead of (2.26).Next, we discuss how one can exploit Proposition 5.1 to estimate the AT T ( g, t ) s. Here,the most natural way to proceed is to use the sample analogue of (5.1): (cid:91) AT T ny + ( g, t ) = n − (cid:80) ni =1 G ig ( Y it − Y ig − ) n − (cid:80) ni =1 G it − t (cid:88) s = g (cid:18) n − (cid:80) ni =1 (1 − D is ) (1 − G ig ) ∆ Y is n − (cid:80) ni =1 (1 − D is ) (1 − G ig ) (cid:19) . (5.5)Note that (5.5) is very easy to compute as it only involves combinations of sample means.Next, we show that these DID estimators also enjoy good asymptotic properties. Moreprecisely, we prove they are √ n -consistent and establish their joint asymptotic distribution.Before we present the results, we need to introduce some additional notation. For each ( g, t ) -pair, let φ ny + ( W i ; g, t ) be the influence function of (cid:91) AT T ny + ( g, t ) , φ ny + ( W i ; g, t ) = (cid:18) G ig E [ G g ] (cid:18) ( Y it − Y ig − ) − E [ G g · ( Y t − Y g − ) E [ G g ] (cid:19) − t (cid:88) s = g (1 − D is ) (1 − G ig ) E [(1 − D s ) (1 − G g )] (cid:18) ∆ Y is − E [(1 − D s ) (1 − G g ) · ∆ Y s ] E [(1 − D s ) (1 − G g )] (cid:19)(cid:33) . Finally, let (cid:91)
AT T ny + ( t ≥ g ) and AT T ( t ≥ g ) denote the vector of (cid:91) AT T ny + ( g, t ) and AT T ( g, t ) ,respectively, for all g, t = 2 , . . . , T with t ≥ g . Analogously, let Φ ny + ( W i ; t ≥ g ) denote thecollection of φ ny + ( W i ; g, t ) across all periods t and groups g such that t ≥ g . Proposition 5.2.
Assume that Assumptions 2.1-2.4 and Assumption 5.1 hold. Then, as n → ∞ , √ n (cid:16) (cid:91) AT T ny + − AT T (cid:17) ( g, t ) = 1 √ n n (cid:88) i =1 φ ny + ( W i ; g, t ) + o p (1) . (5.6) Furthermore, √ n (cid:16) (cid:91) AT T ny + ( t ≥ g ) − AT T ( t ≥ g ) (cid:17) d → N (0 , V ) , (5.7) with V = E (cid:0) Φ ny + ( W ; t ≥ g ) Φ ny + ( W ; t ≥ g ) (cid:48) (cid:1) .Proof. Proof is presented in the Web Appendix C. We restrict our attention to t ≥ g just because these are the post-treatment periods, which presumablyare the periods of main interest for the analysis. However, our results naturally extend to the case whereone consider all possible g, t ’s, with the caveat that AT T ny + ( g, t ) may differ from AT T ( g, t ) for t < g , as thePTA 5.1 does not explicit restrict pre-trends. AT T ( g, t ) ’s. The first and perhapsmore standard approach is to use the analogy principle and directly estimate V , which leadsdirectly to standard errors and pointwise confidence intervals. However, it is worth stressingthat when one is interested in making inference about multiple AT T ( g, t ) ’s, inference pro-cedures based on this standard approach such as those based on traditional t-tests and/orindividual confidence intervals are usually inappropriate as they do not account for the factthat one is (implicitly) conducting multiple hypotheses testing . As a direct consequence,significant treatment effects may emerge simply by chance, even when all AT T ( g, t ) ’s areequal to zero, see, e.g., Romano and Wolf (2005), Anderson (2008) and section 8 of Romanoet al. (2010).An alternative path to conduct asymptotically valid inference for multiple parametersof interest that is robust against the multiple-testing problem is to leverage the asymptoticlinear representation (5.6) to construct computationally-simple bootstrapped simultaneous confidence intervals for multiple AT T ( g, t ) . The idea of this bootstrap procedure is fairlysimple, and each bootstrap iteration simply amounts to “perturbing” the asymptotic linearrepresentation of the (cid:91) AT T ny + ( g, t ) ’s by a random weight V , and it does not require re-estimating the AT T ( g, t ) ’s at each bootstrap draw. In Web Appendix B, we provide astep-by-step description of how one can implement such a procedure. Remark 5.1.
It is worth stressing that the
AT T ny + ( g, t ) estimand defined in (5.1) is onlysuitable for post-treatment periods, i.e., for t ≥ g . Hence, in contrast to (2.25) and (2.26),we can not fix the estimand to analyze pre-treatment periods t < g . To address this issue,we suggest using the estimand AT T preny + ( g, t ) ≡ E [ ∆ Y t | G g = 1] − E [ ∆ Y t | D t = 0 , G g = 0] , f or t < g, (5.8) which should be equal to zero under Assumptions 2.1-2.4 and a stronger version of the PTA5.1 that holds for both pre and post-treatment periods (and not only for post-treatment periodsas PTA 5.1). One can then use estimates of (5.8) to provide indirect evidence for the PTA5.1, as PTA 5.1 cannot be directly tested. We stress that (5.8) should not be directly comparedwith (2.25) and (2.26) for t < g as (5.8) measures “local deviations” (from time t − to t )of a zero pre-treatment trends conditions, whereas (2.25) and (2.26) capture “cumulativedeviations” (from time t until g − ) of zero pre-treatment trends conditions. Remark 5.2.
It is straightforward to build on (cid:91)
AT T ny + ( g, t ) to construct event-study esti-mators for δ es ( e ) as defined in (2.23). Following the same steps described in Section 2.3.2, natural estimator for δ es ( e ) is (cid:98) δ esny + ( e ) = T (cid:88) g =2 T (cid:88) t =2 { t − g + 1 = e } (cid:98) w ( g ; e ) (cid:91) AT T ny + ( g, t ) , where the weights (cid:98) w ( g ; e ) are as defined in (2.20). By building on Theorem 5.2 and the factthat the weights admit an asymptotic linear representation, it is easy to show that (cid:98) δ esny + ( e ) isconsistent and asymptotically normal; see, e.g. C&S. We conclude this section by highlighting situations where we foresee (5.5) being favoredover the other available DID estimators. First, we envision researchers favoring (5.5) over(2.19) in situations where they are not comfortable explicitly restricting pre-trends and/orwhen they want to use data from all groups to estimate the
AT T ( g, t ) ’s. This can be par-ticularly relevant when one wants to conduct cluster-robust inference when only a moderatenumber of clusters is available. We also expect researchers to favor (5.5) over the efficientGMM estimator when implementation is challenging. In this case, we expect that (5.5)’s“easiness-to-use” would dominate the potential efficiency gains of using GMM. Finally, weexpect researcher to favor (5.5) over (2.27) when a “never-treated” group is relatively small,though we stress that these two estimators rely on non-nested PTAs. Remark 5.3.
Given that different estimators (and PTAs) have different implications forrobustness and efficiency, it may be tempting to “specific-to-general” specification search:start the analysis considering estimators that rely on “stronger” assumptions and then testthe validity of these assumptions; in case one does not reject them, one stop and uses the“more efficient” estimators, but in case one rejects the invoked PTA, one then chooses a“more robust” but “less efficient” DID estimator. Although fairly intuitive, the aforementionedstrategy is dangerous and should not be used in practice since this specification search is basedon multiple-testing procedure, and, as so, inference procedures that treat the “final” estimator(or the “winner”) as “true” can be severely distorted, see, e.g. Roth (2020) for detaileddiscussion of this issue. Hence, we argue that researchers should select the PTA taking intoaccount the “robustness” versus “efficiency” trade-off, and that these considerations should bedone based on external, context-specific information, and not on pre-tests.
To illustrate the inherent trade-offs described above, we replicate Katherine Grooms’ (2015)analysis of the transition from federal to state management of the Clean Water Act (CWA).Environmental policy mandated at the federal level is often implemented at the state level.Yet, there exists variation in the level of enforcement across states. Grooms (2015) exploits25he staggered timing of the transfer from federal to state monitoring and enforcement ofthe CWA. Using TWFE specifications akin to (2.16) and (2.17), she finds that state-levelprevalence of corruption plays an important role in the enforcement and compliance of en-vironmental regulation after transitioning to state control.We begin by describing the data, and then we discuss the practical relevance of the keyassumptions and specifications we use for the analysis given our context. Finally, we showthe baseline and corruption-specific results for both the TWFE specification and the newDID estimators that rely on the different PTAs discussed above. Finally, we discuss theimplications of the findings and the importance of choosing an appropriate PTA.
We follow the data construction from Grooms (2015) as closely as possible. Table D.1 in WebAppendix D replicates key summary statistics from Grooms (2015) and provides additionaldetail on data sources and construction. As described further in Web Appendix D, we followGrooms (2015) to construct a measure of the fraction of total facilities with at least oneinspection, violation, or enforcement action in a state and year.The timing of state authorization is distributed fairly evenly throughout our sampleperiod, with the exception that 27 states received authorization prior to the sample period,between 1973-1975. Given that neither the data nor the parallel trends assumptions for Y t (0) provide information to identify the average treatment effect for these “always treated”states, these states are dropped from the analysis. Figure 6.1 highlights the year that each ofthe remaining 23 states started treatment, i.e., the year in which the state was authorized toadminister individual NPDES permits. The bottom four states are what we call the “never-treated” units, i.e., the states that remain unauthorized to administer individual NPDESpermits through the entire sample period. Figure 6.1 also allows one to visualize whichstates form each treatment group (those states whose colors turn to dark blue in the sameyear), and who the “not-yet-treated” states are at any point in time (those units that arecolored light-blue in a given year).Finally, we follow Grooms (2015) in defining states with above median federal publiccorruption convictions across all years as “corrupt” states. Figure 6.2 shows corrupt states Figure D.1 in the Web Appendix D shows the distribution of the timing of state authorization acrossyears. As many states receive authorization for the first four phases in the same year, we define the yearof authorization as the year in which the state was authorized to perform the first phase of the program,administering individual NPDES permits. As of 2008, four states remained unauthorized to administer individual NPDES. Idaho received autho-rization in 2018, outside of the sample period used here to be consistent with Grooms (2015). See Web Appendix D for additional detail.
NMNHMAIDAKAZMETXOKLAFLSDUTARRIKYWVNJALPAIATNIL1976 1979 1982 1985 1988 1991 1994 1997 2000 2003 2006
Never−treated Treated (before state authorization) Treated (after state authorization)
Notes: Shows the timing treatment adoption, where treatment is defined as the year in which the state wasauthorized to administer individual NPDES permits. in red and non-corrupt states in blue. “Always-treated” states are shown in grey. Based onthis measure, “corrupt” states are mostly from the mid-Atlantic and southern regions, while“non-corrupt” states tend to appear in New England and the west.
Like Grooms (2015), the starting point of our exercise is to examine the impact of authoriza-tion on compliance outcomes — for the sake of brevity, we focus on violation rates, thoughresults for inspection rate and enforcement rate are available upon request. Since we are particularly interested in treatment effect dynamics, we estimate event-study-type parameters using four different procedures. First, we replicate the dynamicTWFE specification from Grooms (2015). The exact specification we use is the following: Y it = λ i + λ t + (cid:88) e = − ,e (cid:54) =0 β e { t − G i + 1 = e } + v it , (6.1)which includes 30 treatment lead indicators (all the indicators associated with β e with e < )and 32 treatment lag indicators (all the indicators associated with β e with e > ). We followBorusyak and Jaravel (2017) and omit the treatment lead indicators associated with e = 0 Overall, we find essentially zero effect on these other outcomes, regardless of the PTA and modelspecification used. This is in line with the results in Grooms (2015).
Notes: Corrupt states, shown in red, are those above the median of average convictions per capita across allyears. Non-corrupt states are shown in blue. Grey states are “already treated” prior to the sample windowand are not included in the analysis. and with e = − . Like Grooms (2015), our specifications are weighted by total facilities ina state, and all standard errors are clustered at the state level.Second, we make specific PTAs and use the new estimators described previously. Becauseour empirical application includes a set of “never-treated” states, we estimate event-study-type parameters based on the PTA (2.6) and use (cid:98) δ esnever ( e ) as an estimator for δ es ( e ) . We alsoleverage the PTA (2.7) and use (cid:98) δ esny ( e ) as an estimator for δ es ( e ) . Finally, we employ the PTA(5.1) and use (cid:98) δ esny + ( e ) as an estimator for δ es ( e ) . We do not use the event-study estimatesbased on GMM framework discussed in Section 4, since, in our specific application, the GMMassociated with the PTA 2.5 involves 780 moments with
195 overidentification restrictions ,whereas sample size (state-year pairs) is equal to 759.Next, we analyze whether the effect of state authorization on violation rates vary de-pending on whether a state has a long prevalence of corruption. To do so, we follow Grooms(2015) and consider the following TWFE specification Y it = α i + α t + (cid:88) e = − ,e (cid:54) =0 β e { t − G i + 1 = e } + (cid:88) e = − ,e (cid:54) =0 β ce (1 { t − G i + 1 = e }× Corrupt i )+ v it , (6.2)where the β ce ’s are considered to be a measure of how treatment effects vary depending onwhether a state is “corrupt” or not: positive (negative) point estimates suggest that theviolation rates increased (decreased) more in corrupt states than in non-corrupt states.28t this stage, two important questions arise. First, what type of parallel trends assump-tion is actually being invoked to justify attaching a causal interpretation to the β ce ’s in (6.2)?Second, is (6.2) susceptible to the potential pitfalls discussed in Section 2.3.1? Answeringthese questions is inherently hard, as TWFE is a model specification and not a “researchdesign.” An alternative, and perhaps more constructive way of approaching this problemis to construct event-study-type estimators that explicitly rely on a particular PTA, andthat, by design, avoid the potential lack of a clear interpretation associated with the TWFEspecification (6.2). We follow this latter path.With respect to the PTA, there are two natural variants of each of the PTAs 2.6, 2.7, and5.1 that one can invoke to highlight treatment effect heterogeneity with respect to whethera state is corrupt or not. These variants differ from each other depending on whether ornot one allows for different counterfactual trends between corrupt and non-corrupt states.One may be concerned, for example, that, in the absence of treatment, the evolution of theviolation rate could differ between corrupt and non-corrupt states. In this case, one wouldprefer a “weaker” assumption to allow for corruption-specific trends. In the context of ourapplication, “corruption” is not randomly assigned and the geographic clustering of corruptand non-corrupt states may lead us to prefer a PTA that permits corruption-specific trendsif we think, for example, there may be regional trends that differ across these states. Webelieve this is the most natural identification set up, as this is how one would proceed if onewere to separately estimate counterfactuals for corrupt states and non-corrupt states, andonly later compare their difference. We formalize these six different PTAs below. Assumption 6.1 (Parallel trends assumption based on “never treated” units, with corrup-tion-specific trends) . For c = 0 , , and all g, t = 2 , . . . , T , such that t ≥ g , E [ Y t (0) − Y t − (0) | G g = 1 , Corr = c ] = E [ Y t (0) − Y t − (0) | C = 1 , Corr = c ] . Assumption 6.2 (Parallel trends assumption based on “not-yet treated” units, with cor-ruption-specific trends) . For c = 0 , , and all g, s, t = 2 , . . . , T , such that t ≥ g , s ≥ t , E [ Y t (0) − Y t − (0) | G g = 1 , Corr = c ] = E [ Y t (0) − Y t − (0) | D s = 0 , Corr = c ] . Assumption 6.3 (“Weaker” Parallel trends assumption based on “not-yet treated” units,with corruption-specific trends) . For c = 0 , , and all g, t = 2 , . . . , T , such that t ≥ g , E [ Y t (0) − Y t − (0) | G g = 1 , Corr = c ] = E [ Y t (0) − Y t − (0) | D t = 0 , Corr = c ] . Assumption 6.4 (Parallel trends assumption based on “never treated” units, without cor-ruption-specific trends) . For c = 0 , , and all g, t = 2 , . . . , T , such that t ≥ g , E [ Y t (0) − Y t − (0) | G g = 1 , Corr = c ] = E [ Y t (0) − Y t − (0) | C = 1] . ssumption 6.5 (Parallel trends assumption based on “not-yet treated” units, withoutcorruption-specific trends) . For c = 0 , , and all g, s, t = 2 , . . . , T , such that t ≥ g , s ≥ t , E [ Y t (0) − Y t − (0) | G g = 1 , Corr = c ] = E [ Y t (0) − Y t − (0) | D s = 0] . Assumption 6.6 (“Weaker” Parallel trends assumption based on “not-yet treated” units,without corruption-specific trends) . For c = 0 , , and all g, t = 2 , . . . , T , such that t ≥ g , E [ Y t (0) − Y t − (0) | G g = 1 , Corr = c ] = E [ Y t (0) − Y t − (0) | D t = 0] . The difference between these PTAs depends on whether one uses the “never-treated”,some “not-yet-treated”, or all “not-yet-treated” as valid comparisons group, and whether oneonly uses states with the same corruption status (corrupt or non-corrupt) as valid comparisongroups. Assumptions 6.1-6.3) do not assume that the evolution of the violation rate is thesame between corrupt and non-corrupt states. These three assumptions are the analoguesof Assumptions 2.6, 2.7 and 5.1 when one restricts attention to the subset of units withcorruption status equal to c . Assumptions 6.4-6.6, on the other hand, assume that, in theabsence of treatment, the evolution of the violation rate is the same for corrupt and non-corrupt states, i.e., it rules out corruption-specific trends. As so, one may argue that theAssumptions 6.1-6.3 are “weaker” than Assumptions 6.4-6.6.Next, one can easily leverage any of these PTAs to identify and estimate sensible treat-ment effect parameters by following the same steps described in Section 2.3.2. The first steptoward this goal is to show that the AT T ( g, t ) ’s for the units with corruption status equalto c , c = 0 , , defined by AT T ( g, t ; c ) ≡ E [ Y it (1) − Y it (0) | G g = 1 , Corr = c ] , are nonparametrically point-identified for all t ≥ g . However, given the results in Theorem1 of C&S and the discussion in Sections 2.3.2 and 5, this is a straightforward task. Indeed,one can easily show that for all t ≥ g , the AT T ( g, t ) ’s are nonparametrically identified by AT T never ( g, t ; c ) = E [ Y t − Y g − | G g = 1 , Corr = c ] − E [ Y t − Y g − | C = 1 , Corr = c ] , (6.3) AT T ny ( g, t ; c ) = E [ Y t − Y g − | G g = 1 , Corr = c ] − E [ Y t − Y g − | D t = 0 , Corr = c ] (6.4) AT T ny + ( g, t ; c ) = E [ Y t − Y g − | G g = 1 , Corr = c ] − (cid:32) t (cid:88) s = g E [ ∆ Y s | D s = 0 , Corr = c ] (cid:33) , (6.5) AT T never ( g, t ; c ) = E [ Y t − Y g − | G g = 1 , Corr = c ] − E [ Y t − Y g − | C = 1] , (6.6) AT T ny ( g, t ; c ) = E [ Y t − Y g − | G g = 1 , Corr = c ] − E [ Y t − Y g − | D t = 0] , (6.7) AT T ny + ( g, t ; c ) = E [ Y t − Y g − | G g = 1 , Corr = c ] − (cid:32) t (cid:88) s = g E [ ∆ Y s | D s = 0] (cid:33) , (6.8) These results areanalogous to (2.25), (2.26) and (5.1). Likewise, all the aforementioned quantities can beestimated using the analogy principle, i.e., by replacing population expectation with sampleexpectations.Armed with these estimators, one can form different summary measures for the over-all treatment effect following the same steps described in Section 2.3.2. To explicitly showhow one can form event-study-type estimators, let
AT T generic ( g, t ; c ) be a generic notationfor AT T never ( g, t ; c ) , AT T ny ( g, t ; c ) , AT T ny + ( g, t ; c ) , AT T never ( g, t ; c ) , AT T ny ( g, t ; c ) , and AT T ny + ( g, t ; c ) and denote its plug-in estimator by (cid:91) AT T generic ( g, t ; c ) . Then, one can esti-mate the average treatment effect for units with corruption status equal to c that have beentreated for e periods by (cid:98) δ esgeneric ( e ; c ) = T (cid:88) g =2 T (cid:88) t =2 { t − g + 1 = e } (cid:98) w ( g ; e, c ) (cid:91) AT T generic ( g, t ; c ) , (6.9)where the weights are given by (cid:98) w ( g ; e, c ) ≡ (cid:98) P ( G g = 1 | Treated for ≥ e periods, Corr = c ) = N g ∩≥ e ∩ c N ≥ e ∩ c ,N g ∩≥ e ∩ c denotes the number of observations in group g among those units with corrupt status c that have been treated for at least e periods, and N ≥ e is the number of units with corruptstatus c who have been treated for at least e periods. Given that our main goal is tocompare the evolution of treatment effects between corrupt and non-corrupt states, we cansimply compute the difference between (cid:98) δ esgeneric ( e ; 1) and (cid:98) δ esgeneric ( e ; 0) . Denote this (generic)estimator by (cid:98) δ esgeneric ( e ; 1 − , i.e., (cid:98) δ esgeneric ( e ; 1 −
0) = (cid:98) δ esgeneric ( e ; 1) − (cid:98) δ esgeneric ( e ; 0) . (6.10)Here, we stress that, regardless of which of the six different estimators for (6.10) oneadopts, they are all directly and explicitly tied to a given PTA, and, by design, they bypassthe potential pitfalls associated with the TWFE specification.In addition to the event-study estimates, we further aggregate these treatment effectcurves into scalar, easy to interpret parameters. Toward this end, we report the plug-inestimators for the following two aggregated treatment effect parameters proposed by C&S, AT T simple, − = AT T simple ;1 − AT T simple ;0 , (6.11) Although the PTAs 6.4-6.6 lead to overidentification, for the sake of simplicity we do not fully exploitall these restrictions when proposing the aforementioned estimands. When e < , we replace (cid:98) w ( g ; e, c ) in (6.9) with (cid:98) w ( g ; e − , c ) , where (cid:98) w ( g ; e − , c ) is analogous to (cid:98) w ( g ; e − ) as defined in Remark 2.1. c = 0 , AT T simple ; c is defined analogously to (2.22), i.e., AT T simple,c = (cid:80) T g =2 (cid:80) T t =2 { g ≤ t } P ( G = g, Corr = c ) · AT T ( g, t ; c ) (cid:80) T g =2 (cid:80) T t =2 { g ≤ t } P ( G = g, Corr = c ) , and the average of δ es ( e ; 1 − over all possible (positive) values of e , δ e,avg, − = δ e,avg ;1 − δ e,avg ;0 . (6.12)where, for c = 0 , , δ e,avg ; c is defined analogously to (2.24), i.e., δ e,avg ; c = 1 T − T − (cid:88) e =1 δ es ( e ; c ) . We report estimators for these functionals that rely on the PTA 6.1-6.6, respectively. Forthe sake of comparison, we also report the OLS estimate of β cfe associated with the followingTWFE specification, Y it = α g + α t + β fe D it + β cfe D it × Corrupt i + u it , (6.13)though these estimates are also subject to pitfalls briefly described in Section 2.3.1. Figure 6.3 displays the results based on the TWFE specification (6.1), and those basedon the event-study estimators (cid:98) δ esnever ( e ) , (cid:98) δ esny ( e ) , and (cid:98) δ esny + ( e ) . We report the point-estimatesassociated with 20 treatment leads and 20 treatment lags (red line), their associated 90%point-wise confidence intervals (dark-shaded area), and 90% simultaneous confidence inter-vals (light-shaded area) — we do not report simultaneous confidence intervals for the TWFEspecification as these are usually not reported by practitioners who adopt such specifications.It is important to emphasize that, in each of the panels in Figure 6.3, we have 40 differentestimates, one for each considered e . Point-wise inference procedures proceed “as if” onewere conducting a single hypothesis test, and report standard confidence interval for each e . Failing to account for the fact that one is performing 40 different hypotheses tests maylead to significant treatment effects and/or pre-trends that emerge simply by chance. Simul-taneous confidence intervals, on the other hand, account for this multiple testing problem,and asymptotically cover the entire event-study curve with probability 1 - α , where α isthe significance level. As so, simultaneous confidence intervals are suitable to analyze globalproperties of the event-study curve, such as monotonicity and presence of statistically non-zero effects. In practice, one simply has to replace the commonly used critical value (say,1.645 for a 90% confidence interval) with the one simulated via a bootstrap procedure akin32o Algorithm B.1; see Section 4 of C&S for additional details.The results shown in Figure 6.3 suggest that, regardless of the PTA and the estimatorused, there is little to no evidence that the transition to state control decreased violationrates. Despite the similarity in terms of conclusions, we find that comparing the resultsfrom each specification highlights some interesting practical features. For instance, the pointestimates associated with the TWFE specification (Panel (a)) and with the estimator thatuses the “all not-yet-treated” states as a comparison group (Panel (d)) are close to eachother, whereas using “never-treated” states as a comparison group (Panel (b)) suggests aslightly stronger long-run effect. Furthermore, when using “all not-yet-treated” states as acomparison group (Panel (d)), the (simultaneous) confidence interval is tighter, suggesting,as we discussed in Section 5, that it makes more efficient use of the available data. In termsof interpreting the pre-treatment coefficients, the pre-trend point estimates when using the“not-yet-treated” comparison group in Panel (c) are closer to zero than when one uses the“never-treated” states as a comparison group in Panel (b). It is also very noticeable that pre-treatment trends in Panel (d) are very precisely estimated zeros. However, it is important torecall from Remark 5.1 that these pre-treatment coefficients should not be directly comparedto the other pre-treatment trends as they measure “local deviations” of zero pre-treatmenttrends rather “cumulative deviations” of pre-treatment trends.Although these estimators lead to similar conclusions, they are not (a priori) “madeequal”. As highlighted by S&A and discussed in Section 2.3.1, the β e ’s associated with theTWFE specification (6.1) are not guaranteed to have a clear causal interpretation, evenwhen one invokes the PTA 2.5 , which, in our application, imposes 195 overidentifying re-strictions on the evolution of violation rates across states. The estimators in Panels (b),(c), and (d), on the other hand, are designed to bypass the potential pitfalls of the TWFEspecification, and rely on clearly stated parallel trends assumptions (Assumptions 2.6, 2.7,and 5.1, respectively).As discussed in Section 2.3.2, there are multiple sensible measures that one can useto summarize the overall effect of state authorization on violation rates across all treatedstates. For instance, one can use the “simple” average of all the AT T ( g, t ) where t ≥ g , AT T simple , as defined in (2.22), or the average of the event-study-type estimands δ es ( e ) over the positive values of e , δ e,avg , as defined in 2.24. Table 6.1 shows the estimates ofthese parameters when one adopts Assumption 2.6 (Column (1)), Assumption 2.7 (Column(2)), or Assumption 5.1 (Column (3)). For the sake of comparison we also report the OLSestimate of β fe (Column (4)). Standard errors, clustered at the state level, are reportedin parenthesis and 90% confidence intervals are reported in brackets. Essentially, all thesesummary measures indicates that state authorization has close to zero effect on violation33igure 6.3: Event-study analysis of violation rate: baseline results −0.07−0.05−0.03−0.010.010.030.05 −20 −16 −12 −8 −4 0 4 8 12 16 20 E v en t S t ud y A TT (a) Event study based on TWFE specification−0.07−0.05−0.03−0.010.010.030.05 −20 −16 −12 −8 −4 0 4 8 12 16 20 E v en t S t ud y A TT (b) Event study using never−treated units as comparison group−0.07−0.05−0.03−0.010.010.030.05 −20 −16 −12 −8 −4 0 4 8 12 16 20 E v en t S t ud y A TT (c) Event study using not−yet−treated units as comparison group−0.07−0.05−0.03−0.010.010.030.05 −20 −16 −12 −8 −4 0 4 8 12 16 20 E v en t S t ud y A TT (d) Event study using all not−yet−treated units as comparison group Notes: Red line displays the point estimate, dark-shaded area the 90% pointwise confidence interval, and thelight-shaded are the 90% simultaneous confidence band. Panel (a) displays the ordinary least squares (OLS)estimates of the β e associated with the two-way fixed-effects linear regression specification (6.1); Panel (b)displays the results based on (2.23) that uses (2.27) as an estimator for AT T ( g, t ) ; Panel (c) displays theresults based on (2.23) that uses (2.19) as an estimator for AT T ( g, t ) ; Panel (d) displays the results basedon (2.23) that uses (5.5) as an estimator for AT T ( g, t ) . All standard errors are clustered at the state level,though the standard errors in Panel (a) are based on analytical results, whereas those in Panel (b)-(d) arebased on the multiplicative bootstrap procedure discussed in Algorithm B.1) and in C&S (we use 1,000bootstrap draws). The critical value for the simultaneous confidence bands is computed using Algorithm B.1(which is akin to the one proposed by C&S). Summary measures Never-treated Not-yet-treated All Not-yet-treated TWFE(1) (2) (3) (4)
AT T simple -0.017 -0.010 -0.003 —(0.009) (0.009) (0.006) —[-0.032, 0.001] [-0.024, 0.004] [-0.014, 0.008] — δ e,avg -0.015 -0.008 -0.003 —(0.007) (0.006) (0.004) —[-0.027, -0.002] [-0.017, 0.002] [-0.010, 0.004] —TWFE — — — -0.003— — — (0.010)— — — [-0.019, 0.013] Notes: The point estimates, cluster-robust standard errors (in parenthesis), and 90% confidence interval (in brackets) for the effect of state au-thorization on violation rates.
AT T simple is as defined in (2.22) and denotes the weighted average of all post-treatment
AT T ( g, t ) (cid:48) s . δ e,avg is asdefined in (2.24) and denotes the time-average of all event-study parameters δ es ( e ) , e > . TWFE refers to the ordinary least square estimatesof β fe in the TWFE linear regression specification (2.16), which is invariant to the comparison group being used. Column (1) display the re-sults that uses (2.27) as an estimator for AT T ( g, t ) , column (2) displays the results that uses (2.19) as an estimator for AT T ( g, t ) , and column(3) displays the results that uses (5.5) as an estimator for AT T ( g, t ) Column (4) displays the result using the TWFE regression specification.Standard errors are clustered at the state level, and, with the exception of the TWFE summary measure, are computed using the multiplicativebootstrap procedure described in Algorithm B.1, which is akin to the one proposed by C&S. We use 1,000 bootstrap draws. rates, which is in line with the findings from Grooms (2015).
Next, we analyze whether the effect of state authorization on violation rates vary dependingon whether a state has a long prevalence of corruption. Panel (a) of Figure 6.4 displays theOLS estimates of β ce ’s, together with the 90% pointwise confidence intervals. All standarderrors are clustered at the state level. Consistent with the findings from Grooms (2015),the results suggest that states with high levels of corruption have a lower violation rateafter authorization relative to non-corrupt states, and the relative drop in the violation rateappears to increase with elapsed treatment time.Panels (b), (c), and (d) of Figure 6.4 present the event-study estimates (6.10) based onthe PTAs 6.1, 6.2, and 6.3 that allow for corruption-specific trends, whereas Panels (b),(c), and (d) of Figure 6.5 present the event-study estimates (6.10) based on the PTAs 6.4,6.5, and 6.6 that do not allow for corruption-specific trends. For comparison purposes,Panel (a) of Figures 6.4 and 6.5 displays the OLS estimates of the β ce ’s associated with theTWFE specification (6.2). Like before, all estimators are weighted by total facilities in astate, all standard errors are clustered at the state level, and we report both pointwise andsimultaneous 90% confidence intervals.The results in Figures 6.4 and 6.5 reveal the practical relevance of being explicit aboutthe underlying PTA in a given application. For instance, Figure 6.4 suggests that when oneinvokes Assumption 6.1 (Panel (b)), Assumption 6.2 (Panel (c)), or Assumption 6.3 (Panel35igure 6.4: Event-study analysis of violation rate: difference between corrupt and non-corrupt states allowing different counterfactual trends between corrupt and non-corruptstates −0.16−0.12−0.08−0.040.000.040.080.12 −20 −16 −12 −8 −4 0 4 8 12 16 20 E v en t S t ud y A TT (a) Event study based on TWFE specification−0.16−0.12−0.08−0.040.000.040.080.12 −20 −16 −12 −8 −4 0 4 8 12 16 20 E v en t S t ud y A TT (b) Event study using the never−treated units as comparison group, allowing for corruption−specific trends−0.16−0.12−0.08−0.040.000.040.080.12 −20 −16 −12 −8 −4 0 4 8 12 16 20 E v en t S t ud y A TT (c) Event study using the not−yet−treated units as comparison group, allowing for corruption−specific trends−0.16−0.12−0.08−0.040.000.040.080.12 −20 −16 −12 −8 −4 0 4 8 12 16 20 E v en t S t ud y A TT (d) Event study using all not−yet−treated units as comparison group, allowing for corruption−specific trends Notes: Red line displays the point estimate, dark-shaded area the 90% pointwise confidence interval, andthe light-shaded are the 90% simultaneous confidence band. Panel (a) displays the results based on the OLSestimates of the β ce ’s in the TWFE specification (6.2); Panel (b) displays the results based on the event-studyestimator (6.9) that relies on the PTA 6.1; Panel (c) displays the results based on the event-study estimator(6.9) that relies on the PTA 6.2; Panel (d) displays the results based on the event-study estimator (6.9)that relies on the PTA 6.3. All standard errors are clustered at the state level, though the standard errorsin Panel (a) are based on analytical results, whereas those in Panel (b)-(d) are based on the multiplicativebootstrap procedure discussed in Algorithm B.1, which is similar to C&S proposal (we use 1,000 bootstrapdraws). The critical value for the simultaneous confidence bands is computed using Algorithm B.1. −0.20−0.16−0.12−0.08−0.040.000.04 −20 −16 −12 −8 −4 0 4 8 12 16 20 E v en t S t ud y A TT (a) Event study based on TWFE specification−0.20−0.16−0.12−0.08−0.040.000.04 −20 −16 −12 −8 −4 0 4 8 12 16 20 E v en t S t ud y A TT (b) Event study using the never−treated units as comparison group, not allowing for corruption−specific trends−0.20−0.16−0.12−0.08−0.040.000.04 −20 −16 −12 −8 −4 0 4 8 12 16 20 E v en t S t ud y A TT (c) Event study using the not−yet−treated units as comparison group, not allowing for corruption−specific trends−0.20−0.16−0.12−0.08−0.040.000.04 −20 −16 −12 −8 −4 0 4 8 12 16 20 E v en t S t ud y A TT (d) Event study using all not−yet−treated units as comparison group, not allowing for corruption−specific trends Notes: Red line displays the point estimate, dark-shaded area the 90% pointwise confidence interval, andthe light-shaded are the 90% simultaneous confidence band. Panel (a) displays the results based on the OLSestimates of the β ce ’s in the TWFE specification (6.2); Panel (b) displays the results based on the event-studyestimator (6.9) that relies on the PTA 6.4; Panel (c) displays the results based on the event-study estimator(6.9) that relies on the PTA 6.5; Panel (d) displays the results based on the event-study estimator (6.9)that relies on the PTA 6.6. All standard errors are clustered at the state level, though the standard errorsin Panel (a) are based on analytical results, whereas those in Panel (b)-(d) are based on the multiplicativebootstrap procedure discussed in Algorithm B.1, which is similar to C&S proposal (we use 1,000 bootstrapdraws). The critical value for the simultaneous confidence bands is computed using Algorithm B.1. β cfe associated with the TWFE specification, shown in (6.13).Table 6.2: Effect of authorization on violation rate: corrupt vs. not corrupt states. Allow for corrupt-specific trends Not allow corrupt-specific trendSummary measures Never-treated Not-yet-treated All Not-yet-treated Never-treated Not-yet-treated All Not-yet-treated TWFE(1) (2) (3) (4) (5) (6) (7)
ATT simple, − -0.007 -0.008 -0.014 -0.035 -0.035 -0.033 —(0.014) (0.014) (0.013) (0.012) (0.012) (0.010) —[-0.030, 0.017] [-0.031, 0.016] [-0.036, 0.008] [-0.054, -0.016] [-0.054, -0.015] [-0.049, -0.017] — δ e,avg, − -0.001 -0.002 -0.009 -0.024 -0.025 -0.028 —(0.014) (0.016) (0.014) (0.013) (0.012) (0.011) —[-0.024, 0.022] [-0.028, 0.024] [-0.031, 0.013] [-0.045, -0.003] [-0.045, -0.005] [-0.047, -0.009] —TWFE — — — — — — -0.037— — — — — — (0.010)— — — — — — [-0.054, -0.020] Notes: The point estimates, cluster-robust standard errors (in parenthesis), and 90% confidence interval (in brackets) for the effect of state authorization on violation rates.
ATT simple, − is as defined in (6.2) and denotes the difference of the weighted average of all post-treatment ATT ( g, t ; c ) (cid:48) s between corrupt and non-corrupt states. δ e,avg, − is as defined in (6.12) and denotes difference of the time-average of all event-study parameters δ es ( e ) , e > , between corrupt and non-corrupt states. TWFE refers to theordinary least square estimates of β cfe in the TWFE linear regression specification (6.13), which is invariant to the comparison group being used. Columns (1)-(6) display theresults that relies on the PTA 6.1-6.6, respectively. Standard errors are clustered at the state level, and, with the exception of the TWFE summary measure, are computedusing the multiplicative bootstrap procedure presented in Algorithm B.1, which is akin to C&S proposal. We use 1,000 bootstrap draws. The results in Table 6.2 reinforce the message from Figures 6.4 and 6.5: when one allowsfor corruption-specific trends and relies on PTAs 6.1, 6.2, or 6.3 (Columns (1), (2), and(3), respectively), one finds essentially no evidence that the effect of state authorizationon violation rates varies by state corruption. On the other hand, when one relies on the“stronger” PTAs 6.4, 6.5, or 6.6 (Columns (4), (5), and (6), respectively), one finds evidencethat corrupt states experienced a large decrease in violation rate after state authorizationrelative to non-corrupt states. This latter result is in agreement with the TWFE specification,whereas the former is not. 38
Conclusion
In this paper, we have highlighted the important role played by the parallel trends assumptionin event-study settings in terms of identification, estimation and summary of different treat-ment effects parameters. We first showed that, when there is variation in treatment timing,researchers may adopt different types of parallel trends assumptions and identify/estimatedifferent treatment effect parameters. Next, we discussed the practical implications of adopt-ing different parallel trends assumptions, and discussed how one constructs estimators thatmake use of all the restrictions implied by the underlying PTA. Here, we documented aninteresting “robustness" vs. “efficiency" trade-off in terms of the strength of the underlyingPTA, and argue that one should take this into consideration whenever employing a DID-type of analysis. Importantly, we advocate that one should always attempt to be explicitabout the parallel trends assumption invoked in the study, as this usually translates into amore transparent and objective analysis. We showed how one can form semiparametricallyefficient DID estimators by fully exploiting all the empirical content of underlying PTA viathe traditional GMM approach. We also proposed an alternative, simpler to use DID esti-mator that does not restrict pre-treatment trends when one wants to use “not-yet-treated”units as a comparison group, and, at the same time, makes use of more groups than otheravailable DID estimators. Finally, we illustrated the practical importance of being explicitabout the PTA via an empirical application about the effect of the transition from federal tostate management of the Clean Water Act on compliance rates. Our results suggest that theconclusion that corrupt states see a decline in the violation rate after program authorizationrelative to non-corrupt treated states depends on the type of PTA adopted.
References
Abadie, Alberto , “Semiparametric Difference-in-Difference Estimators,”
Review of Eco-nomic Studies , 2005,
Anderson, Michael L. , “Multiple inference and gender differences in the effects of earlyintervention: A reevaluation of the Abecedarian, Perry Preschool, and Early TrainingProjects,”
Journal of the American Statistical Association , 2008, (484).
Angrist, Joshua D. and Jörn-Steffen Pischke , Mostly Harmless Econometrics: AnEmpiricist ’ s Companion , Princeton, NJ: Princeton University Press, 2009.
Arkhangelsky, Dmitry, Susan Athey, David A. Hirshberg, Guido W. Imbens, andStefan Wager , “Synthetic Difference in Differences,” arXiv preprint arXiv:1812.09970 ,2018. 39 they, Susan and Guido W Imbens , “Design-based Analysis in Difference-In-DifferencesSettings with Staggered Adoption,” arxiv preprint arXiv:1808.05293 , 2018.
Borusyak, Kirill and Xavier Jaravel , “Revisiting Event Study Designs,”
UnpublishedManuscript, Department of Economics, Harvard University , 2017.
Callaway, Brantly and Pedro H. C. Sant’Anna , “Difference-in-Differences with Multi-ple Time Periods,” arXiv preprint arXiv:1803.09015 , 2020.
Chernozhukov, Victor, Iván Fernández-Val, Jinyong Hahn, and Whitney Newey ,“Average and Quantile Effects in Nonseparable Panel Models,”
Econometrica , 2013, (2). Cunningham, Scott , Causal Inference: The Mixtape, v.1.7 de Chaisemartin, Clément and Xavier D’Haultfœuille , “Two-way Fixed Effects Es-timators with Heterogeneous Treatment Effects,”
American Economic Review , 2020, (9).
Ferman, Bruno and Cristine Pinto , “Inference in Differences-in-Differences with FewTreated Groups and Heteroskedasticity,”
The Review of Economics and Statistics , 2019, (3).
Gibbons, Charles E., Juan Carlos Suárez Serrato, and Michael B. Urbancic ,“Broken or Fixed Effects?,”
Journal of Econometric Methods , 2018, (1). Goodman-Bacon, Andrew , “Difference-in-Differences with Variation in Treatment Tim-ing,”
NBER Working Paper No. 25018 , 2019.
Grooms, Katherine K , “Enforcing the Clean Water Act: The effect of state-level cor-ruption on compliance,”
Journal of Environmental Economics and Management , 2015,
Han, Sukjin , “Identification in Nonparametric Models for Dynamic Treatment Effects,”
Journal of Econometrics , 2019,
Forthcoming.
Hansen, Lars Peter , “Large Sample Properties of Generalized Method of Moments Esti-mators,”
Econometrica , 1982, (4). Heckman, James J., Hidehiko Ichimura, Jefrey Smith, and Petra Todd , “Charac-terizing Selection Bias using Experimental Data,”
Econometrica , 1998, (5). Laporte, Audrey and Frank Windmeijer , “Estimation of panel data models with binaryindicators when treatment effects are not constant over time,”
Economics Letters , 2005, (3). Lechner, Michael , “The Estimation of Causal Effects by Difference-in-Difference Methods,”
Foundations and Trends in Econometrics , 2010, (3). Rambachan, Ashesh and Jonathan Roth , “An Honest Approach to Parallel Trends,”
Working Paper, Department of Economics, Harvard University , 2019.40 omano, Joseph P. and Michael Wolf , “Stepwise multiple testing as formalized datasnooping,”
Econometrica , 2005, (4). , Azeem M. Shaikh, and Michael Wolf , “Hypothesis Testing in Econometrics,” AnnualReview of Economics , 2010, (1). Roth, Jonathan , “Pre-test with Caution: Event-study Estimates After Testing for ParallelTrends,”
Working Paper, Department of Economics, Harvard University , 2020.
Rubin, Donald B. , “The design versus the analysis of observational studies for causaleffects: Parallels with the design of randomized trials,”
Statistics in Medicine , 2007, (1)., “For objective causal inference, design trumps analysis,” Annals of Applied Statistics ,2008, (3). Sant’Anna, Pedro H. C. and Jun B Zhao , “Doubly Robust Difference-in-DifferencesEstimators,”
Journal of Econometrics , 2020,
Forthcoming.
Sun, Liyang and Sarah Abraham , “Estimating Dynamic Treatment Effects in EventStudies With Heterogeneous Treatment Effects,”
Working Paper, Department of Eco-nomics, MIT , 2020.
Wooldridge, Jeffrey M. , “Fixed-Effects and Related Estimators for Correlated Random-Coefficient and Treatment-Effect Panel Data Models,”
Review of Economics and Statistics ,2005,87