[PDF] Using Multiple Pre-treatment Periods to Improve Difference-in-Differences and Staggered Adoption Design

Abstract

While difference-in-differences (DID) was originally developed with one pre- and one post-treatment periods, data from additional pre-treatment periods is often available. How can researchers improve the DID design with such multiple pre-treatment periods under what conditions? We first use potential outcomes to clarify three benefits of multiple pre-treatment periods: (1) assessing the parallel trends assumption, (2) improving estimation accuracy, and (3) allowing for a more flexible parallel trends assumption. We then propose a new estimator, double DID, which combines all the benefits through the generalized method of moments and contains the two-way fixed effects regression as a special case. In a wide range of applications where several pre-treatment periods are available, the double DID improves upon the standard DID both in terms of identification and estimation accuracy. We also generalize the double DID to the staggered adoption design where different units can receive the treatment in different time periods. We illustrate the proposed method with two empirical applications, covering both the basic DID and staggered adoption designs. We offer an open-source R package that implements the proposed methodologies.

Full PDF

UUsing Multiple Pre-treatment Periods to ImproveDiﬀerence-in-Diﬀerences and Staggered Adoption Designs ∗Naoki Egami † Soichiro Yamauchi ‡ This version: February 22, 2021First draft: December 6, 2019

Abstract

While diﬀerence-in-diﬀerences (DID) was originally developed with one pre- and onepost-treatment periods, data from additional pre-treatment periods is often available. Howcan researchers improve the DID design with such multiple pre-treatment periods underwhat conditions? We ﬁrst use potential outcomes to clarify three beneﬁts of multiple pre-treatment periods: (1) assessing the parallel trends assumption, (2) improving estimationaccuracy, and (3) allowing for a more ﬂexible parallel trends assumption. We then propose anew estimator, double

DID, which combines all the beneﬁts through the generalized methodof moments and contains the two-way ﬁxed eﬀects regression as a special case. In a widerange of applications where several pre-treatment periods are available, the double DIDimproves upon the standard DID both in terms of identiﬁcation and estimation accuracy.We also generalize the double DID to the staggered adoption design where diﬀerent unitscan receive the treatment in diﬀerent time periods. We illustrate the proposed method withtwo empirical applications, covering both the basic DID and staggered adoption designs.We oﬀer an open-source R package that implements the proposed methodologies. ∗ The methods proposed in this article can be implemented via the open-source statistical software R package DIDdesign available at https://github.com/naoki-egami/DIDdesign . We are grateful to Edmund Malesky,Cuong Viet Nguyen, and Anh Tran for providing us with data and answering our questions. We also thankAdam Glynn, Chad Hazlett, Shiro Kuriwaki, Ian Lundberg, John Marshall, Xiang Zhou, and participantsof the 2019 Summer Meetings of the Political Methodology Society and the 2019 American Political ScienceAssociation Annual Conference for helpful comments and discussions. † Assistant Professor, Department of Political Science, Columbia University, New York NY 10027. Email: [email protected] ; URL: https://naokiegami.com . ‡ PhD Candidate, Department of Government, Harvard University, Cambridge MA 02138. Email: [email protected] ; URL: https://soichiroy.github.io a r X i v : . [ s t a t . A P ] J a n Introduction

Over the last few decades, social scientists have developed and applied various approachesto make credible causal inferences from observational data. One of the most popular andsuccessful is a diﬀerence-in-diﬀerences (DID) design (Bertrand, Duﬂo and Mullainathan, 2004;Angrist and Pischke, 2008). In its most basic form, we compare treatment and control groupsover two time periods — one before and the other after the treatment assignment.In practice, it is common to apply the DID method with additional pre-treatment periods. However, in contrast to the basic two-time-period case, there are a number of diﬀerent waysto analyze the DID design with multiple pre-treatment periods. One popular approach is toapply the two-way ﬁxed eﬀects regression to the entire time periods and supplement it withvarious robustness checks (e.g., Dube, Dube and García-Ponce, 2013; Truex, 2014; Earle andGehlbach, 2015; Hall, 2016; Larreguy and Marshall, 2017). Another is to stick with the two-time-period DID and limit the use of additional pre-treatment periods only to the assessmentof pre-treatment trends (e.g., Ladd and Lenz, 2009; Bechtel and Hainmueller, 2011; Bullockand Clinton, 2011; Keele and Minozzi, 2013; Garﬁas, 2018).This variation of approaches raises an important practical question: how should analystsincorporate multiple pre-treatment periods into the DID design, and under what assumptions?In Section 2, we begin by examining three beneﬁts of multiple pre-treatment periods using po-tential outcomes (Neyman, 1923; Rubin, 1974): (1) assessing the parallel trends assumption,(2) improving estimation accuracy, and (3) allowing for a more ﬂexible parallel trends assump-tion. While these beneﬁts have been discussed in the literature, we revisit them to clarify thateach beneﬁt requires diﬀerent assumptions and diﬀerent estimators. As a result, in practice,researchers tend to enjoy only a subset of the three beneﬁts they can exploit from multiplepre-treatment periods. This methodological synthesis serves as a foundation to develop ournew approach.Our main contribution is to propose a new, simple estimator that achieves all the threebeneﬁts together. We use the generalized method of moments (GMM) framework (Hansen,1982) to develop the double diﬀerence-in-diﬀerences (double DID). At its core, we combine twopopular DID estimators: the standard DID estimator, which relies on the canonical parallel- In our literature review of

American Political Science Review and

American Journal of Political Science between 2015 and 2019, we found that about 63% of the papers that use the DID design have more than onepre-treatment period. See Appendix A for details about our literature review. pre -treatment periods to further relax the parallel trends assumption (Section 3.4). Thisis important because researchers might be worried not only about time-varying unmeasuredconfounders that linearly change over time, but also about more general forms of time-varyingunmeasured confounders. When there exist K pre-treatment periods, our proposed approachcan accommodate time-varying unmeasured confounders that follow ( K − th-order polyno-mial functions. Thus, we can account for more ﬂexible forms of time-varying unmeasuredconfounding as we have more pre-treatment periods. Second, we also incorporate any num-ber of post -treatment periods such that researchers can estimate not only short-term causaleﬀects but also longer-term causal eﬀects (Section 3.4). Finally, we generalize our double DIDestimator to the staggered adoption design where diﬀerent units can receive the treatment indiﬀerent time periods (Section 4). This design is increasingly more popular in political scienceand in the social sciences in general (e.g., Athey and Imbens, 2018; Ben-Michael, Feller andRothstein, 2019). We show that the double DID estimator achieves the same three beneﬁts3ogether even in this more general pattern of the treatment assignment.We oﬀer a companion R package DIDdesign that implements the proposed methods. Weillustrate our proposed methods in Section 6 where we provide two empirical applications. Theﬁrst application revisits Malesky, Nguyen and Tran (2014), which studies how the abolition ofelected councils aﬀects local public services. This serves as an example of the basic DID designwhere treatment assignment happens only once. Our second application is a reanalysis ofPaglayan (2019), which examines the eﬀect of granting collective bargaining rights to teacher’sunion on educational expenditures and teacher’s salary. This is an example for the staggeredadoption design where diﬀerent units can receive the treatment in diﬀerent time periods.

Related Literature.

This paper builds on the large literature of time-series cross-sectionaldata (e.g., De Boef and Keele, 2008; Beck and Katz, 2011; Blackwell and Glynn, 2018) and isconnected to other popular methodologies for making causal inferences. Generalizing the wellknown case of two periods and two groups (e.g., Abadie, 2005), recent papers use potentialoutcomes to unpack the nonparametric connection between the DID and two-way ﬁxed eﬀectsregression estimators, thereby proposing extensions to relax strong parametric and causal as-sumptions (e.g., Imai and Kim, 2011; Athey and Imbens, 2018; Goodman-Bacon, 2018; Strezh-nev, 2018; Imai and Kim, 2019, 2020). Our paper also uses potential outcomes to clarifynonparametric foundations on the use of multiple pre-treatment periods. The key diﬀerence isthat, while this recent literature mainly considers the identiﬁcation under the parallel trendsassumption, we study both estimation accuracy and identiﬁcation under more ﬂexible assump-tions of trends. We do so both in the basic DID setup and in the staggered adoption design.Our double DID estimator contains the sequential DID estimator (e.g., Lee, 2016; Mora andReggio, 2019) as a special case. There are two key advantages over existing approaches. First,when the parallel trends assumption holds, the double DID optimally combines the standardDID and the sequential DID to improve eﬃciency, and thus is not equal to the sequential DID.Importantly, this avoids a dilemma of the sequential DID — it is consistent under assumptionsweaker than the parallel trends, but is less eﬃcient when the parallel trends assumption holds.Second, while the sequential DID has only been developed for the basic DID design wheretreatment assignment happens only once, we generalize it to the staggered adoption design,and further incorporate it into our proposed staggered-adoption double DID estimator.Another class of popular methods is the synthetic control method (Abadie, Diamond andHainmueller, 2010) and their recent extensions (e.g., Xu, 2017; Athey et al., 2017; Ben-Michael,4eller and Rothstein, 2018; Hazlett and Xu, 2018) that estimate a weighted average of controlunits to approximate a treated unit. As carefully noted in those papers, such methodologiesrequire long pre-treatment periods to accurately estimate a pre-treatment trajectory of thetreated unit (Abadie, Diamond and Hainmueller, 2010); for example, Xu (2017) recommendscollecting more than ten pre-treatment periods. In contrast, the proposed double DID canbe applied as long as there are more than one pre-treatment period, and is better suitedwhen there are a small to moderate number of pre-treatment periods. However, we alsoshow in Section 6.2 that the double DID can achieve performance comparable to variants ofsynthetic control methods even when there are a large number of pre-treatment periods. Weoﬀer additional discussions about relationships between our proposed approach and syntheticcontrol methods in Section 5.3.

The diﬀerence-in-diﬀerences (DID) design is one of the most widely used methods to makecausal inference from observational studies (Imbens and Wooldridge, 2009). At its most basic,the DID design consists of treatment and control groups measured at two time periods, beforeand after the treatment assignment. While this basic DID design only requires data fromone post- and one pre-treatment period, additional pre-treatment periods are often availablein applied contexts. Unfortunately, however, assumptions behind diﬀerent uses of multiplepre-treatment periods have often remained unstated.In this section, we use potential outcomes to discuss three well-known practical beneﬁtsof multiple pre-treatment periods: (1) assessing the parallel trends assumption, (2) improvingestimation accuracy, and (3) allowing for a more ﬂexible parallel trends assumption. Thissection serves as a methodological foundation for developing a new approach in Sections 3and 4.

As our running example, we focus on a study of how the abolition of elected councils aﬀects localpublic services. Malesky, Nguyen and Tran (2014) use the DID design to examine the eﬀect In our literature review of

APSR and

AJPS , we found that most DID applications have less than 10pre-treatment periods. The median number of pre-treatment periods is . and, the mean number of pre-treatment periods is about after removing one unique study that has more than pre-treatment periods.See Appendix A for more details about our literature review.

5f recentralization eﬀorts in Vietnam. The abolition of elected councils, the main treatmentof interest, was implemented in 2009 in about 12% of all the communes, which is the smallestadministrative units that the paper considers. For each commune, a variety of outcomes relatedto public services, such as the quality of infrastructure, were measured in 2006, 2008 and 2010— two pre-treatment periods and one post-treatment period. With this DID design, Malesky,Nguyen and Tran (2014) aim to estimate the causal eﬀect of abolishing elected councils onvarious measures of local public services. To introduce the setup of the design, we focus onlyon the basic aspects of the study here and discuss further details when we reanalyze it inSection 6.To begin with, let D it denote the binary treatment for unit i in time period t so that D it = 1 if the unit is treated in time period t , and D it = 0 otherwise. In this section, we consider twopre-treatment time periods t ∈ { , } and one post-treatment period t = 2 . We choose thissetup here because it is suﬃcient for examining beneﬁts of multiple pre-treatment periods,but we also generalize our discussions and methods to an arbitrary number of pre - and post- treatment periods (Section 3.4), and to the staggered adoption design (Section 4) for increasingapplicability of our methods in practice. In our running example, two pre-treatment periodsare 2006 and 2008, and one post-treatment period is 2010. Thus, the treatment group receivesthe treatment only at time t = 2 ; D i = D i = 0 and D i = 1 , whereas units in the controlgroup never gets treated D i = D i = D i = 0 . We refer to the treatment group as G i = 1 andthe control group as G i = 0 . Outcome Y it is measured at time t ∈ { , , } . In addition to paneldata where the same units are measured over time, the DID design accommodates repeatedcross-sectional data as in our running example, in which diﬀerent communes are sampled atthree time periods.To deﬁne causal eﬀects of interest, we rely on the potential outcomes framework (Neyman,1923; Rubin, 1974). For each time period, Y it (1) represents the quality of infrastructure thatcommune i would achieve in time period t if commune i had abolished elected councils. Y it (0) issimilarly deﬁned. For an individual commune, the causal eﬀect of abolishing elected councils onthe quality of infrastructure in time period t is the diﬀerence Y it (1) − Y it (0) . As the treatment isassigned in the second time period, causal eﬀects are deﬁned only for time t = 2 , Y i (1) − Y i (0) . In the DID design, we are interested in estimating the average treatment eﬀect for treatedunits (ATT) (Angrist and Pischke, 2008): τ = E [ Y i (1) − Y i (0) | G i = 1] , (1)6here the expectation is over units in the treatment group G i = 1 so that this causal estimandis the average of individual causal eﬀects for units who receive the treatment. DID with One Pre-Treatment Period

Before we discuss beneﬁts of multiple pre-treatment periods from Section 2.2, we brieﬂy reviewthe DID with one pre-treatment period to ﬁx ideas for settings with multiple pre-treatmentperiods.In the most basic DID design where we have only one pre-treatment period (i.e., t = 1 isthe pre-treatment period and t = 2 is the post-treatment period), researchers can identify theATT based on the widely-used assumption of parallel trends — if the treatment group had notreceived the treatment in the second period, its outcome trend would have been the same asthe trend of the outcome in the control group. (Angrist and Pischke, 2008). Assumption 1 (Parallel Trends) . E [ Y i (0) | G i = 1] − E [ Y i (0) | G i = 1] = E [ Y i (0) | G i = 0] − E [ Y i (0) | G i = 0] , (2)where the left hand side is the trend in outcomes for the treatment group G i = 1 and theright is the one for the control group G i = 0 .Under the parallel trends assumption, we estimate the ATT via the diﬀerence-in-diﬀerencesestimator. (cid:98) τ DID = (cid:18) (cid:80) i : G i =1 Y i n − (cid:80) i : G i =1 Y i n (cid:19) − (cid:18) (cid:80) i : G i =0 Y i n − (cid:80) i : G i =0 Y i n (cid:19) , (3)where n t and n t are the number of units in the treatment and control groups at time t ∈ { , } ,respectively.In practice, we can compute the DID estimator via a linear regression. We regress theoutcome Y it on an intercept, treatment group indicator G i , time indicator I t (equal to ifpost-treatment and otherwise) and the interaction between the treatment group indicatorand the time indicator G i × I t . Y it ∼ α + θG i + γI t + β ( G i × I t ) , (4)where ( α, θ, γ, β ) are corresponding coeﬃcients. In this case, a coeﬃcient of the interaction term β is numerically equal to the DID estimator, (cid:98) τ DID . Importantly, the linear regression is usedhere only to compute the nonparametric DID estimator (equation (3)), and thus it does notrequire any parametric modeling assumption such as constant treatment eﬀects. Furthermore,7hen we analyze panel data in which the same units are observed repeatedly over time, weobtain exactly the same estimate via a linear regression with unit and time ﬁxed eﬀects. Thisnumerical equivalence in the two-time-period case is often the justiﬁcation of the two-way ﬁxedeﬀects regression as the DID design (Angrist and Pischke, 2008). The above equivalence isformally shown in Appendix B.1 for completeness.

We now consider how researchers can exploit multiple pre-treatment periods, while clarifyingunderlying necessary assumptions.The ﬁrst and the most common use of multiple pre-treatment periods is to assess theidentiﬁcation assumption of parallel trends. Because the validity of the DID design rests on theparallel trends assumption, it is critical to evaluate its plausibility in any application. However,the parallel trends assumption itself involves counterfactual outcomes, and thus analysts cannotempirically test it directly. Instead, we often investigate whether trends for treatment andcontrol groups are parallel in pre-treatment periods (Angrist and Pischke, 2008). For example,researchers assess whether trends in the infrastructure quality from 2006 to 2008 — before thetreatment is implemented in 2009 — are the same for treatment and control communes.Thus, researchers often estimate the DID for the pre-treatment periods: (cid:18) (cid:80) i : G i =1 Y i n − (cid:80) i : G i =1 Y i n (cid:19) − (cid:18) (cid:80) i : G i =0 Y i n − (cid:80) i : G i =0 Y i n (cid:19) . (5)We then check whether the DID estimate on pre-treatment periods is statistically distinguish-able from zero. For example, we can apply the DID estimator to 2006 and 2008 as if 2008were the post-treatment period, and assess whether the estimate would be close to zero. InFigure 1, a DID estimate on the pre-treatment periods would be close to zero for the left panel,while it would be negative for the right panel where two groups have diﬀerent pre-treatmenttrends. In Appendix B.4, we show that a robustness check about leads eﬀects (Angrist andPischke, 2008), which incorporates leads of the treatment variable into the two-way ﬁxed ef-fects regression and check whether their coeﬃcients are zero, is equivalent to this DID on thepre-treatment periods.What are the underlying assumptions behind this test of pre-treatment trends? The basicidea is that if trends are parallel from 2006 to 2008, it is more likely that the parallel trendsassumption holds for 2008 and 2010. Hence, instead of considering parallel trends only from2008 to 2010, the test evaluates the two related parallel trends together. By doing so, this8 l l Extended Parallel Trends l l ll l ll l l t = 0(before) t = 1(before) t = 2(after) O u t c o m e Trend of Treatment GroupTrend of Control Group Counterfactual l l l

Extended Parallel Trends Violated l l ll l ll l l t = 0(before) t = 1(before) t = 2(after)

Figure 1:

Parallel Pre-treatment Trends (left) and Non-Parallel Pre-treatment Trends (right).popular test tries to make the DID design falsiﬁable.At its core, this approach does not test the parallel trends assumption itself (Assumption 1),which is untestable due to counterfactual outcomes. Instead, it tests the extended parallel trends assumption — the parallel trends hold for pre-treatment periods, from t = 0 to t = 1 , as wellas from a pre-treatment period t = 1 to a post-treatment period t = 2 : Assumption 2 (Extended Parallel Trends) .  E [ Y i (0) | G i = 1] − E [ Y i (0) | G i = 1] = E [ Y i (0) | G i = 0] − E [ Y i (0) | G i = 0] E [ Y i (0) | G i = 1] − E [ Y i (0) | G i = 1] = E [ Y i (0) | G i = 0] − E [ Y i (0) | G i = 0] (6)Here, the ﬁrst line is the same as the standard parallel trends assumption (equation (2)),and the second line is the parallel trends for pre-treatment periods. This assumption meansthat treatment and control groups have parallel trends of the infrastructure quality from 2008to 2010 as well as in pre-treatment periods from 2006 to 2008. Because outcome trends areobservable in pre-treatment periods, the test of pre-treatment trends (equation (5)) directlytests this assumption.Therefore, many DID studies that exploit the test on pre-treatment trends can be seen asthe DID design under the extended parallel trends assumption. Because the extended paralleltrends assumption naturally implies the conventional parallel trends assumption, Assumption 2is also suﬃcient for identifying the ATT, and we can estimate it via the same DID estimator(equation (3)).In summary, the ﬁrst beneﬁt is that researchers can assess the extended parallel trendsassumption using the pre-treatment-trends test (equation (5)).9 .3 Beneﬁt 2: Improving Estimation Accuracy As we discussed above, many existing DID studies that utilize the test of pre-treatment trendscan be viewed as the DID design with the extended parallel trends assumption. However, thisextended parallel trends assumption is often made implicitly and thus, it is used only for as-sessing the parallel trends assumption. Fortunately, if the extended parallel trends assumptionholds, we can also estimate the ATT with higher accuracy, resulting in smaller standard errors.This additional beneﬁt becomes clear by simply restating the extended parallel trendsassumption as follows.  E [ Y i (0) | G i = 1] − E [ Y i (0) | G i = 1] = E [ Y i (0) | G i = 0] − E [ Y i (0) | G i = 0] E [ Y i (0) | G i = 1] − E [ Y i (0) | G i = 1] = E [ Y i (0) | G i = 0] − E [ Y i (0) | G i = 0] . (7)Under the extended parallel trends assumption, there are two natural DID estimators forthe ATT. The ﬁrst is the same as before: the DID on t = 1 and t = 2 . The second is similarbut with the additional pre-treatment period: the DID on t = 0 and t = 2 . In our runningexample, this means that we have a DID estimator using data from 2008 and 2010 and theother using data from 2006 and 2010. (cid:98) τ DID = (cid:18) (cid:80) i : G i =1 Y i n − (cid:80) i : G i =1 Y i n (cid:19) − (cid:18) (cid:80) i : G i =0 Y i n − (cid:80) i : G i =0 Y i n (cid:19) , (cid:98) τ DID(2,0) = (cid:18) (cid:80) i : G i =1 Y i n − (cid:80) i : G i =1 Y i n (cid:19) − (cid:18) (cid:80) i : G i =0 Y i n − (cid:80) i : G i =0 Y i n (cid:19) . (8)Under the extended parallel trends assumption, both estimators are unbiased and consistentfor the ATT. Thus, we can increase estimation accuracy by combining the two estimators, forexample, simply averaging them. (cid:98) τ e-DID = 12 (cid:98) τ DID + 12 (cid:98) τ DID(2,0) . (9)Intuitively, this extended DID estimator is more eﬃcient because we have more observationsto estimate counterfactual outcomes for the treatment group E [ Y i (0) | G i = 1] .In the panel data settings, we show that this extended DID estimator (cid:98) τ e-DID is numericallyequivalent to a coeﬃcient of the treatment variable in the two-way ﬁxed eﬀects estimatorﬁtted to the three time periods t ∈ { , , } . We also present more general results aboutnonparametric relationships between the extended DID and the two-way ﬁxed eﬀects estimatorin Appendix B.2. 10o summarize, the second beneﬁt of multiple pre-treatment periods is that researchers canuse the extended DID estimator (equivalent to the two-way ﬁxed eﬀects estimator in the paneldata) to increase statistical eﬃciency when the extended parallel trends assumption holds. In this section, we consider scenarios in which the extended parallel trends assumption may notbe plausible. Multiple pre-treatment periods are also useful in accounting for some deviationfrom the parallel trends assumption. We discuss a popular generalization of the diﬀerence-in-diﬀerence estimator, a sequential

DID estimator, which removes bias due to certain violationsof the parallel trends assumption (e.g., Lee, 2016; Mora and Reggio, 2019). We clarify anassumption behind this simple method and relate it to the parallel trends assumption.To introduce the sequential DID estimator, we begin with the extended parallel trendsassumption. As we described in Section 2.2, when the extended parallel trends assumptionholds, a DID estimator applied to pre-treatment periods t = 0 and t = 1 should be zero inexpectation. In contrast, when trends of treatment and control groups are not parallel, a DIDestimate on pre-treatment periods would be non-zero. The sequential DID estimator uses thisDID estimate from pre-treatment periods to adjust for bias in the standard DID estimator.In particular, it subtracts the DID estimator on pre-treatment periods from the usual DIDestimator that uses pre- and post-treatment periods t = 1 and t = 2 . (cid:98) τ s-DID = (cid:26)(cid:18) (cid:80) i : G i =1 Y i n − (cid:80) i : G i =1 Y i n (cid:19) − (cid:18) (cid:80) i : G i =0 Y i n − (cid:80) i : G i =0 Y i n (cid:19)(cid:27) − (cid:26)(cid:18) (cid:80) i : G i =1 Y i n − (cid:80) i : G i =1 Y i n (cid:19) − (cid:18) (cid:80) i : G i =0 Y i n − (cid:80) i : G i =0 Y i n (cid:19)(cid:27) , (10)where the ﬁrst four terms are equal to the standard DID estimator (equation (3)) and thelast four terms are the DID estimator applied to pre-treatment periods t = 0 and t = 1 . Inour running example, we can use this sequential DID estimator by ﬁrst estimating the DIDusing 2008 and 2010 and then subtracting the DID based on 2006 and 2008. An idea behindthis approach is that the DID estimator on pre-treatment periods captures deviation from theparallel trends assumption, and thus we subtract this bias from the usual DID estimator.Although we would expect that this estimator only requires an assumption weaker than theextended parallel trends, what exact assumption do we need for this sequential DID estima-tor? At its core, the parallel trends assumption means that diﬀerences between treatment andcontrol groups due to unobserved confounders are constant over time. Instead of assuming this11onstant unmeasured confounding, the sequential DID estimator rests on the parallel trends-in-trends assumption — unobserved confounding increases or decreases over time but with someconstant rate. Thus, the sequential DID estimator accounts for linear time-varying unmea-sured confounding. For example, researchers might be worried that some treated communeshave higher motivation for reforms, which is not measured, and the infrastructure qualitiesdiﬀer between treated and control communes due to this unobserved motivation. The paralleltrends assumption means that the diﬀerence in the infrastructure qualities due to this unob-served confounder does not grow or decline over time. In contrast, the parallel trends-in-trendsassumption accommodates a simple yet important case in which the unobserved diﬀerence inthe infrastructure qualities does grow or decline with some ﬁxed rate, which analysts do notneed to specify.This parallel trends-in-trends assumption is a generalization of the conventional paralleltrends assumption and formally written as follows. Assumption 3 (Parallel Trends-in-Trends) . { E [ Y i (0) | G i = 1] − E [ Y i (0) | G i = 0] } − { E [ Y i (0) | G i = 1] − E [ Y i (0) | G i = 0] } = { E [ Y i (0) | G i = 1] − E [ Y i (0) | G i = 0] } − { E [ Y i (0) | G i = 1] − E [ Y i (0) | G i = 0] } (11)Here, the left-hand side represents how the unobserved diﬀerence between treatment andcontrol groups changes over time from t = 1 to t = 2 . The right-hand side quantiﬁes the sametime-varying unmeasured diﬀerences from t = 0 to t = 1 . Because the trend of time-varyingunmeasured confounding is estimated from pre-treatment periods t = 0 to t = 1 , researchers donot need to know a rate by which time-varying unobserved confounding increases or decreases.Under the parallel trends-in-trends assumption, the sequential DID estimator is unbiased andconsistent for the ATT.Figure 2 illustrates the diﬀerence between the extended parallel trends assumption (leftpanel) and the parallel trends-in-trends assumption (middle panel). We can see in the secondrow of the ﬁgure that the parallel trends-in-trends assumption allows for a linear change inbias over time, whereas the bias is constant over time in the extended parallel trends.The sequential DID estimator is again connected to a widely used regression estimator. Inparticular, the sequential DID estimator (equation (10)) can be computed as a linear regressionin which we replace the outcome Y it with the transformed outcome that adjusts for the lagged12 l ll l ll l ll l l t = 0(before) t = 1(before) t = 2(after) O u t c o m e Trend of Treatment GroupTrend of Control GroupCounterfactual

Bias l l ll l ll l ll l l t = 0(before) t = 1(before) t = 2(after) l l ll l ll l ll l l t = 0(before) t = 1(before) t = 2(after)

Extended Parallel Trends Parallel Trends−in−Trends Both are Violated t = 0(before) t = 1(before) t = 2(after) B i a s ( D i ff e r en c e i n G r oup s ) t = 0(before) t = 1(before) t = 2(after) t = 0(before) t = 1(before) t = 2(after) Figure 2:

Comparing Extended Parallel Trends Assumption and Parallel Trends-in-TrendsAssumption.

Note:

The extended parallel trends assumption (left column) means that thediﬀerence in the treatment and control groups (bias) is constant over time. The parallel trends-in-trends assumption (middle column) allows for linear time-varying unmeasured confounding.Both assumptions are violated in the right column.outcome in a particular way. ∆ Y it ∼ α s + θ s G i + γ s I t + β s ( G i × I t ) , (12)where ∆ Y it = Y it − ( (cid:80) i : G i =1 Y i,t − ) /n ,t − if G i = 1 and ∆ Y it = Y it − ( (cid:80) i : G i =0 Y i,t − ) /n ,t − if G i = 0 . Coeﬃcients are denoted by ( α s , θ s , γ s , β s ) . In this case, a coeﬃcient in front of theinteraction term β s is numerically identical to the sequential DID estimator. We provide theproof of this equivalence in Appendix B.3.In time-series econometrics, it is common to take the diﬀerence in outcomes in order toremove linear time trends before running regressions (Wooldridge, 2010). We also demonstratethat a common robustness check of including group- or unit-speciﬁc time trends (Angristand Pischke, 2008) is also nonparametrically equivalent to the sequential DID estimator (seeAppendix B.3). Within the potential outcomes framework, we clariﬁed that these commontechniques are justiﬁed under the parallel trends-in-trends assumption.In summary, the third beneﬁt of multiple pre-treatment periods is that, researchers canuse the sequential DID estimator (equation (11)) under the more ﬂexible, parallel trends-in-13rends assumption, even when the extended parallel trends assumption is violated (therefore,the two-way ﬁxed eﬀects estimator is biased). Remark.

Researchers might be worried not only about time-varying unmeasured confoundersthat change over time linearly, but also about more general forms of time-varying unmeasuredconfounders. Fortunately, when we have more than two pre-treatment periods, we can furthergeneralize the sequential DID estimator. In general, when we have K pre-treatment periods, itcan account for the K − degrees of polynomial functions of unmeasured confounders. However,we have to be aware of a key tradeoﬀ: as we incorporate more ﬂexible forms of confounding,standard errors of the sequential DID estimator get much larger. Another powerful approach toaddress unobserved time-varying confounding is based on a form of partial identiﬁcation, suchas sensitivity analysis (Keele et al., 2019) and the bracketing bounds (Angrist and Pischke,2008; Ding and Li, 2019; Ye et al., 2020). While the most basic form of the DID design only requires one pre-treatment period, we seein the previous section that multiple pre-treatment periods provide the three related beneﬁts.We have clariﬁed that each beneﬁt requires diﬀerent assumptions and diﬀerent estimators,and as a result, in practice, researchers tend to enjoy only a subset of the three beneﬁts theycan exploit from multiple pre-treatment periods. In this section, we propose a new, simpleestimator, which we call the double diﬀerence-in-diﬀerences (double DID), that blends all thethree beneﬁts of multiple pre-treatment periods in a single framework. Here, we introduce thedouble DID with settings with two pre-treatment periods.We also provide two extensions. First, we generalize the proposed method to any numberof pre - and post -treatment periods in the DID design (Section 3.4 and Appendix C). Second,we then extend it to the staggered adoption design, where timing of the treatment assignmentcan vary across units, in Section 4.

We propose the double DID estimator within a framework of the generalized method of mo-ments (GMM) (Hansen, 1982). In particular, we combine the standard DID estimator and thesequential DID estimator via the GMM: (cid:98) τ d-DID = argmin τ (cid:32) τ − (cid:98) τ DID τ − (cid:98) τ s-DID (cid:33) (cid:62) W (cid:32) τ − (cid:98) τ DID τ − (cid:98) τ s-DID (cid:33) (13)14tandard DID Extended DID Sequential DIDWeight Matrix    −    W Table 1:

Double DID as Generalization of Popular DID Estimators.where W is a user-speciﬁed weight matrix of dimension × .The important property of the proposed double DID estimator is that it contains all of thepopular estimators that we considered in the previous sections as special cases. By specifyingan appropriate weight matrix W , we can recover the standard DID, the extended DID, andthe sequential DID estimators (see Table 1).The advantage of the double DID estimator is that we can choose the optimal weight matrixby considering the identiﬁcation assumption and estimation within the unifying framework ofthe GMM. The double DID estimator proceeds with the following two steps. Step 1: Assessing the Underlying Assumptions

The ﬁrst step is to assess the underlying assumptions. We check the extended parallel trendsassumption by applying the DID estimator on pre-treatment periods and testing whether theestimate is statistically distinguishable from zero at a conventional level (equation (5)). Im-portantly, this ﬁrst step of the double DID is equivalent to the over-identiﬁcation test in theGMM framework, where we assume that the sequential DID estimator is correctly speciﬁedand test the null hypothesis that the standard DID estimator is correctly speciﬁed. Whenthere are more than two pre-treatment periods, we can diagnose the parallel trends-in-trendsassumption by applying the sequential DID estimator to pre-treatment periods.

Equivalence Approach.

The standard hypothesis testing approach has a risk of con-ﬂating evidence for parallel trends and statistical ineﬃciency. For example, when sample sizeis small, even if pre-treatment trends of the treatment and control groups diﬀer, a test ofthe diﬀerence might not be statistically signiﬁcant due to large standard error, and analystsmight “pass” the pre-treatment-trends test. To mitigate such concerns, we also incorporatean equivalence approach (Wellek, 2010; Hartman and Hidalgo, 2018) in which we evaluate thenull hypothesis that trends of two groups are not parallel in pre-treatment periods. By us- Liu, Wang and Xu (2020) propose a similar test for a diﬀerent class of estimators, what they refer to as“counterfactual estimators,” which share many properties with synthetic control methods. We discuss relation-

Step 2: Estimation of the ATT

The second step is the estimation of the ATT. When the extended parallel trends assumption isplausible, we can estimate the optimal weight matrix W building on the theory of the eﬃcientGMM (Hansen, 1982). Speciﬁcally, the optimal weight matrix that minimizes the variance isthe inverse of the variance-covariance matrix of the DID estimators: (cid:99) W =  (cid:100) Var ( (cid:98) τ DID ) (cid:100) Cov ( (cid:98) τ DID , (cid:98) τ s-DID ) (cid:100) Cov ( (cid:98) τ DID , (cid:98) τ s-DID ) (cid:100) Var ( (cid:98) τ s-DID )  − (14)This optimal weight matrix allows us to compute the weighted average of the standard DIDand the sequential DID estimator such that the resulting variance is the smallest. Remark.

Importantly, under the extended parallel trends assumption, both the standardDID and the sequential DID estimator are consistent to the ATT, and thus, any weightedaverage is a consistent estimator. But this optimal weight matrix chooses the most eﬃcientestimator among all consistent estimators. As we clarify more below, we do not use theweighted average of the standard DID and the sequential DID when the extended paralleltrends assumption is violated.When only the parallel trends-in-trends assumption is plausible, the double DID only con-tains one moment condition τ − (cid:98) τ s-DID = 0 , and thus it is equal to the sequential DID estimator.This is equivalent to choosing the weight matrix W with W = W = W = 0 and W = 1 (the third column in Table 1).When both assumptions are implausible, there is no credible estimator without makingfurther stringent assumptions. However, when there are more than two pre-treatment periods,researchers can also use the proposed generalized K -DID (we will discuss in Section 3.4) tofurther relax the parallel trends-in-trends assumption. ships between our method and synthetic control methods in Section 5.3. .2 Double DID Enjoys Three Beneﬁts The proposed double DID estimator naturally enjoys the three beneﬁts of multiple pre-treatmentperiods within a uniﬁed framework.

1. Assessing Underlying Assumptions

The double DID incorporates the assessment ofunderlying assumptions in its ﬁrst step as the over-identiﬁcation test. When the trends inpre-treatment periods are not parallel, researchers have to pay the most careful attention toresearch design and use domain knowledge to assess the parallel trends-in-trends assumption.

2. Improving Estimation Accuracy

When the extended parallel trends assumption holds,researchers can combine two DIDs with equal weights (i.e., the extended DID estimator, whichis numerically equivalent to the two-way ﬁxed eﬀects regression) to increase estimation accuracy(Section 2.3). In this setting, the double DID further improves estimation accuracy becauseit selects the optimal weights as the GMM estimator. In Section F, we use simulations todemonstrate that the double DID achieves even smaller standard errors than the extendedDID estimator.

3. Allowing For A More Flexible Parallel Trends Assumption

Under the paralleltrends-in-trends assumption, the double DID estimator converges to the sequential DID esti-mator. However, when the extended parallel trends assumption holds, the double DID usesoptimal weights and is not equal to the sequential DID. Thus, the double DID estimator avoidsa dilemma of the sequential DID — it is consistent under a weaker assumption of the paralleltrends-in-trends but is less eﬃcient when the extended parallel trends assumption holds. Bynaturally changing the weight matrix in the GMM framework, the double DID achieves highestimation accuracy under the extended parallel trends assumption and at the same time, al-lows for more ﬂexible time-varying unmeasured confounding under the parallel trends-in-trendsassumption.

Like other DID estimators, the double DID estimator is nicely connected to a widely-usedregression approach. This connection is particularly useful when researchers would like tocontrol for pre-treatment covariates to make the DID design more robust and eﬃcient.To introduce the regression-based double DID estimator, we begin with the standard DID.As discussed in Section 2.1, the standard DID estimator is equivalent to a coeﬃcient in thelinear regression of equation (4). Inspired by this connection, researchers often adjust for17dditional pre-treatment covariates as: Y it ∼ α + θG i + γI t + β ( G i × I t ) + X (cid:62) it ρ , (15)where we adjust for the additional pre-treatment covariates X it . A coeﬃcient of the interactionterm (cid:98) β is the standard DID regression estimator for the ATT. Here, we make the paralleltrends assumption conditional on covariates X it . The idea is that even when the paralleltrends assumption might not hold without controlling for any covariates, trends of the twogroups might be parallel conditionally after adjusting for observed covariates. For example,the conditional parallel trends assumption means that treatment and control groups have thesame trends of the infrastructure quality after controlling for population size and GDP percapita.The estimated coeﬃcient (cid:98) β is consistent for the ATT when this conditional parallel trendsassumption holds and the parametric model is correctly speciﬁed. This parametric assumptionmight be strong, but it is common to all regression strategies, including non-causal settings,and can be assessed via usual model diagnostics.The sequential DID estimator is extended similarly. Based on the connection to the linearregression of equation (12), we can adjust for additional pre-treatment covariates as: ∆ Y it ∼ α s + θ s G i + γ s I t + β s ( G i × I t ) + X (cid:62) it ρ s , (16)where ∆ Y it = Y it − ( (cid:80) i : G i =1 Y i,t − ) /n ,t − if G i = 1 and ∆ Y it = Y it − ( (cid:80) i : G i =0 Y i,t − ) /n ,t − if G i = 0 . The estimated coeﬃcient (cid:98) β s is consistent for the ATT under the conditional paralleltrends-in-trends assumption and the conventional assumption of correct speciﬁcation.The double DID regression combines the two regression estimators via the GMM: (cid:98) β d-DID = argmin β d (cid:32) β d − (cid:98) ββ d − (cid:98) β s (cid:33) (cid:62) W (cid:32) β d − (cid:98) ββ d − (cid:98) β s (cid:33) (17)where W is a user-speciﬁed weight matrix of dimension × .Thus, as the double DID estimator without covariates, the double DID regression alsohas two steps. The ﬁrst step is to assess the underlying assumptions. Here, instead of usingthe standard DID estimator, we use the standard DID regression on pre-treatment periods toassess the conditional extended parallel trends assumption. The second step is to estimate theATT while adjusting for pre-treatment covariates. Instead of using the double DID estimatorwithout covariates, we implement the regression-based double DID estimator (equation (17)).18 .4 Generalized K -Diﬀerence-in-Diﬀerences We generalize the proposed method to any number of pre - and post -treatment periods inAppendix C. This generalization has two central beneﬁts. First, it enables researchers to uselonger pre -treatment periods to allow for even more ﬂexible forms of unmeasured time-varyingconfounding. While the parallel trends-in-trends assumption (Assumption 3) accounts for lineartime-varying unmeasured confounding, we can allow for time-varying unmeasured confoundingthat follows a ( K − th order polynomial function when we have K pre-treatment periods.Second, we also allow for any number of post -treatment periods, and therefore, researcherscan estimate not only short-term causal eﬀects but also longer-term causal eﬀects with thisgeneralization. In this section, we extend the proposed double DID estimator to the staggered adoption designwhere timing of the treatment assignment can vary across units (Athey and Imbens, 2018;Strezhnev, 2018; Ben-Michael, Feller and Rothstein, 2019).

In the staggered adoption (SA) design, diﬀerent units can receive the treatment in diﬀerent timeperiods. Once they receive the treatment, they remain exposed to the treatment afterwards.Therefore, D it = 1 if D im = 1 where m < t. We can thus summarize the information about thetreatment assignment by the timing of the treatment A i where A i ≡ min { t : D it = 1 } . Whenunit i never receives the treatment until the end of time T , we let A i = ∞ . For example, inmany applications where researchers are interested in the causal eﬀect of state- or local-levelpolicies, units adopt policies in diﬀerent time points and remain exposed to such policies oncethey introduce the policies. In Section 6.2, we provide its example based on Paglayan (2019).See Figure 3 for visualization of the SA design.Following the recent literature on the SA design, we make two standard assumptions inthe SA design: no anticipation assumption and invariance to history assumption (Athey andImbens, 2018; Abadie, 2019; Ben-Michael, Feller and Rothstein, 2019; Imai and Kim, 2019).This implies that, for unit i in period t , the potential outcome Y it (1) represents the outcomeof unit i that would realize in period t if unit i receives the treatment at or before period t .Similarly, Y it (0) represents the outcome of unit i that would realize in period t if unit i does19 ear State 2

State 3

Figure 3:

Example of the Staggered Adoption Design.

Note:

We use gray cells of “1” to denotethe treated observation and use white cells of “0” to denote the control observation. The mainfeature of the SA design is that once units receive the treatment, they remain exposed to thetreatment.not receive the treatment by period t . Finally, we generalize the group indicator G as follows. G it =  if A i = t if A i > t − if A i < t (18)where G it = 1 represents units who receive the treatment at time t , and G it = 0 ( G it = − )indicates units who receive the treatment after (before) time t .Under the SA design, the staggered adoption ATT (SA-ATT) at time t is deﬁned as follows. τ SA ( t ) = E [ Y it (1) − Y it (0) | G it = 1] , which represents the causal eﬀect of the treatment in period t on units with G it = 1 , who receivethe treatment at time t. This is a straightforward extension of the standard ATT (equation (1))in the basic DID setting. Researchers might also be interested in the time-average staggeredadoption ATT (time-average SA-ATT). τ SA = (cid:88) t ∈T π t τ SA ( t ) , where T represents a set of the time periods for which researchers want to estimate the ATT.For example, if a researcher is interested in estimating the ATT for the entire sample periods,one can take T = { , . . . , T } . The SA-ATT in period t , τ SA ( t ) , is weighted by the proportionof units who receive the treatment at time t : π t = (cid:80) ni =1 { A i = t } / (cid:80) ni =1 { A i ∈ T } .20 .2 Double DID for Staggered Adoption Design Under what assumptions can we identify the SA-ATT and the time-average SA-ATT? Here, weﬁrst extend the standard DID estimator under the parallel trends assumption and the sequen-tial DID estimator under the parallel trends-in-trends assumption to the SA design. Formally,we deﬁne the standard DID estimator for the SA-ATT at time t as (cid:98) τ SA DID ( t ) = (cid:18) (cid:80) i : G it =1 Y it n t − (cid:80) i : G it =1 Y i,t − n ,t − (cid:19) − (cid:18) (cid:80) i : G it =0 Y it n t − (cid:80) i : G it =0 Y i,t − n ,t − (cid:19) , which is consistent for the SA-ATT under the following parallel trends assumption in period t under the SA design: E [ Y it (0) | G it = 1] − E [ Y i,t − (0) | G it = 1] = E [ Y it (0) | G it = 0] − E [ Y i,t − (0) | G it = 0] . Similarly, we can deﬁne the sequential DID estimator for the SA-ATT at time t as (cid:98) τ SA s-DID ( t ) = (cid:26)(cid:18) (cid:80) i : G it =1 Y it n t − (cid:80) i : G it =1 Y i,t − n ,t − (cid:19) − (cid:18) (cid:80) i : G it =0 Y it n t − (cid:80) i : G it =0 Y i,t − n ,t − (cid:19)(cid:27) − (cid:26)(cid:18) (cid:80) i : G it =1 Y i,t − n ,t − − (cid:80) i : G it =1 Y i,t − n ,t − (cid:19) − (cid:18) (cid:80) i : G it =0 Y i,t − n ,t − − (cid:80) i : G it =0 Y i,t − n ,t − (cid:19)(cid:27) , which is consistent for the SA-ATT under the following parallel trends-in-trends assumptionin period t under the SA design: { E [ Y it (0) | G it = 1] − E [ Y it (0) | G it = 0] } − { E [ Y i,t − (0) | G it = 1] − E [ Y i,t − (0) | G it = 0] } = { E [ Y i,t − (0) | G it = 1] − E [ Y i,t − (0) | G it = 0] } − { E [ Y i,t − (0) | G it = 1] − E [ Y i,t − (0) | G it = 0] } . Finally, combining the standard and sequential DID estimators, we can extend the doubleDID to the SA design as follows. (cid:98) τ SA d-DID ( t ) = argmin τ SA ( t ) (cid:32) τ SA ( t ) − (cid:98) τ SA DID ( t ) τ SA ( t ) − (cid:98) τ SA s-DID ( t ) (cid:33) (cid:62) W ( t ) (cid:32) τ SA ( t ) − (cid:98) τ SA DID ( t ) τ SA ( t ) − (cid:98) τ SA s-DID ( t ) (cid:33) where W ( t ) is a user-speciﬁed weight matrix. Under the SA design as well, the standard DIDand sequential DID estimators are special cases of our proposed double DID estimator withspeciﬁc weight matrices.Like the basic double DID estimator that we have proposed in Section 3.1, the double DIDfor the SA design also has two steps. The ﬁrst step is to assess the underlying assumptionsusing the standard DID for the SA design with two time points { t − , t − } for units who21re not yet treated at time t − , that is, { i : G it ≥ } . This is a generalization of the pre-treatment-trends test in the basic DID setup (Section 2.2). The second step is to estimate theSA-ATT at time t . When only the parallel trends-in-trends assumption is plausible, we chooseweight matrix W ( t ) where W ( t ) = W ( t ) = W ( t ) = 0 and W ( t ) = 1 , which convergesto the sequential DID under the SA design. When the extended parallel trends assumption isplausible, we use the optimal weight matrix deﬁned as W ( t ) = Var ( (cid:98) τ SA (1:2) ( t )) − where Var ( · ) isthe variance-covariance matrix and (cid:98) τ SA (1:2) ( s ) = ( (cid:98) τ SA DID ( t ) , (cid:98) τ SA s-DID ( t )) (cid:62) . This optimal weight matrixprovides us with the most eﬃcient estimator (i.e., the smallest standard error). We providefurther details on the implementation in Appendix D.To estimate the time-average SA-DID, we extend the double DID as follows. (cid:98) τ SA d-DID = argmin τ SA (cid:32) τ SA − (cid:98) τ SA DID τ SA − (cid:98) τ SA s-DID (cid:33) (cid:62) W (cid:32) τ SA − (cid:98) τ SA DID τ SA − (cid:98) τ SA s-DID (cid:33) where (cid:98) τ SA DID = (cid:88) t ∈T π t (cid:98) τ SA DID ( t ) , and (cid:98) τ SA s-DID = (cid:88) t ∈T π t (cid:98) τ SA s-DID ( t ) . The optimal weight matrix W is equal to W = Var ( (cid:98) τ SA (1:2) ) − where (cid:98) τ SA (1:2) = ( (cid:98) τ SA DID , (cid:98) τ SA s-DID ) (cid:62) . We now extend the double DID regression to the SA design setting. We ﬁrst extend thestandard DID regression (Section 3.3) to the SA design. In particular, to estimate the SA-ATT at time t , we can ﬁt the following regression for units who are not yet treated at time t − , that is, { i : G it ≥ } . Y iv ∼ α + θG it + γI v + β SA ( t )( G it × I v ) + X (cid:62) iv ρ , where v ∈ { t − , t } and the time indicator I v (equal to if v = t and if v = t − ). Note that G it deﬁnes the treatment and control group at time t , and thus, it does not depend on timeindex v . The estimated coeﬃcient (cid:98) β SA ( t ) is consistent for the SA-ATT under the conditionalparallel trends assumption.Similarly, we can extend the sequential DID regression to the SA design. Using the con-nection to the linear regression of equation (12), we can adjust for additional pre-treatmentcovariates as: ∆ Y iv ∼ α s + θ s G it + γ s I v + β SA s ( t )( G it × I v ) + X (cid:62) iv ρ s , v ∈ { t − , t } and ∆ Y iv = Y iv − ( (cid:80) i : G it =1 Y i,v − ) /n ,v − if G it = 1 and ∆ Y iv = Y iv − ( (cid:80) i : G it =0 Y i,v − ) /n ,v − if G it = 0 . The estimated coeﬃcient (cid:98) β SA s ( t ) is consistent for theSA-ATT under the conditional parallel trends-in-trends assumption.Therefore, the double DID regression for the SA design combines the two regression esti-mators via the GMM: (cid:98) β SA d-DID ( t ) = argmin β SA d ( t ) (cid:32) β SA d ( t ) − (cid:98) β SA ( t ) β SA d ( t ) − (cid:98) β SA s ( t ) (cid:33) (cid:62) W ( t ) (cid:32) β SA d ( t ) − (cid:98) β SA ( t ) β SA d ( t ) − (cid:98) β SA s ( t ) . (cid:33) where the choice of the weight matrix follows the same two-step procedure as Section 4.2.We also provide further details in Appendix D. The optimal weight matrix W ( t ) is equal toVar ( (cid:98) β SA (1:2) ) − where (cid:98) β SA (1:2) = ( (cid:98) β SA ( t ) , (cid:98) β SA s ( t )) (cid:62) . To estimate the time-average SA-ATT, we extend the double DID regression as follows. (cid:98) β SA d-DID = argmin β SA d  β SA d − (cid:98) β SA β SA d − (cid:98) β SA s  (cid:62) W  β SA d − (cid:98) β SA β SA d − (cid:98) β SA s  where (cid:98) β SA = (cid:88) t ∈T π t (cid:98) β SA ( t ) , and (cid:98) β SA s = (cid:88) t ∈T π t (cid:98) β SA s ( t ) . The optimal weight matrix W is equal to Var ( (cid:98) β SA (1:2) ) − where (cid:98) β SA (1:2) = ( (cid:98) β SA , (cid:98) β SA s ) (cid:62) . This section clariﬁes relationships between our proposed double DID and three existing meth-ods: the two-way ﬁxed eﬀects estimator, the sequential DID estimator, and synthetic controlmethods.

While we contrast the double DID with the two-way ﬁxed eﬀects estimator throughout thepaper, we summarize our discussion here. First, in the basic DID design, the two-way ﬁxedeﬀects estimator is a special case of the double DID with a speciﬁc choice of the weight matrix W (see Table 1). Therefore, whenever the two-way ﬁxed eﬀects estimator is consistent for theATT, the double DID is a more eﬃcient, consistent estimator of the ATT. This is becausethe double DID can choose the optimal weight matrix via the GMM, while the two-way ﬁxedeﬀects uses the pre-determined equal weights over time. Second, in the SA design, a largenumber of recent papers show that the widely-used two-way ﬁxed eﬀects estimator are in23eneral inconsistent for the ATT due to treatment eﬀect heterogeneity and implicit parametricassumptions (Abraham and Sun, 2018; Athey and Imbens, 2018; Strezhnev, 2018; Imai andKim, 2020). In contrast, the proposed double DID in the SA design generalizes nonparametricDID estimators to allow for treatment eﬀect heterogeneity, and thus, it does not suﬀer fromthe same problem. Our double DID estimator contains the sequential DID estimator (e.g., Lee, 2016; Mora andReggio, 2019) as a special case. Our proposed double DID improves over the sequential DIDestimator in two ways. First, when the parallel trends assumption holds, the double DIDoptimally combine the standard DID and the sequential DID to improve eﬃciency, and it isnot equal to the sequential DID. Therefore, it avoids a dilemma of the sequential DID — itis consistent under the parallel trends-in-trends assumption (weaker than the parallel trendsassumption), but is less eﬃcient when the parallel trends assumption holds. Second, while thesequential DID estimator has only been available for the basic DID design where treatmentassignment happens only once, we generalize it to the staggered adoption design and furtherincorporate it into our staggered-adoption double DID estimator (Section 4).

Another relevant popular class of methods is the synthetic control methods. While the methodwas originally designed to estimate the causal eﬀect on a single treated unit, recent extensionsallow for multiple treated units and the staggered adoption design (e.g., Xu, 2017; Athey et al.,2017; Ben-Michael, Feller and Rothstein, 2018; Hazlett and Xu, 2018). Despite a wide varietyof innovative extensions, they all share the same core feature: they require long pre-treatmentperiods to accurately estimate a pre-treatment trajectory of the treated units. For example, Xu(2017) recommends collecting more than ten pre-treatment periods. In contrast, the proposeddouble DID can be applied as long as there are more than one pre-treatment periods, and isbetter suited when there are a small to moderate number of pre-treatment periods.When there are a large number of pre-treatment periods (i.e., long enough to apply thesynthetic control methods), we recommend to apply both the synthetic control methods andproposed double DID, and evaluate robustness across those approaches. This is importantbecause they rely on diﬀerent identiﬁcation assumptions. In fact, we show in Section 6.2, thedouble DID can recover credible estimates similar to more ﬂexible variants of synthetic control24ethods even when there are a large number of pre-treatment periods. This robustness providesresearchers with additional credibility for their causal estimates and underlying assumptions.

Malesky, Nguyen and Tran (2014) utilize the basic DID design to study how the abolition ofelected councils aﬀects local public services in Vietnam. To estimate the causal eﬀects of theinstitutional change, the original authors rely on data from 2008 and 2010, which are beforeand after the abolition of elected councils in 2009. Then, they supplement the main analysisby assessing trends in pre-treatment periods from 2006 to 2008. In this section, we apply theproposed method and illustrate how to improve this basic DID design.Although Malesky, Nguyen and Tran (2014) employ the exact same DID design to all of thethirty outcomes they consider, each outcome might require diﬀerent assumptions as noted inthe original paper. Here, we focus on reanalyzing three outcomes that have diﬀerent patterns ofpre-treatment periods. By doing so, we clarify how researchers can use the double DID methodto transparently assess underlying assumptions and employ appropriate DID estimators underdiﬀerent settings. We provide an analysis of all the thirty outcomes in Appendix G.

The ﬁrst step of the DID design is to visualize trends of treatment and control groups. Figure 4shows trends of three diﬀerent outcomes: “Education and Cultural Program,” “Tap Water,” and“Agricultural Center.” Although the original analysis uses the same DID design for all of them,they have distinct trends in the pre-treatment periods. The ﬁrst outcome of “Education andCultural Program” has parallel trends in pre-treatment periods. For the other two outcomes,trends do not look parallel in either of the cases. While the trends for the second outcome(“Tap Water”) have similar directions, trends for the third outcome (“Agricultural Center”)has opposite signs. This visualization of trends serves as a transparent ﬁrst step to assess theunderlying assumptions necessary for the DID estimation.The next step is to formally assess underlying assumptions. As in the original study, itis common to incorporate additional covariates to make the parallel trends assumption more “Education and Cultural Program” (binary): This variable takes one if there is a program that invests inculture and education in the commune. “Tap Water” (binary): What is the main source of drinking /cookingwater for most people in this commune? “Agricultural Center” (binary): Is there any agriculture extensioncenter in a given commune? Please see Malesky, Nguyen and Tran (2014) for further details. .100.150.200.250.300.350.40 Education and Cultural Program

Year2006 2008 2010TreatmentControl 0.050.100.150.200.250.300.35

Tap Water

Year2006 2008 2010 0.040.050.060.070.080.09

Agricultural Center

Year2006 2008 2010 M ean o f O u t c o m e s Figure 4:

Visualizing Trends of Treatment and Control Groups.

Note:

We report trendsfor the treatment group (blue solid line with solid circles) and the control group (gray dashedline with hollow circles). Two pre-treatment periods are 2006 and 2008. One post-treatmentperiod, 2010, is indicated by the gray shaded area.plausible. Based on detailed domain knowledge, Malesky, Nguyen and Tran (2014) includefour control variables: area size of each commune, population size, whether national-level cityor not, and regional ﬁxed eﬀects. Thus, we assess the conditional extended parallel trendsassumption by ﬁtting the DID regression (equation (15)) to pre-treatment periods from 2006to 2008 where X it includes the four control variables. If the conditional extended paralleltrends assumption holds, estimates of the DID regression on pre-treatment trends should beclose to zero.While a traditional approach is to assess whether estimates are statistically distinguishablefrom zero with the conventional 5% or 10% level, we also report results based on an equiva-lence approach that we recommend in Section 3. Speciﬁcally, we compute the 95% standardizedequivalence conﬁdence interval, which quantiﬁes the smallest equivalence range supported bythe observed data (Hartman and Hidalgo, 2018). In the context of this application, the equiv-alance conﬁdence interval is standardized based on the mean and standard deviation of thecontrol group in 2006. For example, if the 95% standardized equivalence conﬁdence intervalis [ − ν, ν ] , this means that the equivalence test rejects the hypothesis that the DID estimate(standardized with respect to the baseline control outcome) on pre-treatment periods is largerthan ν or smaller than − ν at the 5% level. Thus, the conditional extended parallel trendsassumption is more plausible when the equivalence conﬁdence interval is shorter.The results are summarized in Table 2. Standard errors are computed via block-bootstrapat the district level, where we take 2000 bootstrap iterations. For the ﬁrst outcome, as thegraphical presentation in Figure 4 suggests, a statistical test suggests that the extended parallel26stimate Std. Error p-value 95% Std. Equivalence CIEducation andCultural Program − . [ − . , . Tap Water 0.166 0.084 0.048 [ − . , . Agricultural Center 0.198 0.095 0.037 [ − . , . Table 2:

Assessing Underlying Assumptions Using the Pre-treatment Outcomes.

Note:

Weevaluate the conditional extended parallel trends assumption for three diﬀerent outcomes. Thetable reports DID estimates on pre-treatment trends, standard errors, p-values, and the 95%standardized equivalence conﬁdence intervals.trends assumption is plausible. The test of the conditional extended parallel trends yields thep-value of . (the third column), and similarly, the 95% standardized equivalence conﬁdenceinterval is [ − . , . (the fourth column), which is shorter than the other two outcomesdiscussed below. For the second outcome, the test of the parallel trends produces the p-valueof . , and the 95% standardized equivalence conﬁdence interval, [ − . , . , revealsthat the parallel trends assumption is less plausible for this outcome, than for the ﬁrst out-come. Finally, for the third outcome, “Agricultural Center,” both traditional and equivalenceapproaches provide little evidence for parallel trends as graphically clear in Figure 4. The testof the parallel trends is rejected at the 5% level (p-value = . ) and the 95% standardizedequivalence conﬁdence interval is relatively large, [ − . , . . Although we only have twopre-treatment periods as in the original analysis, if more than two pre-treatment periods areavailable, researchers can assess the extended parallel trends-in-trends assumption in a similarway by applying the sequential DID estimator to pre-treatment periods. After assessing theunderlying parallel trends assumptions, we now proceed to the estimation of the ATT via thedouble DID.

Within the double DID framework, we select appropriate DID estimators after the empiricalassessment of underlying assumptions. For the ﬁrst outcome of “Education and Cultural Pro-gram,” diagnostics in the previous section suggest that the extended parallel trends assumptionis plausible. In such settings, the double DID is expected to produce similar point estimateswith smaller standard errors compared to the conventional DID estimator. The ﬁrst plot ofFigure 5 clearly shows this pattern. In the ﬁgure, we report point estimates as well as 90%27

Education and Cultural Program A TT ( % C on f i den c e I n t e r v a l ) DID Double DID -0.25-0.20-0.15-0.10-0.050.000.05

Tap Water

DID Double DID

Figure 5:

Estimating Causal Eﬀects of Abolishing Elected Councils.

Note:

We compareestimates from the standard DID and the proposed double DID. For the ﬁrst plot where theextended parallel trends assumption is plausible, the double DID produces a similar pointestimate with smaller standard errors. For the second plot where only the parallel trends-in-trends assumption is plausible, the double DID estimator can still estimate the ATT, while thestandard DID estimate is likely to be biased.conﬁdence intervals following the original paper (see Figure 3 in Malesky, Nguyen and Tran(2014)). Using the standard DID estimator, the original estimate of the ATT on “Educationand Cultural Program” was . (90% CI = [ − . , . ]). Using the double DID estimator,an estimate is instead . (90% CI = [ . , . ]). By using the double DID estimator, weshrink standard errors by about 10%. Although we only have two pre-treatment periods here,when there are more pre-treatment periods, the eﬃciency gain of the double DID can be evenlarger.For the second outcome of “Tap Water,” we did not have enough evidence to support theextended parallel trends assumption. Thus, instead of using the standard DID as in the origi-nal analysis, we rely on the parallel trends-in-trends assumption. In this case, the double DIDestimates the ATT by allowing for linear time-varying unmeasured confounding in contrast tothe standard DID that still assumes constant unmeasured confounders. The second plot ofFigure 5 shows the important diﬀerence between the two methods. While the standard DIDestimates is − . (90% CI = [ − . , . ]), the double DID estimate is − . (90% CI= [ − . , − . ]). Given that the extended parallel trends assumption is not plausible, thisresult suggests that the standard DID suﬀers from substantial bias (the bias of . correspondsto more than 50% of the original point estimate). By incorporating non-parallel pre-treatment28rends, the double DID shows that the original DID estimate was underestimated by a largeamount. Finally, for the third outcome (“Agricultural Center”), the previous diagnostics sug-gest that the extended parallel trends assumption is implausible. It is possible to use thedouble DID under the parallel trends-in-trends assumption. However, trends of treatment andcontrol groups have opposite signs, implying the double DID estimates are highly sensitive tothe parallel trends-in-trends assumption. Given that the parallel trends-in-trends assumptionis also diﬃcult to justify here, there is no credible estimator of the ATT without making addi-tional stringent assumptions. While we mainly focused on the three outcomes here, the doubleDID improves upon the standard DID in a similar way for the other outcomes as well (seeAppendix G). In this section, we apply the proposed double DID estimator to revisit Paglayan (2019), whichuses the staggered adoption (SA) design to study the eﬀect of granting collective bargainingrights to teacher’s union on educational expenditures and teacher’s salary. Paglayan (2019)applies the standard two-way ﬁxed eﬀect models to estimate the eﬀect of the introduction of themandatory bargaining law in the US states on the two outcome. The original author exploitsthe variation induced by the diﬀerent introduction timing of the law: A few states introducedthe law as early as in the mid 1960’s, while some states, such as Arizona or Kentucky, neverintroduced the mandate. Among the states that granted the bargaining rights, the introductiontiming varies from the mid 1960’s to the mid 1980’s (Nebraska was the last state that adoptedthe law).

We apply the proposed double DID for the SA design to the panel data consists of state-yearobservations. A state is treated at a particular year, if the state passes the law or has alreadypassed the law of mandatory bargaining. Following the original study, we study two outcome:Per-pupil expenditure and annual teacher salary, both are on a log scale. There are 2,058observations, containing 49 states (excluding Washington DC and Wisconsin, due to the shortavailability of the pre-treatment outcomes) and spanning from 1959 through 2000.Figure 6 shows the variation of the treatment across states and over time. Cells in grayindicate state-year observations that are not treated and blue cells indicate the treated obser-vations. We can observe that there are 14 unique treatment timings (the earliest is 1965 and29

959 1960 1961 1962 1963 1964 1965 1966 1967 1968 1969 1970 1971 1972 1973 1974 1975 1976 1977 1978 1979 1980 1981 1982 1983 1984 1985 1986 1987 1988 1989 1990 1991 1992 1993 1994 1995 1996 1997 1998 1999 2000

CTMIMARINJNYMDNDNVSDAKDEHIKSMEVTIDOKPAMNINFLORIAMTCANHWATNILOHNEALARAZCOGAKYLAMOMSNCNMSCTXUTVAWVWY

TreatedControl

Figure 6:

Treatment Variation Plot.

Note:

Cells in gray are state-year observations that arenot treated (i.e., the mandatory bargaining law is not implemented), while cells in blue areobservations that are under the treatment condition. Rows are sorted such that states thatadopt the policy at earlier years are shown near the top, while states that never adopt thepolicy are shown near the bottom. The ﬁgure indicates that there are variations across statesin adoption timings, and that some states never adopt the policy.the latest is 1987) where the number of states at each treatment timing varies from one tosix (the average number of states at a treatment timing is 2.3). We can also see that there isno reversal of a treatment status in that once a state adopts the policy, the state has neverabolished it during the sample period.We assess the underlying parallel trends assumption for the SA design by utilizing the pre-treatment outcome. As in the pre-treatment-trends test in the basic DID design, we apply the30

Expenditure % S t anda r d i z ed E qu i v a l en c e C I Time Relative to Treatment Assignment -5 -4 -3 -2 -1-0.2-0.10.00.10.2

Salary % S t anda r d i z ed E qu i v a l en c e C I Time Relative to Treatment Assignment

Figure 7:

Assessing Underlying Assumptions Using the Pre-treatment Outcomes (Left: loggedexpenditure; Right: logged teacher salary).

Note:

We report the 95% standardized equivalenceconﬁdence intervals.standard DID estimator for the SA design to pre-treatment periods. For example, to test thepre-treatment trends from t − to t for units who receive the treatment at time t , we estimatethe SA-ATT using the outcome from t − and t − (See Section 4.2 for more details). To furtherfacilitate interpretation, we standardize the outcome by the mean and standard deviation ofthe baseline control group, so that the eﬀect can be interpreted relative to the control group.Figure 7 shows 95% standardized equivalence conﬁdence intervals for the two outcomesof interest (See Section 3.1 for details on the standardization procedure). It shows that forboth outcomes, the equivalence conﬁdence intervals are within 0.2 standard deviation from themeans of the baseline control groups through t − to t − . This suggests that the extendedparallel trends assumption is plausible for both outcomes. We apply the double DID for the SA design as described in Section 4. The standard errors arecomputed by conducting the block bootstrap where the block is taken at the state level andwe take 2000 bootstrap iterations. Analyses for the two outcomes are conducted separately. Inaddition to the proposed method, we apply two existing variants of synthetic control methodsthat can handle the staggered adoption design: the generalized synthetic control method, gsynth , (Xu, 2017) and the augmented synthetic control method, augsynth , (Ben-Michael,Feller and Rothstein, 2019). While the proposed double DID is better suited for settings where31 − . − . . . . Double DID

Time SA − A TT E x p e nd i t u r e −2 0 2 4 6 8 − . − . . . . Gsynth

Time SA − A TT −2 0 2 4 6 8 − . − . . . . augsynth Time SA − A TT −2 0 2 4 6 8 − . − . . . . Double DID

Time SA − A TT S a l a r y −2 0 2 4 6 8 − . − . . . . Gsynth

Time SA − A TT −2 0 2 4 6 8 − . − . . . . augsynth Time SA − A TT Figure 8:

Plot of the Average Treatment Eﬀect on the Treated on Two Outcomes.

Note:

We compare estimates from the double DID, the generalized synthetic control method, andthe augmented synthetic control method. The causal estimates are similar across methods forboth outcomes and treatment eﬀects are not statistically signiﬁcant at the conventional 5%level for most of the time periods.there are a small to moderate number of pre-treatment periods, we evaluate, in the setting oflong pre-treatment periods, whether it can achieve comparable performance to these variants ofsynthetic control methods that are primarily designed to deal with long pre-treatment periods(see more discussions in Section 5.3).Figure 8 shows the estimates of the treatment on the per-pupil expenditure (the ﬁrst row)and the teacher’s salary (the second row), where both eﬀects are on a log scale. We estimatedthe average treatment eﬀect on the two outcomes (cid:96) periods after the treatment assignmentwhere (cid:96) = { , , . . . , } . Note that (cid:96) = 0 corresponds to the contemporaneous eﬀect. Eachcolumn corresponds to diﬀerent estimators. The ﬁrst column shows the proposed double DIDestimator for the staggered adoption design, whereas the second (third) column shows estimatesbased on the generalized synthetic control method (the augmented synthetic control method).We can see that estimates are similar across methods for both outcomes and treatment eﬀects32re not statistically signiﬁcant at the 5% level for most of the time periods. This result isconsistent with the original ﬁnding of Paglayan (2019) that the granting collective bargainingrights did not increase the level of resources devoted to education.As in this example, when there are a large number of pre-treatment periods, it is im-portant to apply both synthetic control methods and the proposed double DID and evaluaterobustness across those approaches. This is critical because they rely on diﬀerent identiﬁcationassumptions. We found such robustness in this application, which provides us with additionalcredibility.

While the most basic form of the DID only requires two time periods — one before and theother after treatment assignment, researchers can often collect data from several additional pre-treatment periods in a wide range of applications. In this article, we show that such multiplepre-treatment periods can help improve the basic DID design and the staggered adoptiondesign in three ways: (1) assessing underlying assumptions about parallel trends, (2) improvingestimation accuracy and (3) enabling more ﬂexible DID estimators. We use the potentialoutcomes framework to clarify assumptions required to enjoy each beneﬁt.We then propose a simple method, the double DID, to combine all three beneﬁts withinthe GMM framework. Importantly, the double DID contains the popular two-way ﬁxed eﬀectsregression and nonparametric DID estimators as special cases, and it use the GMM to furtherimprove with respect to identiﬁcation and estimation accuracy. Finally, we provide two keyextensions. First, we accommodate any number of pre- and post-treatment periods, whichallows for even more ﬂexible forms of unmeasured time-varying confounding. Second, wegeneralize the double DID estimator to the staggered adoption design where timing of thetreatment assignment can vary across units.

References

Abadie, Alberto. 2005. “Semiparametric Diﬀerence-in-Diﬀerences Estimators.”

The Review ofEconomic Studies

Journal of Economic Literature .33badie, Alberto, Alexis Diamond and Jens Hainmueller. 2010. “Synthetic Control Methods forComparative Case Studies: Estimating the Eﬀect of California’s Tobacco Control Program.”

Journal of the American Statistical Association https://arxiv.org/abs/1804.05785 .Angrist, Joshua D and Jörn-Steﬀen Pischke. 2008.

Mostly Harmless Econometrics: An Em-piricist’s Companion . Princeton University Press.Athey, Susan and Guido W Imbens. 2018. “Design-based Analysis in Diﬀerence-in-DiﬀerencesSettings with Staggered Adoption.” National Bureau of Economic Research.Athey, Susan, Mohsen Bayati, Nikolay Doudchenko, Guido Imbens and Khashayar Khosravi.2017. “Matrix Completion Methods for Causal Panel Data Models.” Available at https://arxiv.org/abs/1710.10251 .Bechtel, Michael M and Jens Hainmueller. 2011. “How Lasting is Voter Gratitude? An Analysisof the Short-and Long-Term Electoral Returns to Beneﬁcial Policy.”

American Journal ofPolitical Science

Annual Review of Political Science https://arxiv.org/abs/1811.04170 .Ben-Michael, Eli, Avi Feller and Jesse Rothstein. 2019. “Synthetic Controls and WeightedEvent Studies with Staggered Adoption.” arXiv preprint arXiv:1912.03290 .Bertrand, Marianne, Esther Duﬂo and Sendhil Mullainathan. 2004. “How Much ShouldWe Trust Diﬀerences-in-Diﬀerences Estimates?”

The Quarterly Journal of Economics

American Political ScienceReview

The Journal of Politics

American Journal of Polit-ical Science

Political Analysis

American Political Science Review

American Journal of Political Sci-ence

American Political Science Review

Political Science Research and Methods

Econometrica

American Journal of Political Science https://ssrn.com/abstract=3214231 . 35mai, Kosuke and In Song Kim. 2011. “Understanding and Improving Linear Fixed EﬀectsRegression Models for Causal Inference.” Available at https://imai.fas.harvard.edu/research/files/FEmatchOld.pdf .Imai, Kosuke and In Song Kim. 2019. “When Should We Use Unit Fixed Eﬀects RegressionModels for Causal Inference with Longitudinal Data?”

American Journal of Political Science

Political Analysis .Imbens, Guido W and Jeﬀrey M Wooldridge. 2009. “Recent Developments in the Econometricsof Program Evaluation.”

Journal of Economic Literature https://arxiv.org/abs/1901.01869 .Keele, Luke and William Minozzi. 2013. “How Much is Minnesota like Wisconsin? Assump-tions and Counterfactuals in Causal Inference with Observational Data.”

Political Analysis

American Journal of PoliticalScience

Review of Economicsand Statistics

Sociological Methods & Research

Available at SSRN 3555463 .36alesky, Edmund J, Cuong Viet Nguyen and Anh Tran. 2014. “The Impact of Recentralizationon Public Services: A Diﬀerence-in-Diﬀerences Analysis of the Abolition of Elected Councilsin Vietnam.”

American Political Science Review https://e-archivo.uc3m.es/bitstream/handle/10016/16065/we1233.pdf?sequence=1 .Mora, Ricardo and Iliana Reggio. 2019. “Alternative Diﬀ-in-Diﬀs Estimators with SeveralPretreatment Periods.”

Econometric Reviews

Statistical Science

AmericanJournal of Political Science

Journal of Educational Psychology

American PoliticalScience Review

Testing Statistical Hypotheses of Equivalence and Noninferiority . Chap-man and Hall/CRC.Wooldridge, Jeﬀrey M. 2010.

Econometric Analysis of Cross Section and Panel Data . MITpress.Xu, Yiqing. 2017. “Generalized Synthetic Control Method: Causal Inference with InteractiveFixed Eﬀects Models.”

Political Analysis arXiv preprint arXiv:2006.02423 .38

Review of Papers in

APSR and

AJPS

We conduct a review of the literature to assess current practices of the diﬀerence-in-diﬀerences(DID) design. Speciﬁcally, we search articles published in

American Political Science Review and

American Journal of Political Science from 2015 to 2019. Some of the papers we reviewedwere accepted in 2019 and were oﬃcially published in 2020. Using Google Scholar, we ﬁndarticles that contains any of the following keywords: “two-way ﬁxed eﬀect”, “two-way ﬁxedeﬀects”, “diﬀerence in diﬀerence” or “diﬀerence in diﬀerences.” We then manually select articlesfrom the list that uses the basic DID design and the staggered adoption design (see the maintext for details about the ﬁrst two design). This procedure left us with a total of 25 articles, 11from APSR and 14 from AJPS. Table 3 and 4 show the articles in the list published in APSRand AJPS, respectively.To determine the number of pre-treatment periods, we manually assess the listed articles.Among the 25 articles, 20 articles use the basic DID design, and 5 articles use the staggeredadoption design. When a paper uses the basic DID design, we can determine the length of thepre-treatment periods from the data description and the time of the treatment assignment. Onthe other hand, the pre-treatment periods for the staggered adoption and the general designare set to the total number of time-periods available in the data, as the length of pre-treatmentperiods varies across units.

Table 3:

DID papers on APSR.

Authors Year Title

O’brien, D. Z., & Rickne J. 2016 Gender Quotas And Women’s Political LeadershipGarﬁas, F. 2018 Elite Competition and State Capacity Development:Theory and Evidence From Post-Revolutionary Mexico.Martin, G. J., & Mccrain, J. 2019 Local News And National PoliticsBlom-Hansen, J., Houlberg, K.,Serritzlew, S., & Treisman, D. 2016 Jurisdiction Size and Local Government Policy Expenditure:Assessing The Eﬀect of Municipal AmalgamationClinton, J. D., & Sances, M. W. 2018 The Politics of Policy: The Initial Mass Political Eﬀectsof Medicaid Expansion in The StatesMalesky, E. J. , Nguyen, C. V.,& Tran, A. 2014 The Impact of Recentralization on Public ServicesA Diﬀerence-in-Diﬀerences Analysis of the Abolitionof Elected Councils in Vietnam.Larsen, M. V., Hjorth, F.,Dinesen, P. T.,& Sønderskov, K. M. 2019 When Do Citizens Respond Politically to The Local Economy?Evidence From Registry Data on Local Housing MarketsBecher, M., & González, I. M. 2019 Electoral Reform and Trade-Oﬀs in RepresentationSelb, P., & Munzert, S. 2018 Examining A Most Likely Case for Strong Campaign EﬀectsEnos, R. D., Kaufman, A. R.,& Sands, M. L. 2019 Can Violent Protest Change Local Policy Support?Vasiliki Fouka 2019 How Do Immigrants Respond to Discrimination?39 able 4:

DID papers on AJPS.

Authors Year Title

Bechtel, M. M., Hangartner, D.,& Schmid, L. 2016 Does compulsory voting increase support for leftist policy?Bisgaard, M., & Slothuus, R. 2018 Partisan elites as culprits?How party cues shape partisan perceptual gaps.Bischof, D., & Wagner, M. 2019 Do voters polarize when radical parties enter parliament?Dewan, T., Meriläinen, J.,& Tukiainen, J. 2020 Victorian voting:The origins of party orientation and class alignment.Earle, J. S., & Gehlbach, S. 2015 The Productivity Consequences of Political Turnover:Firm-Level Evidence from Ukraine’s Orange Revolution.Enos, R. D. 2016 What the demolition of public housing teaches usabout the impact of racial threat on political behavior.Gingerich, D. W. 2019 Ballot Reform as Suﬀrage Restriction:Evidence from Brazil’s Second Republic.Hainmueller, J, & Hangartner, D. 2019 Does direct democracy hurt immigrant minorities?Evidence from naturalization decisions in Switzerland.Holbein, J. B., & Hillygus, D. S. 2016 Making young voters:the impact of preregistration on youth turnout.Jäger, K. 2020 When Do Campaign Eﬀects Persist for Years?Evidence from a Natural Experiment.Lindgren, K. O., Oskarsson, S.,& Dawes, C. T. 2017 Can Political Inequalities Be Educated Away?Evidence from a Large-Scale Reform.Lopes da Fonseca, M. 2017 Identifying the source of incumbency advantagethrough a constitutional reform.Paglayan, AS. 2019 Public-Sector Unions and the Size of GovernmentPardos-Prado, S., & Xena, C. 2019 Skill speciﬁcity and attitudes toward immigration.40

Nonparametric Equivalence to Regression Estimators

In this section, we provide results on the nonparametric connection between regression estima-tors and the three DID estimators we discussed in the paper. This section provides method-ological foundations for our main methodological contributions, which we prove in Sections Cand D.

B.1 Standard DID

B.1.1 Repeated Cross-Sectional Data

For the later use in this Appendix, we report the well-known result that the standard DID esti-mator (cid:98) τ DID (equation (3)) is equivalent to coeﬃcient (cid:98) β in the regression estimator (equation (4))(Abadie, 2005).We deﬁne O it to be an indicator variable taking the value when individual i is observedin time period t . Using this notation, we prove the following result. Result 1 (Nonparametric Equivalence of the Standard DID and Regression Estimator) . Wewrite the linear regression estimator (equation (4) ) as a solution to the following least squaresproblem. ( (cid:98) α, (cid:98) θ, (cid:98) γ, (cid:98) β ) = argmin n (cid:88) i =1 2 (cid:88) t =1 O it (cid:110) Y it − α − θG i − γI t − β ( G i × I t ) (cid:111) . Then, (cid:98) τ DID = (cid:98) β. Proof.

By solving the least squares problem, we obtain the following solutions: (cid:98) α = (cid:80) i : G i =0 Y i n (cid:98) θ = (cid:80) i : G i =1 Y i n − (cid:80) i : G i =0 Y i n (cid:98) γ = (cid:80) i : G i =0 Y i n − (cid:80) i : G i =0 Y i n (cid:98) β = (cid:18) (cid:80) i : G i =1 Y i n − (cid:80) i : G i =1 Y i n (cid:19) − (cid:18) (cid:80) i : G i =0 Y i n − (cid:80) i : G i =0 Y i n (cid:19) , which completes the proof. B.1.2 Panel Data

Again, for the later use in the Appendix, we report the well-known result that the standardDID estimator (cid:98) τ DID (equation (3)) is equivalent to coeﬃcient (cid:98) β in the two-way ﬁxed eﬀectsregression estimator in the panel data setting (Abadie, 2005). Result 2 (Nonparametric Equivalence of the Standard DID and Two-way Fixed Eﬀects Re-gression Estimator) . We can write the two-way ﬁxed eﬀects regression estimator as a solutionto the following least squares problem. ( (cid:98) α, (cid:98) δ, (cid:98) β ) = argmin n (cid:88) i =1 2 (cid:88) t =1 ( Y it − α i − δ t − βD it ) . Then, (cid:98) τ DID = (cid:98) β. roof. First we deﬁne the demeaned treatment and outcome variables, Y i = (cid:80) t =1 Y it / , Y t = (cid:80) ni =1 Y it /n , Y = (cid:80) ni =1 (cid:80) t =1 Y it / n , D i = (cid:80) t =1 D it / , D t = (cid:80) ni =1 D it /n , and D = (cid:80) ni =1 (cid:80) t =1 D it / n .Given these transformed variables, we can transform the least squares problem into a well-known demeaned form. (cid:98) β = argmin β n (cid:88) i =1 2 (cid:88) t =1 ( (cid:101) Y it − β (cid:101) D it ) where (cid:101) Y it = Y it − Y i − Y t + Y and (cid:101) D it = D it − D i − D t + D . Using this notation, we canexpress (cid:98) β as (cid:98) β = (cid:80) ni =1 (cid:80) t =1 (cid:101) D it (cid:101) Y it (cid:80) ni =1 (cid:80) t =1 (cid:101) D it where (cid:101) D it takes the following form, (cid:101) D it =  / · n /n if G i = 1 , t = 2 − (1 / · n /n if G i = 1 , t = 1 − (1 / · n /n if G i = 0 , t = 21 / · n /n if G i = 0 , t = 1 , where n = (cid:80) ni =1 G i and n = (cid:80) ni =1 (1 − G i ) . Then, the numerator can be written as n (cid:88) i =1 2 (cid:88) t =1 (cid:101) D it (cid:101) Y it = n n (cid:26) n (cid:88) i =1 G i (cid:101) Y i − n (cid:88) i =1 G i (cid:101) Y i (cid:27) − n n (cid:26) n (cid:88) i =1 (1 − G i ) (cid:101) Y i − n (cid:88) i =1 (1 − G i ) (cid:101) Y i (cid:27) and the denominator is given as n (cid:88) i =1 2 (cid:88) t =1 (cid:101) D it = 2 n (cid:18) n n (cid:19) + 2 n (cid:18) n n (cid:19) = n n n . Combining both terms, we get (cid:98) β = (cid:80) ni =1 (cid:80) t =1 (cid:101) D it (cid:101) Y it (cid:80) ni =1 (cid:80) t =1 (cid:101) D it = 1 n (cid:26) n (cid:88) i =1 G i (cid:101) Y i − n (cid:88) i =1 G i (cid:101) Y i (cid:27) − n (cid:26) n (cid:88) i =1 (1 − G i ) (cid:101) Y i − n (cid:88) i =1 (1 − G i ) (cid:101) Y i (cid:27) = 1 n n (cid:88) i =1 G i ( Y i − Y i ) − n n (cid:88) i =1 (1 − G i )( Y i − Y i )= (cid:98) τ DID , which concludes the proof. 42 .2 Extended DID B.2.1 Repeated Cross-Sectional Data

We consider a case in which there are two pre-treatment periods t = { , } and one post-treatment period t = 2 . Using this notation, we report the following result. Result 3 (Nonparametric Equivalence of the Extended DID and Regression Estimator) . Wefocus on a linear regression estimator that is a solution to the following least squares problem. ( (cid:98) θ, (cid:98) γ, (cid:98) β ) = argmin n (cid:88) i =1 2 (cid:88) t =0 O it ( Y it − θG i − γ t − βD it ) . Then, (cid:98) β = λ (cid:98) τ DID + (1 − λ ) (cid:98) τ DID(2,0) where λ = n n ( n + n ) n n ( n + n ) + n n ( n + n ) , − λ = n n ( n + n ) n n ( n + n ) + n n ( n + n ) . When the sample size of each group is ﬁxed over time, i.e., n = n and n = n , λ = 1 / and therefore, (cid:98) β is equivalent to the extended DID estimator of equal weights in equation (9) . Proof.

By solving the least squares problem, we obtain (cid:98) θ = λ (cid:18) (cid:80) i : G i =1 Y i n − (cid:80) i : G i =0 Y i n (cid:19) + (1 − λ ) (cid:18) (cid:80) i : G i =1 Y i n − (cid:80) i : G i =0 Y i n (cid:19)(cid:98) γ = (cid:80) i : G i =0 Y i n (cid:98) γ = (cid:80) i : G i =1 Y i + (cid:80) i : G i =0 Y i n + n − n n + n (cid:98) θ (cid:98) γ = (cid:80) i : G i =1 Y i + (cid:80) i : G i =0 Y i n + n − n n + n (cid:98) θ (cid:98) β = λ (cid:26)(cid:18) (cid:80) i : G i =1 Y i n − (cid:80) i : G i =1 Y i n (cid:19) − (cid:18) (cid:80) i : G i =0 Y i n − (cid:80) i : G i =0 Y i n (cid:19)(cid:27) + (1 − λ ) (cid:26)(cid:18) (cid:80) i : G i =1 Y i n − (cid:80) i : G i =1 Y i n (cid:19) − (cid:18) (cid:80) i : G i =0 Y i n − (cid:80) i : G i =0 Y i n (cid:19)(cid:27) , which completes the proof. B.2.2 Panel Data

We report that the extended DID estimator (cid:98) τ e-DID (equation (9)) (equal weights: λ = 1 / ) isequivalent to the estimated coeﬃcient (cid:98) β in the two-way ﬁxed eﬀects regression estimator inthe panel data setting with t = { , , } . Result 4 (Nonparametric Equivalence of the Extended DID and Two-way Fixed Eﬀects Re-gression Estimator) . We can write the two-way ﬁxed eﬀects regression estimator as a solutionto the following least squares problem. ( (cid:98) α, (cid:98) δ, (cid:98) β ) = argmin n (cid:88) i =1 2 (cid:88) t =0 ( Y it − α i − δ t − βD it ) . hen, (cid:98) τ e-DID = (cid:98) β. Proof.

First we deﬁne Y i = (cid:80) t =0 Y it / , Y t = (cid:80) ni =1 Y it /n , Y = (cid:80) ni =1 (cid:80) t =0 Y it / n , D i = (cid:80) t =0 D it / , D t = (cid:80) ni =1 D it /n , and D = (cid:80) ni =1 (cid:80) t =0 D it / n. Then, we can write the two-wayﬁxed eﬀects estimator as a two-way demeaned estimator, (cid:98) β = argmin β n (cid:88) i =1 2 (cid:88) t =0 ( (cid:101) Y it − β (cid:101) D it ) = (cid:80) ni =1 (cid:80) t =0 (cid:101) D it (cid:101) Y it (cid:80) ni =1 (cid:80) t =0 (cid:101) D it , as in Result 2, where (cid:101) Y it = Y it − Y i − Y t + Y and (cid:101) D it = D it − D i − D t + D . Importantly, (cid:101) D it takes the following form: (cid:101) D it =  / · n /n if G i = 1 , t = 2 − / · n /n if G i = 1 , t = 0 , − / · n /n if G i = 0 , t = 21 / · n /n if G i = 0 , t = 0 , , where n = (cid:80) ni =1 G i and n = (cid:80) ni =1 (1 − G i ) . Then, the numerator can be written as n (cid:88) i =1 2 (cid:88) t =0 (cid:101) D it (cid:101) Y it = n (cid:88) i =1 G i (cid:18) n n (cid:19) (cid:101) Y i − n (cid:88) i =1 1 (cid:88) t =0 G i (cid:18) n n (cid:19) (cid:101) Y it + n (cid:88) i =1 (1 − G i ) (cid:18) − n n (cid:19) (cid:101) Y i + n (cid:88) i =1 1 (cid:88) t =0 (1 − G i ) (cid:18) n n (cid:19) (cid:101) Y it = n (cid:88) i =1 G i (cid:18) n n (cid:19) { (cid:101) Y i − (cid:101) Y i } + n (cid:88) i =1 G i (cid:18) n n (cid:19) { (cid:101) Y i − (cid:101) Y i }− (cid:40) n (cid:88) i =1 (1 − G i ) (cid:18) n n (cid:19) { (cid:101) Y i − (cid:101) Y i } + n (cid:88) i =1 (1 − G i ) (cid:18) n n (cid:19) { (cid:101) Y i − (cid:101) Y i } (cid:41) = n n (cid:40) n (cid:88) i =1 G i { Y i − Y i } + n (cid:88) i =1 G i { Y i − Y i } (cid:41) − n n (cid:40) n (cid:88) i =1 (1 − G i ) { Y i − Y i } + n (cid:88) i =1 (1 − G i ) { Y i − Y i } (cid:41) . The denominator can be written as n (cid:88) i =1 2 (cid:88) t =0 (cid:101) D it = n n n · . Combining the two terms, we have (cid:98) β = 12 n (cid:40) n (cid:88) i =1 G i { Y i − Y i } + n (cid:88) i =1 G i { Y i − Y i } (cid:41) − n (cid:40) n (cid:88) i =1 (1 − G i ) { Y i − Y i } + n (cid:88) i =1 (1 − G i ) { Y i − Y i } (cid:41) = 12 (cid:40) n n (cid:88) i =1 G i { Y i − Y i } − n n (cid:88) i =1 (1 − G i ) { Y i − Y i } (cid:41) (cid:40) n n (cid:88) i =1 G i { Y i − Y i } − n n (cid:88) i =1 (1 − G i ) { Y i − Y i } (cid:41) = 12 (cid:98) τ DID + 12 (cid:98) τ DID(2,0) , which completes the proof. B.3 Sequential DID

B.3.1 Repeated Cross-Sectional Data

We clarify that the sequential DID estimator (cid:98) τ s-DID (equation (10)) is equivalent to a coeﬃcientin a regression estimator with transformed outcomes. Result 5 (Nonparametric Equivalence of the Sequential DID and Regression Estimator) . Wefocus on a linear regression estimator with a transformed outcome. ( (cid:98) α, (cid:98) θ, (cid:98) γ, (cid:98) β ) = argmin n (cid:88) i =1 2 (cid:88) t =1 O it (cid:110) ∆ Y it − α − θG i − γI t − β ( G i × I t ) (cid:111) , where ∆ Y it =  Y i − (cid:80) i : Gi =1 Y i n if G i = 1 , t = 2 Y i − (cid:80) i : Gi =1 Y i n if G i = 1 , t = 1 Y i − (cid:80) i : Gi =0 Y i n if G i = 0 , t = 2 Y i − (cid:80) i : Gi =0 Y i n if G i = 0 , t = 1 . Then, (cid:98) τ s-DID = (cid:98) β. Proof.

Using Result 1, we obtain (cid:98) β = (cid:18) (cid:80) i : G i =1 ∆ Y i n − (cid:80) i : G i =1 ∆ Y i n (cid:19) − (cid:18) (cid:80) i : G i =0 ∆ Y i n − (cid:80) i : G i =0 ∆ Y i n (cid:19) = (cid:26)(cid:18) (cid:80) i : G i =1 Y i n − (cid:80) i : G i =1 Y i n (cid:19) − (cid:18) (cid:80) i : G i =0 Y i n − (cid:80) i : G i =0 Y i n (cid:19)(cid:27) − (cid:26)(cid:18) (cid:80) i : G i =1 Y i n − (cid:80) i : G i =1 Y i n (cid:19) − (cid:18) (cid:80) i : G i =0 Y i n − (cid:80) i : G i =0 Y i n (cid:19)(cid:27) , which completes the proof.Next, we clarify that the sequential DID estimator (cid:98) τ s-DID (equation (10)) is also equivalentto a coeﬃcient in a regression estimator with group-speciﬁc time trends. Mora and Reggio(2019) derive similar results by making the parametric assumption of the conditional expecta-tions. We prove nonparametric equivalence without making any assumptions about conditionalexpectations. Result 6 (Nonparametric Equivalence of the Sequential DID and Regression Estimator withGroup-Speciﬁc Time Trends) . We focus on a linear regression estimator with group-speciﬁctime trends. ( (cid:98) θ, (cid:98) γ, (cid:98) β ) = argmin n (cid:88) i =1 2 (cid:88) t =0 O it (cid:110) Y it − θ G i − θ ( G i × t ) − γ t − βD it (cid:111) . hen, (cid:98) τ s-DID = (cid:98) β. Proof.

By solving the least squares problem, we obtain (cid:98) θ = (cid:80) i : G i =1 Y i n − (cid:80) i : G i =0 Y i n (cid:98) θ = (cid:18) (cid:80) i : G i =1 Y i n − (cid:80) i : G i =0 Y i n (cid:19) − (cid:18) (cid:80) i : G i =1 Y i n − (cid:80) i : G i =0 Y i n (cid:19)(cid:98) γ = (cid:80) i : G i =0 Y i n , (cid:98) γ = (cid:80) i : G i =0 Y i n , (cid:98) γ = (cid:80) i : G i =0 Y i n (cid:98) β = (cid:26)(cid:18) (cid:80) i : G i =1 Y i n − (cid:80) i : G i =1 Y i n (cid:19) − (cid:18) (cid:80) i : G i =0 Y i n − (cid:80) i : G i =0 Y i n (cid:19)(cid:27) − (cid:26)(cid:18) (cid:80) i : G i =1 Y i n − (cid:80) i : G i =1 Y i n (cid:19) − (cid:18) (cid:80) i : G i =0 Y i n − (cid:80) i : G i =0 Y i n (cid:19)(cid:27) , which completes the proof. B.4 Connection to the Leads Test

Here we formally prove the connection between the test of pre-treatment periods discussed inSection 2.2 and the well known leads test (Angrist and Pischke, 2008). The leads test includes D i,t +1 into a linear regression and check whether a coeﬃcient of D i,t +1 is zero. B.4.1 Repeated Cross-Sectional Data

In the repeated cross-sectional data setting, the leads test considers the following linear regres-sion. ( (cid:98) θ, (cid:98) γ, (cid:98) β, (cid:98) ζ ) = argmin n (cid:88) i =1 1 (cid:88) t =0 O it ( Y it − θG i − γ t − βD it − ζD i,t +1 ) . Then, because D it = 0 for all units in t = { , } , this least squares problem is the same as ( (cid:98) θ, (cid:98) γ, (cid:98) ζ ) = argmin n (cid:88) i =1 1 (cid:88) t =0 O it ( Y it − θG i − γ t − ζD i,t +1 ) . Finally, using Result 1, we have (cid:98) ζ = (cid:18) (cid:80) i : G i =1 Y i n − (cid:80) i : G i =1 Y i n (cid:19) − (cid:18) (cid:80) i : G i =0 Y i n − (cid:80) i : G i =0 Y i n (cid:19) , which is the standard DID estimator to the pre-treatment periods t = 0 , . B.4.2 Panel Data

In the panel data setting, the leads test considers the following two-way ﬁxed eﬀects regression. ( (cid:98) α, (cid:98) δ, (cid:98) β, (cid:98) ζ ) = argmin n (cid:88) i =1 1 (cid:88) t =0 ( Y it − α i − δ t − βD it − ζD i,t +1 ) . Again, this least squares problem is the same as ( (cid:98) α, (cid:98) δ, (cid:98) ζ ) = argmin n (cid:88) i =1 1 (cid:88) t =0 ( Y it − α i − δ t − ζD i,t +1 ) . (cid:98) ζ = (cid:18) (cid:80) i : G i =1 Y i n − (cid:80) i : G i =1 Y i n (cid:19) − (cid:18) (cid:80) i : G i =0 Y i n − (cid:80) i : G i =0 Y i n (cid:19) , which is the standard DID estimator to the pre-treatment periods t = 0 , . Generalized K -Diﬀerence-in-Diﬀerences In this section, we propose the generalized K -DID, which extends the double DID in Section 3to arbitrary number of pre - and post -treatment periods in the basic DID setting. We considerthe staggered adoption design in Section 4. C.1 The Setup and Causal Quantities of Interest

We ﬁrst extend the setup to account for arbitrary number of pre- and post-treatment periods.Suppose we observe outcome Y it for i ∈ { , . . . , n } and t ∈ { , , . . . , T } . We deﬁne the binarytreatment variable to be D it ∈ { , } . The treatment is assigned right before time period T ∗ , and thus, time periods t ∈ { T ∗ , . . . , T } are the post-treatment periods and time periods t ∈ { , . . . , T ∗ − } are the pre-treatment periods. As in Section 2.1, we denote the treatmentgroup as G i = 1 and G i = 0 otherwise. Note that D it = 0 for t ∈ { , . . . , T ∗ } for all units.We are interested in the causal eﬀect at post-treatment time T ∗ + s where s ≥ . When s = 0 , this corresponds to the contemporaneous treatment eﬀect. By specifying diﬀerent valuesof s > , researchers can study a variety of long-term causal eﬀects of the treatment. Formally,our quantity of interest is the average treatment eﬀect on the treated (ATT) at post-treatmenttime T ∗ + s . τ ( s ) ≡ E [ Y i,T ∗ + s (1) − Y i,T ∗ + s (0) | G i = 1] . For example, when s = 3 , this could mean the causal eﬀect of the policy after three years fromits initial introduction. This deﬁnition is a generalization of the standard ATT: when s = 0 , this quantity is equal to the ATT deﬁned in equation (1). C.2 Generalize Parallel Trends Assumptions

What assumptions do we need to identify the ATT at post-treatment time T ∗ + s ? Here,we provide a generalization of the parallel trends assumption, which incorporates both thestandard parallel trends assumption and the parallel trends-in-trends assumption. Assumption 4 ( k -th Order Parallel Trends) . For some integer k such that ≤ k ≤ T ∗ , ∆ ks ( E [ Y i,T ∗ + s (0) | G i = 1]) = ∆ ks ( E [ Y i,T ∗ + s (0) | G i = 0]) , where ∆ ks is the k -th order diﬀerence operator deﬁned recursively as follows. For g ∈ { , } , ∆ s ( E [ Y i,T ∗ + s (0) | G i = g ]) ≡ E [ Y i,T ∗ + s (0) | G i = g ] − E [ Y i,T ∗ − (0) | G i = g ] , when k = 1 and, in general, ∆ ks ( E [ Y i,T ∗ + s (0) | G i = g ]) ≡ ∆ k − s ( E [ Y i,T ∗ + s (0) | G i = g ]) − M ks ∆ k − ( E [ Y i,T ∗ − (0) | G i = g ]) , = E [ Y i,T ∗ + s (0) | G i = g ] − E [ Y i,T ∗ − (0) | G i = g ] − k − (cid:88) j =1 M j +1 s ∆ j ( E [ Y i,T ∗ − (0) | G i = g ]) , where M (cid:96)s = (cid:81) (cid:96) − j =1 ( s + j ) / (cid:81) (cid:96) − j =1 j for (cid:96) ≥ . ∆ k ( E [ Y i,T ∗ − (0) | G i = g ]) is also recursively de-ﬁned as ∆ k ( E [ Y i,T ∗ − (0) | G i = g ]) ≡ ∆ k − ( E [ Y i,T ∗ − (0) | G i = g ]) − ∆ k − ( E [ Y i,T ∗ − (0) | G i = g ]) , and ∆ ( E [ Y i,T ∗ − m (0) | G i = g ]) = E [ Y i,T ∗ − m (0) | G i = g ] − E [ Y i,T ∗ − m − (0) | G i = g ] for48 = { , } . The standard parallel trends assumption and the parallel-trends-in-trends assump-tion are both special cases of this assumption. The k -th order parallel trends assumptionreduces to the standard parallel trends assumption (Assumption 1) when s = 1 and k = 1 , andto the parallel-trends-in-trends assumption (Assumption 3) when s = 1 and k = 2 . To further clarify the meaning of Assumption 4, we can consider a simpler but strongercondition. In particular, the k -th order parallel trends assumption (Assumption 4) is impliedby the following p -th degree polynomial model of confounding. E [ Y it (0) | G i = 1] − E [ Y it (0) | G i = 0] = α + k − (cid:88) p =1 Γ p t p , with unknown parameters α and Γ . Here, the left hand side of the equality captures thediﬀerence between the two groups (treatment and control) in terms of the mean of potentialoutcomes under the control condition. This representation shows that the standard paralleltrends assumption (Assumption 1) is implied by the time-invariant confounding; the paralleltrends-in-trends assumption (Assumption 3) is implied by the linear time-varying confounding;and in general, the k -th order parallel trends assumption is implied by the k -th order polynomialconfounding. C.3 Estimate ATT with Multiple Pre- and Post-Treatment Periods

We consider the identiﬁcation and estimation of the ATT at post-treatment time T ∗ + s . Underthe k -th order parallel trends assumption (Assumption 4), the ATT is identiﬁed as follows. τ ( s ) = ∆ ks ( E [ Y i,T ∗ + s | G i = 1]) − ∆ ks ( E [ Y i,T ∗ + s | G i = 0]) . Because each conditional expectation can be consistently estimated via its sample analogue, (cid:98) τ k ( s ) = ∆ ks (cid:18) (cid:80) i : G i =1 Y i,T ∗ + s n ,T ∗ + s (cid:19) − ∆ ks (cid:18) (cid:80) i : G i =0 Y i,T ∗ + s n ,T ∗ + s (cid:19) is a consistent estimator for the ATT at time T ∗ + s under the k -th order parallel trends as-sumption. When s = 0 and k = 1 , this estimator corresponds to the standard DID estimator(equation (3)). When s = 0 and k = 2 , this is equal to the sequential DID estimator (equa-tion (10)). While existing approaches (e.g., Angrist and Pischke, 2008; Mora and Reggio, 2012;Lee, 2016; Mora and Reggio, 2019) consider each estimator separately, we propose combiningmultiple DID estimators within the GMM framework.In general, the generalized double DID combines K moment conditions where K is thenumber of pre-treatment periods researchers use. When there are more than two pre-treatmentperiods, we can naturally combine more than two DID estimators, improving upon the doubleDID in Section 3. Formally, the generalized double DID is deﬁned as, (cid:98) τ ( s ) = argmin τ g ( τ ) (cid:62) (cid:99) W g ( τ ) where g ( τ ) = ( τ − (cid:98) τ ( s ) , . . . , τ − (cid:98) τ K ( s )) (cid:62) . Based on the theory of the eﬃcient GMM (Hansen,1982), the optimal weight matrix is (cid:99) W = Var ( (cid:98) τ (1: K ) ( s )) − where Var ( · ) is the variance-covariance matrix and (cid:98) τ (1: K ) ( s ) = ( (cid:98) τ ( s ) , . . . , (cid:98) τ K ( s )) (cid:62) . When T ∗ = 2 , this converges to thestandard DID estimator (equation (3)). When T ∗ = 3 , this corresponds to the basic form of49he double DID estimator (equation (13)). Within the GMM framework, we can select momentconditions using the J-statistics (Hansen, 1982). We can similarly generalize the double DIDregression.To assess the extended parallel trends assumption, we can apply the generalized doubleDID to pre-treatment periods t ∈ { , . . . , T ∗ − } as if the last pre-treatment period T ∗ − is the target time period. Moments are g ( τ ) = ( τ − (cid:98) τ (0) , . . . , τ − (cid:98) τ K (0)) (cid:62) where (cid:98) τ k (0) =∆ ks (cid:18) (cid:80) i : Gi =1 Y i,T ∗− n ,T ∗− (cid:19) − ∆ ks (cid:18) (cid:80) i : Gi =0 Y i,T ∗− n ,T ∗− (cid:19) . Similarly, to assess the extended parallel trends-in-trends assumption, we can apply the generalized double DID to pre-treatment periods withmoments g ( τ ) = ( τ − (cid:98) τ (0) , . . . , τ − (cid:98) τ K (0)) (cid:62) . 50 Generalized K -DID for Staggered Adoption Design Combining the setup introduced in Section C.1 and the one in Section 4.1, we propose thegeneralized K -DID for the SA design, which allows researchers to estimate long-term causaleﬀects in the SA design. We focus on the SA-ATT at post-treatment time t + s where t is thetiming of the treatment assignment and s ≥ represents how far in the future we want estimatethe ATT for. We ﬁrst redeﬁne the group indicator G to estimate the long-term SA-ATT atpost-treatment time t + s . In particular, we deﬁne G its =  if A i = t if A i > t + s − otherwisewhere G its = 1 represents units who receive the treatment at time t , and G its = 0 indicatesunits who do not receive the treatment by time t + s . G its = − includes other units whoreceive the treatment before time t or receive the treatment between t + 1 and t + s. When s = 0 , this deﬁnition corresponds to the group indicator in equation (18).Formally, our ﬁrst quantity of interest is the staggered-adoption average treatment eﬀect onthe treated (SA-ATT) at post-treatment time t + s . τ SA ( s, t ) ≡ E [ Y i,t + s (1) − Y i,t + s (0) | G its = 1] . By averaging over time, we can also deﬁne the time-average staggered-adoption average treat-ment eﬀect on the treated (time-average SA-ATT) at s periods after treatment onset. τ SA ( s ) ≡ (cid:88) t ∈T π t τ SA ( s, t ) , where T represents a set of the time periods for which researchers want to estimate the ATT.The SA-ATT in period t , τ SA ( t ) , is weighted by the proportion of units who receive the treat-ment at time t : π t = (cid:80) ni =1 { A i = t } / (cid:80) ni =1 { A i ∈ T } .Here, we provide a generalization of the parallel trends assumption, which incorporatesboth the standard parallel trends assumption and the parallel trends-in-trends assumption. Assumption 5 ( k -th Order Parallel Trends for Staggered Adoption Design) . For some integer k such that ≤ k ≤ T, and for k ≤ t ≤ T − s, ∆ ks ( E [ Y i,t + s (0) | G its = 1]) = ∆ ks ( E [ Y i,t + s (0) | G its = 0]) , where ∆ ks is the k -th order diﬀerence operator deﬁned in Assumption 4.Under Assumption 5, the SA-ATT at post-treatment time t + s is identiﬁed as follows. τ SA ( s, t ) = ∆ ks ( E [ Y i,t + s | G its = 1]) − ∆ ks ( E [ Y i,t + s | G its = 0]) . Since conditional expectations can be consistently estimated via the sample analogue, (cid:98) τ SA k ( s, t ) = ∆ ks (cid:18) (cid:80) i : G its =1 Y i,t + s n ,t + s (cid:19) − ∆ ks (cid:18) (cid:80) i : G its =0 Y i,t + s n ,t + s (cid:19)

51s a consistent estimator for the SA-ATT at post-treatment time t + s under Assumption 5.In general, we combine K DID estimators to obtain the generalized K -DID for the SA-ATTat post-treatment time t + s as follows. (cid:98) τ SA ( s, t ) = argmin τ SA g ( τ SA ) (cid:62) (cid:99) W g ( τ SA ) where g ( τ SA ) = ( τ SA − (cid:98) τ SA ( s ) , . . . , τ SA − (cid:98) τ SA K ( s )) (cid:62) . The optimal weight matrix is (cid:99) W = Var ( (cid:98) τ SA (1: K ) ( s )) − where (cid:98) τ SA (1: K ) ( s ) = ( (cid:98) τ SA ( s ) , . . . , (cid:98) τ SA K ( s )) (cid:62) . To estimate the time-average SA-ATT, we ﬁrst deﬁne the time-average k -th order time-average DID estimator as, (cid:98) τ SA k ( s ) = (cid:88) t ∈T π t (cid:98) τ SA k ( s, t ) . Finally, the generalized K -DID combines K moment conditions as follows. (cid:98) τ SA ( s ) = argmin τ SA g ( τ SA ) (cid:62) (cid:99) W g ( τ SA ) where g ( τ SA ) = ( τ SA − (cid:98) τ SA ( s ) , . . . , τ SA − (cid:98) τ SA K ( s )) (cid:62) . The optimal weight matrix is (cid:99) W = Var ( (cid:98) τ SA (1: K ) ( s )) − where (cid:98) τ SA (1: K ) ( s ) = ( (cid:98) τ SA ( s ) , . . . , (cid:98) τ SA K ( s )) (cid:62) . Equivalence Approach

Here, we provide technical details on the equivalence approach we introduced in Section 3.1.In the standard hypothesis testing, researchers usually evaluate the two-sided null hypothesis H : δ = 0 where δ = { E [ Y i (0) | G i = 1] − E [ Y i (0) | G i = 1] } − { E [ Y i (0) | G i = 0] − E [ Y i (0) | G i = 0] } when we are conducting the pre-treament-trends test. However, this approach hasa risk of conﬂating evidence for parallel trends and statistical ineﬃciency. For example, whensample size is small, even if pre-treatment trends of the treatment and control groups diﬀer(i.e., the null hypothesis is false), a test of the diﬀerence might not be statistically signiﬁcantdue to large standard error. And, analysts might “pass” the pre-treatment-trends test by notﬁnding enough evidence for the diﬀerence.The equivalence approach can mitigate this concern by ﬂipping the null hypothesis, so thatthe rejection of the null can be the evidence for parallel trends. In particular, we consider twoone-sided tests: H : θ ≥ γ U , or θ ≤ γ L where ( γ U , γ L ) is a user-speciﬁed equivalence range. By rejecting this null hypothesis, re-searchers can provide statistical evidence for the alternative hypothesis: H : γ L < θ < γ U , which means that θ (i.e., the diﬀerence in pre-treament-trends across treatment and controlgroups) are within an interval [ γ L , γ U ] . One diﬃculty of the equivalence approach is that researchers have to choose this equivalencerange ( γ U , γ L ) , which might not be straightforward in practice. To overcome this challenge, wefollow Hartman and Hidalgo (2018) to estimate the 95% equivalence conﬁdence interval, whichis the smallest equivalence range supported by the observed data. Suppose we obtain [ − c, c ] as the symmetric 95% equivalence conﬁdence interval where c > is some positive constant.Then, this means that if researchers think the absolute value of θ smaller than c is substantivelynegligible, the 5% equivalence test would reject the null hypothesis and provide the evidencefor the parallel pre-treatment-trends. In contrast, if researchers think the absolute value of θ being c is substantively too large as bias in practice, the 5% equivalence test would fail to rejectthe null hypothesis and cannot provide the evidence for the parallel pre-treatment-trends. Insum, by estimating the equivalence conﬁdence interval, readers of the analysis can decide howmuch evidence for the parallel pre-treatment-trends exists in the observed data. Researcherscan estimate the 95% equivalence conﬁdence interval by the following general two steps. First,estimate 90% conﬁdence interval, which we denote by [ b L , b U ] . Second, we can obtain thesymmetric 95% equivalence conﬁdence interval as [ − b, b ] where we deﬁne b = max {| b L | , | b U |} . See Wellek (2010); Hartman and Hidalgo (2018) for more details.53 igure 9:

Figure 1 from Hartman and Hidalgo (2018) on the diﬀerence between the standardhypothesis testing and the equivalence testing.54

Simulation Study

We conduct a simulation study to compare the performance of the various DID estimators dis-cussed in this paper. We demonstrate two key results. First, the double DID is unbiased underthe extended parallel trends assumption or under the parallel trends-in-trends assumption.Second, the double DID has the smallest standard errors among unbiased DID estimators. Inparticular, standard errors of the double DID are smaller than those of the extended DID (i.e.,the two-way ﬁxed eﬀects estimator) even under the extended parallel trends assumption.We compare three DID estimators — the double DID, the extended DID, and the se-quential DID — using two scenarios. In the ﬁrst scenario, the extended parallel trends as-sumption (Assumption 2) holds where the diﬀerence between potential outcomes under control E [ Y it (0) | G i = 1] − E [ Y it (0) | G i = 0] is constant over time. This corresponds to time-invariantunmeasured confounding, and we expect that all the DID estimators are unbiased in this sce-nario. The second scenario represents the parallel-trends-in-trends assumption (Assumption 3)where unmeasured confounding varies over time linearly. Here, we expect that the double DIDand the sequential DID are unbiased, whereas the extended DID is biased.For each of the two scenarios, we consider the balanced panel data with n units and ﬁve-time periods where treatments are assigned at the last time period. We vary the number ofunits ( n ) from to and evaluate the quality of estimators by absolute bias and standarderrors over 2000 Monte Carlo simulations. We describe the details of the simulation setup next. F.1 Simulation Design

We consider the balanced panel data with T = 5 ( t = { , , , , } ) where the last period ( t = 4 )is treated as the post-treatment period. We vary the number of units at each time period as n ∈{ , , , } . Thus, the total number of observations are nT ∈ { , , , } .We compare three estimators: the double DID, the extended DID, and the sequential DID.Note that we consider four pre-treatment periods here, and thus the generalized doubleDID is not equal to the sequential DID even under the parallel trends-in-trends assumptionbecause it combines two other moments and optimally weight them (see Appendix C). Theequivalence between the sequential DID and the double DID holds only when there are twopre-treatment periods. We see below that the generalized double DID improves upon thesequential DID even under the parallel trends-in-trends assumption as they optimally weightobservations from diﬀerent time periods.We study two scenarios: one under the extended parallel trends assumption (Assumption 2)and the other under the parallel-trends-in-trends assumption (Assumption 3). In the ﬁrstscenario, the diﬀerence between potential outcomes under control E [ Y it (0) | G i = 1] − E [ Y it (0) | G i = 0] is constant over time. In particular, we set E [ Y it (0) | G i = g ] = α t + 0 . × g (19)where ( α , α , α , α , α ) = (1 , , , , . In the second scenario, we allow for linear time-varyingconfounding. In particular, we set E [ Y it (0) | G i = g ] = α t + 0 . × g × ( t + 1) (20)where ( α , α , α , α , α ) = (1 , , , , . Y it (0) = E [ Y it (0) | G i ] + (cid:15) it where (cid:15) it follows the AR(1) process with autocorrelation parameter ρ. That is, (cid:15) it = ρ(cid:15) i,t − + ξ it ,(cid:15) i = N (0 , / (1 − ρ )) ,ξ it = N (0 , . The causal eﬀect is denoted by τ and thus, Y it (1) = τ + Y it (0) where we set τ = 0 . . Finally, Y it = Y it (0) for t ≤ (pre-treatment periods) and Y it = G i Y it (1) + (1 − G i ) Y it (0) for t = 4 (post-treatment period). The half of the samples are in the treatment group ( G i = 1 ) and theother half is in the control group ( G i = 0 ).In Figure 10, we set the autocorrelation parameter ρ = 0 . . This value is similar to the au-tocorrelation parameter used in famous simulation studies in Bertrand, Duﬂo and Mullainathan(2004) ( ρ = 0 . ). We pick a smaller value to make our simulations harder as we see below. InFigure 11, we also provide additional results where we consider a full range of the autocorrela-tion parameters ρ ∈ { , . , . , . , . } (the same positive autocorrelation values consideredin Bertrand, Duﬂo and Mullainathan (2004)). Both ﬁgures show the absolute bias and thestandard errors which are deﬁned asabsolute bias = (cid:12)(cid:12)(cid:12)(cid:12) M M (cid:88) m =1 ( (cid:98) τ m − τ ) (cid:12)(cid:12)(cid:12)(cid:12) and standard error = (cid:118)(cid:117)(cid:117)(cid:116) M M (cid:88) m =1 ( (cid:98) τ m − τ ) , where M is the total number of Monte Carlo iterations. Note that this standard error is a truestandard error over the sampling distribution. F.2 Results

Figure 10 shows the results when the autocorrelation parameter ρ = 0 . . To begin with theabsolute bias, visualized in the ﬁrst row, all estimators have little bias under the extendedparallel trends assumption (Scenario 1), as expected from theoretical results. In contrast,under the parallel-trends-in-trends assumption (Scenario 2), the extended DID (white circlewith dotted line) is biased, while the double DID (black circle with solid line) and the sequentialDID (white triangle with dotted line) are unbiased.The second row represents the standard errors of each estimator. Under the extended par-allel trends assumption (the ﬁrst column), the double DID estimator has the smallest standarderror, smaller than the extended DID estimator (i.e., the two-way ﬁxed eﬀects estimator). Thiseﬃciency gain comes from the fact that the double DID uses the GMM framework to optimallyweight observations from diﬀerent time periods, although the two-way ﬁxed eﬀects estimatoruses equal weights to all pre-treatment periods.Under the parallel trends-in-trends assumption (the second row; the second column), thedouble DID has almost the same standard error as the sequential DID. This shows that thedouble DID changes weights according to scenarios and solves a practical dilemma of thesequential DID — it is unbiased under the weaker assumption of the parallel trends-in-trends,but not eﬃcient under the extended parallel trends.In Figure 11, we provide additional results where we consider a full range of the autocorre-lation parameters ρ ∈ { , . , . , . , . } (the same positive autocorrelation values considered56 . . . . . . Sample Size100 250 500 1000 l l l ll l l lll

Extended DIDSequential DIDDouble DID A b s o l u t e B i as Scenario 1 (Extended Parallel Trends) . . . . . Sample Size100 250 500 1000 l l l ll l l l S t a nd a r d E rr o r s . . . . . . Sample Size100 250 500 1000 l l l ll l l l

Scenario 2 (Parallel Trends−in−Trends) . . . . . Sample Size100 250 500 1000 l l l ll l l l

Figure 10:

Comparing DID estimators in terms of the absolute bias and the standard errors.The ﬁrst row shows that the double DID estimator (black circle with solid line) is unbiasedunder both scenarios. The second row demonstrates that the double DID has the smalleststandard errors among unbiased DID estimators.in Bertrand, Duﬂo and Mullainathan (2004)). We ﬁnd that when the autocorrelation of errorsis small, standard errors of the double DID are smaller than those of the sequential DID evenunder the parallel trends-in-trends assumption.The ﬁrst row of Figure 11 shows that our results on the (absolute) bias do not changeregardless of the autocorrelation of errors. In particular, the double DID is unbiased under theextended parallel trends assumption (the ﬁrst column) or under the parallel trends-in-trendsassumption (the second column). In terms of the standard errors (the second row), two resultsare important. First, under the extended parallel trends assumption (the ﬁrst column), thestandard errors of the double DID is the smallest for all the values of ρ and the eﬃciency gainrelative to the extended DID (i.e., two-way ﬁxed eﬀects estimator) is large when the there ishigh auto-correlations (i.e., ρ is large). Second, under the parallel trends-in-trends assumption(the second column), the standard errors of the double DID is the smallest among unbiasedDID estimators (the extended DID is biased). The eﬃciency gain relative to the sequentialDID is large when ρ is small. 57 . . . . . . . Autocorrelation0 0.2 0.4 0.6 0.8 l l l l ll l l l lll

Extended DIDSequential DIDDouble DID A b s o l u t e B i as Scenario 1 (Extended Parallel Trends) . . . . Autocorrelation0 0.2 0.4 0.6 0.8 l l l l ll l l l l S t a nd a r d E rr o r s . . . . . . . Autocorrelation0 0.2 0.4 0.6 0.8 l l l l ll l l l l

Scenario 2 (Parallel Trends−in−Trends) . . . . Autocorrelation0 0.2 0.4 0.6 0.8 l l l l ll l l l l

Figure 11:

Comparing DID estimators in terms of the absolute bias and the standard errorsaccording to the autocorrelation of errors.

Note : The ﬁrst row shows that the double DIDestimator (black circle with solid line) is unbiased under both scenarios. The second rowdemonstrates that the double DID has the smallest standard errors among unbiased DIDestimators. Under the extended parallel trends assumption (the ﬁrst column), the eﬃciencygain relative to the extended DID (i.e., two-way ﬁxed eﬀects estimator) is large when theautocorrelation parameter ρ is large. Under the parallel trends-in-trends assumption (thesecond column), the eﬃciency gain relative to the sequential DID is large when ρ is small.58 Empirical Application

In Section 6, we have focused on three outcomes to illustrate the advantage of the double DIDestimator. In this section, we provide results for all thirty outcomes analyzed in the originalpaper.To assess the underlying parallel trends assumptions, we combine visualization and for-mal tests, as recommended in the main text. The assessment suggests that we can make theextended parallel trends assumption for ﬁfteen outcomes. Speciﬁcally, for those ﬁfteen out-comes, p-values for the null of pre-treatment parallel trends are above 0.10 (i.e., fail to rejectthe null at the conventional level), and the 95% standardize equivalence conﬁdence interval iscontained in the interval [ − . , . . This means that the deviation from the parallel trends inthe pre-treatment periods are less than 0.2 standard deviation of the control mean in 2006.Figure 12 shows estimated treatment eﬀects under the extended parallel trends assumption.As in Section 6, the double DID estimates are similar to those from the standard DID, andyet, standard errors are smaller because the double DID eﬀectively uses pre-treatment periodswithin the GMM. Here, we only have two pre-treatment periods, but when there are morepre-treatment periods, the eﬃciency gain of the double DID can be even larger.We rely on the parallel trends-in-trends assumption for eight outcomes out of the ﬁfteenremaining outcomes. These outcomes have the 95% standardized equivalence conﬁdence inter-val wider than [ − . , . , but show that treatment and control groups’ pre-treatment trendshave the same sign. The same sign of the pre-treatment trends suggests that parallel trends-in-trends assumption, which can account for the linear time-varying unmeasured confounder,can be plausible for these outcomes, even though the stronger parallel trends assumption ispossibly violated.Figure 13 shows results under the parallel trends-in-trends assumption. As in Section 6, thedouble DID estimates are often diﬀerent from those of the standard DID because the extendedparallel trends assumption is implausible for these outcomes. Importantly, standard errors ofthe double DID are often larger than the standard DID. This is because the double DID needsto adjust for biases in the standard DID by using pre-treatment trends.For the remaining seven outcomes of which treatment and control groups’ pre-treatmenttrends have the opposite sign, it is diﬃcult to justify either the extended parallel trends orparallel trends-in-trends assumption without additional information. Thus, there is no credibleestimator for the ATT without making stronger assumptions. When there are more than twopre-treatment periods, researchers can apply the sequential DID estimator to pre-treatmentperiods in order to formally assess the extended parallel trends-in-trends assumption. Weemphasize that, although we use the equivalence range of [ − . , . as a cutoﬀ for an illus-tration, it is recommended to base this decision on substantive domain knowledge wheneverpossible in practice. 59 taff to Cure Animal Upper Secondary School Village w/ Post OfficeProp. Households w/ Supported Healthcare Prop. Households w/ Supported Tuition Radio Broadcast Socio−Dev't/ Infra. ProjectPeriodic Market Post Office Prop. Households w/ Agricultural Extension Prop. Households w/ Supported CreditEducation and Cultural Program Irrigation Plants Market or Inter−commune Market Paved Road DID Double−DID DID Double−DID DID Double−DID DID Double−DID−0.2−0.10.00.10.2−0.2−0.10.00.10.2−0.2−0.10.00.10.2−0.2−0.10.00.10.2 A TT e s t i m a t e s ( % C I ) Estimates under Extended Parallel Trends Assumption

Figure 12:

Comparing Standard DID and Double DID under Extended Parallel Trends As-sumption. The double DID estimates are similar to those from the standard DID, and yet,standard errors are smaller because the double DID eﬀectively uses pre-treatment periodswithin the GMM. 60 ublic Health Project Public Transport Staff to Support Crops Tap WaterDaily Market Nonfarm Business Prop. Households w/ Business Tax Exemption Prop. Households w/ Supported Crop

DID Double−DID DID Double−DID DID Double−DID DID Double−DID−0.2−0.10.00.10.20.3−0.2−0.10.00.10.20.3 A TT e s t i m a t e s ( % C I ) Estimates under Parallel Trends−in−Trends