Using Multiple Pre-treatment Periods to Improve Difference-in-Differences and Staggered Adoption Design
UUsing Multiple Pre-treatment Periods to ImproveDifference-in-Differences and Staggered Adoption Designs ∗Naoki Egami † Soichiro Yamauchi ‡ This version: February 22, 2021First draft: December 6, 2019
Abstract
While difference-in-differences (DID) was originally developed with one pre- and onepost-treatment periods, data from additional pre-treatment periods is often available. Howcan researchers improve the DID design with such multiple pre-treatment periods underwhat conditions? We first use potential outcomes to clarify three benefits of multiple pre-treatment periods: (1) assessing the parallel trends assumption, (2) improving estimationaccuracy, and (3) allowing for a more flexible parallel trends assumption. We then propose anew estimator, double
DID, which combines all the benefits through the generalized methodof moments and contains the two-way fixed effects regression as a special case. In a widerange of applications where several pre-treatment periods are available, the double DIDimproves upon the standard DID both in terms of identification and estimation accuracy.We also generalize the double DID to the staggered adoption design where different unitscan receive the treatment in different time periods. We illustrate the proposed method withtwo empirical applications, covering both the basic DID and staggered adoption designs.We offer an open-source R package that implements the proposed methodologies. ∗ The methods proposed in this article can be implemented via the open-source statistical software R package DIDdesign available at https://github.com/naoki-egami/DIDdesign . We are grateful to Edmund Malesky,Cuong Viet Nguyen, and Anh Tran for providing us with data and answering our questions. We also thankAdam Glynn, Chad Hazlett, Shiro Kuriwaki, Ian Lundberg, John Marshall, Xiang Zhou, and participantsof the 2019 Summer Meetings of the Political Methodology Society and the 2019 American Political ScienceAssociation Annual Conference for helpful comments and discussions. † Assistant Professor, Department of Political Science, Columbia University, New York NY 10027. Email: [email protected] ; URL: https://naokiegami.com . ‡ PhD Candidate, Department of Government, Harvard University, Cambridge MA 02138. Email: [email protected] ; URL: https://soichiroy.github.io a r X i v : . [ s t a t . A P ] J a n Introduction
Over the last few decades, social scientists have developed and applied various approachesto make credible causal inferences from observational data. One of the most popular andsuccessful is a difference-in-differences (DID) design (Bertrand, Duflo and Mullainathan, 2004;Angrist and Pischke, 2008). In its most basic form, we compare treatment and control groupsover two time periods — one before and the other after the treatment assignment.In practice, it is common to apply the DID method with additional pre-treatment periods. However, in contrast to the basic two-time-period case, there are a number of different waysto analyze the DID design with multiple pre-treatment periods. One popular approach is toapply the two-way fixed effects regression to the entire time periods and supplement it withvarious robustness checks (e.g., Dube, Dube and García-Ponce, 2013; Truex, 2014; Earle andGehlbach, 2015; Hall, 2016; Larreguy and Marshall, 2017). Another is to stick with the two-time-period DID and limit the use of additional pre-treatment periods only to the assessmentof pre-treatment trends (e.g., Ladd and Lenz, 2009; Bechtel and Hainmueller, 2011; Bullockand Clinton, 2011; Keele and Minozzi, 2013; Garfias, 2018).This variation of approaches raises an important practical question: how should analystsincorporate multiple pre-treatment periods into the DID design, and under what assumptions?In Section 2, we begin by examining three benefits of multiple pre-treatment periods using po-tential outcomes (Neyman, 1923; Rubin, 1974): (1) assessing the parallel trends assumption,(2) improving estimation accuracy, and (3) allowing for a more flexible parallel trends assump-tion. While these benefits have been discussed in the literature, we revisit them to clarify thateach benefit requires different assumptions and different estimators. As a result, in practice,researchers tend to enjoy only a subset of the three benefits they can exploit from multiplepre-treatment periods. This methodological synthesis serves as a foundation to develop ournew approach.Our main contribution is to propose a new, simple estimator that achieves all the threebenefits together. We use the generalized method of moments (GMM) framework (Hansen,1982) to develop the double difference-in-differences (double DID). At its core, we combine twopopular DID estimators: the standard DID estimator, which relies on the canonical parallel- In our literature review of
American Political Science Review and
American Journal of Political Science between 2015 and 2019, we found that about 63% of the papers that use the DID design have more than onepre-treatment period. See Appendix A for details about our literature review. pre -treatment periods to further relax the parallel trends assumption (Section 3.4). Thisis important because researchers might be worried not only about time-varying unmeasuredconfounders that linearly change over time, but also about more general forms of time-varyingunmeasured confounders. When there exist K pre-treatment periods, our proposed approachcan accommodate time-varying unmeasured confounders that follow ( K − th-order polyno-mial functions. Thus, we can account for more flexible forms of time-varying unmeasuredconfounding as we have more pre-treatment periods. Second, we also incorporate any num-ber of post -treatment periods such that researchers can estimate not only short-term causaleffects but also longer-term causal effects (Section 3.4). Finally, we generalize our double DIDestimator to the staggered adoption design where different units can receive the treatment indifferent time periods (Section 4). This design is increasingly more popular in political scienceand in the social sciences in general (e.g., Athey and Imbens, 2018; Ben-Michael, Feller andRothstein, 2019). We show that the double DID estimator achieves the same three benefits3ogether even in this more general pattern of the treatment assignment.We offer a companion R package DIDdesign that implements the proposed methods. Weillustrate our proposed methods in Section 6 where we provide two empirical applications. Thefirst application revisits Malesky, Nguyen and Tran (2014), which studies how the abolition ofelected councils affects local public services. This serves as an example of the basic DID designwhere treatment assignment happens only once. Our second application is a reanalysis ofPaglayan (2019), which examines the effect of granting collective bargaining rights to teacher’sunion on educational expenditures and teacher’s salary. This is an example for the staggeredadoption design where different units can receive the treatment in different time periods.
Related Literature.
This paper builds on the large literature of time-series cross-sectionaldata (e.g., De Boef and Keele, 2008; Beck and Katz, 2011; Blackwell and Glynn, 2018) and isconnected to other popular methodologies for making causal inferences. Generalizing the wellknown case of two periods and two groups (e.g., Abadie, 2005), recent papers use potentialoutcomes to unpack the nonparametric connection between the DID and two-way fixed effectsregression estimators, thereby proposing extensions to relax strong parametric and causal as-sumptions (e.g., Imai and Kim, 2011; Athey and Imbens, 2018; Goodman-Bacon, 2018; Strezh-nev, 2018; Imai and Kim, 2019, 2020). Our paper also uses potential outcomes to clarifynonparametric foundations on the use of multiple pre-treatment periods. The key difference isthat, while this recent literature mainly considers the identification under the parallel trendsassumption, we study both estimation accuracy and identification under more flexible assump-tions of trends. We do so both in the basic DID setup and in the staggered adoption design.Our double DID estimator contains the sequential DID estimator (e.g., Lee, 2016; Mora andReggio, 2019) as a special case. There are two key advantages over existing approaches. First,when the parallel trends assumption holds, the double DID optimally combines the standardDID and the sequential DID to improve efficiency, and thus is not equal to the sequential DID.Importantly, this avoids a dilemma of the sequential DID — it is consistent under assumptionsweaker than the parallel trends, but is less efficient when the parallel trends assumption holds.Second, while the sequential DID has only been developed for the basic DID design wheretreatment assignment happens only once, we generalize it to the staggered adoption design,and further incorporate it into our proposed staggered-adoption double DID estimator.Another class of popular methods is the synthetic control method (Abadie, Diamond andHainmueller, 2010) and their recent extensions (e.g., Xu, 2017; Athey et al., 2017; Ben-Michael,4eller and Rothstein, 2018; Hazlett and Xu, 2018) that estimate a weighted average of controlunits to approximate a treated unit. As carefully noted in those papers, such methodologiesrequire long pre-treatment periods to accurately estimate a pre-treatment trajectory of thetreated unit (Abadie, Diamond and Hainmueller, 2010); for example, Xu (2017) recommendscollecting more than ten pre-treatment periods. In contrast, the proposed double DID canbe applied as long as there are more than one pre-treatment period, and is better suitedwhen there are a small to moderate number of pre-treatment periods. However, we alsoshow in Section 6.2 that the double DID can achieve performance comparable to variants ofsynthetic control methods even when there are a large number of pre-treatment periods. Weoffer additional discussions about relationships between our proposed approach and syntheticcontrol methods in Section 5.3.
The difference-in-differences (DID) design is one of the most widely used methods to makecausal inference from observational studies (Imbens and Wooldridge, 2009). At its most basic,the DID design consists of treatment and control groups measured at two time periods, beforeand after the treatment assignment. While this basic DID design only requires data fromone post- and one pre-treatment period, additional pre-treatment periods are often availablein applied contexts. Unfortunately, however, assumptions behind different uses of multiplepre-treatment periods have often remained unstated.In this section, we use potential outcomes to discuss three well-known practical benefitsof multiple pre-treatment periods: (1) assessing the parallel trends assumption, (2) improvingestimation accuracy, and (3) allowing for a more flexible parallel trends assumption. Thissection serves as a methodological foundation for developing a new approach in Sections 3and 4.
As our running example, we focus on a study of how the abolition of elected councils affects localpublic services. Malesky, Nguyen and Tran (2014) use the DID design to examine the effect In our literature review of
APSR and
AJPS , we found that most DID applications have less than 10pre-treatment periods. The median number of pre-treatment periods is . and, the mean number of pre-treatment periods is about after removing one unique study that has more than pre-treatment periods.See Appendix A for more details about our literature review.
5f recentralization efforts in Vietnam. The abolition of elected councils, the main treatmentof interest, was implemented in 2009 in about 12% of all the communes, which is the smallestadministrative units that the paper considers. For each commune, a variety of outcomes relatedto public services, such as the quality of infrastructure, were measured in 2006, 2008 and 2010— two pre-treatment periods and one post-treatment period. With this DID design, Malesky,Nguyen and Tran (2014) aim to estimate the causal effect of abolishing elected councils onvarious measures of local public services. To introduce the setup of the design, we focus onlyon the basic aspects of the study here and discuss further details when we reanalyze it inSection 6.To begin with, let D it denote the binary treatment for unit i in time period t so that D it = 1 if the unit is treated in time period t , and D it = 0 otherwise. In this section, we consider twopre-treatment time periods t ∈ { , } and one post-treatment period t = 2 . We choose thissetup here because it is sufficient for examining benefits of multiple pre-treatment periods,but we also generalize our discussions and methods to an arbitrary number of pre - and post- treatment periods (Section 3.4), and to the staggered adoption design (Section 4) for increasingapplicability of our methods in practice. In our running example, two pre-treatment periodsare 2006 and 2008, and one post-treatment period is 2010. Thus, the treatment group receivesthe treatment only at time t = 2 ; D i = D i = 0 and D i = 1 , whereas units in the controlgroup never gets treated D i = D i = D i = 0 . We refer to the treatment group as G i = 1 andthe control group as G i = 0 . Outcome Y it is measured at time t ∈ { , , } . In addition to paneldata where the same units are measured over time, the DID design accommodates repeatedcross-sectional data as in our running example, in which different communes are sampled atthree time periods.To define causal effects of interest, we rely on the potential outcomes framework (Neyman,1923; Rubin, 1974). For each time period, Y it (1) represents the quality of infrastructure thatcommune i would achieve in time period t if commune i had abolished elected councils. Y it (0) issimilarly defined. For an individual commune, the causal effect of abolishing elected councils onthe quality of infrastructure in time period t is the difference Y it (1) − Y it (0) . As the treatment isassigned in the second time period, causal effects are defined only for time t = 2 , Y i (1) − Y i (0) . In the DID design, we are interested in estimating the average treatment effect for treatedunits (ATT) (Angrist and Pischke, 2008): τ = E [ Y i (1) − Y i (0) | G i = 1] , (1)6here the expectation is over units in the treatment group G i = 1 so that this causal estimandis the average of individual causal effects for units who receive the treatment. DID with One Pre-Treatment Period
Before we discuss benefits of multiple pre-treatment periods from Section 2.2, we briefly reviewthe DID with one pre-treatment period to fix ideas for settings with multiple pre-treatmentperiods.In the most basic DID design where we have only one pre-treatment period (i.e., t = 1 isthe pre-treatment period and t = 2 is the post-treatment period), researchers can identify theATT based on the widely-used assumption of parallel trends — if the treatment group had notreceived the treatment in the second period, its outcome trend would have been the same asthe trend of the outcome in the control group. (Angrist and Pischke, 2008). Assumption 1 (Parallel Trends) . E [ Y i (0) | G i = 1] − E [ Y i (0) | G i = 1] = E [ Y i (0) | G i = 0] − E [ Y i (0) | G i = 0] , (2)where the left hand side is the trend in outcomes for the treatment group G i = 1 and theright is the one for the control group G i = 0 .Under the parallel trends assumption, we estimate the ATT via the difference-in-differencesestimator. (cid:98) τ DID = (cid:18) (cid:80) i : G i =1 Y i n − (cid:80) i : G i =1 Y i n (cid:19) − (cid:18) (cid:80) i : G i =0 Y i n − (cid:80) i : G i =0 Y i n (cid:19) , (3)where n t and n t are the number of units in the treatment and control groups at time t ∈ { , } ,respectively.In practice, we can compute the DID estimator via a linear regression. We regress theoutcome Y it on an intercept, treatment group indicator G i , time indicator I t (equal to ifpost-treatment and otherwise) and the interaction between the treatment group indicatorand the time indicator G i × I t . Y it ∼ α + θG i + γI t + β ( G i × I t ) , (4)where ( α, θ, γ, β ) are corresponding coefficients. In this case, a coefficient of the interaction term β is numerically equal to the DID estimator, (cid:98) τ DID . Importantly, the linear regression is usedhere only to compute the nonparametric DID estimator (equation (3)), and thus it does notrequire any parametric modeling assumption such as constant treatment effects. Furthermore,7hen we analyze panel data in which the same units are observed repeatedly over time, weobtain exactly the same estimate via a linear regression with unit and time fixed effects. Thisnumerical equivalence in the two-time-period case is often the justification of the two-way fixedeffects regression as the DID design (Angrist and Pischke, 2008). The above equivalence isformally shown in Appendix B.1 for completeness.
We now consider how researchers can exploit multiple pre-treatment periods, while clarifyingunderlying necessary assumptions.The first and the most common use of multiple pre-treatment periods is to assess theidentification assumption of parallel trends. Because the validity of the DID design rests on theparallel trends assumption, it is critical to evaluate its plausibility in any application. However,the parallel trends assumption itself involves counterfactual outcomes, and thus analysts cannotempirically test it directly. Instead, we often investigate whether trends for treatment andcontrol groups are parallel in pre-treatment periods (Angrist and Pischke, 2008). For example,researchers assess whether trends in the infrastructure quality from 2006 to 2008 — before thetreatment is implemented in 2009 — are the same for treatment and control communes.Thus, researchers often estimate the DID for the pre-treatment periods: (cid:18) (cid:80) i : G i =1 Y i n − (cid:80) i : G i =1 Y i n (cid:19) − (cid:18) (cid:80) i : G i =0 Y i n − (cid:80) i : G i =0 Y i n (cid:19) . (5)We then check whether the DID estimate on pre-treatment periods is statistically distinguish-able from zero. For example, we can apply the DID estimator to 2006 and 2008 as if 2008were the post-treatment period, and assess whether the estimate would be close to zero. InFigure 1, a DID estimate on the pre-treatment periods would be close to zero for the left panel,while it would be negative for the right panel where two groups have different pre-treatmenttrends. In Appendix B.4, we show that a robustness check about leads effects (Angrist andPischke, 2008), which incorporates leads of the treatment variable into the two-way fixed ef-fects regression and check whether their coefficients are zero, is equivalent to this DID on thepre-treatment periods.What are the underlying assumptions behind this test of pre-treatment trends? The basicidea is that if trends are parallel from 2006 to 2008, it is more likely that the parallel trendsassumption holds for 2008 and 2010. Hence, instead of considering parallel trends only from2008 to 2010, the test evaluates the two related parallel trends together. By doing so, this8 l l Extended Parallel Trends l l ll l ll l l t = 0(before) t = 1(before) t = 2(after) O u t c o m e Trend of Treatment GroupTrend of Control Group Counterfactual l l l
Extended Parallel Trends Violated l l ll l ll l l t = 0(before) t = 1(before) t = 2(after)
Figure 1:
Parallel Pre-treatment Trends (left) and Non-Parallel Pre-treatment Trends (right).popular test tries to make the DID design falsifiable.At its core, this approach does not test the parallel trends assumption itself (Assumption 1),which is untestable due to counterfactual outcomes. Instead, it tests the extended parallel trends assumption — the parallel trends hold for pre-treatment periods, from t = 0 to t = 1 , as wellas from a pre-treatment period t = 1 to a post-treatment period t = 2 : Assumption 2 (Extended Parallel Trends) . E [ Y i (0) | G i = 1] − E [ Y i (0) | G i = 1] = E [ Y i (0) | G i = 0] − E [ Y i (0) | G i = 0] E [ Y i (0) | G i = 1] − E [ Y i (0) | G i = 1] = E [ Y i (0) | G i = 0] − E [ Y i (0) | G i = 0] (6)Here, the first line is the same as the standard parallel trends assumption (equation (2)),and the second line is the parallel trends for pre-treatment periods. This assumption meansthat treatment and control groups have parallel trends of the infrastructure quality from 2008to 2010 as well as in pre-treatment periods from 2006 to 2008. Because outcome trends areobservable in pre-treatment periods, the test of pre-treatment trends (equation (5)) directlytests this assumption.Therefore, many DID studies that exploit the test on pre-treatment trends can be seen asthe DID design under the extended parallel trends assumption. Because the extended paralleltrends assumption naturally implies the conventional parallel trends assumption, Assumption 2is also sufficient for identifying the ATT, and we can estimate it via the same DID estimator(equation (3)).In summary, the first benefit is that researchers can assess the extended parallel trendsassumption using the pre-treatment-trends test (equation (5)).9 .3 Benefit 2: Improving Estimation Accuracy As we discussed above, many existing DID studies that utilize the test of pre-treatment trendscan be viewed as the DID design with the extended parallel trends assumption. However, thisextended parallel trends assumption is often made implicitly and thus, it is used only for as-sessing the parallel trends assumption. Fortunately, if the extended parallel trends assumptionholds, we can also estimate the ATT with higher accuracy, resulting in smaller standard errors.This additional benefit becomes clear by simply restating the extended parallel trendsassumption as follows. E [ Y i (0) | G i = 1] − E [ Y i (0) | G i = 1] = E [ Y i (0) | G i = 0] − E [ Y i (0) | G i = 0] E [ Y i (0) | G i = 1] − E [ Y i (0) | G i = 1] = E [ Y i (0) | G i = 0] − E [ Y i (0) | G i = 0] . (7)Under the extended parallel trends assumption, there are two natural DID estimators forthe ATT. The first is the same as before: the DID on t = 1 and t = 2 . The second is similarbut with the additional pre-treatment period: the DID on t = 0 and t = 2 . In our runningexample, this means that we have a DID estimator using data from 2008 and 2010 and theother using data from 2006 and 2010. (cid:98) τ DID = (cid:18) (cid:80) i : G i =1 Y i n − (cid:80) i : G i =1 Y i n (cid:19) − (cid:18) (cid:80) i : G i =0 Y i n − (cid:80) i : G i =0 Y i n (cid:19) , (cid:98) τ DID(2,0) = (cid:18) (cid:80) i : G i =1 Y i n − (cid:80) i : G i =1 Y i n (cid:19) − (cid:18) (cid:80) i : G i =0 Y i n − (cid:80) i : G i =0 Y i n (cid:19) . (8)Under the extended parallel trends assumption, both estimators are unbiased and consistentfor the ATT. Thus, we can increase estimation accuracy by combining the two estimators, forexample, simply averaging them. (cid:98) τ e-DID = 12 (cid:98) τ DID + 12 (cid:98) τ DID(2,0) . (9)Intuitively, this extended DID estimator is more efficient because we have more observationsto estimate counterfactual outcomes for the treatment group E [ Y i (0) | G i = 1] .In the panel data settings, we show that this extended DID estimator (cid:98) τ e-DID is numericallyequivalent to a coefficient of the treatment variable in the two-way fixed effects estimatorfitted to the three time periods t ∈ { , , } . We also present more general results aboutnonparametric relationships between the extended DID and the two-way fixed effects estimatorin Appendix B.2. 10o summarize, the second benefit of multiple pre-treatment periods is that researchers canuse the extended DID estimator (equivalent to the two-way fixed effects estimator in the paneldata) to increase statistical efficiency when the extended parallel trends assumption holds. In this section, we consider scenarios in which the extended parallel trends assumption may notbe plausible. Multiple pre-treatment periods are also useful in accounting for some deviationfrom the parallel trends assumption. We discuss a popular generalization of the difference-in-difference estimator, a sequential
DID estimator, which removes bias due to certain violationsof the parallel trends assumption (e.g., Lee, 2016; Mora and Reggio, 2019). We clarify anassumption behind this simple method and relate it to the parallel trends assumption.To introduce the sequential DID estimator, we begin with the extended parallel trendsassumption. As we described in Section 2.2, when the extended parallel trends assumptionholds, a DID estimator applied to pre-treatment periods t = 0 and t = 1 should be zero inexpectation. In contrast, when trends of treatment and control groups are not parallel, a DIDestimate on pre-treatment periods would be non-zero. The sequential DID estimator uses thisDID estimate from pre-treatment periods to adjust for bias in the standard DID estimator.In particular, it subtracts the DID estimator on pre-treatment periods from the usual DIDestimator that uses pre- and post-treatment periods t = 1 and t = 2 . (cid:98) τ s-DID = (cid:26)(cid:18) (cid:80) i : G i =1 Y i n − (cid:80) i : G i =1 Y i n (cid:19) − (cid:18) (cid:80) i : G i =0 Y i n − (cid:80) i : G i =0 Y i n (cid:19)(cid:27) − (cid:26)(cid:18) (cid:80) i : G i =1 Y i n − (cid:80) i : G i =1 Y i n (cid:19) − (cid:18) (cid:80) i : G i =0 Y i n − (cid:80) i : G i =0 Y i n (cid:19)(cid:27) , (10)where the first four terms are equal to the standard DID estimator (equation (3)) and thelast four terms are the DID estimator applied to pre-treatment periods t = 0 and t = 1 . Inour running example, we can use this sequential DID estimator by first estimating the DIDusing 2008 and 2010 and then subtracting the DID based on 2006 and 2008. An idea behindthis approach is that the DID estimator on pre-treatment periods captures deviation from theparallel trends assumption, and thus we subtract this bias from the usual DID estimator.Although we would expect that this estimator only requires an assumption weaker than theextended parallel trends, what exact assumption do we need for this sequential DID estima-tor? At its core, the parallel trends assumption means that differences between treatment andcontrol groups due to unobserved confounders are constant over time. Instead of assuming this11onstant unmeasured confounding, the sequential DID estimator rests on the parallel trends-in-trends assumption — unobserved confounding increases or decreases over time but with someconstant rate. Thus, the sequential DID estimator accounts for linear time-varying unmea-sured confounding. For example, researchers might be worried that some treated communeshave higher motivation for reforms, which is not measured, and the infrastructure qualitiesdiffer between treated and control communes due to this unobserved motivation. The paralleltrends assumption means that the difference in the infrastructure qualities due to this unob-served confounder does not grow or decline over time. In contrast, the parallel trends-in-trendsassumption accommodates a simple yet important case in which the unobserved difference inthe infrastructure qualities does grow or decline with some fixed rate, which analysts do notneed to specify.This parallel trends-in-trends assumption is a generalization of the conventional paralleltrends assumption and formally written as follows. Assumption 3 (Parallel Trends-in-Trends) . { E [ Y i (0) | G i = 1] − E [ Y i (0) | G i = 0] } − { E [ Y i (0) | G i = 1] − E [ Y i (0) | G i = 0] } = { E [ Y i (0) | G i = 1] − E [ Y i (0) | G i = 0] } − { E [ Y i (0) | G i = 1] − E [ Y i (0) | G i = 0] } (11)Here, the left-hand side represents how the unobserved difference between treatment andcontrol groups changes over time from t = 1 to t = 2 . The right-hand side quantifies the sametime-varying unmeasured differences from t = 0 to t = 1 . Because the trend of time-varyingunmeasured confounding is estimated from pre-treatment periods t = 0 to t = 1 , researchers donot need to know a rate by which time-varying unobserved confounding increases or decreases.Under the parallel trends-in-trends assumption, the sequential DID estimator is unbiased andconsistent for the ATT.Figure 2 illustrates the difference between the extended parallel trends assumption (leftpanel) and the parallel trends-in-trends assumption (middle panel). We can see in the secondrow of the figure that the parallel trends-in-trends assumption allows for a linear change inbias over time, whereas the bias is constant over time in the extended parallel trends.The sequential DID estimator is again connected to a widely used regression estimator. Inparticular, the sequential DID estimator (equation (10)) can be computed as a linear regressionin which we replace the outcome Y it with the transformed outcome that adjusts for the lagged12 l ll l ll l ll l l t = 0(before) t = 1(before) t = 2(after) O u t c o m e Trend of Treatment GroupTrend of Control GroupCounterfactual
Bias l l ll l ll l ll l l t = 0(before) t = 1(before) t = 2(after) l l ll l ll l ll l l t = 0(before) t = 1(before) t = 2(after)
Extended Parallel Trends Parallel Trends−in−Trends Both are Violated t = 0(before) t = 1(before) t = 2(after) B i a s ( D i ff e r en c e i n G r oup s ) t = 0(before) t = 1(before) t = 2(after) t = 0(before) t = 1(before) t = 2(after) Figure 2:
Comparing Extended Parallel Trends Assumption and Parallel Trends-in-TrendsAssumption.
Note:
The extended parallel trends assumption (left column) means that thedifference in the treatment and control groups (bias) is constant over time. The parallel trends-in-trends assumption (middle column) allows for linear time-varying unmeasured confounding.Both assumptions are violated in the right column.outcome in a particular way. ∆ Y it ∼ α s + θ s G i + γ s I t + β s ( G i × I t ) , (12)where ∆ Y it = Y it − ( (cid:80) i : G i =1 Y i,t − ) /n ,t − if G i = 1 and ∆ Y it = Y it − ( (cid:80) i : G i =0 Y i,t − ) /n ,t − if G i = 0 . Coefficients are denoted by ( α s , θ s , γ s , β s ) . In this case, a coefficient in front of theinteraction term β s is numerically identical to the sequential DID estimator. We provide theproof of this equivalence in Appendix B.3.In time-series econometrics, it is common to take the difference in outcomes in order toremove linear time trends before running regressions (Wooldridge, 2010). We also demonstratethat a common robustness check of including group- or unit-specific time trends (Angristand Pischke, 2008) is also nonparametrically equivalent to the sequential DID estimator (seeAppendix B.3). Within the potential outcomes framework, we clarified that these commontechniques are justified under the parallel trends-in-trends assumption.In summary, the third benefit of multiple pre-treatment periods is that, researchers canuse the sequential DID estimator (equation (11)) under the more flexible, parallel trends-in-13rends assumption, even when the extended parallel trends assumption is violated (therefore,the two-way fixed effects estimator is biased). Remark.
Researchers might be worried not only about time-varying unmeasured confoundersthat change over time linearly, but also about more general forms of time-varying unmeasuredconfounders. Fortunately, when we have more than two pre-treatment periods, we can furthergeneralize the sequential DID estimator. In general, when we have K pre-treatment periods, itcan account for the K − degrees of polynomial functions of unmeasured confounders. However,we have to be aware of a key tradeoff: as we incorporate more flexible forms of confounding,standard errors of the sequential DID estimator get much larger. Another powerful approach toaddress unobserved time-varying confounding is based on a form of partial identification, suchas sensitivity analysis (Keele et al., 2019) and the bracketing bounds (Angrist and Pischke,2008; Ding and Li, 2019; Ye et al., 2020). While the most basic form of the DID design only requires one pre-treatment period, we seein the previous section that multiple pre-treatment periods provide the three related benefits.We have clarified that each benefit requires different assumptions and different estimators,and as a result, in practice, researchers tend to enjoy only a subset of the three benefits theycan exploit from multiple pre-treatment periods. In this section, we propose a new, simpleestimator, which we call the double difference-in-differences (double DID), that blends all thethree benefits of multiple pre-treatment periods in a single framework. Here, we introduce thedouble DID with settings with two pre-treatment periods.We also provide two extensions. First, we generalize the proposed method to any numberof pre - and post -treatment periods in the DID design (Section 3.4 and Appendix C). Second,we then extend it to the staggered adoption design, where timing of the treatment assignmentcan vary across units, in Section 4.
We propose the double DID estimator within a framework of the generalized method of mo-ments (GMM) (Hansen, 1982). In particular, we combine the standard DID estimator and thesequential DID estimator via the GMM: (cid:98) τ d-DID = argmin τ (cid:32) τ − (cid:98) τ DID τ − (cid:98) τ s-DID (cid:33) (cid:62) W (cid:32) τ − (cid:98) τ DID τ − (cid:98) τ s-DID (cid:33) (13)14tandard DID Extended DID Sequential DIDWeight Matrix − W Table 1:
Double DID as Generalization of Popular DID Estimators.where W is a user-specified weight matrix of dimension × .The important property of the proposed double DID estimator is that it contains all of thepopular estimators that we considered in the previous sections as special cases. By specifyingan appropriate weight matrix W , we can recover the standard DID, the extended DID, andthe sequential DID estimators (see Table 1).The advantage of the double DID estimator is that we can choose the optimal weight matrixby considering the identification assumption and estimation within the unifying framework ofthe GMM. The double DID estimator proceeds with the following two steps. Step 1: Assessing the Underlying Assumptions
The first step is to assess the underlying assumptions. We check the extended parallel trendsassumption by applying the DID estimator on pre-treatment periods and testing whether theestimate is statistically distinguishable from zero at a conventional level (equation (5)). Im-portantly, this first step of the double DID is equivalent to the over-identification test in theGMM framework, where we assume that the sequential DID estimator is correctly specifiedand test the null hypothesis that the standard DID estimator is correctly specified. Whenthere are more than two pre-treatment periods, we can diagnose the parallel trends-in-trendsassumption by applying the sequential DID estimator to pre-treatment periods.
Equivalence Approach.
The standard hypothesis testing approach has a risk of con-flating evidence for parallel trends and statistical inefficiency. For example, when sample sizeis small, even if pre-treatment trends of the treatment and control groups differ, a test ofthe difference might not be statistically significant due to large standard error, and analystsmight “pass” the pre-treatment-trends test. To mitigate such concerns, we also incorporatean equivalence approach (Wellek, 2010; Hartman and Hidalgo, 2018) in which we evaluate thenull hypothesis that trends of two groups are not parallel in pre-treatment periods. By us- Liu, Wang and Xu (2020) propose a similar test for a different class of estimators, what they refer to as“counterfactual estimators,” which share many properties with synthetic control methods. We discuss relation-
Step 2: Estimation of the ATT
The second step is the estimation of the ATT. When the extended parallel trends assumption isplausible, we can estimate the optimal weight matrix W building on the theory of the efficientGMM (Hansen, 1982). Specifically, the optimal weight matrix that minimizes the variance isthe inverse of the variance-covariance matrix of the DID estimators: (cid:99) W = (cid:100) Var ( (cid:98) τ DID ) (cid:100) Cov ( (cid:98) τ DID , (cid:98) τ s-DID ) (cid:100) Cov ( (cid:98) τ DID , (cid:98) τ s-DID ) (cid:100) Var ( (cid:98) τ s-DID ) − (14)This optimal weight matrix allows us to compute the weighted average of the standard DIDand the sequential DID estimator such that the resulting variance is the smallest. Remark.
Importantly, under the extended parallel trends assumption, both the standardDID and the sequential DID estimator are consistent to the ATT, and thus, any weightedaverage is a consistent estimator. But this optimal weight matrix chooses the most efficientestimator among all consistent estimators. As we clarify more below, we do not use theweighted average of the standard DID and the sequential DID when the extended paralleltrends assumption is violated.When only the parallel trends-in-trends assumption is plausible, the double DID only con-tains one moment condition τ − (cid:98) τ s-DID = 0 , and thus it is equal to the sequential DID estimator.This is equivalent to choosing the weight matrix W with W = W = W = 0 and W = 1 (the third column in Table 1).When both assumptions are implausible, there is no credible estimator without makingfurther stringent assumptions. However, when there are more than two pre-treatment periods,researchers can also use the proposed generalized K -DID (we will discuss in Section 3.4) tofurther relax the parallel trends-in-trends assumption. ships between our method and synthetic control methods in Section 5.3. .2 Double DID Enjoys Three Benefits The proposed double DID estimator naturally enjoys the three benefits of multiple pre-treatmentperiods within a unified framework.
1. Assessing Underlying Assumptions
The double DID incorporates the assessment ofunderlying assumptions in its first step as the over-identification test. When the trends inpre-treatment periods are not parallel, researchers have to pay the most careful attention toresearch design and use domain knowledge to assess the parallel trends-in-trends assumption.
2. Improving Estimation Accuracy
When the extended parallel trends assumption holds,researchers can combine two DIDs with equal weights (i.e., the extended DID estimator, whichis numerically equivalent to the two-way fixed effects regression) to increase estimation accuracy(Section 2.3). In this setting, the double DID further improves estimation accuracy becauseit selects the optimal weights as the GMM estimator. In Section F, we use simulations todemonstrate that the double DID achieves even smaller standard errors than the extendedDID estimator.
3. Allowing For A More Flexible Parallel Trends Assumption
Under the paralleltrends-in-trends assumption, the double DID estimator converges to the sequential DID esti-mator. However, when the extended parallel trends assumption holds, the double DID usesoptimal weights and is not equal to the sequential DID. Thus, the double DID estimator avoidsa dilemma of the sequential DID — it is consistent under a weaker assumption of the paralleltrends-in-trends but is less efficient when the extended parallel trends assumption holds. Bynaturally changing the weight matrix in the GMM framework, the double DID achieves highestimation accuracy under the extended parallel trends assumption and at the same time, al-lows for more flexible time-varying unmeasured confounding under the parallel trends-in-trendsassumption.
Like other DID estimators, the double DID estimator is nicely connected to a widely-usedregression approach. This connection is particularly useful when researchers would like tocontrol for pre-treatment covariates to make the DID design more robust and efficient.To introduce the regression-based double DID estimator, we begin with the standard DID.As discussed in Section 2.1, the standard DID estimator is equivalent to a coefficient in thelinear regression of equation (4). Inspired by this connection, researchers often adjust for17dditional pre-treatment covariates as: Y it ∼ α + θG i + γI t + β ( G i × I t ) + X (cid:62) it ρ , (15)where we adjust for the additional pre-treatment covariates X it . A coefficient of the interactionterm (cid:98) β is the standard DID regression estimator for the ATT. Here, we make the paralleltrends assumption conditional on covariates X it . The idea is that even when the paralleltrends assumption might not hold without controlling for any covariates, trends of the twogroups might be parallel conditionally after adjusting for observed covariates. For example,the conditional parallel trends assumption means that treatment and control groups have thesame trends of the infrastructure quality after controlling for population size and GDP percapita.The estimated coefficient (cid:98) β is consistent for the ATT when this conditional parallel trendsassumption holds and the parametric model is correctly specified. This parametric assumptionmight be strong, but it is common to all regression strategies, including non-causal settings,and can be assessed via usual model diagnostics.The sequential DID estimator is extended similarly. Based on the connection to the linearregression of equation (12), we can adjust for additional pre-treatment covariates as: ∆ Y it ∼ α s + θ s G i + γ s I t + β s ( G i × I t ) + X (cid:62) it ρ s , (16)where ∆ Y it = Y it − ( (cid:80) i : G i =1 Y i,t − ) /n ,t − if G i = 1 and ∆ Y it = Y it − ( (cid:80) i : G i =0 Y i,t − ) /n ,t − if G i = 0 . The estimated coefficient (cid:98) β s is consistent for the ATT under the conditional paralleltrends-in-trends assumption and the conventional assumption of correct specification.The double DID regression combines the two regression estimators via the GMM: (cid:98) β d-DID = argmin β d (cid:32) β d − (cid:98) ββ d − (cid:98) β s (cid:33) (cid:62) W (cid:32) β d − (cid:98) ββ d − (cid:98) β s (cid:33) (17)where W is a user-specified weight matrix of dimension × .Thus, as the double DID estimator without covariates, the double DID regression alsohas two steps. The first step is to assess the underlying assumptions. Here, instead of usingthe standard DID estimator, we use the standard DID regression on pre-treatment periods toassess the conditional extended parallel trends assumption. The second step is to estimate theATT while adjusting for pre-treatment covariates. Instead of using the double DID estimatorwithout covariates, we implement the regression-based double DID estimator (equation (17)).18 .4 Generalized K -Difference-in-Differences We generalize the proposed method to any number of pre - and post -treatment periods inAppendix C. This generalization has two central benefits. First, it enables researchers to uselonger pre -treatment periods to allow for even more flexible forms of unmeasured time-varyingconfounding. While the parallel trends-in-trends assumption (Assumption 3) accounts for lineartime-varying unmeasured confounding, we can allow for time-varying unmeasured confoundingthat follows a ( K − th order polynomial function when we have K pre-treatment periods.Second, we also allow for any number of post -treatment periods, and therefore, researcherscan estimate not only short-term causal effects but also longer-term causal effects with thisgeneralization. In this section, we extend the proposed double DID estimator to the staggered adoption designwhere timing of the treatment assignment can vary across units (Athey and Imbens, 2018;Strezhnev, 2018; Ben-Michael, Feller and Rothstein, 2019).
In the staggered adoption (SA) design, different units can receive the treatment in different timeperiods. Once they receive the treatment, they remain exposed to the treatment afterwards.Therefore, D it = 1 if D im = 1 where m < t. We can thus summarize the information about thetreatment assignment by the timing of the treatment A i where A i ≡ min { t : D it = 1 } . Whenunit i never receives the treatment until the end of time T , we let A i = ∞ . For example, inmany applications where researchers are interested in the causal effect of state- or local-levelpolicies, units adopt policies in different time points and remain exposed to such policies oncethey introduce the policies. In Section 6.2, we provide its example based on Paglayan (2019).See Figure 3 for visualization of the SA design.Following the recent literature on the SA design, we make two standard assumptions inthe SA design: no anticipation assumption and invariance to history assumption (Athey andImbens, 2018; Abadie, 2019; Ben-Michael, Feller and Rothstein, 2019; Imai and Kim, 2019).This implies that, for unit i in period t , the potential outcome Y it (1) represents the outcomeof unit i that would realize in period t if unit i receives the treatment at or before period t .Similarly, Y it (0) represents the outcome of unit i that would realize in period t if unit i does19 ear State 2
State 3
Figure 3:
Example of the Staggered Adoption Design.
Note:
We use gray cells of “1” to denotethe treated observation and use white cells of “0” to denote the control observation. The mainfeature of the SA design is that once units receive the treatment, they remain exposed to thetreatment.not receive the treatment by period t . Finally, we generalize the group indicator G as follows. G it = if A i = t if A i > t − if A i < t (18)where G it = 1 represents units who receive the treatment at time t , and G it = 0 ( G it = − )indicates units who receive the treatment after (before) time t .Under the SA design, the staggered adoption ATT (SA-ATT) at time t is defined as follows. τ SA ( t ) = E [ Y it (1) − Y it (0) | G it = 1] , which represents the causal effect of the treatment in period t on units with G it = 1 , who receivethe treatment at time t. This is a straightforward extension of the standard ATT (equation (1))in the basic DID setting. Researchers might also be interested in the time-average staggeredadoption ATT (time-average SA-ATT). τ SA = (cid:88) t ∈T π t τ SA ( t ) , where T represents a set of the time periods for which researchers want to estimate the ATT.For example, if a researcher is interested in estimating the ATT for the entire sample periods,one can take T = { , . . . , T } . The SA-ATT in period t , τ SA ( t ) , is weighted by the proportionof units who receive the treatment at time t : π t = (cid:80) ni =1 { A i = t } / (cid:80) ni =1 { A i ∈ T } .20 .2 Double DID for Staggered Adoption Design Under what assumptions can we identify the SA-ATT and the time-average SA-ATT? Here, wefirst extend the standard DID estimator under the parallel trends assumption and the sequen-tial DID estimator under the parallel trends-in-trends assumption to the SA design. Formally,we define the standard DID estimator for the SA-ATT at time t as (cid:98) τ SA DID ( t ) = (cid:18) (cid:80) i : G it =1 Y it n t − (cid:80) i : G it =1 Y i,t − n ,t − (cid:19) − (cid:18) (cid:80) i : G it =0 Y it n t − (cid:80) i : G it =0 Y i,t − n ,t − (cid:19) , which is consistent for the SA-ATT under the following parallel trends assumption in period t under the SA design: E [ Y it (0) | G it = 1] − E [ Y i,t − (0) | G it = 1] = E [ Y it (0) | G it = 0] − E [ Y i,t − (0) | G it = 0] . Similarly, we can define the sequential DID estimator for the SA-ATT at time t as (cid:98) τ SA s-DID ( t ) = (cid:26)(cid:18) (cid:80) i : G it =1 Y it n t − (cid:80) i : G it =1 Y i,t − n ,t − (cid:19) − (cid:18) (cid:80) i : G it =0 Y it n t − (cid:80) i : G it =0 Y i,t − n ,t − (cid:19)(cid:27) − (cid:26)(cid:18) (cid:80) i : G it =1 Y i,t − n ,t − − (cid:80) i : G it =1 Y i,t − n ,t − (cid:19) − (cid:18) (cid:80) i : G it =0 Y i,t − n ,t − − (cid:80) i : G it =0 Y i,t − n ,t − (cid:19)(cid:27) , which is consistent for the SA-ATT under the following parallel trends-in-trends assumptionin period t under the SA design: { E [ Y it (0) | G it = 1] − E [ Y it (0) | G it = 0] } − { E [ Y i,t − (0) | G it = 1] − E [ Y i,t − (0) | G it = 0] } = { E [ Y i,t − (0) | G it = 1] − E [ Y i,t − (0) | G it = 0] } − { E [ Y i,t − (0) | G it = 1] − E [ Y i,t − (0) | G it = 0] } . Finally, combining the standard and sequential DID estimators, we can extend the doubleDID to the SA design as follows. (cid:98) τ SA d-DID ( t ) = argmin τ SA ( t ) (cid:32) τ SA ( t ) − (cid:98) τ SA DID ( t ) τ SA ( t ) − (cid:98) τ SA s-DID ( t ) (cid:33) (cid:62) W ( t ) (cid:32) τ SA ( t ) − (cid:98) τ SA DID ( t ) τ SA ( t ) − (cid:98) τ SA s-DID ( t ) (cid:33) where W ( t ) is a user-specified weight matrix. Under the SA design as well, the standard DIDand sequential DID estimators are special cases of our proposed double DID estimator withspecific weight matrices.Like the basic double DID estimator that we have proposed in Section 3.1, the double DIDfor the SA design also has two steps. The first step is to assess the underlying assumptionsusing the standard DID for the SA design with two time points { t − , t − } for units who21re not yet treated at time t − , that is, { i : G it ≥ } . This is a generalization of the pre-treatment-trends test in the basic DID setup (Section 2.2). The second step is to estimate theSA-ATT at time t . When only the parallel trends-in-trends assumption is plausible, we chooseweight matrix W ( t ) where W ( t ) = W ( t ) = W ( t ) = 0 and W ( t ) = 1 , which convergesto the sequential DID under the SA design. When the extended parallel trends assumption isplausible, we use the optimal weight matrix defined as W ( t ) = Var ( (cid:98) τ SA (1:2) ( t )) − where Var ( · ) isthe variance-covariance matrix and (cid:98) τ SA (1:2) ( s ) = ( (cid:98) τ SA DID ( t ) , (cid:98) τ SA s-DID ( t )) (cid:62) . This optimal weight matrixprovides us with the most efficient estimator (i.e., the smallest standard error). We providefurther details on the implementation in Appendix D.To estimate the time-average SA-DID, we extend the double DID as follows. (cid:98) τ SA d-DID = argmin τ SA (cid:32) τ SA − (cid:98) τ SA DID τ SA − (cid:98) τ SA s-DID (cid:33) (cid:62) W (cid:32) τ SA − (cid:98) τ SA DID τ SA − (cid:98) τ SA s-DID (cid:33) where (cid:98) τ SA DID = (cid:88) t ∈T π t (cid:98) τ SA DID ( t ) , and (cid:98) τ SA s-DID = (cid:88) t ∈T π t (cid:98) τ SA s-DID ( t ) . The optimal weight matrix W is equal to W = Var ( (cid:98) τ SA (1:2) ) − where (cid:98) τ SA (1:2) = ( (cid:98) τ SA DID , (cid:98) τ SA s-DID ) (cid:62) . We now extend the double DID regression to the SA design setting. We first extend thestandard DID regression (Section 3.3) to the SA design. In particular, to estimate the SA-ATT at time t , we can fit the following regression for units who are not yet treated at time t − , that is, { i : G it ≥ } . Y iv ∼ α + θG it + γI v + β SA ( t )( G it × I v ) + X (cid:62) iv ρ , where v ∈ { t − , t } and the time indicator I v (equal to if v = t and if v = t − ). Note that G it defines the treatment and control group at time t , and thus, it does not depend on timeindex v . The estimated coefficient (cid:98) β SA ( t ) is consistent for the SA-ATT under the conditionalparallel trends assumption.Similarly, we can extend the sequential DID regression to the SA design. Using the con-nection to the linear regression of equation (12), we can adjust for additional pre-treatmentcovariates as: ∆ Y iv ∼ α s + θ s G it + γ s I v + β SA s ( t )( G it × I v ) + X (cid:62) iv ρ s , v ∈ { t − , t } and ∆ Y iv = Y iv − ( (cid:80) i : G it =1 Y i,v − ) /n ,v − if G it = 1 and ∆ Y iv = Y iv − ( (cid:80) i : G it =0 Y i,v − ) /n ,v − if G it = 0 . The estimated coefficient (cid:98) β SA s ( t ) is consistent for theSA-ATT under the conditional parallel trends-in-trends assumption.Therefore, the double DID regression for the SA design combines the two regression esti-mators via the GMM: (cid:98) β SA d-DID ( t ) = argmin β SA d ( t ) (cid:32) β SA d ( t ) − (cid:98) β SA ( t ) β SA d ( t ) − (cid:98) β SA s ( t ) (cid:33) (cid:62) W ( t ) (cid:32) β SA d ( t ) − (cid:98) β SA ( t ) β SA d ( t ) − (cid:98) β SA s ( t ) . (cid:33) where the choice of the weight matrix follows the same two-step procedure as Section 4.2.We also provide further details in Appendix D. The optimal weight matrix W ( t ) is equal toVar ( (cid:98) β SA (1:2) ) − where (cid:98) β SA (1:2) = ( (cid:98) β SA ( t ) , (cid:98) β SA s ( t )) (cid:62) . To estimate the time-average SA-ATT, we extend the double DID regression as follows. (cid:98) β SA d-DID = argmin β SA d β SA d − (cid:98) β SA β SA d − (cid:98) β SA s (cid:62) W β SA d − (cid:98) β SA β SA d − (cid:98) β SA s where (cid:98) β SA = (cid:88) t ∈T π t (cid:98) β SA ( t ) , and (cid:98) β SA s = (cid:88) t ∈T π t (cid:98) β SA s ( t ) . The optimal weight matrix W is equal to Var ( (cid:98) β SA (1:2) ) − where (cid:98) β SA (1:2) = ( (cid:98) β SA , (cid:98) β SA s ) (cid:62) . This section clarifies relationships between our proposed double DID and three existing meth-ods: the two-way fixed effects estimator, the sequential DID estimator, and synthetic controlmethods.
While we contrast the double DID with the two-way fixed effects estimator throughout thepaper, we summarize our discussion here. First, in the basic DID design, the two-way fixedeffects estimator is a special case of the double DID with a specific choice of the weight matrix W (see Table 1). Therefore, whenever the two-way fixed effects estimator is consistent for theATT, the double DID is a more efficient, consistent estimator of the ATT. This is becausethe double DID can choose the optimal weight matrix via the GMM, while the two-way fixedeffects uses the pre-determined equal weights over time. Second, in the SA design, a largenumber of recent papers show that the widely-used two-way fixed effects estimator are in23eneral inconsistent for the ATT due to treatment effect heterogeneity and implicit parametricassumptions (Abraham and Sun, 2018; Athey and Imbens, 2018; Strezhnev, 2018; Imai andKim, 2020). In contrast, the proposed double DID in the SA design generalizes nonparametricDID estimators to allow for treatment effect heterogeneity, and thus, it does not suffer fromthe same problem. Our double DID estimator contains the sequential DID estimator (e.g., Lee, 2016; Mora andReggio, 2019) as a special case. Our proposed double DID improves over the sequential DIDestimator in two ways. First, when the parallel trends assumption holds, the double DIDoptimally combine the standard DID and the sequential DID to improve efficiency, and it isnot equal to the sequential DID. Therefore, it avoids a dilemma of the sequential DID — itis consistent under the parallel trends-in-trends assumption (weaker than the parallel trendsassumption), but is less efficient when the parallel trends assumption holds. Second, while thesequential DID estimator has only been available for the basic DID design where treatmentassignment happens only once, we generalize it to the staggered adoption design and furtherincorporate it into our staggered-adoption double DID estimator (Section 4).
Another relevant popular class of methods is the synthetic control methods. While the methodwas originally designed to estimate the causal effect on a single treated unit, recent extensionsallow for multiple treated units and the staggered adoption design (e.g., Xu, 2017; Athey et al.,2017; Ben-Michael, Feller and Rothstein, 2018; Hazlett and Xu, 2018). Despite a wide varietyof innovative extensions, they all share the same core feature: they require long pre-treatmentperiods to accurately estimate a pre-treatment trajectory of the treated units. For example, Xu(2017) recommends collecting more than ten pre-treatment periods. In contrast, the proposeddouble DID can be applied as long as there are more than one pre-treatment periods, and isbetter suited when there are a small to moderate number of pre-treatment periods.When there are a large number of pre-treatment periods (i.e., long enough to apply thesynthetic control methods), we recommend to apply both the synthetic control methods andproposed double DID, and evaluate robustness across those approaches. This is importantbecause they rely on different identification assumptions. In fact, we show in Section 6.2, thedouble DID can recover credible estimates similar to more flexible variants of synthetic control24ethods even when there are a large number of pre-treatment periods. This robustness providesresearchers with additional credibility for their causal estimates and underlying assumptions.
Malesky, Nguyen and Tran (2014) utilize the basic DID design to study how the abolition ofelected councils affects local public services in Vietnam. To estimate the causal effects of theinstitutional change, the original authors rely on data from 2008 and 2010, which are beforeand after the abolition of elected councils in 2009. Then, they supplement the main analysisby assessing trends in pre-treatment periods from 2006 to 2008. In this section, we apply theproposed method and illustrate how to improve this basic DID design.Although Malesky, Nguyen and Tran (2014) employ the exact same DID design to all of thethirty outcomes they consider, each outcome might require different assumptions as noted inthe original paper. Here, we focus on reanalyzing three outcomes that have different patterns ofpre-treatment periods. By doing so, we clarify how researchers can use the double DID methodto transparently assess underlying assumptions and employ appropriate DID estimators underdifferent settings. We provide an analysis of all the thirty outcomes in Appendix G.
The first step of the DID design is to visualize trends of treatment and control groups. Figure 4shows trends of three different outcomes: “Education and Cultural Program,” “Tap Water,” and“Agricultural Center.” Although the original analysis uses the same DID design for all of them,they have distinct trends in the pre-treatment periods. The first outcome of “Education andCultural Program” has parallel trends in pre-treatment periods. For the other two outcomes,trends do not look parallel in either of the cases. While the trends for the second outcome(“Tap Water”) have similar directions, trends for the third outcome (“Agricultural Center”)has opposite signs. This visualization of trends serves as a transparent first step to assess theunderlying assumptions necessary for the DID estimation.The next step is to formally assess underlying assumptions. As in the original study, itis common to incorporate additional covariates to make the parallel trends assumption more “Education and Cultural Program” (binary): This variable takes one if there is a program that invests inculture and education in the commune. “Tap Water” (binary): What is the main source of drinking /cookingwater for most people in this commune? “Agricultural Center” (binary): Is there any agriculture extensioncenter in a given commune? Please see Malesky, Nguyen and Tran (2014) for further details. .100.150.200.250.300.350.40 Education and Cultural Program
Year2006 2008 2010TreatmentControl 0.050.100.150.200.250.300.35
Tap Water
Year2006 2008 2010 0.040.050.060.070.080.09
Agricultural Center
Year2006 2008 2010 M ean o f O u t c o m e s Figure 4:
Visualizing Trends of Treatment and Control Groups.
Note:
We report trendsfor the treatment group (blue solid line with solid circles) and the control group (gray dashedline with hollow circles). Two pre-treatment periods are 2006 and 2008. One post-treatmentperiod, 2010, is indicated by the gray shaded area.plausible. Based on detailed domain knowledge, Malesky, Nguyen and Tran (2014) includefour control variables: area size of each commune, population size, whether national-level cityor not, and regional fixed effects. Thus, we assess the conditional extended parallel trendsassumption by fitting the DID regression (equation (15)) to pre-treatment periods from 2006to 2008 where X it includes the four control variables. If the conditional extended paralleltrends assumption holds, estimates of the DID regression on pre-treatment trends should beclose to zero.While a traditional approach is to assess whether estimates are statistically distinguishablefrom zero with the conventional 5% or 10% level, we also report results based on an equiva-lence approach that we recommend in Section 3. Specifically, we compute the 95% standardizedequivalence confidence interval, which quantifies the smallest equivalence range supported bythe observed data (Hartman and Hidalgo, 2018). In the context of this application, the equiv-alance confidence interval is standardized based on the mean and standard deviation of thecontrol group in 2006. For example, if the 95% standardized equivalence confidence intervalis [ − ν, ν ] , this means that the equivalence test rejects the hypothesis that the DID estimate(standardized with respect to the baseline control outcome) on pre-treatment periods is largerthan ν or smaller than − ν at the 5% level. Thus, the conditional extended parallel trendsassumption is more plausible when the equivalence confidence interval is shorter.The results are summarized in Table 2. Standard errors are computed via block-bootstrapat the district level, where we take 2000 bootstrap iterations. For the first outcome, as thegraphical presentation in Figure 4 suggests, a statistical test suggests that the extended parallel26stimate Std. Error p-value 95% Std. Equivalence CIEducation andCultural Program − . [ − . , . Tap Water 0.166 0.084 0.048 [ − . , . Agricultural Center 0.198 0.095 0.037 [ − . , . Table 2:
Assessing Underlying Assumptions Using the Pre-treatment Outcomes.
Note:
Weevaluate the conditional extended parallel trends assumption for three different outcomes. Thetable reports DID estimates on pre-treatment trends, standard errors, p-values, and the 95%standardized equivalence confidence intervals.trends assumption is plausible. The test of the conditional extended parallel trends yields thep-value of . (the third column), and similarly, the 95% standardized equivalence confidenceinterval is [ − . , . (the fourth column), which is shorter than the other two outcomesdiscussed below. For the second outcome, the test of the parallel trends produces the p-valueof . , and the 95% standardized equivalence confidence interval, [ − . , . , revealsthat the parallel trends assumption is less plausible for this outcome, than for the first out-come. Finally, for the third outcome, “Agricultural Center,” both traditional and equivalenceapproaches provide little evidence for parallel trends as graphically clear in Figure 4. The testof the parallel trends is rejected at the 5% level (p-value = . ) and the 95% standardizedequivalence confidence interval is relatively large, [ − . , . . Although we only have twopre-treatment periods as in the original analysis, if more than two pre-treatment periods areavailable, researchers can assess the extended parallel trends-in-trends assumption in a similarway by applying the sequential DID estimator to pre-treatment periods. After assessing theunderlying parallel trends assumptions, we now proceed to the estimation of the ATT via thedouble DID.
Within the double DID framework, we select appropriate DID estimators after the empiricalassessment of underlying assumptions. For the first outcome of “Education and Cultural Pro-gram,” diagnostics in the previous section suggest that the extended parallel trends assumptionis plausible. In such settings, the double DID is expected to produce similar point estimateswith smaller standard errors compared to the conventional DID estimator. The first plot ofFigure 5 clearly shows this pattern. In the figure, we report point estimates as well as 90%27
Education and Cultural Program A TT ( % C on f i den c e I n t e r v a l ) DID Double DID -0.25-0.20-0.15-0.10-0.050.000.05
Tap Water
DID Double DID
Figure 5:
Estimating Causal Effects of Abolishing Elected Councils.
Note:
We compareestimates from the standard DID and the proposed double DID. For the first plot where theextended parallel trends assumption is plausible, the double DID produces a similar pointestimate with smaller standard errors. For the second plot where only the parallel trends-in-trends assumption is plausible, the double DID estimator can still estimate the ATT, while thestandard DID estimate is likely to be biased.confidence intervals following the original paper (see Figure 3 in Malesky, Nguyen and Tran(2014)). Using the standard DID estimator, the original estimate of the ATT on “Educationand Cultural Program” was . (90% CI = [ − . , . ]). Using the double DID estimator,an estimate is instead . (90% CI = [ . , . ]). By using the double DID estimator, weshrink standard errors by about 10%. Although we only have two pre-treatment periods here,when there are more pre-treatment periods, the efficiency gain of the double DID can be evenlarger.For the second outcome of “Tap Water,” we did not have enough evidence to support theextended parallel trends assumption. Thus, instead of using the standard DID as in the origi-nal analysis, we rely on the parallel trends-in-trends assumption. In this case, the double DIDestimates the ATT by allowing for linear time-varying unmeasured confounding in contrast tothe standard DID that still assumes constant unmeasured confounders. The second plot ofFigure 5 shows the important difference between the two methods. While the standard DIDestimates is − . (90% CI = [ − . , . ]), the double DID estimate is − . (90% CI= [ − . , − . ]). Given that the extended parallel trends assumption is not plausible, thisresult suggests that the standard DID suffers from substantial bias (the bias of . correspondsto more than 50% of the original point estimate). By incorporating non-parallel pre-treatment28rends, the double DID shows that the original DID estimate was underestimated by a largeamount. Finally, for the third outcome (“Agricultural Center”), the previous diagnostics sug-gest that the extended parallel trends assumption is implausible. It is possible to use thedouble DID under the parallel trends-in-trends assumption. However, trends of treatment andcontrol groups have opposite signs, implying the double DID estimates are highly sensitive tothe parallel trends-in-trends assumption. Given that the parallel trends-in-trends assumptionis also difficult to justify here, there is no credible estimator of the ATT without making addi-tional stringent assumptions. While we mainly focused on the three outcomes here, the doubleDID improves upon the standard DID in a similar way for the other outcomes as well (seeAppendix G). In this section, we apply the proposed double DID estimator to revisit Paglayan (2019), whichuses the staggered adoption (SA) design to study the effect of granting collective bargainingrights to teacher’s union on educational expenditures and teacher’s salary. Paglayan (2019)applies the standard two-way fixed effect models to estimate the effect of the introduction of themandatory bargaining law in the US states on the two outcome. The original author exploitsthe variation induced by the different introduction timing of the law: A few states introducedthe law as early as in the mid 1960’s, while some states, such as Arizona or Kentucky, neverintroduced the mandate. Among the states that granted the bargaining rights, the introductiontiming varies from the mid 1960’s to the mid 1980’s (Nebraska was the last state that adoptedthe law).
We apply the proposed double DID for the SA design to the panel data consists of state-yearobservations. A state is treated at a particular year, if the state passes the law or has alreadypassed the law of mandatory bargaining. Following the original study, we study two outcome:Per-pupil expenditure and annual teacher salary, both are on a log scale. There are 2,058observations, containing 49 states (excluding Washington DC and Wisconsin, due to the shortavailability of the pre-treatment outcomes) and spanning from 1959 through 2000.Figure 6 shows the variation of the treatment across states and over time. Cells in grayindicate state-year observations that are not treated and blue cells indicate the treated obser-vations. We can observe that there are 14 unique treatment timings (the earliest is 1965 and29
959 1960 1961 1962 1963 1964 1965 1966 1967 1968 1969 1970 1971 1972 1973 1974 1975 1976 1977 1978 1979 1980 1981 1982 1983 1984 1985 1986 1987 1988 1989 1990 1991 1992 1993 1994 1995 1996 1997 1998 1999 2000
CTMIMARINJNYMDNDNVSDAKDEHIKSMEVTIDOKPAMNINFLORIAMTCANHWATNILOHNEALARAZCOGAKYLAMOMSNCNMSCTXUTVAWVWY
TreatedControl
Figure 6:
Treatment Variation Plot.
Note:
Cells in gray are state-year observations that arenot treated (i.e., the mandatory bargaining law is not implemented), while cells in blue areobservations that are under the treatment condition. Rows are sorted such that states thatadopt the policy at earlier years are shown near the top, while states that never adopt thepolicy are shown near the bottom. The figure indicates that there are variations across statesin adoption timings, and that some states never adopt the policy.the latest is 1987) where the number of states at each treatment timing varies from one tosix (the average number of states at a treatment timing is 2.3). We can also see that there isno reversal of a treatment status in that once a state adopts the policy, the state has neverabolished it during the sample period.We assess the underlying parallel trends assumption for the SA design by utilizing the pre-treatment outcome. As in the pre-treatment-trends test in the basic DID design, we apply the30
Expenditure % S t anda r d i z ed E qu i v a l en c e C I Time Relative to Treatment Assignment -5 -4 -3 -2 -1-0.2-0.10.00.10.2
Salary % S t anda r d i z ed E qu i v a l en c e C I Time Relative to Treatment Assignment
Figure 7:
Assessing Underlying Assumptions Using the Pre-treatment Outcomes (Left: loggedexpenditure; Right: logged teacher salary).
Note:
We report the 95% standardized equivalenceconfidence intervals.standard DID estimator for the SA design to pre-treatment periods. For example, to test thepre-treatment trends from t − to t for units who receive the treatment at time t , we estimatethe SA-ATT using the outcome from t − and t − (See Section 4.2 for more details). To furtherfacilitate interpretation, we standardize the outcome by the mean and standard deviation ofthe baseline control group, so that the effect can be interpreted relative to the control group.Figure 7 shows 95% standardized equivalence confidence intervals for the two outcomesof interest (See Section 3.1 for details on the standardization procedure). It shows that forboth outcomes, the equivalence confidence intervals are within 0.2 standard deviation from themeans of the baseline control groups through t − to t − . This suggests that the extendedparallel trends assumption is plausible for both outcomes. We apply the double DID for the SA design as described in Section 4. The standard errors arecomputed by conducting the block bootstrap where the block is taken at the state level andwe take 2000 bootstrap iterations. Analyses for the two outcomes are conducted separately. Inaddition to the proposed method, we apply two existing variants of synthetic control methodsthat can handle the staggered adoption design: the generalized synthetic control method, gsynth , (Xu, 2017) and the augmented synthetic control method, augsynth , (Ben-Michael,Feller and Rothstein, 2019). While the proposed double DID is better suited for settings where31 − . − . . . . Double DID
Time SA − A TT E x p e nd i t u r e −2 0 2 4 6 8 − . − . . . . Gsynth
Time SA − A TT −2 0 2 4 6 8 − . − . . . . augsynth Time SA − A TT −2 0 2 4 6 8 − . − . . . . Double DID
Time SA − A TT S a l a r y −2 0 2 4 6 8 − . − . . . . Gsynth
Time SA − A TT −2 0 2 4 6 8 − . − . . . . augsynth Time SA − A TT Figure 8:
Plot of the Average Treatment Effect on the Treated on Two Outcomes.
Note:
We compare estimates from the double DID, the generalized synthetic control method, andthe augmented synthetic control method. The causal estimates are similar across methods forboth outcomes and treatment effects are not statistically significant at the conventional 5%level for most of the time periods.there are a small to moderate number of pre-treatment periods, we evaluate, in the setting oflong pre-treatment periods, whether it can achieve comparable performance to these variants ofsynthetic control methods that are primarily designed to deal with long pre-treatment periods(see more discussions in Section 5.3).Figure 8 shows the estimates of the treatment on the per-pupil expenditure (the first row)and the teacher’s salary (the second row), where both effects are on a log scale. We estimatedthe average treatment effect on the two outcomes (cid:96) periods after the treatment assignmentwhere (cid:96) = { , , . . . , } . Note that (cid:96) = 0 corresponds to the contemporaneous effect. Eachcolumn corresponds to different estimators. The first column shows the proposed double DIDestimator for the staggered adoption design, whereas the second (third) column shows estimatesbased on the generalized synthetic control method (the augmented synthetic control method).We can see that estimates are similar across methods for both outcomes and treatment effects32re not statistically significant at the 5% level for most of the time periods. This result isconsistent with the original finding of Paglayan (2019) that the granting collective bargainingrights did not increase the level of resources devoted to education.As in this example, when there are a large number of pre-treatment periods, it is im-portant to apply both synthetic control methods and the proposed double DID and evaluaterobustness across those approaches. This is critical because they rely on different identificationassumptions. We found such robustness in this application, which provides us with additionalcredibility.
While the most basic form of the DID only requires two time periods — one before and theother after treatment assignment, researchers can often collect data from several additional pre-treatment periods in a wide range of applications. In this article, we show that such multiplepre-treatment periods can help improve the basic DID design and the staggered adoptiondesign in three ways: (1) assessing underlying assumptions about parallel trends, (2) improvingestimation accuracy and (3) enabling more flexible DID estimators. We use the potentialoutcomes framework to clarify assumptions required to enjoy each benefit.We then propose a simple method, the double DID, to combine all three benefits withinthe GMM framework. Importantly, the double DID contains the popular two-way fixed effectsregression and nonparametric DID estimators as special cases, and it use the GMM to furtherimprove with respect to identification and estimation accuracy. Finally, we provide two keyextensions. First, we accommodate any number of pre- and post-treatment periods, whichallows for even more flexible forms of unmeasured time-varying confounding. Second, wegeneralize the double DID estimator to the staggered adoption design where timing of thetreatment assignment can vary across units.
References
Abadie, Alberto. 2005. “Semiparametric Difference-in-Differences Estimators.”
The Review ofEconomic Studies
Journal of Economic Literature .33badie, Alberto, Alexis Diamond and Jens Hainmueller. 2010. “Synthetic Control Methods forComparative Case Studies: Estimating the Effect of California’s Tobacco Control Program.”
Journal of the American Statistical Association https://arxiv.org/abs/1804.05785 .Angrist, Joshua D and Jörn-Steffen Pischke. 2008.
Mostly Harmless Econometrics: An Em-piricist’s Companion . Princeton University Press.Athey, Susan and Guido W Imbens. 2018. “Design-based Analysis in Difference-in-DifferencesSettings with Staggered Adoption.” National Bureau of Economic Research.Athey, Susan, Mohsen Bayati, Nikolay Doudchenko, Guido Imbens and Khashayar Khosravi.2017. “Matrix Completion Methods for Causal Panel Data Models.” Available at https://arxiv.org/abs/1710.10251 .Bechtel, Michael M and Jens Hainmueller. 2011. “How Lasting is Voter Gratitude? An Analysisof the Short-and Long-Term Electoral Returns to Beneficial Policy.”
American Journal ofPolitical Science
Annual Review of Political Science https://arxiv.org/abs/1811.04170 .Ben-Michael, Eli, Avi Feller and Jesse Rothstein. 2019. “Synthetic Controls and WeightedEvent Studies with Staggered Adoption.” arXiv preprint arXiv:1912.03290 .Bertrand, Marianne, Esther Duflo and Sendhil Mullainathan. 2004. “How Much ShouldWe Trust Differences-in-Differences Estimates?”
The Quarterly Journal of Economics
American Political ScienceReview
The Journal of Politics
American Journal of Polit-ical Science
Political Analysis
American Political Science Review
American Journal of Political Sci-ence
American Political Science Review
Political Science Research and Methods
Econometrica
American Journal of Political Science https://ssrn.com/abstract=3214231 . 35mai, Kosuke and In Song Kim. 2011. “Understanding and Improving Linear Fixed EffectsRegression Models for Causal Inference.” Available at https://imai.fas.harvard.edu/research/files/FEmatchOld.pdf .Imai, Kosuke and In Song Kim. 2019. “When Should We Use Unit Fixed Effects RegressionModels for Causal Inference with Longitudinal Data?”
American Journal of Political Science
Political Analysis .Imbens, Guido W and Jeffrey M Wooldridge. 2009. “Recent Developments in the Econometricsof Program Evaluation.”
Journal of Economic Literature https://arxiv.org/abs/1901.01869 .Keele, Luke and William Minozzi. 2013. “How Much is Minnesota like Wisconsin? Assump-tions and Counterfactuals in Causal Inference with Observational Data.”
Political Analysis
American Journal of PoliticalScience
Review of Economicsand Statistics
Sociological Methods & Research
Available at SSRN 3555463 .36alesky, Edmund J, Cuong Viet Nguyen and Anh Tran. 2014. “The Impact of Recentralizationon Public Services: A Difference-in-Differences Analysis of the Abolition of Elected Councilsin Vietnam.”
American Political Science Review https://e-archivo.uc3m.es/bitstream/handle/10016/16065/we1233.pdf?sequence=1 .Mora, Ricardo and Iliana Reggio. 2019. “Alternative Diff-in-Diffs Estimators with SeveralPretreatment Periods.”
Econometric Reviews
Statistical Science
AmericanJournal of Political Science
Journal of Educational Psychology
American PoliticalScience Review
Testing Statistical Hypotheses of Equivalence and Noninferiority . Chap-man and Hall/CRC.Wooldridge, Jeffrey M. 2010.
Econometric Analysis of Cross Section and Panel Data . MITpress.Xu, Yiqing. 2017. “Generalized Synthetic Control Method: Causal Inference with InteractiveFixed Effects Models.”
Political Analysis arXiv preprint arXiv:2006.02423 .38
Review of Papers in
APSR and
AJPS
We conduct a review of the literature to assess current practices of the difference-in-differences(DID) design. Specifically, we search articles published in
American Political Science Review and
American Journal of Political Science from 2015 to 2019. Some of the papers we reviewedwere accepted in 2019 and were officially published in 2020. Using Google Scholar, we findarticles that contains any of the following keywords: “two-way fixed effect”, “two-way fixedeffects”, “difference in difference” or “difference in differences.” We then manually select articlesfrom the list that uses the basic DID design and the staggered adoption design (see the maintext for details about the first two design). This procedure left us with a total of 25 articles, 11from APSR and 14 from AJPS. Table 3 and 4 show the articles in the list published in APSRand AJPS, respectively.To determine the number of pre-treatment periods, we manually assess the listed articles.Among the 25 articles, 20 articles use the basic DID design, and 5 articles use the staggeredadoption design. When a paper uses the basic DID design, we can determine the length of thepre-treatment periods from the data description and the time of the treatment assignment. Onthe other hand, the pre-treatment periods for the staggered adoption and the general designare set to the total number of time-periods available in the data, as the length of pre-treatmentperiods varies across units.
Table 3:
DID papers on APSR.
Authors Year Title
O’brien, D. Z., & Rickne J. 2016 Gender Quotas And Women’s Political LeadershipGarfias, F. 2018 Elite Competition and State Capacity Development:Theory and Evidence From Post-Revolutionary Mexico.Martin, G. J., & Mccrain, J. 2019 Local News And National PoliticsBlom-Hansen, J., Houlberg, K.,Serritzlew, S., & Treisman, D. 2016 Jurisdiction Size and Local Government Policy Expenditure:Assessing The Effect of Municipal AmalgamationClinton, J. D., & Sances, M. W. 2018 The Politics of Policy: The Initial Mass Political Effectsof Medicaid Expansion in The StatesMalesky, E. J. , Nguyen, C. V.,& Tran, A. 2014 The Impact of Recentralization on Public ServicesA Difference-in-Differences Analysis of the Abolitionof Elected Councils in Vietnam.Larsen, M. V., Hjorth, F.,Dinesen, P. T.,& Sønderskov, K. M. 2019 When Do Citizens Respond Politically to The Local Economy?Evidence From Registry Data on Local Housing MarketsBecher, M., & González, I. M. 2019 Electoral Reform and Trade-Offs in RepresentationSelb, P., & Munzert, S. 2018 Examining A Most Likely Case for Strong Campaign EffectsEnos, R. D., Kaufman, A. R.,& Sands, M. L. 2019 Can Violent Protest Change Local Policy Support?Vasiliki Fouka 2019 How Do Immigrants Respond to Discrimination?39 able 4:
DID papers on AJPS.
Authors Year Title
Bechtel, M. M., Hangartner, D.,& Schmid, L. 2016 Does compulsory voting increase support for leftist policy?Bisgaard, M., & Slothuus, R. 2018 Partisan elites as culprits?How party cues shape partisan perceptual gaps.Bischof, D., & Wagner, M. 2019 Do voters polarize when radical parties enter parliament?Dewan, T., Meriläinen, J.,& Tukiainen, J. 2020 Victorian voting:The origins of party orientation and class alignment.Earle, J. S., & Gehlbach, S. 2015 The Productivity Consequences of Political Turnover:Firm-Level Evidence from Ukraine’s Orange Revolution.Enos, R. D. 2016 What the demolition of public housing teaches usabout the impact of racial threat on political behavior.Gingerich, D. W. 2019 Ballot Reform as Suffrage Restriction:Evidence from Brazil’s Second Republic.Hainmueller, J, & Hangartner, D. 2019 Does direct democracy hurt immigrant minorities?Evidence from naturalization decisions in Switzerland.Holbein, J. B., & Hillygus, D. S. 2016 Making young voters:the impact of preregistration on youth turnout.Jäger, K. 2020 When Do Campaign Effects Persist for Years?Evidence from a Natural Experiment.Lindgren, K. O., Oskarsson, S.,& Dawes, C. T. 2017 Can Political Inequalities Be Educated Away?Evidence from a Large-Scale Reform.Lopes da Fonseca, M. 2017 Identifying the source of incumbency advantagethrough a constitutional reform.Paglayan, AS. 2019 Public-Sector Unions and the Size of GovernmentPardos-Prado, S., & Xena, C. 2019 Skill specificity and attitudes toward immigration.40
Nonparametric Equivalence to Regression Estimators
In this section, we provide results on the nonparametric connection between regression estima-tors and the three DID estimators we discussed in the paper. This section provides method-ological foundations for our main methodological contributions, which we prove in Sections Cand D.
B.1 Standard DID
B.1.1 Repeated Cross-Sectional Data
For the later use in this Appendix, we report the well-known result that the standard DID esti-mator (cid:98) τ DID (equation (3)) is equivalent to coefficient (cid:98) β in the regression estimator (equation (4))(Abadie, 2005).We define O it to be an indicator variable taking the value when individual i is observedin time period t . Using this notation, we prove the following result. Result 1 (Nonparametric Equivalence of the Standard DID and Regression Estimator) . Wewrite the linear regression estimator (equation (4) ) as a solution to the following least squaresproblem. ( (cid:98) α, (cid:98) θ, (cid:98) γ, (cid:98) β ) = argmin n (cid:88) i =1 2 (cid:88) t =1 O it (cid:110) Y it − α − θG i − γI t − β ( G i × I t ) (cid:111) . Then, (cid:98) τ DID = (cid:98) β. Proof.
By solving the least squares problem, we obtain the following solutions: (cid:98) α = (cid:80) i : G i =0 Y i n (cid:98) θ = (cid:80) i : G i =1 Y i n − (cid:80) i : G i =0 Y i n (cid:98) γ = (cid:80) i : G i =0 Y i n − (cid:80) i : G i =0 Y i n (cid:98) β = (cid:18) (cid:80) i : G i =1 Y i n − (cid:80) i : G i =1 Y i n (cid:19) − (cid:18) (cid:80) i : G i =0 Y i n − (cid:80) i : G i =0 Y i n (cid:19) , which completes the proof. B.1.2 Panel Data
Again, for the later use in the Appendix, we report the well-known result that the standardDID estimator (cid:98) τ DID (equation (3)) is equivalent to coefficient (cid:98) β in the two-way fixed effectsregression estimator in the panel data setting (Abadie, 2005). Result 2 (Nonparametric Equivalence of the Standard DID and Two-way Fixed Effects Re-gression Estimator) . We can write the two-way fixed effects regression estimator as a solutionto the following least squares problem. ( (cid:98) α, (cid:98) δ, (cid:98) β ) = argmin n (cid:88) i =1 2 (cid:88) t =1 ( Y it − α i − δ t − βD it ) . Then, (cid:98) τ DID = (cid:98) β. roof. First we define the demeaned treatment and outcome variables, Y i = (cid:80) t =1 Y it / , Y t = (cid:80) ni =1 Y it /n , Y = (cid:80) ni =1 (cid:80) t =1 Y it / n , D i = (cid:80) t =1 D it / , D t = (cid:80) ni =1 D it /n , and D = (cid:80) ni =1 (cid:80) t =1 D it / n .Given these transformed variables, we can transform the least squares problem into a well-known demeaned form. (cid:98) β = argmin β n (cid:88) i =1 2 (cid:88) t =1 ( (cid:101) Y it − β (cid:101) D it ) where (cid:101) Y it = Y it − Y i − Y t + Y and (cid:101) D it = D it − D i − D t + D . Using this notation, we canexpress (cid:98) β as (cid:98) β = (cid:80) ni =1 (cid:80) t =1 (cid:101) D it (cid:101) Y it (cid:80) ni =1 (cid:80) t =1 (cid:101) D it where (cid:101) D it takes the following form, (cid:101) D it = / · n /n if G i = 1 , t = 2 − (1 / · n /n if G i = 1 , t = 1 − (1 / · n /n if G i = 0 , t = 21 / · n /n if G i = 0 , t = 1 , where n = (cid:80) ni =1 G i and n = (cid:80) ni =1 (1 − G i ) . Then, the numerator can be written as n (cid:88) i =1 2 (cid:88) t =1 (cid:101) D it (cid:101) Y it = n n (cid:26) n (cid:88) i =1 G i (cid:101) Y i − n (cid:88) i =1 G i (cid:101) Y i (cid:27) − n n (cid:26) n (cid:88) i =1 (1 − G i ) (cid:101) Y i − n (cid:88) i =1 (1 − G i ) (cid:101) Y i (cid:27) and the denominator is given as n (cid:88) i =1 2 (cid:88) t =1 (cid:101) D it = 2 n (cid:18) n n (cid:19) + 2 n (cid:18) n n (cid:19) = n n n . Combining both terms, we get (cid:98) β = (cid:80) ni =1 (cid:80) t =1 (cid:101) D it (cid:101) Y it (cid:80) ni =1 (cid:80) t =1 (cid:101) D it = 1 n (cid:26) n (cid:88) i =1 G i (cid:101) Y i − n (cid:88) i =1 G i (cid:101) Y i (cid:27) − n (cid:26) n (cid:88) i =1 (1 − G i ) (cid:101) Y i − n (cid:88) i =1 (1 − G i ) (cid:101) Y i (cid:27) = 1 n n (cid:88) i =1 G i ( Y i − Y i ) − n n (cid:88) i =1 (1 − G i )( Y i − Y i )= (cid:98) τ DID , which concludes the proof. 42 .2 Extended DID B.2.1 Repeated Cross-Sectional Data
We consider a case in which there are two pre-treatment periods t = { , } and one post-treatment period t = 2 . Using this notation, we report the following result. Result 3 (Nonparametric Equivalence of the Extended DID and Regression Estimator) . Wefocus on a linear regression estimator that is a solution to the following least squares problem. ( (cid:98) θ, (cid:98) γ, (cid:98) β ) = argmin n (cid:88) i =1 2 (cid:88) t =0 O it ( Y it − θG i − γ t − βD it ) . Then, (cid:98) β = λ (cid:98) τ DID + (1 − λ ) (cid:98) τ DID(2,0) where λ = n n ( n + n ) n n ( n + n ) + n n ( n + n ) , − λ = n n ( n + n ) n n ( n + n ) + n n ( n + n ) . When the sample size of each group is fixed over time, i.e., n = n and n = n , λ = 1 / and therefore, (cid:98) β is equivalent to the extended DID estimator of equal weights in equation (9) . Proof.
By solving the least squares problem, we obtain (cid:98) θ = λ (cid:18) (cid:80) i : G i =1 Y i n − (cid:80) i : G i =0 Y i n (cid:19) + (1 − λ ) (cid:18) (cid:80) i : G i =1 Y i n − (cid:80) i : G i =0 Y i n (cid:19)(cid:98) γ = (cid:80) i : G i =0 Y i n (cid:98) γ = (cid:80) i : G i =1 Y i + (cid:80) i : G i =0 Y i n + n − n n + n (cid:98) θ (cid:98) γ = (cid:80) i : G i =1 Y i + (cid:80) i : G i =0 Y i n + n − n n + n (cid:98) θ (cid:98) β = λ (cid:26)(cid:18) (cid:80) i : G i =1 Y i n − (cid:80) i : G i =1 Y i n (cid:19) − (cid:18) (cid:80) i : G i =0 Y i n − (cid:80) i : G i =0 Y i n (cid:19)(cid:27) + (1 − λ ) (cid:26)(cid:18) (cid:80) i : G i =1 Y i n − (cid:80) i : G i =1 Y i n (cid:19) − (cid:18) (cid:80) i : G i =0 Y i n − (cid:80) i : G i =0 Y i n (cid:19)(cid:27) , which completes the proof. B.2.2 Panel Data
We report that the extended DID estimator (cid:98) τ e-DID (equation (9)) (equal weights: λ = 1 / ) isequivalent to the estimated coefficient (cid:98) β in the two-way fixed effects regression estimator inthe panel data setting with t = { , , } . Result 4 (Nonparametric Equivalence of the Extended DID and Two-way Fixed Effects Re-gression Estimator) . We can write the two-way fixed effects regression estimator as a solutionto the following least squares problem. ( (cid:98) α, (cid:98) δ, (cid:98) β ) = argmin n (cid:88) i =1 2 (cid:88) t =0 ( Y it − α i − δ t − βD it ) . hen, (cid:98) τ e-DID = (cid:98) β. Proof.
First we define Y i = (cid:80) t =0 Y it / , Y t = (cid:80) ni =1 Y it /n , Y = (cid:80) ni =1 (cid:80) t =0 Y it / n , D i = (cid:80) t =0 D it / , D t = (cid:80) ni =1 D it /n , and D = (cid:80) ni =1 (cid:80) t =0 D it / n. Then, we can write the two-wayfixed effects estimator as a two-way demeaned estimator, (cid:98) β = argmin β n (cid:88) i =1 2 (cid:88) t =0 ( (cid:101) Y it − β (cid:101) D it ) = (cid:80) ni =1 (cid:80) t =0 (cid:101) D it (cid:101) Y it (cid:80) ni =1 (cid:80) t =0 (cid:101) D it , as in Result 2, where (cid:101) Y it = Y it − Y i − Y t + Y and (cid:101) D it = D it − D i − D t + D . Importantly, (cid:101) D it takes the following form: (cid:101) D it = / · n /n if G i = 1 , t = 2 − / · n /n if G i = 1 , t = 0 , − / · n /n if G i = 0 , t = 21 / · n /n if G i = 0 , t = 0 , , where n = (cid:80) ni =1 G i and n = (cid:80) ni =1 (1 − G i ) . Then, the numerator can be written as n (cid:88) i =1 2 (cid:88) t =0 (cid:101) D it (cid:101) Y it = n (cid:88) i =1 G i (cid:18) n n (cid:19) (cid:101) Y i − n (cid:88) i =1 1 (cid:88) t =0 G i (cid:18) n n (cid:19) (cid:101) Y it + n (cid:88) i =1 (1 − G i ) (cid:18) − n n (cid:19) (cid:101) Y i + n (cid:88) i =1 1 (cid:88) t =0 (1 − G i ) (cid:18) n n (cid:19) (cid:101) Y it = n (cid:88) i =1 G i (cid:18) n n (cid:19) { (cid:101) Y i − (cid:101) Y i } + n (cid:88) i =1 G i (cid:18) n n (cid:19) { (cid:101) Y i − (cid:101) Y i }− (cid:40) n (cid:88) i =1 (1 − G i ) (cid:18) n n (cid:19) { (cid:101) Y i − (cid:101) Y i } + n (cid:88) i =1 (1 − G i ) (cid:18) n n (cid:19) { (cid:101) Y i − (cid:101) Y i } (cid:41) = n n (cid:40) n (cid:88) i =1 G i { Y i − Y i } + n (cid:88) i =1 G i { Y i − Y i } (cid:41) − n n (cid:40) n (cid:88) i =1 (1 − G i ) { Y i − Y i } + n (cid:88) i =1 (1 − G i ) { Y i − Y i } (cid:41) . The denominator can be written as n (cid:88) i =1 2 (cid:88) t =0 (cid:101) D it = n n n · . Combining the two terms, we have (cid:98) β = 12 n (cid:40) n (cid:88) i =1 G i { Y i − Y i } + n (cid:88) i =1 G i { Y i − Y i } (cid:41) − n (cid:40) n (cid:88) i =1 (1 − G i ) { Y i − Y i } + n (cid:88) i =1 (1 − G i ) { Y i − Y i } (cid:41) = 12 (cid:40) n n (cid:88) i =1 G i { Y i − Y i } − n n (cid:88) i =1 (1 − G i ) { Y i − Y i } (cid:41) (cid:40) n n (cid:88) i =1 G i { Y i − Y i } − n n (cid:88) i =1 (1 − G i ) { Y i − Y i } (cid:41) = 12 (cid:98) τ DID + 12 (cid:98) τ DID(2,0) , which completes the proof. B.3 Sequential DID
B.3.1 Repeated Cross-Sectional Data
We clarify that the sequential DID estimator (cid:98) τ s-DID (equation (10)) is equivalent to a coefficientin a regression estimator with transformed outcomes. Result 5 (Nonparametric Equivalence of the Sequential DID and Regression Estimator) . Wefocus on a linear regression estimator with a transformed outcome. ( (cid:98) α, (cid:98) θ, (cid:98) γ, (cid:98) β ) = argmin n (cid:88) i =1 2 (cid:88) t =1 O it (cid:110) ∆ Y it − α − θG i − γI t − β ( G i × I t ) (cid:111) , where ∆ Y it = Y i − (cid:80) i : Gi =1 Y i n if G i = 1 , t = 2 Y i − (cid:80) i : Gi =1 Y i n if G i = 1 , t = 1 Y i − (cid:80) i : Gi =0 Y i n if G i = 0 , t = 2 Y i − (cid:80) i : Gi =0 Y i n if G i = 0 , t = 1 . Then, (cid:98) τ s-DID = (cid:98) β. Proof.
Using Result 1, we obtain (cid:98) β = (cid:18) (cid:80) i : G i =1 ∆ Y i n − (cid:80) i : G i =1 ∆ Y i n (cid:19) − (cid:18) (cid:80) i : G i =0 ∆ Y i n − (cid:80) i : G i =0 ∆ Y i n (cid:19) = (cid:26)(cid:18) (cid:80) i : G i =1 Y i n − (cid:80) i : G i =1 Y i n (cid:19) − (cid:18) (cid:80) i : G i =0 Y i n − (cid:80) i : G i =0 Y i n (cid:19)(cid:27) − (cid:26)(cid:18) (cid:80) i : G i =1 Y i n − (cid:80) i : G i =1 Y i n (cid:19) − (cid:18) (cid:80) i : G i =0 Y i n − (cid:80) i : G i =0 Y i n (cid:19)(cid:27) , which completes the proof.Next, we clarify that the sequential DID estimator (cid:98) τ s-DID (equation (10)) is also equivalentto a coefficient in a regression estimator with group-specific time trends. Mora and Reggio(2019) derive similar results by making the parametric assumption of the conditional expecta-tions. We prove nonparametric equivalence without making any assumptions about conditionalexpectations. Result 6 (Nonparametric Equivalence of the Sequential DID and Regression Estimator withGroup-Specific Time Trends) . We focus on a linear regression estimator with group-specifictime trends. ( (cid:98) θ, (cid:98) γ, (cid:98) β ) = argmin n (cid:88) i =1 2 (cid:88) t =0 O it (cid:110) Y it − θ G i − θ ( G i × t ) − γ t − βD it (cid:111) . hen, (cid:98) τ s-DID = (cid:98) β. Proof.
By solving the least squares problem, we obtain (cid:98) θ = (cid:80) i : G i =1 Y i n − (cid:80) i : G i =0 Y i n (cid:98) θ = (cid:18) (cid:80) i : G i =1 Y i n − (cid:80) i : G i =0 Y i n (cid:19) − (cid:18) (cid:80) i : G i =1 Y i n − (cid:80) i : G i =0 Y i n (cid:19)(cid:98) γ = (cid:80) i : G i =0 Y i n , (cid:98) γ = (cid:80) i : G i =0 Y i n , (cid:98) γ = (cid:80) i : G i =0 Y i n (cid:98) β = (cid:26)(cid:18) (cid:80) i : G i =1 Y i n − (cid:80) i : G i =1 Y i n (cid:19) − (cid:18) (cid:80) i : G i =0 Y i n − (cid:80) i : G i =0 Y i n (cid:19)(cid:27) − (cid:26)(cid:18) (cid:80) i : G i =1 Y i n − (cid:80) i : G i =1 Y i n (cid:19) − (cid:18) (cid:80) i : G i =0 Y i n − (cid:80) i : G i =0 Y i n (cid:19)(cid:27) , which completes the proof. B.4 Connection to the Leads Test
Here we formally prove the connection between the test of pre-treatment periods discussed inSection 2.2 and the well known leads test (Angrist and Pischke, 2008). The leads test includes D i,t +1 into a linear regression and check whether a coefficient of D i,t +1 is zero. B.4.1 Repeated Cross-Sectional Data
In the repeated cross-sectional data setting, the leads test considers the following linear regres-sion. ( (cid:98) θ, (cid:98) γ, (cid:98) β, (cid:98) ζ ) = argmin n (cid:88) i =1 1 (cid:88) t =0 O it ( Y it − θG i − γ t − βD it − ζD i,t +1 ) . Then, because D it = 0 for all units in t = { , } , this least squares problem is the same as ( (cid:98) θ, (cid:98) γ, (cid:98) ζ ) = argmin n (cid:88) i =1 1 (cid:88) t =0 O it ( Y it − θG i − γ t − ζD i,t +1 ) . Finally, using Result 1, we have (cid:98) ζ = (cid:18) (cid:80) i : G i =1 Y i n − (cid:80) i : G i =1 Y i n (cid:19) − (cid:18) (cid:80) i : G i =0 Y i n − (cid:80) i : G i =0 Y i n (cid:19) , which is the standard DID estimator to the pre-treatment periods t = 0 , . B.4.2 Panel Data
In the panel data setting, the leads test considers the following two-way fixed effects regression. ( (cid:98) α, (cid:98) δ, (cid:98) β, (cid:98) ζ ) = argmin n (cid:88) i =1 1 (cid:88) t =0 ( Y it − α i − δ t − βD it − ζD i,t +1 ) . Again, this least squares problem is the same as ( (cid:98) α, (cid:98) δ, (cid:98) ζ ) = argmin n (cid:88) i =1 1 (cid:88) t =0 ( Y it − α i − δ t − ζD i,t +1 ) . (cid:98) ζ = (cid:18) (cid:80) i : G i =1 Y i n − (cid:80) i : G i =1 Y i n (cid:19) − (cid:18) (cid:80) i : G i =0 Y i n − (cid:80) i : G i =0 Y i n (cid:19) , which is the standard DID estimator to the pre-treatment periods t = 0 , . Generalized K -Difference-in-Differences In this section, we propose the generalized K -DID, which extends the double DID in Section 3to arbitrary number of pre - and post -treatment periods in the basic DID setting. We considerthe staggered adoption design in Section 4. C.1 The Setup and Causal Quantities of Interest
We first extend the setup to account for arbitrary number of pre- and post-treatment periods.Suppose we observe outcome Y it for i ∈ { , . . . , n } and t ∈ { , , . . . , T } . We define the binarytreatment variable to be D it ∈ { , } . The treatment is assigned right before time period T ∗ , and thus, time periods t ∈ { T ∗ , . . . , T } are the post-treatment periods and time periods t ∈ { , . . . , T ∗ − } are the pre-treatment periods. As in Section 2.1, we denote the treatmentgroup as G i = 1 and G i = 0 otherwise. Note that D it = 0 for t ∈ { , . . . , T ∗ } for all units.We are interested in the causal effect at post-treatment time T ∗ + s where s ≥ . When s = 0 , this corresponds to the contemporaneous treatment effect. By specifying different valuesof s > , researchers can study a variety of long-term causal effects of the treatment. Formally,our quantity of interest is the average treatment effect on the treated (ATT) at post-treatmenttime T ∗ + s . τ ( s ) ≡ E [ Y i,T ∗ + s (1) − Y i,T ∗ + s (0) | G i = 1] . For example, when s = 3 , this could mean the causal effect of the policy after three years fromits initial introduction. This definition is a generalization of the standard ATT: when s = 0 , this quantity is equal to the ATT defined in equation (1). C.2 Generalize Parallel Trends Assumptions
What assumptions do we need to identify the ATT at post-treatment time T ∗ + s ? Here,we provide a generalization of the parallel trends assumption, which incorporates both thestandard parallel trends assumption and the parallel trends-in-trends assumption. Assumption 4 ( k -th Order Parallel Trends) . For some integer k such that ≤ k ≤ T ∗ , ∆ ks ( E [ Y i,T ∗ + s (0) | G i = 1]) = ∆ ks ( E [ Y i,T ∗ + s (0) | G i = 0]) , where ∆ ks is the k -th order difference operator defined recursively as follows. For g ∈ { , } , ∆ s ( E [ Y i,T ∗ + s (0) | G i = g ]) ≡ E [ Y i,T ∗ + s (0) | G i = g ] − E [ Y i,T ∗ − (0) | G i = g ] , when k = 1 and, in general, ∆ ks ( E [ Y i,T ∗ + s (0) | G i = g ]) ≡ ∆ k − s ( E [ Y i,T ∗ + s (0) | G i = g ]) − M ks ∆ k − ( E [ Y i,T ∗ − (0) | G i = g ]) , = E [ Y i,T ∗ + s (0) | G i = g ] − E [ Y i,T ∗ − (0) | G i = g ] − k − (cid:88) j =1 M j +1 s ∆ j ( E [ Y i,T ∗ − (0) | G i = g ]) , where M (cid:96)s = (cid:81) (cid:96) − j =1 ( s + j ) / (cid:81) (cid:96) − j =1 j for (cid:96) ≥ . ∆ k ( E [ Y i,T ∗ − (0) | G i = g ]) is also recursively de-fined as ∆ k ( E [ Y i,T ∗ − (0) | G i = g ]) ≡ ∆ k − ( E [ Y i,T ∗ − (0) | G i = g ]) − ∆ k − ( E [ Y i,T ∗ − (0) | G i = g ]) , and ∆ ( E [ Y i,T ∗ − m (0) | G i = g ]) = E [ Y i,T ∗ − m (0) | G i = g ] − E [ Y i,T ∗ − m − (0) | G i = g ] for48 = { , } . The standard parallel trends assumption and the parallel-trends-in-trends assump-tion are both special cases of this assumption. The k -th order parallel trends assumptionreduces to the standard parallel trends assumption (Assumption 1) when s = 1 and k = 1 , andto the parallel-trends-in-trends assumption (Assumption 3) when s = 1 and k = 2 . To further clarify the meaning of Assumption 4, we can consider a simpler but strongercondition. In particular, the k -th order parallel trends assumption (Assumption 4) is impliedby the following p -th degree polynomial model of confounding. E [ Y it (0) | G i = 1] − E [ Y it (0) | G i = 0] = α + k − (cid:88) p =1 Γ p t p , with unknown parameters α and Γ . Here, the left hand side of the equality captures thedifference between the two groups (treatment and control) in terms of the mean of potentialoutcomes under the control condition. This representation shows that the standard paralleltrends assumption (Assumption 1) is implied by the time-invariant confounding; the paralleltrends-in-trends assumption (Assumption 3) is implied by the linear time-varying confounding;and in general, the k -th order parallel trends assumption is implied by the k -th order polynomialconfounding. C.3 Estimate ATT with Multiple Pre- and Post-Treatment Periods
We consider the identification and estimation of the ATT at post-treatment time T ∗ + s . Underthe k -th order parallel trends assumption (Assumption 4), the ATT is identified as follows. τ ( s ) = ∆ ks ( E [ Y i,T ∗ + s | G i = 1]) − ∆ ks ( E [ Y i,T ∗ + s | G i = 0]) . Because each conditional expectation can be consistently estimated via its sample analogue, (cid:98) τ k ( s ) = ∆ ks (cid:18) (cid:80) i : G i =1 Y i,T ∗ + s n ,T ∗ + s (cid:19) − ∆ ks (cid:18) (cid:80) i : G i =0 Y i,T ∗ + s n ,T ∗ + s (cid:19) is a consistent estimator for the ATT at time T ∗ + s under the k -th order parallel trends as-sumption. When s = 0 and k = 1 , this estimator corresponds to the standard DID estimator(equation (3)). When s = 0 and k = 2 , this is equal to the sequential DID estimator (equa-tion (10)). While existing approaches (e.g., Angrist and Pischke, 2008; Mora and Reggio, 2012;Lee, 2016; Mora and Reggio, 2019) consider each estimator separately, we propose combiningmultiple DID estimators within the GMM framework.In general, the generalized double DID combines K moment conditions where K is thenumber of pre-treatment periods researchers use. When there are more than two pre-treatmentperiods, we can naturally combine more than two DID estimators, improving upon the doubleDID in Section 3. Formally, the generalized double DID is defined as, (cid:98) τ ( s ) = argmin τ g ( τ ) (cid:62) (cid:99) W g ( τ ) where g ( τ ) = ( τ − (cid:98) τ ( s ) , . . . , τ − (cid:98) τ K ( s )) (cid:62) . Based on the theory of the efficient GMM (Hansen,1982), the optimal weight matrix is (cid:99) W = Var ( (cid:98) τ (1: K ) ( s )) − where Var ( · ) is the variance-covariance matrix and (cid:98) τ (1: K ) ( s ) = ( (cid:98) τ ( s ) , . . . , (cid:98) τ K ( s )) (cid:62) . When T ∗ = 2 , this converges to thestandard DID estimator (equation (3)). When T ∗ = 3 , this corresponds to the basic form of49he double DID estimator (equation (13)). Within the GMM framework, we can select momentconditions using the J-statistics (Hansen, 1982). We can similarly generalize the double DIDregression.To assess the extended parallel trends assumption, we can apply the generalized doubleDID to pre-treatment periods t ∈ { , . . . , T ∗ − } as if the last pre-treatment period T ∗ − is the target time period. Moments are g ( τ ) = ( τ − (cid:98) τ (0) , . . . , τ − (cid:98) τ K (0)) (cid:62) where (cid:98) τ k (0) =∆ ks (cid:18) (cid:80) i : Gi =1 Y i,T ∗− n ,T ∗− (cid:19) − ∆ ks (cid:18) (cid:80) i : Gi =0 Y i,T ∗− n ,T ∗− (cid:19) . Similarly, to assess the extended parallel trends-in-trends assumption, we can apply the generalized double DID to pre-treatment periods withmoments g ( τ ) = ( τ − (cid:98) τ (0) , . . . , τ − (cid:98) τ K (0)) (cid:62) . 50 Generalized K -DID for Staggered Adoption Design Combining the setup introduced in Section C.1 and the one in Section 4.1, we propose thegeneralized K -DID for the SA design, which allows researchers to estimate long-term causaleffects in the SA design. We focus on the SA-ATT at post-treatment time t + s where t is thetiming of the treatment assignment and s ≥ represents how far in the future we want estimatethe ATT for. We first redefine the group indicator G to estimate the long-term SA-ATT atpost-treatment time t + s . In particular, we define G its = if A i = t if A i > t + s − otherwisewhere G its = 1 represents units who receive the treatment at time t , and G its = 0 indicatesunits who do not receive the treatment by time t + s . G its = − includes other units whoreceive the treatment before time t or receive the treatment between t + 1 and t + s. When s = 0 , this definition corresponds to the group indicator in equation (18).Formally, our first quantity of interest is the staggered-adoption average treatment effect onthe treated (SA-ATT) at post-treatment time t + s . τ SA ( s, t ) ≡ E [ Y i,t + s (1) − Y i,t + s (0) | G its = 1] . By averaging over time, we can also define the time-average staggered-adoption average treat-ment effect on the treated (time-average SA-ATT) at s periods after treatment onset. τ SA ( s ) ≡ (cid:88) t ∈T π t τ SA ( s, t ) , where T represents a set of the time periods for which researchers want to estimate the ATT.The SA-ATT in period t , τ SA ( t ) , is weighted by the proportion of units who receive the treat-ment at time t : π t = (cid:80) ni =1 { A i = t } / (cid:80) ni =1 { A i ∈ T } .Here, we provide a generalization of the parallel trends assumption, which incorporatesboth the standard parallel trends assumption and the parallel trends-in-trends assumption. Assumption 5 ( k -th Order Parallel Trends for Staggered Adoption Design) . For some integer k such that ≤ k ≤ T, and for k ≤ t ≤ T − s, ∆ ks ( E [ Y i,t + s (0) | G its = 1]) = ∆ ks ( E [ Y i,t + s (0) | G its = 0]) , where ∆ ks is the k -th order difference operator defined in Assumption 4.Under Assumption 5, the SA-ATT at post-treatment time t + s is identified as follows. τ SA ( s, t ) = ∆ ks ( E [ Y i,t + s | G its = 1]) − ∆ ks ( E [ Y i,t + s | G its = 0]) . Since conditional expectations can be consistently estimated via the sample analogue, (cid:98) τ SA k ( s, t ) = ∆ ks (cid:18) (cid:80) i : G its =1 Y i,t + s n ,t + s (cid:19) − ∆ ks (cid:18) (cid:80) i : G its =0 Y i,t + s n ,t + s (cid:19)
51s a consistent estimator for the SA-ATT at post-treatment time t + s under Assumption 5.In general, we combine K DID estimators to obtain the generalized K -DID for the SA-ATTat post-treatment time t + s as follows. (cid:98) τ SA ( s, t ) = argmin τ SA g ( τ SA ) (cid:62) (cid:99) W g ( τ SA ) where g ( τ SA ) = ( τ SA − (cid:98) τ SA ( s ) , . . . , τ SA − (cid:98) τ SA K ( s )) (cid:62) . The optimal weight matrix is (cid:99) W = Var ( (cid:98) τ SA (1: K ) ( s )) − where (cid:98) τ SA (1: K ) ( s ) = ( (cid:98) τ SA ( s ) , . . . , (cid:98) τ SA K ( s )) (cid:62) . To estimate the time-average SA-ATT, we first define the time-average k -th order time-average DID estimator as, (cid:98) τ SA k ( s ) = (cid:88) t ∈T π t (cid:98) τ SA k ( s, t ) . Finally, the generalized K -DID combines K moment conditions as follows. (cid:98) τ SA ( s ) = argmin τ SA g ( τ SA ) (cid:62) (cid:99) W g ( τ SA ) where g ( τ SA ) = ( τ SA − (cid:98) τ SA ( s ) , . . . , τ SA − (cid:98) τ SA K ( s )) (cid:62) . The optimal weight matrix is (cid:99) W = Var ( (cid:98) τ SA (1: K ) ( s )) − where (cid:98) τ SA (1: K ) ( s ) = ( (cid:98) τ SA ( s ) , . . . , (cid:98) τ SA K ( s )) (cid:62) . Equivalence Approach
Here, we provide technical details on the equivalence approach we introduced in Section 3.1.In the standard hypothesis testing, researchers usually evaluate the two-sided null hypothesis H : δ = 0 where δ = { E [ Y i (0) | G i = 1] − E [ Y i (0) | G i = 1] } − { E [ Y i (0) | G i = 0] − E [ Y i (0) | G i = 0] } when we are conducting the pre-treament-trends test. However, this approach hasa risk of conflating evidence for parallel trends and statistical inefficiency. For example, whensample size is small, even if pre-treatment trends of the treatment and control groups differ(i.e., the null hypothesis is false), a test of the difference might not be statistically significantdue to large standard error. And, analysts might “pass” the pre-treatment-trends test by notfinding enough evidence for the difference.The equivalence approach can mitigate this concern by flipping the null hypothesis, so thatthe rejection of the null can be the evidence for parallel trends. In particular, we consider twoone-sided tests: H : θ ≥ γ U , or θ ≤ γ L where ( γ U , γ L ) is a user-specified equivalence range. By rejecting this null hypothesis, re-searchers can provide statistical evidence for the alternative hypothesis: H : γ L < θ < γ U , which means that θ (i.e., the difference in pre-treament-trends across treatment and controlgroups) are within an interval [ γ L , γ U ] . One difficulty of the equivalence approach is that researchers have to choose this equivalencerange ( γ U , γ L ) , which might not be straightforward in practice. To overcome this challenge, wefollow Hartman and Hidalgo (2018) to estimate the 95% equivalence confidence interval, whichis the smallest equivalence range supported by the observed data. Suppose we obtain [ − c, c ] as the symmetric 95% equivalence confidence interval where c > is some positive constant.Then, this means that if researchers think the absolute value of θ smaller than c is substantivelynegligible, the 5% equivalence test would reject the null hypothesis and provide the evidencefor the parallel pre-treatment-trends. In contrast, if researchers think the absolute value of θ being c is substantively too large as bias in practice, the 5% equivalence test would fail to rejectthe null hypothesis and cannot provide the evidence for the parallel pre-treatment-trends. Insum, by estimating the equivalence confidence interval, readers of the analysis can decide howmuch evidence for the parallel pre-treatment-trends exists in the observed data. Researcherscan estimate the 95% equivalence confidence interval by the following general two steps. First,estimate 90% confidence interval, which we denote by [ b L , b U ] . Second, we can obtain thesymmetric 95% equivalence confidence interval as [ − b, b ] where we define b = max {| b L | , | b U |} . See Wellek (2010); Hartman and Hidalgo (2018) for more details.53 igure 9:
Figure 1 from Hartman and Hidalgo (2018) on the difference between the standardhypothesis testing and the equivalence testing.54
Simulation Study
We conduct a simulation study to compare the performance of the various DID estimators dis-cussed in this paper. We demonstrate two key results. First, the double DID is unbiased underthe extended parallel trends assumption or under the parallel trends-in-trends assumption.Second, the double DID has the smallest standard errors among unbiased DID estimators. Inparticular, standard errors of the double DID are smaller than those of the extended DID (i.e.,the two-way fixed effects estimator) even under the extended parallel trends assumption.We compare three DID estimators — the double DID, the extended DID, and the se-quential DID — using two scenarios. In the first scenario, the extended parallel trends as-sumption (Assumption 2) holds where the difference between potential outcomes under control E [ Y it (0) | G i = 1] − E [ Y it (0) | G i = 0] is constant over time. This corresponds to time-invariantunmeasured confounding, and we expect that all the DID estimators are unbiased in this sce-nario. The second scenario represents the parallel-trends-in-trends assumption (Assumption 3)where unmeasured confounding varies over time linearly. Here, we expect that the double DIDand the sequential DID are unbiased, whereas the extended DID is biased.For each of the two scenarios, we consider the balanced panel data with n units and five-time periods where treatments are assigned at the last time period. We vary the number ofunits ( n ) from to and evaluate the quality of estimators by absolute bias and standarderrors over 2000 Monte Carlo simulations. We describe the details of the simulation setup next. F.1 Simulation Design
We consider the balanced panel data with T = 5 ( t = { , , , , } ) where the last period ( t = 4 )is treated as the post-treatment period. We vary the number of units at each time period as n ∈{ , , , } . Thus, the total number of observations are nT ∈ { , , , } .We compare three estimators: the double DID, the extended DID, and the sequential DID.Note that we consider four pre-treatment periods here, and thus the generalized doubleDID is not equal to the sequential DID even under the parallel trends-in-trends assumptionbecause it combines two other moments and optimally weight them (see Appendix C). Theequivalence between the sequential DID and the double DID holds only when there are twopre-treatment periods. We see below that the generalized double DID improves upon thesequential DID even under the parallel trends-in-trends assumption as they optimally weightobservations from different time periods.We study two scenarios: one under the extended parallel trends assumption (Assumption 2)and the other under the parallel-trends-in-trends assumption (Assumption 3). In the firstscenario, the difference between potential outcomes under control E [ Y it (0) | G i = 1] − E [ Y it (0) | G i = 0] is constant over time. In particular, we set E [ Y it (0) | G i = g ] = α t + 0 . × g (19)where ( α , α , α , α , α ) = (1 , , , , . In the second scenario, we allow for linear time-varyingconfounding. In particular, we set E [ Y it (0) | G i = g ] = α t + 0 . × g × ( t + 1) (20)where ( α , α , α , α , α ) = (1 , , , , . Y it (0) = E [ Y it (0) | G i ] + (cid:15) it where (cid:15) it follows the AR(1) process with autocorrelation parameter ρ. That is, (cid:15) it = ρ(cid:15) i,t − + ξ it ,(cid:15) i = N (0 , / (1 − ρ )) ,ξ it = N (0 , . The causal effect is denoted by τ and thus, Y it (1) = τ + Y it (0) where we set τ = 0 . . Finally, Y it = Y it (0) for t ≤ (pre-treatment periods) and Y it = G i Y it (1) + (1 − G i ) Y it (0) for t = 4 (post-treatment period). The half of the samples are in the treatment group ( G i = 1 ) and theother half is in the control group ( G i = 0 ).In Figure 10, we set the autocorrelation parameter ρ = 0 . . This value is similar to the au-tocorrelation parameter used in famous simulation studies in Bertrand, Duflo and Mullainathan(2004) ( ρ = 0 . ). We pick a smaller value to make our simulations harder as we see below. InFigure 11, we also provide additional results where we consider a full range of the autocorrela-tion parameters ρ ∈ { , . , . , . , . } (the same positive autocorrelation values consideredin Bertrand, Duflo and Mullainathan (2004)). Both figures show the absolute bias and thestandard errors which are defined asabsolute bias = (cid:12)(cid:12)(cid:12)(cid:12) M M (cid:88) m =1 ( (cid:98) τ m − τ ) (cid:12)(cid:12)(cid:12)(cid:12) and standard error = (cid:118)(cid:117)(cid:117)(cid:116) M M (cid:88) m =1 ( (cid:98) τ m − τ ) , where M is the total number of Monte Carlo iterations. Note that this standard error is a truestandard error over the sampling distribution. F.2 Results
Figure 10 shows the results when the autocorrelation parameter ρ = 0 . . To begin with theabsolute bias, visualized in the first row, all estimators have little bias under the extendedparallel trends assumption (Scenario 1), as expected from theoretical results. In contrast,under the parallel-trends-in-trends assumption (Scenario 2), the extended DID (white circlewith dotted line) is biased, while the double DID (black circle with solid line) and the sequentialDID (white triangle with dotted line) are unbiased.The second row represents the standard errors of each estimator. Under the extended par-allel trends assumption (the first column), the double DID estimator has the smallest standarderror, smaller than the extended DID estimator (i.e., the two-way fixed effects estimator). Thisefficiency gain comes from the fact that the double DID uses the GMM framework to optimallyweight observations from different time periods, although the two-way fixed effects estimatoruses equal weights to all pre-treatment periods.Under the parallel trends-in-trends assumption (the second row; the second column), thedouble DID has almost the same standard error as the sequential DID. This shows that thedouble DID changes weights according to scenarios and solves a practical dilemma of thesequential DID — it is unbiased under the weaker assumption of the parallel trends-in-trends,but not efficient under the extended parallel trends.In Figure 11, we provide additional results where we consider a full range of the autocorre-lation parameters ρ ∈ { , . , . , . , . } (the same positive autocorrelation values considered56 . . . . . . Sample Size100 250 500 1000 l l l ll l l lll
Extended DIDSequential DIDDouble DID A b s o l u t e B i as Scenario 1 (Extended Parallel Trends) . . . . . Sample Size100 250 500 1000 l l l ll l l l S t a nd a r d E rr o r s . . . . . . Sample Size100 250 500 1000 l l l ll l l l
Scenario 2 (Parallel Trends−in−Trends) . . . . . Sample Size100 250 500 1000 l l l ll l l l
Figure 10:
Comparing DID estimators in terms of the absolute bias and the standard errors.The first row shows that the double DID estimator (black circle with solid line) is unbiasedunder both scenarios. The second row demonstrates that the double DID has the smalleststandard errors among unbiased DID estimators.in Bertrand, Duflo and Mullainathan (2004)). We find that when the autocorrelation of errorsis small, standard errors of the double DID are smaller than those of the sequential DID evenunder the parallel trends-in-trends assumption.The first row of Figure 11 shows that our results on the (absolute) bias do not changeregardless of the autocorrelation of errors. In particular, the double DID is unbiased under theextended parallel trends assumption (the first column) or under the parallel trends-in-trendsassumption (the second column). In terms of the standard errors (the second row), two resultsare important. First, under the extended parallel trends assumption (the first column), thestandard errors of the double DID is the smallest for all the values of ρ and the efficiency gainrelative to the extended DID (i.e., two-way fixed effects estimator) is large when the there ishigh auto-correlations (i.e., ρ is large). Second, under the parallel trends-in-trends assumption(the second column), the standard errors of the double DID is the smallest among unbiasedDID estimators (the extended DID is biased). The efficiency gain relative to the sequentialDID is large when ρ is small. 57 . . . . . . . Autocorrelation0 0.2 0.4 0.6 0.8 l l l l ll l l l lll
Extended DIDSequential DIDDouble DID A b s o l u t e B i as Scenario 1 (Extended Parallel Trends) . . . . Autocorrelation0 0.2 0.4 0.6 0.8 l l l l ll l l l l S t a nd a r d E rr o r s . . . . . . . Autocorrelation0 0.2 0.4 0.6 0.8 l l l l ll l l l l
Scenario 2 (Parallel Trends−in−Trends) . . . . Autocorrelation0 0.2 0.4 0.6 0.8 l l l l ll l l l l
Figure 11:
Comparing DID estimators in terms of the absolute bias and the standard errorsaccording to the autocorrelation of errors.
Note : The first row shows that the double DIDestimator (black circle with solid line) is unbiased under both scenarios. The second rowdemonstrates that the double DID has the smallest standard errors among unbiased DIDestimators. Under the extended parallel trends assumption (the first column), the efficiencygain relative to the extended DID (i.e., two-way fixed effects estimator) is large when theautocorrelation parameter ρ is large. Under the parallel trends-in-trends assumption (thesecond column), the efficiency gain relative to the sequential DID is large when ρ is small.58 Empirical Application
In Section 6, we have focused on three outcomes to illustrate the advantage of the double DIDestimator. In this section, we provide results for all thirty outcomes analyzed in the originalpaper.To assess the underlying parallel trends assumptions, we combine visualization and for-mal tests, as recommended in the main text. The assessment suggests that we can make theextended parallel trends assumption for fifteen outcomes. Specifically, for those fifteen out-comes, p-values for the null of pre-treatment parallel trends are above 0.10 (i.e., fail to rejectthe null at the conventional level), and the 95% standardize equivalence confidence interval iscontained in the interval [ − . , . . This means that the deviation from the parallel trends inthe pre-treatment periods are less than 0.2 standard deviation of the control mean in 2006.Figure 12 shows estimated treatment effects under the extended parallel trends assumption.As in Section 6, the double DID estimates are similar to those from the standard DID, andyet, standard errors are smaller because the double DID effectively uses pre-treatment periodswithin the GMM. Here, we only have two pre-treatment periods, but when there are morepre-treatment periods, the efficiency gain of the double DID can be even larger.We rely on the parallel trends-in-trends assumption for eight outcomes out of the fifteenremaining outcomes. These outcomes have the 95% standardized equivalence confidence inter-val wider than [ − . , . , but show that treatment and control groups’ pre-treatment trendshave the same sign. The same sign of the pre-treatment trends suggests that parallel trends-in-trends assumption, which can account for the linear time-varying unmeasured confounder,can be plausible for these outcomes, even though the stronger parallel trends assumption ispossibly violated.Figure 13 shows results under the parallel trends-in-trends assumption. As in Section 6, thedouble DID estimates are often different from those of the standard DID because the extendedparallel trends assumption is implausible for these outcomes. Importantly, standard errors ofthe double DID are often larger than the standard DID. This is because the double DID needsto adjust for biases in the standard DID by using pre-treatment trends.For the remaining seven outcomes of which treatment and control groups’ pre-treatmenttrends have the opposite sign, it is difficult to justify either the extended parallel trends orparallel trends-in-trends assumption without additional information. Thus, there is no credibleestimator for the ATT without making stronger assumptions. When there are more than twopre-treatment periods, researchers can apply the sequential DID estimator to pre-treatmentperiods in order to formally assess the extended parallel trends-in-trends assumption. Weemphasize that, although we use the equivalence range of [ − . , . as a cutoff for an illus-tration, it is recommended to base this decision on substantive domain knowledge wheneverpossible in practice. 59 taff to Cure Animal Upper Secondary School Village w/ Post OfficeProp. Households w/ Supported Healthcare Prop. Households w/ Supported Tuition Radio Broadcast Socio−Dev't/ Infra. ProjectPeriodic Market Post Office Prop. Households w/ Agricultural Extension Prop. Households w/ Supported CreditEducation and Cultural Program Irrigation Plants Market or Inter−commune Market Paved Road DID Double−DID DID Double−DID DID Double−DID DID Double−DID−0.2−0.10.00.10.2−0.2−0.10.00.10.2−0.2−0.10.00.10.2−0.2−0.10.00.10.2 A TT e s t i m a t e s ( % C I ) Estimates under Extended Parallel Trends Assumption
Figure 12:
Comparing Standard DID and Double DID under Extended Parallel Trends As-sumption. The double DID estimates are similar to those from the standard DID, and yet,standard errors are smaller because the double DID effectively uses pre-treatment periodswithin the GMM. 60 ublic Health Project Public Transport Staff to Support Crops Tap WaterDaily Market Nonfarm Business Prop. Households w/ Business Tax Exemption Prop. Households w/ Supported Crop
DID Double−DID DID Double−DID DID Double−DID DID Double−DID−0.2−0.10.00.10.20.3−0.2−0.10.00.10.20.3 A TT e s t i m a t e s ( % C I ) Estimates under Parallel Trends−in−Trends