Inference in Differences-in-Differences: How Much Should We Trust in Independent Clusters?
IInference in Differences-in-Differences: How MuchShould We Trust in Independent Clusters? ∗ Bruno Ferman † Sao Paulo School of Economics - FGV
First Draft: May 8th, 2019This Draft: September 1st, 2020
Please click here for the most recent version
Abstract
We analyze the conditions in which ignoring spatial correlation is problematic for infer-ence in differences-in-differences models. We show that the relevance of spatial correlationfor inference (when it is ignored) depends on the amount of spatial correlation that remainsafter we control for the time- and group-invariant unobservables. As a consequence, detailssuch as the time frame used in the estimation, and the choice of the estimator, will be keydeterminants on the degree of distortions we should expect when spatial correlation is ig-nored. Simulations with real datasets corroborate these conclusions. These findings providea better understanding on when spatial correlation should be more problematic, and provideimportant guidelines on how to minimize inference problems due to spatial correlation.
Keywords: spatial correlation; clustered standard errors; two-way clustered standard errors
JEL Codes:
C12; C21; C23; C33 ∗ I would like to thank Xavier D’Haultfoeuille, Vitor Possebom, Pedro Sant’Anna, Jon Roth, Elie Tamerand Matthew Webb for comments and suggestions. Luis Alvarez, Lucas Barros, and Flavio Riva providedexceptional research assistance. † email: [email protected]; address: Sao Paulo School of Economics, FGV, Rua Itapeva no. 474, SaoPaulo - Brazil, 01332-000; telephone number: +55 11 3799-3350 a r X i v : . [ ec on . E M ] S e p Introduction
Differences-in-Differences (DID) is one of the most widely used methods for identificationof causal effects in social sciences. However, inference in DID models can be complicatedby both serial and spatial correlations. After an influential paper by Bertrand et al. (2004),showing that serial correlation can lead to severe over-rejection in DID applications if nottaken into account, most papers applying DID models use inference methods that are robustto arbitrary forms of serial correlation. In contrast, most of these papers do not take spatialcorrelation into account. Barrios et al. (2012) show that, under some conditions, ignoringspatial correlation is not a problem for inference when treatment is randomly assigned atthe cluster level. However, random assignment may not be plausible in many DID empiricalapplications. In this paper, we consider the consequences of ignoring spatial correlation inDID models when treatment is possibly not randomly assigned.We consider a setting in which the spatial correlation follows a linear factor model, whichwe show below that allows for a rich variety of spatial correlation structures, and relaxesthe usual assumption that errors are independent across clusters. The main insight is thatthe relevant spatial correlation for DID models reflects the spatial correlation of unobservedvariables that affect the outcome variable after we control for the time- and group-invariantunobservables. As a consequence, we show in Section 2 that, when we consider the two-way fixed effect (TWFE) estimator, inference ignoring spatial correlation becomes moreproblematic when both (i) the second moment of the difference between the pre- and post-treatment averages of common factors is large relative to the variance of the same differencefor the idiosyncratic shocks, and (ii) the distribution of factor loadings has different expectedvalues for treated and control groups. When at least one of these conditions does not hold, The importance of clustering at a group level to take serial correlation into account had been previouslynoted by, for example, Arellano (1987). However, Bertrand et al. (2004) show that such strategies had notbeen widely incorporated in DID applications. More generally, we can think of such linear factor model as an approximation to more complex modelsfor the spatial correlation. A contemporaneous paper by Kelly (2019) shows that spatial correlation may lead to over-rejection in“persistence” regressions. Our papers differ in that we focus on the conditions in which spatial correlation both the distance between the pre-and post-treatment periods is large, and when the treated and control groups are verydifferent. Therefore, considering shorter time ranges and/or more similar control groupslimit distortions on inference due to spatial correlation problems, on the top of potentiallymaking the parallel trends assumption more plausible. We also show that the relevance ofthe spatial correlation depends crucially on the estimator used to estimate the treatmenteffects. These results are consistent with the conclusions from the spatial correlation modelwe analyze in Section 2, and suggests that this structure provides a good approximation forreal datasets like the ACS and the CPS.In Section 4.1, we show that it is not possible to properly address the problem of serialand spatial correlation, unless we impose strong assumptions on the errors in at least onedimension. If relying on methods based on independent clusters is the only option, based generates problems in DID models.
3n the conclusions from Sections 2 and 3, we present in Section 4.2 recommendations forapplied researchers on how to minimize the relevance of spatial correlation. Overall, ourmain conclusion is that the relevance of spatial correlation for inference (when it is ignored)depends on the amount of spatial correlation that remains after we control for the time-and group-invariant unobservables. As a consequence, details such as the time frame usedin the estimation, the choice of the control groups, and the choice of the estimator, willbe key determinants on the degree of distortions we should expect when spatial correlationis ignored. Given this insight, we can have a better understanding about when spatialcorrelation should be more problematic, and we also provide recommendations on how tominimize this problem in empirical applications. Section 5 concludes.
We start presenting in Section 2.1 a very simple DID model to highlight the importanceof the assumption of independent clusters, which is commonly assumed for inference in theDID setting. Then we present in Section 2.2 the main insight of the paper, that the relevantspatial correlation for DID models depends on the amount of spatial correlation that remainsafter we control for the time- and group-invariant unobservables. Therefore, features suchas the time frame used in the estimation may affect the degree in which spatial correlationaffects inference. We consider the case of TWFE estimator in Sections 2.1 and 2.2, and thenwe consider the first-difference estimator and the estimators proposed by de Chaisemartinand D’Haultfoeuille (2018), de Chaisemartin and D’Haultfoeuille (2020) and Callaway andSant’Anna (2018) as alternative estimators in Section 2.3.
We start considering a standard model for the potential outcomes. Let Y jt (0) ( Y jt (1)) bethe potential outcome of group j at time t when this group is untreated (treated) at this4eriod. We consider first that potential outcomes are given by Y jt (0) = θ j + γ t + η jt Y jt (1) = α jt + Y jt (0) , (1)where θ j and γ t are, respectively, group- and time-invariance unobserved variables, while η jt represents unobserved variables that may vary at both dimensions. Importantly, we do notimpose any restriction on the serial and spatial correlations of η jt , so this is a very generalmodel; α jt is the (possibly heterogeneous) treatment effect on group j at time t .This leads to a standard DID model for the observed outcomes Y jt = d jt Y jt (1) + (1 − d jt ) Y jt (0) given by Y jt = αd jt + θ j + γ t + (cid:101) η jt , (2)where (cid:101) η jt = η jt + ( α jt − α ) d jt , and d jt is an indicator variable equal to one if group j istreated at time t , and zero otherwise. The parameter α is defined as the TWFE estimand inequation (2). While recent papers consider settings with heterogeneous treatment effects inwhich the TWFE estimand may be a weighted average of the treatment effects with negativeweights, if treatment starts for all treated groups at the same period and we consider anaggregate group × time regression, then α would be the average of the expected values ofthe treatment effects across treated groups and post-treatment periods. We consider in thissection the properties of the DID estimator of α using a TWFE regression over a repeatedsampling framework on (cid:101) η jt . While we focus on the TWFE estimator in Sections 2.1 and 2.2,in Section 2.3 we also consider the implications of our findings for the estimators proposed See, for example, de Chaisemartin and D’Haultfoeuille (2018), Callaway and Sant’Anna (2018), Atheyand Imbens (2018), and Goodman-Bacon (2018). We can think of that as a “super-population” setting. In contrast, Abadie et al. (2020), Abadie et al.(2017), Barrios et al. (2012) and Rambachan and Roth (2020) consider design-based uncertainty, whereRambachan and Roth (2020) consider specifically the DID setting. We show in Appendix A.4 that our mainconclusions remain valid if we consider design-based uncertainty. We recommend reading Appendix A.4 onlyafter Section 2.2.
5y de Chaisemartin and D’Haultfoeuille (2018) (CH1), de Chaisemartin and D’Haultfoeuille(2020) (CH2), and Callaway and Sant’Anna (2018) (CS) which would be more suitable forsettings in which treatment effects are heterogeneous and there is variation in treatmenttiming.There are N treated groups, N control groups, and T time periods. For simplicity,we assume that d jt changes to 1 for all treated groups starting after date t ∗ , and define adummy variable D j equal to one if group j is treated. Let I ( I ) be the set of indicesfor treated (control) groups, while T ( T ) be the set of indices for post- (pre-) treatmentperiods. For a generic variable A t , define ∇ A = T − t ∗ (cid:80) t ∈T A t − t ∗ (cid:80) t ∈T A t . In particular,following Ferman and Pinto (2019), we consider W i = ∇ (cid:101) η jt , which is the post-pre differencein average errors for each group j .In this simpler case in which treatment starts at the same period for all treated groups,the DID estimator is numerically equivalent to the TWFE estimator of α , which is given byˆ α = 1 N (cid:88) j ∈I ∇ Y j − N (cid:88) t ∈I ∇ Y j = α + 1 N (cid:88) j ∈I W j − N (cid:88) j ∈I W j . (3)Let D = ( D , ..., D N ). In this section, we consider a repeated sampling frameworkover the distribution of { W j } j ∈I ∪I , conditional on D = d . For now, we do not makeany restriction on the dependence between W j and W j (cid:48) . Moreover, we allow for differentdistributions for W j depending on whether j ∈ I or j ∈ I .In this setting, if we have E [ W j | D = d ] = 0 for all j , then the DID estimator ˆ α will beunbiased for α , regardless of the assumptions on the serial and spatial correlations of (cid:101) η jt .However, inference in DID models is only possible if we impose assumptions on either theserial or the spatial correlation of (cid:101) η jt . Most commonly, inference methods for DID modelsdo not impose restrictions on the correlation (cid:101) η jt across time, which is captured by this linearcombination of the errors, W j , but assumes that (cid:101) η jt are independent across j . See, for example, Arellano (1987), Bertrand et al. (2004), Cameron et al. (2008), Brewer et al. (2017),Conley and Taber (2011), Ferman and Pinto (2019), Canay et al. (2017), and MacKinnon and Webb (2019). j is assumed is to rely oncluster robust variance estimator (CRVE), clustering at the group level. In this case, up toa degrees-of-freedom correction, the CRVE is given by (cid:92) var ( ˆ α ) Cluster = (cid:20) N (cid:21) (cid:88) j ∈I (cid:99) W j + (cid:20) N (cid:21) (cid:88) j ∈I (cid:99) W j , (4)where (cid:99) W j = ∇ ˆ (cid:101) η j , which is a linear combination of the residuals of the TWFE regression, ˆ (cid:101) η jt .Assuming independence across j , the CRVE provides asymptotically valid inference when N , N → ∞ . If W j is correlated across j , however, then not taking such spatial correlationinto account can lead to severe underestimation of the true standard error, resulting in over-rejection. The intuition is the following. Imagine there is an unobserved variable in W j that equally affects all treated groups, but does not affect the control groups. If the null H : α = 0 is true, then, from equation (3), we have that ˆ α = N (cid:80) j ∈I W j − N (cid:80) j ∈I W j .Therefore, under the null, finding a “large” value for ˆ α would only be possible if many ofthose W j for j ∈ I were positive. If we (mistakenly) assume that W j are all independent,we would attribute a much lower probability that such event may happen relative to whenwe take into account that those W j ’s might be correlated, leading to over-rejection.When the assumption that (cid:101) η jt is independent across j is relaxed, there are some alterna-tives for inference, but these alternatives often assume that there is a distance metric acrossgroups, impose assumptions on the serial correlation, and/or rely on more data. One im-portant case in which spatial correlation does not generate problems for inference even when We assume that the expected value of this variable is equal to zero conditional on d , so that the presenceof such correlated shock does not affect the identification assumption of the DID model. Or when many of those W j for j ∈ I are negative. For example, Kim and Sun (2013), Conley and Taber (2011) (in their online appendix A.3), and Besteret al. (2011) rely on distance measures across groups. Ad˜ao et al. (2019) show that spatial correlation leads toover-rejection in shift-share designs, and propose an inference method that is asymptotically valid when thereare many shifters. This method, however, does not apply in more general settings. Other papers exploit thetime dimension to perform inference in the presence of spatially correlated shocks. However, these methodsrely on a large number of periods. For example, Vogelsang (2012), Ferman and Pinto (2019) (Section 4) andChernozhukov et al. (2019) present inference methods that work with arbitrary spatial correlation when thenumber of periods goes to infinity, while Dailey (2017) proposes the use of randomization inference usinglong series of past data when the explanatory variable is rainfall data. Since random assign-ment is generally not a reasonable assumption for DID applications, we focus on the case inwhich treatment may not necessarily be randomly assigned. Moreover, Ferman (2020) showsthat some inference methods designed to work when there are few treated and many controlgroups (Conley and Taber (2011) and Ferman and Pinto (2019)) remain valid when there isa single treated group even when there is spatial correlation. When there are more than onetreated group, these tests may lead to over-rejection, and he proposes a conservative test.However, these conclusions rely on a strong mixing condition for the spatial correlation, andare only valid for settings with few treated and many control groups.
The main insight in this paper is to show that the relevant spatial correlation for theTWFE depends on the unobserved variables that remain after we control for the group andyear fixed effects (see also Section 2.3 for alternative estimators). To analyze this idea, weconsider a model in which potential outcomes follow a linear factor model, and derive the W j that is implied when we consider such underlying model. More generally, we can think of suchlinear factor model as an approximation to more complex models for the spatial correlation.Importantly, the model we consider is more general than the usual models considered in DIDsettings, in that we allow for spatial correlation in the time-varying unobservables. Consider While they show this result in a cross-section model, in this case in which all treated groups starttreatment at the same treatment, the DID model can be re-written as cross-section model where eachobservation j is the different between the post- and pre-treatment means. Their results are limited to thecase of equal sized cluster. However, as they argue, we should expect these results to provide useful guidancefor cases in which variation in cluster sizes is modest. Y jt (0) = θ j + γ t + λ t µ j + (cid:15) jt Y jt (1) = α jt + Y jt (0) , (5)where λ t is an (1 × F ) vector of common shocks, while µ j is an ( F ×
1) vector of factorloadings that determines how group j is affected by the common shocks λ t . While θ j and γ t could have been included as components of µ j and λ t , we consider them separately tohighlight that we can still have time-invariant and group-invariant shocks as in the standardDID models, so what we add is the possibility of other spatially correlated shocks that arecaptured by λ t µ j . Moreover, this allows us to think about λ t and µ j as the common shocksand factor loadings that are not time- or group-invariant. This is appropriate since theassumptions we need on λ t and µ j are not necessary for the time- or group-invariant shocks.This structure leads to the same model as in equation (2), but with a structure on the errors (cid:101) η jt = λ t µ j + (cid:15) jt + ( α jt − α ) d jt .Such structure allows for a rich variety of spatial correlation structures. For example,we can think of F points in R d , ( c , ..., c F ), and group j located at point a j ∈ R d . In thiscase, the f − th entry of µ j could be a decreasing function of the distance between c f and a j . This would capture the notion that groups that are closer in some distance dimension in R d would be more correlated than groups that are further apart. We can also consider thecase of N municipalities divided into F states, where there are relevant state-level shocks.If municipality j belongs to state f , we could model that by setting the f − th entry of µ j equal to one and zero otherwise. On top of that we could also have other correlated shocks,generating a richer spatial correlation structure. For example, we can think of specific shocksdepending on whether group j is coastal or inland, or on other variables that the researchermay or may not observe. Moreover, we can allow for the possibility that there are shocksthat only affect the treated and shocks that only affect the control groups by setting a factor9oading that is one only for treated and another one that is one only for control groups.While we focus in this section in the case in which the dimension F is fixed, we consider inAppendix A.2 a setting in which the dimension F may increase with N . This allows us tomodel, for example, settings in which the spatial correlation is strongly mixing, as studied byFerman (2020) in the context of DID models. Throughout, we consider that the researcherconsiders the possibility of spatially correlated shocks, but does not have information — oris not willing to impose a structure — on the determinants of the spatial correlation.We assume that all spatial correlation is captured by this linear factor structure, so that (cid:15) jt is independent across j . We do allow, however, for arbitrary serial correlation in both (cid:15) jt and λ t . To simplify notation, we fix that treated groups start treatment after t ∗ , and let D j = 1 ifgroup j is treated, and 0 otherwise. We consider the distribution of the DID estimator, andinference on the parameter α , based on a repeated sampling framework over the distributionsof D j , λ t , µ j , (cid:15) jt and α jt . Importantly, we allow the distributions of µ j and (cid:15) jt to dependon whether j is treated or control (in particular, this allows for heteroskedasticity in themodel). Likewise, the distributions of λ t and (cid:15) jt may differ depending on whether t is pre-or post-treatment. We consider the following assumptions on the random variables in themodel. Assumption 2.1 (sampling)
We observe a sample { Y j , ..., Y jT , D i } Ni =1 , where Y jt = D i Y jt (1)+(1 − D i ) Y jt (0) if t > t ∗ , and Y jt (0) otherwise. Potential outcomes are determined by equa-tion (5). We also have that { D j , µ j , (cid:15) j , . . . , (cid:15) jT , α jt ∗ +1 , ..., α jT } Ni =1 is iid, and independent of { λ t } Tt =1 . E [ D j ] = c ∈ (0 , D j and the factor loadings µ j . Italso allows for arbitrary dependence between D j and the idiosyncratic shocks ( (cid:15) j , . . . , (cid:15) jT ).In particular, it allows for heteroskedasticity with respect to treatment assignment. Since weare fixing that treatment starts for the treated groups after t ∗ , we allow for the distribution We show in Appendix A.4 that our main conclusions remain valid if we consider design-based uncertainty.See also Footnote 5. λ t to differ whether we are in a pre- or post-treatment period. We consider later theimplications of relaxing the condition that µ j is iid. We also assume that treatment effects α jt are independent across j . Relaxing this assumption to allow for spatial correlation in α jt would only add an additional spatial correlation problem without modifying our mainconclusions.Let µ e = E [ µ j ], and µ ew = E [ µ j | D j = w ], for w ∈ { , } . In this case, we have thatˆ α − α = 1 N (cid:88) j ∈I [ ∇ λ ( µ j − µ e ) + ( ∇ α j − α ) + ∇ (cid:15) j ] − N (cid:88) j ∈I [ ∇ λ ( µ j − µ e ) + ∇ (cid:15) j ] , (6)where, with some abuse of notation, ∇ α j is the post-treatment average of α jt across t . In thiscase, α is defined as E [ ∇ α j | D j = 1]. Therefore, the potential outcomes model (5) generatesa DID model (2) such that W j = ∇ λ ( µ j − µ e ) + ∇ (cid:15) j + ( ∇ α j − α ) D j , where W j is potentiallycorrelated across j due to the common shocks.Since E [ ∇ α j | D j = 1] = α , we have that ˆ α is unbiased if E [ ∇ λ ( µ j − µ e ) + ∇ (cid:15) j | D j = 1] = E [ ∇ λ ( µ j − µ e )+ ∇ (cid:15) j | D j = 0]. Assuming that E [ ∇ (cid:15) j | D j ] = 0, we need that E [ ∇ λ ]( µ e − µ e ) = 0.Note that this term is the sum of the expected value of F terms, (cid:80) Ff =1 E [ ∇ λ ( f )]( µ e ( f ) − µ e ( f )), where v ( f ) is the f − th coordinate of vector v . If we do not take into accountknife-edge cases in which elements of this sum cancel out, then ˆ α is unbiased if, for each f = 1 , ..., F , either one of two conditions hold. First, it may be that E [¯ λ post ( f )] = E [¯ λ pre ( f )],so the first moment of the distribution of the common factor f is stable in the pre- andpost-treatment periods. In this case, even if treated and control groups are differentiallyaffected by this common factor, this would not generate bias on the DID estimator over thedistribution of λ t ( f ). Alternatively, it may be that µ e ( f ) = µ e ( f ). In this case, even if theexpected value of λ t ( f ) differs in the pre- and post-treatment periods, this common factordoes not systematically affect treated groups differently relative to control groups, so theDID estimator is unbiased over the distribution of µ j ( f ). Since the focus in this paper is oninference, we assume a condition such that the estimator is unbiased.11 ssumption 2.2 (parallel trends) E [ ∇ λ ]( µ e − µ e ) = 0 and E [ ∇ (cid:15) j | D j ] = 0.Overall, we can think that there are group- and/or time-invariant unobserved variablesthat may be arbitrarily correlated with treatment assignment, but the other common shocksare not correlated with treatment assignment once we condition on these fixed effects. Im-portantly, we do not see the addition of the linear factor model structure in model (5)implying that the DID model is misspecified. This is similar to a setting in which we havemany individuals per state, and we allow individual-level errors to be correlated within state(but not between states). Such setting can be encompassed in a model similar to the onedescribed in (5), where these common shocks that are state-specific generate problems forinference if we do not cluster at the state level, but do not make the TWFE estimatorbiased. In other words, the parallel trends assumption would be satisfied, because thesetime-varying unobserved shocks that are spatially correlated would not be correlated withtreatment assignment. However, we would still have problems for inference if we do not takespatial correlation into account. The main difference is that we consider a setting in whichwe cannot restrict the spatial correlation to be contained within clusters, for a large numberof clusters.We consider now under which conditions inference based on standard errors clusteredat the group level is significantly affected by spatial correlation. As noted above, based onthe results derived by Barrios et al. (2012), inference would still be reliable if treatment israndomly assigned at the cluster (in this case, group) level. However, this is generally astrong assumption in DID applications, so we focus on cases in which treatment may not berandomly assigned.The potential problem in using the CRVE for inference is that W j = ∇ λ ( µ j − µ e ) +( ∇ α j − α ) D j + ∇ (cid:15) j will generally be correlated across j due to the common shocks. Thisformulation highlights the conditions in which spatially correlated shocks are more likely togenerate problems for inference. The main problems are that, as N → ∞ , (i) ˆ α may not The same line of reasoning remains valid if we consider other estimators to deal with the possibility ofheterogeneous treatment effects.
12e consistent and asymptotically normal, and (ii) CRVE may severely underestimate thetrue variance of ˆ α . For w ∈ { , } , let σ (cid:15) ( w ) = var ( ∇ (cid:15) j + ( ∇ α j − α ) D j | D j = w ), andΣ µ ( w ) = var ( µ j | D j = w ). We present these conclusions in Proposition 2.1. Proposition 2.1
Consider a setting in which potential outcomes follow equation (5), andtreatment starts after periods t ∗ . Assumptions 2.1 and 2.2 hold. Then, as N → ∞ , ˆ α − α = ∇ λ ( µ e − µ e ) + o p (1) , (7) var ( ˆ α ) − (cid:92) var ( ˆ α ) Cluster = ( µ e − µ e ) (cid:48) E (cid:2) ( ∇ λ ) (cid:48) ( ∇ λ ) (cid:3) ( µ e − µ e ) + o p (1) , (8) and (cid:92) var ( ˆ α ) Cluster = o p (1) . (9) Proof.
See details of the proof in Appendix A.1.1.While ˆ α is unbiased despite the spatial correlation, Proposition 2.1 shows that ˆ α is notconsistent and may not be asymptotically normal if ( µ e − µ e ) (cid:48) E (cid:2) ( ∇ λ ) (cid:48) ( ∇ λ ) (cid:3) ( µ e − µ e ) >
0. Moreover, in this case the CRVE underestimates the true variance of ˆ α by ( µ e − µ e ) (cid:48) E (cid:2) ( ∇ λ ) (cid:48) ( ∇ λ ) (cid:3) ( µ e − µ e ), potentially leading to severe over-rejection. This term is thevariance of ∇ λ ( µ e − µ e ) = (cid:80) Ff =1 ∇ λ ( f )( µ e ( f ) − µ e ( f )). If λ t ( f ) is serially positively corre-lated, with stronger dependence relative to the idiosyncratic shocks, then the shorter the dis-tance between the initial and final periods, the smaller the variance of ∇ λ ( f )( µ e ( f ) − µ e ( f ))relative to the variance of ∇ (cid:15) j for any given ( µ e ( f ) − µ e ( f )). See details in Appendix A.5.The intuition in this case is that the group fixed effects would absorb more of the relevantspatial correlation if we expect ¯ λ post ( f ) to be similar to ¯ λ pre ( f ). Likewise, if we fix the secondmoment of ∇ λ ( f ), then the variance of ∇ λ ( f )( µ e ( f ) − µ e ( f )) will be smaller if µ e ( f ) ≈ µ e ( f ). Since ( ∇ λ ) µ = ( ∇ λ ) AA − µ for any full rank matrix A , we can normalize ∇ λ so that its components areuncorrelated. Therefore, we can focus on the variance of each term of this sum separately.
13n this case, the year fixed effects would absorb most of the spatially correlated shocks thatcould make ˆ α inconsistent and lead to such underestimation of the true variance. Therefore,considering a shorter time range, and/or more similar treated and control groups, can limitdistortions on inference due to spatial correlation problems, on the top of potentially makingthe parallel trends assumption more plausible.Moreover, since (cid:92) var ( ˆ α ) Cluster = o p (1), the variance of the t-statistic based on CRVEdiverges if ( µ e − µ e ) (cid:48) E (cid:2) ( ∇ λ ) (cid:48) ( ∇ λ ) (cid:3) ( µ e − µ e ) > N → ∞ . Therefore, even when thenull is true, the probability of rejection would generally converge in probability to one. Weconsider next a local-to-0 approximation in which the variance of ∇ λ drifts to zero, so thatthe variance of the t-statistic does not diverge. Assumption 2.3 (local-to-0 approximation) √ N ( ∇ λ ) = ξ , where E [ ξ ( µ e − µ e )] = 0 and var ( ξ ( µ e − µ e )) = ( µ e − µ e ) (cid:48) Ω( µ e − µ e ). Proposition 2.2
Consider a setting in which potential outcomes follow equation (5), andtreatment starts after periods t ∗ . Assumptions 2.1 to 2.3 hold. Then, as N → ∞ , √ N ( ˆ α − α ) d → ξ ( µ e − µ e ) + 1 c σ (cid:15) (1) Z + 11 − c σ (cid:15) (0) Z , (10) where Z and Z are standard normal variables, and ξ , Z and Z are mutually independent.Moreover, N (cid:92) var ( ˆ α ) Cluster = 1 c σ (cid:15) (1) + 11 − c σ (cid:15) (0) + o p (1) . (11) Proof.
See details of the proof in Appendix A.1.2.Proposition 2.2 considers a case in which the variance of the common shocks converges tozero, so they are of the same order of magnitude of the average of the idiosyncratic shocks. This is true whenever ∇ λ has a continuous distribution. If ∇ λ had a probability mass at zero, we wouldnot have the probability of rejection converging in probability to one. Moreover, this is valid when thedistribution of ∇ λ is fixed when N increases. We consider next a case in which variance of ∇ λ goes to zerowhen N → ∞ . α is consistent in this case, it may not be asymptotically normal, imply-ing that a t-statistic may not be asymptotically standard normal even if we had a consistentestimator for the asymptotic variance. Moreover, the CRVE underestimate the asymptoticvariance of √ N ( ˆ α − α ) by ( µ e − µ e ) (cid:48) Ω( µ e − µ e ). Assuming ξ is normally distributed, thet-statistic would be asymptotically normal, but we would underestimate the variance usingCRVE. Corollary 2.1
Consider the setting from Proposition 2.2, and assume further that ξ ( µ e − µ e ) ∼ N (0 , ( µ e − µ e ) (cid:48) Ω( µ e − µ e )) . Then, as N → ∞ , t = ˆ α (cid:113) (cid:92) var ( ˆ α ) Cluster d → N (cid:32) , µ e − µ e ) (cid:48) Ω( µ e − µ e ) c σ (cid:15) (1) + − c σ (cid:15) (0) (cid:33) . (12)In this case, we have that the t-statistic using CRVE would be asymptotically normal, butwe would have over-rejection because the asymptotic variance of the DID estimator wouldbe underestimated by ( µ e − µ e ) (cid:48) Ω( µ e − µ e ). Spatial correlation will be less relevant whenwhen var ( ξ ( µ e − µ e )) is smaller. Therefore, exactly the same conclusions from Proposition2.1 about when spatial correlation problems are more relevant apply in this setting.While we consider in Propositions 2.1 and 2.2 settings in which the number of factorsis fixed, we find similar results in Appendix A.2, where we consider a model in which thenumber of factors increases with the number of groups. This allows us to consider settings inwhich the spatial correlation is strongly mixing. In this case, we show that ˆ α is asymptoticallynormal, but CRVE underestimates the asymptotic variance of the estimator, implying thatthe t-statistic will have an asymptotically normal distribution, but with variance greater thanone. We find the same conclusions regarding when such distortions will be more relevantas in Propositions 2.1 and 2.2. We also consider in Appendix A.3 the case in which ( µ e − µ e ) (cid:48) E (cid:2) ( ∇ λ ) (cid:48) ( ∇ λ ) (cid:3) ( µ e − µ e ) = 0, but var ( ∇ λ ) does not drift to zero. We show that, in thiscase, we may have additional problems for inference because the CRVE may depend on the15ealization of ∇ λ even asymptotically. We may also have additional problems for inferencein this case if µ i is spatially correlated. Again, all of these distortions become less relevantif the second moments of ∇ λ is smaller, which is consistent with the conclusions above.Overall, we find that the problems associated with spatial correlation when we considerTWFE estimator with CRVE depend crucially on the amount of spatial correlation thatremains after we control for the group and time fixed effects. If the second moments of ∇ λ are close to zero, then the group fixed effects would capture most of the relevant spatialcorrelation, so CRVE would remain reliable. Assuming λ t is serially positively correlated,with stronger dependence relative to the idiosyncratic shocks, then we can attenuate thespatial correlation problem by considering a shorter time frame around the treatment. If µ e ≈ µ e , then the time fixed effects would capture a large part of the spatial correlation. Inthis case, we might still have distortions from the fact that we have only one realization of ∇ λ and from the possibility that µ j exhibits spatial correlation (as presented in Appendix A.3).Still, these results show that we can attenuate the spatial correlation problem by restrictingthe sample to treated and control groups that are more alike. Therefore, considering shortertime ranges and control groups that are more similar to the treated groups would make spatialcorrelation less relevant, on the top of potentially making the parallel trends assumption moreplausible.The asymmetry in the conclusions relative to common factors and factor loadings comesfrom the fact that we are considering inference based on CRVE at the group level, whichis the standard alternative when N is large relative to T . If we had a setting with manyperiods and consider CRVE at the time level, then the reverse result would hold. A possiblealternative in this case, if both N and T are large, could be the use of two-way cluster at thegroup and time dimensions (see Cameron et al. (2011), Thompson (2011), Davezies et al.(2018), Menzel (2017), and MacKinnon et al. (2019)). While some of these methods reportgood performance in simulations even with few clusters in one dimension, if common factorsare serially correlated, then this solution would not take into account the correlation between16 jt and η j (cid:48) t (cid:48) , for j (cid:54) = j (cid:48) and t (cid:54) = t (cid:48) , which would lead to over-rejection. We should expect suchcorrelations to be different from zero if spatial correlated shocks are serially correlated. Wepresent a Monte Carlo simulation in Appendix A.6 confirming this intuition, and considerthis inference method in simulations with the ACS in Section 3.1. We considered in Section 2.2 how spatial correlation in the potential outcomes Y jt (0) and Y jt (1) translates into spatial correlation for the relevant linear combination of the errors in aTWFE estimator. If we consider alternative estimators, then the conditions in which spatialcorrelation generate relevant size distortions will be different. For example, for the settingconsidered in Section 2.2, where treatment starts for all treated groups after time t ∗ , thefirst-difference estimator of α would be numerically equivalent to a TWFE estimator usingonly periods t ∗ and t ∗ + 1. In this case, the results from Proposition 2.1 would imply var ( ˆ α fd ) − (cid:92) var ( ˆ α fd ) Cluster = ( µ e − µ e ) (cid:48) E (cid:2) ( λ t ∗ +1 − λ t ∗ ) (cid:48) ( λ t ∗ +1 − λ t ∗ ) (cid:3) ( µ e − µ e ) + o p (1) . (13)Therefore, as discussed in Section 2.2, ignoring spatial correlation should be less prob-lematic for the first-difference estimator relative to the TWFE if common shocks are seriallypositively correlated, with stronger dependence relative to the idiosyncratic shocks. Notethat, instead of Assumption 2.2, we would need in this case E [ λ t ∗ +1 − λ t ∗ ]( µ e − µ e ) = 0 and E [ (cid:15) j,t ∗ +1 − (cid:15) j,t ∗ | D j ] = 0 for unbiasedness.Important alternative estimators that are gaining significant attention are the ones pro-posed by de Chaisemartin and D’Haultfoeuille (2018) (CH1), de Chaisemartin and D’Haultfoeuille(2020) (CH2), and Callaway and Sant’Anna (2018) (CS). The main idea of these estimatorsis to take into account that, if treatment effects are heterogeneous, then TWFE and thefirst-difference estimator may recover meaningless weighted averages of such heterogeneousfixed effects. CH1 essentially exploits variation from consecutive periods in which there is a17hange in treatment status for some groups. In a very simple example in which we considergroup × time aggregate data and all treated groups start treatment at the same time, theirestimator would be numerically equivalent to the first-difference estimator. More generally,by relying on variations around periods in which there are changes in treatment status, therelevance of spatial correlation for CH1 depends on the second moments of λ t − λ t − forperiods in which there is a status change, rather than depending on the second moments of ∇ λ . Therefore, ignoring spatial correlation would be less problematic with this new estima-tor relative to the TWFE if common shocks are serially positively correlated, with strongerdependence relative to the idiosyncratic shocks.In contrast, CH2 and CS estimate treatment effects not only for the time in which groupsswitch into treatment, but also the effects in the subsequent periods. More specifically, if atreated group starts treatment after time t ∗ , then they estimate the effects l periods aheadof the treatment by comparing outcomes from t ∗ + l + 1 with outcomes from t ∗ . Considerthe case in which we then aggregate these dynamic effects. In this case, the relevance ofspatial correlation will depend on the second moments of λ t ∗ + l +1 − λ t ∗ for l = 0 , ..., L , where L is the maximum dynamic treatment considered in the estimation. Therefore, we shouldexpect spatial correlation to be less relevant for these estimators relative to TWFE, but morerelevant for this estimator relative to CH1.Overall, while there has been a series of papers showing that different estimation methodsmay identify different structural parameters if treatment effects are heterogeneous (e.g., La-porte and Windmeijer (2005), Goodman-Bacon (2018) and de Chaisemartin and D’Haultfoeuille(2018)), we show that the choice among different estimators may affect the degree in whichspatial correlation is a problem for inference, being relevant even when treatment effectsare homogeneous. On the top of estimating meaningful parameters and possibly relyingon weaker assumptions for unbiasedness (as they focus on shorter time range comparisons),we show that another advantage of these estimators is that they should be less affected byspatial correlation. If treatment effects are heterogeneous, then we can think of our results18s affecting inference related to whatever parameter the estimation method identifies. We now test the conclusions from Section 2 in simulations with two real datasets, theAmerican Community Survey (ACS) and the Current Population Survey (CPS). Followingthe strategy used by Bertrand et al. (2004), we randomly generate placebo interventions,and then evaluate the proportion of simulations in which we would reject the null based oninference ignoring spatial correlation. Note that Bertrand et al. (2004) randomly assignedwhich states received treatment in their simulations. In light of the results from Barrioset al. (2012), this is likely why CRVE at the state level worked well in their simulations,even though there may be unobserved variables that are spatially correlated across states.Here we consider simulations in which treatment may not be randomly assigned at the clusterlevel.
We start considering simulations with the ACS from 2005 to 2017. We select twostates and two periods, and then allocate treatment at the Public Use Microdata Area(PUMA) level in the second period. Since it is expected that there are state-level unobservedcovariates, the structure of the data is so that there is potentially relevant spatial correlationacross PUMAs. We consider two different treatment allocations, one in which PUMAsare randomly assigned treatment independently of their state, and another one in whichtreatment is assigned at the state level. Since in either case treatment is randomly assigned,we have that Assumption 2.2 is satisfied in our simulations. We also vary the distancein years between the pre- and post-treatment periods, which can be δ year ∈ { , , ..., } .Following Bertrand et al. (2004), we restrict the sample to women between the ages 25 and We created our ACS extract using IPUMS (Ruggles et al. (2015)). δ ∈ { , , ..., } . In this case, based on the results derived by Barrios et al. (2012), the20roportion of placebo regressions in which the null is rejected at a 5% significance level testshould be around 5%. Rejection rates are close to 5% regardless of δ , whether we considerlog wages (Figure 1.A) or employment (Figure 1.B) as outcome variables. This is consistentwith the fact that treatment was randomly assigned across PUMAs.We also present in Figure 1 rejection rates for simulations in which treatment was assignedat the state level. In this case, we should expect over-rejection if there is spatial correlationin the error term even after taking into account the state and year fixed effects. When weconsider simulations in which pre- and post-treatment periods are consecutive years (that is, δ = 1), there is only mild over-rejection: 6.9% when log wages is used as outcome variableand 7.2% when employment is used as outcome variable. When we increase the distancebetween the pre- and post-treatment periods, however, the over-rejection sharply increases,reaching more than 20% in some cases.These results are in line with the intuition presented in Section 2 that group fixed effectsshould capture most of the spatial correlation if the distance between the pre- and post-treatment years is small. However, when this distance is large, then the group fixed effectswill capture less of the spatial correlation, implying in more severe over-rejection. The resultsare virtually the same if instead of considering two periods, year and year + δ , we includeall time periods from year to year + δ with treatment starting after year + δ/ In Appendix Figure A.2, we present rejection rates separately depending on whether thetwo states are from the same census division. We expect that states from the same censusdivision should be more similar than states from different census divisions. We find evidencethat the over-rejection is smaller when the two states are from the same census division,particularly when we combine information from simulations with log wages and employmentas outcomes. This is again consistent with the theory presented in Section 2 in that spatialcorrelation becomes less problematic when treated and control groups are more similar, eventhough such geographic proximity does not seem sufficient in this case to make rejection Available upon request. δ is large. Since we have relatively few simulations with statesin the same census division, particularly when δ is large, we consider these results withcaution, particularly when we consider simulations with the two outcomes separately. Werevisit the idea that similarity between treated and control groups makes spatial correlationless problematic in the simulations from Section 3.2. There, we do not have such stringentlimitations on the number of simulations with similar treated and control groups, and wehave a better measure of proximity between treated and control groups.We also consider simulations with time-varying treatment assignment. In this case, weselect ten years of data, and set half of the treated PUMAs to start treatment after the thirdyear, and half of the treated PUMAs to start treatment after the seventh year. Again, we varywhether treated PUMAs are randomly selected or selected at the state level. We present inTable 1 rejection rates for the TWFE, CH1, and CH2. When treatment is randomly assignedacross PUMAs, we find rejection rates close to 5% for all estimators, whether we considerlog wages or employment as outcome. This was expected, given that spatial correlationdoes not pose a thread for any of these inference methods when treatment is randomlyassigned at the cluster level. When we consider treatment randomly allocated at the statelevel, however, rejection rates are close to 5% for CH1, but we find significant over-rejectionfor CH2 (9%-10%) and for TWFE (19%-20%). The fact that spatial correlation problemsbecome more severe when we go from CH1 to CH2, and then to TWFE, is consistent withthe discussion from Section 2.3, and with our findings that there is not much over-rejectionwhen we compare consecutive years of data.Finally, we also consider simulations using the two-way cluster standard errors proposedby Cameron et al. (2011), clustering at both the PUMA and the year levels. Since two-waycluster does not work well with only one pre-treatment period and one post-treatment period,we again consider simulations with ten years of data. We consider here the placebo treatmentstarting after the fifth year. When we consider treatment randomly allocated across PUMAs,rejection rates are 6.5% and 7% when the outcome variable is, respectively, log wages and22mployment. There is a slight over-rejection, possibly from the fact that there are onlyten periods. In contrast, when we consider treatment allocated at the state level, rejectionrates are 23% and 28%. These simulations confirm the intuition presented in Section 2,that two-way cluster procedures may underestimate the standard errors because they fail totake into account correlations between η jt and η j (cid:48) t (cid:48) , for j (cid:54) = j (cid:48) and t (cid:54) = t (cid:48) . Note that suchcorrelations will appear whenever there are common shocks that are serially correlated. Wepresent a Monte Carlo simulation in Appendix A.6 that confirms this intuition. We now present simulations using the CPS data from 1979 to 2018. We select two yearsand two age groups. We vary the distance between the pre- and post-treatment periods( δ year ), and the distance between the two age groups ( δ age ), both ranging from 1 to 15. Asbefore, we restrict the sample to women between the ages of 25 and 50, and we consider asoutcome variables log wages and employment. Treatment is then randomly allocated in thepost-treatment for one of the age groups. These simulations mimic a setting in which thereis a policy change that affects individuals from a specific cohort, so we can use other cohortsas a control group. In these simulations, we treat a pair (state × age) as a group i , and weestimate the treatment effect using a DID model including time fixed effects and state × age fixed effects. We test the null hypothesis of no effect based on standard errors clusteredat the state level. Therefore, we implicitly assume that the error term for individuals indifferent states are independent.In these simulations, we now have good measures of proximity both between the pre- andpost-periods ( δ year ), and between the treated and control groups ( δ age ). Therefore, we areable to better validate, in this example, the intuition presented in Section 2 that correlatedshocks should be relatively less important when either (i) treated and control groups are When we consider the adjustment for the two-way clustered standard error proposed by Davezies et al.(2018), we also find over-rejection when treatment assignment is at the state level. We find 11% for log wagesand 15% for employment. When treatment assignment is at the PUMA level, the test is slightly conservative(2% and 2.4%). or (ii) the pre-treatment period is close to the post-treatment period.We present in Figure 2 rejection rates for combinations of ( δ year , δ age ). Interestingly, in-dependently of the outcome variable, rejection rates are generally close to 5% when either δ year or δ age is small. For example, even when δ year = 10, in which case the simulations fromSection 3.1 displayed large over-rejection, rejection rates remain close to 5% when δ age issmall. Likewise, rejection rates are still close to 5% when we consider δ age = 10, as long as δ year is small. When both δ year and δ age increase, however, we find significant over-rejection.With ( δ year , δ age ) = (15 , δ year is large) and significant differencesbetween the treated and control groups ( δ age is large). Without imposing additional assumptions either on the time-series or cross-section corre-lations of the errors, it would not be possible to draw valid inference for the DID estimator.To see that, if we do not impose any restriction on the structure of the errors, then the errorterm η jt in equation 2 could be such that η jt = (cid:88) d ∈{ , } (cid:88) τ ∈{ , } v dτ { j ∈ I d , t ∈ T τ } + ξ jt , (14)for random variables v dτ , d ∈ { , } , and τ ∈ { , } .In this case, there are correlated shocks v dτ that equally affect all observations in eachcombination of treated vs control groups, and pre- vs post-treatment periods. Note that such24tructure would be consistent with the linear factor model for potential outcomes consideredin equation (5). In this case, the distribution of the DID estimator would depend on ( v − v ) − ( v − v ), irrespectively of the asymptotic framework we consider (even if both N and T → ∞ ). Essentially, without imposing additional assumptions on the structure of theerrors, we are left with a 2 × v dτ (e.g., Donald and Lang (2007)).In order to provide valid inference in the DID setting, we need to impose restrictions onthe structure of the errors for at least one dimension, either in the time series or in the crosssection. For example, when we assume that errors are independent across j , then we canprovide valid inference even without imposing any restriction on the time series dependence(e.g., Arellano (1987) and Bertrand et al. (2004)). Most of these approaches will then relyon an asymptotic theory in which the number of groups goes to infinity. The independenceassumption across units can be relaxed if we have a distance measure and impose restrictionson the spatial correlation (e.g., Bester et al. (2011)). Alternatively, if we impose assumptionsin the time series such as stationarity, then it would be possible to allow for arbitrary cross-section correlation (e.g., section 4 of Ferman and Pinto (2019) and Chernozhukov et al.(2017)). In this case, we would rely on an asymptotic theory in which the number of periodsgoes to infinity.On the heart of the problem, if we want to allow errors to be correlated across bothdimensions, then we need a distance measure in at least one dimension. Moreover, we needto impose assumptions on the structure of the errors in such dimension, and we generally needan asymptotic setting in which this dimension goes to infinity. While a distance measure is Some of these approaches rely on both the number of treated and of control units diverging, while othersmay allow for one of those being fixed. Donald and Lang (2007) propose inference with both the number oftreated and control groups fixed by imposing strong assumptions, such as normality and homoskedasticity.While Donald and Lang (2007) consider a case in which errors are independent both across time and groups,it would be possible to extend their ideia to consider a setting with either serial or cross-section correlation,although it would still rely on strong assumptions on the errors.
As discussed in Section 4.1, there is no clear solution for inference if we have the prevalentsetting in DID applications in which there is a small number of periods, and it is not possibleto impose a distance metric across groups. In such cases, relying on inference methodsthat assume cross-section independence, such as CRVE, seems like the only option. Theresults derived in Section 2, and corroborated in simulations with two important datasetsin Section 3 (the ACS and the CPS), provide guidelines on how one should proceed inempirical applications to minimize the relevance of spatial correlation in this case. We showthat spatial correlation can lead to severe over-rejection when (i) the second moment of thedifference in the pre- and post-treatment averages of the common factors is large, and (ii)factor loadings have very different distributions for the treated and control groups.Therefore, researchers in this situation should make sure that at least one of these condi-tions are not satisfied (or, at least, minimized) in their applications. For example, considera setting with more than one pre- and post-treatment periods in which there are arguablyrelevant unobserved common shocks that can affect treated and control groups differently.In this case, a longer time series would imply larger second moment for the difference be-tween the pre- and post-treatment averages of the common factors if such common factorsexhibit stronger positive serial correlation relative to the idiosyncratic shocks (see detailsin Appendix A.5). The simulations from Section 3 provide evidence that this is the casefor the ACS and CPS datasets. One possible recommendation in this case is to restrict thesample to a few periods before and a few periods after the treatment. In this case, thegroup fixed effects would absorb more of these common shocks, making inference assuming26ndependent groups more reliable. In addition to making spatial correlation problems lessrelevant, restricting to periods close to the policy change can also arguably make the paralleltrends assumption of the DID model more plausible as well.We also show that different estimators not only estimate different structural parametersif treatment effects are heterogeneous, but also are differentially affected by spatial correla-tion. For example, since CH1 focuses only on periods immediately before and after changesin treatment status, we show in Section 2.3 that their estimator may be less affected byspatial correlation than the TWFE estimator (on the top of the benefit of providing an es-timator for a meaningful weighted average of treatment effects in case treatment effects areheterogeneous). One should be careful, however, that the frequency of the data may affectto what extent the spatial correlation should be relevant in this case. For example, fromthe discussions in Sections 2 and 3, we should expect spatial correlation problems to be lessrelevant for CH1 when we have yearly data relative to when we have decennial data.If the focus of the empirical exercise is to estimate the long-term impacts of a policychange, then it would not be possible to minimize the second moments of ∇ λ by restrictingthe sample to periods around the policy change. Therefore, the effort should be in thedirection of guaranteeing that the treated and the control groups are as similar as possible.While, in this case, spatial correlation in the factor loadings could affect inference even ifthe distribution of factor loadings is the same for treated and control groups, focusing ontreated and control groups that are more similar ensures that a larger portion of the spatialcorrelation is absorbed by the year fixed effects. Again, making spatial correlation problemsless relevant would be an additional benefit of focusing on treated and control groups that aremore similar, on the top of making the parallel trends assumption arguably more plausible.We show in Appendix A.7 that usual pre-tests for parallel trends can also capture infer-ence problems due to spatial correlation. Following the approach considered by Roth (2019),we show that, if the distribution of λ t is stable, then rejecting the null for a placebo regressionusing the pre-treatment data may be informative about whether spatial correlation is a rel-27vant problem for the main regression, even if the parallel trends assumption is valid. Roth(2019) shows that pre-testing may exacerbate the problem of violations of parallel trendsin case it fails to detect such violations due to sampling noise. In contrast, we show thatsuch pre-testing would not exacerbate the problem in case it fails to detect relevant spatialcorrelation.Still, a potential problem with such pre-test is that they may be low powered, as presentedin our simulations in Appendix A.7 and in simulations from Roth (2019). Our results fromSections 2 and 3 show that these tests should be particularly low-powered to detect spatialcorrelation problems if a researcher uses the pre-treatment data for the pre-test, and thenuses all periods for the estimation. In this case, since the pre-test would be based on a shortertime range, the structure of the errors in the pre-test would induce less problems of spatialcorrelation than the structure of the errors in the main estimation. Given the possibility ofthe pre-test being low-powered, it may be interesting to follow the recommendations aboveof avoiding long panels even if the pre-test does not detect problems. Moreover, focusingon shorter panels for the estimation also makes pre-tests more informative about spatialcorrelation problems, as in this case the pre-tests would have a structure more similar withthe main estimation. Overall, our results complement the recent analysis on pre-testing inDID models by considering the case in which we may have inference problems due to spatialcorrelation. We show that such pre-testing can also be informative about spatial correlationproblems, on the top of being informative about violations in the parallel trends assumption.We also analyze the potential limitations of such pre-tests when we have spatial correlationproblems, and provide guidance on how to make such pre-tests more informative.Finally, note that, if we condition on a realization of λ t and µ j , then our setting isequivalent to a setting in which the parallel trends assumption is violated. In this case,another alternative could be to construct honest confidence sets, as proposed by Rambachanand Roth (2019). However, in this case there would always be a positive probability that therealizations of λ t and µ j are such that the pre trends are much smaller than the post trends,28hich implies that it would not be possible to learn much about the post trends based onthe information from the pre trends. As a consequence, we would have to specify a large setof possible violations of the parallel trends assumption to construct the honest confidencesets proposed by Rambachan and Roth (2019), implying that such approach would generallybe uninformative in this setting. We analyze the conditions in which (ignored) correlated shocks pose relevant challengesfor inference in DID models. Overall, our main conclusion is that the relevance of spatialcorrelation for inference (when it is ignored) depends on the amount of spatial correlation thatremains after we control for the time- and group-invariant unobservables. As a consequence,details such as the time frame used in the estimation, and the choice of the estimator, will bekey determinants on the degree of distortions we should expect when spatial correlation isignored. The simulation results corroborate our theoretical conclusions, suggesting that thelinear factor model analyzed in this paper provides a good approximation to real datasetslike the ACS and the CPS. Given these insights, we can have a better understanding aboutwhen spatial correlation should be more problematic, and some recommendations on how tominimize this problem in empirical applications.
References
Abadie, A., Athey, S., Imbens, G. W., and Wooldridge, J. (2017). When should you adjuststandard errors for clustering? Working Paper 24003, National Bureau of EconomicResearch.Abadie, A., Athey, S., Imbens, G. W., and Wooldridge, J. M. (2020). Sampling-based versusdesign-based uncertainty in regression analysis.
Econometrica , 88(1):265–296.Ad˜ao, R., Koles´ar, M., and Morales, E. (2019). Shift-Share Designs: Theory and Inference*.
The Quarterly Journal of Economics , 134(4):1949–2010.29rellano, M. (1987). Computing robust standard errors for within-groups estimators.
OxfordBulletin of Economics and Statistics , 49(4):431–434.Athey, S. and Imbens, G. (2018). Design-based Analysis in Difference-In-Differences Settingswith Staggered Adoption. Working Paper, arXiv:1808.05293 .Barrios, T., Diamond, R., Imbens, G. W., and Kolesar, M. (2012). Clustering, spatialcorrelations, and randomization inference.
Journal of the American Statistical Association ,107(498):578–591.Bertrand, M., Duflo, E., and Mullainathan, S. (2004). How much should we trust differences-in-differences estimates?
Quarterly Journal of Economics , page 24975.Bester, C. A., Conley, T. G., and Hansen, C. B. (2011). Inference with dependent data usingcluster covariance estimators.
Journal of Econometrics , 165(2):137 – 151.Brewer, M., Crossley, T. F., and Joyce, R. (2017). Inference with difference-in-differencesrevisited.
Journal of Econometric Methods , 7(1).Callaway, B. and Sant’Anna, P. H. C. (2018). Difference-in-Differences with Multiple TimePeriods and an Application on the Minimum Wage and Employment. Working Paper,arXiv:1803.09015 .Cameron, A., Gelbach, J., and Miller, D. (2008). Bootstrap-based improvements for inferencewith clustered errors.
The Review of Economics and Statistics , 90(3):414–427.Cameron, A. C., Gelbach, J. B., and Miller, D. L. (2011). Robust inference with multiwayclustering.
Journal of Business & Economic Statistics , 29(2):238–249.Canay, I. A., Romano, J. P., and Shaikh, A. M. (2017). Randomization tests under anapproximate symmetry assumption.
Econometrica , 85(3):1013–1030.Chernozhukov, V., Wuthrich, K., and Zhu, Y. (2017). An exact and robust conformalinference method for counterfactual and synthetic controls.Chernozhukov, V., Wuthrich, K., and Zhu, Y. (2019). An Exact and Robust Conformal In-ference Method for Counterfactual and Synthetic Controls. Papers 1712.09089, arXiv.org.Conley, T. G. and Taber, C. R. (2011). Inference with Difference in Differences with a SmallNumber of Policy Changes.
The Review of Economics and Statistics , 93(1):113–125.Dailey, A. (2017). Randomization inference with rainfall data: Using historical weatherpatterns for variance estimation.
Political Analysis , 25(3):277 – 288.Davezies, L., D’Haultfoeuille, X., and Guyonvarch, Y. (2018). Asymptotic results undermultiway clustering. arXiv e-prints , page arXiv:1807.07925.de Chaisemartin, C. and D’Haultfoeuille, X. (2018). Two-way fixed effects estimators withheterogeneous treatment effects. 30e Chaisemartin, C. and D’Haultfoeuille, X. (2020). Difference-in-differences estimators ofintertemporal treatment effects.Donald, S. G. and Lang, K. (2007). Inference with Difference-in-Differences and Other PanelData.
The Review of Economics and Statistics , 89(2):221–233.Ferman, B. (2020). Inference in differences-in-differences with few treated units and spatialcorrelation.Ferman, B. and Pinto, C. (2019). Inference in differences-in-differences with few treatedgroups and heteroskedasticity.
The Review of Economics and Statistics , 0(ja):null.Freyaldenhoven, S., Hansen, C., and Shapiro, J. M. (2019). Pre-event trends in the panelevent-study design.
American Economic Review , 109(9):3307–38.Goodman-Bacon, A. (2018). Difference-in-differences with variation in treatment timing.Working Paper 25018, National Bureau of Economic Research.Kahn-Lang, A. and Lang, K. (2019). The promise and pitfalls of differences-in-differences:Reflections on 16 and pregnant and other applications.
Journal of Business & EconomicStatistics , 0(0):1–14.Kelly, M. (2019). The Standard Errors of Persistence. CEPR Discussion Papers 13783,C.E.P.R. Discussion Papers.Kim, M. S. and Sun, Y. (2013). Heteroskedasticity and spatiotemporal dependence robustinference for linear panel models with fixed effects.
Journal of Econometrics , 177(1):85 –108.Laporte, A. and Windmeijer, F. (2005). Estimation of panel data models with binary indi-cators when treatment effects are not constant over time.
Economics Letters , 88(3):389 –396.MacKinnon, J. G., Nielsen, M., and Webb, M. D. (2019). Wild Bootstrap and Asymp-totic Inference with Multiway Clustering. Working Paper 1415, Economics Department,Queen’s University.MacKinnon, J. G. and Webb, M. D. (2017). Wild bootstrap inference for wildly differentcluster sizes.
Journal of Applied Econometrics , 32(2):233–254.MacKinnon, J. G. and Webb, M. D. (2019). Randomization Inference for Difference-in-Differences with Few Treated Clusters.
Journal of Econometrics, Forthcoming .Menzel, K. (2017). Bootstrap with Clustering in Two or More Dimensions. arXiv e-prints ,page arXiv:1703.03043.Rambachan, A. and Roth, J. (2019). An honest approach to parallel trends.Rambachan, A. and Roth, J. (2020). Design-based uncertainty for quasi-experiments.31oth, J. (2019). Pre-test with caution: Event-study estimates after testing for paralleltrends.Ruggles, S., Genadek, K., Goeken, R., Grover, J., and Sobek, M. (2015). Integrated PublicUse Microdata Series: Version 6.0 [Machine-readable database].Thompson, S. B. (2011). Simple formulas for standard errors that cluster by both firm andtime.
Journal of Financial Economics , 99(1):1 – 10.Vogelsang, T. J. (2012). Heteroskedasticity, autocorrelation, and spatial correlation robustinference in linear panel models with fixed-effects.
Journal of Econometrics , 166(2):303 –319. 32igure 1:
Simulations with the ACS . . . . . . . . A. Log wages B. Employment
PUMA level State level R e j e c t i on r a t e s δ year Notes: This figure presents rejection rates for the simulations using ACS data, presented in Section3.1. Each simulation has two states and two periods. We considered all combination of pairs ofstates and years. The distance between the pre- and post-treatment periods ( δ year ) varies from1 to 10 years. The pre-treatment period ranges from 2005 to 2017- δ year . In the “PUMA level”results, treatment is randomly allocated at the PUMA level, while in the “state level” results,treatment is allocated at the state level. For each simulation, we run a DID regression and test thenull hypothesis using standard errors clustered at PUMA level. The outcome variable is log(wages)(subfigure A) and employment status (subfigure B) for women aged between 25 and 50. We consideronly simulations with 20 or more treated and control PUMAs. The number of simulations for eachoutcome and for each treatment allocation ranges from 5646 when δ = 1 to 693 when δ = 10. Simulations with the CPS
A. Log wages B. Employment .04.06.08.1.12.14.16.18.2.22.24.26.28.3.32.34.36.38 R e j e c t i on r a t e s δ age δ year Notes: This figure presents rejection rates for the simulations using CPS data, presented in Section3.2. We considered all combination of pairs of years and pairs of ages. The initial time periodranges from 1979 to 2018- δ year . The initial age ranges from 25 to 50- δ year . For each simulation, werun a DID regression and test the null hypothesis using standard errors clustered at the state level.The outcome variable is log(wages) (subfigure A) and employment status (subfigure B) for womenwith the ages considered in each simulation. Alternative estimators
TWFE CH1 CH2(1) (2) (3)Panel A: Puma-level randomizationLog wages 0.053 0.043 0.056Employed 0.049 0.063 0.062Panel B: State-level randomizationLog wages 0.175 0.045 0.108Employed 0.192 0.044 0.094
Notes: This table presents rejection rates for the simulations using ACS data, presented in Section3.1. Each simulation has two states and ten consecutive years. We considered all combination ofpairs of states and years. In panel A, treatment is randomly allocated at the PUMA level, whilein panel B, treatment is allocated at the state level. Half of the PUMAs in the treated state starttreatment after the third period, while the other half start treatment after the seventh period.For each simulation, we consider TWFE, CH1, and CH2 estimators and test the null hypothesisusing standard errors clustered at PUMA level. The outcome variables are either log(wages) oremployment status for women aged between 25 and 50. We consider only simulations with 20 ormore treated and control PUMAs. Appendix
A.1 Proof of the main results
A.1.1 Proof of Proposition 2.1Proof.
Note first thatˆ α − α = ∇ λ ( µ e − µ e ) + 1 N (cid:88) j ∈I [ ∇ λ ( µ j − µ e ) + ( ∇ α j − α ) + ∇ (cid:15) j ] − N (cid:88) j ∈I [ ∇ λ ( µ j − µ e ) + ∇ (cid:15) j ]= ∇ λ ( µ e − µ e ) + o p (1) , (15)since the terms ∇ λ ( µ j − µ ew ), ( ∇ α j − α ), and ∇ (cid:15) j are uncorrelated across j .The OLS residuals from TWFE DID regression are such that, for j ∈ I w , w ∈ { , } , (cid:99) W j = ∇ Y j − N w (cid:88) k ∈I w ∇ Y j (16)= ∇ λ ( µ j − µ ew ) + ( ∇ α jt − α ) D j + ∇ (cid:15) j − N w (cid:88) k ∈I w [ ∇ λ ( µ k − µ ew ) + ( ∇ α kt − α ) D k + ∇ (cid:15) k ] . Given Assumptions 2.1 and 2.2,1 N w (cid:88) j ∈I w (cid:99) W j = ( ∇ λ )(Σ µ ( w ))( ∇ λ (cid:48) ) + σ (cid:15) ( w ) + o p (1) . (17)Therefore, (cid:92) var ( ˆ α ) Cluster = 1 N (cid:32) N (cid:88) j ∈I (cid:99) W j (cid:33) + 1 N (cid:32) N (cid:88) j ∈I (cid:99) W j (cid:33) (18)= 1 N ( ∇ λ )(Σ µ (1))( ∇ λ (cid:48) ) + 1 N σ (cid:15) (1) + 1 N ( ∇ λ )(Σ µ (0))( ∇ λ (cid:48) ) (19)+ 1 N σ (cid:15) (0) + o p ( N − ) = o p (1) . (20)36ow note that, under Assumptions 2.1 and 2.2, var ( ˆ α | D = d ) = ( µ e − µ e ) (cid:48) E (cid:2) ( ∇ λ ) (cid:48) ( ∇ λ ) (cid:3) ( µ e − µ e ) + 1 N σ (cid:15) (1) + 1 N σ (cid:15) (0) (21)+ 1 N E [( ∇ λ )(Σ µ (1))( ∇ λ ) (cid:48) ] + 1 N E [( ∇ λ )(Σ µ (0))( ∇ λ ) (cid:48) ] , (22)where we implicitly assume that we are conditioning on N ≥ N ≥
1. Otherwise, itwould not be possible to construct a DID estimator.Therefore, var ( ˆ α ) = E [ var ( ˆ α | D )] + var [ E ( ˆ α | D )] (23)= ( µ e − µ e ) (cid:48) E (cid:2) ( ∇ λ ) (cid:48) ( ∇ λ ) (cid:3) ( µ e − µ e ) + E (cid:20) N (cid:21) σ (cid:15) (1) + E (cid:20) N (cid:21) σ (cid:15) (0) (24)+ E (cid:20) N (cid:21) E [( ∇ λ )(Σ µ (1))( ∇ λ ) (cid:48) ] + E (cid:20) N (cid:21) E [( ∇ λ )(Σ µ (0))( ∇ λ ) (cid:48) ] (25)= ( µ e − µ e ) (cid:48) E (cid:2) ( ∇ λ ) (cid:48) ( ∇ λ ) (cid:3) ( µ e − µ e ) + o (1) , (26)since E [ N − w ] = o (1) from N − w p → | N − w | ≤
1, and E ( ˆ α | D ) = α .Therefore, var ( ˆ α ) − (cid:92) var ( ˆ α ) Cluster = ( µ e − µ e ) (cid:48) E (cid:2) ( ∇ λ ) (cid:48) ( ∇ λ ) (cid:3) ( µ e − µ e ) + o p (1) . (27) A.1.2 Proof of Proposition 2.2Proof. √ N ( ˆ α − α ) = √ N ( ∇ λ )( µ e − µ e ) + √ N ( ∇ λ ) 1 N (cid:88) j ∈I ( µ j − µ e ) + √ N N (cid:88) j ∈I [( ∇ α j − α ) + ∇ (cid:15) j ] −√ N ( ∇ λ ) 1 N (cid:88) j ∈I ( µ j − µ e ) − √ N N (cid:88) j ∈I ∇ (cid:15) j . (28)Note that √ N ( ∇ λ ) = ξ = O p (1), N w − (cid:80) j ∈I w ( µ j − µ ew ) = o p (1), and N w − / (cid:80) j ∈I w [( ∇ α j − α ) D j + ∇ (cid:15) j ] d → N (0 , σ (cid:15) ( w )). Moreover, √ N ( ∇ λ ), N − / (cid:80) j ∈I ∇ (cid:15) j and N − / (cid:80) j ∈I [( ∇ α j − α ) + ∇ (cid:15) j ] are mutually independent. Therefore, √ N ( ˆ α − α ) d → ξ ( µ e − µ e ) + 1 c σ (cid:15) (1) Z − − c σ (cid:15) (0) Z , (29)where Z and Z are standard normal variables, and ξ , Z and Z are mutually independent.Moreover, from equation (17), we have that N w − (cid:80) j ∈I w (cid:99) W j = ( ∇ λ )(Σ µ ( w ))( ∇ λ (cid:48) ) + σ (cid:15) ( w ) + o p (1) = σ (cid:15) ( w ) + o p (1), given Assumption 2.3. Therefore, N (cid:92) var ( ˆ α ) Cluster = NN (cid:32) N (cid:88) j ∈I (cid:99) W j (cid:33) + NN (cid:32) N (cid:88) j ∈I (cid:99) W j (cid:33) (30)= 1 c σ (cid:15) (1) + 11 − c σ (cid:15) (0) + o p (1) . (31) A.2 An alternative model in which F → ∞ In Section 2.2, we consider a linear factor model for the spatial correlation in which thenumber of factors, F , is fixed. While this allows for a rich variety of spatial correlationstructures, it would be harder to encompass settings in which, for example, the error isstrongly mixing in the cross section. We consider here a stylized example for the spatialcorrelation, which can also be described as a linear factor model, but in which the numberof factors increases when N , N → ∞ . We show that, as in Proposition 2.1, we also have38hat (i) ignoring spatial correlation and relying on CRVE generally leads to over-rejection,and (ii) the over-rejection is stronger when the second moments of the difference betweenthe post- and pre-treatment averages of the common factors is relatively large.Consider a simple example in which we have N / λ t ( f ), f = 1 , ..., N / N / δ t ( f ), f = 1 , ..., N /
2. We consider the treatment assignmentas fixed, and partition the set of treated groups, I (1), in N / , ..., Λ N / . Likewise, we divide the set of treated groups, I (0), in N / , ..., Γ N / . Potential outcomes are given by Y jt (0) = θ j + γ t + (cid:80) N / f =1 λ t ( f )1 { j ∈ Λ f } + (cid:80) N / f =1 δ t ( f )1 { j ∈ Γ f } + (cid:15) jt Y jt (1) = α + Y jt (0) . (32)Therefore, this model for the potential outcomes follow a linear factor model as the onein equation 5. The main difference is that we allow the number of factors to increase with N , and that we impose a structure in which groups are divided into pairs that are spatiallycorrelated, but independent across pairs. We assume for simplicity that treatment effects arehomogeneous, but all conclusions remain the same if we allow for heterogeneous treatmenteffects, as we do in Section 2. This analysis is conditional on treatment assignment and onthe sequence of factor loadings (in this case, the pairs in which each group belongs), and weimpose the following assumptions. Assumption A.1 (a) { (cid:15) j , ..., (cid:15) jT } I ∪I is mutually independent across j , and identicallydistributed within treated and control groups; (b) { ( λ ( f ) , ..., λ T ( f )) } N / f =1 is iid, { ( δ ( f ) , ..., δ T ( f )) } N / f =1 is iid, and these variables are mutually independent; (c) all random variables have finitefourth moments, (d) E [ ∇ (cid:15) j ] = 0 for all j , E [ ∇ λ ( f )] = 0 for all f = 1 , ..., N /
2, and E [ ∇ δ ( f )] = 0 for all f = 1 , ..., N / θ j and γ t , because these factors are eliminated by the fixed effects. Therefore, theTWFE estimator eliminates θ j and γ t (which may potentially be correlated with treatmentassignment), but does not eliminate all of the spatial correlation structure associated with { λ t ( f ) } f =1 ,...,N / and { δ t ( f ) } f =1 ,...,N / . This remaining factor structure does not generatebias given Assumption A.1(d), but may be problematic for inference if it generates relevantspatial correlation.Let σ λ = var ( ∇ λ ( f )), σ δ = var ( ∇ δ ( f )), and σ (cid:15) ( w ) = var ( ∇ (cid:15) j ( w )) for j ∈ I w , w ∈{ , } . Recall that we are considering treatment assignment as fixed in this setting. There-fore, the variance of the TWFE estimator is given by var ( ˆ α ) = 2 N σ λ + 2 N σ δ + 1 N σ (cid:15) (1) + 1 N σ (cid:15) (0) . (33)We consider the asymptotic behavior of the DID estimator and of CRVE in this settingwhen N and N → ∞ . Proposition A.1
Consider a setting in which potential outcomes follow equation (32).Treatment allocation is fixed, and starts after periods t ∗ for the treated groups. Assump-tion A.1 holds. Then, as N and N → ∞ , √ N ( ˆ α − α ) → d N (cid:18) , c σ λ + 21 − c σ δ + 1 c σ (cid:15) (1) + 11 − c σ (cid:15) (0) (cid:19) . (34) where N /N = c . Moreover, t = ˆ α (cid:113) (cid:92) var ( ˆ α ) Cluster → d N (cid:32) , c σ λ + − c σ δ c σ λ + − c σ δ + c σ (cid:15) (1) + − c σ (cid:15) (0) (cid:33) . (35) Proof. α = α + 2 N N / (cid:88) f =1 ∇ λ ( f ) − N N / (cid:88) f =1 ∇ δ ( f ) + 1 N (cid:88) j ∈I ∇ (cid:15) j − N (cid:88) j ∈I ∇ (cid:15) j . (36)Therefore, applying the central limit theorem, we have √ N ( ˆ α − α ) → d N (cid:18) , c σ λ + 21 − c σ δ + 1 c σ (cid:15) (1) + 11 − c σ (cid:15) (0) (cid:19) . (37)Now the OLS residuals from TWFE DID regression are such that, for j ∈ I , and j ∈ Λ f , (cid:99) W j = ∇ Y j − N (cid:88) k ∈I ∇ Y j = ∇ λ ( f ) + ∇ (cid:15) j − N N / (cid:88) f (cid:48) =1 ∇ λ ( f (cid:48) ) + 1 N (cid:88) k ∈I ∇ (cid:15) k . (38)Given Assumption A.1, 1 N (cid:88) j ∈I (cid:99) W j = σ λ + σ (cid:15) (1) + o p (1) . (39)Using similar calculations for the control groups, we have that, up to a degrees-of-freedomcorrection, N (cid:92) var ( ˆ α ) Cluster = N (cid:34) N (cid:32) N (cid:88) j ∈I (cid:99) W j (cid:33) + 1 N (cid:32) N (cid:88) j ∈I (cid:99) W j (cid:33)(cid:35) (40)= 1 c σ λ + 11 − c σ δ + 1 c σ (cid:15) (1) + 11 − c σ (cid:15) (0) + o p (1) . (41)Combining equations 37 and 40 finishes the proof.Proposition A.1 shows that, in this setting, the TWFE estimator is asymptotically nor-mal. However, CRVE will underestimate the asymptotic variance of the TWFE estimator.Moreover, if we assume λ t ( f ) and δ t ( f ) are serially positively correlated, with stronger de-pendence relative to the idiosyncratic shocks, then the distortion in the variance due tospatial correlation would be less relevant if we consider a shorter distance between the initial41nd final periods. These are essentially the same conclusions from Proposition 2.1, but fora spatial correlation model based on a linear factor model in which the number of factorsincreases with N . This allows for settings in which the spatial correlation is strongly mixing,as considered by Ferman (2020). A.3 Case in which ( µ e − µ e ) (cid:48) E (cid:2) ( ∇ λ ) (cid:48) ( ∇ λ ) (cid:3) ( µ e − µ e ) = 0 We consider now the case in which ( µ e − µ e ) (cid:48) E (cid:2) ( ∇ λ ) (cid:48) ( ∇ λ ) (cid:3) ( µ e − µ e ) = 0, but var ( ∇ λ )does not drift to zero.From equation (15), √ N ( ˆ α − α ) = ∇ λ √ NN (cid:88) j ∈I ( µ j − µ e ) + √ NN (cid:88) j ∈I [( ∇ α j − α ) + ∇ (cid:15) j ] −∇ λ √ NN (cid:88) j ∈I ( µ j − µ e ) − √ NN (cid:88) j ∈I ∇ (cid:15) j . (42)Therefore, assuming that µ j and ∇ (cid:15) j are independent, the asymptotic distribution of √ N ( ˆ α − α ) is given by √ N ( ˆ α − α ) d → ∇ λ (cid:18) √ c W − √ − c W (cid:19) + 1 c σ (cid:15) (1) Z − − c σ (cid:15) (0) Z , (43)where W w ∼ N (0 , Σ( w )), and ∇ λ , W , W , Z , and Z are mutually independent.The first conclusion is that the DID estimator is consistent, but it is generally not asymp-totically normal. Note that the asymptotic distribution of ˆ α is closer to normal if the secondmoments of ∇ λ are closer to zero.Moreover, we have that the asymptotic variance of ˆ α is given by a.var ( √ N ( ˆ α − α )) = 1 c σ (cid:15) (1) + 11 − c σ (cid:15) (0) + 1 c E [( ∇ λ )(Σ µ (1))( ∇ λ ) (cid:48) ] (44)+ 11 − c E [( ∇ λ )(Σ µ (0))( ∇ λ ) (cid:48) ] . (45)42n contrast, we have that N (cid:92) var ( ˆ α ) Cluster = 1 c ( ∇ λ )(Σ µ (1))( ∇ λ (cid:48) ) + 1 c σ (cid:15) (1) + 11 − c ( ∇ λ )(Σ µ (0))( ∇ λ (cid:48) ) + 11 − c σ (cid:15) (0) + o p (1) , implying that a.var ( √ N ( ˆ α − α )) − N (cid:92) var ( ˆ α ) Cluster = 1 c { E [( ∇ λ )(Σ µ (1))( ∇ λ ) (cid:48) ] − ( ∇ λ )(Σ µ (1))( ∇ λ (cid:48) ) } + 11 − c { E [( ∇ λ )(Σ µ (0))( ∇ λ ) (cid:48) ] − ( ∇ λ )(Σ µ (0))( ∇ λ (cid:48) ) } + o p (1) . Therefore, another distortion comes from the fact that spatial correlation implies thatwe would not have a consistent estimator for the asymptotic variance of ˆ α , because theresiduals would depend on the realization of ∇ λ . In this case, even asymptotically, theCRVE (multiplied by N ) would differ from the asymptotic variance of ˆ α due to the differences( ∇ λ )(Σ µ ( w ))( ∇ λ (cid:48) ) − E [( ∇ λ )(Σ µ ( w ))( ∇ λ ) (cid:48) ] for w ∈ { , } . While the expected values of thesedifferences are equal to zero, this can generate some size distortions, because the distributionof the test statistic would not be asymptotically normal. Again, if λ t is serially positivelycorrelated, with stronger dependence relative to the idiosyncratic shocks, then these termsbecome less relevant when we consider shorter time ranges. These terms also become lessrelevant if Σ µ ( w ) = var ( µ j | D j = w ) ≈ µ j to be spatially correlated, then we wouldpotentially have an additional problem for inference. The intuition is that, in this case, anaverage of N observations of µ j ( f ) for the treated groups would be less informative about µ e ( f ) than the same average if µ j ( f ) were independent across j . As a consequence, estimatedstandard errors that ignore this spatial correlation would be under-estimated, which wouldlead to over-rejection. Again, this problem becomes less relevant if the second moment ofthe distribution of ∇ λ is smaller. 43 .4 Model-based versus Design-based uncertainty We focus in this paper on a model-based approach in which uncertainty is modeled as arepeated sampling framework on the errors. We provide a simple example here showing thatthe main intuitions in this paper would also apply if we consider a design-based approach,in which uncertainty comes from the treatment allocation. In this case, we condition ona realization of the potential outcomes, and spatial correlation is captured by consideringthat the treatment allocation is spatially correlated. Abadie et al. (2017) also consider thecase in which treatment allocation is spatially correlated. The difference is that here we arefundamentally interested in the case in which it is not possible to cluster at the treatmentassignment level. This may be the case, for example, because the researcher does not haveinformation on the relevant distance metric, or because there are too few clusters to rely onCRVE at the assignment level.Consider a very simple example where states j = 1 , ..., N are partitioned into equally-sized groups of states Λ , ..., Λ F , and potential outcomes are given by Y jt (0) = θ j + γ t + (cid:80) Ff =1 λ t ( f )1 { j ∈ Λ f } + (cid:15) jt Y jt (1) = α jt + Y jt (0) . (46)Note that this is a particular case of the potential outcomes model determined by equation(5). For simplicity, we assume treatment effects are homogeneous, and consider the case inwhich α jt = 0 for all j and t . Therefore, our estimand, which is the finite-population analogueto the average treatment effect, is equal to zero.We think of that as a “super-population” model where the finite population is drawn.Therefore, when we consider such design-based approach, we condition on the realizationsof θ j , γ t , λ t ( f ) for f = 1 , ..., F , and (cid:15) jt . To capture spatial correlation problems, we considerthat treatment allocation is such that F = F/ π j be the marginal44robability of treatment for state j . From the results derived by Rambachan and Roth (2020),it is clear that the DID estimator is unbiased over the treatment assignment distribution,because π j = 1 / j . Using Lemma 5 from Barrios et al. (2012), the exact variance of the DID estimatorunder this spatially correlated treatment assignment, conditional on the potential outcomes,is given by V corr = 4 F ( F − F (cid:88) f =1 (cid:0) ∇ λ ( f ) − ∇ ¯ λ + ∇ ¯ (cid:15) f − ∇ ¯ (cid:15) (cid:1) , (47)where ∇ ¯ λ = F (cid:80) Ff =1 ∇ λ ( f ), ∇ ¯ (cid:15) f = N/F (cid:80) j ∈ Λ f ∇ (cid:15) j , and ∇ ¯ (cid:15) = N (cid:80) Ni =1 ∇ (cid:15) i .In contrast, if we considered that treatment was assigned with no spatial correlation,then the variance would be given by V uncorr = 4 N ( N − N (cid:88) j =1 (cid:32) F (cid:88) f =1 [ ∇ λ ( f )1 { j ∈ Λ f } ] − ∇ λ j + ∇ (cid:15) j − ∇ ¯ (cid:15) (cid:33) . (48)Note that CRVE at the state level would approximate V uncorr . Therefore, we considerthe extent to which V uncorr underestimates V uncorr . V corr − V uncorr = 1 F F (cid:88) f =1 (cid:0) ∇ λ ( f ) − ∇ ¯ λ (cid:1) (cid:20) F − − N − (cid:21) + 4 N ( N − N (cid:88) j =1 ( ∇ (cid:15) j − ∇ ¯ (cid:15) ) (cid:20) FF − NN − − (cid:21) (49)+ 4 N N (cid:88) j =1 (cid:32) F (cid:88) f =1 [ ∇ λ ( f )1 { j ∈ Λ f } ] − ∇ λ j (cid:33) ( ∇ (cid:15) j − ∇ ¯ (cid:15) ) (cid:20) F − − N − (cid:21) + FF − N − N N ( N − F (cid:88) f =1 (cid:88) i (cid:54) = j,i,j ∈ Λ f ( ∇ (cid:15) i − ∇ ¯ (cid:15) )( ∇ (cid:15) j − ∇ ¯ (cid:15) ) . More generally, Rambachan and Roth (2020) show that the DID estimator is unbiased over the ran-domization distribution if (cid:80) Nj =1 ( π j − ¯ π )( ∇ Y j (0)) = 0. Considering an alternative randomization distributionthat satisfies this condition on the marginal probabilities of treatment assignment does not change our mainconclusions.
45e consider first F is fixed and N → ∞ . All conclusions remain valid if we consider asetting in which both F and N → ∞ , similarly to what we show in Appendix A.2 for themodel-based case. We discuss this case below.The literature on design-based uncertainty imposes assumptions on the sequence of po-tential outcomes of the finite populations. We can think that there is a super-populationin which we draw such finite population. In this case, we think about potential outcomesin this super-population as random variables. In this super-population, we assume (cid:15) jt areindependent across j , as we do in Section 2, implying that, when N → ∞ , the last threeterms in equation (49) converge almost surely to zero. Therefore, to be consistent with theassumptions on the super-population, we assume that the sequence of finite populations issuch that these three terms converge to zero (in this case, these terms are non-stochasticsequences).The first term of equation (49), however, converges to F ( F − (cid:80) Ff =1 (cid:0) ∇ λ ( f ) − ∇ ¯ λ (cid:1) > N → ∞ . Importantly, if the variance of ∇ λ ( f ) is lower in the super-population,then the probability that we condition on a realization of ( ∇ λ (1) , ..., ∇ λ ( F )) such that F ( F − (cid:80) Ff =1 (cid:0) ∇ λ ( f ) − ∇ ¯ λ (cid:1) is larger would be lower. Therefore, we reach exactly the sameconclusion from Proposition 2.1, where we considered a model-based uncertainty. In thissetting, the estimator would not generally be consistent and asymptotically normal, simi-larly to what we find in Proposition 2.1. However, the extent to which we underestimatethe variance, and to which we depart from asymptotic normality, depends on the term F ( F − (cid:80) Ff =1 (cid:0) ∇ λ ( f ) − ∇ ¯ λ (cid:1) >
0, which we expect to be smaller when the variance of ∇ λ ( f )is smaller in the super-population.The case with F → ∞ is similar, with the difference that in this case we would find F ( V corr − V uncorr ) → K × lim F →∞ F (cid:80) Ff =1 (cid:0) ∇ λ ( f ) − ∇ ¯ λ (cid:1) , for some constant K . Thisconstant is greater than zero if F/N → c ∈ [0 , F/N → K = 0,so the variance is not underestimated. This happens when we have a very large number ofgroups of states with only one states, which essentially means that we do not have muchspatial correlation, so it is reasonable that the variance is not underestimated in this case.Finally, if we relax the assumption that treatment effects are homogeneous, then, fol-lowing Abadie et al. (2020), Abadie et al. (2017), and Rambachan and Roth (2020), weshould expect CRVE to be conservative relative to V uncorr . While this could partially offsetpart of the underestimation of V corr when spatial correlation is not taken into account, thesame conclusions about when spatial correlation should lead to more significant problemsfor inference would still apply. A.5 Second moment of pre- vs post-treatment averages differences
We consider here in detail how including more time periods affects the variance of thedifference between the pre- and post-treatment averages of common factors and idiosyncraticshocks. Consider a random variable X t that follows an AR(1) process, X t = ρX t − + ν t ,where ν t is iid with variance σ ν . Since X t is stationary, then var ( X t ) = var ( X t − ) = − ρ σ ν .Assume we have T periods and define ¯ X post as the average in the first half of the periods,and ¯ X pre as the average in the second half of the periods. In this case, E [( ∇ X ) ] = 4 T (1 − ρ ) (cid:20) T − ρ − ρ (3 − ρ T/ )(1 − ρ T/ ) (cid:21) σ ν . (50)When we vary T , note that there are two forces at place. On the one hand, includingmore observations to estimate the post- and pre-treatment averages reduces the variance ofeach average. On the other hand, including more periods implies that the pre- and post-treatment averages will be less correlated, which implies that differencing will absorb lessvariability. 47n Appendix Figure A.1, we present E [( ∇ X ) ] as a function of T for different valuesof ρ . To facilitate the comparison across ρ , for each ρ we normalize σ ν = ρ , so that E [( X − X ) ] = 1. Importantly, E [( ∇ X ) ] goes to zero faster as T increases when ρ is lower.Interestingly, when ρ is large enough, E [( ∇ X ) ] increases with T when T is small, but iteventually starts to decrease and converges to zero when T is large enough. Therefore, if weconsider two series, X t and Z t , where ρ x > ρ z , then E [( ∇ X ) ] / E [( ∇ Z ) ] will be increasingin T . A.6 Monte Carlo Simulations - Two-way Cluster
We present here a small Monte Carlo (MC) simulation to analyze the properties of thetwo-way cluster in a DID setting. We consider a simple example with 100 groups, half treatedand half control, in which Y jt = λ t + (cid:15) jt when j ∈ T and Y jt = λ t + (cid:15) jt when j ∈ T . We set (cid:15) jt ∼ N (0 , j and t . We also set E [ λ wt ] = 0 for w ∈ { , } and for all t ,so the DID estimator is unbiased. However, the λ wt generates important spatial correlationthat is not absorbed by the time fixed effects.The λ wt follows an AR(1) process, with parameter ρ ∈ { , . , . } . We also set T ∈{ , , } . In all simulations, treatment starts after period T /
2. Appendix Table A.1present rejection rates based on (i) robust standard errors (with no cluster), (ii) standard er-rors clustered at group level, and (iii) standard errors clustered at two levels, group and time.As expected, there is a severe over-rejection when we consider inference without clustering,or clustering only at the group level. This happens because this data generating processincludes substantial spatial correlation, that is not captured in these variance estimators.With T = 2, using a two-way cluster — at the time and group levels — does not solvethe problem. The limitation of the two-way cluster estimator in this case comes from thefact that there is only one post-treatment period and one pre-treatment period. When ρ = 0, rejection rates converge to 5% when T increases. When ρ >
0, however, there is stillover-rejection even when T is large. Moreover, the over-rejection is increasing with ρ .48hese results confirm the intuition presented in Section 2, that two-way cluster proceduresmay underestimate the standard errors, because they fail to take into account correlationsbetween η jt and η j (cid:48) t (cid:48) , for j (cid:54) = j (cid:48) and t (cid:54) = t (cid:48) . Note that the only case in which such correlationwould not appear would be when ρ = 0. In this case, we show that two-way cluster wouldwork well when T is large. In contrast, when ρ >
0, two-way cluster would still lead toover-rejection even when T is large. A.7 Pre-testing for spatial correlation problems
In settings with more than one pre-treatment period, it is also possible to conduct placeboexercises to test whether spatial correlation is a problem. For example, consider a settingwith two pre-treatment periods ( t ∈ {− , } ) and one post-treatment period ( t ∈ { } ). Inthis case, we can consider an estimator for the treatment effect using periods t ∈ { , } ,ˆ α = N (cid:80) i ∈I ∆ Y i − N (cid:80) i ∈I ∆ Y i , where for a generic variable A t , ∆ A t = A t − A t − ,and the pre-treatment periods to test whether inference based on CRVE is reliable. In thiscase, we would test whether ˆ α = N (cid:80) i ∈I ∆ Y i − N (cid:80) i ∈I ∆ Y i is different from zero. Thishas been widely considered in the literature as a test for pre-trends (e.g., Freyaldenhovenet al. (2019), Kahn-Lang and Lang (2019), and Roth (2019)). In contrast, here we assumethat trends are parallel, so E [ ˆ α ] = 0 and E [ ˆ α ] = α , and show that such test can also beinformative about whether spatially correlated shocks poses relevant problems for inference. We consider in detail the case in which potential outcomes are given by equation (5),but all our results are valid for more general settings. Under Assumptions 2.1 and 2.2, andconsidering that N → ∞ , we have from Proposition 2.1 that (cid:92) var ( ˆ α τ ) Cluster = var ( ˆ α τ ) − ( µ e − µ e ) (cid:48) E [(∆ λ τ ) (cid:48) (∆ λ τ )] ( µ e − µ e ) + o p (1) . (51) In a revised version of his paper developed concurrently with our paper, Roth (2019) considers inAppendix D simulations in a setting with stochastic violations of parallel trends that is similar to our settingwith spatially correlated shocks. E [(∆ λ ) (cid:48) (∆ λ )] ≈ E [(∆ λ ) (cid:48) (∆ λ )], then rejecting the null that E [ ˆ α ] = 0 would provide evidence that var ( ˆ α )is underestimated when we consider (cid:92) var ( ˆ α ) Cluster , which in turn would indicate that var ( ˆ α )is underestimated when we consider (cid:92) var ( ˆ α ) Cluster . In an extreme example in which the errorterm follows equation (14), the pre-test would be completely uninformative, because in thiscase the year fixed effects would absorb the common shocks in the pre-test, but would notabsorb the common shocks in the main test. If, on the other hand, we assume that commonfactors are stationary, then E [(∆ λ ) (cid:48) (∆ λ )] = E [(∆ λ ) (cid:48) (∆ λ )] and the pre-test would beinformative.Building on the setup considered by Roth (2019), we consider a setting where ( ˆ α , ˆ α ) isjointly normally distributed, ˆ α ˆ α ∼ N α , var ( ˆ α ) cov ( ˆ α , ˆ α ) cov ( ˆ α , ˆ α ) var ( ˆ α ) . (52)There are two important differences relative to the analysis from Roth (2019). First, weassume that ( ˆ α , ˆ α ) are unbiased, so we can focus on the problem of spatial correlation.Second, in our setting, if there are spatially correlated shocks, then a researcher consideringCRVE would be relying on an incorrect variance/covariance matrix for ( ˆ α , ˆ α ). We assumethat the researcher relies (cid:94) var ( ˆ α τ ) = var ( ˆ α ) − ( µ e − µ e ) (cid:48) E (cid:2) ( ∇ λ ) (cid:48) ( ∇ λ ) (cid:3) ( µ e − µ e ). Therefore,the research would rely on the correct variance matrix if ( µ e − µ e ) (cid:48) E [(∆ λ τ ) (cid:48) (∆ λ τ )]( µ e − µ e ) =0, but would underestimate the true variance if ( µ e − µ e ) (cid:48) E [(∆ λ τ ) (cid:48) (∆ λ τ )]( µ e − µ e ) >
0. Wecan think of this normal model as an approximation using Corollary 2.1.By construction, if ( µ e − µ e ) (cid:48) E [(∆ λ ) (cid:48) (∆ λ )]( µ e − µ e ) = 0, then pre-testing E [ ˆ α ] = 0 foran 5% level test would reject the null 5% of the time. In contrast, if ( µ e − µ e ) (cid:48) E [(∆ λ ) (cid:48) (∆ λ )]( µ e − µ e ) >
0, then the distribution of the t-statistic would have a variance larger than one, whichimplies that the test would reject at a higher rate than 5%. An immediate consequenceis that we should expect a larger fraction of applications “surviving” such pre-test when50 µ e − µ e ) (cid:48) E [(∆ λ ) (cid:48) (∆ λ )]( µ e − µ e ) is smaller. If we believe E [(∆ λ ) (cid:48) (∆ λ )] ≈ E [(∆ λ ) (cid:48) (∆ λ )],then this would also imply that the probability of surviving the pre-test would be decreasingwith the degree in which var ( ˆ α ) is underestimated. It is important to understand, however,what are the properties of the estimator for ˆ α when we condition on surviving such pre-test.Let B be the set of values for ˆ α such that we fail to reject the null in the pre-test using a t -test based on ˆ α / (cid:113) (cid:94) var ( ˆ α ). In this case, the pre-test is symmetric in the sense that ˆ α isrejected if and only if − ˆ α is rejected, even if var ( ˆ α ) > (cid:94) var ( ˆ α ). The only difference is thatthe probability of rejecting the null for an 5% level test would be 5% if var ( ˆ α ) = (cid:94) var ( ˆ α ), andwould be increasing in var ( ˆ α ) − (cid:94) var ( ˆ α ). Therefore, from Proposition 3.1 and Corollary 3.1from Roth (2019) we have that E [ ˆ α | ˆ α ∈ B ] = α , so the DID estimator ˆ α remains unbiasedeven if we condition on passing on such pre-test, regardless of whether there is spatialcorrelation. Of course, this conclusion remains valid if we consider different significancelevels for the pre-test. Moreover, since B is a convex set, from Proposition 3.3 from Roth(2019), we also have that var ( ˆ α | ˆ α ∈ B ) ≤ var ( ˆ α ).Taken together, these results show that pre-testing for spatial correlation can be infor-mative about whether inference based on CRVE is reliable, and such pre-testing would notexacerbate the problem in case it fails to detect relevant spatial correlation due to noise inthe data. This differs from the conclusions from Roth (2019) when testing for pre-trends,where conditioning on passing a pre-test for violations on parallel trends implies that theproblem may be exacerbated if the parallel assumptions does not hold. If there are no spa-tially correlated shocks, then we should expect testing α = 0 to have the correct level ifwe condition on ˆ α ∈ B , although it may be conservative. If there are spatially correlatedshocks, then conditioning on ˆ α ∈ B implies that we should not expect more over-rejectionthan if we did not consider a pre-test. Moreover, if we condition on applications that passthe pre-test, then we should expect relatively fewer empirical applications in which CRVEis grossly under-estimated.In a simple case with T = 3, so we pre-test with ˆ α and estimate the effect with ˆ α , and51ith stationary common and idiosyncratic shocks, then we would detect spatial correlatedshocks x % of the times if the probability of rejecting the null in the main test is x %. Since var ( ˆ α | ˆ α ∈ B ) ≤ var ( ˆ α ), we should expect a slightly lower rejection rate once we conditionon passing the pre test. Of course, conditional on the degree of spatial correlation, havingmore pre-treatment periods implies that the probability of passing the pre-test will be lower.We present in Appendix A.8 a simple MC exercise where we consider how the probabilityof passing the pre-test and the rejection rates conditional on passing the pre-test depend onthe number of pre-treatment periods and on the serial correlation of the common shocks.As expected, increasing the number of pre-treatment periods substantially decreases theprobability of passing the pre-test, and such reduction is faster with the degree of spatialcorrelation. Moreover, when common shocks are serially correlated, increasing the number ofperiods has a slightly larger impact on the probability of passing the test when there is spatialcorrelation. Overall, these simulations reveal that conditioning on passing a pre-test wouldreduce the relative proportion of applications with higher over-rejection problems. However,there would still be scenarios with a high probability of passing the pre-test in which inferenceconditional on passing the pre-test would lead to severe over-rejection, particularly when thenumber of pre-treatment periods is small. This is consistent with the conclusions from Roth(2019).Overall, our results complement the recent analysis on pre-testing in DID models byconsidering the case in which we may have inference problems due to spatial correlation. Weshow that such pre-testing can also be informative about spatial correlation problems, on thetop of being informative about violations in the parallel trends assumption. We also analyzethe potential limitations of such pre-tests when we have spatial correlation problems, andprovide guidance on how to make such pre-tests more informative.52 .8 MC simulations for pre-tests We consider a simple MC exercise with 100 groups (half treated and half control), and T periods, where treatment occurs only in the last period. The outcomes are given by Y jt = { j ∈ I } λ t + { j ∈ I } λ t + (cid:15) jt , where (cid:15) jt ∼ N (0 , λ wt , w ∈ { , } followsa stationary AR(1) process with with parameter ρ . We assume that the treatment effect iszero in all simulations.In panel A of Appendix Table A.2, we set var ( λ t ) = var ( λ t ) = 0, so the unconditionalrejection rate in the main test is 5%. We consider that the pre-test passes if we fail to rejectthe null for all t-tests based on a comparison between periods t ∈ { , ..., T − } and T − T . Conditional passing the pre-test, the main test is conservative, with a probability ofrejecting the null that is decreasing with T .In panel B, we set the variance of λ t and λ t so that the unconditional probability ofrejecting the null is 8% when we have only two periods, and vary the serial correlationof the common factors and the number of pre-treatment periods used for pre-testing. Asexpected, with T = 3 the probability of passing the test is around 92%. This probabilityis decreasing with T and, importantly, it decreases faster than the case in with no spatialcorrelation (panel A). The probability of passing the test also decreases at a faster ratewith T when common shocks are serially correlated. The same patterns appear in panels Cand D, where we consider a setting with stronger spatial correlation. In all scenarios, therejection rate conditional on passing the pre-test is around the same size of lower than theunconditional rejection rate. However, there would still be scenarios with a high probabilityof passing the pre-test in which inference conditional on passing the pre-test would lead tosevere over-rejection, particularly when the number of pre-treatment periods is small.53igure A.1: Function E [( ∇ X ) ] for different values of T and ρ ρ =0 ρ =.2 ρ =.4 ρ =.6 ρ =.8 ρ =.9 Notes: This figure presents E [( ∇ X ) ] as a function of T for different values of ρ . We set σ ν = ρ so that E [( ∇ X ) ] = 1 when T = 2 for all values of ρ . We set the first half of the T periods as preand the second half as post treatment. Simulations with the ACS - same vs different census division . . . . . . . . A. Log wages B. Employment C. Combined
Same census division Different census division R e j e c t i on r a t e s δ year Notes: This figure presents rejection rates for the simulations using ACS data, presented in Section3.1. Each simulation has two states and two periods. We considered all combination of pairs ofstates and years. The distance between the pre- and post-treatment periods ( δ year ) varies from 1to 10 years. The pre-treatment period ranges from 2005 to 2017- δ year . In all series, we considertreatment randomly allocated at the state level. For each simulation, we run a DID regressionand test the null hypothesis using standard errors clustered at PUMA level. Rejection rates arepresented separately depending on whether the two states are from the same census division or not.The outcome variable is log(wages) (subfigure A) and employment status (subfigure B) for womenaged between 25 and 50. In subfigure C, we combine the information from both outcomes. Weconsider only simulations with 20 or more treated and control PUMAs. Since there are relativelyfew simulations with states from the same census division, particularly when δ is large, we considerthese results with caution. For example, for δ = 10 there are only 63 simulations with states fromthe same census division for each outcome, whereas for states in different census divisions there are630 simulations. When δ = 1, there are 544 simulations with states in the same census division,and 5102 with states in different census divisions. Monte Carlo Simulations - Two-way Cluster
No cluster Cluster at j CGM DHG(1) (2) (3) (4)
Panel i: ρ = 0 T = 2 0.853 0.791 0.853 0.791 T = 10 0.762 0.795 0.119 0.114 T = 100 0.743 0.778 0.050 0.045 Panel ii: ρ = 0 . T = 2 0.857 0.785 0.857 0.785 T = 10 0.774 0.805 0.168 0.159 T = 100 0.771 0.826 0.085 0.083 Panel iii: ρ = 0 . T = 2 0.835 0.769 0.835 0.769 T = 10 0.840 0.865 0.264 0.254 T = 100 0.844 0.878 0.226 0.221 Notes: This table presents rejection rates for the simulations described in Appendix A.6. Column(1) presents rejection rates based on robust standard errors (with no cluster). Column (2) presentsrejection rates based on standard errors clustered at the group level. Column (3) presents rejectionrates based on two-way clustered standard errors at the group and time levels using Cameron et al.(2011). Column (4) presents rejection rates based on two-way clustered standard errors at thegroup and time levels using the correction proposed by Davezies et al. (2018).