[PDF] Inference in Differences-in-Differences with Few Treated Units and Spatial Correlation

Abstract

We consider the problem of inference in Difference-in-Differences (DID) models when there are few treated units and errors are spatially correlated. We first show that, when there is a single treated unit, existing inference methods designed for settings with few treated and many control units remain asymptotically valid when errors are strongly mixing. However, these methods are invalid with more than one treated unit. We propose asymptotically valid, though generally conservative, inference methods for settings with more than one treated unit. These alternative inference methods are valid even when the relevant distance metric across units is unavailable. Moreover, they may be relevant alternatives even for settings in which the number of treated units is usually considered as large enough to rely on cluster-robust standard errors. We also present an empirical application that highlights some common misunderstandings in the use of randomization inference in DID applications, and illustrates how our results can be used to provide proper inference.

Full PDF

aa r X i v : . [ ec on . E M ] A ug Inference in Diﬀerences-in-Diﬀerences with Few Treated Units andSpatial Correlation ∗ Bruno Ferman † Sao Paulo School of Economics - FGV

First Draft: June 30th, 2020This Draft: August 8th, 2020

Please click here for the most recent version

Abstract

We consider the problem of inference in Diﬀerences-in-Diﬀerences models when there arefew treated units and errors are spatially correlated. We ﬁrst show that, when there is asingle treated unit, existing inference methods designed for settings with few treated andmany control units remain asymptotically valid when errors are strongly mixing. However,these methods are invalid with more than one treated unit. We propose asymptotically valid,though generally conservative, inference methods for settings with more than one treatedunit. Our ﬁndings are relevant even for settings in which the number of treated units isusually considered as large enough to rely on clustered robust standard errors.

Keywords: hypothesis testing; conservative test; causal inference; permutation tests

JEL Codes:

C12; C21; C23; C33 ∗ I would like to thank Luis Alvarez for comments and suggestions. Lucas Barros provided exceptionalresearch assistance. I also thank Benjamin Sommers for useful discussions and for providing the county FIPScodes for the counties used in Sommers et al. (2014). † email: [email protected]; address: Sao Paulo School of Economics, FGV, Rua Itapeva no. 474, SaoPaulo - Brazil, 01332-000; telephone number: +55 11 3799-3350 Introduction

Diﬀerences-in-Diﬀerences (DID) models present a series of challenges for inference. Thereis a large number of inference methods for DID, but the eﬀectiveness of diﬀerent solutionsdepend crucially on the set of assumptions one is willing to make on the errors, and onmany features of the empirical design, such as the number of treated and control units. A common setting in which a satisfactory solution is not yet available is when (i) errorsare possibly spatially correlated, (ii) there is no distance metric across units, and (iii) thenumber of periods is ﬁxed (see Ferman (2019a)). We analyze this case in settings with asmall number of treated and a large number of control units. Importantly, as we explainin details below, our ﬁndings are relevant even for settings in which the number of treatedunits has been considered as large enough to rely on asymptotic approximations with a largenumber of both treated and control units.We ﬁrst derive conditions in which the inference methods proposed by Conley and Taber(2011) (henceforth, CT) and Ferman and Pinto (2019) (henceforth, FP) remain asymptot-ically valid in the presence of spatial correlation when there is a single treated unit. Themain assumptions are that (i) the post-pre diﬀerence in average errors for each unit, whethertreated or control, has the same marginal distribution — we can also allow for scale changesdepending on observed covariates, as considered by FP —, and (ii) the cross-section distri-bution of this post-pre diﬀerence in average errors is strongly mixing for the control units.We consider an asymptotic framework in which the number of treated units is ﬁxed and thenumber of control units goes to inﬁnity. Under these conditions, the asymptotic distributionof the DID estimator depends only on the post-pre diﬀerence in average errors of the treatedunits (as originally shown by CT), and the residuals of the control units asymptotically re-cover the distribution of the errors of the treated units even when there is spatial correlation A non-exhaustive list of papers that proposed and/or analyzed diﬀerent inference methods in the DIDsettings include Arellano (1987), Bertrand et al. (2004), Cameron et al. (2008), Brewer et al. (2017), Conleyand Taber (2011), Donald and Lang (2007) Ferman and Pinto (2019), Canay et al. (2017), and MacKinnonand Webb (2019). While not speciﬁc to DID settings, Ferman (2019b) can be used to provide a ﬁrstassessment on whether inference methods that rely on asymptotic results are reliable in speciﬁc DID settings. However, when the number of treated units is greater than one, the inference methodsproposed by CT and FP will not be asymptotically valid if there is spatial correlation. Theintuition is the same as the one outlined by Ferman (2019a) to show that cluster-robustvariance estimator (CRVE) is generally invalid when we have spatial correlation. When we(mistakenly) assume errors are independent across clusters, we underestimate the volatility ofthe average of the errors for the treated units. We propose an alternative inference methodthat is asymptotically valid, though generally conservative in this setting. We also showthat it is possible to construct a valid test that is less conservative, and a test for spatialcorrelation, in some speciﬁc settings.While the methods proposed by CT and FP have been usually considered as good alter-natives when the number of treated units is very small, our conservative tests may provide aninteresting alternative even for settings in which applied researchers have been conﬁdentlyusing alternatives such as CRVE. The rationale is the following: if we assume errors areindependent across groups, then there is a trade-oﬀ between relying on CRVE relative toCT and FP. While CRVE relies on weaker assumptions on the errors, it relies on an asymp-totic theory that depends on a large number of both treated and control units. With, say,20 or more treated units, such asymptotic approximation required by CRVE is generallyconsidered good enough so that the beneﬁts of relying on weaker assumptions dominate. However, when we allow for spatial correlation (and considering the case with more thanone treated unit), all of those methods will generally be invalid. Moreover, there is yet nosolution available when there is no distance metric across units and the number of periodsis small. The inference methods we propose are the only ones that are robust to spatialcorrelation in this setting.We analyze the properties of the inference method proposed by FP and of two conservative CT have in their appendix an example model that allows spatial dependence. However, their modelrequires an observed distance metric across units. We consider instead a setting in which such distancemetric is not available, as is common in DID applications. Ferman (2019b) proposes an assessment that can be used to evaluate such trade oﬀs.

Let Y it (0) ( Y it (1)) be the potential outcome of group j at time t when this group isuntreated (treated) at this period. We consider that potential outcomes are given by  Y it (0) = θ i + γ t + η it Y it (1) = α it + Y it (0) , (1)where α it is the (possibly heterogeneous) treatment eﬀects for unit i at time t , θ i are time-invariant unobserved eﬀects, and γ t are group-invariant unobserved eﬀects. The error term η it represents unobserved determinants of Y it (0) that are not captured by the ﬁxed eﬀects.This implies that a DID model Y it = αd it + θ i + γ t + ( α it − α ) d it + η it , (2)where Y it is the observed outcome variable for unit i at time t , d it is an indicator variableequal to one if unit i is treated at time t , and zero otherwise, and we deﬁne α as the TWFEestimand. We focus on the case in which d it changes to 1 for all treated units starting afterdate t ∗ , and we consider a unit × time aggregate model. In this setting, the TWFE estimandequals the average treatment eﬀects over i and t for the treated units in the post-treatmentperiods. We assume { α i , ..., α iT } i ∈I are ﬁxed parameters. Alternatively, we can think of For example, if we consider a setting in which treatment varies at the state level and we have individual-level observations, then Y it would represent the average of the individual-level observations in unit i andtime t . See de Chaisemartin and D’Haultfoeuille (2018), Callaway and Sant’Anna (2018), Athey and Imbens(2018), and Goodman-Bacon (2018) for a general discussion on the estimand of the TWFE in diﬀerentsettings. it as stochastic, but consider inference regarding the realization of { α i , ..., α iT } i ∈I .There are N treated units, N control units, and T time periods. Let I ( I ) be theset of indices for treated (control) units, while T ( T ) be the set of indices for post- (pre-)treatment periods. Finally, let D i be an indicator variable equal to one if unit i is treated.We consider later the cases with variation in adoption time and in which we have individual-level observations. While we focus on the TWFE estimator, we also consider below the useof alternative estimators in this setting, such as the one proposed by de Chaisemartin andD’Haultfoeuille (2018).Following FP and Ferman (2019a), for a generic variable A t , deﬁne ∇ A = T − t ∗ P t ∈T A t − t ∗ P t ∈T A t . In particular, we consider W i = ∇ η it , which is the post-pre diﬀerence in averageerrors for each unit i . In this case in which treatment starts at the same period for all treatedunits, the DID estimator is numerically equivalent to the two-way ﬁxed eﬀects (TWFE)estimator of α , which is given by b α = 1 N X i ∈I ∇ Y i − N X t ∈I ∇ Y i = α + 1 N X i ∈I W i − N X i ∈I W i . (3)We consider a repeated sampling framework over the distribution of { W i } i ∈I ∪I , in whichtreatment assignment is pre-determined (alternatively, we can consider that the analysis isconditional on the vector of treatment assignment). For simplicity, we consider ﬁrst the homoskedastic case, as analyzed by CT, and we donot include covariates in model (2). In Appendix A.2 we allow for heteroskedasticity basedon a set of covariates Z i , as considered by FP, and for a set of covariates X it in model (2).In this simpler case, we impose the following assumptions on the distributions of W i . Assumption 2.1 (distribution of W i ) (i) E [ W i | D i = 1] = E [ W i | D i = 0] = 0, (ii) W i hasthe same marginal distribution for all i ∈ I ∪ I , with ﬁnite second moment and continuousdistribution, and (iii) { W i } i ∈I is α -mixing satisfying Assumptions 1 and 3(b) from Jenish We can think of that as a “super-population” setting. See Abadie et al. (2014) and Abadie et al. (2017)for a discussion on a design-based approach for inference. Z i . Finally, Assumption 2.1.(iii) allows for spatially correlated shocks, but restrictssuch spatial correlation by assuming a mixing condition. Since we consider an asymptoticframework in which N is ﬁxed and N → ∞ , we can allow for an unrestricted spatialcorrelation among treated units. However, the mixing condition implies that, as the numberof control units increases, the correlation between two control units would vanish if werandomly draw two of the control units. Since we focus on settings in which the researcherdoes not have information on what generates the spatial correlation, we do not need to modelin detail the sources of spatial correlation. See Appendix A2 from Ferman (2019a) for anexample of spatial correlation model in a DID setting that would satisfy this strong mixingcondition.From Proposition 1 from CT, it follows from equation (3) that the limiting distributionof b α when N is ﬁxed and N → ∞ is given by b α p → α + N − P i ∈I W i . The intuition is that,given the strong mixing condition (Assumption 2.1.(iii)), the average of the control errorswould converge in probability to zero as N → ∞ . However, since the number of treatedunits remain ﬁxed, the distribution of b α will still depend on the errors of the treated unitseven asymptotically. In this case, the estimator is unbiased, but not consistent.Similar results also apply for other estimators considered in the literature. For example,in this setting in which all treated units start treatment after t ∗ , the ﬁrst-diﬀerence estimatorwould be numerically equivalent to the TWFE estimator using only periods t ∗ and t ∗ + 1, sowe would have b α fd p → α fd + N − P i ∈I ( η it ∗ +1 − η it ∗ ), where α fd is the ﬁrst-diﬀerence estimand.In this particular setting in which we have equal weights for all units, the estimator proposedby de Chaisemartin and D’Haultfoeuille (2018) would be numerically the same as b α fd . If we6ad that Y it is the average of M it observations, then a similar result would apply, but theasymptotic distribution of their estimator would depend on a diﬀerent weighted averageof the errors, ( P i ∈I M it ∗ +1 ) − P i ∈I M it ∗ +1 ( η it ∗ +1 − η it ∗ ). Overall, in all these cases wewould have that the estimator converges to a weighted average across treated units of linearcombinations of the errors η it . CT propose an interesting inference method in this case by noting that the residuals c W i of the control units may be informative about the distribution of W i and, therefore,about the distribution of b α . The main idea is to estimate the distribution of W i usingthe empirical distribution of { c W i } i ∈I . Intuitively, this works because, as N → ∞ , c W i p → W i . Therefore, hypothesis testing and conﬁdence intervals can be constructed based on theempirical distribution of { c W i } i ∈I . They show validity of this inference method when W i isiid (Assumption 2 in CT). It is easy to understand the importance of this independenceassumption when N >

1. Consider a simple case in which N = 2, and { W , W } ismultivariate normally distributed with correlation ρ . In this case, the asymptotic varianceof b α would be given by 2 − (1+ ρ ) var ( W i ). However, given Assumption 2.1.(iii), if we considera random draw of W i for two control units, as N → ∞ , the correlation between these errorswould converge to zero. As a consequence, the approach proposed by CT would recovera distribution for N − P i ∈I W i that has a variance given by 2 − var ( W i ). Therefore, thevariance of b α would be underestimated if ρ >

0, leading to over-rejection. The Sameproblem applies for the inference method proposed by FP. In this case, the estimand would be slightly diﬀerent than α fd , as it would put diﬀerent weights for eachunit. We would need a regularity assumption on the distribution of M it for the controls to guarantee thatthe weighted average of the errors of the controls converges in probability to zero. CT have in their appendix an example model that allows spatial dependence. However, their modelrequires an observed distance metric across units. We consider instead a setting in which such distancemetric is not available, as is common in DID applications. If ρ <

0, then inference based on CT would lead to under-rejection. However, it is more common toconsider the case in which ρ > N = 1, however, the inference method proposed by CT can remain valid evenif we allow for spatial correlation. The main intuition is that, under Assumption 2.1, theasymptotic distribution of b α depends only on W , and the distribution of W is still asymptot-ically approximated by the empirical distribution of { c W i } i ∈I , even when errors are spatiallycorrelated. Of course, the strong mixing condition is crucial for this argument. To for-malize that, let F ( c ) be the cdf of W i , and consider the empirical distribution of { c W i } i ∈I , b F ( c ) = N − P i ∈I nc W i < c o , where { . } is an indicator function. Proposition 3.1

Under Assumption 2.1, as N → ∞ , b F ( c ) converges in probability to F ( c ) uniformly on any compact subset of the support of W . The proof is essentially the same as the proof of Proposition 2 from CT for the case withonly one treated unit, but allowing for spatial correlation. We present details in AppendixA.1. It immediately follows that the inference method proposed by CT remains valid for thecase with N = 1, even when we may have spatial correlation. Corollary 3.1

Under Assumption 2.1, if N = 1 , then the inference method proposed byCT is asymptotically valid when N → ∞ . We show in Appendix A.2 that Proposition 3.1 and Corollary 3.1 remain valid when weinclude covariates in model (2). We also show in Appendix A.2 that the inference methodproposed by FP remains asymptotically valid, in this setting with N = 1 and spatial cor-relation, when we relax Assumption 2.1 to allow for heteroskedasticity. Finally, Proposition3.1 and Corollary 3.1 trivially holds for the estimator ﬁrst-diﬀerence estimator and for theestimator proposed by de Chaisemartin and D’Haultfoeuille (2018), by considering the dis-tribution of { ( b η it ∗ +1 − b η it ∗ ) } i ∈I . As explained above, however, the inference methods proposed by CT and FP would notbe valid when N > For the estimator proposed by de Chaisemartin and D’Haultfoeuille (2018), we would only need aregularity assumption on the distribution of M it for the controls to guarantee that the weighted average ofthe errors of the controls converges in probability to zero.

8e consider the case in which the errors of all treated units are perfectly correlated asa worst case scenario to provide an asymptotically valid, though conservative, inferencemethod in this setting. More speciﬁcally, from Proposition 3.1, b F ( c ) asymptotically recoversthe distribution of W i even when we allow for spatial correlation. Given that, instead ofconsidering independent draws from this distribution to put in place of the treated units, wesample only one c W i and place it for all treated units, which would asymptotically recover thedistribution of b α if { W i } i ∈I were perfectly correlated and had the same marginal distribution.When treated units are not perfectly correlation, though, we would recover a distributionfor b α that has a higher variance relative to the true distribution of b α . To formalize thisidea, we consider the following high-level assumption, which trivially holds in case { W i } i ∈I is multivariate normal. Assumption 3.1 (regularity condition)

For any τ ∈ (0 , c τ be the τ − quantile of W i ,and deﬁne f W = N − P i ∈I W i . Then P r (cid:16) { f W < c τ/ } ∪ { f W > c − τ/ } (cid:17) ≤ τ .Note that P r (cid:16) { f W < c τ/ } ∪ { f W > c − τ/ } (cid:17) trivially equals to τ if N = 1. Alternatively,we will also have this equality if W i are perfectly correlated for all treated observations. WhatAssumption 3.1 states is that, regardless of the number of treated observations and of thespatial correlation among the treated units, the probability of having extreme values for theaverage of the treated units, f W , is weakly smaller than the probability of having extremevalues for a single draw of W i . This assumption is intuitively reasonable, and is satisﬁedwhen W i is multivariate normal. Proposition 3.2

Consider testing the null hypothesis H : α = α with a signiﬁcance level τ by contrasting b α with the τ / and − τ / quantiles of b F ( c ) . Under Assumptions 2.1 and3.1, as N → ∞ , this test is asymptotically level τ for any τ ∈ (0 , . This result follows directly from Proposition 3.1 and Assumption 3.1. While we guaranteethat this modiﬁed test does not over-reject (asymptotically), it will generally be conservative9hen N >

1, unless the errors of treated units are perfectly correlated. If we assume W i is multivariate normally distributed, then we can calculate the asymptotic rejection rate ofthe original CT test and of the conservative test, depending on N and ρ . We present theseanalytical results in Table 1. As expected, when the treated observations are independent,the asymptotic rejection rates of the inference method proposed by CT has the correct sizeregardless of N . In contrast, the conservative test has the correct size when N = 1, butbecomes conservative when N >

1. When ρ > N >

1, however, the CT test startsto over-reject, while the conservative test always has a size no greater than 0.05.Therefore, this conservative test provides a viable alternative in settings where N << N that is robust to strongly mixing spatial correlation. This conservative test may be aninteresting alternative even when N is such that applied researchers would usually rely onlarge N /large N methods, such as CRVE. Once we relax the assumption that clusters areindependent, CRVE would be invalid even when N is reasonably large (see Ferman (2019a)).In contrast, the method we propose remains valid when we have a strongly mixing spatialcorrelation. Of course, the cost of being robust to spatial correlation is that the conservativetest will generally have lower power. In Section 4.1 we consider an alternative test that willgenerally be less conservative in common DID applications. Remark 3.1

We show in Appendix A.2 that Proposition 3.2 remains valid for the testproposed by FP. In this case, we relax Assumption 2.1.(ii) by assuming a parametric formfor the heteroskedasticity. More speciﬁcally, we consider a case in which W i = h ( Z i , δ ) ξ i fora known function h ( ., . ), where Z i ∈ R q is a vector of observed covariates, and δ ∈ R p isan unknown parameter. The idea in this case is to estimate δ using the residuals from theDID regression, and then estimate the cdf of ξ with the empirical distribution of { b ξ i } i ∈I ,where b ξ i = h ( Z i , b δ ) − c W i . Then we recover a conservative distribution for b α considering theempirical distribution of { N − P i ∈I h ( Z i , b δ ) b ξ b } b ∈I . We also allow for covariates X it ∈ R p in model (2), which may or may not overlap with the variables in vector Z i . Remark 3.2

The assumption that W i is strongly mixing excludes some important models of10patial correlation. For example, it would not allow for a spatial correlation based on a linearfactor model with a ﬁxed dimension, as considered by Ferman (2019a), if the expected valueof the factor loadings is diﬀerent for treated and controls. In this case, the DID residuals c W i would not capture the spatially correlated shocks that diﬀerentially aﬀect treated andcontrol units, and b F ( c ) would underestimate the variance of the marginal distribution of W i . As a consequence, even the conservative test may over-reject. Importantly, however,the over-rejection of the conservative test in this case would be no larger than the over-rejection of CT (or FP). An example of setting in which this mixing assumption may bereasonable would be when units closer in some distance metric have more correlated errors,while such correlation goes to zero when this distance increases. This assumption wouldalso be reasonable if we have cities divided among many states, and errors are correlatedwithin state, but independent across states. In these cases, the conservative test would bevalid even if we do not have information on the relevant distance metrics. It may also berelevant in setting in which we have information on states, but there are only very few stateswith treated cities, so that clustering at the state level would not be valid. In contrast, thisassumption would not be reasonable if, for example, we believe there is a common shockthat aﬀects all control units diﬀerently from the treated units. Remark 3.3

Constructing a conservative test becomes substantially more complicated iftreated units start treatment at diﬀerent periods. For example, consider that unit 1 startstreatment after t , while unit 2 starts treatment after t . In this case, the asymptotic distri-bution of b α would depend on the linear combinations of the errors W (1) = T − t P Tt = t +1 η t − t P t t =1 η t , and W (2) = T − t P Tt = t +1 η t − t P t t =1 η t . We can still consistently estimatethe marginal distributions of W (1) and W (2) by considering the appropriate linear combi-nation of the residuals from the control units. This is what CT and FP do, and works well ina setting in which W (1) and W (2) are independent. When we allow for spatial dependence,however, it becomes harder to deﬁne a worst case scenario. The worst case scenario will gen-erally not be such that corr ( η t , η t ) = 1. To see that, suppose there are 3 time periods, with11 = 1 and t = 2. In this case, b α − α p → . . η + 0 . η − η ) + 0 . η − . η − . η ).Therefore, assuming corr ( η t , η t ) = 1 will lead to a lower variance relative to the case withno spatial dependence if var ( η i ) is substantially larger than the variance at the two otherperiods. A possible alternative in this case could be to ﬁrst estimate the variance/covariancematrix of the marginal distribution of ( η i , ..., η iT ) for the treated units using the residuals ofthe control units. These estimated marginal distributions would be the same for all treatedunits under the homoskedasticity assumption from CT, or will vary depending on the esti-mated heteroskedascity as considered by FP. Then we can calculate the spatial correlationparameters among the treated units (which will be N ( N −

1) symmetric T × T matrices)that maximizes the variance of b α to construct a worst case scenario. To implement that, wecould follow a similar strategy as the one considered by CT to deal with spatial correlationin their appendix. The diﬀerence is that, since we cannot estimate the spatial dependencebecause there is no distance metric in our setting, we assume a worst case scenario. While the conservative test presented in Section 3 is valid even when we have stronglymixing spatial correlation, such test may be too conservative, implying in low power. Weconsider then how we can exploit particular characteristics of empirical applications to con-struct a less conservative test (Section 4.1), and to test whether spatial correlation is aproblem (Section 4.2).

FP consider as a leading example a setting in which the aggregate unit × time model(2) is heteroskedastic due to variation in the number of observations used to construct theunit × time aggregate observations. We show that, under reasonable assumptions, we canleverage on the structure considered in this setting to construct a test that is asymptotically12alid and less conservative than the one considered in Section 3.Let M it be the number of observations used to construct the aggregate observations forunit i at time t . For simplicity, assume that M it = M i for all t . FP show that, under a widerange of structures for the within-unit correlation of the individual-level errors, the varianceof W i conditional on M i is given by var ( W i | M i ) = A + BM i − for constants A, B ≥ N = 2, where the aggregate data of the treated units areconstructed from M and M individual observations per period, and consider a TWFEestimator using the number of individual-level observations M i as sampling weights. Thiswould lead to the same estimator as if we used the individual-level data. Note that assumingthat both treated units start treatment after t ∗ , and given the assumption that M it = M i ,the estimand of the TWFE will be a weighted average of the (average across time) treatmenteﬀects of units 1 and 2.In this case, a naive version of the inference method proposed by FP ignoring spatialcorrelation would assume that W and W are independent, implying that it would asymp-totically recover a distribution for b α with variance equal to σ naive = (cid:18) M M + M (cid:19) (cid:2) A + BM − (cid:3) + (cid:18) M M + M (cid:19) (cid:2) A + BM − (cid:3) . (4)As we discussed in Section 3, this would generally lead to over-rejection if the treatedunits are spatially correlated. The conservative method proposed in Section 3 would in turnasymptotically recover a distribution with variance given by σ cons1 = (cid:20)(cid:18) M M + M (cid:19) (cid:0) A + BM − (cid:1) + (cid:18) M M + M (cid:19) (cid:0) A + BM − (cid:1) (cid:21) . (5)It is clear that σ cons1 ≥ σ naive . Under the assumptions considered in Section 3, this wouldlead to a valid, but generally conservative test.An alternative approach in this particular setting would be to consider that we have onlyone treated unit with M + M observations. In this case, we would asymptotically recover13 distribution with variance equal to σ cons2 = A + B ( M + M ) − . (6)It is easy to show that σ cons1 ≥ σ cons2 ≥ σ naive . The main question is whether σ cons2 is aconservative estimate of the true variance of f W = M M + M W + M M + M W . While we cannotguarantee this condition without imposing more structure on the spatial correlation, we arguethat this is a reasonable assumption. This would be the case if individual-level observationsare more correlated within their units when compared to the correlation between individual-level observations in diﬀerent treated units. In this case, an average of error terms of M individual observations from one unit combined with M observations from another unitshould be more precise than an average of M + M observations from the same unit.Intuitively, this approach would consider a scenario for the spatial correlation betweenindividual-level observations of two diﬀerent units by assuming they have the same spatialcorrelation as if we considered two individual-level observations in the same unit. Underthe assumption that individuals in the same unit are weakly more spatially correlated thanindividuals in diﬀerent units, this would be a worst-case scenario for the spatial correlation,which would be much less stringent than the worst-case scenario considered in Section 3.Importantly, we do not need individual-level data for this approach. We only need infor-mation on the number of observations used to construct the unit × time aggregates. Weformalize the assumptions we need. Assumption 4.1 (distribution of W i ) (i) E [ W i | D i = 1] = E [ W i | D i = 0] = 0, (ii) W i =[ A + B/M i ] / ξ i , where ξ i has the same marginal distribution for all i ∈ I ∪ I , with ﬁnitesecond moment and continuous distribution, and (iii) { ξ i , M i } i ∈I is α -mixing satisfyingAssumptions 1 and 3(b) from Jenish and Prucha (2009). Assumption 4.2 (regularity condition)

For any τ ∈ (0 , c τ be the τ − quantile of[ A + B/ ( P i ′ ∈I M i ′ )] / ξ i conditional on { M i ′ } i ′ ∈I , and deﬁne f W = P i ∈I M i P i ′∈I M i ′ W i .14hen P r (cid:16) { f W < c τ/ } ∪ { f W > c − τ/ }|{ M i ′ } i ′ ∈I (cid:17) ≤ τ .Under Assumption 4.1, Proposition A.1 implies that we can asymptotically recover thedistribution of [ A + B/ ( P i ′ ∈I M i ′ )] / ξ i , conditional on { M i ′ } i ′ ∈I , using the empirical distri-bution of { [ b A + b B/ ( P i ′ ∈I M i ′ )] / b ξ i } i ∈I , where b ξ i = [ b A + b B/M i ] − / c W i . Under Assumption4.2, a test based on the quantiles of { [ b A + b B/ ( P i ′ ∈I M i ′ )] / b ξ i } i ∈I is asymptotically valid,although it may still be conservative. Still, we should expect this test to be less conservativethan the one presented in Section 3. We show that this is the case in our simulations withthe ACS in Section 5When M it is constant across t , we can implement the inference method above by gen-erating a single treated unit that is a weighted average of the N treated units, e y t = P i ∈I M i P i ′∈I M i ′ y it , with a number of observations f M = P i ∈I M i . In this case, the DIDestimator of b α using this collapse treated unit with f M and { M i } i ∈I as weights is numeri-cally the same as the one using the original data with { M i } i ∈I ∪I as weights. Then we canapply FP for this DID model with a single treated unit. If M it varies with t , then the origi-nal DID estimator would not be numerically the same as the one using an aggregate treatedunit e y t . In this case, we recommend ﬁrst aggregating the data at the unit × time level,and then estimating the DID estimator using sampling weights M min i = min t ∈T ∪T { M it } (or M mean i = T − P Tt =1 M it ). Then we can conduct the conservative inference method proposedabove based on this alternative DID estimator. The advantage of using M min i is that we guar-antee a conservative estimator for the variance of f W , which in this case will be a weightedaverage of W i of the treated units using M min i . The reason is that for each t we will calculate e y t = P i ∈I M min i P i ′∈I M min i ′ y it , where M it ≥ M min i for all t ∈ T ∪ T and i ∈ I . If M it does notvary much on t , then the resulting estimator should be close to the original DID estimatorusing individual-level data. 15 .2 Testing for Spatial Correlation It is common in empirical applications to have settings in which, for example, there areonly a few states that are treated, but we observe subgroups (e.g., counties) within each state.Consider a setting in which there is only one treated state, and assume spatial correlationis restricted within states. In this case, the researcher could decide between using CRVE atthe county level or a method such as CT and FP at the aggregate level. There is a trade oﬀin terms of assumptions for these diﬀerent type of methods. The ﬁrst one relies on no spatialcorrelation within states, while the former relies on homoskedasticity (or heteroskedasticitywith a structure as considered by FP). See Ferman (2019b) for a more thorough comparisonamong diﬀerent inference methods.In this setting, we can consider a simple way to assess whether spatial correlation posesa problem for CRVE at the county level. Under the assumption that errors are independentacross counties, and given that c W i → p W i , we should expect that the proportion of residuals c W i that is positive to be uniform across states (if W i , then we should expect that to beapproximately 1 / p i ), and construct a test statistic t = P Si =1 (ˆ p i − ¯ p ) , where S is the numberof states and ¯ p = S − P Si =1 ˆ p i . Then we can consider permutations of ˆ p i , reconstruct t b for each permutation, and check whether t is extreme in the distribution of permutations.The distribution of the permutations would recover the discrepancies between ˆ p i and ¯ p thatwe should expect when errors are independent, given that we have only a ﬁnite number ofcounties per state. If t is extreme in this distribution, then this would indicate relevantspatial correlation.This idea is similar in spirit to the test for appropriate level of clustering proposedby MacKinnon et al. (2020). Note, however, that CRVE at the state level would not be16onsistent in our setting, so their test would not be valid. Following the work by Roth(2019), we consider such test with caution, as we may fail to reject the null of no spatialcorrelation due to sampling variation. Still, such test may be very informative about whenwe cannot rely on CRVE — and, therefore, should focus on alternatives such as CT and FP—, as we show in the empirical illustration in Section 6. We analyze the spatial correlation problem and our conservative tests presented in Sec-tions 3 and 4.1 in simulations with the American Community Survey (ACS). Following thestrategy used by Bertrand et al. (2004), we randomly generate placebo interventions, andthen evaluate the proportion of simulations in which we would reject the null based on diﬀer-ent inference methods. We use the ACS data from 2005 to 2018. Following Bertrand et al.(2004), we restrict the sample to women between the ages 25 and 50, and consider log wagesas outcome variable. We select two time periods, and allocate treatment at the Public UseMicrodata Area (PUMA) level in the second period. We have around 2000 PUMAs for eachpair of years. We consider two allocation mechanisms. In the ﬁrst, N PUMAs are randomlyallocated to receive treatment. In this case, we should not expect spatial correlation to be aproblem. In the second one, we ﬁrst randomly choose one state, and then randomly choose N PUMAs from this chosen state. In this case, spatial correlation may be a problem ifthere are state-level shocks that aﬀects all PUMAs in a given state. Importantly, the strongmixing condition we assumed in Sections 3 and 4.1 would be satisﬁed in a model in whichthere are state-wide shocks, if PUMAs are independent across state and we have a largenumber of states. Given that we have a relatively large number of states, we believe ourtheory provides a good approximation for this setting. In addition to varying whether treat-ment allocation is independent or spatially correlated, we also vary the number of treated We created our ACS extract using IPUMS (Ruggles et al. (2015)). In all simulations, we restrict to states that had more than 20 PUMAs in the analyzed periods, so thecomparison across N is not confounded by changes in the composition of states. N ∈ { , } , and the time diﬀerence between the pre- and post-treatment periods, δ ∈ { , } .We present in Table 2 rejection rates when we consider the inference method proposedby FP, and the conservative methods proposed in Sections 3 and 4.1. Panel A presentsrejection rates when treatment has no eﬀect. As expected, when treatment is randomlyallocated, rejection rates using FP controls well for size. When treatment allocation isspatially correlated, we do not ﬁnd much distortion when N = 2, but we ﬁnd over-rejectionwhen N = 20. This is consistent with the analytical results presented in Table 1 thatspatial correlation becomes more problematic when N is larger. We also ﬁnd that spatialcorrelation leads to more over-rejection when δ = 5, which is consistent with the results fromFerman (2019a) that the ﬁxed eﬀects capture less of the spatially correlated shocks when thedistance between the pre- and post-treatment periods is larger. In particular, when N = 20and δ = 5 in the spatially correlated setting, we ﬁnd a rejection rate of around 14% for a5%-level test.In contrast the conservative tests never over-reject, even when treatment assignmentis spatially correlated. As expected, however, in many scenarios the test is conservative,with rejection rates much smaller than 5%. We consider in Panels B and C rejection rateswhen treatment eﬀects are, respectively, 0 . . Not surprisingly, controlling the type I error even when there is spatial correlation withthe conservative tests comes at the cost of lower power. Interestingly, the conservative testpresented in Section 4.1 has much larger power relative to the one proposed in Section 3.This highlights the importance of using more information to construct a less conservativeinference method as the one presented in Section 4.1.Finally, we consider the test of spatial correlation proposed in Section 4.2 for each pre- Since in this setting the number of observations per PUMA × time cell varies across time, we run aregression at the PUMA × time cell, with the average number of observations for each PUMA as weights. Weestimate the variance of W i conditional on this average number of observations, M i . That is, we estimate theconditional variance function var ( W i | M i ) = A + BM − i . The correlation between the number of observationsin 2005 and 2010 is 0.94. Standard deviations are calculated at the individual level using the baseline data for all sample. δ = 1, and in 6 out of 9 years when δ = 5, which is consistent with the discussion abovethat spatial correlation is more problematic when the distance between the pre- and post-treatment periods is larger. However, even when spatial correlation is arguably more prob-lematic (when δ = 5), we still cannot reject the null of no spatial correlation in 3 out of 9cases. Even though we do not detect spatial correlation, we would still have a rejection rategreater than 8% in these 3 years. Therefore, we emphasize that failing to reject this testshould be considered with caution, and that presenting alternatives that allow for spatialcorrelation as robustness may be interesting even when spatial correlation is not detected. We illustrate our ﬁndings analyzing the changes in mortality after the Massachusetts 2006health care reform. This reform was analyzed by Sommers et al. (2014) using a DID designcomparing 14 Massachusetts counties with 513 control counties that were selected based ona propensity score to be more similar with the treated counties. Sommers et al. (2014)ﬁnd a reduction of 2.9%-4.2% in mortality in Massachusetts relative to the controls after thereform (depending on whether covariates are included). As Kaestner (2016) reports, thisconclusion found widespread media attention given the importance of these results.Sommers et al. (2014) reported standard errors clustered at the state level, and alsoconsidered standard errors clustered at the county level in their online appendix. Given therecent ﬁndings that clustered standard errors may perform badly when there are few treatedclusters (e.g., CT), Kaestner (2016) re-analyzed the ﬁndings from Sommers et al. (2014) using The propensity score used age distribution, sex, race/ethnicity, poverty rate, median income, unemploy-ment, uninsured rate, and baseline annual mortality as predictors. We take this ﬁrst selection step as givenin our analysis. We ﬁnd similar results if we consider a DID regression using all counties, so that there is nopre-selection of control counties. In contrast,the randomization inference techniques do not require a large number of treated units, but,in DID settings, are generally based on homoskedastic and independent errors. As we reportin this paper, spatial correlation can lead to substantial over-rejection in randomizationinference tests at the county level if spatial correlation is not taken into account. If weconsider randomization inference at the state level, since we would have only one treatedunit, spatial correlation would not generate problems if errors are strongly mixing (Corollary3.1). However, in both cases we may have distortions given that counties/states vary bypopulation sizes, which would naturally lead to heteroskedasticity (e.g., FP).We revisit the use of randomization inference techniques in this empirical application inlight of our ﬁndings. We consider a DID estimator based on OLS TWFE regression with nocovariates. Point estimates are slightly diﬀerent than reported in the original paper becausewe use the publicly available data set (so we do not have death counts for cells with fewerthan 10 deaths), and because we weight observations by population mean across years toavoid the problem with TWFE highlighted by de Chaisemartin and D’Haultfoeuille (2018). We ﬁnd a p-value of 0 .

437 when we use CT method at the county level, which is similar tothe ﬁndings from Kaestner (2016) based on permutations, who found a p-value of 0 . In this setting, there would be 14 treated counties. While this number is not very large, we view spatialcorrelation as a potentially more important problem for inference when standard errors are clustered at thecounty level. See Ferman (2019b) for details. We also restrict to counties with information non-missing information for all years. We end up with 485control counties, as compared to 513 from the original study. Sommers et al. (2014) report a p-value of 0.04 for the OLS unadjusted regression using CRVE at thestate level (see their Appendix Table 4). . Comparing the p-values from FP at the county and statelevels suggests that spatial correlation is a signiﬁcant problem when we consider the county-level model. The test for spatial correlation proposed in Section 4.2 has a p-value of zero,reinforcing that spatial correlation is important in this setting. When we consider the eﬀectson amenable mortality rates, the diﬀerences in p-values become even more striking. We ﬁnda p-value of 0 .

314 for CT at the county level, which goes down to 0 .

032 when we correctfor heteroskedasticity using FP. Considering FP at the state level, however, the p-value isup again to 0 . .

381 (0 .

200 consideringamenable mortality rates), which are remarkably close to the FP p-values at the state level. Kaestner (2016) do not report p-value for the state-level OLS unadjusted regression. However, they ﬁnda p-value of 0.404 for a state-level generalized linear model using a negative binomial distribution and loglink, which is close to the p-value we found.

We consider the problem of inference in DID models when there are few treated unitsand errors are spatially correlated. We ﬁrst show that, when there is only a single treatedunit, the inference methods proposed by CT and FP, which were designed for settings withfew treated and many control units, remain asymptotically valid when errors are stronglymixing. This extends the set of possible applications in which the tests proposed by CT andFP can be reliably used when there is only a single treated unit. However, these methods canlead to over-rejection with more than one treated unit. We propose two alternative inferencemethods that are asymptotically valid, though generally conservative, in the presence ofspatial correlation. These tests provide interesting alternatives when spatial correlation islikely relevant. However, as presented in the simulations from Section 5, being robust tospatial correlation may come at the expense of reductions in test power.

References

Abadie, A., Athey, S., Imbens, G. W., and Wooldridge, J. (2017). When should you adjuststandard errors for clustering? Working Paper 24003, National Bureau of EconomicResearch.Abadie, A., Athey, S., Imbens, G. W., and Wooldridge, J. M. (2014). Finite populationcausal standard errors. Working Paper 20325, National Bureau of Economic Research.Arellano, M. (1987). Computing robust standard errors for within-groups estimators.

OxfordBulletin of Economics and Statistics , 49(4):431–434.Athey, S. and Imbens, G. (2018). Design-based Analysis in Diﬀerence-In-Diﬀerences Settingswith Staggered Adoption. Working Paper, arXiv:1808.05293 .22ertrand, M., Duﬂo, E., and Mullainathan, S. (2004). How much should we trust diﬀerences-in-diﬀerences estimates?

Quarterly Journal of Economics , page 24975.Brewer, M., Crossley, T. F., and Joyce, R. (2017). Inference with diﬀerence-in-diﬀerencesrevisited.

Journal of Econometric Methods , 7(1).Callaway, B. and Sant’Anna, P. H. C. (2018). Diﬀerence-in-Diﬀerences with Multiple TimePeriods and an Application on the Minimum Wage and Employment. Working Paper,arXiv:1803.09015 .Cameron, A., Gelbach, J., and Miller, D. (2008). Bootstrap-based improvements for inferencewith clustered errors.

The Review of Economics and Statistics , 90(3):414–427.Canay, I. A., Romano, J. P., and Shaikh, A. M. (2017). Randomization tests under anapproximate symmetry assumption.

Econometrica , 85(3):1013–1030.Conley, T. G. and Taber, C. R. (2011). Inference with Diﬀerence in Diﬀerences with a SmallNumber of Policy Changes.

The Review of Economics and Statistics , 93(1):113–125.de Chaisemartin, C. and D’Haultfoeuille, X. (2018). Two-way ﬁxed eﬀects estimators withheterogeneous treatment eﬀects.Donald, S. G. and Lang, K. (2007). Inference with Diﬀerence-in-Diﬀerences and Other PanelData.

The Review of Economics and Statistics , 89(2):221–233.Ferman, B. (2019a). Inference in diﬀerences-in-diﬀerences: How much should we trust inindependent clusters?Ferman, B. (2019b). A simple way to assess inference methods.Ferman, B. and Pinto, C. (2019). Inference in diﬀerences-in-diﬀerences with few treatedgroups and heteroskedasticity.

The Review of Economics and Statistics , 0(ja):null.Goodman-Bacon, A. (2018). Diﬀerence-in-diﬀerences with variation in treatment timing.Working Paper 25018, National Bureau of Economic Research.Jenish, N. and Prucha, I. R. (2009). Central limit theorems and uniform laws of largenumbers for arrays of random ﬁelds.

Journal of Econometrics , 150(1):86 – 98.Kaestner, R. (2016). Did massachusetts health care reform lower mortality? no accordingto randomization inference.

Statistics and Public Policy , 3:1 – 6.MacKinnon, J. G. and Webb, M. D. (2019). Randomization Inference for Diﬀerence-in-Diﬀerences with Few Treated Clusters.

Journal of Econometrics, Forthcoming .MacKinnon, J. G., ˜A˜rregaard Nielsen, M., and Webb, M. D. (2020). Testing for the ap-propriate level of clustering in linear regression models. Working Paper 1428, EconomicsDepartment, Queen’s University. 23ewey, W. K. and McFadden, D. (1994). Chapter 36 large sample estimation and hypothesistesting. volume 4 of

Handbook of Econometrics , pages 2111 – 2245. Elsevier.Roth, J. (2019). Pre-test with caution: Event-study estimates after testing for paralleltrends.Ruggles, S., Genadek, K., Goeken, R., Grover, J., and Sobek, M. (2015). Integrated PublicUse Microdata Series: Version 6.0 [Machine-readable database].Sommers, B. D., Long, S. K., and Baicker, K. (2014). Changes in mortality after mas-sachusetts health care reform.

Annals of Internal Medicine , 160(9):585–593. PMID:24798521.van der Vaart, A. W. (1998).

Asymptotic statistics . Cambridge series in statistical and prob-abilistic mathematics. Cambridge University Press, Cambridge (UK), New York (N.Y.).24able 1:

Asymptotic rejection rates

CT test Conservative test(1) (2)

Panel A: ρ = 0 N N N < N < N < Panel B: ρ = 0 . N N N N N Panel C: ρ = 0 . N N N N N Panel D: ρ = 0 . N N N N N Panel E: ρ = 0 . N N N N N Notes: This table presents asymptotic rejection rates of the CT test and of the conservative testpresented in Section 3 for diﬀerent values of N and of spatial correlation among the treatedunits. We assume { W i } i ∈I is multivariate normally distributed, with variance one and correlationcoeﬃcient ρ . Rejection rates - Simulations with the ACS

Independent assignment Spatially correlated assignment N = 2 N = 20 N = 2 N = 20Orig. Cons1 Cons2 Orig. Cons1 Cons2 Orig. Cons1 Cons2 Orig. Cons1 Cons2(1) (2) (3) (4) (5) (6) (7) (8) (9) (10) (11) (12) Panel A: test size δ = 1 0.043 0.008 0.043 0.036 0.000 0.015 0.040 0.005 0.038 0.057 0.000 0.023 δ = 5 0.045 0.009 0.018 0.046 0.000 0.000 0.051 0.009 0.021 0.143 0.000 0.008 Panel B: test power, α = 0 . σδ = 1 0.378 0.166 0.380 0.966 0.003 0.890 0.378 0.159 0.395 0.959 0.005 0.880 δ = 5 0.177 0.050 0.108 0.752 0.000 0.199 0.178 0.064 0.120 0.707 0.004 0.184 Panel C: test power, α = 0 . σδ = 1 0.869 0.736 0.864 0.999 0.914 0.923 0.874 0.722 0.860 1.000 0.905 0.925 δ = 5 0.471 0.291 0.365 0.990 0.355 0.444 0.466 0.290 0.367 0.985 0.306 0.444 Notes: This table presents rejection rates for the FP inference test (Orig.), the conservative test presented in Section 3 (Cons1), and theconservative test presented in Section 4.1 (Cons2). Details of the simulations are presented in Section 5. Appendix

A.1 Proof of Proposition 3.1

Proof.

This proof follows essentially the same steps as the proof of Proposition 2 fromCT. Let Ω be a compact subspace of the support of W i . Note that, for i ∈ I , c W i = W i − W , where W = N − P i ∈I W i , and let W be a generic random variable with thesame distribution of W i . Then,sup c ∈ Ω (cid:12)(cid:12)(cid:12) b F ( c ) − F ( c ) (cid:12)(cid:12)(cid:12) = sup c ∈ Ω (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) N X i ∈I (cid:8) W i − W < c (cid:9) − P r ( W < c ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) (7) ≤ sup c ∈ Ω (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) N X i ∈I (cid:8) W i < c + W (cid:9) − P r (cid:0)

W < c + W (cid:1)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) (8)+sup c ∈ Ω (cid:12)(cid:12) P r (cid:0) W − W < c (cid:1) − P r ( W < c ) (cid:12)(cid:12) . (9)From Theorem 3 of Jenish and Prucha (2009), a LLN applies to our setting with spatiallydependent variables, so W p →

0. This implies W − W d → W . Therefore, from Lemma 2.11from van der Vaart (1998), the term in line 9 is o (1). For the term in line 8, consider acompact subset Θ of the parameter space of W such that 0 is an interior point. Therefore,sup c ∈ Ω (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)X i ∈I (cid:8) W i < c + W (cid:9) N − P r (cid:0)

W < c + W (cid:1)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ sup c ∈ Ω ,h ∈ Θ (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)X i ∈I { W i < c + h } N − P r ( W < c + h ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) + (cid:8) W / ∈ Θ (cid:9) . (10)Since W p →

0, and 0 is in the interior of Θ, it follows that (cid:8) W / ∈ Θ (cid:9) = o p (1). For theother term, note that E [ { W i < c + h } ] = P r ( W < c + h ), { W i < c + h } is continuouswith probability one and bounded by 1 for any c and h , and Ω and Θ are compact. Asimple adaptation from Lemma 2.4 of Newey and McFadden (1994) to allow for α -mixingseries instead of iid sampling implies that the ﬁrst term in the RHS of the above equation is o p (1). Combining these results, we have sup c ∈ Ω (cid:12)(cid:12)(cid:12) b F ( c ) − F ( c ) (cid:12)(cid:12)(cid:12) = o p (1). A.2 Extentions

Consider now a setting in which Y it = αd it + X ′ it β + θ i + γ t + η it , (11)where X it is a ( p ×

1) vector of covariates. As in the main text, we consider W i = ∇ η it ,which is the post-pre diﬀerence in average errors for each unit i . Moreover, we assume that Essentially, the only adjustment in the proof is that we use Theorem 3 from Jenish and Prucha (2009)instead of a LLN for iid series.

27e observe a ( q ×

1) vector Z i that potentially generates heteroskedasticity on W i . We donot impose any restriction on whether X it and Z i share common elements or not.We consider the following assumptions: Assumption A.1 (distributions of W i and X it ) (i) { ( X i , η i ) , ..., ( X iT , η iT ) } i ∈I is α -mixingsatisfying Assumptions 1 and 3(b) from Jenish and Prucha (2009), (ii) ( η i , ..., η iT ) have ex-pectation zero conditional on D i and ( X i , ..., X iT ), (iii) X it has ﬁnite second moments, andthe residuals of X it and d it after the projection on time and unit ﬁxed eﬀects are linearly inde-pendent and those residuals of X it have variation in the limit, (iv) W i |{ Z i , X i , ..., X iT , D i } ∼ h ( Z i , δ ) ξ i for all i , where h ( Z i , δ ) is a known continuous function, where δ ∈ R m is an un-known parameter, (v) ξ i has the same marginal distribution for all i ∈ I ∪ I , with ﬁnitesecond moment and continuous distribution, and (vi) h ( Z i , δ ) ≤ ¯ h ( δ ) for some function ¯ h ( δ ).Assumption A.1 allows for heteroskedasticity of a known form, as considered by FP. Theimportant restriction is that, conditional on Z i , the errors of treated and control units havethe same distribution. This is how FP are able to use information from the residuals ofthe control units to infer about the distribution of the treated units. We need to imposeAssumption A.1.(ii) on the η it ’s rather than on the linear combination W i because of thecovariates X it . Together with Assumption A.1.(iii), we have conditions such that β is con-sistently estimated when N → ∞ . We can consider the homoskedastic case with covariatesby setting h ( Z i , δ ) constant.First, under Assumption A.1, we can still apply Proposition 1 from CT, so b α p → α + N − P i ∈I W i . Now let ˙ A i = A it − A i − A t + A , where A i = T P t ∈T ∪I A it , A t = N + N P i ∈I ∪I A it , and A = T N + N P i ∈T ∪T P i ∈I ∪I A it . From Frisch–Waugh–Lovell the-orem, it follows that b η it = ( α − b α ) ˙ d it + ˙ X ′ it ( β − b β ) + ˙ η it , which implies that, for i ∈ I , c W i = ( b α − α ) (cid:18) T − t ∗ T (cid:19) (cid:18) N N + N (cid:19) + (cid:0) ∇ X i − ∇ X (cid:1) ′ ( β − b β ) + W i − W (12)= W i + b v + (cid:0) ∇ X i − ∇ X (cid:1) ′ ( β − b β ) , (13)where b v = ( b α − α ) (cid:0) T − t ∗ T (cid:1) (cid:16) N N + N (cid:17) − W .We assume we have a consistent estimator for δ , b δ . Note that this implicitly imposerestrictions on the function h () and on the distribution of Z i . See FP for an example inwhich heteroskedasticity comes from variation in the number of observations per unit, andwe have a consistent estimator of the parameters of this function h ().We consider an estimator for the cdf of ξ i given by the empirical distribution of b ξ i = h ( Z i , b δ ) − c W i of the control units. Let b G ( c ) = N − P i ∈I nb ξ i < c o , and G ( c ) be the cdf of ξ i . Proposition A.1

Suppose Assumption A.1 holds, and we have an estimator b δ that is con-sistent for δ when N → ∞ and N is ﬁxed. Then, as N → ∞ , b G ( c ) converges in probabilityto G ( c ) uniformly on any compact subset of the support of W . Proof. b φ i ( c ; a , a , a , a ) = (cid:8) h ( Z i , a ) − ( W i + a + ( ∇ X i − a ) ′ ( β − a )) < c (cid:9) , (14)and φ ( c ; a , a , a , a ) = P r (cid:0) h ( Z, a ) − ( W + a + ( ∇ X − a ) ′ ( β − a )) < c (cid:1) . (15)Note that φ ( c ; a , a , a , a ) = E [ b φ i ( c ; a , a , a , a )], and that φ ( c ; δ, , E [ ∇ X ] , β ) = G ( c ).Consider a compact subset of the parameter space of ( a , a , a , a ), Θ, where ( δ, , E [ ∇ X ] , β )is an interior point.Now similarly to the proof of Proposition 3.1,sup c ∈ Ω (cid:12)(cid:12)(cid:12) b G ( c ) − G ( c ) (cid:12)(cid:12)(cid:12) = sup c ∈ Ω (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) N X i ∈I b φ i (cid:16) c ; b δ, b v, ∇ X, b β (cid:17) − G ( c ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) (16) ≤ sup c ∈ Ω , ( a ,a ,a ,a ) ∈ Θ (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) N X i ∈I b φ i ( c ; a , a , a , a ) − φ ( c ; a , a , a , a ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) (17)+ n ( b δ, b v, ∇ X, b β ) / ∈ Θ o (18)+sup c ∈ Ω (cid:12)(cid:12)(cid:12) φ (cid:16) c ; b δ, b v, ∇ X, b β (cid:17) − φ ( c ; δ, , E [ ∇ X ] , β ) (cid:12)(cid:12)(cid:12) (19)From Proposition 1 of CT we have that b β p → β , and we assume b δ p → δ . Since b α = O p (1)and N ( N + N ) − = o (1), combined with Assumption A.1, we have b v = o p (1). Finally, fromAssumption A.1 ∇ X p → E [ ∇ X ]. Therefore, ( δ, b v, ∇ X, b β ) p → ( δ, , E [ ∇ X ] , β ), which impliesthat n ( b δ, b v, ∇ X, b β ) / ∈ Θ o = o p (1). The term from line 17 is o p (1) following the samearguments used for line 8 in the proof of Proposition 3.1. Finally, note that h ( Z i , b δ ) − c W i d → ξ i . Therefore, from Lemma 2.11 from van der Vaart (1998), the term in line 19 is o p (1),which completes the proof.Proposition A.1 guarantees the validity of the approach proposed by FP for the case with N = 1 if we consider the empirical distribution of { h ( Z , b δ ) b ξ b } b ∈I . For the case with N > b α is approximatedby the empirical distribution of { N − P i ∈I h ( Z i , b δ ) b ξ b } b ∈I . The equivalent to Assumption3.1 to guarantee that the test will asymptotically have the correct level is given by Assumption A.2 (regularity condition)

For any τ ∈ (0 , c τ be the τ − quantile of thedistribution of N − P i ∈I h ( Z i , δ ) ξ conditional on { Z i } i ∈I , and deﬁne f W = N − P i ∈I W i .Then P r (cid:16) { f W < c τ/ } ∪ { f W > c − τ/ }|{ Z i } i ∈I (cid:17) ≤ τ .Again, Assumption A.2 guarantees that the case in which all treated units are perfectly29orrelated leads to a worst scenario case in terms of attaining extreme values for the estima-tor. This assumption is satisﬁed if { ξ i } i ∈I1