IINFERENCE WITH A SINGLE TREATED CLUSTER
ANDREAS HAGEMANN
Abstract.
I introduce a generic method for inference about a scalar parameterin research designs with a finite number of heterogeneous clusters where only asingle cluster received treatment. This situation is commonplace in difference-in-differences estimation but the test developed here applies more generally. I showthat the test controls size and has power under asymptotics where the numberof observations within each cluster is large but the number of clusters is fixed.The test combines weighted, approximately Gaussian parameter estimates witha rearrangement procedure to obtain its critical values. The weights needed formost empirically relevant situations are tabulated in the paper. Calculation ofthe critical values is computationally simple and does not require simulationor resampling. The rearrangement test is highly robust to situations wheresome clusters are much more variable than others. Examples and an empiricalapplication are provided.
JEL classification : C01, C22, C32
Keywords : cluster-robust inference, difference in differences, two-way fixedeffects, clustered data, dependence, heterogeneity Introduction
Inference about the average effect of a binary treatment or policy interventionis often much more challenging than its estimation. For example, calculating adifference-in-differences estimate can be as simple as comparing the difference inaverage outcomes of individuals in a group before and after an intervention to thesame differences in unaffected groups. The main challenge for inference is thatindividuals within each of these groups likely depend on one another in unobservableways. Taking this dependence into account generally requires knowledge of an explicitordering of the dependence structure within each group. While time-dependentdata have a natural ordering, it may be difficult or impossible to credibly ordercross-sectionally dependent data within states or villages. Researchers commonlytry to sidestep this problem by splitting large groups into smaller clusters thatare presumed to be independent in order to have access to standard inferentialprocedures based on cluster-robust standard errors or the bootstrap. Splittingstates, villages, or other large groups into smaller clusters is often difficult to justifybut necessary for most of the available inferential procedures because they achieveconsistency by requiring the number of clusters to go to infinity. If a procedureis valid with a fixed number of clusters, it typically requires at least two treatedclusters unless strong homogeneity conditions are satisfied. Numerical evidenceby Bertrand, Duflo, and Mullainathan (2004), MacKinnon and Webb (2017), andothers suggests that ignoring dependence and heterogeneity may lead to heavily
Date : October 9, 2020. (First version on arXiv: October 9, 2020.) All errors are my own.Comments are welcome. I would like to thank Sarah Miller for useful discussions. a r X i v : . [ ec on . E M ] O c t NFERENCE WITH A SINGLE TREATED CLUSTER 2 distorted inference in empirically relevant situations. In both cases, the actual size ofthe test can exceed its nominal level by several orders of magnitude, i.e., nonexistenteffects are far too likely to show up as highly significant.In this paper, I introduce an asymptotically valid method for inference with asingle treated cluster that allows for heterogeneity of unknown form. The numberof observations within each cluster is presumed to be large but the total numberof clusters is fixed. The method, which I refer to as a rearrangement test , appliesto standard difference-in-differences estimation and other settings where treatmentoccurs in a single cluster and the treatment effect is identified by between-clustercomparisons. The key theoretical insight for the rearrangement test is that a mildrestriction on some but not all of the heterogeneity in two samples of independentnormal variables allows testing the equality of their means even if one sampleconsists of only a single observation. I prove that this is possible for empiricallyrelevant levels of significance if the other sample consists of at least ten observations.The rearrangement test compares the data to a reordered version of itself afterattaching a special weight to the sample with a single observation. The weightsneeded for most standard situations are tabulated in the paper and calculatingadditional weights is computationally simple. I also show that the weights remainapproximately valid if the two samples of independent heterogeneous normal variablesarise as a distributional limit. I exploit this result in the context of cluster-robustinference by constructing asymptotically normal cluster-level statistics to whichthe rearrangement test can be applied. The resulting test is consistent against allfixed alternatives to the null, powerful against 1 / √ n local alternatives, and doesnot require simulation or resampling.Inference based on cluster-level estimates goes back at least to Fama and MacBeth(1973). Their approach is generalized and formally justified by Ibragimov andM¨uller (2010, 2016), who construct t statistics from cluster-level estimates andshow that these statistics can be compared to Student t critical values. Canay,Romano, and Shaikh (2017) obtain null distributions by permuting the signs ofcluster-level statistics under symmetry assumptions. Hagemann (2019) permutescluster-level statistics directly but adjusts inference to control for the potentiallack of exchangeability. All of these methods allow for a fixed number of largeand heterogeneous clusters but require several treated clusters. The rearrangementtest complements these methods because it relies on the same type of high-levelcondition on the cluster-level statistics but is valid with a single treated cluster.Other methods that are valid with a fixed number of clusters are the tests of Bester,Conley, and Hansen (2011) and a cluster-robust version of the wild bootstrap (see,e.g, Cameron, Gelbach, and Miller, 2008; Djogbenou, MacKinnon, and Nielsen,2019) analyzed by Canay, Santos, and Shaikh (2020). However, these papers relyon strong homogeneity conditions across clusters that are not needed here.Several approaches for inference have been developed specifically for difference-in-differences estimation. Conley and Taber (2011) provide a method that is validwith a single treated cluster and infinitely many control clusters under strongindependence and homogeneity conditions that justify an exchangeability argument.Ferman and Pinto (2019) extend this approach to situations where the form ofheteroskedasticity is known exactly. Another extension by Ferman (2020) allows forspatial correlation while maintaining Conley and Taber’s exchangeability condition.The rearrangement test differs from these methods because it is not limited to models NFERENCE WITH A SINGLE TREATED CLUSTER 3 estimated by difference in differences, does not rely on exchangeability conditions,and allows for completely unknown forms of heterogeneity. Other approaches due toMacKinnon and Webb (2019, 2020) use randomization (permutation) inference fordifference-in-differences estimation and other models with few treated clusters. Theytest “sharp” (Fisher, 1935) nulls under randomization hypotheses and asymptoticswhere the number of clusters is eventually infinite. In contrast, the present paper isable to test conventional nulls in a setting with finitely many clusters.The remainder of the paper is organized as follows: Section 2 proves several newresults on normal random vectors with independent, heterogeneous entries after aspecific transformation and introduces the rearrangement test. Section 3 establishesthe asymptotic validity of the test in the presence of finitely many heterogeneousclusters when only one cluster received treatment and discusses several examples.Section 4 illustrates the finite sample behavior of the new test in simulations and indata used by Garthwaite, Gross, and Notowidigdo (2014), who analyze the effects ofa large-scale disruption of public health insurance in Tennessee. Section 5 concludes.The appendix contains auxiliary results and proofs.I will use the following notation. 1 { A } is an indicator function that equals one if A is true and equals zero otherwise. Limits are as n → ∞ unless noted otherwiseand (cid:32) denotes convergence in distribution.2. Inference with heterogenous normal variables
In this section, I construct a test for the equality of means of two samples ofindependent heterogeneous normal variables where one sample consists of only asingle observation. The other sample has finitely many observations. I show thatthe test has power while controlling size (Theorem 2.1) and remains approximatelyvalid if this two-sample problem characterizes the large sample distribution of arandom vector of interest (Proposition 2.3).Consider q independent variables X , , . . . , X ,q with X ,k ∼ N ( µ , σ k ) for 1 (cid:54) k (cid:54) q . Independently, there is an additional variable X ∼ N ( µ , σ ). I interpretthis as a two-sample problem with “control” sample X , , . . . , X ,q and “treatment”sample X , although all of the following still applies if these roles are reversed. Theobjective is to test the null hypothesis of equality of means, H : µ = µ , without knowledge of µ , σ, σ , . . . , σ q and without assuming that these quantitiescan be consistently estimated. I account for the uncertainty about µ by recenteringthe data X = ( X , X , , . . . , X ,q ) with ¯ X = q − (cid:80) qk =1 X ,k to define S ( X, w ) = (cid:0) (1 + w )( X − ¯ X ) , (1 − w )( X − ¯ X ) , X , − ¯ X , . . . , X ,q − ¯ X (cid:1) (2.1)for some known weight w ∈ (0 ,
1) that will be chosen shortly. If X − ¯ X >
0, the1 + w increases X − ¯ X and 1 − w decreases X − ¯ X . If X − ¯ X <
0, these effectsare reversed. The idea underlying the test is that if the decreased version of X − ¯ X is still large in comparison to X , − ¯ X , . . . , X ,q − ¯ X , then this size difference isunlikely to be only due to heterogeneity in σ , σ , . . . , σ q but provides evidence that µ and µ are in fact not equal. I show below that w gives precise probabilisticcontrol over this comparison. In particular, choosing w appropriately allows me toconstruct a test whose size can be bounded at a predetermined significance level.Before defining the test statistic, I first introduce some notation. For a givenvector s ∈ R d , let s (1) (cid:54) · · · (cid:54) s ( d ) be the ordered entries of s . Denote by NFERENCE WITH A SINGLE TREATED CLUSTER 4 s (cid:55)→ s (cid:79) = ( s ( d ) , . . . , s (1) ) the operation of rearranging the components of s fromlargest to smallest. The test uses S ( X, w ) and its rearranged version S ( X, w ) (cid:79) inthe difference-of-means statistic s = ( s , . . . , s q +2 ) (cid:55)→ T ( s ) = s + s − q q (cid:88) k =1 s k +2 (2.2)to define the test function ϕ ( X, w ) = 1 (cid:8) T (cid:0) S ( X, w ) (cid:1) = T (cid:0) S ( X, w ) (cid:79) (cid:1)(cid:9) . (2.3)The test, which I refer to as rearrangement test , rejects if ϕ ( X, w ) = 1 and does notreject otherwise. As stated, the test is against the alternative of a positive treatmenteffect, H : µ > µ . For a test against H : µ < µ , simply use ϕ ( − X, w ). Thesealternatives can be combined to provide a two-sided test. I describe the exactimplementation below equation (2.7) ahead. Also note that the first difference ofmeans in (2.3) simplifies to T ( S ( X, w )) = X − ¯ X but T ( S ( X, w ) (cid:79) ) is in general acomplicated function of w .Intuitively, the rearrangement test can be interpreted as a permutation test thattreats S = S ( X, w ) as if it were the data and uses the second largest permutationstatistic of T ( S ) as critical value c . If T ( S ) > c , then the only possibility left isthat T ( S ) equals its largest permutation statistic. For the difference of means T ( S ),that statistic must be T ( S (cid:79) ) and therefore T ( S ) > c is equivalent to ϕ ( X, w ) = 1.Because S is being permuted and not X , this also explains why it is sensible towrite T ( S ( X, w )) instead of X − ¯ X in the definition of the test function (2.3).A classical permutation test would then use an exchangeability condition on S to determine the size of the test. Even though the S constructed here is farfrom exchangeable, I will show that this test has power while controlling size at apredetermined level. Instead of relying on exchangeability, the results here dependon the joint normality of X combined with the location and scale invariance property ϕ ( X, w ) = ϕ (( X − µ q +1 ) /σ, w ), where 1 q +1 is a ( q +1)-vector of ones. The locationinvariance is forced by the recentering of X with ¯ X and effectively removes µ from the list of nuisance quantities. The scale invariance is ensured by the specificchoices of T and ϕ . It reduces the dimensionless unknowns σ, σ , . . . , σ q to the moretractable ratios σ /σ, . . . , σ q /σ .I start with the analysis of size and power, and connect these results with thesituation where X = ( X , X , , . . . , X ,q ) is an asymptotic approximation later on.I assume that the variances σ k of the X ,k , 1 (cid:54) k (cid:54) q , are bounded away fromzero by some σ ¯ > k . This avoids a trivial and in practice easilyrecognizable situation where some of the X ,k are exactly equal. I also restrict thevariance σ of X to be bounded above by some ¯ σ < ∞ because letting σ → ∞ in ϕ ( X, w ) would have the same effect as setting all σ k equal to zero. Under the nullhypothesis, the distribution of ϕ ( X, w ) is then determined by the unknown value of λ ∈ Λ := { ( µ , σ, σ , . . . , σ q ) ∈ R × (0 , ∞ ) q +1 : σ (cid:54) ¯ σ and σ k (cid:62) σ ¯ for all k but one } . Under the alternative, the distribution of ϕ ( X, w ) also depends on the treatmenteffect δ = µ − µ . I write E λ,δ and P λ,δ to emphasize this dependence butoccasionally drop subscripts to prevent clutter.My strategy is to first bound the null rejection probability E λ, ϕ ( X, w ) uniformlyin λ ∈ Λ by a smooth function of the weight w . I can then find a w to make thebound exactly equal to the desired significance level to guarantee size control. The NFERENCE WITH A SINGLE TREATED CLUSTER 5 bound is also a function of the number of control observations q and the maximalrelative heterogeneity (cid:37) = ¯ σ/σ ¯ of treated and untreated observations. The parameter (cid:37) is user chosen and has a simple interpretation: it restricts how much more variable X can be relative to the X ,k when one of the σ k equals zero and the remaining σ k are all equal to the lower limit σ ¯. This is the worst-case scenario for the test because X is then likely to be very large on accident in comparison to the X ,k . In thatscenario, a (cid:37) of 5 simply means that the variance of X can be up to 5 = 25 timeslarger than the variances of all but one of the X ,k and “infinitely more variable”than the remaining X ,k . There are no restrictions on how much less variable X can be than X , , . . . , X ,q and, in particular, ¯ σ/σ ¯ can be less than one.The following theorem is the main theoretical result of the paper. It establishesthe existence of a size bound that is valid for a fixed number of control observations q and fully accounts for the uncertainty about the parameters in Λ. The theoremalso shows that the test has power against the alternative H : µ > µ . Resultsin the other direction follow by considering E λ, − δ ϕ ( − X, w ) instead of E λ,δ ϕ ( X, w ).The discussion immediately below focuses on the implications of the theorem. Iaddress some of its technical aspects towards the end of this section. Let Φ and φ denote the normal distribution and density functions, respectively. Theorem 2.1 (Size and power).
Let X , X , , . . . , X ,q be independent with X ∼ N ( µ + δ, σ ) and X ,k ∼ N ( µ , σ k ) for (cid:54) k (cid:54) q . If δ = 0 , then for all w ∈ (0 , , sup λ ∈ Λ E λ, ϕ ( X, w ) (cid:54) ξ q ( w, (cid:37) ) := 12 q +1 + (cid:90) ∞ Φ (cid:0) (1 − w ) (cid:37)y (cid:1) q − φ ( y ) dy (2.4)+ min t> (cid:18) Φ (cid:16)(cid:112) q − wt (cid:17) q − + 2Φ( − qt ) (cid:19) . Furthermore, for every λ ∈ Λ and w ∈ (0 , , we have lim δ →∞ E λ,δ ϕ ( X, w ) = 1 and lim δ →∞ E λ,δ ϕ ( X,
1) = 0 . The theorem implies that the rearrangement test controls size, i.e.,sup λ ∈ Λ E λ, ϕ ( X, w ) (cid:54) α, whenever q , w , and (cid:37) are such that ξ q ( w, (cid:37) ) (cid:54) α for the desired significance level α .The bound ξ q ( w, (cid:37) ) has several properties that make this possible. In particular, it ismonotonically increasing in (cid:37) and decreasing in q . The reason for the monotonicityis that if X can be more variable than X , , . . . , X ,q , then the burden of proof toshow “ µ > µ ” as opposed to “ µ = µ with a large realization of X ” becomesnecessarily higher. A large q can ameliorate this effect somewhat because it removesuncertainty about µ . The bound also tends to be decreasing in w ∈ [0 ,
1] becausethe integral generally dominates the other components, but can increase slightly insome situations. This is illustrated in Figure 1, where w (cid:55)→ ξ q ( w, (cid:37) ) (solid lines) isessentially decreasing over the entire domain except for (cid:37) = 2 and w (cid:62) .
85. Mostimportantly, it can be seen that w (cid:55)→ ξ q ( w, (cid:37) ) decreases enough to dip below thedesired significance level α = .
05 (dashed line) for all values of (cid:37) . As q increases (notshown), w (cid:55)→ ξ q ( w, (cid:37) ) is pushed towards zero but the shape of the function does notchange meaningfully with q . The w at which ξ q ( w, (cid:37) ) = α is generally unique formost empirically relevant α and does not exist in some extreme situations. Thiscan be seen in Figure 1, where w (cid:55)→ ξ q ( w, (cid:37) ) crosses α = .
05 only once for each (cid:37) but, for example, ξ q ( w, (cid:37) ) = . NFERENCE WITH A SINGLE TREATED CLUSTER 6 . . . . . . w w (cid:55) → ξ q ( w , (cid:37) ) (cid:37) = 2 (cid:37) = 5 (cid:37) = 20 (cid:37) = 100 Figure 1.
Solid lines show the size bound ξ q ( w, (cid:37) ) at q = 20 control ob-servations as a function of the weight w for different values of the maximalheterogeneity (cid:37) . The dashed line equals . Theorem 2.1 also provides information about the interplay between w and thetest under the alternative. In particular, it shows that the rearrangement test haspower against H : µ > µ for every w ∈ (0 ,
1) but the power declines sharply at w = 1. I therefore explore the behavior of the test with w near 1 further in thefollowing result. It provides a lower bound on the power of the test for fixed δ . Proposition 2.2 (Lower bound on power).
Let X , X , , . . . , X ,q be independentwith X ∼ N ( µ + δ, σ ) and X ,k ∼ N ( µ , σ k ) for (cid:54) k (cid:54) q . For every w ∈ (0 , , σ, σ , . . . , σ q > , and δ > , inf µ ∈ R E λ,δ ϕ ( X, w ) (cid:62) q sup t (cid:62) Φ (cid:18) δσ − w − w t (cid:19) q (cid:89) k =1 (cid:32) Φ (cid:18) σσ k t (cid:19) − . (cid:33) The supremum is attained on t ∈ (0 , ∞ ) . The right-hand side is strictly positive andconverges to as δ → ∞ . The bound shows that the test exhibits a standard relationship between the signal δ and the noise components σ , . . . , σ q . Power is low if the signal relative to σ is weakor the noise in the control group relative to σ is strong. The latter relationship is incontrast to Theorem 2.1, where small σ k relative to σ were problematic. In addition,the bound also clarifies that w dampens δ through the function w (cid:55)→ (1 + w ) / (1 − w ),which is arbitrarily large for w sufficiently close to 1. A w very close to 1 cantherefore drown out a large treatment effect even if the noise coming from thecontrol observations is mild. (The role of the supremum is simply to find the bestpossible balance for a given set of parameters.) It is also worth noting that thebound is tight enough to converge to 1 as δ → ∞ and to 0 as w → w that satisfies ξ q ( w, (cid:37) ) = α is not necessarily unique and becauseProposition 2.2 suggests that power against the alternative H : µ > µ for w near NFERENCE WITH A SINGLE TREATED CLUSTER 7 one can be low, it is sensible to choose the smallest feasible w , denoted by w q ( α, (cid:37) ) = inf (cid:8) w ∈ (0 ,
1) : ξ q ( w, (cid:37) ) = α (cid:9) , (2.5)in the definition of the rearrangement test function for a test of size α , x (cid:55)→ ϕ α ( x ) := ϕ (cid:0) x, w q ( α, (cid:37) ) (cid:1) . (2.6)The test ϕ α also depends on (cid:37) but this is suppressed here to prevent clutter. Table 1lists values of w q ( α, (cid:37) ) for common choices of α as a function of (cid:37) and q . Theyguarantee sup λ ∈ Λ E λ, ϕ α ( X ) (cid:54) α. (2.7)The list is not exhaustive and additional values can be easily calculated by numer-ical integration. An R command that performs the calculations can be found at https://hgmn.github.io/rea .Table 1 shows that the rearrangement test is available in a wide variety ofsituations depending on the desired significance level and tolerance for heterogeneity.For instance, a test with a 10% significance level is already available with q = 10control observations. A 5% level test becomes available at q = 15, a 1% level testat q = 20, and for q (cid:62)
25 there are essentially no restrictions to the level andunderlying heterogeneity. This provides two avenues for implementation:(1) Choose a desired maximal degree of heterogeneity (cid:37) and make test decisionsbased on this choice.(2) Determine at which degree of maximal heterogeneity the null hypothesiscan no longer be rejected.The first option is similar in spirit to the ubiquitous Staiger and Stock (1997) ruleof thumb for weak instruments, where an F statistic larger than 10 corresponds toa tolerance for an at most 10% bias (as defined in Stock and Yogo, 2005) in theinstrumental variables estimator relative to least squares. The second option takesthe form of a “robustness check.” It has a meaningful interpretation because aresult that is robust to a tenfold larger standard deviation in the treated observationrelative to the control sample is more credible than a result that only survives atwofold difference in standard deviation. This second option leaves it up to thereader to decide whether the results are convincing.The test decision itself is simple. Choose w = w q ( α, (cid:37) ) from Table 1 for a givennumber of control observations q , desired significance level α , and maximal tolerancefor heterogeneity, e.g., (cid:37) = 2. For this w , compute S = S ( X, w ) as in (2.1) andreorder the entries of S from largest to smallest to obtain S (cid:79) . For an α -level testof µ = µ , reject in favor of µ > µ if T ( S ) = T ( S (cid:79) ) as defined in (2.2). For aone-sided test with level α against µ < µ , reject if T ( − S ) = T (( − S ) (cid:79) ). For atwo-sided test with level 2 α , reject in favor of µ (cid:54) = µ if either T ( S ) = T ( S (cid:79) ) or T ( − S ) = T (( − S ) (cid:79) ). The “robustness check” increases (cid:37) until the null hypothesis canno longer be rejected against the desired alternative. The test decision is monotonicin (cid:37) , i.e., if (cid:37) (cid:48) > (cid:37) lead to the same test decision, then the decision does not changefor any value between (cid:37) and (cid:37) (cid:48) . An R command that implements the test and therobustness check for any choice of (cid:37) is available at https://hgmn.github.io/rea .I now turn to a discussion of some technical aspects of the size bound ξ q ( w, (cid:37) )that forms the theoretical underpinning for the rearrangement test. The bound,defined in (2.4), has three components with simple interpretations: The 1 / q +1 removes an unlikely event ( X < µ , X , < µ , . . . , X ,q < µ at the same time) NFERENCE WITH A SINGLE TREATED CLUSTER 8
Table 1.
Weights w q ( α, (cid:37) ) as defined in (2.5) that guarantee size control at α for a given maximal degree of heterogeneity (cid:37) = ¯ σ/σ ¯ for different values of q . qα ¯ σ/σ ¯ 10 15 20 25 30 35 40 45 49.10 2 .6333 .4010 .3294 .2829 .2475 .2188 .1948 .1742 .15623 .6098 .5543 .5221 .4983 .4792 .4632 .4495 .43754 .7127 .6669 .6418 .6238 .6094 .5974 .5871 .57815 .7732 .7344 .7137 .6991 .6876 .6779 .6697 .66256 .8129 .7792 .7615 .7493 .7396 .7316 .7248 .71887 .8409 .8111 .7957 .7851 .7768 .7700 .7641 .75908 .8616 .8350 .8213 .8120 .8048 .7987 .7936 .78919 .8776 .8536 .8413 .8329 .8265 .8211 .8165 .8125.05 2 .5752 .5020 .4615 .4318 .4081 .3884 .3715 .35683 .7287 .6703 .6414 .6213 .6054 .5923 .5810 .57124 .8024 .7541 .7314 .7161 .7041 .6942 .6858 .67845 .8450 .8042 .7854 .7729 .7633 .7554 .7486 .74286 .8727 .8374 .8213 .8108 .8028 .7962 .7905 .78567 .8921 .8610 .8469 .8379 .8310 .8253 .8205 .81638 .9064 .8786 .8661 .8582 .8521 .8471 .8429 .83929 .9173 .8923 .8811 .8739 .8685 .8641 .8604 .8571.025 2 .6981 .6049 .5656 .5387 .5175 .5001 .4852 .47233 .7400 .7111 .6926 .6784 .6667 .6568 .64824 .8069 .7838 .7696 .7588 .7501 .7426 .73625 .8466 .8273 .8157 .8071 .8001 .7941 .78896 .8728 .8563 .8465 .8393 .8334 .8284 .82417 .8914 .8770 .8685 .8622 .8572 .8529 .84938 .9053 .8924 .8849 .8795 .8751 .8713 .86819 .9160 .9045 .8978 .8929 .8890 .8856 .8828.01 2 .6986 .6543 .6286 .6092 .5935 .5801 .56863 .8058 .7709 .7527 .7396 .7290 .7201 .71244 .8578 .8290 .8147 .8047 .7968 .7901 .78435 .8882 .8636 .8519 .8438 .8374 .8321 .82756 .9080 .8866 .8767 .8699 .8645 .8601 .85627 .9219 .9030 .8943 .8885 .8839 .8801 .87688 .9322 .9153 .9076 .9024 .8984 .8951 .89229 .9401 .9248 .9179 .9133 .9097 .9067 .9042.005 2 .7642 .7029 .6764 .6576 .6426 .6300 .61913 .8042 .7847 .7719 .7618 .7534 .74614 .8544 .8389 .8290 .8214 .8150 .80965 .8842 .8713 .8632 .8571 .8520 .84776 .9040 .8929 .8861 .8809 .8767 .87317 .9180 .9082 .9024 .8980 .8943 .89128 .9284 .9198 .9146 .9107 .9075 .90489 .9365 .9287 .9241 .9207 .9178 .9154 Note:
Missing cells mean that the test is not recommended or not feasible.
Italics mean thatthe bound in (2.4) is relatively loose. Upright numbers mean that the bound is nearly tight.
NFERENCE WITH A SINGLE TREATED CLUSTER 9 from consideration. This forces a monotonicity property over the complement ofthis event and allows tightly bounding an oracle version of the problem where µ replaces ¯ X in (2.1). This bound is the integral in (2.4). The minimization problemthen adjusts for the fact that the data are centered by ¯ X instead of the unknown µ .The minimizer does not have closed form but is easily found numerically. Takentogether, ξ q ( w q ( α, (cid:37) ) , (cid:37) ) can therefore be roughly viewed as a tight bound for ahigh-probability event plus two small adjustments. I use Table 1 to illustrate therelative size of these adjustments. In the table, empty cells correspond to situationswhere there is either no w such that ξ q ( w, (cid:37) ) = α or more than α/ ξ q ( w q ( α, (cid:37) ) , (cid:37) )is taken up by the non-tight parts of the bound. Cells in italics are settings wherebetween α/ α/
10 of the bound are taken up by the non-tight parts. The lack oftightness in the remaining cells is less than α/
10. For these cells sup λ ∈ Λ E λ, ϕ α ( X )approximately equals α . As the table shows, ξ q ( w q ( α, (cid:37) ) , (cid:37) ) is an essentially tightbound for sup λ ∈ Λ E λ, ϕ α ( X ) for q (cid:62)
30. The bound is also nearly tight for valuesof q as small as 15 as long as (cid:37) is not too large. I return to a discussion of thisaspect of the rearrangement test in Example 4.1 (ahead), where I illustrate the sizeof the test numerically.Finally, before concluding this section, I show that the rearrangement test remainsapproximately valid for random vectors X n converging in distribution to the randomvector X = ( X , X , , . . . , X ,q ) described in Theorem 2.1. The reason is thatE ϕ ( X n , w ) and E ϕ ( X, w ) eventually coincide whenever X has independent entriesand a smoothly distributed first entry. The X in Theorem 2.1 easily satisfies theseconditions, which makes ϕ α ( X n ) asymptotically an α -level test. Proposition 2.3 (Large sample approximation).
Let X , X , , . . . , X ,q be indepen-dent and let X have a continuous distribution. If X n (cid:32) X , then E ϕ ( X n , w ) → E ϕ ( X, w ) for every w ∈ (0 , . I use Theorem 2.1 and Proposition 2.3 in the next section to construct a sim-ple method for inference with a single treated cluster. Section 4 shows how therearrangement test performs in Monte Carlo experiments.3.
Inference with a single treated cluster
In this section, I use a single high-level condition to extend the rearrangementtest introduced in the previous section to a test about a scalar parameter in researchdesigns with a finite number of large, heterogeneous clusters where only a singlecluster received treatment. I then outline how these results can be applied inempirical practice.Suppose data from q + 1 large clusters (e.g., states, industries, or villages observedover one or more time periods) are available. Data are dependent within clustersbut independent across clusters. The exact form of dependence is unknown andnot presumed to be estimable. An intervention took place during which one clusterreceived treatment and and q clusters did not. The quantity of interest is a treatmenteffect or an object related to it that can be represented by a scalar parameter δ .Because the entire cluster was treated, this parameter is only identified up to a In particular, at t = 1 /q , Φ( √ q − wt ) q − + 2Φ( − qt ) < Φ(1 / √ q ) q − + 2Φ( − < q > √ q − wt ) q − + 2Φ( − qt ) (cid:62) t ∈ { , ∞} , the minimization problem always has aninterior solution. This also implies that the bound as a whole is a smooth function of w and (cid:37) . NFERENCE WITH A SINGLE TREATED CLUSTER 10 location shift θ within the treated cluster and therefore only the left-hand side of θ = θ + δ can be identified from this cluster. If the treated cluster would have behavedsimilarly to the untreated clusters in the absence of an intervention, then θ can beidentified from each untreated cluster. Pairwise comparison then identifies δ .The identification strategy outlined in the preceding paragraph is the idea behinddifferences in differences—arguably the most popular identification strategy inmodern empirical research—and a variety of other models. The goal of this sectionis to use the rearrangement test to provide a generic method for testing the hypothesis H : δ = 0 , or, equivalently, H : θ = θ . I achieve this by obtaining an estimate ˆ θ of θ andestimates ˆ θ , , . . . , ˆ θ ,q of θ so thatˆ θ n = (ˆ θ , ˆ θ , . . . , ˆ θ ,q )is approximately a vector of independent but potentially heterogeneous normalvariables that can be used as if it were the data vector X from Section 2.The following example explains how to construct ˆ θ n in a simple situation. Idiscuss construction of ˆ θ n for difference in differences towards the end of this section. Example 3.1 (Regression with cluster-level treatment).
Consider a linear regressionmodel Y i,k = θ + δD k + β (cid:48) k X i,k + U i,k , where i indexes individuals within cluster k . There are q + 1 clusters and individualsin cluster k = q + 1 received treatment ( D k = 1) but those in 1 (cid:54) k (cid:54) q didnot ( D k = 0). The parameter of interest δ on the treatment indicator D k canbe interpreted as an average treatment effect under suitable conditions. See, e.g.,S(cid:32)loczy´nski (2018, 2020) and references therein for a precise discussion. The regressionmay also include covariates X i,k that vary within each cluster and have coefficients β k that may vary across clusters. The condition E( U i,k | D k , X i,k ) = 0 identifies θ = θ + δ within the treated cluster and θ within the untreated clusters. Thepreceding display can then be written as Y i,k = (cid:40) θ + β (cid:48) k X i,k + U i,k , (cid:54) k (cid:54) q,θ + β (cid:48) k X i,k + U i,k , k = q + 1 . View these as q + 1 separate regressions and use the least squares estimates of theconstants θ and θ as the vector ˆ θ n = (ˆ θ , ˆ θ , . . . , ˆ θ ,q ) described above. (cid:3) I will now show that the cluster-level statistics ˆ θ n can be used together withthe results in the previous section to perform a consistent test as the sample size n grows large. The test is not limited to parameters estimated by least squares.Instead, consistency relies on the condition that a centered and scaled version ofsome estimate ˆ θ n converges to a ( q + 1)-dimensional normal distribution, √ n (cid:18) ˆ θ − θ σ ( θ ) , ˆ θ , − θ σ ( θ ) , . . . , ˆ θ ,q − θ σ q ( θ ) (cid:19) θ (cid:32) N (0 , I q +1 ) , (3.1)where θ (cid:32) denotes weak convergence under θ = ( θ , θ ). For fixed θ , the display canbe interpreted as √ n (ˆ θ − θ , . . . , ˆ θ , − θ , . . . , ˆ θ ,q − θ ) (cid:32) N (0 , diag( σ, σ , . . . , σ q ))to include the case that one of the σ , . . . , σ q may be zero as in Theorem 2.1. NFERENCE WITH A SINGLE TREATED CLUSTER 11
A key feature of condition (3.1) is that the σ and σ , . . . , σ q are not assumed tobe known or estimable by the researcher. This is important for applications becauseconsistent variance estimation generally requires knowledge of an explicit orderingof the dependence structure within each cluster. While time-dependent data areautomatically ordered, it may be difficult or impossible to infer or credibly assumean ordering of the data within states or villages. In contrast, (3.1) can be establishedunder weak (short-range) dependence conditions that only require existence of apotentially unknown ordering for which the dependence of more distant units decayssufficiently fast. El Machkouri, Voln´y, and Wu (2013) present convenient momentbounds and limit theorems for this situation. For more results in this direction,see also Bester et al. (2011) and references therein. In general, the convergencein (3.1) also implicitly requires the number of observations in all clusters to growwith the sample size n . However, the clusters are not required to have similar oreven identical sizes. Another noteworthy feature of condition (3.1) is the diagonalcovariance matrix of the limiting distribution. It is the only independence conditionthat is imposed on the clusters.I now show that under the joint convergence (3.1), a rearrangement test thatuses ˆ θ n is asymptotically of level α with a single treated cluster and a fixed numberof control clusters. The test ϕ α (ˆ θ n ), as defined in (2.6), has power against allfixed alternatives θ = θ + δ with δ > θ = θ + δ/ √ n converging to the null. In the latter situation, θ is fixed and θ = ( θ + δ/ √ n, θ )implicitly depends on n . The convergence in (3.1) is then a statement about anentire sequence ( θ + δ/ √ n, θ ) instead of a single point. Results for alternativeswith δ < ϕ α ( − ˆ θ n ). These tests can becombined into a two-sided test that has power against fixed and local alternativesfrom either direction. Algorithm 3.4 at the end of this section shows how this canbe implemented. Theorem 3.2 (Consistency and local power).
Suppose (3.1) holds with σ > andat most one σ k = 0 . If θ = θ , then lim n →∞ E ϕ α (ˆ θ n ) (cid:54) α, every α, (cid:37) with < w ( α, (cid:37) ) < ,and if θ > θ , then E ϕ α (ˆ θ n ) → . If (3.1) holds with θ = ( θ + δ/ √ n, θ ) and the σ, σ , . . . , σ q are continuous and positive at θ , then lim n →∞ E ϕ α (ˆ θ n ) (cid:62) q sup t (cid:62) Φ (cid:32)(cid:18) δσ ( θ ) − w q ( α, (cid:37) )1 − w q ( α, (cid:37) ) t (cid:19)(cid:33) q (cid:89) k =1 (cid:32) Φ (cid:18) σ ( θ ) σ k ( θ ) t (cid:19) − . (cid:33) > . Remarks. (i) Because ϕ α (ˆ θ n ) = 1 if and only if ϕ α ( a (ˆ θ n − θ q +1 )) = 1, where a > q +1 is a ( q + 1)-vector of ones, the √ n -rate in (3.1) and in the theorem can bereplaced by any other rate as long as the asymptotic normal distribution in (3.1)is still attained. Several semiparametric or nonstandard estimators are thereforecovered by the theorem.(ii) It is sometimes of interest in applications to test the null hypothesis H : θ = θ + γ for a given γ . In that case, define Γ = ( γ { k = 1 } ) (cid:54) k (cid:54) q +1 and reject if ϕ α (ˆ θ n − Γ) = 1. Replace θ by θ + γ in Theorem 3.2 and use part (i) of this remarkto see that this leads to a consistent test. (cid:3) I now discuss how the high-level condition (3.1) can be verified in an application.The specific example I use is difference-in-differences estimation but the arguments
NFERENCE WITH A SINGLE TREATED CLUSTER 12 presented here apply more broadly. See also Canay et al. (2017) and Hagemann(2019) for similar types of arguments in other models. For simplicity, I focus on(3.1) under the null hypothesis H : θ = θ . Example 3.3 (Difference in differences).
Consider the panel model Y i,t,k = θ I t + δI t D k + β (cid:48) k X i,t,k + ζ i,k + U i,t,k , (3.2)where i indexes individuals i in unit k ∈ { , . . . , q + 1 } at time t ∈ { , } . Treatmentoccurred between periods 0 and 1. Right-hand side variables are a post-interventionindicator I t = 1 { t = 1 } , a treatment indicator D k that equals 1 if unit k everreceived treatment, individual fixed effects ζ i,k , and other covariates X i,t,k that forevery k vary at least before or after the intervention. The collection of pre andpost intervention data from unit k forms the k -th cluster. Let n k be the number ofindividuals in cluster k so that n = 2 (cid:80) q +1 k =1 n k is the total sample size. View eachcluster as a separate regression and rewrite (3.2) in first differences as∆ Y i,k = (cid:40) θ + β (cid:48) k ∆ X i,k + ∆ U i,k , (cid:54) k (cid:54) q,θ + β (cid:48) k ∆ X i,k + ∆ U i,k , k = q + 1 , where ∆ Y i,k = Y i, ,k − Y i, ,k and so on. Provided E(∆ U i,k | ∆ X i,k ) = 0, the dataidentify θ = θ + δ in a treated cluster and θ in an untreated cluster. The leastsquares estimates ˆ θ and ˆ θ ,k of the parameters θ and θ are suitable cluster-levelestimates if ˆ θ n = (ˆ θ , ˆ θ , , . . . , ˆ θ n,q ) satisfies condition (3.1).In the absence of covariates (i.e., β k ≡ H can be expressed as √ n (ˆ θ ,k − θ ) = (cid:18) nn k (cid:19) / n − / k n k (cid:88) i =1 ∆ U i,k . The same is true for √ n (ˆ θ − θ ) with k = q +1 on the right-hand side of the display. Ifthe number of individuals per cluster is large in the sense that n/n k → c k ∈ (0 , ∞ ) for1 (cid:54) k (cid:54) q + 1, then condition (3.1) already holds if n − / ( (cid:80) n k i =1 U i, ,k , (cid:80) n k i =1 U i, ,k )is independent across 1 (cid:54) k (cid:54) q + 1 and has a non-degenerate normal limitingdistribution for each k . The latter condition can be ensured with a central limittheorem for spatially dependent data. See, e.g., Jenish and Prucha (2009) and ElMachkouri et al. (2013) for appropriate results. If the number of individuals percluster is small, then Theorem 2.1 implies that the rearrangement test can still beapplied under the assumption that (( U i, ,k ) T (cid:54) i (cid:54) n k , ( U i, ,k ) T (cid:54) i (cid:54) n k ) is multivariatenormal for 1 (cid:54) k (cid:54) q + 1. This last condition may be strong but serves to illustratethat ˆ θ and ˆ θ ,k need not even be consistent for the test to be valid.Now consider pooled cross sections with n k individuals in period 0, m k individualsin period 1, and ζ i,k ≡ ζ k . The calculations in the preceding paragraph stillapply with minor modifications. For period 1, n k has to be replaced by m k .The analysis is no longer in first differences but the underlying conditions areessentially identical as long as n/n k → c k ∈ (0 , ∞ ) and n/m k → c (cid:48) k ∈ (0 , ∞ ) for1 (cid:54) k (cid:54) q + 1, where n is the total sample size. If the number of individuals availablepost intervention m = (cid:80) q +1 k =1 m k is relatively small in the sense that m/n k → m/m k → c (cid:48) k ∈ (0 , ∞ ), the scale invariance discussed in the remarks belowTheorem 3.2 allows replacement of the √ n in (3.1) by √ m . Then (3.1) holds if n − / k (cid:80) n k t =1 U i, ,k = O P (1) and m − / k (cid:80) m k t =1 U i, ,k obeys a central limit theorem for NFERENCE WITH A SINGLE TREATED CLUSTER 13 (cid:54) k (cid:54) q + 1. The same argument applies with the roles of n k and m k reversed ifrelatively few individuals are available pre intervention.The calculations in the preceding two paragraphs can be generalized to includecovariates and additional time periods at the expense of more involved notationand non-singularity conditions. The same types of arguments also apply if eachcluster consists of one or few units over many time periods, although the conditionsfor time dependence are generally less involved. See Dedecker et al. (2007) for acomprehensive overview. These remarks and the calculations in this example alsoapply to the regression model in Example 3.1. (cid:3) Remark (Nonlinear models) . The methodology presented here also includes nonlinearmodels because the parameter δ does not need to be interpretable by itself. Forexample, suppose the model in Example 3.1 is the latent model in a binary choiceframework with symmetric link function F and β k ≡ β . Then F ( θ + δ + β (cid:48) x ) − F ( θ + β (cid:48) x ) for some x may be the treatment effect of interest but H : δ = 0still determines whether the treatment effect is zero or not. Estimates of θ and θ = θ + δ from these models typically do not have closed form in the presence ofcovariates but generally have asymptotic linear representations to which the sametypes of arguments as in Example 3.3 can be applied. (cid:3) Before concluding this section, I present a brief summary of how the rearrangementtest can be implemented in practice. By Theorem 3.2, the following procedureprovides an asymptotically α -level test in the presence of a finite number of largeclusters when only a single cluster received treatment. The test is computationallysimple and does not require simulation or resampling, can be two-sided or one-sidedin either direction, is able to detect all fixed alternatives, and is powerful against1 / √ n -local alternatives. Recall that (cid:37) here measures how much more variable theestimate from the treated cluster ˆ θ can be relative to the second-least variablecontrol cluster estimate ˆ θ ,k . A (cid:37) of 5 means that the (asymptotic) variance of ˆ θ can be up to 5 = 25 times larger. There is no restriction on how much less variableˆ θ can be than any of the other estimates and ˆ θ can be infinitely more variablethan the least variable control cluster. (See also the discussion above Theorem 2.1.) Algorithm 3.4 (Rearrangement test). (1) Choose w from Table 1 for the givennumber of control clusters q , desired significance level α , and maximaltolerance for heterogeneity, e.g., (cid:37) = 2.(2) Compute for each untreated cluster k = 1 , . . . , q an estimate ˆ θ ,k of θ andcompute an estimate ˆ θ of θ from the treated cluster so that the difference θ − θ is the treatment effect of interest. (See Examples 3.1 and 3.3 above.)Use ˆ θ n = (ˆ θ , ˆ θ , . . . , ˆ θ ,q ) as if it were X in (2.1) to compute S = S (ˆ θ n , w )with w as in Step (1). Note that ¯ X is replaced here by q − (cid:80) qk =1 ˆ θ ,k .(3) Reorder the entries of S from largest to smallest. Denote this by S (cid:79) asdefined above (2.2). Compute T ( S ) and T ( S (cid:79) ) as in (2.2).(4) Reject H : θ = θ in favor of(a) H : θ > θ if T ( S ) = T ( S (cid:79) ).(b) H : θ < θ if T ( − S ) = T (( − S ) (cid:79) ).(c) H : θ (cid:54) = θ if either T ( S ) = T ( S (cid:79) ) or T ( − S ) = T (( − S ) (cid:79) ) but use α/ (cid:3) This test can also be used as a “robustness check” if inference was originallyperformed with a method designed for a finer level of clustering, e.g., at the county
NFERENCE WITH A SINGLE TREATED CLUSTER 14 level instead of the state level. In that case Algorithm 3.4 can illustrate how wellthe results of the original test hold up if there is dependence across counties. As Ipoint out in Section 2, one could start at (cid:37) = 0 or (cid:37) = 1 and increase (cid:37) until thenull hypothesis can no longer be rejected. This is informative because a result thatholds up to a potentially (cid:37) = 25 times larger variance is more credible than a resultthat only holds if (cid:37) = 1, i.e., if ˆ θ cannot be more variable than all but one ˆ θ ,k .If the rearrangement test is used in difference-in-differences models in conjunctionwith the popular Conley and Taber (2011) test, it is important to note that (cid:37) = 1still allows for substantial heterogeneity whereas the Conley-Taber test presumesfull homogeneity across clusters.An R command that implements Algorithm 3.4 and the robustness check for anychoice of (cid:37) is available at https://hgmn.github.io/rea . The next section showshow the rearrangement test performs in simulations and an application.4. Numerical results
This section explores the finite-sample behavior of the rearrangement test intwo experiments. Example 4.1 compares the rearrangement test to the widely usedConley and Taber (2011) test in the two-way fixed effects model with clusters.Example 4.2 applies the rearrangement test as a robustness check for the results ofGarthwaite et al. (2014). The discussion focuses on one-sided tests to the right butthe results apply more generally.
Example 4.1 (Two-way fixed effects; Conley and Taber, 2011).
This example uses aMonte Carlo experiment to compare rearrangement to the Conley and Taber (2011)test. The Conley-Taber test is designed specifically for difference in differences andapplies to models with a single treated cluster. Following Conley and Taber (2011,sec. V), the data are generated from the two-way fixed effects model Y t,k = δI t D k + η t + ζ k + U t,k , (4.1)where I t is a post-intervention indicator, D k is a treatment indicator, and η t and ζ k are time and cluster fixed effects, respectively. The error term satisfies U t,k = γU t − ,k + σ { k = q +1 } V t,k , (4.2)where the V t,k are iid copies of a standard normal variable and k = q + 1 is the onecluster that received treatment. The model uses η t ≡ ≡ ζ k , ten time periods withfour post-intervention periods, and, unless stated otherwise, γ = . δ = 0. I donot consider all of Conley and Taber’s variations of their model and, to focus on thesimplest possible situation, I do not include covariates. I expand upon their analysisby investigating smaller numbers of control clusters q and values of σ other thanone. In the latter situation, the Conley-Taber test can be expected to fail becauseit relies heavily on homogeneity of all clusters in absence of an intervention. TheConley-Taber test can be restored (as q → ∞ ) if the exact form of heterogeneity isknown (Ferman and Pinto, 2019; Ferman, 2020) but this is not assumed here.The Conley-Taber test with one treated cluster can be computed as follows:(1) Regress the outcome on I t D k , time and cluster fixed effects, and other covariates(if available). Denote the coefficient on I t D k by ˆ δ . (2) Split the residuals by clusterand run, for each of the q control clusters separately, regressions of the residuals ona constant and I t . (3) Compute the 1 − α empirical quantile of the q coefficients on I t . Reject H : δ = 0 if ˆ δ is larger than that quantile. NFERENCE WITH A SINGLE TREATED CLUSTER 15 . . . . Conley-Taber (size, δ = 0) σ R e j ec t i o n f r e q u e n c y q = 50 , Nq = 15 , Nq = 50 , χ
5% level . . . . Rearrangement (size, δ = 0) σ Figure 2.
Rejection frequencies of a true null as a function of the heterogeneity σ for the Conley-Taber test (left) and the rearrangement test (right) with(i) q = 50 control clusters and normal errors (solid lines), (ii) q = 15 and normalerrors (long-dashed), and (iii) q = 50 and chi-squared errors (dotted). Theshort-dashed line equals .
05. The rearrangement test uses (cid:37) = 2 (vertical line).
The rearrangement test can be computed similarly from q + 1 separate artificialregressions of Y t,k on a constant and the post-intervention indicator I t , Y t,k = ζ + θ I t + error t,k , (cid:54) k (cid:54) q,Y t,k = ζ + θ I t + error t,k , k = q + 1 , where ζ is the intercept in each regression. The coefficients on the post-interventionindicator can be expressed as θ = ¯ η + − ¯ η − and θ = δ + ¯ η + − ¯ η − , where ¯ η − and ¯ η + are time averages of η t pre and post intervention, respectively. Because δ = θ − θ ,I apply the rearrangement test to the least squares estimates ˆ θ , , . . . , ˆ θ ,q andˆ θ of θ and θ , respectively. I view (4.1) as coming from individual-level dataaggregated to the cluster level with a fixed number of time periods. The estimatesˆ θ , ˆ θ , , . . . , ˆ θ ,q should therefore be approximately normal for the rearrangementtest to apply. To test deviations from this assumption in finite samples, I alsoconsider a situation where the innovations V t,k in (4.2) are χ / H : δ = 0 as afunction of σ ∈ { , . , . , . . . , . } for the two tests at the 5% level (short-dashedlines). The assumptions of the Conley-Taber test (left) hold as q → ∞ when σ = 1but are violated at any sample size as soon as σ >
1. The rearrangement test (right)here uses (cid:37) = 2 (vertical line). The assumptions of the rearrangement test areviolated as soon as σ >
2. The figure shows rejection rates in 10,000 Monte Carloexperiments for each horizontal coordinate with (i) q = 50 control clusters (solidlines), (ii) q = 15 (long-dashed), and (iii) q = 50 but the V t,k are iid copies of a( χ − / σ = 1 but quickly becameunusable as σ increased. It exceeded a 10% rejection rate at about σ = 1 .
25. At σ = 2 .
5, the Conley-Taber test falsely discovered a nonzero effect in about 25% of all
NFERENCE WITH A SINGLE TREATED CLUSTER 16 . . . . . Rearrangement (power, δ = 2) σ R e j ec t i o n f r e q u e n c y q = 50 , N, γ = . q = 15 , N, γ = . q = 50 , N, γ = . q = 50 , N, γ = . q = 50 , χ , γ = . . . . . . Rearrangement (power, δ = 3) σ Figure 3.
Rejection frequencies of the rearrangement test ( (cid:37) = 2) under thealternative as a function of the heterogeneity σ at δ = 2 (left) and δ = 3 (right)with (i) and (ii) as in Figure 2, (iii) is (i) with weak time dependence γ = . γ = . . cases. In contrast, the rearrangement test was able to reject at or below the nominallevel of the test as long as σ (cid:54) (cid:37) . For σ > (cid:37) , the rearrangement test eventuallystarted to over-reject. It performed worst at σ = 2 .
5, where it rejected in 6.9-9.2%of all cases.I also conducted a large number of additional experiments under the null. Iconsidered (not shown) other distributions for V t,k and other values of the AR(1)coefficient γ , the number of time periods, the number of post-intervention periods,and the number of control clusters. However, I found that these changes had littleimpact on the results in the preceding paragraph. The Conley-Taber test performedwell when there was no heterogeneity but over-rejected wildly otherwise. Moreresults in this direction can be found in Canay et al. (2017), who come to the sameconclusion in their experiments. The rearrangement test continued to be highlyrobust to heterogeneity as long as (cid:37) was not chosen to be much too small.I now turn to the performance of the rearrangement test under the alternative.The behavior of the Conley-Taber test under the alternative is not discussed dueto its massive size distortion. I consider the same models as before together withsome variations mentioned in the preceding paragraph but use nonzero δ . Figure3 shows the results with δ = 2 (left) and δ = 3 (right). The base model is againmodel (i) with q = 50 control clusters, standard normal V t,k , and time dependenceset to γ = . q = 15 (long-dashed), (iii) lowers the time dependence to γ = . γ = . χ − / NFERENCE WITH A SINGLE TREATED CLUSTER 17 reliably detected smaller treatment effects when the time dependence was relativelyweak. Increasing the treatment effect (right) improved detection rates substantiallyand uniformly across models, with strong time dependence again being the mostchallenging situation. The rearrangement test now had considerable power evenwhen only 15 control clusters were available, the innovations were asymmetric, orthe time dependence was not extreme. Power was very high when there was littletime dependence.Figures 2 and 3 also illustrate two noteworthy aspects of the rearrangementtest: (1) The inequality the rearrangement is based on is nearly tight (as discussedbelow equation (2.6)) in the sense that it cannot be meaningfully be improved uponunless q is very small. This can be seen in the right panel of Figure 2, where therejection rate of the test was essentially at or slightly below nominal level when σ = (cid:37) . (2) Rejection rates under the null hypothesis increase with σ but this doesnot necessarily translate into increased rejection rates under the alternative for large σ . This is seen in the right panel of Figure 3, where the power decreases with σ inthe presence of weak time dependence ( γ = . (cid:3) Example 4.2 (Health insurance and labor supply; Garthwaite et al., 2014).
In thisexample, I use the rearrangement test to reanalyze the results of Garthwaite et al.(2014). They use a difference-in-differences design to study the effects of a large-scale disruption of public heath insurance on labor supply. Their design exploitsthat in 2005 approximately 170,000 adults in Tennessee (roughly 4% of the state’snon-elderly, adult population) abruptly lost access to TennCare, the state’s publichealth insurance system. Garthwaite et al. use data from the 2001-2008 MarchCurrent Population Survey to determine health insurance and work status for theyears 2000-2007. The comparison groups for Tennessee are the 16 other Southernstates defined by the U.S. Census Bureau.The main treatment effect in Garthwaite et al. (2014, their β in their equation(1)) can be estimated as δ in Y t,k = θ I t + δI t D k + ζ k + U t,k , where Y t,k is a state-by-year mean of an outcome of interest for state k in year t , I t = 1 { t (cid:62) } is a post-intervention indicator, and D k equals one for anobservation from Tennessee and equals zero otherwise. There are 17 × δ with bootstrap standard errors thatare compared to Student t critical values with 16 degrees of freedom. Their preferredbootstrap first draws states with replacement and then draws individuals withinthose states with replacement. This type of inference accounts for autocorrelationwithin individuals over time but generally requires the number of clusters to beinfinite for the asymptotics. This bootstrap also does not account for potentialdependence within states.I replicate the findings of Garthwaite et al. (2014) in the top panel of Table 2.They estimate the causal effect of the TennCare disenrollment on the probability of(1) having public health insurance, (2) being employed, and (3)-(6) being employedfor a certain number of hours per week. I show their bootstrap standard errors in The Southern states are Alabama, Arkansas, Delaware, the District of Columbia, Florida,Georgia, Kentucky, Louisiana, Maryland, Mississippi, North Carolina, Oklahoma, Tennessee, Texas,Virginia, South Carolina, and West Virginia.
NFERENCE WITH A SINGLE TREATED CLUSTER 18
Table 2.
Effects of TennCare disenrollment in Garthwaite et al. (2014, TableII.A) with their auto-correlation robust bootstrap standard errors (top) andthe largest (cid:37) at which a rearrangement test robust to arbitrary correlationwithin states and over time still detects an effect (bottom). (1) (2) (3) (4) (5) (6)Employed Employed Employed EmployedHas public working working working workinghealth <
20 hours (cid:62)
20 hours 20-35 hours (cid:62)
35 hoursinsurance Employed per week per week per week per weekˆ δ − − p -val. [0.000] [0.019] [0.621] [0.011] [0.453] [0.020]Rearrangement test: largest (cid:37) at which H : δ = 0 is rejected α (“ × ” indicates that H : δ = 0 cannot be rejected for any (cid:37) (cid:62) .
10 5.434 1.793 × × × .
05 2.914 0.972 × × × parentheses but report one-sided p -values in brackets instead of their two-sided p -values. In (1) the alternative is a negative effect, for (2)-(6) the alternative is positive.Garthwaite et al. find a highly significant 4.6 percentage point decrease for (1) andmostly significant positive effects for (2)-(6). They document an approximately 2.5percentage point increase in employment and find the same effect if the outcome isrestricted to individuals working more than 20 hours or more than 35 hours a week.All three effects are significant at the 5% level. The inference in Garthwaite et al.shows no significant effect for individuals working less than 20 hours or 20-35 hours.I now apply the rearrangement test as a robustness check. I view each state overtime as a single cluster and run 17 separate least squares regressions of the form Y t,k = θ I t + ζ k + U t,k , (cid:54) k (cid:54) ,Y t,k = θ I t + ζ k + U t,k , k = 17 , to obtain ˆ θ ,k (1 (cid:54) k (cid:54)
16) from each of the Southern states except Tennessee andˆ θ from Tennessee ( k = 17). Note that the ζ k are now the constant terms in eachregression. To perform the robustness check, I start with (cid:37) = 0 and increase (cid:37) by .
001 in Algorithm 3.4 as long as the null hypothesis H : δ = 0 is still rejected. Thebottom panel of Table 2 shows the largest feasible value of (cid:37) for outcomes (1)-(6).At the 10% level, the result in (1) survives an up to 5.4 times larger variance in theestimate from Tennessee relative to the second-least variable control cluster estimate.The result in (2) holds if Tennessee has a 1.8 times larger variance and (4) holdseven with an up to 2.2 times larger variance. At the 5% level, these three resultsremain valid with smaller (cid:37) but the result in (2) only survives if the estimate fromTennessee is at most slightly less variable than the second-least variable controlcluster estimate. The results in (3) and (5) confirm findings in Garthwaite et al.(2014) in that they are not significant at any level and for any value of (cid:37) .A noteworthy situation occurs in (6), where the rearrangement test disagreessharply with the significant effect found by Garthwaite et al. (2014). The rearrange-ment test finds no effect at any significance level and for any (cid:37) . In contrast, theeffects in (2) and (6) are not only essentially identical but also have identical standard NFERENCE WITH A SINGLE TREATED CLUSTER 19 errors. (The p -values differ slightly because of rounding.) This also illustrates thatthe rearrangement test differs fundamentally from inference based on t statisticsand resampling.In sum, the rearrangement test robustly confirms—with one exception—the resultsof Garthwaite et al. (2014). There is statistical evidence of increased employmentconcentrated among individuals working at least 20 hours per week even if oneaccounts for arbitrary dependence within states and over time. The results hold upto substantial heterogeneity across clusters even if the number of clusters is treateda fixed for the analysis. It is also worth noting that (cid:37) only restricts heterogeneity inone direction. All of the results presented here are robust to arbitrary heterogeneityin any other direction and to Tennessee being infinitely more variable than the leastvariable control cluster. (cid:3) Conclusion
I introduce a generic method for inference about a scalar parameter in researchdesigns with a finite number of large, heterogeneous clusters where only a singlecluster received treatment. This situation is commonplace in difference-in-differencesestimation but the test developed here applies more generally. I show that thetest asymptotically controls size and has power in a setting where the number ofobservations within each cluster is large but the number of clusters is fixed. Thetest combines independent, approximately Gaussian parameter estimates from eachcluster with a weighting scheme and a rearrangement procedure to obtain its criticalvalues. The weights needed for most empirically relevant situations are tabulatedin the paper. The critical values are computationally simple and do not requiresimulation or resampling. The test is highly robust to situations where some clustersare much more variable than others. Examples and an empirical application areprovided.
Appendix A. Proofs
Proof of Theorem 2.1.
Choose any λ ∈ Λ and w ∈ (0 , S ( X, w ) = S =( S , . . . , S q +2 ). By continuity, we have T ( S ) = T ( S (cid:79) ) if and only if S + S = S ( q +2) + S ( q +1) and (cid:80) qk =1 S k +2 = (cid:80) qk =1 S ( k ) almost surely. Conclude thatE λ, ϕ ( X, w ) = P λ, (cid:16) min { (1 + w )( X − ¯ X ) , (1 − w )( X − ¯ X ) } > max k ( X ,k − ¯ X ) (cid:17) . Because of the centering, we can without loss of generality assume µ = 0. Define X , = (1 + w ) X and X , = (1 − w ) X . Use monotonicity of maximum andminimum to express the right-hand side of the preceding display as P λ, (min { X , − w ¯ X , X , + w ¯ X } > X , ( q ) ) . Let s = (cid:80) qk =1 σ k and denote by ˜ ϕ ( X, w ) an infeasibleversion of the test function ϕ ( X, w ) that replaces ¯ X by µ . The inequality | { a >b } − { c > b }| (cid:54) {| a − b | (cid:54) | a − c |} for a, b, c ∈ R and the triangle inequality thenimply that for every t > λ ∈ Λ (cid:12)(cid:12) E λ, ϕ ( X, w )1 {| ¯ X | (cid:54) st } − E λ, ˜ ϕ ( X, w )1 {| ¯ X | (cid:54) st } (cid:12)(cid:12) cannot exceedsup λ ∈ Λ P λ, (cid:0) | X , (1) − X , ( q ) | (cid:54) | min { X , − w ¯ X , X , + w ¯ X } − X , (1) | , | ¯ X | (cid:54) st (cid:1) . NFERENCE WITH A SINGLE TREATED CLUSTER 20
By monotonicity, this is at most sup λ ∈ Λ P λ, ( | X , (1) − X , ( q ) | (cid:54) wst ). Note that X , (1) is negatively skewed and X , ( q ) positively skewed. Because X , (1) and X , ( q ) are independent, P λ, ( | X , (1) − X , ( q ) | (cid:54) wst ) is largest when X , (1) has the leastskew. This happens at σ = 0 and impliessup λ ∈ Λ P λ, ( | X , (1) − X , ( q ) | (cid:54) wst ) = sup λ ∈ Λ P λ, ( | X , ( q ) | (cid:54) wst ) . The probability on the right is the supremum of (cid:81) qk =1 Φ( wst/σ k ) − (cid:81) qk =1 Φ( − wst/σ k )over λ ∈ Λ. Because s/σ k is decreasing in σ k , the entire expression must bedecreasing in σ k and the supremum in the preceding display is therefore attainedat σ = · · · = σ q − = σ ¯ and σ q = 0. Conclude that sup λ ∈ Λ P λ, ( | X , (1) − X , ( q ) | (cid:54) wst ) (cid:54) Φ( √ q − wt ) q − . Because (cid:12)(cid:12) E λ, ϕ ( X, w )1 {| ¯ X | > st } − E λ, ˜ ϕ ( X, w )1 {| ¯ X | > st } (cid:12)(cid:12) (cid:54) P ( | ¯ X | > st ) = 2Φ( − qt )and because all bounds so far are valid for every t , it follows thatsup λ ∈ Λ (cid:12)(cid:12) E λ, ϕ ( X, w ) − E λ, ˜ ϕ ( X, w ) (cid:12)(cid:12) (cid:54) min t> (cid:16) Φ (cid:0)(cid:112) q − wt (cid:1) q − + 2Φ( − qt ) (cid:17) . Now consider E λ, ˜ ϕ ( X, w ) = P λ, ( X , (1) > X , ( q ) ), which can be expressed as P (cid:0) (1 − w ) X > X , ( q ) , X > (cid:1) + P (cid:0) (1 + w ) X > X , ( q ) , X < (cid:1) . The second term on the right is at most P ( X , ( q ) < , Y <
0) = Φ(0) q +1 = 2 − q − .Use independence to write the first term of the preceding display as (cid:90) ∞ q (cid:89) k =1 Φ (cid:18) (1 − w ) σyσ k (cid:19) φ ( y ) dy (cid:54) (cid:90) ∞ Φ (cid:18) (1 − w )¯ σyσ ¯ (cid:19) q − φ ( y ) dy, where the inequality follows because the the integrand is increasing in σ , decreasingin σ k , and at most one σ k can be arbitrarily close to zero. Combine the bounds onE λ, ˜ ϕ ( X, w ) and E λ, ϕ ( X, w ) − E λ, ˜ ϕ ( X, w ) to obtain the bound ξ q .Now consider the alternative. We still haveE λ,δ ϕ ( X, w ) = P λ,δ (cid:16) min { (1 + w )( X − ¯ X ) , (1 − w )( X − ¯ X ) } > max k ( X ,k − ¯ X ) (cid:17) . Because 1 { min { (1+ w )( X − ¯ X ) , (1 − w )( X − ¯ X ) } > max k ( X ,k − ¯ X ) } → δ → ∞ for w ∈ (0 , λ,δ ϕ ( X, w ) → w = 1, min { X − ¯ X ) , } − max k ( X ,k − ¯ X ) → − max k ( X ,k − ¯ X ) almostsurely as δ → ∞ . This limit has a continuous distribution function at 0. At w = 1, the Slutsky lemma implies that the preceding display converges to P (0 > max k ( X ,k − ¯ X )) = P ( ¯ X > max k X ,k ) = 0, as required. (cid:3) Proof of Proposition 2.2.
Let A t = (cid:84) qk =1 {− t < X ,k (cid:54) t } for some t >
0. Asabove, assume without loss of generality that µ = 0 and recall that E λ,δ ϕ ( X, w ) = P λ,δ (min { X , − w ¯ X , X , + w ¯ X } > X , ( q ) ). For every fixed t , this is strictlylarger than P (cid:0) min { X , − w ¯ X , X , + w ¯ X } > X , ( q ) , A t (cid:1) (cid:62) P (cid:0) min { X , , X , } − wt > t, A t (cid:1) because X , ( q ) (cid:54) t and | ¯ X | (cid:54) t . By independence and because t >
0, the displaycan be expressed as P λ,δ (cid:18) X > w − w t (cid:19) P λ ( A t ) = P λ,δ (cid:18) X > w − w t (cid:19) q (cid:89) k =1 (cid:0) Φ( t/σ k ) − Φ( − t/σ k ) (cid:1) . NFERENCE WITH A SINGLE TREATED CLUSTER 21
By symmetry, this simplifies toΦ (cid:32)(cid:18) w − w t − δ (cid:19) /σ (cid:33) q q (cid:89) k =1 (cid:0) Φ( t/σ k ) − . (cid:1) and, because t was arbitrary, it must be true thatE λ,δ ϕ ( X, w ) (cid:62) q sup t (cid:62) Φ (cid:32)(cid:18) δ − w − w t (cid:19) /σ (cid:33) q (cid:89) k =1 (cid:0) Φ( t/σ k ) − . (cid:1) . Replace t by tσ to obtain the bound in the proposition.The quantity inside the supremum is continuous on [0 , ∞ ], equals zero at t = 0and t = ∞ , and is strictly positive on t ∈ (0 , , ∞ ] with the ordertopology is compact and the supremum must therefore be attained on t ∈ (0 , ∞ ) tonot contradict the extreme value theorem. The supremum in the preceding displayis therefore a maximum over t ∈ (0 , ∞ ) for every fixed δ ∈ [0 , ∞ ) and the maximizedfunction is a continuous function of δ on [0 , ∞ ] by the Berge maximum theorem. As δ → ∞ , the supremum is attained at t = ∞ and the right-hand side of the displayequals one. (cid:3) Proof of Proposition 2.3.
Let S ( X n , w ) = S n = ( S ,n , . . . , S q +2 ,n ). We cannot havemin { S ,n , S ,n } < max { S ,n , . . . , S q +2 ,n } and T ( S n ) = T ( S (cid:79) n ) at the same time. Moreover, the reverse inequality implies T ( S n ) = T ( S (cid:79) n ). Conclude thatE ϕ ( X n , w ) = P (cid:0) min { S ,n , S ,n } > max { S ,n , . . . , S q +2 ,n } (cid:1) + P (cid:0) T ( S n ) = T ( S (cid:79) n ) , min { S ,n , S ,n } = max { S ,n , . . . , S q +2 ,n } (cid:1) . By the assumed weak convergence and the continuous mapping theorem, we have S ( X n , w ) (cid:32) S ( X, w ) = ( S , . . . , S q +2 ). Use the continuous mapping theorem againto deducemin { S ,n , S ,n } − max { S ,n , . . . , S q +2 ,n } (cid:32) min { S , S } − max { S , . . . , S q +2 } . The right-hand side can be expressed as h X , ,...,X ,q ( X ) := min { (1 + w )( X − ¯ X ) , (1 − w )( X − ¯ X ) } − max k ( X ,k − ¯ X ) , where x (cid:55)→ h X , ,...,X ,q ( x ) is strictly increasing and continuous for almost everyrealization of X , , . . . , X ,q and therefore has a strictly increasing and continuousinverse h − X , ,...,X ,q almost everywhere. Independence implies that the distributionfunction of the preceding display equals x (cid:55)→ EΦ( h − X , ,...,X ,q ( x ) /σ ), which iscontinuous by dominated convergence. Conclude that h X , ,...,X ,q ( X ) must have acontinuous distribution function at 0 so that P (min { S ,n , S ,n } − max { S ,n , . . . , S q +2 ,n } > → E ϕ ( X, w )and P (min { S ,n , S ,n } − max { S ,n , . . . , S q +2 ,n } = 0) →
0. Combine these tworesults to obtain E ϕ ( X n , w ) → E ϕ ( X, w ) + 0, as desired. (cid:3)
Proof of Theorem 3.2.
Let X ,n = √ n (ˆ θ − θ ) and X ,k,n = √ n (ˆ θ ,k − θ ) for 1 (cid:54) k (cid:54) q . By assumption, X n = ( X ,n , X , ,n , . . . , X ,q,n ) (cid:32) X . Because x (cid:55)→ ϕ α ( x ) isinvariant to multiplication of x with positive constants, we have ϕ α (ˆ θ n ) = ϕ α ( X n ) if NFERENCE WITH A SINGLE TREATED CLUSTER 22 θ = θ . By Proposition 2.3 and Theorem 2.1, this implies E ϕ α (ˆ θ n ) → E ϕ α ( X ) (cid:54) α under the null hypothesis.Suppose θ = θ + δ/ √ n . Let x (cid:55)→ S α ( x ) = S ( x, w q ( α, (cid:37) )) and ∆ = ( δ { k =1 } ) (cid:54) k (cid:54) q +1 . By the assumed continuity and the Slutsky lemma, we have X n +∆ θ (cid:32) X + ∆. Because √ nS α (ˆ θ n ) = S α ( X n + ∆) and ϕ α is invariant to scalingof S by positive constants, it follows from Proposition 2.3 that E ϕ α (ˆ θ n ) thatE ϕ α (ˆ θ n ) = E ϕ α ( X n + ∆) → E ϕ α ( X + ∆), to which the lower bound developed inProposition 2.2 can be applied.Now suppose δ = θ − θ >
0. Let ¯ X ,n = q − (cid:80) qk =1 X ,k,n . Because X n / √ n (cid:32) { (1 + w )( X ,n + δ − ¯ X ,n ) , (1 − w )( X ,n + δ − ¯ X ,n ) } − max k ( X ,k,n − ¯ X ,n )divided by √ n converges weakly to min { (1 + w ) δ, (1 − w ) δ } . Because zero is acontinuity point of the distribution of this degenerate variable unless δ = 0, concludethat E ϕ α (ˆ θ n ) → (cid:3) References
Bertrand, M., E. Duflo, and S. Mullainathan (2004). How much should we trustdifferences-in-differences estimates?
Quarterly Journal of Economics 119 , 249–275.Bester, C. A., T. G. Conley, and C. B. Hansen (2011). Inference with dependentdata using cluster covariance estimators.
Journal of Econometrics 165 , 137–151.Cameron, A. C., J. B. Gelbach, and D. L. Miller (2008). Bootstrap-based improve-ments for inference with clustered errors.
Review of Economics and Statistics 90 ,414–427.Canay, I., J. P. Romano, and A. M. Shaikh (2017). Randomization tests under anapproximate symmetry assumption.
Econometrica 85 , 1013–1030.Canay, I. A., A. Santos, and A. M. Shaikh (2020). The wild bootstrap with a “small”number of “large” clusters.
Review of Economics and Statistics , forthcoming.Conley, T. G. and C. R. Taber (2011). Inference with “difference in differences”with a small number of policy changes.
Review of Economics and Statistics 93 ,113–125.Dedecker, J., P. Doukhan, G. Lang, J. R. Le´on, S. Louhichi, and C. Prieur (2007).
Weak Dependence: With Examples and Applications . Springer.Djogbenou, A. A., J. G. MacKinnon, and M. Ø. Nielsen (2019). Asymptotic theoryand wild bootstrap inference with clustered errors.
Journal of Econometrics 212 ,393–412.El Machkouri, M., D. Voln´y, and W. B. Wu (2013). A central limit theorem forstationary random fields.
Stochastic Processes and their Applications 123 , 1–14.Fama, E. F. and J. D. MacBeth (1973). Risk, return, and equilibrium: Empiricaltests.
Journal of Political Economy 81 , 607–636.Ferman, B. (2020). Inference in differences-in-differences with few treated unitsand spatial correlation. Sao Paulo School of Economics FGV working paper, arXiv:2006.16997 .Ferman, B. and C. Pinto (2019). Inference in differences-in-differences with fewtreated groups and heteroskedasticity.
Review of Economics and Statistics 101 ,452–467.
NFERENCE WITH A SINGLE TREATED CLUSTER 23
Fisher, R. A. (1935). “The coefficient of racial likeness” and the future of craniometry.
Journal of the Royal Anthropological Institute of Great Britain and Ireland 66 ,57–63.Garthwaite, C., T. Gross, and M. J. Notowidigdo (2014). Public health insurance,labor supply, and employment lock.
Quarterly Journal of Economics 129 , 653–696.Hagemann, A. (2019). Permutation inference with a finite number of heterogeneousclusters. University of Michigan working paper, arXiv:1907.01049 .Ibragimov, R. and U. M¨uller (2010). t -statistic based correlation and heterogeneityrobust inference. Journal of Business & Economic Statistics 28 , 453–468.Ibragimov, R. and U. M¨uller (2016). Inference with few heterogenous clusters.
Review of Economics and Statistics 98 , 83–06.Jenish, N. and I. R. Prucha (2009). Central limit theorems and uniform laws oflarge numbers for arrays of random fields.
Journal of Econometrics 150 , 86–98.MacKinnon, J. G. and M. D. Webb (2017). Wild bootstrap inference for wildlydifferent cluster sizes.
Journal of Applied Econometrics 32 , 233–254.MacKinnon, J. G. and M. D. Webb (2019). Wild bootstrap randomization inferencefor few treated clusters.
Advances in Econometrics 39 , 61–85.MacKinnon, J. G. and M. D. Webb (2020). Randomization inference for difference-in-differences with few treated clusters.
Journal of Econometrics 218 , 435–450.S(cid:32)loczy´nski, T. (2018). A general weighted average representation of the ordinaryand two-stage least squares estimands. Working paper, Department of Economics,Brandeis University.S(cid:32)loczy´nski, T. (2020). Interpreting OLS estimands when treatment effects areheterogeneous: Smaller groups get larger weights.
Review of Economics andStatistics , forthcoming.Staiger, D. and J. H. Stock (1997). Instrumental variables regression with weakinstruments.
Econometrica 65 , 557–586.Stock, J. and M. Yogo (2005). Testing for weak instruments in linear IV regression.In D. W. Andrews (Ed.),
Identification and Inference for Econometric Models ,pp. 80–108. New York: Cambridge University Press.
Department of Economics, University of Michigan, 611 Tappan Ave, Ann Arbor, MI48109, USA. Tel.: +1 (734) 764-2355. Fax: +1 (734) 764-2769
Email address : [email protected] URL ::