[PDF] A User's Guide to Approximate Randomization Tests with a Small Number of Clusters

Abstract

This paper provides a user's guide to the general theory of approximate randomization tests developed in Canay, Romano, and Shaikh (2017) when specialized to linear regressions with clustered data. Such regressions include settings in which the data is naturally grouped into clusters, such as villages or repeated observations over time on individual units, as well as settings with weak temporal dependence, in which pseudo-clusters may be formed using blocks of consecutive observations. An important feature of the methodology is that it applies to settings in which the number of clusters is small -- even as small as five. We provide a step-by-step algorithmic description of how to implement the test and construct confidence intervals for the quantity of interest. We additionally articulate the main requirements underlying the test, emphasizing in particular common pitfalls that researchers may encounter. Finally, we illustrate the use of the methodology with two applications that further elucidate these points. The required software to replicate these empirical exercises and to aid researchers wishing to employ the methods elsewhere is provided in both {\tt R} and {\tt Stata}.

Full PDF

aa r X i v : . [ ec on . E M ] F e b A User’s Guide to Approximate Randomization Testswith a Small Number of Clusters ∗ Yong CaiDepartment of EconomicsNorthwestern University [email protected]

Ivan A. CanayDepartment of EconomicsNorthwestern University [email protected]

Deborah KimDepartment of EconomicsNorthwestern University [email protected]

Azeem M. ShaikhDepartment of EconomicsUniversity of Chicago [email protected]

February 19, 2021

Abstract

This paper provides a user’s guide to the general theory of approximate ran-domization tests developed in Canay et al. (2017a) when specialized to linear re-gressions with clustered data. Such regressions include settings in which the data isnaturally grouped into clusters, such as villages or repeated observations over timeon individual units, as well as settings with weak temporal dependence, in whichpseudo-clusters may be formed using blocks of consecutive observations. An impor-tant feature of the methodology is that it applies to settings in which the numberof clusters is small – even as small as ﬁve. We provide a step-by-step algorithmicdescription of how to implement the test and construct conﬁdence intervals for thequantity of interest. We additionally articulate the main requirements underlyingthe test, emphasizing in particular common pitfalls that researchers may encounter.Finally, we illustrate the use of the methodology with two applications that fur-ther elucidate these points: one to a linear regression with clustered data based onMeng et al. (2015) and a second to a linear regression with temporally dependentdata based on Munyo and Rossi (2015). The required software to replicate theseempirical exercises and to aid researchers wishing to employ the methods elsewhereis provided in both R and Stata . KEYWORDS: Randomization tests, linear regression, clustered data, time seriesJEL classiﬁcation codes: C12, C14 ∗ The research of the fourth author is supported by NSF Grant SES-1530661.

Introduction

This paper provides a user’s guide to the general theory of approximate randomizationtests (ARTs) developed in Canay et al. (2017a) when specialized to linear regressionswith clustered data. Here, clustered data refers to data that may be grouped so thatthere may be dependence within each cluster, but distinct clusters are approximatelyindependent in a way to be made precise below. Such data is remarkably common,including not only data that are naturally grouped into clusters, such as villages orrepeated observations over time on individual units, but also data with weak temporaldependence, in which pseudo-clusters may be formed using blocks of consecutive ob-servations. An important feature of the methodology is that it applies to commonlyencountered settings in which the number of clusters is small – even as small as ﬁve. Inthis respect, the proposed methodology contrasts sharply and meaningfully with manycommonly employed methods for inference in such settings. We brieﬂy elaborate on thispoint in our discussion of related literature below.A principal goal of this paper is to make the general theory developed in Canay et al.(2017a) more accessible by providing a step-by-step algorithmic description of how toimplement the test and construct conﬁdence intervals for the quantity of interest inthese types of settings. While we focus in Section 2 below on what we view as the mostnatural implementation of the test, we further show that it is numerically equivalent toan alternative implementation. This equivalence yields further insights into the mainrequirements underlying the test, which are clariﬁed in Section 3. These requirementsessentially demand that the quantity of interest is suitably estimable cluster-by-cluster.As discussed further in Section 3, when this is not satisﬁed, a researcher need not con-clude that it is not possible to exploit the results in Canay et al. (2017a). Instead,several remedies are possible, including clustering more coarsely or changing the spec-iﬁcation to ensure that this requirement is satisﬁed. We provide two applications thatfurther elucidate these points: one to a linear regression with clustered data based onMeng et al. (2015) and a second to a linear regression with temporally dependent databased on Munyo and Rossi (2015). The required software to replicate these empiricalexercises and to aid researchers wishing to employ the methods elsewhere is provided inboth R and Stata . The methodology described in this paper is part of a large and active literature oninference with clustered data. Following Bertrand et al. (2004), researchers are acutelyaware of the need to adjust inferences appropriately to account for this sort of de-pendence. Many of the most commonly employed methods for doing so, however, areinadequate for the unusually common situation in which the number of clusters is small.Conventional wisdom suggests that the number of clusters is small when it is less than The

Stata and R packages ARTs can be downloaded fromhttp://sites.northwestern. edu/iac879/software/.

Stata , iswidely acknowledged to perform poorly when this rule-of-thumb is not satisﬁed. Sim-ilarly, the cluster wild bootstrap described in Cameron et al. (2008) requires either asuﬃciently large number of clusters or, as shown by Canay et al. (2020), stringent ho-mogeneity across clusters, to perform reliably. As explained further in Section 3, themethods developed in Canay et al. (2017a) and described in this paper, require neithera large number of clusters nor such homogeneity across clusters. We note that themethods by Ibragimov and M¨uller (2010, 2016), which are closely related to the onesdescribed here, also do not require such restrictions, but are generally less powerful andpermit testing a less rich variety of hypotheses. See Canay et al. (2017a) for furtherdiscussion of these points as well as Conley et al. (2018) for an insightful and thoroughreview of the related literature more broadly.The remainder of this paper is organized as follows. In Section 2, we ﬁrst formalizethe setting and establish some notation. We then describe the implementation of ap-proximate randomization tests (ARTs) in an algorithmic fashion as well as how to invertthese tests to construct conﬁdence intervals for the quantity of interest. In Section 3,we articulate the main requirements underlying the tests and discuss remedies for caseswhere these requirements are not satisﬁed. Our two empirical applications are containedin Section 4. Finally, we provide some concluding remarks in Section 5.

We start by reviewing the inference approach proposed by Canay et al. (2017a) in thecontext of a linear regression model with clustered data. In order to do so, we indexclusters by j ∈ J ≡ { , . . . , q } and units in the j th cluster by i ∈ I n,j ≡ { , . . . , n j } .We also denote by n = P qj =1 n j the total number of observations. The observed dataconsists of an outcome of interest, Y i,j , and a vector of covariates, Z i,j ∈ R d z , that arerelated through the equation Y i,j = Z ′ i,j β + ǫ i,j , (1)where β ∈ R d z are unknown parameters and our requirements on ǫ i,j are explainedbelow in Section 3. Our goal is to test H : c ′ β = λ vs. H : c ′ β = λ , (2)for given values of c ∈ R d z and λ ∈ R , at level α ∈ (0 , β equalsa given value, i.e., H : β ℓ = λ vs. H : β ℓ = λ , ℓ ∈ { , . . . , d z } , by simply setting c to be a standard unit vector with a onein the ℓ th component and zeros otherwise. More generally, the approach we describebelow extends immediately to the case where the hypothesis of interest involves multipleelements of β , in which case the test becomes H : Rβ = Λ vs. H : Rβ = Λ , (3)for a given p × d z -dimensional matrix R and p -dimensional vector Λ, at level α ∈ (0 , The most straightforward way to test the hypotheses in (2) via ARTs is by followingthe steps described in Algorithm 2.1 below.

Algorithm 2.1 (ARTs via within-cluster estimates) . This implementation of ARTsinvolves the following steps:

Step 1 : For each cluster j ∈ J , run an ordinary least squares regression of Y i,j on Z i,j using the n j observations in cluster j . Denote the corresponding estimatorsof β by { ˆ β n,j : j ∈ J } . Step 2 : For each j ∈ J , deﬁne the random variables S n,j ≡ √ n j ( c ′ ˆ β n,j − λ ) . (4) Step 3 : Compute the sample mean and sample standard deviation of { S n,j : j ∈ J } , ¯ S n ≡ q q X j =1 S n,j and ˆ σ S ≡ vuut q − q X j =1 ( S n,j − ¯ S n ) , (5) and then construct the test statistic T n = | ¯ S n | ˆ σ S / √ q . (6)3 tep 4 : Let G = { , − } q , so g = ( g , . . . , g q ) ∈ G is simply a q -dimensional vectorwith elements g j being either or − . For any element g ∈ G deﬁne ¯ S ∗ n ( g ) ≡ q q X j =1 g j S n,j and ˆ σ ∗ S ( g ) ≡ vuut q − q X j =1 ( g j S n,j − ¯ S ∗ n ( g )) , (7) and then construct the test statistic T ∗ n ( g ) = | ¯ S ∗ n ( g ) | ˆ σ ∗ S ( g ) / √ q . (8) Step 5 : Compute the − α quantile of { T ∗ n ( g ) : g ∈ G } as ˆ c n (1 − α ) ≡ inf  u ∈ R : 1 | G | X g ∈ G I { T ∗ n ( g ) ≤ u } ≥ − α  . (9) Step 6 : Compute the test as φ n ≡ I { T n > ˆ c n (1 − α ) } , (10) where T n is as in (6) and ˆ c n (1 − α ) is as in (9) . The associated p -value is ˆ p n ≡ | G | X g ∈ G I { T ∗ n ( g ) ≥ T n } , (11) where T ∗ n ( g ) is as in (8) . Algorithm 2.1 involves six steps that are easy to implement from a computationalstandpoint, but some of the steps deserve some clariﬁcation. Step 1 involves q within-cluster regressions that lead to q estimates of β . This essentially demands that theparameter β is identiﬁed cluster-by-cluster, and may fail to hold if some of the variablesin the vector Z i,j are constant within cluster. We discuss possible remedies for thisproblem in Section 3 and illustrate their use in one of the applications in Section 4. Animportant feature of the method is that from Step 2 onwards, the original data is nolonger needed as all the calculations only involve the q estimators of the parameter β obtained in Step 1. Step 3 deﬁnes a type of t -statistic that is appropriate for the nullhypothesis in (2). If the null hypothesis of interest is the one in (3), then a Wald-typetest statistic could be used instead, i.e., T wald n ≡ q ¯ S ′ n Σ − S ¯ S n , (12)4here S n,j ≡ √ n ( R ˆ β n,j − Λ) and Σ S ≡ q q X j =1 S n,j S ′ n,j . Notably, Step 4 does not require one to recompute the estimates of β . It rather uses the q estimates from Step 1 and applies sign changes to the q -dimensional vector { S n,j : j ∈ J } .Since the cardinality of G is | G | = 2 q , it exceeds 2 ,

000 when q >

10 and in such casesit may be convenient to use a stochastic approximation without aﬀecting the propertiesof the test (see Canay et al., 2017a, Remark 2.2). Formally, in this case we letˆ G ≡ { g , . . . , g B } , (13)where g = (1 , . . . ,

1) is the identity vector and g b = ( g b , . . . , g bq ), for b = 2 , . . . , B , arei.i.d. Rademacher random variables; i.e., each g bj equals ± G replacing G everywhere and set B = 1 ,

000 (orany other reasonably large number chosen by the analyst). Step 5 requires computingthe 1 − α quantile of { T n ( g ) : g ∈ G } , which can be typically obtained by sorting thevalues of { T n ( g ) : g ∈ G } and then taking the ⌈| G | (1 − α ) ⌉ th highest element in theordered list. Thus, if we denote the ordered values of { T n ( g ) : g ∈ G } by T (1) n ≤ T (2) n ≤ · · · ≤ T ( B ) n , then we may deﬁne ˆ c n (1 − α ) in (9) as ˆ c n (1 − α ) = T ( ⌈| G | (1 − α ) ⌉ ) . This suggests thatthe test may have no power for very low values of q . For example, when α = 10%, thisproblem arises if q ≤

4. For q = 5 the test already has non-trivial power and is onlyslightly conservative under the null. Similarly, when α = 5% the test has non-trivialpower for any q ≥

6. Step 6 is straightforward and it provides both the test φ n and the p -value ˆ p n . We now discuss how to compute conﬁdence sets for the parameter c ′ β by test inversion.As before, a particularly important case is when c selects the ℓ th component of β andthen the conﬁdence set is simply a conﬁdence interval for β ℓ . The idea behind testinversion is straightforward: form conﬁdence sets by collecting all values of c ′ β thatcannot be rejected by a test at level α . Formally, for the test φ n in (10) we deﬁne˜ C n = { λ : φ n = 0 when testing H : c ′ β = λ } . (14)In an asymptotic framework where n → ∞ while q remains ﬁxed, Canay et al. (2017a)show that φ n is asymptotically level α under H . It follows from that result that, byconstruction, ˜ C n covers c ′ β with probability at least equal to 1 − α asymptotically. The5et ˜ C n requires implementing Algorithm 2.1 for all possible values of λ ∈ R . This maybe cumbersome from a computational point of view and may be not particularly easyto report due to spurious rejections under H (that happen with probability α ). Wetherefore recommend reporting the convex hull of ˜ C n , denoted below by C n , which iseasier to compute and interpret. Let L n be the smallest value of λ that cannot berejected by φ n and U n the highest value of λ that cannot be rejected by φ n . We thendeﬁne C n as C n = [ L n , U n ] . (15)By construction the value ˆ λ = q P qj =1 c ′ ˆ β n,j , where { ˆ β n,j : j ∈ J } is deﬁned in Step 1 ofAlgorithm 2.1, is never rejected. Finding L n and U n is then typically straightforward bymeans of a bisection algorithm once the analyst determines a small enough and a largeenough value of λ for which the null in (2) is rejected. In the applications of Section 4we report C n with L n and U n computed via the bisection algorithm described below inAlgorithm 2.2. Algorithm 2.2 (ART-based conﬁdence intervals for c ′ β ) . For { ˆ β n,j : j ∈ J } as de-ﬁned in Step 1 of Algorithm 2.1, the construction of the conﬁdence interval involves thefollowing steps: Step 1 : Find the minimum value of τ ∈ { , , , . . . } such that φ n in (10) rejectsthe null H in (2) with λ = ˆ λ − τ s ˆ λ where ˆ λ = 1 q q X j =1 c ′ ˆ β n,j and s ˆ λ = vuut q q X j =1 ( c ′ ˆ β n,j − ˆ λ ) . (16) Denote this value of λ by λ (0) , rl . Repeat for λ = ˆ λ + τ s ˆ λ and denote this value of λ by λ (0) , ru . Finally, set λ (0) , al = λ (0) , au = ˆ λ . Step 2 : In a while loop at iteration k ≥ , let λ ( k )l = 12 (cid:16) λ ( k − , rl + λ ( k − , al (cid:17) . Test H in (2) for λ = λ ( k )l and set λ ( k ) , rl = λ ( k )l and λ ( k ) , al = λ ( k − , al when φ n = 1 or λ ( k ) , rl = λ ( k − , rl and λ ( k ) , al = λ ( k )l when φ n = 0 . For a set tolerance level tol and maximum number of iterations max , if (cid:12)(cid:12)(cid:12) λ ( k − , rl − λ ( k − , al (cid:12)(cid:12)(cid:12) < tol or k = max , top the algorithm and set L n = λ ( k ) , al . Otherwise, move to iteration k + 1 . Step 3 : Repeat Step 2 for the upper bound U n using λ ( k )u , λ ( k − , ru , and λ ( k − , au inplace of λ ( k )l , λ ( k − , rl , and λ ( k − , al , respectively. Before we move to the empirical applications we discuss an interesting feature of ARTsas described in Algorithm 2.1. It turns out that ARTs can be implemented by a slightlydiﬀerent algorithm that does not involve estimating the parameter β within each cluster.This alternative algorithm involves replacing Steps 1 and 2 in Algorithm 2.1 by the twoalternative steps described in Algorithm 2.3 below, while keeping Steps 3 to 6 unaﬀected. Algorithm 2.3 (ARTs via within-cluster weighted scores) . This implementation ofARTs involves the following steps:

Step 1 ′ : Run a full-sample least squares regression of Y i,j on Z i,j subject to the re-striction imposed by the null hypothesis, i.e., c ′ β = λ . Denote by ǫ r i,j the restrictedresiduals from this regression and by ˆ β r n the restricted LS estimator of β . Step 2 ′ : For each cluster j ∈ J , deﬁne S n,j ≡ c ′ ˆΩ − n,j √ n j X i ∈ I n,j Z i,j ˆ ǫ r i,j , (17) where ˆΩ n,j ≡ n j X i ∈ I n,j Z i,j Z ′ i,j (18) is a d z × d z matrix that is assumed to be full rank with inverse ˆΩ − n,j . Steps 3-6 : Same as in Algorithm 2.1.

Note that Steps 3-6 remain unchanged given the alternative deﬁnition of S n,j in Step2 ′ . When it comes to Steps 1 and 2, there are two diﬀerences worth discussing. The ﬁrstdiﬀerence is that Step 1 ′ requires a single full-sample restricted least squares estimatorof β as opposed to the q cluster-by-cluster estimators in Step 1 of Algorithm 2.1. Thesecond diﬀerence is that Step 2 ′ is based on within-cluster weighted scores as opposedto the centered within-cluster estimates of β in Step 2 of Algorithm 2.1. Interestingly,these two implementations are numerically equivalent and so implementing ARTs viaAlgorithm 2.1 or Algorithm 2.3 leads to identical results. To see this formally, it isenough to show that S n,j as deﬁned in (4) and (17) are the same using the following7rgument. For each j ∈ J , S n,j ≡ c ′ ˆΩ − n,j √ n j X i ∈ I n,j Z i,j ˆ ǫ r i,j = c ′ ˆΩ − n,j √ n j X i ∈ I n,j Z i,j ( Y i,j − Z ′ i,j ˆ β r n )= c ′ ˆΩ − n,j √ n j X i ∈ I n,j Z i,j Y i,j − c ′ ˆΩ − n,j √ n j X i ∈ I n,j Z i,j Z ′ i,j ˆ β r n = √ n j ( c ′ ˆ β n,j − c ′ β ) − √ n j ( c ′ ˆ β r n − c ′ β )= √ n j ( c ′ ˆ β n,j − λ ) , where the fourth equality follows by adding and subtracting √ n j c ′ β and the last equalityholds because c ′ ˆ β r n = c ′ β = λ under the null hypothesis in (2). It thus follows that S n,j in (4) and in (17) are identical and so ARTs can be alternatively implemented viaAlgorithm 2.1 or 2.3. The main requirement underlying ARTs is Assumption 3.1 in Canay et al. (2017a).This assumption guarantees that the test delivers rejection probabilities under the nullhypothesis that are close to the nominal level α in an asymptotic framework where n → ∞ and q remains ﬁxed. In the context of the linear model in (1), this translatesinto the following two conditions. The ﬁrst condition is that the cluster-by-clusterestimators of β jointly converge in distribution at some (possibly unknown) rate; i.e.,  a n, ( ˆ β n, − β )... a n,q ( ˆ β n,q − β )  d →  S ... S q  (19)for a sequences a n,j → ∞ and random variables ( S , . . . , S q ) ′ . The second conditionrequires that the limiting random variables are invariant to sign changes, i.e.,( g S , . . . , g q S q ) d = ( S , . . . , S q ) , (20)for any g in G , where G is deﬁned in Step 4 of Algorithm 2.1.Condition (19) holds, for example, when Z i,j and ǫ i,j are uncorrelated and the an-alyst assumes some form of weak dependence within clusters that permits the applica-tion of an appropriate central limit theorem. In such a case, (19) typically holds with a n,j = √ n j and each S j being a mean-zero normal random variable. In fact, under thecommonly used assumption of independent clusters, it also follows that S j ⊥⊥ S j ′ for any8 = j ′ . In this case the normally distributed random variables may not be identicallydistributed but are indeed independent. Condition (20), in turn, requires each S j to besymmetrically distributed around zero and independent of each other. This is imme-diately satisﬁed when each S j is a mean-zero normal random variable and clusters areindependent. Importantly, these assumptions allow for the normally distributed randomvariables to have diﬀerent variances across clusters; a type of heterogeneity not allowedby the cluster wild bootstrap approach popularized by Cameron et al. (2008) and laterstudied formally by Canay et al. (2020). Remark 3.1.

We focus our exposition on the case where Z i,j is exogenous but weemphasize that the conditions in (19) and (20) typically hold in instrumental variable(IV) models. Accommodating IV to ARTs then only requires modifying Step 1 inAlgorithm 2.1 so that the least squares regression is replaced with the appropriate IVregression. Steps 2-6 remain unaﬀected.An implicit requirement behind ARTs that deserves further comments lies in Step1 of Algorithm 2.1, which requires that the analyst runs cluster-by-cluster regressions.This step implicitly assumes that the parameter β is identiﬁed within each cluster. Inpractice, this means that the matrix ˆΩ n,j in (18) must be invertible and hence the samerequirement applies to Algorithm 2.3. This restriction may be substantially importantin some applications and so here we discuss common ways in which the problem maymanifest and two alternative remedies.One case in which running least squares cluster-by-cluster is not feasible is whenthe coeﬃcient of interest is associated with a variable that only varies across clusters.For example, consider the model in (1) and partition Z i,j into a constant term, a scalarvariable that only varies across clusters, Z (1) j , and another variable that varies acrossand within clusters, Z (2) i,j . That is, Y i,j = β + Z (1) j β + Z (2) i,j β + ǫ i,j , (21)where the analysts’ interest lies in the coeﬃcient β , i.e., c ′ β = β . Clearly, the regressionin Step 1 of Algorithm 2.1 would not separately identify β and β as Z (1) j is perfectlycolinear with the constant term. The matrix ˆΩ n,j in (18) is simply singular. Thissituation arises, for example, in the empirical application considered by Canay et al.(2017b) where j ∈ J indexes schools and the variable of interest is a treatment indicatorat the school level. A natural remedy in a situation like this is clustering more coarsely(e.g., by combining clusters) to obtain variation within the re-deﬁned clusters. Thisis possible for ARTs since the validity of the method does not rely on having a largenumber of clusters and thus it can aﬀord to work with coarser clustering. In fact, incertain settings combining clusters may be quite natural. For example, Canay et al.(2017b) re-deﬁned clusters as “pairs” of schools (as opposed to just schools) given that9he treatment assignment mechanism of the experiment was a matched pairs design andso the pairs used at the randomization stage represented natural groupings. In othersettings where it is less clear how to group clusters, any grouping that satisﬁes therequisite identiﬁcation condition leads to a valid test, but it may be further desirable tocombine such tests to limit concerns about “data snooping” across groupings. To thisend, results in DiCiccio et al. (2020) on combining tests may be relevant. Remark 3.2.

A quick inspection of (21) may lead the analyst to believe there is aworkaround that does not involve combining clusters if one instead uses some estimatorof β from a full sample regression. For example, the full sample least squares estimatorˆ β n, from the regression in (1). Then, assuming for simplicity that Z (1) j = 0 for all j ∈ J , one may consider modifying Step 1 in Algorithm 2.1 by running a regression of Y i,j on an intercept and Z (2) i,j (not including Z (1) j ) and then redeﬁning { ˆ β n,j : j ∈ J } asthe diﬀerence between the within cluster intercept estimates, ˆ β j, and the full sampleestimate ˆ β n, , i.e., ˆ β n,j = ˆ β j, − ˆ β n, . Such strategies unfortunately introduce dependencebetween the q estimators of β (as they all depend on ˆ β n, ) and thus end up violatingone of the two main conditions needed for ARTs to be asymptotically valid; mainlycondition (20).Another case where the lack of identiﬁcation within cluster may manifest is when thevariable of interest actually varies within clusters but the model speciﬁcation involvesother variables that are collinear with some other variable (including the variable ofinterest or the constant term) within clusters. For example, consider the model in (1)where instead of individuals indexed by i ∈ I n,j , units within cluster are indexed overtime t ∈ T . Partition Z j,t into the variable of interest, Z (1) j,t , and time ﬁxed eﬀects δ t .That is, Y j,t = Z (1) j,t β + X ˜ t ∈ T I { ˜ t = t } δ ˜ t + ǫ j,t . (22)It then follows that, within each cluster j ∈ J , the time ﬁxed eﬀect δ t absorbs allthe variation in Z (1) j,t and so β is not identiﬁed. In cases like this the analyst couldagain combine clusters to obtain variation within the re-deﬁned clusters. An alternativeremedy is to change the speciﬁcation by, for example, replacing the time ﬁxed eﬀectwith a cluster-speciﬁc time trend. Such speciﬁcation is more restrictive than the timeﬁxed eﬀect in the sense that it imposes a linear trend but, at the same time, is moregeneral as it allows for heterogeneity across clusters in the linear trend. We illustratethis approach in the application we consider in Section 4.1.The need to identify β within each cluster is in our view the main limitation ofARTs, but a limitation that needs to be dealt with in certain settings. One may thenwonder why not simply use some other inference method that is valid when the numberof clusters is small and that does not rely on estimating β cluster-by-cluster. Perhapsthe most popular approach in that category is the cluster wild bootstrap popularized10y Cameron et al. (2008) and recently studied formally by Canay et al. (2020). Whilenot having to estimate β within each cluster represents an advantage over ARTs, thisadditional ﬂexibility comes at a cost in terms of the degree of heterogeneity that themodel can deal with. In particular, the results in Canay et al. (2020) show that thecluster wild bootstrap is expected to work well in settings with a small number of clusters as long as the clusters are “homogeneous,” in a sense made precise in Canay et al. (2020).Intuitively, it is required that the variance covariance matrix ˆΩ n,j deﬁned in (18) is thesame across clusters (up to scalar multiplication). Such stringent homogeneity conditionis not required for ARTs to work well, as the method allows clusters to be arbitrarilyheterogeneous as long as ˆΩ n,j is invertible for j ∈ J . Remark 3.3.

For ease of exposition, we have written the requirement in (19) in termsof the diﬀerences ˆ β n,j − β , but it is possible to replace it with the diﬀerences c ′ ˆ β n,j − c ′ β (or R ˆ β n,j − Rβ , depending on the null hypotheses of interest). In most cases, re-writingthe condition in this way is not useful, but it is in cases where c ′ β is identiﬁed withineach cluster while β is not. For example, consider the model in (21) when the coeﬃcientof interest is β as opposed to β , i.e, c ′ β = β . In that case the entire term β + Z (1) j β may be absorbed into a cluster-speciﬁc intercept without aﬀecting the identiﬁcation andestimation of c ′ β = β within each cluster. In this section we apply ARTs as described in Algorithm 2.1 and ART-based conﬁdenceintervals as described in Algorithm 2.2 in the context of two distinct empirical appli-cations. The R and Stata packages and codes required to replicate the results in thissection are available as part of the online supplemental material.

Meng et al. (2015, MQY) argue that China’s Great Famine, from 1959 to 1961, wasthe result of an inﬂexible food procurement policy by the central government. To makethis point, they show that food production and mortality become positively correlatedduring the time of famine, when this coeﬃcient is otherwise negative or not signiﬁcantlydiﬀerent from 0 in normal times.MQY consider the following regression, Y j,t +1 = Z (1) j,t β + Z (2) j,t β + δ t + ǫ j,t where j indexes provinces (ranging from 1 to 19) and t indexes years (ranging from 195311o 1982). Here, Y j,t +1 = log(number of deaths in province j during year t + 1) Z (1) j,t = log(predicted grain production in province j during year t ) × I { t is a famine year } Z (2) j,t = log(predicted grain production in province j during year t ) δ t = time ﬁxed eﬀects . In this application the level of clustering is a province, and so in order to apply ARTs asdescribed in Section 2.1, one needs to estimate β = ( β , β ) ′ and δ t province-by-province.This illustrates one of the situations where including time ﬁxed eﬀects province-by-province is infeasible for the implementation of ARTs, given that the only source ofremaining variation within a province is indeed time. The second identiﬁcation problemdescribed in Section 3 then arises. As we discussed in that section, one way to deal withthis issue consists of replacing the time ﬁxed eﬀects with a cluster-speciﬁc time trend,i.e., in Step 1 of Algorithm 2.1 estimate Y j,t +1 = Z (1) j,t β + Z (2) j,t β + γ j t + ǫ j,t . (23)We will refer to this as Analysis • Analysis • Analysis • Analysis • Analysis • Analysis γ j t replaces time ﬁxed eﬀects δ t . Table 1 summarizes the number ofclusters and the number of observations for each of these analyses.Meng et al. (2015) consider the following two null hypotheses of interest, H (1)0 : β = 0 and H (2)0 : β + β = 0 . (24)12 nalysis Table 1:

Cluster Information. ‘Min. Size’, ‘Med. Size’, ‘Max. Size’ denote the minimum, the median,and the maximum size of clusters. β p -value 0.000 0.000 0.000 0.000 0.000 0.00095% CI [0.050, 0.077] [0.043, 0.071] [0.057, 0.086] [0.051, 0.083] [0.051, 0.078] [0.044, 0.071]ART p -value 0.000 0.002 0.000 0.000 0.000 0.00095% CI [0.032, 0.055] [0.018, 0.047] [0.038, 0.066] [0.028, 0.067] [0.032, 0.058] [0.029, 0.050] β + β = 0CCE p -value 0.050 0.009 0.059 0.005 0.266 0.363ART p -value 0.098 0.571 0.096 0.487 0.080 0.001Observations 569 246 689 298 569 246Short Sample No Yes No Yes No YesAuto. Region No No Yes Yes No NoPred. Grain Prod. Yes Yes Yes Yes No No Table 2:

Results for Analyses β . CCE refers to cluster-robust standard errors.ART p -values are obtained using Algorithm 2.1. ART-based 95% conﬁdence intervals are obtained usingAlgorithm 2.2. In Table 2 we replicate the main table in Meng et al. (2015) using cluster robust standarderrors (CCE) and also include the results associated with ARTs for both H (1)0 and H (2)0 in (24). For H (1)0 we report p -values and 95% conﬁdence intervals, while for H (2)0 we justreport p -values following MQY. The authors note in footnote 33 that using the clusterwild bootstrap led to similar results as those presented in their main table so we do notinclude cluster wild bootstrap results here either.We comment on the following main features of Table 2:1. For the null hypothesis H (1)0 associated with the parameter β , the ART p -valuesare of comparable magnitude to traditional CCE p -values. Similarly, ART-basedconﬁdence intervals are of roughly the same length as those obtained based on CCEalthough the ART-based conﬁdence intervals do not contain the LS estimates. Thisis because ART-based conﬁdence intervals are centered around the mean of theprovince-by-province estimates, which may not necessarily be equal to the full13ample LS estimate of β .2. For the null hypothesis H (2)0 associated with the parameter β + β , the ART p -value is sometimes higher and sometimes lower than the CCE p -value depending onthe speciﬁcation. Given the relatively small number of clusters in this application,the ART p -values are likely to be more reliable than those associated with CCEas CCE is known to perform poorly when the number of clusters is not suﬃcientlylarge. Munyo and Rossi (2015) study criminal recidivism of former prisoners by looking at therelationship between the number of inmates released from incarceration on a given dayand the number of oﬀenses committed on the same day. They claim that the liquidityconstraints that inmates face on the day of release increase the likelihood of recidivismon the same day. Using data of 2631 days between January 1st 2004 and March 15 2011collected from the criminal incidents reports in Montevideo in Uruguay, they estimatethe following linear model by least squares Y t = Z ′ t β + ǫ t where t indexes days and Y t = the total number of oﬀenses on day tZ t = the total number of inmates released, temperature, rainfall, hours of sunshineon day t , a dummy for holidays, a dummy for December 31st and a yearly trend.We refer to this as Analysis • Analysis Z t includes a daily trend in place of a yearly trend. • Analysis Z t includes a monthly trend in place of a yearly trend. • Analysis Z t includes an intra-month daily trend, month- and year- level ﬁxedeﬀects and their interactions in place of a yearly trend. • Analysis Z t includes month- and year- level ﬁxed eﬀects and their interactionsin place of a yearly trend.Analysis β withNewey-West heteroskedasticity-autocorrelation-consistent (HAC) standard errors. In14ddition, they report ART p -values as described in Algorithm 2.1 for the null hypothesisthat H : c ′ β = 0 as in (2), where c selects the coeﬃcient on the total number of inmatesreleased on day t .In this application the level of clustering is not naturally determined by the data,but pseudo-clusters may be formed using blocks of consecutive observations under theassumption of weak temporal dependence. In order to apply ARTs as described inAlgorithm 2.1 we then form q pseudo-clusters by dividing the data into q consecutiveblocks of size b n = ⌊ n/q ⌋ where n = 2631 is the number of total observations. Moreconcretely, we deﬁne the j th pseudo-cluster as X ( n ) j = { ( Y t , Z ′ t ) ′ : t = ( j − b n + 1 , · · · , jb n } where j = 1 , · · · , q − , and let the last q th pseudo-cluster contain all the remaining n − b n ( q −

1) observations.Note that in this application the number of pseudo-clusters q is a tuning parameter thatthe analyst must specify. Munyo and Rossi (2015) set q = 10. We repeat their analyseswith alternative values of q and investigate how sensitive the results are to this choice.The relevant cluster information is given in Table 3. Pseudo-cluster size for diﬀerent values of q . Table 4 shows LS estimates of β , p -values for the hypothesis in (2), and 95% conﬁ-dence intervals for each analysis. Following Munyo and Rossi (2015), we report resultsbased on HAC standard errors. The table also shows ART p -values as described inAlgorithm 2.1 and ART-based 95% conﬁdence intervals as described in Algorithm 2.2for q = 8, q = 10, and q = 16.We summarize the main ﬁndings of the results in Table 4 as follows:1. The choice of q is important for the results of ARTs but currently there is no theorydeveloped to choose this tuning parameter according to some data dependentcriteria. The smaller q is, the more observations are available within each cluster.Having more observations per cluster is important for one of the requirementsbehind ARTs, mainly (19). A small value of q , however, tends to aﬀect the powerof ARTs despite not really aﬀecting the control of the rejection probability underthe null hypothesis. This feature can be seen in Table 4, where ARTs p -values aredecreasing in q across diﬀerent speciﬁcations. In this application, where there are15 peciﬁcation p -value 0.068 0.034 0.034 0.019 0.01595% CI [-0.017, 0.468] [0.02, 0.5] [0.019, 0.5] [0.038, 0.413] [0.046, 0.421]CRS: q=8estimate 0.277 0.220 0.217 0.177 0.182 p -value 0.008 0.023 0.023 0.102 0.10295% CI [0.124, 0.429] [0.035, 0.391] [0.035, 0.391] [-0.07, 0.397] [-0.067, 0.418]CRS: q=10estimate 0.354 0.252 0.257 0.204 0.225 p -value 0.002 0.014 0.014 0.063 0.05395% CI [0.141, 0.603] [0.068, 0.446] [0.068, 0.458] [-0.023, 0.431] [-0.003, 0.452]CRS: q=16estimate 0.286 0.229 0.226 0.172 0.207 p -value 0.002 0.006 0.006 0.027 0.01095% CI [0.131, 0.444] [0.097, 0.369] [0.087, 0.371] [0.02, 0.324] [0.056, 0.367]Observations 2631 2631 2631 2631 2631Time Trend Year Day Month Intra-month Day NoneTime Fixed Eﬀect No No No Yes YesControls No No No No No Table 4:

Results for Analyses β . HAC refers to the heteroskedasticity andautocorrelation consistent standard error. ART p -values are obtained using Algorithm 2.1. ART-based95% conﬁdence intervals are obtained using Algorithm 2.2. q = 16, a larger value of q like q = 10 or q =16 may be preferable to smaller values, like q = 8, based on power considerations.Note, however, that except in Analyses q determineswhether the null hypothesis is rejected at a given signiﬁcance level, the results forAnalyses t -test with HAC standard errors areconsistent to those of ARTs when q = 16. Both methods reject the null hypothesis H : c ′ β = 0 at a 10% nominal level across diﬀerent speciﬁcations. The results sup-port the authors’ argument that the release of inmates from incarceration increasethe chance of re-oﬀenses on the day of release. The goal of this paper is to make the general theory developed in Canay et al. (2017a)more accessible by providing a step-by-step algorithmic description of how to imple-ment the test and construct conﬁdence intervals in linear regression models with clus-tered data, as well as clarifying the main requirements and limitations of the approach.The main two takeaways are the following. First, implementation of ARTs in linearregression models is as straightforward as other more conventional inference methods.Algorithms 2.1 and 2.2 provide a clear explanation of how to take ARTs to the data,and the companion

Stata and R packages available as part of the supplemental mate-rial are intended to facilitate adoption of ARTs. Second, our discussion on the mainrequirements behind ARTs hopefully show that understanding the trade-oﬀs betweenARTs and other popular alternatives for inference with a small number of clusters, likethe clustered wild bootstrap, is fundamental for practitioners to choose a method thataligns well with the features of their application. In particular, while ARTs essentiallydemand that the parameter of interest is suitably estimable cluster-by-cluster withoutimposing restrictions on the degree of heterogeneity across clusters, the cluster wildbootstrap requires the clusters to be suﬃciently homogeneous (see Canay et al., 2020)without demanding identiﬁcation of the parameter of interest cluster-by-cluster. References

Bertrand, M. , Duflo, E. and

Mullainathan, S. (2004). How much should wetrust diﬀerences-in-diﬀerences estimates?

The Quarterly Journal of Economics , Cameron, A. C. , Gelbach, J. B. and

Miller, D. L. (2008). Bootstrap-based im-17rovements for inference with clustered errors.

The Review of Economics and Statis-tics , Canay, I. A. and

Kamat, V. (2018). Approximate permutation tests and induced orderstatistics in the regression discontinuity design.

The Review of Economic Studies , Canay, I. A. , Romano, J. P. and

Shaikh, A. M. (2017a). Randomization tests underan approximate symmetry assumption.

Econometrica , Canay, I. A. , Romano, J. P. and

Shaikh, A. M. (2017b). Supplement to ‘Random-ization tests under an approximate symmetry assumption’.

Econometrica Supplemen-tal Material , . http://dx.doi.org/10.3982/ECTA13081. Canay, I. A. , Santos, A. and

Shaikh, A. M. (2020). The wild bootstrap with a“small” number of “large” clusters.

The Review of Economics and Statistics . Forth-coming.

Conley, T. , Gonc¸alves, S. and

Hansen, C. (2018). Inference with dependent data inaccounting and ﬁnance applications.

Journal of Accounting Research , DiCiccio, C. J. , DiCiccio, T. J. and

Romano, J. P. (2020). Exact tests via multipledata splitting.

Statistics & Probability Letters , Ibragimov, R. and

M¨uller, U. K. (2010). t-statistic based correlation and hetero-geneity robust inference.

Journal of Business & Economic Statistics , Ibragimov, R. and

M¨uller, U. K. (2016). Inference with few heterogeneous clusters.

Review of Economics and Statistics , Liang, K.-Y. and

Zeger, S. L. (1986). Longitudinal data analysis using generalizedlinear models.

Biometrika , Meng, X. , Qian, N. and

Yared, P. (2015). The institutional causes of china’s greatfamine, 1959–1961.

The Review of Economic Studies , Munyo, I. and

Rossi, M. A. (2015). First-day criminal recidivism.

Journal of PublicEconomics ,124