[PDF] Standard Errors for Panel Data Models with Unknown Clusters

Abstract

Full PDF

aa r X i v : . [ ec on . E M ] M a y Standard Errors for Panel Data Models with Unknown Clusters

Jushan Bai ∗ Columbia University

Sung Hoon Choi † Rutgers University

Yuan Liao ‡ Rutgers University

May 20, 2020

Abstract

This paper develops a new standard-error estimator for linear panel data models. Theproposed estimator is robust to heteroskedasticity, serial correlation, and cross-sectionalcorrelation of unknown forms. The serial correlation is controlled by the Newey-Westmethod. To control for cross-sectional correlations, we propose to use the thresholdingmethod, without assuming the clusters to be known. We establish the consistency of theproposed estimator. Monte Carlo simulations show the method works well. An empiricalapplication is considered.Keywords: Panel data, clustered standard errors, thresholding, cross-sectional corre-lation, serial correlation, heteroskedasticity ∗ Address: 420 West 118th St. MC 3308, New York, NY 10027, USA. E-mail: [email protected] . † Address: 75 Hamilton St., New Brunswick, NJ 08901, USA. E-mail: [email protected] . ‡ Address: 75 Hamilton St., New Brunswick, NJ 08901, USA. Email: [email protected] . Introduction

Consider a linear panel regression with ﬁxed-eﬀects: y it = x ′ it β + α i + µ t + u it , where α i and µ t are individual ﬁxed-eﬀects and time ﬁxed eﬀects; x it is a k × u it is an unobservable error component. The outcome variable y it andﬁxed eﬀects are scalars, and β is a k × N small- T scenario. The case of large- N large- T is then studied by Ahn and Moon (2014), Hansen(2007), among many others, while either cross-sectional or serial independence is required.Hansen (2007) examined the covariance estimator when the time series dependence is leftunrestricted. In addition, Vogelsang (2012) studied the asymptotic theory that is robust toheteroskedasticity, autocorrelation, and spatial correlation, which extended and generalizedthe asymptotic results of Hansen (2007) for the conventional cluster standard errors includingtime ﬁxed eﬀects. Stock and Watson (2008) suggested a bias-adjusted heteroskedasticity-robust variance matrix estimator that handles serial correlations under any sequences of N or T . Also, see Petersen (2009) who used a simulation study to examine diﬀerent types ofstandard errors, including the clustered, Fama-MacBeth, and the modiﬁed version of Newey-West standard errors for panel data. In general, on the other hand, the conventional clusterstandard errors assume that individuals across clusters are independent. Also, the clusterstructure should be known such as schools, villages, industries, or states. See Arellano (2003),Cameron and Miller (2015) and Greene (2003). However, the knowledge of clusters is notavailable in many applications.In a recent interesting paper, Abadie, Athey, Imbens, and Wooldridge (2017) argue thatclustering is an issue more of sampling design or experimental design. Clustered standarderrors are not always necessary and researchers should be more thoughtful when applyingthem. One reason is that clustering may result in an unnecessarily wider conﬁdence inter-val. Clustered standard errors are derived from the modelling perspective (model implied1ariance matrix) and are widely practiced, see, for example, Angrist and Pischke (2008),Cameron and Trivedi (2005), and Wooldridge (2003, 2010). In this paper, we continue totake the modeling perspective. Because of our use of thresholding method, the resultingconﬁdence interval is not necessarily much wider, even if all cross-sectional units are allowedto be correlated. Furthermore, the proposed approach is also applicable when the knowledgeof clustering is not available.We provide a robust standard error that allows both serial and cross-sectional correla-tions. We do not impose parametric structures on the serial or cross-sectional correlations.We assume these correlations are weak and apply nonparametric methods to estimate thestandard errors. To control for the autocorrelation in time series, we employ the Newey-Westtruncation. To control for the cross-sectional correlation, we assume sparsity for cross-section( i, j ) pairs, potentially resulting from the presence of cross-sectional clusters, but the knowl-edge on clustering (the number of clusters and the size of each cluster) is not assumed. Wethen estimate them by applying the thresholding approach of Bickel and Levina (2008). Wealso show how to make use of information on clustering when available. In passing we pointout that instead of robust standard errors, in a separate study, Bai, Choi, and Liao (2019)proposed a feasible GLS (FGLS) method to take into account heteroskedasticity and bothserial and cross-sectional correlations. The FGLS is more eﬃcient than OLS.The methods we employ in this paper, banding and thresholding, are regularization meth-ods, and have been used extensively in the recent machine learning literature for estimatinghigh-dimensional parameters. Nonparametric machine learning techniques have been provedto be useful tools in econometric studies.The rest of the paper is organized as follows. In Section 2, we describe the models andstandard errors as well as the asymptotic results of OLS. Monte Carlo studies evaluating theﬁnite sample performance of the estimators are presented in Section 3. Section 4 illustratesour methods in an application of US divorce law reform eﬀects. Conclusions are provided inSection 5 and all proofs are given in Appendix A.Throughout this paper, ν min ( A ) and ν max ( A ) denote the minimum and maximum eigen-values of matrix A . We use k A k = p ν max ( A ′ A ), k A k = max i P j | A ij | and k A k F = p tr ( A ′ A ) as the operator norm, the ℓ -norm and the Frobenius norm of a matrix A, respec-tively. Note that if A is a vector, k A k is the Euclidean norm, and | a | is the Absolute-valuenorm of a scalar a . We consider the following model: y it = x ′ it β + u it , (2.1)2here β is a k × x it is a k × u it represents the error term, often known as the idiosyncratic component. This formulationincorporates the standard ﬁxed eﬀects models as in Hansen (2007). For example, x it , y it and u it can be interpreted as variables resulting from removing the nuisance parameters from theequation, such as ﬁrst-diﬀerencing to remove the ﬁxed eﬀects. Indeed, it is straightforwardto allow additive ﬁxed eﬀects by using the usual demean procedure.For a ﬁxed t , model (2.1) can be written as: y t = x t β + u t , (2.2)where y t = ( y t , ..., y Nt ) ′ ( N × x t = ( x t , ..., x Nt ) ′ ( N × k ) , and u t = ( u t , ..., u Nt ) ′ ( N × y i = ( y i , ..., y iT ) ′ ( T × x i = ( x i , ..., x iT ) ′ ( T × k ) , and u i = ( u i , ..., u iT ) ′ ( T × y is indexed by t , it refers to an N × y is indexed by i it refers to a T × x and u . There is no confusion when context is clear.The (pooled) ordinary least square (OLS) estimator of β from equations (2.1) and (2.2)may be deﬁned as b β = ( N X i =1 T X t =1 x it x ′ it ) − N X i =1 T X t =1 x it y it = ( T X t =1 x ′ t x t ) − T X t =1 x ′ t y t . (2.3)The variance of b β depends on both V X ≡ N T N P i =1 T P t =1 x it x ′ it , and particularly, V ≡ V ar ( 1 √ N T N X i =1 T X t =1 x it u it )= 1 N T T X t =1 Ex ′ t u t u ′ t x t + 1 N T T − X h =1 T X t = h +1 [ Ex ′ t u t u ′ t − h x t − h + Ex ′ t − h u t − h u ′ t x t ] . (2.4)The goal of this paper is to consistently estimate V in the presence of both serial and cross-sectional correlations in { u it } .There are two types of clustered standard errors suggested by Arellano (1987). Theoriginal individual clustered version is b V CX = 1 N T N X i =1 x ′ i b u i b u ′ i x i , with b u i = y i − x i b β are the OLS residuals, and this estimator allows for arbitrary serialdependence and heteroskedasticity within individuals. In addition, b V CX assumes no cross-3ection correlation.The time-clustered version, which allows for heteroskedasticity and arbitrary cross-sectionalcorrelation, is b V CT = 1 N T T X t =1 x ′ t b u t b u ′ t x t , with b u t = y t − x t b β . Here b V CT assumes no serial correlation.The above clustered standard errors are robust to either arbitrary serial correlation orarbitrary cross-sectional correlation, respectively. In practice, however, since the dependenceassumption is unknown, an over-rejection problem may occur. Speciﬁcally, if there existboth serial and cross-sectional correlations, these estimators are not robust anymore, as ournumerical evidence shows in Section 3 (e.g., Tables 3 and 5).To control for the serial correlation, a simple modiﬁcation of b V CT using Newey and West(1987) is b V DK = 1 N T T X t =1 x ′ t b u t b u ′ t x t + 1 N T L X h =1 ω ( h, L ) T X t = h +1 [ x ′ t b u t b u ′ t − h x t − h + x ′ t − h b u t − h b u ′ t x t ] , (2.5)where ω ( · ) is the kernel function and L is the bandwidth. This estimator is suggested byDriscoll and Kraay (1998). When N is large, however, (2.5) accumulates a large number ofcross-sectional estimation noises.More generally, let V ij ≡ T T X t =1 Ex it u it u jt x ′ jt + 1 T T − X h =1 T X t = h +1 [ Ex it u it u j,t − h x ′ j,t − h + Ex i,t − h u i,t − h u jt x ′ jt ] . Then equation (2.4) can be written as V = 1 N X ij V ij . Unlike time series observations, cross-sectional observations have no natural ordering. Theycan be arranged in diﬀerent orders. That is why cross-sectional correlation is more diﬃcultto control. The usual cluster standard error makes the following assumption: let C , ..., C G be disjoint subsets of { , ..., N } , so that they are known clusters and that V ij = 0 when i and j belong to diﬀerent clusters. So V can be expressed as V = 1 N G X g =1 X ( i,j ) ∈ C g V ij . C g is small (this would be thecase if the number of clusters G is large) or grows slowly with N , then we only need to estimate P Gg =1 P ( i,j ) ∈ C g V ij ’s, greatly reducing the number of pair-wise covariances. Butas commented in the literature, this requires the knowledge of C , ..., C G , which in someapplications, is not naturally available. V with unknown clusters The key assumption we make is that conditionally on x t , { u it } is weakly correlated acrossboth t and i . Essentially, this means V ij is zero or nearly so for most pairs of ( i, j ). There isa partition { ( i, j ) : i, j ≤ N } = S s S S l so that S s = { ( i, j ) : k Ex it u it u j,t + h x ′ j,t + h k = 0 ∀ h } ,S l = { ( i, j ) : k Ex it u it u j,t + h x ′ j,t + h k 6 = 0 ∃ h } , where the subscript “ s ” indicates “small”, and “ l ” indicates “large”. We assume that ( i, i ) ∈ S l for all i ≤ N , and importantly, most pairs ( i, j ) belong to S s . Yet, we do not need to knowwhich elements belong to S s or S l . Then V = 1 N X ( i,j ) ∈ S l V ij . Furthermore, let ω ( h, L ) = 1 − h/ ( L + 1) be the Bartlett kernel. Also see Andrews (1991) forother kernel functions. As suggested by Newey and West (1987), V ij can be approximatedby V u,ij ≡ T T X t =1 Ex it u it u jt x ′ jt + 1 T L X h =1 ω ( h, L ) T X t = h +1 [ Ex it u it u j,t − h x ′ j,t − h + Ex i,t − h u i,t − h u jt x ′ jt ] . Then approximately, V ≈ N X ( i,j ) ∈ S l V u,ij . The above approximation plays the fundamental role of our standard error estimator. Weestimate V ij using Newey and West (1987), and estimate S l using the cross-sectional thresh-olding.To apply Newey and West (1987), we estimate V u,ij by S u,ij ≡ T T X t =1 x it b u it b u jt x ′ jt + 1 T L X h =1 ω ( h, L ) T X t = h +1 [ x it b u it b u j,t − h x ′ j,t − h + x i,t − h b u i,t − h b u jt x ′ jt ] , b u it = y it − x ′ it b β . For a predetermined threshold value λ ij , we approximate S l by b S l = { ( i, j ) : k S u,ij k > λ ij } . Hence, a “matrix hard-thresholding” estimator of V is b V Hard ≡ N X ( i,j ) ∈ b S l ∪{ i = j } S u,ij . As for the threshold value, we specify λ ij = M ω NT q k S u,ii kk S u,jj k , where ω NT = L r log( LN ) T for a constant M >

0. The converging sequence ω NT → i,j ≤ N k S u,ij − V u,ij k = O P ( ω NT ) . In practice, the thresholding constant, M , can be chosen through multifold cross-validation,which is discussed in the next subsection. In addition, we can obtain b V DK from b V Hard bysetting M = 0.We also recommend a “matrix soft-thresholding” estimator as follow: b V Soft ≡ N X i,j b S u,ij , where b S u,ij is b S u,ij =  S u,ij , if i = j,A u,ij , if k S u,ij k > λ ij , and i = j, , if k S u,ij k < λ ij , and i = j, where the ( k, k ′ )’s element of A u,ij is ( sgn ( x ) denotes the sign function) A u,ij,kk ′ =  sgn ( S u,ij,kk ′ )[ | S u,ij,kk ′ | − η ij,kk ′ ] + , if | S u,ij,kk ′ | > η ij,kk ′ , , if | S u,ij,kk ′ | < η ij,kk ′ , for the threshold value η ij,kk ′ = M ω NT q | S u,ii,kk ′ || S u,jj,kk ′ | , where ω NT = L r log( LN ) T for some constant M >

0. 6 emark 2.1.

The thresholding estimators for V do not assume known cluster information(the number of clusters and the membership of clusters). The method can also be modiﬁedto take into account the clustering information when available, and is particularly suitablewhen the number of clusters is small, and the size of each cluster is large. The modiﬁcationis to apply the thresholding method within each cluster. The conventional clustered stan-dard errors lose a lot of degrees of freedom when the size of cluster is too large (becauseeach cluster is eﬀectively treated as a “single observation”), resulting in conservative conﬁ-dence intervals. See Cameron and Miller (2015). The thresholding avoids this problem, whileallowing correlations of unknown form within each cluster. Our suggested estimators, b V Hard and b V Soft , require the choice of tuning parameters L and M ,which are the bandwidth and the threshold constant respectively. To choose the bandwidth L , we recommend using L = 4( T / / as Newey and West (1994) suggested.In practice, M can be chosen through multifold cross-validation. After obtaining theestimated residuals b u it by OLS, we split the data into two subsets, denoted by { b u it } t ∈ J and { b u it } t ∈ J ; let T ( J ) and T ( J ) be the sizes of J and J , which are T ( J ) + T ( J ) = T and T ( J ) ≍ T . As suggested by Bickel and Levina (2008), we can set T ( J ) = T (1 − log( T ) − )and T ( J ) = T / log( T ); J represents the training data set, and J represents the validationdata set.The procedure requires splitting the data multiple times, say P times. At the p th split,we denote by b V p the sample covariance matrix based on the validation set, deﬁned by b V p = 1 N X ij S pu,ij , with S pu,ij deﬁned similarly to S u,ij using data on J . Let b V s ( M ) be the thresholding estimatorwith threshold constant M using the entire sample. Then we choose the constant M ∗ byminimizing a cross-validation objective function M ∗ = arg min

1) through agrid search.The above procedure modiﬁes that of Bickel and Levina (2008) in two aspects. One is7o use the entire sample when computing b V s instead of J . Since T ( J ) is close to T , thismodiﬁcation does not change the result much, but simpliﬁes the computation. The secondmodiﬁcation is to use a consecutive block for the validation set because of time series, so thatthe serial correlation is not perturbed. Hence in view of the time series nature, we ﬁrst dividethe data into P = log( T ) blocks with block length T / log( T ). Each J is taken as one ofthe P blocks when computing b V p , similar to the K-fold cross validation. We have conductedsimulations of the cross-validation in the presence of both correlations, and the results showthat this procedure performs well. For instance, the cross-validation tends to choose smaller M as the cross-sectional correlation becomes stronger. Due to the page limit, however, thoseare not reported in this paper. Below we present assumptions under which b V (either b V Hard or b V Soft ) consistently estimates V . We deﬁne α NT ( h ) ≡ sup X max t ≤ T [ k E ( u t u ′ t − h | X ) k + k E ( u t − h u ′ t | X ) k ]and ρ ij,h ≡ sup X max t ≤ T [ | E ( u it u j,t − h | X ) | + | E ( u i,t − h u jt | X ) | ] , where X = { x it } i ≤ N,t ≤ T . These coeﬃcients give measures of autocovariances and cross-section covariances. Assumption 2.1. (i) E ( u t | x t ) = 0 .(ii) Let ν ≤ ... ≤ ν k be the eigenvalues of ( NT P Ni =1 P Tt =1 Ex it x ′ it ) . Then there exist con-stants c , c > such that c < ν ≤ · · · ≤ ν k < c . Assumption 2.2. (weak serial and cross-sectional dependence).(i) P ∞ h =0 α NT ( h ) ≤ C for some C > . In addition, there exist κ ∈ (0 , , C > such thatfor all T > , sup A ∈F −∞ ,B ∈F ∞ T | P ( A ) P ( B ) − P ( AB ) | < exp ( − CT κ ) , where F −∞ and F ∞ T denote the σ -algebras generated by { ( x t , u t ) : t ≤ } and { ( x t , u t ) : t ≥ T } respectively.(ii) For some q ∈ [0 , , ω − qNT max i ≤ N P Nj =1 ( P Lh =0 ρ ij,h ) q = o (1) , where ω NT ≡ L q log( LN ) T . Assumption 2.2 (i) is the standard alpha-mixing condition, adapted to the large- N panel.Condition (ii) is new here. It requires weak cross-sectional correlations. It is similar to the“approximate sparse assumption” in Bickel and Levina (2008). Note that we actually allowthe presence of many “small” but nonzero k Ex it u it u j,t + h x ′ j,t + h k . Clusters that have “large”8 Ex it u it u j,t + h x ′ j,t + h k are unknown to us. Hence the appealing feature of our method is thatwe allow for unknown clusters.Essentially the assumption ω − qN,T max i ≤ N P j ≤ N ( P Lh =0 ρ ij,h ) q = o (1) controls the order ofelements in S l . The following example presents a case of cross sectional weak correlationsthat satisﬁes condition (ii). Example 2.1.

Suppose uniformly for all h = 0 , ..., L , E ( u t u ′ t − h | X ) is an N × N block-diagonalmatrix, where the size of each block is at most S NT , which practically means that each clustercontains no more than S NT individuals, assuming clusters are mutually uncorrelated. Then ρ ij,h = 0 for ( i, j ) belong to diﬀerent blocks. Within the same block, almost surely in X , | E ( u i,t u j,t − h | X ) | + | E ( u i,t − h u jt | X ) | ≤ α NT ( h ) , ∞ X h =0 α NT ( h ) < ∞ Then let B ( i ) denote the block that i belongs to, whose size is at most S NT . ω − qNT max i ≤ N N X j =1 ( L X h =0 ρ ij,h ) q = ω − qNT max i ≤ N X j ∈ B ( i ) ( L X h =0 ρ ij,h ) q ≤ Cω − qNT S NT ( ∞ X h =0 α NT ( h )) q/c for constants c, C > . The last term converges to zero so long as ω − qNT S NT →

0. This thenrequires either ﬁxed or slowly growing cluster size S NT . Assumption 2.3. (i) For each ﬁxed h , ω ( h, L ) → as L → ∞ and max h ≤ L | ω ( h, L ) | ≤ C for some C > . (ii) Exponential tail: There exist r , r > and b , b > , such that r − + r − + κ − > ,and for any s > , i ≤ N , P ( | u it | > s ) ≤ exp ( − ( s/b ) r ) , P ( | x it | > s ) ≤ exp( − ( s/b ) r ) . (iii) There is c > , for all i, λ min (var( √ T P Tt =1 x it u it )) > c . Additionally, the eigenvaluesof V and V X are bounded away from both zero and inﬁnity. Condition (i) is well satisﬁed by various kernels for the HAC-type estimator. Condition(ii) ensures the Bernstein-type inequality for weakly dependent data. Note that it requiresthe underlying distributions to be thin-tailed. Allowing for heavy-tailed distributions is alsoan important issue. However, it would require a very diﬀerent estimation method, and is outof the scope of this paper. Nevertheless, we have conducted simulation studies under heavy-tailed distributions (e.g., t -distribution with degree of freedom 5). Indeed, the proposed9stimator works well in this case, even though the theory requires thin-tailed distributions. We have the following main theorem and all proofs are contained in Appendix A1.

Theorem 2.1.

Under Assumption 2.1-2.3, as

N, T → ∞ , √ N T [ V − X b V V − X ] − / ( b β − β ) d → N (0 , I ) . Theorem 2.1 allows us to construct a (1 − τ )% conﬁdence interval for c ′ β for any given c ∈ R k . The standard error of c ′ ˆ β OLS is (cid:16) N T c ′ ( V − X b V V − X ) c (cid:17) / and the conﬁdence interval for c ′ β is [ c ′ b β ± Z τ ˆ σ/ √ N T ] where Z τ is the (1 − τ )% quantile ofstandard normal distribution and ˆ σ = ( c ′ ( V − X b V V − X ) c ) / . In this section we examine the ﬁnite sample performance of the robust standard errors usingsimulation study. The data generating process (DGP) used for the simulation is producedby the ﬁxed eﬀect linear regression model y it = α i + µ t + β x it + u it , where the true β = 1. The DGP allows for serial and cross-sectional correlations in x it asfollow: x it = a i ν i +1 ,t + ν i,t + b i ν i − ,t , ν it = ρ X ν i,t − + ǫ it , ǫ it ∼ N (0 , , ν i = 0 ,α i ∼ N (0 , . , µ t ∼ N (0 , . , where the constants { a i , b i } Ni =1 are i.i.d. Uniform(0 , γ X ), which introduce cross-sectionalcorrelation. In addition, ν it is modeled as AR (1) process with the autoregressive parameter ρ X . Throughout this simulation study, we set ρ X = 0 . γ X = 1.We generate the error terms, u it , in three diﬀerent cases as follow:Case 1: u it = c i m i +1 ,t + m i,t + d i m i − ,t , m it = ρm i,t − + ε it , ε it ∼ N (0 , , m i = 0 , The simulation results for the heavy-tailed distributions are available upon request from the authors. u it = ψ N X j =1 w ij u it + η it , η it ∼ N (0 , , u i = 0 , Case 3: u it = r X k =1 λ ij F tk + e it , F tk = ρ F F t − ,k + ξ tk , λ ik = ρ λ λ i − ,k + ζ tj ,e it ∼ N (0 , , ξ it ∼ N (0 , , ζ it ∼ N (0 , . The regressor is uncorrelated with the error term u it each other. In Case 1, we generate theerror term similar to x it . The constants { c i , d i } Ni =1 are i.i.d. Uniform(0 , γ ), which introducecross-sectional correlation, and heteroskedasticity when γ > m it is modeled as AR (1)process with the autoregressive parameter ρ . Varying γ > ρ = 0 , γ = 0); (b) only serial correlation( ρ = 0 . , γ = 0); (c) only cross-sectional correlation ( ρ = 0 , γ = 1); and (d) both serial andcross-sectional correlations ( ρ = { . , . } , γ = 1). In Case 2, the error terms are modeledas a spatial autoregressive (SAR(1)) process. The matrix W = ( w ij ) N × N is a rook typeweight matrix whose diagonal elements are zero. Note that the rows of W are standardized,hence they sum to one. ψ is the scalar spatial autoregressive coeﬃcient with | ψ | <

1. In thispaper, we report the case of ψ = 0 .

5. Importantly, SAR(1) model does not produce the serialcorrelation on the error term. In Case 3, we consider an error factor structure. Both factorsand factor loadings follow AR(1) processes, which introduce both serial and cross-sectionalcorrelations. We set r = 2, and consider the cases of ρ λ = 0 . ρ F = 0 . t -statistics for testing the null hypothesis H : β =1 against the alternative H : β = 1. In each simulation we compare the proposed estimatorwith that of other common ﬁve types of standard errors for b β : the standard White estimatorgiven by b V W hite = NT P Ni =1 P Tt =1 e x it e x ′ it b u it , where ˜ x it is demeaned version of regressor. Twotypes of clustered standard errors, b V CX and b V CT , as deﬁned in Section 2. In addition, we usetwo types of Newey and West HAC estimators for the panel version as follows: b V DK = 1 N T T X t =1 e x ′ t b u t b u ′ t e x t + 1 N T L X h =1 ω ( h, L ) T X t = h +1 [ e x ′ t b u t b u ′ t − h e x t − h + e x ′ t − h b u t − h b u ′ t e x t ]and b V HAC = 1

N T N X i =1 T X t =1 e x it e x ′ it b u it + 1 N T N X i =1 L X h =1 ω ( h, L ) T X t = h +1 [ e x it b u it b u i,t − h e x ′ i,t − h + e x i,t − h b u i,t − h b u it e x ′ it ] . Note that b V HAC assumes cross-sectional independence, while b V DK allows arbitrary cross-sectional dependence. In addition, b V DK and b V HAC can be obtained from our proposed esti-11ator with M = 0 and a large constant M , respectively.Results are given for sample sizes N = 50 ,

200 and T = 100 , { N, T } com-bination, we set L = 3 , ,

11 as the bandwidth for b V HAC , b V DK , and the proposed estimator, b V Hard . We also use Bartlett kernel for these three estimators. For the thresholding constantparameters of b V Hard , we set M = 0 . , . , . , .

25 in all cases. The simulation is replicatedfor one thousand times for each case and the nominal signiﬁcance level is 0.05. Simulationresults are reported in Tables 1 - 5.

Tables 1 - 5 present the simulation results, where each table corresponds to diﬀerent cases.Each table presents results of null rejection probabilities for 5% level tests based on sixdiﬀerent standard errors. As expected, a common feature in all tables is that when both N and T are small, all six estimators have rejection probabilities greater than 0.05. Thismight happen even when the errors are drawn from i.i.d. standard normal, and this problembecomes more noticeable in the presence of serial, cross-sectional, or both correlations. Anumber of interesting ﬁndings based on tables are summarized below.Tables 1-3 shows the results of Case 1. In Table 1, Panel A indicates that all the esti-mators perform well due to no correlation. Especially, White standard error estimators giverejection probabilities close to 0.05. In Panel B, when the serial correlation is introduced,the performances of b V CX and b V HAC are markedly better than others except for small samplesize. In addition, our proposed estimator, b V Hard , also performs well when we use both largerthreshold constant M and bandwidth L . Since there is only a serial correlation in the errorterm, these estimators take this correlation into account and perform well. As the size ofbandwidth increases, the standard error estimated by b V HAC increases to a level similar to theresults of b V CX and the tendency to over-reject diminishes. Since the Newey-West techniquegives the weight, which is less than one, the estimated standard error may be underesti-mated. Hence, the traditional cluster standard error, b V CX , dominates the standard error ofNewey-West panel version, b V HAC . Note that the unreported rejection probabilities of b V DK exponentially increases as the bandwidth L increases.Table 2 considers the case of cross-sectionally correlated errors and regressors. In PanelA, except the case of small sample size, b V CT and b V DK with small bandwidth L have rejectionprobabilities close to 0.05 in the ﬁrst panel. Also, b V Hard with small L and M performs well.Importantly, notice that the rejection rate of b V DK and b V Hard tend to over-reject substantiallyas the lag length L increases. In addition, as the cross-section size N increases, the over-rejection problem becomes worse, as we mentioned in Section 2. This tendency is easy toexplain. Since b V DK is an estimator based on a single time series and it is zero when full weightis given to the sample autocovariance, the bias in b V DK initially falls but then increases as the12ag length increases, while the variance of b V DK is initially increasing but eventually becomesdecreasing. Hence, b V DK is biased downward substantially, and its t-statistics tends to over-reject when a large bandwidth is used. On the other hand, in the case of the small size of L and M , b V Hard gives less bias on the estimated standard error.Panel B of Table 2 allows the serial correlation as well as the cross-sectional correlation.Not surprisingly, all estimators except b V Hard and b V DK tend to over-reject substantially. Inthe small sample, these two estimators get worse than the case of the ﬁrst panel. In the largesample, however, rejection probabilities of b V Hard and b V DK are close to 0.05. Importantly, b V Hard outperforms b V DK by choosing M properly. Unreported results of b V DK with largerbandwidth, L , show much larger rejection probabilities than that of b V Hard . This indicatesthat we can obtain unbiased standard error estimator and appropriate rejection rates usingour proposed estimators, b V Hard . Table 3 is the result of strong serial correlation with thecross-sectional dependence. When the serial correlation gets stronger, such as ρ = 0 .

9, allestimators tend to over-reject exponentially in small samples. However, b V Hard and b V DK outperform other estimators as the dimensionality increases.Table 4 considers the error with SAR(1) structure, which does not require the serialcorrelation on the error term. Similar to the results reported in the ﬁrst panel of Table 2, b V CT gives rejection probabilities close to 0.05. b V DK and b V Hard with small bandwidth L alsoperform well. Moreover, b V Hard with proper thresholding constant M gives less bias than b V DK on the estimated standard error.Finally, Table 5 presents the results of the error factor structure. Similar to the resultsof Table 3, all estimators except b V Hard and b V DK tend to over-reject. Rejection probabilitiesof b V Hard and b V DK are relatively close to 0.05 when the sample size is large. In this section, we re-examine the empirical work of the association between divorce lawreforms and divorce rates using our proposed OLS standard error. There are many empiricalstudies on the eﬀects of divorce law reforms on divorce rates. Friedberg (1998) found thatstate law reforms signiﬁcantly increased divorce rates with controls for state and year ﬁxedeﬀects. Wolfers (2006) investigated the question of whether law reform continues to have animpact on the divorce rate by including dummy variables for the ﬁrst two years after thereforms, 3-4 years, 5-6 years, and so on. Speciﬁcally, he studied the following ﬁxed eﬀectpanel data model y it = α i + µ t + X k =1 β k X it,k + δ i t + u it , (4.1)13here y it is the divorce rate for state i and year t ; α i and µ t are the state and year ﬁxedeﬀects; X it,k is a binary regressor that representing the treatment eﬀect 2 k years after thereform; δ i t a linear time trend. Wolfers (2006) suggested that there might be two sides ofthe same treatment yield this phenomenon: a number of divorces gradually shifted after theearlier dissolution of bad matches, after the reform.Both Friedberg (1998) and Wolfers (2006) estimated OLS regressions using state popula-tion weight for each year. In addition, they estimated standard errors under the assumptionthat errors are homoskedastic, serially and cross-sectionally uncorrelated. However, ignoringthese correlations might lead to bias in the standard error estimators. We re-estimated themodel of Wolfers (2006) using proposed OLS standard error estimators.The same data as in Wolfers (2006) are used, but we exclude Indiana, New Mexicoand Louisiana due to missing observations around divorce law reforms. As a result, weobtain a balanced panel data contain the divorce rates, state-level reform years and binaryregressors from 1956 to 1988 over 48 states. We ﬁt models both with and without lineartime trend, and also calculate our standard errors, as well as OLS, White, cluster and HACstandard errors. We set lag choices L = 3 for HAC and our standard errors as suggested byNewey and West (1994) ( L = 4( T / / ). The threshold values M chosen by the cross-validation method is M = 0 . M = 0 . M values are relatively small, implying the existenceof cross-sectional correlations. The estimated β , · · · , β with and without linear time trendand their diﬀerent types of standard errors are presented respectively in Table 6 below. Notethat robust standard errors are not necessarily larger than the usual OLS standard errors, asshown in columns corresponding to se CT , se DK and se Hard .In Table 6, OLS estimates with and without linear time trend are similar to each other.These estimates are also closely comparable to the results obtained in Wolfers (2006). TheOLS estimates indicate that divorce rates rose soon after the law reform. However, within adecade, divorce rates had fallen over time. Most of the coeﬃcient estimates are statisticallysigniﬁcant at the 5% level using usual OLS standard errors. According to the cluster standarderrors, however, the only signiﬁcant estimates are 11-15+ after the reform in the modelwithout linear time trend. We use our method of correcting standard error estimates forheteroskedasticity, serial correlation and also cross-sectional correlation. In the model withoutlinear trend, the estimates for 3-4 and 7-15+ are signiﬁcant. On the other hand, the estimatesfor 1-4 are signiﬁcant when linear trend is added. Our estimated standard errors are close tothose of se CT and se DK , which allow arbitrary cross-section correlations. The result indicatesnon-negligible cross-sectional correlations. The result is also consistent with Kim and Oka(2014), who used the interactive ﬁxed eﬀects approach. The latter approach is suitable formodels with strong cross-sectional correlations.14 Conclusions

This paper studies the standard error problem for the OLS estimator in linear panel models,and proposes a new standard-error estimator that is robust to heteroskedasticity, serial andcross-sectional correlations when clusters are unknown. Simulated experiments demonstratethe robustness of the new standard-error estimator to various correlation structures.Table 1: Null rejection probabilities, 5% level. Two-tailed test of H : β = 1. Case 1: Nocross-sectional correlation ( γ = 0). b V Hard b V HAC b V DK b V CX b V CT b V W N T L \ M 0.10 0.15 0.20 0.25A. No serial correlation: ρ = 050 100 3 .067 .065 .065 .067 .057 .068 .059 .058 .0547 .070 .066 .070 .062 .058 .073 .059 .058 .05411 .082 .071 .056 .055 .057 .088 .059 .058 .05450 200 3 .054 .053 .053 .051 .046 .053 .057 .044 .0477 .055 .054 .056 .053 .047 .056 .057 .044 .04711 .056 .054 .051 .047 .047 .061 .057 .044 .047200 100 3 .062 .065 .059 .060 .047 .065 .051 .055 .0477 .071 .066 .057 .050 .047 .075 .051 .055 .04711 .079 .065 .057 .047 .047 .091 .051 .055 .047200 200 3 .051 .051 .053 .052 .048 .051 .051 .051 .0487 .057 .055 .055 .054 .047 .057 .051 .051 .04811 .058 .052 .049 .046 .047 .060 .051 .051 .048B. Serial correlation: ρ = 0 .

550 100 3 .077 .078 .081 .078 .070 .078 .065 .104 .1047 .085 .084 .078 .076 .068 .083 .065 .104 .10411 .086 .082 .070 .066 .067 .091 .065 .104 .10450 200 3 .070 .071 .075 .074 .069 .071 .067 .096 .1007 .073 .070 .068 .067 .063 .072 .067 .096 .10011 .077 .074 .065 .064 .061 .072 .067 .096 .100200 100 3 .078 .080 .080 .077 .065 .080 .053 .103 .0947 .083 .078 .072 .061 .057 .082 .053 .103 .09411 .087 .071 .059 .058 .055 .105 .053 .103 .094200 200 3 .057 .056 .055 .056 .052 .057 .045 .085 .0827 .059 .054 .053 .051 .048 .064 .045 .085 .08211 .064 .057 .054 .047 .047 .067 .045 .085 .08215able 2: Null rejection probabilities, 5% level. Two-tailed test of H : β = 1. Case 1:Cross-sectional correlation ( γ = 1). b V Hard b V HAC b V DK b V CX b V CT b V W N T L \ M 0.10 0.15 0.20 0.25A. No serial correlation: ρ = 050 100 3 .054 .054 .055 .055 .142 .055 .152 .054 .1407 .066 .064 .063 .069 .141 .068 .152 .054 .14011 .078 .077 .082 .109 .145 .079 .152 .054 .14050 200 3 .046 .049 .046 .047 .133 .049 .145 .043 .1327 .054 .055 .060 .060 .134 .052 .145 .043 .13211 .059 .062 .063 .073 .135 .058 .145 .043 .132200 100 3 .060 .060 .060 .064 .148 .058 .150 .053 .1487 .069 .073 .075 .080 .149 .067 .150 .053 .14811 .086 .085 .096 .126 .151 .084 .150 .053 .148200 200 3 .050 .051 .051 .050 .121 .050 .128 .049 .1217 .057 .058 .057 .057 .122 .058 .128 .049 .12111 .063 .062 .064 .079 .123 .062 .128 .049 .121B. Serial correlation: ρ = 0 .

350 100 3 .070 .069 .069 .067 .150 .069 .155 .074 .1767 .074 .075 .073 .078 .150 .077 .155 .074 .17611 .083 .079 .093 .108 .150 .085 .155 .074 .17650 200 3 .058 .058 .058 .058 .150 .058 .146 .069 .1717 .055 .060 .062 .061 .142 .056 .146 .069 .17111 .059 .063 .071 .082 .142 .060 .146 .069 .171200 100 3 .078 .076 .077 .072 .162 .080 .157 .091 .1857 .083 .087 .083 .084 .160 .086 .157 .091 .18511 .097 .089 .103 .133 .159 .101 .157 .091 .185200 200 3 .055 .055 .054 .056 .132 .056 .133 .068 .1577 .053 .051 .056 .057 .130 .057 .133 .068 .15711 .057 .059 .065 .078 .130 .061 .133 .068 .15716able 3: Null rejection probabilities, 5% level. Two-tailed test of H : β = 1. Case 1: Bothstrong serial and cross-sectional correlations ( ρ = 0 . , γ = 1). b V Hard b V HAC b V DK b V CX b V CT b V W N T L \ M 0.10 0.15 0.20 0.2550 100 3 .096 .097 .098 .098 .180 .098 .168 .145 .2717 .101 .101 .100 .101 .173 .102 .168 .145 .27111 .113 .101 .110 .128 .174 .113 .168 .145 .27150 200 3 .091 .092 .092 .092 .194 .092 .180 .157 .2867 .088 .087 .090 .093 .185 .087 .180 .157 .28611 .092 .092 .099 .114 .182 .092 .180 .157 .286200 100 3 .089 .087 .087 .086 .193 .089 .146 .144 .2567 .092 .089 .095 .097 .178 .097 .146 .144 .25611 .100 .096 .109 .128 .173 .109 .146 .144 .256200 200 3 .069 .069 .069 .067 .146 .068 .125 .121 .2267 .066 .068 .069 .067 .136 .069 .125 .121 .22611 .072 .071 .076 .087 .133 .072 .125 .121 .226Table 4: Null rejection probabilities, 5% level. Two-tailed test of H : β = 1. Case 2: Errorswith Spatial AR(1) structure ( ψ = 0 . b V Hard b V HAC b V DK b V CX b V CT b V W N T L \ M 0.10 0.15 0.20 0.2550 100 3 .061 .060 .062 .057 .124 .059 .143 .053 .1257 .068 .068 .073 .077 .125 .067 .143 .053 .12511 .086 .083 .089 .110 .128 .085 .143 .053 .12550 200 3 .046 .046 .047 .047 .113 .046 .130 .043 .1117 .052 .053 .053 .059 .114 .048 .130 .043 .11111 .052 .059 .065 .083 .112 .061 .130 .043 .111200 100 3 .061 .057 .058 .057 .123 .062 .120 .051 .1227 .070 .071 .070 .081 .122 .068 .120 .051 .12211 .082 .080 .101 .117 .121 .088 .120 .051 .122200 200 3 .055 .055 .053 .050 .124 .056 .125 .049 .1237 .064 .061 .062 .064 .123 .064 .125 .049 .12311 .069 .070 .083 .099 .123 .068 .125 .049 .12317able 5: Null rejection probabilities, 5% level. Two-tailed test of H : β = 1. Case 3: Errorswith Factor structure ( ρ F = 0 . , ρ λ = 0 . b V Hard b V HAC b V DK b V CX b V CT b V W N T L \ M 0.10 0.15 0.20 0.2550 100 3 .081 .080 .078 .081 .129 .078 .102 .130 .2027 .091 .084 .079 .082 .115 .093 .102 .130 .20211 .103 .086 .094 .086 .115 .108 .102 .130 .20250 200 3 .083 .082 .081 .081 .117 .080 .095 .132 .1837 .066 .066 .065 .073 .107 .065 .095 .132 .18311 .072 .076 .080 .084 .104 .072 .095 .132 .183200 100 3 .076 .074 .076 .074 .109 .076 .086 .121 .1677 .077 .080 .077 .074 .105 .083 .086 .121 .16711 .081 .076 .084 .094 .103 .096 .086 .121 .167200 200 3 .072 .072 .073 .069 .115 .071 .090 .126 .1847 .070 .067 .071 .074 .106 .068 .090 .126 .18411 .074 .073 .073 .075 .104 .072 .090 .126 .18418able 6: Empirical application: eﬀects of divorce law refrom with state and year ﬁxed eﬀects:US state level data annual from 1956 to 1988, dependent variable is divorce rate per 1000persons per year. OLS estimates and standard errors (using state population weights).Eﬀects: ˆ β OLS se OLS se W se CX se CT se HAC se DK se Hard

Panel A: Without state-speciﬁc linear time trends1–2 years .256 .086* .140 .189 .139 .172 .155 .1483–4 years .209 .086* .081* .159 .075* .114 .104* .089*5–6 years .126 .086 .073 .168 .064* .105 .088 .0697–8 years .105 .086 .070 .165 .059 .100 .065 .040*9–10 years -.122 .085 .060* .161 .041* .088 .058* .054*11–12 years -.344 .085* .071* .173* .043* .101* .056* .075*13–14 years -.496 .085* .074* .188* .050* .110* .054* .062*15+ years -.508 .081* .089* .223* .048* .139* .061* .077*Panel B: With state-speciﬁc linear time trends1–2 years .286 .064* .152 .206 .143* .185 .145* .140*3–4 years .254 .071* .099* .171 .102* .140 .134 .126*5–6 years .186 .079* .102 .206 .110 .145 .148 .1437–8 years .177 .086* .109 .230 .120 .153 .155 .1469–10 years -.037 .093 .111 .241 .120 .156 .164 .15411–12 years -.247 .100* .128 .268 .141 .179 .196 .18313–14 years -.386 .108* .137* .296 .164* .193* .218 .20915+ years -.414 .120* .158* .337 .186* .221 .251 .243

Note:

Standard errors with asterisks indicate signiﬁcance at 5% level using N (0 ,

1) criticalvalues; se OLS and se W refer to OLS and White standard errors respectively; se CX and se CT are clustered standard errors suggested by Arellano (1987); se HAC and se DK are two types ofNewey-West HAC estimator as explained in the text; se Hard is our standard error. Bartlettkernel with lag length L = 3 is used for se HAC , se DK and se Hard . The threshold value for se Hard by the cross-validation is M = 0 . M = 0 . Appendix

Throughout the proof, max i , max t , max h , max ij , max it , P i , P t , and P ij denote max i ≤ N , max t ≤ T , max h ≤ L , max i,j , max i,t , P Ni =1 , P Tt =1 , and P Ni =1 P Nj =1 respectively. A.1 Proof of Theorem 2.1

First let V L = 1 N T X t Ex ′ t u t u ′ t x t + 1 N T L X h =1 ω ( h, L ) T X t = h +1 [ Ex ′ t u t u ′ t − h x t − h + Ex ′ t − h u t − h u ′ t x t ] . We need following lemmas to prove the main results.

Lemma A.1. (i) k V − V L k ≤ C T − P h = L α NT ( h ) + C L P h =1 (1 − ω ( h, L )) α NT ( h ) .(ii) max i | V u,ii − var( √ T P Tt =1 x it u it ) | = o (1) .(iii) min i λ min ( V u,ii ) > c .Proof. (i) First note that k Ex ′ t u t u ′ t − h x t − h + Ex ′ t − h u t − h u ′ t x t k ≤ E k x t kk E ( u t u ′ t − h | X ) kk x t − h k + E k x t − h kk E ( u t − h u ′ t | X ) kk x t k≤ α NT ( h ) E k x t kk x t − h k ≤ CN α NT ( h ) . Hence for some

C, c > , k V − V L k ≤ k N T T − X h = L +1 T X t = h +1 [ Ex ′ t u t u ′ t − h x t − h + Ex ′ t − h u t − h u ′ t x t ] k + k N T L X h =1 (1 − ω ( h, L )) T X t = h +1 [ Ex ′ t u t u ′ t − h x t − h + Ex ′ t − h u t − h u ′ t x t ] k≤ C T T − X h = L +1 T X t = h +1 α NT ( h c ) + C T L X h =1 (1 − ω ( h, L )) T X t = h +1 α NT ( h c ) ≤ C X h>L α NT ( h c ) + C L X h =1 | − ω ( h, L ) | α NT ( h c ) = o (1) . The second term of the last equation goes to zero due to Assumption 2.2(iii) and the dom-inated convergence theorem, noting that | − ω ( h, L ) | α NT ( h c ) ≤ Cα NT ( h c ) and α NT ( h c ) issummable over h . 20ii) The proof for max i | V u,ii − var( √ T P Tt =1 x it u it ) | = o (1) follows from the same argument.(iii) The result follows from (ii) and the assumption that min i λ min (var( √ T P Tt =1 x it u it )) >c . Lemma A.2.

Suppose log N = o ( T ) . For f ( t, h, L ) = ω ( h, L )1 { t > h } , max h max i,j k T T X t =1 x it u it u j,t − h x ′ j,t − h f ( t, h, L ) − Ex it u it u j,t − h x ′ j,t − h f ( t, h, L ) k = O P ( r log( LN ) T ) . Proof.

The left hand side can be written asmax h max ij k T P t Z h,ij,t k , where Z h,ij,t = f ( t, h, L )( x it ε it ε j,t − h x ′ j,t − h − Ex it ε it ε j,t − h x ′ j,t − h ) . For convenience, assume that dim( Z h,ij,t ) = 1 and there is no serial correlation. Set α n = q log( LN ) T and c = 2 C for c, C >

0. Then, by using Bernstein Inequality and exponential tailconditions, and that f ( t, h, L ) is bounded, P (max h ≤ L max ij | T T X t =1 Z h,ij,t | > cα n ) ≤ LN max h ≤ L max ij P ( | T T X t =1 Z h,ij,t | > cα n ) ≤ LN exp ( − T c α n C ) ≤ exp (log( LN ) − T c α n C )= exp ( − log( LN ))= 1 LN → . Lemma A.3.

Suppose log N = o ( T ) . For f ( t, h, L ) = ω ( h, L )1 { t > h } , max h max i,j k T T X t =1 x it b u it b u j,t − h x ′ j,t − h f ( t, h, L ) − x it u it u j,t − h x ′ j,t − h f ( t, h, L ) k = O P ( 1 T r log( LN ) N ) . Proof.

The left hand side is bounded by a + a + a , where a = max h max i,j k T T X t =1 x it ( b u it − u it )( b u j,t − h − u j,t − h ) x ′ j,t − h f ( t, h, L ) k a = max h max i,j k T T X t =1 x it u it ( b u j,t − h − u j,t − h ) x ′ j,t − h f ( t, h, L ) k = max h max i,j k T T X t =1 x it ( b u it − u it ) u j,t − h x ′ j,t − h f ( t, h, L ) k . For simplicity, let’s assume dim( x it ) = 1. Then a ≤ k b β − β k max h max i,j k T T X t =1 x it x it x j,t − h x j,t − h f ( t, h, L ) k≤ O P ( 1 N T ) max ij T T X t =1 k x it k = O P ( 1 N T ) . By using Bernstein Inequality for weakiy dependent data and exponential tail conditions,and that f ( t, h, L ) is bounded, a ≤ k b β − β k max h max i,j k T T X t =1 x it u it x j,t − h x j,t − h f ( t, h, L ) k≤ O P ( 1 √ N T ) O P ( r log( LN ) T )= O P ( 1 T r log( LN ) N ) .a is bounded using the same argument. Together,max h max i,j k T T X t =1 x it b u it b u j,t − h x ′ j,t − h f ( t, h, L ) − x it u it u j,t − h x ′ j,t − h f ( t, h, L ) k = O P ( 1 T r log( LN ) N ) . Proof of Theorem 2.1.

It suﬃce to prove k b V − V k = o P (1). By Lemma A.1, we have k b V − V k ≤ k b V − V L k + C X h>L α NT ( h ) + C L X h =1 (1 − ω ( h, L )) α NT ( h ) . The remaining proof is that of k b V − V L k = o P (1), given below. Main proof of the convergence of k b V − V L k Note that V L = N P ij V u,ij , b V = N P ij b S u,ij . Hence k b V − V L k ≤ N X b S u,ij =0 k V u,ij − b S u,ij k + 1 N X b S u,ij =0 k V u,ij − b S u,ij k . k S u,ij − V u,ij k < λ ij for ∀ ( i, j ) and C > k S u,ii k ≥ k V u,ii k − k S u,ii − V u,ii k≥ k V u,ii k − max ij k S u,ii − V u,ii k≥ k V u,ii k − Cω NT > C . From Assumption 2.3, k V u,ii k > c >

0, then, λ ij = M ω NT p k S u,ii kk S u,jj k > c M ω NT > c ω NT . Then, λ ij > cω NT ≥ max ij k S u,ij − V u,ij k . Therefore, k S u,ij − V u,ij k < λ ij for ∀ ( i, j )Recall ρ ij,h = sup X max t | E ( u it u j,t − h | X ) | + | E ( u i,t − h u jt | X ) | . Then, k V u,ij k ≤ k T X t Ex it u it u jt x ′ jt + 1 T L X h =1 ω ( h, L ) T X t = h +1 [ Ex it u it u j,t − h x ′ j,t − h + Ex i,t − h u i,t − h u jt x ′ jt ] k≤ Cρ ij, / C T L X h =1 ω ( h, L ) T X t = h +1 ρ ij,h ≤ C L X h =0 ρ ij,h . Hence, on the event max ij k S u,ij − V u,ij k ≤ Cω NT ,1 N X b S u,ij =0 k V u,ij − b S u,ij k ≤ N X b S u,ij =0 k V u,ij k ≤ N X ij k V u,ij k {k S u,ij k < λ ij } = 1 N X ij k V u,ij k {k V u,ij k < k S u,ij k + k S u,ij − V u,ij k , k S u,ij k < λ ij }≤ N X ij k V u,ij k (1 . λ ij ) − q k V u,ij k − q {k V u,ij k < . λ ij }≤ N X ij k V u,ij k q (1 . λ ij ) − q ≤ Cω − qNT N X ij k V u,ij k q ≤ Cω − qNT max i X j ( L X h =0 ρ ij,h ) q . On the other hand, on the event max ij k S u,ij − V u,ij k ≤ Cω NT ,1 N X b S u,ij =0 k V u,ij − b S u,ij k ≤ N X b S u,ij =0 k V u,ij − S u,ij k + 1 N X b S u,ij =0 k S u,ij − b S u,ij k≤ N X b S u,ij =0 . λ ij + 1 N X b S u,ij =0 λ ij ≤ N X ij . λ ij {k S u,ij k > λ ij }

23 1 N X ij . λ ij {k V u,ij k > k S u,ij k − k S u,ij − V u,ij k , k S u,ij k > λ ij }≤ N X ij . λ ij k V u,ij k q (0 . λ ij ) q {k V u,ij k > . λ ij }≤ N X ij Cλ − qij k V u,ij k q ≤ N X ij k V u,ij k q Cω − qNT ≤ Cω − qNT max i X j ( L X h =0 ρ ij,h ) q . Hence k b V − V L k ≤ Cω − qNT max i P j ( P Lh =0 ρ ij,h ) q . Therefore, we have k b V − V k ≤ O P ( ω − qNT max i X j ( L X h =0 ρ ij,h ) q ) + C T − X h = L α NT ( h ) + C L X h =1 (1 − ω ( h, L )) α NT ( h ) . Remaining proofs : max ij k S u,ij − V u,ij k = O P ( ω NT ) , where ω NT = L q log( LN ) T . Recall S u,ij ≡ T P t x it b u it b u jt x ′ jt + 1 T L P h =1 ω ( h, L ) T P t = h +1 [ x it b u it b u j,t − h x ′ j,t − h + x i,t − h b u i,t − h b u jt x ′ jt ] ,V u,ij ≡ T P t Ex it u it u jt x ′ jt + 1 T L P h =1 ω ( h, L ) T P t = h +1 [ Ex it u it u j,t − h x ′ j,t − h + Ex i,t − h u i,t − h u jt x ′ jt ].Let M u,ij ≡ T P t x it u it u jt x ′ jt + 1 T L P h =1 ω ( h, L ) T P t = h +1 [ x it u it u j,t − h x ′ j,t − h + x i,t − h u i,t − h u jt x ′ jt ] . We ﬁrst bound max ij k M u,ij − V u,ij k , then bound max ij k S u,ij − M u,ij k . Proof of max ij k M u,ij − V u,ij k = O P ( L q log( LN ) T )Given Lemma A.2, we havemax ij k M u,ij − V u,ij k ≤ O P ( r log( LN ) T )+ 2 max ij k T L X h =1 T X t =1 [ x it u it u j,t − h x ′ j,t − h f ( t, h, L ) − Ex it u it u j,t − h x ′ j,t − h f ( t, h, L )] k≤ O P ( r log( LN ) T ) + LO P ( r log( LN ) T ) = O P ( L r log( LN ) T ) . rove of max ij k M u,ij − S u,ij k = O P ( LT q log( LN ) N )Given Lemma A.3, we havemax ij k M u,ij − S u,ij k ≤ O P ( 1 T r log( LN ) N )+ 2 max ij k T L X h =1 T X t =1 [ x it b u it b u j,t − h x ′ j,t − h f ( t, h, L ) − x it u it u j,t − h x ′ j,t − h f ( t, h, L ))] k≤ O P ( 1 T r log( LN ) N ) + LO P ( 1 T r log( LN ) N ) = O P ( LT r log( LN ) N ) . Together,max ij k V u,ij − S u,ij k = O P ( L r log( LN ) T ) + O P ( LT r log( LN ) N ) = O P ( L r log( LN ) T ) . eferences Abadie, A., S. Athey, G. W. Imbens, and

J. Wooldridge (2017): “When should youadjust standard errors for clustering?,”

National Bureau of Economic Research WorkingPaper No. 24003 . Ahn, S. C., and

H. R. Moon (2014): “Large-N and large-T properties of panel dataestimators and the Hausman test,” in

Festschrift in Honor of Peter Schmidt , pp. 219–258.Springer.

Andrews, D. W. (1991): “Heteroskedasticity and autocorrelation consistent covariancematrix estimation,”

Econometrica: Journal of the Econometric Society , 59(3), 817–858.

Angrist, J. D., and

J.-S. Pischke (2008):

Mostly harmless econometrics: An empiricist’scompanion . Princeton university press.

Arellano, M. (1987): “Computing Robust Standard Errors for Within-groups Estimators,”

Oxford bulletin of Economics and Statistics , 49(4), 431–434.(2003):

Panel data econometrics . Oxford university press.

Bai, J., S. H. Choi, and

Y. Liao (2019): “Feasible Generalized Least Squares for PanelData Models with Cross-sectional and Serial Correlations,”

Working paper . Bickel, P. J., and

E. Levina (2008): “Covariance regularization by thresholding,”

TheAnnals of Statistics , 36(6), 2577–2604.

Cameron, A. C., and

D. L. Miller (2015): “A practitioners guide to cluster-robustinference,”

Journal of Human Resources , 50(2), 317–372.

Cameron, A. C., and

P. K. Trivedi (2005):

Microeconometrics: methods and applications .Cambridge university press.

Driscoll, J. C., and

A. C. Kraay (1998): “Consistent covariance matrix estimation withspatially dependent panel data,”

Review of economics and statistics , 80(4), 549–560.

Friedberg, L. (1998): “Did unilateral divorce raise divorce rates? Evidence from paneldata,”

American Economic Review , 88(3), 608–627.

Greene, W. H. (2003):

Econometric analysis . Pearson Education, Upper Saddle River, NJ.

Hansen, C. B. (2007): “Asymptotic properties of a robust variance matrix estimator forpanel data when T is large,”

Journal of Econometrics , 141(2), 597–620.26 im, D., and

T. Oka (2014): “Divorce law reforms and divorce rates in the USA: Aninteractive ﬁxed-eﬀects approach,”

Journal of Applied Econometrics , 29(2), 231–245.

Liang, K.-Y., and

S. L. Zeger (1986): “Longitudinal data analysis using generalized linearmodels,”

Biometrika , 73(1), 13–22.

Newey, W. K., and

K. D. West (1987): “A simple, positive semi-deﬁnite, heteroskedastic-ity and autocorrelationconsistent covariance matrix,”

Econometrica: Journal of the Econo-metric Society , 55, 703–708.(1994): “Automatic lag selection in covariance matrix estimation,”

The Review ofEconomic Studies , 61(4), 631–653.

Petersen, M. A. (2009): “Estimating standard errors in ﬁnance panel data sets: Comparingapproaches,”

The Review of Financial Studies , 22(1), 435–480.

Stock, J. H., and

M. W. Watson (2008): “Heteroskedasticity-robust standard errors forﬁxed eﬀects panel data regression,”

Econometrica , 76(1), 155–174.

Vogelsang, T. J. (2012): “Heteroskedasticity, autocorrelation, and spatial correlation ro-bust inference in linear panel models with ﬁxed-eﬀects,”

Journal of Econometrics , 166(2),303–319.

White, H. (1980): “A heteroskedasticity-consistent covariance matrix estimator and a directtest for heteroskedasticity,”

Econometrica: Journal of the Econometric Society , pp. 817–838.

Wolfers, J. (2006): “Did unilateral divorce laws raise divorce rates? A reconciliation andnew results,”

American Economic Review , 96(5), 1802–1820.

Wooldridge, J. M. (2003): “Cluster-sample methods in applied econometrics,”

AmericanEconomic Review , 93(2), 133–138.(2010):