[PDF] Square-root nuclear norm penalized estimator for panel data models with approximately low-rank unobserved heterogeneity

Abstract

This paper considers a nuclear norm penalized estimator for panel data models with interactive effects. The low-rank interactive effects can be an approximate model and the rank of the best approximation unknown and grow with sample size. The estimator is solution of a well-structured convex optimization problem and can be solved in polynomial-time. We derive rates of convergence, study the low-rank properties of the estimator, estimation of the rank and of annihilator matrices when the number of time periods grows with the sample size. Two-stage estimators can be asymptotically normal. None of the procedures require knowledge of the variance of the errors.

Full PDF

aa r X i v : . [ m a t h . S T ] M a y SQUARE-ROOT NUCLEAR NORM PENALIZED ESTIMATOR FORPANEL DATA MODELS WITH APPROXIMATELY LOW-RANKUNOBSERVED HETEROGENEITY

JAD BEYHUM AND ERIC GAUTIER

Abstract.

This paper considers a nuclear norm penalized estimator for panel data modelswith interactive eﬀects. The low-rank interactive eﬀects can be an approximate model and therank of the best approximation unknown and grow with sample size. The estimator is solutionof a well-structured convex optimization problem and can be solved in polynomial-time. Wederive rates of convergence, study the low-rank properties of the estimator, estimation of therank and of annihilator matrices when the number of time periods grows with the samplesize. We propose and analyze a two-stage estimator and prove its asymptotic normality. Wecan also use the baseline estimator as an initialization for any sequential algorithm. None ofthe procedures require knowledge of the variance of the errors. Introduction

Panel data allow to estimate models with ﬂexible unobserved heterogeneity using the factthat each individual is observed repeatedly. The high-dimensional statistics literature en-ables estimation in the presence of a high-dimensional parameter, provided that it has alow-dimensional structure. This paper studies a model that borrows from the two aforemen-tioned strands of literature. We consider a linear panel data model with interactive eﬀects(1) Y it = K X k =1 β k X kit + λ ⊤ i f t + Γ dit + E it , E [ E it ] = 0 , where i ∈ { , ..., N } indices the individuals and t ∈ { , ..., T } the time periods, Y it is theoutcome, X kit is the k th regressor, β ∈ R K is a vector of parameters, λ i and f t are vectorsin R r of factor loadings and factors, Γ dit is a remainder which accounts for the fact that theusual interactive eﬀects speciﬁcation (when Γ dit = 0) can be an approximation, and E it isan error. Only β is nonrandom. Precise assumptions on the joint distribution of the right-hand side random elements is given later. Only the regressors and outcomes are availableto the researcher. The regressors correspond to observed heterogeneity and the remainingright-hand side elements to unobserved heterogeneity. The interactive eﬀects or statistical The authors acknowledge ﬁnancial support from the grant ERC POEMH 337665. They are grateful toThierry Magnac for helpful discussions.Keywords: panel data, interactive ﬁxed eﬀects, factor models, ﬂexible unobserved heterogeneity, nuclearnorm penalization, unknown variance.MSC 2010 subject classiﬁcation: 62J99, 62H12, 62H25. factor structure generalizes the usual individual plus time eﬀects where λ ⊤ i f t = c i + d t . Onecan think that λ ⊤ i f t + Γ dit + E it accounts for the contribution of regressors which are notavailable to the researcher but have an eﬀect on the outcome if we believe these have astatistical factor structure plus remainder plus error term. In such a case, the error E it isa composite error which accounts for a linear combination of those coming from the missingregressors and the usual error from the long regression model which includes both observedand unobserved regressors. When the regressors and λ ⊤ i f t + Γ dit are correlated, the least-squares estimator is inconsistent. This is a situation where we say that the regressors areendogenous or that there is an omitted variable bias. The speciﬁcation is very ﬂexible tomodel unobserved heterogeneity and can be broadly applied (see, e.g. , [13] in the context ofpublic policy evaluation). In matrix form, (1) becomes(2) Y = K X k =1 β k X k + Γ l + Γ d + E, where Y, X , ..., X K , Γ l , Γ d and E are random N × T matrices, Γ lit = λ ⊤ i f t , rank (cid:16) Γ l (cid:17) = r , andΓ d has small nuclear norm. The nuclear norm is the sum of the singular values. We denote byΓ = Γ l + Γ d . Many variations on model (1) have been considered and we name only a few. In[9, 26] the regressors have a factor structure and β can vary across individuals. In [14, 20] thenumber of regressors grows with the sample size. [9, 21] allow for lags of the outcome in (1).In the setup where Γ d = 0 and r is ﬁxed and known, [3] analyses the least-squares estimator(3) (cid:16) b β B , b Λ B , b F B (cid:17) ∈ argmin β ∈ R p Λ ⊤ Λ ∈D rr , F ⊤ F = T I r (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) Y − K X k =1 β k X k − Λ F ⊤ (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) , where | · | is the ℓ norm of the vectorized matrix, Λ (resp. F ) is a N × r (resp. T × r )matrix, D rr the set of diagonal r × r matrices, and I r the identity of size r . It is shown to be √ N T -consistent and asymptotically normal when, among other things, the factors are strong.This means that Λ ⊤ Λ /N converges in probability to a nonsingular matrix, hence the ratioof any singular value of Γ l and √ N T has a positive and ﬁnite limit in probability as N goesto inﬁnity and T increases with N . [22] shows that using the same estimator with an upperbound on the number of factors leads to the same asymptotic properties. However, (3) is anonconvex optimization problem. For this reason, an iterative algorithm is used starting froman initial estimator which could be the least-squares estimator(4) b β LS ∈ argmin β ∈ R p (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) Y − K X k =1 β k X k (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) or based on grid values. Using b β LS can be problematic because it can be inconsistent and thusfar away from a (the?) global minimum of (3). [16] analyses the asymptotic properties of m th iterates of one of [3]’s iterative algorithm treating them as an estimator. It is found that itcan be consistent when adding several additional assumptions among which the consistency of ANEL DATA MODELS WITH APPROXIMATELY LOW-RANK UNOBSERVED HETEROGENEITY 3 b β LS if used as an initialization. It is argued in Remark 4.2 in [15] that iterative algorithms areconsistent if the initialisation is by a consistent estimator. Corollary 1 in [23] give a conditionon the rate of convergence of a preliminary estimator so that an iterative algorithm in thespirit of those proposed by [3] is asymptotically equivalent to [3]’s theoretical estimator.The tools in this paper are related to those used in matrix completion. There, the problemconsists in estimating the unobserved entries of a low-rank matrix from an observed subset ofits entries, sometimes with additive noise (see, e.g. , [7, 8, 17, 18, 19, 28, 29, 30]). The usual ℓ and ℓ -norms are replaced by the rank and nuclear norm, soft and hard thresholding arecarried on the singular values. These methods have recently been used in econometrics (seein particular [2, 4, 10]). The problem in this paper diﬀers in that we observe all the entriesof the matrices Y and X , . . . , X K but none of Γ + E and both Γ and E are random.The iterative procedures in [3] could yield a local minimum while the theoretical propertiesare for the global minimum. In contrast, the estimators in [23] and in this paper involve convexprograms for which convergence to a global minimum is achieved in polynomial time. Theadditional novelties of this paper are as follows. This paper considers a square-root nuclearnorm penalized estimator (see [5] for the Lasso), where the sum of squared residuals is replacedby its square-root. It can be viewed as the estimator in [23] using a data-driven penalty levelso it is directly implementable by the researcher and does not require an additional divergingmultiplicative factor which can result in over-penalization. We provide a straightforwarditerative algorithm to compute the estimator. Our results do not rely on conditioning onrealizations of Γ and we state the conditions on the joint distribution of Γ and the regressors.We allow the interactive eﬀect to be an approximate model and hence many non-strong factors(see [27]) via the additional term Γ d . The rank of Γ l is treated as random and can grow withthe sample size and be unknown. We obtain low-rank oracle type inequalities for various lossfunctions and results on the rank of our estimator of Γ, introduce a thresholded estimatorwhich can be used to estimate the rank of Γ l as well as projectors on the vector spacesspanned by the factors and factor loadings which we analyze theoretically. We also obtainrates of convergence for the estimation of β . These results do not rely on a strong-factorassumption. Finally, we propose a two-stage estimator and show its asymptotic normality.Based on our procedure and result on the estimation of the rank of Γ l , we can proceed asanalyzed in [23] and use an iterative algorithm as a second stage.2. Preliminaries N denotes the positive integers, N denotes N ∪ { } . For a ∈ R , we set a + = max( a, a > a/ ∞ . { µ N } denotes a numerical sequence of generic term µ N . M NT is the set of matrices with real coeﬃcients of size N × T . The transpose of A ∈ M NT is A ⊤ , its trace is tr( A ), and its rank is rank( A ). For A ∈ M NT and v ∈ R NT , vec( A ) isobtained by stacking the columns of A and mat( v ) is the unique matrix in M NT such that JAD BEYHUM AND ERIC GAUTIER v = vec (mat( v )). Matrices are denoted using capital letters and their vectorization usinglowercase letters. The k th singular value of A ∈ M NT (arranged in decreasing order andrepeated according to multiplicty) is σ k ( A ). A = P rank( A ) k =1 σ k ( A ) u k ( A ) v k ( A ) ⊤ is the singularvalue decomposition of A , where { u k ( A ) } rank( A ) k =1 is a family of orthonormal vectors of R N and { v k ( A ) } rank( A ) k =1 of R T . The scalar product is h A, B i = tr (cid:16) A ⊤ B (cid:17) . The ℓ -norm (or Frobeniusnorm) is | A | = h A, A i = P rank( A ) k =1 σ k ( A ) , the nuclear norm is | A | ∗ = P rank( A ) k =1 σ k ( A ), andthe operator norm is | A | op = σ ( A ). P u ( A ) and P v ( A ) are the orthogonal projectors ontospan { u ( A ) , . . . , u rank( A ) ( A ) } and span { v ( A ) , . . . , v rank( A ) ( A ) } and M u ( A ) and M v ( A ) ontothe orthogonal complements. For A, ∆ ∈ M NT , we deﬁne P A (∆) = ∆ − M u ( A ) ∆ M v ( A ) and P ⊥ A (∆) = M u ( A ) ∆ M v ( A ) . Recall that, if e ∆ ∈ M NT (see lemma 2.3 and 3.4 in [29] for (8)-(9)), P A (∆) = M u ( A ) ∆ P v ( A ) + P u ( A ) ∆ , (5) rank ( P A (∆)) ≤ , rank( A )) , (6) D P A (∆) , P A (cid:16) e ∆ (cid:17)E = D P A (∆) , e ∆ E , (7) D P A (∆) , P ⊥ A (∆) E = 0 , (8) (cid:12)(cid:12)(cid:12) A + P ⊥ A (∆) (cid:12)(cid:12)(cid:12) ∗ = | A | ∗ + (cid:12)(cid:12)(cid:12) P ⊥ A (∆) (cid:12)(cid:12)(cid:12) ∗ . (9)The cone C A,c = n ∆ ∈ M NT : (cid:12)(cid:12)(cid:12) P ⊥ A (∆) (cid:12)(cid:12)(cid:12) ∗ ≤ c |P A (∆) | ∗ o , deﬁned for A ∈ M NT and c > A l , A d ∈ M NT for two components suchthat A = A l + A d . The role of and assumptions on A l and A d will be clear from the text. A l stands for a “low-rank” (the rank can diverge with sample size) component with a largeoperator norm while A d is a small remainder term.We denote by P X (resp. M X ) the orthogonal projector on the vector space spanned by { X k } Kk =1 (resp. on its orthogonal) and X = ( x , . . . , x K ). We consider an asymptotic where N goes to inﬁnity and T is a function of N that goes to inﬁnity when N goes to inﬁnity. Theprobabilistic framework consists of a sequence of data generating processes (henceforth DGPs)that depend on N . We write that an event occurs w.p.a. 1 (”with probaility approaching1”) when its probability converges to 1 as N goes to inﬁnity. All limits are when N goes toinﬁnity. We denote convergence in probability and in distribution by respectively P −→ and d −→ .We allow the researcher to apply annihilator matrices M u (to the left) and M v (to the right)on both sides of (2) and still denote by Y, X , . . . , X K , Γ l , Γ d , E the transformed matrices. Shecan apply a within-group (or ﬁrst-diﬀerence or Helmert) transform on the left to annihilateindividual eﬀects and a similar on the right to annihilate time eﬀects. This is importantif the researcher thinks there are individual and time eﬀects and there could be additionalinteractive eﬀects and wants to avoid relying on penalization to ﬁgure out that there areclassical individual and time eﬀects. The regressors can be transformations of the baselineregressors as in Section 4.6 to ensure their operator norm is not too large. We do not writethese transformations explicitly to simplify the exposition. ANEL DATA MODELS WITH APPROXIMATELY LOW-RANK UNOBSERVED HETEROGENEITY 5 First-stage estimator

The estimator is deﬁned, for λ >

0, as(10) (cid:16) b β, b Γ (cid:17) ∈ argmin β ∈ R K , Γ ∈M NT √ N T (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) Y − K X k =1 β k X k − Γ (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) + λN T | Γ | ∗ . The nuclear norm plays the role of the ℓ -norm in the Lasso estimator. It yields low-ranksolutions, that is a sparse vectors of singular value of b Γ (see Proposition 7 for a formal result).The nuclear norm penalization is the convex relaxation of a penalization involving rank (Γ).This estimator can be viewed as a type of square-root Lasso estimator of [5] for parameterswhich are matrices. As for the square-root Lasso, the ℓ -norm is not squared in (10) whichimplies that we do not need to know the variance of E it when these are iid to choose λ . Forlarge classes of DGP, this choice will be canonical. Proposition 1.

A solution (cid:16) b β, b Γ (cid:17) of (10) is such that b Γ ∈ arg min Γ ∈M NT √ N T | M X ( Y − Γ) | + λN T | Γ | ∗ . Let us provide another interpretation for this estimator. In the least-squares problemmin (cid:12)(cid:12)(cid:12) Y − P Kk =1 β k X k − Γ (cid:12)(cid:12)(cid:12) , even if Γ were given, estimation of β is an inverse problem. Forthis reason, properties of the design matrix matter for estimation. While, if β were given,estimation of Γ would not be an inverse problem, it is when β is unknown. Indeed, applying M X to (2), we obtain M X ( Y ) = M X (Γ) + M X ( E ). Because Γ appears via M X (Γ), estimationof Γ is an inverse problem with correlated errors which can be correlated with M X (Γ) via M X .The trace of the covariance operator of the error term is E h | M X ( E ) | i and diverges with N .We will see that | M X ( E ) | / √ N T can converge to the standard error of the entries of E whenthese are iid mean zero with ﬁnite variance. The nonstandard framework here is that X , hencethe operator, and the “parameter” Γ are random. Also M X is not invertible. But estimationof Γ becomes feasible under shape restrictions. This paper considers a generalization of therestriction that Γ has low rank by allowing for approximately low-rank matrices.For u ≥ u = min σ> n σ + σ u o and the minimum is attained at σ = u if u > u = 0. Thus, any solution (cid:16) b β, b Γ (cid:17) of (10) is solution of(11) (cid:16) b β, b Γ , b σ (cid:17) ∈ argmin β ∈ R k , Γ ∈M NT ,σ> σ + 1 σN T (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) Y − K X k =1 β k X k − Γ (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) + 2 λN T | Γ | ∗ and(12) b σ = 1 √ N T (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) Y − K X k =1 b β k X k − b Γ (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) . This objective function in (11) has the advantage that the new objective function only has onenonsmooth convex function in ( β, Γ): the nuclear norm. Because f ( x, y ) = x /y is convex on JAD BEYHUM AND ERIC GAUTIER the domain { ( x, y ) ∈ R | y > } , the objective function is convex in ( β, Γ , σ ). This formulationis analogous to the concomitant Lasso or scaled-Lasso for linear regression (see [25, 31]).3.1. First-order conditions and consequences.

The formulation is used in Section 3.2for implementation of our estimator. It is also useful to obtain by subdiﬀerential calculus theﬁrst order-conditions of program (10). Indeed, the diﬀerential with respect to β k at ( β, Γ , σ )on the domain (hence σ >

0) is, for k = 1 , . . . , K ,(13) − σN T * X k , Y − K X k =1 β k X k − Γ + and the subdiﬀerential with respect to Γ at ( β, Γ , σ ) (see (2.1) in [19]) is(14)  − σN T Y − K X k =1 β k X k − Γ ! + 2 λN T Z, Z = rank(Γ) X k =1 u k (Γ) v k (Γ) ⊤ + M u (Γ) W M v (Γ) , | W | op ≤  , in particular | Z | op ≤ h Γ , Z i = | Γ | ∗ . Due to (12), if b σ = 0 then clearly b β is the least-squares estimator which minimizes (cid:12)(cid:12)(cid:12) Y − P Kk =1 β k X k − b Γ (cid:12)(cid:12)(cid:12) . Else, setting (13) to 0 at (cid:16) b β, b Γ , b σ (cid:17) yields the same conclusion. Hence, if X ⊤ X is positive deﬁnite, we have(15) b β = (cid:16) X ⊤ X (cid:17) − X ⊤ ( y − b γ ) . Because, if b σ >

0, 0 belongs to the set deﬁned in (14) at (cid:16) b β, b Γ , b σ (cid:17) , there exists c W ∈ M NT and b Z = P rank( b Γ) k =1 u k (cid:16)b Γ (cid:17) v k (cid:16)b Γ (cid:17) ⊤ + M u (cid:0)b Γ (cid:1) c W M v (cid:0)b Γ (cid:1) such that (cid:12)(cid:12)(cid:12)c W (cid:12)(cid:12)(cid:12) op ≤ Y − P Kk =1 b β k X k − b Γ = λ b σ b Z , hence, for all k = 1 , . . . , K , D X k , b Z E = 0, thus M X (cid:16) b Z (cid:17) = b Z and(16) Y − K X k =1 b β k X k − b Γ = M X (cid:16) Y − b Γ (cid:17) = λ b σ b Z. Again, due to (12), if b σ = 0 then (16) also holds. As a consequence, we have b σ = 1 √ N T (cid:12)(cid:12)(cid:12) M X (cid:16) Y − b Γ (cid:17)(cid:12)(cid:12)(cid:12) and any solution (cid:16) b β, b Γ (cid:17) of (10) is also solution of(17) (cid:16) b β, b Γ (cid:17) ∈ argmin β ∈ R K , Γ ∈M NT N T (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) Y − K X k =1 β k X k − Γ (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) + 2 λ b σN T | Γ | ∗ . So (cid:16) b β, b Γ (cid:17) given by (10) is a solution to a type of matrix Lasso estimator with data-drivenpenalty λ b σ | Γ | ∗ /N T . The estimator in [23] corresponds to (17) without the data-driven b σ . Remark 1.

Due to the nuclear norm, (16) and the expression of b Z yield Y − K X k =1 b β k X k ! M v (cid:0)b Γ (cid:1) = λ b σM u (cid:0)b Γ (cid:1) c W M v (cid:0)b Γ (cid:1) ANEL DATA MODELS WITH APPROXIMATELY LOW-RANK UNOBSERVED HETEROGENEITY 7 which, unlike [3] , is not zero. Applying the annihilator M u (cid:0)b Γ (cid:1) does not change this. Computational aspect.

Based on (11), where the objective function is convex, we canminimize iteratively over β , Γ, and σ : start from (cid:16) β (0) , Γ (0) , σ (0) (cid:17) and repeat until convergence,(1) β ( t +1) is obtained by least-squares minimizing (cid:12)(cid:12)(cid:12) Y − P Kk =1 β k X k − Γ ( t ) (cid:12)(cid:12)(cid:12) ,(2) Setting Z ( t +1) = Y − P Kk =1 β ( t +1) k X k , Γ ( t +1) is obtained by solving the matrix Lassomin Γ (cid:12)(cid:12)(cid:12) Z ( t +1) − Γ (cid:12)(cid:12)(cid:12) + 2 λσ ( t ) | Γ | ∗ , i.e. applying soft-thresholding to the singular value decomposition (henceforth SVD)Γ ( t +1) = min( N,T ) X k =1 (cid:16) σ k (cid:16) Z ( t +1) (cid:17) − λσ ( t ) (cid:17) + u k (cid:16) Z ( t +1) (cid:17) v k (cid:16) Z ( t +1) (cid:17) ⊤ , (3) σ ( t +1) = (cid:12)(cid:12)(cid:12) Z ( t +1) − Γ ( t +1) (cid:12)(cid:12)(cid:12) / √ N T . Remark 2.

The estimator in [23] can be obtained by repeating (1) and (2) for a ﬁxed value of σ ( t ) . λ N σ ( t ) corresponds to √ N T Ψ NT in their notations and they assume / (cid:16) Ψ NT p min( N, T ) (cid:17) +Ψ NT → to circumvent the unavailability of an upper bound on the variance of the errors. Remark 3.

Various numerical approximations of the theoretical estimator (3) are discussedin [3] . The method page 1237 in [3] considers iterates step (1) and a modiﬁed step (2) where λ = 0 and under the restriction that rank(Γ) = r , from which we extract the factor and factorloadings. The second step corresponds to hard-thresholding the SVD of Z ( t +1) to keep onlythe part corresponding to the r largest singular values. It is argued that the approach fromiterating (a) augmented least-squares given factors (cid:16) β ( t +1) , Λ ( t +1) (cid:17) ∈ argmin β ∈ R p , Λ ∈M Nr (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) Y − K X k =1 β k X k − Λ (cid:16) F ( t ) (cid:17) ⊤ (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) or, by partialling out and the fact that M v ( Γ ( t ) ) is also the projector onto the orthogonal ofthe columns of F ( t ) , β ( t +1) ∈ argmin β ∈ R p (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) Y − K X k =1 β k X k ! M v ( Γ ( t ) ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) and (b) PCA to obtain F ( t +1) is less numerically robust. Results

Error bound on the estimation of Γ and | M X ( E ) | / √ N T . The result of this sectionare upper bounds on the error made by estimating Γ and | M X ( E ) | / √ N T by (cid:16)b Γ , b σ (cid:17) . They JAD BEYHUM AND ERIC GAUTIER hold without assumption. A key quantity is the compatibility constant (see [6] for the high-dimensional linear regression). It is deﬁned, for all realizations of X and A ∈ M NT , by κ A,c = inf ∆ ∈ C A,c : ∆ =0 p A ) | M X (∆) | |P A (∆) | ∗ . Remark 4.

A few remarks are in order. First, if X = 0 , we have M X (∆) = ∆ . Second, thedenominator of the ratio cannot be 0 because, for ∆ ∈ C A,c , | ∆ | ∗ ≤ (1 + c ) |P A (∆) | ∗ , hencethe function of ∆ in the inﬁmum is continuous. Third, because the ratio involves two linearoperators, the inﬁmum is the same if we restrict ∆ to have norm 1 and the intersection withthe cone is compact. Hence, the inﬁmum is a minimum. Fourth, for all A ∈ M NT and c > ,the minimum is the limit of minima over ﬁnite sets so it is a measurable function of X . Fifth,we work with κ e Γ ,c for a random e Γ which depends on the random Γ and X via κ e Γ ,c itself andwe allow Γ and X to be dependent. We make a slight abuse of notations and consider that κ e Γ ,c is a measurable function of e Γ and X . In practice, it is a measurable lower bound on itfor every ﬁxed e Γ ∈ M NT and X in the support of the corresponding random matrix. Remark 5.

When X = 0 one has, for all A ∈ M NT and c > , κ A,c ≥ . Proposition 2.

The following lower bounds hold (18) κ A,c ≥ min ∆ ∈ C A,c : ∆ =0 | M X (∆) | |P A (∆) | ≥ min ∆ ∈ C A,c : ∆ =0 | M X (∆) | | ∆ | . The quantity in the middle is the restricted eigenvalue (see [19]). The smaller one is used in[23]. These constants are essential elements in the upper bounds of Theorem 1 and Proposition6 and further discussed below. Throughout the rest of the paper, ρ ∈ (0 , c ( ρ, e ρ ) = 1 + ρ + e ρ − ρ , d ( ρ, e ρ ) = max (1 + e ρ, ρ (1 + c ( ρ, e ρ ))) , e ( ρ, e ρ ) = d ( ρ, e ρ ) + ρ (1 + c ( ρ, e ρ )) ,θ ∞ (cid:16)e Γ , ρ, e ρ (cid:17) = 2  −  d ( ρ, e ρ ) r (cid:16)e Γ (cid:17) λ √ N T κ e Γ ,c ( ρ, e ρ )   − e ( ρ, e ρ ) ,θ ( ρ, e ρ ) = inf e Γ ∈M NT max  θ ∞ (cid:16)e Γ , ρ, e ρ (cid:17) λ rank (cid:16)e Γ (cid:17) | M X ( E ) | √ N T κ e Γ ,c ( ρ, e ρ ) , e ρ (cid:12)(cid:12)(cid:12) Γ − e Γ (cid:12)(cid:12)(cid:12) ∗  ,θ ∗ ( ρ ) = inf e ρ> (1 + c ( ρ, e ρ )) θ ( ρ, e ρ ) , θ σ ( ρ ) = inf e ρ> d ( ρ, e ρ ) θ ( ρ, e ρ ) . Theorem 1.

On the event E = n ρλ | M X ( E ) | / √ N T ≥ | M X ( E ) | op o , we have (cid:12)(cid:12)(cid:12)b Γ − Γ (cid:12)(cid:12)(cid:12) ∗ ≤ θ ∗ ( ρ ) , (19) (cid:12)(cid:12)(cid:12)(cid:12)b σ − √ N T | M X ( E ) | (cid:12)(cid:12)(cid:12)(cid:12) ≤ λN T θ σ ( ρ ) . (20) ANEL DATA MODELS WITH APPROXIMATELY LOW-RANK UNOBSERVED HETEROGENEITY 9

Note that θ ∗ ( ρ ) ≤ θ σ ( ρ ) /ρ . For example, we can take e ρ = 1 and ρ = 2 /

5, in which case c ( ρ, e ρ ) = 4, d ( ρ, e ρ ) = 2, e ( ρ, e ρ ) = 4, θ ∗ ( ρ ) = 5 θ ( ρ, e ρ ), and θ σ ( ρ ) = 2 θ ( ρ, e ρ ). We state amore general result to allow the theory to handle the case where ρ is close to 1 which allowsa smaller λ (what matters is the product ρλ in the deﬁnition of E and ρψ N in Assumption 2)which we ﬁnd works well in small samples. If we write Γ = Γ l + Γ d and take e Γ = Γ l in themaximum in the expression of θ ( ρ, e ρ ), we obtain(21) θ ( ρ, e ρ ) ≤ max  θ ∞ (cid:16) Γ l , ρ, e ρ (cid:17) λ rank (cid:16) Γ l (cid:17) | M X ( E ) | √ N T κ l ,c ( ρ, e ρ ) , e ρ (cid:12)(cid:12)(cid:12) Γ d (cid:12)(cid:12)(cid:12) ∗  . Under the premises of Proposition 3 (22), the quantity | M X ( E ) | / √ N T is consistent and wecan obtain an upper bound like (21) where it is replaced by a constant σ . We obtain a tightbound if Γ l and Γ d in the decomposition Γ = Γ l +Γ d are such that θ ∞ (cid:16) Γ l , ρ, e ρ (cid:17) rank (cid:16) Γ l (cid:17) /κ l ,c ( ρ, e ρ )and in particular rank (cid:16) Γ l (cid:17) is small and (cid:12)(cid:12)(cid:12) Γ d (cid:12)(cid:12)(cid:12) ∗ is small. But Γ d could have high-rank. The de-composition (cid:16) Γ l , Γ d (cid:17) which realizes the tradeoﬀ depends on Γ (via their sum) and M X whichappears in the deﬁnition of κ Γ l ,c ( ρ, e ρ ). Theorem 1 and Proposition 6 show that b Γ performs,up to a multiplicative constant, as well as an oracle who would know Γ and choose the bestmisspeciﬁed model to keep the number of incidental parameters moderate while incurring abias which is not too large. The term involving ( · ) − in the deﬁnition of θ ∞ (cid:16)e Γ , ρ, e ρ (cid:17) couldbe ∞ if κ e Γ ,c ( ρ, e ρ ) is too small. It appears because we do not know the variance of the errorsor use a sequence of penalties that diverge faster than necessary. Finally, a smaller constant c ( ρ, e ρ ) implies a smaller cone and a larger κ e Γ ,c ( ρ, e ρ ).Let us comment the ﬁrst term in the maximum in the right-hand side of (21). rank (cid:16) Γ l (cid:17) plays the same role as the number of nonzeros for estimation of sparse vectors of coeﬃcientsin linear regression models. The other key ingredient is κ l ,c ( ρ, e ρ ). Because it appears in thedenominator of an upper bound, it is desirable to have it as large as possible. The compatibilityconstant is the sharpest of the three quantities in (18). To gain insight, we make an analogywith linear regression. Denoting by X the design matrix and N the sample size, the rateof estimation of the vector of coeﬃcients depends on the largest eigenvalue of ( X ⊤ X/N ) − in a numerator, or the smallest eigenvalue of X ⊤ X/N in a denominator (the square of thesmallest singular value of X/ √ N ). Sharper constants can be used when the vector is sparse.Because of the ℓ − norm in the Lasso, the diﬀerence between the estimator and the sparsevector belongs to a cone with probability close to 1. As a result, the smallest singular valuewhich, by the Courant-Fisher theorem, is solution to a minimization problem, can be replacedby a minimization on the cone rather than on the whole space. This is important in high-dimensions because the minimum singular value is zero when the dimension is larger thanthe sample size. In his paper, the smaller quantity in (18) restricts the whole space to C A,c in the minimization problem deﬁning the smallest singular value of the operator M X . Recallthat, without restriction, the minimum is 0 because M X is not invertible. The relevant cone is C Γ l ,c . It contains Γ l but also, if Γ d = 0, we show that it contains, with probability close to1, ∆ deﬁned as b Γ − Γ, for all minimizer b Γ. The deﬁnition of the compatibility constant yields | ∆ | ∗ ≤ (1 + c ) q l ) κ Γ l ,c | M X (∆) | . This allows to relate the error in terms of a loss involving the nuclear norm to the loss derivedfrom the least-squares criterion in the optimization program of Proposition 1. The restrictedeigenvalue replaces ∆ in the denominator by a type of projection P A (∆) of ∆ onto a subspacespanned by few columns and few rows. The additional gain from using the compatibilityconstant is obtained because we use p A ) / |P A (∆) | ∗ instead of 1 / |P A (∆) | and henceavoid a type of Cauchy-Schwartz inequality |P A (∆) | ∗ ≤ p A ) |P A (∆) | .4.2. Restriction on the joint distribution of X and E .Assumption 1. The following hold: ( i ) There exists σ > such that | E | / ( N T ) P −→ σ , ( ii ) There exists Σ ∈ M KK positive deﬁnite such that X ⊤ X/ ( N T ) P −→ Σ , ( iii ) X ⊤ e = O P (cid:16) √ N T (cid:17) , ( iv ) There exists { µ N } such that P Kk =1 | X k | = O P (cid:0) µ N (cid:1) . Condition (iii) is satisﬁed if, for all k , h X k , E i = P Tt =1 P Ni =1 X kit E it = O P (cid:16) √ N T (cid:17) . Thiscan allow for so-called predetermined regressors. This can be satisﬁed if, for some family ofﬁltrations ( F Nt ) t =1 ,...,T , for all t = 1 , . . . , T and i = 1 , . . . , N , X kit is F Nt − -measurable and E it is F Nt measurable and, for example, under cross sectional independence. The role of (iv)is to introduce the notation { µ N } , this is not a restriction. Due to (ii), µ N = O (cid:16) √ N T (cid:17) . { µ N } sometimes appears in upper bounds in the results below and slowly diverging sequencesprovide sharper results than when µ N is of the order as large as √ N T . { µ N } can diverge asslowly as p max ( N, T ) if the regressors satisfy the same assumptions as E in Proposition 4 ormore generally those that can be found in [24, 32] (see also Appendix A.1 in [22]). This willusually not hold if the regressors have a nonzero mean and more generally under the setupof Section 4.6. In these cases, there exists C > N ∈ N , µ N ≥ C √ N T .Section 4.6 presents how to work with transformed regressors to obtain sharper results andcomplements the solution presented in the paragraph after Proposition 4.

Proposition 3.

Under Assumption 1 with µ N = √ N T , we have (cid:12)(cid:12)(cid:12)(cid:12) | M X ( E ) | √ N T − σ (cid:12)(cid:12)(cid:12)(cid:12) = O P (cid:18) √ N T (cid:19) (22) (cid:12)(cid:12)(cid:12) | M X ( E ) | op − | E | op (cid:12)(cid:12)(cid:12) = O P (cid:18) µ N √ N T (cid:19) . (23)The next assumption is a suﬃcient condition for the event E in Theorem 1 to have aprobability which converges to 1. ANEL DATA MODELS WITH APPROXIMATELY LOW-RANK UNOBSERVED HETEROGENEITY 11

Assumption 2.

Maintain Assumption 1 and, if an upper bound µ N for Assumption 1 (iv) isavailable else take µ N = √ N T , take { λ N } of the form (24) λ N = (cid:18) − φ N √ N T (cid:19) − (cid:18) ψ N + φ N µ N √ N T (cid:19) , where { φ N } and { φ N } are arbitrary sequences going to inﬁnity, as slowly as we want butno faster than √ N T for { φ N } , and ( i ) ψ N = O (cid:16) √ N T (cid:17) , ( ii ) lim N →∞ P (cid:16) ρψ N σ ≥ | E | op (cid:17) = 1 . We can take φ = φ in which case we write φ = φ = φ . Under the premises of Section4.6, we can take µ N = λ N and(25) λ N = (cid:18) − φ N √ N T (cid:19) − ψ N . We have E = (cid:26) ρψ N σ + ρ φ N µ N √ N T σ + ρ φ N λ N √ N T σ ≥ | E | op + (cid:16) | M X ( E ) | op − | E | op (cid:17) + ρλ N (cid:18) σ − | M X ( E ) | √ N T (cid:19)(cid:27) , hence P ( E ) ≥ P (cid:18)n ρψ N σ ≥ | E | op o \ (cid:26) ρ φ N µ N √ N T σ ≥ | M X ( E ) | op − | E | op (cid:27) \ (cid:26) φ N √ N T σ ≥ σ − | M X ( E ) | √ N T (cid:27)(cid:19) and the 3 events have probability going to 1 by (ii) and Proposition 3 so lim N →∞ P ( E ) = 1.We can handle large classes of joint distributions of X and E , including ones where the errorshave heavy tails. It is usual, but not necessary, to work with classes of distributions such that | E | op = O P (cid:16)p max( N, T ) (cid:17) . For such distributions, it is enough to take ψ N = C p max( N, T )for large enough C for Assumption 2 to hold. An easy way to circumvent the problem that C is unknown is to take ψ N = φ N p max( N, T ) but this results in over penalization. At thecost of additional assumptions on the distribution, one can obtain the following more preciseproposal based on Corollary 5.35 and Theorem 5.31 in [32].

Proposition 4. If E = M u ηM v , where M u and M v are, possibly random, matrices such that | M u | op ≤ and | M v | op ≤ and either of the following holds ( i ) { η it } i,t are i.i.d. centered Gaussian random variables, ( ii ) { η it } i,t are i.i.d. centered random variables with ﬁnite fourth moments and T /N con-verges to a constant in [0 , ,then the sequence deﬁned by ψ N = (cid:16) √ N + √ T (cid:17) /ρ + ϕ N , where ϕ N → ∞ arbitrarily slowlyin case (i) and n ϕ N / √ T o is bounded away from 0 in case (ii) , satisﬁes Assumption 2 (ii) . The matrices M u and M v can be known or estimated (see, e.g. , Section 4.6) and have beenapplied to the data. Applying such matrices cannot increase rank (cid:16) M u Γ l M v (cid:17) , (cid:12)(cid:12)(cid:12) M u Γ d M v (cid:12)(cid:12)(cid:12) op , or | M u ηM v | op but can reduce the operator norm of the regressors and give rise to a smallersequence { µ N } . These matrices can be unknown and the baseline error E can have temporaland cross-sectional dependence. Because the operator norm of a matrix is equal to the operatornorm of its transpose, the role of N and T can be exchanged in (ii). The proposed choiceof the penalty level is almost completely explicit and does not depend on the variance of theerrors. The remaining sequences are arbitrary. In contrast to (24) where (cid:16) − φ N / √ N T (cid:17) − converges to 1, [23] employs a factor converging to inﬁnity. Hence, the data-driven methodof this paper provides less shrinkage, less bias, and a better bias/variance tradeoﬀ.4.3. Restriction on the joint distribution of X and Γ .Assumption 3. ρ and e ρ are given and the random matrix Γ can be decomposed as Γ = Γ l +Γ d ,where, for { r N } , ( i ) rank (cid:16) Γ l (cid:17) = O P ( r N ) , ( ii ) (cid:12)(cid:12)(cid:12) Γ d (cid:12)(cid:12)(cid:12) ∗ = O P ( λ N r N ) , ( iii ) There exists κ > independent of N such that κ Γ l ,c ( ρ, e ρ ) ≥ κ w.p.a. 1. We maintain Assumption 3 to translate the result of Theorem 1 into rates of convergence.(i) allows for ranks which are random and can vary with the sample size which is more generaland realistic than the usual assumption that the rank is ﬁxed. Condition (iii) is a condition onthe second random element of the ﬁrst term in the right-hand side of (21). Condition (iii) isintroduced because κ Γ l ,c ( ρ, e ρ ) is random. Such an assumption would not be required if X andΓ were ﬁxed. Unlike other papers on the topic, this paper allows for Γ d = 0. By the oracletype inequalities of Theorem 1, the estimator performs as well as the best infeasible trade-oﬀ.Assumptions (i) and (ii) are not restrictive because r N can be anything. The idea thoughto obtain tight results is to have r N small and realize a trade-oﬀ. (i) is the reason why thecomponent Γ l is called low-rank. The component Γ d can be viewed as a remainder which canhave an arbitrary rank. Before the statement of Assumption 3, Γ l and Γ d are not preciselydeﬁned. Their sum is Γ so parts of Γ l can be transferred to Γ d and vice versa. Assumption 3makes it more precise which component is which. Proposition 5.

Assumption 3 (iii) for a cone with constant c holds with the lower bound κ if, w.p.a. 1, κ + 2rank (cid:16) Γ l (cid:17) Q ( b, b ⊥ ) ≤ , where b, b ⊥ ∈ R K are deﬁned, for k = 1 , . . . , K , as b k = a min (cid:16) |P Γ l ( X k ) | op , | X k | op (cid:17) , b ⊥ k = a (cid:12)(cid:12)(cid:12) P ⊥ Γ l ( X k ) (cid:12)(cid:12)(cid:12) op , a = (cid:12)(cid:12)(cid:12) X ⊤ X/ ( N T ) (cid:12)(cid:12)(cid:12) − | X | / ( N T ) , Q ( b, b ⊥ ) = | b | n p N | b ⊥ | ≥ o + | b + b ⊥ c | − c p N ! (cid:26) − p N h b ⊥ , b i c ≤ p N | b ⊥ | < (cid:27) + (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) b + b ⊥ p N h b ⊥ , b i − p N | b ⊥ | (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) − p N h b ⊥ , b i (cid:16) − p N | b ⊥ | (cid:17)  (cid:26) p N | b ⊥ | < − p N h b ⊥ , b i c (cid:27) , ANEL DATA MODELS WITH APPROXIMATELY LOW-RANK UNOBSERVED HETEROGENEITY 13 and p N = min (cid:16) N − rank (cid:16) Γ l (cid:17) , T − rank (cid:16) Γ l (cid:17)(cid:17) . Note that Q ( b, b ⊥ ) < | b + b ⊥ c | and, if K = 1, a = 1 / | X | and | b + b ⊥ c | = 1 | X | (cid:18) min (cid:16) |P Γ l ( X ) | op , | X | op (cid:17) + (cid:12)(cid:12)(cid:12) P ⊥ Γ l ( X ) (cid:12)(cid:12)(cid:12) op c (cid:19) . The quantity (cid:12)(cid:12)(cid:12) P ⊥ Γ l ( X k ) (cid:12)(cid:12)(cid:12) op = (cid:12)(cid:12)(cid:12) M u ( Γ l ) X k M v ( Γ l ) (cid:12)(cid:12)(cid:12) op in the deﬁnition of b ⊥ k can be not too largebecause the projectors can reduce the operator norm if X k has a component with a factorstructure and shares some factors in common with Γ l which are annihilated by M v ( Γ l ) (seeRemark 7 for further discussion of this aspect). Due to Assumption 1 (ii), a = O P (cid:16) / √ N T (cid:17) .In the worst case, by the crude bound | X k | op ≤ | X k | , b and b ⊥ , hence Q ( b, b ⊥ ) are bounded.If µ N = o (cid:16) √ N T (cid:17) , the condition in Proposition holds for arbitrary constants κ < N large enough, but this is not necessary. Section 4.6 presents solutions to work with regressorswith smaller operator norm. Lemma A.7 in [23] provides an alternative suﬃcient condition forAssumption 3 (ii). Lemma A.8 is another suﬃcient condition when K = 1. In our framework r N can grow, c can be diﬀerent from 3, and we do not work contionnal on Γ l , condition (iii)has to be modiﬁed with a denominator of √ N T r N and the probabilities are with respect tothe distribution of (Γ , X ). It is claimed in Remark (a) in [23] that the condition in LemmaA.8 holds when X = Π l + U , Π l has a ﬁxed rank, and U has iid mean zero normal entries.4.4. Rates of convergence.

Theorem 1 and the assumptions on the DGP yield the following.

Theorem 2.

Under assumptions 2 and 3, (cid:12)(cid:12)(cid:12)b Γ − Γ (cid:12)(cid:12)(cid:12) ∗ = O P ( λ N r N ) , (26) b σ − σ = O P λ N r N N T ! , (27) b β − β = O P (cid:18) λ N r N µ N N T (cid:19) . (28)In (28), we have implicitly assumed that √ N T = O ( λ N r N µ N ) but this always occurswhen X = 0 and the problem is to have λ N r N µ N as close as possible in rate to √ N T .Under usual assumptions where we can take λ N proportional to p max( N, T ), r N ﬁxed,and make no restriction on { µ N } so that µ N = O ( √ N T ), we obtain the rate convergenceof 1 / p min( N, T ) which is the one in [23]. Theorem 2 shows that b β remains consistentif r N = o (cid:16)p min( N, T ) (cid:17) . Obviously r N can be larger if µ N is smaller. The most favor-able situation, when µ N = O (cid:16)p max ( N, T ) (cid:17) and λ N is proportional to p max( N, T ), yields b β − β = O P (max( N, T ) r N / ( N T )), hence, when

N/T has a positive limit, this becomes O P (cid:16) r N / √ N T (cid:17) . Achieving µ N = o (cid:16) √ N T (cid:17) and in some cases µ N = O (cid:16)p max ( N, T ) (cid:17) us-ing transformed regressors is sometimes possible under the premises of Section 4.6 and thispaper allows to obtain such an estimator and transformed regressors in a data-driven way. Section 4.7 proposes an alternative approach where we can obtain the 1 / √ N T rate and haveasymptotic normality.4.5.

Additional results using the relation to the matrix Lasso.

Because any solution (cid:16) b β, b Γ (cid:17) of (10) is solution of (17), we prove the following additional results. They would alsoapply to (17) even if, rather than b σ , we used an upper bound on the standard error of theerrors. The results that we state involve b σ but, under the assumptions of Theorem 2, b σ isa consistent estimator of σ . In order to guarantee P (cid:16) ρλ N min ( b σ, σ ) ≥ | M X ( E ) | op (cid:17) → Assumption 4.

Assumption 2 holds and { φ N } satisﬁes the additional restriction that, for N large enough, (cid:18) − φ N √ N T (cid:19) φ N ≥ φ N r N √ N T (cid:18) ψ N + φ N µ N √ N T (cid:19) . Indeed, we can replace n φ N σ/ √ N T ≥ σ − | M X ( E ) | / √ N T o by (cid:8) φ N σλ N r N /N T ≥ σ − b σ (cid:9) in the previous analysis which converges to 1 due to (27) because, due to Assumption 4, φ N ≥ φ N λ N r N / √ N T , hence − φ N λ N r N N T ! λ N ≥ (cid:18) − φ N √ N T (cid:19) λ N = ψ N + φ N µ N √ N T .

A conservative choice is φ N = c √ N T for a small c ∈ (0 , c = c ( ρ ) = (1 + ρ ) / (1 − ρ ). First, with a proof similar to the computations in [19],we obtain a result which is an oracle inequality with constant 1 if X and Γ are not random. Proposition 6. If ρλ min ( b σ, σ ) ≥ | M X ( E ) | op , we have N T (cid:12)(cid:12)(cid:12) M X (cid:16) Γ − b Γ (cid:17)(cid:12)(cid:12)(cid:12) ≤ inf e Γ  N T (cid:12)(cid:12)(cid:12) M X (cid:16) Γ − e Γ (cid:17)(cid:12)(cid:12)(cid:12) + 2( λ (1 + ρ ) min ( b σ, σ )) N T rank (cid:16)e Γ (cid:17) κ e Γ ,c ( ρ )  . This inequality yields a slightly diﬀerent notion of approximately sparse solution becausethe ﬁrst term in the maximum involves (cid:12)(cid:12)(cid:12) M X (cid:16) Γ − e Γ (cid:17)(cid:12)(cid:12)(cid:12) / ( N T ) rather than (cid:12)(cid:12)(cid:12) Γ − e Γ (cid:12)(cid:12)(cid:12) ∗ . The nextresult provides a bound on rank (cid:16)b Γ (cid:17) as a function of the previous bound. Proposition 7. If ρλ b σ ≥ | M X ( E ) | op then we have (cid:18) λ (1 − ρ ) b σ − (cid:12)(cid:12)(cid:12) Γ d (cid:12)(cid:12)(cid:12) op (cid:19) rank (cid:16)b Γ (cid:17) ≤ (cid:12)(cid:12)(cid:12)(cid:12) P u (cid:0)b Γ (cid:1) M X (cid:16) Γ l − b Γ (cid:17) P v (cid:0)b Γ (cid:1)(cid:12)(cid:12)(cid:12)(cid:12) ≤ (cid:12)(cid:12)(cid:12) M X (cid:16) Γ l − b Γ (cid:17)(cid:12)(cid:12)(cid:12) . As a result, under the above conditions and Assumtion 3 (ii),rank (cid:16)b Γ (cid:17) ≤ (cid:16) (1 + ρ ) / ((1 − ρ ) κ Γ l ,c ( ρ ) ) (cid:17) rank (cid:16) Γ l (cid:17) . We can combine propositions 6 and 7 with Proposition 11 in the appendix to obtain resultsfor other loss functions, in particular the Frobenius norm.

ANEL DATA MODELS WITH APPROXIMATELY LOW-RANK UNOBSERVED HETEROGENEITY 15

Our estimator has desirable low-rank properties but it can fail to obtain rank(Γ), rank (cid:16) Γ l (cid:17) ,or annihilator matrices. Thus, we introduce the hard-thresholded estimator b Γ t = rank (cid:0)b Γ (cid:1)X k =1 σ k (cid:16)b Γ (cid:17) n σ k (cid:16)b Γ (cid:17) ≥ t o u k (cid:16)b Γ (cid:17) v k (cid:16)b Γ (cid:17) ⊤ . Proposition 8.

Under the assumptions of Theorem 2 and Assumption 4, if (cid:12)(cid:12)(cid:12) Γ d (cid:12)(cid:12)(cid:12) op = o P ( λ N σ ) ,we have (29) max (cid:18)(cid:12)(cid:12)(cid:12) Γ − b Γ (cid:12)(cid:12)(cid:12) op , (cid:12)(cid:12)(cid:12) Γ l − b Γ (cid:12)(cid:12)(cid:12) op (cid:19) ≤ ( ρ + 1) λ N σ + O P r N µ N N T !! . Assumption 5.

Let h > . The following conditions hold ( i ) r N µ N = o ( N T ) , ( ii ) P (cid:16) σ rank ( Γ l ) (cid:16) Γ l (cid:17) ≥ ( ρ + 1) λ N h ( h + 1) σ (cid:17) → . Condition (i) guarantees the O P in (29) is o P (1). It allows the pivotal thresholding methodsbelow but imposes a slight restriction on the operator norms of the regressors. Section 4.6allows to come back to a case where (i) holds for a large class of regressors. Without (i)max (cid:18)(cid:12)(cid:12)(cid:12) Γ − b Γ (cid:12)(cid:12)(cid:12) op , (cid:12)(cid:12)(cid:12) Γ l − b Γ (cid:12)(cid:12)(cid:12) op (cid:19) = O P ( λ N )and can adapt the results which follow at the expense of a theoretical but unfeasible threshold-ing level or using conservative levels λ N /t = o (1). Condition (ii) is weaker than a strong-factorassumption on Γ l . We now show that we can recover rank(Γ) with a data-driven threshold. Proposition 9.

Under the assumptions of Proposition 8 and Assumption 5, then setting t = ( ρ + 1) λ N h b σ yields P (cid:16) rank (cid:16)b Γ t (cid:17) = rank (cid:16) Γ l (cid:17)(cid:17) → . Moreover, if we remove (ii) , then we have P (cid:16) rank (cid:16)b Γ t (cid:17) ≤ rank (cid:16) Γ l (cid:17)(cid:17) → , if we replace (ii) by the weaker assumption P (cid:16) σ rank ( Γ l ) (cid:16) Γ l (cid:17) ≥ ( ρ + 1) λ N h σ (cid:17) → , we have P (cid:16) rank (cid:16)b Γ t (cid:17) ≥ rank (cid:16) Γ l (cid:17)(cid:17) → , and (30) max (cid:18)(cid:12)(cid:12)(cid:12) Γ − b Γ t (cid:12)(cid:12)(cid:12) op , (cid:12)(cid:12)(cid:12) Γ l − b Γ t (cid:12)(cid:12)(cid:12) op (cid:19) ≤ ( ρ + 1) λ N ( h + 1) ( σ + o P (1)) . We strengthen Assumption 5 (ii) as follows. When v N increases like √ N T , it is a strong-factor assumption.

Assumption 6.

Let { v N } be such that v N ≥ ( ρ + 1) λ N h ( h + 1) σ . Assume that P (cid:16) σ rank ( Γ l ) (cid:16) Γ l (cid:17) ≥ v N (cid:17) → . Proposition 10.

Under the assumptions of Proposition 9 and Assumption 6, we have (cid:12)(cid:12)(cid:12)(cid:12) P v (cid:0)b Γ t (cid:1) − P v ( Γ l ) (cid:12)(cid:12)(cid:12)(cid:12) = (cid:12)(cid:12)(cid:12)(cid:12) M v (cid:0)b Γ tv (cid:1) − M v ( Γ l ) (cid:12)(cid:12)(cid:12)(cid:12) ≤ ( ρ + 1) √ r N λ N v N (cid:16) ( h + 1) σ + o P (1) (cid:17)(cid:12)(cid:12)(cid:12)(cid:12) P u (cid:0)b Γ t (cid:1) − P u ( Γ l ) (cid:12)(cid:12)(cid:12)(cid:12) = (cid:12)(cid:12)(cid:12)(cid:12) M u (cid:0)b Γ t (cid:1) − M u ( Γ l ) (cid:12)(cid:12)(cid:12)(cid:12) ≤ ( ρ + 1) √ r N λ N v N (cid:16) ( h + 1) σ + o P (1) (cid:17) . Under a strong-factor assumption, when λ N is proportional to p max( N, T ) and r N is ﬁxed,we obtain the same rate of convergence as using PCA and as in Lemma A.7 in [3]. Here weobtain an upper bound with known constant. The rates that we obtain are also more generalbecause we do not need to maintain the strong-factor assumption or that r N is ﬁxed, { λ N } could also allow for errors with larger tails of the operator norm.4.6. Working with transformed regressors.

In the previous sections, { µ N } sometimesplays an important role and we might want it to be not too large. However, this can be aslarge as O ( √ N T ) if the next assumption holds.

Assumption 7.

For at least one k ∈ { , . . . , K } , (31) X k = Π lk + Π dk + U k , and Π dk , U k , σ k , r kN , λ kN , and v kN play the role of Γ d , E , σ , r N , λ N and v N and satisfythe assumptions of Proposition 4, Assumption 3 (i) and (ii) , and Assumption 5 (ii) , (cid:12)(cid:12)(cid:12) Π lk (cid:12)(cid:12)(cid:12) − = O P (cid:16) / √ N T (cid:17) . The problem is diﬃcult due to (cid:12)(cid:12)(cid:12) Π lk (cid:12)(cid:12)(cid:12) − = O P (cid:16) / √ N T (cid:17) . This occurs under a strong-factorassumption when the ratio of any singular value of Π lk and √ N T has a positive and ﬁnitelimit in probability. The problem would be even harder if Π lk does not have a small rank( i.e. , with “many” strong factors) and there is obviously a problem related to identiﬁcationwhen X k = Π lk and Π lk has small rank. Under the aforementioned assumptions, we can take λ kN = λ N . The matrix Π lk , σ k , and the annihilators M u (Π lk ) and M v (Π lk ) can be estimated likein the previous sections and one can replace X k by e X k , where X k − e X k has low rank, and Γ l by e Γ l = Γ l + P Kk =1 β k (cid:16) X k − e X k (cid:17) . For simplicity of exposition, we apply a transformation toall regressors. When X = 0, (10) can be computed as an iterated soft-thresholding estimator.One can work with an estimator e Π k of Π k of the form e Π k = b Π k or e Π k = b Π tk obtainedas described in the previous sections, with (1) e X k = X k − e Π k , (2) e X k = M u (cid:0)e Π k (cid:1) X k , (3) e X k = X k M v (cid:0)e Π k (cid:1) , (4) e X k = P ⊥ e Π k ( X k ), (5) e X k = X k − X ( l k ) k where X ( l k ) k is obtained from X k by keeping the low rank component from a SVD corresponding to the l k = rank (cid:16) e Π k (cid:17) largestsingular values. An alternative is to rely on Principal Component Analysis (henceforth PCA)using the eigenvalue-ratio (see [1]) to select the number of factors. By the previous results,using such transformed regressors gives rise to additional terms in e Γ of rank each at most

ANEL DATA MODELS WITH APPROXIMATELY LOW-RANK UNOBSERVED HETEROGENEITY 17 r kN + o P (1) if Π k = Π lk or of same rank as e X lk w.p.a. 1 if we use hard-thresholding aswell. Assuming we transform all regressors, the rank of e Γ is at most e r N + o P (1), where e r N = r N +2((1+ ρ ) / (1 − ρ )) P Kk =1 r kN if e Π k = b Π k and l k = rank (cid:16) b Π k (cid:17) and else e r N = r N + P Kk =1 r kN .Using e Π k = b Π tk has the advantage that if Π dk = 0 we have guarantees on the low rank of e Γ. Remark 6.

In Assumption 7 we have assumed that we maintain the assumption of Proposition4 and Assumption 5 (ii) for simplicity of exposition. But we can also handle heavy tailederrors U k by choosing a penalty level λ kN large enough as disscussed before Proposition 4 .We maintain Assumption 5 (ii) to allow for a simple thresholding rule but it is enough to usea thresholding at any level of smaller order than √ N T to obtain µ N = o (cid:16) √ N (cid:17) . Second-stage estimator of β . As seen at the end of Section 4.4, the estimator b β couldsometimes achieve the 1 / √ N T rate. But under weaker conditions we obtain a slower rateof convergence. This section presents three diﬀerent two-stage approaches which deliver anasymptotically normal estimator of β .4.7.1. Approach 1: Annihilation of low-rank components of Γ and the regressors. This sectionanalyzes another approach under Assumption 7 where, for simplicity of exposition, the laststatement holds for all regressors, and we use the transformed regressors with transformation(1) or (2). We obtain estimators of Π lu = (cid:16) Γ l , Π l , . . . , Π lK (cid:17) and Π lv = (cid:18)(cid:16) Γ l (cid:17) ⊤ , (cid:16) Π l (cid:17) ⊤ , . . . , (cid:16) Π lK (cid:17) ⊤ (cid:19) ⊤ by plug-in using e Π k = b Π k or e Π k = b Π tk (preferably) for k = 1 , . . . , K and(32) b Γ = be Γ − K X k =1 b β k e Π k . We denote by b Π u and b Π v the estimators, by σ = σ + P Kk =1 σ k and b σ = b σ + P Kk =1 b σ k , by e σ = σ and be σ = b σ if e Π k = b Π k , and by e σ = ( h + 1) σ and be σ = ( h + 1) b σ if e Π k = b Π tk . Because,for K ∈ N and A , . . . , A K with same number of rows, | ( A , . . . , A K ) | ≤ P Kk =1 | A k | , and b Γ − Γ l = be Γ − e Γ l + K X k =1 (cid:16) β k − b β k (cid:17) (cid:16) e Π k − Π k (cid:17) + K X k =1 (cid:16) β k − b β k (cid:17) Π k , we obtain the following corollary of Proposition 8 and (30). Corollary 1.

Under the assumptions 1, 3, where in (iii) we have e Γ l instead of Γ l , 4, 7, λ N e r N = o ( N T ) , and (cid:12)(cid:12)(cid:12) Γ d (cid:12)(cid:12)(cid:12) op = o P ( λ N σ ) , we have (cid:12)(cid:12)(cid:12) Γ l − b Γ (cid:12)(cid:12)(cid:12) op ≤ ( ρ + 1) λ N ( σ + o P (1))max (cid:18)(cid:12)(cid:12)(cid:12) Π lu − b Π lu (cid:12)(cid:12)(cid:12) op , (cid:12)(cid:12)(cid:12) Π lv − b Π lv (cid:12)(cid:12)(cid:12) op (cid:19) ≤ ( ρ + 1) λ N ( e σ + o P (1)) . Based on this corollary, we can rely on hard-thresholding of these estimators that wedenote by b Γ t , b Π tu and b Π tv and estimate the rank of Γ l and the annihilator matrices M u ( Γ l ), M v ( Γ l ), M u ( Π lu ), and M v ( Π lv ) by M u (cid:0)b Γ t (cid:1) , M v (cid:0)b Γ t (cid:1) , M u (cid:0)b Π tu (cid:1) , and M v (cid:0)b Π lv (cid:1) . Again, the ﬁrst twoannihilators are estimated at the same rate as in Lemma A.7 in [3] if Γ l satisﬁes a strong-factor assumption. Proposition 9 and Proposition 10 hold with the annihilator matrices ofthis section replacing σ by e σ and b σ by be σ and Assumption 5 (ii) by P (cid:16) min (cid:16) σ rank ( Π lu ) (cid:16) Π lu (cid:17) , σ rank ( Π lv ) (cid:16) Π lv (cid:17)(cid:17) ≥ ( ρ + 1) λ N h ( h + 1) e σ (cid:17) → λ N e r N = o ( N T ) maintained in Corollary 1 and the next assumption.

Assumption 8.

Let { v N } be such that v N ≥ ( ρ + 1) λ N h ( h + 1) e σ , we have P (cid:16) min (cid:16) σ rank ( Π lu ) (cid:16) Π lu (cid:17) , σ rank ( Π lv ) (cid:16) Π lv (cid:17)(cid:17) ≥ v N (cid:17) → and, for a sequence { r N } , max (cid:16) rank (cid:16) Π lu (cid:17) , rank (cid:16) Π lv (cid:17)(cid:17) = O P ( r N ) . We denote by P ⊥ b Π t (resp. P ⊥ Π ) the operator which applied to A ∈ M NT is P ⊥ b Π ( A ) = M u (cid:0)b Π tu (cid:1) AM v (cid:0)b Π tv (cid:1) (resp. P ⊥ Π ( A ) = M u (Π u ) AM v (Π v ) ) and deﬁne the estimator(33) e β (1) ∈ argmin β ∈ R K (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) P ⊥ b Π t Y − K X k =1 β k X k !(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) . Also P ⊥ b Π t ( X ) (resp. P ⊥ b Π t ( U ), P ⊥ Π ( X ), and P ⊥ Π ( U )) is the matrix formed like X , replacing thematrices X k by P ⊥ b Π t ( X k ) (resp. P ⊥ Π ( X k ), P ⊥ Π ( U k ), and P ⊥ Π ( U k )) for k = 1 , . . . , K . Assumption 9.

Maintain the assumptions of Corollary 1 and Assumption 8 and ( i ) r N λ N (cid:0) λ N + √ r N µ N /v N (cid:1) /v N = o ( N T ) , ( ii ) r N λ N /v N = o (cid:16) √ N T (cid:17) , ( iii ) r / N λ N ( | Γ | op + λ N ) /v N = o P (cid:16) √ N T (cid:17) , ( iv ) (cid:12)(cid:12)(cid:12) P ⊥ Π l (Π d ) (cid:12)(cid:12)(cid:12) = o P ( N T ) , ( v ) There exists Σ ⊥ ∈ M KK positive deﬁnite such that P ⊥ Π l ( U ) ⊤ P ⊥ Π l ( U ) / ( N T ) P −→ Σ ⊥ , ( vi ) P ⊥ Π l ( U ) ⊤ e/ √ N T d −→ N (cid:0) , σ Σ ⊥ (cid:1) . Regarding Assumption 9 (iii), | Γ | op is usually O P (cid:16) √ N T (cid:17) if it has a nontrivial low-rankcomponent. (i)-(iii) can be satisﬁed under weaker assumptions than a strong-factor assump-tion ( v N is of the order of √ N T ) and when r N goes to inﬁnity. (v) is satisﬁed, for example,if (Π lu , Π lv ) and U are independent and (vi) when ( X, Γ l ) and E are independent. Theorem 3.

Let Assumption 9 holds. We have √ N T b σ (cid:16) e β (1) − β (cid:17) d −→ N (cid:16) , Σ − ⊥ (cid:17) , P ⊥ b Π t ( X ) ⊤ P ⊥ b Π t ( X ) / ( N T ) P −→ Σ ⊥ . ANEL DATA MODELS WITH APPROXIMATELY LOW-RANK UNOBSERVED HETEROGENEITY 19

Also, if |P Π l ( U ) | = o P ( | U | ) then Σ ⊥ = E [ U ⊤ U ] . This occurs if E h max (cid:16) rank (cid:16) Π lu (cid:17) , rank (cid:16) Π lv (cid:17)(cid:17)i = o (cid:16)p min( N, T ) (cid:17) and U and (Π lu , Π lv ) are independent. Approach 2: Using [3] ’s estimator as a second stage.

An alternative approach discussedin [15] is to rely on a preliminary consistent estimator to initialize [3]’s non convex estimator.[23] put forward this approach and the possibility to rely on a preliminary estimator liketheir matrix Lasso as a ﬁrst-step. Among other conditions, using such a two-stage approachrequires that the rate of convergence of the ﬁrst-step estimator of β is at least ( N T ) / , aconsistent estimator of rank(Γ), which is assumed constant, a strong-factor assumption onΓ, and Γ d = 0. This methodology can be applied using as a ﬁrst-stage the thresholded ornonthresholded square-root estimator of this paper. We denote this estimator by (cid:16) e β (2) , e Γ (2) (cid:17) .This paper provides a consistent estimator of rank (cid:16) Γ l (cid:17) via hard-thresholding of (32) or anupper bound on it without thresholding. Lemma 3 in [23] proposes an other consistentestimator but probably has a typo due to contradictory assumptions. The advantage of theestimator of this paper is that the level of thresholding is less conservative and makes use ofthe consistent estimator of the variance of errors. Recall that if Γ d = 0 and Π l = . . . , Π lK ,from the discussion after Proposition 7 and (32),rank (cid:16)b Γ (cid:17) ≤ (cid:18) ρ − ρ (cid:19) e r N κ e Γ l + K X k =1 r kN ! + o P (1) . An estimator of the asymptotic covariance matrix of the second-stage estimator, given aconsistent estimator of b r = rank (cid:16) Γ l (cid:17) , is given by (see page 1552 of [22]) b σ B b Σ B , where b σ B = 1 p ( N − b r )( T − b r ) − K (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) Y − K X k =1 e β (2) k X k − e Γ (2) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) (cid:16) b Σ B (cid:17) kl = 1 N T (cid:28) M u (cid:0)e Γ (2) (cid:1) X k M v (cid:0)e Γ (2) (cid:1) , X l (cid:29) ∀ k, l ∈ { , . . . , K } . Simulations

We take the same data generating process as in [23] with a single regressor and two factors: Y it = X it + X l =1 (1 + λ ,il ) f ,tl + E it ,X it = 1 + X l =1 (2 + λ ,il + λ ,il )( f ,tl + f ,t − r ) + U it , where f ,tl , λ ,il , λ ,il , U it , and E it for all indices are mutually independent and i.i.d. standardnormal. The matrix X has a statistical factor structure with a low-rank component of rank3 due to the constant 1. Recall that b β LS is the least-squares estimator which ignores thepresence of Γ is inconsistent because X it and Γ it are correlated. By the analysis of the paper,the square-root estimator coincides with the estimator in [23] with a smaller penalization. The results in [23] are obtained with a penalty much smaller than allowed by the theory.We compare the performance of the least-squares estimator b β LS , the square-root estimator b β obtained with the baseline regressors, the square-root estimator b β pt obtained with thetransformed regressors, where we apply (2) from Section 4.6 with e Π = b Π t , and the two-stageestimators from Section 4.7. We use b β LS to initialize the iterative estimators. The number ofiterations is 100 to obtain the estimator of rank (Γ), as explained after Corollary 1, useful tocompute e β (2) . We use the same number of iterations to obtain b β pt . We consider an additional100 iterations for b β , b β pt , and e β (2) . As a result, e β (1) and e β (2) have been computed with the samenumber of iterations. We consider two sample sizes: (a) N = T = 50 and (b) N = T = 150.We use 7300 Monte-Carlo replications to allow for an accuracy of ± .

005 with 95% for thecoverage probabilities of 95% conﬁdence intervals. We choose λ N = 1 . (cid:16) √ N + √ T (cid:17) and thehard-thresholding levels are 2 λ N times an estimator of the standard error from the ﬁrst-stages.A ﬁrst approach is to not apply any matrix to the data as described after Proposition 4.The results in tables 1 and 2 compare the performance of the estimators in terms of MSE, bias,and standard error (henceforth std). In case (a), rank (cid:16) b Π t (cid:17) has been found to be always equalto 2 while rank (cid:16) b Π (cid:17) to 3 (the true rank), rank (cid:16)b Γ t (cid:17) has been found to be always equal to 2(the true rank) in 89% of the cases and else to 1. We used rank (cid:16) b Π t (cid:17) for b β pt and subsequentlyrank (cid:16)b Γ t (cid:17) , e β (1) and e β (2) , even though it did not perform well for such small sample size. Incase (b), rank (cid:16) b Π t (cid:17) has been found to be always equal to 3 while rank (cid:16) b Π (cid:17) and rank (cid:16)b Γ t (cid:17) have been found to be always equal to 2 (the true rank). Table 1. N = T = 50 b β LS b β b β pt e β (1) e β (2) MSE 0.053 0.020 5 10 − − bias 0.230 0.142 -10 − Table 2. N = T = 150 b β LS b β b β pt e β (1) e β (2) MSE 0.053 0.011 4 10 − − − bias 0.231 0.103 4 10 − − -8 10 − std 0.009 0.008 0.006 0.006 0.003 A second approach is to apply Within transforms M u = I N − J N /N and M v = I T − J T /T to the left and right of Y and X , where J N ∈ M NN (resp. J T ∈ M T T ) has all entries equalto 1. These allow to get rid of the mean 1 of X but more generally of any individual andtime eﬀects in both Π l and Γ l . The results are in tables 3 and 4. In case (a), rank (cid:16) b Π t (cid:17) andrank (cid:16) b Π (cid:17) has been found to be always equal to 2 (the true rank), rank (cid:16)b Γ t (cid:17) has been foundto be equal to 2 (the true rank) in 81% of the cases and else to 1. In case (b), rank (cid:16) b Π t (cid:17) ,rank (cid:16) b Π (cid:17) , rank (cid:16)b Γ t (cid:17) have been found to be always equal to 2 (the true ranks).Table 5 assesses the coverage probabilities in the diﬀerent cases. ANEL DATA MODELS WITH APPROXIMATELY LOW-RANK UNOBSERVED HETEROGENEITY 21

Table 3. N = T = 50, Within b β LS b β b β pt e β (1) e β (2) MSE 0.049 0.016 5 10 − − Table 4. N = T = 150, Within b β LS b β b β pt e β (1) e β (2) MSE 0.049 0.007 5 10 − − − bias 0.222 0.081 -1 10 − − − std 0.014 0.007 0.007 0.007 0.004 Table 5.

Coverage of 95% conﬁdence intervals based on the two-stage approaches.

Within transforms (N,T) e β (1) e β (2) No (50,50) 0.87 0.84Yes (50,50) 0.81 0.76No (150,150) 0.95 0.94Yes (150,150) 0.95 0.94

References [1] S. C. Ahn and A. R. Horenstein. Eigenvalue ratio test for the number of factors.

Econometrica , 81.[2] S. Athey, M. Bayati, N. Doudchenko, G. Imbens, and K. Khashayar. Matrix completion methods for causalpanel data models. arXiv preprint 1710.10251, 2017.[3] J. Bai. Panel data models with interactive ﬁxed eﬀects.

Econometrica , 77.[4] J. Bai and S. Ng. Principal components and regularized estimation of factor models. arXiv preprint1708.08137, 2017.[5] A. Belloni, V. Chernozhukov, and L. Wang. Square-root lasso: pivotal recovery of sparse signals via conicprogramming.

Biometrika , 98.[6] P. B¨uhlmann and S. van De Geer.

Statistics for High-Dimensional Data: Methods, Theory and Applica-tions . Springer, 2011.[7] E. J. Cand`es and Y. Plan. Tight oracle inequalities for low-rank matrix recovery from a minimal numberof noisy random measurements.

IEEE Transactions on Information Theory , 57.[8] E. J. Cand`es and T. Tao. The power of convex relaxation: Near-optimal matrix completion.

IEEE Trans-actions on Information Theory , 56.[9] A. Chudik and M. H. Pesaran. Common correlated eﬀects estimation of heterogeneous dynamic panel datamodels with weakly exogenous regressors.

Journal of Econometrics , 188.[10] A. Dupuy, A. Galichon, and Y. Sun. Estimating matching aﬃnity matrix under low-rank constraints.arXiv preprint 1612.09585, 2017.[11] E. Gautier, A. Tsybakov, and C. Rose. High-dimensional instrumental variables regression and conﬁdencesets. arXiv preprint 1105.2454v5.[12] C. Giraud.

Introduction to High-Dimensional Statistics . Chapman and Hall/CRC, 2014.[13] L. Gobillon and T. Magnac. Regional policy evaluation: Interactive ﬁxed eﬀects and synthetic controls.

Review of Economics and Statistics , 98:535–551, 2016.[14] C. Hansen and Y. Liao. The factor-lasso and k-step bootstrap approach for inference in high-dimensionaleconomic applications.

Econometric Theory , pages 1–45, 2018.[15] C. Hsiao. Panel models with interactive eﬀects.

Journal of Econometrics , 206:645–673, 2018. [16] B. Jiang, Y. Yang, J. Gao, and C. Hsia. Recursive estimation in large panel data models: theory andpractice. Monash Business School preprint.[17] O. Klopp. Noisy low-rank matrix completion with general sampling distribution.

Bernoulli , 20.[18] O. Klopp and S. Gaiﬀas. High dimensional matrix estimation with unknown variance of the noise.

StatisticaSinica , 27:115–145, 2017.[19] V. Koltchinskii, K. Lounici, and A. B. Tsybakov. Nuclear-norm penalization and optimal rates for noisylow-rank matrix completion.

The Annals of Statistics , 39.[20] X. Lu and L. Su. Shrinkage estimation of dynamic panel data models with interactive ﬁxed eﬀects.

Journalof Econometrics , 190.[21] H. R. Moon and M. Weidner. Dynamic linear panel regression models with interactive ﬁxed eﬀects.

Econo-metric Theory , 33.[22] H. R. Moon and M. Weidner. Linear regression for panel with unknown number of factors as interactiveﬁxed eﬀects.

Econometrica , 83.[23] H. R. Moon and M. Weidner. Nuclear norm regularized estimation of panel regression models. arXivpreprint 1810.10987, 2018.[24] A. Onatski. Asymptotic analysis of the squared estimation error in misspeciﬁed factor models.

Journal ofEconometrics , 186.[25] A. B. Owen. A robust hybrid of lasso and ridge regression.

Contemporary Mathematics , 443:59–72, 2007.[26] M. H. Pesaran. Estimation and inference in large heterogeneous panels with a multifactor error structure.

Econometrica , 74.[27] M. H. Pesaran.

Time Series and Panel Data Econometrics . Oxford University Press, 2015.[28] B. Recht. A simpler approach to matrix completion.

Journal of Machine Learning Research , 12.[29] B. Recht, M. Fazel, and P. A. Parrilo. Guaranteed minimum-rank solutions of linear matrix equations vianuclear norm minimization.

SIAM review , 52.[30] A. Rohde and A. B. Tsybakov. Estimation of high-dimensional low-rank matrices.

The Annals of Statistics ,39.[31] T. Sun and C.-H. Zhang. Scaled sparse linear regression.

Biometrika , 99.[32] R. Vershynin. Introduction to the non-asymptotic analysis of random matrices. In

Compressed Sensing,Theory and Applications , pages 210–268, 2012.

Appendix

Proof of Proposition 1.

By deﬁnition of b β and b Γ, we have, for all β ∈ R K and Γ ∈ M NT ,1 √ N T (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) Y − K X k =1 b β k X k − b Γ (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) + λN T (cid:12)(cid:12)(cid:12)b Γ (cid:12)(cid:12)(cid:12) ∗ ≤ √ N T (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) Y − K X k =1 β k X k − Γ (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) + λN T | Γ | ∗ . By deﬁnition of P X and of the estimator, for all β ∈ R K and Γ ∈ M NT , we have1 √ N T (cid:12)(cid:12)(cid:12) M X (cid:16) Y − b Γ (cid:17)(cid:12)(cid:12)(cid:12) + λN T (cid:12)(cid:12)(cid:12)b Γ (cid:12)(cid:12)(cid:12) ∗ ≤ √ N T (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) Y − K X k =1 b β k X k − b Γ (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) + λN T (cid:12)(cid:12)(cid:12)b Γ (cid:12)(cid:12)(cid:12) ∗ ≤ √ N T (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) Y − K X k =1 β k X k − Γ (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) + λN T | Γ | ∗ . ANEL DATA MODELS WITH APPROXIMATELY LOW-RANK UNOBSERVED HETEROGENEITY 23

By choosing β such that P Kk =1 β k X k = P X ( Y − Γ), we obtain, for all Γ ∈ M NT ,1 √ N T (cid:12)(cid:12)(cid:12) M X (cid:16) Y − b Γ (cid:17)(cid:12)(cid:12)(cid:12) + λN T (cid:12)(cid:12)(cid:12)b Γ (cid:12)(cid:12)(cid:12) ∗ ≤ √ N T | M X ( Y − Γ) | + λN T | Γ | ∗ , hence the result. Proof of Proposition2.

The ﬁrst inequality is obtained using trace duality and (6). The secondis obtained by (8) and the Pythagorean theorem.

Proof of Theorem 1.

The techniques are similar to those in [5, 11]. Take e Γ ∈ M NT and denoteby ∆ = b Γ − Γ. Remark that (cid:12)(cid:12)(cid:12)b Γ (cid:12)(cid:12)(cid:12) ∗ = (cid:12)(cid:12)(cid:12) Γ − e Γ + e Γ + P e Γ (∆) + P ⊥ e Γ (∆) (cid:12)(cid:12)(cid:12) ∗ ≥ (cid:12)(cid:12)(cid:12)e Γ + P ⊥ e Γ (∆) (cid:12)(cid:12)(cid:12) ∗ − (cid:12)(cid:12)(cid:12) Γ − e Γ (cid:12)(cid:12)(cid:12) ∗ − (cid:12)(cid:12)(cid:12) P e Γ (∆) (cid:12)(cid:12)(cid:12) ∗ (34) ≥ (cid:12)(cid:12)(cid:12)e Γ (cid:12)(cid:12)(cid:12) ∗ + (cid:12)(cid:12)(cid:12) P ⊥ e Γ (∆) (cid:12)(cid:12)(cid:12) ∗ − (cid:12)(cid:12)(cid:12) Γ − e Γ (cid:12)(cid:12)(cid:12) ∗ − (cid:12)(cid:12)(cid:12) P e Γ (∆) (cid:12)(cid:12)(cid:12) ∗ (by (9)) . (35)Now, by (16) and the deﬁnition of b Γ, we have(36) 1 √ N T (cid:12)(cid:12)(cid:12) M X (cid:16) Y − b Γ (cid:17)(cid:12)(cid:12)(cid:12) + λN T (cid:12)(cid:12)(cid:12)b Γ (cid:12)(cid:12)(cid:12) ∗ ≤ √ N T | M X ( Y − Γ) | + λN T | Γ | ∗ . By convexity, trace duality, and λρ | M X ( E ) | / √ N T ≥ | M X ( E ) | op , if M X ( E ) = 0, we have1 √ N T (cid:12)(cid:12)(cid:12) M X (cid:16) Y − b Γ (cid:17)(cid:12)(cid:12)(cid:12) − √ N T | M X ( Y − Γ) | ≥ − √ N T | M X ( E ) | D M X ( E ) , b Γ − Γ E ≥ − λρN T | ∆ | ∗ . (37)(37) also holds if M X ( E ) = 0 because (cid:12)(cid:12)(cid:12) M X (cid:16) Y − b Γ (cid:17)(cid:12)(cid:12)(cid:12) ≥

0. This and (36) imply(38) (cid:12)(cid:12)(cid:12)b Γ (cid:12)(cid:12)(cid:12) ∗ ≤ ρ | ∆ | ∗ + | Γ | ∗ . Using (35), we get (cid:12)(cid:12)(cid:12)e Γ (cid:12)(cid:12)(cid:12) ∗ + (cid:12)(cid:12)(cid:12) P ⊥ e Γ (∆) (cid:12)(cid:12)(cid:12) ∗ − (cid:12)(cid:12)(cid:12) Γ − e Γ (cid:12)(cid:12)(cid:12) ∗ − (cid:12)(cid:12)(cid:12) P e Γ (∆) (cid:12)(cid:12)(cid:12) ∗ ≤ ρ | ∆ | ∗ + | Γ | ∗ and | Γ | ∗ ≤ (cid:12)(cid:12)(cid:12) Γ − e Γ (cid:12)(cid:12)(cid:12) ∗ + (cid:12)(cid:12)(cid:12)e Γ (cid:12)(cid:12)(cid:12) ∗ yields (cid:12)(cid:12)(cid:12) P ⊥ e Γ (∆) (cid:12)(cid:12)(cid:12) ∗ − (cid:12)(cid:12)(cid:12) P e Γ (∆) (cid:12)(cid:12)(cid:12) ∗ ≤ ρ | ∆ | ∗ + 2 (cid:12)(cid:12)(cid:12) Γ − e Γ (cid:12)(cid:12)(cid:12) ∗ . Then, because | ∆ | ∗ ≤ (cid:12)(cid:12)(cid:12) P ⊥ e Γ (∆) (cid:12)(cid:12)(cid:12) ∗ + (cid:12)(cid:12)(cid:12) P e Γ (∆) (cid:12)(cid:12)(cid:12) ∗ , we have(39) (1 − ρ ) (cid:12)(cid:12)(cid:12) P ⊥ e Γ (∆) (cid:12)(cid:12)(cid:12) ∗ ≤ (1 + ρ ) (cid:12)(cid:12)(cid:12) P e Γ (∆) (cid:12)(cid:12)(cid:12) ∗ + 2 (cid:12)(cid:12)(cid:12) Γ − e Γ (cid:12)(cid:12)(cid:12) ∗ . Also, by (36), 1 √ N T (cid:12)(cid:12)(cid:12) M X (cid:16) Y − b Γ (cid:17)(cid:12)(cid:12)(cid:12) − √ N T | M X ( Y − Γ) | ≤ λN T (cid:16) | Γ | ∗ − (cid:12)(cid:12)(cid:12)b Γ (cid:12)(cid:12)(cid:12) ∗ (cid:17) and | Γ | ∗ − (cid:12)(cid:12)(cid:12)b Γ (cid:12)(cid:12)(cid:12) ∗ ≤ (cid:12)(cid:12)(cid:12)e Γ (cid:12)(cid:12)(cid:12) ∗ + (cid:12)(cid:12)(cid:12) Γ − e Γ (cid:12)(cid:12)(cid:12) ∗ − (cid:12)(cid:12)(cid:12)b Γ (cid:12)(cid:12)(cid:12) ∗ = 2 (cid:12)(cid:12)(cid:12) Γ − e Γ (cid:12)(cid:12)(cid:12) ∗ + (cid:12)(cid:12)(cid:12)e Γ (cid:12)(cid:12)(cid:12) ∗ − (cid:12)(cid:12)(cid:12) Γ − e Γ (cid:12)(cid:12)(cid:12) ∗ − (cid:12)(cid:12)(cid:12)b Γ (cid:12)(cid:12)(cid:12) ∗ ≤ (cid:12)(cid:12)(cid:12) Γ − e Γ (cid:12)(cid:12)(cid:12) ∗ + (cid:12)(cid:12)(cid:12) P e Γ (∆) (cid:12)(cid:12)(cid:12) ∗ − (cid:12)(cid:12)(cid:12) P ⊥ e Γ (∆) (cid:12)(cid:12)(cid:12) ∗ (by (35)) , hence we have(40) 1 √ N T (cid:12)(cid:12)(cid:12) M X (cid:16) Y − b Γ (cid:17)(cid:12)(cid:12)(cid:12) − √ N T | M X ( Y − Γ) | ≤ λN T (cid:16) (cid:12)(cid:12)(cid:12) Γ − e Γ (cid:12)(cid:12)(cid:12) ∗ + (cid:12)(cid:12)(cid:12) P e Γ (∆) (cid:12)(cid:12)(cid:12) ∗ (cid:17) . Let e ρ > Case 1. If e ρ (cid:12)(cid:12)(cid:12) P e Γ (∆) (cid:12)(cid:12)(cid:12) ∗ ≤ (cid:12)(cid:12)(cid:12) Γ − e Γ (cid:12)(cid:12)(cid:12) ∗ , we have, by (39), | ∆ | ∗ ≤ (cid:12)(cid:12)(cid:12) P e Γ (∆) (cid:12)(cid:12)(cid:12) ∗ + (cid:12)(cid:12)(cid:12) P ⊥ e Γ (∆) (cid:12)(cid:12)(cid:12) ∗ ≤ − ρ (cid:16)(cid:12)(cid:12)(cid:12) P e Γ (∆) (cid:12)(cid:12)(cid:12) ∗ + (cid:12)(cid:12)(cid:12) Γ − e Γ (cid:12)(cid:12)(cid:12) ∗ (cid:17) ≤ − ρ (cid:18) e ρ + 1 (cid:19) (cid:12)(cid:12)(cid:12) Γ − e Γ (cid:12)(cid:12)(cid:12) ∗ . This yields the ﬁrst part of the ﬁrst inequality of Theorem 1. The ﬁrst part of the secondinequality is obtained by combining (37) and (40).

Case 2.

Otherwise, if e ρ (cid:12)(cid:12)(cid:12) P e Γ (∆) (cid:12)(cid:12)(cid:12) ∗ > (cid:12)(cid:12)(cid:12) Γ − e Γ (cid:12)(cid:12)(cid:12) ∗ , we obtain, by (39), that (cid:12)(cid:12)(cid:12) P ⊥ e Γ (∆) (cid:12)(cid:12)(cid:12) ∗ ≤ c ( ρ, e ρ ) (cid:12)(cid:12)(cid:12) P e Γ (∆) (cid:12)(cid:12)(cid:12) ∗ , which implies that ∆ ∈ C e Γ and | ∆ | ∗ ≤ (1 + c ( ρ, e ρ )) (cid:12)(cid:12)(cid:12) P e Γ (∆) (cid:12)(cid:12)(cid:12) ∗ . We have1 N T (cid:12)(cid:12)(cid:12) M X (cid:16) Y − b Γ (cid:17)(cid:12)(cid:12)(cid:12) − N T | M X ( Y − Γ) | = 1 N T (cid:12)(cid:12)(cid:12) M X (cid:16)b Γ − Γ (cid:17)(cid:12)(cid:12)(cid:12) − N T D M X ( E ) , b Γ − Γ E hence, because λρ | M X ( E ) | / √ N T ≥ | M X ( E ) | op ,1 N T (cid:12)(cid:12)(cid:12) M X (cid:16)b Γ − Γ (cid:17)(cid:12)(cid:12)(cid:12) ≤ N T (cid:12)(cid:12)(cid:12) M X (cid:16) Y − b Γ (cid:17)(cid:12)(cid:12)(cid:12) − N T | M X ( Y − Γ) | + 2 λρ (1 + c ( ρ, e ρ )) | M X ( E ) | ( N T ) (cid:12)(cid:12)(cid:12) P e Γ (∆) (cid:12)(cid:12)(cid:12) ∗ (41)and, by (40), 1 √ N T (cid:12)(cid:12)(cid:12) M X (cid:16) Y − b Γ (cid:17)(cid:12)(cid:12)(cid:12) − √ N T | M X ( Y − Γ) | ≤ (1 + e ρ ) λN T (cid:12)(cid:12)(cid:12) P e Γ (∆) (cid:12)(cid:12)(cid:12) ∗ which, combined with (37), yields (cid:12)(cid:12)(cid:12)(cid:12) √ N T (cid:12)(cid:12)(cid:12) M X (cid:16) Y − b Γ (cid:17)(cid:12)(cid:12)(cid:12) − √ N T | M X ( Y − Γ) | (cid:12)(cid:12)(cid:12)(cid:12) ≤ d ( ρ, e ρ ) λN T (cid:12)(cid:12)(cid:12) P e Γ (∆) (cid:12)(cid:12)(cid:12) ∗ . (42)Now, using1 N T (cid:12)(cid:12)(cid:12) M X (cid:16) Y − b Γ (cid:17)(cid:12)(cid:12)(cid:12) − N T | M X ( Y − Γ) | = (cid:18) √ N T (cid:12)(cid:12)(cid:12) M X (cid:16) Y − b Γ (cid:17)(cid:12)(cid:12)(cid:12) − √ N T | M X ( Y − Γ) | (cid:19) × (cid:18) √ N T (cid:12)(cid:12)(cid:12) M X (cid:16) Y − b Γ (cid:17)(cid:12)(cid:12)(cid:12) − √ N T | M X ( Y − Γ) | + 2 √ N T | M X ( Y − Γ) | (cid:19) and (42), we obtain1 N T (cid:12)(cid:12)(cid:12) M X (cid:16) Y − b Γ (cid:17)(cid:12)(cid:12)(cid:12) − N T | M X ( Y − Γ) | (43) ≤ d ( ρ, e ρ ) λN T (cid:12)(cid:12)(cid:12) P e Γ (∆) (cid:12)(cid:12)(cid:12) ∗ (cid:18) d ( ρ, e ρ ) λN T (cid:12)(cid:12)(cid:12) P e Γ (∆) (cid:12)(cid:12)(cid:12) ∗ + 2 | M X ( E ) | √ N T (cid:19) . Combining (41) and (43), we get1

N T (cid:12)(cid:12)(cid:12) M X (cid:16)b Γ − Γ (cid:17)(cid:12)(cid:12)(cid:12) ≤ (cid:18) d ( ρ, e ρ ) λN T (cid:12)(cid:12)(cid:12) P e Γ (∆) (cid:12)(cid:12)(cid:12) ∗ (cid:19) + 2 e ( ρ, e ρ ) λ | M X ( E ) | ( N T ) (cid:12)(cid:12)(cid:12) P e Γ (∆) (cid:12)(cid:12)(cid:12) ∗ . By deﬁnition of κ e Γ ,c ( ρ, e ρ ), this implies | M X (∆) | ≤  −  d ( ρ, e ρ ) r (cid:16)e Γ (cid:17) λ √ N T κ e Γ ,c ( ρ, e ρ )   − e ( ρ, e ρ ) λ r (cid:16)e Γ (cid:17) | M X ( E ) | √ N T κ e Γ ,c ( ρ, e ρ ) , |P Γ (∆) | ∗ ≤  −  d ( ρ, e ρ ) r (cid:16)e Γ (cid:17) λ √ N T κ e Γ ,c ( ρ, e ρ )   − e ( ρ, e ρ ) λ rank (cid:16)e Γ (cid:17) | M X ( E ) | √ N T κ e Γ ,c ( ρ, e ρ ) , (44)which yields the ﬁrst result. The second result follows from (42) and (44). Proof of Proposition 3.

Lemma 1.

It holds that | P X ( E ) | = O P (1) and | P X ( E ) | op = O P (cid:16) µ N / √ N T (cid:17) . Proof.

Let | · | denote the ℓ or operator norm. We use that, due to Assumption 1 (ii), w.p.a.1, | P X ( E ) | = (cid:12)(cid:12)(cid:12) X ( X ⊤ X ) − X ⊤ e (cid:12)(cid:12)(cid:12) and (cid:12)(cid:12)(cid:12) X ( X ⊤ X ) − X ⊤ e (cid:12)(cid:12)(cid:12) = (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) K X k =1 X k (cid:16) ( X ⊤ X ) − X ⊤ e (cid:17) k (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ vuut K X k =1 | X k | (cid:12)(cid:12)(cid:12) ( X ⊤ X ) − X ⊤ e (cid:12)(cid:12)(cid:12) . Due to Assumption 1 (ii) and (iii), we have (cid:12)(cid:12)(cid:12) ( X ⊤ X ) − X ⊤ e (cid:12)(cid:12)(cid:12) ≤ (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) X ⊤ XN T ! − (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) op (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) X ⊤ eN T (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) = O P (cid:18) √ N T (cid:19) (45)and | X k | = q ( X ⊤ X ) kk = O P (cid:16) √ N T (cid:17) hence the result. (cid:3)

By Lemma 1 and the inverse triangle inequality, we have (cid:12)(cid:12)(cid:12)(cid:12) | M X ( E ) | √ N T − | E | √ N T (cid:12)(cid:12)(cid:12)(cid:12) ≤ | P X ( E ) | √ N T P −→ (cid:12)(cid:12)(cid:12) | M X ( E ) | op − | E | op (cid:12)(cid:12)(cid:12) ≤ | P X ( E ) | op . Proof of Proposition 5.

Let us consider a cone with constant c . We work for any draw of X and Γ l and consider the matrices ﬁxed. By the computations in the proof of Lemma 1, | P X (∆) | ≤ | X | N T (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) X ⊤ XN T ! − (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) op (cid:12)(cid:12)(cid:12) X ⊤ δ (cid:12)(cid:12)(cid:12) . Also, for k ∈ { , . . . , K } , using the cone and the trace duality in the third display, we obtain |h X k , ∆ i| ≤ |h X k , P Γ l (∆) i| + (cid:12)(cid:12)(cid:12)D X k , P ⊥ Γ l (∆) E(cid:12)(cid:12)(cid:12) = |hP Γ l ( X k ) , P Γ l (∆) i| + (cid:12)(cid:12)(cid:12)D P ⊥ Γ l ( X k ) , P ⊥ Γ l (∆) E(cid:12)(cid:12)(cid:12) ≤ min (cid:16) |P Γ l ( X k ) | op , | X k | op (cid:17) |P Γ l (∆) | ∗ + (cid:12)(cid:12)(cid:12) P ⊥ Γ l ( X k ) (cid:12)(cid:12)(cid:12) op (cid:12)(cid:12)(cid:12) P ⊥ Γ l (∆) (cid:12)(cid:12)(cid:12) ∗ , hence | P X (∆) | ≤ K X k =1 (cid:16) b k |P Γ l (∆) | ∗ + b ⊥ k (cid:12)(cid:12)(cid:12) P ⊥ Γ l (∆) (cid:12)(cid:12)(cid:12) ∗ (cid:17) . Also, by homogeneity, we have κ l ,c = 2rank (cid:16) Γ l (cid:17) inf ∆ ∈ C Γ l : | P Γ l (∆) | ∗ =1 (cid:16) | ∆ | − | P X (∆) | (cid:17) . Denote by { σ k } and { σ ⊥ k } the singular values of P Γ l (∆) and P ⊥ Γ l (∆). The rank of the ﬁrst(resp. the second) matrix is at most 2rank (cid:16) Γ l (cid:17) (resp. p N ) so, by the Pythagorean theorem, κ l ,c ≥ (cid:16) Γ l (cid:17) inf P k σ k =1 | σ | ≤ ( Γ l ) P k σ ⊥ k ≤ c | σ ⊥ | ≤ p N σ ≥ ,σ ⊥ ≥ X k σ k + X k σ ⊥ k − K X k =1 b k + b ⊥ k X k σ ⊥ k !!  = 1 + 2rank (cid:16) Γ l (cid:17) inf P k σ ⊥ k ≤ c | σ ⊥ | ≤ p N σ ⊥ ≥ X k σ ⊥ k − K X k =1 b k + b ⊥ k X k σ ⊥ k !!  (46) = 1 + 2rank (cid:16) Γ l (cid:17) inf ≤ u ≤ c inf P k σ ⊥ k = u | σ ⊥ | ≤ p N σ ⊥ ≥ X k σ ⊥ k − K X k =1 ( b k + b ⊥ k u ) ! ANEL DATA MODELS WITH APPROXIMATELY LOW-RANK UNOBSERVED HETEROGENEITY 27 = 1 + 2rank (cid:16) Γ l (cid:17) min ≤ u ≤ c u p N − K X k =1 ( b k + b ⊥ k u ) ! . (47)The degree 2 polynomial in the bracket has a minimum at u ∗ given by u ∗ (cid:16) − p N | b ⊥ | (cid:17) = p N h b ⊥ , b i . If p N | b ⊥ | ≥ κ l ,c ≥ − (cid:16) Γ l (cid:17) | b | ,else, if p N h b ⊥ , b i < c (cid:16) − p N | b ⊥ | (cid:17) the minimum is at u ∗ and κ l ,c ≥ − (cid:16) Γ l (cid:17) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) b + b ⊥ p N h b ⊥ , b i − p N | b ⊥ | (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) − p N h b ⊥ , b i (cid:16) − p N | b ⊥ | (cid:17)  , else, the minimum is at c and κ l ,c ≥ − (cid:16) Γ l (cid:17) | b + b ⊥ c | − c p N ! . Remark 7.

Denoting by P ⊥ A,U × V the operator deﬁned like P A using annihilators which projectonto the orthogonal of the vector space spanned by the columns of A and U (resp. A and V )for U and V such that the vector spaces have common dimension r ( A, U × V ) , noting that toobtain (34) it is enough that e Γ P ⊥ e Γ ,U × V (∆) ⊤ = 0 and e Γ ⊤ P ⊥ e Γ ,U × V (∆) = 0 , the result of Theorem1 holds replacing κ e Γ ,c ( ρ, e ρ ) by a compatibility constant replacing P ⊥ e Γ by P ⊥ e Γ ,U × V , P e Γ by P e Γ ,U × V ,everywhere rank (cid:16)e Γ (cid:17) by r (cid:16)e Γ , U × V (cid:17) , and with an inﬁmum over U and V after the inﬁmumover e Γ . The freedom over U and V allows to annihilate low-rank components of X k if it hasa component with a factor structure and deliver constants b ⊥ k which are O P (cid:16)p max( N, T ) (cid:17) .Proof of Theorem 2. The ﬁrst inequalities follow from Theorem 1 so we only prove (28). Dueto Assumption 1 (ii), w.p.a. 1, b β − β = (cid:16) X ⊤ X (cid:17) − X ⊤ ( γ − b γ ) + (cid:16) X ⊤ X (cid:17) − X ⊤ e , also (cid:12)(cid:12)(cid:12) X ⊤ ( γ − b γ ) (cid:12)(cid:12)(cid:12) = K X k =1 D X k , b Γ − Γ E ≤ K X k =1 | X k | (cid:12)(cid:12)(cid:12)b Γ − Γ (cid:12)(cid:12)(cid:12) ∗ (by trace duality) , (cid:12)(cid:12)(cid:12)(cid:12)(cid:16) X ⊤ X (cid:17) − X ⊤ ( γ − b γ ) (cid:12)(cid:12)(cid:12)(cid:12) ≤ N T (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) X ⊤ XN T ! − (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) op vuut K X k =1 | X k | (cid:12)(cid:12)(cid:12)b Γ − Γ (cid:12)(cid:12)(cid:12) ∗ . By Assumption 1 and (26), we obtain (cid:12)(cid:12)(cid:12)(cid:12)(cid:16) X ⊤ X (cid:17) − X ⊤ ( γ − b γ ) (cid:12)(cid:12)(cid:12)(cid:12) = O P ( λ N r N µ N / ( N T )). Next,by (45), we have (cid:12)(cid:12)(cid:12)(cid:12)(cid:16) X ⊤ X (cid:17) − X ⊤ e (cid:12)(cid:12)(cid:12)(cid:12) = O P (1 / √ N T ). This yields the result.

Proof of Proposition 6.

The proof techniques are similar to those in [19]. We make use of thefact that if Z ∈ ∂ | · | ∗ (cid:16)e Γ (cid:17) , i.e. , is of the form Z = rank (cid:0)e Γ (cid:1)X k =1 u k (cid:16)e Γ (cid:17) v k (cid:16)e Γ (cid:17) ⊤ + M u (cid:0)e Γ (cid:1) W M v (cid:0)e Γ (cid:1) , for W such that | W | op ≤

1, then(48)

D b Z − Z, b Γ − e Γ E ≥ W (see [19] page 2308), (cid:28) M u (cid:0)e Γ (cid:1) W M v (cid:0)e Γ (cid:1) , e Γ − b Γ (cid:29) = − (cid:12)(cid:12)(cid:12)(cid:12) M u (cid:0)e Γ (cid:1) b Γ M v (cid:0)e Γ (cid:1)(cid:12)(cid:12)(cid:12)(cid:12) ∗ = − (cid:12)(cid:12)(cid:12) P ⊥ e Γ (cid:16)e Γ − b Γ (cid:17)(cid:12)(cid:12)(cid:12) ∗ . Now, by (16) and (48), we obtain D M X (cid:16) Γ − b Γ (cid:17) , e Γ − b Γ E ≤ λ b σ D Z, e Γ − b Γ E − D M X ( E ) , e Γ − b Γ E ≤ λ b σ (cid:12)(cid:12)(cid:12) P e Γ (cid:16)e Γ − b Γ (cid:17)(cid:12)(cid:12)(cid:12) ∗ ∧ (cid:12)(cid:12)(cid:12)(cid:12) P u (cid:0)e Γ (cid:1) (cid:16)e Γ − b Γ (cid:17) P v (cid:0)e Γ (cid:1)(cid:12)(cid:12)(cid:12)(cid:12) ∗ − λ b σ (cid:12)(cid:12)(cid:12) P ⊥ e Γ (cid:16)e Γ − b Γ (cid:17)(cid:12)(cid:12)(cid:12) ∗ − D M X ( E ) , e Γ − b Γ E . (49)We now use2 D M X (cid:16) Γ − b Γ (cid:17) , e Γ − b Γ E = (cid:12)(cid:12)(cid:12) M X (cid:16) Γ − b Γ (cid:17)(cid:12)(cid:12)(cid:12) + (cid:12)(cid:12)(cid:12) M X (cid:16)e Γ − b Γ (cid:17)(cid:12)(cid:12)(cid:12) − (cid:12)(cid:12)(cid:12) M X (cid:16) Γ − e Γ (cid:17)(cid:12)(cid:12)(cid:12) (50)and consider cases (1) D M X (cid:16) Γ − b Γ (cid:17) , e Γ − b Γ E ≤ D M X (cid:16) Γ − b Γ (cid:17) , e Γ − b Γ E > (cid:12)(cid:12)(cid:12) M X (cid:16) Γ − b Γ (cid:17)(cid:12)(cid:12)(cid:12) ≤ (cid:12)(cid:12)(cid:12) M X (cid:16) Γ − e Γ (cid:17)(cid:12)(cid:12)(cid:12) , hence the result.In case (2), we have λ b σ (cid:12)(cid:12)(cid:12) P ⊥ e Γ (cid:16)e Γ − b Γ (cid:17)(cid:12)(cid:12)(cid:12) ∗ ≤ λ b σ (cid:12)(cid:12)(cid:12) P e Γ (cid:16)e Γ − b Γ (cid:17)(cid:12)(cid:12)(cid:12) ∗ − D M X ( E ) , e Γ − b Γ E , thus, because ρλ b σ ≥ | M X ( E ) | op , e Γ − b Γ ∈ C e Γ . Moreover, by (50) and (49), we have (cid:12)(cid:12)(cid:12) M X (cid:16) Γ − b Γ (cid:17)(cid:12)(cid:12)(cid:12) + (cid:12)(cid:12)(cid:12) M X (cid:16)e Γ − b Γ (cid:17)(cid:12)(cid:12)(cid:12) + 2 λ b σ (cid:12)(cid:12)(cid:12) P ⊥ e Γ (cid:16)e Γ − b Γ (cid:17)(cid:12)(cid:12)(cid:12) ∗ ≤ (cid:12)(cid:12)(cid:12) M X (cid:16) Γ − e Γ (cid:17)(cid:12)(cid:12)(cid:12) + 2 λ b σ (cid:12)(cid:12)(cid:12) P e Γ (cid:16)e Γ − b Γ (cid:17)(cid:12)(cid:12)(cid:12) ∗ − D M X ( E ) , e Γ − b Γ E ≤ (cid:12)(cid:12)(cid:12) M X (cid:16) Γ − e Γ (cid:17)(cid:12)(cid:12)(cid:12) + 2 λ b σ (cid:12)(cid:12)(cid:12) P e Γ (cid:16)e Γ − b Γ (cid:17)(cid:12)(cid:12)(cid:12) ∗ + 2 ρλ b σ (cid:16)(cid:12)(cid:12)(cid:12) P e Γ (cid:16)e Γ − b Γ (cid:17)(cid:12)(cid:12)(cid:12) ∗ + (cid:12)(cid:12)(cid:12) P ⊥ e Γ (cid:16)e Γ − b Γ (cid:17)(cid:12)(cid:12)(cid:12) ∗ (cid:17) and, by deﬁnition of κ e Γ ,c ( ρ ) , (cid:12)(cid:12)(cid:12) M X (cid:16) Γ − b Γ (cid:17)(cid:12)(cid:12)(cid:12) + (cid:12)(cid:12)(cid:12) M X (cid:16)e Γ − b Γ (cid:17)(cid:12)(cid:12)(cid:12) ≤ (cid:12)(cid:12)(cid:12) M X (cid:16) Γ − e Γ (cid:17)(cid:12)(cid:12)(cid:12) + 2 λ (1 + ρ ) b σ (cid:12)(cid:12)(cid:12) P e Γ (cid:16)e Γ − b Γ (cid:17)(cid:12)(cid:12)(cid:12) ∗ ≤ (cid:12)(cid:12)(cid:12) M X (cid:16) Γ − e Γ (cid:17)(cid:12)(cid:12)(cid:12) + 2 λ (1 + ρ ) b σ r (cid:16)e Γ (cid:17) κ e Γ ,c ( ρ ) (cid:12)(cid:12)(cid:12) M X (cid:16)e Γ − b Γ (cid:17)(cid:12)(cid:12)(cid:12) , hence 1 N T (cid:12)(cid:12)(cid:12) M X (cid:16) Γ − b Γ (cid:17)(cid:12)(cid:12)(cid:12) ≤ N T (cid:12)(cid:12)(cid:12) M X (cid:16) Γ − e Γ (cid:17)(cid:12)(cid:12)(cid:12) + 2( λ (1 + ρ ) b σ ) N T rank (cid:16)e Γ (cid:17) κ e Γ ,c ( ρ ) . Proof of Proposition 7. (16) yields, for all k = 1 , . . . , rank (cid:16)b Γ (cid:17) , u k (cid:16)b Γ (cid:17) ⊤ M X (cid:16) Γ l − b Γ (cid:17) v k (cid:16)b Γ (cid:17) = λ b σ − u k (cid:16)b Γ (cid:17) ⊤ M X (cid:16) Γ d + E (cid:17) v k (cid:16)b Γ (cid:17) ANEL DATA MODELS WITH APPROXIMATELY LOW-RANK UNOBSERVED HETEROGENEITY 29 = λ b σ − (cid:28) M X (cid:16) Γ d + E (cid:17) , u k (cid:16)b Γ (cid:17) v k (cid:16)b Γ (cid:17) ⊤ (cid:29) , ≥ λ (1 − ρ ) b σ − (cid:12)(cid:12)(cid:12) Γ d (cid:12)(cid:12)(cid:12) op , and, by summing the inequalities,(51) * rank (cid:0)b Γ (cid:1)X k =1 u (cid:16)b Γ (cid:17) k v (cid:16)b Γ (cid:17) ⊤ k , P u (cid:0)b Γ (cid:1) M X (cid:16) Γ l − b Γ (cid:17) P v (cid:0)b Γ (cid:1)+ ≥ (cid:18) λ (1 − ρ ) b σ − (cid:12)(cid:12)(cid:12) Γ d (cid:12)(cid:12)(cid:12) op (cid:19) rank (cid:16)b Γ (cid:17) , thus (cid:12)(cid:12)(cid:12)(cid:12) P u (cid:0)b Γ (cid:1) M X (cid:16) Γ l − b Γ (cid:17) P v (cid:0)b Γ (cid:1)(cid:12)(cid:12)(cid:12)(cid:12) ≥ (cid:18) λ (1 − ρ ) b σ − (cid:12)(cid:12)(cid:12) Γ d (cid:12)(cid:12)(cid:12) op (cid:19) r rank (cid:16)b Γ (cid:17) . Proposition 11.

Proposition 11.

Let m = | X | op NT (cid:12)(cid:12)(cid:12)(cid:12)(cid:16) X ⊤ XNT (cid:17) − (cid:12)(cid:12)(cid:12)(cid:12) op ! (cid:16)P Kk =1 | X k | (cid:17) (cid:16) rank (Γ) + rank (cid:16)b Γ (cid:17)(cid:17) , wehave (cid:12)(cid:12)(cid:12) P X (cid:16) Γ − b Γ (cid:17)(cid:12)(cid:12)(cid:12) ≤ m (1 − m ) + (cid:12)(cid:12)(cid:12) M X (cid:16) Γ − b Γ (cid:17)(cid:12)(cid:12)(cid:12) , (cid:12)(cid:12)(cid:12) Γ − b Γ (cid:12)(cid:12)(cid:12) ≤ m (1 − m ) + ! (cid:12)(cid:12)(cid:12) M X (cid:16) Γ − b Γ (cid:17)(cid:12)(cid:12)(cid:12) . Proof.

By Theorem C.5 in [12], the deﬁnition of P X , and the computations in the proof ofTheorem 2, we have, w.p.a. 1, (cid:12)(cid:12)(cid:12) P X (cid:16) Γ − b Γ (cid:17)(cid:12)(cid:12)(cid:12) ≤  | X | op N T (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) X ⊤ XN T ! − (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) op  K X k =1 | X k | ! rank (cid:16) Γ − b Γ (cid:17) (cid:12)(cid:12)(cid:12)b Γ − Γ (cid:12)(cid:12)(cid:12) ≤ m (cid:12)(cid:12)(cid:12)b Γ − Γ (cid:12)(cid:12)(cid:12) . We conclude by the Pythagorean theorem. (cid:3)

Proof of Proposition 8.

By (16), we have Γ l − b Γ = P Kk =1 (cid:16) b β k − β k (cid:17) X k − Γ d − E + λ N b σ b Z ,hence (cid:12)(cid:12)(cid:12) Γ − b Γ (cid:12)(cid:12)(cid:12) op ≤ (cid:12)(cid:12)(cid:12) b β − β (cid:12)(cid:12)(cid:12) vuut K X k =1 | X k | + (cid:12)(cid:12)(cid:12) Γ d (cid:12)(cid:12)(cid:12) op + | E | op + λ N b σ and we conclude using Theorem 2 and Assumption 2 (ii). Proof of Proposition 9.

The Weyl’s inequality, yields, for k ∈ { , . . . , min( N, T ) } , (cid:12)(cid:12)(cid:12) σ k (cid:16) Γ l (cid:17) − σ k (cid:16)b Γ (cid:17)(cid:12)(cid:12)(cid:12) ≤ (cid:12)(cid:12)(cid:12) Γ l − b Γ (cid:12)(cid:12)(cid:12) op . This implies, for k ≤ rank (cid:16) Γ l (cid:17) ,(52) σ k (cid:16)b Γ (cid:17) ≥ σ k (cid:16) Γ l (cid:17) − (cid:12)(cid:12)(cid:12) Γ l − b Γ (cid:12)(cid:12)(cid:12) op and, for k > rank (cid:16) Γ l (cid:17) ,(53) σ k (cid:16)b Γ (cid:17) ≤ (cid:12)(cid:12)(cid:12) Γ l − b Γ (cid:12)(cid:12)(cid:12) op . By Assumption 5 (i) and Proposition 8, we have P (cid:18)(cid:12)(cid:12)(cid:12) Γ l − b Γ (cid:12)(cid:12)(cid:12) op ≤ ( ρ + 1) λ N hσ (cid:19) →

1. ByTheorem 2 and λ N r N = o ( N T ), we obtain P (( ρ + 1) λ N hσ < t ) → P (cid:16) ∀ k > rank (cid:16) Γ l (cid:17) , t > σ k (cid:16)b Γ (cid:17)(cid:17) → . By Assumption 5 (ii), we have P (cid:18) σ k (cid:16) Γ l (cid:17) − (cid:12)(cid:12)(cid:12) Γ l − b Γ (cid:12)(cid:12)(cid:12) op ≤ ( ρ + 1) λ N h σ (cid:19) →

1. By Theorem2 and λ N r N = o ( N T ), we obtain P (cid:0) t < ( ρ + 1) λ N h σ (cid:1) → P (cid:16) ∀ k ≤ rank (cid:16) Γ l (cid:17) , t < σ k (cid:16)b Γ (cid:17)(cid:17) → . Combining (54) and (55), we obtain the ﬁrst result. The other results are obtained similarly.

Proof of Proposition 10.

Because (cid:12)(cid:12)(cid:12)(cid:12) M v (cid:0)b Γ t (cid:1) − M v ( Γ l ) (cid:12)(cid:12)(cid:12)(cid:12) = (cid:12)(cid:12)(cid:12)(cid:12) P v (cid:0)b Γ t (cid:1) − P v ( Γ l ) (cid:12)(cid:12)(cid:12)(cid:12) = rank (cid:16)b Γ t (cid:17) + rank (cid:16) Γ l (cid:17) − rank ( Γ l ) X k =1 v k (cid:16) Γ l (cid:17) ⊤ P v (cid:0)b Γ t (cid:1) v k (cid:16) Γ l (cid:17) , = rank (cid:16)b Γ t (cid:17) − rank (cid:16) Γ l (cid:17) + 2 rank ( Γ l ) X k =1 v k (cid:16) Γ l (cid:17) ⊤ M v (cid:0)b Γ t (cid:1) v k (cid:16) Γ l (cid:17)(cid:12)(cid:12)(cid:12)(cid:12) Γ l M v (cid:0)b Γ t (cid:1)(cid:12)(cid:12)(cid:12)(cid:12) = rank ( Γ l ) X k =1 σ k (cid:16) Γ l (cid:17) v k (cid:16) Γ l (cid:17) ⊤ M v (cid:0)b Γ t (cid:1) v k (cid:16) Γ l (cid:17) , the result follows from (cid:12)(cid:12)(cid:12)(cid:12) M v (cid:0)b Π tv (cid:1) − M v ( Π lv ) (cid:12)(cid:12)(cid:12)(cid:12) ≤ (cid:12)(cid:12)(cid:12) rank (cid:16)b Γ t (cid:17) − rank (cid:16) Γ l (cid:17)(cid:12)(cid:12)(cid:12) + 2 σ rank ( Γ l ) (Γ l ) (cid:12)(cid:12)(cid:12)(cid:12) Γ l M v (cid:0)b Γ t (cid:1)(cid:12)(cid:12)(cid:12)(cid:12) ≤ (cid:12)(cid:12)(cid:12) rank (cid:16)b Γ t (cid:17) − rank (cid:16) Γ l (cid:17)(cid:12)(cid:12)(cid:12) + 2 σ rank ( Γ l ) (Γ l ) (cid:12)(cid:12)(cid:12) Γ l − b Γ t (cid:12)(cid:12)(cid:12) (cid:12)(cid:12)(cid:12)(cid:12) M v (cid:0)b Γ t (cid:1)(cid:12)(cid:12)(cid:12)(cid:12) ≤ o P (1) + 2 r N  ( ρ + 1) λ N ( h + 1) ( σ + o P (1)) σ rank ( Γ l ) (Γ l )  . Proof of Theorem 3.

Using that M u (cid:0)b Π tu (cid:1) and M v (cid:0)b Π tv (cid:1) are self-adjoint, a solution to (33) sat-isﬁes, for l = 1 , . . . , K , (cid:28) M u (cid:0)b Π tu (cid:1) X l M v (cid:0)b Π tv (cid:1) , Y − P Kk =1 e β (1) k X k (cid:29) = 0, hence * M u ( Π lu ) X l M v ( Π lv ) , Γ d + E + K X k =1 (cid:16) β k − e β (1) k (cid:17) X k + = *(cid:18) M u ( Π lu ) − M u (cid:0)b Π tu (cid:1)(cid:19) X l M v ( Π lv ) , Γ d + E + K X k =1 (cid:16) β k − e β (1) k (cid:17) X k + + * M u ( Π lu ) X l (cid:18) M v ( Π lv ) − M v (cid:0)b Π tv (cid:1)(cid:19) , Γ d + E + K X k =1 (cid:16) β k − e β (1) k (cid:17) X k + ANEL DATA MODELS WITH APPROXIMATELY LOW-RANK UNOBSERVED HETEROGENEITY 31 − *(cid:18) M u ( Π lu ) − M u (cid:0)b Π tu (cid:1)(cid:19) X l (cid:18) M v ( Π lv ) − M v (cid:0)b Π tv (cid:1)(cid:19) , Γ + E + K X k =1 (cid:16) β k − e β (1) k (cid:17) X k + , so K X k =1 (cid:16) β k − e β (1) k (cid:17) D M u ( Π lu ) X l M v ( Π lv ) , X k E − (cid:28)(cid:18) M u ( Π lu ) − M u (cid:0)b Π tu (cid:1)(cid:19) X l M v ( Π lv ) , X k (cid:29) − (cid:28) M u ( Π lu ) X l (cid:18) M v ( Π lv ) − M v (cid:0)b Π tv (cid:1)(cid:19) , X k (cid:29) + (cid:28)(cid:18) M u ( Π lu ) − M u (cid:0)b Π tu (cid:1)(cid:19) X l (cid:18) M v ( Π lv ) − M v (cid:0)b Π tv (cid:1)(cid:19) , X k (cid:29) ! = − D M u ( Π lu ) X l M v ( Π lv ) , Γ d + E E + (cid:28)(cid:18) M u ( Π lu ) − M u (cid:0)b Π tu (cid:1)(cid:19) X l M v ( Π lv ) , Γ d + E (cid:29) + (cid:28) M u ( Π lu ) X l (cid:18) M v ( Π lv ) − M v (cid:0)b Π tv (cid:1)(cid:19) , Γ d + E (cid:29) − (cid:28)(cid:18) M u ( Π lu ) − M u (cid:0)b Π tu (cid:1)(cid:19) X l (cid:18) M v ( Π lv ) − M v (cid:0)b Π tv (cid:1)(cid:19) , Γ + E (cid:29) . (56)Let us show that D M u ( Π lu ) X l M v ( Π lv ) , X k E , which by Assumption 9 (v) diverges like N T , is thehigh-order term multiplying (cid:16) β k − e β (1) k (cid:17) in (56). This also yields the consistency of the estima-tor of the covariance matrix. By self-adjointness of the projections, Theorem C.5 in [12], andProposition 9 with the modiﬁcations of Section 4.7 which imply rank (cid:18) M u ( Π lu ) − M u (cid:0)b Π tu (cid:1)(cid:19) ≤ r N w.p.a. 1, denoting, for a matrix M and r ∈ N by | M | ,r = P rk =1 σ k ( M ) , we have, (cid:12)(cid:12)(cid:12)(cid:12)(cid:28)(cid:18) M u ( Π lu ) − M u (cid:0)b Π tu (cid:1)(cid:19) X l M v ( Π dv ) , X k (cid:29)(cid:12)(cid:12)(cid:12)(cid:12) ≤ (1 + o P (1)) (cid:12)(cid:12)(cid:12)(cid:12) M u ( Π lu ) − M u (cid:0)b Π tu (cid:1)(cid:12)(cid:12)(cid:12)(cid:12) (cid:12)(cid:12)(cid:12) X l M v ( Π lu ) X ⊤ k (cid:12)(cid:12)(cid:12) , r N ≤ (cid:16) √ r N + o P (1) (cid:17) (cid:12)(cid:12)(cid:12)(cid:12) M u ( Π lu ) − M u (cid:0)b Π tu (cid:1)(cid:12)(cid:12)(cid:12)(cid:12) (cid:12)(cid:12)(cid:12) X l M v ( Π lv ) (cid:12)(cid:12)(cid:12) op (cid:12)(cid:12)(cid:12) X k M v ( Π lv ) (cid:12)(cid:12)(cid:12) op , hence, by Proposition 10 with the modiﬁcations of Section 4.7, (cid:12)(cid:12)(cid:12)(cid:12)(cid:28)(cid:18) M u ( Π lu ) − M u (cid:0)b Π tu (cid:1)(cid:19) X l M v (cid:0)b Π tv (cid:1) , X k (cid:29)(cid:12)(cid:12)(cid:12)(cid:12) ≤ ρ + 1) r N λ N v N (cid:16) ( h + 1) e σ + o P (1) (cid:17) (cid:12)(cid:12)(cid:12)(cid:16) Π dl + U l (cid:17) M v ( Π lv ) (cid:12)(cid:12)(cid:12) op (cid:12)(cid:12)(cid:12)(cid:16) Π dk + U k (cid:17) M v ( Π lv ) (cid:12)(cid:12)(cid:12) op . We treat similarly (cid:12)(cid:12)(cid:12)(cid:12)(cid:28) M u ( Π lu ) X l (cid:18) M v ( Π lv ) − M v (cid:0)b Π tv (cid:1)(cid:19) , X k (cid:29)(cid:12)(cid:12)(cid:12)(cid:12) , and, for the fourth term, use that (cid:12)(cid:12)(cid:12)(cid:12)(cid:28)(cid:18) M u ( Π lu ) − M u (cid:0)b Π tu (cid:1)(cid:19) X l (cid:18) M v ( Π lv ) − M v (cid:0)b Π tv (cid:1)(cid:19) , X k (cid:29)(cid:12)(cid:12)(cid:12)(cid:12) ≤ (cid:12)(cid:12)(cid:12)(cid:12)(cid:18) M u ( Π lu ) − M u (cid:0)b Π tu (cid:1)(cid:19) X l (cid:18) M v ( Π lv ) − M v (cid:0)b Π tv (cid:1)(cid:19)(cid:12)(cid:12)(cid:12)(cid:12) ∗ | X k | op ≤ (cid:16) √ r N + o P (1) (cid:17) (cid:12)(cid:12)(cid:12)(cid:12)(cid:18) M u ( Π lu ) − M u (cid:0)b Π tu (cid:1)(cid:19) X l (cid:18) M v ( Π lv ) − M v (cid:0)b Π tv (cid:1)(cid:19)(cid:12)(cid:12)(cid:12)(cid:12) | X k | op2 JAD BEYHUM AND ERIC GAUTIER ≤ (cid:16) √ r N + o P (1) (cid:17) (cid:12)(cid:12)(cid:12)(cid:12) M u ( Π lu ) − M u (cid:0)b Π tu (cid:1)(cid:12)(cid:12)(cid:12)(cid:12) op (cid:12)(cid:12)(cid:12)(cid:12) X l (cid:18) M v ( Π lv ) − M v (cid:0)b Π tv (cid:1)(cid:19)(cid:12)(cid:12)(cid:12)(cid:12) | X k | op ≤ (cid:16) √ r N + o P (1) (cid:17) (cid:12)(cid:12)(cid:12)(cid:12) M u ( Π lu ) − M u (cid:0)b Π tu (cid:1)(cid:12)(cid:12)(cid:12)(cid:12) | X l | op (cid:12)(cid:12)(cid:12)(cid:12) M v ( Π lv ) − M v (cid:0)b Π tv (cid:1)(cid:12)(cid:12)(cid:12)(cid:12) | X k | op ≤ ( ρ + 1) (2 r N ) / λ N v N (cid:16) ( h + 1) e σ + o P (1) (cid:17) | X l | op | X k | op , where we use Proposition 9 in the third display and Proposition 10 (with the modiﬁcationsof Section 4.7) in the last display. Let us consider now the quantities on the right-hand sidein (56). Proceeding like above, we have (cid:12)(cid:12)(cid:12)(cid:12)(cid:28)(cid:18) M u ( Π lu ) − M u (cid:0)b Π tu (cid:1)(cid:19) X l M v ( Π lv ) , Γ d + E (cid:29)(cid:12)(cid:12)(cid:12)(cid:12) ≤ (1 + o P (1)) (cid:12)(cid:12)(cid:12)(cid:12) M u ( Π lu ) − M u (cid:0)b Π tu (cid:1)(cid:12)(cid:12)(cid:12)(cid:12) (cid:12)(cid:12)(cid:12)(cid:12) X l M v ( Π lv ) (cid:16) Γ d + E (cid:17) ⊤ (cid:12)(cid:12)(cid:12)(cid:12) , r N ≤ ρ + 1) r N λ N (cid:0) ( h + 1) e σ + o P (1) (cid:1) v N (cid:18) ρλ N σ + (cid:12)(cid:12)(cid:12) Γ d M v ( Π lv ) (cid:12)(cid:12)(cid:12) op (cid:19) (cid:18) ρλ N σ l + (cid:12)(cid:12)(cid:12) Π dl M v ( Π lv ) (cid:12)(cid:12)(cid:12) op (cid:19) and treat similarly (cid:28) M u ( Π lu ) X l (cid:18) M v ( Π lv ) − M v (cid:0)b Π tv (cid:1)(cid:19) , Γ d + E (cid:29) . With the same arguments,the absolute value of the last term of (56) is smaller than( ρ + 1) √ r N ) / λ N (cid:0) ( h + 1) e σ + o P (1) (cid:1) v N | X l | op (cid:16) | Γ | op + ρλ N ( h + 1) e σ + o P (1) (cid:17) . Let us now look at the ﬁrst terms on the left-hand side and on the right-hand side of (56).By (iv), for all k, l ∈ { , . . . , K } , D M u ( Π lu ) X l M v ( Π lv ) , X k E = D M u ( Π lu ) U l M v ( Π lv ) , U k E + o P ( N T )so, by (v), D M u ( Π lu ) X l M v ( Π lv ) , X k E are the high-order terms on the left-hand side of (56). Sim-ilarly, by (iv), the high-order terms on the right-hand side of (56) are D M u ( Π lu ) U l M v ( Π lv ) , E E .As a result, e β (1) is asymptotically equivalent to the ideal estimator β (57) β ∈ argmin β ∈ R K (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) P ⊥ Π l Y − K X k =1 β k U k !(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) . Hence, w.p.a. 1, β − β = (cid:16) P ⊥ Π l ( U ) ⊤ P ⊥ Π l ( U ) (cid:17) − P ⊥ Π l ( U ) ⊤ e and we conclude by usual arguments.To obtain the ﬁrst part of the second statement we use that U ⊤ U − P ⊥ Π l ( U ) ⊤ P ⊥ Π l ( U ) is sym-metric positive deﬁnite. It is clearly symmetric. The positive deﬁniteness follows from thefollowing computations. Let b ∈ R K , we have X k,l b k b l tr (cid:16) U ⊤ k U l (cid:17) ≥ X k,l b k b l tr (cid:16) M v ( Π lv ) U ⊤ k U l (cid:17) = X k,l b k b l tr (cid:16) M v ( Π lv ) U ⊤ k M u ( Π lv ) U l M v ( Π lv ) (cid:17) + X k,l b k b l tr (cid:16) M v ( Π lv ) U ⊤ k P u ( Π lv ) U l M v ( Π lv ) (cid:17) ANEL DATA MODELS WITH APPROXIMATELY LOW-RANK UNOBSERVED HETEROGENEITY 33 ≥ X k,l b k b l tr (cid:16) P ⊥ Π l ( U k ) ⊤ P ⊥ Π l ( U l ) (cid:17) . Because U ⊤ U has a ﬁxed dimension, all norms are equivalent and (cid:12)(cid:12)(cid:12) U ⊤ U − P ⊥ Π l ( U ) ⊤ P ⊥ Π l ( U ) (cid:12)(cid:12)(cid:12) op ≤ tr (cid:16) U ⊤ U − P ⊥ Π l ( U ) ⊤ P ⊥ Π l ( U ) (cid:17) = |P Π l ( U ) | = o P ( | U | ). We conclude using that | U | ≤ K (cid:12)(cid:12)(cid:12) U ⊤ U (cid:12)(cid:12)(cid:12) op .Also, from the above, P ⊥ Π l ( U ) ⊤ P ⊥ Π l ( U ) = P ⊥ Π l ( U ) ⊤ P ⊥ Π l ( U ) + M where M is a smaller orderterm by condition (iv). We obtain the last part of the second statement using the next lemma. Lemma 2.

Assume U and (Π lu , Π lv ) are independent, and E h max (cid:16) rank (cid:16) Π lu (cid:17) , rank (cid:16) Π lv (cid:17)(cid:17)i = o (cid:16)p min( N, T ) (cid:17) , then |P Π l ( U ) | / ( N T ) = o P (1) , hence P ⊥ Γ r ( U ) ⊤ P ⊥ Γ r ( U ) / ( N T ) P −→ E h U ⊤ U i . Proof.

We prove that, for k ∈ { , . . . , K } , |P Π l ( U k ) | / ( N T ) converges to 0 in L . This relieson (5) and the facts that M u (Π l ) is a contraction for the Euclidian norm and E (cid:20)(cid:12)(cid:12)(cid:12) U k P v (Π l ) (cid:12)(cid:12)(cid:12) (cid:21) = E (cid:20) E (cid:20)(cid:12)(cid:12)(cid:12) U k P v (Π l ) (cid:12)(cid:12)(cid:12) (cid:12)(cid:12)(cid:12) Π lu , Π lv (cid:21)(cid:21) = E " E " N X i =1 (cid:12)(cid:12)(cid:12) U i · P v (Π l ) (cid:12)(cid:12)(cid:12) (cid:12)(cid:12)(cid:12) Π lu , Π lv = N E h rank (cid:16) Π lv (cid:17)i u = o ( N T )and similarly for E (cid:20)(cid:12)(cid:12)(cid:12) P u (Π l ) U k (cid:12)(cid:12)(cid:12) (cid:21) . By the arguments in the previous proof U ⊤ U/ ( N T ) and P ⊥ Γ r ( U ) ⊤ P ⊥ Γ r ( U ) / ( N T ) have same limit, hence the result by the law of large numbers. (cid:3)

Toulouse School of Economics, Universit´e Toulouse Capitole, 21 all´ee de Brienne, 31000Toulouse, France

E-mail address ::