[PDF] Likelihood Inference and The Role of Initial Conditions for the Dynamic Panel Data Model

Abstract

Lancaster (2002} proposes an estimator for the dynamic panel data model with homoskedastic errors and zero initial conditions. In this paper, we show this estimator is invariant to orthogonal transformations, but is inefficient because it ignores additional information available in the data. The zero initial condition is trivially satisfied by subtracting initial observations from the data. We show that differencing out the data further erodes efficiency compared to drawing inference conditional on the first observations. Finally, we compare the conditional method with standard random effects approaches for unobserved data. Standard approaches implicitly rely on normal approximations, which may not be reliable when unobserved data is very skewed with some mass at zero values. For example, panel data on firms naturally depend on the first period in which the firm enters on a new state. It seems unreasonable then to assume that the process determining unobserved data is known or stationary. We can instead make inference on structural parameters by conditioning on the initial observations.

Full PDF

aa r X i v : . [ s t a t . M E ] F e b Likelihood Inference and The Role of InitialConditions for the Dynamic Panel Data Model

Jose Diogo Barbosa University of Michigan

Marcelo J. Moreira

FGV

This version: September 3, 2018 The authors gratefully acknowledge the research support of CNPq and FAPERJ. bstract

Lancaster (2002) proposes an estimator for the dynamic panel data model withhomoskedastic errors and zero initial conditions. In this paper, we show this esti-mator is invariant to orthogonal transformations, but is ineﬃcient because it ignoresadditional information available in the data. The zero initial condition is triviallysatisﬁed by subtracting initial observations from the data. We show that diﬀerencingout the data further erodes eﬃciency compared to drawing inference conditional onthe ﬁrst observations.Finally, we compare the conditional method with standard random eﬀects ap-proaches for unobserved data. Standard approaches implicitly rely on normal ap-proximations, which may not be reliable when unobserved data is very skewed withsome mass at zero values. For example, panel data on ﬁrms naturally depend on theﬁrst period in which the ﬁrm enters on a new state. It seems unreasonable then toassume that the process determining unobserved data is known or stationary. Wecan instead make inference on structural parameters by conditioning on the initialobservations.

Keywords : Autoregressive, Panel Data, Invariance, Eﬃciency.

JEL Classiﬁcation Numbers : C12, C30. 1

Introduction

In an important paper, Lancaster (2002) studies, from a Bayesian perspective, estima-tion of the structural parameters of a dynamic panel data model with ﬁxed eﬀects andinitial observations equal to zero. His method involves reparameterizing the model sothat the information matrix is block diagonal, with the common parameters in oneblock and the incidental parameters in the other. His estimator is then deﬁned asone of the local maxima of the integrated likelihood function, integrating with respectto the Lebesgue measure on R N . However, Lancaster leaves unanswered the ques-tion of how to uniquely determine the consistent root of his proposed methodology.Some authors, including Dhaene and Jochmans (2016) and Kruiniger (2014), haveproposed diﬀerent ways to ﬁnd the consistent estimator in Lancaster’s approach.In this paper, we explain the shortcoming of Lancaster’s estimator: it ignoresavailable information in the model. In particular, Lancaster’s posterior distributionuses only part of the maximal invariant statistic’s log-likelihood function; when thefull likelihood function is used in the estimation, a unique, consistent, and asymp-totically normal eﬃcient estimator is obtained; see Moreira (2009). The estimatorobtained using the full likelihood is asymptotically more eﬃcient than Lancaster’s es-timator. Therefore, trying to correct the nonuniqueness issue of Lancaster’s estimatoris unnecessary and leads to ineﬃcient estimators.Lancaster (2002) and Moreira (2009) study consistent estimation of dynamic panelmodels under the same set of assumptions and with initial observations equal tozero. The zero initial condition is trivially satisﬁed when the initial observationsare subtracted from the autoregressive variables. We then show that eﬃciency isimproved by conditioning on the initial observations instead of diﬀerencing out thedata. The conditional argument is essentially a ﬁxed-eﬀects approach in which wemake no further premises on unobserved data. This is in contrast with commonly-used estimators in the literature which make further assumptions; see Bai (2013a,b)on correlated random eﬀects or Blundell and Bond (1998) on stationarity.A potential advantage of conditioning on the ﬁrst observation is robustness. AsBlundell and Smith (1991) point out, asymptotic arguments are usually based onthe average temporal eﬀect, calculated on the individual dimension. Therefore, theimportance of unobserved data does not disappear asymptotically for relatively shortpanels. For example, take data on ﬁrms or individual earnings. Cabral and Mata(2003) show that the distribution of ﬁrms is very skewed, with most of the massbeing small ﬁrms, while Evans (1987b,a) and Hall (1987) show that Gilbrat’s Law(independent ﬁrm size and growth) is rejected for small ﬁrms. As only a handful oflarge ﬁrms provide the bulk of the data, we should not expect estimators based onassumptions about unobserved data to be approximately normal. As the ﬁrms’ entrystates do not disappear asymptotically for relatively short panels, conditioning onthe ﬁrst observation would be preferable to assuming known processes for unobserveddata. 2he remainder of this paper is organized as follows. Section 2 introduces a simpledynamic panel data model without covariates and a zero initial condition. Thissection determines the maximal invariant statistic and summarizes the asymptotictheory for the maximum invariant likelihood estimator (MILE). Section 3 developsthe asymptotic theory for Lancaster’s estimator. Section 4 shows that this estimatoris less eﬃcient than MILE because it ignores relevant data. Section 5 shows that itis less eﬃcient to diﬀerence out the ﬁrst observation than to condition on it. Section6 compares the conditional argument to standard random eﬀects approaches for theunobserved data. Section 7 concludes and discusses how to extend the model to amore useful form. We consider a simple homoskedastic dynamic panel model with ﬁxed eﬀects andwithout covariates: y i,t +1 = ρy i,t + η i + σu i,t , i = 1 , · · · , N ; t = 1 , · · · , T (1)where N ≥ T + 1, y i,t ∈ R are observable variables and u i,t iid ∼ N (0 ,

1) are unob-servable errors; η i ∈ R are incidental parameters and ( ρ, σ ) ∈ R × R are structuralparameters. We denote the true unknown parameters by ( ρ ∗ , σ ∗ , η ∗ i ). We assumethat the parameter space is a compact set and σ ∗ >

0. For now, we assume thatthe initial observed condition is y i, = 0 as Lancaster (2002) does. We relax thisassumption in Sections 5 and 6.Solving model (1) recursively and writing it in matrix form yields Y T = η ′ T B ′ T + σU T B ′ T , where (2) U T ∼ N (0 N × T , I N ⊗ I T ) ,η = ( η , · · · , η N ) ′ ∈ R N × , 1 T = (1 , · · · , ′ ∈ R T × ,Y T =  y , · · · y ,T +1 ... . . . ... y N, · · · y N,T +1  , U T =  u , · · · u ,T +1 ... . . . ... u N, · · · u N,T +1  , and B T =  ρ T − · · ·  . When there is no confusion, we will omit the subscript from the matrices; e.g., Y instead of Y T , B instead of B T , etc.The inverse of B has a simple form, B − ≡ D = I T − ρJ T , where J T = (cid:20) ′ T − I T − T − (cid:21) T − is a ( T − i are treated equally, the coordinate system used to specify the vector( y ,t , · · · , y N,t ) should not aﬀect inference based on them. Therefore, it is reasonable torestrict attention to coordinate-free functions of ( y ,t , · · · , y N,t ). Chamberlain and Moreira(2009) and Moreira (2009) show that, indeed, orthogonal transformations preserveboth the model (2) and the structural parameters ( ρ, σ ) and this yields a maxi-mal invariant statistic, the T × T matrix Y ′ Y . So, if the researcher ﬁnds that it isreasonable to restrict attention to statistics that are invariant to orthogonal transfor-mations, the maximal invariant statistic plays a crucial role: a statistic is invariantto orthogonal transformations if, and only if, it depends on the data through themaximal invariant statistic Y ′ Y .The maximal invariant statistic Y ′ Y has a noncentral Wishart distribution anddepends only on ρ , σ , and ω η ≡ η ′ ησ N . The noncentral Wishart distribution is themultivariate generalization of the noncentral Chi-squared distribution and it dependson the modiﬁed Bessel function of the ﬁrst kind. We use uniform approximations ofBessel functions (see Abramowitz and Stegun (1965)), which allows us to write thedensity of the noncentral Wishart distribution in a more tractable form. Speciﬁcally,the log-likelihood of Y ′ Y is, up to an o p ( N − ) term, proportional to Q MN ( θ ) = −

12 ln (cid:0) σ (cid:1) − σ tr ( DY ′ Y D ′ ) N T − ω η

2+ (1 + A N ) / T − ln (cid:16) A N ) / (cid:17) T , (3)where θ = (cid:0) ρ, σ , ω η (cid:1) and A N = 2 q ω η ′ T DY ′ Y D ′ T σ N .The log-likelihood Q MN ( θ ) is free from the incidental parameter problem since Y ′ Y is parametrized by the ﬁxed dimensional vector of parameters θ . Although thedimension is ﬁxed, the parameter ω η ≡ η ′ ησ N = P Ni =1 η i σ N (4)depends on the sample size N . For simplicity, we omit the dependence on N from theparameter ω η . However, the asymptotic properties of the estimator will be derivedunder diﬀerent sequences of ω η . The asymptotic properties of the estimator obtainedby maximizing the objective function Q MN ( θ ) are studied by Moreira (2009) andwe reproduce these results here for convenience. The information matrix I T ( θ ∗ ) inMoreira (2009) contains typographical errors which are corrected here. Deﬁne thematrix F =  · · · ρ · · · ρ ρ · · · T − ρ T − T − ρ T − · · · ρ  F j = d j dρ j F and F ∗ j = d j dρ j F (cid:12)(cid:12)(cid:12)(cid:12) ρ = ρ ∗ . Theorem 1 : Let ˆ θ = arg max θ ∈ Θ Q MN ( θ ) . (5)(A.1) Under the assumption that N → ∞ with T ﬁxed, (i) if ω η is ﬁxed at ω ∗ η ,then ˆ θ M → p θ ∗ = (cid:0) ρ ∗ , σ ∗ , ω ∗ η (cid:1) ; (ii) if ω η → ω ∗ η , then ˆ θ M → p θ ∗ = (cid:0) ρ ∗ , σ ∗ , ω ∗ η (cid:1) ; and(iii) if lim sup ω ∗ η < ∞ , then ˆ θ M = θ ∗ + o p (1), where θ ∗ = (cid:0) ρ ∗ , σ ∗ , ω ∗ η (cid:1) .(A.2) Under the assumption that T → ∞ and | ρ ∗ | <

1, (i) if ω η is ﬁxed at ω ∗ η ,then ˆ θ M → p θ ∗ = (cid:0) ρ ∗ , σ ∗ , ω ∗ η (cid:1) ; (ii) if ω η → ω ∗ η , then ˆ θ M → p θ ∗ = (cid:0) ρ ∗ , σ ∗ , ω ∗ η (cid:1) ; and(iii) if lim sup ω ∗ η < ∞ , then ˆ θ M = θ ∗ + o p (1), where θ ∗ = (cid:0) ρ ∗ , σ ∗ , ω ∗ η (cid:1) .(B) Assume that ω ∗ η > S MN ( θ ) = ∂Q MN ( θ ) ∂θ and H MN ( θ ) = ∂ Q MN ( θ ) ∂θ∂θ ′ , respectively, and deﬁne the matrix I MT ( θ ∗ ) =  h MT ω ∗ η σ ∗ ′ T F ∗ T ω ∗ η T ω ∗ η T ω ∗ η T ′ T F ∗ T Tω ∗ η σ ∗ ′ T F ∗ T ω ∗ η T σ ∗ ) (cid:16) ω ∗ η T ω ∗ η T (cid:17) σ ∗ ω ∗ η T ω ∗ η T ω ∗ η T ω ∗ η T ′ T F ∗ T T σ ∗ ω ∗ η T ω ∗ η T T ( ω ∗ η T )  , where h MT = tr (cid:0) F ∗ F ′ ∗ (cid:1) T + ω ∗ η T ω ∗ η T ′ T F ∗ F ′ ∗ T T + 11 + 2 ω ∗ η T (cid:18) ′ T F ∗ T T (cid:19) ! . As N → ∞ with T ﬁxed, (i) √ N T S MN ( θ ∗ ) → d N (cid:0) , I MT ( θ ∗ ) (cid:1) ; (ii) H MN ( θ ∗ ) → p −I MT ( θ ∗ ); (iii) √ N T (cid:16) ˆ θ M − θ ∗ (cid:17) → d N (cid:0) , I MT ( θ ∗ ) − (cid:1) ; and (iv) the log-likelihoodratio isΛ N (cid:16) θ ∗ + h · ( N T ) − / , θ ∗ (cid:17) = N T (cid:16) Q MN (cid:16) θ ∗ + h · ( N T ) − / (cid:17) − Q MN ( θ ∗ ) (cid:17) = h ′ √ N T S MN ( θ ∗ ) − h ′ I T ( θ ∗ ) h + o Q MN ( θ ∗ ) (1) , √ N T S MN ( θ ∗ ) → d N (cid:0) , I MT ( θ ∗ ) (cid:1) under Q MN ( θ ∗ ). Furthermore, ˆ θ M is asymptoticallyeﬃcient within the class of regular invariant estimators for the diﬀerenced model (16)under large N , ﬁxed T asymptotics. 5art (A) of the above theorem implies that ˆ ρ M → p ρ ∗ and ˆ σ M → p σ ∗ regardlessof the growth rate of N and T as long as N T → ∞ . Part (B) derives the limitingdistribution of ˆ θ M . It shows, in particular, that ˆ ρ M achieves the eﬃciency bound (cid:0) I MT ( θ ∗ ) − (cid:1) for regular invariant estimators as N → ∞ . Regular estimators ex-clude supereﬃcient estimators in the sense of Hodges-Le Cam and, heuristically, aregular estimator is one whose asymptotic distribution does not change in shrinkingneighborhoods of the true parameter value (see Bickel, Klaassen, Ritov, and Wellner(1998) for more details).Part (B) of Theorem 1 ﬁnds the asymptotic distribution of ˆ θ M , assuming that ω η is ﬁxed at ω ∗ η . A generalization of part (B) that allows for ω η → ω ∗ η follows from asimple application of Le Cam’s third lemma. Lemma 2 : Assume ω η = ω ∗ η + h/ √ N , where ω ∗ η > h ∈ R and ω ∗ η is in theparameter space. As N → ∞ with T ﬁxed, √ N T (cid:16) ˆ θ M − θ ∗ (cid:17) → d N  h  , I MT ( θ ∗ ) −  , where I MT ( θ ∗ ) − is deﬁned in Theorem 1, part (B).Lemma 2 shows that the asymptotic distribution of the structural parametersdoes not change, whether ω η is ﬁxed at ω ∗ η or ω η → ω ∗ η . For simplicity’s sake, weassume throughout the paper that the sequence ω η (which can depend on the samplesize N ) is ﬁxed at ω ∗ η > Lancaster (2002) proposes a Bayesian approach to estimate the structural parameterswhich involves reparameterizing model (2) so that the information matrix is blockdiagonal, with the common parameters in one block and the incidental parametersin the other. Then he deﬁnes his estimator as one of the local maxima of the inte-grated likelihood function, integrating with respect to the Lebesgue measure on R N .Dhaene and Jochmans (2016) give an alternative interpretation for Lancaster’s esti-mator by showing that Lancaster’s posterior can be obtained by adjusting the proﬁlelikelihood so that its score is free from asymptotic bias. In this sense, Lancaster’sestimator is a bias-corrected estimator.Lancaster’s estimator seeks to maximize the following objective function: Q LN (cid:0) ρ, σ (cid:1) = −

12 ln (cid:0) σ (cid:1) + 1 ′ T F T T ( T − − σ tr ( DY ′ Y D ′ H ) N ( T − , (6)where the matrix H is deﬁned as H = I T − T T ′ T . ρ ∗ , σ ∗ ) is not a global maximizerof plim N →∞ Q LN ( ρ, σ ) as in standard maximum likelihood theory (see Dhaene and Jochmans(2016)). Nonetheless, Theorem 3, proved by Lancaster (2002), shows that the poste-rior (6) can be used to consistently estimate the structural parameters. Theorem 3 : Let S LN ( ρ, σ ) = ∂Q LN ( ρ,σ ) ∂ ( ρ,σ ) ′ be the score of the posterior (6) and letΘ N be the set of roots of S LN (cid:0) ρ, σ (cid:1) = 0 (7)corresponding to the local maxima. If that set is empty, set Θ N equals to { } . Then,there is a consistent root of equation (7).The usefulness of Theorem 3 is limited by the fact that it only states that one (ofpossibly many) of the local maxima of (6) is a consistent estimator of the structuralparameters. However, it does not indicate how to ﬁnd a consistent estimator. There-fore, even though the posterior can be used to consistently estimate the structuralparameters of model (2), Lancaster leaves unanswered the question of how to uniquelydetermine the consistent root of (7). To our knowledge, two diﬀerent methodologiesto uniquely choose the consistent root of Lancaster’s score have been proposed. Theﬁrst approach, by Dhaene and Jochmans (2016), suggests as a consistent estimatorof the structural parameters, the minimizer of the norm of Lancaster’s score on aninterval around the maximum likelihood estimator obtained from the maximizationof the likelihood of (2). The second method, by Kruiniger (2014), uses as a consis-tent estimator, the minimizer of a quadratic form in Lancaster’s score subjected to acondition on the Hessian matrix.Lemma 4 ﬁnds the asymptotic variance of a consistent root of (7). Lemma 4 : Assume ω ∗ η > (cid:0) ˆ ρ L , ˆ σ L (cid:1) be a consistent root of S LN ( ρ, σ ).Let the Hessian matrix be H LN (cid:0) ρ, σ (cid:1) = ∂ Q LN ( ρ, σ ) ∂ ( ρ, σ ) ′ ∂ ( ρ, σ ) , and deﬁne the matrices I LT ( θ ∗ ) = " h LT − σ ∗ ′ T F ∗ T T ( T − − σ ∗ ′ T F ∗ T T ( T −

1) 12( σ ∗ ) and Σ T ( θ ∗ ) = TT − (cid:20) a LT I LT (1 , I LT (1 , I LT (2 , (cid:21) , (8)where h LT = − ′ T F ∗ T T ( T −

1) + tr (cid:0) F ∗ F ∗ ′ (cid:1) T − ′ T F ∗ ′ F ∗ T T (cid:0) ω ∗ η T − (cid:1) T − − ω ∗ η TT − (cid:18) ′ T F ∗ ′ T T (cid:19) , LT = 2 h LT + 2 1 ′ T F ∗ T T ( T − , and I LT ( i, j ) is the ( i, j )-th entry of I LT ( θ ∗ ). When I LT ( θ ∗ ) is nonsingular, as N → ∞ with T ﬁxed, (i) √ N T S LN ( ρ ∗ , σ ∗ ) → d N (0 , Σ T ( θ ∗ )); (ii) − H LN ( ρ ∗ , σ ∗ ) → p I LT ( θ ∗ );and (iii) √ N T (cid:18) ˆ ρ L − ρ ∗ ˆ σ L − σ ∗ (cid:19) → d N (cid:0) , I LT ( θ ∗ ) − Σ T ( θ ∗ ) I LT ( θ ∗ ) − (cid:1) .Lemma 4 follows from standard asymptotic theory because of the assumption thatthe Hessian matrix − H LN ( ρ ∗ , σ ∗ ) converges in probability to a nonsingular matrix I LT ( θ ∗ ). Dhaene and Jochmans (2016) prove that this assumption is violated, i.e. I LT ( θ ∗ ) is singular, when T = 2 or ρ ∗ = 1. In contrast to Lancaster’s estimator, the estimator deﬁned in (5), based on the max-imal invariant statistic’s log-likelihood, is a standard maximum likelihood estimatorin the sense that its objective function is a likelihood function and the limit of itsobjective function attains a unique global maximum at the vector of true parame-ters. This implies that the estimator that maximizes the maximal invariant statistic’slog-likelihood is uniquely determined and no additional procedures are necessary toobtain a consistent and asymptotically normal estimator of the structural parame-ters. More than that, Theorem 1 shows that it attains the minimum variance boundfor invariant regular estimators of the structural parameters of model (2). Therefore,the estimator deﬁned in (5) eﬃciently uses all information available in the maximalinvariant statistic Y ′ Y .Lancaster’s estimator is also invariant to orthogonal transformations since it de-pends on the data only through the maximal invariant statistic Y ′ Y ; see equation(6). However, it is not uniquely determined and additional steps, such as those pro-posed by Dhaene and Jochmans (2016) and Kruiniger (2014), are necessary to choosethe consistent root among the (possibly) many roots of Lancaster’s score function.Furthermore, the limit of probability of the Hessian of Q LN ( ρ, σ ) is not necessarilyproportional to the asymptotic variance ( AVar ) of the score. Since the informationequality does not hold, the asymptotic variance of Lancaster’s consistent estimatorcannot attain the lower bound for invariant regular estimators found in Theorem 1.The explanation for the shortcomings of Lancaster’s estimator is that, even thoughit is invariant, its objective function ignores information available in the maximal in-variant by using only part of the maximal invariant statistic’s log-likelihood function.Indeed, the maximal invariant statistic’s log-likelihood function (3) concentrated out8f σ and ω η is Q MN ( ρ ) = − T − T ln (cid:18) tr ( DY ′ Y D ′ H ) N ( T − (cid:19) − T ln (cid:18) ′ T DY ′ Y D ′ T N T (cid:19) , (9)while Lancaster’s objective function concentrated out of σ is Q LN ( ρ ) = 1 ′ T F T T ( T − −

12 ln (cid:18) tr ( DY ′ Y D ′ H ) N ( T − (cid:19) , which allows us to write Q MN ( ρ ) = T − T Q LN ( ρ ) + Q M − LN ( ρ ) , where (10) Q M − LN ( ρ ) = − ′ T F T T − T ln (cid:18) ′ T DY ′ Y D ′ T N T (cid:19) . The respective score functions are given by S LN ( ρ ) = 1 ′ T F T T ( T −

1) + tr ( J T Y ′ Y D ′ H ) tr ( DY ′ Y D ′ H )and S MN ( ρ ) = T − T S LN ( ρ ) + S M − LN ( ρ ) , where (11) S M − LN ( ρ ) = − ′ T F T T + 1 T ′ T J T Y ′ Y D ′ T ′ T DY ′ Y D ′ T . The decomposition (10) implies an orthogonal decomposition of the concentratedscore function:

Theorem 5 : Let S MN ( ρ ), S LN ( ρ ) and S M − LN ( ρ ) be the score functions associatedwith Q MN ( ρ ), Q LN ( ρ ) and Q M − LN ( ρ ), respectively. Then, S MN ( ρ ) = T − T S LN ( ρ ) + S M − LN ( ρ ) , (12)where S M − LN ( ρ ∗ ) → p (cid:18) √ N T S LN ( ρ ∗ ) √ N T S M − LN ( ρ ∗ ) (cid:19) → d N (cid:18)(cid:18) (cid:19) , (cid:18) b T c T (cid:19)(cid:19) , (14)with b T = a LT − T ( T − (cid:18) ′ T F ∗ T T (cid:19) > , a LT as deﬁned in Lemma 4, and c T = 2 T (cid:0) ω ∗ η T (cid:1) ′ T F ∗ ′ F ∗ T T − (cid:18) ′ T F ∗ T T (cid:19) ! > . A consistent estimator for ρ ∗ is expected to be more eﬃcient, the closer it staysto the maximum likelihood estimator ˆ ρ M . This implies that the consistent estimatorˆ ρ L might have low eﬃciency because it only uses part of the full log-likelihood (10).One may use (cid:12)(cid:12) S MN (ˆ ρ L ) (cid:12)(cid:12) as a measure of how “close” any inﬂection point ˆ ρ L is to ˆ ρ M and, because S MN (ˆ ρ L ) = S M − LN (ˆ ρ L ), ˆ ρ L will be more eﬃcient, the smaller (cid:12)(cid:12) S M − LN (ˆ ρ L ) (cid:12)(cid:12) is. The term S M − LN ( . ) evaluated at the true value ρ ∗ or at a consistent estimator,converges in probability to zero. This suggests choosing the inﬂexion points based onLancaster’s score by looking at S M − LN (ˆ ρ L ) to uniquely determine the consistent root,akin to Dhaene and Jochmans (2016) and Kruiniger (2014).Instead, we can use the information on S M − LN ( . ) to increase the eﬃciency of ˆ ρ L .Simple algebra shows that (cid:12)(cid:12) S M − LN (ˆ ρ L ) (cid:12)(cid:12) = (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ( T − tr (cid:16) J T Y ′ Y ˆ D ′ L H (cid:17) tr (cid:16) ˆ D L Y ′ Y ˆ D ′ L H (cid:17) + 1 ′ T J T Y ′ Y ˆ D ′ L T ′ T ˆ D L Y ′ Y ˆ D ′ L T (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) , where ˆ D L is the matrix D evaluated at ˆ ρ L . Notice that ˆ ρ L is ineﬃcient since √ N T S M − LN (ˆ ρ L ) does not converge in probability to zero. Instead of choosing theconsistent root following Lancaster’s methodology, we could add the information in S M − LN ( . ) to S LN ( . ) by using (12). The associated estimator ˆ ρ M uses the informationin both S M − LN ( . ) and S LN ( . ) eﬃciently and is uniquely deﬁned, making the estimationproblem more simple and objective. Theorem 1 shows that the estimator for the structural parameters ( ρ, σ ) attains theeﬃciency bound for model (1) when the ﬁrst observation is y i, = 0. Lancaster (2002)suggests working with the diﬀerenced data ˜ y i,t = y i,t − y i, so that we may continueworking with an initial observation equal to zero. Instead, we draw inference on( ρ, σ ) by conditioning on the ﬁrst observation y i, . We will show the diﬀerencingmethod is indeed less eﬃcient than the conditional method.By using model (1) with the variables y i,t in levels, there are more observationsthat can be used to estimate the parameters. On the other hand, the number ofparameters to be estimated increases. In this section we show that the estimatorthat maximizes the likelihood of the maximal invariant statistic of the original model(1) has asymptotic variance that is no larger than the variance of the estimator (5)and, under very general conditions, using y i,t in levels will be strictly more eﬃcient.10hen we allow for a nonzero initial condition, the model (1) becomes Y T = y .ρe ′ B ′ T + η. ′ T B ′ + σ.U T .B ′ , where (15) U T ∼ N (0 N × T , I N ⊗ I T ) ,y = ( y , , · · · , y N, ) ′ is the vector of ﬁrst observations and e = (1 , , ..., ′ is thecanonical vector.Lancaster (2002) works with a diﬀerenced version of model (1) given by e y i,t +1 = ρ e y i,t + e η i + σu i,t , i = 1 , · · · , N ; t = 1 , · · · , T (16) e y i,t ≡ y i,t − y i, , i = 1 , · · · , N ; t = 1 , · · · , T + 1 e η i ≡ η i − y i, (1 − ρ ) , i = 1 , · · · , N and seeks inference based on e y i, = 0 for all i . In matrix form, Y T +1 − y . ′ T +1 = [0 N × : Y T − y . ′ T ] . The second term equals Y T − y . ′ T = y . ( ρ.e ′ B ′ − ′ T ) + η. ′ T B ′ + σ.U.B ′ = [ η − (1 − ρ ) y ] . ′ T B ′ + σ.U.B ′ . Deﬁning diﬀerenced variables and the incidental parameters, e Y T = Y T − y . ′ T and e η = η − (1 − ρ ) .y , we are back to the model with the ﬁrst observation equal to zero: e Y T = e η. ′ T B ′ + σ.U T .B ′ , U ∼ N (0 N × T , I N ⊗ I T ) , (17)where the incidental parameter is τ = e η .Diﬀerencing eliminates one time period from the data. Instead, we will conditionon the initial observation y itself. It is convenient to work with the linear setup ofChamberlain and Moreira (2009): Y = x.a ( γ ) + τ .b ( γ ) + U.c ( γ ) , (18)where x ∈ R N × K are explanatory variables, τ ∈ R N × J are the incidental parameters,and a ( · ) , b ( · ) , c ( · ) are given functions of the unknown parameter of interest γ .Consider the group of transformations g.Y , where g are orthogonal matrices such that g.x = x . This group modiﬁes the unknown incidental parameter τ , but preserves themodel and the parameter γ of interest: g.Y = g.x.a ( γ ) + g.τ .b ( γ ) + g.U.c ( γ )= x.a ( γ ) + ( g.τ ) .b ( γ ) + U.c ( γ )11n distribution, because the law of g.U is the same as the law of U . This group yieldsthe maximal invariant statistic, which is given by the pair Z = ( x ′ x ) − / x ′ Y and Z ′ Z = Y ′ M x Y, (19)where Z = q ′ Y for any matrix q such that the matrix q = h x ( x ′ x ) − / : q i isorthogonal. Because q is an orthogonal matrix, we have q q ′ = I N − N x = M x for N x = x ( x ′ x ) − x ′ . We refer the reader to Chamberlain and Moreira (2009) for moredetails.Deﬁne the parameters ω τ,x = τ ′ M x τσ N and δ τ,x = ( x ′ x ) − x ′ τ . (20)The coeﬃcient δ τ,x is the ordinary least squares (OLS) estimator and ω τ,x is thestandardized average of sum of squared residuals from a ﬁctitious regression of τ on x (if there are no covariates x , we deﬁne ω τ = ( σ N ) − τ ′ τ ).For the model (18), the statistics Z and Z ′ Z are independently normal and non-central Wishart distributed. Their distributions depend on ( ρ, σ , δ τ,x ) and (cid:0) ρ, σ , ω τ,x (cid:1) ,respectively.If we condition on the initial observation y , the model (15) for Y T is a specialcase of (18), where x = y , τ = η , and the regression coeﬃcients are a ( γ ) = ρe ′ B ′ , b ( γ ) = 1 ′ T B ′ , and c ( γ ) = σ.B ′ . Let θ = (cid:0) ρ, σ , δ η,y , ω η,y (cid:1) ∈ R be the parameters of the joint distribution of (19).The likelihood estimator that maximizes the joint likelihood of (19) does not havethe incidental parameter problem since it is parametrized by θ , which has ﬁxeddimension.If we instead diﬀerence out the data, as Lancaster (2002) does, the model (17)for e Y T is a special case of (18), in which x is absent, the incidental parameter τ is e η = η − (1 − ρ ) .y , and the regression coeﬃcients are b ( γ ) = 1 ′ T B ′ and c ( γ ) = σ.B ′ . We now show that Lancaster (2002)’s diﬀerencing method entails unnecessary eﬃ-ciency loss. We begin by deﬁning the estimator based on the likelihood of ( Z , Z ′ Z ).Let ˆ θ M = arg min θ ∈ Θ Q M N ( θ ) , (21)where Q M N ( θ ) = −

12 ln σ − ω η,y − σ tr (cid:18) DZ ′ Z D ′ N T (cid:19) + 12 T (cid:0) A ,N (cid:1) / − σ N T ( Z D ′ − k y k ( ρe ′ + δ η,y ′ T )) ( DZ ′ − k y k ( ρe + δ η,y T )) − T ln (cid:16) (cid:0) A ,N (cid:1) / (cid:17) , (22)12s, up to o p ( N − ) terms, proportional to the (conditional on y ) log-likelihood of( Z , Z ′ Z ) and A ,N = 2 q ω η,y ′ T DZ ′ Z D ′ T σ N . The asymptotic behavior of the estima-tor ˆ θ M is given next. Lemma

6: Assume that the true parameters δ η,y and ω η,y are ﬁxed, respectively,at δ ∗ η,y and ω ∗ η,y > < lim N →∞ k y k N ≡ k y k < ∞ . As N → ∞ with T ﬁxed,(A) ˆ θ M → p θ ∗ = (cid:0) ρ ∗ , σ ∗ , δ ∗ η,y , ω ∗ η,y (cid:1) . (B) Let Q M N ( ρ, σ ) be the objective function (22) concentrated out of δ η,y and ω η,y and denote the score statistic and the Hessian matrix by S M N (cid:0) ρ, σ (cid:1) = ∂Q M N ( ρ, σ ) ∂ ( ρ, σ ) ′ and H M N (cid:0) ρ, σ (cid:1) = ∂ Q M N ( ρ, σ ) ∂ ( ρ, σ ) ′ ∂ ( ρ, σ ) , respectively. Also, deﬁne the matrix I M T ( θ ∗ ) = " d M T − ′ T F ∗ T σ ∗ T − ′ T F ∗ T σ ∗ T σ ∗ ) T − T , where d M T = 1 T (cid:0) ω ∗ η,y T (cid:1) ′ T F ∗ ′ F ∗ T T + ω ∗ η,y T (cid:18) ′ T F ∗ T T (cid:19) ! − T (cid:18) ′ T F ∗ T T (cid:19) + ω ∗ η,y + k y k σ ∗ (cid:0) δ ∗ η,y + ρ ∗ − (cid:1) ! ′ T F ∗ ′ F ∗ T T − (cid:18) ′ T F ∗ T T (cid:19) ! + tr (cid:0) F ∗ F ∗ ′ (cid:1) T − ′ T F ∗ ′ F ∗ T T . Then, (i) √ N T S M N ( ρ ∗ , σ ∗ ) → d N (cid:0) , I M T ( θ ∗ ) (cid:1) ; (ii) H M N ( ρ ∗ , σ ∗ ) → p −I M T ( θ ∗ );and (iii) √ N T (cid:18) ˆ ρ M − ρ ∗ ˆ σ M − σ ∗ (cid:19) → d N (cid:0) , I M T ( θ ∗ ) − (cid:1) .Lemma 7 below shows that, in general, (21) estimates the structural parameterswith strictly smaller variances than the original estimator that uses diﬀerenced data,as deﬁned in (5). Lemma

7: If δ ∗ η,y + ρ ∗ = 1, then AVar (ˆ ρ M ) > AVar (ˆ ρ M ) and AVar (cid:0) ˆ σ M (cid:1) > AVar (cid:0) ˆ σ M (cid:1) . If δ ∗ η,y + ρ ∗ = 1, then AVar (ˆ ρ M ) = AVar (ˆ ρ M ) and AVar (cid:0) ˆ σ M (cid:1) = AVar (cid:0) ˆ σ M (cid:1) . 13 Random Eﬀects and Moment Conditions

In Section 5 we draw inference on the structural parameters by conditioning or bydiﬀerencing the data based on the ﬁrst observed initial condition y i, . Both methodsare ﬁxed-eﬀect approaches, which do not rely on any further assumptions on the unobserved data such as y i, . We could instead consider random-eﬀect approacheswhich rely on assumptions for the unobserved value y i, . These include Chamberlain’s(correlated) random eﬀects and Blundell and Bond (1998)’s stationarity assumptionsfor the ﬁrst unobserved value y i, .In this section, we compare the conditional method with the random-eﬀect meth-ods using the ﬁrst moment of ( x ′ x ) − x ′ Y and Y ′ YN . (23)Deﬁne the function π ( γ, δ τ,x ) = a ( γ ) + δ τ,x .b ( γ ) . The expectation of the maximal invariant is then given by E ( x ′ x ) − x ′ Y = π ( γ, δ τ,x ) and E Y ′ YN = σ (cid:2) π ( γ, δ τ,x ) ′ ω x π ( γ, δ τ,x ) + b ( γ ) ′ ω τ,x b ( γ ) (cid:3) + c ( γ ) ′ c ( γ ) , For example, Lancaster (2002)’s diﬀerencing approach yields E e Y ′ T e Y T N = σ B T (cid:2) ω e η T ′ T + I T (cid:3) B ′ T . Hence, we have ( T + 1) T / ρ , σ , and ω e η .Conditioning on y yields the following (conditional) moments based on the maximalinvariant: E h ( y ′ y ) − y ′ Y T (cid:12)(cid:12)(cid:12) y i = ρe ′ B ′ T + δ η,y . ′ T B ′ T (24)and E (cid:20) Y ′ T Y T N (cid:12)(cid:12)(cid:12)(cid:12) y (cid:21) = σ B T (cid:8) ω η,y T ′ T + I T (cid:9) B ′ T (25)+ σ [ ρB T e + δ η,y .B T T ] ω y [ ρe ′ B ′ T + δ η,y . ′ T B ′ T ] . The unknown parameters are the autoregressive coeﬃcient ρ , the error variance σ ,the OLS coeﬃcient δ η,y , and the standardized squared residuals ω η,y .Invariance reduces the information to T + ( T + 1) T / T + 1) ( T + 2) / − T more moments than Lancaster (2002)’s diﬀerencing method and14ne additional parameter (four instead of three). This comparison helps to explainthe eﬃciency gains from conditioning instead of diﬀerencing.We can then explore the conditional moments (24) and (25) to ﬁnd MinimumDistance (MD) estimators. We further note that Arellano and Bond (1991) andAhn and Schmidt (1995) implicitly use linear combinations of the moments (24) and(25) to ﬁnd moments for the autoregressive parameter ρ .We instead explore connections between these moments, conditional on the ﬁrstobserved value y i, , and random eﬀects assumptions for the unobserved value y i, .To simplify comparison, we continue working with the model without covariates andwith homoskedastic errors. As a preliminary to the correlated random eﬀects assumption, consider the modelwritten recursively to the time t = 0: Y T +1 = y .ρe ′ B ′ T +1 + η. ′ T +1 B ′ T +1 + σ.U T +1 .B ′ T +1 , where (26) U T +1 ∼ N (cid:0) N × ( T +1) , I N ⊗ I ( T +1) (cid:1) , and y is unobserved. This model is again a special case of the linear setup ofChamberlain and Moreira (2009) without covariates x . The incidental parameter τ = (cid:2) y η (cid:3) would encapsulate the individual ﬁxed eﬀects and the initial conditionsthemselves. The regression coeﬃcients would be given by b ( γ ) = (cid:20) ρe ′ ′ T +1 (cid:21) B ′ and c ( γ ) = σ.B ′ (where we again omit the subscript, here ( T + 1), from the matrices when there is noconfusion).The maximal invariant is Y ′ T +1 Y T +1 , which contains ( T + 1) ( T + 2) / E Y ′ T +1 Y T +1 N = σ B T +1 (cid:26)(cid:2) ρ.e T +1 (cid:3) ω τ (cid:20) ρ.e ′ ′ T +1 (cid:21) + I T +1 (cid:27) B ′ T +1 . The total number of parameters are ﬁve: the autoregressive parameter ρ , the errorvariance σ , and the three nonredundant elements of ω τ . The parameters need to bewell-behaved as the sample size N grows, and inference is based on the asymptoticnormality of their respective estimators. If diﬀerent series start at diﬀerent pointsin time (as typically happens with ﬁrms’ data), then the parameter ω y may notbe well-behaved or its estimator may not be asymptotically normal. For example,moment equations depend on terms such as N X i =1 y i, u i,t . y i, are closeto zero. This happens because the variance ratio of some terms y i, u i,t to their sum,max j ≤ N V ar ( y j, u j,t ) P Ni =1 V ar ( y i, u i,t )may not be negligible. This problem is mitigated by shocks over time, hence condi-tioning on the observations y i, is more robust. For us to make a connection to the CRE estimator, we need to include the vector ofones in x = 1 N . The reason is that the (correlated) random eﬀects assumption (cid:2) y η (cid:3) | x ∼ N ( x.ι, I N ⊗ Φ) , where ι = ( ι , ι ) . would need to be invariant to our group of transformations to be decomposed intoan invariant uniform prior and an additional term (and we want to allow the randomeﬀects to have a nonzero mean even without additional regressors).The setup is the same as treating the initial condition as an incidental parameter.However, we include x = 1 N and deﬁne the regression coeﬃcient on the vector ofones, as in Section 7 of Chamberlain and Moreira (2009). The maximal invariant is1 ′ N Y T +1 and Y ′ T +1 Y T +1 . Their (conditional on τ = (cid:2) y η (cid:3) ) expectation is given by E [1 ′ N Y T +1 ] = δ τ, N (cid:20) ρ.e ′ ′ T +1 (cid:21) B ′ T +1 and E (cid:20) Y ′ T +1 Y T +1 N (cid:21) = B T +1 (cid:26)(cid:2) ρ.e T +1 (cid:3) (cid:0) σ ω τ, N + δ ′ τ, N δ τ, N (cid:1) (cid:20) ρ.e ′ ′ T +1 (cid:21) + σ .I T +1 (cid:27) B ′ T +1 . The unknown parameters are the autoregressive coeﬃcient ρ , the error variance σ ,the sample averages δ τ, N , and standardized squared deviations ω τ, N . Hence, theinvariance argument reduces the data space to T + 1 + ( T + 1) ( T + 2) / ω τ, N may not bewell-behaved and its estimator may not be asymptotically normal.Under the random eﬀects assumption, the model becomes Y T +1 = 1 N .a ( γ ) + U T +1 .c ( γ ) , The Lindeberg condition holds if and only if the series is approximately normal and the termsare asymptotically negligible. a ( γ ) = ι (cid:20) ρe ′ ′ T +1 (cid:21) B ′ T +1 and c ( γ ) ′ c ( γ ) = B T +1 (cid:26)(cid:2) ρ.e T +1 (cid:3) Φ (cid:20) ρe ′ ′ T +1 (cid:21) + σ I T +1 (cid:27) B ′ T +1 . The (unconditional) expectation of the maximal invariant has the same functionalform as the moments conditional on τ = (cid:2) y η (cid:3) : E (cid:20) ′ N Y T +1 N (cid:21) = ι (cid:20) ρ.e ′ ′ T +1 (cid:21) B ′ T +1 and E (cid:20) Y ′ T +1 Y T +1 N (cid:21) = B T +1 (cid:26)(cid:2) ρ.e T +1 (cid:3) (Φ + ι ′ ι ) (cid:20) ρ.e ′ ′ T +1 (cid:21) + σ .I T +1 (cid:27) B ′ T +1 . So the criticism on lack of robustness to the data-generating process for y i, is appli-cable here as well. Blundell and Bond (1998) instead consider a diﬀerent assumption, that y does notdeviate systematically from the stationary mean η/ (1 − ρ ). We start with the as-sumption y ∼ N (cid:18) η − ρ , σ .I N (cid:19) . We assume normality for the purposed of invariance. However, only the mean andvariance are relevant for the expectation calculations derived below.Under this assumption, the model is equivalent to Y T +1 = η.b ( γ ) + U T +1 .c ( γ ) , where the coeﬃcients are given by b ( γ ) = 1 ′ T +1 B ′ T +1 + ρ − ρ e ′ B ′ T +1 and c ( γ ) ′ c ( γ ) = B T +1 (cid:0) σ ρ e e ′ + σ I T +1 (cid:1) B ′ T +1 . The maximal invariant is given by Y ′ T +1 Y T +1 : E Y ′ T +1 Y T +1 N = B T +1 (cid:0) σ ρ e e ′ + σ I T +1 (cid:1) B ′ T +1 + σ ω η B T +1 (cid:20) T +1 ′ T +1 + ρ − ρ (cid:0) e ′ T +1 + 1 T +1 e ′ (cid:1) + ρ (1 − ρ ) e e ′ (cid:21) B ′ T +1 . ρ , σ , ω η , and σ .We can re-arrange some of these ( T + 1) ( T +2) / ρ . For example, E [( y i, − y i, ) . ( y i, − ρy i, )] = 0 . (27)As Blundell and Bond (1998) also note, this expectation may fail to be zero if thestationarity assumption breaks down. For example, this moment condition breaksdown if either y i, = k or even if y i, iid ∼ N (0 , σ ).It is worthwhile making a connection between the stationarity assumption andinference conditional on the ﬁrst observations. Since inference is conditional on y ,we could stack the quantities together and look at the (conditional) expectation of Y ′ T +1 Y T +1 = (cid:20) y ′ y y ′ Y T Y ′ T y Y ′ T Y T (cid:21) . As in the derivation based on that of Blundell and Bond (1998), we have exactlythe same quantity Y ′ T +1 Y T +1 . However, we have ﬁve parameters (if we consider y ′ y itself to be the parameter). Alternatively, we lose one moment (from removing y ′ y itself) and four parameters. If the stationarity assumption is correct, there shouldbe eﬃciency loss from making inference conditional on the ﬁrst observation. Onthe other hand, conditional inference should be robust to diﬀerent data-generatingprocess for the initial data values. This paper studies the Bayesian estimator of Lancaster (2002), which is invariantto natural rotations of the data. The likelihood of the maximal invariant can bedivided into two parts, one of which is maximized to obtain Lancaster’s estimator.The second part is not asymptotically negligible and, as such, can be used to attaina more eﬃcient estimator. A natural conclusion is that it is unnecessary to use thedata to choose the correct inﬂection point of Lancaster’s score likelihood. Lancaster(2002)’s theory is based on diﬀerencing the data using the ﬁrst observation. Instead,we can condition on the ﬁrst observation itself to further improve eﬃciency. Theconditional argument is essentially a ﬁxed-eﬀects approach in which we make nofurther assumptions on unobserved data.Current practice in economics uses standard GMM methods based on data diﬀer-encing; e.g., Athanasoglou, Brissimis, and Delis (2008) on bank proﬁtability, Guiso, Pistaferri, and Schivardi(2005) on risk allocation, Konings and Vandenbussche (2005) on antidumping pro-tection eﬀects, and Topalova and Khandelwal (2011) on tariﬀs’ change on ﬁrm pro-ductivity, among others. Instead, we advocate using moments based on invariantstatistics from a model that allows for heteroskedasticity and factor structure. It isrelatively straightforward to extend our method to allow for covariates in the linear18odel of Chamberlain and Moreira (2009). Bai (2013a) allows for general time-seriesheteroskedasticity as opposed to assuming speciﬁc moving average (MA) processesand data values lagged enough as instruments. Chamberlain and Moreira (2009) andMoon and Weidner (2015) suggest including factors which generalize individual andtime eﬀects. Conditioning on the ﬁrst observation can then provide reliable inferencein dynamic panel data models using moments based on invariant statistics for moregeneral models.

References

Abramowitz, M., and

I. A. Stegun (1965):

Handbook of Mathematical Functions:with Formulas, Graphs, and Mathematical Tables . Ahn, S., and

P. Schmidt (1995): “Eﬃcient Estimation of Models for DynamicPanel Data,”

Journal of Econometrics , 68, 5–27.

Arellano, M., and

S. Bond (1991): “Some Tests of Speciﬁcation for Panel Data:Monte Carlo Evidence and an Application to Employment Equations,”

Review ofEconomic Studies , 58, 277–297.

Athanasoglou, P. P., S. N. Brissimis, and

M. D. Delis (2008): “Bank-speciﬁc,industry-speciﬁc and macroeconomic determinants of bank proﬁtability,”

Journalof International Financial Markets, Institutions and Money , 18, 121–136.

Bai, J. (2013a): “Fixed-Eﬀects Dynamic Panel Models, A Factor AnalyticalMethod,”

Econometrica , 81, 285–314.(2013b): “Fixed-Eﬀects Dynamic Panel Models, A Factor AnalyticalMethod,”

Econometrica , 81, 185–314, Supplement.

Bickel, P. J., C. A. J. Klaassen, Y. Ritov, and

J. A. Wellner (1998):

Eﬃcient and Adaptive Estimation for Semiparametric Models . Springer.

Blundell, R., and

S. Bond (1998): “Initial Conditions and Moment Restrictionsin Dynamic Panel Data Models,”

Journal of Econometrics , 87, 115–143.

Blundell, R., and

R. Smith (1991): “Conditions Initiales et Estimation Eﬃcacedans les Modeles Dynamiques sur Donnees de Panel: Une Application au Com-portement d’Investissement des Entreprises,”

Annales d’Economie et de Statistique ,20.

Cabral, L. M. B., and

J. Mata (2003): “On the Evolution of the Firm SizeDistribution: Facts and Theory,”

American Economic Review , 93, 1075–1090.19 hamberlain, G., and

M. J. Moreira (2009): “Decision Theory Applied to aLinear Panel Data Model,”

Econometrica , 77, 107–133.

Dhaene, G., and

K. Jochmans (2016): “Likelihood Inference in an Autoregressionwith Fixed Eﬀects,”

Econometric Theory , 32, 1–38.

Evans, D. S. (1987a): “The Relationship Between Firm Growth, Size, and Age:Estimates for 100 Manufacturing Industries,”

The Journal of Industrial Economics ,35, 567–581.(1987b): “Tests of Alternative Theories of Firm Growth,”

Journal of Polit-ical Economy , 95, 657–674.

Guiso, L., L. Pistaferri, and

F. Schivardi (2005): “Insurance within the Firm,”

Journal of Political Economy , 113, 1054–1087.

Hall, B. H. (1987): “The Relationship between Firm Size and Firm Growth in theUS Manufacturing Sector,”

The Journal of Industrial Economics , 35, 583–606.

Konings, J., and

H. Vandenbussche (2005): “Antidumping Protection andMarkups of Domestic Firms,”

Journal of International Economics , 65, 151–165.

Kruiniger, H. (2014): “A Further Look at Modiﬁed ML Estimation of the PanelAR(1) Model with Fixed Eﬀects and Arbitrary Initial Conditions,” Working Paper,Durham University.

Lancaster, T. (2002): “Orthogonal Parameters and Panel Data,”

The Review ofEconomic Studies , 69, 647–666.

Moon, H. R., and

M. Weidner (2015): “Linear Regression for Panel with Un-known Number of Factors as Interactive Fixed Eﬀects,”

Econometrica , 83, 1543–1579.

Moreira, M. J. (2009): “A Maximum Likelihood Method for the Incidental Pa-rameter Problem,”

Annals of Statistics , 37, 3660–3696.

Topalova, P., and

A. Khandelwal (2011): “Trade Liberalization and Firm Pro-ductivity: The Case of India,”