Likelihood Inference and The Role of Initial Conditions for the Dynamic Panel Data Model
aa r X i v : . [ s t a t . M E ] F e b Likelihood Inference and The Role of InitialConditions for the Dynamic Panel Data Model
Jose Diogo Barbosa University of Michigan
Marcelo J. Moreira
FGV
This version: September 3, 2018 The authors gratefully acknowledge the research support of CNPq and FAPERJ. bstract
Lancaster (2002) proposes an estimator for the dynamic panel data model withhomoskedastic errors and zero initial conditions. In this paper, we show this esti-mator is invariant to orthogonal transformations, but is inefficient because it ignoresadditional information available in the data. The zero initial condition is triviallysatisfied by subtracting initial observations from the data. We show that differencingout the data further erodes efficiency compared to drawing inference conditional onthe first observations.Finally, we compare the conditional method with standard random effects ap-proaches for unobserved data. Standard approaches implicitly rely on normal ap-proximations, which may not be reliable when unobserved data is very skewed withsome mass at zero values. For example, panel data on firms naturally depend on thefirst period in which the firm enters on a new state. It seems unreasonable then toassume that the process determining unobserved data is known or stationary. Wecan instead make inference on structural parameters by conditioning on the initialobservations.
Keywords : Autoregressive, Panel Data, Invariance, Efficiency.
JEL Classification Numbers : C12, C30. 1
Introduction
In an important paper, Lancaster (2002) studies, from a Bayesian perspective, estima-tion of the structural parameters of a dynamic panel data model with fixed effects andinitial observations equal to zero. His method involves reparameterizing the model sothat the information matrix is block diagonal, with the common parameters in oneblock and the incidental parameters in the other. His estimator is then defined asone of the local maxima of the integrated likelihood function, integrating with respectto the Lebesgue measure on R N . However, Lancaster leaves unanswered the ques-tion of how to uniquely determine the consistent root of his proposed methodology.Some authors, including Dhaene and Jochmans (2016) and Kruiniger (2014), haveproposed different ways to find the consistent estimator in Lancaster’s approach.In this paper, we explain the shortcoming of Lancaster’s estimator: it ignoresavailable information in the model. In particular, Lancaster’s posterior distributionuses only part of the maximal invariant statistic’s log-likelihood function; when thefull likelihood function is used in the estimation, a unique, consistent, and asymp-totically normal efficient estimator is obtained; see Moreira (2009). The estimatorobtained using the full likelihood is asymptotically more efficient than Lancaster’s es-timator. Therefore, trying to correct the nonuniqueness issue of Lancaster’s estimatoris unnecessary and leads to inefficient estimators.Lancaster (2002) and Moreira (2009) study consistent estimation of dynamic panelmodels under the same set of assumptions and with initial observations equal tozero. The zero initial condition is trivially satisfied when the initial observationsare subtracted from the autoregressive variables. We then show that efficiency isimproved by conditioning on the initial observations instead of differencing out thedata. The conditional argument is essentially a fixed-effects approach in which wemake no further premises on unobserved data. This is in contrast with commonly-used estimators in the literature which make further assumptions; see Bai (2013a,b)on correlated random effects or Blundell and Bond (1998) on stationarity.A potential advantage of conditioning on the first observation is robustness. AsBlundell and Smith (1991) point out, asymptotic arguments are usually based onthe average temporal effect, calculated on the individual dimension. Therefore, theimportance of unobserved data does not disappear asymptotically for relatively shortpanels. For example, take data on firms or individual earnings. Cabral and Mata(2003) show that the distribution of firms is very skewed, with most of the massbeing small firms, while Evans (1987b,a) and Hall (1987) show that Gilbrat’s Law(independent firm size and growth) is rejected for small firms. As only a handful oflarge firms provide the bulk of the data, we should not expect estimators based onassumptions about unobserved data to be approximately normal. As the firms’ entrystates do not disappear asymptotically for relatively short panels, conditioning onthe first observation would be preferable to assuming known processes for unobserveddata. 2he remainder of this paper is organized as follows. Section 2 introduces a simpledynamic panel data model without covariates and a zero initial condition. Thissection determines the maximal invariant statistic and summarizes the asymptotictheory for the maximum invariant likelihood estimator (MILE). Section 3 developsthe asymptotic theory for Lancaster’s estimator. Section 4 shows that this estimatoris less efficient than MILE because it ignores relevant data. Section 5 shows that itis less efficient to difference out the first observation than to condition on it. Section6 compares the conditional argument to standard random effects approaches for theunobserved data. Section 7 concludes and discusses how to extend the model to amore useful form. We consider a simple homoskedastic dynamic panel model with fixed effects andwithout covariates: y i,t +1 = ρy i,t + η i + σu i,t , i = 1 , · · · , N ; t = 1 , · · · , T (1)where N ≥ T + 1, y i,t ∈ R are observable variables and u i,t iid ∼ N (0 ,
1) are unob-servable errors; η i ∈ R are incidental parameters and ( ρ, σ ) ∈ R × R are structuralparameters. We denote the true unknown parameters by ( ρ ∗ , σ ∗ , η ∗ i ). We assumethat the parameter space is a compact set and σ ∗ >
0. For now, we assume thatthe initial observed condition is y i, = 0 as Lancaster (2002) does. We relax thisassumption in Sections 5 and 6.Solving model (1) recursively and writing it in matrix form yields Y T = η ′ T B ′ T + σU T B ′ T , where (2) U T ∼ N (0 N × T , I N ⊗ I T ) ,η = ( η , · · · , η N ) ′ ∈ R N × , 1 T = (1 , · · · , ′ ∈ R T × ,Y T = y , · · · y ,T +1 ... . . . ... y N, · · · y N,T +1 , U T = u , · · · u ,T +1 ... . . . ... u N, · · · u N,T +1 , and B T = ρ T − · · · . When there is no confusion, we will omit the subscript from the matrices; e.g., Y instead of Y T , B instead of B T , etc.The inverse of B has a simple form, B − ≡ D = I T − ρJ T , where J T = (cid:20) ′ T − I T − T − (cid:21) T − is a ( T − i are treated equally, the coordinate system used to specify the vector( y ,t , · · · , y N,t ) should not affect inference based on them. Therefore, it is reasonable torestrict attention to coordinate-free functions of ( y ,t , · · · , y N,t ). Chamberlain and Moreira(2009) and Moreira (2009) show that, indeed, orthogonal transformations preserveboth the model (2) and the structural parameters ( ρ, σ ) and this yields a maxi-mal invariant statistic, the T × T matrix Y ′ Y . So, if the researcher finds that it isreasonable to restrict attention to statistics that are invariant to orthogonal transfor-mations, the maximal invariant statistic plays a crucial role: a statistic is invariantto orthogonal transformations if, and only if, it depends on the data through themaximal invariant statistic Y ′ Y .The maximal invariant statistic Y ′ Y has a noncentral Wishart distribution anddepends only on ρ , σ , and ω η ≡ η ′ ησ N . The noncentral Wishart distribution is themultivariate generalization of the noncentral Chi-squared distribution and it dependson the modified Bessel function of the first kind. We use uniform approximations ofBessel functions (see Abramowitz and Stegun (1965)), which allows us to write thedensity of the noncentral Wishart distribution in a more tractable form. Specifically,the log-likelihood of Y ′ Y is, up to an o p ( N − ) term, proportional to Q MN ( θ ) = −
12 ln (cid:0) σ (cid:1) − σ tr ( DY ′ Y D ′ ) N T − ω η
2+ (1 + A N ) / T − ln (cid:16) A N ) / (cid:17) T , (3)where θ = (cid:0) ρ, σ , ω η (cid:1) and A N = 2 q ω η ′ T DY ′ Y D ′ T σ N .The log-likelihood Q MN ( θ ) is free from the incidental parameter problem since Y ′ Y is parametrized by the fixed dimensional vector of parameters θ . Although thedimension is fixed, the parameter ω η ≡ η ′ ησ N = P Ni =1 η i σ N (4)depends on the sample size N . For simplicity, we omit the dependence on N from theparameter ω η . However, the asymptotic properties of the estimator will be derivedunder different sequences of ω η . The asymptotic properties of the estimator obtainedby maximizing the objective function Q MN ( θ ) are studied by Moreira (2009) andwe reproduce these results here for convenience. The information matrix I T ( θ ∗ ) inMoreira (2009) contains typographical errors which are corrected here. Define thematrix F = · · · ρ · · · ρ ρ · · · T − ρ T − T − ρ T − · · · ρ F j = d j dρ j F and F ∗ j = d j dρ j F (cid:12)(cid:12)(cid:12)(cid:12) ρ = ρ ∗ . Theorem 1 : Let ˆ θ = arg max θ ∈ Θ Q MN ( θ ) . (5)(A.1) Under the assumption that N → ∞ with T fixed, (i) if ω η is fixed at ω ∗ η ,then ˆ θ M → p θ ∗ = (cid:0) ρ ∗ , σ ∗ , ω ∗ η (cid:1) ; (ii) if ω η → ω ∗ η , then ˆ θ M → p θ ∗ = (cid:0) ρ ∗ , σ ∗ , ω ∗ η (cid:1) ; and(iii) if lim sup ω ∗ η < ∞ , then ˆ θ M = θ ∗ + o p (1), where θ ∗ = (cid:0) ρ ∗ , σ ∗ , ω ∗ η (cid:1) .(A.2) Under the assumption that T → ∞ and | ρ ∗ | <
1, (i) if ω η is fixed at ω ∗ η ,then ˆ θ M → p θ ∗ = (cid:0) ρ ∗ , σ ∗ , ω ∗ η (cid:1) ; (ii) if ω η → ω ∗ η , then ˆ θ M → p θ ∗ = (cid:0) ρ ∗ , σ ∗ , ω ∗ η (cid:1) ; and(iii) if lim sup ω ∗ η < ∞ , then ˆ θ M = θ ∗ + o p (1), where θ ∗ = (cid:0) ρ ∗ , σ ∗ , ω ∗ η (cid:1) .(B) Assume that ω ∗ η > S MN ( θ ) = ∂Q MN ( θ ) ∂θ and H MN ( θ ) = ∂ Q MN ( θ ) ∂θ∂θ ′ , respectively, and define the matrix I MT ( θ ∗ ) = h MT ω ∗ η σ ∗ ′ T F ∗ T ω ∗ η T ω ∗ η T ω ∗ η T ′ T F ∗ T Tω ∗ η σ ∗ ′ T F ∗ T ω ∗ η T σ ∗ ) (cid:16) ω ∗ η T ω ∗ η T (cid:17) σ ∗ ω ∗ η T ω ∗ η T ω ∗ η T ω ∗ η T ′ T F ∗ T T σ ∗ ω ∗ η T ω ∗ η T T ( ω ∗ η T ) , where h MT = tr (cid:0) F ∗ F ′ ∗ (cid:1) T + ω ∗ η T ω ∗ η T ′ T F ∗ F ′ ∗ T T + 11 + 2 ω ∗ η T (cid:18) ′ T F ∗ T T (cid:19) ! . As N → ∞ with T fixed, (i) √ N T S MN ( θ ∗ ) → d N (cid:0) , I MT ( θ ∗ ) (cid:1) ; (ii) H MN ( θ ∗ ) → p −I MT ( θ ∗ ); (iii) √ N T (cid:16) ˆ θ M − θ ∗ (cid:17) → d N (cid:0) , I MT ( θ ∗ ) − (cid:1) ; and (iv) the log-likelihoodratio isΛ N (cid:16) θ ∗ + h · ( N T ) − / , θ ∗ (cid:17) = N T (cid:16) Q MN (cid:16) θ ∗ + h · ( N T ) − / (cid:17) − Q MN ( θ ∗ ) (cid:17) = h ′ √ N T S MN ( θ ∗ ) − h ′ I T ( θ ∗ ) h + o Q MN ( θ ∗ ) (1) , √ N T S MN ( θ ∗ ) → d N (cid:0) , I MT ( θ ∗ ) (cid:1) under Q MN ( θ ∗ ). Furthermore, ˆ θ M is asymptoticallyefficient within the class of regular invariant estimators for the differenced model (16)under large N , fixed T asymptotics. 5art (A) of the above theorem implies that ˆ ρ M → p ρ ∗ and ˆ σ M → p σ ∗ regardlessof the growth rate of N and T as long as N T → ∞ . Part (B) derives the limitingdistribution of ˆ θ M . It shows, in particular, that ˆ ρ M achieves the efficiency bound (cid:0) I MT ( θ ∗ ) − (cid:1) for regular invariant estimators as N → ∞ . Regular estimators ex-clude superefficient estimators in the sense of Hodges-Le Cam and, heuristically, aregular estimator is one whose asymptotic distribution does not change in shrinkingneighborhoods of the true parameter value (see Bickel, Klaassen, Ritov, and Wellner(1998) for more details).Part (B) of Theorem 1 finds the asymptotic distribution of ˆ θ M , assuming that ω η is fixed at ω ∗ η . A generalization of part (B) that allows for ω η → ω ∗ η follows from asimple application of Le Cam’s third lemma. Lemma 2 : Assume ω η = ω ∗ η + h/ √ N , where ω ∗ η > h ∈ R and ω ∗ η is in theparameter space. As N → ∞ with T fixed, √ N T (cid:16) ˆ θ M − θ ∗ (cid:17) → d N h , I MT ( θ ∗ ) − , where I MT ( θ ∗ ) − is defined in Theorem 1, part (B).Lemma 2 shows that the asymptotic distribution of the structural parametersdoes not change, whether ω η is fixed at ω ∗ η or ω η → ω ∗ η . For simplicity’s sake, weassume throughout the paper that the sequence ω η (which can depend on the samplesize N ) is fixed at ω ∗ η > Lancaster (2002) proposes a Bayesian approach to estimate the structural parameterswhich involves reparameterizing model (2) so that the information matrix is blockdiagonal, with the common parameters in one block and the incidental parametersin the other. Then he defines his estimator as one of the local maxima of the inte-grated likelihood function, integrating with respect to the Lebesgue measure on R N .Dhaene and Jochmans (2016) give an alternative interpretation for Lancaster’s esti-mator by showing that Lancaster’s posterior can be obtained by adjusting the profilelikelihood so that its score is free from asymptotic bias. In this sense, Lancaster’sestimator is a bias-corrected estimator.Lancaster’s estimator seeks to maximize the following objective function: Q LN (cid:0) ρ, σ (cid:1) = −
12 ln (cid:0) σ (cid:1) + 1 ′ T F T T ( T − − σ tr ( DY ′ Y D ′ H ) N ( T − , (6)where the matrix H is defined as H = I T − T T ′ T . ρ ∗ , σ ∗ ) is not a global maximizerof plim N →∞ Q LN ( ρ, σ ) as in standard maximum likelihood theory (see Dhaene and Jochmans(2016)). Nonetheless, Theorem 3, proved by Lancaster (2002), shows that the poste-rior (6) can be used to consistently estimate the structural parameters. Theorem 3 : Let S LN ( ρ, σ ) = ∂Q LN ( ρ,σ ) ∂ ( ρ,σ ) ′ be the score of the posterior (6) and letΘ N be the set of roots of S LN (cid:0) ρ, σ (cid:1) = 0 (7)corresponding to the local maxima. If that set is empty, set Θ N equals to { } . Then,there is a consistent root of equation (7).The usefulness of Theorem 3 is limited by the fact that it only states that one (ofpossibly many) of the local maxima of (6) is a consistent estimator of the structuralparameters. However, it does not indicate how to find a consistent estimator. There-fore, even though the posterior can be used to consistently estimate the structuralparameters of model (2), Lancaster leaves unanswered the question of how to uniquelydetermine the consistent root of (7). To our knowledge, two different methodologiesto uniquely choose the consistent root of Lancaster’s score have been proposed. Thefirst approach, by Dhaene and Jochmans (2016), suggests as a consistent estimatorof the structural parameters, the minimizer of the norm of Lancaster’s score on aninterval around the maximum likelihood estimator obtained from the maximizationof the likelihood of (2). The second method, by Kruiniger (2014), uses as a consis-tent estimator, the minimizer of a quadratic form in Lancaster’s score subjected to acondition on the Hessian matrix.Lemma 4 finds the asymptotic variance of a consistent root of (7). Lemma 4 : Assume ω ∗ η > (cid:0) ˆ ρ L , ˆ σ L (cid:1) be a consistent root of S LN ( ρ, σ ).Let the Hessian matrix be H LN (cid:0) ρ, σ (cid:1) = ∂ Q LN ( ρ, σ ) ∂ ( ρ, σ ) ′ ∂ ( ρ, σ ) , and define the matrices I LT ( θ ∗ ) = " h LT − σ ∗ ′ T F ∗ T T ( T − − σ ∗ ′ T F ∗ T T ( T −
1) 12( σ ∗ ) and Σ T ( θ ∗ ) = TT − (cid:20) a LT I LT (1 , I LT (1 , I LT (2 , (cid:21) , (8)where h LT = − ′ T F ∗ T T ( T −
1) + tr (cid:0) F ∗ F ∗ ′ (cid:1) T − ′ T F ∗ ′ F ∗ T T (cid:0) ω ∗ η T − (cid:1) T − − ω ∗ η TT − (cid:18) ′ T F ∗ ′ T T (cid:19) , LT = 2 h LT + 2 1 ′ T F ∗ T T ( T − , and I LT ( i, j ) is the ( i, j )-th entry of I LT ( θ ∗ ). When I LT ( θ ∗ ) is nonsingular, as N → ∞ with T fixed, (i) √ N T S LN ( ρ ∗ , σ ∗ ) → d N (0 , Σ T ( θ ∗ )); (ii) − H LN ( ρ ∗ , σ ∗ ) → p I LT ( θ ∗ );and (iii) √ N T (cid:18) ˆ ρ L − ρ ∗ ˆ σ L − σ ∗ (cid:19) → d N (cid:0) , I LT ( θ ∗ ) − Σ T ( θ ∗ ) I LT ( θ ∗ ) − (cid:1) .Lemma 4 follows from standard asymptotic theory because of the assumption thatthe Hessian matrix − H LN ( ρ ∗ , σ ∗ ) converges in probability to a nonsingular matrix I LT ( θ ∗ ). Dhaene and Jochmans (2016) prove that this assumption is violated, i.e. I LT ( θ ∗ ) is singular, when T = 2 or ρ ∗ = 1. In contrast to Lancaster’s estimator, the estimator defined in (5), based on the max-imal invariant statistic’s log-likelihood, is a standard maximum likelihood estimatorin the sense that its objective function is a likelihood function and the limit of itsobjective function attains a unique global maximum at the vector of true parame-ters. This implies that the estimator that maximizes the maximal invariant statistic’slog-likelihood is uniquely determined and no additional procedures are necessary toobtain a consistent and asymptotically normal estimator of the structural parame-ters. More than that, Theorem 1 shows that it attains the minimum variance boundfor invariant regular estimators of the structural parameters of model (2). Therefore,the estimator defined in (5) efficiently uses all information available in the maximalinvariant statistic Y ′ Y .Lancaster’s estimator is also invariant to orthogonal transformations since it de-pends on the data only through the maximal invariant statistic Y ′ Y ; see equation(6). However, it is not uniquely determined and additional steps, such as those pro-posed by Dhaene and Jochmans (2016) and Kruiniger (2014), are necessary to choosethe consistent root among the (possibly) many roots of Lancaster’s score function.Furthermore, the limit of probability of the Hessian of Q LN ( ρ, σ ) is not necessarilyproportional to the asymptotic variance ( AVar ) of the score. Since the informationequality does not hold, the asymptotic variance of Lancaster’s consistent estimatorcannot attain the lower bound for invariant regular estimators found in Theorem 1.The explanation for the shortcomings of Lancaster’s estimator is that, even thoughit is invariant, its objective function ignores information available in the maximal in-variant by using only part of the maximal invariant statistic’s log-likelihood function.Indeed, the maximal invariant statistic’s log-likelihood function (3) concentrated out8f σ and ω η is Q MN ( ρ ) = − T − T ln (cid:18) tr ( DY ′ Y D ′ H ) N ( T − (cid:19) − T ln (cid:18) ′ T DY ′ Y D ′ T N T (cid:19) , (9)while Lancaster’s objective function concentrated out of σ is Q LN ( ρ ) = 1 ′ T F T T ( T − −
12 ln (cid:18) tr ( DY ′ Y D ′ H ) N ( T − (cid:19) , which allows us to write Q MN ( ρ ) = T − T Q LN ( ρ ) + Q M − LN ( ρ ) , where (10) Q M − LN ( ρ ) = − ′ T F T T − T ln (cid:18) ′ T DY ′ Y D ′ T N T (cid:19) . The respective score functions are given by S LN ( ρ ) = 1 ′ T F T T ( T −
1) + tr ( J T Y ′ Y D ′ H ) tr ( DY ′ Y D ′ H )and S MN ( ρ ) = T − T S LN ( ρ ) + S M − LN ( ρ ) , where (11) S M − LN ( ρ ) = − ′ T F T T + 1 T ′ T J T Y ′ Y D ′ T ′ T DY ′ Y D ′ T . The decomposition (10) implies an orthogonal decomposition of the concentratedscore function:
Theorem 5 : Let S MN ( ρ ), S LN ( ρ ) and S M − LN ( ρ ) be the score functions associatedwith Q MN ( ρ ), Q LN ( ρ ) and Q M − LN ( ρ ), respectively. Then, S MN ( ρ ) = T − T S LN ( ρ ) + S M − LN ( ρ ) , (12)where S M − LN ( ρ ∗ ) → p (cid:18) √ N T S LN ( ρ ∗ ) √ N T S M − LN ( ρ ∗ ) (cid:19) → d N (cid:18)(cid:18) (cid:19) , (cid:18) b T c T (cid:19)(cid:19) , (14)with b T = a LT − T ( T − (cid:18) ′ T F ∗ T T (cid:19) > , a LT as defined in Lemma 4, and c T = 2 T (cid:0) ω ∗ η T (cid:1) ′ T F ∗ ′ F ∗ T T − (cid:18) ′ T F ∗ T T (cid:19) ! > . A consistent estimator for ρ ∗ is expected to be more efficient, the closer it staysto the maximum likelihood estimator ˆ ρ M . This implies that the consistent estimatorˆ ρ L might have low efficiency because it only uses part of the full log-likelihood (10).One may use (cid:12)(cid:12) S MN (ˆ ρ L ) (cid:12)(cid:12) as a measure of how “close” any inflection point ˆ ρ L is to ˆ ρ M and, because S MN (ˆ ρ L ) = S M − LN (ˆ ρ L ), ˆ ρ L will be more efficient, the smaller (cid:12)(cid:12) S M − LN (ˆ ρ L ) (cid:12)(cid:12) is. The term S M − LN ( . ) evaluated at the true value ρ ∗ or at a consistent estimator,converges in probability to zero. This suggests choosing the inflexion points based onLancaster’s score by looking at S M − LN (ˆ ρ L ) to uniquely determine the consistent root,akin to Dhaene and Jochmans (2016) and Kruiniger (2014).Instead, we can use the information on S M − LN ( . ) to increase the efficiency of ˆ ρ L .Simple algebra shows that (cid:12)(cid:12) S M − LN (ˆ ρ L ) (cid:12)(cid:12) = (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ( T − tr (cid:16) J T Y ′ Y ˆ D ′ L H (cid:17) tr (cid:16) ˆ D L Y ′ Y ˆ D ′ L H (cid:17) + 1 ′ T J T Y ′ Y ˆ D ′ L T ′ T ˆ D L Y ′ Y ˆ D ′ L T (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) , where ˆ D L is the matrix D evaluated at ˆ ρ L . Notice that ˆ ρ L is inefficient since √ N T S M − LN (ˆ ρ L ) does not converge in probability to zero. Instead of choosing theconsistent root following Lancaster’s methodology, we could add the information in S M − LN ( . ) to S LN ( . ) by using (12). The associated estimator ˆ ρ M uses the informationin both S M − LN ( . ) and S LN ( . ) efficiently and is uniquely defined, making the estimationproblem more simple and objective. Theorem 1 shows that the estimator for the structural parameters ( ρ, σ ) attains theefficiency bound for model (1) when the first observation is y i, = 0. Lancaster (2002)suggests working with the differenced data ˜ y i,t = y i,t − y i, so that we may continueworking with an initial observation equal to zero. Instead, we draw inference on( ρ, σ ) by conditioning on the first observation y i, . We will show the differencingmethod is indeed less efficient than the conditional method.By using model (1) with the variables y i,t in levels, there are more observationsthat can be used to estimate the parameters. On the other hand, the number ofparameters to be estimated increases. In this section we show that the estimatorthat maximizes the likelihood of the maximal invariant statistic of the original model(1) has asymptotic variance that is no larger than the variance of the estimator (5)and, under very general conditions, using y i,t in levels will be strictly more efficient.10hen we allow for a nonzero initial condition, the model (1) becomes Y T = y .ρe ′ B ′ T + η. ′ T B ′ + σ.U T .B ′ , where (15) U T ∼ N (0 N × T , I N ⊗ I T ) ,y = ( y , , · · · , y N, ) ′ is the vector of first observations and e = (1 , , ..., ′ is thecanonical vector.Lancaster (2002) works with a differenced version of model (1) given by e y i,t +1 = ρ e y i,t + e η i + σu i,t , i = 1 , · · · , N ; t = 1 , · · · , T (16) e y i,t ≡ y i,t − y i, , i = 1 , · · · , N ; t = 1 , · · · , T + 1 e η i ≡ η i − y i, (1 − ρ ) , i = 1 , · · · , N and seeks inference based on e y i, = 0 for all i . In matrix form, Y T +1 − y . ′ T +1 = [0 N × : Y T − y . ′ T ] . The second term equals Y T − y . ′ T = y . ( ρ.e ′ B ′ − ′ T ) + η. ′ T B ′ + σ.U.B ′ = [ η − (1 − ρ ) y ] . ′ T B ′ + σ.U.B ′ . Defining differenced variables and the incidental parameters, e Y T = Y T − y . ′ T and e η = η − (1 − ρ ) .y , we are back to the model with the first observation equal to zero: e Y T = e η. ′ T B ′ + σ.U T .B ′ , U ∼ N (0 N × T , I N ⊗ I T ) , (17)where the incidental parameter is τ = e η .Differencing eliminates one time period from the data. Instead, we will conditionon the initial observation y itself. It is convenient to work with the linear setup ofChamberlain and Moreira (2009): Y = x.a ( γ ) + τ .b ( γ ) + U.c ( γ ) , (18)where x ∈ R N × K are explanatory variables, τ ∈ R N × J are the incidental parameters,and a ( · ) , b ( · ) , c ( · ) are given functions of the unknown parameter of interest γ .Consider the group of transformations g.Y , where g are orthogonal matrices such that g.x = x . This group modifies the unknown incidental parameter τ , but preserves themodel and the parameter γ of interest: g.Y = g.x.a ( γ ) + g.τ .b ( γ ) + g.U.c ( γ )= x.a ( γ ) + ( g.τ ) .b ( γ ) + U.c ( γ )11n distribution, because the law of g.U is the same as the law of U . This group yieldsthe maximal invariant statistic, which is given by the pair Z = ( x ′ x ) − / x ′ Y and Z ′ Z = Y ′ M x Y, (19)where Z = q ′ Y for any matrix q such that the matrix q = h x ( x ′ x ) − / : q i isorthogonal. Because q is an orthogonal matrix, we have q q ′ = I N − N x = M x for N x = x ( x ′ x ) − x ′ . We refer the reader to Chamberlain and Moreira (2009) for moredetails.Define the parameters ω τ,x = τ ′ M x τσ N and δ τ,x = ( x ′ x ) − x ′ τ . (20)The coefficient δ τ,x is the ordinary least squares (OLS) estimator and ω τ,x is thestandardized average of sum of squared residuals from a fictitious regression of τ on x (if there are no covariates x , we define ω τ = ( σ N ) − τ ′ τ ).For the model (18), the statistics Z and Z ′ Z are independently normal and non-central Wishart distributed. Their distributions depend on ( ρ, σ , δ τ,x ) and (cid:0) ρ, σ , ω τ,x (cid:1) ,respectively.If we condition on the initial observation y , the model (15) for Y T is a specialcase of (18), where x = y , τ = η , and the regression coefficients are a ( γ ) = ρe ′ B ′ , b ( γ ) = 1 ′ T B ′ , and c ( γ ) = σ.B ′ . Let θ = (cid:0) ρ, σ , δ η,y , ω η,y (cid:1) ∈ R be the parameters of the joint distribution of (19).The likelihood estimator that maximizes the joint likelihood of (19) does not havethe incidental parameter problem since it is parametrized by θ , which has fixeddimension.If we instead difference out the data, as Lancaster (2002) does, the model (17)for e Y T is a special case of (18), in which x is absent, the incidental parameter τ is e η = η − (1 − ρ ) .y , and the regression coefficients are b ( γ ) = 1 ′ T B ′ and c ( γ ) = σ.B ′ . We now show that Lancaster (2002)’s differencing method entails unnecessary effi-ciency loss. We begin by defining the estimator based on the likelihood of ( Z , Z ′ Z ).Let ˆ θ M = arg min θ ∈ Θ Q M N ( θ ) , (21)where Q M N ( θ ) = −
12 ln σ − ω η,y − σ tr (cid:18) DZ ′ Z D ′ N T (cid:19) + 12 T (cid:0) A ,N (cid:1) / − σ N T ( Z D ′ − k y k ( ρe ′ + δ η,y ′ T )) ( DZ ′ − k y k ( ρe + δ η,y T )) − T ln (cid:16) (cid:0) A ,N (cid:1) / (cid:17) , (22)12s, up to o p ( N − ) terms, proportional to the (conditional on y ) log-likelihood of( Z , Z ′ Z ) and A ,N = 2 q ω η,y ′ T DZ ′ Z D ′ T σ N . The asymptotic behavior of the estima-tor ˆ θ M is given next. Lemma
6: Assume that the true parameters δ η,y and ω η,y are fixed, respectively,at δ ∗ η,y and ω ∗ η,y > < lim N →∞ k y k N ≡ k y k < ∞ . As N → ∞ with T fixed,(A) ˆ θ M → p θ ∗ = (cid:0) ρ ∗ , σ ∗ , δ ∗ η,y , ω ∗ η,y (cid:1) . (B) Let Q M N ( ρ, σ ) be the objective function (22) concentrated out of δ η,y and ω η,y and denote the score statistic and the Hessian matrix by S M N (cid:0) ρ, σ (cid:1) = ∂Q M N ( ρ, σ ) ∂ ( ρ, σ ) ′ and H M N (cid:0) ρ, σ (cid:1) = ∂ Q M N ( ρ, σ ) ∂ ( ρ, σ ) ′ ∂ ( ρ, σ ) , respectively. Also, define the matrix I M T ( θ ∗ ) = " d M T − ′ T F ∗ T σ ∗ T − ′ T F ∗ T σ ∗ T σ ∗ ) T − T , where d M T = 1 T (cid:0) ω ∗ η,y T (cid:1) ′ T F ∗ ′ F ∗ T T + ω ∗ η,y T (cid:18) ′ T F ∗ T T (cid:19) ! − T (cid:18) ′ T F ∗ T T (cid:19) + ω ∗ η,y + k y k σ ∗ (cid:0) δ ∗ η,y + ρ ∗ − (cid:1) ! ′ T F ∗ ′ F ∗ T T − (cid:18) ′ T F ∗ T T (cid:19) ! + tr (cid:0) F ∗ F ∗ ′ (cid:1) T − ′ T F ∗ ′ F ∗ T T . Then, (i) √ N T S M N ( ρ ∗ , σ ∗ ) → d N (cid:0) , I M T ( θ ∗ ) (cid:1) ; (ii) H M N ( ρ ∗ , σ ∗ ) → p −I M T ( θ ∗ );and (iii) √ N T (cid:18) ˆ ρ M − ρ ∗ ˆ σ M − σ ∗ (cid:19) → d N (cid:0) , I M T ( θ ∗ ) − (cid:1) .Lemma 7 below shows that, in general, (21) estimates the structural parameterswith strictly smaller variances than the original estimator that uses differenced data,as defined in (5). Lemma
7: If δ ∗ η,y + ρ ∗ = 1, then AVar (ˆ ρ M ) > AVar (ˆ ρ M ) and AVar (cid:0) ˆ σ M (cid:1) > AVar (cid:0) ˆ σ M (cid:1) . If δ ∗ η,y + ρ ∗ = 1, then AVar (ˆ ρ M ) = AVar (ˆ ρ M ) and AVar (cid:0) ˆ σ M (cid:1) = AVar (cid:0) ˆ σ M (cid:1) . 13 Random Effects and Moment Conditions
In Section 5 we draw inference on the structural parameters by conditioning or bydifferencing the data based on the first observed initial condition y i, . Both methodsare fixed-effect approaches, which do not rely on any further assumptions on the unobserved data such as y i, . We could instead consider random-effect approacheswhich rely on assumptions for the unobserved value y i, . These include Chamberlain’s(correlated) random effects and Blundell and Bond (1998)’s stationarity assumptionsfor the first unobserved value y i, .In this section, we compare the conditional method with the random-effect meth-ods using the first moment of ( x ′ x ) − x ′ Y and Y ′ YN . (23)Define the function π ( γ, δ τ,x ) = a ( γ ) + δ τ,x .b ( γ ) . The expectation of the maximal invariant is then given by E ( x ′ x ) − x ′ Y = π ( γ, δ τ,x ) and E Y ′ YN = σ (cid:2) π ( γ, δ τ,x ) ′ ω x π ( γ, δ τ,x ) + b ( γ ) ′ ω τ,x b ( γ ) (cid:3) + c ( γ ) ′ c ( γ ) , For example, Lancaster (2002)’s differencing approach yields E e Y ′ T e Y T N = σ B T (cid:2) ω e η T ′ T + I T (cid:3) B ′ T . Hence, we have ( T + 1) T / ρ , σ , and ω e η .Conditioning on y yields the following (conditional) moments based on the maximalinvariant: E h ( y ′ y ) − y ′ Y T (cid:12)(cid:12)(cid:12) y i = ρe ′ B ′ T + δ η,y . ′ T B ′ T (24)and E (cid:20) Y ′ T Y T N (cid:12)(cid:12)(cid:12)(cid:12) y (cid:21) = σ B T (cid:8) ω η,y T ′ T + I T (cid:9) B ′ T (25)+ σ [ ρB T e + δ η,y .B T T ] ω y [ ρe ′ B ′ T + δ η,y . ′ T B ′ T ] . The unknown parameters are the autoregressive coefficient ρ , the error variance σ ,the OLS coefficient δ η,y , and the standardized squared residuals ω η,y .Invariance reduces the information to T + ( T + 1) T / T + 1) ( T + 2) / − T more moments than Lancaster (2002)’s differencing method and14ne additional parameter (four instead of three). This comparison helps to explainthe efficiency gains from conditioning instead of differencing.We can then explore the conditional moments (24) and (25) to find MinimumDistance (MD) estimators. We further note that Arellano and Bond (1991) andAhn and Schmidt (1995) implicitly use linear combinations of the moments (24) and(25) to find moments for the autoregressive parameter ρ .We instead explore connections between these moments, conditional on the firstobserved value y i, , and random effects assumptions for the unobserved value y i, .To simplify comparison, we continue working with the model without covariates andwith homoskedastic errors. As a preliminary to the correlated random effects assumption, consider the modelwritten recursively to the time t = 0: Y T +1 = y .ρe ′ B ′ T +1 + η. ′ T +1 B ′ T +1 + σ.U T +1 .B ′ T +1 , where (26) U T +1 ∼ N (cid:0) N × ( T +1) , I N ⊗ I ( T +1) (cid:1) , and y is unobserved. This model is again a special case of the linear setup ofChamberlain and Moreira (2009) without covariates x . The incidental parameter τ = (cid:2) y η (cid:3) would encapsulate the individual fixed effects and the initial conditionsthemselves. The regression coefficients would be given by b ( γ ) = (cid:20) ρe ′ ′ T +1 (cid:21) B ′ and c ( γ ) = σ.B ′ (where we again omit the subscript, here ( T + 1), from the matrices when there is noconfusion).The maximal invariant is Y ′ T +1 Y T +1 , which contains ( T + 1) ( T + 2) / E Y ′ T +1 Y T +1 N = σ B T +1 (cid:26)(cid:2) ρ.e T +1 (cid:3) ω τ (cid:20) ρ.e ′ ′ T +1 (cid:21) + I T +1 (cid:27) B ′ T +1 . The total number of parameters are five: the autoregressive parameter ρ , the errorvariance σ , and the three nonredundant elements of ω τ . The parameters need to bewell-behaved as the sample size N grows, and inference is based on the asymptoticnormality of their respective estimators. If different series start at different pointsin time (as typically happens with firms’ data), then the parameter ω y may notbe well-behaved or its estimator may not be asymptotically normal. For example,moment equations depend on terms such as N X i =1 y i, u i,t . y i, are closeto zero. This happens because the variance ratio of some terms y i, u i,t to their sum,max j ≤ N V ar ( y j, u j,t ) P Ni =1 V ar ( y i, u i,t )may not be negligible. This problem is mitigated by shocks over time, hence condi-tioning on the observations y i, is more robust. For us to make a connection to the CRE estimator, we need to include the vector ofones in x = 1 N . The reason is that the (correlated) random effects assumption (cid:2) y η (cid:3) | x ∼ N ( x.ι, I N ⊗ Φ) , where ι = ( ι , ι ) . would need to be invariant to our group of transformations to be decomposed intoan invariant uniform prior and an additional term (and we want to allow the randomeffects to have a nonzero mean even without additional regressors).The setup is the same as treating the initial condition as an incidental parameter.However, we include x = 1 N and define the regression coefficient on the vector ofones, as in Section 7 of Chamberlain and Moreira (2009). The maximal invariant is1 ′ N Y T +1 and Y ′ T +1 Y T +1 . Their (conditional on τ = (cid:2) y η (cid:3) ) expectation is given by E [1 ′ N Y T +1 ] = δ τ, N (cid:20) ρ.e ′ ′ T +1 (cid:21) B ′ T +1 and E (cid:20) Y ′ T +1 Y T +1 N (cid:21) = B T +1 (cid:26)(cid:2) ρ.e T +1 (cid:3) (cid:0) σ ω τ, N + δ ′ τ, N δ τ, N (cid:1) (cid:20) ρ.e ′ ′ T +1 (cid:21) + σ .I T +1 (cid:27) B ′ T +1 . The unknown parameters are the autoregressive coefficient ρ , the error variance σ ,the sample averages δ τ, N , and standardized squared deviations ω τ, N . Hence, theinvariance argument reduces the data space to T + 1 + ( T + 1) ( T + 2) / ω τ, N may not bewell-behaved and its estimator may not be asymptotically normal.Under the random effects assumption, the model becomes Y T +1 = 1 N .a ( γ ) + U T +1 .c ( γ ) , The Lindeberg condition holds if and only if the series is approximately normal and the termsare asymptotically negligible. a ( γ ) = ι (cid:20) ρe ′ ′ T +1 (cid:21) B ′ T +1 and c ( γ ) ′ c ( γ ) = B T +1 (cid:26)(cid:2) ρ.e T +1 (cid:3) Φ (cid:20) ρe ′ ′ T +1 (cid:21) + σ I T +1 (cid:27) B ′ T +1 . The (unconditional) expectation of the maximal invariant has the same functionalform as the moments conditional on τ = (cid:2) y η (cid:3) : E (cid:20) ′ N Y T +1 N (cid:21) = ι (cid:20) ρ.e ′ ′ T +1 (cid:21) B ′ T +1 and E (cid:20) Y ′ T +1 Y T +1 N (cid:21) = B T +1 (cid:26)(cid:2) ρ.e T +1 (cid:3) (Φ + ι ′ ι ) (cid:20) ρ.e ′ ′ T +1 (cid:21) + σ .I T +1 (cid:27) B ′ T +1 . So the criticism on lack of robustness to the data-generating process for y i, is appli-cable here as well. Blundell and Bond (1998) instead consider a different assumption, that y does notdeviate systematically from the stationary mean η/ (1 − ρ ). We start with the as-sumption y ∼ N (cid:18) η − ρ , σ .I N (cid:19) . We assume normality for the purposed of invariance. However, only the mean andvariance are relevant for the expectation calculations derived below.Under this assumption, the model is equivalent to Y T +1 = η.b ( γ ) + U T +1 .c ( γ ) , where the coefficients are given by b ( γ ) = 1 ′ T +1 B ′ T +1 + ρ − ρ e ′ B ′ T +1 and c ( γ ) ′ c ( γ ) = B T +1 (cid:0) σ ρ e e ′ + σ I T +1 (cid:1) B ′ T +1 . The maximal invariant is given by Y ′ T +1 Y T +1 : E Y ′ T +1 Y T +1 N = B T +1 (cid:0) σ ρ e e ′ + σ I T +1 (cid:1) B ′ T +1 + σ ω η B T +1 (cid:20) T +1 ′ T +1 + ρ − ρ (cid:0) e ′ T +1 + 1 T +1 e ′ (cid:1) + ρ (1 − ρ ) e e ′ (cid:21) B ′ T +1 . ρ , σ , ω η , and σ .We can re-arrange some of these ( T + 1) ( T +2) / ρ . For example, E [( y i, − y i, ) . ( y i, − ρy i, )] = 0 . (27)As Blundell and Bond (1998) also note, this expectation may fail to be zero if thestationarity assumption breaks down. For example, this moment condition breaksdown if either y i, = k or even if y i, iid ∼ N (0 , σ ).It is worthwhile making a connection between the stationarity assumption andinference conditional on the first observations. Since inference is conditional on y ,we could stack the quantities together and look at the (conditional) expectation of Y ′ T +1 Y T +1 = (cid:20) y ′ y y ′ Y T Y ′ T y Y ′ T Y T (cid:21) . As in the derivation based on that of Blundell and Bond (1998), we have exactlythe same quantity Y ′ T +1 Y T +1 . However, we have five parameters (if we consider y ′ y itself to be the parameter). Alternatively, we lose one moment (from removing y ′ y itself) and four parameters. If the stationarity assumption is correct, there shouldbe efficiency loss from making inference conditional on the first observation. Onthe other hand, conditional inference should be robust to different data-generatingprocess for the initial data values. This paper studies the Bayesian estimator of Lancaster (2002), which is invariantto natural rotations of the data. The likelihood of the maximal invariant can bedivided into two parts, one of which is maximized to obtain Lancaster’s estimator.The second part is not asymptotically negligible and, as such, can be used to attaina more efficient estimator. A natural conclusion is that it is unnecessary to use thedata to choose the correct inflection point of Lancaster’s score likelihood. Lancaster(2002)’s theory is based on differencing the data using the first observation. Instead,we can condition on the first observation itself to further improve efficiency. Theconditional argument is essentially a fixed-effects approach in which we make nofurther assumptions on unobserved data.Current practice in economics uses standard GMM methods based on data differ-encing; e.g., Athanasoglou, Brissimis, and Delis (2008) on bank profitability, Guiso, Pistaferri, and Schivardi(2005) on risk allocation, Konings and Vandenbussche (2005) on antidumping pro-tection effects, and Topalova and Khandelwal (2011) on tariffs’ change on firm pro-ductivity, among others. Instead, we advocate using moments based on invariantstatistics from a model that allows for heteroskedasticity and factor structure. It isrelatively straightforward to extend our method to allow for covariates in the linear18odel of Chamberlain and Moreira (2009). Bai (2013a) allows for general time-seriesheteroskedasticity as opposed to assuming specific moving average (MA) processesand data values lagged enough as instruments. Chamberlain and Moreira (2009) andMoon and Weidner (2015) suggest including factors which generalize individual andtime effects. Conditioning on the first observation can then provide reliable inferencein dynamic panel data models using moments based on invariant statistics for moregeneral models.
References
Abramowitz, M., and
I. A. Stegun (1965):
Handbook of Mathematical Functions:with Formulas, Graphs, and Mathematical Tables . Ahn, S., and
P. Schmidt (1995): “Efficient Estimation of Models for DynamicPanel Data,”
Journal of Econometrics , 68, 5–27.
Arellano, M., and
S. Bond (1991): “Some Tests of Specification for Panel Data:Monte Carlo Evidence and an Application to Employment Equations,”
Review ofEconomic Studies , 58, 277–297.
Athanasoglou, P. P., S. N. Brissimis, and
M. D. Delis (2008): “Bank-specific,industry-specific and macroeconomic determinants of bank profitability,”
Journalof International Financial Markets, Institutions and Money , 18, 121–136.
Bai, J. (2013a): “Fixed-Effects Dynamic Panel Models, A Factor AnalyticalMethod,”
Econometrica , 81, 285–314.(2013b): “Fixed-Effects Dynamic Panel Models, A Factor AnalyticalMethod,”
Econometrica , 81, 185–314, Supplement.
Bickel, P. J., C. A. J. Klaassen, Y. Ritov, and
J. A. Wellner (1998):
Efficient and Adaptive Estimation for Semiparametric Models . Springer.
Blundell, R., and
S. Bond (1998): “Initial Conditions and Moment Restrictionsin Dynamic Panel Data Models,”
Journal of Econometrics , 87, 115–143.
Blundell, R., and
R. Smith (1991): “Conditions Initiales et Estimation Efficacedans les Modeles Dynamiques sur Donnees de Panel: Une Application au Com-portement d’Investissement des Entreprises,”
Annales d’Economie et de Statistique ,20.
Cabral, L. M. B., and
J. Mata (2003): “On the Evolution of the Firm SizeDistribution: Facts and Theory,”
American Economic Review , 93, 1075–1090.19 hamberlain, G., and
M. J. Moreira (2009): “Decision Theory Applied to aLinear Panel Data Model,”
Econometrica , 77, 107–133.
Dhaene, G., and
K. Jochmans (2016): “Likelihood Inference in an Autoregressionwith Fixed Effects,”
Econometric Theory , 32, 1–38.
Evans, D. S. (1987a): “The Relationship Between Firm Growth, Size, and Age:Estimates for 100 Manufacturing Industries,”
The Journal of Industrial Economics ,35, 567–581.(1987b): “Tests of Alternative Theories of Firm Growth,”
Journal of Polit-ical Economy , 95, 657–674.
Guiso, L., L. Pistaferri, and
F. Schivardi (2005): “Insurance within the Firm,”
Journal of Political Economy , 113, 1054–1087.
Hall, B. H. (1987): “The Relationship between Firm Size and Firm Growth in theUS Manufacturing Sector,”
The Journal of Industrial Economics , 35, 583–606.
Konings, J., and
H. Vandenbussche (2005): “Antidumping Protection andMarkups of Domestic Firms,”
Journal of International Economics , 65, 151–165.
Kruiniger, H. (2014): “A Further Look at Modified ML Estimation of the PanelAR(1) Model with Fixed Effects and Arbitrary Initial Conditions,” Working Paper,Durham University.
Lancaster, T. (2002): “Orthogonal Parameters and Panel Data,”
The Review ofEconomic Studies , 69, 647–666.
Moon, H. R., and
M. Weidner (2015): “Linear Regression for Panel with Un-known Number of Factors as Interactive Fixed Effects,”
Econometrica , 83, 1543–1579.
Moreira, M. J. (2009): “A Maximum Likelihood Method for the Incidental Pa-rameter Problem,”
Annals of Statistics , 37, 3660–3696.
Topalova, P., and
A. Khandelwal (2011): “Trade Liberalization and Firm Pro-ductivity: The Case of India,”