Discretizing Unobserved Heterogeneity
aa r X i v : . [ ec on . E M ] F e b Discretizing Unobserved Heterogeneity ∗ St´ephane Bonhomme † Thibaut Lamadon ‡ Elena Manresa § Revised draft: January 2021
Abstract
We study discrete panel data methods where unobserved heterogeneity isrevealed in a first step, in environments where population heterogeneity isnot discrete. We focus on two-step grouped fixed-effects (GFE) estimators,where individuals are first classified into groups using kmeans clustering,and the model is then estimated allowing for group-specific heterogeneity.Our framework relies on two key properties: heterogeneity is a function— possibly nonlinear and time-varying — of a low-dimensional continuouslatent type, and informative moments are available for classification. Weillustrate the method in a model of wages and labor market participation,and in a probit model with time-varying heterogeneity. We derive asymp-totic expansions of two-step GFE estimators as the number of groups growswith the two dimensions of the panel. We propose a data-driven rule forthe number of groups, and discuss bias reduction and inference.
JEL codes:
C23, C38.
Keywords:
Unobserved heterogeneity, panel data, kmeans clustering,dimension reduction. ∗ We thank Anna Simoni, Manuel Arellano, Neele Balke, Jesus Carro, Gary Chamberlain, TimChristensen, Alfred Galichon, Chris Hansen, Joe Hotz, Gr´egory Jolivet, Arthur Lewbel, AnnaMikusheva, Roger Moon, Whitney Newey, Juan Pantano, Philippe Rigollet, Martin Weidner,and seminar audiences at various places for comments. The authors acknowledge support fromthe NSF grant number SES-1658920. The usual disclaimer applies. † University of Chicago, [email protected] ‡ University of Chicago, [email protected] § New York University, [email protected] Introduction
In both reduced-form and structural work in economics, it is common to modelunobserved heterogeneity as a small number of discrete types. Various estima-tion strategies are available, including discrete-type random-effects (as in Keaneand Wolpin, 1997, and many other applications) and grouped fixed-effects (as re-cently studied by Hahn and Moon, 2010, and Bonhomme and Manresa, 2015).These methods require the researcher to jointly estimate individual heterogene-ity and model parameters. In addition, little is known about their propertieswhen individual heterogeneity is not discrete in the population. In this paper,we study two-step discrete estimators for panel data, and provide conditions fortheir validity when heterogeneity is continuous.We focus on two-step grouped fixed-effects (GFE) estimators. In a first step,we classify individuals based on a set of individual-specific moments, using the kmeans clustering algorithm. The aim of the kmeans classification is to grouptogether individuals whose latent types are most similar. In a second step, weestimate the model by allowing for group-specific heterogeneity. This second stepis similar to fixed-effects (FE) estimation, albeit it involves a smaller number ofparameters that are group-specific instead of individual-specific. We analyze theproperties of these two-step estimators in panel data models where heterogeneity iscontinuous. Hence, in contrast with existing theoretical justifications for discrete-type methods, here we use discrete heterogeneity as a dimension reduction devicerather than as a substantive assumption about population unobservables.Our approach is targeted to environments with two key properties.
First ,unobserved heterogeneity is a function of a low-dimensional latent variable. Wedo not restrict this latent type to be discrete. In many economic models, agents’heterogeneity in preferences or technology is driven by a low-dimensional type,which enters the model nonlinearly and may affect multiple outcomes. As an Also related, nonparametric maximum likelihood methods (e.g., Heckman and Singer, 1984)rely on joint estimation of the distribution of heterogeneity and the parameters. In a network context, Gao et al. (2015) provide results for stochastic blockmodels undercontinuous heterogeneity. Buchinsky et al. (2005) also propose to group individuals in a first step using kmeans.
Second , the first-step moments satisfy an injectivity condition, which requiresany two individuals with the same population moments to have the same type.The choice of moments is important to ensure good performance. In examples, weshow how suitable moments arise naturally. In models with exogenous covariates,we propose and analyze the use of conditional moments to recover latent types.Our setup also covers models where heterogeneity varies over time. Unlikeadditive FE methods and interactive FE methods based on linear factor structures(Bai, 2009), GFE does not require heterogeneity to take an additive or interactiveform. As an illustration, we compare GFE and FE estimators in a probit modelwhere heterogeneity is a nonlinear function of a time-invariant factor loading anda time-specific factor.Our main results are large-
N, T asymptotic expansions of two-step GFE esti-mators under time-invariant and time-varying continuous heterogeneity. In bothsettings, GFE is consistent as the number of groups grows with the sample size,under conditions that we provide. We find that, when the population heterogene-ity is not discrete, estimating group membership induces an incidental parameterbias, similarly to FE methods. Moreover, since discreteness is an approximationin our setting, GFE is affected by approximation error. We propose a simple data-driven rule for the number of groups that controls the approximation error, anddiscuss how to reduce incidental parameter bias for inference.The outline of the paper is as follows. We introduce the setup and two-stepGFE estimators in Section 2, study their asymptotic properties in Section 3, andoutline several extensions in Section 4. The main proofs may be found in theappendix, and the supplemental material contains additional results.
We consider a panel data setup, where we denote outcome variables and exogenouscovariates as Y i =( Y ′ i , ..., Y ′ iT ) ′ and X i = ( X ′ i , ..., X ′ iT ) ′ , respectively, for i =1 , ..., N .3n our theory we cover two models. In the first one, unobserved heterogeneity istime-invariant. In this case, the conditional log-density of Y i given X i is given by: ln f i ( α i , θ ) = T X t =1 ln f ( Y it | Y i,t − , X it , α i , θ ) , (1)and the log-density of exogenous covariates X i takes the form:ln g i ( µ i ) = T X t =1 ln g ( X it | X i,t − , µ i ) , where θ is a vector of common parameters, and α i and µ i are individual-specificparameters. We leave the form of g unrestricted, and in estimation we will usea conditional likelihood approach based on f i alone. In other words, in applica-tions the researcher only needs to specify the parametric form of f i ( α i , θ ) in (1).However, the heterogeneity µ i in covariates plays an important role in our theory.In the second model, unobserved heterogeneity varies over time. Such variationin unobservables over calendar time (e.g., business cycle), age (e.g., life cycle),counties, or markets, is of interest in many applications. In the time-varying case,log-densities take the form:ln f i ( α i , θ ) = T X t =1 ln f ( Y it | Y i,t − , X it , α it , θ ) , ln g i ( µ i ) = T X t =1 ln g ( X it | X i,t − , µ it ) , where α i = ( α ′ i , ..., α ′ iT ) ′ and µ i = ( µ ′ i , ..., µ ′ iT ) ′ . In both models we areinterested in estimating θ , as well as average effects depending on α , ..., α N . GFE relies on two key assumptions that we now present. We defer the presen-tation of regularity conditions until Section 3.
First , we assume that unobserved In models with first-order dependence, we assume that Y i is observed and we conditionon it. Higher-order dependence can be accommodated similarly. In dynamic settings, Y it maycontain sequentially exogenous covariates in addition to outcome variables. ξ i . Assumption 1. (heterogeneity)(a) Time-invariant heterogeneity: There exist ξ i of fixed dimension d , and twoLipschitz-continuous functions α and µ , such that α i = α ( ξ i ) and µ i = µ ( ξ i ) .(b) Time-varying heterogeneity: There exist ξ i of fixed dimension d , λ t of di-mension d λ , and two functions α and µ that are Lipschitz-continuous in their firstargument, such that α it = α ( ξ i , λ t ) and µ it = µ ( ξ i , λ t ) . We will refer to ξ i as an individual type , and to d as the dimension of het-erogeneity. The researcher does not need to know d , α , or µ in applications. Inmodels with time-varying unobserved heterogeneity, Assumption 1 requires unob-servables to follow a factor structure. The link between α it , ξ i and λ t may benonlinear, the linear structure α it = ξ ′ i λ t (Bai, 2009) being covered as a specialcase. Moreover, the dimension of λ t is unrestricted. Our theory will show thatthe performance of two-step GFE crucially relies on ξ i being low-dimensional, aleading case being d = 1. We provide examples in the next subsection. Second , we rely on individual-specific moment vectors h i that are informativeabout the types ξ i . We state this formally as our second main assumption, where k · k denotes an Euclidean norm. Assumption 2. (injective moments)There exist vectors h i of fixed dimension, and a Lipschitz-continuous function ϕ ,such that plim T →∞ h i = ϕ ( ξ i ) , and N P Ni =1 k h i − ϕ ( ξ i ) k = O p (1 /T ) as N, T tend to infinity. Moreover, there exists a Lipschitz-continuous function ψ suchthat ξ i = ψ ( ϕ ( ξ i )) . Assumption 2 requires the individual moment vector h i to be informative about ξ i , in the sense that, for large T , ξ i can be uniquely recovered from h i . Neither ϕ nor ψ (which may depend on θ ) need to be known to the econometrician. In-tuitively, injectivity guarantees that one can separate the types of two individuals ξ i and ξ i ′ by comparing their moments h i and h i ′ . For example, an average h i = T P Tt =1 h ( Y it , X it ) will, under Assumption 1 and suitable regularity condi-5ions, converge as T tends to infinity to a function ϕ ( ξ i ) of the type ξ i . Werequire ϕ to be injective.The convergence rate in Assumption 2 requires appropriate conditions on theserial dependence of Y it and X it . In models with time-varying heterogeneity, ϕ will also depend on the λ t process. In such models, Assumption 2 requires themoments to be informative about ξ i , and not λ t . Injectivity is a key requirementfor consistency of two-step GFE estimators. More generally, the choice of moments h i is important for finite-sample performance. To illustrate the framework we now describe two examples, for which we willprovide illustrative simulations in Subsection 2.4. First, consider a dynamic modelof wages W ∗ it and labor force participation Y it : Y it = { u ( α i ) ≥ c ( Y i,t − ; θ ) + U it } ,W ∗ it = α i + V it ,W it = Y it W ∗ it , (2)where the wage W ∗ it is only observed when i works, U it are i.i.d. standard normal,independent of the past Y it ’s and α i , and V it are i.i.d. independent of all U it ’s, Y i , and α i . Here the same scalar expected payoff α i = ξ i , unobserved tothe econometrician, drives the wage and the decision to work. Individuals havecommon preferences denoted by the utility function u , the cost function c is state-dependent, and both u and c are unknown to the econometrician.In this setting, GFE provides a natural approach to exploit the functional linkbetween α i and u ( α i ), and to learn about the type α i using both wages andparticipation. For instance, when h i = ( W i , Y i ) ′ , where Z i = T P Tt =1 Z it denotesthe individual mean of Z it , injectivity is satisfied under mild conditions, provided W i = α i Y i + o p (1) and plim T →∞ Y i > θ in (2). However, a con-ventional FE estimator would treat α i and u i = u ( α i ) as unrelated parameters,so the FE estimate of θ would be solely based on the binary participation deci-6ions. Another strategy would be to rely on discrete-type random-effects methods,which are typically based on joint estimation. In contrast, we implement GFE intwo steps with no need for iterative estimation, and we justify the estimator inenvironments where heterogeneity is not restricted to be discrete.As a second example, consider the following probit model with time-varyingheterogeneity: Y it = { X ′ it θ + α it + U it ≥ } ,X it = µ it + V it , (3)where U it are i.i.d. standard normal, independent of all V it ’s, α it ’s, and µ it ’s, and V it are i.i.d. independent of all α it ’s and µ it ’s. Under Assumption 1, α it and µ it depend on a low-dimensional vector ξ i of factor loadings, so α it = α ( ξ i , λ t )and µ it = µ ( ξ i , λ t ). Here d is the dimension of the type ξ i governing both α it and µ it .To motivate why, in static models with covariates such as (3), α it and µ it may depend on a common low-dimensional type ξ i , suppose that, in every period,agent i chooses X it based on expected utility or profit maximization. She observes ξ i and λ t — which enter outcomes through α it — and takes her decision beforethe i.i.d. shock U it is realized. In such a case, X it will be a function of ξ i and λ t ,as well as idiosyncratic factors V it in the agent’s information set. Here we assumethat the agent’s information set, and primitives such as preferences or costs, donot include other i -specific elements beyond ξ i . When α ( · , · ) is additive or multiplicative in its arguments, model (3) can beestimated using two-way FE (Fern´andez-Val and Weidner, 2016) or interactiveFE (Bai, 2009, Chen et al. , 2020), respectively. However, when α ( · , · ) is unknown,these fixed-effects estimators are inconsistent in general. In contrast, GFE willremain consistent when unobservables are unknown nonlinear functions of factorloadings ξ i and factors λ t , and injectivity holds. Taking h i = ( Y i , X ′ i ) ′ as mo-ments in model (3), injectivity is satisfied when types have monotone effects on This example is reminiscent of Mundlak’s (1961) classic analysis of farm production func-tions, where soil quality ξ i is observed to the farmer but latent to the analyst. More generally, in Assumption 2 we require thatthe latent type ξ i can be asymptotically recovered from a moment vector whosedimension is not growing with the sample size. Two-step GFE consists of a classification step and an estimation step.
First step: classification.
We rely on the individual-specific moments h i tolearn about the individual types ξ i . Specifically, we partition individuals into K groups, corresponding to group indicators b k i ∈ { , ..., K } , by computing: (cid:16)b h (1) , ..., b h ( K ) , b k , ..., b k N (cid:17) = argmin( e h (1) ,..., e h ( K ) ,k ,...,k N ) N X i =1 (cid:13)(cid:13)(cid:13) h i − e h ( k i ) (cid:13)(cid:13)(cid:13) , (4)where { k i } are partitions of { , ..., N } into K groups, and e h ( k ) is a vector. Notethat b h ( k ) is simply the mean of h i in group b k i = k .In the kmeans optimization problem (4), the minimum is taken with respectto all possible partitions { k i } . Fast and stable optimization methods such asLloyd’s algorithm are available, although computing a global minimum may bechallenging; see Bonhomme and Manresa (2015) for references. Following theliterature, we will focus on the asymptotic properties of the global minimum andabstract from optimization error. Lastly, note that the quadratic loss functionin (4) can accommodate weights on different components of h i , although here forsimplicity we present the unweighted case. Second step: estimation.
We maximize the log-likelihood function with re-spect to common parameters θ and group-specific effects α , where the groups aregiven by the b k i estimated in the first step. We define the two-step GFE estimatoras: (cid:16)b θ, b α (1) , ..., b α ( K ) (cid:17) = argmax ( θ,α (1) ,...,α ( K )) N X i =1 ln f i (cid:16) α (cid:16)b k i (cid:17) , θ (cid:17) . (5) To see this, consider the case where α it is the only component of heterogeneity (i.e., µ it = 0in (3)), and take h i = Y i . Letting G denote the cdf of − ( V ′ it θ + U it ), injectivity will holdwhen α ( · , · ) is strictly increasing in its first argument and G is strictly increasing, since then ϕ ( ξ ) = plim T →∞ T P Tt =1 G ( α ( ξ, λ t )) is strictly increasing. K group-specific parameters instead of N individual-specific ones. In models with time-varying heterogeneity, α ( k ) willsimply be a vector ( α ( k ) ′ , ..., α T ( k ) ′ ) ′ . Choice of K . Two-step GFE estimation requires setting a number of groups K . We propose a simple data-driven selection rule based on the first step. Theconvergence rate of the kmeans estimator (and the rate of the GFE estima-tor) will be governed by two quantities: the kmeans objective function b Q ( K ) = N P Ni =1 k h i − b h ( b k i ) k , which decreases as K gets larger and the group approxi-mation becomes more accurate, and the variability V h = E [ k h i − ϕ ( ξ i ) k ] of themoment h i , which does not depend on K . We take the smallest K that guaranteesthat b Q ( K ) is of the same or lower order as V h . That is, letting b V h = V h + o p (1 /T ),we suggest setting: b K = min K ≥ n K : b Q ( K ) ≤ γ b V h o , (6)where γ ∈ (0 ,
1] is a user-specified parameter. In the simulations in the nextsubsection we will set γ = 1, although smaller γ values corresponding to larger K ’s will also be supported by our theory. To illustrate the performance of GFE in models where heterogeneity follows anonlinear factor structure, we present the results of a small-scale simulation studybased on our two examples (2) and (3). In both cases, we assume that the type ξ i governing heterogeneity is scalar. We compare the bias of GFE to that of FEand interactive FE estimators. In the supplemental material, we provide detailson the simulations and report additional results.In Figure 1, we compare GFE and FE in model (2), using a CRRA functionalform: u ( α ) = e α (1 − η ) − − η , with a risk aversion parameter η ∈ { , } . We focus on thedifference in costs c (0; θ ) − c (1; θ ), which measures the degree of state dependence When h i = T P Tt =1 h ( Y it , X it ) and observations are independent over time, one may take b V h = NT P Ni =1 P Tt =1 k h ( Y it , X it ) − h i k . With dependent data, one can use trimming or thebootstrap to estimate V h (Hahn and Kuersteiner, 2011, Arellano and Hahn, 2007).
10 20 30 40 50T0.20.40.60.81.0 p a r a m e t e r = 1 10 20 30 40 50T= 2 Notes: Means of c (0; b θ ) − c (1; b θ ) over 1000 simulations. GFE is indicated in solid, FE is indashed, and the truth c (0; θ ) − c (1; θ ) = 1 is in dotted. N = 1000 , and T is indicated on thex-axis. η is the risk aversion parameter in u ( · ) . See the supplemental material for details. in participation decisions. We take h i = ( W i , Y i ) ′ as moments for GFE, and reportaverage parameter estimates over 1000 simulations. We set N =1000 and vary Tbetween 5 and 50. We find that FE is more biased than GFE for both values of riskaversion. This is consistent with wages and participation providing informativemoments about the latent type in this setting.In Figure 2 we compare GFE, FE, and interactive FE in model (3) with X it scalar, using a CES specification: α it = ( aξ σi + (1 − a ) λ σt ) σ , for σ ∈{− , , , } and a =0 .
5, and µ it = α it . The factors λ t and the individual loadings ξ i enterheterogeneity in a nonlinear way. We show estimates of θ for various estimators:Figure 2: Probit model (3) with time-varying heterogeneity
20 40T1.01.21.41.6 p a r a m e t e r = 10 20 40T= 0 20 40T= 1 20 40T= 10 Notes: Means of b θ over 1000 simulations. GFE is indicated in solid, FE is in dashed, interactiveFE is in dash-dotted, and the truth θ = 1 is in dotted. N = 1000 , and T is indicated on thex-axis. σ is the substitution parameter in α ( · , · ) . See the supplemental material for details. Y i , X i ) ′ as moments for GFE. Note that both Y i and X i are informative about ξ i in this data generating process. We reportparameter averages over 1000 simulations, for N =1000. We find that, while GFE,FE, and interactive FE are all biased, the bias of GFE is smaller across all σ values. In this section we provide asymptotic expansions for two-step GFE estimators.Our first result is a rate of convergence for kmeans. Let us define the approximationerror one would make if one were to discretize the latent types ξ i directly, as: B ξ ( K ) = min( e ξ (1) ,..., e ξ ( K ) ,k ,...,k N ) 1 N N X i =1 (cid:13)(cid:13)(cid:13) ξ i − e ξ ( k i ) (cid:13)(cid:13)(cid:13) , (7)where, similarly to (4), the minimum is taken with respect to all partitions { k i } and vectors e ξ ( k ). In the asymptotic analysis we let T = T N and K = K N tend toinfinity jointly with N . Lemma 1.
Let Assumption 2 hold. Let b h (1) , ..., b h ( K ) and b k , ..., b k N given by (4).Then, as N, T, K tend to infinity we have: N N X i =1 (cid:13)(cid:13)(cid:13)b h ( b k i ) − ϕ ( ξ i ) (cid:13)(cid:13)(cid:13) = O p (cid:18) T (cid:19) + O p ( B ξ ( K )) . The bound in Lemma 1 has two terms: an O p (1 /T ) term that depends on thenumber of periods used to construct the moments h i , and an O p ( B ξ ( K )) termthat reflects the presence of an approximation error. The rate at which B ξ ( K )tends to zero depends on the dimension of ξ i . Graf and Luschgy (2002, Theorem5.3) provide explicit characterizations in the case where ξ i has compact support. Large-
N, T theory implies that additive and interactive FE are consistent when σ = 1and σ = 0, respectively. Figure 2 shows that, despite being large- N, T consistent in thesespecifications, in our simulations, additive and interactive FE have larger biases than GFE forthe N and T values we consider. See Graf and Luschgy (2002, p. 875) for a discussion of the compact support assumption. B ξ ( K ) = O p ( K − ) when ξ i isone-dimensional, and B ξ ( K ) = O p ( K − ) when ξ i is two-dimensional. Lemma 2. (Graf and Luschgy, 2002) Let ξ i be random vectors with compactsupport in R d . Then, as N, K tend to infinity we have B ξ ( K ) = O p ( K − d ) . We now use these results to study the properties of GFE in models with time-invariant and time-varying heterogeneity, in turn. We use the shorthand notation E Z ( W ) and E Z = z ( W ) for the conditional expectations of W given Z and Z = z ,respectively. In the time-varying case, we denote as λ the process of λ t ’s, andas E λ = λ ( W ) the conditional expectation of W given λ = λ . We use a similarnotation for variances. Finally, k M k denotes the spectral norm of a matrix M . To state our first main theorem, where heterogeneity is time-invariant, we makethe following assumptions, where ℓ it ( α i , θ ) = ln f ( Y it | Y i,t − , X it , α i , θ ), ℓ i ( α i , θ ) = T P Tt =1 ℓ it ( α i , θ ), and α ( θ, ξ ) = argmax α E ξ i = ξ ( ℓ i ( α, θ )) for all θ, ξ . Assumption 3. (regularity, time-invariant heterogeneity)(i) ( Y ′ i , X ′ i , ξ ′ i , h ′ i ) ′ are i.i.d.; ( Y ′ it , X ′ it ) ′ are stationary for all i ; ℓ it ( α, θ ) is threetimes differentiable in ( α, θ ) for all i, t ; and the parameter space Θ for θ is compact, the space for α i is compact, and θ belongs to the interior of Θ .(ii) N, T, K tend jointly to infinity; sup ξ,α,θ | E ξ i = ξ ( ℓ it ( α, θ )) | = O (1) , and sim-ilarly for the first three derivatives of ℓ it ; inf ξ,α,θ E ξ i = ξ ( − ∂ ℓ it ( α,θ ) ∂α∂α ′ ) > ;and max i sup α,θ (cid:12)(cid:12) ℓ i ( α, θ ) − E ξ i ( ℓ i ( α, θ )) (cid:12)(cid:12) = o p (1) , and similarly for the firstthree derivatives of ℓ i .(iii) inf ξ,θ E ξ i = ξ ( − ∂ ℓ it ( α ( θ,ξ ) ,θ ) ∂α∂α ′ ) > ; E [ T P Tt =1 ℓ it ( α ( θ, ξ i ) , θ )] has a unique max-imum at θ on Θ , and its matrix of second derivatives is − H < ; and sup θ NT P Ni =1 P Tt =1 k ∂ ℓ it ( α ( θ,ξ i ) ,θ ) ∂θ∂α ′ k = O p (1) .(iv) sup e ξ,α k ∂∂ξ ′ (cid:12)(cid:12) ξ = e ξ E ξ i = ξ (vec ∂ ℓ it ( α,θ ) ∂θ∂α ′ ) k , sup e ξ,α k ∂∂ξ ′ (cid:12)(cid:12) ξ = e ξ E ξ i = ξ (vec ∂ ℓ it ( α,θ ) ∂α∂α ′ ) k , and sup e ξ,θ k ∂∂ξ ′ (cid:12)(cid:12) ξ = e ξ E ξ i = ξ ( ∂ℓ it ( α ( θ, e ξ ) ,θ ) ∂α ) k are O (1) . That is, ln f ( y it | y i,t − , x it , α, θ ) is three times differentiable in ( α, θ ), for almost all( y it , y it − , x it ).
12n part (i) in Assumption 3 we treat heterogeneity as random in order to useLemma 2, which requires ξ i to be i.i.d. draws from a distribution. However, notewe do not restrict how α i and µ i depend on each other. Moreover, while ourresults require asymptotic stationarity of the time-series processes, the theoremcould be extended to allow for nonstationary initial conditions.In part (ii) we require strict concavity of the log-likelihood as a function of α .Concavity holds in a number of nonlinear panel data models such as probit andlogit models, tobit, Poisson, or multinomial logit; see Fern´andez-Val and Weidner(2016) and Chen et al. (2020). One can show that Theorem 1 continues to holdwithout concavity, under an identification condition and an assumption boundingthe derivatives of the empirical GFE objective function. Importantly, note that H − is the asymptotic variance of the FE estimator. As a result, H being positivedefinite rules out models that are not identified under FE, such as a linear modelwith a time-invariant covariate and a heterogeneous intercept.In part (iii) we introduce the target log-likelihood NT P Ni =1 P Tt =1 ℓ it ( α ( θ, ξ i ) , θ )(Arellano and Hahn, 2007), which we will show approximates the GFE log-likelihoodin large samples under our assumptions (note that α ( θ , ξ i )= α i ). In part (iv) werequire some moments to be bounded asymptotically.We now state our first main result, where we denote, evaluating all quantitiesat true values ( θ , α i ) and omitting the dependence from the notation: s i = 1 T T X t =1 ∂ℓ it ∂θ + E ξ i (cid:18) ∂ ℓ it ∂θ∂α ′ (cid:19) (cid:20) E ξ i (cid:18) − ∂ ℓ it ∂α∂α ′ (cid:19)(cid:21) − ∂ℓ it ∂α ! , (8) H = plim N,T →∞ N T N X i =1 T X t =1 E ξ i (cid:18) − ∂ ℓ it ∂θ∂θ ′ (cid:19) − E ξ i (cid:18) ∂ ℓ it ∂θ∂α ′ (cid:19) (cid:20) E ξ i (cid:18) − ∂ ℓ it ∂α∂α ′ (cid:19)(cid:21) − E ξ i (cid:18) ∂ ℓ it ∂α∂θ ′ (cid:19) ! . (9) Theorem 1.
Let the conditions of Lemmas 1 and 2 and Assumptions 1, 2 and 3 old. Then, as N, T, K tend to infinity we have: b θ = θ + H − N N X i =1 s i + O p (cid:18) T (cid:19) + O p (cid:16) K − d (cid:17) + o p (cid:18) √ N T (cid:19) . (10)The first three terms in (10) also appear in large- N, T expansions of FE estima-tors (e.g., Hahn and Newey, 2004). Similarly to FE, GFE is subject to incidentalparameter O p (1 /T ) bias. This contrasts with the properties of GFE estimatorsunder discrete heterogeneity (e.g., Hahn and Moon, 2010, Bonhomme and Man-resa, 2015). Indeed, when heterogeneity is not restricted to have a small numberof points of support, classification noise affects the properties of second-step es-timators in general. This motivates using bias reduction techniques for inferenceanalogous to those used in FE, as we will discuss in the next section.The O p ( K − d ) term in (10) reflects the approximation error, which dependson the number of groups. Setting K = b K according to (6) guarantees that theapproximation error is O p (1 /T ). Formally, we have the following result. Corollary 1.
Let the conditions in Theorem 1 hold. Let K = b K given by (6),with γ = O (1) . Then, as N, T tend to infinity we have: b θ = θ + H − N N X i =1 s i + O p (cid:18) T (cid:19) + o p (cid:18) √ N T (cid:19) . (11)Under Corollary 1, the biases of FE and GFE have the same order of magni-tude. However, the required value of K depends on the dimension d of individualheterogeneity. Specifically, when ξ i follows a continuous distribution of dimen-sion d , setting K proportional to or greater than min( T d , N ) will ensure that theapproximation error is O p (1 /T ). For small d (e.g., when d = 1) this will typicallyrequire a small number of groups (of the order of T ).GFE can have advantages compared to FE, for two reasons. First, the two-stepmethod can allow researchers to select moments that are particularly informative In the supplemental material we provide a similar expansion for GFE estimators of averageeffects M = NT P Ni =1 P Tt =1 m ( X it , α i , θ ), which are functions of both common parametersand individual heterogeneity. /T , yet K/N tends to zero. We have the following.
Corollary 2.
Let the conditions in Theorem 1 hold. Let K = b K given by (6),with γ = o (1) . Suppose that K/N tends to zero, and that Assumption A1 in theappendix holds. Then the O p (1 /T ) term in (11) takes the explicit form C/T + o p (1 /T ) , where: CT = H − ∂∂θ (cid:12)(cid:12)(cid:12)(cid:12) θ E (cid:20) (cid:13)(cid:13)b α i ( θ ) − E ξ i ( b α i ( θ )) (cid:13)(cid:13) i ( θ ) − k b α i ( θ ) − E h i ( b α i ( θ )) k i ( θ ) (cid:21) , (12) with b α i ( θ ) = argmax α ℓ i ( θ, α ) , Ω i ( θ ) = E ξ i ( − ∂ ℓ it ( α ( θ,ξ i ) ,θ ) ∂α∂α ′ ) , and k V k = V ′ Ω V . Corollary 2 shows that the first-order asymptotic bias of GFE is the differencebetween two terms. The bias is zero when h i is an injective function of ξ i ; i.e.,when ε i = h i − ϕ ( ξ i ) = 0. More generally, the bias can be expanded in ε i , and it issmall when moments provide accurate estimates of the latent types. Moreover, thefirst term on the right-hand side of (12) coincides with the bias of FE (e.g., Arellanoand Hahn, 2007). The form of (12) implies that the biases of FE and GFE areequal when the moments are the FE estimates h i = b α i ( θ ), however other momentchoices can lead to smaller biases. From this perspective, GFE provides flexibilityto use well-suited proxies of the latent types. As an example, our simulations ofthe labor force participation model (2) show that, by jointly exploiting wages andparticipation to construct moments that are informative about the latent type,GFE can have smaller bias than FE (and smaller mean squared error as well, asshown in the supplemental material).A second advantage of GFE comes from the use of grouping, and from theresulting regularization. Indeed, individual FE estimates can be highly variablewhenever the number of parameters per individual is large. In such cases, reduc-ing the number of parameters through grouping can improve performance. Forinstance, the ability to handle multiple components of heterogeneity is central tothe performance of GFE in models with time-varying unobserved heterogeneity.15his is the case we focus on next. To state our second main theorem, where heterogeneity is time-varying, we makethe following assumptions, where ℓ it ( α it , θ ) = ln f ( Y it | Y i,t − , X it , α it , θ ), ℓ i ( α i , θ ) = T P Tt =1 ℓ it ( α it , θ ), and α t ( θ, ξ ) = argmax α E ξ i = ξ,λ = λ ( ℓ it ( α, θ )). Assumption 4. (regularity, time-varying heterogeneity)(i) ( Y ′ i , X ′ i , ξ ′ i , h ′ i ) ′ are i.i.d. across i conditional on λ ; ( Y ′ it , X ′ it , λ ′ t ) ′ are sta-tionary for all i ; ℓ it ( α it , θ ) is three times differentiable, for all i, t ; and Θ andthe space for α it are compact, and θ belongs to the interior of Θ .(ii) N, T, K tend jointly to infinity; max t sup ξ,λ,α,θ | E ξ i = ξ,λ = λ ( ℓ it ( α, θ )) | = O (1) ,and similarly for the first three derivatives of ℓ it ; the minimum (respectively,maximum) eigenvalue of ( − ∂ ℓ it ( α,θ ) ∂α∂α ′ ) is bounded away from zero (resp., in-finity) with probability one, uniformly in i, t, α, θ ; the third derivatives of ℓ it ( α, θ ) are O p (1) , uniformly in i, t, α, θ ; and NT P Ni =1 P Tt =1 [ ℓ it ( α it , θ ) − E ξ i ,λ ( ℓ it ( α it , θ ))] = O p (1) , and similarly for the first three derivatives.(iii) min t inf ξ,λ,θ E ξ i = ξ,λ = λ ( − ∂ ℓ it ( α t ( θ,ξ ) ,θ ) ∂α∂α ′ ) > ; E [ T P Tt =1 ℓ it ( α t ( θ, ξ i ) , θ )] has aunique maximum at θ on Θ , and its matrix of second derivatives is − H < ;and sup θ NT P Ni =1 P Tt =1 k ∂ ℓ it ( α t ( θ,ξ i ) ,θ ) ∂θ∂α ′ k = O p (1) .(iv) k ∂∂ξ ′ (cid:12)(cid:12) ξ = e ξ E ξ i = ξ,λ = λ (vec ∂ ℓ it ( α,θ ) ∂θ∂α ′ ) k , k ∂∂ξ ′ (cid:12)(cid:12) ξ = e ξ E ξ i = ξ,λ = λ (vec ∂ ℓ it ( α,θ ) ∂α∂α ′ ) k , and k ∂∂ξ ′ (cid:12)(cid:12) ξ = e ξ E ξ i = ξ,λ = λ ( ∂ℓ it ( α t ( θ, e ξ ) ,θ ) ∂α ) k are O (1) , uniformly in t , e ξ , λ , α , and θ .(v) E h i = h,ξ i = ξ,λ = λ ( ∂ℓ it ( α t ( θ,ξ ) ,θ ) ∂α ) and E h i = h,ξ i = ξ,λ = λ (vec ∂∂θ ′ (cid:12)(cid:12) θ ∂ℓ it ( α t ( θ,ξ ) ,θ ) ∂α ) are twicedifferentiable with respect to h , with first and second derivatives that are uni-formly bounded in t , ξ , λ , h in the support of h i given λ = λ , and θ ∈ Θ ; and k Var h i = h,ξ i = ξ,λ = λ ( ∂ℓ it ( α t ( θ,ξ ) ,θ ) ∂α ) k and k Var h i = h,ξ i = ξ,λ = λ (vec ∂∂θ ′ (cid:12)(cid:12) θ ∂ℓ it ( α t ( θ,ξ ) ,θ ) ∂α ) k are O (1) , uniformly in t , ξ , λ , h , and θ . In part (ii) in Assumption 4, we impose a stronger concavity condition than Note that α t ( θ, ξ i ) depends on the process λ in addition to the type ξ i , although we leavethe dependence on λ implicit in the notation. In a static model, α t ( θ, ξ i ) is a function of ξ i and λ t , while in a dynamic model it also depends on the history of the time effects ( λ t , λ t − , , ... ).
16n Assumption 3. The other parts are similar to Assumption 3, except part (v)where we require regularity of certain conditional expectations and variances.We next state our second main result, where, differently from Theorem 1, s i in(8) and H in (9) are now evaluated at ( θ , α it ), and expectations are conditionalon ( ξ i , λ ). Theorem 2.
Let the conditions of Lemmas 1 and 2 and Assumptions 1, 2 and 4hold. Then, as
N, T, K tend to infinity such that
K/N tends to zero, we have: b θ = θ + H − N N X i =1 s i + O p (cid:18) T (cid:19) + O p (cid:18) KN (cid:19) + O p (cid:16) K − d (cid:17) + o p (cid:18) √ N T (cid:19) . (13)Theorem 2 shows that GFE is consistent as N, T, K tend to infinity and
K/N tends to zero. This requires no parametric assumption about how ξ i and λ t affectindividual and time heterogeneity, unlike additive or interactive FE methods.To give intuition, consider the probit model (3) with time-varying unobserv-ables. Under Assumption 2, in the first step, GFE consistently estimates an injec-tive function ϕ i = ϕ ( ξ i ) of the type. One can then rewrite the outcome equationin (3) as Y it = { X ′ it θ + α ( ψ ( ϕ i ) , λ t ) + U it ≥ } , where ψ is the function intro-duced in Assumption 2, and α it = α ( ψ ( ϕ i ) , λ t ) is simply a time-varying functionof ϕ i . In the second step, GFE estimates this function by including group-timeindicators in the probit regression.As in Theorem 1, the expansion in Theorem 2 features a combination of inci-dental parameter bias and approximation error. When using the rule (6) for K ,the approximation error is of the same or lower order compared to 1/T. However,the O p ( K/N ) term is a new contribution relative to the time-invariant case, whichreflects the estimation of KT group-specific parameters using N T observations.As an example, when d = 1 and K is chosen of the order of T , the O p termsin (13) are O p (1 /T + T /N ). Although this rate of convergence can be fastwhen N is sufficiently large relative to T , it is too slow to apply conventional In particular, we use part (ii) in Assumption 4 to establish consistency. Note that thiscondition can be restrictive in models with time-varying random coefficients. When
N/T →
0, one could obtain a faster rate in (13) by choosing another rule for K . λ t is low-dimensional, we describe how toobtain a faster convergence rate by grouping both individuals and time periods. In models with time-invariant heterogeneity, Corollary 1 can be used to charac-terize the asymptotic distribution of GFE estimators. However, as in FE, thepresence of the O p (1 /T ) term in (11) shifts the distribution of b θ away from θ whenever T is not large relative to N . A variety of methods are available to bias-correct FE estimators and construct asymptotically valid confidence intervals; seeArellano and Hahn (2007) for a review. Consider the setup of Corollary 1, underthe additional assumption that the O p (1 /T ) term in (11) is equal to C/T + o p (1 /T )for some constant C . In this case, one can show that half-panel jackknife (Dhaeneand Jochmans, 2015) gives asymptotically valid inference based on GFE as N and T tend to infinity at the same rate. The distribution of the bias-corrected GFEestimator is then asymptotically normal centered at the truth, and the asymptoticvariance H − can be consistently estimated by replacing the expectations in (8)and (9) by group-specific means.In settings where heterogeneity varies over time, it can be desirable to groupnot only individuals as in (4), but also time periods (or alternatively countiesor markets, depending on the application). We now describe such a method,and discuss its potential for performing inference in models with time-varyingheterogeneity. In the two-way GFE approach, we classify time periods based oncross-sectional moments w t = N P Ni =1 w ( Y it , X it ), and compute: (cid:16) b w (1) , ..., b w ( L ) , b l , ..., b l T (cid:17) = argmin ( e w (1) ,..., e w ( L ) ,l ,...,l T ) T X t =1 (cid:13)(cid:13) w t − e w ( l t ) (cid:13)(cid:13) , (14) In particular, half-panel jackknife is valid under the conditions of Corollary 2, which requirestaking γ = o (1) in our rule (6) for K in order for the approximation error to be of small order.Deriving primitive conditions for the validity of half-panel jackknife and other bias-reductionmethods for other choices of K is left for future work. { l t } are partitions of { , ..., T } into L groups. Given the group indicators b k i and b l t , we then maximize P Ni =1 P Tt =1 ln f ( Y it | X it , α ( b k i , b l t ) , θ ), with respect to θ and the KL group-specific parameters α ( k, l ).Two-way GFE estimators can be expanded similarly to Theorem 2, under twomain additional assumptions: the model is static and observations are independentacross i and t , and the dimensions d λ of time heterogeneity λ t and d of individualheterogeneity ξ i are both small. Then, for s i and H as in Theorem 2, we show inthe supplemental material that: b θ = θ + H − N N X i =1 s i + O p (cid:18) T + 1 N + KLN T (cid:19) + O p (cid:16) K − d + L − dλ (cid:17) + o p (cid:18) √ N T (cid:19) . Suppose d = d λ = 1, and K is given by (6) with γ asymptotically constant, withan analogous choice for L . Then the O p term in this expansion can be shownto be O p (1 /T + 1 /N ). We leave to future work the formal study of the validityof bias reduction methods for inference, such as two-way split panel jackknife(Fern´andez-Val and Weidner, 2016), as N and T tend to infinity at the same rate. Our theory shows that the dimension d of heterogeneity plays a key role in theproperties of GFE. While models with scalar latent types ξ i , such as model (2)of wages and labor force participation, are not uncommon in economics, manyapplications involve conditioning covariates. Under Assumptions 1 and 2, themoments h i should, asymptotically, be injective functions of all the heterogeneitycoming from both Y i and X i . However, when X i depends on multiple componentsof heterogeneity, this might lead to a large dimension d .We now show that GFE can still perform well under a weaker form of injec-tivity. Consider the case where Assumption 1 is replaced by α i = α ( ξ i ) and µ i = µ ( ξ i , ν i ), where ν i is another latent component that affects covariates.Moreover, instead of requiring injectivity for both ξ i and ν i , let us maintainAssumption 2, which only requires h i to be injective for ξ i . In other words, h i needs to be directly informative about the unobserved heterogeneity component19 i that appears in the conditional distribution of Y i given X i . We show in thesupplemental material that, under regularity conditions otherwise similar to thoseof Corollary 1, the convergence rate of GFE is unaffected by the dimension of ν i .Specifically, when K = b K is given by (6) with γ = O (1) (which adapts to thedimension of ξ i and not the one of ν i ), we have: b θ = θ + O p (cid:18) T (cid:19) + O p (cid:18) √ N T (cid:19) . (15)To prove (15) we assume that the rate condition T d = O ( N ) holds, where d isthe (small) dimension of ξ i . In models with time-varying conditioning covariates, a simple way to targetmoments to ξ i is to construct h i using the conditional distribution of Y i given X i .To see this, consider a static model f ( Y it | X it , α i , θ ) where X it has finite support.In this case, we have under appropriate conditions: P Tt =1 { X it = x } h ( Y it , X it ) P Tt =1 { X it = x } | {z } = h i ( x ) = E X it = x,ξ i [ h ( Y it , X it )] | {z } = ϕ ( x,ξ i ) + o p (1) , where h i ( x ) is only defined when P Tt =1 { X it = x } 6 = 0, and, importantly, ϕ ( x, ξ i )does not depend on ν i . In the supplemental material we discuss implementation,and we report simulation results in a probit model with binary covariates. Wefind that using conditional moments can enhance the performance of GFE insuch settings. We leave the analysis of conditional moments in the presence ofcontinuous covariates to future work. In this paper, we analyze some properties of two-step grouped fixed-effects(GFE) methods in settings where population heterogeneity is not discrete. Our In the supplemental material, we provide an asymptotic expansion for GFE in a linearhomoskedastic model under a small approximation error, as in Corollary 2. The argumentrequires no restriction on the relative rates of N and T . Interestingly, in this case the asymptoticvariances of GFE and FE differ, since the within-group variation in ν i tends to decrease thevariance, yet the expansion features an additional score term compared to Theorem 1. et al. (2019) for matchedemployer-employee data in the presence of continuous firm heterogeneity. Otherpotential applications include nonlinear factor models, nonparametric and semi-parametric panel data models such as quantile regression with individual effects,and network models. References [1] Arellano, M., and J. Hahn (2007): “Understanding Bias in Nonlinear PanelModels: Some Recent Developments,”. In: R. Blundell, W. Newey, andT. Persson (eds.):
Advances in Economics and Econometrics, Ninth WorldCongress , Cambridge University Press.[2] Bai, J. (2009), “Panel Data Models with Interactive Fixed Effects,”
Econo-metrica , 77, 1229–1279.[3] Bonhomme, S., and E. Manresa (2015): “Grouped Patterns of Heterogeneityin Panel Data,”
Econometrica , 83(3), 1147–1184.[4] Bonhomme, S., T. Lamadon, and E. Manresa (2019): “A DistributionalFramework for Matched Employer-Employee Data,”
Econometrica , 87(3),699–739.[5] Buchinsky, M., J. Hahn, and J. Hotz (2005): “Cluster Analysis: A tool forPreliminary Structural Analysis,” unpublished manuscript.[6] Chen, M., I. Fern´andez-Val, and M. Weidner (2020): “Nonlinear Panel Modelswith Interactive Effects,” to appear in the
Journal of Econometrics .[7] Dhaene, G. and K. Jochmans (2015): “Split Panel Jackknife Estimation,”
Review of Economic Studies , 82(3), 991–1030.218] Fern´andez-Val, I., and M. Weidner (2016): “Individual and Time Effects inNonlinear Panel Data Models with Large N, T,”
Journal of Econometrics ,196, 291–312.[9] Gao, C., Y. Lu, and H. H. Zhou (2015): “Rate-Optimal Graphon Estimation,”
Annals of Statistics , 43(6), 2624–2652.[10] Graf, S., and H. Luschgy (2002): “Rates of Convergence for the EmpiricalQuantization Error”,
Annals of Probability , 30(2), 874–897.[11] Hahn, J., and G. Kuersteiner (2011): “Bias Reduction for Dynamic NonlinearPanel Models with Fixed Effects,”
Econometric Theory , 27(6), 1152–1191.[12] Hahn, J., and H. Moon (2010): “Panel Data Models with Finite Number ofMultiple Equilibria,”
Econometric Theory , 26(3), 863–881.[13] Hahn, J. and W.K. Newey (2004): “Jackknife and Analytical Bias Reductionfor Nonlinear Panel Models”,
Econometrica , 72, 1295–1319.[14] Heckman, J.J., and B. Singer (1984): “A Method for Minimizing the Impactof Distributional Assumptions in Econometric Models for Duration Data,”
Econometrica , 52(2), 271–320.[15] Keane, M., and K. Wolpin (1997): “The Career Decisions of Young Men,”
Journal of Political Economy , 105(3), 473–522.[16] Kennan, J., and J. Walker (2011): “The Effect of Expected Income on Indi-vidual Migration Decisions”,
Econometrica , 79(1), 211–251.[17] Mundlak, Y. (1961): “Empirical Production Function Free of ManagementBias,”
Journal of Farm Economics , 43(1), 44–56.
APPENDIX
Proof of Lemma 1.
Define B ϕ ( ξ ) ( K ) = min( e h, { k i } ) N P Ni =1 k ϕ ( ξ i ) − e h ( k i ) k ,similarly to (7), and denote: ( h, { k i } ) = argmin( e h, { k i } ) P Ni =1 k ϕ ( ξ i ) − e h ( k i ) k .By definition of ( b h, { b k i } ), we have: P Ni =1 k h i − b h ( b k i ) k ≤ P Ni =1 k h i − h ( k i ) k (al-most surely). Letting ε i = h i − ϕ ( ξ i ), we thus have, using the triangle inequalitytwice:1 N N X i =1 (cid:13)(cid:13)(cid:13) ϕ ( ξ i ) − b h ( b k i ) (cid:13)(cid:13)(cid:13) ≤ N N X i =1 (cid:13)(cid:13)(cid:13) h i − b h ( b k i ) (cid:13)(cid:13)(cid:13) + 2 N N X i =1 k h i − ϕ ( ξ i ) k N N X i =1 k h i − h ( k i ) k + 2 N N X i =1 k ε i k ≤ N N X i =1 k ϕ ( ξ i ) − h ( k i ) k !| {z } = B ϕ ( ξ ) ( K ) + 6 N N X i =1 k ε i k . By Assumption 2, N P Ni =1 k ε i k = O p (1 /T ). In addition, since ϕ is Lipschitz-continuous, there exists a constant τ such that k ϕ ( ξ ′ ) − ϕ ( ξ ) k ≤ τ k ξ ′ − ξ k for all( ξ, ξ ′ ). This implies that B ϕ ( ξ ) ( K ) ≤ τ B ξ ( K ), and Lemma 1 follows. Proofs of Theorems 1 and 2.
It is convenient to use a common notation forTheorems 1 and 2. Let p denote the number of individual-specific vectors α ji , j ∈ { , ..., p } . In the time-invariant case: p = 1, j = 1, and α ji = α i . In thetime-varying case: p = T , j ∈ { , ..., T } , and α ji = α it . Denote ℓ ij = ℓ i in the time-invariant case, and ℓ ij = ℓ it in the time-varying case. Let v ij = ∂ℓ ij ∂α , v αij = ∂ ℓ ij ∂α∂α ′ , v θij = ∂ ℓ ij ∂θ∂α ′ , and v ααij = ∂ ℓ ij ∂α∂α ′ ⊗ ∂α ′ (which is a dim α ji × (dim α ji ) matrix). Let,for all θ ∈ Θ, j ∈ { , ..., p } , and k ∈ { , ..., K } , b α j ( k, θ )=argmax α P Ni =1 { b k i = k } ℓ ij ( α, θ ). Likewise, denote α j ( θ, ξ )=argmax α E ξ i = ξ,λ = λ ( ℓ ij ( α, θ )). We will in-dex expectations by ξ i and λ , although the conditioning on λ is not neededin the time-invariant case of Theorem 1. Finally, let δ = T + K − d in the time-invariant case, and let δ = T + KN + K − d in the time-varying case.To show consistency of b θ , we first establish the next technical lemma (see thesupplemental material for the proof): Lemma A1.
Under the conditions of either Theorem 1 or Theorem 2 we have: N p N X i =1 p X j =1 (cid:13)(cid:13)(cid:13)b α j ( b k i , θ ) − α j ( θ, ξ i ) (cid:13)(cid:13)(cid:13) = O p ( δ ) , ∀ θ ∈ Θ , (A1)sup θ ∈ Θ N p N X i =1 p X j =1 (cid:13)(cid:13)(cid:13)b α j ( b k i , θ ) − α j ( θ, ξ i ) (cid:13)(cid:13)(cid:13) = o p (1) . (A2)From (A2) we then verify using a Taylor expansion that:sup θ ∈ Θ (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) N p N X i =1 p X j =1 ℓ ij (cid:16)b α j ( b k i , θ ) , θ (cid:17) − N p N X i =1 p X j =1 ℓ ij (cid:0) α j ( θ, ξ i ) , θ (cid:1)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) = o p (1) . b θ then follows by standard arguments.Next, the two key steps in the proof consist in showing the following two ex-pansions:1 N p N X i =1 p X j =1 ∂ℓ ij ( b α j ( b k i , θ ) , θ ) ∂θ = 1 N p N X i =1 p X j =1 ∂∂θ (cid:12)(cid:12)(cid:12)(cid:12) θ ℓ ij (cid:0) α j ( θ, ξ i ) , θ (cid:1) + O p ( δ ) , (A3)1 N p N X i =1 p X j =1 ∂ ∂θ∂θ ′ (cid:12)(cid:12)(cid:12)(cid:12) θ (cid:16) ℓ ij (cid:16)b α j ( b k i , θ ) , θ (cid:17) − ℓ ij (cid:0) α j ( θ, ξ i ) , θ (cid:1)(cid:17) = o p (1) . (A4)To show (A3), we show the following technical lemma, where we omit refer-ences to the evaluation points θ and α ji for conciseness: Lemma A2.
Under the conditions of either Theorem 1 or Theorem 2 we have: N p N X i =1 p X j =1 E ξ i ,λ (cid:0) v θij (cid:1) (cid:2) E ξ i ,λ (cid:0) v αij (cid:1)(cid:3) − v αij (cid:16)b α j ( b k i , θ ) − α ji + ( v αij ) − v ij (cid:17) = O p ( δ ) , N p N X i =1 p X j =1 (cid:16) v θij (cid:0) v αij (cid:1) − − E ξ i ,λ (cid:0) v θij (cid:1) (cid:2) E ξ i ,λ (cid:0) v αij (cid:1)(cid:3) − (cid:17) v αij (cid:16)b α j ( b k i , θ ) − α ji (cid:17) = O p ( δ ) . Now, expanding v θij ( b α j ( b k i , θ ) , θ ) around α j ( θ , ξ i )= α ji , and using the identity ∂α j ( θ ,ξ i ) ∂θ ′ = (cid:2) E ξ i ,λ (cid:0) − v αij (cid:1)(cid:3) − E ξ i ,λ (cid:0) v θij (cid:1) ′ , we obtain:1 N p N X i =1 p X j =1 ∂ℓ ij ( b α j ( b k i , θ ) , θ ) ∂θ − N p N X i =1 p X j =1 ∂∂θ (cid:12)(cid:12)(cid:12)(cid:12) θ ℓ ij (cid:0) α j ( θ, ξ i ) , θ (cid:1) = 1 N p N X i =1 p X j =1 n v θij (cid:16)b α j ( b k i , θ ) − α ji (cid:17) + E ξ i ,λ (cid:0) v θij (cid:1) (cid:2) E ξ i ,λ (cid:0) v αij (cid:1)(cid:3) − v ij o + O p ( δ ) , and summing the two parts in Lemma A2 shows that the last expression is O p ( δ ).It follows that (A3) is satisfied.To show (A4), we show the next technical lemma: Lemma A3.
Under the conditions of either Theorem 1 or Theorem 2 we have: N p N X i =1 p X j =1 (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) ∂ b α j ( b k i , θ ) ∂θ ′ − ∂α j ( θ , ξ i ) ∂θ ′ (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) = o p (1) . (A5)24sing (A1) and the identity ∂α j ( θ ,ξ i ) ∂θ ′ = (cid:2) E ξ i ,λ (cid:0) − v αij (cid:1)(cid:3) − E ξ i ,λ (cid:0) v θij (cid:1) ′ , we thushave, under the conditions of either Theorem 1 or 2:1 N p N X i =1 p X j =1 ∂ ∂θ∂θ ′ (cid:12)(cid:12)(cid:12)(cid:12) θ ℓ ij (cid:16)b α j ( b k i , θ ) , θ (cid:17) − N p N X i =1 p X j =1 ∂ ∂θ∂θ ′ (cid:12)(cid:12)(cid:12)(cid:12) θ ℓ ij (cid:0) α j ( θ, ξ i ) , θ (cid:1) = 1 N p N X i =1 p X j =1 v θij ∂ b α j ( b k i , θ ) ∂θ ′ − ∂α j ( θ , ξ i ) ∂θ ′ ! + o p (1) = o p (1) , where we have used Lemma A3 in the last equality.Finally, to show Theorems 1 and 2 we expand the GFE score as:1 N p N X i =1 p X j =1 ∂ℓ ij ( b α j ( b k i , θ ) , θ ) ∂θ + ∂∂θ ′ (cid:12)(cid:12)(cid:12) e θ N p N X i =1 p X j =1 ∂ℓ ij ( b α j ( b k i , θ ) , θ ) ∂θ ! (cid:16)b θ − θ (cid:17) =0 , where e θ lies between θ and b θ , and further expand ∂∂θ ′ (cid:12)(cid:12) e θ Np P Ni =1 P pj =1 ∂ℓ ij ( b α j ( b k i ,θ ) ,θ ) ∂θ around θ using that e θ is consistent. Lastly, we use (A3) and (A4), and note that,if ℓ i ( θ ) = p P pj =1 ℓ ij (cid:0) α j ( θ, ξ i ) , θ (cid:1) denotes the individual target log-likelihood,then s i = ∂ℓ i ( θ ) ∂θ and H = plim N,T →∞ N P Ni =1 E ξ i ,λ ( − ∂ ℓ i ( θ ) ∂θ∂θ ′ ). Proof of Corollary 1.
By the triangle inequality: N P Ni =1 k b h ( b k i ) − ϕ ( ξ i ) k ≤ b Q ( K ) + O p ( T ) = O p ( T ). The proof of Theorem 1 is then unchanged, simplyredefining δ =1 /T (since heterogeneity is time-invariant here). This shows (11). Proof of Corollary 2.
To prove Corollary 2, we follow a likelihood approach(see Arellano and Hahn, 2007). Consider the difference between the GFE and FEprofile log-likelihoods: ∆ L ( θ ) = N P Ni =1 ℓ i ( b α ( b k i , θ ) , θ ) − N P Ni =1 ℓ i ( b α i ( θ ) , θ ). Assumption A1. (regularity) Let b w i = − ∂ ℓ i ( b α i ( θ ) ,θ ) ∂α∂α ′ , and b g i = ∂ ℓ i ( b α i ( θ ) ,θ ) ∂θ∂α ′ b w − i .(i) ℓ it ( α i , θ ) is four times differentiable, and its fourth derivatives satisfy similarproperties to the first three.(ii) γ ( h )= { E h i = h ( b w i ) } − E h i = h ( b w i b α i ( θ )) and λ ( h )= E h i = h ( b g i b w i ) { E h i = h ( b w i ) } − are Lipschitz-continuous in h ; and Var h i = h ( b w i ( b α i ( θ ) − γ ( h i ))) = O ( T ) and Var h i = h (( b g i − λ ( h i )) b w i ) = O ( T ) , uniformly in h . Lemma A4.
Let the conditions of Corollary 2 hold, and let ν i ( θ )= b α i ( θ ) − E h i ( b α i ( θ )) . e have: ∂∂θ (cid:12)(cid:12)(cid:12) θ ∆ L ( θ )= − ∂∂θ (cid:12)(cid:12)(cid:12) θ N N X i =1 ν i ( θ ) ′ E ξ i [ − v αi ( α ( θ, ξ i ) , θ )] ν i ( θ ) + o p (cid:18) T (cid:19) . (A6)Corollary 2 follows, since the bias of the FE score is: ∂∂θ (cid:12)(cid:12) θ (cid:2) N P Ni =1 ℓ i ( b α i ( θ ) , θ ) − N P Ni =1 ℓ i ( α ( θ, ξ i ) , θ ) (cid:3) = ∂∂θ (cid:12)(cid:12) θ N P Ni =1 b ν i ( θ ) ′ E ξ i [ − v αi ( α ( θ, ξ i ) , θ )] b ν i ( θ ) + o p ( T ),where b ν i ( θ ) = b α i ( θ ) − E ξ i ( b α i ( θ )); see, e.g., Arellano and Hahn (2007).26 UPPLEMENTAL MATERIAL“Discretizing Unobserved Heterogeneity”
S1 Proofs of technical lemmas
Lemma A1.
From Assumption 3 ( iii )-( iv ) or 4 ( iii )-( iv ), both ∂α j ( θ,ξ ) ∂θ ′ and ∂α j ( θ,ξ ) ∂ξ ′ are uniformly bounded (in probability in the time-varying case). Let a j ( k, θ ) = α j ( θ, ψ ( b h ( k ))). We thus have, using Lemmas 1 and 2:sup θ ∈ Θ N p X i,j (cid:13)(cid:13)(cid:13) a j ( b k i , θ ) − α j ( θ, ξ i ) (cid:13)(cid:13)(cid:13) =sup θ ∈ Θ N p X i,j (cid:13)(cid:13)(cid:13) α j ( θ, ψ ( b h ( b k i ))) − α j ( θ, ψ ( ϕ ( ξ i ))) (cid:13)(cid:13)(cid:13) = O p N X i k b h ( b k i ) − ϕ ( ξ i ) k ! = O p ( δ ) . (S1)Let θ ∈ Θ. Expanding: P i,j ℓ ij ( a j ( b k i , θ ) , θ ) ≤ P i,j ℓ ij ( b α j ( b k i , θ ) , θ ) to secondorder around α j ( θ, ξ i ), and using:max i,j sup ( α,θ ) k v αij ( α, θ ) k = O p (1) , (S2)we have, for some a ij ( θ ) between b α j ( b k i , θ ) and α j ( θ, ξ i ):12 N p X i,j (cid:16)b α j ( b k i , θ ) − α j ( θ, ξ i ) (cid:17) ′ [ − v αij ( a ij ( θ ) , θ )] (cid:16)b α j ( b k i , θ ) − α j ( θ, ξ i ) (cid:17) ≤ N p X i,j v ij ( α j ( θ, ξ i ) , θ ) ′ (cid:16)b α j ( b k i , θ ) − a j ( b k i , θ ) (cid:17) + O p ( δ )= 1 N p X i,j v j ( b k i , θ ) ′ (cid:16)b α j ( b k i , θ ) − a j ( b k i , θ ) (cid:17) + O p ( δ ) , (S3)where v j ( k, θ ) denotes the mean over i of v ij ( α j ( θ, ξ i ) , θ ) in group b k i = k , and the O p ( δ ) terms are uniform in θ by (S1).Now, by Assumption 3 (ii) or 4 (ii) there exists a constant c > i,j inf ( α,θ ) mineig (cid:2) − v αij ( α, θ ) (cid:3) ≥ c + o p (1) , (S4)where mineig( M ) is the minimum eigenvalue of M . Let A = Np P i,j k b α j ( b k i , θ ) − j ( θ, ξ i ) k . By (S3) and the Cauchy Schwarz inequality, we have: A ≤ O p N p X i,j (cid:13)(cid:13)(cid:13) v j ( b k i , θ ) (cid:13)(cid:13)(cid:13) ! N p X i,j (cid:13)(cid:13)(cid:13)b α j ( b k i , θ ) − a j ( b k i , θ ) (cid:13)(cid:13)(cid:13) ! + O p ( δ ) . By (S1) and the triangle inequality: ( Np P i,j k b α j ( b k i , θ ) − a j ( b k i , θ ) k ) ≤ A + O p ( δ ). Hence: A = O p (cid:20)(cid:16) Np P i,j k v j ( b k i , θ ) k (cid:17) (cid:16) A + O p ( δ ) (cid:17)(cid:21) + O p ( δ ), whichimplies: A = O p N p X i,j k v j ( b k i , θ ) k ! + O p ( δ ) . (S5)We are now going to show that, for all θ ∈ Θ:1
N p X i,j (cid:13)(cid:13)(cid:13) v j ( b k i , θ ) (cid:13)(cid:13)(cid:13) = O p ( δ ) . (S6)Using (S5) and (S6) will then imply (A1). Under the conditions of Theorem 1, it iseasy to see that (S6) holds. We are now going to show (S6) under the conditions ofTheorem 2. Let, for all j, θ, h, ξ, λ : ρ j ( h, ξ, λ, θ ) = E h i = h,ξ i = ξ,λ = λ ( v ij ( α j ( θ, ξ ) , θ )),and, for all i, j, θ : ζ ij ( θ ) = v ij ( α j ( θ, ξ i ) , θ ) − ρ j ( h i , ξ i , λ , θ ). By Assumption 4 (v),and letting h i = ϕ ( ξ i ) + ε i , we can expand ρ j ( h i , ξ i , λ , θ ) twice around ϕ ( ξ i ) as: ρ j ( ϕ ( ξ i ) , ξ i , λ , θ ) + ∂ρ j ( ϕ ( ξ i ) ,ξ i ,λ ,θ ) ∂h ′ ε i + ε ′ i ∂ ρ j ( a jiθ ,ξ i ,λ ,θ ) ∂h∂h ′ ε i , where a jiθ lies between h i and ϕ ( ξ i ). Hence, taking expectations, using that E ξ i ,λ (cid:2) ρ j ( h i , ξ i , λ , θ ) (cid:3) = 0,and using Assumptions 2 and 4 (v), we have:1 N p X i,j k ρ j ( ϕ ( ξ i ) , ξ i , λ , θ ) k = 1 N p X i,j (cid:13)(cid:13)(cid:13)(cid:13) ∂ρ j ( ϕ ( ξ i ) , ξ i , λ , θ ) ∂h ′ E ξ i ,λ [ ε i ] (cid:13)(cid:13)(cid:13)(cid:13) + o p (cid:18) T (cid:19) , which is O p ( T ). Hence: Np P i,j k ρ j ( h i , ξ i , λ , θ ) k = O p ( T ). It thus follows fromthe triangle inequality that:1 N p X i,j k v j ( b k i , θ ) k ≤ O p (cid:18) T (cid:19) + 2 N p X i,j k ζ j ( b k i , θ ) k , (S7)where ζ j ( k, θ ) denotes the mean of ζ ij ( θ ) in group b k i = k . Now, using that28 k , ..., b k N are functions of h , ..., h N , we have: E " N p X i,j k ζ j ( b k i , θ ) k = 1 N p X k,j E " P i P i ′ { b k i = k } { b k i ′ = k } E h ,...,h N ,ξ ,...,ξ N ,λ (cid:0) ζ ij ( θ ) ′ ζ i ′ j ( θ ) (cid:1)P i { b k i = k } . Furthermore, since observations are independent across i given λ : E h ,...,h N ,ξ ,...,ξ N ,λ (cid:0) ζ i ,j ( θ ) ′ ζ i ,j ( θ ) (cid:1) = E h i ,ξ i , ,λ (cid:0) ζ i ,j ( θ ) (cid:1) ′ E h i ,ξ i , ,λ (cid:0) ζ i ,j ( θ ) (cid:1) = 0 for all i = i and j. Hence: E " N p X i,j k ζ j ( b k i , θ ) k = 1 N p X k,j E " P i { b k i = k } E h i ,ξ i ,λ (cid:0) ζ ij ( θ ) ′ ζ ij ( θ ) (cid:1)P i { b k i = k } . Finally, using that E h i ,ξ i ,λ (cid:0) ζ ij ( θ ) (cid:1) = 0, and using part (v) in Assumption 4: E h i = h,ξ i = ξ,λ = λ (cid:0) ζ ij ( θ ) ′ ζ ij ( θ ) (cid:1) = Tr (cid:2) Var h i = h,ξ i = ξ,λ = λ ( v ij ( α j ( θ, ξ i ) , θ )) (cid:3) = O (1) , uniformly in h, ξ, λ . This implies that E h Np P i,j k ζ j ( b k i , θ ) k i = O (cid:0) KN (cid:1) , andshows (S6) and (A1).We are now going to show:sup θ ∈ Θ N p X i,j k v j ( b k i , θ ) k = o p (1) . (S8)Using a bounding argument similar to the one we used to show (A1), (A2) willthen follow. To see that (S8) holds, let Z ( θ ) = Np P i,j k v j ( b k i , θ ) k . By (S6), Z ( θ ) = O p ( δ ) for all θ ∈ Θ. Moreover: ∂Z ( θ ) ∂θ = Np P i,j v θj ( b k i , θ ) v j ( b k i , θ ) = O p (cid:16)p sup θ ∈ Θ Z ( θ ) (cid:17) uniformly in θ , using the Cauchy Schwarz inequality with ei-ther Assumption 3 (ii) or 4 (ii), where v θj ( k, e θ ) is the mean of ∂∂θ (cid:12)(cid:12) θ = e θ v ij ( α j ( θ, ξ i ) , θ ) ′ Note that the dimension of v ij is fixed throughout, independent of the sample size.
29n group b k i = k . Since Θ is compact, it follows that sup θ ∈ Θ Z ( θ ) = o p (1). Lemma A2.
Let us omit references to θ and α ji throughout, and let: A = 1 N p N X i =1 p X j =1 E ξ i ,λ (cid:0) v θij (cid:1) (cid:2) E ξ i ,λ (cid:0) v αij (cid:1)(cid:3) − v αij (cid:16)b α j ( b k i , θ ) − α ji + ( v αij ) − v ij (cid:17) ,B = 1 N p N X i =1 p X j =1 (cid:16) v θij (cid:0) v αij (cid:1) − − E ξ i ,λ (cid:0) v θij (cid:1) (cid:2) E ξ i ,λ (cid:0) v αij (cid:1)(cid:3) − (cid:17) v αij (cid:16)b α j ( b k i , θ ) − α ji (cid:17) . We first bound A . Expanding: P i { b k i = k } v ij ( b α j ( k ))=0 for all k, j , we have,for a ij between α ji and b α j ( b k i ): X i { b k i = k } v ij ( α ji ) + X i { b k i = k } v αij ( α ji )( b α j ( b k i ) − α ji )+ 12 X i { b k i = k } v ααij ( a ij ) (cid:16)b α j ( b k i ) − α ji (cid:17) ⊗ (cid:16)b α j ( b k i ) − α ji (cid:17) = 0 . It follows that b α j ( b k i ) = e α j ( b k i ) + e v j ( b k i ) + e w j ( b k i ), where: e α j ( k ) = X i { b k i = k } ( − v αij ) ! − X i { b k i = k } ( − v αij ) α ji ! , e v j ( k ) = X i { b k i = k } ( − v αij ) ! − X i { b k i = k } v ij ! , e w j ( k ) = 12 X i { b k i = k } ( − v αij ) ! − X i { b k i = k } v ααij ( a ij ) (cid:16)b α j ( b k i ) − α ji (cid:17) ⊗ ! , where a ⊗ = a ⊗ a . Hence, we have: A = 1 N p X i,j E ξ i ,λ (cid:0) v θij (cid:1) (cid:2) E ξ i ,λ (cid:0) v αij (cid:1)(cid:3) − v αij (cid:16) e w j ( b k i )+ e α j ( b k i ) − α ji + e v j ( b k i )+( v αij ) − v ij (cid:17) . Let υ > , ǫ >
0. There is
M > (cid:16) sup θ ∈ Θ (cid:13)(cid:13)(cid:13) ∂Z ( θ ) ∂θ (cid:13)(cid:13)(cid:13) > M p sup θ ∈ Θ Z ( θ ) (cid:17) < ǫ .Take a finite cover of Θ = B ∪ ... ∪ B R , where B r are balls with centers θ r and diametersdiam B r ≤ M √ υ . Since: sup θ ∈ Θ Z ( θ ) ≤ max r Z ( θ r ) + sup θ (cid:13)(cid:13)(cid:13) ∂Z ( θ ) ∂θ (cid:13)(cid:13)(cid:13) M √ υ , and since: a >υ ⇒ a − √ a √ υ > υ , we have: Pr (sup θ ∈ Θ Z ( θ ) > υ ) ≤ ǫ + Pr (cid:0) max r Z ( θ r ) > υ (cid:1) , which, by(S6), is smaller than ǫ for N, T, K large enough.
N p X i,j E ξ i ,λ (cid:0) v θij (cid:1) (cid:2) E ξ i ,λ (cid:0) v αij (cid:1)(cid:3) − v αij e w j ( b k i )= O p ( 1 N p X i,j k b α j ( b k i ) − α ji k ) = O p ( δ ) , where we have used (S2), (A1), and either Assumption 3 (ii) or Assumption 4 (ii).Next, let z j ( ξ i ) ′ = E ξ i ,λ (cid:0) v θij (cid:1) (cid:2) E ξ i ,λ (cid:0) v αij (cid:1)(cid:3) − . We have:1 N p X i,j E ξ i ,λ (cid:0) v θij (cid:1) (cid:2) E ξ i ,λ (cid:0) v αij (cid:1)(cid:3) − v αij (cid:16)e α j ( b k i ) − α ji (cid:17) = 1 N p X i,j (cid:18) z j ( ξ i ) ′ − e z j (cid:16)b k i (cid:17) ′ (cid:19) v αij (cid:16)e α j ( b k i ) − α ji (cid:17) , (S9)where, for all k, j : e z j ( k ) = X i { b k i = k } ( − v αij ) ! − X i { b k i = k } ( − v αij ) z j ( ξ i ) ! . (S10)Now we have, using that: α j P i (cid:16) α j ( b k i ) − α ji (cid:17) ′ ( − v αij ) (cid:16) α j ( b k i ) − α ji (cid:17) is mini-mized at α j = e α j , and using (S2) and (S4):1 N p X i,j (cid:13)(cid:13)(cid:13)e α j ( b k i ) − α ji (cid:13)(cid:13)(cid:13) = O p N p X i,j (cid:16)e α j ( b k i ) − α ji (cid:17) ′ ( − v αij ) (cid:16)e α j ( b k i ) − α ji (cid:17)! = O p N p X i,j (cid:16)b α j ( b k i ) − α ji (cid:17) ′ ( − v αij ) (cid:16)b α j ( b k i ) − α ji (cid:17)! = O p N p X i,j (cid:13)(cid:13)(cid:13)b α j ( b k i ) − α ji (cid:13)(cid:13)(cid:13) ! , where the last expression is O p ( δ ) by (A1). Likewise, since by Assumption 3 (iv)or 4 (iv) ∂ vec z j ( ξ ) ∂ξ ′ is bounded (in probability) uniformly in j and ξ , we have:1 N p X i,j (cid:13)(cid:13)(cid:13)e z j ( b k i ) − z j ( ξ i ) (cid:13)(cid:13)(cid:13) = O p N p X i,j (cid:16)e z j ( b k i ) − z j ( ξ i ) (cid:17) ′ ( − v αij ) (cid:16)e z j ( b k i ) − z j ( ξ i ) (cid:17)! = O p N p X i,j (cid:16) z j (cid:16) ψ (cid:16)b h ( b k i ) (cid:17)(cid:17) − z j ( ξ i ) (cid:17) ′ ( − v αij ) (cid:16) z j (cid:16) ψ (cid:16)b h ( b k i ) (cid:17)(cid:17) − z j ( ξ i ) (cid:17)! = O p N p X i,j (cid:13)(cid:13)(cid:13)b h ( b k i ) − ϕ ( ξ i ) (cid:13)(cid:13)(cid:13) ! = O p ( δ ) , (S11)31here we have used (S2), (S4), Lemmas 1 and 2, and that ψ is Lipschitz-continuous.Combining results, and using the Cauchy Schwarz inequality in (S9), we obtain:1 N p X i,j E ξ i ,λ (cid:0) v θij (cid:1) (cid:2) E ξ i ,λ (cid:0) v αij (cid:1)(cid:3) − v αij (cid:16)e α j ( b k i ) − α ji (cid:17) = O p ( δ ) . The last term in A is: A = 1 N p X i,j E ξ i ,λ (cid:0) v θij (cid:1) (cid:2) E ξ i ,λ (cid:0) v αij (cid:1)(cid:3) − ( − v αij ) (cid:16) ( − v αij ) − v ij − e v j ( b k i ) (cid:17) . Since e v j ( k ) = ( P i { b k i = k } ( − v αij )) − ( P i { b k i = k } ( − v αij )( − v αij ) − v ij ), we have: A = 1 N p X i,j (cid:18) z j ( ξ i ) ′ − e z j (cid:16)b k i (cid:17) ′ (cid:19) ( − v αij )( − v αij ) − v ij = 1 N p X i,j (cid:18) z j ( ξ i ) ′ − e z j (cid:16)b k i (cid:17) ′ (cid:19) v ij = 1 N p X i,j (cid:18) z j ( ξ i ) ′ − z ∗ j (cid:16)b k i (cid:17) ′ (cid:19) v ij + 1 N p X i,j (cid:18) z ∗ j (cid:16)b k i (cid:17) ′ − e z j (cid:16)b k i (cid:17) ′ (cid:19) v ij , (S12)where e z j ( k ) is given by (S10), and: z ∗ j ( k )= X i { b k i = k } E ξ i ,λ (cid:0) − v αij (cid:1)! − X i { b k i = k } E ξ i ,λ (cid:0) − v αij (cid:1) z j ( ξ i ) ! . (S13)Under the conditions of Theorem 1, it is easy to see that A = O p ( δ ). We arenow going to show that A = O p ( δ ) under the conditions of Theorem 2. To seethat the first term on the right-hand-side of (S12) is O p ( δ ), we use an argumentsimilar to the one we used to show (S6). Let ζ ij = v ij − E h i ,ξ i ,λ ( v ij ). Followingthe same steps as the ones leading to (S7), we obtain:1 N p X i,j (cid:13)(cid:13) E h i ,ξ i ,λ ( v ij ) (cid:13)(cid:13) = O p (cid:18) T (cid:19) . (S14)Moreover, by an argument similar to (S11), since E ξ i ,λ ( − v αij ) is bounded away32rom zero with probability one, we have:1 N p X i,j (cid:13)(cid:13)(cid:13) z j ( ξ i ) − z ∗ j ( b k i ) (cid:13)(cid:13)(cid:13) = O p ( δ ) . (S15)Let z ′ = ( z ′ , ..., z ′ p ), and z ∗ ( k ) ′ = ( z ∗ ( k ) ′ , ..., z ∗ p ( k ) ′ ). Since ζ ij are independentacross i , with zero mean, conditional on h , ..., h N , ξ , ..., ξ N , λ , we thus have,denoting ζ i = ( ζ ′ i , ..., ζ ′ ip ) ′ : E (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) N p X i,j (cid:18) z j ( ξ i ) ′ − z ∗ j (cid:16)b k i (cid:17) ′ (cid:19) v ij (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) ≤ O (cid:18) T (cid:19) E " N p X i,j (cid:13)(cid:13)(cid:13) z j ( ξ i ) − z ∗ j (cid:16)b k i (cid:17)(cid:13)(cid:13)(cid:13) +2 E (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) N p X i,j (cid:18) z j ( ξ i ) ′ − z ∗ j (cid:16)b k i (cid:17) ′ (cid:19) ζ ij (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) = O (cid:18) δT (cid:19) + 2 N p X i E (cid:20)(cid:18) z ′ i − z ∗ (cid:16)b k i (cid:17) ′ (cid:19) E h i ,ξ i ,λ [ ζ i ζ ′ i ] (cid:16) z i − z ∗ (cid:16)b k i (cid:17)(cid:17)(cid:21) = O (cid:18) δT (cid:19) + O (cid:18) δpN T (cid:19) = O ( δ ) , where we have used, in turn, the triangle and Cauchy Schwarz inequalities, (S14),(S15), conditional independence of the ζ i across i , part (v) in Assumption 4, and(S15) one more time. Note that, by part (v) in Assumption 4, k E h i ,ξ i ,λ [ ζ i ζ ′ i ] k ≤ Tr E h i ,ξ i ,λ [ ζ i ζ ′ i ] ≤ p max j Tr E h i ,ξ i ,λ (cid:2) ζ ij ζ ′ ij (cid:3) = O p ( p /T ).Turning to the second term in (S12), we have:1 N p X i,j (cid:18) z ∗ j (cid:16)b k i (cid:17) ′ − e z j (cid:16)b k i (cid:17) ′ (cid:19) v ij = 1 N p X i,j (cid:18) z ∗ j (cid:16)b k i (cid:17) ′ − e z j (cid:16)b k i (cid:17) ′ (cid:19) v j (cid:16)b k i (cid:17) , where by (S6) we have: Np P i,j k v j ( b k i ) k = O p ( δ ). Moreover:1 N p X i,j (cid:13)(cid:13)(cid:13) z ∗ j (cid:16)b k i (cid:17) − e z j (cid:16)b k i (cid:17)(cid:13)(cid:13)(cid:13) ≤ N p X i,j (cid:13)(cid:13)(cid:13) z j ( ξ i ) − z ∗ j (cid:16)b k i (cid:17)(cid:13)(cid:13)(cid:13) + 2 N p X i,j (cid:13)(cid:13)(cid:13) z j ( ξ i ) − e z j (cid:16)b k i (cid:17)(cid:13)(cid:13)(cid:13) , O p ( δ ) due to (S11), and the firstterm is O p ( δ ) due to (S15). This shows that A = O p ( δ ), hence that A = O p ( δ ).Let us now turn to B . Letting: π ′ ij = v θij (cid:0) v αij (cid:1) − − E ξ i ,λ (cid:0) v θij (cid:1) (cid:2) E ξ i ,λ (cid:0) v αij (cid:1)(cid:3) − ,we have: B = 1 N p X i,j π ′ ij v αij (cid:16) e w j ( b k i ) + e v j ( b k i ) + e α j ( b k i ) − α ji (cid:17) . First, we have: Np P i,j π ′ ij v αij e w j ( b k i ) = O p ( δ ). Next, we have: Np P i,j π ′ ij v αij e v j ( b k i ) = Np P i,j e π j ( b k i ) ′ v αij e v j ( b k i ), where e π j ( k ) is defined similarly to e α j ( k ). To see that thisquantity is O p ( δ ), note that, by the definition of e v j ( k ) and using (S4) and (S6):1 N p X i,j (cid:13)(cid:13)(cid:13)e v j ( b k i ) (cid:13)(cid:13)(cid:13) = O p N p X i,j (cid:13)(cid:13)(cid:13) v j ( b k i ) (cid:13)(cid:13)(cid:13) ! = O p ( δ ) . Moreover, letting τ ij = π ′ ij v αij , we have:1 N p X i,j (cid:13)(cid:13)(cid:13)e π j ( b k i ) (cid:13)(cid:13)(cid:13) = O p N p X i,j (cid:13)(cid:13)(cid:13) τ j ( b k i ) (cid:13)(cid:13)(cid:13) ! . Now, the τ ij are independent across i , with zero conditional mean given ξ i , λ : E ξ i ,λ (cid:0) π ′ ij v αij (cid:1) = E ξ i ,λ (cid:16)(cid:16) v θij (cid:0) v αij (cid:1) − − E ξ i ,λ (cid:0) v θij (cid:1) (cid:2) E ξ i ,λ (cid:0) v αij (cid:1)(cid:3) − (cid:17) v αij (cid:17) = 0 . Using an argument similar to the one we used to show (S6), and using Assumption4 (v) in the time-varying case, it thus follows that Np P i,j k e π j ( b k i ) k = O p ( δ ).Hence, by the Cauchy Schwarz inequality: Np P i,j π ′ ij v αij e v j ( b k i ) = O p ( δ ).We lastly bound the third term B in B :1 N p X i,j π ′ ij v αij (cid:16)e α j ( b k i ) − α ji (cid:17) = 1 N p X i,j π ′ ij v αij h(cid:16) α ∗ j ( b k i ) − α ji (cid:17) + (cid:16)e α j ( b k i ) − α ∗ j ( b k i ) (cid:17)i , where e α j ( k ) and α ∗ j ( k ) are given by expressions similar to (S10) and (S13), with α ji in place of z j ( ξ i ) in those formulas. The first term is O p ( δ ) since, similarlyto (S15): Np P i,j k α ∗ j ( b k i ) − α ji k = O p ( δ ), and the τ ij = π ′ ij v αij are conditionally34ndependent across i with zero mean given ξ i and λ (using a similar argumentto the first term in (S12)). The second term is:1 N p X i,j π ′ ij v αij (cid:16)e α j ( b k i ) − α ∗ j ( b k i ) (cid:17) = 1 N p X i,j e π j ( b k i ) ′ v αij (cid:16)e α j ( b k i ) − α ∗ j ( b k i ) (cid:17) . We have already shown that: Np P i,j k e π j ( b k i ) k = O p ( δ ). Moreover, using similararguments to the ones we used to bound Np P i,j k z ∗ j ( b k i ) − e z j ( b k i ) k above, wehave: Np P i,j k e α j ( b k i ) − α ∗ j ( b k i ) k = O p ( δ ). This shows that B = O p ( δ ), hencethat B = O p ( δ ). Lemma A3.
For given k, j , θ -differentiating: P i { b k i = k } v ij ( b α j ( k, θ ) , θ )=0, andusing (S4), we obtain: ∂ b α j ( k, θ ) ∂θ ′ = X i { b k i = k } (cid:16) − v αij (cid:16)b α j ( b k i , θ ) , θ (cid:17)(cid:17)! − X i { b k i = k } v θij (cid:16)b α j ( b k i , θ ) , θ (cid:17) ′ . (S16)Let us define, at θ = θ (and omitting θ and α ji from the notation): ∂ e α j ( k ) ∂θ ′ = X i { b k i = k } ( − v αij ) ! − X i { b k i = k } ( v θij ) ′ ,∂ e α j ∗ ( k ) ∂θ ′ = X i { b k i = k } ( − v αij ) ! − X i { b k i = k } ( − v αij ) (cid:2) E ξ i ,λ ( − v αij ) (cid:3) − E ξ i ,λ ( v θij ) ′ | {z } = ∂αj ( ξi ∂θ ′ . Using (A1) and (S4), we have: Np P i,j k ∂ b α j ( b k i ) ∂θ ′ − ∂ e α j ( b k i ) ∂θ ′ k = o p (1). Moreover: ∂ e α j ( k ) ∂θ ′ − ∂ e α j ∗ ( k ) ∂θ ′ = P i { b k i = k } ( − v αij ) P i { b k i = k } ! − P i { b k i = k } τ ′ ij P i { b k i = k } ! , where τ ′ ij =( v θij ) ′ − ( − v αij ) (cid:2) E ξ i ,λ ( − v αij ) (cid:3) − E ξ i ,λ ( v θij ) ′ are conditionally independentacross i , with zero mean given ξ i and λ . Hence, using (S4), and a similar argu-ment to the one we used to show (S6), we have: Np P i,j k ∂ e α j ( b k i ) ∂θ ′ − ∂ e α j ∗ ( b k i ) ∂θ ′ k = o p (1).Lastly, using (S4) we have, as in (S11): Np P i,j k ∂ e α j ∗ ( b k i ) ∂θ ′ − ∂α j ( ξ i ) ∂θ ′ k = o p (1). Com-bining results shows (A5). 35 emma A4. In the following we again evaluate all functions at θ , and omit θ for the notation. In particular, b α i is a shorthand for b α i ( θ ). We will use thenotation b w i = − v αi ( b α i ). The choice of K = b K with γ = o (1) implies that:1 N X i k h i − b h ( b k i ) k = o p (cid:18) T (cid:19) . (S17)We also have: N P i k b α ( b k i ) − b α i k = O p (cid:0) T (cid:1) . Let, for all k : e α ( k ) = X i { b k i = k } b w i ! − X i { b k i = k } b w i b α i . (S18)Expanding: P i { b k i = k } v i ( b α ( k )) = 0 around b α i , using that v i ( b α i ) = 0, we obtain: b α ( k ) = e α ( k )+ 12 "X i { b k i = k } b w i − X i { b k i = k } v ααi ( a i ( k )) (cid:16)b α ( b k i ) − b α i (cid:17) ⊗ , where a i ( k ) lies between b α i and b α ( k ), and v ααi ( a i ( k )) is a matrix of third derivativeswith (dim α i ) columns.To see that (A6) holds, we rely on the following decomposition: ∂∂θ (cid:12)(cid:12)(cid:12) θ ∆ L ( θ ) = 1 N X i ∂ℓ i ( b α ( b k i )) ∂θ − N X i ∂ℓ i ( b α i ) ∂θ = 1 N X i v θi ( b α i ) (cid:16)b α ( b k i ) − b α i (cid:17) + 12 N X i v θαi ( a i ) (cid:16)b α ( b k i ) − b α i (cid:17) ⊗ = 1 N X i v θi ( b α i ) (cid:16)e α ( b k i ) − b α i (cid:17) + 12 N X i v θαi ( a i ) (cid:16)b α ( b k i ) − b α i (cid:17) ⊗ + 12 N X i v θi ( b α i ) (cid:0) E b k i [ b w i ′ ] (cid:1) − E b k i (cid:20) v ααi ′ (cid:16) a i ′ ( b k i ′ ) (cid:17) (cid:16)b α ( b k i ′ ) − b α i ′ (cid:17) ⊗ (cid:21) = 1 N X i v θi ( b α i ) (cid:16)e α ( b k i ) − b α i (cid:17)| {z } = A + 12 N X i v θαi ( a i ) (cid:16)e α ( b k i ) − b α i (cid:17) ⊗ | {z } = A + o p (cid:18) T (cid:19) + 12 N X i v θi ( b α i ) (cid:0) E b k i [ b w i ′ ] (cid:1) − E b k i (cid:20) v ααi ′ (cid:16) a i ′ ( b k i ′ ) (cid:17) (cid:16)e α ( b k i ′ ) − b α i ′ (cid:17) ⊗ (cid:21)| {z } = A , a i and a i ( b k i ) lie between b α i and b α ( b k i ), and E k denotes a mean in group b k i = k . Let γ ( h ) = { E h i = h ( b w i ) } − E h i = h ( b w i b α i ), and ν i = b α i − γ ( h i ). Let b g i = v θi ( b α i )( b w i ) − , λ ( h ) = E h i = h ( b g i b w i ) { E h i = h ( b w i ) } − , and τ i = b g ′ i − λ ( h i ) ′ . Using (S17)we can show, using that γ is Lipschitz-continuous, that N P i k γ ( h i ) − e γ ( b k i ) k = o p (cid:0) T (cid:1) . Moreover, we have: E [ b w i ν i | h , ..., h N ] = E h i [ b w i b α i ] − E h i [ b w i b α i ] =0. Similararguments to the proof of Lemma A1 give: N P i k E b k i [ b w i ′ ν i ′ ] k = O p ( KNT )= o p (cid:0) T (cid:1) .Hence: N P i k e ν ( b k i ) k = N P i k (cid:0) E b k i [ b w i ′ ] (cid:1) − E b k i [ b w i ′ ν i ′ ] k = o p (cid:0) T (cid:1) . Likewise, wehave: N P i k λ ( h i ) − e λ ( b k i ) k = o p (cid:0) T (cid:1) , and: N P i k e τ ( b k i ) k = o p (cid:0) T (cid:1) . Let us now expand the three terms A , A , A in the above decomposition: A = 1 N X i b g i b w i (cid:16)e α ( b k i ) − b α i (cid:17) = 1 N X i (cid:16)b g i − e g ( b k i ) (cid:17) b w i (cid:16)e α ( b k i ) − b α i (cid:17) = − N X i (cid:16) λ ( h i ) − e λ ( b k i ) + τ ′ i − e τ ( b k i ) ′ (cid:17) b w i (cid:16) γ ( h i ) − e γ ( b k i ) + ν i − e ν ( b k i ) (cid:17) = − N X i τ ′ i b w i ν i + o p (cid:18) T (cid:19) = − N X i τ ′ i E ξ i ( − v αi ( α i )) ν i + o p (cid:18) T (cid:19) ,A = 12 N X i E ξ i (cid:0) v θαi ( α i ) (cid:1) (cid:16)e α ( b k i ) − b α i (cid:17) ⊗ + o p (cid:18) T (cid:19) = 12 N X i E ξ i (cid:0) v θαi ( α i ) (cid:1) (cid:16)e γ ( b k i ) − γ ( h i ) + e ν ( b k i ) − ν i (cid:17) ⊗ + o p (cid:18) T (cid:19) = 12 N X i E ξ i (cid:0) v θαi ( α i ) (cid:1) ν ⊗ i + o p (cid:18) T (cid:19) ,A = 12 N X i E ξ i (cid:0) v θi ( α i ) (cid:1) (cid:2) E ξ i ( − v αi ( α i )) (cid:3) − E ξ i [ v ααi ( α i )] ν ⊗ i + o p (cid:18) T (cid:19) . Combining, we get: ∂∂θ (cid:12)(cid:12)(cid:12) θ ∆ L ( θ ) = − N X i τ ′ i E ξ i ( − v αi ( α i )) ν i + o p (cid:18) T (cid:19) + 12 N X i h E ξ i (cid:0) v θαi ( α i ) (cid:1) + E ξ i (cid:0) v θi ( α i ) (cid:1) (cid:2) E ξ i ( − v αi ( α i )) (cid:3) − E ξ i [ v ααi ( α i )] i ν ⊗ i . Here e γ ( k ), e λ ( k ), e ν ( k ), and e τ ( k ) are defined similarly to e α ( k ) in (S18), with γ ( h i ), λ ( h i ), ν i ,and τ i , respectively, replacing b α i in that formula. ∂ b α i ( θ ) ∂θ ′ = b g ′ i , and: ∂∂θ ′ (cid:12)(cid:12)(cid:12) θ vec E ξ i [ − v αi ( α ( θ, ξ i ) , θ )]= − (cid:16) E ξ i (cid:0) v θαi ( α i ) (cid:1) + E ξ i (cid:0) v θi ( α i ) (cid:1) (cid:2) E ξ i ( − v αi ( α i )) (cid:3) − E ξ i [ v ααi ( α i )] (cid:17) ′ . Let ω i = { E h i ( b w i ) } − b w i , and e ν i ( θ ) = b α i ( θ ) − E h i ( ω i b α i ( θ )). Combining the abovewith the expression of the bias of the FE score, we obtain: ∂∂θ (cid:12)(cid:12)(cid:12) θ ∆ L ( θ )= − ∂∂θ (cid:12)(cid:12)(cid:12) θ N X i e ν i ( θ ) ′ E ξ i [ − v αi ( α ( θ, ξ i ) , θ )] e ν i ( θ )+ o p (cid:18) T (cid:19) . (S19)Lastly, let b α i ( θ ) = E h i ( b α i ( θ )) + ν i ( θ ), and ω i = E h i ( ω i ) + η i = 1 + η i . We have: e ν i ( θ ) = ν i ( θ ) − E h i ( η i ν i ( θ )), from which it follows that: N P i k e ν i ( θ ) − ν i ( θ ) k = o p (1 /T ). Likewise: N P i k ∂ e ν i ( θ ) ∂θ ′ − ∂ν i ( θ ) ∂θ ′ k = o p (1 /T ). Hence, (S19) implies(A6). S2 Complements and extensions
S2.1 Average effects
Let m i ( α i , θ ) = T P Tt =1 m ( X it , α i , θ ) in the time-invariant case, and m i ( α i , θ ) = T P Tt =1 m ( X it , α it , θ ) in the time-varying case. Let c M = N P i m i (cid:16)b α ( b k i ) , b θ (cid:17) bethe GFE estimator of M = N P i m i ( α i , θ ). We use a common notation as inthe proofs of Theorems 1 and 2, and denote m ij ( α ji , θ ) = m i ( α i , θ ) in the time-invariant case, and m ij ( α ji , θ ) = m ( X it , α it , θ ) in the time-varying case. Assumption S1. (average effects)(i) m ij ( α, θ ) is twice differentiable in both its arguments, for all i, j .(ii) max i,j sup α,θ k m ij ( α, θ ) k = O p (1) , and similarly for the first two derivativesof m ij ; max j sup e ξ,λ k ∂∂ξ ′ (cid:12)(cid:12) ξ = e ξ E ξ i = ξ,λ = λ ( ∂m ij ( α ji ,θ ) ∂α ) k = O (1) ; and, letting τ mij = ∂m ij ( α ji ,θ ) ∂α ′ − E ξ i ,λ [ ∂m ij ( α ji ,θ ) ∂α ′ ] E ξ i ,λ [ v αij ( α ji , θ )] v αij ( α ji , θ ) , the func-tion E h i = h,ξ i = ξ,λ = λ (vec τ mij ) is twice differentiable with respect to h , withfirst and second derivatives that are uniformly bounded in j , ξ , λ , and h ,and k Var h i = h,ξ i = ξ,λ = λ (vec τ mij ) k = O ( pT ) , uniformly in j , ξ , λ , and h . s i and H as in Theorem 1 or 2, and let s = N P i s i . Define: s mi = 1 p X j ( E ξ i ,λ (cid:18) ∂m ij ∂α ′ (cid:19) (cid:20) E ξ i ,λ (cid:18) − ∂ ℓ ij ∂α∂α ′ (cid:19)(cid:21) − ∂ℓ ij ∂α + E ξ i ,λ (cid:18) ∂m ij ∂θ ′ (cid:19) H − s + E ξ i ,λ (cid:18) ∂m ij ∂α ′ (cid:19) (cid:20) E ξ i ,λ (cid:18) − ∂ ℓ ij ∂α∂α ′ (cid:19)(cid:21) − E ξ i ,λ (cid:18) ∂ ℓ ij ∂α∂θ ′ (cid:19) H − s ) . Corollary S1.
Let the conditions of Theorem 1 or 2 hold, and let Assumption S1hold. Then, as
N, T, K tend to infinity such that
Kp/ ( N T ) tends to zero: c M = M + 1 N X i s mi + O p (cid:18) T (cid:19) + O p (cid:18) KpN T (cid:19) + O p (cid:16) K − d (cid:17) + o p (cid:18) √ N T (cid:19) . Proof.
We have, by a Taylor expansion: c M − M = 1 N p X i,j m ij (cid:16)b α j ( b k i , b θ ) , b θ (cid:17) − N p X i,j m ij (cid:0) α ji , θ (cid:1) = 1 N p X i,j ∂m ij (cid:0) α ji , θ (cid:1) ∂α ′ (cid:16)b α j ( b k i , b θ ) − α ji (cid:17) + 1 N p X i,j ∂m ij (cid:0) α ji , θ (cid:1) ∂θ ′ (cid:16)b θ − θ (cid:17) + O p ( δ ) , where δ is defined as in the proofs of Theorems 1 and 2.Using similar arguments to the ones we used to establish Lemma A2, underAssumption S1 we have (recall that α j ( θ , ξ i ) = α ji ):1 N p X i,j ∂m ij (cid:0) α ji , θ (cid:1) ∂α ′ (cid:16)b α j ( b k i , θ ) − α j ( θ , ξ i ) (cid:17) + 1 N p X i,j E ξ i ,λ " ∂m ij (cid:0) α ji , θ (cid:1) ∂α ′ E ξ i ,λ (cid:2) v αij ( α ji , θ ) (cid:3) − v ij ( α ji , θ ) = O p ( δ ) . Moreover, using (A5) and Assumption S1 we obtain:1
N p X i,j ∂m ij (cid:0) α ji , θ (cid:1) ∂α ′ n(cid:16)b α j ( b k i , b θ ) − α j ( b θ, ξ i ) (cid:17) − (cid:16)b α j ( b k i , θ ) − α j ( θ , ξ i ) (cid:17)o = o p (cid:16) k b θ − θ k (cid:17) + O p ( δ ) = o p (cid:18) √ N T (cid:19) + O p ( δ ) . c M − M = 1 N p X i,j ∂m ij (cid:0) α ji , θ (cid:1) ∂α ′ (cid:16)b α j ( b k i , b θ ) − b α j ( b k i , θ ) (cid:17) + 1 N p X i,j ∂m ij (cid:0) α ji , θ (cid:1) ∂α ′ (cid:16)b α j ( b k i , θ ) − α ji (cid:17) + 1 N p X i,j ∂m ij (cid:0) α ji , θ (cid:1) ∂θ ′ (cid:16)b θ − θ (cid:17) + O p ( δ )= 1 N p X i,j ∂m ij (cid:0) α ji , θ (cid:1) ∂α ′ (cid:16) α j ( b θ, ξ i ) − α j ( θ , ξ i ) (cid:17) + 1 N p X i,j E ξ i ,λ " ∂m ij (cid:0) α ji , θ (cid:1) ∂α ′ E ξ i ,λ (cid:2) − v αij ( α ji , θ ) (cid:3) − v ij ( α ji , θ )+ 1 N p X i,j ∂m ij (cid:0) α ji , θ (cid:1) ∂θ ′ (cid:16)b θ − θ (cid:17) + O p ( δ ) + o p (cid:18) √ N T (cid:19) . The result comes from expanding α j ( b θ, ξ i ) around θ , and then substituting b θ − θ by its influence function. S2.2 Two-way GFE
We have the following lemma, whose proof is analogous to that of Lemma 1.
Lemma S1.
Suppose that there exist random vectors h i = T P t h ( Y it , X it ) and w t = N P i w ( Y it , X it ) , with fixed dimensions, and Lipschitz-continuous functions ϕ and φ , such that h i = ϕ ( ξ i ) + o p (1) , N P i k h i − ϕ ( ξ i ) k = O p (1 /T ) , w t = φ ( λ t ) + o p (1) , and T P t k w t − φ ( λ t ) k = O p (1 /N ) as N, T tend to infinity.Then we have, as
N, T, K tend to infinity: N P i k b h ( b k i ) − ϕ ( ξ i ) k = O p (cid:0) T (cid:1) + O p ( B ξ ( K )) , and, as N, T, L tend to infinity: T P t k b w ( b l t ) − φ ( λ t ) k = O p (cid:0) N (cid:1) + O p ( B λ ( L )) , where B λ ( L ) is defined analogously to B ξ ( K ) . For all θ , ξ , and λ , let α ( θ, ξ, λ ) = argmax α E ξ i = ξ, λ t = λ ( ℓ it ( α, θ )). In addition,let ξ = ( ξ ′ , ..., ξ ′ N ) ′ . Assumption S2. (regularity, two-way)(i) ( Y ′ it , X ′ it ) ′ , i = 1 , .., N , t = 1 , ..., T , are i.i.d. given ξ and λ , ξ i are i.i.d.,and λ t are i.i.d.; ℓ it ( α, θ ) is three times differentiable in ( θ, α ) ; Θ is compact,the spaces for ξ i and λ t are compact, and θ belongs to the interior of Θ . ii) N, T, K, L tend jointly to infinity; sup ξ,λ,α,θ | E ξ i = ξ,λ t = λ ( ℓ it ( α, θ )) | = O (1) ,and similarly for the first three derivatives of ℓ it ; the minimum (resp., max-imum) eigenvalue of ( − ∂ ℓ it ( α,θ ) ∂α∂α ′ ) is bounded away from zero (resp., infinity)with probability one uniformly in i, t, α, θ , and the third derivatives of ℓ it ( α, θ ) are O p (1) , uniformly in i, t, α, θ ; NT P i,t [ ℓ it ( α it , θ ) − E ξ i ,λ t ( ℓ it ( α it , θ ))] = O p (1) , and similarly for the first three derivatives of ℓ it .(iii) inf ξ,λ,θ E ξ i = ξ, λ t = λ ( − ∂ ℓ it ( α ( θ,ξ,λ ) ,θ ) ∂α∂α ′ ) > ; E h NT P i,t ℓ it ( α ( θ, ξ i , λ t ) , θ ) i hasa unique maximum at θ on Θ , and its second derivative is − H < .(iv) ∂∂ξ ′ (cid:12)(cid:12) e ξ E ξ i = ξ,λ t = λ (vec ∂ ℓ it ( α,θ ) ∂θ∂α ′ )= O (1) ; ∂∂λ ′ (cid:12)(cid:12) e λ E ξ i = ξ,λ t = λ (vec ∂ ℓ it ( α,θ ) ∂θ∂α ′ )= O (1) ; ∂∂ξ ′ (cid:12)(cid:12) e ξ E ξ i = ξ,λ t = λ (vec ∂ ℓ it ( α,θ ) ∂α∂α ′ )= O (1) ; ∂∂λ ′ (cid:12)(cid:12) e λ E ξ i = ξ,λ t = λ (vec ∂ ℓ it ( α,θ ) ∂α∂α ′ )= O (1) ; ∂∂ξ ′ (cid:12)(cid:12) e ξ E ξ i = ξ,λ t = λ ( ∂ℓ it ( α ( θ,ξ,λ ) ,θ ) ∂α )= O (1) ; ∂∂λ ′ (cid:12)(cid:12) e λ E ξ i = ξ,λ t = λ ( ∂ℓ it ( α ( θ,ξ,λ ) ,θ ) ∂α )= O (1) ,uniformly in ξ, e ξ, λ, e λ, α, θ .(v) E h i = h,ξ i = ξ,w t = w,λ t = λ ( ∂ℓ it ( α ( θ,ξ,λ ) ,θ ) ∂α ) , E h i = h,ξ i = ξ,w t = w,λ t = λ (vec ∂∂θ ′ (cid:12)(cid:12) θ ∂ℓ it ( α ( θ,ξ,λ ) ,θ ) ∂α ) are twice differentiable with respect to h and w , with first and second deriva-tives that are uniformly bounded in h ∈ H , w ∈ W , ξ , λ , and θ ∈ Θ , where H and W are the supports of h i and w t ; k Var h i = h,ξ i = ξ,w t = w,λ t = λ ( ∂ℓ it ( α ( θ,ξ,λ ) ,θ ) ∂α ) k and k Var h i = h,ξ i = ξ,w t = w,λ t = λ (vec ∂∂θ ′ (cid:12)(cid:12) θ ∂ℓ it ( α ( θ,ξ,λ ) ,θ ) ∂α ) k are O (1) , uniformly in h , w , ξ , λ , θ . Theorem S1.
Let the conditions in Lemma S1 hold. Suppose that B ξ ( K ) = O p ( K − d ) and B λ ( L ) = O p ( L − dλ ) . Suppose that α and µ are Lipschitz-continuousin both arguments, and that there exist two Lipschitz-continuous functions ψ and Ψ such that ξ i = ψ ( ϕ ( ξ i )) and λ t = Ψ ( φ ( λ t )) . Lastly, let Assumption S2 hold.Then, as N, T, K, L tend to infinity such that
KL/ ( N T ) tends to zero, we have: b θ = θ + H − N X i s i + O p (cid:18) T + 1 N + KLN T (cid:19) + O p (cid:16) K − d + L − dλ (cid:17) + o p (cid:18) √ N T (cid:19) . Proof.
The proof closely follows the steps of that of Theorem 2. Here we simplyhighlight the main differences. Let δ = T + N + KLNT + K − d + L − dλ . To show41onsistency, a key step is to show, for all θ ∈ Θ:1
N T X i,t (cid:13)(cid:13)(cid:13) v ( b k i , b l t , θ ) (cid:13)(cid:13)(cid:13) = O p ( δ ) , (S20)where v ( k, l, θ ) denotes the mean of v it ( α ( θ, ξ i , λ t ) , θ ) in the intersection of groups b k i = k and b l t = l . Let: ρ ( h, ξ, w, λ, θ ) = E h i = h,ξ i = ξ,w t = w,λ t = λ ( v it ( α ( θ, ξ, λ ) , θ )), andlet, for all i, t, θ : ζ it ( θ ) = v it ( α ( θ, ξ i , λ t ) , θ ) − ρ ( h i , ξ i , w t , λ t , θ ). Proceeding asin the proof of Lemma A1 we have:1 N T X i,t k ρ ( h i , ξ i , w t , λ t , θ ) k = O p (cid:18) T (cid:19) + O p (cid:18) N (cid:19) . We thus only need to bound: E " N T X i,t k ζ ( b k i , b l t , θ ) k = 1 N T X k,ℓ E kℓ (cid:2) E h i ,ξ i ,w t ,λ t ( ζ it ( θ ) ′ ζ it ( θ )) (cid:3) , where we have used that observations are independent across i and t given ξ and λ , and E kℓ denotes a mean in groups b k i = k and b l t = l . To bound this quantity,we use part ( v ) in Assumption S2. We thus obtain (S20).Similarly to the proof of Lemma A2, we then show:1 N T X i,t n v θit (cid:16)b α ( b k i , b l t ) − α it (cid:17) + E ξ i ,λ t (cid:0) v θit (cid:1) (cid:2) E ξ i ,λ t ( v αit ) (cid:3) − v it o = O p ( δ ) , (S21)where we omit references to θ and α it . The first key term is: A = 1 N T X i,t E ξ i ,λ t (cid:0) v θit (cid:1) (cid:2) E ξ i ,λ t ( v αit ) (cid:3) − ( − v αit ) (cid:16) ( − v αit ) − v it − e v ( b k i , b l t ) (cid:17) , where e v is defined analogously to the proof of Lemma A2. To show that A = O p ( δ ), we use that the ζ it ( θ ) are independent across i and t , with zero meanconditional on h , ..., h N , w , ..., w T , ξ , and λ .42et π ′ it = v θit ( v αit ) − − E ξ i ,λ t (cid:0) v θit (cid:1) (cid:2) E ξ i ,λ t ( v αit ) (cid:3) − . The second key term is: B = 1 N T X i,t π ′ it v αit (cid:16)e α ( b k i , b l t ) − α it (cid:17) = 1 N T X i,t π ′ it v αit (cid:16) α ∗ ( b k i , b l t ) − α it (cid:17) + 1 N T X i,t π ′ it v αit (cid:16)e α ( b k i , b l t ) − α ∗ ( b k i , b l t ) (cid:17) , where e α ( k, l ) and α ∗ ( k, l ) are defined analogously to the proof of Lemma A2. Toshow that B = O p ( δ ), we use that τ it = π ′ it v αit are independent across i and t withzero mean given ξ , λ .The final step, as in the proof of Lemma A3, is to show that:1 N T X i,t (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) ∂ b α ( b k i , b l t , θ ) ∂θ ′ − ∂α ( θ , ξ i , λ t ) ∂θ ′ (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) = o p (1) . (S22)The proof of (S22) follows similar arguments to the proof of Lemma A3. S2.3 GFE based on conditional moments
Assumption S3. (heterogeneity, conditional case)There exist vectors ξ i of fixed dimension d , and ν i of dimension d ν , and functions α and µ Lipschitz-continuous in ξ , such that α i = α ( ξ i ) and µ i = µ ( ξ i , ν i ) . Differently from Assumption 1, here µ i depends on an additional heterogeneitycomponent ν i , and by Assumption 2 the moment h i is only injective for ξ i . Assumption S4. (regularity, conditional case)(i) ( Y ′ i , X ′ i , ξ ′ i , ν ′ i , h ′ i ) ′ are i.i.d.; ( Y ′ it , X ′ it ) ′ are stationary for all i ; ℓ it ( α, θ ) isthree times differentiable in both its arguments for all i, t ; and Θ is compact,the space for α i is compact, and θ belongs to the interior of Θ .(ii) N, T, K tend jointly to infinity; sup ξ,ν,α,θ | E ξ i = ξ,ν i = ν ( ℓ it ( α, θ )) | = O (1) , andsimilarly for the first three derivatives of ℓ it ; inf ξ,ν,α,θ E ξ i = ξ,ν i = ν ( − ∂ ℓ it ( α,θ ) ∂α∂α ′ ) is positive definite; and max i sup α,θ (cid:12)(cid:12) ℓ i ( α, θ ) − E ξ i ,ν i ( ℓ i ( α, θ )) (cid:12)(cid:12) = o p (1) ,and similarly for the first three derivatives of ℓ i . iii) inf ξ,ν,θ E ξ i = ξ,ν i = ν ( − ∂ ℓ it ( α ( θ,ξ ) ,θ ) ∂α∂α ′ ) > ; E [ T P Tt =1 ℓ it ( α ( θ, ξ i ) , θ )] has a uniquemaximum at θ on Θ , and its matrix of second derivatives is − H cond < ;and sup θ NT P i,t k ∂ ℓ it ( α ( θ,ξ i ) ,θ ) ∂θ∂α ′ k = O p (1) .(iv) sup e ξ,α k ∂∂ξ ′ (cid:12)(cid:12) ξ = e ξ E ξ i = ξ (vec ∂ ℓ it ( α,θ ) ∂θ∂α ′ ) k ; sup e ξ,α k ∂∂ξ ′ (cid:12)(cid:12) ξ = e ξ E ξ i = ξ (vec ∂ ℓ it ( α,θ ) ∂α∂α ′ ) k ; and sup e ξ,θ k ∂∂ξ ′ (cid:12)(cid:12) ξ = e ξ E ξ i = ξ ( ∂ℓ it ( α ( θ, e ξ ) ,θ ) ∂α ) k are O (1) .(v) E h i = h,ξ i = ξ ( ∂ℓ it ( α ( θ,ξ ) ,θ ) ∂α ) is twice differentiable with respect to h and ξ , withfirst and second derivatives that are uniformly bounded in ξ , h ∈ H , and θ ∈ Θ ; and k Var h i = h,ξ i = ξ ( ∂ℓ i ( α ( θ,ξ ) ,θ ) ∂α ) k = O (1) , uniformly in ξ , h and θ . Corollary S2.
Let the conditions of Lemmas 1 and 2 hold. Let Assumptions2, S3, and S4 hold. Let K be given by (6), with γ = O (1) . Then, as N, T, K tendto infinity such that T d = O ( N ) we have: b θ = θ + O p (cid:18) T (cid:19) + O p (cid:18) √ N T (cid:19) . (S23) Proof.
Let δ = T + KN + K − d . To show consistency, the key step is to show:1 N X i (cid:13)(cid:13)(cid:13) v ( b k i , θ ) (cid:13)(cid:13)(cid:13) = O p ( δ ) , ∀ θ ∈ Θ . (S24)Let, for all θ, h, ξ : ρ ( h, ξ, θ ) = E h i = h,ξ i = ξ ( v i ( α ( θ, ξ ) , θ )), and let, for all i, θ : ζ i ( θ ) = v i ( α ( θ, ξ i ) , θ ) − ρ ( h i , ξ i , θ ). One can show, using similar techniques to the proofof Lemma A1, that: N P i k ζ ( b k i , θ ) k = O p ( KN ), and that this implies (S24). We then show: N P i ∂ℓ i ( b α ( b k i ,θ ) ,θ ) ∂θ = O p ( δ ), which will follow from:1 N X i v θi (cid:16)b α ( b k i ) − α i (cid:17) = O p ( δ ) , (S25)where from now on we omit references to θ and α i . We have:1 N X i v θi (cid:16)b α ( b k i ) − α i (cid:17) = 1 N X i v θi (cid:16)e α ( b k i ) − α i + e v ( b k i ) (cid:17) + O p ( δ ) , Note that if K = b K is given by (6) with γ = O (1), then K = O ( T d ) and δ = O ( T + T d N ), so if T d = O ( N ) then δ = O ( T ). Note that, in the case of Theorem 1 (i.e., in the absence of additional heterogeneity ν i ),the left-hand side in (S24) is O p ( T ). e α ( k ) and e v ( k ) are as in the proof of Lemma A2; that is, denoting w i = ( − v αi ),we have e α ( k ) = w ( k ) − wα ( k ) and e v ( k ) = w ( k ) − v ( k ).Let γ v ( h i ) = E h i ( v i ), ζ vi = v i − γ v ( h i ), γ w ( h i ) = E h i ( w i ), ζ wi = w i − γ w ( h i ), γ v θ ( h i ) = E h i ( v θi ), and ζ v θ i = v θi − γ v θ ( h i ). First, we have:1 N X i v θi e v ( b k i ) = 1 N X i v θi w ( b k i ) − v ( b k i ) = 1 N X i v θ ( b k i ) w ( b k i ) − v i = 1 N X i ( γ v θ ( b k i ) + ζ v θ ( b k i ))( γ w ( b k i ) + ζ w ( b k i )) − v i = 1 N X i γ v θ ( b k i ) γ w ( b k i ) − v i + O p ( δ ) , where for example γ w ( k ) is the mean of γ w ( h i ) in group b k i = k , and we have usedthat N P i k ζ v θ ( b k i ) k = O p ( K/N ), N P i k ζ w ( b k i ) k = O p ( K/N ), and N P i k v i k = O p (1 /T ). Moreover:1 N X i γ v θ ( b k i ) γ w ( b k i ) − v i = 1 N X i γ v θ ( b k i ) γ w ( b k i ) − γ v ( h i ) + O p ( δ ) , where we have used that N P i k ζ v ( b k i ) k = O p ( K/ ( N T )). Lastly, we have:1 N X i γ v θ ( b k i ) γ w ( b k i ) − γ v ( h i ) = 1 N X i γ v θ ( h i ) γ w ( h i ) − γ v ( h i )+ 1 N X i h γ v θ ( b k i ) γ w ( b k i ) − − γ v θ ( h i ) γ w ( h i ) − i γ v ( h i ) , where the first term is O p ( δ ) since it is a mean of i.i.d. terms with mean O (1 /T ) andvariance O (1 /T ), and the second term is O p ( δ ) since N P i k h i − h ( b k i ) k = O p ( δ )and the γ functions are Lipschitz-continuous.Second, let v θi w − i = η ( h i , ξ i ) + e i , where E h i = h,ξ i = ξ ( e i w i ) = 0. We have:1 N X i v θi (cid:16)e α ( b k i ) − α i (cid:17) = 1 N X i η ( h i , ξ i ) w i (cid:16)e α ( b k i ) − α i (cid:17) + 1 N X i e i w i (cid:16)e α ( b k i ) − α i (cid:17) , where the first term is O p ( δ ) since N P i k h i − h ( b k i ) k = O p ( δ ), N P i k ξ i − ξ ( b k i ) k = O p ( δ ), N P i k e α ( b k i ) − α i k = O p ( δ ), η is Lipschitz-continuous, and45 i is uniformly bounded (as in the proof of Lemma A2), and the second term is:1 N X i e i w i (cid:16)e α ( b k i ) − α i (cid:17) = 1 N X i e i w i (cid:16)e α ( b k i ) − α ( b k i ) (cid:17) + 1 N X i e i w i (cid:16) α ( b k i ) − α i (cid:17) = 1 N X i ew ( b k i ) (cid:16)e α ( b k i ) − α ( b k i ) (cid:17) + O p ( δ ) = O p ( δ ) , where we have used that the ( e i w i )’s have zero mean given h , ..., h N , ξ , ..., ξ N with bounded conditional variance, and N P i k ew ( b k i ) k = O p ( K/N ) = O p ( δ ).Finally, to show: N P i ∂ ∂θ∂θ ′ (cid:12)(cid:12) θ ( ℓ i ( b α ( b k i , θ ) , θ ) − ℓ i ( α ( θ, ξ i ) , θ )) = o p (1), we usesimilar arguments to the proof of Lemma A3. Example: a linear homoskedastic model.
Consider the model Y it = X it θ + α i + U it , where X it are scalar and U it are i.i.d. with mean zero and variance σ given X i , ..., X iT , α i . Let b θ be the GFE estimator based on a moment h i = ϕ ( α i ) + ε i that satisfies Assumptions 1 and 2 for ξ i = α i ; that is, h i is onlyinformative about α i , but not about the heterogeneity in X it . Let ζ Xi = X i − E h i ( X i ), ζ αi = α i − E h i ( α i ), and ζ Ui = U i − E h i ( U i ). We assume that K is largeenough for the approximation error to be of smaller order, and that K/N tendsto zero, as in Corollary 2. Under appropriate conditions in the regression model,using similar arguments to the proof of Corollary S2 (though with no need forany restriction on the relative rates of N and T ), one can show that b θ admits thefollowing expansion: b θ = θ + N P i ζ Xi ( ζ αi + ζ Ui ) + NT P i,t ( X it − X i )( U it − U i ) E [( X it − X i ) ] + Var( ζ Xi ) + o p (cid:18) T (cid:19) + o p (cid:18) √ N T (cid:19) . (S26)Notice two differences between (S26) and the expansion of the FE estimator: thepresence of Var( ζ Xi ) in the denominator, and the presence of N P i ζ Xi ( ζ αi + ζ Ui )in the numerator. In addition, notice that (S26) simplifies to the expression inCorollary 2 in the absence of additional heterogeneity ν i . Although the arguments are as in the proof of Lemma A3, the target log-likelihood is differentsince here α ( θ, ξ i ) only depends on ξ i , not on ( ξ ′ i , ν ′ i ) ′ . In particular, the matrix H cond inAssumption S4 differs from the matrix H in Assumption 3; see (S26) for an example. Model of wages and participation (see (2)).
We model the initial conditionas: Y i = { u ( α i ) ≥ c (1; θ ) + U i } , with U i standard normal, independent of α i . We set c (0; θ ) = 0 and c (1; θ ) = −
1. We set α i and V it to be independentstandard normals. In the simulations based on models (2) and (3) we weight themoments by the share of between-i variance to total variance. To compute thevariance b V h to set the number of groups in this dynamic model, we use a Newey-West expression with one lag. Lastly, for kmeans computation we use Lloyd’salgorithm with 100 random starting values. Table S1 shows additional simulationresults for this model. Probit model with time-varying heterogeneity (see (3)).
The U it ’s arestandard normal independent of the X it ’s and the α it ’s. The data generatingprocess (DGP) for the scalar covariate is: X it = µ it + V it , where V it are i.i.d.standard normal independent of the U it ’s, α it ’s, and µ it ’s, and µ it = α it . Weset θ = 1, and set ξ i and λ t to be i.i.d. Gamma(1,1) draws, independent ofeach other. Table S2 shows additional simulation results for this model, includ-ing for the two-way GFE estimator based on both the cross-sectional moments( N P i Y it , N P i X it ) ′ , and the individual-specific moments ( Y i , X i ) ′ . Conditional moments: an example.
Consider the following probit model: Y it = { X ′ it θ + α i + U it ≥ } , where the U it are i.i.d. standard normal independentof the X it ’s and α i , and θ is a vector of ones. The DGP for the k -th covariateis: X itk = { µ i k + V itk > } , where V itk are i.i.d. standard normal independent ofthe U it ’s, α i , and the µ i k ’s, and α i and the µ i k ’s follow independent standardnormals. We vary the number of covariates between 1 and 3, so the total dimensionof heterogeneity varies between 2 and 4. In this model, we expect the bias of FEto be moderate given the time horizon we consider ( T = 20), since α i is scalar Specifically, we demean and rescale h i so that all its components h iℓ have zero mean and unitvariance, and multiply each component h iℓ by: max (cid:18) P i h iℓ − T P i,t ( h itℓ − h iℓ ) P i h iℓ , (cid:19) . Using equalweights instead has small effects in these simulations, however we observed that this particularweighting can improve performance when some moments are substantially less informative aboutthe heterogeneity than others. h i = ( Y i , X ′ i ) ′ as moments. In Table S3 we show the biases,standard deviations, and root mean squared errors of FE and GFE among 1000simulations, for N=1000 and T=20. In the top panel we report GFE estimates asa function of the number of groups K . We see that, while the bias of GFE remainsmoderate with one covariate, the bias increases substantially with the dimensionof heterogeneity, in agreement with our theory. By comparison, the bias of FE inthe bottom panel is indeed quite small, and it only increases moderately with thenumber of covariates.The situation is rather different when using conditional moments in GFE.In the middle panel in Table S3 we show simulation results for GFE based oncovariates-specific conditional means Y i ( x ) = P Tt =1 { X it = x } Y it / P Tt =1 { X it = x } . Importantly, in large samples these moments are only informative about α i ,not µ i . We see that the bias of GFE with conditional moments increases onlymoderately with the number of covariates, and that FE and GFE with conditionalmoments have comparable — and quite small — biases.Regarding implementation, note that, for a given i , all moments Y i ( x ) maynot be available since i ’s covariates may never take the value x in the sample. InTable S3, whenever Y i ( x ) is not available, we set the moment to an imputed value,the overall conditional mean Y ( x ) = P i,t { X it = x } Y it / P i,t { X it = x } . Theimputation does not affect the theory, provided the event that any of the Y i ( x )’sis not available tends to zero with probability approaching one in large samples. Moreover, we have obtained similar results using an alternative conditional firststep implementation that does not rely on imputations. To provide intuition in a simple case, suppose that X it are binary, i.i.d. over time given µ i ,with Pr( X it = 1 | µ i = µ ) ∈ ( ǫ, − ǫ ) for all µ , for some ǫ >
0. Then Pr( ∃ i : X i = ... = X iT =0) ≤ N (1 − ǫ ) T , which tends to zero whenever (ln N ) /T → This implementation is as follows. Let I i ( x ) be the indicator that there exists a t such that X it = x , and let x , ..., x M denote the points of support of X it . In the first step, we use aLloyd’s-like algorithm to minimize the function P Ni =1 P Mm =1 I i ( x m ) (cid:0) Y i ( x m ) − g ( x m , k i ) (cid:1) , with T Bias std RMSE se/std Bias std RMSE se/stdGFE, η = 1 FE, η = 15 -0.570 0.058 0.573 1.082 -0.835 0.064 0.837 1.06610 -0.207 0.040 0.211 1.003 -0.418 0.040 0.420 1.04120 -0.088 0.027 0.092 0.993 -0.209 0.026 0.211 1.06430 -0.055 0.023 0.060 0.960 -0.140 0.023 0.142 0.99140 -0.040 0.019 0.044 1.000 -0.105 0.019 0.106 1.03450 -0.031 0.017 0.036 0.982 -0.084 0.017 0.086 1.022GFE, η = 2 FE, η = 25 -0.519 0.063 0.523 1.052 -0.876 0.068 0.879 1.06310 -0.163 0.043 0.169 0.985 -0.442 0.041 0.444 1.07020 -0.049 0.031 0.058 0.929 -0.225 0.028 0.227 1.04230 -0.032 0.024 0.040 0.964 -0.153 0.022 0.154 1.06840 -0.019 0.020 0.028 0.981 -0.113 0.019 0.115 1.04550 -0.015 0.019 0.024 0.944 -0.091 0.018 0.093 1.000 Notes: simulations, N = 1000 . “RMSE” is root mean squared error, “se” is the averageof standard error estimates across simulations, “std” is the standard deviation of the estimatoracross simulations. η is the risk aversion parameter. respect to k , ..., k N and g ( x , g ( x M , K ). a b l e S : P r o b i t m o d e l ( ) w i t h t i m e - v a r y i n g h e t e r og e n e i t y T Bias std RMSE se/std Bias std RMSE se/std Bias std RMSE se/std Bias std RMSE se/std2-way GFE, σ = −
10 GFE, σ = −
10 FE, σ = −
10 IFE, σ = −
105 -0.045 0.035 0.057 0.927 -0.044 0.035 0.056 0.926 0.442 0.071 0.448 0.706 0.116 0.064 0.133 0.47310 -0.016 0.024 0.028 0.939 -0.014 0.024 0.028 0.939 0.198 0.036 0.201 0.762 0.100 0.036 0.107 0.48820 -0.003 0.016 0.016 1.014 -0.000 0.016 0.016 1.014 0.098 0.019 0.100 0.911 0.087 0.020 0.089 0.59630 -0.000 0.013 0.013 1.009 0.003 0.013 0.013 1.013 0.069 0.014 0.070 0.966 0.059 0.014 0.061 0.67540 0.001 0.011 0.011 1.021 0.005 0.011 0.012 1.016 0.055 0.012 0.057 0.949 0.044 0.012 0.046 0.67650 0.001 0.010 0.010 0.995 0.006 0.010 0.012 0.994 0.048 0.011 0.049 0.947 0.036 0.011 0.037 0.6772-way GFE, σ =0 GFE, σ =0 FE, σ =0 IFE, σ =05 -0.044 0.042 0.060 0.883 -0.043 0.042 0.060 0.882 0.488 0.091 0.497 0.654 0.152 0.089 0.176 0.39810 -0.022 0.026 0.035 0.951 -0.021 0.026 0.034 0.949 0.226 0.045 0.231 0.710 0.118 0.040 0.125 0.49420 -0.009 0.018 0.020 0.969 -0.006 0.018 0.019 0.964 0.108 0.023 0.110 0.843 0.099 0.023 0.101 0.57130 -0.004 0.014 0.015 1.001 0.001 0.014 0.014 1.000 0.072 0.017 0.074 0.918 0.068 0.017 0.070 0.61540 -0.002 0.013 0.013 0.989 0.004 0.013 0.013 0.985 0.056 0.014 0.058 0.922 0.051 0.014 0.052 0.63950 -0.001 0.011 0.012 0.965 0.005 0.012 0.013 0.961 0.046 0.012 0.047 0.926 0.040 0.012 0.042 0.6432-way GFE, σ =1 GFE, σ =1 FE, σ =1 IFE, σ =15 -0.049 0.056 0.074 0.754 -0.048 0.056 0.074 0.754 0.565 0.146 0.583 0.506 0.207 0.117 0.238 0.35910 -0.032 0.029 0.043 0.981 -0.030 0.029 0.042 0.979 0.251 0.062 0.258 0.603 0.141 0.043 0.147 0.51320 -0.014 0.020 0.024 0.986 -0.010 0.020 0.022 0.983 0.114 0.027 0.117 0.825 0.125 0.029 0.128 0.51430 -0.007 0.016 0.017 0.996 -0.001 0.016 0.016 0.992 0.074 0.019 0.077 0.889 0.085 0.021 0.088 0.56140 -0.005 0.014 0.015 0.985 0.001 0.014 0.014 0.980 0.055 0.016 0.057 0.915 0.063 0.016 0.065 0.61150 -0.003 0.012 0.012 1.027 0.004 0.012 0.013 1.025 0.044 0.014 0.046 0.951 0.050 0.014 0.052 0.6322-way GFE, σ =10 GFE, σ =10 FE, σ =10 IFE, σ =105 -0.016 0.075 0.076 0.691 -0.015 0.075 0.076 0.692 0.706 0.262 0.753 0.386 0.300 0.255 0.394 0.21810 -0.013 0.035 0.037 0.946 -0.010 0.035 0.036 0.947 0.323 0.097 0.337 0.486 0.183 0.060 0.192 0.45820 -0.002 0.024 0.024 0.975 0.003 0.024 0.024 0.967 0.150 0.037 0.154 0.719 0.168 0.036 0.172 0.47430 0.002 0.019 0.019 0.991 0.008 0.019 0.021 0.983 0.100 0.025 0.104 0.814 0.121 0.030 0.125 0.45640 0.003 0.016 0.016 0.989 0.010 0.016 0.019 0.985 0.076 0.020 0.079 0.851 0.091 0.021 0.093 0.53650 0.002 0.014 0.015 0.995 0.010 0.014 0.018 0.996 0.061 0.017 0.063 0.911 0.073 0.017 0.075 0.592 N o t e s : S ee n o t e s t o T a b l e S . I F E i s i n t e r a c t e d fi x e d - e ff ec t s w i t ho n e f a c t o r . σ i s t h e s u b s t i t u t i o n pa r a m e t e r . able S3: Probit model with binary covariatesK Bias std RMSE Bias std RMSE Bias std RMSEGFE, 1 covariate GFE, 2 covariates GFE, 3 covariates5 -0.189 0.029 0.191 -0.293 0.031 0.295 -0.362 0.042 0.36510 -0.083 0.027 0.088 -0.205 0.032 0.207 -0.275 0.035 0.27820 -0.017 0.029 0.033 -0.118 0.030 0.122 -0.206 0.033 0.20930 0.006 0.029 0.030 -0.081 0.030 0.086 -0.166 0.032 0.16940 0.018 0.029 0.035 -0.056 0.030 0.064 -0.136 0.033 0.14050 0.026 0.030 0.039 -0.040 0.031 0.051 -0.116 0.033 0.120Cond. GFE, 1 covariate Cond. GFE, 2 covariates Cond. GFE, 3 covariates5 -0.060 0.035 0.069 -0.085 0.037 0.093 -0.111 0.039 0.11710 -0.045 0.033 0.056 -0.073 0.043 0.085 -0.100 0.044 0.10920 -0.015 0.034 0.038 -0.046 0.037 0.059 -0.075 0.045 0.08730 0.008 0.036 0.036 -0.031 0.036 0.047 -0.061 0.043 0.07540 0.025 0.035 0.043 -0.020 0.037 0.041 -0.050 0.042 0.06550 0.034 0.035 0.049 -0.012 0.036 0.038 -0.040 0.041 0.057FE, 1 covariate FE, 2 covariates FE, 3 covariates- 0.062 0.031 0.069 0.074 0.034 0.081 0.088 0.039 0.097 Notes: simulations, N = 1000 , T = 20 . In the top panel we show GFE estimates basedon unconditional moments for different K values, in the middle panel we show GFE estimatesbased on conditional moments for different K values, in the bottom row we show FE estimates.values, in the bottom row we show FE estimates.