[PDF] Discretizing Unobserved Heterogeneity

Abstract

We study discrete panel data methods where unobserved heterogeneity is revealed in a first step, in environments where population heterogeneity is not discrete. We focus on two-step grouped fixed-effects (GFE) estimators, where individuals are first classified into groups using kmeans clustering, and the model is then estimated allowing for group-specific heterogeneity. Our framework relies on two key properties: heterogeneity is a function - possibly nonlinear and time-varying - of a low-dimensional continuous latent type, and informative moments are available for classification. We illustrate the method in a model of wages and labor market participation, and in a probit model with time-varying heterogeneity. We derive asymptotic expansions of two-step GFE estimators as the number of groups grows with the two dimensions of the panel. We propose a data-driven rule for the number of groups, and discuss bias reduction and inference.

Full PDF

aa r X i v : . [ ec on . E M ] F e b Discretizing Unobserved Heterogeneity ∗ St´ephane Bonhomme † Thibaut Lamadon ‡ Elena Manresa § Revised draft: January 2021

Abstract

We study discrete panel data methods where unobserved heterogeneity isrevealed in a ﬁrst step, in environments where population heterogeneity isnot discrete. We focus on two-step grouped ﬁxed-eﬀects (GFE) estimators,where individuals are ﬁrst classiﬁed into groups using kmeans clustering,and the model is then estimated allowing for group-speciﬁc heterogeneity.Our framework relies on two key properties: heterogeneity is a function— possibly nonlinear and time-varying — of a low-dimensional continuouslatent type, and informative moments are available for classiﬁcation. Weillustrate the method in a model of wages and labor market participation,and in a probit model with time-varying heterogeneity. We derive asymp-totic expansions of two-step GFE estimators as the number of groups growswith the two dimensions of the panel. We propose a data-driven rule forthe number of groups, and discuss bias reduction and inference.

JEL codes:

C23, C38.

Keywords:

Unobserved heterogeneity, panel data, kmeans clustering,dimension reduction. ∗ We thank Anna Simoni, Manuel Arellano, Neele Balke, Jesus Carro, Gary Chamberlain, TimChristensen, Alfred Galichon, Chris Hansen, Joe Hotz, Gr´egory Jolivet, Arthur Lewbel, AnnaMikusheva, Roger Moon, Whitney Newey, Juan Pantano, Philippe Rigollet, Martin Weidner,and seminar audiences at various places for comments. The authors acknowledge support fromthe NSF grant number SES-1658920. The usual disclaimer applies. † University of Chicago, [email protected] ‡ University of Chicago, [email protected] § New York University, [email protected] Introduction

In both reduced-form and structural work in economics, it is common to modelunobserved heterogeneity as a small number of discrete types. Various estima-tion strategies are available, including discrete-type random-eﬀects (as in Keaneand Wolpin, 1997, and many other applications) and grouped ﬁxed-eﬀects (as re-cently studied by Hahn and Moon, 2010, and Bonhomme and Manresa, 2015).These methods require the researcher to jointly estimate individual heterogene-ity and model parameters. In addition, little is known about their propertieswhen individual heterogeneity is not discrete in the population. In this paper,we study two-step discrete estimators for panel data, and provide conditions fortheir validity when heterogeneity is continuous.We focus on two-step grouped ﬁxed-eﬀects (GFE) estimators. In a ﬁrst step,we classify individuals based on a set of individual-speciﬁc moments, using the kmeans clustering algorithm. The aim of the kmeans classiﬁcation is to grouptogether individuals whose latent types are most similar. In a second step, weestimate the model by allowing for group-speciﬁc heterogeneity. This second stepis similar to ﬁxed-eﬀects (FE) estimation, albeit it involves a smaller number ofparameters that are group-speciﬁc instead of individual-speciﬁc. We analyze theproperties of these two-step estimators in panel data models where heterogeneity iscontinuous. Hence, in contrast with existing theoretical justiﬁcations for discrete-type methods, here we use discrete heterogeneity as a dimension reduction devicerather than as a substantive assumption about population unobservables.Our approach is targeted to environments with two key properties.

First ,unobserved heterogeneity is a function of a low-dimensional latent variable. Wedo not restrict this latent type to be discrete. In many economic models, agents’heterogeneity in preferences or technology is driven by a low-dimensional type,which enters the model nonlinearly and may aﬀect multiple outcomes. As an Also related, nonparametric maximum likelihood methods (e.g., Heckman and Singer, 1984)rely on joint estimation of the distribution of heterogeneity and the parameters. In a network context, Gao et al. (2015) provide results for stochastic blockmodels undercontinuous heterogeneity. Buchinsky et al. (2005) also propose to group individuals in a ﬁrst step using kmeans.

Second , the ﬁrst-step moments satisfy an injectivity condition, which requiresany two individuals with the same population moments to have the same type.The choice of moments is important to ensure good performance. In examples, weshow how suitable moments arise naturally. In models with exogenous covariates,we propose and analyze the use of conditional moments to recover latent types.Our setup also covers models where heterogeneity varies over time. Unlikeadditive FE methods and interactive FE methods based on linear factor structures(Bai, 2009), GFE does not require heterogeneity to take an additive or interactiveform. As an illustration, we compare GFE and FE estimators in a probit modelwhere heterogeneity is a nonlinear function of a time-invariant factor loading anda time-speciﬁc factor.Our main results are large-

N, T asymptotic expansions of two-step GFE esti-mators under time-invariant and time-varying continuous heterogeneity. In bothsettings, GFE is consistent as the number of groups grows with the sample size,under conditions that we provide. We ﬁnd that, when the population heterogene-ity is not discrete, estimating group membership induces an incidental parameterbias, similarly to FE methods. Moreover, since discreteness is an approximationin our setting, GFE is aﬀected by approximation error. We propose a simple data-driven rule for the number of groups that controls the approximation error, anddiscuss how to reduce incidental parameter bias for inference.The outline of the paper is as follows. We introduce the setup and two-stepGFE estimators in Section 2, study their asymptotic properties in Section 3, andoutline several extensions in Section 4. The main proofs may be found in theappendix, and the supplemental material contains additional results.

We consider a panel data setup, where we denote outcome variables and exogenouscovariates as Y i =( Y ′ i , ..., Y ′ iT ) ′ and X i = ( X ′ i , ..., X ′ iT ) ′ , respectively, for i =1 , ..., N .3n our theory we cover two models. In the ﬁrst one, unobserved heterogeneity istime-invariant. In this case, the conditional log-density of Y i given X i is given by: ln f i ( α i , θ ) = T X t =1 ln f ( Y it | Y i,t − , X it , α i , θ ) , (1)and the log-density of exogenous covariates X i takes the form:ln g i ( µ i ) = T X t =1 ln g ( X it | X i,t − , µ i ) , where θ is a vector of common parameters, and α i and µ i are individual-speciﬁcparameters. We leave the form of g unrestricted, and in estimation we will usea conditional likelihood approach based on f i alone. In other words, in applica-tions the researcher only needs to specify the parametric form of f i ( α i , θ ) in (1).However, the heterogeneity µ i in covariates plays an important role in our theory.In the second model, unobserved heterogeneity varies over time. Such variationin unobservables over calendar time (e.g., business cycle), age (e.g., life cycle),counties, or markets, is of interest in many applications. In the time-varying case,log-densities take the form:ln f i ( α i , θ ) = T X t =1 ln f ( Y it | Y i,t − , X it , α it , θ ) , ln g i ( µ i ) = T X t =1 ln g ( X it | X i,t − , µ it ) , where α i = ( α ′ i , ..., α ′ iT ) ′ and µ i = ( µ ′ i , ..., µ ′ iT ) ′ . In both models we areinterested in estimating θ , as well as average eﬀects depending on α , ..., α N . GFE relies on two key assumptions that we now present. We defer the presen-tation of regularity conditions until Section 3.

First , we assume that unobserved In models with ﬁrst-order dependence, we assume that Y i is observed and we conditionon it. Higher-order dependence can be accommodated similarly. In dynamic settings, Y it maycontain sequentially exogenous covariates in addition to outcome variables. ξ i . Assumption 1. (heterogeneity)(a) Time-invariant heterogeneity: There exist ξ i of ﬁxed dimension d , and twoLipschitz-continuous functions α and µ , such that α i = α ( ξ i ) and µ i = µ ( ξ i ) .(b) Time-varying heterogeneity: There exist ξ i of ﬁxed dimension d , λ t of di-mension d λ , and two functions α and µ that are Lipschitz-continuous in their ﬁrstargument, such that α it = α ( ξ i , λ t ) and µ it = µ ( ξ i , λ t ) . We will refer to ξ i as an individual type , and to d as the dimension of het-erogeneity. The researcher does not need to know d , α , or µ in applications. Inmodels with time-varying unobserved heterogeneity, Assumption 1 requires unob-servables to follow a factor structure. The link between α it , ξ i and λ t may benonlinear, the linear structure α it = ξ ′ i λ t (Bai, 2009) being covered as a specialcase. Moreover, the dimension of λ t is unrestricted. Our theory will show thatthe performance of two-step GFE crucially relies on ξ i being low-dimensional, aleading case being d = 1. We provide examples in the next subsection. Second , we rely on individual-speciﬁc moment vectors h i that are informativeabout the types ξ i . We state this formally as our second main assumption, where k · k denotes an Euclidean norm. Assumption 2. (injective moments)There exist vectors h i of ﬁxed dimension, and a Lipschitz-continuous function ϕ ,such that plim T →∞ h i = ϕ ( ξ i ) , and N P Ni =1 k h i − ϕ ( ξ i ) k = O p (1 /T ) as N, T tend to inﬁnity. Moreover, there exists a Lipschitz-continuous function ψ suchthat ξ i = ψ ( ϕ ( ξ i )) . Assumption 2 requires the individual moment vector h i to be informative about ξ i , in the sense that, for large T , ξ i can be uniquely recovered from h i . Neither ϕ nor ψ (which may depend on θ ) need to be known to the econometrician. In-tuitively, injectivity guarantees that one can separate the types of two individuals ξ i and ξ i ′ by comparing their moments h i and h i ′ . For example, an average h i = T P Tt =1 h ( Y it , X it ) will, under Assumption 1 and suitable regularity condi-5ions, converge as T tends to inﬁnity to a function ϕ ( ξ i ) of the type ξ i . Werequire ϕ to be injective.The convergence rate in Assumption 2 requires appropriate conditions on theserial dependence of Y it and X it . In models with time-varying heterogeneity, ϕ will also depend on the λ t process. In such models, Assumption 2 requires themoments to be informative about ξ i , and not λ t . Injectivity is a key requirementfor consistency of two-step GFE estimators. More generally, the choice of moments h i is important for ﬁnite-sample performance. To illustrate the framework we now describe two examples, for which we willprovide illustrative simulations in Subsection 2.4. First, consider a dynamic modelof wages W ∗ it and labor force participation Y it :  Y it = { u ( α i ) ≥ c ( Y i,t − ; θ ) + U it } ,W ∗ it = α i + V it ,W it = Y it W ∗ it , (2)where the wage W ∗ it is only observed when i works, U it are i.i.d. standard normal,independent of the past Y it ’s and α i , and V it are i.i.d. independent of all U it ’s, Y i , and α i . Here the same scalar expected payoﬀ α i = ξ i , unobserved tothe econometrician, drives the wage and the decision to work. Individuals havecommon preferences denoted by the utility function u , the cost function c is state-dependent, and both u and c are unknown to the econometrician.In this setting, GFE provides a natural approach to exploit the functional linkbetween α i and u ( α i ), and to learn about the type α i using both wages andparticipation. For instance, when h i = ( W i , Y i ) ′ , where Z i = T P Tt =1 Z it denotesthe individual mean of Z it , injectivity is satisﬁed under mild conditions, provided W i = α i Y i + o p (1) and plim T →∞ Y i > θ in (2). However, a con-ventional FE estimator would treat α i and u i = u ( α i ) as unrelated parameters,so the FE estimate of θ would be solely based on the binary participation deci-6ions. Another strategy would be to rely on discrete-type random-eﬀects methods,which are typically based on joint estimation. In contrast, we implement GFE intwo steps with no need for iterative estimation, and we justify the estimator inenvironments where heterogeneity is not restricted to be discrete.As a second example, consider the following probit model with time-varyingheterogeneity:  Y it = { X ′ it θ + α it + U it ≥ } ,X it = µ it + V it , (3)where U it are i.i.d. standard normal, independent of all V it ’s, α it ’s, and µ it ’s, and V it are i.i.d. independent of all α it ’s and µ it ’s. Under Assumption 1, α it and µ it depend on a low-dimensional vector ξ i of factor loadings, so α it = α ( ξ i , λ t )and µ it = µ ( ξ i , λ t ). Here d is the dimension of the type ξ i governing both α it and µ it .To motivate why, in static models with covariates such as (3), α it and µ it may depend on a common low-dimensional type ξ i , suppose that, in every period,agent i chooses X it based on expected utility or proﬁt maximization. She observes ξ i and λ t — which enter outcomes through α it — and takes her decision beforethe i.i.d. shock U it is realized. In such a case, X it will be a function of ξ i and λ t ,as well as idiosyncratic factors V it in the agent’s information set. Here we assumethat the agent’s information set, and primitives such as preferences or costs, donot include other i -speciﬁc elements beyond ξ i . When α ( · , · ) is additive or multiplicative in its arguments, model (3) can beestimated using two-way FE (Fern´andez-Val and Weidner, 2016) or interactiveFE (Bai, 2009, Chen et al. , 2020), respectively. However, when α ( · , · ) is unknown,these ﬁxed-eﬀects estimators are inconsistent in general. In contrast, GFE willremain consistent when unobservables are unknown nonlinear functions of factorloadings ξ i and factors λ t , and injectivity holds. Taking h i = ( Y i , X ′ i ) ′ as mo-ments in model (3), injectivity is satisﬁed when types have monotone eﬀects on This example is reminiscent of Mundlak’s (1961) classic analysis of farm production func-tions, where soil quality ξ i is observed to the farmer but latent to the analyst. More generally, in Assumption 2 we require thatthe latent type ξ i can be asymptotically recovered from a moment vector whosedimension is not growing with the sample size. Two-step GFE consists of a classiﬁcation step and an estimation step.

First step: classiﬁcation.

We rely on the individual-speciﬁc moments h i tolearn about the individual types ξ i . Speciﬁcally, we partition individuals into K groups, corresponding to group indicators b k i ∈ { , ..., K } , by computing: (cid:16)b h (1) , ..., b h ( K ) , b k , ..., b k N (cid:17) = argmin( e h (1) ,..., e h ( K ) ,k ,...,k N ) N X i =1 (cid:13)(cid:13)(cid:13) h i − e h ( k i ) (cid:13)(cid:13)(cid:13) , (4)where { k i } are partitions of { , ..., N } into K groups, and e h ( k ) is a vector. Notethat b h ( k ) is simply the mean of h i in group b k i = k .In the kmeans optimization problem (4), the minimum is taken with respectto all possible partitions { k i } . Fast and stable optimization methods such asLloyd’s algorithm are available, although computing a global minimum may bechallenging; see Bonhomme and Manresa (2015) for references. Following theliterature, we will focus on the asymptotic properties of the global minimum andabstract from optimization error. Lastly, note that the quadratic loss functionin (4) can accommodate weights on diﬀerent components of h i , although here forsimplicity we present the unweighted case. Second step: estimation.

We maximize the log-likelihood function with re-spect to common parameters θ and group-speciﬁc eﬀects α , where the groups aregiven by the b k i estimated in the ﬁrst step. We deﬁne the two-step GFE estimatoras: (cid:16)b θ, b α (1) , ..., b α ( K ) (cid:17) = argmax ( θ,α (1) ,...,α ( K )) N X i =1 ln f i (cid:16) α (cid:16)b k i (cid:17) , θ (cid:17) . (5) To see this, consider the case where α it is the only component of heterogeneity (i.e., µ it = 0in (3)), and take h i = Y i . Letting G denote the cdf of − ( V ′ it θ + U it ), injectivity will holdwhen α ( · , · ) is strictly increasing in its ﬁrst argument and G is strictly increasing, since then ϕ ( ξ ) = plim T →∞ T P Tt =1 G ( α ( ξ, λ t )) is strictly increasing. K group-speciﬁc parameters instead of N individual-speciﬁc ones. In models with time-varying heterogeneity, α ( k ) willsimply be a vector ( α ( k ) ′ , ..., α T ( k ) ′ ) ′ . Choice of K . Two-step GFE estimation requires setting a number of groups K . We propose a simple data-driven selection rule based on the ﬁrst step. Theconvergence rate of the kmeans estimator (and the rate of the GFE estima-tor) will be governed by two quantities: the kmeans objective function b Q ( K ) = N P Ni =1 k h i − b h ( b k i ) k , which decreases as K gets larger and the group approxi-mation becomes more accurate, and the variability V h = E [ k h i − ϕ ( ξ i ) k ] of themoment h i , which does not depend on K . We take the smallest K that guaranteesthat b Q ( K ) is of the same or lower order as V h . That is, letting b V h = V h + o p (1 /T ),we suggest setting: b K = min K ≥ n K : b Q ( K ) ≤ γ b V h o , (6)where γ ∈ (0 ,

1] is a user-speciﬁed parameter. In the simulations in the nextsubsection we will set γ = 1, although smaller γ values corresponding to larger K ’s will also be supported by our theory. To illustrate the performance of GFE in models where heterogeneity follows anonlinear factor structure, we present the results of a small-scale simulation studybased on our two examples (2) and (3). In both cases, we assume that the type ξ i governing heterogeneity is scalar. We compare the bias of GFE to that of FEand interactive FE estimators. In the supplemental material, we provide detailson the simulations and report additional results.In Figure 1, we compare GFE and FE in model (2), using a CRRA functionalform: u ( α ) = e α (1 − η ) − − η , with a risk aversion parameter η ∈ { , } . We focus on thediﬀerence in costs c (0; θ ) − c (1; θ ), which measures the degree of state dependence When h i = T P Tt =1 h ( Y it , X it ) and observations are independent over time, one may take b V h = NT P Ni =1 P Tt =1 k h ( Y it , X it ) − h i k . With dependent data, one can use trimming or thebootstrap to estimate V h (Hahn and Kuersteiner, 2011, Arellano and Hahn, 2007).

10 20 30 40 50T0.20.40.60.81.0 p a r a m e t e r = 1 10 20 30 40 50T= 2 Notes: Means of c (0; b θ ) − c (1; b θ ) over 1000 simulations. GFE is indicated in solid, FE is indashed, and the truth c (0; θ ) − c (1; θ ) = 1 is in dotted. N = 1000 , and T is indicated on thex-axis. η is the risk aversion parameter in u ( · ) . See the supplemental material for details. in participation decisions. We take h i = ( W i , Y i ) ′ as moments for GFE, and reportaverage parameter estimates over 1000 simulations. We set N =1000 and vary Tbetween 5 and 50. We ﬁnd that FE is more biased than GFE for both values of riskaversion. This is consistent with wages and participation providing informativemoments about the latent type in this setting.In Figure 2 we compare GFE, FE, and interactive FE in model (3) with X it scalar, using a CES speciﬁcation: α it = ( aξ σi + (1 − a ) λ σt ) σ , for σ ∈{− , , , } and a =0 .

5, and µ it = α it . The factors λ t and the individual loadings ξ i enterheterogeneity in a nonlinear way. We show estimates of θ for various estimators:Figure 2: Probit model (3) with time-varying heterogeneity

20 40T1.01.21.41.6 p a r a m e t e r = 10 20 40T= 0 20 40T= 1 20 40T= 10 Notes: Means of b θ over 1000 simulations. GFE is indicated in solid, FE is in dashed, interactiveFE is in dash-dotted, and the truth θ = 1 is in dotted. N = 1000 , and T is indicated on thex-axis. σ is the substitution parameter in α ( · , · ) . See the supplemental material for details. Y i , X i ) ′ as moments for GFE. Note that both Y i and X i are informative about ξ i in this data generating process. We reportparameter averages over 1000 simulations, for N =1000. We ﬁnd that, while GFE,FE, and interactive FE are all biased, the bias of GFE is smaller across all σ values. In this section we provide asymptotic expansions for two-step GFE estimators.Our ﬁrst result is a rate of convergence for kmeans. Let us deﬁne the approximationerror one would make if one were to discretize the latent types ξ i directly, as: B ξ ( K ) = min( e ξ (1) ,..., e ξ ( K ) ,k ,...,k N ) 1 N N X i =1 (cid:13)(cid:13)(cid:13) ξ i − e ξ ( k i ) (cid:13)(cid:13)(cid:13) , (7)where, similarly to (4), the minimum is taken with respect to all partitions { k i } and vectors e ξ ( k ). In the asymptotic analysis we let T = T N and K = K N tend toinﬁnity jointly with N . Lemma 1.

Let Assumption 2 hold. Let b h (1) , ..., b h ( K ) and b k , ..., b k N given by (4).Then, as N, T, K tend to inﬁnity we have: N N X i =1 (cid:13)(cid:13)(cid:13)b h ( b k i ) − ϕ ( ξ i ) (cid:13)(cid:13)(cid:13) = O p (cid:18) T (cid:19) + O p ( B ξ ( K )) . The bound in Lemma 1 has two terms: an O p (1 /T ) term that depends on thenumber of periods used to construct the moments h i , and an O p ( B ξ ( K )) termthat reﬂects the presence of an approximation error. The rate at which B ξ ( K )tends to zero depends on the dimension of ξ i . Graf and Luschgy (2002, Theorem5.3) provide explicit characterizations in the case where ξ i has compact support. Large-

N, T theory implies that additive and interactive FE are consistent when σ = 1and σ = 0, respectively. Figure 2 shows that, despite being large- N, T consistent in thesespeciﬁcations, in our simulations, additive and interactive FE have larger biases than GFE forthe N and T values we consider. See Graf and Luschgy (2002, p. 875) for a discussion of the compact support assumption. B ξ ( K ) = O p ( K − ) when ξ i isone-dimensional, and B ξ ( K ) = O p ( K − ) when ξ i is two-dimensional. Lemma 2. (Graf and Luschgy, 2002) Let ξ i be random vectors with compactsupport in R d . Then, as N, K tend to inﬁnity we have B ξ ( K ) = O p ( K − d ) . We now use these results to study the properties of GFE in models with time-invariant and time-varying heterogeneity, in turn. We use the shorthand notation E Z ( W ) and E Z = z ( W ) for the conditional expectations of W given Z and Z = z ,respectively. In the time-varying case, we denote as λ the process of λ t ’s, andas E λ = λ ( W ) the conditional expectation of W given λ = λ . We use a similarnotation for variances. Finally, k M k denotes the spectral norm of a matrix M . To state our ﬁrst main theorem, where heterogeneity is time-invariant, we makethe following assumptions, where ℓ it ( α i , θ ) = ln f ( Y it | Y i,t − , X it , α i , θ ), ℓ i ( α i , θ ) = T P Tt =1 ℓ it ( α i , θ ), and α ( θ, ξ ) = argmax α E ξ i = ξ ( ℓ i ( α, θ )) for all θ, ξ . Assumption 3. (regularity, time-invariant heterogeneity)(i) ( Y ′ i , X ′ i , ξ ′ i , h ′ i ) ′ are i.i.d.; ( Y ′ it , X ′ it ) ′ are stationary for all i ; ℓ it ( α, θ ) is threetimes diﬀerentiable in ( α, θ ) for all i, t ; and the parameter space Θ for θ is compact, the space for α i is compact, and θ belongs to the interior of Θ .(ii) N, T, K tend jointly to inﬁnity; sup ξ,α,θ | E ξ i = ξ ( ℓ it ( α, θ )) | = O (1) , and sim-ilarly for the ﬁrst three derivatives of ℓ it ; inf ξ,α,θ E ξ i = ξ ( − ∂ ℓ it ( α,θ ) ∂α∂α ′ ) > ;and max i sup α,θ (cid:12)(cid:12) ℓ i ( α, θ ) − E ξ i ( ℓ i ( α, θ )) (cid:12)(cid:12) = o p (1) , and similarly for the ﬁrstthree derivatives of ℓ i .(iii) inf ξ,θ E ξ i = ξ ( − ∂ ℓ it ( α ( θ,ξ ) ,θ ) ∂α∂α ′ ) > ; E [ T P Tt =1 ℓ it ( α ( θ, ξ i ) , θ )] has a unique max-imum at θ on Θ , and its matrix of second derivatives is − H < ; and sup θ NT P Ni =1 P Tt =1 k ∂ ℓ it ( α ( θ,ξ i ) ,θ ) ∂θ∂α ′ k = O p (1) .(iv) sup e ξ,α k ∂∂ξ ′ (cid:12)(cid:12) ξ = e ξ E ξ i = ξ (vec ∂ ℓ it ( α,θ ) ∂θ∂α ′ ) k , sup e ξ,α k ∂∂ξ ′ (cid:12)(cid:12) ξ = e ξ E ξ i = ξ (vec ∂ ℓ it ( α,θ ) ∂α∂α ′ ) k , and sup e ξ,θ k ∂∂ξ ′ (cid:12)(cid:12) ξ = e ξ E ξ i = ξ ( ∂ℓ it ( α ( θ, e ξ ) ,θ ) ∂α ) k are O (1) . That is, ln f ( y it | y i,t − , x it , α, θ ) is three times diﬀerentiable in ( α, θ ), for almost all( y it , y it − , x it ).

12n part (i) in Assumption 3 we treat heterogeneity as random in order to useLemma 2, which requires ξ i to be i.i.d. draws from a distribution. However, notewe do not restrict how α i and µ i depend on each other. Moreover, while ourresults require asymptotic stationarity of the time-series processes, the theoremcould be extended to allow for nonstationary initial conditions.In part (ii) we require strict concavity of the log-likelihood as a function of α .Concavity holds in a number of nonlinear panel data models such as probit andlogit models, tobit, Poisson, or multinomial logit; see Fern´andez-Val and Weidner(2016) and Chen et al. (2020). One can show that Theorem 1 continues to holdwithout concavity, under an identiﬁcation condition and an assumption boundingthe derivatives of the empirical GFE objective function. Importantly, note that H − is the asymptotic variance of the FE estimator. As a result, H being positivedeﬁnite rules out models that are not identiﬁed under FE, such as a linear modelwith a time-invariant covariate and a heterogeneous intercept.In part (iii) we introduce the target log-likelihood NT P Ni =1 P Tt =1 ℓ it ( α ( θ, ξ i ) , θ )(Arellano and Hahn, 2007), which we will show approximates the GFE log-likelihoodin large samples under our assumptions (note that α ( θ , ξ i )= α i ). In part (iv) werequire some moments to be bounded asymptotically.We now state our ﬁrst main result, where we denote, evaluating all quantitiesat true values ( θ , α i ) and omitting the dependence from the notation: s i = 1 T T X t =1 ∂ℓ it ∂θ + E ξ i (cid:18) ∂ ℓ it ∂θ∂α ′ (cid:19) (cid:20) E ξ i (cid:18) − ∂ ℓ it ∂α∂α ′ (cid:19)(cid:21) − ∂ℓ it ∂α ! , (8) H = plim N,T →∞ N T N X i =1 T X t =1 E ξ i (cid:18) − ∂ ℓ it ∂θ∂θ ′ (cid:19) − E ξ i (cid:18) ∂ ℓ it ∂θ∂α ′ (cid:19) (cid:20) E ξ i (cid:18) − ∂ ℓ it ∂α∂α ′ (cid:19)(cid:21) − E ξ i (cid:18) ∂ ℓ it ∂α∂θ ′ (cid:19) ! . (9) Theorem 1.

Let the conditions of Lemmas 1 and 2 and Assumptions 1, 2 and 3 old. Then, as N, T, K tend to inﬁnity we have: b θ = θ + H − N N X i =1 s i + O p (cid:18) T (cid:19) + O p (cid:16) K − d (cid:17) + o p (cid:18) √ N T (cid:19) . (10)The ﬁrst three terms in (10) also appear in large- N, T expansions of FE estima-tors (e.g., Hahn and Newey, 2004). Similarly to FE, GFE is subject to incidentalparameter O p (1 /T ) bias. This contrasts with the properties of GFE estimatorsunder discrete heterogeneity (e.g., Hahn and Moon, 2010, Bonhomme and Man-resa, 2015). Indeed, when heterogeneity is not restricted to have a small numberof points of support, classiﬁcation noise aﬀects the properties of second-step es-timators in general. This motivates using bias reduction techniques for inferenceanalogous to those used in FE, as we will discuss in the next section.The O p ( K − d ) term in (10) reﬂects the approximation error, which dependson the number of groups. Setting K = b K according to (6) guarantees that theapproximation error is O p (1 /T ). Formally, we have the following result. Corollary 1.

Let the conditions in Theorem 1 hold. Let K = b K given by (6),with γ = O (1) . Then, as N, T tend to inﬁnity we have: b θ = θ + H − N N X i =1 s i + O p (cid:18) T (cid:19) + o p (cid:18) √ N T (cid:19) . (11)Under Corollary 1, the biases of FE and GFE have the same order of magni-tude. However, the required value of K depends on the dimension d of individualheterogeneity. Speciﬁcally, when ξ i follows a continuous distribution of dimen-sion d , setting K proportional to or greater than min( T d , N ) will ensure that theapproximation error is O p (1 /T ). For small d (e.g., when d = 1) this will typicallyrequire a small number of groups (of the order of T ).GFE can have advantages compared to FE, for two reasons. First, the two-stepmethod can allow researchers to select moments that are particularly informative In the supplemental material we provide a similar expansion for GFE estimators of averageeﬀects M = NT P Ni =1 P Tt =1 m ( X it , α i , θ ), which are functions of both common parametersand individual heterogeneity. /T , yet K/N tends to zero. We have the following.

Corollary 2.

Let the conditions in Theorem 1 hold. Let K = b K given by (6),with γ = o (1) . Suppose that K/N tends to zero, and that Assumption A1 in theappendix holds. Then the O p (1 /T ) term in (11) takes the explicit form C/T + o p (1 /T ) , where: CT = H − ∂∂θ (cid:12)(cid:12)(cid:12)(cid:12) θ E (cid:20) (cid:13)(cid:13)b α i ( θ ) − E ξ i ( b α i ( θ )) (cid:13)(cid:13) i ( θ ) − k b α i ( θ ) − E h i ( b α i ( θ )) k i ( θ ) (cid:21) , (12) with b α i ( θ ) = argmax α ℓ i ( θ, α ) , Ω i ( θ ) = E ξ i ( − ∂ ℓ it ( α ( θ,ξ i ) ,θ ) ∂α∂α ′ ) , and k V k = V ′ Ω V . Corollary 2 shows that the ﬁrst-order asymptotic bias of GFE is the diﬀerencebetween two terms. The bias is zero when h i is an injective function of ξ i ; i.e.,when ε i = h i − ϕ ( ξ i ) = 0. More generally, the bias can be expanded in ε i , and it issmall when moments provide accurate estimates of the latent types. Moreover, theﬁrst term on the right-hand side of (12) coincides with the bias of FE (e.g., Arellanoand Hahn, 2007). The form of (12) implies that the biases of FE and GFE areequal when the moments are the FE estimates h i = b α i ( θ ), however other momentchoices can lead to smaller biases. From this perspective, GFE provides ﬂexibilityto use well-suited proxies of the latent types. As an example, our simulations ofthe labor force participation model (2) show that, by jointly exploiting wages andparticipation to construct moments that are informative about the latent type,GFE can have smaller bias than FE (and smaller mean squared error as well, asshown in the supplemental material).A second advantage of GFE comes from the use of grouping, and from theresulting regularization. Indeed, individual FE estimates can be highly variablewhenever the number of parameters per individual is large. In such cases, reduc-ing the number of parameters through grouping can improve performance. Forinstance, the ability to handle multiple components of heterogeneity is central tothe performance of GFE in models with time-varying unobserved heterogeneity.15his is the case we focus on next. To state our second main theorem, where heterogeneity is time-varying, we makethe following assumptions, where ℓ it ( α it , θ ) = ln f ( Y it | Y i,t − , X it , α it , θ ), ℓ i ( α i , θ ) = T P Tt =1 ℓ it ( α it , θ ), and α t ( θ, ξ ) = argmax α E ξ i = ξ,λ = λ ( ℓ it ( α, θ )). Assumption 4. (regularity, time-varying heterogeneity)(i) ( Y ′ i , X ′ i , ξ ′ i , h ′ i ) ′ are i.i.d. across i conditional on λ ; ( Y ′ it , X ′ it , λ ′ t ) ′ are sta-tionary for all i ; ℓ it ( α it , θ ) is three times diﬀerentiable, for all i, t ; and Θ andthe space for α it are compact, and θ belongs to the interior of Θ .(ii) N, T, K tend jointly to inﬁnity; max t sup ξ,λ,α,θ | E ξ i = ξ,λ = λ ( ℓ it ( α, θ )) | = O (1) ,and similarly for the ﬁrst three derivatives of ℓ it ; the minimum (respectively,maximum) eigenvalue of ( − ∂ ℓ it ( α,θ ) ∂α∂α ′ ) is bounded away from zero (resp., in-ﬁnity) with probability one, uniformly in i, t, α, θ ; the third derivatives of ℓ it ( α, θ ) are O p (1) , uniformly in i, t, α, θ ; and NT P Ni =1 P Tt =1 [ ℓ it ( α it , θ ) − E ξ i ,λ ( ℓ it ( α it , θ ))] = O p (1) , and similarly for the ﬁrst three derivatives.(iii) min t inf ξ,λ,θ E ξ i = ξ,λ = λ ( − ∂ ℓ it ( α t ( θ,ξ ) ,θ ) ∂α∂α ′ ) > ; E [ T P Tt =1 ℓ it ( α t ( θ, ξ i ) , θ )] has aunique maximum at θ on Θ , and its matrix of second derivatives is − H < ;and sup θ NT P Ni =1 P Tt =1 k ∂ ℓ it ( α t ( θ,ξ i ) ,θ ) ∂θ∂α ′ k = O p (1) .(iv) k ∂∂ξ ′ (cid:12)(cid:12) ξ = e ξ E ξ i = ξ,λ = λ (vec ∂ ℓ it ( α,θ ) ∂θ∂α ′ ) k , k ∂∂ξ ′ (cid:12)(cid:12) ξ = e ξ E ξ i = ξ,λ = λ (vec ∂ ℓ it ( α,θ ) ∂α∂α ′ ) k , and k ∂∂ξ ′ (cid:12)(cid:12) ξ = e ξ E ξ i = ξ,λ = λ ( ∂ℓ it ( α t ( θ, e ξ ) ,θ ) ∂α ) k are O (1) , uniformly in t , e ξ , λ , α , and θ .(v) E h i = h,ξ i = ξ,λ = λ ( ∂ℓ it ( α t ( θ,ξ ) ,θ ) ∂α ) and E h i = h,ξ i = ξ,λ = λ (vec ∂∂θ ′ (cid:12)(cid:12) θ ∂ℓ it ( α t ( θ,ξ ) ,θ ) ∂α ) are twicediﬀerentiable with respect to h , with ﬁrst and second derivatives that are uni-formly bounded in t , ξ , λ , h in the support of h i given λ = λ , and θ ∈ Θ ; and k Var h i = h,ξ i = ξ,λ = λ ( ∂ℓ it ( α t ( θ,ξ ) ,θ ) ∂α ) k and k Var h i = h,ξ i = ξ,λ = λ (vec ∂∂θ ′ (cid:12)(cid:12) θ ∂ℓ it ( α t ( θ,ξ ) ,θ ) ∂α ) k are O (1) , uniformly in t , ξ , λ , h , and θ . In part (ii) in Assumption 4, we impose a stronger concavity condition than Note that α t ( θ, ξ i ) depends on the process λ in addition to the type ξ i , although we leavethe dependence on λ implicit in the notation. In a static model, α t ( θ, ξ i ) is a function of ξ i and λ t , while in a dynamic model it also depends on the history of the time eﬀects ( λ t , λ t − , , ... ).

16n Assumption 3. The other parts are similar to Assumption 3, except part (v)where we require regularity of certain conditional expectations and variances.We next state our second main result, where, diﬀerently from Theorem 1, s i in(8) and H in (9) are now evaluated at ( θ , α it ), and expectations are conditionalon ( ξ i , λ ). Theorem 2.

Let the conditions of Lemmas 1 and 2 and Assumptions 1, 2 and 4hold. Then, as

N, T, K tend to inﬁnity such that

K/N tends to zero, we have: b θ = θ + H − N N X i =1 s i + O p (cid:18) T (cid:19) + O p (cid:18) KN (cid:19) + O p (cid:16) K − d (cid:17) + o p (cid:18) √ N T (cid:19) . (13)Theorem 2 shows that GFE is consistent as N, T, K tend to inﬁnity and

K/N tends to zero. This requires no parametric assumption about how ξ i and λ t aﬀectindividual and time heterogeneity, unlike additive or interactive FE methods.To give intuition, consider the probit model (3) with time-varying unobserv-ables. Under Assumption 2, in the ﬁrst step, GFE consistently estimates an injec-tive function ϕ i = ϕ ( ξ i ) of the type. One can then rewrite the outcome equationin (3) as Y it = { X ′ it θ + α ( ψ ( ϕ i ) , λ t ) + U it ≥ } , where ψ is the function intro-duced in Assumption 2, and α it = α ( ψ ( ϕ i ) , λ t ) is simply a time-varying functionof ϕ i . In the second step, GFE estimates this function by including group-timeindicators in the probit regression.As in Theorem 1, the expansion in Theorem 2 features a combination of inci-dental parameter bias and approximation error. When using the rule (6) for K ,the approximation error is of the same or lower order compared to 1/T. However,the O p ( K/N ) term is a new contribution relative to the time-invariant case, whichreﬂects the estimation of KT group-speciﬁc parameters using N T observations.As an example, when d = 1 and K is chosen of the order of T , the O p termsin (13) are O p (1 /T + T /N ). Although this rate of convergence can be fastwhen N is suﬃciently large relative to T , it is too slow to apply conventional In particular, we use part (ii) in Assumption 4 to establish consistency. Note that thiscondition can be restrictive in models with time-varying random coeﬃcients. When

N/T →

0, one could obtain a faster rate in (13) by choosing another rule for K . λ t is low-dimensional, we describe how toobtain a faster convergence rate by grouping both individuals and time periods. In models with time-invariant heterogeneity, Corollary 1 can be used to charac-terize the asymptotic distribution of GFE estimators. However, as in FE, thepresence of the O p (1 /T ) term in (11) shifts the distribution of b θ away from θ whenever T is not large relative to N . A variety of methods are available to bias-correct FE estimators and construct asymptotically valid conﬁdence intervals; seeArellano and Hahn (2007) for a review. Consider the setup of Corollary 1, underthe additional assumption that the O p (1 /T ) term in (11) is equal to C/T + o p (1 /T )for some constant C . In this case, one can show that half-panel jackknife (Dhaeneand Jochmans, 2015) gives asymptotically valid inference based on GFE as N and T tend to inﬁnity at the same rate. The distribution of the bias-corrected GFEestimator is then asymptotically normal centered at the truth, and the asymptoticvariance H − can be consistently estimated by replacing the expectations in (8)and (9) by group-speciﬁc means.In settings where heterogeneity varies over time, it can be desirable to groupnot only individuals as in (4), but also time periods (or alternatively countiesor markets, depending on the application). We now describe such a method,and discuss its potential for performing inference in models with time-varyingheterogeneity. In the two-way GFE approach, we classify time periods based oncross-sectional moments w t = N P Ni =1 w ( Y it , X it ), and compute: (cid:16) b w (1) , ..., b w ( L ) , b l , ..., b l T (cid:17) = argmin ( e w (1) ,..., e w ( L ) ,l ,...,l T ) T X t =1 (cid:13)(cid:13) w t − e w ( l t ) (cid:13)(cid:13) , (14) In particular, half-panel jackknife is valid under the conditions of Corollary 2, which requirestaking γ = o (1) in our rule (6) for K in order for the approximation error to be of small order.Deriving primitive conditions for the validity of half-panel jackknife and other bias-reductionmethods for other choices of K is left for future work. { l t } are partitions of { , ..., T } into L groups. Given the group indicators b k i and b l t , we then maximize P Ni =1 P Tt =1 ln f ( Y it | X it , α ( b k i , b l t ) , θ ), with respect to θ and the KL group-speciﬁc parameters α ( k, l ).Two-way GFE estimators can be expanded similarly to Theorem 2, under twomain additional assumptions: the model is static and observations are independentacross i and t , and the dimensions d λ of time heterogeneity λ t and d of individualheterogeneity ξ i are both small. Then, for s i and H as in Theorem 2, we show inthe supplemental material that: b θ = θ + H − N N X i =1 s i + O p (cid:18) T + 1 N + KLN T (cid:19) + O p (cid:16) K − d + L − dλ (cid:17) + o p (cid:18) √ N T (cid:19) . Suppose d = d λ = 1, and K is given by (6) with γ asymptotically constant, withan analogous choice for L . Then the O p term in this expansion can be shownto be O p (1 /T + 1 /N ). We leave to future work the formal study of the validityof bias reduction methods for inference, such as two-way split panel jackknife(Fern´andez-Val and Weidner, 2016), as N and T tend to inﬁnity at the same rate. Our theory shows that the dimension d of heterogeneity plays a key role in theproperties of GFE. While models with scalar latent types ξ i , such as model (2)of wages and labor force participation, are not uncommon in economics, manyapplications involve conditioning covariates. Under Assumptions 1 and 2, themoments h i should, asymptotically, be injective functions of all the heterogeneitycoming from both Y i and X i . However, when X i depends on multiple componentsof heterogeneity, this might lead to a large dimension d .We now show that GFE can still perform well under a weaker form of injec-tivity. Consider the case where Assumption 1 is replaced by α i = α ( ξ i ) and µ i = µ ( ξ i , ν i ), where ν i is another latent component that aﬀects covariates.Moreover, instead of requiring injectivity for both ξ i and ν i , let us maintainAssumption 2, which only requires h i to be injective for ξ i . In other words, h i needs to be directly informative about the unobserved heterogeneity component19 i that appears in the conditional distribution of Y i given X i . We show in thesupplemental material that, under regularity conditions otherwise similar to thoseof Corollary 1, the convergence rate of GFE is unaﬀected by the dimension of ν i .Speciﬁcally, when K = b K is given by (6) with γ = O (1) (which adapts to thedimension of ξ i and not the one of ν i ), we have: b θ = θ + O p (cid:18) T (cid:19) + O p (cid:18) √ N T (cid:19) . (15)To prove (15) we assume that the rate condition T d = O ( N ) holds, where d isthe (small) dimension of ξ i . In models with time-varying conditioning covariates, a simple way to targetmoments to ξ i is to construct h i using the conditional distribution of Y i given X i .To see this, consider a static model f ( Y it | X it , α i , θ ) where X it has ﬁnite support.In this case, we have under appropriate conditions: P Tt =1 { X it = x } h ( Y it , X it ) P Tt =1 { X it = x } | {z } = h i ( x ) = E X it = x,ξ i [ h ( Y it , X it )] | {z } = ϕ ( x,ξ i ) + o p (1) , where h i ( x ) is only deﬁned when P Tt =1 { X it = x } 6 = 0, and, importantly, ϕ ( x, ξ i )does not depend on ν i . In the supplemental material we discuss implementation,and we report simulation results in a probit model with binary covariates. Weﬁnd that using conditional moments can enhance the performance of GFE insuch settings. We leave the analysis of conditional moments in the presence ofcontinuous covariates to future work. In this paper, we analyze some properties of two-step grouped ﬁxed-eﬀects(GFE) methods in settings where population heterogeneity is not discrete. Our In the supplemental material, we provide an asymptotic expansion for GFE in a linearhomoskedastic model under a small approximation error, as in Corollary 2. The argumentrequires no restriction on the relative rates of N and T . Interestingly, in this case the asymptoticvariances of GFE and FE diﬀer, since the within-group variation in ν i tends to decrease thevariance, yet the expansion features an additional score term compared to Theorem 1. et al. (2019) for matchedemployer-employee data in the presence of continuous ﬁrm heterogeneity. Otherpotential applications include nonlinear factor models, nonparametric and semi-parametric panel data models such as quantile regression with individual eﬀects,and network models. References [1] Arellano, M., and J. Hahn (2007): “Understanding Bias in Nonlinear PanelModels: Some Recent Developments,”. In: R. Blundell, W. Newey, andT. Persson (eds.):

Advances in Economics and Econometrics, Ninth WorldCongress , Cambridge University Press.[2] Bai, J. (2009), “Panel Data Models with Interactive Fixed Eﬀects,”

Econo-metrica , 77, 1229–1279.[3] Bonhomme, S., and E. Manresa (2015): “Grouped Patterns of Heterogeneityin Panel Data,”

Econometrica , 83(3), 1147–1184.[4] Bonhomme, S., T. Lamadon, and E. Manresa (2019): “A DistributionalFramework for Matched Employer-Employee Data,”

Econometrica , 87(3),699–739.[5] Buchinsky, M., J. Hahn, and J. Hotz (2005): “Cluster Analysis: A tool forPreliminary Structural Analysis,” unpublished manuscript.[6] Chen, M., I. Fern´andez-Val, and M. Weidner (2020): “Nonlinear Panel Modelswith Interactive Eﬀects,” to appear in the

Journal of Econometrics .[7] Dhaene, G. and K. Jochmans (2015): “Split Panel Jackknife Estimation,”

Review of Economic Studies , 82(3), 991–1030.218] Fern´andez-Val, I., and M. Weidner (2016): “Individual and Time Eﬀects inNonlinear Panel Data Models with Large N, T,”

Journal of Econometrics ,196, 291–312.[9] Gao, C., Y. Lu, and H. H. Zhou (2015): “Rate-Optimal Graphon Estimation,”

Annals of Statistics , 43(6), 2624–2652.[10] Graf, S., and H. Luschgy (2002): “Rates of Convergence for the EmpiricalQuantization Error”,

Annals of Probability , 30(2), 874–897.[11] Hahn, J., and G. Kuersteiner (2011): “Bias Reduction for Dynamic NonlinearPanel Models with Fixed Eﬀects,”

Econometric Theory , 27(6), 1152–1191.[12] Hahn, J., and H. Moon (2010): “Panel Data Models with Finite Number ofMultiple Equilibria,”

Econometric Theory , 26(3), 863–881.[13] Hahn, J. and W.K. Newey (2004): “Jackknife and Analytical Bias Reductionfor Nonlinear Panel Models”,

Econometrica , 72, 1295–1319.[14] Heckman, J.J., and B. Singer (1984): “A Method for Minimizing the Impactof Distributional Assumptions in Econometric Models for Duration Data,”

Econometrica , 52(2), 271–320.[15] Keane, M., and K. Wolpin (1997): “The Career Decisions of Young Men,”

Journal of Political Economy , 105(3), 473–522.[16] Kennan, J., and J. Walker (2011): “The Eﬀect of Expected Income on Indi-vidual Migration Decisions”,

Econometrica , 79(1), 211–251.[17] Mundlak, Y. (1961): “Empirical Production Function Free of ManagementBias,”

Journal of Farm Economics , 43(1), 44–56.

APPENDIX

Proof of Lemma 1.

Deﬁne B ϕ ( ξ ) ( K ) = min( e h, { k i } ) N P Ni =1 k ϕ ( ξ i ) − e h ( k i ) k ,similarly to (7), and denote: ( h, { k i } ) = argmin( e h, { k i } ) P Ni =1 k ϕ ( ξ i ) − e h ( k i ) k .By deﬁnition of ( b h, { b k i } ), we have: P Ni =1 k h i − b h ( b k i ) k ≤ P Ni =1 k h i − h ( k i ) k (al-most surely). Letting ε i = h i − ϕ ( ξ i ), we thus have, using the triangle inequalitytwice:1 N N X i =1 (cid:13)(cid:13)(cid:13) ϕ ( ξ i ) − b h ( b k i ) (cid:13)(cid:13)(cid:13) ≤ N N X i =1 (cid:13)(cid:13)(cid:13) h i − b h ( b k i ) (cid:13)(cid:13)(cid:13) + 2 N N X i =1 k h i − ϕ ( ξ i ) k N N X i =1 k h i − h ( k i ) k + 2 N N X i =1 k ε i k ≤ N N X i =1 k ϕ ( ξ i ) − h ( k i ) k !| {z } = B ϕ ( ξ ) ( K ) + 6 N N X i =1 k ε i k . By Assumption 2, N P Ni =1 k ε i k = O p (1 /T ). In addition, since ϕ is Lipschitz-continuous, there exists a constant τ such that k ϕ ( ξ ′ ) − ϕ ( ξ ) k ≤ τ k ξ ′ − ξ k for all( ξ, ξ ′ ). This implies that B ϕ ( ξ ) ( K ) ≤ τ B ξ ( K ), and Lemma 1 follows. Proofs of Theorems 1 and 2.

It is convenient to use a common notation forTheorems 1 and 2. Let p denote the number of individual-speciﬁc vectors α ji , j ∈ { , ..., p } . In the time-invariant case: p = 1, j = 1, and α ji = α i . In thetime-varying case: p = T , j ∈ { , ..., T } , and α ji = α it . Denote ℓ ij = ℓ i in the time-invariant case, and ℓ ij = ℓ it in the time-varying case. Let v ij = ∂ℓ ij ∂α , v αij = ∂ ℓ ij ∂α∂α ′ , v θij = ∂ ℓ ij ∂θ∂α ′ , and v ααij = ∂ ℓ ij ∂α∂α ′ ⊗ ∂α ′ (which is a dim α ji × (dim α ji ) matrix). Let,for all θ ∈ Θ, j ∈ { , ..., p } , and k ∈ { , ..., K } , b α j ( k, θ )=argmax α P Ni =1 { b k i = k } ℓ ij ( α, θ ). Likewise, denote α j ( θ, ξ )=argmax α E ξ i = ξ,λ = λ ( ℓ ij ( α, θ )). We will in-dex expectations by ξ i and λ , although the conditioning on λ is not neededin the time-invariant case of Theorem 1. Finally, let δ = T + K − d in the time-invariant case, and let δ = T + KN + K − d in the time-varying case.To show consistency of b θ , we ﬁrst establish the next technical lemma (see thesupplemental material for the proof): Lemma A1.

Under the conditions of either Theorem 1 or Theorem 2 we have: N p N X i =1 p X j =1 (cid:13)(cid:13)(cid:13)b α j ( b k i , θ ) − α j ( θ, ξ i ) (cid:13)(cid:13)(cid:13) = O p ( δ ) , ∀ θ ∈ Θ , (A1)sup θ ∈ Θ N p N X i =1 p X j =1 (cid:13)(cid:13)(cid:13)b α j ( b k i , θ ) − α j ( θ, ξ i ) (cid:13)(cid:13)(cid:13) = o p (1) . (A2)From (A2) we then verify using a Taylor expansion that:sup θ ∈ Θ (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) N p N X i =1 p X j =1 ℓ ij (cid:16)b α j ( b k i , θ ) , θ (cid:17) − N p N X i =1 p X j =1 ℓ ij (cid:0) α j ( θ, ξ i ) , θ (cid:1)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) = o p (1) . b θ then follows by standard arguments.Next, the two key steps in the proof consist in showing the following two ex-pansions:1 N p N X i =1 p X j =1 ∂ℓ ij ( b α j ( b k i , θ ) , θ ) ∂θ = 1 N p N X i =1 p X j =1 ∂∂θ (cid:12)(cid:12)(cid:12)(cid:12) θ ℓ ij (cid:0) α j ( θ, ξ i ) , θ (cid:1) + O p ( δ ) , (A3)1 N p N X i =1 p X j =1 ∂ ∂θ∂θ ′ (cid:12)(cid:12)(cid:12)(cid:12) θ (cid:16) ℓ ij (cid:16)b α j ( b k i , θ ) , θ (cid:17) − ℓ ij (cid:0) α j ( θ, ξ i ) , θ (cid:1)(cid:17) = o p (1) . (A4)To show (A3), we show the following technical lemma, where we omit refer-ences to the evaluation points θ and α ji for conciseness: Lemma A2.

Under the conditions of either Theorem 1 or Theorem 2 we have: N p N X i =1 p X j =1 E ξ i ,λ (cid:0) v θij (cid:1) (cid:2) E ξ i ,λ (cid:0) v αij (cid:1)(cid:3) − v αij (cid:16)b α j ( b k i , θ ) − α ji + ( v αij ) − v ij (cid:17) = O p ( δ ) , N p N X i =1 p X j =1 (cid:16) v θij (cid:0) v αij (cid:1) − − E ξ i ,λ (cid:0) v θij (cid:1) (cid:2) E ξ i ,λ (cid:0) v αij (cid:1)(cid:3) − (cid:17) v αij (cid:16)b α j ( b k i , θ ) − α ji (cid:17) = O p ( δ ) . Now, expanding v θij ( b α j ( b k i , θ ) , θ ) around α j ( θ , ξ i )= α ji , and using the identity ∂α j ( θ ,ξ i ) ∂θ ′ = (cid:2) E ξ i ,λ (cid:0) − v αij (cid:1)(cid:3) − E ξ i ,λ (cid:0) v θij (cid:1) ′ , we obtain:1 N p N X i =1 p X j =1 ∂ℓ ij ( b α j ( b k i , θ ) , θ ) ∂θ − N p N X i =1 p X j =1 ∂∂θ (cid:12)(cid:12)(cid:12)(cid:12) θ ℓ ij (cid:0) α j ( θ, ξ i ) , θ (cid:1) = 1 N p N X i =1 p X j =1 n v θij (cid:16)b α j ( b k i , θ ) − α ji (cid:17) + E ξ i ,λ (cid:0) v θij (cid:1) (cid:2) E ξ i ,λ (cid:0) v αij (cid:1)(cid:3) − v ij o + O p ( δ ) , and summing the two parts in Lemma A2 shows that the last expression is O p ( δ ).It follows that (A3) is satisﬁed.To show (A4), we show the next technical lemma: Lemma A3.

Under the conditions of either Theorem 1 or Theorem 2 we have: N p N X i =1 p X j =1 (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) ∂ b α j ( b k i , θ ) ∂θ ′ − ∂α j ( θ , ξ i ) ∂θ ′ (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) = o p (1) . (A5)24sing (A1) and the identity ∂α j ( θ ,ξ i ) ∂θ ′ = (cid:2) E ξ i ,λ (cid:0) − v αij (cid:1)(cid:3) − E ξ i ,λ (cid:0) v θij (cid:1) ′ , we thushave, under the conditions of either Theorem 1 or 2:1 N p N X i =1 p X j =1 ∂ ∂θ∂θ ′ (cid:12)(cid:12)(cid:12)(cid:12) θ ℓ ij (cid:16)b α j ( b k i , θ ) , θ (cid:17) − N p N X i =1 p X j =1 ∂ ∂θ∂θ ′ (cid:12)(cid:12)(cid:12)(cid:12) θ ℓ ij (cid:0) α j ( θ, ξ i ) , θ (cid:1) = 1 N p N X i =1 p X j =1 v θij ∂ b α j ( b k i , θ ) ∂θ ′ − ∂α j ( θ , ξ i ) ∂θ ′ ! + o p (1) = o p (1) , where we have used Lemma A3 in the last equality.Finally, to show Theorems 1 and 2 we expand the GFE score as:1 N p N X i =1 p X j =1 ∂ℓ ij ( b α j ( b k i , θ ) , θ ) ∂θ + ∂∂θ ′ (cid:12)(cid:12)(cid:12) e θ N p N X i =1 p X j =1 ∂ℓ ij ( b α j ( b k i , θ ) , θ ) ∂θ ! (cid:16)b θ − θ (cid:17) =0 , where e θ lies between θ and b θ , and further expand ∂∂θ ′ (cid:12)(cid:12) e θ Np P Ni =1 P pj =1 ∂ℓ ij ( b α j ( b k i ,θ ) ,θ ) ∂θ around θ using that e θ is consistent. Lastly, we use (A3) and (A4), and note that,if ℓ i ( θ ) = p P pj =1 ℓ ij (cid:0) α j ( θ, ξ i ) , θ (cid:1) denotes the individual target log-likelihood,then s i = ∂ℓ i ( θ ) ∂θ and H = plim N,T →∞ N P Ni =1 E ξ i ,λ ( − ∂ ℓ i ( θ ) ∂θ∂θ ′ ). Proof of Corollary 1.

By the triangle inequality: N P Ni =1 k b h ( b k i ) − ϕ ( ξ i ) k ≤ b Q ( K ) + O p ( T ) = O p ( T ). The proof of Theorem 1 is then unchanged, simplyredeﬁning δ =1 /T (since heterogeneity is time-invariant here). This shows (11). Proof of Corollary 2.

To prove Corollary 2, we follow a likelihood approach(see Arellano and Hahn, 2007). Consider the diﬀerence between the GFE and FEproﬁle log-likelihoods: ∆ L ( θ ) = N P Ni =1 ℓ i ( b α ( b k i , θ ) , θ ) − N P Ni =1 ℓ i ( b α i ( θ ) , θ ). Assumption A1. (regularity) Let b w i = − ∂ ℓ i ( b α i ( θ ) ,θ ) ∂α∂α ′ , and b g i = ∂ ℓ i ( b α i ( θ ) ,θ ) ∂θ∂α ′ b w − i .(i) ℓ it ( α i , θ ) is four times diﬀerentiable, and its fourth derivatives satisfy similarproperties to the ﬁrst three.(ii) γ ( h )= { E h i = h ( b w i ) } − E h i = h ( b w i b α i ( θ )) and λ ( h )= E h i = h ( b g i b w i ) { E h i = h ( b w i ) } − are Lipschitz-continuous in h ; and Var h i = h ( b w i ( b α i ( θ ) − γ ( h i ))) = O ( T ) and Var h i = h (( b g i − λ ( h i )) b w i ) = O ( T ) , uniformly in h . Lemma A4.

Let the conditions of Corollary 2 hold, and let ν i ( θ )= b α i ( θ ) − E h i ( b α i ( θ )) . e have: ∂∂θ (cid:12)(cid:12)(cid:12) θ ∆ L ( θ )= − ∂∂θ (cid:12)(cid:12)(cid:12) θ N N X i =1 ν i ( θ ) ′ E ξ i [ − v αi ( α ( θ, ξ i ) , θ )] ν i ( θ ) + o p (cid:18) T (cid:19) . (A6)Corollary 2 follows, since the bias of the FE score is: ∂∂θ (cid:12)(cid:12) θ (cid:2) N P Ni =1 ℓ i ( b α i ( θ ) , θ ) − N P Ni =1 ℓ i ( α ( θ, ξ i ) , θ ) (cid:3) = ∂∂θ (cid:12)(cid:12) θ N P Ni =1 b ν i ( θ ) ′ E ξ i [ − v αi ( α ( θ, ξ i ) , θ )] b ν i ( θ ) + o p ( T ),where b ν i ( θ ) = b α i ( θ ) − E ξ i ( b α i ( θ )); see, e.g., Arellano and Hahn (2007).26 UPPLEMENTAL MATERIAL“Discretizing Unobserved Heterogeneity”

S1 Proofs of technical lemmas

Lemma A1.

From Assumption 3 ( iii )-( iv ) or 4 ( iii )-( iv ), both ∂α j ( θ,ξ ) ∂θ ′ and ∂α j ( θ,ξ ) ∂ξ ′ are uniformly bounded (in probability in the time-varying case). Let a j ( k, θ ) = α j ( θ, ψ ( b h ( k ))). We thus have, using Lemmas 1 and 2:sup θ ∈ Θ N p X i,j (cid:13)(cid:13)(cid:13) a j ( b k i , θ ) − α j ( θ, ξ i ) (cid:13)(cid:13)(cid:13) =sup θ ∈ Θ N p X i,j (cid:13)(cid:13)(cid:13) α j ( θ, ψ ( b h ( b k i ))) − α j ( θ, ψ ( ϕ ( ξ i ))) (cid:13)(cid:13)(cid:13) = O p N X i k b h ( b k i ) − ϕ ( ξ i ) k ! = O p ( δ ) . (S1)Let θ ∈ Θ. Expanding: P i,j ℓ ij ( a j ( b k i , θ ) , θ ) ≤ P i,j ℓ ij ( b α j ( b k i , θ ) , θ ) to secondorder around α j ( θ, ξ i ), and using:max i,j sup ( α,θ ) k v αij ( α, θ ) k = O p (1) , (S2)we have, for some a ij ( θ ) between b α j ( b k i , θ ) and α j ( θ, ξ i ):12 N p X i,j (cid:16)b α j ( b k i , θ ) − α j ( θ, ξ i ) (cid:17) ′ [ − v αij ( a ij ( θ ) , θ )] (cid:16)b α j ( b k i , θ ) − α j ( θ, ξ i ) (cid:17) ≤ N p X i,j v ij ( α j ( θ, ξ i ) , θ ) ′ (cid:16)b α j ( b k i , θ ) − a j ( b k i , θ ) (cid:17) + O p ( δ )= 1 N p X i,j v j ( b k i , θ ) ′ (cid:16)b α j ( b k i , θ ) − a j ( b k i , θ ) (cid:17) + O p ( δ ) , (S3)where v j ( k, θ ) denotes the mean over i of v ij ( α j ( θ, ξ i ) , θ ) in group b k i = k , and the O p ( δ ) terms are uniform in θ by (S1).Now, by Assumption 3 (ii) or 4 (ii) there exists a constant c > i,j inf ( α,θ ) mineig (cid:2) − v αij ( α, θ ) (cid:3) ≥ c + o p (1) , (S4)where mineig( M ) is the minimum eigenvalue of M . Let A = Np P i,j k b α j ( b k i , θ ) − j ( θ, ξ i ) k . By (S3) and the Cauchy Schwarz inequality, we have: A ≤ O p  N p X i,j (cid:13)(cid:13)(cid:13) v j ( b k i , θ ) (cid:13)(cid:13)(cid:13) ! N p X i,j (cid:13)(cid:13)(cid:13)b α j ( b k i , θ ) − a j ( b k i , θ ) (cid:13)(cid:13)(cid:13) !  + O p ( δ ) . By (S1) and the triangle inequality: ( Np P i,j k b α j ( b k i , θ ) − a j ( b k i , θ ) k ) ≤ A + O p ( δ ). Hence: A = O p (cid:20)(cid:16) Np P i,j k v j ( b k i , θ ) k (cid:17) (cid:16) A + O p ( δ ) (cid:17)(cid:21) + O p ( δ ), whichimplies: A = O p N p X i,j k v j ( b k i , θ ) k ! + O p ( δ ) . (S5)We are now going to show that, for all θ ∈ Θ:1

N p X i,j (cid:13)(cid:13)(cid:13) v j ( b k i , θ ) (cid:13)(cid:13)(cid:13) = O p ( δ ) . (S6)Using (S5) and (S6) will then imply (A1). Under the conditions of Theorem 1, it iseasy to see that (S6) holds. We are now going to show (S6) under the conditions ofTheorem 2. Let, for all j, θ, h, ξ, λ : ρ j ( h, ξ, λ, θ ) = E h i = h,ξ i = ξ,λ = λ ( v ij ( α j ( θ, ξ ) , θ )),and, for all i, j, θ : ζ ij ( θ ) = v ij ( α j ( θ, ξ i ) , θ ) − ρ j ( h i , ξ i , λ , θ ). By Assumption 4 (v),and letting h i = ϕ ( ξ i ) + ε i , we can expand ρ j ( h i , ξ i , λ , θ ) twice around ϕ ( ξ i ) as: ρ j ( ϕ ( ξ i ) , ξ i , λ , θ ) + ∂ρ j ( ϕ ( ξ i ) ,ξ i ,λ ,θ ) ∂h ′ ε i + ε ′ i ∂ ρ j ( a jiθ ,ξ i ,λ ,θ ) ∂h∂h ′ ε i , where a jiθ lies between h i and ϕ ( ξ i ). Hence, taking expectations, using that E ξ i ,λ (cid:2) ρ j ( h i , ξ i , λ , θ ) (cid:3) = 0,and using Assumptions 2 and 4 (v), we have:1 N p X i,j k ρ j ( ϕ ( ξ i ) , ξ i , λ , θ ) k = 1 N p X i,j (cid:13)(cid:13)(cid:13)(cid:13) ∂ρ j ( ϕ ( ξ i ) , ξ i , λ , θ ) ∂h ′ E ξ i ,λ [ ε i ] (cid:13)(cid:13)(cid:13)(cid:13) + o p (cid:18) T (cid:19) , which is O p ( T ). Hence: Np P i,j k ρ j ( h i , ξ i , λ , θ ) k = O p ( T ). It thus follows fromthe triangle inequality that:1 N p X i,j k v j ( b k i , θ ) k ≤ O p (cid:18) T (cid:19) + 2 N p X i,j k ζ j ( b k i , θ ) k , (S7)where ζ j ( k, θ ) denotes the mean of ζ ij ( θ ) in group b k i = k . Now, using that28 k , ..., b k N are functions of h , ..., h N , we have: E " N p X i,j k ζ j ( b k i , θ ) k = 1 N p X k,j E " P i P i ′ { b k i = k } { b k i ′ = k } E h ,...,h N ,ξ ,...,ξ N ,λ (cid:0) ζ ij ( θ ) ′ ζ i ′ j ( θ ) (cid:1)P i { b k i = k } . Furthermore, since observations are independent across i given λ : E h ,...,h N ,ξ ,...,ξ N ,λ (cid:0) ζ i ,j ( θ ) ′ ζ i ,j ( θ ) (cid:1) = E h i ,ξ i , ,λ (cid:0) ζ i ,j ( θ ) (cid:1) ′ E h i ,ξ i , ,λ (cid:0) ζ i ,j ( θ ) (cid:1) = 0 for all i = i and j. Hence: E " N p X i,j k ζ j ( b k i , θ ) k = 1 N p X k,j E " P i { b k i = k } E h i ,ξ i ,λ (cid:0) ζ ij ( θ ) ′ ζ ij ( θ ) (cid:1)P i { b k i = k } . Finally, using that E h i ,ξ i ,λ (cid:0) ζ ij ( θ ) (cid:1) = 0, and using part (v) in Assumption 4: E h i = h,ξ i = ξ,λ = λ (cid:0) ζ ij ( θ ) ′ ζ ij ( θ ) (cid:1) = Tr (cid:2) Var h i = h,ξ i = ξ,λ = λ ( v ij ( α j ( θ, ξ i ) , θ )) (cid:3) = O (1) , uniformly in h, ξ, λ . This implies that E h Np P i,j k ζ j ( b k i , θ ) k i = O (cid:0) KN (cid:1) , andshows (S6) and (A1).We are now going to show:sup θ ∈ Θ N p X i,j k v j ( b k i , θ ) k = o p (1) . (S8)Using a bounding argument similar to the one we used to show (A1), (A2) willthen follow. To see that (S8) holds, let Z ( θ ) = Np P i,j k v j ( b k i , θ ) k . By (S6), Z ( θ ) = O p ( δ ) for all θ ∈ Θ. Moreover: ∂Z ( θ ) ∂θ = Np P i,j v θj ( b k i , θ ) v j ( b k i , θ ) = O p (cid:16)p sup θ ∈ Θ Z ( θ ) (cid:17) uniformly in θ , using the Cauchy Schwarz inequality with ei-ther Assumption 3 (ii) or 4 (ii), where v θj ( k, e θ ) is the mean of ∂∂θ (cid:12)(cid:12) θ = e θ v ij ( α j ( θ, ξ i ) , θ ) ′ Note that the dimension of v ij is ﬁxed throughout, independent of the sample size.

29n group b k i = k . Since Θ is compact, it follows that sup θ ∈ Θ Z ( θ ) = o p (1). Lemma A2.

Let us omit references to θ and α ji throughout, and let: A = 1 N p N X i =1 p X j =1 E ξ i ,λ (cid:0) v θij (cid:1) (cid:2) E ξ i ,λ (cid:0) v αij (cid:1)(cid:3) − v αij (cid:16)b α j ( b k i , θ ) − α ji + ( v αij ) − v ij (cid:17) ,B = 1 N p N X i =1 p X j =1 (cid:16) v θij (cid:0) v αij (cid:1) − − E ξ i ,λ (cid:0) v θij (cid:1) (cid:2) E ξ i ,λ (cid:0) v αij (cid:1)(cid:3) − (cid:17) v αij (cid:16)b α j ( b k i , θ ) − α ji (cid:17) . We ﬁrst bound A . Expanding: P i { b k i = k } v ij ( b α j ( k ))=0 for all k, j , we have,for a ij between α ji and b α j ( b k i ): X i { b k i = k } v ij ( α ji ) + X i { b k i = k } v αij ( α ji )( b α j ( b k i ) − α ji )+ 12 X i { b k i = k } v ααij ( a ij ) (cid:16)b α j ( b k i ) − α ji (cid:17) ⊗ (cid:16)b α j ( b k i ) − α ji (cid:17) = 0 . It follows that b α j ( b k i ) = e α j ( b k i ) + e v j ( b k i ) + e w j ( b k i ), where: e α j ( k ) = X i { b k i = k } ( − v αij ) ! − X i { b k i = k } ( − v αij ) α ji ! , e v j ( k ) = X i { b k i = k } ( − v αij ) ! − X i { b k i = k } v ij ! , e w j ( k ) = 12 X i { b k i = k } ( − v αij ) ! − X i { b k i = k } v ααij ( a ij ) (cid:16)b α j ( b k i ) − α ji (cid:17) ⊗ ! , where a ⊗ = a ⊗ a . Hence, we have: A = 1 N p X i,j E ξ i ,λ (cid:0) v θij (cid:1) (cid:2) E ξ i ,λ (cid:0) v αij (cid:1)(cid:3) − v αij (cid:16) e w j ( b k i )+ e α j ( b k i ) − α ji + e v j ( b k i )+( v αij ) − v ij (cid:17) . Let υ > , ǫ >

0. There is

M > (cid:16) sup θ ∈ Θ (cid:13)(cid:13)(cid:13) ∂Z ( θ ) ∂θ (cid:13)(cid:13)(cid:13) > M p sup θ ∈ Θ Z ( θ ) (cid:17) < ǫ .Take a ﬁnite cover of Θ = B ∪ ... ∪ B R , where B r are balls with centers θ r and diametersdiam B r ≤ M √ υ . Since: sup θ ∈ Θ Z ( θ ) ≤ max r Z ( θ r ) + sup θ (cid:13)(cid:13)(cid:13) ∂Z ( θ ) ∂θ (cid:13)(cid:13)(cid:13) M √ υ , and since: a >υ ⇒ a − √ a √ υ > υ , we have: Pr (sup θ ∈ Θ Z ( θ ) > υ ) ≤ ǫ + Pr (cid:0) max r Z ( θ r ) > υ (cid:1) , which, by(S6), is smaller than ǫ for N, T, K large enough.

N p X i,j E ξ i ,λ (cid:0) v θij (cid:1) (cid:2) E ξ i ,λ (cid:0) v αij (cid:1)(cid:3) − v αij e w j ( b k i )= O p ( 1 N p X i,j k b α j ( b k i ) − α ji k ) = O p ( δ ) , where we have used (S2), (A1), and either Assumption 3 (ii) or Assumption 4 (ii).Next, let z j ( ξ i ) ′ = E ξ i ,λ (cid:0) v θij (cid:1) (cid:2) E ξ i ,λ (cid:0) v αij (cid:1)(cid:3) − . We have:1 N p X i,j E ξ i ,λ (cid:0) v θij (cid:1) (cid:2) E ξ i ,λ (cid:0) v αij (cid:1)(cid:3) − v αij (cid:16)e α j ( b k i ) − α ji (cid:17) = 1 N p X i,j (cid:18) z j ( ξ i ) ′ − e z j (cid:16)b k i (cid:17) ′ (cid:19) v αij (cid:16)e α j ( b k i ) − α ji (cid:17) , (S9)where, for all k, j : e z j ( k ) = X i { b k i = k } ( − v αij ) ! − X i { b k i = k } ( − v αij ) z j ( ξ i ) ! . (S10)Now we have, using that: α j P i (cid:16) α j ( b k i ) − α ji (cid:17) ′ ( − v αij ) (cid:16) α j ( b k i ) − α ji (cid:17) is mini-mized at α j = e α j , and using (S2) and (S4):1 N p X i,j (cid:13)(cid:13)(cid:13)e α j ( b k i ) − α ji (cid:13)(cid:13)(cid:13) = O p N p X i,j (cid:16)e α j ( b k i ) − α ji (cid:17) ′ ( − v αij ) (cid:16)e α j ( b k i ) − α ji (cid:17)! = O p N p X i,j (cid:16)b α j ( b k i ) − α ji (cid:17) ′ ( − v αij ) (cid:16)b α j ( b k i ) − α ji (cid:17)! = O p N p X i,j (cid:13)(cid:13)(cid:13)b α j ( b k i ) − α ji (cid:13)(cid:13)(cid:13) ! , where the last expression is O p ( δ ) by (A1). Likewise, since by Assumption 3 (iv)or 4 (iv) ∂ vec z j ( ξ ) ∂ξ ′ is bounded (in probability) uniformly in j and ξ , we have:1 N p X i,j (cid:13)(cid:13)(cid:13)e z j ( b k i ) − z j ( ξ i ) (cid:13)(cid:13)(cid:13) = O p N p X i,j (cid:16)e z j ( b k i ) − z j ( ξ i ) (cid:17) ′ ( − v αij ) (cid:16)e z j ( b k i ) − z j ( ξ i ) (cid:17)! = O p N p X i,j (cid:16) z j (cid:16) ψ (cid:16)b h ( b k i ) (cid:17)(cid:17) − z j ( ξ i ) (cid:17) ′ ( − v αij ) (cid:16) z j (cid:16) ψ (cid:16)b h ( b k i ) (cid:17)(cid:17) − z j ( ξ i ) (cid:17)! = O p N p X i,j (cid:13)(cid:13)(cid:13)b h ( b k i ) − ϕ ( ξ i ) (cid:13)(cid:13)(cid:13) ! = O p ( δ ) , (S11)31here we have used (S2), (S4), Lemmas 1 and 2, and that ψ is Lipschitz-continuous.Combining results, and using the Cauchy Schwarz inequality in (S9), we obtain:1 N p X i,j E ξ i ,λ (cid:0) v θij (cid:1) (cid:2) E ξ i ,λ (cid:0) v αij (cid:1)(cid:3) − v αij (cid:16)e α j ( b k i ) − α ji (cid:17) = O p ( δ ) . The last term in A is: A = 1 N p X i,j E ξ i ,λ (cid:0) v θij (cid:1) (cid:2) E ξ i ,λ (cid:0) v αij (cid:1)(cid:3) − ( − v αij ) (cid:16) ( − v αij ) − v ij − e v j ( b k i ) (cid:17) . Since e v j ( k ) = ( P i { b k i = k } ( − v αij )) − ( P i { b k i = k } ( − v αij )( − v αij ) − v ij ), we have: A = 1 N p X i,j (cid:18) z j ( ξ i ) ′ − e z j (cid:16)b k i (cid:17) ′ (cid:19) ( − v αij )( − v αij ) − v ij = 1 N p X i,j (cid:18) z j ( ξ i ) ′ − e z j (cid:16)b k i (cid:17) ′ (cid:19) v ij = 1 N p X i,j (cid:18) z j ( ξ i ) ′ − z ∗ j (cid:16)b k i (cid:17) ′ (cid:19) v ij + 1 N p X i,j (cid:18) z ∗ j (cid:16)b k i (cid:17) ′ − e z j (cid:16)b k i (cid:17) ′ (cid:19) v ij , (S12)where e z j ( k ) is given by (S10), and: z ∗ j ( k )= X i { b k i = k } E ξ i ,λ (cid:0) − v αij (cid:1)! − X i { b k i = k } E ξ i ,λ (cid:0) − v αij (cid:1) z j ( ξ i ) ! . (S13)Under the conditions of Theorem 1, it is easy to see that A = O p ( δ ). We arenow going to show that A = O p ( δ ) under the conditions of Theorem 2. To seethat the ﬁrst term on the right-hand-side of (S12) is O p ( δ ), we use an argumentsimilar to the one we used to show (S6). Let ζ ij = v ij − E h i ,ξ i ,λ ( v ij ). Followingthe same steps as the ones leading to (S7), we obtain:1 N p X i,j (cid:13)(cid:13) E h i ,ξ i ,λ ( v ij ) (cid:13)(cid:13) = O p (cid:18) T (cid:19) . (S14)Moreover, by an argument similar to (S11), since E ξ i ,λ ( − v αij ) is bounded away32rom zero with probability one, we have:1 N p X i,j (cid:13)(cid:13)(cid:13) z j ( ξ i ) − z ∗ j ( b k i ) (cid:13)(cid:13)(cid:13) = O p ( δ ) . (S15)Let z ′ = ( z ′ , ..., z ′ p ), and z ∗ ( k ) ′ = ( z ∗ ( k ) ′ , ..., z ∗ p ( k ) ′ ). Since ζ ij are independentacross i , with zero mean, conditional on h , ..., h N , ξ , ..., ξ N , λ , we thus have,denoting ζ i = ( ζ ′ i , ..., ζ ′ ip ) ′ : E (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) N p X i,j (cid:18) z j ( ξ i ) ′ − z ∗ j (cid:16)b k i (cid:17) ′ (cid:19) v ij (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)  ≤ O (cid:18) T (cid:19) E " N p X i,j (cid:13)(cid:13)(cid:13) z j ( ξ i ) − z ∗ j (cid:16)b k i (cid:17)(cid:13)(cid:13)(cid:13) +2 E (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) N p X i,j (cid:18) z j ( ξ i ) ′ − z ∗ j (cid:16)b k i (cid:17) ′ (cid:19) ζ ij (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)  = O (cid:18) δT (cid:19) + 2 N p X i E (cid:20)(cid:18) z ′ i − z ∗ (cid:16)b k i (cid:17) ′ (cid:19) E h i ,ξ i ,λ [ ζ i ζ ′ i ] (cid:16) z i − z ∗ (cid:16)b k i (cid:17)(cid:17)(cid:21) = O (cid:18) δT (cid:19) + O (cid:18) δpN T (cid:19) = O ( δ ) , where we have used, in turn, the triangle and Cauchy Schwarz inequalities, (S14),(S15), conditional independence of the ζ i across i , part (v) in Assumption 4, and(S15) one more time. Note that, by part (v) in Assumption 4, k E h i ,ξ i ,λ [ ζ i ζ ′ i ] k ≤ Tr E h i ,ξ i ,λ [ ζ i ζ ′ i ] ≤ p max j Tr E h i ,ξ i ,λ (cid:2) ζ ij ζ ′ ij (cid:3) = O p ( p /T ).Turning to the second term in (S12), we have:1 N p X i,j (cid:18) z ∗ j (cid:16)b k i (cid:17) ′ − e z j (cid:16)b k i (cid:17) ′ (cid:19) v ij = 1 N p X i,j (cid:18) z ∗ j (cid:16)b k i (cid:17) ′ − e z j (cid:16)b k i (cid:17) ′ (cid:19) v j (cid:16)b k i (cid:17) , where by (S6) we have: Np P i,j k v j ( b k i ) k = O p ( δ ). Moreover:1 N p X i,j (cid:13)(cid:13)(cid:13) z ∗ j (cid:16)b k i (cid:17) − e z j (cid:16)b k i (cid:17)(cid:13)(cid:13)(cid:13) ≤ N p X i,j (cid:13)(cid:13)(cid:13) z j ( ξ i ) − z ∗ j (cid:16)b k i (cid:17)(cid:13)(cid:13)(cid:13) + 2 N p X i,j (cid:13)(cid:13)(cid:13) z j ( ξ i ) − e z j (cid:16)b k i (cid:17)(cid:13)(cid:13)(cid:13) , O p ( δ ) due to (S11), and the ﬁrstterm is O p ( δ ) due to (S15). This shows that A = O p ( δ ), hence that A = O p ( δ ).Let us now turn to B . Letting: π ′ ij = v θij (cid:0) v αij (cid:1) − − E ξ i ,λ (cid:0) v θij (cid:1) (cid:2) E ξ i ,λ (cid:0) v αij (cid:1)(cid:3) − ,we have: B = 1 N p X i,j π ′ ij v αij (cid:16) e w j ( b k i ) + e v j ( b k i ) + e α j ( b k i ) − α ji (cid:17) . First, we have: Np P i,j π ′ ij v αij e w j ( b k i ) = O p ( δ ). Next, we have: Np P i,j π ′ ij v αij e v j ( b k i ) = Np P i,j e π j ( b k i ) ′ v αij e v j ( b k i ), where e π j ( k ) is deﬁned similarly to e α j ( k ). To see that thisquantity is O p ( δ ), note that, by the deﬁnition of e v j ( k ) and using (S4) and (S6):1 N p X i,j (cid:13)(cid:13)(cid:13)e v j ( b k i ) (cid:13)(cid:13)(cid:13) = O p N p X i,j (cid:13)(cid:13)(cid:13) v j ( b k i ) (cid:13)(cid:13)(cid:13) ! = O p ( δ ) . Moreover, letting τ ij = π ′ ij v αij , we have:1 N p X i,j (cid:13)(cid:13)(cid:13)e π j ( b k i ) (cid:13)(cid:13)(cid:13) = O p N p X i,j (cid:13)(cid:13)(cid:13) τ j ( b k i ) (cid:13)(cid:13)(cid:13) ! . Now, the τ ij are independent across i , with zero conditional mean given ξ i , λ : E ξ i ,λ (cid:0) π ′ ij v αij (cid:1) = E ξ i ,λ (cid:16)(cid:16) v θij (cid:0) v αij (cid:1) − − E ξ i ,λ (cid:0) v θij (cid:1) (cid:2) E ξ i ,λ (cid:0) v αij (cid:1)(cid:3) − (cid:17) v αij (cid:17) = 0 . Using an argument similar to the one we used to show (S6), and using Assumption4 (v) in the time-varying case, it thus follows that Np P i,j k e π j ( b k i ) k = O p ( δ ).Hence, by the Cauchy Schwarz inequality: Np P i,j π ′ ij v αij e v j ( b k i ) = O p ( δ ).We lastly bound the third term B in B :1 N p X i,j π ′ ij v αij (cid:16)e α j ( b k i ) − α ji (cid:17) = 1 N p X i,j π ′ ij v αij h(cid:16) α ∗ j ( b k i ) − α ji (cid:17) + (cid:16)e α j ( b k i ) − α ∗ j ( b k i ) (cid:17)i , where e α j ( k ) and α ∗ j ( k ) are given by expressions similar to (S10) and (S13), with α ji in place of z j ( ξ i ) in those formulas. The ﬁrst term is O p ( δ ) since, similarlyto (S15): Np P i,j k α ∗ j ( b k i ) − α ji k = O p ( δ ), and the τ ij = π ′ ij v αij are conditionally34ndependent across i with zero mean given ξ i and λ (using a similar argumentto the ﬁrst term in (S12)). The second term is:1 N p X i,j π ′ ij v αij (cid:16)e α j ( b k i ) − α ∗ j ( b k i ) (cid:17) = 1 N p X i,j e π j ( b k i ) ′ v αij (cid:16)e α j ( b k i ) − α ∗ j ( b k i ) (cid:17) . We have already shown that: Np P i,j k e π j ( b k i ) k = O p ( δ ). Moreover, using similararguments to the ones we used to bound Np P i,j k z ∗ j ( b k i ) − e z j ( b k i ) k above, wehave: Np P i,j k e α j ( b k i ) − α ∗ j ( b k i ) k = O p ( δ ). This shows that B = O p ( δ ), hencethat B = O p ( δ ). Lemma A3.

For given k, j , θ -diﬀerentiating: P i { b k i = k } v ij ( b α j ( k, θ ) , θ )=0, andusing (S4), we obtain: ∂ b α j ( k, θ ) ∂θ ′ = X i { b k i = k } (cid:16) − v αij (cid:16)b α j ( b k i , θ ) , θ (cid:17)(cid:17)! − X i { b k i = k } v θij (cid:16)b α j ( b k i , θ ) , θ (cid:17) ′ . (S16)Let us deﬁne, at θ = θ (and omitting θ and α ji from the notation): ∂ e α j ( k ) ∂θ ′ = X i { b k i = k } ( − v αij ) ! − X i { b k i = k } ( v θij ) ′ ,∂ e α j ∗ ( k ) ∂θ ′ = X i { b k i = k } ( − v αij ) ! − X i { b k i = k } ( − v αij ) (cid:2) E ξ i ,λ ( − v αij ) (cid:3) − E ξ i ,λ ( v θij ) ′ | {z } = ∂αj ( ξi ∂θ ′ . Using (A1) and (S4), we have: Np P i,j k ∂ b α j ( b k i ) ∂θ ′ − ∂ e α j ( b k i ) ∂θ ′ k = o p (1). Moreover: ∂ e α j ( k ) ∂θ ′ − ∂ e α j ∗ ( k ) ∂θ ′ = P i { b k i = k } ( − v αij ) P i { b k i = k } ! − P i { b k i = k } τ ′ ij P i { b k i = k } ! , where τ ′ ij =( v θij ) ′ − ( − v αij ) (cid:2) E ξ i ,λ ( − v αij ) (cid:3) − E ξ i ,λ ( v θij ) ′ are conditionally independentacross i , with zero mean given ξ i and λ . Hence, using (S4), and a similar argu-ment to the one we used to show (S6), we have: Np P i,j k ∂ e α j ( b k i ) ∂θ ′ − ∂ e α j ∗ ( b k i ) ∂θ ′ k = o p (1).Lastly, using (S4) we have, as in (S11): Np P i,j k ∂ e α j ∗ ( b k i ) ∂θ ′ − ∂α j ( ξ i ) ∂θ ′ k = o p (1). Com-bining results shows (A5). 35 emma A4. In the following we again evaluate all functions at θ , and omit θ for the notation. In particular, b α i is a shorthand for b α i ( θ ). We will use thenotation b w i = − v αi ( b α i ). The choice of K = b K with γ = o (1) implies that:1 N X i k h i − b h ( b k i ) k = o p (cid:18) T (cid:19) . (S17)We also have: N P i k b α ( b k i ) − b α i k = O p (cid:0) T (cid:1) . Let, for all k : e α ( k ) = X i { b k i = k } b w i ! − X i { b k i = k } b w i b α i . (S18)Expanding: P i { b k i = k } v i ( b α ( k )) = 0 around b α i , using that v i ( b α i ) = 0, we obtain: b α ( k ) = e α ( k )+ 12 "X i { b k i = k } b w i − X i { b k i = k } v ααi ( a i ( k )) (cid:16)b α ( b k i ) − b α i (cid:17) ⊗ , where a i ( k ) lies between b α i and b α ( k ), and v ααi ( a i ( k )) is a matrix of third derivativeswith (dim α i ) columns.To see that (A6) holds, we rely on the following decomposition: ∂∂θ (cid:12)(cid:12)(cid:12) θ ∆ L ( θ ) = 1 N X i ∂ℓ i ( b α ( b k i )) ∂θ − N X i ∂ℓ i ( b α i ) ∂θ = 1 N X i v θi ( b α i ) (cid:16)b α ( b k i ) − b α i (cid:17) + 12 N X i v θαi ( a i ) (cid:16)b α ( b k i ) − b α i (cid:17) ⊗ = 1 N X i v θi ( b α i ) (cid:16)e α ( b k i ) − b α i (cid:17) + 12 N X i v θαi ( a i ) (cid:16)b α ( b k i ) − b α i (cid:17) ⊗ + 12 N X i v θi ( b α i ) (cid:0) E b k i [ b w i ′ ] (cid:1) − E b k i (cid:20) v ααi ′ (cid:16) a i ′ ( b k i ′ ) (cid:17) (cid:16)b α ( b k i ′ ) − b α i ′ (cid:17) ⊗ (cid:21) = 1 N X i v θi ( b α i ) (cid:16)e α ( b k i ) − b α i (cid:17)| {z } = A + 12 N X i v θαi ( a i ) (cid:16)e α ( b k i ) − b α i (cid:17) ⊗ | {z } = A + o p (cid:18) T (cid:19) + 12 N X i v θi ( b α i ) (cid:0) E b k i [ b w i ′ ] (cid:1) − E b k i (cid:20) v ααi ′ (cid:16) a i ′ ( b k i ′ ) (cid:17) (cid:16)e α ( b k i ′ ) − b α i ′ (cid:17) ⊗ (cid:21)| {z } = A , a i and a i ( b k i ) lie between b α i and b α ( b k i ), and E k denotes a mean in group b k i = k . Let γ ( h ) = { E h i = h ( b w i ) } − E h i = h ( b w i b α i ), and ν i = b α i − γ ( h i ). Let b g i = v θi ( b α i )( b w i ) − , λ ( h ) = E h i = h ( b g i b w i ) { E h i = h ( b w i ) } − , and τ i = b g ′ i − λ ( h i ) ′ . Using (S17)we can show, using that γ is Lipschitz-continuous, that N P i k γ ( h i ) − e γ ( b k i ) k = o p (cid:0) T (cid:1) . Moreover, we have: E [ b w i ν i | h , ..., h N ] = E h i [ b w i b α i ] − E h i [ b w i b α i ] =0. Similararguments to the proof of Lemma A1 give: N P i k E b k i [ b w i ′ ν i ′ ] k = O p ( KNT )= o p (cid:0) T (cid:1) .Hence: N P i k e ν ( b k i ) k = N P i k (cid:0) E b k i [ b w i ′ ] (cid:1) − E b k i [ b w i ′ ν i ′ ] k = o p (cid:0) T (cid:1) . Likewise, wehave: N P i k λ ( h i ) − e λ ( b k i ) k = o p (cid:0) T (cid:1) , and: N P i k e τ ( b k i ) k = o p (cid:0) T (cid:1) . Let us now expand the three terms A , A , A in the above decomposition: A = 1 N X i b g i b w i (cid:16)e α ( b k i ) − b α i (cid:17) = 1 N X i (cid:16)b g i − e g ( b k i ) (cid:17) b w i (cid:16)e α ( b k i ) − b α i (cid:17) = − N X i (cid:16) λ ( h i ) − e λ ( b k i ) + τ ′ i − e τ ( b k i ) ′ (cid:17) b w i (cid:16) γ ( h i ) − e γ ( b k i ) + ν i − e ν ( b k i ) (cid:17) = − N X i τ ′ i b w i ν i + o p (cid:18) T (cid:19) = − N X i τ ′ i E ξ i ( − v αi ( α i )) ν i + o p (cid:18) T (cid:19) ,A = 12 N X i E ξ i (cid:0) v θαi ( α i ) (cid:1) (cid:16)e α ( b k i ) − b α i (cid:17) ⊗ + o p (cid:18) T (cid:19) = 12 N X i E ξ i (cid:0) v θαi ( α i ) (cid:1) (cid:16)e γ ( b k i ) − γ ( h i ) + e ν ( b k i ) − ν i (cid:17) ⊗ + o p (cid:18) T (cid:19) = 12 N X i E ξ i (cid:0) v θαi ( α i ) (cid:1) ν ⊗ i + o p (cid:18) T (cid:19) ,A = 12 N X i E ξ i (cid:0) v θi ( α i ) (cid:1) (cid:2) E ξ i ( − v αi ( α i )) (cid:3) − E ξ i [ v ααi ( α i )] ν ⊗ i + o p (cid:18) T (cid:19) . Combining, we get: ∂∂θ (cid:12)(cid:12)(cid:12) θ ∆ L ( θ ) = − N X i τ ′ i E ξ i ( − v αi ( α i )) ν i + o p (cid:18) T (cid:19) + 12 N X i h E ξ i (cid:0) v θαi ( α i ) (cid:1) + E ξ i (cid:0) v θi ( α i ) (cid:1) (cid:2) E ξ i ( − v αi ( α i )) (cid:3) − E ξ i [ v ααi ( α i )] i ν ⊗ i . Here e γ ( k ), e λ ( k ), e ν ( k ), and e τ ( k ) are deﬁned similarly to e α ( k ) in (S18), with γ ( h i ), λ ( h i ), ν i ,and τ i , respectively, replacing b α i in that formula. ∂ b α i ( θ ) ∂θ ′ = b g ′ i , and: ∂∂θ ′ (cid:12)(cid:12)(cid:12) θ vec E ξ i [ − v αi ( α ( θ, ξ i ) , θ )]= − (cid:16) E ξ i (cid:0) v θαi ( α i ) (cid:1) + E ξ i (cid:0) v θi ( α i ) (cid:1) (cid:2) E ξ i ( − v αi ( α i )) (cid:3) − E ξ i [ v ααi ( α i )] (cid:17) ′ . Let ω i = { E h i ( b w i ) } − b w i , and e ν i ( θ ) = b α i ( θ ) − E h i ( ω i b α i ( θ )). Combining the abovewith the expression of the bias of the FE score, we obtain: ∂∂θ (cid:12)(cid:12)(cid:12) θ ∆ L ( θ )= − ∂∂θ (cid:12)(cid:12)(cid:12) θ N X i e ν i ( θ ) ′ E ξ i [ − v αi ( α ( θ, ξ i ) , θ )] e ν i ( θ )+ o p (cid:18) T (cid:19) . (S19)Lastly, let b α i ( θ ) = E h i ( b α i ( θ )) + ν i ( θ ), and ω i = E h i ( ω i ) + η i = 1 + η i . We have: e ν i ( θ ) = ν i ( θ ) − E h i ( η i ν i ( θ )), from which it follows that: N P i k e ν i ( θ ) − ν i ( θ ) k = o p (1 /T ). Likewise: N P i k ∂ e ν i ( θ ) ∂θ ′ − ∂ν i ( θ ) ∂θ ′ k = o p (1 /T ). Hence, (S19) implies(A6). S2 Complements and extensions

S2.1 Average eﬀects

Let m i ( α i , θ ) = T P Tt =1 m ( X it , α i , θ ) in the time-invariant case, and m i ( α i , θ ) = T P Tt =1 m ( X it , α it , θ ) in the time-varying case. Let c M = N P i m i (cid:16)b α ( b k i ) , b θ (cid:17) bethe GFE estimator of M = N P i m i ( α i , θ ). We use a common notation as inthe proofs of Theorems 1 and 2, and denote m ij ( α ji , θ ) = m i ( α i , θ ) in the time-invariant case, and m ij ( α ji , θ ) = m ( X it , α it , θ ) in the time-varying case. Assumption S1. (average eﬀects)(i) m ij ( α, θ ) is twice diﬀerentiable in both its arguments, for all i, j .(ii) max i,j sup α,θ k m ij ( α, θ ) k = O p (1) , and similarly for the ﬁrst two derivativesof m ij ; max j sup e ξ,λ k ∂∂ξ ′ (cid:12)(cid:12) ξ = e ξ E ξ i = ξ,λ = λ ( ∂m ij ( α ji ,θ ) ∂α ) k = O (1) ; and, letting τ mij = ∂m ij ( α ji ,θ ) ∂α ′ − E ξ i ,λ [ ∂m ij ( α ji ,θ ) ∂α ′ ] E ξ i ,λ [ v αij ( α ji , θ )] v αij ( α ji , θ ) , the func-tion E h i = h,ξ i = ξ,λ = λ (vec τ mij ) is twice diﬀerentiable with respect to h , withﬁrst and second derivatives that are uniformly bounded in j , ξ , λ , and h ,and k Var h i = h,ξ i = ξ,λ = λ (vec τ mij ) k = O ( pT ) , uniformly in j , ξ , λ , and h . s i and H as in Theorem 1 or 2, and let s = N P i s i . Deﬁne: s mi = 1 p X j ( E ξ i ,λ (cid:18) ∂m ij ∂α ′ (cid:19) (cid:20) E ξ i ,λ (cid:18) − ∂ ℓ ij ∂α∂α ′ (cid:19)(cid:21) − ∂ℓ ij ∂α + E ξ i ,λ (cid:18) ∂m ij ∂θ ′ (cid:19) H − s + E ξ i ,λ (cid:18) ∂m ij ∂α ′ (cid:19) (cid:20) E ξ i ,λ (cid:18) − ∂ ℓ ij ∂α∂α ′ (cid:19)(cid:21) − E ξ i ,λ (cid:18) ∂ ℓ ij ∂α∂θ ′ (cid:19) H − s ) . Corollary S1.

Let the conditions of Theorem 1 or 2 hold, and let Assumption S1hold. Then, as

N, T, K tend to inﬁnity such that

Kp/ ( N T ) tends to zero: c M = M + 1 N X i s mi + O p (cid:18) T (cid:19) + O p (cid:18) KpN T (cid:19) + O p (cid:16) K − d (cid:17) + o p (cid:18) √ N T (cid:19) . Proof.

We have, by a Taylor expansion: c M − M = 1 N p X i,j m ij (cid:16)b α j ( b k i , b θ ) , b θ (cid:17) − N p X i,j m ij (cid:0) α ji , θ (cid:1) = 1 N p X i,j ∂m ij (cid:0) α ji , θ (cid:1) ∂α ′ (cid:16)b α j ( b k i , b θ ) − α ji (cid:17) + 1 N p X i,j ∂m ij (cid:0) α ji , θ (cid:1) ∂θ ′ (cid:16)b θ − θ (cid:17) + O p ( δ ) , where δ is deﬁned as in the proofs of Theorems 1 and 2.Using similar arguments to the ones we used to establish Lemma A2, underAssumption S1 we have (recall that α j ( θ , ξ i ) = α ji ):1 N p X i,j ∂m ij (cid:0) α ji , θ (cid:1) ∂α ′ (cid:16)b α j ( b k i , θ ) − α j ( θ , ξ i ) (cid:17) + 1 N p X i,j E ξ i ,λ " ∂m ij (cid:0) α ji , θ (cid:1) ∂α ′ E ξ i ,λ (cid:2) v αij ( α ji , θ ) (cid:3) − v ij ( α ji , θ ) = O p ( δ ) . Moreover, using (A5) and Assumption S1 we obtain:1

N p X i,j ∂m ij (cid:0) α ji , θ (cid:1) ∂α ′ n(cid:16)b α j ( b k i , b θ ) − α j ( b θ, ξ i ) (cid:17) − (cid:16)b α j ( b k i , θ ) − α j ( θ , ξ i ) (cid:17)o = o p (cid:16) k b θ − θ k (cid:17) + O p ( δ ) = o p (cid:18) √ N T (cid:19) + O p ( δ ) . c M − M = 1 N p X i,j ∂m ij (cid:0) α ji , θ (cid:1) ∂α ′ (cid:16)b α j ( b k i , b θ ) − b α j ( b k i , θ ) (cid:17) + 1 N p X i,j ∂m ij (cid:0) α ji , θ (cid:1) ∂α ′ (cid:16)b α j ( b k i , θ ) − α ji (cid:17) + 1 N p X i,j ∂m ij (cid:0) α ji , θ (cid:1) ∂θ ′ (cid:16)b θ − θ (cid:17) + O p ( δ )= 1 N p X i,j ∂m ij (cid:0) α ji , θ (cid:1) ∂α ′ (cid:16) α j ( b θ, ξ i ) − α j ( θ , ξ i ) (cid:17) + 1 N p X i,j E ξ i ,λ " ∂m ij (cid:0) α ji , θ (cid:1) ∂α ′ E ξ i ,λ (cid:2) − v αij ( α ji , θ ) (cid:3) − v ij ( α ji , θ )+ 1 N p X i,j ∂m ij (cid:0) α ji , θ (cid:1) ∂θ ′ (cid:16)b θ − θ (cid:17) + O p ( δ ) + o p (cid:18) √ N T (cid:19) . The result comes from expanding α j ( b θ, ξ i ) around θ , and then substituting b θ − θ by its inﬂuence function. S2.2 Two-way GFE

We have the following lemma, whose proof is analogous to that of Lemma 1.

Lemma S1.

Suppose that there exist random vectors h i = T P t h ( Y it , X it ) and w t = N P i w ( Y it , X it ) , with ﬁxed dimensions, and Lipschitz-continuous functions ϕ and φ , such that h i = ϕ ( ξ i ) + o p (1) , N P i k h i − ϕ ( ξ i ) k = O p (1 /T ) , w t = φ ( λ t ) + o p (1) , and T P t k w t − φ ( λ t ) k = O p (1 /N ) as N, T tend to inﬁnity.Then we have, as

N, T, K tend to inﬁnity: N P i k b h ( b k i ) − ϕ ( ξ i ) k = O p (cid:0) T (cid:1) + O p ( B ξ ( K )) , and, as N, T, L tend to inﬁnity: T P t k b w ( b l t ) − φ ( λ t ) k = O p (cid:0) N (cid:1) + O p ( B λ ( L )) , where B λ ( L ) is deﬁned analogously to B ξ ( K ) . For all θ , ξ , and λ , let α ( θ, ξ, λ ) = argmax α E ξ i = ξ, λ t = λ ( ℓ it ( α, θ )). In addition,let ξ = ( ξ ′ , ..., ξ ′ N ) ′ . Assumption S2. (regularity, two-way)(i) ( Y ′ it , X ′ it ) ′ , i = 1 , .., N , t = 1 , ..., T , are i.i.d. given ξ and λ , ξ i are i.i.d.,and λ t are i.i.d.; ℓ it ( α, θ ) is three times diﬀerentiable in ( θ, α ) ; Θ is compact,the spaces for ξ i and λ t are compact, and θ belongs to the interior of Θ . ii) N, T, K, L tend jointly to inﬁnity; sup ξ,λ,α,θ | E ξ i = ξ,λ t = λ ( ℓ it ( α, θ )) | = O (1) ,and similarly for the ﬁrst three derivatives of ℓ it ; the minimum (resp., max-imum) eigenvalue of ( − ∂ ℓ it ( α,θ ) ∂α∂α ′ ) is bounded away from zero (resp., inﬁnity)with probability one uniformly in i, t, α, θ , and the third derivatives of ℓ it ( α, θ ) are O p (1) , uniformly in i, t, α, θ ; NT P i,t [ ℓ it ( α it , θ ) − E ξ i ,λ t ( ℓ it ( α it , θ ))] = O p (1) , and similarly for the ﬁrst three derivatives of ℓ it .(iii) inf ξ,λ,θ E ξ i = ξ, λ t = λ ( − ∂ ℓ it ( α ( θ,ξ,λ ) ,θ ) ∂α∂α ′ ) > ; E h NT P i,t ℓ it ( α ( θ, ξ i , λ t ) , θ ) i hasa unique maximum at θ on Θ , and its second derivative is − H < .(iv) ∂∂ξ ′ (cid:12)(cid:12) e ξ E ξ i = ξ,λ t = λ (vec ∂ ℓ it ( α,θ ) ∂θ∂α ′ )= O (1) ; ∂∂λ ′ (cid:12)(cid:12) e λ E ξ i = ξ,λ t = λ (vec ∂ ℓ it ( α,θ ) ∂θ∂α ′ )= O (1) ; ∂∂ξ ′ (cid:12)(cid:12) e ξ E ξ i = ξ,λ t = λ (vec ∂ ℓ it ( α,θ ) ∂α∂α ′ )= O (1) ; ∂∂λ ′ (cid:12)(cid:12) e λ E ξ i = ξ,λ t = λ (vec ∂ ℓ it ( α,θ ) ∂α∂α ′ )= O (1) ; ∂∂ξ ′ (cid:12)(cid:12) e ξ E ξ i = ξ,λ t = λ ( ∂ℓ it ( α ( θ,ξ,λ ) ,θ ) ∂α )= O (1) ; ∂∂λ ′ (cid:12)(cid:12) e λ E ξ i = ξ,λ t = λ ( ∂ℓ it ( α ( θ,ξ,λ ) ,θ ) ∂α )= O (1) ,uniformly in ξ, e ξ, λ, e λ, α, θ .(v) E h i = h,ξ i = ξ,w t = w,λ t = λ ( ∂ℓ it ( α ( θ,ξ,λ ) ,θ ) ∂α ) , E h i = h,ξ i = ξ,w t = w,λ t = λ (vec ∂∂θ ′ (cid:12)(cid:12) θ ∂ℓ it ( α ( θ,ξ,λ ) ,θ ) ∂α ) are twice diﬀerentiable with respect to h and w , with ﬁrst and second deriva-tives that are uniformly bounded in h ∈ H , w ∈ W , ξ , λ , and θ ∈ Θ , where H and W are the supports of h i and w t ; k Var h i = h,ξ i = ξ,w t = w,λ t = λ ( ∂ℓ it ( α ( θ,ξ,λ ) ,θ ) ∂α ) k and k Var h i = h,ξ i = ξ,w t = w,λ t = λ (vec ∂∂θ ′ (cid:12)(cid:12) θ ∂ℓ it ( α ( θ,ξ,λ ) ,θ ) ∂α ) k are O (1) , uniformly in h , w , ξ , λ , θ . Theorem S1.

Let the conditions in Lemma S1 hold. Suppose that B ξ ( K ) = O p ( K − d ) and B λ ( L ) = O p ( L − dλ ) . Suppose that α and µ are Lipschitz-continuousin both arguments, and that there exist two Lipschitz-continuous functions ψ and Ψ such that ξ i = ψ ( ϕ ( ξ i )) and λ t = Ψ ( φ ( λ t )) . Lastly, let Assumption S2 hold.Then, as N, T, K, L tend to inﬁnity such that

KL/ ( N T ) tends to zero, we have: b θ = θ + H − N X i s i + O p (cid:18) T + 1 N + KLN T (cid:19) + O p (cid:16) K − d + L − dλ (cid:17) + o p (cid:18) √ N T (cid:19) . Proof.

The proof closely follows the steps of that of Theorem 2. Here we simplyhighlight the main diﬀerences. Let δ = T + N + KLNT + K − d + L − dλ . To show41onsistency, a key step is to show, for all θ ∈ Θ:1

N T X i,t (cid:13)(cid:13)(cid:13) v ( b k i , b l t , θ ) (cid:13)(cid:13)(cid:13) = O p ( δ ) , (S20)where v ( k, l, θ ) denotes the mean of v it ( α ( θ, ξ i , λ t ) , θ ) in the intersection of groups b k i = k and b l t = l . Let: ρ ( h, ξ, w, λ, θ ) = E h i = h,ξ i = ξ,w t = w,λ t = λ ( v it ( α ( θ, ξ, λ ) , θ )), andlet, for all i, t, θ : ζ it ( θ ) = v it ( α ( θ, ξ i , λ t ) , θ ) − ρ ( h i , ξ i , w t , λ t , θ ). Proceeding asin the proof of Lemma A1 we have:1 N T X i,t k ρ ( h i , ξ i , w t , λ t , θ ) k = O p (cid:18) T (cid:19) + O p (cid:18) N (cid:19) . We thus only need to bound: E " N T X i,t k ζ ( b k i , b l t , θ ) k = 1 N T X k,ℓ E kℓ (cid:2) E h i ,ξ i ,w t ,λ t ( ζ it ( θ ) ′ ζ it ( θ )) (cid:3) , where we have used that observations are independent across i and t given ξ and λ , and E kℓ denotes a mean in groups b k i = k and b l t = l . To bound this quantity,we use part ( v ) in Assumption S2. We thus obtain (S20).Similarly to the proof of Lemma A2, we then show:1 N T X i,t n v θit (cid:16)b α ( b k i , b l t ) − α it (cid:17) + E ξ i ,λ t (cid:0) v θit (cid:1) (cid:2) E ξ i ,λ t ( v αit ) (cid:3) − v it o = O p ( δ ) , (S21)where we omit references to θ and α it . The ﬁrst key term is: A = 1 N T X i,t E ξ i ,λ t (cid:0) v θit (cid:1) (cid:2) E ξ i ,λ t ( v αit ) (cid:3) − ( − v αit ) (cid:16) ( − v αit ) − v it − e v ( b k i , b l t ) (cid:17) , where e v is deﬁned analogously to the proof of Lemma A2. To show that A = O p ( δ ), we use that the ζ it ( θ ) are independent across i and t , with zero meanconditional on h , ..., h N , w , ..., w T , ξ , and λ .42et π ′ it = v θit ( v αit ) − − E ξ i ,λ t (cid:0) v θit (cid:1) (cid:2) E ξ i ,λ t ( v αit ) (cid:3) − . The second key term is: B = 1 N T X i,t π ′ it v αit (cid:16)e α ( b k i , b l t ) − α it (cid:17) = 1 N T X i,t π ′ it v αit (cid:16) α ∗ ( b k i , b l t ) − α it (cid:17) + 1 N T X i,t π ′ it v αit (cid:16)e α ( b k i , b l t ) − α ∗ ( b k i , b l t ) (cid:17) , where e α ( k, l ) and α ∗ ( k, l ) are deﬁned analogously to the proof of Lemma A2. Toshow that B = O p ( δ ), we use that τ it = π ′ it v αit are independent across i and t withzero mean given ξ , λ .The ﬁnal step, as in the proof of Lemma A3, is to show that:1 N T X i,t (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) ∂ b α ( b k i , b l t , θ ) ∂θ ′ − ∂α ( θ , ξ i , λ t ) ∂θ ′ (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) = o p (1) . (S22)The proof of (S22) follows similar arguments to the proof of Lemma A3. S2.3 GFE based on conditional moments

Assumption S3. (heterogeneity, conditional case)There exist vectors ξ i of ﬁxed dimension d , and ν i of dimension d ν , and functions α and µ Lipschitz-continuous in ξ , such that α i = α ( ξ i ) and µ i = µ ( ξ i , ν i ) . Diﬀerently from Assumption 1, here µ i depends on an additional heterogeneitycomponent ν i , and by Assumption 2 the moment h i is only injective for ξ i . Assumption S4. (regularity, conditional case)(i) ( Y ′ i , X ′ i , ξ ′ i , ν ′ i , h ′ i ) ′ are i.i.d.; ( Y ′ it , X ′ it ) ′ are stationary for all i ; ℓ it ( α, θ ) isthree times diﬀerentiable in both its arguments for all i, t ; and Θ is compact,the space for α i is compact, and θ belongs to the interior of Θ .(ii) N, T, K tend jointly to inﬁnity; sup ξ,ν,α,θ | E ξ i = ξ,ν i = ν ( ℓ it ( α, θ )) | = O (1) , andsimilarly for the ﬁrst three derivatives of ℓ it ; inf ξ,ν,α,θ E ξ i = ξ,ν i = ν ( − ∂ ℓ it ( α,θ ) ∂α∂α ′ ) is positive deﬁnite; and max i sup α,θ (cid:12)(cid:12) ℓ i ( α, θ ) − E ξ i ,ν i ( ℓ i ( α, θ )) (cid:12)(cid:12) = o p (1) ,and similarly for the ﬁrst three derivatives of ℓ i . iii) inf ξ,ν,θ E ξ i = ξ,ν i = ν ( − ∂ ℓ it ( α ( θ,ξ ) ,θ ) ∂α∂α ′ ) > ; E [ T P Tt =1 ℓ it ( α ( θ, ξ i ) , θ )] has a uniquemaximum at θ on Θ , and its matrix of second derivatives is − H cond < ;and sup θ NT P i,t k ∂ ℓ it ( α ( θ,ξ i ) ,θ ) ∂θ∂α ′ k = O p (1) .(iv) sup e ξ,α k ∂∂ξ ′ (cid:12)(cid:12) ξ = e ξ E ξ i = ξ (vec ∂ ℓ it ( α,θ ) ∂θ∂α ′ ) k ; sup e ξ,α k ∂∂ξ ′ (cid:12)(cid:12) ξ = e ξ E ξ i = ξ (vec ∂ ℓ it ( α,θ ) ∂α∂α ′ ) k ; and sup e ξ,θ k ∂∂ξ ′ (cid:12)(cid:12) ξ = e ξ E ξ i = ξ ( ∂ℓ it ( α ( θ, e ξ ) ,θ ) ∂α ) k are O (1) .(v) E h i = h,ξ i = ξ ( ∂ℓ it ( α ( θ,ξ ) ,θ ) ∂α ) is twice diﬀerentiable with respect to h and ξ , withﬁrst and second derivatives that are uniformly bounded in ξ , h ∈ H , and θ ∈ Θ ; and k Var h i = h,ξ i = ξ ( ∂ℓ i ( α ( θ,ξ ) ,θ ) ∂α ) k = O (1) , uniformly in ξ , h and θ . Corollary S2.

Let the conditions of Lemmas 1 and 2 hold. Let Assumptions2, S3, and S4 hold. Let K be given by (6), with γ = O (1) . Then, as N, T, K tendto inﬁnity such that T d = O ( N ) we have: b θ = θ + O p (cid:18) T (cid:19) + O p (cid:18) √ N T (cid:19) . (S23) Proof.

Let δ = T + KN + K − d . To show consistency, the key step is to show:1 N X i (cid:13)(cid:13)(cid:13) v ( b k i , θ ) (cid:13)(cid:13)(cid:13) = O p ( δ ) , ∀ θ ∈ Θ . (S24)Let, for all θ, h, ξ : ρ ( h, ξ, θ ) = E h i = h,ξ i = ξ ( v i ( α ( θ, ξ ) , θ )), and let, for all i, θ : ζ i ( θ ) = v i ( α ( θ, ξ i ) , θ ) − ρ ( h i , ξ i , θ ). One can show, using similar techniques to the proofof Lemma A1, that: N P i k ζ ( b k i , θ ) k = O p ( KN ), and that this implies (S24). We then show: N P i ∂ℓ i ( b α ( b k i ,θ ) ,θ ) ∂θ = O p ( δ ), which will follow from:1 N X i v θi (cid:16)b α ( b k i ) − α i (cid:17) = O p ( δ ) , (S25)where from now on we omit references to θ and α i . We have:1 N X i v θi (cid:16)b α ( b k i ) − α i (cid:17) = 1 N X i v θi (cid:16)e α ( b k i ) − α i + e v ( b k i ) (cid:17) + O p ( δ ) , Note that if K = b K is given by (6) with γ = O (1), then K = O ( T d ) and δ = O ( T + T d N ), so if T d = O ( N ) then δ = O ( T ). Note that, in the case of Theorem 1 (i.e., in the absence of additional heterogeneity ν i ),the left-hand side in (S24) is O p ( T ). e α ( k ) and e v ( k ) are as in the proof of Lemma A2; that is, denoting w i = ( − v αi ),we have e α ( k ) = w ( k ) − wα ( k ) and e v ( k ) = w ( k ) − v ( k ).Let γ v ( h i ) = E h i ( v i ), ζ vi = v i − γ v ( h i ), γ w ( h i ) = E h i ( w i ), ζ wi = w i − γ w ( h i ), γ v θ ( h i ) = E h i ( v θi ), and ζ v θ i = v θi − γ v θ ( h i ). First, we have:1 N X i v θi e v ( b k i ) = 1 N X i v θi w ( b k i ) − v ( b k i ) = 1 N X i v θ ( b k i ) w ( b k i ) − v i = 1 N X i ( γ v θ ( b k i ) + ζ v θ ( b k i ))( γ w ( b k i ) + ζ w ( b k i )) − v i = 1 N X i γ v θ ( b k i ) γ w ( b k i ) − v i + O p ( δ ) , where for example γ w ( k ) is the mean of γ w ( h i ) in group b k i = k , and we have usedthat N P i k ζ v θ ( b k i ) k = O p ( K/N ), N P i k ζ w ( b k i ) k = O p ( K/N ), and N P i k v i k = O p (1 /T ). Moreover:1 N X i γ v θ ( b k i ) γ w ( b k i ) − v i = 1 N X i γ v θ ( b k i ) γ w ( b k i ) − γ v ( h i ) + O p ( δ ) , where we have used that N P i k ζ v ( b k i ) k = O p ( K/ ( N T )). Lastly, we have:1 N X i γ v θ ( b k i ) γ w ( b k i ) − γ v ( h i ) = 1 N X i γ v θ ( h i ) γ w ( h i ) − γ v ( h i )+ 1 N X i h γ v θ ( b k i ) γ w ( b k i ) − − γ v θ ( h i ) γ w ( h i ) − i γ v ( h i ) , where the ﬁrst term is O p ( δ ) since it is a mean of i.i.d. terms with mean O (1 /T ) andvariance O (1 /T ), and the second term is O p ( δ ) since N P i k h i − h ( b k i ) k = O p ( δ )and the γ functions are Lipschitz-continuous.Second, let v θi w − i = η ( h i , ξ i ) + e i , where E h i = h,ξ i = ξ ( e i w i ) = 0. We have:1 N X i v θi (cid:16)e α ( b k i ) − α i (cid:17) = 1 N X i η ( h i , ξ i ) w i (cid:16)e α ( b k i ) − α i (cid:17) + 1 N X i e i w i (cid:16)e α ( b k i ) − α i (cid:17) , where the ﬁrst term is O p ( δ ) since N P i k h i − h ( b k i ) k = O p ( δ ), N P i k ξ i − ξ ( b k i ) k = O p ( δ ), N P i k e α ( b k i ) − α i k = O p ( δ ), η is Lipschitz-continuous, and45 i is uniformly bounded (as in the proof of Lemma A2), and the second term is:1 N X i e i w i (cid:16)e α ( b k i ) − α i (cid:17) = 1 N X i e i w i (cid:16)e α ( b k i ) − α ( b k i ) (cid:17) + 1 N X i e i w i (cid:16) α ( b k i ) − α i (cid:17) = 1 N X i ew ( b k i ) (cid:16)e α ( b k i ) − α ( b k i ) (cid:17) + O p ( δ ) = O p ( δ ) , where we have used that the ( e i w i )’s have zero mean given h , ..., h N , ξ , ..., ξ N with bounded conditional variance, and N P i k ew ( b k i ) k = O p ( K/N ) = O p ( δ ).Finally, to show: N P i ∂ ∂θ∂θ ′ (cid:12)(cid:12) θ ( ℓ i ( b α ( b k i , θ ) , θ ) − ℓ i ( α ( θ, ξ i ) , θ )) = o p (1), we usesimilar arguments to the proof of Lemma A3. Example: a linear homoskedastic model.

Consider the model Y it = X it θ + α i + U it , where X it are scalar and U it are i.i.d. with mean zero and variance σ given X i , ..., X iT , α i . Let b θ be the GFE estimator based on a moment h i = ϕ ( α i ) + ε i that satisﬁes Assumptions 1 and 2 for ξ i = α i ; that is, h i is onlyinformative about α i , but not about the heterogeneity in X it . Let ζ Xi = X i − E h i ( X i ), ζ αi = α i − E h i ( α i ), and ζ Ui = U i − E h i ( U i ). We assume that K is largeenough for the approximation error to be of smaller order, and that K/N tendsto zero, as in Corollary 2. Under appropriate conditions in the regression model,using similar arguments to the proof of Corollary S2 (though with no need forany restriction on the relative rates of N and T ), one can show that b θ admits thefollowing expansion: b θ = θ + N P i ζ Xi ( ζ αi + ζ Ui ) + NT P i,t ( X it − X i )( U it − U i ) E [( X it − X i ) ] + Var( ζ Xi ) + o p (cid:18) T (cid:19) + o p (cid:18) √ N T (cid:19) . (S26)Notice two diﬀerences between (S26) and the expansion of the FE estimator: thepresence of Var( ζ Xi ) in the denominator, and the presence of N P i ζ Xi ( ζ αi + ζ Ui )in the numerator. In addition, notice that (S26) simpliﬁes to the expression inCorollary 2 in the absence of additional heterogeneity ν i . Although the arguments are as in the proof of Lemma A3, the target log-likelihood is diﬀerentsince here α ( θ, ξ i ) only depends on ξ i , not on ( ξ ′ i , ν ′ i ) ′ . In particular, the matrix H cond inAssumption S4 diﬀers from the matrix H in Assumption 3; see (S26) for an example. Model of wages and participation (see (2)).

We model the initial conditionas: Y i = { u ( α i ) ≥ c (1; θ ) + U i } , with U i standard normal, independent of α i . We set c (0; θ ) = 0 and c (1; θ ) = −

1. We set α i and V it to be independentstandard normals. In the simulations based on models (2) and (3) we weight themoments by the share of between-i variance to total variance. To compute thevariance b V h to set the number of groups in this dynamic model, we use a Newey-West expression with one lag. Lastly, for kmeans computation we use Lloyd’salgorithm with 100 random starting values. Table S1 shows additional simulationresults for this model. Probit model with time-varying heterogeneity (see (3)).

The U it ’s arestandard normal independent of the X it ’s and the α it ’s. The data generatingprocess (DGP) for the scalar covariate is: X it = µ it + V it , where V it are i.i.d.standard normal independent of the U it ’s, α it ’s, and µ it ’s, and µ it = α it . Weset θ = 1, and set ξ i and λ t to be i.i.d. Gamma(1,1) draws, independent ofeach other. Table S2 shows additional simulation results for this model, includ-ing for the two-way GFE estimator based on both the cross-sectional moments( N P i Y it , N P i X it ) ′ , and the individual-speciﬁc moments ( Y i , X i ) ′ . Conditional moments: an example.

Consider the following probit model: Y it = { X ′ it θ + α i + U it ≥ } , where the U it are i.i.d. standard normal independentof the X it ’s and α i , and θ is a vector of ones. The DGP for the k -th covariateis: X itk = { µ i k + V itk > } , where V itk are i.i.d. standard normal independent ofthe U it ’s, α i , and the µ i k ’s, and α i and the µ i k ’s follow independent standardnormals. We vary the number of covariates between 1 and 3, so the total dimensionof heterogeneity varies between 2 and 4. In this model, we expect the bias of FEto be moderate given the time horizon we consider ( T = 20), since α i is scalar Speciﬁcally, we demean and rescale h i so that all its components h iℓ have zero mean and unitvariance, and multiply each component h iℓ by: max (cid:18) P i h iℓ − T P i,t ( h itℓ − h iℓ ) P i h iℓ , (cid:19) . Using equalweights instead has small eﬀects in these simulations, however we observed that this particularweighting can improve performance when some moments are substantially less informative aboutthe heterogeneity than others. h i = ( Y i , X ′ i ) ′ as moments. In Table S3 we show the biases,standard deviations, and root mean squared errors of FE and GFE among 1000simulations, for N=1000 and T=20. In the top panel we report GFE estimates asa function of the number of groups K . We see that, while the bias of GFE remainsmoderate with one covariate, the bias increases substantially with the dimensionof heterogeneity, in agreement with our theory. By comparison, the bias of FE inthe bottom panel is indeed quite small, and it only increases moderately with thenumber of covariates.The situation is rather diﬀerent when using conditional moments in GFE.In the middle panel in Table S3 we show simulation results for GFE based oncovariates-speciﬁc conditional means Y i ( x ) = P Tt =1 { X it = x } Y it / P Tt =1 { X it = x } . Importantly, in large samples these moments are only informative about α i ,not µ i . We see that the bias of GFE with conditional moments increases onlymoderately with the number of covariates, and that FE and GFE with conditionalmoments have comparable — and quite small — biases.Regarding implementation, note that, for a given i , all moments Y i ( x ) maynot be available since i ’s covariates may never take the value x in the sample. InTable S3, whenever Y i ( x ) is not available, we set the moment to an imputed value,the overall conditional mean Y ( x ) = P i,t { X it = x } Y it / P i,t { X it = x } . Theimputation does not aﬀect the theory, provided the event that any of the Y i ( x )’sis not available tends to zero with probability approaching one in large samples. Moreover, we have obtained similar results using an alternative conditional ﬁrststep implementation that does not rely on imputations. To provide intuition in a simple case, suppose that X it are binary, i.i.d. over time given µ i ,with Pr( X it = 1 | µ i = µ ) ∈ ( ǫ, − ǫ ) for all µ , for some ǫ >

0. Then Pr( ∃ i : X i = ... = X iT =0) ≤ N (1 − ǫ ) T , which tends to zero whenever (ln N ) /T → This implementation is as follows. Let I i ( x ) be the indicator that there exists a t such that X it = x , and let x , ..., x M denote the points of support of X it . In the ﬁrst step, we use aLloyd’s-like algorithm to minimize the function P Ni =1 P Mm =1 I i ( x m ) (cid:0) Y i ( x m ) − g ( x m , k i ) (cid:1) , with T Bias std RMSE se/std Bias std RMSE se/stdGFE, η = 1 FE, η = 15 -0.570 0.058 0.573 1.082 -0.835 0.064 0.837 1.06610 -0.207 0.040 0.211 1.003 -0.418 0.040 0.420 1.04120 -0.088 0.027 0.092 0.993 -0.209 0.026 0.211 1.06430 -0.055 0.023 0.060 0.960 -0.140 0.023 0.142 0.99140 -0.040 0.019 0.044 1.000 -0.105 0.019 0.106 1.03450 -0.031 0.017 0.036 0.982 -0.084 0.017 0.086 1.022GFE, η = 2 FE, η = 25 -0.519 0.063 0.523 1.052 -0.876 0.068 0.879 1.06310 -0.163 0.043 0.169 0.985 -0.442 0.041 0.444 1.07020 -0.049 0.031 0.058 0.929 -0.225 0.028 0.227 1.04230 -0.032 0.024 0.040 0.964 -0.153 0.022 0.154 1.06840 -0.019 0.020 0.028 0.981 -0.113 0.019 0.115 1.04550 -0.015 0.019 0.024 0.944 -0.091 0.018 0.093 1.000 Notes: simulations, N = 1000 . “RMSE” is root mean squared error, “se” is the averageof standard error estimates across simulations, “std” is the standard deviation of the estimatoracross simulations. η is the risk aversion parameter. respect to k , ..., k N and g ( x , g ( x M , K ). a b l e S : P r o b i t m o d e l ( ) w i t h t i m e - v a r y i n g h e t e r og e n e i t y T Bias std RMSE se/std Bias std RMSE se/std Bias std RMSE se/std Bias std RMSE se/std2-way GFE, σ = −

10 GFE, σ = −

10 FE, σ = −

10 IFE, σ = −

105 -0.045 0.035 0.057 0.927 -0.044 0.035 0.056 0.926 0.442 0.071 0.448 0.706 0.116 0.064 0.133 0.47310 -0.016 0.024 0.028 0.939 -0.014 0.024 0.028 0.939 0.198 0.036 0.201 0.762 0.100 0.036 0.107 0.48820 -0.003 0.016 0.016 1.014 -0.000 0.016 0.016 1.014 0.098 0.019 0.100 0.911 0.087 0.020 0.089 0.59630 -0.000 0.013 0.013 1.009 0.003 0.013 0.013 1.013 0.069 0.014 0.070 0.966 0.059 0.014 0.061 0.67540 0.001 0.011 0.011 1.021 0.005 0.011 0.012 1.016 0.055 0.012 0.057 0.949 0.044 0.012 0.046 0.67650 0.001 0.010 0.010 0.995 0.006 0.010 0.012 0.994 0.048 0.011 0.049 0.947 0.036 0.011 0.037 0.6772-way GFE, σ =0 GFE, σ =0 FE, σ =0 IFE, σ =05 -0.044 0.042 0.060 0.883 -0.043 0.042 0.060 0.882 0.488 0.091 0.497 0.654 0.152 0.089 0.176 0.39810 -0.022 0.026 0.035 0.951 -0.021 0.026 0.034 0.949 0.226 0.045 0.231 0.710 0.118 0.040 0.125 0.49420 -0.009 0.018 0.020 0.969 -0.006 0.018 0.019 0.964 0.108 0.023 0.110 0.843 0.099 0.023 0.101 0.57130 -0.004 0.014 0.015 1.001 0.001 0.014 0.014 1.000 0.072 0.017 0.074 0.918 0.068 0.017 0.070 0.61540 -0.002 0.013 0.013 0.989 0.004 0.013 0.013 0.985 0.056 0.014 0.058 0.922 0.051 0.014 0.052 0.63950 -0.001 0.011 0.012 0.965 0.005 0.012 0.013 0.961 0.046 0.012 0.047 0.926 0.040 0.012 0.042 0.6432-way GFE, σ =1 GFE, σ =1 FE, σ =1 IFE, σ =15 -0.049 0.056 0.074 0.754 -0.048 0.056 0.074 0.754 0.565 0.146 0.583 0.506 0.207 0.117 0.238 0.35910 -0.032 0.029 0.043 0.981 -0.030 0.029 0.042 0.979 0.251 0.062 0.258 0.603 0.141 0.043 0.147 0.51320 -0.014 0.020 0.024 0.986 -0.010 0.020 0.022 0.983 0.114 0.027 0.117 0.825 0.125 0.029 0.128 0.51430 -0.007 0.016 0.017 0.996 -0.001 0.016 0.016 0.992 0.074 0.019 0.077 0.889 0.085 0.021 0.088 0.56140 -0.005 0.014 0.015 0.985 0.001 0.014 0.014 0.980 0.055 0.016 0.057 0.915 0.063 0.016 0.065 0.61150 -0.003 0.012 0.012 1.027 0.004 0.012 0.013 1.025 0.044 0.014 0.046 0.951 0.050 0.014 0.052 0.6322-way GFE, σ =10 GFE, σ =10 FE, σ =10 IFE, σ =105 -0.016 0.075 0.076 0.691 -0.015 0.075 0.076 0.692 0.706 0.262 0.753 0.386 0.300 0.255 0.394 0.21810 -0.013 0.035 0.037 0.946 -0.010 0.035 0.036 0.947 0.323 0.097 0.337 0.486 0.183 0.060 0.192 0.45820 -0.002 0.024 0.024 0.975 0.003 0.024 0.024 0.967 0.150 0.037 0.154 0.719 0.168 0.036 0.172 0.47430 0.002 0.019 0.019 0.991 0.008 0.019 0.021 0.983 0.100 0.025 0.104 0.814 0.121 0.030 0.125 0.45640 0.003 0.016 0.016 0.989 0.010 0.016 0.019 0.985 0.076 0.020 0.079 0.851 0.091 0.021 0.093 0.53650 0.002 0.014 0.015 0.995 0.010 0.014 0.018 0.996 0.061 0.017 0.063 0.911 0.073 0.017 0.075 0.592 N o t e s : S ee n o t e s t o T a b l e S . I F E i s i n t e r a c t e d ﬁ x e d - e ﬀ ec t s w i t ho n e f a c t o r . σ i s t h e s u b s t i t u t i o n pa r a m e t e r . able S3: Probit model with binary covariatesK Bias std RMSE Bias std RMSE Bias std RMSEGFE, 1 covariate GFE, 2 covariates GFE, 3 covariates5 -0.189 0.029 0.191 -0.293 0.031 0.295 -0.362 0.042 0.36510 -0.083 0.027 0.088 -0.205 0.032 0.207 -0.275 0.035 0.27820 -0.017 0.029 0.033 -0.118 0.030 0.122 -0.206 0.033 0.20930 0.006 0.029 0.030 -0.081 0.030 0.086 -0.166 0.032 0.16940 0.018 0.029 0.035 -0.056 0.030 0.064 -0.136 0.033 0.14050 0.026 0.030 0.039 -0.040 0.031 0.051 -0.116 0.033 0.120Cond. GFE, 1 covariate Cond. GFE, 2 covariates Cond. GFE, 3 covariates5 -0.060 0.035 0.069 -0.085 0.037 0.093 -0.111 0.039 0.11710 -0.045 0.033 0.056 -0.073 0.043 0.085 -0.100 0.044 0.10920 -0.015 0.034 0.038 -0.046 0.037 0.059 -0.075 0.045 0.08730 0.008 0.036 0.036 -0.031 0.036 0.047 -0.061 0.043 0.07540 0.025 0.035 0.043 -0.020 0.037 0.041 -0.050 0.042 0.06550 0.034 0.035 0.049 -0.012 0.036 0.038 -0.040 0.041 0.057FE, 1 covariate FE, 2 covariates FE, 3 covariates- 0.062 0.031 0.069 0.074 0.034 0.081 0.088 0.039 0.097 Notes: simulations, N = 1000 , T = 20 . In the top panel we show GFE estimates basedon unconditional moments for diﬀerent K values, in the middle panel we show GFE estimatesbased on conditional moments for diﬀerent K values, in the bottom row we show FE estimates.values, in the bottom row we show FE estimates.