[PDF] Forecasting with Bayesian Grouped Random Effects in Panel Data

Abstract

In this paper, we estimate and leverage latent constant group structure to generate the point, set, and density forecasts for short dynamic panel data. We implement a nonparametric Bayesian approach to simultaneously identify coefficients and group membership in the random effects which are heterogeneous across groups but fixed within a group. This method allows us to flexibly incorporate subjective prior knowledge on the group structure that potentially improves the predictive accuracy. In Monte Carlo experiments, we demonstrate that our Bayesian grouped random effects (BGRE) estimators produce accurate estimates and score predictive gains over standard panel data estimators. With a data-driven group structure, the BGRE estimators exhibit comparable accuracy of clustering with the Kmeans algorithm and outperform a two-step Bayesian grouped estimator whose group structure relies on Kmeans. In the empirical analysis, we apply our method to forecast the investment rate across a broad range of firms and illustrate that the estimated latent group structure improves forecasts relative to standard panel data estimators.

Full PDF

FForecasting with Bayesian Grouped Random

Eﬀects in Panel Data

Boyuan Zhang : University of Pennsylvania

First Version: June 30, 2020This Version: October 6, 2020

Abstract

In this paper, we estimate and leverage latent constant group structure to gener-ate the point, set, and density forecasts for short dynamic panel data. We implementa nonparametric Bayesian approach to simultaneously identify coeﬃcients and groupmembership in the random eﬀects which are heterogeneous across groups but ﬁxedwithin a group. This method allows us to ﬂexibly incorporate subjective prior knowl-edge on the group structure that potentially improves the predictive accuracy. InMonte Carlo experiments, we demonstrate that our Bayesian grouped random eﬀects(BGRE) estimators produce accurate estimates and score predictive gains over stan-dard panel data estimators. With a data-driven group structure, the BGRE estimatorsexhibit comparable accuracy of clustering with the

Kmeans algorithm and outperforma two-step Bayesian grouped estimator whose group structure relies on

Kmeans . Inthe empirical analysis, we apply our method to forecast the investment rate across abroad range of ﬁrms and illustrate that the estimated latent group structure improvesforecasts relative to standard panel data estimators.

JEL CLASSIFICATION: C11, C14, C23, C53, G31KEY WORDS: Panel Data; Grouped Heterogeneity; Random Eﬀects; Dirichlet Process; SetForecast; Density Forecast; Investment : Department of Economics, Perelman Center for Political Science and Economics, University of Penn-sylvania, 133 S. 36th St., Philadelphia, PA 19104-6297. Email: [email protected]. We would like tothank Karun Adusumilli, Francis Diebold, Maximilian G¨obel, Philippe Goulet Coulombe, Frank Schorfheidefor helpful comments and suggestions. a r X i v : . [ ec on . E M ] O c t his Version: October 6, 2020his Version: October 6, 2020 With the increasing availability of panel data, many works have examined and demonstratedits central role in the empirical research throughout the social and business sciences. Analysisof panel data has various edges over that of pure cross-sectional or time-series data. Themost important one is that the panel data provide researchers with a ﬂexible way to modelboth heterogeneity among individuals, ﬁrms, regions, and countries and possible structuralchanges over time. Apart from the principal role in the model estimation, it is interesting andessential to study their relevance for forecasting. Among novel methods emerged recently, thelatent group structure in the heterogeneity attracts wide attention. In this paper, we allowfor grouped patterns of unobserved heterogeneity in the dynamics panel data models andevaluate whether this latent structure improves the predictive performance in an extensivecollection of short time series.In the dynamics panel data model, it is common to assume that each cross-sectionalunit has unique intercept. This assumption introduces a large number of parameters thatbecome a burden in estimation. In models that have as many parameters as individualunits, ﬁxed eﬀects estimators are known to suﬀer from the “incidental parameters” problem(Neyman and Scott, 1948), which can bring about signiﬁcant biases in estimates of commonparameters. This problem becomes severe in short panels even if the number of units goes toinﬁnity (Chamberlain, 1980; Nickell, 1981), and the ﬁxed-eﬀects themselves are often poorlyestimated. An unreliable estimate leads to concerns about the predictive power of paneldata models as inaccurate estimates aﬀect forecasts in all aspects.To address this issue , econometricians attempt to reduce the number of unknown param-eters by dividing units into a ﬁnite number of groups. The premise of this idea is that unitsin the same group share the unit-speciﬁc parameters. Previous works include Bonhommeand Manresa (2015), Ando and Bai (2016), Su et al. (2016), Bester and Hansen (2016), Suet al. (2019), Bonhomme et al. (2019), and Cheng et al. (2019). Moreover, ﬁnite mixturemodel provide a well-known probabilistic approach to model-based clustering (McNicholasand Murphy, 2010; Fr¨uhwirth-Schnatter, 2011). With a ﬁnite number of groups, econome-tricians could avoid “incidental parameter” problem under several particular assumptionsand derive consistent estimators for the common parameters. Another important strand of literature implements generalized method of moments (GMM) methods toeliminate bias, see Arellano and Bond (1991), Arellano and Bover (1995) and Blundell and Bond (1998).Though successfully solved the “incidental parameters” problem, this set of methods doesn’t allow for anylatent group structure. his Version: October 6, 2020his Version: October 6, 2020 N and T are large enough, the information criterion could select the true group struc-ture. However, with a short time span, these criteria might fail to achieve their goal. Asnoted in Bonhomme and Manresa (2015, hereafter BM), the choice of the number of groups iscrucial to estimation and inference for model parameters. Misspeciﬁcation in group numberforces the algorithm to consider incorrect group membership. We will later show that it isthe information criteria that substantially aﬀects the performance of Grouped Fixed Eﬀects(GFE) estimator proposed by BM.The contributions of this paper are fourfold. First, closely following Kim and Wang (2019,hereafter KW) and Liu (2020), we develop a posterior sampling algorithm that addresses thenonparametric estimation of latent grouped eﬀects and proposes Bayesian Grouped RandomEﬀects (BGRE) estimator. The number of groups is treated as an unknown parameter that isestimated jointly with the component-speciﬁc parameters under the assumption that groupmembership remains constant over time. Instead of using the Finite Mixture model, whichneeds to preset the number of groups, we use Dirichlet Process (DP) prior, in particular thestick-breaking prior, that allows for inﬁnite potential groups. The entire posterior samplerbuilds upon the blocked Gibbs sampling proposed by Ishwaran and James (2001).Second, we leverage the researcher’s prior knowledge of the latent group structure toimprove the estimation and forecasting. In particular, we summarize and incorporate theinformation of subjective group structure in the prior distribution of the membership prob-abilities. If the subjective prior on the group structure is more precise than the randomguess, even with incorrect presumed number of groups, we show that including it in theprior improves the performance of the BGRE estimators as it guides the group membershipestimates. Unlike the P´olya urn Gibbs sampler (Escobar and West, 1995), blocked Gibbs sampler approach avoidsmarginalizing over the prior and thus allows for direct sampling of the nonparametric posterior, leading tocomputational and inferential advantages. his Version: October 6, 2020his Version: October 6, 2020

Kmeans algorithm (MacQueen, 1967) undercertain assumptions. In particular, both algorithms assign units to the closest centroid whenforming the clusters and recalculate the means of the new cluster afterward. To compare theperformance of clustering, we modify our algorithm to incorporate

Kmeans and construct atwo-step BGRE estimator where individuals are clustered in the ﬁrst step using

Kmeans , andthe group-speciﬁc heterogeneity is estimated in the second step. In the simulation section,we document that our BGRE estimators dominate the two-step GRE estimator in terms ofthe performance of both clustering and forecasting. We also ﬁnd that the two-step GREestimators with

Kmeans algorithm severely underestimate the number of groups under alldata generating processes, whereas BGRE estimators deliver accurate estimates.Last but not least, we examine the performance of BGRE estimators using various setsof simulated data and real data. The Monte Carlo study presents that grouped heterogene-ity brings gains in estimating group structure and one-step ahead point, set, and densityforecasting relative to commonly used predictors with diﬀerent parametric priors on indi-vidual eﬀects. In particular, our estimators outperform BM’s Grouped Fixed Eﬀects (GFE)estimator in various settings of the data generating process. The better performance is pri-marily due to the accurate estimate of the group structure. Regarding other predictors, weshow that failing to model group structure and to pool information across units severelydeteriorates the results for both estimation and forecasting. Finally, we use our methodto forecast the investment rate across a broad range of ﬁrms. The BGRE estimators oﬀerbetter performance than the standard panel data models in forecasting. This reveals thatincorporating the latent group structure provides a great amount of ﬂexibility and improvesthe predictive power of the underlying panel data model.Our paper relates to several branches of the literature. Our work is closely related to BM,KW, and Bonhomme et al. (2019, hereafter BLM). All of these three papers aim to estimatethe unobserved heterogeneity in a linear dynamic panel data model and develop statisticalinference methods. BM estimates the parameters of the model using the GFE estimator thatminimizes the least-squares criterion for all possible groupings of the cross-sectional units.They jointly estimate the individual types and the model’s parameter given the numberof groups and perform model selection afterward. One the other hand, BLM modify thismethod and split the procedure into two steps with

Kmeans clustering algorithm is used inthe ﬁrst step. From Bayesian’s point of view, KW proposes a full Bayesian estimator that his Version: October 6, 2020his Version: October 6, 2020

Kmeans algorithm. We conductvarious Monte Carlo experiments in section 4 to examine the performance of the proposedestimator in a controlled environment in the light of point, density, and set forecasts. Wealso examine the performance of a few variants of the BGRE estimator. In section 5, weconduct empirical analysis in which we forecast the investment rate across ﬁrms. Finally,we conclude in section 6. A description of the data sets, additional empirical results, andderivations are relegated to the appendix. his Version: October 6, 2020his Version: October 6, 2020 We consider a panel with observations for cross-sectional units i “ , . . . , N in periods t “ , . . . , T . Given a panel data set tp y it , x it qu , a simple linear dynamic panel data modelwith grouped patterns of heterogeneity takes the following form: y it “ α g i t ` ρy it ´ ` β i x it ` ε it , ε it iid „ N ` , σ g i ˘ . (2.1)where x it are a p ˆ ε it but isallowed to be arbitrarily correlated with α g i t . α g i t denote the time-varying group-speciﬁcheterogeneity. The subscript g i P t , ..., K u is the group membership variable with unknownand unconstrained K . y it ´ is the lagged outcome variable. ρ is the homogeneous AR(1)parameters that are common for all cross-sectional units, and β i is a p ˆ ε it is the idiosyncratic error term featured by zero mean andgrouped heteroskedasticity σ g i , with cross-sectional homoskedasticity being a special casewhere σ g i “ σ . This setting leads to a heterogeneous panel with group pattern modeledthrough α g i t and σ g i .By stacking all observations for unit i , we get an aggregated model: y i “ α g i ` ρy i, ´ ` x i β i ` ε i , ε i iid „ N p , Σ g i q . (2.2)where y i “ r y i , y i , . . . , y iT s , y i, ´ “ r y i , y i , . . . , y iT ´ s , T ˆ x i “ r x i , x i , . . . , x iT s , α g i “ r α g i , α g i , . . . , α g i T s , ε i “ r ε i , ε i , . . . , ε iT s , Σ g i “ σ g i I T . To indicate the com-ponent from which each observation stems, we introduce a group membership variable G “ r g , . . . , g N s taking values in t , . . . , K u N . Deﬁne a set of unit that belongs to group k : C k “ t i P t , , ..., N u| g i “ k u . Let | C k | denote the cardinality of the set C k .Following Sun (2005), Lin and Ng (2012) and BM, we assume that individual groupmembership does not vary over time. In addition, for any group i ‰ j , we assume that theyhave diﬀerent paths of random eﬀects, e.g., α i ‰ α j , and no single unit can simultaneouslybelong to these two groups: C i Ş C j “ H .The main goal of this paper is to estimate the grouped random eﬀects α g i , commonparameter ρ , hetergenous coeﬀecients β i and group membership G using full sample andprovide the point, set, and density forecasts of y it ` h for each unit i . Throughout this paper,we focus on the one-step ahead forecast where h “

1. For the multiple-step forecast, the his Version: October 6, 2020his Version: October 6, 2020

1. For the multiple-step forecast, the his Version: October 6, 2020his Version: October 6, 2020 y iT ` h in accordance with (2.1) given the estimate ofparameters and realizations of data. Our goal is to generate one-step ahead forecasts of y i,T ` for i “ , ..., N conditional on thehistory of observations, Y “ r y , y , ..., y N s , y i “ r y i , y i , ..., y iT s ,X “ r x , x , ..., x N s , x i “ r x i , x i , ..., x iT s . and newly available exogenous variables x iT ` at T `

1. For illustration purpose, we drop X and x iT ` from notations but we always condition on these exogenous variables.The posterior predictive distribution for unit i is given by p p y iT ` | Y q “ ż p p y iT ` | Y, Θ q p p Θ | Y q d Θ , (2.3)where Θ is a vector of parameters Θ “ t ρ, β i , α g i , Σ g i , g i u . This density is the posteriorexpectation of the following function, p p y iT ` | Y, Θ q “ K ÿ k “ p g i “ k q p p y iT ` | Y, ρ, β i , α k , Σ k q , (2.4)which is invariant to relabeling the components of the mixture. Therefore, given M ˚ posteriordraws, the density estimated from the MCMC draws isˆ p p y iT ` | Y q “ M ˚ M ˚ ÿ j “ ˜ K p j q ÿ k “ p g i “ k q p ´ y iT ` | Y, ρ p j q , β p j q i , α p j q k , Σ p j q k ¯¸ . (2.5)Therefore, we can draw samples from ˆ p p y iT ` | Y q by simulating (2.1) forward conditional onthe posterior draws of Θ and observations. We evaluate the point forecasts via the Root Mean Square Forecast Error (RMSFE) underthe quadratic loss function averaged across units. Let ˆ y iT ` represent the predicted value his Version: October 6, 2020his Version: October 6, 2020 T , the loss function is written as L p p y N,T ` , y N,T ` q “ N N ÿ i “ p ˆ y iT ` ´ y iT ` q “ N N ÿ i “ ˆ ε iT ` , (2.6)where y i,T ` is the realization at T ` ε iT ` denote the forecast error.The optimal posterior forecast under quadratic loss function is obtain by minimizing theposterior risk, ˆ y N,T ` “ argmin ˆ y P R N ż L p ˆ y, y N,T ` q p p y N,T ` | Y q dy N,T ` “ argmin ˆ y P R N N N ÿ i “ E “ p ˆ y ´ y iT ` h q | Y ‰ . (2.7)This implies optimal posterior forecast is the posterior mean,ˆ y i,T ` “ E p y iT ` | Y q , for i “ , . . . , N. (2.8) We construct set forecasts CS iT ` from the posterior predictive distribution of each unit. Inparticular, we adopt a Bayesian approach and report the highest posterior density interval(HPDI), which is the narrowest connected interval with coverage probability of 1 ´ α . Putdiﬀerently, it requires that the probability of y iT ` h P CS iT ` conditional on having observedthe history Y is at least 1 ´ α , i.e., P p y iT ` P CS iT ` q ě ´ α, for all i, (2.9)and this interval is the shortest among all possible single connected candidate sets. Let δ l be the lower bound and δ u be the upper bound, then CS iT ` “ “ δ li , δ ui ‰ .The assessment of set forecasts in simulation studies and empirical applications is basedon two metrics: (1) the cross-sectional coverage frequency, Cov T ` “ N N ÿ i “ I t y iT ` P CS iT ` u , (2.10)and (2) the average length of the sets C iT ` , AvgL T ` “ N N ÿ i “ p δ ui ´ δ li q . (2.11) his Version: October 6, 2020his Version: October 6, 2020 To compare the performance of density forecast for various estimators, we examine the con-tinuous ranked probability score (CRPS) across units. The CRPS is frequently used to assessthe respective accuracy of two probabilistic forecasting models. It is a quadratic measure ofthe diﬀerence between the forecast cumulative distribution function (CDF), F T ` i p y q , andthe empirical CDF of the observation with the formula as follows, CRP S T ` “ N N ÿ i “ CRP S p F T ` i , y iT ` h q“ N N ÿ i “ ż ` F T ` i p y q ´ I t y iT ` h ď y u ˘ dy, (2.12)where y iT ` h is the realization at T ` LP S T ` “ N N ÿ i “ ln p p y iT ` | Y q , (2.13) To evaluate the statistical superiority of pooling within K clusters, we report the bias, stan-dard deviation, average length of 95% credible set, and frequentist coverage of the posteriormean estimate of ρ across Monte Carlo repetitions. For the random eﬀects α , we only presentthe average bias as it may not be of interest for most empirical analysis.To estimate the number of groups, we derive a point estimator from its posterior dis-tribution, typically, the posterior mean, which is consistent with a quadratic loss function.In the empirical analysis, we also consider the posterior mode suggested by Malsiner-Walliet al. (2016), which is equal to the most frequent number of non-empty components visitedduring MCMC sampling. These approaches constitute an automatic and straightforwardstrategy to estimate the unknown number of groups without using model selection criteriaor marginal likelihoods. his Version: October 6, 2020his Version: October 6, 2020

Two sets of speciﬁcations of prior distributions are considered in this section. In the ﬁrstprior speciﬁcation, we concentrate on random eﬀects models and implement a full Bayesiananalysis. In addition, we specify a hyperprior for the distribution of unobserved heterogeneityand then construct a joint posterior for the coeﬃcients of this hyperprior as well as the actualunit-speciﬁc and common coeﬃcients. In the second speciﬁcation, econometricians couldprovide useful information on the latent group structure and incorporate it in the prior. his Version: October 6, 2020his Version: October 6, 2020 In this paper, we focus on the random coeﬃcients model where heterogeneous parameters α g i t and σ g i are independent and are assumed to be independent of the initial value of each unit y i . The speciﬁcation can be extended to correlated random coeﬃcients model by modelingthe joint distribution of heterogeneous parameters and initial values y i .A typical choice in the nonparametric Bayesian literature is the Dirichlet Process (DP)prior or stick-breaking prior. With group probabilities π k and parameter in prior: (mean,variance) = p µ α , Σ α q , a draw of α g i t from the DP prior could be viewed as a mixture of pointmess with the probability mass function, α g i t „ K ÿ k “ π k δ α kt , with α kt „ N p µ α , Σ α q , (3.1)where δ x denotes the Dirac-delta function concentrated at x , each α kt is drawn from a normaldistribution and K is unknown. µ α are set to the OLS estimate of α assuming K “ α equals 200 ˆ ˆΣ α where ˆΣ α is the standard deviation of the OLS estimator. In the samefashion, we can deﬁne the DP prior for grouped heteroskedasticity σ g i given identical groupprobabilities π k : σ g i „ K ÿ k “ π k δ σ k , with σ k „ IG ˆ ν σ , δ σ ˙ , ν σ “ , and δ σ “ , (3.2)where each component is drawn from inverse-Gamma distribution.Put together, the posterior draws of grouped related coeﬃcients can be characterized bya grouped triplet t π k , α k , σ k u for k “ , , .. , and α k “ r α k , α k , ..., α kT s . Importantly, thedistributions of both α k and σ k are discrete, because draws can only take the values in theset tp α k , σ k q : k P Z ` u . This nonparametric nature makes the Dirichlet Process prior an idealchoice for clustering problems especially when the distinct number of clusters is unknownbeforehand. The group parameters p α k , σ k q are assumed to follow the base distribution B which is an independent (non-conjugate) Multivariate Normal-Inverse-Gamma (IMNIG)distribution.On the other hand, the group probability is formalized through an inﬁnite-dimensionalstick-breaking prior governed by the concentration parameter a , π k “ ξ k ź j ă k p ´ ξ j q for k ą , and π “ ξ , (3.3) his Version: October 6, 2020his Version: October 6, 2020 ξ k , which are called stick lengths, are independent random variables drawn from thebeta distribution Beta p , a q . This construction can be viewed as a stick-breaking procedure,where at each step, we independently and randomly break the leftover of a stick of unitlength and assign the length of this break to the current value of π k . The smaller a is, theless of the stick will be left for subsequent values (on average), yielding more concentrateddistributions.The concentration parameter a speciﬁes how strong this discretization is. As a Ñ a Ñ 8 , the realizationsbecome continuous-valued from its based distribution. Escobar and West (1995) shows thatthe number of estimated groups under a DP prior is sensitive to a , which indicates that adata-driven estimate should be more reasonable. To determine how discrete we want andhow many groups are needed given the data, it is convenient to treat a as a parameterunder the nonparametric Bayesian framework. Put diﬀerently, we can set up a relativelygeneral hyperprior for a „ Gamma p . , q , and update it based on the observations. Thisstep generates a posterior estimate of a , which implicitly chooses the optimal K withoutre-estimating the models with diﬀerent numbers of groups.Finally, the prior distribution for the common parameter ρ is chosen to be a normaldistribution, ρ „ N p , σ ρ q with σ ρ “ . (3.4)The prior of heterogeneous parameter β i follows, β i „ N p , Σ β q with Σ β “ ˆ I p . (3.5)To sum up, in the random coeﬃcients model, we specify the Dirichlet Process priorsfor group random eﬀects α g i t and heteroskedasticity σ g i , a stick-breaking process for groupprobabilities π k , a hyperprior for the concentration parameter a and a normal prior for thecommon parameter ρ and heterogeneous parameter β i . Frequently, researchers could provide a group structure on all or at least part of the unitsbased on personal expertise and the nature of individuals. For example, ﬁrms coming fromthe same industry may share a similar growth pattern with relative high probability; coun-tries having the same level of development form comparable ﬁscal policies. Though this his Version: October 6, 2020his Version: October 6, 2020 ω i “ r ω i , ω i , ..., ω iK p s for all unit i “ , .., N , where K p is the presetnumber of groups. Namely, before estimating the group membership, the researcher assignseach unit to diﬀerent groups with a set of subjective group-speciﬁc probabilities, and theseprobabilities will enter the algorithm through a prior distribution for ω i . We name this priorfor ω i as Subjective Group Probability (SGP) Prior . In practice, one could provide a table(for example, Table 1) documenting the subjective group probability of a unit falling into aspeciﬁc group. Table 1: An example of prior group probability, K p “ i “ , , ..., N ,the researcher is fairly conﬁdent in her knowledge and sets one of ω ik to 100%. This isequivalent to the case where the researcher exactly partitions N units into K p predeterminedgroups.Building on the prior for the random eﬀects model in Section 3.1.1, we allow for incorpo-rating the researchers’ prior knowledge while inheriting the feature of reallocating units andchanging the number of groups along the MCMC sampling. These ﬂexible features enablethe block Gibbs sampler to automatically correct and update the imprecise subjective prior,especially when K p doesn’t match the true number of groups.To incorporate these subjective group probabilities, it is important to choose a properprior for ω . Dirichlet distribution is an applicable candidate among assorted densities since his Version: October 6, 2020his Version: October 6, 2020 K ˚ ) in aniteration equals the presumed K p . Let ω i “ r ω i , ω i , ..., ω iK p s be the vector of group-speciﬁcprobability for unit i , we set the prior density for ω i as an unsymmetric Dirichlet distribution, ω i „ Dir p a i , a i , ..., a iK p q , (3.6)where a ik are concentration parameters and strictly positive. Conditional on ω i , the groupmembership g i is assumed to be drawn from a multinomial distribution, g i „ M ultinomial p ω i q , i.e. , P p g i “ k | ω i q “ ω ik , for k “ , . . . , K p . (3.7)It can be shown posterior probability of ω i given g i is also a Dirichlet distribution withmodiﬁed hyperparameters: Dir p a i ` p g i “ q , a i ` p g i “ q , ..., a iK p ` p g i “ K p qq . Hencewe can direct sample ω i from it posterior distribution.Another important property of Dirichlet distribution that enables itself to be the mostsuitable prior is that we can tie our prior probability directly with the expected value of ω ik , E p ω ik q “ a ik ř K p i “ k a ik . (3.8)To integrate the researcher’s prior knowledge, one only need to deliberately choose a setof t a ik u such that the expected probability matches her subjective probability on groups.Since the membership probabilities ω ik are updated based on observations, and we allowfor reallocating units and changing the number of groups along the MCMC sampling, arevision of our block Gibbs sampler is needed to adjust for such changes. The details of thenew algorithm are presented in Appendix A.2. In practice, we can restrict ř K p i “ k a ik to be 1so that a ik represents both the subjective group probability for unit i belonging to group k and the prior mean of ω ik . Draws from the joint posterior distribution can be obtained by using blocked Gibbs sampling.The proposed algorithm is based on Ishwaran and James (2001) and Walker (2007). Thoughthe algorithm in Ishwaran and James (2001) has been widely used for sampling stick-breakingpriors, it alone can’t fulﬁll our need for estimating the number of groups without any prede-termined level or upper bound since it requires a ﬁnite-dimensional prior and truncation. To his Version: October 6, 2020his Version: October 6, 2020 K , we implement slice-samplingproposed by Walker (2007) and modify the framework of Ishwaran and James (2001) withadditional posterior sampling steps. Using the conjugate priors speciﬁed in Section 3.1, eachparameter is directly drawn from its posterior distribution.In the Appendix A, we provide detailed derivations for the conditional distributions overwhich the Gibbs sampler iterates. We focus on the time-varying grouped random eﬀectsmodel with grouped heteroskedasticity, which is the most sophisticated speciﬁcation. Otherspeciﬁcations can be estimated by merely ignoring time eﬀects in α ’s or shutting down theheteroskedasticity. One feature of our proposed block Gibbs sampler is that it partitions N units into G groups,and, at the same time, generates posterior draws for parameters. This Gibbs sampler andour BGRE estimator inevitably remind us of one of the most popular clustering algorithmsin the area of unsupervised machine learning: the Kmeans algorithm. Indeed, the Kmeansalgorithm plays a crucial role in BM and BLM, who estimate the grouped ﬁxed-eﬀects fromthe frequentists’ point of view. In this subsection, we seek to illustrate the similarity andconnection between our block Gibbs sampler and the Kmeans algorithm in the limit.We start with the Kmeans clustering algorithm. Given a set of observations p z , z , . . . , z N q , where each observation contains the dependent variables and covariates, p y i , x i q . Kmeansclustering aims to partition the N observations into K sets so as to minimize the within-cluster sum of squares,min t C k u Kk “ K ÿ k “ ÿ i P C k } z i ´ µ k } where µ k “ | C k | ÿ i P C k z i . (3.9)The algorithm alternates between reassigning points to clusters and recomputing themeans. For the assignment step, one computes the squared Euclidean distance from eachpoint to each cluster mean, and then assign each observation to the cluster with the nearestmean. The update step of the algorithm recalculates centroid for observations assigned toeach cluster and updates µ k for all k .According to the block Gibbs sampler, we assign unit i to group k conditional on the his Version: October 6, 2020his Version: October 6, 2020

11. The last his Version: October 6, 2020his Version: October 6, 2020

11. The last his Version: October 6, 2020his Version: October 6, 2020 K “

4. Given N and K , we partition the entire sample into K balanced blocks with N { K units in each block .For each DGP, 100 datasets are generated, and we run the block Gibbs samplers for eachdata set with M “ ,

000 iterations after a burn-in of 5,000 draws.

The Monte Carlo simulation is based on the dynamic panel data model in (2.1), in whichwe suppress the exogenous predictors x it for simplicity. In short, we consider four lineardynamic DGPs in this section. DGP1 and DGP2 involve time-invariant random eﬀects whiletime-varying random eﬀects are allowed in DGP3. Moreover, DGP1 and DGP3 considerhomoskedasticity, but DGP2 has heteroskedastic innovations. DGP4 is the standard paneldata model without a group structure. Throughout these DGPs, the random eﬀects α k andidiosyncratic error ε it are standard normal distributed, independent across i , k , and t , andmutually independent. ε it is independent of all regressors. The data are simulated accordingto the following DGPs: DGP1:

Time-invariant grouped random eﬀects, homoskedasticity (Grp Ti-Homo) .This DGP is the most naive panel data model with group pattern in the random eﬀect. y it “ α g i ` ρy it ´ ` ε it , with ρ “ . ε it iid „ N p , . q and α k iid „ N p k, . q for k “ , , ..., K . DGP2:

Time-invariant grouped random eﬀects, heteroskedasticity (Grp Ti-Hetero) .This DGP aims to incorporate heteroskedasticity, which leads to a slightly complicatedprocess that is hard to estimate. y it “ α g i ` ρy it ´ ` ε it , with ρ “ . ε it iid „ N ` , σ g i ˘ where σ k “ . ` ´ k ´ K ˘ , and α k iid „ N p k, . q for k “ , , ..., K . DGP3:

Time-varying grouped random eﬀects, homoskedasticity (Grp Tv-Homo) .So far, we have focused on time-invariant models. But when estimated on real data, it’s rea-sonable to believe the random eﬀect could a have time pattern. Hence, we introduce various If N { K is not an integer, use t N { K u for group 1,2,.., K ´

1, the last group contains the residual units. his Version: October 6, 2020his Version: October 6, 2020

1, the last group contains the residual units. his Version: October 6, 2020his Version: October 6, 2020 y it “ α g i ` ρy it ´ ` ε it , with ρ “ . ε it iid „ N p , q and α it iid „ N ` α g i t , . ˘ where α g i t varies across periods andgroups as depicted in Figure 1. To enrich the patterns of time-varying random eﬀects, weconstruct 4 diﬀerent paths. Group 1 has a constant mean for α i . The means for α i in group2 are also constant but experience a structure change at T “

5. Group 3 and group 4 areequipped with monotonically increasing/decreasing means.Figure 1: Mean of Random Eﬀects for GDP3, K “ DGP4:

Time-invariant random eﬀects, homoskedasticity, no group structure (Std Ti-Homo) . This is the standard panel data model with unit-speciﬁc random eﬀects and iden-tical variance for the innovations. y it “ α i ` ρy it ´ ` ε it , with ρ “ . ε it iid „ N p , . q and α i iid „ N p , . q . We consider four types of BGRE estimators that diﬀer on the assumptions made on the ran-dom eﬀects (RE) and the variance of errors: (1) time-invariant grouped RE with homoskedas-ticity (Ti-Homo); (2) time-invariant grouped RE with heteroskedasticity (Ti-Hetero); (3) his Version: October 6, 2020his Version: October 6, 2020 . For instance, Ti-Homo estimator assumes the truemodel to have time-invariant grouped random eﬀects, and the variance of error terms to beconstant across units. Besides the results we will show below, we modify the DGPs and con-duct more experiments using (a) larger variance of error terms σ , (b) shorter time span, and(c) diﬀerent true number of groups K . These additional results are available in AppendixE. Regarding alternative estimators, we consider the following Bayesian estimators thathave diﬀerent prior assumptions on the random eﬀects α i . (1) Bayesian pooled estimator(Pooled): α i is treated as a common parameter as ρ does, this means all units share thesame prior level of α i ; (2) ﬂat-prior estimator (Flat): assume p p α i q 9

1, this amounts todraw samples from a posterior whose mode is the MLE estimate. Given the estimate ofcommon parameter, there is no pooling across units, α i ’s are estimated only using their ownhistory; (3) Parametric-prior estimator (Param): assume α i „ N p µ, π q , where a Normal-Inverse-Gamma hyperprior is further imposed on p µ, π q , this prior be thought of as a limitcase of the DP prior when the concentration parameter a Ñ

0, so there is only one cluster,and p µ, π q are directly drawn from the base distribution. Table 2 shows the estimate comparison among alternative predictors. For the DGP1, Ti-Homo and Ti-Hetero estimator are the best in every aspect. This is as expected sincethey correctly model the time-invariant random eﬀects. Among these two estimators, whenwe allow for group-level heteroskedasticity, the optimal number of groups decreases as Ti-Hetero underestimates the number of groups. The coverage probability, however, is notwell-controlled, both of which are below the nominal coverage of 0.95. The Flat estimatoralso has good performance in terms of RMSE of ρ . Nonetheless, its coverage probability isrelatively low: only 23% of credible sets successfully contain the true values. This is due tothe relatively large biases for α i . The rest predictors are considerably worse. This implies For the Tv-Homo and Tv-Hetero estimator, as we allow for time eﬀects in α i , we use the most recent α iT to make one-step ahead prediciton. This is equivalent to assume the law of motion of α it is a randomwalk. Modeling the trend of α it would result in a more accurate forecast, but this is beyond the Scope ofthis paper. The Normal-Inverse-Gamma hyperprior for p µ, π q used in the Monte Carlo simulation is as follow: µ | π „ N p m, vπ q with m equating to the pooled OLS estimator of α i and v “ π follows IG p ν π { , δ π { q with ν π “ δ π “ his Version: October 6, 2020his Version: October 6, 2020 K equals to 3.60 while the truth is 4.For the case of DGP2, we keep time-invariant random eﬀects while assuming heteroskedas-ticity. Tv-Homo and Tv-Hetero are still dominating. Tv-Hetero generates the best resultswith an accurate estimate of the number of groups as it correctly models heteroskedasticitywhich in turn improves the estimation eﬃciency. The Flat estimator closely follows them,and the rest are worse. Regarding DGP3, when time-varying random eﬀects are introducedin the model, Tv-Homo and Tv-Hetero estimator yield the best performance and the es-timated average K s are close to the truth. But the biases are arguably low for these twoestimators in sacriﬁce for small standard deviation and short credible intervals. It is worthnoting that, unlike Ti-Homo in DGP1 and Ti-Hetero in DGP2, though correctly speciﬁed,the bias for α i is still comparatively high. This is because, for simplicity, we don’t modelthe law of motion for α it and simply assume α iT ` “ α iT , which results in large bias in α i . As regards the DGP4 that doesn’t have a group structure, the Flat estimator is thebest since it doesn’t pool cross-sectional information but estimate the unit-speciﬁc randomeﬀects. All of the BGRE estimators have almost identical performances with the estimatednumber of groups close or equal to 1. Since the Pooled and Param estimators assume nogroup structure, both have similar estimates as the BGRE estimators. Table 3 reports the predictive performance of a range of parametric forecasts. For the DGP1,the best forecasts are generated by the Ti-Homo estimator, as it is correctly speciﬁed in thisenvironment. It has the smallest RMSFE, the shortest average length of the credible set,correct coverage probability, the largest LPS, and the smallest CRPS. Although allowingfor heteroskedasticity along with the time-invariant random eﬀects, Ti-Hetero generates aperfect point forecast as well as Ti-Homo. But Ti-Hetero introduces uncertainty revealed bya slightly wider credible set and worse density forecast. Moreover, estimators involving time-varying random eﬀects (Tv-Homo and Tv-Hetero) worsen the forecast. Finally, incorrectlyimposing no latent group pattern substantially deteriorates the predictive performance in allaspects. his Version: October 6, 2020his Version: October 6, 2020 ρ ˆ α i ClusterRMSE Bias Std AvgL Cov Bias Avg KDGP 1(Grp Ti Ho.) Ti-Homo 0.0198 0.0113 0.0120 0.0468 0.83 -0.0744 3.60Ti-Hetero 0.0202 0.0118 0.0120 0.0468 0.78 -0.0776 3.55Tv-Homo 0.2403 0.2387 0.0187 0.0712 0.06 -1.5255 1.82Tv-Hetero 0.2405 0.2389 0.0108 0.0689 0.07 -1.5301 1.89Pooled 0.2449 0.2447 0.0069 0.0268 0.00 -1.5588 1Flat 0.0369 -0.0344 0.0121 0.0469 0.23 0.2166 100Param 0.2711 0.2437 0.1148 0.4545 0.38 -1.5474 1DGP 2(Grp Ti He.) Ti-Homo 0.0226 0.0097 0.0150 0.0583 0.85 -0.0681 3.74Ti-Hetero 0.0112 0.0036 0.0082 0.0321 0.95 -0.0261 3.98Tv-Homo 0.1924 0.1885 0.0234 0.0893 0.14 -1.2289 11.44Tv-Hetero 0.0965 0.0894 0.0255 0.0979 0.30 -0.5925 3.42Pooled 0.2318 0.2316 0.0079 0.0310 0.00 -1.4905 1Flat 0.0493 -0.0469 0.0146 0.0567 0.06 0.2984 100Param 0.2576 0.2303 0.1115 0.4407 0.38 -1.4792 1DGP 3(Grp Tv Ho.) Ti-Homo 0.2726 0.2724 0.0119 0.0463 0.00 -2.1937 2.00Ti-Hetero 0.2741 0.2738 0.0121 0.0470 0.00 -2.2031 2.32Tv-Homo 0.0580 0.0525 0.0221 0.0860 0.33 -0.3679 3.89Tv-Hetero 0.0589 0.0534 0.0222 0.0863 0.33 -0.3743 3.85Pooled 0.1925 0.1923 0.0081 0.0314 0.00 -1.6381 1Flat 0.3230 0.3227 0.0126 0.0492 0.00 -2.5462 100Param 0.2172 0.1912 0.1021 0.4033 0.54 -1.6269 1DGP 4(Std Ti Ho.) Ti-Homo 0.2177 0.2170 0.0164 0.0635 0.01 0.0038 1.03Ti-Hetero 0.2168 0.2159 0.0165 0.0644 0.01 0.0035 1.02Tv-Homo 0.2216 0.2210 0.0161 0.0627 0.00 0.0037 1.01Tv-Hetero 0.2204 0.2198 0.0164 0.0638 0.00 0.0038 1.16Pooled 0.2204 0.2198 0.0161 0.0628 0.00 0.0037 1Flat 0.1838 -0.1817 0.0277 0.1076 0.00 -0.0032 100Param 0.2321 0.2201 0.0714 0.2856 0.06 0.0154 1 his Version: October 6, 2020his Version: October 6, 2020

Time-invariant grouped ﬁxed eﬀects, homoskedasticity. y it “ α g i ` ρy it ´ ` ε it , his Version: October 6, 2020his Version: October 6, 2020

Time-invariant grouped ﬁxed eﬀects, homoskedasticity. y it “ α g i ` ρy it ´ ` ε it , his Version: October 6, 2020his Version: October 6, 2020 his Version: October 6, 2020his Version: October 6, 2020

Time-invariant grouped ﬁxed eﬀects, heteroskedasticity. y it “ α g i ` ρy it ´ ` ε it , with ρ “ . α k “ k for k “ , , ..., K and ε it iid „ N ` , σ g i ˘ with σ k “ . ` ´ k ´ K ˘ . DGP7:

Time-varying grouped ﬁxed eﬀects, homoskedasticity. y it “ α g i t ` ρy it ´ ` ε it , with ρ “ . ε it iid „ N p , q and α g i t “ α g i t where α g i t varies across periods and groups asdepicted in Figure 1. In this experiment, we compare our BGRE estimator with BM’s GFE estimator . In particu-lar, we assess the performance of point forecast in y and the accuracy of coeﬃcient estimates(group random eﬀects α and common parameter ρ ).To compare the performance of estimators, we use the default numerical setting in BM.It is worth noticing that BM relies on information criteria to ex post select the optimalnumber of groups. Hence we consider at most 10 groups and estimate the number of groups K in accordance with the following Akaike information criterion (AIC) : AIC p k q “ N T N ÿ i “ T ÿ t “ ´ y it ´ ˆ ρ p k q y it ´ ´ ˆ β i k q x it ´ ´ p α p k q p g i t ¯ ` p σ p k q k p T ` N ´ k q N T , where p σ is a consistent estimate of the variance of ε it : p σ p k q “ N T ´ K max T ´ N ´ p p ` q N ÿ i “ T ÿ t “ ´ y it ´ ˆ ρ p k q y it ´ ´ ˆ β i k q x it ´ ´ p α p k q p g i t ¯ . The results are shown in the Table 4. As BM proposes two algorithms , we presentthe results for four versions of the GFE estimator. The ﬁrst two estimators (GFE a and The default settings are as follow: (1) Number of groups = 4; Number of covariates = 1; Standard errors:0 (no standard errors). (2) For algorithm 0, Number of simulations = 100; (3) For algorithm 1, Number ofsimulations = 10, Number of neighbors = 10 , Number of steps = 10. We also tried the alternative choice ˆ σ kT ` N ` p ` NT ln p N T q for the penalty. This corresponds to the defaultBIC used in Bonhomme and Manresa (2015). We found that, in this case, BIC selected the smallest possiblenumber of groups for all DGPs, i.e., no group structure, whereas the truth is K “

4. Moreover, other formsof BIC could always select the largest K as well. Due to the inaccurate estimate of group structure andsubstantially poor performance, we don’t show the result with the default BIC. These two algorithms could generate diﬀerent estimates as shown in the case of DGP6 below. his Version: October 6, 2020his Version: October 6, 2020 a ) equip with the Iterative and Variable Neighborhood Search algorithm, respectively.We impose the true number of groups K for the other two estimators (GFE a and GFE a ),i.e., we don’t perform model selection but choose ˆ K “ K “ a and GFE a ) are still stragglingand worse than their counterparts with ˆ K selected by the information criterion. GFE a andGFE a generate relatively high bias for both α i and ρ . In the case of DGP7, Tv-Homoand Tv-Hetero still have better performance than the GFE estimators whose models arechosen by the information criterion. Furthermore, Tv-Homo and Tv-Hetero perform onlymarginally worse than GFE a and GFE a estimators that are imposed with the true K .This is because they overestimate the number of groups in some posterior draws, and hence,on average, the posterior mean forecast and estimate are slightly oﬀ.Moreover, the accuracy of the GFE estimator is profoundly aﬀected by the choice ofthe information criteria. We implement several information criteria proposed by Bai andNg (2002) in this Monte Carlo experiment and ﬁnd that there is no single criterion thatconsistently selects the correct number of groups nor close to the truth. As the GFE estimatoris designed for the time-varying model, we ﬁnally select the AIC mentioned above, whichchooses a model that is closest to the true model in time-varying DGPs. But this deliberatelyselected AIC fails to improve the performance of the GFE estimator in DGP7. Once weswitch to other forms of AIC or BIC, these results will not hold anymore. These facts alsoemphasize the importance of not relying on ex-post model selection and the superiority ofour Bayesian estimators. In this section, we explore whether subjective group structure improves the accuracy offorecast and group clustering. We conduct two Monte Carlo experiments corresponding toSGP Prior deﬁned in Section 3.1.2.We consider ﬁve scenarios, each of which diﬀers in the structure of subjective priorprobability and hence the prior group probability π . The exact speciﬁcation is characterizedin Table 5. The ﬁrst three scenarios set the preset number of groups as the truth K , whereas his Version: October 6, 2020his Version: October 6, 2020

100 % to the each group4 K ´ K ` . Remember that the prior knowledge is the most accurate in scenario1, where the researcher is equivalent to know the true group structure. In this regard, aclear gain in estimate emerges as the RMSE for ρ , bias for ρ and α i generated by SGP-RE1 The last two scenarios aim to evaluate the performance of SGP prior when the number of groups is wrong.Instead of randomly assigning a unit to each groups with a set of probability, we assume the econometricianis conﬁdent on her prior and set 100% for a particular group. In particular, we assign the ﬁrst N { K unitsinto group 1, the next N { K units into group 2, and so on. We also run other designs for scenario 4 and 5with diﬀerent prior probabilities. The results show that, as long as a certain amount of units are correctlyclustered into groups, the performances of SPG-RE estimators are slightly better than those of the BGREestimator. The full results are presented in the Appendix E.5. his Version: October 6, 2020his Version: October 6, 2020 „ „ K “ K , the uninformative prior forces the algorithm to consider other incorrectgroups with a large chance (= 1 ´ K ), and hence deteriorates the performance of bothestimates and forecasts. In terms of the one-step ahead forecast, SGP-RE1 leads the restby scoring the lowest values for each metrics (and the highest LPS), closely followed bySGP-RE2. SGP-RE3 is suﬀering in terms of point and set forecast. As for scenario 4 and 5,despite the ﬂexibility of allowing for changing a and K p along MCMC sampling, the incorrectspeciﬁcations of the group number deteriorate the estimation, both of which fail to deliverreliable estimates for ρ and α . Nonetheless, such prior structures help point and densityforecasts. Both estimators beat the benchmark and generate comparable LPS to SGP-RE1and SGP-RE2. This valuable improvement mainly results from the fact the algorithm couldexploit the prior knowledge on group structure that successfully partitions merely a fractionof units and adapt the number of groups accordingly.In short, regarding the overall performance of the SGP prior, the best case is that theresearcher has a relative conﬁdent prior and knows the true number of groups. In this case,the SGP-RE estimator dominates the Tv-Hetero estimator from every angle. However, inpractice, we rarely come up with such a precise prior due to the incomplete understandingof the population of the data. Instead, we might be less conﬁdent on our knowledge or evenspecify more/fewer groups than the truth. Under this circumstance, the SGP-RE estimatorcould still deliver a better density forecast because of the great exploration of the priorinformation and the adaptive scheme featured by our Bayesian method. In this section, we compare our BGRE estimator with a two-step GRE estimator, where unitsare clustered into groups in the ﬁrst step using

Kmeans algorithm, and the model is thenestimated in the second step with group-speciﬁc heterogeneity. Unlike BLM, we implementthe Bayesian framework in the second step to echo other Bayesian estimators presented inthe previous section. This two-step procedure allows us to examine the clustering accuracyof Kmeans relative to our full Bayesian estimate as two-step GRE estimators can be viewed his Version: October 6, 2020his Version: October 6, 2020

Note: the value of each relative metric is capped by 100% to enhance the readability. The number in eachsubpanel represents the (original) value of metric for the benchmark model (Tv-Hetero).

Figure 3: Monte Carlo Experiment: Forecast, SGP Prior

Note: the value of each relative metric is capped by 100% to enhance the readability. The number in eachsubpanel represents the (original) value of metric for the benchmark model (Tv-Hetero). his Version: October 6, 2020his Version: October 6, 2020 forestimates and forecast in Figure 4 and 5. Each bar represents the performance of the two-Step GRE estimators against the performance of the original GRE estimator. Above zeroindicates the 2-step estimator underperforms the benchmark while a 2-step estimator isbetter when its bars show negative values. The main models are correctly speciﬁed for eachDGP, i.e., Ti-Homo for DGP 1, Ti-Hetero for DGP 2, and Tv-Homo for DGP 3.Figure 4 presents the point estimates for each DGP. We document the root mean squaredforecast error, absolute bias, standard deviation, and the average length of the 95% credibleset for the common parameter ρ , while the metric for the random eﬀects is the absolute bias.According to these measures, the two-Step GRE estimators perform worse than the BayesianGRE estimator as they introduce much higher bias in the estimate of ρ (hence larger RMSEfor ρ ) and α i . It is worth noting that the estimator equipped with Kmeans doesn’t aﬀectthe standard deviation and the average length of 95% credible set of ρ .The inferior performance of the 2-step GRE estimator is due to the inaccurate estimateof group structure. Table 6 reports the estimated number of groups from the two-step GREestimators with Kmeans and the BGRE estimator. Regarding the performance of clustering,the Kmeans algorithm severely underestimates the number of groups as it prefers much lessgroups, while the true number is 4. Meanwhile, our BGRE approach accurately estimatesthe number of groups, though slightly underestimated in the DGP 1.Table 6: Number of groups, Kmeans vs. BGREDGP 1 DGP 2 DGP 3Kmeans 2.20 2.20 2.26BGRE 3.60 3.98 3.85Figure 5 shows the point, set, and density forecast for each DGP. As Kmeans fails toestimate the group structure, none of the 2-step GRE estimators outperform the GRE esti-mators. Namely, Kmeans clustering doesn’t help make a more accurate forecast, and instead,it generates a much higher forecast bias and standard deviation. The full results are presented in the Appendix E.4. his Version: October 6, 2020his Version: October 6, 2020

Note: the value for each relative metric is capped by 200% to enhance the readability.

Figure 5: Monte Carlo Experiment: Forecast, Two-Step GRE with Kmeans

Note: the value for each relative metric is capped by 50% to enhance the readability. his Version: October 6, 2020his Version: October 6, 2020

Note: the value for each relative metric is capped by 50% to enhance the readability. his Version: October 6, 2020his Version: October 6, 2020 In this section, we illustrate the use of Bayesian Grouped Random Eﬀects estimators ina cross-ﬁrm study. We revisit the investment regression and use a diﬀerent version of thedynamic grouped panel model to forecast the investment rate for a panel of ﬁrms in allindustries. Instead of using the traditional Tobin’s Q-type investment regression, we imple-ment a new scheme proposed by Gala et al. (2019), who directly estimates the corporateinvestment rate without Tobin’s Q. Again, our main focus is the one-step ahead point, setand density forecast. Due to space limitations, we only report forecast results for the mostrecent year in the main text. Summary statistics, and additional details of implementationare stored in Appendix F and G. We consider a general model with grouped latent heterogeneity in α i . Following Hsiao andTahmiscioglu (1997) and Gala et al. (2019), the investment equation is speciﬁed as, ˆ IK ˙ it “ α g i t ` ρ ˆ IK ˙ it ´ ` β i ˆ CFK ˙ it ´ ` β i ln K it ´ ` β i ln ˆ YK ˙ it ´ ` ε it , (5.1)where capital stock, K it is deﬁned as net property, plant and equipment; I it is capital invest-ment; CF it , is a liquidity variable deﬁned as cash ﬂow minus dividends; Y t is the end-of-yearsales; ε it are the normally distributed error terms. The subscript i denotes companies, and t denotes time. Unlike the commonly speciﬁed investment equation using Tobin’s Q, the addi-tional terms, including the natural logarithm of lagged capital and sales-to-capital ratio, arebased on the regression proposed by Gala et al. (2019). The lagged values of the investmentrate are included as explanatory variables to avoid endogeneity problems.As we focus on forecasting, we can relax a few assumptions to achieve better predictiveperformance. These assumptions include time-invariant random eﬀects α g i , homoskedasticityin σ i and homogeneous coeﬃcients for all dependent variables ( β i ¨ = β i ). Table 7 summarizesthe estimators and their properties we consider in this section. The implementation oftime-invariant RE and homoskedasticity is similar to the one in the previous section, i.e.,construct four versions of BGRE estimator: Ti-Homo, Ti-Hetero, Tv-Homo, and Tv-Hetero.Despite the fact that the homogeneous slopes have been frequently rejected in empirical The investment rate for a ﬁrm in a particular year is deﬁned as the fraction of capital expenditures inproperty, plant, and equipment in terms of the beginning-of-year capital stock. his Version: October 6, 2020his Version: October 6, 2020

X X X

Ti-Hetero

X X

Tv-Homo

X X

Tv-Hetero X Flat

X X

Pooled

X X

Param

X X

HeterogenousCoef. Ti-Homo

X X X

Ti-Hetero

X X

Tv-Homo

X X

Tv-Hetero X Flat

X X

The individual company data are obtain from COMPUSTAT Annual database. To accountfor potential structural breaks and the advanced speed of capital accumulation in the recentdecades, our sample is composed of a balanced panel of ﬁrms for the years 2000 to 2019,that includes ﬁrms from all industries with no missing value in accounting data.We keep only ﬁrm-years that have non-missing information required to construct theprimary variables of interest, namely: capital stock K , investment I , liquidity CF , and salesrevenues Y . The further details of constructing the sample can be found in the AppendixF. The ﬁnal sample comprises 337 ﬁrms and the observations for each ﬁrm is 20.To examine the performance of various estimators with limited observations, we chooseto use a rolling window of 15 years. In this sense, we create ﬁve balanced panels which endin years 2014, ..., 2018 ( t “ T ), respectively. The observations in the next year ( t “ T ` his Version: October 6, 2020his Version: October 6, 2020 We begin the empirical analysis by comparing the performance of point, set, and densityforecast for the last panel (in-sample periord: 2005 - 2018). We aim to forecast the invest-ment rate in 2019. We consider all the model speciﬁcations depicted in the Table 7 andtheir performance is presented in the Table 8. Throughout the analysis, the Flat estimatorserves as the benchmark as it essentially assumes individual eﬀects. In Table 8, the thirdcolumn shows the RMSFE for the one-step ahead forecast. For the panel considered in thetable, we ﬁrst notice that the best model is the Tv-Hetero (time-varying random eﬀects,heteroskedasticity) in homogeneous coeﬃcients speciﬁcation. It outperforms the benchmark– Flat estimator by 25%. Ti-Hetero also delivers accurate point forecast, which suggests timeeﬀects provide merely marginal improvement. Under heterogeneous coeﬃcients speciﬁcation,for the BGRE estimators, though all of them beat the Flat estimator, their RMSFEs arerelatively larger. This may arise from the fact that heteroskedasticity alone can capture agreat amount of individual eﬀects, thus imposing heterogeneous coeﬃcients in β i may overﬁtthe model and lead to poor forecasts.The fourth column documents the average number of latent groups in α i . Most of ourBGRE estimators deem a group structure with more than six underlying components. And his Version: October 6, 2020his Version: October 6, 2020

Kmeans algorithm. Ourempirical application to investment rates across ﬁrms reveals that the estimated latent groupstructure provides a great amount of ﬂexibility and improves point, set, and density forecasts.The present work raises interesting issues for further research. First, it may be appealing his Version: October 6, 2020his Version: October 6, 2020

Ando, T. and J. Bai (2016): “Panel data models with grouped factor structure underunknown group membership,”

Journal of Applied Econometrics , 31, 163–191.

Antoniak, C. E. (1974): “Mixtures of Dirichlet processes with applications to Bayesiannonparametric problems,”

The Annals of Statistics , 1152–1174.

Arellano, M. and S. Bond (1991): “Some tests of speciﬁcation for panel data: MonteCarlo evidence and an application to employment equations,”

The Review of EconomicStudies , 58, 277–297.

Arellano, M. and O. Bover (1995): “Another look at the instrumental variable esti-mation of error-components models,”

Journal of Econometrics , 68, 29–51.

Bai, J. and S. Ng (2002): “Determining the number of factors in approximate factormodels,”

Econometrica , 70, 191–221.

Bester, C. A. and C. B. Hansen (2016): “Grouped eﬀects estimators in ﬁxed eﬀectsmodels,”

Journal of Econometrics , 190, 197–208.

Biernacki, C., G. Celeux, and G. Govaert (2000): “Assessing a mixture modelfor clustering with the integrated completed likelihood,”

IEEE transactions on patternanalysis and machine intelligence , 22, 719–725.

Blundell, R. and S. Bond (1998): “Initial conditions and moment restrictions in dy-namic panel data models,”

Journal of Econometrics , 87, 115–143.

Bonhomme, S., T. Lamadon, and E. Manresa (2019): “Discretizing unobserved het-erogeneity,”

University of Chicago, Becker Friedman Institute for Economics WorkingPaper . Bonhomme, S. and E. Manresa (2015): “Grouped patterns of heterogeneity in paneldata,”

Econometrica , 83, 1147–1184.

Celeux, G., F. Forbes, C. P. Robert, and Titterington (2006): “Deviance infor-mation criteria for missing data models,”

Bayesian Analysis , 1, 651–673.

Chamberlain, G. (1980): “Analysis of covariance with qualitative data,”

The Review ofEconomic Studies , 47, 225–238.

Cheng, X., F. Schorfheide, and P. Shao (2019): “Clustering for Multi-DimensionalHeterogeneity,” .

Escobar, M. D. and M. West (1995): “Bayesian density estimation and inference usingmixtures,”

Journal of the American Statistical Association , 90, 577–588.

Fr¨uhwirth-Schnatter, S. (2004): “Estimating marginal likelihoods for mixture andMarkov switching models using bridge sampling techniques,”

The Econometrics Journal ,7, 143–167. his Version: October 6, 2020his Version: October 6, 2020

Advances in Data Analysis and Classiﬁcation , 5, 251–280.

Gala, V. D., J. F. Gomes, and T. Liu (2019): “Investment without q,”

Journal ofMonetary Economics . Geweke, J. and G. Amisano (2010): “Comparing and evaluating Bayesian predictivedistributions of asset returns,”

International Journal of Forecasting , 26, 216–230.

Hastie, D. I., S. Liverani, and S. Richardson (2015): “Sampling from Dirichletprocess mixture models with unknown concentration parameter: mixing issues in largedata implementations,”

Statistics and Computing , 25, 1023–1037.

Hsiao, C. and A. K. Tahmiscioglu (1997): “A panel analysis of liquidity constraintsand ﬁrm investment,”

Journal of the American Statistical Association , 92, 455–465.

Ishwaran, H. and L. F. James (2001): “Gibbs sampling methods for stick-breakingpriors,”

Journal of the American Statistical Association , 96, 161–173.

Kalli, M., J. E. Griffin, and S. G. Walker (2011): “Slice sampling mixture models,”

Statistics and Computing , 21, 93–105.

Keribin, C. (2000): “Consistent estimation of the order of mixture models,”

Sankhy¯a: TheIndian Journal of Statistics, Series A , 49–66.

Kim, J. and L. Wang (2019): “Hidden group patterns in democracy developments:Bayesian inference for grouped heterogeneity,”

Journal of Applied Econometrics , 34, 1016–1028.

Lin, C.-C. and S. Ng (2012): “Estimation of panel data models with parameter hetero-geneity when group membership is unknown,”

Journal of Econometric Methods , 1, 42–55.

Liu, L. (2020): “Density Forecasts in Panel Data Models: A Semiparametric BayesianPerspective,” .

Liu, L., H. R. Moon, and F. Schorfheide (2019): “Forecasting with a panel tobitmodel,” Tech. rep., National Bureau of Economic Research.

Liverani, S., D. I. Hastie, L. Azizi, M. Papathomas, and S. Richardson (2015):“PReMiuM: An R package for proﬁle regression mixture models using Dirichlet processes,”

Journal of Statistical Software , 64, 1.

MacQueen, J. (1967): “Some methods for classiﬁcation and analysis of multivariate ob-servations,” in

Proceedings of the ﬁfth Berkeley symposium on mathematical statistics andprobability , Oakland, CA, USA, vol. 1, 281–297.

Malsiner-Walli, G., S. Fr¨uhwirth-Schnatter, and B. Gr¨un (2016): “Model-basedclustering based on sparse ﬁnite Gaussian mixtures,”

Statistics and Computing , 26, 303–324. his Version: October 6, 2020his Version: October 6, 2020

Statistics and Computing , 26, 303–324. his Version: October 6, 2020his Version: October 6, 2020 McNicholas, P. D. and T. B. Murphy (2010): “Model-based clustering of longitudinaldata,”

Canadian Journal of Statistics , 38, 153–168.

Molitor, J., M. Papathomas, M. Jerrett, and S. Richardson (2010): “Bayesianproﬁle regression with an application to the National Survey of Children’s Health,”

Bio-statistics , 11, 484–498.

Neal, R. M. (2000): “Markov chain sampling methods for Dirichlet process mixture mod-els,”

Journal of Computational and Graphical Statistics , 9, 249–265.

Neyman, J. and E. L. Scott (1948): “Consistent estimates based on partially consistentobservations,”

Econometrica , 1–32.

Nickell, S. (1981): “Biases in dynamic models with ﬁxed eﬀects,”

Econometrica: Journalof the Econometric Society , 1417–1426.

Papaspiliopoulos, O. and G. O. Roberts (2008): “Retrospective Markov chain MonteCarlo methods for Dirichlet process hierarchical models,”

Biometrika , 95, 169–186.

Su, L., Z. Shi, and P. C. Phillips (2016): “Identifying latent structures in panel data,”

Econometrica , 84, 2215–2264.

Su, L., X. Wang, and S. Jin (2019): “Sieve estimation of time-varying panel data modelswith latent structures,”

Journal of Business & Economic Statistics , 37, 334–349.

Sun, Y. (2005): “Estimation and inference in panel structure models,”

Available at SSRN794884 . Walker, S. G. (2007): “Sampling the Dirichlet mixture model with slices,”

Communica-tions in Statistics—Simulation and Computation ® , 36, 45–54. Yau, C., O. Papaspiliopoulos, G. O. Roberts, and C. Holmes (2011): “Bayesiannon-parametric hidden Markov models with applications in genomics,”

Journal of theRoyal Statistical Society , 73, 37–57. his Version: October 6, 2020his Version: October 6, 2020

A-1

Supplemental Appendix to“Forecasting with Bayesian Grouped Random Eﬀectsin Panel Data”

Boyuan Zhang

A Posterior Distributions and Algorithms

A.1 Random Eﬀects Model

Below, I present the conditional posterior distribution for the time-varying random eﬀectsmodel with heteroskedasticity, which is the most complicated scenarios. For other models,such as its time-invariant counterparts and homoskedastic model, adjustment can be easilymade by eliminating time eﬀects and heteroskedasticity.To facilitate derivation, we stack observations and parameters,Observations: Y “ r y , y , ..., y N s , y i “ r y i , y i , ..., y iT s , Covariates: X “ r x , x , ..., x N s , x i “ r x i , x i , ..., x iT s , Random eﬀects: α “ r α , α , . . . s , Covariance matrices: Σ “ “ σ , σ , . . . ‰ , Hetergeneous coeﬃcients: β “ r β , . . . , β N s , Stick length: Ξ “ r ξ , ξ , . . . s , Group membership: G “ r g , . . . , g N s , Auxiliary varaible: u “ r u , u , ..., u N s , Hyper parameters: φ “ r µ α , Σ α , ν σ , δ σ s . his Version: October 6, 2020his Version: October 6, 2020 A-2The posterior of unknown objects in the random eﬀects model is, p p ρ, β, α, Σ , Ξ , a, G | Y, X q9 p p Y | X, ρ, β, α, Σ , G q p p ρ, β, α, Σ , Ξ , a, G q9 p p Y | X, ρ, β, α, Σ , G q p p α, Σ | φ q p p Ξ | a q p p G | Ξ q p p ρ q p p β q p p a q“ N ź i “ p p y i | x i , ρ, β i , α g i , σ g i q ź j “ p p α j , σ j | φ q ź j “ p p ξ j | a q N ź i “ p p g i | Ξ q p p ρ q N ź i “ p p β i q p p a q“ « N ź i “ p p y i | x i , ρ, β i , α g i , σ g i q p p g i | Ξ q p p β i q ﬀ « ź j “ p p α j , σ j | φ q p p ξ j | a q ﬀ p p ρ q p p a q . (A.1)In the following derivation and algorithm, we adopt the slice sampler (Walker, 2007) thatavoids approximation in Ishwaran and James (2001). Walker (2007) augments the posteriordistribution with a set of auxiliary variables u “ r u , u , ..., u N s , which are i.i.d. standarduniform random variables, i.e, u i iid „ U p , q . Then the augmented posterior is written as, p p ρ, β, α, Σ , Ξ , a, G, u | Y, X q9 « N ź i “ p p y i | x i , ρ, β i , α g i , σ g i q p u i ă π g i q p p β i q ﬀ « ź j “ p p α j , σ j | φ q p p ξ j | a q ﬀ p p ρ q p p a q“ « N ź i “ p p y i | x i , ρ, β i , α g i , σ g i q p p u i | π g i q π g i p p β i q ﬀ « ź j “ p p α j , σ j | φ q p p ξ j | a q ﬀ p p ρ q p p a q , (A.2)where π g i “ p p g i | Ξ q , and p¨q is the indicator function, which is equal to zero unless thespeciﬁc condition is satisﬁed. The original posterior can be recovered by integrating out u i for i “ , , ..., N . As we don’t limit the upper bound of the number of groups, it is impossibleto sample from an inﬁnite-dimensional posterior density. The merit of slice-sampling is thatit reduces the dimensions and allows us to solve a manageable problem with ﬁnite dimensions,which we will see below.With a set of auxiliary variables u “ r u , u , ..., u N s , we deﬁne the largest possible numberof potential components as K ˚ “ min k u ˚ ą ´ k ÿ j “ π j + , (A.3) his Version: October 6, 2020his Version: October 6, 2020

0, which will be clear in the subsequentposterior derivation.Next, we deﬁne the number of active groups K a “ max ď i ď N g i . (A.5)It can be shown that K a ď K ˚ . Conditional posterior of α (grouped random eﬀects) . p p α | ρ, β, Σ , Ξ , a, G, u, Y, X q9 « N ź i “ p p y i | x i , ρ, β i , α g i , Σ g i q p u i ă π g i q ﬀ « ź j “ p p α j , σ j | φ q ﬀ , For k P t , , ..., K a u , deﬁne a set of unit that belongs to group k , C k “ t i P t , , ..., N u| g i “ k u , (A.6)then the posterior density for α k read as p p α k | ρ, β, Σ , Ξ , a, G, u, Y, X q9 « ź i P C k p p y i | x i , ρ, β i , α k , σ k q ﬀ p p α k | φ q9 exp « ´ ÿ i P C k p y i ´ ρy ´ ,i ´ x i β i ´ α k q Σ ´ k p y i ´ ρy ´ ,i ´ x i β i ´ α k q ﬀ exp “ ´ p α k ´ µ α q Σ ´ α p α k ´ µ α q ‰ exp “ ´ p α k ´ ¯ µ α k q ¯Σ ´ α p α k ´ ¯ µ α k q ‰ , where y ´ ,i are lagged values for y i . Assuming an independent normal conjugate prior for See proof in proposition B.1 See proof in proposition B.1 his Version: October 6, 2020his Version: October 6, 2020

A-4 α k , the posterior for α k is given by α k | ρ, β, Σ , Ξ , a, G, u, Y, X „ N ` ¯ µ α k , ¯Σ α k ˘ . (A.7)where ¯Σ α k “ ˜ Σ ´ α ` ÿ i P C k Σ ´ i ¸ ´ , ¯ µ α k “ ¯Σ α k « Σ ´ α µ α ` ÿ i P C k Σ ´ i r y i ﬀ , r y i “ y i ´ ρy ´ ,i ´ x i β i . If group k is empty, we draw α k from its prior N p µ α , Σ α q . Conditional posterior of Σ (grouped variance) . Under the cross-sectional indepen-dence, for k “ , , ..., K a , p p σ k | ρ, β, α, Ξ , a, G, u, Y, X q9 « ź i P C k p p y i | x i , ρ, β i , α k , σ k q ﬀ p p σ k | φ q Assuming a inverse-gamma prior σ k „ IG ` v σ , δ σ ˘ , the posterior distribution of σ k is p p σ k | ρ, β, α, G, u, Y, X q9 ź i P C k « p σ k q ´ T exp ˜ ´ ř Tt “ p y it ´ ρy it ´ ´ β x it ´ α kt q σ k ¸ﬀ ˆ σ k ˙ vσ ` exp ˆ ´ δ σ σ k ˙ “ ˆ σ k ˙ vσ ` T | Ck | ` exp ˜ ´ δ σ ` ř i P C k ř Tt “ p y it ´ ρy it ´ ´ β x it ´ α kt q σ k ¸ . This implies σ k | ρ, β, α, Ξ , a, G, u, Y, X „ IG ˆ ¯ v σ,k , ¯ δ σ,k ˙ , (A.8) his Version: October 6, 2020his Version: October 6, 2020 A-5where ¯ v σ,k “ v σ ` T | C k | , ¯ δ σ,kt “ δ σ ` ÿ i P C k T ÿ t “ p r y it ´ α kt q , | C k | “ occurrence of g i “ k, r y it “ y it ´ ρy it ´ ´ β x it . If group k is empty, we draw σ k from its prior IG ` v σ , δ σ ˘ . Conditional posterior of ρ (common coeﬃcient) . Using a normal conjugate prior ρ „ N p µ ρ , Σ ρ q , we could solve standard Bayesian linear regression to get the posteriordensity of the common coeﬃcient ρ , p p ρ | β, α, Σ , Ξ , a, G, u, Y, X q9 « N ź i “ p p y i | x i , ρ, β i , α g i , Σ g i q ﬀ p p ρ q9 exp « ´ N ÿ i “ p y i ´ α g i ´ x i β i ´ ρy ´ ,i q Σ ´ g i p y i ´ α g i ´ x i β i ´ ρy ´ ,i ﬀ exp “ ´p ρ ´ µ ρ q Σ ´ ρ p ρ ´ µ ρ q ‰ . This implies ρ | β, α, Σ , Ξ , a, G, u, Y, X „ N ` ¯ µ ρ , ¯Σ ρ ˘ , (A.9)where ¯Σ ρ “ ˜ Σ ´ ρ ` N ÿ i “ y ,i Σ ´ g i y ´ ,i ¸ ´ , ¯ µ ρ “ ¯Σ ρ « Σ ´ ρ µ ρ ` N ÿ i “ y ,i Σ ´ g i ˆ y i ﬀ , ˆ y i “ y i ´ α g i ´ x i β i . Conditional posterior of β (heterogeneous coeﬃcients) . As ε it is independent acrossunits, we solve for β for each unit separately. We transform the model into a standard linear his Version: October 6, 2020his Version: October 6, 2020 A-6model with a known form of heteroskedasticity, y it ´ α g i t ´ ρy it ´ “ β i x it ` ε it , ε it „ N p , σ g i q . Using a normal conjugate prior β i „ N p µ β , σ β q , for the unit i , the posterior distribution iswritten as, p p β i | ρ, α, Σ , Ξ , a, G, u, Y, X q9 p p y i | x i , ρ, β i , α g i , σ g i q p p β i q9 exp « ´ ř Tt “ p y it ´ α g i ´ ρy it ´ x it β i q σ g i ﬀ exp “ ´p β i ´ µ β q Σ ´ β p β i ´ µ β q ‰ . This implies β i | ρ, α, Σ , Ξ , a, G, u, Y, X „ N ` ¯ µ β i , ¯Σ β i ˘ , (A.10)where ¯Σ β i “ ˜ Σ ´ ρ ` σ ´ g i T ÿ t “ x it x it ¸ ´ ¯ µ β i “ ¯Σ ρ « Σ ´ ρ µ ρ ` σ ´ g i T ÿ t “ x it ˆ y it ﬀ ˆ y it “ y it ´ α g i ´ ρy it ´ Conditional posterior of Ξ (stick length) . p p Ξ | ρ, β, α, Σ , a, G, u, Y, X q9 « N ź i “ p p u i | π g i q π g i ﬀ « ź j “ p p ξ j | a q ﬀ « N ź i “ p p u i | π g i q ξ g i ź l ă g i p ´ ξ l q ﬀ « ź j “ p p ξ j | a q ﬀ his Version: October 6, 2020his Version: October 6, 2020

Conditional posterior of u (auxiliary variable) . Conditional on the group “sticklengths” ξ k and group member indices G , it is straightforward to show that the posteriordensity of u i is a uniform distribution ranging deﬁne on p , π g i q , that is u i | Ξ , G „ Unif p , π g i q , (A.15)where π g i “ ξ g i ś j ă g j p ´ ξ j q . Moreover, it is worth noting that the values for K ˚ and u ˚ need to be updated according to (A.3) and (A.4) after this step. Conditional posterior of G (group membership) . We derive the posterior distributionof g i consider on G p i q , where G p i q is a set including all member indices except for g i , i.e., G p i q “ G \ g i . Hence, for k “ , , ..., K ˚ , p p g i “ k | ρ, β, α, Σ , Ξ , a, G p i q , u, Y, X q9 p p y i | ρ, β i , α k , σ k , Y, X q p u i ă π g i q . As per a discrete distribution, we normalize the point mass to get a valid distribution: p p g i “ k | ρ, β, α, Σ , Ξ , a, G p i q , u, Y, X q “ p p y i | ρ, β i , α k , σ k , Y, X q p u i ă π k q ř K ˚ j “ p p y i | ρ, β i , α j , σ j , Y, X q p u i ă π j q . (A.16) A.1.1 Blocked Gibbs Sampler and Algorithm

Initialization:(i) Preset the initial number of active groups K a . As derived by Antoniak (1974), theexpected number of unique groups is E r K | a s « a log ` a ` Na ˘ . We set K a to its expectedvalue with concentration parameter a replaced by prior mean.(ii) In ignorance of group heterogeneity ( K “

1) and heteroskedasticity, run OLS to getˆ α OLS , ˆ ρ OLS , ˆ β iOLS and Cov p ˆ α OLS q . These OLS estimators serve as the mean andcovariance matrix in the related priors.(iii) Generate K ˚ random sample from the distribution N p ˆ α OLS , Cov p ˆ α OLS qq .(iv) Initialize group membership G by sampling from (A.16) ignoring p u i ă π g i q . Removeempty groups.For each iteration s “ , , .., N sim his Version: October 6, 2020his Version: October 6, 2020 A-9(i) Number of active groups: K a “ max ď i ď N g p s ´ q i . (ii) Group “stick length”: for k “ , , ..., K a , draw ξ k from a Beta distribution in (A.11): ξ k | ρ p s ´ q , β p s ´ q , α p s ´ q , Σ p s q , a p s ´ q , G p s ´ q , u p s ´ q , Y, X „ Beta ˜ | C k | ` , a ` N ÿ j “ p g j ą k q ¸ , and calculate group probability in accordance to (A.12).(iii) Group heterogeneity: for k “ , , ..., K a , draw α p s q k from a normal distribution in (A.7): α k | ρ p s ´ q , β p s ´ q , Σ p s ´ q , a p s ´ q , G p s ´ q , u p s ´ q , Y, X „ N ` ¯ µ α k , ¯Σ α k ˘ . (iv) Group heteroskedasticity: for k “ , , ..., K a and t “ , , ..., T , draw σ p s q k from aninverse Gamma distribution in (A.8): σ k | ρ p s ´ q , β p s ´ q , α p s q , G p s ´ q , u p s ´ q , Y, X „ IG ˆ ¯ v σ,k , ¯ δ σ,k ˙ . (v) Label switching : after each iteration an additional random permutation step is addedto the MCMC scheme which randomly permutes the current labeling of the compo-nents. Random permutation ensures that the sampler explores all K! modes of the fullposterior distribution and avoids that the sampler is trapped around a single posteriormode. Following Liu (2020) , we update ! α p s q k , σ p s q k , π p s q k , g p s ´ q i ) by three Metropolis-Hastings label-switching moves developed by Papaspiliopoulos and Roberts (2008) (step(a) and (b)) and Hastie et al. (2015) (step (c)). All these label switching moves aim toimprove numerical convergence.(a) Randomly select two nonempty groups i and j , swap group labels g p s ´ q i and g p s ´ q j Without this step, the one-at-a-time updates of the allocations mean that clusters rarely switch labels,and consequentially the ordering will be largely determined by the (perhaps random) initialization of thesampler. See Algorithm C.4 in the appendix his Version: October 6, 2020his Version: October 6, 2020

A-10for all units in these groups, accept new label with probability:min ˜ , π N j i π N i j π N i i π N j j ¸ “ min ` , p π i { π j q N j ´ N i ˘ , where N i , N j are the number of units in the group i and j respectively.(b) Randomly select two adjacent groups l and l ` t l, l ` u Ă t , , ..., K a u ,swap group label g p s ´ q l and “stick length” ξ p s q l , accept new label and stick lengthwith probability: min ˜ , ˜ p N l ` l ˜ p N l l ` π N l l π N l ` l ` ¸ , where ˜ p i and ˜ p j are new group probabilities derived with new ξ p s q l and ξ p s q l ` .(c) Randomly select two adjacent groups k and k ` t k, k ` u Ă t , , ..., K a u ,swap group label g p s ´ q i , “stick length” ξ p s q k and update group-speciﬁc parameter t α p s q k , σ p s q k u , accept new new label and stick length with probabilitymin " , ´ R { r R ¯ N k ` ´ R { r R ¯ N k * , where R “ ` a ` N k ` ` ř l ą k ` N l a ` N k ` ` ř l ą k ` N l ,R “ a ` N k ` ř l ą k ` N l ` a ` N k ` ř l ą k ` N l , r R “ π k ` R ` π k R π k ` π k ` . The new group probability is deﬁned as p k “ π k ` R { r R and p k ` “ π k R { r R . his Version: October 6, 2020his Version: October 6, 2020 A-11Additionally, we update the “stick lengths” for group k and k ` ξ k “ p k ś l ă c p ´ ξ l q ,ξ k ` “ p k ` p ´ ξ k q ś l ă c p ´ ξ l q . (vi) Auxiliary variables: for i “ , , ..., N , draw u i from an uniform distribution in (A.15): u i | Ξ p s q , G p s q „ U p , p p s q g i q . Then calculate u ˚ according to (A.4).(vii) DP concentration parameter:(a) Draw latent variable η from a Beta distribution in (A.13): η „ Beta p a ` , K a q (b) Draw concentration parameter a from a mixture of Gamma distribution in (A.14): a | η, K a „ Gamma p m ` K a , n ´ log p η qq with prob. π a Gamma p m ` K a ´ , n ´ log p η qq with prob. 1 ´ π a , and π a is deﬁned as π a ´ π a “ m ` K a ´ N p n ´ log p η qq . (viii) Potential groups: start with ˜ K “ K a ,(a) Group probabilities:(1) if ř ˜ Kj “ π p s q j ą ´ u ˚ , set K ˚ “ ˜ K and stop(2) otherwise, let ˜ K “ ˜ K `

1, draw ξ ˜ K „ Beta ` , α p s q ˘ , update π ˜ K “ ξ ˜ K ś j ă ˜ K p ´ ξ j q and go to step p q (b) Group parameters: for k “ K ` , ¨ ¨ ¨ , K ˚ , draw α p s q k and σ p s q k from their priordistributions. This particular choices of ξ k and ξ k ` ensure the group probabilities that are changed are those associatedwith the the group k and k `

1, and the rest are unchanged. Moreover, it can be shown that p ´ ξ k qp ´ ξ k ` q “p ´ ξ k qp ´ ξ k ` q . See more details in the appendices of Hastie et al. (2015). his Version: October 6, 2020his Version: October 6, 2020 A-12(ix) Common AR(1) parameter: draw ρ from a normal distribution in (A.9): ρ | β p s ´ q , α p s q , Σ p s q , a p s q , G p s ´ q , u p s q , Y, X „ N ` ¯ µ ρ , ¯Σ ρ ˘ . (x) Heterogeneous parameter: draw β i from a normal distribution in (A.10): β i | ρ p s q , α p s q , Σ p s q , a p s q , G p s ´ q , u p s q , Y, X „ N ` ¯ µ β i , ¯Σ β i ˘ . (xi) Group membership: for i “ , , ..., N and k “ , , ..., K ˚ , draw g i from a multinomialdistribution in (A.16): p p g i “ k | ρ p s q , β p s q , α p s q , Σ p s q , ξ p s q , a p s q , G p i q , u p s q , Y, X q “ p p y i | ρ p s q , β p s q i , α p s q k , Σ p s q k q p u p s q i ă π k q ř K ˚ j “ p p y i | ρ p s q , β p s q i , α p s q j , Σ p s q j q p u p s q j ă π g j q A.2 Random Eﬀects Model with Subjective Group ProbabilityPrior

This algorithm is designed for the random eﬀect model where econometricians have priorknowledge on the group structure and presume the number of groups K p . Building onthe algorithm for the random eﬀect model in Section A.1, we allow for incorporating theresearchers’ prior knowledge while inheriting the feature of reallocating units and changingthe number of groups along the MCMC sampling. his Version: October 6, 2020his Version: October 6, 2020 A-13We use the same notation as in Appendix A.1,Observations: Y “ r y , y , ..., y N s , y i “ r y i , y i , ..., y iT s , Covariates: X “ r x , x , ..., x N s , x i “ r x i , x i , ..., x iT s , Random eﬀects: α “ r α , α , . . . , α K s , Covariance matrices: Σ “ r Σ , Σ , . . . , Σ K s , Group membership: G “ r g , . . . , g N s , Stick length: Ξ “ r ξ , ξ , . . . s , Group probability: π “ r π , . . . , π N s , Membership probability: ω “ r ω , . . . , ω N s , ω i “ r ω i , ω i , ..., ω iK s , Auxiliary varaible: u “ r u , u , ..., u N s , Hyper parameters: φ “ r µ α , Σ α , ν σ , δ σ s . Notice that we deﬁne two sets of probabilities, π and ω . In practice, they have distinctroles in the algorithm. The group membership π captures the groups’ probability basedon the entire sample, and, most importantly, determines the upper bound of the auxiliaryvariable u g i that has direct eﬀect on the potential number of groups K ˚ . On the other hand,the membership probability ω i represents the probabilities of a unit i belonging to each of K groups, through which the researcher’s prior knowledge enters the algorithm.As regards the choices of prior, we adopt the independent Multivariate Normal-Inverse-Gamma prior Dirichlet Process priors for group random eﬀects α g i t and heteroskedasticity σ g i , a normal prior for the common parameter ρ , an unsymmetric Dirichlet prior for themembership probability ω with concentration parameters chosen by the econometrician, amultinomial prior for Group membership g i , a Beta prior for the stick length ξ , and a mixtureGamma prior for the concentration parameter a his Version: October 6, 2020his Version: October 6, 2020 A-14The posterior of unknown objects in this random eﬀects model is: p p ρ, β, α, Σ , G, ω | Y, X q9 p p Y | X, ρ, β, α, Σ , G, ω q p p ρ, β, α, Σ , G, π q9 p p Y | X, ρ, β, α, Σ , G q p p α, Σ | φ q p p G | p q p p ρ q p p β q p p ω | a q“ N ź i “ p p y i | x i , ρ, β i , α g i , σ g i q K ź j “ p p α j , σ j | φ q N ź i “ p p g i | ω i q N ź i “ p p ω i | a i q N ź i “ p p β i q p p ρ q“ N ź i “ “ p p y i | x i , ρ, β i , α g i , σ g i q p p g i | ω i q p p ω i | a i q p p β i q ‰ K ź j “ p p α j , σ j | φ q p p ρ q . To allow for automatically adjustment for the number of groups, we introduce a set ofauxiliary variables u “ r u , u , ..., u N s proposed by Walker (2007) and rewrite the posteriorabove as, p p ρ, β, α, Σ , Ξ , a, G, u, ω, π | Y, X q9 N ź i “ “ p p y i | x i , ρ, β i , α g i , σ g i q p p g i | ω i q p p ω i | a i q p u i ă π g i q p p β i q ‰ K ź j “ p p α j , σ j | φ q p p ρ q . The number of potential groups K ˚ and the number of active groups K a are deﬁned inthe equation (A.3) and (A.5). Conditional posterior of α (grouped random eﬀects) . Identical to (A.7). Conditional posterior of Σ (grouped variance) . Identical to (A.8). Conditional posterior of ρ (common coeﬃcient) . Identical to (A.9). Conditional posterior of β (heterogeneous coeﬃcients) . Identical to (A.10). Conditional posterior of Ξ (stick length) . Identical to (A.11). Then generate π inaccordance to (A.12). Conditional posterior of a (concentration parameter) . Identical to (A.14). Conditional posterior of u (auxiliary variable) . Identical to (A.15). Conditional posterior of ω (membership probability) . Sampling from the posteriorof π can be implemented as follows. As we adopt Dirichlet prior for π and Multinomial prior his Version: October 6, 2020his Version: October 6, 2020

1. Then we draw ω i from the posterior density, ω i | ρ, β, α, Σ , G, Y, X „ Dir ` ˜ a i ` p g i “ q , . . . , ˜ a iK ˚p s q ` p g i “ K ˚p s q q ˘ . (A.19) In the simulation, we set (cid:15) “ . his Version: October 6, 2020his Version: October 6, 2020

Case 3 : K ˚p s q ă K p . We have few groups than the researcher assumes and a unit i might be assigned to a group that is no longer considered in the current iteration (i.e., K ˚p s q ă g p s ´ q i ). In this case, we need to select and renormalize a subset of a ik since somegroups are dismissed. In this regard, for a unit i , we select K ˚p s q most frequent non-emptygroups among the groups visited in the previous iteration s ´

1. If there are not enoughcandidates, we add back non-selected groups in the ﬁrst K ˚p s q out of the K p groups. Thenwe normalize the selected a ik and get ˆ a ik . Finally we draw ω i from the posterior density, ω i | ρ, β, α, Σ , G, Y, X „ Dir ` ˆ a i ` p g i “ q , . . . , ˆ a iK ˚p s q ` p g i “ K ˚p s q q ˘ . (A.20) Conditional posterior of G (group membership) . We derive the posterior distributionof g i consider on G p i q , where G p i q is a set including all member indices except for g i , i.e., G p i q “ G \ g i . Hence, for k “ , , ..., K ˚ , p p g i “ k | ρ, β, α, Σ , G p i q , π, ω, Y, X q9 p p y i | ρ, β i , α k , Σ k , Y, X q ω ik p u i ă π g i q . As per a discrete distribution, we normalize the point mass to get a valid distribution, p p g i “ k | ρ, β i , α, Σ , G p i q , π, ω, Y, X q “ p p y i | ρ, β i , α k , Σ k , Y, X q ω ik p u i ă π k q ř Kj “ p p y i | ρ, β , α j , Σ j , Y, X q ω ij p u i ă π j q . (A.21) A.3 Random Eﬀects Model with Group Structures in α , ρ and β In this subsection, I present the conditional posterior distribution for the time-invariantrandom eﬀects model with group structures in α , ρ and β . The model is, y it “ θ g i q x it ` ε it , ε it iid „ N ` , σ g i ˘ , where q x it “ r y it ´ x it s , and θ g i “ r α g i ρ g i β g i s . In practice, if all selected a ik are zero for some i , we simply assume a ik “ { K ˚p s q . his Version: October 6, 2020his Version: October 6, 2020 A-17We use the same notation as in Appendix A.1 aside from the coeﬃcients θ ,Observations: Y “ r y , y , ..., y N s , y i “ r y i , y i , ..., y iT s , Covariates: X “ r x , x , ..., x N s , x i “ r x i , x i , ..., x iT s , Stacked coeﬃcients: θ “ r θ , θ , . . . s , Covariance matrices: Σ “ “ σ , σ , . . . ‰ , Stick length: Ξ “ r ξ , ξ , . . . s , Group membership: G “ r g , . . . , g N s , Auxiliary varaible: u “ r u , u , ..., u N s , Hyper parameters: φ “ r µ θ , Σ θ , ν σ , δ σ s . The posterior of unknown objects in the random eﬀects model is, p p θ, Σ , Ξ , a, G, u | Y, X q9 « N ź i “ p p y i | x i , θ g i , σ g i q p u i ă π g i q ﬀ « ź j “ p p θ j , σ j | φ q p p ξ j | a q ﬀ p p ρ q p p a q . (A.22)The number of potential groups K ˚ and the number of active groups K a are deﬁned inthe equation (A.3) and (A.5). Conditional posterior of θ (grouped random eﬀects) . p p θ | Σ , Ξ , a, G, u, Y, X q9 « N ź i “ p p y i | x i , θ g i , Σ g i q p u i ă π g i q ﬀ « ź j “ p p θ j , σ j | φ q ﬀ For k P t , , ..., K a u , deﬁne a set of unit that belongs to group k , C k “ t i P t , , ..., N u| g i “ k u , (A.23) his Version: October 6, 2020his Version: October 6, 2020 A-18then the posterior density for θ k read as p p θ k | Σ , Ξ , a, G, u, Y, X q9 « ź i P C k p p y i | x i , θ k , σ k q ﬀ p p θ k | φ q9 exp « ´ ÿ i P C k p y i ´ q x i θ k q Σ ´ k p y i ´ q x i θ k q ﬀ exp “ ´ p θ k ´ µ θ q Σ ´ θ p θ k ´ µ θ q ‰ exp “ ´ p θ k ´ ¯ µ θ k q ¯Σ ´ θ p θ k ´ ¯ µ θ k q ‰ . Assuming an independent normal conjugate prior for θ k , the posterior for θ k is given by θ k | Σ , Ξ , a, G, u, Y, X „ N ` ¯ µ θ k , ¯Σ θ k ˘ , (A.24)where ¯Σ θ k “ ˜ Σ ´ θ ` σ ´ k ÿ i P C k q x i q x i ¸ ´ , ¯ µ θ k “ ¯Σ θ k « Σ ´ θ µ θ ` σ ´ k ÿ i P C k q x i y i ﬀ . If group k is empty, we draw θ k from its prior N p µ θ , Σ θ q . Conditional posterior of Σ (grouped variance) . Under the cross-sectional indepen-dence, for k “ , , ..., K a , p p σ k | θ, Ξ , a, G, u, Y, X q9 « ź i P C k p p y i | x i , ρ, β i , α k , σ k q ﬀ p p σ k | φ q Assuming a inverse-gamma prior σ k „ IG ` v σ , δ σ ˘ , the posterior distribution of σ k is p p σ k | ρ, β, α, G, u, Y, X q9 ź i P C k « p σ k q ´ T exp ˜ ´ ř Tt “ p y it ´ θ k q x it q σ k ¸ﬀ ˆ σ k ˙ vσ ` exp ˆ ´ δ σ σ k ˙ “ ˆ σ k ˙ vσ ` T | Ck | ` exp ˜ ´ δ σ ` ř i P C k ř Tt “ p y it ´ θ k q x it q σ k ¸ . his Version: October 6, 2020his Version: October 6, 2020

Proposition B.1

Suppose that we have a model with posterior as given in the section 3.2.Given the deﬁnition of the number of potential component K ˚ (eq.(A.3)), the minimum ofauxiliary variables u ˚ (eq.(A.4)) and the number of active group K (eq.(A.5)), we have(i) u i ą π k for @ i “ , , ..., n and @ k ą K ˚ ;(ii) K ă K ˚ .Proof: (i) By deﬁnition, u ˚ “ min ď i ď N u i for i “ , , ..., n , then, u i ě u ˚ ą ´ K ˚ ÿ j “ π j “ ÿ j “ K ˚ π j ě π k , @ k ą K ˚ . his Version: October 6, 2020his Version: October 6, 2020

0. Then by deﬁnition, u ˚ ď u i ă π K ñ ´ u ˚ ą ´ π K “ K ´ ÿ j “ π j . Since K ˚ is the smallest number s.t. ´ u ˚ ă K ˚ ř j “ π j , then K ď K ˚ . C Convergence Diagnostic

To assess convergence, we assess the trace plot, cumulative mean, and auto-correlation ofposterior draws for diﬀerent coeﬃcients. In particular, the data generating process used hereis DGP7, where we assume time-varying grouped random eﬀects and homoskedasticity. Weevaluate the most complicated BGRE estimator: Tv-Hetero (time-varying α i , heteroskedas-ticity), and report the convergence diagnostics for α , , σ and ρ .Figure C.1: Convergence Diagnostics, α , ( i “ t “ Due to time eﬀects and heteroskedasticity, we randomly present one of the α for unit i “ t “

1, and the variance of error term σ for unit i “ his Version: October 6, 2020his Version: October 6, 2020

1, and the variance of error term σ for unit i “ his Version: October 6, 2020his Version: October 6, 2020 A-21Figure C.2: Convergence Diagnostics, σ , ( i “ ρ D Computation DetailsE Additional Simulation Results

E.1 Main MC Simulation: Larger variance

In this section, we present the additional simulation results of DGP1, DGP2 DGP3 andDGP4 with larger variance with σ “ . . The anther settings remain the same: N “ T “

10 and the true number of groups is K “ his Version: October 6, 2020his Version: October 6, 2020

10 and the true number of groups is K “ his Version: October 6, 2020his Version: October 6, 2020 A-22Table E.1 and E.2 shows the estimate and forecast comparison among alternative predic-tors. For DGP1 and DGP2, the results are similar to those of smaller variance in the maintext: the best models are Ti-Homo and Ti-Hetero, respectively. However, in the DGP3, theTv-Homo and Tv-Hetero estimators, which are expected to stand out since they correctlymodel the time eﬀects, don’t oﬀer the best performance. The potential reason is that theestimation becomes substantially diﬃcult in the presence of both time-varying random ef-fects and much noisier error terms, making it hard to accurately determine group structure.Regarding the DGP4, Ti-Homo and Ti-Hetero deliver outstanding performance relative toother alternative estimators. As there is no group structure in this DGP, the Flat estimatorshould be the best, which indeed generate accurate estimates and forecast but Ti-Homo andTi-Hetero can still stand out. This is mainly because Ti-Homo and Ti-Hetero optimallypartition similar units into several groups, which averages out the noisy error terms and,hence, scores great performance. These exciting results also suggest that, though no groupstructure in the sample, our BGRE estimators have the edge over other estimators whoeither pool all information (Pooled) or treat each unit separately (Flat). his Version: October 6, 2020his Version: October 6, 2020

A-23Table E.1: Monte Carlo Experiment: Point Estimates, Larger σ ˆ ρ ˆ α i ClusterRMSE Bias Std AvgL Cov PBias Avg KDGP 1(Grp Ti Ho.) Ti-Homo 0.0336 0.0227 0.0169 0.0658 0.70 -0.1446 3.08Ti-Hetero 0.0342 0.0241 0.0165 0.0643 0.63 -0.1548 2.96Tv-Homo 0.2386 0.2363 0.0181 0.0686 0.05 -1.5126 1.30Tv-Hetero 0.2436 0.2419 0.0155 0.0597 0.05 -1.5488 1.41Pooled 0.2217 0.2215 0.0088 0.0342 0 -1.4113 1Flat 0.0665 -0.0642 0.0164 0.0639 0.03 0.4058 100Param 0.2493 0.2205 0.1119 0.4429 0.51 -1.4000 1DGP 2(Grp Ti He.) Ti-Homo 0.0146 0.0058 0.0104 0.0404 0.94 -0.0405 4.19Ti-Hetero 0.0084 0.0033 0.0062 0.0242 0.91 -0.0231 3.96Tv-Homo 0.2231 0.2215 0.0193 0.0744 0.02 -1.4385 10.59Tv-Hetero 0.1639 0.1611 0.0218 0.0827 0.09 -1.0595 3.08Pooled 0.2495 0.2494 0.0063 0.0245 0 -1.6040 1Flat 0.0262 -0.0237 0.0105 0.0409 0.32 0.1506 100Param 0.2743 0.2480 0.1135 0.4488 0.29 -1.5926 1DGP 3(Grp Tv Ho.) Ti-Homo 0.2074 0.2064 0.0172 0.0670 0.02 0.0052 1.03Ti-Hetero 0.2063 0.2054 0.0169 0.0659 0.01 0.0050 1.03Tv-Homo 0.2102 0.2095 0.0168 0.0653 0 0.0052 1.03Tv-Hetero 0.2113 0.2106 0.0169 0.0658 0 0.0051 1.17Pooled 0.2102 0.2096 0.0166 0.0646 0 0.0050 1Flat 0.1870 -0.1849 0.0277 0.1080 0 -0.0047 100Param 0.2162 0.2098 0.0505 0.2012 0.01 0.0168 1DGP 4(Std Ti Ho.) Ti-Homo 0.0145 0.0063 0.0097 0.0376 0.88 -0.0439 4.45Ti-Hetero 0.0148 0.0069 0.0098 0.0384 0.89 -0.0481 4.23Tv-Homo 0.3096 0.3093 0.0137 0.0530 0 -1.9922 1.99Tv-Hetero 0.3108 0.3105 0.0135 0.0522 0 -1.9998 2.04Pooled 0.2529 0.2528 0.0062 0.0240 0 -1.6281 1Flat 0.0256 -0.0230 0.0100 0.0388 0.37 0.1481 100Param 0.2775 0.2515 0.1161 0.4593 0.26 -1.6167 1 his Version: October 6, 2020his Version: October 6, 2020

A-24Table E.2: Monte Carlo Experiment: Forecast, Larger σ Point Forecast Set Forecast Density ForecastRMSFE Error Std AvgL Cov LPS CRPSDGP 1(Grp Ti Ho.) Ti-Homo 1.2281 0.0012 1.2210 4.8430 0.95 -1.6271 0.6939Ti-Hetero 1.2297 0.0041 1.2224 4.8338 0.95 -1.6305 0.6952Tv-Homo 1.2863 0.0180 1.2724 5.1561 0.95 -1.6746 0.7266Tv-Hetero 1.2902 0.0193 1.2760 5.1626 0.95 -1.6776 0.7285Pooled 1.3385 0.3476 1.2832 5.4955 0.96 -1.7161 0.7567Flat 1.2637 -0.1433 1.2494 4.8660 0.95 -1.6564 0.7147Param 1.3415 0.3499 1.2856 8.2549 1 -1.8465 0.7971DGP 2(Grp Ti He.) Ti-Homo 0.6989 0.0055 0.6949 2.7235 0.93 -1.0583 0.3819Ti-Hetero 0.6926 0.0024 0.6884 2.5818 0.96 -0.8640 0.3600Tv-Homo 0.8502 0.0101 0.8425 2.5678 0.89 -1.2358 0.4496Tv-Hetero 0.7313 0.0080 0.7230 2.7180 0.95 -0.9323 0.3812Pooled 0.8699 0.4283 0.7507 3.7531 0.95 -1.2926 0.4832Flat 0.7172 -0.0414 0.7120 2.8231 0.93 -1.0899 0.3935Param 0.8751 0.4284 0.7567 7.1947 1 -1.5801 0.5658DGP 3(Grp Tv Ho.) Ti-Homo 1.2684 -0.0325 1.2614 5.0355 0.95 -1.6600 0.7162Ti-Hetero 1.2686 -0.0326 1.2616 5.0343 0.95 -1.6603 0.7164Tv-Homo 1.2750 0.0022 1.2616 5.0568 0.95 -1.6653 0.7200Tv-Hetero 1.2749 0.0026 1.2614 5.0580 0.95 -1.6653 0.7197Pooled 1.2678 -0.0328 1.2608 5.0383 0.95 -1.6596 0.7158Flat 1.2781 -0.0377 1.2711 4.7935 0.94 -1.6690 0.7235Param 1.2835 -0.0214 1.2608 9.2910 1 -1.8663 0.7878DGP 4(Std Ti Ho.) Ti-Homo 0.6435 -0.0112 0.6400 2.5472 0.95 -0.9807 0.3636Ti-Hetero 0.6436 -0.0103 0.6401 2.6020 0.95 -0.9837 0.3639Tv-Homo 0.7069 0.0247 0.6988 2.8327 0.95 -1.0760 0.3993Tv-Hetero 0.7072 0.0247 0.6991 2.8513 0.95 -1.0771 0.3994Pooled 0.8200 0.4198 0.7011 3.5977 0.97 -1.2356 0.4660Flat 0.6720 -0.0580 0.6663 2.6339 0.95 -1.0246 0.3800Param 0.8243 0.4203 0.7059 7.0829 1 -1.5529 0.5467 his Version: October 6, 2020his Version: October 6, 2020

A-25

E.2 Main MC Simulation: Shorter Time Periods

Here, we show the additional simulation results of DGP1, DGP2 and DGP4 with smallperiod, i.e, T “

5. The rest settings remain the same: N “ σ “ . and the truenumber of groups is K “ T ˆ ρ ˆ α i ClusterRMSE Bias Std AvgL Cov PBias Avg KDGP 1(Grp Ti Ho.) Ti-Homo 0.0379 0.0276 0.0198 0.0766 0.67 -0.1414 3.16Ti-Hetero 0.0387 0.029 0.0198 0.0772 0.66 -0.1488 3.09Tv-Homo 0.3577 0.3566 0.0199 0.0777 0.02 -1.8223 1.37Tv-Hetero 0.3654 0.3646 0.0195 0.0760 0.01 -1.8584 1.63Pooled 0.2789 0.2785 0.0120 0.0467 0.01 -1.4050 1Flat 0.0591 -0.0554 0.0190 0.0738 0.16 0.2782 100Param 0.3146 0.2783 0.1385 0.5503 0.43 -1.4006 1DGP 2(Grp Ti He.) Ti-Homo 0.0523 0.0380 0.0245 0.0952 0.66 -0.189 2.91Ti-Hetero 0.0230 0.0126 0.0147 0.0574 0.86 -0.0639 3.40Tv-Homo 0.3099 0.3052 0.0288 0.1115 0.05 -1.5558 5.65Tv-Hetero 0.2099 0.2018 0.0377 0.1455 0.17 -1.0403 2.87Pooled 0.2613 0.2609 0.0138 0.0537 0 -1.3253 1Flat 0.0833 -0.0798 0.0232 0.0902 0.04 0.4030 100Param 0.2965 0.2602 0.1365 0.5417 0.51 -1.3210 1DGP 4(Std Ti Ho.) Ti-Homo 0.2345 0.2329 0.0269 0.1051 0 0.0018 1Ti-Hetero 0.2344 0.2328 0.0269 0.105 0 0.0017 1.01Tv-Homo 0.2357 0.2342 0.0270 0.1051 0 0.0016 1Tv-Hetero 0.2363 0.2347 0.0272 0.1062 0 0.0018 1.24Pooled 0.2344 0.2328 0.0270 0.1051 0 0.0017 1Flat 0.3601 -0.3571 0.0459 0.1790 0 -0.0029 100Param 0.2516 0.2323 0.0941 0.3756 0.19 0.0066 1 his Version: October 6, 2020his Version: October 6, 2020

A-26Table E.4: Monte Carlo Experiment: Forecast, Smaller T Point Forecast Set Forecast Density ForecastRMSFE Error Std AvgL Cov LPS CRPSDGP 1(Grp Ti Ho.) Ti-Homo 0.8491 0.0553 0.8407 3.3362 0.95 -1.2561 0.4798Ti-Hetero 0.8505 0.0585 0.8418 3.3743 0.95 -1.2605 0.4810Tv-Homo 0.9310 0.1221 0.9114 3.6884 0.95 -1.3514 0.5275Tv-Hetero 0.9352 0.1248 0.9156 3.6947 0.95 -1.3549 0.5296Pooled 1.0817 0.6143 0.8796 4.2897 0.96 -1.4992 0.6152Flat 0.8790 -0.1242 0.8653 3.4044 0.95 -1.2946 0.4980Param 1.0852 0.6144 0.8836 7.5513 1 -1.6938 0.6666DGP 2(Grp Ti He.) Ti-Homo 1.1264 0.0956 1.1131 4.2430 0.93 -1.5308 0.6138Ti-Hetero 1.0782 0.0397 1.0698 3.9245 0.95 -1.3183 0.5621Tv-Homo 1.2929 0.1517 1.2713 4.0156 0.89 -1.6690 0.6880Tv-Hetero 1.1323 0.1113 1.1129 4.1041 0.94 -1.3952 0.5948Pooled 1.2862 0.6034 1.1257 5.0186 0.93 -1.6744 0.7084Flat 1.1320 -0.166 1.1129 4.3232 0.93 -1.5477 0.6210Param 1.2889 0.6022 1.1294 7.9859 0.99 -1.8040 0.7577DGP 4(Std Ti Ho.) Ti-Homo 0.8550 -0.0081 0.8500 3.4368 0.96 -1.2665 0.4841Ti-Hetero 0.8550 -0.0082 0.8500 3.4362 0.96 -1.2668 0.4841Tv-Homo 0.8607 -0.026 0.8501 3.4509 0.95 -1.2733 0.4875Tv-Hetero 0.8620 -0.0259 0.8514 3.4513 0.95 -1.2745 0.4883Pooled 0.8550 -0.0082 0.8500 3.4362 0.95 -1.2665 0.4841Flat 0.8907 0.0036 0.886 3.2123 0.92 -1.3147 0.5052Param 0.8671 -0.0032 0.8498 8.5456 1 -1.6545 0.5970 his Version: October 6, 2020his Version: October 6, 2020

A-27

E.3 Main MC Simulation: Diﬀerent K In this section, we present the simulation results of DGP1, DGP2 and DGP3 with diﬀerentnumber of groups. In particular, we consider K P t , , , u . The rest settings remain thesame: N “ T “

10, and σ “ . .Figure E.1 presents the relative performance of the BGRE estimators against the ﬂatestimator under diﬀerent K . In particular, we show the results of the correctly speciﬁedestimators for each DGP, i.e., Ti-Homo estimator for DGP 1, Ti-Hetero estimator for DGP2, and Tv-Homo estimator for DGP 3. For the DGP 1, the accuracy of the estimatesand the predictive power of the BGRE estimator gradually vanish as K increases. At K “

8, the BGRE estimator still marginally dominates the ﬂat estimator in all aspectsbesides the bias of α . Moving to the DGP 2, the BGRE estimator oﬀers better performancethan the ﬂat estimator for all K . This suggests that the BGRE estimator successfullycaptures heterogeneity in variance and sophisticated group patterns, even with eight diﬀerentclusters. Regarding the DPG 3, where we introduce time variation in α , the BGRE estimatoroutperforms the benchmark model in terms of forecasting. Moreover, we see RMSFE, theaverage length of the credible set, and LPS are all trending down. This suggests the moreremarkable improvement in the predictive power of the BGRE estimator, as the true modelbecomes more sophisticated. It is also noteworthy that, while the average length of thecredible set for ρ is relatively large, the BGRE estimator generates a much lower RMSE of ρ and absolute bias of α than the ﬂat estimator. his Version: October 6, 2020his Version: October 6, 2020 A-28Figure E.1: Monte Carlo Experiment: BGRE Estimator, Diﬀerent K (a) DGP1, Estimates (b) DGP1, Forecasts(c) DGP2, Estimates (d) DGP2, Forecasts(e) DGP3, Estimates (f) DGP3, Forecasts his Version: October 6, 2020his Version: October 6, 2020 A-29Table E.5: Monte Carlo Experiment: Point Estimates, Diﬀerent K , DGP1 (Grp Ti Ho.)ˆ ρ ˆ α i ClusterRMSE Bias Std AvgL Cov PBias Avg K K “ K “ K “ K “ his Version: October 6, 2020his Version: October 6, 2020 A-30Table E.6: Monte Carlo Experiment: Forecast, Diﬀerent K , DGP1 (Grp Ti Ho.)Point Forecast Set Forecast Density ForecastRMSFE Error Std AvgL Cov LPS CRPS K “ K “ K “ K “ his Version: October 6, 2020his Version: October 6, 2020 A-31Table E.7: Monte Carlo Experiment: Point Estimates, Diﬀerent K , DGP2 (Grp Ti He.)ˆ ρ ˆ α i ClusterRMSE Bias Std AvgL Cov PBias Avg K K “ K “ K “ K “ his Version: October 6, 2020his Version: October 6, 2020

E.4 Two-Step GRE estimator

Table E.11: Monte Carlo Experiment: Two-Step GRE with Kmean, Point Estimatesˆ ρ ˆ α i ClusterRMSE Bias Std AvgL Cov PBias Avg KDGP 1(Grp Ti Ho.) Ti-Homo 0.0627 0.0599 0.0114 0.0444 0.25 -0.3836 2.2Ti-Hetero 0.0611 0.0583 0.0116 0.0452 0.27 -0.3734 2.2Tv-Homo 0.1567 0.1544 0.0186 0.0721 0.11 -0.9927 2.2Tv-Hetero 0.1560 0.1537 0.0189 0.0738 0.10 -0.9887 2.2DGP 2(Grp Ti He.) Ti-Homo 0.0550 0.0513 0.0133 0.0517 0.27 -0.3388 2.2Ti-Hetero 0.0456 0.0429 0.0099 0.0386 0.27 -0.2822 2.2Tv-Homo 0.1196 0.1143 0.0203 0.0789 0.21 -0.7572 2.2Tv-Hetero 0.1458 0.1427 0.0196 0.0764 0.15 -0.9400 2.2DGP 3(Grp Tv Ho.) Ti-Homo 0.2863 0.2861 0.0114 0.0446 0.00 -2.2885 2.26Ti-Hetero 0.2807 0.2804 0.0115 0.0448 0.00 -2.2489 2.26Tv-Homo 0.1196 0.1171 0.0166 0.0648 0.08 -0.8168 2.26Tv-Hetero 0.1172 0.1144 0.017 0.0661 0.08 -0.7982 2.26DGP 4(Std Ti Ho.) Ti-Homo 0.0832 0.0796 0.0210 0.0819 0.10 0.0014 2.03Ti-Hetero 0.0833 0.0797 0.0210 0.0819 0.08 0.0015 2.03Tv-Homo 0.0942 0.0908 0.0218 0.0851 0.05 0.0018 2.03Tv-Hetero 0.0943 0.0908 0.0218 0.0851 0.06 0.0019 2.03 his Version: October 6, 2020his Version: October 6, 2020

A-36Table E.12: Monte Carlo Experiment: Two-Step GRE with Kmean, ForecastPoint Forecast Set Forecast Density ForecastRMSFE Error Std AvgL Cov LPS CRPSDGP 1(Grp Ti Ho.) Ti-Homo 0.8607 0.0777 0.8502 3.3899 0.95 -1.2715 0.4866Ti-Hetero 0.8610 0.0751 0.8507 3.3949 0.95 -1.2714 0.4867Tv-Homo 0.8457 0.0084 0.8369 3.3324 0.95 -1.2545 0.4775Tv-Hetero 0.8457 0.0086 0.8369 3.3393 0.95 -1.2551 0.4774DGP 2(Grp Ti He.) Ti-Homo 1.0969 0.0856 1.0860 4.2250 0.93 -1.5155 0.6031Ti-Hetero 1.0993 0.0723 1.0896 4.0301 0.94 -1.4115 0.5857Tv-Homo 1.0886 0.0054 1.0764 4.2115 0.93 -1.5073 0.5971Tv-Hetero 1.0900 0.0079 1.0775 3.9970 0.94 -1.3806 0.5758DGP 3(Grp Tv Ho.) Ti-Homo 1.3326 0.3341 1.2852 5.0617 0.95 -1.7082 0.7589Ti-Hetero 1.3393 0.3219 1.2952 5.0363 0.95 -1.7071 0.7616Tv-Homo 1.3031 0.0273 1.2961 4.2823 0.90 -1.7175 0.7477Tv-Hetero 1.3052 0.0263 1.2982 4.2703 0.90 -1.7150 0.7483DGP 4(Std Ti Ho.) Ti-Homo 0.8532 -0.0232 0.8488 3.2353 0.94 -1.2638 0.4822Ti-Hetero 0.8532 -0.0231 0.8488 3.2420 0.94 -1.2641 0.4823Tv-Homo 0.8537 -0.0018 0.8458 3.2581 0.94 -1.2640 0.4826Tv-Hetero 0.8537 -0.0014 0.8458 3.2654 0.94 -1.2646 0.4826 his Version: October 6, 2020his Version: October 6, 2020

A-37

E.5 Subjective Priors With Knowledge on Groups

Table E.13: Monte Carlo Experiment: Estimates, SGP prior, DGP3ˆ ρ ˆ α i ClusterRMSE Bias Std AvgL Cov Bias Avg KSGP-RE1 0.0396 0.0294 0.0225 0.0871 0.75 -0.2072 4.00SGP-RE2 0.0463 0.0378 0.0228 0.0892 0.64 -0.2650 3.98SGP-RE3 0.0760 0.0716 0.0214 0.0834 0.26 -0.4993 3.58SGP-RE4 0.0821 0.0793 0.0210 0.0818 0.02 -0.5533 4.00SGP-RE5 0.0691 0.0654 0.0214 0.0834 0.09 -0.4583 6.00TvHetero 0.0599 0.0549 0.0220 0.0849 0.31 -0.3842 3.85Flat 0.3243 0.3240 0.0126 0.0493 0 -2.5626 100Table E.14: Monte Carlo Experiment: Forecast, SGP prior, DGP3Point Forecast Set Forecast Density ForecastRMSFE Error Std AvgL Cov LPS CRPSSGP-RE1 1.0266 0.0198 1.0178 3.7766 0.93 -1.4553 0.5823SGP-RE2 1.0465 0.0214 1.0377 3.8382 0.93 -1.4716 0.5926SGP-RE3 1.1337 0.0287 1.1250 4.1172 0.93 -1.5542 0.6455SGP-RE4 1.0682 0.0310 1.0590 4.0305 0.94 -1.4934 0.6063SGP-RE5 1.0784 0.0293 1.0696 3.9689 0.93 -1.5070 0.6114Tv-Hetero 1.0952 0.0255 1.0866 3.9531 0.93 -1.6450 0.6192Flat 1.2400 0.4211 1.1608 5.3189 0.97 -1.6450 0.7048

E.5.1 Fixed K Estimator: Imposing the true number of groups

As shown in the previous subsection, the Bayesian GRE estimator works reasonably well inﬁnite samples to determine the number of groups. In this subsection, we assume that thenumber of groups is known and focus on clustering. We present a table of the accuracy ofclustering, where each row shows the faction of units that are correctly assigned to the true his Version: October 6, 2020his Version: October 6, 2020

A-38group. As an orthodox clustering algorithm, the results for Kmeans are also included as thebenchmark. To avoid cluttering the tables in the main text, we don’t present the results forsuboptimal estimators. To be more precise, for the DGP involving time-invariant randomeﬀects, we only document the result for Ti-Homo and Ti-Hetero since other estimators arearguably worse in clustering, based on the simulation presented in the previous section. Thesame rule applies for time-varying DGPs.Table E.15 shows the accuracy of clustering for each estimator. Overall, the accuracy ishigh for Kmeans and correctly speciﬁed estimators in each DGP, while our BGRE estimatorsare slightly dominated by the Kmeans algorithm. The reasons are straightforward. OurBGRE estimators simultaneously estimate parameters and group units while Kmeans merelyperforms clustering. The additional estimation steps in our block Gibbs sampler depend onpriors and parametric assumptions that could aﬀect the clustering. On the other hand,the Kmeans algorithm forms clusters through spatial relationships between units, free ofany assumption. Such diﬀerences yield the discrepancies in accuracy between Kmeans andBGRE estimators. But they are acceptable as the discrepancies are within 10% most ofthe time. Comparing the performance of Kmeans in the two-step GRE estimator (Table 6),imposing the correct number of groups indeed improves the clustering ability of Kmeans.Nevertheless, it is uncommon to know the truth in practice.Table E.15: Monte Carlo Experiment: Accuracy of Clustering, Fixed K Group 1 Group 2 Group 3 Group 4

DGP 5(Grp Ti Ho.) Ti-Homo 86.87% 83.62% 60.42% 91.64%Ti-Hetero 73.59% 66.17% 56.47% 95.82%Kmeans 88.00% 96.00% 68.00% 100.00%DGP 6(Grp Ti He.) Ti-Homo 78.17% 78.66% 66.58% 99.14%Ti-Hetero 89.63% 84.22% 85.18% 99.98%Kmeans 84.00% 100.00% 88.00% 100.00%DGP 7(Grp Tv Ho.) Tv-Homo 99.33% 68.99% 93.40% 84.61%Tv-Hetero 98.92% 71.53% 93.41% 75.87%Kmeans 96.00% 88.00% 100.00% 88.00%Next, we visualize the clusters to provide a clear view of the performance of clustering.We construct a posterior similarity matrix, a matrix containing the posterior probabilitiesof observations i and j being in the same cluster (estimated empirically from the MCMCdraws). This design avoids the problem of reassigning group members to give posterior draw his Version: October 6, 2020his Version: October 6, 2020 A-39and show a clear group structure.Figure E.2, E.3 and E.4 present the similarity matrices for the simulation using DGP5,DGP6 and DGP7, respectively. The colors depict the degree of similarity. Ideally, a perfectestimator should reveal four light yellow squares in the heatmap, leaving the remaining areain dark blue. As DGP5 implements ﬁxed-eﬀects and assumes homoskedasticity, Ti-Homoand Ti-Hetero estimator reveal a clear partition that matches the design of DGP5. Thougha few units are incorrectly clustered, four yellow squares on the diagonal indicate that theposterior partition is reliable. However, Tv-Homo and Tv-Hetero estimators deliver inferiorestimates and present one major group instead of four.Turning to DGP6, the best partition is generated by the Ti-Hetero estimator, which iscorrectly speciﬁed under this DGP. Even though the data density of group 2 heavily overlapswith the one of group 1 and group 3, due to the relatively small mean and large variance in α i , Ti-Hetero estimator succeeds in delivering a clear group pattern that clearly distinguishesthese three groups. The Ti-Homo estimator also has an excellent performance with ignoranceof heteroskedasticity, but it generates much more vague boundaries between groups 1, 2, and3. The Tv-Homo and Tv-Hetero results are incredibly messed, none of which depicts thecorrect partition.As for DGP7, Tv-Homo and Tv-Hetero are the best, which is expected in this DGP. Wesee a clear four-group pattern from the similarity matrix in panel (c) and (d). A few yellowand light blue stripes in the oﬀ-diagonal block suggest Tv-Homo and Tv-Hetero estimatorswrongly allocate a few units in posterior draws, especially for the units in group 2 and 4.Indeed, the paths of random eﬀects in these two groups share great similarities. As depictedin Figure 1, the red line (group 2) can be roughly viewed as the step function approximationof the green line (group 4). Ti-Home and Ti-Hetero struggle as they ignore the time eﬀectin α i by construction. his Version: October 6, 2020his Version: October 6, 2020

The individual company raw annual data are obtained from the COMPUSTAT database.We constructed the sample using the data from the year 1999 to 2019. The reason to not usethe data back to the 1970s is to avoid potential structure breaks in the variable of interestand to reﬂect the advanced speed of capital accumulation in recent decades. The primaryvariables of interest are: • K = Capital stock: net property, plant, and equipment. [PPENT] • I = Investment: capital expenditures in property, plant, and equipment. [CAPX] • Y = Sales: net sales revenues. [SALE] • CF = Cash Flow: income after taxes and interest plus depreciation minus dividends.[EBITDA - TXT - XINT - DVT]Additional variables used in the alternative model speciﬁcation: his Version: October 6, 2020his Version: October 6, 2020

A-42 • Q = Tobin’s Q: deﬁne as (E+B-INV) / K - 1. • E = Market value of equity: the sum of common equity and preferred equity. [PRCC f *CSHO+ PSTK] • B = Book value of debt: the sum of short-term and long-term debt. [DLC + DLTT] • INV = Market value of inventories. [INVT].The variable names and formula in the bracket are corresponding items in COMPUSTAT.We process the raw data according to the following guidance:1. Observations where capital stock and sales are either zero or negative are eliminated.2. Firms that have missing values in the primary variables of interest during 1999-2019are excluded.3. We eliminate any ﬁrm-year observation if the ﬁrm involved in merger and acquisition.4. Each ﬁrm must have valid annual observations from the year 1999 to 2019.The ﬁnal sample comprises 337 ﬁrms and the observations on each ﬁrm is 20. Thesummary statistics are reported in Table F.1.Table F.1: Descriptive Statistics for the Variables of InterestMin 25% Median Mean 75% Max StD Skew. Kurt.I/K 0.03 0.11 0.16 0.17 0.21 0.53 0.09 1.41 2.53CF/K -1.13 0.12 0.26 0.38 0.51 2.55 0.48 1.55 5.94Y/K -1.53 0.54 1.35 1.19 1.95 4.19 1.17 -0.23 -0.09N/K -8.63 -5.36 -4.19 -4.56 -3.46 -1.77 1.52 -0.74 -0.12log(K) -0.37 5.16 6.77 6.60 8.32 9.82 2.26 -0.63 0.21q -0.55 0.83 2.96 7.37 8.32 90.06 12.92 4.00 19.63

Notes: The descriptive statistics are computed across N and T dimension of thepanel. his Version: October 6, 2020his Version: October 6, 2020

Notes: The descriptive statistics are computed across N and T dimension of thepanel. his Version: October 6, 2020his Version: October 6, 2020 A-43

G Additional Empirical Results

In this section, we present the full result of empirical analysis in which detailed yearlyestimate results are listed here.Table G.1: Empirical Application: Predict Investment Rate, RMSFE2015 2016 2017 2018 2019HomogenousCoef. Ti-Homo 0.0917 0.1395 0.2625 0.1166 0.1108Ti-Hetero 0.0750 0.1159 0.3550 0.0674 0.0822Tv-Homo 0.0927 0.1382 0.2590 0.1165 0.1177Tv-Hetero 0.0783 0.1156 0.3686 0.0692 0.0812Pooled 0.0926 0.1386 0.2625 0.1160 0.1150Flat 0.1034 0.1491 0.2703 0.1328 0.1100Param 0.1958 0.2295 0.2466 0.2492 0.2043HeterogenousCoef. Ti-Homo 0.1103 0.1006 1.8575 0.1041 0.1144Ti-Hetero 0.1104 0.0999 1.8802 0.1028 0.1152Tv-Homo 0.1582 0.1729 0.2863 0.1782 0.1070Tv-Hetero 0.1097 0.1009 1.8644 0.1062 0.1101Flat 0.1649 0.1906 1.7937 0.1833 0.1164 his Version: October 6, 2020his Version: October 6, 2020

A-44Table G.2: Empirical Application: Predict Investment Rate, Average Number of Groups2015 2016 2017 2018 2019HomogenousCoef. Ti-Homo 2 2 1 2 2Ti-Hetero 6.62 7.8 6.66 7.79 7.86Tv-Homo 1 1 1 1 1Tv-Hetero 6.75 6.79 6.8 7.64 6.75Pooled 1 1 1 1 1Flat 337 337 337 337 337Param 1 1 1 1 1HeterogenousCoef. Ti-Homo 7.75 6.66 6.78 8.05 7.48Ti-Hetero 7.09 6.65 7.23 6.67 6.68Tv-Homo 1 1 1 1 1Tv-Hetero 6.07 6.11 6.79 6.57 7.46Flat 337 337 337 337 337Table G.3: Empirical Application: Predict Investment Rate, Frequentist Coverage Rate2015 2016 2017 2018 2019HomogenousCoef. Ti-Homo 0.9822 0.9703 0.9733 0.9733 0.9614Ti-Hetero 0.9525 0.9466 0.9674 0.9585 0.9021Tv-Homo 0.9822 0.9644 0.9792 0.9733 0.9525Tv-Hetero 0.9466 0.9347 0.9496 0.9466 0.8813Pooled 0.9852 0.9644 0.9792 0.9763 0.9555Flat 0.9792 0.9792 0.9703 0.9733 0.9703Param 1 1 1 1 1HeterogenousCoef. Ti-Homo 0.9407 0.9585 0.9555 0.9525 0.8724Ti-Hetero 0.9436 0.9466 0.9674 0.9496 0.8724Tv-Homo 0.9852 0.9792 0.9792 0.9792 0.9703Tv-Hetero 0.9466 0.9466 0.9407 0.9258 0.8338Flat 0.9733 0.9822 0.9733 0.9763 0.9733 his Version: October 6, 2020his Version: October 6, 2020

A-45Table G.4: Empirical Application: Predict Investment Rate, Length of 95% Credible Set2015 2016 2017 2018 2019HomogenousCoef. Ti-Homo 0.5737 0.5692 0.5710 0.5912 0.5908Ti-Hetero 0.4149 0.3129 0.3223 0.3030 0.4012Tv-Homo 0.5665 0.5620 0.5628 0.5851 0.5875Tv-Hetero 0.2886 0.2801 0.2809 0.3810 0.3867Pooled 0.5759 0.5716 0.5716 0.5946 0.5966Flat 0.5709 0.5664 0.5729 0.5976 0.6041Param 6.7334 6.7548 6.9387 6.8507 6.8211HeterogenousCoef. Ti-Homo 0.2881 0.3008 0.403 0.2925 0.2837Ti-Hetero 0.2889 0.3024 0.4033 0.2921 0.2841Tv-Homo 0.6368 0.6605 0.7106 0.6414 0.6344Tv-Hetero 0.2827 0.2961 0.3978 0.2840 0.2741Flat 0.6660 0.7119 0.7948 0.6826 0.6753Table G.5: Empirical Application: Predict Investment Rate, LPS2015 2016 2017 2018 2019HomogenousCoef. Ti-Homo 0.8021 0.5577 -0.3237 0.6752 0.7039Ti-Hetero 1.5788 1.5833 1.6552 1.7086 1.3724Tv-Homo 0.8044 0.5601 -0.3055 0.6754 0.6671Tv-Hetero 1.5618 1.5680 1.6063 1.6896 1.2981Pooled 0.7952 0.5556 -0.3197 0.6716 0.6746Flat 0.7532 0.4900 -0.5030 0.5868 0.6935Param -1.2611 -1.2671 -1.2779 -1.2760 -1.2722HeterogenousCoef. Ti-Homo 1.3901 1.5670 1.2802 1.6146 1.1904Ti-Hetero 1.3059 1.5676 1.5278 1.6140 1.1883Tv-Homo 0.6598 0.5030 0.4711 0.5313 0.6879Tv-Hetero 1.5067 1.5490 1.5286 1.5520 1.1013Flat 0.5675 0.4889 0.4966 0.4385 0.6205 his Version: October 6, 2020his Version: October 6, 2020

A-46Table G.6: Empirical Application: Predict Investment Rate, CRPS2015 2016 2017 2018 2019HomogenousCoef. Ti-Homo 0.0529 0.0633 0.0712 0.0599 0.0605Ti-Hetero 0.0398 0.0399 0.0511 0.0315 0.0464Tv-Homo 0.0532 0.0635 0.0721 0.0599 0.0634Tv-Hetero 0.0354 0.0394 0.0517 0.0332 0.0454Pooled 0.0535 0.0637 0.0712 0.0603 0.0627Flat 0.0537 0.0640 0.0723 0.0620 0.0604Param 0.3510 0.3550 0.3649 0.3601 0.3554HeterogenousCoef. Ti-Homo 0.0389 0.0382 0.1257 0.0343 0.0485Ti-Hetero 0.0391 0.0381 0.1271 0.0346 0.0485Tv-Homo 0.0639 0.072 0.0790 0.0696 0.0615Tv-Hetero 0.0384 0.0379 0.1248 0.0353 0.0486Flat 0.0702 0.0751 0.1587 0.0721 0.065 his Version: October 6, 2020his Version: October 6, 2020