[PDF] A Semi-Parametric Bayesian Generalized Least Squares Estimator

Abstract

In this paper we propose a semi-parametric Bayesian Generalized Least Squares estimator. In a generic GLS setting where each error is a vector, parametric GLS maintains the assumption that each error vector has the same covariance matrix. In reality however, the observations are likely to be heterogeneous regarding their distributions. To cope with such heterogeneity, a Dirichlet process prior is introduced for the covariance matrices of the errors, leading to the error distribution being a mixture of a variable number of normal distributions. Our methods let the number of normal components be data driven. Two specific cases are then presented: the semi-parametric Bayesian Seemingly Unrelated Regression (SUR) for equation systems; as well as the Random Effects Model (REM) and Correlated Random Effects Model (CREM) for panel data. A series of simulation experiments is designed to explore the performance of our methods. The results demonstrate that our methods obtain smaller posterior standard deviations than the parametric Bayesian GLS. We then apply our semi-parametric Bayesian SUR and REM/CREM methods to empirical examples.

Full PDF

AA Semi-Parametric Bayesian Generalized Least SquaresEstimator

Ruochen WuSchool of EconomicsFudan University Melvyn Weeks ∗ Faculty of Economics and Clare College,University of CambridgeNovember 23, 2020

Abstract

In this paper we propose a semi-parametric Bayesian Generalized Least Squares esti-mator. In a generic gls setting where each error is a vector, parametric gls maintainsthe assumption that each error vector has the same covariance matrix. In reality however,the observations are likely to be heterogeneous regarding their distributions. To cope withsuch heterogeneity, a Dirichlet process prior is introduced for the covariance matrices of theerrors, leading to the error distribution being a mixture of a variable number of normaldistributions. Our methods let the number of normal components be data driven. Two spe-ciﬁc cases are then presented: the semi-parametric Bayesian Seemingly Unrelated Regression( sur ) for equation systems; as well as the Random Eﬀects Model ( rem ) and Correlated Ran-dom Eﬀects Model ( crem ) for panel data. A series of simulation experiments is designed toexplore the performance of our methods. The results demonstrate that our methods obtainsmaller posterior standard deviations than the parametric Bayesian gls . We then apply oursemi-parametric Bayesian sur and rem / crem methods to empirical examples.JEL Classiﬁcation Code: C3Keywords: Bayesian semi-parametric, generalized lease square, Dirichlet process, equationsystem, seemingly unrelated regression, panel data, random eﬀects model, correlated randomeﬀects model. ∗ Contact Author: Dr. M. Weeks, Faculty of Economics, University of Cambridge, Cambridge CB3 9DD, UK.Email: [email protected]. We are grateful to the useful comments received from Debopam Bhattacharya,Xiaohong Chen, Gernot Doppelhofer, Oliver Linton and Justin Tobias. a r X i v : . [ ec on . E M ] N ov Introduction

The Generalized Least Square ( gls ) estimator is a family of econometric methods that have seennumerous applications in empirical economics. As pointed out by Wooldridge (2003), parametric gls type estimators accommodate a deviation from the assumption that the errors in the modelare homoskedastic and serial uncorrelated. For example, relative to the ordinary least squaresregression model, gls no longer assumes that the covariance matrix of the errors is diagonalwith identical diagonal elements.In a more general setting, each error may be a random vector, which includes some of themost popular applications of gls . For example, the Seemingly Unrelated Regression ( sur ,Zellner, 1962 and 1971) has been developed for equation systems, and widely applied to gaineﬃciency by exploiting the correlation between errors across equations. Similarly, in the anal-ysis of panel data the random eﬀects model rem recognizes that there are individual speciﬁc,time-invariant features that are unobservable and uncorrelated with the explanatory variables.A useful extension to re is the correlated random eﬀects ( cre ) model (Chamberlain, 1980;Wooldridge, 2005; Murtazashvili and Wooldridge, 2008), which allows the individual eﬀects tobe correlated with the explanatory variables usually as a linear function of the means of theregressors.However, the parametric gls still maintains the assumption that the error vector of each in-dividual has the same covariance matrix. In reality however, heterogeneity in error distributionsis a major concern in empirical analysis. Such heterogeneity can be caused by observations onindividuals or households reﬂecting variation in demographics such as the size of the household,and the level of income among others. It is a challenge for analysts who seek reliable inferencewith the data to capture the form of the heterogeneity in observations.The standard Bayesian approach to gls assumes that the error distribution is multivariatenormal. Recent developments in Bayesian methods allow the use of prior information to relax thisassumption. For example, the Dirichlet prior has been introduced to accommodate heterogeneityin the distributions of both errors (see Chigira and Shiba, 2015 for an example) and modelparameters (Allenby et al., 1998) by mixing a ﬁxed number of normal distributions.However, a notable drawback of the Dirichlet prior is that the the dimension of the mixingdistribution is usually unknown. Bayesian semi-parametric methods introduce more ﬂexibilityby letting the data and the prior determine the structure of heterogeneity jointly. The DirichletProcess ( DP ) prior can be used to form a mixing of normal distributions, whose dimension neednot be predetermined. In this sense, the use of DP priors represents a more ﬂexible approach toaccommodating heterogeneity than the mixing of a ﬁxed number of normal distributions withthe Dirichlet prior.In the context where heterogeneity in the distributions of errors is the major concern, asin where the focus is upon the inference, DP priors are introduced for the hyper-parametersof the errors in the model This leads to the grouping of the hyper-parameters, with those inthe same group having identical values. As such, the corresponding errors of hyper-parametersin the same group have the same distribution, while the errors whose hyper-parameters are inseparate groups are from diﬀerent distributions.A landmark study in this area is Conley et al. (2008), where a Bayesian semi-parametricapproach to the instrumental variable problem was introduced in a two stage least square frame-work. Due to the endogeneity of some explanatory variables, the errors in the two stages arecorrelated by construction. Instead of assuming that the joint errors in the two stages have anidentical bivariate normal distribution (cf. Chao and Phillips, 1998; Geweke, 1996; Kleibergen See Escobar and West, 1995 and 1998 and MacEachern, 1998 for a reference of the Dirichlet Process prior. E.g. for an error vector with zero mean, its hyper-parameter is its covariance matrix. gls that incorporates the DP prior. The motivation is to explore more informationin the error distribution by allowing their hyper-parameters to diﬀer across observations. Theresulting distribution of the error terms will involve a mixture of normal distributions wherethe number of the normal components is inﬂuenced by both the prior and the data. We thenintroduce two speciﬁc cases of semi-parametric Bayesian gls , namely for equation systems andpanel data.The rest of the paper is organized as follow. In Section 2 we introduce the generic formof the Dirichlet process, and demonstrate its use as a prior for semi-parametric Bayesian gls .Then two special cases of the gls are described. The dp-sur method is introduced in Section3. Sections 3.3 and 3.4 present the simulation design and results, respectively, for the dp-sur .Two empirical examples are given in Section 3.5. Section 4 motivates and introduces our semi-parametric Bayesian gls methods for panel data, the dp-rem , and its extension, the dp-crem is introduced in Section 4.3. Simulation designs and results for the panel setting are in Section4.4. The dp-rem and dp-crem methods are then applied to two empirical examples in Section4.5. Section 5 concludes the paper. We introduce the generic Bayesian gls with DP mixture in this section. In Section 2.1 webrieﬂy review the literature in related areas. Then in Section 2.2 we present how the Dirichletprocess prior can be used to produce a semi-parametric Bayesian gls . Bayesian attempts to incorporate heterogeneity in the hyper-parameters of the errors, and con-sequently in their distributions, can be traced back to Geweke (1993). He introduced a Bayesian gls where an inverse gamma prior is introduced for variances of the errors, each of which hada normal distribution. Geweke (1993) demonstrated that such a scale mixture of normal distri-butions is equivalent to the errors having a t-distribution.Although the model with the t-distributed errors is ﬂexible, this approach depends upon theassumption that the normal distributions are mixed with inverse gamma distributed variances.As pointed out by Koop (2003), relaxing this assumption results in more ﬂexible models, giventhat the errors are no longer restricted to having a t-distribution. This can be done by usinga Dirichlet prior (the conjugate prior of a multinomial distribution) to mix a ﬁnite number ofnormal distributions. The Dirichlet mixture model has emerged as a widely applied methodologyfor capturing heterogeneity in both linear and non-linear models, including Allenby et al. (1998),Li and Tobias (2011) and Chigira and Shiba (2015). However, the main limitation is that ittakes a fairly diﬃcult test procedure to determine the “correct” number of mixing components.In the wake of this limitation of the Dirichlet mixture model, it seems more reasonableto let the data and the prior determine the number of normal components jointly. This canbe achieved using a Dirichlet prior of inﬁnite dimension, which is the Dirichlet process ( DP )introduced by Ferguson (1973) . DP is the conjugate prior for an inﬁnite dimension, non- See Teh (2011) and Gershman and Blei (2012) for reviews of the Dirichlet process. DP can be written as F ∼ DP ( α, F ) , (1)where α > F is the base distribution. F is a randomdistribution that is discrete with probability one .The DP is a non-parametric “distribution of distributions” (Escobar and West, 1995 and1998; MacEachern, 1998), in that a draw, say F , from a DP is a probability distribution itself.The Chinese restaurant process (Aldous, 1985) provides the the predictive probabilities of the n th realization, r n , conditioned on the n − { r , r , . . . , r n − } , and realizationsfrom F can be drawn accordingly. Due to the fact that F is discrete, the existing realizations willbe assigned to groups, each with a unique value for all realizations in the same group. Denotethe group id of r i as c i = 1 , . . . , K , and the unique value of group c i as r ∗ c i : if r i is in group k ,then r i = r ∗ c i = r ∗ k . The prediction probabilities of r n is given byPr { r n = r ∗ k | r , r , . . . , r n − } = (cid:26) n k n − α if 1 ≤ k ≤ K αn − α if k = K + 1 (i.e. r n = r ∗ K +1 ∼ F ) , (2)where n k is the number of realizations that are already in group k . Aldous (1985) showed that r , r , . . . , r n generated according to the Chinese restaurant process are i.i.d. draws from F , i.e. F | α, F ∼ DP ( α, F ) r i | F iid ∼ F. . (3)A model with a DP prior on the distribution of parameters is called a DP mixture model(de Carvalho et al., 2013; Wiesenfarth et al., 2014; Li et al., 2018 and Hejblum et al., 2019),and is capable of representing very general forms of heterogeneity in the distributions of theobservations. The DP normal mixture model, i.e. one whose mixture components are normaldistributions, can be written as F | α, F ∼ DP ( α, F ) θ i | F iid ∼ F y i | θ i ∼ N ( θ i ) , , (4)where θ i is the set of parameters of observation y i . In the multivariate normal case, θ i consistsof the mean vector and covariance matrix, i.e. θ i = ( µ i , Σ i ).The posterior probability of θ i having the same value as one of the existing θ − i isPr { θ i = θ ∗ k | θ − i , y i , α } ∝ n k n − α N ( y i | θ ∗ k ) , (5)where θ ∗ k and n k denote, respectively, the unique value of group k and the number of observationsalready in group k . The posterior probability of θ i taking a new value from the base distribution,i.e. θ i = θ ∗ new ∼ F , isPr { θ i = θ ∗ new | θ − i , y i , α, F } ∝ αn − α (cid:90) N ( y i | θ ∗ new ) p ( θ ∗ new | F ) d θ ∗ new , (6)where p ( θ ∗ new | F ) is the probability density of the new value θ ∗ new given F . We now introducehow the DP normal mixture model is applied to introduce a semi-parametric Bayesian gls . The level of discreteness is inﬂuenced by α , the concentration parameter. The realisations r , r , . . . , r n generated according to (2) are not independent given that the n th realisation isgenerated conditioned on the n − F . Note that y i is a vector. .2 Semi-parametric Bayesian GLS Below we introduce the generic form of the semi-parametric gls estimator, where a DP prior isintroduced on the distribution of the hyper-parameters of the errors. Consider a general linearregression y i = X i β + ε i , (7)where i indexes the observation, y i is a Q × X i is a Q × K matrix of explanatory variables, β is a K × ε i is a Q × gls estimator introduces a DP prior on the distribution of the errorcovariance matrix which will be used to weight the observations. Given the usual assumptionof zero means for the errors then θ i = Σ i . The hierarchical prior can then be written as F | α, F ∼ DP ( α, F ) Σ i | F iid ∼ F. (8)Due to the discreteness of F under the DP prior, the value of some covariance matrices Σ i willthe same, thus putting Σ i into groups denoted by c i . This “grouping” characteristic can helpto reveal the structure of the unobserved heterogeneity in the data.The gls estimator weights the observations according to their covariance matrix. The like-lihood of β is then p (cid:0) y i | β , Σ ∗ c i (cid:1) = 1(2 π ) Q/ | Σ ∗ c i | − exp (cid:20) −

12 ( y i − X i β ) (cid:48) Σ ∗− c i ( y i − X i β ) (cid:21) . (9)where Σ ∗ c i is the unique covariance matrix of group c i . Given the choice of prior for β , one couldgenerate draws from the posterior of the parameters with mcmc methods.For the conjugate normal prior for β is speciﬁed, i.e. β ∼ N ( b , V ) , (10)where b and V denote, respectively, the prior mean and covariance matrix of β . The posteriorof β may then be written as β | y , Σ ∗ c i ∼ N ( b , V ) , (11)where V = (cid:32) V − + N (cid:88) i =1 X (cid:48) i Σ ∗− c i X i (cid:33) − , (12)and b = V (cid:32) V − b + N (cid:88) i =1 X (cid:48) i Σ ∗− c i y i (cid:33) . (13)Note that (11), (12) and (13) have the same form as the posterior of the parametric Bayesian gls estimator assuming i.i.d. normal errors. In the case of the semi-parametric Bayesian gls estimator, the errors are associated with diﬀerent hyper-parameters, such that each observation i is weighted by Σ ∗ c i . dp-gls is generic, without making any assumptions on the form of thecovariance matrix other than all Σ ∗ c i being positive deﬁnite and symmetric. We now proceed toexplain how the semi-parametric Bayesian gls estimators work in two speciﬁc contexts, whichare equation systems in Section 3 and panel data models in Section 4.4 Semi-parametric Seemingly Unrelated Regression

Below we introduce the sur equation system and demonstrate how the DP prior is incorporated.Without loss of generality we consider a system of two equations y i = β + x ,i β + x ,i β + ε i y i = β + x ,i β + x ,i β + x ,i β + ε i , (14)where y mi denotes observation i for equation m ( m = 1 ,

2) and x mk,i ( k = 1 , ,

3) are theexplanatory variables. β ml ( l = 0 , , ,

3) denote the coeﬃcients, and ε mi are the errors. In thiscontext, the dimension of each error vector, i.e. Q in Section 2.2 will be the number of equationsin the system.The model can be written in matrix form y = X β + ε y = X β + ε , (15)where y m = { y mi } , ε m = { ε mi } are N × X = [ ι , x , x ] and X = [ ι , x , x , x ]are N × N × ι is an N × β = { β l } and β = { β l } are 3 × × sur , Zellner, 1962) was introduced for thistask. Instead of ε i iid ∼ N (0 , σ ) and ε i iid ∼ N (0 , σ ), as in the ols case, the errors ε i are nowidentically multivariate normally distributed, i.e. ε i = ( ε i ε i ) (cid:48) iid ∼ N ( , Σ ). The covariancematrix of ε is then Ω = Σ ⊗ I = (cid:20) σ I N σ I N σ I N σ I N (cid:21) , s.t. σ = σ , (16)where ” ⊗ ” stands for the Kronecker product.One could transform the observations with this covariance matrix, so that the errors fol-low the standard normal distribution N (0 , . Although the sur model accounts for the cross-equation correlation of errors, as Wooldridge(2003) has noted, the errors are assumed to be identically distributed. Moreover, unlike theclassical gls estimator, this distribution is usually assumed to be normal. In this section wepropose a new dp-sur method that makes no a priori assumptions on the family of distributionof the errors. If we allow each observation i have its own covariance matrix, ﬂexibility of the errordistribution will lead to identiﬁcation problems if we only have a cross sectional data. Assigningthe observations into groups represents a compromise. Given (15), the covariance matrix of theerror for observation i is given by Σ i = (cid:20) σ ∗ ,c i σ ∗ ,c i σ ∗ ,c i σ ∗ ,c i (cid:21) , s.t. σ ∗ ,c i = σ ∗ ,c i , (17)where c i denotes the group id of observation i , and the superscript ∗ denotes the group-speciﬁchyper-parameter. For c i = c j , i, j ∈ { , , . . . , N } observations i and j share the same group id However, the covariance matrix Ω has a speciﬁc form as in the sur in (16), instead of the general, positivedeﬁnite symmetric form of a covariance matrix. σ ∗ pq,c i = σ ∗ pq,c j , where p, q ∈ { , } index the equations in thesystem.Assuming that the number of groups are known, the Dirichlet prior may be used to performthe mixing. A less restrictive approach utilises a non-parametric approach by introducing a DP prior for the distribution of Σ i as in (8). A natural choice of the base distribution F isthe conjugate prior for the covariance matrix of a multivariate normal distribution, the inverseWishart distribution, i.e. F ≡ IW ( ν, W ) (18)where ν , W are hyper-parameters of the inverse Wishart distribution. Given (18) the posteriordistribution of each Σ i is also inverse Wishart, which is easy to draw from using the Gibbssampler. The main diﬀerence from the parametric Bayesian sur is that the covariance matrixof each observation is now given in (17), which allows each group of observations to have its ownunique values for the parameters. A Gibbs sampler (see Geweke, 1996) is available for the sur model. The Gibbs sampler drawstwo sets of parameters from their posteriors: the covariance matrix of the errors Σ and theregression parameters β , namely Σ | y , X , β (19) β | y , X , Σ . (20)When introducing the hierarchical structure which includes the DP prior, a number of extraparameters are included in the mcmc algorithm. These are the covariance matrices of the errors,Θ = { Σ i } , and α , the concentration parameter of the DP prior. The Gibbs sampler now consistsof Θ | y , X , β , α (21) β | y , X , Θ , α (22) α | y , X , β , Θ . (23)The major diﬀerence between the two Gibbs sampler lies in (19) and (21). In (19) the errorshave the same covariance matrix Σ . In contrast, there will be K ≤ N unique values in Θ inequation (21) due to the discreteness of F under the DP prior; observations with the same valueof Σ i are assigned to the same group. With the last draw of β , the residuals can be obtained,which are used as the data to take a draw for Θ.In making draws of the concentration parameter α using (23), we adopt the DP prior intro-duced by Conley et al. (2008), namely p ( α ) ∝ (cid:18) − α − α min α max − α min (cid:19) τ , (24)where α min and α max are the pre-set lower and upper bound of α . Larger α lead to more groupsbeing generated on average, i.e. the DP being less discrete. According to the distribution ofthe number of groups K conditioned on α in Antoniak (1974), we could determine α min and α max by setting the mode of number of groups to K min and K max . In this paper we let K min be 1 and K max be 5% of the number of observations. Following the suggestion of Conley et al.(2008), we set τ to 0.8. The hyper-parameters α max has been adjusted according to K max being10% and 50% of the sample size. In our experiments the results are insensitive to these changesin the hyper-parameters in the prior of the concentration parameter α .6 .3 A Simulation Experiment In this section we conduct a simulation experiments designed to compare our method to theBayesian sur described in Section 3. As the main focus of this paper is the potential eﬃciencygains over gls type estimators, we evaluate the performance of the dp-sur and normal Bayesian sur focusing upon the posterior standard deviations of the parameters estimated with the twomethods. All simulation experiments are based upon the two equation system in (14).The experiments are designed to highlight the performance of the estimators along the fol-lowing dimensions:(i) heterogeneity in the errors;(ii) the tail of the error distribution;(iii) sample size.For (i) we check the performance of our dp-sur approach against a model where the errorsare distributed i.i.d. multivariate normal. In the heterogeneous case, the most direct way isto generate the errors from a mixture of multivariate normal distributions . However, we usemultivariate t-distributions (Andrews and Mallows, 1974) to exploit the scale mixtures of normaldistributions with inverse Wishart covariance matrices.To accommodate (ii), we vary the degrees of freedom (df) of the multivariate t-distribution.Smaller degrees of freedom leads to heavier tails, which indicates that a larger proportion ofobservations follow normal distributions that are “ﬂatter”, i.e. less concentrated around themean.To determine the robustness of our method, we include a set of simulations where the errorsfollow a log-normal distribution. The log-normal distribution has seen a wide range of applica-tions in empirical studies. For example, with perhaps the exception of the top 1-3 percent of thepopulation, income has been shown to follow a log-normal distribution (Clementi and Gallegati,2005). In addition, extreme realizations are more likely to be generated from the multivariatelog-normal distribution, as it is fat-tailed.Using (15), the explanatory variables are drawn from normal distributions with parameters x ,i iid ∼ N (1 , , x ,i iid ∼ N (3 , , and x ,i iid ∼ N ( − , , x ,i iid ∼ N (4 , , x ,i iid ∼ N ( − , . (25)We set β = (1 . , − . , . (cid:48) and β = (1 . , − . , − . , (cid:48) . We generate errors from the multi-variate normal distribution N ( , Σ ), where Σ = (cid:20) . . (cid:21) . (26)Without loss of generality, we let the variances be identical, and ﬁx the correlation between theerrors in the two equations at 0.5. We set three samples sizes for the simulation experiment:100, 250, and 500.When generating errors from a multivariate t-distribution, we set the location parameter to µ = and shape parameter to Σ . As noted, the parameter that controls the tail behaviour of Simulating data from a mixture of multivariate normal distributions can be problematic given the inﬂuenceof the following: the number of components, the covariance matrices (we ﬁx the mean at zero) of the normalcomponents and the weights assigned to each component. , 3 and 4. For df=2 the tails of thecorresponding multivariate t-distribution are much heavier than that of the multivariate normalwith the same location and shape parameters µ and Σ . When the df is 4, the tails of themultivariate t-distribution are only slightly heavier than the multivariate normal. Our dp-sur should demonstrate relative eﬃciency in all three situations, as the multivariate t-distributionshas heavier tails than the multivariate normal. Gains in eﬃciency will be decreasing in df, giventhat the tails are less heavy.The errors in the multivariate log-normal scenario are constructed by ﬁrst drawing frommultivariate normal distributions, and then taking the natural exponent of these draws. Becausethe log-normal distribution has a positive mean, it is necessary to demean, so that the errorswill have zero means. Our dp-sur is expected to have eﬃciency gains in the log-normal casegiven it is asymmetric and heavy tailed. Below we present the simulation results . We present the posterior standard deviations ( s.d. )estimated with both our dp-sur method and the Bayesian sur assuming multivariate normalerrors, along with the percentage diﬀerences between them . As the behaviour of the posteriorsof all the parameters are uniformly similar, for the sake of clearness, we present only the resultsregarding β and β . Multivariate t-distributed Errors

Table 1 presents the posterior s.d. s and the percentage diﬀerence between the s.d. estimatedwith the dp-sur and the normal sur , both averaged over the samples. We observe that the dp-sur gives smaller posterior s.d when df is 2, 3 and 4. The percentage diﬀerences when df is2 are above 40%, above 20% when df is 3, and around 15% when df is 4 as shown in the upperthree panels of the table. Eﬃciency gains increase with sample size as more extreme values ofthe errors are realised. The parametric sur assumes that all the realizations have the same Σ ,where the extreme ones will expand this Σ shared by all realizations. In contrast, these extremerealizations will be assigned to distributions with larger Σ i ’s by the dp-sur , while the rest willbe treated as realizations from normal distributions with smaller Σ i ’s. By accommodating ahigher degree of heterogeneity in the error distributions, potentially more eﬃciency gains couldbe achieved by the dp-sur .Our results are consistent with expectations. The eﬃciency gains of the semi-parametric dp-sur are the largest when the df is 2 (with the heaviest tails). Eﬃciency gains fall with thedf increasing, given less heavy tails of the distribution of the errors. In fact, the lowest panelin Table 2 where the df is inﬁnity, we observe that the posterior s.d. estimated with the twomethods are very close. The s.d. for dp-sur is slightly larger than their sur counterparts. Thisis not surprising since when the distribution of the errors is multivariate normal, the parametricmethod is more parsimonious, using the correct structure for the covariance matrix of the errors.In the multivariate normal case, among the three sample sizes, the diﬀerences between the s.d. are the largest when the sample size is 100. This is expected as the information “wasted” by We do not use df 1, as the t distribution does not even have a mean in this case. We carry out 100 simulations for each sample size, which proved suﬃcient to achieve stable results even withthe smallest sample size. The tables containing the posterior means can be found in the Appendices. For the tables of posterior means,there are 6 columns presenting the means estimated by the two methods for the 3 correlations. The full results are included in Appendices. ∆% = ( s.d. sur − s.d. dp ) / s.d. sur × Nevertheless, the diﬀerences are still small in magnitude, less than 2.5% for all coeﬃcients. dp-sur in this case has a larger impact on eﬃciency when the sample sizeis small. Table 1: Posterior s.d. , multivariate t errors df = 2Sample size 100 250 500Parameters DP SUR ∆% DP SUR ∆% DP SUR ∆% β β β β β β ∞ Sample size 100 250 500Parameters DP SUR ∆% DP SUR ∆% DP SUR ∆% β β Multivariate Log-normal Errors

The posterior s.d. s are presented in Table 2. We observe that the dp-sur posterior s.d. aremore than 55% smaller than those calculated using the Bayesian sur assuming i.i.d. normalerrors. The eﬃciency gains increase with sample size, which reach more than 65% in the caseof 500 observations. As with the case of t distributed errors, this is due to the fact that moreextreme realizations of errors are present in larger samples, leading to more eﬃciency gains bygrouping them. Table 2: Posterior s.d. , multivariate log-normal errors

Log-normalSample size 100 250 500Parameters DP SUR ∆% DP SUR ∆% DP SUR ∆% β β Below we apply our dp-sur method to an economic model of the demand for factors of pro-duction with a generalized Leontief cost function (Diewert, 1971), an equation system with the9umber of equations as that of factors. To make our empirical demonstration as general aspossible, we do not impose symmetry or homogeneity restrictions.The dataset, taken from Malikov et al. (2016), contains 2397 observations on 285 large U.S.banks between 2001 and 2010. The data includes quantities and prices of the inputs, i.e. labour,physical assets and borrowed funding, and the quantity of output, which is the loans made by abank. Given the relatively large sample size, it is possible for us to explore the performance ofthe dp-sur with diﬀerent sample sizes.The demand for factors equation system may be written as a L = LY = β LL + β LA P A P L + β LF P F P L + β LT T + ε L (27) a A = AY = β AA + β AL P L P A + β AF P F P A + β AT T + ε A (28) a F = FY = β F F + β F L P L P F + β F A P A P F + β F T T + ε F , (29)where L , A and F denote the quantity of labour, physical assets and borrowed funds, respec-tively; T denotes the trend variable; Y denotes output, and P k is the price of factor k , with k ∈ { L, A, F } . For the errors we assume ( ε Li , ε Ai , ε F i ) ∼ N ( , Σ i ) where Σ i is the covariancematrix of observation i . We allow the errors to be correlated across the three equations in thesystem, i.e. cov ( ε ki , ε si ) (cid:54) = 0, with k, s ∈ { L, A, F } indexing equations.Given that the main objects of interest are the price elasticities, we report the posteriormeans and s.d. of the price elasticities of the three factors. With the generalized Leontief costfunction, the cross price elasticities of the factors are given by e ks = 12 β ks ( P k /P s ) − / a k , ∀ k (cid:54) = s, (30)The own price elasticities are e kk = − (cid:80) s (cid:54) = k β ks ( P k /P s ) − / a k . (31)Table 3 contains the posterior means of the price elasticities of the demand for factors. Onecan see that with both the 800 observation sub-sample and the full sample, the posterior meansof all the elasticities are relatively small indicating that the demand for factors (labour, physicalassets and borrowed fundings) of U.S. banks are relatively price inelastic.Note that the own price elasticities of labour and physical assets are negative in both samples.In contrast, the own price elasticity of borrowed fundings is positive, although we note that theabsolute values are extremely small compared to those of the labour and physical assets. Thisshows that the demand for borrowed fundings is inelastic in the production of the U.S. bankingindustry. One potential reason is that borrowed funds are usually used in ways such as tomeet the required reserve ratio set by the Fed. Such a feature makes borrowed fundings ratherinelastic with respect to their price.There are some diﬀerences between the posterior means estimated with the dp-sur and theBayesian sur assuming normal errors. Such diﬀerences are not observed in the simulation stud-ies. However, it should be noted that in the simulations, the regression equation was correctlyspeciﬁed, which is not guaranteed with the empirical data. Such diﬀerences in the posteriormeans with empirical datasets have also been observed in the literature on semi-parametricmixture with a DP prior, including Conley et al. (2008). Note that the sur and ols estimators are exactly the same when all the equations in the system share thesame explanatory variables. The posterior s.d. are also relatively large for this elasticity as shown in Table 4.

Table 4 presents the posterior s.d. of the price elasticities estimated with the two samples.We observe that the dp-sur achieves smaller posterior s.d. for all the price elasticities than theBayesian sur assuming normality. This is not unexpected, as the elasticities are functions of theregression parameters in the equation system, which are estimated with smaller posterior s.d. with the semi-parametric dp-sur than the parametric Bayesian sur . The greatest percentagediﬀerence (∆%) with the 800-observation sub-sample takes place with the cross price elasticityof the demand for labour with respect to the price of physical assets, for which the dp-sur posterior s.d is 38.27% smaller than the sur counterpart. With the full sample, the largestpercentage diﬀerence is observed with the cross price elasticity of the demand for borrowedfundings with respect to the price of physical assets, which reached 39.15%.Table 4: Elasticities, U.S. banking industry: posterior s.d.

In Figure 1 we present the histograms for the posterior distribution for the cross priceelasticity of funding with respect to the price of physical assets for the dp-sur and the parametricBayesian sur . Using the smaller 800-observation sub-sample, the posteriors for both estimatorsinclude 0. However, with the full sample the dp-sur gives a posterior distribution that has a95% credible interval (from -0.027 to -0.002) that excludes 0, shown by the two red vertical linesin the left panel. In contrast, the right panel shows that the parametric Bayesian sur gives a95% credible interval (from -0.027 to 0.016) that still includes 0 even with the full sample. FromFigure 1 we also note that the posterior distribution with the dp-sur is strongly right skewed,which can result in the parametric Bayesian sur having larger posterior standard deviation.11 a) dp-sur (b) Parametric sur Figure 1: Histograms of the elasticity of fundings w.r.t. asset price

In addition to equation systems, the random eﬀects model ( rem ) for panel data is anotherscenario where the gls has seen numerous applications. In a panel with N cross sections andtime series of dimension T , the error of each individual is a T × vector. We will relax theassumption of parametric Bayesian gls for the rem (Koop, 2003) that the error vectors for allindividuals have the same distribution. In this section we propose a semi-parametric Bayesianapproach by introducing DP priors on the variances of the random eﬀects and the errors. Wefollow the same approach as in the dp-sur method in terms of applying the DP prior on thehyper-parameters.Consider the following panel data model y it = β x it + · · · + β K x Kit + u i + η it = β x it + · · · + β K x Kit + ε it , (32)where i and t index the cross section and time series dimensions of the data, respectively, y it isthe dependent variable, x kit denote the explanatory variables, and the β k , k = 1 , . . . , K are thecoeﬃcients. u i is the time-invariant unobservable of individual i , and η it the error term.In Bayesian methods the diﬀerence between the ﬁxed and random eﬀects lies in the choiceof prior for the individual eﬀects u i . Fixed eﬀects Bayesian methods assume a non-hierarchicalprior for u i , while for the random eﬀects a hierarchical prior is assumed. The prior for u i maybe written as u i | d iid ∼ N (cid:0) , d (cid:1) , (33)where d is the variance of u i . Assuming η it iid ∼ N (0 , σ ), the posterior distribution of u i isgiven by u i | y i , β , d , σ ∼ N (cid:0) µ i , s (cid:1) , (34)where µ i = s σ − ι (cid:48) T ( y i − X i β ), s = ( d − + σ − ι (cid:48) T ι T ) − , with ι T denoting a T × X i = [ x it , . . . , x Kit ] is a T × K matrix of explanatory variables, and y i = [ y i , y i , . . . , y iT ] (cid:48) is a T × We use the term “individual” to denote the cross section unit here. In practice it can be households, ﬁrms,countries or actual individuals. That is, the Q in the generic semi-parametric gls in Section 2.2 is T in this context. We use T here followingpanel data protocols. Note that the variances of u i and η it is often assumed to be random, and have their own priors. For themoment we leave them ﬁxed for the sake of simplicity. β marginalized over u i in the Bayesian rem may be written as p ( y i | β , Σ ) = 1(2 π ) T/ | Σ | − exp (cid:20) −

12 ( y i − X i β ) (cid:48) Σ ( y i − X i β ) (cid:21) , (35)where Σ is the covariance matrix of the T × ε i = [ ε i , ε i , . . . , ε iT ] (cid:48) . Assuming that E [ η it u j | X ] = 0 , ∀ i, j, t (Greene, 2012), the covariance matrix of the composite error ε i isCov( ε i ) = Σ = σ I T × T + s ι T ι (cid:48) T =  σ + s s · · · s s σ + s · · · s ... ... . . . ... s s · · · σ + s  , (36)where σ is the variance of η it , and s is the variance of u i . Before we proceed to our dp-rem method, we review the work of Kleinman and Ibrahim (1998)and Kyung et al. (2010), who use the Dirichlet process prior for a diﬀerent purpose. Considerthe model y it = X it β i + ζ it , (37)where β i is the vector of parameters. In the literature in this area, the focus has been on theheterogeneity in the parameters (i.e. β i ) across the individuals. For this purpose, the DP prioris put on the parameters β i themselves, i.e. F ∼ DP ( α, N ( µ β , Σ β )) β i | F iid ∼ F. (38)Given that β i has a discrete dp posterior, β i are grouped, with those in the same group havingthe same value.Heterogeneity in the parameters per se is not the principle focus of this paper. Rather, thispaper aims at providing more eﬃcient inferences by exploiting the information in the distri-bution of unobservables. In a rem setting, unobservables are composed of individual speciﬁcunobserved eﬀects u i and idiosyncratic errors η it . Therefore, we focus on the heterogeneity inhyper-parameters of these unobservables instead of the model parameters. In this sense, ourmethod is in the same spirit as the literature pioneered by Conley et al. (2008).We relax the identically distributed assumption for η it and u i , by introducing DP priors onthe variances. This will have the eﬀect of grouping errors over both the cross section dimension i and the time series dimension t , with those in the same group sharing the same hyper-parameter.The DP prior for the variance of the idiosyncratic error η it is G ∼ DP ( α η , G ) σ it | G ∼ G, (39)where α η and G denote the concentration parameter and base distribution of the DP prior,respectively. The grouping of these variances will take place without imposing any restrictions.For example, for c it (cid:54) = c is ( t (cid:54) = s ), then σ ∗ c it and σ ∗ c is are allocated to diﬀerent groups and η it and η is have diﬀerent distributions. The random eﬀects model in these researches, mostly in statistics, means diﬀerent from that in econometrics,as theirs is in fact random coeﬃcients model. c it ( c is ) denotes the group id’s of η it ( η is ). DP prior for the variance of the individual eﬀects u i in our dp-rem can be writtenusing the following hierarchical structure F ∼ DP ( α u , F ) d i | F ∼ F. (40) d i is the prior variance of the random eﬀects u i , α u is the concentration parameter, and F thebase distribution of the DP prior. The use of an independent DP prior on the hyper-parametersof individual eﬀects u i generates groupings over the N individuals such that that u i that belongto the same group are generated from a distribution with the same hyper-parameter. This thenrelaxes the rem assumption that the individual eﬀects are identically distributed.Although the u i are no longer identically distributed, for each particular u i a conjugatenormal prior can be introduced. The posterior of each u i is then a normal distribution, themeans and variances of which are diﬀerent across the cross section i , i.e. u i | y , β , d ∗ c i , σ ∗ c it ∼ N (cid:0) µ i , s i (cid:1) , (41)where µ i = s i ι (cid:48) T Σ − η i ( y i − X i β ) (42)is the posterior mean of u i . The posterior variance is given by s i = (cid:0) d ∗− c i + ι (cid:48) T Σ − η i ι T (cid:1) − . (43)From (43) we observe that the posterior variance of the random eﬀects u i is the sum of d ∗− c i theinverse of the unique value of d i (the hyper-parameter of u i ), and all the elements in Σ η i . Aswe allow that Σ η i (cid:54) = Σ η j ( ∀ i (cid:54) = j ), s i is in turn allowed to be diﬀerent for each individual eﬀect u i . The covariance matrix of each composite error vector ε i is also allowed to be diﬀerent forevery i . The covariance matrix of ε i is given byCov( ε i ) = Σ i = Σ η i + s i ι T ι (cid:48) T . (44) For the choice of base distributions, we use the inverse gamma distribution, the conjugate priorfor the variance of a normal distribution, i.e. F ≡ IG ( a u , b u ) G ≡ IG ( a η , b η ) , (45)where a u and a η are the shape hyper-parameters, and b u and b η denote, respectively, the ratehyper-parameters of F and G .The likelihood of β marginalized over u i is given by p ( y i | β , Σ i ) = 1(2 π ) Q/ | Σ i | − exp (cid:20) −

12 ( y i − X i β ) (cid:48) Σ − i ( y i − X i β ) (cid:21) . (46)Compared with the marginal likelihood of the parametric Bayesian rem in (35), the covariancematrix of the composite error vector ε i is allowed to be diﬀerent for each individual i in thepanel. Here we adopt a prior whose mean is 0, and variance is 1000. β , i.e. β ∼ N ( b , V ) , where b and V denote respectively, prior mean and covariance matrix of β , the posterior of β marginalized over u i is β | y , d ∗ c i , σ ∗ c it ∼ N ( b , V ) . (47) V = (cid:32) V − + N (cid:88) i =1 X (cid:48) i Σ − i X i (cid:33) − , (48)denotes the posterior covariance matrix, and b is the posterior mean vector, which we write as b = V (cid:32) V − b + N (cid:88) i =1 X (cid:48) i Σ − i y i (cid:33) . (49)For Θ η = { σ it } , U = { u i } , and Θ u = { d i } , a Gibbs sampler for this dp-rem can be written as:Θ η | y , X , U, β , Θ u , α u , α η Θ u | y , X , U, β , Θ η , α u , α η U | y , X , β , Θ u , Θ η , α u , α η β | y , X , U, Θ u , Θ η , α u , α η α u | y , X , U, β , Θ u , Θ η , α η α η | y , X , U, β , Θ u , Θ η , α u . (50)The Gibbs sampler for the regression parameters, hyper-parameters and concentration pa-rameters of the two DP are similar to those for the dp-sur in Section 3.2. In the dp-rem , therandom eﬀects have a mixture of normal distributions. The posterior mean and variance of eachparticular u i are in (42) and (43), respectively. For each i a u i is drawn from N (cid:0) µ i , s i (cid:1) withthe Gibbs sampler. The Correlated Random Eﬀects Model ( crem ) represens a natural extension of the rem . In-troduced by Mundlak (1978) and further discussed by Chamberlain (1980), the crem oﬀers amiddle ground between the ﬁxed and random eﬀects.Without loss of generality, we consider the following model for the panel data y it = β x it + β x it + v i + η it , (51)where v i is the random eﬀects. While maintaining the gls structure of the rem , crem allows theindividual eﬀects to be correlated with X i , representing the correlation using a linear functionof the means of X i , i.e. v i = β ¯ x i + β ¯ x i + u i , (52)The crem model is then y it = β x it + β x it + β ¯ x i + β ¯ x i + u i + η it (53)The DP prior can be introduced on the hyper-parameters of u i and η it as in the rem case.15 .4 DP-REM/CREM Simulation Results We carry out a series of simulation experiments to demonstrate the performance of our dp-rem and dp-crem methods relative to the standard Bayesian rem and crem . The simulationexperiments have been designed for the same purpose as those for the dp-sur in Section 3.3.For the rem model we assume y it = β x it + β x it + u i + η it = β x it + β x it + ε it (54)where the explanatory variables are generated from the following normal distributions x ,it iid ∼ N (1 , , x ,it iid ∼ N (3 , . We set the coeﬃcients in (54) to β = 5 , β = 10 . The coeﬃcients in the crem model (53) are set to β = 5 , β = 10 , β = − , β = 2 . (55)Below we present the simulation results. We ﬁrst report the results where the errors, u i and η it , are assumed to follow t-distributions and then those with log-normal distributions. t-Distributed Random Eﬀects and Errors Table 5 reports the averages of the posterior s.d. s of the rem coeﬃcients estimated with bothmethods, and the average of the percentage diﬀerences between the dp-rem and rem posterior s.d. s. The largest diﬀerences between the two estimators with respect to the posterior s.d. are observed when df = 2, where the t-distributions of the random eﬀects and the errors havethe heaviest tails. As expected, these diﬀerences decrease as the df increase, where the tailsof the t distributions become less ’heavy’. In the bottom panel where the errors have normaldistributions (equivalent to df being inﬁnity), the dp-rem and normal rem posterior s.d. arealmost equivalent, as the t-distribution is the normal distribution in this case.We also note that the percentage diﬀerences increase slightly when the sample size becomeslarger for all three ﬁnite df. This is expected given that there are more extreme realizations inlarger samples, and our dp-rem method detects such heterogeneity and assign them into thesame group. In contrast, the Bayesian rem method assuming normality ﬂattens the normalposterior distribution for the extreme values, leading to larger posterior s.d.

Table 6 reports the averages of posterior s.d. of the crem coeﬃcients, and the averagespercentage diﬀerences between the two estimators. β and β denote the two original explanatoryvariables, whereas β and β capture the eﬀect of the respective sample means for each individualin the panel. The ﬁndings are similar to the rem case in that the percentage diﬀerences betweenthe posterior s.d. estimated with our dp-crem and the parametric Bayesian crem are thelargest with df equal to 2, and decrease with the increase in the df. Diﬀerences between thetwo methods regarding the posterior s.d. are almost zero when the df is inﬁnity, when thet-distribution becomes normal distribution. The percentage diﬀerences also increase slightly inthe three ﬁnite df cases when the sample size becomes large due to more extreme values in theunobservables. 16able 5: Posterior s.d. , rem with t distributed unobservables df = 2Sample size 100 300 500Parameters DP REM ∆% DP REM ∆% DP REM ∆% β β β β β β ∞ Sample size 100 300 500Parameters DP REM ∆% DP REM ∆% DP REM ∆% β β Log-normal Distributed Random Eﬀects and Errors

Table 7 contains the average of posterior s.d. estimated with the dp-rem and normal rem , andthe percentage diﬀerences between them. It can be seen that our dp-rem posterior s.d. aresmaller than those estimated by Bayesian rem assuming normality in all cases. Due to the factthat the log-normal distribution is heavy tailed, the percentage diﬀerences are more than 70%in all cases, which increase slightly when the sample size gets larger.The posterior s.d. of the dp-crem and crem averaged over the simulated samples arereported in Table 8, along with the average of the percentage diﬀerence between the two s.d.

ASs before, the posterior s.d. estimated with our dp-crem are more than 70% smaller thanthose estimated with the normal Bayesian crem for all coeﬃcients. The percentage diﬀerencesalso increase when the sample size increases

In this section we present the results based upon two empirical examples. In the ﬁrst we estimatethe cost function of U.S. banks, and in the second we estimate the wages of U.S. workers.

Bank Cost Function

We ﬁrst apply our dp-rem and dp-crem methods to the dataset in Feng and Serletis (2009)on the costs of 218 U.S. banks whose assets are between 1 and 3 billion dollars (2000 value),covering a period of 8 years from 1998 to 2005. There are three inputs, labour, borrowed fundsand physical capital; and three outputs, consumer loans, non-consumer loans and securities.The functional form is the simple translog cost function (Christensen and Greene, 1976). For17able 6: Posterior s.d. , crem with t distributed unobservables df = 2Sample size 100 300 500Parameters DP REM ∆% DP REM ∆% DP REM ∆% β β β β β β β β β β β β ∞ Sample size 100 300 500Parameters DP REM ∆% DP REM ∆% DP REM ∆% β β β β Table 7: Posterior s.d. , rem with log-normal distributed unobservables Log-normalSample size 100 300 500Parameters DP REM ∆% DP REM ∆% DP REM ∆% β β Table 8: Posterior s.d. , crem with log-normal distributed unobservables Log-normalSample size 100 300 500Parameters DP REM ∆% DP REM ∆% DP REM ∆% β β β β i with n inputs and m outputs we writeln C it = m (cid:88) j =1 α j ln q j,it + 12 m (cid:88) j =1 m (cid:88) k =1 δ jk ln q j,it · ln q k,it + n (cid:88) r =1 β r ln p r,it + 12 n (cid:88) r =1 n (cid:88) s =1 φ rs ln p r,it · ln p s,it + n (cid:88) r =1 m (cid:88) j =1 γ rj ln p r,it · ln q j,it + u i + η it . (56)where C is cost, q j is output quantity j , and p r is input price r . We impose the linear homogeneityin input prices on the cost function, which in the translog case can be expressed as n (cid:88) r =1 β r = 1 , n (cid:88) s =1 φ sr = 0 , r = 1 , , . . . , n, n (cid:88) r =1 γ rj = 0 , j = 1 , , . . . , m. (57)Table 9 contains the posterior means and s.d. of the free coeﬃcients in the rem . Todiﬀerentiate the inputs from outputs, we index the three outputs with numbers, and index theinputs by letters, with l and f denoting, respectively, labour and borrowed funds. The posteriormode of the number of groups in the random eﬀects is 2, and that in the errors is 3, whichshows the existence of heterogeneity, although it is not strong.The posterior means of all the coeﬃcients for ﬁrst order terms (the β ’s and α ’s) are of thesame magnitude with the two methods with the exception of α for customer loans, but it isinsigniﬁcant with the rem . Although a number of the coeﬃcients for the crossproduct termshave diﬀerent signs across the two methods, the coeﬃcients are not signiﬁcant. Consistent withthe detection of heterogeneity in the random eﬀects and the errors, the posterior s.d. of our dp-sur method are smaller than those estimated by parametric Bayesian rem for all coeﬃcients.Most of the percentage diﬀerences presented in the last column are more than 10%, with thelargest being 24.51% for δ .We also estimate the model with the dp-crem . The posterior mode of the number ofgroups in the random eﬀects and the errors are 2 and 3, respectively, indicating the existence ofheterogeneity. The coeﬃcients of the explanatory variables have smaller dp-crem posterior s.d. than their crem counterparts, with similar magnitudes as in Table 9. However, the regressionparameters are all highly insigniﬁcant for the sample means of the explanatory variables, astheir posterior s.d. are very large compared with their posterior means. This indicates that thedata do not support the crem speciﬁcation . U.S. Individual Wage

In this section we present the results of a wage model for U.S. workers using the data in Cornwelland Rupert (1988). The data covers 595 individuals over a period of 7 years, from 1976 to1982. This sample size allows us to demonstrate our method with sub-samples of 100 and 250individuals. The number of groups, i.e. the number of normal distributions that are being mixed, is also a random variablein Bayesian methods. Its posterior mode thus provides an indication of the strength of heterogeneity in the errordistributions. The results are not presented here, but are available on request. rem

Mean S.D.Coeﬃcients DP-REM REM DP-REM REM ∆% β l -0.6583 -0.6099 0.1480 0.1795 17.56% β f α -0.1908 -0.0240 0.0799 0.0944 15.35% α α φ ll φ ff φ lf -0.1664 -0.0891 0.0079 0.0097 18.57% δ δ δ δ δ δ -0.1441 -0.1318 0.0117 0.0152 22.98% γ l γ l -0.0030 0.0370 0.0131 0.0151 13.60% γ l γ f -0.0157 -0.0220 0.0035 0.0043 19.51% γ f γ f The model is given byln

W age it = β E it + β M it + β F i + β Ed it + v i + ε it , where the dependent variable is the logged wage, and the explanatory variables are experience inyears ( E ), dummies for marriage status ( M ) and the individual being female ( F ), as well as theyears of education ( Ed ). As there is a strong reason to suspect that the unobserved individualeﬀect v i to be endogenous due to omitted variables such as personal capability and motivation,we apply our dp-crem model, and write v i as v i = ˜ β E i + ˜ β M i + u i . (58)The means of experience and marriage status of individual i are included as they are the twotime variant variables in the original model.Table 10: U.S. Individual Wage crem Mean S.D.Sample size 100 250 100 250Parameters DP REM DP REM DP REM ∆% DP REM ∆% β β β β β -0.091 -0.058 -0.085 -0.058 0.0046 0.0077 40.21% 0.0028 0.0056 50.17%˜ β Table 10 contains the posterior means, s.d. and the percentage diﬀerence between the s.d. estimated with our dp-crem and Bayesian crem assuming normality for both the 100 and 250individual sub-sample. The two coeﬃcients for the means of time variant explanatory variables,˜ β and ˜ β are both signiﬁcant with both sub-samples, indicating that v i actually is correlated to20he explanatory variables, conﬁrming our suspicion. As for the coeﬃcients for the explanatoryvariables themselves ( β to β ), the posterior means are all of the same signs with the dp-crem and dp-rem , and the diﬀerences between them become smaller with the larger sub-sample of250 observations. As expected, the experience and education are both positively correlatedwith the wage of the workers. The coeﬃcient for the gender dummy ( β ) is also positive. Thismay seem as an indication of gender discrimination in wages against male workers within thetwo sub-samples . Heterogeneity is detected in both sub-samples, as the posterior modes ofthe numbers of groups in the random eﬀects and errors are 2 and 3 with the 100 individualsub-sample, and 3 and 4 with the 250 individual one. Our semi-parametric dp-crem providessmaller posterior s.d. for all coeﬃcients with both sub-samples.In Figure 2 we present histograms of the posterior distributions of the education ( β ) pa-rameter. As before, the two red vertical lines in each panel mark the 95% credible interval.Comparing panel (a) and (b) we observe that for the 100-observation sub-sample both the dp-crem and the parametric Bayesian crem give 95% credible that exclude 0, with the 95%credible interval for dp-crem (from 0.052 to 0.119) shorter than the parametric crem (from0.292 to 0.377). This is not surprising given that the posterior distribution is right skewed asshown in panel (a). Similar conclusions follow from comparing panel (c) and (d) based upon the250-observation sub-sample. The 95% credible interval with the dp-crem is from 0.058 to 0.092,while that with the parametric crem is from 0.263 to 0.313. Note also that the 95% credibleintervals based upon the 250-observation sub-sample are shorter than their counterparts usingthe 100-observation sub-sample, due to the increase in the sample size. In this paper we address the potential violation of the assumptions made by parametric Bayesian gls estimators that the unobservables are homogeneous regarding their distributions. Suchassumptions are likely to be problematic in reality particularly when micro data are used, asthe features of individuals or households are likely to lead to the distributions of observationsbeing diﬀerent. We present a semi-parametric Bayesian gls where the error distribution is anon-parametric mixture of normal distributions by introducing a Dirichlet process prior on thehyper-parameters of the errors. The number of normal components is decided jointly by the dataand the prior in such a mixture of normals, which is able to cover a large variety of distributions.The errors are grouped by the DP prior, with those in the same group having the same hyper-parameters and thus the same distribution. Two speciﬁc cases of the semi-parametric Bayesian gls are then introduced, which are the sur for equation systems and the rem / crem for paneldata.Our dp-sur and dp-rem / dp-crem methods are demonstrated with a series of simulationexperiments consisting of three scenarios, where the unobservables follow normal distributions,t-distributions which are one type of scale mixtures of normals, and log-normal distributions,respectively. The results show that in the homogeneous normal case, our dp-sur and dp-rem / dp-crem methods give posterior means and s.d. similar to their parametric counterpartsassuming normality. When the errors follow t-distributions, the degrees of freedom of the t-distribution control how heavy the tails are, which reﬂects the strength of heterogeneity inthe unobservables. Our simulation results show that the posterior s.d. of our dp-sur and However, before jumping to this conclusion, one should keep in mind that in the time period of the data (1976to 1982), fewer women were working than at present. In the 100-observation sub-sample, 16% of the workers arewomen, and in the 250-observation sub-sample, the percentage is 8.8%. Given such low percentage of femaleworkers, those women who did decide to enter the labour market may themselves be relatively skilled workerswho could reasonably expect a high wage. The sample selection bias (Heckman, 1976) may still be present hereas a result, though it is not within the scope of this paper. a) dp-crem , 100-observations (b) Parametric crem , 100-observations(c) dp-crem , 250-observations (d) Parametric crem , 250-observations Figure 2: Histograms of the parameter for education ( β ) dp-rem / dp-crem are smaller than those of the parametric Bayesian methods. Such eﬃciencygains are the largest when df is 2 that represents the strongest heterogeneity. The eﬃciencygains become smaller when the df increases, for the tails are less heavy, i.e. the heterogeneityis less strong. The simulations with log-normal unobservables are used to demonstrate therobustness of our method with asymmetric, fat tailed distributions. The results demonstratedthat the posterior s.d. of our dp-sur and dp-rem / dp-crem method are more than 50% smallerthan those of the parametric Bayesian estimators assuming normality. Moreover, the eﬃciencygains increase slightly with larger sample sizes when the distribution of the unobservables arenon-normal, which is the result of more extreme realizations in large samples.We apply our dp-sur method to the demands for production factors with the generalizedLeontief cost function using a dataset of the U.S. banking industry. We estimate the modelwith an 800-observation sub-sample and the full sample. Heterogeneity is detected in boththe sub-sample and the full sample. The dp-sur posterior s.d. are smaller than the normalBayesian sur ones for all the demand elasticities, which shows that it is more preferable to usea semi-parametric method such as our dp-sur .Our dp-rem / dp-crem are applied to two datasets as well. The ﬁrst is a U.S. bank costfunctions data. The rem seems to ﬁt the datasets better. Heterogeneity is detected in the U.S.bank data, and our dp-rem achieved smaller posterior s.d. than the parametric Bayesian rem .The second application is a U.S. individual wage model, where there is a strong reason to suspectthat the unobserved individual eﬀects are correlated to the explanatory variables like educationdue to unobserved individual features such as abilities. The crem model is then estimated. Our dp-crem detects heterogeneity in this dataset as well, and obtains smaller posterior s.d. thanthe parametric Bayesian crem . 22 Posterior Means for DP-SUR Simulations

A.1 Multivariate t-distributed Errors

Table 11 gives the posterior means averaged over the samples that are estimated with the dp-sur and sur assuming normality. One can see that the posterior means estimated with bothour semi-parametric dp-sur and the Bayesian sur assuming normality are similar to each other.In addition, they are both close to the true values of the coeﬃcients in all cases.Table 11: Posterior means, multivariate t errors df = 2Sample size Truth 100 250 500Parameters DP SUR DP SUR DP SUR β β -0.5 -0.500 -0.484 -0.491 -0.488 -0.500 -0.519 β β β -1.2 -1.195 -1.194 -1.201 -1.194 -1.206 -1.230 β -0.7 -0.702 -0.719 -0.697 -0.700 -0.696 -0.684 β β β -0.5 -0.497 -0.493 -0.502 -0.499 -0.498 -0.499 β β β -1.2 -1.194 -1.183 -1.198 -1.199 -1.202 -1.199 β -0.7 -0.699 -0.696 -0.698 -0.699 -0.703 -0.702 β β β -0.5 -0.500 -0.499 -0.516 -0.519 -0.495 -0.495 β β β -1.2 -1.205 -1.208 -1.192 -1.192 -1.202 -1.202 β -0.7 -0.708 -0.713 -0.700 -0.701 -0.704 -0.701 β ∞ Sample size Truth 100 250 500Parameters DP SUR DP SUR DP SUR β β -0.5 -0.4969 -0.4974 -0.4959 -0.4957 -0.4904 -0.4905 β β β -1.2 -1.2115 -1.2118 -1.1959 -1.1965 -1.1976 -1.1977 β -0.7 -0.6984 -0.6985 -0.7004 -0.7003 -0.6911 -0.6912 β .2 Multivariate Log-normal Errors Table 12 contains the posterior means estimated with the two methods with multivariate log-normal errors. In all three samples the two posterior means of all the slope parameters aresimilar, and are close to the truth. The intercepts β and β estimated with our dp-sur ,however, are farther away from the true values. The fact that the log-normal distribution isskewed inﬂuences the posterior means of intercepts, when the DP mixture model mixes normaldistributions to model the log-normal distribution.Table 12: Posterior means, multivariate log-normal errors Log-normalSample size Truth 100 250 500Parameters DP SUR DP SUR DP SUR β β -0.5 -0.501 -0.492 -0.499 -0.488 -0.496 -0.504 β β β -1.2 -1.201 -1.180 -1.202 -1.205 -1.199 -1.198 β -0.7 -0.701 -0.696 -0.703 -0.719 -0.705 -0.720 β B Posterior Standard Deviations for DP-SUR Simulations

B.1 Multivariate t-distributed Errors

Table 13 presents the full results regarding the posterior standard deviations of the dp-sur simulations with multivariate t-distributed errors.

B.2 Multivariate Log-normal Errors

Table 14 shows the full results regarding the posterior standard deviations of the dp-sur simu-lations with multivariate log-normal errors.

C Posterior Means for DP-REM/CREM

C.1 t-distributed Errors

Table 15 contains the average of the posterior means over the samples of the coeﬃcients in rem with t-distributed random eﬀects and errors. It can be seen that the averaged posterior meansestimated with our dp-rem and parametric Bayesian rem are all almost identical, and they areall close to the true value of the coeﬃcients for all four cases, i.e. df being 2, 3, 4, and inﬁnity,where the t-distribution becomes normal distribution.The average of posterior means of the crem coeﬃcients are presented in Table 16. Similarto the rem case, the average of the posterior means estimated with both our dp-crem andparametric Bayesian crem are similar to each other, and close to the pre-set true values of thecoeﬃcients in all cases.

C.2 Log-normal Errors

Table 17 gives the average of the posterior means of the rem with log-normal distributed randomeﬀects and errors. One could see that the the dp-rem and Bayesian rem assuming normality24able 13: Posterior s.d. , multivariate t errors, full results df = 2Sample size 100 250 500Parameters DP SUR ∆% DP SUR ∆% DP SUR ∆% β β β β β β β β β β β β β β β β β β β β β ∞ Sample size 100 250 500Parameters DP SUR ∆% DP SUR ∆% DP SUR ∆% β β β β β β β s.d. , multivariate log-normal errors, full results Log-normalSample size 100 250 500Parameters DP SUR ∆% DP SUR ∆% DP SUR ∆% β β β β β β β Table 15: Posterior means, rem with t distributed unobservables df = 2Sample size Truth 100 300 500Parameters DP REM DP REM DP REM β β

10 9.999 10.002 10.000 9.997 9.996 9.990df = 3Sample size Truth 100 300 500Parameters DP REM DP REM DP REM β β

10 10.001 10.000 9.998 9.997 9.999 9.999df = 4Sample size Truth 100 300 500Parameters DP REM DP REM DP REM β β

10 10.001 10.001 10.001 10.003 9.998 9.998df = ∞ Sample size Truth 100 300 500Parameters DP REM DP REM DP REM β β

10 9.998 9.999 9.999 9.999 10.000 10.000 crem with t distributed unobservables df = 2Sample size Truth 100 300 500Parameters DP REM DP REM DP REM β β

10 10.002 9.996 10.001 10.005 10.000 9.999 β -2 -1.973 -2.020 -1.988 -1.961 -2.005 -2.003 β β β

10 10.000 9.997 10.000 10.003 9.997 9.995 β -2 -1.987 -1.970 -1.983 -1.968 -1.986 -1.996 β β β

10 10.003 10.003 10.002 10.001 9.998 9.993 β -2 -1.999 -1.996 -1.999 -1.996 -2.005 -2.004 β ∞ Sample size Truth 100 300 500Parameters DP REM DP REM DP REM β β

10 10.001 10.001 9.998 9.998 10.001 10.001 β -2 -1.999 -2.000 -2.004 -2.005 -1.998 -1.998 β dp-rem posteriormeans being slightly closer to the truths.Table 17: Posterior means, rem with log-normal distributed unobservables Log-normalSample size Truth 100 300 500Parameters DP REM DP REM DP REM β β

10 10.357 11.163 10.369 11.232 10.359 11.222

Table 18 presents the posterior means averaged over the simulation samples of the dp-crem and normal Bayesian crem with log-normal distributed random eﬀects and errors. The posteriormeans of the coeﬃcients for the explanatory variables are very close to the truth with both our dp-crem and Bayesian crem assuming normality. The posterior means of the coeﬃcients forthe means of the explanatory variables are slightly farther away from the truth. The skewnessof the log-normal distribution inﬂuences the posterior means of the intercepts, which are timeinvariant for each individual i in the random eﬀects. In the crem case, the sample means ofeach individual’s explanatory variables, ¯ x i and ¯ x i , are also time invariant like the intercept. Asa result, the posterior means of their coeﬃcients, β and β , are more diﬀerent from the truthcompared with β and β , as the log-normal distribution is skewed.Table 18: Posterior means, crem with log-normal distributed unobservables Log-normalSample size Truth 100 300 500Parameters DP REM DP REM DP REM β β

10 9.999 10.020 10.007 10.015 10.000 9.995 β -2 -1.908 -1.269 -1.851 -1.657 -1.832 -1.432 β eferences [1] Aldous, D. J. (1985). Exchangeability and related topics. In ´Ecole d’ ´Et´e de Probabilit´es deSaint-Flour XIII—1983 . Springer, Berlin, 1-198.[2] Allenby, G. M., Arora, N., & Ginter, J. L. (1998). On the heterogeneity of demand. Journalof Marketing Research , 384-389.[3] Andrews, D. F., & Mallows, C. L. (1974). Scale mixtures of normal distributions.

Journal ofthe Royal Statistical Society. Series B (Methodological) , 99-102.[4] Antoniak, C. E. (1974). Mixtures of Dirichlet processes with applications to Bayesian non-parametric problems.

The Annals of Statistics , 1152-1174.[5] de Carvalho, V. I., Jara, A., Hanson, T. E., & de Carvalho, M. (2013). Bayesian nonpara-metric ROC regression modeling.

Bayesian Analysis , 8(3), 623-646.[6] Chao, J. C., & Phillips, P. C. (1998). Posterior distributions in limited information analysisof the simultaneous equations model using the Jeﬀreys prior.

Journal of Econometrics , 87(1),49-86.[7] Chigira, H., & Shiba, T. (2015). Dirichlet Prior for Estimating Unknown Regression ErrorHeteroskedasticity.

TERG Discussion Papers , 341, 1-17.[8] Clementi, F., & Gallegati, M. (2005). Pareto’s law of income distribution: Evidence for Ger-many, the United Kingdom, and the United States. In

Econophysics of Wealth Distributions (pp. 3-14). Springer, Milano.[9] Conley, T. G., Hansen, C. B., McCulloch, R. E., & Rossi, P. E. (2008). A semi-parametricBayesian approach to the instrumental variable problem.

Journal of Econometrics , 144(1),276-305.[10] Cornwell, C., & Rupert, P. (1988). Eﬃcient estimation with panel data: An empiricalcomparison of instrumental variables estimators.

Journal of Applied Econometrics , 3(2),149-155.[11] Christensen, L. R., & Greene, W. H. (1976). Economies of scale in US electric powergeneration.

Journal of political Economy , 84(4, Part 1), 655-676.[12] Diewert, W. E. (1971). An application of the Shephard duality theorem: a generalizedLeontief production function.

Journal of Political Economy , 79(3), 481-507.[13] Escobar, M. D., & West, M. (1995). Bayesian density estimation and inference using mix-tures.

Journal of the american statistical association , 90(430), 577-588.[14] Escobar, M. D., & West, M. (1998). Computing nonparametric hierarchical models. In: Dey,D., M¨uIler, P., & Sinha, D.

Practical nonparametric and semiparametric Bayesian statistics .Springer, New York, 1-22.[15] Feng, G., & Serletis, A. (2009). Eﬃciency and productivity of the US banking industry,1998 − Journal of Applied Econometrics , 24(1), 105-138.[16] Ferguson, T. S. (1973). A Bayesian analysis of some nonparametric problems.

The Annalsof Statistics , 1(2), 209-230. 2917] Gershman, S. J., & Blei, D. M. (2012). A tutorial on Bayesian nonparametric models.

Journal of Mathematical Psychology , 56(1), 1-12.[18] Geweke, J. (1993). Bayesian treatment of the independent student-t linear model.

Journalof Applied Econometrics , 8(S1).[19] Geweke, J. (1996). Bayesian reduced rank regression in econometrics.

Journal of Econo-metrics , 75(1), 121-146.[20] Greene, William H. (2012). Econometric Analysis, Seventh Edition. Upper Saddle River,New Jersey: Prentice Hall.[21] Heckman, J. J. (1976). The common structure of statistical models of truncation, sampleselection and limited dependent variables and a simple estimator for such models.

Annals ofEconomic and Social Measurement , 5(4), 475-492. NBER.[22] Hejblum, B. P., Alkhassim, C., Gottardo, R., Caron, F., & Thi´ebaut, R. (2019). SequentialDirichlet process mixtures of multivariate skew t -distributions for model-based clustering ofﬂow cytometry data. The Annals of Applied Statistics , 13(1), 638-660.[23] Kleinman, K. P., & Ibrahim, J. G. (1998). A semiparametric Bayesian approach to therandom eﬀects model.

Biometrics , 921-938.[24] Kleibergen, F., & van Dijk, H. K. (1998). Bayesian simultaneous equations analysis usingreduced rank structures.

Econometric Theory , 14(6), 701-743.[25] Kloek, T., & Van Dijk, H. K. (1978). Bayesian estimates of equation system parameters:an application of integration by Monte Carlo.

Econometrica , 46, 1-19.[26] Koop, G. (2003). Bayesian Econometrics. John Wiley & Sons.[27] Kumbhakar, S. C., & Tsionas, E. G. (2011). Stochastic error speciﬁcation in primal anddual production systems.

Journal of Applied Econometrics , 26(2), 270-297.[28] Kyung, M., Gill, J., & Casella, G. (2010). Estimation in Dirichlet random eﬀects models.

The Annals of Statistics , 38(2), 979-1009.[29] Li, M., & Tobias, J. L. (2011). Bayesian inference in a correlated random coeﬃcients model:Modeling causal eﬀect heterogeneity with an application to heterogeneous returns to school-ing.

Journal of econometrics , 162(2), 345-361.[30] Li, C., Casella, G., & Ghosh, M. (2018). Estimation of regression vectors in linear mixedmodels with Dirichlet process random eﬀects.

Communications in Statistics-Theory andMethods , 47(16), 3935-3954.[31] MacEachern, S. N. (1998). Computational methods for mixture of Dirichlet process models.In: Dey, D., M¨uIler, P., & Sinha, D.

Practical nonparametric and semiparametric Bayesianstatistics . Springer, New York, 23-43.[32] Malikov, E., Kumbhakar, S. C., & Tsionas, M. G. (2016). A cost system approach to thestochastic directional technology distance function with undesirable outputs: the case of USbanks in 2001-2010.

Journal of Applied Econometrics , 31(7), 1407-1429.[33] Murtazashvili, I., & Wooldridge, J. M. (2008). Fixed eﬀects instrumental variables estima-tion in correlated random coeﬃcient panel data models.

Journal of Econometrics , 142(1),539-552. 3034] Rossi, P. E., Allenby, G. M., & McCulloch, R. (2012). Bayesian statistics and marketing.John Wiley & Sons.[35] Strutz, T. (2010).

Data ﬁtting and uncertainty: A practical introduction to weighted leastsquares and beyond . Vieweg and Teubner.[36] Teh, Y. W., Jordan, M. I., Beal, M. J., & Blei, D. M. (2005). Sharing clusters amongrelated groups: Hierarchical Dirichlet processes. In

Advances in neural information processingsystems (pp. 1385-1392).[37] Teh, Y. W. (2011). Dirichlet Process. In

Encyclopedia of machine learning , pp. 280-287.Springer US.[38] Wiesenfarth, M., Hisgen, C. M., Kneib, T., & Cadarso-Suarez, C. (2014). Bayesian non-parametric instrumental variables regression based on penalized splines and dirichlet processmixtures.

Journal of Business & Economic Statistics , 32(3), 468-482.[39] White, H. (2014).

Asymptotic theory for econometricians . Academic press.[40] Wooldridge, J. M. (2003). Cluster-sample methods in applied econometrics.

American Eco-nomic Review , 93(2), 133-138.[41] Wooldridge, J. M. (2005). Fixed-eﬀects and related estimators for correlated random-coeﬃcient and treatment-eﬀect panel data models.

Review of Economics and Statistics , 87(2),385-390.[42] Zellner, A. (1962). An eﬃcient method of estimating seemingly unrelated regressions andtests for aggregation bias.

Journal of the American Statistical Association , 57(298), 348-368.[43] Zellner, A. (1971).