A Semi-Parametric Bayesian Generalized Least Squares Estimator
AA Semi-Parametric Bayesian Generalized Least SquaresEstimator
Ruochen WuSchool of EconomicsFudan University Melvyn Weeks ∗ Faculty of Economics and Clare College,University of CambridgeNovember 23, 2020
Abstract
In this paper we propose a semi-parametric Bayesian Generalized Least Squares esti-mator. In a generic gls setting where each error is a vector, parametric gls maintainsthe assumption that each error vector has the same covariance matrix. In reality however,the observations are likely to be heterogeneous regarding their distributions. To cope withsuch heterogeneity, a Dirichlet process prior is introduced for the covariance matrices of theerrors, leading to the error distribution being a mixture of a variable number of normaldistributions. Our methods let the number of normal components be data driven. Two spe-cific cases are then presented: the semi-parametric Bayesian Seemingly Unrelated Regression( sur ) for equation systems; as well as the Random Effects Model ( rem ) and Correlated Ran-dom Effects Model ( crem ) for panel data. A series of simulation experiments is designed toexplore the performance of our methods. The results demonstrate that our methods obtainsmaller posterior standard deviations than the parametric Bayesian gls . We then apply oursemi-parametric Bayesian sur and rem / crem methods to empirical examples.JEL Classification Code: C3Keywords: Bayesian semi-parametric, generalized lease square, Dirichlet process, equationsystem, seemingly unrelated regression, panel data, random effects model, correlated randomeffects model. ∗ Contact Author: Dr. M. Weeks, Faculty of Economics, University of Cambridge, Cambridge CB3 9DD, UK.Email: [email protected]. We are grateful to the useful comments received from Debopam Bhattacharya,Xiaohong Chen, Gernot Doppelhofer, Oliver Linton and Justin Tobias. a r X i v : . [ ec on . E M ] N ov Introduction
The Generalized Least Square ( gls ) estimator is a family of econometric methods that have seennumerous applications in empirical economics. As pointed out by Wooldridge (2003), parametric gls type estimators accommodate a deviation from the assumption that the errors in the modelare homoskedastic and serial uncorrelated. For example, relative to the ordinary least squaresregression model, gls no longer assumes that the covariance matrix of the errors is diagonalwith identical diagonal elements.In a more general setting, each error may be a random vector, which includes some of themost popular applications of gls . For example, the Seemingly Unrelated Regression ( sur ,Zellner, 1962 and 1971) has been developed for equation systems, and widely applied to gainefficiency by exploiting the correlation between errors across equations. Similarly, in the anal-ysis of panel data the random effects model rem recognizes that there are individual specific,time-invariant features that are unobservable and uncorrelated with the explanatory variables.A useful extension to re is the correlated random effects ( cre ) model (Chamberlain, 1980;Wooldridge, 2005; Murtazashvili and Wooldridge, 2008), which allows the individual effects tobe correlated with the explanatory variables usually as a linear function of the means of theregressors.However, the parametric gls still maintains the assumption that the error vector of each in-dividual has the same covariance matrix. In reality however, heterogeneity in error distributionsis a major concern in empirical analysis. Such heterogeneity can be caused by observations onindividuals or households reflecting variation in demographics such as the size of the household,and the level of income among others. It is a challenge for analysts who seek reliable inferencewith the data to capture the form of the heterogeneity in observations.The standard Bayesian approach to gls assumes that the error distribution is multivariatenormal. Recent developments in Bayesian methods allow the use of prior information to relax thisassumption. For example, the Dirichlet prior has been introduced to accommodate heterogeneityin the distributions of both errors (see Chigira and Shiba, 2015 for an example) and modelparameters (Allenby et al., 1998) by mixing a fixed number of normal distributions.However, a notable drawback of the Dirichlet prior is that the the dimension of the mixingdistribution is usually unknown. Bayesian semi-parametric methods introduce more flexibilityby letting the data and the prior determine the structure of heterogeneity jointly. The DirichletProcess ( DP ) prior can be used to form a mixing of normal distributions, whose dimension neednot be predetermined. In this sense, the use of DP priors represents a more flexible approach toaccommodating heterogeneity than the mixing of a fixed number of normal distributions withthe Dirichlet prior.In the context where heterogeneity in the distributions of errors is the major concern, asin where the focus is upon the inference, DP priors are introduced for the hyper-parametersof the errors in the model This leads to the grouping of the hyper-parameters, with those inthe same group having identical values. As such, the corresponding errors of hyper-parametersin the same group have the same distribution, while the errors whose hyper-parameters are inseparate groups are from different distributions.A landmark study in this area is Conley et al. (2008), where a Bayesian semi-parametricapproach to the instrumental variable problem was introduced in a two stage least square frame-work. Due to the endogeneity of some explanatory variables, the errors in the two stages arecorrelated by construction. Instead of assuming that the joint errors in the two stages have anidentical bivariate normal distribution (cf. Chao and Phillips, 1998; Geweke, 1996; Kleibergen See Escobar and West, 1995 and 1998 and MacEachern, 1998 for a reference of the Dirichlet Process prior. E.g. for an error vector with zero mean, its hyper-parameter is its covariance matrix. gls that incorporates the DP prior. The motivation is to explore more informationin the error distribution by allowing their hyper-parameters to differ across observations. Theresulting distribution of the error terms will involve a mixture of normal distributions wherethe number of the normal components is influenced by both the prior and the data. We thenintroduce two specific cases of semi-parametric Bayesian gls , namely for equation systems andpanel data.The rest of the paper is organized as follow. In Section 2 we introduce the generic formof the Dirichlet process, and demonstrate its use as a prior for semi-parametric Bayesian gls .Then two special cases of the gls are described. The dp-sur method is introduced in Section3. Sections 3.3 and 3.4 present the simulation design and results, respectively, for the dp-sur .Two empirical examples are given in Section 3.5. Section 4 motivates and introduces our semi-parametric Bayesian gls methods for panel data, the dp-rem , and its extension, the dp-crem is introduced in Section 4.3. Simulation designs and results for the panel setting are in Section4.4. The dp-rem and dp-crem methods are then applied to two empirical examples in Section4.5. Section 5 concludes the paper. We introduce the generic Bayesian gls with DP mixture in this section. In Section 2.1 webriefly review the literature in related areas. Then in Section 2.2 we present how the Dirichletprocess prior can be used to produce a semi-parametric Bayesian gls . Bayesian attempts to incorporate heterogeneity in the hyper-parameters of the errors, and con-sequently in their distributions, can be traced back to Geweke (1993). He introduced a Bayesian gls where an inverse gamma prior is introduced for variances of the errors, each of which hada normal distribution. Geweke (1993) demonstrated that such a scale mixture of normal distri-butions is equivalent to the errors having a t-distribution.Although the model with the t-distributed errors is flexible, this approach depends upon theassumption that the normal distributions are mixed with inverse gamma distributed variances.As pointed out by Koop (2003), relaxing this assumption results in more flexible models, giventhat the errors are no longer restricted to having a t-distribution. This can be done by usinga Dirichlet prior (the conjugate prior of a multinomial distribution) to mix a finite number ofnormal distributions. The Dirichlet mixture model has emerged as a widely applied methodologyfor capturing heterogeneity in both linear and non-linear models, including Allenby et al. (1998),Li and Tobias (2011) and Chigira and Shiba (2015). However, the main limitation is that ittakes a fairly difficult test procedure to determine the “correct” number of mixing components.In the wake of this limitation of the Dirichlet mixture model, it seems more reasonableto let the data and the prior determine the number of normal components jointly. This canbe achieved using a Dirichlet prior of infinite dimension, which is the Dirichlet process ( DP )introduced by Ferguson (1973) . DP is the conjugate prior for an infinite dimension, non- See Teh (2011) and Gershman and Blei (2012) for reviews of the Dirichlet process. DP can be written as F ∼ DP ( α, F ) , (1)where α > F is the base distribution. F is a randomdistribution that is discrete with probability one .The DP is a non-parametric “distribution of distributions” (Escobar and West, 1995 and1998; MacEachern, 1998), in that a draw, say F , from a DP is a probability distribution itself.The Chinese restaurant process (Aldous, 1985) provides the the predictive probabilities of the n th realization, r n , conditioned on the n − { r , r , . . . , r n − } , and realizationsfrom F can be drawn accordingly. Due to the fact that F is discrete, the existing realizations willbe assigned to groups, each with a unique value for all realizations in the same group. Denotethe group id of r i as c i = 1 , . . . , K , and the unique value of group c i as r ∗ c i : if r i is in group k ,then r i = r ∗ c i = r ∗ k . The prediction probabilities of r n is given byPr { r n = r ∗ k | r , r , . . . , r n − } = (cid:26) n k n − α if 1 ≤ k ≤ K αn − α if k = K + 1 (i.e. r n = r ∗ K +1 ∼ F ) , (2)where n k is the number of realizations that are already in group k . Aldous (1985) showed that r , r , . . . , r n generated according to the Chinese restaurant process are i.i.d. draws from F , i.e. F | α, F ∼ DP ( α, F ) r i | F iid ∼ F. . (3)A model with a DP prior on the distribution of parameters is called a DP mixture model(de Carvalho et al., 2013; Wiesenfarth et al., 2014; Li et al., 2018 and Hejblum et al., 2019),and is capable of representing very general forms of heterogeneity in the distributions of theobservations. The DP normal mixture model, i.e. one whose mixture components are normaldistributions, can be written as F | α, F ∼ DP ( α, F ) θ i | F iid ∼ F y i | θ i ∼ N ( θ i ) , , (4)where θ i is the set of parameters of observation y i . In the multivariate normal case, θ i consistsof the mean vector and covariance matrix, i.e. θ i = ( µ i , Σ i ).The posterior probability of θ i having the same value as one of the existing θ − i isPr { θ i = θ ∗ k | θ − i , y i , α } ∝ n k n − α N ( y i | θ ∗ k ) , (5)where θ ∗ k and n k denote, respectively, the unique value of group k and the number of observationsalready in group k . The posterior probability of θ i taking a new value from the base distribution,i.e. θ i = θ ∗ new ∼ F , isPr { θ i = θ ∗ new | θ − i , y i , α, F } ∝ αn − α (cid:90) N ( y i | θ ∗ new ) p ( θ ∗ new | F ) d θ ∗ new , (6)where p ( θ ∗ new | F ) is the probability density of the new value θ ∗ new given F . We now introducehow the DP normal mixture model is applied to introduce a semi-parametric Bayesian gls . The level of discreteness is influenced by α , the concentration parameter. The realisations r , r , . . . , r n generated according to (2) are not independent given that the n th realisation isgenerated conditioned on the n − F . Note that y i is a vector. .2 Semi-parametric Bayesian GLS Below we introduce the generic form of the semi-parametric gls estimator, where a DP prior isintroduced on the distribution of the hyper-parameters of the errors. Consider a general linearregression y i = X i β + ε i , (7)where i indexes the observation, y i is a Q × X i is a Q × K matrix of explanatory variables, β is a K × ε i is a Q × gls estimator introduces a DP prior on the distribution of the errorcovariance matrix which will be used to weight the observations. Given the usual assumptionof zero means for the errors then θ i = Σ i . The hierarchical prior can then be written as F | α, F ∼ DP ( α, F ) Σ i | F iid ∼ F. (8)Due to the discreteness of F under the DP prior, the value of some covariance matrices Σ i willthe same, thus putting Σ i into groups denoted by c i . This “grouping” characteristic can helpto reveal the structure of the unobserved heterogeneity in the data.The gls estimator weights the observations according to their covariance matrix. The like-lihood of β is then p (cid:0) y i | β , Σ ∗ c i (cid:1) = 1(2 π ) Q/ | Σ ∗ c i | − exp (cid:20) −
12 ( y i − X i β ) (cid:48) Σ ∗− c i ( y i − X i β ) (cid:21) . (9)where Σ ∗ c i is the unique covariance matrix of group c i . Given the choice of prior for β , one couldgenerate draws from the posterior of the parameters with mcmc methods.For the conjugate normal prior for β is specified, i.e. β ∼ N ( b , V ) , (10)where b and V denote, respectively, the prior mean and covariance matrix of β . The posteriorof β may then be written as β | y , Σ ∗ c i ∼ N ( b , V ) , (11)where V = (cid:32) V − + N (cid:88) i =1 X (cid:48) i Σ ∗− c i X i (cid:33) − , (12)and b = V (cid:32) V − b + N (cid:88) i =1 X (cid:48) i Σ ∗− c i y i (cid:33) . (13)Note that (11), (12) and (13) have the same form as the posterior of the parametric Bayesian gls estimator assuming i.i.d. normal errors. In the case of the semi-parametric Bayesian gls estimator, the errors are associated with different hyper-parameters, such that each observation i is weighted by Σ ∗ c i . dp-gls is generic, without making any assumptions on the form of thecovariance matrix other than all Σ ∗ c i being positive definite and symmetric. We now proceed toexplain how the semi-parametric Bayesian gls estimators work in two specific contexts, whichare equation systems in Section 3 and panel data models in Section 4.4 Semi-parametric Seemingly Unrelated Regression
Below we introduce the sur equation system and demonstrate how the DP prior is incorporated.Without loss of generality we consider a system of two equations y i = β + x ,i β + x ,i β + ε i y i = β + x ,i β + x ,i β + x ,i β + ε i , (14)where y mi denotes observation i for equation m ( m = 1 ,
2) and x mk,i ( k = 1 , ,
3) are theexplanatory variables. β ml ( l = 0 , , ,
3) denote the coefficients, and ε mi are the errors. In thiscontext, the dimension of each error vector, i.e. Q in Section 2.2 will be the number of equationsin the system.The model can be written in matrix form y = X β + ε y = X β + ε , (15)where y m = { y mi } , ε m = { ε mi } are N × X = [ ι , x , x ] and X = [ ι , x , x , x ]are N × N × ι is an N × β = { β l } and β = { β l } are 3 × × sur , Zellner, 1962) was introduced for thistask. Instead of ε i iid ∼ N (0 , σ ) and ε i iid ∼ N (0 , σ ), as in the ols case, the errors ε i are nowidentically multivariate normally distributed, i.e. ε i = ( ε i ε i ) (cid:48) iid ∼ N ( , Σ ). The covariancematrix of ε is then Ω = Σ ⊗ I = (cid:20) σ I N σ I N σ I N σ I N (cid:21) , s.t. σ = σ , (16)where ” ⊗ ” stands for the Kronecker product.One could transform the observations with this covariance matrix, so that the errors fol-low the standard normal distribution N (0 , . Although the sur model accounts for the cross-equation correlation of errors, as Wooldridge(2003) has noted, the errors are assumed to be identically distributed. Moreover, unlike theclassical gls estimator, this distribution is usually assumed to be normal. In this section wepropose a new dp-sur method that makes no a priori assumptions on the family of distributionof the errors. If we allow each observation i have its own covariance matrix, flexibility of the errordistribution will lead to identification problems if we only have a cross sectional data. Assigningthe observations into groups represents a compromise. Given (15), the covariance matrix of theerror for observation i is given by Σ i = (cid:20) σ ∗ ,c i σ ∗ ,c i σ ∗ ,c i σ ∗ ,c i (cid:21) , s.t. σ ∗ ,c i = σ ∗ ,c i , (17)where c i denotes the group id of observation i , and the superscript ∗ denotes the group-specifichyper-parameter. For c i = c j , i, j ∈ { , , . . . , N } observations i and j share the same group id However, the covariance matrix Ω has a specific form as in the sur in (16), instead of the general, positivedefinite symmetric form of a covariance matrix. σ ∗ pq,c i = σ ∗ pq,c j , where p, q ∈ { , } index the equations in thesystem.Assuming that the number of groups are known, the Dirichlet prior may be used to performthe mixing. A less restrictive approach utilises a non-parametric approach by introducing a DP prior for the distribution of Σ i as in (8). A natural choice of the base distribution F isthe conjugate prior for the covariance matrix of a multivariate normal distribution, the inverseWishart distribution, i.e. F ≡ IW ( ν, W ) (18)where ν , W are hyper-parameters of the inverse Wishart distribution. Given (18) the posteriordistribution of each Σ i is also inverse Wishart, which is easy to draw from using the Gibbssampler. The main difference from the parametric Bayesian sur is that the covariance matrixof each observation is now given in (17), which allows each group of observations to have its ownunique values for the parameters. A Gibbs sampler (see Geweke, 1996) is available for the sur model. The Gibbs sampler drawstwo sets of parameters from their posteriors: the covariance matrix of the errors Σ and theregression parameters β , namely Σ | y , X , β (19) β | y , X , Σ . (20)When introducing the hierarchical structure which includes the DP prior, a number of extraparameters are included in the mcmc algorithm. These are the covariance matrices of the errors,Θ = { Σ i } , and α , the concentration parameter of the DP prior. The Gibbs sampler now consistsof Θ | y , X , β , α (21) β | y , X , Θ , α (22) α | y , X , β , Θ . (23)The major difference between the two Gibbs sampler lies in (19) and (21). In (19) the errorshave the same covariance matrix Σ . In contrast, there will be K ≤ N unique values in Θ inequation (21) due to the discreteness of F under the DP prior; observations with the same valueof Σ i are assigned to the same group. With the last draw of β , the residuals can be obtained,which are used as the data to take a draw for Θ.In making draws of the concentration parameter α using (23), we adopt the DP prior intro-duced by Conley et al. (2008), namely p ( α ) ∝ (cid:18) − α − α min α max − α min (cid:19) τ , (24)where α min and α max are the pre-set lower and upper bound of α . Larger α lead to more groupsbeing generated on average, i.e. the DP being less discrete. According to the distribution ofthe number of groups K conditioned on α in Antoniak (1974), we could determine α min and α max by setting the mode of number of groups to K min and K max . In this paper we let K min be 1 and K max be 5% of the number of observations. Following the suggestion of Conley et al.(2008), we set τ to 0.8. The hyper-parameters α max has been adjusted according to K max being10% and 50% of the sample size. In our experiments the results are insensitive to these changesin the hyper-parameters in the prior of the concentration parameter α .6 .3 A Simulation Experiment In this section we conduct a simulation experiments designed to compare our method to theBayesian sur described in Section 3. As the main focus of this paper is the potential efficiencygains over gls type estimators, we evaluate the performance of the dp-sur and normal Bayesian sur focusing upon the posterior standard deviations of the parameters estimated with the twomethods. All simulation experiments are based upon the two equation system in (14).The experiments are designed to highlight the performance of the estimators along the fol-lowing dimensions:(i) heterogeneity in the errors;(ii) the tail of the error distribution;(iii) sample size.For (i) we check the performance of our dp-sur approach against a model where the errorsare distributed i.i.d. multivariate normal. In the heterogeneous case, the most direct way isto generate the errors from a mixture of multivariate normal distributions . However, we usemultivariate t-distributions (Andrews and Mallows, 1974) to exploit the scale mixtures of normaldistributions with inverse Wishart covariance matrices.To accommodate (ii), we vary the degrees of freedom (df) of the multivariate t-distribution.Smaller degrees of freedom leads to heavier tails, which indicates that a larger proportion ofobservations follow normal distributions that are “flatter”, i.e. less concentrated around themean.To determine the robustness of our method, we include a set of simulations where the errorsfollow a log-normal distribution. The log-normal distribution has seen a wide range of applica-tions in empirical studies. For example, with perhaps the exception of the top 1-3 percent of thepopulation, income has been shown to follow a log-normal distribution (Clementi and Gallegati,2005). In addition, extreme realizations are more likely to be generated from the multivariatelog-normal distribution, as it is fat-tailed.Using (15), the explanatory variables are drawn from normal distributions with parameters x ,i iid ∼ N (1 , , x ,i iid ∼ N (3 , , and x ,i iid ∼ N ( − , , x ,i iid ∼ N (4 , , x ,i iid ∼ N ( − , . (25)We set β = (1 . , − . , . (cid:48) and β = (1 . , − . , − . , (cid:48) . We generate errors from the multi-variate normal distribution N ( , Σ ), where Σ = (cid:20) . . (cid:21) . (26)Without loss of generality, we let the variances be identical, and fix the correlation between theerrors in the two equations at 0.5. We set three samples sizes for the simulation experiment:100, 250, and 500.When generating errors from a multivariate t-distribution, we set the location parameter to µ = and shape parameter to Σ . As noted, the parameter that controls the tail behaviour of Simulating data from a mixture of multivariate normal distributions can be problematic given the influenceof the following: the number of components, the covariance matrices (we fix the mean at zero) of the normalcomponents and the weights assigned to each component. , 3 and 4. For df=2 the tails of thecorresponding multivariate t-distribution are much heavier than that of the multivariate normalwith the same location and shape parameters µ and Σ . When the df is 4, the tails of themultivariate t-distribution are only slightly heavier than the multivariate normal. Our dp-sur should demonstrate relative efficiency in all three situations, as the multivariate t-distributionshas heavier tails than the multivariate normal. Gains in efficiency will be decreasing in df, giventhat the tails are less heavy.The errors in the multivariate log-normal scenario are constructed by first drawing frommultivariate normal distributions, and then taking the natural exponent of these draws. Becausethe log-normal distribution has a positive mean, it is necessary to demean, so that the errorswill have zero means. Our dp-sur is expected to have efficiency gains in the log-normal casegiven it is asymmetric and heavy tailed. Below we present the simulation results . We present the posterior standard deviations ( s.d. )estimated with both our dp-sur method and the Bayesian sur assuming multivariate normalerrors, along with the percentage differences between them . As the behaviour of the posteriorsof all the parameters are uniformly similar, for the sake of clearness, we present only the resultsregarding β and β . Multivariate t-distributed Errors
Table 1 presents the posterior s.d. s and the percentage difference between the s.d. estimatedwith the dp-sur and the normal sur , both averaged over the samples. We observe that the dp-sur gives smaller posterior s.d when df is 2, 3 and 4. The percentage differences when df is2 are above 40%, above 20% when df is 3, and around 15% when df is 4 as shown in the upperthree panels of the table. Efficiency gains increase with sample size as more extreme values ofthe errors are realised. The parametric sur assumes that all the realizations have the same Σ ,where the extreme ones will expand this Σ shared by all realizations. In contrast, these extremerealizations will be assigned to distributions with larger Σ i ’s by the dp-sur , while the rest willbe treated as realizations from normal distributions with smaller Σ i ’s. By accommodating ahigher degree of heterogeneity in the error distributions, potentially more efficiency gains couldbe achieved by the dp-sur .Our results are consistent with expectations. The efficiency gains of the semi-parametric dp-sur are the largest when the df is 2 (with the heaviest tails). Efficiency gains fall with thedf increasing, given less heavy tails of the distribution of the errors. In fact, the lowest panelin Table 2 where the df is infinity, we observe that the posterior s.d. estimated with the twomethods are very close. The s.d. for dp-sur is slightly larger than their sur counterparts. Thisis not surprising since when the distribution of the errors is multivariate normal, the parametricmethod is more parsimonious, using the correct structure for the covariance matrix of the errors.In the multivariate normal case, among the three sample sizes, the differences between the s.d. are the largest when the sample size is 100. This is expected as the information “wasted” by We do not use df 1, as the t distribution does not even have a mean in this case. We carry out 100 simulations for each sample size, which proved sufficient to achieve stable results even withthe smallest sample size. The tables containing the posterior means can be found in the Appendices. For the tables of posterior means,there are 6 columns presenting the means estimated by the two methods for the 3 correlations. The full results are included in Appendices. ∆% = ( s.d. sur − s.d. dp ) / s.d. sur × Nevertheless, the differences are still small in magnitude, less than 2.5% for all coefficients. dp-sur in this case has a larger impact on efficiency when the sample sizeis small. Table 1: Posterior s.d. , multivariate t errors df = 2Sample size 100 250 500Parameters DP SUR ∆% DP SUR ∆% DP SUR ∆% β β β β β β ∞ Sample size 100 250 500Parameters DP SUR ∆% DP SUR ∆% DP SUR ∆% β β Multivariate Log-normal Errors
The posterior s.d. s are presented in Table 2. We observe that the dp-sur posterior s.d. aremore than 55% smaller than those calculated using the Bayesian sur assuming i.i.d. normalerrors. The efficiency gains increase with sample size, which reach more than 65% in the caseof 500 observations. As with the case of t distributed errors, this is due to the fact that moreextreme realizations of errors are present in larger samples, leading to more efficiency gains bygrouping them. Table 2: Posterior s.d. , multivariate log-normal errors
Log-normalSample size 100 250 500Parameters DP SUR ∆% DP SUR ∆% DP SUR ∆% β β Below we apply our dp-sur method to an economic model of the demand for factors of pro-duction with a generalized Leontief cost function (Diewert, 1971), an equation system with the9umber of equations as that of factors. To make our empirical demonstration as general aspossible, we do not impose symmetry or homogeneity restrictions.The dataset, taken from Malikov et al. (2016), contains 2397 observations on 285 large U.S.banks between 2001 and 2010. The data includes quantities and prices of the inputs, i.e. labour,physical assets and borrowed funding, and the quantity of output, which is the loans made by abank. Given the relatively large sample size, it is possible for us to explore the performance ofthe dp-sur with different sample sizes.The demand for factors equation system may be written as a L = LY = β LL + β LA P A P L + β LF P F P L + β LT T + ε L (27) a A = AY = β AA + β AL P L P A + β AF P F P A + β AT T + ε A (28) a F = FY = β F F + β F L P L P F + β F A P A P F + β F T T + ε F , (29)where L , A and F denote the quantity of labour, physical assets and borrowed funds, respec-tively; T denotes the trend variable; Y denotes output, and P k is the price of factor k , with k ∈ { L, A, F } . For the errors we assume ( ε Li , ε Ai , ε F i ) ∼ N ( , Σ i ) where Σ i is the covariancematrix of observation i . We allow the errors to be correlated across the three equations in thesystem, i.e. cov ( ε ki , ε si ) (cid:54) = 0, with k, s ∈ { L, A, F } indexing equations.Given that the main objects of interest are the price elasticities, we report the posteriormeans and s.d. of the price elasticities of the three factors. With the generalized Leontief costfunction, the cross price elasticities of the factors are given by e ks = 12 β ks ( P k /P s ) − / a k , ∀ k (cid:54) = s, (30)The own price elasticities are e kk = − (cid:80) s (cid:54) = k β ks ( P k /P s ) − / a k . (31)Table 3 contains the posterior means of the price elasticities of the demand for factors. Onecan see that with both the 800 observation sub-sample and the full sample, the posterior meansof all the elasticities are relatively small indicating that the demand for factors (labour, physicalassets and borrowed fundings) of U.S. banks are relatively price inelastic.Note that the own price elasticities of labour and physical assets are negative in both samples.In contrast, the own price elasticity of borrowed fundings is positive, although we note that theabsolute values are extremely small compared to those of the labour and physical assets. Thisshows that the demand for borrowed fundings is inelastic in the production of the U.S. bankingindustry. One potential reason is that borrowed funds are usually used in ways such as tomeet the required reserve ratio set by the Fed. Such a feature makes borrowed fundings ratherinelastic with respect to their price.There are some differences between the posterior means estimated with the dp-sur and theBayesian sur assuming normal errors. Such differences are not observed in the simulation stud-ies. However, it should be noted that in the simulations, the regression equation was correctlyspecified, which is not guaranteed with the empirical data. Such differences in the posteriormeans with empirical datasets have also been observed in the literature on semi-parametricmixture with a DP prior, including Conley et al. (2008). Note that the sur and ols estimators are exactly the same when all the equations in the system share thesame explanatory variables. The posterior s.d. are also relatively large for this elasticity as shown in Table 4.
Table 4 presents the posterior s.d. of the price elasticities estimated with the two samples.We observe that the dp-sur achieves smaller posterior s.d. for all the price elasticities than theBayesian sur assuming normality. This is not unexpected, as the elasticities are functions of theregression parameters in the equation system, which are estimated with smaller posterior s.d. with the semi-parametric dp-sur than the parametric Bayesian sur . The greatest percentagedifference (∆%) with the 800-observation sub-sample takes place with the cross price elasticityof the demand for labour with respect to the price of physical assets, for which the dp-sur posterior s.d is 38.27% smaller than the sur counterpart. With the full sample, the largestpercentage difference is observed with the cross price elasticity of the demand for borrowedfundings with respect to the price of physical assets, which reached 39.15%.Table 4: Elasticities, U.S. banking industry: posterior s.d.
In Figure 1 we present the histograms for the posterior distribution for the cross priceelasticity of funding with respect to the price of physical assets for the dp-sur and the parametricBayesian sur . Using the smaller 800-observation sub-sample, the posteriors for both estimatorsinclude 0. However, with the full sample the dp-sur gives a posterior distribution that has a95% credible interval (from -0.027 to -0.002) that excludes 0, shown by the two red vertical linesin the left panel. In contrast, the right panel shows that the parametric Bayesian sur gives a95% credible interval (from -0.027 to 0.016) that still includes 0 even with the full sample. FromFigure 1 we also note that the posterior distribution with the dp-sur is strongly right skewed,which can result in the parametric Bayesian sur having larger posterior standard deviation.11 a) dp-sur (b) Parametric sur Figure 1: Histograms of the elasticity of fundings w.r.t. asset price
In addition to equation systems, the random effects model ( rem ) for panel data is anotherscenario where the gls has seen numerous applications. In a panel with N cross sections andtime series of dimension T , the error of each individual is a T × vector. We will relax theassumption of parametric Bayesian gls for the rem (Koop, 2003) that the error vectors for allindividuals have the same distribution. In this section we propose a semi-parametric Bayesianapproach by introducing DP priors on the variances of the random effects and the errors. Wefollow the same approach as in the dp-sur method in terms of applying the DP prior on thehyper-parameters.Consider the following panel data model y it = β x it + · · · + β K x Kit + u i + η it = β x it + · · · + β K x Kit + ε it , (32)where i and t index the cross section and time series dimensions of the data, respectively, y it isthe dependent variable, x kit denote the explanatory variables, and the β k , k = 1 , . . . , K are thecoefficients. u i is the time-invariant unobservable of individual i , and η it the error term.In Bayesian methods the difference between the fixed and random effects lies in the choiceof prior for the individual effects u i . Fixed effects Bayesian methods assume a non-hierarchicalprior for u i , while for the random effects a hierarchical prior is assumed. The prior for u i maybe written as u i | d iid ∼ N (cid:0) , d (cid:1) , (33)where d is the variance of u i . Assuming η it iid ∼ N (0 , σ ), the posterior distribution of u i isgiven by u i | y i , β , d , σ ∼ N (cid:0) µ i , s (cid:1) , (34)where µ i = s σ − ι (cid:48) T ( y i − X i β ), s = ( d − + σ − ι (cid:48) T ι T ) − , with ι T denoting a T × X i = [ x it , . . . , x Kit ] is a T × K matrix of explanatory variables, and y i = [ y i , y i , . . . , y iT ] (cid:48) is a T × We use the term “individual” to denote the cross section unit here. In practice it can be households, firms,countries or actual individuals. That is, the Q in the generic semi-parametric gls in Section 2.2 is T in this context. We use T here followingpanel data protocols. Note that the variances of u i and η it is often assumed to be random, and have their own priors. For themoment we leave them fixed for the sake of simplicity. β marginalized over u i in the Bayesian rem may be written as p ( y i | β , Σ ) = 1(2 π ) T/ | Σ | − exp (cid:20) −
12 ( y i − X i β ) (cid:48) Σ ( y i − X i β ) (cid:21) , (35)where Σ is the covariance matrix of the T × ε i = [ ε i , ε i , . . . , ε iT ] (cid:48) . Assuming that E [ η it u j | X ] = 0 , ∀ i, j, t (Greene, 2012), the covariance matrix of the composite error ε i isCov( ε i ) = Σ = σ I T × T + s ι T ι (cid:48) T = σ + s s · · · s s σ + s · · · s ... ... . . . ... s s · · · σ + s , (36)where σ is the variance of η it , and s is the variance of u i . Before we proceed to our dp-rem method, we review the work of Kleinman and Ibrahim (1998)and Kyung et al. (2010), who use the Dirichlet process prior for a different purpose. Considerthe model y it = X it β i + ζ it , (37)where β i is the vector of parameters. In the literature in this area, the focus has been on theheterogeneity in the parameters (i.e. β i ) across the individuals. For this purpose, the DP prioris put on the parameters β i themselves, i.e. F ∼ DP ( α, N ( µ β , Σ β )) β i | F iid ∼ F. (38)Given that β i has a discrete dp posterior, β i are grouped, with those in the same group havingthe same value.Heterogeneity in the parameters per se is not the principle focus of this paper. Rather, thispaper aims at providing more efficient inferences by exploiting the information in the distri-bution of unobservables. In a rem setting, unobservables are composed of individual specificunobserved effects u i and idiosyncratic errors η it . Therefore, we focus on the heterogeneity inhyper-parameters of these unobservables instead of the model parameters. In this sense, ourmethod is in the same spirit as the literature pioneered by Conley et al. (2008).We relax the identically distributed assumption for η it and u i , by introducing DP priors onthe variances. This will have the effect of grouping errors over both the cross section dimension i and the time series dimension t , with those in the same group sharing the same hyper-parameter.The DP prior for the variance of the idiosyncratic error η it is G ∼ DP ( α η , G ) σ it | G ∼ G, (39)where α η and G denote the concentration parameter and base distribution of the DP prior,respectively. The grouping of these variances will take place without imposing any restrictions.For example, for c it (cid:54) = c is ( t (cid:54) = s ), then σ ∗ c it and σ ∗ c is are allocated to different groups and η it and η is have different distributions. The random effects model in these researches, mostly in statistics, means different from that in econometrics,as theirs is in fact random coefficients model. c it ( c is ) denotes the group id’s of η it ( η is ). DP prior for the variance of the individual effects u i in our dp-rem can be writtenusing the following hierarchical structure F ∼ DP ( α u , F ) d i | F ∼ F. (40) d i is the prior variance of the random effects u i , α u is the concentration parameter, and F thebase distribution of the DP prior. The use of an independent DP prior on the hyper-parametersof individual effects u i generates groupings over the N individuals such that that u i that belongto the same group are generated from a distribution with the same hyper-parameter. This thenrelaxes the rem assumption that the individual effects are identically distributed.Although the u i are no longer identically distributed, for each particular u i a conjugatenormal prior can be introduced. The posterior of each u i is then a normal distribution, themeans and variances of which are different across the cross section i , i.e. u i | y , β , d ∗ c i , σ ∗ c it ∼ N (cid:0) µ i , s i (cid:1) , (41)where µ i = s i ι (cid:48) T Σ − η i ( y i − X i β ) (42)is the posterior mean of u i . The posterior variance is given by s i = (cid:0) d ∗− c i + ι (cid:48) T Σ − η i ι T (cid:1) − . (43)From (43) we observe that the posterior variance of the random effects u i is the sum of d ∗− c i theinverse of the unique value of d i (the hyper-parameter of u i ), and all the elements in Σ η i . Aswe allow that Σ η i (cid:54) = Σ η j ( ∀ i (cid:54) = j ), s i is in turn allowed to be different for each individual effect u i . The covariance matrix of each composite error vector ε i is also allowed to be different forevery i . The covariance matrix of ε i is given byCov( ε i ) = Σ i = Σ η i + s i ι T ι (cid:48) T . (44) For the choice of base distributions, we use the inverse gamma distribution, the conjugate priorfor the variance of a normal distribution, i.e. F ≡ IG ( a u , b u ) G ≡ IG ( a η , b η ) , (45)where a u and a η are the shape hyper-parameters, and b u and b η denote, respectively, the ratehyper-parameters of F and G .The likelihood of β marginalized over u i is given by p ( y i | β , Σ i ) = 1(2 π ) Q/ | Σ i | − exp (cid:20) −
12 ( y i − X i β ) (cid:48) Σ − i ( y i − X i β ) (cid:21) . (46)Compared with the marginal likelihood of the parametric Bayesian rem in (35), the covariancematrix of the composite error vector ε i is allowed to be different for each individual i in thepanel. Here we adopt a prior whose mean is 0, and variance is 1000. β , i.e. β ∼ N ( b , V ) , where b and V denote respectively, prior mean and covariance matrix of β , the posterior of β marginalized over u i is β | y , d ∗ c i , σ ∗ c it ∼ N ( b , V ) . (47) V = (cid:32) V − + N (cid:88) i =1 X (cid:48) i Σ − i X i (cid:33) − , (48)denotes the posterior covariance matrix, and b is the posterior mean vector, which we write as b = V (cid:32) V − b + N (cid:88) i =1 X (cid:48) i Σ − i y i (cid:33) . (49)For Θ η = { σ it } , U = { u i } , and Θ u = { d i } , a Gibbs sampler for this dp-rem can be written as:Θ η | y , X , U, β , Θ u , α u , α η Θ u | y , X , U, β , Θ η , α u , α η U | y , X , β , Θ u , Θ η , α u , α η β | y , X , U, Θ u , Θ η , α u , α η α u | y , X , U, β , Θ u , Θ η , α η α η | y , X , U, β , Θ u , Θ η , α u . (50)The Gibbs sampler for the regression parameters, hyper-parameters and concentration pa-rameters of the two DP are similar to those for the dp-sur in Section 3.2. In the dp-rem , therandom effects have a mixture of normal distributions. The posterior mean and variance of eachparticular u i are in (42) and (43), respectively. For each i a u i is drawn from N (cid:0) µ i , s i (cid:1) withthe Gibbs sampler. The Correlated Random Effects Model ( crem ) represens a natural extension of the rem . In-troduced by Mundlak (1978) and further discussed by Chamberlain (1980), the crem offers amiddle ground between the fixed and random effects.Without loss of generality, we consider the following model for the panel data y it = β x it + β x it + v i + η it , (51)where v i is the random effects. While maintaining the gls structure of the rem , crem allows theindividual effects to be correlated with X i , representing the correlation using a linear functionof the means of X i , i.e. v i = β ¯ x i + β ¯ x i + u i , (52)The crem model is then y it = β x it + β x it + β ¯ x i + β ¯ x i + u i + η it (53)The DP prior can be introduced on the hyper-parameters of u i and η it as in the rem case.15 .4 DP-REM/CREM Simulation Results We carry out a series of simulation experiments to demonstrate the performance of our dp-rem and dp-crem methods relative to the standard Bayesian rem and crem . The simulationexperiments have been designed for the same purpose as those for the dp-sur in Section 3.3.For the rem model we assume y it = β x it + β x it + u i + η it = β x it + β x it + ε it (54)where the explanatory variables are generated from the following normal distributions x ,it iid ∼ N (1 , , x ,it iid ∼ N (3 , . We set the coefficients in (54) to β = 5 , β = 10 . The coefficients in the crem model (53) are set to β = 5 , β = 10 , β = − , β = 2 . (55)Below we present the simulation results. We first report the results where the errors, u i and η it , are assumed to follow t-distributions and then those with log-normal distributions. t-Distributed Random Effects and Errors Table 5 reports the averages of the posterior s.d. s of the rem coefficients estimated with bothmethods, and the average of the percentage differences between the dp-rem and rem posterior s.d. s. The largest differences between the two estimators with respect to the posterior s.d. are observed when df = 2, where the t-distributions of the random effects and the errors havethe heaviest tails. As expected, these differences decrease as the df increase, where the tailsof the t distributions become less ’heavy’. In the bottom panel where the errors have normaldistributions (equivalent to df being infinity), the dp-rem and normal rem posterior s.d. arealmost equivalent, as the t-distribution is the normal distribution in this case.We also note that the percentage differences increase slightly when the sample size becomeslarger for all three finite df. This is expected given that there are more extreme realizations inlarger samples, and our dp-rem method detects such heterogeneity and assign them into thesame group. In contrast, the Bayesian rem method assuming normality flattens the normalposterior distribution for the extreme values, leading to larger posterior s.d.
Table 6 reports the averages of posterior s.d. of the crem coefficients, and the averagespercentage differences between the two estimators. β and β denote the two original explanatoryvariables, whereas β and β capture the effect of the respective sample means for each individualin the panel. The findings are similar to the rem case in that the percentage differences betweenthe posterior s.d. estimated with our dp-crem and the parametric Bayesian crem are thelargest with df equal to 2, and decrease with the increase in the df. Differences between thetwo methods regarding the posterior s.d. are almost zero when the df is infinity, when thet-distribution becomes normal distribution. The percentage differences also increase slightly inthe three finite df cases when the sample size becomes large due to more extreme values in theunobservables. 16able 5: Posterior s.d. , rem with t distributed unobservables df = 2Sample size 100 300 500Parameters DP REM ∆% DP REM ∆% DP REM ∆% β β β β β β ∞ Sample size 100 300 500Parameters DP REM ∆% DP REM ∆% DP REM ∆% β β Log-normal Distributed Random Effects and Errors
Table 7 contains the average of posterior s.d. estimated with the dp-rem and normal rem , andthe percentage differences between them. It can be seen that our dp-rem posterior s.d. aresmaller than those estimated by Bayesian rem assuming normality in all cases. Due to the factthat the log-normal distribution is heavy tailed, the percentage differences are more than 70%in all cases, which increase slightly when the sample size gets larger.The posterior s.d. of the dp-crem and crem averaged over the simulated samples arereported in Table 8, along with the average of the percentage difference between the two s.d.
ASs before, the posterior s.d. estimated with our dp-crem are more than 70% smaller thanthose estimated with the normal Bayesian crem for all coefficients. The percentage differencesalso increase when the sample size increases
In this section we present the results based upon two empirical examples. In the first we estimatethe cost function of U.S. banks, and in the second we estimate the wages of U.S. workers.
Bank Cost Function
We first apply our dp-rem and dp-crem methods to the dataset in Feng and Serletis (2009)on the costs of 218 U.S. banks whose assets are between 1 and 3 billion dollars (2000 value),covering a period of 8 years from 1998 to 2005. There are three inputs, labour, borrowed fundsand physical capital; and three outputs, consumer loans, non-consumer loans and securities.The functional form is the simple translog cost function (Christensen and Greene, 1976). For17able 6: Posterior s.d. , crem with t distributed unobservables df = 2Sample size 100 300 500Parameters DP REM ∆% DP REM ∆% DP REM ∆% β β β β β β β β β β β β ∞ Sample size 100 300 500Parameters DP REM ∆% DP REM ∆% DP REM ∆% β β β β Table 7: Posterior s.d. , rem with log-normal distributed unobservables Log-normalSample size 100 300 500Parameters DP REM ∆% DP REM ∆% DP REM ∆% β β Table 8: Posterior s.d. , crem with log-normal distributed unobservables Log-normalSample size 100 300 500Parameters DP REM ∆% DP REM ∆% DP REM ∆% β β β β i with n inputs and m outputs we writeln C it = m (cid:88) j =1 α j ln q j,it + 12 m (cid:88) j =1 m (cid:88) k =1 δ jk ln q j,it · ln q k,it + n (cid:88) r =1 β r ln p r,it + 12 n (cid:88) r =1 n (cid:88) s =1 φ rs ln p r,it · ln p s,it + n (cid:88) r =1 m (cid:88) j =1 γ rj ln p r,it · ln q j,it + u i + η it . (56)where C is cost, q j is output quantity j , and p r is input price r . We impose the linear homogeneityin input prices on the cost function, which in the translog case can be expressed as n (cid:88) r =1 β r = 1 , n (cid:88) s =1 φ sr = 0 , r = 1 , , . . . , n, n (cid:88) r =1 γ rj = 0 , j = 1 , , . . . , m. (57)Table 9 contains the posterior means and s.d. of the free coefficients in the rem . Todifferentiate the inputs from outputs, we index the three outputs with numbers, and index theinputs by letters, with l and f denoting, respectively, labour and borrowed funds. The posteriormode of the number of groups in the random effects is 2, and that in the errors is 3, whichshows the existence of heterogeneity, although it is not strong.The posterior means of all the coefficients for first order terms (the β ’s and α ’s) are of thesame magnitude with the two methods with the exception of α for customer loans, but it isinsignificant with the rem . Although a number of the coefficients for the crossproduct termshave different signs across the two methods, the coefficients are not significant. Consistent withthe detection of heterogeneity in the random effects and the errors, the posterior s.d. of our dp-sur method are smaller than those estimated by parametric Bayesian rem for all coefficients.Most of the percentage differences presented in the last column are more than 10%, with thelargest being 24.51% for δ .We also estimate the model with the dp-crem . The posterior mode of the number ofgroups in the random effects and the errors are 2 and 3, respectively, indicating the existence ofheterogeneity. The coefficients of the explanatory variables have smaller dp-crem posterior s.d. than their crem counterparts, with similar magnitudes as in Table 9. However, the regressionparameters are all highly insignificant for the sample means of the explanatory variables, astheir posterior s.d. are very large compared with their posterior means. This indicates that thedata do not support the crem specification . U.S. Individual Wage
In this section we present the results of a wage model for U.S. workers using the data in Cornwelland Rupert (1988). The data covers 595 individuals over a period of 7 years, from 1976 to1982. This sample size allows us to demonstrate our method with sub-samples of 100 and 250individuals. The number of groups, i.e. the number of normal distributions that are being mixed, is also a random variablein Bayesian methods. Its posterior mode thus provides an indication of the strength of heterogeneity in the errordistributions. The results are not presented here, but are available on request. rem
Mean S.D.Coefficients DP-REM REM DP-REM REM ∆% β l -0.6583 -0.6099 0.1480 0.1795 17.56% β f α -0.1908 -0.0240 0.0799 0.0944 15.35% α α φ ll φ ff φ lf -0.1664 -0.0891 0.0079 0.0097 18.57% δ δ δ δ δ δ -0.1441 -0.1318 0.0117 0.0152 22.98% γ l γ l -0.0030 0.0370 0.0131 0.0151 13.60% γ l γ f -0.0157 -0.0220 0.0035 0.0043 19.51% γ f γ f The model is given byln
W age it = β E it + β M it + β F i + β Ed it + v i + ε it , where the dependent variable is the logged wage, and the explanatory variables are experience inyears ( E ), dummies for marriage status ( M ) and the individual being female ( F ), as well as theyears of education ( Ed ). As there is a strong reason to suspect that the unobserved individualeffect v i to be endogenous due to omitted variables such as personal capability and motivation,we apply our dp-crem model, and write v i as v i = ˜ β E i + ˜ β M i + u i . (58)The means of experience and marriage status of individual i are included as they are the twotime variant variables in the original model.Table 10: U.S. Individual Wage crem Mean S.D.Sample size 100 250 100 250Parameters DP REM DP REM DP REM ∆% DP REM ∆% β β β β β -0.091 -0.058 -0.085 -0.058 0.0046 0.0077 40.21% 0.0028 0.0056 50.17%˜ β Table 10 contains the posterior means, s.d. and the percentage difference between the s.d. estimated with our dp-crem and Bayesian crem assuming normality for both the 100 and 250individual sub-sample. The two coefficients for the means of time variant explanatory variables,˜ β and ˜ β are both significant with both sub-samples, indicating that v i actually is correlated to20he explanatory variables, confirming our suspicion. As for the coefficients for the explanatoryvariables themselves ( β to β ), the posterior means are all of the same signs with the dp-crem and dp-rem , and the differences between them become smaller with the larger sub-sample of250 observations. As expected, the experience and education are both positively correlatedwith the wage of the workers. The coefficient for the gender dummy ( β ) is also positive. Thismay seem as an indication of gender discrimination in wages against male workers within thetwo sub-samples . Heterogeneity is detected in both sub-samples, as the posterior modes ofthe numbers of groups in the random effects and errors are 2 and 3 with the 100 individualsub-sample, and 3 and 4 with the 250 individual one. Our semi-parametric dp-crem providessmaller posterior s.d. for all coefficients with both sub-samples.In Figure 2 we present histograms of the posterior distributions of the education ( β ) pa-rameter. As before, the two red vertical lines in each panel mark the 95% credible interval.Comparing panel (a) and (b) we observe that for the 100-observation sub-sample both the dp-crem and the parametric Bayesian crem give 95% credible that exclude 0, with the 95%credible interval for dp-crem (from 0.052 to 0.119) shorter than the parametric crem (from0.292 to 0.377). This is not surprising given that the posterior distribution is right skewed asshown in panel (a). Similar conclusions follow from comparing panel (c) and (d) based upon the250-observation sub-sample. The 95% credible interval with the dp-crem is from 0.058 to 0.092,while that with the parametric crem is from 0.263 to 0.313. Note also that the 95% credibleintervals based upon the 250-observation sub-sample are shorter than their counterparts usingthe 100-observation sub-sample, due to the increase in the sample size. In this paper we address the potential violation of the assumptions made by parametric Bayesian gls estimators that the unobservables are homogeneous regarding their distributions. Suchassumptions are likely to be problematic in reality particularly when micro data are used, asthe features of individuals or households are likely to lead to the distributions of observationsbeing different. We present a semi-parametric Bayesian gls where the error distribution is anon-parametric mixture of normal distributions by introducing a Dirichlet process prior on thehyper-parameters of the errors. The number of normal components is decided jointly by the dataand the prior in such a mixture of normals, which is able to cover a large variety of distributions.The errors are grouped by the DP prior, with those in the same group having the same hyper-parameters and thus the same distribution. Two specific cases of the semi-parametric Bayesian gls are then introduced, which are the sur for equation systems and the rem / crem for paneldata.Our dp-sur and dp-rem / dp-crem methods are demonstrated with a series of simulationexperiments consisting of three scenarios, where the unobservables follow normal distributions,t-distributions which are one type of scale mixtures of normals, and log-normal distributions,respectively. The results show that in the homogeneous normal case, our dp-sur and dp-rem / dp-crem methods give posterior means and s.d. similar to their parametric counterpartsassuming normality. When the errors follow t-distributions, the degrees of freedom of the t-distribution control how heavy the tails are, which reflects the strength of heterogeneity inthe unobservables. Our simulation results show that the posterior s.d. of our dp-sur and However, before jumping to this conclusion, one should keep in mind that in the time period of the data (1976to 1982), fewer women were working than at present. In the 100-observation sub-sample, 16% of the workers arewomen, and in the 250-observation sub-sample, the percentage is 8.8%. Given such low percentage of femaleworkers, those women who did decide to enter the labour market may themselves be relatively skilled workerswho could reasonably expect a high wage. The sample selection bias (Heckman, 1976) may still be present hereas a result, though it is not within the scope of this paper. a) dp-crem , 100-observations (b) Parametric crem , 100-observations(c) dp-crem , 250-observations (d) Parametric crem , 250-observations Figure 2: Histograms of the parameter for education ( β ) dp-rem / dp-crem are smaller than those of the parametric Bayesian methods. Such efficiencygains are the largest when df is 2 that represents the strongest heterogeneity. The efficiencygains become smaller when the df increases, for the tails are less heavy, i.e. the heterogeneityis less strong. The simulations with log-normal unobservables are used to demonstrate therobustness of our method with asymmetric, fat tailed distributions. The results demonstratedthat the posterior s.d. of our dp-sur and dp-rem / dp-crem method are more than 50% smallerthan those of the parametric Bayesian estimators assuming normality. Moreover, the efficiencygains increase slightly with larger sample sizes when the distribution of the unobservables arenon-normal, which is the result of more extreme realizations in large samples.We apply our dp-sur method to the demands for production factors with the generalizedLeontief cost function using a dataset of the U.S. banking industry. We estimate the modelwith an 800-observation sub-sample and the full sample. Heterogeneity is detected in boththe sub-sample and the full sample. The dp-sur posterior s.d. are smaller than the normalBayesian sur ones for all the demand elasticities, which shows that it is more preferable to usea semi-parametric method such as our dp-sur .Our dp-rem / dp-crem are applied to two datasets as well. The first is a U.S. bank costfunctions data. The rem seems to fit the datasets better. Heterogeneity is detected in the U.S.bank data, and our dp-rem achieved smaller posterior s.d. than the parametric Bayesian rem .The second application is a U.S. individual wage model, where there is a strong reason to suspectthat the unobserved individual effects are correlated to the explanatory variables like educationdue to unobserved individual features such as abilities. The crem model is then estimated. Our dp-crem detects heterogeneity in this dataset as well, and obtains smaller posterior s.d. thanthe parametric Bayesian crem . 22 Posterior Means for DP-SUR Simulations
A.1 Multivariate t-distributed Errors
Table 11 gives the posterior means averaged over the samples that are estimated with the dp-sur and sur assuming normality. One can see that the posterior means estimated with bothour semi-parametric dp-sur and the Bayesian sur assuming normality are similar to each other.In addition, they are both close to the true values of the coefficients in all cases.Table 11: Posterior means, multivariate t errors df = 2Sample size Truth 100 250 500Parameters DP SUR DP SUR DP SUR β β -0.5 -0.500 -0.484 -0.491 -0.488 -0.500 -0.519 β β β -1.2 -1.195 -1.194 -1.201 -1.194 -1.206 -1.230 β -0.7 -0.702 -0.719 -0.697 -0.700 -0.696 -0.684 β β β -0.5 -0.497 -0.493 -0.502 -0.499 -0.498 -0.499 β β β -1.2 -1.194 -1.183 -1.198 -1.199 -1.202 -1.199 β -0.7 -0.699 -0.696 -0.698 -0.699 -0.703 -0.702 β β β -0.5 -0.500 -0.499 -0.516 -0.519 -0.495 -0.495 β β β -1.2 -1.205 -1.208 -1.192 -1.192 -1.202 -1.202 β -0.7 -0.708 -0.713 -0.700 -0.701 -0.704 -0.701 β ∞ Sample size Truth 100 250 500Parameters DP SUR DP SUR DP SUR β β -0.5 -0.4969 -0.4974 -0.4959 -0.4957 -0.4904 -0.4905 β β β -1.2 -1.2115 -1.2118 -1.1959 -1.1965 -1.1976 -1.1977 β -0.7 -0.6984 -0.6985 -0.7004 -0.7003 -0.6911 -0.6912 β .2 Multivariate Log-normal Errors Table 12 contains the posterior means estimated with the two methods with multivariate log-normal errors. In all three samples the two posterior means of all the slope parameters aresimilar, and are close to the truth. The intercepts β and β estimated with our dp-sur ,however, are farther away from the true values. The fact that the log-normal distribution isskewed influences the posterior means of intercepts, when the DP mixture model mixes normaldistributions to model the log-normal distribution.Table 12: Posterior means, multivariate log-normal errors Log-normalSample size Truth 100 250 500Parameters DP SUR DP SUR DP SUR β β -0.5 -0.501 -0.492 -0.499 -0.488 -0.496 -0.504 β β β -1.2 -1.201 -1.180 -1.202 -1.205 -1.199 -1.198 β -0.7 -0.701 -0.696 -0.703 -0.719 -0.705 -0.720 β B Posterior Standard Deviations for DP-SUR Simulations
B.1 Multivariate t-distributed Errors
Table 13 presents the full results regarding the posterior standard deviations of the dp-sur simulations with multivariate t-distributed errors.
B.2 Multivariate Log-normal Errors
Table 14 shows the full results regarding the posterior standard deviations of the dp-sur simu-lations with multivariate log-normal errors.
C Posterior Means for DP-REM/CREM
C.1 t-distributed Errors
Table 15 contains the average of the posterior means over the samples of the coefficients in rem with t-distributed random effects and errors. It can be seen that the averaged posterior meansestimated with our dp-rem and parametric Bayesian rem are all almost identical, and they areall close to the true value of the coefficients for all four cases, i.e. df being 2, 3, 4, and infinity,where the t-distribution becomes normal distribution.The average of posterior means of the crem coefficients are presented in Table 16. Similarto the rem case, the average of the posterior means estimated with both our dp-crem andparametric Bayesian crem are similar to each other, and close to the pre-set true values of thecoefficients in all cases.
C.2 Log-normal Errors
Table 17 gives the average of the posterior means of the rem with log-normal distributed randomeffects and errors. One could see that the the dp-rem and Bayesian rem assuming normality24able 13: Posterior s.d. , multivariate t errors, full results df = 2Sample size 100 250 500Parameters DP SUR ∆% DP SUR ∆% DP SUR ∆% β β β β β β β β β β β β β β β β β β β β β ∞ Sample size 100 250 500Parameters DP SUR ∆% DP SUR ∆% DP SUR ∆% β β β β β β β s.d. , multivariate log-normal errors, full results Log-normalSample size 100 250 500Parameters DP SUR ∆% DP SUR ∆% DP SUR ∆% β β β β β β β Table 15: Posterior means, rem with t distributed unobservables df = 2Sample size Truth 100 300 500Parameters DP REM DP REM DP REM β β
10 9.999 10.002 10.000 9.997 9.996 9.990df = 3Sample size Truth 100 300 500Parameters DP REM DP REM DP REM β β
10 10.001 10.000 9.998 9.997 9.999 9.999df = 4Sample size Truth 100 300 500Parameters DP REM DP REM DP REM β β
10 10.001 10.001 10.001 10.003 9.998 9.998df = ∞ Sample size Truth 100 300 500Parameters DP REM DP REM DP REM β β
10 9.998 9.999 9.999 9.999 10.000 10.000 crem with t distributed unobservables df = 2Sample size Truth 100 300 500Parameters DP REM DP REM DP REM β β
10 10.002 9.996 10.001 10.005 10.000 9.999 β -2 -1.973 -2.020 -1.988 -1.961 -2.005 -2.003 β β β
10 10.000 9.997 10.000 10.003 9.997 9.995 β -2 -1.987 -1.970 -1.983 -1.968 -1.986 -1.996 β β β
10 10.003 10.003 10.002 10.001 9.998 9.993 β -2 -1.999 -1.996 -1.999 -1.996 -2.005 -2.004 β ∞ Sample size Truth 100 300 500Parameters DP REM DP REM DP REM β β
10 10.001 10.001 9.998 9.998 10.001 10.001 β -2 -1.999 -2.000 -2.004 -2.005 -1.998 -1.998 β dp-rem posteriormeans being slightly closer to the truths.Table 17: Posterior means, rem with log-normal distributed unobservables Log-normalSample size Truth 100 300 500Parameters DP REM DP REM DP REM β β
10 10.357 11.163 10.369 11.232 10.359 11.222
Table 18 presents the posterior means averaged over the simulation samples of the dp-crem and normal Bayesian crem with log-normal distributed random effects and errors. The posteriormeans of the coefficients for the explanatory variables are very close to the truth with both our dp-crem and Bayesian crem assuming normality. The posterior means of the coefficients forthe means of the explanatory variables are slightly farther away from the truth. The skewnessof the log-normal distribution influences the posterior means of the intercepts, which are timeinvariant for each individual i in the random effects. In the crem case, the sample means ofeach individual’s explanatory variables, ¯ x i and ¯ x i , are also time invariant like the intercept. Asa result, the posterior means of their coefficients, β and β , are more different from the truthcompared with β and β , as the log-normal distribution is skewed.Table 18: Posterior means, crem with log-normal distributed unobservables Log-normalSample size Truth 100 300 500Parameters DP REM DP REM DP REM β β
10 9.999 10.020 10.007 10.015 10.000 9.995 β -2 -1.908 -1.269 -1.851 -1.657 -1.832 -1.432 β eferences [1] Aldous, D. J. (1985). Exchangeability and related topics. In ´Ecole d’ ´Et´e de Probabilit´es deSaint-Flour XIII—1983 . Springer, Berlin, 1-198.[2] Allenby, G. M., Arora, N., & Ginter, J. L. (1998). On the heterogeneity of demand. Journalof Marketing Research , 384-389.[3] Andrews, D. F., & Mallows, C. L. (1974). Scale mixtures of normal distributions.
Journal ofthe Royal Statistical Society. Series B (Methodological) , 99-102.[4] Antoniak, C. E. (1974). Mixtures of Dirichlet processes with applications to Bayesian non-parametric problems.
The Annals of Statistics , 1152-1174.[5] de Carvalho, V. I., Jara, A., Hanson, T. E., & de Carvalho, M. (2013). Bayesian nonpara-metric ROC regression modeling.
Bayesian Analysis , 8(3), 623-646.[6] Chao, J. C., & Phillips, P. C. (1998). Posterior distributions in limited information analysisof the simultaneous equations model using the Jeffreys prior.
Journal of Econometrics , 87(1),49-86.[7] Chigira, H., & Shiba, T. (2015). Dirichlet Prior for Estimating Unknown Regression ErrorHeteroskedasticity.
TERG Discussion Papers , 341, 1-17.[8] Clementi, F., & Gallegati, M. (2005). Pareto’s law of income distribution: Evidence for Ger-many, the United Kingdom, and the United States. In
Econophysics of Wealth Distributions (pp. 3-14). Springer, Milano.[9] Conley, T. G., Hansen, C. B., McCulloch, R. E., & Rossi, P. E. (2008). A semi-parametricBayesian approach to the instrumental variable problem.
Journal of Econometrics , 144(1),276-305.[10] Cornwell, C., & Rupert, P. (1988). Efficient estimation with panel data: An empiricalcomparison of instrumental variables estimators.
Journal of Applied Econometrics , 3(2),149-155.[11] Christensen, L. R., & Greene, W. H. (1976). Economies of scale in US electric powergeneration.
Journal of political Economy , 84(4, Part 1), 655-676.[12] Diewert, W. E. (1971). An application of the Shephard duality theorem: a generalizedLeontief production function.
Journal of Political Economy , 79(3), 481-507.[13] Escobar, M. D., & West, M. (1995). Bayesian density estimation and inference using mix-tures.
Journal of the american statistical association , 90(430), 577-588.[14] Escobar, M. D., & West, M. (1998). Computing nonparametric hierarchical models. In: Dey,D., M¨uIler, P., & Sinha, D.
Practical nonparametric and semiparametric Bayesian statistics .Springer, New York, 1-22.[15] Feng, G., & Serletis, A. (2009). Efficiency and productivity of the US banking industry,1998 − Journal of Applied Econometrics , 24(1), 105-138.[16] Ferguson, T. S. (1973). A Bayesian analysis of some nonparametric problems.
The Annalsof Statistics , 1(2), 209-230. 2917] Gershman, S. J., & Blei, D. M. (2012). A tutorial on Bayesian nonparametric models.
Journal of Mathematical Psychology , 56(1), 1-12.[18] Geweke, J. (1993). Bayesian treatment of the independent student-t linear model.
Journalof Applied Econometrics , 8(S1).[19] Geweke, J. (1996). Bayesian reduced rank regression in econometrics.
Journal of Econo-metrics , 75(1), 121-146.[20] Greene, William H. (2012). Econometric Analysis, Seventh Edition. Upper Saddle River,New Jersey: Prentice Hall.[21] Heckman, J. J. (1976). The common structure of statistical models of truncation, sampleselection and limited dependent variables and a simple estimator for such models.
Annals ofEconomic and Social Measurement , 5(4), 475-492. NBER.[22] Hejblum, B. P., Alkhassim, C., Gottardo, R., Caron, F., & Thi´ebaut, R. (2019). SequentialDirichlet process mixtures of multivariate skew t -distributions for model-based clustering offlow cytometry data. The Annals of Applied Statistics , 13(1), 638-660.[23] Kleinman, K. P., & Ibrahim, J. G. (1998). A semiparametric Bayesian approach to therandom effects model.
Biometrics , 921-938.[24] Kleibergen, F., & van Dijk, H. K. (1998). Bayesian simultaneous equations analysis usingreduced rank structures.
Econometric Theory , 14(6), 701-743.[25] Kloek, T., & Van Dijk, H. K. (1978). Bayesian estimates of equation system parameters:an application of integration by Monte Carlo.
Econometrica , 46, 1-19.[26] Koop, G. (2003). Bayesian Econometrics. John Wiley & Sons.[27] Kumbhakar, S. C., & Tsionas, E. G. (2011). Stochastic error specification in primal anddual production systems.
Journal of Applied Econometrics , 26(2), 270-297.[28] Kyung, M., Gill, J., & Casella, G. (2010). Estimation in Dirichlet random effects models.
The Annals of Statistics , 38(2), 979-1009.[29] Li, M., & Tobias, J. L. (2011). Bayesian inference in a correlated random coefficients model:Modeling causal effect heterogeneity with an application to heterogeneous returns to school-ing.
Journal of econometrics , 162(2), 345-361.[30] Li, C., Casella, G., & Ghosh, M. (2018). Estimation of regression vectors in linear mixedmodels with Dirichlet process random effects.
Communications in Statistics-Theory andMethods , 47(16), 3935-3954.[31] MacEachern, S. N. (1998). Computational methods for mixture of Dirichlet process models.In: Dey, D., M¨uIler, P., & Sinha, D.
Practical nonparametric and semiparametric Bayesianstatistics . Springer, New York, 23-43.[32] Malikov, E., Kumbhakar, S. C., & Tsionas, M. G. (2016). A cost system approach to thestochastic directional technology distance function with undesirable outputs: the case of USbanks in 2001-2010.
Journal of Applied Econometrics , 31(7), 1407-1429.[33] Murtazashvili, I., & Wooldridge, J. M. (2008). Fixed effects instrumental variables estima-tion in correlated random coefficient panel data models.
Journal of Econometrics , 142(1),539-552. 3034] Rossi, P. E., Allenby, G. M., & McCulloch, R. (2012). Bayesian statistics and marketing.John Wiley & Sons.[35] Strutz, T. (2010).
Data fitting and uncertainty: A practical introduction to weighted leastsquares and beyond . Vieweg and Teubner.[36] Teh, Y. W., Jordan, M. I., Beal, M. J., & Blei, D. M. (2005). Sharing clusters amongrelated groups: Hierarchical Dirichlet processes. In
Advances in neural information processingsystems (pp. 1385-1392).[37] Teh, Y. W. (2011). Dirichlet Process. In
Encyclopedia of machine learning , pp. 280-287.Springer US.[38] Wiesenfarth, M., Hisgen, C. M., Kneib, T., & Cadarso-Suarez, C. (2014). Bayesian non-parametric instrumental variables regression based on penalized splines and dirichlet processmixtures.
Journal of Business & Economic Statistics , 32(3), 468-482.[39] White, H. (2014).
Asymptotic theory for econometricians . Academic press.[40] Wooldridge, J. M. (2003). Cluster-sample methods in applied econometrics.
American Eco-nomic Review , 93(2), 133-138.[41] Wooldridge, J. M. (2005). Fixed-effects and related estimators for correlated random-coefficient and treatment-effect panel data models.
Review of Economics and Statistics , 87(2),385-390.[42] Zellner, A. (1962). An efficient method of estimating seemingly unrelated regressions andtests for aggregation bias.
Journal of the American Statistical Association , 57(298), 348-368.[43] Zellner, A. (1971).