A Bayesian approach for clustering skewed data using mixtures of multivariate normal-inverse Gaussian distributions
AA Bayesian approach for clustering skewed data usingmixtures of multivariate normal-inverse Gaussiandistributions
Yuan Fang ∗ Dimitris Karlis † Sanjeena Subedi ‡ May 7, 2020
Abstract
Non-Gaussian mixture models are gaining increasing attention for mixture model-based clustering particularly when dealing with data that exhibit features such asskewness and heavy tails. Here, such a mixture distribution is presented, based onthe multivariate normal inverse Gaussian (MNIG) distribution. For parameter esti-mation of the mixture, a Bayesian approach via Gibbs sampler is used; for this, anovel approach to simulate univariate generalized inverse Gaussian random variablesand matrix generalized inverse Gaussian random matrices is provided. The proposedalgorithm will be applied to both simulated and real data. Through simulation studies ∗ Department of Mathematical Sciences, Binghamton University, State University of New York, 4400Vestal Parkway East, Binghamton, NY, USA 13902. e: [email protected] † Department of Statistics, University of Waterloo, thens University of Economics and Business, Athens,Greece. e: [email protected] ‡ Department of Mathematical Sciences, Binghamton University, State University of New York, 4400Vestal Parkway East, Binghamton, NY, USA 13902. e: [email protected] a r X i v : . [ s t a t . C O ] M a y nd real data analysis, we show parameter recovery and that our approach providescompetitive clustering results compared to other clustering approaches. Keywords : GIG-distribution, cluster analysis, matrix-GIG distribution, model-basedclustering, multivariate skew distributions, MNIG distribution.
Model-based clustering uses finite mixture models that assumes that the population consistsof a finite mixture of subpopulations, each represented by a known distribution. Gaussianmixture models, where each of the mixture components comes from a Gaussian distribu-tion are predominant in the literature. However, Gaussian mixture models can only modelsymmetric elliptical data. In the last decade, skewed mixture models that are based off ofnon-symmetric marginal distributions and have the flexibility of representing skewed andsymmetric components have increasingly gained attention. Some examples include mix-tures of skew-normal distributions (Lin et al., 2007b), mixtures of skew- t distributions (Linet al., 2007a; Pyne et al., 2009; Lin, 2010; Vrbik and McNicholas, 2012; Murray et al., 2014),mixtures of generalized hyperbolic distributions (Browne and McNicholas, 2015; Wei et al.,2019), mixtures of variance-gamma distributions (McNicholas et al., 2017), and mixtures ofmultivariate normal inverse Gaussian (MNIG) distributions (Karlis and Santourian, 2009;Subedi and McNicholas, 2014; O’Hagan et al., 2016).The MNIG distribution Barndorff-Nielsen (1997) is a mean-variance mixture of the mul-tivariate normal distribution with the inverse Gaussian distribution. Mixtures of MNIGdistributions were first proposed by Karlis and Santourian (2009) and parameter estimationwas done using an expectation-maximization (EM) algorithm. Problems with the EM al-gorithm for mixture model estimation can include slow convergence and unreliable results2rising from an unpleasant likelihood surface. These are well-known and have been discussedby Titterington et al. (1985), among others. Subedi and McNicholas (2014) implemented analternative variational Bayes framework for parameter estimation for these MNIG mixtures.Although variational inference tends to be faster than traditional MCMC based approachesand can be easily scaled for larger datasets, it is an approximation to the true posterior andthe statistical properties of these estimates are not as well understood (Blei et al., 2017).Moreover, it does not provide exact coverage as noted by Blei et al. (2017).Some early work in a Bayesian framework for parameter estimation for the mixturesof distribution dates back to over three decades (Diebolt and Robert, 1994; Robert, 1996;Richarson and Green, 1997; Stephens et al., 2000; Stephens, 2000). Treating the parameter asa random variable, inference regarding the parameters is conducted based on their posteriordistributions. Additionally, a Bayesian approach can moderate failure to convergence that isroutinely encountered in an EM algorithm by smoothing the likelihood (Fraley and Raftery,2002). Fr¨uhwirth-Schnatter (2006) provides a detailed overview of the Bayesian frameworkfor modeling finite mixtures of distributions. Recently, the Bayesian framework for modelingskewed mixtures are gaining more and more attention in the recent literature. Fruhwirth-Schnatter and Pyne (2010) implemented a Bayesian approach to parameter estimation andclustering through finite mixtures of univariate and multivariate skew-normal and skew- t distributions. Hejblum et al. (2019) proposed a sequential Dirichlet process mixtures ofmultivariate skew- t distributions for modeling flow cytometry data. Maleki et al. (2019)introduced mixtures of unrestricted skew-normal generalized hyperbolic distributions.In this paper, we implement a fully Bayesian approach for parameter estimation via Gibbssampling. The structure of the paper is as follows. In Section 2, MNIG distributions and a G -component mixtures of them are described. In Section 3, details on the Gibbs sampling-based approach to parameter estimation, convergence diagnostics and label-switching issues3re discussed. Here, we also propose novel approaches to simulate from two distributions:generalized inverse Gaussian distribution (GIG) and matrix generalized inverse Gaussian(MGIG) distributions. In section 4, through simulation studies and real data analysis, wedemonstrate competitive clustering results. Lastly, Section 5 concludes with a discussionand some future directions. The MNIG distribution is based on a mean-variance mixture of a d -dimensional multivariatenormal with an inverse Gaussian distribution (Barndorff-Nielsen, 1997). Suppose Y | U = u isa random vector from a d -dimensional multivariate normal distribution with mean µ + u ∆ β and covariance u ∆ , and U arises from an inverse Gaussian distribution with parameters γ and δ such that Y | U = u ∼ N ( µ + u ∆ β, u ∆ ) and U ∼ IG ( γ, δ ) . The probability density of U is given by f ( u ) = δ √ π exp( δγ ) u − / exp (cid:26) − (cid:18) δ u + γ u (cid:19)(cid:27) . The marginal distribution of Y is a MNIG distribution with density f Y ( y ) = δ d − (cid:20) απq ( y ) (cid:21) d +12 exp ( p ( y )) K d +12 ( αq ( y )) , α = (cid:113) γ + β (cid:62) ∆ β , p ( y ) = δγ + β (cid:62) ( y − µ ) , q ( y ) = (cid:113) δ + ( y − µ ) (cid:62) ∆ − ( y − µ ) , and K d +12 is the modified Bessel function of the third kind of order d +12 . For identifiabilityreasons, | ∆ | = 1 is required.Combining the conditional d -dimensional multivariate normal density of Y | U = u withthe marginal density of U , the joint probability density can be derived: f ( y , u ) = f ( y | u ) f ( u )=(2 π ) − / | u ∆ | − / exp (cid:26) −
12 ( y − µ − u ∆ β )( u ∆ ) − ( y − µ − u ∆ β ) (cid:62) (cid:27) × δ √ π exp( δγ ) u − / exp (cid:18) − (cid:18) δ u + γ u (cid:19)(cid:19) ∝ u − d +32 | ∆ | − / δ × exp (cid:26) δγ − (cid:18) δ u + γ u (cid:19) −
12 ( y − µ − u ∆ β )( u ∆ ) − ( y − µ − u ∆ β ) (cid:62) (cid:27) By letting θ = ( µ , β , δ, γ, ∆ ), it is easy to show that the above joint density is within theexponential family, i.e., f ( y , u | θ ) ∝ h ( y , u ) r ( θ ) exp (cid:40) T r (cid:32) (cid:88) j =1 φ j ( θ ) t j ( y , u ) (cid:33)(cid:41) , h ( y , u ) = u − d +32 , r ( θ ) = δ exp ( δγ ) exp (cid:8) − T r ( βµ (cid:62) ) (cid:9) ; φ = β , t ( y , u ) = y (cid:62) ; φ = µ (cid:62) ∆ − , t ( y , u ) = y (cid:62) u − ; φ = − ( ∆ ββ (cid:62) + γ d I d ) , t ( y , u ) = 12 u ; φ = − ( ∆ − µµ (cid:62) + δ d I d ) , t ( y , u ) = 12 u − ; φ = − ∆ − , t = 12 yy (cid:62) u − . A G -component finite mixture of multivariate normal inverse Gaussian density can bewritten as f ( y | ϑ ) = G (cid:88) g =1 π g f ( y | θ g ) , where G is the number of components, f ( y | θ g ) is the g th component MNIG density withparameters θ g = ( µ g , β g , δ g , γ g , ∆ g ) and π g > (cid:80) Gg =1 π g = 1), and ϑ = ( π , . . . , π G , θ , . . . , θ G ). Consider n independent observations y = ( y , . . . , y n ) comingfrom a G -component mixture of MNIG distributions. The likelihood for a G -componentmixture of MNIG distributions can be written as L ( θ , . . . , θ G ) = n (cid:89) i =1 G (cid:88) g =1 π g f g ( y i | µ g , β g , δ g , γ g , ∆ g ) . A component indicator variable Z i = ( Z i , Z i , . . . , Z iG ) can be defined such that z ig = i belongs to group g, y i , the latentvariable u ig , and the missing data z ig ) likelihood of the mixture model is given by L c ( θ , . . . , θ G ) = G (cid:89) g =1 n (cid:89) i =1 [ π g f ( y i , u ig | µ g , β g , δ g , γ g , ∆ g )] z ig = G (cid:89) g =1 (cid:40) [ r ( θ g )] t g · n (cid:89) i =1 [ h ( y i , u ig )] z ig × exp T r (cid:40) (cid:88) j =1 φ jg ( θ g ) t jg ( y , u g ) (cid:41)(cid:41) , where t g = (cid:80) ni =1 z ig . Also, θ g = ( π g , µ g , β g , δ g , γ g , ∆ g ) denote the parameters related to the g th component. Accordingly, the component-specific functions for the parameters, φ jg , andthe sufficient statistics t jg , for j = 1 , . . . ,
5, are given as follows: φ g = β g , t g ( y , u g ) = n (cid:88) i =1 z ig y (cid:62) i ; φ g = µ (cid:62) g ∆ − g , t g ( y , u g ) = n (cid:88) i =1 z ig u − ig y (cid:62) i ; φ g = − ( ∆ g β g β (cid:62) g + γ g d I d ) , t g ( y , u g ) = 12 n (cid:88) i =1 z ig u ig ; φ g = − ( ∆ − g µ g µ (cid:62) g + δ g d I d ) , t g ( y , u g ) = 12 n (cid:88) i =1 z ig u − ig ; φ g = − ∆ − g , t g ( y , u g ) = 12 n (cid:88) i =1 z ig u − ig y i y (cid:62) i . To simplify notation we will use t jg = t jg ( y , u g ), j = 1 , . . . , Parameter Estimation using Gibbs Sampling
We denote as
GIG ( λ, δ , γ ) the GIG distribution with probability density function givenby f ( z ; λ, δ, γ ) = (cid:16) γδ (cid:17) λ z λ − K λ ( δγ ) exp (cid:18) − (cid:18) δ z + γ z (cid:19)(cid:19) . Also we denote as
M GIG d ( · , · , · ) a MGIG distribution (see Appendix A for details). Obvi-ously if the dimension d = 1 the MGIG is simply a GIG distribution. If the conjugate prior of θ g is of the form p ( θ g ) ∝ [ r ( θ g )] a (0) g, · exp T r (cid:40) (cid:88) j =1 φ jg ( θ g ) a (0) g,j (cid:41) , with hyperparameters (cid:110) a (0) g, , a (0) g, , a (0) g, , a (0) g, , a (0) g, , a (0) g, (cid:111) , then the posterior distribution is of theform p ( θ g |· ) ∝ [ r ( θ g )] a (0) g, + t g · exp T r (cid:40) (cid:88) j =1 φ jg ( θ g ) (cid:16) a (0) g,j + t jg ( y , u g ) (cid:17)(cid:41) . Hence, the hyperparameters of the posterior distribution are of the form a g,j = a (0) g,j + t jg , j = 0 , , , , , U | Y = y is a generalized-inverse Gaussian distribution suchthat U | Y = y ∼ GIG (cid:18) d + 12 , q ( y ) , α (cid:19) . A conjugate Dirichlet prior distribution with hyperparameters ( a (0)1 , , . . . , a (0) G, ) is assigned8o the mixing weights ( π , . . . , π G ). The resulting posterior is a Dirichlet distribution withhyperparameters ( a , , . . . , a G, ).A conjugate gamma prior with hyperparameters a (0) g, , a (0) g, − a (0) g, a (0) g, is assigned to δ g , which yields a posterior distribution δ g ∼ Gamma (cid:18) a g, , a g, − a g, a g, (cid:19) . Conditional on δ g , a truncated Normal conjugate prior is assigned to γ g . The resultingposterior distribution is given as: γ g | δ g ∼ N (cid:18) a g, δ g a g, , a g, (cid:19) · ( γ g > ∆ g was assigned to ( µ g , β g ) andthe resulting posterior is a multivariate normal distribution conditional on ∆ g . µ g β g (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ∆ g ∼ M V N a g, a g, − a g, a a g, a g, − a g, ∆ − g (2 a g, a g, − a g, a g, )4 a g, a g, − a g, , Σ µ g Σ µ g ,β g Σ µ g ,β g Σ β g , where Σ µ g = 2 a g, ∆ g a g, a g, − a g, , Σ β g = 2 a g, ∆ − g a g, a g, − a g, and Σ µ g ,β g = − a g, I d a g, a g, − a g, .An inverse-Wishart( ν , Λ ) prior is assigned to ∆ g and the resulting posterior distributionis an MGIG distribution: ∆ g ∼ M GIG d (cid:18) − ν + t g , β g t g β (cid:62) g , S g + Λ (cid:19) , where S g = (cid:80) ni =1 z ig u − ig ( y i − µ g )( y i − µ g ) (cid:62) .9iven an estimate of θ g , i.e. (ˆ π g , ˆ µ g , ˆ β g , ˆ γ g , ˆ δ g , ˆ ∆ g ), the probability that z ig = 1 is (cid:99) z ig = ˆ π g f g ( y i | u ig ; ˆ µ g , ˆ β g , ˆ ∆ g ) · f g ( u ig ; ˆ γ g ) (cid:80) Gk =1 ˆ π k f k ( y i | u ik ; ˆ µ k , ˆ β k , ˆ ∆ k ) · f k ( u ik ; ˆ γ k ) . Further details are provided in Appendix B.
Parameter estimation in a Bayesian framework can be done by using samples from theposterior distribution via Gibbs sampling. Gibbs sampling is a type of Monte Carlo MarkovChain sampling algorithm which iterates between the following two steps: drawing sampleof z i = ( z i , . . . , z iG ) from its multinomial conditional posterior distribution with p ( z ig = 1 | ϑ ( t − , y ) , where ϑ ( t − = ( π , . . . , π G ) ( t − , θ ,..., θ G are the values of ϑ after ( t − − th iteration, and g = 1 , . . . , G ; and updating θ , . . . , θ G by drawing samples from their posterior distributionsand updating π , . . . , π G .A Gibbs sampling framework for parameter estimation of the MNIG mixtures is summa-rized as follows:Step 0 Initialization: For the observed data y = ( y , . . . , y n ), the algorithm is initialized with G components. ˆ z ig is initialized using the result from k -means clustering (or any otherclustering algorithm). Based on the initialized ˆ z ig , parameters from the g -th componentare initialized as follows:(a) γ g = δ g = 1.(b) µ g is set as the component sample mean.10c) β g is assigned a d -dimensional vector with all entries equal to 0 . ∆ g is initialized as the component sample variance matrix divided by its deter-minant rise to power 1 /d to ensure that | ∆ g | = 1.(e) π g is set as the proportion of observation in the g th component.Step 1 At t th iteration, update z ig from its multinomial conditional posterior distribution withPr( z ig = 1) = ˆ z ig = π g f ( y i , u ig | ( µ g , β g , δ g , γ g , ∆ g ) ( t − ) (cid:80) Gk =1 π k f ( y i , u ik | ( µ k , β k , δ g , γ k , ∆ g ) ( t − ) . where ( µ g , β g , δ g , γ g , ∆ g ) ( t − are the values of µ g , β g , δ g , γ g , and ∆ g sampled from the t − th iteration.Step 2 (a) Update U i ’s by drawing samples from GIG(( d + 1) / , q ( y i ) , α ) distribution for i = 1 , . . . , n ;(b) Based on the updated U ig and z ig , update the hyperparameters { a g, , a g, , a g, , a g, , a g, , a g, } , g = 1 , . . . , G ;(c) Update the parameters to ( µ g , β g , δ g , γ g , ∆ g ) ( t ) by each drawing one sample fromtheir posterior distributions described in the previous section;(d) Update ( π ( t )1 , . . . , π ( t ) G ) from Dir( a , , . . . , a G, ).Step 1 and 2 are iterated until convergence. Note that posterior distribution of U | Y = y is a GIG distribution and the conditionalposterior distribution of ∆ g is a MGIG distribution. Hence we need to simulate from them.While for the GIG the literature has certain approaches, here, we propose a novel approach11o sample from the univariate GIG distribution and the MGIG distribution suitable for theMCMC approach to be used. At the past there are attempts for simulating from GIG based on different approaches (Atkin-son, 1982; Dagpunar, 1989; Devroye, 2014; Lai, 2009; H¨ormann and Leydold, 2014; Leydoldand H¨ormann, 2011). Our approach is particularly suitable as it is based on a Markov chainand hence it can be embedded in the MCMC easily.The approach for sampling from a univariate GIG distribution is based on a propertyof the GIG distribution. Firstly, if X is a GIG random variable, then the same is true for X − . Secondly, if X and Y are two independent random variables, from a GIG distributionand a Gamma distribution, respectively, then the transformed random variables ( X + Y ) − and X − − ( X + Y ) − follow a GIG distribution and a Gamma distribution separately, andthey are independent (Letac and Seshadri, 1983; Letac and Weso(cid:32)lowski, 2000). It has beenfurther shown by Letac and Seshadri (1983) that, if X is a GIG random variable, and Y and Z are Gamma random variables, then the following holds: X d = Y + 1 Z + X , where d = indicates that the left hand side and the right hand side possess the same distribu-tion.The fraction form the right hand side of above is a random continued fraction, denotedas [ Y ; Z, X ]. Following the result of Goswami (2004), consider the Markov Chain definedrecursively by continued random fractions X = [ Y ; Z , X ], X = [ Y ; Z , X ] and in general X n = [ Y n ; Z n , X n − ], they must converge to a stationary distribution. In details, if { Y i , Z i , i =12 , , . . . } are independent random variables from Gamma ( λ, α ) and Gamma ( λ, β ) distribu-tions respectively, then the chain converges to the density of a GIG( λ, α, β ) distribution.The algorithm to sample random numbers U from a GIG (cid:0) − d +12 , q ( y ) , α , (cid:1) distributionis outlined below:Step 1 : Start with a U from any distribution.Step 2 : Generate V ∼ Gamma (cid:18) − d + 12 , α (cid:19) and S ∼ Gamma (cid:18) − d + 12 , q ( y )2 (cid:19) . Step 3 : Update U using U = S + 1 V + 1 U .Step 4 Repeat Step 2 and Step 3 for a number of times until converge.Convergence of the above Markov Chain is monitored using the Heidelberger and Welch’sconvergence diagnostic for only one chain. Sampling from MGIG is not developed so far to our knowledge. In Fazayeli and Banerjee(2016) a standard importance sampling approach is described. It is interesting that MGIGis conjugate for the covariance matrix of a multivariate normal distribution.Our algorithm for sampling from a MGIG distribution is based on the Matsumoto-Yor (MY) property for random matrices of different dimensions, discussed by Massam andWesolowski (2006).Let V n be the Euclidean space of n × n real symmetric matrices. Let also V + n denotethe cone of positive definite matrices in V n . Let a ∈ V + s and b ∈ V + r , X and Y be two13ndependent random matrices with M GIG r and Wishart distributions, respectively, with p = q + r − s X ∼ M GIG r (cid:0) − p, P ( z (cid:62) ) a, b (cid:1) and Y ∼ W s ( q, a ) . Let z be a given constant s × r matrix of full rank, and P ( z (cid:62) ) a = z (cid:62) az defining a linearmapping, then, U and V defined as U = ( P ( z ) X + Y ) − , V = X − − P ( z (cid:62) )( P ( z ) X + Y ) − are independent M GIG and Wishart distributions and in particular U ∼ M GIG s ( − q, P ( z ) b, a ) and V ∼ W r ( p, b ) . The detailed simulation of ∆ g ∼ M GIG d (cid:0) − ν + t g , β g t g β (cid:62) g , S g + Λ (cid:1) is organized as:Suppose we take r = 1 , s = d, q = ν + t g p = q + 1 − d z = β g , a = Λ + S g ,where S g = (cid:80) ni =1 z ig u − ig ( y i − µ g )( y i − µ g ) (cid:62) and let b = 2 t g . Then, take the followingsteps:Step 1 Simulate an X following the univariate generalized inverse Gaussian distribution suchthat X ∼ GIG (cid:0) − p, z (cid:62) az, b (cid:1) . Step 2 Simulate a d × d matrix variable W following Wishart distribution, i.e., W ∼ W d ( q, a ) . Step 3 Compute Y = ( zXz (cid:62) + W ) − ; it follows a M GIG d ( − q, zbz (cid:62) , a ) distribution.14he algorithm just described offers a simple way to simulate from the MGIG distribution.In order to generate from a MGIG we need just to simulate from a univariate GIG distributionand a Wishart distribution which is very standard. Also, note that with the algorithm aboveGIG is simulated as a Markov chain and this is very suitable in the Gibbs setting we arediscussing. In the Bayesian approach to parameter estimation and clustering in finite mixture models,label switching can occur. This is well known and refers to the likelihood not changingwhen the mixture components are permuted (Stephens, 2000). There are many different ap-proaches to dealing with the label switching issue in the literature. For example, Richarsonand Green (1997) suggested that one can put artificial constraints on the mixing propor-tion π , . . . , π G or the parameters to force the labeling to be unique. Celeux et al. (2000)considered a decision theoretic approach to overcome the label switching issue. Stephens(2000) proposed an algorithm that combined the relabeling algorithm with the decision the-oretic approach. In our analysis, we found that relabeling the parameter estimates by theconstraint on the mixing proportion works well.To diagnose the convergence of Monte Carlo Markov Chains, three independent se-quences, with different k -means initializations are simulated. Likelihood is calculated usingthe updated parameters at the end of each iteration and the chains of likelihood monitoredfor convergence using the Gelman-Rubin diagnostics (Gelman et al., 1992) which is based onthe comparisons of the between and within variations. As early iterations reflect the startingapproximation and may not represent the target posterior, samples from the early iterationsknown as “burn-in” period are discarded (Gelman et al., 2013). If the potential scale reduc-tion factor calculated based on the likelihood chains after “burn-in” is below 1.1, the chains15re considered as converged and mixing well. After reaching a stationary approximation ofthe posterior distribution, averages of the samples, drawn from the approximated posteriorin the Markov Chains, discarding those from the “burn-in” period, provides good estimations to the parameters (Diebolt and Robert, 1994). Model selection for selecting the optimal number of components can be done by a com-parison between models containing different number of clusters (Fraley and Raftery, 2002;Gelman et al., 2013). Several model selection criteria has been proposed in the literaturesuch as Akaike Information Criterion (AIC; Akaike, 1983), Bayesian Information Criterion(BIC; Schwarz, 1978), Deviance Information Criterion (DIC; Spiegelhalter et al., 2002), andIntegrated Complete Likelihood (ICL; Biernacki et al., 1998). Model selection based on BIChas been shown to obtain satisfactory performance Leroux (1992); Keribin (2000); Roederand Wasserman (1997). Additionally, Steel and Raftery (2009) investigated the performanceof several model selection approaches including some Bayesian approaches as well as infor-mation criteria based approaches for Gaussian mixture models and showed that the BICperformed the best among the ones compared. Hence, here, the Gibbs sampling algorithmis carried out for a range of number of components, G with model selection conducted usingthe BIC. When the true class label is known, the performance of the clustering algorithmcan be assessed using the adjusted Rand index (ARI; Hubert and Arabie, 1985). The ARIis a measure of the pairwise agreement between the true and estimated classifications afteradjusting for agreement by chance such that a value of ‘1’ indicates a perfect agreement andthe expected value ARI under random classification is 0.16 Simulation Studies and Real Data Analysis
In this simulation study, the proposed algorithm is applied to 100 two-dimensional data setswith two skewed and heavy-tailed components (for both sample size is 500); see Figure 1(left panel) for one example. The true parameters used to generate the data are summarizedin Table 1. The proposed algorithm is run with the number of components ranging from G = 1 to G = 5. A two component model is selected by the BIC for all 100 datasets withan average ARI is 1 .
00 and standard deviation (sd) of 0 .
00. The estimated parameters arealso summarized in Table 1. Y Y True
048 -10 -5 0 5 10 Y Y Predict Figure 1: Scatter plot highlighting the true labels (left) and a contour plot showing thepredicted classifications (right). Here, the ARI is 1.Figure 1 (left) shows the true component membership of one of the hundred datasets andFigure 1 (right) gives the contour plot based on the estimated parameters for the datasetdescribed in the left panel. Although flexible clustering models based on skewed data have17able 1: True and Estimated Values for the MNIG Parameters in Simulation Study 1
Component 1 Component 2True Parameters Estimated Parameters True Parameters Estimated Parameters(Standard Error) (Standard Error) γ . . .
14) 1 . . . δ . . .
09) 1 . . . µ [2 ,
5] [1 . . , . . − ,
3] [ − . . , . . β [0 . , .
5] [0 . . , . . − . , − .
1] [ − . . , − . . ∆ (cid:34) − − (cid:35) (cid:34) . . − . . − . .
07) 1 . . (cid:35) (cid:34) (cid:35) (cid:34) . . − . . − . .
05) 1 . . (cid:35) emerged in the last decade or so, Gaussian mixture still remain one of the predominantlyused approach for model-based clustering due to its mathematical tractability and the rela-tive computational simplicity with parameter estimation. Therefore, the Gaussian mixturemodels (GMM) implemented in the R package mclust (Scrucca et al., 2017) are also appliedto these datasets. However, these Gaussian mixture models cannot account for skewness.Hence, mixtures of generalized hyperbolic distributions (MixGHD; Browne and McNicholas,2015) implemented in the R package MixGHD (Tortora et al., 2018) are also applied to thesedatasets. Mixtures of generalized hyperbolic distributions (McNeil et al., 2015) also have theflexibility of modeling skewed as well as symmetric components. Several skewed distributionssuch as the skew- t distribution, skew normal distribution, variance-gamma distribution, andMNIG distribution as well as symmetric distributions such as the Gaussian and t -distributioncan be obtained as a limiting case of the generalized hyperbolic distribution (Browne andMcNicholas, 2015). In this simulation, the Gaussian mixture models always chooses 3 or 4components since it needs more than one component to fit the skew clusters. The mixtureof generalized hyperbolic distributions always choose the two-component model as well andgives ARI = 1 .
00 with a standard deviation (sd) of 0 . In this simulation, 100 four-dimensional datasets were generated with three underlyinggroups where one component is skewed with 200 observations, the other component is sym-metric with 200 observations, and the third component is skewed with 100 observations. Theparameters used to generate the data are summarized in Table 2. The proposed algorithmis applied to these datasets for G = 1 through G = 7, and it correctly selected the correctthree-component model 99 out of 100 times with an average ARI of 1.00 (sd 0.00).Table 2: True and Estimated Values for the Parameters in Simulation Study 2 Component 1 ( n = 100) True Parameters Estimated Parameters (Standard Error) γ . . . δ . . . µ [9 , − , − ,
9] [8 . . , − , . , − . . , . . β [0 , , − . , − . − . . , . . , − . . , − . . ∆ . .
15) 0 . . − . . − . . . .
12) 1 . .
15) 0 . .
13) 0 . . − . .
11) 0 . .
13) 1 . .
15) 0 . . − . .
11) 0 . .
11) 0 . .
13) 0 . . Component 2 ( n = 200) True Parameters Estimated Parameters (Standard Error) γ . . . δ . . . µ [5 , , , −
7] [4 . . , . . , . . , − . . β [0 . , . , . , .
5] [0 . . , . . , . . , . . ∆ . .
25) 0 . .
11) 0 . .
14) 1 . . . .
11) 1 . .
11) 0 . .
09) 0 . . . .
14) 0 . .
09) 1 . .
10) 0 . . . .
16) 0 . .
09) 0 . .
09) 0 . . Component 3 ( n = 200) True Parameters Estimated Parameters (Standard Error) γ . . . δ . . . µ [ − , − , ,
3] [ − . . , − . . , . . , . . β [0 , , ,
0] [ − . . , − . . , . . , . . ∆ . .
09) 0 . . − . .
08) 0 . . . .
09) 1 . .
10) 0 . .
09) 0 . . − . .
08) 0 . .
09) 1 . .
11) 0 . . . .
08) 0 . .
08) 0 . .
09) 1 . . The average of the parameter estimates for the 99 datasets provided in Table 2 shows good19arameter recovery. Figure 2 (right) gives the pairwise scatter plot based on the estimatedparameters for this dataset where the true group labels are described in the left panel.
X1 X2 X3 X4 X X X X True
X1 X2 X3 X4 X X X X Predict
Figure 2: Pairwise scatter plot highlighting the true labels (left) and a pairwise scatter plotshowing the predicted classifications (right) of one of the hundred datasets. Here, the ARIwas 1.Gaussian mixture models and mixtures of generalized hyperbolic distributions are appliedto these datasets. The mixture of generalized hyperbolic distributions correctly selects thethree-component model 84 times out of the 100 datasets with an average ARI of 1 .
00 (sd0.00). The Gaussian mixture models did not choose the correct number of components forall 100 datasets.
The proposed algorithm is also applied to some benchmark clustering datasets.
The Old Faithful Dataset
The Old Faithful data available in the R package MASS consists of the waiting time between20ruptions and the duration of the eruption for the Old Faithful geyser in Yellowstone NationalPark, Wyoming, USA. There are 272 observations and each observation contains 2 variables(waiting time between eruption and duration of eruption). We ran our algorithm on thescaled data for G = 1 to 5. A two component model was selected. This is consistent withSubedi and McNicholas (2014) that used mixtures of MNIG distributions in a variationalBayes framework and with Franczak et al. (2014) and Vrbik and McNicholas (2012) whoused other skewed mixture models. The contour plot of the fitted model given in Figure 3shows that our fitted model captures the density of the data fairly well. -3-2-1012 -2 -1 0 1 2 eruptions w a i t i ng Predict Figure 3: Contour plot for the Old Faithful data using the estimated values of theparameters using the proposed algorithm.
The Fish Catch Dataset
The fish catch data, available from the R package rrcov , consists of the weight and differentbody lengths measurements of seven different fish species. There are 159 observations inthis data set. Similar to Subedi and McNicholas (2014), after dropping the highly correlatedvariables, the variables Length2 , Height and
Width were used for the analysis where
Length2 is the length from the nose to the notch of the tail,
Height is the maximal height as a21ercentage of the length from the nose to the end of the tail, and
Width is the maximalwidth as a percentage of the length from the nose to the end of the tail. The proposedalgorithm is applied (after scaling the data) for G = 1 to 9 and it selects a four-componentmodel. Figure 4 shows the pairwise scatter plots for this dataset, with the left panel showingthe true species of the fish and the right panel showing the estimated groups. Althoughthe true number of species of fish is seven, from the pairwise scatter plot one can see thatthe species White, Roach and
Perch are hard to indistinguishable and there is no clearseparation between the species
Bream and
Parkki . Table 3 summarizes the cross tabulationof the true species and estimated group membership and it is consistent with the resultfrom Subedi and McNicholas (2014) that used a variational Bayes approach for clusteringusing mixtures of MNIG. The GMM and mixGHD are also applied to the
Fish Catch dataand both resulted in a five component model with classification where the additional fifthcomponent contained fish from both
Whitewish and
Perch (see Table 3 for detail).
Length2 Height Width
Leng t h2 H e i gh t W i d t h -2 -1 0 1 2 3 -1 0 1 2 -2 -1 0 1 2 30123-1012-2-10123 FishSpecies
BreamWhitewishRoachParkkiSmeltPikePerch
Length2 Height Width
Leng t h2 H e i gh t W i d t h -2 -1 0 1 2 3 -1 0 1 2 -2 -1 0 1 2 30123-1012-2-10123 Predict
Figure 4: Pairwise scatter plot highlighting the true labels (left) and a pairwise scatter plotshowing the predicted classifications (right) for the fish catch data.22able 3: Cross tabulation of true fish species and the estimated classification using theproposed model, Gaussian mixture models, and the mixtures of generalized hyperbolicdistributions.
Proposed algorithm GMM MixGHD
ARI: 0.63 ARI: 0.52 ARI: 0.54Estimated Groups Estimated Groups Estimated Groups1 2 3 4 1 2 3 4 5 1 2 3 4 5Bream 34 34 34Parkki 11 11 11Whitewish 6 3 3 3 3 3Roach 20 20 20Perch 56 36 20 20 36 20Smelt 14 11 3 14Pike 17 17 17
The Australian Athletes (AIS) Dataset :The AIS dataset available in the R package DAAG (Maindonald and Braun, 2019) contains 202observations and 13 variables comprising of measurements on various characteristics of theblood, body size, and sex of the athlete. The proposed algorithm is applied on a subset ofdataset with the variables body mass index (
BMI ) and body fat (
Bfat ) as has been previouslyused (Vrbik and McNicholas, 2012; Lin, 2010). The algorithm is run for G = 1 to 5 and atwo-component model is selected. Comparing the estimated component membership withthe gender yields an ARI = 0 .
83. The contour plot of the fitted model in Figure 5 shows thatthe fitted model captured the density of the data fairly well. The Gaussian mixture modelsand mixtures of generalized hyperbolic distributions are also applied to the AIS dataset andthe summary of the performance are given in Table 4.The Gaussian mixture model selects athree component model whereas the mixtures of generalized hyperbolic distribution selects a23able 4: Summary of the performances of the proposed model, the Gaussian mixturemodel, and mixtures of generalized hyperbolic distributions on the AIS datasets.
Estimated Groups ARI
Proposed Algorithm G = 2 0.83GMM G = 3 0.69MixGHD G = 2 0.78 -2024 -2 0 2 4 BMI p c B f a t Predict Figure 5: Contour plot using the estimated parameters for the AIS datatwo component model. However, the ARI of the proposed model is higher than that obtainedby mixtures of generalized hyperbolic distribution.
The Crab Dataset :This is dataset of morphological measurements on Leptograpsus crabs, available in the R package MASS (Venables and Ripley, 2002). There are 200 observations and 5 covariates inthis dataset, describing 5 morphological measurements on 50 crabs each of two colour formsand both sexes. The five measurements are frontal lobe size (FL), rear width (RW), carapacelength (CL), carapace width (CW), and body depth (BD) respectively. All measurementsare taken in the unit of millimeters. The proposed algorithm is applied to this dataset for G = 1 to G = 5 and it selects a two-component model. Comparison of the estimated groupmembership with the two color forms of the crabs, “B” or “O” for blue or orange shows24omplete agreement (ARI= 1). The pairwise scatter plots are given in Figure 6, where theleft panel shows the original measurement variables and the right panel gives the principalcomponents (only for visualization purposes), both colored with estimated classification ofthe color forms. The Gaussian mixture models and the mixtures of generalized hyperbolic FL RW CL CW BD F L R W C L C W B D
10 15 20 10 15 20 20 30 40 20 30 40 50 10 15 200.0000.0250.0500.0750.1000.12510152020304020304050101520
Predict PC1 PC2 PC3 PC4 PC5 P C P C P C P C P C -30-20-10 0 10 20 -3 -2 -1 0 1 2 -2 -1 0 1 2 -1.0-0.5 0.0 0.5 1.0 -0.8 -0.4 0.0 0.4 0.80.000.010.020.03-3-2-1012-2-1012-1.0-0.50.00.51.0-0.8-0.40.00.40.8 Predict Figure 6: Pairwise scatter plot highlighting the predicted classification of the color forms ofthe crabs, in terms of the original measurements (left), and in terms of the principalcomponents (right).distributions are also applied to this dataset and the performance is summarized in Table 5.The mixtures of generalized hyperbolic distributions selects a two-component model (seeTable 5: Summary of the performances of the proposed model, the Gaussian mixturemodel, and mixtures of generalized hyperbolic distributions on the crabs datasets.
Model Chosen ARI (color) ARI (gender)
GMM G = 3 0.02 0.72MixGHD G = 2 0.00 0.77Proposed Algorithm G = 2 1.00 0.00Table 5). However, the estimated group membership are more in agreement with classifying25he crabs by their sexes (ARI of 0.77). The Gaussian mixture model on the other hand yieldeda three component model where the classification obtained are also more in agreement withthe classification based on the sex of the crabs (ARI of 0.72) (see Table 5). In summary, we proposed a fully Bayesian approach for parameter estimation for mixtures ofMNIG distributions. We also propose novel approaches to simulate from two distributions:GIG distributions and MGIG distributions. Through simulation studies, we show that theparameter estimates were very close to the true parameters and close to perfect classificationwas obtained. In simulation studies, the correct number of components were always selectedusing BIC. Using both simulated and real datasets, we show that our approach providescompetitive clustering results compared to other clustering approaches. The contour plotsusing the estimated parameters show that the mixture distribution captures the shape ofthe dataset closely.When the true number of groups is unknown which is typically the case, the optimalnumber of components is chosen a posteriori using a model selection criterion. At present,the BIC is most popular but different model selection can choose different models and theBIC does not always choose the best one. In our case, the model selection using BIC seemedto perform fairly well in both simulated and real data analysis. However, the search for ahighly effective model selection criteria remains an open problem especially when dealingwith skewed data. 26 eferences
Akaike, H. (1983), “Information Measures and Model Selection,”
In Proceedings of the In-ternational Statistical Institute 44th Session , 277–291.Atkinson, A. (1982), “The simulation of generalized inverse Gaussian and hyperbolic randomvariables,”
SIAM Journal on Scientific and Statistical Computing , 3, 502–515.Barndorff-Nielsen, O., Blaesild, P., Jensen, J. L., and Jørgensen, B. (1982), “Exponentialtransformation models,”
Proceedings of the Royal Society of London. A. Mathematical andPhysical Sciences , 379, 41–65.Barndorff-Nielsen, O. E. (1997), “Normal inverse Gaussian distributions and stochasticvolatility modelling,”
Scandinavian Journal of Statistics , 24, 1–13.Biernacki, C., Celeux, G., and Govaert, G. (1998), “Assessing a Mixture Model for Clusteringwith the Integrated Classification Likelihood,” Tech. rep., Institut National de Rechercheen informatique et en automatique.Blei, D. M., Kucukelbir, A., and McAuliffe, J. D. (2017), “Variational inference: A reviewfor statisticians,”
Journal of the American Statistical Association , 112, 859–877.Browne, R. P., and McNicholas, P. D. (2015), “A mixture of generalized hyperbolic distri-butions,”
The Canadian Journal of Statistics , 43, 176–198.Celeux, G., Hurn, M., and Robert, C. P. (2000), “Computational and Inferential Difficultieswith Mixture Posterior Distributions,”
Journal of the American Statistical Association ,95, 957–970.Dagpunar, J. (1989), “An Easily Implemented Generalised Inverse Gaussian Generator,”
Communications in Statistics - Simulation and Computation , 18, 703–710.27evroye, L. (2014), “Random variate generation for the generalized inverse Gaussian distri-bution,”
Statistics and Computing , 24, 239–246.Diebolt, J., and Robert, C. P. (1994), “Estimation of finite mixture distributions throughBayesian sampling,”
Journal of the Royal Statistical Society: Series B (Methodology) , 56,363–375.Fazayeli, F., and Banerjee, A. (2016), in
Machine Learning and Knowledge Discovery inDatabases , eds. Frasconi, P., Landwehr, N., Manco, G., and Vreeken, J., Cham: SpringerInternational Publishing, pp. 648–664.Fraley, C., and Raftery, A. E. (2002), “Model-Based Clustering, Discriminant Analysis, andDensity Estimation,”
Journal of the American Statistical Association , 97, 611–631.Franczak, B. C., Browne, R. P., and McNicholas, P. D. (2014), “Mixtures of shifted asym-metric Laplace distributions,”
IEEE Transactions on Pattern Analysis and Machine In-telligence , 36, 1149–1157.Fr¨uhwirth-Schnatter, S. (2006),
Finite mixture and Markov switching models , Springer Sci-ence & Business Media.Fruhwirth-Schnatter, S., and Pyne, S. (2010), “Bayesian Inference for finite mixtures ofunivatiate and multivariate skew-normal and skew- t distributions,” Biostatistics , 11, 317–336.Gelman, A., Carlin, J. B., Stern, H. S., Dunson, D. B., Vehtari, A., and Rubin, D. B. (2013),
Bayesian Data Analysis , CRC Press, 3rd ed.Gelman, A., Rubin, D. B. et al. (1992), “Inference from iterative simulation using multiplesequences,”
Statistical Science , 7, 457–472.28oswami, A. (2004), “Random Continued Fractions: A Markov Chain Approach,”
EconomicTheory , 23, 85–105.Hejblum, B. P., Alkhassim, C., Gottardo, R., Caron, F., Thi´ebaut, R. et al. (2019), “Se-quential Dirichlet process mixtures of multivariate skew t -distributions for model-basedclustering of flow cytometry data,” The Annals of Applied Statistics , 13, 638–660.H¨ormann, W., and Leydold, J. (2014), “Generating generalized inverse Gaussian randomvariates,”
Statistics and Computing , 24, 547–557.Hubert, L., and Arabie, P. (1985), “Comparing partitions,”
Journal of Classification , 2,193–218.Karlis, D., and Santourian, A. (2009), “Model-based clustering with non-elliptically con-toured distributions,”
Statistics and Computing , 19, 73–83.Keribin, C. (2000), “Consistent Estimation of the Order of Mixture Models,”
Sankhya: TheIndian Journal of Statistics, Series A , 62, 49–66.Lai, Y. (2009), “Generating inverse Gaussian random variates by approximation,”
Compu-tational Statistics & Data Analysis , 53, 3553–3559.Leroux, B. G. (1992), “Consistent Estimation of a Mixing Distribution,”
The Annals ofStatistics , 20, 1350–1360.Letac, G., and Seshadri, V. (1983), “A characterization of the generalized inverse Gaussiandistribution by continued fractions,”
Probability Theory and Related Fields , 62, 485–489.Letac, G., and Weso(cid:32)lowski, J. (2000), “An independence property for the product of GIGand gamma laws,”
The Annals of Probability , 28, 1371–1383.29eydold, J., and H¨ormann, W. (2011), “Generating generalized inverse Gaussian randomvariates by fast inversion,”
Computational Statistics & Data Analysis , 55, 213–217.Lin, T. I. (2010), “Robust mixture modeling using multivariate skew t distributions,” Statis-tics and Computing , 20, 343–356.Lin, T. I., Lee, J. C., and Hsieh, W. J. (2007a), “Robust mixture modeling using the skew t distribution,” Statistics and Computing , 17, 81–92.Lin, T. I., Lee, J. C., and Yen, S. Y. (2007b), “Finite mixture modeling using the skewnormal distribution,”
Statistica Sinica , 17, 909–927.Maindonald, J. H., and Braun, W. J. (2019),
DAAG: Data Analysis and Graphics Data andFunctions , r package version 1.22.1.Maleki, M., Wraith, D., and Arellano-Valle, R. B. (2019), “Robust finite mixture modelingof multivariate unrestricted skew-normal generalized hyperbolic distributions,”
Statisticsand Computing , 29, 415–428.Massam, H., and Wesolowski, J. (2006), “The Matsumoto-Yor property and the structureof the Wishart distribution,”
Journal of Multivariate Analysis , 97, 103–123.McNeil, A. J., Frey, R., Embrechts, P. et al. (2015), “Quantitative risk management: Con-cepts,”
Economics Books .McNicholas, S. M., McNicholas, P. D., and Browne, R. P. (2017), “A mixture of variance-gamma factor analyzers,” in
Big and Complex Data Analysis , Springer, pp. 369–385.Murray, P. M., Browne, R. P., and McNicholas, P. D. (2014), “Mixtures of skew- t factoranalyzers,” Computational Statistics & Data Analysis , 77, 326–335.30’Hagan, A., Murphy, T. B., Gormley, I. C., McNicholas, P. D., and Karlis, D. (2016),“Clustering with the multivariate normal inverse Gaussian distribution,”
ComputationalStatistics & Data Analysis , 93, 18–30.Pyne, S., Hu, X., Wang, K., Rossin, E., Lin, T.-I., Baecher-Allan, L. M. M. C., McLachlan,G. J., Tamayo, P., Hafler, D. A., Jager, P. L. D., and Mesirov, J. P. (2009), “Automatedhigh-dimensional flow cytometric data analysis,”
Proceedings of the National Academy ofSciences , 106, 8519–8524.Richarson, S., and Green, P. J. (1997), “On Bayesian analysis of mixtures with an unknownnumber of components,”
Journal of the Royal Statistical Society: Series B (Methodology) ,59, 731–792.Robert, C. P. (1996), “Mixtures of distributions: inference and estimation,”
Markov chainMonte Carlo in practice , 441, 464.Roeder, K., and Wasserman, L. (1997), “Practical Bayesian Density Estimation Using Mix-tures of Normals,”
Journal of the American Statistical Association , 92, 894–902.Schwarz, G. (1978), “Estimating the Dimension of a Model,”
The Annals of Statistics , 6,461–464.Scrucca, L., Fop, M., Murphy, T. B., and Raftery, A. E. (2017), “mclust 5: clustering, clas-sification and density estimation using Gaussian finite mixture models,”
The R Journal ,8, 205–233.Spiegelhalter, D. J., Best, N. G., Carlin, B. P., and Linde, A. V. D. (2002), “Bayesianmeasures of model complexity and fit,”
Journal of the Royal Statistical Society, Series B(Methodology) , 64, 583–616. 31teel, R. J., and Raftery, A. E. (2009), “Performance of Bayesian Model Selection Criteria forGaussian Mixture Models,” Tech. rep., Department of Statistics University of Washington.Stephens, M. (2000), “Dealing with Label Switching in Mixture Models,”
Journal of RoyalStatistical Society. Series B (Methodology) , 62, 795–809.Stephens, M. et al. (2000), “Bayesian analysis of mixture models with an unknown numberof components - an alternative to reversible jump methods,”
The Annals of Statistics , 28,40–74.Subedi, S., and McNicholas, P. D. (2014), “Variational Bayes approximations for clusteringvia mixtures of normal inverse Gaussian distributions,”
Advances in Data Analysis andClassification , 8, 167–193.Titterington, D. M., Smith, A. F., and Makov, U. E. (1985),
Statistical analysis of finitemixture distributions , Wiley,.Tortora, C., ElSherbiny, A., Browne, R. P., Franczak, B. C., , McNicholas, P. D., and D..,A. D. (2018),
MixGHD: Model Based Clustering, Classification and Discriminant AnalysisUsing the Mixture of Generalized Hyperbolic Distributions , r package version 2.2.Venables, W. N., and Ripley, B. D. (2002),
Modern Applied Statistics with S , New York:Springer, 4th ed., iSBN 0-387-95457-0.Vrbik, I., and McNicholas, P. (2012), “Analytic calculations for the EM algorithm for mul-tivariate skew- t mixture models,” Statistics & Probability Letters , 82, 1169–1174.Wei, Y., Tang, Y., and McNicholas, P. D. (2019), “Mixtures of generalized hyperbolic distri-butions and mixtures of skew- t distributions for model-based clustering with incompletedata,” Computational Statistics & Data Analysis , 130, 18–41.32 ppendixA The matrix-GIG distribution
The MGIG was introduced by Barndorff-Nielsen et al. (1982). Let V n be the Euclidean spaceof n × n real symmetric matrices. Let also V + n denote the cone of positive definite matricesin V n .A random matrix X taking its values in V + n is said to follow the M GIG n ( − p, a, b ) dis-tribution if it has density of the form f X ( x ) ∝ | x | − p − n +12 exp (cid:0) − T r ( a x ) − T r ( b x − ) (cid:1) The above density is well defined if a, b ∈ V + n and p ∈ R . According to Massam andWesolowski (2006), the density is well defined for non positive definite matrices a and b ifcertain conditions hold, but these cases are not of interest for our purpose.Similarly, a random matrix X taking its values in V + n is said to follow the Wishart(denoted as W n ( q, c )) distribution if it has density of the form f X ( x ) ∝ | x | q − n +12 exp ( − T r ( cx ))were c ∈ V + n , q > ( n − /
2. 33
Mathematical Details for Posterior Distributions f U ig ( u | Y i = y i ) ∝ f Y ( Y | U = u ) f U ( u ) ∝ u − d +32 exp (cid:40) − (cid:32) δ g u + γ g u (cid:33) −
12 ( y i − µ g − u ∆ g β g )( u ∆ g ) − ( y i − µ g − u ∆ g β g ) (cid:62) (cid:41) = u − d +32 exp (cid:40) − (cid:32) δ g u + γ g u (cid:33) − (cid:104)(cid:16) ( y i − µ g ) ∆ − ( y i − µ g ) (cid:62) (cid:17) u − + ( β g ∆ g β (cid:62) g ) u (cid:105)(cid:41) = u − d +32 exp (cid:40) − (cid:32) δ g + ( y i − µ g ) ∆ − ( y i − µ g ) (cid:62) u + ( γ g + β g ∆ g β (cid:62) g ) u (cid:33)(cid:41) = u − d +32 exp (cid:26) − (cid:18) q ( y i ) u + α g u (cid:19)(cid:27) . Therefore, the posterior distribution of U ig | Y i are the GIG distributions: U ig | Y i = y i ∼ GIG (cid:18) d + 12 , q ( y i ) , α g (cid:19) ,U − ig | Y i = y i ∼ GIG (cid:18) − d + 12 , α g , q ( y i ) (cid:19) . The contribution from ( δ g , γ g ) to the likelihood is given as follows: f ( δ g , γ g |· ) ∝ δ · (cid:80) i z ig g × exp (cid:40) δ g γ g (cid:88) i z ig + 12 (cid:32) γ g n (cid:88) i =1 z ig u ig + δ g n (cid:88) i =1 z ig u − ig (cid:33)(cid:41) . Given the prior of δ g of the form δ g ∼ Gamma a (0) g, , a (0) g, − a (0) g, a (0) g, , he posterior distribution of δ g is derived as below: f ( δ g |· ) ∝ δ g · (cid:80) i z ig · exp (cid:40) δ g γ g (cid:88) i z ig + 12 δ g (cid:32) n (cid:88) i =1 z ig u − ig (cid:33)(cid:41) × (cid:0) δ g (cid:1) a (0) g, / · exp − δ g a (0) g, − a (0) g, a g, =( δ g ) a g, / × exp − δ g a (0) g, − a (0) g, a (0) g, . which indicates that the posterior distribution of δ g is δ g ∼ Gamma (cid:18) a g, , a g, − a g, a g, (cid:19) . Conditional on δ g , a truncated Normal conjugate prior is assigned to γ g . The resulting posteriordistribution is given as: γ g | δ g ∼ N (cid:18) a g, δ g a g, , a g, (cid:19) · ( γ g > . Conjugate joint multivariate normal prior conditional on ∆ g was assigned to ( µ g , β g ), i.e., µ g β g (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ∆ g ∼ M V N µ β , Σ µ Σ µ ,β Σ µ ,β Σ β , where µ = 2 a (0) g, a (0) g, − a (0) g, a (0)1 a (0) g, a (0) g, − a (0) g, , β = ∆ − g (2 a (0) g, a (0) g, − a (0) g, a (0) g, )4 a (0) g, a (0) g, − a (0) g, , Σ µ = 2 a (0) g, ∆ g a (0) g, a (0) g, − a (0) g, , Σ β = 2 a (0) g, ∆ − g a (0) g, a (0) g, − a (0) g, and Σ µ ,β = − a (0) g, I d a (0) g, a (0) g, − a (0) g, . herefore, the posterior should be a joint multivariate normal distribution as well of the form µ g β g (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) · ∼ M V N µ ∗ β ∗ , Σ µ Σ µ,β Σ µ,β Σ β . The joint density of ( µ g , β g ) from the likelihood is given as follows: f ( µ g , β g | · ) ∝ exp (cid:40)(cid:16) − β (cid:62) g µ g (cid:17) (cid:88) i z ig + β (cid:62) g n (cid:88) i =1 z ig y i + µ (cid:62) g ∆ − g n (cid:88) i =1 z ig u − ig y i − (cid:34) β (cid:62) g (cid:32) n (cid:88) i =1 u ig z ig (cid:33) ∆ g β g (cid:35) − (cid:34) µ (cid:62) g (cid:32) n (cid:88) i =1 u − ig z ig (cid:33) ∆ − g µ g (cid:35)(cid:41) . Hence, the marginal density of µ g and β g from the posterior are given as the following: f ( µ g | β g = β , ∆ g ) ∝ exp (cid:26) − (cid:104) µ (cid:62) g (2 a g, ∆ − g ) µ g − µ (cid:62) g (cid:0) ∆ − g a g, − a g, β g (cid:1) − (cid:0) ∆ − g a g, − a g, β (cid:1) (cid:62) µ g + (cid:0) ∆ − g a g, − a g, β (cid:1) (cid:62) (cid:18) ∆ g a g, (cid:19) (cid:0) ∆ − g a g, − a g, β (cid:1)(cid:21)(cid:27) ,f ( β g | µ g = µ , ∆ g ) ∝ exp (cid:26) − (cid:104) β (cid:62) g (2 a g, ∆ g ) β g − β (cid:62) g ( a g, − a g, µ ) − ( a g, − a g, µ ) (cid:62) β g + ( a g, − a g, µ ) (cid:62) (cid:32) ∆ − g a (cid:33) ( a g, − a g, µ ) (cid:35)(cid:41) . Based on the conditional normal property of the multivariate normal variables, (cid:0) µ g | β g = β , ∆ g , · (cid:1) ∼ M V N (cid:16) µ ∗ + Σ µ,β Σ − β ( β − β ∗ ) , Σ µ − Σ µ,β Σ − β Σ µ,β (cid:17) and (cid:0) β g | µ g = µ , ∆ g , · (cid:1) ∼ M V N (cid:0) β ∗ + Σ µ,β Σ − µ ( µ − µ ∗ ) , Σ β − Σ µ,β Σ − µ Σ µ,β (cid:1) , we can derive that µ ∗ = 2 a g, a g, − a g, a a g, a g, − a g, β ∗ = ∆ − g (2 a g, a g, − a g, a g, )4 a g, a g, − a g, , µ = 2 a g, ∆ g a g, a g, − a g, , Σ β = 2 a g, ∆ − g a g, a g, − a g, , Σ µ,β = − a g, I d a g, a g, − a g, . An Inverse-Wishart prior is given to ∆ g , the resulting posterior is a Matrix Generalized Inverse-Gaussian (MGIG) distribution. The derivation is given as below: f ( ∆ g |· ) ∝ | ∆ g | − (cid:80) i zig × exp T r (cid:40) − β g (cid:32) n (cid:88) i =1 u ig z ig (cid:33) β (cid:62) g ∆ g − n (cid:88) i =1 ( y i − µ g )( y i − µ g ) (cid:62) u i ∆ − g (cid:41) × | ∆ g | − ν d +12 × exp T r (cid:26) −
12 Λ ∆ − g (cid:27) = | ∆ g | − tg, ν d +12 × exp T r (cid:40) − β g β (cid:62) g (2 t g, ) ∆ g − (cid:32) n (cid:88) i =1 z ig u − ig ( y i − µ g )( y i − µ g ) (cid:62) + Λ (cid:33) ∆ − g (cid:41) = | ∆ g | − tg, ν d +12 × exp T r (cid:26) − β g β (cid:62) g (2 t g, ) ∆ g − (cid:0) S g + Λ (cid:1) ∆ − g (cid:27) . Therefore, the posterior of ∆ g is ∆ g ∼ M GIG d (cid:18) − ν + t g , β g t g β (cid:62) g , S g + Λ (cid:19) . C Credible Intervals for Simulation Study 1 .5 1.0 1.5 g coverage g coverage d coverage d coverage m coverage m coverage -4.0 -3.5 m coverage m coverage .5 1.0 b coverage b coverage -2.0 -1.5 -1.0 -0.5 b coverage -1.5 -1.0 -0.5 0.0 0.5 b coverage D coverage -1.4 -1.2 -1.0 -0.8 D coverage -1.4 -1.2 -1.0 -0.8 D coverage D coverage .8 0.9 1.0 1.1 1.2 1.3 D coverage -0.3 -0.2 -0.1 0.0 0.1 0.2 D coverage -0.3 -0.2 -0.1 0.0 0.1 0.2 D coverage D coveragecoverage