A parsimonious family of multivariate Poisson-lognormal distributions for clustering multivariate count data
AA parsimonious family of multivariate Poisson-lognormaldistributions for clustering multivariate count data
Sanjeena Subedi ∗ Ryan Browne † Abstract
Multivariate count data are commonly encountered through high-throughput se-quencing technologies in bioinformatics, text mining, or in sports analytics. Althoughthe Poisson distribution seems a natural fit to these count data, its multivariate ex-tension is computationally expensive.In most cases mutual independence among thevariables is assumed, however this fails to take into account the correlation among thevariables usually observed in the data. Recently, mixtures of multivariate Poisson-lognormal (MPLN) models have been used to analyze such multivariate count mea-surements with a dependence structure. In the MPLN model, each count is modeledusing an independent Poisson distribution conditional on a latent multivariate Gaus-sian variable. Due to this hierarchical structure, the MPLN model can account forover-dispersion as opposed to the traditional Poisson distribution and allows for corre-lation between the variables. Rather than relying on a Monte Carlo-based estimationframework which is computationally inefficient, a fast variational-EM based framework ∗ Department of Mathematical Sciences, Binghamton University, State University of New York, 4400Vestal Parkway East, Binghamton, NY, USA 13902. e: [email protected] † Department of Statistics and Actuarial Science, University of Waterloo, 200 University Avenue West,Waterloo, ON, Canada. e: [email protected] a r X i v : . [ s t a t . C O ] A p r s used here for parameter estimation. Further, a parsimonious family of mixtures ofPoisson-lognormal distributions are proposed by decomposing the covariance matrixand imposing constraints on these decompositions. Utility of such models is shownusing simulated and benchmark datasets. Keywords : BIC, clustering, count data, MCLUST, mixture models, model-based clus-tering, MPLN, variational approximations, Variational EM algorithm.
With the emergence of next generation sequencing technologies that provide a fast andcost-effective data generation platform, multivariate count data are becoming ubiquitousin bioinformatics. Although there has been a big explosion in the approaches for datageneration and a dramatic reduction in cost and time, efficiently analyzing these complexbiological data sets still remains a challenge. Cluster analysis is widely used in bioinformaticsfor identification and analysis of population heterogeneity. Clustering allows us to summarizedata into homogenous groups or clusters, each with its unique characteristics. de Souto et al.(2008) did a comparative analysis of several clustering techniques on 35 different cancer geneexpression data sets and concluded that a mixture model-based clustering approach showedsuperior performance in terms of recovering the true structure of the data as opposed to thetraditional distance-based approaches such as k -means and hierarchical methods. Model-based clustering, which utilizes mixture models, has been increasingly used in the last twodecades (Gollini and Murphy, 2014; Subedi and McNicholas, 2014; Bouveyron and Brunet,2012; Subedi et al., 2015; Browne and McNicholas, 2015; Kosmidis and Karlis, 2016; Danget al., 2015; Vrbik and McNicholas, 2014; Tortora et al., 2019; Subedi and McNicholas,2020). A finite mixture model assumes that the population consists of a finite mixture2f subpopulations or components, which can be represented by a parametric model. If d -dimensional random vector Y arises from a G component finite mixture model, its densitycan be written as f ( y | ϑ ) = G (cid:88) g =1 π g f ( y | ϕ g ) , where π g for g = 1 , . . . , G is the mixing proportion such that π g > (cid:80) g π g = 1, f ( · )is the component specific probability density function or the probability mass function withparameters ϕ g , and ϑ contains all the unknown parameters for the finite mixture. Dependingon the nature of the data, an appropriate f ( · ) is used.A number of initial works on univariate analysis for a count variable from next gen-eration sequencing data focused on a Poisson model (Marioni et al., 2008; Bullard et al.,2010; Witten et al., 2010). However, a Poisson distribution also imposes a mean-variancerelationship such that mean and variance must be equal and most modern biological datatypically exhibits over-dispersion (i.e., the observed variation is larger than what is pre-dicted by the Poisson model, see Anders and Huber (2010)). A commonly used approachto incorporate this additional variability in univariate case is to model the mean parameterof Poisson distribution as a random variable and impose a distribution. Several differentdistributions has been proposed: the inverse-Gaussian distribution yields a Poisson-inverseGaussian distribution (Holla, 1967), the inverse-Gamma distribution yields a Poisson-inverseGamma distribution Willmot (1993), the lognormal distribution yields a Poisson-lognormaldistribution (Bulmer, 1974), and the gamma distribution yields a gamma-Poisson distri-bution (i.e., the negative binomial distribution) (Greenwood and Yule, 1920; Collings andMargolin, 1985). See Karlis and Xekalaki (2005) for a detailed list of other possible priors.Another approach is to scale the mean parameter by a random variable and impose a beta3rior on that scaling random variable resulting in a Poisson-Beta distribution (Gurland,1958). Amongst these, the most popular approach for statistical methods pertaining nextgeneration sequence data is the gamma-Poisson distribution also known as the negative bino-mial distribution (Robinson et al., 2010; Yu et al., 2013; Love et al., 2014; Dong et al., 2016).In a gamma-Poisson distribution, λ is modeled as a random variable from the Gamma dis-tribution such that Y i | λ i ∼ Poisson( λ ) and λ i ∼ Gamma( α, β ). The resulting marginaldistribution of Y i will then be a negative binomial distribution (Greenwood and Yule, 1920;Collings and Margolin, 1985) with mean E ( Y ) = αβ , dispersion parameter φ = α , andvariance V ( Y ) = E ( Y ) + E ( Y ) φ > E ( Y ). Note that this is a univariate distribution.While the multivariate extensions of these distributions seem a natural choice to mod-elling multivariate count data, these models are computationally expensive and/or theyimpose restrictions on the covariance structure. For example, in addition to the computa-tional burden (Brijs et al., 2004), the multivariate Poisson distribution by (Teicher, 1954;Campbell, 1934) can only allow for positive correlation among the variables Karlis andMeligkotsidou (2007); Inouye et al. (2017). Several different formulation of multivariate ver-sion of the Poisson-beta distribution was studied by Sarabia and G´omez-D´eniz (2011); theseformulations are either computationally intensive especially for high dimensional data, haverestrictions on the covariance structure or the parameters in the marginal distributions arenot free (i.e., they share the hyperparameters). Hence, independence between variables is as-sumed for multivariate analysis in most cases (Rau et al., 2015; Si et al., 2014; Papastamouliset al., 2016), which fails to take into account the correlation between variables.Silva et al. (2019) developed a mixtures of multivariate Poisson-lognormal (MPLN;Aitchison and Ho, 1989) distributions to cluster multivariate count measurements. In MPLN4odels, the counts are modeled using a hierarchical structure such that Y ij | θ ij ∼ Poisson( e θ ij ) and θ i = ( θ i , . . . , θ id ) = (log λ i , . . . , log λ id ) ∼ N d ( µ , Σ ) , where N d ( µ , Σ ) denotes a d -dimensional Gaussian distribution with mean µ and covariance Σ . Due to this hierarchical structure, the MPLN model can account for over-dispersion asopposed to the traditional Poisson distribution and allows for correlation. An attractiveproperty of MPLN distribution is that it can support both positive and negative correla-tion. However, the posterior of the latent variables do not have a closed form and therefore,these models came with heavy computation cost as they relied on Bayesian MCMC-basedapproaches, discussed further in Section 2.1. Bayesian MCMC-based approaches for modelswith latent variables can be computationally intensive (Blei et al., 2017) and the computa-tional time is further compounded in model-based clustering where the expected value ofthe latent variable needs to be computed several times (further detailed in Section 2.2). Inthis paper, we propose a computationally efficient framework for clustering using mixturesof multivariate Poisson-lognormal distributions using variational Gaussian approximations.We also impose constraints on the covariance matrices of latent variable using an eigen-decomposition which results in a parsimonious family of models. Section 2.3 provides thevariational EM estimation algorithm, Section 3 provides results on simulations with compar-isons to a popular Poisson-based mixture approach and results on a real data set. Sectionconcludes with some conclusions and future directions.5 Methodology
In multivariate Poisson-lognormal (MPLN; Aitchison and Ho, 1989) distribution, the countsare modeled using a hierarchical structure such that Y ij | θ ij ∼ Poisson( e θ ij ) and θ i ∼ N d ( µ , Σ ) , where θ i = ( θ i , . . . , θ id ) and N d ( µ , Σ ) denotes a d -dimensional Gaussian distribution withmean µ and covariance Σ . The mean and expected value of the Y (Aitchison and Ho, 1989;Georgescu et al., 2014) can be found using the law of iterative expectation, E ( Y j ) = E [ E ( Y j | θ j )] = e µ j +1 / × Σ jj , V ( Y j ) = E [ V ( Y j | θ j )] + V [ E ( Y j | θ j )] = E ( Y j ) + E ( Y j ) ( e Σ jj − , cov( Y j , Y k ) = E [cov( Y j , Y k | θ j , θ k )] + cov [ E ( Y j | θ j ) E ( Y k | θ k )] = E ( Y j ) E ( Y k )( e Σ jk − , where Y j is the j th entry of Y , µ j is the j th entry of µ , Σ jk is the entry in the j th row and k th column of the matrix Σ and j (cid:54) = k . Due to this hierarchical structure, the MPLN modelcan account for over-dispersion and also allows for correlation.The marginal probability density of Y can be written as: f Y ( y ) = (cid:90) IR d (cid:34) d (cid:89) j =1 p ( y j | θ j ) (cid:35) φ d ( θ | µ , Σ ) d θ , where θ j and y j are the j th element of θ and y respectively, p ( · ) is the probability massfunction of the Poisson distribution with mean λ j = e θ j and φ d ( · ) is the probability densityfunction of d -dimensional Gaussian distribution with mean µ and covariance Σ . Note that6he marginal distribution of Y involves multiple integrals and cannot be further simplified.Hence, Aitchison and Ho (1989) suggested that the maximum likelihood estimation of theparameters requires a mix of Newton-Raphson and steepest ascent methods. An alternateapproach is to use an expectation-maximization (EM) algorithm (Dempster et al., 1977),which is an iterative approach for maximizing the likelihood when the data are incompleteor are treated as incomplete. In an MPLN distribution, the observed variables are thecounts Y and the missing data are the latent variables θ . The EM-algorithm comprisesof two iterative steps: an E-step and an M-step. In E-step, the expected value of thecomplete data (i.e. observed and missing data) log-likelihood given the observed data andcurrent parameter estimate (i.e., updates from previous iteration) is computed which is thenmaximized in the subsequent M-step. These step are repeated until convergence to obtainthe maximum likelihood estimate of the parameters. In order to compute the expectedvalue of the complete data log-likelihood, E ( θ | y ) and E ( θθ (cid:48) | y ) need to be computed. Theconditional distribution of f θ | Y ( θ | y , µ , Σ ) is given by f θ | Y ( θ | y , µ , Σ ) = (cid:104)(cid:81) dj =1 p ( y j | θ j ) (cid:105) φ d ( θ | µ , Σ ) f Y ( y | µ , Σ ) . However, as mentioned above, the marginal distribution of Y involves multiple integralsand cannot be further simplified. Georgescu et al. (2014) proposed a maximum likelihoodestimation of the parameters for MPLN distribution in the Monte Carlo EM (MCEM) frame-work (Wei and Tanner, 1990); they utilized a Monte Carlo EM algorithm using importancesampling to find the empirical expected value of the complete data log-likelihood. Chagneauet al. (2011) utilized an MCMC-based approach for parameter estimation. However, it comeswith a heavy computational overhead, further exacerbated as the dimensionality and samplesize rise (Chagneau et al., 2011). 7 .2 Mixtures of Multivariate Poisson-lognormal distributions Silva et al. (2019) proposed a mixture of multivariate Poisson-lognormal distributions tocluster multivariate count measurements. A G -component mixture of MPLN distributionscan be written as f ( y | ϑ ) = G (cid:88) g =1 π g f Y ( y | µ g , Σ g ) = G (cid:88) g =1 π g (cid:90) IR d (cid:34) d (cid:89) j =1 p ( y ij | θ ij ) (cid:35) φ d ( θ i | µ g , Σ g ) d θ i , where ϑ denotes all model parameters and f Y ( y | µ g , Σ g ) denotes the distribution of the g th component with parameters µ g and Σ g . In model-based clustering, an additional componentmembership indicator variable Z is introduced which is assumed to be unknown and Z ig = 1if the observation i th belongs to group g and Z ig = 0 otherwise. Hence, the complete datanow comprises of observed expression levels y , underlying latent variable θ , and unknowngroup membership z . Therefore, the complete-data log-likelihood is l c ( µ , Σ | y , θ , z ) = G (cid:88) g =1 n (cid:88) i =1 z ig (cid:34) log φ d ( θ i | µ g , Σ g ) + d (cid:88) j =1 log p ( y ij | θ ij ) (cid:35) . In order to utilize the EM framework for parameter estimation for the mixtures of MPLNdistributions, finding the expected value of the complete-data log-likelihood requires the con-ditional expectations E ( Z ig θ i | y i , µ g , Σ g ) and E ( Z ig θ i θ (cid:48) i | y i , µ g , Σ g ). Silva et al. (2019) uti-lized an MCMC-EM algorithm using Stan (Hoffman and Gelman, 2014; Stan DevelopmentTeam, 2015) to compute the required conditional expectations.
Stan uses the No-U-TurnSampler (NUTS; Hoffman and Gelman, 2014), an adaptive variant of the Hamiltonian MonteCarlo (HMC; Neal et al., 2011) that requires no tuning parameters and can efficiently samplefrom a posterior distribution when the parameters are correlated. However, in a clusteringcontext, even such efficient sampling techniques like NUTS comes with heavy computational8urden as MCMC-EM needs to be conducted at every iteration of MCMC-EM after up-dating the model parameters and the component indicator variable. Additionally, Bayesiantechniques based on MCMC sampling algorithms also comes with increased computationaloverhead and possible difficulty in determining convergence as the complexity of the modelincreases. When dealing with such complex models, convergence can also be quite slow.This crucial aspect is partly why applications of such models have been limited to smalldimensions (Georgescu et al., 2014). Additionally, when the true number of groups, i.e., thenumber of components of the mixture model, is unknown, the MCMC-EM algorithm mustbe employed for every possible number of components and the optimal number of compo-nents is chosen in conjunction with a model selection criterion which might entail extremecomputational cost.
Variational approximation (Wainwright et al., 2008) is an approximate inference techniquewhich has been very popular in machine learning. It presents an alternative parameterestimation framework for MPLN distribution by using a computationally convenient ap-proximating density in place of a more complex but ‘true’ posterior density. Using computa-tionally convenient Gaussian densities, complex posterior distributions are approximated byminimizing the Kullback-Leibler (KL) divergence between the true and the approximatingdensities and therefore, reducing the computational overhead. Several studies have shownthat it delivers accurate approximations (Archambeau et al., 2007; Challis and Barber, 2013;Arridge et al., 2018). Khan et al. (2013) developed variational inference framework for sev-eral non-conjugate latent Gaussian models. Arridge et al. (2018) also developed a variationalGaussian approximation (VGA) to the posterior distribution arising from the univariate Pois-son model with a Gaussian prior and derived an explicit expression for the lower bound, and9howed the existence and uniqueness of the optimal Gaussian approximation.Here, we propose a variational Gaussian approximation to the marginal distribution ofthe observed variable i.e. f ( y ) in the mixtures of MPLN distributions. Suppose, we have anapproximating density q ( θ ), the marginal log-likelihood can be written aslog f ( y ) = (cid:90) IR d log f ( y ) q ( θ ) d θ = (cid:90) IR d log f ( y , θ ) /q ( θ ) f ( θ | y ) /q ( θ ) q ( θ ) d θ = (cid:90) IR d [log f ( y , θ ) − log q ( θ )] q ( θ ) d θ + D KL ( q (cid:107) f )= F ( q, y ) + D KL ( q (cid:107) f ) , where D KL ( q (cid:107) f ) = (cid:82) IR d q ( θ ) log q ( θ ) f ( θ | y ) d θ is the Kullback-Leibler (KL) divergence between f ( θ | y ) and approximating distribution q ( θ ), and F ( q, y ) = (cid:82) IR d [log f ( y , θ ) − log q ( θ )] q ( θ ) d θ is our evidence lower bound (ELBO). To minimize the KL divergence, we maximize ourELBO. In VGA, q ( θ ) is assumed to be a Gaussian distribution. Assuming q ( θ ) = N ( m , S ),the lower bound for the log f ( y ) becomes F ( q, y ) = −
12 ( m − µ ) (cid:48) Σ − ( m − µ ) −
12 tr( Σ − S ) + 12 log | S | −
12 log | Σ | − d m (cid:48) y − d (cid:88) j =1 (cid:16) e m j + S jj + log( y j !) (cid:17) . This lower bound is strictly jointly concave with respect to the mean ( m ) and variance( S ) of the approximating distribution and hence, similar to Arridge et al. (2018), parameterestimation can be obtained via Newton’s method and fixed-point method. This alleviatesthe need of the use of MCMC-based approach to get samples from the posterior distribu-tion of the latent variable θ and the computational burden that comes along with such10pproaches. Additionally, the EM framework using variational Gaussian approximation al-gorithm is monotonic as opposed to MCMC-based EM. A similar variational framework forMPLN distribution was recently proposed by Chiquet et al. (2019) for network models.The complete data log-likelihood of the mixtures of MPLN distributions can be writtenas: l c ( ϑ | y ) = G (cid:88) g =1 n (cid:88) i =1 z ig log π g + G (cid:88) g =1 n (cid:88) i =1 z ig log f ( y i | µ g , Σ g )= G (cid:88) g =1 n (cid:88) i =1 z ig log π g + G (cid:88) g =1 n (cid:88) i =1 z ig (cid:20)(cid:90) IR d [log f ( y , θ ig ) − log q ( θ ig )] q ( θ ig ) d θ ig + D KL ( q ig (cid:107) f ig ) (cid:21) where D KL ( q ig (cid:107) f ig ) = (cid:82) IR d q ( θ ig ) log q ( θ ig ) f ( θ i | y i ,Z ig =1) d θ ig is the Kullback-Leibler (KL) divergencebetween f ( θ i | y i , Z ig = 1) and approximating distribution q ( θ ig ). Assuming q ( θ ig ) = N ( m ig , S ig ), the log-likelihood becomes: l c ( ϑ | y ) = G (cid:88) g =1 n (cid:88) i =1 z ig log π g + G (cid:88) g =1 n (cid:88) i =1 z ig [ F ( q ig , y i ) + D KL ( q ig (cid:107) f ig )] , where the ELBO for each observation y i is F ( q ig , y i ) = 12 log | S ig | −
12 ( m ig − µ g ) (cid:48) Σ − g ( m ig − µ g ) − tr( Σ − g S ig ) −
12 log | Σ g | − d m (cid:48) ig y i − d (cid:88) j =1 (cid:16) e m igj + S ig,jj + log( y ij !) (cid:17) , where m igj is the j th element of the m ig and S ig,jj is the j th diagonal element of the matrix S ig .The variational parameters that maximize the ELBO will minimize the KL divergencebetween the true posterior and the approximating density.11arameter estimation can be done in an iterative EM-type approach such that the fol-lowing steps are iterated.1. Conditional on the variational parameters m ig , S ig and on µ g and Σ g , the E ( Z ig ) iscomputed. Given µ g and Σ g , E ( Z ig | y i ) = π g f ( y | µ g , Σ g ) (cid:80) Gh =1 π h f ( y | µ h , Σ h ) . Note that this involves the marginal distribution of Y which is difficult to compute.Hence, we use an approximation of E ( Z ig ) where we replace the marginal density ofthe exponent of ELBO such that (cid:98) Z ig def = π g exp [ F ( q ig , y i )] (cid:80) Gh =1 π h exp [ F ( q ih , y i )] . This approximation is computationally convenient and a similar framework was utilizedby Tang et al. (2015); Gollini and Murphy (2014). This approximation works well insimulation studies and real data analysis.2. Given ˆ Z ig , variational parameters m ig and S ig is updated conditional on µ g and Σ g .The lower bound is strictly jointly concave with respect to m ig and S ig of the approx-imating distribution. Therefore, we can update m ig and S ig as following:(a) fixed-point method for updating S ig is S ( t +1) ig = (cid:26) Σ − g + I (cid:12) exp (cid:20) m ( t ) ig + 12 diag (cid:16) S ( t ) ig (cid:17)(cid:21) (cid:48) d (cid:27) − where the vector function exp [ a ] = ( e a , . . . , e a d ) (cid:48) is a vector of exponential eachelement of the d -dimensional vector a , diag( S ) = ( S . . . , S dd ) puts the diagonalelements of the d × d matrix S into a d-dimensional vector, (cid:12) the Hadmard12roduct and d is a d-dimensional vector of ones ;(b) Newton’s method to update m ig is m ( t +1) ig = m ( t ) ig − S ( t +1) ig (cid:26) exp (cid:20) m ( t ) ig + 12 diag (cid:16) S ( t +1) ig (cid:17)(cid:21) + Σ − g (cid:16) m ( t ) ig − µ g (cid:17) − y i (cid:27) ; and3. given ˆ z ig and the variational parameters m ig and S ig , the parameters π , µ g and Σ g . π g = (cid:80) ni =1 (cid:98) Z ig n , µ g = (cid:80) ni =1 (cid:98) Z ig m ig (cid:80) ni =1 (cid:98) Z ig , and Σ g = (cid:80) ni =1 (cid:98) Z ig ( m ig − µ g )( m ig − µ g ) (cid:48) (cid:80) ni =1 (cid:98) Z ig − (cid:80) ni =1 (cid:98) Z ig S ig (cid:80) ni =1 (cid:98) Z ig . For the mixtures of MPLN distribution, the number of free parameters in the covariancematrices of the latent variable is Gd ( d + 1) /
2, i.e., the number of parameters increasesquadratically with d . Banfield and Raftery (1993) proposed an eigen-decomposition of thecovariance matrices of mixtures of Gaussian distributions such that Σ g = λ g D g A g D (cid:48) g where D g is the matrix of eigenvectors and A g is a diagonal matrix proportional to the eigenvalues of Σ g , such that | A g | = 1, and λ g is the associated constant of proportionality. For a Gaussiandistribution, this decomposition results in a geometric interpretation of the constraints suchthat λ g controls the cluster volume, A g controls the cluster shape, and D g controls thecluster orientation. Celeux and Govaert (1995) imposed constraints on these parametersof the covariance matrix to be equal or different among groups resulting in a family of 14models known as known as Gaussian parsimonious clustering models (GPCM). Parameterestimation in an EM framework for 12 of these 14 GPCM family and MM framework of13rowne and McNicholas (2014) for the remaining two models is available in the R package mclust (Scrucca et al., 2016). Here, we propose an eigen-decomposition of the component-specific covariance matrices of the latent variable θ to introduce parsimony. The geometricinterpretation does not hold for observed data and therefore, we only focused on the followingsubset of the GPCM family of model given in Table 1.Table 1: Geometric interpretation of the subset of the models considered in our study ob-tained via eigen-decomposition of Σ g .Model (Covariance) Volume Shape Orientation ParametersEII ( λI ) Equal Spherical - 1VII ( λ g I ) Variable Spherical - G EEI ( λ A ) Equal Equal Ax-Alg d VVI ( λ g A g ) Variable Variable Ax-Alg dG EEE ( λ DAD T ) Equal Equal Equal d ( d + 1) / λ g DA g D T ) Variable Variable Equal d ( d + 1) / G − d EEV ( λ D g AD Tg ) Equal Equal Variable Gd ( d + 1) / − ( G − d VVV ( λ g D g A g D Tg ) Variable Variable Variable Gd ( d + 1) / Y j , Y k ) = E ( Y j ) E ( Y k )( e Σ jk − , therefore, indepen-dence among the latent variable θ will result in independence among the observed vari-able Y . We focused on EII, VII, EEI, and VVI to recover diagonal covariance struc-ture when variables are independent. Additionally, to introduce some parsimony com-pared to VVV model, we also focused some additional non-diagonal constriants such asEEE, VVE and EEV. Given m ig and S ig and using the sample covariance matrix of m g as ˆ Σ sample = (cid:80) ni =1 (cid:98) Z ig ( m ig − µ g )( m ig − µ g ) (cid:48) (cid:80) ni =1 (cid:98) Z ig − (cid:80) ni =1 (cid:98) Z ig S ig (cid:80) ni =1 (cid:98) Z ig the parameter estimates are analogous totheir Gaussian counterpart. See Celeux and Govaert (1995); Browne and McNicholas (2014)for details. 14 .5 Initialization, Model Selection, and Performance assessment EM algorithm relies heavily on the initial values (Biernacki et al., 2003) and hence, goodstarting values are crucial. While initialization of the clusters using k -means is one of thewidely used approach for symmetric distributions, it may not be appropriate for our over-dispersed count datasets as k -means initialization is equivalent to fitting a Gaussian mixturemodels with spherical clusters which will not be ideal for these over-dispersed multivariatecounts. Therefore, we utilized small EM initialization for our analysis. For each G , theinitialization “‘EII” model of consist of using 20 different random partitions of datasetsinto G clusters and running 20 iterations of our variational EM algorithm. The parametersassociated with the largest log-likelihood among the 20 random partitions was then used toinitialize all eight models which is then run until convergence. Convergence of the algorithmfor these models is determined using a modified Aitken acceleration criterion. where atiteration m . Convergence is based on Aitken acceleration (Aitken, 1926) which is consideredto satisfied when l ( m +1) ∞ − l ( m ) ∞ is positive and smaller than some (cid:15) , where l ( m ) is the value ofthe log-likelihood and l ( m +1) ∞ is an asymptotic estimate of the log-likelihood given by l ( m +1) ∞ = l ( m ) + l ( m +1) − l ( m − − a ( m ) where a ( m ) = l ( m +1) − l ( m ) l ( m ) − l ( m − , (B¨ohning et al., 1994). Here, we used (cid:15) = 0 . . Typically, the number of component is unknown in cluster analysis. In our case, thecovariance structure that provides the best fit is also unknown. Hence, we fit the model withall eight covariance structure to a range of G and the Bayesian information criteria (BIC)(Schwarz et al., 1978) is used to assess the goodness of fit of each fitted model and select thenumber of G and corresponding to the corresponding covariance structure that has the best15tness to the data. The BIC of a fitted model is computed byBIC( ˆ ϑ ) = − (cid:96) ( ˆ ϑ ) + ψ log( n ) , where (cid:96) ( ϑ ) is the observed data log-likelihood function here, n is the sample size, and ψ isthe number of free parameters in the model. Note that here we compute approximation ofBIC using ELBO (Chen et al., 2018). We will refer to the approximation of BIC as BICin the paper. Additionally, the variational parameters can be regarded as proxies for thelatent variables and therefore, we do not penalize for variational parameters. The modelthat has the smallest BIC value is selected and the number of underlying groups is thereforeestimated. Performance assessment clustering performance was done in using the AdjustedRand Index (ARI; Hubert and Arabie, 1985): 1 indicates perfect agreement between the trueand predicted classification and the expected value of the ARI under random classificationis zero. A value < R package HTSCluster (Rau et al., 2015). Note that it is a clustering algorithmdesigned for clustering high-throughput transcriptome sequencing (HTS) data and thereforehas a built-in normalization option which could not be turned off. Therefore, we set thenormalization option to “TMM” for our analysis.16
Analyses
We generated 100 three dimensional datasets of size N = 2000 with three components withmixing proportions π = (0 . , . , . G = 1 : 4 and all eight covariance structures. In 100 out of the 100 datasets, a threecomponent VVV model was selected with an average ARI of 0.99 and standard deviation(sd) of 0.003. Summary of the parameter estimates is provided in Table 2.Table 2: Summary of the average and standard errors (SE) of the estimated parameters fromthe 100 datasets where a G = 3 component VVV model was selected in Simulation Study 1.True Parameters Average of Estimated Parameters Standard errors µ (6, 3, 3) (6.00, 3.00, 3.00) (0.03, 0.04, 0.04) µ (3, 5, 3) (3.00, 5.00, 3.00) (0.02, 0.02, 0.02) µ (5, 3, 5) (5.00, 3.00, 5.00) (0.02, 0.03, 0.02) Σ = .
30 0 .
15 0 . .
15 0 .
40 0 . .
20 0 .
30 0 . , ˆ Σ = .
30 0 .
15 0 . .
15 0 .
40 0 . .
20 0 .
30 0 . , SE( ˆ Σ ) = .
02 0 .
02 0 . .
02 0 .
03 0 . .
02 0 .
03 0 . Σ = .
30 0 .
15 0 . .
15 0 .
40 0 . .
20 0 .
30 0 . , ˆ Σ = .
30 0 .
15 0 . .
15 0 .
40 0 . .
20 0 .
30 0 . , SE( ˆ Σ ) = .
02 0 .
01 0 . .
01 0 .
02 0 . .
01 0 .
02 0 . Σ = . − . − . − .
15 0 . − . − . − .
10 0 . , ˆ Σ = . − . − . − .
15 0 . − . − . − .
10 0 . , SE( ˆ Σ ) = .
01 0 .
01 0 . .
01 0 .
02 0 . .
01 0 .
01 0 . Similarly, we also ran the Poisson mixture model in the R package HTSCluster for G =17 : 4 and selected the best model with BIC. In 100 out of the 100 datasets, it overestimatedthe number of components resulting in a four component model was selected with an averageARI of 0.75 and a standard deviation of 0.11. Here, we generated 100 six dimensional datasets of size N = 500 with two components withmixing proportions π = (0 . , . λ I)covariance structure (see Table 3 for the values used to generate the data). We ran ourmodel for G = 1 : 3 and all eight covariance structures. In 99 out of 100 datasets, a twocomponent EII model was selected with an average ARI of 1.00 and a standard deviationof 0.00 and in one of the 100 dataset, a two component VII model was selected. Note, thatVII is still imposes an isotropic (spherical) structure on Σ but it allows Σ to be vary amonggroups. In our case, with G = 2, it just added one additional parameter so the penalty forselecting a slightly complicated model was very small. Summary of the parameter estimatesof the 99 out of the 100 datasets where the correct model is selected is provided in 3.Table 3: Summary of the average and standard errors (SE) of the estimated parameters fromthe 99 out the 100 datasets where a G = 2 component EII model was selected in SimulationStudy 2.g µ g ˆ µ g λ ˆ λ
1, 2 1 Average 0.99SE 0.03Similarly, we also ran the Poisson mixture model in the R package HTSCluster for G = 1 :18 and selected the best model with BIC. In 100 out of the 100 datasets, it overestimated thenumber of components resulting in a three component model was selected with an averageARI of 0.04 and a standard deviation of 0.02. Here, we generated 100 six dimensional datasets of size N = 2000 from a two compo-nent mixtures of independent negative binomial distributions with mixing proportions π =(0 . , . G = 1 , . . . , R package HTSCluster for G =1 : 4 and selected the best model with BIC. In 100 out of the 100 datasets, it overestimatedthe number of components resulting in a four component model was selected with an averageARI of 0.29 and a standard deviation of 0.01. Here, we used 100 four dimensional datasets generated using a mixtures of independentPoisson distributions with mixing proportions π = (0 . , . G = 1 , . . . , g=1 True Mean (1000, 500, 1000, 500, 1000 , 500)Average of Estimated Means (1000.28, 500.24, 999.84, 500.09, 999.75, 500.25)SE of Estimated Means (2.66, 1.50, 2.31, 1.43, 2.91, 1.36)True Variance (11000, 3000, 11000, 3000, 11000, 3000)Average of Estimated Variances (11045.99, 3012.67, 11036.90, 30011.06, 11034.85, 3012.86)SE of Estimated Variances (147.40, 37.25, 148.18, 37.31, 152.14, 39.56) g=2
True Mean (500, 1000, 500, 1000, 500, 500)Average of Estimated Means (500.10, 999.81, 500.00, 1000.53, 499.91, 499.75)SE of Estimated Means (2.44, 5.38, 5.37, 2.71, 2.39, 2.68)True Variance (3000, 11000, 3000,11000, 3000, 3000)Average of Estimated Variances (3011.23, 11036.40, 3010.06, 11051.85, 3009.10, 3007.34)SE of Estimated Variances (43.19, 182.23, 45.03, 193.22, 43.65, 43.70)two component EII model was selected with perfect classification for all datasets. Note thathere, the diagonal structure for Σ was always selected as the datasets were generated frommixtures of independent Poisson distributions. Table 5 shows that our model was able torecover the true mean the data generated from the mixtures of Poisson distributions verywell but it overestimated the variable slightly which would be expected as MPLN imposes astructure such that V ( Y ) > E ( Y ) but for Poisson distribution V ( Y ) = E ( Y ) . Similarly, we also ran the Poisson mixture model in the R package HTSCluster for G =1 : 4 and selected the best model with BIC. In 100 out of the 100 datasets, it selected thecorrect G = 2 model with an ARI of 1. 20able 5: Summary of the average and standard errors (SE) of the estimated means forthe 100 datasets in Simulation Study 4 where the data was generated using a mixtures ofindependent Poisson distributions. g=1 True Mean (1000, 1500, 1500, 1000)Average of Estimated Means (999.98, 1500.16, 1500.27, 1000.09)SE of Estimated Means (1.31, 1.84, 1.77, 1.40)True Variance (1000, 1500, 1500, 1000)Average of Estimated Variances (1027.97, 1563.15, 1563.27,1028.09)SE of Estimated Variances (6.21, 14.37, 14.28, 6.52) g=2
True Mean (1000, 1000, 1000, 1500)Average of Estimated Means (999.94, 1000.04, 999.96, 1499.97)SE of Estimated Means (1.39, 1,32, 1.38, 1.79)True Variance (1000, 1000, 1000, 1500)Average of Estimated Variances (1027.93, 1028.03, 1027.95, 1562.95 )SE of Estimated Variances (6.50, 6.13, 6.57, 14.51)
Here, we used the Chronic Kidney Disease (CKD) Dataset available via UCI Machine Learn-ing repository. The dataset comprises of 25 attributes regarding various blood measurementsand various categorical measurements pertaining disease status on 400 individuals. Here, wefocus on three discrete count measurements: bgr : blood glucose random; wbcc : white bloodcell count; and pcv packed cell volume with the aim of classifying the disease status of thepatient for CKD. Information on the status of CKD is available in the variable class . Be-fore the analysis, observations with missing values on our 4 variables ( bgr, wbcc, pcv and class ) were removed resulting in 261 observations. We ran our algorithm for G = 1 , . . . , G = 1 , . . . , pcv
10 20 30 40 50 100 200 300 400 500 bgr wbcc pcv
10 20 30 40 50 100 200 300 400 500 bgr wbcc pcv
10 20 30 40 50 100 200 300 400 500 bgr wbcc
Figure 1: Scatter plot matrix of the chronic kidney dataset where the colors/symbols repre-sents the CKD status on the left, colors/symbols representing the cluster structure recoveredby MPLN in the middle, and colors/symbols representing the cluster structure recovered byHTSCluster on the right.
Here, we propose an efficient framework for parameter estimation for clustering using mix-tures of multivariate Poisson-lognormal distributions that utilizes a variational Gaussian22pproximation. The variational Gaussian approximation provides a fast and deterministicframework for estimating the model parameters and alleviates the computational overheaddue to MCMC-EM. Through eigen-decomposition of the component’s covariance matricesof the latent variable and imposition of constraints, a family of parsimonious models is pro-posed. Through simulation studies, we show that the proposed models provide competitiveperformance, the parameters were very close to the true parameters (when the correct modelwas chosen), and close to perfect classification was obtained using the model selected by BIC.Additionally, we also generate data from mixtures of univariate Poisson distributions andmixtures of negative binomial distributions and show that our proposed model can recoverthe true mean of the models fairly well. When datasets were generated from a Poisson modelwhich assumes V ( Y ) = E ( Y ), the variance was overestimated slightly as MPLN imposes thestructure that V ( Y ) > E ( Y ) . In the case where the datasets were generated from a mixtureof univariate negative binomial distributions, the variance estimation was much closer asboth MPLN and negative binomial impose the structure that V ( Y ) > E ( Y ) . Additionally,when data are generated from mixtures of univariate distributions, our approach selects amodel with a diagonal covariance structure where variables are independent.While our proposed work is motivated by modern biological data such as RNA-seq data,here we only focus on a general framework that is applicable to any multivariate countdata. For biological data, extensions of these approaches may require some structure thatneeds to be imposed on the datasets to normalize for library specific factors such as librarysize and transcript specific factors such as gene length. Although the use of variationalGaussian approximations alleviates some of the challenges of MCMC-EM, the algorithm canstill be prone to computational issues encountered by any traditional EM algorithm such asconvergence to local maxima, singularities, etc. Furthermore, a systematic evaluation for thespeed and accuracy of variational EM against the MCMC-EM would be very valuable and23s ongoing. In our work, BIC seems to perform fairly well, however, some future work willfocus on investigation of different model-selection criteria for MPLN mixture models and fordiscrete data in general.
References
Aitchison, J. and C. Ho (1989). The multivariate Poisson-log normal distribution.
Biometrika 76 (4), 643–653.Aitken, A. C. (1926). A series formula for the roots of algebraic and transcendental equations.
Proceedings of the Royal Society of Edinburgh 45 , 14–22.Anders, S. and W. Huber (2010). Differential expression analysis for sequence count data.
Nature Precedings , 1–1.Archambeau, C., D. Cornford, M. Opper, and J. Shawe-Taylor (2007). Gaussian process ap-proximations of stochastic differential equations.
Journal of Machine Learning Research 1 ,1–16.Arridge, S. R., K. Ito, B. Jin, and C. Zhang (2018). Variational gaussian approximation forpoisson data.
Inverse Problems 34 (2), 025005.Banfield, J. D. and A. E. Raftery (1993). Model-based Gaussian and non-Gaussian clustering.
Biometrics 49 (3), 803–821.Biernacki, C., G. Celeux, and G. Govaert (2003). Choosing starting values for the em algo-rithm for getting the highest likelihood in multivariate gaussian mixture models.
Compu-tational Statistics & Data Analysis 41 (3-4), 561–575.24lei, D. M., A. Kucukelbir, and J. D. McAuliffe (2017). Variational inference: A review forstatisticians.
Journal of the American statistical Association 112 (518), 859–877.B¨ohning, D., E. Dietz, R. Schaub, P. Schlattmann, and B. Lindsay (1994). The distributionof the likelihood ratio for mixtures of densities from the one-parameter exponential family.
Annals of the Institute of Statistical Mathematics 46 , 373–388.Bouveyron, C. and C. Brunet (2012). Simultaneous model-based clustering and visualizationin the fisher discriminative subspace.
Statistics and Computing 22 (1), 301–324.Brijs, T., D. Karlis, G. Swinnen, K. Vanhoof, G. Wets, and P. Manchanda (2004). A mul-tivariate poisson mixture model for marketing applications.
Statistica Neerlandica 58 (3),322–348.Browne, R. P. and P. D. McNicholas (2014). Estimating common principal components inhigh dimensions.
Advances in Data Analysis and Classification 8 (2), 217–226.Browne, R. P. and P. D. McNicholas (2015). A mixture of generalized hyperbolic distribu-tions.
Canadian Journal of Statistics 43 (2), 176–198.Bullard, J. H., E. Purdom, K. D. Hansen, and S. Dudoit (2010). Evaluation of statisticalmethods for normalization and differential expression in mrna-seq experiments.
BMCBioinformatics 11 (1), 94.Bulmer, M. (1974). On fitting the poisson lognormal distribution to species-abundance data.
Biometrics , 101–110.Campbell, J. (1934). The poisson correlation function.
Proceedings of the Edinburgh Math-ematical Society 4 (1), 18–26. 25eleux, G. and G. Govaert (1995). Gaussian parsimonious clustering models.
Pattern Recog-nition 28 , 781–793.Chagneau, P., F. Mortier, N. Picard, and J.-N. Bacro (2011). A hierarchical bayesian modelfor spatial prediction of multivariate non-gaussian random fields.
Biometrics 67 (1), 97–105.Challis, E. and D. Barber (2013). Gaussian kullback-leibler approximate inference.
TheJournal of Machine Learning Research 14 (1), 2239–2286.Chen, Y.-C., Y. S. Wang, E. A. Erosheva, et al. (2018). On the use of bootstrap withvariational inference: Theory, interpretation, and a two-sample test example.
The Annalsof Applied Statistics 12 (2), 846–876.Chiquet, J., M. Mariadassou, and S. Robin (2019). Variational inference for sparse networkreconstruction from count data. In
Proceedings of the 36th International Conference onMachine Learning , Volume 97 of
Proceedings of Machine Learning Research .Collings, B. J. and B. H. Margolin (1985). Testing goodness of fit for the poisson assumptionwhen observations are not identically distributed.
Journal of the American StatisticalAssociation 80 (390), 411–418.Dang, U. J., R. P. Browne, and P. D. McNicholas (2015). Mixtures of multivariate powerexponential distributions.
Biometrics 71 (4), 1081–1089.de Souto, M. C., I. G. Costa, D. S. de Araujo, T. B. Ludermir, and A. Schliep (2008).Clustering cancer gene expression data: a comparative study.
BMC Bioinformatics 9 (1),1. 26empster, A. P., N. M. Laird, and D. B. Rubin (1977). Maximum likelihood from incompletedata via the EM algorithm.
Journal of the Royal Statistical Society: Series B 39 (1), 1–38.Dong, K., H. Zhao, T. Tong, and X. Wan (2016). Nblda: negative binomial linear discrimi-nant analysis for rna-seq data.
BMC Bioinformatics 17 (1), 369.Georgescu, V., N. Desassis, S. Soubeyrand, A. Kretzschmar, and R. Senoussi (2014). Anautomated MCEM algorithm for hierarchical models with multivariate and multitype re-sponse variables.
Communications in Statistics-Theory and Methods 43 (17), 3698–3719.Gollini, I. and T. B. Murphy (2014). Mixture of latent trait analyzers for model-basedclustering of categorical data.
Statistics and Computing 24 (4), 569–588.Greenwood, M. and G. U. Yule (1920). An inquiry into the nature of frequency distribu-tions representative of multiple happenings with particular reference to the occurrenceof multiple attacks of disease or of repeated accidents.
Journal of the Royal statisticalsociety 83 (2), 255–279.Gurland, J. (1958). A generalized class of contagious distributions.
Biometrics 14 (2), 229–249.Hoffman, M. D. and A. Gelman (2014). The no-u-turn sampler: adaptively setting pathlengths in hamiltonian monte carlo.
Journal of Machine Learning Research 15 (1), 1593–1623.Holla, M. (1967). On a poisson-inverse gaussian distribution.
Metrika 11 (1), 115–121.Hubert, L. and P. Arabie (1985). Comparing partitions.
Journal of Classification 2 , 193–218.27nouye, D. I., E. Yang, G. I. Allen, and P. Ravikumar (2017). A review of multivariatedistributions for count data derived from the poisson distribution.
Wiley InterdisciplinaryReviews: Computational Statistics 9 (3), e1398.Karlis, D. and L. Meligkotsidou (2007). Finite mixtures of multivariate poisson distributionswith application.
Journal of Statistical Planning and Inference 137 (6), 1942–1960.Karlis, D. and E. Xekalaki (2005). Mixed poisson distributions.
International StatisticalReview 73 (1), 35–58.Khan, M. E., A. Aravkin, M. Friedlander, and M. Seeger (2013). Fast dual variational infer-ence for non-conjugate latent gaussian models. In
International Conference on MachineLearning , pp. 951–959.Kosmidis, I. and D. Karlis (2016). Model-based clustering using copulas with applications.
Statistics and Computing 26 (5), 1079–1099.Love, M. I., W. Huber, and S. Anders (2014). Moderated estimation of fold change anddispersion for rna-seq data with deseq2.
Genome Biology 15 (12), 550.Marioni, J. C., C. E. Mason, S. M. Mane, M. Stephens, and Y. Gilad (2008). Rna-seq:an assessment of technical reproducibility and comparison with gene expression arrays.
Genome Research 18 (9), 1509–1517.Neal, R. M. et al. (2011). Mcmc using hamiltonian dynamics.
Handbook of Markov ChainMonte Carlo 2 (11).Papastamoulis, P., M.-L. Martin-Magniette, and C. Maugis-Rabusseau (2016). On the esti-mation of mixtures of Poisson regression models with large number of components.
Com-putational Statistics & Data Analysis 93 , 97–106.28au, A., C. Maugis-Rabusseau, M.-L. Martin-Magniette, and G. Celeux (2015). Co-expression analysis of high-throughput transcriptome sequencing data with Poisson mix-ture models.
Bioinformatics 31 (9), 1420–1427.Robinson, M. D., D. J. McCarthy, and G. K. Smyth (2010). edger: a bioconductor packagefor differential expression analysis of digital gene expression data.
Bioinformatics 26 (1),139–140.Sarabia, J. M. and E. G´omez-D´eniz (2011). Multivariate poisson-beta distributions withapplications.
Communications in Statistics-Theory and Methods 40 (6), 1093–1108.Schwarz, G. et al. (1978). Estimating the dimension of a model.
The Annals of Statistics 6 (2),461–464.Scrucca, L., M. Fop, T. B. Murphy, and A. E. Raftery (2016). mclust 5: clustering, classifi-cation and density estimation using Gaussian finite mixture models.
The R Journal 8 (1),205–233.Si, Y., P. Liu, P. Li, and T. P. Brutnell (2014). Model-based clustering for RNA-seq data.
Bioinformatics 30 (2), 197–205.Silva, A., S. J. Rothstein, P. D. McNicholas, and S. Subedi (2019). A multivariate poisson-lognormal mixture model for clustering transcriptome sequencing data.
BMC Bioinformat-ics 20 (1), 394.Stan Development Team (2015).
Stan: A C++ library for probability and sampling (Version2.8.0) .Subedi, S. and P. D. McNicholas (2014). Variational bayes approximations for clustering29ia mixtures of normal inverse gaussian distributions.
Advances in Data Analysis andClassification 8 (2), 167–193.Subedi, S. and P. D. McNicholas (2020). A variational approximations-dic rubric for param-eter estimation and mixture model selection within a family setting.
Journal of Classifi-cation , 1–20.Subedi, S., A. Punzo, S. Ingrassia, and P. D. McNicholas (2015). Cluser-weighed t -facoranalyzers for robus model-based clusering and dimension reducion. Statistical Methods &Applications 24 (4), 623–649.Tang, Y., R. P. Browne, and P. D. McNicholas (2015). Model based clustering of high-dimensional binary data.
Computational Statistics & Data Analysis 87 , 84–101.Teicher, H. (1954). On the multivariate poisson distribution.
Scandinavian Actuarial Jour-nal 1954 (1), 1–9.Tortora, C., B. C. Franczak, R. P. Browne, and P. D. McNicholas (2019). A mixture ofcoalesced generalized hyperbolic distributions.
Journal of Classification 36 (1), 26–57.Vrbik, I. and P. D. McNicholas (2014). Parsimonious skew mixture models for model-basedclustering and classification.
Computational Statistics and Data Analysis 71 , 196–210.Wainwright, M. J., M. I. Jordan, et al. (2008). Graphical models, exponential families, andvariational inference.
Foundations and Trends R (cid:13) in Machine Learning 1 (1–2), 1–305.Wei, G. C. and M. A. Tanner (1990). A Monte Carlo implementation of the EM algorithmand the poor man’s data augmentation algorithms. Journal of the American StatisticalAssociation 85 (411), 699–704. 30illmot, G. E. (1993). On recursive evaluation of mixed poisson probabilities and relatedquantities.
Scandinavian Actuarial Journal 1993 (2), 114–133.Witten, D., R. Tibshirani, S. G. Gu, A. Fire, and W.-O. Lui (2010). Ultra-high through-put sequencing-based small rna discovery and discrete statistical biomarker analysis in acollection of cervical tumours and matched controls.
BMC Biology 8 (1), 58.Yu, D., W. Huber, and O. Vitek (2013). Shrinkage estimation of dispersion in negativebinomial models for rna-seq experiments with small sample size.