Logistic Normal Multinomial Factor Analyzers for Clustering Microbiome Data
aa r X i v : . [ s t a t . M E ] J a n Logistic Normal Multinomial FactorAnalyzers for Clustering MicrobiomeData
Wangshu Tu ∗ Sanjeena Subedi † January 7, 2021
Abstract
The human microbiome plays an important role in human health and dis-ease status. Next generating sequencing technologies allow for quantifying thecomposition of the human microbiome. Clustering these microbiome data canprovide valuable information by identifying underlying patterns across sam-ples. Recently, Fang & Subedi (2020) proposed a logistic normal multinomialmixture model (LNM-MM) for clustering microbiome data. As microbiomedata tends to be high dimensional, here, we develop a family of logistic normalmultinomial factor analyzers (LNM-FA) by incorporating a factor analyzer ∗ Department of Mathematical Sciences, Binghamton University, State University of New York,4400 Vestal Parkway East, Binghamton, NY, USA 13902. e: [email protected] † Department of Mathematical Sciences, Binghamton University, State University of New York,4400 Vestal Parkway East, Binghamton, NY, USA 13902. e: [email protected] tructure in the LNM-MM. This family of models is more suitable for high-dimensional data as the number of parameters in LNM-FA can be greatlyreduced by assuming that the number of latent factors is small. Parameterestimation is done using a computationally efficient variant of the alternat-ing expectation conditional maximization algorithm that utilizes variationalGaussian approximations. The proposed method is illustrated using simulatedand real datasets. The human microbiota is a complex collection of microbes including but not limitedto bacteria, fungi, and viruses that reside in the human body. It is estimated thatthere are nearly 30 trillion bacterial cells living in or on each human body, which isabout one bacterium for every cell in the human body (Sender et al. 2016). Theseorganisms play an important role in human health and diseases (Huttenhower et al.2012). For example, changes in the gut microbiota have been linked to inflam-matory bowel disease (Becker et al. 2015), obesity (Davis 2016), type 2 diabetes(Cho & Blaser 2012), and cancer (Pfirschke et al. 2015). Using next generating se-quencing technologies, the abundance and composition of these microbes can bequantified.Cluster analysis has been widely used to gain insights from microbiome data.Cluster analysis is used to group observations into homogeneous subpopulations withsimilar characteristics. Enterotype, a term first proposed by Arumugam et al. (2011),refers to groups of individuals with similar gut microbial communities. Wu et al.22011) used a partitioning around medoids (PAM) approach with various distancemeasures to cluster the gut microbiota samples of 98 healthy volunteers and foundthat the number of enterotypes varied between two and three. Abdel-Aziz et al.(2020) utilized hierarchical clustering to cluster the sputum microbiome datasetsand identified two distinct robust phenotype of severe asthma. On the other hand, k -means clustering has also been widely used to cluster microbiome data (Taie et al.2018, Hotterbeekx et al. 2016). Although k -means, PAM, and hierarchical clusteringare well-established clustering techniques and frequently used in many fields, theseapproaches fail to take into account the compositional nature of the microbiomedata.Several model-based clustering frameworks have been proposed for microbiomedata (Holmes et al. 2012, Subedi et al. 2020, Fang & Subedi 2020). A model-basedclustering approach utilizes a finite mixture model, which assumes that the datacomes from a finite collection of subpopulations or components where each subpopu-lation can be represented by a distribution function and the appropriate distributionis chosen depending on the nature of the data. A Dirichlet-multinomial model hasbeen widely used for modeling microbiome data (La Rosa et al. 2012, Chen & Li2013, Wadsworth et al. 2017, Koslovsky & Vannucci 2020). In terms of clustering,Holmes et al. (2012) proposed a Dirichlet-multinomial mixture model to cluster mi-crobiome data. Subedi et al. (2020) proposed mixtures of Dirichlet-multinomial re-gression models to cluster microbiome data which can incorporate the effects ofcovariates. However, due to the limited number of parameters in the Dirichlet distri-bution, the covariance of the microbiome data cannot be modeled adequately using3 Dirichlet-multinomial distribution (Xia et al. 2013).An alternate model for microbiome data utilized by Xia et al. (2013) is an additivelogistic normal multinomial (LNM) model. An additive logistic normal multinomial(LNM) model (Aitchison 1982) models the observed counts using a hierarchical struc-ture. The observed counts are modeled using a multinomial distribution conditionalon the compositions and a Gaussian prior is imposed on the log-ratio transformedcompositions. While this approach brings flexibility in modeling the data, the pos-terior distributions of the transformed variable does not have a closed form solution.A Markov chain Monte Carlo (MCMC) approach is typically utilized for parameterestimation (Xia et al. 2013, ¨Aij¨o et al. 2018), which comes with heavy computationalcost. Recently, Fang & Subedi (2020) proposed a mixture of additive logistic normalmultinomial (LNM) model to cluster microbiome data and proposed an alternateapproach for parameter estimation that utilized variational Gaussian approxima-tions (VGA; Wainwright & Jordan 2008). VGA provides an alternative parameterestimation framework where complex posterior distributions are approximated usingcomputationally convenient Gaussian densities by minimizing the Kullback-Leibler(KL) divergence between the true and the Gaussian densities.In the LNM model, the log-ratio transformed composition variable is assumedto be a multivariate Gaussian distribution and hence, the number of parametersin the covariance matrix of the transformed variable grows quadratically with thedimensionality. McNicholas & Murphy (2008) proposed a family of parsimoniousGaussian mixture models (PGMM) utilizing a factor analyzer structure. In PGMM,the number of parameters in the covariance matrix is linear with dimensionality4nd by choosing the number of latent factors to be sufficiently small, the numberof parameters in the covariance matrix can be greatly reduced. In this paper, weextend the mixture of logistic normal multinomial models for high dimensional databy incorporating a factor analyzer structure in the latent space. We develop a varia-tional variant of the alternating expectation conditional maximization for parameterestimation. The paper is structured as follows: Section 2 provides details of thelogistic normal multinomial model and the finite mixture of logistic normal multino-mial factor analyzers along with details on parameter estimation; in Sections 3 and4, these models are applied to simulated and real datasets, respectively and Section5 concludes the paper. Consider human microbiome count data on K+1 taxa. Let W = ( W , · · · , W K +1 ) T denote the random vector of counts of K+1 bacterial taxa, and p = ( p , · · · , p K +1 ) T be the underlying composition of the microbial taxa such that P K +1 k =1 p k = 1. Theobserved counts w can be modeled using a multinomial distribution such that f ( w | p ) ∝ K +1 Y k =1 ( p k ) w k . However, the actual variability in the microbiome composition data is greater thanwhat is modeled or predicted by the multinomial model (Xia et al. 2013). To account5or this additional variability, one approach is to treat the probability vector p as arandom sample from a Dirichlet distribution such that for each observation i , W i | p i ∼ Multinomial ( p i ) and p i ∼ Dirichlet ( α , . . . , α K +1 ) . The resulting compound distribution is known as the Dirichlet-multinomial distribu-tion and has been used widely for microbiome data (Chen & Li 2013, Holmes et al.2012, Subedi et al. 2020). However, due to the limited number of parameters in aDirichlet-multinomial distribution, the variance and covariances of the microbiomecomposition cannot be adequately modeled by a Dirichlet-multinomial distribution(Xia et al. 2013). An alternate approach is to use a log-ratio transformation on p and impose a prior on the transformed variable (Xia et al. 2013, ¨Aij¨o et al. 2018,Silverman et al. 2018).In this paper, we will use the additive logistic normal multinomial model byXia et al. (2013) that utilizes an additive log-ratio (ALR) transformation to map p from the restricted simplex S K to a K -dimensional open real space IR K such that Y = φ ( p ) = (cid:20) log (cid:18) p p K +1 (cid:19) , . . . , log (cid:18) p K p K +1 (cid:19)(cid:21) ⊤ , (1)where p K +1 is used as a reference and a multivariate Gaussian distribution is imposedwith mean µ and covariance Σ on Y . Here, φ : (0 , K → IR K is a one-to-one6unction, and therefore, p = φ − ( Y ) = " exp( Y ) P Kk =1 exp( Y k ) + 1 , · · · , exp( Y k ) P Kk =1 exp( Y k ) + 1 , P Kk =1 exp( Y k ) + 1 T . (2)Thus, the conditional probability function of W | Y becomes f ( w | y ) ∝ K Y k =1 ( exp( y k ) P Kk =1 exp( y k ) + 1 ) w k ( P Kk =1 exp( y k ) + 1 ) w k +1 , and the marginal probability function of W becomes f ( w | µ , Σ ) = Z IR K f ( w | y ) f ( y | µ , Σ g ) d y ∝ Z IR K K +1 Y k =1 (cid:8) φ − ( y ) k (cid:9) w k | Σ | − exp (cid:26) −
12 ( y − µ g ) ⊤ Σ − ( y − µ ) (cid:27) d y . Note that this marginal probability function of W involves multiple integrals andcannot be further simplified. Although the LNM model provides flexibility in themodeling structure, parameter estimation thus far has mostly relied on BayesianMCMC-based approaches that come with a heavy computational burden (Xia et al.2013). Recently, Fang & Subedi (2020) proposed mixtures of the logistic normalmultinomial models (LNM-MM) for clustering microbiome data where a computa-tionally efficient framework for parameter estimation was developed using variationalGaussian approximations (VGA; Wainwright & Jordan 2008). VGA is an alternateparameter estimation framework that utilizes a computationally convenient Gaussiandensity to approximate a more complex but “true” posterior density. The complex7osterior distribution is approximated by minimizing the Kullback-Leibler (KL) di-vergence between the true and the approximating densities.Using an approximating density q ( y ), the marginal log density of W can bewritten as: log f ( w ) = Z q ( y ) log q ( y ) f ( y | w ) d y + Z q ( y ) log f ( w , y ) q ( y ) d y = D KL [ q ( y ) || f ( y | w )] + F ( q ( y ) , w ) , (3)where D KL [ q ( y ) || f ( y | w )] is the Kullback-Leibler divergence from f ( y | w ) to q ( y )and F ( q ( y ) , w ) is known as the evidence lower bound (ELBO). Then, minimizingthe Kullback-Leibler divergence from f ( y | w ) to q ( y ) is equivalent to maximizing theELBO. In a variational Gaussian approximation framework, q ( y ) is taken to be aGaussian distribution. If we assume q ( y ) to be a Gaussian distribution with mean m and diagonal covariance matrix V , the lower bound of F ( q ( y ) , w ) becomes˜ F ( m , V , µ , Σ ) = C + w ∗ T m − K +1 X k =1 w k ! " log K X k =1 exp (cid:16) m k + v k (cid:17) + 1 ! +12 log | V | + K −
12 log | Σ | −
12 ( m − µ ) T Σ − ( m − µ ) −
12 tr( Σ − V ) , (4)where w ∗ is a K -dimensional vector with the first K elements of w and C is aconstant. Details of the derivation of this lower bound is provided in Appendix A.This lower bound can be easily maximized with respect to the model parametersand the variational parameters using an iterative approach. Thus, use of the VGAeliminates the need for an MCMC-based approach for parameter estimation and8rastically reduces the computational overhead making it feasible to extend thesemodels for clustering in a high dimensional setting. Several studies have shown thatVGA delivers accurate approximations (Archambeau et al. 2007, Arridge et al. 2018,Challis & Barber 2013, Subedi & Browne 2020). A finite mixture model assumes that data comes from a finite collection of subpopu-lations and each subpopulation can be represented using a parametric distribution.A G -component finite mixture of LNM models can be written as f ( w i | ϑ ) = G X g =1 π g f ( w i | µ g , Σ g ) , where f ( w i | µ g , Σ g ) represents the marginal probability mass function of the logisticnormal multinomial model of the g th subpopulation, π g > g th subpopulation such that P Gg =1 π g = 1, and ϑ represents all the modelparameters. The likelihood of the mixtures of LNM models can be written as L ( ϑ ) = n Y i =1 G X g =1 π g f ( w i | µ g , Σ g ) . (5)We define a group membership indicator variable z i = ( z i , . . . , z iG ) such that z ig = 1if observation i belongs to group g and 0 otherwise. In the context of clustering, thesegroup memberships are treated as unobserved or missing data and the likelihood9unction in 5 is considered an incomplete-data likelihood function.Therefore, the complete-data likelihood with observed data ( w , . . . , w n ) andmissing data ( z , . . . , z n ) can be written as L ( ϑ ) = n Y i =1 G Y g =1 { π g f ( w i | µ g , Σ g ) } z ig . Then, the complete-data log-likelihood becomes l ( ϑ ) = n X i =1 G X g =1 z ig { log π g + log f ( w i | µ g , Σ g ) } . For incorporating a factor analyzer structure (Ghahramani et al. 1997, McLachlan & Peel2000) in the mixtures of LNM models, we utilize the following structure on Y fromthe g th component: Y = µ g + Λ g U g + ǫ g , where µ g is a K -dimensional mean vector, U g ∼ N (0 , I q ) is q -dimensional vectorof latent factors, Λ g is a K × q matrix of factor loadings, ǫ g ∼ N (0 , D g ) is a K -dimensional vector of errors where D g is diagonal matrix, and U g ⊥ ǫ g . Thus, forthe g th component, Y ∼ N ( µ g , Λ g Λ Tg + D g ) and Y | u g ∼ N ( µ g + Λ g u g , D g ). Parameter estimation of the mixtures of factor analyzers is typically done using analternating expectation conditional maximization (AECM) algorithm. The AECMalgorithm (Meng & Van Dyk 1997) is an extension of the expectation-maximization10EM) algorithm (Dempster et al. 1977) that uses different specification of missingdata at different cycles and the maximization step comprises of a series of conditionalmaximizations. Each cycle of the AECM algorithm consists of an E-step in whichthe expected value of the complete-data log-likelihood is computed, which is thenfollowed by a conditional maximization step where a subset of the model parametersare updated. Here, we will develop a variational version of the AECM algorithmthat uses different specification of the missing data at different cycles.
First Cycle
In the first cycle, we utilize the following hierarchical structure: W i | Y i ∼ Multi.( p i ) and Y i ∼ N ( µ g , Λ g Λ Tg + D g ) , where p i can be obtained from Y i using Equation 2. Then the component specificmarginal probability function of the observed data w i is f ( w i | µ g , Λ g , D g ) = Z IR K f ( w i | y i ) f ( y i | µ g , Λ g Λ Tg + D g ) d y ∝ Z IR K K +1 Y k =1 (cid:8) φ − ( y i ) k (cid:9) w k | Λ g Λ Tg + D g | − exp (cid:26) −
12 ( y i − µ g ) ⊤ ( Λ g Λ Tg + D g ) − ( y i − µ g ) (cid:27) d y . Assuming Z and Y as missing variables, the complete-data log-likelihood using11he marginal probability function of W is l ( ϑ | w i ) = n X i =1 G X g =1 z ig (cid:8) log π g + log f ( w i | µ g , Λ g Λ Tg + D g ) (cid:9) = n X i =1 G X g =1 z ig (cid:26) log π g + log Z f ( w i | y i ) f g ( y i | µ g , Λ g Λ Tg + D g ) d y (cid:27) . Assuming the component-specific q ( y ) to be a Gaussian distribution with mean m g and diagonal covariance matrix V g and replacing the log of the marginal of thecomponent probability function by the component specific ˜ F ( m ig , V ig , µ g , Σ g ), thevariational Gaussian lower bound of complete-data log-likelihood can be written as˜ L = n X i =1 G X g =1 z ig (cid:26) log π g − (cid:0) T ( K +1) w i (cid:1) (cid:20) log (cid:18) T ( K ) exp (cid:18) m ig + diag( V ig )2 (cid:19) + 1 (cid:19)(cid:21) + C i + w ∗ Ti m ig + 12 log | V ig | + K −
12 log | Σ g | −
12 tr( Σ − g V ig ) −
12 ( m ig − µ g ) T Σ − g ( m ig − µ g ) (cid:27) , where ( K ) stands for column vector of 1’s with dimension K , C i stands for log T w i ! Q Kk =1 w ik ! ,diag( V ig ) = ( V ig, , V ig, , . . . , V ig,KK ) puts the diagonal elements of the K × K ma-trix V ig into a K-dimensional vector, and Σ g = Λ g Λ Tg + D g . In this cycle, for theparameter updates in the ( t + 1) th iteration, the following steps are conducted:1. Update the variational Gaussian lower bound of the complete-data log-likelihoodfrom the first cycle ˜ L by updating m ig and V ig . For updating V ( t +1) ig , we usethe Newton-Raphson method. We take the derivative respect to standard error12 ( t +1) ig and find the solution to the following score function: ∂ ˜ L ∂v ig = v ( t ) ig − − v ( t ) ig diag( Σ ( t ) g − ) − ( T ( K +1) w i ) v ( t ) ig diag(exp( m ( t ) ig + diag( v ( t ) ig ) )) T ( K ) exp( m ( t ) ig + diag( v ( t ) ig ) ) + 1 . For updating m ( t +1) ig , we again use the Newton-Raphson method to find thesolution to the following score function: ∂ ˜ L ∂ m ig = w ∗ i − Σ ( t ) g − ( m ( t ) ig − µ ( t ) g ) − ( T ( K +1) w i ) exp (cid:18) m ( t ) ig + diag( v ( t ) ig ) (cid:19) T ( K ) exp( m ( t ) ig + diag( v ( t ) ig ) ) + 1 .
2. Update the component indicator variable Z ig . Conditional on the variationalparameters m ( t +1) ig , V ( t +1) ig and on µ ( t ) g , Λ ( t ) g , and D ( t ) g , the expected value of Z ig can be computed as E ( Z ( t +1) ig ) = π ( t ) g f ( w i | µ ( t ) g , Λ ( t ) g , D ( t ) g ) P Gh =1 π ( t ) h f ( w i | µ ( t ) h , Λ ( t ) h , D ( t ) h ) . As this involves the marginal distribution of W , which is difficult to compute,we use an approximation of E ( Z ( t +1) ig ) using the ELBO:ˆ z ( t +1) ig = π ( t ) g exp { ˜ F ( µ ( t ) g , Λ ( t ) g Λ ( t ) g T + D ( t ) g , m ( t +1) ig , V ( t +1) ig ) } P Gg =1 π ( t ) g exp { ˜ F ( µ ( t ) g , Λ ( t ) g Λ ( t ) g T + D ( t ) g , m ( t +1) ig , V ( t +1) ig ) } .
3. Given the variational parameters and ˆ z ( t +1) ig , we then update the parameters π g µ g as: ˆ π ( t +1) g = P ni =1 ˆ z ( t +1) ig n , and ˆ µ ( t +1) g = P ni =1 ˆ z ( t +1) ig m ( t +1) ig P ni =1 ˆ z ( t +1) ig . Second Cycle
In the second cycle, we utilize the following hierarchical structure: W i | Y i ∼ Multi.( P i ) , Y i | U i = u i ∼ N ( µ g , + Λ g u i , D g ) , and U i ∼ N ( , I q ) , where P i can be obtained from Y i using Equation 2. Assuming Z , Y and U asmissing variables, the complete log-likelihood using the marginal probability functionof W has the following form: l ( W , Z ) = n X i =1 G X g =1 z ig (cid:26) log π g + log (cid:20)Z f ( w i | y i ) f g ( y i | µ g + Λ g u i , D g ) f g ( u i | , I q ) d y d u (cid:21)(cid:27) In this cycle, we derive an approximate lower bound for the log of the marginalprobability function of W using the approximating density q ( y , u )log f ( w ) = Z q ( y , u ) log q ( y , u ) f ( y , u | w ) d y d u + Z q ( y , u ) log f ( w , u , y ) q ( y , u ) d y d u = D KL [ q ( y , u ) || f ( y , u | w )] + F ( q ( y , u ) , w ) , (6)where F ( q ( y , u ) , w ) is the ELBO and D KL [ q ( y , u ) || f ( y , u | w )] is the Kullback-Leibler divergence from f ( y , u | w ) to q ( y , u ). Furthermore, assuming q ( u , y ) =14 ( u ) q ( y ), q ( u ) = N ( ˜ m ig , ˜ V g ), and q ( y ) = N ( m ig , V ig ), we can show that F ( q ( u , y ) , w ) ≥ C + w ∗ i T m ig − (cid:0) T ( K +1) w i (cid:1) (cid:20) log (cid:18) T ( K ) exp (cid:18) m ig + V ig (cid:19) + 1 (cid:19)(cid:21) + 12 (log | V ig | + log | ˜ V g | + q + K − log | D g | − ˜ m Tig ˜ m ig − tr( ˜ V g ) − tr( D − g ( V ig + ( m ig − µ g ) T ( m ig − µ g ))) + 2( m ig − µ g ) T D − g Λ g ˜ m ig − ˜ m Tig Λ Tg D − g Λ g ˜ m ig − tr ( Λ Tg D − g Λ g ˜ V g ))= ˜ F ( µ g , Λ g , D g , m ig , V ig , ˜ m ig , ˜ V g ) . Here, m ig and V ig are the variational parameters of q ( y i ) from first cycle and ˜ m ig and ˜ V ig are the variational parameters of q ( u i ). Details of the derivation of the lowerbound are provided in Appendix B. In this cycle, for the parameter updates in the( t + 1) th iteration, the following steps are conducted:1. Update the variational Gaussian lower bound of complete-data log-likelihoodof the second cycle ˜ L by updating ˜ m ( t +1) ig and ˜ V ( t +1) g as˜ m ( t +1) ig = ( Λ ( t ) g T D ( t ) g − Λ ( t ) g + I q ) − Λ ( t ) g T D ( t ) g − ( m ( t +1) ig − µ ( t +1) g ) , and˜ V ( t +1) g = ( Λ ( t ) g T D ( t ) g − Λ ( t ) g + I q ) − .
2. Update the group indicator variable Z . Similar to the first cycle, we computean approximation of E ( Z ig ) using the ELBO from the second cycle:ˆ z ( t +1) ig = π ( t +1) g exp { ˜ F ( µ ( t +1) g , Λ ( t ) g , D ( t ) g , m ( t +1) ig , V ( t +1) ig , ˜ m ( t +1) ig , ˜ V ( t +1) g ) } P Gh =1 π ( t +1) h exp { ˜ F ( µ ( t +1) h , Λ ( t ) h , D ( t ) h , m ( t +1) ih , V ( t +1) ih , ˜ m ( t +1) ih , ˜ V ( t +1) h ) } .
15. Update D ( t +1) g − and Λ ( t +1) g asˆ D ( t +1) g = diag { ˆ Σ ( t +1) g − Λ ( t ) g ( Λ ( t ) g T D ( t ) g − Λ ( t ) g + I q ) − Λ ( t ) g T D ( t ) g − ˆ S ( t +1) g + Λ ( t ) g θ ( t +1) g Λ ( t ) g T } , ˆ Λ ( t +1) g = ˆ S ( t +1) g β ( t +1) g T θ ( t +1) g − , whereˆ S ( t +1) g = P ni =1 z ( t +1) ig ( m ( t +1) ig − µ ( t +1) g ) T ( m ( t +1) ig − µ ( t +1) g ) P ni =1 ˆ z ( t +1) ig , ˆ Σ ( t +1) g = P ni =1 z ( t +1) ig h V ( t +1) ig + ( m ( t +1) ig − µ ( t +1) g )( m ( t +1) ig − µ ( t +1) g ) ⊤ iP ni =1 ˆ z ( t +1) ig , θ ( t +1) g = ( Λ ( t ) g T D ( t ) g − Λ ( t ) g + I q ) − + β ( t +1) g S ( t +1) g β ( t +1) g T , and β ( t +1) g = ( Λ ( t ) g T D ( t ) g − Λ ( t ) g + I q ) − Λ ( t ) g T D ( t ) g − . Overall, our algorithm consists of the following steps:I. Specify the number of clusters: G and q and provide an initial guesses for Λ g , D g and Z ig .II. First cycle:1) Update the variational Gaussian lower bound of complete-data log-likelihoodof the first cycle by estimating V ig and m ig .2) Update Z ig .3) Update π g and µ g . 16II. Second cycle:1) Update the variational Gaussian lower bound of complete-data log-likelihoodof the first cycle by estimating ˜ V ig and ˜ m ig .2) Update Z ig again.3) Update S g , Σ g , D g , and Λ g .IV. Compute the likelihood P ni =1 log P Gg =1 π g f ( w i | ϑ g ) using the current estimatorsand check for convergence. If it is converged, then stop, otherwise go to step2.Note that Λ g is unidentifiable. This can be seen if we let Λ ∗ g = Λ g T be a newfactor loading matrix where T be an orthonormal matrix such that TT T = I , then Λ ∗ g Λ ∗ T g + D g = Λ g TT T Λ Tg + D g = Λ g Λ Tg + D g = Λ g Λ Tg + D g . Hence, we focus on therecovery of Σ g = Λ g Λ TG + D g which is identifiable. Additionally, by incorporatinga factor structure, we can utilize Woodbury Identity(Woodbury 1950) to compute Σ − g : Σ − g = ( Λ g Λ Tg + D g ) − = D − g − D − g Λ g ( I q + Λ Tg D − g Λ g ) − Λ Tg D − g , and thus the matrix inversion is O ( q ) as opposed to O ( K ). Therefore, when q ≪ K ,inverting Σ is computationally efficient. 17 .4 A Family of Mixture Model for Clustering To introduce parsimony, we further imposed constraints on the parameters of the co-variance matrix of the latent variable across groups such that Λ g = Λ and D g = D and on whether D g = d g I . This results in a family of eight different parsimoniousLNM-FAs (Table 1). Similar constraints on the components of the covariance matri-ces were utilized by McNicholas & Murphy (2008), Subedi et al. (2013, 2015).Table 1: The family of logistic normal multinomial factor analyzers. Model Λ g D g Total ParGroup Group Diagonal“UUU” U U U G ∗ ( Kq − q ( q − /
2) + K ∗ G + G − K “UUC” U U C G ∗ ( Kq − q ( q − /
2) + G + G − K “UCU” U C U G ∗ ( Kq − q ( q − /
2) + K + G − K “UCC” U C C G ∗ ( Kq − q ( q − /
2) + 1 + G − K “CUU” C U U Kq − q ( q − / K ∗ G + G − K “CUC” C U C Kq − q ( q − / G + G − K “CCU” C C U Kq − q ( q − / K + G − K “CCC” C C C Kq − q ( q − / G − K In Table 1, the column “Group” refers to constraints across groups, the column“Diagonal” refers to the matrix having the same diagonal elements, and the lettersrefer to whether or not the constraints were imposed on the parameters such that “U”stands for unconstrained and “C” stands for constrained. For example, the model“UCU” refers to unconstrained Λ g but constrained D g = D . Whereas in the model“UCC” where constraints on both the “Group” and the “Diagonal” are imposed for D g , it means D g = d I p . Details of the parameter estimates for the LNM-FA familyare provided in the Appendix C. 18 .5 Initialization For estimation, we need to first initialize the model parameters, variational parame-ters, and the component indicator variable Z ig . The EM algorithm for finite mixturemodels is known to be heavily depending on starting values. Let z ∗ ig , π ∗ g , µ ∗ g , D ∗ g , Λ ∗ g , m ∗ ig and V ∗ ig be the initial values for Z ig , π g , µ g , D g , Λ g , m ig and V ig respectively.The initialization is conducted as following:1. z ∗ ig can be obtained by random allocation of observation to different clusters,cluster assignment from k -mean clustering or cluster assignment from somemodel-based clustering algorithms. Since our algorithm is based on a factoranalyzer structure, we initialize Z ig using the cluster membership obtained byfitting parsimonious Gaussian mixture models(PGMM; McNicholas & Murphy2008) to the transformed variable Y obtained using Equation 1. For computa-tional purposes, any 0 in the W were replaced by 0.001 for initialization. Theimplementation of PGMM is available in R package “pgmm”(McNicholas et al.2019).2. Using this initial partition, µ ∗ g is the sample mean of the g th cluster and π ∗ g isthe proportion of observations in the g th cluster in this initial partition.3. Similar to McNicholas & Murphy (2008), we estimate the sample covariancematrix S ∗ g for each group and then use eigendecomposition of S ∗ g to obtain D ∗ g and Λ ∗ g . Suppose λ g is a vector of the first q largest eigenvalues of S ∗ g and the19olumns of L g are the corresponding eigenvectors, then Λ ∗ g = L g λ g , and D ∗ g = diag { S ∗ g − Λ ∗ g Λ ∗ T g } .
4. As Newton-Raphson method is used to update the variational parameters, weneed m ∗ and V ∗ . For m ∗ , we apply an additive log ratio transformation onthe observed taxa compositions ˆ p and set m ∗ = φ (ˆ p ) using Equation 1. For V ∗ , we use a diagonal matrix where all diagonal entries are 0.1. Note that itis important to choose a small value for V ∗ to avoid overshooting in Newton-Raphson method. Convergence of the algorithm is determined using Aitken acceleration criterion. TheAitken’s acceleration (Aitken 1926) is defined as: a ( k ) = l ( k +1) − l ( k ) l ( k ) − l ( k − where l ( k +1) stands for the log-likelihood values at k + 1 iteration. Then, the asymp-totic estimate for log-likelihood at iteration k + 1 is: l ( k +1) ∞ = l ( k ) + l ( k +1) − l ( k ) − a ( k ) | l ( k +1) ∞ − l ( k ) ∞ | < ǫ where ǫ is a small number (B¨ohning et al. 1994). Here, we set ǫ = 10 − .In clustering, the number of components are unknown. Hence, we run our algo-rithm for all possible numbers of clusters and latent variables, and the best modelis chosen a posteriori using a model selection criteria. Here, we use the BayesianInformation Criterion (BIC; Schwarz 1978). The BIC is the most popular choice formodel selection in model-based clustering and is defined as BIC = 2 l ( w , ˆ ϑ ) − p log( n ) , where l ( w, ˆ ϑ ) is the log-likelihood evaluated using the estimated ˆ ϑ , p is the num-ber of free parameters, and n is the number of observations. When the true labelsare known, the Adjusted Rand Index (ARI; Hubert & Arabie 1985) is used for per-formance assessment. For perfect agreement, the ARI is 1 while the expected valueof ARI is 0 under random classification. In this section, we use simulation studies to demonstrate the clustering performanceand parameter recovery of the proposed LNM-FA models. We first generate Y from a multivariate normal distribution, then transform the data into composition21 using the additive log ratio transformation. Count data were then generatedbased on a multinomial distribution with composition Z with the total count foreach observation being generated from a uniform distribution U [5000 , G, q ) using the BIC.
Here, we generated 100 ten-dimensional datasets, each of size n = 1000 from mostconstrained model “CCC” with G = 3, and q = 3. Figure 1 shows a visualization ofthe cluster structure in the latent space for one of the hundred datasets and Figure2 shows the visualization of the relative abundance for observed count data of thesame dataset.We ran all 8 models in the LNM-FA family for G = 1 . . . q = 1 . . . G = 3 and q = 3 with an average ARI of 0.999 (standarddeviation [sd] of 0.003). The true values of the parameters π and µ are provided inTable 2. As Λ is not identifiable but Σ = ΛΛ T + D is identifiable, we demonstratethe recovery of Σ . The true value of Λ g and D g for Σ is provided in the AppendixD and the average and standard errors of norm of the bias of Σ is provided in Table2. For comparison, we also ran the LNM-MM and DMM on all hundred datasetsfor G = 1 , . . . ,
5. In 81 out of the 100 datasets, the BIC selected a three componentLNM-MM model with an average ARI of 0.99 (sd of 0.00) and a four component22able 2: The true parameters and average of the estimated values along with thestandard errors for Simulation Study 1.
Component 1( n = 500) µ [-0.17, 0.03, 0.08, 0.24, 0.24, -0.06, -0.03,0.14, -0.11, 0.14]Average of ˆ µ [-0.17, 0.03, 0.08, 0.25, 0.24, -0.06, -0.02, 0.14, -0.11, 0.14]sd of ˆ µ (0.02, 0.01, 0.02 ,0.05, 0.04, 0.02, 0.03, 0.02, 0.03, 0.02) π π (sd of ˆ π ) 0.50(0.014)Component 2( n = 300) µ [0.33, 0.63, 0.44, 0.60, 0.32, 0.52, 0.39, 0.50,0.51,0.45]Average of ˆ µ [0.33, 0.63, ,0.44, 0.60, 0.33, 0.52, 0.39, 0.50, 0.51 , 0.45]sd of ˆ µ (0.03, 0.02, 0.03, 0.06, 0.05, 0.02, 0.04, 0.02, 0.03, 0.02) π π (sd of ˆ π ) 0.301(0.014)Component 3( n = 200) µ [-0.59, -0.66, -0.55, -0.45, -0.60, -0.68, -0.53, -0.41,-0.65, -0.46]Average of ˆ µ [-0.587, -0.662, -0.553, -0.444 ,-0.602, -0.683, -0.526, -0.408, -0.647, -0.463 ]sd of ˆ µ (0.03, 0.03, 0.03, 0.07, 0.07, 0.03, 0.04, 0.02, 0.04, 0.03) π π (sd of ˆ π ) 0.199 (0.011)Average and sd of the L1 norm of the difference between estimated and true Covariance.Note that for “CCC” model, all components have the same Σ .Average of | ˆ Σ ( i ) − Σ | . | ˆ Σ ( i ) − Σ | . Y in one of the hundred datasets fromSimulation Study 1. The observations are colored using their true class label. Forthis dataset, ARI of 1 was obtained by LNM-FA. −0.08−0.020.05 −0.01−0.010.080.120.150.13 0.080.250.27−0.36−0.3−0.32−0.4−0.42−0.41 −0.39−0.42−0.34−0.13−0.19−0.15−0.38−0.33−0.320.390.440.47 −0.08−0.06−0.1−0.14−0.23−0.30.02−0.070.060.20.020.06−0.24−0.23−0.18 0.30.250.260.310.280.17−0.06−0.11−0.130.350.150.140.030.160.07−0.56−0.37−0.49 0.190.160.040.080.120.10.03−0.070.090.160.110.250.02−0.06−0.01−0.16−0.24−0.180.210.320.27 0.230.080.15−0.16−0.22−0.1700.09−0.030.50.540.56−0.43−0.5−0.550.210.270.22−0.090.03−0.03−0.08−0.02−0.11 −0.13−0.22−0.15−0.16−0.1−0.1−0.07−0.06−0.07−0.16−0.19−0.10.30.330.310.020.06−0.06−0.19−0.23−0.24−0.090.01−0.2−0.17−0.05−0.12 Y1 Y2 Y3 Y4 Y5 Y6 Y7 Y8 Y9 Y10 Y Y Y Y Y Y Y Y Y Y −2 −1 0 1 −1 0 1 −2 −1 0 1 −2 0 2 4 −3−2−10 1 2 3 −1 0 1 −2−1 0 1 2 −1.0−0.50.00.51.0 −2−1 0 1 2 −1 0 10.000.250.500.75−101−2−101−2024−3−2−10123−101−2−1012−1.0−0.50.00.51.0−2−1012−101 model in 13 of the datasets. The LNM-MM model encountered computational issuesin 6 out of the 100 datasets. On the other hand, a five component DMM was selectedby the BIC in all 100 datasets with an average ARI of 0.00 (sd 0.00). Here, we generate 100 ten-dimensional datasets, each of size n = 1000 from mostflexible model “UUU” with G = 3, and q = 3. Figure 3 shows visualization ofthe cluster structure in the latent space in one of the hundred datasets and Figure4 shows the visualization of the relative abundance for observed count data of the24igure 2: Boxplot of the relative abundances of the observed counts in each clusterin one of the hundred datasets from Simulation Study 1. Clusters R e l . abundan c e o f ob s e r v ed v a r i ab l e Clusters R e l . abundan c e o f ob s e r v ed v a r i ab l e Clusters R e l . abundan c e o f ob s e r v ed v a r i ab l e Clusters R e l . abundan c e o f ob s e r v ed v a r i ab l e Clusters R e l . abundan c e o f ob s e r v ed v a r i ab l e Clusters R e l . abundan c e o f ob s e r v ed v a r i ab l e Clusters R e l . abundan c e o f ob s e r v ed v a r i ab l e Clusters R e l . abundan c e o f ob s e r v ed v a r i ab l e Clusters R e l . abundan c e o f ob s e r v ed v a r i ab l e Clusters R e l . abundan c e o f ob s e r v ed v a r i ab l e Clusters R e l . abundan c e o f ob s e r v ed v a r i ab l e same datasetWe ran all 8 models in the LNM-FA family for G = 1 . . . q = 1 . . . G = 3 and q = 3, with an average ARI of1 (sd of 0). The true values of the parameters π , µ along with the average andstandard deviations of their estimates are provided in Table 3. Again, the averageand standard deviations of the L norm of Σ g are provided in Table 3.We also ran the LNM-MM and DMM on all 100 datasets. From the LNM-MM25able 3: The true parameters and average of the estimated values along with thestandard errors for Simulation Study 2. Component 1 ( n = 500) µ [0.16, -0.13, 0.06, 0.13, 0.00, -0.06, -0.02, -0.11, 0.00, 0.03]Average of ˆ µ [0.163 ,-0.130, 0.057, 0.134, 0.001, -0.064, -0.015, -0.108, 0.002, 0.027]sd of ˆ µ (0.02, 0.01, 0.02, 0.05, 0.04, 0.02, 0.03, 0.02, 0.03, 0.02) π π (sd of ˆ π ) 0.50 (0.02)Average of | ˆ Σ ( i )1 − Σ | . | ˆ Σ ( i )1 − Σ | . n = 300) µ [0.79, 1.01, 0.66, 0.76, 0.86, 0.83, 0.66, 0.68, 0.85, 0.84]Average of ˆ µ [0.79, 1.01, 0.66 , 0.76, 0.86, 0.83, 0.66, 0.68 , 0.85 , 0.84]sd of ˆ µ (0.03, 0.02, 0.02, 0.05, 0.03, 0.03, 0.05, 0.02, 0.02, 0.03) π π (sd of ˆ π ) 0.30(0.01)Average of | ˆ Σ ( i )2 − Σ | . | ˆ Σ ( i )2 − Σ | . n = 200) µ [-0.77, -0.89, -0.88, -0.78, -0.71, -0.89, -0.86, -0.82, -0.86, -0.80]Average of ˆ µ [-0.77, -0.89, -0.88, -0.780, -0.71, -0.89, -0.86, -0.82, -0.86, -0.80 ]sd of ˆ µ (0.02, 0.02, 0.02, 0.01, 0.02, 0.01, 0.02, 0.01, 0.01, 0.02 ) π π (sd of ˆ π ) 0.20 (0.01)Average of | ˆ Σ ( i )3 − Σ | . | ˆ Σ ( i )3 − Σ | . Y in one of the hundred datasets fromSimulation Study 2. The observations are colored using their true class label. Forthis dataset, ARI of 1 was obtained by LNM-FA. Y1 Y2 Y3 Y4 Y5 Y6 Y7 Y8 Y9 Y10 Y Y Y Y Y Y Y Y Y Y −1 0 1 2 −1 0 1 2 −1 0 1 −2 0 2 −2−10 1 2 3 −1 0 1 2 −2−10 1 2 3 −1 0 1 2 −1 0 1 −1 0 1 20.00.51.01.5−1012−101−202−2−10123−1012−2−10123−1012−101−1012 family, the BIC selected a three component model in 12 out of the 100 datasetswith perfect classification but four and five component models in 70 and 9 datasets,respectively. The LNM-MM implementation encountered singularities in the remain-ing 9 datasets. On the other hand, a five component DMM is selected every timewith an average ARI of 0.27 (sd of 0.03). We applied our method to three publicly available microbiome datasets.
Dietswap Dataset:
We applied our algorithm to the microbiome dataset
Dietswap
Clusters R e l . abundan c e o f ob s e r v ed v a r i ab l e Clusters R e l . abundan c e o f ob s e r v ed v a r i ab l e Clusters R e l . abundan c e o f ob s e r v ed v a r i ab l e Clusters R e l . abundan c e o f ob s e r v ed v a r i ab l e Clusters R e l . abundan c e o f ob s e r v ed v a r i ab l e Clusters R e l . abundan c e o f ob s e r v ed v a r i ab l e Clusters R e l . abundan c e o f ob s e r v ed v a r i ab l e Clusters R e l . abundan c e o f ob s e r v ed v a r i ab l e Clusters R e l . abundan c e o f ob s e r v ed v a r i ab l e Clusters R e l . abundan c e o f ob s e r v ed v a r i ab l e Clusters R e l . abundan c e o f ob s e r v ed v a r i ab l e (O’Keefe et al. 2015) available in R package Microbiome (Lahti & Shetty 2012-2019).Colorectal cancer is the third most prevalent cancer worldwide (Garrett 2019). Therate of colon cancer in Americans of African descent is much higher than comparedto rural Africans (O’Keefe et al. 2015). Recent findings indicate that the risk ofcolon cancer has been known to be associated with dietary habits that affects the gutmicrobiota (Garrett 2019). To investigate diet-associated cancer risk, (O’Keefe et al.2015) collected fecal samples from healthy middle aged 20 African(AFR) and 20African American(AAM). Fecal samples were taken at 6 different timepoints: the28rst three measurements (i.e., Day 0, Day 7, and Day 14) were taken in their homeenvironment with their regular dietary habits and the last three measurements (i.e.,Day 15, Day 22, and Day 29) were taken after an intervention diet. The HumanIntestinal Tract Chip phylogenetic microarray was used for the global profiling ofmicrobiota composition. As repeated measurements at different time points aretaken on the same individuals and our model currently cannot model that (violatesthe independence assumption), we only utilize the measurements at Day 0. Hence,the resulting dataset comprises of 38 individuals from Day 0, and we focus ouranalysis at the genus level resulting in 130 genera.
FerrettiP Dataset:
We applied our algorithm to the microbiome dataset
FerrettiP (Ferretti et al. 2018) available in R package curatedMetagenomicData (Pasolli et al.2017). The study sampled the microbiome of 25 mother-infant pairs across multiplebody sites from birth up to 4 months postpartum. Out of the 216 samples collected,119 samples were derived from the stool (proxy for gut microbiome), 15 samples werederived from the skin swabs (skin microbiome), 63 samples were derived from theoral cavity swabs (oral microbiome), and 19 samples were derived from the vaginalswabs (vaginal microbiome). Here, we focus our analysis on the subset of the 119stool samples (23 adults and 96 newborns). As repeated measurements at differenttime points are taken on the same individuals (violates the independence assump-tion), we only focus on one time point (i.e., Day 1) for the newborns at the genuslevel. Hence, the resulting dataset comprises of 42 individuals (23 adults and 19newborns) and 262 genera. ShiB Dataset:
We applied our algorithm to the microbiome dataset
ShiB (Shi et al.29015) available in R package curatedMetagenomicData (Pasolli et al. 2017). Peri-odontitis is a common oral disease that affects about 50% of the American adults andis associated with alterations in the subgingival microbiome of individual tooth sites(Shi et al. 2015). Current commonly used clinical parameters cannot adequately pre-dict the disease progression (McGuire & Nunn 1996). The study by Shi et al. (2015)was designed to identify potential prognostic biomarkers using the compositions ofthe subgingival microbiome that can predict periodontitis. Oral microbiome sampleswere collected from 12 healthy individuals with chronic periodontitis before and af-ter nonsurgical therapy from multiples tooth sites. Only the samples from the toothsites that were clinically resolved after the therapy were retained, resulting in a totalof 48 samples (24 periodontitis samples and 24 recovered samples) and 96 genera.Although multiple samples per individuals were obtained, Shi et al. (2015) that indi-vidual tooth sites are likely to have independent clinical states and unique microbialcommunities in subgingival pockets, so we also treat samples as independent.For all three datasets, we first utilize the R package ALDEx2 (Fernandes et al.2013, 2014, Gloor et al. 2016) for differential abundance analysis to identify the gen-era that are different among the two groups (i.e., AFR vs. AAM for Dietswap dataset,adults vs. infants for
FerrettiP dataset, and periodontitis vs. recovered for
ShiB dataset). This step is analogous to conducting differential expression analysis inRNA-seq studies before performing cluster analysis to remove the noise variablesbefore clustering the data. Here, we used the Welch’s t -test option in ALDEx2 onthe log-transformed counts for each genera for differential abundance analysis andselected those genera for which the corresponding expected value of the Benjamini-30ochberg corrected p-value is less than 0.05. The numbers of differentially abundantgenera for Dietswap , FerrettiP , and
ShiB datasets are 23, 8, and 4, respectively.To preserve the relative abundance, the remaining genera are aggregated in a cat-egory “Others”, which is then used as the reference level for the additive log-ratiotransformation.The heatmaps of the relative abundances of the differentially expressed genera forall three datasets in Figure 5 shows that there are some distinct differences in therelative abundance of the genera between the groups.We ran all 8 models from our mixtures of LNM-FA family for G = 1 . . .
3. Sinceall three datasets has different dimensions, we ran q = 1 . . . Dietswap and
FerrettiP datasets, and q = 1 . . . ShiB dataset. The BIC was used to selectthe best fitting model. For comparison, we also ran the mixtures of LNM modelswithout the factor structure and the Dirichlet-multinomial mixture model on allthree datasets for G = 1 . . .
3. The classification results from all three approachesare summarized in Table 4In all three datasets, our proposed LNM-FA was able to recover the underlyinggroups. Our proposed approach outperformed DMM in all three datasets. In the
Dietswap (sample size n = 38) and FerrettiP datasets (sample size n = 42), 23and 8 taxa in genus level were identified as differentially abundant, respectively.Thus, while fitting LNM-MM model in these two datasets, Σ becomes singular,while the LNM-FA could be fitted due to the computational advantage that comeswith the incorporation of factor analyzer structure. On the other hand, the LNM-MM could be fitted for ShiB dataset where the dimensionality of the dataset after31 i g u r e : H e a t m a p o f r e l a t i v e a bund a n ce o f t h e d i ff e r e n t i a ll y a bund a n t g e n e r a i n a ll t h r ee d a t a s e t s . AFR 1AFR 2AFR 3AFR 4AFR 5AFR 6AFR 7AFR 8AFR 9AFR 10AFR 11AFR 12AFR 13AFR 14AFR 15AFR 16AFR 17AFR 18AFR 19AFR 20AFR 21AMR 1AMR 2AMR 3AMR 4AMR 5AMR 6AMR 7AMR 8AMR 9AMR 10AMR 11AMR 12AMR 13AMR 14AMR 15AMR 16AMR 17 B a c t e r o i de s v u l ga t u s e t r e l . P r e v o t e ll a o r A lli s t i pe s e t r e l . C l o s t r i d i u m sy m b i o s u m e t r e l . B a c t e r o i de s f r S ubdo li g r anu l u m v B r y an t e ll a f o r B a c t e r o i de s un i f P a r aba c t e r o i de s d i s t a s on i s e t r e l . B a c t e r o i de s s p l a c hn i c u s e t r e l . A nae r o s t i pe s c a cc ae e t r e l . B a c t e r o i de s o M i t s uo k e ll a m M ega s phae r E uba c t e r i u m b i f B a c t e r o i de s s t e r c o r B a c t e r o i de s p l ebe i u s e t r e l . C l o s t r i d i u m d i ff i c il e e t r e l . C l o s t r i d i u m ( s en s u s t r T anne r e ll a e t r e l . B a c t e r o i de s i n t e s t i na li s e t r e l . E uba c t e r i u m v P r e v o t e ll a m e l an i nogen i c a e t r e l . . . . R e l a t i v e A bundan c e C o l o r K ey ( a ) Dietswap d a t a s e t . Infant 1Infant 2Infant 3Infant 4Infant 5Infant 6Infant 7Infant 8Infant 9Infant 10Infant 11Infant 12Infant 13Infant 14Infant 15Infant 16Infant 17Infant 18Infant 19Mother 1Mother 2Mother 3Mother 4Mother 5Mother 6Mother 7Mother 8Mother 9Mother 10Mother 11Mother 12Mother 13Mother 14Mother 15Mother 16Mother 17Mother 18Mother 19Mother 20Mother 21Mother 22Mother 23 B a c t e r o i de s S ubdo li g r anu l u m P r op i on i ba c t e r S t r ep t o c o cc u s O do r i ba c t e r F ae c a li ba c t e r B a r ne s i e ll a P a r aba c t e r o i de s A li s t i pe s . . . R e l a t i v e A bundan c e C o l o r K ey ( b ) FerrettiP d a t a s e t . Periodontitis 1Periodontitis 2Periodontitis 3Periodontitis 4Periodontitis 5Periodontitis 6Periodontitis 7Periodontitis 8Periodontitis 9Periodontitis 10Periodontitis 11Periodontitis 12Periodontitis 13Periodontitis 14Periodontitis 15Periodontitis 16Periodontitis 17Periodontitis 18Periodontitis 19Periodontitis 20Periodontitis 21Periodontitis 22Periodontitis 23Periodontitis 24Recovered 1Recovered 2Recovered 3Recovered 4Recovered 5Recovered 6Recovered 7Recovered 8Recovered 9Recovered 10Recovered 11Recovered 12Recovered 13Recovered 14Recovered 15Recovered 16Recovered 17Recovered 18Recovered 19Recovered 20Recovered 21Recovered 22Recovered 23Recovered 24 P o r ph y r o m ona s S t r ep t o c o cc u s T anne r e ll a T r epone m a . . . R e l a t i v e A bundan c e C o l o r K ey ( c ) ShiB d a t a s e t . able 4: Summary of the clustering performances on all three real datasets usingbest fitting model by LNM-FA, LNM-MM and DMM.Data Approach Estimated Classification Table ARI(Model) G q AAM AFR
LNM-FA 2 2 Est. Group 1 20 1 (CUU) Est. Group 1 16
AAM AFR
Dietswap
LNM-MM - - - - - -- - -
AAM AFR
DMM 3 - Est. Group 1 2 13 0.38Est. Group 2 14 1Est. Group 3 5 3
Adult Infant
LNM-FA 2 1 Est. Group 1 22 0 (UCC) Est. Group 2 1 19
Adult Infant
FerrettiP
LNM-MM - - - - - -- - -
Adult Infant
DMM 2 - Est. Group 1 22 1 0.81Est. Group 2 1 18
Periodontitis Recovered
LNM-FA 2 1 Est. Group 1 21 4 (CCC) Est. Group 2 3 20
Periodontitis Recovered
ShiB
LNM-MM 2 - Est. Group 1 21 4
Est. Group 2 3 20
Periodontitis Recovered
DMM 2 - Est. Group 1 18 2 0.43Est. Group 2 6 2233ifferential abundance analysis was 5 (i.e., four differentially abundant genera andone aggregated column of “Others”). In
ShiB dataset, both LNM-FA and LNM-MM selected a two component model with an ARI of 0.49. However, the numberof parameters that needs to be estimated for the covariance matrices of the latentvariable in best fitting model by LNM-FA (i.e., CCC with q=1) is less compared tothe LNM-MM (i.e., 4 for LNM-FA vs. 20 for LNM-MM). Note, that the DMM modelcould be fitted to all three datasets. The DMM model accounts for overdispersionby utilizing a Dirichlet prior on the multinomial parameter p . However, as noted byAitchison (1982) and Xia et al. (2013), the logistic normal multinomial distributionallows for a more flexible covariance structure than the Dirichlet-multinomial model. Here, we extended the additive logistic normal multinomial mixture model for highdimensional data by incorporating a factor analyzer structure. A family of eightmixture models was proposed by imposing constraints on the components of thecovariance matrix of the latent variable. Due to the incorporation of the factor an-alyzer structure, the number of parameters are now linear in the dimensionality ofthe latent variable as opposed to the additive logistic normal multinomial mixturemodel where the number of parameters grows quadratically. Through simulationstudies, we demonstrated that our proposed approach provides excellent clusteringperformance and parameter recovery. Imposing a factor analyzer structure allows usto work on a lower dimension q compare to K and thus, the number of free param-34ters in the covariance matrix is greatly reduced when q is chosen to be sufficientlysmaller than K . Additionally, the use of Woodbury identity provides additionalcomputational advantages. For the real data analysis, our approach outperforms theDirichlet-multinomial mixture model in all three datasets. For the Dietswap datasetand
FerrettiP datasets, the LNM-MM by (Fang & Subedi 2020)(i.e., the additivelogistic multinomial mixture model without the utilize factor analyzer structure)could not be fitted due to computational issues as the dimensions of those datasetsare higher. In
ShiB dataset where K is small, the LNM-MM and our proposedLNM-FA provide comparable performance. While our approach can deal with highdimensional nature of the microbiome data, it does not account for any covariateinformation currently. Microbiome composition is very dynamic and is affected bytime variant covariates such as diet, environmental exposures and time invariantcovariates such as gender. Understanding how various biological/environmental fac-tors affect the changes in the microbiome compositions might be valuable in gainingvaluable biological insight into disease diagnosis and prognosis. Acknowledgements
This work was supported by the Collaboration Grant for Mathematicians from theSimons Foundation. 35 eferences
Abdel-Aziz, M. I., Brinkman, P., Vijverberg, S. J., Neerincx, A. H., Riley, J. H.,Bates, S., Hashimoto, S., Kermani, N. Z., Chung, K. F., Djukanovic, R. et al.(2020), ‘Sputum microbiome profiles identify severe asthma phenotypes of relativestability at 12-18 months’,
Journal of Allergy and Clinical Immunology .¨Aij¨o, T., M¨uller, C. L. & Bonneau, R. (2018), ‘Temporal probabilistic modelingof bacterial compositions derived from 16S rRNA sequencing’,
Bioinformatics (3), 372–380.Aitchison, J. (1982), ‘The statistical analysis of compositional data’, Journal of theRoyal Statistical Society: Series B (Methodological) , 139–160.Aitken, A. C. (1926), ‘A series formula for the roots of algebraic and transcendentalequations’, Proceedings of the Royal Society of Edinburgh , 14–22.Archambeau, C., Cornford, D., Opper, M. & Shawe-Taylor, J. (2007), ‘Gaussianprocess approximations of stochastic differential equations.’, Journal of MachineLearning Research - Proceedings Track , 1–16.Arridge, S. R., Ito, K., Jin, B. & Zhang, C. (2018), ‘Variational Gaussian approxi-mation for Poisson data’, Inverse Problems (2), 025005.Arumugam, M., Raes, J., Pelletier, E., Le Paslier, D., Yamada, T., Mende, D. R.,Fernandes, G. R., Tap, J., Bruls, T., Batto, J.-M. et al. (2011), ‘Enterotypes ofthe human gut microbiome’, Nature (7346), 174–180.36ecker, C., Neurath, M. & Wirtz, S. (2015), ‘The intestinal microbiota in inflamma-tory bowel disease’,
ILAR Journal .Blei, D. & Lafferty, J. (2007), ‘A correlated topic model of science’, The Annals ofApplied Statistics .B¨ohning, D., Dietz, E., Schaub, R., Schlattmann, P. & Lindsay, B. (1994), ‘Thedistribution of the likelihood ratio for mixtures of densities from the one-parameterexponential family’, Annals of the Institute of Statistical Mathematics , 373–388.Challis, E. & Barber, D. (2013), ‘Gaussian Kullback-Leibler approximate inference’, The Journal of Machine Learning Research , 2239–2286.Chen, J. & Li, H. (2013), ‘Variable selection for sparse Dirichlet-multinomial re-gression with an application to microbiome data analysis’, The Annals of AppliedStatistics (1).Cho, I. & Blaser, M. J. (2012), ‘The human microbiome: at the interface of healthand disease’, Nature Reviews Genetics (4), 260–270.Davis, C. (2016), ‘The gut microbiome and its role in obesity’, Nutrition Today , 167–174.Dempster, A. P., Laird, N. M. & Rubin, D. B. (1977), ‘Maximum likelihood fromincomplete data via the EM algorithm’, Journal of the Royal Statistical Society:Series B (1), 1–38.Fang, Y. & Subedi, S. (2020), ‘Mixtures of logistic-normal multinomial mixture modelfor clustering microbiome data’, arXiv preprint arXiv:2011.06682 .37ernandes, A. D., Reid, J. N., Macklaim, J. M., McMurrough, T. A., Edgell, D. R. &Gloor, G. B. (2014), ‘Unifying the analysis of high-throughput sequencing datasets:characterizing RNA-seq, 16S rRNA gene sequencing and selective growth experi-ments by compositional data analysis’, Microbiome (1), 15.Fernandes, A., Macklaim, J., Linn, T., Reid, G. & Gloor, G. (2013), ‘ANOVA-likedifferential gene expression analysis of single-organism and meta-RNA-seq’, PLOSOne (7), e67019.Ferretti, P., Pasolli, E., Tett, A., Asnicar, F., Gorfer, V., Fedi, S., Armanini, F.,Truong, D. T., Manara, S., Zolfo, M. et al. (2018), ‘Mother-to-infant microbialtransmission from different body sites shapes the developing infant gut micro-biome’, Cell Host & Microbe (1), 133–145.Garrett, W. S. (2019), ‘The gut microbiota and colon cancer’, Science (6446), 1133–1135.Ghahramani, Z., Hinton, G. E. et al. (1997), The EM algorithm for mixtures offactor analyzers, Technical report, Technical Report CRG-TR-96-1, University ofToronto.Gloor, G. B., Macklaim, J. M. & Fernandes, A. D. (2016), ‘Displaying variation inlarge datasets: plotting a visual summary of effect sizes’,
Journal of Computationaland Graphical Statistics (3), 971–979.Holmes, I., Harris, K. & Quince, C. (2012), ‘Dirichlet multinomial mixtures: Gener-ative models for microbial metagenomics’, PLOS One , e30126.38otterbeekx, A., Xavier, B. B., Bielen, K., Lammens, C., Moons, P., Schepens, T.,Ieven, M., Jorens, P. G., Goossens, H., Kumar-Singh, S. et al. (2016), ‘The endotra-cheal tube microbiome associated with Pseudomonas aeruginosa or Staphylococcusepidermidis ’, Scientific Reports , 36507.Hubert, L. & Arabie, P. (1985), ‘Comparing partitions’, Journal of classification (1), 193–218.Huttenhower, C., Gevers, D., Knight, R., Abubucker, S., Badger, J. H., Chin-walla, A. T., Creasy, H. H., Earl, A. M., FitzGerald, M. G., Fulton, R. S. et al.(2012), ‘Structure, function and diversity of the healthy human microbiome’, Na-ture (7402), 207.Koslovsky, M. D. & Vannucci, M. (2020), ‘MicroBVS: Dirichlet-tree multinomialregression models with Bayesian variable selection-an R package’,
BMC bioinfor-matics (1), 1–10.La Rosa, P. S., Brooks, J. P., Deych, E., Boone, E. L., Edwards, D. J., Wang,Q., Sodergren, E., Weinstock, G. & Shannon, W. D. (2012), ‘Hypothesis testingand power calculations for taxonomic-based human microbiome data’, PLOS One (12), e52078.Lahti, L. & Shetty, S. (2012-2019), ‘microbiome R package’.McGuire, M. K. & Nunn, M. E. (1996), ‘Prognosis versus actual outcome. II. Theeffectiveness of clinical parameters in developing an accurate prognosis’, Journalof Periodontology (7), 658–665. 39cLachlan, G. & Peel, D. (2000), Mixtures of factor analyzers, in ‘In Proceedingsof the Seventeenth International Conference on Machine Learning’, Citeseer.McNicholas, P. D., ElSherbiny, A., McDaid, A. F. & Murphy, T. B. (2019), pgmm:Parsimonious Gaussian Mixture Models . R package version 1.2.4. URL: https://CRAN.R-project.org/package=pgmm
McNicholas, P. D. & Murphy, T. B. (2008), ‘Parsimonious Gaussian mixture models’,
Statistics and Computing (3), 285–296.Meng, X.-L. & Van Dyk, D. (1997), ‘The EM algorithm–an old folk-song sung toa fast new tune’, Journal of the Royal Statistical Society: Series B (StatisticalMethodology) (3), 511–567.O’Keefe, S. J., Li, J. V., Lahti, L., Ou, J., Carbonero, F., Mohammed, K., Posma,J. M., Kinross, J., Wahl, E., Ruder, E. et al. (2015), ‘Fat, fibre and cancer risk inafrican americans and rural africans’, Nature Communications (1), 1–14.Pasolli, E., Schiffer, L., Manghi, P., Renson, A., Obenchain, V., Truong, D. T.,Beghini, F., Malik, F., Ramos, M., Dowd, J. B. et al. (2017), ‘Accessible, curatedmetagenomic data through ExperimentHub’, Nature Methods (11), 1023.Pfirschke, C., Garris, C. & Pittet, M. J. (2015), ‘Common TLR5 mutations controlcancer progression’, Cancer Cell (1), 1–3.Schwarz, G. (1978), ‘Estimating the dimension of a model’, The Annals of Statistics , 461–464. 40ender, R., Fuchs, S. & Milo, R. (2016), ‘Revised estimates for the number of humanand bacteria cells in the body’, PLOS Biology , e1002533.Shi, B., Chang, M., Martin, J., Mitreva, M., Lux, R., Klokkevold, P., Sodergren, E.,Weinstock, G., Haake, S. & Li, H. (2015), ‘Dynamic changes in the subgingivalmicrobiome and their potential for diagnosis and prognosis of periodontitis’, MBio , e01926–14.Silverman, J. D., Durand, H. K., Bloom, R. J., Mukherjee, S. & David, L. A. (2018),‘Dynamic linear models guide design and analysis of microbiota studies withinartificial human guts’, Microbiome (1), 1–20.Subedi, S. & Browne, R. (2020), ‘A parsimonious family of multivariate Poisson-lognormal distributions for clustering multivariate count data’, Stat (1), e310.Subedi, S., Neish, D., Bak, S. & Feng, Z. (2020), ‘Cluster analysis of microbiomedata via mixtures of Dirichlet-multinomial regression models’, Journal of RoyalStatistical Society: Series C (5), 1163–1187.Subedi, S., Punzo, A., Ingrassia, S. & McNicholas, P. D. (2013), ‘Clustering andclassification via cluster-weighted factor analyzers’, Advances in Data Analysisand Classification (1), 5–40.Subedi, S., Punzo, A., Ingrassia, S. & McNicholas, P. D. (2015), ‘Cluster-weighed t -factor analyzers for robust model-based cluserting and dimension reduction’, Sta-tistical Methods & Applications (4), 623–649.41aie, W. S., Omar, Y. & Badr, A. (2018), Clustering of human intestine micro-biomes with k-means, in ‘2018 21st Saudi Computer Society National ComputerConference (NCC)’, IEEE, pp. 1–6.Wadsworth, W. D., Argiento, R., Guindani, M., Galloway-Pena, J., Shelburne, S. A.& Vannucci, M. (2017), ‘An integrative Bayesian Dirichlet-multinomial regressionmodel for the analysis of taxonomic abundances in microbiome data’, BMC Bioin-formatics (1), 1–12.Wainwright, M. J. & Jordan, M. I. (2008), Graphical Models, Exponential Families,and Variational Inference , Now Publishers Inc., Hanover, MA, USA.Woodbury, M. A. (1950), ‘Inverting modified matrices’,
Memorandum report (106), 336.Wu, G. D., Chen, J., Hoffmann, C., Bittinger, K., Chen, Y.-Y., Keilbaugh, S. A.,Bewtra, M., Knights, D., Walters, W. A., Knight, R. et al. (2011), ‘Linking long-term dietary patterns with gut microbial enterotypes’, Science (6052), 105–108.Xia, F., Chen, J., Fung, W. K. & Li, H. (2013), ‘A logistic normal multinomial regres-sion model for microbiome compositional data analysis’,
Biometrics (4), 1053–1063. 42 ELBO for LNM model
First, we decompose F ( q ( y ) , w ) into 3 parts: F ( q ( y ) , w ) = Z q ( y ) log f ( w | y ) d y + Z q ( y ) log f ( y ) d y − Z q ( y ) log q ( y ) d y . The second and third integral (i.e. E q ( y ) (log f ( y )) and E q ( y ) (log q ( y ))) have explicitsolutions such that E q ( y ) (log f ( y )) = − K π ) −
12 log | Σ | −
12 ( m − µ ) T Σ − ( m − µ ) −
12 tr( Σ − V )and − E q ( y ) (log q ( y )) = 12 log | V | + K K π ) . Note that V is a diagonal matrix. As for the first integral, it has no explicit solutionbecause of the expectation of log sum exponential term: E q ( y ) (log f ( y | w )) = C + w ∗ T m − K +1 X k =1 w k ! E q ( y ) " log K +1 X k =1 exp y k where w ∗ represents a K dimension vector with first K elements of w , y K +1 is setto 0 and C stands for log T w ! Q Kk =1 w k ! . Blei & Lafferty (2007) proposed an upper boundfor E q ( y ) h log (cid:16)P K +1 k =1 exp y k (cid:17)i as E q ( y | m , V ) " log K +1 X k =1 exp y k ! ≤ ξ − ( K +1 X k =1 E q ( y | m , V ) [exp( y k )] ) − ξ ) , (7)43here ξ ∈ IR is introduced as a new variational parameter. Fang & Subedi (2020)utilized this upper bound to find a lower bound for E q ( y ) (log f ( y | w )). Here wefurther simplify lower bound by Blei & Lafferty (2007). Let Z = P K +1 k =1 exp( y k ),then we have: E q ( y ) " log K +1 X k =1 exp y k ! ≤ log E q ( y ) K +1 X k =1 exp y k ! = log " K X k =1 exp (cid:18) m k + V k (cid:19) + 1 , where m k , V k stands for k th entry of m and the k th diagonal entry of V . The twoupper bounds are equal when minimize 7 with respect to ξ . .Combining all 3 parts together, we have the approximate lower bound for log f ( w ):˜ F ( q ( y ) , w ) = C + w ∗ T m − K +1 X k =1 w k ! ( log " K X k =1 exp (cid:18) m k + V k (cid:19) + 1 +12 log | V | + K −
12 log | Σ | −
12 ( m − µ ) T Σ − ( m − µ ) −
12 tr( Σ − V ) B ELBO for Cycle 2
Here, in the second cycle, we have F ( q ( u , y ) , w ) = Z q ( u , y ) log f ( w , u , y ) q ( u , y ) d y d u = Z q ( u , y ) log f ( w | u , y ) d y d u + Z q ( u , y ) log f ( u , y ) d y d u − Z q ( u , y ) log q ( u , y ) d y d u . Furthermore, we assume that q ( u , y ) = q ( u ) q ( y ), u ∼ N ( ˜ m , ˜ V ) and y ∼ ( m , V ). Thus, the first term can be written as: Z q ( u , y ) log f ( w | u , y ) d y d u = Z q ( u ) q ( y ) log f ( w | y ) d y d u = Z q ( y ) log f ( w | y ) d y This is identical to the first term in the ELBO in the first cycle and thus its lowerbound is Z q ( u , y ) log f ( w | u , y ) d y d u ≥ C + w ∗ T m − K +1 X k =1 w k ! ( log K X k =1 exp (cid:18) m k + V k (cid:19) + 1 !) The third term is − Z q ( u , y ) log q ( u , y ) d y d u = 12 (cid:16) log | V | + log | ˜ V | + q + K + ( K + q ) log 2 π (cid:17) . The second term is Z q ( u , y ) log f ( u , y ) d y d u = Z q ( u ) q ( y ) log[ f ( y | u ) f ( u )] d y d u = E q ( u ) E q ( y ) (log f ( y | u ) f ( u ))= − n ( q + K ) log(2 π ) − log | D | − ˜ m T ˜ m − tr( ˜ V ) − tr( Λ T D − Λ ˜ V ) − tr (cid:0) D − ( V + ( m − µ ) T ( m − µ )) (cid:1) + 2( m − µ ) T D − Λ ˜ m − ˜ m T Λ T D − Λ ˜ m (cid:9) . F ( q ( u , y ) , w ) ≥ C + w T m − K +1 X i =1 w i ! { log( K X k =1 exp( m k + V k } +12 (log | V | + log | ˜ V | + q + K − log | D | − ˜ m T ˜ m − tr ( ˜ V ) − tr ( D − ( V + ( m − µ ) T ( m − µ ))) + 2( m − µ ) T D − Λ ˜ m − ˜ m T Λ T D − Λ ˜ m − tr ( Λ T D − Λ ˜ V ))where m and V are calculated from first stage.In addition to variational parameter in second stage, it is worth to notice that˜ m ig = E ( u ig | y i , z ig ), and ˜ V g = Cov ( u ig | y i , z ig ). Because the following relationship: y i u ig | z ig ∼ M V N µ g , Λ g Λ Tg + D g Λ g Λ Tg I q Therefore: E ( u ig | y i , z ig = 1) = Λ Tg ( Λ g Λ Tg + D g ) − ( m ig − µ g ) Cov ( u ig | y i , z ig ) = I q − Λ Tg ( Λ g Λ Tg + D g ) − Λ g Then because the following inverse can be split as:( Λ Tg D − g Λ g + I q ) − = I q − Λ Tg D − g ( I + D − g Λ g Λ Tg D − g ) − D − g Λ g D g is always invertible by design. we have:˜ V = ( Λ Tg D − g Λ g + I q ) − = I q − Λ Tg ( D g + Λ g Λ Tg ) − Λ g Above shows ˜ V = Cov ( u ig | y i , z ig ). Apply same trick on ˜ m , we have:˜ m = ( Λ Tg D − g Λ g + I q ) − Λ Tg D − g ( m ig − µ g )= ( I q − Λ Tg ( D g + Λ g Λ Tg ) − Λ g ) Λ Tg D − g ( m ig − µ g )= Λ Tg ( D − g − ( D g + Λ g Λ Tg ) − Λ g Λ Tg D − g )( m ig − µ g )= Λ Tg ( Λ g Λ Tg + D g ) − ( m ig − µ g )Where the last equality is followed by: I = ( Λ g Λ Tg + D g )( D − g − ( D g + Λ g Λ Tg ) − Λ g Λ Tg D − g )= ( D − g − ( D g + Λ g Λ Tg ) − Λ g Λ Tg D − g ) T ( Λ g Λ Tg + D g ) T = ( D − g − ( D g + Λ g Λ Tg ) − Λ g Λ Tg D − g )( Λ g Λ Tg + D g )Because of symmetric. We showed that ( D − g − ( D g + Λ g Λ Tg ) − Λ g Λ Tg D − g ) =( Λ g Λ Tg + D g ) − Hence we conclude that, the variational parameter is essentially the conditionalexpectation and covariance of u ig | y i . 47 Parameter estimates for the family of models
From here, we will derive the family of 8 models by setting different constrains on Σ . Notice the following identities are easy to verify: n X i =1 z ig = n g , log | ( d g I K ) − | = log( d − Kg ) P ni =1 z ig ( ˜ m ig ˜ m Tig + ˜ V ig ) n g = θ g
1. “UUU”: we do not put any constrain on Λ g , D g . The Solution is exactly thesame as above derivation.2. “UUC”: we assume D g = d g I K , and no constrain for Λ g . Except D g , the restestimation is exactly same as model “UUU”.ˆ d g = 1 K tr { Σ g − Λ g β g S g + Λ g θ g Λ Tg }
3. “UCU”: we assume D g = D , and no constrain for Λ g . Except D g , the restestimation is exactly same as model “UUU”. Take derivative respect to D − ,we get: ˆ D = 1 n G X g =1 n g diag { Σ g − Λ g β g S g + Λ g θ g Λ Tg }
4. ”UCC”: we assume D g = d I K , and no constrain for Λ g . Except D g , the restestimation is exactly same as model “UUU”. Follow the same procedure as48odel “UUC” and “UCU”, we get:ˆ d = 1 Kn G X g =1 n g tr { Σ g − Λ g β g S g + Λ g θ g Λ Tg }
5. “CUU”: we assume Λ g = Λ , and no constrain for D g . Except Λ , the restestimations are exactly same as model “UUU”. Taking derivative of l withrespect to Λ gives us: ∂l ∂ Λ = G X g =1 n g ( D − g S g β Tg − D − g Λ θ g )which must be solved for Λ in a row-by-row manner. Let λ i to represent theith row of Λ , and r i to represent ith row of P Gg =1 n g ( D − g S g β Tg ). Therefore λ i = r i ( G X g =1 n g d g ( i ) θ g ) − where d g ( i ) is the ith entry of D g
6. “CUC”: we assume Λ g = Λ , D g = d g I K . Estimation of Λ g are exactly sameas model “CUU”. Estimation of D g are exactly same as model “UUC”.7. “CCU”: we assume Λ g = Λ , D g = D . Estimation of Λ g are exactly same asmodel “CUU”. Estimation of D g are exactly same as model “UCU”:8. “CCC”: we assume Λ g = Λ , D g = d I K . Estimation of Λ g are exactly same asmodel “CUU”. Estimation of D g are exactly same as model “UCC”.49 True Parameters for Σ g in Simulation Studies True Λ and D for Σ in Simulation Study 1: Λ = − .
003 0 . − . − .
278 0 .
090 0 . − .
131 0 .
187 0 . .
424 0 . − . . − . − . .
275 0 .
062 0 . − .
222 0 . − . − .
100 0 . − . .
284 0 . − . . − .
353 0 . , D = 0 . ∗ I . True Λ g and D g for Σ g in Simulation Study 2: Λ = − .
003 0 . − . − .
278 0 .
090 0 . − .
131 0 .
187 0 . .
424 0 . − . . − . − . .
275 0 .
062 0 . − .
222 0 . − . − .
100 0 . − . .
284 0 . − . . − .
353 0 . , Λ = − . − .
289 0 . − .
070 0 .
267 0 . . − . − . . − .
690 0 . . − . − . − .
137 0 . − . .
400 0 . − . .
199 0 .
334 0 . .
167 0 . − . . − . − . , Λ = . − .
167 0 . .
146 0 . − . . − . − . − . − .
062 0 . .
086 0 . − . − . − .
051 0 . − . − . − . − .
059 0 .
112 0 . .
047 0 . − . . − . − . D = diag [0 . , . , . , . , . , . , . , . , . , . D = diag [0 . , . , . , . , . , . , . , . , . , . D = diag [0 . , . , . , . , . , . , . , . , . , .005]