A Bayesian Mixture Model for Clustering on the Stiefel Manifold
Subhajit Sengupta, Subhadip Pal, Riten Mitra, Ying Guo, Arunava Banerjee, Yuan Ji
aarXiv: arXiv:0000.0000
A Bayesian Mixture Model forClustering on the Stiefel Manifold
Subhajit Sengupta ∗ , ‡ , Subhadip Pal ∗ , § Riten Mitra § , Ying Guo ¶ ,Arunava Banerjee † , (cid:107) and Yuan Ji † , ‡ , ∗∗ Program for Computational Genomics and Medicine, NorthShore University HealthSystem ‡ Department of Bioinformatics and Biostatistics , University of Louisville § Department of Biostatistics and Bioinformatics, Emory University ¶ Department of Computer & Information Science & Engineering, University of Florida (cid:107)
Department of Public Health Sciences, The University of Chicago ∗∗ E-mails of the corresponding authors [email protected] ; [email protected] Abstract:
Analysis of a Bayesian mixture model for the Matrix Langevindistribution on the Stiefel manifold is presented. The model exploits a par-ticular parametrization of the Matrix Langevin distribution, various aspectsof which are elaborated on. A general, and novel, family of conjugate pri-ors, and an efficient Markov chain Monte Carlo (MCMC) sampling schemefor the corresponding posteriors is then developed for the mixture model.Theoretical properties of the prior and posterior distributions, includingposterior consistency, are explored in detail. Extensive simulation experi-ments are presented to validate the efficacy of the framework. Real-worldexamples, including a large scale neuroimaging dataset, are analyzed todemonstrate the computational tractability of the approach.
Keywords and phrases:
Matrix Langevin Mixture model, Mixture model,Orthonormal vectors, Parametric model, Stiefel manifold.
1. Introduction
Analysis of directional data comprises a major sub-field of study in Statistics.Directional data range from unit vectors in the simplest examples, to sets ofordered orthonormal frames in the general case. Since the associated samplespace is not the Euclidean space, standard statistical methods developed for theEuclidean space for the analysis of univariate or multivariate data cannot beeasily adapted for directional data. For example, it is often desirable to accountfor the geometric structure underlying the sample space in statistical inference.Beyond those fashioned for simpler non-Euclidean spaces like the circle or thesphere, there is a pressing need for methodology development for general sam-ple spaces such as the Stiefel or the Grassmann manifold to support modern ∗ Equal contribution † Corresponding author 1 a r X i v : . [ s t a t . M E ] A ug engupta et al./BMMCSM applications, increasingly seen in the fields of computer vision (Turaga, Veer-araghavan and Chellappa, 2008; Turaga et al., 2011; Anand, Mittal and Meer,2016; Lui and Beveridge, 2008; Zeng et al., 2015), medical image analysis (Lui,2012), astronomy (Mardia and Jupp, 2009; Lin, Rao and Dunson, 2017), and,biology (Downs, 1972; Mardia and Khatri, 1977), to name but a few. In thisarticle, we present a framework for Bayesian inference of a mixture model onthe Stiefel manifold (James, 1976; Chikuse, 2012) that remains computationallytractable even at large data sizes. With ever-growing computational power, weargue that it is now feasible to apply Bayesian methods to real world large anddirectional data.One of the most commonly used distributions on the Stiefel manifold is an ex-ponential family distribution known as the Matrix Langevin ( ML ) or the Von-Mises Fisher matrix distribution (Mardia and Jupp, 2009; Khatri and Mardia,1977), introduced first by Downs (1972). In early work Mardia and Khatri (1977)and Jupp and Mardia (1980) studied the properties of the maximum likelihoodestimators for this distribution in the classical setting. In large measure, subse-quent efforts at exploring the ML distribution (Chikuse, 1991a,b, 1998) werelimited to asymptotic results on distributional or inferential problems. More re-cently, Hoff (2009) has developed a rejection sampling based method to samplefrom a matrix Bingham-Von Mises-Fisher distribution on the Stiefel manifold.To date, Bayesian analysis on these general sample spaces have been very lim-ited. A major obstacle for the development of efficient inference techniques forthis family of distributions has been the intractability of the corresponding nor-malizing constant, a hypergeometric function of matrix argument.The article that is most aligned to our overall objective is Lin, Rao and Dunson(2017), where the authors have developed a rejection sampling based data aug-mentation strategy for Bayesian inference with the mixture of ML distribution.However, it is well known that sampling techniques based on a data augmen-tation strategy often suffer from slow rates of convergence. With the additionaldetrimental impact of the rejection ratio, convergence can become painfullyslow. Applicability of their MCMC technique is therefore limited, particularlyin terms of scalability to large datasets.Our contribution begins with an exploration of the properties of the ML dis-tribution, followed by the construction of a family of conjugate priors for ML distribution, which we then analyze in considerable detail. In the context ofthe natural exponential family, Diaconis and Ylvisaker (Diaconis and Ylvisaker,1979) laid the foundations for constructing conjugate prior distributions (theDY class) for natural exponential family models. In our case, however, the DYconstruction can not be directly applied, and we therefore derive a modifiedconstruction. The resultant prior is flexible in the sense that one can incor-porate information from data via appropriate hyperparameter selection, andfurthermore, there is the provision to set the hyperparameters in the absence ofany prior knowledge to a weakly non-informative prior. For the latter, the priormight become improper in which case we adopt a constrained mixture model.Using this novel prior we implement a scalable posterior inference scheme by de-signing an efficient Gibbs sampler. We note in passing that in the expression for engupta et al./BMMCSM the posterior, the presence of F ( · , · ) in the denominator make the inferenceprocedure challenging. We also explore the weak and strong posterior consis-tency under the new class of priors. Finally, we extend the proposed frameworkfor a single ML distribution to a finite mixture of ML ’s.To identify the optimum number of clusters, often times deviance informationcriterion (DIC) has been used in the literature (Gelman et al., 2003; Spiegelhal-ter et al., 2002). However, several studies have pointed to the weakness of thestandard DIC measure in mixture models and have proposed alternatives. Weperform extensive simulations to identify alternative schemes to computing DICthat would work best for a mixture of ML distributions. In order to demonstratethe scalability of our inference scheme, we then analyze a large-scale DTMRIdataset. Real datasets that have been analyzed in the literature come from as-tronomy (near-earth objects) or vectorcardiography. In both cases the data isdrawn from a matrix valued manifold where each element is a collection of twoorthonormal vectors in R . Realizing that most of the existing applications relyon an efficient computation of the matrix hypergeometric function on a 2 × ML distribution defined on the Stiefel manifold ( V n,p ) and explore its the-oretical properties as well as properties of the corresponding hypergeometricconstant. In Section 3, we present the construction of the conjugate prior andthe posterior for a single ML distribution, properties of which are then analyzedin considerable detail. Generalization to a finite mixture model and inferenceare presented in Section 4, as well as extended theoretical properties such asthe weak and strong posterior consistency. Extensive simulation studies are pre-sented and summarized in Section 5. In Section 6, we provide experimentalresults from two real-world datasets. Conclusions and future work in presentedin Section 7. Notational Convention • R k = The k -dimensional real space. • S p = (cid:8) ( d , . . . , d p ) ∈ R p + : 0 < d p < · · · < d < ∞ (cid:9) . • R n × p = Space of all n × p real-valued matrices. • V n,p = Stiefel Manifold. • (cid:101) V n,p = { X ∈ V n,p : X ,j > ∀ j = 1 , , · · · , p } . • V p,p = O ( p ) = Space of Orthogonal matrices. • Υ( · ) = Product measure defined on V n,p × R p + × V p,p . engupta et al./BMMCSM • I p = p × p identity matrix. • f ( · ; · ) = Probability density function. • g ( · ; · ) = Unnormalized version of the probability density function. • tr ( A ) = Trace of a square matrix A. • etr ( A ) = Exponential of tr ( A ). • E ( X ) = Expectation of the random variable X . • I ( · ) = Indicator function. • We use d and D interchangeably. D is the diagonal matrix with diagonal d . We use matrix notation D in the place of d wherever needed, and vector d otherwise. • (cid:107)·(cid:107) = Matrix operator norm. ML distribution on the Stiefel manifold ( V n,p ) The Stiefel manifold, V n,p is the space of all p ordered orthonormal vectors (alsoknown as p -frames) in R n and is defined as V n,p = { X ∈ R n × p : X T X = I p } , where R n × p is the space of all n × p real-valued matrices and I p is the p × p identity matrix (Mardia and Jupp, 2009; Absil, Mahony and Sepulchre, 2009;Chikuse, 2012; Edelman, Arias and Smith, 1998; Downs, 1972). V n,p is a compactRiemannian manifold of dimension np − p ( p + 1) /
2. For p = 1, V n,p is the ( n − S n − and for p = n , V n,p = O ( p ), the orthogonal group consistingall orthogonal p × p real-valued matrices, with the group operation being matrixmultiplication. V n,p may be embedded in the np -dimensional Euclidean spaceof n × p real-valued matrices with the inclusion map as a natural embedding,and is thus a submanifold of R np . Since V n,p is an embedded submanifold of R n × p , its topology is the subset topology induced by R n × p (Absil, Mahony andSepulchre, 2009; Edelman, Arias and Smith, 1998).The differential form ( H T dH ) = (cid:86) pi =1 (cid:86) nj = i +1 h Tj dh i where H ∈ V n,p , is in-variant under the transforms H → QH and H → H P where Q ∈ V n,n and P ∈ V p,p , respectively. This defines an invariant measure on V n,p . The sur-face area or volume of V n,p is V ol ( V n,p ) := (cid:82) V n,p ( H T dH ) = 2 p ( √ π ) np / Γ p ( n/ p ( · ) is the multivariate Gamma function (page 70 in Muirhead (2009)).The measure defined in this manner is called the invariant unnormalized or theHaar measure. This measure can be normalized to a probability measure bysetting (cid:82) V n,p [ dH ] = 1 where [ dH ] = ( H T dH ) /V ol ( V n,p ). Uniform distributionon V n,p is denoted by [ dH ] and is the unique probability measure which is in-variant under rotations and reflections. For detail description of construction ofthe Haar measure on V n,p and its properties please refer to Muirhead (2009). ML distribution Mardia and Jupp (2009) is a widely used non-uniform dis-tribution on V n,p (Khatri and Mardia, 1977; Mardia and Jupp, 2009; Chikuse,2012; Lin, Rao and Dunson, 2017). This distribution is also known as Von Mises-Fisher Matrix Distribution (Khatri and Mardia, 1977). The density function of engupta et al./BMMCSM the ML distribution with respect to the normalized Haar measure [ dX ] andparametrized by F ∈ R n × p , defined in Chikuse (2012), is given by f ML ( X ; F ) = etr ( F T X ) F (cid:16) n , F T F (cid:17) , (1)where etr ( Z ) = exp( trace ( Z ) for any square matrix Z and the normalizingconstant, F ( n/ , F T F / n × p parameter matrix F = M DV T where M ∈ (cid:101) V n,p , V ∈ V p,p and the diag-onal entries of D , d = ( d , d , · · · , d p ) ∈ S p where 0 < d p < · · · < d < d < ∞ (Chikuse, 2012). See Notation for definitions of (cid:101) V n,p , V p,p and S p . Here, (cid:101) V n,p denotes the a subspace of V n,p consisting of matrices in V n,p whose elements ofthe first row of are positive. Note that, being a closed subspace of V n,p , (cid:101) V n,p isalso a compact space.Plugging in the SVD form of F , we rewrite the ML density function as f ML ( X ; ( M, d , V )) = etr ( V DM T X )) F ( n , D ) I ( M ∈ (cid:101) V n,p , d ∈ S p , V ∈ V p,p ) . This parametrization ensures identifiability of all the parameters ( M , d and V ).For notational convenience we omit the indicator function part and use thefollowing form of the ML density for rest of the article f ML ( X ; ( M, d , V )) = etr ( V DM T X )) F ( n , D ) , (2)with respect to the normalized Haar measure [ dX ] (Muirhead, 2009). From Kha-tri and Mardia (1977) (page 96) note that the normalizing constant can besimplified as follows – F (cid:18) n , F T F (cid:19) = F (cid:18) n , D (cid:19) . Thus F ( · ) only depends on the eigenvalues of the matrix F T F , which are thediagonal elements of the matrix D . The parametrization with M, D and V en-ables us to represent the intractable hypergeometric function of matrix argumentas a function of vector d , diagonal entries of D , paving a path for an efficientposterior inference. This makes posterior inference computationally tractable.Note that an alternative parametrization through polar decomposition with M and K (Mardia and Jupp, 2009) may pose computational challenges since theelliptical part K lies on a positive semi-definite cone and inference on positivesemi-definite cone is not that straightforward (Hill and Waters, 1987; Bhatia, engupta et al./BMMCSM M, D and V parameters basedrepresentation for ML distribution for most part of our theory.In the following subsection we study a few important properties of the hyper-geometric function of matrix argument F (cid:0) n/ , D / (cid:1) , which are required forsubsequent sections. F (cid:16) n , D (cid:17) Lemma 1.
For any p × p diagonal matrix D with positive elements, F (cid:16) n , D (cid:17) ≤ etr ( D )) when n ≥ p . Proof of Lemma 1.
From Equation 2, we have (cid:90) V n,p f ML ( X ; ( M, d , V )) [ dX ] = 1= ⇒ (cid:90) V n,p etr ( V DM T X )) F ( n , D ) [ dX ] = 1= ⇒ F (cid:18) n , D (cid:19) = (cid:90) V n,p etr ( V DM T X )) [ dX ] . (3)We know that f ML ( X ; ( M, d , V )) has the unique modal orientation M V T (page32 in Chikuse (2012)). Hence it follows from Equation 3 that F (cid:18) n , D (cid:19) ≤ (cid:90) V n,p etr ( V DM T M V T )) [ dX ]= etr ( D )) (cid:90) V n,p [ dX ] = etr ( D )) , (4)where [ dX ] is the normalized Haar measure on V n,p . (cid:3) Lemma 2.
Let A be a n × p real matrix with n ≥ p . If (cid:107) A (cid:107) ≤ δ ( < δ ) for some δ > then | A j,j | ≤ δ ( < δ ) for j = 1 , .., p . Here A j,j denotes the ( j, j ) -th entryof the matrix A and (cid:107) A (cid:107) is the spectral norm of the matrix A . Proof of Lemma 2.
From the assumptions of the Lemma 2 along with the definition of the spectralnorm, it follows that l T A T A l ≤ δ ( < δ ) for all l ∈ R p with l T l = 1. Inparticular, e Tj A T A e j ≤ δ ( < δ ) where e j ∈ R p such that its j -th entry equals1 while rest of its entries are 0. Hence we have that (cid:80) nk =1 A k,j ≤ δ ( < δ )implying the fact that | A j,j | ≤ δ ( < δ ) . (cid:3) engupta et al./BMMCSM Lemma 3.
Let D be a p × p diagonal matrix with positive diagonal elements d = { d , d , · · · , d p } . Then for any δ > and n ≥ p , there exists a positiveconstant, K n,p,δ , depending on n, p and δ , such that F (cid:18) n , D (cid:19) > K n,p,δ etr ((1 − δ ) D ) . Proof of Lemma 3.
Note that, D is a p × p diagonal matrix with positive diagonal elements d , .., d p .For the case n ≥ p , define (cid:102) M = (cid:20) I p n − p,p (cid:21) , (cid:101) V = I p and I (cid:63) := (cid:20) I p n − p,p (cid:21) , (5)where I p denotes the p × p identity matrix and n − p,p represents the zero matrixof dimension ( n − p ) × p . For arbitrary given positive constant δ >
0, consider B δ := { X ∈ V n,p , such that (cid:107) X − I (cid:63) (cid:107) < δ } , where (cid:107)·(cid:107) denotes the spectral norm of a matrix. Let µ denotes the normalizedHaar measure on the V n,p . Clearly, 0 < µ ( B δ ) < ∞ , as B δ is a non-empty opensubset of V n,p . Now from Equation 2 we have, F (cid:18) n , D (cid:19) = (cid:90) V n,p etr (cid:16) (cid:101) V D (cid:102) M T X (cid:17) dµ ( X ) . ≥ (cid:90) B δ etr (cid:16) (cid:101) V D (cid:102) M T X (cid:17) dµ ( X ) . (6)Using Lemma 2 we know that X j,j > (1 − δ ) for j = 1 , , . . . , p where X ∈ B δ . Note that, X j,j denotes the ( j, j )-th entry of the matrix X . Hence fromEquation 5 and 6 it follows that, F (cid:18) n , D (cid:19) ≥ (cid:90) B δ exp p (cid:88) j =1 X j,j d j dµ ( X ) ,> µ ( B δ ) etr ((1 − δ ) D ) , (7)where the last inequality uses the fact that d j > j = 1 , . . . p. Finally wedenote K n,p,δ := µ ( B δ ) > n, p along with δ , to conclude that F (cid:18) n , D (cid:19) > K n,p,δ etr ((1 − δ ) D ) . (cid:3) Lemma 4.
For any p × p diagonal matrix D with positive elements d ∈ S p ,the hypergeometric function of matrix argument denoted by F (cid:16) n , D (cid:17) is log-convex with respect to d where n ≥ p . engupta et al./BMMCSM Proof of Lemma 4.
From Equation 2, we have F (cid:18) n , D (cid:19) = (cid:90) V n,p etr ( V DM T X ) [ dX ] , (8)for arbitrary M ∈ (cid:101) V n,p and V ∈ V n,p where n ≥ p . Without loss of generality,we can take M = (cid:102) M = (cid:20) I p ( n − p ) ,p (cid:21) and V = I p .Let D and D be two p × p diagonal matrix with positive diagonal entries d and d , respectively and d (cid:54) = d . From Equation 8, we have F (cid:18) n , D (cid:19) = (cid:90) V n,p etr ( D (cid:102) M T X ) [ dX ] F (cid:18) n , D (cid:19) = (cid:90) V n,p etr ( D (cid:102) M T X ) [ dX ] . (9)Let λ ∈ [0 ,
1] be any real number. We have F (cid:32) n , ( λ D + (1 − λ ) D ) (cid:33) = (cid:90) V n,p etr (( λ D + (1 − λ ) D ) ˜ M T X ) [ dX ]= (cid:90) V n,p (cid:16) etr ( D ˜ M T X ) (cid:17) λ (cid:16) etr ( D ˜ M T X ) (cid:17) − λ [ dX ] < (cid:32)(cid:90) V n,p etr ( D ˜ M T X ) [ dX ] (cid:33) λ (cid:32)(cid:90) V n,p etr ( D ˜ M T X ) [ dX ] (cid:33) − λ = (cid:18) F (cid:18) n , D (cid:19)(cid:19) λ (cid:18) F (cid:18) n , D (cid:19)(cid:19) − λ . (10)Note that the inequality is due to H¨older (Hardy, Littlewood and P´olya, 1952)and note that in this case d (cid:54) = d . Therefore from Equation 10 we have,log F (cid:18) n , (cid:18) λ D − λ ) D (cid:19)(cid:19) < λ log F (cid:18) n , D (cid:19) +(1 − λ ) log F (cid:18) n , D (cid:19) . (11)Hence log F (cid:16) n , D (cid:17) is a convex function or equivalently F (cid:16) n , D (cid:17) is alog-convex function of the diagonal entries d of matrix D . (cid:3) engupta et al./BMMCSM Lemma 5.
For any p × p ( p ≥ ) diagonal matrix D with positive elements d ∈ S p , then for i = 1 , , · · · , p we have < ∂∂ d i (cid:20) F (cid:18) n , D (cid:19)(cid:21) < F (cid:18) n , D (cid:19) where n ≥ p . Proof of Lemma 5.Right hand side inequality:
Proceeding similar way as Lemma 4 we have F (cid:18) n , D (cid:19) = (cid:90) V n,p etr ( D (cid:102) M T X ) [ dX ] , where (cid:102) M = (cid:20) I p ( n − p ) ,p (cid:21) . (12)From Equation 12, we have F (cid:18) n , D (cid:19) = (cid:90) V n,p exp p (cid:88) j =1 d j X j,j [ dX ] (13)Consider the set V := { X ∈ V n,p : X i,i = 1 } . Note that V is isomorphic tothe lower dimensional Stiefel manifold, V n,p − . V , being a lower dimensionalsubspace of V n,p , has measure zero i.e. (cid:82) V n,p I ( X ∈ V )[ dX ] = 0, where I ( X ∈ V )is the indicator function for X to be in the set V . From Equation 13, we have F (cid:18) n , D (cid:19) = (cid:90) V n,p exp p (cid:88) j =1 d j X j,j I ( X ∈ V c ) [ dX ] , (14)where V c is the complement of V . Hence, ∂∂ d i (cid:20) F (cid:18) n , D (cid:19)(cid:21) = (cid:90) V n,p X i,i I ( X ∈ V c ) exp p (cid:88) j =1 d j X j,j [ dX ] . (15)Observe that, (cid:107) X (cid:107) = 1 on V n,p . Hence from Lemma 2 we have | X i,i | ≤
1. Also, X i,i (cid:54) = 1 when X ∈ V c . As a result, we conclude that X i,i < V n,p ∩ V c .Subsequently, it follows from Equations 14 and 15 that, ∂∂ d i (cid:20) F (cid:18) n , D (cid:19)(cid:21) < (cid:90) V n,p exp p (cid:88) j =1 d j X j,j I ( X ∈ V c ) [ dX ]= F (cid:18) n , D (cid:19) . (16) engupta et al./BMMCSM Left hand side inequality:
Consider V i, + n,p := { X ∈ V n,p : X i,i > } , V i, − n,p := { X ∈ V n,p : X i,i < } and V i, n,p := { X ∈ V n,p : X i,i = 0 } . Clearly, V i, + n,p , V i, n,p and V i, − n,p forms a partition of V n,p . Hence from equation 13 we have, ∂∂ d i (cid:20) F (cid:18) n , D (cid:19)(cid:21) = (cid:90) V i, + n,p X i,i exp p (cid:88) j =1 d j X j,j [ dX ] + (cid:90) V i, n,p X i,i exp p (cid:88) j =1 d j X j,j [ dX ]+ (cid:90) V i, + n,p X i,i exp p (cid:88) j =1 d j X j,j [ dX ]= (cid:90) V i, + n,p X i,i exp p (cid:88) j =1 d j X j,j [ dX ] + (cid:90) V i, − n,p X i,i exp p (cid:88) j =1 d j X j,j [ dX ] . (17)Let Γ be the n × n diagonal matrix such that Γ j,j = 1 for j = 1 , . . . , n, j (cid:54) = i and Γ i,i = −
1. Γ is an orthogonal matrix as Γ T Γ = I n . It is easy to show that V i, + n,p = (cid:8) Γ X : X ∈ V i, − n,p (cid:9) .Consider the change of variable Y := Γ X . Using standard algebra we can showthat X i,i = − Y i,i and X j,j = Y j,j for j = 1 , . . . p, j (cid:54) = i . As the normalizedHaar measure on V n,p is invariant under orthogonal transformation from Lefti.e. [ dX ] = [ dY ] Chikuse (2012), we get that (cid:90) V i, − n,p X i,i exp p (cid:88) j =1 d j X j,j [ dX ] = − (cid:90) V i, + n,p Y i,i exp − d i Y i,i + p (cid:88) j =1 ,j (cid:54) = i d j Y j,j [ dY ]= − (cid:90) V i, + n,p X i,i exp − d i X i,i + p (cid:88) j =1 ,j (cid:54) = i d j X j,j [ dX ] . (18)From Equations 17 and 18 we have, ∂∂ d i (cid:20) F (cid:18) n , D (cid:19)(cid:21) = (cid:90) V i, + n,p X i,i exp p (cid:88) j =1 ,j (cid:54) = i d j X j,j (cid:32) exp ( d i X i,i ) − exp ( − d i X i,i ) (cid:33) [ dX ]= (cid:90) V i, + n,p X i,i exp p (cid:88) j =1 ,j (cid:54) = i d j X j,j d i X i,i ) [ dX ] (19) engupta et al./BMMCSM where sinh is the hyperbolic sin function. Note that sinh ( d i X i,i ) > d i > X i,i > V i, + n,p . Hence from Equation 19 it follows that, ∂∂ d i (cid:20) F (cid:18) n , D (cid:19)(cid:21) > . (20)From Equations 16 and 20, we have the result. (cid:3) All five lemmas will be used for a theoretical development of a conjugate priorfamily for ML distributions, which we discuss next.
3. Bayesian framework for ML distribution In this section we develop a comprehensive Bayesian framework related to ML distribution. We construct a novel class of conjugate priors and study theirproperties. We also derive the posterior form and comment on hyperparametersettings. In the context of the exponential family of distributions, Diaconis and Ylvisaker(1979) (DY) provides a standard procedure to obtain a class of conjugate priorswhen the distribution is represented through natural parametrization Casellaand Berger (2002). But we realize that for the ML distribution DY theoremcould not be applied directly. We postpone the discussion on the DY theory laterin Section 3.4 since a direct application of their construction is not possible.Instead, we propose two different conjugate priors next aiming for scalable andflexible posterior inference.In this context, we would also like to mention that the construction of the classof priors in Hornik and Gr¨un (2013) is based on the direct application of DY,which is also not quite appropriate for ML distribution. The idea of construct-ing a conjugate prior on the natural parameter F and using a transformationafterwards involves calculation of complicated Jacobean term Hornik and Gr¨un(2013). Hence the corresponding class of prior obtained by this transformationwould lack the interpretation of the corresponding hyperparameters. As the DYtheorem is not directly applicable, an appropriate modification is required inorder to use with ML distribution (see details in Section 3.4). In this sectionwe construct a new class of conjugate prior for ML density. We then show thatthe hyperparameters of the constructed class of priors are easily interpretablefrom practitioners point of view. We further extend our investigation to studyproperties that are essential for the hyperparameter selection and posterior in-ference. In the following paragraphs we design both joint and independent priorstructures for the parameters of the ML distribution. engupta et al./BMMCSM Definition 1.
The probability density function of the joint conjugate prior withrespect to the appropriate product measure Υ on V n,p × R p + × V p,p on the param-eters M, D and V for ML distribution is proportional to g ( M, d , V ; ν, Ψ) = etr (cid:0) ν V DM T Ψ (cid:1)(cid:2) F ( n , D ) (cid:3) ν , (21) as long as g ( M, d , V ; ν, Ψ) can be integrable. Here ν > and Ψ ∈ R n × p . Although joint prior structure has some desirable properties (see Theorem 4and Section 3.3), it sometimes difficult to incorporate strength of prior beliefwhich could differ for different parameters. For example, if a practitioner hasstrong prior belief on M but has very less knowledge about parameters D and V , then JMDY may not be the optimal choice for prior structure. We design aclass of conditional conjugate prior which would be better suited for this type ofsituation due to flexibility. Also, it is customary to come up with independentprior structure (Gelman et al., 2014; Khare, Pal and Su, 2017) for parameters ofcurved exponential family (Casella and Berger, 2002), where the parametriza-tion differs from the natural parametrization. In order to develop conditionalconjugate prior structure we assume independent priors on M , d and V . It iseasy to see that conditional conjugate priors for both M and V are ML dis-tribution whereas the following definition is used to construct the conditionalconjugate prior for D . Definition 2.
The probability density function of the conditional conjugateprior for D with respect to the Lebesgue measure on R p + is proportional to g ( d ; ν, η ) = exp( ν η T d ) (cid:2) F (cid:0) n , D (cid:1)(cid:3) ν , (22) as long as g ( d ; ν, η ) can be integrable. Here ν > , η ∈ R p and n ≥ p . Note that, g ( d ; ν, η ) is a function of n as well, however we do not vary n anywhere in our construction and thus we omit the symbol n from the notationof g ( d ; ν, η ).We refer this particular class of distributions defined in Definition 1 and Defini-tion 2 as joint modified Diaconis-Ylvisaker ( JMDY ) and independent modifiedDiaconis-Ylvisaker (
IMDY ) class, respectively for subsequent discussions.Theorem 1 and Theorem 2 provides conditions on ν, Ψ and η so that g ( M, d , V ; ν, Ψ)and g ( M, d , V ; ν, η ) are integrable, respectively. We state and prove the follow-ing lemma which is necessary to prove these theorems. Lemma 6.
Let Ψ ∈ R n × p and D be a diagonal matrix with positive diagonalentries. If (cid:107) Ψ (cid:107) < , then for arbitrary M ∈ V n,p , V ∈ V p,p , etr (cid:0) V DM T Ψ (cid:1) F ( n , D ) < etr ( − (cid:15) D ) K n,p,(cid:15) , (23) where (cid:15) = (1 − (cid:107) Ψ (cid:107) ) and K n,p,(cid:15) > is a constant depending on n, p and (cid:15) . engupta et al./BMMCSM Proof of Lemma 6.
Note that, 0 < (cid:15) < as (cid:107) Ψ (cid:107) <
1. Assume Y = M T Ψ V ∈ R p × p . For arbitrary l ∈ R p with (cid:107) l (cid:107) = 1, we have l T Y T Y l = ( V l ) T Ψ T Ψ( V l ) − l T V T Ψ T ( I n − M M T )Ψ V l ≤ (1 − (cid:15) ) . (24)The last inequality follows as (cid:107) Ψ (cid:107) = 1 − (cid:15) and ( I n − M M T ) is a non-negativedefinite matrix. From Equation 24 it follows that (cid:107) Y (cid:107) ≤ − (cid:15) . Hence, wecan apply Lemma 2 we obtain that | Y j,j | < − (cid:15) for j = 1 , · · · , p , where Y ,j is the j -th diagonal element of the matrix Y . Now applying Lemma 3 we have, etr (cid:0) V DM T Ψ (cid:1) F ( n , D ) < etr ( DY − (1 − (cid:15) ) D ) K n,p,(cid:15) < etr ( − (cid:15) D ) K n,p,(cid:15) . (cid:3) Theorem 1.
Let M ∈ V n,p and V ∈ V p,p and D be a diagonal matrix withpositive diagonal elements d ∈ R p + . Let Ψ ∈ R n × p with n ≥ p , then for any ν > ,(a) if (cid:107) Ψ (cid:107) < , we have (cid:90) V n,p (cid:90) V p,p (cid:90) R p + g ( M, d , V ; ν, Ψ) d d dµ ( V ) dµ ( M ) < ∞ , (b) if (cid:107) Ψ (cid:107) > , we have (cid:90) V n,p (cid:90) V p,p (cid:90) R p + g ( M, d , V ; ν, Ψ) d d dµ ( V ) dµ ( M ) = ∞ , where g ( M, d , V ; ν, Ψ) is defined in Definition 1. Proof of Theorem 1. ( a ) When (cid:107) Ψ (cid:107) < g ( M, d , V ; ν, Ψ) can be normalized to construct a probability den- engupta et al./BMMCSM sity function with respect to the product measure Υ. Consider that (cid:90) V n,p (cid:90) V p,p (cid:90) R p + g ( M, d , V ; ν, Ψ) d d dµ ( V ) dµ ( M )= (cid:90) V n,p (cid:90) V p,p (cid:90) R p + etr (cid:0) νV DM T Ψ (cid:1)(cid:2) F ( n , D ) (cid:3) ν d d dµ ( V ) dµ ( M ) ( i ) < (cid:90) V n,p (cid:90) V p,p (cid:90) R p + etr ( − ν(cid:15) D )( K n,p,(cid:15) ) ν d d dµ ( V ) dµ ( M )= (cid:90) V n,p dµ ( M ) (cid:90) V p,p dµ ( V ) (cid:90) R p + etr ( − ν(cid:15) D )( K n,p,(cid:15) ) ν d d ( ii ) = 1 K νn,p,(cid:15) p (cid:89) j =1 (cid:90) R + exp( − ν(cid:15) d j ) dd j < ∞ , where the inequality ( i ) is due to Lemma 6 while ( ii ) follows as µ is the normal-ized Haar measure. Note that, here we write [ dV ] = dµ ( V ) and [ dM ] = dµ ( M ).( b ) When (cid:107) Ψ (cid:107) > M Ψ D Ψ V T Ψ be the the unique SVD (Chikuse, 2012) decomposition forthe matrix Ψ. Note that, using sub-multiplicativity (cid:107) Ψ (cid:107) ≤ (cid:107) M Ψ (cid:107) (cid:107) D Ψ (cid:107) (cid:13)(cid:13) V T Ψ (cid:13)(cid:13) = (cid:107) D Ψ (cid:107) = D Ψ , . Hence there exists an (cid:15) > D Ψ , > (1 + (cid:15) ) where D Ψ , denotes thefirst diagonal element of the diagonal matrix D Ψ . Now consider the fact that (cid:90) V n,p (cid:90) V p,p (cid:90) R p + g ( M, d , V ; ν, Ψ) d d dµ ( V ) dµ ( M ) ≥ (cid:90) V n,p (cid:90) V p,p (cid:90) S p g ( M, d , V ; ν, Ψ) d d dµ ( V ) dµ ( M )= (cid:90) V n,p (cid:90) V p,p (cid:90) S p etr (cid:0) ν V DM T Ψ (cid:1)(cid:2) F ( n , D ) (cid:3) ν d d dµ ( V ) dµ ( M )= (cid:90) V n,p (cid:90) V p,p (cid:90) S p etr (cid:0) ν DM T M Ψ D Ψ V T Ψ V (cid:1)(cid:2) F ( n , D ) (cid:3) ν d d dµ ( V ) dµ ( M ) . (25)Consider the change of variable via the following orthogonal transformations M ∗ = (cid:2) M Ψ , M Ψ (cid:3) M, V ∗ = V T Ψ V, where M Ψ is matrix containing the bases for the orthogonal complement ofthe column space of M Ψ . Note that (cid:2) M Ψ , M Ψ (cid:3) T M Ψ = ( I (cid:63) ) T where I (cid:63) := (cid:2) I p , n − p,p (cid:3) T . As the Haar measure on the Stiefel manifold is invariant engupta et al./BMMCSM under the orthogonal transformations (Chikuse, 2012), from Equation 25 weget that, (cid:90) V n,p (cid:90) V p,p (cid:90) R p + g ( M, d , V ; ν, Ψ) d d dµ ( V ) dµ ( M ) ≥ (cid:90) V n,p (cid:90) V p,p (cid:90) S p etr (cid:16) ν DM ∗ T I (cid:63) D Ψ V ∗ (cid:17)(cid:2) F ( n , D ) (cid:3) ν d d dµ ( V ∗ ) dµ ( M ∗ ) . (26)Consider V † n,p := (cid:26) M ∈ V n,p : (cid:107) I (cid:63) − M (cid:107) < δ (cid:27) ; V † p,p := (cid:26) V ∈ V p,p : (cid:107) I p − V (cid:107) < δ (cid:27) , where δ = (cid:15) / (2 (cid:107) D Ψ (cid:107) ). Note that δ > < (cid:107) D Ψ (cid:107) < ∞ . Clearly V † n,p and V † p,p are open subsets of V n,p and V p,p respectively. Hence, µ ( V † n,p ) > µ ( V † p,p ) > M ∈ V † n,p and V ∈ V † p,p then using sub-multiplicativity of (cid:107)·(cid:107) (Conway, 1990)and triangle inequality, we get (cid:13)(cid:13) M T I (cid:63) D Ψ V − D Ψ (cid:13)(cid:13) ≤ (cid:13)(cid:13) M T I (cid:63) D Ψ V − D Ψ V (cid:13)(cid:13) + (cid:107) D Ψ V − D Ψ (cid:107) ≤ (cid:13)(cid:13) M T I (cid:63) − I p (cid:13)(cid:13) (cid:107) D Ψ V (cid:107) + (cid:107) D Ψ (cid:107) (cid:107) V − I p (cid:107) = (cid:13)(cid:13) ( M − I (cid:63) ) T I (cid:63) (cid:13)(cid:13) (cid:107) D Ψ V (cid:107) + (cid:107) D Ψ (cid:107) (cid:107) V − I p (cid:107) ≤ (cid:13)(cid:13) ( M − I (cid:63) ) T (cid:13)(cid:13) (cid:107) I (cid:63) (cid:107) (cid:107) D Ψ (cid:107) (cid:107) V (cid:107) + (cid:107) D Ψ (cid:107) (cid:107) V − I p (cid:107) ≤ (cid:13)(cid:13) ( M − I (cid:63) ) T (cid:13)(cid:13) (cid:107) D Ψ (cid:107) + (cid:107) D Ψ (cid:107) (cid:107) V − I p (cid:107) ≤ δ (cid:107) D Ψ (cid:107) = (cid:15) . (27)Let λ , . . . , λ p be diagonal elements of the matrix M T I (cid:63) D Ψ V . From Lemma2 we get that | λ j − D Ψ ,j | ≤ (cid:15) / j = 1 , . . . , p . Here D Ψ ,j denotes the j -thdiagonal element of the matrix D Ψ . Hence for arbitrary M ∈ V † n,p and V ∈ V † n,p ,we have tr (cid:0) M T I (cid:63) D Ψ V (cid:1) = p (cid:88) j =1 λ j ≥ p (cid:88) j =1 (cid:16) D Ψ ,j − (cid:15) (cid:17) , (28)as λ j ≥ (cid:0) D Ψ ,j − (cid:15) (cid:1) for all j = 1 , , · · · , p . engupta et al./BMMCSM Now from Equation 26, we have (cid:90) V n,p (cid:90) V p,p (cid:90) R p + g ( M, d , V ; ν, Ψ) d d dµ ( V ) dµ ( M ) ≥ (cid:90) V † n,p (cid:90) V † p,p (cid:90) S p etr (cid:16) ν DM ∗ T I (cid:63) D Ψ V ∗ (cid:17)(cid:2) F ( n , D ) (cid:3) ν d d dµ ( V ∗ ) dµ ( M ∗ ) ( iii ) ≥ (cid:90) V † n,p (cid:90) V † p,p (cid:90) S p exp (cid:16) ν (cid:80) pj =1 d j (cid:0) D Ψ ,j − (cid:15) (cid:1)(cid:17)(cid:2) F ( n , D ) (cid:3) ν d d dµ ( V ∗ ) dµ ( M ∗ ) , ( iv ) ≥ (cid:90) V † n,p (cid:90) V † p,p (cid:90) S p exp (cid:16) ν (cid:80) pj =1 d j (cid:0) D Ψ ,j − (cid:15) (cid:1)(cid:17) [ etr ( D )] ν d d dµ ( V ∗ ) dµ ( M ∗ ) , ≥ µ ( V † n,p ) µ ( V † p,p ) (cid:90) S p exp ν p (cid:88) j =1 d j (cid:16) D Ψ ,j − − (cid:15) (cid:17) d d , ( v ) ≥ µ ( V † n,p ) µ ( V † p,p ) (cid:90) S p exp (cid:16) ν (cid:15) d (cid:17) p (cid:89) j =2 exp (cid:16) ν d j (cid:16) D Ψ ,j − − (cid:15) (cid:17)(cid:17) d d , = ∞ , (29)where ( iii ) and ( iv ) follow from Equation 28 and Lemma 1, respectively. Finally,( v ) follows as D Ψ , > (1 + (cid:15) ). (cid:3) Remark for Theorem 1.
One could notice that the conditions mentionedin this theorem is not entirely necessary and sufficient conditions. We have notaddressed the case where (cid:107) Ψ (cid:107) = 1. This scenarios could be broken into twocases (a) all the eigenvalues of Ψ are equal to 1 and (b) only a few eigenvaluesare equal to 1 and rest are strictly less than 1. In both the cases, it seems thatthe problem is more involved than the current one and we have not investigatedthe finiteness of the corresponding integral in detail for those cases. For now,we leave those for future work. Theorem 2.
Let D be diagonal matrix with diagonal elements d ∈ R p + . Let η = ( η , . . . , η p ) ∈ R p and n be any integer with n ≥ p . Then for any ν > , (cid:90) R p + g ( d ; ν, η , n ) d d < ∞ , if and only if max ≤ j ≤ p η j < , where g ( d ; ν, η , n ) is defined in Definition 2. Proof of Theorem 2. engupta et al./BMMCSM Sufficient condition:
For any η := ( η , . . . , η p ) ∈ R p , define η + := (cid:0) η +1 , . . . , η + p (cid:1) where η + j equals η j when η j > D η to be the diag-onal matrix with diagonal elements η + . Let us consider the following matricesΨ = (cid:20) D η n − p,p (cid:21) , M (cid:63) = (cid:20) I p,p n − p,p (cid:21) and V (cid:63) = I p . Note that (cid:102) M ∈ (cid:101) V n,p , (cid:101) V ∈ V p,p and D η = (cid:102) M T Ψ (cid:101) V . Now from Definition 2 we getthat (cid:90) R p + g ( d ; ν, η , n ) d d = (cid:90) R p + exp( ν (cid:80) pj =1 η j d j ) (cid:2) F ( n , D ) (cid:3) ν d d ≤ (cid:90) R p + exp( ν (cid:80) pj =1 η + j d j ) (cid:2) F ( n , D ) (cid:3) ν d d = (cid:90) R p + etr ( ν DD η ) (cid:2) F ( n , D ) (cid:3) ν d d = (cid:90) R p + etr (cid:16) ν (cid:101) V D (cid:102) M T Ψ (cid:17)(cid:2) F ( n , D ) (cid:3) ν d d ( vi ) < (cid:90) R p + etr ( − ν(cid:15) D )( K n,p,(cid:15) ) ν d d = 1( K n,p,(cid:15) ) ν p (cid:89) j =1 (cid:90) R + exp( − ν(cid:15) d j ) dd j < ∞ , (30)where the inequality at step ( vi ) follows from Lemma 6 with appropriate (cid:15) > Necessary condition:
Let η ∈ R p be such that max j =1 ,...p η j ≥
1. There existat least one j ∈ { , . . . p } such that η j ≥
1. Without loss of generality, let usassume that η ≥
1. From Definition 2, we have engupta et al./BMMCSM (cid:90) R p + g ( d ; ν, η , n ) d d = (cid:90) R p + exp( ν (cid:80) pj =1 η j d j ) (cid:2) F ( n , D ) (cid:3) ν d d ≥ (cid:90) R p + exp( ν (cid:80) pj =1 η j d j ) etr ( νD ) d d = p (cid:89) j =1 (cid:90) R + exp ( ν ( η j − d j ) dd j = (cid:90) R + exp ( ν ( η − d ) dd p (cid:89) j =2 (cid:90) R + exp ( ν ( η j − d j ) dd j = ∞ , where the inequality is due to Lemma 1. (cid:3) Remark for Theorem 2.
We could alternatively parametrize
IMDY in thefollowing way g ( d ; ν, η ) ∝ exp (cid:16)(cid:80) pj =1 η j d j (cid:17) / (cid:104) F ( n , D ) (cid:105) ν when max ≤ j ≤ p η j < ν .In this parametrization if we set ν = 0 and β := − η then g ( d ; ν, η ) refers tothe Exponential distribution with parameter β . The following lemmas are essential to study theoretical properties of the conju-gate prior mentioned in Section 3.1.
Lemma 7.
The probability density function for the prior distribution of d ∼ IMDY ( d ; ν, η ) denoted by g ( d ; ν, η ) := exp( ν η T d ) / (cid:104) F (cid:16) n , D (cid:17)(cid:105) ν , is log-concaveas a function of d where D is the diagonal matrix with diagonal elements d , max ≤ j ≤ p η j < , ν > and n ≥ p . Proof of Lemma 7.
From Definition 2 we have, g ( d ; ν, η ) := exp( ν η T d ) (cid:2) F (cid:0) n , D (cid:1)(cid:3) ν , = ⇒ log g ( d ; ν, η ) := ν η T d − ν log (cid:18) F (cid:18) n , D (cid:19)(cid:19) (31)From Lemma 4, it follows that − ν log (cid:16) F (cid:16) n , D (cid:17)(cid:17) is concave function of d .Also, ν η T d is a linear function of d . Therefore from Equation 31 it is clear thatlog g ( d ; ν, η ) is a concave function of d . engupta et al./BMMCSM (cid:3) Lemma 8.
The distribution of d is unimodal if < η j < for all j =1 , , · · · , p . The mode of the distribution is characterized by the parameter η and it does not dependent on the parameter ν . Proof of Lemma 8.
Let l ( d , ν, η ) = log( g ( d ; ν, η )). If (cid:98) d is the mode of the distribution then ∂∂ d l ( d , ν, η ) (cid:12)(cid:12)(cid:12)(cid:12) d = (cid:98) d = 0 , = ⇒ ν η − ν ∂∂ d log (cid:18) F (cid:18) n , D (cid:19)(cid:19) (cid:12)(cid:12)(cid:12)(cid:12) d = (cid:98) d = 0 , = ⇒ ∂∂ d log (cid:18) F (cid:18) n , D (cid:19)(cid:19) (cid:12)(cid:12)(cid:12)(cid:12) d = (cid:98) d = η , = ⇒ h ( (cid:98) d ) = η , (32)where h ( d ) := ( h ( d ) , h ( d ) , · · · , h p ( d )) with h j ( d ) := (cid:16) ∂∂d j F (cid:16) n , D (cid:17)(cid:17) / F (cid:16) n , D (cid:17) for j = 1 , , · · · , p . The function h j ( d ) is strictly increasing as the function F (cid:16) n , D (cid:17) is log-convex (see Lemma 4). Also, it follows from Lemma 5 that0 < h j ( d ) < d ∈ S p . Hence the Equation 32 has a unique solutionwhen 0 < η j < j = 1 , , · · · , p . Also it is clear that the solution doesnot depend on ν . On the other hand, given any (cid:98) d ∈ S p we can always find a η satisfying Equation 32 such that 0 < max ≤ j ≤ p η j < (cid:3) Remark:
In the case of η j ≤
0, the density defined in 2 is decreasing as afunction of d j on the set R + . Therefore, mode does not exist.In order to introduce the notion of “concentration” for IMDY class of distribu-tions we require the concept of level set. Let unnormalized probability densityfunction for
IMDY class of distributions, g ( x ; ν, η ), achieves the maximum valueat m η and let S l = (cid:8) x ∈ R p + : g ( x ; 1 , η ) /g ( m η ; 1 , η ) > l (cid:9) be the level set of order l containing the mode m η where 0 ≤ l <
1. Note that,to define the level set we could have used any fixed value of ν > g ( x ; ν , η )instead of g ( x ; 1 , η ), however without loss of generality we choose ν = 1. Lemma 9.
Let η ∈ R p be a fixed vector such that < max ≤ j ≤ p η j < . Whenever d ∼ IMDY ( d ; ν, η ) , we have(a) P ν ( S l ) is an increasing function of ν .(b) For any open set S ⊂ R p + containing m η , P ν ( d ∈ S ) goes to as ν → ∞ , engupta et al./BMMCSM where P ν ( · ) denotes the probability distribution corresponding to d ∼ IMDY ( d ; ν, η ) . Proof of Lemma 9. ( a ) Note that, from definitions of unimodality and level set we have (cid:20) g ( y ; ν, η ) g ( x ; ν, η ) (cid:21) > y ∈ S and for all x ∈ S c . (33)Consider the function r ( ν, x ) := (cid:90) S g ( y ; ν, η ) g ( x ; ν, η ) d y = (cid:90) S (cid:20) g ( y ; 1 , η ) g ( x ; 1 , η ) (cid:21) ν d y , (34)where x ∈ S c . Using equation 33 it is easy to see that (cid:104) g ( y ;1 , η ) g ( x ;1 , η ) (cid:105) ν is monotoni-cally increasing in ν for all y ∈ S . Hence r ( ν, x ) is increasing function in ν forany x ∈ S c .Note that, P ν ( d ∈ S c ) P ν ( d ∈ S ) = (cid:82) S c g ( x ; ν, η ) d x (cid:82) S g ( y ; ν, η ) d y = (cid:90) S c (cid:82) S g ( y ; ν, η ) g ( x ; ν, η ) d y d x = (cid:90) S c r ( ν, x ) d x . (35)Hence P ν ( d ∈ S c ) /P ν ( d ∈ S ) is a decreasing function of ν as r ( ν, x ) is a decreas-ing function in ν for every x ∈ S c or equivalently P ν ( d ∈ S ) increasing functionin ν . (cid:3) ( b ) Let d ∼ IMDY ( · ; ν, η ) with 0 < η j < j = 1 , . . . p . Let m η be themode the distribution. Note that the value of m η only depends on the parameter η and does not depend on the parameter ν . Let f ( d ; ν, η ) be the correspondingprobability density function. Hence for the class of distribution function definedin Definition 2, it follows that, f ( d ; ν, η ) = 1 K ν, η exp( ν η T d ) (cid:2) F (cid:0) n , D (cid:1)(cid:3) ν , (36)where K ν, η is the appropriate normalizing constant.Let us define the function g ( d ; η ) = exp( η T d ) / F (cid:16) n , D (cid:17) . Let S be any openset containing m η , the mode of the density function f ( d ; ν, η ). Consider, the set S (cid:63) := { d : g ( d ; η ) ≤ ζ } , where ζ = sup d ∈ S c g ( d ; η ). It is easy to show that S c ⊆ S (cid:63) . engupta et al./BMMCSM Consider the fact that , g ( d ; η ) g ( λ m η + (1 − λ ) d ; η ) where λ ∈ (0 , η T d ) F (cid:0) n , D (cid:1) F (cid:16) n , [ λD m +(1 − λ ) D ] (cid:17) exp( η T ( λ m η + (1 − λ ) d )) ( vii ) ≤ exp( η T d ) F (cid:0) n , D (cid:1) (cid:104) F (cid:16) n , D m (cid:17)(cid:105) λ (cid:104) F (cid:16) n , D (cid:17)(cid:105) − λ exp( η T ( λ m η + (1 − λ ) d )) ≤ (cid:34) exp( η T d ) F (cid:0) n , D (cid:1) (cid:35) λ F (cid:16) n , D m (cid:17) exp( η T m η ) λ = (cid:20) g ( d ; η ) g ( m η ; η ) (cid:21) λ , (37)where D m is the diagonal matrix with diagonal m η . Note that, inequality ( vii )follows from the fact that F ( · ) is a log-convex function.Hence we have, f ( d ; ν, η ) f ( λ m η + (1 − λ ) d ; ν, η ) = (cid:20) g ( d ; η ) g ( λ m η + (1 − λ ) d ; η ) (cid:21) ν ≤ (cid:20) g ( d ; η ) g ( m η ; η ) (cid:21) ν λ . (38) P ν ( S (cid:63) ) = (cid:90) S (cid:63) f ( d ; ν, η ) d d = (cid:90) S (cid:63) f ( d ; ν, η ) f ( λ m η + (1 − λ ) d ; ν, η ) f ( λ m η + (1 − λ ) d ; ν, η ) d d ≤ (cid:90) S (cid:63) (cid:20) g ( d ; η ) g ( m η ; η ) (cid:21) νλ f ( λ m η + (1 − λ ) d ; ν, η ) d d ≤ (cid:90) S (cid:63) (cid:20) ζg ( m η ; η ) (cid:21) νλ f ( λ m η + (1 − λ ) d ; ν, η ) d d = (cid:20) ζg ( m η ; η ) (cid:21) νλ (cid:90) S (cid:63) f ( λ m η + (1 − λ ) d ; ν, η ) d d ≤ (cid:20) ζg ( m η ; η ) (cid:21) νλ . (39)Hence we have,lim ν →∞ P ν ( S ) ≥ − lim ν →∞ P ν ( S (cid:63) ) ≥ − lim ν →∞ (cid:20) ζg ( m η ; η ) (cid:21) νλ = 1as ζ < g ( m η ; η ). (cid:3) engupta et al./BMMCSM The following two theorems establishes few important properties of
IMDY and
JMDY class of distributions.
Theorem 3.
Let d ∼ IMDY ( · ; ν, η ) for some ν > and max ≤ j ≤ p η j < where η = ( η , . . . , η p ) . Then(a) The distribution of d is log-concave.(b) The distribution of d is unimodal if η j > for all j = 1 , , · · · , p . Themode of the distribution is characterized by the parameter η and it does notdependent on the parameter ν .(c) The parameter ν relates to the concentration of the probability around modeof the distribution. Larger values of ν implies larger concentration of prob-ability near the mode of the distribution. Proof of Theorem 3.
Proof of part ( a ) , ( b ) and ( c ) follow from Lemma 7, 8 and 9, respectively. (cid:3) We call the parameter η as modal parameter and ν as Concentration parameter. Definition 3.
The parameter η in the distribution that belongs to the class ofdistributions IMDY is defined as “modal parameter”. Definition 4.
The scalar parameter ν in the distribution that belongs to theclass of distributions IMDY is defined as “concentration parameter”. Theorem 4.
Let ( M, d , V ) ∼ JMDY ( · ; ν, Ψ) for some ν > and (cid:107) Ψ (cid:107) < .Then(a) The distribution has unique mode. The mode is characterized by the param-eter Ψ and it does not dependent on the parameter ν .(b) Conditional distribution of M given ( d , V ) and V given ( M, d ) are ML distributions whereas conditional distribution of d given ( M, V ) is IMDYclass of distribution. Proof of Theorem 4.
The joint density is proportional to g ( M, d , V ; ν, Ψ) = etr ( ν V DM T Ψ) (cid:2) F (cid:0) n , D (cid:1)(cid:3) ν , (40)( a ) Let us write the SVD (Chikuse, 2012) of Ψ = M Ψ D Ψ V T Ψ . We have, etr ( ν V DM T Ψ) = etr ( ν DM T M Ψ D Ψ V T Ψ V )= etr ( ν V T Ψ V D U M D M V TM D Ψ )= etr ( ν V D U M D M V TM D Ψ ) (41) engupta et al./BMMCSM where SVD of is written as M T M Ψ = U M D M V TM and V = V T Ψ V is an orthog-onal matrix. Therefore we have, etr ( ν V DM T Ψ) = etr ( ν V D U M D M V TM D Ψ ) ( viii ) ≤ etr ( ν DD M D Ψ ) , (42)where the inequality ( viii ) follows from Kristof (1969) (see Theorem on page 5)as V , U M and V M are orthogonal matrices while D , D M and D Ψ are diagonalmatrices with nonnegative diagonal entries. Note that, using sub-multiplicativityof the (cid:107)·(cid:107) (Conway, 1990), we have (cid:107) D M (cid:107) = (cid:13)(cid:13) U TM M T M Ψ V M (cid:13)(cid:13) ≤ (cid:13)(cid:13) U TM (cid:13)(cid:13) (cid:13)(cid:13) M T (cid:13)(cid:13) (cid:107) M Ψ (cid:107) (cid:107) V M (cid:107) ≤ . Therefore, using Lemma 2, we infer that all the diagonal entries of D M is lessthan or equal to 1. Hence from Equation 42, we get that etr ( ν V DM T Ψ) ≤ etr ( ν DD Ψ ) . (43)Therefore, it follows from Kristof (1969) that M = M Ψ and V = V Ψ are uniquemaximizers when M Ψ ∈ (cid:101) V n,p and V Ψ ∈ V p,p . Note that, this does not dependon the choice of ν .Now putting back the value of M and V , we write the expression given in theEquation 40 which can now be seen as etr ( ν DD Ψ ) / (cid:104) F (cid:16) n , D (cid:17)(cid:105) ν . Note that,the diagonal elements of D Ψ is between 0 and 1 as (cid:107) Ψ (cid:107) <
1. Hence usingpart (b) of Theorem 3 we know that etr ( ν DD Ψ ) / (cid:104) F (cid:16) n , D (cid:17)(cid:105) ν has a uniquemaximizer which also does not depend on the choice of ν . (cid:3) ( b ) For JMDY prior structure, the conditional distribution of M given ( d , V )is proportional to etr (cid:16) ν (Ψ V D ) T M ) (cid:17) . This distribution is an ML distribution with parameters M M Ψ , D M Ψ , V M Ψ whereSVD decomposition (Chikuse, 2012) of ν (Ψ V D ) = M M Ψ D M Ψ ( V M Ψ ) T .Similarly, the conditional distribution of V given M and d is proportional to etr (cid:16) ν (Ψ T M D ) T V ) (cid:17) . Therefore, it is another ML distribution with parameters M V Ψ , D V Ψ , V V Ψ whereSVD decomposition of ν (Ψ T M D ) = M V Ψ D V Ψ ( V V Ψ ) T .Finally, the conditional distribution of d given ( M, V ) is a distribution thatbelongs to
IMDY class of distributions with parameters ν and η Ψ , where η Ψ = { η Ψ1 , η Ψ2 , · · · , η Ψ p } and η Ψ j is the j -th diagonal element of the matrix M T Ψ V . engupta et al./BMMCSM (cid:3) In next subsection (Section 3.3) we show that the posterior “modal parameter”is a linear combination of the prior “modal parameter” and a function of samplemean.The following lemmas are useful from the practitioner viewpoint. The resultwill help to truncate the right tail of the distribution at an appropriate pointaccording to a criteria involving only the unnormalized density function.
Lemma 10.
Let d ∼ IMDY ( · ; ν, η ) for some ν > and max ≤ j ≤ p η j < where η = ( η , . . . , η p ) . Let m be the mode of the conditional distribution, g ( · ) := g ( · ; ν, η | ( d , . . . , d p )) , of the variable d given ( d , . . . , d p ) . Then Q ( d ) = g ( d + b ) /g ( d ) is strictly decreasing when b > and d > m where m is themode of the density function given in Definition 2. Proof of Lemma 10.
We have, log( g ( d )) = ν η d − log (cid:18) F (cid:18) n , D (cid:19)(cid:19) = ⇒ ∂ ∂d (log g ( d )) = − ∂ ∂d (cid:18) log (cid:18) F (cid:18) n , D (cid:19)(cid:19)(cid:19) < , (44)as log (cid:16) F (cid:16) n , D (cid:17)(cid:17) is a strictly convex function (from Lemma 4). Therefore ∂∂d (log g ( d )) = g (cid:48) ( d ) /g ( d ) is a strictly decreasing function in d .log( Q ( d )) = log( g ( d + b )) − log( g ( d ))= ⇒ ∂∂d (log Q ( d )) = g (cid:48) ( d + b ) g ( d + b ) − g (cid:48) ( d ) g ( d ) < , as g (cid:48) ( d ) /g ( d ) is a strictly decreasing function. Therefore, Q ( d ) is also astrictly decreasing function in d . (cid:3) Lemma 11.
Let d ∼ IMDY ( · ; ν, η ) for some ν > and max ≤ j ≤ p η j < where η = ( η , . . . , η p ) . Let m be the mode of the conditional distribution, g ( · ) := g ( · ; ν, η | ( d , . . . , d p )) , of the variable d given ( d , . . . , d p ) . Let B > m , besuch that g ( B ) g ( m ) < (cid:15) for some (cid:15) > , then P ( d > B | d , . . . , d p ) < (cid:15) . Proof of Lemma 11.
The unnormalized conditional density of the random variable d is proportionalto g ( d ) = exp ( ν η d ) F (cid:0) n , D (cid:1) ν . engupta et al./BMMCSM Let f ( d ; ν, η | ( d , . . . , d p )) be the density function for the conditional distribu-tion of d given ( d , . . . , d p ). For notational convenience, for rest of this lemmawe use f ( · ) as the conditional probability density function. Hence we have, f ( d ) = 1 K ν, η exp ( ν η d ) F (cid:0) n , D (cid:1) ν , where K ν, η is an appropriate normalizing constant. From Lemma 10, it followsthat f ( B + x ) /f ( m + x ) is a decreasing function of x when B > m . Hence forall x > f ( B + x ) f ( m + x ) = g ( B + x ) g ( m + x ) < g ( B ) g ( m ) ( viii ) < (cid:15), where the inequality at ( viii ) follows due to the assumption of the lemma.Therefore, P ( d > B | ( d , . . . , d p )) = (cid:90) ∞ B f ( y ) dy = (cid:90) ∞ f ( B + x ) f ( m + x ) f ( m + x ) dx< (cid:15) P ( d > m | ( d , . . . , d p )) < (cid:15). (cid:3) Let W i for i = 1 , , · · · , N be the samples drawn from ML distribution withparameters M, d , V . If we consider a Bayesian analysis with the prior class JMDY with parameters ν and Ψ, then the probability density for the jointposterior distribution of M, d and V given { W i } Ni =1 is proportional to g ( M, d , V ; ν, Ψ) × N (cid:89) i =1 etr ( V DM T W i ) F ( n , D )= etr (cid:0) ν V DM T Ψ (cid:1)(cid:2) F ( n , D ) (cid:3) ν × N (cid:89) i =1 etr ( V DM T W i ) F ( n , D )= etr (cid:16) ( ν + N ) V DM T (cid:16) νν + N Ψ + Nν + N W (cid:17)(cid:17)(cid:2) F ( n , D ) (cid:3) ν + N , (45)where W = (cid:80) Ni =1 W i /N and N is the number of data points. Observe that,the posterior distribution is also in JMDY class with concentration parameter( ν + N ) and modal parameter (cid:16) νν + N Ψ + Nν + N W (cid:17) . engupta et al./BMMCSM On the other hand, when we consider a Bayesian analysis with the prior class
IMDY with parameters ν and η , then the conditional probability density forposterior distribution of d given M , V , { W i } Ni =1 is proportional to g ( d ; ν, η ) × N (cid:89) i =1 etr ( V DM T W i ) F ( n , D )= exp( ν η T d ) (cid:2) F (cid:0) n , D (cid:1)(cid:3) ν × N (cid:89) i =1 etr ( V DM T W i ) F ( n , D )= exp( ν η T d ) (cid:2) F (cid:0) n , D (cid:1)(cid:3) ν × etr ( DM T N W V ) (cid:2) F ( n , D ) (cid:3) N where W = (cid:80) Ni =1 W i /N = exp( ν η T d ) (cid:2) F (cid:0) n , D (cid:1)(cid:3) ν + N × exp( N p (cid:88) j =1 d j Y j,j ) where Y = M T W V ∈ R p × p = exp( ν η T d ) exp( N (cid:80) pj =1 d j Y j,j ) (cid:2) F (cid:0) n , D (cid:1)(cid:3) ν + N = exp (cid:18) ( ν + N ) (cid:16) νν + N η + Nν + N η Y (cid:17) T d (cid:19)(cid:2) F (cid:0) n , D (cid:1)(cid:3) ν + N where η Y = ( Y , , · · · , Y p,p )(46)Here the conditional posterior distribution of d is in IMDY class with concen-tration parameter ( ν + N ) and modal parameter (cid:16) νν + N η + Nν + N η Y (cid:17) .Finally, in the following subsection we talk about the several reasons for notbeing able to use DY theorem directly in our case. ML distribution According to the assumption of DY, for a d -dimensional exponential familydistribution, µ be the measure defined on the Borel sets of R d . In the contextof th ML distribution µ is the measure defined on the Stiefel manifold. Thesymbol X is used to denote the interior of the support of the measure µ . Asshowed in Hornik and Gr¨un (2013) X := { X : (cid:107) X (cid:107) < } . According to theassumptions of DY (cid:82) X dP θ ( X ) = 1 (See the paragraph after equation (2.1) onpage 271 in Diaconis and Ylvisaker (1979)). On the contrary for matrix Langevindistribution (cid:90) X dP θ ( X ) = (cid:90) X f ML ( X ) [ dX ] = 0 . During the proof of Theorem 1 in Diaconis and Ylvisaker (1979) Dy constructs engupta et al./BMMCSM a probability measure restricted on set A as follows. µ A ( B ) = µ ( A ∩ B ) µ ( A ) , where µ ( A ) > . Also, x A = (cid:82) Z dµ A ( Z ). In the context of the proof of Theorem 1 in Diaconisand Ylvisaker (1979) uses the crucial fact that x A are dense in supp ( µ ) (See theline after Equation (2.4) on page 272 in Diaconis and Ylvisaker (1979)).In the context of the ML distribution supp ( µ ) is the Stiefel manifold. It canbe shown that similar construction in the case of ML distribution would leadto x A where x A does not belong to the Stiefel manifold i.e. x A (cid:54)∈ supp ( µ ).Hence x A will not be dense supp ( µ ). As a result, Theorem 1 in (Diaconis andYlvisaker, 1979) is not applicable for ML distribution. Note that a modifiedDY construction can be formulated that would enable us constructing prior on F . However, our parametrization is different than the natural parametrization,therefore we require a new approach to construct the prior distribution on M, d and V . Plots for conditional prior of d given M and V Figure 1 shows plots forprior densities for different values of ν and η . Note that, with the same value of η the location of the mode remain the same for different values of ν (see eachrow of Figure 1). As ν increases, the probability concentration around the modeof the distribution increases. Finding the modal parameter from the mode
We have given an examplewhen the practitioner wants to set a particular mode denoted by d mode . We solvefor the corresponding η mode from Equation 32. For example, let us denote themode by (5 ,
7) and after solving for η mode , we have η mode = (0 . , . ,
7) for two different setting of ν which incorporates the strength of the belief in the value of the mode. Here wetake ν = 10 and ν = 20. For both
JMDY and
IMDY class of distributions, we have uniform prior overrespective parameters whenever the probability density function is proportionalto 1. For
JMDY , it can be achieved by setting ν = 0 in Definition 1. For IMDY , ν = 0 provides the uniform prior on parameter d . The resulting priorswould be improper as in this case, the integral over the entire space becomesinfinite. However, in this case, it is necessary to check the propriety of posteriordistributions.In order to incorporate the prior belief for IMDY class of distributions, one canfind the appropriate value of hyperparameter η from Equation 32 once modeof d (denoted by d mode ) is given. Note that, we get a feasible η for every real d mode ∈ S p . The other parameter ν sets the strength of one’s prior belief. It isimportant to realize that there is a strong relationship between ν and number engupta et al./BMMCSM ν = 10 , η = [0 . , .
89] (b) ν = 20 , η = [0 . , . ν = 10 , η = [0 . , .
50] (d) ν = 20 , η = [0 . , . ν = 10 , η = [0 . , .
94] (f) ν = 20 , η = [0 . , . Fig 1: Prior density plots for different values of ν and η engupta et al./BMMCSM ν = 10 , η = [0 . , .
88] (b) ν = 20 , η = [0 . , . Fig 2: Find appropriate η from given mode of the distribution d mode .of data samples. For setting the hyperparameters of the prior distribution for M and V , one can use M mode and V mode , respectively with the appropriateparameters for ML distribution.On the other hand for JMDY class of distribution, we set appropriate value ofhyperparameter η from Equation 32 when mode of d is given. Next, we constructa diagonal matrix, D η with the diagonal entries η . The hyperparameter Ψ canbe constructed in the following way, Ψ = M mode D η V Tmode where M mode and V mode are the choices for the modes of their respective distributions.In order to setup an empirical prior framework, one could obtain the maximumlikelihood estimator (MLE) using the technique described in Chikuse (2012).We could set the hyperparameters in such a way that the mode of the priordistribution is same as MLE. Also note that, the “Empirical Bayesian” proce-dure (Robbins, 1985; Casella, 1985) is out of scope of this study.
4. Bayesian framework for Mixture of ML distributions In this section, we develop a framework for a finite mixture of ML distributions.We talk about posterior form and consistency. We also elaborate on samplingtechnique. Cluster analysis helps to determine the internal structure of data in an unsu-pervised way when no information other than the observed values of data isavailable (Picard, 2007). Finite mixture model allows us to cluster data points engupta et al./BMMCSM by assuming that each component of the mixture comes from a suitable para-metric distribution and the mixture distribution is constructed by a convexcombination of a number of individual component distributions. This numberof components is typically specified initially.We describe our framework as a finite mixture of ML distribution with a fixednumber of mixture component C . Details on the selection number of mixturecomponent is described in Section 4.5. One of the popular techniques of clus-tering data is to model the data by a mixture of appropriate distributions. Forexample Gaussian mixture model is one of the most popular methods which hasbeen used in numerous application spanning from computer vision to compu-tational neuroscience (Stauffer and Grimson, 1999; McKenna, Raja and Gong,1999; KaewTraKulPong and Bowden, 2002; Lewicki, 1998; Wood et al., 2004),in the context of directional data mixture of Von Mises (McGraw et al., 2006;Mardia, Taylor and Subramaniam, 2007; Tang, Chu and Huang, 2009; Bangert,Hennig and Oelfke, 2010; Reisinger et al., 2010; Hornik and Gr¨un, 2014) ormixture of ML distributions used in Lin, Rao and Dunson (2017).Consider a product parameter space denoted by Θ := (cid:101) V Cn,p ×S Cp ×V Cp,p . Let θ := { θ c } Cc =1 = { M c , d c , V c } Cc =1 denote any point in Θ . Let S π := {(cid:104) π , π , · · · , π C (cid:105) ∈ (0 , C : (cid:80) Cc =1 π c = 1 } be the C -Simplex, and π ∈ S π be any point in it. Let usalso denote Ξ := Θ × S π .Now consider a class of finite mixture of ML densities denoted by C ML := { f ( X ; ( θ , π )) = (cid:80) Cc =1 π c f ML ( X ; θ c ) : ( θ , π ) ∈ Ξ } . Let X i ∈ V n,p ( i = 1 , , · · · , N )be the observed data from mixture of ML distributions. f ( X ; ( θ , π )) = C (cid:88) c =1 π c f ML ( X ; θ c ) as f ∈ C ML . For convenience it is customary to introduce latent cluster assignment variableto make sampling easier (McLachlan and Peel, 2004; Bishop, 2006). Thereforethis mixture model can be described as the following X i | ( Z i = c ) ∼ f ML ( X i ; θ c ) with P ( Z i = c ) = π c for c = 1 , · · · , C, (47)where π c > (cid:80) Cc =1 π c = 1 and Z i is the latent cluster assignment for i -thdata point, X i . The likelihood function for the parameter θ is given by L ( θ ) = N (cid:89) i =1 C (cid:89) c =1 [ π c f ML ( X i | θ c )] I ( Z i = c ) (48)In Section 3 we talk about the prior structure and its properties in detail. Weassume two different class of prior structures. In the first one, we have( M c , D c , V c ) ∼ JMDY ( · ; ν c , Ψ c ) π ∼ Dir( · ; ( α , α , · · · , α C )) , (49) engupta et al./BMMCSM while in the second one, we have M c ∼ ML ( · ; ξ Mc , ξ Dc , ξ Vc ) D c ∼ IMDY ( · ; ν c , η c ) V c ∼ ML ( · ; γ Mc , γ Dc , γ Vc ) π ∼ Dir( · ; ( α , α , · · · , α C )) . (50)For both the prior structures the conditional posterior distributions of the pa-rameters would be similar. Therefore, we choose to use independent prior struc-ture given in Equation 50 to demonstrate the posterior computation describedin Section 4.4.The posterior density of ( θ , π , Z i ) given { X i } Ni =1 is proportional to (cid:40) N (cid:89) i =1 C (cid:89) c =1 [ π c f ML ( X i | θ c )] I ( Z i = c ) (cid:41) f prior ( π , θ ) (51)From Equation 51 it follows that the posterior density is proportional to C (cid:89) c =1 (cid:40) π c ( α c + N c − etr (cid:0)(cid:0) V c D c M Tc (cid:1) N c X c + G c M c + H c V c (cid:1) F ( n ; D c / ν c + N c exp( ν c η Tc d c ) I ( d c ∈ S p ) (cid:41) , (52)where N c = (cid:80) Ni =1 I ( Z i = c ) and X c = N c (cid:80) Ni =1 X i I ( Z i = c ) for c = 1 , · · · , C .Also we have, G c = ξ Vc ξ Dc ( ξ Mc ) T and H c = γ Vc γ Dc ( γ Mc ) T . (53) The class of prior distributions specified in Equations 50 and 49 are flexible in thesense, empirical information and/or prior knowledge about any parameters canbe incorporated in the model via appropriate hyper-parameter choices. On theother hand, in the absence of prior knowledge, one can specify hyper-parametersvalues such that the corresponding prior distributions becomes weakly informa-tive or vague. In the following section we note down two specific proceduresto select the value of hyper-parameters focusing independent prior structurein Equation 50 in mind. Similar procedure can easily be developed to selecthyper-parameters for the joint prior structure described in Equation 49.
Weakly informative prior
If the prior probability density function is pro-portional to 1 then we refer the corresponding prior as uniform prior. We canconstruct uniform prior using the prior structure defined in Equation 50 by engupta et al./BMMCSM choosing α c = 1, ν c = 0, ξ Dc = p,p and ξ Dc = p,p for c = 1 , . . . , C . Here p,p denotes the zero matrix of dimension p × p . Note that, the other hyperparame-ters, η c , ξ Mc , ξ Vc , γ Mc , γ Dc , are not required to be specified in this case. Note thatthe uniform prior designed here is improper in nature and the improper priorsare not allowed for mixture models as it leads to invalid posterior. As a remedyone may construct “constrained mixture model” (Diebolt and Robert, 1994) byintroducing some additional constraint to ensure propriety for correspondingposterior. As it is tangential to the current discussion, we avoid the detailedconstruction on ‘constrained mixture model’ for the current model in this ar-ticle. Without going into the additional complexity, one may construct weaklyinformative, proper prior by choosing η c to be very close to zero (such as 0 . Empirical prior
We first gather the empirical information by fitting a EMbased algorithm to the data to obtain the maximum likelihood estimator of theparameters (see Section 4.6). Once we have a basic basic estimates of the clusterassignments, we compute number of points, n † c , assigned in each clusters andrough estimates of the cluster specific parameters, M † c , d † c , V † c , for c ∈ { , . . . , C } .The idea is to choose appropriate hyper-parameter values in Equation 50, sothat the corresponding prior distributions have modes at the values M † c , D † c , V † c .For the prior distribution of d c use the procedure described in Section 3.5 toset appropriate value of η c . For c = 1 . . . C , we set ξ Mc = M † c , ξ Vc = I p and γ Mc = V † c , γ Vc = I p . The choice for ν c and ξ Dc , ξ Dc are crucial and it may notdesired to set very high values for these parameters. We set ν c = n † c /K † andthe values for ξ Dc , ξ Dc to be close to n † c /K † . Here K † determines the relativestrength of the prior distribution appropriately. To select hyper parameters fromthe parameter π we set α c = n † c /K † for c = 1 , . . . C. In any Bayesian model, consistency of the posterior distribution is a desirableproperty. In the following subsection we establish posterior consistency for ourmixture model.
Consider a product parameter space denoted by Θ := (cid:101) V Cn,p ×S Cp ×V Cp,p . Let θ := { θ c } Cc =1 = { M c , d c , V c } Cc =1 denote any point in Θ , and θ := { M c , d c , V c } Cc =1 ∈ Θ a particular point. Let S π := { ( π , π , · · · , π C ) ∈ (0 , C : (cid:80) Cc =1 π c = 1 } bethe C -Simplex, and π ∈ S π be any point in it.Consider the distance metric d ( · , · ) on the parameter space Ξ := Θ × S π con-structed from appropriate distance metrics in the respective parameter spaces: d ( θ , θ ) := (cid:118)(cid:117)(cid:117)(cid:116) C (cid:88) c =1 [ d St ( M c , M c ) + d Eu ( d c , d c ) + d St ( V c , V c )] d (( θ , π ) , ( θ , π )) := (cid:113) d Eu ( π , π ) + d ( θ , θ ) (54) engupta et al./BMMCSM and, likewise, d (( X , θ , π ) , ( X , θ , π )) := (cid:113) d St ( X , X ) + d Eu ( π , π ) + d ( θ , θ ) (55)where d Eu is the Euclidean distance and d St is the geodesic distance on theStiefel manifold. Also consider a class of finite mixture of ML densities denotedby C ML := { f ( X ; ( θ , π )) = (cid:80) Cc =1 π c f ML ( X ; θ c ) : ( θ , π ) ∈ Ξ } .We alternatively denote f ( X ; ( θ , π )) by f θ , π ( X ) when we wish to emphasize theparametrization. f θ , π : V n,p → R + is a family of probability density functionswith respect to the normalized Haar measure [ dX ] on V n,p . Observe that Ξ and V n,p are complete separable metric spaces and that ( θ , π ) → f θ , π is one-to-oneand ( X, θ , π ) → f ( X ; ( θ , π )) is measurable.The prior Π is defined on Ξ. Let X , X , · · · , X N be independent and identicallydistributed with probability density function f θ , π . The posterior distributionΠ( A | X , X , · · · , X N ) for any measurable subset A of Ξ is given byΠ( A | X , X , · · · , X N ) = (cid:82) A R n (( θ , π )) d Π(( θ , π )) (cid:82) Ξ R n (( θ , π )) d Π(( θ , π )) (56)where R n (( θ , π )) = N (cid:89) i =1 f ( X i ; ( θ , π )) f ( X i ; ( θ , π )) . (57)In our model, Π(( θ , π )) in Equation 56 is defined with respect to the appropriateproduct measure on Θ and the Lebesgue measure on S π . The prior Π is givenby the Equation 50.For (cid:15) > θ , π ) (corresponding to the true density f θ , π ) in Ξ as N (cid:15) (( θ , π )) = { ( θ , π ) ∈ Ξ : d (( θ , π ) , ( θ , π )) < (cid:15) } ,KL (cid:15) (( θ , π )) = (cid:40) ( θ , π ) ∈ Ξ : (cid:90) V n,p f θ , π ( X ) log f θ , π ( X ) f θ , π ( X ) [ dX ] < (cid:15) (cid:41) , U (cid:15) (( θ , π )) = (cid:40) ( θ , π ) ∈ Ξ : (cid:12)(cid:12)(cid:12)(cid:12) (cid:90) V n,p g ( X ) f θ , π ( X )[ dX ] − (cid:90) V n,p g ( X ) f θ , π ( X )[ dX ] (cid:12)(cid:12)(cid:12)(cid:12) < (cid:15) (cid:41) , W (cid:15) (( θ , π )) = ( θ , π ) ∈ Ξ : (cid:32)(cid:90) V n,p (cid:18)(cid:113) f θ , π ( X ) − (cid:113) f θ , π ( X ) (cid:19) [ dX ] (cid:33) / < (cid:15) . The weak neighborhood definition holds if the corresponding equation is satisfiedfor all bounded and continuous functions g on V n,p . Lemma 12.
A finite mixture of ML densities is strictly positive, bounded awayfrom zero and bounded from above. engupta et al./BMMCSM Proof of Lemma 12.
Let f ( X ; ( θ , π )) ∈ C ML be a density function that is a C -component mixtureof ML distributions parametrized by ( θ , π ) ∈ Ξ, that is, f ( X ; ( θ , π )) = C (cid:88) c =1 π c f ML ( X ; θ c ) . (58)Since the density function f ML ( · ; θ c ) : V n,p → R + is continuous on the compactmanifold V n,p , the extreme value theorem (Rudin et al., 1964) dictates that f ML ( · ; θ c ) is bounded and attains at least one minima and maxima. In partic-ular, f ML ( X ; θ c ) has the unique modal orientation M c V Tc (page 32 in Chikuse(2012)) where θ c = ( M c , d c , V c ). Likewise, it is easy to see that the minimumvalue of the density function occurs at − M c V Tc . Hence for any X ∈ V n,p , wehave etr (cid:0) V c D c M Tc ( M c V Tc ) (cid:1) F ( n/ D c / ≥ f ML ( X ; θ c ) ≥ etr (cid:0) − V c D c M Tc ( M c V Tc ) (cid:1) F ( n/ D c / , = ⇒ etr ( D c ) F ( n/ D c / ≥ f ML ( X ; θ c ) ≥ etr ( − D c ) F ( n/ D c / , = ⇒ exp( (cid:80) pi =1 d ic ) F ( n/ D c / ≥ f ML ( X ; θ c ) ≥ exp( − (cid:80) pi =1 d ic ) F ( n/ D c / > . (59)Let U B = max ≤ c ≤ C exp( (cid:80) pi =1 d ic ) F ( n/ D c / and LB = min ≤ c ≤ C exp( (cid:80) pi =1 − d ic ) F ( n/ D c / . Using Equa-tions 58 and 59, we get0 < LB ≤ f ( X ; ( θ , π )) ≤ U B < ∞ . (60) (cid:3) Lemma 13.
Let ( θ , π ) , ( θ , π ) ∈ Ξ . Then for any (cid:15) > , there exists a δ > such that d (( θ , π ) , ( θ , π )) < δ = ⇒ sup X ∈V n,p (cid:12)(cid:12)(cid:12)(cid:12) log f ( X ; ( θ , π )) f ( X ; ( θ , π )) (cid:12)(cid:12)(cid:12)(cid:12) < (cid:15), where f ( X ; ( θ , π )) , f ( X ; ( θ , π )) ∈ C ML . Proof of Lemma 13.
Let f ( X ; ( θ , π )) ∈ C ML , that is, let f ( X ; ( θ , π )) = (cid:80) Cc =1 π c f ML ( X ; θ c ). Notethat for all c = 1 , . . . , C, the function f ML ( X ; θ c ) is continuous in X and θ c .Since f ( X ; ( θ , π )) is a linear combination of functions { f ML ( X ; θ c ) } Cc =1 withweights { π c } Cc =1 , it too is continuous in X ∈ V n,p and ( θ , π ) ∈ Ξ. Moreover, fromLemma 12, f ( X ; ( θ , π )) is bounded away from 0 and ∞ . Hence log f ( X ; ( θ , π ))is continuous in X ∈ V n,p and ( θ , π ) ∈ Ξ, since log is continuous and welldefined over the range. engupta et al./BMMCSM Let B ( θ , π ) ⊂ Ξ be a compact ball around ( θ , π ) with strictly positive,bounded radius.Now, consider the function log f ( X ; ( θ , π )), restricted to the domain V n,p × B ( θ , π ) . Within both V n,p and B ( θ , π ) , f is continuous in each argument θ , π and X The compactness of both these spaces ensures uniform continuityindividually in each argument.The latter ensures uniform continuity of the jointfunction within the joint space V n,p × B ( θ , π ) .Now, since V n,p × B ( θ , π ) is compact, the function restricted to this domain is uniformly continuous. Therefore for any (cid:15) >
0, there exists a δ >
0, such that d ( ( X , θ , π ) , ( X , θ , π ) ) < δ = ⇒ | log f ( X ; ( θ , π )) − log f ( X ; ( θ , π )) | < (cid:15), for arbitrary ( X , θ , π ) , ( X , θ , π ) ∈ V n,p × B ( θ , π ) . In particular, setting X = X = X , and using the fact that d ( ( θ , π ) , ( θ , π ) ) = d ( ( X, θ , π ) , ( X, θ , π ) ) <δ for all X (clear from Equations 54 and 55), we have d ( ( θ , π ) , ( θ , π ) ) < δ = ⇒ sup X ∈V n,p | log f ( X ; ( θ , π )) − log f ( X ; ( θ , π )) | < (cid:15), = ⇒ sup X ∈V n,p (cid:12)(cid:12)(cid:12)(cid:12) log f ( X ; ( θ , π )) f ( X ; ( θ , π )) (cid:12)(cid:12)(cid:12)(cid:12) < (cid:15). (61) (cid:3) Theorem 5. (Weak Consistency)
For our prior Π (defined in Definition 2in Section 3) Π( U c | X , X , · · · , X N ) → a.s. F ∞ θ , π , for any weak neighborhood U of ( θ , π ) . F θ , π is the probability distributionfunction corresponding to the density f θ , π , and F ∞ θ , π is the correspondinginfinite product measure. Proof of Theorem 5.
For every (cid:15) > δ > d ( ( θ , π ) , ( θ , π ) ) < δ = ⇒ sup X ∈V n,p (cid:12)(cid:12)(cid:12)(cid:12) log f ( X ; ( θ , π )) f ( X ; ( θ , π )) (cid:12)(cid:12)(cid:12)(cid:12) < (cid:15), [from Lemma 13] , = ⇒ (cid:90) V n,p f ( X ; ( θ , π )) (cid:12)(cid:12)(cid:12)(cid:12) log f ( X ; ( θ , π )) f ( X ; ( θ , π )) (cid:12)(cid:12)(cid:12)(cid:12) [ dX ] < (cid:15), = ⇒ (cid:90) V n,p f ( X ; ( θ , π )) log f ( X ; ( θ , π )) f ( X ; ( θ , π )) [ dX ] < (cid:15). Hence, N δ (( θ , π )) ⊂ KL (cid:15) (( θ , π )). It is easy to see that Π puts strictly pos-itive measure on N δ (( θ , π )) for all δ >
0, because by the definition of distancemetric given in Equation 54, any neighborhood around ( θ , π ) has a positivemeasure. It follows that Π puts strictly positive measure on KL (cid:15) (( θ , π )) forall (cid:15) >
0. The theorem then follows from Schwartz (1965). (cid:3) engupta et al./BMMCSM Theorem 6. (Strong/Hellinger Consistency)
For our prior
ΠΠ( W c | X , X , · · · , X N ) → a.s. F ∞ θ , π , for any Hellinger neighborhood W of ( θ , π ) . Proof of Theorem 6.
For any (cid:15) > U (cid:15) (( θ , π )) = (cid:40) ( θ , π ) ∈ Ξ : (cid:12)(cid:12)(cid:12)(cid:12) (cid:90) V n,p g ( X ) f θ , π ( X )[ dX ] − (cid:90) V n,p g ( X ) f θ , π ( X )[ dX ] (cid:12)(cid:12)(cid:12)(cid:12) < (cid:15) (cid:41) . for all bounded and continuous functions g on V n,p .For each ( θ , π ) ∈ U (cid:15) (( θ , π )) choose g ( X ) := (cid:32) (cid:112) f θ , π ( X ) − (cid:112) f θ , π ( X ) (cid:112) f θ , π ( X ) + (cid:112) f θ , π ( X ) (cid:33) Now from Lemma 12, the functions f θ , π are bounded away from 0, ensuring apositive lower bound for the denominator. The upper bound for the denominatoris guaranteed from the upper bound property that follows from the same Lemma.A similar argument holds for the numerator as well. This ensures boundednessof the function g ( X ) and the continuity follows from the continuity of f ( X )(from Lemma 12). Thus g ( X ) is a bounded and continuous function. Hence, (cid:12)(cid:12)(cid:12)(cid:12) (cid:90) (cid:32) (cid:112) f θ , π ( X ) − (cid:112) f θ , π ( X ) (cid:112) f θ , π ( X ) + (cid:112) f θ , π ( X ) (cid:33) f θ , π ( X ) [ dX ] − (cid:90) (cid:32) (cid:112) f θ , π ( X ) − (cid:112) f θ , π ( X ) (cid:112) f θ , π ( X ) + (cid:112) f θ , π ( X ) (cid:33) f θ , π ( X ) [ dX ] (cid:12)(cid:12)(cid:12)(cid:12) < (cid:15) , = ⇒ (cid:12)(cid:12)(cid:12)(cid:12) (cid:90) (cid:18)(cid:113) f θ , π ( X ) − (cid:113) f θ , π ( X ) (cid:19) [ dX ] (cid:12)(cid:12)(cid:12)(cid:12) < (cid:15) , = ⇒ (cid:32)(cid:90) (cid:18)(cid:113) f θ , π ( X ) − (cid:113) f θ , π ( X ) (cid:19) [ dX ] (cid:33) / < (cid:15). (62)Hence U (cid:15) (( θ , π )) ⊂ W (cid:15) (( θ , π )). The Theorem now follows from an applica-tion of Theorem 5. (cid:3) In order to perform Bayesian inference, it is important to compute statisticsrelated to the posterior distribution e.g. the posterior mean or posterior quan-tiles. The posterior density, defined in Equation 52, is intractable in the sensethat it is not possible to compute these quantities analytically by performingintegration or to generate i.i.d. samples from the posterior distribution. engupta et al./BMMCSM However we can design a Gibbs sampling Markov chain to generate samples fromthe posterior distribution. It is known that the Markov chain corresponding toGibbs samplers would converge to the desired stationary distribution.In order to implement the Gibbs sampler, we sample cluster specific parametersalong with the latent indicator Z i for cluster assignment for each data point X i where i = 1 , , · · · , N . The conditional distribution of Z i given all otherparameters follows P ( Z i = c | M c , d c , V c , π , { X i } Ni ) = π c f ML ( X i | M c , d c , V c ) (cid:80) Cc =1 π c f ML ( X i | M c , d c , V c ) (63)for c = 1 , , · · · , C . The conditional posterior distribution of π is given as π | { M c , d c , V c } Cc =1 , { X i , Z i } Ni ∼ Dir( α + N , α + N , · · · , α C + N C ) , (64)where N c = (cid:80) Ni =1 I ( Z i = c ) for c = 1 , · · · , C . Given the latent cluster as-signments, the conditional posterior distribution of cluster specific parametersare independent. Due to conditional functional conjugacy for M c and V c , it isstraightforward to show that the full conditional of the corresponding posteriorwould belong to the ML class of distribution. In particular, M c | ( d c , V c , { X i , Z i } Ni , π ) ∼ ML (cid:0) · ; ( S MG , S DG , S VG ) (cid:1) (65) V c | ( M c , d c , { X i , Z i } Ni , π ) ∼ ML (cid:0) · ; ( S MH , S DH , S VH ) (cid:1) , (66)where ( S MG , S DG , S VG ) and S MH , S DH , S VH ) are SVD decompositions of matrices( D c V Tc N c X Tc + G c ) and ( D c V Tc N c X Tc + H c ), respectively. Observe that X c = 1 N c N (cid:88) i =1 X i I ( Z i = c ) . Efficient sampling from ML distribution is done using algorithm developedin Hoff (2009).The conditional posterior distribution for d c given other parameters has thefollowing density – f ( d c | M c , V c , { X i , Z i } Ni =1 , π ) ∝ exp (cid:0) ( ν c η Tc + N c φ Tc ) d c (cid:1) F ( n ; D c / ν c + N c I ( d c ∈ S p ) d c | ( M c , V c , { X i , Z i } Ni =1 , π ) ∼ IMDY (cid:18) · ; ( ν c + N c ) , ν c η c ν c + N c + N c φ c ν c + N c (cid:19) , (67)where φ c = { φ c , φ c , · · · , φ cp } with φ cj is the j -th diagonal element of the ma-trix M Tc X c V c for j = 1 , , · · · , p . Note that, this can also be verified from theEquation 46 in Section 3.3. Due to non-standard form of the posterior distribu-tion given in Equation 67, sampling of d c is challenging. engupta et al./BMMCSM The density corresponding to the full conditional distribution of d cj , the j -thdiagonal entry of D c for j = 1 , , · · · , p , is given below, f (cid:16) d cj | d − cj , M c , V c , { X i , Z i } Ni , π (cid:17) ∝ exp( d cj ( ν c η cj + N c φ cj )) F ( n/ , D c / ν c + N c I (cid:0) d c ( j +1) < d cj < d c ( j − (cid:1) . (68)Also let F cj ( · ) is the corresponding distribution function of conditional distri-bution of d cj . We describe the detailed implementation of sampling from thisconditional distribution in the following paragraph after this subsection. For ageneric representation of posterior distribution for d cj for all j = 1 , · · · , p , wedefine d c = ∞ and d c ( p +1) = 0. We also write d − cj := (cid:8) d c , · · · , d c ( j − , d c ( j +1) , · · · , d cp (cid:9) . (69)We have designed an efficient sampling scheme to sample d c using the set of p distributions given in Equation 68. Observe that support of the distributionfor d c is [ d c , ∞ ) while that of the others are bounded. Note that the posteriordistribution of d c is unimodal (see Theorem 3) and we exploit that fact todesign an efficient sampler for d c . A description of the sampling steps is givenin Algorithm 1 below.Note that, because of log-concavity nature of the conditional distribution func-tion for d cj , we could have implemented adaptive rejection sampler (ARS) forit. However, the standard ARS algorithm can not be immediately implementedin this context because of involved computation with F ( · ) function. So wereserved this development for our future work. Gibbs algorithm
The following algorithm outlines the steps of the Gibbssampling algorithm which shows the full conditional distribution of the param-eters at k -th step based on the samples drawn at ( k − engupta et al./BMMCSM Algorithm 1
Algorithm for MCMC method for c = 1 , , · · · , C do initialize the MCMC chain with M (0) c , D (0) c , V (0) c and π (0) end for k = 0 repeat k = k + 1 for i = 1 , , · · · , N do Z ( k ) i ∼ Categorical (cid:16) · ; { , , · · · , C } , π ( k − , { M ( k ) c , D ( k ) c , V ( k ) c } Cc =1 (cid:17) end forfor c = 1 , , · · · , C do N ( k ) c = N (cid:88) i =1 I ( Z ( k ) i = c ) X ( k ) c = 1 N c N (cid:88) i =1 X i I ( Z ( k ) i = c ) M ( k ) c ∼ ML (cid:16) · ; (cid:16) ( S MG ) ( k − , ( S DG ) ( k − , ( S VG ) ( k − (cid:17)(cid:17) V ( k ) c ∼ ML (cid:16) · ; (cid:16) ( S MH ) ( k ) , ( S DH ) ( k − , ( S VH ) ( k − (cid:17)(cid:17) d ( k ) cj ∼ F cj (cid:16) · ; ( d − cj ) ( k ) , M ( k ) c , V ( k ) c , { X i , Z i } Ni , π ( k − (cid:17) for all j = 1 , , · · · , p end for π ( k ) ∼ Dir (cid:16) · ; α + N ( k )1 , · · · , α C + N ( k ) C (cid:17) until convergence Note that, ( d − cj ) ( k ) is given by the following set( d − cj ) ( k ) := (cid:110) ( d c ) ( k ) , · · · , ( d c ( j − ) ( k ) , ( d c ( j +1) ) ( k − , · · · , ( d cp ) ( k − (cid:111) The stationary distribution of the Gibbs sampling Markov chain is the poste-rior distribution corresponding to Equation 52. Convergence to this stationarydistribution does not on the choice of the initial point. However, in order to runthe MCMC method it is required to initialize Algorithm 1 with certain values(e.g. M (0) c , D (0) c , V (0) c and π (0) ). In practice, specifically in the case of large-scaledataset, it is often seen that bad choice of initial value might lead to slow con-vergence of the MCMC method. In order to come up with a reasonable choiceof initial value, we first run a hierarchical clustering (Lattin, Carroll and Green,2003; Rokach and Maimon, 2005) on the entire dataset with a fixed number ofclusters, C (for selection of optimal C see Section 4.5) to get a initial clusterassignments for the data points. Based on the initial assignment, we adopt a engupta et al./BMMCSM maximum likelihood based technique described in Chikuse (2012) to obtain theinitial value of the cluster specific parameters. This initial point selection pro-cedure has worked well for our simulated dataset. We notice that the selectionof initial point may not be crucial for small datasets. However, for large datasetchoice of suitable initial point could save significant amount of time by reducingnumber of burn-in steps. Efficient Rejection Sampler
In this section we describe the rejection sam-pling procedure from the conditional distribution of ( d | ( d , · · · , d p )) when d ∼ IMDY ( · ; ν, η ) for some ν > ≤ j ≤ p η j <
1. Here η = ( η , . . . , η p ). Let m be the mode of the conditional distribution, g ( · ) := g ( · ; ν, η | ( d , . . . , d p )),of the variable d given ( d , . . . , d p ) when η >
0. In case, η <
0, we explicitlyset m to be 0.Using property of the conditional distribution described Lemma 11 the we com-pute a critical point RT cric so that P (cid:16) d > RT crit | ( d , · · · , d p ) , { X j } Nj =1 (cid:17) < (cid:15) with the choice of (cid:15) = 0 . d c to thebounded interval (0 , RT crit ]. We employ a efficient rejection sampling scheme tosample from the desired distribution in the following way.Let δ = RT crit /N bin where N bin is the total number of partitions for the interval(0 , RT crit ]. Consider, k = ([ m/δ ] + 1) where [ m/δ ] denotes the greatest integersless that or equal to m/δ . Now define the function g ( x ) := k − (cid:88) j =1 g ( j δ ) I (( j − δ,jδ ]) ( x ) + g ( m ) I (( k − δ,kδ ]) ( x )+ N bin (cid:88) j = k +1 g (( j − δ ) I ((( j − δ,jδ ]) ( x ) . Note that g ( x ) ≥ g ( x ) for all x ∈ (0 , RT crit ] as g ( · ) unimodal log-concavefunction with maxima m . To sample from for the distribution with densitycorresponding to the function g ( · ) we consider, p j = q j / (cid:80) N bin j =1 q j for j =1 , , · · · , N bin where, q j = g ( jδ ) if 1 ≤ j < (cid:2) mδ (cid:3) + 1 ,g ( m ) if j = (cid:2) mδ (cid:3) + 1 ,g (( j − δ ) if (cid:2) mδ (cid:3) + 1 < j ≤ M. The steps of the rejection samplers are given below– Sample Z from the discrete distribution with the support { , , . . . , N bin } corresponding probability { p j } Mj =1 .– Sample y ∼ U nif orm (( Z − δ, Zδ ).– Sample U ∼ U nif orm (0 , y if U ≤ g ( y ) g ( y ) . engupta et al./BMMCSM Note that the efficiency of the sampler increases when we choose larger valuesfor N bin . Posterior summary
There are multiple ways to summarize the posterior dis-tribution for the estimates of the parameters. We choose to use the parametriza-tion given in Equation 1 for posterior summary. This parametrization enables usto report the error in more interpretable way. Using M, d , V , it is challenging toreport the error as M and V lie on a non-euclidean space. Generating summaryof results for different parameters on V n,p is not straightforward. Some general-ized version of mean like Karcher mean could be investigated. Note that we candirectly compare true F and ˆ F as there is no constraint on the elements of F .This direct comparison is not immediately possible for M, d , V parametrizationgiven in Equation 2 which is mainly done to achieve computational tractability. In order to identify the optimum number of cluster we use Deviance Informa-tion Criteria for Bayesian model selection (
DIC ) (Gelman et al., 2003; Spiegel-halter et al., 2002). It has been successfully used as a model selection crite-ria in in various Bayesian models (Berg, Meyer and Yu, 2004; Fran¸cois andLaval, 2011; Khare, Pal and Su, 2017). To explain the DIC criterion in thecontext of the current model, let θ ( C ) = { M c , d c , V c } Cc =1 denote all the parame-ter vectors and the deviance function is defined as Dev ( θ ( C ) ) := − logL ( θ ( C ) )where L ( · ) is the likelihood function defined in Equation 48. Let { θ ( C,i ) } Si =1 be S values of the parameters, sampled from the appropriate posterior dis-tribution in Equation 52. The DIC score with a given choice for C , is com-puted as DIC ( C ) := Dev ( C ) + (cid:80) Si =1 (cid:16) Dev ( θ ( C,i ) ) − Dev ( C ) (cid:17) / (2( S − Dev ( C ) = (cid:80) Si =1 Dev ( θ ( C,i ) ) /S ( Gelman et al. (2003), page 185). To infer thenumber of clusters, samples are generated from different Markov chain assum-ing different values of C . The optimum number of cluster is given by C opt =argmax C DIC ( C ) . For detailed discussion on DIC see DeIorio and Robert(2002); Gelman et al. (2003); Titterington et al. (2006). Specifically, in thecontext of the mixture model, DeIorio and Robert (2002) described possiblelimitations for the standard DIC criterion. Following the alternative criteriaproposed in Titterington et al. (2006), we considered several score functions(i.e. DIC , DIC , DIC , DIC , DIC , DIC , DIC as defined in Titter-ington et al. (2006)). We conducted an extensive numerical study with severalsimulated data sets. We found that the score function DIC outperforms otheralternative criteria in terms of efficiency for the model. Also, the computationof DIC takes significantly less time than that of standard DIC . Therefore onemay use
DIC instead of standard DIC whenever computation of standard
DIC takes significantly longer time particularly for any large dataset. Addi-tional details along with a table comparing the performance of different
DIC scores in our simulation study is given in Section 5) where we observe that engupta et al./BMMCSM DIC score can identify the correct number of clusters in most of the cases inour model. In this section, we develop an iterative optimization technique to obtain pointestimator for the parameters specified in the model given by Equation 49. Specif-ically, we employ expectation maximization (EM) algorithm (Dempster, Lairdand Rubin, 1977) to obtain mode of the posterior distribution for the parametersin the model specified in Equation 49. Note that the algorithm is computation-ally fast and can be useful to get some rough estimates of the parameter speciallyfor large data-sets. Also, we may use this algorithm in specific way to select ap-propriate values of the hyperparameters (See Section 3.5) in the case of MCMCbased posterior inference. Note that the rough estimates can also help find suit-able initial values for the MCMC procedures, particularly for analyzing massivedata. To describe the procedure, let us consider complete data log-likelihood(From Equation 52) as follows N (cid:88) i =1 C (cid:88) c =1 Z ic log f ML ( X i | θ c ) + Z ic log π c ++ C (cid:88) c =1 α c log π c + tr (cid:0) ν c V c D c M Tc Ψ (cid:1) − ν c log (cid:18) F (cid:18) n , D c (cid:19)(cid:19) , (70)where Z ic = I ( Z i = c ) andlog f ML ( X i | θ c ) = tr (( V c D c M Tc ) X i )) − log (cid:18) F (cid:18) n , D c (cid:19)(cid:19) . Let we start the iterative algorithm at an initial point (cid:0) θ (0) , π (0) (cid:1) . We constructa sequence of parameter values (cid:8)(cid:0) θ ( t ) , π ( t ) (cid:1)(cid:9) t ≥ where we move from (cid:0) θ ( t ) , π ( t ) (cid:1) to (cid:0) θ ( t +1) , π ( t +1) (cid:1) using the “E-step” and “M-step” described below. E-step:
We construct the objective function Q ( θ , π | X , θ ( t ) , π ( t ) ):= N (cid:88) i =1 C (cid:88) c =1 (cid:104) Z ic (cid:105) log f ML ( X i | θ c ) + (cid:104) Z ic (cid:105) log π c ++ C (cid:88) c =1 α c log π c + tr (cid:0) ν c V c D c M Tc Ψ (cid:1) − ν c log (cid:18) F (cid:18) n , D c (cid:19)(cid:19) , (71)where (cid:104) Z ic (cid:105) := E ( Z ic | X , θ ( t ) ) = π ( t ) c f ML ( X i | θ ( t ) c ) (cid:80) Ck =1 π ( t ) k f ML ( X i | θ ( t ) c ) engupta et al./BMMCSM M-step:
In this step, we maximize Q ( θ , π | X , θ ( t ) , π ( t ) ) with respect to the θ , π . It is easy to see that, Q ( θ , π | X , θ ( t ) , π ( t ) ) is maximized when we set π = ˆ π where the c -th component of the vector ˆ π c ,ˆ π c = α c + (cid:80) Ni =1 (cid:104) z ic (cid:105) N + (cid:80) Cc =1 α c for c = 1 , . . . C. Note that θ := { θ c } Cc =1 where θ c = { M c , d c , V c } . Hence, the function Q ( θ , π | X , θ ( t ) , π ( t ) ) can be maximized by maximizing the function tr (cid:16) V c D c M cT (cid:104) (cid:101) X ( c ) + Ψ (cid:105)(cid:17) − ν c log (cid:18) F (cid:18) n , D c (cid:19)(cid:19) , (72)with respect to the variables M c ∈ (cid:101) V n,p , V c ∈ V p,p and d c ∈ S p for each c =1 , . . . , C separately where (cid:101) X ( c ) = (cid:80) Ni =1 (cid:104) Z ic (cid:105) X i (cid:80) Ni =1 (cid:104) Z ic (cid:105) .Let (cid:102) M ( c ) , (cid:101) D ( c ) and (cid:101) V ( c ) be the unique singular value decomposition (Chikuse,2012) for the matrix (cid:104) (cid:101) X ( c ) + Ψ (cid:105) . Let (cid:101) d ( c ) be the diagonal elements of the matrix (cid:101) D ( c ) and (cid:98) d c be the solution of the set of equations h ( (cid:98) d c ) = (cid:101) d ( c ) where h ( d ) := (cid:16) ∂∂ d F (cid:16) n , D (cid:17)(cid:17) / F (cid:16) n , D (cid:17) . Standard Newton-Raphson (NR) (Wright andNocedal, 1999) method can be used to solve for (cid:98) d c the from the equation h ( (cid:98) d c ) = (cid:101) d ( c ) . In the case of p = 2, we derive the explicit expression of the Hessian matrixand show the steps by NR to solve for (cid:98) d c in Section 4.6.1.From Chikuse (2012) we get that the objective function in Equation 72 is max-imized at (cid:99) M c = (cid:102) M ( c ) , (cid:98) V c = (cid:101) V ( c ) and (cid:98) D c where (cid:98) D c is the diagonal matrix withdiagonal elements (cid:98) d c .Finally we move to the values (cid:0) θ ( t +1) , π ( t +1) (cid:1) by the setting, θ ( t +1) := (cid:110)(cid:16) (cid:99) M c , (cid:98) d c , (cid:98) V c (cid:17)(cid:111) Cc =1 and π ( t +1) := (cid:98) π . We stop the iteration when we achieve convergence, i.e. the values of the pa-rameters in the two consecutive iterations are very close.
For this subsection we omit the subscript c for ease of notation. Now observethat, F (cid:18) p + 2 k, d + d (cid:19) = Γ ( p + 2 k )4 − p +2 k − (cid:18)(cid:113) d + d (cid:19) − ( p +2 k − I p +2 k − (cid:18)(cid:113) d + d (cid:19) . where I ν ( · ) is the modified Bessel function of first kind with order ν . Takingpartial derivative with respect to d we have, ∂∂d F (cid:18) p + 2 k, d + d (cid:19) = d p + 2 k ) F (cid:18) p + 2 k + 1 , d + d (cid:19) . engupta et al./BMMCSM Consider the expression for the hypergeometric function of the Matrix argumentwith 2 × F (cid:18) p, D (cid:19) = ∞ (cid:88) k =0 d k d k k (cid:0) p − (cid:1) k ( p ) k k ! F (cid:18) p + 2 k, d + d (cid:19) . (73)This representation is also useful as we can get a good idea on error bound byapproximating the number of terms for this infinite series.Let us use the following notations T = (cid:16) d (cid:17) k (cid:16) d (cid:17) k (cid:0) p − (cid:1) k ( p ) k k ! T = F (cid:18) p + 2 k, d + d (cid:19) T = F (cid:18) p + 2 k + 1 , d + d (cid:19) T = F (cid:18) p + 2 k + 2 , d + d (cid:19) . We derive ∂∂d (cid:18) F (cid:18) p, D (cid:19)(cid:19) = ∞ (cid:88) k =0 T (cid:40) kd T + d p + 2 k ) T (cid:41) ,∂∂d (cid:18) F (cid:18) p, D (cid:19)(cid:19) = ∞ (cid:88) k =0 T (cid:40) kd T + d p + 2 k ) T (cid:41) ,∂∂d ∂∂d (cid:18) F (cid:18) p, D (cid:19)(cid:19) = ∞ (cid:88) k =0 T (cid:40) kd kd T + k ( d d + d d )( p + 2 k ) T + d p + 2 k ) d p + 2 k + 1) T (cid:41) ,∂ ∂d (cid:18) F (cid:18) p, D (cid:19)(cid:19) = ∞ (cid:88) k =0 T (cid:40) kd k − d T + 4 k + 12( p + 2 k ) T + d p + 2 k ) d p + 2 k + 1) T (cid:41) ,∂ ∂d (cid:18) F (cid:18) p, D (cid:19)(cid:19) = ∞ (cid:88) k =0 T (cid:40) kd k − d T + 4 k + 12( p + 2 k ) T + d p + 2 k ) d p + 2 k + 1) T (cid:41) . (74)Denoting R ( d , d ) = (cid:16) F (cid:16) n , D (cid:17)(cid:17) , the Hessian matrix is written with the engupta et al./BMMCSM help of set of Equations in 74 as H = (cid:20) H H H H (cid:21) , where H = ∂∂d (cid:32) ∂R∂d R (cid:33) = ∂ R∂d R − (cid:32) ∂R∂d R (cid:33) ,H = ∂∂d (cid:32) ∂R∂d R (cid:33) = (cid:32) ∂R∂d ∂R∂d R (cid:33) − (cid:32) ∂R∂d R (cid:33)(cid:32) ∂R∂d R (cid:33) ,H = H ,H = ∂∂d (cid:32) ∂R∂d R (cid:33) = ∂ R∂d R − (cid:32) ∂R∂d R (cid:33) . where R is used in the places of R ( d , d ) for brevity of symbol.Now, the update equation for NR method is given below (cid:34) (cid:98) d new (cid:98) d new (cid:35) = (cid:34) (cid:98) d old (cid:98) d old (cid:35) − H − (cid:98) d old , (cid:98) d old (cid:34) ∂R ( (cid:101) d old , (cid:101) d old ) ∂ d ∂R ( (cid:101) d old , (cid:101) d old ) ∂ d (cid:35) .
5. Experiments with simulated data
We carry out two sets of simulation to investigate the clustering framework withour proposed Bayesian mixture model. In order to evaluate the performance ofour clustering method, we consider the following three criteria –(a) identification of the correct number of clusters,(b) correct assignment for each data point to the appropriate cluster and thusevaluate a measure of goodness for clustering using some well establishedmetrics,(c) accuracy in estimation of cluster specific parameters.In order to evaluate the criterion ( a ), we start with two simulation scenarioswhere the true numbers of clusters are three and four, respectively. In eachcase, we have 50 individual datasets where number of data points is 400 and 500,respectively. For rest of the section we refer the these two simulation scenariosas simulation ( i ) and simulation ( ii ). In simulation ( i ), three parameter matricesare setWe select appropriate values of hyperparameters for prior distributions in 50empirically using the procedure developed in Section 4.2. Note that, the valueof K † is set to 20 to reflect the concentration similar to 5% of the size of therespective cluster.In general, for MCMC procedure, choice of a good initial point expedite theconvergence for practical purposes. Therefore, we use the procedure describedin Section 4.6 to set the initial value of the parameters M , d and V .Optimal number of cluster is chosen based on DIC criteria described in Sec-tion 4.5. We performed numerous experiments with several score functions for engupta et al./BMMCSM DIC (Titterington et al., 2006) apart from standard definition of
DIC . We runour model with number of clusters equal to 2 , , i ) and2 , , , ii ). We present a summary of our result (see Ta-ble 1) for DIC and
DIC values, where we have shown that in almost all thecases (94% for original DIC , 95% for original
DIC ) we are able to select thecorrect number of clusters. The computation time for DIC is significantly lessthan that of original DIC . Method True number of clusters Total number of datasets Number of datasets with correctnumber of estimated clusters
DIC
DIC
DIC DIC Table 1
Number of datasets where correct number of clusters is identified with
DIC and
DIC . We notice that in simulation, whenever the model fails to identify the truenumber of clusters, it always overestimates the number of clusters. Realizingthis, we appropriately design a penalized version of the standard
DIC criterionwith which we significantly improve the estimation of correct model.
Common metrics for evaluating clustering methods
It is important tomeasure the assignment of each data point to the appropriate cluster. Notethat, even if the number of clusters is right, the performance of the clusteringmethod could be low because of incorrect cluster assignments. In order to eval-uate clustering efficiency one could calculate several external cluster evaluationmetrics. Here in this study we compute purity, Normalized mutual information(NMI), rand index (RI), adjusted rand index (ARI), Jaccard Index (JI) andF-measure (Rand, 1971; Vinh, Epps and Bailey, 2010).We build up some notations for introducing those metric briefly. Let us as-sume we have N data points denoted { X j } Nj =1 . Set of C true classes is givenby A = { A , A , · · · , A C } where A c = { j : X j belongs to c -th cluster } for c =1 , , · · · , C and clustering method returns K number of clusters and the set ofclusters is given by B = { B , B , · · · , B K } where B k = { j : X j assigned to k -th cluster } for k = 1 , , · · · , K . Note that we use | · | to denote the number of elements ina set. • Purity is defined (see Rand (1971); Vinh, Epps and Bailey (2010)) aspurity( A , B ) = (cid:80) k max c | B k ∩ A c | N .
It is the most simple evaluation measure. To compute purity each clusteris assigned to the class which is most prevalent in the cluster and the ac-curacy of the assignment is measured by counting the number of correctlyassigned data to the cluster and dividing by total number of data in the engupta et al./BMMCSM dataset. Clearly, Purity lies between 0 and 1 where perfect clustering hasa purity of 1. • NMI is an information-theoretic measure which is defined asNMI( A , B ) = I ( A , B )[ H ( A ) + H ( B )] / , where, I ( · , · ) and H ( · ) stand for mutual information and entropy, respec-tively with I ( A , B ) = (cid:88) k (cid:88) c | B k ∩ A c | N log N | B k ∩ A c || B k | | A c | , and H ( A ) = − (cid:88) k | B k | N log | B k | N ; H ( M ) = − (cid:88) c | A c | N log | A c | N .
NMI reaches its maximum value 1 only when the two sets A and B havea perfect one-to-one correspondence. • RI is written as RI = T P + T NT P + F P + T N + F N , where TP is the number of true positives, TN is the number of true neg-atives, FP is the number of false positives, and FN is the number of falsenegatives. This can be viewed as a measure of the percentage of correctdecisions. Note that, here false positives and false negatives are equallyweighted. The Rand index also lies between 0 and 1. When clusteringresults agree with the class perfectly, the Rand index is 1. • ARI is a chance-corrected version of RI. A problem with RI is that theexpected value of the RI between two random clustering methods is not aconstant. This problem is corrected in ARI which assumes the generalizedhyper-geometric distribution as the model of randomness. The ARI hasthe maximum value 1, and its expected value is 0 in the case of randomclusters. A larger ARI means a higher agreement between two clusteringmethods. • JI is defined by the following formula – JI = T PT P + F P + F N .
It is also known as intersection over union used to quantify the similaritybetween two sets. It takes a value between 0 and 1. Index value 1 or 0 meanstwo sets are identical or two sets have no common elements, respectively. • F-measure is defined by F β = ( β + 1) · P recision · Recallβ P recision + Recall , engupta et al./BMMCSM where P recision = T PT P + F P and
Recall = T PT P + F N .
F-measure can be used to penalize false negatives more strongly than falsepositives by selecting β >
1. On the other hand, when β = 0, recall hasno impact on F-measure.We summarize all the evaluation metrics for the two simulation scenarios in thefollowing Table 2 and 3. We observe that most of the metrics are close to themaximum possible value 1, which indicates an overall success of our clusteringmethod. Metrics PUR RI ARI JI NMI F05 F1 F2 F5Mean 0.984 0.979 0.952 0.938 0.923 0.968 0.968 0.968 0.968Std. dev. 0.008 0.010 0.023 0.028 0.031 0.015 0.015 0.015 0.015
Table 2
Clustering evaluation metrics when true number of clusters equal to three
Metrics PUR RI ARI JI NMI F05 F1 F2 F5Mean 0.978 0.978 0.942 0.918 0.921 0.957 0.957 0.957 0.957Std. dev. 0.008 0.008 0.020 0.027 0.024 0.015 0.015 0.015 0.015
Table 3
Clustering evaluation metrics when true number of clusters equal to four
In order to evaluate parameter values for each clusters let us denote the trueparameter set for C classes by { F , F , · · · , F C } where F c = M c D c V Tc . Wefind out d MSE = (cid:80) c (cid:107) ˆ F c − F c (cid:107) F , where ˆ F c is the estimate of the parametermatrix for the c -th cluster, where (cid:107)·(cid:107) F denotes the matrix Frobenious norm.We plot below (in Figure 3) the relative error (in percentage) in estimatingthe true parameter F . From the plot we observe that procedure is efficient inestimating the parameter as the maximum relative error is below 4%. We showthe simulation results for one particular dataset from simulation (i) and (ii) forboth the eigenvectors in Figures 4 and 5, respectively.As our Bayesian inference technique involves MCMC sampling scheme, it iscustomary to check the standard MCMC convergence and efficiency diagnos-tics (Cowles and Carlin, 1996). We investigate the convergence by carefullyobserving the MCMC cumulative average plot and auto-correlation function(ACF) plot for one of the elements of parameter F for one of the clusters inthe dataset. Here in Figure 6, we show both the plots for the simulation sce-nario ( i ). By looking at the cumulative average (Figure 6(b)) we set the valuefor number of burn-in iteration to 800. The small values in the ACF plot (Fig-ure 6(a)) indicates high efficiency for parameter estimation based on the MCMC engupta et al./BMMCSM llllllllllllllllllllllllllllllllllllllllllllllllll dataset index d M SE a s a pe r c en t age o f ab s o l u t e s u m o f e l e m en t s i n F % % % % % llllllllllllllllllllllllllllllllllllllllllllllllll dataset index % % % % % % % % % % Fig 3: Relative error (in percentage) for parameter F in simulation (i) (leftpanel) and simulation (ii) (right panel).samples. Note that, these plots have very similar characteristics for all the otherscenarios.
6. Application
In this section we show two real data based applications with our model. Thefirst one is associated with medical image analysis while the second one is relatedto astronomical data. We present each application in different subsections below.
The human brain consists of more than 100 billion neurons, and it is arguablythe most complex structure in our body (Basser and Jones, 2002; Mori andZhang, 2006). Magnetic resonance imaging (MRI) is powerful noninvasive andthree-dimensional imaging technique to characterize the entire brain anatomy.Diffusion tensor imaging (DTI) is a relatively new MRI technique which helpsto reconstruct the underlying 3D structures of axonal bundles in the brain.Using a technique called tractography using the data collected by DTI the vox-els that belong to the same white matter tract are grouped together. This is engupta et al./BMMCSM Fig 4: First and second eigenvectors using one of the datasets in simulation ( i ).Fig 5: First and second eigenvectors using one of the datasets in simulation ( ii ). engupta et al./BMMCSM . . . . . . lag A u t o c o rr e l a t i on − . − . − . − . MCMC iteration C u m u l a t i v e a v e r age (a) ACF of F (b) Cumulative average of F Fig 6: Diagnostic plots of MCMC for simulation ( i ).used to investigate brain connectivity, for example, cortex-white matter con-nectivity (Catani et al., 2002; Lazar and Alexander, 2005) or corticothalamicconnectivity (Guy M Mckhann, 2004).DTI technique was introduced in the mid 1990s (Basser, Mattiello and LeBihan,1994). The diffusion term represents translational motion of water moleculesand this motions is used as a probe to estimate the axonal organization ofthe brain. The water molecules move relatively easily along the axonal bun-dles compared to the perpendicular to these bundles because there are fewerobstacles to prevent movement along the fibers which carry rich anatomical in-formation about the white matter (Mori and Zhang, 2006). Fiber orientationsare estimated from three independent diffusion measurements along the x, y and z axes. However, these three measurements are not enough as the fiber orien-tation is not always along one of these three axes. But to accurately constructapparent diffusion coefficient where the intensity of each voxel is proportionalto the extent of diffusion, we need to measure diffusion along many directions,which is difficult. In order to give a practical solution to this, the concept of DTwas introduced (Basser, Mattiello and LeBihan, 1994).In this model, measurements along different axes (see Figure 7a) are fitted to a3D ellipsoid shown in Figure 7b, which represents average diffusion distance ineach direction (Mori and Zhang, 2006). Note that the properties of a 3D ellipsoidcan be defined by six parameters – three of its eigenvalues and correspondingeigenvectors (mutually perpendicular), which can compactly represented by a3 × λ , λ and λ ) give thediffusivity in the direction of each eigenvector, denoted by E , E and E inFigure 7c. engupta et al./BMMCSM X E λ Y (b) XY Z D xx D xz D xy D yx D yz D yy D zx D zz D zy E T E λ λ λ Matrix of EigenvectorsEigenvaluesDiffusion Tensor ×× D E E E E (a) (c) E E λ λ Z Fig 7: A schematic representation of (a) fibers and (b) estimated ellipsoids withthe corresponding (c) eigenvectors and eigenvalues.According to our knowledge, this is the very first work with DTI data which ismodeled with a mixture of ML distributions. Also, we consider a final datasetafter selecting the voxels in the white matter region of the brain containinginformation from almost 63 ,
000 voxels. Our implementation is very efficient inhandling this large amount of data. We model diffusion tensors by elementsin V n,p . Note that, for the scope of this project we are only interested in thedirection of the eigenvectors of DT. Also, we only need to model E and E as direction of E will be totally governed by the rest of the two eigenvectors.Therefore, we have two orthonormal eigenvectors in three dimensions i.e. a 3 × × V n,p .In practice, Wishart distribution is commonly used to analyze DT, a 3 × ML distributions. It is easier to findinterpretations of the parameters for ML distribution in terms of directionof the data. Therefore our Bayesian mixture model is relatively more flexiblein terms of handling DTI data which have directional components. Also ourinference mechanism can handle a very large number of DTI data from eachvoxels. To the extent of our knowledge, this is the first paper that develops the engupta et al./BMMCSM framework to analyze DTI data when they are modeled as objects on V n,p .Before presenting the results, we would like to point out that our results couldbe improved by incorporating eigenvalues along with the eigenvectors. However,that requires more complicated statistical model which we currently reserve forour future work and it is outside of the scope of current paper as we mainlyfocusing on building the appropriate framework for analyzing DTI data. Never-theless, we show in in Section 6.1.2 that we have found evidences of meaningfulclusters by only investigating the directional part of the data. The Philadelphia Neurodevelopmental Cohort (PNC) is a large-scale initiativeto understand how genetics impact trajectories of brain development and cog-nitive functioning in adolescence, and understand how abnormal trajectoriesof development are associated with psychiatric symptomatology (Satterthwaiteet al., 2014). As part of the PNC, 1,445 children ages 8-21 received multi-modalneuroimaging in order to evaluate with a detailed cognitive and psychiatric as-sessment. Data is pre-processed with the comprehensive DTI data processingsoftware library FSL (Woolrich et al., 2009).Some of the important features of this dataset is that all imaging data wasacquired at a single site, on a single scanner, in a short period of time that didnot span any software or hardware upgrades. Quality of the images of the DTIdata was primarily assessed by visual inspection and rarely, two artifacts werenoted in the DTI data (Satterthwaite et al., 2014).
We take one anonymous subject from this dataset consisting of 62 ,
667 mea-surements. We use a finite mixture of ML distributions to cluster this largedataset.We use conditional conjugate prior distributions defined in Equation 50. Weselect appropriate values of hyperparameters using the procedure developed forempirical prior in Section 4.2. Note that, the value of K † is set to 100 as weexpect relatively large number of data points in each cluster. Here we use theprocedure described in Section 4.6 to set the initial value of the parameters M , d and V for the MCMC algorithm. First 1000 MCMC samples are discarded asburn-in samples.We use different number of clusters to fit the dataset with our Bayesian modeland choose 12 as the estimated number of clusters by DIC criterion describedin Section 4.5.In Figure 8 we present the top three clusters with their voxel locations mappedinside the anatomical structure of brain (see for 3D version of these figures). Note that in this figure, panel ( a ),( b ) and ( c ) represent top, side and front view, respectively. It is important tonotice that we are successfully able to locate few important fiber structures inthe dataset from the sample. engupta et al./BMMCSM Fig 8: DTI clustering results for three major clusters engupta et al./BMMCSM The Near Earth Object (NEO) population is defined as a group of small bod-ies with perihelion distance less than 1 . .
983 AU (Donnison, 2006). NEOs are NEAs (near-Earthasteroids) and NECs (near-Earth comets). NEAs are asteroids whose perihe-lion distance is less than 1 . . https://cneos.jpl.nasa.gov/faq/ ). A detailed categorization of NEO can also befound in https://cneos.jpl.nasa.gov/about/neo_groups.html . They arealso called short-period (SP) comets, which are generally confined to directorbits with angle of inclination with respect to a reference plane, less than ap-proximately 35 ◦ . The SP comets are in well determined orbits with modesteccentricities and inclinations. This make them a possible resource for spacedevelopments (Lewis, Matthews and Guerrieri, 1993).The NEC dataset was built by the Near Earth Object Program of the NationalAeronautics and Space Administration(NASA). Each data point characterizesthe orientation of a two-dimensional elliptical orbit in three-dimensional space,and thus lies on the Stiefel manifold V , . For our experiment we have down-loaded NEC dataset containing 175 entries. Orientation of SP comet’s orbit canbe specified by the following quantities. We could find the definition of thesethree important quantities in https://ssd.jpl.nasa.gov/?glossary . • Celestial longitude ( L ) • Latitude of the perihelion ( θ ) • Longitude of the ascending node (Ω)Celestial longitude of the comet ( L ) (Hughes, 1985) and latitude of the perihelion( θ ) (Yabushita, Hasegawa and Kobayashi, 1979) are computed by the followingformula, respectively. L = Ω + tan − (cid:18) sin ω cos i cos ω (cid:19) sin θ = sin i sin ω From the dataset we could find the values of orbital inclination ( i ), longitude ofthe ascending node (Ω), argument of periapsis (perihelion)( ω ) as shown in Fig-ure 9 Using the appropriate transformations given in Jupp and Mardia (1979);Yabushita, Hasegawa and Kobayashi (1979) we find L , θ and Ω for each comet.The direction of the perihelion is x = ( cos θ cos L, cos θ sin L, sin θ ) and thedirected unit normal to the orbit given by the right hand rule is x = ( sin θ sin Ω − sin θ cos Ω − cos θ sin (Ω − L )) /r where r = sin θ + cos θ sin (Ω − L ). The orientation of the orbit therefore canbe represented by the matrix X ∈ V , given by X = [ x T x T ]. An appropriatemodel for the distribution of these matrices is the ML family (Jupp and Mardia,1979). engupta et al./BMMCSM NEC’s orbitReference directionReference plane Inclination (i) Argument of periapsis ( ω )Longitude of ascending node ( Ω )NEC Ω i ω Fig 9: Near Earth comet’s orbit and orbital elementsHere we model NEC dataset as a finite mixture of ML distributions. We ranour model for number of clusters equals to 3 , , ,
6. In each situation we use2000 MCMC samples out of which we set initial 1000 iterations as burn-ins.We select appropriate values of hyperparameters for prior distributions in Equa-tion 50 empirically using the procedure developed in Section 4.2.We choose number of burn-in iterations (1000 in this case) by observing theMCMC convergence diagnostic plot. Below we report the DIC for selecting themodel. Our DIC (shown in Table 4) is minimized at number of clusters equals tofour. Note that, also from the reported results in Lin, Rao and Dunson (2017),four seems to be the most likely number of clusters.
Number of Clusters DIC Value3 3074.914 2607.045 2712.966 2685.94
Table 4
DIC table for NEC dataset
We compute the probabilities for any two NEC data to belong to the same clus-ter for all the NEC data. This is also called as cluster co-occurrence probabilitymatrix (Hofmann and Puzicha, 1998). We draw the corresponding heatmap inFigure 10 to show this.Finally, we plot each eigenvector from a data point of V , in a sphere (Figure 11and 12). We use different color (red, blue, green, black) to represent four differentclusters. The NECs denoted by the points with same color indicates the groupof comets with similar orbital characteristics. engupta et al./BMMCSM Fig 10: Cluster co-occurrence probability matrix for NEC dataset.
7. Discussions and Future directions
In this paper, we build a Bayesian framework for a mixture of ML distributionswhich could be applied to real world directional data. We construct two specialfamilies of distributions to be used as prior distributions following the orginalconjugate prior construction in Diaconis and Ylvisaker (1979). We discuss fewimportant properties for our prior class of distributions. For the mixture modelwe computed the posterior and also give insights on selection of hyperparam-eters, which should be helpful for practitioners. Finally, we are able to handlea large amount of DTI data in the real data application and results look quitepromising.For our future extension, instead of selecting the number of clusters by DICcriterion, we would like the number of clusters to be a random variable. A fullyBayesian model-based approach which assumes a parametric prior (e.g. Poisson)on the number of clusters, could be employed. The next natural step in thisdirection is to extend the existing model to a non-parametric framework. In fact,non-parametric version is more flexible in terms of modeling and experimentingwith different types of underlying clustering structure. Note that, though Lin,Rao and Dunson (2017) opened the doors to such modeling, their model space engupta et al./BMMCSM Fig 11: First eigenvector of a data point is embedded in a sphere.differs from ours in various respects.On a separate direction, we also plan to explore in depth the analytical prop-erties of the hypergeometric function of matrix argument function ( F ( · )) for p ≥
2. Direct computation, as is done in our case studies, could create bot-tlenecks for data coming from higher dimension. Analytical bounds could helpeither in approximation or designing a good MCMC sampler. For example, onecould borrow the importance sampling approach used for evaluating the normal-izing constants in Mitra et al. (2013). This would primarily rely on the abilityto simulate efficiently from ML distributions, which is already ensured by Hoff(2009). Along this line, it would be nice to study the theoretical properties,particularly ergodicity of the MCMC schemes rigorously.The coming together of state-of-the-art Bayesian methods incorporating topo-logical properties of the space is a rich area that has been initiated only recentlyby Bhattacharya and Dunson (2012) and Lin, Rao and Dunson (2017). We planto continue along this direction and contribute to the Bayesian methodologi- engupta et al./BMMCSM Fig 12: Second eigenvector of a data point is embedded in a sphere.cal development on general analytic manifolds, which would be appropriate toanalyze large-scale data with complex structure.
References
Absil, P.-A. , Mahony, R. and
Sepulchre, R. (2009).
Optimization algo-rithms on matrix manifolds . Princeton University Press.
Anand, S. , Mittal, S. and
Meer, P. (2016). Robust Estimation for ComputerVision Using Grassmann Manifolds. In
Riemannian Computing in ComputerVision
Bangert, M. , Hennig, P. and
Oelfke, U. (2010). Using an infinite vonMises-Fisher mixture model to cluster treatment beam directions in externalradiation therapy. In
Machine Learning and Applications (ICMLA), 2010Ninth International Conference on
Basser, P. J. and
Jones, D. K. (2002). Diffusion-tensor MRI: theory, experi-mental design and data analysis–a technical review.
NMR in Biomedicine engupta et al./BMMCSM Basser, P. J. , Mattiello, J. and
LeBihan, D. (1994). MR diffusion tensorspectroscopy and imaging.
Biophysical journal Berg, A. , Meyer, R. and
Yu, J. (2004). Deviance information criterionfor comparing stochastic volatility models.
Journal of Business & EconomicStatistics Bhatia, R. (2007). Positive Definite Matrices Princeton University Press.
Princeton and Oxford . Bhattacharya, A. and
Dunson, D. B. (2012). Strong consistency of nonpara-metric Bayes density estimation on compact metric spaces with applicationsto specific manifolds.
Annals of the Institute of Statistical Mathematics Bishop, C. M. (2006).
Pattern recognition and machine learning . springer.
Butler, R. W. and
Wood, A. T. (2003). Laplace approximation for Besselfunctions of matrix argument.
Journal of Computational and Applied Mathe-matics
Casella, G. (1985). An introduction to empirical Bayes data analysis.
TheAmerican Statistician Casella, G. and
Berger, R. L. (2002).
Statistical inference . Duxbury Pa-cific Grove, CA. Catani, M. , Howard, R. J. , Pajevic, S. and
Jones, D. K. (2002). Virtualin vivo interactive dissection of white matter fasciculi in the human brain.
Neuroimage Chikuse, Y. (1991a). High dimensional limit theorems and matrix decomposi-tions on the Stiefel manifold.
Journal of multivariate analysis Chikuse, Y. (1991b). Asymptotic expansions for distributions of the large sam-ple matrix resultant and related statistics on the Stiefel manifold.
Journal ofmultivariate analysis Chikuse, Y. (1998). Density estimation on the Stiefel manifold.
Journal ofmultivariate analysis Chikuse, Y. (2012).
Statistics on special manifolds . Springer Science &Business Media.
Conway, J. (1990). A { Course } in { Functional }{ Analysis } . Cowles, M. K. and
Carlin, B. P. (1996). Markov chain Monte Carlo conver-gence diagnostics: a comparative review.
Journal of the American StatisticalAssociation DeIorio, M. and
Robert, C. P. (2002). Discussion of Bayesian measures ofmodel complexity and fit.
Journal of the Royal Statistical Society: Series B(Statistical Methodology) Dempster, A. P. , Laird, N. M. and
Rubin, D. B. (1977). Maximum likeli-hood from incomplete data via the EM algorithm.
Journal of the royal sta-tistical society. Series B (methodological)
Diaconis, P. and
Ylvisaker, D. (1979). Conjugate priors for exponentialfamilies.
The Annals of statistics Diebolt, J. and
Robert, C. P. (1994). Estimation of Finite Mixture Distri-butions through Bayesian Sampling.
Journal of the Royal Statistical Society.Series B (Methodological) engupta et al./BMMCSM Donnison, J. (2006). Some aspects of the statistics of Near-Earth Objects.
Proceedings of the International Astronomical Union Downs, T. D. (1972). Orientation statistics.
Biometrika
Edelman, A. , Arias, T. A. and
Smith, S. T. (1998). The geometry of algo-rithms with orthogonality constraints.
SIAM journal on Matrix Analysis andApplications Franc¸ois, O. and
Laval, G. (2011). Deviance information criteria formodel selection in approximate Bayesian computation. arXiv preprintarXiv:1105.0269 . Gelman, A. , Carlin, J. B. , Stern, H. S. and
Rubin, D. B. (2003).
BayesianData Analysis, Second Edition (Chapman & Hall/CRC Texts in StatisticalScience) , 2 ed. Chapman and Hall/CRC.
Gelman, A. , Carlin, J. B. , Stern, H. S. , Dunson, D. B. , Vehtari, A. and
Rubin, D. B. (2014).
Bayesian data analysis . CRC press Boca Raton,FL. Gross, K. I. and
Richards, D. S. P. (1987). Special functions of matrix ar-gument. I. Algebraic induction, zonal polynomials, and hypergeometric func-tions.
Transactions of the American Mathematical Society
Gross, K. I. and
Richards, D. S. P. (1989). Total positivity, spherical series,and hypergeometric functions of matrix argument.
Journal of Approximationtheory Gupta, R. D. and
Richards, D. S. P. (1985). Hypergeometric functions ofscalar matrix argument are expressible in terms of classical hypergeometricfunctions.
SIAM journal on mathematical analysis Guy M Mckhann, I. (2004). Non-invasive mapping of connections betweenhuman thalamus and cortex using diffusion imaging.
Neurosurgery . Hardy, G. H. , Littlewood, J. E. and
P´olya, G. (1952).
Inequalities . Cam-bridge university press.
Herz, C. S. (1955). Bessel functions of matrix argument.
Annals of Mathemat-ics
Hill, R. D. and
Waters, S. R. (1987). On the cone of positive semidefinitematrices.
Linear Algebra and its Applications Hoff, P. D. (2009). Simulation of the matrix Bingham–von Mises–Fisher dis-tribution, with applications to multivariate and relational data.
Journal ofComputational and Graphical Statistics Hofmann, T. and
Puzicha, J. (1998). Statistical models for co-occurrencedata.
Hornik, K. and
Gr¨un, B. (2013). On conjugate families and Jeffreys priorsfor von Mises-Fisher distributions.
J Stat Plan Inference
Hornik, K. and
Gr¨un, B. (2014). movMF: An R package for fitting mixturesof von Mises-Fisher distributions.
Journal of Statistical Software Hughes, D. W. (1985). The position of earth at previous apparitions of Halley’scomet.
Quarterly Journal of the Royal Astronomical Society James, A. T. (1964). Distributions of matrix variates and latent roots derivedfrom normal samples.
The Annals of Mathematical Statistics
James, I. M. (1976).
The topology of Stiefel manifolds . Cambridge Univer- engupta et al./BMMCSM sity Press. Jupp, P. E. and
Mardia, K. V. (1979). Maximum likelihood estimators for thematrix von Mises-Fisher and Bingham distributions.
The Annals of Statistics
Jupp, P. and
Mardia, K. (1980). A general correlation coefficient for direc-tional data and related regression problems.
Biometrika
KaewTraKulPong, P. and
Bowden, R. (2002). An improved adaptive back-ground mixture model for real-time tracking with shadow detection.
Video-based surveillance systems Khare, K. , Pal, S. and
Su, Z. (2017). A bayesian approach for envelopemodels.
The Annals of Statistics Khatri, C. and
Mardia, K. (1977). The von Mises-Fisher matrix distributionin orientation statistics.
Journal of the Royal Statistical Society. Series B(Methodological)
Koev, P. and
Edelman, A. (2006). The efficient evaluation of the hyper-geometric function of a matrix argument.
Mathematics of Computation Kristof, W. (1969). A theorem on the trace of certain matrix products andsome applications.
ETS Research Report Series . Lattin, J. M. , Carroll, J. D. and
Green, P. E. (2003).
Analyzing multi-variate data . Thomson Brooks/Cole Pacific Grove, CA.
Lazar, M. and
Alexander, A. L. (2005). Bootstrap white matter tractogra-phy (BOOT-TRAC).
NeuroImage Lewicki, M. S. (1998). A review of methods for spike sorting: the detectionand classification of neural action potentials.
Network: Computation in NeuralSystems R53–R78.
Lewis, J. S. , Matthews, M. S. and
Guerrieri, M. L. (1993). Resources ofnear-Earth space.
Resources of near-earth space . Lin, L. , Rao, V. and
Dunson, D. (2017). BAYESIAN NONPARAMETRICINFERENCE ON THE STIEFEL MANIFOLD.
Statistica Sinica Lui, Y. M. (2012). Advances in matrix manifolds for computer vision.
Imageand Vision Computing Lui, Y. and
Beveridge, J. (2008). Grassmann registration manifolds for facerecognition.
Computer Vision–ECCV 2008
Mardia, K. V. and
Jupp, P. E. (2009).
Directional statistics . John Wiley& Sons.
Mardia, K. and
Khatri, C. (1977). Uniform distribution on a Stiefel manifold.
Journal of Multivariate Analysis Mardia, K. V. , Taylor, C. C. and
Subramaniam, G. K. (2007). Proteinbioinformatics and mixtures of bivariate von Mises distributions for angulardata.
Biometrics McGraw, T. , Vemuri, B. , Yezierski, R. and
Mareci, T. (2006). Segmen-tation of high angular resolution diffusion MRI modeled as a field of vonMises-Fisher mixtures. In
European Conference on Computer Vision
McKenna, S. J. , Raja, Y. and
Gong, S. (1999). Tracking colour objects engupta et al./BMMCSM using adaptive mixture models. Image and vision computing McLachlan, G. and
Peel, D. (2004).
Finite mixture models . John Wiley &Sons.
Mitra, R. , M¨uller, P. , Liang, S. , Yue, L. and
Ji, Y. (2013). A bayesiangraphical model for chip-seq data on histone modifications.
Journal of theAmerican Statistical Association
Mori, S. and
Zhang, J. (2006). Principles of diffusion tensor imaging and itsapplications to basic neuroscience research.
Neuron Muirhead, R. J. (1975). Expressions for some hypergeometric functions ofmatrix argument with applications.
Journal of multivariate analysis Muirhead, R. J. (2009).
Aspects of multivariate statistical theory . JohnWiley & Sons.
Picard, F. (2007). An introduction to mixture models.
Statistics for SystemsBiology, Research Report . Rand, W. M. (1971). Objective criteria for the evaluation of clustering meth-ods.
Journal of the American Statistical association Reisinger, J. , Waters, A. , Silverthorn, B. and
Mooney, R. J. (2010).Spherical topic models. In
Proceedings of the 27th international conference onmachine learning (ICML-10)
Robbins, H. (1985). An empirical Bayes approach to statistics. In
Herbert Rob-bins Selected Papers
Rokach, L. and
Maimon, O. (2005). The Data Mining and Knowledge Dis-covery Handbook: A Complete Guide for Researchers and Practitioners.
Rudin, W. et al. (1964).
Principles of mathematical analysis . McGraw-HillNew York. Satterthwaite, T. D. , Elliott, M. A. , Ruparel, K. , Loughead, J. , Prabhakaran, K. , Calkins, M. E. , Hopson, R. , Jackson, C. , Keefe, J. , Riley, M. et al. (2014). Neuroimaging of the Philadelphia neu-rodevelopmental cohort.
Neuroimage Schwartz, L. (1965). On bayes procedures.
Probability Theory and RelatedFields Schwartzman, A. (2006). Random ellipsoids and false discovery rates: Statis-tics for diffusion tensor imaging data PhD thesis, Stanford University.
Spiegelhalter, D. J. , Best, N. G. , Carlin, B. P. and
Van Der Linde, A. (2002). Bayesian measures of model complexity and fit.
Journal of the RoyalStatistical Society: Series B (Statistical Methodology) Stauffer, C. and
Grimson, W. E. L. (1999). Adaptive background mixturemodels for real-time tracking. In
Computer Vision and Pattern Recognition,1999. IEEE Computer Society Conference on. Tang, H. , Chu, S. M. and
Huang, T. S. (2009). Generative model-basedspeaker clustering via mixture of von mises-fisher distributions. In
Acoustics,Speech and Signal Processing, 2009. ICASSP 2009. IEEE International Con-ference on
Titterington, D. M. , Robert, C. P. , Forbes, F. and
Celeux, G. (2006).Deviance information criteria for missing data models.
Bayesian Analysis engupta et al./BMMCSM Turaga, P. , Veeraraghavan, A. and
Chellappa, R. (2008). Statisticalanalysis on Stiefel and Grassmann manifolds with applications in computer vi-sion. In
Computer Vision and Pattern Recognition, 2008. CVPR 2008. IEEEConference on
Turaga, P. , Veeraraghavan, A. , Srivastava, A. and
Chellappa, R. (2011). Statistical computations on Grassmann and Stiefel manifolds for im-age and video-based recognition.
IEEE Transactions on Pattern Analysis andMachine Intelligence Vinh, N. X. , Epps, J. and
Bailey, J. (2010). Information theoretic measuresfor clusterings comparison: Variants, properties, normalization and correctionfor chance.
Journal of Machine Learning Research Wood, E. , Fellows, M. , Donoghue, J. and
Black, M. (2004). Automaticspike sorting for neural decoding. In
Engineering in Medicine and BiologySociety, 2004. IEMBS’04. 26th Annual International Conference of the IEEE Woolrich, M. W. , Jbabdi, S. , Patenaude, B. , Chappell, M. , Makni, S. , Behrens, T. , Beckmann, C. , Jenkinson, M. and
Smith, S. M. (2009).Bayesian analysis of neuroimaging data in FSL.
Neuroimage S173–S186.
Wright, S. J. and
Nocedal, J. (1999). Numerical optimization.
SpringerScience Yabushita, S. , Hasegawa, I. and
Kobayashi, K. (1979). The Distributionsof Inclination and Perihelion Latitude of Long-Period Comets and Their Dy-namical Implications.
Publications of the Astronomical Society of Japan Zeng, X. , Bian, W. , Liu, W. , Shen, J. and
Tao, D. (2015). Dictionary pairlearning on Grassmann manifolds for image denoising.
IEEE Transactions onImage Processing24