[PDF] A Bayesian Mixture Model for Clustering on the Stiefel Manifold

Abstract

Analysis of a Bayesian mixture model for the Matrix Langevin distribution on the Stiefel manifold is presented. The model exploits a particular parametrization of the Matrix Langevin distribution, various aspects of which are elaborated on. A general, and novel, family of conjugate priors, and an efficient Markov chain Monte Carlo (MCMC) sampling scheme for the corresponding posteriors is then developed for the mixture model. Theoretical properties of the prior and posterior distributions, including posterior consistency, are explored in detail. Extensive simulation experiments are presented to validate the efficacy of the framework. Real-world examples, including a large scale neuroimaging dataset, are analyzed to demonstrate the computational tractability of the approach.

Full PDF

aarXiv: arXiv:0000.0000

A Bayesian Mixture Model forClustering on the Stiefel Manifold

Subhajit Sengupta ∗ , ‡ , Subhadip Pal ∗ , § Riten Mitra § , Ying Guo ¶ ,Arunava Banerjee † , (cid:107) and Yuan Ji † , ‡ , ∗∗ Program for Computational Genomics and Medicine, NorthShore University HealthSystem ‡ Department of Bioinformatics and Biostatistics , University of Louisville § Department of Biostatistics and Bioinformatics, Emory University ¶ Department of Computer & Information Science & Engineering, University of Florida (cid:107)

Department of Public Health Sciences, The University of Chicago ∗∗ E-mails of the corresponding authors [email protected] ; [email protected] Abstract:

Analysis of a Bayesian mixture model for the Matrix Langevindistribution on the Stiefel manifold is presented. The model exploits a par-ticular parametrization of the Matrix Langevin distribution, various aspectsof which are elaborated on. A general, and novel, family of conjugate pri-ors, and an eﬃcient Markov chain Monte Carlo (MCMC) sampling schemefor the corresponding posteriors is then developed for the mixture model.Theoretical properties of the prior and posterior distributions, includingposterior consistency, are explored in detail. Extensive simulation experi-ments are presented to validate the eﬃcacy of the framework. Real-worldexamples, including a large scale neuroimaging dataset, are analyzed todemonstrate the computational tractability of the approach.

Keywords and phrases:

Matrix Langevin Mixture model, Mixture model,Orthonormal vectors, Parametric model, Stiefel manifold.

1. Introduction

Analysis of directional data comprises a major sub-ﬁeld of study in Statistics.Directional data range from unit vectors in the simplest examples, to sets ofordered orthonormal frames in the general case. Since the associated samplespace is not the Euclidean space, standard statistical methods developed for theEuclidean space for the analysis of univariate or multivariate data cannot beeasily adapted for directional data. For example, it is often desirable to accountfor the geometric structure underlying the sample space in statistical inference.Beyond those fashioned for simpler non-Euclidean spaces like the circle or thesphere, there is a pressing need for methodology development for general sam-ple spaces such as the Stiefel or the Grassmann manifold to support modern ∗ Equal contribution † Corresponding author 1 a r X i v : . [ s t a t . M E ] A ug engupta et al./BMMCSM applications, increasingly seen in the ﬁelds of computer vision (Turaga, Veer-araghavan and Chellappa, 2008; Turaga et al., 2011; Anand, Mittal and Meer,2016; Lui and Beveridge, 2008; Zeng et al., 2015), medical image analysis (Lui,2012), astronomy (Mardia and Jupp, 2009; Lin, Rao and Dunson, 2017), and,biology (Downs, 1972; Mardia and Khatri, 1977), to name but a few. In thisarticle, we present a framework for Bayesian inference of a mixture model onthe Stiefel manifold (James, 1976; Chikuse, 2012) that remains computationallytractable even at large data sizes. With ever-growing computational power, weargue that it is now feasible to apply Bayesian methods to real world large anddirectional data.One of the most commonly used distributions on the Stiefel manifold is an ex-ponential family distribution known as the Matrix Langevin ( ML ) or the Von-Mises Fisher matrix distribution (Mardia and Jupp, 2009; Khatri and Mardia,1977), introduced ﬁrst by Downs (1972). In early work Mardia and Khatri (1977)and Jupp and Mardia (1980) studied the properties of the maximum likelihoodestimators for this distribution in the classical setting. In large measure, subse-quent eﬀorts at exploring the ML distribution (Chikuse, 1991a,b, 1998) werelimited to asymptotic results on distributional or inferential problems. More re-cently, Hoﬀ (2009) has developed a rejection sampling based method to samplefrom a matrix Bingham-Von Mises-Fisher distribution on the Stiefel manifold.To date, Bayesian analysis on these general sample spaces have been very lim-ited. A major obstacle for the development of eﬃcient inference techniques forthis family of distributions has been the intractability of the corresponding nor-malizing constant, a hypergeometric function of matrix argument.The article that is most aligned to our overall objective is Lin, Rao and Dunson(2017), where the authors have developed a rejection sampling based data aug-mentation strategy for Bayesian inference with the mixture of ML distribution.However, it is well known that sampling techniques based on a data augmen-tation strategy often suﬀer from slow rates of convergence. With the additionaldetrimental impact of the rejection ratio, convergence can become painfullyslow. Applicability of their MCMC technique is therefore limited, particularlyin terms of scalability to large datasets.Our contribution begins with an exploration of the properties of the ML dis-tribution, followed by the construction of a family of conjugate priors for ML distribution, which we then analyze in considerable detail. In the context ofthe natural exponential family, Diaconis and Ylvisaker (Diaconis and Ylvisaker,1979) laid the foundations for constructing conjugate prior distributions (theDY class) for natural exponential family models. In our case, however, the DYconstruction can not be directly applied, and we therefore derive a modiﬁedconstruction. The resultant prior is ﬂexible in the sense that one can incor-porate information from data via appropriate hyperparameter selection, andfurthermore, there is the provision to set the hyperparameters in the absence ofany prior knowledge to a weakly non-informative prior. For the latter, the priormight become improper in which case we adopt a constrained mixture model.Using this novel prior we implement a scalable posterior inference scheme by de-signing an eﬃcient Gibbs sampler. We note in passing that in the expression for engupta et al./BMMCSM the posterior, the presence of F ( · , · ) in the denominator make the inferenceprocedure challenging. We also explore the weak and strong posterior consis-tency under the new class of priors. Finally, we extend the proposed frameworkfor a single ML distribution to a ﬁnite mixture of ML ’s.To identify the optimum number of clusters, often times deviance informationcriterion (DIC) has been used in the literature (Gelman et al., 2003; Spiegelhal-ter et al., 2002). However, several studies have pointed to the weakness of thestandard DIC measure in mixture models and have proposed alternatives. Weperform extensive simulations to identify alternative schemes to computing DICthat would work best for a mixture of ML distributions. In order to demonstratethe scalability of our inference scheme, we then analyze a large-scale DTMRIdataset. Real datasets that have been analyzed in the literature come from as-tronomy (near-earth objects) or vectorcardiography. In both cases the data isdrawn from a matrix valued manifold where each element is a collection of twoorthonormal vectors in R . Realizing that most of the existing applications relyon an eﬃcient computation of the matrix hypergeometric function on a 2 × ML distribution deﬁned on the Stiefel manifold ( V n,p ) and explore its the-oretical properties as well as properties of the corresponding hypergeometricconstant. In Section 3, we present the construction of the conjugate prior andthe posterior for a single ML distribution, properties of which are then analyzedin considerable detail. Generalization to a ﬁnite mixture model and inferenceare presented in Section 4, as well as extended theoretical properties such asthe weak and strong posterior consistency. Extensive simulation studies are pre-sented and summarized in Section 5. In Section 6, we provide experimentalresults from two real-world datasets. Conclusions and future work in presentedin Section 7. Notational Convention • R k = The k -dimensional real space. • S p = (cid:8) ( d , . . . , d p ) ∈ R p + : 0 < d p < · · · < d < ∞ (cid:9) . • R n × p = Space of all n × p real-valued matrices. • V n,p = Stiefel Manifold. • (cid:101) V n,p = { X ∈ V n,p : X ,j > ∀ j = 1 , , · · · , p } . • V p,p = O ( p ) = Space of Orthogonal matrices. • Υ( · ) = Product measure deﬁned on V n,p × R p + × V p,p . engupta et al./BMMCSM • I p = p × p identity matrix. • f ( · ; · ) = Probability density function. • g ( · ; · ) = Unnormalized version of the probability density function. • tr ( A ) = Trace of a square matrix A. • etr ( A ) = Exponential of tr ( A ). • E ( X ) = Expectation of the random variable X . • I ( · ) = Indicator function. • We use d and D interchangeably. D is the diagonal matrix with diagonal d . We use matrix notation D in the place of d wherever needed, and vector d otherwise. • (cid:107)·(cid:107) = Matrix operator norm. ML distribution on the Stiefel manifold ( V n,p ) The Stiefel manifold, V n,p is the space of all p ordered orthonormal vectors (alsoknown as p -frames) in R n and is deﬁned as V n,p = { X ∈ R n × p : X T X = I p } , where R n × p is the space of all n × p real-valued matrices and I p is the p × p identity matrix (Mardia and Jupp, 2009; Absil, Mahony and Sepulchre, 2009;Chikuse, 2012; Edelman, Arias and Smith, 1998; Downs, 1972). V n,p is a compactRiemannian manifold of dimension np − p ( p + 1) /

2. For p = 1, V n,p is the ( n − S n − and for p = n , V n,p = O ( p ), the orthogonal group consistingall orthogonal p × p real-valued matrices, with the group operation being matrixmultiplication. V n,p may be embedded in the np -dimensional Euclidean spaceof n × p real-valued matrices with the inclusion map as a natural embedding,and is thus a submanifold of R np . Since V n,p is an embedded submanifold of R n × p , its topology is the subset topology induced by R n × p (Absil, Mahony andSepulchre, 2009; Edelman, Arias and Smith, 1998).The diﬀerential form ( H T dH ) = (cid:86) pi =1 (cid:86) nj = i +1 h Tj dh i where H ∈ V n,p , is in-variant under the transforms H → QH and H → H P where Q ∈ V n,n and P ∈ V p,p , respectively. This deﬁnes an invariant measure on V n,p . The sur-face area or volume of V n,p is V ol ( V n,p ) := (cid:82) V n,p ( H T dH ) = 2 p ( √ π ) np / Γ p ( n/ p ( · ) is the multivariate Gamma function (page 70 in Muirhead (2009)).The measure deﬁned in this manner is called the invariant unnormalized or theHaar measure. This measure can be normalized to a probability measure bysetting (cid:82) V n,p [ dH ] = 1 where [ dH ] = ( H T dH ) /V ol ( V n,p ). Uniform distributionon V n,p is denoted by [ dH ] and is the unique probability measure which is in-variant under rotations and reﬂections. For detail description of construction ofthe Haar measure on V n,p and its properties please refer to Muirhead (2009). ML distribution Mardia and Jupp (2009) is a widely used non-uniform dis-tribution on V n,p (Khatri and Mardia, 1977; Mardia and Jupp, 2009; Chikuse,2012; Lin, Rao and Dunson, 2017). This distribution is also known as Von Mises-Fisher Matrix Distribution (Khatri and Mardia, 1977). The density function of engupta et al./BMMCSM the ML distribution with respect to the normalized Haar measure [ dX ] andparametrized by F ∈ R n × p , deﬁned in Chikuse (2012), is given by f ML ( X ; F ) = etr ( F T X ) F (cid:16) n , F T F (cid:17) , (1)where etr ( Z ) = exp( trace ( Z ) for any square matrix Z and the normalizingconstant, F ( n/ , F T F / n × p parameter matrix F = M DV T where M ∈ (cid:101) V n,p , V ∈ V p,p and the diag-onal entries of D , d = ( d , d , · · · , d p ) ∈ S p where 0 < d p < · · · < d < d < ∞ (Chikuse, 2012). See Notation for deﬁnitions of (cid:101) V n,p , V p,p and S p . Here, (cid:101) V n,p denotes the a subspace of V n,p consisting of matrices in V n,p whose elements ofthe ﬁrst row of are positive. Note that, being a closed subspace of V n,p , (cid:101) V n,p isalso a compact space.Plugging in the SVD form of F , we rewrite the ML density function as f ML ( X ; ( M, d , V )) = etr ( V DM T X )) F ( n , D ) I ( M ∈ (cid:101) V n,p , d ∈ S p , V ∈ V p,p ) . This parametrization ensures identiﬁability of all the parameters ( M , d and V ).For notational convenience we omit the indicator function part and use thefollowing form of the ML density for rest of the article f ML ( X ; ( M, d , V )) = etr ( V DM T X )) F ( n , D ) , (2)with respect to the normalized Haar measure [ dX ] (Muirhead, 2009). From Kha-tri and Mardia (1977) (page 96) note that the normalizing constant can besimpliﬁed as follows – F (cid:18) n , F T F (cid:19) = F (cid:18) n , D (cid:19) . Thus F ( · ) only depends on the eigenvalues of the matrix F T F , which are thediagonal elements of the matrix D . The parametrization with M, D and V en-ables us to represent the intractable hypergeometric function of matrix argumentas a function of vector d , diagonal entries of D , paving a path for an eﬃcientposterior inference. This makes posterior inference computationally tractable.Note that an alternative parametrization through polar decomposition with M and K (Mardia and Jupp, 2009) may pose computational challenges since theelliptical part K lies on a positive semi-deﬁnite cone and inference on positivesemi-deﬁnite cone is not that straightforward (Hill and Waters, 1987; Bhatia, engupta et al./BMMCSM M, D and V parameters basedrepresentation for ML distribution for most part of our theory.In the following subsection we study a few important properties of the hyper-geometric function of matrix argument F (cid:0) n/ , D / (cid:1) , which are required forsubsequent sections. F (cid:16) n , D (cid:17) Lemma 1.

For any p × p diagonal matrix D with positive elements, F (cid:16) n , D (cid:17) ≤ etr ( D )) when n ≥ p . Proof of Lemma 1.

From Equation 2, we have (cid:90) V n,p f ML ( X ; ( M, d , V )) [ dX ] = 1= ⇒ (cid:90) V n,p etr ( V DM T X )) F ( n , D ) [ dX ] = 1= ⇒ F (cid:18) n , D (cid:19) = (cid:90) V n,p etr ( V DM T X )) [ dX ] . (3)We know that f ML ( X ; ( M, d , V )) has the unique modal orientation M V T (page32 in Chikuse (2012)). Hence it follows from Equation 3 that F (cid:18) n , D (cid:19) ≤ (cid:90) V n,p etr ( V DM T M V T )) [ dX ]= etr ( D )) (cid:90) V n,p [ dX ] = etr ( D )) , (4)where [ dX ] is the normalized Haar measure on V n,p . (cid:3) Lemma 2.

Let A be a n × p real matrix with n ≥ p . If (cid:107) A (cid:107) ≤ δ ( < δ ) for some δ > then | A j,j | ≤ δ ( < δ ) for j = 1 , .., p . Here A j,j denotes the ( j, j ) -th entryof the matrix A and (cid:107) A (cid:107) is the spectral norm of the matrix A . Proof of Lemma 2.

From the assumptions of the Lemma 2 along with the deﬁnition of the spectralnorm, it follows that l T A T A l ≤ δ ( < δ ) for all l ∈ R p with l T l = 1. Inparticular, e Tj A T A e j ≤ δ ( < δ ) where e j ∈ R p such that its j -th entry equals1 while rest of its entries are 0. Hence we have that (cid:80) nk =1 A k,j ≤ δ ( < δ )implying the fact that | A j,j | ≤ δ ( < δ ) . (cid:3) engupta et al./BMMCSM Lemma 3.

Let D be a p × p diagonal matrix with positive diagonal elements d = { d , d , · · · , d p } . Then for any δ > and n ≥ p , there exists a positiveconstant, K n,p,δ , depending on n, p and δ , such that F (cid:18) n , D (cid:19) > K n,p,δ etr ((1 − δ ) D ) . Proof of Lemma 3.

Note that, D is a p × p diagonal matrix with positive diagonal elements d , .., d p .For the case n ≥ p , deﬁne (cid:102) M = (cid:20) I p n − p,p (cid:21) , (cid:101) V = I p and I (cid:63) := (cid:20) I p n − p,p (cid:21) , (5)where I p denotes the p × p identity matrix and n − p,p represents the zero matrixof dimension ( n − p ) × p . For arbitrary given positive constant δ >

0, consider B δ := { X ∈ V n,p , such that (cid:107) X − I (cid:63) (cid:107) < δ } , where (cid:107)·(cid:107) denotes the spectral norm of a matrix. Let µ denotes the normalizedHaar measure on the V n,p . Clearly, 0 < µ ( B δ ) < ∞ , as B δ is a non-empty opensubset of V n,p . Now from Equation 2 we have, F (cid:18) n , D (cid:19) = (cid:90) V n,p etr (cid:16) (cid:101) V D (cid:102) M T X (cid:17) dµ ( X ) . ≥ (cid:90) B δ etr (cid:16) (cid:101) V D (cid:102) M T X (cid:17) dµ ( X ) . (6)Using Lemma 2 we know that X j,j > (1 − δ ) for j = 1 , , . . . , p where X ∈ B δ . Note that, X j,j denotes the ( j, j )-th entry of the matrix X . Hence fromEquation 5 and 6 it follows that, F (cid:18) n , D (cid:19) ≥ (cid:90) B δ exp  p (cid:88) j =1 X j,j d j  dµ ( X ) ,> µ ( B δ ) etr ((1 − δ ) D ) , (7)where the last inequality uses the fact that d j > j = 1 , . . . p. Finally wedenote K n,p,δ := µ ( B δ ) > n, p along with δ , to conclude that F (cid:18) n , D (cid:19) > K n,p,δ etr ((1 − δ ) D ) . (cid:3) Lemma 4.

For any p × p diagonal matrix D with positive elements d ∈ S p ,the hypergeometric function of matrix argument denoted by F (cid:16) n , D (cid:17) is log-convex with respect to d where n ≥ p . engupta et al./BMMCSM Proof of Lemma 4.

From Equation 2, we have F (cid:18) n , D (cid:19) = (cid:90) V n,p etr ( V DM T X ) [ dX ] , (8)for arbitrary M ∈ (cid:101) V n,p and V ∈ V n,p where n ≥ p . Without loss of generality,we can take M = (cid:102) M = (cid:20) I p ( n − p ) ,p (cid:21) and V = I p .Let D and D be two p × p diagonal matrix with positive diagonal entries d and d , respectively and d (cid:54) = d . From Equation 8, we have F (cid:18) n , D (cid:19) = (cid:90) V n,p etr ( D (cid:102) M T X ) [ dX ] F (cid:18) n , D (cid:19) = (cid:90) V n,p etr ( D (cid:102) M T X ) [ dX ] . (9)Let λ ∈ [0 ,

1] be any real number. We have F (cid:32) n , ( λ D + (1 − λ ) D ) (cid:33) = (cid:90) V n,p etr (( λ D + (1 − λ ) D ) ˜ M T X ) [ dX ]= (cid:90) V n,p (cid:16) etr ( D ˜ M T X ) (cid:17) λ (cid:16) etr ( D ˜ M T X ) (cid:17) − λ [ dX ] < (cid:32)(cid:90) V n,p etr ( D ˜ M T X ) [ dX ] (cid:33) λ (cid:32)(cid:90) V n,p etr ( D ˜ M T X ) [ dX ] (cid:33) − λ = (cid:18) F (cid:18) n , D (cid:19)(cid:19) λ (cid:18) F (cid:18) n , D (cid:19)(cid:19) − λ . (10)Note that the inequality is due to H¨older (Hardy, Littlewood and P´olya, 1952)and note that in this case d (cid:54) = d . Therefore from Equation 10 we have,log F (cid:18) n , (cid:18) λ D − λ ) D (cid:19)(cid:19) < λ log F (cid:18) n , D (cid:19) +(1 − λ ) log F (cid:18) n , D (cid:19) . (11)Hence log F (cid:16) n , D (cid:17) is a convex function or equivalently F (cid:16) n , D (cid:17) is alog-convex function of the diagonal entries d of matrix D . (cid:3) engupta et al./BMMCSM Lemma 5.

For any p × p ( p ≥ ) diagonal matrix D with positive elements d ∈ S p , then for i = 1 , , · · · , p we have < ∂∂ d i (cid:20) F (cid:18) n , D (cid:19)(cid:21) < F (cid:18) n , D (cid:19) where n ≥ p . Proof of Lemma 5.Right hand side inequality:

Proceeding similar way as Lemma 4 we have F (cid:18) n , D (cid:19) = (cid:90) V n,p etr ( D (cid:102) M T X ) [ dX ] , where (cid:102) M = (cid:20) I p ( n − p ) ,p (cid:21) . (12)From Equation 12, we have F (cid:18) n , D (cid:19) = (cid:90) V n,p exp  p (cid:88) j =1 d j X j,j  [ dX ] (13)Consider the set V := { X ∈ V n,p : X i,i = 1 } . Note that V is isomorphic tothe lower dimensional Stiefel manifold, V n,p − . V , being a lower dimensionalsubspace of V n,p , has measure zero i.e. (cid:82) V n,p I ( X ∈ V )[ dX ] = 0, where I ( X ∈ V )is the indicator function for X to be in the set V . From Equation 13, we have F (cid:18) n , D (cid:19) = (cid:90) V n,p exp  p (cid:88) j =1 d j X j,j  I ( X ∈ V c ) [ dX ] , (14)where V c is the complement of V . Hence, ∂∂ d i (cid:20) F (cid:18) n , D (cid:19)(cid:21) = (cid:90) V n,p X i,i I ( X ∈ V c ) exp  p (cid:88) j =1 d j X j,j  [ dX ] . (15)Observe that, (cid:107) X (cid:107) = 1 on V n,p . Hence from Lemma 2 we have | X i,i | ≤

1. Also, X i,i (cid:54) = 1 when X ∈ V c . As a result, we conclude that X i,i < V n,p ∩ V c .Subsequently, it follows from Equations 14 and 15 that, ∂∂ d i (cid:20) F (cid:18) n , D (cid:19)(cid:21) < (cid:90) V n,p exp  p (cid:88) j =1 d j X j,j  I ( X ∈ V c ) [ dX ]= F (cid:18) n , D (cid:19) . (16) engupta et al./BMMCSM Left hand side inequality:

Consider V i, + n,p := { X ∈ V n,p : X i,i > } , V i, − n,p := { X ∈ V n,p : X i,i < } and V i, n,p := { X ∈ V n,p : X i,i = 0 } . Clearly, V i, + n,p , V i, n,p and V i, − n,p forms a partition of V n,p . Hence from equation 13 we have, ∂∂ d i (cid:20) F (cid:18) n , D (cid:19)(cid:21) = (cid:90) V i, + n,p X i,i exp  p (cid:88) j =1 d j X j,j  [ dX ] + (cid:90) V i, n,p X i,i exp  p (cid:88) j =1 d j X j,j  [ dX ]+ (cid:90) V i, + n,p X i,i exp  p (cid:88) j =1 d j X j,j  [ dX ]= (cid:90) V i, + n,p X i,i exp  p (cid:88) j =1 d j X j,j  [ dX ] + (cid:90) V i, − n,p X i,i exp  p (cid:88) j =1 d j X j,j  [ dX ] . (17)Let Γ be the n × n diagonal matrix such that Γ j,j = 1 for j = 1 , . . . , n, j (cid:54) = i and Γ i,i = −

1. Γ is an orthogonal matrix as Γ T Γ = I n . It is easy to show that V i, + n,p = (cid:8) Γ X : X ∈ V i, − n,p (cid:9) .Consider the change of variable Y := Γ X . Using standard algebra we can showthat X i,i = − Y i,i and X j,j = Y j,j for j = 1 , . . . p, j (cid:54) = i . As the normalizedHaar measure on V n,p is invariant under orthogonal transformation from Lefti.e. [ dX ] = [ dY ] Chikuse (2012), we get that (cid:90) V i, − n,p X i,i exp  p (cid:88) j =1 d j X j,j  [ dX ] = − (cid:90) V i, + n,p Y i,i exp  − d i Y i,i + p (cid:88) j =1 ,j (cid:54) = i d j Y j,j  [ dY ]= − (cid:90) V i, + n,p X i,i exp  − d i X i,i + p (cid:88) j =1 ,j (cid:54) = i d j X j,j  [ dX ] . (18)From Equations 17 and 18 we have, ∂∂ d i (cid:20) F (cid:18) n , D (cid:19)(cid:21) = (cid:90) V i, + n,p X i,i exp  p (cid:88) j =1 ,j (cid:54) = i d j X j,j  (cid:32) exp ( d i X i,i ) − exp ( − d i X i,i ) (cid:33) [ dX ]= (cid:90) V i, + n,p X i,i exp  p (cid:88) j =1 ,j (cid:54) = i d j X j,j  d i X i,i ) [ dX ] (19) engupta et al./BMMCSM where sinh is the hyperbolic sin function. Note that sinh ( d i X i,i ) > d i > X i,i > V i, + n,p . Hence from Equation 19 it follows that, ∂∂ d i (cid:20) F (cid:18) n , D (cid:19)(cid:21) > . (20)From Equations 16 and 20, we have the result. (cid:3) All ﬁve lemmas will be used for a theoretical development of a conjugate priorfamily for ML distributions, which we discuss next.

3. Bayesian framework for ML distribution In this section we develop a comprehensive Bayesian framework related to ML distribution. We construct a novel class of conjugate priors and study theirproperties. We also derive the posterior form and comment on hyperparametersettings. In the context of the exponential family of distributions, Diaconis and Ylvisaker(1979) (DY) provides a standard procedure to obtain a class of conjugate priorswhen the distribution is represented through natural parametrization Casellaand Berger (2002). But we realize that for the ML distribution DY theoremcould not be applied directly. We postpone the discussion on the DY theory laterin Section 3.4 since a direct application of their construction is not possible.Instead, we propose two diﬀerent conjugate priors next aiming for scalable andﬂexible posterior inference.In this context, we would also like to mention that the construction of the classof priors in Hornik and Gr¨un (2013) is based on the direct application of DY,which is also not quite appropriate for ML distribution. The idea of construct-ing a conjugate prior on the natural parameter F and using a transformationafterwards involves calculation of complicated Jacobean term Hornik and Gr¨un(2013). Hence the corresponding class of prior obtained by this transformationwould lack the interpretation of the corresponding hyperparameters. As the DYtheorem is not directly applicable, an appropriate modiﬁcation is required inorder to use with ML distribution (see details in Section 3.4). In this sectionwe construct a new class of conjugate prior for ML density. We then show thatthe hyperparameters of the constructed class of priors are easily interpretablefrom practitioners point of view. We further extend our investigation to studyproperties that are essential for the hyperparameter selection and posterior in-ference. In the following paragraphs we design both joint and independent priorstructures for the parameters of the ML distribution. engupta et al./BMMCSM Deﬁnition 1.

The probability density function of the joint conjugate prior withrespect to the appropriate product measure Υ on V n,p × R p + × V p,p on the param-eters M, D and V for ML distribution is proportional to g ( M, d , V ; ν, Ψ) = etr (cid:0) ν V DM T Ψ (cid:1)(cid:2) F ( n , D ) (cid:3) ν , (21) as long as g ( M, d , V ; ν, Ψ) can be integrable. Here ν > and Ψ ∈ R n × p . Although joint prior structure has some desirable properties (see Theorem 4and Section 3.3), it sometimes diﬃcult to incorporate strength of prior beliefwhich could diﬀer for diﬀerent parameters. For example, if a practitioner hasstrong prior belief on M but has very less knowledge about parameters D and V , then JMDY may not be the optimal choice for prior structure. We design aclass of conditional conjugate prior which would be better suited for this type ofsituation due to ﬂexibility. Also, it is customary to come up with independentprior structure (Gelman et al., 2014; Khare, Pal and Su, 2017) for parameters ofcurved exponential family (Casella and Berger, 2002), where the parametriza-tion diﬀers from the natural parametrization. In order to develop conditionalconjugate prior structure we assume independent priors on M , d and V . It iseasy to see that conditional conjugate priors for both M and V are ML dis-tribution whereas the following deﬁnition is used to construct the conditionalconjugate prior for D . Deﬁnition 2.

The probability density function of the conditional conjugateprior for D with respect to the Lebesgue measure on R p + is proportional to g ( d ; ν, η ) = exp( ν η T d ) (cid:2) F (cid:0) n , D (cid:1)(cid:3) ν , (22) as long as g ( d ; ν, η ) can be integrable. Here ν > , η ∈ R p and n ≥ p . Note that, g ( d ; ν, η ) is a function of n as well, however we do not vary n anywhere in our construction and thus we omit the symbol n from the notationof g ( d ; ν, η ).We refer this particular class of distributions deﬁned in Deﬁnition 1 and Deﬁni-tion 2 as joint modiﬁed Diaconis-Ylvisaker ( JMDY ) and independent modiﬁedDiaconis-Ylvisaker (

IMDY ) class, respectively for subsequent discussions.Theorem 1 and Theorem 2 provides conditions on ν, Ψ and η so that g ( M, d , V ; ν, Ψ)and g ( M, d , V ; ν, η ) are integrable, respectively. We state and prove the follow-ing lemma which is necessary to prove these theorems. Lemma 6.

Let Ψ ∈ R n × p and D be a diagonal matrix with positive diagonalentries. If (cid:107) Ψ (cid:107) < , then for arbitrary M ∈ V n,p , V ∈ V p,p , etr (cid:0) V DM T Ψ (cid:1) F ( n , D ) < etr ( − (cid:15) D ) K n,p,(cid:15) , (23) where (cid:15) = (1 − (cid:107) Ψ (cid:107) ) and K n,p,(cid:15) > is a constant depending on n, p and (cid:15) . engupta et al./BMMCSM Proof of Lemma 6.

Note that, 0 < (cid:15) < as (cid:107) Ψ (cid:107) <

1. Assume Y = M T Ψ V ∈ R p × p . For arbitrary l ∈ R p with (cid:107) l (cid:107) = 1, we have l T Y T Y l = ( V l ) T Ψ T Ψ( V l ) − l T V T Ψ T ( I n − M M T )Ψ V l ≤ (1 − (cid:15) ) . (24)The last inequality follows as (cid:107) Ψ (cid:107) = 1 − (cid:15) and ( I n − M M T ) is a non-negativedeﬁnite matrix. From Equation 24 it follows that (cid:107) Y (cid:107) ≤ − (cid:15) . Hence, wecan apply Lemma 2 we obtain that | Y j,j | < − (cid:15) for j = 1 , · · · , p , where Y ,j is the j -th diagonal element of the matrix Y . Now applying Lemma 3 we have, etr (cid:0) V DM T Ψ (cid:1) F ( n , D ) < etr ( DY − (1 − (cid:15) ) D ) K n,p,(cid:15) < etr ( − (cid:15) D ) K n,p,(cid:15) . (cid:3) Theorem 1.

Let M ∈ V n,p and V ∈ V p,p and D be a diagonal matrix withpositive diagonal elements d ∈ R p + . Let Ψ ∈ R n × p with n ≥ p , then for any ν > ,(a) if (cid:107) Ψ (cid:107) < , we have (cid:90) V n,p (cid:90) V p,p (cid:90) R p + g ( M, d , V ; ν, Ψ) d d dµ ( V ) dµ ( M ) < ∞ , (b) if (cid:107) Ψ (cid:107) > , we have (cid:90) V n,p (cid:90) V p,p (cid:90) R p + g ( M, d , V ; ν, Ψ) d d dµ ( V ) dµ ( M ) = ∞ , where g ( M, d , V ; ν, Ψ) is deﬁned in Deﬁnition 1. Proof of Theorem 1. ( a ) When (cid:107) Ψ (cid:107) < g ( M, d , V ; ν, Ψ) can be normalized to construct a probability den- engupta et al./BMMCSM sity function with respect to the product measure Υ. Consider that (cid:90) V n,p (cid:90) V p,p (cid:90) R p + g ( M, d , V ; ν, Ψ) d d dµ ( V ) dµ ( M )= (cid:90) V n,p (cid:90) V p,p (cid:90) R p + etr (cid:0) νV DM T Ψ (cid:1)(cid:2) F ( n , D ) (cid:3) ν d d dµ ( V ) dµ ( M ) ( i ) < (cid:90) V n,p (cid:90) V p,p (cid:90) R p + etr ( − ν(cid:15) D )( K n,p,(cid:15) ) ν d d dµ ( V ) dµ ( M )= (cid:90) V n,p dµ ( M ) (cid:90) V p,p dµ ( V ) (cid:90) R p + etr ( − ν(cid:15) D )( K n,p,(cid:15) ) ν d d ( ii ) = 1 K νn,p,(cid:15) p (cid:89) j =1 (cid:90) R + exp( − ν(cid:15) d j ) dd j < ∞ , where the inequality ( i ) is due to Lemma 6 while ( ii ) follows as µ is the normal-ized Haar measure. Note that, here we write [ dV ] = dµ ( V ) and [ dM ] = dµ ( M ).( b ) When (cid:107) Ψ (cid:107) > M Ψ D Ψ V T Ψ be the the unique SVD (Chikuse, 2012) decomposition forthe matrix Ψ. Note that, using sub-multiplicativity (cid:107) Ψ (cid:107) ≤ (cid:107) M Ψ (cid:107) (cid:107) D Ψ (cid:107) (cid:13)(cid:13) V T Ψ (cid:13)(cid:13) = (cid:107) D Ψ (cid:107) = D Ψ , . Hence there exists an (cid:15) > D Ψ , > (1 + (cid:15) ) where D Ψ , denotes theﬁrst diagonal element of the diagonal matrix D Ψ . Now consider the fact that (cid:90) V n,p (cid:90) V p,p (cid:90) R p + g ( M, d , V ; ν, Ψ) d d dµ ( V ) dµ ( M ) ≥ (cid:90) V n,p (cid:90) V p,p (cid:90) S p g ( M, d , V ; ν, Ψ) d d dµ ( V ) dµ ( M )= (cid:90) V n,p (cid:90) V p,p (cid:90) S p etr (cid:0) ν V DM T Ψ (cid:1)(cid:2) F ( n , D ) (cid:3) ν d d dµ ( V ) dµ ( M )= (cid:90) V n,p (cid:90) V p,p (cid:90) S p etr (cid:0) ν DM T M Ψ D Ψ V T Ψ V (cid:1)(cid:2) F ( n , D ) (cid:3) ν d d dµ ( V ) dµ ( M ) . (25)Consider the change of variable via the following orthogonal transformations M ∗ = (cid:2) M Ψ , M Ψ (cid:3) M, V ∗ = V T Ψ V, where M Ψ is matrix containing the bases for the orthogonal complement ofthe column space of M Ψ . Note that (cid:2) M Ψ , M Ψ (cid:3) T M Ψ = ( I (cid:63) ) T where I (cid:63) := (cid:2) I p , n − p,p (cid:3) T . As the Haar measure on the Stiefel manifold is invariant engupta et al./BMMCSM under the orthogonal transformations (Chikuse, 2012), from Equation 25 weget that, (cid:90) V n,p (cid:90) V p,p (cid:90) R p + g ( M, d , V ; ν, Ψ) d d dµ ( V ) dµ ( M ) ≥ (cid:90) V n,p (cid:90) V p,p (cid:90) S p etr (cid:16) ν DM ∗ T I (cid:63) D Ψ V ∗ (cid:17)(cid:2) F ( n , D ) (cid:3) ν d d dµ ( V ∗ ) dµ ( M ∗ ) . (26)Consider V † n,p := (cid:26) M ∈ V n,p : (cid:107) I (cid:63) − M (cid:107) < δ (cid:27) ; V † p,p := (cid:26) V ∈ V p,p : (cid:107) I p − V (cid:107) < δ (cid:27) , where δ = (cid:15) / (2 (cid:107) D Ψ (cid:107) ). Note that δ > < (cid:107) D Ψ (cid:107) < ∞ . Clearly V † n,p and V † p,p are open subsets of V n,p and V p,p respectively. Hence, µ ( V † n,p ) > µ ( V † p,p ) > M ∈ V † n,p and V ∈ V † p,p then using sub-multiplicativity of (cid:107)·(cid:107) (Conway, 1990)and triangle inequality, we get (cid:13)(cid:13) M T I (cid:63) D Ψ V − D Ψ (cid:13)(cid:13) ≤ (cid:13)(cid:13) M T I (cid:63) D Ψ V − D Ψ V (cid:13)(cid:13) + (cid:107) D Ψ V − D Ψ (cid:107) ≤ (cid:13)(cid:13) M T I (cid:63) − I p (cid:13)(cid:13) (cid:107) D Ψ V (cid:107) + (cid:107) D Ψ (cid:107) (cid:107) V − I p (cid:107) = (cid:13)(cid:13) ( M − I (cid:63) ) T I (cid:63) (cid:13)(cid:13) (cid:107) D Ψ V (cid:107) + (cid:107) D Ψ (cid:107) (cid:107) V − I p (cid:107) ≤ (cid:13)(cid:13) ( M − I (cid:63) ) T (cid:13)(cid:13) (cid:107) I (cid:63) (cid:107) (cid:107) D Ψ (cid:107) (cid:107) V (cid:107) + (cid:107) D Ψ (cid:107) (cid:107) V − I p (cid:107) ≤ (cid:13)(cid:13) ( M − I (cid:63) ) T (cid:13)(cid:13) (cid:107) D Ψ (cid:107) + (cid:107) D Ψ (cid:107) (cid:107) V − I p (cid:107) ≤ δ (cid:107) D Ψ (cid:107) = (cid:15) . (27)Let λ , . . . , λ p be diagonal elements of the matrix M T I (cid:63) D Ψ V . From Lemma2 we get that | λ j − D Ψ ,j | ≤ (cid:15) / j = 1 , . . . , p . Here D Ψ ,j denotes the j -thdiagonal element of the matrix D Ψ . Hence for arbitrary M ∈ V † n,p and V ∈ V † n,p ,we have tr (cid:0) M T I (cid:63) D Ψ V (cid:1) = p (cid:88) j =1 λ j ≥ p (cid:88) j =1 (cid:16) D Ψ ,j − (cid:15) (cid:17) , (28)as λ j ≥ (cid:0) D Ψ ,j − (cid:15) (cid:1) for all j = 1 , , · · · , p . engupta et al./BMMCSM Now from Equation 26, we have (cid:90) V n,p (cid:90) V p,p (cid:90) R p + g ( M, d , V ; ν, Ψ) d d dµ ( V ) dµ ( M ) ≥ (cid:90) V † n,p (cid:90) V † p,p (cid:90) S p etr (cid:16) ν DM ∗ T I (cid:63) D Ψ V ∗ (cid:17)(cid:2) F ( n , D ) (cid:3) ν d d dµ ( V ∗ ) dµ ( M ∗ ) ( iii ) ≥ (cid:90) V † n,p (cid:90) V † p,p (cid:90) S p exp (cid:16) ν (cid:80) pj =1 d j (cid:0) D Ψ ,j − (cid:15) (cid:1)(cid:17)(cid:2) F ( n , D ) (cid:3) ν d d dµ ( V ∗ ) dµ ( M ∗ ) , ( iv ) ≥ (cid:90) V † n,p (cid:90) V † p,p (cid:90) S p exp (cid:16) ν (cid:80) pj =1 d j (cid:0) D Ψ ,j − (cid:15) (cid:1)(cid:17) [ etr ( D )] ν d d dµ ( V ∗ ) dµ ( M ∗ ) , ≥ µ ( V † n,p ) µ ( V † p,p ) (cid:90) S p exp  ν p (cid:88) j =1 d j (cid:16) D Ψ ,j − − (cid:15) (cid:17) d d , ( v ) ≥ µ ( V † n,p ) µ ( V † p,p ) (cid:90) S p exp (cid:16) ν (cid:15) d (cid:17) p (cid:89) j =2 exp (cid:16) ν d j (cid:16) D Ψ ,j − − (cid:15) (cid:17)(cid:17) d d , = ∞ , (29)where ( iii ) and ( iv ) follow from Equation 28 and Lemma 1, respectively. Finally,( v ) follows as D Ψ , > (1 + (cid:15) ). (cid:3) Remark for Theorem 1.

One could notice that the conditions mentionedin this theorem is not entirely necessary and suﬃcient conditions. We have notaddressed the case where (cid:107) Ψ (cid:107) = 1. This scenarios could be broken into twocases (a) all the eigenvalues of Ψ are equal to 1 and (b) only a few eigenvaluesare equal to 1 and rest are strictly less than 1. In both the cases, it seems thatthe problem is more involved than the current one and we have not investigatedthe ﬁniteness of the corresponding integral in detail for those cases. For now,we leave those for future work. Theorem 2.

Let D be diagonal matrix with diagonal elements d ∈ R p + . Let η = ( η , . . . , η p ) ∈ R p and n be any integer with n ≥ p . Then for any ν > , (cid:90) R p + g ( d ; ν, η , n ) d d < ∞ , if and only if max ≤ j ≤ p η j < , where g ( d ; ν, η , n ) is deﬁned in Deﬁnition 2. Proof of Theorem 2. engupta et al./BMMCSM Suﬃcient condition:

For any η := ( η , . . . , η p ) ∈ R p , deﬁne η + := (cid:0) η +1 , . . . , η + p (cid:1) where η + j equals η j when η j > D η to be the diag-onal matrix with diagonal elements η + . Let us consider the following matricesΨ = (cid:20) D η n − p,p (cid:21) , M (cid:63) = (cid:20) I p,p n − p,p (cid:21) and V (cid:63) = I p . Note that (cid:102) M ∈ (cid:101) V n,p , (cid:101) V ∈ V p,p and D η = (cid:102) M T Ψ (cid:101) V . Now from Deﬁnition 2 we getthat (cid:90) R p + g ( d ; ν, η , n ) d d = (cid:90) R p + exp( ν (cid:80) pj =1 η j d j ) (cid:2) F ( n , D ) (cid:3) ν d d ≤ (cid:90) R p + exp( ν (cid:80) pj =1 η + j d j ) (cid:2) F ( n , D ) (cid:3) ν d d = (cid:90) R p + etr ( ν DD η ) (cid:2) F ( n , D ) (cid:3) ν d d = (cid:90) R p + etr (cid:16) ν (cid:101) V D (cid:102) M T Ψ (cid:17)(cid:2) F ( n , D ) (cid:3) ν d d ( vi ) < (cid:90) R p + etr ( − ν(cid:15) D )( K n,p,(cid:15) ) ν d d = 1( K n,p,(cid:15) ) ν p (cid:89) j =1 (cid:90) R + exp( − ν(cid:15) d j ) dd j < ∞ , (30)where the inequality at step ( vi ) follows from Lemma 6 with appropriate (cid:15) > Necessary condition:

Let η ∈ R p be such that max j =1 ,...p η j ≥

1. There existat least one j ∈ { , . . . p } such that η j ≥

1. Without loss of generality, let usassume that η ≥

1. From Deﬁnition 2, we have engupta et al./BMMCSM (cid:90) R p + g ( d ; ν, η , n ) d d = (cid:90) R p + exp( ν (cid:80) pj =1 η j d j ) (cid:2) F ( n , D ) (cid:3) ν d d ≥ (cid:90) R p + exp( ν (cid:80) pj =1 η j d j ) etr ( νD ) d d = p (cid:89) j =1 (cid:90) R + exp ( ν ( η j − d j ) dd j = (cid:90) R + exp ( ν ( η − d ) dd p (cid:89) j =2 (cid:90) R + exp ( ν ( η j − d j ) dd j = ∞ , where the inequality is due to Lemma 1. (cid:3) Remark for Theorem 2.

We could alternatively parametrize

IMDY in thefollowing way g ( d ; ν, η ) ∝ exp (cid:16)(cid:80) pj =1 η j d j (cid:17) / (cid:104) F ( n , D ) (cid:105) ν when max ≤ j ≤ p η j < ν .In this parametrization if we set ν = 0 and β := − η then g ( d ; ν, η ) refers tothe Exponential distribution with parameter β . The following lemmas are essential to study theoretical properties of the conju-gate prior mentioned in Section 3.1.

Lemma 7.

The probability density function for the prior distribution of d ∼ IMDY ( d ; ν, η ) denoted by g ( d ; ν, η ) := exp( ν η T d ) / (cid:104) F (cid:16) n , D (cid:17)(cid:105) ν , is log-concaveas a function of d where D is the diagonal matrix with diagonal elements d , max ≤ j ≤ p η j < , ν > and n ≥ p . Proof of Lemma 7.

From Deﬁnition 2 we have, g ( d ; ν, η ) := exp( ν η T d ) (cid:2) F (cid:0) n , D (cid:1)(cid:3) ν , = ⇒ log g ( d ; ν, η ) := ν η T d − ν log (cid:18) F (cid:18) n , D (cid:19)(cid:19) (31)From Lemma 4, it follows that − ν log (cid:16) F (cid:16) n , D (cid:17)(cid:17) is concave function of d .Also, ν η T d is a linear function of d . Therefore from Equation 31 it is clear thatlog g ( d ; ν, η ) is a concave function of d . engupta et al./BMMCSM (cid:3) Lemma 8.

The distribution of d is unimodal if < η j < for all j =1 , , · · · , p . The mode of the distribution is characterized by the parameter η and it does not dependent on the parameter ν . Proof of Lemma 8.

Let l ( d , ν, η ) = log( g ( d ; ν, η )). If (cid:98) d is the mode of the distribution then ∂∂ d l ( d , ν, η ) (cid:12)(cid:12)(cid:12)(cid:12) d = (cid:98) d = 0 , = ⇒ ν η − ν ∂∂ d log (cid:18) F (cid:18) n , D (cid:19)(cid:19) (cid:12)(cid:12)(cid:12)(cid:12) d = (cid:98) d = 0 , = ⇒ ∂∂ d log (cid:18) F (cid:18) n , D (cid:19)(cid:19) (cid:12)(cid:12)(cid:12)(cid:12) d = (cid:98) d = η , = ⇒ h ( (cid:98) d ) = η , (32)where h ( d ) := ( h ( d ) , h ( d ) , · · · , h p ( d )) with h j ( d ) := (cid:16) ∂∂d j F (cid:16) n , D (cid:17)(cid:17) / F (cid:16) n , D (cid:17) for j = 1 , , · · · , p . The function h j ( d ) is strictly increasing as the function F (cid:16) n , D (cid:17) is log-convex (see Lemma 4). Also, it follows from Lemma 5 that0 < h j ( d ) < d ∈ S p . Hence the Equation 32 has a unique solutionwhen 0 < η j < j = 1 , , · · · , p . Also it is clear that the solution doesnot depend on ν . On the other hand, given any (cid:98) d ∈ S p we can always ﬁnd a η satisfying Equation 32 such that 0 < max ≤ j ≤ p η j < (cid:3) Remark:

In the case of η j ≤

0, the density deﬁned in 2 is decreasing as afunction of d j on the set R + . Therefore, mode does not exist.In order to introduce the notion of “concentration” for IMDY class of distribu-tions we require the concept of level set. Let unnormalized probability densityfunction for

IMDY class of distributions, g ( x ; ν, η ), achieves the maximum valueat m η and let S l = (cid:8) x ∈ R p + : g ( x ; 1 , η ) /g ( m η ; 1 , η ) > l (cid:9) be the level set of order l containing the mode m η where 0 ≤ l <

1. Note that,to deﬁne the level set we could have used any ﬁxed value of ν > g ( x ; ν , η )instead of g ( x ; 1 , η ), however without loss of generality we choose ν = 1. Lemma 9.

Let η ∈ R p be a ﬁxed vector such that < max ≤ j ≤ p η j < . Whenever d ∼ IMDY ( d ; ν, η ) , we have(a) P ν ( S l ) is an increasing function of ν .(b) For any open set S ⊂ R p + containing m η , P ν ( d ∈ S ) goes to as ν → ∞ , engupta et al./BMMCSM where P ν ( · ) denotes the probability distribution corresponding to d ∼ IMDY ( d ; ν, η ) . Proof of Lemma 9. ( a ) Note that, from deﬁnitions of unimodality and level set we have (cid:20) g ( y ; ν, η ) g ( x ; ν, η ) (cid:21) > y ∈ S and for all x ∈ S c . (33)Consider the function r ( ν, x ) := (cid:90) S g ( y ; ν, η ) g ( x ; ν, η ) d y = (cid:90) S (cid:20) g ( y ; 1 , η ) g ( x ; 1 , η ) (cid:21) ν d y , (34)where x ∈ S c . Using equation 33 it is easy to see that (cid:104) g ( y ;1 , η ) g ( x ;1 , η ) (cid:105) ν is monotoni-cally increasing in ν for all y ∈ S . Hence r ( ν, x ) is increasing function in ν forany x ∈ S c .Note that, P ν ( d ∈ S c ) P ν ( d ∈ S ) = (cid:82) S c g ( x ; ν, η ) d x (cid:82) S g ( y ; ν, η ) d y = (cid:90) S c (cid:82) S g ( y ; ν, η ) g ( x ; ν, η ) d y d x = (cid:90) S c r ( ν, x ) d x . (35)Hence P ν ( d ∈ S c ) /P ν ( d ∈ S ) is a decreasing function of ν as r ( ν, x ) is a decreas-ing function in ν for every x ∈ S c or equivalently P ν ( d ∈ S ) increasing functionin ν . (cid:3) ( b ) Let d ∼ IMDY ( · ; ν, η ) with 0 < η j < j = 1 , . . . p . Let m η be themode the distribution. Note that the value of m η only depends on the parameter η and does not depend on the parameter ν . Let f ( d ; ν, η ) be the correspondingprobability density function. Hence for the class of distribution function deﬁnedin Deﬁnition 2, it follows that, f ( d ; ν, η ) = 1 K ν, η exp( ν η T d ) (cid:2) F (cid:0) n , D (cid:1)(cid:3) ν , (36)where K ν, η is the appropriate normalizing constant.Let us deﬁne the function g ( d ; η ) = exp( η T d ) / F (cid:16) n , D (cid:17) . Let S be any openset containing m η , the mode of the density function f ( d ; ν, η ). Consider, the set S (cid:63) := { d : g ( d ; η ) ≤ ζ } , where ζ = sup d ∈ S c g ( d ; η ). It is easy to show that S c ⊆ S (cid:63) . engupta et al./BMMCSM Consider the fact that , g ( d ; η ) g ( λ m η + (1 − λ ) d ; η ) where λ ∈ (0 , η T d ) F (cid:0) n , D (cid:1) F (cid:16) n , [ λD m +(1 − λ ) D ] (cid:17) exp( η T ( λ m η + (1 − λ ) d )) ( vii ) ≤ exp( η T d ) F (cid:0) n , D (cid:1) (cid:104) F (cid:16) n , D m (cid:17)(cid:105) λ (cid:104) F (cid:16) n , D (cid:17)(cid:105) − λ exp( η T ( λ m η + (1 − λ ) d )) ≤ (cid:34) exp( η T d ) F (cid:0) n , D (cid:1) (cid:35) λ  F (cid:16) n , D m (cid:17) exp( η T m η )  λ = (cid:20) g ( d ; η ) g ( m η ; η ) (cid:21) λ , (37)where D m is the diagonal matrix with diagonal m η . Note that, inequality ( vii )follows from the fact that F ( · ) is a log-convex function.Hence we have, f ( d ; ν, η ) f ( λ m η + (1 − λ ) d ; ν, η ) = (cid:20) g ( d ; η ) g ( λ m η + (1 − λ ) d ; η ) (cid:21) ν ≤ (cid:20) g ( d ; η ) g ( m η ; η ) (cid:21) ν λ . (38) P ν ( S (cid:63) ) = (cid:90) S (cid:63) f ( d ; ν, η ) d d = (cid:90) S (cid:63) f ( d ; ν, η ) f ( λ m η + (1 − λ ) d ; ν, η ) f ( λ m η + (1 − λ ) d ; ν, η ) d d ≤ (cid:90) S (cid:63) (cid:20) g ( d ; η ) g ( m η ; η ) (cid:21) νλ f ( λ m η + (1 − λ ) d ; ν, η ) d d ≤ (cid:90) S (cid:63) (cid:20) ζg ( m η ; η ) (cid:21) νλ f ( λ m η + (1 − λ ) d ; ν, η ) d d = (cid:20) ζg ( m η ; η ) (cid:21) νλ (cid:90) S (cid:63) f ( λ m η + (1 − λ ) d ; ν, η ) d d ≤ (cid:20) ζg ( m η ; η ) (cid:21) νλ . (39)Hence we have,lim ν →∞ P ν ( S ) ≥ − lim ν →∞ P ν ( S (cid:63) ) ≥ − lim ν →∞ (cid:20) ζg ( m η ; η ) (cid:21) νλ = 1as ζ < g ( m η ; η ). (cid:3) engupta et al./BMMCSM The following two theorems establishes few important properties of

IMDY and

JMDY class of distributions.

Theorem 3.

Let d ∼ IMDY ( · ; ν, η ) for some ν > and max ≤ j ≤ p η j < where η = ( η , . . . , η p ) . Then(a) The distribution of d is log-concave.(b) The distribution of d is unimodal if η j > for all j = 1 , , · · · , p . Themode of the distribution is characterized by the parameter η and it does notdependent on the parameter ν .(c) The parameter ν relates to the concentration of the probability around modeof the distribution. Larger values of ν implies larger concentration of prob-ability near the mode of the distribution. Proof of Theorem 3.

Proof of part ( a ) , ( b ) and ( c ) follow from Lemma 7, 8 and 9, respectively. (cid:3) We call the parameter η as modal parameter and ν as Concentration parameter. Deﬁnition 3.

The parameter η in the distribution that belongs to the class ofdistributions IMDY is deﬁned as “modal parameter”. Deﬁnition 4.

The scalar parameter ν in the distribution that belongs to theclass of distributions IMDY is deﬁned as “concentration parameter”. Theorem 4.

Let ( M, d , V ) ∼ JMDY ( · ; ν, Ψ) for some ν > and (cid:107) Ψ (cid:107) < .Then(a) The distribution has unique mode. The mode is characterized by the param-eter Ψ and it does not dependent on the parameter ν .(b) Conditional distribution of M given ( d , V ) and V given ( M, d ) are ML distributions whereas conditional distribution of d given ( M, V ) is IMDYclass of distribution. Proof of Theorem 4.

The joint density is proportional to g ( M, d , V ; ν, Ψ) = etr ( ν V DM T Ψ) (cid:2) F (cid:0) n , D (cid:1)(cid:3) ν , (40)( a ) Let us write the SVD (Chikuse, 2012) of Ψ = M Ψ D Ψ V T Ψ . We have, etr ( ν V DM T Ψ) = etr ( ν DM T M Ψ D Ψ V T Ψ V )= etr ( ν V T Ψ V D U M D M V TM D Ψ )= etr ( ν V D U M D M V TM D Ψ ) (41) engupta et al./BMMCSM where SVD of is written as M T M Ψ = U M D M V TM and V = V T Ψ V is an orthog-onal matrix. Therefore we have, etr ( ν V DM T Ψ) = etr ( ν V D U M D M V TM D Ψ ) ( viii ) ≤ etr ( ν DD M D Ψ ) , (42)where the inequality ( viii ) follows from Kristof (1969) (see Theorem on page 5)as V , U M and V M are orthogonal matrices while D , D M and D Ψ are diagonalmatrices with nonnegative diagonal entries. Note that, using sub-multiplicativityof the (cid:107)·(cid:107) (Conway, 1990), we have (cid:107) D M (cid:107) = (cid:13)(cid:13) U TM M T M Ψ V M (cid:13)(cid:13) ≤ (cid:13)(cid:13) U TM (cid:13)(cid:13) (cid:13)(cid:13) M T (cid:13)(cid:13) (cid:107) M Ψ (cid:107) (cid:107) V M (cid:107) ≤ . Therefore, using Lemma 2, we infer that all the diagonal entries of D M is lessthan or equal to 1. Hence from Equation 42, we get that etr ( ν V DM T Ψ) ≤ etr ( ν DD Ψ ) . (43)Therefore, it follows from Kristof (1969) that M = M Ψ and V = V Ψ are uniquemaximizers when M Ψ ∈ (cid:101) V n,p and V Ψ ∈ V p,p . Note that, this does not dependon the choice of ν .Now putting back the value of M and V , we write the expression given in theEquation 40 which can now be seen as etr ( ν DD Ψ ) / (cid:104) F (cid:16) n , D (cid:17)(cid:105) ν . Note that,the diagonal elements of D Ψ is between 0 and 1 as (cid:107) Ψ (cid:107) <

1. Hence usingpart (b) of Theorem 3 we know that etr ( ν DD Ψ ) / (cid:104) F (cid:16) n , D (cid:17)(cid:105) ν has a uniquemaximizer which also does not depend on the choice of ν . (cid:3) ( b ) For JMDY prior structure, the conditional distribution of M given ( d , V )is proportional to etr (cid:16) ν (Ψ V D ) T M ) (cid:17) . This distribution is an ML distribution with parameters M M Ψ , D M Ψ , V M Ψ whereSVD decomposition (Chikuse, 2012) of ν (Ψ V D ) = M M Ψ D M Ψ ( V M Ψ ) T .Similarly, the conditional distribution of V given M and d is proportional to etr (cid:16) ν (Ψ T M D ) T V ) (cid:17) . Therefore, it is another ML distribution with parameters M V Ψ , D V Ψ , V V Ψ whereSVD decomposition of ν (Ψ T M D ) = M V Ψ D V Ψ ( V V Ψ ) T .Finally, the conditional distribution of d given ( M, V ) is a distribution thatbelongs to

IMDY class of distributions with parameters ν and η Ψ , where η Ψ = { η Ψ1 , η Ψ2 , · · · , η Ψ p } and η Ψ j is the j -th diagonal element of the matrix M T Ψ V . engupta et al./BMMCSM (cid:3) In next subsection (Section 3.3) we show that the posterior “modal parameter”is a linear combination of the prior “modal parameter” and a function of samplemean.The following lemmas are useful from the practitioner viewpoint. The resultwill help to truncate the right tail of the distribution at an appropriate pointaccording to a criteria involving only the unnormalized density function.

Lemma 10.

Let d ∼ IMDY ( · ; ν, η ) for some ν > and max ≤ j ≤ p η j < where η = ( η , . . . , η p ) . Let m be the mode of the conditional distribution, g ( · ) := g ( · ; ν, η | ( d , . . . , d p )) , of the variable d given ( d , . . . , d p ) . Then Q ( d ) = g ( d + b ) /g ( d ) is strictly decreasing when b > and d > m where m is themode of the density function given in Deﬁnition 2. Proof of Lemma 10.

We have, log( g ( d )) = ν η d − log (cid:18) F (cid:18) n , D (cid:19)(cid:19) = ⇒ ∂ ∂d (log g ( d )) = − ∂ ∂d (cid:18) log (cid:18) F (cid:18) n , D (cid:19)(cid:19)(cid:19) < , (44)as log (cid:16) F (cid:16) n , D (cid:17)(cid:17) is a strictly convex function (from Lemma 4). Therefore ∂∂d (log g ( d )) = g (cid:48) ( d ) /g ( d ) is a strictly decreasing function in d .log( Q ( d )) = log( g ( d + b )) − log( g ( d ))= ⇒ ∂∂d (log Q ( d )) = g (cid:48) ( d + b ) g ( d + b ) − g (cid:48) ( d ) g ( d ) < , as g (cid:48) ( d ) /g ( d ) is a strictly decreasing function. Therefore, Q ( d ) is also astrictly decreasing function in d . (cid:3) Lemma 11.

Let d ∼ IMDY ( · ; ν, η ) for some ν > and max ≤ j ≤ p η j < where η = ( η , . . . , η p ) . Let m be the mode of the conditional distribution, g ( · ) := g ( · ; ν, η | ( d , . . . , d p )) , of the variable d given ( d , . . . , d p ) . Let B > m , besuch that g ( B ) g ( m ) < (cid:15) for some (cid:15) > , then P ( d > B | d , . . . , d p ) < (cid:15) . Proof of Lemma 11.

The unnormalized conditional density of the random variable d is proportionalto g ( d ) = exp ( ν η d ) F (cid:0) n , D (cid:1) ν . engupta et al./BMMCSM Let f ( d ; ν, η | ( d , . . . , d p )) be the density function for the conditional distribu-tion of d given ( d , . . . , d p ). For notational convenience, for rest of this lemmawe use f ( · ) as the conditional probability density function. Hence we have, f ( d ) = 1 K ν, η exp ( ν η d ) F (cid:0) n , D (cid:1) ν , where K ν, η is an appropriate normalizing constant. From Lemma 10, it followsthat f ( B + x ) /f ( m + x ) is a decreasing function of x when B > m . Hence forall x > f ( B + x ) f ( m + x ) = g ( B + x ) g ( m + x ) < g ( B ) g ( m ) ( viii ) < (cid:15), where the inequality at ( viii ) follows due to the assumption of the lemma.Therefore, P ( d > B | ( d , . . . , d p )) = (cid:90) ∞ B f ( y ) dy = (cid:90) ∞ f ( B + x ) f ( m + x ) f ( m + x ) dx< (cid:15) P ( d > m | ( d , . . . , d p )) < (cid:15). (cid:3) Let W i for i = 1 , , · · · , N be the samples drawn from ML distribution withparameters M, d , V . If we consider a Bayesian analysis with the prior class JMDY with parameters ν and Ψ, then the probability density for the jointposterior distribution of M, d and V given { W i } Ni =1 is proportional to g ( M, d , V ; ν, Ψ) × N (cid:89) i =1 etr ( V DM T W i ) F ( n , D )= etr (cid:0) ν V DM T Ψ (cid:1)(cid:2) F ( n , D ) (cid:3) ν × N (cid:89) i =1 etr ( V DM T W i ) F ( n , D )= etr (cid:16) ( ν + N ) V DM T (cid:16) νν + N Ψ + Nν + N W (cid:17)(cid:17)(cid:2) F ( n , D ) (cid:3) ν + N , (45)where W = (cid:80) Ni =1 W i /N and N is the number of data points. Observe that,the posterior distribution is also in JMDY class with concentration parameter( ν + N ) and modal parameter (cid:16) νν + N Ψ + Nν + N W (cid:17) . engupta et al./BMMCSM On the other hand, when we consider a Bayesian analysis with the prior class

IMDY with parameters ν and η , then the conditional probability density forposterior distribution of d given M , V , { W i } Ni =1 is proportional to g ( d ; ν, η ) × N (cid:89) i =1 etr ( V DM T W i ) F ( n , D )= exp( ν η T d ) (cid:2) F (cid:0) n , D (cid:1)(cid:3) ν × N (cid:89) i =1 etr ( V DM T W i ) F ( n , D )= exp( ν η T d ) (cid:2) F (cid:0) n , D (cid:1)(cid:3) ν × etr ( DM T N W V ) (cid:2) F ( n , D ) (cid:3) N where W = (cid:80) Ni =1 W i /N = exp( ν η T d ) (cid:2) F (cid:0) n , D (cid:1)(cid:3) ν + N × exp( N p (cid:88) j =1 d j Y j,j ) where Y = M T W V ∈ R p × p = exp( ν η T d ) exp( N (cid:80) pj =1 d j Y j,j ) (cid:2) F (cid:0) n , D (cid:1)(cid:3) ν + N = exp (cid:18) ( ν + N ) (cid:16) νν + N η + Nν + N η Y (cid:17) T d (cid:19)(cid:2) F (cid:0) n , D (cid:1)(cid:3) ν + N where η Y = ( Y , , · · · , Y p,p )(46)Here the conditional posterior distribution of d is in IMDY class with concen-tration parameter ( ν + N ) and modal parameter (cid:16) νν + N η + Nν + N η Y (cid:17) .Finally, in the following subsection we talk about the several reasons for notbeing able to use DY theorem directly in our case. ML distribution According to the assumption of DY, for a d -dimensional exponential familydistribution, µ be the measure deﬁned on the Borel sets of R d . In the contextof th ML distribution µ is the measure deﬁned on the Stiefel manifold. Thesymbol X is used to denote the interior of the support of the measure µ . Asshowed in Hornik and Gr¨un (2013) X := { X : (cid:107) X (cid:107) < } . According to theassumptions of DY (cid:82) X dP θ ( X ) = 1 (See the paragraph after equation (2.1) onpage 271 in Diaconis and Ylvisaker (1979)). On the contrary for matrix Langevindistribution (cid:90) X dP θ ( X ) = (cid:90) X f ML ( X ) [ dX ] = 0 . During the proof of Theorem 1 in Diaconis and Ylvisaker (1979) Dy constructs engupta et al./BMMCSM a probability measure restricted on set A as follows. µ A ( B ) = µ ( A ∩ B ) µ ( A ) , where µ ( A ) > . Also, x A = (cid:82) Z dµ A ( Z ). In the context of the proof of Theorem 1 in Diaconisand Ylvisaker (1979) uses the crucial fact that x A are dense in supp ( µ ) (See theline after Equation (2.4) on page 272 in Diaconis and Ylvisaker (1979)).In the context of the ML distribution supp ( µ ) is the Stiefel manifold. It canbe shown that similar construction in the case of ML distribution would leadto x A where x A does not belong to the Stiefel manifold i.e. x A (cid:54)∈ supp ( µ ).Hence x A will not be dense supp ( µ ). As a result, Theorem 1 in (Diaconis andYlvisaker, 1979) is not applicable for ML distribution. Note that a modiﬁedDY construction can be formulated that would enable us constructing prior on F . However, our parametrization is diﬀerent than the natural parametrization,therefore we require a new approach to construct the prior distribution on M, d and V . Plots for conditional prior of d given M and V Figure 1 shows plots forprior densities for diﬀerent values of ν and η . Note that, with the same value of η the location of the mode remain the same for diﬀerent values of ν (see eachrow of Figure 1). As ν increases, the probability concentration around the modeof the distribution increases. Finding the modal parameter from the mode

We have given an examplewhen the practitioner wants to set a particular mode denoted by d mode . We solvefor the corresponding η mode from Equation 32. For example, let us denote themode by (5 ,

7) and after solving for η mode , we have η mode = (0 . , . ,

7) for two diﬀerent setting of ν which incorporates the strength of the belief in the value of the mode. Here wetake ν = 10 and ν = 20. For both

JMDY and

IMDY class of distributions, we have uniform prior overrespective parameters whenever the probability density function is proportionalto 1. For

JMDY , it can be achieved by setting ν = 0 in Deﬁnition 1. For IMDY , ν = 0 provides the uniform prior on parameter d . The resulting priorswould be improper as in this case, the integral over the entire space becomesinﬁnite. However, in this case, it is necessary to check the propriety of posteriordistributions.In order to incorporate the prior belief for IMDY class of distributions, one canﬁnd the appropriate value of hyperparameter η from Equation 32 once modeof d (denoted by d mode ) is given. Note that, we get a feasible η for every real d mode ∈ S p . The other parameter ν sets the strength of one’s prior belief. It isimportant to realize that there is a strong relationship between ν and number engupta et al./BMMCSM ν = 10 , η = [0 . , .

89] (b) ν = 20 , η = [0 . , . ν = 10 , η = [0 . , .

50] (d) ν = 20 , η = [0 . , . ν = 10 , η = [0 . , .

94] (f) ν = 20 , η = [0 . , . Fig 1: Prior density plots for diﬀerent values of ν and η engupta et al./BMMCSM ν = 10 , η = [0 . , .

88] (b) ν = 20 , η = [0 . , . Fig 2: Find appropriate η from given mode of the distribution d mode .of data samples. For setting the hyperparameters of the prior distribution for M and V , one can use M mode and V mode , respectively with the appropriateparameters for ML distribution.On the other hand for JMDY class of distribution, we set appropriate value ofhyperparameter η from Equation 32 when mode of d is given. Next, we constructa diagonal matrix, D η with the diagonal entries η . The hyperparameter Ψ canbe constructed in the following way, Ψ = M mode D η V Tmode where M mode and V mode are the choices for the modes of their respective distributions.In order to setup an empirical prior framework, one could obtain the maximumlikelihood estimator (MLE) using the technique described in Chikuse (2012).We could set the hyperparameters in such a way that the mode of the priordistribution is same as MLE. Also note that, the “Empirical Bayesian” proce-dure (Robbins, 1985; Casella, 1985) is out of scope of this study.

4. Bayesian framework for Mixture of ML distributions In this section, we develop a framework for a ﬁnite mixture of ML distributions.We talk about posterior form and consistency. We also elaborate on samplingtechnique. Cluster analysis helps to determine the internal structure of data in an unsu-pervised way when no information other than the observed values of data isavailable (Picard, 2007). Finite mixture model allows us to cluster data points engupta et al./BMMCSM by assuming that each component of the mixture comes from a suitable para-metric distribution and the mixture distribution is constructed by a convexcombination of a number of individual component distributions. This numberof components is typically speciﬁed initially.We describe our framework as a ﬁnite mixture of ML distribution with a ﬁxednumber of mixture component C . Details on the selection number of mixturecomponent is described in Section 4.5. One of the popular techniques of clus-tering data is to model the data by a mixture of appropriate distributions. Forexample Gaussian mixture model is one of the most popular methods which hasbeen used in numerous application spanning from computer vision to compu-tational neuroscience (Stauﬀer and Grimson, 1999; McKenna, Raja and Gong,1999; KaewTraKulPong and Bowden, 2002; Lewicki, 1998; Wood et al., 2004),in the context of directional data mixture of Von Mises (McGraw et al., 2006;Mardia, Taylor and Subramaniam, 2007; Tang, Chu and Huang, 2009; Bangert,Hennig and Oelfke, 2010; Reisinger et al., 2010; Hornik and Gr¨un, 2014) ormixture of ML distributions used in Lin, Rao and Dunson (2017).Consider a product parameter space denoted by Θ := (cid:101) V Cn,p ×S Cp ×V Cp,p . Let θ := { θ c } Cc =1 = { M c , d c , V c } Cc =1 denote any point in Θ . Let S π := {(cid:104) π , π , · · · , π C (cid:105) ∈ (0 , C : (cid:80) Cc =1 π c = 1 } be the C -Simplex, and π ∈ S π be any point in it. Let usalso denote Ξ := Θ × S π .Now consider a class of ﬁnite mixture of ML densities denoted by C ML := { f ( X ; ( θ , π )) = (cid:80) Cc =1 π c f ML ( X ; θ c ) : ( θ , π ) ∈ Ξ } . Let X i ∈ V n,p ( i = 1 , , · · · , N )be the observed data from mixture of ML distributions. f ( X ; ( θ , π )) = C (cid:88) c =1 π c f ML ( X ; θ c ) as f ∈ C ML . For convenience it is customary to introduce latent cluster assignment variableto make sampling easier (McLachlan and Peel, 2004; Bishop, 2006). Thereforethis mixture model can be described as the following X i | ( Z i = c ) ∼ f ML ( X i ; θ c ) with P ( Z i = c ) = π c for c = 1 , · · · , C, (47)where π c > (cid:80) Cc =1 π c = 1 and Z i is the latent cluster assignment for i -thdata point, X i . The likelihood function for the parameter θ is given by L ( θ ) = N (cid:89) i =1 C (cid:89) c =1 [ π c f ML ( X i | θ c )] I ( Z i = c ) (48)In Section 3 we talk about the prior structure and its properties in detail. Weassume two diﬀerent class of prior structures. In the ﬁrst one, we have( M c , D c , V c ) ∼ JMDY ( · ; ν c , Ψ c ) π ∼ Dir( · ; ( α , α , · · · , α C )) , (49) engupta et al./BMMCSM while in the second one, we have M c ∼ ML ( · ; ξ Mc , ξ Dc , ξ Vc ) D c ∼ IMDY ( · ; ν c , η c ) V c ∼ ML ( · ; γ Mc , γ Dc , γ Vc ) π ∼ Dir( · ; ( α , α , · · · , α C )) . (50)For both the prior structures the conditional posterior distributions of the pa-rameters would be similar. Therefore, we choose to use independent prior struc-ture given in Equation 50 to demonstrate the posterior computation describedin Section 4.4.The posterior density of ( θ , π , Z i ) given { X i } Ni =1 is proportional to (cid:40) N (cid:89) i =1 C (cid:89) c =1 [ π c f ML ( X i | θ c )] I ( Z i = c ) (cid:41) f prior ( π , θ ) (51)From Equation 51 it follows that the posterior density is proportional to C (cid:89) c =1 (cid:40) π c ( α c + N c − etr (cid:0)(cid:0) V c D c M Tc (cid:1) N c X c + G c M c + H c V c (cid:1) F ( n ; D c / ν c + N c exp( ν c η Tc d c ) I ( d c ∈ S p ) (cid:41) , (52)where N c = (cid:80) Ni =1 I ( Z i = c ) and X c = N c (cid:80) Ni =1 X i I ( Z i = c ) for c = 1 , · · · , C .Also we have, G c = ξ Vc ξ Dc ( ξ Mc ) T and H c = γ Vc γ Dc ( γ Mc ) T . (53) The class of prior distributions speciﬁed in Equations 50 and 49 are ﬂexible in thesense, empirical information and/or prior knowledge about any parameters canbe incorporated in the model via appropriate hyper-parameter choices. On theother hand, in the absence of prior knowledge, one can specify hyper-parametersvalues such that the corresponding prior distributions becomes weakly informa-tive or vague. In the following section we note down two speciﬁc proceduresto select the value of hyper-parameters focusing independent prior structurein Equation 50 in mind. Similar procedure can easily be developed to selecthyper-parameters for the joint prior structure described in Equation 49.

Weakly informative prior

If the prior probability density function is pro-portional to 1 then we refer the corresponding prior as uniform prior. We canconstruct uniform prior using the prior structure deﬁned in Equation 50 by engupta et al./BMMCSM choosing α c = 1, ν c = 0, ξ Dc = p,p and ξ Dc = p,p for c = 1 , . . . , C . Here p,p denotes the zero matrix of dimension p × p . Note that, the other hyperparame-ters, η c , ξ Mc , ξ Vc , γ Mc , γ Dc , are not required to be speciﬁed in this case. Note thatthe uniform prior designed here is improper in nature and the improper priorsare not allowed for mixture models as it leads to invalid posterior. As a remedyone may construct “constrained mixture model” (Diebolt and Robert, 1994) byintroducing some additional constraint to ensure propriety for correspondingposterior. As it is tangential to the current discussion, we avoid the detailedconstruction on ‘constrained mixture model’ for the current model in this ar-ticle. Without going into the additional complexity, one may construct weaklyinformative, proper prior by choosing η c to be very close to zero (such as 0 . Empirical prior

We ﬁrst gather the empirical information by ﬁtting a EMbased algorithm to the data to obtain the maximum likelihood estimator of theparameters (see Section 4.6). Once we have a basic basic estimates of the clusterassignments, we compute number of points, n † c , assigned in each clusters andrough estimates of the cluster speciﬁc parameters, M † c , d † c , V † c , for c ∈ { , . . . , C } .The idea is to choose appropriate hyper-parameter values in Equation 50, sothat the corresponding prior distributions have modes at the values M † c , D † c , V † c .For the prior distribution of d c use the procedure described in Section 3.5 toset appropriate value of η c . For c = 1 . . . C , we set ξ Mc = M † c , ξ Vc = I p and γ Mc = V † c , γ Vc = I p . The choice for ν c and ξ Dc , ξ Dc are crucial and it may notdesired to set very high values for these parameters. We set ν c = n † c /K † andthe values for ξ Dc , ξ Dc to be close to n † c /K † . Here K † determines the relativestrength of the prior distribution appropriately. To select hyper parameters fromthe parameter π we set α c = n † c /K † for c = 1 , . . . C. In any Bayesian model, consistency of the posterior distribution is a desirableproperty. In the following subsection we establish posterior consistency for ourmixture model.

Consider a product parameter space denoted by Θ := (cid:101) V Cn,p ×S Cp ×V Cp,p . Let θ := { θ c } Cc =1 = { M c , d c , V c } Cc =1 denote any point in Θ , and θ := { M c , d c , V c } Cc =1 ∈ Θ a particular point. Let S π := { ( π , π , · · · , π C ) ∈ (0 , C : (cid:80) Cc =1 π c = 1 } bethe C -Simplex, and π ∈ S π be any point in it.Consider the distance metric d ( · , · ) on the parameter space Ξ := Θ × S π con-structed from appropriate distance metrics in the respective parameter spaces: d ( θ , θ ) := (cid:118)(cid:117)(cid:117)(cid:116) C (cid:88) c =1 [ d St ( M c , M c ) + d Eu ( d c , d c ) + d St ( V c , V c )] d (( θ , π ) , ( θ , π )) := (cid:113) d Eu ( π , π ) + d ( θ , θ ) (54) engupta et al./BMMCSM and, likewise, d (( X , θ , π ) , ( X , θ , π )) := (cid:113) d St ( X , X ) + d Eu ( π , π ) + d ( θ , θ ) (55)where d Eu is the Euclidean distance and d St is the geodesic distance on theStiefel manifold. Also consider a class of ﬁnite mixture of ML densities denotedby C ML := { f ( X ; ( θ , π )) = (cid:80) Cc =1 π c f ML ( X ; θ c ) : ( θ , π ) ∈ Ξ } .We alternatively denote f ( X ; ( θ , π )) by f θ , π ( X ) when we wish to emphasize theparametrization. f θ , π : V n,p → R + is a family of probability density functionswith respect to the normalized Haar measure [ dX ] on V n,p . Observe that Ξ and V n,p are complete separable metric spaces and that ( θ , π ) → f θ , π is one-to-oneand ( X, θ , π ) → f ( X ; ( θ , π )) is measurable.The prior Π is deﬁned on Ξ. Let X , X , · · · , X N be independent and identicallydistributed with probability density function f θ , π . The posterior distributionΠ( A | X , X , · · · , X N ) for any measurable subset A of Ξ is given byΠ( A | X , X , · · · , X N ) = (cid:82) A R n (( θ , π )) d Π(( θ , π )) (cid:82) Ξ R n (( θ , π )) d Π(( θ , π )) (56)where R n (( θ , π )) = N (cid:89) i =1 f ( X i ; ( θ , π )) f ( X i ; ( θ , π )) . (57)In our model, Π(( θ , π )) in Equation 56 is deﬁned with respect to the appropriateproduct measure on Θ and the Lebesgue measure on S π . The prior Π is givenby the Equation 50.For (cid:15) > θ , π ) (corresponding to the true density f θ , π ) in Ξ as N (cid:15) (( θ , π )) = { ( θ , π ) ∈ Ξ : d (( θ , π ) , ( θ , π )) < (cid:15) } ,KL (cid:15) (( θ , π )) = (cid:40) ( θ , π ) ∈ Ξ : (cid:90) V n,p f θ , π ( X ) log f θ , π ( X ) f θ , π ( X ) [ dX ] < (cid:15) (cid:41) , U (cid:15) (( θ , π )) = (cid:40) ( θ , π ) ∈ Ξ : (cid:12)(cid:12)(cid:12)(cid:12) (cid:90) V n,p g ( X ) f θ , π ( X )[ dX ] − (cid:90) V n,p g ( X ) f θ , π ( X )[ dX ] (cid:12)(cid:12)(cid:12)(cid:12) < (cid:15) (cid:41) , W (cid:15) (( θ , π )) =  ( θ , π ) ∈ Ξ : (cid:32)(cid:90) V n,p (cid:18)(cid:113) f θ , π ( X ) − (cid:113) f θ , π ( X ) (cid:19) [ dX ] (cid:33) / < (cid:15)  . The weak neighborhood deﬁnition holds if the corresponding equation is satisﬁedfor all bounded and continuous functions g on V n,p . Lemma 12.

A ﬁnite mixture of ML densities is strictly positive, bounded awayfrom zero and bounded from above. engupta et al./BMMCSM Proof of Lemma 12.

Let f ( X ; ( θ , π )) ∈ C ML be a density function that is a C -component mixtureof ML distributions parametrized by ( θ , π ) ∈ Ξ, that is, f ( X ; ( θ , π )) = C (cid:88) c =1 π c f ML ( X ; θ c ) . (58)Since the density function f ML ( · ; θ c ) : V n,p → R + is continuous on the compactmanifold V n,p , the extreme value theorem (Rudin et al., 1964) dictates that f ML ( · ; θ c ) is bounded and attains at least one minima and maxima. In partic-ular, f ML ( X ; θ c ) has the unique modal orientation M c V Tc (page 32 in Chikuse(2012)) where θ c = ( M c , d c , V c ). Likewise, it is easy to see that the minimumvalue of the density function occurs at − M c V Tc . Hence for any X ∈ V n,p , wehave etr (cid:0) V c D c M Tc ( M c V Tc ) (cid:1) F ( n/ D c / ≥ f ML ( X ; θ c ) ≥ etr (cid:0) − V c D c M Tc ( M c V Tc ) (cid:1) F ( n/ D c / , = ⇒ etr ( D c ) F ( n/ D c / ≥ f ML ( X ; θ c ) ≥ etr ( − D c ) F ( n/ D c / , = ⇒ exp( (cid:80) pi =1 d ic ) F ( n/ D c / ≥ f ML ( X ; θ c ) ≥ exp( − (cid:80) pi =1 d ic ) F ( n/ D c / > . (59)Let U B = max ≤ c ≤ C exp( (cid:80) pi =1 d ic ) F ( n/ D c / and LB = min ≤ c ≤ C exp( (cid:80) pi =1 − d ic ) F ( n/ D c / . Using Equa-tions 58 and 59, we get0 < LB ≤ f ( X ; ( θ , π )) ≤ U B < ∞ . (60) (cid:3) Lemma 13.

Let ( θ , π ) , ( θ , π ) ∈ Ξ . Then for any (cid:15) > , there exists a δ > such that d (( θ , π ) , ( θ , π )) < δ = ⇒ sup X ∈V n,p (cid:12)(cid:12)(cid:12)(cid:12) log f ( X ; ( θ , π )) f ( X ; ( θ , π )) (cid:12)(cid:12)(cid:12)(cid:12) < (cid:15), where f ( X ; ( θ , π )) , f ( X ; ( θ , π )) ∈ C ML . Proof of Lemma 13.

Let f ( X ; ( θ , π )) ∈ C ML , that is, let f ( X ; ( θ , π )) = (cid:80) Cc =1 π c f ML ( X ; θ c ). Notethat for all c = 1 , . . . , C, the function f ML ( X ; θ c ) is continuous in X and θ c .Since f ( X ; ( θ , π )) is a linear combination of functions { f ML ( X ; θ c ) } Cc =1 withweights { π c } Cc =1 , it too is continuous in X ∈ V n,p and ( θ , π ) ∈ Ξ. Moreover, fromLemma 12, f ( X ; ( θ , π )) is bounded away from 0 and ∞ . Hence log f ( X ; ( θ , π ))is continuous in X ∈ V n,p and ( θ , π ) ∈ Ξ, since log is continuous and welldeﬁned over the range. engupta et al./BMMCSM Let B ( θ , π ) ⊂ Ξ be a compact ball around ( θ , π ) with strictly positive,bounded radius.Now, consider the function log f ( X ; ( θ , π )), restricted to the domain V n,p × B ( θ , π ) . Within both V n,p and B ( θ , π ) , f is continuous in each argument θ , π and X The compactness of both these spaces ensures uniform continuityindividually in each argument.The latter ensures uniform continuity of the jointfunction within the joint space V n,p × B ( θ , π ) .Now, since V n,p × B ( θ , π ) is compact, the function restricted to this domain is uniformly continuous. Therefore for any (cid:15) >

0, there exists a δ >

0, such that d ( ( X , θ , π ) , ( X , θ , π ) ) < δ = ⇒ | log f ( X ; ( θ , π )) − log f ( X ; ( θ , π )) | < (cid:15), for arbitrary ( X , θ , π ) , ( X , θ , π ) ∈ V n,p × B ( θ , π ) . In particular, setting X = X = X , and using the fact that d ( ( θ , π ) , ( θ , π ) ) = d ( ( X, θ , π ) , ( X, θ , π ) ) <δ for all X (clear from Equations 54 and 55), we have d ( ( θ , π ) , ( θ , π ) ) < δ = ⇒ sup X ∈V n,p | log f ( X ; ( θ , π )) − log f ( X ; ( θ , π )) | < (cid:15), = ⇒ sup X ∈V n,p (cid:12)(cid:12)(cid:12)(cid:12) log f ( X ; ( θ , π )) f ( X ; ( θ , π )) (cid:12)(cid:12)(cid:12)(cid:12) < (cid:15). (61) (cid:3) Theorem 5. (Weak Consistency)

For our prior Π (deﬁned in Deﬁnition 2in Section 3) Π( U c | X , X , · · · , X N ) → a.s. F ∞ θ , π , for any weak neighborhood U of ( θ , π ) . F θ , π is the probability distributionfunction corresponding to the density f θ , π , and F ∞ θ , π is the correspondinginﬁnite product measure. Proof of Theorem 5.

For every (cid:15) > δ > d ( ( θ , π ) , ( θ , π ) ) < δ = ⇒ sup X ∈V n,p (cid:12)(cid:12)(cid:12)(cid:12) log f ( X ; ( θ , π )) f ( X ; ( θ , π )) (cid:12)(cid:12)(cid:12)(cid:12) < (cid:15), [from Lemma 13] , = ⇒ (cid:90) V n,p f ( X ; ( θ , π )) (cid:12)(cid:12)(cid:12)(cid:12) log f ( X ; ( θ , π )) f ( X ; ( θ , π )) (cid:12)(cid:12)(cid:12)(cid:12) [ dX ] < (cid:15), = ⇒ (cid:90) V n,p f ( X ; ( θ , π )) log f ( X ; ( θ , π )) f ( X ; ( θ , π )) [ dX ] < (cid:15). Hence, N δ (( θ , π )) ⊂ KL (cid:15) (( θ , π )). It is easy to see that Π puts strictly pos-itive measure on N δ (( θ , π )) for all δ >

0, because by the deﬁnition of distancemetric given in Equation 54, any neighborhood around ( θ , π ) has a positivemeasure. It follows that Π puts strictly positive measure on KL (cid:15) (( θ , π )) forall (cid:15) >

0. The theorem then follows from Schwartz (1965). (cid:3) engupta et al./BMMCSM Theorem 6. (Strong/Hellinger Consistency)

For our prior

ΠΠ( W c | X , X , · · · , X N ) → a.s. F ∞ θ , π , for any Hellinger neighborhood W of ( θ , π ) . Proof of Theorem 6.

For any (cid:15) > U (cid:15) (( θ , π )) = (cid:40) ( θ , π ) ∈ Ξ : (cid:12)(cid:12)(cid:12)(cid:12) (cid:90) V n,p g ( X ) f θ , π ( X )[ dX ] − (cid:90) V n,p g ( X ) f θ , π ( X )[ dX ] (cid:12)(cid:12)(cid:12)(cid:12) < (cid:15) (cid:41) . for all bounded and continuous functions g on V n,p .For each ( θ , π ) ∈ U (cid:15) (( θ , π )) choose g ( X ) := (cid:32) (cid:112) f θ , π ( X ) − (cid:112) f θ , π ( X ) (cid:112) f θ , π ( X ) + (cid:112) f θ , π ( X ) (cid:33) Now from Lemma 12, the functions f θ , π are bounded away from 0, ensuring apositive lower bound for the denominator. The upper bound for the denominatoris guaranteed from the upper bound property that follows from the same Lemma.A similar argument holds for the numerator as well. This ensures boundednessof the function g ( X ) and the continuity follows from the continuity of f ( X )(from Lemma 12). Thus g ( X ) is a bounded and continuous function. Hence, (cid:12)(cid:12)(cid:12)(cid:12) (cid:90) (cid:32) (cid:112) f θ , π ( X ) − (cid:112) f θ , π ( X ) (cid:112) f θ , π ( X ) + (cid:112) f θ , π ( X ) (cid:33) f θ , π ( X ) [ dX ] − (cid:90) (cid:32) (cid:112) f θ , π ( X ) − (cid:112) f θ , π ( X ) (cid:112) f θ , π ( X ) + (cid:112) f θ , π ( X ) (cid:33) f θ , π ( X ) [ dX ] (cid:12)(cid:12)(cid:12)(cid:12) < (cid:15) , = ⇒ (cid:12)(cid:12)(cid:12)(cid:12) (cid:90) (cid:18)(cid:113) f θ , π ( X ) − (cid:113) f θ , π ( X ) (cid:19) [ dX ] (cid:12)(cid:12)(cid:12)(cid:12) < (cid:15) , = ⇒ (cid:32)(cid:90) (cid:18)(cid:113) f θ , π ( X ) − (cid:113) f θ , π ( X ) (cid:19) [ dX ] (cid:33) / < (cid:15). (62)Hence U (cid:15) (( θ , π )) ⊂ W (cid:15) (( θ , π )). The Theorem now follows from an applica-tion of Theorem 5. (cid:3) In order to perform Bayesian inference, it is important to compute statisticsrelated to the posterior distribution e.g. the posterior mean or posterior quan-tiles. The posterior density, deﬁned in Equation 52, is intractable in the sensethat it is not possible to compute these quantities analytically by performingintegration or to generate i.i.d. samples from the posterior distribution. engupta et al./BMMCSM However we can design a Gibbs sampling Markov chain to generate samples fromthe posterior distribution. It is known that the Markov chain corresponding toGibbs samplers would converge to the desired stationary distribution.In order to implement the Gibbs sampler, we sample cluster speciﬁc parametersalong with the latent indicator Z i for cluster assignment for each data point X i where i = 1 , , · · · , N . The conditional distribution of Z i given all otherparameters follows P ( Z i = c | M c , d c , V c , π , { X i } Ni ) = π c f ML ( X i | M c , d c , V c ) (cid:80) Cc =1 π c f ML ( X i | M c , d c , V c ) (63)for c = 1 , , · · · , C . The conditional posterior distribution of π is given as π | { M c , d c , V c } Cc =1 , { X i , Z i } Ni ∼ Dir( α + N , α + N , · · · , α C + N C ) , (64)where N c = (cid:80) Ni =1 I ( Z i = c ) for c = 1 , · · · , C . Given the latent cluster as-signments, the conditional posterior distribution of cluster speciﬁc parametersare independent. Due to conditional functional conjugacy for M c and V c , it isstraightforward to show that the full conditional of the corresponding posteriorwould belong to the ML class of distribution. In particular, M c | ( d c , V c , { X i , Z i } Ni , π ) ∼ ML (cid:0) · ; ( S MG , S DG , S VG ) (cid:1) (65) V c | ( M c , d c , { X i , Z i } Ni , π ) ∼ ML (cid:0) · ; ( S MH , S DH , S VH ) (cid:1) , (66)where ( S MG , S DG , S VG ) and S MH , S DH , S VH ) are SVD decompositions of matrices( D c V Tc N c X Tc + G c ) and ( D c V Tc N c X Tc + H c ), respectively. Observe that X c = 1 N c N (cid:88) i =1 X i I ( Z i = c ) . Eﬃcient sampling from ML distribution is done using algorithm developedin Hoﬀ (2009).The conditional posterior distribution for d c given other parameters has thefollowing density – f ( d c | M c , V c , { X i , Z i } Ni =1 , π ) ∝ exp (cid:0) ( ν c η Tc + N c φ Tc ) d c (cid:1) F ( n ; D c / ν c + N c I ( d c ∈ S p ) d c | ( M c , V c , { X i , Z i } Ni =1 , π ) ∼ IMDY (cid:18) · ; ( ν c + N c ) , ν c η c ν c + N c + N c φ c ν c + N c (cid:19) , (67)where φ c = { φ c , φ c , · · · , φ cp } with φ cj is the j -th diagonal element of the ma-trix M Tc X c V c for j = 1 , , · · · , p . Note that, this can also be veriﬁed from theEquation 46 in Section 3.3. Due to non-standard form of the posterior distribu-tion given in Equation 67, sampling of d c is challenging. engupta et al./BMMCSM The density corresponding to the full conditional distribution of d cj , the j -thdiagonal entry of D c for j = 1 , , · · · , p , is given below, f (cid:16) d cj | d − cj , M c , V c , { X i , Z i } Ni , π (cid:17) ∝ exp( d cj ( ν c η cj + N c φ cj )) F ( n/ , D c / ν c + N c I (cid:0) d c ( j +1) < d cj < d c ( j − (cid:1) . (68)Also let F cj ( · ) is the corresponding distribution function of conditional distri-bution of d cj . We describe the detailed implementation of sampling from thisconditional distribution in the following paragraph after this subsection. For ageneric representation of posterior distribution for d cj for all j = 1 , · · · , p , wedeﬁne d c = ∞ and d c ( p +1) = 0. We also write d − cj := (cid:8) d c , · · · , d c ( j − , d c ( j +1) , · · · , d cp (cid:9) . (69)We have designed an eﬃcient sampling scheme to sample d c using the set of p distributions given in Equation 68. Observe that support of the distributionfor d c is [ d c , ∞ ) while that of the others are bounded. Note that the posteriordistribution of d c is unimodal (see Theorem 3) and we exploit that fact todesign an eﬃcient sampler for d c . A description of the sampling steps is givenin Algorithm 1 below.Note that, because of log-concavity nature of the conditional distribution func-tion for d cj , we could have implemented adaptive rejection sampler (ARS) forit. However, the standard ARS algorithm can not be immediately implementedin this context because of involved computation with F ( · ) function. So wereserved this development for our future work. Gibbs algorithm

The following algorithm outlines the steps of the Gibbssampling algorithm which shows the full conditional distribution of the param-eters at k -th step based on the samples drawn at ( k − engupta et al./BMMCSM Algorithm 1

Algorithm for MCMC method for c = 1 , , · · · , C do initialize the MCMC chain with M (0) c , D (0) c , V (0) c and π (0) end for k = 0 repeat k = k + 1 for i = 1 , , · · · , N do Z ( k ) i ∼ Categorical (cid:16) · ; { , , · · · , C } , π ( k − , { M ( k ) c , D ( k ) c , V ( k ) c } Cc =1 (cid:17) end forfor c = 1 , , · · · , C do N ( k ) c = N (cid:88) i =1 I ( Z ( k ) i = c ) X ( k ) c = 1 N c N (cid:88) i =1 X i I ( Z ( k ) i = c ) M ( k ) c ∼ ML (cid:16) · ; (cid:16) ( S MG ) ( k − , ( S DG ) ( k − , ( S VG ) ( k − (cid:17)(cid:17) V ( k ) c ∼ ML (cid:16) · ; (cid:16) ( S MH ) ( k ) , ( S DH ) ( k − , ( S VH ) ( k − (cid:17)(cid:17) d ( k ) cj ∼ F cj (cid:16) · ; ( d − cj ) ( k ) , M ( k ) c , V ( k ) c , { X i , Z i } Ni , π ( k − (cid:17) for all j = 1 , , · · · , p end for π ( k ) ∼ Dir (cid:16) · ; α + N ( k )1 , · · · , α C + N ( k ) C (cid:17) until convergence Note that, ( d − cj ) ( k ) is given by the following set( d − cj ) ( k ) := (cid:110) ( d c ) ( k ) , · · · , ( d c ( j − ) ( k ) , ( d c ( j +1) ) ( k − , · · · , ( d cp ) ( k − (cid:111) The stationary distribution of the Gibbs sampling Markov chain is the poste-rior distribution corresponding to Equation 52. Convergence to this stationarydistribution does not on the choice of the initial point. However, in order to runthe MCMC method it is required to initialize Algorithm 1 with certain values(e.g. M (0) c , D (0) c , V (0) c and π (0) ). In practice, speciﬁcally in the case of large-scaledataset, it is often seen that bad choice of initial value might lead to slow con-vergence of the MCMC method. In order to come up with a reasonable choiceof initial value, we ﬁrst run a hierarchical clustering (Lattin, Carroll and Green,2003; Rokach and Maimon, 2005) on the entire dataset with a ﬁxed number ofclusters, C (for selection of optimal C see Section 4.5) to get a initial clusterassignments for the data points. Based on the initial assignment, we adopt a engupta et al./BMMCSM maximum likelihood based technique described in Chikuse (2012) to obtain theinitial value of the cluster speciﬁc parameters. This initial point selection pro-cedure has worked well for our simulated dataset. We notice that the selectionof initial point may not be crucial for small datasets. However, for large datasetchoice of suitable initial point could save signiﬁcant amount of time by reducingnumber of burn-in steps. Eﬃcient Rejection Sampler

In this section we describe the rejection sam-pling procedure from the conditional distribution of ( d | ( d , · · · , d p )) when d ∼ IMDY ( · ; ν, η ) for some ν > ≤ j ≤ p η j <

1. Here η = ( η , . . . , η p ). Let m be the mode of the conditional distribution, g ( · ) := g ( · ; ν, η | ( d , . . . , d p )),of the variable d given ( d , . . . , d p ) when η >

0. In case, η <

0, we explicitlyset m to be 0.Using property of the conditional distribution described Lemma 11 the we com-pute a critical point RT cric so that P (cid:16) d > RT crit | ( d , · · · , d p ) , { X j } Nj =1 (cid:17) < (cid:15) with the choice of (cid:15) = 0 . d c to thebounded interval (0 , RT crit ]. We employ a eﬃcient rejection sampling scheme tosample from the desired distribution in the following way.Let δ = RT crit /N bin where N bin is the total number of partitions for the interval(0 , RT crit ]. Consider, k = ([ m/δ ] + 1) where [ m/δ ] denotes the greatest integersless that or equal to m/δ . Now deﬁne the function g ( x ) := k − (cid:88) j =1 g ( j δ ) I (( j − δ,jδ ]) ( x ) + g ( m ) I (( k − δ,kδ ]) ( x )+ N bin (cid:88) j = k +1 g (( j − δ ) I ((( j − δ,jδ ]) ( x ) . Note that g ( x ) ≥ g ( x ) for all x ∈ (0 , RT crit ] as g ( · ) unimodal log-concavefunction with maxima m . To sample from for the distribution with densitycorresponding to the function g ( · ) we consider, p j = q j / (cid:80) N bin j =1 q j for j =1 , , · · · , N bin where, q j =  g ( jδ ) if 1 ≤ j < (cid:2) mδ (cid:3) + 1 ,g ( m ) if j = (cid:2) mδ (cid:3) + 1 ,g (( j − δ ) if (cid:2) mδ (cid:3) + 1 < j ≤ M. The steps of the rejection samplers are given below– Sample Z from the discrete distribution with the support { , , . . . , N bin } corresponding probability { p j } Mj =1 .– Sample y ∼ U nif orm (( Z − δ, Zδ ).– Sample U ∼ U nif orm (0 , y if U ≤ g ( y ) g ( y ) . engupta et al./BMMCSM Note that the eﬃciency of the sampler increases when we choose larger valuesfor N bin . Posterior summary

There are multiple ways to summarize the posterior dis-tribution for the estimates of the parameters. We choose to use the parametriza-tion given in Equation 1 for posterior summary. This parametrization enables usto report the error in more interpretable way. Using M, d , V , it is challenging toreport the error as M and V lie on a non-euclidean space. Generating summaryof results for diﬀerent parameters on V n,p is not straightforward. Some general-ized version of mean like Karcher mean could be investigated. Note that we candirectly compare true F and ˆ F as there is no constraint on the elements of F .This direct comparison is not immediately possible for M, d , V parametrizationgiven in Equation 2 which is mainly done to achieve computational tractability. In order to identify the optimum number of cluster we use Deviance Informa-tion Criteria for Bayesian model selection (

DIC ) (Gelman et al., 2003; Spiegel-halter et al., 2002). It has been successfully used as a model selection crite-ria in in various Bayesian models (Berg, Meyer and Yu, 2004; Fran¸cois andLaval, 2011; Khare, Pal and Su, 2017). To explain the DIC criterion in thecontext of the current model, let θ ( C ) = { M c , d c , V c } Cc =1 denote all the parame-ter vectors and the deviance function is deﬁned as Dev ( θ ( C ) ) := − logL ( θ ( C ) )where L ( · ) is the likelihood function deﬁned in Equation 48. Let { θ ( C,i ) } Si =1 be S values of the parameters, sampled from the appropriate posterior dis-tribution in Equation 52. The DIC score with a given choice for C , is com-puted as DIC ( C ) := Dev ( C ) + (cid:80) Si =1 (cid:16) Dev ( θ ( C,i ) ) − Dev ( C ) (cid:17) / (2( S − Dev ( C ) = (cid:80) Si =1 Dev ( θ ( C,i ) ) /S ( Gelman et al. (2003), page 185). To infer thenumber of clusters, samples are generated from diﬀerent Markov chain assum-ing diﬀerent values of C . The optimum number of cluster is given by C opt =argmax C DIC ( C ) . For detailed discussion on DIC see DeIorio and Robert(2002); Gelman et al. (2003); Titterington et al. (2006). Speciﬁcally, in thecontext of the mixture model, DeIorio and Robert (2002) described possiblelimitations for the standard DIC criterion. Following the alternative criteriaproposed in Titterington et al. (2006), we considered several score functions(i.e. DIC , DIC , DIC , DIC , DIC , DIC , DIC as deﬁned in Titter-ington et al. (2006)). We conducted an extensive numerical study with severalsimulated data sets. We found that the score function DIC outperforms otheralternative criteria in terms of eﬃciency for the model. Also, the computationof DIC takes signiﬁcantly less time than that of standard DIC . Therefore onemay use

DIC instead of standard DIC whenever computation of standard

DIC takes signiﬁcantly longer time particularly for any large dataset. Addi-tional details along with a table comparing the performance of diﬀerent

DIC scores in our simulation study is given in Section 5) where we observe that engupta et al./BMMCSM DIC score can identify the correct number of clusters in most of the cases inour model. In this section, we develop an iterative optimization technique to obtain pointestimator for the parameters speciﬁed in the model given by Equation 49. Specif-ically, we employ expectation maximization (EM) algorithm (Dempster, Lairdand Rubin, 1977) to obtain mode of the posterior distribution for the parametersin the model speciﬁed in Equation 49. Note that the algorithm is computation-ally fast and can be useful to get some rough estimates of the parameter speciallyfor large data-sets. Also, we may use this algorithm in speciﬁc way to select ap-propriate values of the hyperparameters (See Section 3.5) in the case of MCMCbased posterior inference. Note that the rough estimates can also help ﬁnd suit-able initial values for the MCMC procedures, particularly for analyzing massivedata. To describe the procedure, let us consider complete data log-likelihood(From Equation 52) as follows N (cid:88) i =1 C (cid:88) c =1 Z ic log f ML ( X i | θ c ) + Z ic log π c ++ C (cid:88) c =1 α c log π c + tr (cid:0) ν c V c D c M Tc Ψ (cid:1) − ν c log (cid:18) F (cid:18) n , D c (cid:19)(cid:19) , (70)where Z ic = I ( Z i = c ) andlog f ML ( X i | θ c ) = tr (( V c D c M Tc ) X i )) − log (cid:18) F (cid:18) n , D c (cid:19)(cid:19) . Let we start the iterative algorithm at an initial point (cid:0) θ (0) , π (0) (cid:1) . We constructa sequence of parameter values (cid:8)(cid:0) θ ( t ) , π ( t ) (cid:1)(cid:9) t ≥ where we move from (cid:0) θ ( t ) , π ( t ) (cid:1) to (cid:0) θ ( t +1) , π ( t +1) (cid:1) using the “E-step” and “M-step” described below. E-step:

We construct the objective function Q ( θ , π | X , θ ( t ) , π ( t ) ):= N (cid:88) i =1 C (cid:88) c =1 (cid:104) Z ic (cid:105) log f ML ( X i | θ c ) + (cid:104) Z ic (cid:105) log π c ++ C (cid:88) c =1 α c log π c + tr (cid:0) ν c V c D c M Tc Ψ (cid:1) − ν c log (cid:18) F (cid:18) n , D c (cid:19)(cid:19) , (71)where (cid:104) Z ic (cid:105) := E ( Z ic | X , θ ( t ) ) = π ( t ) c f ML ( X i | θ ( t ) c ) (cid:80) Ck =1 π ( t ) k f ML ( X i | θ ( t ) c ) engupta et al./BMMCSM M-step:

In this step, we maximize Q ( θ , π | X , θ ( t ) , π ( t ) ) with respect to the θ , π . It is easy to see that, Q ( θ , π | X , θ ( t ) , π ( t ) ) is maximized when we set π = ˆ π where the c -th component of the vector ˆ π c ,ˆ π c = α c + (cid:80) Ni =1 (cid:104) z ic (cid:105) N + (cid:80) Cc =1 α c for c = 1 , . . . C. Note that θ := { θ c } Cc =1 where θ c = { M c , d c , V c } . Hence, the function Q ( θ , π | X , θ ( t ) , π ( t ) ) can be maximized by maximizing the function tr (cid:16) V c D c M cT (cid:104) (cid:101) X ( c ) + Ψ (cid:105)(cid:17) − ν c log (cid:18) F (cid:18) n , D c (cid:19)(cid:19) , (72)with respect to the variables M c ∈ (cid:101) V n,p , V c ∈ V p,p and d c ∈ S p for each c =1 , . . . , C separately where (cid:101) X ( c ) = (cid:80) Ni =1 (cid:104) Z ic (cid:105) X i (cid:80) Ni =1 (cid:104) Z ic (cid:105) .Let (cid:102) M ( c ) , (cid:101) D ( c ) and (cid:101) V ( c ) be the unique singular value decomposition (Chikuse,2012) for the matrix (cid:104) (cid:101) X ( c ) + Ψ (cid:105) . Let (cid:101) d ( c ) be the diagonal elements of the matrix (cid:101) D ( c ) and (cid:98) d c be the solution of the set of equations h ( (cid:98) d c ) = (cid:101) d ( c ) where h ( d ) := (cid:16) ∂∂ d F (cid:16) n , D (cid:17)(cid:17) / F (cid:16) n , D (cid:17) . Standard Newton-Raphson (NR) (Wright andNocedal, 1999) method can be used to solve for (cid:98) d c the from the equation h ( (cid:98) d c ) = (cid:101) d ( c ) . In the case of p = 2, we derive the explicit expression of the Hessian matrixand show the steps by NR to solve for (cid:98) d c in Section 4.6.1.From Chikuse (2012) we get that the objective function in Equation 72 is max-imized at (cid:99) M c = (cid:102) M ( c ) , (cid:98) V c = (cid:101) V ( c ) and (cid:98) D c where (cid:98) D c is the diagonal matrix withdiagonal elements (cid:98) d c .Finally we move to the values (cid:0) θ ( t +1) , π ( t +1) (cid:1) by the setting, θ ( t +1) := (cid:110)(cid:16) (cid:99) M c , (cid:98) d c , (cid:98) V c (cid:17)(cid:111) Cc =1 and π ( t +1) := (cid:98) π . We stop the iteration when we achieve convergence, i.e. the values of the pa-rameters in the two consecutive iterations are very close.

For this subsection we omit the subscript c for ease of notation. Now observethat, F (cid:18) p + 2 k, d + d (cid:19) = Γ ( p + 2 k )4 − p +2 k − (cid:18)(cid:113) d + d (cid:19) − ( p +2 k − I p +2 k − (cid:18)(cid:113) d + d (cid:19) . where I ν ( · ) is the modiﬁed Bessel function of ﬁrst kind with order ν . Takingpartial derivative with respect to d we have, ∂∂d F (cid:18) p + 2 k, d + d (cid:19) = d p + 2 k ) F (cid:18) p + 2 k + 1 , d + d (cid:19) . engupta et al./BMMCSM Consider the expression for the hypergeometric function of the Matrix argumentwith 2 × F (cid:18) p, D (cid:19) = ∞ (cid:88) k =0 d k d k k (cid:0) p − (cid:1) k ( p ) k k ! F (cid:18) p + 2 k, d + d (cid:19) . (73)This representation is also useful as we can get a good idea on error bound byapproximating the number of terms for this inﬁnite series.Let us use the following notations T = (cid:16) d (cid:17) k (cid:16) d (cid:17) k (cid:0) p − (cid:1) k ( p ) k k ! T = F (cid:18) p + 2 k, d + d (cid:19) T = F (cid:18) p + 2 k + 1 , d + d (cid:19) T = F (cid:18) p + 2 k + 2 , d + d (cid:19) . We derive ∂∂d (cid:18) F (cid:18) p, D (cid:19)(cid:19) = ∞ (cid:88) k =0 T (cid:40) kd T + d p + 2 k ) T (cid:41) ,∂∂d (cid:18) F (cid:18) p, D (cid:19)(cid:19) = ∞ (cid:88) k =0 T (cid:40) kd T + d p + 2 k ) T (cid:41) ,∂∂d ∂∂d (cid:18) F (cid:18) p, D (cid:19)(cid:19) = ∞ (cid:88) k =0 T (cid:40) kd kd T + k ( d d + d d )( p + 2 k ) T + d p + 2 k ) d p + 2 k + 1) T (cid:41) ,∂ ∂d (cid:18) F (cid:18) p, D (cid:19)(cid:19) = ∞ (cid:88) k =0 T (cid:40) kd k − d T + 4 k + 12( p + 2 k ) T + d p + 2 k ) d p + 2 k + 1) T (cid:41) ,∂ ∂d (cid:18) F (cid:18) p, D (cid:19)(cid:19) = ∞ (cid:88) k =0 T (cid:40) kd k − d T + 4 k + 12( p + 2 k ) T + d p + 2 k ) d p + 2 k + 1) T (cid:41) . (74)Denoting R ( d , d ) = (cid:16) F (cid:16) n , D (cid:17)(cid:17) , the Hessian matrix is written with the engupta et al./BMMCSM help of set of Equations in 74 as H = (cid:20) H H H H (cid:21) , where H = ∂∂d (cid:32) ∂R∂d R (cid:33) =  ∂ R∂d R  − (cid:32) ∂R∂d R (cid:33) ,H = ∂∂d (cid:32) ∂R∂d R (cid:33) = (cid:32) ∂R∂d ∂R∂d R (cid:33) − (cid:32) ∂R∂d R (cid:33)(cid:32) ∂R∂d R (cid:33) ,H = H ,H = ∂∂d (cid:32) ∂R∂d R (cid:33) =  ∂ R∂d R  − (cid:32) ∂R∂d R (cid:33) . where R is used in the places of R ( d , d ) for brevity of symbol.Now, the update equation for NR method is given below (cid:34) (cid:98) d new (cid:98) d new (cid:35) = (cid:34) (cid:98) d old (cid:98) d old (cid:35) − H − (cid:98) d old , (cid:98) d old (cid:34) ∂R ( (cid:101) d old , (cid:101) d old ) ∂ d ∂R ( (cid:101) d old , (cid:101) d old ) ∂ d (cid:35) .

5. Experiments with simulated data

We carry out two sets of simulation to investigate the clustering framework withour proposed Bayesian mixture model. In order to evaluate the performance ofour clustering method, we consider the following three criteria –(a) identiﬁcation of the correct number of clusters,(b) correct assignment for each data point to the appropriate cluster and thusevaluate a measure of goodness for clustering using some well establishedmetrics,(c) accuracy in estimation of cluster speciﬁc parameters.In order to evaluate the criterion ( a ), we start with two simulation scenarioswhere the true numbers of clusters are three and four, respectively. In eachcase, we have 50 individual datasets where number of data points is 400 and 500,respectively. For rest of the section we refer the these two simulation scenariosas simulation ( i ) and simulation ( ii ). In simulation ( i ), three parameter matricesare setWe select appropriate values of hyperparameters for prior distributions in 50empirically using the procedure developed in Section 4.2. Note that, the valueof K † is set to 20 to reﬂect the concentration similar to 5% of the size of therespective cluster.In general, for MCMC procedure, choice of a good initial point expedite theconvergence for practical purposes. Therefore, we use the procedure describedin Section 4.6 to set the initial value of the parameters M , d and V .Optimal number of cluster is chosen based on DIC criteria described in Sec-tion 4.5. We performed numerous experiments with several score functions for engupta et al./BMMCSM DIC (Titterington et al., 2006) apart from standard deﬁnition of

DIC . We runour model with number of clusters equal to 2 , , i ) and2 , , , ii ). We present a summary of our result (see Ta-ble 1) for DIC and

DIC values, where we have shown that in almost all thecases (94% for original DIC , 95% for original

DIC ) we are able to select thecorrect number of clusters. The computation time for DIC is signiﬁcantly lessthan that of original DIC . Method True number of clusters Total number of datasets Number of datasets with correctnumber of estimated clusters

DIC

DIC DIC Table 1

Number of datasets where correct number of clusters is identiﬁed with

DIC and

DIC . We notice that in simulation, whenever the model fails to identify the truenumber of clusters, it always overestimates the number of clusters. Realizingthis, we appropriately design a penalized version of the standard

DIC criterionwith which we signiﬁcantly improve the estimation of correct model.

Common metrics for evaluating clustering methods

It is important tomeasure the assignment of each data point to the appropriate cluster. Notethat, even if the number of clusters is right, the performance of the clusteringmethod could be low because of incorrect cluster assignments. In order to eval-uate clustering eﬃciency one could calculate several external cluster evaluationmetrics. Here in this study we compute purity, Normalized mutual information(NMI), rand index (RI), adjusted rand index (ARI), Jaccard Index (JI) andF-measure (Rand, 1971; Vinh, Epps and Bailey, 2010).We build up some notations for introducing those metric brieﬂy. Let us as-sume we have N data points denoted { X j } Nj =1 . Set of C true classes is givenby A = { A , A , · · · , A C } where A c = { j : X j belongs to c -th cluster } for c =1 , , · · · , C and clustering method returns K number of clusters and the set ofclusters is given by B = { B , B , · · · , B K } where B k = { j : X j assigned to k -th cluster } for k = 1 , , · · · , K . Note that we use | · | to denote the number of elements ina set. • Purity is deﬁned (see Rand (1971); Vinh, Epps and Bailey (2010)) aspurity( A , B ) = (cid:80) k max c | B k ∩ A c | N .

It is the most simple evaluation measure. To compute purity each clusteris assigned to the class which is most prevalent in the cluster and the ac-curacy of the assignment is measured by counting the number of correctlyassigned data to the cluster and dividing by total number of data in the engupta et al./BMMCSM dataset. Clearly, Purity lies between 0 and 1 where perfect clustering hasa purity of 1. • NMI is an information-theoretic measure which is deﬁned asNMI( A , B ) = I ( A , B )[ H ( A ) + H ( B )] / , where, I ( · , · ) and H ( · ) stand for mutual information and entropy, respec-tively with I ( A , B ) = (cid:88) k (cid:88) c | B k ∩ A c | N log N | B k ∩ A c || B k | | A c | , and H ( A ) = − (cid:88) k | B k | N log | B k | N ; H ( M ) = − (cid:88) c | A c | N log | A c | N .

NMI reaches its maximum value 1 only when the two sets A and B havea perfect one-to-one correspondence. • RI is written as RI = T P + T NT P + F P + T N + F N , where TP is the number of true positives, TN is the number of true neg-atives, FP is the number of false positives, and FN is the number of falsenegatives. This can be viewed as a measure of the percentage of correctdecisions. Note that, here false positives and false negatives are equallyweighted. The Rand index also lies between 0 and 1. When clusteringresults agree with the class perfectly, the Rand index is 1. • ARI is a chance-corrected version of RI. A problem with RI is that theexpected value of the RI between two random clustering methods is not aconstant. This problem is corrected in ARI which assumes the generalizedhyper-geometric distribution as the model of randomness. The ARI hasthe maximum value 1, and its expected value is 0 in the case of randomclusters. A larger ARI means a higher agreement between two clusteringmethods. • JI is deﬁned by the following formula – JI = T PT P + F P + F N .

It is also known as intersection over union used to quantify the similaritybetween two sets. It takes a value between 0 and 1. Index value 1 or 0 meanstwo sets are identical or two sets have no common elements, respectively. • F-measure is deﬁned by F β = ( β + 1) · P recision · Recallβ P recision + Recall , engupta et al./BMMCSM where P recision = T PT P + F P and

Recall = T PT P + F N .

F-measure can be used to penalize false negatives more strongly than falsepositives by selecting β >

1. On the other hand, when β = 0, recall hasno impact on F-measure.We summarize all the evaluation metrics for the two simulation scenarios in thefollowing Table 2 and 3. We observe that most of the metrics are close to themaximum possible value 1, which indicates an overall success of our clusteringmethod. Metrics PUR RI ARI JI NMI F05 F1 F2 F5Mean 0.984 0.979 0.952 0.938 0.923 0.968 0.968 0.968 0.968Std. dev. 0.008 0.010 0.023 0.028 0.031 0.015 0.015 0.015 0.015

Table 2

Clustering evaluation metrics when true number of clusters equal to three

Metrics PUR RI ARI JI NMI F05 F1 F2 F5Mean 0.978 0.978 0.942 0.918 0.921 0.957 0.957 0.957 0.957Std. dev. 0.008 0.008 0.020 0.027 0.024 0.015 0.015 0.015 0.015

Table 3

Clustering evaluation metrics when true number of clusters equal to four

In order to evaluate parameter values for each clusters let us denote the trueparameter set for C classes by { F , F , · · · , F C } where F c = M c D c V Tc . Weﬁnd out d MSE = (cid:80) c (cid:107) ˆ F c − F c (cid:107) F , where ˆ F c is the estimate of the parametermatrix for the c -th cluster, where (cid:107)·(cid:107) F denotes the matrix Frobenious norm.We plot below (in Figure 3) the relative error (in percentage) in estimatingthe true parameter F . From the plot we observe that procedure is eﬃcient inestimating the parameter as the maximum relative error is below 4%. We showthe simulation results for one particular dataset from simulation (i) and (ii) forboth the eigenvectors in Figures 4 and 5, respectively.As our Bayesian inference technique involves MCMC sampling scheme, it iscustomary to check the standard MCMC convergence and eﬃciency diagnos-tics (Cowles and Carlin, 1996). We investigate the convergence by carefullyobserving the MCMC cumulative average plot and auto-correlation function(ACF) plot for one of the elements of parameter F for one of the clusters inthe dataset. Here in Figure 6, we show both the plots for the simulation sce-nario ( i ). By looking at the cumulative average (Figure 6(b)) we set the valuefor number of burn-in iteration to 800. The small values in the ACF plot (Fig-ure 6(a)) indicates high eﬃciency for parameter estimation based on the MCMC engupta et al./BMMCSM llllllllllllllllllllllllllllllllllllllllllllllllll dataset index d M SE a s a pe r c en t age o f ab s o l u t e s u m o f e l e m en t s i n F % % % % % llllllllllllllllllllllllllllllllllllllllllllllllll dataset index % % % % % % % % % % Fig 3: Relative error (in percentage) for parameter F in simulation (i) (leftpanel) and simulation (ii) (right panel).samples. Note that, these plots have very similar characteristics for all the otherscenarios.

6. Application

In this section we show two real data based applications with our model. Theﬁrst one is associated with medical image analysis while the second one is relatedto astronomical data. We present each application in diﬀerent subsections below.

The human brain consists of more than 100 billion neurons, and it is arguablythe most complex structure in our body (Basser and Jones, 2002; Mori andZhang, 2006). Magnetic resonance imaging (MRI) is powerful noninvasive andthree-dimensional imaging technique to characterize the entire brain anatomy.Diﬀusion tensor imaging (DTI) is a relatively new MRI technique which helpsto reconstruct the underlying 3D structures of axonal bundles in the brain.Using a technique called tractography using the data collected by DTI the vox-els that belong to the same white matter tract are grouped together. This is engupta et al./BMMCSM Fig 4: First and second eigenvectors using one of the datasets in simulation ( i ).Fig 5: First and second eigenvectors using one of the datasets in simulation ( ii ). engupta et al./BMMCSM . . . . . . lag A u t o c o rr e l a t i on − . − . − . − . MCMC iteration C u m u l a t i v e a v e r age (a) ACF of F (b) Cumulative average of F Fig 6: Diagnostic plots of MCMC for simulation ( i ).used to investigate brain connectivity, for example, cortex-white matter con-nectivity (Catani et al., 2002; Lazar and Alexander, 2005) or corticothalamicconnectivity (Guy M Mckhann, 2004).DTI technique was introduced in the mid 1990s (Basser, Mattiello and LeBihan,1994). The diﬀusion term represents translational motion of water moleculesand this motions is used as a probe to estimate the axonal organization ofthe brain. The water molecules move relatively easily along the axonal bun-dles compared to the perpendicular to these bundles because there are fewerobstacles to prevent movement along the ﬁbers which carry rich anatomical in-formation about the white matter (Mori and Zhang, 2006). Fiber orientationsare estimated from three independent diﬀusion measurements along the x, y and z axes. However, these three measurements are not enough as the ﬁber orien-tation is not always along one of these three axes. But to accurately constructapparent diﬀusion coeﬃcient where the intensity of each voxel is proportionalto the extent of diﬀusion, we need to measure diﬀusion along many directions,which is diﬃcult. In order to give a practical solution to this, the concept of DTwas introduced (Basser, Mattiello and LeBihan, 1994).In this model, measurements along diﬀerent axes (see Figure 7a) are ﬁtted to a3D ellipsoid shown in Figure 7b, which represents average diﬀusion distance ineach direction (Mori and Zhang, 2006). Note that the properties of a 3D ellipsoidcan be deﬁned by six parameters – three of its eigenvalues and correspondingeigenvectors (mutually perpendicular), which can compactly represented by a3 × λ , λ and λ ) give thediﬀusivity in the direction of each eigenvector, denoted by E , E and E inFigure 7c. engupta et al./BMMCSM X E λ Y (b) XY Z D xx D xz D xy D yx D yz D yy D zx D zz D zy E T E λ λ λ Matrix of EigenvectorsEigenvaluesDiffusion Tensor ×× D E E E E (a) (c) E E λ λ Z Fig 7: A schematic representation of (a) ﬁbers and (b) estimated ellipsoids withthe corresponding (c) eigenvectors and eigenvalues.According to our knowledge, this is the very ﬁrst work with DTI data which ismodeled with a mixture of ML distributions. Also, we consider a ﬁnal datasetafter selecting the voxels in the white matter region of the brain containinginformation from almost 63 ,

000 voxels. Our implementation is very eﬃcient inhandling this large amount of data. We model diﬀusion tensors by elementsin V n,p . Note that, for the scope of this project we are only interested in thedirection of the eigenvectors of DT. Also, we only need to model E and E as direction of E will be totally governed by the rest of the two eigenvectors.Therefore, we have two orthonormal eigenvectors in three dimensions i.e. a 3 × × V n,p .In practice, Wishart distribution is commonly used to analyze DT, a 3 × ML distributions. It is easier to ﬁndinterpretations of the parameters for ML distribution in terms of directionof the data. Therefore our Bayesian mixture model is relatively more ﬂexiblein terms of handling DTI data which have directional components. Also ourinference mechanism can handle a very large number of DTI data from eachvoxels. To the extent of our knowledge, this is the ﬁrst paper that develops the engupta et al./BMMCSM framework to analyze DTI data when they are modeled as objects on V n,p .Before presenting the results, we would like to point out that our results couldbe improved by incorporating eigenvalues along with the eigenvectors. However,that requires more complicated statistical model which we currently reserve forour future work and it is outside of the scope of current paper as we mainlyfocusing on building the appropriate framework for analyzing DTI data. Never-theless, we show in in Section 6.1.2 that we have found evidences of meaningfulclusters by only investigating the directional part of the data. The Philadelphia Neurodevelopmental Cohort (PNC) is a large-scale initiativeto understand how genetics impact trajectories of brain development and cog-nitive functioning in adolescence, and understand how abnormal trajectoriesof development are associated with psychiatric symptomatology (Satterthwaiteet al., 2014). As part of the PNC, 1,445 children ages 8-21 received multi-modalneuroimaging in order to evaluate with a detailed cognitive and psychiatric as-sessment. Data is pre-processed with the comprehensive DTI data processingsoftware library FSL (Woolrich et al., 2009).Some of the important features of this dataset is that all imaging data wasacquired at a single site, on a single scanner, in a short period of time that didnot span any software or hardware upgrades. Quality of the images of the DTIdata was primarily assessed by visual inspection and rarely, two artifacts werenoted in the DTI data (Satterthwaite et al., 2014).

We take one anonymous subject from this dataset consisting of 62 ,

667 mea-surements. We use a ﬁnite mixture of ML distributions to cluster this largedataset.We use conditional conjugate prior distributions deﬁned in Equation 50. Weselect appropriate values of hyperparameters using the procedure developed forempirical prior in Section 4.2. Note that, the value of K † is set to 100 as weexpect relatively large number of data points in each cluster. Here we use theprocedure described in Section 4.6 to set the initial value of the parameters M , d and V for the MCMC algorithm. First 1000 MCMC samples are discarded asburn-in samples.We use diﬀerent number of clusters to ﬁt the dataset with our Bayesian modeland choose 12 as the estimated number of clusters by DIC criterion describedin Section 4.5.In Figure 8 we present the top three clusters with their voxel locations mappedinside the anatomical structure of brain (see for 3D version of these ﬁgures). Note that in this ﬁgure, panel ( a ),( b ) and ( c ) represent top, side and front view, respectively. It is important tonotice that we are successfully able to locate few important ﬁber structures inthe dataset from the sample. engupta et al./BMMCSM Fig 8: DTI clustering results for three major clusters engupta et al./BMMCSM The Near Earth Object (NEO) population is deﬁned as a group of small bod-ies with perihelion distance less than 1 . .

983 AU (Donnison, 2006). NEOs are NEAs (near-Earthasteroids) and NECs (near-Earth comets). NEAs are asteroids whose perihe-lion distance is less than 1 . . https://cneos.jpl.nasa.gov/faq/ ). A detailed categorization of NEO can also befound in https://cneos.jpl.nasa.gov/about/neo_groups.html . They arealso called short-period (SP) comets, which are generally conﬁned to directorbits with angle of inclination with respect to a reference plane, less than ap-proximately 35 ◦ . The SP comets are in well determined orbits with modesteccentricities and inclinations. This make them a possible resource for spacedevelopments (Lewis, Matthews and Guerrieri, 1993).The NEC dataset was built by the Near Earth Object Program of the NationalAeronautics and Space Administration(NASA). Each data point characterizesthe orientation of a two-dimensional elliptical orbit in three-dimensional space,and thus lies on the Stiefel manifold V , . For our experiment we have down-loaded NEC dataset containing 175 entries. Orientation of SP comet’s orbit canbe speciﬁed by the following quantities. We could ﬁnd the deﬁnition of thesethree important quantities in https://ssd.jpl.nasa.gov/?glossary . • Celestial longitude ( L ) • Latitude of the perihelion ( θ ) • Longitude of the ascending node (Ω)Celestial longitude of the comet ( L ) (Hughes, 1985) and latitude of the perihelion( θ ) (Yabushita, Hasegawa and Kobayashi, 1979) are computed by the followingformula, respectively. L = Ω + tan − (cid:18) sin ω cos i cos ω (cid:19) sin θ = sin i sin ω From the dataset we could ﬁnd the values of orbital inclination ( i ), longitude ofthe ascending node (Ω), argument of periapsis (perihelion)( ω ) as shown in Fig-ure 9 Using the appropriate transformations given in Jupp and Mardia (1979);Yabushita, Hasegawa and Kobayashi (1979) we ﬁnd L , θ and Ω for each comet.The direction of the perihelion is x = ( cos θ cos L, cos θ sin L, sin θ ) and thedirected unit normal to the orbit given by the right hand rule is x = ( sin θ sin Ω − sin θ cos Ω − cos θ sin (Ω − L )) /r where r = sin θ + cos θ sin (Ω − L ). The orientation of the orbit therefore canbe represented by the matrix X ∈ V , given by X = [ x T x T ]. An appropriatemodel for the distribution of these matrices is the ML family (Jupp and Mardia,1979). engupta et al./BMMCSM NEC’s orbitReference directionReference plane Inclination (i) Argument of periapsis ( ω )Longitude of ascending node ( Ω )NEC Ω i ω Fig 9: Near Earth comet’s orbit and orbital elementsHere we model NEC dataset as a ﬁnite mixture of ML distributions. We ranour model for number of clusters equals to 3 , , ,

6. In each situation we use2000 MCMC samples out of which we set initial 1000 iterations as burn-ins.We select appropriate values of hyperparameters for prior distributions in Equa-tion 50 empirically using the procedure developed in Section 4.2.We choose number of burn-in iterations (1000 in this case) by observing theMCMC convergence diagnostic plot. Below we report the DIC for selecting themodel. Our DIC (shown in Table 4) is minimized at number of clusters equals tofour. Note that, also from the reported results in Lin, Rao and Dunson (2017),four seems to be the most likely number of clusters.

Number of Clusters DIC Value3 3074.914 2607.045 2712.966 2685.94

Table 4

DIC table for NEC dataset

We compute the probabilities for any two NEC data to belong to the same clus-ter for all the NEC data. This is also called as cluster co-occurrence probabilitymatrix (Hofmann and Puzicha, 1998). We draw the corresponding heatmap inFigure 10 to show this.Finally, we plot each eigenvector from a data point of V , in a sphere (Figure 11and 12). We use diﬀerent color (red, blue, green, black) to represent four diﬀerentclusters. The NECs denoted by the points with same color indicates the groupof comets with similar orbital characteristics. engupta et al./BMMCSM Fig 10: Cluster co-occurrence probability matrix for NEC dataset.

7. Discussions and Future directions

In this paper, we build a Bayesian framework for a mixture of ML distributionswhich could be applied to real world directional data. We construct two specialfamilies of distributions to be used as prior distributions following the orginalconjugate prior construction in Diaconis and Ylvisaker (1979). We discuss fewimportant properties for our prior class of distributions. For the mixture modelwe computed the posterior and also give insights on selection of hyperparam-eters, which should be helpful for practitioners. Finally, we are able to handlea large amount of DTI data in the real data application and results look quitepromising.For our future extension, instead of selecting the number of clusters by DICcriterion, we would like the number of clusters to be a random variable. A fullyBayesian model-based approach which assumes a parametric prior (e.g. Poisson)on the number of clusters, could be employed. The next natural step in thisdirection is to extend the existing model to a non-parametric framework. In fact,non-parametric version is more ﬂexible in terms of modeling and experimentingwith diﬀerent types of underlying clustering structure. Note that, though Lin,Rao and Dunson (2017) opened the doors to such modeling, their model space engupta et al./BMMCSM Fig 11: First eigenvector of a data point is embedded in a sphere.diﬀers from ours in various respects.On a separate direction, we also plan to explore in depth the analytical prop-erties of the hypergeometric function of matrix argument function ( F ( · )) for p ≥

2. Direct computation, as is done in our case studies, could create bot-tlenecks for data coming from higher dimension. Analytical bounds could helpeither in approximation or designing a good MCMC sampler. For example, onecould borrow the importance sampling approach used for evaluating the normal-izing constants in Mitra et al. (2013). This would primarily rely on the abilityto simulate eﬃciently from ML distributions, which is already ensured by Hoﬀ(2009). Along this line, it would be nice to study the theoretical properties,particularly ergodicity of the MCMC schemes rigorously.The coming together of state-of-the-art Bayesian methods incorporating topo-logical properties of the space is a rich area that has been initiated only recentlyby Bhattacharya and Dunson (2012) and Lin, Rao and Dunson (2017). We planto continue along this direction and contribute to the Bayesian methodologi- engupta et al./BMMCSM Fig 12: Second eigenvector of a data point is embedded in a sphere.cal development on general analytic manifolds, which would be appropriate toanalyze large-scale data with complex structure.

References

Absil, P.-A. , Mahony, R. and

Sepulchre, R. (2009).

Optimization algo-rithms on matrix manifolds . Princeton University Press.

Anand, S. , Mittal, S. and

Meer, P. (2016). Robust Estimation for ComputerVision Using Grassmann Manifolds. In

Riemannian Computing in ComputerVision

Bangert, M. , Hennig, P. and

Oelfke, U. (2010). Using an inﬁnite vonMises-Fisher mixture model to cluster treatment beam directions in externalradiation therapy. In

Machine Learning and Applications (ICMLA), 2010Ninth International Conference on

Basser, P. J. and

Jones, D. K. (2002). Diﬀusion-tensor MRI: theory, experi-mental design and data analysis–a technical review.

NMR in Biomedicine engupta et al./BMMCSM Basser, P. J. , Mattiello, J. and

LeBihan, D. (1994). MR diﬀusion tensorspectroscopy and imaging.

Biophysical journal Berg, A. , Meyer, R. and

Yu, J. (2004). Deviance information criterionfor comparing stochastic volatility models.

Journal of Business & EconomicStatistics Bhatia, R. (2007). Positive Deﬁnite Matrices Princeton University Press.

Princeton and Oxford . Bhattacharya, A. and

Dunson, D. B. (2012). Strong consistency of nonpara-metric Bayes density estimation on compact metric spaces with applicationsto speciﬁc manifolds.

Annals of the Institute of Statistical Mathematics Bishop, C. M. (2006).

Pattern recognition and machine learning . springer.

Butler, R. W. and

Wood, A. T. (2003). Laplace approximation for Besselfunctions of matrix argument.

Journal of Computational and Applied Mathe-matics

Casella, G. (1985). An introduction to empirical Bayes data analysis.

TheAmerican Statistician Casella, G. and

Berger, R. L. (2002).

Statistical inference . Duxbury Pa-ciﬁc Grove, CA. Catani, M. , Howard, R. J. , Pajevic, S. and

Jones, D. K. (2002). Virtualin vivo interactive dissection of white matter fasciculi in the human brain.

Neuroimage Chikuse, Y. (1991a). High dimensional limit theorems and matrix decomposi-tions on the Stiefel manifold.

Journal of multivariate analysis Chikuse, Y. (1991b). Asymptotic expansions for distributions of the large sam-ple matrix resultant and related statistics on the Stiefel manifold.

Journal ofmultivariate analysis Chikuse, Y. (1998). Density estimation on the Stiefel manifold.

Journal ofmultivariate analysis Chikuse, Y. (2012).

Statistics on special manifolds . Springer Science &Business Media.

Conway, J. (1990). A { Course } in { Functional }{ Analysis } . Cowles, M. K. and

Carlin, B. P. (1996). Markov chain Monte Carlo conver-gence diagnostics: a comparative review.

Journal of the American StatisticalAssociation DeIorio, M. and

Robert, C. P. (2002). Discussion of Bayesian measures ofmodel complexity and ﬁt.

Journal of the Royal Statistical Society: Series B(Statistical Methodology) Dempster, A. P. , Laird, N. M. and

Rubin, D. B. (1977). Maximum likeli-hood from incomplete data via the EM algorithm.

Journal of the royal sta-tistical society. Series B (methodological)

Diaconis, P. and

Ylvisaker, D. (1979). Conjugate priors for exponentialfamilies.

The Annals of statistics Diebolt, J. and

Robert, C. P. (1994). Estimation of Finite Mixture Distri-butions through Bayesian Sampling.

Journal of the Royal Statistical Society.Series B (Methodological) engupta et al./BMMCSM Donnison, J. (2006). Some aspects of the statistics of Near-Earth Objects.

Proceedings of the International Astronomical Union Downs, T. D. (1972). Orientation statistics.

Biometrika

Edelman, A. , Arias, T. A. and

Smith, S. T. (1998). The geometry of algo-rithms with orthogonality constraints.

SIAM journal on Matrix Analysis andApplications Franc¸ois, O. and

Laval, G. (2011). Deviance information criteria formodel selection in approximate Bayesian computation. arXiv preprintarXiv:1105.0269 . Gelman, A. , Carlin, J. B. , Stern, H. S. and

Rubin, D. B. (2003).

BayesianData Analysis, Second Edition (Chapman & Hall/CRC Texts in StatisticalScience) , 2 ed. Chapman and Hall/CRC.

Gelman, A. , Carlin, J. B. , Stern, H. S. , Dunson, D. B. , Vehtari, A. and

Rubin, D. B. (2014).

Bayesian data analysis . CRC press Boca Raton,FL. Gross, K. I. and

Richards, D. S. P. (1987). Special functions of matrix ar-gument. I. Algebraic induction, zonal polynomials, and hypergeometric func-tions.

Transactions of the American Mathematical Society

Gross, K. I. and

Richards, D. S. P. (1989). Total positivity, spherical series,and hypergeometric functions of matrix argument.

Journal of Approximationtheory Gupta, R. D. and

Richards, D. S. P. (1985). Hypergeometric functions ofscalar matrix argument are expressible in terms of classical hypergeometricfunctions.

SIAM journal on mathematical analysis Guy M Mckhann, I. (2004). Non-invasive mapping of connections betweenhuman thalamus and cortex using diﬀusion imaging.

Neurosurgery . Hardy, G. H. , Littlewood, J. E. and

P´olya, G. (1952).

Inequalities . Cam-bridge university press.

Herz, C. S. (1955). Bessel functions of matrix argument.

Annals of Mathemat-ics

Hill, R. D. and

Waters, S. R. (1987). On the cone of positive semideﬁnitematrices.

Linear Algebra and its Applications Hoff, P. D. (2009). Simulation of the matrix Bingham–von Mises–Fisher dis-tribution, with applications to multivariate and relational data.

Journal ofComputational and Graphical Statistics Hofmann, T. and

Puzicha, J. (1998). Statistical models for co-occurrencedata.

Hornik, K. and

Gr¨un, B. (2013). On conjugate families and Jeﬀreys priorsfor von Mises-Fisher distributions.

J Stat Plan Inference

Hornik, K. and

Gr¨un, B. (2014). movMF: An R package for ﬁtting mixturesof von Mises-Fisher distributions.

Journal of Statistical Software Hughes, D. W. (1985). The position of earth at previous apparitions of Halley’scomet.

Quarterly Journal of the Royal Astronomical Society James, A. T. (1964). Distributions of matrix variates and latent roots derivedfrom normal samples.

The Annals of Mathematical Statistics

James, I. M. (1976).

The topology of Stiefel manifolds . Cambridge Univer- engupta et al./BMMCSM sity Press. Jupp, P. E. and

Mardia, K. V. (1979). Maximum likelihood estimators for thematrix von Mises-Fisher and Bingham distributions.

The Annals of Statistics

Jupp, P. and

Mardia, K. (1980). A general correlation coeﬃcient for direc-tional data and related regression problems.

Biometrika

KaewTraKulPong, P. and

Bowden, R. (2002). An improved adaptive back-ground mixture model for real-time tracking with shadow detection.

Video-based surveillance systems Khare, K. , Pal, S. and

Su, Z. (2017). A bayesian approach for envelopemodels.

The Annals of Statistics Khatri, C. and

Mardia, K. (1977). The von Mises-Fisher matrix distributionin orientation statistics.

Journal of the Royal Statistical Society. Series B(Methodological)

Koev, P. and

Edelman, A. (2006). The eﬃcient evaluation of the hyper-geometric function of a matrix argument.

Mathematics of Computation Kristof, W. (1969). A theorem on the trace of certain matrix products andsome applications.

ETS Research Report Series . Lattin, J. M. , Carroll, J. D. and

Green, P. E. (2003).

Analyzing multi-variate data . Thomson Brooks/Cole Paciﬁc Grove, CA.

Lazar, M. and

Alexander, A. L. (2005). Bootstrap white matter tractogra-phy (BOOT-TRAC).

NeuroImage Lewicki, M. S. (1998). A review of methods for spike sorting: the detectionand classiﬁcation of neural action potentials.

Network: Computation in NeuralSystems R53–R78.

Lewis, J. S. , Matthews, M. S. and

Guerrieri, M. L. (1993). Resources ofnear-Earth space.

Resources of near-earth space . Lin, L. , Rao, V. and

Dunson, D. (2017). BAYESIAN NONPARAMETRICINFERENCE ON THE STIEFEL MANIFOLD.

Statistica Sinica Lui, Y. M. (2012). Advances in matrix manifolds for computer vision.

Imageand Vision Computing Lui, Y. and

Beveridge, J. (2008). Grassmann registration manifolds for facerecognition.

Computer Vision–ECCV 2008

Mardia, K. V. and

Jupp, P. E. (2009).

Directional statistics . John Wiley& Sons.

Mardia, K. and

Khatri, C. (1977). Uniform distribution on a Stiefel manifold.

Journal of Multivariate Analysis Mardia, K. V. , Taylor, C. C. and

Subramaniam, G. K. (2007). Proteinbioinformatics and mixtures of bivariate von Mises distributions for angulardata.

Biometrics McGraw, T. , Vemuri, B. , Yezierski, R. and

Mareci, T. (2006). Segmen-tation of high angular resolution diﬀusion MRI modeled as a ﬁeld of vonMises-Fisher mixtures. In

European Conference on Computer Vision

McKenna, S. J. , Raja, Y. and

Gong, S. (1999). Tracking colour objects engupta et al./BMMCSM using adaptive mixture models. Image and vision computing McLachlan, G. and

Peel, D. (2004).

Finite mixture models . John Wiley &Sons.

Mitra, R. , M¨uller, P. , Liang, S. , Yue, L. and

Ji, Y. (2013). A bayesiangraphical model for chip-seq data on histone modiﬁcations.

Journal of theAmerican Statistical Association

Mori, S. and

Zhang, J. (2006). Principles of diﬀusion tensor imaging and itsapplications to basic neuroscience research.

Neuron Muirhead, R. J. (1975). Expressions for some hypergeometric functions ofmatrix argument with applications.

Journal of multivariate analysis Muirhead, R. J. (2009).

Aspects of multivariate statistical theory . JohnWiley & Sons.

Picard, F. (2007). An introduction to mixture models.

Statistics for SystemsBiology, Research Report . Rand, W. M. (1971). Objective criteria for the evaluation of clustering meth-ods.

Journal of the American Statistical association Reisinger, J. , Waters, A. , Silverthorn, B. and

Mooney, R. J. (2010).Spherical topic models. In

Proceedings of the 27th international conference onmachine learning (ICML-10)

Robbins, H. (1985). An empirical Bayes approach to statistics. In

Herbert Rob-bins Selected Papers

Rokach, L. and

Maimon, O. (2005). The Data Mining and Knowledge Dis-covery Handbook: A Complete Guide for Researchers and Practitioners.

Rudin, W. et al. (1964).

Principles of mathematical analysis . McGraw-HillNew York. Satterthwaite, T. D. , Elliott, M. A. , Ruparel, K. , Loughead, J. , Prabhakaran, K. , Calkins, M. E. , Hopson, R. , Jackson, C. , Keefe, J. , Riley, M. et al. (2014). Neuroimaging of the Philadelphia neu-rodevelopmental cohort.

Neuroimage Schwartz, L. (1965). On bayes procedures.

Probability Theory and RelatedFields Schwartzman, A. (2006). Random ellipsoids and false discovery rates: Statis-tics for diﬀusion tensor imaging data PhD thesis, Stanford University.

Spiegelhalter, D. J. , Best, N. G. , Carlin, B. P. and

Van Der Linde, A. (2002). Bayesian measures of model complexity and ﬁt.

Journal of the RoyalStatistical Society: Series B (Statistical Methodology) Stauffer, C. and

Grimson, W. E. L. (1999). Adaptive background mixturemodels for real-time tracking. In

Computer Vision and Pattern Recognition,1999. IEEE Computer Society Conference on. Tang, H. , Chu, S. M. and

Huang, T. S. (2009). Generative model-basedspeaker clustering via mixture of von mises-ﬁsher distributions. In

Acoustics,Speech and Signal Processing, 2009. ICASSP 2009. IEEE International Con-ference on

Titterington, D. M. , Robert, C. P. , Forbes, F. and

Celeux, G. (2006).Deviance information criteria for missing data models.

Bayesian Analysis engupta et al./BMMCSM Turaga, P. , Veeraraghavan, A. and

Chellappa, R. (2008). Statisticalanalysis on Stiefel and Grassmann manifolds with applications in computer vi-sion. In

Computer Vision and Pattern Recognition, 2008. CVPR 2008. IEEEConference on

Turaga, P. , Veeraraghavan, A. , Srivastava, A. and

Chellappa, R. (2011). Statistical computations on Grassmann and Stiefel manifolds for im-age and video-based recognition.

IEEE Transactions on Pattern Analysis andMachine Intelligence Vinh, N. X. , Epps, J. and

Bailey, J. (2010). Information theoretic measuresfor clusterings comparison: Variants, properties, normalization and correctionfor chance.

Journal of Machine Learning Research Wood, E. , Fellows, M. , Donoghue, J. and

Black, M. (2004). Automaticspike sorting for neural decoding. In

Engineering in Medicine and BiologySociety, 2004. IEMBS’04. 26th Annual International Conference of the IEEE Woolrich, M. W. , Jbabdi, S. , Patenaude, B. , Chappell, M. , Makni, S. , Behrens, T. , Beckmann, C. , Jenkinson, M. and

Smith, S. M. (2009).Bayesian analysis of neuroimaging data in FSL.

Neuroimage S173–S186.

Wright, S. J. and

Nocedal, J. (1999). Numerical optimization.

SpringerScience Yabushita, S. , Hasegawa, I. and

Kobayashi, K. (1979). The Distributionsof Inclination and Perihelion Latitude of Long-Period Comets and Their Dy-namical Implications.

Publications of the Astronomical Society of Japan Zeng, X. , Bian, W. , Liu, W. , Shen, J. and

Tao, D. (2015). Dictionary pairlearning on Grassmann manifolds for image denoising.

IEEE Transactions onImage Processing24