[PDF] Mixture Probabilistic Principal Geodesic Analysis

Abstract

Dimensionality reduction on Riemannian manifolds is challenging due to the complex nonlinear data structures. While probabilistic principal geodesic analysis~(PPGA) has been proposed to generalize conventional principal component analysis (PCA) onto manifolds, its effectiveness is limited to data with a single modality. In this paper, we present a novel Gaussian latent variable model that provides a unique way to integrate multiple PGA models into a maximum-likelihood framework. This leads to a well-defined mixture model of probabilistic principal geodesic analysis (MPPGA) on sub-populations, where parameters of the principal subspaces are automatically estimated by employing an Expectation Maximization algorithm. We further develop a mixture Bayesian PGA (MBPGA) model that automatically reduces data dimensionality by suppressing irrelevant principal geodesics. We demonstrate the advantages of our model in the contexts of clustering and statistical shape analysis, using synthetic sphere data, real corpus callosum, and mandible data from human brain magnetic resonance~(MR) and CT images.

Full PDF

MMixture Probabilistic Principal GeodesicAnalysis

Youshan Zhang , Jiarui Xing , and Miaomiao Zhang , Computer Science & Engineering, Lehigh University, Bethlehem, USA Electrical & Computer Engineering, University of Virginia, Charlottesville, USA Computer Science, University of Virginia, Charlottesville, USA

Abstract.

Dimensionality reduction on Riemannian manifolds is chal-lenging due to the complex nonlinear data structures. While probabilisticprincipal geodesic analysis (PPGA) has been proposed to generalizeconventional principal component analysis (PCA) onto manifolds, itseﬀectiveness is limited to data with a single modality. In this paper, wepresent a novel Gaussian latent variable model that provides a uniqueway to integrate multiple PGA models into a maximum-likelihood frame-work. This leads to a well-deﬁned mixture model of probabilistic principalgeodesic analysis (MPPGA) on sub-populations, where parameters of theprincipal subspaces are automatically estimated by employing an Expec-tation Maximization algorithm. We further develop a mixture BayesianPGA (MBPGA) model that automatically reduces data dimensionalityby suppressing irrelevant principal geodesics. We demonstrate the ad-vantages of our model in the contexts of clustering and statistical shapeanalysis, using synthetic sphere data, real corpus callosum, and mandibledata from human brain magnetic resonance (MR) and CT images.

PCA has been widely used to analyze high-dimensional data due to its eﬀec-tiveness in ﬁnding the most important principal modes for data representation[12]. Motivated by the nice properties of probabilistic modeling, a latent variablemodel of PCA for factor analysis was presented [23,18]. Later, diﬀerent variantsof probabilistic PCA including Bayesian PCA [2] and mixture models of PCA [4]were developed for automatic data dimensionality reduction and clustering, re-spectively. It is important to extend all these models from ﬂat Euclidean spaces togeneral Riemannian manifolds, where the data is typically equipped with smoothconstraints. For instance, an appropriate representation of directional data, i.e.,vectors of unit length in R n , is the sphere S n − [16]. Another important exampleof manifold data is in shape analysis, where the deﬁnition of the shape of anobject should not depend on its position, orientation, or scale, i.e., Kendall shapespace [14]. Other examples of manifold data include geometric transformationssuch as rotations and translations, symmetric positive-deﬁnite tensors [10,25],Grassmannian manifolds (a set of m -dimensional linear subspaces of R n ), andStiefel manifolds (the set of orthonormal m -frames in R n ) [24]. a r X i v : . [ c s . L G ] S e p Y. Zhang et al.

Data dimensionality reduction on manifolds is challenging due to the com-monly used linear operations violate the natural constraints of manifold-valueddata. In addition, basic statistical terms such as distance metrics, or data distri-butions vary on diﬀerent types of manifolds [14,24,17]. A groundbreaking work,known as principal geodesic analysis (PGA), was the ﬁrst to generalize PCAto nonlinear manifolds [10]. This method describes the geometric variability ofmanifold data by ﬁnding lower-dimensional geodesic subspaces that minimizethe residual sum-of-squared geodesic distances to the data. Later on, an exactsolution to PGA [19,20] and a robust formulation for estimating the outputresults [1] were developed. The probabilistic interpretation of PGA was ﬁrstlyintroduced in [26], which paved a way for factor analysis on manifolds. SincePPGA only deﬁnes a single projection of the data, the scope of its application islimited to uni-modal distributions. A more natural and motivating solution isto model the multi-modal data structure with a collection or mixture of localsub-models. Current mixture models on a speciﬁc manifold generally employa two-stage procedure: a clustering of the data projected in Euclidean spacefollowed by performing PCA within each cluster [6]. None of these algorithmsdeﬁne a probability density.In this paper, we derive a mixture of PGA models as a natural extension ofPPGA [26], where all model parameters including the low-dimensional factorsfor each data cluster is estimated through the maximization of a single likelihoodfunction. The theoretical foundation of developing generative models of principalgeodesic analysis for multi-population studies on general manifolds is brandnew. In addition, the algorithmic inference of our proposed method is nontrivialdue to the complicated geometry of manifold-valued data and numerical issues.Compared to previous methods, the major advantages of our model are: (i)it leads to a uniﬁed algorithm that well integrates soft data clustering andprincipal subspaces estimation on general Riemannian manifolds; (ii) in contrastto the two-stage approach mentioned above, our model explicitly considers thereconstruction error of principal modes as a criterion for clustering tasks; and(iii) it provides a more powerful way to learn features from data in non-Euclideanspaces with multiple subpopulations. We showcase our model advantages from twodistinct perspectives: automatic data clustering and dimensionality reduction foranalyzing shape variability. In order to validate the eﬀectiveness of the proposedalgorithm, we compare its performance with the state-of-the-art methods onboth synthetic and real datasets. We also brieﬂy discuss a Bayesian versionof our mixture PPGA model that equips with the functionality of automaticdimensionality selection on general manifold data.

In this section, we brieﬂy review PPGA [26] deﬁned on a smooth Riemannianmanifold M , which is a generalization of PPCA [23] in Euclidean space. Beforeintroducing the model, we ﬁrst recap a few basic concepts of Riemannian geometry(more details are provided in [7]). PPGA 3

Covariant Derivative.

The covariant derivative is a generalization of theEuclidean directional derivative to the manifold setting. Consider a curve c ( t ) :[0 , → M and let ˙ c = dc/dt be its velocity. Given a vector ﬁeld V ( t ) deﬁnedalong c , we can deﬁne the covariant derivative of V to be DVdt = ∇ ˙ c V that reﬂectsthe change of the vector ﬁeld ˙ c in the V direction. A vector ﬁeld is called parallelif the covariant derivative along the curve c is zero. A curve c is geodesic if itsatisﬁes the equation ∇ ˙ c ˙ c = 0. Exponential Map.

For any point p ∈ M and tangent vector v ∈ T p M (alsoknown as the tangent space of M at p ), there exists a unique geodesic curve c with initial conditions c (0) = p and ˙ c (0) = v . This geodesic is only guaranteed toexist locally. The Riemannian exponential map at p is deﬁned as Exp p ( v ) = c (1).In other words, the exponential map takes a position and velocity as input andreturns the point at time t = 1 along the geodesic with certain initial conditions.Notice that the exponential map is simply an addition in Euclidean space, i.e.,Exp p ( v ) = p + v . Logarithmic Map.

The exponential map is locally diﬀeomorphic onto aneighborhood of p . Let V ( p ) be the largest such neighborhood, the Riemannianlog map, Log p : V ( p ) → T p M , is an inverse of the exponential map within V ( p ). For any point q ∈ V ( p ), the Riemannian distance function is given byDist( p, q ) = (cid:107) Log p ( q ) (cid:107) . Similar to the exponential map, this logarithmic map isa subtraction in Euclidean space, i.e., Log p ( q ) = q − p . Given an d -dimensional random variable y ∈ M , the main idea of PPGA [26] isto model y as y = Exp( Exp( µ, Bx ) , (cid:15) ) , B = W Λ, (1)where µ is a base point on M , x ∈ R q is a q -dimensional latent variable, with x ∼ N (0 , I ), B is an d × q factor matrix that relates x and y , and (cid:15) representserror. We will ﬁnd it is convenient to model the factors as B = W Λ , where W is a matrix with q columns of mutually orthogonal tangent vectors in T µ M , Λ is a q × q diagonal matrix of scale factors for the columns of W . This removesthe rotation ambiguity of the latent factors and makes them analagous to theeigenvectors and eigenvalues of standard PCA (there is still of course an ambiguityof the ordering of the factors).The likelihood of PPGA is deﬁned by a generalization of the normal dis-tribution N ( µ, τ − ), called Riemannian normal distribution, with its precisionparameter τ . Therefore, we have p ( y | µ, τ ) = 1 C ( µ, τ ) exp (cid:16) − τ y, µ ) (cid:17) , with C ( µ, τ ) = (cid:90) M exp (cid:16) − τ y, µ ) (cid:17) dy. (2)This distribution is applicable to any Riemannian manifold, and the value of C in Eq. 2 does not depend on µ . It reduces to a multivariate normal distribution Y. Zhang et al. with isotropic covariance when M = R n (see [9] for details). Note that this noisemodel could be replaced with other diﬀerent distributions according to diﬀerenttypes of applications.Now, the PPGA model for a random variable y in Eq. (1) can be deﬁned as y ∼ N (cid:0) Exp( µ, s ) , τ − (cid:1) , s = W Λx. (3)

We now introduce a mixture model of PPGA (MPPGA) that provides a temptingprospect of being able to model complex multi-modal data structures. This for-mulation allows all model parameters to be estimated from maximum-likelihood,where both an appropriate data clustering and the associated principal modesare jointly optimized.Consider observed data y n ∈ { y , · · · , y N } generated from K clusters on M (as shown in Fig. 1). We ﬁrst introduce a K -dimensional binary random variable z n with its k -th element z nk ∈ { , } as an indicator for n -th data point thatbelongs to cluster k , where k ∈ { , · · · , K } . This indicates that z nk = 1 withother value being zero if the data y n is in cluster k . The probability of eachrandom variable z n is p ( z n ) = K (cid:89) k =1 π z nk k , (4)where π k ∈ [0 ,

1] is the model mixing coeﬃcient that satisﬁes K (cid:80) k =1 π k = 1.Analogous to PPGA in Eq. (1), the likelihood of each observed data y n is p ( y n | z n ) = K (cid:89) k =1 N ( y n | Exp( µ k , s nk ) , τ − k ) z nk , with s nk = W k Λ k x nk , (5)where x nk ∼ N (0 , I ) is a latent random variable in R q , µ k is a base point foreach cluster k , W k is a matrix with each columns representing the mutuallyorthogonal tangent vectors in T µ k M , and Λ k is a diagonal matrix of scale factorsfor the columns of W k .Combining Eq. (4) with Eq. (5), we obtain the complete data likelihood p ( y, z ) = N (cid:89) n =1 p ( y n | z n ) p ( z n ) p ( x n )= N,K (cid:89) n,k =1 [ π k p ( y n | Exp( µ k , s nk ) , τ − k ) p ( x nk )] z nk . (6) PPGA 5Fig. 1: Example MPPGA model with four clusters.

The log of the data likelihood in Eq. (6) can be computed as L (cid:44) ln p ( y, z ) = − N,K (cid:88) n,k =1 z nk ln { π k p ( y n | Exp( µ k , s nk ) , τ − k ) p ( x nk ) } . (7) We employ a maximum likelihood expectation maximization (EM) method toestimate model parameters θ = ( π k , µ k , W k , Λ k , τ k , x nk ) and latent variables z nk .This scheme includes two main steps: E-step.

To treat the binary indicator z nk fully as latent random variables, weintegrate them out from the distribution deﬁned in Eq. (6). Similar to typi-cal Gaussian mixture models, the expectation value of the complete-data loglikelihood function is E [ L ] = − N,K (cid:88) n,k =1 E [ z nk ] { ln p ( y n | Exp( µ k , s nk ) , τ − k ) + ln p ( x nk ) + ln π k } . (8)The expected value of the latent variable z nk , also known as the responsibilityof component k for data point y n [3], is then computed by its posterior distributionas E [ z nk ] = p ( z nk | y n ) = p ( y n | z nk ) p ( z nk ) (cid:80) Kk =1 p ( y n | z nk ) p ( z nk )= π k p ( y n | Exp( µ k , z nk ) , τ − k ) (cid:80) Kk =1 π k p ( y n | Exp( µ k , z nk ) , τ − k ) . (9)Recall that the Rimannian distance function Dist( p, q ) = (cid:107) Log p ( q ) (cid:107) . We let γ nk (cid:44) E [ z nk ] and rewrite Eq. (8) as E [ L ] = − N,K (cid:88) n,k =1 γ nk { τ k µ k , s nk ) , y n ) + ln C + ln π k + || x nk || } , (10)where C is a normalizing constant. Y. Zhang et al.

M-step.

We use gradient ascent to maximize the expectation function E [ L ] andupdate parameters θ . Since the maximization of the mixing coeﬃcient π k is thesame as Gaussian mixture model [3], we only give its ﬁnal close-form update hereas ˜ π k = (cid:80) Nn =1 γ nk /N .The computation of the gradient term requires we compute the derivativeoperator (Jacobian matrix) of the exponential map, i.e., d µ k Exp( µ k , s nk ), or d s nk Exp( µ k , s nk ). Next, we brieﬂy review the computations of derivatives w.r.t.the mean point µ and the tangent vector s separately. Closed-form formulationsof these derivatives in the space of sphere, or 2D Kendall shape space are providedin [26,11]. For derivative w.r.t. µ . Consider a variation of geodesics, e.g., c ( h, t ) =Exp(Exp( µ, hu ) , ts ( h )), where u ∈ T µ M and s ( h ) comes from parallel trans-lating s along the geodesic Exp( µ, hu ). The derivative of this variation results ina Jacobi ﬁeld: J µ ( t ) = dc/dh (0 , t ). This gives an expression for the exponentialmap derivative as d µ Exp( µ, s ) = J µ (1) (as shown on the left panel of Fig. 2). For derivative w.r.t. s . Consider a variation of geodesics, e.g., c ( h, t ) =Exp( µ, hu + ts ). Again, the derivative of the exponential map is given by aJacobi ﬁeld satisfying J s ( t ) = dc/dh (0 , t ), and we have d s Exp( µ, s ) u = J s (1) (asshown on the right panel of Fig. 2). Fig. 2: Jacobi ﬁelds

Now we are ready to derive all gradient terms of E [ L ] in Eq. 10 w.r.t. theparameters θ . For purpose of better readability, we simplify the notation bydeﬁning Log( · ) (cid:44) Log (Exp( µ k , s nk ) , y n ) in remaining sections. Gradient for µ k : the gradient of updating µ k is ∇ µ k E [ L ] = N,K (cid:88) n,k =1 γ nk τ k d µ k Exp( µ k , s nk ) † Log( · ) , (11)where † represents adjoint operator, i.e., for any tangent vectors ˆ u and ˆ v , (cid:104) d µ k Exp( µ k , s nk )ˆ u, ˆ v (cid:105) = (cid:104) ˆ u, d µ k Exp( µ k , s nk ) † ˆ v (cid:105) . PPGA 7

Gradient for τ k : the gradient of τ k is computed as ∇ τ k E [ L ] = N,K (cid:88) n,k =1 γ nk C ( τ ) A n − (cid:90) R r − τ r ) · n (cid:89) κ =2 κ − / κ f κ ( √ κ κ r ) dr −

12 Log( · ) dr, (12)where A n − is the surface area of n − r is radius, κ κ is thesectional curvature. Here R = min v R ( v ), where R ( v ) is the maximum distanceof Exp( µ k , rv ) with v being a point of unit sphere S n − ⊂ T µ k M . While thisformula is only valid for simple connected symmetric spaces, other spaces shouldbe changed according to diﬀerent deﬁnitions of the probability density functionin Eq. (2).To derive the gradient w.r.t. W k , Λ k and x nk , we need to compute d (Log( · ) ) /ds nk ﬁrst. Analogous to Eq. 11, we have d (Log( · ) ) ds nk = 2 (cid:0) d s nk Exp( µ k , s nk ) † Log( · ) (cid:1) . (13)After applying chain rule, we ﬁnally get all gradient terms as following: Gradient for W k : the gradient term of W k is ∇ W k E [ L ] = N,K (cid:88) n,k =1 γ nk τ k · d (Log( · ) ) ds nk · x Tnk Λ k . (14)To maintain the mutual orthogonality of each column of W k , we consider W k as a point in Stiefel manifold V q ( T µ M ), i.e., the space of orthonormal q -framesin T µ M , and project the gradient of Eq. 14 into tangent space T W k V q ( T µ M ).We then update W k by taking a small step along the geodesic in the projectedgradient direction. For details on Stiefel manifold, see [8]. Gradient for Λ ak : the gradient term of each a -th diagonal element of Λ k is: ∇ Λ ak E [ L ] = N,K (cid:88) n,k =1 γ nk τ k ( W ak x ank ) T · d (Log( · ) ) ds nk , (15)where W ak is the a th column of W k and x ank is the a th component of x nk . Gradient for x nk : the gradient w.r.t. each x nk is ∇ x nk E [ L ] = − N,K (cid:88) n,k =1 γ nk { x nk − τ k Λ k W Tk · d (Log( · ) ) ds nk } . (16)In this section, we further develop a Bayesian variant of MPPGA that equipswith the functionality of automatic data dimensionality reduction. A critical issue Y. Zhang et al. in maximum likelihood estimate of principal geodesic analysis is the choice of thenumber of principal geodesic to be retained. This also could be problematic in ourproposed MPPGA model since we assume each cluster has diﬀerent dimensionsof principal subspaces, and an exhaustive search over the parameter space canbecome computationally intractable.To address this issue, we develop a Bayesian mixture principal geodesic analysis(MBPGA) model that determines the number of principal modes automaticallyto avoid adhoc parameter tuning. We carefully introduces an automatic relevancedetermination (ARD) prior [3] on each a th diagonal element of the eigenvaluematrix Λ as p ( Λ | β ) = d − (cid:89) i =1 ( β a π ) d/ e − β a (cid:107) Λ a (cid:107) . (17)Each hyper-parameter β a controls the inverse variance of its correspondingprincipal geodesic W a , which is the a th column of W matrix. This indicates thatif β a is particularly large, the corresponding W a will tend to be small and willbe eﬀectively eliminated.Incorporating this ARD prior into our MPPGA model deﬁned in Eq. 7, wearrive at a log posterior distribution of Λ asln p ( Λ | Y ) = L − d − (cid:88) i =1 β a (cid:107) Λ a (cid:107) + const. . (18)Analogous to the EM algorithm introduced in Sec. 3.1, we maximize over Λ a in M-step by using the following gradient: ∇ Λ a E [ L ] = N,K (cid:88) n,k =1 γ nk τ k ( W ak x ank ) T · d (Log( · ) ) ds nk − β a Λ a . (19)Similar to the ARD prior discussed in [2], the hyper-parameter β a can beeﬀectively estimated by β a = d/ (cid:107) Λ a (cid:107) , where d is the dimension of the originaldata space. We demonstrate the eﬀectiveness of our MPPGA and MBPGA model by usingboth synthetic data and real data, and compare with two baseline methodsK-means-PCA [6] and MPPCA [22] designed for multimodal Euclidean data.The geometry background of speciﬁc sphere and Kendall shape space includingthe computations of Riemannian exponential map, log map, and Jacobi ﬁeldscan be found in [26,9].

Sphere.

Using the generative model for PGA, we simulate a random sample of764 data points on the unit sphere S with known parameters W, Λ, τ , and π PPGA 9 (see Tab 1). All data points consist three clusters (Green: 200; Blue: 289; Black:275). Note that our ground truth µ is generated from random uniform points onthe sphere. The W is generated from a random Gaussian matrix, to which wethen apply the Gram-Schmidt algorithm to ensure its columns are orthonormal. Corpus callosum shape.

The corpus callosum data are derived from public releasedOpen Access Series of Imaging Studies (OASIS) database . It includes 32 magnetic resonance imaging scans of human brain subjects,with age from 19 to 90. The corpus callosum is segmented in a midsagittalslice using the ITK SNAP program . The boundaries of thesesegmentations are sampled with 64 points. This algorithm generates a samplingof a set of shape boundaries while enforcing correspondences between diﬀerentpoint models within the population.

Mandible shape.

The mandible data is extracted from a collection of CT scans ofhuman mandibles, with 77 subjects (36 female vs. 41 male) aged from 0 to 19.We sample 2 ×

400 points on the boundaries.

We ﬁrst run our EM algorithm estimation of both MPPGA and MBPGA to testwhether we could recover the model parameters. To initialize the model parameters(e.g., the cluster mean µ , principal eigenvector matrix W , and eigenvalue Λ ),we use the output of K-means algorithm followed by performing linear PCAwithin each cluster. We uniformly distribute the weight to each mixing coeﬃcient,i.e., π k = 1 /K . The initialization of all precision parameters { τ k } is 0 .

01. Wecompare our model with two existing algorithms - mixture probabilistic principalcomponents (MPPCA) [22] and K-means-PCA [6] performed in Euclidean space.For fair comparison, we keep the number of clusters the same across all algorithms.To further investigate the applicability of our model MPPGA to real data,we test on 2D shapes of corpus callosum to study brain degeneration. The ideais to identify shape diﬀerences between two sub-populations: healthy vs. controlgroup by analyzing their shape variability. We also run the extended Bayesianversion of our model MBPGA to automatically select a compact set of principalgeodesics to represent data variability. We perform similar experiments on the2D mandible shape data to study group diﬀerences across genders, as well aswithin-group shape variability that reﬂects localized regions of growth.

Fig. 3 compares the estimated results of our model MPPGA/MBPGA with twobaseline methods K-means-PCA and MPPCA. For the purpose of visualization,we project the estimated principle modes of K-means-PCA and MPPCA modelfrom Euclidean space onto the sphere. Our model automatically separates thesphere data into three groups, which aligns fairly well with the ground truth(Green: 200; Blue: 289; Black: 275). For geodesics in each cluster (ground truth in yellow and model estimate in red), our results overlap better with the groundtruth than others. This also indicates that our model can recover the parameterscloser to the truth (as shown in Tab. 1). In particular, the MBPGA model isable to automatically select an eﬀective dimension of the principal subspaces torepresent data variability. (a) K-means-PCA (b) MPPCA (c) MPPGA (d) MBPGAFig. 3: The comparison of our model MPPGA/MBPGA with K-means-PCA and MPPCA(after being projected from Eucliean space onto the sphere). We have three clustersmarked in green, blue, and black. Yellow lines: ground truth geodesics; Red lines:estimated geodesics.

Table 1: Comparison between ground truth parameters { λ k , π k , τ k } and theestimation of our model and baseline algorithms. λ k =1 , , π k =1 , , τ k =1 , , Ground truth (0.2, 0.01, 0) (0.2618, 0.3783, 0.3599) (277.7778, 123.4568, 69.4444)K-means-PCA (0.1843, 0.0177, 0) (0.2500, 0.3927, 0.3573) NAMPPCA (0.5439, 0.0450, 0) (0.2585, 0.3586, 0.3829) (163.9344, 107.5269, 101.0101)MPPGA (0.1901, 0.0099, 0) (0.2618, 0.3783, 0.3599) (211.8783, 137.7593, 94.8111)MBPGA (0.1905, 0, 0) (0.2618, 0.3783, 0.3599) (212.4965, 140.0511, 96.1169)

Fig. 4 demonstrates result of shape variations estimated by our model MPPGAand MBPGA. The corpus callosum shapes are automatically clustered into twodiﬀerent groups: healthy vs. control. An example of a segmented corpus callosumfrom brain MRI is shown in Fig. 4(a). Fig. 4(b) - Fig. 4(e) show shape variationsgenerated from points along the ﬁrst principal geodesic: Exp( µ, αw a ), where α = − , − , , , × √ λ ), for a = 1. It is shown that the corpus callosumfrom healthy group is signiﬁcantly larger than control group. Meanwhile, theanterior and posterior ends of the corpus callosum show larger variation thanthe mid-caudate, which is consistent with previous studies.Fig. 5 shows fairly close eigenvalues estimated by MPPGA and MBPGA oncorpus callosum data. Since the ARD prior introduced in MBPGA automaticallysuppresses irrelevant principal geodesics to zero, we have 15 selected out of 128in total.We validate our MBPGA model to analyze the the mandible shape data (visualization of 2D examples are shown in Fig. 6(a)) since MBPGA produces PPGA 11 (a) Example of corpus callosum segmentation from an MRI slice (b) MPPGA k1 (d) MBPGA k1 (e) MBPGA k (c) MPPGA k Fig. 4: Corpus callosum shape variations (healthy k vs. control k ) along the ﬁrstprincipal geodesic ( − , − , , , × √ λ estimated by our model MPPGA and MBPGA.Fig. 5: Eigenvalues estimated by MPPGA/ MBPGA on corpus callosum data. fairly close results as MPPGA, but with the functionality of automatic datadimensionality reduction. The MBPGA model reduces the original data dimensionfrom d = 800 to d = 70. Fig. 6(b)(c) displays shape variations of mandibles fromboth male and female group. It clearly shows that generally male mandibleshave larger variations than female mandibles, which is consistent with previousstudies [5]. In particular, male mandibles have a larger variation in the temporalcrest and the base of mandible. We presented a mixture model of PGA (MPPGA) on general Riemannian man-ifolds. We developed an Expectation Maximization for maximum likelihoodestimation of parameters including the underlying principal subspaces and auto-matic data clustering results. This work takes the ﬁrst step to generalize mixturemodels of principal mode analysis to Riemannian manifolds. A Bayesian variant ofMPPGA (MBPGA) was also discussed in this paper for automatic dimensionalityreduction. This model is particularly useful, as it avoids singularities that areassociated with maximum likelihood estimations by suppressing the irrelevantinformation, e.g., outliers or noises. Our proposed model also paves a way for (a) 2D Examples (b) Shape variation (male)(c) Shape variation (female)

Fig. 6: 2D examples of mandible shape data and shape variations (male vs. female)along the ﬁrst principal geodesic ( − , − , , , × √ λ estimated by MBPGA model. new tasks on manifolds such as hierarchical clustering and classiﬁcation. Noticethat all experiments conducted in this paper are with the number of clusters k being determined (e.g., healthy vs. control in corpus callosum data, or male vs.female in mandible data). For datasets with completely unknown clusters, currentmethods such as Elbow [15], Silhouhette [13], and Gap statistic methods [21] canbe performed to determine the optimal number of clusters. This will be furtherinvestigated in our future work. References

1. Banerjee, M., Jian, B., Vemuri, B.C.: Robust fr´echet mean and pga on riemannianmanifolds with applications to neuroimaging. In: International Conference onInformation Processing in Medical Imaging. pp. 3–15. Springer (2017)2. Bishop, C.M.: Bayesian pca. In: Advances in neural information processing systems.pp. 382–388 (1999)3. Bishop, C.M.: Pattern recognition and machine learning pp. 500–600 (2006)4. Chen, J., Liu, J.: Mixture principal component analysis models for process moni-toring. Industrial & engineering chemistry research (4), 1478–1488 (1999)5. Chung, M.K., Qiu, A., Seo, S., Vorperian, H.K.: Uniﬁed heat kernel regressionfor diﬀusion, kernel smoothing and wavelets on manifolds and its application tomandible growth modeling in ct images. Medical image analysis (1), 63–76 (2015)6. Cootes, T.F., Taylor, C.J.: A mixture model for representing shape variation. Imageand Vision Computing (8), 567–573 (1999)7. Do Carmo, M.: Riemannian geometry. Birkhauser (1992)8. Edelman, A., Arias, T.A., Smith, S.T.: The geometry of algorithms with orthogonal-ity constraints. SIAM journal on Matrix Analysis and Applications (2), 303–353(1998)9. Fletcher, P.T.: Geodesic regression and the theory of least squares on riemannianmanifolds. International journal of computer vision (2), 171–185 (2013)PPGA 1310. Fletcher, P.T., Lu, C., Pizer, S.M., Joshi, S.: Principal geodesic analysis for thestudy of nonlinear statistics of shape. IEEE transactions on medical imaging (8),995–1005 (2004)11. Fletcher, P.T., Zhang, M.: Probabilistic geodesic models for regression and di-mensionality reduction on riemannian manifolds. In: Riemannian Computing inComputer Vision, pp. 101–121. Springer (2016)12. Jolliﬀe, I.T.: Principal component analysis and factor analysis. In: Principal com-ponent analysis, pp. 115–128. Springer (1986)13. Kaufman, L., Rousseeuw, P.J.: Partitioning around medoids (program pam). Findinggroups in data: an introduction to cluster analysis pp. 68–125 (1990)14. Kendall, D.G.: Shape manifolds, procrustean metrics, and complex projective spaces.Bulletin of the London Mathematical Society (2), 81–121 (1984)15. Ketchen, D.J., Shook, C.L.: The application of cluster analysis in strategic man-agement research: an analysis and critique. Strategic management journal (6),441–458 (1996)16. Mardia, K.V., Jupp, P.E.: Directional statistics, vol. 494. John Wiley & Sons (2009)17. Obata, M.: Certain conditions for a riemannian manifold to be isometric with asphere. Journal of the Mathematical Society of Japan (3), 333–340 (1962)18. Roweis, S.T.: Em algorithms for pca and spca. In: Advances in neural informationprocessing systems. pp. 626–632 (1998)19. Sommer, S., Lauze, F., Hauberg, S., Nielsen, M.: Manifold valued statistics, exactprincipal geodesic analysis and the eﬀect of linear approximations. In: Europeanconference on computer vision. pp. 43–56. Springer (2010)20. Sommer, S., Lauze, F., Nielsen, M.: Optimization over geodesics for exact principalgeodesic analysis. Advances in Computational Mathematics (2), 283–313 (2014)21. Tibshirani, R., Walther, G., Hastie, T.: Estimating the number of clusters in adata set via the gap statistic. Journal of the Royal Statistical Society: Series B(Statistical Methodology) (2), 411–423 (2001)22. Tipping, M.E., Bishop, C.M.: Mixtures of probabilistic principal component ana-lyzers. Neural computation (2), 443–482 (1999)23. Tipping, M.E., Bishop, C.M.: Probabilistic principal component analysis. Journalof the Royal Statistical Society: Series B (Statistical Methodology) (3), 611–622(1999)24. Turaga, P., Veeraraghavan, A., Srivastava, A., Chellappa, R.: Statistical computa-tions on grassmann and stiefel manifolds for image and video-based recognition.IEEE Transactions on Pattern Analysis and Machine Intelligence (11), 2273–2286(2011)25. Tuzel, O., Porikli, F., Meer, P.: Pedestrian detection via classiﬁcation on riemannianmanifolds. IEEE transactions on pattern analysis and machine intelligence30