The EM Perspective of Directional Mean Shift Algorithm
TThe EM Perspective of Directional Mean Shift Algorithm
Yikun Zhang [email protected]
Yen-Chi Chen [email protected]
University of Washington, Seattle, WA 98195
Abstract
The directional mean shift (DMS) algorithm is a nonparametric method for pursuing lo-cal modes of densities defined by kernel density estimators on the unit hypersphere. Inthis paper, we show that any DMS iteration can be viewed as a generalized Expectation-Maximization (EM) algorithm; in particular, when the von Mises kernel is applied, itbecomes an exact EM algorithm. Under the (generalized) EM framework, we provide anew proof for the ascending property of density estimates and demonstrate the global con-vergence of directional mean shift sequences. Finally, we give a new insight into the linearconvergence of the DMS algorithm.
Keywords:
Directional data, mean shift algorithm, kernel smoothing, EM algorithm
1. Introduction
The directional mean shift (DMS) algorithm is a generalization of the regular mean shiftalgorithm with Euclidean data (Fukunaga and Hostetler, 1975; Cheng, 1995; Comaniciu andMeer, 2002; Carreira-Perpi˜n´an, 2015) toward handling directional data, which are assumedto consist of observations lying on a unit hypersphere Ω q = { x ∈ R q +1 : || x || = 1 } . Anal-ogous to the regular mean shift procedure, the DMS algorithm is built upon a directionalkernel density estimator (KDE) (Hall et al., 1987; Bai et al., 1988; Zhao and Wu, 2001;Garc´ıa-Portugu´es, 2013) and conducts fixed-point iterations to pursue the local modes ofthe directional KDE.Several different forms of the DMS algorithm have been proposed and applied to variousfields such as gene clustering (Oba et al., 2005), medical structure classification (Kafai et al.,2010), seismological analysis (Zhang and Chen, 2020) in the last two decades. Compared toits success in practice, the convergence theory of the DMS algorithm is less developed. Theconvergence of density estimates along any mean shift iterative sequence (or the so-calledascending property) was proved by Cheng (1995); Comaniciu and Meer (2002); Li et al.(2007) for the regular mean shift algorithm and generalized to the mean shift algorithm onarbitrary manifolds by Subbarao and Meer (2009). This ascending property was specializedto a DMS algorithm with the von Mises kernel by Kobayashi and Otsu (2010). Kafaiet al. (2010) and Zhang and Chen (2020) later showed that the DMS algorithm’s ascendingproperty holds for any convex and differentiable directional kernel.However, the convergence of DMS sequences is a more challenging problem. Such con-vergence has been well-studied in the regular mean shift scenarios (Cheng, 1995; Li et al.,2007; Carreira-Perpi˜n´an, 2007; Aliyari Ghassabeh, 2013, 2015; Arias-Castro et al., 2016). Inthe directional data setting, the only work that we are aware of is Zhang and Chen (2020), © https://creativecommons.org/licenses/by/4.0/ . a r X i v : . [ m a t h . S T ] J a n hang and Chen but their result is limited to small neighborhoods of local modes. The global behavior ofDMS sequences remains unclear.To address the global convergence of any DMS sequence on Ω q , we formulate the DMSalgorithm into a special case of the famous (generalized) Expectation-Maximization (EM)algorithm (Dempster et al., 1977; Wu, 1983; Geoffrey J. McLachlan, 2008; Balakrishnanet al., 2017). It allows us to present a new proof of the ascending property and study theconvergence of DMS sequences based on extensive research in the (generalized) EM algo-rithm. We are inspired by Carreira-Perpi˜n´an (2007) for his method of writing the regularmean shift with Euclidean data as a (generalized) EM algorithm. However, the formulationis more difficult in any DMS algorithm because directional data lie on a nonlinear manifold.We overcome the problem by projecting the data from the original hypersphere Ω q ontoa high dimensional hypersphere Ω q +1 and design a mixture model to associate the DMSalgorithm with the (generalized) EM algorithm. Main results
The main objective of this paper is to show that the usual DMS algorithm(reviewed in Section 2.2) is a (generalized) EM algorithm. Specially, our work contributesthe following:1. We construct a maximum likelihood problem viewing the directional KDE as a mixturemodel with a single parameter µ ∈ Ω q in a higher dimensional sphere Ω q +1 . Fitting themixture model with a (generalized) EM algorithm at a single point Y = (0 , ..., , T ∈ Ω q +1 leads to the DMS algorithm (Section 3).2. Under the (generalized) EM framework, we prove the ascending property of the DMSalgorithm based on the monotonicity of observed likelihoods along any (generalized)EM iteration (Section 4.2).3. We derive the global convergence of DMS sequences using properties from the EMperspective (Section 4.3).4. Finally, compared to the result in Zhang and Chen (2020), we provide an alternativebut straightforward proof of linear convergence of the DMS algorithm (Section 5). Notation
Let Ω q = (cid:8) x ∈ R q +1 : || x || = 1 (cid:9) be the q -dimensional unit sphere in theambient Euclidean space R q +1 , where ||·|| is the usual Euclidean norm (or L -norm) in R q +1 . The inner product between two vectors x , y ∈ R q +1 is denoted by x T y . Given asmooth function f : R q +1 → R , we denote its total gradient and Hessian by ∇ f ( x ) = (cid:16) ∂f ( x ) ∂x , ..., ∂f ( x ) ∂x q +1 (cid:17) T ∈ R q +1 and ∇∇ f ( x ) = ∂ f ( x ) ∂x · · · ∂ f ( x ) ∂x ∂x q +1 ... . . . ... ∂ f ( x ) ∂x q +1 ∂x · · · ∂ f ( x ) ∂x q +1 ∈ R ( q +1) × ( q +1) , respectively. We use the big- O notation p t = O ( q t ) for a sequence of vectors p t and positivescalars q t if there is a global constant C such that || p t || ≤ Cq t for all sufficiently large t . he EM Perspective of DMS Algorithm
2. Background
Let X , ..., X n ∈ Ω q ⊂ R q +1 be a random sample generated from a directional densityfunction f on Ω q with (cid:82) Ω q f ( x ) ω q ( d x ) = 1 , where ω q is the Lebesgue measure on Ω q . Thedirectional KDE at point x ∈ Ω q is often written as (Beran, 1979; Hall et al., 1987; Baiet al., 1988; Garc´ıa-Portugu´es, 2013): (cid:98) f h ( x ) = c h,q,L n n (cid:88) i =1 L (cid:18) − x T X i h (cid:19) , (1)where L is a directional kernel (i.e., a rapidly decaying function with nonnegative valuesand defined on [0 , ∞ ) ⊂ R ), h > c h,q,L is a normalizingconstant satisfying c − h,q,L = (cid:90) Ω q L (cid:18) − x T y h (cid:19) ω q ( d y ) = O ( h q ) . (2)The bandwidth selection is crucial in determining the performance of directional KDEsbecause it controls the bias-variance tradeoff. There are various reliable bandwidth selectionmechanisms in the literature (Hall et al., 1987; Bai et al., 1988; Taylor, 2008; Marzio et al.,2011; Oliveira et al., 2012; Garc´ıa-Portugu´es, 2013; Saavedra-Nieves and Mar´ıa Crujeiras,2020). Compared to the bandwidth, the choice of the kernel is less influential. One popularcandidate is the so-called von Mises kernel L ( r ) = e − r , whose name originates from thefamous q -von Mises-Fisher (vMF) distribution on Ω q with the following density: f vMF ( x | µ , κ ) = C q ( κ ) · exp (cid:0) κ µ T x (cid:1) and C q ( κ ) = κ q − (2 π ) q +12 I q − ( κ ) , (3)where µ ∈ Ω q is the directional mean, κ ≥ I α ( κ ) isthe modified Bessel function of the first kind of order κ ; see Mardia (1975); Mardia andJupp (2000) for more details. With the von Mises kernel, the directional KDE (1) becomesa mixture of q -von Mises-Fisher densities as: (cid:98) f h ( x ) = 1 n n (cid:88) i =1 f vMF (cid:18) x ; X i , h (cid:19) = C q (cid:0) h (cid:1) n n (cid:88) i =1 exp (cid:18) x T X i h (cid:19) . (4)The EM framework for estimating the parameters of a mixture of vMF distributions isalso well-studied. We offer a detailed review in Appendix B. Another potential choice isthe following truncated convex kernel proposed by Chang-Chien et al. (2012); Yang et al.(2014): L ( r ) = (cid:40) (1 − r ) p if 0 ≤ r ≤ , p ≥ hang and Chen We follow the technique in Yang et al. (2014) to derive the most commonly used directionalmean shift (DMS) algorithm. Assume that the kernel L is differentiable except for finitelymany points on [0 , ∞ ). Given the directional KDE (cid:98) f h ( x ) in (1), we introduce a Lagrangianmultiplier λ to maximize (cid:98) f h ( x ) under the constraint x T x = 1 as follows: (cid:98) L h ( x , λ ) = c h,q,L n n (cid:88) i =1 L (cid:18) − x T X i h (cid:19) + λ (1 − x T x ) . Taking the partial derivatives of (cid:98) L h with respect to x and λ and setting them to zero yieldthat ∂ (cid:98) L h ∂ x = − c h,q,L nh n (cid:88) i =1 X i L (cid:48) (cid:18) − x T X i h (cid:19) − λ x = 0 and 1 − x T x = 0 . Solving for x leads to the fixed-point iteration equation x ( t +1) = F ( x ( t ) ) for the DMSalgorithm, where F ( x ) = − (cid:80) ni =1 X i L (cid:48) (cid:16) − x T X i h (cid:17)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:80) ni =1 X i L (cid:48) (cid:16) − x T X i h (cid:17)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) . (6)Zhang and Chen (2020) suggested an alternative derivation of the DMS algorithm based onthe following equality on Ω q : (cid:98) f h ( x ) = (cid:101) f h ( x ) with (cid:101) f h ( x ) = c h,q,L n n (cid:88) i =1 L (cid:32) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) x − X i h (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) (cid:33) , and wrote out an explicit formula for the mean shift vector. Based on their derivation, thepreceding fixed-point iteration scheme (6) can be written in terms of ∇ (cid:98) f h ( x ) as: x ( t +1) = F ( x ( t ) ) = ∇ (cid:98) f h ( x ( t ) ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ∇ (cid:98) f h ( x ( t ) ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) , (7)where ∇ is the total gradient operator in the ambient space R q +1 .
3. Directional Mean Shift as a Generalized EM Algorithm
Recall that { X , ..., X n } ⊂ Ω q is the observed directional dataset by the DMS algorithm inSection 2.2. Our derivation starts with defining a mixture model on a higher dimensionalhypersphere Ω q +1 . Consider the following mixture model for any point y ∈ Ω q +1 : f L,q +1 ( y ) = n (cid:88) i =1 α i · C κ i ,q +1 ,L · L (cid:2) κ i (1 − y T ν i ) (cid:3) , (8) he EM Perspective of DMS Algorithm where { α i , ν i , κ i } ni =1 are the parameters of this mixture model such that n (cid:80) i =1 α i = 1, α i ≥ i = 1 , ..., n , and C κ i ,q +1 ,L is the normalizing constant such that C − κ i ,q +1 ,L = (cid:90) Ω q +1 L (cid:2) κ i (cid:0) − y T ν i (cid:1)(cid:3) ω q +1 ( d y ) = w q (Ω q ) (cid:90) − L [ κ i (1 − t )] · (1 − t ) q − dt (9)with ω q (Ω q ) as the surface area of Ω q . Equation (9) demonstrates that the normalizingconstant C κ i ,q +1 ,L is independent of the choice of the directional mean parameter ν i . Theparameters α i and κ i represent the weight (mixing proportion) and the amount of smoothing(concentration) applied to the i -th observation, respectively. Later, we will take α i = n and κ i = h , though they would depend on the index i in a generic case. One should keepin mind that α i , κ i are fixed and given.We introduce an unknown parameter µ relating the directional mean parameter ν i inthe mixture model (8) to each observation X i in the directional dataset as: ν i = ν i ( µ ) = (cid:18) X i (cid:113) − ( µ T X i ) , µ T X i (cid:19) T ∈ Ω q +1 . (10)Given µ ∈ Ω q , the distribution in (8) is a mixture of n densities on Ω q +1 with directionalmeans { ν i ( µ ) } ni =1 , fixed mixing proportions { α i } ni =1 and fixed concentration parameters { κ i } ni =1 . Varying the parameter µ ∈ Ω q +1 in the mixture model leads to a spherical trans-formation of the whole mixture distribution in a higher dimensional unit sphere Ω q +1 . Ourgoal is to find the maximum likelihood estimate (MLE) of µ given a single observation fromthe mixture model.We consider the following hypothetical setup for deriving the EM algorithm and con-necting it with the DMS algorithm. Suppose that we are given Y , an observation in Ω q +1 .We fit the mixture model (8) to Y and attempt to find the MLE of µ based on this singleobservation. For such mixture model, finding the MLE of µ can be done via the EM algo-rithm. To this end, we introduce a hidden/latent random variable Z indicating from whichcomponent Y is generated. Namely, Z = i if Y is sampled from C κ i ,q +1 ,L · L (cid:2) κ i (1 − y T ν i ) (cid:3) in (8). The EM framework for obtaining the MLE of µ given the observed data Y is asfollows: • E-Step : Assuming the values of Z is known, the complete log-likelihood is written as:log P ( Y , Z | µ ) = n (cid:88) i =1 { Z = i } · (cid:8) log α i + log C κ i ,q +1 ,L + log L (cid:2) κ i (1 − Y T ν i ) (cid:3)(cid:9) . (11)Given ( Y , µ ( t ) ) at the t -th iteration, we take the expectation of the complete log-likelihood(11) with respect to the current posterior distribution of Z | (cid:0) Y , µ ( t ) (cid:1) as: Q ( µ | µ ( t ) ) = E Z | ( Y , µ ( t ) ) [log P ( Y , Z | µ )]= n (cid:88) i =1 [log α i + log C κ i ,q +1 ,L ] · P ( Z = i | Y , µ ( t ) )+ n (cid:88) i =1 log L (cid:2) κ i (1 − Y T ν i ) (cid:3) · P ( Z = i | Y , µ ( t ) ) . (12) hang and Chen By Bayes’ theorem, the posterior distribution P ( Z = i | Y , µ ( t ) ) is computed as P ( Z = i | Y , µ ( t ) ) = P ( Y | Z = i, µ ( t ) ) · P ( Z = i | µ ( t ) ) P ( Y | µ ( t ) )= α i · C κ i ,q +1 ,L · L (cid:104) κ i (cid:16) − Y T ν ( t ) i (cid:17)(cid:105)(cid:80) n(cid:96) =1 α (cid:96) · C κ (cid:96) ,q +1 ,L · L (cid:104) κ (cid:96) (cid:16) − Y T ν ( t ) (cid:96) (cid:17)(cid:105) . (13) • M-Step : To derive the M-step, we focus on a particular case where the observation Y = (0 , ..., , T ∈ Ω q +1 . Since ν Ti Y = µ T X i for i = 1 , ..., n , the Q-function (12) reducesto Q ( µ | µ ( t ) ) = n (cid:88) i =1 [log α i + log C κ i ,q +1 ,L ] · P ( Z = i | Y , µ ( t ) )+ n (cid:88) i =1 log L (cid:2) κ i (1 − µ T X i ) (cid:3) · P ( Z = i | Y , µ ( t ) ) . (14)As is shown in (9) that each normalizing constant C κ i ,q +1 ,L is independent of the directionalmean parameter ν i for i = 1 , ..., n (and consequently, the parameter µ ), we only need tomaximize the second term in (14) with respect to µ under the constraint µ T µ = 1. To thisend, we introduce a Lagrangian multiplier λ and obtain the following Lagrangian function: L ( µ , λ ) = n (cid:88) i =1 log L (cid:2) κ i (1 − µ T X i ) (cid:3) · P ( Z = i | Y , µ ( t ) ) + λ (cid:0) − µ T µ (cid:1) . Taking the partial derivatives of L with respect to µ and λ and setting them to zeros yieldthat ∂ L ∂ µ = n (cid:88) i =1 P ( Z = i | Y , µ ( t ) ) · − κ i X i · L (cid:48) (cid:2) κ i (1 − µ T X i ) (cid:3) L [ κ i (1 − µ T X i )] − λ µ = 0 and 1 − µ T µ = 0 . This further implies that (cid:98) µ = (cid:80) ni =1 P ( Z = i | Y , µ ( t ) ) · − κ i X i · L (cid:48) [ κ i (1 − (cid:98) µ T X i ) ] L [ κ i (1 − (cid:98) µ T X i )] (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:80) ni =1 P ( Z = i | Y , µ ( t ) ) · − κ i X i · L (cid:48) [ κ i (1 − (cid:98) µ T X i )] L [ κ i (1 − (cid:98) µ T X i )] (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) = (cid:80) ni =1 α i · C κi,q +1 ,L · L [ κ i ( − X Ti µ ( t ) )] (cid:80) n(cid:96) =1 α (cid:96) · C κ(cid:96),q +1 ,L · L [ κ (cid:96) ( − X T(cid:96) µ ( t ) )] · − κ i X i · L (cid:48) [ κ i (1 − (cid:98) µ T X i ) ] L [ κ i (1 − (cid:98) µ T X i )] (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:80) ni =1 α i · C κi,q +1 ,L · L [ κ i ( − X Ti µ ( t ) )] (cid:80) n(cid:96) =1 α (cid:96) · C κ(cid:96),q +1 ,L · L [ κ (cid:96) ( − X T(cid:96) µ ( t ) )] · − κ i X i · L (cid:48) [ κ i (1 − (cid:98) µ T X i )] L [ κ i (1 − (cid:98) µ T X i )] (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) = − (cid:80) ni =1 κ i α i · C κ i ,q +1 ,L X i · L (cid:48) (cid:2) κ i (cid:0) − (cid:98) µ T X i (cid:1)(cid:3) · L [ κ i ( − X Ti µ ( t ) )] L [ κ i (1 − (cid:98) µ T X i )] (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:80) ni =1 κ i α i · C κ i ,q +1 ,L X i · L (cid:48) [ κ i (1 − (cid:98) µ T X i )] · L [ κ i ( − X Ti µ ( t ) )] L [ κ i (1 − (cid:98) µ T X i )] (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) , (15) he EM Perspective of DMS Algorithm where we plug in the posterior distribution (13) in the second equality, and the term n (cid:88) (cid:96) =1 α (cid:96) · C κ (cid:96) ,q +1 ,L · L (cid:104) κ (cid:96) (cid:16) − X T(cid:96) µ ( t ) (cid:17)(cid:105) is independent of the outer summation index i in both the numerator and denominator, sothey are canceled out in the third equality. Therefore, equation (15) defines a fixed-pointiteration scheme for obtaining µ ( t +1) in the EM framework, and an exact M-step will iterate(15) until convergence.Suppose that we take κ i = h and α i = n for all i = 1 , ..., n in the fixed-point iterationequation (15). Then, the normalizing constant C κ i ,q +1 ,L will no longer depend on thesummation index i , and the fixed-point iteration equation reduces to (cid:98) µ = − (cid:80) ni =1 X i L (cid:48) (cid:16) − (cid:98) µ T X i h (cid:17) · L (cid:18) − X Ti µ ( t ) h (cid:19) L (cid:18) − (cid:98) µ T X ih (cid:19) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:80) ni =1 X i L (cid:48) (cid:16) − (cid:98) µ T X i h (cid:17) · L (cid:18) − X Ti µ ( t ) h (cid:19) L (cid:18) − (cid:98) µ T X ih (cid:19) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) (16)or more specifically, (cid:98) µ ( k +1; t ) = − (cid:80) ni =1 X i L (cid:48) (cid:16) − X Ti (cid:98) µ ( k ; t ) h (cid:17) · L (cid:18) − X Ti µ ( t ) h (cid:19) L (cid:18) − X Ti (cid:98) µ ( k ; t ) h (cid:19) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:80) ni =1 X i L (cid:48) (cid:16) − X Ti (cid:98) µ ( k ; t ) h (cid:17) · L (cid:18) − X Ti µ ( t ) h (cid:19) L (cid:18) − X Ti (cid:98) µ ( k ; t ) h (cid:19) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) (17)for k = 0 , , , ... . Therefore, the complete EM algorithm consists of two nested loops: theouter one is the usual iteration over t , while the inner one iterates { (cid:98) µ ( k ; t ) } ∞ k =0 under thefixed µ ( t ) for an exact M-step. If we choose the initial value for iterating (17) as (cid:98) µ (0; t ) = µ ( t ) and do a single iteration of (17), i.e., µ ( t +1) = (cid:98) µ (1; t ) , we obtain µ ( t +1) = − (cid:80) ni =1 X i L (cid:48) (cid:16) − X Ti µ ( t ) h (cid:17)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:80) ni =1 X i L (cid:48) (cid:16) − X Ti µ ( t ) h (cid:17)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) , (18)which coincides with the mean shift updating formula with directional data (c.f., equation(6)). In other words, the DMS algorithm can be viewed as the above EM algorithm withan inexact M-step update. More importantly, as we will demonstrate in Section 4.2, theQ-function Q ( µ | µ ( t ) ) is non-decreasing along the iterative path defined by (18) under somemild conditions on kernel L . Therefore, the DMS algorithm with a general kernel is indeeda generalized EM algorithm. hang and Chen Remark 1.
Our derivation is different from the suggestion given in Subbarao and Meer(2009). In our derivation, we cannot define a group action of G on Ω q such that the com-ponentwise density C κ i ,q +1 ,L · L (cid:2) κ i (1 − y T ν i ) (cid:3) in (8) is equivalent to the original density C κ i ,q,L · L (cid:2) κ i (1 − ( g · y ) T X i ) (cid:3) under some g ∈ G because these two densities lie on hyper-spheres with different dimensions. As the Gaussian mean shift with Euclidean data can be written as an EM algorithm(Carreira-Perpi˜n´an, 2007), we show in this subsection that the DMS algorithm with thevon Mises kernel is also an EM algorithm. That is, the preceding M-step in Section 3.1 isexact when we apply the von Mises kernel. This is because the von Mises kernel L ( r ) = e − r satisfies the property that L (cid:48) ∝ L , and the above M-step (15) is simplified as: µ ( t +1) = (cid:80) ni =1 κ i α i C q +1 ( κ i ) X i · exp (cid:0) κ i X Ti µ ( t ) (cid:1)(cid:12)(cid:12)(cid:12)(cid:12)(cid:80) ni =1 κ i α i C q +1 ( κ i ) X i · exp (cid:0) κ i X Ti µ ( t ) (cid:1)(cid:12)(cid:12)(cid:12)(cid:12) , (19)Now, if we take κ i = h and α i = n for all i = 1 , ..., n as before, the exact M-step in (19)becomes µ ( t +1) = (cid:80) ni =1 X i exp (cid:16) X Ti µ ( t ) h (cid:17)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:80) ni =1 X i exp (cid:16) X Ti µ ( t ) h (cid:17)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) , (20)which is precisely the DMS algorithm with the von Mises kernel. As a result, the DMSalgorithm with the von Mises kernel is an exact EM algorithm. Our derivation is motivated by the discovery of Carreira-Perpi˜n´an (2007), in which theauthor obtained an elegant connection between the regular mean shift algorithm withEuclidean data and (generalized) EM algorithm. The key of the derivation in Carreira-Perpi˜n´an (2007) is to define an artificial maximum likelihood problem where1. the underlying distribution is a mixture model defined by the original KDE so that themeans of its components are given by the dataset to which the mean shift algorithmis applied, but the mixture model contains only a single displacement parameter,2. and, the mixture model is fit on the sample with a single point, the origin Y = .Under the above construction, Carreira-Perpi˜n´an (2007) proved that any maximumlikelihood estimate of the displacement vector is a mode of the original KDE. However, asis argued in Appendix C, naively applying the idea of Carreira-Perpi˜n´an (2007) would fail.We have to make a non-trivial extension of his approach to bridge the connection betweenDMS and EM algorithms. The two key changes we made in our derivation are:1. We construct a mixture model in a higher-dimensional hypersphere Ω q +1 and intro-duce the parameter µ ∈ Ω q controlling the projections of observations { X , ..., X n } ⊂ Ω q onto Ω q +1 .2. We fit the mixture model on the sample containing a single point Y = (0 , ..., , T ∈ Ω q +1 . he EM Perspective of DMS Algorithm
4. Convergence of Directional Mean Shift as a Generalized EMAlgorithm
In this section, we show that the Q-function (14) is non-decreasing along any DMS sequence(18) under some regularity conditions on kernel L . As the directional KDE will be identicalto the corresponding observed likelihood up to a multiplicative constant, we furnish a newproof of the ascending property of density estimates along DMS sequences based on thegeneralized EM perspective, which is different from the existing result (e.g., Kobayashi andOtsu (2010); Kafai et al. (2010); Zhang and Chen (2020)). Moreover, the connection tothe (generalized) EM algorithm implies that the DMS sequence converges from almost anystarting point on Ω q , though it may be stuck in a saddle point or a local minimum of thedirectional KDE (cid:98) f h in (1); see such non-typical behaviors of EM algorithms in Section 3.6.1and 3.6.2 of Geoffrey J. McLachlan (2008). Nevertheless, the latter cases are unlikely inpractice because both saddle points and minima are unstable for maximization (Carreira-Perpi˜n´an, 2007). Therefore, the convergence of DMS sequences will almost always be to alocal mode of (cid:98) f h in practice. This conclusion is much stronger than the local convergencein Theorem 11 of Zhang and Chen (2020). To guarantee the convergence of the DMS iteration (18) as a generalized EM algorithm, wemake the following assumption on the directional KDE (1) and kernel L : • (C1) The number of local modes of (cid:98) f h on Ω q is finite, and the modes are isolatedfrom other critical points. • (C2) L : [0 , ∞ ) → [0 , ∞ ) is non-increasing, differentiable, and convex with 0 < L (0) < ∞ .Condition (C1) is commonly assumed for the convergence of regular mean shift sequences(Li et al., 2007; Aliyari Ghassabeh, 2015). It is an evident corollary for the regular KDEwith various kernels (Silverman, 1981; Hall et al., 2004; Genovese et al., 2016; Chac´on,2020), and one can easily extend the result to directional KDEs. Condition (C2) is anatural assumption on kernel L ; see, for instance, Comaniciu and Meer (2002); Li et al.(2007); Kafai et al. (2010); Zhang and Chen (2020). It is satisfied by many directionalkernels, such as the von Mises kernel L ( r ) = e − r . The differentiability of kernel L can berelaxed to include finitely many non-differentiable points on [0 , ∞ ). Such relaxation enablesour convergence results to incorporate the DMS algorithms with kernels having boundedsupports (e.g., kernels in (5)). The key reason why the DMS iteration is a generalized EM algorithm is that a single iter-ation of every incomplete M-step with the initial value µ ( t ) as in (18) ensures the followingnon-decreasing property of the Q-function (14). hang and Chen Theorem 1.
Under condition (C2), the Q-function (14) satisfies Q ( µ ( t +1) | µ ( t ) ) ≥ Q ( µ ( t ) | µ ( t ) ) where µ ( t +1) = − (cid:80) ni =1 X i L (cid:48) (cid:18) − X Ti µ ( t ) h (cid:19)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:80) ni =1 X i L (cid:48) (cid:18) − X Ti µ ( t ) h (cid:19)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) is from the iteration (18) . The essences of the proof of Theorem 1 are the following two inequalities:log y − log x ≥ y − xx and L ( y ) − L ( x ) ≥ L (cid:48) ( x ) · ( y − x ) , (21)which are guaranteed by the concavity of log( · ) as well as the convexity and differentiabilityof the kernel L ; see Appendix D.1. The condition (C2) on kernel L provides a sufficientcondition under which the directional mean shift iteration does increase the Q-function(14), but it is by no means necessary. One can choose a different kernel and show that theiteration (18) increases the Q-function using a similar argument but under other conditions(e.g., log L ( r ) is convex).The observed log-likelihood with data point Y = (0 , ..., , T ∈ Ω q +1 under the EMframework of Section 3.1 embraces the following association with the logarithm of thedirectional KDE (cid:98) f h in (1). Proposition 2.
The observed log-likelihood log P ( Y | µ ) is given by log P ( Y | µ ) = log (cid:34) n (cid:88) (cid:96) =1 α (cid:96) · C κ (cid:96) ,q +1 ,L · L (cid:2) κ (cid:96) (cid:0) − X T(cid:96) µ (cid:1)(cid:3)(cid:35) . (22) In particular, when α (cid:96) = n and κ (cid:96) = h for all (cid:96) = 1 , ..., n (i.e., simplifying to the DMScase), it implies that log P ( Y | µ ) = log (cid:98) f h ( µ ) + log C /h ,q +1 ,L c h,q,L , (23) where c h,q,L is defined in (2) and C κ i ,q +1 ,L is elucidated in (9) . The proof of Proposition 2 is deferred to Appendix D.2. Although some careful read-ers may note that c h,q,L = C /h ,q,L , the extra term log C /h ,q +1 ,L c h,q,L is still irreducible in(23), because we define the mixture density (8) for the (generalized) EM algorithm on ahigher dimensional sphere Ω q +1 . However, the presence of this extra term will not affectthe ascending property of density estimates along DMS (or equivalently, generalized EM)sequences, as stated in the theorem below. Theorem 3.
Let (cid:8) µ ( t ) (cid:9) ∞ t =0 be the path of successive points defined by the DMS algorithm (6) . Under condition (C2), the sequence (cid:110) (cid:98) f h ( µ ( t ) ) (cid:111) ∞ t =0 is non-decreasing and thus con-verges. Theorem 3 is the well-known monotonicity of observed (log-)likelihood sequences definedby the (generalized) EM algorithm (c.f., Theorem 1 in Dempster et al. (1977)). The proofis in Appendix D.3. he EM Perspective of DMS Algorithm The connection to the (generalized) EM algorithm allows us to derive a global convergenceproperty of the DMS algorithm.
Theorem 4.
Let (cid:8) µ ( t ) (cid:9) ∞ t =0 be the path of successive points defined by the DMS algorithm (6) . Under conditions (C1) and (C2), the sequence (cid:8) µ ( t ) (cid:9) ∞ t =0 converges to a local mode of (cid:98) f h from almost all starting point µ (0) ∈ Ω q . The high-level idea of the proof is as follows. For the convergence of (generalized)EM sequences to stationary points (in most cases, local maxima), Wu (1983) assumedthe compactness of the parameter space and differentiability of the observed log-likelihoodwithin the parameter space. In addition, the set of local modes of the observed log-likelihoodis required to be discrete in order to guarantee the convergence of any generalized EMsequence to its single stationary point. We detail out these assumptions and the resultingconvergence result for generalized EM (or equally, DMS) sequences in Appendix D.4 andshow that conditions (C1) and (C2) imply these assumptions.In practice, the set of local minima and saddle points of (cid:98) f h will have zero Lebesguemeasure on Ω q , so the convergence of DMS sequences will almost always be to a localmode; see Appendix A for basins of attraction (or convergence domains) yielded by DMSalgorithms. Therefore, our result (Theorem 4) is stronger than the convergence theory ofZhang and Chen (2020), in which they only proved the convergence of DMS sequencesaround each local mode of (cid:98) f h .
5. Rate of Convergence of Directional Mean Shift Algorithm
In this section, we study the rate of convergence of the DMS algorithm. We have shownin Sections 3 and 4 that the DMS algorithm is a generalized EM algorithm and convergesto local modes of the directional KDE from almost every starting point on Ω q . It is well-known that the convergence rate of an EM algorithm is generally linear (Page 102 in GeoffreyJ. McLachlan (2008)). Carreira-Perpi˜n´an (2007) studied in detail the rate of convergencefor Gaussian mean shift in the homoscedastic and isotropic case. More recently, Zhang andChen (2020) wrote the DMS into a gradient ascent algorithm with an adaptive step sizeon Ω q and argued that the DMS algorithm will converge linearly with a sufficiently smallbandwidth around some neighborhoods of local modes. We will investigate the algorithmicconvergence rate of the DMS algorithm in an alternative but straightforward approach.Consider the following assumptions: • (D1) The true directional density f is continuously differentiable, and its partialderivatives are square integrable on Ω q . • (D2) Besides (C2), we assume that L : [0 , ∞ ) → [0 , ∞ ) satisfies0 < (cid:90) ∞ L ( r ) k r q − dr < ∞ and 0 < (cid:90) ∞ L (cid:48) ( r ) k r q − dr < ∞ for all q ≥ k = 1 ,
2, and (cid:82) ∞ (cid:15) − | L (cid:48) ( r ) | r q dr = O ( (cid:15) ) as (cid:15) → (cid:15) > hang and Chen Conditions (D1) and (D2) are common assumptions in directional data to establish thepointwise and uniform consistency of directional KDE (cid:98) f h to the true density f on Ω q (Hallet al., 1987; Klemel¨a, 2000; Zhao and Wu, 2001; Garc´ıa-Portugu´es et al., 2013; Garc´ıa-Portugu´es, 2013; Zhang and Chen, 2020; Alonso-Pena and Crujeiras, 2020).Now, let m be a local mode of directional KDE (cid:98) f h to which the DMS sequence { x ( t ) } ∞ t =0 defined by (6) converges. By Taylor’s theorem and the fact that m = F ( m ), we deducethat x ( t +1) = F ( x ( t ) ) = m + ∇ F ( m ) · ( x ( t ) − m ) + O (cid:18)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) x ( t ) − m (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) (cid:19) , (24)where ∇ F ( m ) ∈ R ( q +1) × ( q +1) is the Jacobian of F evaluated at m . Thus, x ( t +1) − m ≈∇ F ( m )( x ( t ) − m ) around a small neighborhood of m on Ω q . It implies that the rate ofconvergence is associated with the Jacobian ∇ F ( m ) through its eigenvalue as follows: (cid:12)(cid:12)(cid:12)(cid:12) x ( t +1) − x ∗ (cid:12)(cid:12)(cid:12)(cid:12) (cid:12)(cid:12)(cid:12)(cid:12) x ( t ) − x ∗ (cid:12)(cid:12)(cid:12)(cid:12) ≤ max i =1 ,...,q +1 | Λ i | , (25)where Λ , ..., Λ q +1 are the eigenvalues of ∇ F ( m ). More importantly, the Jacobian ∇ F ( x )have the following useful properties. Proposition 5.
For any x ∈ Ω q ⊂ R q +1 ,(a) the Jacobian ∇ F ( x ) is a symmetric matrix in R ( q +1) × ( q +1) , and has an explicit form ∇ F ( x ) = ∇∇ (cid:98) f h ( x ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ∇ (cid:98) f h ( x ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) − ∇ (cid:98) f h ( x ) ∇ (cid:98) f h ( x ) T ∇∇ (cid:98) f h ( x ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ∇ (cid:98) f h ( x ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) . (26) In particular, ∇ F ( m ) = ( I q +1 − mm T ) ∇∇ (cid:98) f h ( m ) || ∇ (cid:98) f h ( m ) || at any local mode m of (cid:98) f h .(b) we have that max i =1 ,...,q +1 | Λ i | → as h → and nh q → ∞ , where Λ , ..., Λ q +1 arethe eigenvalues of ∇ F ( m ) . The proof of Proposition 5 is in Appendix D.5. Note that (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ∇ (cid:98) f h ( x ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) → ∞ as h → nh q → ∞ for any x ∈ Ω q ; see Lemma 10 in Zhang and Chen (2020). While (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ∇ (cid:98) f h ( x ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) → ∞ may seem to be counter-intuitive, it is indeed a reasonable result because the total gradient ∇ (cid:98) f h ( x ) in R q +1 consists of two parts: the gradient tangent to Ω q (tangent gradient) and thegradient perpendicular to Ω q (normal gradient). The tangent gradient is a stable quantity(since it is a well-defined quantity in terms of the true directional density f ) and thenormal gradient will diverge (since the true directional density f does not have any normalgradient).Together with (25), Proposition 5 sheds light on the algorithmic convergence rate of theDMS algorithm. As h → nh q → ∞ , ∇ F ( m ) → m becomes superlinear. Yet, it will never be quadratic because ∇ F ( m ) (cid:54) = 0 forany positive bandwidth h and sample size n . In practice, one can select a small bandwidthto attain the linear convergence when the amount of available data is sufficient. Our resultcoincides with the findings of Carreira-Perpi˜n´an (2007) in Euclidean data and Zhang andChen (2020) in directional data. he EM Perspective of DMS Algorithm
6. Discussions
This paper has shown that the DMS algorithm is an EM algorithm for the von Miseskernel and a generalized EM algorithm for general kernels. Similar to Euclidean KDE andmean shift algorithm, our results hold when each data point have a different weight andconcentration parameter in the underlying directional KDE. Under the (generalized) EMrepresentation, we have provided a new proof of the DMS algorithm’s ascending propertyand validated the global convergence of its iterative sequences on the unit hypersphere Ω q .Finally, we have discussed the rate of convergence of the DMS algorithm by analyzing theJacobian of its fixed-point iterative function.Within a small scope, our work suggests several potential extensions or improvementsof the DMS algorithm. First, instead of performing a single iteration step (18) as the DMSalgorithm, one may iterate (16) with a fixed µ ( t ) for several steps before replacing µ ( t ) bythe resulting output µ ( t +1) in (16). These extra iterations in the M-step may give rise tofaster convergence of the DMS algorithm. However, one needs to propose a strategy tohandle zero divisions in (18) when implementing this extension with truncated kernels as(5). Second, our explicit formula (26) for the Jacobian of the fixed-point iterative functioncan help accelerate the DMS algorithm via Aitken’s or other related methods (Section 4.8 inGeoffrey J. McLachlan (2008)). Finally, other practical guidelines for improving the regularmean shift algorithm based on its tie to the (generalized) EM algorithm (Carreira-Perpi˜n´an,2007) may be adoptable to the DMS scenario.More broadly, our work may have some potential impacts to the nonconvex optimiza-tion theory on manifolds. The existing convergence theory of a gradient ascent/descenttype method with a nonconvex objective function (as in our DMS scenario) is often limitedto some small neighborhoods of its maximizers/minimizers, especially when the objectivefunction is supported on a compact manifold. Moreover, the ascending property of theobjective function along the gradient ascent/descent path is not generally guaranteed. Ourresearch points to a possible avenue of establishing the ascending property and global con-vergence theory for a broader class of gradient-based methods, that is, connecting themto the well-studied EM algorithm. Upon this connection, our technique of projecting alower-dimensional manifold onto a higher-dimensional one could be potentially useful. Acknowledgments
YC is supported by NSF DMS - 1810960 and DMS - 195278, NIH U01 - AG0169761.
References
M. Abramowitz and I. A. Stegun.
Handbook of Mathematical Functions with Formulas,Graphs, and Mathematical Tables . Dover Publications, Inc., New York, 1974.Y. Aliyari Ghassabeh. On the convergence of the mean shift algorithm in the one-dimensional space.
Pattern Recognition Letters , 34(12):1423 – 1427, 2013.Y. Aliyari Ghassabeh. A sufficient condition for the convergence of the mean shift algorithmwith gaussian kernel.
Journal of Multivariate Analysis , 135:1 – 10, 2015. hang and Chen M. Alonso-Pena and R. M. Crujeiras. Nonparametric multimodal regression for circulardata. arXiv preprint arXiv:2012.09915 , 2020.E. Arias-Castro, D. Mason, and B. Pelletier. On the estimation of the gradient lines of adensity and the consistency of the mean-shift algorithm.
Journal of Machine LearningResearch , 17(43):1–28, 2016.Z. Bai, C. Rao, and L. Zhao. Kernel estimators of density function of directional data.
Journal of Multivariate Analysis , 27(1):24 – 39, 1988.S. Balakrishnan, M. J. Wainwright, and B. Yu. Statistical guarantees for the EM algorithm:From population to sample-based analysis.
Annals of Statistics , 45(1):77–120, 02 2017.URL https://doi.org/10.1214/16-AOS1435 .A. Banerjee, I. S. Dhillon, J. Ghosh, and S. Sra. Clustering on the unit hypersphere usingvon mises-fisher distributions.
Journal of Machine Learning Research , 6:1345–1382, 2005.R. Beran. Exponential models for directional data.
The Annals of Statistics , 7(6):1162–1178,11 1979.M. ´A. Carreira-Perpi˜n´an. Gaussian mean-shift is an em algorithm.
IEEE Trans. PatternAnal. Mach. Intell. , 29(5):767–776, May 2007.M. ´A. Carreira-Perpi˜n´an. A review of mean-shift algorithms for clustering. arXiv preprintarXiv:1503.00687 , 2015.J. E. Chac´on. The modal age of statistics.
International Statistical Review , 88(1):122–141,2020.S.-J. Chang-Chien, W.-L. Hung, and M.-S. Yang. On mean shift-based clustering for circulardata.
Soft Computing , 16(6):1043–1060, June 2012. ISSN 1432-7643.Y. Cheng. Mean shift, mode seeking, and clustering.
IEEE Transactions on Pattern Analysisand Machine Intelligence , 17(8):790–799, 1995.D. Comaniciu and P. Meer. Mean shift: a robust approach toward feature space analysis.
IEEE Transactions on Pattern Analysis and Machine Intelligence , 24(5):603–619, 2002.A. P. Dempster, N. M. Laird, and D. B. Rubin. Maximum likelihood from incomplete datavia the em algorithm.
Journal of the Royal Statistical Society: Series B (Methodological) ,39(1):1–22, 1977.I. S. Dhillon and D. S. Modha. Concept decompositions for large sparse text data usingclustering.
Machine Learning , 42(1):143–175, 2001.K. Fukunaga and L. Hostetler. The estimation of the gradient of a density function, withapplications in pattern recognition.
IEEE Transactions on Information Theory , 21(1):32–40, 1975.E. Garc´ıa-Portugu´es. Exact risk improvement of bandwidth selectors for kernel densityestimation with directional data.
Electron. J. Stat. , 7:1655–1685, 2013. he EM Perspective of DMS Algorithm E. Garc´ıa-Portugu´es, R. M. Crujeiras, and W. Gonz´alez-Manteiga. Kernel density estima-tion for directional-linear data.
Journal of Multivariate Analysis , 121:152 – 175, 2013.C. R. Genovese, M. Perone-Pacifico, I. Verdinelli, and L. Wasserman. Non-parametricinference for density modes.
Journal of the Royal Statistical Society: Series B (StatisticalMethodology) , 78(1):99–126, 2016.T. K. Geoffrey J. McLachlan.
The EM Algorithm and Extensions . Wiley Series in Proba-bility and Statistics. Wiley-Interscience, 2 edition, 2008.P. Hall, G. S. Watson, and J. Cabrara. Kernel density estimation with spherical data.
Biometrika , 74(4):751–762, 12 1987.P. Hall, M. C. Minnotte, and C. Zhang. Bump hunting with non-gaussian kernels.
Annalsof Statistics , 32(5):2124–2141, 10 2004.M. Kafai, Y. Miao, and K. Okada. Directional mean shift and its application for topologyclassification of local 3d structures. pages 170 – 177, 07 2010.J. Klemel¨a. Estimation of densities and derivatives of densities with directional data.
Jour-nal of Multivariate Analysis , 73(1):18 – 40, 2000.T. Kobayashi and N. Otsu. Von mises-fisher mean shift for clustering on a hypersphere. In , pages 2130–2133, 2010.X. Li, Z. Hu, and F. Wu. A note on the convergence of the mean shift.
Pattern Recognition ,40(6):1756 – 1762, 2007.K. Mardia and P. Jupp.
Directional Statistics . Wiley Series in Probability and Statistics.Wiley, 2000.K. V. Mardia. Statistics of directional data.
Journal of the Royal Statistical Society. SeriesB (Methodological) , 37(3):349–393, 1975.M. D. Marzio, A. Panzera, and C. C. Taylor. Kernel density estimation on the torus.
Journal of Statistical Planning and Inference , 141(6):2156 – 2173, 2011.S. Oba, K. Kato, and S. Ishii. Multi-scale clustering for gene expression profiling data. In
Fifth IEEE Symposium on Bioinformatics and Bioengineering (BIBE’05) , pages 210–217,Oct 2005.M. Oliveira, R. M. Crujeiras, and A. Rodr´ıguez-Casal. A plug-in rule for bandwidth selectionin circular density estimation.
Comput. Stat. Data Anal. , 56(12):3898–3908, 2012.P. Saavedra-Nieves and R. Mar´ıa Crujeiras. Nonparametric estimation of directional highestdensity regions. arXiv preprint arXiv:2009.08915 , 2020. URL https://arxiv.org/abs/2009.08915 .B. W. Silverman. Using kernel density estimates to investigate multimodality.
Journal ofthe Royal Statistical Society: Series B (Methodological) , 43(1):97–99, 1981. hang and Chen S. Sra. A short note on parameter approximation for von Mises-Fisher distributions: and afast implementation of I s ( x ). Computational Statistics , 27(1):177–190, 2012.R. Subbarao and P. Meer. Nonlinear mean shift over riemannian manifolds.
Internationaljournal of computer vision , 84(1):1–20, 2009.C. C. Taylor. Automatic bandwidth selection for circular density estimation.
ComputationalStatistics & Data Analysis , 52(7):3493 – 3500, 2008.C. F. J. Wu. On the Convergence Properties of the EM Algorithm.
The Annals of Statistics ,11(1):95–103, 1983.M.-S. Yang, S.-J. Chang-Chien, and H.-C. Kuo. On mean shift clustering for directionaldata on a hypersphere. In
Artificial Intelligence and Soft Computing , pages 809–818,Cham, 2014. Springer International Publishing.Y. Zhang and Y.-C. Chen. Kernel smoothing, mean shift, and their learning theory withdirectional data. arXiv preprint arXiv:2010.13523 , 2020. URL https://arxiv.org/abs/2010.13523 .L. Zhao and C. Wu. Central limit theorem for integrated squared error of kernel estimatorsof spherical density.
Sci. China Ser. A Math. , 44(4):474–483, 2001.
Appendix A. Basins of Attraction for Directional Mean Shift Algorithm
We simulate the dataset { X , ..., X } ⊂ Ω utilized by the DMS algorithm from a mixtureof vMF densities as in Zhang and Chen (2020): f ( x ) = 0 . · f vMF ( x ; µ , κ ) + 0 . · f vMF ( x ; µ , κ ) + 0 . · f vMF ( x ; µ , κ )with µ ≈ ( − . , − . , − . µ ≈ (0 . , , . µ = ( − . , . ,
0) (or [ − ◦ , − ◦ ],[0 ◦ , ◦ ], [150 ◦ , ◦ ] in their spherical [longitude, latitude] coordinates), and κ = κ = 8, κ = 5. There are three local modes of the underlying density f . When we apply thedirectional KDE (1) and corresponding DMS algorithm to this dataset, the bandwidthparameter h is chosen using the rule-of-thumb selector (Proposition 2 in Garc´ıa-Portugu´es(2013)) as h ≈ . i (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) X ( t +1) i − X ( t ) i (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) <(cid:15) = 10 − .Figure 1 displays the basins of attraction (or convergence domains) for DMS algorithmswith truncated convex kernel L ( r ) = (1 − r ) · { ≤ r ≤ } and von Mises kernel L ( r ) = e − r .Each point on Ω except for a set with Lebesgue measure zero converges to a local mode (cid:98) f h as the corresponding DMS algorithms are stopped. Appendix B. EM Algorithm on a Mixture of vMF Distributions
We consider a directional dataset Y = { Y , ...., Y N } ⊂ Ω q consisting of independently andidentically distributed (IID) samples from a mixture of M vMF distributions, whose density he EM Perspective of DMS Algorithm (a) DMS with L ( r ) = (1 − r ) · { ≤ r ≤ } (b) DMS with von Mises kernel L ( r ) = e − r Figure 1: Basins of attractions for DMS algorithms with two different kernels on the simu-lated dataset.is f ( y | Θ) = M (cid:88) j =1 α j · f vMF ( y | µ j , κ j ) . (27)with the set of parameters asΘ = ( α j , µ j , κ j ) : M (cid:88) j =1 α j = 1 , α j ≥ , µ Tj µ j = 1 , κ j ≥ , j = 1 , ..., M . Let Z = { Z , ..., Z N } be the corresponding set of hidden random variables indicating themixture component from which the data points are sampled. In particular, Z i = j if Y i issampled from f vMF ( y | µ j , κ j ). The EM framework for obtaining the maximum likelihoodestimates for parameters in Θ given the observed data Y = { Y , ...., Y N } ⊂ Ω q is as follows(Banerjee et al., 2005): • E-Step : Assuming that the values in Z = { Z , ..., Z N } are known, the complete log-likelihood is given bylog P ( Y , Z| Θ) = N (cid:88) i =1 M (cid:88) j =1 (cid:2) { Z i = j } log α j + { Z i = j } log f vMF ( Y i | µ j , κ j ) (cid:3) , (28)where A is the indicator function of the set A . For the given ( Y , Θ ( t ) ) at the t -th iteration,the E-step takes the expectation of the complete log-likelihood (28) with respect to thecurrent posterior conditional distribution of Z| (cid:0) Y , Θ ( t ) (cid:1) as: Q (Θ | Θ ( t ) ) ≡ E Z| ( Y , Θ ( t ) ) [log P ( Y , Z| Θ)]= M (cid:88) j =1 N (cid:88) i =1 (log α j ) · P (cid:16) Z i = j |Y , Θ ( t ) (cid:17) + M (cid:88) j =1 N (cid:88) i =1 [log f vMF ( Y i | µ j , κ j )] · P (cid:16) Z i = j |Y , Θ ( t ) (cid:17) . (29) hang and Chen By Bayes’ theorem, the posterior distribution P (cid:0) Z i = j |Y , Θ ( t ) (cid:1) is calculated as P (cid:16) Z i = j |Y , Θ ( t ) (cid:17) = P ( Y| Z i = j, Θ ( t ) ) · P ( Z i = j | Θ ( t ) ) P ( Y| Θ ( t ) ) (**) = P ( Y i | Z i = j, Θ ( t ) ) · P ( Z i = j | Θ ( t ) ) P ( Y i | Θ ( t ) ) = P (cid:16) Z i = j | Y i , Θ ( t ) (cid:17) = α j · f vMF (cid:16) Y i | µ ( t ) j , κ ( t ) j (cid:17)(cid:80) M(cid:96) =1 α (cid:96) · f vMF (cid:16) Y i | µ ( t ) (cid:96) , κ ( t ) (cid:96) (cid:17) , where we utilize the IID assumption on Y = { Y , ...., Y N } and the fact that Y i is indepen-dent of Z k when i (cid:54) = k to obtain (**). • M-Step : In the M-step, we solve for Θ ( t +1) that maximizes the function Q (Θ | Θ ( t ) ).Notice that the maximization of { α j } Mj =1 and { µ j , κ j } Mj =1 can be done separately as termscontaining { α j } Mj =1 and { µ j , κ j } Mj =1 are unrelated in (29). To maximize Q (Θ | Θ ( t ) ) withrespect to { α j } Mj =1 under the constraint (cid:80) Mj =1 α j = 1, we introduce a Lagrangian multiplier λ and obtain the following Lagrangian function: L ( α , ..., α M , λ ) = M (cid:88) j =1 N (cid:88) i =1 (log α j ) · P (cid:16) Z i = j |Y , Θ ( t ) (cid:17) + λ M (cid:88) j =1 α j − . Taking the partial derivatives of L with respect to each α j and λ and setting them to zeroyield that (with some rearrangements) N (cid:88) i =1 P ( Z i = j |Y , Θ ( t ) ) + α j λ = 0 and M (cid:88) j =1 α j − . On summing both sides of the first equation over j = 1 , ..., M we find that λ opt = − N andhence, α ( t +1) j = 1 N N (cid:88) i =1 P (cid:16) Z i = j |Y , Θ ( t ) (cid:17) (30)for j = 1 , ..., M . Now, we maximize Q (Θ | Θ ( t ) ) with respect to { µ j , κ j } Mj =1 under theconstraints µ Tj µ j = 1 and κ j ≥ j = 1 , ..., M . Let λ j be the Lagrangian multiplierassociated with each equality constraint µ Tj µ j = 1. The Lagrangian function is given by
1. As mentioned in Section A.1 in Banerjee et al. (2005), we should have introduced a Lagrangian multiplierfor each inequality constraint κ j ≥
0, and work with the necessary KKT conditions. However, if κ j = 0then f vMF( y | µ j ,κ j ) is the uniform density on Ω q , and if κ j > he EM Perspective of DMS Algorithm L (cid:0) { µ j , κ j , λ j } Mj =1 (cid:1) = M (cid:88) j =1 N (cid:88) i =1 [log f vMF ( Y i | µ j , κ j )] · P (cid:16) Z i = j |Y , Θ ( t ) (cid:17) + M (cid:88) j =1 λ j (1 − µ Tj µ j )= M (cid:88) j =1 (cid:40) N (cid:88) i =1 (cid:2) log C q ( κ j ) + κ j µ Tj Y i (cid:3) · P (cid:16) Z i = j |Y , Θ ( t ) (cid:17) + λ j (1 − µ Tj µ j ) (cid:41) . (31)Taking partial derivatives of (31) with respect to { µ j , κ j , λ j } Mj =1 and setting them to zero,we have that κ j N (cid:88) i =1 Y i · P ( Z i = j |Y , Θ ( t ) ) − λ j µ j = 0 , (32) C (cid:48) q ( κ j ) C q ( κ j ) N (cid:88) i =1 P ( Z i = j |Y , Θ ( t ) ) + µ j N (cid:88) i =1 Y i · P ( Z i = j |Y , Θ ( t ) ) = 0 , (33)1 − µ Tj µ j = 0 , (34)for j = 1 , ..., M . Combining (32) and (34) shows that for j = 1 , ..., M , (cid:98) λ j = κ j (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) N (cid:88) i =1 Y i · P (cid:16) Z i = j |Y , Θ ( t ) (cid:17)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) , µ ( t +1) j = (cid:80) Ni =1 Y i · P ( Z i = j |Y , Θ ( t ) ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:80) Ni =1 Y i · P ( Z i = j |Y , Θ ( t ) ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) . (35)Substituting (35) into (33) gives us that − C (cid:48) q (cid:16) κ ( t +1) j (cid:17) C q (cid:16) κ ( t +1) j (cid:17) = I q +12 (cid:16) κ ( t +1) j (cid:17) I q − (cid:16) κ ( t +1) j (cid:17) = (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:80) Ni =1 Y i · P ( Z i = j |Y , Θ ( t ) ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) (cid:80) Ni =1 P ( Z i = j |Y , Θ ( t ) ) . (36)The first equality in (36) follows from the below calculations (recall (3)): C (cid:48) q ( κ ) = (cid:16) q − (cid:17) κ q − (2 π ) q +12 I q − ( κ ) − (2 π ) q +12 I (cid:48) q − ( κ ) κ q − (2 π ) q +1 (cid:104) I q − ( κ ) (cid:105) = κ q − (2 π ) q +12 I q − ( κ ) q − κ − I (cid:48) q − ( κ ) I q − ( κ ) and thus, C (cid:48) q ( κ ) C q ( κ ) = q − κ − I (cid:48) q − ( κ ) I q − ( κ ) = q − κ − (cid:16) q − κ (cid:17) I q − ( κ ) + I q +12 ( κ ) I q − ( κ ) = − I q +12 ( κ ) I q − ( κ ) , hang and Chen where we use the well-know recurrence relation κ I (cid:48) s ( κ ) = s I s ( κ )+ κ I s +1 ( κ ); see, for instance,Section 9.6.26 in Abramowitz and Stegun (1974).The updating formula (36) involves the modified Bessel function of the first kind and hasno closed-form solutions for κ ( t +1) j . However, Banerjee et al. (2005); Sra (2012) propose anapproximation formula for κ in terms of A q ( κ ) ≡ I q +12 ( κ ) I q − ( κ ) as (cid:98) κ ≈ ( q + 1) A q ( κ ) − A q ( κ ) − A q ( κ ) . (37)Therefore, we may update κ ( t +1) j in the E-step by κ ( t +1) j = ( q + 1) A q ( κ ( t +1) ) − A q ( κ ( t +1) ) − A q ( κ ( t +1) ) (38)with A q ( κ ( t +1) ) = || (cid:80) Ni =1 Y i · P ( Z i = j |Y , Θ ( t ) ) || (cid:80) Ni =1 P ( Z i = j |Y , Θ ( t ) ) .The above EM algorithm involves only soft assignments for hidden variables Z . Theanalysis of hard assignments for hidden variables and the subsequent discussion on theirconnections with the spherical k -Means (Dhillon and Modha, 2001) clustering can be foundin Banerjee et al. (2005). Appendix C. Naive Method: Introducing a Displacement Parameter
We show in this section that introducing a displacement parameter to define an artificialmaximum likelihood as in Carreira-Perpi˜n´an (2007) is (probably) not a promising directionunder our directional data scenario. For simplicity, we only consider the von Mises kernelcase. Given a directional dataset { X , .., X n } ⊂ Ω q , we may define the mixture model bysubtracting a displacement parameter µ ∈ Ω q from each mean { X , .., X n } of the mixturecomponent as: f MOvMF ( y ) = n (cid:88) i =1 α i · C q ( κ i , X i − µ ) · exp (cid:2) κ i ( X i − µ ) T y (cid:3) . (39)The key difference between the mixture density here and the one in (8) is that the nor-malizing constant in each component of (39) depends not only on the (fixed) concentrationparameter κ i but also our newly introduced displacement parameter µ . The reason is thata unit vector subtracted by another unit vector is no longer a unit vector on Ω q . Hence, C q ( κ, X − µ ) − = (cid:90) Ω q exp (cid:2) κ ( X − µ ) T y (cid:3) ω q ( d y )= ω q − (Ω q − ) (cid:90) − exp ( κt || X − µ || ) · (1 − t ) q − dt = (2 π ) q +12 ( κ || X − µ || ) q − · I q − ( κ || X − µ || ) he EM Perspective of DMS Algorithm This extra dependence becomes a main obstacle to connecting the corresponding EM al-gorithm with the directional mean shift algorithm, because one can easily verify that thederivative of the resulting Q-function with respect to µ in this case would involve the deriva-tive of modified Bessel functions of the first kind. There is no closed-form expression forsuch derivative in general. Appendix D. Proofs of Theorems and Propositions
D.1 Proof of Theorem 1Theorem 1.
Under condition (C2), the Q-function (14) satisfies Q ( µ ( t +1) | µ ( t ) ) ≥ Q ( µ ( t ) | µ ( t ) ) where µ ( t +1) = − (cid:80) ni =1 X i L (cid:48) (cid:18) − X Ti µ ( t ) h (cid:19)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:80) ni =1 X i L (cid:48) (cid:18) − X Ti µ ( t ) h (cid:19)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) is from the iteration (18) . Proof
As we take κ i = h and α i = n for all i = 1 , ..., n in equations (14) and (13), theQ-function reduces to Q ( µ | µ ( t ) ) = n (cid:88) i =1 (cid:20) log C /h ,q +1 ,L n + log L (cid:18) − X Ti µ h (cid:19)(cid:21) · P ( Z = i | Y , µ ( t ) ) , (40)and the posterior Z | Y , µ ( t ) becomes P ( Z = i | Y , µ ( t ) ) = L (cid:16) − X Ti µ ( t ) h (cid:17)(cid:80) n(cid:96) =1 L (cid:16) − X T(cid:96) µ ( t ) h (cid:17) . (41)The differentiability and convexity of L under (C2) indicates that L ( y ) − L ( x ) ≥ L (cid:48) ( x ) · ( y − x ) , (42)where L (cid:48) ( x ) is replaced by any subgradient at non-differentiable points. As 0 < L (0) < ∞ ,we have that Q ( µ ( t +1) | µ ( t ) ) − Q ( µ ( t ) | µ ( t ) )= n (cid:88) i =1 (cid:34) log L (cid:32) − X Ti µ ( t +1) h (cid:33) − log L (cid:32) − X Ti µ ( t ) h (cid:33)(cid:35) · L (cid:16) − X Ti µ ( t ) h (cid:17)(cid:80) n(cid:96) =1 L (cid:16) − X T(cid:96) µ ( t ) h (cid:17) (i) ≥ n (cid:88) i =1 L (cid:16) − X Ti µ ( t +1) h (cid:17) · (cid:34) L (cid:32) − X Ti µ ( t +1) h (cid:33) − L (cid:32) − X Ti µ ( t ) h (cid:33)(cid:35) · L (cid:16) − X Ti µ ( t ) h (cid:17)(cid:80) n(cid:96) =1 L (cid:16) − X T(cid:96) µ ( t ) h (cid:17) (ii) ≥ n (cid:88) i =1 L (cid:48) (cid:16) − X Ti µ ( t ) h (cid:17) L (cid:16) − X Ti µ ( t +1) h (cid:17) · (cid:34) X Ti (cid:0) µ ( t ) − µ ( t +1) (cid:1) h (cid:35) · L (cid:16) − X Ti µ ( t ) h (cid:17)(cid:80) n(cid:96) =1 L (cid:16) − X T(cid:96) µ ( t ) h (cid:17) (iii) ≥ L (cid:0) h (cid:1) nL (0) n (cid:88) i =1 L (cid:48) (cid:32) − X Ti µ ( t ) h (cid:33) X Ti (cid:16) µ ( t ) − µ ( t +1) (cid:17) hang and Chen (iv) = L (cid:0) h (cid:1) nL (0) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) n (cid:88) i =1 X i L (cid:48) (cid:32) − X Ti µ ( t ) h (cid:33)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) · (cid:16) µ ( t +1) (cid:17) T (cid:16) µ ( t +1) − µ ( t ) (cid:17) (v) = L (cid:0) h (cid:1) nL (0) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) n (cid:88) i =1 X i L (cid:48) (cid:32) − X Ti µ ( t ) h (cid:33)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) · (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) µ ( t +1) − µ ( t ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≥ , where we use the fact that log y − log x ≥ y − xy by the concavity of log( · ) in (i), leverage theinequality (42) in (ii), apply the monotonicity of L in (iii), plug in (18) in (iv), and noticethat (cid:0) µ ( t +1) (cid:1) T (cid:0) µ ( t +1) − µ ( t ) (cid:1) = (cid:12)(cid:12)(cid:12)(cid:12) µ ( t +1) − µ ( t ) (cid:12)(cid:12)(cid:12)(cid:12) on Ω q in (v). The result follows. D.2 Proof of Proposition 2Proposition 2.
The observed log-likelihood log P ( Y | µ ) is given by log P ( Y | µ ) = log (cid:34) n (cid:88) (cid:96) =1 α (cid:96) · C κ (cid:96) ,q +1 ,L · L (cid:2) κ (cid:96) (cid:0) − X T(cid:96) µ (cid:1)(cid:3)(cid:35) . (43) In particular, when α (cid:96) = n and κ (cid:96) = h for all (cid:96) = 1 , ..., n (i.e., simplifying to the DMScase), it implies that log P ( Y | µ ) = log (cid:98) f h ( µ ) + log C /h ,q +1 ,L c h,q,L , (44) where c h,q,L is defined in (2) and C κ i ,q +1 ,L is elucidated in (9) . Proof
Notice that the observed log-likelihood in the EM framework of Section 3.1, afterwe take Y = (0 , ..., , T ∈ Ω q +1 , is given bylog P ( Y | µ ) = log P ( Y , Z | µ ) − log P ( Z | Y , µ )= n (cid:88) i =1 { Z = i } · (cid:8) log α i + log C κ i ,q +1 ,L + log L (cid:2) κ i (1 − X Ti µ ) (cid:3)(cid:9) − n (cid:88) i =1 { Z = i } · log P ( Z = i | Y , µ ) . (45)Taking the expectation over Z | (cid:0) Y , µ ( t ) (cid:1) on both sides of (45) yields thatlog P ( Y | µ ) = n (cid:88) i =1 P (cid:16) Z = i | Y , µ ( t ) (cid:17) · (cid:8) log α i + log C κ i ,q +1 ,L + log L (cid:2) κ i (1 − X Ti µ ) (cid:3)(cid:9) − n (cid:88) i =1 P (cid:16) Z = i | Y , µ ( t ) (cid:17) · log P ( Z = i | Y , µ ) . (46) he EM Perspective of DMS Algorithm After plugging (13) into (46), we obtain thatlog P ( Y | µ )= n (cid:88) i =1 P (cid:16) Z = i | Y , µ ( t ) (cid:17) · (cid:34) log α i + log C κ i ,q +1 ,L + log L (cid:2) κ i (1 − X Ti µ ) (cid:3) − log α i · C κ i ,q +1 ,L · L (cid:2) κ i (1 − X Ti µ ) (cid:3)(cid:80) n(cid:96) =1 α (cid:96) · C κ (cid:96) ,q +1 ,L · L (cid:2) κ (cid:96) (cid:0) − X T(cid:96) µ (cid:1)(cid:3) (cid:35) = n (cid:88) i =1 P (cid:16) Z = i | Y , µ ( t ) (cid:17) · log (cid:34) n (cid:88) (cid:96) =1 α (cid:96) · C κ (cid:96) ,q +1 ,L · L (cid:2) κ (cid:96) (cid:0) − X T(cid:96) µ (cid:1)(cid:3)(cid:35) = log (cid:34) n (cid:88) (cid:96) =1 α (cid:96) · C κ (cid:96) ,q +1 ,L · L (cid:2) κ (cid:96) (cid:0) − X T(cid:96) µ (cid:1)(cid:3)(cid:35) . When α (cid:96) = n and κ (cid:96) = h for all (cid:96) = 1 , ..., n (i.e., simplifying to the DMS form), the aboveequation indicates thatlog P ( Y | µ ) = log (cid:34) c h,q,L n n (cid:88) i =1 L (cid:18) − X Ti µ h (cid:19)(cid:35) + log C /h ,q +1 ,L c h,q,L = log (cid:98) f h ( µ ) + log C /h ,q +1 ,L c h,q,L , (47)where c h,q,L is defined in (2) and C κ i ,q +1 ,L is elucidated in (9). The results follow. D.3 Proof of Theorem 3Theorem 3.
Let (cid:8) µ ( t ) (cid:9) ∞ t =0 be the path of successive points defined by the DMS algorithm (6) . Under condition (C2), the sequence (cid:110) (cid:98) f h ( µ ( t ) ) (cid:111) ∞ t =0 is non-decreasing and thus con-verges. hang and Chen Proof
The difference of observed log-likelihoods (46) between two consecutive steps of ourDMS (or generalized EM) iteration is calculated aslog P (cid:16) Y | µ ( t +1) (cid:17) − log P (cid:16) Y | µ ( t ) (cid:17) (i) = Q (cid:16) µ ( t +1) | µ ( t ) (cid:17) − n (cid:88) i =1 P (cid:16) Z = i | Y , µ ( t ) (cid:17) · log P (cid:16) Z = i | Y , µ ( t +1) (cid:17) − Q (cid:16) µ ( t ) | µ ( t ) (cid:17) + n (cid:88) i =1 P (cid:16) Z = i | Y , µ ( t ) (cid:17) · log P (cid:16) Z = i | Y , µ ( t ) (cid:17) = Q (cid:16) µ ( t +1) | µ ( t ) (cid:17) − Q (cid:16) µ ( t ) | µ ( t ) (cid:17) − n (cid:88) i =1 P (cid:16) Z = i | Y , µ ( t ) (cid:17) · log P (cid:0) Z = i | Y , µ ( t +1) (cid:1) P (cid:0) Z = i | Y , µ ( t ) (cid:1) (ii) ≥ Q (cid:16) µ ( t +1) | µ ( t ) (cid:17) − Q (cid:16) µ ( t ) | µ ( t ) (cid:17) − log (cid:34) n (cid:88) i =1 P (cid:16) Z = i | Y , µ ( t +1) (cid:17)(cid:35)(cid:124) (cid:123)(cid:122) (cid:125) =0(iii) ≥ , (48)where we recall (14) in (i), apply Jensen’s inequality in (ii), and utilize Theorem 1 in (iii).The inequality (ii) is strict except when µ ( t +1) = µ ( t ) . Therefore, combining (47) with (48)show thatlog (cid:98) f h ( µ ( t +1) ) − log (cid:98) f h ( µ ( t ) ) = log P (cid:16) Y | µ ( t +1) (cid:17) − log P (cid:16) Y | µ ( t ) (cid:17) ≥ . As (cid:98) f h is differentiable on the compact set Ω q under (C2), (cid:110) (cid:98) f h ( µ ( t ) ) (cid:111) ∞ t =0 is thus boundedfrom above and converges. D.4 Conditions on the Convergence of Generalized EM Sequences and Proofof Theorem 4
For the convergence of (generalized) EM sequences to stationary points (in most cases,local maxima), Wu (1983) made the following assumptions on the parameter space Ψ andobserved log-likelihood function (cid:96) ( φ ): • (A1) Ψ is a subset of the ambient Euclidean space, • (A2) Ψ = { φ ∈ Ψ : (cid:96) ( φ ) ≥ (cid:96) ( φ ) } is compact for any (cid:96) ( φ ) > −∞ , where φ is theinitial parameter for the (generalized) EM algorithm, • (A3) (cid:96) is continuous in Ψ and differentiable in the interior of Ψ.Let φ → M ( φ ) be the point-to-set map defined by a generalized EM iteration such that theQ-function satisfies Q ( φ (cid:48) | φ ) ≥ Q ( φ | φ ) for any φ (cid:48) ∈ Ψ .
2. The Q-function is defined as the expectation of the complete log-likelihood over the posterior distributionof hidden variables given the values of observed variables and current parameters. he EM Perspective of DMS Algorithm Denote by S the set of stationary points of the observed log-likelihood (cid:96) . As a consequenceof the above assumptions, all the limit points of the generalized EM sequence are stationarypoints of the observed log-likelihood. Lemma 6 (Theorem 1 of Wu (1983)) . Let { φ p } be a generalized EM sequence generated by φ p +1 ∈ M ( φ p ) , and suppose that (i) M is a closed point-to-set map over the complementof S ; (ii) (cid:96) ( φ p +1 ) > (cid:96) ( φ p ) for all φ p / ∈ S . Then all the limit points of { φ p } are stationarypoints of (cid:96) , and (cid:96) ( φ p ) converges monotonically to (cid:96) ∗ = (cid:96) ( φ ∗ ) for some φ ∗ ∈ S . The generalized EM iteration is usually terminated when the (Euclidean) distance be-tween two consecutive iteration points is below some small threshold value. However, thisproperty does not imply the convergence of the generalized EM sequence to a single pointin S ; see some related discussions in Wu (1983) and a counterexample in Li et al. (2007).To guarantee the convergence of a generalized EM sequence to a single point in S (in mostcases, a local maximum), we require the discreteness of the set of local modes: • (A4) The set of local modes of the observed log-likelihood (cid:96) ( φ ) is discrete, and themodes are isolated from other critical points. D.4.1 Proof of Theorem 4
Theorem 4.
Let (cid:8) µ ( t ) (cid:9) ∞ t =0 be the path of successive points defined by the DMS algorithm (6) . Under conditions (C1) and (C2), the sequence (cid:8) µ ( t ) (cid:9) ∞ t =0 converges to a local mode of (cid:98) f h from almost all starting point µ (0) ∈ Ω q . Proof
Now, we verify in detail that the above assumptions (A1-4) for the convergenceof generalized EM sequences can be satisfied by our conditions (C1) and (C2) on direc-tional KDE (cid:98) f h and kernel function L . Notice that the estimated parameter µ lies onΩ q , a compact manifold in the ambient Euclidean space R q +1 . By the equivalence of theobserved log-likelihood and logarithm of the directional KDE (c.f., equation (23)), the con-strained parameter space Ψ = (cid:110) µ ∈ Ω q : log (cid:98) f h ( µ ) ≥ log (cid:98) f h ( µ (0) ) (cid:111) is also compact for anylog (cid:98) f h ( µ (0) ) > −∞ . Moreover, under condition (C2), (cid:98) f h is differentiable on Ω q . Thus,assumptions (A1-3) are naturally justified in our DMS context. Finally, it is obvious thatour condition (C1) implies assumption (A4), and as long as we do not initialize the DMSalgorithm within the set of local minima and saddle points, the algorithm will converge toa local mode. D.5 Proof of Proposition 5Proposition 5.
For any x ∈ Ω q ⊂ R q +1 ,(a) the Jacobian ∇ F ( x ) is a symmetric matrix in R ( q +1) × ( q +1) , and has its explicit formas ∇ F ( x ) = ∇∇ (cid:98) f h ( x ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ∇ (cid:98) f h ( x ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) − ∇ (cid:98) f h ( x ) ∇ (cid:98) f h ( x ) T ∇∇ (cid:98) f h ( x ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ∇ (cid:98) f h ( x ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) . (49) hang and Chen In particular, ∇ F ( m ) = ( I q +1 − mm T ) ∇∇ (cid:98) f h ( m ) || ∇ (cid:98) f h ( m ) || at any local mode m of (cid:98) f h .(b) we have that max i =1 ,...,q +1 | Λ i | → as h → and nh q → ∞ , where Λ , ..., Λ q +1 arethe eigenvalues of ∇ F ( m ) . Proof (a) We leverage the form F ( x ) = ∇ (cid:98) f h ( x ) || ∇ (cid:98) f h ( x ) || in (7) to directly compute the explicitform of the Jacobian F ( x ) as ∇ F ( x ) = ∇∇ (cid:98) f h ( x ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ∇ (cid:98) f h ( x ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) − ∇ (cid:98) f h ( x ) ∇ (cid:98) f h ( x ) T ∇∇ (cid:98) f h ( x ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ∇ (cid:98) f h ( x ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) , which is a symmetric matrix. At local mode m , we have that m = ∇ (cid:98) f h ( m ) || ∇ (cid:98) f h ( m ) || , and theJacobian ∇ F ( m ) is simplified as ∇ F ( m ) = ( I q +1 − mm T ) ∇∇ (cid:98) f h ( m ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ∇ (cid:98) f h ( m ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) , where I q +1 ∈ R ( q +1) × ( q +1) is the identity matrix.(b) By the definition of local modes of (cid:98) f h on Ω q , the total gradient ∇ (cid:98) f h ( m ) has zero pro-jection onto the tangent space of Ω q at any local mode m . Thus, ∇ (cid:98) f h ( m ) is completelydetermined by its radial component along m . By Lemma 10 in Zhang and Chen (2020),it is obvious that (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ∇ (cid:98) f h ( m ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) → ∞ as h → nh q → ∞ . Finally, each eigenvalue of ∇ F ( m ) will have its absolute value tending to 0 as (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ∇ (cid:98) f h ( m ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) → ∞ . The result follows.. The result follows.