[PDF] The EM Perspective of Directional Mean Shift Algorithm

Abstract

The directional mean shift (DMS) algorithm is a nonparametric method for pursuing local modes of densities defined by kernel density estimators on the unit hypersphere. In this paper, we show that any DMS iteration can be viewed as a generalized Expectation-Maximization (EM) algorithm; in particular, when the von Mises kernel is applied, it becomes an exact EM algorithm. Under the (generalized) EM framework, we provide a new proof for the ascending property of density estimates and demonstrate the global convergence of directional mean shift sequences. Finally, we give a new insight into the linear convergence of the DMS algorithm.

Full PDF

TThe EM Perspective of Directional Mean Shift Algorithm

Yikun Zhang [email protected]

Yen-Chi Chen [email protected]

University of Washington, Seattle, WA 98195

Abstract

The directional mean shift (DMS) algorithm is a nonparametric method for pursuing lo-cal modes of densities deﬁned by kernel density estimators on the unit hypersphere. Inthis paper, we show that any DMS iteration can be viewed as a generalized Expectation-Maximization (EM) algorithm; in particular, when the von Mises kernel is applied, itbecomes an exact EM algorithm. Under the (generalized) EM framework, we provide anew proof for the ascending property of density estimates and demonstrate the global con-vergence of directional mean shift sequences. Finally, we give a new insight into the linearconvergence of the DMS algorithm.

Keywords:

Directional data, mean shift algorithm, kernel smoothing, EM algorithm

1. Introduction

The directional mean shift (DMS) algorithm is a generalization of the regular mean shiftalgorithm with Euclidean data (Fukunaga and Hostetler, 1975; Cheng, 1995; Comaniciu andMeer, 2002; Carreira-Perpi˜n´an, 2015) toward handling directional data, which are assumedto consist of observations lying on a unit hypersphere Ω q = { x ∈ R q +1 : || x || = 1 } . Anal-ogous to the regular mean shift procedure, the DMS algorithm is built upon a directionalkernel density estimator (KDE) (Hall et al., 1987; Bai et al., 1988; Zhao and Wu, 2001;Garc´ıa-Portugu´es, 2013) and conducts ﬁxed-point iterations to pursue the local modes ofthe directional KDE.Several diﬀerent forms of the DMS algorithm have been proposed and applied to variousﬁelds such as gene clustering (Oba et al., 2005), medical structure classiﬁcation (Kafai et al.,2010), seismological analysis (Zhang and Chen, 2020) in the last two decades. Compared toits success in practice, the convergence theory of the DMS algorithm is less developed. Theconvergence of density estimates along any mean shift iterative sequence (or the so-calledascending property) was proved by Cheng (1995); Comaniciu and Meer (2002); Li et al.(2007) for the regular mean shift algorithm and generalized to the mean shift algorithm onarbitrary manifolds by Subbarao and Meer (2009). This ascending property was specializedto a DMS algorithm with the von Mises kernel by Kobayashi and Otsu (2010). Kafaiet al. (2010) and Zhang and Chen (2020) later showed that the DMS algorithm’s ascendingproperty holds for any convex and diﬀerentiable directional kernel.However, the convergence of DMS sequences is a more challenging problem. Such con-vergence has been well-studied in the regular mean shift scenarios (Cheng, 1995; Li et al.,2007; Carreira-Perpi˜n´an, 2007; Aliyari Ghassabeh, 2013, 2015; Arias-Castro et al., 2016). Inthe directional data setting, the only work that we are aware of is Zhang and Chen (2020), © https://creativecommons.org/licenses/by/4.0/ . a r X i v : . [ m a t h . S T ] J a n hang and Chen but their result is limited to small neighborhoods of local modes. The global behavior ofDMS sequences remains unclear.To address the global convergence of any DMS sequence on Ω q , we formulate the DMSalgorithm into a special case of the famous (generalized) Expectation-Maximization (EM)algorithm (Dempster et al., 1977; Wu, 1983; Geoﬀrey J. McLachlan, 2008; Balakrishnanet al., 2017). It allows us to present a new proof of the ascending property and study theconvergence of DMS sequences based on extensive research in the (generalized) EM algo-rithm. We are inspired by Carreira-Perpi˜n´an (2007) for his method of writing the regularmean shift with Euclidean data as a (generalized) EM algorithm. However, the formulationis more diﬃcult in any DMS algorithm because directional data lie on a nonlinear manifold.We overcome the problem by projecting the data from the original hypersphere Ω q ontoa high dimensional hypersphere Ω q +1 and design a mixture model to associate the DMSalgorithm with the (generalized) EM algorithm. Main results

The main objective of this paper is to show that the usual DMS algorithm(reviewed in Section 2.2) is a (generalized) EM algorithm. Specially, our work contributesthe following:1. We construct a maximum likelihood problem viewing the directional KDE as a mixturemodel with a single parameter µ ∈ Ω q in a higher dimensional sphere Ω q +1 . Fitting themixture model with a (generalized) EM algorithm at a single point Y = (0 , ..., , T ∈ Ω q +1 leads to the DMS algorithm (Section 3).2. Under the (generalized) EM framework, we prove the ascending property of the DMSalgorithm based on the monotonicity of observed likelihoods along any (generalized)EM iteration (Section 4.2).3. We derive the global convergence of DMS sequences using properties from the EMperspective (Section 4.3).4. Finally, compared to the result in Zhang and Chen (2020), we provide an alternativebut straightforward proof of linear convergence of the DMS algorithm (Section 5). Notation

Let Ω q = (cid:8) x ∈ R q +1 : || x || = 1 (cid:9) be the q -dimensional unit sphere in theambient Euclidean space R q +1 , where ||·|| is the usual Euclidean norm (or L -norm) in R q +1 . The inner product between two vectors x , y ∈ R q +1 is denoted by x T y . Given asmooth function f : R q +1 → R , we denote its total gradient and Hessian by ∇ f ( x ) = (cid:16) ∂f ( x ) ∂x , ..., ∂f ( x ) ∂x q +1 (cid:17) T ∈ R q +1 and ∇∇ f ( x ) =  ∂ f ( x ) ∂x · · · ∂ f ( x ) ∂x ∂x q +1 ... . . . ... ∂ f ( x ) ∂x q +1 ∂x · · · ∂ f ( x ) ∂x q +1  ∈ R ( q +1) × ( q +1) , respectively. We use the big- O notation p t = O ( q t ) for a sequence of vectors p t and positivescalars q t if there is a global constant C such that || p t || ≤ Cq t for all suﬃciently large t . he EM Perspective of DMS Algorithm

2. Background

Let X , ..., X n ∈ Ω q ⊂ R q +1 be a random sample generated from a directional densityfunction f on Ω q with (cid:82) Ω q f ( x ) ω q ( d x ) = 1 , where ω q is the Lebesgue measure on Ω q . Thedirectional KDE at point x ∈ Ω q is often written as (Beran, 1979; Hall et al., 1987; Baiet al., 1988; Garc´ıa-Portugu´es, 2013): (cid:98) f h ( x ) = c h,q,L n n (cid:88) i =1 L (cid:18) − x T X i h (cid:19) , (1)where L is a directional kernel (i.e., a rapidly decaying function with nonnegative valuesand deﬁned on [0 , ∞ ) ⊂ R ), h > c h,q,L is a normalizingconstant satisfying c − h,q,L = (cid:90) Ω q L (cid:18) − x T y h (cid:19) ω q ( d y ) = O ( h q ) . (2)The bandwidth selection is crucial in determining the performance of directional KDEsbecause it controls the bias-variance tradeoﬀ. There are various reliable bandwidth selectionmechanisms in the literature (Hall et al., 1987; Bai et al., 1988; Taylor, 2008; Marzio et al.,2011; Oliveira et al., 2012; Garc´ıa-Portugu´es, 2013; Saavedra-Nieves and Mar´ıa Crujeiras,2020). Compared to the bandwidth, the choice of the kernel is less inﬂuential. One popularcandidate is the so-called von Mises kernel L ( r ) = e − r , whose name originates from thefamous q -von Mises-Fisher (vMF) distribution on Ω q with the following density: f vMF ( x | µ , κ ) = C q ( κ ) · exp (cid:0) κ µ T x (cid:1) and C q ( κ ) = κ q − (2 π ) q +12 I q − ( κ ) , (3)where µ ∈ Ω q is the directional mean, κ ≥ I α ( κ ) isthe modiﬁed Bessel function of the ﬁrst kind of order κ ; see Mardia (1975); Mardia andJupp (2000) for more details. With the von Mises kernel, the directional KDE (1) becomesa mixture of q -von Mises-Fisher densities as: (cid:98) f h ( x ) = 1 n n (cid:88) i =1 f vMF (cid:18) x ; X i , h (cid:19) = C q (cid:0) h (cid:1) n n (cid:88) i =1 exp (cid:18) x T X i h (cid:19) . (4)The EM framework for estimating the parameters of a mixture of vMF distributions isalso well-studied. We oﬀer a detailed review in Appendix B. Another potential choice isthe following truncated convex kernel proposed by Chang-Chien et al. (2012); Yang et al.(2014): L ( r ) = (cid:40) (1 − r ) p if 0 ≤ r ≤ , p ≥ hang and Chen We follow the technique in Yang et al. (2014) to derive the most commonly used directionalmean shift (DMS) algorithm. Assume that the kernel L is diﬀerentiable except for ﬁnitelymany points on [0 , ∞ ). Given the directional KDE (cid:98) f h ( x ) in (1), we introduce a Lagrangianmultiplier λ to maximize (cid:98) f h ( x ) under the constraint x T x = 1 as follows: (cid:98) L h ( x , λ ) = c h,q,L n n (cid:88) i =1 L (cid:18) − x T X i h (cid:19) + λ (1 − x T x ) . Taking the partial derivatives of (cid:98) L h with respect to x and λ and setting them to zero yieldthat ∂ (cid:98) L h ∂ x = − c h,q,L nh n (cid:88) i =1 X i L (cid:48) (cid:18) − x T X i h (cid:19) − λ x = 0 and 1 − x T x = 0 . Solving for x leads to the ﬁxed-point iteration equation x ( t +1) = F ( x ( t ) ) for the DMSalgorithm, where F ( x ) = − (cid:80) ni =1 X i L (cid:48) (cid:16) − x T X i h (cid:17)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:80) ni =1 X i L (cid:48) (cid:16) − x T X i h (cid:17)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) . (6)Zhang and Chen (2020) suggested an alternative derivation of the DMS algorithm based onthe following equality on Ω q : (cid:98) f h ( x ) = (cid:101) f h ( x ) with (cid:101) f h ( x ) = c h,q,L n n (cid:88) i =1 L (cid:32) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) x − X i h (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) (cid:33) , and wrote out an explicit formula for the mean shift vector. Based on their derivation, thepreceding ﬁxed-point iteration scheme (6) can be written in terms of ∇ (cid:98) f h ( x ) as: x ( t +1) = F ( x ( t ) ) = ∇ (cid:98) f h ( x ( t ) ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ∇ (cid:98) f h ( x ( t ) ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) , (7)where ∇ is the total gradient operator in the ambient space R q +1 .

3. Directional Mean Shift as a Generalized EM Algorithm

Recall that { X , ..., X n } ⊂ Ω q is the observed directional dataset by the DMS algorithm inSection 2.2. Our derivation starts with deﬁning a mixture model on a higher dimensionalhypersphere Ω q +1 . Consider the following mixture model for any point y ∈ Ω q +1 : f L,q +1 ( y ) = n (cid:88) i =1 α i · C κ i ,q +1 ,L · L (cid:2) κ i (1 − y T ν i ) (cid:3) , (8) he EM Perspective of DMS Algorithm where { α i , ν i , κ i } ni =1 are the parameters of this mixture model such that n (cid:80) i =1 α i = 1, α i ≥ i = 1 , ..., n , and C κ i ,q +1 ,L is the normalizing constant such that C − κ i ,q +1 ,L = (cid:90) Ω q +1 L (cid:2) κ i (cid:0) − y T ν i (cid:1)(cid:3) ω q +1 ( d y ) = w q (Ω q ) (cid:90) − L [ κ i (1 − t )] · (1 − t ) q − dt (9)with ω q (Ω q ) as the surface area of Ω q . Equation (9) demonstrates that the normalizingconstant C κ i ,q +1 ,L is independent of the choice of the directional mean parameter ν i . Theparameters α i and κ i represent the weight (mixing proportion) and the amount of smoothing(concentration) applied to the i -th observation, respectively. Later, we will take α i = n and κ i = h , though they would depend on the index i in a generic case. One should keepin mind that α i , κ i are ﬁxed and given.We introduce an unknown parameter µ relating the directional mean parameter ν i inthe mixture model (8) to each observation X i in the directional dataset as: ν i = ν i ( µ ) = (cid:18) X i (cid:113) − ( µ T X i ) , µ T X i (cid:19) T ∈ Ω q +1 . (10)Given µ ∈ Ω q , the distribution in (8) is a mixture of n densities on Ω q +1 with directionalmeans { ν i ( µ ) } ni =1 , ﬁxed mixing proportions { α i } ni =1 and ﬁxed concentration parameters { κ i } ni =1 . Varying the parameter µ ∈ Ω q +1 in the mixture model leads to a spherical trans-formation of the whole mixture distribution in a higher dimensional unit sphere Ω q +1 . Ourgoal is to ﬁnd the maximum likelihood estimate (MLE) of µ given a single observation fromthe mixture model.We consider the following hypothetical setup for deriving the EM algorithm and con-necting it with the DMS algorithm. Suppose that we are given Y , an observation in Ω q +1 .We ﬁt the mixture model (8) to Y and attempt to ﬁnd the MLE of µ based on this singleobservation. For such mixture model, ﬁnding the MLE of µ can be done via the EM algo-rithm. To this end, we introduce a hidden/latent random variable Z indicating from whichcomponent Y is generated. Namely, Z = i if Y is sampled from C κ i ,q +1 ,L · L (cid:2) κ i (1 − y T ν i ) (cid:3) in (8). The EM framework for obtaining the MLE of µ given the observed data Y is asfollows: • E-Step : Assuming the values of Z is known, the complete log-likelihood is written as:log P ( Y , Z | µ ) = n (cid:88) i =1 { Z = i } · (cid:8) log α i + log C κ i ,q +1 ,L + log L (cid:2) κ i (1 − Y T ν i ) (cid:3)(cid:9) . (11)Given ( Y , µ ( t ) ) at the t -th iteration, we take the expectation of the complete log-likelihood(11) with respect to the current posterior distribution of Z | (cid:0) Y , µ ( t ) (cid:1) as: Q ( µ | µ ( t ) ) = E Z | ( Y , µ ( t ) ) [log P ( Y , Z | µ )]= n (cid:88) i =1 [log α i + log C κ i ,q +1 ,L ] · P ( Z = i | Y , µ ( t ) )+ n (cid:88) i =1 log L (cid:2) κ i (1 − Y T ν i ) (cid:3) · P ( Z = i | Y , µ ( t ) ) . (12) hang and Chen By Bayes’ theorem, the posterior distribution P ( Z = i | Y , µ ( t ) ) is computed as P ( Z = i | Y , µ ( t ) ) = P ( Y | Z = i, µ ( t ) ) · P ( Z = i | µ ( t ) ) P ( Y | µ ( t ) )= α i · C κ i ,q +1 ,L · L (cid:104) κ i (cid:16) − Y T ν ( t ) i (cid:17)(cid:105)(cid:80) n(cid:96) =1 α (cid:96) · C κ (cid:96) ,q +1 ,L · L (cid:104) κ (cid:96) (cid:16) − Y T ν ( t ) (cid:96) (cid:17)(cid:105) . (13) • M-Step : To derive the M-step, we focus on a particular case where the observation Y = (0 , ..., , T ∈ Ω q +1 . Since ν Ti Y = µ T X i for i = 1 , ..., n , the Q-function (12) reducesto Q ( µ | µ ( t ) ) = n (cid:88) i =1 [log α i + log C κ i ,q +1 ,L ] · P ( Z = i | Y , µ ( t ) )+ n (cid:88) i =1 log L (cid:2) κ i (1 − µ T X i ) (cid:3) · P ( Z = i | Y , µ ( t ) ) . (14)As is shown in (9) that each normalizing constant C κ i ,q +1 ,L is independent of the directionalmean parameter ν i for i = 1 , ..., n (and consequently, the parameter µ ), we only need tomaximize the second term in (14) with respect to µ under the constraint µ T µ = 1. To thisend, we introduce a Lagrangian multiplier λ and obtain the following Lagrangian function: L ( µ , λ ) = n (cid:88) i =1 log L (cid:2) κ i (1 − µ T X i ) (cid:3) · P ( Z = i | Y , µ ( t ) ) + λ (cid:0) − µ T µ (cid:1) . Taking the partial derivatives of L with respect to µ and λ and setting them to zeros yieldthat ∂ L ∂ µ = n (cid:88) i =1 P ( Z = i | Y , µ ( t ) ) · − κ i X i · L (cid:48) (cid:2) κ i (1 − µ T X i ) (cid:3) L [ κ i (1 − µ T X i )] − λ µ = 0 and 1 − µ T µ = 0 . This further implies that (cid:98) µ = (cid:80) ni =1 P ( Z = i | Y , µ ( t ) ) · − κ i X i · L (cid:48) [ κ i (1 − (cid:98) µ T X i ) ] L [ κ i (1 − (cid:98) µ T X i )] (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:80) ni =1 P ( Z = i | Y , µ ( t ) ) · − κ i X i · L (cid:48) [ κ i (1 − (cid:98) µ T X i )] L [ κ i (1 − (cid:98) µ T X i )] (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) = (cid:80) ni =1 α i · C κi,q +1 ,L · L [ κ i ( − X Ti µ ( t ) )] (cid:80) n(cid:96) =1 α (cid:96) · C κ(cid:96),q +1 ,L · L [ κ (cid:96) ( − X T(cid:96) µ ( t ) )] · − κ i X i · L (cid:48) [ κ i (1 − (cid:98) µ T X i ) ] L [ κ i (1 − (cid:98) µ T X i )] (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:80) ni =1 α i · C κi,q +1 ,L · L [ κ i ( − X Ti µ ( t ) )] (cid:80) n(cid:96) =1 α (cid:96) · C κ(cid:96),q +1 ,L · L [ κ (cid:96) ( − X T(cid:96) µ ( t ) )] · − κ i X i · L (cid:48) [ κ i (1 − (cid:98) µ T X i )] L [ κ i (1 − (cid:98) µ T X i )] (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) = − (cid:80) ni =1 κ i α i · C κ i ,q +1 ,L X i · L (cid:48) (cid:2) κ i (cid:0) − (cid:98) µ T X i (cid:1)(cid:3) · L [ κ i ( − X Ti µ ( t ) )] L [ κ i (1 − (cid:98) µ T X i )] (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:80) ni =1 κ i α i · C κ i ,q +1 ,L X i · L (cid:48) [ κ i (1 − (cid:98) µ T X i )] · L [ κ i ( − X Ti µ ( t ) )] L [ κ i (1 − (cid:98) µ T X i )] (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) , (15) he EM Perspective of DMS Algorithm where we plug in the posterior distribution (13) in the second equality, and the term n (cid:88) (cid:96) =1 α (cid:96) · C κ (cid:96) ,q +1 ,L · L (cid:104) κ (cid:96) (cid:16) − X T(cid:96) µ ( t ) (cid:17)(cid:105) is independent of the outer summation index i in both the numerator and denominator, sothey are canceled out in the third equality. Therefore, equation (15) deﬁnes a ﬁxed-pointiteration scheme for obtaining µ ( t +1) in the EM framework, and an exact M-step will iterate(15) until convergence.Suppose that we take κ i = h and α i = n for all i = 1 , ..., n in the ﬁxed-point iterationequation (15). Then, the normalizing constant C κ i ,q +1 ,L will no longer depend on thesummation index i , and the ﬁxed-point iteration equation reduces to (cid:98) µ = − (cid:80) ni =1 X i L (cid:48) (cid:16) − (cid:98) µ T X i h (cid:17) · L (cid:18) − X Ti µ ( t ) h (cid:19) L (cid:18) − (cid:98) µ T X ih (cid:19) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:80) ni =1 X i L (cid:48) (cid:16) − (cid:98) µ T X i h (cid:17) · L (cid:18) − X Ti µ ( t ) h (cid:19) L (cid:18) − (cid:98) µ T X ih (cid:19) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) (16)or more speciﬁcally, (cid:98) µ ( k +1; t ) = − (cid:80) ni =1 X i L (cid:48) (cid:16) − X Ti (cid:98) µ ( k ; t ) h (cid:17) · L (cid:18) − X Ti µ ( t ) h (cid:19) L (cid:18) − X Ti (cid:98) µ ( k ; t ) h (cid:19) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:80) ni =1 X i L (cid:48) (cid:16) − X Ti (cid:98) µ ( k ; t ) h (cid:17) · L (cid:18) − X Ti µ ( t ) h (cid:19) L (cid:18) − X Ti (cid:98) µ ( k ; t ) h (cid:19) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) (17)for k = 0 , , , ... . Therefore, the complete EM algorithm consists of two nested loops: theouter one is the usual iteration over t , while the inner one iterates { (cid:98) µ ( k ; t ) } ∞ k =0 under theﬁxed µ ( t ) for an exact M-step. If we choose the initial value for iterating (17) as (cid:98) µ (0; t ) = µ ( t ) and do a single iteration of (17), i.e., µ ( t +1) = (cid:98) µ (1; t ) , we obtain µ ( t +1) = − (cid:80) ni =1 X i L (cid:48) (cid:16) − X Ti µ ( t ) h (cid:17)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:80) ni =1 X i L (cid:48) (cid:16) − X Ti µ ( t ) h (cid:17)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) , (18)which coincides with the mean shift updating formula with directional data (c.f., equation(6)). In other words, the DMS algorithm can be viewed as the above EM algorithm withan inexact M-step update. More importantly, as we will demonstrate in Section 4.2, theQ-function Q ( µ | µ ( t ) ) is non-decreasing along the iterative path deﬁned by (18) under somemild conditions on kernel L . Therefore, the DMS algorithm with a general kernel is indeeda generalized EM algorithm. hang and Chen Remark 1.

Our derivation is diﬀerent from the suggestion given in Subbarao and Meer(2009). In our derivation, we cannot deﬁne a group action of G on Ω q such that the com-ponentwise density C κ i ,q +1 ,L · L (cid:2) κ i (1 − y T ν i ) (cid:3) in (8) is equivalent to the original density C κ i ,q,L · L (cid:2) κ i (1 − ( g · y ) T X i ) (cid:3) under some g ∈ G because these two densities lie on hyper-spheres with diﬀerent dimensions. As the Gaussian mean shift with Euclidean data can be written as an EM algorithm(Carreira-Perpi˜n´an, 2007), we show in this subsection that the DMS algorithm with thevon Mises kernel is also an EM algorithm. That is, the preceding M-step in Section 3.1 isexact when we apply the von Mises kernel. This is because the von Mises kernel L ( r ) = e − r satisﬁes the property that L (cid:48) ∝ L , and the above M-step (15) is simpliﬁed as: µ ( t +1) = (cid:80) ni =1 κ i α i C q +1 ( κ i ) X i · exp (cid:0) κ i X Ti µ ( t ) (cid:1)(cid:12)(cid:12)(cid:12)(cid:12)(cid:80) ni =1 κ i α i C q +1 ( κ i ) X i · exp (cid:0) κ i X Ti µ ( t ) (cid:1)(cid:12)(cid:12)(cid:12)(cid:12) , (19)Now, if we take κ i = h and α i = n for all i = 1 , ..., n as before, the exact M-step in (19)becomes µ ( t +1) = (cid:80) ni =1 X i exp (cid:16) X Ti µ ( t ) h (cid:17)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:80) ni =1 X i exp (cid:16) X Ti µ ( t ) h (cid:17)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) , (20)which is precisely the DMS algorithm with the von Mises kernel. As a result, the DMSalgorithm with the von Mises kernel is an exact EM algorithm. Our derivation is motivated by the discovery of Carreira-Perpi˜n´an (2007), in which theauthor obtained an elegant connection between the regular mean shift algorithm withEuclidean data and (generalized) EM algorithm. The key of the derivation in Carreira-Perpi˜n´an (2007) is to deﬁne an artiﬁcial maximum likelihood problem where1. the underlying distribution is a mixture model deﬁned by the original KDE so that themeans of its components are given by the dataset to which the mean shift algorithmis applied, but the mixture model contains only a single displacement parameter,2. and, the mixture model is ﬁt on the sample with a single point, the origin Y = .Under the above construction, Carreira-Perpi˜n´an (2007) proved that any maximumlikelihood estimate of the displacement vector is a mode of the original KDE. However, asis argued in Appendix C, naively applying the idea of Carreira-Perpi˜n´an (2007) would fail.We have to make a non-trivial extension of his approach to bridge the connection betweenDMS and EM algorithms. The two key changes we made in our derivation are:1. We construct a mixture model in a higher-dimensional hypersphere Ω q +1 and intro-duce the parameter µ ∈ Ω q controlling the projections of observations { X , ..., X n } ⊂ Ω q onto Ω q +1 .2. We ﬁt the mixture model on the sample containing a single point Y = (0 , ..., , T ∈ Ω q +1 . he EM Perspective of DMS Algorithm

4. Convergence of Directional Mean Shift as a Generalized EMAlgorithm

In this section, we show that the Q-function (14) is non-decreasing along any DMS sequence(18) under some regularity conditions on kernel L . As the directional KDE will be identicalto the corresponding observed likelihood up to a multiplicative constant, we furnish a newproof of the ascending property of density estimates along DMS sequences based on thegeneralized EM perspective, which is diﬀerent from the existing result (e.g., Kobayashi andOtsu (2010); Kafai et al. (2010); Zhang and Chen (2020)). Moreover, the connection tothe (generalized) EM algorithm implies that the DMS sequence converges from almost anystarting point on Ω q , though it may be stuck in a saddle point or a local minimum of thedirectional KDE (cid:98) f h in (1); see such non-typical behaviors of EM algorithms in Section 3.6.1and 3.6.2 of Geoﬀrey J. McLachlan (2008). Nevertheless, the latter cases are unlikely inpractice because both saddle points and minima are unstable for maximization (Carreira-Perpi˜n´an, 2007). Therefore, the convergence of DMS sequences will almost always be to alocal mode of (cid:98) f h in practice. This conclusion is much stronger than the local convergencein Theorem 11 of Zhang and Chen (2020). To guarantee the convergence of the DMS iteration (18) as a generalized EM algorithm, wemake the following assumption on the directional KDE (1) and kernel L : • (C1) The number of local modes of (cid:98) f h on Ω q is ﬁnite, and the modes are isolatedfrom other critical points. • (C2) L : [0 , ∞ ) → [0 , ∞ ) is non-increasing, diﬀerentiable, and convex with 0 < L (0) < ∞ .Condition (C1) is commonly assumed for the convergence of regular mean shift sequences(Li et al., 2007; Aliyari Ghassabeh, 2015). It is an evident corollary for the regular KDEwith various kernels (Silverman, 1981; Hall et al., 2004; Genovese et al., 2016; Chac´on,2020), and one can easily extend the result to directional KDEs. Condition (C2) is anatural assumption on kernel L ; see, for instance, Comaniciu and Meer (2002); Li et al.(2007); Kafai et al. (2010); Zhang and Chen (2020). It is satisﬁed by many directionalkernels, such as the von Mises kernel L ( r ) = e − r . The diﬀerentiability of kernel L can berelaxed to include ﬁnitely many non-diﬀerentiable points on [0 , ∞ ). Such relaxation enablesour convergence results to incorporate the DMS algorithms with kernels having boundedsupports (e.g., kernels in (5)). The key reason why the DMS iteration is a generalized EM algorithm is that a single iter-ation of every incomplete M-step with the initial value µ ( t ) as in (18) ensures the followingnon-decreasing property of the Q-function (14). hang and Chen Theorem 1.

Under condition (C2), the Q-function (14) satisﬁes Q ( µ ( t +1) | µ ( t ) ) ≥ Q ( µ ( t ) | µ ( t ) ) where µ ( t +1) = − (cid:80) ni =1 X i L (cid:48) (cid:18) − X Ti µ ( t ) h (cid:19)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:80) ni =1 X i L (cid:48) (cid:18) − X Ti µ ( t ) h (cid:19)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) is from the iteration (18) . The essences of the proof of Theorem 1 are the following two inequalities:log y − log x ≥ y − xx and L ( y ) − L ( x ) ≥ L (cid:48) ( x ) · ( y − x ) , (21)which are guaranteed by the concavity of log( · ) as well as the convexity and diﬀerentiabilityof the kernel L ; see Appendix D.1. The condition (C2) on kernel L provides a suﬃcientcondition under which the directional mean shift iteration does increase the Q-function(14), but it is by no means necessary. One can choose a diﬀerent kernel and show that theiteration (18) increases the Q-function using a similar argument but under other conditions(e.g., log L ( r ) is convex).The observed log-likelihood with data point Y = (0 , ..., , T ∈ Ω q +1 under the EMframework of Section 3.1 embraces the following association with the logarithm of thedirectional KDE (cid:98) f h in (1). Proposition 2.

The observed log-likelihood log P ( Y | µ ) is given by log P ( Y | µ ) = log (cid:34) n (cid:88) (cid:96) =1 α (cid:96) · C κ (cid:96) ,q +1 ,L · L (cid:2) κ (cid:96) (cid:0) − X T(cid:96) µ (cid:1)(cid:3)(cid:35) . (22) In particular, when α (cid:96) = n and κ (cid:96) = h for all (cid:96) = 1 , ..., n (i.e., simplifying to the DMScase), it implies that log P ( Y | µ ) = log (cid:98) f h ( µ ) + log C /h ,q +1 ,L c h,q,L , (23) where c h,q,L is deﬁned in (2) and C κ i ,q +1 ,L is elucidated in (9) . The proof of Proposition 2 is deferred to Appendix D.2. Although some careful read-ers may note that c h,q,L = C /h ,q,L , the extra term log C /h ,q +1 ,L c h,q,L is still irreducible in(23), because we deﬁne the mixture density (8) for the (generalized) EM algorithm on ahigher dimensional sphere Ω q +1 . However, the presence of this extra term will not aﬀectthe ascending property of density estimates along DMS (or equivalently, generalized EM)sequences, as stated in the theorem below. Theorem 3.

Let (cid:8) µ ( t ) (cid:9) ∞ t =0 be the path of successive points deﬁned by the DMS algorithm (6) . Under condition (C2), the sequence (cid:110) (cid:98) f h ( µ ( t ) ) (cid:111) ∞ t =0 is non-decreasing and thus con-verges. Theorem 3 is the well-known monotonicity of observed (log-)likelihood sequences deﬁnedby the (generalized) EM algorithm (c.f., Theorem 1 in Dempster et al. (1977)). The proofis in Appendix D.3. he EM Perspective of DMS Algorithm The connection to the (generalized) EM algorithm allows us to derive a global convergenceproperty of the DMS algorithm.

Theorem 4.

Let (cid:8) µ ( t ) (cid:9) ∞ t =0 be the path of successive points deﬁned by the DMS algorithm (6) . Under conditions (C1) and (C2), the sequence (cid:8) µ ( t ) (cid:9) ∞ t =0 converges to a local mode of (cid:98) f h from almost all starting point µ (0) ∈ Ω q . The high-level idea of the proof is as follows. For the convergence of (generalized)EM sequences to stationary points (in most cases, local maxima), Wu (1983) assumedthe compactness of the parameter space and diﬀerentiability of the observed log-likelihoodwithin the parameter space. In addition, the set of local modes of the observed log-likelihoodis required to be discrete in order to guarantee the convergence of any generalized EMsequence to its single stationary point. We detail out these assumptions and the resultingconvergence result for generalized EM (or equally, DMS) sequences in Appendix D.4 andshow that conditions (C1) and (C2) imply these assumptions.In practice, the set of local minima and saddle points of (cid:98) f h will have zero Lebesguemeasure on Ω q , so the convergence of DMS sequences will almost always be to a localmode; see Appendix A for basins of attraction (or convergence domains) yielded by DMSalgorithms. Therefore, our result (Theorem 4) is stronger than the convergence theory ofZhang and Chen (2020), in which they only proved the convergence of DMS sequencesaround each local mode of (cid:98) f h .

5. Rate of Convergence of Directional Mean Shift Algorithm

In this section, we study the rate of convergence of the DMS algorithm. We have shownin Sections 3 and 4 that the DMS algorithm is a generalized EM algorithm and convergesto local modes of the directional KDE from almost every starting point on Ω q . It is well-known that the convergence rate of an EM algorithm is generally linear (Page 102 in GeoﬀreyJ. McLachlan (2008)). Carreira-Perpi˜n´an (2007) studied in detail the rate of convergencefor Gaussian mean shift in the homoscedastic and isotropic case. More recently, Zhang andChen (2020) wrote the DMS into a gradient ascent algorithm with an adaptive step sizeon Ω q and argued that the DMS algorithm will converge linearly with a suﬃciently smallbandwidth around some neighborhoods of local modes. We will investigate the algorithmicconvergence rate of the DMS algorithm in an alternative but straightforward approach.Consider the following assumptions: • (D1) The true directional density f is continuously diﬀerentiable, and its partialderivatives are square integrable on Ω q . • (D2) Besides (C2), we assume that L : [0 , ∞ ) → [0 , ∞ ) satisﬁes0 < (cid:90) ∞ L ( r ) k r q − dr < ∞ and 0 < (cid:90) ∞ L (cid:48) ( r ) k r q − dr < ∞ for all q ≥ k = 1 ,

2, and (cid:82) ∞ (cid:15) − | L (cid:48) ( r ) | r q dr = O ( (cid:15) ) as (cid:15) → (cid:15) > hang and Chen Conditions (D1) and (D2) are common assumptions in directional data to establish thepointwise and uniform consistency of directional KDE (cid:98) f h to the true density f on Ω q (Hallet al., 1987; Klemel¨a, 2000; Zhao and Wu, 2001; Garc´ıa-Portugu´es et al., 2013; Garc´ıa-Portugu´es, 2013; Zhang and Chen, 2020; Alonso-Pena and Crujeiras, 2020).Now, let m be a local mode of directional KDE (cid:98) f h to which the DMS sequence { x ( t ) } ∞ t =0 deﬁned by (6) converges. By Taylor’s theorem and the fact that m = F ( m ), we deducethat x ( t +1) = F ( x ( t ) ) = m + ∇ F ( m ) · ( x ( t ) − m ) + O (cid:18)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) x ( t ) − m (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) (cid:19) , (24)where ∇ F ( m ) ∈ R ( q +1) × ( q +1) is the Jacobian of F evaluated at m . Thus, x ( t +1) − m ≈∇ F ( m )( x ( t ) − m ) around a small neighborhood of m on Ω q . It implies that the rate ofconvergence is associated with the Jacobian ∇ F ( m ) through its eigenvalue as follows: (cid:12)(cid:12)(cid:12)(cid:12) x ( t +1) − x ∗ (cid:12)(cid:12)(cid:12)(cid:12) (cid:12)(cid:12)(cid:12)(cid:12) x ( t ) − x ∗ (cid:12)(cid:12)(cid:12)(cid:12) ≤ max i =1 ,...,q +1 | Λ i | , (25)where Λ , ..., Λ q +1 are the eigenvalues of ∇ F ( m ). More importantly, the Jacobian ∇ F ( x )have the following useful properties. Proposition 5.

For any x ∈ Ω q ⊂ R q +1 ,(a) the Jacobian ∇ F ( x ) is a symmetric matrix in R ( q +1) × ( q +1) , and has an explicit form ∇ F ( x ) = ∇∇ (cid:98) f h ( x ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ∇ (cid:98) f h ( x ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) − ∇ (cid:98) f h ( x ) ∇ (cid:98) f h ( x ) T ∇∇ (cid:98) f h ( x ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ∇ (cid:98) f h ( x ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) . (26) In particular, ∇ F ( m ) = ( I q +1 − mm T ) ∇∇ (cid:98) f h ( m ) || ∇ (cid:98) f h ( m ) || at any local mode m of (cid:98) f h .(b) we have that max i =1 ,...,q +1 | Λ i | → as h → and nh q → ∞ , where Λ , ..., Λ q +1 arethe eigenvalues of ∇ F ( m ) . The proof of Proposition 5 is in Appendix D.5. Note that (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ∇ (cid:98) f h ( x ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) → ∞ as h → nh q → ∞ for any x ∈ Ω q ; see Lemma 10 in Zhang and Chen (2020). While (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ∇ (cid:98) f h ( x ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) → ∞ may seem to be counter-intuitive, it is indeed a reasonable result because the total gradient ∇ (cid:98) f h ( x ) in R q +1 consists of two parts: the gradient tangent to Ω q (tangent gradient) and thegradient perpendicular to Ω q (normal gradient). The tangent gradient is a stable quantity(since it is a well-deﬁned quantity in terms of the true directional density f ) and thenormal gradient will diverge (since the true directional density f does not have any normalgradient).Together with (25), Proposition 5 sheds light on the algorithmic convergence rate of theDMS algorithm. As h → nh q → ∞ , ∇ F ( m ) → m becomes superlinear. Yet, it will never be quadratic because ∇ F ( m ) (cid:54) = 0 forany positive bandwidth h and sample size n . In practice, one can select a small bandwidthto attain the linear convergence when the amount of available data is suﬃcient. Our resultcoincides with the ﬁndings of Carreira-Perpi˜n´an (2007) in Euclidean data and Zhang andChen (2020) in directional data. he EM Perspective of DMS Algorithm

6. Discussions

This paper has shown that the DMS algorithm is an EM algorithm for the von Miseskernel and a generalized EM algorithm for general kernels. Similar to Euclidean KDE andmean shift algorithm, our results hold when each data point have a diﬀerent weight andconcentration parameter in the underlying directional KDE. Under the (generalized) EMrepresentation, we have provided a new proof of the DMS algorithm’s ascending propertyand validated the global convergence of its iterative sequences on the unit hypersphere Ω q .Finally, we have discussed the rate of convergence of the DMS algorithm by analyzing theJacobian of its ﬁxed-point iterative function.Within a small scope, our work suggests several potential extensions or improvementsof the DMS algorithm. First, instead of performing a single iteration step (18) as the DMSalgorithm, one may iterate (16) with a ﬁxed µ ( t ) for several steps before replacing µ ( t ) bythe resulting output µ ( t +1) in (16). These extra iterations in the M-step may give rise tofaster convergence of the DMS algorithm. However, one needs to propose a strategy tohandle zero divisions in (18) when implementing this extension with truncated kernels as(5). Second, our explicit formula (26) for the Jacobian of the ﬁxed-point iterative functioncan help accelerate the DMS algorithm via Aitken’s or other related methods (Section 4.8 inGeoﬀrey J. McLachlan (2008)). Finally, other practical guidelines for improving the regularmean shift algorithm based on its tie to the (generalized) EM algorithm (Carreira-Perpi˜n´an,2007) may be adoptable to the DMS scenario.More broadly, our work may have some potential impacts to the nonconvex optimiza-tion theory on manifolds. The existing convergence theory of a gradient ascent/descenttype method with a nonconvex objective function (as in our DMS scenario) is often limitedto some small neighborhoods of its maximizers/minimizers, especially when the objectivefunction is supported on a compact manifold. Moreover, the ascending property of theobjective function along the gradient ascent/descent path is not generally guaranteed. Ourresearch points to a possible avenue of establishing the ascending property and global con-vergence theory for a broader class of gradient-based methods, that is, connecting themto the well-studied EM algorithm. Upon this connection, our technique of projecting alower-dimensional manifold onto a higher-dimensional one could be potentially useful. Acknowledgments

YC is supported by NSF DMS - 1810960 and DMS - 195278, NIH U01 - AG0169761.

References

M. Abramowitz and I. A. Stegun.

Handbook of Mathematical Functions with Formulas,Graphs, and Mathematical Tables . Dover Publications, Inc., New York, 1974.Y. Aliyari Ghassabeh. On the convergence of the mean shift algorithm in the one-dimensional space.

Pattern Recognition Letters , 34(12):1423 – 1427, 2013.Y. Aliyari Ghassabeh. A suﬃcient condition for the convergence of the mean shift algorithmwith gaussian kernel.

Journal of Multivariate Analysis , 135:1 – 10, 2015. hang and Chen M. Alonso-Pena and R. M. Crujeiras. Nonparametric multimodal regression for circulardata. arXiv preprint arXiv:2012.09915 , 2020.E. Arias-Castro, D. Mason, and B. Pelletier. On the estimation of the gradient lines of adensity and the consistency of the mean-shift algorithm.

Journal of Machine LearningResearch , 17(43):1–28, 2016.Z. Bai, C. Rao, and L. Zhao. Kernel estimators of density function of directional data.

Journal of Multivariate Analysis , 27(1):24 – 39, 1988.S. Balakrishnan, M. J. Wainwright, and B. Yu. Statistical guarantees for the EM algorithm:From population to sample-based analysis.

Annals of Statistics , 45(1):77–120, 02 2017.URL https://doi.org/10.1214/16-AOS1435 .A. Banerjee, I. S. Dhillon, J. Ghosh, and S. Sra. Clustering on the unit hypersphere usingvon mises-ﬁsher distributions.

Journal of Machine Learning Research , 6:1345–1382, 2005.R. Beran. Exponential models for directional data.

The Annals of Statistics , 7(6):1162–1178,11 1979.M. ´A. Carreira-Perpi˜n´an. Gaussian mean-shift is an em algorithm.

IEEE Trans. PatternAnal. Mach. Intell. , 29(5):767–776, May 2007.M. ´A. Carreira-Perpi˜n´an. A review of mean-shift algorithms for clustering. arXiv preprintarXiv:1503.00687 , 2015.J. E. Chac´on. The modal age of statistics.

International Statistical Review , 88(1):122–141,2020.S.-J. Chang-Chien, W.-L. Hung, and M.-S. Yang. On mean shift-based clustering for circulardata.

Soft Computing , 16(6):1043–1060, June 2012. ISSN 1432-7643.Y. Cheng. Mean shift, mode seeking, and clustering.

IEEE Transactions on Pattern Analysisand Machine Intelligence , 17(8):790–799, 1995.D. Comaniciu and P. Meer. Mean shift: a robust approach toward feature space analysis.

IEEE Transactions on Pattern Analysis and Machine Intelligence , 24(5):603–619, 2002.A. P. Dempster, N. M. Laird, and D. B. Rubin. Maximum likelihood from incomplete datavia the em algorithm.

Journal of the Royal Statistical Society: Series B (Methodological) ,39(1):1–22, 1977.I. S. Dhillon and D. S. Modha. Concept decompositions for large sparse text data usingclustering.

Machine Learning , 42(1):143–175, 2001.K. Fukunaga and L. Hostetler. The estimation of the gradient of a density function, withapplications in pattern recognition.

IEEE Transactions on Information Theory , 21(1):32–40, 1975.E. Garc´ıa-Portugu´es. Exact risk improvement of bandwidth selectors for kernel densityestimation with directional data.

Electron. J. Stat. , 7:1655–1685, 2013. he EM Perspective of DMS Algorithm E. Garc´ıa-Portugu´es, R. M. Crujeiras, and W. Gonz´alez-Manteiga. Kernel density estima-tion for directional-linear data.

Journal of Multivariate Analysis , 121:152 – 175, 2013.C. R. Genovese, M. Perone-Paciﬁco, I. Verdinelli, and L. Wasserman. Non-parametricinference for density modes.

Journal of the Royal Statistical Society: Series B (StatisticalMethodology) , 78(1):99–126, 2016.T. K. Geoﬀrey J. McLachlan.

The EM Algorithm and Extensions . Wiley Series in Proba-bility and Statistics. Wiley-Interscience, 2 edition, 2008.P. Hall, G. S. Watson, and J. Cabrara. Kernel density estimation with spherical data.

Biometrika , 74(4):751–762, 12 1987.P. Hall, M. C. Minnotte, and C. Zhang. Bump hunting with non-gaussian kernels.

Annalsof Statistics , 32(5):2124–2141, 10 2004.M. Kafai, Y. Miao, and K. Okada. Directional mean shift and its application for topologyclassiﬁcation of local 3d structures. pages 170 – 177, 07 2010.J. Klemel¨a. Estimation of densities and derivatives of densities with directional data.

Jour-nal of Multivariate Analysis , 73(1):18 – 40, 2000.T. Kobayashi and N. Otsu. Von mises-ﬁsher mean shift for clustering on a hypersphere. In , pages 2130–2133, 2010.X. Li, Z. Hu, and F. Wu. A note on the convergence of the mean shift.

Pattern Recognition ,40(6):1756 – 1762, 2007.K. Mardia and P. Jupp.

Directional Statistics . Wiley Series in Probability and Statistics.Wiley, 2000.K. V. Mardia. Statistics of directional data.

Journal of the Royal Statistical Society. SeriesB (Methodological) , 37(3):349–393, 1975.M. D. Marzio, A. Panzera, and C. C. Taylor. Kernel density estimation on the torus.

Journal of Statistical Planning and Inference , 141(6):2156 – 2173, 2011.S. Oba, K. Kato, and S. Ishii. Multi-scale clustering for gene expression proﬁling data. In

Fifth IEEE Symposium on Bioinformatics and Bioengineering (BIBE’05) , pages 210–217,Oct 2005.M. Oliveira, R. M. Crujeiras, and A. Rodr´ıguez-Casal. A plug-in rule for bandwidth selectionin circular density estimation.

Comput. Stat. Data Anal. , 56(12):3898–3908, 2012.P. Saavedra-Nieves and R. Mar´ıa Crujeiras. Nonparametric estimation of directional highestdensity regions. arXiv preprint arXiv:2009.08915 , 2020. URL https://arxiv.org/abs/2009.08915 .B. W. Silverman. Using kernel density estimates to investigate multimodality.

Journal ofthe Royal Statistical Society: Series B (Methodological) , 43(1):97–99, 1981. hang and Chen S. Sra. A short note on parameter approximation for von Mises-Fisher distributions: and afast implementation of I s ( x ). Computational Statistics , 27(1):177–190, 2012.R. Subbarao and P. Meer. Nonlinear mean shift over riemannian manifolds.

Internationaljournal of computer vision , 84(1):1–20, 2009.C. C. Taylor. Automatic bandwidth selection for circular density estimation.

ComputationalStatistics & Data Analysis , 52(7):3493 – 3500, 2008.C. F. J. Wu. On the Convergence Properties of the EM Algorithm.

The Annals of Statistics ,11(1):95–103, 1983.M.-S. Yang, S.-J. Chang-Chien, and H.-C. Kuo. On mean shift clustering for directionaldata on a hypersphere. In

Artiﬁcial Intelligence and Soft Computing , pages 809–818,Cham, 2014. Springer International Publishing.Y. Zhang and Y.-C. Chen. Kernel smoothing, mean shift, and their learning theory withdirectional data. arXiv preprint arXiv:2010.13523 , 2020. URL https://arxiv.org/abs/2010.13523 .L. Zhao and C. Wu. Central limit theorem for integrated squared error of kernel estimatorsof spherical density.

Sci. China Ser. A Math. , 44(4):474–483, 2001.

Appendix A. Basins of Attraction for Directional Mean Shift Algorithm

We simulate the dataset { X , ..., X } ⊂ Ω utilized by the DMS algorithm from a mixtureof vMF densities as in Zhang and Chen (2020): f ( x ) = 0 . · f vMF ( x ; µ , κ ) + 0 . · f vMF ( x ; µ , κ ) + 0 . · f vMF ( x ; µ , κ )with µ ≈ ( − . , − . , − . µ ≈ (0 . , , . µ = ( − . , . ,

0) (or [ − ◦ , − ◦ ],[0 ◦ , ◦ ], [150 ◦ , ◦ ] in their spherical [longitude, latitude] coordinates), and κ = κ = 8, κ = 5. There are three local modes of the underlying density f . When we apply thedirectional KDE (1) and corresponding DMS algorithm to this dataset, the bandwidthparameter h is chosen using the rule-of-thumb selector (Proposition 2 in Garc´ıa-Portugu´es(2013)) as h ≈ . i (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) X ( t +1) i − X ( t ) i (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) <(cid:15) = 10 − .Figure 1 displays the basins of attraction (or convergence domains) for DMS algorithmswith truncated convex kernel L ( r ) = (1 − r ) · { ≤ r ≤ } and von Mises kernel L ( r ) = e − r .Each point on Ω except for a set with Lebesgue measure zero converges to a local mode (cid:98) f h as the corresponding DMS algorithms are stopped. Appendix B. EM Algorithm on a Mixture of vMF Distributions

We consider a directional dataset Y = { Y , ...., Y N } ⊂ Ω q consisting of independently andidentically distributed (IID) samples from a mixture of M vMF distributions, whose density he EM Perspective of DMS Algorithm (a) DMS with L ( r ) = (1 − r ) · { ≤ r ≤ } (b) DMS with von Mises kernel L ( r ) = e − r Figure 1: Basins of attractions for DMS algorithms with two diﬀerent kernels on the simu-lated dataset.is f ( y | Θ) = M (cid:88) j =1 α j · f vMF ( y | µ j , κ j ) . (27)with the set of parameters asΘ =  ( α j , µ j , κ j ) : M (cid:88) j =1 α j = 1 , α j ≥ , µ Tj µ j = 1 , κ j ≥ , j = 1 , ..., M  . Let Z = { Z , ..., Z N } be the corresponding set of hidden random variables indicating themixture component from which the data points are sampled. In particular, Z i = j if Y i issampled from f vMF ( y | µ j , κ j ). The EM framework for obtaining the maximum likelihoodestimates for parameters in Θ given the observed data Y = { Y , ...., Y N } ⊂ Ω q is as follows(Banerjee et al., 2005): • E-Step : Assuming that the values in Z = { Z , ..., Z N } are known, the complete log-likelihood is given bylog P ( Y , Z| Θ) = N (cid:88) i =1 M (cid:88) j =1 (cid:2) { Z i = j } log α j + { Z i = j } log f vMF ( Y i | µ j , κ j ) (cid:3) , (28)where A is the indicator function of the set A . For the given ( Y , Θ ( t ) ) at the t -th iteration,the E-step takes the expectation of the complete log-likelihood (28) with respect to thecurrent posterior conditional distribution of Z| (cid:0) Y , Θ ( t ) (cid:1) as: Q (Θ | Θ ( t ) ) ≡ E Z| ( Y , Θ ( t ) ) [log P ( Y , Z| Θ)]= M (cid:88) j =1 N (cid:88) i =1 (log α j ) · P (cid:16) Z i = j |Y , Θ ( t ) (cid:17) + M (cid:88) j =1 N (cid:88) i =1 [log f vMF ( Y i | µ j , κ j )] · P (cid:16) Z i = j |Y , Θ ( t ) (cid:17) . (29) hang and Chen By Bayes’ theorem, the posterior distribution P (cid:0) Z i = j |Y , Θ ( t ) (cid:1) is calculated as P (cid:16) Z i = j |Y , Θ ( t ) (cid:17) = P ( Y| Z i = j, Θ ( t ) ) · P ( Z i = j | Θ ( t ) ) P ( Y| Θ ( t ) ) (**) = P ( Y i | Z i = j, Θ ( t ) ) · P ( Z i = j | Θ ( t ) ) P ( Y i | Θ ( t ) ) = P (cid:16) Z i = j | Y i , Θ ( t ) (cid:17) = α j · f vMF (cid:16) Y i | µ ( t ) j , κ ( t ) j (cid:17)(cid:80) M(cid:96) =1 α (cid:96) · f vMF (cid:16) Y i | µ ( t ) (cid:96) , κ ( t ) (cid:96) (cid:17) , where we utilize the IID assumption on Y = { Y , ...., Y N } and the fact that Y i is indepen-dent of Z k when i (cid:54) = k to obtain (**). • M-Step : In the M-step, we solve for Θ ( t +1) that maximizes the function Q (Θ | Θ ( t ) ).Notice that the maximization of { α j } Mj =1 and { µ j , κ j } Mj =1 can be done separately as termscontaining { α j } Mj =1 and { µ j , κ j } Mj =1 are unrelated in (29). To maximize Q (Θ | Θ ( t ) ) withrespect to { α j } Mj =1 under the constraint (cid:80) Mj =1 α j = 1, we introduce a Lagrangian multiplier λ and obtain the following Lagrangian function: L ( α , ..., α M , λ ) = M (cid:88) j =1 N (cid:88) i =1 (log α j ) · P (cid:16) Z i = j |Y , Θ ( t ) (cid:17) + λ  M (cid:88) j =1 α j −  . Taking the partial derivatives of L with respect to each α j and λ and setting them to zeroyield that (with some rearrangements) N (cid:88) i =1 P ( Z i = j |Y , Θ ( t ) ) + α j λ = 0 and M (cid:88) j =1 α j − . On summing both sides of the ﬁrst equation over j = 1 , ..., M we ﬁnd that λ opt = − N andhence, α ( t +1) j = 1 N N (cid:88) i =1 P (cid:16) Z i = j |Y , Θ ( t ) (cid:17) (30)for j = 1 , ..., M . Now, we maximize Q (Θ | Θ ( t ) ) with respect to { µ j , κ j } Mj =1 under theconstraints µ Tj µ j = 1 and κ j ≥ j = 1 , ..., M . Let λ j be the Lagrangian multiplierassociated with each equality constraint µ Tj µ j = 1. The Lagrangian function is given by

1. As mentioned in Section A.1 in Banerjee et al. (2005), we should have introduced a Lagrangian multiplierfor each inequality constraint κ j ≥

0, and work with the necessary KKT conditions. However, if κ j = 0then f vMF( y | µ j ,κ j ) is the uniform density on Ω q , and if κ j > he EM Perspective of DMS Algorithm L (cid:0) { µ j , κ j , λ j } Mj =1 (cid:1) = M (cid:88) j =1 N (cid:88) i =1 [log f vMF ( Y i | µ j , κ j )] · P (cid:16) Z i = j |Y , Θ ( t ) (cid:17) + M (cid:88) j =1 λ j (1 − µ Tj µ j )= M (cid:88) j =1 (cid:40) N (cid:88) i =1 (cid:2) log C q ( κ j ) + κ j µ Tj Y i (cid:3) · P (cid:16) Z i = j |Y , Θ ( t ) (cid:17) + λ j (1 − µ Tj µ j ) (cid:41) . (31)Taking partial derivatives of (31) with respect to { µ j , κ j , λ j } Mj =1 and setting them to zero,we have that κ j N (cid:88) i =1 Y i · P ( Z i = j |Y , Θ ( t ) ) − λ j µ j = 0 , (32) C (cid:48) q ( κ j ) C q ( κ j ) N (cid:88) i =1 P ( Z i = j |Y , Θ ( t ) ) + µ j N (cid:88) i =1 Y i · P ( Z i = j |Y , Θ ( t ) ) = 0 , (33)1 − µ Tj µ j = 0 , (34)for j = 1 , ..., M . Combining (32) and (34) shows that for j = 1 , ..., M , (cid:98) λ j = κ j (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) N (cid:88) i =1 Y i · P (cid:16) Z i = j |Y , Θ ( t ) (cid:17)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) , µ ( t +1) j = (cid:80) Ni =1 Y i · P ( Z i = j |Y , Θ ( t ) ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:80) Ni =1 Y i · P ( Z i = j |Y , Θ ( t ) ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) . (35)Substituting (35) into (33) gives us that − C (cid:48) q (cid:16) κ ( t +1) j (cid:17) C q (cid:16) κ ( t +1) j (cid:17) = I q +12 (cid:16) κ ( t +1) j (cid:17) I q − (cid:16) κ ( t +1) j (cid:17) = (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:80) Ni =1 Y i · P ( Z i = j |Y , Θ ( t ) ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) (cid:80) Ni =1 P ( Z i = j |Y , Θ ( t ) ) . (36)The ﬁrst equality in (36) follows from the below calculations (recall (3)): C (cid:48) q ( κ ) = (cid:16) q − (cid:17) κ q − (2 π ) q +12 I q − ( κ ) − (2 π ) q +12 I (cid:48) q − ( κ ) κ q − (2 π ) q +1 (cid:104) I q − ( κ ) (cid:105) = κ q − (2 π ) q +12 I q − ( κ )  q − κ − I (cid:48) q − ( κ ) I q − ( κ )  and thus, C (cid:48) q ( κ ) C q ( κ ) = q − κ − I (cid:48) q − ( κ ) I q − ( κ ) = q − κ − (cid:16) q − κ (cid:17) I q − ( κ ) + I q +12 ( κ ) I q − ( κ ) = − I q +12 ( κ ) I q − ( κ ) , hang and Chen where we use the well-know recurrence relation κ I (cid:48) s ( κ ) = s I s ( κ )+ κ I s +1 ( κ ); see, for instance,Section 9.6.26 in Abramowitz and Stegun (1974).The updating formula (36) involves the modiﬁed Bessel function of the ﬁrst kind and hasno closed-form solutions for κ ( t +1) j . However, Banerjee et al. (2005); Sra (2012) propose anapproximation formula for κ in terms of A q ( κ ) ≡ I q +12 ( κ ) I q − ( κ ) as (cid:98) κ ≈ ( q + 1) A q ( κ ) − A q ( κ ) − A q ( κ ) . (37)Therefore, we may update κ ( t +1) j in the E-step by κ ( t +1) j = ( q + 1) A q ( κ ( t +1) ) − A q ( κ ( t +1) ) − A q ( κ ( t +1) ) (38)with A q ( κ ( t +1) ) = || (cid:80) Ni =1 Y i · P ( Z i = j |Y , Θ ( t ) ) || (cid:80) Ni =1 P ( Z i = j |Y , Θ ( t ) ) .The above EM algorithm involves only soft assignments for hidden variables Z . Theanalysis of hard assignments for hidden variables and the subsequent discussion on theirconnections with the spherical k -Means (Dhillon and Modha, 2001) clustering can be foundin Banerjee et al. (2005). Appendix C. Naive Method: Introducing a Displacement Parameter

We show in this section that introducing a displacement parameter to deﬁne an artiﬁcialmaximum likelihood as in Carreira-Perpi˜n´an (2007) is (probably) not a promising directionunder our directional data scenario. For simplicity, we only consider the von Mises kernelcase. Given a directional dataset { X , .., X n } ⊂ Ω q , we may deﬁne the mixture model bysubtracting a displacement parameter µ ∈ Ω q from each mean { X , .., X n } of the mixturecomponent as: f MOvMF ( y ) = n (cid:88) i =1 α i · C q ( κ i , X i − µ ) · exp (cid:2) κ i ( X i − µ ) T y (cid:3) . (39)The key diﬀerence between the mixture density here and the one in (8) is that the nor-malizing constant in each component of (39) depends not only on the (ﬁxed) concentrationparameter κ i but also our newly introduced displacement parameter µ . The reason is thata unit vector subtracted by another unit vector is no longer a unit vector on Ω q . Hence, C q ( κ, X − µ ) − = (cid:90) Ω q exp (cid:2) κ ( X − µ ) T y (cid:3) ω q ( d y )= ω q − (Ω q − ) (cid:90) − exp ( κt || X − µ || ) · (1 − t ) q − dt = (2 π ) q +12 ( κ || X − µ || ) q − · I q − ( κ || X − µ || ) he EM Perspective of DMS Algorithm This extra dependence becomes a main obstacle to connecting the corresponding EM al-gorithm with the directional mean shift algorithm, because one can easily verify that thederivative of the resulting Q-function with respect to µ in this case would involve the deriva-tive of modiﬁed Bessel functions of the ﬁrst kind. There is no closed-form expression forsuch derivative in general. Appendix D. Proofs of Theorems and Propositions

D.1 Proof of Theorem 1Theorem 1.

As we take κ i = h and α i = n for all i = 1 , ..., n in equations (14) and (13), theQ-function reduces to Q ( µ | µ ( t ) ) = n (cid:88) i =1 (cid:20) log C /h ,q +1 ,L n + log L (cid:18) − X Ti µ h (cid:19)(cid:21) · P ( Z = i | Y , µ ( t ) ) , (40)and the posterior Z | Y , µ ( t ) becomes P ( Z = i | Y , µ ( t ) ) = L (cid:16) − X Ti µ ( t ) h (cid:17)(cid:80) n(cid:96) =1 L (cid:16) − X T(cid:96) µ ( t ) h (cid:17) . (41)The diﬀerentiability and convexity of L under (C2) indicates that L ( y ) − L ( x ) ≥ L (cid:48) ( x ) · ( y − x ) , (42)where L (cid:48) ( x ) is replaced by any subgradient at non-diﬀerentiable points. As 0 < L (0) < ∞ ,we have that Q ( µ ( t +1) | µ ( t ) ) − Q ( µ ( t ) | µ ( t ) )= n (cid:88) i =1 (cid:34) log L (cid:32) − X Ti µ ( t +1) h (cid:33) − log L (cid:32) − X Ti µ ( t ) h (cid:33)(cid:35) · L (cid:16) − X Ti µ ( t ) h (cid:17)(cid:80) n(cid:96) =1 L (cid:16) − X T(cid:96) µ ( t ) h (cid:17) (i) ≥ n (cid:88) i =1 L (cid:16) − X Ti µ ( t +1) h (cid:17) · (cid:34) L (cid:32) − X Ti µ ( t +1) h (cid:33) − L (cid:32) − X Ti µ ( t ) h (cid:33)(cid:35) · L (cid:16) − X Ti µ ( t ) h (cid:17)(cid:80) n(cid:96) =1 L (cid:16) − X T(cid:96) µ ( t ) h (cid:17) (ii) ≥ n (cid:88) i =1 L (cid:48) (cid:16) − X Ti µ ( t ) h (cid:17) L (cid:16) − X Ti µ ( t +1) h (cid:17) · (cid:34) X Ti (cid:0) µ ( t ) − µ ( t +1) (cid:1) h (cid:35) · L (cid:16) − X Ti µ ( t ) h (cid:17)(cid:80) n(cid:96) =1 L (cid:16) − X T(cid:96) µ ( t ) h (cid:17) (iii) ≥ L (cid:0) h (cid:1) nL (0) n (cid:88) i =1 L (cid:48) (cid:32) − X Ti µ ( t ) h (cid:33) X Ti (cid:16) µ ( t ) − µ ( t +1) (cid:17) hang and Chen (iv) = L (cid:0) h (cid:1) nL (0) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) n (cid:88) i =1 X i L (cid:48) (cid:32) − X Ti µ ( t ) h (cid:33)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) · (cid:16) µ ( t +1) (cid:17) T (cid:16) µ ( t +1) − µ ( t ) (cid:17) (v) = L (cid:0) h (cid:1) nL (0) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) n (cid:88) i =1 X i L (cid:48) (cid:32) − X Ti µ ( t ) h (cid:33)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) · (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) µ ( t +1) − µ ( t ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≥ , where we use the fact that log y − log x ≥ y − xy by the concavity of log( · ) in (i), leverage theinequality (42) in (ii), apply the monotonicity of L in (iii), plug in (18) in (iv), and noticethat (cid:0) µ ( t +1) (cid:1) T (cid:0) µ ( t +1) − µ ( t ) (cid:1) = (cid:12)(cid:12)(cid:12)(cid:12) µ ( t +1) − µ ( t ) (cid:12)(cid:12)(cid:12)(cid:12) on Ω q in (v). The result follows. D.2 Proof of Proposition 2Proposition 2.

Notice that the observed log-likelihood in the EM framework of Section 3.1, afterwe take Y = (0 , ..., , T ∈ Ω q +1 , is given bylog P ( Y | µ ) = log P ( Y , Z | µ ) − log P ( Z | Y , µ )= n (cid:88) i =1 { Z = i } · (cid:8) log α i + log C κ i ,q +1 ,L + log L (cid:2) κ i (1 − X Ti µ ) (cid:3)(cid:9) − n (cid:88) i =1 { Z = i } · log P ( Z = i | Y , µ ) . (45)Taking the expectation over Z | (cid:0) Y , µ ( t ) (cid:1) on both sides of (45) yields thatlog P ( Y | µ ) = n (cid:88) i =1 P (cid:16) Z = i | Y , µ ( t ) (cid:17) · (cid:8) log α i + log C κ i ,q +1 ,L + log L (cid:2) κ i (1 − X Ti µ ) (cid:3)(cid:9) − n (cid:88) i =1 P (cid:16) Z = i | Y , µ ( t ) (cid:17) · log P ( Z = i | Y , µ ) . (46) he EM Perspective of DMS Algorithm After plugging (13) into (46), we obtain thatlog P ( Y | µ )= n (cid:88) i =1 P (cid:16) Z = i | Y , µ ( t ) (cid:17) · (cid:34) log α i + log C κ i ,q +1 ,L + log L (cid:2) κ i (1 − X Ti µ ) (cid:3) − log α i · C κ i ,q +1 ,L · L (cid:2) κ i (1 − X Ti µ ) (cid:3)(cid:80) n(cid:96) =1 α (cid:96) · C κ (cid:96) ,q +1 ,L · L (cid:2) κ (cid:96) (cid:0) − X T(cid:96) µ (cid:1)(cid:3) (cid:35) = n (cid:88) i =1 P (cid:16) Z = i | Y , µ ( t ) (cid:17) · log (cid:34) n (cid:88) (cid:96) =1 α (cid:96) · C κ (cid:96) ,q +1 ,L · L (cid:2) κ (cid:96) (cid:0) − X T(cid:96) µ (cid:1)(cid:3)(cid:35) = log (cid:34) n (cid:88) (cid:96) =1 α (cid:96) · C κ (cid:96) ,q +1 ,L · L (cid:2) κ (cid:96) (cid:0) − X T(cid:96) µ (cid:1)(cid:3)(cid:35) . When α (cid:96) = n and κ (cid:96) = h for all (cid:96) = 1 , ..., n (i.e., simplifying to the DMS form), the aboveequation indicates thatlog P ( Y | µ ) = log (cid:34) c h,q,L n n (cid:88) i =1 L (cid:18) − X Ti µ h (cid:19)(cid:35) + log C /h ,q +1 ,L c h,q,L = log (cid:98) f h ( µ ) + log C /h ,q +1 ,L c h,q,L , (47)where c h,q,L is deﬁned in (2) and C κ i ,q +1 ,L is elucidated in (9). The results follow. D.3 Proof of Theorem 3Theorem 3.

The diﬀerence of observed log-likelihoods (46) between two consecutive steps of ourDMS (or generalized EM) iteration is calculated aslog P (cid:16) Y | µ ( t +1) (cid:17) − log P (cid:16) Y | µ ( t ) (cid:17) (i) = Q (cid:16) µ ( t +1) | µ ( t ) (cid:17) − n (cid:88) i =1 P (cid:16) Z = i | Y , µ ( t ) (cid:17) · log P (cid:16) Z = i | Y , µ ( t +1) (cid:17) − Q (cid:16) µ ( t ) | µ ( t ) (cid:17) + n (cid:88) i =1 P (cid:16) Z = i | Y , µ ( t ) (cid:17) · log P (cid:16) Z = i | Y , µ ( t ) (cid:17) = Q (cid:16) µ ( t +1) | µ ( t ) (cid:17) − Q (cid:16) µ ( t ) | µ ( t ) (cid:17) − n (cid:88) i =1 P (cid:16) Z = i | Y , µ ( t ) (cid:17) · log P (cid:0) Z = i | Y , µ ( t +1) (cid:1) P (cid:0) Z = i | Y , µ ( t ) (cid:1) (ii) ≥ Q (cid:16) µ ( t +1) | µ ( t ) (cid:17) − Q (cid:16) µ ( t ) | µ ( t ) (cid:17) − log (cid:34) n (cid:88) i =1 P (cid:16) Z = i | Y , µ ( t +1) (cid:17)(cid:35)(cid:124) (cid:123)(cid:122) (cid:125) =0(iii) ≥ , (48)where we recall (14) in (i), apply Jensen’s inequality in (ii), and utilize Theorem 1 in (iii).The inequality (ii) is strict except when µ ( t +1) = µ ( t ) . Therefore, combining (47) with (48)show thatlog (cid:98) f h ( µ ( t +1) ) − log (cid:98) f h ( µ ( t ) ) = log P (cid:16) Y | µ ( t +1) (cid:17) − log P (cid:16) Y | µ ( t ) (cid:17) ≥ . As (cid:98) f h is diﬀerentiable on the compact set Ω q under (C2), (cid:110) (cid:98) f h ( µ ( t ) ) (cid:111) ∞ t =0 is thus boundedfrom above and converges. D.4 Conditions on the Convergence of Generalized EM Sequences and Proofof Theorem 4

For the convergence of (generalized) EM sequences to stationary points (in most cases,local maxima), Wu (1983) made the following assumptions on the parameter space Ψ andobserved log-likelihood function (cid:96) ( φ ): • (A1) Ψ is a subset of the ambient Euclidean space, • (A2) Ψ = { φ ∈ Ψ : (cid:96) ( φ ) ≥ (cid:96) ( φ ) } is compact for any (cid:96) ( φ ) > −∞ , where φ is theinitial parameter for the (generalized) EM algorithm, • (A3) (cid:96) is continuous in Ψ and diﬀerentiable in the interior of Ψ.Let φ → M ( φ ) be the point-to-set map deﬁned by a generalized EM iteration such that theQ-function satisﬁes Q ( φ (cid:48) | φ ) ≥ Q ( φ | φ ) for any φ (cid:48) ∈ Ψ .

2. The Q-function is deﬁned as the expectation of the complete log-likelihood over the posterior distributionof hidden variables given the values of observed variables and current parameters. he EM Perspective of DMS Algorithm Denote by S the set of stationary points of the observed log-likelihood (cid:96) . As a consequenceof the above assumptions, all the limit points of the generalized EM sequence are stationarypoints of the observed log-likelihood. Lemma 6 (Theorem 1 of Wu (1983)) . Let { φ p } be a generalized EM sequence generated by φ p +1 ∈ M ( φ p ) , and suppose that (i) M is a closed point-to-set map over the complementof S ; (ii) (cid:96) ( φ p +1 ) > (cid:96) ( φ p ) for all φ p / ∈ S . Then all the limit points of { φ p } are stationarypoints of (cid:96) , and (cid:96) ( φ p ) converges monotonically to (cid:96) ∗ = (cid:96) ( φ ∗ ) for some φ ∗ ∈ S . The generalized EM iteration is usually terminated when the (Euclidean) distance be-tween two consecutive iteration points is below some small threshold value. However, thisproperty does not imply the convergence of the generalized EM sequence to a single pointin S ; see some related discussions in Wu (1983) and a counterexample in Li et al. (2007).To guarantee the convergence of a generalized EM sequence to a single point in S (in mostcases, a local maximum), we require the discreteness of the set of local modes: • (A4) The set of local modes of the observed log-likelihood (cid:96) ( φ ) is discrete, and themodes are isolated from other critical points. D.4.1 Proof of Theorem 4

Theorem 4.

Now, we verify in detail that the above assumptions (A1-4) for the convergenceof generalized EM sequences can be satisﬁed by our conditions (C1) and (C2) on direc-tional KDE (cid:98) f h and kernel function L . Notice that the estimated parameter µ lies onΩ q , a compact manifold in the ambient Euclidean space R q +1 . By the equivalence of theobserved log-likelihood and logarithm of the directional KDE (c.f., equation (23)), the con-strained parameter space Ψ = (cid:110) µ ∈ Ω q : log (cid:98) f h ( µ ) ≥ log (cid:98) f h ( µ (0) ) (cid:111) is also compact for anylog (cid:98) f h ( µ (0) ) > −∞ . Moreover, under condition (C2), (cid:98) f h is diﬀerentiable on Ω q . Thus,assumptions (A1-3) are naturally justiﬁed in our DMS context. Finally, it is obvious thatour condition (C1) implies assumption (A4), and as long as we do not initialize the DMSalgorithm within the set of local minima and saddle points, the algorithm will converge toa local mode. D.5 Proof of Proposition 5Proposition 5.

For any x ∈ Ω q ⊂ R q +1 ,(a) the Jacobian ∇ F ( x ) is a symmetric matrix in R ( q +1) × ( q +1) , and has its explicit formas ∇ F ( x ) = ∇∇ (cid:98) f h ( x ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ∇ (cid:98) f h ( x ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) − ∇ (cid:98) f h ( x ) ∇ (cid:98) f h ( x ) T ∇∇ (cid:98) f h ( x ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ∇ (cid:98) f h ( x ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) . (49) hang and Chen In particular, ∇ F ( m ) = ( I q +1 − mm T ) ∇∇ (cid:98) f h ( m ) || ∇ (cid:98) f h ( m ) || at any local mode m of (cid:98) f h .(b) we have that max i =1 ,...,q +1 | Λ i | → as h → and nh q → ∞ , where Λ , ..., Λ q +1 arethe eigenvalues of ∇ F ( m ) . Proof (a) We leverage the form F ( x ) = ∇ (cid:98) f h ( x ) || ∇ (cid:98) f h ( x ) || in (7) to directly compute the explicitform of the Jacobian F ( x ) as ∇ F ( x ) = ∇∇ (cid:98) f h ( x ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ∇ (cid:98) f h ( x ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) − ∇ (cid:98) f h ( x ) ∇ (cid:98) f h ( x ) T ∇∇ (cid:98) f h ( x ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ∇ (cid:98) f h ( x ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) , which is a symmetric matrix. At local mode m , we have that m = ∇ (cid:98) f h ( m ) || ∇ (cid:98) f h ( m ) || , and theJacobian ∇ F ( m ) is simpliﬁed as ∇ F ( m ) = ( I q +1 − mm T ) ∇∇ (cid:98) f h ( m ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ∇ (cid:98) f h ( m ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) , where I q +1 ∈ R ( q +1) × ( q +1) is the identity matrix.(b) By the deﬁnition of local modes of (cid:98) f h on Ω q , the total gradient ∇ (cid:98) f h ( m ) has zero pro-jection onto the tangent space of Ω q at any local mode m . Thus, ∇ (cid:98) f h ( m ) is completelydetermined by its radial component along m . By Lemma 10 in Zhang and Chen (2020),it is obvious that (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ∇ (cid:98) f h ( m ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) → ∞ as h → nh q → ∞ . Finally, each eigenvalue of ∇ F ( m ) will have its absolute value tending to 0 as (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ∇ (cid:98) f h ( m ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) → ∞ . The result follows.. The result follows.