An Adaptive EM Accelerator for Unsupervised Learning of Gaussian Mixture Models
aa r X i v : . [ c s . L G ] S e p Preprint Notice:c (cid:13)
EEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 1
An Adaptive EM Accelerator for UnsupervisedLearning of Gaussian Mixture Models
Truong Nguyen, Guangye Chen, and Luis Chacón
Abstract
We propose an Anderson Acceleration (AA) scheme for the adaptive
Expectation-Maximization (EM) algorithm forunsupervised learning a finite mixture model from multivariate data (Figueiredo and Jain 2002). The proposed algorithmis able to determine the optimal number of mixture components autonomously, and converges to the optimal solutionmuch faster than its non-accelerated version. The success of the AA-based algorithm stems from several developmentsrather than a single breakthrough (and without these, our tests demonstrate that AA fails catastrophically). To begin,we ensure the monotonicity of the likelihood function (a the key feature of the standard EM algorithm) with a recentlyproposed monotonicity-control algorithm (Henderson and Varahdan 2019), enhanced by a novel monotonicity test withlittle overhead. We propose nimble strategies for AA to preserve the positive definiteness of the Gaussian weights andcovariance matrices strictly, and to conserve up to the second moments of the observed data set exactly. Finally, weemploy a K-means clustering algorithm using the gap statistic to avoid excessively overestimating the initial numberof components, thereby maximizing performance. We demonstrate the accuracy and efficiency of the algorithm withseveral synthetic data sets that are mixtures of Gaussians distributions of known number of components, as well asdata sets generated from particle-in-cell simulations. Our numerical results demonstrate speed-ups with respect to non-accelerated EM of up to × when the exact number of mixture components is known, and between a few and morethan an order of magnitude with component adaptivity. Index Terms unsupervised machine learning, Gaussian mixture model, maximum likelihood estimation, adaptive Expectation-Maximization, Anderson acceleration, monotonicity control, K-means, the gap statistic. ✦ NTRODUCTION T HE Gaussian mixture model (GMM) is a probabilistic model that assumes all the observed data points are generatedfrom a mixture of a finite number of Gaussian (normal) distributions [1]–[4]. It has wide applications in patternrecognition and unsupervised machine learning [5], [6], big data analytics [7]–[10], and image segmentation anddenoising [11]–[15], as well as recent applications in applied and computational physics, e.g., gas kinetic [16] andplasma kinetic algorithms [17], [18]. Some other applications of GMM can be found in [19]–[22]. Of interest here isa parametric probability density function family of the form f ( x ; θ ) = P Kk =1 ω k G k ( x ; θ k ) , where G k is a Gaussiandistribution parameterized by θ k , ω k is the a positive weight under the constraint of P Kk =1 ω k = 1 , and K is thetotal number of Gaussian components. A common iterative approach to estimate the parameters of the Gaussiandistributions in GMM is the Expectation-Maximization (EM) algorithm, which is based on the maximum likelihoodprinciple [2], [23]–[25]. EM is well known for its robustness, as it is guaranteed to converge monotonically to alocal maximum. The standard EM algorithm for GMM (EM-GMM) is conceptually simple and easy to implement.However, the performance is highly dependent on the initial guess of the Gaussian parameters and the separationof the Gaussian components (convergence can be very slow when Gaussian components are not well separated). Afurther difficulty is that the maximum likelihood principle alone cannot determine the number of components [1], [26].The number of Gaussian components is usually unknown in practice, which makes the proper choice of the number ofGaussian components crucial for the optimal performance of EM-GMM. Choosing too many components would resultin overfitting the data set, and potentially worsening the convergence rate of EM-GMM. Alternatively, choosing toofew components could result in under-fitting, leading to model predictions that may miss important structures of thedata.To resolve the number of components issue in GMM, a recent and well adopted adaptive EM algorithm was proposed[27], [28] that can automatically converge on the optimal number of Gaussian components. By employing a “minimum-message-length (MML)” Bayesian information criterion [29], [30], the method allows users to start with a relatively • Truong Nguyen is with the Applied Mathematics and Plasma Physics Group, Theoretical Division, Los Alamos National Laboratory, NM 87545,USA, e-mail: [email protected]. • Guangye Chen is with the Applied Mathematics and Plasma Physics Group, Theoretical Division, Los Alamos National Laboratory, NM 87545, USA,e-mail: [email protected]. • Luis Chacón is with the Applied Mathematics and Plasma Physics Group, Theoretical Division, Los Alamos National Laboratory, NM 87545, USA,e-mail: [email protected] submitted September 25, 2020.
EEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 2 large number of Gaussian components and gradually converge to the optimal number of groups during EM iterativeprocedure. Adaptive EM introduces a modified M-step for the Gaussian weights capable of eliminating unimportantcomponents, which is only a simple extension of standard EM-GMM and makes it attractive for practitioners whencompared to other methods (e.g., variational Bayesian model [31], and those reviewed in Ref. [26]). Several drawbacksof standard EM, such as sensitivity to initialization, and possible convergence to singular solutions, are also largelyavoided [27], [28]. However, the slow convergence problems of standard EM were left unaddressed.Many methods have been proposed to date to accelerate the convergence rate of standard (non-adaptive)
EM [25],[32], which may be roughly categorized into EM extension algorithms and gradient-based algorithms. The algorithmsin the first category are mainly developed within the EM framework based on statistical considerations, which includeECM [33], ECME [34], SAGE [35], AECM [36], PX-EM [37], and CEMM [38], etc. We note that Figueiredo and Jain’soriginal paper for adaptive EM-GMM employed CEMM, but mainly for the purpose of avoiding elimination of allGaussian components at the beginning of the iteration in some situations [28]. Algorithms in the second category treatstandard EM as a fixed-point iteration map, and seek accelerations using various gradient-based methods, includingAitken-Steffensen-type [23], [39]–[42], conjugate gradient (CG) [43]–[45], quasi-Newton (QN) [46]–[49] and Newton-Raphson [50], [51] methods, etc. In the context of the GMM density estimation, the Newton-Raphson method requirescomputation of the second derivative of the log-likelihood function (i.e., the Hessian matrix [2]), which is consideredtoo complicated to be practical, especially when compared to standard EM. QN and CG methods avoid the difficultyby using approximations that involve only the first derivatives of the log-likelihood function (i.e., the score function[2]), which are much easier to obtain, and still often gain much acceleration over the standard EM when it is veryslow. However, one common feature for many gradient-based EM-accelerators is that they require line-searching andcareful monitoring or safeguards, a consequence of the lack of automatic monotone convergence of the likelihoodfunction. The line-search step (e.g., needed in Refs. [43], [44], [46], [47], [49]) determines the step-size in some gradientdirection in order to make progress in increasing the likelihood function. This often requires multiple likelihoodfunction evaluations, which is one of the most expensive operations in EM-GMM due to the need to evaluate multipleexponential functions on the sample data. The situation is similar for other strategies such as globalization [41],algorithm restart [42], or monitoring of the progress [49]. The need for many additional likelihood function evaluationscan offset much (or even all) of the algorithmic acceleration afforded by these solver strategies, leading to virtually nowall-clock-time advantage.In this study, we explore acceleration of EM-GMM using Anderson Acceleration (AA) [52], which can be viewedas a variant of QN [53]. We note that AA has been explored before for EM-GMM [54], [55]. In Ref. [54], a reducedmixture problem (i.e., estimating only the means of a three-component univariate Gaussian mixture) was successfullyaccelerated by AA. Later, Ref. [55] successfully extended the method to two-component multivariate Gaussian mixtures,suggesting potential for AA as an EM-accelerator for GMM. However, both studies assumed a known number ofmixture components, and both employed a large number of samples in their tests ( in Ref. [54] and in Ref.[55]), presumably robustifying the AA iterations. For smaller and more realistic data sets and more complicatedapplications, as in some of our tests, the AA implementation in Refs. [54], [55] may break positiveness of Gaussianparameters and may converge to sub-optimal solutions [48] and even fail catastrophically (as we will show). To remedythose drawbacks, we will employ a restarted/regularized version of AA to accelerate adaptive EM while monitoringthe monotonicity of the likelihood function (a critical step [48]), which has not previously been tested with GMM. Infact, to the best of our knowledge, no gradient-based EM accelerators have been applied to the adaptive EM-GMMalgorithm.The success of our proposed algorithm stems from various ingredients. Firstly, we improve the monotonicity controlstep proposed in Ref. [48] with a new, very low overhead monotonicity test. Secondly, we ensure that the algorithmpreserves positive-definiteness of Gaussian weights and covariance matrices, and conserves up to second moments ofthe observed data set exactly, just as in a standard EM. Lastly, it is well known that choosing the initial number ofGaussians well can significantly affect the performance of adaptive EM. To obtain a good estimate for the initial numberof components, we complement our method with a reliable initialization routine using K-means clustering with thegap statistic [56]. As a result, our AA-based algorithm delivers significant efficiency gains versus the non-acceleratedEM while converging to the same (optimal) solution.The rest of the paper is organized as follows. Section 2 introduces the basic concepts of both standard andadaptive EM-GMM. Section 3 briefly summarizes a recent attempt at EM acceleration, the exact-line-search (ELS)EM (which we will use as a benchmark for performance). We then present an overview of AA, its challenges ofaggressive application to EM-GMM, and strategies to address these challenges proposed in the literature. Section4 proposes our solution for accelerating adaptive EM with monotonicity control for the likelihood function. Keyelements include the design of an efficient approach for monotonicity control in AA, the use of a regularization term[48] to combine the robustness of EM and local convergence speed of AA, and the careful selection of the initialnumber of Gaussian components using K-means clustering based on the gap statistic approach [56]. Also describedare our solutions for strict moment conservation and preservation of positive-definiteness of Gaussian weights andcovariance matrices. Section 5 demonstrates the fidelity and efficiency of the proposed acceleration scheme over itsnon-accelerated counterpart for several synthetic data sets, as well as real data sets generated from particle-in-cellsimulations [17], where we demonstrate significant algorithmic and wall-clock-time speed-ups. Finally, we conclude inSection 6.
EEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 3 HE E XPECTATION M AXIMIZATION (EM)
ALGORITHM FOR G AUSSIAN MIXTURES (EM-GMM)
A Gaussian mixture (GM) of K components is defined to be a convex combination of K Gaussian distributions G k , k = 1 , · · · , K of the following form f ( x ) = K X k =1 ω k G k ( x ; µ k , Σ k ) , (1)where ω k , µ k , Σ k are the weight, mean and covariance matrix, respectively, of the k th Gaussian in the mixture. Notethat ω k ≥ , and Σ k are symmetric-positive-definite matrices. The Gaussian distribution G k is defined as G k ( x ; µ k , Σ k ) = 1 p (2 π ) D | Σ k | e − ( x − µ k ) T Σ − k ( x − µ k ) (2)where both x and µ are D -dimensional column vectors, the superscript T denotes transpose, and | Σ k | is the determi-nant of the covariance matrix. The goal is to find the Gaussian parameters θ = ( θ , ..., θ K ) where θ k = ( ω k , µ k , Σ k ) for k = 1 , ..., K that maximizethe log-likelihood of the Gaussian mixture model [57]. The log-likelihood is written as L ( θ ) = ln (cid:18) N Y j =1 (cid:20) f ( x j ) (cid:21) ζ j (cid:19) + η (cid:18) K X k =1 ω k − (cid:19) = N X j =1 ζ j ln (cid:18) K X k =1 ω k G k ( x ; µ k , Σ k ) (cid:19) + η (cid:18) K X k =1 ω k − (cid:19) , (3)for N independent samples X = ( x , ..., x N ) (presumably) drawn from f ( x ) , with each sample x j having a weight ζ j . Here, η (cid:18) P Kk =1 ω k − (cid:19) is the Lagrange-multiplier term that enforces the normalization constraint P Kk =1 ω k = 1 ,i.e., f ( x ) is normalized to unity. We assume that P Nj =1 ζ j = N , and the sample weights ζ j , j = 1 , . . . , N account forthe cases with non-identical samples [58].In order to maximize the log-likelihood function L ( θ ) , we solve the following score equations: ∂ L ( θ ) ∂ θ = 0 , and ∂ L ( θ ) ∂η = 0 . (4)We obtain (see Ref. [5] and Appendix A): µ k = 1 N k N X j =1 r jk x j , (5) Σ k = 1 N k N X j =1 r jk ( x j − µ k )( x j − µ k ) T , (6) ω k = N k N , (7)where r jk is the responsibility value of point x j within the k th Gaussian, and is defined as: r jk = ζ j ω k G k ( x j ; µ k , Σ k ) P Kl =1 ω l G l ( x j ; µ l , Σ l ) , (8)with N k = P Nj =1 r jk and ζ j = P Kk =1 r jk .EM-GMM is an iterative procedure to find the solution to (5), (6) and (7), and can be done in the following distinctsteps until convergence: • Expectation-step: Evaluate the responsibilities r jk for j = 1 , . . . , N and k = 1 , . . . , K using (8). • Maximization-step: Update the Gaussian parameters for k = 1 , . . . , K using (5), (6) and (7).Convergence of the algorithm is assessed by monitoring the log-likelihood function (3). We note that several quantitiesused in the computation of responsibilities for θ ( it ) , r ( it ) jk (cid:0) θ ( it ) (cid:1) , can be re-used in the evaluation of L ( θ ( it ) ) for efficiency.We remark that each EM iteration ensures the monotonic increase of the likelihood function during the iteration [23],and conserves up to the second moments of the data set [17], [59]. EEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 4
In practice, the number of Gaussian components in the Gaussian mixture is often unknown. Therefore, a way to selectthe proper number of mixture components is needed. Component adaptivity can be accomplished by employing theminimum message length (MML) criterion [17], [27]–[30], [60] to penalize the log-likelihood function. Instead of using(3) for the log-likelihood, we use the following penalized log-likelihood function: PL ( θ ) = N X j =1 ζ j ln (cid:18) K X k =1 ω k G k ( x ; µ k , Σ k ) (cid:19) + η (cid:18) K X k =1 ω k − (cid:19) − d ln ( N ) − T K X k =1 ln ( ω k ) , (9)where d is the total number of parameters in the Gaussian mixture and T = D ( D +3)2 . For a detailed derivation of (9),we refer the reader to Ref. [17]. The last two terms in (9) are the penalization terms, and they play an important rolein determining the optimal number of components by removing unnecessary Gaussian components in order to avoidover-fitting the data.In order to maximize the penalized log-likelihood function (9), the same approach as described in Section 2.1 resultsin the same formulas for updating the Gaussians’ means µ k and covariance matrices Σ k in the M-step, i.e., (5) and(6), respectively. However, due to the presence of the penalization terms, the equation for the Gaussians’ weights ismodified as (see Appendix A): ω k = N k − T N − T K , (10)provided that N k > T / . If N k < T / , ω k < , indicating that the k th Gaussian should be killed. Also, (10) advisesthat one should start with more Gaussian components than the exact number of Gaussian components in the GMM[61].Putting things together, assuming K components at each iteration of adaptive EM-GMM, we do the following:1) Evaluate the responsibilities r jk using (8).2) Compute ω ( ∗ ) k = max (cid:18) N k − T/ N − T K/ , (cid:19) .3) If ω ( ∗ ) k = 0 then kill the k th Gaussian: set µ k = 0 , Σ k = 0 and K = K − . Otherwise, update the mean andcovariance matrix of the k th Gaussian using (5) and (6), respectively.4) Re-normalize the Gaussian weights: ω k = ω ( ∗ ) k (cid:14)(cid:0) P Kk =1 ω ( ∗ ) k (cid:1) .5) Check for convergence by monitoring the penalized log-likelihood function (9).Steps 1 to 5 are repeated until convergence. We refer to Step 1 as the E-step and Steps 2-4 as the M-step. In practice, wecan perform the iterations in a simultaneous approach or a component-wise approach [38]. In the simultaneous EM, weperform the E-step for all available Gaussian components in the mixture, then update their parameters in the M-step.In the component-wise EM, we perform the E-M steps for one Gaussian component and then move on the the nextone until we reach the final component in the mixture. The simultaneous approach is faster but the algorithm couldpossibly kill all Gaussians at once in some situations (e.g., when the starting number of components, K init , is muchlarger than the model’s exact number of components, K model ) [28]. The component-wise approach is slower (because itrequires updating the Gaussian mixture’s probability density function once a component is updated) but it can preventsuch a problem [28]. For efficiency, here we follow the simultaneous approach while relying on an extended form ofK-means clustering [56], [62]–[66] to provide a good initial guess for the number of components. This will be discussedlater in this study.We note that both the accelerated and non-accelerated versions of the adaptive EM-GMM iteration do not conservethe moments of the observed data set due to the presence of the penalization terms. To recover conservation, weperform a final standard EM-GMM step after convergence [17]. ONLINEAR ACCELERATION OF THE EM ALGORITHM
A recent attempt to accelerate EM-GMM is the so-called exact line search EM (ELS-EM) introduced by Xiang et al. [67].The strategy of the method is to search for an improved solution, θ ( new ) = (cid:0) ω ( new ) k , µ ( new ) k , Σ ( new ) k (cid:1) , which is alongthe line joining the current and previous iterates, before updating the Gaussian parameters in the M-step. In particular,we want to find ρ ( it ) = ( ρ ω k , ρ ) , where ρ ω k is the step size for each Gaussian weight ω k and ρ is the common step sizefor all Gaussian means and covariance matrices in GMM [67], such that ω ( new ) k = ω ( it − k + ρ ω k ( ω ( it ) k − ω ( it − k ) , µ ( new ) k = µ ( it − k + ρ ( µ ( it ) k − µ ( it − k ) , Σ ( new ) k = Σ ( it − k + ρ ( Σ ( it ) k − Σ ( it − k ) , (11) EEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 5 maximizes the log-likelihood. Here, the superscripts indicate solutions at current ( it ) th and previous ( it − th iteration,respectively. Details for the computation of the step sizes ρ ( it ) can be found in Ref. [67]. If L ( θ ( new ) ) > L ( θ ( it ) ) , weuse θ ( new ) to update the Gaussian parameters in M-step instead of using θ ( it ) . The sketch for one iteration of ELS-EM[67] is as follows:1) E-step: Evaluate the responsibilities r ( it ) jk (cid:0) θ ( it ) (cid:1) using (8).2) ELS-step: Compute the solution θ ( new ) using (11). If L ( θ ( new ) ) > L ( θ ( it ) ) , evaluate r ( new ) jk (cid:0) θ ( new ) (cid:1) and set r ( it ) jk = r ( new ) jk , else, keep r ( it ) jk from the E-step.3) M-step: Update Gaussian parameters θ ( it +1) using r ( it ) jk .It was pointed out in Ref. [67] that the cost of the ELS-step is the same as the cost of the E-step, since it needs anadditional evaluation of the log-likelihood function and the re-evaluation of responsibilities r jk when the new solution θ ( new ) is better. This makes one iteration of ELS-EM about two times more expensive than one EM iteration. We haveimplemented ELS for standard (non-adaptive) EM-GMM, and we have confirmed that this is the case (as we willshow). Moreover, we have found a reduction in iteration count of only ∼ − . × with ELS-EM for our synthetic datasets, resulting in almost no wall-clock-time speed up. This is in contrast with our proposed algorithm, described in thenext section, where we demonstrate wall-clock-time speed-ups up to ∼ × for the same data sets. AA is a common accelerator for nonlinear Picard iterative procedures [52], [54], [68]. EM-GMM is indeed a Picarditeration, which can show slow convergence especially in the case where the Gaussians in the mixture are highlyoverlapping. Therefore, in principle, it can be accelerated by Anderson acceleration. A direct application of AA for EM(AAEM), taken from [54], is given in Algorithm 1. In the algorithm, G ( θ ) is one step of either standard or adaptiveEM-GMM. Algorithm 1
Anderson accelerated EM algorithm (AAEM)Given initial solution, θ (0) and maximum number of residuals, m AA > .Evaluate θ (1) = G ( θ (0) ) . Do it = 1 , , ... until converged:1) Set the number of residuals in AA, m = min ( it, m AA ) .2) Set F it = ( f it − m , ..., f it ) where f i = G ( θ ( i ) ) − θ ( i ) .3) Solve for α ( it ) such that α ( it ) = argmin α || F it α || subject to m X i =0 α i = 1 . (12)4) θ ( it +1) = P mi =0 α ( it ) i G ( θ ( it − m + i ) ) .
5) Check for convergence by monitoring log-likelihood (use (3) for standard EM or (9) for adaptive EM).Unfortunately, several problems arise when one naively applies AA to EM-GMM. Firstly, for both standard andadaptive cases, our numerical experiments show that AAEM only conserves the zeroth moment of the given dataset. Secondly, the positive-definiteness property of the Gaussian weights and covariance matrices is not ensured withstandard AA since the Anderson iterate is expressed as a non-convex linear combination of positive-definite solutions[69], [70] (i.e., some of the AA coefficients α i in (12) can be negative). Thirdly, the monotonicity of the log-likelihoodfunction is not ensured. From our numerical results, we often see that the log-likelihood of the Anderson solution isless than the log-likelihood of the current EM iterate, i.e., L ( θ ( AA ) ) < L ( θ ( it ) ) in the standard case, or PL ( θ ( AA ) ) < PL ( θ ( it ) ) in the adaptive case. Fourthly, in adaptive EM, since the exact number of Gaussian components is unknown,we often start with a larger number of components than needed for a given data set. In this case, when applying AAfor adaptive EM-GMM, we observe that the method does not produce the right number of Gaussian components andconverges to a non-optimal solution. In some situations, the convergence rate of adaptive AAEM is observed to beslower than that of the adaptive non-accelerated EM, as we will demonstrate in a later section.We have explored various solutions proposed in the literature to address these issues, with varied success. Topreserve up to second moments at every iteration, we considered accelerating Gaussian moments M k = ω k (1 , µ k , Σ k + µ k µ Tk ) instead of the Gaussian parameters θ k = ( ω k , µ k , Σ k ) . We find that this approach conserves up to secondmoments in the non-adaptive case with fixed number of Gaussian components, but not in the adaptive case becauseof the renormalization of the Gaussian weights in the M-step. In addition, the moment-acceleration strategy maybreak the positive-definiteness of Σ k when computed from the second moment matrix M k, . To address the positive-definiteness of Gaussian weights and covariance matrices, we employed the AA globalization technique proposed in[70], i.e., an additional constraint for the positivity of the AA coefficients is added to the least square problem (12) asfollows EEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 6
Find α ( it ) s.t. α ( it ) = argmin α || F it α || subject to m X i =0 α i = 1 and α i > ∀ i, (13)where m is the number of past residuals in AA at the ( it ) th iteration. We found from our numerical experimentsthat this approach significantly slows down the local convergence speed of AA (since it restricts the optimizationdomain), and that, at a later phase in the AAEM iteration, it frequently defaults back to EM, i.e., it returns α m = 1 and α = · · · = α m − = 0 .To address successfully the AAEM problem of local convergence to a non-optimal solution, we follow the AAregularization approach proposed by Henderson et al. [48]. Specifically, the constrained least-squares problem (12) canbe reformulated as an unconstrained least-squares problem to which a regularization term λ I m × m is added as follows(see [48], [54]): At ( it ) th iteration, find γ ( it ) s.t. (cid:0) F Tit F it + λ I (cid:1) γ ( it ) = F Tit f it , (14)where F it = (∆ f it − m , · · · , ∆ f it − ) , ∆ f i = f i +1 − f i and f i = G ( θ ( i ) ) − θ ( i ) . As remarked in Ref. [54], there existsa one-to-one correspondence between the coefficients α ( it ) given by (12) and the coefficients γ ( it ) given by (14) when λ = 0 , that is: α ( it )0 = γ ( it )0 , α ( it ) i = γ ( it ) i − γ ( it ) i − for i = 1 , · · · , m − ,α ( it ) m = 1 − γ ( it ) m − . In this case, at the ( it ) th iteration, the updated Anderson iterate can be written as θ ( it +1) = G ( θ ( it ) ) − m − X i =0 γ ( it ) i (cid:2) G ( θ ( it − m + i +1) ) − G ( θ ( it − m + i ) ) (cid:3) . (15)According to Henderson et al. [48], the use of the regularization term λ I helps combine the convergence robustnessof EM and the local convergence speed of AA. One can see that if λ = 0 , we recover AA, and if λ ≫ , we recoverEM. We refer to Ref. [48] for the strategy of computing λ at each EM iteration, which we follow strictly. This approach,together with a novel, very efficient monotonicity-control implementation for the log-likelihood function (discussed inthe next section) results in an algorithm that captures the right number of components and quickly converges to theright solution nearly without run-time penalty. HE ADAPTIVE ACCELERATED - MONOTONICITY - PRESERVING E XPECTATION -M AXIMIZATION (A-AMEM)
ALGORITHM
The main goal of this paper is to make the adaptive EM (A-EM) algorithm faster. To this end, we apply an Andersonacceleration to A-EM while maintaining key features of both adaptive and standard EM such as component adaptivity,conservation of up to second moments, preservation of positive-definiteness of Gaussian weights and covariancematrices, and monotonicity preservation of the log-likelihood function during the iteration. The adaptive, accelerated,monotonicity-preserving EM (A-AMEM) algorithm for GMM is outlined in Section 4.1, with the initialization andimplementation details discussed in Sections 4.2 and 4.3, respectively.
The accelerated algorithm for the adaptive EM-GMM iteration is detailed in Algorithm 2. We implement the regu-larization term [48] and employ the AA periodical restart [48], [68], [71] to address some of the pitfalls of a naiveAAEM implementation for adaptive GMM. In Algorithm 2, G ( θ ) represents the fixed-point EM step of A-EM, λ is theregularization factor and ǫ is the log-likelihood controlling parameter. Details on Algorithm 2 and further discussionsfor the values of λ and ǫ are given next. We use K-means clustering for Gaussian initialization. The initial centroids of the clusters in each K-means call areselected from the data points with an improved seeding technique [72]. To obtain the best guess for the number ofcomponents, we employ the gap statistic (GS) method [56], using the so-called the gap statistic value (GSV). (Wehave also explored other K-means techniques [62], [63], [65], [66] and we will discuss them in Section 5.9.) The GSVassociated to K clusters is defined as (cf. [56]): GSV ( K ) = E ∗ (cid:2) ln ( SSE k ) (cid:3) − ln ( SSE k ) . (16)In (16), E ∗ denotes the expectation under a sample from the reference distribution, and SSE k is given as: SSE K = K X k =1 X x i ∈ C k ζ i | x i − c k | , (17) EEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 7
Algorithm 2
Adaptive accelerated-monotonicity EM (A-AMEM) algorithmGiven • m AA , the maximum number of residuals for AA. • K init , the initial number of components to be used. Initialization:
Perform K-means clustering algorithm with K init clusters (multiple times and select the best run) toobtain the initial solutions θ (0) . The adaptive AMEM: Do it = 0 , , , ... until converged:1) Perform EM step: θ ( it +1) = G ( θ ( it ) ) . During this step, Gaussian(s) may be killed.Restart AA if a Gaussian component is eliminated.2) Apply regularized AA :Solve the least square problem (14) for γ ( it ) .Compute the Anderson iterate, θ ( AA ) , using (15). If any Gaussian weight becomes negative, then use EMsolution and continue to the next iteration.Else, go to Step 3.3) Monotonicity control :If PL ( θ ( AA ) ) − PL ( θ ( it ) ) > − ǫ then set θ ( it +1) = θ ( AA ) , Else, use EM solution.4)
Restart AA if number of residual vectors reaches m AA .
5) Check for convergence by monitoring the penalized log-likelihood given by (9).
Conservation of moments:
Perform one final standard EM step after convergence.where C k is the k th cluster and ζ i is the weight for point x i ∈ C k .The computation of the expectation in GSV ( K ) is done with Monte-Carlo from reference data sets (see Ref. [56]).We have found that generating the reference sets from a uniform distribution over a box aligned with the principalcomponents of the observed sample [56] (which is the one we use in our numerical simulations in Section 5.9), or froma unit normal distribution with parameters taken to be the sample’s mean and covariance matrix yields reliable resultsfor our data sets.The optimal number of clusters K opt is found from the following condition: K opt = smallest K s.t. GSV ( K ) > GSV ( K + 1) + τ × s K +1 (18)for K = K min , · · · , K max . In (18), s K is the standard deviation term which accounts for the Monte Carlo simulationerror in evaluating GSV ( K ) , and τ is user-input factor that represents the amount of standard deviation used. We referthe reader to Ref. [56] for more details on the evaluation of GSV ( K ) and s K . In our application, choosing τ = 0 or in (18) instead of τ = − as in Ref. [56] helps avoid possible under-estimation of number of clusters by the GS method.Although it may potentially over-estimate the number of clusters by a few in some cases, this is acceptable in ourapplication since, for robustness, A-AMEM should begin with more components than the expected number. Oncethe optimal number of clusters is obtained, we set the initial number of clusters as K = K opt + K adjust for some K adjust > , to further avoid under-estimation of the model. We then repeat K-means with K clusters multiple timesand the centroids associated with the best trial are selected. The best trial is the one that yields the smallest inertia(SSE) value. The initial Gaussian means µ (0) k , k = 1 , · · · , K are assigned from the clusters’ centroids of the best run.The initial Gaussians weights are computed from the K-means best run as: ω (0) k = n k N where n k = P i ∈C k ζ i is the total weight of points that belong to the k th cluster and N is the total number of observedpoints in the data set. The initial Gaussians’ covariance matrices can be assigned to the clusters’ so-called within-covariances, which are evaluated as: Σ (0) k = 1 n k n k X j =1 ζ i ( x j − µ (0) k )( x j − µ (0) k ) T , where x j , j = 1 , · · · , n k are the points assigned to the k th cluster at the end of the K-means iteration. In general, theK-means initialization algorithm for EM is quite inexpensive compared to EM, taking a small fraction (5-10%) of thetotal wall-clock time, and can have a large impact in the efficiency of the overall algorithm. For the adaptive EM algorithms, we follow the steps outlined in Section 2.2 to compute the updated Gaussians’parameters. We remark that the k th Gaussian is removed from the mixture if the updated weight is negative. EEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 8
Additionally, we perform a restart of AA if a Gaussian component is killed during this step, since the past AAsolution history of the removed component is no longer valid.Once we obtain the updated values for the Gaussian parameters, we solve the unconstrained regularized least-square problem (14), and compute the updated AA solution, θ ( AA ) , using (15). To ensure positive-definiteness ofcovariance matrices, we accelerate the entries of the matrices L k , k = 1 , . . . , K where Σ k = L k L Tk is the Choleskydecomposition [73] of Σ k for k = 1 , . . . , K . After the acceleration procedure, the covariance matrices Σ k can berecovered using the lower-triangular matrices L k for k = 1 , · · · , K . As for the Gaussian weights, if AA returns ω l < for some l = 1 , . . . , K , then we simply roll back to the EM solution, which is guaranteed to keep Gaussian weightspositive and increase the penalized log-likelihood function. We note that the violation of positive-definiteness of theGaussian weights does not occur frequently during the A-AMEM iteration, about − of the time for our syntheticdata sets. Hence, the local convergence speed of AA is not much affected by the development of negative weights, andthus the rollback-to-EM strategy seems reasonable.The monotonicity control step (step 3 in Algorithm 2) is expensive because in principle it requires an additionalevaluation of the log-likelihood function (9), which involves loops over all samples and available Gaussian componentsand expensive logarithmic and exponential evaluations. As a result, one iteration of A-AMEM becomes twice asexpensive as one A-EM iteration. To avoid evaluating the penalized log-likelihood function for the AA iterate, PL ( θ ( AA ) ) , in the monotonicity control step, we approximate the computation of PL ( θ ( AA ) ) − PL ( θ ( it ) ) using afirst-order Taylor expansion, which relies on the exact evaluation of score functions. We recall that the score function isthe derivative of the log-likelihood function with respect to the Gaussian unknowns. In particular, instead of checking PL ( θ ( AA ) ) − PL ( θ ( it ) ) > − ǫ , (19)we check ∂ PL ( θ ) ∂ θ ( it ) · (cid:0) θ ( AA ) − θ ( it ) (cid:1) > − ǫ , (20)where ǫ > is the monotonicity parameter, and ∂ PL ( θ ) ∂ω ( it ) k = N ( it ) k ω ( it ) k − T ω ( it ) k − N + T K , (21) ∂ PL ( θ ) ∂ µ ( it ) k = (cid:0) Σ ( it ) k (cid:1) − (cid:20) N X j =1 r ( it ) jk ( x j − µ ( it ) k ) (cid:21) , (22) ∂ PL ( θ ) ∂ Σ ( it ) k = (cid:0) Σ ( it ) k (cid:1) − (cid:26) N X j =1 r ( it ) jk (cid:20) − Σ ( it ) k + (cid:0) x j − µ ( it +1) k (cid:1)(cid:0) x j − µ ( it +1) k (cid:1) T (cid:21)(cid:27)(cid:0) Σ ( it ) k (cid:1) − , (23)for k = 1 , · · · , K . Details on the derivation of (21), (22) and (23) are given in Appendix A. Using (20) is cheapbecause most quantities in (21), (22) and (23) can be re-used from the adaptive M step in Algorithm 2. Thus, theevaluation complexity for the gradient ∂ PL ( θ ) ∂ θ ( it ) is only of order O ( KD ) where D is the dimension of µ ( it ) k . This is muchmore efficient than the direct evaluation of PL ( θ ( AA ) ) , which has computational complexity of O ( N KD ) , where N ≫ is the number of sample data points. The Taylor expansion approach for approximating PL ( θ ( AA ) ) − PL ( θ ( it ) ) renders the cost of one A-AMEM iteration comparable to one A-EM iteration, and is a key contributor to the efficiencyimprovement of our implementation. As for the monotonicity parameter ǫ , we find that using values of ǫ ∈ [0 . , . works well for our simulations. Choosing ǫ = 0 . means that likelihood ratios between current and acceleratedsolutions are allowed to be no greater than e ǫ ≈ . [48]. Further discussion about the choice of ǫ can be found in thesame reference.We apply a periodic restart strategy for AA in A-AMEM to help improve the overall robustness of the algorithm,as suggested in [48], [71]. The AA restart solution proposed in [71] kept the last column of F it . We have testedboth resetting all AA residuals to zero and keeping the latest column of F it , and found that they yield comparableperformance for our numerical tests.Finally, once the algorithm converges to a solution with an optimal number of Gaussian components, we performone final standard EM iteration (see Section 2.1 and [17]) to recover the conservation up to second moments of theobserved data set. EEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 9
UMERICAL R ESULTS
We apply A-AMEM (Algorithm 2) and A-EM to several synthetic GMM data sets and compare the performanceand results of the two algorithms. Each synthetic data set is a convex linear combination of K exact = 3 Gaussiandistributions with different overlap. Each set consists of N = 1000 points. The separation between Gaussians can bemeasured by the Euclidean distances of the Gaussians’ means and the Gaussians’ shapes, determined by covariancematrices. We generate three synthetic data sets, namely Very Well Separated (VWS), Poorly Separated (PS) and VeryPoorly Separated (VPS), with Gaussian means given as follows:VWS: µ = − − − , µ = , µ = , PS: µ = − − − , µ = , µ = , VPS: µ = − − − , µ = , µ = . For the Gaussian weights and covariance matrices, we choose: ω = 0 . , ω = 0 . , ω = 0 . , Σ = diag (1 , , , Σ = 1 . Σ , Σ = 0 . Σ , for the three manufactured GMM data sets. We also apply Algorithm 2 to real data sets generated from collisionlessplasma particle-in-cell (PIC) simulations [17]. These real data sets consist of electrons’ velocity points in the threedimensional (3D) velocity space, and will be described in detail later.Our goal is to study the efficiency of A-AMEM when compared to A-EM for different initial number of Gaussianscomponents, K init . To this end, we define the iteration reduction factor (IRF) and CPU time reduction factor (TRF)between accelerated and non-accelerated EM algorithms as: IRF = number of standard EM iterationsnumber of accelerated EM iterations , (24) T RF = standard EM CPU timeaccelerated EM CPU time . (25)The same terminating tolerance is used when applying both A-AMEM and A-EM to the data sets. In order for both A-EM and A-AMEM to kill enough unnecessary components before converging to the desired optimal solutions, we usea small terminating tolerance T OL . In particular, we set
T OL = 10 − for the synthetic data sets, and T OL = 10 − for the PIC data sets.As for the choice of m AA , the maximum number of past residuals in AA, we acknowledge that the performanceof AA with respect to this number is problem-dependent, which was also remarked in [54]. We find that using m AA between 5 and 10 works well for our simulations. We set m AA = 5 for K init = 3 , and m AA = 10 for < K init ≤ . The 2D view of the synthetic data sets in the X-Y plane is given in Fig. 1. The views in the Y-Z and X-Z planes areidentical to the X-Y plane’s view, since we use diagonal covariance matrices for the three components in the mixture. -5 0 5 X -6-4-20246 Y -5 0 5 X -6-4-20246 Y -5 0 5 X -4-20246 Y Fig. 1. Visualization of manufactured GMM data sets in the X-Y plane. (Left) VWS data set. (Center) PS data set. (Right) VPS data set.
EEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 10
We can clearly observe the presence of the three Gaussian components in the VWS data set and somewhat clearlyin the PS data set. However, they are impossible to tell by the naked eye in the VPS data set. It is expected that A-EMwill be able to detect the right number of components and converge fairly quickly for the VWS data set. It is alsoanticipated that A-EM will detect the right number of Gaussians for the PS data set but with a slower convergence rate.For the VPS data set, however, A-EM is expected to converge extremely slowly due to the significant overlap amongGaussian components in the mixture, and perhaps even underestimate the number of components.
We illustrate the performance of ELS-EM for GMM by applying it to the synthetic data sets in Section 5.1 and comparingthe results with standard EM. For these tests, we use K init = 3 , fixed, i.e., we assume the exact model. We use thesame tolerance for terminating both algorithms. Both ELS-EM and standard EM are initialized using the best K-meansresult, i.e., with the smallest inertia (SSE) values, out of multiple trials with K = 3 clusters, as described in Section 4.2.In these simulations, we note that ELS-EM converges to the same solutions as EM. We report the ratios for IRF andTRF between ELS-EM and EM in Table 1. We observe from the Table that the ratios of IRF over TRF are approximatelyequal to . . This verifies that, on average, one iteration of ELS-EM is twice as expensive as one iteration of EM. Itis also apparent that ELS-ES is only able to speed up the convergence by a factor of . for all cases, resulting in noactual CPU speed up (as demonstrated by the TRF values in Table 1). TABLE 1Reduction factors for ELS-EM.
IRF TRF IRF/TRFVWS 1.50 0.81
PS 1.90 0.89
VPS 1.89 0.90 K init = 3 We apply A-EM and A-AMEM to the manufactured data sets using the exact initial number of components K init = 3 .We initialize the Gaussians as in the previous section. Fig. 2 depicts the convergence history of the log-likelihood ofthe two algorithms. We have added a subplot inside the convergence plot of the VPS case to zoom into the first 40iterations. iterations -5467.5-5467-5466.5-5466-5465.5 p e n a li z e d l og li ke li hood Adaptive EMAdaptive AMEM iterations -5205-5200-5195-5190 p e n a li z e d l og li ke li hood Adaptive EMAdaptive AMEM iterations -4910-4900-4890-4880 p e n a li z e d l og li ke li hood Adaptive EMAdaptive AMEM iterations -4910-4905-4900-4895-4890-4885-4880 p e n a li z e d l og li ke li hood Fig. 2. Case K init = 3 : History of penalized log-likelihood values as a function of the number of iterations for synthetic data sets. (Left) VWSdata set. (Center) PS data set. (Right) VPS data set. As expected, the convergence rate of A-EM worsens with increasing Gaussian-component overlap. However, A-AMEM seems to converge fairly quickly, regardless of component overlap, to the same penalized log-likelihood valuesas the non-accelerated version. We also remark that both A-EM and A-AMEM give the correct number of components( K final = 3 ) for the three manufactured data sets, i.e., both algorithms are able to recognize the true number ofcomponents within each data set and do not kill any component. Table 2 records the iteration and the wall-clock timereduction factors for this case, and shows that A-AMEM outperforms A-EM dramatically, especially for the hard VPScase (by a factor of 60 in wall-clock time). In practice, we often do not know in advance the exact model of the Gaussian mixtures. Therefore, it is wise to startwith a number of Gaussian components larger than the suspected number of groups, and rely on component adaptivityto find the correct model. However, the application of AA without monotonicity control to A-EM (which we term A-AAEM) fails catastrophically. To demonstrate this, we compare the outcomes of A-EM with and without AA (with no
EEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 11
TABLE 2Reduction factors for K init = 3 . IRF TRF IRF/TRFVWS 2.33 1.98
PS 10.62 10.22
VPS 67.45 60.96 monotonicity control) to the synthetic data sets with K init = 5 . Gaussian parameters are initialized from the best runout of multiple trials of K-means clustering with K = 5 clusters. The histories of the penalized log-likelihood valuesas a function of the number of iterations are given in Fig. 3. iterations -5540-5520-5500-5480-5460 p e n a li z e d l og li ke li hood Adaptive EMAdaptive AMEM iterations -5280-5260-5240-5220-5200-5180 p e n a li z e d l og li ke li hood Adaptive EMAdaptive AMEM iterations -5000-4950-4900-4850 p e n a li z e d l og li ke li hood Adaptive EMAdaptive AMEM
Fig. 3. Case K init = 5 – Application of AA to A-EM without monotonicity control: History of penalized log-likelihood values as a function of thenumber of iterations for synthetic data sets. (Left) VWS data set. (Center) PS data set. (Right) VPS data set. From the figure, we observe that A-AAEM fails to annihilate unnecessary Gaussian components, converges to thewrong solutions, and sometimes is even slower than the non-accelerated version. Using K init > K exact is equivalentto expanding the solution space, i.e., many local maxima are created, and the accelerated A-EM manages to convergeto one of those local maxima in the expanded subspace. That AA+EM finds a suboptimal solution is also the casewithout EM adaptivity (AAEM) when K init > K exact , even if standard EM is able to find the correct solution. Morespecifically, in our simulations with K init = 5 > K exact , standard EM finds three Gaussians approximately matchingthe exact ones, and two Gaussians with very small weights. However, in the converged AAEM solutions, we observethat all five Gaussians have comparable weights, and the means and covariance matrices for these Gaussians arefar from the exact parameters. As a result, at convergence, L ( θ ( AAEM )) < L ( θ ( EM )) for the standard case and PL ( θ ( A-AAEM )) < PL ( θ ( A-EM )) for the adaptive case, as we observed in Fig. 3. We conclude that the applicationof AA to EM-GMM without monotonicity control yields unreliable solutions and no performance advantage when theexact number of components in the mixture is unknown. K init = 5 We use K init = 5 in both A-AMEM and A-EM, and the initialization of Gaussian parameters is done in the samemanner as described in Section 5.4. Note that we turn on the monotonicity control step for this test. As shown in Fig.4, both algorithms can detect the right number of Gaussian components at convergence for the manufactured datasets, and find the same solution. A-AMEM is able to annihilate Gaussian components somewhat faster than A-EM, butconverges very fast once the optimal number of Gaussian components is reached, while A-EM continues to struggle toconverge, especially for the VPS data set.We also investigate the dynamics of the component removal process in A-AMEM due to the monotonicity controlstep. To this end, in Fig. 5 we show the binary plots of the solution choices, AA iterate or EM iterate, during theA-AMEM iteration for the three synthetic GMM data sets. In the binary plots, the y-value for a given iteration is setto only when A-AMEM falls back to the EM solution due to lack of monotonicity (and not because of violations ofpositivity). Also, the vertical lines in these binary plots represent the iterations when Gaussians are killed.From Fig. 5 we see that A-AMEM frequently reverts back to EM before reaching the optimal number of Gaussiancomponents in order to maintain the monotonicity of the penalized log-likelihood function. This explains the factthat A-AMEM kills Gaussians components at a comparable rate to A-EM. However, once the optimal number ofcomponents is reached, the acceleration kicks in aggressively and A-AMEM barely rolls back to EM. The last dot (withy-value equal to zero) in each of the binary plots of Fig. 5 indicates the application of a final standard EM step forconservation up to the second moments of the data points [17].Table 3 shows the IRF and TRF for K init = 5 , demonstrating that A-AMEM converges faster than A-EM, especiallyfor the hard VPS data set, with speed-ups reaching an order of magnitude for that case. EEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 12 iterations -5540-5520-5500-5480-5460 p e n a li z e d l og li ke li hood Adaptive EMAdaptive AMEMGaussian killed in EMGaussian killed in AMEM iterations -5280-5260-5240-5220-5200-5180 p e n a li z e d l og li ke li hood Adaptive EMAdaptive AMEMGaussian killed in EMGaussian killed in AMEM iterations -5000-4950-4900-4850 p e n a li z e d l og li ke li hood Adaptive EMAdaptive AMEMGaussian killed in EMGaussian killed in AMEM
Fig. 4. Case K init = 5 : History of penalized log-likelihood values as a function of the number of iterations for synthetic data sets. (Left) VWSdata set. (Center) PS data set. (Right) VPS data set. The blue and red dots indicate the iterations where Gaussians are killed. iterations -1012 boo l ea n f o r AA / E M iterations -1012 boo l ea n f o r AA / E M iterations -1012 boo l ea n f o r AA / E M Fig. 5. Case K init = 5 : History of solution choices in A-AMEM iterations for synthetic data sets. (Left) VWS data set. (Center) PS data set.(Right) VPS data set. TABLE 3Reduction factors for K init = 5 . IRF TRF IRF/TRFVWS 1.85 1.83
PS 4.56 3.94
VPS 12.60 10.01 K init = 8 Next, we consider K init = 8 , and perform the same numerical simulations for the manufactured GMM data set usingboth adaptive non-accelerated and accelerated EM. As before, Gaussian parameters are initialized from the best run outof multiple runs of K-means clustering, but with K = 8 clusters. We present the plots of the penalized log-likelihoodvalues as a function of the number of iterations in Fig. 6.Observations for this case are very similar to the K init = 5 case. Again, in the beginning A-AMEM kills Gaussianscomponents at a similar rate to A-EM. However, once the optimal number of Gaussians components is reached, A-AMEM quickly converges while A-EM struggles. In terms of accuracy, both A-AMEM and A-EM converge to the samesolutions and yield the same final penalized log-likelihood value. The IRF and TRF for this case are presented in Table4, again demonstrating a significant speed-up.It is clear from Figs. 4 and 6 that A-AMEM does not remove Gaussian components much more efficiently thanA-EM. Since a larger K init requires more Gaussian killing, it delays the onset of the fast convergence stage of theiteration, resulting in the increase (degradation) of the IRF/TRF ratio between the K init = 5 and K init = 8 casesobserved in Tables 3 and 4. This result highlights the importance of finding good guesses for the initial number ofcomponents. We will discuss in detail our strategy for finding best guess for K init in Section 5.9. Previous results in Section 5.5 and Section 5.6 have demonstrated that the cost per iteration of A-AMEM is comparableto A-EM. To further highlight the importance of using the Taylor expansion for the monotonicity control step, weexamine the performance of A-AMEM (vs. A-EM) using (19) instead of (20). This requires an extra evaluation of thepenalized log-likelihood function (9) at every A-AMEM iteration. We consider the synthetic data sets with K init = 5 EEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 13 iterations -5600-5550-5500-5450 p e n a li z e d l og li ke li hood Adaptive EMAdaptive AMEMGaussian killed in EMGaussian killed in AMEM iterations -5400-5350-5300-5250-5200-5150 p e n a li z e d l og li ke li hood Adaptive EMAdaptive AMEMGaussian killed in EMGaussian killed in AMEM iterations -5050-5000-4950-4900-4850 p e n a li z e d l og li ke li hood Adaptive EMAdaptive AMEMGaussian killed in EMGaussian killed in AMEM
Fig. 6. Case K init = 8 : History of penalized log-likelihood values as a function of the number of iterations for synthetic data sets. (Left) VWSdata set. (Center) PS data set. (Right) VPS data set. The blue and red dots indicate the iterations where Gaussians are killed.TABLE 4Reduction factors for K init = 8 . IRF TRF IRF/TRFVWS 1.42 1.46
PS 5.56 4.12
VPS 11.26 6.73 and K init = 8 . Table 5 shows that the IRF/TRF ratios for these cases are ∼ , confirming that one iteration of A-AMEMwithout Taylor expansion is about twice as expensive as one A-EM iteration. TABLE 5Reduction factors for K init = 5 , without using Taylor expansion in the monotonicity control step. K init = 5 K init = 8 IRF TRF IRF/TRF IRF TRF IRF/TRFVWS 1.73 1.01
PS 4.51 2.43
VPS 10.90 5.07
As a final test for A-AMEM, we consider data sets generated from PIC simulations (see [74] for detailed description ofPIC and [17] for a specific PIC application of GMM). In particular, we consider the same 2D-3V Weibel electromagneticinstability [75] as in Ref. [17]. We partition the 2D spatial domain into × cells, with N ≈ particles percell per species. We run the simulations until time t = 50 (in inverse plasma frequency units) and record the velocitypoints of all particles within each cell. The particle velocities in the 3D velocity space in each cell provide the data set.We then test the algorithms with selected cells. By applying both A-AMEM and A-EM to these cells, we assume thatthe velocity distribution functions (VDFs) can be approximated by a Gaussian mixture model of unknown number ofcomponents. At time t = 50 , the simulations are in the nonlinear phase and the electron VDFs strongly deviate fromthe Maxwellian distribution. Therefore, for most of the cells, we expect at least a few (anisotropic) components.For demonstration purposes, we choose cells 83, 155, 170, 243, with the cell numbers defined lexicographically onthe 2D spatial mesh from-left-to-right and from-bottom-to-top. We apply A-EM and A-AMEM to these data sets using K init = 8 and K init = 10 Gaussian components. The plots of penalized log-likelihood values as functions of iterationsare given in Fig. 7 and the IRFs and TRFs are recorded in Table 6.
TABLE 6Reduction factors for K init = 8 , for particle-in-cell data sets. K init = 8 K init = 10 IRF TRF IRF/TRF IRF TRF IRF/TRFCell 83 2.25 1.95
Cell 155 7.15 6.16
Cell 170 3.69 3.25
Cell 243 3.86 3.34
We observe from Fig. 7 and Table 6 that A-AMEM converges two to six times faster than A-EM to the same solution,and that on average the largest accelerations and smaller IRF/TRF ratios occur for smaller K init , again emphasizingthe need for good initial guesses for the number of components to maximize performance. EEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 14 iterations p e n a li z e d l og li ke li hood Adaptive EMAdaptive AMEMGaussian killed in EMGaussian killed in AMEM iterations p e n a li z e d l og li ke li hood Adaptive EMAdaptive AMEMGaussian killed in EMGaussian killed in AMEM iterations p e n a li z e d l og li ke li hood Adaptive EMAdaptive AMEMGaussian killed in EMGaussian killed in AMEM iterations p e n a li z e d l og li ke li hood Adaptive EMAdaptive AMEMGaussian killed in EMGaussian killed in AMEM (a) K init = 8 iterations p e n a li z e d l og li ke li hood Adaptive EMAdaptive AMEMGaussian killed in EMGaussian killed in AMEM iterations p e n a li z e d l og li ke li hood Adaptive EMAdaptive AMEMGaussian killed in EMGaussian killed in AMEM iterations p e n a li z e d l og li ke li hood Adaptive EMAdaptive AMEMGaussian killed in EMGaussian killed in AMEM iterations p e n a li z e d l og li ke li hood Adaptive EMAdaptive AMEMGaussian killed in EMGaussian killed in AMEM (b) K init = 10 Fig. 7. Application of A-AMEM for PIC data sets with K init = 8 , : History of penalized log-likelihood values as a function of the number ofiterations for selected particle-in-cell data sets. (From left to right) Cell 83, 155, 170, 243. The blue and red dots indicate the iterations whereGaussians are killed. The previous results highlight the importance of choosing K init wisely. In addition to the gap statistic (GS) method,we have explored other extensions of K-means clustering algorithms proposed in the literature [62], [63], [65], [66]to obtain a good initial guess of the number of components for both A-EM and A-AMEM. Firstly, we tested the so-called map-dp clustering method introduced in [62]. This approach works well for well separated data sets but notfor poorly separated ones. We observed that, when the Gaussians are poorly separated, this approach usually yields asingle K = 1 component. In addition, the effectiveness of the approach strongly depends on the choice of parameters,making it brittle. Secondly, we tried the mean silhouette approach [64]–[66]. This approach works well for our datasets. However, it is very expensive since it requires distance evaluation between all points in the data sets, i.e., it hascomputational complexity of order O ( N D ) , much greater than the complexity of the K-means algorithm, O ( N KD ) for N ≫ K .We find that GS [56] is cheap to use with K-means as the clustering method, and gives good estimates for the initialnumber of components for our data sets (as we show below). The initialization procedure for the Gaussian parametersusing GS (as described in Section 4.2) is summarized in Algorithm 3. We use K min = 2 , K max = 10 , and K adjust = 2 for our numerical simulations. Algorithm 3
GS–K-means algorithmGiven data set X = ( x , , · · · , x N ) , the minimum number of cluster K min and the maximum number of cluster K max
1) Use the gap statistic (GS) method [56] to estimate the number of components among K min , · · · , K max clusters,i.e., K opt = GS-method ( K min, , K max ) .2) Set K init = K opt + K adjust for some K adjust > to further avoid possible under-estimations.3) Perform K-means algorithm using K init clusters multiple times and select centroids from the best run whichyields the smallest inertia (SSE).4) Set the initial means to be the centroids obtained from Step 3 and compute the initial values for Gaussianweights and covariance matrices.Fig. 8 shows the results for the number of components from the GS–K-means method for the synthetic data sets.Table 7 presents numerical results for applying A-EM and A-AMEM initialized with GS–K-means to both synthetic andPIC data sets. We remark that in these simulations, A-EM and A-AMEM converge to the same solutions. We observethat the number of components returned by the GS–K-means initialization for the synthetic data set is very accurate EEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 15 for the VWS and PS data sets. For the VPS data set, since the Gaussians are highly overlapping, the method under-estimates the number of components by only one. For the synthetic tests, the optimal final number of components K final returned by A-EM and A-AMEM is equal to the number of components predicted by GS–K-means. For the PICdata sets, we observe that the GS–K-means underestimates the optimal number of components by one or two groups,justifying the need to adjust K est by some amount K adjust in Alg. 3.Finally, we note that on average the wall-clock-time for the GS–K-means initialization step is about an order ofmagnitude smaller than adaptive EM in our experiments, but this cost is likely amortized by the performance gainsfrom accurately guessing the number of components of the mixture. G a p - S t a t i s t i c s V a l u e K of largest GS ValueOptimal K G a p - S t a t i s t i c s V a l u e K of largest GS ValueOptimal K G a p - S t a t i s t i c s V a l u e K of largest GS ValueOptimal K
Fig. 8. Estimated number of components by GS–K-means for synthetic data sets. (Left) VWS data set. (Center) PS data set. (Right) VPS dataset.. TABLE 7Results of using A-EM and A-AMEM with GS–K-means for both synthetic and particle-in-cell data sets. K final is the optimal number ofgroups returned by both A-EM and A-AMEM. K est K init IRF TRF K final VWS 3 5 1.85 1.76 3PS 3 5 4.39 4.03 3VPS 2 4 2.28 2.08 2Cell 83 2 4 1.84 1.87 3Cell 155 2 4 12.02 11.85 4Cell 170 3 5 4.80 4.53 5Cell 243 2 4 3.63 3.26 3
ONCLUSION
We propose for the first time an accelerated, monotonicity-preserving algorithm for the adaptive
EM-GMM algorithmthat is significantly faster (in wall-clock time) than its non-accelerated counterpart. The method combines the minimum-message-length Bayesian information criterion with a monotonicity-controlled Anderson acceleration (AA) solver. Thetargeted use of exact score functions in the monotonicity control step of AA avoids computations of the log-likelihoodfunction, which is very expensive for GMM, and delivers an overall very competitive method. The resulting A-AMEMconverges to the same solution as the non-accelerated version, strictly conserves up to second moments of the observeddata points and ensures the positive-definiteness property of the solutions. The method has been tested on severalsynthetic and data sets generated from PIC simulations. It shows significant acceleration (from a few times up to morethan an order of magnitude) in terms of both iteration count and wall-clock time. Finally, we have explored the use ofa GS–K-means initialization strategy, which provides as good a guess for the number of components as practical at afraction of the cost of the adaptive EM algorithm. By eliminating the guess work in the number of components (andthus avoiding the necessary culling of unneeded mixture components), this strategy significantly enhances both theefficiency and robustness of our approach in practical applications. A PPENDIX D ERIVATION OF THE D ERIVATIVES OF L OG - LIKELIHOOD FUNCTION W . R . T G AUSSIAN PARAMETERS
We consider the log-likelihood function with particle weight ζ j for all j = 1 , · · · , N as follows: L ( θ ) = N X j =1 ζ j ln (cid:18) K X k =1 ω k G k ( x j ; µ k , Σ k ) (cid:19) + η (cid:18) K X k =1 ω k − (cid:19) (26)where N is the total number of observed points in the data set, K is the number of Gaussian components in the mixtureand x j , j = 1 , · · · , N is the observed data points with weights ζ j , and η (cid:18) P Kk =1 ω k − (cid:19) is the Lagrange multiplier EEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 16 term that enforces the normalization constraint P Kk =1 ω k = 1 . We assume that P Nj =1 ζ j = N . Parameters ω k , µ k and Σ k represent the weights, means, and covariance matrices of the k th Gaussian. The Gaussian distribution k th in the D -dimensional space is defined as G k ( x ; µ k , Σ k ) = 1 p (2 π ) D | Σ k | e − ( x − µ k ) T Σ − k ( x − µ k ) . (27)Firstly, we employ the following identities from the Matrix Cook Book [76]: ∂∂ s ( x − s ) T W ( x − s ) = − W ( x − s ) , (28) ∂∂ A v T A − v = − A − vv T A − , (29) ∂∂ A | A | = | A | A − . (30)where x , s are vectors and W , A are matrices.Secondly, we define the following quantities f ( x j ) = f ( x j ; θ ) = K X l =1 ω l G l ( x j ; µ l , Σ l ) , (31) r jk = ζ j ω k G k ( x j ; µ k , Σ k ) f ( x j ) , (32)where θ = ( θ , · · · , θ k ) , k = 1 , · · · , K and θ k = ( ω k , µ k , Σ k ) . K is the total number of components in the Gaussianmixture (GM).Next, taking the derivative of L ( θ ) w.r.t. the mean µ k , we have ∂ L ( θ ) ∂ µ k = N X j =1 ζ j f ( x j ) ∂∂ µ k (cid:20) ω k e − ( x j − µ k ) T Σ − k ( x j − µ k ) (2 π ) D/ | Σ k | / (cid:21) = N X j =1 ζ j ω k G k ( x j ; µ k , Σ k ) f ( x j ) × ∂ (cid:0) − ( x j − µ k ) T Σ − k ( x j − µ k ) (cid:1) ∂ µ k = N X j =1 − r jk ∂∂ µ k (cid:0) ( x j − µ k ) T Σ − k ( x j − µ k ) (cid:1) = N X j =1 r jk Σ − k (cid:0) x j − µ k (cid:1) . (33)where in the last equality of (33), we use (28).Taking the derivative of L ( θ ) w.r.t. the covariance matrix Σ k , we have ∂ L ( θ ) ∂ Σ k = N X j =1 ζ j f ( x j ) ∂∂ Σ k (cid:20) ω k e − ( x j − µ k ) T Σ − k ( x j − µ k ) (2 π ) D/ | Σ k | / (cid:21) = N X j =1 ζ j ω k f ( x j ) (cid:26) e − ( x j − µ k ) T Σ − k ( x j − µ k ) (2 π ) D/ ∂ | Σ k | − ∂ Σ k + G k ( x j ; µ k , Σ k ) ∂∂ Σ k (cid:2) −
12 ( x j − µ k ) T Σ − k ( x j − µ k ) (cid:3)(cid:27) = N X j =1 ζ j ω k f ( x j ) (cid:26) − G k ( x j ; µ k , Σ k ) Σ − k + 12 G k ( x j ; µ k , Σ k ) Σ − k ( x j − µ k )( x j − µ k ) T Σ − k (cid:27) = N X j =1 r jk (cid:26) − Σ − k + Σ − k ( x j − µ k )( x j − µ k ) T Σ − k (cid:27) , (34)in which we use (29) and (30) to go from the second equality to the third equality. EEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 17
Taking the derivative of L ( θ ) w.r.t. the weight ω k subject to the constraint P Kk =1 ω k = 1 , we have ∂ L ( θ ) ∂ω k = ∂∂ω k (cid:26) N X j =1 ζ j ln (cid:18) K X k =1 ω k G k ( x j ; µ k , Σ k ) (cid:19) + η (cid:18) K X k =1 ω k − (cid:19)(cid:27) = N X j =1 ζ j G k ( x j ; µ k , Σ k ) f ( x j ) + η = 1 ω k N X j =1 r jk + η , (35)where η = − N in (35). In the case of the penalized log-likelihood function, taking into account the penalized terms,we have ∂ L ( θ ) ∂ω k = ∂∂ω k (cid:26) N X j =1 ζ j ln (cid:18) K X k =1 ω k G k ( x j ; µ k , Σ k ) (cid:19) − T K X k =1 ln ( ω k ) + η (cid:18) K X k =1 ω k − (cid:19)(cid:27) = N X j =1 ζ j G k ( x j ; µ k , Σ k ) f ( x j ) − T ω k + η = 1 ω k N X j =1 r jk − T ω k + η , (36)where η = − N + 0 . T K for this case.Setting (33), (34), and (35) equal to zero and using η = − N , we arrive at Σ − k (cid:18) N X j =1 r jk ( x j − µ k ) (cid:19) = 0 ⇔ µ k = P Nj =1 r jk x j P Nj =1 r jk , (37) Σ − k (cid:18) N X j =1 r jk (cid:26) − Σ k + ( x j − µ k )( x j − µ k ) T (cid:27)(cid:19) Σ − k = 0 ⇔ Σ k = P Nj =1 r jk ( x j − µ k )( x j − µ k ) T P Nj =1 r jk , (38)and ω k N X j =1 r jk = − η = N ⇔ ω k = P Nj =1 r jk N . (39)Similarly, for the penalized log-likelihood case, setting (36) to zero and using η = − N + 0 . T K, we have ω k N X j =1 r jk − T ω k = − η = N − . T K ⇔ ω k = P Nj =1 r jk − . TN − . T K . (40)Here, we note that ∂ L ( θ ) ∂ µ k = ∂ PL ( θ ) ∂ µ k and ∂ L ( θ ) ∂ Σ k = ∂ PL ( θ ) ∂ Σ k . A CKNOWLEDGMENT
This work was supported by the U.S. Department of Energy, Office of Science, through the Scientific Discoverythrough Advanced Computation (SciDAC) Fusion Energy Sciences/Applied Scientific Computing Research partner-ship program. This research used resources provided by the Los Alamos National Laboratory Institutional ComputingProgram, and was performed under the auspices of the National Nuclear Security Administration of the U.S. De-partment of Energy at Los Alamos National Laboratory, managed by Triad National Security, LLC under contract89233218CNA000001.
EEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 18 R EFERENCES [1] D. M. Titterington, A. F. Smith, and U. E. Makov,
Statistical analysis of finite mixture distributions . Wiley„ 1985.[2] G. J. McLachlan and D. Peel,
Finite mixture models , ser. Wiley series in probability and statistics: Applied probability and statistics. JohnWiley & Sons, 2004.[3] G. J. McLachlan, S. X. Lee, and S. I. Rathnayake, “Finite mixture models,”
Annual review of statistics and its application , vol. 6, pp. 355–378,2019.[4] S. Fruhwirth-Schnatter, G. Celeux, and C. P. Robert,
Handbook of mixture analysis . CRC press, 2019.[5] C. M. Bishop,
Pattern recognition and machine learning (Information science and statistics) . Berlin, Heidelberg: Springer-Verlag, 2006.[6] K. P. Murphy,
Machine learning: A probabilistic perspective . MIT Press, 2012.[7] S. Liu, J. McGree, Z. Ge, and Y. Xie,
Computational and statistical methods for analysing big data with applications . Academic Press, 2016.[8] I. Ullah and K. Mengersen, “Bayesian mixture models and their big data implementations with application to invasive species presence-only data,”
Journal of Big Data , vol. 6, no. 29, 2019.[9] B. Bouchon-Meunier, G. Coletti, and R. R. Yager,
Modern information processing . Elsevier Science, 2006.[10] Z. Yu, C. Chen, X. Zheng, W. Ding, and D. Chen, “Context-aware trust aided recommendation via Ontology and Gaussian mixturemodel in big data environment,” in . IEEE, 2014, pp. 85–90.[11] H. Bi, H. Tang, G. Yang, H. Shu, and J.-L. Dillenseger, “Accurate image segmentation using Gaussian mixture model with saliency map,”
Pattern Analysis and Applications , vol. 21, pp. 869–878, 2018.[12] C.-A. Deledalle, S. Parameswaran, and T. Q. Nguyen, “Image denoising with generalized Gaussian mixture model patch priors,”
SIAMJournal on Imaging Sciences , vol. 11, no. 4, pp. 2568–2609, 2018.[13] K. Kalti and M. Mahjoub, “Image segmentation by Gaussian mixture models and modified FCM algorithm,”
The International ArabJournal of Information Technology , vol. 11, no. 1, pp. 11–18, 2014.[14] T. M. Nguyen and J. Wu, “Dirichlet Gaussian mixture model: Application to image segmentation,”
Image and Vision Computing , vol. 29,no. 12, pp. 818–828, 2011.[15] R. Farnoosh and B. Zarpak, “Image segmentation using Gaussian mixture model,”
IUST International Journal of Engineering Science , vol. 9,no. 1–2, pp. 29–32, 2008.[16] A. Alekseenko, T. Nguyen, and A. Wood, “A deterministic-stochastic method for computing the Boltzmann collision integral in O(MN)operations,”
Kinetic & Related Models , vol. 11, no. 5, pp. 1211–1234, 2018.[17] G. Chen, L. Chacon, and T. Nguyen, “An unsupervised machine-learning checkpoint-restart algorithm using Gaussian mixtures forparticle-in-cell simulations,” submitted to Journal of Computational Physics, arXiv:2007.12273 , 2020.[18] R. Dupuis, M. V. Goldman, D. L. Newman, J. Amaya, and G. Lapenta, “Characterizing magnetic reconnection regions using Gaussianmixture models on particle velocity distributions,”
The Astrophysical Journal , vol. 889, no. 1, p. 22, 2020.[19] A. Barb, “Gaussian mixture models for semantic ranking in domain specific databases with application in radiology,”
Central EuropeanJournal of Computer Science , vol. 1, no. 3, pp. 266–279, 2011.[20] D. Reynolds, T. Quatieri, and R. Dunn, “Speaker verification using adapted Gaussian mixture models,”
Digital Signal Processing , vol. 10,no. 1–3, pp. 19–41, 2000.[21] B. Plataniotis, “Gaussian mixtures and their applications to signal processing,” in
Advanced Signal Processing Handbook: Theory andImplementation for Radar, Sonar, and Medical Imaging Real Time Systems , S. Stergiopoulos, Ed. Taylor & Francis Group, 2000, ch. 3.[22] D. Yu and L. Deng, “Gaussian mixture models,” in
Automatic Speech Recognition . Springer, 2015, pp. 13–21.[23] A. P. Dempster, N. M. Laird, and D. B. Rubin, “Maximum likelihood from incomplete data via the EM algorithm,”
Journal of the royalstatistical society. Series B (methodological) , vol. 39, no. 1, pp. 1–38, 1977.[24] R. A. Redner and H. F. Walker, “Mixture densities, maximum likelihood and the EM algorithm,”
SIAM review , vol. 26, no. 2, pp. 195–239,1984.[25] G. J. McLachlan and T. Krishnan,
The EM algorithm and extensions . John Wiley & Sons, 2007, vol. 382.[26] G. J. McLachlan and S. Rathnayake, “On the number of components in a Gaussian mixture model,”
Wiley Interdisciplinary Reviews: DataMining and Knowledge Discovery , vol. 4, no. 5, pp. 341–355, 2014.[27] M. A. Figueiredo and A. K. Jain, “Unsupervised selection and estimation of finite mixture models,” in
Proceedings 15th InternationalConference on Pattern Recognition. ICPR-2000 , vol. 2. IEEE, 2000, pp. 87–90.[28] ——, “Unsupervised learning of finite mixture models,”
IEEE Transactions on Pattern Analysis and Machine Intelligence , vol. 24, no. 3, pp.381–396, 2002.[29] C. S. Wallace,
Statistical and inductive inference by minimum message length . Berlin, Heidelberg: Springer-Verlag, 2005.[30] M. Hansen and B. Yu, “Model selection and the principle of minimum description length,”
Journal of the American Statistical Association ,vol. 96, no. 454, pp. 746–774, 2001.[31] A. Corduneanu and C. M. Bishop, “Variational Bayesian model selection for mixture distributions,” in
Artificial intelligence and Statistics ,vol. 2001. Morgan Kaufmann Waltham, MA, 2001, pp. 27–34.[32] K. Lange,
Optimization . Springer Science & Business Media, 2013.[33] X.-L. Meng and D. B. Rubin, “Maximum likelihood estimation via the ECM algorithm: A general framework,”
Biometrika , vol. 80, no. 2,pp. 267–278, 1993.[34] C. Liu and D. B. Rubin, “The ECME algorithm: a simple extension of EM and ECM with faster monotone convergence,”
Biometrika ,vol. 81, no. 4, pp. 633–648, 1994.[35] J. A. Fessler and A. O. Hero, “Space-alternating generalized Expectation-Maximization algorithm,”
IEEE Transactions on signal processing ,vol. 42, no. 10, pp. 2664–2677, 1994.[36] X.-L. Meng and D. Van Dyk, “The EM algorithm, an old folk song sung to a fast new tune,”
Journal of the Royal Statistical Society: SeriesB (Statistical Methodology) , vol. 59, no. 3, pp. 511–567, 1997.[37] C. Liu, D. B. Rubin, and Y. N. Wu, “Parameter expansion to accelerate EM : the PX-EM algorithm,”
Biometrika , vol. 85, no. 4, pp. 755–770,1998.[38] G. Celeux, S. Chrétien, F. Forbes, and A. Mkhadri, “A component-wise EM algorithm for mixtures,”
Journal of Computational and GraphicalStatistics , vol. 10, no. 4, pp. 697–712, 2001.[39] J. H. Wolfe, “Pattern clustering by multivariate mixture analysis,”
Multivariate Behavioral Research , vol. 5, no. 3, pp. 329–350, 1970.[40] T. A. Louis, “Finding the observed information matrix when using the EM algorithm,”
Journal of the Royal Statistical Society: Series B(Methodological) , vol. 44, no. 2, pp. 226–233, 1982.[41] R. Varadhan and C. Roland, “Simple and globally convergent methods for accelerating the convergence of any EM algorithm,”
Scandinavian Journal of Statistics , vol. 35, no. 2, pp. 335–353, 2008.[42] A. Berlinet and C. Roland, “Acceleration schemes with application to the EM algorithm,”
Computational statistics & data analysis , vol. 51,no. 8, pp. 3689–3702, 2007.[43] M. Jamshidian and R. I. Jennrich, “Conjugate gradient acceleration of the EM algorithm,”
Journal of the American Statistical Association ,vol. 88, no. 421, pp. 221–228, 1993.[44] R. Salakhutdinov, S. T. Roweis, and Z. Ghahramani, “Optimization with EM and expectation-conjugate-gradient,” in
Proceedings of the20th International Conference on Machine Learning (ICML-03) , 2003, pp. 672–679.
EEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 19 [45] Y. He and C. Liu, “The dynamic "expectation–conditional maximization either" algorithm,”
Journal of the Royal Statistical Society: Series B(Statistical Methodology) , vol. 74, no. 2, pp. 313–336, 2012.[46] K. Lange, “A quasi-newtonian acceleration of the EM algorithm,”
Statistica Sinica. v5 , pp. 1–18, 1995.[47] M. Jamshidian and R. I. Jennrich, “Acceleration of the EM algorithm by using quasi-Newton methods,”
Journal of the Royal StatisticalSociety: Series B (Statistical Methodology) , vol. 59, no. 3, pp. 569–587, 1997.[48] N. Henderson and R. Varadhan, “Damped Anderson acceleration with restarts and monotonicity control for accelerating EM andEM-like algorithms,”
Journal of Computational and Graphical Statistics , vol. 28, no. 4, pp. 834–846, 2019.[49] H. Zhou, D. Alexander, and K. Lange, “A quasi-Newton acceleration for high-dimensional optimization algorithms,”
Statistics andcomputing , vol. 21, no. 2, pp. 261–273, 2011.[50] I. Meilijson, “A fast improvement to the EM algorithm on its own terms,”
Journal of the Royal Statistical Society: Series B (Methodological) ,vol. 51, no. 1, pp. 127–138, 1989.[51] K. Lange, “A gradient algorithm locally equivalent to the EM algorithm,”
Journal of the Royal Statistical Society: Series B (Methodological) ,vol. 57, no. 2, pp. 425–437, 1995.[52] D. G. Anderson, “Iterative procedures for nonlinear integral equations,”
J. Assoc. Comput. Mach. , vol. 12, pp. 547–560, 1965.[53] H.-r. Fang and Y. Saad, “Two classes of multisecant methods for nonlinear acceleration,”
Numerical Linear Algebra with Applications ,vol. 16, no. 3, pp. 197–221, 2009.[54] H. F. Walker and P. Ni, “Anderson acceleration for fixed-point iterations,”
SIAM Journal on Numerical Analysis , vol. 49, no. 4, pp.1715–1735, 2011.[55] J. H. Plasse, “The EM algorithm in multivariate Gaussian mixture models using Anderson acceleration,” Master’s thesis, WorcesterPolytechnic Institute, April 2013.[56] R. Tibshirani, G. Walther, and T. Hastie, “Estimating the number of clusters in a data set via the gap statistic,”
Journal of Royal StatisticalSociety: Series B (Statistical Methodology) , vol. 63, no. 2, pp. 411–423, 2001.[57] B. S. Everitt, “Finite mixture distributions,”
Wiley StatsRef: Statistics Reference Online , 2014.[58] V. Hasselblad, “Estimation of parameters for a mixture of normal distributions,”
Technometrics , vol. 8, no. 3, pp. 431–444, 1966.[59] J. Behboodian, “On a mixture of normal distributions,”
Biometrika , vol. 34, no. 57 Part 1, pp. 215–217, 1970.[60] D. J. C. MacKay,
Information theory, inference and learning algorithms . USA: Cambridge University Press, 2003.[61] J. Rousseau and K. Mengersen, “Asymptotic behaviour of the posterior distribution in overfitted mixture models,”
Journal of the RoyalStatistical Society: Series B (Statistical Methodology) , vol. 73, no. 5, pp. 689–710, 2011.[62] Y. Raykov, A. Boukouvalas, F. Baig, and M. Little, “What to do when K-means clustering fails: A simple yet principled alternativealgorithm,”
PLoS ONE , vol. 11, no. 9, 2016.[63] C. Darken and J. Moody, “Fast adaptive K-means clustering: some empirical results,” in , vol. 2. IEEE, 1990, pp. 233–238.[64] Scikit-Learn Library,
Selecting the number of clusters with silhouette analysis on K-Means clustering , 2007-2020. [Online]. Available:https://scikit-learn.org/stable/auto_examples/cluster/plot_kmeans_silhouette_analysis.html[65] P. J. Rousseeuw, “Silhouettes: a graphical aid to the interpretation and validation of cluster analysis,”
Journal of Computational and AppliedMathematics , vol. 20, pp. 53–65, 1987.[66] S. Nanjundan, S. Sankaran, C. R. Arjun, and P. Anand, “Identifying the number of clusters for K-means: A hypersphere density basedapproach,” in
International Conference on Computers, Communication and Signal Processing - 2019 , 2019.[67] W. Xiang, A. Karfoul, C. Yang, H. Shu, and R. L. B. Jeannès, “An exact line search scheme to accelerate the EM algorithm: Applicationto Gaussian mixture models identification,”
Journal of Computational Science , vol. 41, p. 101073, 2020.[68] N. N. Carlson and K. Miller, “Design and application of a gradient-weighted moving finite element code i: In one dimension,”
SIAM J.Sci. Comput. , vol. 19, no. 3, pp. 728–765, 1998.[69] H. An, X. Jia, and H. F. Walker, “Anderson acceleration and application to the three-temperature energy equations,”
J. Comput. Phys. ,vol. 347, no. 15, pp. 1–19, 2017.[70] X. Chen and C. Kelley, “Convergence of the EDIIS algorithm for nonlinear equations,”
SIAM Journal on Scientific Computing , vol. 41,no. 1, pp. A365–A379, 2019.[71] P. R. Pratapa and P. Suryanarayana, “Restarted Pulay mixing for efficient and robust acceleration of fixed-point iterations,”
ChemicalPhysical Letters , vol. 635, pp. 69–74, 2015.[72] D. Arthur and S. Vassilvitskii, “K-means++: the advantages of careful seeding,” in
In Proceedings of the 18th Annual ACM-SIAM Symposiumon Discrete Algorithms , 2007, pp. 1027–1035.[73] R. A. Horn and C. R. Johnson,
Matrix analysis . Cambridge university press, 2012.[74] C. K. Birdsall and A. B. Langdon,
Plasma physics via computer simulation . CRC press, 2004.[75] E. S. Weibel, “Spontaneously growing transverse waves in a plasma due to an anisotropic velocity distribution,”