[PDF] A Mean Field Games approach to Cluster Analysis

Abstract

In this paper, we develop a Mean Field Games approach to Cluster Analysis. We consider a finite mixture model, given by a convex combination of probability density functions, to describe the given data set. We interpret a data point as an agent of one of the populations represented by the components of the mixture model, and we introduce a corresponding optimal control problem. In this way, we obtain a multi-population Mean Field Games system which characterizes the parameters of the finite mixture model. Our method can be interpreted as a continuous version of the classical Expectation-Maximization algorithm.

Full PDF

AA Mean Field Games approachto Cluster Analysis

Laura Aquilanti, Simone Cacace, Fabio Camilliand Raul De Maio

Abstract

In this paper, we develop a Mean Field Games approach to ClusterAnalysis. We consider a ﬁnite mixture model, given by a convex combi-nation of probability density functions, to describe the given data set. Weinterpret a data point as an agent of one of the populations represented bythe components of the mixture model, and we introduce a correspondingoptimal control problem. In this way, we obtain a multi-population MeanField Games system which characterizes the parameters of the ﬁnite mix-ture model. Our method can be interpreted as a continuous version of theclassical Expectation-Maximization algorithm.

AMS-Subject Classiﬁcation: . Keywords: mixture model; Cluster Analysis; Expectation-Maximization algorithm; MeanField Games; multi-population model . Cluster Analysis, a classical problem in unsupervised Machine Learning, con-cerns the repartition of a group of data points into subgroups, in such a waythat the elements within a same group are more similar to each other than theyare to the elements of a diﬀerent one. Clustering methods can be used either toinfer speciﬁc patterns in rough data sets, or as a preliminary analysis in super-vised Machine Learning algorithms (for a review of the applications of clusteranalysis, see [6, 23]). Algorithms for Cluster Analysis are generically dividedinto two classes: hard clustering methods, whose prototype is the K-means al-gorithm and its variants, for which each data point belongs only to a singlesubgroup [8]; soft clustering methods, such as the Expectation-Maximizationand the Fuzzy K-means algorithms, for which each data point has a certaindegree or probability to belong to each cluster [7, 21].Most of the mathematical literature on Cluster Analysis, and more in generalon Machine Learning, is in the framework of ﬁnite-dimensional optimization.But recently, diﬀerent approaches based on partial diﬀerential equations andinﬁnite dimensional optimization have begun to be pursued [11, 12, 15]. Evenif, in most of the cases, it is not clear if these new approaches are computationally1 a r X i v : . [ m a t h . NA ] D ec ompetitive with the traditional ones, they oﬀer new points of view and insightsin this rapidly developing ﬁeld.In this paper, we propose an approach to Cluster Analysis based on MeanField Games (MFG in short). This theory was introduced by Lasry and Lions[19] in the framework of diﬀerential games with a large number of rationalagents. Although it was realized from the beginning that, in perspective, oneof the most promising application of MFG theory could be the “Big Data”analysis [18], up to now most of the applications are in the ﬁelds of Economicsand Financial Mathematics, and only very few papers deal with data analysisproblems [22, 5]. Nevertheless, a MFG version of the classical hard-clusteringK-means algorithm has been recently proposed in [14]. Relying on a similarapproach, here we deal with a MFG version of the soft clustering Expectation-Maximization algorithm.As in the classical Expectation-Maximization (EM in short) algorithm, thestarting point of our analysis is a ﬁnite mixture model given by convex combi-nations of probability density functions (PDFs in short) m ( x ) = K (cid:88) k =1 α k m k ( x ) , x ∈ R d , with K (cid:88) k =1 α k = 1 . Finite mixture models are powerful probabilistic techniques to describe complexscenarios, e.g. several regions with high mass density, and populations composedby diﬀerent sub-populations, each one represented by a PDF. Generally, thecomponents of the mixture are assumed to be parametrized PDFs. For example,for a Gaussian mixture model, the parameters are represented by the mean andthe variance of each component. Then, the EM algorithm is a technique whichallows to compute the unknown coeﬃcients (weights and parameters) of themixture by incrementally increasing the maximum likelihood with respect tothe data set.In our approach, we are given a PDF f on R d representing the data set,and we aim to ﬁnd a corresponding mixture model in order to subdivide thepoints in K clusters. We interpret the data points as agents belonging with acertain probability to one of the K sub-populations, which are described by aPDF m k with appropriate weights α k on the entire population. The similarity,or proximity, among the members of a same sub-population is encoded in thecost functions of the optimal control problems for each sub-population, whichpush the agents to aggregate around the closer barycentre of the distributions m k .In order to characterize the components of the mixture model, i.e. the PDFs2 k and the weights α k , we ﬁrst consider the case of a quadratic MFG system  − ε ∆ u k ( x ) + | Du k ( x ) | + λ k = ( x − µ k ) t (Σ − k ) t Σ − k ( x − µ k ) , x ∈ R d ,ε ∆ m k ( x ) + div( m k ( x ) Du k ( x )) = 0 , x ∈ R d ,α k = (cid:82) R d γ k ( x ) f ( x ) dx,m k ≥ , (cid:82) R d m k ( x ) dx = 1 , u k ( µ k ) = 0 , (1.1)for k = 1 , . . . , K , where the coupling among the diﬀerent populations is encodedin the right hand side of the ﬁrst equation. Indeed, the mean µ k and thecovariance matrix Σ k of the population m k depend, besides data set f , on thewhole measure m through the quantities γ k ( x ) = α k m k ( x ) /m ( x ). The latterplay a central role in the model because they represent the degree of membershipof agents to a given population k . We introduce a notion of consistency with thedata set, and we show that a solution of the multi-population MFG system (1.1)is given by a consistent Gaussian mixture with the same mean and variance of f . Moreover, any consistent Gaussian mixture can be obtained as a solution ofthe MFG system. We remark that the main diﬀerence between our approachand the classical EM algorithm, which is an iterative procedure, is that theMFG method gives the partition into clusters as the result of the solution of astationary system of PDEs.Then we consider a stationary multi-population MFG system, where theHamiltonian and the coupling terms can be very general, and we show that italways admits a solution. Note that, in this framework, it is not reasonable toobtain uniqueness, since there may be several admissible partitions of a givendata set. We also consider a two-step numerical approximation of the MFGsystem, inspired to the classical EM algorithm, and we test the method ondiscrete and continuous data sets.The paper is organized as follows. In Section 2, following [6, 7], we brieﬂyintroduce the K-means and the Expectation Maximization algorithms, in theirclassical approach. In Section 3, we ﬁrst describe the approach in [14] andthen we propose a quadratic multi-population MFG model for Gaussian ﬁnitemixture models. In Section 4, we study a general class of MFG systems forCluster Analysis. Finally, in Section 5, we perform some numerical tests toconﬁrm the theoretical results. In this section, we brieﬂy review some classical algorithms for Cluster Analysis.We start by describing the

K-means algorithm , a hard clustering technique,hence the

Expectation-Maximization algorithm , a soft clustering one. For acomplete introduction to the subject, we refer to [6, Cap. 9].In the following, we are given a set of observable data X = { x , . . . , x I } in R d and a ﬁxed number K of clusters S j , j = 1 , . . . , K , into which we want topartition our observations. 3 .1 K-Means algorithm The K-means algorithm is an iterative procedure that assigns points to the near-est cluster with respect to the Euclidean distance, even though other distancescan be considered. We denote by µ = { µ , µ , . . . , µ K } , µ k ∈ R d , the vectorof barycentres of the clusters { S , S , ..., S K } . Moreover, we introduce a vectorof cluster assignment, c = ( c , c , . . . , c I ), c i ∈ { , . . . K } , such that for everyvector x i ∈ X , we have c i = k ⇔ x i ∈ S k . The goal is to ﬁnd the vectors µ and c in order to minimize the functional J = I (cid:88) i =1 K (cid:88) k =1 { c i = k } | x i − µ k | , (2.1)representing the sum of the distances of the data points to the correspondingbarycentres ( denotes the characteristic function). Starting from an arbitraryassignment for the vector µ , the n th iteration of the K-means algorithm alter-nates two steps: Cluster assignment : Minimization of the objective function J with re-spect to c by assigning the point x i to the closest barycentre, i.e. c ni = arg min j | x i − µ nj | ∀ i = 1 , . . . , I. Barycentre update : Computation of the new barycentres µ n +1 k of thecluster S nk = { x i ∈ X : c ni = k } , i.e. µ n +1 k = (cid:80) Ii =1 x i { c ni = k } (cid:80) Ii =1 { c ni = k } = (cid:80) Ii =1 x i { c ni = k } card( S nk ) ∀ k = 1 , . . . , K. At each iteration, the objective function J decreases and the algorithm is re-peated until there is no change in the cluster assignments, up to a toleranceerror. Mixture models are probabilistic techniques used to represent complex scenar-ios of data, and they are related to soft clustering methods which considera probabilistic partition of the data points into clusters, whose correspondingparametrized probabilities are computed via the EM algorithm. In this setting,we assume that the data set X represents a set of (independent and identicallydistributed) observations of a random variable X , with unknown distributionthat we aim to approximate with a distribution P . Moreover, we assume that P is given by a mixture of probability densities p k ( x | θ k ), depending on someparameters θ k , with mixing coeﬃcients α k such that (cid:80) Kk =1 α k = 1, i.e. P ( x ) = K (cid:88) k =1 α k p k ( x | θ k ) . (2.2)4he purpose is to compute the parameters α = ( α , . . . , α k ), θ = ( θ , . . . , θ k ) ofthe mixture model (2.2) in order to maximize the likelihood objective functionln P ( X | α, θ ) = I (cid:88) i =1 (cid:32) ln K (cid:88) k =1 α k p k ( x i | θ k ) (cid:33) with respect to data set X . To simplify the computation of the extrema of theprevious functional, the data set X is considered as incomplete, and it is intro-duced a random variable Y = { y i } Ii =1 whose values y i specify the component of P generating the point, i.e. y i ∈ { , . . . , K } and y i = k if and only if the point x i has been generated by the distribution p k . We deﬁne the responsibility of x i with respect to the k-th cluster as γ k ( x i ) := p k ( y i = k | x i , θ k ) , (2.3)i.e. as the probability of an assignment y i once we have observed x i ; the EMalgorithm aims to ﬁnd a maximum of the expected value of the complete datalog-likelihood functional E Y [ln p ( X , Y| θ k , α k )] = I (cid:88) i =1 K (cid:88) k =1 γ k ( x i ) ln( α k p k ( x i | θ k )) (2.4)(see formula (9.40) in [6]). It takes a particular simple and explicit form ifwe consider parametric distributions p k ( x | θ k ) given by Gaussian distributions N ( x | µ k , Σ k ). Starting from an arbitrary initialization µ k , Σ k , α k , k = 1 , . . . , K ,and computing the optimality conditions for the functional (2.4) with respectto γ k for α k , µ k , Σ k ﬁxed, and then exchanging the roles, we get the followingtwo alternating steps: E-step : Given µ n − k Σ n − k , α n − k , k = 1 , . . . , K , compute the posteriorprobability (responsibility) of a datum x i to belong to the cluster S k γ nk ( x i ) = P ( y i = k | x i , µ n − k , Σ n − k ) = α n − k N ( x i | µ n − k , Σ n − k ) (cid:80) Kj =1 α n − j N ( x i | µ n − j , Σ n − j ) . M-step : Update the parameters α , µ , Σ, by setting for k = 1 , . . . , K , α nk = (cid:80) Ii =1 γ nk ( x i ) I ,µ nk = (cid:80) Ni =1 x i γ nk ( x i ) (cid:80) Ii =1 γ nk ( x i ) , (2.5)Σ nk = (cid:80) Ni =1 γ nk ( x i )( x i − µ nk )( x i − µ nk ) t (cid:80) Ii =1 γ nk ( x i ) . (2.6)5fter the convergence of the algorithm under some appropriate stopping crite-rion, we interpret the responsibility γ k ( x i ) as the probability that x i belongs tothe cluster S k .It is interesting to observe that the K-means algorithm can be derived asa limit of the EM algorithm when the variances of the mixture model (2.4) goto zero. Indeed, assume for simplicity that the covariance matrix is Σ k = σI , k = 1 , . . . , K , where I is the identity d × d matrix and σ > γ k ( x i ) = α k e − | xi − µk | σ (cid:80) Kj =1 α j e − | xi − µj | σ . For σ →

0, the responsibility γ k ( x i ), corresponding to index k minimizing thedistance of the barycentre µ k from x i , tends to 1, while, for j (cid:54) = k , all theother responsibilities γ j ( x i ) go to 0. Hence, at least formally, γ k ( x i ) → { c i = k } and the maximum likelihood functional (2.4) tends, up to a constant, to theK-means functional J deﬁned in (2.1) (see [6, Sec. 9.3.2]). In this section, we describe some models for Cluster Analysis based on MFGtheory. In the classical MFG theory (see [19]), we have a continuum of indis-tinguishable agents whose dynamics, driven by independent Brownian motions W t , is given by (cid:40) dX t = a t dt + √ (cid:15)dW t , t > X = x. (3.1)The function a t represents the control which an agent chooses in order to mini-mize a cost function J , to be speciﬁed according to the considered model. Mostof the time, the cost J comprises a term L ( x, a ) depending on the current state x and control a of the agent, and a term F ( x, m ) depending on the distribution m of the other agents. For example, consider the long time average cost J ( x, a ) = lim T → + ∞ T E x (cid:26) (cid:90) T (cid:2) L ( X s , a s ) + F ( X s , m ( X s )) (cid:3) ds (cid:27) , where m is a density function representing the distribution of the agents. Let H ( x, p ) = sup q ∈ R d { pq − L ( x, q ) } be the Hamiltonian function. Then the triplet( u, m, λ ), given by the ergodic cost λ , the corrector u and the distribution ofthe agents m corresponding to the choice of the optimal control D p H ( x, Du ),can be characterized as a solution of the MFG system  − ε ∆ u ( x ) + H ( x, Du ( x )) + λ = F ( x, m ) , x ∈ R d ,ε ∆ m ( x ) + div( D p H ( x, Du ( x )) m ( x )) = 0 , x ∈ R d ,m ≥ , (cid:82) R d m ( x ) dx = 1 , (cid:82) R d u ( x ) dx = 0 . (3.2)6he ﬁrst equation is a Hamilton-Jacobi-Bellman equation for the couple ( λ, u );the second is a Fokker-Planck equation for the distribution m , which providesthe density of the population at the position x ∈ R d ; the normalization condi-tions are imposed since u is deﬁned up to a constant, and m must represent aprobability density function.In some problems, as the ones we consider, several populations of agents,homogeneous within a given group but with diﬀerent objectives and preferencesfrom one group to the other, can interact in the same environment. In thiscase, the model can be described by a multi-population MFG system, where,for k = 1 , . . . , K , each population satisﬁes a system such as (3.2), but with thecost function F k depending on m k and also on the distributions m j , j (cid:54) = k ofthe other populations (see [13, 19]).We describe two approaches to Cluster Analysis based on MFG theory, whichcorrespond respectively to the K-means algorithm and to the EM algorithm.We consider a data set given by the support of a measure with density function f : R d → R such that f ≥ , (cid:90) R d f ( x ) dx = 1 . (3.3)Note that the data set X in Section 2 corresponds to the case of an atomicmeasure P ( x ) = 1 I I (cid:88) i =1 δ x i ( x ) . (3.4)The function f describes the entire population of agents/data points, that wewant to subdivide in K sub-populations/clusters on the basis of some similarproperties or characteristics, where K is ﬁxed a priori. Each population isrepresented by a density function m k and the similarity or proximity will beexpressed by means of an appropriate cost function. A MFG approach which mimics the K-means algorithm has been recently in-troduced in [14]. Since this approach is the basis for the EM algorithm we willsubsequently study, we brieﬂy describe it. It is worth noting that it is a hardclustering method, as the classical K means algorithm, hence each data pointcan belong only to a single cluster S k , k = 1 , . . . , K .Denote by m = ( m , . . . , m K ) and u = ( u , . . . , u K ) the vectors of the densityfunctions and of the corresponding value functions. An agent of the k -th popu-lation moves according to the dynamics (3.1) and minimizes the inﬁnite horizoncost functional u k ( x ) = inf a s E x (cid:20) (cid:90) ∞ (cid:18) | a s | + F k ( X s , m k ( X s )) (cid:19) e − ρs ds (cid:21) , where ρ >

0, the discount factor, is a ﬁxed constant. The coupling cost is givenby F k ( x, m k ) = 1 + ρ | x − y k | , (3.5)7here the barycentres y k are deﬁned by y k := (cid:82) R d xm k ( x ) dx (cid:82) R d m k ( x ) dx . (3.6)Consider the multi-population MFG system for k = 1 , . . . , K , (cid:40) ρu k − ε ∆ u k ( x ) + | Du k ( x ) | = F k ( x, m k ) , x ∈ R d ,ρm k ( x ) − ε ∆ m ( x ) − div( Du k ( x ) m k ( x )) = ρ ˜ f k x ∈ R d . (3.7)where ˜ f k is a Gaussian distribution with mean˜ y k = (cid:82) S k xf ( x ) dx (cid:82) S k f ( x ) dx (3.8)and variance ε , and the cluster S k = S k ( u ) related to ˜ y k is deﬁned by S k = { x ∈ R d : u k ( x ) = min j =1 ,...,K u j ( x ) } . (3.9)The value function u k is intended as a measure of the distance from the barycen-tre y k . We refer to [14] for a detailed interpretation of the system.A family of vectors { ¯ y k } Kk =1 is said to satisfy the self consistency rule with thedata set f if ¯ y k = (cid:82) S k xf ( x ) dx (cid:82) S k f ( x ) dx (3.10)for a family of clusters { S k } Kk =1 .In [14], the following two crucial results are proved:(i) For all the family of vectors { ¯ y k } Kk =1 satisfying the self consistency rule(3.10) for some family of clusters { S k } , there exists a solution ( u, m ) of(3.7) such that ¯ y k = y k with y k given by (3.6).(ii) Given a solution ( u, m ) of (3.7), then the barycentres y k given by (3.6)satisfy the self consistency rule (3.10) for S k = S k ( u ) deﬁned in (3.9).It is important to observe that it is not reasonable to expect uniqueness ofthe solution to (3.7), since there exist in general several sets of barycentres { ¯ y k } Kk =1 satisfying (3.10). This non-uniqueness property can be interpreted asthe convergence to a local minimum in the classical K-means algorithm. The aim is to ﬁnd, by means of an appropriate multi-population MFG system,a mixture of K density functions which gives an optimal representation of adata set, described by a density function f as in (3.3). The resulting methodis soft-clustering, since each point/agent has a certain probability to belong to8ach one of the clusters.In this section, in order to explain the method in a simple setting, we willdescribe a speciﬁc model for a Gaussian mixture, while the general method willbe described in Section 4.We introduce some preliminary notations. Given a mixture m ( x ) = K (cid:88) k =1 α k m k ( x ) (3.11)where α k ∈ (0 , (cid:80) Kk =1 α k = 1 and m k ≥ (cid:82) R d m k dx = 1, we introduce theresponsibilities γ k ( x ) = α k m k ( x ) m ( x ) , x ∈ R d , k = 1 , . . . , K. (3.12)We also deﬁne µ k = (cid:82) R d xγ k ( x ) f ( x ) dx (cid:82) R d γ k ( x ) f ( x ) dx , (3.13)Σ k = (cid:82) R d ( x − µ k )( x − µ k ) t γ k ( x ) f ( x ) dx (cid:82) R d γ k ( x ) f ( x ) dx (3.14)which are the mean µ k ∈ R d and the covariance matrix Σ k ∈ R d × d of thedensity function m k computed with respect to the data set f . Since, as we willsee in the rest of the section, the function m is given by a convex combinationof Gaussian functions with positive weights α k , then γ k > R d and thereforealso µ k and Σ k are well deﬁned. Remark 3.1.

Consider a ﬁnite data set X = { x , . . . , x I } and the correspondingatomic measure P deﬁned in (3.4) . Then, replacing the measure f ( x ) dx in (3.13) - (3.14) with P , we obtain µ k = (cid:80) Ii =1 γ k ( x i ) x i (cid:80) Ii =1 γ k ( x i ) , Σ k = (cid:80) Ii =1 γ k ( x i )( x i − µ i )( x i − µ i ) t (cid:80) Ii =1 γ k ( x i ) , which coincide with formulas (2.5) , (2.6) used in the Maximization step of theEM algorithm. Given the same type of dynamics (3.1) for all the populations, an agent ofthe k -th population wants to minimize the ergodic cost functional J k ( x, a ) = lim T → + ∞ T E x (cid:40)(cid:90) T (cid:20) | a t | + F k ( X s , m k ( X s ) , m ( X s )) (cid:21) ds (cid:41) , (3.15)where F ( x, m k , m ) = 12 ( x − µ k ) t (Σ − k ) t Σ − k ( x − µ k ) (3.16)9ote that the coupling among the various populations is given by the depen-dence of µ k , Σ k on the responsibility γ k , which in turn depends on the measure m . Observe that F k is instead independent of u j , j = 1 , . . . , K . The couplingterm forces a generic data point to distribute with an higher probability aroundthe nearest mean point µ k , since it penalizes the square of the distance from it,with an attenuation factor given by the covariance matrix Σ k .At a formal level, the multi-population MFG system corresponding to the pre-vious problem is given by  − ε ∆ u k ( x ) + | Du k ( x ) | + λ k = ( x − µ k ) t (Σ − k ) t Σ − k ( x − µ k ) , x ∈ R d ,ε ∆ m k ( x ) + div( m k ( x ) Du k ( x )) = 0 , x ∈ R d ,α k = (cid:82) R d γ k ( x ) f ( x ) dx,m k ≥ , (cid:82) R d m k ( x ) dx = 1 , u k ( µ k ) = 0 , (3.17)for k = 1 , . . . , K , where γ k , µ k , Σ k are deﬁned as in (3.12)-(3.14). Deﬁnition 3.2.

We say that a Gaussian mixture m ( x ) = (cid:80) Kk =1 β k N ( x | ν k , T k ) , x ∈ R d , is consistent with the data set if  ν k = (cid:82) R d xγ k ( x ) f ( x ) dx (cid:82) R d γ k ( x ) f ( x ) dxT k = ε (cid:82) R d ( x − ν k )( x − ν k ) t γ k ( x ) f ( x ) dx (cid:82) R d γ k ( x ) f ( x ) dx ,β k = (cid:82) R d γ k ( x ) f ( x ) dx (3.18) where the responsabilities γ k are deﬁned as in (3.12) with m k ( x ) = N ( x | ν k , T k ) . Note that ε can be considered as an additional parameter of the model, thatwe can tune to adjust the covariance matrix of the function m k . Proposition 3.3.

Let ε = 1 and m be a Gaussian mixture consistent with thedata set f . Then m and f have the same mean and covariance, i.e. (cid:82) R d xm ( x ) dx = (cid:82) R d xf ( x ) dx (3.19) (cid:82) R d ( x − µ )( x − µ ) t m ( x ) dx = (cid:82) R d ( x − µ )( x − µ ) t f ( x ) dx (3.20) Proof.

To prove (3.19), it suﬃces to apply the consistency conditions (3.18) to(3.11) and recall that (cid:80) Kk =1 γ k ( x ) = 1 to get (cid:90) R d xm ( x ) dx = K (cid:88) k =1 α k (cid:90) R d xm k ( x ) dx = K (cid:88) k =1 α k (cid:82) R d xγ k ( x ) f ( x ) dx (cid:82) R d γ k ( x ) f ( x ) dx = K (cid:88) k =1 (cid:90) R d xγ k ( x ) f ( x ) dx = (cid:90) R d xf ( x ) dx.

10o prove (3.20), we ﬁrst claim that (cid:90) R d xx t m ( x ) dx = (cid:90) R d xx t f ( x ) dx. (3.21)Indeed, multiplying by α k = (cid:82) R d γ k ( x ) f ( x ) dx the consistency conditions (3.18),recalling that (cid:80) Kk =1 γ k ( x ) = 1 and summing over k , by the identity K (cid:88) k =1 α k (cid:90) R d ( x − µ k )( x − µ k ) t m k ( x ) dx = K (cid:88) k =1 α k (cid:82) R d ( x − µ k )( x − µ k ) t γ k ( x ) f ( x ) dx (cid:82) R d γ k ( x ) f ( x ) dx , we easily get (cid:90) R xx t m ( x ) dx − K (cid:88) k =1 α k µ k µ tk = (cid:90) R xx t f ( x ) dx − K (cid:88) k =1 α k µ k µ tk and therefore the claim. Combining (3.21) with the identity (3.19), we get(3.20).In the next result, we show that if there exists a mixture of Gaussian densitiesconsistent with the data set f , then it solves (3.17) for appropriate ( u k , λ k ), k = 1 , . . . , K . Proposition 3.4.

Let m be a mixture of Gaussian densities, i.e. m ( x ) = (cid:80) Kk =1 β k N ( x | ν k , T k ) for x ∈ R d , consistent with the data set f . Then, thefamily of quadruples ( u k , λ k , m k , β k ) , k = 1 , . . . , K , with u k ( x ) = ε x − µ k ) t T − k ( x − µ k ) ,λ k = ε Tr( T − k ) ,m k ( x ) = N ( x | µ k , T k ) β k = (cid:90) R d γ k ( x ) f ( x ) dx is a solution of (3.17) .Proof. Consider a component of the mixture m k which is a Gaussian densityfunction N ( x | ν k , T k ) = 1(2 π ) d | T k | e − ( x − ν k ) t T − k ( x − ν k ) . Recall that ( x − ν k ) ∈ R d and T − k ∈ R d × d is the inverse of the covariance matrix T k . We have  Dm k ( x ) = − m k ( x ) T − k ( x − ν k ) D m k ( x ) = m k ( x ) (cid:2) ( T − k ( x − ν k ))( T − k ( x − ν k )) t − T − k (cid:3) ∆ m k ( x ) = m k ( x ) (cid:2) ( T − k ( x − ν k )) t ( T − k ( x − ν k )) − Tr( T − k ) (cid:3) where x ∈ R d , D denotes the Hessian and Tr the trace. Furthermore, given amatrix B k ∈ R d × d and assuming that u k ( x ) = 12 ( x − ρ k ) t B k ( x − ρ k ) ,

11e have Du k ( x ) = B k ( x − ρ k ) and ∆ u k ( x ) = Tr B k . Substituing m k and u k asabove in the second equation of (3.17), we get ε (cid:2) ( T − k ( x − ν k )) t ( T − k ( x − ν k )) − Tr( T − k ) (cid:3) − ( T − k ( x − ν k )) t ( B k ( x − ρ k ))+Tr( B k )) = 0which is satisﬁed for ρ k = ν k and B k = εT − k . Hence we get u k ( x ) = ε x − ν k ) t T − k ( x k − ν k ) + C k (3.22)for some C k ∈ R . Replacing u k , given by (3.22), in the Hamilton-Jacobi equationin (3.17), we have − ε Tr( T − k ) + ε | ( x − ν k ) t T − k ( x − ν k ) | + λ k = F k ( x, m k , m )Recalling that F k ( x, m k , m ) = ( x − µ k ) t (Σ − k ) t Σ − k ( x − µ k ) and that, by (3.18), ν k = µ k and T k = ε Σ k , we get that the previous equation is satisﬁed for λ k = ε Tr( T − k ) . Taking into account the normalization condition u k ( µ k ) = 0, we ﬁnally get C k = 0 and then the result.We now prove a reverse result, i.e. a solution of the MFG system (3.17)gives a mixture of Gaussian densities consistent with the data set f . Proposition 3.5.

Let { ( u k , λ k , m k , α k ) } Kk =1 be a solution of the MFG system (3.17) . Then m ( x ) = (cid:80) Kk =1 α k m k ( x ) , x ∈ R d is a Gaussian mixture consistentwith the data set f .Proof. Firstly, ﬁxed k = 1 , . . . , K , consider the Hamilton-Jacobi equation − ε ∆ u k ( x ) + 12 | Du k ( x ) | + λ k = 12 ( x − µ k ) t (Σ − k ) t Σ − k ( x − µ k ) (3.23)with the normalization condition u k ( µ k ) = 0. By [3], the unique solution ofthe previous equation is given by the couple ( u k , λ k ) = ( ( x − µ k ) t Σ − k ( x − µ k ) , ε Tr(Σ − k )). Replacing Du k ( x ) = Σ − k ( x − µ k ) in the Fokker-Planck equa-tion, we obtain the equation ε ∆ m k ( x ) + Dm k ( x )Σ − k ( x − µ k ) + m k ( x ) Tr(Σ − k ) = 0 . (3.24)Taking into account the normalization conditions m k ≥ (cid:82) m k dx = 1, we havethat the solution of (3.24) is given by the Gaussian density m k ( x ) = N ( x | ν k , T k ),where, by (3.24), the constants ν k , T k have to satisfy the identity ε (cid:0) ( T − k ( x − ν k )) t ( T − k )( x − ν k ) − Tr( T − k ) (cid:1) + − ( T − k ( x − ν k )) t (Σ − k ( x − µ k )) + Tr(Σ − k ) = 0Hence m k is a solution of (3.24) if ν k = µ k , T k = ε Σ k . If we set m ( x ) = (cid:80) Kk =1 α k m k ( x ), with m k ( x ) = N ( x | µ k , ε Σ k ) and α k as in (3.17), we get aGaussian mixture compatible with the data set f .12astly, we state an existence result for solutions to (3.17) (since the proof issimilar to the one of Theorem 4.4, we omit the details). Proposition 3.6.

There exists a solution { u k , λ k , m k , α k } Kk =1 to (3.17) . We also observe that, as for (3.7), it is not reasonable to expect uniquenessfor the solution of (3.17).

Remark 3.7. In -dimensional case, the coupling term is F ( x, m k , m ) = 12 (cid:12)(cid:12)(cid:12)(cid:12) x − µ k σ k (cid:12)(cid:12)(cid:12)(cid:12) . (3.25) where µ k = (cid:82) R xγ k ( x ) f ( x ) dx (cid:82) R γ k ( x ) f ( x ) dx , σ k = (cid:82) R ( x − µ k ) γ k ( x ) f ( x ) dx (cid:82) R γ k ( x ) f ( x ) dx . (3.26) A Gaussian mixture m ( x ) = (cid:80) Kk =1 β k N ( x | ν k , τ k ) is said to be consistent withthe data set, in sense of Deﬁnition 3.2, if ν k = µ k , τ k = εσ k , where µ k and σ k deﬁned as in (3.26) , and β k = (cid:82) R γ k ( x ) f ( x ) dx . A general class of linear-quadratic MFG systems admitting a Gaussian dis-tribution as solution of the Fokker-Planck equation was studied in [3].

Remark 3.8.

If we consider a coupling cost F k in (3.16) independent of Σ k ,i.e. Σ k = σ I , where I is the identity matrix, we have F k ( x, m k , m ) = 12 σ ( x − µ k ) t ( x − µ k ) . Then, the solution of the MFG system (3.17) gives a mixture ofGaussian densities m ( x ) = (cid:80) Kk =1 α k N ( x | µ k , εσ I ) .Arguing formally and sending σ → , the responsibility γ k ( x ) , correspondingto the index k minimizing the distance of x from the barycentre µ k , tends to ,while the other responsibilities γ j ( x ) , for j (cid:54) = k , tend to . Set S k = { x ∈ R d :lim σ → γ k ( x ) = 1 } and observe that these sets coincide with the clusters deﬁnedin (3.9) for the MFG K-means model. Moreover the consistency condition forthe mean in (3.18) converges to (cid:90) xm k ( x ) dx = (cid:82) S k xf ( x ) dx (cid:82) S k f ( x ) dx which coincides with (3.10) . Hence the MFG K-means model can be interpreted,at least formally, as the limit of the MFG EM model for σ → (we plan to doa more rigorous analysis of this observation in the future).Sending (cid:15) → + in (3.17) has a diﬀerent eﬀect in the model. Indeed, in thiscase, the problem degenerate in a ﬁrst order MFG system and the ﬁnite mix-ture m tends to a convex combination of Dirac functions concentrated in thebarycentres µ k . emark 3.9. The approach developed in this section is a soft clustering modelas the classical EM algorithm, while the one in Section 3.1 is a hard clusteringone, as the K-means algorithm. In both the models, the coupling terms have theeﬀect of aggregating the agents around the closer barycenter. But, while in theMFG K-means model, the clusters are circular, in the model discussed in thissection, thanks to the additional parameters given by the covariance matrices,the clusters are ellipsoidal and therefore, in some situations, give a better ap-proximation of the data set. This advantage of the EM algorithm with respectto the K-means algorithm is well-known also in the classical setting discussed inSection 2.We consider an ergodic MFG system, see (3.17) , instead of a discounted MFGsystem as (3.7) , but this does not change in an essential way the model. Thechoice of an ergodic system has been made in order to apply the algorithm devel-oped in [9]. We also observe that system (3.17) has the classical structure of aMFG system, where the Fokker-Planck equation is the adjoint of the linearizedof the Hamilton-Jacobi-Bellman equation, while this property is not satisﬁed bysystem (3.7) . In this section, we formulate a general MFG model for a ﬁnite mixture of prob-ability densities.We consider a smooth, bounded domain Ω containing the support of thedata set f which satisﬁes the conditions in (3.3). Given a mixture of densityfunctions as in (3.11) and deﬁned the corresponding responsibilities γ k as in(3.12), we consider the following multi-population MFG system  − ε ∆ u k ( x ) + H k ( x, Du k ( x )) + λ k = F k ( x, m k , m ) , x ∈ Ω , (HJ) ε ∆ m k ( x ) + div( m k ( x ) D p H k ( x, Du k ( x ))) = 0 , x ∈ Ω , (FP) ∂ n u k ( x ) = 0 , x ∈ ∂ Ω , (HJn) ε∂ n m k ( x ) + m k ( x ) D p H k ( x, Du k ( x )) · (cid:126)n = 0 , x ∈ ∂ Ω , (FPn) α k = (cid:82) Ω γ k ( x ) f ( x ) dx, (COEFF) m k ≥ , (cid:82) Ω m k ( x ) dx = 1 , (cid:82) Ω u k ( x ) dx = 0 , (NORM) (4.1)for k = 1 , . . . , K , where (cid:126)n is the outward normal to the boundary of Ω, ∂ n denotes the normal derivative and γ k ( x ) = α k m k ( x ) /m ( x ), k = 1 , . . . , K arethe responsibilities. The boundary conditions in (4.1) are given by the Neumanncondition (HJn) for u k and its dual (FPn) for m k . Moreover, the normalizationconditions (NORM) are imposed since u k is deﬁned up to a constant and m k represents a probability densiy function.In the rest of the paper, we will make the following assumptions:14 H1)

Ω is a C bounded domain of R d ; (H2) the Hamiltonians H k , k = 1 , . . . , K , are of the form H k ( x, p ) = R k | p | r − H k ( x ) , with R k > r > H k ∈ C (Ω), ∂ n H k ( x ) ≥ x ∈ ∂ Ω; (H3) The coupling costs F k : Ω × R × R d × d → R are a continuous, uniformlybounded functions of the form F k ( x, m k , m ) = F k ( x, µ k , Σ k ) , (4.2)where µ k = (cid:90) Ω x m k ( x ) m ( x ) f ( x ) dx, (4.3)Σ k = (cid:90) Ω ( x − µ k )( x − µ k ) t m k ( x ) m ( x ) f ( x ) dx. (4.4)Moreover, for µ , Σ ﬁxed, F k ( · , µ, Σ) ∈ W , ∞ (Ω). Remark 4.1. If α k = (cid:82) Ω γ k ( x ) f ( x ) dx (cid:54) = 0 , then, µ k = 1 α k (cid:90) Ω x α k m k ( x ) m ( x ) f ( x ) dx = (cid:82) Ω xγ k ( x ) f ( x ) dxα k = (cid:82) Ω xγ k ( x ) f ( x ) dx (cid:82) Ω γ k ( x ) f ( x ) dx , which has to be compared with the deﬁnition of µ k in (3.13) (similarly for Σ k ).As we will see in the proof of the existence theorem, since m ( x ) > for all x ∈ Ω ,deﬁnitions (4.3) - (4.4) avoid the problem of having α k = 0 in the deﬁnition of µ k , Σ k . Remark 4.2.

In view of diﬀerent applications from Cluster Analysis, it is pos-sible to consider more general Hamiltonians and coupling terms. For example,coupling terms F k involving the Kullback-Leibler divergence m k ln (cid:18) q k m k (cid:19) , where q k is a function of the data set f . But to maintain the parallelism withclassical EM algorithm, we prefer to restrict to the simpler case (4.2) . We now give the deﬁnition of solution for the MFG system (4.1). Note thatthe coeﬃcients of the mixture α k are part of the unknowns of the system. Deﬁnition 4.3.

A solution of (4.1) is a family of K quadruples ( u k , λ k , m k , α k ) such that u k ∈ C (Ω) , λ k ∈ R , m k ∈ W , (Ω) , α k ∈ [0 , and(i) ( u k , λ k ) satisﬁes (HJ)-(HJn) in point-wise sense. ii) m k satisﬁes (FP)-(FPn) in weak sense, i.e. ε (cid:90) Ω Dm k ( x ) · Dφ ( x ) dx + (cid:90) Ω m k ( x ) D p H k ( x, Du k ( x )) · Dφ ( x ) dx = 0 for all φ ∈ W , (Ω) .(iii) The coeﬃcients α k satisfy (COEFF) (note that this implies α k ∈ [0 , and (cid:80) Kk =1 α k = 1 ).(iv) The normalization conditions (NORM) are satisﬁed. We are going to prove existence of a solution to (4.1). The proof is similar tothe one of [13, Theorem 4], the main diﬀerence is the presence of the additionalunknowns given by the coeﬃcients α k . Theorem 4.4.

Assume (H1)-(H3) . Then, there exists a solution ( u k , λ k , m k , α k ) , k = 1 , . . . , K , of (4.1) in the sense of Deﬁnition 4.3. Moreover u k ∈ C (Ω) , m k ∈ W ,p (Ω) for any p ≥ and m k ≥ δ > for some constant δ .Proof. Fixed p > d , we deﬁne the sets P = { m ∈ L (Ω) s.t. (cid:90) Ω m ( x ) dx = 1 }D = P ∩ { m ∈ W , ∞ (Ω) s.t. || m || W ,p (Ω) ≤ C, m ( x ) ≥ δ > } , where C , δ are two constants to be ﬁxed later, and we consider the set C = { ( α, m ) =( α , . . . , α K , m , . . . , m K ) : α k ∈ [0 , ,m k ∈ D , k = 1 , . . . , K, K (cid:88) k =1 α k = 1 } . (4.5)It is easy to see that the set C is convex. Moreover, it is compact with respect tothe standard topology of R × · · · × R × C (Ω) × · · · × C (Ω) since α k ∈ [0 , k =1 , . . . , K , and the set D is compactly embedded in C ,β (Ω) with β ≤ ( p − d ) /p (see Theorem 7.26 in [17]).We deﬁne the map Ψ : C → C which, to a vector ( β, η ) = ( β , . . . , β K , η , . . . , η K )associates the vector ( α, m ) = ( α , . . . , α K , m , . . . , m K ) deﬁned in the followingway: Step (i):

Given ( β, η ) ∈ C , we deﬁne the responsibilities γ k ( x ) = β k η k ( x ) (cid:80) Kk =1 β k η k ( x ) , k = 1 , . . . , K. (4.6)Note the responsibilities are well deﬁned since, by η k ≥ δ > (cid:80) Kk =1 β k = 1,we get (cid:80) Kk =1 β k η k ( x ) ≥ δ for x ∈ Ω. 16 tep (ii):

Given γ k ( x ) as in Step (i) , we deﬁne µ k , Σ k as in (4.3)-(4.4) andwe consider the Hamilton-Jacobi-Bellman equations  − ε ∆ u k ( x ) + H k ( x, Du k ( x )) + λ k = F k ( x, µ k , Σ k ) , x ∈ Ω ,∂ n u k ( x ) = 0 x ∈ ∂ Ω , (cid:82) Ω u k ( x ) dx = 0 , (4.7)for k = 1 , . . . , K . Note that the previous systems are independent of each other.By [20, Theorem II.1], (4.7) admits a unique solution ( u k , λ k ) ∈ C ( ¯Ω) × R with | λ k | ≤ C, (cid:107) u k (cid:107) W , ∞ (Ω) ≤ C (4.8)for some positive constant C which depends only on Ω, H and (cid:107)F k ( · , µ k , Σ k ) (cid:107) L ∞ . Step (iii):

Given u k as in Step (ii) , we consider the problems  ε ∆ m k ( x ) + div( m k ( x ) D p H k ( x, Du k ( x )) = 0 x ∈ Ω ε∂ n m k ( x ) + m k ( x ) D p H k ( x, Du k ( x )) · (cid:126)n ( x ) = 0 x ∈ ∂ Ω m k ≥ , (cid:82) Ω m k ( x ) dx = 1 (4.9)for k = 1 , . . . , K . Since (4.8) implies that | D p H k ( x, Du k ( x )) | is uniformlybounded in Ω, by Theorems II.4.4, II.4.5, II.4.7 in [4], it follows that problem(4.9) admits a unique weak solution m k ∈ W , (Ω). Moreover, m k ∈ W ,p (Ω)for all p ≥ m k ∈ C (Ω) and || m k || W ,p (Ω) ≤ ˆ C, (4.10)ˆ δ ≤ m k ( x ) ≤ ˆ δ − ∀ x ∈ Ω , (4.11)for constants ˆ C , ˆ δ > (cid:107) DH k ( x, Du k ( x )) (cid:107) L ∞ (Ω) andtherefore, recalling (4.8), on Ω, H and (cid:107)F k (cid:107) L ∞ . Hence, in (4.5), ﬁxing C larger that ˆ C in (4.10) and δ smaller than ˆ δ in (4.11), we conclude m k ∈ D , k = 1 , . . . , K .Furthermore, we deﬁne α k := (cid:90) Ω γ k ( x ) f ( x ) dx. (4.12)Since 0 ≤ γ k ( x ) ≤ (cid:80) Kk =1 γ k ( x ) = 1 for x ∈ Ω, it follows that α k ∈ [0 , (cid:80) Kk =1 α k = 1.We conclude that the map Ψ, which to the vector ( η, β ) associates the vector( α, m ) given by (4.9)-(4.12), is well deﬁned and maps the set C into itself.We now show that the map Ψ is continuous with respect to the R × · · · × R × C (Ω) × · · · × C (Ω) topology. Let ( β n , η n ) be a sequence in C such that β nk converges β k in R and η nk converges to η k uniformly in Ω, for all k = 1 . . . , K .Since (cid:80) Kk =1 β nk η nk converges uniformly to (cid:80) Kk =1 β k η k and η nk , η k ≥ δ > γ nk , γ k as in (4.6), it follows that γ nk → γ k for n → ∞ , uniformly in Ω . (4.13)17hen, by 0 ≤ γ nk , γ k ≤

1, (4.13) and the dominated convergence theorem, wehave lim n →∞ α nk = lim n →∞ (cid:90) Ω γ nk ( x ) f ( x ) dx = (cid:90) Ω γ k ( x ) f ( x ) dx = α k . (4.14)Moreover, since (cid:80) Kk =1 β nk η nk , (cid:80) Kk =1 β k η k ≥ δ and η nk converges to η k uniformly,we have η nk (cid:80) Kk =1 β nk η nk −→ η k (cid:80) Kk =1 β k η k , for n → ∞ , uniformly in Ω . Recalling deﬁnitions (4.3)-(4.4), it follows thatlim n →∞ µ nk = lim n →∞ (cid:90) Ω xη nk ( x ) f ( x ) (cid:80) Kk =1 β nk η nk ( x ) dx = (cid:90) Ω xη k ( x ) f ( x ) (cid:80) Kk =1 β k η k ( x ) dx = µ k and, similarly, lim n →∞ Σ nk = Σ k .It follows that F k ( x, µ nk , Σ nk ) converges uniformly to F k ( x, µ k , Σ k ). Denoted by( u nk , λ nk ) and ( u k , λ k ), k = 1 , . . . , K , the solutions of (4.7) corresponding re-spectively to F k ( x, µ nk , Σ nk ) and F k ( x, µ k , Σ k ), by a standard stability result for(4.7), we have that λ nk converges to λ k and u nk converges u k uniformly in Ω.Moreover Du nk → Du k locally uniformly in Ω (see [13, Theorem 4 ] for details).Let m nk , m k be the solutions of (4.9) corresponding to u nk and u k . By the esti-mate (4.10), the functions m nk are equi-bounded in W ,p . By the local uniformconvergence of Du nk to Du k , passing to the limit in the weak formulation of(4.9) yields that (possibly passing to a subsequence) that m nk converges to aweak solution of the problem, which, by uniqueness, coincides with m k . Hence m nk converges uniformly to m k , k = 1 , . . . , K , and, recalling (4.14), we concludethat the map Ψ is continuous.Applying the Schauder’s Fixed Point Theorem to the map Ψ : C → C , weget that there exists ( α, m ) ∈ C such that Ψ( α, m ) = ( α, m ), hence a solutionto (4.1). In this section, we present an algorithm for the cluster problem, based on thenumerical approximation of (4.1). Our aim is to compute the ﬁnite mixturemodel (3.11) and the corresponding responsibilities γ k ( x ). We use an approachsimilar to the classical EM algorithm. Indeed, rather than solving the fullproblem (4.1) in the coupled unknowns ( u k , λ k , m k , α k ) for k = 1 , ..., K , wesplit it in the iterative resolution of K independent sub-problems as follows.Given an arbitrary initialization α , . . . , α k , m . . . , m k for the mixture model,at the n th iteration, we alternate two steps: E-step.

For k = 1 , . . . , K , compute the new responsibilities, mixturecoeﬃcients, barycentres and covariance matrices γ nk ( x ) = α n − k m n − k ( x ) m n − ( x ) , α nk = (cid:90) Ω γ nk ( x ) f ( x ) dx , (5.1)18 nk = (cid:90) Ω x m n − k ( x ) m n − ( x ) f ( x ) dx , Σ nk = (cid:90) Ω ( x − µ nk )( x − µ nk ) t m n − k ( x ) m n − ( x ) f ( x ) dx . M-step.

For k = 1 , . . . , K compute the new coupling cost F k ( x, µ nk , Σ nk )and solve the (independent) K systems  − ε ∆ u k ( x ) + H k ( x, Du k ( x )) + λ k = F k ( x, µ nk , Σ nk ) , x ∈ Ω ,ε ∆ m k ( x ) + div( m k ( x ) D p H k ( x, Du k ( x ))) = 0 , x ∈ Ω ,∂ n u k ( x ) = 0 , x ∈ ∂ Ω ,ε∂ n m k ( x ) + m k ( x ) D p H k ( x, Du k ( x )) · (cid:126)n = 0 , x ∈ ∂ Ω ,m k ≥ , (cid:82) Ω m k ( x ) dx = 1 , (cid:82) Ω u k ( x ) dx = 0 . (5.2)Note that the coupling between the mixture components is entirely embedded inthe E-step, and this signiﬁcantly reduces the computational eﬀorts to performthe M-step, in particular the K sub-problems (5.2) can be solved in parallel.Moreover, for ﬁxed k , we observe that the Hamilton-Jacobi-Bellman equationis decoupled from the corresponding Fokker-Planck equation. In principle, wecould compute ﬁrst u k and then use it to compute m k . Nevertheless, we preferto solve the system (5.2) as a whole, also in view of more general coupling costs,in which the dependence on m k is not frozen at the previous step as in (5.1).Let us now discuss, more in detail, the main ingredients for a practical imple-mentation of the algorithm. First of all, we have to introduce a discretizationof the domain Ω. The choice of a structured or an unstructured grid clearlydepends on the geometry of Ω and, in turn, on the nature of the data set repre-sented by the distribution f . Typically Ω can be chosen as a simple hypercubein R d containing the support of f , as in the tests shown below.Then, we need some quadrature rule for the approximation of the integrals ap-pearing in the E-step, and also a suitable discretization of the MFG systemsin the M-step, e.g. via ﬁnite diﬀerences or ﬁnite elements ([1, 2, 10]). Notethat each system in (5.2) consists of two non-linear PDEs in the three un-knowns ( u k , m k , λ k ), plus boundary and normalization conditions. This leadsto the building block of the algorithm, namely the numerical solution of a dis-crete non-linear system which is formally overdetermined , having more equationsthan unknowns. More precisely, if Ω is discretized with N of degrees of freedom,then ( u k , m k , λ k ) corresponds to a vector of 2 N + 1 unknowns, while (5.2) gives2 N +2 non-linear equations, subject to N constraints representing the condition m k ≥

0. Hence, by employing a Newton-like solver, we end up with non squarelinearized systems, whose solutions should be meant in a least-squares sense,as explained in [9], where a direct method for MFG systems is proposed. Werefer the reader to this paper for further details and the implementation of themethod, which will be used in the following tests.We conclude this discussion with some remarks on the initialization and thestopping criterion of the algorithm. As it is commonly used in the classical19M algorithm, we choose all the initial weights α k = K , and the densities m k given by Gaussian PDFs N ( x | µ k , I ) with random means µ k . The convergenceof the algorithm is really sensitive to this initial guess, and we observe a relevantspeedup if the supports of the m k are as much as possible disjointed. Finally,we choose a tolerance tol > n the EM-step until bothmax k =1 ,...,K | µ nk − µ n − k | < tol and max k =1 ,...,K | Σ nk − Σ n − k | < tol . Note thatalso the M-step has its own tolerances for the convergence of the Newton solversfor the K MFG systems. In principle, we can choose a single tolerance forthe whole algorithm and alternate the E-step and M-step in a Gauss-Seidelfashion, namely updating the parameters (5.1) after a single Newton iterationfor each system (5.2). Surprisingly, this does not improve the convergence andthe proposed implementation with two nested loops is more eﬀective. Indeed,we observed that, once each mixture component m k stabilizes around its currentmean µ k , in the next iterations it readily moves to the new barycentre with justsmall adjustments of the corresponding covariance matrix Σ k . In practice, mostof the computational eﬀorts are concentrated on the ﬁrst few iterations of thealgorithm.Now, let us deﬁne the settings for the numerical experiments. We use auniform structured grid to discretize the domain Ω, and a simple rectangularquadrature rule for the E-step. Moreover, we employ standard ﬁnite diﬀerencesfor the discretization of the M-step, where we consider the particular case of aquadratic Hamiltonian as in (3.17) (see also Remark 3.7), and we set the diﬀu-sion coeﬃcient ε = 1. The case of general Hamiltonians and coupling costs willbe addressed in a more computational oriented paper. Test 1.

We consider a piece-wise constant distribution f on Ω = [0 , (cid:82) f ( x ) dx =1. In Figure 1, we show the solution computed on a uniform grid of N = 201nodes, respectively for K = 1 , ,

3. The thin line represents f , while the thickline represents the mixture m = (cid:80) Kk =1 α k m k . We clearly see how the mean andthe variance of each m k adapt to the data, according to the given number K ofmixture components. Test 2.

We consider the example of a distribution on Ω = [0 ,

1] with anoscillatory behavior. Indeed, we deﬁne f by suitably scaling and translatingthe function ˜ f = x sin(4 πx ) for x ∈ [0 , f has compact supportand (cid:82) f ( x ) dx = 1. In Figure 2, we show the solutions corresponding to K = 1 , , ,

4, again for a uniform grid of N = 201 nodes. It is interestingto observe that the peaks of f are sequentially approximated as the number K of mixture components increases, according to their heights and the underlyingmasses. Test 3.

We present an application of cluster analysis to the so called colorquantization problem in computer graphics. Given an image, the aim is toreduce the total number of colors (2 for images in 24-bit RGB format) to a20 (a) (b) (c)Figure 1: Approximation of a piece-wise constant data distribution, for K =1 , , -10123456 0 0.2 0.4 0.6 0.8 1fm -10123456 0 0.2 0.4 0.6 0.8 1fm (a) (b) -10123456 0 0.2 0.4 0.6 0.8 1fm -10123456 0 0.2 0.4 0.6 0.8 1fm (c) (d)Figure 2: Approximation of an oscillatory data distribution, for K = 1 , , , K , so that the resulting image is as similar as possible to theoriginal one. As pointed out in [6, Cap. 9], this problem can be solved bythe classical K-means algorithm. Note this approach is quite diﬀerent from segmentation , indeed no geometric correlation between pixels is considered, butjust the color information. Here we apply the MFG clustering algorithm tothe case of an image in grey scales. Each pixel in the image contains a level21f grey represented by a value in the interval [0 ,

1] (typically sampled with 256levels, from black 0 to white 1), so that the problem is reduced to dimensionone, choosing Ω = [0 ,

1] uniformly discretized with N = 256 nodes. To generatethe data set distribution f , for each x ∈ [0 ,

1] we count the number of pixelsin the image with grey level x , then we normalize it to satisfy (cid:82) f ( x ) dx = 1.In Figure 3, we show the original image and the corresponding distribution f ,while in Figure 4 we report the results for K = 2 , , { γ k } k =1 ,...,K to map the single pixel grey valueto the barycentre of its most representative cluster. If the pixel p has grey value x p , then it is mapped to the value µ k ∗ , where k ∗ = arg max k =1 ,...,K γ k ( x p ). -10123456 0 0.2 0.4 0.6 0.8 1f (a) (b)Figure 3: A black and white image (a) and its grey scales distribution (b). Test 4.

We ﬁnally consider a two dimensional example. We take somedata from the Elki project [16], forming a “mouse” similar to a popular comiccharacter. The data set is organized in 3 clusters (plus some random noise),corresponding to the head and the ears of the mouse, see Figure 5. Then, we set K = 3, we discretize the domain Ω = [0 , by means of a uniform grid of N =51 nodes, and we build the distribution f as in the previous test, by countingthe data points falling in each cell of the grid and normalizing the result to obtain (cid:82) Ω f ( x ) dx = 1. For the visual representation, we consider RGB triplets in [0 , ,and we assign to the three clusters the pure colors red (1 , , , ,

0) andblue (0 , ,

1) respectively. Then we use the responsibilities { γ k } k =1 , , ∈ [0 , C i = ( γ ( x i ) , γ ( x i ) , γ ( x i )), for i = 1 , ..., N , which quantiﬁes how much it belongs to a certain cluster. Figure 6shows the surface of the computed mixture and the corresponding clusterization.We observe that the scattered data set is well approximated by the mixture, andthe corresponding three clusters are well separated by small overlapping regions.22 Figure 4: Color quantization via MFG clustering (left panels) and the corre-sponding mixtures (right panels), for K = 2 , , References [1] Y. Achdou and I. Capuzzo Dolcetta, Mean Field Games: numerical meth-ods, SIAM J. Numer. Anal., 48 (2010), pp. 1136–1162.[2] N. Almulla, R. Ferreira and D. Gomes, Two numerical approaches to sta-tionary mean-ﬁeld games, Dyn. Games Appl. 7 (2017), no. 4, 657–682.23 (a) (b)Figure 5: The “mouse” data set (a) and the corresponding distribution (b). (a) (b)Figure 6: MFG mixture (a) and clustering (b) of the “mouse” data set for K = 3.[3] M. Bardi and F. Priuli, Linear-quadratic N-person and mean-ﬁeld gameswith ergodic cost, SIAM J. Control Optim. 52 (2014), no. 5, 3022–3052.[4] A. Bensoussan, Perturbation methods in optimal control , Wiley/Gauthier-Villars Series in Modern Applied Mathematics, 1988.[5] C. Bertucci, S. Vassilaras, J.-M. Lasry, G. Paschos, M. Debbah and P.-L.Lions, Transmit strategies for massive machine-type communications basedon Mean Field Games, in

Proceedings of the International Symposium onWireless Communication System (ISWCS) , 2018.[6] C.M. Bishop,

Pattern recognition and Machine Learning , Information Sci-ence and Statistics, Springer, New York, 2006.247] J.A. Bilmes, A gentle tutorial of the EM algorithm and its applicationto parameter estimation for Gaussian mixture and hidden Markov model,Technical Report ICSI-TR-97-021, University of Berkeley, 2000.[8] L. Bottou and Y. Bengio, Convergence properties of the K-means algo-rithms, Adv. Neural Inf. Process. Syst. 82 (1995), 585-592.[9] S. Cacace and F. Camilli, A generalized Newton method for homogenizationof Hamilton-Jacobi equations, SIAM J. Sci. Comput. 38 (2016), no. 6,A3589-A3617.[10] E. Carlini and F.J. Silva, On the discretization of some nonlinear Fokker-Planck-Kolmogorov equations and applications. SIAM J. Numer. Anal. 56(2018), no. 4, 2148–2177.[11] P. Chaudhari, A. Oberman, S. Osher, S. Soatto and G. Carlier, Deep re-laxation: partial diﬀerential equations for optimizing deep neural networks,Res. Math. Sci. 5 (2018), no. 3, Paper No. 30, 30 pp.[12] Y. Chen, T. T. Georgiou and A. Tannenbaum, Optimal Transport for Gaus-sian Mixture Models, in IEEE Access, vol. 7, pp. 6269-6278, 2019[13] M. Cirant, Multi-population Mean Field Games systems with Neumannboundary conditions. J. Math. Pures Appl. 103 (2015), no. 5, 12941315.[14] J.L. Coron,

Quelques exemples de jeux `a champ moyen , Ph.D. thesis, Uni-versit´e Paris-Dauphine, 2018.[15] W. E, J. Han and Q. Li, A mean-ﬁeld optimal control formulation of deeplearning. Res. Math. Sci. 6 (2019), no. 1, Paper No. 10, 41 pp.[16] https://elki-project.github.io/datasets/[17] D. Gilbarg and N.S.Trudinger,

Elliptic partial diﬀerential equations of sec-ond order , Classics in Mathematics. Springer-Verlag, Berlin, 2001.[18] O. Gu´eant, J-M. Lasry and P-L. Lions, Mean Field Games and applications,in

Paris-Princeton Lectures on Mathematical Finance 2010 , Lecture Notesin Math., 2003, Springer, Berlin, 2011, 205266.[19] J.-M. Lasry and P.-L. Lions, Mean Field Games. Jpn. J. Math., 2 (2007),229-260.[20] P.-L. Lions, Quelques remarques sur les problemes elliptiques quasilineairesdu second ordre, J. Anal. Math 45 (1985) 234-254.[21] S. Miyiamoto and M. Mukaidono, Fuzzy C-Means as a regularization andmaximum entropy approach.

IFSA’97 Prague: Proceedings [of the] seventhInternational Fuzzy Systems Association World Congress , 86-92.2522] S. Pequito, A.P. Aguiar, B. Sinopoli and D. Gomes, Unsupervised learn-ing of ﬁnite mixture models using Mean Field Games, in

Annual AllertonConference on Communication, Control and Computing , 2011, 321-328.[23] A. Saxena, M. Prasad, A. Gupta, N. Bharill, O.P. Patel, A. Tiwari, M.J. Er,W. Ding and C.T. Lin, A review of clustering techniques and developments,Neurocomputing, 267 (2017), 664–681. [email protected]@sbai.uniroma1.it

SBAI, Sapienza Universit`a di Romavia A.Scarpa 14, 00161 Roma (Italy) [email protected]

Dipartimento di Matematica e FisicaUniversit`a degli Studi Roma TreLargo S. L. Murialdo 1, 00146 Roma (Italy) [email protected]@iconsulting.biz