A Mean Field Games approach to Cluster Analysis
AA Mean Field Games approachto Cluster Analysis
Laura Aquilanti, Simone Cacace, Fabio Camilliand Raul De Maio
Abstract
In this paper, we develop a Mean Field Games approach to ClusterAnalysis. We consider a finite mixture model, given by a convex combi-nation of probability density functions, to describe the given data set. Weinterpret a data point as an agent of one of the populations represented bythe components of the mixture model, and we introduce a correspondingoptimal control problem. In this way, we obtain a multi-population MeanField Games system which characterizes the parameters of the finite mix-ture model. Our method can be interpreted as a continuous version of theclassical Expectation-Maximization algorithm.
AMS-Subject Classification: . Keywords: mixture model; Cluster Analysis; Expectation-Maximization algorithm; MeanField Games; multi-population model . Cluster Analysis, a classical problem in unsupervised Machine Learning, con-cerns the repartition of a group of data points into subgroups, in such a waythat the elements within a same group are more similar to each other than theyare to the elements of a different one. Clustering methods can be used either toinfer specific patterns in rough data sets, or as a preliminary analysis in super-vised Machine Learning algorithms (for a review of the applications of clusteranalysis, see [6, 23]). Algorithms for Cluster Analysis are generically dividedinto two classes: hard clustering methods, whose prototype is the K-means al-gorithm and its variants, for which each data point belongs only to a singlesubgroup [8]; soft clustering methods, such as the Expectation-Maximizationand the Fuzzy K-means algorithms, for which each data point has a certaindegree or probability to belong to each cluster [7, 21].Most of the mathematical literature on Cluster Analysis, and more in generalon Machine Learning, is in the framework of finite-dimensional optimization.But recently, different approaches based on partial differential equations andinfinite dimensional optimization have begun to be pursued [11, 12, 15]. Evenif, in most of the cases, it is not clear if these new approaches are computationally1 a r X i v : . [ m a t h . NA ] D ec ompetitive with the traditional ones, they offer new points of view and insightsin this rapidly developing field.In this paper, we propose an approach to Cluster Analysis based on MeanField Games (MFG in short). This theory was introduced by Lasry and Lions[19] in the framework of differential games with a large number of rationalagents. Although it was realized from the beginning that, in perspective, oneof the most promising application of MFG theory could be the “Big Data”analysis [18], up to now most of the applications are in the fields of Economicsand Financial Mathematics, and only very few papers deal with data analysisproblems [22, 5]. Nevertheless, a MFG version of the classical hard-clusteringK-means algorithm has been recently proposed in [14]. Relying on a similarapproach, here we deal with a MFG version of the soft clustering Expectation-Maximization algorithm.As in the classical Expectation-Maximization (EM in short) algorithm, thestarting point of our analysis is a finite mixture model given by convex combi-nations of probability density functions (PDFs in short) m ( x ) = K (cid:88) k =1 α k m k ( x ) , x ∈ R d , with K (cid:88) k =1 α k = 1 . Finite mixture models are powerful probabilistic techniques to describe complexscenarios, e.g. several regions with high mass density, and populations composedby different sub-populations, each one represented by a PDF. Generally, thecomponents of the mixture are assumed to be parametrized PDFs. For example,for a Gaussian mixture model, the parameters are represented by the mean andthe variance of each component. Then, the EM algorithm is a technique whichallows to compute the unknown coefficients (weights and parameters) of themixture by incrementally increasing the maximum likelihood with respect tothe data set.In our approach, we are given a PDF f on R d representing the data set,and we aim to find a corresponding mixture model in order to subdivide thepoints in K clusters. We interpret the data points as agents belonging with acertain probability to one of the K sub-populations, which are described by aPDF m k with appropriate weights α k on the entire population. The similarity,or proximity, among the members of a same sub-population is encoded in thecost functions of the optimal control problems for each sub-population, whichpush the agents to aggregate around the closer barycentre of the distributions m k .In order to characterize the components of the mixture model, i.e. the PDFs2 k and the weights α k , we first consider the case of a quadratic MFG system − ε ∆ u k ( x ) + | Du k ( x ) | + λ k = ( x − µ k ) t (Σ − k ) t Σ − k ( x − µ k ) , x ∈ R d ,ε ∆ m k ( x ) + div( m k ( x ) Du k ( x )) = 0 , x ∈ R d ,α k = (cid:82) R d γ k ( x ) f ( x ) dx,m k ≥ , (cid:82) R d m k ( x ) dx = 1 , u k ( µ k ) = 0 , (1.1)for k = 1 , . . . , K , where the coupling among the different populations is encodedin the right hand side of the first equation. Indeed, the mean µ k and thecovariance matrix Σ k of the population m k depend, besides data set f , on thewhole measure m through the quantities γ k ( x ) = α k m k ( x ) /m ( x ). The latterplay a central role in the model because they represent the degree of membershipof agents to a given population k . We introduce a notion of consistency with thedata set, and we show that a solution of the multi-population MFG system (1.1)is given by a consistent Gaussian mixture with the same mean and variance of f . Moreover, any consistent Gaussian mixture can be obtained as a solution ofthe MFG system. We remark that the main difference between our approachand the classical EM algorithm, which is an iterative procedure, is that theMFG method gives the partition into clusters as the result of the solution of astationary system of PDEs.Then we consider a stationary multi-population MFG system, where theHamiltonian and the coupling terms can be very general, and we show that italways admits a solution. Note that, in this framework, it is not reasonable toobtain uniqueness, since there may be several admissible partitions of a givendata set. We also consider a two-step numerical approximation of the MFGsystem, inspired to the classical EM algorithm, and we test the method ondiscrete and continuous data sets.The paper is organized as follows. In Section 2, following [6, 7], we brieflyintroduce the K-means and the Expectation Maximization algorithms, in theirclassical approach. In Section 3, we first describe the approach in [14] andthen we propose a quadratic multi-population MFG model for Gaussian finitemixture models. In Section 4, we study a general class of MFG systems forCluster Analysis. Finally, in Section 5, we perform some numerical tests toconfirm the theoretical results. In this section, we briefly review some classical algorithms for Cluster Analysis.We start by describing the
K-means algorithm , a hard clustering technique,hence the
Expectation-Maximization algorithm , a soft clustering one. For acomplete introduction to the subject, we refer to [6, Cap. 9].In the following, we are given a set of observable data X = { x , . . . , x I } in R d and a fixed number K of clusters S j , j = 1 , . . . , K , into which we want topartition our observations. 3 .1 K-Means algorithm The K-means algorithm is an iterative procedure that assigns points to the near-est cluster with respect to the Euclidean distance, even though other distancescan be considered. We denote by µ = { µ , µ , . . . , µ K } , µ k ∈ R d , the vectorof barycentres of the clusters { S , S , ..., S K } . Moreover, we introduce a vectorof cluster assignment, c = ( c , c , . . . , c I ), c i ∈ { , . . . K } , such that for everyvector x i ∈ X , we have c i = k ⇔ x i ∈ S k . The goal is to find the vectors µ and c in order to minimize the functional J = I (cid:88) i =1 K (cid:88) k =1 { c i = k } | x i − µ k | , (2.1)representing the sum of the distances of the data points to the correspondingbarycentres ( denotes the characteristic function). Starting from an arbitraryassignment for the vector µ , the n th iteration of the K-means algorithm alter-nates two steps: Cluster assignment : Minimization of the objective function J with re-spect to c by assigning the point x i to the closest barycentre, i.e. c ni = arg min j | x i − µ nj | ∀ i = 1 , . . . , I. Barycentre update : Computation of the new barycentres µ n +1 k of thecluster S nk = { x i ∈ X : c ni = k } , i.e. µ n +1 k = (cid:80) Ii =1 x i { c ni = k } (cid:80) Ii =1 { c ni = k } = (cid:80) Ii =1 x i { c ni = k } card( S nk ) ∀ k = 1 , . . . , K. At each iteration, the objective function J decreases and the algorithm is re-peated until there is no change in the cluster assignments, up to a toleranceerror. Mixture models are probabilistic techniques used to represent complex scenar-ios of data, and they are related to soft clustering methods which considera probabilistic partition of the data points into clusters, whose correspondingparametrized probabilities are computed via the EM algorithm. In this setting,we assume that the data set X represents a set of (independent and identicallydistributed) observations of a random variable X , with unknown distributionthat we aim to approximate with a distribution P . Moreover, we assume that P is given by a mixture of probability densities p k ( x | θ k ), depending on someparameters θ k , with mixing coefficients α k such that (cid:80) Kk =1 α k = 1, i.e. P ( x ) = K (cid:88) k =1 α k p k ( x | θ k ) . (2.2)4he purpose is to compute the parameters α = ( α , . . . , α k ), θ = ( θ , . . . , θ k ) ofthe mixture model (2.2) in order to maximize the likelihood objective functionln P ( X | α, θ ) = I (cid:88) i =1 (cid:32) ln K (cid:88) k =1 α k p k ( x i | θ k ) (cid:33) with respect to data set X . To simplify the computation of the extrema of theprevious functional, the data set X is considered as incomplete, and it is intro-duced a random variable Y = { y i } Ii =1 whose values y i specify the component of P generating the point, i.e. y i ∈ { , . . . , K } and y i = k if and only if the point x i has been generated by the distribution p k . We define the responsibility of x i with respect to the k-th cluster as γ k ( x i ) := p k ( y i = k | x i , θ k ) , (2.3)i.e. as the probability of an assignment y i once we have observed x i ; the EMalgorithm aims to find a maximum of the expected value of the complete datalog-likelihood functional E Y [ln p ( X , Y| θ k , α k )] = I (cid:88) i =1 K (cid:88) k =1 γ k ( x i ) ln( α k p k ( x i | θ k )) (2.4)(see formula (9.40) in [6]). It takes a particular simple and explicit form ifwe consider parametric distributions p k ( x | θ k ) given by Gaussian distributions N ( x | µ k , Σ k ). Starting from an arbitrary initialization µ k , Σ k , α k , k = 1 , . . . , K ,and computing the optimality conditions for the functional (2.4) with respectto γ k for α k , µ k , Σ k fixed, and then exchanging the roles, we get the followingtwo alternating steps: E-step : Given µ n − k Σ n − k , α n − k , k = 1 , . . . , K , compute the posteriorprobability (responsibility) of a datum x i to belong to the cluster S k γ nk ( x i ) = P ( y i = k | x i , µ n − k , Σ n − k ) = α n − k N ( x i | µ n − k , Σ n − k ) (cid:80) Kj =1 α n − j N ( x i | µ n − j , Σ n − j ) . M-step : Update the parameters α , µ , Σ, by setting for k = 1 , . . . , K , α nk = (cid:80) Ii =1 γ nk ( x i ) I ,µ nk = (cid:80) Ni =1 x i γ nk ( x i ) (cid:80) Ii =1 γ nk ( x i ) , (2.5)Σ nk = (cid:80) Ni =1 γ nk ( x i )( x i − µ nk )( x i − µ nk ) t (cid:80) Ii =1 γ nk ( x i ) . (2.6)5fter the convergence of the algorithm under some appropriate stopping crite-rion, we interpret the responsibility γ k ( x i ) as the probability that x i belongs tothe cluster S k .It is interesting to observe that the K-means algorithm can be derived asa limit of the EM algorithm when the variances of the mixture model (2.4) goto zero. Indeed, assume for simplicity that the covariance matrix is Σ k = σI , k = 1 , . . . , K , where I is the identity d × d matrix and σ > γ k ( x i ) = α k e − | xi − µk | σ (cid:80) Kj =1 α j e − | xi − µj | σ . For σ →
0, the responsibility γ k ( x i ), corresponding to index k minimizing thedistance of the barycentre µ k from x i , tends to 1, while, for j (cid:54) = k , all theother responsibilities γ j ( x i ) go to 0. Hence, at least formally, γ k ( x i ) → { c i = k } and the maximum likelihood functional (2.4) tends, up to a constant, to theK-means functional J defined in (2.1) (see [6, Sec. 9.3.2]). In this section, we describe some models for Cluster Analysis based on MFGtheory. In the classical MFG theory (see [19]), we have a continuum of indis-tinguishable agents whose dynamics, driven by independent Brownian motions W t , is given by (cid:40) dX t = a t dt + √ (cid:15)dW t , t > X = x. (3.1)The function a t represents the control which an agent chooses in order to mini-mize a cost function J , to be specified according to the considered model. Mostof the time, the cost J comprises a term L ( x, a ) depending on the current state x and control a of the agent, and a term F ( x, m ) depending on the distribution m of the other agents. For example, consider the long time average cost J ( x, a ) = lim T → + ∞ T E x (cid:26) (cid:90) T (cid:2) L ( X s , a s ) + F ( X s , m ( X s )) (cid:3) ds (cid:27) , where m is a density function representing the distribution of the agents. Let H ( x, p ) = sup q ∈ R d { pq − L ( x, q ) } be the Hamiltonian function. Then the triplet( u, m, λ ), given by the ergodic cost λ , the corrector u and the distribution ofthe agents m corresponding to the choice of the optimal control D p H ( x, Du ),can be characterized as a solution of the MFG system − ε ∆ u ( x ) + H ( x, Du ( x )) + λ = F ( x, m ) , x ∈ R d ,ε ∆ m ( x ) + div( D p H ( x, Du ( x )) m ( x )) = 0 , x ∈ R d ,m ≥ , (cid:82) R d m ( x ) dx = 1 , (cid:82) R d u ( x ) dx = 0 . (3.2)6he first equation is a Hamilton-Jacobi-Bellman equation for the couple ( λ, u );the second is a Fokker-Planck equation for the distribution m , which providesthe density of the population at the position x ∈ R d ; the normalization condi-tions are imposed since u is defined up to a constant, and m must represent aprobability density function.In some problems, as the ones we consider, several populations of agents,homogeneous within a given group but with different objectives and preferencesfrom one group to the other, can interact in the same environment. In thiscase, the model can be described by a multi-population MFG system, where,for k = 1 , . . . , K , each population satisfies a system such as (3.2), but with thecost function F k depending on m k and also on the distributions m j , j (cid:54) = k ofthe other populations (see [13, 19]).We describe two approaches to Cluster Analysis based on MFG theory, whichcorrespond respectively to the K-means algorithm and to the EM algorithm.We consider a data set given by the support of a measure with density function f : R d → R such that f ≥ , (cid:90) R d f ( x ) dx = 1 . (3.3)Note that the data set X in Section 2 corresponds to the case of an atomicmeasure P ( x ) = 1 I I (cid:88) i =1 δ x i ( x ) . (3.4)The function f describes the entire population of agents/data points, that wewant to subdivide in K sub-populations/clusters on the basis of some similarproperties or characteristics, where K is fixed a priori. Each population isrepresented by a density function m k and the similarity or proximity will beexpressed by means of an appropriate cost function. A MFG approach which mimics the K-means algorithm has been recently in-troduced in [14]. Since this approach is the basis for the EM algorithm we willsubsequently study, we briefly describe it. It is worth noting that it is a hardclustering method, as the classical K means algorithm, hence each data pointcan belong only to a single cluster S k , k = 1 , . . . , K .Denote by m = ( m , . . . , m K ) and u = ( u , . . . , u K ) the vectors of the densityfunctions and of the corresponding value functions. An agent of the k -th popu-lation moves according to the dynamics (3.1) and minimizes the infinite horizoncost functional u k ( x ) = inf a s E x (cid:20) (cid:90) ∞ (cid:18) | a s | + F k ( X s , m k ( X s )) (cid:19) e − ρs ds (cid:21) , where ρ >
0, the discount factor, is a fixed constant. The coupling cost is givenby F k ( x, m k ) = 1 + ρ | x − y k | , (3.5)7here the barycentres y k are defined by y k := (cid:82) R d xm k ( x ) dx (cid:82) R d m k ( x ) dx . (3.6)Consider the multi-population MFG system for k = 1 , . . . , K , (cid:40) ρu k − ε ∆ u k ( x ) + | Du k ( x ) | = F k ( x, m k ) , x ∈ R d ,ρm k ( x ) − ε ∆ m ( x ) − div( Du k ( x ) m k ( x )) = ρ ˜ f k x ∈ R d . (3.7)where ˜ f k is a Gaussian distribution with mean˜ y k = (cid:82) S k xf ( x ) dx (cid:82) S k f ( x ) dx (3.8)and variance ε , and the cluster S k = S k ( u ) related to ˜ y k is defined by S k = { x ∈ R d : u k ( x ) = min j =1 ,...,K u j ( x ) } . (3.9)The value function u k is intended as a measure of the distance from the barycen-tre y k . We refer to [14] for a detailed interpretation of the system.A family of vectors { ¯ y k } Kk =1 is said to satisfy the self consistency rule with thedata set f if ¯ y k = (cid:82) S k xf ( x ) dx (cid:82) S k f ( x ) dx (3.10)for a family of clusters { S k } Kk =1 .In [14], the following two crucial results are proved:(i) For all the family of vectors { ¯ y k } Kk =1 satisfying the self consistency rule(3.10) for some family of clusters { S k } , there exists a solution ( u, m ) of(3.7) such that ¯ y k = y k with y k given by (3.6).(ii) Given a solution ( u, m ) of (3.7), then the barycentres y k given by (3.6)satisfy the self consistency rule (3.10) for S k = S k ( u ) defined in (3.9).It is important to observe that it is not reasonable to expect uniqueness ofthe solution to (3.7), since there exist in general several sets of barycentres { ¯ y k } Kk =1 satisfying (3.10). This non-uniqueness property can be interpreted asthe convergence to a local minimum in the classical K-means algorithm. The aim is to find, by means of an appropriate multi-population MFG system,a mixture of K density functions which gives an optimal representation of adata set, described by a density function f as in (3.3). The resulting methodis soft-clustering, since each point/agent has a certain probability to belong to8ach one of the clusters.In this section, in order to explain the method in a simple setting, we willdescribe a specific model for a Gaussian mixture, while the general method willbe described in Section 4.We introduce some preliminary notations. Given a mixture m ( x ) = K (cid:88) k =1 α k m k ( x ) (3.11)where α k ∈ (0 , (cid:80) Kk =1 α k = 1 and m k ≥ (cid:82) R d m k dx = 1, we introduce theresponsibilities γ k ( x ) = α k m k ( x ) m ( x ) , x ∈ R d , k = 1 , . . . , K. (3.12)We also define µ k = (cid:82) R d xγ k ( x ) f ( x ) dx (cid:82) R d γ k ( x ) f ( x ) dx , (3.13)Σ k = (cid:82) R d ( x − µ k )( x − µ k ) t γ k ( x ) f ( x ) dx (cid:82) R d γ k ( x ) f ( x ) dx (3.14)which are the mean µ k ∈ R d and the covariance matrix Σ k ∈ R d × d of thedensity function m k computed with respect to the data set f . Since, as we willsee in the rest of the section, the function m is given by a convex combinationof Gaussian functions with positive weights α k , then γ k > R d and thereforealso µ k and Σ k are well defined. Remark 3.1.
Consider a finite data set X = { x , . . . , x I } and the correspondingatomic measure P defined in (3.4) . Then, replacing the measure f ( x ) dx in (3.13) - (3.14) with P , we obtain µ k = (cid:80) Ii =1 γ k ( x i ) x i (cid:80) Ii =1 γ k ( x i ) , Σ k = (cid:80) Ii =1 γ k ( x i )( x i − µ i )( x i − µ i ) t (cid:80) Ii =1 γ k ( x i ) , which coincide with formulas (2.5) , (2.6) used in the Maximization step of theEM algorithm. Given the same type of dynamics (3.1) for all the populations, an agent ofthe k -th population wants to minimize the ergodic cost functional J k ( x, a ) = lim T → + ∞ T E x (cid:40)(cid:90) T (cid:20) | a t | + F k ( X s , m k ( X s ) , m ( X s )) (cid:21) ds (cid:41) , (3.15)where F ( x, m k , m ) = 12 ( x − µ k ) t (Σ − k ) t Σ − k ( x − µ k ) (3.16)9ote that the coupling among the various populations is given by the depen-dence of µ k , Σ k on the responsibility γ k , which in turn depends on the measure m . Observe that F k is instead independent of u j , j = 1 , . . . , K . The couplingterm forces a generic data point to distribute with an higher probability aroundthe nearest mean point µ k , since it penalizes the square of the distance from it,with an attenuation factor given by the covariance matrix Σ k .At a formal level, the multi-population MFG system corresponding to the pre-vious problem is given by − ε ∆ u k ( x ) + | Du k ( x ) | + λ k = ( x − µ k ) t (Σ − k ) t Σ − k ( x − µ k ) , x ∈ R d ,ε ∆ m k ( x ) + div( m k ( x ) Du k ( x )) = 0 , x ∈ R d ,α k = (cid:82) R d γ k ( x ) f ( x ) dx,m k ≥ , (cid:82) R d m k ( x ) dx = 1 , u k ( µ k ) = 0 , (3.17)for k = 1 , . . . , K , where γ k , µ k , Σ k are defined as in (3.12)-(3.14). Definition 3.2.
We say that a Gaussian mixture m ( x ) = (cid:80) Kk =1 β k N ( x | ν k , T k ) , x ∈ R d , is consistent with the data set if ν k = (cid:82) R d xγ k ( x ) f ( x ) dx (cid:82) R d γ k ( x ) f ( x ) dxT k = ε (cid:82) R d ( x − ν k )( x − ν k ) t γ k ( x ) f ( x ) dx (cid:82) R d γ k ( x ) f ( x ) dx ,β k = (cid:82) R d γ k ( x ) f ( x ) dx (3.18) where the responsabilities γ k are defined as in (3.12) with m k ( x ) = N ( x | ν k , T k ) . Note that ε can be considered as an additional parameter of the model, thatwe can tune to adjust the covariance matrix of the function m k . Proposition 3.3.
Let ε = 1 and m be a Gaussian mixture consistent with thedata set f . Then m and f have the same mean and covariance, i.e. (cid:82) R d xm ( x ) dx = (cid:82) R d xf ( x ) dx (3.19) (cid:82) R d ( x − µ )( x − µ ) t m ( x ) dx = (cid:82) R d ( x − µ )( x − µ ) t f ( x ) dx (3.20) Proof.
To prove (3.19), it suffices to apply the consistency conditions (3.18) to(3.11) and recall that (cid:80) Kk =1 γ k ( x ) = 1 to get (cid:90) R d xm ( x ) dx = K (cid:88) k =1 α k (cid:90) R d xm k ( x ) dx = K (cid:88) k =1 α k (cid:82) R d xγ k ( x ) f ( x ) dx (cid:82) R d γ k ( x ) f ( x ) dx = K (cid:88) k =1 (cid:90) R d xγ k ( x ) f ( x ) dx = (cid:90) R d xf ( x ) dx.
10o prove (3.20), we first claim that (cid:90) R d xx t m ( x ) dx = (cid:90) R d xx t f ( x ) dx. (3.21)Indeed, multiplying by α k = (cid:82) R d γ k ( x ) f ( x ) dx the consistency conditions (3.18),recalling that (cid:80) Kk =1 γ k ( x ) = 1 and summing over k , by the identity K (cid:88) k =1 α k (cid:90) R d ( x − µ k )( x − µ k ) t m k ( x ) dx = K (cid:88) k =1 α k (cid:82) R d ( x − µ k )( x − µ k ) t γ k ( x ) f ( x ) dx (cid:82) R d γ k ( x ) f ( x ) dx , we easily get (cid:90) R xx t m ( x ) dx − K (cid:88) k =1 α k µ k µ tk = (cid:90) R xx t f ( x ) dx − K (cid:88) k =1 α k µ k µ tk and therefore the claim. Combining (3.21) with the identity (3.19), we get(3.20).In the next result, we show that if there exists a mixture of Gaussian densitiesconsistent with the data set f , then it solves (3.17) for appropriate ( u k , λ k ), k = 1 , . . . , K . Proposition 3.4.
Let m be a mixture of Gaussian densities, i.e. m ( x ) = (cid:80) Kk =1 β k N ( x | ν k , T k ) for x ∈ R d , consistent with the data set f . Then, thefamily of quadruples ( u k , λ k , m k , β k ) , k = 1 , . . . , K , with u k ( x ) = ε x − µ k ) t T − k ( x − µ k ) ,λ k = ε Tr( T − k ) ,m k ( x ) = N ( x | µ k , T k ) β k = (cid:90) R d γ k ( x ) f ( x ) dx is a solution of (3.17) .Proof. Consider a component of the mixture m k which is a Gaussian densityfunction N ( x | ν k , T k ) = 1(2 π ) d | T k | e − ( x − ν k ) t T − k ( x − ν k ) . Recall that ( x − ν k ) ∈ R d and T − k ∈ R d × d is the inverse of the covariance matrix T k . We have Dm k ( x ) = − m k ( x ) T − k ( x − ν k ) D m k ( x ) = m k ( x ) (cid:2) ( T − k ( x − ν k ))( T − k ( x − ν k )) t − T − k (cid:3) ∆ m k ( x ) = m k ( x ) (cid:2) ( T − k ( x − ν k )) t ( T − k ( x − ν k )) − Tr( T − k ) (cid:3) where x ∈ R d , D denotes the Hessian and Tr the trace. Furthermore, given amatrix B k ∈ R d × d and assuming that u k ( x ) = 12 ( x − ρ k ) t B k ( x − ρ k ) ,
11e have Du k ( x ) = B k ( x − ρ k ) and ∆ u k ( x ) = Tr B k . Substituing m k and u k asabove in the second equation of (3.17), we get ε (cid:2) ( T − k ( x − ν k )) t ( T − k ( x − ν k )) − Tr( T − k ) (cid:3) − ( T − k ( x − ν k )) t ( B k ( x − ρ k ))+Tr( B k )) = 0which is satisfied for ρ k = ν k and B k = εT − k . Hence we get u k ( x ) = ε x − ν k ) t T − k ( x k − ν k ) + C k (3.22)for some C k ∈ R . Replacing u k , given by (3.22), in the Hamilton-Jacobi equationin (3.17), we have − ε Tr( T − k ) + ε | ( x − ν k ) t T − k ( x − ν k ) | + λ k = F k ( x, m k , m )Recalling that F k ( x, m k , m ) = ( x − µ k ) t (Σ − k ) t Σ − k ( x − µ k ) and that, by (3.18), ν k = µ k and T k = ε Σ k , we get that the previous equation is satisfied for λ k = ε Tr( T − k ) . Taking into account the normalization condition u k ( µ k ) = 0, we finally get C k = 0 and then the result.We now prove a reverse result, i.e. a solution of the MFG system (3.17)gives a mixture of Gaussian densities consistent with the data set f . Proposition 3.5.
Let { ( u k , λ k , m k , α k ) } Kk =1 be a solution of the MFG system (3.17) . Then m ( x ) = (cid:80) Kk =1 α k m k ( x ) , x ∈ R d is a Gaussian mixture consistentwith the data set f .Proof. Firstly, fixed k = 1 , . . . , K , consider the Hamilton-Jacobi equation − ε ∆ u k ( x ) + 12 | Du k ( x ) | + λ k = 12 ( x − µ k ) t (Σ − k ) t Σ − k ( x − µ k ) (3.23)with the normalization condition u k ( µ k ) = 0. By [3], the unique solution ofthe previous equation is given by the couple ( u k , λ k ) = ( ( x − µ k ) t Σ − k ( x − µ k ) , ε Tr(Σ − k )). Replacing Du k ( x ) = Σ − k ( x − µ k ) in the Fokker-Planck equa-tion, we obtain the equation ε ∆ m k ( x ) + Dm k ( x )Σ − k ( x − µ k ) + m k ( x ) Tr(Σ − k ) = 0 . (3.24)Taking into account the normalization conditions m k ≥ (cid:82) m k dx = 1, we havethat the solution of (3.24) is given by the Gaussian density m k ( x ) = N ( x | ν k , T k ),where, by (3.24), the constants ν k , T k have to satisfy the identity ε (cid:0) ( T − k ( x − ν k )) t ( T − k )( x − ν k ) − Tr( T − k ) (cid:1) + − ( T − k ( x − ν k )) t (Σ − k ( x − µ k )) + Tr(Σ − k ) = 0Hence m k is a solution of (3.24) if ν k = µ k , T k = ε Σ k . If we set m ( x ) = (cid:80) Kk =1 α k m k ( x ), with m k ( x ) = N ( x | µ k , ε Σ k ) and α k as in (3.17), we get aGaussian mixture compatible with the data set f .12astly, we state an existence result for solutions to (3.17) (since the proof issimilar to the one of Theorem 4.4, we omit the details). Proposition 3.6.
There exists a solution { u k , λ k , m k , α k } Kk =1 to (3.17) . We also observe that, as for (3.7), it is not reasonable to expect uniquenessfor the solution of (3.17).
Remark 3.7. In -dimensional case, the coupling term is F ( x, m k , m ) = 12 (cid:12)(cid:12)(cid:12)(cid:12) x − µ k σ k (cid:12)(cid:12)(cid:12)(cid:12) . (3.25) where µ k = (cid:82) R xγ k ( x ) f ( x ) dx (cid:82) R γ k ( x ) f ( x ) dx , σ k = (cid:82) R ( x − µ k ) γ k ( x ) f ( x ) dx (cid:82) R γ k ( x ) f ( x ) dx . (3.26) A Gaussian mixture m ( x ) = (cid:80) Kk =1 β k N ( x | ν k , τ k ) is said to be consistent withthe data set, in sense of Definition 3.2, if ν k = µ k , τ k = εσ k , where µ k and σ k defined as in (3.26) , and β k = (cid:82) R γ k ( x ) f ( x ) dx . A general class of linear-quadratic MFG systems admitting a Gaussian dis-tribution as solution of the Fokker-Planck equation was studied in [3].
Remark 3.8.
If we consider a coupling cost F k in (3.16) independent of Σ k ,i.e. Σ k = σ I , where I is the identity matrix, we have F k ( x, m k , m ) = 12 σ ( x − µ k ) t ( x − µ k ) . Then, the solution of the MFG system (3.17) gives a mixture ofGaussian densities m ( x ) = (cid:80) Kk =1 α k N ( x | µ k , εσ I ) .Arguing formally and sending σ → , the responsibility γ k ( x ) , correspondingto the index k minimizing the distance of x from the barycentre µ k , tends to ,while the other responsibilities γ j ( x ) , for j (cid:54) = k , tend to . Set S k = { x ∈ R d :lim σ → γ k ( x ) = 1 } and observe that these sets coincide with the clusters definedin (3.9) for the MFG K-means model. Moreover the consistency condition forthe mean in (3.18) converges to (cid:90) xm k ( x ) dx = (cid:82) S k xf ( x ) dx (cid:82) S k f ( x ) dx which coincides with (3.10) . Hence the MFG K-means model can be interpreted,at least formally, as the limit of the MFG EM model for σ → (we plan to doa more rigorous analysis of this observation in the future).Sending (cid:15) → + in (3.17) has a different effect in the model. Indeed, in thiscase, the problem degenerate in a first order MFG system and the finite mix-ture m tends to a convex combination of Dirac functions concentrated in thebarycentres µ k . emark 3.9. The approach developed in this section is a soft clustering modelas the classical EM algorithm, while the one in Section 3.1 is a hard clusteringone, as the K-means algorithm. In both the models, the coupling terms have theeffect of aggregating the agents around the closer barycenter. But, while in theMFG K-means model, the clusters are circular, in the model discussed in thissection, thanks to the additional parameters given by the covariance matrices,the clusters are ellipsoidal and therefore, in some situations, give a better ap-proximation of the data set. This advantage of the EM algorithm with respectto the K-means algorithm is well-known also in the classical setting discussed inSection 2.We consider an ergodic MFG system, see (3.17) , instead of a discounted MFGsystem as (3.7) , but this does not change in an essential way the model. Thechoice of an ergodic system has been made in order to apply the algorithm devel-oped in [9]. We also observe that system (3.17) has the classical structure of aMFG system, where the Fokker-Planck equation is the adjoint of the linearizedof the Hamilton-Jacobi-Bellman equation, while this property is not satisfied bysystem (3.7) . In this section, we formulate a general MFG model for a finite mixture of prob-ability densities.We consider a smooth, bounded domain Ω containing the support of thedata set f which satisfies the conditions in (3.3). Given a mixture of densityfunctions as in (3.11) and defined the corresponding responsibilities γ k as in(3.12), we consider the following multi-population MFG system − ε ∆ u k ( x ) + H k ( x, Du k ( x )) + λ k = F k ( x, m k , m ) , x ∈ Ω , (HJ) ε ∆ m k ( x ) + div( m k ( x ) D p H k ( x, Du k ( x ))) = 0 , x ∈ Ω , (FP) ∂ n u k ( x ) = 0 , x ∈ ∂ Ω , (HJn) ε∂ n m k ( x ) + m k ( x ) D p H k ( x, Du k ( x )) · (cid:126)n = 0 , x ∈ ∂ Ω , (FPn) α k = (cid:82) Ω γ k ( x ) f ( x ) dx, (COEFF) m k ≥ , (cid:82) Ω m k ( x ) dx = 1 , (cid:82) Ω u k ( x ) dx = 0 , (NORM) (4.1)for k = 1 , . . . , K , where (cid:126)n is the outward normal to the boundary of Ω, ∂ n denotes the normal derivative and γ k ( x ) = α k m k ( x ) /m ( x ), k = 1 , . . . , K arethe responsibilities. The boundary conditions in (4.1) are given by the Neumanncondition (HJn) for u k and its dual (FPn) for m k . Moreover, the normalizationconditions (NORM) are imposed since u k is defined up to a constant and m k represents a probability densiy function.In the rest of the paper, we will make the following assumptions:14 H1)
Ω is a C bounded domain of R d ; (H2) the Hamiltonians H k , k = 1 , . . . , K , are of the form H k ( x, p ) = R k | p | r − H k ( x ) , with R k > r > H k ∈ C (Ω), ∂ n H k ( x ) ≥ x ∈ ∂ Ω; (H3) The coupling costs F k : Ω × R × R d × d → R are a continuous, uniformlybounded functions of the form F k ( x, m k , m ) = F k ( x, µ k , Σ k ) , (4.2)where µ k = (cid:90) Ω x m k ( x ) m ( x ) f ( x ) dx, (4.3)Σ k = (cid:90) Ω ( x − µ k )( x − µ k ) t m k ( x ) m ( x ) f ( x ) dx. (4.4)Moreover, for µ , Σ fixed, F k ( · , µ, Σ) ∈ W , ∞ (Ω). Remark 4.1. If α k = (cid:82) Ω γ k ( x ) f ( x ) dx (cid:54) = 0 , then, µ k = 1 α k (cid:90) Ω x α k m k ( x ) m ( x ) f ( x ) dx = (cid:82) Ω xγ k ( x ) f ( x ) dxα k = (cid:82) Ω xγ k ( x ) f ( x ) dx (cid:82) Ω γ k ( x ) f ( x ) dx , which has to be compared with the definition of µ k in (3.13) (similarly for Σ k ).As we will see in the proof of the existence theorem, since m ( x ) > for all x ∈ Ω ,definitions (4.3) - (4.4) avoid the problem of having α k = 0 in the definition of µ k , Σ k . Remark 4.2.
In view of different applications from Cluster Analysis, it is pos-sible to consider more general Hamiltonians and coupling terms. For example,coupling terms F k involving the Kullback-Leibler divergence m k ln (cid:18) q k m k (cid:19) , where q k is a function of the data set f . But to maintain the parallelism withclassical EM algorithm, we prefer to restrict to the simpler case (4.2) . We now give the definition of solution for the MFG system (4.1). Note thatthe coefficients of the mixture α k are part of the unknowns of the system. Definition 4.3.
A solution of (4.1) is a family of K quadruples ( u k , λ k , m k , α k ) such that u k ∈ C (Ω) , λ k ∈ R , m k ∈ W , (Ω) , α k ∈ [0 , and(i) ( u k , λ k ) satisfies (HJ)-(HJn) in point-wise sense. ii) m k satisfies (FP)-(FPn) in weak sense, i.e. ε (cid:90) Ω Dm k ( x ) · Dφ ( x ) dx + (cid:90) Ω m k ( x ) D p H k ( x, Du k ( x )) · Dφ ( x ) dx = 0 for all φ ∈ W , (Ω) .(iii) The coefficients α k satisfy (COEFF) (note that this implies α k ∈ [0 , and (cid:80) Kk =1 α k = 1 ).(iv) The normalization conditions (NORM) are satisfied. We are going to prove existence of a solution to (4.1). The proof is similar tothe one of [13, Theorem 4], the main difference is the presence of the additionalunknowns given by the coefficients α k . Theorem 4.4.
Assume (H1)-(H3) . Then, there exists a solution ( u k , λ k , m k , α k ) , k = 1 , . . . , K , of (4.1) in the sense of Definition 4.3. Moreover u k ∈ C (Ω) , m k ∈ W ,p (Ω) for any p ≥ and m k ≥ δ > for some constant δ .Proof. Fixed p > d , we define the sets P = { m ∈ L (Ω) s.t. (cid:90) Ω m ( x ) dx = 1 }D = P ∩ { m ∈ W , ∞ (Ω) s.t. || m || W ,p (Ω) ≤ C, m ( x ) ≥ δ > } , where C , δ are two constants to be fixed later, and we consider the set C = { ( α, m ) =( α , . . . , α K , m , . . . , m K ) : α k ∈ [0 , ,m k ∈ D , k = 1 , . . . , K, K (cid:88) k =1 α k = 1 } . (4.5)It is easy to see that the set C is convex. Moreover, it is compact with respect tothe standard topology of R × · · · × R × C (Ω) × · · · × C (Ω) since α k ∈ [0 , k =1 , . . . , K , and the set D is compactly embedded in C ,β (Ω) with β ≤ ( p − d ) /p (see Theorem 7.26 in [17]).We define the map Ψ : C → C which, to a vector ( β, η ) = ( β , . . . , β K , η , . . . , η K )associates the vector ( α, m ) = ( α , . . . , α K , m , . . . , m K ) defined in the followingway: Step (i):
Given ( β, η ) ∈ C , we define the responsibilities γ k ( x ) = β k η k ( x ) (cid:80) Kk =1 β k η k ( x ) , k = 1 , . . . , K. (4.6)Note the responsibilities are well defined since, by η k ≥ δ > (cid:80) Kk =1 β k = 1,we get (cid:80) Kk =1 β k η k ( x ) ≥ δ for x ∈ Ω. 16 tep (ii):
Given γ k ( x ) as in Step (i) , we define µ k , Σ k as in (4.3)-(4.4) andwe consider the Hamilton-Jacobi-Bellman equations − ε ∆ u k ( x ) + H k ( x, Du k ( x )) + λ k = F k ( x, µ k , Σ k ) , x ∈ Ω ,∂ n u k ( x ) = 0 x ∈ ∂ Ω , (cid:82) Ω u k ( x ) dx = 0 , (4.7)for k = 1 , . . . , K . Note that the previous systems are independent of each other.By [20, Theorem II.1], (4.7) admits a unique solution ( u k , λ k ) ∈ C ( ¯Ω) × R with | λ k | ≤ C, (cid:107) u k (cid:107) W , ∞ (Ω) ≤ C (4.8)for some positive constant C which depends only on Ω, H and (cid:107)F k ( · , µ k , Σ k ) (cid:107) L ∞ . Step (iii):
Given u k as in Step (ii) , we consider the problems ε ∆ m k ( x ) + div( m k ( x ) D p H k ( x, Du k ( x )) = 0 x ∈ Ω ε∂ n m k ( x ) + m k ( x ) D p H k ( x, Du k ( x )) · (cid:126)n ( x ) = 0 x ∈ ∂ Ω m k ≥ , (cid:82) Ω m k ( x ) dx = 1 (4.9)for k = 1 , . . . , K . Since (4.8) implies that | D p H k ( x, Du k ( x )) | is uniformlybounded in Ω, by Theorems II.4.4, II.4.5, II.4.7 in [4], it follows that problem(4.9) admits a unique weak solution m k ∈ W , (Ω). Moreover, m k ∈ W ,p (Ω)for all p ≥ m k ∈ C (Ω) and || m k || W ,p (Ω) ≤ ˆ C, (4.10)ˆ δ ≤ m k ( x ) ≤ ˆ δ − ∀ x ∈ Ω , (4.11)for constants ˆ C , ˆ δ > (cid:107) DH k ( x, Du k ( x )) (cid:107) L ∞ (Ω) andtherefore, recalling (4.8), on Ω, H and (cid:107)F k (cid:107) L ∞ . Hence, in (4.5), fixing C larger that ˆ C in (4.10) and δ smaller than ˆ δ in (4.11), we conclude m k ∈ D , k = 1 , . . . , K .Furthermore, we define α k := (cid:90) Ω γ k ( x ) f ( x ) dx. (4.12)Since 0 ≤ γ k ( x ) ≤ (cid:80) Kk =1 γ k ( x ) = 1 for x ∈ Ω, it follows that α k ∈ [0 , (cid:80) Kk =1 α k = 1.We conclude that the map Ψ, which to the vector ( η, β ) associates the vector( α, m ) given by (4.9)-(4.12), is well defined and maps the set C into itself.We now show that the map Ψ is continuous with respect to the R × · · · × R × C (Ω) × · · · × C (Ω) topology. Let ( β n , η n ) be a sequence in C such that β nk converges β k in R and η nk converges to η k uniformly in Ω, for all k = 1 . . . , K .Since (cid:80) Kk =1 β nk η nk converges uniformly to (cid:80) Kk =1 β k η k and η nk , η k ≥ δ > γ nk , γ k as in (4.6), it follows that γ nk → γ k for n → ∞ , uniformly in Ω . (4.13)17hen, by 0 ≤ γ nk , γ k ≤
1, (4.13) and the dominated convergence theorem, wehave lim n →∞ α nk = lim n →∞ (cid:90) Ω γ nk ( x ) f ( x ) dx = (cid:90) Ω γ k ( x ) f ( x ) dx = α k . (4.14)Moreover, since (cid:80) Kk =1 β nk η nk , (cid:80) Kk =1 β k η k ≥ δ and η nk converges to η k uniformly,we have η nk (cid:80) Kk =1 β nk η nk −→ η k (cid:80) Kk =1 β k η k , for n → ∞ , uniformly in Ω . Recalling definitions (4.3)-(4.4), it follows thatlim n →∞ µ nk = lim n →∞ (cid:90) Ω xη nk ( x ) f ( x ) (cid:80) Kk =1 β nk η nk ( x ) dx = (cid:90) Ω xη k ( x ) f ( x ) (cid:80) Kk =1 β k η k ( x ) dx = µ k and, similarly, lim n →∞ Σ nk = Σ k .It follows that F k ( x, µ nk , Σ nk ) converges uniformly to F k ( x, µ k , Σ k ). Denoted by( u nk , λ nk ) and ( u k , λ k ), k = 1 , . . . , K , the solutions of (4.7) corresponding re-spectively to F k ( x, µ nk , Σ nk ) and F k ( x, µ k , Σ k ), by a standard stability result for(4.7), we have that λ nk converges to λ k and u nk converges u k uniformly in Ω.Moreover Du nk → Du k locally uniformly in Ω (see [13, Theorem 4 ] for details).Let m nk , m k be the solutions of (4.9) corresponding to u nk and u k . By the esti-mate (4.10), the functions m nk are equi-bounded in W ,p . By the local uniformconvergence of Du nk to Du k , passing to the limit in the weak formulation of(4.9) yields that (possibly passing to a subsequence) that m nk converges to aweak solution of the problem, which, by uniqueness, coincides with m k . Hence m nk converges uniformly to m k , k = 1 , . . . , K , and, recalling (4.14), we concludethat the map Ψ is continuous.Applying the Schauder’s Fixed Point Theorem to the map Ψ : C → C , weget that there exists ( α, m ) ∈ C such that Ψ( α, m ) = ( α, m ), hence a solutionto (4.1). In this section, we present an algorithm for the cluster problem, based on thenumerical approximation of (4.1). Our aim is to compute the finite mixturemodel (3.11) and the corresponding responsibilities γ k ( x ). We use an approachsimilar to the classical EM algorithm. Indeed, rather than solving the fullproblem (4.1) in the coupled unknowns ( u k , λ k , m k , α k ) for k = 1 , ..., K , wesplit it in the iterative resolution of K independent sub-problems as follows.Given an arbitrary initialization α , . . . , α k , m . . . , m k for the mixture model,at the n th iteration, we alternate two steps: E-step.
For k = 1 , . . . , K , compute the new responsibilities, mixturecoefficients, barycentres and covariance matrices γ nk ( x ) = α n − k m n − k ( x ) m n − ( x ) , α nk = (cid:90) Ω γ nk ( x ) f ( x ) dx , (5.1)18 nk = (cid:90) Ω x m n − k ( x ) m n − ( x ) f ( x ) dx , Σ nk = (cid:90) Ω ( x − µ nk )( x − µ nk ) t m n − k ( x ) m n − ( x ) f ( x ) dx . M-step.
For k = 1 , . . . , K compute the new coupling cost F k ( x, µ nk , Σ nk )and solve the (independent) K systems − ε ∆ u k ( x ) + H k ( x, Du k ( x )) + λ k = F k ( x, µ nk , Σ nk ) , x ∈ Ω ,ε ∆ m k ( x ) + div( m k ( x ) D p H k ( x, Du k ( x ))) = 0 , x ∈ Ω ,∂ n u k ( x ) = 0 , x ∈ ∂ Ω ,ε∂ n m k ( x ) + m k ( x ) D p H k ( x, Du k ( x )) · (cid:126)n = 0 , x ∈ ∂ Ω ,m k ≥ , (cid:82) Ω m k ( x ) dx = 1 , (cid:82) Ω u k ( x ) dx = 0 . (5.2)Note that the coupling between the mixture components is entirely embedded inthe E-step, and this significantly reduces the computational efforts to performthe M-step, in particular the K sub-problems (5.2) can be solved in parallel.Moreover, for fixed k , we observe that the Hamilton-Jacobi-Bellman equationis decoupled from the corresponding Fokker-Planck equation. In principle, wecould compute first u k and then use it to compute m k . Nevertheless, we preferto solve the system (5.2) as a whole, also in view of more general coupling costs,in which the dependence on m k is not frozen at the previous step as in (5.1).Let us now discuss, more in detail, the main ingredients for a practical imple-mentation of the algorithm. First of all, we have to introduce a discretizationof the domain Ω. The choice of a structured or an unstructured grid clearlydepends on the geometry of Ω and, in turn, on the nature of the data set repre-sented by the distribution f . Typically Ω can be chosen as a simple hypercubein R d containing the support of f , as in the tests shown below.Then, we need some quadrature rule for the approximation of the integrals ap-pearing in the E-step, and also a suitable discretization of the MFG systemsin the M-step, e.g. via finite differences or finite elements ([1, 2, 10]). Notethat each system in (5.2) consists of two non-linear PDEs in the three un-knowns ( u k , m k , λ k ), plus boundary and normalization conditions. This leadsto the building block of the algorithm, namely the numerical solution of a dis-crete non-linear system which is formally overdetermined , having more equationsthan unknowns. More precisely, if Ω is discretized with N of degrees of freedom,then ( u k , m k , λ k ) corresponds to a vector of 2 N + 1 unknowns, while (5.2) gives2 N +2 non-linear equations, subject to N constraints representing the condition m k ≥
0. Hence, by employing a Newton-like solver, we end up with non squarelinearized systems, whose solutions should be meant in a least-squares sense,as explained in [9], where a direct method for MFG systems is proposed. Werefer the reader to this paper for further details and the implementation of themethod, which will be used in the following tests.We conclude this discussion with some remarks on the initialization and thestopping criterion of the algorithm. As it is commonly used in the classical19M algorithm, we choose all the initial weights α k = K , and the densities m k given by Gaussian PDFs N ( x | µ k , I ) with random means µ k . The convergenceof the algorithm is really sensitive to this initial guess, and we observe a relevantspeedup if the supports of the m k are as much as possible disjointed. Finally,we choose a tolerance tol > n the EM-step until bothmax k =1 ,...,K | µ nk − µ n − k | < tol and max k =1 ,...,K | Σ nk − Σ n − k | < tol . Note thatalso the M-step has its own tolerances for the convergence of the Newton solversfor the K MFG systems. In principle, we can choose a single tolerance forthe whole algorithm and alternate the E-step and M-step in a Gauss-Seidelfashion, namely updating the parameters (5.1) after a single Newton iterationfor each system (5.2). Surprisingly, this does not improve the convergence andthe proposed implementation with two nested loops is more effective. Indeed,we observed that, once each mixture component m k stabilizes around its currentmean µ k , in the next iterations it readily moves to the new barycentre with justsmall adjustments of the corresponding covariance matrix Σ k . In practice, mostof the computational efforts are concentrated on the first few iterations of thealgorithm.Now, let us define the settings for the numerical experiments. We use auniform structured grid to discretize the domain Ω, and a simple rectangularquadrature rule for the E-step. Moreover, we employ standard finite differencesfor the discretization of the M-step, where we consider the particular case of aquadratic Hamiltonian as in (3.17) (see also Remark 3.7), and we set the diffu-sion coefficient ε = 1. The case of general Hamiltonians and coupling costs willbe addressed in a more computational oriented paper. Test 1.
We consider a piece-wise constant distribution f on Ω = [0 , (cid:82) f ( x ) dx =1. In Figure 1, we show the solution computed on a uniform grid of N = 201nodes, respectively for K = 1 , ,
3. The thin line represents f , while the thickline represents the mixture m = (cid:80) Kk =1 α k m k . We clearly see how the mean andthe variance of each m k adapt to the data, according to the given number K ofmixture components. Test 2.
We consider the example of a distribution on Ω = [0 ,
1] with anoscillatory behavior. Indeed, we define f by suitably scaling and translatingthe function ˜ f = x sin(4 πx ) for x ∈ [0 , f has compact supportand (cid:82) f ( x ) dx = 1. In Figure 2, we show the solutions corresponding to K = 1 , , ,
4, again for a uniform grid of N = 201 nodes. It is interestingto observe that the peaks of f are sequentially approximated as the number K of mixture components increases, according to their heights and the underlyingmasses. Test 3.
We present an application of cluster analysis to the so called colorquantization problem in computer graphics. Given an image, the aim is toreduce the total number of colors (2 for images in 24-bit RGB format) to a20 (a) (b) (c)Figure 1: Approximation of a piece-wise constant data distribution, for K =1 , , -10123456 0 0.2 0.4 0.6 0.8 1fm -10123456 0 0.2 0.4 0.6 0.8 1fm (a) (b) -10123456 0 0.2 0.4 0.6 0.8 1fm -10123456 0 0.2 0.4 0.6 0.8 1fm (c) (d)Figure 2: Approximation of an oscillatory data distribution, for K = 1 , , , K , so that the resulting image is as similar as possible to theoriginal one. As pointed out in [6, Cap. 9], this problem can be solved bythe classical K-means algorithm. Note this approach is quite different from segmentation , indeed no geometric correlation between pixels is considered, butjust the color information. Here we apply the MFG clustering algorithm tothe case of an image in grey scales. Each pixel in the image contains a level21f grey represented by a value in the interval [0 ,
1] (typically sampled with 256levels, from black 0 to white 1), so that the problem is reduced to dimensionone, choosing Ω = [0 ,
1] uniformly discretized with N = 256 nodes. To generatethe data set distribution f , for each x ∈ [0 ,
1] we count the number of pixelsin the image with grey level x , then we normalize it to satisfy (cid:82) f ( x ) dx = 1.In Figure 3, we show the original image and the corresponding distribution f ,while in Figure 4 we report the results for K = 2 , , { γ k } k =1 ,...,K to map the single pixel grey valueto the barycentre of its most representative cluster. If the pixel p has grey value x p , then it is mapped to the value µ k ∗ , where k ∗ = arg max k =1 ,...,K γ k ( x p ). -10123456 0 0.2 0.4 0.6 0.8 1f (a) (b)Figure 3: A black and white image (a) and its grey scales distribution (b). Test 4.
We finally consider a two dimensional example. We take somedata from the Elki project [16], forming a “mouse” similar to a popular comiccharacter. The data set is organized in 3 clusters (plus some random noise),corresponding to the head and the ears of the mouse, see Figure 5. Then, we set K = 3, we discretize the domain Ω = [0 , by means of a uniform grid of N =51 nodes, and we build the distribution f as in the previous test, by countingthe data points falling in each cell of the grid and normalizing the result to obtain (cid:82) Ω f ( x ) dx = 1. For the visual representation, we consider RGB triplets in [0 , ,and we assign to the three clusters the pure colors red (1 , , , ,
0) andblue (0 , ,
1) respectively. Then we use the responsibilities { γ k } k =1 , , ∈ [0 , C i = ( γ ( x i ) , γ ( x i ) , γ ( x i )), for i = 1 , ..., N , which quantifies how much it belongs to a certain cluster. Figure 6shows the surface of the computed mixture and the corresponding clusterization.We observe that the scattered data set is well approximated by the mixture, andthe corresponding three clusters are well separated by small overlapping regions.22 Figure 4: Color quantization via MFG clustering (left panels) and the corre-sponding mixtures (right panels), for K = 2 , , References [1] Y. Achdou and I. Capuzzo Dolcetta, Mean Field Games: numerical meth-ods, SIAM J. Numer. Anal., 48 (2010), pp. 1136–1162.[2] N. Almulla, R. Ferreira and D. Gomes, Two numerical approaches to sta-tionary mean-field games, Dyn. Games Appl. 7 (2017), no. 4, 657–682.23 (a) (b)Figure 5: The “mouse” data set (a) and the corresponding distribution (b). (a) (b)Figure 6: MFG mixture (a) and clustering (b) of the “mouse” data set for K = 3.[3] M. Bardi and F. Priuli, Linear-quadratic N-person and mean-field gameswith ergodic cost, SIAM J. Control Optim. 52 (2014), no. 5, 3022–3052.[4] A. Bensoussan, Perturbation methods in optimal control , Wiley/Gauthier-Villars Series in Modern Applied Mathematics, 1988.[5] C. Bertucci, S. Vassilaras, J.-M. Lasry, G. Paschos, M. Debbah and P.-L.Lions, Transmit strategies for massive machine-type communications basedon Mean Field Games, in
Proceedings of the International Symposium onWireless Communication System (ISWCS) , 2018.[6] C.M. Bishop,
Pattern recognition and Machine Learning , Information Sci-ence and Statistics, Springer, New York, 2006.247] J.A. Bilmes, A gentle tutorial of the EM algorithm and its applicationto parameter estimation for Gaussian mixture and hidden Markov model,Technical Report ICSI-TR-97-021, University of Berkeley, 2000.[8] L. Bottou and Y. Bengio, Convergence properties of the K-means algo-rithms, Adv. Neural Inf. Process. Syst. 82 (1995), 585-592.[9] S. Cacace and F. Camilli, A generalized Newton method for homogenizationof Hamilton-Jacobi equations, SIAM J. Sci. Comput. 38 (2016), no. 6,A3589-A3617.[10] E. Carlini and F.J. Silva, On the discretization of some nonlinear Fokker-Planck-Kolmogorov equations and applications. SIAM J. Numer. Anal. 56(2018), no. 4, 2148–2177.[11] P. Chaudhari, A. Oberman, S. Osher, S. Soatto and G. Carlier, Deep re-laxation: partial differential equations for optimizing deep neural networks,Res. Math. Sci. 5 (2018), no. 3, Paper No. 30, 30 pp.[12] Y. Chen, T. T. Georgiou and A. Tannenbaum, Optimal Transport for Gaus-sian Mixture Models, in IEEE Access, vol. 7, pp. 6269-6278, 2019[13] M. Cirant, Multi-population Mean Field Games systems with Neumannboundary conditions. J. Math. Pures Appl. 103 (2015), no. 5, 12941315.[14] J.L. Coron,
Quelques exemples de jeux `a champ moyen , Ph.D. thesis, Uni-versit´e Paris-Dauphine, 2018.[15] W. E, J. Han and Q. Li, A mean-field optimal control formulation of deeplearning. Res. Math. Sci. 6 (2019), no. 1, Paper No. 10, 41 pp.[16] https://elki-project.github.io/datasets/[17] D. Gilbarg and N.S.Trudinger,
Elliptic partial differential equations of sec-ond order , Classics in Mathematics. Springer-Verlag, Berlin, 2001.[18] O. Gu´eant, J-M. Lasry and P-L. Lions, Mean Field Games and applications,in
Paris-Princeton Lectures on Mathematical Finance 2010 , Lecture Notesin Math., 2003, Springer, Berlin, 2011, 205266.[19] J.-M. Lasry and P.-L. Lions, Mean Field Games. Jpn. J. Math., 2 (2007),229-260.[20] P.-L. Lions, Quelques remarques sur les problemes elliptiques quasilineairesdu second ordre, J. Anal. Math 45 (1985) 234-254.[21] S. Miyiamoto and M. Mukaidono, Fuzzy C-Means as a regularization andmaximum entropy approach.
IFSA’97 Prague: Proceedings [of the] seventhInternational Fuzzy Systems Association World Congress , 86-92.2522] S. Pequito, A.P. Aguiar, B. Sinopoli and D. Gomes, Unsupervised learn-ing of finite mixture models using Mean Field Games, in
Annual AllertonConference on Communication, Control and Computing , 2011, 321-328.[23] A. Saxena, M. Prasad, A. Gupta, N. Bharill, O.P. Patel, A. Tiwari, M.J. Er,W. Ding and C.T. Lin, A review of clustering techniques and developments,Neurocomputing, 267 (2017), 664–681. [email protected]@sbai.uniroma1.it
SBAI, Sapienza Universit`a di Romavia A.Scarpa 14, 00161 Roma (Italy) [email protected]
Dipartimento di Matematica e FisicaUniversit`a degli Studi Roma TreLargo S. L. Murialdo 1, 00146 Roma (Italy) [email protected]@iconsulting.biz