Probabilistic Multilevel Clustering via Composite Transportation Distance
aa r X i v : . [ c s . L G ] O c t Probabilistic Multilevel Clustering via CompositeTransportation Distance
Nhat Ho ⋆, † Viet Huynh ⋆, ‡ Dinh Phung ‡ Michael I. Jordan † University of California at Berkeley, Berkeley, USA † Monash University, Australia ‡ October 30, 2018
Abstract
We propose a novel probabilistic approach to multilevel clustering problems based oncomposite transportation distance, which is a variant of transportation distance wherethe underlying metric is Kullback-Leibler divergence. Our method involves solving ajoint optimization problem over spaces of probability measures to simultaneously discovergrouping structures within groups and among groups. By exploiting the connection of ourmethod to the problem of finding composite transportation barycenters, we develop fastand efficient optimization algorithms even for potentially large-scale multilevel datasets.Finally, we present experimental results with both synthetic and real data to demonstratethe efficiency and scalability of the proposed approach.
Clustering is a classic and fundamental problem in machine learning. Popular clusteringmethods such as K-means and the EM algorithm have been the workhorses of exploratorydata analysis. However, the underlying model for such methods is a simple flat partitionor a mixture model, which do not capture multilevel structures (e.g., words are groupedinto documents, documents are grouped into corpora) that arise in many applications in thephysical, biological or cognitive sciences. The clustering of multilevel structured data calls fornovel methodologies beyond classical clustering.One natural approach for capturing multilevel structures is to use a hierarchy in which dataare clustered locally into groups, and those groups are partitioned in a “global clustering.”Attempts to develop algorithms of this kind can be roughly classified into two categories.The first category makes use of probabilistic models, often based on Dirichlet process priors.Examples in this vein include the Hierarchical Dirichlet Process (HDP) [18], Nested DirichletProcess (NDP) [15], Multilevel Clustering with Context (MC ) [11], and Multilevel ClusteringHierarchical Dirichlet Process (MLC-HDP) [21]. Despite the flexibility and solid statisticalfoundation of these models, they have seen limited application to large-scale datasets, givenconcerns about the computational scaling of the sampling-based algorithms that are generallyused for inference under these models.A second category of multilevel methods is based on tools from optimal transport theory,where algorithms such as Wasserstein barycenters provide scalable computation [3, 4]. Thesemethods trace their origins to a seminal paper by [14] which established a connection betweenthe K-means algorithm and the problem of determining a discrete probability measure thatis close in Wasserstein distance [19] to the empirical measure of the data. Based on this ⋆ Nhat Ho and Viet Huynh contributed equally to this work and are in alphabetical order. composite trans-portation distance [12], to overcome this limitation, and to provide a more general multilevelclustering method. The salient feature of composite transportation distance is that it utilizesKullback-Leibler (KL) divergence as the underlying metric of optimal transportation distance,in contrast to the standard Euclidean metric that has been used in optimal transportationapproaches to clustering to date.In order to motivate our use of composite transportation distance, we start with a one-level structure data in which the data are generated from a finite mixture model, e.g., amixture of (multivariate) Gaussian distributions or multinomial distributions. Unlike tradi-tional estimators such as maximum likelihood estimation (MLE), the high-level idea of usingcomposite transportation distance is to determine optimal parameters to minimize the KLcost of moving the likelihood from one cluster to another cluster. Intuitively, with such adistance, we can employ the underlying geometric structure of parameters to perform efficientclustering with the data. Another advantage of composite transportation distance is its flex-ibility to generalize to multilevel structure data. More precisely, by representing each groupin a multilevel clustering problem by an unknown mixture model (local clustering), we candetermine the optimal parameters, which can be represented as local (probability) measures,of each group via optimization problems based on composite transportation distance. Then,in order to determine global clustering among these groups, we perform a composite trans-portation barycenter problem over the local measures to obtain a global measure over thespace of mixture models, which serves as a partition of these groups. As a result, our finalmethod, which we refer to as multilevel composite transportation (MCT), involves solving ajoint optimal transport optimization problem with respect to both a local clustering and aglobal clustering based on the cost matrix encoding KL divergence among atoms. The solu-tion strategy involves using the fast computation method of Wasserstein barycenters combinedwith coordinate descent.In summary, our main contributions are the following: (i) A new optimization formulationfor clustering based on a variety of multilevel data types, including both continuous anddiscrete observations, based on composite transportation distance; (ii) We provide a highlyscalable solution strategy for this optimization formulation; (iii) Although our approach avoidsthe use of the Dirichlet process as a building block, the approach has much of the flexibilitythe hierarchical Dirichlet process in its ability to share atoms among local clusterings. Wethus are able to borrow strength among clusters, which improves statistical efficiency undercertain applications, e.g., image annotation in computer vision.The paper is organized as follows. Section 2 provides preliminary background on compositetransportation distance and composite transportation barycenters. Section 3 formulates themultilevel composite transportation optimization model, while Section 4 presents simulationstudies with both synthetic and real data. Finally, we conclude the paper with a discussionin Section 5. Technical details of proofs and algorithm development are provided in theAppendix. 2
Composite transportation distance
Throughout this paper, we let Θ be a bounded subset of R d for a given dimension d ≥ { f ( x | θ ) , θ ∈ Θ } is a given exponential family of distributions with naturalparameter θ : f ( x | θ ) := h ( x ) exp ( h T ( x ) , θ i − A ( θ )) , where A ( θ ) is the log-partition function which is convex. We define P θ to be the probabilitydistribution whose density function is f ( x | θ ). Given a fixed number of K components, wedenote a finite mixture distribution as follows: P ω K , Θ K := K X k =1 ω k P θ k , (1)where ω K = ( ω , . . . , ω K ) ∈ ∆ K , which is a probability simplex in K − Θ K = { θ k } Kk =1 ∈ Θ K are the weights and atoms. Then, the probability density function ofmixture model can be expressed p ω K , Θ K ( x ) := K X k =1 ω k f ( x | θ k ) . We also use Q ω K , Θ K to denote a finite mixture of at most K components to avoid potentialnotational clutter. For any two finite mixture probability distributions P ω K , Θ K and P ω ′ K ′ , Θ ′ K ′ and any two givennumbers K and K ′ , we define the composite transportation distance between P ω K , Θ K and P ω ′ K ′ , Θ ′ K ′ as follows c W ( P ω K , Θ K , P ω ′ K ′ , Θ ′ K ′ ) := inf π ∈ Π ( ω K , ω ′ K ′ ) h π , M i , (2)where the cost matrix M = ( M ij ) satisfies M ij = KL( f ( x | θ i ) , f ( x | θ ′ j )) for 1 ≤ i ≤ K and1 ≤ j ≤ K ′ . Here, h ., . i denotes the dot product (or Frobenius inner product) of two matricesand Π( ω K , ω ′ K ′ ) is the set of all probability measures (or equivalently transportation plans) π on [0 , K × K ′ that have marginals ω K and ω ′ K ′ respectively. Detailed form of cost matrix
Since f ( x | θ ) is an exponential family, we can computeKL ( f ( x | θ ) , f ( x | θ ′ )) in closed form as follows [22, 9, Ch.8] :KL (cid:0) f ( x | θ ′ ) , f ( x | θ ) (cid:1) = D A (cid:0) θ, θ ′ (cid:1) , where D A ( · , · ) is the Bregman divergence associated with log-partition function A ( · ) of f ,i.e., D A (cid:0) θ, θ ′ (cid:1) = A ( θ ) − A (cid:0) θ ′ (cid:1) − (cid:10) ∇ A (cid:0) θ ′ (cid:1) , (cid:0) θ − θ ′ (cid:1)(cid:11) . Therefore, the cost matrix M has an explicit form M ij = A ( θ i ) − A (cid:0) θ ′ j (cid:1) − (cid:10) ∇ A (cid:0) θ ′ j (cid:1) , (cid:0) θ i − θ ′ j (cid:1)(cid:11) (3)for 1 ≤ i ≤ K and 1 ≤ j ≤ K ′ . Note that the order of parameters is reversed in KL and Bregman divergences. omposite transportation distance on the space of finite mixtures of finite mix-tures We can recursively define finite mixtures of finite mixtures, and define a suitableversion of composite transportation distance on this abstract space. In particular, con-sider a collection of N finite mixture probability distributions with at most K components n P ω iK , Θ iK o Ni =1 and a collection of ¯ N finite mixture probability distributions with at most ¯ K components (cid:26) P ω i ¯ K , Θ i ¯ K (cid:27) ¯ Ni =1 . We define two finite mixtures of these distributions as follows P = N X i =1 τ i P ω iK , Θ iK , Q = ¯ N X i =1 τ i P ω i ¯ K , Θ i ¯ K , where τ = ( τ , . . . , τ N ) ∈ ∆ N and τ = ( τ , . . . , τ ¯ N ) ∈ ∆ ¯ N . Then, the composite transporta-tion distance between P and Q is c W ( P , Q ) := inf π ∈ Π( τ , τ ) (cid:10) π , M (cid:11) , where the cost matrix M = (cid:8) M ij (cid:9) is defined as M ij = c W ( P ω iK , Θ iK , P ω j ¯ K , Θ j ¯ K ) , for 1 ≤ i ≤ N and 1 ≤ j ≤ ¯ N . Note that, in a slight notational abuse, c W ( ., . ) is used for boththe finite mixtures and finite mixtures of finite mixtures. In this section, we assume that X , . . . , X n are i.i.d. samples from the mixture density p ω k , Θ k ( x ) = k X i =1 ω i f ( x | θ i ) , where k < ∞ is the true number of components. Since k is generally unknown, we fit thismodel by a mixture of K distributions where K ≥ k . Denote P n := n P ni =1 δ X i as an empirical measure with respect to samples X , . . . , X n . Tofacilitate the discussion, we define the following composite transportation distance betweenan empirical measure P n and the mixture probability distribution P ω K , Θ K c W ( P n , P ω K , Θ K ) := inf π ∈ Π ( n n , ω K ) h π , M i , (4)where M = ( M ij ) ∈ R n × K is a cost matrix defined as M ij := − log f ( X i | θ j ) for 1 ≤ i ≤ n, ≤ j ≤ K . Furthermore, Π ( · , · ) is the set of transportation plans between n /n and ω K .To estimate the true weights ω k and true components θ i as 1 ≤ i ≤ k , we perform anoptimization with transportation distance c W as follows:( b ω n,K , b Θ n,K ) = arg min ω K , Θ K c W ( P n , P ω K , Θ K ) . (5)The estimator ( b ω n,K , b Θ n,K ) is usually referred to as the Minimum Kantorovitch estimator [1].4 lgorithm 1 Composite Transportation Distance with Mixtures
Input:
Data D = { X i } ni =1 ; the number of clusters K the regularized hyper-parameter λ > Output:
Optimal weight-atoms { ω j , θ j } Kj =1 Initialize weights { ω j } Kj =1 and atoms { θ j } Kj =1 . while not converged do
1. Update weights ω j : for j = 1 to K do Compute transportation plan π ij as π ij = ( f ( X i | θ j )) /λ (cid:14) n K X k =1 ( f ( X i | θ j )) /λ ! for 1 ≤ i ≤ n Update weight ω j = P ni =1 π ij . end for
2. Update atoms θ j : for j = 1 to J do Update atoms θ j as solution of equation ∇ A ( θ j ) = P ni =1 π ij ω j T ( X i ). end forend while2.2.2 Regularized composite transportation distance As is the case with the traditional optimal transportation distance, the composite transporta-tion distance c W does not have a favorable computational complexity. Therefore, we consideran entropic regularizer to speed up its computation [3]. More precisely, we consider thefollowing regularized version of c W ( P n , P ω K , Θ K ):inf π ∈ Π ( n n , ω K ) h π , M i − λ H ( π ) , where λ > H ( π ) := − P i,j π ij log π ij is an entropy of π ∈ Π ( n /n, ω K ). Equipped with this regularization, we have a regularized version of the optimalestimator in (5): min ω K , Θ K inf π ∈ Π ( n n , ω K ) h π , M i − λ H ( π ) . (6)We summarize the algorithm for determining local solutions of the above objective functionin Algorithm 1. The details for how to obtain the updates of weight and atoms in Algorithm 1are deferred to the Supplementary Material. Given the formulation of Algorithm 1, we havethe following result regarding its convergence to a local optimum. Theorem 1.
The Algorithm 1 monotonically decreases the objective function (6) of the reg-ularized composite transportation distance for finite mixtures. .3 Composite transportation barycenter for mixtures of exponential fam-ilies In this section, we consider a problem of finding composite transportation barycenters for acollection of mixtures of exponential family. For J ≥
1, let (cid:26) P j ω jKj , Θ jKj (cid:27) Jj =1 be a collectionof J mixtures of exponential families as described in (1), and let { a j } Jj =1 ∈ ∆ J be weightsassociated with these mixtures. The transportation barycenter of these probability measuresis a mixture of exponential family with at most L components, and is defined as an optimalsolution of the following problem:argmin w L, Ψ L J X j =1 a j c W (cid:18) Q w L , Ψ L , P j ω jKj , Θ jKj (cid:19) , (7)where w L = { w l } Ll =1 ∈ ∆ L and Ψ L = { ψ l } Ll =1 ∈ Θ L are unknown weights and parametersthat we need to optimize. Recall that, to avoid notational clutter, we use Q w L , Ψ L to denotea finite mixture with at most L components. Since P j ω jKj , Θ jKj and Q w L , Ψ L are mixtures ofexponential families, Eq. (7) can be rewritten asargmin w L, Ψ L J X j =1 a j min π j ∈ Π (cid:16) ω jKj , w L (cid:17) (cid:10) π j , M j (cid:11) , where the cost matrices M j = ( M juv ) satisfy M juv = KL( f ( x | θ ju ) , f ( x | ψ v )) for 1 ≤ j ≤ J , whichhas the closed form defined in Eq. (3) since f is from an exponential family of distributions. We incorporate regularizers in the composite transportation barycenter. In particular, wewrite the objective function to be minimized asargmin w L, Ψ L J X j =1 a j min π j ∈ Π (cid:16) ω jKj , w L (cid:17) (cid:10) π j , M j (cid:11) − λ H (cid:0) π j (cid:1) . (8)We call this objective function the regularized composite transportation barycenter . Due tospace constraints, we present the detailed algorithm for determining local solutions of thisobjective function in the Supplementary Material. Assume that we have J groups of independent data, X j,i , where 1 ≤ j ≤ J and 1 ≤ i ≤ n j ; i.e.,the data are presented in a two-level grouping structure. Our goal is to find simultaneouslythe local clustering for each data group and the global clustering across groups.6 .1 Multilevel composite transportation (MCT) To facilitate the discussion, for each 1 ≤ j ≤ J , we denote the empirical measure associatedwith group j as P jn j := 1 n j n j X i =1 δ X j,i . Additionally, we assume that the number of local and global clusters are bounded. In par-ticular, we allow local group j to have at most K j clusters, which can be represented as amixture of exponential families P j ω j , Θ jKj , while we have at most C global clusters among J given groups. Here, each global cluster can be represented as a finite mixture distribution Q m w mL , Ψ mL with at most L clusters, where w mL = ( w m , . . . , w mL ) and Ψ mL = ( ψ m , . . . , ψ mL ) areglobal weights and atoms for 1 ≤ m ≤ C , respectively. With the local clustering, we perform composite transportation distance optimization forgroup j , which can be expressed as in (5). More precisely, this step can be viewed as findingoptimal local weights ω jK j and local atoms Θ jK j to minimize the composite transportationdistance c W ( P jn , P ω jKj , Θ jKj ) for all 1 ≤ j ≤ J . Regarding the global clustering with J givengroups, we can treat the finite mixture probability distribution P ω jKj , Θ jKj of each group asobservations in the space of distributions over probability distributions. Thus we achieve aclustering of these distributions by means of an optimization with the following compositetransportation distance on the space of finite mixtures of finite mixtures:inf Q c W ( P , Q ) , where we denote P := J P Jj =1 δ P j ω jKj , Θ jKj and Q := P C m =1 b m δ Q m w mL , Ψ mL . Since the finite mixture probability distributions P ω jKj , Θ jKj in each group are unobserved, wedetermine them by minimizing the objective cost functions in the local clustering and globalclustering simultaneously. In particular, we consider the following objective function:inf ω jKj , Θ jKj , Q J X j =1 c W (cid:18) P jn j , P j ω jKj , Θ jKj (cid:19) + ζ c W ( P , Q ) , (9)where ζ > Multilevel Composite Transportation (MCT) . To obtain a favorable computation profile with MCT, we consider a regularized version ofthe composite transportation distances in both the local and global structures. To simplify7he discussion, we denote π j ∈ Π( n j n j , ω jK j ) as local transportation plans between P jn j and P j ω jKj , Θ jKj for all ≤ j ≤ J . Thus, the following formulation holds c W (cid:18) P jn j , P j ω jKj , Θ jKj (cid:19) = inf π j ∈ Π( nj nj , ω jKj ) (cid:10) π j , M j (cid:11) , where M j is the cost matrix between P jn j and P j ω jKj , Θ jKj that is defined as[ M j ] uv = − log f ( X j,u | θ jv ) , for 1 ≤ u ≤ n j and 1 ≤ v ≤ K j . Therefore, we can consider the regularized version ofcomposite transportation distance at each group j as follows: c W (cid:18) P jn j , P j ω jKj , Θ jKj (cid:19) − λ l H ( π j ) , (10)where λ l > c W ( P , Q ) = inf a ∈ Π( J J , b ) X j,m a jm c W ( P j ω jKj , Θ jKj , Q m w mL , Ψ mL ) , where a = ( a jm ) in the above infimum is a global transportation plan between P and Q .Here, we can further rewrite c W ( P j ω jKj , Θ jKj , Q m w mL , Ψ mL ) as c W ( P j ω jKj , Θ jKj , Q m w mL , Ψ mL ) = inf τ j,m ∈ Π( ω jKj , w mL ) (cid:10) τ j,m , γ j,m (cid:11) , where the cost matrix γ j,m is defined as KL divergence between two exponential family atomsin Eq. (3): γ j,mk,l := A (cid:16) θ jk (cid:17) − A ( ψ ml ) − D ∇ A ( ψ ml ) , (cid:16) θ jk − ψ ml (cid:17)E . To facilitate the discussion later, we denote τ j,m the partial global transportation plan between P ω jKj , Θ jKj and Q m w mL , Ψ mL . Therefore, we can regularize the composite transportation distancewith global structure as c W ( P , Q ) − λ a H ( a ) − λ g J X j =1 C X m =1 H ( τ j,m ) , (11)where λ a corresponds to a penalization term for global structure while λ g represent a penal-ization term for a partial global transportation plan. Combining the results from Eqs. (10)and (11), we obtain the overall objective function of MCT:inf ω jKj , Θ jKj , Q J X j =1 c W (cid:18) P jn j , P j ω jKj , Θ jKj (cid:19) + ζ c W ( P , Q ) − R ( π , τ , a ) , (12)where R ( π , τ , a ) := λ l P Jj =1 H ( π j ) + ζ [ λ a H ( a ) + λ g P Jj =1 P C m =1 H ( τ j,m )] is a combinationof all regularized terms for the local and global clustering. We call this objective function regularized MCT . 8 .3 Algorithm for regularized MCT Algorithm 2
Probabilistic Multilevel Clustering
Input:
Data D = { X j,i } J,n j j =1 ,i =1 ; the number of local clusters K j and global clusters C ; thenumber of components in each global cluster L ; the penalization term ζ ; the regularizedhyper-parameters λ l , λ g . Output: local and global parameters ω jK j , Θ jK j , w mL , Ψ mL for 1 ≤ j ≤ J and 1 ≤ m ≤ C .Initialize these local and global parameters. while not converged do
1. Update local parameters:. for j = 1 to J do Update ω jK j , Θ jK j as optimal solutions of (13). end for
2. Update global parameters: for m = 1 to C do Update w mL , Ψ mL as optimal solutions of (14). end forend while We now describe our detailed strategy for obtaining a locally optimal solution of regularizedMCT . In particular, our algorithm consists of two key steps: local clustering updates andglobal clustering updates. For simplicity of presentation, we assume that at step t of ouralgorithm, we have the following updated values of our parameters: ω jK j , Θ jK j , w mL , Ψ mL , for1 ≤ j ≤ J and 1 ≤ m ≤ C . Local clustering updates
To obtain updates for local weights ω jK j and local atoms Θ jK j ,we solve the following combined regularized composite transportation barycenter problem:inf ω jKj , Θ jKj c W (cid:18) P jn j , P j ω jKj , Θ jKj (cid:19) − λ l H ( π j ) (13)+ C X m =1 a jm c W ( P j ω jKj , Θ jKj , Q m w mL , Ψ mL ) − ζλ g C X m =1 H ( τ j,m ) , where π j is the local transportation plan between P jn j and P j ω jKj , Θ jKj at step t while a and τ j,m are respectively the global transportation plan and partial global transportation plans at thisstep. The idea of obtaining local solution of the above objective function is identical to thatof (8); therefore, we defer the detailed presentation of this algorithm to the SupplementaryMaterial. Global clustering updates
In order to update the global weights w mL and global atomparameters Ψ mL , we consider the following optimization problem:inf Q c W ( P , Q ) − ζ [ λ a H ( a ) + λ g J X j =1 C X m =1 H ( τ j,m )] . (14)9atasets n j ) C )Continuous data 100 2 500 6Discrete data 500 25 100 5 (a) Statistics of synthetic datasets Datasets C )LabelMe 1 ,
800 30 8NUS-WIDE 1 ,
040 238 13 (b) Statistics of real-world datasets
Table 1: Summarization of synthetic and realworld datasetsThe algorithm for obtaining the local solutions of this objective is based on bacycenter com-putation algorithms in [4] for updating barycenter weights w mL and the partial global trans-portation plan τ j,m . The natural parameters of global atoms of the barycenters are weightedaverages of local atoms from all group j : ψ v = P Jj =1 P Lu =1 π juv θ ju P Jj =1 P Lu =1 π juv . The detailed derivation of this algorithm is deferred to the Supplementary Material. Insummary, the main steps of updating the local and global clustering updates are summarizedin Algorithm 2. We have the following result guaranteeing the local convergence of thisalgorithm.
Theorem 2.
Algorithm 2 monotonically decreases the objective function of regularized MCT (12) until local convergence.
We first evaluate the model via simulation studies, then demonstrate its applications on textand image modeling using two real-world datasets.
We evaluate the effectiveness of our proposed clustering algorithm by considering two types(discrete and continuous) of synthetic data generated from multilevel processes as follows.
Continuous data
We start with six clusters of data, each of which is a mixture of threeGaussian components. Figure 1 depicts the ground truth of the six mixtures we generate thedata from. We uniformly generated 100 groups of data, each group belonging to one of thesix aforementioned clusters. Once the cluster index of a data group was defined, we generated500 data points from the corresponding mixture of Gaussians.
Discrete data
Data was generated from five clusters of 25-dimensional bar topics, each ofwhich is a mixture of four bar topics out of total ten topics as shown in Figure 2 (secondrow). Each cluster shares two topics with any other cluster. We then generated 500 groups ofdata, each group belonging to one of the five aforementioned clusters. Once the cluster index10 − − − − − − − − − − − − − Figure 1: Synthetic multilevel Gaussian data (orange at top) and inferred clusters (blue atbottom).of a data group is defined, we generate 100 data points from the mixture of bar topics of thatcluster.
Clustering results
We ran the proposed method with synthetic continuous data using thefollowing local and global penalization hyper-parameters: λ l and λ g are set equal to 1 . X ji now is a one-hot vector. We simulated from ten topicsincluding five horizontal and five vertical bars. The top row of Figure 2 depicts a collectionof four bar topics that data of a cluster may be generated from; i.e., the first cluster containsdata simulated from a mixture of four horizontal bar topics. In the middle row, we depictthe histogram plot of all data generated from each cluster while the bottom row shows theplot of clusters discovered by our proposed model. There is only a slight difference in the plotbetween ground truth and the inferred mixture of bar topics . These results demonstrate theeffectiveness and flexibility of our algorithms in learning both continuous and discrete data. We now demonstrate our proposed model on two real-world datasets: the LabelMe dataset[16, 13] with continuous observations and the NUS-WIDE [2] with discrete observations.Statistics for these datasets are presented in Table 1b.
LabelMe dataset
This consists of 2 ,
688 annotated images which are classified into eightscene categories including tall buildings, inside city, street, highway, coast, open country,mountain, and forest [16]. Each image contains multiple annotated regions. Each region, We also measured the NMI (Normalized Mutual Information) between the ground truth labels and learnedgroups and obtained around 0 . t o p i cs i n e a c h c l u s t e r g e n e r e a t e d d a t a r e d i sc o v e r e d d a t a Figure 2: Synthetic multilevel bar topic data (two top rows) and rediscovered output (bottomrow) Methods NMI ARI AMIK-means 0.37 0.282 0.365SVB-MC2 0.315 0.206 0.273W-means 0.423 0.35 0.416
MCT 0.485 0.412 0.477
Table 2: Clustering performance on LabelMe (continuous) dataset.which is annotated by users, represents an object in the image . We remove the imagescontaining less than four annotated regions and obtained totally 1 ,
800 images. We thenextract GIST features [10], a visual descriptor to represent perceptual dimensions and orientedspatial structures of a scene, for each region in an image. We use PCA to reduce the numberof dimensions to 30.
NUS-WIDE dataset
We used a subset of the original NUS-WIDE dataset [2] which con-tains images of 13 kinds of animals comprising 2 ,
054 images in training subset. Each imageis annotated with several tags out of 1 ,
000 tags. We filtered out images with less than threetags and obtained 1 ,
040 images with the remaining number tags of 238. After preprocessing,we have a dataset with 1 ,
040 groups; each data point in a group is a one-hot vector of 238dimensions representing a tag word annotated for that group (image). Sample images and annotated regions can be found at http://people.csail.mit.edu/torralba/code/spatialenvelope/ including squirrel, cow, cat, zebra, tiger, lion, elephant, whales, rabbit, snake, antlers, hawk and wolf MCT 0.423 0.255 0.39
Table 3: Clustering performance on NUS-WIDE (discrete) dataset.Figure 3: Tag cloud of six clusters discovered by the proposed model with the NUS-WIDEdataset. 13 aseline methods We quantitatively compare our proposed method to baseline approachesdiscussed in [5], including K-means , W-means , and
SVB-MC2 without context [7]. We usethree popular metrics: NMI (Normalized Mutual Information) [17, 16.3], ARI (Adjusted RandIndex) [6], and AMI (Adjusted Mutual Information) [20] to evaluate the clustering perfor-mance.
Experimental results
We conducted experiments on the LabelMe dataset with the numberof local atoms set equal to K = 5, the number of global atoms set to L = 15, and the numberof clusters set to C = 8. We chose the hyper-parameters for penalized terms to be λ l = 3and λ g = 3. As shown in Table 2, our proposed method is superior to the baseline methodsin terms of clustering performance.We also compared clustering performance using the discrete real-world dataset NUS-WIDE. We chose K = 2, L = 4, and C = 13 with the hyper-parameters λ l = 1 and λ g = 1 .
6. Results are presented in Table 3. Since baseline methods are applicable only tocontinuous data, we have normalized the discrete data for each image and then applied thebaseline methods to cluster the dataset. The results show that the clustering performanceof K-means and W-means is inferior to that of our proposed model which directly modelsdiscrete data. Moreover, the Adjusted Rand Index (ARI) and Adjusted Mutual Information(AMI) of these model show that their clustering outcomes are not robust.To illustrate the qualitative results of the proposed model, we selectively choose six outof thirteen clusters discovered and computes the proportion of tags presented in each cluster.Figure 3 depicts tag-clouds of these clusters. Each tag-cloud consistently manifests the clustercontent. For example, the top-left tag-cloud denote the cluster of rabbits, which is one of theground-truth subsets of images.
We have proposed a probabilistic model that uses a novel composite transportation distanceto cluster data with potentially complexed hierarchical multilevel structures. The proposedmodel is able to handle both discrete and continuous observations. Experiments on simulatedand real-world data have shown that our approach outperforms competing methods that alsotarget multilevel clustering tasks. Our developed model is based on the exponential familyassumption with data distribution and thereby applies naturally to other data types; e.g., amixture of Poisson distributions [8]. Finally, there are several possible directions for extensionsfrom our work. First, it is of interest to extend our approach to richer settings of hierarchicaldata similar to those considered in MC [11]; e.g., when group-level context is available inthe data. Second, our method requires knowledge of the upper bounds with the numbers ofclusters both in local and global clustering. It is of practical importance to develop methodsthat are able to estimate these cardinalities efficiently. References [1] F. Bassetti, A. Bodini, and E. Regazzini. On minimum Kantorovich distance estimators.
Statistics & probability letters , 76(12):1298–1302, 2006.[2] T.-S. Chua, J. Tang, R. Hong, H. Li, Z. Luo, and Y. Zheng. NUS-WIDE: a real-world14eb image database from National University of Singapore. In
Proceedings of the ACMinternational conference on image and video retrieval , page 48. ACM, 2009.[3] M. Cuturi. Sinkhorn distances: Lightspeed computation of optimal transport. In
Ad-vances in neural information processing systems , pages 2292–2300, 2013.[4] M. Cuturi and A. Doucet. Fast computation of Wasserstein barycenters.
Proceedings ofthe 31st International Conference on Machine Learning , 2014.[5] N. Ho, X. Nguyen, M. Yurochkin, H. H. Bui, V. Huynh, and D. Q. Phung. Multilevelclustering via Wasserstein means. In
Proceedings of the 34th International Conferenceon Machine Learning, ICML 2017, Sydney, NSW, Australia, 6-11 August 2017 , pages1501–1509, 2017.[6] L. Hubert and P. Arabie. Comparing partitions.
Journal of classification , 2(1):193–218,1985.[7] V. Huynh, D. Q. Phung, S. Venkatesh, X. Nguyen, M. D. Hoffman, and H. H. Bui.Scalable nonparametric Bayesian multilevel clustering. In
UAI , 2016.[8] R. R. Jayasekare, R. Gill, and K. Lee. Modeling discrete stock price changes using amixture of Poisson distributions.
Journal of the Korean Statistical Society , 45(3):409–421, 2016.[9] M. I. Jordan. An introduction to probabilistic graphical models. 2003.[10] D. G. Lowe. Object recognition from local scale-invariant features. In
Computer vision,1999. The proceedings of the seventh IEEE international conference on , volume 2, pages1150–1157. Ieee, 1999.[11] V. Nguyen, D. Phung, X. Nguyen, S. Venkatesh, and H. Bui. Bayesian nonparametricmultilevel clustering with group-level contexts.
Proceedings of the 31st InternationalConference on Machine Learning , 2014.[12] X. Nguyen. Convergence of latent mixing measures in finite and infinite mixture models.
Annals of Statistics , 4(1):370–400, 2013.[13] A. Oliva and A. Torralba. Modeling the shape of the scene: A holistic representation ofthe spatial envelope.
International Journal of Computer Vision , 42:145–175, 2001.[14] D. Pollard. Quantization and the method of K-means.
IEEE Transactions on InformationTheory , 28:199–205, 1982.[15] A. Rodriguez, D. B. . Dunson, and A. E. Gelfand. The nested Dirichlet process.
Journalof the American Statistical Association , 103:1131–1144, 2008.[16] B. C. Russell, A. Torralba, K. P. Murphy, and W. T. Freeman. LabelMe: a databaseand web-based tool for image annotation.
International journal of computer vision , 77(1-3):157–173, 2008.[17] H. Sch¨utze, C. D. Manning, and P. Raghavan.
Introduction to information retrieval ,volume 39. Cambridge University Press, 2008.1518] Y. Teh, M. Jordan, M. Beal, and D. Blei. Hierarchical Dirichlet processes.
J. Amer.Statist. Assoc. , 101:1566–1581, 2006.[19] C. Villani.
Topics in Optimal Transportation . American Mathematical Society, 2003.[20] N. X. Vinh, J. Epps, and J. Bailey. Information theoretic measures for clusterings compar-ison: Variants, properties, normalization and correction for chance.
Journal of MachineLearning Research , 11(Oct):2837–2854, 2010.[21] D. F. Wulsin, S. T. Jensen, and B. Litt. Nonparametric multi-level clustering of humanepilepsy seizures.
Annals of Applied Statistics , 10:667–689, 2016.[22] J. Zhang, Y. Song, G. Chen, and C. Zhang. On-line evolutionary exponential familymixture. In
IJCAI , pages 1610–1615, 2009.16 ppendix
A Finite mixtures with regularized composite transportationdistance
In this section, we provide detailed analyses for obtaining updates with weights and atoms inAlgorithm 1 to find the local solution of the objective function in Eq. (6), which optimizesfinite mixtures with regularized composite transportation distance. To ease the presentation,we would like to remind this objective function, which is defined as followsmin ω K , Θ K inf π ∈ Π ( n n , ω K ) h π , M i − λ H ( π )where λ > H ( π ) = − P i,j π ij log π ij is an entropy of π ∈ Π ( n /n, ω K ). Here, P n = n P ni =1 δ X i an empirical measure with respect to samples X , . . . , X n .Furthermore, M = ( M ij ) is a cost matrix such that M ij = − log f ( X i | θ j ) for 1 ≤ i ≤ n, ≤ j ≤ K while Π ( · , · ) is the set of transportation plans between n /n and ω K . A.1 Update weights
Our strategy for updating weights ω K in the above objective function relies on solving thefollowing relaxation of that optimization probleminf π ∈S n h π , M i − λ H ( π ) (15)where S n = n π : P Kj =1 π ij = 1 /n o . Invoking the Lagrangian multiplier for the constraint π K = n n , the above objective function is equivalent to minimize the following function F = n X i =1 K X j =1 π ij M ij + λ n X i =1 K X j =1 π ij (log π ij −
1) + n X i =1 κ i K X j =1 π ij − n . By taking the derivative of F with respect to π ij and setting it to zero, the following equationholds ∂ F ∂π ij = M ij + λ log π ij + κ i = 0 . The above equation leads to π ij = exp (cid:18) − M ij − κ i λ (cid:19) = ( f ( X i | θ j )) /λ exp (cid:18) − κ i λ (cid:19) . Invoking the condition P k π ik = n , we haveexp (cid:18) − κ i λ (cid:19) K X j =1 ( f ( X i | θ j )) /λ = 1 n , (cid:18) − κ i λ (cid:19) = 1 n P Kj =1 ( f ( X i | θ j )) /λ . Governed by the previous equations, we find that π ij = 1 n ( f ( X i | θ j )) /λ P Kj =1 ( f ( X i | θ j )) /λ . (16)Therefore, we can update the weight ω j as ω j = n X i =1 π ij for any 1 ≤ j ≤ K . A.2 Update atoms
Given the updates for weight ω K and the formulation of cost matrix M , to obtain the updatefor atoms θ j as 1 ≤ j ≤ K , we optimize the following objective functionmin Θ K − n X i =1 K X j =1 π ij log f ( X i | θ j ) . (17)Since f ( x | θ ) is an exponential family distribution with natural parameter θ , we can representit as f ( x | θ ) = h ( x ) exp ( h T ( x ) , θ i − A ( θ )) , where A ( θ ) is the log-partition function which is convex. Plugging this formulation of f ( x | θ )into the objective function (17) and taking the derivative with respect to θ j , we obtain thefollowing equation n X i =1 π ij T ( X i ) − ω j ∇ A ( θ j ) = 0 . (18)Therefore, we can update atoms θ j as the solution of the above equation for 1 ≤ j ≤ K , A.3 Proof for local convergence of Algorithm 1
Given the formulation of Algorithm 1, we would like to demonstrate its convergence to localsolution of objective function (6) in Theorem 1.Our proof of the theorem is straight-forward from the updates of weights and atoms viaLagrangian multipliers. In particular, we denote ω ( t ) K , Θ ( t ) K , and π ( t ) as the update of weights,atoms, and transportation plan in step t of Algorithm 1 for t ≥
0. Additionally, let M ( t ) bethe cost matrix at step t , i.e., M ( t ) ij = − log f ( X i | θ ( t ) j ) for all i, j . Furthermore, we denote g ( ω K , Θ K ) := inf π ∈ Π ( n n , ω K ) h π , M i − λ H ( π ) . t ≥
0, it is clear that g ( ω ( t ) K , Θ ( t ) K ) = inf π ∈ Π (cid:16) n n , ω ( t ) K (cid:17) D π , M ( t ) E − λ H ( π ) ≥ inf π ∈S n D π , M ( t ) E − λ H ( π ) ≥ D π ( t +1) , M ( t ) E − λ H (cid:16) π ( t +1) (cid:17) where S n = n π : P Kj =1 π ij = 1 /n ∀ ≤ i ≤ n o . Here, the first inequality is due to the fact thatΠ (cid:16) n n , ω ( t ) K (cid:17) ⊂ S n while the second inequality is due to (16) in subsection A.1. Accordingto the update of atoms in (18) in subsection A.2, we have that D π ( t +1) , M ( t ) E − λ H (cid:16) π ( t +1) (cid:17) ≥ D π ( t +1) , M ( t +1) E − λ H (cid:16) π ( t +1) (cid:17) ≥ inf π ∈ Π (cid:16) n n , ω ( t +1) K (cid:17) D π , M ( t +1) E − λ H ( π )= g ( ω ( t +1) K , Θ ( t +1) K ) . Governed by the above results, for any t ≥
0, the following holds g ( ω ( t ) K , Θ ( t ) K ) ≥ g ( ω ( t +1) K , Θ ( t +1) K ) . As a consequence, we achieve the conclusion of Theorem 1.
B Regularized composite transportation barycenter
In this section, we provide a detailed algorithm for achieving local solution to regularizedcomposite transportation barycenter in objective function in Eq. (8). To facilitate the discus-sion, we will remind the formulation of that objective function. In particular, the objectivefunction with regularized composite transportation distance has the following formulationmin w L , Ψ L J X j =1 a j min π j ∈ Π (cid:16) ω jKj , w L (cid:17) (cid:10) π j , M j (cid:11) − λ H (cid:0) π j (cid:1) where M j is the corresponding KL cost matrix between finite mixture probability distribution P j ω jKj , Θ jKj and Q w L , Ψ L for 1 ≤ j ≤ J . Here, { a j } Jj =1 ∈ ∆ J are given weights associated withthe finite mixture probability distributions (cid:26) P j ω jKj , Θ jKj (cid:27) Jj =1 .As f ( x | θ ) is an exponential family, the cost matrix M j = ( M juv ) has the following formu-lation M juv = KL( f ( x | ψ v ) , f ( x | θ ju ))= A ( θ ju ) − A ( ψ v ) − (cid:10) ∇ A ( ψ v ) , ( θ ju − ψ v ) (cid:11) for all 1 ≤ u, v ≤ K . 19 .1 Update weights and atoms Our procedure for updating weights w K for the objective function of regularized compositetransportation distance will be similar to Algorithm 1 in [4]. Therefore, we will only focus onthe updates with atoms Ψ L = ( ψ , . . . , ψ L ).Given the updates of weights w K , we compute the optimal transportation plan π j = ( π juv )between ω jK j and w L using Algorithm 3 in [4]. Then, to obtain the updates for Ψ L , we considerthe following optimization problemmin Ψ L J X j =1 a j K j X u =1 K j X v =1 π juv M juv . By taking the derivative of the above objective function with respect to ψ v and setting it to0, we achieve the following equation J X j =1 K j X u =1 π juv (cid:10) ∇ A ( ψ v ) , θ ju − ψ v (cid:11) = 0 . One possible solution to the above equation is P Jj =1 P K j u =1 π juv ( θ ju − ψ v ) = 0. This previousequation suggests that ψ v = P Jj =1 P K j u =1 π juv θ ju P Jj =1 P K j u =1 π juv (19)for all 1 ≤ v ≤ L . Equipped with these updates for weights w L and atoms Ψ L , we sum-marize the detail of an algorithm for determining the local solution of regularized compositetransportation barycenter in Eq. (8) in Algorithm 3. Algorithm 3
Regularized composite transportation barycenter
Input:
Finite mixture probability distributions (cid:26) P j ω jKj , Θ jKj (cid:27) Jj =1 , given weights { a j } Jj =1 , andthe regularized hyper-parameter λ > Output:
Optimal weights w L and atoms Ψ L .Initialize weights { w j } Lj =1 and atoms { ψ j } Lj =1 . while not converged do
1. Update weights w L as Algorithm 1 in [4].2. Compute transportation plans π j for 1 ≤ j ≤ J using Algorithm 3 in [4].3. Update atoms Ψ L as in Eq. (19). end while B.2 Local convergence of Algorithm 3
Given the formulation of Algorithm 3, the following theorem demonstrates that this algorithmdetermines the local solution of objective function (8)
Theorem 3.
The Algorithm 3 monotonically decreases the objective function (8) of regularizedcomposite transportation barycenter until local convergence.
The proof of Theorem 3 is a direct consequence of the updates with weights and atoms inthe above subsection and can be argued in the similar fashion as that of Theorem 1; therefore,it is omitted. 20
Multilevel clustering with composite transportation distance
In this section, we provide detailed argument for the algorithm development to determinethe local solutions of regularized multilevel composite transportation (MCT). To ease thepresentation later, we would like to remind the objective function of this problem as wellas all its important relevant notations. We start with the objective function in Eq. (12) asfollows inf ω jKj , Θ jKj , Q J X j =1 c W (cid:18) P jn j , P j ω jKj , Θ jKj (cid:19) + ζ c W ( P , Q ) − R ( π , τ , a ) , where R ( π , τ , a ) = λ l P Jj =1 H ( π j ) + ζλ a [ H ( a ) + λ g P Jj =1 P C m =1 H ( τ j,m )] is a combinationof all regularized terms for the local and global clustering. Here, for the simplicity of ourargument, we choose ζ = 1 to derive our learning updates. In the above objective function, P = J P Jj =1 δ P j ω jKj , Θ jKj and Q = P C m =1 b m δ Q m w mL , Ψ mL . We summarize below the notations forour algorithm development. Variables of local clustering structures • Local transportation plans for group j : π j = n π juv o u,v ∈ Π( n j n j , ω jK j ) s.t. P v π juv = n j , and P u π juv = ω jv , • Local atoms for local group Θ jK j = n θ jk o K j k =1 and their local mixing weights ω j = n ω jk o K j k =1 . Variables of assignment group to barycenter • Global transportation plan a = ( a jm ) ∈ Π( J J , b ) between P and Q . Variables of global clustering structures • Partial global transportation plans between local measure P ω jKj , Θ jKj and global measure Q m w m , Ψ mk : τ j,m = n τ j,mkl o k,l where P l τ j,mkl = ω jk and P l τ j,mkl = w ml for all 1 ≤ j ≤ J and 1 ≤ m ≤ C . • Global atoms for global measure Ψ mL = { ψ ml } Ll =1 . and global mixing weights w mL = { w ml } Ll =1 where w ml = P k τ jmkl for any j . C.1 Local clustering updates
As being mentioned in the main text, to obtain updates for local weights ω jK j and local atoms Θ jK j , we solve the following regularized composite transportation barycenter probleminf ω jKj , Θ jKj c W (cid:18) P jn j , P j ω jKj , Θ jKj (cid:19) − λ l H ( π j ) + C X m =1 a jm c W ( P j ω jKj , Θ jKj , Q m w mL , Ψ mL ) − λ g K X m =1 H ( τ j,m ) . ω jKj , Θ jKj inf π j ∈ Π( nj nj , ω jKj ) (cid:10) π j , M j (cid:11) − λ l H ( π j ) + C X m =1 a jm inf τ j,m ∈ Π( ω jKj , w mL ) (cid:10) τ j,m , γ j,m (cid:11) − λ g K X m =1 H ( τ j,m ) (20)where M j is the cost matrix between P jn j and P j ω jKj , Θ jKj that has a formulation as[ M j ] uv = − log f ( X j,u | θ jv ) , for 1 ≤ u ≤ n j and 1 ≤ v ≤ K j . Additionally, the cost matrix γ j,m has the followingformulation γ j,mkl = A (cid:16) θ jk (cid:17) − A ( ψ ml ) − D ∇ A ( ψ ml ) , (cid:16) θ jk − ψ ml (cid:17)E . Update local weights:
The idea for obtaining the local solutions of above objective func-tion is similar to that in Section A.
Update local atoms:
Given the updates for local weight ω jK j , to obtain the update equa-tion for local atoms Θ jK j , we optimize the following objective functionmin Θ jKj − n j X u =1 K j X v =1 π juv log f ( X j,u | θ jv ) + C X m =1 a jm X v,l τ j,mvl γ j,mkl . (21)Since f ( x | θ ) is an exponential family distribution with natural parameter θ , we can representit as f ( x | θ ) = h ( x ) exp ( h T ( x ) , θ i − A ( θ )) , where A ( θ ) is the log-partition function which is convex. Given that formulation of f ( x | θ ),our objective function (17) is equivalent to minimize the following objective function F local = − n j X u =1 K j X v =1 π juv (cid:0)(cid:10) T ( X j,u ) , θ jv (cid:11) − A (cid:0) θ jv (cid:1)(cid:1) + C X m =1 a jm X v,l τ j,mkl (cid:20) A (cid:0) θ jv (cid:1) − A ( ψ ml ) − (cid:10) ∇ A ( ψ ml ) , (cid:0) θ jv − ψ ml (cid:1)(cid:11)(cid:21) . By direct computation, F local has the following partial derivative with respect to θ jv ∂ F local ∂θ jv = − n j X u =1 π juv (cid:0) T ( X j,u ) − ∇ A (cid:0) θ jv (cid:1)(cid:1) + C X m =1 a jm L X l =1 τ j,mvl (cid:2) ∇ A (cid:0) θ jv (cid:1) − ∇ A ( ψ ml ) (cid:3) = − n j X u =1 π juv T ( X j,u ) + ω jv ∇ A (cid:0) θ jv (cid:1) + C X m =1 a jm L X l =1 τ j,mvl (cid:0) ∇ A (cid:0) θ jv (cid:1) − ∇ A ( ψ ml ) (cid:1) , P n j u =1 π juv = ω jv .Given the above partial derivatives, we can update the atoms θ jv to be the solution of thefollowing equation ∇ A (cid:0) θ jv (cid:1) = C P m =1 a jm L P l =1 τ j,mvl ∇ A ( ψ ml ) + n j P u =1 π juv T ( X j,u ) C P m =1 a jm L P l =1 τ j,mvl + ω jv (22) C.2 Computing global transportation plan
Given the updates for local weights ω jK j and local atoms Θ jK j for 1 ≤ j ≤ J , we now developan update on for global transportation plan a = ( a jm ) between P and Q . Our strategy forthe update relies on solving the following objective functioninf a X j,m a jm c W (cid:18) P j ω jKj , Θ jKj , Q m w mL , Ψ mL (cid:19) − λ a H ( a ) . where a in the above infimum satisfies the constraint a K = J n . By means of Lagrangianmultiplier, the above objective function can be rewritten as F global = X j,m a jm c W (cid:18) P j ω jKj , Θ jKj , Q m w mL , Ψ mL (cid:19) − λ a H ( a ) + κ a J X j =1 C X m =1 a jm − J ! , The function F global has the partial derivative with respect to a jm as follows ∂ F global ∂a jm = c W (cid:18) P j ω jKj , Θ jKj , Q m w mL , Ψ mL (cid:19) + λ a log a jm + κ a = X k,l τ j,mkl γ j,mkl + λ a ln a jm + κ a . Setting the above derivative to 0 and invoking the constraint that P C m =1 a jm = J , we findthat a jm = 1 J exp (cid:16) − P k,l τ j,mkl γ j,mkl (cid:17) /λ a P m exp (cid:16) − P k,l τ j,mkl γ j,mkl (cid:17) /λ a (23)for 1 ≤ j ≤ J and 1 ≤ m ≤ L . C.3 Global clustering updates
Given the updates with local weights and atoms as well as the global transportation plan,we are now ready to develop an update for global weights w mL and global atoms Ψ mL for1 ≤ m ≤ C . In particular, the objective function for updating these global parameters are asfollows min { w mL } , { Ψ mL } J X j =1 C X m =1 a jm c W ( P j ω jKj , Θ jKj , Q m w mL , Ψ mL ) − λ g J X j =1 C X m =1 H ( τ j,m )23he above objective function can be rewritten asmin { w mL } , { Ψ mL } J X j =1 C X m =1 a jm inf τ j,m ∈ Π( ω jKj , w mL ) (cid:10) τ j,m , γ j,m (cid:11) + λ g X k,l τ j,mkl (cid:16) log τ j,mkl − (cid:17)(cid:19) Given the above objective function, for each m , to update the global weights w mL and globalatoms Ψ mL , we consider the following composite transportation barycentermin w mL , Ψ mL J X j =1 a jm inf τ j,m ∈ Π( ω jKj , w mL ) (cid:10) τ j,m , γ j,m (cid:11) + λ g X k,l τ j,mkl (cid:16) log τ j,mkl − (cid:17)(cid:19) . Update global weights:
Given the above objective function, the idea for updating theglobal weights w mL is similar to Algorithm 1 in [4]. Update partial transportation plans:
Once global weights are obtained, we can useAlgorithm 3 in [4] to update the optimal partial transportation plans τ j,m between localmeasure P ω jKj , Θ jKj and global measure Q m w m , Ψ mk . Update global atoms:
With the updates for the global weight w mL , to obtain the updateequation for global atoms Ψ mL , we minimize the following objective function F p-global = J X j =1 a jm X k,l τ j,mkl γ j,mkl = J X j =1 a jm X k,l τ j,mkl (cid:18) A (cid:16) θ jk (cid:17) − A ( ψ ml ) − D ∇ A ( ψ ml ) , θ jk − ψ ml E(cid:19) . Taking the derivative of F p-global with respect to ψ ml and setting it to zero, we find that ∂ F p-global ∂ψ ml = J X j =1 a jm X k τ j,mkl ∇ A ( ψ ml ) (cid:16) ψ ml − θ jk (cid:17) = 0 . Since the log-partition function A ( · ) is convex, ∇ A ( ψ v ) is a positive-semidefinite matrix.Therefore, we can choose P Jj =1 a jm P k τ j,mkl (cid:16) ψ ml − θ jk (cid:17) = 0, which means that ψ ml = P Jj =1 P K j k =1 a jm τ j,mkl θ jk P Jj =1 P K j k =1 a jm τ j,mkl . C.4 Proof for local convergence of Algorithm 2
Equipped with the above updates with local and global parameters of regularized MCT,we are ready to demonstrate the convergence of Algorithm 2 to local solution of objectivefunction (12) of regularized MCT in Theorem 2. To simplify the argument, we only provideproof sketch for this theorem. 24n particular, we denote ω j, ( t ) K j and Θ j, ( t ) K j as the updates of local weights and local atomsin step t of Algorithm 2 for t ≥
0. Similarly, we denote w m, ( t ) L and Ψ m, ( t ) L as the updates ofglobal weights and global atoms at step t . Furthermore, we denote g ( { ω j, ( t ) K j } , { Θ j, ( t ) K j } , { w m, ( t ) L } , { Ψ m, ( t ) L } ) := J X j =1 c W (cid:18) P jn j , P j ω jKj , Θ jKj (cid:19) + ζ c W ( P , Q ) − R ( π , τ , a ) . Then, according to local clustering updates step, we would have g ( { ω j, ( t ) K j } , { Θ j, ( t ) K j } , { w m, ( t ) L } , { Ψ m, ( t ) L } ) ≥ g ( { ω j, ( t +1) K j } , { Θ j, ( t +1) K j } , { w m, ( t ) L } , { Ψ m, ( t ) L } ) . On the other hand, invoking the global clustering updates step, we achieve g ( { ω j, ( t +1) K j } , { Θ j, ( t +1) K j } , { w m, ( t ) L } , { Ψ m, ( t ) L } ) ≥ g ( { ω j, ( t +1) K j } , { Θ j, ( t +1) K j } , { w m, ( t +1) L } , { Ψ m, ( t +1) L } )Governed by the above results, for any t ≥
0, the following holds g ( { ω j, ( t ) K j } , { Θ j, ( t ) K j } , { w m, ( t ) L } , { Ψ m, ( t ) L } ) ≥ g ( { ω j, ( t +1) K j } , { Θ j, ( t +1) K j } , { w m, ( t +1) L } , { Ψ m, ( t +1) L } ) ..