[PDF] Probabilistic Multilevel Clustering via Composite Transportation Distance

Abstract

We propose a novel probabilistic approach to multilevel clustering problems based on composite transportation distance, which is a variant of transportation distance where the underlying metric is Kullback-Leibler divergence. Our method involves solving a joint optimization problem over spaces of probability measures to simultaneously discover grouping structures within groups and among groups. By exploiting the connection of our method to the problem of finding composite transportation barycenters, we develop fast and efficient optimization algorithms even for potentially large-scale multilevel datasets. Finally, we present experimental results with both synthetic and real data to demonstrate the efficiency and scalability of the proposed approach.

Full PDF

aa r X i v : . [ c s . L G ] O c t Probabilistic Multilevel Clustering via CompositeTransportation Distance

Nhat Ho ⋆, † Viet Huynh ⋆, ‡ Dinh Phung ‡ Michael I. Jordan † University of California at Berkeley, Berkeley, USA † Monash University, Australia ‡ October 30, 2018

Abstract

We propose a novel probabilistic approach to multilevel clustering problems based oncomposite transportation distance, which is a variant of transportation distance wherethe underlying metric is Kullback-Leibler divergence. Our method involves solving ajoint optimization problem over spaces of probability measures to simultaneously discovergrouping structures within groups and among groups. By exploiting the connection of ourmethod to the problem of ﬁnding composite transportation barycenters, we develop fastand eﬃcient optimization algorithms even for potentially large-scale multilevel datasets.Finally, we present experimental results with both synthetic and real data to demonstratethe eﬃciency and scalability of the proposed approach.

Clustering is a classic and fundamental problem in machine learning. Popular clusteringmethods such as K-means and the EM algorithm have been the workhorses of exploratorydata analysis. However, the underlying model for such methods is a simple ﬂat partitionor a mixture model, which do not capture multilevel structures (e.g., words are groupedinto documents, documents are grouped into corpora) that arise in many applications in thephysical, biological or cognitive sciences. The clustering of multilevel structured data calls fornovel methodologies beyond classical clustering.One natural approach for capturing multilevel structures is to use a hierarchy in which dataare clustered locally into groups, and those groups are partitioned in a “global clustering.”Attempts to develop algorithms of this kind can be roughly classiﬁed into two categories.The ﬁrst category makes use of probabilistic models, often based on Dirichlet process priors.Examples in this vein include the Hierarchical Dirichlet Process (HDP) [18], Nested DirichletProcess (NDP) [15], Multilevel Clustering with Context (MC ) [11], and Multilevel ClusteringHierarchical Dirichlet Process (MLC-HDP) [21]. Despite the ﬂexibility and solid statisticalfoundation of these models, they have seen limited application to large-scale datasets, givenconcerns about the computational scaling of the sampling-based algorithms that are generallyused for inference under these models.A second category of multilevel methods is based on tools from optimal transport theory,where algorithms such as Wasserstein barycenters provide scalable computation [3, 4]. Thesemethods trace their origins to a seminal paper by [14] which established a connection betweenthe K-means algorithm and the problem of determining a discrete probability measure thatis close in Wasserstein distance [19] to the empirical measure of the data. Based on this ⋆ Nhat Ho and Viet Huynh contributed equally to this work and are in alphabetical order. composite trans-portation distance [12], to overcome this limitation, and to provide a more general multilevelclustering method. The salient feature of composite transportation distance is that it utilizesKullback-Leibler (KL) divergence as the underlying metric of optimal transportation distance,in contrast to the standard Euclidean metric that has been used in optimal transportationapproaches to clustering to date.In order to motivate our use of composite transportation distance, we start with a one-level structure data in which the data are generated from a ﬁnite mixture model, e.g., amixture of (multivariate) Gaussian distributions or multinomial distributions. Unlike tradi-tional estimators such as maximum likelihood estimation (MLE), the high-level idea of usingcomposite transportation distance is to determine optimal parameters to minimize the KLcost of moving the likelihood from one cluster to another cluster. Intuitively, with such adistance, we can employ the underlying geometric structure of parameters to perform eﬃcientclustering with the data. Another advantage of composite transportation distance is its ﬂex-ibility to generalize to multilevel structure data. More precisely, by representing each groupin a multilevel clustering problem by an unknown mixture model (local clustering), we candetermine the optimal parameters, which can be represented as local (probability) measures,of each group via optimization problems based on composite transportation distance. Then,in order to determine global clustering among these groups, we perform a composite trans-portation barycenter problem over the local measures to obtain a global measure over thespace of mixture models, which serves as a partition of these groups. As a result, our ﬁnalmethod, which we refer to as multilevel composite transportation (MCT), involves solving ajoint optimal transport optimization problem with respect to both a local clustering and aglobal clustering based on the cost matrix encoding KL divergence among atoms. The solu-tion strategy involves using the fast computation method of Wasserstein barycenters combinedwith coordinate descent.In summary, our main contributions are the following: (i) A new optimization formulationfor clustering based on a variety of multilevel data types, including both continuous anddiscrete observations, based on composite transportation distance; (ii) We provide a highlyscalable solution strategy for this optimization formulation; (iii) Although our approach avoidsthe use of the Dirichlet process as a building block, the approach has much of the ﬂexibilitythe hierarchical Dirichlet process in its ability to share atoms among local clusterings. Wethus are able to borrow strength among clusters, which improves statistical eﬃciency undercertain applications, e.g., image annotation in computer vision.The paper is organized as follows. Section 2 provides preliminary background on compositetransportation distance and composite transportation barycenters. Section 3 formulates themultilevel composite transportation optimization model, while Section 4 presents simulationstudies with both synthetic and real data. Finally, we conclude the paper with a discussionin Section 5. Technical details of proofs and algorithm development are provided in theAppendix. 2

Composite transportation distance

Throughout this paper, we let Θ be a bounded subset of R d for a given dimension d ≥ { f ( x | θ ) , θ ∈ Θ } is a given exponential family of distributions with naturalparameter θ : f ( x | θ ) := h ( x ) exp ( h T ( x ) , θ i − A ( θ )) , where A ( θ ) is the log-partition function which is convex. We deﬁne P θ to be the probabilitydistribution whose density function is f ( x | θ ). Given a ﬁxed number of K components, wedenote a ﬁnite mixture distribution as follows: P ω K , Θ K := K X k =1 ω k P θ k , (1)where ω K = ( ω , . . . , ω K ) ∈ ∆ K , which is a probability simplex in K − Θ K = { θ k } Kk =1 ∈ Θ K are the weights and atoms. Then, the probability density function ofmixture model can be expressed p ω K , Θ K ( x ) := K X k =1 ω k f ( x | θ k ) . We also use Q ω K , Θ K to denote a ﬁnite mixture of at most K components to avoid potentialnotational clutter. For any two ﬁnite mixture probability distributions P ω K , Θ K and P ω ′ K ′ , Θ ′ K ′ and any two givennumbers K and K ′ , we deﬁne the composite transportation distance between P ω K , Θ K and P ω ′ K ′ , Θ ′ K ′ as follows c W ( P ω K , Θ K , P ω ′ K ′ , Θ ′ K ′ ) := inf π ∈ Π ( ω K , ω ′ K ′ ) h π , M i , (2)where the cost matrix M = ( M ij ) satisﬁes M ij = KL( f ( x | θ i ) , f ( x | θ ′ j )) for 1 ≤ i ≤ K and1 ≤ j ≤ K ′ . Here, h ., . i denotes the dot product (or Frobenius inner product) of two matricesand Π( ω K , ω ′ K ′ ) is the set of all probability measures (or equivalently transportation plans) π on [0 , K × K ′ that have marginals ω K and ω ′ K ′ respectively. Detailed form of cost matrix

Since f ( x | θ ) is an exponential family, we can computeKL ( f ( x | θ ) , f ( x | θ ′ )) in closed form as follows [22, 9, Ch.8] :KL (cid:0) f ( x | θ ′ ) , f ( x | θ ) (cid:1) = D A (cid:0) θ, θ ′ (cid:1) , where D A ( · , · ) is the Bregman divergence associated with log-partition function A ( · ) of f ,i.e., D A (cid:0) θ, θ ′ (cid:1) = A ( θ ) − A (cid:0) θ ′ (cid:1) − (cid:10) ∇ A (cid:0) θ ′ (cid:1) , (cid:0) θ − θ ′ (cid:1)(cid:11) . Therefore, the cost matrix M has an explicit form M ij = A ( θ i ) − A (cid:0) θ ′ j (cid:1) − (cid:10) ∇ A (cid:0) θ ′ j (cid:1) , (cid:0) θ i − θ ′ j (cid:1)(cid:11) (3)for 1 ≤ i ≤ K and 1 ≤ j ≤ K ′ . Note that the order of parameters is reversed in KL and Bregman divergences. omposite transportation distance on the space of ﬁnite mixtures of ﬁnite mix-tures We can recursively deﬁne ﬁnite mixtures of ﬁnite mixtures, and deﬁne a suitableversion of composite transportation distance on this abstract space. In particular, con-sider a collection of N ﬁnite mixture probability distributions with at most K components n P ω iK , Θ iK o Ni =1 and a collection of ¯ N ﬁnite mixture probability distributions with at most ¯ K components (cid:26) P ω i ¯ K , Θ i ¯ K (cid:27) ¯ Ni =1 . We deﬁne two ﬁnite mixtures of these distributions as follows P = N X i =1 τ i P ω iK , Θ iK , Q = ¯ N X i =1 τ i P ω i ¯ K , Θ i ¯ K , where τ = ( τ , . . . , τ N ) ∈ ∆ N and τ = ( τ , . . . , τ ¯ N ) ∈ ∆ ¯ N . Then, the composite transporta-tion distance between P and Q is c W ( P , Q ) := inf π ∈ Π( τ , τ ) (cid:10) π , M (cid:11) , where the cost matrix M = (cid:8) M ij (cid:9) is deﬁned as M ij = c W ( P ω iK , Θ iK , P ω j ¯ K , Θ j ¯ K ) , for 1 ≤ i ≤ N and 1 ≤ j ≤ ¯ N . Note that, in a slight notational abuse, c W ( ., . ) is used for boththe ﬁnite mixtures and ﬁnite mixtures of ﬁnite mixtures. In this section, we assume that X , . . . , X n are i.i.d. samples from the mixture density p ω k , Θ k ( x ) = k X i =1 ω i f ( x | θ i ) , where k < ∞ is the true number of components. Since k is generally unknown, we ﬁt thismodel by a mixture of K distributions where K ≥ k . Denote P n := n P ni =1 δ X i as an empirical measure with respect to samples X , . . . , X n . Tofacilitate the discussion, we deﬁne the following composite transportation distance betweenan empirical measure P n and the mixture probability distribution P ω K , Θ K c W ( P n , P ω K , Θ K ) := inf π ∈ Π ( n n , ω K ) h π , M i , (4)where M = ( M ij ) ∈ R n × K is a cost matrix deﬁned as M ij := − log f ( X i | θ j ) for 1 ≤ i ≤ n, ≤ j ≤ K . Furthermore, Π ( · , · ) is the set of transportation plans between n /n and ω K .To estimate the true weights ω k and true components θ i as 1 ≤ i ≤ k , we perform anoptimization with transportation distance c W as follows:( b ω n,K , b Θ n,K ) = arg min ω K , Θ K c W ( P n , P ω K , Θ K ) . (5)The estimator ( b ω n,K , b Θ n,K ) is usually referred to as the Minimum Kantorovitch estimator [1].4 lgorithm 1 Composite Transportation Distance with Mixtures

Input:

Data D = { X i } ni =1 ; the number of clusters K the regularized hyper-parameter λ > Output:

Optimal weight-atoms { ω j , θ j } Kj =1 Initialize weights { ω j } Kj =1 and atoms { θ j } Kj =1 . while not converged do

1. Update weights ω j : for j = 1 to K do Compute transportation plan π ij as π ij = ( f ( X i | θ j )) /λ (cid:14) n K X k =1 ( f ( X i | θ j )) /λ ! for 1 ≤ i ≤ n Update weight ω j = P ni =1 π ij . end for

2. Update atoms θ j : for j = 1 to J do Update atoms θ j as solution of equation ∇ A ( θ j ) = P ni =1 π ij ω j T ( X i ). end forend while2.2.2 Regularized composite transportation distance As is the case with the traditional optimal transportation distance, the composite transporta-tion distance c W does not have a favorable computational complexity. Therefore, we consideran entropic regularizer to speed up its computation [3]. More precisely, we consider thefollowing regularized version of c W ( P n , P ω K , Θ K ):inf π ∈ Π ( n n , ω K ) h π , M i − λ H ( π ) , where λ > H ( π ) := − P i,j π ij log π ij is an entropy of π ∈ Π ( n /n, ω K ). Equipped with this regularization, we have a regularized version of the optimalestimator in (5): min ω K , Θ K inf π ∈ Π ( n n , ω K ) h π , M i − λ H ( π ) . (6)We summarize the algorithm for determining local solutions of the above objective functionin Algorithm 1. The details for how to obtain the updates of weight and atoms in Algorithm 1are deferred to the Supplementary Material. Given the formulation of Algorithm 1, we havethe following result regarding its convergence to a local optimum. Theorem 1.

The Algorithm 1 monotonically decreases the objective function (6) of the reg-ularized composite transportation distance for ﬁnite mixtures. .3 Composite transportation barycenter for mixtures of exponential fam-ilies In this section, we consider a problem of ﬁnding composite transportation barycenters for acollection of mixtures of exponential family. For J ≥

1, let (cid:26) P j ω jKj , Θ jKj (cid:27) Jj =1 be a collectionof J mixtures of exponential families as described in (1), and let { a j } Jj =1 ∈ ∆ J be weightsassociated with these mixtures. The transportation barycenter of these probability measuresis a mixture of exponential family with at most L components, and is deﬁned as an optimalsolution of the following problem:argmin w L, Ψ L J X j =1 a j c W (cid:18) Q w L , Ψ L , P j ω jKj , Θ jKj (cid:19) , (7)where w L = { w l } Ll =1 ∈ ∆ L and Ψ L = { ψ l } Ll =1 ∈ Θ L are unknown weights and parametersthat we need to optimize. Recall that, to avoid notational clutter, we use Q w L , Ψ L to denotea ﬁnite mixture with at most L components. Since P j ω jKj , Θ jKj and Q w L , Ψ L are mixtures ofexponential families, Eq. (7) can be rewritten asargmin w L, Ψ L J X j =1 a j min π j ∈ Π (cid:16) ω jKj , w L (cid:17) (cid:10) π j , M j (cid:11) , where the cost matrices M j = ( M juv ) satisfy M juv = KL( f ( x | θ ju ) , f ( x | ψ v )) for 1 ≤ j ≤ J , whichhas the closed form deﬁned in Eq. (3) since f is from an exponential family of distributions. We incorporate regularizers in the composite transportation barycenter. In particular, wewrite the objective function to be minimized asargmin w L, Ψ L J X j =1 a j min π j ∈ Π (cid:16) ω jKj , w L (cid:17) (cid:10) π j , M j (cid:11) − λ H (cid:0) π j (cid:1) . (8)We call this objective function the regularized composite transportation barycenter . Due tospace constraints, we present the detailed algorithm for determining local solutions of thisobjective function in the Supplementary Material. Assume that we have J groups of independent data, X j,i , where 1 ≤ j ≤ J and 1 ≤ i ≤ n j ; i.e.,the data are presented in a two-level grouping structure. Our goal is to ﬁnd simultaneouslythe local clustering for each data group and the global clustering across groups.6 .1 Multilevel composite transportation (MCT) To facilitate the discussion, for each 1 ≤ j ≤ J , we denote the empirical measure associatedwith group j as P jn j := 1 n j n j X i =1 δ X j,i . Additionally, we assume that the number of local and global clusters are bounded. In par-ticular, we allow local group j to have at most K j clusters, which can be represented as amixture of exponential families P j ω j , Θ jKj , while we have at most C global clusters among J given groups. Here, each global cluster can be represented as a ﬁnite mixture distribution Q m w mL , Ψ mL with at most L clusters, where w mL = ( w m , . . . , w mL ) and Ψ mL = ( ψ m , . . . , ψ mL ) areglobal weights and atoms for 1 ≤ m ≤ C , respectively. With the local clustering, we perform composite transportation distance optimization forgroup j , which can be expressed as in (5). More precisely, this step can be viewed as ﬁndingoptimal local weights ω jK j and local atoms Θ jK j to minimize the composite transportationdistance c W ( P jn , P ω jKj , Θ jKj ) for all 1 ≤ j ≤ J . Regarding the global clustering with J givengroups, we can treat the ﬁnite mixture probability distribution P ω jKj , Θ jKj of each group asobservations in the space of distributions over probability distributions. Thus we achieve aclustering of these distributions by means of an optimization with the following compositetransportation distance on the space of ﬁnite mixtures of ﬁnite mixtures:inf Q c W ( P , Q ) , where we denote P := J P Jj =1 δ P j ω jKj , Θ jKj and Q := P C m =1 b m δ Q m w mL , Ψ mL . Since the ﬁnite mixture probability distributions P ω jKj , Θ jKj in each group are unobserved, wedetermine them by minimizing the objective cost functions in the local clustering and globalclustering simultaneously. In particular, we consider the following objective function:inf ω jKj , Θ jKj , Q J X j =1 c W (cid:18) P jn j , P j ω jKj , Θ jKj (cid:19) + ζ c W ( P , Q ) , (9)where ζ > Multilevel Composite Transportation (MCT) . To obtain a favorable computation proﬁle with MCT, we consider a regularized version ofthe composite transportation distances in both the local and global structures. To simplify7he discussion, we denote π j ∈ Π( n j n j , ω jK j ) as local transportation plans between P jn j and P j ω jKj , Θ jKj for all ≤ j ≤ J . Thus, the following formulation holds c W (cid:18) P jn j , P j ω jKj , Θ jKj (cid:19) = inf π j ∈ Π( nj nj , ω jKj ) (cid:10) π j , M j (cid:11) , where M j is the cost matrix between P jn j and P j ω jKj , Θ jKj that is deﬁned as[ M j ] uv = − log f ( X j,u | θ jv ) , for 1 ≤ u ≤ n j and 1 ≤ v ≤ K j . Therefore, we can consider the regularized version ofcomposite transportation distance at each group j as follows: c W (cid:18) P jn j , P j ω jKj , Θ jKj (cid:19) − λ l H ( π j ) , (10)where λ l > c W ( P , Q ) = inf a ∈ Π( J J , b ) X j,m a jm c W ( P j ω jKj , Θ jKj , Q m w mL , Ψ mL ) , where a = ( a jm ) in the above inﬁmum is a global transportation plan between P and Q .Here, we can further rewrite c W ( P j ω jKj , Θ jKj , Q m w mL , Ψ mL ) as c W ( P j ω jKj , Θ jKj , Q m w mL , Ψ mL ) = inf τ j,m ∈ Π( ω jKj , w mL ) (cid:10) τ j,m , γ j,m (cid:11) , where the cost matrix γ j,m is deﬁned as KL divergence between two exponential family atomsin Eq. (3): γ j,mk,l := A (cid:16) θ jk (cid:17) − A ( ψ ml ) − D ∇ A ( ψ ml ) , (cid:16) θ jk − ψ ml (cid:17)E . To facilitate the discussion later, we denote τ j,m the partial global transportation plan between P ω jKj , Θ jKj and Q m w mL , Ψ mL . Therefore, we can regularize the composite transportation distancewith global structure as c W ( P , Q ) − λ a H ( a ) − λ g J X j =1 C X m =1 H ( τ j,m ) , (11)where λ a corresponds to a penalization term for global structure while λ g represent a penal-ization term for a partial global transportation plan. Combining the results from Eqs. (10)and (11), we obtain the overall objective function of MCT:inf ω jKj , Θ jKj , Q J X j =1 c W (cid:18) P jn j , P j ω jKj , Θ jKj (cid:19) + ζ c W ( P , Q ) − R ( π , τ , a ) , (12)where R ( π , τ , a ) := λ l P Jj =1 H ( π j ) + ζ [ λ a H ( a ) + λ g P Jj =1 P C m =1 H ( τ j,m )] is a combinationof all regularized terms for the local and global clustering. We call this objective function regularized MCT . 8 .3 Algorithm for regularized MCT Algorithm 2

Probabilistic Multilevel Clustering

Input:

Data D = { X j,i } J,n j j =1 ,i =1 ; the number of local clusters K j and global clusters C ; thenumber of components in each global cluster L ; the penalization term ζ ; the regularizedhyper-parameters λ l , λ g . Output: local and global parameters ω jK j , Θ jK j , w mL , Ψ mL for 1 ≤ j ≤ J and 1 ≤ m ≤ C .Initialize these local and global parameters. while not converged do

1. Update local parameters:. for j = 1 to J do Update ω jK j , Θ jK j as optimal solutions of (13). end for

2. Update global parameters: for m = 1 to C do Update w mL , Ψ mL as optimal solutions of (14). end forend while We now describe our detailed strategy for obtaining a locally optimal solution of regularizedMCT . In particular, our algorithm consists of two key steps: local clustering updates andglobal clustering updates. For simplicity of presentation, we assume that at step t of ouralgorithm, we have the following updated values of our parameters: ω jK j , Θ jK j , w mL , Ψ mL , for1 ≤ j ≤ J and 1 ≤ m ≤ C . Local clustering updates

To obtain updates for local weights ω jK j and local atoms Θ jK j ,we solve the following combined regularized composite transportation barycenter problem:inf ω jKj , Θ jKj c W (cid:18) P jn j , P j ω jKj , Θ jKj (cid:19) − λ l H ( π j ) (13)+ C X m =1 a jm c W ( P j ω jKj , Θ jKj , Q m w mL , Ψ mL ) − ζλ g C X m =1 H ( τ j,m ) , where π j is the local transportation plan between P jn j and P j ω jKj , Θ jKj at step t while a and τ j,m are respectively the global transportation plan and partial global transportation plans at thisstep. The idea of obtaining local solution of the above objective function is identical to thatof (8); therefore, we defer the detailed presentation of this algorithm to the SupplementaryMaterial. Global clustering updates

In order to update the global weights w mL and global atomparameters Ψ mL , we consider the following optimization problem:inf Q c W ( P , Q ) − ζ [ λ a H ( a ) + λ g J X j =1 C X m =1 H ( τ j,m )] . (14)9atasets n j ) C )Continuous data 100 2 500 6Discrete data 500 25 100 5 (a) Statistics of synthetic datasets Datasets C )LabelMe 1 ,

800 30 8NUS-WIDE 1 ,

040 238 13 (b) Statistics of real-world datasets

Table 1: Summarization of synthetic and realworld datasetsThe algorithm for obtaining the local solutions of this objective is based on bacycenter com-putation algorithms in [4] for updating barycenter weights w mL and the partial global trans-portation plan τ j,m . The natural parameters of global atoms of the barycenters are weightedaverages of local atoms from all group j : ψ v = P Jj =1 P Lu =1 π juv θ ju P Jj =1 P Lu =1 π juv . The detailed derivation of this algorithm is deferred to the Supplementary Material. Insummary, the main steps of updating the local and global clustering updates are summarizedin Algorithm 2. We have the following result guaranteeing the local convergence of thisalgorithm.

Theorem 2.

Algorithm 2 monotonically decreases the objective function of regularized MCT (12) until local convergence.

We ﬁrst evaluate the model via simulation studies, then demonstrate its applications on textand image modeling using two real-world datasets.

We evaluate the eﬀectiveness of our proposed clustering algorithm by considering two types(discrete and continuous) of synthetic data generated from multilevel processes as follows.

Continuous data

We start with six clusters of data, each of which is a mixture of threeGaussian components. Figure 1 depicts the ground truth of the six mixtures we generate thedata from. We uniformly generated 100 groups of data, each group belonging to one of thesix aforementioned clusters. Once the cluster index of a data group was deﬁned, we generated500 data points from the corresponding mixture of Gaussians.

Discrete data

Data was generated from ﬁve clusters of 25-dimensional bar topics, each ofwhich is a mixture of four bar topics out of total ten topics as shown in Figure 2 (secondrow). Each cluster shares two topics with any other cluster. We then generated 500 groups ofdata, each group belonging to one of the ﬁve aforementioned clusters. Once the cluster index10 − − − − − − − − − − − − − Figure 1: Synthetic multilevel Gaussian data (orange at top) and inferred clusters (blue atbottom).of a data group is deﬁned, we generate 100 data points from the mixture of bar topics of thatcluster.

Clustering results

We ran the proposed method with synthetic continuous data using thefollowing local and global penalization hyper-parameters: λ l and λ g are set equal to 1 . X ji now is a one-hot vector. We simulated from ten topicsincluding ﬁve horizontal and ﬁve vertical bars. The top row of Figure 2 depicts a collectionof four bar topics that data of a cluster may be generated from; i.e., the ﬁrst cluster containsdata simulated from a mixture of four horizontal bar topics. In the middle row, we depictthe histogram plot of all data generated from each cluster while the bottom row shows theplot of clusters discovered by our proposed model. There is only a slight diﬀerence in the plotbetween ground truth and the inferred mixture of bar topics . These results demonstrate theeﬀectiveness and ﬂexibility of our algorithms in learning both continuous and discrete data. We now demonstrate our proposed model on two real-world datasets: the LabelMe dataset[16, 13] with continuous observations and the NUS-WIDE [2] with discrete observations.Statistics for these datasets are presented in Table 1b.

LabelMe dataset

This consists of 2 ,

688 annotated images which are classiﬁed into eightscene categories including tall buildings, inside city, street, highway, coast, open country,mountain, and forest [16]. Each image contains multiple annotated regions. Each region, We also measured the NMI (Normalized Mutual Information) between the ground truth labels and learnedgroups and obtained around 0 . t o p i cs i n e a c h c l u s t e r g e n e r e a t e d d a t a r e d i sc o v e r e d d a t a Figure 2: Synthetic multilevel bar topic data (two top rows) and rediscovered output (bottomrow) Methods NMI ARI AMIK-means 0.37 0.282 0.365SVB-MC2 0.315 0.206 0.273W-means 0.423 0.35 0.416

MCT 0.485 0.412 0.477

Table 2: Clustering performance on LabelMe (continuous) dataset.which is annotated by users, represents an object in the image . We remove the imagescontaining less than four annotated regions and obtained totally 1 ,

800 images. We thenextract GIST features [10], a visual descriptor to represent perceptual dimensions and orientedspatial structures of a scene, for each region in an image. We use PCA to reduce the numberof dimensions to 30.

NUS-WIDE dataset

We used a subset of the original NUS-WIDE dataset [2] which con-tains images of 13 kinds of animals comprising 2 ,

054 images in training subset. Each imageis annotated with several tags out of 1 ,

000 tags. We ﬁltered out images with less than threetags and obtained 1 ,

040 images with the remaining number tags of 238. After preprocessing,we have a dataset with 1 ,

040 groups; each data point in a group is a one-hot vector of 238dimensions representing a tag word annotated for that group (image). Sample images and annotated regions can be found at http://people.csail.mit.edu/torralba/code/spatialenvelope/ including squirrel, cow, cat, zebra, tiger, lion, elephant, whales, rabbit, snake, antlers, hawk and wolf MCT 0.423 0.255 0.39

Table 3: Clustering performance on NUS-WIDE (discrete) dataset.Figure 3: Tag cloud of six clusters discovered by the proposed model with the NUS-WIDEdataset. 13 aseline methods We quantitatively compare our proposed method to baseline approachesdiscussed in [5], including K-means , W-means , and

SVB-MC2 without context [7]. We usethree popular metrics: NMI (Normalized Mutual Information) [17, 16.3], ARI (Adjusted RandIndex) [6], and AMI (Adjusted Mutual Information) [20] to evaluate the clustering perfor-mance.

Experimental results

We conducted experiments on the LabelMe dataset with the numberof local atoms set equal to K = 5, the number of global atoms set to L = 15, and the numberof clusters set to C = 8. We chose the hyper-parameters for penalized terms to be λ l = 3and λ g = 3. As shown in Table 2, our proposed method is superior to the baseline methodsin terms of clustering performance.We also compared clustering performance using the discrete real-world dataset NUS-WIDE. We chose K = 2, L = 4, and C = 13 with the hyper-parameters λ l = 1 and λ g = 1 .

6. Results are presented in Table 3. Since baseline methods are applicable only tocontinuous data, we have normalized the discrete data for each image and then applied thebaseline methods to cluster the dataset. The results show that the clustering performanceof K-means and W-means is inferior to that of our proposed model which directly modelsdiscrete data. Moreover, the Adjusted Rand Index (ARI) and Adjusted Mutual Information(AMI) of these model show that their clustering outcomes are not robust.To illustrate the qualitative results of the proposed model, we selectively choose six outof thirteen clusters discovered and computes the proportion of tags presented in each cluster.Figure 3 depicts tag-clouds of these clusters. Each tag-cloud consistently manifests the clustercontent. For example, the top-left tag-cloud denote the cluster of rabbits, which is one of theground-truth subsets of images.

We have proposed a probabilistic model that uses a novel composite transportation distanceto cluster data with potentially complexed hierarchical multilevel structures. The proposedmodel is able to handle both discrete and continuous observations. Experiments on simulatedand real-world data have shown that our approach outperforms competing methods that alsotarget multilevel clustering tasks. Our developed model is based on the exponential familyassumption with data distribution and thereby applies naturally to other data types; e.g., amixture of Poisson distributions [8]. Finally, there are several possible directions for extensionsfrom our work. First, it is of interest to extend our approach to richer settings of hierarchicaldata similar to those considered in MC [11]; e.g., when group-level context is available inthe data. Second, our method requires knowledge of the upper bounds with the numbers ofclusters both in local and global clustering. It is of practical importance to develop methodsthat are able to estimate these cardinalities eﬃciently. References [1] F. Bassetti, A. Bodini, and E. Regazzini. On minimum Kantorovich distance estimators.

Statistics & probability letters , 76(12):1298–1302, 2006.[2] T.-S. Chua, J. Tang, R. Hong, H. Li, Z. Luo, and Y. Zheng. NUS-WIDE: a real-world14eb image database from National University of Singapore. In

Proceedings of the ACMinternational conference on image and video retrieval , page 48. ACM, 2009.[3] M. Cuturi. Sinkhorn distances: Lightspeed computation of optimal transport. In

Ad-vances in neural information processing systems , pages 2292–2300, 2013.[4] M. Cuturi and A. Doucet. Fast computation of Wasserstein barycenters.

Proceedings ofthe 31st International Conference on Machine Learning , 2014.[5] N. Ho, X. Nguyen, M. Yurochkin, H. H. Bui, V. Huynh, and D. Q. Phung. Multilevelclustering via Wasserstein means. In

Proceedings of the 34th International Conferenceon Machine Learning, ICML 2017, Sydney, NSW, Australia, 6-11 August 2017 , pages1501–1509, 2017.[6] L. Hubert and P. Arabie. Comparing partitions.

Journal of classiﬁcation , 2(1):193–218,1985.[7] V. Huynh, D. Q. Phung, S. Venkatesh, X. Nguyen, M. D. Hoﬀman, and H. H. Bui.Scalable nonparametric Bayesian multilevel clustering. In

UAI , 2016.[8] R. R. Jayasekare, R. Gill, and K. Lee. Modeling discrete stock price changes using amixture of Poisson distributions.

Journal of the Korean Statistical Society , 45(3):409–421, 2016.[9] M. I. Jordan. An introduction to probabilistic graphical models. 2003.[10] D. G. Lowe. Object recognition from local scale-invariant features. In

Computer vision,1999. The proceedings of the seventh IEEE international conference on , volume 2, pages1150–1157. Ieee, 1999.[11] V. Nguyen, D. Phung, X. Nguyen, S. Venkatesh, and H. Bui. Bayesian nonparametricmultilevel clustering with group-level contexts.

Proceedings of the 31st InternationalConference on Machine Learning , 2014.[12] X. Nguyen. Convergence of latent mixing measures in ﬁnite and inﬁnite mixture models.

Annals of Statistics , 4(1):370–400, 2013.[13] A. Oliva and A. Torralba. Modeling the shape of the scene: A holistic representation ofthe spatial envelope.

International Journal of Computer Vision , 42:145–175, 2001.[14] D. Pollard. Quantization and the method of K-means.

IEEE Transactions on InformationTheory , 28:199–205, 1982.[15] A. Rodriguez, D. B. . Dunson, and A. E. Gelfand. The nested Dirichlet process.

Journalof the American Statistical Association , 103:1131–1144, 2008.[16] B. C. Russell, A. Torralba, K. P. Murphy, and W. T. Freeman. LabelMe: a databaseand web-based tool for image annotation.

International journal of computer vision , 77(1-3):157–173, 2008.[17] H. Sch¨utze, C. D. Manning, and P. Raghavan.

Introduction to information retrieval ,volume 39. Cambridge University Press, 2008.1518] Y. Teh, M. Jordan, M. Beal, and D. Blei. Hierarchical Dirichlet processes.

J. Amer.Statist. Assoc. , 101:1566–1581, 2006.[19] C. Villani.

Topics in Optimal Transportation . American Mathematical Society, 2003.[20] N. X. Vinh, J. Epps, and J. Bailey. Information theoretic measures for clusterings compar-ison: Variants, properties, normalization and correction for chance.

Journal of MachineLearning Research , 11(Oct):2837–2854, 2010.[21] D. F. Wulsin, S. T. Jensen, and B. Litt. Nonparametric multi-level clustering of humanepilepsy seizures.

Annals of Applied Statistics , 10:667–689, 2016.[22] J. Zhang, Y. Song, G. Chen, and C. Zhang. On-line evolutionary exponential familymixture. In

IJCAI , pages 1610–1615, 2009.16 ppendix

A Finite mixtures with regularized composite transportationdistance

In this section, we provide detailed analyses for obtaining updates with weights and atoms inAlgorithm 1 to ﬁnd the local solution of the objective function in Eq. (6), which optimizesﬁnite mixtures with regularized composite transportation distance. To ease the presentation,we would like to remind this objective function, which is deﬁned as followsmin ω K , Θ K inf π ∈ Π ( n n , ω K ) h π , M i − λ H ( π )where λ > H ( π ) = − P i,j π ij log π ij is an entropy of π ∈ Π ( n /n, ω K ). Here, P n = n P ni =1 δ X i an empirical measure with respect to samples X , . . . , X n .Furthermore, M = ( M ij ) is a cost matrix such that M ij = − log f ( X i | θ j ) for 1 ≤ i ≤ n, ≤ j ≤ K while Π ( · , · ) is the set of transportation plans between n /n and ω K . A.1 Update weights

Our strategy for updating weights ω K in the above objective function relies on solving thefollowing relaxation of that optimization probleminf π ∈S n h π , M i − λ H ( π ) (15)where S n = n π : P Kj =1 π ij = 1 /n o . Invoking the Lagrangian multiplier for the constraint π K = n n , the above objective function is equivalent to minimize the following function F = n X i =1 K X j =1 π ij M ij + λ n X i =1 K X j =1 π ij (log π ij −

1) + n X i =1 κ i  K X j =1 π ij − n  . By taking the derivative of F with respect to π ij and setting it to zero, the following equationholds ∂ F ∂π ij = M ij + λ log π ij + κ i = 0 . The above equation leads to π ij = exp (cid:18) − M ij − κ i λ (cid:19) = ( f ( X i | θ j )) /λ exp (cid:18) − κ i λ (cid:19) . Invoking the condition P k π ik = n , we haveexp (cid:18) − κ i λ (cid:19)  K X j =1 ( f ( X i | θ j )) /λ  = 1 n , (cid:18) − κ i λ (cid:19) = 1 n P Kj =1 ( f ( X i | θ j )) /λ . Governed by the previous equations, we ﬁnd that π ij = 1 n ( f ( X i | θ j )) /λ P Kj =1 ( f ( X i | θ j )) /λ . (16)Therefore, we can update the weight ω j as ω j = n X i =1 π ij for any 1 ≤ j ≤ K . A.2 Update atoms

Given the updates for weight ω K and the formulation of cost matrix M , to obtain the updatefor atoms θ j as 1 ≤ j ≤ K , we optimize the following objective functionmin Θ K − n X i =1 K X j =1 π ij log f ( X i | θ j ) . (17)Since f ( x | θ ) is an exponential family distribution with natural parameter θ , we can representit as f ( x | θ ) = h ( x ) exp ( h T ( x ) , θ i − A ( θ )) , where A ( θ ) is the log-partition function which is convex. Plugging this formulation of f ( x | θ )into the objective function (17) and taking the derivative with respect to θ j , we obtain thefollowing equation n X i =1 π ij T ( X i ) − ω j ∇ A ( θ j ) = 0 . (18)Therefore, we can update atoms θ j as the solution of the above equation for 1 ≤ j ≤ K , A.3 Proof for local convergence of Algorithm 1

Given the formulation of Algorithm 1, we would like to demonstrate its convergence to localsolution of objective function (6) in Theorem 1.Our proof of the theorem is straight-forward from the updates of weights and atoms viaLagrangian multipliers. In particular, we denote ω ( t ) K , Θ ( t ) K , and π ( t ) as the update of weights,atoms, and transportation plan in step t of Algorithm 1 for t ≥

0. Additionally, let M ( t ) bethe cost matrix at step t , i.e., M ( t ) ij = − log f ( X i | θ ( t ) j ) for all i, j . Furthermore, we denote g ( ω K , Θ K ) := inf π ∈ Π ( n n , ω K ) h π , M i − λ H ( π ) . t ≥

0, it is clear that g ( ω ( t ) K , Θ ( t ) K ) = inf π ∈ Π (cid:16) n n , ω ( t ) K (cid:17) D π , M ( t ) E − λ H ( π ) ≥ inf π ∈S n D π , M ( t ) E − λ H ( π ) ≥ D π ( t +1) , M ( t ) E − λ H (cid:16) π ( t +1) (cid:17) where S n = n π : P Kj =1 π ij = 1 /n ∀ ≤ i ≤ n o . Here, the ﬁrst inequality is due to the fact thatΠ (cid:16) n n , ω ( t ) K (cid:17) ⊂ S n while the second inequality is due to (16) in subsection A.1. Accordingto the update of atoms in (18) in subsection A.2, we have that D π ( t +1) , M ( t ) E − λ H (cid:16) π ( t +1) (cid:17) ≥ D π ( t +1) , M ( t +1) E − λ H (cid:16) π ( t +1) (cid:17) ≥ inf π ∈ Π (cid:16) n n , ω ( t +1) K (cid:17) D π , M ( t +1) E − λ H ( π )= g ( ω ( t +1) K , Θ ( t +1) K ) . Governed by the above results, for any t ≥

0, the following holds g ( ω ( t ) K , Θ ( t ) K ) ≥ g ( ω ( t +1) K , Θ ( t +1) K ) . As a consequence, we achieve the conclusion of Theorem 1.

B Regularized composite transportation barycenter

In this section, we provide a detailed algorithm for achieving local solution to regularizedcomposite transportation barycenter in objective function in Eq. (8). To facilitate the discus-sion, we will remind the formulation of that objective function. In particular, the objectivefunction with regularized composite transportation distance has the following formulationmin w L , Ψ L J X j =1 a j min π j ∈ Π (cid:16) ω jKj , w L (cid:17) (cid:10) π j , M j (cid:11) − λ H (cid:0) π j (cid:1) where M j is the corresponding KL cost matrix between ﬁnite mixture probability distribution P j ω jKj , Θ jKj and Q w L , Ψ L for 1 ≤ j ≤ J . Here, { a j } Jj =1 ∈ ∆ J are given weights associated withthe ﬁnite mixture probability distributions (cid:26) P j ω jKj , Θ jKj (cid:27) Jj =1 .As f ( x | θ ) is an exponential family, the cost matrix M j = ( M juv ) has the following formu-lation M juv = KL( f ( x | ψ v ) , f ( x | θ ju ))= A ( θ ju ) − A ( ψ v ) − (cid:10) ∇ A ( ψ v ) , ( θ ju − ψ v ) (cid:11) for all 1 ≤ u, v ≤ K . 19 .1 Update weights and atoms Our procedure for updating weights w K for the objective function of regularized compositetransportation distance will be similar to Algorithm 1 in [4]. Therefore, we will only focus onthe updates with atoms Ψ L = ( ψ , . . . , ψ L ).Given the updates of weights w K , we compute the optimal transportation plan π j = ( π juv )between ω jK j and w L using Algorithm 3 in [4]. Then, to obtain the updates for Ψ L , we considerthe following optimization problemmin Ψ L J X j =1 a j K j X u =1 K j X v =1 π juv M juv . By taking the derivative of the above objective function with respect to ψ v and setting it to0, we achieve the following equation J X j =1 K j X u =1 π juv (cid:10) ∇ A ( ψ v ) , θ ju − ψ v (cid:11) = 0 . One possible solution to the above equation is P Jj =1 P K j u =1 π juv ( θ ju − ψ v ) = 0. This previousequation suggests that ψ v = P Jj =1 P K j u =1 π juv θ ju P Jj =1 P K j u =1 π juv (19)for all 1 ≤ v ≤ L . Equipped with these updates for weights w L and atoms Ψ L , we sum-marize the detail of an algorithm for determining the local solution of regularized compositetransportation barycenter in Eq. (8) in Algorithm 3. Algorithm 3

Regularized composite transportation barycenter

Input:

Finite mixture probability distributions (cid:26) P j ω jKj , Θ jKj (cid:27) Jj =1 , given weights { a j } Jj =1 , andthe regularized hyper-parameter λ > Output:

Optimal weights w L and atoms Ψ L .Initialize weights { w j } Lj =1 and atoms { ψ j } Lj =1 . while not converged do

1. Update weights w L as Algorithm 1 in [4].2. Compute transportation plans π j for 1 ≤ j ≤ J using Algorithm 3 in [4].3. Update atoms Ψ L as in Eq. (19). end while B.2 Local convergence of Algorithm 3

Given the formulation of Algorithm 3, the following theorem demonstrates that this algorithmdetermines the local solution of objective function (8)

Theorem 3.

The Algorithm 3 monotonically decreases the objective function (8) of regularizedcomposite transportation barycenter until local convergence.

The proof of Theorem 3 is a direct consequence of the updates with weights and atoms inthe above subsection and can be argued in the similar fashion as that of Theorem 1; therefore,it is omitted. 20

Multilevel clustering with composite transportation distance

In this section, we provide detailed argument for the algorithm development to determinethe local solutions of regularized multilevel composite transportation (MCT). To ease thepresentation later, we would like to remind the objective function of this problem as wellas all its important relevant notations. We start with the objective function in Eq. (12) asfollows inf ω jKj , Θ jKj , Q J X j =1 c W (cid:18) P jn j , P j ω jKj , Θ jKj (cid:19) + ζ c W ( P , Q ) − R ( π , τ , a ) , where R ( π , τ , a ) = λ l P Jj =1 H ( π j ) + ζλ a [ H ( a ) + λ g P Jj =1 P C m =1 H ( τ j,m )] is a combinationof all regularized terms for the local and global clustering. Here, for the simplicity of ourargument, we choose ζ = 1 to derive our learning updates. In the above objective function, P = J P Jj =1 δ P j ω jKj , Θ jKj and Q = P C m =1 b m δ Q m w mL , Ψ mL . We summarize below the notations forour algorithm development. Variables of local clustering structures • Local transportation plans for group j : π j = n π juv o u,v ∈ Π( n j n j , ω jK j ) s.t. P v π juv = n j , and P u π juv = ω jv , • Local atoms for local group Θ jK j = n θ jk o K j k =1 and their local mixing weights ω j = n ω jk o K j k =1 . Variables of assignment group to barycenter • Global transportation plan a = ( a jm ) ∈ Π( J J , b ) between P and Q . Variables of global clustering structures • Partial global transportation plans between local measure P ω jKj , Θ jKj and global measure Q m w m , Ψ mk : τ j,m = n τ j,mkl o k,l where P l τ j,mkl = ω jk and P l τ j,mkl = w ml for all 1 ≤ j ≤ J and 1 ≤ m ≤ C . • Global atoms for global measure Ψ mL = { ψ ml } Ll =1 . and global mixing weights w mL = { w ml } Ll =1 where w ml = P k τ jmkl for any j . C.1 Local clustering updates

As being mentioned in the main text, to obtain updates for local weights ω jK j and local atoms Θ jK j , we solve the following regularized composite transportation barycenter probleminf ω jKj , Θ jKj c W (cid:18) P jn j , P j ω jKj , Θ jKj (cid:19) − λ l H ( π j ) + C X m =1 a jm c W ( P j ω jKj , Θ jKj , Q m w mL , Ψ mL ) − λ g K X m =1 H ( τ j,m ) . ω jKj , Θ jKj inf π j ∈ Π( nj nj , ω jKj ) (cid:10) π j , M j (cid:11) − λ l H ( π j ) + C X m =1 a jm inf τ j,m ∈ Π( ω jKj , w mL ) (cid:10) τ j,m , γ j,m (cid:11) − λ g K X m =1 H ( τ j,m ) (20)where M j is the cost matrix between P jn j and P j ω jKj , Θ jKj that has a formulation as[ M j ] uv = − log f ( X j,u | θ jv ) , for 1 ≤ u ≤ n j and 1 ≤ v ≤ K j . Additionally, the cost matrix γ j,m has the followingformulation γ j,mkl = A (cid:16) θ jk (cid:17) − A ( ψ ml ) − D ∇ A ( ψ ml ) , (cid:16) θ jk − ψ ml (cid:17)E . Update local weights:

The idea for obtaining the local solutions of above objective func-tion is similar to that in Section A.

Update local atoms:

Given the updates for local weight ω jK j , to obtain the update equa-tion for local atoms Θ jK j , we optimize the following objective functionmin Θ jKj − n j X u =1 K j X v =1 π juv log f ( X j,u | θ jv ) + C X m =1 a jm X v,l τ j,mvl γ j,mkl . (21)Since f ( x | θ ) is an exponential family distribution with natural parameter θ , we can representit as f ( x | θ ) = h ( x ) exp ( h T ( x ) , θ i − A ( θ )) , where A ( θ ) is the log-partition function which is convex. Given that formulation of f ( x | θ ),our objective function (17) is equivalent to minimize the following objective function F local = − n j X u =1 K j X v =1 π juv (cid:0)(cid:10) T ( X j,u ) , θ jv (cid:11) − A (cid:0) θ jv (cid:1)(cid:1) + C X m =1 a jm X v,l τ j,mkl (cid:20) A (cid:0) θ jv (cid:1) − A ( ψ ml ) − (cid:10) ∇ A ( ψ ml ) , (cid:0) θ jv − ψ ml (cid:1)(cid:11)(cid:21) . By direct computation, F local has the following partial derivative with respect to θ jv ∂ F local ∂θ jv = − n j X u =1 π juv (cid:0) T ( X j,u ) − ∇ A (cid:0) θ jv (cid:1)(cid:1) + C X m =1 a jm L X l =1 τ j,mvl (cid:2) ∇ A (cid:0) θ jv (cid:1) − ∇ A ( ψ ml ) (cid:3) = − n j X u =1 π juv T ( X j,u ) + ω jv ∇ A (cid:0) θ jv (cid:1) + C X m =1 a jm L X l =1 τ j,mvl (cid:0) ∇ A (cid:0) θ jv (cid:1) − ∇ A ( ψ ml ) (cid:1) , P n j u =1 π juv = ω jv .Given the above partial derivatives, we can update the atoms θ jv to be the solution of thefollowing equation ∇ A (cid:0) θ jv (cid:1) = C P m =1 a jm L P l =1 τ j,mvl ∇ A ( ψ ml ) + n j P u =1 π juv T ( X j,u ) C P m =1 a jm L P l =1 τ j,mvl + ω jv (22) C.2 Computing global transportation plan

Given the updates for local weights ω jK j and local atoms Θ jK j for 1 ≤ j ≤ J , we now developan update on for global transportation plan a = ( a jm ) between P and Q . Our strategy forthe update relies on solving the following objective functioninf a X j,m a jm c W (cid:18) P j ω jKj , Θ jKj , Q m w mL , Ψ mL (cid:19) − λ a H ( a ) . where a in the above inﬁmum satisﬁes the constraint a K = J n . By means of Lagrangianmultiplier, the above objective function can be rewritten as F global = X j,m a jm c W (cid:18) P j ω jKj , Θ jKj , Q m w mL , Ψ mL (cid:19) − λ a H ( a ) + κ a J X j =1 C X m =1 a jm − J ! , The function F global has the partial derivative with respect to a jm as follows ∂ F global ∂a jm = c W (cid:18) P j ω jKj , Θ jKj , Q m w mL , Ψ mL (cid:19) + λ a log a jm + κ a = X k,l τ j,mkl γ j,mkl + λ a ln a jm + κ a . Setting the above derivative to 0 and invoking the constraint that P C m =1 a jm = J , we ﬁndthat a jm = 1 J exp (cid:16) − P k,l τ j,mkl γ j,mkl (cid:17) /λ a P m exp (cid:16) − P k,l τ j,mkl γ j,mkl (cid:17) /λ a (23)for 1 ≤ j ≤ J and 1 ≤ m ≤ L . C.3 Global clustering updates

Given the updates with local weights and atoms as well as the global transportation plan,we are now ready to develop an update for global weights w mL and global atoms Ψ mL for1 ≤ m ≤ C . In particular, the objective function for updating these global parameters are asfollows min { w mL } , { Ψ mL } J X j =1 C X m =1 a jm c W ( P j ω jKj , Θ jKj , Q m w mL , Ψ mL ) − λ g J X j =1 C X m =1 H ( τ j,m )23he above objective function can be rewritten asmin { w mL } , { Ψ mL } J X j =1 C X m =1 a jm inf τ j,m ∈ Π( ω jKj , w mL ) (cid:10) τ j,m , γ j,m (cid:11) + λ g X k,l τ j,mkl (cid:16) log τ j,mkl − (cid:17)(cid:19) Given the above objective function, for each m , to update the global weights w mL and globalatoms Ψ mL , we consider the following composite transportation barycentermin w mL , Ψ mL J X j =1 a jm inf τ j,m ∈ Π( ω jKj , w mL ) (cid:10) τ j,m , γ j,m (cid:11) + λ g X k,l τ j,mkl (cid:16) log τ j,mkl − (cid:17)(cid:19) . Update global weights:

Given the above objective function, the idea for updating theglobal weights w mL is similar to Algorithm 1 in [4]. Update partial transportation plans:

Once global weights are obtained, we can useAlgorithm 3 in [4] to update the optimal partial transportation plans τ j,m between localmeasure P ω jKj , Θ jKj and global measure Q m w m , Ψ mk . Update global atoms:

With the updates for the global weight w mL , to obtain the updateequation for global atoms Ψ mL , we minimize the following objective function F p-global = J X j =1 a jm X k,l τ j,mkl γ j,mkl = J X j =1 a jm X k,l τ j,mkl (cid:18) A (cid:16) θ jk (cid:17) − A ( ψ ml ) − D ∇ A ( ψ ml ) , θ jk − ψ ml E(cid:19) . Taking the derivative of F p-global with respect to ψ ml and setting it to zero, we ﬁnd that ∂ F p-global ∂ψ ml = J X j =1 a jm X k τ j,mkl ∇ A ( ψ ml ) (cid:16) ψ ml − θ jk (cid:17) = 0 . Since the log-partition function A ( · ) is convex, ∇ A ( ψ v ) is a positive-semideﬁnite matrix.Therefore, we can choose P Jj =1 a jm P k τ j,mkl (cid:16) ψ ml − θ jk (cid:17) = 0, which means that ψ ml = P Jj =1 P K j k =1 a jm τ j,mkl θ jk P Jj =1 P K j k =1 a jm τ j,mkl . C.4 Proof for local convergence of Algorithm 2

Equipped with the above updates with local and global parameters of regularized MCT,we are ready to demonstrate the convergence of Algorithm 2 to local solution of objectivefunction (12) of regularized MCT in Theorem 2. To simplify the argument, we only provideproof sketch for this theorem. 24n particular, we denote ω j, ( t ) K j and Θ j, ( t ) K j as the updates of local weights and local atomsin step t of Algorithm 2 for t ≥

0. Similarly, we denote w m, ( t ) L and Ψ m, ( t ) L as the updates ofglobal weights and global atoms at step t . Furthermore, we denote g ( { ω j, ( t ) K j } , { Θ j, ( t ) K j } , { w m, ( t ) L } , { Ψ m, ( t ) L } ) := J X j =1 c W (cid:18) P jn j , P j ω jKj , Θ jKj (cid:19) + ζ c W ( P , Q ) − R ( π , τ , a ) . Then, according to local clustering updates step, we would have g ( { ω j, ( t ) K j } , { Θ j, ( t ) K j } , { w m, ( t ) L } , { Ψ m, ( t ) L } ) ≥ g ( { ω j, ( t +1) K j } , { Θ j, ( t +1) K j } , { w m, ( t ) L } , { Ψ m, ( t ) L } ) . On the other hand, invoking the global clustering updates step, we achieve g ( { ω j, ( t +1) K j } , { Θ j, ( t +1) K j } , { w m, ( t ) L } , { Ψ m, ( t ) L } ) ≥ g ( { ω j, ( t +1) K j } , { Θ j, ( t +1) K j } , { w m, ( t +1) L } , { Ψ m, ( t +1) L } )Governed by the above results, for any t ≥

0, the following holds g ( { ω j, ( t ) K j } , { Θ j, ( t ) K j } , { w m, ( t ) L } , { Ψ m, ( t ) L } ) ≥ g ( { ω j, ( t +1) K j } , { Θ j, ( t +1) K j } , { w m, ( t +1) L } , { Ψ m, ( t +1) L } ) ..