[PDF] Reducing over-clustering via the powered Chinese restaurant process

Abstract

Dirichlet process mixture (DPM) models tend to produce many small clusters regardless of whether they are needed to accurately characterize the data - this is particularly true for large data sets. However, interpretability, parsimony, data storage and communication costs all are hampered by having overly many clusters. We propose a powered Chinese restaurant process to limit this kind of problem and penalize over clustering. The method is illustrated using some simulation examples and data with large and small sample size including MNIST and the Old Faithful Geyser data.

Full PDF

RReducing over-clustering via the powered Chinese restaurant process

Jun Lu Meng Li David Dunson Abstract

Dirichlet process mixture (DPM) models tend toproduce many small clusters regardless of whetherthey are needed to accurately characterize the data- this is particularly true for large data sets. How-ever, interpretability, parsimony, data storage andcommunication costs all are hampered by hav-ing overly many clusters. We propose a poweredChinese restaurant process to limit this kind ofproblem and penalize over clustering. The methodis illustrated using some simulation examples anddata with large and small sample size includingMNIST and the Old Faithful Geyser data.

1. Introduction

Dirichlet process mixture (DPM) models and closely re-lated formulations have been very widely used for ﬂexiblemodeling of data and for clustering. DPMs of Gaussianshave been shown to possess frequentist optimality propertiesin density estimation, obtaining minimax adaptive rates ofposterior concentration with respect to the true unknownsmoothness of the density (Shen et al., 2013). DPMs arealso very widely used for probabilistic clustering of data.In the clustering context, it is well known the DPMs favorintroducing new components at a log rate as the sample sizeincreases, and tend to produce some large clusters alongwith many small clusters. As the sample size N increases,these small clusters can be introduced as an artifact even ifthey are not needed to characterize the true data generatingprocess; for example, even if the true model has ﬁnitelymany clusters, the DPM will continue to introduce newclusters as N increases (Miller & Harrison, 2013; Argientoet al., 2009; Lartillot & Philippe, 2004; Onogi et al., 2011;Miller & Harrison, 2014).Continuing to introduce new clusters as N increases can beargued to be an appealing property. The number of ‘types’of individuals is unlikely to be ﬁnite in an inﬁnitely largepopulation, and there is always a chance of discovering Department of Statistics, Rice University, Houston, TX, USA Department of Statistical Science, Duke University, Durham, NC,USA. Correspondence to: Jun Lu < [email protected] > . new types as new samples are collected. This rationale hasmotivated a rich literature on generalizations of Dirichletprocesses, which have more ﬂexibility in terms of the rateof introduction of new clusters. For example, the two pa-rameter Poisson-Dirichlet process (a.k.a., the Pitman-Yorprocess) is a generalization that instead induces a powerlaw rate, which is more consistent with many observed dataprocesses (Perman et al., 1992). There has also been con-sideration of a rich class of Gibbs-type processes, whichconsiderably generalize Pitman-Yor to a broad class of so-called exchangeable partition probability functions (EPPFs)(Gnedin & Pitman, 2005; Lijoi & Pr¨unster, 2010; De Blasiet al., 2015; Bacallado et al., 2017; Favaro et al., 2013).Much of the emphasis in the Gibbs-type process literaturehas been on data in which ‘species’ are observed directly,and the goal is predicting the number of new species in a fur-ther sample (Lijoi et al., 2007). It remains unclear whethersuch elaborate generalizations of Dirichlet processes havedesirable behavior when clusters/species are latent variablesin a mixture model.The emphasis of this article is on addressing practical prob-lems that arise in implementing DPMs and generalizationswhen sample sizes and data dimensionality are moderatetoo large. In such settings, it is common knowledge that thenumber of clusters can be too large, leading to a lack of in-terpretability, computational problems and other issues. Forthese reasons, it is well motivated to develop sparser cluster-ing methods that do not restrict the number of clusters to beﬁnite a priori but instead favor deletion of small clusters thatmay not be needed to accurately characterize the true datagenerating mechanism. With this goal in mind, we ﬁnd thatthe usual focus on exchangeable models, and in particularEPPFs, can limit practical performance. There has beensome previous work on non-exchangeable clustering meth-ods motivated by incorporation of predictor-dependence inclustering (Blei & Frazier, 2011; Ghosh et al., 2011; 2014;Socher et al., 2011), but our focus is instead on providing asimple approach that tends to delete small and unnecessaryclusters produced by a DPM. Marginalizing out the ran-dom measure in the DPM speciﬁcation produces a Chineserestaurant process (CRP). We propose a simple poweredmodiﬁcation to the CRP, which has the desired impact onclustering and develop associated inference methods. a r X i v : . [ c s . L G ] F e b educing over-clustering via the powered Chinese restaurant process

2. Powered Chinese restaurant process(pCRP)

The Chinese restaurant process is a simple stochastic processthat is exchangeable. In the analogy from which this processtakes its name, customers seat themselves at a restaurantwith an inﬁnite number of tables. Each customer sits ata previously occupied table with probability proportionalto the number of customers already sitting there, and at anew table with probability proportional to a concentrationparameter α . For example, the ﬁrst customer enters and sitsat the ﬁrst table. The second customer enters and sits atthe ﬁrst table with probability α and at a new table withprobability α α . The i th customer sits at an occupied tablewith probability proportional to the number of customersalready seated at that table, or sits at a new table with aprobability proportional to α . Formally, if z i is the tablechosen by the i th customer, then p ( z i = k | z − i , α ) = (cid:40) N k, − i N + α − , if k is occupied, i.e. N k > , αN + α − , if k is a new table, i.e. k = k (cid:63) = K + 1 , (1)where z − i = ( z , z , . . . , z i − , z i +1 , . . . , z N ) and N k, − i isthe number of customers seated at table k excluding cus-tomer i . From the deﬁnition above, we can observe thatthe CRP is deﬁned by a rich-get-richer property in whichthe probability of being allocated to a table increases inproportion to the number of customers already at that table.In a CRP mixture model, each table is assigned a speciﬁcparameter in a kernel generating data at the observationlevel. Customers assigned to a speciﬁc table are given thecluster index corresponding to that table, and have their datagenerated from the kernel with appropriate cluster/table-speciﬁc parameters. The CRP provides a prior probabilitymodel on the clustering process, and this prior can be up-dated with the observed data to obtain a posterior over thecluster allocations for each observation in a data set. TheCRP provides an exchangeable prior on the partition of in-dices { , . . . , N } into clusters; exchangeability means thatthe ordering of the indices has no impact on the probabilityof a particular conﬁguration – only the number of clusters K N and the size of each cluster can play a role. The CRPimplies that E[ K N | α ] = O ( α log N ) (Teh, 2011). Popular Bayesian nonparametric priors, such as the Dirich-let process (Ferguson, 1973; Blackwell & MacQueen, 1973;Antoniak, 1974), Chinese restaurant process, Pitman-Yorprocess (Perman et al., 1992; Pitman & Yor, 1997) and In-dian buffet process (Grifﬁths & Ghahramani, 2005; Thibaux & Jordan, 2007), assume inﬁnite exchangeability. In par-ticular, suppose we have a clustering process for an in-ﬁnite sequence of data points i = 1 , , , . . . , ∞ . Thisclustering process will induce a partition of the integers { , . . . , N } into K N clusters of size N , . . . , N K N , for N = 1 , , . . . , ∞ . For an exchangeable clustering process,the probability of a particular partition of { , . . . , N } onlydepends on N , . . . , N K N and K N , and does not depend onthe order of the indices { , . . . , N } . In addition, the proba-bility distributions for different choices of N are coherent ;the probability distribution of partitions of { , . . . , N } canbe obtained from the probability distribution of partitionsof { , . . . , N + 1 } by marginalizing out the cluster assign-ment for data point i = N + 1 . These properties are oftenhighly appealing computationally and theoretically, but itis nonetheless useful to consider processes that violate theinﬁnite exchangeability assumption. This can occur whenthe addition of a new data point i = N + 1 to a sample of N data points can impact the clustering of the original N datapoints. For example, we may re-evaluate whether data point and are clustered together in light of new informationprovided by a third data point, a type of feedback property.We propose a new powered Chinese restaurant process(pCRP), which is designed to favor elimination of artifac-tual small clusters produced by the usual CRP by implicitincorporation of a feedback property violating the usualexchangeability assumptions. The proposed pCRP makesthe random seating assignment of the customers depend onthe powered number of customers at each table (i.e. raisethe number of each table to power r ). Formally, we have p ( z i = k | z − i , α ) =  N rk, − i (cid:80) Kh N rh, − i + α , if k is occupied, i.e. N k > , α (cid:80) Kh N rh, − i + α , if k is a new table, i.e. k = k (cid:63) = K + 1 , (2)where r > and N k, − i is the number of customers seatedat table k excluding customer i . More generally, one mayconsider a g -CRP to generalize the CRP such that p ( z i = k | z − i , α ) = (cid:40) g ( N k, − i ) (cid:80) Kh g ( N h, − i )+ α , if k is occupied, i.e. N k > , α (cid:80) Kh g ( N h, − i )+ α , if k is a new table, i.e. k = k (cid:63) = K + 1 , (3)where g ( · ) : R + → R + is an increasing function and g (0) = 0 . We achieve shrinkage of small clusters via arich-get-(more)-richer property by requiring g ( x ) ≥ x for x > to‘enlarge’ clusters containing more than one ele-ment. We require the g -CRP to maintain a proportionalinvariance property: g ( cN ) g ( cN ) = g ( N ) g ( N ) (4) educing over-clustering via the powered Chinese restaurant process for any c, N , N > , so that scaling cluster sizes by a con-stant factor has no impact on the prediction rule in (3). Thefollowing Lemma 2.1 shows that the pCRP in equation (2)using the power function is the only g -CRP that satisﬁes theproportional invariance property. Lemma 2.1.

If a continuous function g ( x ) : R + → R + satisﬁes equation (4) , then g ( x ) = g (1) · x r for all x > and some constant r ∈ R .Proof of Lemma 2.1. It is easy to verify that g ( x ) = g (1) · x r for some r > is a solution to the functional equation (4).We next show its uniqueness.Equation (4) implies that g ( cN ) /g ( N ) = g ( cN ) /g ( N ) for any N , N > . Denote f ( c ) = g ( cN ) /g ( N ) > forarbitrary N > . We then have f ( st ) = g ( stN ) /g ( N ) = g ( stN ) /g ( tN ) · g ( tN ) /g ( N ) = f ( s ) f ( t ) for any s, t > .By letting f ∗ ( x ) = f ( e x ) > , it follows that log f ∗ ( s + t ) = log f ∗ ( s )+log f ∗ ( t ) , which is the well known Cauchyfunctional equation and has the unique solution log f ∗ ( x ) = rx for some constant r . Therefore, f ( x ) = f ∗ (log( x )) = x r which gives g ( cN ) = g ( N ) c r . We complete the proofby letting N = 1 .As a generalization of the CRP, which corresponds to thespecial case in which r = 1 , the proposed pCRP with r > generates new clusters following a probability that is con-ﬁguration dependent and not exchangeable. For example,for three customers z , z , z , p ( z = 2 | z = 1 , z =1) < p ( z = 1 | z = 1 , z = 2) , where z i = k if the i th customer sits at table k . This non-exchangeability is acritical feature of pCRP, allowing new cluster generationto learn from existing patterns. Consider two extreme con-ﬁgurations: (i) K N = N with one member in each cluster,and (ii) K N = 1 with all members in a single cluster. Theprobabilities of generating a new cluster under (i) and (ii)are both α/ ( N + α ) in CRP, but dramatically different inpCRP: (i) α/ ( N + α ) and (ii) α/ ( N r + α ) , respectively.Therefore, if the previous customers are more spread out,there is a larger probability of continuing this pattern bycreating new tables. Similarly, if customers choose a smallnumber of tables, then a new customer is more likely to jointhe dominant clusters rather than open a new table.The power r is a critical parameter controling how much wepenalize small clusters. The larger the power r , the greaterthe penalty. We propose a method to choose r in a data-driven fashion: cross validation using a proper loss functionto select a ﬁxed r . The proportional invariance property makes it easier to de-ﬁne a cross validation (CV) procedure for estimating r . Inparticular, one can tune r to obtain good performance on an initial training sample and that r would also be appropriatefor a subsequent data set that has a very different samplesize. For other choices of g ( · ) , which do not possess pro-portional invariance, it may be necessary to adapt r to thesample size for appropriate calibration.In evaluating generalization error, we use the following lossfunction based on within-cluster sum of squares: K (cid:88) k =1 (cid:118)(cid:117)(cid:117)(cid:116) N k (cid:88) j : j ∈ C k || x j − x k || , (5)where C k is the data samples in the k th cluster and x k is themean vector for cluster k . The square root has an importantimpact in favoring a smaller nunber of clusters; for example,inducing a price to be paid for introducing two clusters withthe same mean. In implementing CV, we start by choosinga small value of r ( r = 1 + (cid:15) ) and then increasing until weidentify an inﬂection point. ∞ ∞ N x i z i µ k Σ k βπ k α Figure 1.

A Bayesian inﬁnite GMM.

Although the proposed pCRP is generic, we focus on its ap-plication in Gaussian mixture models (GMMs) for concrete-ness. We develop a collapsed Gibbs sampling algorithm(Alg 3 in (Neal, 2000) and further introduced in (Murphy,2012)) for posterior computation. Our proposed pCRP canalso be easily implemented via a non-collapsed Gibbs sam-pling algorithm ((West & Escobar, 1993), i.e. Alg 2 in (Neal,2000)). In addition, we permute the data at each samplingiteration to eliminate order dependence as in (Socher et al., educing over-clustering via the powered Chinese restaurant process X be the observations, assumed to follow a mixtureof multivariate Gaussian distributions. We use a conju-gate Normal-Inverse-Wishart (NIW) prior p ( µ , Σ | β ) forthe mean vector µ and covariance matrix Σ in each mul-tivariate Gaussian component, where β consists of all thehyperparameters in NIW. I.e. we will work with the follow-ing deﬁnition of Bayesian inﬁnite Gaussian mixture model: x i | z i , { µ k , Σ k } ∼ N ( µ z i , Σ z i ) ,z i | π ∼ Multinomial( π , . . . , π K ) , { µ k , Σ k } ∼ NIW( β ) , π ∼ Dirichlet( α/K, . . . , α/K ) , (6)by taking the limit as K → ∞ (Rasmussen, 1999),where N ( · , · ) is the multivariate normal distribution, Multinomial( · ) is the Multinomial distribution and Dirichlet( · ) is the Dirichlet distribution. The process isillustrated in Figure 1.A key quantity in a collapsed Gibbs sampler is the prob-ability of each customer i sitting with table k : p ( z i = k | z − i , X , α, β ) , where z − i are the seating assignments ofall the other customers and α is the concentration parameterin CRP and pCRP. This probability is calculated as follows: p ( z i = k | z − i , X , α, β ) ∝ p ( z i = k | z − i , α, (cid:19)(cid:19) β ) p ( X | z i = k, z − i , (cid:26) α, β )= p ( z i = k | z − i , α ) p ( x i |X − i , z i = k, z − i , β ) · p ( X − i | (cid:24)(cid:24)(cid:24) z i = k, z − i , β ) ∝ p ( z i = k | z − i , α ) p ( x i |X − i , z i = k, z − i , β ) ∝ p ( z i = k | z − i , α ) p ( x i |X k, − i , β ) , (7)where X k, − i are the observations in table k excluding the i th observation and the ﬁrst term of the last equation above p ( z i = k | z − i , α ) is the proposed pCRP in equation (2). If z i = k is an existing component, the second term above p ( x i |X k, − i , β ) is calculated using the posterior predictivedistribution at x i , where the posterior prediction distribu-tion of the new data x (cid:63) given the data set X and the priorparameter β under the NIW prior is p ( x (cid:63) |X , β ) = (cid:90) µ (cid:90) Σ p ( x (cid:63) | µ , Σ ) p ( µ , Σ |X , β ) d µ d Σ . (8)When z i = k (cid:63) is a new component then we have: p ( x i |X k, − i , β ) = p ( x i | β )= (cid:90) µ (cid:90) Σ p ( x i | µ , Σ ) p ( µ , Σ | β ) d µ d Σ , (9)which is just the prior predictive distribution and can becalculated by the posterior predictive distribution for newdata p ( x (cid:63) |X , β ) under NIW prior but with X = {∅} . Algorithm 1

Collapsed Gibbs Sampling for pCRP

Input:

Choose an initial z , r , α , β ; for T iterations do • Sample random permutation τ of , . . . , N ; for i ∈ ( τ (1) , . . . , τ ( N )) do • Remove x i ’s statistics from component z i ; for k = 1 to K do • Calculate p ( z i = k | z − i , α ) = N rk, − i (cid:80) Kh N rh, − i + α ; • Calculate p ( x i |X k, − i , β ) ; • Calculate p ( z i = k | z − i , X , α, β ) ∝ p ( z i = k | z − i , α ) p ( x i |X k, − i , β ) ; end for • Calculate p ( z i = k (cid:63) | z − i , α ) = α (cid:80) Kh N rh, − i + α ; • Calculate p ( x i | β ) ; • Calculate p ( z i = k (cid:63) | z − i , X , α, β ) ∝ p ( z i = k (cid:63) | z − i , α ) p ( x i | β ) ; • Sample k new from p ( z i | z − i , X , α, β ) after nor-malizing; • Add x i ’s statistics to the component z i = k new ; • If any component is empty, remove it and decrease K . end forend for Algorithm 1 gives the pseudo code of the collapsed Gibbssampler to implement pCRP in Gaussian mixture models.

3. Experiments

We conduct experiments to demonstrate the main advantagesof the proposed pCRP using both synthetic and real data. Ina wide range of scenarios across various sample sizes, pCRPreduces over-clustering of CRP, and leads to performancesthat are as good or better than CRP in terms of densityestimation, out of sample prediction, and overall clusteringresults.In all experiments, we run the Gibbs sampler 20, 000 iter-ations with a burn-in of 10, 000. The sampler is thinnedby keeping every 5 th draw. We use the same concentrationparameter α = 1 for both CRP and pCRP in all scenar-ios. In addition, we equip CRP with an unfair advantage tomatch the magnitude of its prior mean α log( N ) to the truenumber of clusters, termed CRP-Oracle . The power r inpCRP is tuned using cross validation. In order to measureoverall clustering performance, we use normalized mutualinformation (NMI) (McDaid et al., 2013) and variation ofinformation (VI) (Meil˘a, 2003), which measures the simi-larity between the true and estimated cluster assignments.Higher NMI and lower VI indicate better performance. Ifapplicable, metrics using the true clustering are calculated toprovide an upper bound for all methods, coded as ‘GroundTruth’. educing over-clustering via the powered Chinese restaurant process We ﬁrst use simulated data to assess the performance ofpCRP in emptying extra components, compared to the tradi-tional CRP. Figure 2 shows the true data generating density,which represent the two cases of well-mixed Gaussian com-ponents and shared mean Gaussian mixture coded as Sim 1and Sim 2, respectively.The oracle concentration parameters in CRP-Oracle are(0.52, 0.40) in Sim 1 and (0.52, 0.40) in Sim 2, which areall smaller than the unit concentration parameter used inCRP and pCRP. The sample sizes in the two simulationcases are respectively (300, 2000). Figure 3 shows the crossvalidation curve to select r in pCRP using a training dataset with 200 samples. The representative cases of inﬂectionpoint described in Section 2.3 were observed: the loss curvefor cross validation ‘blows up’ for one particular r value inboth Sim 1 and Sim 2. We can ﬁnd the ﬁrst stage of Sim 2is oscillating from 14.2595 to 14.2005 which is ﬂatter thanthat of Sim 1 that oscillates from 15.5260 to 14.9020. Thisis because the components in Sim 2 are well separated. Wechoose this change point as the power r in either case.

20 15 10 5 0 5 10

DensityComponents (a) Sim 1

20 15 10 5 0 5 10

DensityComponents (b) Sim 2

Figure 2.

Data generating densities for two scenarios: (a) Sim1: amixture of three poorly separated Gaussian components; (b) Sim2: a mixture of three well separated Gaussian components. (a) Sim 1 (b) Sim 2

Figure 3.

Cross validation curves to choose r for Sim 1 and Sim 2.The x -axis is the power value, the y -axis is the loss. The verticalline is the chosen power r value. Figure 4 shows traceplots of posterior samples for the num-ber of clusters for each of the methods in Sim 1. ClearlypCRP places relatively high posterior probability on three

Iteration C l u s t e r N u m b e r (a) CRP-Oracle Iteration C l u s t e r N u m b e r (b) CRP Iteration C l u s t e r N u m b e r (c) pCRP Figure 4.

Traceplots of cluster numbers using the three methods inSim 1 when N = 2000 . The x -axis is the sampling iteration, the y -axis is the number of clusters. clusters, which is the ground truth. In contrast, CRP hashigher posterior variance, systematic over-estimation of thenumber of clusters, and worse computational efﬁciency. TheCRP-Oracle has better performance, but does clearly worsethan p-CRP, and there is still a tendency for over-estimation.This demonstrates that one cannot simply ﬁx up or calibratethe CRP by choosing the precision to be appropriately small.Figure 6 suggests that CRP will have larger probability onlarger cluster numbers especially when the sample size in-creases, while pCRP tends to have larger probability onthe true cluster number as the sample size increases. Forexample, in Sim 1, the probability of selecting three clus-ters increases from 0.55 to 0.68 in pCRP when N increasesfrom 300 to 2000 and the probability for all the other clus-ter number decreases. However, the probability of ﬁndingfour clusters stabilizes around 0.37 and 0.38 in CRP-Oraclewhen N increases from 300 to 2000. CRP has increasedprobability of selecting larger number of clusters (say 5, 6,7, 8 clusters) when N increases from 300 to 2000. In fact,the proposed pCRP has the largest concentration probabilityon the true number of clusters among all the three methodsincluding CRP-Oracle, and this observation is consistent educing over-clustering via the powered Chinese restaurant process DensityComponents (a) CRP-Oracle

DensityComponents (b) CRP

DensityComponents (c) pCRP

Figure 5.

Posterior densities for three methods in Sim 1 when N = 2000 . The dashed lines are weighted components. between N = 300 and N = 2000 .Table 1 provides numerical summaries of this simulation.We can see all three methods lead to similar NMI, but pCRPconsistently gives the highest value. Furthermore, pCRPleads to the lowest value of VI in most tests. The parsimo-nious effect of pCRP discussed above is further conﬁrmedby the average and maximum number of clusters; see thecolumns K and K max in the table.The posterior density plots in Figure 5 show that there isone small unnecessary cluster in CRP-Oracle and two smallunnecessary clusters in CRP, while all three methods capturethe general shape of the true density and thus provide goodﬁtting performance. The over-clustering effect of CRP ismuch reduced by pCRP as seen in Figure 5(c). (a) Sim 1, N =300 (b) Sim 2, N =300 (c) Sim 1, N =2000 (d) Sim 2, N =2000 Figure 6.

Estimated posterior of the number of clusters in observeddata for CRP-Oracle (red x), CRP (blue circle) and pCRP (greenstar).

In this experiment, we cluster 1000 and 3000 digits of theclasses 1 to 4 in MNIST data set (LeCun et al., 2010), wherethe four clusters are approximately equally distributed.From cross validation on a different set of 1000 samples,we obtain the power value r = 1 . . The concentration pa-rameter α in CRP-Oracle is calculated as 0.58 ( N = 1000 ) (a) True clustering when N =3000 (b) CRP-Oracle when N =3000 (c) pCRP when N =3000 Figure 7.

Results of clustering 3000 randomly sampled digits from1 to 4 in spectral space. Observations in the same color representthe same digit. CRP-Oracle seems to over-ﬁt the noise (the redcluster). We omit the result of CRP as it is similar to CRP-Oracle. and 0.5 ( N = 3000) .Figure 7 shows the clustering result of all the three methodsfor N = 3000 . Both CRP and CRP-Oracle seem to over-ﬁt the data by introducing a small cluster (in red), whilepCRP gives a cleaner clustering result with four clusters.This comparison is further conﬁrmed by Table 2, wherethe average posterior cluster number in CRP apparentlyincreases when N grows to 3000. In contrast, pCRP is educing over-clustering via the powered Chinese restaurant process N = 300 Method NMI (SE) VI (SE) K (SE) K max Ground truth (Sim 1) 1.0 0.0 3 -CRP-Oracle (Sim 1) 0.800 (1.1 × − ) 0.669 (4.5 × − ) 4.2 (2.3 × − ) 8CRP (Sim 1) 0.773 (1.2 × − ) 0.795 (5.4 × − ) 5.3 (3.3 × − ) 12pCRP (Sim 1) (0.7 × − ) (4.4 × − ) (1.7 × − ) Ground truth (Sim 2) 1.0 0.0 2 -CRP-Oracle (Sim 2) 0.937 (0.8 × − ) 0.210 (3.0 × − ) 4.0 (2.3 × − ) 9CRP (Sim 2) 0.917 (0.9 × − ) 0.287 (3.7 × − ) 4.8 (2.9 × − ) 12pCRP (Sim 2) (0.3 × − ) (0.8 × − ) (0.9 × − ) N = 2000 Method NMI (SE) VI (SE) K (SE) K max Ground truth (Sim 1) 1.0 0.0 3 -CRP-Oracle (Sim 1) 0.812 (5.3 × − ) 0.610 (2.6 × − ) 4.0 (2.3 × − ) 10CRP (Sim 1) 0.782 (8.5 × − ) 0.732 (4.0 × − ) 5.8 (3.6 × − ) 12pCRP (Sim 1) (6.6 × − ) (7.3 × − ) (1.6 × − ) Ground truth (Sim 2) 1.0 0.0 2 -CRP-Oracle (Sim 2) 0.962 (4.8 × − ) 0.122 (1.7 × − ) 4.1 (2.2 × − ) 8CRP (Sim 2) 0.940 (8.2 × − ) 0.205 (3.1 × − ) 5.6 (3.5 × − ) 13pCRP (Sim 2) (1.2 × − ) (0.4 × − ) (0.8 × − ) Table 1.

Comparison of CRP and pCRP on Sim 1 and Sim 2. K is the average number of found clusters. K max is the maximum numberof clusters during sampling. SE is the standard error of mean. Ground truth is calculated using the true assignments. closer to the true situation by reducing the over-clusteringeffect, even compared to CRP-Oracle; see the columns of K and K max . All methods lead to similar NMI but pCRPgives lower VI. The Old Faithful Geyser data ( N = 272 ) are widely usedto illustrate the performance of clustering algorithms. Weuse a test sample of 100 in CV leading to the power value r = 1 . . We compare all methods on the other 172 datapoints. A manual clustering that consists of two Gaussiancomponents is viewed as the ground truth. The concentra-tion parameter is 0.39 in CRP-Oracle. Figure 8(b) shows thesize of each component obtained from all methods and themanual clustering. We can see that there are two mixturecomponents in CRP-Oracle and pCRP, and four mixturecomponents in the CRP method. In this case where thesample size is relatively small, we again see that pCRP suc-cessfully suppresses small components and generates moreparsimonious results than CRP. (a) The distribution of Old Faithful Geyserafter standardization N u m b e r o f s a m p l e s i n e a c h c l u s t e r Manually clusteringCRP-OracleCRPpCRP (b) Clustering results using CRP-Oracle,CRP, pCRP and manual clustering.

Figure 8.

Clustering result for Old Faithful Geyser educing over-clustering via the powered Chinese restaurant process N = 1000 Method NMI (SE) VI (SE) K (SE) K max Ground truth 1.0 0 4 -CRP-Oracle (3.3 × − ) (1.4 × − ) 4.37 (1.3 × − ) 7CRP (3.3 × − ) 1.386 (1.4 × − ) 4.58 (1.6 × − ) 8pCRP (3.3 × − ) (1.4 × − ) (0.6 × − ) N = 3000 Method NMI (SE) VI (SE) K (SE) K max Ground truth 1.0 0.0 4 -CRP-Oracle 0.651 (2.0 × − ) 1.400 (1.1 × − ) 5.17 (1.2 × − ) 8CRP 0.651 (2.0 × − ) 1.402 (1.1 × − ) 5.44 (1.6 × − ) 9pCRP (1.9 × − ) (1.1 × − ) (1.2 × − ) Table 2.

Comparison of CRP-Oracle, CRP and pCRP on a 4 digits subset of MNIST. K is the average number of discovered clusters. K max is the maximum number of clusters during sampling. References

Antoniak, Charles E. Mixtures of Dirichlet processes withapplications to Bayesian nonparametric problems.

Theannals of statistics , pp. 1152–1174, 1974.Argiento, Raffaele, Guglielmi, Alessandra, and Pievatolo,Antonio. A comparison of nonparametric priors in hierar-chical mixture modelling for AFT regression.

Journal ofStatistical Planning and Inference , 139(12):3989–4005,2009.Bacallado, Sergio, Battiston, Marco, Favaro, Stefano,Trippa, Lorenzo, et al. Sufﬁcientness postulates for gibbs-type priors and hierarchical generalizations.

StatisticalScience , 32(4):487–500, 2017.Blackwell, David and MacQueen, James B. Ferguson distri-butions via P´olya urn schemes.

The annals of statistics ,pp. 353–355, 1973.Blei, David M and Frazier, Peter I. Distance dependent Chi-nese restaurant processes.

Journal of Machine LearningResearch , 12:2461–2488, August 2011.De Blasi, Pierpaolo, Favaro, Stefano, Lijoi, Antonio, Mena,Rams´es H, Pr¨unster, Igor, and Ruggiero, Matteo. AreGibbs-type priors the most natural generalization of theDirichlet process?

IEEE Transactions on Pattern Analysisand Machine Intelligence , 37(2):212–229, 2015.Favaro, Stefano, Lijoi, Antonio, Pruenster, Igor, et al. Con-ditional formulae for gibbs-type exchangeable randompartitions.

The Annals of Applied Probability , 23(5):1721–1754, 2013.Ferguson, Thomas S. A Bayesian analysis of some nonpara-metric problems.

The Annals of Statistics , 1(2):209–230,1973. Ghosh, Soumya, Ungureanu, Andrei B, Sudderth, Erik B,and Blei, David M. Spatial distance dependent chineserestaurant processes for image segmentation. In

Ad-vances in Neural Information Processing Systems , pp.1476–1484, 2011.Ghosh, Soumya, Raptis, Michalis, Sigal, Leonid, and Sud-derth, Erik B. Nonparametric clustering with distancedependent hierarchies. In

UAI , pp. 260–269, 2014.Gnedin, Alexander and Pitman, Jim. Exchangeable Gibbspartitions and Stirling triangles.

Zap. Nauchn. Sem. StPeterburg. Otdel. Mat. Inst. Steklov. , 325:83–102, 2005.Grifﬁths, Thomas L and Ghahramani, Zoubin. Inﬁnite latentfeature models and the Indian buffet process. In

Advancesin Neural Information Processing Systems , volume 18,pp. 475–482, 2005.Lartillot, Nicolas and Philippe, Herv´e. A Bayesian mixturemodel for across-site heterogeneities in the amino-acidreplacement process.

Molecular biology and evolution ,21(6):1095–1109, 2004.LeCun, Yann, Cortes, Corinna, and Burges, Christopher JC.MNIST handwritten digit database.

AT&T Labs [Online].Available: http://yann. lecun. com/exdb/mnist , 2010.Lijoi, Antonio and Pr¨unster, Igor. Models beyond the Dirich-let process. In Hjort, Nils Lid, Holmes, Chris, M¨uller,Peter, and Walker, Stephen G (eds.),

Bayesian Nonpara-metrics , volume 28, pp. 80–136. Cambridge Univ. Press,Cambridge, 2010.Lijoi, Antonio, Mena, Rams´es H, and Pr¨unster, Igor.Bayesian nonparametric estimation of the probability ofdiscovering new species.

Biometrika , 94(4):769–786,2007. educing over-clustering via the powered Chinese restaurant process

McDaid, Aaron F, Greene, Derek, and Hurley, Neil.Normalized mutual information to evaluate overlap-ping community ﬁnding algorithms. arXiv preprintarXiv:1110.2515v2 , 2013.Meil˘a, Marina. Comparing clusterings by the variation ofinformation. In Sch¨olkopf, Bernhard and Warmuth, Man-fred K. (eds.),

Learning Theory and Kernel Machines , pp.173–187. Springer Berlin Heidelberg, 2003.Miller, Jeffrey W and Harrison, Matthew T. A simple ex-ample of Dirichlet process mixture inconsistency for thenumber of components. In

Advances in Neural Informa-tion Processing Systems , pp. 199–206, 2013.Miller, Jeffrey W and Harrison, Matthew T. Inconsistencyof Pitman-Yor process mixtures for the number of com-ponents.

The Journal of Machine Learning Research , 15(1):3333–3370, 2014.Murphy, Kevin P.

Machine learning: a probabilistic per-spective . MIT press, 2012.Neal, Radford M. Markov chain sampling methods forDirichlet process mixture models.

Journal of Computa-tional and Graphical Statistics , 9(2):249–265, 2000.Onogi, Akio, Nurimoto, Masanobu, and Morita, Mitsuo.Characterization of a Bayesian genetic clustering algo-rithm based on a Dirichlet process prior and comparisonamong Bayesian clustering methods.

BMC bioinformat-ics , 12(1):263, 2011.Perman, Mihael, Pitman, Jim, and Yor, Marc. Size-biasedsampling of Poisson point processes and excursions.

Probability Theory and Related Fields , 92(1):21–39,1992.Pitman, Jim and Yor, Marc. The two-parameter Poisson-Dirichlet distribution derived from a stable subordinator.

The Annals of Probability , pp. 855–900, 1997.Rasmussen, Carl Edward. The inﬁnite Gaussian mixturemodel. In

Advances in Neural Information ProcessingSystems , volume 12, pp. 554–560, 1999.Shen, Weining, Tokdar, Surya T, and Ghosal, Subhashis.Adaptive Bayesian multivariate density estimation withDirichlet mixtures.

Biometrika , 100(3):623–640, 2013.Socher, Richard, Maas, Andrew L, and Manning, Christo-pher D. Spectral Chinese restaurant processes: Nonpara-metric clustering based on similarities. In

FourteenthInternational Conference on Artiﬁcial Intelligence andStatistics (AISTATS) , pp. 698–706, 2011.Teh, Yee Whye. Dirichlet Process. In

Encyclopedia ofMachine Learning , pp. 280–287. Springer, 2011. Thibaux, Romain and Jordan, Michael I. Hierarchical betaprocesses and the Indian buffet process. In

ArtiﬁcialIntelligence and Statistics , pp. 564–571, 2007.West, Mike and Escobar, Michael D.