A simple proof of Pitman-Yor's Chinese restaurant process from its stick-breaking representation
aa r X i v : . [ m a t h . S T ] O c t A simple proof of Pitman–Yor’s Chinese restaurantprocess from its stick-breaking representation
Caroline Lawless and Julyan ArbelUniv. Grenoble Alpes, Inria, CNRS, LJK, 38000 Grenoble, France
Abstract
For a long time, the Dirichlet process has been the gold standard discrete ran-dom measure in Bayesian nonparametrics. The Pitman–Yor process provides a simpleand mathematically tractable generalization, allowing for a very flexible control of theclustering behaviour. Two commonly used representations of the Pitman–Yor processare the stick-breaking process and the Chinese restaurant process. The former is aconstructive representation of the process which turns out very handy for practical im-plementation, while the latter describes the partition distribution induced. Obtainingone from the other is usually done indirectly with use of measure theory. In contrast,we provide here an elementary proof of Pitman–Yor’s Chinese Restaurant process fromits stick-breaking representation.
The Pitman–Yor process defines a rich and flexible class of random probability measureswhich was developed by Perman et al. (1992) and further investigated by Pitman (1995),Pitman and Yor (1997). It is a simple generalization of the Dirichlet process (Ferguson,1973), whose mathematical tractability contributed to its popularity in machine learn-ing theory (Caron et al., 2017), probabilistic models for linguistic applications (Teh, 2006,Wood et al., 2011), excursion theory (Perman et al., 1992, Pitman and Yor, 1997), measure-valued diffusions in population genetics (Petrov, 2009, Feng and Sun, 2010), combinatorics(Vershik et al., 2004, Kerov, 2006) and statistical physics (Derrida, 1981).Its most prominent role is perhaps in Bayesian nonparametric statistics where it is used asa prior distribution, following the work of Ishwaran and James (2001). Applications in thissetting embrace a variety of inferential problems, including species sampling (Favaro et al.,2009, Navarrete et al., 2008, Arbel et al., 2017), survival analysis and graphical models in ge-netics (Jara et al., 2010, Ni et al., 2018), image segmentation (Sudderth and Jordan, 2009),curve estimation (Canale et al., 2017), exchangeable feature allocations (Battiston et al.,2018) and time-series and econometrics (Caron et al., 2017, Bassetti et al., 2014).Last but not least, the Pitman–Yor process is also employed in the context of nonpara-metric mixture modeling, thus generalizing the celebrated Dirichlet process mixture modelof Lo (1984). Nonparametric mixture models based on the Pitman–Yor process are charac-terized by a more flexible parameterization than the Dirichlet process mixture model, thus1llowing for a better control of the clustering behaviour (De Blasi et al., 2015). In addition,see Ishwaran and James (2001), Favaro and Walker (2013), Arbel et al. (2018) for poste-rior sampling algorithms, Scricciolo et al. (2014), Miller and Harrison (2014) for asymptoticproperties, and Scarpa and Dunson (2009), Canale et al. (2017) for spike-and-slab exten-sions.The Pitman–Yor process has the following stick-breaking representation: if v i ind ∼ Beta(1 − d, α + id ) for i = 1 , , . . . with d ∈ (0 ,
1) and α > − d , if π j = v j Q j − i =1 (1 − v i ) for j = 1 , , . . . ,and if θ , θ , . . . iid ∼ H , then the discrete random probability measure P = ∞ X j =1 π j δ θ j (1)is distributed according to the Pitman–Yor process, PY( α, d, H ), with concentration param-eter α , discount parameter d , and base distribution H .The Pitman–Yor process induces the following partition distribution: if P ∼ PY( α, d, H ),for some nonatomic probability distribution H , we observe data x , . . . , x n | P iid ∼ P , and C is the partition of the first n integers { , . . . , n } induced by data, then P ( C = C ) = d | C | ( α ) ( n ) (cid:16) αd (cid:17) ( | C | ) Y c ∈ C (1 − d ) ( | c |− , (2)where the multiplicative factor before the product in (2) is also commonly (and equivalently)written as ( Q | C |− i =1 α + id ) / ( α ) ( n − in the literature. When the discount parameter d is setto zero, the Pitman–Yor process reduces to the Dirichlet process and the partition distribu-tion (2) boils down to the celebrated Chinese Restaurant process (CRP, see Antoniak, 1974).By abuse of language, we call the partition distribution (2) the Pitman–Yor’s CRP. Underthe latter partition distribution, the number of parts in a partition C of n elements, k n = | C | ,grows to infinity as a power-law of the sample size, n d (see Pitman, 2003, for details). ThisPitman–Yor power-law growth is more in tune with most of empirical data (Clauset et al.,2009) than the logarithmic growth induced by the Dirichlet process CRP, α log n .The purpose of this note is to provide a simple proof of Pitman–Yor’s CRP (2) from itsstick-breaking representation (1) (Theorem 2.1). This generalizes the derivation by Miller(2018) who obtained the Dirichlet process CRP (Antoniak, 1974) from Sethuraman’s stick-breaking representation (Sethuraman, 1994). In doing so, we also provide the marginaldistribution of the allocation variables vector (3) in Proposition 2.2. Suppose we make n observations, z , . . . , z n . We denote the set { , . . . , n } by [ n ]. Ourobservations induce a partition of [ n ], denoted C = { c , . . . , c k n } where c , . . . , c k n are disjointsets and S k n i =1 c i = [ n ] , in such a way that z i and z j belong to the same partition if and onlyif z i = z j . We denote the number of parts in the partition C by k n = | C | and we denote thenumber of elements in partition j by | c j | . We use bold font to represent random variables.2e write ( x ) ( n ) = Q n − j =0 ( x + j ) to denote the rising factorial. Theorem 2.1.
Suppose v i ind ∼ Beta(1 − d, α + id ) for i = 1 , , . . . , π j = v j j − Y i =1 (1 − v i ) for j = 1 , , . . . Let allocation variables be defined by z , . . . , z n | π = π iid ∼ π, meaning, P ( z i = j | π ) = π j , (3) and C denote the random partition of [ n ] induced by z , . . . , z n . Then P ( C = C ) = d | C | ( α ) ( n ) (cid:16) αd (cid:17) ( | C | ) Y c ∈ C (1 − d ) ( | c |− . The proof of Theorem 2.1 follows the lines of Miller (2018)’s derivation. We need thenext two technical results, which we will prove in Section 3. Let C z denote the partition [ n ]induced by z for any z ∈ N n . Let k n be the number of parts in the partition. We define m ( z ) = max { z , . . . , z n } , and g j ( z ) = { i : z i ≥ j } . Proposition 2.2.
For any z ∈ N n , the marginal distribution of the allocation variablesvector z = ( z , . . . , z n ) is given by P ( z = z ) = 1( α ) ( n ) Y c ∈ C z Γ( | c | + 1 − d )Γ(1 − d ) m ( z ) Y j =1 α + ( j − dg j ( z ) + α + ( j − d . Lemma 2.3.
For any partition C of [ n ] , X z ∈ N n ( C z = C ) m ( z ) Y j =1 α + ( j − dg j ( z ) + α + ( j − d = d | C | Q c ∈ C ( | c | − d ) (cid:16) αd (cid:17) ( | C | ) . roof of Theorem 2.1. P ( C = C ) = X z ∈ N n P ( C = C | z = z ) P ( z = z ) (a) = X z ∈ N n ( C z = C ) 1( α ) ( n ) Y c ∈ C z Γ( | c | + 1 − d )Γ(1 − d ) m ( z ) Y j =1 α + ( j − dg j ( z ) + α + ( j − d = 1( α ) ( n ) Y c ∈ C Γ( | c | + 1 − d )Γ(1 − d ) X z ∈ N n ( C z = C ) m ( z ) Y j =1 α + ( j − dg j ( z ) + α + ( j − d (b) = 1( α ) ( n ) Y c ∈ C Γ( | c | + 1 − d )Γ(1 − d ) d | C | Q c ∈ C ( | c | − d ) (cid:16) αd (cid:17) ( | C | )(c) = 1( α ) ( n ) Y c ∈ C (1 − d ) ( | c |− Y c ∈ C ( | c | − d ) d | C | Q c ∈ C ( | c | − d ) (cid:16) αd (cid:17) ( | C | ) = d | C | ( α ) ( n ) (cid:16) αd (cid:17) ( | C | ) Y c ∈ C (1 − d ) ( | c |− , where (a) is by Proposition 2.2, (b) is by Lemma 2.3, and (c) is since Γ( | c | + 1 − d ) =( | c | − d )Γ( | c − d | ). We require the following additional lemmas.
Lemma 3.1.
For a + c > , and b + d > , if y ∼ Beta( a, b ) , then E [ y c (1 − y ) d ] = B ( a + c,b + d ) B ( a,b ) where B denotes the beta function.Proof. E [ y c (1 − y ) d ] = Z y c (1 − y ) d B ( a, b ) y a − (1 − y ) b − d y = 1 B ( a, b ) Z y a + c − (1 − y ) b + d − d y = B ( a + c, b + d ) B ( a, b ) . Let S k n denote the set of k n ! permutations of [ k n ]. The following lemma is key for provingLemma 2.3. 4 emma 3.2. For any n , . . . , n k n ∈ N , X σ ∈ S kn k n Y i =1 a i ( σ ) − ( k n − i + 1) d = 1 Q k n i =1 ( n i − d ) where a i ( σ ) = n σ i + n σ i +1 + · · · + n σ kn .Proof. Consider the process of sampling without replacement k n times from an urn containing k n balls. The balls have sizes n − d, . . . , n k n − d, and the probability of drawing ball i isproportional to its size n i − d. Thus for any permutation σ ∈ S k n we have that p ( σ ) = n σ − dn − td = n σ − da ( σ ) − td ,p ( σ | σ ) = n σ − dn − n σ − ( k n − d = n σ − da ( σ ) − ( k n − d ,p ( σ i | σ , . . . , σ i − ) = n σ i − dn − n σ − · · · − n σ i − − ( k n − i + 1) d = n σ i − da i ( σ ) − ( k n − i + 1) d . Therefore, p ( σ ) = p ( σ ) p ( σ | σ ) · · · p ( σ k n | σ , . . . , σ k n − ) = k n Y i =1 n σ i − da i ( σ ) − ( k n − i + 1) d . (4)This way, we construct a distribution on S k n . We know that P σ ∈ S kn p ( σ ) = 1 . Applying thisto Equation (4) and dividing both sides by ( n σ − d ) · · · ( n σ kn − d ) = ( n − d ) · · · ( n k n − d )gives the result. Lemma 3.3.
Let b i ∈ N for i ∈ { , . . . , k n } and let b = 0 . We define ¯ b i = b + b + · · · + b i . Then k n Y i =1 X b i ∈ N ( αd + ¯ b i − ) ( b i ) ( a i + αd + ¯ b i − ) ( b i ) = ( αd ) ( k n ) Q k n i =1 ( a i d − ( k n + 1 − i )) . Proof.
Let A j denote the intermediate sum A j = Q k n i = j P b i ∈ N ( αd +¯ b i − ) ( bi ) ( ai + αd +¯ b i − ) ( bi ) . We show byinduction decreasing from j = k n to j = 0 that A j = ( αd + ¯ b j − ) ( k n − j +1) Q k n i = j ( a i d − ( k n + 1 − i )) . (5)When j = k n we have A k n = X b kn ∈ N ( αd + ¯ b k n − ) ( b kn ) ( a kn + αd + ¯ b k n − ) ( b kn ) = X b kn ∈ N E [ X b kn ]5here X ∼ Beta( αd + ¯ b k n − , a kn d ). We have that X b kn ∈ N E [ X b kn ] = E h X b kn ∈ N X b kn i = E (cid:20) X − X (cid:21) = α + d ¯ b k n − a k n − d , due to Lemma 3.1, which proves the initialization for (5).We now consider the case of an arbitrary j , greater than 0 and less than k n . By theinduction hypothesis, we have that Equation (5) holds for j + 1, that is A j +1 = ( αd + ¯ b j ) ( k n − j ) Q k n i = j +1 ( a i d − ( k n + 1 − i )) . Therefore, A j = X b j ∈ N ( αd + ¯ b j − ) ( b j ) ( a j + αd + ¯ b j − ) ( b j ) k n Y i = j +1 X b i ∈ N ( αd + ¯ b i − ) ( b i ) ( a i + αd + ¯ b i − ) ( b i ) = X b j ∈ N ( αd + ¯ b j − ) ( b j ) ( a j d + αd + ¯ b j − ) ( b j ) ( αd + ¯ b j ) ( k n − j ) Q k n i = j +1 ( a i d − ( k n + 1 − i ))Rearranging the rising factorials in the numerator, we can write (cid:16) αd + ¯ b j − (cid:17) ( b j ) (cid:16) αd + ¯ b j (cid:17) ( k n − j ) = (cid:16) αd + ¯ b j − (cid:17) ( b j ) (cid:16) αd + ¯ b j − + b j (cid:17) ( k n − j ) = (cid:16) αd + ¯ b j − (cid:17) ( b j + k n − j ) = (cid:16) αd + ¯ b j − (cid:17) ( k n − j ) (cid:16) αd + ¯ b j − + k n − j (cid:17) ( b j ) and thus factorize the terms independent of b j in order to obtain A j = ( αd + ¯ b j − ) ( k n − j ) Q k n i = j +1 ( a i d − ( k n + 1 − i )) X b j ∈ N ( αd + ¯ b j − + k n − j ) ( b j ) ( a j d + αd + ¯ b j − ) ( b j ) . The sum above can be rewritten, using X ∼ Beta( αd + ¯ b j − + ( k n − j ) , a j d − ( k n − j )), as X b j ∈ N E [ X b j ] = E (cid:20) X − X (cid:21) = αd + ¯ b j − + ( k n − j ) a j d − ( k n + 1 − j ) . Putting this all together, A j = ( αd + ¯ b j − + ( k n − j )) a j d − ( k n + 1 − j ) ( αd + ¯ b j − ) ( k n − j ) Q k n i = j +1 ( a i d − ( k n + 1 − i ))6 ( αd + ¯ b j − ) ( k n − j +1) Q k n i = j ( a i d − ( k n + 1 − i ))which proves the desired result for j . By induction, this result is true for all j ∈ { , . . . , k n } . Letting j = 1 gives the result stated in the lemma, since ¯ b = b = 0. Proof of Proposition 2.2.
For simplicity, we fix the allocation variable vector to a value z and denote m ( z ) by m and g j ( z ) by g j . We have P ( z = z | π , . . . , π m ) = n Y i =1 π z i = m Y j =1 π e j j where e j = { i : z i = j } . Thus, P ( z = z | v , . . . , v m ) = m Y j =1 (cid:16) v j Q j − i =1 (1 − v i ) (cid:17) e j = m Y j =1 v e j j (1 − v j ) f j where f j = { i : z i > j } . Therefore, P ( z = z ) = Z P ( z = z | v , . . . , v m ) p ( v , . . . , v m )d v · · · d v m = Z (cid:16) m Y j =1 v e j j (1 − v j ) f j (cid:17) p ( v ) · · · p m ( v m )d v · · · d v m = m Y j =1 Z v e j j (1 − v j ) f j p j ( v j )d v j (a) = m Y j =1 B ( e j + 1 − d, f j + α + jd ) B (1 − d, α + jd )= m Y j =1 Γ( e j + 1 − d )Γ( f j + α + jd )Γ( α + ( j − d ) + 1Γ( e j + f j + α + ( j − d ) + 1)Γ(1 − d )Γ( α + jd ) (b) = m Y j =1 Γ( e j + 1 − d )Γ(1 − d ) m Y j =1 Γ( g j +1 + α + jd )Γ( g j + α + ( j − d + 1) m Y j =1 Γ( α + ( j − d + 1)Γ( α + jd ) (c) = m Y j =1 Γ( e j + 1 − d )Γ(1 − d ) m Y j =1 α + ( j − dg j + α + ( j − d Γ( g m +1 + α + md )Γ( α )Γ( g + α )Γ( α + md ) (d) = Γ( α )Γ( n + α ) Y c ∈ C z Γ( | c | + 1 − d )Γ(1 − d ) m Y j =1 α + ( j − dg j + α + ( j − d f j = g j +1 and g j = e j + f j , step (c)since Γ( x + 1) = x Γ( x ), and step (d) since g = n and g m +1 = 0 . Proof of Lemma 2.3.
As before, we denote the parts of C by c , . . . , c k n , and we let k n = | C | . We denote the distinct values taken on by z , . . . , z n by j < · · · < j k n . We define j = b = 0 , b i = j i − j i − , and ¯ b i = b + · · · + b i for i ∈ { , . . . , k n } . We use the notation a i ( σ ) = n σ i + · · · + n σ kn , where σ is the permutation of [ k n ] such that c σ i = { ℓ : z ℓ = j i } . Then for any z ∈ N n such that C z = C, m ( z ) Y j =1 α + ( j − dg j ( z ) + α + ( j − d = m ( z ) Y j =1 αd + j − g j ( z )+ αd + j − k n Y i =1 ¯ b i Y j =¯ b i − +1 αd + j − g j ( z )+ αd + j − k n Y i =1 ( αd + ¯ b i − ) ( b i ) ( α + a i ( σ ) d + ¯ b i − ) ( b i ) , because g j ( z ) = a i ( σ ) for ¯ b i − < j ≤ ¯ b i . It follows from the definition of b = ( b , ..., b k n ) and σ that there is a one-to-one correspondence between { z ∈ N n : C z = C } and { ( σ, b ) : σ ∈ S k n , b ∈ N k n } . Therefore, X z ∈ N n ( C z = C ) m ( z ) Y j =1 α + ( j − dg j ( z ) + α + ( j − d = X σ ∈ S kn X b ∈ N kn k n Y i =1 ( αd + ¯ b i − ) ( b i ) ( a i ( σ )+ αd + ¯ b i − ) ( b i ) = X σ ∈ S kn k n Y i =1 X b i ∈ N ( αd + ¯ b i − ) ( b i ) ( a i ( σ )+ αd + ¯ b i − ) ( b i )(a) = X σ ∈ S kn k n Y i =1 ( αd ) ( k n ) a i ( σ ) d − ( k n − i + 1)= d k n (cid:16) αd (cid:17) ( k n ) X σ ∈ S kn k n Y i =1 a i ( σ ) − ( k n − i + 1) d (b) = d k n Q c ∈ C ( | c | − d ) (cid:16) αd (cid:17) ( k n ) , where step (a) follows from Lemma 3.3 and step (b) follows from Lemma 3.2. Acknowledgement
The authors would like to thank Bernardo Nipoti for fruitful discussions that initiated thiswork. 8 eferences
Antoniak, C. E. (1974). Mixtures of Dirichlet processes with applications to Bayesian non-parametric problems.
The Annals of Statistics , 2:1152–1174.Arbel, J., De Blasi, P., and Pr¨unster, I. (2018). Stochastic approximations to the Pitman–Yorprocess.
Bayesian Analysis, in press .Arbel, J., Favaro, S., Nipoti, B., and Teh, Y. W. (2017). Bayesian nonparametric inferencefor discovery probabilities: credible intervals and large sample asymptotics.
StatisticaSinica , 27:839–858.Bassetti, F., Casarin, R., and Leisen, F. (2014). Beta-product dependent Pitman–Yor pro-cesses for Bayesian inference.
Journal of Econometrics , 180(1):49 – 72.Battiston, M., Favaro, S., Roy, D. M., and Teh, Y. W. (2018). A characterization ofproduct-form exchangeable feature probability functions.
The Annals of Applied Prob-ability , 28(3):1423–1448.Canale, A., Lijoi, A., Nipoti, B., and Pr¨unster, I. (2017). On the Pitman–Yor process withspike and slab base measure.
Biometrika , 104(3):681–697.Caron, F., Neiswanger, W., Wood, F., Doucet, A., and Davy, M. (2017). Generalized P´olyaUrn for Time-Varying Pitman-Yor Processes.
Journal of Machine Learning Research ,18(27):1–32.Clauset, A., Shalizi, C. R., and Newman, M. E. (2009). Power-law distributions in empiricaldata.
SIAM review , 51(4):661–703.De Blasi, P., Favaro, S., Lijoi, A., Mena, R. H., Pr¨unster, I., and Ruggiero, M. (2015).Are Gibbs-type priors the most natural generalization of the Dirichlet process?
PatternAnalysis and Machine Intelligence, IEEE Transactions on , 37(2):212–229.Derrida, B. (1981). Random-energy model: An exactly solvable model of disordered systems.
Physical Review B , 24(5):2613.Favaro, S., Lijoi, A., Mena, R., and Pr¨unster, I. (2009). Bayesian non-parametric inferencefor species variety with a two-parameter Poisson–Dirichlet process prior.
J. R. Stat. Soc.Ser. B , 71:993–1008.Favaro, S. and Walker, S. G. (2013). Slice sampling σ -stable Poisson-Kingman mixturemodels. Journal of Computational and Graphical Statistics , 22(4):830–847.Feng, S. and Sun, W. (2010). Some diffusion processes associated with two parameterPoisson–Dirichlet distribution and Dirichlet process.
Probability theory and related fields ,148(3-4):501–525.Ferguson, T. (1973). A Bayesian analysis of some nonparametric problems.
The Annals ofStatistics , 1(2):209–230. 9shwaran, H. and James, L. F. (2001). Gibbs sampling methods for stick-breaking priors.
J.Amer. Statist. Assoc. , 96:161–173.Jara, A., Lesaffre, E., De Iorio, M., and Quintana, F. (2010). Bayesian semiparametricinference for multivariate doubly-interval-censored data.
Ann. Appl. Stat. , 4(4):2126–2149.Kerov, S. V. (2006). Coherent random allocations, and the Ewens-Pitman formula.
Journalof Mathematical sciences , 138(3):5699–5710.Lo, A. (1984). On a class of Bayesian nonparametric estimates: I. Density estimates.
TheAnnals of Statistics , 12(1):351–357.Miller, J. W. (2018). An elementary derivation of the Chinese restaurant process fromSethuraman’s stick-breaking process. arXiv preprint arXiv:1801.00513 .Miller, J. W. and Harrison, M. T. (2014). Inconsistency of Pitman-Yor process mixtures forthe number of components.
The Journal of Machine Learning Research , 15(1):3333–3370.Navarrete, C., Quintana, F. A., and Mueller, P. (2008). Some issues in nonparametricBayesian modeling using species sampling models.
Statistical Modelling , 8(1):3–21.Ni, Y., M¨uller, P., Zhu, Y., and Ji, Y. (2018). Heterogeneous reciprocal graphical models.
Biometrics , 74(2):606–615.Perman, M., Pitman, J., and Yor, M. (1992). Size-biased sampling of Poisson point processesand excursions.
Probability Theory and Related Fields , 92(1):21–39.Petrov, L. (2009). Two-parameter family of diffusion processes in the Kingman simplex.
Functional Analysis and Its Applications , 43:279–296.Pitman, J. (1995). Exchangeable and partially exchangeable random partitions.
ProbabilityTheory and Related Fields , 102(2):145–158.Pitman, J. (2003). Poisson-Kingman partitions.
Lecture Notes-Monograph Series , pages1–34.Pitman, J. and Yor, M. (1997). The two-parameter Poisson-Dirichlet distribution derivedfrom a stable subordinator.
The Annals of Probability , 25(2):855–900.Scarpa, B. and Dunson, D. B. (2009). Bayesian hierarchical functional data analysis viacontaminated informative priors.
Biometrics , 65(3):772–780.Scricciolo, C. et al. (2014). Adaptive Bayesian Density Estimation in L p -metrics withPitman-Yor or Normalized Inverse-Gaussian Process Kernel Mixtures. Bayesian Anal-ysis , 9(2):475–520.Sethuraman, J. (1994). A constructive definition of Dirichlet priors.
Statistica Sinica , 4:639–650. 10udderth, E. B. and Jordan, M. I. (2009). Shared segmentation of natural scenes usingdependent Pitman-Yor processes. In
Advances in Neural Information Processing Systems21 , pages 1585–1592. Curran Associates, Inc.Teh, Y. W. (2006). A hierarchical Bayesian language model based on Pitman-Yor processes.In
Proceedings of the 21st International Conference on Computational Linguistics and the44th annual meeting of the Association for Computational Linguistics , pages 985–992.Association for Computational Linguistics.Vershik, A., Yor, M., and Tsilevich, N. (2004). On the Markov–Krein identity and quasi-invariance of the gamma process.
Journal of Mathematical Sciences , 121(3):2303–2310.Wood, F., Gasthaus, J., Archambeau, C., James, L., and Teh, Y. W. (2011). The sequencememoizer.