Distance Dependent Chinese Restaurant Processes
DDistance Dependent Chinese Restaurant Processes
David M. Blei
Department of Computer SciencePrinceton UniversityPrinceton, NJ, USAe-mail: [email protected]
Peter I. Frazier
School of Operations Research and Information EngineeringCornell UniversityIthaca, NY 14853, USAe-mail: [email protected]
1. Introduction
Dirichlet process (DP) mixture models provide a valuable suite of flexible clustering algorithmsfor high dimensional data analysis. Such models have been adapted to text modeling (Teh et al.,2006; Goldwater et al., 2006), computer vision (Sudderth et al., 2005), sequential models (Dunson,2006; Fox et al., 2007), and computational biology (Xing et al., 2007). Moreover, recent yearshave seen significant advances in scalable approximate posterior inference methods for this classof models (Liang et al., 2007; Daume, 2007; Blei and Jordan, 2005). DP mixtures have become avaluable tool in modern machine learning.DP mixtures can be described via the Chinese restaurant process (CRP), a distribution over partitionsthat embodies the assumed prior distribution over cluster structures (Pitman, 2002). The CRP isfancifully described by a sequence of customers sitting down at the tables of a Chinese restaurant.Each customer sits at a previously occupied table with probability proportional to the number ofcustomers already sitting there, and at a new table with probability proportional to a concentrationparameter. In a CRP mixture, customers are identified with data points, and data sitting at the sametable belong to the same cluster. Since the number of occupied tables is random, this provides aflexible model in which the number of clusters is determined by the data.The customers of a CRP are exchangeable—under any permutation of their ordering, the probabilityof a particular configuration is the same—and this property is essential to connect the CRP mixtureto the DP mixture. The reason is as follows. The Dirichlet process is a distribution over distributions,and the DP mixture assumes that the random parameters governing the observations are drawn froma distribution drawn from a Dirichlet process. The observations are conditionally independent giventhe random distribution, and thus they must be marginally exchangeable. If the CRP mixture didnot yield an exchangeable distribution, it could not be equivalent to a DP mixture. That these parameters will exhibit a clustering structure is due to the discreteness of distributions drawn from aDirichlet process (Ferguson, 1973; Antoniak, 1974; Blackwell, 1973).1 a r X i v : . [ s t a t . M L ] A ug lei and Frazier/Distance Dependent Chinese Restaurant Processes Exchangeability is a reasonable assumption in some clustering applications, but in many it is not.Consider data ordered in time, such as a time-stamped collection of news articles. In this setting,each article should tend to cluster with other articles that are nearby in time. Or, consider spatial data,such as pixels in an image or measurements at geographic locations. Here again, each datum shouldtend to cluster with other data that are nearby in space. While the traditional CRP mixture providesa flexible prior over partitions of the data, it cannot accommodate such non-exchangeability.In this paper, we develop the distance dependent Chinese restaurant process , a new CRP in whichthe random seating assignment of the customers depends on the distances between them. Thesedistances can be based on time, space, or other characteristics. Distance dependent CRPs can recovera number of existing dependent distributions (Ahmed and Xing, 2008; Zhu et al., 2005). They canalso be arranged to recover the traditional CRP distribution. The distance dependent CRP expandsthe palette of infinite clustering models, allowing for many useful non-exchangeable distributions aspriors on partitions. The key to the distance dependent CRP is that it represents the partition with customer assignments ,rather than table assignments. While the traditional CRP connects customers to tables, the distancedependent CRP connects customers to other customers. The partition of the data, i.e., the tableassignment representation, arises from these customer connections. When used in a Bayesian model,the customer assignment representation allows for a straightforward Gibbs sampling algorithm forapproximate posterior inference (see Section 3). This provides a new tool for flexible clustering ofnon-exchangeable data, such as time-series or spatial data, as well as a new algorithm for inferencewith traditional CRP mixtures.
Related work.
Several other non-exchangeable priors on partitions have appeared in recentresearch literature. Some can be formulated as distance dependent CRPs, while others representa different class of models. The most similar to the distance dependent CRP is the probabilitydistribution on partitions presented in Dahl (2008). Like the distance dependent CRP, this distributionmay be constructed through a collection of independent priors on customer assignments to othercustomers, which then implies a prior on partitions. Unlike the distance dependent CRP, however,the distribution presented in Dahl (2008) requires normalization of these customer assignmentprobabilities. The model in Dahl (2008) may always be written as a distance dependent CRP,although the normalization requirement prevents the reverse from being true (see Section 2). Wenote that Dahl (2008) does not present an algorithm for sampling from the posterior, but the Gibbssampler presented here for the distance dependent CRP can also be employed for posterior inferencein that model.There are a number of Bayesian nonparametric models that allow for dependence between (marginal)partition membership probabilities. These include the dependent Dirichlet process (MacEachern, This is an expanded version of our shorter conference paper on this subject (Blei and Frazier, 2010). This versioncontains new perspectives on inference and new results. We avoid calling these clustering models “Bayesian nonparametric” (BNP) because they cannot necessarily be castas a mixture model originating from a random measure, such as the DP mixture model. The DP mixture is BNP becauseit includes a prior over the infinite space of probability densities, and the CRP mixture is only BNP in its connection tothe DP mixture. That said, most applications of this machinery are based around letting the data determine their numberof clusters. The fact that it actually places a distribution on the infinite-dimensional space of probability measures isusually not exploited. lei and Frazier/Distance Dependent Chinese Restaurant Processes
2. Distance dependent CRPs
The Chinese restaurant process (CRP) is a probability distribution over partitions (Pitman, 2002). Itis described by considering a Chinese restaurant with an infinite number of tables and a sequentialprocess by which customers enter the restaurant and each sit down at a randomly chosen table.After N customers have sat down, their configuration at the tables represents a random partition.Customers sitting at the same table are in the same cycle. lei and Frazier/Distance Dependent Chinese Restaurant Processes z ( c ) F IG . An illustration of the distance dependent CRP. The process operates at the level of customer assignments, whereeach customer chooses either another customer or no customer according to Eq. (2). Customers that chose not toconnect to another are indicated with a self link The table assignments, a representation of the partition that is familiarto the CRP, are derived from the customer assignments. In the traditional CRP, the probability of a customer sitting at a table is computed from the numberof other customers already sitting at that table. Let z i denote the table assignment of the i th customer,assume that the customers z i − occupy K tables, and let n k denote the number of customerssitting at table k . The traditional CRP draws each z i sequentially, p ( z i = k | z i − , α ) ∝ (cid:26) n k for k ≤ Kα for k = K + 1 , (1)where α is a given scaling parameter. When all N customers have been seated, their table as-signments provide a random partition. Though the process is described sequentially, the CRP isexchangeable. The probability of a particular partition of N customers is invariant to the order inwhich they sat down.We now introduce the distance dependent CRP . In this distribution, the seating plan probabilityis described in terms of the probability of a customer sitting with each of the other customers .The allocation of customers to tables is a by-product of this representation. If two customers arereachable by a sequence of interim customer assignments, then they at the same table. This isillustrated in Figure 1.Let c i denote the i th customer assignment, the index of the customer with whom the i th customeris sitting. Let d ij denote the distance measurement between customers i and j , let D denote theset of all distance measurements between customers, and let f be a decay function (described inmore detail below). The distance dependent CRP independently draws the customer assignmentsconditioned on the distance measurements, p ( c i = j | D, α ) ∝ (cid:26) f ( d ij ) if j (cid:54) = iα if i = j. (2) lei and Frazier/Distance Dependent Chinese Restaurant Processes customer t ab l e customer t ab l e customer t ab l e customer t ab l e F IG . Draws from sequential CRPs. Illustrated are draws for different decay functions, which are inset: (1) Thetraditional CRP; (2) The window decay function; (3) The exponential decay function; (4) The logistic decay function.The table assignments are illustrated, which are derived from the customer assignments drawn from the distancedependent CRP. The decay functions (inset) are functions of the distance between the current customer and eachprevious customer.lei and Frazier/Distance Dependent Chinese Restaurant Processes Notice the customer assignments do not depend on other customer assignments, only the distancesbetween customers. Also notice that j ranges over the entire set of customers, and so any customermay sit with any other. (If desirable, restrictions are possible through the distances d ij . See thediscussion below of sequential CRPs.)As we mentioned above, customers are assigned to tables by considering sets of customers that arereachable from each other through the customer assignments. (Again, see Figure 1.) We denote theinduced table assignments z ( c ) , and notice that many configurations of customer assignments c might lead to the same table assignment. Finally, customer assignments can produce a cycle, e.g.,customer 1 sits with 2 and customer 2 sits with 1. This still determines a valid table assignment: Allcustomers sitting in a cycle are assigned to the same table.By being defined over customer assignments, the distance dependent CRP provides a more ex-pressive distribution over partitions than models based on table assignments. This distribution isdetermined by the nature of the distance measurements and the decay function. For example, ifeach customer is time-stamped, then d ij might be the time difference between customers i and j ;the decay function can encourage customers to sit with those that are contemporaneous. If eachcustomer is associated with a location in space, then d ij might be the Euclidean distance betweenthem; the decay function can encourage customers to sit with those that are in proximity. For manysets of distance measurements, the resulting distribution over partitions is no longer exchangeable;this is an appropriate distribution to use when exchangeability is not a reasonable assumption.
Decay functions.
In general, the decay function mediates how distances between customers affectthe resulting distribution over partitions. We assume that the decay function f is non-increasing,takes non-negative finite values, and satisfies f ( ∞ ) = 0 . We consider several types of decay asexamples, all of which satisfy these nonrestrictive assumptions.The window decay f ( d ) = 1[ d < a ] only considers customers that are at most distance a fromthe current customer. The exponential decay f ( d ) = e − d/a decays the probability of linking toan earlier customer exponentially with the distance to the current customer. The logistic decay f ( d ) = exp ( − d + a ) / (1 + exp ( − d + a )) is a smooth version of the window decay. Each of theseaffects the distribution over partitions in a different way. Sequential CRPs and the traditional CRP.
With certain types of distance measurements anddecay functions, we obtain the special case of sequential CRPs . A sequential CRP is constructedby assuming that d ij = ∞ for those j > i . With our previous requirement that f ( ∞ ) = 0 , thisguarantees that no customer can be assigned to a later customer, i.e., p ( c i ≤ i | D ) = 1 . Thesequential CRP lets us define alternative formulations of some previous time-series models. Forexample, with a window decay function and a = 1 , we recover the model studied in Ahmed and The probability distribution over partitions defined by Eq. (2) is similar to the distribution over partitions presentedin Dahl (2008). That probability distribution may be specified by Eq. (2) if f ( d ij ) is replaced by a non-negative value h ij that satisfies a normalization requirement (cid:80) i (cid:54) = j h ij = N − for each j . Thus, the model presented in Dahl (2008)may be understood as a normalized version of the distance dependent CRP. To write this model as a distance dependentCRP, take d ij = 1 /h ij and f ( d ) = 1 /d (with / ∞ and / ∞ = 0 ), so that f ( d ij ) = h ij . Even though the traditional CRP is described as a sequential process, it gives an exchangeable distribution. Thus,sequential CRPs, which include both the traditional CRP as well as non-exchangeable distributions, are more expressivethan the traditional CRP. lei and Frazier/Distance Dependent Chinese Restaurant Processes Xing (2008). With a logistic decay function, we recover the model studied in Zhu et al. (2005). Inour empirical study we will examine sequential models in detail.The sequential CRP can re-express the traditional CRP. Specifically, the traditional CRP is recoveredwhen f ( d ) = 1 for d (cid:54) = ∞ and d ij < ∞ for j < i . To see this, consider the marginal distribution ofa customer sitting at a particular table, given the previous customers’ assignments. The probabilityof being assigned to each of the other customers at that table is proportional to one. Thus, theprobability of sitting at that table is proportional to the number of customers already sitting there.Moreover, the probability of not being assigned to a previous customer is proportional to the scalingparameter α . This is precisely the traditional CRP distribution of Eq. (1). Although these modelsare the same, the corresponding Gibbs samplers are different (see Section 5.4).Figure 2 illustrates seating assignments (at the table level) derived from draws from sequentialCRPs with each of the decay functions described above, including the original CRP. (To adaptthese settings to the sequential case, the distances are d ij = i − j for j < i and d ij = ∞ for j > i .) Compared to the traditional CRP, customers tend to sit at the same table with other nearbycustomers. We emphasize that sequential CRPs are only one type of distance dependent CRP. Otherdistances, combined with the formulation of Eq. (2), lead to a variety of other non-exchangeabledistributions over partitions. Marginal invariance.
The traditional CRP is marginally invariant : Marginalizing over a particu-lar customer gives the same probability distribution as if that customer were not included in themodel at all. The distance dependent CRP does not generally have this property, allowing it tocapture the way in which influence might be transmitted from one point to another. See Section 4for a precise characterization of the class of distance dependent CRPs that are marginally invariant.To see when this might be a relevant property, consider the goal of modeling preferences of peoplewithin a social network. The model used should reflect the fact that persons A and B are morelikely to share preferences if they also share a common friend C. Any marginally invariant model,however, would insist that the distribution of the preferences of A and B is the same whether (1)they have no such common friend C, or (2) they do but his preferences are unobserved and hencemarginalized out. In this setting, we might prefer a model that is not marginally invariant. Knowingthat they have a common friend affects the probability that A and B share preferences, regardless ofwhether the friend’s preferences are observed. A similar example is modeling the spread of disease.Suddenly discovering a city between two others–even if the status of that city is unobserved–shouldchange our assessment of the probability that the disease travels between them.We note, however, that if observations are missing then models that are not marginally invariantrequire that relevant conditional distributions be computed as ratios of normalizing constants. Incontrast, marginally invariant models afford a more convenient factorization, and so allow easiercomputation. Even when faced with data that clearly deviates from marginal invariance, the modelermay be tempted to use a marginally invariant model, choosing computational convenience overfidelity to the data.We have described a general formulation of the distance dependent CRP. We now describe twoapplications to Bayesian modeling of discrete data, one in a fully observed model and the otherin a mixture model. These examples illustrate how one might use the posterior distribution of the lei and Frazier/Distance Dependent Chinese Restaurant Processes partitions, given data and an assumed generating process based on the distance dependent CRP. Wewill focus on models of discrete data and we will use the terminology of document collections todescribe these models. Thus, our observations are assumed to be collections of words from a fixedvocabulary, organized into documents.
Language modeling.
In the language modeling application, each document is associated with adistance dependent CRP, and its tables are embellished with IID draws from a base distribution overterms or words. (The documents share the same base distribution.) The generative process of wordsin a document is as follows. The data are first placed at tables via customer assignments, and thenassigned to the word associated with their tables. Subsets of the data exhibit a partition structure bysharing the same table.When using a traditional CRP, this is a formulation of a simple Dirichlet-smoothed language model.Alternatives to this model, such as those using the Pitman-Yor process, have also been applied inthis setting (Teh, 2006; Goldwater et al., 2006). We consider a sequential CRP, which assumes that aword is more likely to occur near itself in a document. Words are still considered contagious—seeinga word once means we’re likely to see it again—but the window of contagion is mediated by thedecay function.More formally, given a decay function f , sequential distances D , scaling parameter α , and basedistribution G over discrete words, N words are drawn as follows,1. For each word i ∈ { , . . . , N } draw assignment c i ∼ dist-CRP ( α, f, D ) .2. For each table, k ∈ { , . . . } , draw a word w ∗ ∼ G .3. For each word i ∈ { , . . . , N } , assign the word w i = w ∗ z ( c ) i .The notation z ( c ) i is the table assignment of the i th customer in the table assignments induced bythe complete collection of customer assignments.For each document, we observe a sequence of words w N from which we can infer their seatingassignments in the distance dependent CRP. The partition structure of observations—that is, whichwords are the same as other words—indicates either that they share the same table in the seatingarrangement, or that two tables share the same term drawn from G . We have not described theprocess sequentially, as one would with a traditional CRP, in order to emphasize the three stageprocess of the distance dependent CRP—first the customer assignments and table parameters aredrawn, and then the observations are assigned to their corresponding parameter. However, thesequential distances D guarantee that we can draw each word successively. This, in turn, meansthat we can easily construct a predictive distribution of future words given previous words. (SeeSection 3 below.) Mixture modeling
The second model we study is akin to the CRP mixture or (equivalently) theDP mixture, but differs in that the mixture component for a data point depends on the mixturecomponent for nearby data. Again, each table is endowed with a draw from a base distribution While we focus on text, these models apply to any discrete data, such as genetic data, and, with modification, tonon-discrete data as well. That said, CRP-based methods have been extensively applied to text modeling and naturallanguage processing (Teh et al., 2006; Johnson et al., 2007; Li et al., 2007; Blei et al., 2010). lei and Frazier/Distance Dependent Chinese Restaurant Processes G , but here that draw is a distribution over mixture component parameters. In the documentsetting, observations are documents (as opposed to individual words), and G is typically a Dirichletdistribution over distributions of words (Teh et al., 2006). The data are drawn as follows:1. For each document i ∈ [1 , N ] draw assignment c i ∼ dist-CRP ( α, f, D ) .2. For each table, k ∈ { , . . . } , draw a parameter θ ∗ k ∼ G .3. For each document i ∈ [1 , N ] , draw w i ∼ F ( θ z ( c ) i ) .In Section 5, we will study the sequential CRP in this setting, choosing its structure so thatcontemporaneous documents are more likely to be clustered together. The distances d ij can be thedifferences between indices in the ordering of the data, or lags between external measurements ofdistance like date or time. (Spatial distances or distances based on other covariates can be usedto define more general mixtures, but we leave these settings for future work.) Again, we have notdefined the generative process sequentially but, as long as D respects the assumptions of a sequentialCRP, an equivalent sequential model is straightforward to define. Relationship to dependent Dirichlet processes.
More generally, the distance dependent CRPmixture provides an alternative to the dependent Dirichlet process (DDP) mixture as an infiniteclustering model that models dependencies between the latent component assignments of thedata (MacEachern, 1999). The DDP has been extended to sequential, spatial, and other kinds ofdependence (Griffin and Steel, 2006; Duan et al., 2007; Xue et al., 2007). In all these settings,statisticians have appealed to truncations of the stick-breaking representation for approximateposterior inference, citing the dependency between data as precluding the more efficient techniquesthat integrate out the component parameters and proportions. In contrast, distance dependentCRP mixtures are amenable to Gibbs sampling algorithms that integrate out these variables (seeSection 3).An alternative to the DDP formalism is the Bayesian density regression (BDR) model of Dunsonet al. (2007). In BDR, each data point is associated with a random measure and is drawn from amixture of per-data random measures where the mixture proportions are related to the distancebetween data points. Unlike the DDP, this model affords a Gibbs sampler where the randommeasures can be integrated out.However, it is still different in spirit from the distance dependent CRP. Data are drawn fromdistributions that are similar to distributions of nearby data, and the particular values of nearby dataimpose softer constraints than those in the distance dependent CRP. As an extreme case, considera random partition of the nodes of a network, where distances are defined in terms of the numberof hops between nodes. Further, suppose that there are several disconnected components in thisnetwork, that is, pairs of nodes that are not reachable from each other. In the DDP model, thesenodes are very likely not to be partitioned in the same group. In the ddCRP model, however, it isimpossible for them to be grouped together.We emphasize that DDP mixtures (and BDR) and distance dependent CRP mixtures are different classes of models. DDP mixtures are Bayesian nonparametric models, interpretable as data drawnfrom a random measure, while the distance dependent CRP mixtures generally are not. DDP mixturesexhibit marginal invariance, while distance dependent CRPs generally do not (see Section 4). In lei and Frazier/Distance Dependent Chinese Restaurant Processes their ability to capture dependence, these two classes of models capture similar assumptions, but theappropriate choice of model depends on the modeling task at hand.
3. Posterior inference and prediction
The central computational problem for distance dependent CRP modeling is posterior inference,determining the conditional distribution of the hidden variables given the observations. This posterioris used for exploratory analysis of the data and how it clusters, and is needed to compute thepredictive distribution of a new data point given a set of observations.Regardless of the likelihood model, the posterior will be intractable to compute because the distancedependent CRP places a prior over a combinatorial number of possible customer configurations.In this section we provide a general strategy for approximating the posterior using Monte CarloMarkov chain (MCMC) sampling. This strategy can be used in either fully-observed or mixturesettings, and can be used with arbitrary distance functions. (For example, in Section 5 we illustratethis algorithm with both sequential distance functions and graph-based distance functions and inboth fully-observed and mixture settings.)In MCMC, we aim to construct a Markov chain whose stationary distribution is the posterior ofinterest. For distance dependent CRP models, the state of the chain is defined by c i , the customerassignments for each data point. We will also consider z ( c ) , which are the table assignments thatfollow from the customer assignments (see Figure 1). Let η = { D, α, f, G } denote the set of modelhyperparameters. It contains the distances D , the scaling factor α , the decay function f , and thebase measure G . Let x denote the observations.In Gibbs sampling, we iteratively draw from the conditional distribution of each latent variablegiven the other latent variables and observations. (This defines an appropriate Markov chain, seeNeal (1993).) In distance dependent CRP models, the Gibbs sampler iteratively draws from p ( c (new) i | c − i , x , η ) ∝ p ( c (new) i | D, α ) p ( x | z ( c − i ∪ c (new) i ) , G ) . (3)The first term is the distance dependent CRP prior from Eq. (2).The second term is the likelihood of the observations under the partition given by z ( c − i ∪ c ( new ) i ) .This can be thought of as removing the current link from the i th customer and then consideringhow each alternative new link affects the likelihood of the observations. Before examining thislikelihood, we describe how removing and then replacing a customer link affects the underlyingpartition (i.e., table assignments).To begin, consider the effect of removing a customer link. What is the difference between thepartition z ( c ) and z ( c − i ) ? There are two cases.The first case is that a table splits. This happens when c i is the only connection between the i thdata point and a particular table. Upon removing c i , the customers at its table are split in two: thosecustomers pointing (directly or indirectly) to i are at one table; the other customers previously seatedwith i are at a different table. (See the change from the first to second rows of Figure 3.)The second case is that there is no change. If the i th link is not the only connection betweencustomer i and his table or if c i was a self-link ( c i = i ) then the tables remain the same. In this case, z ( c − i ) = z ( c ) . lei and Frazier/Distance Dependent Chinese Restaurant Processes Customer link representation Table assignment representation
Here we are going to sample the third customer link. To begin, the customer links imply a partition of two tables.When we remove the third link, we split one of the tables into two. Two customers' table assignments have changed.We have now drawn the third link and obtained the fifth customer. This merges two of the tables from step F IG . An example of a single step of the Gibbs sampler. Here we illustrate a scenario that highlights all the ways thatthe sampler can move: A table can be split when we remove the customer link before conditioning; and two tables canjoin when we resample that link.lei and Frazier/Distance Dependent Chinese Restaurant Processes Now consider the effect of replacing the customer link. What is the difference between the partition z ( c − i ) and z ( c − i ∪ c (new) i ) ? Again there are two cases. The first case is that c (new) i joins two tablesin z ( c − i ) . Upon adding c (new) i , the customers at its table become linked to another set of customers.(See the change from the second to third rows of Figure 3.)The second case, as above, is that there is no change. This occurs if c (new) i points to a customer thatis already at its table under z ( c − i ) or if c (new) i is a self-link.With the changed partition in hand, we now compute the likelihood term. We first compute thelikelihood term for partition z ( c ) . The likelihood factors into a product of terms, each of which isthe probability of the set of observations at each table. Let | z ( c ) | be the number of tables and z k ( c ) be the set of indices that are assigned to table k . The likelihood term is p ( x | z ( c ) , G ) = | z ( c ) | (cid:89) k =1 p ( x z k ( c ) | G ) . (4)Because of this factorization, the Gibbs sampler need only compute terms that correspond to changesin the partition. Consider the partition z ( c − i ) , which may have split a table, and the new partition z ( c − i ∪ c (new) ) . There are three cases to consider. First, c i might link to itself—there will be nochange to the likelihood function because a self-link cannot join two tables. Second, c i might link toanother table but cause no change in the partition. Finally, c i might link to another table and jointwo tables k and (cid:96) . The Gibbs sampler for the distance dependent CRP is thus p ( c (new) i | c − i , x , η ) ∝ α if c (new) i is equal to i.f ( d ij ) if c (new) i = j does not join two tables. f ( d ij ) p ( x zk ( c − i ) ∪ z(cid:96) ( c − i ) | G ) p ( x zk ( c − i ) | G ) p ( x z(cid:96) ( c − i ) | G ) if c (new) i = j joins tables k and (cid:96). (5)The specific form of the terms in Eq. (4) depend on the model. We first consider the fully observedcase (i.e., “language modeling”). Recall that the partition corresponds to words of the same type,but that more than one table can contain identical types. (For example, four tables could containobservations of the word “peanut.” But, observations of the word “walnut” cannot sit at any of thepeanut tables.) Thus, the likelihood of the data is simply the probability under G of a representativefrom each table, e.g., the first customer, times a product of indicators to ensure that all observationsare equal, p ( x z k ( c ) | G ) = p ( x z k ( c ) | G ) (cid:81) i ∈ z k ( c ) x i = x z k ( c ) ) , (6)where z k ( c ) is the index of the first customer assigned to table k .In the mixture model, we compute the marginal probability that the set of observations from eachtable are drawn independently from the same parameter, which itself is drawn from G . Each termis p ( x z k ( c ) | G ) = (cid:90) (cid:16)(cid:81) i ∈ z k ( c ) p ( x i | θ ) (cid:17) p ( θ | G ) dθ. (7)Because this term marginalizes out the mixture component θ , the result is a collapsed sampler forthe mixture model. When G and p ( x | θ ) form a conjugate pair, the integral is straightforward tocompute. In nonconjugate settings, an additional layer of sampling is needed. lei and Frazier/Distance Dependent Chinese Restaurant Processes Prediction.
In prediction, our goal is to compute the conditional probability distribution of a newdata point x new given the data set x . This computation relies on the posterior. Recall that D is theset of distances between all the data points. The predictive distribution is p ( x new | x , D, G , α ) = (cid:88) c new p ( c new | D, α ) (cid:80) c p ( x new | c new , c , x , G ) p ( c | x , D, α, G ) . (8)The outer summation is over the customer assignment of the new data point; its prior probability onlydepends on the distance matrix D . The inner summation is over the posterior customer assignmentsof the data set; it determines the probability of the new data point conditioned on the previousdata and its partition. In this calculation, the difference between sequential distances and arbitrarydistances is important.Consider sequential distances and suppose that x new is a future data point. In this case, the distributionof the data set customer assignments c does not depend on the new data point’s location in time.The reason is that data points can only connect to data points in the past. Thus, the posterior p ( c | x , D, α, G ) is unchanged by the addition of the new data, and we can use previously computedGibbs samples to approximate it.In other situations—nonsequential distances or sequential distances where the new data occurssomewhere in the middle of the sequence—the discovery of the new data point changes the posterior p ( c | x , D, α, G ) . The reason is that the knowledge of where the new data is relative to the others(i.e., the information in D ) changes the prior over customer assignments and thus changes theposterior as well. This new information requires rerunning the Gibbs sampler to account for the newdata point. Finally, note that the special case where we know the new data’s location in advance(without knowing its value) does not require rerunning the Gibbs sampler.
4. Marginal invariance
In Section 2 we discussed the property of marginal invariance , where removing a customer leavesthe partition distribution over the remaining customers unchanged. When a model has this property,unobserved data may simply be ignored. We mentioned that the traditional CRP is marginallyinvariant, while the distance dependent CRP does not necessarily have this property.In fact, the traditional CRP is the only distance dependent CRP that is marginally invariant. The details of this characterization are given in the appendix. This characterization of marginallyinvariant CRPs contrasts the distance dependent CRP with the alternative priors over partitionsinduced by random measures, such as the Dirichlet process.In addition to the Dirichlet process, random-measure models include the dependent Dirichletprocess (MacEachern, 1999) and the order-based dependent Dirichlet process (Griffin and Steel,2006). These models suppose that data from a given covariate were drawn independently from afixed latent sampling probability measure. These models then suppose that these sampling measureswere drawn from some parent probability measure. Dependence between the randomly drawnsampling measures is achieved through this parent probability measure. One can also create a marginally invariant distance dependent CRP by combining several independent copies of thetraditional CRP. Details are discussed in the appendix. lei and Frazier/Distance Dependent Chinese Restaurant Processes We formally define a random-measure model as follows. Let X and Y be the sets in which covariatesand observations take their values, let x N ⊂ X , y N ⊂ Y be the set of observed covariates andtheir corresponding sampled values, and let M ( Y ) be the space of probability measures on Y . Arandom-measure model is any probability distribution on the samples y N induced by a probabilitymeasure G on the space M ( Y ) X . This random-measure model may be written y n | x n ∼ P x n , ( P x ) x ∈ X ∼ G, (9)where the y n are conditionally independent of each other given ( P x ) x ∈ X . Such models implicitlyinduce a distribution on partitions of the data by taking all points n whose sampled values y n areequal to be in the same cluster.In such random-measure models, the (prior) distribution on y − n does not depend on x n , and so suchmodels are marginally invariant, regardless of the points x n and the distances between them. Fromthis observation, and the lack of marginal invariance of the distance dependent CRP, it follows thatthe distributions on partitions induced by random-measure models are different from the distancedependent CRP. The only distribution that is both a distance dependent CRP, and is also induced bya random-measure model, is the traditional CRP.Thus, distance dependent CRPs are generally not marginally invariant, and so are appropriatefor modeling situations that naturally depart from marginal invariance. This distinguishes priorsobtained with distance dependent CRPs from those obtained from random-measure models, whichare appropriate when marginal invariance is a reasonable assumption.
5. Empirical study
We studied the distance dependent CRP in the language modeling and mixture settings on four textdata sets. We explored both time dependence, where the sequential ordering of the data is respectedvia the decay function and distance measurements, and network dependence, where the data areconnected in a graph. We show below that the distance dependent CRP gives better fits to text datain both the fully-observed and mixture modeling settings. Further, we compared the traditional Gibbs sampler for DP mixtures to the Gibbs sampler for thedistance dependent CRP formulation of DP mixtures. We found that the sampler based on customerassignments mixes faster than the traditional sampler.
We evaluated the fully-observed distance dependent CRP models on two data sets: a collection of100 OCR’ed documents from the journal
Science and a collection of 100 world news articles fromthe
New York Times . We modeled each document independently. We assess sampler convergencevisually, examining the autocorrelation plots of the log likelihood of the state of the chain (Robertand Casella, 2004). Our R implementation of Gibbs sampling for ddCRP models is available at lei and Frazier/Distance Dependent Chinese Restaurant Processes Decay parameter
Log B a y e s f a c t o r l l l l l l lllll Science l l l l l l lllll
Decay type l explog F IG . Bayes factors of the distance dependent CRP versus the traditional CRP on documents from Science and theNew York Times. The black line at denotes an equal fit between the traditional CRP and distance dependent CRP,while positive values denote a better fit for the distance dependent CRP. Also illustrated are standard errors acrossdocuments. We compare models by estimating the Bayes factor, the ratio of the probability under the distancedependent CRP to the probability under the traditional CRP (Kass and Raftery, 1995). For a decayfunction f , this Bayes factor is BF f,α = p ( w N | dist-CRP f,α ) /p ( w N | CRP α ) . (10)A value greater than one indicates an improvement of the distance dependent CRP over the traditionalCRP. Following Geyer and Thompson (1992), we estimate this ratio with a Monte Carlo estimatefrom posterior samples.Figure 4 illustrates the average log Bayes factors across documents for various settings of theexponential and logistic decay functions. The logistic decay function always provides a better modelthan the traditional CRP; the exponential decay function provides a better model at certain settingsof its parameter. (These curves are for the hierarchical setting with the base distribution over terms G unobserved; the shapes of the curves are similar in the non-hierarchical settings.) We examined the distance dependent CRP mixture on two text corpora. We analyzed one month ofthe
New York Times (NYT) time-stamped by day, containing 2,777 articles, 3,842 unique terms and530K observed words. We also analyzed 12 years of NIPS papers time-stamped by year, containing1,740 papers, 5,146 unique terms, and 1.6M observed words. Distances D were differences betweentime-stamps.In both corpora we removed the last 250 articles as held out data. In the NYT data, this amounts tothree days of news; in the NIPS data, this amounts to papers from the 11th and 12th year. (We retain lei and Frazier/Distance Dependent Chinese Restaurant Processes Decay parameter H e l d − ou t li k e li hood − − − − − − − − NIPS l l ll l l − − − − New York Times l l ll l l
Decay type l CRP l exponential l logistic F IG . Predictive held-out log likelihood for the last year of NIPS and last three days of the New York Times corpus.Error bars denote standard errors across MCMC samples. On the NIPS data, the distance dependent CRP outperformsthe traditional CRP for the logistic decay with a decay parameter of years. On the New York Times data, the distancedependent CRP outperforms the traditional CRP in almost all settings tested. the time stamps of the held-out articles because the predictive likelihood of an article’s contentsdepends on its time stamp, as well as the time stamps of earlier articles.) We evaluate the models byestimating the predictive likelihood of the held out data. The results are in Figure 5. On the NYTcorpus, the distance dependent CRPs definitively outperform the traditional CRP. A logistic decaywith a window of 14 days performs best. On the NIPS corpus, the logistic decay function with adecay parameter of 2 years outperforms the traditional CRP. In general, these results show thatnon-exchangeable models given by the distance dependent CRP mixture provide a better fit than theexchangeable CRP mixture. The previous two examples have considered data analysis settings with a sequential distance function.However, the distance dependent CRP is a more general modeling tool. Here, we demonstrate itsflexibility by analyzing a set of networked documents with a distance dependent CRP mixture model.Networked data induces an entirely different distance function, where any data point may link to anarbitrary set of other data. We emphasize that we can use the same Gibbs sampling algorithms forboth the sequential and networked settings.Specifically, we analyzed the CORA data set, a collection of Computer Science abstracts that areconnected if one paper cites the other (McCallum et al., 2000). One natural distance function is thenumber of connections between data (and ∞ if two data points are not reachable from each other).We use the window decay function with parameter , enforcing that a customer can only link toitself or to another customer that refers to an immediately connected document. We treat the graphas undirected. lei and Frazier/Distance Dependent Chinese Restaurant Processes IG . The MAP clustering of a subset of CORA. Each node is an abstract in the collection and each link represents acitation. Colors are repeated across connected components – no two data points from disconnected components in thegraph can be assigned to the same cluster. Within each connected component, colors are not repeated, and nodes withthe same color are assigned to the same cluster.lei and Frazier/Distance Dependent Chinese Restaurant Processes Figure 6 shows a subset of the MAP estimate of the clustering under these assumptions. Note thatthe clusters form connected groups of documents, though several clusters are possible within alarge connected group. Traditional CRP clustering does not lean towards such solutions. Overall,the distance dependent CRP provides a better model. The log Bayes factor is 13,062, strongly infavor of the distance dependent CRP, although we emphasize that much of this improvement mayoccur simply because the distance dependent CRP avoids clustering abstracts from unconnectedcomponents of the network. Further analysis is needed to understand the abilities of the distancedependent CRP beyond those of simpler network-aware clustering schemes.We emphasize that this analysis is meant to be a proof of concept to demonstrate the flexibilityof distance dependent CRP mixtures. Many modeling choices can be explored, including longerwindows in the decay function and treating the graph as a directed graph. A similar modeling set-upcould be used to analyze spatial data, where distances are natural to compute, or images (e.g., forimage segmentation), where distances might be the Manhattan distance between pixels.
The distance dependent CRP can express a number of flexible models. However, as we describein Section 2, it can also re-express the traditional CRP. In the mixture model setting, the Gibbssampler of Section 3 thus provides an alternative algorithm for approximate posterior inference inDP mixtures. We compare this Gibbs sampler to the widely used collapsed Gibbs sampler for DPmixtures, i.e., Algorithm 3 from Neal (2000), which is applicable when the base measure G isconjugate to the data generating distribution.The Gibbs sampler for the distance dependent CRP iteratively samples the customer assignment ofeach data point, while the collapsed Gibbs sampler iteratively samples the cluster assignment of eachdata point. The practical difference between the two algorithms is that the distance dependent CRPbased sampler can change several customers’ cluster assignments via a single customer assignment.This allows for larger moves in the state space of the posterior and, we will see below, faster mixingof the sampler.Moreover, the computational complexity of the two samplers is the same. Both require computingthe change in likelihood of adding or removing either a set of points (in the distance dependent CRPcase) or a single point (in the traditional CRP case) to each cluster. Whether adding or removingone or a set of points, this amounts to computing a ratio of normalizing constants for each cluster,and this is where the bulk of the computation of each sampler lies. To compare the samplers, we analyzed documents from the
Science and
New York Times collectionsunder a CRP mixture with scaling parameter equal to one and uniform Dirichlet base measure.Figure 7 illustrates the log probability of the state of the traditional CRP Gibbs sampler as a functionof Gibbs sampler iteration. The log probability of the state is proportional to the posterior; a highervalue indicates a state with higher posterior likelihood. These numbers are comparable becausethe models, and thus the normalizing constant, are the same for both the traditional representation In some settings, removing a single point—as is done in Neal (2000)—allows faster computation of each sampleriteration. This is true, for example, if the observations are single words (as opposed to a document of words) or singledraws from a Gaussian. Although each iteration may be faster with the traditional sampler, that sampler may spendmany more iterations stuck in local optima. lei and Frazier/Distance Dependent Chinese Restaurant Processes Iteration (beyond 3) CR P m i x t u r e sc o r e − − − − − − − −
200 400 600 800 1000
Iteration (beyond 3) CR P m i x t u r e sc o r e − − − −
200 400 600 800 1000
Algorithm ddcrpcrp F IG . Each panel illustrates 100 Gibbs runs using Algorithm 3 of (Neal, 2000) (CRP, in blue) and the sampler fromSection 3 with the identity decay function (distance dependent CRP, in red). Both samplers have the same limitingdistribution because the distance dependent CRP with identity decay is the traditional CRP. We plot the log probabilityof the CRP representation (i.e., the divergence) as a function of its iteration. The left panel shows the Science corpus,and the right panel shows the New York Times corpus. Higher values indicate that the chain has found a better localmode of the posterior. In these examples, the distance dependent CRP Gibbs sampler mixes faster. and customer based CRP. Iterations 3–1000 are plotted, where each sampler is started at the same(random) state. The traditional Gibbs sampler is much more prone to stagnation at local optima,particularly for the Science corpus.
6. Discussion
We have developed the distance dependent Chinese restaurant process, a distribution over partitionsthat accommodates a flexible and non-exchangeable seating assignment distribution. The distancedependent CRP hinges on the customer assignment representation. We derived a general-purposeGibbs sampler based on this representation, and examined sequential models of text.The distance dependent CRP opens the door to a number of further developments in infiniteclustering models. We plan to explore spatial dependence in models of natural images, and multi-level models akin to the hierarchical Dirichlet process (Teh et al., 2006). Moreover, the simplicityand fixed dimensionality of the corresponding Gibbs sampler suggests that a variational method isworth exploring as an alternative deterministic form of approximate inference.
Acknowledgments
David M. Blei is supported by ONR 175-6343, NSF CAREER 0745520, AFOSR 09NL202, theAlfred P. Sloan foundation, and a grant from Google. Peter I. Frazier is supported by AFOSRYIP FA9550-11-1-0083. Both authors thank the three anonymous reviewers for their insightfulcomments and suggestions. lei and Frazier/Distance Dependent Chinese Restaurant Processes References
A. Ahmed and E. Xing. Dynamic non-parametric mixture models and the recurrent Chineserestaurant process with applications to evolutionary clustering. In
International Conference onData Mining , 2008.C. Antoniak. Mixtures of Dirichlet processes with applications to Bayesian nonparametric problems.
The Annals of Statistics , 2(6):1152–1174, 1974.D. Blackwell. Discreteness of Ferguson selections.
The Annals of Statistics , 1(2):356–358, 1973.D. Blei and P. Frazier. Distance dependent Chinese restaurant processes. In
International Conferenceon Machine Learning , 2010.D. Blei and M. Jordan. Variational inference for Dirichlet process mixtures.
Journal of BayesianAnalysis , 1(1):121–144, 2005.D. Blei, T. Griffiths, and M. Jordan. The nested Chinese restaurant process and Bayesian nonpara-metric inference of topic hierarchies.
Journal of the ACM , 57(2):1–30, 2010.D.B. Dahl. Distance-based probability distribution for set partitions with applications to Bayesiannonparametrics. In
JSM Proceedings. Section on Bayesian Statistical Science, American Statisti-cal Association, Alexandria, Va , 2008.H. Daume. Fast search for Dirichlet process mixture models. In
Artificial Intelligence and Statistics ,San Juan, Puerto Rico, 2007. URL http://pub.hal3.name/ .J. Duan, M. Guindani, and A. Gelfand. Generalized spatial Dirichlet process models.
Biometrika ,94:809–825, 2007.D. Dunson. Bayesian dynamic modeling of latent trait distributions.
Biostatistics , 2006.D. Dunson, N. Pillai, and J. Park. Bayesian density regression.
Journal of the Royal StatisticalSociety: Series B (Statistical Methodology) , 69(2):163–183, 2007.M. Escobar and M. West. Bayesian density estimation and inference using mixtures.
Journal of theAmerican Statistical Association , 90:577–588, 1995.T. Ferguson. A Bayesian analysis of some nonparametric problems.
The Annals of Statistics , 1:209–230, 1973.E. Fox, E. Sudderth, M. Jordan, and A. Willsky. Developing a tempered HDP-HMM for systemswith state persistence. Technical report, MIT Laboratory for Information and Decision Systems,2007.C. Geyer and E. Thompson. Constrained Monte Carlo maximum likelihood for dependent data.
Journal of the American Statistical Association , 54(657–699), 1992.S. Goldwater, T. Griffiths, and M. Johnson. Interpolating between types and tokens by estimatingpower-law generators. In
Neural Information Processing Systems , 2006.J. Griffin and M. Steel. Order-based dependent Dirichlet processes.
Journal of the AmericanStatistical Association , 101(473):179–194, 2006.J.A. Hartigan. Partition models.
Communications in Statistics-Theory and Methods , 19(8):2745–2756, 1990.M. Johnson, T. Griffiths, and Goldwater S. Adaptor grammars: A framework for specifyingcompositional nonparametric Bayesian models. In B. Sch ¨olkopf, J. Platt, and T. Hoffman, editors,
Advances in Neural Information Processing Systems 19 , pages 641–648, Cambridge, MA, 2007.MIT Press.R. Kass and A. Raftery. Bayes factors.
Journal of the American Statistical Association , 90(430):773–795, 1995. lei and Frazier/Distance Dependent Chinese Restaurant Processes W. Li, D. Blei, and A. McCallum. Nonparametric Bayes pachinko allocation. In
The 23rdConference on Uncertainty in Artificial Intelligence , 2007.P. Liang, M. Jordan, and B. Taskar. A permutation-augmented sampler for DP mixture models. In
International Conference on Machine Learning , 2007.S. MacEachern. Dependent nonparametric processes. In
ASA Proceedings of the Section onBayesian Statistical Science , 1999.A. McCallum, K. Nigam, J. Rennie, and K. Seymore. Automating the construction of internetportals with machine learning.
Information Retrieval , 2000.K.T. Miller, T.L. Griffiths, and M.I. Jordan. The phylogenetic indian buffet process: A non-exchangeable nonparametric prior for latent features. In David A. McAllester and Petri Myllym¨aki,editors,
UAI , pages 403–410. AUAI Press, 2008.P. Mueller and F. Quintana. Random partition models with regression on covariates. In
InternationalConference on Interdisciplinary Mathematical and Statistical Techniques , 2008.P. Muller, F. Quintana, and G. Rosner. Bayesian clustering with regression. Working paper, 2008.R. Neal. Probabilistic inference using Markov chain Monte Carlo methods. Technical ReportCRG-TR-93-1, Department of Computer Science, University of Toronto, 1993.R. Neal. Markov chain sampling methods for Dirichlet process mixture models.
Journal ofComputational and Graphical Statistics , 9(2):249–265, 2000.J. Pitman.
Combinatorial Stochastic Processes . Lecture Notes for St. Flour Summer School.Springer-Verlag, New York, NY, 2002.C. Rasmussen and Z. Ghahramani. Infinite mixtures of Gaussian process experts. In T. Dietterich,S. Becker, and Z. Ghahramani, editors,
Advances in Neural Information Processing Systems 14 ,Cambridge, MA, 2002. MIT Press.C. Ritter and M. Tanner. Facilitating the Gibbs sampler: The Gibbs stopper and the Griddy-Gibbssampler.
Journal of the American Statistical Association , 87(419):861–868, 1992.C. Robert and G. Casella.
Monte Carlo Statistical Methods . Springer Texts in Statistics. Springer-Verlag, New York, NY, 2004.E. Sudderth, A. Torralba, W. Freeman, and A. Willsky. Describing visual scenes using transformedDirichlet processes. In
Advances in Neural Information Processing Systems 18 , 2005.E.B. Sudderth and M. I. Jordan. Shared segmentation of natural scenes using dependent pitman-yorprocesses. In Daphne Koller, Dale Schuurmans, Yoshua Bengio, and L´eon Bottou, editors,
NIPS ,pages 1585–1592. MIT Press, 2008.Y. Teh. A hierarchical Bayesian language model based on Pitman-Yor processes. In
Proceedings ofthe Association of Computational Linguistics , 2006.Y. Teh, M. Jordan, M. Beal, and D. Blei. Hierarchical Dirichlet processes.
Journal of the AmericanStatistical Association , 101(476):1566–1581, 2006.E. Xing, M. Jordan, and R. Sharan. Bayesian haplotype inference via the Dirichlet process.
Journalof Computational Biology , 14(3):267–284, 2007.Y. Xue, D. Dunson, and L. Carin. The matrix stick-breaking process for flexible multi-task learning.In
International Conference on Machine Learning , 2007.X. Zhu, Z. Ghahramani, and J. Lafferty. Time-sensitive Dirichlet process mixture models. TechnicalReport CMU-CALD-05-104, Carnegie Mellon University, 2005. lei and Frazier/Distance Dependent Chinese Restaurant Processes Appendix A: A formal characterization of marginal invariance
In this section, we formally characterize the class of distance dependent CRPs that are marginallyinvariant. This family is a very small subset of the entire set of distance dependent CRPs, containingonly the traditional CRP and variants constructed from independent copies of it. This characterizationis used in Section 4 to contrast the distance dependent CRP with random-measure models.Throughout this section, we assume that the decay function satisfies a relaxed version of the triangleinequality, which uses the notation ¯ d ij = min( d ij , d ji ) . We assume: if ¯ d ij = 0 and ¯ d jk = 0 then ¯ d ik = 0 ; and if ¯ d ij < ∞ and ¯ d jk < ∞ then ¯ d ik < ∞ . A.1. Sequential Distances
We first consider sequential distances. We begin with the following proposition, which showsthat a very restricted class of distance dependent CRPs may also be constructed by collections ofindependent CRPs.
Proposition 1.
Fix a set of sequential distances between each of n customers, a real number a > ,and a set A ∈ {∅ , { } , R } . Then there is a (non-random) partition B , . . . , B K of { , . . . , n } forwhich two distinct customers i and j are in the same set B k iff ¯ d ij ∈ A . For each k = 1 , . . . , K , letthere be an independent CRP with concentration parameter α/a , and let customers within B k beclustered among themselves according to this CRP.Then, the probability distribution on clusters induced by this construction is identical to the distancedependent CRP with decay function f ( d ) = a d ∈ A ] . Furthermore, this probability distribution ismarginally invariant.Proof. We begin by constructing a partition B , . . . , B K with the stated property. Let J ( i ) =min { j : j = i or ¯ d ij ∈ A } , and let J = { J ( i ) : i = 1 , . . . , n } be the set of unique values takenby J . Each customer i will be placed in the set containing customer J ( i ) . Assign to each value j ∈ J a unique integer k ( j ) between and |J | . For each j ∈ J , let B k ( j ) = { i : J ( i ) = j } = { i : i = j or ¯ d ij ∈ A } . Each customer i is in exactly one set, B k ( J ( i )) , and so B , . . . , B |J | is a partitionof { , . . . , n } .To show that i (cid:54) = i (cid:48) are both in B k iff ¯ d ii (cid:48) ∈ A , we consider two possibilties. If A = ∅ , then J ( i ) = i and each B k contains only a single point. If A = { } or A = R , then it follows from the relaxedtriangle inequality assumed at the beginning of Appendix A.With this partition B , . . . , B K , the probability of linkage under the distance dependent CRP withdecay function f ( d ) = a d ∈ A ] may be written p ( c i = j ) ∝ α if i = j , a if j < i and j ∈ B k ( i ) , if j > i or j / ∈ B k ( i ) .By noting that linkages between customers from different sets B k occur with probability , we seethat this is the same probability distribution produced by taking K independent distance dependent lei and Frazier/Distance Dependent Chinese Restaurant Processes CRPs, where the k th distance dependent CRP governs linkages between customers in B k using p ( c i = j ) ∝ α if i = j , a if j < i , if j > i ,for i, j ∈ B k .Finally, dividing the unnormalized probabilities by a , we rewrite the linkage probabilities for the k th distance dependent CRP as p ( c i = j ) ∝ α/a if i = j , if j < i , if j > i ,for i, j ∈ B k . This is identical to the distribution of the traditional CRP with concentration parameter α/a .This shows that the distance dependent CRP with decay function f ( d ) = a d ∈ A ] induces thesame probability distribution on clusters as the one produced by a collection of K independenttraditional CRPs, each with concentration parameter α/a , where the k th traditional CRP governsthe clusters of customers within B k .The marginal invariance of this distribution follows from the marginal invariance of each traditionalCRP, and their independence from one another.The probability distribution described in this proposition separates customers into groups B , . . . , B K based on whether inter-customer distances fall within the set A , and then governs clustering withineach group independently using a traditional CRP. Clustering across groups does not occur.We consider what this means for specific choices of A . If A = { } , then each group containsthose customers whose distance from one another is . This group is well-defined because of theassumption that d ij = 0 and d jk = 0 implies d ik = 0 . If A = R , then each group contains thosecustomers whose distance from one another is finite. Similarly to the A = { } case, this group iswell-defined because of the assumption that d ij < ∞ and d jk < ∞ implies d ik < ∞ . If A = ∅ , theneach group contains only a single customer. In this case, each customer will be in his own cluster.Since the resulting construction is marginally invariant, Proposition 1 provides a sufficient conditionfor marginal invariance. The following proposition shows that this condition is necessary as well. Proposition 2.
If the distance dependent CRP for a given decay function f is marginally invariantover all sets of sequential distances then f is of the form f ( d ) = a d ∈ A ] for some a > and A equal to either ∅ , { } , or R .Proof. Consider a setting with customers, in which customer may either be absent, or presentwith his seating assignment marginalized out. Fix a non-increasing decay function f with f ( ∞ ) = 0 and suppose that the distances are sequential, so d = d = d = ∞ . Suppose that the distancedependent CRP resulting from this f and any collection of sequential distances is marginally lei and Frazier/Distance Dependent Chinese Restaurant Processes invariant. Then the probability that customers and share a table must be the same whethercustomer is absent or present.If customer is absent, P { and sit at same table | absent } = f ( d ) f ( d ) + α . (11)If customer is present, customers and may sit at the same table in two different ways: sitswith directly ( c = 1 ); or sits with , and sits with ( c = 2 and c = 1 ). Thus, P { and sit at same table | present } = f ( d ) f ( d ) + f ( d ) + α + (cid:18) f ( d ) f ( d ) + f ( d ) + α (cid:19) (cid:18) f ( d ) f ( d ) + α (cid:19) . (12)For the distance dependent CRP to be marginally invariant, Eq. (11) and Eq. (12) must be identical.Writing Eq. (11) on the left side and Eq. (12) on the right, we have f ( d ) f ( d ) + α = f ( d ) f ( d ) + f ( d ) + α + (cid:18) f ( d ) f ( d ) + f ( d ) + α (cid:19) (cid:18) f ( d ) f ( d ) + α (cid:19) . (13)We now consider two different possibilities for the distances d and d , always keeping d = d + d .First, suppose d = 0 and d = d = d for some d ≥ . By multiplying Eq. (13) through by (2 f ( d ) + α ) ( f (0) + α ) ( f ( d ) + α ) and rearranging terms, we obtain αf ( d ) ( f (0) − f ( d )) . Thus, either f ( d ) = 0 or f ( d ) = f (0) . Since this is true for each d ≥ and f is nonincreasing, f = a d ∈ A ] with a ≥ and either A = ∅ , A = R , A = [0 , b ] , or A = [0 , b ) with b ∈ [0 , ∞ ) .Because A = ∅ is among the choices, we may assume a > without loss of generality. We nowshow that if A = [0 , b ] or A = [0 , b ) , then we must have b = 0 and A is of the form claimed by theproposition.Suppose for contradiction that A = [0 , b ] or A = [0 , b ) with b > . Consider distances given by d = d = d = b − (cid:15) with (cid:15) ∈ (0 , b/ . By multiplying Eq. (12) through by ( f (2 d ) + f ( d ) + α ) ( f ( d ) + α ) ( f (2 d ) + α ) and rearranging terms, we obtain αf ( d ) ( f ( d ) − f (2 d )) . Since f ( d ) = a > , we must have f (2 d ) = f ( d ) > . But, d = 2( b − (cid:15) ) > b implies togetherwith f (2 d ) = a d ∈ A ] that f (2 d ) = 0 , which is a contradiction.These two propositions are combined in the following corollary, which states that the class of decayfunctions considered in Propositions 1 and 2 is both necessary and sufficient for marginal invariance. lei and Frazier/Distance Dependent Chinese Restaurant Processes Corollary 1.
Fix a particular decay function f . The distance dependent CRP resulting from thisdecay function is marginally invariant over all sequential distances if and only if f is of the form f ( d ) = a d ∈ A ] for some a > and some A ∈ {∅ , { } , R } .Proof. Sufficiency for marginal invariance is shown by Proposition 1. Necessity is shown byProposition 2.Although Corollary 1 allows any choice of a > in the decay function f ( d ) = a d ∈ A ] , thedistribution of the distance dependent CRP with a particular f and α remains unchanged if both f and α are multiplied by a constant factor (see Eq. (2)). Thus, the distance dependent CRP defined by f ( d ) = a d ∈ A ] and concentration parameter α is identical to the one defined by f ( d ) = 1[ d ∈ A ] and concentration parameter α/a . In this sense, we can restrict the choice of a in Corollary 1 (andalso Propositions 1 and 2) to a = 1 without loss of generality. A.2. General Distances
We now consider all sets of distances, including non-sequential distances. The class of distancedependent CRPs that are marginally invariant over this larger class of distances is even morerestricted than in the sequential case. We have the following proposition providing a necessarycondition for marginal invariance.
Proposition 3.
If the distance dependent CRP for a given decay function f is marginally invariantover all sets of distances, both sequential and non-sequential, then f is identically .Proof. From Proposition 2, we have that any decay function that is marginally invariant under allsequential distances must be of the form f ( d ) = a d ∈ A ] , where a > and A ∈ {∅ , { } , R } . Wenow show that if the decay function is marginally invariant under all sets of distances (not just thosethat are sequential), then f (0) = 0 . The only decay function of the form f ( d ) = a d ∈ A ] thatsatisfies f (0) = 0 is the one that is identically , and so this will show our result.To show f (0) = 0 , suppose that we have n + 1 customers, all of whom are a distance away fromone another, so d ij = 0 for i, j = 1 , . . . , n + 1 . Under our assumption of marginal invariance, theprobability that the first n customers sit at separate tables should be invariant to the absence orpresence of customer n + 1 .When customer n + 1 is absent, the only way in which the first n customers may sit at separatetables is for each to link to himself. Let p n = α/ ( α + ( n − f (0)) denote the probability of a givencustomer linking to himself when customer n + 1 is absent. Then P { , . . . , n sit separately | n + 1 absent } = ( p n ) n . (14)We now consider the case when customer n + 1 is present. Let p n +1 = α/ ( α + nf (0)) be theprobability of a given customer linking to himself, and let q n +1 = f (0) / ( α + nf (0)) be theprobability of a given customer linking to some other given customer. The first n customers mayeach sit at separate tables in two different ways. First, each may link to himself, which occurs withprobability ( p n +1 ) n . Second, all but one of these first n customers may link to himself, with theremaining customer linking to customer n + 1 , and customer n + 1 linking either to himself or to lei and Frazier/Distance Dependent Chinese Restaurant Processes the customer that linked to him. This occurs with probability n ( p n +1 ) n − q n +1 ( p n +1 + q n +1 ) . Thus,the total probability that the first n customers sit at separate tables is P { , . . . , n sit separately | n + 1 present } = ( p n +1 ) n + n ( p n +1 ) n − q n +1 ( p n +1 + q n +1 ) . (15)Under our assumption of marginal invariance, Eq. (14) must be equal to Eq. (15), and so p n +1 ) n + n ( p n +1 ) n − q n +1 ( p n +1 + q n +1 ) − ( p n ) n . (16)Consider n = 2 . By substituting the definitions of p , p , and q , and then rearranging terms, wemay rewrite Eq. (16) as αf (0) (2 f (0) − α )( α + f (0)) ( α + 2 f (0)) , which is satisfied only when f (0) ∈ { , α/ √ } . Consider the second of these roots, α/ √ . When n = 3 , this value of f (0) violates Eq. (16). Thus, the first root is the only possibility and we musthave f (0) = 0 .The decay function f = 0 described in Proposition 3 is a special case of the decay function fromProposition 2, obtained by taking A = ∅ . As described above, the resulting probability distributionis one in which each customer links to himself, and is thus clustered by himself. This distribution ismarginally invariant. From this observation quickly follows the following corollary. Corollary 2.
The decay function f = 0 is the only one for which the resulting distance dependentCRP is marginally invariant over all distances, both sequential and non-sequential.Proof. Necessity of f = 0 for marginal invariance follows from Proposition 3. Sufficiency followsfrom the fact that the probability distribution on partitions induced by f = 0 is the one under whicheach customer is clustered alone almost surely, which is marginally invariant. Appendix B: Gibbs sampling for the hyperparameters
To enhance our models, we place a prior on the concentration parameter α and augment our Gibbssampler accordingly, just as is done in the traditional CRP mixture (Escobar and West, 1995). Tosample from the posterior of α given the customer assignments c and data, we begin by notingthat α is conditionally independent of the observed data given the customer assignments. Thus, thequantity needed for sampling is p ( α | c ) ∝ p ( c | α ) p ( α ) , where p ( α ) is a prior on the concentration parameter. lei and Frazier/Distance Dependent Chinese Restaurant Processes From the independence of the c i under the generative process, p ( c | α ) = (cid:81) Ni =1 p ( c i | D, α ) . Nor-malizing provides p ( c | α ) = N (cid:89) i =1 c i = i ] α + 1[ c i (cid:54) = i ] f ( d ic i ) α + (cid:80) j (cid:54) = i f ( d ij ) ∝ α K (cid:34) N (cid:89) i =1 (cid:32) α + (cid:88) j (cid:54) = i f ( d ij ) (cid:33)(cid:35) − , where K is the number of self-links c i = i in the customer assignments c . Although K is equalto the number of tables | z ( c ) | when distances are sequential, K and | z ( c ) | generally difffer whendistances are non-sequential. Then, p ( α | c ) ∝ α K (cid:34) N (cid:89) i =1 (cid:32) α + (cid:88) j (cid:54) = i f ( d ij ) (cid:33)(cid:35) − p ( α ) . (17)Eq. (17) reduces further in the following special case: f is the window decay function, f ( d ) =1[ d < a ] ; d ij = i − j for i > j ; and distances are sequential so d ij = ∞ for i < j . In this case, (cid:80) i − j =1 f ( d ij ) = ( i − ∧ ( a − , where ∧ is the minimum operator, and N (cid:89) i =1 (cid:32) α + i − (cid:88) j =1 f ( d ij ) (cid:33) = ( α + a − [ N − a ] + Γ( α + a ∧ N ) / Γ( α ) , (18)where [ N − a ] + = max(0 , N − a ) is the positive part of N − a . Then, p ( α | c ) ∝ Γ( α )Γ( α + a ∧ N ) α K ( α + a − [ N − a ] + p ( α ) . (19)If we use the identity decay function, which results in the traditional CRP, then we recover anexpression from Antoniak (1974): p ( α | c ) ∝ Γ( α )Γ( α + N ) α K p ( α ) . This expression is used in Escobarand West (1995) to sample exactly from the posterior of α when the prior is gamma distributed.In general, if the prior on α is continuous then it is difficult to sample exactly from the posterior ofEq. (17). There are a number of ways to address this. We may, for example, use the Griddy-Gibbsmethod (Ritter and Tanner, 1992). This method entails evaluating Eq. (17) on a finite set of points,approximating the inverse cdf of p ( α | c ) using these points, and transforming a uniform randomvariable with this approximation to the inverse cdf.We may also sample over any hyperparameters in the decay function used (e.g., the window sizein the window decay function, or the rate parameter in the exponential decay function) within ourGibbs sampler. For the rest of this section, we use a to generically denote a hyperparameter in thedecay function, and we make this dependence explicit by writing f ( d, a ) . lei and Frazier/Distance Dependent Chinese Restaurant Processes To describe Gibbs sampling over these hyperparameters in the decay function, we first write p ( c | α, a ) = N (cid:89) i =1 c i = i ] α + 1[ c i (cid:54) = i ] f ( d ic i , a ) α + (cid:80) i − j =1 f ( d ij , a )= α K (cid:34) (cid:89) i : c i (cid:54) = i f ( d ij , a ) (cid:35) (cid:34) N (cid:89) i =1 (cid:32) α + i − (cid:88) j =1 f ( d ij , a ) (cid:33)(cid:35) − . Since a is conditionally independent of the observed data given c and α , to sample over a in ourGibbs sampler it is enough to know the density p ( a | c , α ) ∝ (cid:34) (cid:89) i : c i (cid:54) = i f ( d ij , a ) (cid:35) (cid:34) N (cid:89) i =1 (cid:32) α + i − (cid:88) j =1 f ( d ij , a ) (cid:33)(cid:35) − p ( a | α ) . (20)In many cases our prior p ( a | α ) on a will not depend on α .In the case of the window decay function with sequential distances and d ij = i − j for i > j , wecan simplify this further as we did above with Eq. (18). Noting that (cid:81) i : c i (cid:54) = i f ( d ij , a ) will be forthose a > max i i − c i , and for other a , we have p ( a | c , α ) ∝ Γ( α )Γ( α + a ∧ N ) p ( a | α )1[ a > max i i − c i ]( α + a − [ N − a ] + . (21)If the prior distribution on aa