Conditional Variational Inference with Adaptive Truncation for Bayesian Nonparametric Models
aa r X i v : . [ s t a t . M L ] J a n Conditional Variational Inference with AdaptiveTruncation for Bayesian Nonparametric Models
J. Y. Liu and Xinghao Qiao
Department of Statistics, London School of Economics, London WC2A 2AE, U.K. , [email protected] [email protected]
Abstract
The scalable inference for Bayesian nonparametric models with big data is stillchallenging. Current variational inference methods fail to characterise the correlationstructure among latent variables due to the mean-field setting and cannot infer the trueposterior dimension because of the universal truncation. To overcome these limitations,we build a general framework to infer Bayesian nonparametric models by maximisingthe proposed nonparametric evidence lower bound, and then develop a novel approachby combining Monte Carlo sampling and stochastic variational inference framework.Our method has several advantages over the traditional online variational inferencemethod. First, it achieves a smaller divergence between variational distributions andthe true posterior by factorising variational distributions under the conditional settinginstead of the mean-field setting to capture the correlation pattern. Second, it reducesthe risk of underfitting or overfitting by truncating the dimension adaptively ratherthan using a prespecified truncated dimension for all latent variables. Third, it reducesthe computational complexity by approximating the posterior functionally instead ofupdating the stick-breaking parameters individually. We apply the proposed methodon hierarchical Dirichlet process and gamma–Dirichlet process models, two essentialBayesian nonparametric models in topic analysis. The empirical study on three largedatasets including arXiv , New York Times and
Wikipedia reveals that our proposed ethod substantially outperforms its competitor in terms of lower perplexity and muchclearer topic-words clustering. Some key words:
Gibbs sampling; Hierarchical Dirichlet process; Nonparametric evidence lowerbound; Stochastic variational inference; Topic modelling.
Bayesian nonparametric models, differing from parametric models by relaxing the fixed di-mension assumption, are widely used in bioinfomatics, language processing, computer versionand network analysis (Dunson & Park, 2008; Sudderth & Jordan, 2009; Caron & Fox, 2017;Ranganath & Blei, 2018). For example, in natural language processing, Teh et al. (2006)develop a hierarchical Dirichlet process, which extends the latent Dirichlet allocation model(Blei et al., 2003) from a nonparametric perspective. Hierarchical Dirichlet process is definedon a countable dimensional simplex to replace the finite-dimensional Dirichlet distributionin latent Dirichlet allocation. Within such model, the number of topics is regarded as arandom variable instead of a fixed value and hence can be inferred from data.The inference of Bayesian nonparametric models is more complicated than its parametriccounterpart. Due to the infinite-dimensional nature of Bayesian nonparametric models, afinite-dimensional truncation is needed to approximate the posterior. However, the selectionof the optimal truncation level poses extra challenges. The traditional Markov chain MonteCarlo methods (Teh et al., 2006; Papaspiliopoulos & Roberts, 2008) can produce an adaptiveselection of the truncated dimension but are not computationally scalable especially forbig data. On the other hand, standard variational inference methods (Teh et al., 2008;Wang et al., 2011; Hoffman et al., 2013; Roychowdhury & Kulis, 2015) can accelerate thecomputation but suffer from an universal selection of the truncation level, that is, truncatingthe dimension of all latent variables to a prespecified value. However, a subjective selectionof the fixed truncation level would lead to a low predictive accuracy due to the possible2verfitting or underfitting. In this sense, such universal truncation method contradicts themotivation and advantages of using Bayesian nonparametric models.In this paper, we propose a general framework with novel and efficient algorithms toinfer a large class of Bayesian nonparametric models in the following steps. First, we derivethe nonparametric evidence lower bound based on finite and measurable partitions. Second,we propose the conditional setting when factorising variational distributions by letting vari-ables in the middle layers conditional on two adjacent layers. Third, to handle big data, wedevelop the corresponding stochastic variational inference framework (Hoffman et al., 2013;Blei et al., 2017) under our conditional setting. Finally, within our framework, we adoptMonte Carlo sampling to generate samples for local latent variables, and further update thevariational parameters for global latent variables based on the empirical distribution gener-ated from these samples. Meanwhile, we truncate the dimension of variational distributionsto that of the empirical distribution.Our proposed method, named conditional variational inference with adaptive truncation,benefits from both the accuracy of Monte Carlo sampling and the efficiency of variationalinference as follows. First, our method rebuilds the correlation structure and hence attainsa smaller divergence between the variational distribution and the true posterior. Such pro-cedure removes the unrealistic mean-field assumption, and searches an optimal variationaldistribution over a wider family. Second, our method assigns a probability of increasing thedimension of variation distributions adaptive to the goodness-of-fit. As the inference pro-ceeds, it reaches a stable level balancing the goodness-of-fit and model complexity. Therefore,it provides an adaptive selection of the truncated dimension and reduces the risk of over-fitting or underfitting. Finally, our method achieves better prediction without sacrificingcomputational efficiency. With the optimal variational distributions for global variables, thelocal Markov chain converges fast, which is demonstrated in our empirical study.To assess the empirical performance of the proposed method, we develop detailed algo-rithms for hierarchical Dirichlet process model and gamma–Dirichlet process model (Jordan,3010), and apply them on the topic analysis of three large datasets including arXiv , New YorkTimes and
Wikipedia . The results show that the algorithms for our proposed method consis-tently outperform traditional online variational inference (Wang et al., 2011) in three exam-ples by substantially reducing the hold-out perplexity. Furthermore, our method gives muchclearer topic-words clustering by removing replicated topics and providing room to furtheradd new topics. We provide the code at https://github.com/yiruiliu110/ConditionalVI . Suppose that p Ω , F q is a Polish sample space, Θ is the set of all bounded measures on p Ω , F q and M is a σ -algebra on Θ. A random measure G on p Ω , F q is a transition kernelfrom p Θ , M q into p Ω , F q such that (i) G ÞÑ G p A q is M -measurable for any A P F and(ii) A ÞÑ G p A q is a measure for any realisation of G (Ghosal & Van der Vaart, 2017). Forexample, a Dirichlet process P (Ferguson, 1973) with base measure P satisfies ` P p A q , P p A q , . . . , P p A n q ˘ „ Dirichlet ` P p A q , P p A q , . . . , P p A n q ˘ for any partition Ω “ p A , . . . , A n q of Ω, that is, a finite number of measurable, nonemptyand disjoint sets such that Ť ni “ A i “ Ω. The Dirichlet process is denoted by P „ DP p P q or P „ DP p αH q with prior precision α “ P p Ω q and center measure H “ α ´ P . Moreover, arandom measure is called a completely random measure (Kingman, 1993) if it also satisfies(iii) P p A i q is independent of P p A j q for any disjoint subsets A i and A j in Ω. See Appendix A.1for a short review. Completely random measures and their normalisations (Regazzini et al.,2003), for example, gamma process and Dirichlet process, respectively, are commonly usedas priors for infinite-dimensional latent variables in Bayesian nonparametric models, becausetheir realisations are atomic measures with a countable-dimensional support.As an important subclass of Bayesian nonparametric models, hierarchical Bayesian non-parametric models use random measures for priors in multiple layers. Consider the following4 G G j z ji x ji βλ JN j Figure 1: Hierarchical structure in Bayesian nonparametric models. The blue and red boxescorrespond to J and N j replicates, respectively.model, G | H „ P p H q , β | λ „ p p β | λ q ,G j | G „ R p G q p j “ , . . . , J q ,z ji | G j „ G j , x ji | z ji „ f p x ji | β, z ji q p i “ , . . . , N j ; j “ , . . . , J q , (1)whose two-layer hierarchical structure is summarised in Figure 1. Specifically, in the toplayer, G , . . . , G J are generated from a random measure R with common base measure G ,while in the bottom layer, G itself is a realisation of random measure P with base measure H . To ensure exchangeability, G , . . . , G J are assumed to be identical and independent given G . The global parameter β is assigned a prior p p β | λ q . In addition, each local latent variable z ji is sampled from G j independently and the observation x ji is generated from a likelihoodfunction f , which is parameterised by both global latent variable β and local latent variable z ji . We next illustrate the necessity of hierarchical structure in Bayesian nonparametric mod-els, using the example of hierarchical Dirichlet process model in topic modelling (Teh et al.,2006), where P and R in (1) are both Dirichlet processes, G | H „ DP p αH q , G j | G „ DP p γG q p j “ , . . . , J q . (2)Suppose a corpus has J documents, each document j has N j words and each word is cho-sen from a vocabulary with W terms. We describe the generative model as follows. First, G “ ř k “ G k δ φ k is generated from DP p αH q , and for each document j a topic propor-tion G j “ ř k “ G jk δ φ k is independently sampled from DP p γG q . Second, for any topic k, W -dimensionalDirichlet distribution parameterised by η , β k „ Dir p η q . Third, for each word i in document j , a topic assignment z ji is allocated by z ji „ Multinomial p G j q , where z ji represents the topic k if z ji “ φ k . Finally, the observation x ji is independently generated from the assigned topicand the corresponding within-topic word distribution, x ji | t z ji “ φ k u „ Multinomial p β k q .Within such hierarchical Dirichlet process model, in the top layer, if G , . . . , G J are sampledfrom a Dirichlet process with a diffuse base measure instead of an atomic G , the supportsof G , . . . , G J do not overlap almost surely, which results in no share of topics among dif-ferent documents. To solve this issue, we let the base measure G have an atomic andinfinite-dimensional support, for example, assigning a Dirichlet process prior for G .Generally speaking, it is not necessary to restrict the prior for G to be a Dirichletprocess or other probability random measures. The essential point here is to equip G withan infinite-dimensional and atomic support. Therefore, other completely random measuresand their normalisation can also be used as priors for G . For example, the gamma–Dirichletprocess model (Jordan, 2010), G | H „ ΓP p αH q , G j | G „ DP p G q p j “ , . . . , J q . (3)The gamma–Dirichlet process allows a more flexible model by removing the constraint on theprior precision in the top layer. Other choices of prior for G include beta process, σ -stableprocess and inverse Gaussian process (Ghosal & Van der Vaart, 2017). The object of variational inference is to minimise the divergence between the variational dis-tribution and the true posterior. For infinite-dimensional random measures, their Kullback–Leibler divergence is well defined although the corresponding density function does not exist6ith respect to Lebesgue measure. Suppose two random measures P and Q from p Θ , M q into p Ω , F q , the Radon–Nikodym derivative dQ { dP exits if Q is absolutely continuous withrespect to P . Their Kullback–Leibler divergence is defined as,KL p Q k P q “ ż Θ log dQdP dQ, which is computationally intractable due to the infinite-dimensional integral. By contrast,we calculate this divergence by the limit superior of the divergence between correspondingfinite-dimensional induced measures, that is,KL p Q k P q “ lim sup Ω KL p p Ω k q Ω q , (4)where p Ω and q Ω are respectively induced measures from P and Q on a finite-dimensionalpartition Ω “ p A , . . . , A n q , such that p Ω p A i q “ P p A i q and q Ω p A i q “ Q p A i q for each A i P Ω .With an induced random variable Z Ω : Θ Ñ R n , we can also denote the induced measuresby p p Z Ω q and q p Z Ω q . The result in (4) is justified in Appendix A.2. The parametric variational inference algorithm uses a finite-dimensional variational distribu-tion to approximate the true posterior by maximising the evidence lower bound (Blei et al.,2017), while, for nonparametric models, we need to use a random measure as variationaldistribution due to the infinite dimensionality of latent variables. Based on the Kullback–Leibler divergence between random measures in (4), we propose a general variational infer-ence framework for Bayesian nonparametric models by defining the corresponding nonpara-metric evidence lower bound as,
NPELBO “ lim inf Ω ” E q p Z Ω q log p p X, Z Ω q ( ´ E q p Z Ω q log q p Z Ω q (ı , (5)where p p X, Z Ω q and q p Z Ω q correspond to the induced measures from the joint distributionand the variational distribution on Ω , with Z and X denoting the observations and latent7ariables, respectively. Provided the result thatKL ` Q p Z q k P p X | Z q ˘ ` NPELBO “ log p p X q , (6)our proposed framework considers maximising the nonparametric evidence lower bound in(5), which is equivalent to minimising the Kullback–Leibler divergence between variationaldistribution Q p z q and true posterior P p z | x q . See Appendix A.3 for the proof of equation(6). To simplify the notation, we will use p p¨q and q p¨q to denote the true and variationaldistributions, respectively, where the context is clear. The hierarchical Bayesian nonparametric model in (1) has multiple layers, and hence Z in(5) includes several latent variables, which are global latent variable β , local latent variables t z ji u ď j ď J, ď i ď N j , global prior G , and local priors t G j u ď j ď J . To factorise the variational dis-tribution q ` β, t z ji u , G , t G j u ˘ , the traditional variational inference algorithms typically con-sider the mean-field setting, q p β, t z ji u , G , t G j uq “ q p β q q p G q ś Jj “ q p G j q ś Jj “ ś N j i “ q p z ji q ,where variables in different layers are assumed to be independent. However, this assump-tion is not valid in nonparametric variational inference because the independence between t G j u ď j ď J and G contradicts the fact that the support of each G j is fully determined by G . As the updatings of q p G j q and q p G q are independent in the procedure of iterations,they are likely to have different supports, which contradicts their definitions. Moreover, themean-field assumption fails to account for the possibly high correlation among G , t G j u and t z ji u . In contrast to the traditional variational inference under the mean-field setting, we fac-torise the variational distribution as, q ` β, t z ji u , G , t G j u ˘ “ q p β q q p G q J ź j “ q p G j | G , z j q J ź j “ N j ź i “ q p z ji q , (7)in the sense of the probability law. On one hand, our conditional setting eliminates thecontradiction in the mean-field setting, because we consider the variational distribution of8 j conditional on G , which ensures that G j shares the same support of G . On the otherhand, such conditional design facilitates the recovery of the dependence structure among G , t G j u and t z ji u . Combing (5) and (7), our proposed conditional variational inference seeks to maximisethe following nonparametric evidence lower bound,NPELBO “ lim inf Ω ” E q p β, t z j u ,G Ω , t G Ωj uq log p ` t x j u , β, t z j u , G Ω , t G Ωj u ˘( ´ E q p G Ω q log q p G Ω q ( ´ E q p β q log q p β q ( ´ J ÿ j “ N j ÿ i “ E q p z ji q log q p z ji q ( ´ J ÿ j “ E q p G Ω q E q p z j q E q p G Ωj | G Ω ,z j q log q p G Ωj | G Ω , z j q (ı , (8)where x j “ t x ji u ď i ď N j , z j “ t z ji u ď i ď N j , and Ω is a partition of the sample space Ω for G and t G j u ď j ď J . To maximise the nonparametric evidence lower bound in (8), we first seek the optimalvariational distribution of G j given G and z j for each j . As p ` t x j u , β, t z j u , G Ω , t G Ωj u ˘ “ p p G Ω , t z j uq ś Jj “ p p G Ωj | G Ω , z j q p p x j | β, z j q , the non-constant term in (8) with respect to q p G j | G , z j q islim inf Ω ” J ÿ j “ E q p G Ω q E q p z j q E q p G Ωj | G Ω ,z j q log p p G Ωj | G Ω , z j q ´ log q p G Ωj | G Ω , z j q (ı . It is worth noting that the above expression can be viewed as the negative of the Kullback–Leibler divergence whose maximum is zero. Therefore, the optimal conditional variationaldistribution for G j is q p G j | G , z j q “ p p G j | G , z j q as the divergence equals zero if andonly if q p G Ωj | G Ω , z j q “ p p G Ωj | G Ω , z j q for any partition Ω . This result is also intuitivebecause the best variational distribution to approximate the posterior given other variablesis the conditional posterior itself. Benefiting from the conjugacy in Bayesian nonparametricmodels, the analytical form of such conditional posterior is easy to derive.9e then implement a coordinate ascent approach by iterating the following three stepsuntil convergence. The first step considers obtaining the optimal q p G q conditional on otherparameters. To achieve this, in Appendix A.4, we rely on (8) to derive the evidence lowerbound under Ω with respect to q p G Ω q ,ELBO Ω “ E q p G Ω q log p p G Ω q q p G Ω q ( ` J ÿ j “ E q p G Ω q E q p z j q ” log E p p G Ωj | G Ω q p p z j | G Ωj q (ı ` constant , (9)where E p p G Ωj | G Ω q is with respect to the prior distribution p p G Ωj | G Ω q instead of the variationaldistribution. Consequently, this expectation can be easily calculated due to its analyticalrepresentation. As the nonparametric evidence lower bound NPELBO “ lim inf Ω p ELBO Ω q ,if we can find a random measure q p G q with its induced measure q Ω p G q satisfying q p G Ω q 9 p p G Ω q exp ˆ J ÿ j “ E q p z j q ” log E p p G Ωj | G Ω q p p z j | G Ωj q (ı˙ , (10)for any partition Ω , then this q p G q is the optimal variational random measure. In caseswhere it is difficult to find a simple random measure satisfying (10), we need to restrict thevariational distribution in a special family and optimise the parameters. Provided with theupdated q p G q and other parameters, the second step considers optimising the variationaldistribution for z j in the form of q p z j q 9 exp ˆ E q p G q ” log E p p G j | G q p p z j | G j q (ı ` E q p β q log p p x j | z j , β q (˙ . (11)Finally, the optimal variational distribution for the global latent variable β given otherupdated parameters is q p β q 9 p p β q exp ” J ÿ j “ E q p z j q log p p x j | z j , β q (ı . (12) Whereas the coordinate ascent formulas in (10)–(12) provide a general framework, theyare difficult to be directly implemented especially for big data, because updating all local10atent variables in each iteration is not computationally efficient. By contrast, stochasticvariational inference (Hoffman et al., 2013) is widely used in practice, where the computationis accelerated by randomly selecting a small batch of data and iteratively updating theparameters with a random but unbiased gradient of evidence lower bound. Specifically, foran evidence lower bound ELBO p ξ q as a function of parameter ξ , if there exists a randomfunction h p ξ q satisfying E “ h p ξ q ‰ “ ELBO p ξ q , ξ can be updated in the τ -th iteration by ξ p τ q “ ξ p τ ´ q ` ρ t h p ξ p τ ´ q q , where the step size ρ t satisfies the Robbins–Monro condition(Robbins & Monro, 1951).For hierarchical Bayesian nonparametric models, the traditional stochastic variational in-ference methods suffer from the mean-field assumption and the universal truncation (Hoffman et al.,2013; Wang et al., 2011). To overcome these disadvantages, we propose a new approach byintegrating Monte Carlo sampling scheme into the stochastic variational inference frameworkunder the conditional variational setting, namely conditional variational inference with adap-tive truncation. The proposed method not only benefits from the fast speed of stochasticvariational inference but also overcomes the challenges of nonparametric variational infer-ence discussed in Section 3.3. Moreover, it can automatically truncate the dimension ofvariational distributions in an adaptive fashion. We will show the detailed procedures inSections 4.2 and 4.3. Under the conditional setting, we rely on the conditional variational inference frameworkin Section 3.4 to infer global variables, while approximate the optimal distributions of localvariables via Monte Carlo sampling instead of analytical optimisation.For the variational inference part, we approximate the posterior distribution for globalprior G and global latent variable β . From the entire data x “ t x , . . . , x J u , we randomlysample a subset t x s : x s P x u Ss “ , where S is the batch size with S ! J . Assuming thata partition Ω is given to obtain the limit inferior of nonparametric evidence lower bound,11e aim to update the parameters for q p G Ω q conditional on the updated q p β q and t q p z s qu Ss “ .While standard stochastic variational inference uses the analytical way to update parameters,we draw T s samples from q p z s q for each z s in the batch, ˆ z s “ ˆ z s,t : ˆ z s,t „ q p z s q ( T s t “ , so as toget a random nonparametric evidence lower bound with respect to q p G Ω q , { NPELBO “ E q p G Ω q ” log p p G Ω q q p G Ω q ` JS S ÿ s “ T ´ s T s ÿ t “ log E p p G Ωs | G Ω q p p ˆ z s,t | G Ωs q (ı ` constant . (13)It is obvious that E p { NPELBO q “
NPELBO and hence the random gradient is unbiased,which satisfies the key condition for stochastic variational inference. Therefore, according to(13), we can use the random gradient generated from ˆ z s to update the parameters of q p G Ω q .Analogously, the random nonparametric evidence lower bound with respect to q p β q is, { NPELBO “ E q p β q ! log p p β q q p β q ` JS S ÿ s “ T ´ s T s ÿ t “ log p p x s | ˆ z s,t , β q ) ` constant , (14)and then we can update its parameter with the corresponding random gradient in a similarway.For the Monte Carlo sampling part, given the updated q p G Ω q and q p β q from (13) and (14),we draw samples ˆ z s for each z s in the batch using Markov chain Monte Carlo. It is difficultto get an closed-form formula for optimal q p z s q due to the lack of conjugacy between G and z s . Moreover, since G s is integrated out, the local latent variables t z si u N s i “ are conditionallydependent and cannot be i.i.d. sampled. Therefore, we propose the following Gibbs samplingapproach to get the samples under optimal variational distributions. Conditional on q p G Ω q and samples ˆ z s,i ´ “ t ˆ z sl : l “ , . . . , N s , l ‰ i u , it follows from (11) that the optimalvariational distribution of q p z si q is q p z si q 9 exp ! E q p G Ω q E p p G Ωj | G Ω q “ p p z si , ˆ z s,i ´ | G Ωs q ‰ ` E q p β q “ log p p x si | z si , β q ‰) . (15)Then we sample ˆ z si „ q p z si q for each i iteratively, which constructs a Markov chain. Asˆ z si is sampled from the optimised variational distribution conditional on ˆ z s,i ´ in (15), thejoint distribution generated from the Markov chain will converge to the optimal variational12istribution, which achieves the maximum nonparametric evidence lower bound. After theconvergence, we can sample ˆ z s, , . . . , ˆ z s,T s from the stable Markov chain.To maximise the nonparametric evidence lower bound, we iterate the following threesteps, (i) randomly selecting a small batch from the entire data, (ii) sampling t ˆ z s u Ss “ byMonte Carlo method, and (iii) updating q p G Ω q and q p β q in the stochastic variational inferenceframework. Moreover, the partition Ω in our method is data-adaptive as demonstrated inSection 4.3. In this section, we illustrate the approach to determine the finite and measurable partition Ω , which could reach the limit inferior of nonparametric evidence lower bound. Rather thanfixed on a universal truncation level, in our framework, the dimension of Ω gradually increasesto a stable level. This partition or truncation is dependent on data fitting and embeddedwithin the optimisation process, which provides another key advantage of integrating theMonte Carlo sampling scheme into the stochastic variational inference framework.We first define the partition Ω. Note that samples t ˆ z s u Ss “ used to simulate the optimalvariational distribution have finite-dimensional atomic support, denoted by φ , . . . , φ K , where K is a finite integer. We therefore partition the sample space Ω into K ` K probability mass atoms φ , . . . , φ K and one complement set φ “ Ω {t φ , . . . , φ K u . We then update the partition Ω in the inference procedure. If all points in t ˆ z s u Ss “ are sampledbefore, we keep the current partition Ω . Otherwise, if a sample ˆ z si P φ , which means it is dis-tinct from φ , . . . , φ K , we draw a new φ K ` and refine the partition as ` φ , φ , . . . , φ K , φ K ` ˘ ,where we update φ as Ω {t φ , . . . , φ K , φ K ` u . With the dynamic partition Ω defined above,the prior is proportional to the posterior on φ , p ` G p φ q ˘ q ` G p φ q ˘ , due to the lack of datainformation. Therefore, the nonparametric evidence lower bound will remain constant underany further partition, which means the partition Ω enables the nonparametric evidence lowerbound to attain its limit inferior. 13 lgorithm 1: Conditional variational inference with adaptive truncation. Initialise the partition Ω , the parameters for q p G q , q p β q and set up the step-size t ρ τ u τ ě ; repeat Randomly select x , . . . , x S from the entire dataset ; for s P t , . . . , S u do Initialise the values for t ˆ z si u N s i “ ; repeat for i P t , . . . , N s u do Sample ˆ z si conditional on q p G q , q p β q and ˆ z s,i ´ according to (15); if Sampling a new ˆ z si then Refine the partition Ω ; until Convergence ; Sample t ˆ z s,t u T s t “ from the stable Markov chain; Update the parameters for q p G q and q p β q given the samples t ˆ z s u Ss “ with step-size ρ τ according to (13) and (14); until Convergence ; In our framework, we start from a low-dimensional partition when the variational dis-tributions are far from the optimal, then update the partition and gradually increase itsdimension according to data fitting. When the inference is close to convergence with a largeenough dimensional partition, it is less likely to refine the partition and hence the dimensionof variational distributions attains a stable level. This data-adaptive truncation reflects abalance between the goodness-of-fit and model complexity. We summarise the above infer-ence procedure in Algorithm 1. 14
Applications in topic modelling
We apply the proposed conditional variational inference with adaptive truncation methodto the hierarchical Dirichlet process model. Specifically, we factorise the variational dis-tributions in the conditional setting according to (7) and specify the variational family asfollows. First, the variational distribution of G s for each s is given by q p G s | G , z s q “ DP ` ř k “ n sk δ φ k ` G ˘ , where n sk “ ř N s i “ I p z si “ φ k q with I p¨q being the indicator func-tion. Second, q p β k q for each topic k is set as a W -dimensional Dirichlet distribution, q p β k q “ Dirichlet p λ k q , where λ k “ p λ k , . . . , λ kW q T is the parameter of vocabulary distri-bution for topic k . To make prediction, λ k serves as the core task of inference. Specially,the variational distributions for the topics without any observation remain the same as theprior. Therefore, we regard them as the zeroth topic without loss of generality and denotethe corresponding variational distribution on vocabulary by q p β q “ Dirichlet p η q . Third, wepropose the variational family for G as, q p G q “ K ÿ k “ m k δ ϕ k ` m DP p αH q , (16)such that ř Kk “ m k “ φ k „ H due to the lack of posterior information for φ k . Takinginto account the tradeoff between inferential accuracy and computational efficiency, in (16)we assume that q p G q have deterministic probability mass on φ k s, as the main purposeof G is to provide a discrete and infinite-dimensional support to ensure that G j s sharethe same topics φ k s. This kind of spike and slab methodology is widely used in Bayesiananalysis (Andersen et al., 2017). Under such scenario, the optimised t m k u Kk “ coincide withmaximum-a-posteriori estimation. Finally, following (15) we use Monte Carlo sampling toget samples t ˆ z s u Ss “ and hence do not need to parametrise their variational distributions.Based on these settings, we can infer the hierarchical Dirichlet process model by applyingAlgorithm 1 in the following steps. 15 he partition Ω . As different samples in t ˆ z s u Ss “ are used to represent different topicclusters in topic modelling, their exact values in sample space do not contain any statis-tical information. We then index the topics with observations from 1 to K and denotethe different clusters by distinct points φ , . . . , φ K in Ω . With the samples t ˆ z s u Ss “ , wedefine ˆ n sk,t “ ř N s i “ I p ˆ z si,t “ φ k q . Then the number of topics with observations is K “ ř k “ I p ř Ss “ T ´ s ř T s t “ ˆ n sk,t ą q . We partition Ω to a p K ` q -dimensional Ω including K single points t φ k u Kk “ and one complement set φ “ Ω {t φ k u Kk “ . Inference for G . With the partition Ω defined above, G Ωs conditional on G Ω is a p K ` q -dimensional Dirichlet distribution. By (13), we derive the random nonparametricevidence lower bound with respect to q p G q in Appendix A.5, { NPELBO “ ´ K ÿ k “ log m k ` α log m ` JS S ÿ s “ K ÿ k “ T ´ s T s ÿ t “ log Γ p γm k ` ˆ n sk,t q Γ p γm k q ` constant . (17)However, there is no closed-form expression for the probability proportion parameters t m k u Kk “ to attain the maximum in (17). Moreover, the standard gradient descent algorithm fails inthis case, because t m k u Kk “ may easily exceed the simplex during the updating procedure.Instead, given the parameters t m p τ q k u Ss “ in the τ -th iteration, we define m ˚ k $’’&’’% J S ´ γ ř Ss “ T ´ s ř T s t “ Φ p γm p τ q k ` ˆ n sk,t q ´ Φ p γm p τ q k q ( m p τ q k ´ p k “ , . . . , K q ,α ´ p k “ q , (18)such that ř Kk “ m ˚ k “
1, and update the parameters by m p τ ` q “ p ´ ρ t q m p τ q ` ρ τ m ˚ k . InAppendix A.6, we also show that this updating algorithm is consistent to the gradient descentafter the inverse logit transformation. In the process of updating, the condition ř Kk “ m ˚ k “ Inference for β . By (14), we update the parameters for q p β q using samples t ˆ z s u Ss “ . Wedefine λ ˚ kw for topic k and word w as, λ ˚ kw “ η ` J S ´ S ÿ s “ T s ÿ t “ T ´ s N s ÿ i “ I p ˆ z si,t “ φ k , x si “ w q , (19)16nd update the parameter λ k by λ p τ ` q k “ p ´ ρ t q λ p τ q k ` ρ τ λ ˚ k for each k , where λ ˚ k “p λ ˚ k , . . . , λ ˚ kW q T . Sampling for z . According to (15) we sample ˆ z si conditional on q p G q and ˆ z si ´ by q p z si “ φ k q 9 $’’&’’% p γm k ` ˆ n ksi ´ q exp ` Φ p λ kx si q ´ Φ p ř Ww “ λ kw q ˘ p k “ , . . . , K q ,γm exp ` Φ p η q ´ Φ p W η q ˘ p k “ q , (20)to construct the Markov chain, where ˆ n ksi ´ “ ř ď l ď N s ,l ‰ i I p ˆ z sl “ φ k q . Whenever the sampledˆ z si is φ , which means ˆ z si forms a new point not belonging to t φ , . . . , φ K u , we need toupdate the partition and add a new topic indicated by φ K ` . Otherwise we stick to thesame partition dimension. Iterating the sampling scheme till convergence, we obtain thesamples t ˆ z si,t u ď s ď S, ď i ď N s , ď t ď T s and corresponding t ˆ n sk,t u ď s ď S, ď k ď K, ď t ď T s for the selectedchunk.According to Algorithm 1, we repeatedly select documents in a batch with randomness,sample z and update parameters for G , β by iterating (18)–(20) until the nonparametricevidence lower bound converges to its maximum.Our method is different from other nonparametric inference methods. Wang & Blei(2012) replace analytical updating for local parameters by the locally collapsed Gibbs sam-pling. But their work cannot maximise the evidence lower bound, especially when q p β q haslarge variance. Bryant & Sudderth (2012) use split-merge algorithms to generate new dimen-sions and remove redundant dimensions. However, to check the split-merge criterion, theirmethod needs to calculate the training likelihood, which is computational inefficient. More-over, both methods are based on the mean-field assumption and hence ignore the correlationstructure among latent variables. The algorithm of conditional variational inference with adaptive truncation can also beapplied to a general class of hierarchical Bayesian nonparametric models, where the global17rior G is generated from a completely random measure. For example, gamma–Dirichletprocess model uses gamma process to generate G and Dirichlet process to generate t G j u Jj “ .In these models, concentration parameter for any G j is not fixed and G is not restrictedto be a probability measure. The corresponding inference algorithm is similar to that ofhierarchical Dirichlet process model, but requires a new parameter µ to approximate G p Ω q .We choose the variational family for global prior G as, Q p G q “ µ ` K ÿ k “ m k δ φ k ` m r N p αH q ˘ , (21)where r N is the normalisation of the corresponding completely random measure and ř Kk “ m k “
1. Similarly, we derive the random nonparametric evidence lower bound in Appendix A.7, { NPELBO “ K log µ ` K ÿ k “ log v p µm k q ` log u p µm q` JS S ÿ s “ " log Γ p µ q Γ p µ ` N s q ` T ´ s T s ÿ t “ K ÿ k “ log Γ p µm k ` ˆ n sk,t q Γ p µm k q * ` constant , (22)where v p¨q is the weight intensity measure (Appendix A.1) for the completely random measureand u is the density function for G p Ω q that can be derived given the Laplace transform ofthe completely random measure. Therefore, we can update t m k u Kk “ in the same way as thehierarchical Dirichlet model. Following Algorithm 1 and its application in Section 5.1, we canalso update µ by the stochastic gradient descent. To illustrate with an example, we considerthe gamma–Dirichlet model, whose inference algorithm is provided in Appendix A.7. We apply the algorithm of conditional variational inference with adaptive truncation to threelarge datasets.1. arXiv : The corpus includes the descriptive metadata of all articles on arXiv , a freedistribution service and an open archive for scholarly articles, up to September 1,2019,which includes 1.03M documents and 44M words from a vocabulary of 7,500 termsafter preprocessing. 18.
New York Times : The corpus combines all articles published by
New York Times fromJanuary 1, 1987 to June 19, 2007 (Sandhaus, 2008), which has 1.56M documents and176M words from a vocabulary of 7,600 terms after preprocessing.3.
Wikipedia : The corpus collects the entire entries on English
Wikipedia websites onJanuary 1, 2019, which contains 4.03M documents and 423M words from a vocabularyof 8,000 terms after preprocessing.In the preprocessing, stemming and lemmatisation are used to clean the raw text data.Moreover, words with too high or too low frequency and common stop words are bothremoved before experiments.To evaluate the performance of our proposed method, we set aside a test set for 10,000documents for each dataset and calculate the hold-out perplexity (Ranganath & Blei, 2018),perplexity hold-out “ exp ! ´ ř j P D test log p p x test j | x train j , D train q ř j P D test | x test j | ) , where D train and D test represent the training and test data, respectively, and x train j and x test j arethe training and test words in test document j, respectively, and | x test j | is the number of wordsin x test j . The perplexity measures the uncertainty of the fitted model, and hence a betterlanguage model with more accurate inference will have a higher predictive likelihood and thusa lower perplexity. Since the exact computation for perplexity is not tractable, the standardroutine uses D train to get the variational distribution for β and G , obtains the variationaldistribution for G j based on G and x test j , and then approximates the likelihood by p p x test j | x train j q “ ś w P x test j ř Kk “ G jk β kw , where G j “ p G j , . . . , G jK q T and β k “ p β k , . . . , β kW q T arethe variational expectations of G j and β k , respectively (Blei et al., 2003). We model threedatasets under both hierarchical Dirichlet process and gamma–Dirichlet process models.For hierarchical Dirichlet process model we set the hyperparameters as α “ γ “ η “ α and γ are the concentration parameters for G and t G j u respectively, and η is thehyperparameter of prior on the distribution of words. We choose a batch size of 256 andadopt the Robbins Monro learning rate p ` t q ´ . in updating (Hoffman et al., 2010). The19 a) arXiv . Hours P e r p l e x i t y (b) New York Times . . . . Hours P e r p l e x i t y (c) Wikipedia . . Hours P e r p l e x i t y Hours N u m b e r o f t o p i c s Hours N u m b e r o f t o p i c s Hours N u m b e r o f t o p i c s Figure 2: Top row: plots for the perplexity vs the running time up to 5 hours. Bottomrow: plots for the number of topics vs the running time up to 5 hours. Left, middle andright columns correspond to datasets arXiv , New York Times and
Wikipedia , respectively.The black dotted line corresponds to traditional online variational inference method forhierarchical Dirichlet process model. The red solid and blue dashed lines correspond toconditional variational inference with adaptive truncation method for hierarchical Dirichletprocess model and for gamma–Dirichlet process model, respectively.20nitial number of topics is chosen as 100. For gamma–Dirichlet process model, we use thesame hyperparameters but discard γ . To make comparison, we keep the default settings intraditional online variational inference (Wang et al., 2011).The top row of Figure 2 plots the hold-out perplexity as a function of running time forthree comparison methods on three datasets. Table 1 reports numerical summaries. Severalconclusions can be drawn here. First, on all three datasets, our conditional variationalinference with adaptive truncation method consistently outperforms the traditional onlinevariational inference method. The improvement is highly significant especially for arXiv and Wikipedia . For
New York Times , such improvement is moderate probably due to the longlength of documents in this corpus. Second, for each dataset, the gamma–Dirichlet processmodel attains a lower perplexity than hierarchical Dirichlet process model, which makessense due to the fact that the gamma–Dirichlet process model removes a restriction of thehierarchical Dirichlet process model and hence is more flexible. Third, the proposed methodis computationally efficient. Although it involves Monte Carlo sampling, the perplexityconverges at a fast speed. This is because the convergence of local Markov chain to assignwords to topics is accelerated by a clear topic-words clustering as the global variationaldistributions approach to the optimal.The bottom row of Figure 2 plots the number of topics in the process of inference. Fortraditional online variational inference, the number of topics remains constant, while for ourmethod, it first has a steep increase and then converges to a stable level. For example, thenumber of topics in
Wikipedia drastically increase from 100 to around 190 for hierarchicalDirichlet process model and around 200 for gamma–Dirichlet process model. The sharpincrease is driven by the data complexity, while the stable level is due to the dimensionpenalty of hierarchical Dirichlet process model. Although the estimation of the number oftopics is not consistent, the proposed truly nonparametric inference method can providesome useful information about topics in data. For instance, arXiv , containing the abstractdescriptions of scientific articles, has the smallest number of topics because its topics are21able 1: A summary of hold-out perplexity results on three datasets. Relative improvementsin percentage over TOVI for HDP model are shown in parenthesesInference method arXiv New York Times Wikipedia
HDP model TOVI 1005 1681 1422HDP model CVIAT 832 (17.21%) 1569 (6.66%) 1207 (15.12%)ΓDP model CVIAT 808 (19.60%) 1536 (8.62%) 1157 (18.64%)TOVI, traditional online variational inference; CVIAT, conditional variational infer-ence with adaptive truncation. HDP, hierarchical Dirichlet process; ΓDP, gamma–Dirichlet process.restricted to the quantitative subjects including computer science, mathematics, statisticsand physics. By contrast,
New York Times is a compilation of all new articles covering awider range of areas, and hence consists of more topics.
Wikipedia has the largest numberof topics as it contains almost every aspect of an encyclopedia. One key point here is thatwe do not need to set a fixed number of topics before the inference. Instead, our methodstarts from an initial value, for example 100 in our experiments, then automatically reachesthe optimal number of topics after iterations, and finally keeps it at a stable level.Moreover, our method reveals better linguistic results. To compare our proposed methodwith traditional online inference for hierarchical Dirichlet process model, we report the top12 words in the top 10 topics with biggest weights for both methods on datasets arXiv and
Wikipedia in Tables 2a and 2b, respectively. We observe a few apparent patterns. First,our method does not contain replicated topics. The traditional online variational inferencemethod results in very similar word components, for example, columns 1-6 in the bottompart of Table 2a. An ideal topic-word clustering should allocate these words into just onetopic. But since the prespecified number of topics is fixed at 150, which seems larger thanthe truth, the inference generates multiple replicated topics. By contrast, the topic-word22able 2: Top 12 words in top 10 topics for datasets arXiv and
Wikipedia (a) arXiv (b)
Wikipedia
Within the proposed general framework, Algorithm 1 can also be applied to other hier-archical Bayesian nonparametric models including hierarchical Pitman–Yor process model(Teh & Jordan, 2010) and hierarchical beta process model (Thibaux & Jordan, 2007), whichare used to present the power law and the sparsity in latent features, respectively. In suchcases, other Monte Carlo methods, for example, slice sampling (Neal, 2003), retrospectiveMarkov chain Monte Carlo (Papaspiliopoulos & Roberts, 2008) or unbiased Markov chainMonte Carlo methods with couplings (Jacob et al., 2019), could possibly be used. We expectthat our proposed method provides more advantages in these applications, because hierarchi-cal Pitman–Yor process with heavy tail behaviour and hierarchical beta process with sparsestructure may suffer more from the universal truncation.24
Appendix
A.1 A short review of completely random measures
A completely random measure (Kingman, 1993) is characterised by its Laplace transform,E e ´ tP p A q ( “ exp ! ´ ż A ż p , p ´ e ´ tπ q v c p dx, ds q ) , where A is any measurable subset of Ω and v c p dx, ds q is called the L´evy measure. If v c p dx, ds q “ κ p dx q v p ds q , where κ p¨q and v p¨q are measures on Ω and p , , respectively,the completely random measure is homogeneous (Ghosal & Van der Vaart, 2017). In suchcase, we call v p¨q the weight intensity measure. We can view completely random measureas a Poisson process on the product space Ω ˆ p , using its L´evy measure as the meanmeasure. A.2 Derivation for (4)
By definition of induced measure, q Ω p d Θ q “ Q p d Θ q for any M -measurable d Θ , we have ż Θ log dq Ω dp Ω dq Ω “ ż Θ log dq Ω dp Ω dQ. It follows from lim sup Ω dq Ω { dp Ω “ dQ { dP and the monotone convergence theorem thatlim sup Ω ż Θ log dq Ω dp Ω dQ “ ż Θ log dQdP dQ. Combining the above equations yields (4). Furthermore, suppose there exists a sequence ofpartition t Ω i u i ě such that lim sup Ω i “ Ω , we havelim sup Ω i ż Θ log dq Ω i dp Ω i dq Ω i “ lim sup Ω i ż Θ log dq Ω i dp Ω i dQ “ ż Θ log dq Ω dp Ω dQ “ ż Θ log dq Ω dp Ω dq Ω . Hence lim sup KL p q Ω i k p Ω i q “ KL p q Ω k p Ω q , which will be used in Appendix A.5. A.3 Derivation for (6) By p p X, Z q “ p p Z | X q p p X q , we have ż log p p X, Z Ω q q p Z Ω q q p dZ Ω q “ log p p X q ` ż log p p Z Ω q q p Z Ω q q p dZ Ω q . Ω " ż log p p X, Z Ω q q p Z Ω q q p dZ Ω q * “ log p p X q ´ lim sup Ω " ´ ż log p p Z Ω q q p Z Ω q q p dZ Ω q * . Combing the above equation with the definition of nonparametric evidence lower bound in(5) and Kullback–Leibler divergence in (4) yields (6).
A.4 Derivation for (9) By p p G Ω , t z j u Jj “ q “ ş ¨ ¨ ¨ ş p ` G Ω , t G j u Jj “ , t z j u Jj “ ˘ dG dG ¨ ¨ ¨ dG J and the hierarchical gen-erative structure, the evidence lower bound under partition Ω with respect to q p G Ω q equals,ELBO Ω “ E q p G Ω q E q pt z j u Jj “ q “ log p p G Ω , t z j u Jj “ q ‰ ´ E q p G Ω q “ log q p G Ω q ‰ ` constant “ E q p G Ω q E q pt z j u Jj “ q “ log q p G Ω q J ź j “ ż p p G Ωj | G Ω q p p z j | G Ωj q dG j ‰ ´ E q p G Ω q “ log q p G Ω q ‰ ` constant “ J ÿ j “ E q p G Ω q E q p z j q ” log E p p G Ωj | G Ω q “ p p z j | G Ωj q ‰ı ` E q p G Ω q “ log p p G Ω q ´ log q p G Ω q ‰ ` constant . Furthermore, based on the equation above, (8) can be expressed as NPELBO “ lim inf Ω ELBO Ω . A.5 Derivation for (17)
By the formula of moments for Dirichlet-distributed random variables, we obtainE p p G Ωs | G Ω q “ p p ˆ z s,t | G Ωs q ‰ “ Γ p γ q Γ p γ ` N s q K ź k “ Γ p γG k ` ˆ n sk,t q Γ p γG k q . Based on the points t φ k u Kk “ defined in Section 5.1, we propose a sequence of partition t Ω c : Ω c “ Ť Kk “ Ω ck u c ě to approach Ω , where Ω ck “ p φ k ´ c ´ , φ k ` c ´ s for k “ , . . . , K and Ω c is the corresponding complement. Under Ω c , q p G Ω c q “ d K ` ` m ´ p G Ω c ´ M Ω c q ˘ and p p G Ω c q “ d K ` p G Ω c q , where d K ` p¨q denotes the density function for p K ` q -dimensionalDirichlet distribution, M “ ř Kk “ m k δ φ k and M Ω c is the corresponding induced randomvariable. By (13), the random nonparametric evidence lower bound under Ω c is { NPELBO Ω c “ E q p G Ωc q ! K ÿ k “ p αH Ω c k ´ q log m G k p G k ´ m k q ` p αH Ω c ´ q log m ` JS S ÿ s “ K ÿ k “ T ´ s T s ÿ t “ log Γ p γG k ` ˆ n sk,t q Γ p γG k q ) ` constant , H Ω c k “ H p Ω ck q . Since, p G k ´ m k q{ m „ Beta p H Ω c k q under q p G Ω c q , the term E q p G Ωc q p αH Ω c k ´ q log m p G k ´ m k q ´ ( is constant with respect to parameters t m k u Kk “ . Taking limsup onboth sides of the above equation with lim sup Ω C E q p G Ωc q p log G k q “ log m k , lim sup Ω C H Ω C k “ k “ , . . . , K and lim sup Ω C H Ω C “ , we obtain equation (17). A.6 Derivation for (18)
Consider the Lagrange multiplier of constrained optimisation, L “ ´ K ÿ k “ log m k ` p α ´ q log m ` JS S ÿ s “ K ÿ k “ T ´ s T s ÿ t “ log Γ p γm k ` ˆ n sk,t q Γ p γm k q ´ λ p K ÿ k “ m k ´ q , its first order conditions satisfy, $’’&’’% J S ´ γ ř Ss “ T ´ s ř T s t “ Φ p γm k ` ˆ n sk,t q ´ Φ p γm k q ( m k ´ “ m k λ, p k “ , . . . , K q ,α ´ “ m λ, p k “ q . Dividing λ on both sides of the above equations, the definition of t m ˚ k u Kk “ in (18) follows.We next show that this updating is consistent to the gradient descent after the inverselogit transformation, that is, transforming t m k u Kk “ by m k “ e θ k { ř Kl “ e θ l to remove theconstraint of ř Kk “ m k “
1. By B m k {B θ k “ m k ´ m k , B m l {B θ k “ ´ m k m l for l ‰ k , and thechain rule, we have B L B θ k “ $’’&’’% J S ´ γ ř Ss “ T ´ s ř T s t “ Φ p γm k ` ˆ n sk,t q ´ Φ p γm k q ( m k ´ ´ Λ m k p k “ , . . . , K q ,α ´ ´ Λ m k p k “ q , where L denotes { NPELBO in (17) andΛ “ α ´ ` K ÿ k “ ” J S ´ γ S ÿ s “ T ´ s T s ÿ t “ Φ p γm k ` ˆ n sk,t q ´ Φ p γm k q ( m k ´ ı . As B L {B θ k “ Λ p m ˚ k ´ m k q , p m ˚ k ´ m k q represents the gradient with respect to t θ k u Kk “ afterthe inverse logit transformation. 27 .7 Derivation for the extension in Section 5.2 Without restriction on probability random measure,log E p p G Ωs | G Ω q “ p p ˆ z s,t | G Ωs q ‰ “ log Γ p ř Kk “ G k q Γ p ř Kk “ G k ` N s q K ź k “ Γ p G k ` ˆ n sk,t q Γ p G k q , In analogy to Appendix A.5, under a partition Ω c , the random nonparametric evidence lowerbound equals, { NPELBO Ω c “ K log µ ` E q p G Ωc q ” K ÿ k “ log p p G Ω c k q ` log p p G Ω c q` JS S ÿ s “ K ÿ k “ T ´ s T s ÿ t “ log Γ p γG Ω c k ` ˆ n sk,t q Γ p γG Ω c k q ı ` constant , where K log µ comes from the Jacob matrix from G , G , . . . , G K to µ, m , . . . , m K . As thepartition converges to single points and the corresponding complement, lim sup Ω c p p G Ω c k q “ v p G Ω c k q , lim sup Ω c p p G Ω c q “ u p G Ω c q , we can have (22) by lim sup Ω c G k “ µm k for k ‰ Ω c G “ µm . Specially, for the gamma–Dirichlet model, { NPELBO “ ´ µ ´ K ÿ k “ log m k ` p α ´ q log µm ` JS S ÿ s “ " log Γ p µ q Γ p µ ` N s q ` K ÿ k “ T ´ s T s ÿ t “ log Γ p µm k ` ˆ n sk,t q Γ p µm k q * ` constant . Therefore, its gradient with respect to µ is, ´ ` α ´ µ ` JS S ÿ s “ ! Φ p µ q ´ Φ p µ ` N s q ` K ÿ k “ T ´ s T s ÿ t “ m k ` Φ p µm k ` ˆ n sk,t q ´ Φ p µm k q ˘) References
Andersen, M. R. , Vehtari, A. , Winther, O. & Hansen, L. K. (2017). Bayesianinference for spatio-temporal spike-and-slab priors.
Journal of Machine Learning Research , 5076–5133. Blei, D. M. , Kucukelbir, A. & McAuliffe, J. D. (2017). Variational inference: areview for statisticians.
Journal of the American Statistical Association , 859–877.28 lei, D. M. , Ng, A. Y. & Jordan, M. I. (2003). Latent Dirichlet allocation.
Journal ofMachine Learning Research , 993–1022. Bryant, M. & Sudderth, E. B. (2012). Truly nonparametric online variational inferencefor hierarchical Dirichlet processes. In
Advances in Neural Information Processing Systems25 . Caron, F. & Fox, E. B. (2017). Sparse graphs using exchangeable random measures.
Journal of the Royal Statistical Society: Series B , 1295–1366. Dunson, D. B. & Park, J.-H. (2008). Kernel stick-breaking processes.
Biometrika ,307–323. Ferguson, T. S. (1973). A Bayesian analysis of some nonparametric problems.
The Annalsof Statistics , 209–230. Ghosal, S. & Van der Vaart, A. (2017).
Fundamentals of Nonparametric BayesianInference . Cambridge: Cambridge University Press.
Hoffman, M. , Bach, F. R. & Blei, D. M. (2010). Online learning for latent Dirichletallocation. In
Advances in Neural Information Processing Systems 23 . Hoffman, M. D. , Blei, D. M. , Wang, C. & Paisley, J. (2013). Stochastic variationalinference.
Journal of Machine Learning Research , 1303–1347. Jacob, P. E. , O’Leary, J. & Atchad´e, Y. F. (2019). Unbiased Markov chain MonteCarlo with couplings.
Journal of the Royal Statistical Society: Series B in press . Jordan, M. I. (2010). Hierarchical models, nested models and completely random mea-sures. In
Frontiers of Statistical Decision Making and Bayesian Analysis . New York:Springer, pp. 207–218.
Kingman, J. F. C. (1993).
Poisson Processes . Oxford: Clarendon Press.29 eal, R. M. (2003). Slice sampling.
Annals of Statistics , 705–767. Papaspiliopoulos, O. & Roberts, G. O. (2008). Retrospective Markov chain MonteCarlo methods for Dirichlet process hierarchical models.
Biometrika , 169–186. Ranganath, R. & Blei, D. M. (2018). Correlated random measures.
Journal of theAmerican statistical Association , 417–430.
Regazzini, E. , Lijoi, A. & Pr¨unster, I. (2003). Distributional results for means ofnormalized random measures with independent increments.
Annals of Statistics , 560–585. Robbins, H. & Monro, S. (1951). A stochastic approximation method.
The Annals ofMathematical Statistics , 400–407. Roychowdhury, A. & Kulis, B. (2015). Gamma processes, stick-breaking, and varia-tional inference. In
Proceedings of the 8th International Conference on Artificial Intelli-gence and Statistics . Sandhaus, E. (2008).
The New York Times annotated corpus . Philadelphia: LinguisticData Consortium.
Sudderth, E. B. & Jordan, M. I. (2009). Shared segmentation of natural scenes usingdependent Pitman-Yor processes. In
Advances in Neural Information Processing Systems21 . Teh, Y. W. & Jordan, M. I. (2010). Hierarchical Bayesian nonparametric models withapplications. In
Bayesian Nonparametrics . Cambridge University Press, pp. 158–207.
Teh, Y. W. , Jordan, M. I. , Beal, M. J. & Blei, D. M. (2006). Hierarchical Dirichletprocesses.
Journal of the American Statistical Association , 1566–1581.
Teh, Y. W. , Kurihara, K. & Welling, M. (2008). Collapsed variational inference forHDP. In
Advances in Neural Information Processing Systems 20 .30 hibaux, R. & Jordan, M. I. (2007). Hierarchical beta processes and the Indian buffetprocess. In
Proceedings of the 11th International Conference on Artificial Intelligence andStatistics , vol. 2.
Wang, C. & Blei, D. M. (2012). Truncation-free online variational inference for Bayesiannonparametric models. In
Advances in Neural Information Processing Systems 25 . Wang, C. , Paisley, J. & Blei, D. (2011). Online variational inference for the hierar-chical Dirichlet process. In