[PDF] Conditional Variational Inference with Adaptive Truncation for Bayesian Nonparametric Models

Abstract

The scalable inference for Bayesian nonparametric models with big data is still challenging. Current variational inference methods fail to characterise the correlation structure among latent variables due to the mean-field setting and cannot infer the true posterior dimension because of the universal truncation. To overcome these limitations, we build a general framework to infer Bayesian nonparametric models by maximising the proposed nonparametric evidence lower bound, and then develop a novel approach by combining Monte Carlo sampling and stochastic variational inference framework. Our method has several advantages over the traditional online variational inference method. First, it achieves a smaller divergence between variational distributions and the true posterior by factorising variational distributions under the conditional setting instead of the mean-field setting to capture the correlation pattern. Second, it reduces the risk of underfitting or overfitting by truncating the dimension adaptively rather than using a prespecified truncated dimension for all latent variables. Third, it reduces the computational complexity by approximating the posterior functionally instead of updating the stick-breaking parameters individually. We apply the proposed method on hierarchical Dirichlet process and gamma--Dirichlet process models, two essential Bayesian nonparametric models in topic analysis. The empirical study on three large datasets including arXiv, New York Times and Wikipedia reveals that our proposed method substantially outperforms its competitor in terms of lower perplexity and much clearer topic-words clustering.

Full PDF

aa r X i v : . [ s t a t . M L ] J a n Conditional Variational Inference with AdaptiveTruncation for Bayesian Nonparametric Models

J. Y. Liu and Xinghao Qiao

Department of Statistics, London School of Economics, London WC2A 2AE, U.K. , [email protected] [email protected]

Abstract

The scalable inference for Bayesian nonparametric models with big data is stillchallenging. Current variational inference methods fail to characterise the correlationstructure among latent variables due to the mean-ﬁeld setting and cannot infer the trueposterior dimension because of the universal truncation. To overcome these limitations,we build a general framework to infer Bayesian nonparametric models by maximisingthe proposed nonparametric evidence lower bound, and then develop a novel approachby combining Monte Carlo sampling and stochastic variational inference framework.Our method has several advantages over the traditional online variational inferencemethod. First, it achieves a smaller divergence between variational distributions andthe true posterior by factorising variational distributions under the conditional settinginstead of the mean-ﬁeld setting to capture the correlation pattern. Second, it reducesthe risk of underﬁtting or overﬁtting by truncating the dimension adaptively ratherthan using a prespeciﬁed truncated dimension for all latent variables. Third, it reducesthe computational complexity by approximating the posterior functionally instead ofupdating the stick-breaking parameters individually. We apply the proposed methodon hierarchical Dirichlet process and gamma–Dirichlet process models, two essentialBayesian nonparametric models in topic analysis. The empirical study on three largedatasets including arXiv , New York Times and

Wikipedia reveals that our proposed ethod substantially outperforms its competitor in terms of lower perplexity and muchclearer topic-words clustering. Some key words:

Gibbs sampling; Hierarchical Dirichlet process; Nonparametric evidence lowerbound; Stochastic variational inference; Topic modelling.

Bayesian nonparametric models, diﬀering from parametric models by relaxing the ﬁxed di-mension assumption, are widely used in bioinfomatics, language processing, computer versionand network analysis (Dunson & Park, 2008; Sudderth & Jordan, 2009; Caron & Fox, 2017;Ranganath & Blei, 2018). For example, in natural language processing, Teh et al. (2006)develop a hierarchical Dirichlet process, which extends the latent Dirichlet allocation model(Blei et al., 2003) from a nonparametric perspective. Hierarchical Dirichlet process is deﬁnedon a countable dimensional simplex to replace the ﬁnite-dimensional Dirichlet distributionin latent Dirichlet allocation. Within such model, the number of topics is regarded as arandom variable instead of a ﬁxed value and hence can be inferred from data.The inference of Bayesian nonparametric models is more complicated than its parametriccounterpart. Due to the inﬁnite-dimensional nature of Bayesian nonparametric models, aﬁnite-dimensional truncation is needed to approximate the posterior. However, the selectionof the optimal truncation level poses extra challenges. The traditional Markov chain MonteCarlo methods (Teh et al., 2006; Papaspiliopoulos & Roberts, 2008) can produce an adaptiveselection of the truncated dimension but are not computationally scalable especially forbig data. On the other hand, standard variational inference methods (Teh et al., 2008;Wang et al., 2011; Hoﬀman et al., 2013; Roychowdhury & Kulis, 2015) can accelerate thecomputation but suﬀer from an universal selection of the truncation level, that is, truncatingthe dimension of all latent variables to a prespeciﬁed value. However, a subjective selectionof the ﬁxed truncation level would lead to a low predictive accuracy due to the possible2verﬁtting or underﬁtting. In this sense, such universal truncation method contradicts themotivation and advantages of using Bayesian nonparametric models.In this paper, we propose a general framework with novel and eﬃcient algorithms toinfer a large class of Bayesian nonparametric models in the following steps. First, we derivethe nonparametric evidence lower bound based on ﬁnite and measurable partitions. Second,we propose the conditional setting when factorising variational distributions by letting vari-ables in the middle layers conditional on two adjacent layers. Third, to handle big data, wedevelop the corresponding stochastic variational inference framework (Hoﬀman et al., 2013;Blei et al., 2017) under our conditional setting. Finally, within our framework, we adoptMonte Carlo sampling to generate samples for local latent variables, and further update thevariational parameters for global latent variables based on the empirical distribution gener-ated from these samples. Meanwhile, we truncate the dimension of variational distributionsto that of the empirical distribution.Our proposed method, named conditional variational inference with adaptive truncation,beneﬁts from both the accuracy of Monte Carlo sampling and the eﬃciency of variationalinference as follows. First, our method rebuilds the correlation structure and hence attainsa smaller divergence between the variational distribution and the true posterior. Such pro-cedure removes the unrealistic mean-ﬁeld assumption, and searches an optimal variationaldistribution over a wider family. Second, our method assigns a probability of increasing thedimension of variation distributions adaptive to the goodness-of-ﬁt. As the inference pro-ceeds, it reaches a stable level balancing the goodness-of-ﬁt and model complexity. Therefore,it provides an adaptive selection of the truncated dimension and reduces the risk of over-ﬁtting or underﬁtting. Finally, our method achieves better prediction without sacriﬁcingcomputational eﬃciency. With the optimal variational distributions for global variables, thelocal Markov chain converges fast, which is demonstrated in our empirical study.To assess the empirical performance of the proposed method, we develop detailed algo-rithms for hierarchical Dirichlet process model and gamma–Dirichlet process model (Jordan,3010), and apply them on the topic analysis of three large datasets including arXiv , New YorkTimes and

Wikipedia . The results show that the algorithms for our proposed method consis-tently outperform traditional online variational inference (Wang et al., 2011) in three exam-ples by substantially reducing the hold-out perplexity. Furthermore, our method gives muchclearer topic-words clustering by removing replicated topics and providing room to furtheradd new topics. We provide the code at https://github.com/yiruiliu110/ConditionalVI . Suppose that p Ω , F q is a Polish sample space, Θ is the set of all bounded measures on p Ω , F q and M is a σ -algebra on Θ. A random measure G on p Ω , F q is a transition kernelfrom p Θ , M q into p Ω , F q such that (i) G ÞÑ G p A q is M -measurable for any A P F and(ii) A ÞÑ G p A q is a measure for any realisation of G (Ghosal & Van der Vaart, 2017). Forexample, a Dirichlet process P (Ferguson, 1973) with base measure P satisﬁes ` P p A q , P p A q , . . . , P p A n q ˘ „ Dirichlet ` P p A q , P p A q , . . . , P p A n q ˘ for any partition Ω “ p A , . . . , A n q of Ω, that is, a ﬁnite number of measurable, nonemptyand disjoint sets such that Ť ni “ A i “ Ω. The Dirichlet process is denoted by P „ DP p P q or P „ DP p αH q with prior precision α “ P p Ω q and center measure H “ α ´ P . Moreover, arandom measure is called a completely random measure (Kingman, 1993) if it also satisﬁes(iii) P p A i q is independent of P p A j q for any disjoint subsets A i and A j in Ω. See Appendix A.1for a short review. Completely random measures and their normalisations (Regazzini et al.,2003), for example, gamma process and Dirichlet process, respectively, are commonly usedas priors for inﬁnite-dimensional latent variables in Bayesian nonparametric models, becausetheir realisations are atomic measures with a countable-dimensional support.As an important subclass of Bayesian nonparametric models, hierarchical Bayesian non-parametric models use random measures for priors in multiple layers. Consider the following4 G G j z ji x ji βλ JN j Figure 1: Hierarchical structure in Bayesian nonparametric models. The blue and red boxescorrespond to J and N j replicates, respectively.model, G | H „ P p H q , β | λ „ p p β | λ q ,G j | G „ R p G q p j “ , . . . , J q ,z ji | G j „ G j , x ji | z ji „ f p x ji | β, z ji q p i “ , . . . , N j ; j “ , . . . , J q , (1)whose two-layer hierarchical structure is summarised in Figure 1. Speciﬁcally, in the toplayer, G , . . . , G J are generated from a random measure R with common base measure G ,while in the bottom layer, G itself is a realisation of random measure P with base measure H . To ensure exchangeability, G , . . . , G J are assumed to be identical and independent given G . The global parameter β is assigned a prior p p β | λ q . In addition, each local latent variable z ji is sampled from G j independently and the observation x ji is generated from a likelihoodfunction f , which is parameterised by both global latent variable β and local latent variable z ji . We next illustrate the necessity of hierarchical structure in Bayesian nonparametric mod-els, using the example of hierarchical Dirichlet process model in topic modelling (Teh et al.,2006), where P and R in (1) are both Dirichlet processes, G | H „ DP p αH q , G j | G „ DP p γG q p j “ , . . . , J q . (2)Suppose a corpus has J documents, each document j has N j words and each word is cho-sen from a vocabulary with W terms. We describe the generative model as follows. First, G “ ř k “ G k δ φ k is generated from DP p αH q , and for each document j a topic propor-tion G j “ ř k “ G jk δ φ k is independently sampled from DP p γG q . Second, for any topic k, W -dimensionalDirichlet distribution parameterised by η , β k „ Dir p η q . Third, for each word i in document j , a topic assignment z ji is allocated by z ji „ Multinomial p G j q , where z ji represents the topic k if z ji “ φ k . Finally, the observation x ji is independently generated from the assigned topicand the corresponding within-topic word distribution, x ji | t z ji “ φ k u „ Multinomial p β k q .Within such hierarchical Dirichlet process model, in the top layer, if G , . . . , G J are sampledfrom a Dirichlet process with a diﬀuse base measure instead of an atomic G , the supportsof G , . . . , G J do not overlap almost surely, which results in no share of topics among dif-ferent documents. To solve this issue, we let the base measure G have an atomic andinﬁnite-dimensional support, for example, assigning a Dirichlet process prior for G .Generally speaking, it is not necessary to restrict the prior for G to be a Dirichletprocess or other probability random measures. The essential point here is to equip G withan inﬁnite-dimensional and atomic support. Therefore, other completely random measuresand their normalisation can also be used as priors for G . For example, the gamma–Dirichletprocess model (Jordan, 2010), G | H „ ΓP p αH q , G j | G „ DP p G q p j “ , . . . , J q . (3)The gamma–Dirichlet process allows a more ﬂexible model by removing the constraint on theprior precision in the top layer. Other choices of prior for G include beta process, σ -stableprocess and inverse Gaussian process (Ghosal & Van der Vaart, 2017). The object of variational inference is to minimise the divergence between the variational dis-tribution and the true posterior. For inﬁnite-dimensional random measures, their Kullback–Leibler divergence is well deﬁned although the corresponding density function does not exist6ith respect to Lebesgue measure. Suppose two random measures P and Q from p Θ , M q into p Ω , F q , the Radon–Nikodym derivative dQ { dP exits if Q is absolutely continuous withrespect to P . Their Kullback–Leibler divergence is deﬁned as,KL p Q k P q “ ż Θ log dQdP dQ, which is computationally intractable due to the inﬁnite-dimensional integral. By contrast,we calculate this divergence by the limit superior of the divergence between correspondingﬁnite-dimensional induced measures, that is,KL p Q k P q “ lim sup Ω KL p p Ω k q Ω q , (4)where p Ω and q Ω are respectively induced measures from P and Q on a ﬁnite-dimensionalpartition Ω “ p A , . . . , A n q , such that p Ω p A i q “ P p A i q and q Ω p A i q “ Q p A i q for each A i P Ω .With an induced random variable Z Ω : Θ Ñ R n , we can also denote the induced measuresby p p Z Ω q and q p Z Ω q . The result in (4) is justiﬁed in Appendix A.2. The parametric variational inference algorithm uses a ﬁnite-dimensional variational distribu-tion to approximate the true posterior by maximising the evidence lower bound (Blei et al.,2017), while, for nonparametric models, we need to use a random measure as variationaldistribution due to the inﬁnite dimensionality of latent variables. Based on the Kullback–Leibler divergence between random measures in (4), we propose a general variational infer-ence framework for Bayesian nonparametric models by deﬁning the corresponding nonpara-metric evidence lower bound as,

NPELBO “ lim inf Ω ” E q p Z Ω q log p p X, Z Ω q ( ´ E q p Z Ω q log q p Z Ω q (ı , (5)where p p X, Z Ω q and q p Z Ω q correspond to the induced measures from the joint distributionand the variational distribution on Ω , with Z and X denoting the observations and latent7ariables, respectively. Provided the result thatKL ` Q p Z q k P p X | Z q ˘ ` NPELBO “ log p p X q , (6)our proposed framework considers maximising the nonparametric evidence lower bound in(5), which is equivalent to minimising the Kullback–Leibler divergence between variationaldistribution Q p z q and true posterior P p z | x q . See Appendix A.3 for the proof of equation(6). To simplify the notation, we will use p p¨q and q p¨q to denote the true and variationaldistributions, respectively, where the context is clear. The hierarchical Bayesian nonparametric model in (1) has multiple layers, and hence Z in(5) includes several latent variables, which are global latent variable β , local latent variables t z ji u ď j ď J, ď i ď N j , global prior G , and local priors t G j u ď j ď J . To factorise the variational dis-tribution q ` β, t z ji u , G , t G j u ˘ , the traditional variational inference algorithms typically con-sider the mean-ﬁeld setting, q p β, t z ji u , G , t G j uq “ q p β q q p G q ś Jj “ q p G j q ś Jj “ ś N j i “ q p z ji q ,where variables in diﬀerent layers are assumed to be independent. However, this assump-tion is not valid in nonparametric variational inference because the independence between t G j u ď j ď J and G contradicts the fact that the support of each G j is fully determined by G . As the updatings of q p G j q and q p G q are independent in the procedure of iterations,they are likely to have diﬀerent supports, which contradicts their deﬁnitions. Moreover, themean-ﬁeld assumption fails to account for the possibly high correlation among G , t G j u and t z ji u . In contrast to the traditional variational inference under the mean-ﬁeld setting, we fac-torise the variational distribution as, q ` β, t z ji u , G , t G j u ˘ “ q p β q q p G q J ź j “ q p G j | G , z j q J ź j “ N j ź i “ q p z ji q , (7)in the sense of the probability law. On one hand, our conditional setting eliminates thecontradiction in the mean-ﬁeld setting, because we consider the variational distribution of8 j conditional on G , which ensures that G j shares the same support of G . On the otherhand, such conditional design facilitates the recovery of the dependence structure among G , t G j u and t z ji u . Combing (5) and (7), our proposed conditional variational inference seeks to maximisethe following nonparametric evidence lower bound,NPELBO “ lim inf Ω ” E q p β, t z j u ,G Ω , t G Ωj uq log p ` t x j u , β, t z j u , G Ω , t G Ωj u ˘( ´ E q p G Ω q log q p G Ω q ( ´ E q p β q log q p β q ( ´ J ÿ j “ N j ÿ i “ E q p z ji q log q p z ji q ( ´ J ÿ j “ E q p G Ω q E q p z j q E q p G Ωj | G Ω ,z j q log q p G Ωj | G Ω , z j q (ı , (8)where x j “ t x ji u ď i ď N j , z j “ t z ji u ď i ď N j , and Ω is a partition of the sample space Ω for G and t G j u ď j ď J . To maximise the nonparametric evidence lower bound in (8), we ﬁrst seek the optimalvariational distribution of G j given G and z j for each j . As p ` t x j u , β, t z j u , G Ω , t G Ωj u ˘ “ p p G Ω , t z j uq ś Jj “ p p G Ωj | G Ω , z j q p p x j | β, z j q , the non-constant term in (8) with respect to q p G j | G , z j q islim inf Ω ” J ÿ j “ E q p G Ω q E q p z j q E q p G Ωj | G Ω ,z j q log p p G Ωj | G Ω , z j q ´ log q p G Ωj | G Ω , z j q (ı . It is worth noting that the above expression can be viewed as the negative of the Kullback–Leibler divergence whose maximum is zero. Therefore, the optimal conditional variationaldistribution for G j is q p G j | G , z j q “ p p G j | G , z j q as the divergence equals zero if andonly if q p G Ωj | G Ω , z j q “ p p G Ωj | G Ω , z j q for any partition Ω . This result is also intuitivebecause the best variational distribution to approximate the posterior given other variablesis the conditional posterior itself. Beneﬁting from the conjugacy in Bayesian nonparametricmodels, the analytical form of such conditional posterior is easy to derive.9e then implement a coordinate ascent approach by iterating the following three stepsuntil convergence. The ﬁrst step considers obtaining the optimal q p G q conditional on otherparameters. To achieve this, in Appendix A.4, we rely on (8) to derive the evidence lowerbound under Ω with respect to q p G Ω q ,ELBO Ω “ E q p G Ω q log p p G Ω q q p G Ω q ( ` J ÿ j “ E q p G Ω q E q p z j q ” log E p p G Ωj | G Ω q p p z j | G Ωj q (ı ` constant , (9)where E p p G Ωj | G Ω q is with respect to the prior distribution p p G Ωj | G Ω q instead of the variationaldistribution. Consequently, this expectation can be easily calculated due to its analyticalrepresentation. As the nonparametric evidence lower bound NPELBO “ lim inf Ω p ELBO Ω q ,if we can ﬁnd a random measure q p G q with its induced measure q Ω p G q satisfying q p G Ω q 9 p p G Ω q exp ˆ J ÿ j “ E q p z j q ” log E p p G Ωj | G Ω q p p z j | G Ωj q (ı˙ , (10)for any partition Ω , then this q p G q is the optimal variational random measure. In caseswhere it is diﬃcult to ﬁnd a simple random measure satisfying (10), we need to restrict thevariational distribution in a special family and optimise the parameters. Provided with theupdated q p G q and other parameters, the second step considers optimising the variationaldistribution for z j in the form of q p z j q 9 exp ˆ E q p G q ” log E p p G j | G q p p z j | G j q (ı ` E q p β q log p p x j | z j , β q (˙ . (11)Finally, the optimal variational distribution for the global latent variable β given otherupdated parameters is q p β q 9 p p β q exp ” J ÿ j “ E q p z j q log p p x j | z j , β q (ı . (12) Whereas the coordinate ascent formulas in (10)–(12) provide a general framework, theyare diﬃcult to be directly implemented especially for big data, because updating all local10atent variables in each iteration is not computationally eﬃcient. By contrast, stochasticvariational inference (Hoﬀman et al., 2013) is widely used in practice, where the computationis accelerated by randomly selecting a small batch of data and iteratively updating theparameters with a random but unbiased gradient of evidence lower bound. Speciﬁcally, foran evidence lower bound ELBO p ξ q as a function of parameter ξ , if there exists a randomfunction h p ξ q satisfying E “ h p ξ q ‰ “ ELBO p ξ q , ξ can be updated in the τ -th iteration by ξ p τ q “ ξ p τ ´ q ` ρ t h p ξ p τ ´ q q , where the step size ρ t satisﬁes the Robbins–Monro condition(Robbins & Monro, 1951).For hierarchical Bayesian nonparametric models, the traditional stochastic variational in-ference methods suﬀer from the mean-ﬁeld assumption and the universal truncation (Hoﬀman et al.,2013; Wang et al., 2011). To overcome these disadvantages, we propose a new approach byintegrating Monte Carlo sampling scheme into the stochastic variational inference frameworkunder the conditional variational setting, namely conditional variational inference with adap-tive truncation. The proposed method not only beneﬁts from the fast speed of stochasticvariational inference but also overcomes the challenges of nonparametric variational infer-ence discussed in Section 3.3. Moreover, it can automatically truncate the dimension ofvariational distributions in an adaptive fashion. We will show the detailed procedures inSections 4.2 and 4.3. Under the conditional setting, we rely on the conditional variational inference frameworkin Section 3.4 to infer global variables, while approximate the optimal distributions of localvariables via Monte Carlo sampling instead of analytical optimisation.For the variational inference part, we approximate the posterior distribution for globalprior G and global latent variable β . From the entire data x “ t x , . . . , x J u , we randomlysample a subset t x s : x s P x u Ss “ , where S is the batch size with S ! J . Assuming thata partition Ω is given to obtain the limit inferior of nonparametric evidence lower bound,11e aim to update the parameters for q p G Ω q conditional on the updated q p β q and t q p z s qu Ss “ .While standard stochastic variational inference uses the analytical way to update parameters,we draw T s samples from q p z s q for each z s in the batch, ˆ z s “ ˆ z s,t : ˆ z s,t „ q p z s q ( T s t “ , so as toget a random nonparametric evidence lower bound with respect to q p G Ω q , { NPELBO “ E q p G Ω q ” log p p G Ω q q p G Ω q ` JS S ÿ s “ T ´ s T s ÿ t “ log E p p G Ωs | G Ω q p p ˆ z s,t | G Ωs q (ı ` constant . (13)It is obvious that E p { NPELBO q “

NPELBO and hence the random gradient is unbiased,which satisﬁes the key condition for stochastic variational inference. Therefore, according to(13), we can use the random gradient generated from ˆ z s to update the parameters of q p G Ω q .Analogously, the random nonparametric evidence lower bound with respect to q p β q is, { NPELBO “ E q p β q ! log p p β q q p β q ` JS S ÿ s “ T ´ s T s ÿ t “ log p p x s | ˆ z s,t , β q ) ` constant , (14)and then we can update its parameter with the corresponding random gradient in a similarway.For the Monte Carlo sampling part, given the updated q p G Ω q and q p β q from (13) and (14),we draw samples ˆ z s for each z s in the batch using Markov chain Monte Carlo. It is diﬃcultto get an closed-form formula for optimal q p z s q due to the lack of conjugacy between G and z s . Moreover, since G s is integrated out, the local latent variables t z si u N s i “ are conditionallydependent and cannot be i.i.d. sampled. Therefore, we propose the following Gibbs samplingapproach to get the samples under optimal variational distributions. Conditional on q p G Ω q and samples ˆ z s,i ´ “ t ˆ z sl : l “ , . . . , N s , l ‰ i u , it follows from (11) that the optimalvariational distribution of q p z si q is q p z si q 9 exp ! E q p G Ω q E p p G Ωj | G Ω q “ p p z si , ˆ z s,i ´ | G Ωs q ‰ ` E q p β q “ log p p x si | z si , β q ‰) . (15)Then we sample ˆ z si „ q p z si q for each i iteratively, which constructs a Markov chain. Asˆ z si is sampled from the optimised variational distribution conditional on ˆ z s,i ´ in (15), thejoint distribution generated from the Markov chain will converge to the optimal variational12istribution, which achieves the maximum nonparametric evidence lower bound. After theconvergence, we can sample ˆ z s, , . . . , ˆ z s,T s from the stable Markov chain.To maximise the nonparametric evidence lower bound, we iterate the following threesteps, (i) randomly selecting a small batch from the entire data, (ii) sampling t ˆ z s u Ss “ byMonte Carlo method, and (iii) updating q p G Ω q and q p β q in the stochastic variational inferenceframework. Moreover, the partition Ω in our method is data-adaptive as demonstrated inSection 4.3. In this section, we illustrate the approach to determine the ﬁnite and measurable partition Ω , which could reach the limit inferior of nonparametric evidence lower bound. Rather thanﬁxed on a universal truncation level, in our framework, the dimension of Ω gradually increasesto a stable level. This partition or truncation is dependent on data ﬁtting and embeddedwithin the optimisation process, which provides another key advantage of integrating theMonte Carlo sampling scheme into the stochastic variational inference framework.We ﬁrst deﬁne the partition Ω. Note that samples t ˆ z s u Ss “ used to simulate the optimalvariational distribution have ﬁnite-dimensional atomic support, denoted by φ , . . . , φ K , where K is a ﬁnite integer. We therefore partition the sample space Ω into K ` K probability mass atoms φ , . . . , φ K and one complement set φ “ Ω {t φ , . . . , φ K u . We then update the partition Ω in the inference procedure. If all points in t ˆ z s u Ss “ are sampledbefore, we keep the current partition Ω . Otherwise, if a sample ˆ z si P φ , which means it is dis-tinct from φ , . . . , φ K , we draw a new φ K ` and reﬁne the partition as ` φ , φ , . . . , φ K , φ K ` ˘ ,where we update φ as Ω {t φ , . . . , φ K , φ K ` u . With the dynamic partition Ω deﬁned above,the prior is proportional to the posterior on φ , p ` G p φ q ˘ q ` G p φ q ˘ , due to the lack of datainformation. Therefore, the nonparametric evidence lower bound will remain constant underany further partition, which means the partition Ω enables the nonparametric evidence lowerbound to attain its limit inferior. 13 lgorithm 1: Conditional variational inference with adaptive truncation. Initialise the partition Ω , the parameters for q p G q , q p β q and set up the step-size t ρ τ u τ ě ; repeat Randomly select x , . . . , x S from the entire dataset ; for s P t , . . . , S u do Initialise the values for t ˆ z si u N s i “ ; repeat for i P t , . . . , N s u do Sample ˆ z si conditional on q p G q , q p β q and ˆ z s,i ´ according to (15); if Sampling a new ˆ z si then Reﬁne the partition Ω ; until Convergence ; Sample t ˆ z s,t u T s t “ from the stable Markov chain; Update the parameters for q p G q and q p β q given the samples t ˆ z s u Ss “ with step-size ρ τ according to (13) and (14); until Convergence ; In our framework, we start from a low-dimensional partition when the variational dis-tributions are far from the optimal, then update the partition and gradually increase itsdimension according to data ﬁtting. When the inference is close to convergence with a largeenough dimensional partition, it is less likely to reﬁne the partition and hence the dimensionof variational distributions attains a stable level. This data-adaptive truncation reﬂects abalance between the goodness-of-ﬁt and model complexity. We summarise the above infer-ence procedure in Algorithm 1. 14

Applications in topic modelling

We apply the proposed conditional variational inference with adaptive truncation methodto the hierarchical Dirichlet process model. Speciﬁcally, we factorise the variational dis-tributions in the conditional setting according to (7) and specify the variational family asfollows. First, the variational distribution of G s for each s is given by q p G s | G , z s q “ DP ` ř k “ n sk δ φ k ` G ˘ , where n sk “ ř N s i “ I p z si “ φ k q with I p¨q being the indicator func-tion. Second, q p β k q for each topic k is set as a W -dimensional Dirichlet distribution, q p β k q “ Dirichlet p λ k q , where λ k “ p λ k , . . . , λ kW q T is the parameter of vocabulary distri-bution for topic k . To make prediction, λ k serves as the core task of inference. Specially,the variational distributions for the topics without any observation remain the same as theprior. Therefore, we regard them as the zeroth topic without loss of generality and denotethe corresponding variational distribution on vocabulary by q p β q “ Dirichlet p η q . Third, wepropose the variational family for G as, q p G q “ K ÿ k “ m k δ ϕ k ` m DP p αH q , (16)such that ř Kk “ m k “ φ k „ H due to the lack of posterior information for φ k . Takinginto account the tradeoﬀ between inferential accuracy and computational eﬃciency, in (16)we assume that q p G q have deterministic probability mass on φ k s, as the main purposeof G is to provide a discrete and inﬁnite-dimensional support to ensure that G j s sharethe same topics φ k s. This kind of spike and slab methodology is widely used in Bayesiananalysis (Andersen et al., 2017). Under such scenario, the optimised t m k u Kk “ coincide withmaximum-a-posteriori estimation. Finally, following (15) we use Monte Carlo sampling toget samples t ˆ z s u Ss “ and hence do not need to parametrise their variational distributions.Based on these settings, we can infer the hierarchical Dirichlet process model by applyingAlgorithm 1 in the following steps. 15 he partition Ω . As diﬀerent samples in t ˆ z s u Ss “ are used to represent diﬀerent topicclusters in topic modelling, their exact values in sample space do not contain any statis-tical information. We then index the topics with observations from 1 to K and denotethe diﬀerent clusters by distinct points φ , . . . , φ K in Ω . With the samples t ˆ z s u Ss “ , wedeﬁne ˆ n sk,t “ ř N s i “ I p ˆ z si,t “ φ k q . Then the number of topics with observations is K “ ř k “ I p ř Ss “ T ´ s ř T s t “ ˆ n sk,t ą q . We partition Ω to a p K ` q -dimensional Ω including K single points t φ k u Kk “ and one complement set φ “ Ω {t φ k u Kk “ . Inference for G . With the partition Ω deﬁned above, G Ωs conditional on G Ω is a p K ` q -dimensional Dirichlet distribution. By (13), we derive the random nonparametricevidence lower bound with respect to q p G q in Appendix A.5, { NPELBO “ ´ K ÿ k “ log m k ` α log m ` JS S ÿ s “ K ÿ k “ T ´ s T s ÿ t “ log Γ p γm k ` ˆ n sk,t q Γ p γm k q ` constant . (17)However, there is no closed-form expression for the probability proportion parameters t m k u Kk “ to attain the maximum in (17). Moreover, the standard gradient descent algorithm fails inthis case, because t m k u Kk “ may easily exceed the simplex during the updating procedure.Instead, given the parameters t m p τ q k u Ss “ in the τ -th iteration, we deﬁne m ˚ k $’’&’’% J S ´ γ ř Ss “ T ´ s ř T s t “ Φ p γm p τ q k ` ˆ n sk,t q ´ Φ p γm p τ q k q ( m p τ q k ´ p k “ , . . . , K q ,α ´ p k “ q , (18)such that ř Kk “ m ˚ k “

1, and update the parameters by m p τ ` q “ p ´ ρ t q m p τ q ` ρ τ m ˚ k . InAppendix A.6, we also show that this updating algorithm is consistent to the gradient descentafter the inverse logit transformation. In the process of updating, the condition ř Kk “ m ˚ k “ Inference for β . By (14), we update the parameters for q p β q using samples t ˆ z s u Ss “ . Wedeﬁne λ ˚ kw for topic k and word w as, λ ˚ kw “ η ` J S ´ S ÿ s “ T s ÿ t “ T ´ s N s ÿ i “ I p ˆ z si,t “ φ k , x si “ w q , (19)16nd update the parameter λ k by λ p τ ` q k “ p ´ ρ t q λ p τ q k ` ρ τ λ ˚ k for each k , where λ ˚ k “p λ ˚ k , . . . , λ ˚ kW q T . Sampling for z . According to (15) we sample ˆ z si conditional on q p G q and ˆ z si ´ by q p z si “ φ k q 9 $’’&’’% p γm k ` ˆ n ksi ´ q exp ` Φ p λ kx si q ´ Φ p ř Ww “ λ kw q ˘ p k “ , . . . , K q ,γm exp ` Φ p η q ´ Φ p W η q ˘ p k “ q , (20)to construct the Markov chain, where ˆ n ksi ´ “ ř ď l ď N s ,l ‰ i I p ˆ z sl “ φ k q . Whenever the sampledˆ z si is φ , which means ˆ z si forms a new point not belonging to t φ , . . . , φ K u , we need toupdate the partition and add a new topic indicated by φ K ` . Otherwise we stick to thesame partition dimension. Iterating the sampling scheme till convergence, we obtain thesamples t ˆ z si,t u ď s ď S, ď i ď N s , ď t ď T s and corresponding t ˆ n sk,t u ď s ď S, ď k ď K, ď t ď T s for the selectedchunk.According to Algorithm 1, we repeatedly select documents in a batch with randomness,sample z and update parameters for G , β by iterating (18)–(20) until the nonparametricevidence lower bound converges to its maximum.Our method is diﬀerent from other nonparametric inference methods. Wang & Blei(2012) replace analytical updating for local parameters by the locally collapsed Gibbs sam-pling. But their work cannot maximise the evidence lower bound, especially when q p β q haslarge variance. Bryant & Sudderth (2012) use split-merge algorithms to generate new dimen-sions and remove redundant dimensions. However, to check the split-merge criterion, theirmethod needs to calculate the training likelihood, which is computational ineﬃcient. More-over, both methods are based on the mean-ﬁeld assumption and hence ignore the correlationstructure among latent variables. The algorithm of conditional variational inference with adaptive truncation can also beapplied to a general class of hierarchical Bayesian nonparametric models, where the global17rior G is generated from a completely random measure. For example, gamma–Dirichletprocess model uses gamma process to generate G and Dirichlet process to generate t G j u Jj “ .In these models, concentration parameter for any G j is not ﬁxed and G is not restrictedto be a probability measure. The corresponding inference algorithm is similar to that ofhierarchical Dirichlet process model, but requires a new parameter µ to approximate G p Ω q .We choose the variational family for global prior G as, Q p G q “ µ ` K ÿ k “ m k δ φ k ` m r N p αH q ˘ , (21)where r N is the normalisation of the corresponding completely random measure and ř Kk “ m k “

1. Similarly, we derive the random nonparametric evidence lower bound in Appendix A.7, { NPELBO “ K log µ ` K ÿ k “ log v p µm k q ` log u p µm q` JS S ÿ s “ " log Γ p µ q Γ p µ ` N s q ` T ´ s T s ÿ t “ K ÿ k “ log Γ p µm k ` ˆ n sk,t q Γ p µm k q * ` constant , (22)where v p¨q is the weight intensity measure (Appendix A.1) for the completely random measureand u is the density function for G p Ω q that can be derived given the Laplace transform ofthe completely random measure. Therefore, we can update t m k u Kk “ in the same way as thehierarchical Dirichlet model. Following Algorithm 1 and its application in Section 5.1, we canalso update µ by the stochastic gradient descent. To illustrate with an example, we considerthe gamma–Dirichlet model, whose inference algorithm is provided in Appendix A.7. We apply the algorithm of conditional variational inference with adaptive truncation to threelarge datasets.1. arXiv : The corpus includes the descriptive metadata of all articles on arXiv , a freedistribution service and an open archive for scholarly articles, up to September 1,2019,which includes 1.03M documents and 44M words from a vocabulary of 7,500 termsafter preprocessing. 18.

New York Times : The corpus combines all articles published by

New York Times fromJanuary 1, 1987 to June 19, 2007 (Sandhaus, 2008), which has 1.56M documents and176M words from a vocabulary of 7,600 terms after preprocessing.3.

Wikipedia : The corpus collects the entire entries on English

Wikipedia websites onJanuary 1, 2019, which contains 4.03M documents and 423M words from a vocabularyof 8,000 terms after preprocessing.In the preprocessing, stemming and lemmatisation are used to clean the raw text data.Moreover, words with too high or too low frequency and common stop words are bothremoved before experiments.To evaluate the performance of our proposed method, we set aside a test set for 10,000documents for each dataset and calculate the hold-out perplexity (Ranganath & Blei, 2018),perplexity hold-out “ exp ! ´ ř j P D test log p p x test j | x train j , D train q ř j P D test | x test j | ) , where D train and D test represent the training and test data, respectively, and x train j and x test j arethe training and test words in test document j, respectively, and | x test j | is the number of wordsin x test j . The perplexity measures the uncertainty of the ﬁtted model, and hence a betterlanguage model with more accurate inference will have a higher predictive likelihood and thusa lower perplexity. Since the exact computation for perplexity is not tractable, the standardroutine uses D train to get the variational distribution for β and G , obtains the variationaldistribution for G j based on G and x test j , and then approximates the likelihood by p p x test j | x train j q “ ś w P x test j ř Kk “ G jk β kw , where G j “ p G j , . . . , G jK q T and β k “ p β k , . . . , β kW q T arethe variational expectations of G j and β k , respectively (Blei et al., 2003). We model threedatasets under both hierarchical Dirichlet process and gamma–Dirichlet process models.For hierarchical Dirichlet process model we set the hyperparameters as α “ γ “ η “ α and γ are the concentration parameters for G and t G j u respectively, and η is thehyperparameter of prior on the distribution of words. We choose a batch size of 256 andadopt the Robbins Monro learning rate p ` t q ´ . in updating (Hoﬀman et al., 2010). The19 a) arXiv . Hours P e r p l e x i t y (b) New York Times . . . . Hours P e r p l e x i t y (c) Wikipedia . . Hours P e r p l e x i t y Hours N u m b e r o f t o p i c s Hours N u m b e r o f t o p i c s Hours N u m b e r o f t o p i c s Figure 2: Top row: plots for the perplexity vs the running time up to 5 hours. Bottomrow: plots for the number of topics vs the running time up to 5 hours. Left, middle andright columns correspond to datasets arXiv , New York Times and

Wikipedia , respectively.The black dotted line corresponds to traditional online variational inference method forhierarchical Dirichlet process model. The red solid and blue dashed lines correspond toconditional variational inference with adaptive truncation method for hierarchical Dirichletprocess model and for gamma–Dirichlet process model, respectively.20nitial number of topics is chosen as 100. For gamma–Dirichlet process model, we use thesame hyperparameters but discard γ . To make comparison, we keep the default settings intraditional online variational inference (Wang et al., 2011).The top row of Figure 2 plots the hold-out perplexity as a function of running time forthree comparison methods on three datasets. Table 1 reports numerical summaries. Severalconclusions can be drawn here. First, on all three datasets, our conditional variationalinference with adaptive truncation method consistently outperforms the traditional onlinevariational inference method. The improvement is highly signiﬁcant especially for arXiv and Wikipedia . For

New York Times , such improvement is moderate probably due to the longlength of documents in this corpus. Second, for each dataset, the gamma–Dirichlet processmodel attains a lower perplexity than hierarchical Dirichlet process model, which makessense due to the fact that the gamma–Dirichlet process model removes a restriction of thehierarchical Dirichlet process model and hence is more ﬂexible. Third, the proposed methodis computationally eﬃcient. Although it involves Monte Carlo sampling, the perplexityconverges at a fast speed. This is because the convergence of local Markov chain to assignwords to topics is accelerated by a clear topic-words clustering as the global variationaldistributions approach to the optimal.The bottom row of Figure 2 plots the number of topics in the process of inference. Fortraditional online variational inference, the number of topics remains constant, while for ourmethod, it ﬁrst has a steep increase and then converges to a stable level. For example, thenumber of topics in

Wikipedia drastically increase from 100 to around 190 for hierarchicalDirichlet process model and around 200 for gamma–Dirichlet process model. The sharpincrease is driven by the data complexity, while the stable level is due to the dimensionpenalty of hierarchical Dirichlet process model. Although the estimation of the number oftopics is not consistent, the proposed truly nonparametric inference method can providesome useful information about topics in data. For instance, arXiv , containing the abstractdescriptions of scientiﬁc articles, has the smallest number of topics because its topics are21able 1: A summary of hold-out perplexity results on three datasets. Relative improvementsin percentage over TOVI for HDP model are shown in parenthesesInference method arXiv New York Times Wikipedia

HDP model TOVI 1005 1681 1422HDP model CVIAT 832 (17.21%) 1569 (6.66%) 1207 (15.12%)ΓDP model CVIAT 808 (19.60%) 1536 (8.62%) 1157 (18.64%)TOVI, traditional online variational inference; CVIAT, conditional variational infer-ence with adaptive truncation. HDP, hierarchical Dirichlet process; ΓDP, gamma–Dirichlet process.restricted to the quantitative subjects including computer science, mathematics, statisticsand physics. By contrast,

New York Times is a compilation of all new articles covering awider range of areas, and hence consists of more topics.

Wikipedia has the largest numberof topics as it contains almost every aspect of an encyclopedia. One key point here is thatwe do not need to set a ﬁxed number of topics before the inference. Instead, our methodstarts from an initial value, for example 100 in our experiments, then automatically reachesthe optimal number of topics after iterations, and ﬁnally keeps it at a stable level.Moreover, our method reveals better linguistic results. To compare our proposed methodwith traditional online inference for hierarchical Dirichlet process model, we report the top12 words in the top 10 topics with biggest weights for both methods on datasets arXiv and

Wikipedia in Tables 2a and 2b, respectively. We observe a few apparent patterns. First,our method does not contain replicated topics. The traditional online variational inferencemethod results in very similar word components, for example, columns 1-6 in the bottompart of Table 2a. An ideal topic-word clustering should allocate these words into just onetopic. But since the prespeciﬁed number of topics is ﬁxed at 150, which seems larger thanthe truth, the inference generates multiple replicated topics. By contrast, the topic-word22able 2: Top 12 words in top 10 topics for datasets arXiv and

Wikipedia (a) arXiv (b)

Wikipedia

Within the proposed general framework, Algorithm 1 can also be applied to other hier-archical Bayesian nonparametric models including hierarchical Pitman–Yor process model(Teh & Jordan, 2010) and hierarchical beta process model (Thibaux & Jordan, 2007), whichare used to present the power law and the sparsity in latent features, respectively. In suchcases, other Monte Carlo methods, for example, slice sampling (Neal, 2003), retrospectiveMarkov chain Monte Carlo (Papaspiliopoulos & Roberts, 2008) or unbiased Markov chainMonte Carlo methods with couplings (Jacob et al., 2019), could possibly be used. We expectthat our proposed method provides more advantages in these applications, because hierarchi-cal Pitman–Yor process with heavy tail behaviour and hierarchical beta process with sparsestructure may suﬀer more from the universal truncation.24

Appendix

A.1 A short review of completely random measures

A completely random measure (Kingman, 1993) is characterised by its Laplace transform,E e ´ tP p A q ( “ exp ! ´ ż A ż p , p ´ e ´ tπ q v c p dx, ds q ) , where A is any measurable subset of Ω and v c p dx, ds q is called the L´evy measure. If v c p dx, ds q “ κ p dx q v p ds q , where κ p¨q and v p¨q are measures on Ω and p , , respectively,the completely random measure is homogeneous (Ghosal & Van der Vaart, 2017). In suchcase, we call v p¨q the weight intensity measure. We can view completely random measureas a Poisson process on the product space Ω ˆ p , using its L´evy measure as the meanmeasure. A.2 Derivation for (4)

By deﬁnition of induced measure, q Ω p d Θ q “ Q p d Θ q for any M -measurable d Θ , we have ż Θ log dq Ω dp Ω dq Ω “ ż Θ log dq Ω dp Ω dQ. It follows from lim sup Ω dq Ω { dp Ω “ dQ { dP and the monotone convergence theorem thatlim sup Ω ż Θ log dq Ω dp Ω dQ “ ż Θ log dQdP dQ. Combining the above equations yields (4). Furthermore, suppose there exists a sequence ofpartition t Ω i u i ě such that lim sup Ω i “ Ω , we havelim sup Ω i ż Θ log dq Ω i dp Ω i dq Ω i “ lim sup Ω i ż Θ log dq Ω i dp Ω i dQ “ ż Θ log dq Ω dp Ω dQ “ ż Θ log dq Ω dp Ω dq Ω . Hence lim sup KL p q Ω i k p Ω i q “ KL p q Ω k p Ω q , which will be used in Appendix A.5. A.3 Derivation for (6) By p p X, Z q “ p p Z | X q p p X q , we have ż log p p X, Z Ω q q p Z Ω q q p dZ Ω q “ log p p X q ` ż log p p Z Ω q q p Z Ω q q p dZ Ω q . Ω " ż log p p X, Z Ω q q p Z Ω q q p dZ Ω q * “ log p p X q ´ lim sup Ω " ´ ż log p p Z Ω q q p Z Ω q q p dZ Ω q * . Combing the above equation with the deﬁnition of nonparametric evidence lower bound in(5) and Kullback–Leibler divergence in (4) yields (6).

A.4 Derivation for (9) By p p G Ω , t z j u Jj “ q “ ş ¨ ¨ ¨ ş p ` G Ω , t G j u Jj “ , t z j u Jj “ ˘ dG dG ¨ ¨ ¨ dG J and the hierarchical gen-erative structure, the evidence lower bound under partition Ω with respect to q p G Ω q equals,ELBO Ω “ E q p G Ω q E q pt z j u Jj “ q “ log p p G Ω , t z j u Jj “ q ‰ ´ E q p G Ω q “ log q p G Ω q ‰ ` constant “ E q p G Ω q E q pt z j u Jj “ q “ log q p G Ω q J ź j “ ż p p G Ωj | G Ω q p p z j | G Ωj q dG j ‰ ´ E q p G Ω q “ log q p G Ω q ‰ ` constant “ J ÿ j “ E q p G Ω q E q p z j q ” log E p p G Ωj | G Ω q “ p p z j | G Ωj q ‰ı ` E q p G Ω q “ log p p G Ω q ´ log q p G Ω q ‰ ` constant . Furthermore, based on the equation above, (8) can be expressed as NPELBO “ lim inf Ω ELBO Ω . A.5 Derivation for (17)

By the formula of moments for Dirichlet-distributed random variables, we obtainE p p G Ωs | G Ω q “ p p ˆ z s,t | G Ωs q ‰ “ Γ p γ q Γ p γ ` N s q K ź k “ Γ p γG k ` ˆ n sk,t q Γ p γG k q . Based on the points t φ k u Kk “ deﬁned in Section 5.1, we propose a sequence of partition t Ω c : Ω c “ Ť Kk “ Ω ck u c ě to approach Ω , where Ω ck “ p φ k ´ c ´ , φ k ` c ´ s for k “ , . . . , K and Ω c is the corresponding complement. Under Ω c , q p G Ω c q “ d K ` ` m ´ p G Ω c ´ M Ω c q ˘ and p p G Ω c q “ d K ` p G Ω c q , where d K ` p¨q denotes the density function for p K ` q -dimensionalDirichlet distribution, M “ ř Kk “ m k δ φ k and M Ω c is the corresponding induced randomvariable. By (13), the random nonparametric evidence lower bound under Ω c is { NPELBO Ω c “ E q p G Ωc q ! K ÿ k “ p αH Ω c k ´ q log m G k p G k ´ m k q ` p αH Ω c ´ q log m ` JS S ÿ s “ K ÿ k “ T ´ s T s ÿ t “ log Γ p γG k ` ˆ n sk,t q Γ p γG k q ) ` constant , H Ω c k “ H p Ω ck q . Since, p G k ´ m k q{ m „ Beta p H Ω c k q under q p G Ω c q , the term E q p G Ωc q p αH Ω c k ´ q log m p G k ´ m k q ´ ( is constant with respect to parameters t m k u Kk “ . Taking limsup onboth sides of the above equation with lim sup Ω C E q p G Ωc q p log G k q “ log m k , lim sup Ω C H Ω C k “ k “ , . . . , K and lim sup Ω C H Ω C “ , we obtain equation (17). A.6 Derivation for (18)

Consider the Lagrange multiplier of constrained optimisation, L “ ´ K ÿ k “ log m k ` p α ´ q log m ` JS S ÿ s “ K ÿ k “ T ´ s T s ÿ t “ log Γ p γm k ` ˆ n sk,t q Γ p γm k q ´ λ p K ÿ k “ m k ´ q , its ﬁrst order conditions satisfy, $’’&’’% J S ´ γ ř Ss “ T ´ s ř T s t “ Φ p γm k ` ˆ n sk,t q ´ Φ p γm k q ( m k ´ “ m k λ, p k “ , . . . , K q ,α ´ “ m λ, p k “ q . Dividing λ on both sides of the above equations, the deﬁnition of t m ˚ k u Kk “ in (18) follows.We next show that this updating is consistent to the gradient descent after the inverselogit transformation, that is, transforming t m k u Kk “ by m k “ e θ k { ř Kl “ e θ l to remove theconstraint of ř Kk “ m k “

1. By B m k {B θ k “ m k ´ m k , B m l {B θ k “ ´ m k m l for l ‰ k , and thechain rule, we have B L B θ k “ $’’&’’% J S ´ γ ř Ss “ T ´ s ř T s t “ Φ p γm k ` ˆ n sk,t q ´ Φ p γm k q ( m k ´ ´ Λ m k p k “ , . . . , K q ,α ´ ´ Λ m k p k “ q , where L denotes { NPELBO in (17) andΛ “ α ´ ` K ÿ k “ ” J S ´ γ S ÿ s “ T ´ s T s ÿ t “ Φ p γm k ` ˆ n sk,t q ´ Φ p γm k q ( m k ´ ı . As B L {B θ k “ Λ p m ˚ k ´ m k q , p m ˚ k ´ m k q represents the gradient with respect to t θ k u Kk “ afterthe inverse logit transformation. 27 .7 Derivation for the extension in Section 5.2 Without restriction on probability random measure,log E p p G Ωs | G Ω q “ p p ˆ z s,t | G Ωs q ‰ “ log Γ p ř Kk “ G k q Γ p ř Kk “ G k ` N s q K ź k “ Γ p G k ` ˆ n sk,t q Γ p G k q , In analogy to Appendix A.5, under a partition Ω c , the random nonparametric evidence lowerbound equals, { NPELBO Ω c “ K log µ ` E q p G Ωc q ” K ÿ k “ log p p G Ω c k q ` log p p G Ω c q` JS S ÿ s “ K ÿ k “ T ´ s T s ÿ t “ log Γ p γG Ω c k ` ˆ n sk,t q Γ p γG Ω c k q ı ` constant , where K log µ comes from the Jacob matrix from G , G , . . . , G K to µ, m , . . . , m K . As thepartition converges to single points and the corresponding complement, lim sup Ω c p p G Ω c k q “ v p G Ω c k q , lim sup Ω c p p G Ω c q “ u p G Ω c q , we can have (22) by lim sup Ω c G k “ µm k for k ‰ Ω c G “ µm . Specially, for the gamma–Dirichlet model, { NPELBO “ ´ µ ´ K ÿ k “ log m k ` p α ´ q log µm ` JS S ÿ s “ " log Γ p µ q Γ p µ ` N s q ` K ÿ k “ T ´ s T s ÿ t “ log Γ p µm k ` ˆ n sk,t q Γ p µm k q * ` constant . Therefore, its gradient with respect to µ is, ´ ` α ´ µ ` JS S ÿ s “ ! Φ p µ q ´ Φ p µ ` N s q ` K ÿ k “ T ´ s T s ÿ t “ m k ` Φ p µm k ` ˆ n sk,t q ´ Φ p µm k q ˘) References

Andersen, M. R. , Vehtari, A. , Winther, O. & Hansen, L. K. (2017). Bayesianinference for spatio-temporal spike-and-slab priors.

Journal of Machine Learning Research , 5076–5133. Blei, D. M. , Kucukelbir, A. & McAuliffe, J. D. (2017). Variational inference: areview for statisticians.

Journal of the American Statistical Association , 859–877.28 lei, D. M. , Ng, A. Y. & Jordan, M. I. (2003). Latent Dirichlet allocation.

Journal ofMachine Learning Research , 993–1022. Bryant, M. & Sudderth, E. B. (2012). Truly nonparametric online variational inferencefor hierarchical Dirichlet processes. In

Advances in Neural Information Processing Systems25 . Caron, F. & Fox, E. B. (2017). Sparse graphs using exchangeable random measures.

Journal of the Royal Statistical Society: Series B , 1295–1366. Dunson, D. B. & Park, J.-H. (2008). Kernel stick-breaking processes.

Biometrika ,307–323. Ferguson, T. S. (1973). A Bayesian analysis of some nonparametric problems.

The Annalsof Statistics , 209–230. Ghosal, S. & Van der Vaart, A. (2017).

Fundamentals of Nonparametric BayesianInference . Cambridge: Cambridge University Press.

Hoffman, M. , Bach, F. R. & Blei, D. M. (2010). Online learning for latent Dirichletallocation. In

Advances in Neural Information Processing Systems 23 . Hoffman, M. D. , Blei, D. M. , Wang, C. & Paisley, J. (2013). Stochastic variationalinference.

Journal of Machine Learning Research , 1303–1347. Jacob, P. E. , O’Leary, J. & Atchad´e, Y. F. (2019). Unbiased Markov chain MonteCarlo with couplings.

Journal of the Royal Statistical Society: Series B in press . Jordan, M. I. (2010). Hierarchical models, nested models and completely random mea-sures. In

Frontiers of Statistical Decision Making and Bayesian Analysis . New York:Springer, pp. 207–218.

Kingman, J. F. C. (1993).

Poisson Processes . Oxford: Clarendon Press.29 eal, R. M. (2003). Slice sampling.

Annals of Statistics , 705–767. Papaspiliopoulos, O. & Roberts, G. O. (2008). Retrospective Markov chain MonteCarlo methods for Dirichlet process hierarchical models.

Biometrika , 169–186. Ranganath, R. & Blei, D. M. (2018). Correlated random measures.

Journal of theAmerican statistical Association , 417–430.

Regazzini, E. , Lijoi, A. & Pr¨unster, I. (2003). Distributional results for means ofnormalized random measures with independent increments.

Annals of Statistics , 560–585. Robbins, H. & Monro, S. (1951). A stochastic approximation method.

The Annals ofMathematical Statistics , 400–407. Roychowdhury, A. & Kulis, B. (2015). Gamma processes, stick-breaking, and varia-tional inference. In

Proceedings of the 8th International Conference on Artiﬁcial Intelli-gence and Statistics . Sandhaus, E. (2008).

The New York Times annotated corpus . Philadelphia: LinguisticData Consortium.

Sudderth, E. B. & Jordan, M. I. (2009). Shared segmentation of natural scenes usingdependent Pitman-Yor processes. In

Advances in Neural Information Processing Systems21 . Teh, Y. W. & Jordan, M. I. (2010). Hierarchical Bayesian nonparametric models withapplications. In

Bayesian Nonparametrics . Cambridge University Press, pp. 158–207.

Teh, Y. W. , Jordan, M. I. , Beal, M. J. & Blei, D. M. (2006). Hierarchical Dirichletprocesses.

Journal of the American Statistical Association , 1566–1581.

Teh, Y. W. , Kurihara, K. & Welling, M. (2008). Collapsed variational inference forHDP. In

Advances in Neural Information Processing Systems 20 .30 hibaux, R. & Jordan, M. I. (2007). Hierarchical beta processes and the Indian buﬀetprocess. In

Proceedings of the 11th International Conference on Artiﬁcial Intelligence andStatistics , vol. 2.

Wang, C. & Blei, D. M. (2012). Truncation-free online variational inference for Bayesiannonparametric models. In

Advances in Neural Information Processing Systems 25 . Wang, C. , Paisley, J. & Blei, D. (2011). Online variational inference for the hierar-chical Dirichlet process. In