Scaling up Dynamic Topic Models
SScaling up Dynamic Topic Models
Arnab Bhadury † ∗ , Jianfei Chen † , Jun Zhu † , Shixia Liu ‡† Department of Computer Science & Technology; State Key Lab of Intelligent Technology & Systems † Tsinghua National Lab of Information Science & Technology; Center for Bio-Inspired Computing Research ‡ School of Software, Tsinghua University, Beijing, 100084 China abhadury@flipboard.com; [email protected]; {dcszj, shixia}@tsinghua.edu.cn
ABSTRACT
Dynamic topic models (DTMs) are very e ff ective in discover-ing topics and capturing their evolution trends in time seriesdata. To do posterior inference of DTMs, existing methodsare all batch algorithms that scan the full dataset beforeeach update of the model and make inexact variational ap-proximations with mean-field assumptions. Due to a lack ofa more scalable inference algorithm, despite the usefulness,DTMs have not captured large topic dynamics.This paper fills this research void, and presents a fastand parallelizable inference algorithm using Gibbs Samplingwith Stochastic Gradient Langevin Dynamics that does notmake any unwarranted assumptions. We also present aMetropolis-Hastings based O (1) sampler for topic assign-ments for each word token. In a distributed environment,our algorithm requires very little communication betweenworkers during sampling (almost embarrassingly parallel)and scales up to large-scale applications. We are able tolearn the largest Dynamic Topic Model to our knowledge,and learned the dynamics of 1,000 topics from 2.6 milliondocuments in less than half an hour, and our empirical re-sults show that our algorithm is not only orders of magnitudefaster than the baselines but also achieves lower perplexity. General Terms
Algorithms, Experimentation, Performance
Keywords
Topic Model; Dynamic Topic Model; Large Scale MachineLearning; Parallel Computing; MCMC; MPI
1. INTRODUCTION
Surrounded by data, statistical topic models have becomesome of the most useful machine learning tools to automat-ically analyze large sets of categorical data, including both ∗ Arnab Bhadury is now with Flipboard Inc. 210-128 WHastings St. Vancouver, BC, V6B 1G8, Canada.
Copyright is held by the International World Wide Web Conference Com-mittee (IW3C2). IW3C2 reserves the right to provide a hyperlink to theauthor’s site if the Material is used in electronic media.
WWW 2016,
April 11–15, 2016, Montréal, Québec, Canada.ACM 978-1-4503-4143-1/16/04.DOI: http://dx.doi.org/10.1145/2872427.2883046 . text documents and images under some bag-of-words rep-resentations. Topic models can capture thematic structurethat exists within a data corpus and finds a low dimensionalrepresentation of the documents. Such topical representa-tions can be used for subsequent analysis tasks, such asclustering (29), classification (33; 34), and data visualiza-tion (16). One of the most popular topic models, LatentDirichlet Allocation (LDA) (5), has seen large amounts ofapplication in both industry (25) and academia. Since exactposterior inference of LDA is intractable, recent research hasfocused on speeding up the approximate inference methodsfor LDA from various directions, including stochastic/onlineinference (19), fast sampling algorithms (31; 7), and scalablesystems (1).While LDA is extremely useful, it has many simplisticassumptions that fail to capture some complicated struc-tures underlying a large data corpus, such as the corre-lationship between multiple topics and the temporal evo-lution of topics in data streams. Correlated Topic Model(CTM) (3) is one such extension to LDA that introducesnon-conjugate Logistic-Normal parameters to capture thecorrelation among topics. Though flexible in model capac-ity, the non-conjugacy makes approximating the posteriorand scaling up a lot more di ffi cult. Variational approxima-tion was often adopted (3) under some unwarranted mean-field assumptions. The standard variational methods can-not deal with large datasets either. Recently, Chen et al.(8) scaled up the CTM using a novel (distributed) GibbsSampler with Data Augmentation, which does not make un-necessary mean-field assumptions; thereby leading to betterperformance in terms of both time e ffi ciency and testinglikelihood/perpelexity.Dynamic Topic Model (DTM) (4) is another extensionto LDA that discovers topics and their evolution trends intime series data by chaining the time-specific topic-term dis-tributions via a Markov process under a Logistic-Normal pa-rameterization as in CTM. The non-conjugacy in the DTMmodel makes its large-scale posterior inference even moredi ffi cult and it remains a challenge in machine learning re-search. Existing inference algorithms of DTM have been fo-cused on mean-field variational approximations, Laplace ap-proximations or delta methods (23), which potentially leadto inaccurate results due to improper assumptions, requiremodel specific derivations, and can only deal with smalldata corpora and learn a small number of topics. Therehas been a lot of recent research done to scale up variationalinference (14; 6) to large data corpora but using variationalmethods, inferring the variational distribution over topics or a word in topic modeling is typically of O ( K ) complex-ity, where K is the number of topics, while recent samplingalgorithms of LDA utilize the sparsity in the model (30) oruse Metropolis-Hastings sampling with dedicated data struc-tures (e.g., Alias tables (31; 15)) to incrementally bring thesampling complexity down to an amortized O (1) per token.Our recent work (7) presents an even more e ffi cient O (1)sampler by optimizing the CPU cache access.As shown by Wang et. al. (26) capturing a large numberof topic trends in topic modeling is extremely important asit improves tasks such as advertisement and recommenda-tions. Using variational approaches, it is di ffi cult to supporta large number of topics, while there also exist many ma-chine learning applications where algorithms need to be fastwhile processing small scale data (17). To address these twodi ff erent limitations, we present a novel scalable solution todo posterior inference using Gibbs Sampling which avoidsany restricting assumptions and can be easily scaled up tocapture thousands of topics from millions of documents in aparallel environment, while also being faster on single ma-chines. To deal with the non-conjugacy in the model, withinthe Gibbs Sampling framework, we use recent developmentsin Stochastic MCMC and use Stochastic Gradient LangevinDynamics (SGLD) (28) to sample the logistic normal param-eters by observing only a mini-batch of the document set.Stochastic methods have shown to converge faster than theirbatch counterparts (19), and are particularly applicable tolarge-scale problems.We use ideas from these state-of-the-art methods to derivea fast algorithm to capture topic trends of a large numberof topics. We first derive the update equations for all theparameters, then show the algorithm’s e ffi ciency on a singlemultithreaded machine and further present a parallel algo-rithm to learn large Dynamic Topic Models on multiple ma-chines. Our algorithm is very close to being “embarrassinglyparallel” and scales up extremely well with the number oftime slices. We learn a large dynamic topic model from a 9GB dataset consisting of 2.6 million documents in less thanan hour. This is not only the biggest DTM but also thefastest inference algorithm to our knowledge.In the rest of the paper, we first introduce some relatedwork that inspires our research including LDA and its fastsampling algorithms, and the DTM model in Section 2. InSection 3, we introduce our proposed algorithm, and provideimplementation details for multithreaded and distributedmachines in Section 4. Experiments for both single ma-chines and distributed machines are discussed in Section 5and we conclude our paper in Section 6.
2. RELATED WORK
In this section, we briefly review some related work on thevanilla LDA and dynamic topic models.
Latent Dirichlet Allocation (LDA) is a probabilistic gen-erative model of documents. It represents documents as anadmixture of a set of K topics to be learned from data.LDA uses conjugate Dirichlet-Multinomial parameters forboth document-topic and topic-term distributions. Let D = At the start of each iteration, there is a small synchroniza-tion step, but for the intensive computation part, there isno communication needed.
Figure 1: Generative Process of Latent Dirichlet Al-location { W d } Dd =1 be a set of D documents, where each document W d = { W d,n } N d n =1 is a set of N d words W d,n . The generatingprocess, as shown in Figure 1, of LDA is1. For each topic k ∈ [ K ], draw Φ k ∼ Dir( Φ k | β );2. For each document d ∈ [ D ], draw the topic mixingproportion: θ d ∼ Dir( θ d | α );(a) For each word n ∈ [ N d ], draw the topic assign-ment Z d,n and the word itself: Z d,n ∼ Mult( Z d,n | θ d ) W d,n ∼ Mult( W d,n | Φ Z d,n ) , where α and β are the Dirichlet parameters, Φ k is the termdistribution of the topic k of size V , θ d is the K -dimensionaltopic mixing distribution of document d , Dir( · ) is the Dirich-let distribution, and Mult( · ) is the Multinomial distribution.Having conjugacy in the model allows the document-topicproportion Θ := { θ d } and the topic-term proportion Φ := { Φ k } to be integrated out analytically, yielding with a col-lapsed posterior distribution p ( Z |D , α , β ), where Z := { z d } denotes the set of all topic assignments. Extensive researchhas been done to optimize the posterior inference of the col-lapsed model using both variational approximation (20) andsampling.Recently, sampling based posterior inference algorithmshave been paid close attention to because of their simplicityand the fact that they have shown to provide sparser resultsthat can be further optimized. Once the conjugate parame-ters are integrated out from the joint distribution, a simpleGibbs Sampler can draw topic indices Z d,n by the followingconditional probability: p ( Z d,n | rest ) ∝ ( C d, ¬ ( d,n ) k + α k )( C w, ¬ ( d,n ) k + β w ) C ¬ ( d,n ) k + ¯ β . (1) ere, C dk stands for the number of times that topic k hasbeen observed in document d , C wk denotes the number oftimes a word w has been observed as a topic k in the cor-pus, C k stands for the number of times topic k has beenassigned to a word in the corpus, and rest denotes all othervariables except Z d,n . The superscript ¬ ( d, n ) specifies thecount ignoring the topic index for the word in ( d, n ) position;and ¯ β is the sum of all the β w s.Eq. (1) is the original Collapsed Gibbs Sampler condi-tional proposed by Gri ffi ths et al. (12) that requires an O ( K ) time complexity to sample each token in the data.Yao et al. (30) rewrite Eq. (1) into a di ff erent form to takeadvantage of sparsity that exists within the model to lowerthe time complexity of sampling each token down to O ( K d + K w ), where K d denotes the number of topics that exist ina document d and K w stands for the number of di ff erenttopics a word w has been observed by. For models with alarge K , we often have K d ≪ K and K w ≪ K .AliasLDA (15) factorizes the same conditional in Eq. (1)into two components, a sparse component and a dense com-ponent. The sparse component (document-component) issampled in a method analogous to SparseLDA (30) in O ( K d ),and for sampling the dense component, an Alias table (21) iscreated to generate K stale samples and Metropolis-Hastingstests (18; 13) are used to accept parameter updates in amor-tized O (1) time complexity.LightLDA(31) builds on that by removing the sparse com-ponent altogether, and uses an Alias table to generate twoproposals that are factors of the true conditional. LightLDAalternatively samples from the two high probable proposaldistributions, and brings the amortized sampling complexitydown by another magnitude to O (1). WarpLDA (7) furtherimproves the e ffi ciency of LightLDA by optimizing the ac-cess of CPU cache, making the O (1) algorithm more e ffi cientin practice.This research builds on their insights to create a fast sam-pler of topic indices for DTM. Dynamic Topic Model (DTM) (4) is a topic model that isused to model time series data. Since a Dirichlet distribu-tion is not suitable to model state changes, DTM adopts alogistic-normal parameterization as in CTM (3), chains theparameters together in a Markovian structure, and allowsevolution of parameters with a Gaussian noise. The Gaus-sian variables are mapped to the simplex from where theMultinomial variables are drawn.Given a set of data D = {D t } Tt =1 , where D t denotes thedataset at time slice t and T is the total number of timeslices. For each time slice t , DTM first samples the evolvedparameters from linear dynamic systems, namely, α t ∼ N ( α t | α t − , σ I ) ∀ k ∈ [ K ] : Φ k,t ∼ N ( Φ k,t | Φ k,t − , β I ) , where σ and β are variance parameters. Then, for a docu-ment d at time t , the words are generated as follows: η d,t ∼ N ( η d,t | α t , ψ I ) , Assuming that a document d only contains a subset of top-ics, and a word w is only assigned to a subset of topicswhile sampling. Following this assumption, (1) could bemultiplied for a sparser expression. The second part of thisassumption does not necessarily hold in large data corpora. Figure 2: Generative Process of DTM ∀ n ∈ [ N d,t ]: Z d,n,t ∼ Mult( Z d,n,t | π ( η d,t )) , ∀ n ∈ [ N d,t ]: W d,n,t ∼ Mult( W d,n,t | π ( Φ Z d,n,t ,t )) , where π ( α ) = γ with each element being γ k = exp( α k ) ! j exp( α j ) under a soft-max transformation, and ψ is another varianceparameter.After observing the data D , the posterior distribution con-ditioned on the hyper-parameters is given by: p ( α , η , Φ , Z |D , σ , β , ψ ) ∝ T ! t =1 N ( α t | α t − , σ I ) × K ! k =1 N ( Φ k,t | Φ k,t − , β I ) D t ! d =1 N ( η d,t | α t , ψ I ) × N d,t ! n =1 Mult( Z d,n,t | π ( η d,t ))Mult( W d,n,t | π ( Φ Z d,n,t ,t )) . The non-conjugacy of Gaussian and multinomial variablesmakes exact inference of the model intractable. Unlike LDA,the topic-term proportion parameters Φ t and the document-topic proportion parameters η t cannot be analytically in-tegrated out, which makes the task of approximating theposterior more complicated.Blei & La ff erty (4) propose a variational Kalman filter-ing approach to infer the posterior distribution by mak-ing an unwarranted mean-field approximation, which is notscalable to large datasets. Other fast variational methods(14; 6) could be generalized to do posterior inference butas discussed earlier, variational methods are not amenableto support a large number of topics due to the inability toexplore model sparsity. In the following section, we presentan even faster Gibbs Sampler that is e ffi cient in capturinglarge dynamic topic models with many topics.
3. GIBBS SAMPLER FOR DTM
We now present our blockwise Gibbs Sampler for the pos-terior inference of DTM. Unlike LDA or CTM, we cannot ntegrate any of the parameters out because of the non-conjugacy from both directions, therefore we sample eachparameter separately. The parameters that we need to in-fer are α t , η d,t , Φ k,t and Z d,n,t . The following subsectionsshow the derivation and the sampler update equations ofeach parameter conditioned on all the other parameters. α t It is easy to see from Figure 2 that α t has a Markovianstructure and conditioned on its Markov Blanket (MB), weget: p ( α t | MB( α t )) ∝ N ( α t | α t − , σ I ) N ! α t +1 | α t , σ I " D t d =1 N ! η d,t | α t , ψ I " . (2)With three Gaussian terms, we simply use the “completingthe square” trick and get: α t ∼ N $ α t | ˆ µ , ˆ Λ − % , (3)where the mean and covaraiance matrix are evaluated as:ˆ µ = α t + η d,t − Λ − & σ η d,t + D t ψ α t ' (4) η d,t = 1 D t D t ( d =1 η d,t , α t = α t +1 + α t − , ˆ Λ = & σ + D t ψ ' I. (5)The key thing to note here is that evaluating the mean ˆ µ isan O ( K ) operation as long as we keep record of η d,t becauseˆ Λ is a diagonal matrix. Also note that ˆ Λ is constant for alldocuments in the time slice t . η d,t Conditioned on α t and Z d,t , the posterior conditional for η d,t is: p ( η d,t | α t , Z d,t ) ∝ N ( η d,t | α t , ψ I ) × N d,t n =1 Mult( Z d,n,t | π ( η d,t )) (6)This is a multivariate Bayesian Logistic Regression modelwith a Gaussian prior and Multinomial observations Z d,n,t .To infer the posterior, we use a novel method developed byWelling & Teh called Stochastic Gradient Langevin Dynam-ics (SGLD) (28).SGLD is an iterative learning algorithm that uses mini-batches to perform updates, which is suitable for large datasets.It builds on the well-known optimization algorithm, Stochas-tic Gradient Descent, and adds some Gaussian noise to eachupdate so that it can generate samples from the true poste-rior and not just collapse to the MAP/MLE solution. Let p ( θ |{ x n } ) ∝ p ( θ ) ) n p ( x n | θ ) be an arbitrary generative modelwith prior p ( θ ) and likelihood p ( x | θ ). The SGLD parameterupdate for θ at the i th iteration is: ∆ θ i = ǫ i * ∇ log p ( θ ) + NM M ( n =1 ∇ log p ( x in | θ ) + + ξ i , (7) ξ i ∼ N ( ξ i | , ǫ i ) , (8) where N is the number of data points in the dataset, and M is a mini-batch created from those N data points. Welling& Teh show that a Metropolis-Hastings test is unnecessaryif we update ǫ i in a way such that as i increases, ǫ i → ǫ i = a × ( b + i ) − c as our heuristic,and it satisfies the aforementioned condition.Taking the gradient of the natural logarithm of the pos-terior conditional of η d,t in (6), we get: ∇ η kd,t log p ( η d,t | α t , Z d,t ) = − ψ ( η kd,t − α kt )+ N d,t ( n =1 ! δ ( Z d,n,t = k ) − π ( η d,t ) k " , (9)where δ ( Z d,n,t = k ) is the Kronecker delta function. Oncloser attention to the gradient, it is easy to see that thefirst term in the gradient of the likelihood term is simply thenumber of times the topic k has been observed in document d in time slice t , and hence, the summation in (9) can bereplaced, and (9) can be simply rewritten as: ∇ η kd,t log p ( η d,t | α t , Z d,t ) = − ψ ( η kd,t − α kt )+ C kd,t − ( N d,t × π ( η d,t ) k ) , (10)where N d,t is the number of words in document d of timeslice t and C kd,t stands for the number of times the topic k hasbeen observed in document d in time slice t . This equationcan be directly put into the SGLD update equation, andthe count matrix can be updated while storing topic indices.Updating the topic proportion of each document takes O ( K )time, if we store the softmax normalization constant duringthe first evaluation. Φ k,t Φ k s are linked together as a Markov chain similar to α t sand the posterior conditioned on its Markov Blanket is: p ( Φ k,t | MB( Φ k,t )) ∝ N ( Φ k,t | Φ k,t − , β I ) × N ( Φ k,t +1 | Φ k,t , β I ) × D t d =1 N d,t n =1 Mult( W d,n,t | π ( Φ k,t )) . (11)Since the first two terms are Gaussians with the same vari-ance, they can be multiplied together to get a new Gaussian,and we can treat it as the prior. Using the “completing thesquare” trick again, Eq. (11) can be rewritten as: p ( Φ k,t | MB( Φ k,t )) = N ( Φ k,t | Φ k,t , β I ) × D t d =1 N d,t n =1 Mult( W d,n,t | π ( Φ k,t )) , (12)where Φ k,t = Φ k,t +1 + Φ k,t − W d,n,t as multinomial observations. If we take the gradient of he natural logarithm of this Bayesian Logistic Regressionmodel, we get: ∇ Φ wk,t log p ( Φ k,t | MB( Φ k,t )) = Φ wk,t +1 + Φ wk,t − − Φ wk,t β + C wk,t − ( C k,t × π ( Φ k,t ) w ) , (14)where C wk,t = ! D t d =1 ! N d,t n =1 ( δ ( W d,n,t = w, Z d,n,t = k ) and C k,t = ! D t d =1 ! N d,t n =1 ( δ ( Z d,n,t = k )).Using cached matrices, only the first and the third termsneed to be evaluated, and Φ t can be sampled in O ( V K )time using SGLD. Since we use a small batch of documentsto sample Φ t , we clear C wk,t and C k,t at the start of eachiteration. Z d,n,t From the generative process of DTM, as shown earlier inFigure 2, it can be found that given η and Φ , Z is condition-ally independent of the rest of the parameters in the model.Therefore, sampling the topic indices for DTM conditionedon the topic-term proportions and the topic-document pro-portions can be collapsed to: p ( Z d,n,t = k | η kd,t , Φ wk,t ) ∝ exp( η kd,t ) " doc exp( Φ wk,t ) " word (15)However, since sampling a token requires evaluating thenormalization constant, the naive method requires O ( K )time to sample each token, and for fast inference, it is abso-lutely vital to have a fast sampler for Z since it is the mostexpensive parameter to sample in DTM. We take ideas fromrecent advances in the inference of Latent Dirichlet Allo-cation to optimize the sampling, where authors use AliasTables (as mentioned in Section 2) to generate K samplesin O ( K ) time that are accepted using a Metropolis-Hastings(MH) Test, thus bringing the amortized sampling complex-ity to O (1)(31; 15).Alias sampling (21) takes advantage of the fact that whilegenerating one sample from a K dimensional vector is of O ( K ) complexity, but subsequent samples only require an O (1) time. It transforms the multinomial sampling probleminto an easy uniform sampling one, which requires a simplelookup on the “Alias tables”, as shown in Figure 3.But since Eq. (15) is not a static distribution, a Metropolis-Hastings (MH) Test is done to check whether the new sam-ple should be accepted or rejected. While in theory, an O (1)complexity is great, but in practice, we desire high accep-tance of the stale samples from the MH Test.The key to high acceptance rates in MH tests is to gener-ate good proposals, and similar to Yuan et al. (31), we breakEq. (15) into two products exp( η kd,t ) and exp( Φ wk,t ), and useone of them alternatively to generate proposals. We call pro-posals generated from the topic-term parameter the word-proposal and proposals generated from the topic-documentparameter the doc-proposal .To generate both proposals, we generate Alias tables, andgenerate K samples, and use them until we run out of gener-ated samples. When we do not have any more samples, webuild a new Alias table, and generate new samples again.The doc-proposal p d ( k ) and its acceptance probability A d for topic k are given by: p d ( k ) ∝ exp( η kd,t ) (16) Parameter Sampling Complexity α O ( K ) η O ( D m K ) Φ O ( V K ) Z O ( D m N d ) Table 1: Sampling complexities of each parameterin DTM each iteration. D m : Size of the mini-batch, N d : Number of words in a sample document d. A d = min & , exp( η kd,t + Φ wk,t + η sd,t )exp( η sd,t + Φ ws,t + η kd,t ) ' = min ( , exp( Φ wk,t )exp( Φ ws,t ) ) . (17)Analogous to the word proposal, the word-proposal p w ( k )and its acceptance probability A w for topic k are given by: p w ( k ) ∝ exp( Φ wk,t ) (18) A w = min & , exp( η kd,t + Φ wk,t + Φ ws,t )exp( η ks,t + Φ ws,t + Φ wk,t ) ' = min & , exp( η kd,t )exp( η sd,t ) ' . (19)It is easy to see that A w is high when the proposed topic k is commonly seen in document d and A d is high when theword w is morse often observed as a topic k .The sampling complexities of each parameter have beensummarized in Table 1, and in a multicore environment,all of the parameters could be sampled completely indepen-dently after updating the count matrices at the start of eachiteration.
4. MULTITHREADED AND DISTRIBUTEDIMPLEMENTATION
In this section we present a distributed implementationof our algorithm, which utilizes two levels of parallelism:multi-machines and multi-threads.DTM, itself, has an “embarrassingly parallel” structureand can be exploited by assigning each time slice to a sepa-rate worker. The data of di ff erent time slices can be storedin di ff erent machines without ever needing to move themaround. We implement DTM in parallel using Message-Passing Interface (MPI) and pass the required data to neigh-bouring workers at the start of each iteration. The Marko-vian structure in the model requires α t and Φ t to be passedto adjacent nodes, and after that no further communicationis required among these nodes.Within each worker process, there is an additional level ofmulti-thread parallelism, implemented using C++11 std::thread in our system. We create three threads tosample η , Z , Φ respectively for the time slice assigned to aworker. We relax the original blockwise Gibbs sampling byadopting a Jacobi style iteration. Instead of using the mostrecent value, we sample η conditioned on Z and Φ fromthe previous iteration and likewise for Z and Φ . We cantherefore sample η , Z and Φ independently. Better imple-mentation with no Jacobi relaxations and better parallelismcan be designed, which sample η , Z and Φ sequentially andpartition computational tasks by d for Z , η and by k for Φ . igure 3: An example of how alias table (21) is constructed for a four dimensional non-uniform vector. Thefigure shows how a non-uniform vector of di ff erent probabilities can be compactly stored in a table withthe constraint that each column indexes a maximum of two elements in the vector. Using an alias table,generating samples samples only requires two uniform samples; first to pick the column, and then doing aweighted coin-toss to sample an element. Constructing Alias Tables takes O ( K ) time, where K is the numberof dimensions in the vector. We adopt the aforementioned 3-thread approach purely forease of implementation.While using distributed computing, we often come acrosshuge datasets (Bing News, in Section 5), where after trim-ming of vocabulary, the number of documents is much largerthan the size of the vocabulary. In this case, one option is touse a mini-batch size ( D m ) such that D m × N avg ≈ V × K for load balancing between threads, where N avg denotes theaverage length of documents. Another option is to dividethe sampling of Z and η into more cores than for Φ .
5. EXPERIMENTS
We now present our algorithm in two experiment plat-forms, the first being a single machine in a multithreadedenvironment, and the second being a 6-node cluster consist-ing of 72 cores in total. Our results demonstrate that ouralgorithm (GS-SGLD) is substantially faster than the ex-isting baseline (VKF) on both single machine and parallelenvironments. We also capture the largest Dynamic TopicModel to our knowledge.
There are two datasets used for the experiments. The firstconsists of NIPS full papers from the year 1987 to 1999 ,where we pick the most frequent 8,000 words as vocabularywith stop words removed. We divide the dataset into thir-teen time slices by year. We also use a large dataset thatconsists of all the news containing the word “Obama” from Available here:
Bing News during the years from 2012 to 2015. This seconddataset consists of more than 2.6 million documents, andare divided into 29 time slices, according to months. Thevocabulary is trimmed down to 15,000 words after removingstop words, and looking at the most frequent words.We partition each time slice of both datasets to containa training set and a testing set with observed and held-outparts to evaluate perplexity. We use the partially observeddocument method (22).For the NIPS dataset, we run our GS-SGLD sampler on asingle machine, and Figure 4 shows the evolution of one ofthe 50 topics captured. The figure shows di ff erent methodsand algorithms used in Computer Vision over the years, andthe increase in the problem complexity as algorithms gotbetter (Face Recognition, Object Recognition).In the Bing News dataset, we capture many political trends,and Figure 6 shows one such evolution of the topic regardingthe possibility of Syria possessing mass-destruction weapons.This trend shows an increasing involvement of Russia andthe United States on the issue, and words such as “agree-ment” are captured in September 2014, when agreementswere announced to eliminate chemical weapon stockpiles.The trend captured is in complete accordance with the Wiki-pedia article on the issue. In this section, we compare our inference algorithm to theexisting variational Kalman filtering (VKF) method (4) forestimating DTM’s model parameters. Wikipedia link on the issue: http://en.wikipedia.org/wiki/Syria_and_weapons_of_mass_destruction igure 4: Evolution of Image Processing in NIPS from 1987-1999.
We run our GS-SGLD algorithm and the VKF algorithmon the NIPS dataset containing a total of 1,740 documentsdivided in 13 time slices. We set our SGLD learning rateto ǫ i = 0 . × (100 + i ) − . and use a mini-batch size of60 documents. Figure 5 shows our results compared to thebaseline when we capture 50 topics in the dataset, and inFigure 7, we further show that the perplexity of each timeslice decreases at approximately the same rate, which is alsoa desirable property for our algorithm. For clarity, we onlyplot the first five time slices and the average perplexity of13 time slices.We find that our algorithm (GS-SGLD) is significantlyfaster than the variational Kalman filtering (VKF) approachbecause of using a stochastic algorithm and the amortized O (1) sampler for the topics for each token. In addition, GS-SGLD also achieves a slightly lower perplexity, compared toVKF, even after running the algorithm until both algorithmsconverge due to not making any unwarranted mean-field as-sumptions. In this section, we show the scalability of our algorithm byinferring 1 ,
000 topics from the Bing News dataset describedearlier. The dataset consists of news of 29 time slices, andwe use 58 cores for our experiment, to infer a large DynamicTopic Model. We use a mini-batch size of 8,000 and set ourSGLD learning rate ǫ i to 0 . × (1000 + i ) − . . We also run Figure 5: Perplexity comparison of GS-SGLD withVKF over time in seconds.
Dataset Docs Time Slices Topics Run. TimeNIPS 1740 13 50 118.72Bing news 2.6M 29 1000 1694.83
Table 2: Running time comparison (in seconds) oftwo datasets in a parallel environment. our code on the NIPS dataset with the same parametersas the ones described in the Single Machine Experimentssection. The results are summarized in Table 2.We further show that our algorithm scales up well with anincreasing amount of time slices due to its embarrassinglyparallel nature. Assuming we have more cores to work withthan time slices, there is a little communication overheadadded to the running time, but the sampling complexity igure 6: Evolution of Syrian possible threat of possessing mass-destruction weapons.Figure 7: Perplexity comparison of di ff erent timeslices.Figure 8: Running Time comparison of single ma-chine and parallel GS-SGLD inference. remains constant. The experiment in Table 3 uses the samedataset but we trim each time slice’s documents to 50K sothat we can e ffi ciently run the algorithm on a single machineas well. During this experiment, we choose our mini-batchsize to be 2000 and we iterate until convergence (around 60iterations). The scalability is graphed in Figure 8.
6. DISCUSSIONS AND FUTURE WORK
It is also worth mentioning that there are other dynamictopic models that can also capture topic trends over timesuch as Dynamic Mixture Model (DMM) (27) and Topics Num. of Time Slices S-M GS-SGLD P GS-SGLD5 712.81 (2) 208.74 (10)8 1106.25 (2) 227.38 (16)11 1577.83 (2) 296.51 (22)15 2200.01 (2) 302.97 (30)20 N/A* (2) 357.11 (40)29 N/A* (2) 398.32 (58)
Table 3: Running time comparison of GS-SGLD in-ference in single machine and parallel settings in sec-onds. *not finished in one hour. (number) specifiesthe number of cores. over Time (ToT) (24) but the Dynamic Topic Model pro-posed by Blei et. al has the distinct advantage of being ableto naturally capture the correlation among topics over timeby allowing the covariance of η to be non-diagonal matrices.Relaxing this condition only requires a small update in the η and α samplers, and hence DTM is especially appealingto us. Furthermore, DMM and ToT can be parallelized us-ing the existing parallel frameworks of LDA(25; 1; 32), whileour algorithm can be generalized to scale up non-conjugateLogistic Normal topic models.We, next, plan to test out our inference algorithm forDTM in an even bigger scale with thousands of machines andterabytes of data. We intend to further extend this researchand learn dynamic correlation graphs of topics over mul-tiple time slices (Dynamic Correlated Topic Models), andimprove the visualization of the learned evolution of topicsfrom DTM, which has been hard to show graphically.At the moment, we use Stochastic Gradient Langevin Dy-namics on each node separately, but recent developments inDistributed Stochastic MCMC (2) can also parallellize thesampling of Φ t and η making the algorithm even more e ffi -cient.Using SGLD requires tuning the step-size parameters, andthe parameters can di ff er from dataset to dataset. Hence,we intend to try a more adaptive approach such as AdaGrad(11), or use more recent and sophisticated Stochastic Gradi- nt Monte Carlo methods such as SGHMC (9) and SGNHT(10) for inference.
7. CONCLUSIONS
We propose a scalable and e ffi cient inference method ofDynamic Topic Models. Our algorithm is a novel combina-tion of Stochastic Gradient Langevin Dynamics and Metropolis-Hastings sampler using Alias tables in a blockwise GibbsSampling framework. This combination makes our algo-rithm naturally parallelizable and allows it to scale up ex-tremely well with the number of time slices and topics, aswe have shown in our experiments.Our algorithm is significantly faster than the baselines inboth single machine and distributed environments. Mak-ing fewer restricting assumptions, our algorithm also per-forms slightly better in terms of perplexity than the existingvariational methods. DTMs can capture very exciting topictrends over time but the existing inference methods use vari-ational approximations and have not been able to learn largeDTMs from big datasets.Our algorithm has made it possible to do posterior infer-ence of DTM at an industrial scale, and we prove this claimby learning the biggest existing DTM. Our work is appli-cable for both researchers and industries as a large scaleDynamic Topic Model can capture very interesting trends. Acknowledgments
The work was supported by the National Basic ResearchProgram (973 Program) of China (Nos. 2013CB329403,2012CB316301), National NSF of China (Nos. 61322308,61332007), Tsinghua National Laboratory for InformationScience and Technology Big Data Initiative, Tsinghua Ini-tiative Scientific Research Program (No. 20141080934), andthe Collaboration Awards from Microsoft and Intel.
References [1] A. Ahmed, M. Aly, J. Gonzalez, S. Narayanamurthy,and A. J. Smola. Scalable inference in latent variablemodels. In
International Conference on Web Searchand Data Mining (WSDM) . ACM, 2012.[2] S. Ahn, B. Shahbaba, and M. Welling. Distributedstochastic gradient mcmc. In
International Conferenceon Machine Learning (ICML) , 2014.[3] D. Blei and J. La ff erty. Correlated topic models. Advances in Neural Information Processing Systems(NIPS) , 2006.[4] D. M. Blei and J. D. La ff erty. Dynamic topic mod-els. In International Conference on Machine Learning(ICML) , 2006.[5] D. M. Blei, A. Y. Ng, and M. I. Jordan. Latent dirich-let allocation.
Journal of Machine Learning Research ,3:993–1022, 2003.[6] T. Broderick, N. Boyd, A. Wibisono, A. C. Wilson, andM. I. Jordan. Streaming variational bayes. In
Advancesin Neural Information Processing Systems , pages 1727–1735, 2013. [7] J. Chen, K. Li, J. Zhu, and W. Chen. WarpLDA: asimple and e ffi cient O(1) algorithm for latent Dirichletallocation. arXiv preprint, arXiv:1510.08628 , 2015.[8] J. Chen, J. Zhu, Z. Wang, X. Zheng, and B. Zhang.Scalable inference for logistic-normal topic models. In Advances in Neural Information Processing Systems(NIPS) , 2013.[9] T. Chen, E. B. Fox, and C. Guestrin. Stochastic gradi-ent hamiltonian monte carlo. arXiv:1402.4102 , 2014.[10] N. Ding, Y. Fang, R. Babbush, C. Chen, R. D. Skeel,and H. Neven. Bayesian sampling using stochastic gra-dient thermostats. In
Advances in Neural InformationProcessing Systems (NIPS) , 2014.[11] J. Duchi, E. Hazan, and Y. Singer. Adaptive subgra-dient methods for online learning and stochastic opti-mization.
The Journal of Machine Learning Research ,12:2121–2159, 2011.[12] T. L. Gri ffi ths and M. Steyvers. Finding scientific top-ics. Proceedings of the National Academy of Sciences ,101(suppl 1):5228–5235, 2004.[13] W. K. Hastings. Monte carlo sampling methods us-ing markov chains and their applications.
Biometrika ,57(1):97–109, 1970.[14] M. D. Ho ff man, D. M. Blei, C. Wang, and J. Paisley.Stochastic variational inference. The Journal of Ma-chine Learning Research , 14(1):1303–1347, 2013.[15] A. Q. Li, A. Ahmed, S. Ravi, and A. J. Smola. Re-ducing the sampling complexity of topic models. In
International Conference on Knowledge Discovery andData mining (SIGKDD) , 2014.[16] S. Liu, X. Wang, J. Chen, J. Zhu, and B. Guo. Topic-panorama: A full picture of relevant topics. In
VisualAnalytics Science and Technology (VAST) , 2014.[17] E. Meeds, R. Hendriks, S. a. Faraby, M. Bruntink, andM. Welling. Mlitb: Machine learning in the browser. arXiv preprint arXiv:1412.2432 , 2014.[18] N. Metropolis, A. W. Rosenbluth, M. N. Rosenbluth,A. H. Teller, and E. Teller. Equation of state calcu-lations by fast computing machines.
The Journal ofChemical Physics , 21(6):1087–1092, 1953.[19] S. Patterson and Y. W. Teh. Stochastic gradient rie-mannian langevin dynamics on the probability simplex.In
Advances in Neural Information Processing Systems(NIPS) , 2013.[20] Y. W. Teh, D. Newman, and M. Welling. A col-lapsed variational bayesian inference algorithm for la-tent dirichlet allocation. In
Advances in Neural Infor-mation Processing Systems (NIPS) , 2006.[21] A. J. Walker. An e ffi cient method for generatingdiscrete random variables with general distributions. ACM Transactions on Mathematical Software (TOMS) ,3(3):253–256, 1977.
22] H. M. Wallach, I. Murray, R. Salakhutdinov, andD. Mimno. Evaluation methods for topic models. In
In-ternational Conference on Machine Learning (ICML) .ACM, 2009.[23] C. Wang and D. M. Blei. Variational inference in non-conjugate models.
The Journal of Machine LearningResearch , 14(1):1005–1031, 2013.[24] X. Wang and A. McCallum. Topics over time: a non-markov continuous-time model of topical trends. In
Proceedings of the 12th ACM SIGKDD internationalconference on Knowledge discovery and data mining ,pages 424–433. ACM, 2006.[25] Y. Wang, H. Bai, M. Stanton, W.-Y. Chen, and E. Y.Chang. Plda: Parallel latent dirichlet allocation forlarge-scale applications. In
Algorithmic Aspects in In-formation and Management , pages 301–314. Springer,2009.[26] Y. Wang, X. Zhao, Z. Sun, H. Yan, L. Wang, Z. Jin,L. Wang, Y. Gao, J. Zeng, Q. Yang, et al. Towards topicmodeling for big data.
ACM Transactions on IntelligentSystems and Technology , 2014.[27] X. Wei, J. Sun, and X. Wang. Dynamic mixture mod-els for multiple time-series. In
IJCAI , volume 7, pages2909–2914, 2007. [28] M. Welling and Y. W. Teh. Bayesian learning viastochastic gradient langevin dynamics. In
InternationalConference on Machine Learning (ICML) , 2011.[29] P. Xie and E. P. Xing. Integrating document clusteringand topic modeling. arXiv preprint arXiv:1309.6874 ,2013.[30] L. Yao, D. Mimno, and A. McCallum. E ffi cient meth-ods for topic model inference on streaming documentcollections. In International Conference on KnowledgeDiscovery and Data mining (SIGKDD) , 2009.[31] J. Yuan, F. Gao, Q. Ho, W. Dai, J. Wei, X. Zheng, E. P.Xing, T.-Y. Liu, and W.-Y. Ma. Lightlda: Big topicmodels on modest compute clusters. arXiv:1412.1576 ,2014.[32] H. Zhao, B. Jiang, and J. Canny. Same but di ff er-ent: Fast and high-quality gibbs parameter estimation. arXiv preprint arXiv:1409.5402 , 2014.[33] J. Zhu, A. Ahmed, and E. P. Xing. MedLDA: Maximummargin supervised topic models. Journal of MachineLearning Research , 13:2237–2278, 2012.[34] J. Zhu, N. Chen, H. Perkins, and B. Zhang. Gibbs max-margin topic models with data augmentation.