Efficient Learning for Undirected Topic Models
EEfficient Learning for Undirected Topic Models
Jiatao Gu and Victor O.K. Li
Department of Electrical and Electronic EngineeringThe University of Hong Kong { jiataogu, vli } @eee.hku.hk Abstract
Replicated Softmax model, a well-knownundirected topic model, is powerful in ex-tracting semantic representations of docu-ments. Traditional learning strategies suchas Contrastive Divergence are very inef-ficient. This paper provides a novel esti-mator to speed up the learning based onNoise Contrastive Estimate, extended fordocuments of variant lengths and weightedinputs. Experiments on two benchmarksshow that the new estimator achieves greatlearning efficiency and high accuracy ondocument retrieval and classification.
Topic models are powerful probabilistic graphicalapproaches to analyze document semantics in dif-ferent applications such as document categoriza-tion and information retrieval. They are mainlyconstructed by directed structure like pLSA (Hof-mann, 2000) and LDA (Blei et al., 2003). Accom-panied by the vast developments in deep learn-ing, several undirected topic models, such as(Salakhutdinov and Hinton, 2009; Srivastava etal., 2013), have recently been reported to achievegreat improvements in efficiency and accuracy.Replicated Softmax model (RSM) (Hinton andSalakhutdinov, 2009), a kind of typical undirectedtopic model, is composed of a family of RestrictedBoltzmann Machines (RBMs). Commonly, RSMis learned like standard RBMs using approximatemethods like Contrastive Divergence (CD). How-ever, CD is not really designed for RSM. Differentfrom RBMs with binary input, RSM adopts soft-max units to represent words, resulting in great in-efficiency with sampling inside CD, especially fora large vocabulary. Yet, NLP systems usually re-quire vocabulary sizes of tens to hundreds of thou-sands, thus seriously limiting its application. Dealing with the large vocabulary size of the in-puts is a serious problem in deep-learning-basedNLP systems. Bengio et al. (2003) pointed thisproblem out when normalizing the softmax proba-bility in the neural language model (NNLM), andMorin and Bengio (2005) solved it based on a hi-erarchical binary tree. A similar architecture wasused in word representations like (Mnih and Hin-ton, 2009; Mikolov et al., 2013a). Directed treestructures cannot be applied to undirected mod-els like RSM, but stochastic approaches can workwell. For instance, Dahl et al. (2012) found thatseveral Metropolis Hastings sampling (MH) ap-proaches approximate the softmax distribution inCD well, although MH requires additional com-plexity in computation. Hyv¨arinen (2007) pro-posed Ratio Matching (RM) to train unnormal-ized models, and Dauphin and Bengio (2013)added stochastic approaches in RM to accommo-date high-dimensional inputs. Recently, a new es-timator Noise Contrastive Estimate (NCE) (Gut-mann and Hyv¨arinen, 2010) is proposed for un-normalized models, and shows great efficiency inlearning word representations such as in (Mnihand Teh, 2012; Mikolov et al., 2013b).In this paper, we propose an efficient learningstrategy for RSM named α -NCE, applying NCE asthe basic estimator. Different from most related ef-forts that use NCE for predicting single word, ourmethod extends NCE to generate noise for doc-uments in variant lengths. It also enables RSM touse weighted inputs to improve the modelling abil-ity. As RSM is usually used as the first layer inmany deeper undirected models like Deep Boltz-mann Machines (Srivastava et al., 2013), α -NCEcan be readily extended to learn them efficiently. RSM is a typical undirected topic model, which isbased on bag-of-words (BoW) to represent docu-ments. In general, it consists of a series of RBMs, a r X i v : . [ c s . L G ] J un ach of which contains variant softmax visibleunits but the same binary hidden units.Suppose K is the vocabulary size. For a docu-ment with D words, if the i th word in the docu-ment equals the k th word of the dictionary, a vec-tor v i ∈ { , } K is assigned, only with the k th element v ik = 1 . An RBM is formed by assign-ing a hidden state h ∈ { , } H to this document V = { v , ..., v D } , where the energy function is: E θ ( V , h ) = − h T W ˆ v − b T ˆ v − D · a T h (1)where θ = { W , b , a } are parameters shared byall the RBMs, and ˆ v = (cid:80) Di =1 v i is commonly re-ferred to as the word count vector of a document.The probability for the document V is given by: P θ ( V ) = 1 Z D e − F θ ( V ) , Z D = (cid:88) V e − F θ ( V ) F θ ( V ) = log (cid:88) h e − E θ ( V , h ) (2)where F θ ( V ) is the “free energy”, which can beanalytically integrated easily, and Z D is the “par-tition function” for normalization, only associatedwith the document length D . As the hidden stateand document are conditionally independent, theconditional distributions are derived: P θ ( v ik = 1 | h ) = exp (cid:0) W Tk h + b k (cid:1)(cid:80) Kk =1 exp (cid:0) W Tk h + b k (cid:1) (3) P θ ( h j = 1 | V ) = σ ( W j ˆ v + D · a j ) (4)where σ ( x ) = e − x . Equation (3) is the soft-max units describing the multinomial distributionof the words, and Equation (4) serves as an effi-cient inference from words to semantic meanings,where we adopt the probabilities of each hiddenunit “activated” as the topic features. RSM is naturally learned by minimizing the nega-tive log-likelihood function (ML) as follows: L ( θ ) = − E V ∼ P data [log P θ ( V )] (5)However, the gradient is intractable for the combi-natorial normalization term Z D . Common strate-gies to overcome this intractability are MCMC-based approaches such as Contrastive Divergence(CD) (Hinton, 2002) and Persistent CD (PCD)(Tieleman, 2008), both of which require repeatingGibbs steps of h ( i ) ∼ P θ ( h | V ( i ) ) and V ( i +1) ∼ P θ ( V | h ( i ) ) to generate model samples to approx-imate the gradient. Typically, the performance and consistency improve when more steps are adopted.Notwithstanding, even one Gibbs step is time con-suming for RSM, since the multinomial samplingnormally requires linear time computations. The“alias method” (Kronmal and Peterson Jr, 1979)speeds up multinomial sampling to constant timewhile linear time is required for processing the dis-tribution. Since P θ ( V | h ) changes at every itera-tion in CD, such methods cannot be used. Unlike (Dahl et al., 2012) that retains CD, weadopted NCE as the basic learning strategy. Con-sidering RSM is designed for documents, we fur-ther modified NCE with two novel heuristics,developing the approach “Partial Noise UniformContrastive Estimate” (or α -NCE for short). Noise Contrastive Estimate (NCE), similar to CD,is another estimator for training models with in-tractable partition functions. NCE solves the in-tractability through treating the partition function Z D as an additional parameter Z cD added to θ ,which makes the likelihood computable. Yet, themodel cannot be trained through ML as the likeli-hood tends to be arbitrarily large by setting Z cD tohuge numbers. Instead, NCE learns the model in aproxy classification problem with noise samples.Given a document collection (data) { V d } T d , andanother collection (noise) { V n } T n with T n = kT d ,NCE distinguishes these (1+ k ) T d documents sim-ply based on Bayes’ Theorem, where we assumeddata samples matched by our model, indicating P θ (cid:39) P data , and noise samples generated from anartificial distribution P n . Parameters are learnedby minimizing the cross-entropy function: J ( θ ) = − E V d ∼ P θ [log σ k ( X ( V d ))] − k E V n ∼ P n [log σ k − ( − X ( V n ))] (6)and the gradient is derived as follows, −∇ θ J ( θ ) = E V d ∼ P θ [ σ k − ( − X ) ∇ θ X ( V d )] − k E V n ∼ P n [ σ k ( X ) ∇ θ X ( V n )] (7)where σ k ( x ) = ke − x , and the “log-ratio” is: X ( V ) = log [ P θ ( V ) /P n ( V )] (8) J ( θ ) can be optimized efficiently with stochasticgradient descent (SGD). Gutmann and Hyv¨arinen(2010) showed that the NCE gradient ∇ θ J ( θ ) willreach the ML gradient when k → ∞ . In practice,a larger k tends to train the model better. .2 Partial Noise Sampling Different from (Mnih and Teh, 2012), which gen-erates noise per word, RSM requires the estimatorto sample the noise at the document level. An in-tuitive approach is to sample from the empiricaldistribution ˜ p for D times, where the log probabil-ity is computed: log P n ( V ) = (cid:80) v ∈ V (cid:2) v T log ˜ p (cid:3) .For a fixed k , Gutmann and Hyv¨arinen (2010)suggested choosing the noise close to the data fora sufficient learning result, indicating full noisemight not be satisfactory. We proposed an alter-native “Partial Noise Sampling (PNS)” to gener-ate noise by replacing part of the data with sam-pled words. See Algorithm 1, where we fixed the Algorithm 1
Partial Noise Sampling Initialize: k, α ∈ (0 , for each V d = { v } D ∈ { V d } T d do Set: D r = (cid:100) α · D (cid:101) Draw: V r = { v r } D r ⊆ V uniformly for j = 1 , ..., k do Draw: V ( j ) n = { v ( j ) n } D − D r ∼ ˜ p V ( j ) n = V ( j ) n ∪ V r end for Bind: ( V d , V r ) , ( V (1) n , V r ) , ..., ( V ( k ) n , V r ) end for proportion of remaining words at α , named “noiselevel” of PNS. However, traversing all the condi-tions to guess the remaining words requires O ( D !) computations. To avoid this, we simply bound theremaining words with the data and noise in ad-vance and the noise log P n ( V ) is derived readily: log P θ ( V r ) + (cid:88) v ∈ V \ V r (cid:2) v T log ˜ p (cid:3) (9)where the remaining words V r are still assumedto be described by RSM with a smaller documentlength. In this way, it also strengthens the robust-ness of RSM towards incomplete data.Sampling the noise normally requires additionalcomputational load. Fortunately, since ˜ p is fixed,sampling is efficient using the “alias method”. Italso allows storing the noise for subsequent use,yielding much faster computation than CD. When we initially implemented NCE for RSM,we found the document lengths terribly biased thelog-ratio, resulting in bad parameters. Therefore“Uniform Contrastive Estimate (UCE)” was pro-posed to accommodate variant document lengths by adding the uniform assumption: ¯ X ( V ) = D − log [ P θ ( V ) /P n ( V )] (10)where UCE adopts the uniform probabilities D √ P θ and D √ P n for classification to average the mod-elling ability at word-level. Note that D is notnecessarily an integer in UCE, and allows choos-ing a real-valued weights on the document such as idf -weighting (Salton and McGill, 1983). Typi-cally, it is defined as a weighting vector w , where w k = log T d | V ∈{ V d } : v ik =1 , v i ∈ V | is multiplied to the k th word in the dictionary. Thus for a weighted in-put V w and corresponding length D w , we derive: ˜ X ( V w ) = D w − log [ P θ ( V w ) /P n ( V w )] (11)where log P n ( V w ) = (cid:80) v w ∈ V w (cid:2) v wT log ˜ p (cid:3) . Aspecific Z cD w will be assigned to P θ ( V w ) .Combining PNS and UCE yields a new estima-tor for RSM, which we simply call α -NCE . We evaluated the new estimator to train RSMs ontwo text datasets: 20 Newsgroups and IMDB.The 20 Newsgroups dataset is a collection ofthe Usenet posts, which contains 11,345 trainingand 7,531 testing instances. Both the training andtesting sets are labeled into 20 classes. Removingstop words as well as stemming were performed.The IMDB dataset is a benchmark for senti-ment analysis, which consists of 100,000 moviereviews taken from IMDB. The dataset is dividedinto 75,000 training instances ( / labeled and / unlabeled) and 25,000 testing instances. Twotypes of labels, positive and negative, are given toshow sentiment. Following (Maas et al., 2011), nostop words are removed from this dataset.For each dataset, we randomly selected ofthe training set for validation, and the idf -weightvector is computed in advance. In addition, replac-ing the word count ˆ v by (cid:100) log (1 + ˆ v ) (cid:101) slightly im-proved the modelling performance for all models.We implemented α -NCE according to the pa-rameter settings in (Hinton, 2010) using SGD inminibatches of size and an initialized learningrate of . . The number of hidden units was fixed α comes from the noise level in PNS, but UCE is alsothe vital part of this estimator, which is absorbed in α -NCE. Available at http://qwone.com/˜jason/20Newsgroups Available at http://ai.stanford.edu/˜amaas/data/sentiment t for all models. Although learning the parti-tion function Z cD separately for every length D isnearly impossible, as in (Mnih and Teh, 2012) wealso surprisingly found freezing Z cD as a constantfunction of D without updating never harmed butactually enhanced the performance. It is proba-bly because the large number of free parametersin RSM are forced to learn better when Z cD is aconstant. In practise, we set this constant functionas Z cD = 2 H · (cid:0)(cid:80) k e b k (cid:1) D . It can readily extend tolearn RSM for real-valued weighted length D w .We also implemented CD with the same set-tings. All the experiments were run on a singleGPU GTX970 using the library Theano (Bergstraet al., 2010). To make the comparison fair, both α -NCE and CD share the same implementation. To evaluate the efficiency in learning, we usedthe most frequent words as dictionaries with sizesranging from to , for both datasets, andtest the computation time both for CD of vari-ant Gibbs steps and α -NCE of variant noise sam-ple sizes. The comparison of the mean runningFigure 1: Comparison of running timetime per minibatch is clearly shown in Figure 1,which is averaged on both datasets. Typically, α -NCE achieves to times speed-up com-pared to CD. Although both CD and α -NCE runslower when the input dimension increases, CDtends to take much more time due to the multino-mial sampling at each iteration, especially whenmore Gibbs steps are used. In contrast, runningtime stays reasonable in α -NCE even if a largernoise size or a larger dimension is applied. One direct measure to evaluate the modelling per-formance is to assess RSM as a generative model to estimate the log-probability per word as per-plexity . However, as α -NCE learns RSM by dis-tinguishing the data and noise from their respec-tive features, parameters are trained more like afeature extractor than a generative model. It is notfair to use perplexity to evaluate the performance.For this reason, we evaluated the modelling per-formance with some indirect measures.Figure 2: Precision-Recall curves for the retrievaltask on the 20 Newsgroups dataset using RSMs.For 20 Newsgroups, we trained RSMs on thetraining set, and reported the results on docu-ment retrieval and document classification. Forretrieval, we treated the testing set as queries, andretrieved documents with the same labels in thetraining set by cosine-similarity . Precision-recall(P-R) curves and mean average precision (MAP)are two metrics we used for evaluation. For clas-sification, we trained a softmax regression on thetraining set, and checked the accuracy on the test-ing set. We use this dataset to show the modellingability of RSM with different estimators.For IMDB, the whole training set is used forlearning RSMs, and an L2-regularized logistic re-gression is trained on the labeled training set. Theerror rate of sentiment classification on the testingset is reported, compared with several BoW-basedbaselines. We use this dataset to show the generalmodelling ability of RSM compared with others.We trained both α -NCE and CD, and naturallyNCE (without UCE) at a fixed vocabulary size(2000 for 20 Newsgroups, and 5000 for IMDB).Posteriors of the hidden units were used as topicfeatures. For α -NCE , we fixed noise level at . for 20 Newsgroups and . for IMDB. In compar-ison, we trained CD from 1 up to 5 Gibbs steps.Figure 2 and Table 1 show that a larger noisesize in α -NCE achieves better modelling perfor- a) MAP for document retrieval (b) Document classification accuracy (c) Sentiment classification accuracy Figure 3: Tracking the modelling performance with variant α using α -NCE to learn RSMs. CD is alsoreported as the baseline. (a) (b) are performed on 20 Newsgroups, and (c) is performed on IMDB.mance, and α -NCE greatly outperforms CD on re-trieval tasks especially around large recall values.The classification results of α -NCE is also compa-rable or slightly better than CD. Simultaneously,it is gratifying to find that the idf -weighting in-puts achieve the best results both in retrieval andclassification tasks, as idf -weighting is known toextract information better than word count. In ad-dition, naturally NCE performs poorly comparedto others in Figure 2, indicating variant documentlengths actually bias the learning greatly. CD α -NCEk=1 k=5 k=25 k=25 (idf)64.1% 61.8% 63.6% Table 1: Comparison of classification accuracy onthe 20 Newsgroups dataset using RSMs.
Models AccuracyBag of Words (BoW) (Maas and Ng, 2010) 86.75%LDA (Maas et al., 2011) 67.42%LSA (Maas et al., 2011) 83.96%Maas et al. (2011)’s “full” model 87.44%WRRBM (Dahl et al., 2012) 87.42%RSM:CD 86.22%RSM: α -NCE-5 RSM: α -NCE-5 (idf) Table 2: The performance of sentiment classifica-tion accuracy on the IMDB dataset using RSMscompared to other BoW-based approaches.On the other hand, Table 2 shows the perfor-mance of RSM in sentiment classification, wheremodel combinations reported in previous effortsare not considered. It is clear that α -NCE learnsRSM better than CD, and outperforms BoW andother BoW-based models such as LDA. The idf - Accurately, WRRBM uses “bag of n -grams” assumption. weighting inputs also achieve the best perfor-mance. Note that RSM is also based on BoW, in-dicating α -NCE has arguably reached the limits oflearning BoW-based models. In future work, RSMcan be extended to more powerful undirected topicmodels, by considering more syntactic informa-tion such as word-order or dependency relation-ship in representation. α -NCE can be used to learnthem efficiently and achieve better performance. α In order to decide the best noise level ( α ) for PNS,we learned RSMs using α -NCE with differentnoise levels for both word count and idf -weightinginputs on the two datasets. Figure 3 shows that α -NCE learning with partial noise ( α > ) out-performs full noise ( α = 0 ) in most situations,and achieves better results than CD in retrieval andclassification on both datasets. However, learningtends to become extremely difficult if the noisebecomes too close to the data, and this explainswhy the performance drops rapidly when α → .Furthermore, curves in Figure 3 also imply thechoice of α might be problem-dependent, withlarger sets like IMDB requiring relatively smaller α . Nonetheless, a systematic strategy for choos-ing optimal α will be explored in future work. Inpractise, a range from . ∼ . is recommended. We propose a novel approach α -NCE for learningundirected topic models such as RSM efficiently,allowing large vocabulary sizes. It is new a es-timator based on NCE, and adapted to documentswith variant lengths and weighted inputs. We learnRSMs with α -NCE on two classic benchmarks,where it achieves both efficiency in learning andaccuracy in retrieval and classification tasks. eferences Yoshua Bengio, R´ejean Ducharme, Pascal Vincent, andChristian Janvin. 2003. A neural probabilistic lan-guage model.
The Journal of Machine Learning Re-search , 3:1137–1155.James Bergstra, Olivier Breuleux, Fr´ed´eric Bastien,Pascal Lamblin, Razvan Pascanu, Guillaume Des-jardins, Joseph Turian, David Warde-Farley, andYoshua Bengio. 2010. Theano: a CPU andGPU math expression compiler. In
Proceedingsof the Python for Scientific Computing Conference(SciPy) , June. Oral Presentation.David M Blei, Andrew Y Ng, and Michael I Jordan.2003. Latent dirichlet allocation. the Journal of ma-chine Learning research , 3:993–1022.George E Dahl, Ryan P Adams, and Hugo Larochelle.2012. Training restricted boltzmann machines onword observations. arXiv preprint arXiv:1202.5695 .Yann Dauphin and Yoshua Bengio. 2013. Stochasticratio matching of rbms for sparse high-dimensionalinputs. In
Advances in Neural Information Process-ing Systems , pages 1340–1348.Michael Gutmann and Aapo Hyv¨arinen. 2010. Noise-contrastive estimation: A new estimation princi-ple for unnormalized statistical models. In
Inter-national Conference on Artificial Intelligence andStatistics , pages 297–304.Geoffrey E Hinton and Ruslan R Salakhutdinov. 2009.Replicated softmax: an undirected topic model. In
Advances in neural information processing systems ,pages 1607–1614.Geoffrey Hinton. 2002. Training products of expertsby minimizing contrastive divergence.
Neural com-putation , 14(8):1771–1800.Geoffrey Hinton. 2010. A practical guide to train-ing restricted boltzmann machines.
Momentum ,9(1):926.Thomas Hofmann. 2000. Learning the similarity ofdocuments: An information-geometric approach todocument retrieval and categorization.Aapo Hyv¨arinen. 2007. Some extensions of scorematching.
Computational statistics & data analysis ,51(5):2499–2512.Richard A Kronmal and Arthur V Peterson Jr. 1979.On the alias method for generating random variablesfrom a discrete distribution.
The American Statisti-cian , 33(4):214–218.Andrew L Maas and Andrew Y Ng. 2010. A prob-abilistic model for semantic word vectors. In
NIPSWorkshop on Deep Learning and Unsupervised Fea-ture Learning . Andrew L Maas, Raymond E Daly, Peter T Pham, DanHuang, Andrew Y Ng, and Christopher Potts. 2011.Learning word vectors for sentiment analysis. In
Proceedings of the 49th Annual Meeting of the Asso-ciation for Computational Linguistics: Human Lan-guage Technologies-Volume 1 , pages 142–150. As-sociation for Computational Linguistics.Tomas Mikolov, Kai Chen, Greg Corrado, and Jef-frey Dean. 2013a. Efficient estimation of wordrepresentations in vector space. arXiv preprintarXiv:1301.3781 .Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Cor-rado, and Jeff Dean. 2013b. Distributed representa-tions of words and phrases and their compositional-ity. In
Advances in Neural Information ProcessingSystems , pages 3111–3119.Andriy Mnih and Geoffrey E Hinton. 2009. A scal-able hierarchical distributed language model. In
Advances in neural information processing systems ,pages 1081–1088.Andriy Mnih and Yee Whye Teh. 2012. A fast andsimple algorithm for training neural probabilisticlanguage models. arXiv preprint arXiv:1206.6426 .Frederic Morin and Yoshua Bengio. 2005. Hierarchi-cal probabilistic neural network language model. In
Proceedings of the international workshop on artifi-cial intelligence and statistics , pages 246–252. Cite-seer.Ruslan Salakhutdinov and Geoffrey Hinton. 2009. Se-mantic hashing.
International Journal of Approxi-mate Reasoning , 50(7):969–978.Gerard Salton and Michael J McGill. 1983. Introduc-tion to modern information retrieval.Nitish Srivastava, Ruslan R Salakhutdinov, and Ge-offrey E Hinton. 2013. Modeling documentswith deep boltzmann machines. arXiv preprintarXiv:1309.6865 .Tijmen Tieleman. 2008. Training restricted boltz-mann machines using approximations to the likeli-hood gradient. In