[PDF] Learning Topic Models by Belief Propagation

Abstract

Latent Dirichlet allocation (LDA) is an important hierarchical Bayesian model for probabilistic topic modeling, which attracts worldwide interests and touches on many important applications in text mining, computer vision and computational biology. This paper represents LDA as a factor graph within the Markov random field (MRF) framework, which enables the classic loopy belief propagation (BP) algorithm for approximate inference and parameter estimation. Although two commonly-used approximate inference methods, such as variational Bayes (VB) and collapsed Gibbs sampling (GS), have gained great successes in learning LDA, the proposed BP is competitive in both speed and accuracy as validated by encouraging experimental results on four large-scale document data sets. Furthermore, the BP algorithm has the potential to become a generic learning scheme for variants of LDA-based topic models. To this end, we show how to learn two typical variants of LDA-based topic models, such as author-topic models (ATM) and relational topic models (RTM), using BP based on the factor graph representation.

Full PDF

aa r X i v : . [ c s . L G ] M a r Learning Topic Models by Belief Propagation

Jia Zeng,

Member, IEEE,

William K. Cheung,

Member, IEEE, and Jiming Liu,

Fellow, IEEE

Abstract —Latent Dirichlet allocation (LDA) is an important hierarchical Bayesian model for probabilistic topic modeling, which attractsworldwide interests and touches on many important applications in text mining, computer vision and computational biology. This paperrepresents LDA as a factor graph within the Markov random ﬁeld (MRF) framework, which enables the classic loopy belief propagation(BP) algorithm for approximate inference and parameter estimation. Although two commonly-used approximate inference methods,such as variational Bayes (VB) and collapsed Gibbs sampling (GS), have gained great successes in learning LDA, the proposed BPis competitive in both speed and accuracy as validated by encouraging experimental results on four large-scale document data sets.Furthermore, the BP algorithm has the potential to become a generic learning scheme for variants of LDA-based topic models. To thisend, we show how to learn two typical variants of LDA-based topic models, such as author-topic models (ATM) and relational topicmodels (RTM), using BP based on the factor graph representation.

Index Terms —Latent Dirichlet allocation, topic models, belief propagation, message passing, factor graph, Bayesian networks, Markovrandom ﬁelds, hierarchical Bayesian models, Gibbs sampling, variational Bayes. ✦ NTRODUCTION

Latent Dirichlet allocation (LDA) [1] is a three-layerhierarchical Bayesian model (HBM) that can infer prob-abilistic word clusters called topics from the document-word (document-term) matrix. LDA has no exact in-ference methods because of loops in its graphical rep-resentation. Variational Bayes (VB) [1] and collapsedGibbs sampling (GS) [2] have been two commonly-usedapproximate inference methods for learning LDA andits extensions, including author-topic models (ATM) [3]and relational topic models (RTM) [4]. Other infer-ence methods for probabilistic topic modeling includeexpectation-propagation (EP) [5] and collapsed VB infer-ence (CVB) [6]. The connections and empirical compar-isons among these approximate inference methods canbe found in [7]. Recently, LDA and HBMs have foundmany important real-world applications in text miningand computer vision (e.g., tracking historical topics fromtime-stamped documents [8] and activity perception incrowded and complicated scenes [9]).This paper represents LDA by the factor graph [10]within the Markov random ﬁeld (MRF) framework [11].From the MRF perspective, the topic modeling problemcan be interpreted as a labeling problem, in which theobjective is to assign a set of semantic topic labels toexplain the observed nonzero elements in the document-word matrix. MRF solves the labeling problem existingwidely in image analysis and computer vision by twoimportant concepts: neighborhood systems and clique po-tentials [12] or factor functions [11]. It assigns the best • J. Zeng is with the School of Computer Science and Technology, SoochowUniversity, Suzhou 215006, China. He is also with the Shanghai KeyLaboratory of Intelligent Information Processing, China. To whom corre-spondence should be addressed. E-mail: [email protected]. • W. K. Cheung and J. Liu are with the Department of Computer Science,Hong Kong Baptist University, Kowloon Tong, Hong Kong. topic labels according to the maximum a posteriori (MAP)estimation through maximizing the posterior probability,which is in nature a prohibited combinatorial optimiza-tion problem in the discrete topic space. However, weoften employ the smoothness prior [13] over neighboringtopic labels to reduce the complexity by encouraging orpenalizing only a limited number of possible labelingconﬁgurations.The factor graph is a graphical representation methodfor both directed models (e.g., hidden Markov models(HMMs) [11, Chapter 13.2.3]) and undirected models(e.g., Markov random ﬁelds (MRFs) [11, Chapter 8.4.3])because factor functions can represent both conditionaland joint probabilities. In this paper, the proposed factorgraph for LDA describes the same joint probability asthat in the three-layer HBM, and thus it is not a newtopic model but interprets LDA from a novel MRFperspective. The basic idea is inspired by the collapsedGS algorithm for LDA [2], [14], which integrates outmultinomial parameters based on Dirichlet-Multinomialconjugacy and views Dirichlet hyperparameters as thepseudo topic labels having the same layer with the latenttopic labels. In the collapsed hidden variable space,the joint probability of LDA can be represented as theproduct of factor functions in the factor graph. By con-trast, the undirected model “harmonium” [15] encodesa different joint probability from LDA and probabilisticlatent semantic analysis (PLSA) [16], so that it is a newand viable alternative to the directed models.The factor graph representation facilitates the classicloopy belief propagation (BP) algorithm [10], [11], [17]for approximate inference and parameter estimation. Bydesigning proper neighborhood system and factor functions ,we may encourage or penalize different local labelingconﬁgurations in the neighborhood system to realize thetopic modeling goal. The BP algorithm operates well onthe factor graph, and it has the potential to become a generic learning scheme for variants of LDA-based topicmodels. For example, we also extend the BP algorithmto learn ATM [3] and RTM [4] based on the factor graphrepresentations. Although the convergence of BP is notguaranteed on general graphs [11], it often convergesand works well in real-world applications.The factor graph of LDA also reveals some intrinsicrelations between HBM and MRF. HBM is a class ofdirected models within the Bayesian network frame-work [14], which represents the causal or conditionaldependencies of observed and hidden variables in thehierarchical manner so that it is difﬁcult to factorize thejoint probability of hidden variables. By contrast, MRFcan factorize the joint distribution of hidden variablesinto the product of factor functions according to theHammersley-Clifford theorem [18], which facilitates theefﬁcient BP algorithm for approximate inference andparameter estimation. Although learning HBM often hasdifﬁculty in estimating parameters and inferring hiddenvariables due to the causal coupling effects, the alterna-tive factor graph representation as well as the BP-basedlearning scheme may shed more light on faster and moreaccurate algorithms for HBM.The remainder of this paper is organized as follows. InSection 2 we introduce the factor graph interpretation forLDA, and derive the loopy BP algorithm for approximateinference and parameter estimation. Moreover, we dis-cuss the intrinsic relations between BP and other state-of-the-art approximate inference algorithms. Sections 3and 4 present how to learn ATM and RTM using the BPalgorithms. Section 5 validates the BP algorithm on fourdocument data sets. Finally, Section 6 draws conclusionsand envisions future work.

ELIEF P ROPAGATION FOR

LDA

The probabilistic topic modeling task is to assign a setof semantic topic labels, z = { z kw,d } , to explain the ob-served nonzero elements in the document-word matrix x = { x w,d } . The notations ≤ k ≤ K is the topic index, x w,d is the number of word counts at the index { w, d } , ≤ w ≤ W and ≤ d ≤ D are the word index inthe vocabulary and the document index in the corpus.Table 1 summarizes some important notations.Fig. 1 shows the original three-layer graphical rep-resentation of LDA [1]. The document-speciﬁc topicproportion θ d ( k ) generates a topic label z kw,d,i ∈{ , } , P Kk =1 z kw,d,i = 1 , which in turn generates eachobserved word token i at the index w in the document d based on the topic-speciﬁc multinomial distribution φ k ( w ) over the vocabulary words. Both multinomial pa-rameters θ d ( k ) and φ k ( w ) are generated by two Dirichletdistributions with hyperparameters α and β , respec-tively. For simplicity, we consider only the smoothedLDA [2] with the ﬁxed symmetric Dirichlet hyperpa-rameters α and β . The plates indicate replications. Forexample, the document d repeats D times in the corpus,the word tokens w n repeats N d times in the document d , the vocabulary size is W , and there are K topics. α βθ d z i w i φ k N d KD Fig. 1. Three-layer graphical representation of LDA [1].TABLE 1Notations ≤ d ≤ D Document index ≤ w ≤ W Word index in vocabulary ≤ k ≤ K Topic index ≤ a ≤ A Author index ≤ c ≤ C Link index x = { x w,d } Document-word matrix z = { z kw,d } Topic labels for words z − w,d Labels for d excluding w z w, − d Labels for w excluding d a d Coauthors of the document d µ · ,d ( k ) P w x w,d µ w,d ( k ) µ w, · ( k ) P d x w,d µ w,d ( k ) θ d Factor of the document dφ w Factor of the word wη c Factor of the link cf ( · ) Factor functions α, β

Dirichlet hyperparameters

We begin by transforming the directed graph of Fig. 1into a two-layer factor graph, of which a representa-tive fragment is shown in Fig. 2. The notation, z kw,d = P x w,d i =1 z kw,d,i /x w,d , denotes the average topic labelingconﬁguration over all word tokens ≤ i ≤ x w,d atthe index { w, d } . We deﬁne the neighborhood system ofthe topic label z w,d as z − w,d and z w, − d , where z − w,d denotes a set of topic labels associated with all wordindices in the document d except w , and z w, − d denotesa set of topic labels associated with the word indices w in all documents except d . The factors θ d and φ w are denoted by squares, and their connected variables z w,d are denoted by circles. The factor θ d connects theneighboring topic labels { z w,d , z − w,d } at different wordindices within the same document d , while the factor φ w connects the neighboring topic labels { z w,d , z w, − d } atthe same word index w but in different documents. Weabsorb the observed word w as the index of φ w , whichis similar to absorbing the observed document d as theindex of θ d . Because the factors can be parameterizedfunctions [11], both θ d and φ w can represent the samemultinomial parameters with Fig. 1.Fig. 2 describes the same joint probability with Fig. 1if we properly design the factor functions. The bipartitefactor graph is inspired by the collapsed GS [2], [14] al-gorithm, which integrates out parameter variables { θ, φ } in Fig. 1 and treats the hyperparameters { α, β } as pseudo α βθ d z − w,d z w,d z w, − d φ w µ θ d → z w,d µ z w,d ← φ w µ − w,d µ w, − d Fig. 2. Factor graph of LDA and message passing. topic counts having the same layer with hidden variables z . Thus, the joint probability of the collapsed hiddenvariables can be factorized as the product of factorfunctions. This collapsed view has been also discussedwithin the mean-ﬁeld framework [19], inspiring the zero-order approximation CVB (CVB0) algorithm [7] for LDA.So, we speculate that all three-layer LDA-based topicmodels can be collapsed into the two-layer factor graph,which facilitates the BP algorithm for efﬁcient inferenceand parameter estimation. However, how to use the two-layer factor graph to represent more general multi-layerHBM still remains to be studied.Based on Dirichlet-Multinomial conjugacy, integratingout multinomial parameters { θ, φ } yields the joint prob-ability [14] of LDA in Fig. 1, P ( x , z | α, β ) ∝ D Y d =1 K Y k =1 Γ( P Ww =1 x w,d z kw,d + α )Γ[ P Kk =1 ( P Ww =1 x w,d z kw,d + α )] × W Y w =1 K Y k =1 Γ( P Dd =1 x w,d z kw,d + β )Γ[ P Ww =1 ( P Dd =1 x w,d z kw,d + β )] , (1)where x w,d z kw,d = P x w,d i =1 z kw,d,i recovers the original topicconﬁguration over the word tokens in Fig. 1. Here, wedesign the factor functions as f θ d ( x · ,d , z · ,d , α ) = K Y k =1 Γ( P Ww =1 x w,d z kw,d + α )Γ[ P Kk =1 ( P Ww =1 x w,d z kw,d + α )] , (2) f φ w ( x w, · , z w, · , β ) = K Y k =1 Γ( P Dd =1 x w,d z kw,d + β )Γ[ P Ww =1 ( P Dd =1 x w,d z kw,d + β )] , (3)where z · ,d = { z w,d , z − w,d } and z w, · = { z w,d , z w, − d } denote subsets of the variables in Fig. 2. Therefore, thejoint probability (1) of LDA can be re-written as theproduct of factor functions [11, Eq. (8.59)] in Fig. 2, P ( x , z | α, β ) ∝ D Y d =1 f θ d ( x · ,d , z · ,d , α ) W Y w =1 f φ w ( x w, · , z w, · , β ) . (4)Therefore, the two-layer factor graph in Fig. 2 encodesexactly the same information with the three-layer graphfor LDA in Fig. 1. In this way, we may interpret LDAwithin the MRF framework to treat probabilistic topic modeling as a labeling problem. The BP [11] algorithms provide exact solutions for infer-ence problems in tree-structured factor graphs but ap-proximate solutions in factor graphs with loops. Ratherthan directly computing the conditional joint probability p ( z | x ) , we compute the conditional marginal probability, p ( z kw,d = 1 , x w,d | z k − w, − d , x − w, − d ) , referred to as message µ w,d ( k ) , which can be normalized using a local compu-tation, i.e., P Kk =1 µ w,d ( k ) = 1 , ≤ µ w,d ( k ) ≤ . Accordingto the Markov property in Fig. 2, we obtain p ( z kw,d , x w,d | z k − w, − d , x − w, − d ) ∝ p ( z kw,d , x w,d | z k − w,d , x − w,d ) p ( z kw,d , x w,d | z kw, − d , x w, − d ) , (5)where − w and − d denote all word indices except w and all document indices except d , and the no-tations z − w,d and z w, − d represent all possible neigh-boring topic conﬁgurations. From the message pass-ing view, p ( z kw,d , x w,d | z k − w,d , x − w,d ) is the neighboringmessage µ θ d → z w,d ( k ) sent from the factor node θ d , and p ( z kw,d , x w,d | z kw, − d , x w, − d ) is the other neighboring mes-sage µ φ w → z w,d ( k ) sent from the factor node φ w . Noticethat (5) uses the smoothness prior in MRF, which en-courages only K smooth topic conﬁgurations within theneighborhood system. Using the Bayes’ rule and the jointprobability (1), we can expand Eq. (5) as µ w,d ( k ) ∝ p ( z k · ,d , x · ,d ) p ( z k − w,d , x − w,d ) × p ( z kw, · , x w, · ) p ( z kw, − d , x w, − d ) , ∝ P − w x − w,d z k − w,d + α P Kk =1 ( P − w x − w,d z k − w,d + α ) × P − d x w, − d z kw, − d + β P Ww =1 ( P − d x w, − d z kw, − d + β ) , (6)where the property, Γ( x + 1) = x Γ( x ) , is used to cancelthe common terms in both nominator and denomina-tor [14]. We ﬁnd that Eq. (6) updates the message onthe variable z kw,d if its neighboring topic conﬁguration { z k − w,d , z kw, − d } is known. However, due to uncertainty,we know only the neighboring messages rather thanthe precise topic conﬁguration. So, we replace topicconﬁgurations by corresponding messages in Eq. (6) andobtain the following message update equation, µ w,d ( k ) ∝ µ − w,d ( k ) + α P k [ µ − w,d ( k ) + α ] × µ w, − d ( k ) + β P w [ µ w, − d ( k ) + β ] , (7)where µ − w,d ( k ) = X − w x − w,d µ − w,d ( k ) , (8) µ w, − d ( k ) = X − d x w, − d µ w, − d ( k ) . (9)Messages are passed from variables to factors, and inturn from factors to variables until convergence or the input : x , K, T, α, β . output : θ d , φ w . µ w,d ( k ) ← initialization and normalization ; for t ← to T do µ t +1 w,d ( k ) ∝ µ t − w,d ( k )+ α P k [ µ t − w,d ( k )+ α ] × µ tw, − d ( k )+ β P w [ µ tw, − d ( k )+ β ] ; end θ d ( k ) ← [ µ · ,d ( k ) + α ] / P k [ µ · ,d ( k ) + α ] ; φ w ( k ) ← [ µ w, · ( k ) + β ] / P w [ µ w, · ( k ) + β ] ; Fig. 3. The synchronous BP for LDA. maximum number of iterations is reached. Notice thatwe need only pass messages for x w,d = 0 . Because x isa very sparse matrix, the message update equation (7) iscomputationally fast by sweeping all nonzero elementsin the sparse matrix x .Based on messages, we can estimate the multinomialparameters θ and φ by the expectation-maximization(EM) algorithm [14]. The E-step infers the normalizedmessage µ w,d ( k ) . Using the Dirichlet-Multinomial conju-gacy and Bayes’ rule, we express the marginal Dirichletdistributions on parameters as follows, p ( θ d ) = Dir ( θ d | µ · ,d ( k ) + α ) , (10) p ( φ w ) = Dir ( φ w | µ w, · ( k ) + β ) . (11)The M-step maximizes (10) and (11) with respect to θ d and φ w , resulting in the following point estimates ofmultinomial parameters, θ d ( k ) = µ · ,d ( k ) + α P k [ µ · ,d ( k ) + α ] , (12) φ w ( k ) = µ w, · ( k ) + β P w [ µ w, · ( k ) + β ] . (13)In this paper, we consider only ﬁxed hyperparameters { α, β } . Interested readers can ﬁgure out how to estimatehyperparameters based on inferred messages in [14].To implement the BP algorithm, we must choose eitherthe synchronous or the asynchronous update scheduleto pass messages [20]. Fig. 3 shows the synchronousmessage update schedule. At each iteration t , eachvariable uses the incoming messages in the previousiteration t − to calculate the current message. Onceevery variable computes its message, the message ispassed to the neighboring variables and used to computemessages in the next iteration t + 1 . An alternative is theasynchronous message update schedule. It updates themessage of each variable immediately. The updated mes-sage is immediately used to compute other neighboringmessages at each iteration t . The asynchronous updateschedule often passes messages faster across variables,which causes the BP algorithm converge faster thanthe synchronous update schedule. Another terminationcondition for convergence is that the change of themultinomial parameters [1] is less than a predeﬁnedthreshold λ , for example, λ = 0 . [21]. We may also adopt one of the BP instantiations, the sum-product algorithm [11], to infer µ w,d ( k ) . For convenience,we will not include the observation x w,d in the formula-tion. Fig. 2 shows the message passing from two factors θ d and φ w to the variable z w,d , where the arrows denotethe message passing directions. Based on the smoothnessprior, we encourage only K smooth topic conﬁgurationswithout considering all other possible conﬁgurations.The message µ w,d ( k ) is proportional to the product ofboth incoming messages from factors, µ w,d ( k ) ∝ µ θ d → z w,d ( k ) × µ φ w → z w,d ( k ) . (14)Eq. (14) has the same meaning with (5). The messagesfrom factors to variables are the sum of all incomingmessages from the neighboring variables, µ θ d → z w,d ( k ) = f θ d Y − w µ − w,d ( k ) α, (15) µ φ w → z w,d ( k ) = f φ w Y − d µ w, − d ( k ) β, (16)where α and β can be viewed as the pseudo-messages,and f θ d and f φ w are the factor functions that encourageor penalize the incoming messages.In practice, however, Eqs. (15) and (16) often causethe product of multiple incoming messages close tozero [12]. To avoid arithmetic underﬂow, we use the sumoperation rather than the product operation of incomingmessages because when the product value increases thesum value also increases, Y − w µ − w,d ( k ) α ∝ X − w µ − w,d ( k ) + α, (17) Y − d µ w, − d ( k ) β ∝ X − d µ w, − d ( k ) + β. (18)Such approximations as (17) and (18) transform the sum-product to the sum-sum algorithm, which resemblesthe relaxation labeling algorithm for learning MRF withgood performance [12].The normalized message µ w,d ( k ) is multiplied by thenumber of word counts x w,d during the propagation,i.e., x w,d µ w,d ( k ) . In this sense, x w,d can be viewed asthe weight of µ w,d ( k ) during the propagation, wherethe bigger x w,d corresponds to the larger inﬂuence ofits message to those of its neighbors. Thus, the topicsmay be distorted by those documents with greater wordcounts. To avoid this problem, we may choose anotherweight like term frequency (TF) or term frequency-inverse document frequency (TF-IDF) for weighted beliefpropagation. In this sense, BP can not only handle discrete data, but also process continuous data like TF-IDF. TheMRF model in Fig. 2 can be extended to describe bothdiscrete and continuous data in general, while LDA inFig. 1 focuses only on generating discrete data.In the MRF model, we can design the factor functionsarbitrarily to encourage or penalize local topic labelingconﬁgurations based on our prior knowledge. From Fig. 1, LDA solves the topic modeling problem accordingto three intrinsic assumptions:1) Co-occurrence: Different word indices within thesame document tend to be associated with the sametopic labels.2) Smoothness: The same word indices in differentdocuments are likely to be associated with the sametopic labels.3) Clustering: All word indices do not tend to asso-ciate with the same topic labels.The ﬁrst assumption is determined by the document-speciﬁc topic proportion θ d ( k ) , where it is more likelyto assign a topic label z kw,d = 1 to the word index w if the topic k is more frequently assigned to otherword indices in the document d . Similarly, the secondassumption is based on the topic-speciﬁc multinomialdistribution φ k ( w ) . which implies a higher likelihoodto associate the word index w with the topic label z kw,d = 1 if k is more frequently assigned to the sameword index w in other documents except d . The thirdassumption avoids grouping all word indices into onetopic through normalizing φ k ( w ) in terms of all wordindices. If most word indices are associated with thetopic k , the multinomial parameter φ k will become toosmall to allocate the topic k to these word indices.According to the above assumptions, we design f θ d and f φ w over messages as f θ d ( µ · ,d , α ) = 1 P k [ µ − w,d ( k ) + α ] , (19) f φ w ( µ w, · , β ) = 1 P w [ µ w, − d ( k ) + β ] . (20)Eq. (19) normalizes the incoming messages by the totalnumber of messages for all topics associated with thedocument d to make outgoing messages comparableacross documents. Eq. (20) normalizes the incoming mes-sages by the total number of messages for all words inthe vocabulary to make outgoing messages comparableacross vocabulary words. Notice that (15) and (16) realizethe ﬁrst two assumptions, and (20) encodes the thirdassumption of topic modeling. The similar normalizationtechnique to avoid partitioning all data points into onecluster has been used in the classic normalized cutsalgorithm for image segmentation [22]. Combining (14)to (20) will yield the same message update equation (7).To estimate parameters θ d and φ w , we use the jointmarginal distributions (15) and (16) of the set of variablesbelonging to factors θ d and φ w including the variable z w,d , which produce the same point estimation equa-tions (12) and (13). We may simplify the message update equation (7). Sub-stituting (12) and (13) into (7) yields the approximatemessage update equation, µ w,d ( k ) ∝ θ d ( k ) × φ w ( k ) , (21) function [phi, theta] = siBP(X, K, T, ALPHA, BETA)% X is a W*D sparse matrix.% W is the vocabulary size.% D is the number of documents.% The element of X is the word count 'xi'.% 'wi' and 'di' are word and document indices.% K is the number of topics.% T is the number of iterations.% mu is a matrix with K rows for topic messages.% phi is a K*W matrix.% theta is a K*D matrix.% ALPHA and BETA are hyperparameters.%% normalize(A,dim) returns the normalized values% (sum to one) of the elements along different% dimensions of an array.[wi,di,xi] = find(X);% random initializationmu = normalize(rand(K,nnz(X)),1);% simplified belief propagationfor t = 1:Tfor k = 1:Kmd(k,:) = accumarray(di,xi'.*mu(k,:));mw(k,:) = accumarray(wi,xi'.*mu(k,:));endtheta = normalize(md+ALPHA,1); %Eq.(9)phi = normalize(mw+BETA,2); %Eq.(10)mu = normalize(theta(:,di).*phi(:,wi),1); %Eq.(18)endreturn Fig. 4. The MATLAB code for siBP. which includes the current message µ w,d ( k ) in bothnumerator and denominator in (7). In many real-worldtopic modeling tasks, a document often contains manydifferent word indices, and the same word index ap-pears in many different documents. So, at each iteration,Eq. (21) deviates slightly from (7) after adding the cur-rent message to both numerator and denominator. Suchslight difference may be enlarged after many iterationsin Fig. 3 due to accumulation effects, leading to differentestimated parameters. Intuitively, Eq. (21) implies that ifthe topic k has a higher proportion in the document d ,and it has the a higher likelihood to generate the wordindex w , it is more likely to allocate the topic k to theobserved word x w,d . This allocation scheme in principlefollows the three intrinsic topic modeling assumptionsin the subsection 2.3. Fig. 4 shows the MATLAB codefor the simpliﬁed BP (siBP). Here we discuss some intrinsic relations between BPwith three state-of-the-art LDA learning algorithms suchas VB [1], GS [2], and zero-order approximation CVB(CVB0) [7], [19] within the uniﬁed message passingframework. The message update scheme is an instantia-tion of the E-step of EM algorithm [23], which has beenwidely used to infer the marginal probabilities of hiddenvariables in various graphical models according to themaximum-likelihood estimation [11] (e.g., the E-step in-ference for GMMs [24], the forward-backward algorithmfor HMMs [25], and the probabilistic relaxation labeling algorithm for MRF [26]). After the E-step, we estimatethe optimal parameters using the updated messages andobservations at the M-step of EM algorithm.VB is a variational message passing method [27] thatuses a set of factorized variational distributions q ( z ) to approximate the joint distribution (1) by minimiz-ing the Kullback-Leibler (KL) divergence between them.Employing the Jensen’s inequality makes the approxi-mate variational distribution an adjustable lower boundon the joint distribution, so that maximizing the jointprobability is equivalent to maximizing the lower boundby tuning a set of variational parameters. The lowerbound q ( z ) is also an MRF in nature that approximatesthe joint distribution (1). Because there is always a gapbetween the lower bound and the true joint distribution,VB introduces bias when learning LDA. The variationalmessage update equation is µ w,d ( k ) ∝ exp[Ψ( µ · ,d ( k ) + α )]exp[Ψ( P k [ µ · ,d ( k ) + α ])] × µ w, · ( k ) + β P w [ µ w, · ( k ) + β ] , (22)which resembles the synchronous BP (7) but with twomajor differences. First, VB uses complicated digammafunctions Ψ( · ) , which not only introduces bias [7] butalso slows down the message updating. Second, VB usesa different variational EM schedule. At the E-step, itsimultaneously updates both variational messages andparameter of θ d until convergence, holding the varia-tional parameter of φ ﬁxed. At the M-step, VB updatesonly the variational parameter of φ .The message update equation of GS is µ w,d,i ( k ) ∝ n − i · ,d ( k ) + α P k [ n − i · ,d ( k ) + α ] × n − iw, · ( k ) + β P w [ n − iw, · ( k ) + β ] , (23)where n − i · ,d ( k ) is the total number of topic labels k in thedocument d except the topic label on the current wordtoken i , and n − iw, · ( k ) is the total number of topic labels k of the word w except the topic label on the currentword token i . Eq. (23) resembles the asynchronous BPimplementation (7) but with two subtle differences. First,GS randomly samples the current topic label z kw,d,i = 1 from the message µ w,d,i ( k ) , which truncates all K -tuplemessage values to zeros except the sampled topic label k . Such information loss introduces bias when learningLDA. Second, GS must sample a topic label for eachword token, which repeats x w,d times for the word index { w, d } . The sweep of the entire word tokens rather thanword index restricts GS’s scalability to large-scale docu-ment repositories containing billions of word tokens.CVB0 is exactly equivalent to our asynchronous BP im-plementation but based on word tokens. Previous empir-ical comparisons [7] advocated the CVB0 algorithm forLDA within the approximate mean-ﬁeld framework [19]closely connected with the proposed BP. Here we clearlyexplain that the superior performance of CVB0 has been largely attributed to its asynchronous BP implementationfrom the MRF perspective. Our experiments also supportthat the message passing over word indices instead oftokens will produce comparable or even better topicmodeling performance but with signiﬁcantly smallercomputational costs.Eq. (21) also reveals that siBP is a probabilistic matrixfactorization algorithm that factorizes the document-word matrix, x = [ x w,d ] W × D , into a matrix of document-speciﬁc topic proportions, θ = [ θ d ( k )] K × D , and a ma-trix of vocabulary word-speciﬁc topic proportions, φ =[ φ w ( k )] K × W , i.e., x ∼ φ T θ . We see that the larger numberof word counts x w,d corresponds to the higher likelihood P k θ d ( k ) φ w ( k ) . From this point of view, the multinomialprinciple component analysis (PCA) [28] describes someintrinsic relations among LDA, PLSA [16], and non-negative matrix factorization (NMF) [29]. Eq. (21) is thesame as the E-step update for PLSA except that the pa-rameters θ and φ are smoothed by the hyperparameters α and β to prevent overﬁtting.VB, BP and siBP have the computational complexity O ( KDW d T ) , but GS and CVB0 require O ( KDN d T ) ,where W d is the average vocabulary size, N d is theaverage number of word tokens per document, and T is the number of learning iterations. ELIEF P ROPAGATION FOR

ATM

Author-topic models (ATM) [3] depict each author of thedocument as a mixture of probabilistic topics, and havefound important applications in matching papers withreviewers [30]. Fig. 5A shows the generative graphicalrepresentation for ATM, which ﬁrst uses a document-speciﬁc uniform distribution u d to generate an authorindex a, ≤ a ≤ A , and then uses the author-speciﬁctopic proportions θ a to generate a topic label z kw,d = 1 for the word index w in the document d . The plate on θ indicates that there are A unique authors in the cor-pus. The document often has multiple coauthors. ATMrandomly assigns one of the observed author indices toeach word in the document based on the document-speciﬁc uniform distribution u d . However, it is morereasonable that each word x w,d is associated with anauthor index a ∈ a d from the multinomial rather thanuniform distribution, where a d is a set of author indicesof the document d . As a result, each topic label takes twovariables z a,kw,d = { , } , P a,k z a,kw,d = 1 , a ∈ a d , ≤ k ≤ K ,where a is the author index and k is the topic indexattached to the word.We transform Fig. 5A to the factor graph representa-tion of ATM in Fig. 5B. As with Fig. 2, we absorb theobserved author index a ∈ a d of the document d as theindex of the factor θ a ∈ a d . The notation z a − w, · denotes alllabels connected with the authors a ∈ a d except those forthe word index w . The only difference between ATM andLDA is that the author a ∈ a d instead of the document d connects the labels z aw,d and z a − w, · . As a result, ATMencourages topic smoothness among labels z aw,d attachedto the same author a instead of the same document d . αα βθu d z n w n a A θ a ∈ a d z a − w, · z aw,d z w, − d Kφ k φ w N d µ θ a → z aw,d µ ( z a − w, · ) D ( A ) ( B ) Fig. 5. (A) The three-layer graphical representation [3] and (B) two-layer factor graph of ATM. input : x , a d , K, T, α, β . output : θ a , φ w . µ a, w,d ( k ) , a ∈ a d , ← initialization and normalization ; for t ← to T do µ a,t +1 w,d ( k ) ∝ µ a,t − w, − d ( k )+ α P k [ µ a,t − w, − d ( k )+ α ] × µ tw, − d ( k )+ β P w [ µ tw, − d ( k )+ β ] ; end θ a ( k ) ← [ µ a · , · ( k ) + α ] / P k [ µ a · , · ( k ) + α ] ; φ w ( k ) ← [ µ w, · ( k ) + β ] / P w [ µ w, · ( k ) + β ] ; Fig. 6. The synchronous BP for ATM.

Unlike passing the K -tuple message µ w,d ( k ) in Fig. 3,the BP algorithm for learning ATM passes the | a d | × K -tuple message vectors µ aw,d ( k ) , a ∈ a d through the factor θ a ∈ a d in Fig. 5B, where | a d | is the number of authors inthe document d . Nevertheless, we can still obtain the K -tuple word topic message µ w,d ( k ) by marginalizing themessage µ aw,d ( k ) in terms of the author variable a ∈ a d as follows, µ w,d ( k ) = X a ∈ a d µ aw,d ( k ) . (24)Since Figs. 2 and 5B have the same right half part,the message passing equation from the factor φ w to thevariable z w,d and the parameter estimation equation for φ w in Fig. 5B remain the same as (7) and (13) based onthe marginalized word topic message in (24). Thus, weonly need to derive the message passing equation fromthe factor θ a ∈ a d to the variable z aw,d in Fig. 5B. Because ofthe topic smoothness prior, we design the factor functionas follows, f θ a = 1 P k [ µ a − w, − d ( k ) + α ] , (25)where µ a − w, − d ( k ) = P − w, − d x aw,d µ aw,d ( k ) denotes thesum of all incoming messages attached to the author index a and the topic index k excluding x aw,d µ aw,d ( k ) .Likewise, Eq. (25) normalizes the incoming messagesattached the author index a in terms of the topic index k to make outgoing messages comparable for differentauthors a ∈ a d . Similar to (15), we derive the messagepassing µ f θa → z aw,d through adding all incoming messagesevaluated by the factor function (25).Multiplying two messages from factors θ a ∈ a d and φ w yields the message update equation as follows, µ aw,d ( k ) ∝ µ a − w, − d ( k ) + α P k [ µ a − w, − d ( k ) + α ] × µ w, − d ( k ) + β P w [ µ w, − d ( k ) + β ] . (26)Notice that the | a d | × K -tuple message µ aw,d ( k ) , a ∈ a d is normalized in terms of all combinations of { a, k } , a ∈ a d , ≤ k ≤ K . Based on the normalized messages, theauthor-speciﬁc topic proportion θ a ( k ) can be estimatedfrom the sum of all incoming messages including µ aw,d evaluated by the factor function f θ a as follows, θ a ( k ) = µ a · , · ( k ) + α P k [ µ a · , · ( k ) + α ] . (27)As a summary, Fig. 6 shows the synchronous BPalgorithm for learning ATM. The difference betweenFig. 3 and Fig. 6 is that Fig. 3 considers the authorindex a as the label for each word. At each iteration,the computational complexity is O ( KDW d A d T ) , where A d is the average number of authors per document. ELIEF P ROPAGATION FOR

RTM

Network data, such as citation and coauthor networksof documents [30], [31], tag networks of documents andimages [32], hyperlinked networks of web pages, andsocial networks of friends, exist pervasively in data min-ing and machine learning. The probabilistic relationaltopic modeling of network data can provide both usefulpredictive models and descriptive statistics [4].In Fig. 7A, relational topic models (RTM) [4] represententire document topics by the mean value of the docu- αα βθ d θ d θ d ′ z − w,d z w,d z w, − d z · ,d ′ η c Kcz d,n z d ′ ,n w d,n w d ′ ,n ηφ k φ w N d N d ′ µ z · ,d ′ → η c µ η c → z w,d DD ( A ) ( B ) Fig. 7. (A) The three-layer graphical representation [4] and (B) two-layer factor graph of RTM. input : w , c , ξ, K, T, α, β . output : θ a , φ w . µ w,d ( k ) ← initialization and normalization ; for t ← to T do µ t +1 w,d ( k ) ∝ [(1 − ξ ) µ tθ d → z w,d ( k ) + ξµ tη c → z w,d ( k )] × µ tφ w → z w,d ( k ) ; end θ d ( k ) ← [ µ · ,d ( k ) + α ] / P k [ µ · ,d ( k ) + α ] ; φ w ( k ) ← [ µ w, · ( k ) + β ] / P w [ µ w, · ( k ) + β ] ; Fig. 8. The synchronous BP for RTM. ment topic proportions, and use Hadamard product ofmean values z d ◦ z d ′ from two linked documents { d, d ′ } as link features, which are learned by the generalizedlinear model (GLM) η to generate the observed binarycitation link variable c = 1 . Besides, all other parts inRTM remain the same as LDA.We transform Fig. 7A to the factor graph Fig. 7B byabsorbing the observed link index c ∈ c , ≤ c ≤ C asthe index of the factor η c . Each link index connects adocument pair { d, d ′ } , and the factor η c connects wordtopic labels z w,d and z · ,d ′ of the document pair. Besidesencoding the topic smoothness, RTM explicitly describesthe topic structural dependencies between the pair oflinked documents { d, d ′ } using the factor function f η c ( · ) . In Fig. 7, the messages from the factors θ d and φ w to thevariable z w,d are the same as LDA in (15) and (16). Thus,we only need to derive the message passing equationfrom the factor η c to the variable z w,d .We design the factor function f η c ( · ) for linked docu-ments as follows, f η c ( k | k ′ ) = P { d,d ′ } µ · ,d ( k ) µ · ,d ′ ( k ′ ) P { d,d ′ } ,k ′ µ · ,d ( k ) µ · ,d ′ ( k ′ ) , (28)which depicts the likelihood of topic label k assignedto the document d when its linked document d ′ is associated with the topic label k ′ . Notice that the de-signed factor function does not follow the GLM forlink modeling in the original RTM [4] because the GLMmakes inference slightly more complicated. However,similar to the GLM, Eq. (28) is also able to capture thetopic interactions between two linked documents { d, d ′ } in document networks. Instead of smoothness priorencoded by factor functions (19) and (20), it describesarbitrary topic dependencies { k, k ′ } of linked documents { d, d ′ } .Based on the factor function (28), we resort to the sum-product algorithm to calculate the message, µ η c → z w,d ( k ) = X d ′ X k ′ f η c ( k | k ′ ) µ · ,d ′ ( k ′ ) , (29)where we use the sum rather than the product ofmessages from all linked documents d ′ to avoid arith-metic underﬂow. The standard sum-product algorithmrequires the product of all messages from factors to vari-ables. However, in practice, the direct product operationcannot balance the messages from different sources. Forexample, the message µ θ d → z w,d is from the neighboringwords within the same document d , while the message µ η c → z w,d is from all linked documents d ′ . If we passthe product of these two types of messages, we cannotdistinguish which one inﬂuences more on the topic label z w,d . Hence, we use the weighted sum of two types ofmessages, µ ( z w,d = k ) ∝ [(1 − ξ ) µ θ d → z w,d ( k )+ ξµ η c → z w,d ( k )] × µ φ w → z w,d ( k ) , (30)where ξ ∈ [0 , is the weight to balance two messages µ θ d → z w,d and µ η c → z w,d . When there are no link informa-tion ξ = 0 , Eq. (30) reduces to (7) so that RTM reducesto LDA. Fig. 8 shows the synchronous BP algorithmfor learning RTM. Given the inferred messages, theparameter estimation equations remain the same as (12)and (13). The computational complexity at each iterationis O ( K CDW d T ) , where C is the total number of linksin the document network. TABLE 2Summarization of four document data sets

Data sets

D A W C N d W d CORA

MEDL

NIPS − BLOG − XPERIMENTS

We use four large-scale document data sets:1) CORA [33] contains abstracts from the CORA re-search paper search engine in machine learningarea, where the documents can be classiﬁed into major categories.2) MEDL [34] contains abstracts from the MEDLINEbiomedical paper search engine, where the docu-ments fall broadly into categories.3) NIPS [35] includes papers from the conference“Neural Information Processing Systems”, whereall papers are grouped into categories. NIPS hasno citation link information.4) BLOG [36] contains a collection of political blogs onthe subject of American politics in the year 2008.where all blogs can be broadly classiﬁed into categories. BLOG has no author information.Table 2 summarizes the statistics of four data sets, where D is the total number of documents, A is the totalnumber of authors, W is the vocabulary size, C is thetotal number of links between documents, N d is theaverage number of words per document, and W d is theaverage vocabulary size per document. We compare BP with two commonly-used LDA learningalgorithms such as VB [1] (Here we use Blei’s imple-mentation of digamma functions) and GS [2] underthe same ﬁxed hyperparameters α = β = 0 . . Weuse MATLAB C/C++ MEX-implementations for all thesealgorithms, and carry out experiments on a commonPC with CPU . GHz and RAM G. With the goal ofrepeatability, we have made our source codes and datasets publicly available [37].To examine the convergence property of BP, we usethe entire data set as the training set, and calculate thetraining perplexity [1] at every iterations in the totalof training iterations from the same initialization.Fig. 9 shows that the training perplexity of BP generallydecreases rapidly as the number of training iterationsincreases. In our experiments, BP on average convergeswith the number of training iterations T ≈ when thedifference of training perplexity between two successiveiterations is less than one. Although this paper does not ∼ blei/lda-c/index.html2. http://psiexp.ss.uci.edu/research/programs data/toolbox.htm theoretically prove that BP will deﬁnitely converge tothe ﬁxed point, the resemblance among VB, GS and BPin the subsection 2.5 implies that there should be thesimilar underlying principle that ensures BP to convergeon general sparse word vector space in real-world appli-cations. Further analysis reveals that BP on average usesmore number of training iterations until convergencethan VB ( T ≈ ) but much less number of trainingiterations than GS ( T ≈ ) on the four data sets. Thefast convergence rate is a desirable property as far asthe online [21] and distributed [38] topic modeling forlarge-scale corpus are concerned.The predictive perplexity for the unseen test set iscomputed as follows [1], [7]. To ensure all algorithmsto achieve the local optimum, we use the trainingiterations to estimate φ on the training set from thesame initialization. In practice, this number of trainingiterations is large enough for convergence of all algo-rithms in Fig. 9. We randomly partition each documentin the test set into and subsets. We use iterations of learning algorithms to estimate θ from thesame initialization while holding φ ﬁxed on the subset, and then calculate the predictive perplexity onthe left subset, P = exp ( − P w,d x w,d log (cid:2) P k θ d ( k ) φ w ( k ) (cid:3)P w,d x w,d ) , (31)where x w,d denotes word counts in the subset.Notice that the perplexity (31) is based on the marginalprobability of the word topic label µ w,d ( k ) in (21).Fig. 10 shows the predictive perplexity (average ± standard deviation) from ﬁve-fold cross-validation fordifferent topics, where the lower perplexity indicatesthe better generalization ability for the unseen test set.Consistently, BP has the lowest perplexity for differenttopics on four data sets, which conﬁrms its effectivenessfor learning LDA. On average, BP lowers around than VB and than GS in perplexity. Fig. 11 showsthat BP uses less training time than both VB and GS.We show only . times of the real training time of VBbecause of time-consuming digamma functions. In fact,VB runs as fast as BP if we remove digamma functions.So, we believe that it is the digamma functions thatslow down VB in learning LDA. BP is faster than GSbecause it computes messages for word indices. Thespeed difference is largest on the NIPS set due to itslargest ratio N d /W d = 2 . in Table 2. Although VBconverges rapidly attributed to digamma functions, itoften consumes triple more training time. Therefore, BPon average enjoys the highest efﬁciency for learningLDA with regard to the balance of convergence rate andtraining time.We also compare six BP implementations such assiBP, BP and CVB0 [7] using both synchronous andasynchronous update schedules. We name three syn-chronous implementations as s-BP, s-siBP and s-CVB0,and three asynchronous implementations as a-BP, a-siBP BPGSVB

CORA MEDL0 200 400 600 800 1000 0 200 400 600 800 1000 0 200 400 600 800 1000NIPS BLOGNumber of Iterations T r a i n i n g P e r p l e x i t y Fig. 9. Training perplexity as a function of number of iterations when K = 50 on CORA, MEDL, NIPS and BLOG. BPGSVB P r e d i c t i v e P e r p l e x i t y Fig. 10. Predictive perplexity as a function of number of topics on CORA, MEDL, NIPS and BLOG.

BPGSVB

CORA MEDL NIPS BLOG0.3x Number of Topics T i m e ( S e c o n d s ) Fig. 11. Training time as a function of number of topics on CORA, MEDL, NIPS and BLOG. For VB, it shows . timesof the real learning time denoted by . x. and a-CVB0. Because these six belief propagation im-plementations produce comparable perplexity, we showthe relative perplexity that subtracts the mean valueof six implementations in Fig. 12. Overall, the asyn-chronous schedule gives slightly lower perplexity thansynchronous schedule because it passes messages fasterand more efﬁciently. Except on CORA set, siBP generallyprovides the highest perplexity because it introducessubtle biases in computing messages at each iteration.The biased message will be propagated and accumulatedleading to inaccurate parameter estimation. Althoughthe proposed BP achieves lower perplexity than CVB0 onNIPS set, both of them work comparably well on othersets. But BP is much faster because it computes messagesover word indices. The comparable results also conﬁrmour assumption that topic modeling can be efﬁcientlyperformed on word indices instead of tokens.To measure the interpretability of a topic model, the word intrusion and topic intrusion are proposed to in-volve subjective judgements [39]. The basic idea is toask volunteer subjects to identify the number of word intruders in the topic as well as the topic intruders in thedocument, where intruders are deﬁned as inconsistentwords or topics based on prior knowledge of subjects.Fig. 13 shows the top ten words of K = 10 topics inferredby VB, GS and BP algorithms on NIPS set. We ﬁnd noobvious difference with respect to word intrusions ineach topic. Most topics share the similar top ten wordsbut with different ranking orders. Despite signiﬁcantperplexity difference, the topics extracted by three algo-rithms remains almost the same interpretability at leastfor the top ten words. This result coincides with [39] thatthe lower perplexity may not enhance interpretability ofinferred topics.Similar phenomenon has also been observed in MRF-based image labeling problems [20]. Different MRF infer-ence algorithms such as graph cuts and BP often yieldcomparable results. Although one inference method mayﬁnd more optimal MRF solutions, it does not neces-sarily translate into better performance compared tothe ground-truth. The underlying hypothesis is that theground-truth labeling conﬁguration is often less optimal

10 20 30 40 50−15−10−5051015

10 20 30 40 50−6−4−20246810 10 20 30 40 50−30−20−10010203040 10 20 30 40 50−40−20020406080

Number of TopicsCORA MEDL NIPS BLOG R e l a t i v e P e r p l e x i t y a-siBPa-BPs-CVB0s-siBPa-CVB0s-BP Fig. 12. Relative predictive perplexity as a function of number of topics on CORA, MEDL, NIPS and BLOG.

Topic recognition image system images network training speech ﬁgure model setimage images ﬁgure object feature features recognition space visual distancerecognition image images training system speech set feature features wordTopic network networks units input neural learning hidden output training unitnetwork networks neural input units output learning hidden layer weightsnetwork units networks input output neural hidden learning unit layerTopic analog chip circuit ﬁgure neural input output time system networkanalog circuit chip output ﬁgure signal input neural time systemanalog neural circuit chip ﬁgure output input time system signalTopic time model neurons neuron spike synaptic activity input ﬁring informationmodel neurons cells neuron cell visual activity response input stimulusneurons time model neuron synaptic spike cell activity input ﬁringTopic model visual ﬁgure cells motion direction input spatial ﬁeld orientationtime noise dynamics order results point model system values ﬁguremodel visual ﬁgure motion ﬁeld direction spatial cells image orientationTopic function functions algorithm linear neural matrix learning space networks datafunction functions number set algorithm theorem tree bound learning classfunction functions algorithm set theorem linear number vector case spaceTopic learning network error neural training networks time function weight modelfunction learning error algorithm training linear vector data set spacelearning error neural network function weight training networks time gradientTopic data training set error algorithm learning function class examples classiﬁcationtraining data set performance classiﬁcation recognition test class error speechtraining data set error learning performance test neural number classiﬁcationTopic model data models distribution probability parameters gaussian algorithm likelihood mixturemodel data models distribution gaussian probability parameters likelihood mixture algorithmmodel data distribution models gaussian algorithm probability parameters likelihood mixtureTopic learning state time control function policy reinforcement action algorithm optimallearning state control time model policy action reinforcement system stateslearning state time control policy function action reinforcement algorithm model Fig. 13. Top ten words of K = 10 topics of VB (ﬁrst line), GS (second line), and BP (third line) on NIPS. than solutions produced by inference algorithms. Forexample, if we manually label the topics for a corpus,the ﬁnal perplexity is often higher than that of solutionsreturned by VB, GS and BP. For each document, LDAprovides the equal number of topics K but the ground-truth often uses the unequal number of topics to explainthe observed words, which may be another reason whythe overall perplexity of learned LDA is often lowerthan that of the ground-truth. To test this hypothesis,we compare the perplexity of labeled LDA (L-LDA) [40]with LDA in Fig. 14. L-LDA is a supervised LDA thatrestricts the hidden topics as the observed class labelsof each document. When a document has multiple class labels, L-LDA automatically assigns one of the classlabels to each word index. In this way, L-LDA resemblesthe process of manual topic labeling by human, and itssolution can be viewed as close to the ground-truth.For a fair comparison, we set the number of topics K = 7 , , , of LDA for CORA, MEDL, NIPS andBLOG according to the number of document categoriesin each set. Both L-LDA and LDA are trained by BPusing iterations from the same initialization. Fig. 14conﬁrms that L-LDA produces higher perplexity thanLDA, which partly supports that the ground-truth oftenyields the higher perplexity than the optimal solutionsof LDA inferred by BP. The underlying reason may be CORA MEDL NIPS BLOG050010001500200025003000

L-LDALDA P e r p l e x i t y Fig. 14. Perplexity of L-LDA and LDA on four data sets. that the three topic modeling rules encoded by LDA arestill too simple to capture human behaviors in ﬁndingtopics.Under this situation, improving the formulation oftopic models such as LDA is better than improvinginference algorithms to enhance the topic modelingperformance signiﬁcantly. Although the proper settingsof hyperparameters can make the predictive perplexitycomparable for all state-of-the-art approximate inferencealgorithms [7], we still advocate BP because it is fasterand more accurate than both VB and GS, even if they allcan provide comparable perplexity and interpretabilityunder the proper settings of hyperparameters.

The GS algorithm for learning ATM is implemented inthe MATLAB topic modeling toolbox. We compare BPand GS for learning ATM based on iterations ontraining data. Fig. 15 shows the predictive perplexity(average ± standard deviation) from ﬁve-fold cross-validation. On average, BP lowers perplexity thanGS, which is consistent with Fig. 10. Another possiblereason for such improvements may be our assumptionthat all coauthors of the document account for the wordtopic label using multinomial instead of uniform proba-bilities. The GS algorithm for learning RTM is implemented inthe R package. We compare BP with GS for learningRTM using the same iterations on training data set.Based on the training perplexity, we manually set theweight ξ = 0 . in Fig. 8 to achieve the overall superiorperformance on four data sets.Fig. 16 shows predictive perplexity (average ± stan-dard deviation) on ﬁve-fold cross-validation. On av-erage, BP lowers perplexity than GS. Because theoriginal RTM learned by GS is inﬂexible to balanceinformation from different sources, it has slightly higher

3. http://psiexp.ss.uci.edu/research/programs data/toolbox.htm4. http://cran.r-project.org/web/packages/lda/ perplexity than LDA (Fig. 10). To circumvent this prob-lem, we introduce the weight ξ in (30) to balance twotypes of messages, so that the learned RTM gains lowerperplexity than LDA. Future work will estimate thebalancing weight ξ based on the feature selection or MRFstructure learning techniques.We also examine the link prediction performance ofRTM. We deﬁne the link prediction as a binary clas-siﬁcation problem. As with [4], we use the Hadmardproduct of a pair of document topic proportions asthe link feature, and train an SVM [41] to decide ifthere is a link between them. Notice that the originalRTM [4] learned by the GS algorithm uses the GLM topredict links. Fig. 17 compares the F-measure (average ± standard deviation) of link prediction on ﬁve-fold cross-validation. Encouragingly, BP provides signiﬁcantly higher F-measure over GS on average. These resultsconﬁrm the effectiveness of BP for capturing accuratetopic structural dependencies in document networks. ONCLUSIONS

First, this paper has presented the novel factor graphrepresentation of LDA within the MRF framework. Notonly does MRF solve topic modeling as a labeling prob-lem, but also facilitate BP algorithms for approximateinference and parameter estimation in three steps:1) First, we absorb { w, d } as indices of factors, whichconnect hidden variables such as topic labels in theneighborhood system.2) Second, we design the proper factor functions toencourage or penalize different local topic labelingconﬁgurations in the neighborhood system.3) Third, we develop the approximate inference andparameter estimation algorithms within the mes-sage passing framework.The BP algorithm is easy to implement, computation-ally efﬁcient, faster and more accurate than other twoapproximate inference methods like VB [1] and GS [2]in several topic modeling tasks of broad interests. Fur-thermore, the superior performance of BP algorithm forlearning ATM [3] and RTM [4] conﬁrms its potentialeffectiveness in learning other LDA extensions.Second, as the main contribution of this paper, theproper deﬁnition of neighborhood systems as well as thedesign of factor functions can interpret the three-layerLDA by the two-layer MRF in the sense that they encodethe same joint probability. Since the probabilistic topicmodeling is essentially a word annotation paradigm, theopened MRF perspective may inspire us to use otherMRF-based image segmentation [22] or data clusteringalgorithms [17] for LDA-based topic models.Finally, the scalability of BP is an important issue inour future work. As with VB and GS, the BP algorithmhas a linear complexity with the number documents D and the number of topics K . We may extend the pro-posed BP algorithm for online [21] and distributed [38]learning of LDA, where the former incrementally learns

10 20 30 40 5070080090010001100 450550650750850 18002000220024002600

BPGS

10 20 30 40 50 10 20 30 40 50

BPGSBPGS

CORA MEDL NIPS P r e d i c t i v e P e r p l e x i t y Number of Topics

Fig. 15. Predictive perplexity as a function of number of topics for ATM on CORA, MEDL and NIPS.

10 20 30 40 506007008009001000 450550650750850 10 20 30 40 502200260030003400

BPGS

10 20 30 40 50

BPGSBPGS

CORA MEDL BLOG P r e d i c t i v e P e r p l e x i t y Number of Topics

Fig. 16. Predictive perplexity as a function of number of topics for RTM on CORA, MEDL and BLOG.

10 20 30 40 500.40.60.81

10 20 30 40 50 10 20 30 40 50CORA MEDL BLOG F - m e a s u r e Number of Topics

BPGSBPGS BPGS

Fig. 17. F-measure of link prediction as a function of number of topics on CORA, MEDL and BLOG. parts of D documents in data streams and the latterlearns parts of D documents on distributed computingunits. Since the K -tuple message is often sparse [42], wemay also pass only salient parts of the K -tuple messagesor only update those informative parts of messages ateach learning iteration to speed up the whole messagepassing process. A CKNOWLEDGEMENTS

This work is supported by NSFC (Grant No. 61003154),the Shanghai Key Laboratory of Intelligent InformationProcessing, China (Grant No. IIPL-2010-009), and a grantfrom Baidu. R EFERENCES [1] D. M. Blei, A. Y. Ng, and M. I. Jordan, “Latent Dirichlet alloca-tion,”

J. Mach. Learn. Res. , vol. 3, pp. 993–1022, 2003.[2] T. L. Grifﬁths and M. Steyvers, “Finding scientiﬁc topics,”

Proc.Natl. Acad. Sci. , vol. 101, pp. 5228–5235, 2004. [3] M. Rosen-Zvi, T. Grifﬁths, M. Steyvers, and P. Smyth, “Theauthor-topic model for authors and documents,” in

UAI , 2004,pp. 487–494.[4] J. Chang and D. M. Blei, “Hierarchical relational models fordocument networks,”

Annals of Applied Statistics , vol. 4, no. 1, pp.124–150, 2010.[5] T. Minka and J. Lafferty, “Expectation-propagation for the gener-ative aspect model,” in

UAI , 2002, pp. 352–359.[6] Y. W. Teh, D. Newman, and M. Welling, “A collapsed variationalBayesian inference algorithm for latent Dirichlet allocation,” in

NIPS , 2007, pp. 1353–1360.[7] A. Asuncion, M. Welling, P. Smyth, and Y. W. Teh, “On smoothingand inference for topic models,” in

UAI , 2009, pp. 27–34.[8] I. Pruteanu-Malinici, L. Ren, J. Paisley, E. Wang, and L. Carin,“Hierarchical Bayesian modeling of topics in time-stamped doc-uments,”

IEEE Trans. Pattern Anal. Mach. Intell. , vol. 32, no. 6, pp.996–1011, 2010.[9] X. G. Wang, X. X. Ma, and W. E. L. Grimson, “Unsupervisedactivity perception in crowded and complicated scenes usinghierarchical Bayesian models,”

IEEE Trans. Pattern Anal. Mach.Intell. , vol. 31, no. 3, pp. 539–555, 2009.[10] F. R. Kschischang, B. J. Frey, and H.-A. Loeliger, “Factor graphsand the sum-product algorithm,”

IEEE Transactions on Inform.Theory , vol. 47, no. 2, pp. 498–519, 2001.[11] C. M. Bishop,

Pattern recognition and machine learning . Springer,2006. [12] J. Zeng and Z.-Q. Liu, “Markov random ﬁeld-based statisticalcharacter structure modeling for handwritten Chinese characterrecognition,” IEEE Trans. Pattern Anal. Mach. Intell. , vol. 30, no. 5,pp. 767–780, 2008.[13] S. Z. Li,

Markov Random Field Modeling in Image Analysis . NewYork: Springer-Verlag, 2001.[14] G. Heinrich, “Parameter estimation for text analysis,” Universityof Leipzig, Tech. Rep., 2008.[15] M. Welling, M. Rosen-Zvi, and G. Hinton, “Exponential familyharmoniums with an application to information retrieval,” in

NIPS , 2004, pp. 1481–1488.[16] T. Hofmann, “Unsupervised learning by probabilistic latent se-mantic analysis,”

Machine Learning , vol. 42, pp. 177–196, 2001.[17] B. J. Frey and D. Dueck, “Clustering by passing messages betweendata points,”

Science , vol. 315, no. 5814, pp. 972–976, 2007.[18] J. M. Hammersley and P. Clifford, “Markov ﬁeld on ﬁnite graphsand lattices,” unpublished , 1971.[19] A. U. Asuncion, “Approximate Mean Field for Dirichlet-BasedModels,” in

ICML Workshop on Topic Models , 2010.[20] M. F. Tappen and W. T. Freeman, “Comparison of graph cuts withbelief propagation for stereo, using identical MRF parameters,” in

ICCV , 2003, pp. 900–907.[21] M. Hoffman, D. Blei, and F. Bach, “Online learning for latentDirichlet allocation,” in

NIPS , 2010, pp. 856–864.[22] J. Shi and J. Malik, “Normalized cuts and image segmentation,”

IEEE Trans. Pattern Anal. Machine Intell. , vol. 22, no. 8, pp. 888–905,2000.[23] A. P. Dempster, N. M. Laird, and D. B. Rubin, “Maximumlikelihood from incomplete data via the EM algorithm,”

Journalof the Royal Statistical Society, Series B , vol. 39, pp. 1–38, 1977.[24] J. Zeng, L. Xie, and Z.-Q. Liu, “Type-2 fuzzy Gaussian mixturemodels,”

Pattern Recognition , vol. 41, no. 12, pp. 3636–3643, 2008.[25] J. Zeng and Z.-Q. Liu, “Type-2 fuzzy hidden Markov models andtheir application to speech recognition,”

IEEE Trans. Fuzzy Syst. ,vol. 14, no. 3, pp. 454–467, 2006.[26] J. Zeng and Z. Q. Liu, “Type-2 fuzzy Markov random ﬁelds andtheir application to handwritten Chinese character recognition,”

IEEE Trans. Fuzzy Syst. , vol. 16, no. 3, pp. 747–760, 2008.[27] J. Winn and C. M. Bishop, “Variational message passing,”

J. Mach.Learn. Res. , vol. 6, pp. 661–694, 2005.[28] W. L. Buntine, “Variational extensions to EM and multinomialPCA,” in

ECML , 2002, pp. 23–34.[29] D. D. Lee and H. S. Seung, “Learning the parts of objects by non-negative matrix factorization,”

Nature , vol. 401, pp. 788–791, 1999.[30] J. Zeng, W. K. Cheung, C.-H. Li, and J. Liu, “Coauthor net-work topic models with application to expert ﬁnding,” in

IEEE/WIC/ACM WI-IAT , 2010, pp. 366–373.[31] J. Zeng, W. K.-W. Cheung, C.-H. Li, and J. Liu, “Multirelationaltopic models,” in

ICDM , 2009, pp. 1070–1075.[32] J. Zeng, W. Feng, W. K. Cheung, and C.-H. Li, “Higher-orderMarkov tag-topic models for tagged documents and images,” arXiv:1109.5370v1 [cs.CV] , 2011.[33] A. K. McCallum, K. Nigam, J. Rennie, and K. Seymore, “Automat-ing the construction of internet portals with machine learning,”

Information Retrieval , vol. 3, no. 2, pp. 127–163, 2000.[34] S. Zhu, J. Zeng, and H. Mamitsuka, “Enhancing MEDLINE doc-ument clustering by incorporating MeSH semantic similarity,”

Bioinformatics , vol. 25, no. 15, pp. 1944–1951, 2009.[35] A. Globerson, G. Chechik, F. Pereira, and N. Tishby, “Euclideanembedding of co-occurrence data,”

J. Mach. Learn. Res. , vol. 8, pp.2265–2295, 2007.[36] J. Eisenstein and E. Xing, “The CMU 2008 political blog corpus,”Carnegie Mellon University, Tech. Rep., 2010.[37] J. Zeng, “TMBP: A topic modeling toolbox using belief propaga-tion,”

J. Mach. Learn. Res. , p. arXiv:1201.0838v1 [cs.LG], 2012.[38] K. Zhai, J. Boyd-Graber, and N. Asadi, “Using variational infer-ence and MapReduce to scale topic modeling,” arXiv:1107.3765v1[cs.AI] , 2011.[39] J. Chang, J. Boyd-Graber, S. Gerris, C. Wang, and D. Blei, “Readingtea leaves: How humans interpret topic models,” in

NIPS , 2009,pp. 288–296.[40] D. Ramage, D. Hall, R. Nallapati, and C. D. Manning, “LabeledLDA: A supervised topic model for credit attribution in multi-labeled corpora,” in

Empirical Methods in Natural Language Pro-cessing , 2009, pp. 248–256. [41] C.-C. Chang and C.-J. Lin, “LIBSVM: A library for support vectormachines,”

ACM Transactions on Intelligent Systems and Technology ,vol. 2, pp. 27:1–27:27, 2011.[42] I. Porteous, D. Newman, A. Ihler, A. Asuncion, P. Smyth, andM. Welling, “Fast collapsed Gibbs sampling for latent Dirichletallocation,” in