Scalable Text and Link Analysis with Mixed-Topic Link Models
aa r X i v : . [ c s . L G ] M a r Scalable Text and Link Analysis with Mixed-Topic LinkModels
Yaojia Zhu
University of New Mexico [email protected] Xiaoran Yan
University of New Mexico [email protected] Lise Getoor
University of Maryland [email protected] Moore
Santa Fe Institute [email protected]
ABSTRACT
Many data sets contain rich information about objects, aswell as pairwise relations between them. For instance, innetworks of websites, scientific papers, and other documents,each node has content consisting of a collection of words, aswell as hyperlinks or citations to other nodes. In order toperform inference on such data sets, and make predictionsand recommendations, it is useful to have models that areable to capture the processes which generate the text ateach node and the links between them. In this paper, wecombine classic ideas in topic modeling with a variant ofthe mixed-membership block model recently developed inthe statistical physics community. The resulting model hasthe advantage that its parameters, including the mixtureof topics of each document and the resulting overlappingcommunities, can be inferred with a simple and scalableexpectation-maximization algorithm. We test our model onthree data sets, performing unsupervised topic classificationand link prediction. For both tasks, our model outperformsseveral existing state-of-the-art methods, achieving higheraccuracy with significantly less computation, analyzing adata set with 1.3 million words and 44 thousand links ina few minutes.
Keywords
Document classification, Community detection, Topic mod-eling, Link prediction, Stochastic block model
1. INTRODUCTION
Many modern data sets contain not only rich informationabout each object, but also pairwise relationships betweenthem, forming networks where each object is a node andlinks represent the relationships. In document networks, forexample, each node is a document containing a sequence ofwords, and the links between nodes are citations or hyper-links. Both the content of the documents and the topologyof the links between them are meaningful. Over the past few years, two disparate communities havebeen approaching these data sets from different points ofview. In the data mining community, the goal has been toaugment traditional approaches to learning and data miningby including relations between objects [15, 33] for instance,to use the links between documents to help us label themby topic. In the network community, including its subset instatistical physics, the goal has been to augment traditionalcommunity structure algorithms such as the stochastic blockmodel [14, 20, 30] by taking node attributes into account: forinstance, to use the content of documents, rather than justthe topological links between them, to help us understandtheir community structure.In the original stochastic block model, each node has a dis-crete label, assigning it to one of k communities. These la-bels, and the k × k matrix of probabilities with which a givenpair of nodes with a given pair of labels have a link betweenthem, can be inferred using Monte Carlo algorithms (e.g.[26]) or, more efficiently, with belief propagation [12, 11] orpseudolikelihood approaches [7]. However, in real networkscommunities often overlap, and a given node can belong tomultiple communities. This led to the mixed-membership block model [1], where the goal is to infer, for each node v ,a distribution or mixture of labels θ v describing to what ex-tent it belongs to each community. If we assume that linksare assortative, i.e., that nodes are more likely to link toothers in the same community, then the probability of a linkbetween two nodes v and v ′ depends on some measure ofsimilarity (say, the inner product) of θ v and θ v ′ .These mixed-membership block models fit nicely with clas-sic ideas in topic modeling. In models such as ProbabilisticLatent Semantic Analysis ( plsa ) [19] and Latent DirichletAllocation ( lda ) [4], each document d has a mixture θ d oftopics. Each topic corresponds in turn to a probability dis-tribution over words, and each word in d is generated in-dependently from the resulting mixture of distributions. Ifwe think of θ d as both the mixture of topics for generatingwords and the mixture of communities for generating links,then we can infer { θ d } jointly from the documents’ contentand the presence or absence of links between them.There are many possible such models, and we are far fromthe first to think along these lines. Our innovation is to takeas our starting point a particular mixed-membership blockmodel recently developed in the statistical physics commu-ity [2], which we refer to as the bkn model. It differs fromthe mixed-membership stochastic block model ( mmsb ) of [1]in several ways:1. The bkn model treats the community membership mix-tures θ d directly as parameters to be inferred. In con-trast, mmsb treats θ d as hidden variables generated bya Dirichlet distribution, and infers the hyperparame-ters of that distribution. The situation between plsa and lda is similar; plsa infers the topic mixtures θ d ,while lda generates them from a Dirichlet distribution.2. The mmsb model generates each link according to aBernoulli distribution, with an extra parameter forsparsity. Instead, bkn treats the links as a randommultigraph, where the number of links A dd ′ betweeneach pair of nodes is Poisson-distributed. As a result,the derivatives of the log-likelihood with respect to θ d and the other parameters are particularly simple.These two factors make it possible to fit the bkn modelusing an efficient and exact expectation-maximization (EM)algorithm, making its inference highly scalable. The bkn model has another advantage as well:3. The bkn model is degree-corrected , in that it takes theobserved degrees of the nodes into account when com-puting the expected number of edges between them.Thus it recognizes that two documents that have verydifferent degrees might in fact have the same mix oftopics; one may simply be more popular than the other.In our work, we use a slight variant of the bkn model togenerate the links, and we use plsa to generate the text.We present an EM algorithm for inferring the topic mixturesand other parameters. (While we do not impose a Dirichletprior on the topic mixtures, it is easy to add a correspondingterm to the update equations.) Our algorithm is scalable inthe sense that each iteration takes O ( K ( N + M + R )) timefor networks with K topics, N documents, and M links,where R is the sum over documents of the number of distinctwords appearing in each one. In practice, our EM algorithmconverges within a small number of iterations, making thetotal running time linear in the size of the corpus.Our model can be used for a variety of learning and gen-eralization tasks, including document classification or linkprediction. For document classification, we can obtain hardlabels for each document by taking its most-likely topic withrespect to θ d , and optionally improve these labels furtherwith local search. For link prediction, we train the modelusing a subset of the links, and then ask it to rank the re-maining pairs of documents according to the probability of alink between them. For each task we determine the optimalrelative weight of the content vs. the link information.We performed experiments on three real-world data sets,with thousands of documents and millions of words. Ourresults show that our algorithm is more accurate, and con-siderably faster, than previous techniques for both documentclassification and link prediction. The rest of the paper is organized as follows. Section 2 de-scribes our generative model, and compares it with relatedmodels in the literature. Section 3 gives our EM algorithmand analyzes its running time. Section 4 contains our exper-imental results for document classification and link predic-tion, comparing our accuracy and running time with othertechniques. In Section 5, we conclude, and offer some direc-tions for further work.
2. OUR MODEL AND PREVIOUS WORK
In this section, we give our proposed model, which we callthe
Poisson mixed-topic link model ( pmtlm ) and its degree-corrected variant pmtlm-dc . Consider a network of N documents. Each document d hasa fixed length L d , and consists of a string of words w dℓ for1 ≤ ℓ ≤ L d , where 1 ≤ w dℓ ≤ W where W is the numberof distinct words. In addition, each pair of documents d, d ′ has an integer number of links connecting them, giving anadjacency matrix A dd ′ . There are K topics, which play thedual role of the overlapping communities in the network.Our model generates both the content { w dℓ } and the links { A dd ′ } as follows. We generate the content using the plsa model [19]. Each topic z is associated with a probabil-ity distribution β z over words, and each document has aprobability distribution θ d over topics. For each document1 ≤ d ≤ N and each 1 ≤ ℓ ≤ L d , we independently1. choose a topic z = z dℓ ∼ Multi( θ d ), and2. choose the word w dℓ ∼ Multi( β z ).Thus the total probability that w dℓ is a given word w isPr[ w dℓ = w ] = K X z =1 θ dz β zw . (1)We assume that the number of topics K is fixed. The dis-tributions β z and θ d are parameters to be inferred.We generate the links using a version of the Ball-Karrer-Newman ( bkn ) model [2]. Each topic z is associated with alink density η z . For each pair of documents d, d ′ and eachtopic z , we independently generate a number of links whichis Poisson-distributed with mean θ dz θ d ′ z η z . Since the sum ofindependent Poisson variables is Poisson, the total numberof links between d and d ′ is distributed as A dd ′ ∼ Poi X z θ dz θ d ′ z η z ! . (2)Since A dd ′ can exceed 1, this gives a random multigraph.In the data sets we study below, A dd ′ is 1 or 0 dependingon whether d cites d ′ , giving a simple graph. On the otherhand, in the sparse case the event that A dd ′ > A dd ′ isPoisson-distributed rather than Bernoulli makes the deriva-tives of the likelihood with respect to the parameters θ dz and η z very simple, allowing us to write down an efficientEM algorithm for inferring them. d’ ZW ! NL " C (a) link-lda C d a b d Z dd’ d’ N d’ t (b) c-pldc L d d’ L d’ d W d’n W dn Z d’n Z dn !" y dd’ (c) rtm L d d’ L d’ S d S d’ d W d’n W dn Z d’n Z dn ! Z d’d Z dd’ " A dd’ (d) pmtlm-dc Figure 1: Graphical models for link generation.
This version of the model assumes that links are assortative,i.e., that links between documents only form to the extentthat they belong to the same topic. One can easily generalizethe model to include disassortative links as well, replacing η z with a matrix η zz ′ that allows documents with distincttopics z, z ′ to link [2].We also consider degree-corrected versions of this model,where in addition to its topic mixture θ d , each documenthas a propensity S d of forming links. In that case, A dd ′ ∼ Poi S d S d ′ X z θ dz θ d ′ z η z ! . (3)We call this variant the Poisson Mixed-Topic Link Modelwith Degree Correction ( pmtlm-dc ). Most models for document networks generate content usingeither plsa [19], as we do, or lda [4]. The distinction is that plsa treats the document mixtures θ d as parameters, whilein lda they are hidden variables, integrated over a Dirichletdistribution. As we show in Section 3, our approach gives asimple, exact EM algorithm, avoiding the need for samplingor variational methods. While we do not impose a Dirichletprior on θ d in this paper, it is easy to add a correspondingterm to the update equations for the EM algorithm, with noloss of efficiency.There are a variety of methods in the literature to generatelinks between documents. phits-plsa [10], link-lda [13]and link-plsa-lda [27] use the phits [9] model for linkgeneration. phits treats each document as an additionalterm in the vocabulary, so two documents are similar if theylink to the same documents. This is analogous to a mix-ture model for networks studied in [28]. In contrast, blockmodels like ours treat documents as similar if they link to similar documents, as opposed to literally the same ones.The pairwise link-lda model [27], like ours, generates thelinks with a mixed-topic block model, although as in mmsb [1] and lda [4] it treats the θ d as hidden variables integratedover a Dirichlet prior. They fit their model with a varia-tional method that requires N parameters, making it lessscalable than our approach.In the c-pldc model [32], the link probability from d to d ′ isdetermined by their topic mixtures θ d , θ d ′ and the popularity t d ′ of d ′ , which is drawn from a Gamma distribution withhyperparameters a and b . Thus t d ′ plays a role similar tothe degree-correcting parameter S d ′ in our model, althoughwe correct for the degree of d as well. However, c-pldc doesnot generate the content, but takes it as given.The Relational Topic Model ( rtm ) [5, 6] assumes that thelink probability between d and d ′ depends on the topics ofthe words appearing in their text. In contrast, our modeluses the underlying topic mixtures θ d to generate both thecontent and the links. Like our model, rtm defines the sim-ilarity of two topics as a weighted inner product of theirtopic mixtures: however, in rtm the probability of a link isa nonlinear function of this similarity, which can be logistic,exponential or normal, of this similarity.Although it deals with a slightly different kind of dataset,our model is closest in spirit to the Latent Topic HypertextModel ( lthm ) [18]. This is a generative model for hypertextnetworks, where each link from d to d ′ is associated with aspecific word w in d . If we sum over all words in d , the totalnumber of links A dd ′ from d to d ′ that lthm would generatefollows a binomial distribution A dd ′ ∼ Bin L d , λ d ′ X z θ dz θ d ′ z ! , (4)where λ d ′ is, in our terms, a degree-correction parameter.When L d is large this becomes a Poisson distribution withmean L d λ d ′ P z θ dz θ d ′ z . Our model differs from this in twoways: our parameters η z give a link density associated witheach topic z , and our degree correction S d does not assumethat the number of links from d is proportional to its length.We briefly mention several other approaches. The authorsof [16] extend the probabilistic relational model ( prm ) frame-work and proposed a unified generative model for both con-tent and links in a relational structure. In [24], the authorsproposed a link-based model that describes both node at-tributes and links. The htm model [31] treats links as fixedrather than generating them, and only generates the text.Finally, the lmmg model [22] treats the appearance or ab-sence of a word as a binary attribute of each document, anduses a logistic or exponential function of these attributes todetermine the link probabilities.In Section 4 below, we compare our model to phits-plsa , link-lda , c-pldc , and rtm . Graphical models for the linkgeneration components of these models, and ours, are shownin Figure 1.
3. A SCALABLE EM ALGORITHM
Here we describe an efficient Expectation-Maximization al-gorithm to find the maximum-likelihood estimates of the pa-rameters of our model. Each update takes O ( K ( N + M + R ))time for a document network with K topics, N documents,nd M links, where R is the sum over the documents of thenumber of distinct words in each one. Thus the runningtime per iteration is linear in the size of the corpus.For simplicity we describe the algorithm for the simpler ver-sion of our model, pmtlm . The algorithm for the degree-corrected version, pmtlm-dc , is similar. Let C dw denote the number of times a word w appears indocument d . From (1), the log-likelihood of d ’s content is L content d = log P ( w d , . . . , w dL d | θ d , β )= W X w =1 C dw log K X z =1 θ dz β zw ! . (5)Similarly, from (2), the log-likelihood for the links A dd ′ is L links = log P ( A | θ, η )= 12 X dd ′ A dd ′ log X z θ dz θ d ′ z η z ! − X dd ′ X z θ dz θ d ′ z η z . (6)We ignore the constant term − P dd ′ log A dd ′ ! from the de-nominator of the Poisson distribution, since it has no bearingon the parameters. While we can use the total likelihood P d L content d + L links directly, in practice we can improve our performance signifi-cantly by better balancing the information in the content vs.that in the links. In particular, the log-likelihood L content d ofeach document is proportional to its length, while its contri-bution to L links is proportional to its degree. Since a typicaldocument has many more words than links, L content tendsto be much larger than L links .Following [19], we can provide this balance in two ways. Oneis to normalize L content by the length L d , and another is toadd a parameter α that reweights the relative contributionsof the two terms L content and L links . We then maximize thefunction L = α X d L d L content d + (1 − α ) L links . (7)Varying α from 0 to 1 lets us interpolate between two ex-tremes: studying the document network purely in terms ofits topology, or purely in terms of the documents’ content.Indeed, we will see in Section 4 that the optimal value of α depends on which task we are performing: closer to 0 forlink prediction, and closer to 1 for topic classification. We maximize L as a function of { θ, β, η } using an EM algo-rithm, very similar to the one introduced by [2] for overlap-ping community detection. We start with a standard trick to change the log of a sum into a sum of logs, writing L content d ≥ W X w =1 C dw K X z =1 h dw ( z ) log θ dz β zw h dw ( z ) L links ≥ X dd ′ K X z =1 A dd ′ q dd ′ ( z ) log θ dz θ d ′ z η z q dd ′ ( z ) − X dd ′ K X z =1 θ dz θ d ′ z η z . (8)Here h dw ( z ) is the probability that a given appearance of w in d is due to topic z , and q dd ′ ( z ) is the probability that agiven link from d and d ′ is due to topic z . This lower boundholds with equality when h dw ( z ) = θ dz β zw P z ′ θ dz ′ β z ′ w , q dd ′ ( z ) = θ dz θ d ′ z η z P z ′ θ dz ′ θ d ′ z ′ η z ′ , (9)giving us the E step of the algorithm.For the M step, we derive update equations for the parame-ters { θ, β, η } . By taking derivatives of the log-likelihood (7)(see the Appendix A for details) we obtain η z = P dd ′ A dd ′ q dd ′ ( z ) (cid:0)P d θ dz (cid:1) (10) β zw = P d (1 /L d ) C dw h dw ( z ) P d (1 /L d ) P w ′ C dw ′ h dw ′ ( z ) (11) θ dz = ( α/L d ) P w C dw h dw ( z ) + (1 − α ) P d ′ A dd ′ q dd ′ ( z ) α + (1 − α ) κ d . (12)Here κ d = P d ′ A dd ′ is the degree of document d .To analyze the running time, let R d denote the number ofdistinct words in document d , and let R = P d R d . Thenonly KR of the parameters h dw ( z ) are nonzero. Similarly, q dd ′ ( z ) only appears if A dd ′ = 0, so in a network with M links only KM of the q dd ′ ( z ) are nonzero. The total num-ber of nonzero terms appearing in (9)–(12), and hence therunning time of the E and M steps, is thus O ( K ( N + M + R )).As in [2], we can speed up the algorithm if θ is sparse, i.e.if many documents belong to fewer than K topics, so thatmany of the θ dz are zero. According to (9), if θ dz = 0 then h dℓ ( z ) = q dd ′ ( z ) = 0, in which case (12) implies that θ dz = 0for all future iterations. If we choose a threshold below which θ dz is effectively zero, then as θ becomes sparser we canmaintain just those h dℓ ( z ) and q dd ′ ( z ) where θ dz = 0. Thisin turn simplifies the updates for η and β in (10) and (11).We note that the simplicity of our update equations comesfrom the fact that the A dd ′ is Poisson, and that its mean is amultilinear function of the parameters. Models where A dd ′ is Bernoulli-distributed with a more complicated link prob-ability, such as a logistic function, have more complicatedderivatives of the likelihood, and therefore more complicatedupdate equations.Note also that this EM algorithm is exact, in the sense thatthe maximum-likelihood estimators { b θ, b β, b η } are fixed pointsof the update equations. This is because the E step (9) isexact, since the conditional distribution of topics associatedith each word occurrence and each link is a product dis-tribution, which we can describe exactly with h dw and q dd ′ .(There are typically multiple fixed points, so in practice werun our algorithm with many different initial conditions, andtake the fixed point with the highest likelihood.)This exactness is due to the fact that the topic mixtures θ d are parameters to be inferred. In models such as lda and mmsb where θ d is a hidden variable integrated over aDirichlet prior, the topics associated with each word and linkhave a complicated joint distribution that can only be ap-proximated using sampling or variational methods. (To befair, recent advances such as stochastic optimization basedon network subsampling [17] have shown that approximateinference in these models can be carried out quite efficiently.)On the other hand, in the context of finding communities innetworks, models with Dirichlet priors have been observedto generalize more successfully than Poisson models suchas bkn [17]. Happily, we can impose a Dirichlet prior on θ d with no loss of efficiency, simply by including pseudo-counts in the update equations—in essence adding addi-tional words and links that are known to come from eachtopic (see Appendix C). This lets us obtain a maximum aposteriori (MAP) estimate of an lda -like model. We leavethis as a direction for future work. Our model, like plsa and the bkn model, lets us infer asoft classification—a mixture of topic labels or communitymemberships for each document. However, we often want toinfer categorical labels, where each document d is assignedto a single topic 1 ≤ z d ≤ K . A natural way to do thisis to let z d be the most-likely label in the inferred mixture,ˆ z d = argmax z θ dz . This is equivalent to rounding θ d to adelta function, θ dz = 1 for z = ˆ z d and 0 for z = ˆ z d .If we wish, we can improve these discrete labels further usinglocal search. If each document has just a single topic, thelog-likelihood of our model is L content d = W X w =1 C dw log β z d w (13) L links = 12 X dd ′ A dd ′ log η z d z d ′ . (14)Note that here η is a matrix, with off-diagonal entries that al-low documents with different topics z d , z d ′ to be linked. Oth-erwise, these discrete labels would cause the network to splitinto K separate components.Let n z denote the number of documents of topic z , let L z = P d : z d = z L d be their total length, and let C zw = P d : z d = z C dw be the total number of times w appears in them. Let m zz ′ denote the total number of links between documents of top-ics z and z ′ , counting each link twice if z = z ′ . Then theMLEs for β and η areˆ β zw = C zw L z , ˆ η zz ′ = m zz ′ n z n z ′ . (15)Applying these MLEs in (13) and (14) gives us a point es-timate of the likelihood of a discrete topic assignment z d , which we can normalize or reweight as discussed in Sec-tion 3.2 if we like. We can then maximize this likelihoodusing local search: for instance, using the Kernighan-Linheuristic as in [21] or a Monte Carlo algorithm to find a lo-cal maximum of the likelihood in the vicinity of ˆ z . Each stepof these algorithms changes the label of a single document d , so we can update the values of n z , L z , C zw , and m zz ′ and compute the new likelihood in O ( K + R d ) time. In ourexperiments we used the KL heuristic, and found that forsome data sets it noticeably improved the accuracy of ouralgorithm for the document classification task.
4. EXPERIMENTAL RESULTS
In this section we present empirical results on our modeland our algorithm for unsupervised document classificationand link prediction. We compare its accuracy and runningtime with those of several other methods, testing it on threereal-world document citation networks.
The top portion of Table 1 lists the basic statistics for threereal-world corpora [29]: Cora, Citeseer, and PubMed . Coraand Citeseer contain papers in machine learning, with K = 7topics for Cora and K = 6 for Citeseer. PubMed consistsof medical research papers on K = 3 topics, namely threetypes of diabetes. All three corpora have ground-truth topiclabels provided by human curators.The data sets for these corpora are slightly different. ThePubMed data set has the number of times C dw each wordappeared in each document, while the data for Cora andCiteseer records whether or not a word occurred at leastonce in the document. For Cora and Citeseer, we treat C dw as being 0 or 1. We compare the Poisson Mixed-Topic Link Model ( pmtlm )and its degree-corrected variant, denoted pmtlm-dc , with phits-plsa , link-lda , c-pldc , and rtm (see Section 2.2).We used our own implementation of both phits-plsa and rtm . For rtm , we implemented the variational EM algo-rithm given in [6]. The implementation is based on the lda code available from the authors . We also tried the codeprovided by J. Chang , which uses a Monte Carlo algorithmfor the E step, but we found the variational algorithm worksbetter on our data sets. While rtm includes a variety of linkprobability functions, we only used the sigmoid function.We also assume a symmetric Dirichlet prior. The results for link-lda and c-pldc are taken from [32].Each E and M step of the variational algorithm for rtm per-forms multiple iterations until they converge on estimates forthe posterior and the parameters [6]. This is quite differentfrom our EM algorithm: since our E step is exact, we updatethe parameters only once in each iteration. In our implemen-tation, the convergence condition for the E step and for theentire EM algorithm are that the fractional increase of the These data sets are available for download at See See
200 400 600 800 1,00000.20.40.60.81 EM iterations S c a l ed l og − li k e li hood Convergence Test Cora (PMTLM)Citeseer (PMTLM)PubMed (PMTLM)Cora (PMTLM−DC)Citeseer (PMTLM−DC)PubMed (PMTLM−DC)
Figure 2: The log-likelihood of the PMTLM andPMTLM-DC models as a function of the numberof EM iterations, normalized so that and are theinitial and final log-likelihood respectively. The con-vergence is roughly the same for all three data sets,showing that the number of iterations is roughlyconstant as a function of the size of the corpus. log-likelihood between iterations is less than 10 − ; we per-formed a maximum of 500 iterations of the rtm algorithmdue to its greater running time. In order to optimize the η parameters (see the graphical model in Section 2.2) rtm usesa tunable regularization parameter ρ , which can be thoughtof as the number of observed non-links. We tried various set-tings for ρ , namely 0 . M, . M, . M, M, M, M and 10 M where M is the number of observed links.As described in Section 3.2, for pmtlm , pmtlm-dc and phits-plsa we vary the relative weight α of the likelihood ofthe content vs. the links, tuning α to its best possible value.For the PubMed data set, we also normalized the contentlikelihood by the length of the documents. For pmtlm , pmtlm-dc and phits-plsa , we performed 500independent runs of the EM algorithm, each with randominitial values of the parameters and topic mixtures. For eachrun we iterated the EM algorithm up to 5000 times; we foundthat it typically converges in fewer iterations, with the crite-rion that the fractional increase of the log-likelihood for twosuccessive iterations is less than 10 − . Figure 2 shows thatthe log-likelihood as a function of the number of iterationsare quite similar for all three data sets, even though thesecorpora have very different sizes. This indicates that evenfor large data sets, our algorithm converges within a smallnumber of iterations, making its total running time linear inthe size of the corpus.For pmtlm and pmtlm-dc , we obtain discrete topic labelsby running our EM algorithm and rounding the topic mix-tures as described in Section 3.4. We also tested improv-ing these labels with local search, using the Kernighan-Linheuristic to change the label of one document at a time untilwe reach a local optimum of the likelihood. More precisely,of those 500 runs, we took the T best fixed points of the EMalgorithm (i.e., with the highest likelihood) and attemptedto improve them further with the KL heuristic. We used T = 50 for Cora and Citeseer and T = 5 for PubMed. Cora Citeseer PubMedStatistics K N M W R plsa ) 28 61 362EM ( phits-plsa ) 40 67 445EM ( pmtlm ) 33 64 419EM ( pmtlm-dc ) 36 64 402EM ( rtm ) 992 597 2,194KL ( pmtlm ) 375 618 13,723KL ( pmtlm-dc ) 421 565 13,014 Table 1: The statistics of the three data sets, andthe mean running time, for the EM algorithms inour model PMTLM, its degree-corrected variantPMTLM-DC, and PLSA, PHITS-PLSA, and RTM.Each corpus has K topics, N documents, M links, avocabulary of size W , and a total size R . Runningtimes for our algorithm, PLSA, and PHITS-PLSAare given for one run of EM iterations. Run-ning times for RTM consist of up to 500 EM itera-tions, or until the convergence criteria are reached.Our EM algorithm is highly scalable, with a runningtime that grows linearly with the size of the corpus.In particular, it is much faster that the variationalalgorithm for RTM. Improving discrete labels withthe Kernighan-Lin heuristic (KL) increases our al-gorithm’s running time, but improves its accuracyfor document classification in Cora and Citeseer.
For rtm , in each E step, we initialize the variational param-eters randomly, and in each M step we initialize the hyper-parameters randomly. We execute 500 independent runs foreach setting of the tunable parameter ρ . For each algorithm, we used several measures of the accuracyof the inferred labels as compared to the human-curatedones. The Normalized Mutual Information (NMI) betweentwo labelings C and C is defined asNMI( C , C ) = MI( C , C )max(H( C ) , H( C )) . (16)Here MI( C , C ) is the mutual information between C and C , and H( C ) and H( C ) are the entropies of C and C respectively. Thus the NMI is a measure of how much infor-mation the inferred labels give us about the true ones. Wealso used the Pairwise F-measure (PWF) [3] and the Varia-tion of Information (VI) [25] (which we wish to minimize). The best NMI, VI, and PWF we observed for each algorithmare given in Table 2, where for link-lda and c-pldc wequote results from [32]. For rtm , we give these metrics forthe labeling with the highest likelihood, using the best valueof ρ for each metric.We see that even without the additional step of local search,our algorithm does very well, outperforming all other meth-ods we tried on Citeseer and PubMed and all but c-pldc on Cora. (Note that we did not test link-lda or c-pldc ora Citeseer PubMedAlgorithm NMI VI PWF NMI VI PWF NMI VI PWF phits-plsa link-lda . † — 0 . † . † — 0 . † — — — c-pldc . † — 0 . † . † — 0 . † — — — rtm pmtlm pmtlm ( kl ) (.4) (.4) (.4) (.6) (.6) 0.518 (.5) 0.233 (.9) 1.642 (.9) 0.488 (.9) pmtlm-dc (.8) (.8) (.8) pmtlm-dc ( kl ) 0.491 (.3) 1.865 (.3) 0.511 (.3) 0.406 (.3) 2.084 (.3) (.3) 0.260 (.8) 1.577 (.8) 0.492 (.8) Table 2: The best normalized mutual information (NMI), variational of information (VI) and pairwise F-measure (PWF) achieved by each algorithm. Values marked by † are quoted from [32]; other values are basedon our implementation. The best values are shown in bold; note that we seek to maximize NMI and PWF,and minimize VI. For PHITS-PLSA, PMTLM, and PMTLM-DC, the number in parentheses is the best valueof the relative weight α of content vs. links. Refining the labeling returned by the EM algorithm with theKernighan-Lin heuristic is indicated by (KL). on PubMed.) Degree correction ( pmtlm-dc ) improves ac-curacy significantly for PubMed.Refining our labeling with the KL heuristic improved theperformance of our algorithm significantly for Cora and Cite-seer, giving us a higher accuracy than all the other methodswe tested. For PubMed, local search did not increase accu-racy in a statistically significant way. In fact, on some runsit decreased the accuracy slightly compared to the initiallabeling ˆ z obtained from our EM algorithm; this is coun-terintuitive, but it shows that increasing the likelihood of alabeling in the model can decrease its accuracy.In Figure 3, we show how the performance of pmtlm , pmtlm-dc ,and phits-plsa varies as a function of α , the relative weightof content vs. links. Recall that at α = 0 these algorithmslabel documents solely on the basis of their links, while at α = 1 they only pay attention to the content. Each pointconsists of the top 20 runs with that value of α .For Cora and Citeseer, there is an intermediate value of α atwhich pmtlm and pmtlm-dc have the best accuracy. How-ever, this peak is fairly broad, showing that we do not haveto tune α very carefully. For PubMed, where we also normal-ized the content information by document length, pmtlm-dc performs best at a particular value of α .We give the running time of these algorithms, including pmtlm and pmtlm-dc with and without the kl heuristic,in Table 1, and compare it to the running time of the otheralgorithms we implemented. Our EM algorithm is muchfaster than the variational EM algorithm for rtm , and isscalable in that it grows linearly with the size of the corpus. Link prediction (e.g. [8, 23, 34]) is a natural generalizationtask in networks, and another way to measure the quality ofour model and our EM algorithm. Based on a training setconsisting of a subset of the links, our goal is to rank all pairswithout an observed link according to the probability of alink between them. For our models, we rank pairs accordingto the expected number of links A dd ′ in the Poisson distri-bution, (2) and (3), which is monotonic in the probability that at least one link exists.We can then predict links between those pairs where thisprobability exceeds some threshold. Since we are agnosticabout this threshold and about the cost of Type I vs. TypeII errors, we follow other work in this area by defining theaccuracy of our model as the AUC, i.e. the probability thata random true positive link is ranked above a random truenon-link. Equivalently, this is the area under the receiver op-erating characteristic curve (ROC). Our goal is to do betterthan the baseline AUC of 1 /
2, corresponding to a randomranking of the pairs.We carried out 10-fold cross-validation, in which the linksin the original graph are partitioned into 10 subsets withequal size. For each fold, we use one subset as the testlinks, and train the model using the links in the other 9folds. We evaluated the AUC on the held-out links andthe non-links. For Cora and Citeseer, all the non-links areused. For PubMed, we randomly chose 10% of the non-links for comparison. We trained the models with the samesettings as those for document classification in Section 4.3;we executed 100 independent runs for each test. Note thatunlike the document classification task, here we used the fulltopic mixtures to predict links, not just the discrete labelsconsisting of the most-likely topic for each document.Note that pmtlm-dc assigns S d to be zero if the degree of d is zero. This makes it impossible for d to have any test linkwith others if its observed degree is zero in the training data.One way to solve this is to assign a small positive value to S d even if d ’s degree is zero. Our approach assigns S d to bethe smallest value among those S d ′ that are non-zero: S d = min { S d ′ : S d ′ > } if κ d = 0 . (17)Figure 4(a) gives the AUC values for pmtlm and pmtlm-dc as a function of the relative weight α of content vs. links.The green horizontal line in each of those subplots representthe highest AUC value achieved by the rtm model for eachdata set, using the best value of ρ among those specifiedin Section 4.3. Interestingly, for Cora and Citeseer the opti-mal value of α is smaller than in Figure 3, showing that con- α N o r m a li z ed M u t ua l I n f o r m a t i on Cora PMTLMPMTLM (KL)PMTLM−DCPMTLM−DC (KL)PHITS−PLSA 0 0.2 0.4 0.6 0.8 100.10.20.30.4 α N o r m a li z ed M u t ua l I n f o r m a t i on Citeseer PMTLMPMTLM (KL)PMTLM−DCPMTLM−DC (KL)PHITS−PLSA 0 0.2 0.4 0.6 0.8 100.050.10.150.20.25 α N o r m a li z ed M u t ua l I n f o r m a t i on PubMed PMTLMPMTLM−DCPHITS−PLSA
Figure 3: The accuracy of PMTLM, PMTLM-DC, and PHITS-PLSA on the document classification task,measured by the NMI, as a function of the relative weight α of the content vs. the links. At α = 0 thesealgorithms label documents solely on the basis of their links, while at α = 1 they pay attention only to thecontent. For Cora and Citeseer, there is a broad range of α that maximizes the accuracy. For PubMed, thedegree-corrected model PMTLM-DC performs best at a particular value of α . α A UC CoraPMTLM
DCPMTLMRTM α A UC Citeseer
PMTLM−DCPMTLMRTM α A UC PubMedPMTLM−DCPMTLMRTM (a) AUC values for different α . False positive rate T r ue po s i t i v e r a t e Cora PMTLM−DCPMTLMRTM
False positive rate T r ue po s i t i v e r a t e Citeseer PMTLM−DCPMTLMRTM
False positive rate T r ue po s i t i v e r a t e PubMed PMTLM−DCPMTLMRTM (b) ROC curves achieving the highest AUC values. P r e c i s i on Cora PMTLM−DCPMTLMRTM P r e c i s i on Citeseer PMTLM−DCPMTLMRTM P r e c i s i on PubMed PMTLM−DCPMTLMRTM (c) Precision-recall curves achieving the highest AUC values.
Figure 4: Performance on the link prediction task. For all three data sets and all the α values, the PMTLM-DCmodel achieves higher accuracy than the PMTLM model. In contrast to Figure 3, for this task the optimalvalue of α is relatively small, showing that the content is less important, and the topology is more important,for link prediction than for document classification. The green line in Figure 4(a) indicates the highest AUCachieved by the RTM model, maximized over the tunable parameter ρ . Our models outperform RTM on allthree data sets. In addition, the degree-corrected model (PMTLM-DC) does significantly better than theuncorrected version (PMTLM). ent is less important for link prediction than for documentclassification. We also plot the receiver operating character-istic (ROC) curves and precision-recall curves that achievethe highest AUC values in Figure 4(b) and Figure 4(c) re-spectively.We see that, for all three data sets, our models outperform rtm , and that the degree-corrected model pmtlm-dc is sig-nificantly more accurate than the uncorrected one.
5. CONCLUSIONS
We have introduced a new generative model for documentnetworks. It is a marriage between Probabilistic Latent Se-mantic Analysis [19] and the Ball-Karrer-Newman mixedmembership block model [2]. Because of its mathematicalsimplicity, its parameters can be inferred with a particu-larly simple and scalable EM algorithm. Our experiments onboth document classification and link prediction show thatit achieves high accuracy and efficiency for a variety of datasets, outperforming a number of other methods. In futurework, we plan to test its performance for other tasks includ-ing supervised and semisupervised learning, active learning,and content prediction, i.e., predicting the presence or ab-sence of words in a document based on its links to otherdocuments and/or a subset of its text.
6. ACKNOWLEDGMENTS
We are grateful to Brian Ball, Brian Karrer, Mark Newmanand David M. Blei for helpful conversations. Y.Z., X.Y., andC.M. are supported by AFOSR and DARPA under grantFA9550-12-1-0432.
7. REFERENCES [1] E. Airoldi, D. Blei, S. Fienberg, and E. Xing. Mixedmembership stochastic blockmodels.
J. MachineLearning Research , 9:1981–2014, 2008.[2] B. Ball, B. Karrer, and M. E. J. Newman. Efficientand principled method for detecting communities innetworks.
Phys. Rev. E , 84:036103, 2011.[3] S. Basu.
Semi-supervised Clustering: ProbabilisticModels, Algorithms and Experiments . PhD thesis,Department of Computer Sciences, University ofTexas at Austin, 2005.[4] D. M. Blei, A. Y. Ng, and M. I. Jordan. LatentDirichlet allocation.
J. Machine Learning Research ,3:993–1022, 2003.[5] J. Chang and D. M. Blei. Relational topic models fordocument networks.
Artificial Intelligence andStatistics , 2009.[6] J. Chang and D. M. Blei. Hierarchical relationalmodels for document networks.
The Annals of AppliedStatistics , 4(1):124–150, Mar. 2010.[7] A. Chen, A. A. Amini, P. J. Bickel, and E. Levina.Fitting community models to large sparse networks.
CoRR , abs/1207.2340, 2012.[8] A. Clauset, C. Moore, and M. E. Newman.Hierarchical structure and the prediction of missinglinks in networks.
Nature , 453(7191):98–101, 2008.[9] D. Cohn and H. Chang. Learning to probabilisticallyidentify authoritative documents. In
Proc. 17th Intl.Conf. on Machine Learning , pages 167–174, 2000. [10] D. Cohn and T. Hofmann. The missing link—aprobabilistic model of document content andhypertext connectivity.
Proc. 13th Neural InformationProcessing Systems , 2001.[11] A. Decelle, F. Krzakala, C. Moore, and L. Zdeborov´a.Asymptotic analysis of the stochastic block model formodular networks and its algorithmic applications.
Phys. Rev. E , 84(6), 2011.[12] A. Decelle, F. Krzakala, C. Moore, and L. Zdeborov´a.Inference and phase transitions in the detection ofmodules in sparse networks.
Phys. Rev. Lett. ,107:065701, 2011.[13] E. Erosheva, S. Fienberg, and J. Lafferty.Mixed-membership models of scientific publications.
Proc. National Academy of Sciences , 101Suppl:5220–7, Apr. 2004.[14] S. E. Fienberg and S. Wasserman. Categorical dataanalysis of single sociometric relations. sociologicalMethodology , pages 156–192, 1981.[15] L. Getoor and C. P. Diehl. Link mining: a survey.
ACM SIGKDD Explorations Newsletter , 7(2):3–12,2005.[16] L. Getoor, N. Friedman, D. Koller, and B. Taskar.Learning probabilistic models of relational structure.
Journal of Machine Learning Research , 3:679–707,December 2002.[17] P. Gopalan, D. Mimno, S. Gerrish, M. Freedman, andD. Blei. Scalable inference of overlapping communities.In
Advances in Neural Information Processing Systems25 , pages 2258–2266, 2012.[18] A. Gruber, M. Rosen-Zvi, and Y. Weiss. Latent topicmodels for hypertext.
Proc. 24th Conf. on Uncertaintyin Artificial Intelligence , 2008.[19] T. Hofmann. Probabilistic latent semantic indexing. In
Proceedings of the 22nd annual international ACMSIGIR conference on Research and development ininformation retrieval , SIGIR ’99, pages 50–57, NewYork, NY, USA, 1999. ACM.[20] P. Holland, K. Laskey, and S. Leinhardt. Stochasticblockmodels: First steps.
Social Networks ,5(2):109–137, 1983.[21] B. Karrer and M. E. J. Newman. Stochasticblockmodels and community structure in networks.
Phys. Rev. E , 83:016107, 2011.[22] M. Kim and J. Leskovec. Latent multi-groupmembership graph model.
CoRR , abs/1205.4546, 2012.[23] L. L¨u and T. Zhou. Link prediction in complexnetworks: A survey.
Physica A: Statistical Mechanicsand its Applications , 390(6):1150–1170, 2011.[24] Q. Lu and L. Getoor. Link-based classification usinglabeled and unlabeled data.
ICML Workshop on ”TheContinuum from Labeled to Unlabeled Data inMachine Learning and Data Mining , 2003.[25] M. Meil˘a. Comparing clusterings by the variation ofinformation.
Learning theory and kernel machines ,pages 173–187, 2003.[26] C. Moore, X. Yan, Y. Zhu, J. Rouquier, and T. Lane.Active learning for node classification in assortativeand disassortative networks. In
Proc. 17th KDD , pages841–849, 2011.[27] R. M. Nallapati, A. Ahmed, E. P. Xing, and W. W.ohen. Joint latent topic models for text andcitations.
Proc. 14th ACM SIGKDD internationalconference on Knowledge discovery and data mining -KDD ’08 , page 542, 2008.[28] M. E. J. Newman and E. A. Leicht. Mixture modelsand exploratory analysis in networks.
Proceedings ofthe National Academy of Sciences of the United Statesof America , 104(23):9564–9, 2007.[29] P. Sen, G. Namata, M. Bilgic, and L. Getoor.Collective classification in network data.
AI Magazine ,pages 1–24, 2008.[30] T. Snijders and K. Nowicki. Estimation and predictionfor stochastic blockmodels for graphs with latent blockstructure.
Journal of Classification , 14(1):75–100,1997.[31] C. Sun, B. Gao, Z. Cao, and H. Li. HTM: a topicmodel for hypertexts. In
Proc. Conf. on EmpiricalMethods in Natural Language Processing , EMNLP ’08,pages 514–522, 2008.[32] T. Yang, R. Jin, Y. Chi, and S. Zhu. A Bayesianframework for community detection integratingcontent and link. In
Proc. 25th Conf. on Uncertaintyin Artificial Intelligence , pages 615–622, 2009.[33] P. Yu, J. Han, and C. Faloutsos.
Link Mining: Models,Algorithms, and Applications . Springer, 2010.[34] Y. Zhao, E. Levina, and J. Zhu. Link prediction forpartially observed networks. arXiv preprintarXiv:1301.7047 , 2013.
APPENDIXA. UPDATE EQUATIONS FOR PMTLM
In this appendix, we derive the update equations (10)–(12)for the parameters η , β , and θ , giving the M step of ouralgorithm.Recall that the likelihood is given by (7) and (8). For iden-tifiability, we impose the normalization constraints ∀ z : X w β zw = 1 (18) ∀ d : X z θ dz = 1 (19)For each topic z , taking the derivative of the likelihood withrespect to η z gives0 = 11 − α ∂ L ∂η z = 1 η z X dd ′ A dd ′ q dd ′ ( z ) − X dd ′ θ dz θ d ′ z . (20)Thus η z = P dd ′ A dd ′ q dd ′ ( z ) P dd ′ θ dz θ d ′ z = P dd ′ A dd ′ q dd ′ ( z ) (cid:0)P d θ dz (cid:1) . (21)Plugging this in to (8) makes the last term a constant, − / P dd ′ A dd ′ = − M . Thus we can ignore this term whenestimating θ dz .Similarly, for each topic z and each word w , taking thederivative with respect to β zw gives ν z = 1 α ∂ L ∂β zw = 1 β zw X d L d C dw h dw ( z ) , (22) where ν z is the Lagrange multiplier for (18). Normalizing β z determines ν z , and gives β zw = P d (1 /L d ) C dw h dw ( z ) P d (1 /L d ) P w ′ C dw ′ h dw ′ ( z ) . (23)Finally, for each document d and each topic z , taking thederivative with respect to θ dz gives λ d = ∂ L ∂θ dz = αL d θ dz X w C dw h dw ( z ) + 1 − αθ dz X d ′ A dd ′ q dd ′ ( z ) , (24)where λ d is the Lagrange multiplier for (19). Normalizing θ d determines λ d and gives θ dz = ( α/L d ) P w C dw h dw ( z ) + (1 − α ) P d ′ A dd ′ q dd ′ ( z ) α + (1 − α ) κ d . (25) B. UPDATE EQUATIONS FOR THE DEGREE-CORRECTED MODEL
Recall that in the degree-corrected model pmtlm-dc , thenumber of links between each pair of documents d, d ′ isPoisson-distributed with mean S d S d ′ X z η z θ dz θ d ′ z . (26)To make the model identifiable, in addition to (18) and (19),we impose the following constraint on the degree-correctionparameters, ∀ z : X d S d θ dz = 1 . (27)With this constraint, we have L = α X d L d X wz C dw h dw ( z ) log θ dz β zw h dw ( z )+ (1 − α ) X d κ d log S d + 1 − α X dd ′ z (cid:18) A dd ′ q dd ′ ( z ) log η z θ dz θ d ′ z q dd ′ ( z ) − S d S d ′ η z θ dz θ d ′ z (cid:19) . (28)The update equation (23) for β remains the same, since thedegree-correction only affects the part of the model that gen-erates the links, not the words. We now derive the updateequations for η , S , and θ .For each topic z , taking the derivative of the likelihood withrespect to η z gives0 = 21 − α ∂L∂η z = 1 η z X dd ′ A dd ′ q dd ′ ( z ) − X dd ′ S d S d ′ θ dz θ d ′ z = 1 η z X dd ′ A dd ′ q dd ′ ( z ) − , (29)where we used (27). Thus η z = X dd ′ A dd ′ q dd ′ ( z ) , (30)o η z is simply the expected number of links caused by topic z . In particular, X z η z = X dd ′ A dd ′ = X d κ d = 2 M . (31)For S d , we have11 − α ∂L∂S d = κ d S d − X d ′ z S d ′ η z θ dz θ d ′ z = κ d S d − X z η z θ dz = X z ξ z θ dz , (32)where ξ z is the Lagrange multiplier for (27). Thus S d = κ d P z ( η z + ξ z ) θ dz . (33)We will determine ξ z below. However, note that multiplyingboth sides of (32) by S d , summing over d , and applying (27)and (31) gives X z ξ z = 0 . (34)Most importantly, for θ we have ∂L∂θ dz = 1 θ dz αL d X w C dw h dw ( z ) + (1 − α ) X d ′ A dd ′ q dd ′ ( z ) ! − (1 − α ) X d ′ S d S d ′ η z θ d ′ z = 1 θ dz αL d X w C dw h dw ( z ) + (1 − α ) X d ′ A dd ′ q dd ′ ( z ) ! − (1 − α ) S d η z = λ d + (1 − α ) S d ξ z , (35)where λ d is the Lagrange multiplier for (19), and where weapplied (27) in the second equality. Multiplying both sidesof (35) by θ dz , summing over z , and applying (33) gives λ d = α . (36)Summing over d and applying (27), (30), and (36) gives1 − αα ξ z = X d L d X w C dw h dw ( z ) − X d θ dz = X d L d X w C dw ( h dw ( z ) − θ dz ) . (37)Thus ξ z measures how the inferred topic distributions of thewords h dw ( z ) differ from the topic mixtures θ dz .Finally, (35) and (36) give θ dz = ( α/L d ) P w C dw h dw ( z ) + (1 − α ) P d ′ A dd ′ q dd ′ ( z ) α + (1 − α )( η z + ξ z ) S d , (38)where η z and ξ z are given by (30) and (37). C. UPDATE EQUATIONS WITH DIRICHLETPRIOR
If we impose a Dirichlet prior on θ , with parameters { γ z } foreach topic z , this gives an additional term P dz ( γ z −
1) log θ dz in the log-likelihood of both the pmtlm and pmtlm-dc mod-els. This is equivalent to introducing pseudocounts t z = γ z − z , which we can think of as additional wordsor links that we know are due to topic z . Our original mod-els, without this term, correspond to the uniform prior with γ z = 1 and t z = 0. However, as long as γ z ≥ pmtlm model, (25) becomes θ dz = t z + ( α/L d ) P w C dw h dw ( z ) + (1 − α ) P d ′ A dd ′ q dd ′ ( z ) P z t z + α + (1 − α ) κ d . (39)In the degree-corrected model pmtlm-dc , (36) and (37) be-come λ d = α + X z t z (40)and 1 − αα ξ z = X d L d X w C dw ( h dw ( z ) − θ dz )+ 1 α X d t z − θ dz X z ′ t z ′ ! . (41)Note that ξ z has two contributions. One measures, as before,how the inferred topic distributions of the words h dw ( z ) dif-fer from the topic mixtures θ dz , and the other measures howthe fraction t z / P z ′ t z ′ of pseudocounts for topic z differsfrom θ dz .Finally, (38) becomes θ dz = t z + ( α/L d ) P w C dw h dw ( z ) + (1 − α ) P d ′ A dd ′ q dd ′ ( z ) α + (1 − α )( η z + ξ z ) S d + P z ′ t z ′ , (42)where η z and ξ zz