Random-walk Based Generative Model for Classifying Document Networks
RRandom-walk Based Generative Model for Classifying Document Networks
Takafumi J. Suzuki
Communication Technology Laboratory, Research and Technology Group, Fuji Xerox Co., Ltd. 6-1Minatomirai, Nishi-ku, Yokohama, Kanagawa, 220-8668, [email protected]
Abstract
Document networks are found in various collec-tions of real-world data, such as citation networks,hyperlinked web pages, and online social networks.A large number of generative models have beenproposed because they offer intuitive and usefulpictures for analyzing document networks. Promi-nent examples are relational topic models, wheredocuments are linked according to their topic sim-ilarities. However, existing generative models donot make full use of network structures becausethey are largely dependent on topic modeling ofdocuments. In particular, centrality of graph nodesis missing in generative processes of previous mod-els. In this paper, we propose a novel generativemodel for document networks by introducing ran-dom walkers on networks to integrate the node cen-trality into link generation processes. The devel-oped method is evaluated in semi-supervised clas-sification tasks with real-world citation networks.We show that the proposed model outperforms ex-isting probabilistic approaches especially in detect-ing communities in connected networks.
Graph representation is one of the most fundamental datastructures in computer science, and describes various types ofreal-world data such as social networks, relational databases,and the world-wide web. Uncovering clusters, which are alsoreferred as communities in network science, is an essentialstep to clarify intrinsic natures of concerned data. There hasbeen a lot of works on community detection algorithms in-cluding supervised and unsupervised approaches [Fortunatoand Hric, 2016]. Among the various methods, Bayesian ap-proaches to graph generation processes have been intensivelystudied over many years because they give us fundamentalinsights on hidden patterns of networks.Usually, real-world networks are composed of not onlylink structures between nodes but also rich information ontheir constituents. Citation networks belong to an importantclass of such networks, where each article contains informa-tive sets of words to topology of citation links. Since topic analysis of citation networks enables us to predict latent cita-tion links, a lot of generative models have been proposed toutilize the complementary information from graph and con-tents. Relational topic model (RTM) [Chang and Blei, 2009]is an extension of the latent Dirichlet allocation (LDA) [Blei et al. , 2003] to predicting citation links based on topic sim-ilarities. RTM has been further extend to incorporate vari-ous aspects of underlying topics. For instance, generalizedRTM (gRTM) [Chen et al. , 2015] takes into account inter-topic similarities. Constrained RTM [Terragni et al. , 2020]reflects prior knowledge on documents to link generation pro-cesses. Besides LDA, traditional generative models includingprobabilistic latent semantic analysis (PLSA) and stochas-tic block model (SBM) have been flexibly employed to de-scribe citation networks [Cohn and Hofmann, 2001; Erosheva et al. , 2004; Nallapati and Cohen, 2008; Liu et al. , 2009;Yang et al. , 2016a].In order to develop generative models which jointly de-scribe networks and texts, it is crucial to capture intrinsicnatures of both data types. From this perspective, existingmodels do not seem to fully utilize topological structures be-cause they are basically developed upon topic modeling ofdocuments. RTM and its variants, for instance, are only con-cerned with topic similarities between local pairs of docu-ments, while non-local coherence between multiples of doc-uments are not considered. Meanwhile, it is widely knownin network science that random walks can capture global in-formation of networks. One of the most successful exam-ples is PageRank algorithm [Brin and Page, 1998], wherecentrality of nodes is evaluated with eigenvectors of modi-fied transition matrices. Recently, Okamoto and Qiu [2018;2019] have proposed a novel generative model called mod-ular decomposition of Markov chain (MDMC) for networkclustering. The key idea of MDMC is to introduce randomwalkers on networks to utilize global link structures for de-tecting communities. Predictive performance of MDMC hasbeen further elaborated with Gibbs sampling algorithms, andbecomes competitive with other probabilistic community-detection approaches [Suzuki, 2019]. Hence, it is fruitfulto unify MDMC and basic topic models to simultaneouslyleverage the complementary information from latent topicsand communities.In this paper, we develop a novel generative model namedtopic MDMC (TMDMC), where a random-walk method is a r X i v : . [ phy s i c s . s o c - ph ] J a n ugmented with additional textual information to improvedetectability of community structures. We apply TMDMCto transductive classification tasks with real-world citationnetworks, and show that TMDMC outperforms other proba-bilistic approaches especially in detecting community in con-nected networks. The remainder of the paper is organized asfollows: In Section 2, we review previous models for citationnetworks. In Section 3, we outline MDMC to provide a ba-sis for random-walk approaches. We combine MDMC withtopic modeling and propose our model in Section 4. TMDMCis evaluated in semi-supervised node classification for bench-mark citation networks in Section 5. Conclusions and promis-ing future works are stated in Section 6. Joint modeling of network structures and document contentshas been intensively studied to perform traditional tasks suchas node classification and link prediction. Broadly speaking,these approaches fall into two major categories: probabilisticand deterministic approaches.The proposed method shares some common features withprevious probabilistic approaches. RTM [Chang and Blei,2009] is a hierarchical probabilistic model associating linksbetween documents with their topic similarities: Links areconsidered to be more often formed between documents withsimilar topic distributions, which are estimated by LDA [Blei et al. , 2003] in advance. While original RTM allows interac-tions only within the same topics, gRTM [Chen et al. , 2015]incorporates inter-topic correlations with weighted matricesin a latent topic space. Imbalance issues between presenceand absence of links in observed graphs are also alleviated ingRTM with regularized Bayesian inference techniques. Bai et al. [2018] have employed neural architectures to over-come limited expressivity of RTM. Very recently, RTM hasbeen extend to a semi-supervised model to incorporate priorknowledge on must-link and cannot-link constraints [Terragni et al. , 2020]. While link generations are considered as down-stream tasks in RTMs, some approaches consider them as par-allel or upstream processes. Cohn and Hofmann [2001] haveused PLSA as a core building block to perform a simulta-neous decomposition of texts and graphs. PLSA has beenreplaced by LDA in Erosheva et al. [2004], whose graphicalmodel is closely related to our proposed model. SBM, whichis considered as one of the most famous generative models forcommunity detection, has been also integrated with RTM tomodel document networks in LBH-RTM [Yang et al. , 2016a].In spite of intensive studies, most of the LDA-based mod-els do not fully utilize network structures because they con-sider link generation processes as downstream tasks of topicmodeling. An important exception is LBH-RTM [Yang et al. ,2016a], which associates link generation processes with la-tent communities by weighted SBM. However, it may discardcrucial information from node centrality because weightedSBM cannot resolve nodes in communities. On the otherhand, random walks, which are integrated in our proposedmodel, can quantify node centrality, and provide the rich in-formation for various downstream tasks.Another successful approach for analyzing document net- works is to deterministically embed graph nodes into low-dimensional feature vectors, which are later used in nodeclassification or link prediction. Traditional methods, suchas label propagation [Zhu et al. , 2003], manifold regulariza-tion [Belkin et al. , 2006], use graph Laplacian as regulariza-tion terms in the corresponding loss functions. Neural archi-tectures have also been frequently employed to learn nodevectors. Semi-supervised embedding [Weston et al. , 2012]imposes regularization terms in deep architectures to learngraph structures. Deep Walk [Perozzi et al. , 2014] embedsnodes into a low dimensional space with node sequences ob-tained by random walkers. Planetoid [Yang et al. , 2016b]has extended Deep Walk to jointly embed node features andlink structures with Skipgram models. Recently, limited ex-pressivity of traditional models has been significantly relaxedby graph convolutional networks (GCNs) [Kipf and Welling,2017], whose hidden layers are used as node embedding vec-tors. In spite of the performance, however, vast amounts ofmodel parameters must be optimized in GCNs, which requiretechniques such as renormalization tricks [Kipf and Welling,2017] and kernel smoothing [Xu et al. , 2019]. Establishingefficient and powerful schemes in GCNs is a challenging on-going issue.
In this section, we outline generative processes of MDMC[Okamoto and Qiu, 2018; Okamoto and Qiu, 2019; Suzuki,2019] to lay the base of our new model. The key idea ofMDMC is to introduce and observe Markovian dynamics ofrandom walkers who travel around networks. The networkstructures are characterized by transition matrix T mn , whichsatisfies (cid:80) Nm =1 T mn = 1 . Given transition probabilities p ( t ) of the agent at time t , those at the next time are designed tobe equal to T p ( t ) in terms of expectation values. Each link inobserved graphs can be encoded into N -dimensional vectors τ ( t ) l , where τ ( t ) ln = 1 if link l contains node n and otherwise . Parameters in MDMC are optimized in an unsupervisedmanner to reconstruct observed graph τ ( t ) . MDMC modelsgenerative processes of links by combining agent probabilitydistributions p ( t ) and latent community assignments z ( t ) .MDMC instantiates the aforementioned ideas with follow-ing generating processes at time t .1. For each community k = 1 , , · · · , K :(a) Draw probabilities p ( t ) (: | k ) ∼ Dir ( α ( t ): k ) with α ( t ) nk = α ( t ) k (cid:80) Nm =1 T nm p ( t − ( m | k ) .2. For each link l = 1 , , · · · , L :(a) Draw community distributions π ( t ) l ∼ Dir ( η ( t ) ) .(b) Draw community assignment z ( t ) l ∼ Mult (cid:16) π ( t ) l (cid:17) .(c) Draw link data τ ( t ) l ∼ Mult (cid:16) p ( t ) (: | z ( t ) lk = 1) (cid:17) .Here, Dir ( · ) and Mult ( · ) denote the Dirichlet and multinomialdistributions, respectively.Although generative processes in MDMC closely resemblethose in LDA, there are several crucial differences for analyz-ing network structures. The most important point is that prioristribution of p ( t ) is dependent on transition matrix T andprevious distribution p ( t − . This modeling instantiates theMarkovian dynamics of random walkers who capture globalnetwork structures. The second point is that generation prob-ability of link l connecting nodes m and n is proportional to p ( t ) ( m | k ) p ( t ) ( n | k ) with presumed community k . An intu-ition behind this modeling is that links are more often gener-ated between central nodes within a community. In this section, we detail our proposed model, which we callTMDMC. Subsection 4.1 describes how to represent textualinformation and presents generative processes of TMDMC.Parameter inference steps and procedures of implementingTMDMC are discussed in Subsection 4.2. TMDMC is ex-tended to a semi-supervised model for node classificationtasks in Subsection 4.3.
In order to model generative processes of documents, we con-sider documents as sets of words, i.e. bag of words (BoW)representation. We will incorporate the BoW-representeddocuments into an MDMC scheme with following strategies:(1) We introduce “linked documents” w ( t ) li for link l by com-bining BoW-representation of endpoint documents. (2) Topic y ( t ) li is assigned to each word i in linked documents with topicdistribution π ( t ) l , which are also used to predict communityassignments of the concerned link. (3) With topic y ( t ) li as-signed in the previous step, we fit the observed word w ( t ) li withtopic-specific word distribution φ ( t ) . This modeling is moti-vated by the observation that documents linked in networksshould share similar topic distributions, which are also com-mon to communities of links. Consistency between link com-munities and word topics are guaranteed through the commondistribution π ( t ) l .TMDMC consists of the following additional generativeprocesses at time t on top of the original MDMC model.1. For each community k = 1 , , · · · , K :(b) Draw word distributions φ ( t ) (: | k ) ∼ Dir ( β ( t ): k ) .2. For each link l = 1 , , · · · , L :(d) For each word i = 1 , , · · · , N l :i. Draw topic vectors y ( t ) li ∼ Mult (cid:16) π ( t ) l (cid:17) .ii. Draw words w ( t ) di ∼ Mult (cid:16) φ ( t ) (: | y ( t ) lik = 1) (cid:17) .The graphical model representation of TMDMC is shownin Figure 1. The model is composed of time-series blocks,where upper and lower branches within each block representLDA and original MDMC, respectively. The inter-block linksare introduced by the Markovian dynamics of random walk-ers inherited from original MDMC. Approximation.
While TMDMC theoretically consists ofinfinite numbers of sequential blocks which are dependent
L N l L N l η ( t ) π ( t ) l z ( t ) l τ ( t ) l p ( t ) α ( t ) y ( t ) li w ( t ) li φ ( t ) β ( t ) η ( t +1) π ( t +1) l z ( t +1) l τ ( t +1) l p ( t +1) α ( t +1) y ( t +1) li w ( t +1) li φ ( t +1) β ( t +1) Figure 1: Graphical model of TMDMC. on agent probability distributions at previous time steps, itis hard to simultaneously optimize the whole parameters.In order to make the model tractable, we approximate thehyper-parameters α ( t ) nk to their expectation values [Suzuki,2019]. This approximation cuts dependency between differ-ent blocks in parameter inference steps because it is sufficientto predict the expectation value of p ( t ) to infer the parametersat next Markov time. Joint probability distribution
Thanks to the approxima-tion described above, it is allowed to separately optimize theparameters at each Markov time. We achieve the joint prob-ability distribution at time t by marginalizing p ( t ) , π ( t ) , and φ ( t ) as P ( τ ( t ) , w ( t ) , z ( t ) , y ( t ) ) ∝ Γ (cid:16)(cid:80) Kk =1 η ( t ) k (cid:17)(cid:81) Kk =1 Γ (cid:16) η ( t ) k (cid:17) (cid:81) Kk =1 Γ (cid:16) η ( t ) k + (cid:80) Ll =1 z ( t ) lk (cid:17) Γ (cid:16)(cid:80) Kk =1 (cid:16) η ( t ) k + (cid:80) Ll =1 z ( t ) lk (cid:17)(cid:17) × K (cid:89) k =1 Γ (cid:16)(cid:80) Nn =1 α ( t ) nk (cid:17)(cid:81) Nn =1 Γ( α ( t ) nk ) (cid:81) Nn =1 Γ (cid:16) α ( t ) nk + ( τ z ) ( t ) nk (cid:17) Γ (cid:16)(cid:80) Nn =1 (cid:16) α ( t ) nk + ( τ z ) ( t ) nk (cid:17)(cid:17) × Γ (cid:16)(cid:80) Vw =1 β ( t ) wk (cid:17)(cid:81) Vw =1 Γ( β ( t ) wk ) (cid:81) Vw =1 Γ (cid:16) β ( t ) wk + Y ( t ) wk (cid:17) Γ (cid:16)(cid:80) Vw =1 (cid:16) β ( t ) wk + Y ( t ) wk (cid:17)(cid:17) , (1)with ( τ z ) ( t ) nk = (cid:80) Ll =1 τ ( t ) ln z ( t ) lk and Y ( t ) wk = (cid:80) ( l,i ) δ w,w li y ( t ) lik ,where δ ab denotes Kronecker delta. Collapsed Gibbs Sampling.
With the aid of Eq. (1), com-munity z ( t ) and topic y ( t ) assignments are sampled as P ( z ( t ) lk = 1 | τ ( t ) , w ( t ) , z ( t ) \ l , y ( t ) ) ∝ (cid:16) η ( t ) k + Y ( t ) l.k (cid:17) (cid:81) n,τ ( t ) ln (cid:54) =0 (cid:81) T ( t ) d − u =0 (cid:16) α ( t ) nk + ( τ z ) ( t ) nk \ l (cid:17)(cid:81) T ( t ) l − u =0 (cid:104)(cid:80) Nn =1 (cid:16) α ( t ) nk + ( τ z ) ( t ) nk \ l (cid:17) + u (cid:105) , (2)nd P ( y ( t ) lik = 1 | τ ( t ) , w ( t ) , z ( t ) , y ( t ) \ li ) ∝ (cid:16) η ( t ) k + z ( t ) lk + Y ( t ) lk \ i (cid:17) β ( t ) w li + Y ( t ) w li k \ li (cid:80) Vw =1 (cid:16) β ( t ) w + Y ( t ) wk \ li (cid:17) , (3)respectively. Here, we have introduced ( τ z ) ( t ) nk \ l = (cid:80) Ll (cid:48) (cid:54) = l τ ( t ) l (cid:48) n z ( t ) l (cid:48) k , Y ( t ) l.k = (cid:80) N l i =1 y ( t ) lik , Y ( t ) lk \ i = (cid:80) N l i (cid:48) (cid:54) = i y ( t ) li (cid:48) k , and Y ( t ) wk \ li = (cid:80) ( l (cid:48) ,i (cid:48) ) (cid:54) =( l,i ) δ w,w l (cid:48) i (cid:48) y ( t ) l (cid:48) i (cid:48) k . Update equations.
In this paper, we update the hyperpa-rameters α ( t ) k , β ( t ) wk , and η ( t ) k at the ends of each Markov stepby approximately maximizing the likelihood function withNewton’s method and Minka’s fixed-point iteration [Minka,2000]. First, the parameter α ( t ) k is updated as α ( t +1) k = α ( t ) k − F k ( α ( t ) k ) F (cid:48) k ( α ( t ) k ) , (4)with the log-derivative of the likelihood function F k ( α ( t ) k ) = ddα ( t ) k ln P ( τ ( t ) , w ( t ) | z ( t ) , y ( t ) ) . Second, the parameters β ( t +1) wk can be estimated with Minka’s fixed-point iteration as β ( t +1) wk = (cid:104) Ψ (cid:16) β ( t ) wk + Y ( t ) wk (cid:17) − Ψ (cid:16) β ( t ) wk (cid:17)(cid:105) β ( t ) wk (cid:104) Ψ (cid:16) β ( t ) k + Y ( t ) k (cid:17) − Ψ (cid:16) β ( t ) k (cid:17)(cid:105) , (5)where β ( t ) k = (cid:80) Vw =1 β ( t ) wk , and Y ( t ) k = (cid:80) Vw =1 Y ( t ) wk . Finally,the parameter η ( t +1) k for the next time step is also obtained byusing Minka’s fixed-point iteration as η ( t +1) k = (cid:80) Ll =1 (cid:104) Ψ (cid:16) η ( t ) k + Y ( t ) l.k + z ( t ) lk (cid:17) − Ψ (cid:16) η ( t ) k (cid:17)(cid:105) η ( t ) k (cid:80) Ll =1 (cid:104) Ψ (cid:16) η ( t )sum + Y ( t ) l.. + 1 (cid:17) − Ψ (cid:16) η ( t )sum (cid:17)(cid:105) , (6)with η ( t )sum = (cid:80) Kk =1 η ( t ) k and Y ( t ) l.. = (cid:80) Kk =1 Y ( t ) l.k . Inference of parameters at next time step.
We can obtainthe expectation value of the probability p ( t ) ( n | k ) by using itsposterior Dirichlet distribution as p ( t ) ( n | k ) = α ( t ) nk + ( τ z ) ( t ) nk (cid:80) Nn =1 (cid:104) α ( t ) nk + ( τ z ) ( t ) nk (cid:105) . (7)The updated parameters α ( t +1) k and the estimate (7) of p ( t ) are used to compute the parameters α ( t +1) nk for the prior dis-tribution of p ( t +1) at the next time. Algorithm.
The sampling algorithms of latent variables z ( t ) and y ( t ) and the update equations of parameters α ( t ) , β ( t ) , η ( t ) , and p ( t ) are summarized in Algorithm 1. The num-bers of time steps and Monte Carlo sampling are denoted by T step and S , respectively. Algorithm 1
TMDMC Initialize α (1) , β (1) , η (1) , and p (0) for t = 1 , , · · · , T step do Initialize z ( t ) and y ( t ) for s = 1 , , · · · , S do for l = 1 , , · · · , L do Draw z ( t ) l with Eq. (2) for i = 1 , , · · · , N l do Draw y ( t ) li with Eq. (3) end for end for end for Update α ( t +1) with Eq. (4) Update β ( t +1) with Eq. (5) Update η ( t +1) with Eq. (6) Estimate p ( t ) with Eq. (7) end for We apply TMDMC to semi-supervised node-classificationtasks, where class labels of a small part of nodes are given topredict those of the remaining nodes. It is straightforward togeneralize generative models to semi-supervised setups whenclass labels are translated to latent variables. In the followingexperiments, we determine the latent variables from labeleddata as follows: (1) Community assignments z ( t ) l of link l is set as the corresponding class when either of the endpointnodes is labeled. When both of the endpoint nodes are la-beled, randomly choose one of the class labels. (2) Topicassignments y ( t ) of all the words in labeled documents areset as the ground-truth label. The protocols (1) and (2) canbe used for LDA-based models and original MDMC as wellbecause they share common latent variables with TMDMC.Community structures can be detected by TMDMC in twodifferent ways in terms of community and topic distribu-tions. First, agent probability distributions p ( t ) can be usedto predict community distributions of node n with p ( t ) kn = p ( t ) ( n | k ) π ( t ) k /p ( t ) n , where π ( t ) k = η ( t ) k / (cid:80) Kk =1 η ( t ) k and p ( t ) n = (cid:80) Kk =1 p ( t ) ( n | k ) π ( t ) k . Another way to quantify communityassignments is to consider topic distributions of words con-tained in the corresponding documents. When we focus ondocument d with N d words, the posteriori topic distributionis given by θ ( t ) dk = ( Y ( t ) d.k + η ( t ) k ) / (cid:80) Kk =1 ( Y ( t ) d.k + η ( t ) k ) with Y ( t ) d.k = (cid:80) N d i =1 y ( t ) dik . The expectation value of θ ( t ) dk can be eval-uated with Gibbs samplers. The first indicator p ( t ) kn is used inprevious MDMC papers, while the second one θ ( t ) dk is usefulin LDA-based models. In this section, we evaluate the performance of TMDMC forsemi-supervised node classification tasks with real-world ci-tation networks. We describe experimental setups and base-line methods in Subsections 5.1 and 5.2, respectively. Classi-ataset Nodes Edges Clusters Classes FeaturesCiteseer 3,312 4,732 437 6 3,703-LCC 2,110 3,757 1 6 3,703Cora 2,708 5,429 77 7 1,433-LCC 2,485 5,209 1 7 1,433
Table 1: Dataset statistics.
Method Text Graph Model ClassifierTSVM (cid:88) - Vector-space One-vs-restLP - (cid:88)
Graph-based Neighbor-votingPlanetoid (cid:88) (cid:88)
Neural SoftmaxLDA (cid:88) - Generative Argmax ( p kd ) MDMC - (cid:88)
Generative Argmax ( p kn ) gRTM (cid:88) (cid:88) Generative Argmax ( p kd ) TMDMC (cid:88) (cid:88)
Generative Argmax( p kn , p kd ) Table 2: Summary of baseline and proposed methods. fication accuracies of TMDMC are presented and comparedwith various methods in Subsections 5.3.
We evaluate TMDMC with two citation networks: Citeseerand Cora [Sen et al. , 2008]. We also analyze the largest con-nected components (LCCs) of the networks to study effectsof connectivity. The datasets contain BoW-represented doc-uments and citation links between them. Each document isclassified into a single ground-truth class. We consider setsof words in the documents as feature vectors, and constructundirected graphs from the citation links. The statistics ofthese datasets are summarized in Table 1.We conduct semi-supervised node classification experi-ments in the following setups: We randomly sample a fewpercent of nodes for each class as labeled data, and the restof the nodes are left for unlabeled data. Classes of labelednodes are translated to fixed values of latent variables withthe protocol described in Subsection 4.3. Model parametersare optimized in a transductive manner, where both labeledand unlabeled data are observed in parameter inference steps.Predictive performance of models is evaluated with accuracyof classes assigned to unlabeled data. We have repeated theseprocedures times for each dataset to perform statisticalanalysis with various data splits.The model parameters in TMDMC are set as α (0) k = 0 . L , β (0) vk = η (0) k = 1 for all k and v , and are updated asdescribed in Subsection 4.2. The number of communities K and that of vocabulary V are read from the ground-truthclasses and the dimension of feature vectors. Gibbs samplingis performed with sampling size S = 100 , and burn-in pe-riod S burn = 100 . The maximum Markov time is set as T step = 20 . We compare TMDMC against previous methods developedfor transductive classification: transductive support vec- Method Citeseer [-LCC] Cora [-LCC]TSVM 58.0 ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± Table 3: Classification accuracy in percent with -labeled data. Method Citeseer [-LCC] Cora [-LCC]TSVM 60.0 ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± Table 4: Classification accuracy in percent with -labeled data. tor machine (TSVM) [Joachims, 1999], label propagation(LP) [Zhu et al. , 2003], and Planetoid [Yang et al. , 2016b].TSVM and LP can leverage only feature vectors and graphstructures, respectively, while Planetoid utilizes both of thedata. We build one-vs-the-rest classifiers with TSVM imple-mented in SVMLight , and use python-igraph library for LP.For Planetoid, we use a transductive version provided by theauthors . We have also implemented LDA [Blei et al. , 2003],MDMC [Suzuki, 2019], and gRTM [Chen et al. , 2015] to per-form comparative studies of TMDMC with other generativemodels. Labeled data are used in the same way as describedabove when a model has latent variables corresponding to y and z in TMDMC. Similarly, the initial values and the up-dated equations of the hyper-parameters in LDA, MDMC,and gRTM are common in TMDMC. We use the hinge lossin gRTM with cost parameters l = 1 and c = 4 . The baselineand proposed methods are summarized in Table 2, which aredivided into upper and lower rows depending on whether theyare generative models or not. Table 3 shows means and standard errors of classificationaccuracies in -labeled datasets out of trials. Val-ues in square brackets are results of LCCs in correspondingdatasets. We make values underscored (bold) when it is high-est among all (generative) models. In terms of expectationvalues, TMDMC outperforms other generative models, i.e.LDA, MDMC, and gRTM, in Citeer-LCC, Cora and Cora-LCC. In addition, TMDMC achieves the best score in Coradatasets including non-generative models. Citeseer is char- http://svmlight.joachims.org/ https://igraph.org/2014/02/04/igraph-0.7-python.html https://github.com/kimiyoung/planetoidigure 2: Communities of Citeseer-LCC with TMDMC. Figure 3: Communities of Cora-LCC with TMDMC. acteristic in that feature-based methods clearly outperformsgenuinely graph-based methods. Besides, the performance ofLP, Planetoid, MDMC, and TMDMC, all of which are basedon random walks, is significantly improved when LCC is con-sidered. This indicates that observed graph structures are notso consistent with ground-truth labels, while texts are infor-mative for classification. In contrast to Citeseer, both feature-and graph-based methods perform evenly well in Cora. It isalso worth noting that TMDMC surpasses Planetoid, whichemploys deep neural architectures, in Cora and Cora-LCC.These observations imply that TMDMC can effectively learncommunity patterns of document networks from limited num-bers of labeled data especially when both texts and graphs cancollaboratively predict ground-truth class labels.In order to evaluate the size effect, we test the modelswith increased proportions of the labeled data. Table 4 re-ports classification accuracies with labeled data. WhileTMDMC still performs better than other generative modelsin Citeseer-LCC, Cora, and Cora-LCC, Planetoid wins thehighest scores in all the datasets. In particular, the marginbetween Planetoid and TMDMC gets wider in Citeseer. Weconjecture that high-capacity neural architectures employedby Planetoid could discern informative and uninformativefeatures with increased training data. Discernibility is crucialespecially in Citeseer datasets because observed graph maysometimes cause confusion in predicting correct class labels.We finally visualize typical results of -labeled Citeseer-LCC (Cora-LCC) in Figure 2 (3) to qualitatively analyzecharacteristics of TMDMC. Node sizes are scaled to theirdegrees, and node colors represent class labels assigned byTMDMC. Layouts are computed with the FruchtermanRein-gold algorithm. Circle, triangle and square nodes denotecorrectly-classified, misclassified and labeled nodes, respec-tively. As expected, nodes with large degrees tend to be cor- rectly classified by TMDMC, where random walkers makeaccount of central nodes in link generation processes. Be-sides, nodes of the same label tend to aggregate, even in mis-classified cases, into clusters, because of the associative na-tures of TMDMC modeling. The second feature results inmisclassification in Citeseer, where peripheral nodes do notnecessarily belong to the same community as central ones. InCora, TMDMC fails to distinguish “blue” and “cyan” com-munities, which are densely connected in networks and re-quire classifiers to find additional discriminating features. In this paper, we have proposed a novel generative modelnamed TMDMC for analyzing document networks. The pro-posed model unifies random-walk-based community detec-tion and topic modeling to jointly model graph structures andtextual information. Random walkers quantify node central-ity, which is incorporated in link generation processes. Wehave compared TMDMC with previous probabilistic models,i.e. LDA, MDMC, and gRTM, in semi-supervised classifica-tion tasks. TMDMC outperforms other probabilistic modelsin classifying nodes in connected components of real-worldcitation networks. Besides, TMDMC surpasses Planetoid, adeep-neural model, in -labeled Cora dataset. This indi-cates that TMDMC detects community structures from a lim-ited number of labeled data. For future works, it is promisingto enhance model capacities of TMDMC with neural archi-tectures to discern informative data in a similar way to NRTM[Bai et al. , 2018]. This extension can strengthen modeling ofmutual interactions between text and graph natures as well. Acknowledgments
The author would like to thank Xule Qiu, and Seiya Inagi forfruitful discussions and comments. eferences [Bai et al. , 2018] Haoli Bai, Zhuangbin Chen, Michael R.Lyu, Irwin King, and Zenglin Xu. Neural relationaltopic models for scientific article analysis.
InternationalConference on Information and Knowledge Management(CIKM’18) , pages 27–36, 2018.[Belkin et al. , 2006] Mikhail Belkin, Partha Niyogi, andVikas Sindhwani. Manifold regularization: A geometricframework for learning from labeled and unlabeled ex-amples.
Journal of Machine Learning Research , 7:2399–2434, 2006.[Blei et al. , 2003] David M. Blei, Andrew Y. Ng, andMichael I. Jordan. Latent dirichlet allocation.
Journal ofMachine Learning Research , 3:993–1022, 2003.[Brin and Page, 1998] Sergey Brin and Lawrence Page. Theanatomy of a large-scale hypertextual web search engine.
Computer networks and ISDN systems , 30(1-7):107–117,1998.[Chang and Blei, 2009] Jonathan Chang and David Blei. Re-lational topic models for document networks. In
Proceed-ings of the 12th International Conference on Artificial In-telligence and Statistics , volume 5 of
Proceedings of Ma-chine Learning Research , pages 81–88, 2009.[Chen et al. , 2015] Ning Chen, Jun Zhu, Fei Xia, andBo Zhang. Discriminative relational topic models.
IEEETransactions on Pattern Analysis and Machine Intelli-gence , 37(5):973–986, 2015.[Cohn and Hofmann, 2001] David A Cohn and Thomas Hof-mann. The missing link-a probabilistic model of docu-ment content and hypertext connectivity. In
Advances inNeural Information Processing Systems (NIPS’00) , pages430–436, 2001.[Erosheva et al. , 2004] Elena Erosheva, Stephen Fienberg,and John Lafferty. Mixed-membership models of scien-tific publications.
Proceedings of the National Academy ofSciences , 101:5220–5227, 2004.[Fortunato and Hric, 2016] Santo Fortunato and Darko Hric.Community detection in networks: A user guide.
PhysicsReports , 659:1 – 44, 2016.[Joachims, 1999] Thorsten Joachims. Transductive inferencefor text classification using support vector machines. In
Proceedings of the 16th International Conference on Ma-chine Learning (ICML’99) , pages 200–209, 1999.[Kipf and Welling, 2017] Thomas N. Kipf and Max Welling.Semi-supervised classification with graph convolutionalnetworks. In
International Conference on Learning Rep-resentations (ICLR’17) , 2017.[Liu et al. , 2009] Yan Liu, Alexandru Niculescu-Mizil, andWojciech Gryc. Topic-link LDA: Joint models of topic andauthor community.
Proceedings of the 26th InternationalConference On Machine Learning (ICML’09) , pages 665–672, 2009.[Minka, 2000] Thomas P. Minka. Estimating a Dirich-let distribution. https://tminka.github.io/papers/dirichlet/minka-dirichlet.pdf, 2000. [Nallapati and Cohen, 2008] Ramesh Nallapati andWilliam W Cohen. Link-plsa-lda: A new unsuper-vised model for topics and influence of blogs. In
International Conference on Weblogs and Social Media(ICWSM’08) , pages 84–92, 2008.[Okamoto and Qiu, 2018] Hiroshi Okamoto and Xule Qiu.Community detection by modular decomposition of ran-dom walk. In
The 7th International Conference on Com-plex Networks and Their Applications , page 59, 2018.[Okamoto and Qiu, 2019] Hiroshi Okamoto and Xule Qiu.Modular decomposition of markov chain: detecting hi-erarchical organization of pervasive communities. arXivpreprint , arXiv:1909.07066, 2019.[Perozzi et al. , 2014] Bryan Perozzi, Rami Al-Rfou, andSteven Skiena. Deepwalk: Online learning of social repre-sentations. In
Proceedings of the 20th ACM SIGKDD In-ternational Conference on Knowledge Discovery and DataMining (KDD’14) , pages 701–710, 2014.[Sen et al. , 2008] Prithviraj Sen, Galileo Namata, MustafaBilgic, Lise Getoor, Brian Galligher, and Tina Eliassi-Rad.Collective classification in network data.
AI magazine ,29:93–93, 2008.[Suzuki, 2019] Takafumi J. Suzuki. Bayesian modelingof random walker for community detection in networks. arXiv preprint , arXiv:1910.11587, 2019.[Terragni et al. , 2020] Silvia Terragni, Elisabetta Fersini,and Enza Messina. Constrained relational topic models.
Information Sciences , 512:581–594, 2020.[Weston et al. , 2012] Jason Weston, Fr´ed´eric Ratle, HosseinMobahi, and Ronan Collobert. Deep learning via semi-supervised embedding. In
Neural Networks: Tricks of theTrade , pages 639–655. 2012.[Xu et al. , 2019] Bingbing Xu, Huawei Shen, Qi Cao, Ket-ing Cen, and Xueqi Cheng. Graph convolutional networksusing heat kernel for semi-supervised learning. In
Pro-ceedings of the 28th International Joint Conference on Ar-tificial Intelligence (IJCAI’19) , pages 1928–1934. AAAIPress, 2019.[Yang et al. , 2016a] Weiwei Yang, Jordan Boyd-Graber, andPhilip Resnik. A discriminative topic model using docu-ment network structure. In
Proceedings of the 54th AnnualMeeting of the Association for Computational Linguistics(ACL’16) , pages 686–696, 2016.[Yang et al. , 2016b] Zhilin Yang, William W. Cohen, andRuslan Salakhutdinov. Revisiting semi-supervised learn-ing with graph embeddings. In
Proceedings of the 33rdInternational Conference on International Conference onMachine Learning (ICML’16) , pages 40–48, 2016.[Zhu et al. , 2003] Xiaojin Zhu, Zoubin Ghahramani, andJohn D Lafferty. Semi-supervised learning using gaussianfields and harmonic functions. In