[PDF] A Framework for Comparing Groups of Documents

Abstract

We present a general framework for comparing multiple groups of documents. A bipartite graph model is proposed where document groups are represented as one node set and the comparison criteria are represented as the other node set. Using this model, we present basic algorithms to extract insights into similarities and differences among the document groups. Finally, we demonstrate the versatility of our framework through an analysis of NSF funding programs for basic research.

Full PDF

aa r X i v : . [ c s . C L ] A ug A Framework for Comparing Groups of Documents

Arun S. Maiya

Institute for Defense Analyses — Alexandria, VA, USA [email protected]

Abstract

We present a general framework for compar-ing multiple groups of documents. A bipar-tite graph model is proposed where documentgroups are represented as one node set andthe comparison criteria are represented as theother node set. Using this model, we presentbasic algorithms to extract insights into sim-ilarities and differences among the documentgroups. Finally, we demonstrate the versatilityof our framework through an analysis of NSFfunding programs for basic research.

Given multiple sets (or groups) of documents, it is of-ten necessary to compare the groups to identify simi-larities and differences along different dimensions. Inthis work, we present a general framework to performsuch comparisons for extraction of important insights.Indeed, many real-world tasks can be framed as a prob-lem of comparing two or more groups of documents.Here, we provide two motivating examples.

1. Program Reviews.

To better direct research efforts,funding organizations such as the National ScienceFoundation (NSF), the National Institutes of Health(NIH), and the Department of Defense (DoD), are of-ten in the position of reviewing research programs viatheir artifacts ( e.g., grant abstracts, published papers,and other research descriptions). Such reviews mightinvolve identifying overlaps across different programs,which may indicate a duplication of effort. It mayalso involve the identiﬁcation of unique, emerging, ordiminishing topics. A “document group” here couldbe deﬁned either as a particular research program thatfunds many organizations, the totality of funded re-search conducted by a speciﬁc organization, or all re-search associated with a particular time period ( e.g., ﬁs-cal year). In all cases, the objective is to draw compar-isons between groups by comparing the document setsassociated with them.

2. Intelligence.

In the areas of defense and intelli-gence, document sets are sometimes obtained from dif- ferent sources or entities. For instance, the U.S. ArmedForces sometimes seize documents during raids of ter-rorist strongholds. Similarities between two documentsets (each captured from a different source) can poten-tially be used to infer a non-obvious association be-tween the sources.Of course, there are numerous additional examplesacross many domains ( e.g., comparing differentnews sources, comparing the reviews for severalproducts, etc.). Given the abundance of real-worldapplications as illustrated above, it is surprising,then, that there are no existing general-purposeapproaches for drawing such comparisons. Whilethere is some previous work on the comparisonof document sets (referred to as comparative textmining ), these existing approaches lack the generalityto be widely applicable across different use casescenarios with different comparison criteria. More-over, much of the work in the area focuses largelyon the summarization of shared or unshared topicsamong document groups ( e.g.,

Wan et al. (2011),Huang et al. (2011), Campr and Jeˇzek (2013),Wang et al. (2012), Zhai et al. (2004)). That is,the problem of drawing multi-faceted comparisonsamong the groups themselves is not typically ad-dressed. This, then, motivates our development of a general-purpose model for comparisons of documentsets along arbitrary dimensions. We use this model forthe identiﬁcation of similarities, differences, trends,and anomalies among large groups of documents. Webegin by formally describing our model.

As input, we are given several groups of documents,and our task is to compare them. We now formallydeﬁne these document groups and the criteria used tocompare them. Let D = { d , d , . . . , d N } be a doc-ument collection comprising the totality of documentsunder consideration, where N is the size. Let D P be apartition of D representing the document groups. See

Document Exploitation (DOCEX) at http://en.wikipedia.org for more information. eﬁnition 1 A document group is a subset D Pi ∈ D P (where index i ∈ { . . . | D P |} ). Each document group in D P , for instance, mightrepresent articles associated with either a particular or-ganization ( e.g., university), a research funding source( e.g., NSF or DARPA program), or a time period ( e.g., aﬁscal year). Document groups are compared using comparison criteria , D C , a family of subsets of D . Deﬁnition 2

A comparison criterion is a subset D Ci ∈ D C (where index i ∈ { . . . | D C |} ). Intuitively, each subset of D C represents a set ofdocuments sharing some attribute. Our model allowsgreat ﬂexibility in how D C is deﬁned. For instance, D C might be deﬁned by the named entities mentionedwithin documents ( e.g., each subset contains docu-ments that mention a particular person or organizationof interest). For the present work, we deﬁne D C by top-ics discovered using latent Dirichlet allocation or LDA(Blei et al., 2003). LDA Topics as Comparison Criteria.

Probabilis-tic topic modeling algorithms like LDA discover la-tent themes ( i.e., topics) in document collections. Byusing these discovered topics as the comparison cri-teria, we can compare arbitrary groups of documentsby the themes and subject areas comprising them. Let K be the number of topics or themes in D . Eachdocument in D is composed of a sequence of words: d i = h s i , s i , . . . , s iN i i , where N i is the number ofwords in d i and i ∈ { . . . N } . V = S Ni =1 f ( d i ) isthe vocabulary of D , where f ( · ) takes a sequence ofelements and returns a set. LDA takes K and D (in-cluding its components such as V ) as input and pro-duces two matrices as output, one of which is θ . Thematrix θ ∈ R N × K is the document-topic distributionmatrix and shows the distribution of topics within eachdocument. Each row of the matrix represents a prob-ability distribution. D C is constructed using K sub-sets of documents, each of which represent a set ofdocuments pertaining largely to the same topic. Thatis, for t ∈ { . . . K } and i ∈ { . . . N } , each subset D Ct ∈ D C is comprised of all documents d i where t = argmax x θ ix . Having deﬁned the documentgroups D P and the comparison criteria D C , we nowconstruct a bipartite graph model used to perform com-parisons. A Bipartite Graph Model.

Our objective is to com-pare the document groups in D P based on D C . We doso by representing D P and D C as a weighted bipartitegraph, G = ( P, C, E, w ) , where P and C are disjointsets of nodes, E is the edge set, and w : E → Z + are the edge weights. Each subset of D P is repre-sented as a node in P , and each subset of D C is rep- D C is also a partition of D , when deﬁned in this way. resented as a node in C . Let α : P → D P and β : C → D C be functions that map nodes to the doc-ument subsets that they represent. Then, the edge set E is { ( u, v ) | u ∈ P, v ∈ C, α ( u ) ∩ β ( v ) = ∅} ,and the edge weight for any two nodes u ∈ P and v ∈ C is w (( u, v )) = | α ( u ) ∩ β ( v ) | . Concisely, eachweighted edge in G between a document group (in P )and a topic (in C ) represents the number of documentsshared among the two sets. Figure 1 shows a toy illus-tration of the model. Each node in P is shown in blackand represents a subset of D P ( i.e., a document group).Each node in C is shown in gray and represents a subsetof D C ( i.e., a document cluster pertaining primarily tothe same topic). Each edge represents the intersectionof the two subsets it connects. In the next section, wewill describe basic algorithms on such bipartite graphscapable of yielding important insights into the similar-ities and differences among document groups. (cid:1) (cid:1) (cid:13)(cid:4)(cid:12)(cid:5)(cid:14)(cid:15)(cid:16)(cid:17)(cid:18)(cid:2)(cid:3)(cid:4)(cid:5)(cid:6)(cid:7) (cid:8) (cid:1) (cid:2) (cid:9) (cid:10)(cid:4)(cid:6)(cid:11)(cid:12)(cid:7) (cid:8) (cid:1) (cid:3) (cid:4) (cid:13)(cid:4)(cid:12)(cid:5)(cid:14)(cid:15)(cid:16)(cid:17)(cid:18)(cid:2)(cid:3)(cid:4)(cid:5)(cid:6)(cid:18)(cid:19)(cid:13)(cid:4)(cid:12)(cid:5)(cid:14)(cid:15)(cid:16)(cid:17)(cid:18)(cid:2)(cid:3)(cid:4)(cid:5)(cid:6)(cid:18)(cid:20)(cid:13)(cid:4)(cid:12)(cid:5)(cid:14)(cid:15)(cid:16)(cid:17)(cid:18)(cid:2)(cid:3)(cid:4)(cid:5)(cid:6)(cid:18)(cid:21)(cid:13)(cid:4)(cid:12)(cid:5)(cid:14)(cid:15)(cid:16)(cid:17)(cid:18)(cid:2)(cid:3)(cid:4)(cid:5)(cid:6)(cid:18)(cid:22)(cid:13)(cid:4)(cid:12)(cid:5)(cid:14)(cid:15)(cid:16)(cid:17)(cid:18)(cid:2)(cid:3)(cid:4)(cid:5)(cid:6)(cid:18)(cid:23)(cid:13) (cid:24) (cid:23) (cid:10)(cid:4)(cid:6)(cid:11)(cid:12)(cid:18)(cid:19)(cid:10)(cid:4)(cid:6)(cid:11)(cid:12)(cid:18)(cid:20)(cid:10)(cid:4)(cid:6)(cid:11)(cid:12)(cid:18)(cid:21)(cid:10)(cid:4)(cid:6)(cid:11)(cid:12)(cid:18)(cid:22)(cid:10)(cid:4)(cid:6)(cid:11)(cid:12)(cid:18)(cid:25)(cid:10)(cid:4)(cid:6)(cid:11)(cid:12)(cid:18)(cid:26)(cid:10)(cid:4)(cid:6)(cid:11)(cid:12)(cid:23)(cid:13) (cid:27) (cid:23) Figure 1: [Toy Illustration of Bipartite Graph Model.]

Each black node ( i.e., node ∈ P ) represents a documentgroup. Each gray node ( i.e., node ∈ C ) represents a clus-ter of documents pertaining primarily to the same topic. We focus on three basic operations in this work.

Node Entropy.

Let ~w be a vector of weights for alledges incident to some node v ∈ E . The entropy H of v is: H( v ) = − P i p i log | ~w | ( p i ) , where p i = w i P j w j and i, j ∈ { . . . | ~w |} . A similar formulation was em-ployed in Eagle et al. (2010). Intuitively, if v ∈ P , H ( v ) measures the extent to which the document groupis concentrated around a small number of topics (lowervalues of H ( v ) mean more concentrated). Similarly, if v ∈ C , it is the extent to which a topic is concentratedaround a small number of document groups. Node Similarity.

Given a graph G , there are manyways to measure the similarity of two nodes basedon their connections. Such measures can be used toinfer similarity (and dissimilarity) among documentgroups. However, existing methods are not well-suitedfor the task of document group comparison. The well-nown SimRank algorithm (Jeh and Widom, 2002)ignores edge weights, and neither SimRank norits extension, SimRank++ (Antonellis et al., 2008),scale to larger graphs. SimRank++ and AS-COS (Chen and Giles, 2013) do incorporate edgeweights but in ways that are not appropriate for doc-ument group comparisons. For instance, both Sim-Rank++ and ASCOS incorporate magnitude in the sim-ilarity computation. Consider the case where documentgroups are deﬁned as research labs. ASCOS and Sim-Rank++ will measure large research labs and small re-search labs as less similar when in fact they may pub-lish nearly identical lines of research. Finally, underthese existing methods, document groups sharing zerotopics in common could still be considered similar,which is undesirable here. For these reasons, we for-mulate similarity as follows. Let N G ( · ) be a functionthat returns the neighbors of a given node in G . Giventwo nodes u, v ∈ P , let L u,v = N G ( u ) ∪ N G ( v ) and let x : I → L u,v be the indexing function for L u,v . Weconstruct two vectors, ~a and ~b , where a k = w ( u, x ( k )) , b k = w ( v, x ( k )) , and k ∈ I . Each vector is essentiallya sequence of weights for edges between u, v ∈ P andeach node in L u,v . Similarity of two nodes is mea-sured using the cosine similarity of their correspondingsequences, ~a · ~b k ~a kk ~b k , which we compute using a func-tion sim ( · , · ) . Thus, document groups are consideredmore similar when they have similar sets of topics insimilar proportions. As we will show later, this sim-ple solution, based on item-based collaborative ﬁlter-ing (Sarwar et al., 2001), is surprisingly effective at in-ferring similarity among document groups in G . Node Clusters.

Identifying clusters of related nodes inthe bipartite graph G can show how document groupsform larger classes. However, we ﬁnd that G is typ-ically fairly dense. For these reasons, partitioning ofthe one-mode projection of G and other standard bipar-tite graph clustering techniques ( e.g., Dhillion (2001)and Sun et al. (2009)) are rendered less effective. Weinstead employ a different tack and exploit the nodesimilarities computed earlier. We transform G intoa new weighted graph G P = ( P, E P , w sim ) where E P = { ( u, v ) | u, v ∈ P, sim ( u, v ) > ξ } , ξ is a pre-deﬁned threshold, and w sim is the edge weight function( i.e., w sim = sim ). Thus, G P is the similarity graphof document groups. ξ = 0 . was used as the thresholdfor our analyses. To ﬁnd clusters in G P , we employ theLouvain algorithm, a heuristic method based on mod-ularity optimization (Blondel et al., 2008). Modularitymeasures the fraction of edges falling within clustersas compared to the expected fraction if edges were dis-tributed evenly in the graph (Newman, 2006). The al-gorithm initially assigns each node to its own cluster. I is the index set of L u,v . At each iteration, in a local and greedy fashion, nodesare re-assigned to clusters with which they achieve thehighest modularity.

As a realistic and informative case study, we utilizeour model to characterize funding programs of the Na-tional Science Foundation (NSF). This corpus consistsof 132,372 grant abstracts describing awards for basicresearch and other support funded by the NSF betweenthe years 1990 and 2002 (Bache and Lichman, 2013). Each award is associated with both a program element( i.e., funding source) and a date. We deﬁne documentgroups in two ways: by program element and by cal-endar year. For comparison criteria, we used topicsdiscovered with the MALLET implementation of LDA(McCallum, 2002) using K = 400 as the number oftopics and as the number of iterations. All otherparameters were left as defaults. The NSF corpus pos-sesses unique properties that lend themselves to exper-imental evaluation. For instance, program elements arenot only associated with speciﬁc sets of research top-ics but are named based on the content of the program.This provides a measure of ground truth against whichwe can validate our model. We structure our analysesaround speciﬁc questions, which now follow. Which NSF programs are focused on speciﬁc areasand which are not?

When deﬁning document groups as program elements ( i.e., each NSF program is a nodein P ), node entropy can be used to answer this question.Table 1 shows examples of program elements most andleast associated with speciﬁc topics, as measured byentropy. For example, the program (low entropy) is largely focused on a single linguistics topic (labeled by LDA with words such as “language,”“languages,” and “linguistic”). By contrast, the Aus-tralia program (high entropy) was designed to supportUS-Australia cooperative research across many ﬁelds,as correctly inferred by our model.

Low Entropy Program ElementsProgram Primary LDA Topic language languages linguistic network connection internet

High Entropy Program ElementsProgram Primary LDA Topic (many topics & disciplines) (many topics & disciplines)

Table 1: [Examples of High/Low Entropy Programs.]

Which research areas are growing/emerging?

Whendeﬁning document groups as calendar years (instead ofprogram elements), low entropy nodes in C are topicsconcentrated around certain years. Concentrations in Data for years 1989 and 2003 in this publicly availablecorpus were partially missing and omitted in some analyses. ater years indicate growth. The LDA-discovered topic nanotechnology is among the lowest entropy topics( i.e., an outlier topic with respect to entropy). As shownin Figure 2, the number of nanotechnology grants dras-tically increased in proportion through 2002. This re-sult is consistent with history, as the National Nan-otechnology Initiative was proposed in the late 1990s topromote nanotechnology R&D. One could also mea-sure such trends using budget allocations by incorpo-rating the award amounts into the edge weights of G . Topic: Nanotechnology P e r c en t age o f T o t a l G r an t s Figure 2: [Uptrend in Nanotechnology.]

Our model cor-rectly identiﬁes the surge in nanotechnology R&D beginningin the late 1990s.

Given an NSF program, to which other programsis it most similar?

As described in Section 3, wheneach node in P represents an NSF program, our modelcan easily identify the programs most similar to agiven program. For instance, Table 2 shows the topthree most similar programs to both the TheoreticalPhysics and

Ecology programs. Results agree withintuition. For each NSF program, we identiﬁed thetop n most similar programs ranked by our sim ( · , · ) function, where n ∈ { , , } . These programs weremanually judged for relatedness, and the Mean Aver-age Precision (MAP), a standard performance metricfor ranking tasks in information retrieval, was com-puted. We were unsuccessful in evaluating alternativeweighted similarity measures mentioned in Section 3due to their aforementioned issues with scalability andthe size of the NSF dataset. (For instance, the im-plementations of ASCOS (Antonellis et al., 2008) andSimRank (Jeh and Widom, 2002) that we consideredare available here. ) Recall that our sim ( · , · ) func-tion is based on measuring the cosine similarity be-tween two weight vectors, ~a and ~b , generated fromour bipartite graph model. As a baseline for compar-ison, we evaluated two additional similarity implemen-tations using these weight vectors. The ﬁrst measuresthe similarity between weight vectors using weightedJaccard similarity, which is P k min( a k ,b k ) P k max( a k ,b k ) (denoted as See

National Nanotechnology Initiative at http://en.wikipedia.org for more information. See networkx addon project at http://github.com/hhchen1105/ . Wtd. Jaccard ). The second measure is implementedby taking the Spearman’s rank correlation coefﬁcientof ~a and ~b (denoted as Rank ). Figure 3 shows the MeanAverage Precision (MAP) for each method and eachvalue of n . With the exception of the difference be-tween Cosine and

Wtd. Jaccard for MAP @3 , all otherperformance differentials were statistically signiﬁcant,based on a one-way ANOVA and post-hoc Tukey HSDat a 5% signiﬁcance level. This, then, provides somevalidation for our choice. Table 2: [Similarity Queries.]

Three most similar programsto the

Theoretical Physics and

Ecology programs. (cid:1)(cid:1)(cid:2)(cid:3)(cid:1)(cid:2)(cid:4)(cid:1)(cid:2)(cid:5)(cid:1)(cid:2)(cid:6)(cid:1)(cid:2)(cid:7)(cid:1)(cid:2)(cid:8)(cid:1)(cid:2)(cid:9)(cid:1)(cid:2)(cid:10)(cid:1)(cid:2)(cid:11)(cid:3) (cid:12)(cid:13)(cid:14)(cid:15)(cid:5) (cid:12)(cid:13)(cid:14)(cid:15)(cid:8) (cid:12)(cid:13)(cid:14)(cid:15)(cid:11) (cid:1) (cid:2) (cid:3) (cid:4) (cid:5) (cid:6) (cid:7) (cid:2) (cid:8) (cid:3) (cid:9) (cid:2) (cid:5) (cid:10) (cid:8) (cid:2) (cid:11) (cid:12) (cid:13) (cid:12) (cid:14) (cid:4) (cid:1)(cid:6)(cid:10)(cid:5)(cid:15)(cid:14)(cid:8)(cid:5)(cid:5)(cid:5)(cid:5)(cid:5)(cid:16)(cid:12)(cid:17)(cid:12)(cid:18)(cid:3)(cid:8)(cid:12)(cid:19)(cid:20)(cid:5)(cid:1)(cid:2)(cid:3)(cid:13)(cid:21)(cid:8)(cid:2)(cid:13) (cid:16)(cid:17)(cid:18)(cid:19)(cid:20)(cid:21)(cid:22)(cid:23)(cid:24)(cid:2)(cid:25)(cid:26)(cid:27)(cid:28)(cid:28)(cid:27)(cid:29)(cid:24)(cid:30)(cid:27)(cid:20)(cid:31)

Figure 3: [Mean Average Precision (MAP).]

Cosine simi-larity outperforms alternative approaches.

How do NSF programs join together to form largerprogram categories?

As mentioned, by using the sim-ilarity graph G P constructed from G , clusters of re-lated NSF programs can be discovered. Figure 4, forinstance, shows a discovered cluster of NSF programsall related to the ﬁeld of neuroscience. Each NSF pro-gram ( i.e., node) is composed of many documents.Figure 4: [Neuroscience Programs.] A discovered clusterof program elements all related to neuroscience . Which pairs of grants are the most similar in theresearch they describe?

Although the focus of thispaper is on drawing comparisons among groups ofdocuments, it is often necessary to draw comparisonsamong individual documents, as well. For instance,in the case of this NSF corpus, one may wish to iden-tify pairs of grants from different programs describingighly similar lines of research. One common approachto this is to exploit the low-dimensional representa-tions of documents returned by LDA (Blei et al., 2003).Any given document d i ∈ D (where i ∈ { . . . N } )can be represented by a K-dimensional probability vec-tor of topic proportions given by θ i ∗ , the i th row ofthe document-topic matrix θ . The similarity betweenany two documents, then, can be measured using thedistance between their corresponding probability vec-tors ( i.e., probability distributions). We quantify thesimilarity between probability vectors using the com-plement of Hellinger distance: H S ( d x , d y ) = 1 − √ qP Ki =1 ( √ θ xi − p θ yi ) , where x, y ∈ { . . . N } .Unfortunately, identifying the set of most similar doc-ument pairs in this way can be computationally ex-pensive, as the number of pairwise comparisons scalesquadratically with the size of the corpus. For themoderately-sized NSF corpus, this amounts to wellover 8 billion comparisons. To address this issue, ourbipartite graph model can be exploited as a blocking heuristic using either the document groups or the com-parison criteria. In the latter case, one can limit thepairwise comparisons to only those documents that re-side in the same subset of D C . For the former case, node similarity can be used. Instead of comparing eachdocument with every other document, we can limit thecomparisons to only those document groups of interestthat are deemed similar by our model. As an illustrativeexample, out of the different NSF programs cov-ering these , grant abstracts, the program and the program are in-ferred as being highly similar by our model. Thus, wecan limit the pairwise comparisons to only such docu-ment groups that are similar and likely to contain sim-ilar documents. In the case of these two programs, thefollowing two grants are easily identiﬁed as being themost similar with a Hellinger similarity ( H S ) score of . (only text snippets are shown due to space con-straints): Grant

Program : 1271 Computational Mathematics

Title : Analyses of Structured

ComputationalProblems and

Parallel

Iterative

Algorithms . Abstract : The main objectives of the re-search planned is the analysis of large scale structured computational problems and of theconvergence of parallel iterative methods forsolving linear systems and applications of thesetechniques to the solution of large sparse anddense structured systems of linear equations

Grant

Program : 2865 Numeric, Symbolic, andGeometric Computation

Title : Sparse Matrix Algorithms on Dis-tributed

Memory Multiprocessors.

Abstract : The design, analysis, and imple-mentation of algorithms for the solution of sparse matrix problems on distributed memorymultiprocessors will be investigated. Thedevelopment of these parallel sparse matrixalgorithms should have an impact of challeng-ing large-scale computational problems inseveral scientiﬁc, econometric, and engineeringdisciplines.

Some key terms in each grant are manually highlightedin bold. As can be seen, despite some differences interminology, the two lines of research are related, asmatrices (studied in Grant e.g., “matrix” vs. “linear systems”,“parallel” vs. “distributed”), document similarity canstill be accurately inferred by taking the Hellinger sim-ilarity of the LDA-derived low-dimensional represen-tations for the two documents. In this way, by exploit-ing the group-level similarities inferred by our model incombination with such document-level similarities, wecan more effectively “zero in” on such highly relateddocument pairs.

We have presented a bipartite graph model for draw-ing comparisons among large groups of documents.We showed how basic algorithms using the model canidentify trends and anomalies among the documentgroups. As an example analysis, we demonstrated howour model can be used to better characterize and eval-uate NSF research programs. For future work, we planon employing alternative comparison criteria in ourmodel such as those derived from named entity recog-nition and paraphrase detection. eferences [Antonellis et al.2008] Ioannis Antonellis, Hector G.Molina, and Chi C. Chang. 2008. Simrank++:Query Rewriting Through Link Analysis of theClick Graph.

Proc. VLDB Endow. , 1(1):408–421,August.[Bache and Lichman2013] K. Bache and M. Lichman.2013. UCI machine learning repository.[Blei et al.2003] David M. Blei, Andrew Y. Ng, andMichael I. Jordan. 2003. Latent Dirichlet Alloca-tion.

J. Mach. Learn. Res. , 3(4-5):993–1022, March.[Blondel et al.2008] Vincent D. Blondel, Jean-LoupGuillaume, Renaud Lambiotte, and Etienne Lefeb-vre. 2008. Fast unfolding of communities in largenetworks.

Journal of Statistical Mechanics: Theoryand Experiment , 2008(10):P10008+, July.[Campr and Jeˇzek2013] Michal Campr and KarelJeˇzek. 2013. Topic Models for Comparative Sum-marization. In Ivan Habernal and V´aclav Matouˇsek,editors,

Text, Speech, and Dialogue , volume 8082 of

Lecture Notes in Computer Science , pages 568–574.Springer Berlin Heidelberg.[Chen and Giles2013] Hung H. Chen and C. Lee Giles.2013. ASCOS: An Asymmetric Network Struc-ture COntext Similarity Measure. In

Proceedingsof the 2013 IEEE/ACM International Conferenceon Advances in Social Networks Analysis and Min-ing , ASONAM ’13, pages 442–449, New York, NY,USA. ACM.[Dhillion2001] Inderjit S. Dhillion. 2001. Co-clustering Documents and Words Using Bipar-tite Spectral GraphPartitioning. Technical report,Austin, TX, USA.[Eagle et al.2010] Nathan Eagle, Michael Macy, andRob Claxton. 2010. Network diversity and eco-nomic development.

Science , 328(5981):1029–1031, May.[Huang et al.2011] Xiaojiang Huang, Xiaojun Wan, andJianguo Xiao. 2011. Comparative News Summa-rization Using Linear Programming. In

Proceed-ings of the 49th Annual Meeting of the Associationfor Computational Linguistics: Human LanguageTechnologies: Short Papers - Volume 2 , HLT ’11,pages 648–653, Stroudsburg, PA, USA. Associationfor Computational Linguistics. [Jeh and Widom2002] Glen Jeh and Jennifer Widom.2002. SimRank: a measure of structural-contextsimilarity. In

Proceedings of the eighth ACMSIGKDD international conference on Knowledgediscovery and data mining , KDD ’02, pages 538–543, New York, NY, USA. ACM.[McCallum2002] Andrew K. McCallum. 2002. MAL-LET: A Machine Learning for Language Toolkit.[Newman2006] M. E. J. Newman. 2006. Modularityand community structure in networks.

Proceedingsof the National Academy of Sciences , 103(23):8577–8582, June.[Sarwar et al.2001] Badrul Sarwar, George Karypis,Joseph Konstan, and John Riedl. 2001. Item-based Collaborative Filtering Recommendation Al-gorithms. In

Proceedings of the 10th InternationalConference on World Wide Web , WWW ’01, pages285–295, New York, NY, USA. ACM.[Sun et al.2009] Yizhou Sun, Yintao Yu, and JiaweiHan. 2009. Ranking-based Clustering of Hetero-geneous Information Networks with Star NetworkSchema. In

Proceedings of the 15th ACM SIGKDDInternational Conference on Knowledge Discoveryand Data Mining , KDD ’09, pages 797–806, NewYork, NY, USA. ACM.[Wan et al.2011] Xiaojun Wan, Houping Jia, ShanshanHuang, and Jianguo Xiao. 2011. Summarizing theDifferences in Multilingual News. In

Proceedingsof the 34th International ACM SIGIR Conferenceon Research and Development in Information Re-trieval , SIGIR ’11, pages 735–744, New York, NY,USA. ACM.[Wang et al.2012] Dingding Wang, Shenghuo Zhu, TaoLi, and Yihong Gong. 2012. Comparative Doc-ument Summarization via Discriminative SentenceSelection.

ACM Trans. Knowl. Discov. Data , 6(3),October.[Zhai et al.2004] ChengXiang Zhai, Atulya Velivelli,and Bei Yu. 2004. A Cross-collection MixtureModel for Comparative Text Mining. In