Clustering and Classification in Text Collections Using Graph Modularity
aa r X i v : . [ c s . I R ] M a y Journal of Machine Learning Research x (2011) x-xx Submitted 5/11; Published xx/11
Clustering and Classification in Text Collections UsingGraph Modularity
Grigory Pivovarov [email protected]
Institute for Nuclear ResearchRussian Academy of SciencesMoscow, 117312, Russia
Sergei Trunov [email protected]
Institute for Institutional AnalysesHigher School of EconomicsMoscow, 109028, Russia
Editor:
Not Known
Abstract
A new fast algorithm for clustering and classification of large collections of text documentsis introduced. The new algorithm employs the bipartite graph that realizes the word-document matrix of the collection. Namely, the modularity of the bipartite graph is used asthe optimization functional. Experiments performed with the new algorithm on a numberof text collections had shown a competitive quality of the clustering (classification), and arecord-breaking speed.
Keywords:
Text Clustering, Text Classification, Modularity
1. Introduction
We explore a possibility of clustering (or classification) of documents. Clustering and clas-sification are methods for information retrieval (for a recent review see Berry (2003)). Thepossibility we explore consists in combining two ideas considered previously.The first idea is co-clustering (Dhillon, 2001), (Zha et al., 2001). Co-clustering clustersalong with the documents the words used in the documents. As an outcome, clusters ofdocuments are generated along with corresponding clusters of words. This approach featuresthe following advantages: Clusters of words generated as a byproduct of the approach canbe used for interpretation of the clusters of documents; In the classification tasks, it ispossible to use in the training sets separate words along with documents. The standardalgorithm used within the co-clustering approach to reach the result is the spectral clustering(von Luxburg, 2007). (Computationally, spectral clustering finds eigenvectors of the graphlaplacian. With a number of tricks, the eigenvectors are used for clustering.)The second idea is modularity (Newman, 2006). Modularity is a class of optimizationfunctionals introduced in the studies of graph clustering. Let us compare modularity toother optimization functionals appearing within the widely used approach to clusteringbased on generative models (Zhong and Ghosh, 2005). These optimization functionals arevarious “distances” between the data and the model. Optimization consists in findingparameters of the model yielding the minimal distance. In contrast, the modularity is c (cid:13) ivovarov and Trunov optimal when a “distance” between the data and a null-model is maximal. The null-modelis a key notion for the modularity idea. The null-model models the data without structure(the most random data). Concretely, modularity is defined as follows. A functional ongraph partitionings is picked out. Modularity is an additive or multiplicative difference ofthe value the functional takes on the graph under study and the mean value it takes onthe null-model. From the above comparison we conclude that using modularity is in a wayless demanding than using generative models, because it is easier to model randomnessthan specific data. Comparison of modularity-based approaches and generative modelsapproaches is attempted in (Karrer and Newman, 2011), (Bickel and Chen, 2009).Modularity has been used in text clustering (Grineva et al., 2009). In this attempt, adense weighted graph has been clustered. The nodes of the graph are the documents, allthe documents are potentially linked to one another, the edges have weights characterizingsimilarity of the linked documents.In this paper we apply the modularity of (Newman, 2006) to the bipartite word-documentgraph. This is a very sparse bipartite graph G whose nodes are documents and words, andedges are between documents and words contained in them. The sparsity of G makes ourapproach practical.More technically, our work is based on two facts. First, the modularity can be opti-mized with fast and efficient algorithms (Blondel et al., 2008) that have complexity pro-portional to the number of links. (Here we point out that we independently developedan algorithm similar to the so called Louvain algorithm (Blondel et al., 2008) before thepaper (Blondel et al., 2008) appeared. We had used it in 2007 to cluster the citation graphof the papers from http://arxiv.org . The results of this clustering are accessible via http://xstructure.inr.ac.ru .) Second, the density of the graphs in our experimentswas in the range from 0.0015 to 0.006 . For such graphs, | E | ∝ | V | log | V | , where | E | isthe number of edges and | V | is the number of vertexes of the graph. Also, the number ofvertexes in our graph equals approximately the number of documents in the collection.We conclude that in the case under consideration the linearity of the algorithm in thenumber of edges of the graph almost implies the linearity in the number of documents. Inthis way we obtain a very fast algorithm. It allows one to cluster (classify) tens of millions ofdocuments in a few hours with a typical computer hardware. Presently, a clustering problemis considered to be a “large scale” if it involves up to 10 documents (Vries and Geva, 2010).With our algorithm, it is possible to raise this bar at least up to 10 documents.The paper is organised as follows. In the next section, we outline the algorithm. In thethird section, we present the results of experiments applying the new algorithm to varioustext collections. In the cocluding section, we briefly summarise our achievements.
2. The Algorithm
In this section, we outline the algorithm we used to maximize the modularity of the bipartitegraph G modeling a collection of documents (its vertexes are documents and words of the
1. The density of a graph is defined as 2 | E | / ( | V | − | E | is the number of edges and | V | is thenumber of the vertexes of the graph. ext Clustering and Classification with Modularity collection; an edge appears between a document and a word if the latter is contained in theformer).The modularity Q ( P, G ) is a functional defined on the set { P } whose members arepartitions of the set of vertexes of the graph G (Newman, 2006).As discussed above the modularity is a difference between the fraction of the edges insidethe clusters for the graph under consideration and for the null model. For example, for asimple (unweighted and undirected) graph, the value it takes on a particular partition is Q ( P, G ) = N X i =1 (cid:16) l i L − D i L (cid:17) , (1)where the summation runs over the clusters of the partition, N is the number of clusters, l i is the number of edges inside the i th cluster, L is the number of graph edges, and D i isthe sum of degrees of vertexes inside the cluster i .Modularity can be used to determine an invariant of the graph G —the partition P thatgives the modularity its maximal value. Generally, computing this invariant is an NP-complete problem (Brandes et al., 2006). There is a number of algorithms for computingan approximation to this invariant (Fortunato, 2010).For our particular case, where the graph under consideration is a bipartite one, the nullmodel should be modified allowing for the edges to appear randomly only between the twoparts of the graph. Accordingly, equation (1) is transformed as follows (Barber, 2007): Q bp ( P, G ) = N X i =1 (cid:16) l i L − D i D i L (cid:17) , (2)where D i ( D i ) is the sum of degrees of vertexes inside the first (second) part of the i thcluster, and L is the number of graph edges.Our algorithm is based on a use of an operation T P to be defined below. It acts onany partition P ′ that can be obtained from the partition P involved in its definition by acoarsening, P ′ ≥ P (this means that the subsets of P ′ can be obtained by merging somesubsets of P ). The outcome of T P acting on P ′ is a new partition whose modularity is notless than the one of P ′ : Q ( T P P ′ ) ≥ Q ( P ′ ). (Here and below we omit the second argumentof Q ( P, G ) because the graph G is fixed.) This is the basic property of the operation T P :its action “improves” the partition. The definition of T P does not use any specific propertyof the quality functional Q , and can be given for any particular choice of the latter. Westress that T P depends on the particular choice of the quality functional Q .To define T P , we introduce an arbitrary numbering of the elements v , v ∈ P (the notation v originates from the most refined partition of G whose members are separate vertexes).After that, instead of the set of elements v of the partition P we deal with the set of theirnumbers, v ∈ { , , . . . , | P |} .The next step is to introduce coordinates on the set of P ′ ≥ P . Each P ′ can beconsidered as a point in a space with | P | discrete coordinates; each coordinate takes aninteger value from 1 to | P | . Indeed, each P ′ defines an equivalence relation on the numbers: v ′ ∼ v if v and v ′ belong to the same subset of P ′ . The v th coordinate of P ′ can be definedas follows: P ′ v = max v ′ ∼ v v ′ (3) ivovarov and Trunov So, by this formula, any set v is mapped to the set inside the same cluster of | P ′ | with themaximal number. Inversely, any point ( x , . . . , x | P | ) of the discrete space { , . . . , | P |} | P | canbe interpreted as a partition P ′ whose members are obtained by merging the subsets of P whose coordinates x v ∈ P coincide.Now the functional Q can be considered as a function of | P | discrete arguments: Q ( P ′ ) = Q ( P ′ , ..., P ′| P | ); (4)each argument runs from 1 to | P | . We are looking for the maximum of this function.To approximate the maximum, we can take any starting P ′ and use the discrete cycliccoordinate descent method (Luenberger, 1973) to obtain a point T P P ′ improving the par-tition P ′ , Q ( T P P ′ ) ≥ Q ( P ′ ). This concludes our definition of the operation T P .The operation T P can be used to describe the previously introduced Louvain algorithm(Blondel et al., 2008). Indeed, the Louvain algorithm yields the partition ending the se-quence of partitions P n = T P n − P n − that starts from the most refined partition P whosemembers are the vertexes.Experimenting with classification of text collections, we have found that it is advanta-geous to use another sequence of partitions approaching the maximum, P n = T P T P n − P n − .So, we start with the most refined partition P . The first step of the process yields P = T P P (this is the case because T P = T P for any P ), the second, P = T P T P P ,and so on. The process stops when its next step yields a partition whose modularity coin-cides with the one obtained on the previous step.Comparing this algorithm to the Louvain algorithm we point out that, in contrast to theLouvain algorithm, each step of our algorithm does not necessarily coarsen the partition,i.e. our P n is not always more coarse than P n − . The results we obtain appear to be moreaccurate (in the sense to be defined latter on) than the ones obtained with the Louvainalgorithm.This concludes the general description of our algorithm. Handmade classifications of large text collections have a number of classification levels.For example, the online arxive arxiv.org has three classification levels (e.g. Physics—Condensed Matter—Superconductivity), and the huge collection of web sites dmoz.org hasmore than three classification levels (the actual number of levels depends on the subjectfield). Such levels are not described with the above approach employing the modularityfunction.A handle on this is provided by the parametric modularity introduced in (Reichardt and Bornholdt,2006),(Lambiotte, 2010). It is defined as follows: Q bp ( P, G, λ ) = N X i =1 (cid:16) l i L − λ D i D i L (cid:17) , (5)where an extra real positive parameter λ had appeared.Let us give an example clarifying the meaning of the new parameter λ . Consider agraph G K which consists of K copies of the graph G . Let the modularity of G reach its ext Clustering and Classification with Modularity maximum value on the partition P max . This G K gives a simple model of a graph withtwo classification levels naturally present: the upper level P has as its classes the separatecopies of G , while the ground level P of the classification subdivides each copy of G onthe subgraphs participating in P max . With this notations, Q ( G K , P ) = Q + ( G, P max ) − Q − ( G, P max ) /K , where Q + ( Q − ) denotes the first (second) term in the right hand side of(2). Also, Q ( G K , P ) = 1 − /K . Because Q ± <
1, at large K , Q ( G K , P ) > Q ( G K , P ).We conclude that in this case the modularity is unable to resolve the ground level of theclassification if the number of subclasses at the upper level K is large enough (practically,this takes place at K ∼ G K .In this example, the graph is not a bipartite one. So, we take as a parametric modularitythe quantity Q ( G, P, λ ) = P Ni =1 (cid:16) l i /L − λD i / (4 L ) (cid:17) . Compare this formula with the abovedefinition of the modularity for nonbipartite graphs (1). For this case, take λ = K . Wehave Q ( G K , P , K ) = Q ( G, P max ), and Q ( G K , P , K ) = 0. We conclude that taking λ = K enables the parametric modularity to see namely the ground level of the classification.The big question in using the parametric modularity is how to find the “good values”of the parameter λ . As we have seen, λ has a meaning of the number of clusters on theupper level of clustering, and we normally do not know it beforehand. At this momentwe do not give any prescription on defining λ . In what follows, we use the parametricmodularity to find our classifications. We always give the value of λ with which one oranother classification had been obtained.What we can state is that varying λ is a useful tool. In our experiments, λ was variedfrom 1 to 300. Applying the above clustering algorithm to various large graphs we observed appearanceof long tails in the distribution of the clusters in the number of vertexes: Typically, alongwith a few large clusters, we obtain a large number of relatively small clusters. And thesmaller is the cluster, the harder to interpret it. Also, it seems that the appearance of smallclusters is not infrequently caused by minor peculiarities in the data.In the results we present below, the vertexes of the clusters belonging to the long tailsare redistributed among a few large clusters. In this section, we describe the procedure ofthis redistribution of the “astray” vertexes.The redistribution was obtained with an operation similar to the above T P . This op-eration, R N , depends on a natural number N . It acts on any partition P with number ofclusters larger than N , | P | > N .First, the redistribution operation R N orders the clusters of the partition P by theirsize. Next, all the vertexes not included in the first N clusters are counted. Let the numberof these astray vertexes be M . A redistribution of the astray vertexes among the N largestclusters can be pointed out with the set of coordinates ( x , ..., x M ). The value taken by thecoordinate x k equals the number of the large cluster the k th astray vertex is redistributedto. ivovarov and Trunov As in the operation T P , the optimal point in the space with the above coordinates is de-termined by the modularity with the discrete cyclic coordinate descent method (Luenberger,1973). The only undetermined ingredient in the definition of the redistribution operation R N is the starting point for the descent.The starting point for the descent was determined with the following procedure. Thevalue of the first coordinate x was determined by the optimal number of the large clusterfor placing the first astray vertex in under the condition that the rest of the astray vertexesare considered as separate clusters. The value of the second coordinate x was determinedsimilarly but under condition that the first of the astray vertexes is already placed into thelarge cluster number x , and so on.Previously we described the sequence of partitions P n = T P T P n − P n − . It stops on apartition P . Our final result is P f = T P R N P , where N < | P | is the number of clusters wechoose to be present in the final clustering. As before, the leftest operation T P improvesthe clustering (its action determines the optimal cluster for each vertex among the clustersobtained by the action of the redistribution operation R N ). A classification problem is given if a subset of the classification indexes is already given (thetraining set), and the rest should be generated. To clarify, the number of classes is presetto N . For a subset of vertexes (the training set) the correct classes are known. For therest of vertexes (testing set) the correct classes should be determined. We attempt to solvethe classification problem using its analogy to the problem of redistribution of the astrayvertexes of the previous subsection.To solve the problem using modularity, we point out that the correct classification definesa partition on the set of documents obtained from joining the training and testing sets. Themembers of this partition are the classes consisting from the documents of the training setwith addition of the correctly attributed documents from the testing set. We assume thatthis partition is the one that maximizes the parameterized modularity at a certain value ofthe parameter λ . If λ is known, this is a problem of maximization with constraints. Theconstrains fix the number of clusters to N and the distribution among the clusters for thetraining set.We look for approximate solution of this problem using the above redistribution op-eration R N . Our approximation to the optimal classification is P c = R N P where P isthe partition with the training set correctly distributed and each of the rest of vertexesbelonging to its own cluster.
3. The Experiment
Four document collections have been used for testing our algorithm. Three of them areamong well known classical test collections—
20 Newsgroups, Reuters 21578 , and
WEBKB4 .We used pre-processed versions of these collections (Cardoso-Cachopo). The fourth collec-tion (
TripAdvisor dataset ) is a collection of travelers reviews of the hotels they stayedin obtained via the popular resource tripadvisor.com (OpinionAnalysisCorpus). In thiscollection, all the reviews were classified into two classes—the positive and negative reviews. ext Clustering and Classification with Modularity Table 1 gives parameters of the collections. All four collections were used for clusteringand classification.
Dataset Total ♯ of docs ♯ of training docs ♯ of test docs ♯ of classes
20 Newsgroups
Reuters-21578
WebKB4
TripAdvisor
Dataset ♯ of vertexes, G ♯ of links, G ♯ of vertexes, G ♯ of links G
20 Newsgroups
Reuters-21578
WebKB4
TripAdvisor G G ℓ -norm—had not shown improvement sufficient to justify the troubleof using them.) The clustering was performed by the following protocol. For each testing collection, op-timization of the parameterized modularity was performed for a sequence of values λ =1 , . , , . . . with the objective of finding the suboptimal value of λ .As mentioned above, the quality of clustering was measured with the Normalized Mu-tual Information (NMI) and Purity functionals. These functionals are maximal when thegenerated clustering coincides with a given “correct” clustering. Below we give the formulasfor computing these functionals. The clusters of the given “correct” clustering are called ivovarov and Trunov classes. The NMI functional reads NMI = P Cl P Km N l,m log ( N N l,m / ( N l N m )) qP Km N m log( N m /N ) P Cl N l log( N l /N ) . (6)Here summation in l is over the classes, in m over the generated clusters, N is the totalnumber of documents, N l ( N m ) is the number of documents in class l (cluster m ), N l,m isthe number of documents in the overlap between class l and cluster m , The NMI takes itsvalues in the interval (0 , Purity = P Km max l N l,m N . (7)Table 3 gives the clustering results. It shows that the optimization in the value of λ , andthe use of bigrams improves the quality of clustering (measured with NMI) considerably. G G Dataset λ ♯ of clusters NMI Purity ♯ of clusters NMI Purity
20 Newsgroups
20 0.63 0.67
Reuters-21578
WebKB4
TripAdvisor
Table 3: Clustering Results. The G G λ is themodularity parameter. Numbers in italic were obtained with the projection ontothe first K clusters ordered by their size ( K equals the number of clusters in thetraining set). Numbers in bold give our best results.Table 4 compares our results with the results obtained with other algorithms. The latterwere extracted from sources pointed out in the Table 4 caption. In the classification experiments, the suboptimal value of the parameter λ was used deter-mined previously in the clustering experiments. ext Clustering and Classification with Modularity Dataset/Algorithm Modularity ExtPLSA MMF SC SKM CLGR NMF
20 Newsgroups
WebKB4
NMI values obtained with various methods. Columns are marked with the methodnames.
Modularity is the method of this paper;
ExtPLSA is a version of theprobabilistic latent semantic analysis (Kim et al., 2008);
MMF is a mixture of theMises-Fisher distributions (Zhong and Ghosh, 2005); SC is the spectral clustering(Zhong and Ghosh, 2005); SKM are the spherical K -means (Wang et al., 2007); CLGR is Clustering with Local and Global Regularization (Wang et al., 2007);
NMF is the nonnegative matrix factorization (Wang et al., 2007).There are two standard classification quality measures (Manning et al., 2008), micro-averaged and macro-averaged: micro-F1 = X c T P ( c ) D , macro-F1 = X c F ( c ) N . (8)Here the sum in the right hand side of the definitions runs over classes; D is the num-ber of documents to be classified, N is the number of classes, T P ( c ) is the number ofcorrectly classified documents for class c , and F ( c ) = 2 R ( c ) P ( c ) / ( R ( c ) + P ( c )), where R ( c ) = T P ( c ) /N ( c ), and P ( c ) = T P ( c ) /N ( c ). In the last relations, N ( c ) ( N ( c )) are,respectively, the correct (actual) number of the documents from the testing set to be (havebeen) attributed to class c .Table 5 gives results of our classification experiments. The same table compares ourresults to the results obtained with other algorithms. Modularity G Modularity G SVM N-Bayes K-NNDataset mic mac mic mac mic mac mic mac mic mac
20 Newsgroups
Reuters-21578
WebKB4
TripAdvisor
Modularity G Modularity G SVM with support vector ma-chine (Cardoso-Cachopo, 2007),(Guo et al., 2004);
N-Bayes with the naive Bayes(Guo et al., 2004),(Cardoso-Cachopo, 2007);
K-NN with the K nearest neighborsmethod (Cardoso-Cachopo, 2007),(Guo et al., 2004). ivovarov and Trunov
4. Conclusions
We presented a new algorithm for clustering and classification of text collections. Ouralgorithm optimizes modularity computed for a fundamental object—the word-documentbipartite graph.At a competitive quality of the output, our algorithm’s main boast is its speed: Using theresults on the clustering of a large web-graph (about one billion of the edges) (Blondel et al.,2008), we estimate the time complexity of the clustering task for a collection of 10 millionsof documents (each document about the average size of the documents from
20 Newsgroups collection) as several hours for a typical hardware.We conclude that our algorithm can be used for clustering very large document col-lections in reasonable time. With our algorithm, the size of amenable collections can beincreased at least an order of magnitude.We believe that using our algorithm opens up new possibilities for automated structuringof the enormous number of text documents available via the web.
References
Michael J. Barber. Modularity and community detection in bipartite networks.
Phys. Rev.E , 76(6):066102, 2007.Michael W. Berry.
Survey of Text Mining . Springer-Verlag New York, Inc., Secaucus, NJ,USA, 2003.Peter J. Bickel and Aiyou Chen. A nonparametric view of network models and newmangirvan and other modularities.
Proceedings of the National Academy of Sciences , 106(50):21068–21073, 2009.Vincent D Blondel, Jean-Loup Guillaume, Renaud Lambiotte, and Etienne Lefebvre. Fastunfolding of communities in large networks.
Journal of Statistical Mechanics: Theoryand Experiment , 2008(10):P10008, 2008.U. Brandes, D. Delling, M. Gaertler, R. Goerke, M. Hoefer, Z. Nikoloski, and D. Wagner.Maximizing modularity is hard. physics/0608255 , 2006.Ana Cardoso-Cachopo. Datasets for single-label text categorization. http://web.ist.utl.pt/~acardoso/datasets/ .Ana Cardoso-Cachopo.
Improving Methods for Single-label Text Categorization . PhD thesis,IST, october 2007.Inderjit S. Dhillon. Co-clustering documents and words using bipartite spectral graphpartitioning. In
Proceedings of the seventh ACM SIGKDD international conference onKnowledge discovery and data mining , KDD ’01, pages 269–274, New York, NY, USA,2001. ACM.S. Fortunato and M. Barth´elemy. Resolution limit in community detection.
Proceedings ofthe National Academy of Sciences , 104(1):36, 2007. ext Clustering and Classification with Modularity Santo Fortunato. Community detection in graphs.
PHYSICS REPORTS , 486:75, 2010.Maria Grineva, Maxim Grinev, and Dmitry Lizorkin. Extracting key terms from noisy andmultitheme documents. In
Proceedings of the 18th international conference on Worldwide web , WWW ’09, pages 661–670. ACM, 2009.Gongde Guo, Hui wang, David Bell, Yaxin Bi, , Kieran Greer, and Kieran Greer. An knnmodel-based approach and its application in text categorization. In
Computational Lin-guistics and Intelligence Text Processing, 5th International Conference, CICLing 2004,Springer, Seoul, Korea , pages 559–570, 2004.Brian Karrer and M. E. J. Newman. Stochastic blockmodels and community structure innetworks.
Phys. Rev. E , 83(1):016107, 2011.Young-Min Kim, Jean-Fran¸cois Pessiot, Massih-Reza Amini, and Patrick Gallinari. Anextension of plsa for document clustering. In
CIKM , pages 1345–1346, 2008.Renaud Lambiotte. Multi-scale modularity in complex networks. physics/1004.4268 , 2010.David G. Luenberger.
Introduction to Linear and Nonlinear Programming . Addison-Wesley,1973.Christopher D. Manning, Prabhakar Raghavan, and Hinrich Schtze.
Introduction to Infor-mation Retrieval . Cambridge University Press, New York, NY, USA, 2008.M. E. J. Newman. Modularity and community structure in networks.
PROC.NATL.ACAD.SCI.USA , 103:8577, 2006.OpinionAnalysisCorpus. .Joerg Reichardt and Stefan Bornholdt. Statistical mechanics of community detection.
PHYS.REV.E , 74:016110, 2006.U. von Luxburg. A tutorial on spectral clustering.
Statistics and Computing , 17(4):395 –416, 2007.Christopher M. De Vries and Shlomo Geva. Document clustering with k-tree. cs.IR/1001.0827 , 2010.Fei Wang, Changshui Zhang, and Tao Li. Regularized clustering for documents. In
Pro-ceedings of the 30th annual international ACM SIGIR conference on Research and devel-opment in information retrieval , SIGIR ’07, pages 95–102, 2007.Hongyuan Zha, Xiaofeng He, Chris Ding, Horst Simon, and Ming Gu. Bipartite graphpartitioning and data clustering. In
Proceedings of the tenth international conference onInformation and knowledge management , CIKM ’01, pages 25–32, New York, NY, USA,2001. ACM.Shi Zhong and Joydeep Ghosh. Generative model-based document clustering: a comparativestudy.
Knowl. Inf. Syst. , 8(3):374–384, 2005., 8(3):374–384, 2005.