[PDF] Benchmarking community detection methods on social media data

Abstract

Benchmarking the performance of community detection methods on empirical social network data has been identified as critical for improving these methods. In particular, while most current research focuses on detecting communities in data that has been digitally extracted from large social media and telecommunications services, most evaluation of this research is based on small, hand-curated datasets. We argue that these two types of networks differ so significantly that by evaluating algorithms solely on the former, we know little about how well they perform on the latter. To address this problem, we consider the difficulties that arise in constructing benchmarks based on digitally extracted network data, and propose a task-based strategy which we feel addresses these difficulties. To demonstrate that our scheme is effective, we use it to carry out a substantial benchmark based on Facebook data. The benchmark reveals that some of the most popular algorithms fail to detect fine-grained community structure.

Full PDF

BBenchmarking community detection methods on social media data

Benchmarking the performance of community detection methods on empirical social network data hasbeen identiﬁed as critical for improving these methods. In particular, while most current research focuseson detecting communities in data that has been digitally extracted from large social media and telecom-munications services, most evaluation of this research is based on small, hand-curated datasets. We arguethat these two types of networks differ so signiﬁcantly that by evaluating algorithms solely on the former,we know little about how well they perform on the latter. To address this problem, we consider the dif-ﬁculties that arise in constructing benchmarks based on digitally extracted network data, and propose atask-based strategy which we feel addresses these difﬁculties. To demonstrate that our scheme is effec-tive, we use it to carry out a substantial benchmark based on Facebook data. The benchmark reveals thatsome of the most popular algorithms fail to detect ﬁne-grained community structure.

Keywords : community detection, benchmarking, evaluation, social networks, datamining, social mediadata

1. Introduction

Community structure has been identiﬁed as playing a key role in the formation and function of manysystems, so it comes as no surprise that the topic topic has received a large amount of attention, withhundreds of papers currently published on the topic every year. However, in the literature on the topic itis commonly observed that although many community detection methods exist, we do not know whichones work best on real data [12, 15, 29, 30].This problem of evaluation on real data is so acute that, in what has been recognized as the authori-tative review on the community detection problem, Fortunato states that our inability to properly bench-mark algorithms has led to “a serious limit of the ﬁeld” and that little is known about which methodsperform best in practice [12].In this paper, we focus on this problem of evaluating community detection methods, focusing onsocial networks. We ﬁrst brieﬂy review the history of community detection in order to reveal why thebenchmarking and evaluation are currently so problematic. Next we argue that recent attempts to solvethe problem of evaluation using large network datasets with “ground truth” data—while a step in theright direction—are ﬂawed because they do not properly deal with the fact that the ground truth data isimperfect and incomplete.We propose a modiﬁed benchmarking workﬂow which we believe appropriately utilizes an imperfectset of ground truth communities. In this bencmark, we use the communities detected by an algorithm toinfer the value of node attributes related to community structure. The inference is done in a machine-learning setting where the community assignment matrix is used as the features associated with eachnode. The idea is that if the community detection algorithm has done a good job, then a machinelearning classiﬁer should be able to use the community assignment matrix to accurately inferring such anattribute. Finally, we run a benchmark using our proposed workﬂow on forty Facebook datasets. We ﬁndthat none of the methods we test automatically detect communities at all scales, and the methods becomemost effective when run many times, each time with a different value for the resolution parameter.1 a r X i v : . [ c s . S I] F e b of 19

2. A Brief History of Community Detection

We present the following brief history of community detection because we believe the historical trendswe highlight are closely related to the currently inadequate standard of evaluation in the ﬁeld of com-munity detection. In short, we argue that up until the mid-1990s, community detection algorithms weretypically run on smaller, hand-curated datasets that were gathered speciﬁcally for research. Researcherstypically carefully curated these datasets and had well-informed prior knowledge on the communitystructure. This expert knowledge, based on ﬁrst-hand experience with the social system from which thedata was generated, could be used to set up benchmarks for community detection algorithms. However,since then the focus has shifted to ﬁnding communities in larger datasets that are not purpose gathered,but rather digitally extracted , i.e., mined from sources such as log ﬁles of web services and mobilecommunication networks. While the community structure found in these datasets is potentially verydifferent, researchers continue to evaluate their algorithms on the smaller, purpose-gathered datasets.This leaves us ignorant of how well community detection algorithms perform on larger, mined datasets.With the objective of examining how improvements in community detection methods were eval-uated, let us start by examining one of the earliest improvements: the introduction of the adjacencymatrix as a means of group detection. The adjacency matrix was meant to improve upon the “socia-gram” introduced by Moreno in 1934 [20]. An example of a sociogram is depicted in ﬁg. 1a. While thesociogram as a method for visualization attracted immediate attention (the New York Times described itas a “new science” [4]), laying out nodes and links by hand was criticized as too subjective: “at present,the sociogram must be built by a process of trial and error, which produces the unhappy result that dif-ferent investigators using the same data build as many different sociograms as there are investigators”[11]. If one looks at ﬁg. 1a, one can observe the grounds for this criticism: not only is it time-consumingto draw such a diagram by hand, but it also seems that the group structure of the social network is notvery clear, and that it might be clearer if one had drawn it differently.In 1946, Forsyth et al. proposed one could represent social networks with an adjacency matrix[11]. The focus of their paper was a procedure for sorting the rows and columns of this matrix inorder “to present sociometric data more objectively, and to make possible a more detailed analysis ofgroup structure.” One can argue that the method presented in that paper is the ﬁrst community detectionalgorithm. The algorithm orders the entries in an adjacency matrix such that the community structureshould be apparent as dense blocks along the diagonal. In ﬁg. 1b, we see the same social network as inﬁg. 1a, but this time visualized using the technique proposed by Forsyth et al.For our purposes, the most relevant part of this paper is how Forsyth et al. evaluate how well theirmethod works. Their evaluation can best be described as a sort of informal visual check. They presentﬁg. 1b and assert that the group structure is much clearer than it was when depicted using Moreno’ssociogram method. They outline what they consider the groups to be in ﬁg. 1b using faint dottedlines to draw blocks along the diagonal axis. They simply assert that these boxes represent subgroups,even though these subgroups are not described by Moreno. The important point is that, aside from thevisualization, there is no external evidence to support their claim that the subgroups are valid. Thus, theevaluation employed here boils down to visualizing the output of the algorithm and asserting that onecan clearly see group structure.In one sense, this is a poor evaluation: aside from their informal visual check, Forsyth et al. presentbasically no empirical evidence that their method more objectively detects community structure. Aproper empirical evaluation might begin with several networks where the “ground truth” set of com-munities is known, and then determine whether users of the new method are able to more accuratelyidentify the ground truth set of communities than when they use previously-existing techniques. of 19 (a) One of Moreno’s early sociograms (b) The same network depicted using Forsyth et al.’scommunity detection method. F IG . 1: The ﬁrst community detection algorithm, introduced by Forsyth et al. in 1946 was supposed tomore objectively and clearly display community structure [11]. Because the network used for evaluatingthe utility of the technique is small and hand-curated, the evaluation was informal, yet adequate.However, from another perspective, the evaluation is adequate. The researcher who gathered datasuch as in the example above (in this case H. H. Jennings) typically spends months carrying out surveyson and observing the social system which produces the data. Even before any social network analysis isperformed, the researcher has a rich understanding of the social structure that exists among the subjects.In this context, community detection methods are not meant to discover previously unknown structure;rather, they are meant to support, augment, and “make objective” the expert knowledge built up overmonths of observation and ﬁrst-hand research.Over the next ﬁfty years, the datasets used to evaluate community detection algorithms were alsogenerally gathered by experts who had ﬁrst-hand knowledge of the social system which the datasetcovered. Well-known examples include Zachary’s Karate Club [31], Sampson’s Monks [24], and theSouthern Women dataset [8]. Through their close observation, the researchers who collected this datawere able to group the nodes into communities based on events such as crises or social gatherings.During this time, network datasets tended to be small (with fewer than 500 nodes, and often fewer than50 nodes) and well-studied (in [13], Freeman synthesizes the ﬁndings of 21 methodological studies onthe Southern Women’s network alone).Then, starting in the late 1990s, a new era of work on community detection began. This new erawas created in part by a new type of social network data that emerged in the form of digital recordssuch as mobile communication records or data from Facebook. While this new data still representssocial networks, it differs from the data that had be analyzed in the previous decades in important waysbecause it is not collected personally by researchers.First, the new data sets are typically not collected speciﬁcally for the purpose of a scientiﬁc study,but rather extracted after the fact from logs or databases. As a result, the data may be messier (e.g.,include a majority of users who have very low activity levels) and cover many social contexts for each of 19 (a) Zachary’s Karate Club [31] (b) Facebook friendships at Caltech [28] F IG . 2: On the left, a network which is typical of “hand curated” datasets gathered by a researcher inthe ﬁeld. On the right an example of a “digitally extracted” dataset. We have reason to believe that thestructure of the communities in these two types of networks differs in important ways.user. For example, in most of the datasets gathered up to the mid-1990s, only one social context wouldbe studied, such as activity in a club, at home, or in the workplace. On the other hand, in the moremodern datasets, such as Facebook data, social interactions from several social contexts are jumbledtogether. Another important difference is that the more modern datasets are typically several orders ofmagnitude larger than the earlier type. Figure 2 displays an example of what we will call a hand-curatednetwork and a digitally extracted network . Finally, whereas the goal of community detection on hand-curated networks was often to make essentially known community structure more objective, the goal onlarge digitally extracted networks is to uncover completely unknown community structure.Due to these key differences between the old, hand-curated networks and the new, digitally extractednetworks, many new methods for the community detection problem were proposed. The ﬁeld of com-munity detection enjoyed booming popularity as more physicists, computer scientists, and social sci-entists developed these new methods. We argue that when modern community detection methods wereevaluated on social network data something went wrong: rather than evaluating these new methodson the new datasets for which they were designed, the new methods were often evaluated on the olddatasets. [6, 10, 14, 21] Thus, we know that many of the new community detection methods work wellon datasets like Zachary’s Karate Club or the Southern Women’s dataset, but we do not know how wellthey work on larger, digitally extracted datasets, and this is the ignorance that Fortunato described in theexcerpt above.Note that these new methods were evaluated on diverse types of data. For example, the Gene Ontol-ogy and other annotation can be used to evaluate the modules found in protein-protein interaction net-works [17]; product categorizations can be used to annotate the network of products co-purchased onAmazon.com [30]. See [3] for an example of thorough evaluation on current datasets sets from these In some of these papers a larger social network was evaluated (such as the co-authorship network on arXiv), but these lackedthe ground-truth or meta-data necessary for a proper evaluation. These larger networks were typically employed only for com-paring something other than how well the algorithm identiﬁes all relevant community structure, such as which algorithm gets thehighest modularity or runs quicketst. of 19and other types of data. Here we consider evaluation only on social network data, i.e., networks in whichnodes represent humans and links represent relationships.There is thus a clear need for benchmarks based on modern, digitally extracted social networkdatasets, however creating these is not straightforward. Researchers who gather a small dataset byhand for the purpose of studying group behavior can conﬁdently annotate their datasets with the groundtruth community structure. On the other hand, with the digitally extracted datasets it is nearly alwaysunclear how exactly one should deﬁne a ground-truth set of communities. In some cases, it might noteven be clear that such community structure exists in the data at all.If we are to make progress in detecting community structure in modern datasets, then it is imperativethat we design benchmarks based on such data. In the next section, we review recent efforts in thisdirection and propose how such a benchmark could be created.

3. Evaluation on digitally extracted networks

Recent efforts to benchmark the performance of community detection algorithms can broadly be placedinto three categories:1. real-world benchmarks , such as Zachary’s Karate Club, where a dataset based on a social systemincludes a natural set of ground-truth communities;2. synthetic benchmarks , where data is artiﬁcially created according to some model which includesa pre-deﬁned set of ground-truth communities; and3. task-oriented benchmarks , in which communities are used to help complete some task on real-world data, such as graph compression, decentralized routing [26], or attribute inference [19].The real-world benchmark is ideal, because clear problems exist with the other options. While usingsynthetic data to benchmark an algorithm is better than nothing, it is unclear how to create synthetic datathat is realistic. Thus, even if an algorithm performs well on synthetic data, it may perform poorly onreal datasets. The problem with task-oriented benchmarks is that many types of structure may be usefulfor solving a task, including structures that do not resemble network communities. Thus, the algorithmthat performs best on the task-oriented benchmark may not be ﬁnding communities at all, but somethingelse, e.g., the clusters identiﬁed by block modelling.Due to these deﬁciencies, here we focus on the possibility of using real-world benchmark graphson digitally extracted data. Such a benchmark is typically carried out as in ﬁg. 3. In the last section,we mentioned that many of the smaller, hand-curated datasets include a natural ground truth set ofcommunities which can be used for real-world benchmarks. It is possible to construct this ground-truthbecause a researcher (or team of researchers) has ﬁrst-hand experience studying a relatively small socialsystem. For digitally-extracted datasets, no such experts exist, partly because the process of gatheringthe dataset (which typically involves writing a web crawler or log-ﬁle parser) does not involve ﬁrst-handcontact with a social system, and partly because the dataset is simply too large for any individual to havedetailed knowledge of its structure.Recent attempts to create a ground-truth set of communities for a large, digitally extracted socialnetwork typically employ relevant meta-data—perhaps the most signiﬁcant work in this direction is intwo recent papers from Yang and Leskovec [29, 30], who examine the social networks of LiveJournal,Orkut, Friendster, and Ning. Each of these social network datasets includes groups of users explicitlycreated by users. Another common type of social network data that includes an explicit ground-truth set of 19

GroundTruthAlgorithmicallyDetected Clusters SimilarityMeasure Score F IG . 3: The straightforward way of benchmarking an algorithm using a network with a ground-truth setof communities.of communities is co-authorship networks, where attendance at conferences is used to deﬁne the groundtruth set of communities.3.1 Problem: so-called ground truths are incomplete

We now come to this section’s central question: while one can certainly choose to deﬁne a ground truthset of communities by using this meta-data, is this a sensible thing to do? In other words: should wepunish a community ﬁnding algorithm if it’s output does not match such a ground truth? We argue thatthe ground truth set of communities which can be extracted from such meta-data is likely to be woefullyincomplete.We substantiate this objection with a concrete example based on the Facebook100 dataset, intro-duced by Traud et al. [27]. We choose our example from this dataset because because it is in many waysan ideal dataset for a community detection benchmark: it comes from Facebook, which at the time thedata was collected in 2005, was extremely popular among college students, and thus the data providesthorough coverage of the acquaintanceship among students. Furthermore, as the data was provideddirectly by Facebook, the dataset is not based on a sample, but rather includes all

Facebook friendships.Finally, it includes meta-data based on the proﬁle page of each user that indicates dorm membership,gender, graduation year, and academic major.As our particular example, we will examine the University of Chicago sub-network, which includes6,591 nodes and 208,103 undirected edges. This sub-network is chosen because one of the authors hasﬁrst-hand experience of the social life there as an undergraduate and is even included in the dataset.Furthermore, the residential “houses” are known to be of utmost importance to the social life at theUniversity of Chicago: upon entering the university, every student is required to spend at least oneyear living in the dormitory system, and the friendships formed in this stage of college—for manystudents, the ﬁrst time living away from home—often endure for years. According to the University ofChicago website, “each house represents a tight-knit community of students, resident faculty masters,and residential staff, who live, relax, study, dine together at House Tables, engage, socialize, and learnfrom each other” [2]. The meta-data does not explicitly indicate that the “dorm” meta-data represents the housing system at the University of of 19

Possiblemeta-communityPossiblesub-communities F IG . 4: Upper pane: the adjacency matrix of the University of Chicago, with house membership high-lighted (zoomable online). Lower pane: a zoomed in region suggests that sub-communities exist withinhouses, and that macro-structure exists between houses.The adjacency matrix of this network is displayed in the top pane of ﬁg. 4. The nodes have beenordered so that all nodes belonging to the same house are adjacent to each other. We can conﬁrm thatthe house meta-data is indeed relevant for friendships by noting when nodes are arranged in this order, Chicago, but this is strongly suggested by the data itself, as the number of distinct values corresponds quite closely to the numberof houses and fraternity houses. At the time of data collection, the housing system consisted of 10 residence halls (physicalbuildings) which were further subdivided into 37 houses, which typically represent a physically adjacent wing of a residence hall.House sizes range from 100 to 37 students, with an average size of 70 [1]. Furthermore, there were somewhere between ten andtwenty fraternity houses. Thus some of the dorm meta-data may indicate fraternity membership. of 19dense blocks form on the main diagonal.However, to return to the central question of this section: should we use these dorms as the groundtruth set of communities in a benchmark? In other words: does the grouping of nodes by house cor-respond to the grouping of network communities that exist in the network? The ordering of nodes inﬁg. 4 not only blocks nodes by house, but also attempts to highlight any sub or meta-structure. We canobserve that within many of the houses there is clearly deﬁned sub-structure. A few examples of thissub-structure are highlighted in the brown squares in the bottom pane of ﬁg. 4. We can also observethat several of the houses together form one meta-community. An example of this meta-community ishighlighted in green in the same image. The twelve houses that belong to this meta-community couldrepresent houses that are physically co-located in the same residence hall.We do not claim to have found the correct ground truth set of communities in ﬁg. 4; there maybe alternative orderings of the adjacency matrix that highlight additional community structure. Thepoint we want to make is that if we were to simply use the houses as ground truth communities in abenchmarking setting, then we would be using, at best, and incomplete ground truth, because it containsneither the sub-structure nor meta-structure that is clearly visible in the adjacency matrix. We also stressthat the set of houses alone is not even approximately complete, because while there are only aroundﬁfty blue squares, there are several times as many sub-communities visible. Thus, if a communityﬁnding algorithm were to ﬁnd all of the communities that are visible in the adjacency matrix, and thenevaluated by measuring how similar those found communities were to the ground truth of houses, thetwo sets would not be very similar according to several common similarity measures. According sucha benchmark, the algorithm would therefore perform poorly, even though in fact it did a good job ofdetecting communities.Thus, even though on the face of it the housing meta-data appeared to provide a great ground-truthfor a community detection benchmark, it turns out that it is gravely incomplete. We should thereforenot build a benchmark based on the assumption that there is a one-to-one mapping between networkcommunities and houses. We also believe that similar problems apply to other digitally extracted socialnetwork datasets. For example, in the annotated networks used in [29, 30], the ground-truth set ofcommunities is provided explicitly by users who typically create groups in an online social networkservice. There may be many network communities that the users themselves are not even aware of, orsimply do not create explicitly groups for. Also, users may not be aware of meta-communities, as thesecould include thousands of users.3.2

Solution: Measure communities’ relation to meta-data rather than correspondence with it

Returning to the University of Chicago example, we ask: can we create any reasonable benchmark basedon the house meta-data? Certainly the houses are related to the community structure, but we are unsureof the nature of this relationship. This relationship could be simple, for example, if a node is a memberof a given network community, then it is more likely to be a member of some house. Or this relationship To highlight sub-structure within each house, we performed the following steps: we ﬁrst extracted the sub-graph induced byeach house (i.e., the edges contained in each blue-square in ﬁg. 4) and then ran a community detection method on that sub-graph(we used the Louvain method of modularity maximization [5], but many non-overlapping community detection algorithms wouldalso have been appropriate). We then arranged the nodes within each house block-wise by the network communities found in thathouse’s sub-graph. To highlight meta-structure between dorms, we created a meta-graph by turning each house into one node inthe meta graph, and weighted the edges between nodes in the meta-graph by the total number of edges which connected housesin the original graph. We then found communities on the meta-graph using the same method. Finally, we ordered the houses inﬁg. 4 block-wise by the sets of houses that were identiﬁed as belonging to the same meta-community. of 19could be more complex, for example, if a node is a member of a certain set of network communitiesand at the same time not a member of some other set of network communities, then there is a highprobability of belonging to a certain house. Due to these possibilities, while we might know that certainmeta-data (such as the housing-meta data) is closely related to community structure, we often do notknow the exact nature of this relationship.

AlgorithmicallyDetectedCommunities N ode s Features:Communities Labels:Dorms

Test dataTrain Data TrainedClassifierModelMeasurePredictionaccuracy F IG . 5: Procedure for a community-detection benchmark based on attribute inference.Thus, we would like to create a benchmarking scheme which allows for a ﬂexible relationshipbetween network communities and meta-data that we believe is closely related to community struc-ture. In such a scheme, we do not want to assume we have knowledge of a complete ground-truth set ofcommunities.We observe that the objective of machine learning is to come up with models that ﬂexibly capturethe relationship between a set of features and some target attribute. In fact, machine learning modelsare designed to be used precisely in situations where the relationship between the features and the targetattribute is complex and unknown. Building on this observation, we propose that one valid way toincorporate meta-data into a benchmarking scheme is to treat the meta-data as a node attribute whosevalue can be more easily inferred with a good set of network communities. In a way, we shift from aground-truth based benchmark to a task-oriented benchmark, where the task is to infer missing metadata.The benchmarking procedure that we propose is illustrated in ﬁg. 5. The ﬁrst step is to detectcommunities with the algorithm which is being benchmarked. These detected communities are usedto build a community assignment matrix, where each row corresponds to a node, and each columncorresponds to a detected community. The values in this matrix are either one or zero—a value of oneat position ( i , j ) indicates that node i is a member of community j . The meta-data is used to label eachnode’s class. With the community-assignment matrix as the feature matrix, and the meta-data as thelabels, we can then train a machine learning model in the usual manner, and use 10-fold cross validationto measure the accuracy of the model.The accuracy indicates how well the communities as features allow the meta-data to be inferred—the higher the accuracy, the better the communities. This makes sense if we assume that the communitystructure is closely related to the meta-data. However, there are some important conceptual objectionsone could raise against this benchmarking scheme. First, all sorts of network features might be usefulfor inferring missing meta-data, so the algorithm that performs best in such a benchmark may not evenbe detecting network communities at all. Second, if a community detection algorithm produces several0 of 19irrelevant communities, then many machine learning models are clever enough to simply ignore these.Thus, a community detection algorithm will not necessarily be punished for producing a large set ofbogus communities. In fact, if one had inﬁnite computing time and an ideal machine learning model,then a trivial community detection algorithm would always obtain the top score: the algorithm thatproduces all possible communities. Given all these communities, the ideal machine learning modelwould discover which subset of communities allowed the most accurate inference of the meta-data.We point out that any benchmarking scheme which is based on an incomplete ground-truth willhave the same problems, because given a detected community that does not correspond to anything inthe incomplete ground truth, one cannot say whether the community is invalid or whether it is valid butnot included in the ground truth.Furthermore, while these objections are valid and need to be kept in mind when interpreting results,we believe they can be addressed. One can ensure that the algorithm in question does in fact detectnetwork communities and that it does not detect bogus communities by running it on synthetic syntheticbenchmark networks with clear community structure in which the ground-truth is complete. In this con-text, the synthetic benchmark networks are not used to determine which community detection algorithmis the best, but rather as a sort of sanity check to make sure that the algorithm in question is in fact acommunity detection algorithm and does not detect a large number of bogus communities. Thus, thesynthetic network used should have clear community structure that is relatively easy to detect.One might argue that since the goal here is simply to measure how related the community assignmentmatrix is to some meta-data, a simpler approach would be to use some measure of mutual informationbetween the community assignment matrix and the meta-data. If the community structure were a non-overlapping partition, then this task would be straightforward—one could use the normalized mutualinformation as deﬁned in [7]. Because nodes can belong to several communities, this measurementbecomes more difﬁcult. While the mutual information of groupings (as opposed to partitions) has beendeﬁned (see [18] for an overview), these deﬁnitions do not take interactions between community mem-bership into account. Because we believe that in social systems the relation between community mem-bership and some target attribute may be more complex than allowed for by these measure of mutualinformation, we choose instead to use machine learning models to measure this relation, as these striveto form more ﬂexible hypotheses about how a feature space is related to some target attribute.We began this section by deﬁning three types of benchmarks used to measure the performance ofcommunity detection algorithms, and then explained why one would ideally use real-world networksin which the complete ground truth was known. We then argued that the meta-data associated withdigitally-extracted networks is unfortunately likely to be incomplete, and therefore inappropriate forreal-world benchmarks. Thus, while in theory we would like to use real-world data with completeground truths, in practice no such datasets exist. Finally, we proposed an alternative benchmarkingscheme which includes both a task-oriented component and a “sanity check” component based on syn-thetic data. In the next section, we carry out the proposed task-oriented benchmark in order to providea concrete example of what we have in mind, and to demonstrate that such a benchmarking scheme canreveal insight into the behavior into the community detection algorithms.

4. Illustrative example: a benchmark based on Facebook data

The primary purpose of this section is to ﬂesh out the benchmarking procedure outlined in ﬁg. 5. Alongthe way we also uncover problematic behavior exhibited by a few community detection algorithms; forexample, the Louvain and InfoMap methods seem to detect community structure only at a very coarselevel, even if hierarchical versions of these algorithms are used.1 of 19

Data and experiment design.

First, a word on the data. In the last section, we illustrated an examplewith a Facebook network representing acquaintanceships at the University of Chicago. As we mentionedabove, this network came from a larger dataset, the Facebook100 dataset from Traud et al. [27], whichincludes Facebook data on 100 collegiate networks; these 100 networks are the data used here. Thesenetworks range in size from 769 nodes and 17k edges to 36k nodes and 1.6m edges. The data has all ofthe desirable characteristics described above, for example, it comes directly from Facebook and is notsampled. Furthermore, the dataset includes node attribute information on ﬁve attributes: gender, yearof graduation, dormitory (as used in the last section), academic major, and high school. Note that noedges exist between members of different networks, and that each network is treated independently ofthe other networks.Traud et al. found that two of these attributes had a close relationship with community structure:year of graduation and dorm. Our benchmark will therefore contain two separate components: onein which communities are used to infer dorm membership, and the other in which they are used toinfer year of graduation. For each combination of network, and algorithm, we ﬁrst detect communitiesand use them to build a community assignment matrix. Next, for each attribute, we use ten-fold crossvalidation to measure how well a classiﬁer can infer the attribute based on the community assignmentmatrix. We measure a classiﬁer’s accuracy by simply calculating the percentage of time that a classiﬁercorrectly infers the attribute.We reiterate that the assumption underlying benchmarks is that each of these attributes is closelyrelated to community structure, and so as the community assignment matrix tends to improve and moreclosely resemble the unknown ground truth, this matrix will allow a machine learning classiﬁer to moreaccurate inference of missing attributes.

Classiﬁer.

Some of the community detection methods benchmarked here detect thousands of com-munities on these networks, so it is essential that the classiﬁer used performs well in situations withthousands of features, otherwise our benchmark may be biased against methods which detect manycommunities. After experimenting with several classiﬁers and feature selection schemes, we found thatan ensemble method called stochastic gradient boosting both performed best and was least sensitive tolarge numbers of communities. In particular, we use the implementation provided in the Python packagescikit-learn [22], with the learning rate set to 0.005 and the number of trees set to 1000. Community detection methods tested.

We benchmark four community detection algorithms: theLouvain method of modularity maximization [5], the InfoMap method of map equation maximization[23], the Link Community method (LC) [3], and the Greedy Clique Expansion algorithm (GCE) [16].We choose the Louvain method and InfoMap because they are perhaps the two currently most popularmethods of community detection. We include the Link Community method and GCE because they bothclaim to handle the case of overlapping communities particularly well, and we have reason to believethat in the Facebook data most nodes could belong to multiple communities.We used the author’s implementation of the Louvain method , which allows for both ﬂat and hierar- For some nodes, attribute values were missing; i.e., some Facebook users failed to mention on their proﬁle which dorm theylived in or which year they graduated. We simply leave these nodes out of the cross-validation. There were two more parameter values to set: we required at least ﬁve examples for a split in a decision tree (i.e., themax samples subsplit parameter =

5) and set the subsampling rate to 0 .

4. We arrived at these values through experimentation,choosing those values which maximized the performance of the classiﬁer. https://sites.google.com/site/findcommunities/ Dorm Accuracies Year Accuracies UChicago StatsMethod Histogram Mean Histogram Mean Median Smallest gce15 47.0 65.0 43.0 266infohier 32.7 57.2 171.0 114louvain10 25.4 60.0 1016.0 27linkcluster 27.9 38.5 4.0 10Table 1: Performance of four community detection algorithms on inferring dorm and year of gradua-tion. The “UChicago Stats” columns indicate some relevant statistics about communities found on theUniversity of Chicago’s Facebook network, whose adjacency matrix is displayed in ﬁg. 4.chical partitions, both of which will be considered below. We also used the author’s implementation ofthe InfoMap algorithm presented in [23], which is designed to detect hierarchical community structure.Likewise, we used the author’s C++ implementation of LC, which can detect either a ﬂat or hierarchicalclustering, both of which will be tested below. Because LC often found vast numbers of extremelysmall communities, we removed all communities containing fewer than four nodes or three edges. ForGCE, we used the author’s implementation, and set the value of the resolution parameter α to 1.5, asthis value was recommended for the Facebook data in previous work. We should note that because GCEhas a resolution that has been tuned to this type of data, it has an unfair advantage over the other algo-rithms; however, in the latter part of this section we also try to optimally tune the resolution parametersof the other methods. Results.

Results are presented in table 1. Note that training the classiﬁer is computationally expensive,and for this reason we run our benchmark on only the forty smallest universities. Furthermore, ratherthan carrying out the evaluation of accuracy on all ten folds of the cross-validation scheme, we useonly three. Thus, for each combination of community detection method and attribute, we detectedcommunities on the forty smallest networks, then trained classiﬁers to infer the attribute on three folds,yielding a total of 120 classiﬁers. The “mean accuracy” columns presented in tables 1 to 3 thereforeindicate the average accuracy obtained by these 120 classiﬁers. We also show the distribution of these120 accuracies with a histogram.Turning to the results, we see that GCE has the best performance on inferring values for both thedorm and year attributes; in particular, it performs substantially better than the other methods on thedorm inference task. We observe that the Louvain and InfoMap methods detected a smaller numberof larger communities, whereas GCE tended to ﬁnd more and smaller communities. We believe thatthe Louvain and InfoMap methods performed poorly because they missed the ﬁne-grained community https://github.com/bagrow/linkcomm https://sites.google.com/site/greedycliqueexpansion/ Even so, our benchmark scheme took months to carry out on a machine with 32 cores—it required us to train 2400 classiﬁers(ten community detection methods, three folds, forty universities, two attributes). Some of these classiﬁers were slow to trainbecause they needed to be trained on more than ten thousand features (a variation of the link-clustering method labeled below as linkClusterCombined found by far the most communities and therefore the classiﬁers trained on these communities took up mostof the CPU time).

Louvain Parameters Dorm Accuracies Year Accuracies UChicago StatsMarkov Time Multi-level Histogram Mean Histogram Mean Median Smallest (cid:88) (cid:88) (cid:88)

The Louvain method and multi-scale community detection.

To further investigate whether the Lou-vain method’s poor performance is due to missing communities at the smallest scale, we perform addi-tional experiments. The scores for the Louvain method presented in table 1 are based on the optimal ﬂatpartitioning. However, as the Louvain method is based on an agglomerative, hierarchical clustering, onecan also include communities from all levels of the dendrogram, not just the ﬂat cut which optimizesmodularity. Blondel et al., the authors of the Louvain method [5], claim that the method “unfolds a com-plete hierarchical community structure for the network,” which suggests that the algorithm should detectcommunity structures on all scales. In the second row of table 2, we present the results when commu-nities detected at all levels are used. We note that the accuracy increases slightly, and that the numberof communities found increases—for example, on the University of Chicago network, the number ofcommunities increased from 27 to 97.The ﬁndings of [9] indicate that to ﬁnd community structure at all resolutions, the very deﬁnition4 of 19of modularity should be parameterized with a parameter called “Markov time.” We test this claim bychecking whether such a parameterized version of modularity can yield ﬁne-grained communities thatimprove accuracy. We use the implementation by Renault Lambiotte, which is fortunately based onthe very same implementation of the Louvain method and so allows for direct comparison. We set theresolution parameter to 0.5 and 0.2, which are values that should detect community structure on a smallerscale than the unparameterized version of modularity used above, which implicitly sets this value to 1.0.For each of these values, we extract all communities from the dendrogram, as described in the lastparagraph. We observe that when the Markov time is decreased, the number of communities detectedincreases and the accuracy increases signiﬁcantly for the dorm attribute. As the dorm attribute is moreclosely associated with ﬁner-scale community structure, this indicates that the resolution parameter doesindeed help to ﬁnd community structure on a smaller scale.This ﬁnding suggests that when modularity maximization techniques are used, then in order to ﬁndcommunity structure at smaller scales, it is not enough simply to use a hierarchical clustering tech-nique and make cuts at all levels in the resulting dendrogram. Rather, the very deﬁnition of modularityitself must be parameterized with a resolution parameter. While there is much theoretical literature on“resolution limit” inherent in modularity, here we ﬁnd strong empirical evidence of this limit.Our ﬁndings here also place the results of Traud et al. [27] into doubt. They analyzed the com-munity structure in the Facebook100 network using a modularity maximization technique, but paid noconsideration to the resolution limit. Traud et al. found that in larger universities, year of graduation wasmore relevant for community structure than dormotory assignment. Our results indicate that this ﬁndingis likely not inherent in the data, but rather due to a limitation of modularity maxmimization techniques:in larger networks, a na¨ıve application of these techniques does not detect ﬁner-grained communities.The importance of a resolution parameter for modularity raises the question of whether InfoMapcould also perform better if its objective function, the Map Equation, were parameterized with a reso-lution parameter. As mentioned above, while the implementation of InfoMap that we used is designedto detect community structure at all relevant resolutions, tended to detect only larger communities.Along the lines of the parameterized deﬁnition of modularity discussed above, recent work in [25] hasparameterized the Map Equation (which InfoMap optimizes) with a resolution parameter by modify-ing the Markov time used to compute the stationary distribution of the random walk. We tried to usetheir implementation, but encountered unexpected behavior and results that were worse than with theunparameterized InfoMap. We believe that these results are related to implementation rather than theconceptual modiﬁcations, and therefore do not report these results.

Combining multiple runs to ﬁnd structure at all scales.

Leaving InfoMap aside, each of the threeother algorithms has a resolution parameter: for the Louvain method, we have the Markov time, forthe LC method we have the threshold at which to cut the hierarchical clustering of edges, and GCEhas a parameter α which is built into its local objective function. In order to detect communities at allscales, one could run the algorithm multiple times using different values for the resolution parameter,and then combine the results. In our ﬁnal experiment, we check whether such a procedure increases theperformance on the benchmark. We combine runs of the Louvain method where the Markov time is setto t = . , t = . , ..., t = . α is set to 0.8, Available at In particular, when the resolution parameter of the parameterized InfoMap is set to 1.0, then the original InfoMap should berecovered, but this was not the case.

Dorm Accuracies Year Accuracies UChicago StatsMethod Histogram Mean Histogram Mean Median Smallest linkClusterCombined 44.7 61.3 5.0 15731gceCombinedE500 53.4 75.3 22.0 890louvainCombined 50.8 73.6 59.0 278Table 3: Results for when each algorithm is run several times—with different values for the resolutionparameter–and the distinct communities from all runs are combined. Note the large improvement forthe dorm attribute when compared to table 1.1.0, 1.3, 1.5, 1.7, and 2.2.When combining several runs of a community detection algorithm, the resulting set of communitiescan contain a very large set of near-duplicate communities. For example, in a single run, the LC methodﬁnds over 100,000 communities on some of the Facebook100 networks, so when several runs are com-bined, this number can reach into the millions. The vast majority of these millions of communities arenear-duplicates of other communities, an undesirable property in most settings, and one which in thecurrent context makes the training of the classiﬁer computationally expensive. When we combine sev-eral runs of an algorithm, we therefore remove the near duplicates by following the procedure outlinedin section 2 of [16], setting ε to 0.5; this technique basically removes communities that have a Jaccardsimilarity of greater than 0.5 with any communities of equal or lesser size.The results of this ﬁnal experiment are displayed in table 3. We see that each method has beneﬁttedby combining the results of multiple runs at different settings of the resolution parameter.We now wrap up this demonstration benchmark with a summary of our ﬁndings. Many commu-nity detection techniques strive to be parameter free so that they can automatically detect communitieswithout requiring a user to experiment with different parameter values [12]. While this is a worthygoal, the results of this section indicate that to achieve good performance, one must tweak the resolutionparameter of every method tested here. If we compare the results in table 1 with those in table 3, wesee that if one simply trusts the algorithm to automatically set the resolution parameter, then the methodmay in practice struggle to ﬁnd structure at all relevant levels. For example, the naive application of theLouvain method produces a set of communities which allow a classiﬁer to infer the dorm attribute withan accuracy of only 25.6%, whereas by combining multiple runs with different values of the Markovtime parameter, one can obtain an accuracy of 50.4%. This dramatic increase in performance (as wellas our analysis above) indicates that the naive application of the algorithm failed to detect much of theﬁner-grained community structure.

5. Conclusion

In section 2, we distinguished between digitally extracted networks and small, hand-curated networks,and we argued that these two types of data differ in important ways. We also pointed out that althoughthe recent wave of community detection methods are supposed to work on digitally extracted networks,in practice the only real data they are tested on are small, hand-curated networks. As a result, weare unaware of whether these methods work on digitally extracted networks. This situation is causedin large part by the lack of digitally extracted networks with acceptable ground-truth data; indeed, insection 3 we showed that even in cases where it appears that one may have a reasonable ground truth6 of 19set of communities, this ground truth is likely to be quite incomplete, and therefore unsuitable for astraightforward benchmark such as the one depicted in ﬁg. 3. In that section we also proposed analternative benchmarking scheme that is appropriate for the case where one has only an incompleteground-truth. Finally, in section 4 we employed this alternative benchmarking scheme.The results in section 4 demonstrate how the inference-based benchmarking scheme we proposed insection 3 can reveal limitations of community ﬁnding algorithms. The benchmarks indicate that someof the most popular community detection methods struggle to detect communities at smaller scales. Wenote that this problem did not emerge when these methods were benchmarked on small, hand-curatednetworks, and this suggests that we cannot assume that just because a community detection methodworks well on small, hand-curated networks like Zachary’s Karate Club, it does not mean it will workas well on digitally-extracted networks.An unfortunate drawback of the benchmarking approach presented here is its complexity. Thebenchmarking workﬂow involves components that are not directly related to community detection—such as classiﬁers—which add extra parameters whose values must be set with care. While this com-plexity is unfortunate, we feel that the simpler approach of treating meta-data as if it contained a com-plete ground truth, as in [29, 30], is even more problematic because it may unfairly punish a communityalgorithm for detecting a valid network community that does not exist in an incomplete ground truth.One beneﬁt of this benchmarking approach is that it indicates a practical problem for which networkcommunities are useful. We can conﬁrm that network communities are in fact useful for inferringmissing attribute values. While it is not the primary concern of this paper and so we will not go into great detail on thematter, we note that we also conducted further experiments in which the goal was not to benchmarkcommunity detection algorithms, but rather to infer attribute values as accurately as possible. In theseexperiments we included two additional types of node features that one would almost certainly use inpractice. First, each node had features indicating its attributes such gender, academic major, year ofgraduation, and dorm (of course, we left out the attribute which we were trying to infer). Second,each node had features indicating the percentage of its friends who had each attribute value; e.g., thepercentage of friends that were male or female, and the percentage of friends in each possible academicmajor. We found that these two simple feature types allowed for more accurate classiﬁcation thannetwork communities. When we combined all three feature types (i.e., each node’s attributes, its friendsattributes, and the network communities to which it belonged), the accuracy improved only slightly (byless than 1%) over the case where we left out the network communities altogether.Thus, if one were trying to infer the missing attribute values in the Facebook100 dataset, one couldobtain quite good results even if one ignored network communities altogether and use simpler featuresbased simply on node attributes the distribution of these attributes in each node’s egocentric network.While this may be a rather gloomy ﬁnding for champions of community detection, we note that thisﬁnding does not indicate the network communities are useless, but rather that in the particular case ofthe Facebook100 dataset, these two other features simply happen to contain information which is veryuseful for inferring missing attribute values. The accuracy with which we could infer missing node attributes should be taken with a grain of salt. In our scheme, wemeasured this accuracy by holding out data from nodes whose labels were known, whereas in practice, one would want to infervalues for nodes with unknown labels. It could be these nodes with missing labels differ from the nodes with known labels, and asa result, the accuracy of inferring their labels might differ. Because our objective was simply to measure how related communitystructure is to meta-data—and not to accurately measure how well we can infer missing data in practice—this limitation is notproblematic for the work presented here, but should be borne in mind by those who are interested in attribute inference for its ownsake.EFERENCES

17 of 19We conclude by noting that while the benchmarking here was based on the task of inferring missingnode attributes that we believe to be closely related to community structure, one could construct con-ceptually similar benchmarks based on different tasks. One natural example would be to use networkcommunities to perform supervised link prediction ; this is a natural ﬁt because presumable the processesresponsible for link formation are closely related to the processes which form network communities.

Funding

This work is supported by Science Foundation Ireland under grant no. 08/SRC/I1407, Clique: Graphand Network Analysis Cluster.

References [1] (2004) University of Chicago House system. . Accessed:28/12/2012.[2] (2012) Announcing New Residence Hall and Dining Commons. http://csl.uchicago.edu/feature/announcing-new-residence-hall-and-dining-commons .Accessed: 28/12/2012.[3] Ahn, Y., Bagrow, J. & Lehmann, S. (2010) Link communities reveal multiscale complexity innetworks.

Nature , (7307), 761–764.[4] Author, U. (1933) Emotions Mapped by New Geography. The New York Times .[5] Blondel, V., Guillaume, J., Lambiotte, R. & Lefebvre, E. (2008) Fast unfolding of communities inlarge networks.

Journal of Statistical Mechanics: Theory and Experiment , (10), P10008.[6] Clauset, A., Moore, C. & Newman, M. (2007) Structural inference of hierarchies in networks. Statistical network analysis: models, issues, and new directions , pages 1–13.[7] Danon, L., Guilera, A. D., Duch, J. & Arenas, A. (2005) Comparing community structure identiﬁ-cation.

Journal of Statistical Mechanics: Theory and Experiment , (9), P09008–09008.[8] Davis, A., Gardner, B. & Gardner, M. (1941) Deep south . University of Chicago Press Chicago.[9] Delvenne, J., Yaliraki, S. & Barahona, M. (2010) Stability of graph communities across timescales.

Proceedings of the National Academy of Sciences , (29), 12755–12760.[10] Duch, J. & Arenas, A. (2005) Community detection in complex networks using extremal optimiza-tion. Physical review E , (2), 027104.[11] Forsyth, E. & Katz, L. (1946) A Matrix Approach to the Analysis of Sociometric Data: PreliminaryReport. Sociometry , (4), pp. 340–347.[12] Fortunato, S. (2010) Community detection in graphs. Physics Reports , (3-5), 75–174.[13] Freeman, L. C. (2003) Finding social groups: A meta-analysis of the southern women data. In Dynamic Social Network Modeling and Analysis. The National Academies , pages 39–97. Press.8 of 19

REFERENCES [14] Girvan, M. & Newman, M. (2002) Community structure in social and biological networks.

Pro-ceedings of the National Academy of Sciences , (12), 7821–7826.[15] Lancichinetti, A., Kivel¨a, M., Saram¨aki, J. & Fortunato, S. (2010) Characterizing the communitystructure of complex networks. PLoS One , (8), e11976.[16] Lee, C., Reid, F., McDaid, A. & Hurley, N. (2010) Detecting highly overlapping community struc-ture by greedy clique expansion. SNA-KDD 2010 , page 11.[17] Marras, E., Travaglione, A., Chaurasia, G., Futschik, M. & Capobianco, E. (2010) Inferring mod-ules from human protein interactome classes.

BMC systems biology , (1), 102.[18] McDaid, A., Greene, D. & Hurley, N. (2011) Normalized Mutual Information to evaluate overlap-ping community ﬁnding algorithms. arXiv preprint arXiv:1110.2515 .[19] Mislove, A., Viswanath, B., Gummadi, K. & Druschel, P. (2010) You are who you know: inferringuser proﬁles in online social networks. In Proceedings of the third ACM international conferenceon Web search and data mining , pages 251–260. ACM.[20] Moreno, J. (1934)

Who shall survive? : a new approach to the problem of human interrelations .Nervous and Mental Disease Publishing Co.[21] Newman, M. (2006) Modularity and community structure in networks.

Proceedings of the NationalAcademy of Sciences , (23), 8577–8582.[22] Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., Blondel, M.,Prettenhofer, P., Weiss, R., Dubourg, V., Vanderplas, J., Passos, A., Cournapeau, D., Brucher, M.,Perrot, M. & Duchesnay, E. (2011) Scikit-learn: Machine Learning in Python . Journal of MachineLearning Research , , 2825–2830.[23] Rosvall, M. & Bergstrom, C. (2011) Multilevel compression of random walks on networks revealshierarchical organization in large integrated systems. PloS one , (4), e18209.[24] Sampson, S. (1968) A novitiate in a period of change: An experimental and case study of socialrelationships . PhD thesis, Cornell University.[25] Schaub, M., Delvenne, J., Yaliraki, S. & Barahona, M. (2012) Markov dynamics as a zoominglens for multiscale community detection: non clique-like communities and the ﬁeld-of-view limit.

PloS one , (2), e32210.[26] Stabeler, M., Lee, C., Williamson, G. & Cunningham, P. (2011) Using Hierarchical CommunityStructure to Improve Community-Based Message Routing. In ICWSM 2011 Workshop on SocialMobile Web Workshop, SMW .[27] Traud, A., Mucha, P. & Porter, M. (2011) Social structure of Facebook networks.

Physica A:Statistical Mechanics and its Applications .[28] Traud, A. L., Mucha, P. J. & Porter, M, A. (2012) Social structure of Facebook networks.

PhysicaA: Statistical Mechanics and its Applications , (16), 4165–4180.[29] Yang, J. & Leskovec, J. (2012a) Deﬁning and evaluating network communities based on ground-truth. In Proceedings of the ACM SIGKDD Workshop on Mining Data Semantics , page 3. ACM.

EFERENCES

19 of 19[30] Yang, J. & Leskovec, J. (2012b) Structure and Overlaps of Communities in Networks. In

Proceed-ings of the 6th SNA-KDD Workshop .[31] Zachary, W. W. (1977) An Information Flow Model for Conﬂict and Fission in Small Groups.

Journal of Anthropological Research ,33