[PDF] ABACUS: frequent pAttern mining-BAsed Community discovery in mUltidimensional networkS

Abstract

Community Discovery in complex networks is the problem of detecting, for each node of the network, its membership to one of more groups of nodes, the communities, that are densely connected, or highly interactive, or, more in general, similar, according to a similarity function. So far, the problem has been widely studied in monodimensional networks, i.e. networks where only one connection between two entities can exist. However, real networks are often multidimensional, i.e., multiple connections between any two nodes can exist, either reflecting different kinds of relationships, or representing different values of the same type of tie. In this context, the problem of Community Discovery has to be redefined, taking into account multidimensional structure of the graph. We define a new concept of community that groups together nodes sharing memberships to the same monodimensional communities in the different single dimensions. As we show, such communities are meaningful and able to group highly correlated nodes, even if they might not be connected in any of the monodimensional networks. We devise ABACUS (Apriori-BAsed Community discoverer in mUltidimensional networkS), an algorithm that is able to extract multidimensional communities based on the apriori itemset miner applied to monodimensional community memberships. Experiments on two different real multidimensional networks confirm the meaningfulness of the introduced concepts, and open the way for a new class of algorithms for community discovery that do not rely on the dense connections among nodes.

Full PDF

aa r X i v : . [ c s . S I] J un Noname manuscript No. (will be inserted by the editor)

ABACUS : frequent pAttern mining-BAsed Communitydiscovery in mUltidimensional networkS

Michele Berlingerio · Fabio Pinelli · Francesco Calabrese

Received: date / Accepted: date

Abstract

Community Discovery in complex networks is the problem of de-tecting, for each node of the network, its membership to one of more groups ofnodes, the communities , that are densely connected, or highly interactive, or,more in general, similar , according to a similarity function. So far, the prob-lem has been widely studied in monodimensional networks, i.e. networks whereonly one connection between two entities may exist. However, real networks areoften multidimensional, i.e., multiple connections between any two nodes mayexist, either reﬂecting diﬀerent kinds of relationships, or representing diﬀerentvalues of the same type of tie. In this context, the problem of Community Dis-covery has to be redeﬁned, taking into account multidimensional structure ofthe graph. We deﬁne a new concept of community that groups together nodessharing memberships to the same monodimensional communities in the diﬀer-ent single dimensions. As we show, such communities are meaningful and ableto group nodes even if they might not be connected in any of the monodimen-sional networks. We devise ABACUS (frequent pAttern mining-BAsed Com-munity discoverer in mUltidimensional networkS), an algorithm that is ableto extract multidimensional communities based on the extraction of frequentclosed itemsets from monodimensional community memberships. Experimentson two diﬀerent real multidimensional networks conﬁrm the meaningfulness ofthe introduced concepts, and open the way for a new class of algorithms forcommunity discovery that do not rely on the dense connections among nodes.

M. BerlingerioIBM Research, Dublin, IrelandE-mail: mberling [at) ie.ibm.comF. PinelliIBM Research, Dublin, IrelandE-mail: fabiopin [at) ie.ibm.comF. CalabreseIBM Research, Dublin, IrelandE-mail: fcalabre [at) ie.ibm.com Michele Berlingerio et al.

Keywords

Community discovery · Multidimensional Networks · SocialNetwork Analysis

Inspired by real-world scenarios such as social networks, technology networks,the Web, biological networks, and so on, in the last years, wide, multidis-ciplinary, and extensive research has been devoted to the extraction of nontrivial knowledge from networks. Predicting future links among the nodes oractors of a network ([13]), detecting and studying the diﬀusion of informa-tion among them ([23,29]), mining frequent patterns of nodes’ behaviors ([4,47,20]), are only a few examples of tasks in the ﬁeld of Complex NetworkAnalysis, that includes, among all, physicians, mathematicians, computer sci-entists, sociologists, economists and biologists. The data at the basis of thisﬁeld of research is huge, heterogeneous, and semantically rich, and this allowsto identify many properties and behaviors of the actors involved in a network.One crucial task at the basis of Complex Network Analysis is CommunityDiscovery, i.e., the discovery of a group of nodes densely connected, or highlyrelated. Many techniques have been proposed to identify communities in net-works ([28,21]), allowing the detection of hierarchical connections, inﬂuentialnodes in communities, or just groups of nodes that share some properties orbehaviors. In order to do so, the connections among the nodes of a networkwere so far posed at the center of investigation, since they play a key role inthe study of the network structure, evolution, and behavior.Nowadays, most of the work done in the literature is limited to a verysimpliﬁed perspective of such relations, focusing only on whether two nodesare connected or not, and possibly assigning a strength to this connection.In the real world, however, this is not always enough to model all the avail-able information about the interactions between actors, including their multi-ple preferences, their multifaceted behaviors, and their complex interactions.While multiple types of connections among actors could still be representedinto a monodimensional network, by collapsing all connections to one type andpotentially aﬀecting a measure of tie strength, a more sophisticated analysis ofthe network structure, which could maintain information on the semantic dif-ferences in how actors are connected, would help all the techniques to providemore meaningful communities.To this aim, in this paper we deal with multidimensional networks , i.e.networks in which multiple connections may exist between a pair of nodes, re-ﬂecting various interactions (i.e., dimensions) between them. Multidimension-ality in real networks may be expressed by either diﬀerent types of connections(two persons may be connected because they are friends, colleagues, they playtogether in a team, and so on), or diﬀerent quantitative values of one speciﬁcrelationship (co-authorship between two authors may occur in several diﬀerentyears, for example). itle Suppressed Due to Excessive Length 3 friends colleagues samecolleaguesrelatives namesameteam

CIKM SIGKDD VLDBSIGKDDSIGMODPKDD

Fig. 1

Example of multidimensional networks

This distinction is reported in Figure 1, where on the left we have diﬀerenttypes of links, while on the right we have diﬀerent values (conferences) for onerelationship (for example, co-authorship). We can also distinguish between explicit or implicit dimensions, the former being relationships explicitly setby the nodes (friendship, for example), while the latter being relationshipsinferred by the analyst, that may link two nodes according to their similarityor other principles (two users may be passively linked if they wrote a post onthe same topic).In this scenario, we deal with the problem of Multidimensional CommunityDiscovery , i.e. the problem of detecting communities of actors in a multidi-mensional network. We deﬁne a new concept of multidimensional communitythat groups nodes sharing their membership to the same monodimensionalcommunities in the same single dimensions. This concept gives us the possibil-ity to leverage traditional monodimensional community discovery algorithms.It then allows us to deﬁne the lattice of multidimensional communities asfunction of the subset of dimensions for which the monodimensional commu-nity memberships of nodes are shared. Each multidimensional community canthen be represented by the associated subset of dimensions, providing a se-mantic meaning to the community. Note that while the problem of ﬁndingcross-dimensional or cross-network structures is not new [15,5,46], our deﬁ-nition of multidimensional community diﬀers from the previous ones. In fact,using this deﬁnition, a multidimensional community could be unconnected, i.e.composed of nodes which are not directly connected in any of the dimensions.This represents a complex phenomenon that can be seen in the real world: notall the people in a social community are necessarily connected directly, and, ifthey share their memberships in more than one dimension, they can be seenas a (potentially unconnected) group of highly related (both positively andnegatively) people.We devise ABACUS (frequent pAttern mining-BAsed Community discov-erer in mUltidimensional networkS), an algorithm that extracts multidimen-sional communities such as the one in Figure 2 working in four steps:1. Each dimension is treated separately and monodimensional communitiesare extracted

Michele Berlingerio et al.

Fig. 2

An example of a real multidimensional community found in DBLP by our algo-rithm ABACUS, that other methods are not able to detect. Nodes in the community: AmitAgarwal, Qikai Chen, Swaroop Ghosh, Patrick Ndai, Kaushik Roy.

2. Each node is labeled with a list of pairs (dimension, community the nodebelongs to in that dimension)3. Each pair is treated as an item and a frequent closed itemset mining algo-rithm is applied4. Frequent closed itemsets represent multidimensional communities describedby the itemsets

ABACUS is based on existing monodimensional algorithms for community dis-covery (used as a parameter), and on the extraction of frequent closed item-sets, that, in our scenario, represent the multidimensional description of thecommunities.Our main contribution can be then summarized as follows: we introduce thenew concept of multidimensional communities, and the

ABACUS algorithm toextract them (Section 4); we show the applicability of

ABACUS to real worldmultidimensional networks (Section 5), together with a comparison with pre-vious approaches to the problem of community discovery in multidimensionalnetwork.

Detecting communities in networks has been studied from many angles. Twocomprehensive surveys on the topic can be found in [21,28]. From one side,a community has been deﬁned as a set of nodes with a high density of linksamong them, and sparse connections with nodes outside the community. Thepapers working with this quantitative deﬁnition rely on information theoryprinciples [35] or on the notion of modularity [19], which is a function de-ﬁned to detect the ratio between intra- and inter-community number of edges.Modularity is widely used in many works, and several algorithms have beenproposed to extract high modularity partitioning of a network: one of them isa greedy optimization able to scale up to networks with billions of edges [9]. itle Suppressed Due to Excessive Length 5

From another side, communities have been approached looking at the statisti-cal properties of the graph. In [24], a framework for the detection of overlap-ping communities, i.e. communities allowing the vertices to be in more thanone community, is presented. The framework is based on the “split between-ness” concept: vertices and edges are ranked by their betweenness centrality(the portion of shortest path in which they appear) and then split in orderto form a transformed network, where classical algorithms can be used to de-tect communities. The resulting communities are then merged in order to ﬁndoverlaps. Another class of approaches relies on the propagation in the networkof a label [39] or a particular deﬁnition of structure (usually a clique [34]). Theﬁrst approach is known for being a quasi linear solution for the problem, thesecond one allows to ﬁnd overlapping communities. One algorithm that max-imizes quality and quantity measures on its results is InfoMap [41], a randomwalk-based algorithm. An emerging novel problem deﬁnition can be found in[1], in which the authors state that community discovery algorithms shouldnot group nodes but edges, emphasizing the role of the relation residing in acommunity. Previously described methods have focused on both unweightedor weighted graphs, but still considering the network as a monodimensionalentity. Only since recently, multidimensionality has started to be taken intoaccount in network analysis. A few examples of studies are: link predictionin networks with positive and negative links [27] or in multidimensional net-works [40]; a statistical analysis over diﬀerent kinds of relations in the samenetwork in an online game community [43]; analysis of structural properties ofmultidimensional networks [6,8] and its applications to multidimensional hubanalysis [7].From a community discovery point of view, to the best of our knowledge,the main approaches to take into account multiple dimensions are three. In[31] the authors extend the deﬁnition of modularity to ﬁt the multidimen-sional case, which they call “multislice”. However, no deﬁnition of “multidi-mensional community” is provided, nor the approach characterizes and ana-lyzes the communities found. Instead, the authors use the multidimensionalinformation to extract monodimensional communities. In [44] the authors cre-ate a machine learning procedure which detects the possible diﬀerent latentdimensions among the entities in the network and uses them as features for thenode classiﬁcation algorithm. In other words, they use the multidimensionallabels of some nodes to infer labels for other nodes, by means of edge cluster-ing. Hence, multidimensionality is only present here in terms of node labels,but the input network is not multidimensional according to our deﬁnition (seeSection 3), and the output is not in the form of multidimensional communities.In [5], a possible formulation of community discovery and characterization inmultidimensional networks was given. A new measure was introduced to cap-ture the interplay among the dimensions, that makes multidimensional com-munities emerge even where the connections among nodes reside in diﬀerentdimensions. In this paper, we approach the problem from a similar angle, butfocus on extracting communities using frequent itemset mining, and givinga semantic description to each multidimensional community as the subset of

Michele Berlingerio et al. dimensions used to characterize it. Resulting multidimensional communitiesmay be diﬀerent from the ones extracted in [5] and are navigable using thelattice extracted in the frequent itemset mining process.Another work that deals with networks containing heterogenous informa-tion, but not multiple dimensions, is presented in [42], where the authors pro-pose a method to generate net-clusters using links across multi-typed objects.This approach works on heterogeneous networks, i.e. networks where nodesmay have diﬀerent types (e.g. papers or authors), and does not deal with mul-tidimensional networks, i.e. networks where edges may be of diﬀerent typesand two nodes may be connected by multiple edges.The authors of [15] studied the problem of community mining in multi-relational networks. The problem setting, however, is diﬀerent: the authorsexploit the multi-relational links to evaluate the importance of the relationsbased on labeled examples, provided by a user as queries. Hence, they donot perform community detection, but rather extract the importance of eachdimension for a given node, in the form of a weight.The idea of applying closed frequent pattern mining to multi-relational datais not new. In [17], the authors extract all closed n-sets satisfying given piece-wise (anti-)monotonic constraints, from n-ary relations. In [33], the authorspresented a framework for constraint-based pattern mining in multi-relationaldatabases, ﬁnding patterns not under (anti-)monotonic and closedness con-straints, expressed over complex aggregates over multiple relations. However,both the works solve the technical problem of ﬁnding the frequent closed pat-terns, but do not apply this technique to the setting of multidimensional net-work analysis.There are other works in the literature that deal with the extraction ofknowledge across networks. In [46] and [48], for example, the authors deal withthe problem of ﬁnding cross-graph quasi-cliques. This problem can be seen asa sub-problem of the one we deal with in this paper. However, our conceptof community is independent from the density of the connections among thenodes. Other two papers [18,30] deal with the extraction of cliques with par-ticular constraints: in the ﬁrst work, the authors search cliques that remaincliques over time; in the second, cliques with homogeneous node attributes arefound. They however do not deal with the community discovery problem.Based on all the above, we believe that two approaches may be consideredreally related to our problem formulation, namely [31] and [5], thus we usethese as baselines for comparison in Section 5.

In the world as we know it we can see a large number of interactions andconnections among information sources, events, people, or items, giving birthto complex networks. Enumerating all the possible networks detectable withinour world, or their properties, would be diﬃcult due to their number and het-erogeneity, and it is not the scope of this paper. An excellent survey on complex itle Suppressed Due to Excessive Length 7 networks can be found in [32], where the author gives a good classiﬁcation ofnetworks into social (where, for example, we ﬁnd on-line social network such asFacebook), information (such as for example citation networks), technological (among which we mention the power grid, the train routes, or the Internet),and biological (e.g., protein interaction networks) networks.While all the example networks presented in [32] are monodimensional, inthe real world it is possible to ﬁnd many multidimensional networks: trans-portation networks (transport means are diﬀerent dimensions), social networks(diﬀerent online services may be seen as diﬀerent dimensions connecting thesame users), co-authorship networks (diﬀerent venues as dimensions), consti-tute a short, non-exhaustive list of possible real-world examples.3.1 A model for multidimensional networksIn its classical deﬁnition, a network is deﬁned as a structure that is madeup of a set of entities and connections among them. We want to extend thisdeﬁnition by allowing connections of diﬀerent kinds, that we call dimensions .We use a multigraph to model a multidimensional network and its prop-erties. For the sake of simplicity, in our model we only consider undirectedmultigraphs and since we do not consider node labels, hereafter we use edge-labeled undirected multigraphs , denoted by a triple G = ( V, E, L ) where: V isa set of nodes; L is a set of labels; E is a set of labeled edges, i.e. the set oftriples ( u, v, d ) where u, v ∈ V are nodes and d ∈ L is a label. Also, we use theterm dimension to indicate label , and we say that a node belongs to or appearsin a given dimension d if there is at least one edge labeled with d adjacent toit. We also say that an edge belongs to or appears in a dimension d if its labelis d . We assume that given a pair of nodes u, v ∈ V and a label d ∈ L onlyone edge ( u, v, d ) may exist. Thus, each pair of nodes in G can be connectedby at most | L | possible edges.3.2 Real world datasetWe created two multidimensional networks from the well known digital bibli-ography database DBLP and from a search engine query log . – DBLP We extracted author-author relationships if two authors collabo-rated in writing at least one paper. The dimensions of this network aredeﬁned as the venues in which the paper was published, resulting in 2,536conferences that took place in years 2000-2010 (all the editions of a con-ference are considered as one dimension). As the network was created inyear 2012, we consider our temporal subset to be complete for years 2000-2010. We weighted each edge by the number of papers published by the http://dblp.uni-trier.de/xml JoeDoeJohn SmithJoe Blogs HarryWotsitKDD KDD SDMKDDSIGMODCIKMICDM

Solondz Happyness JoyMovie Psycology

Bin2 Bin2 Bin2 Bin2Bin2 Bin1Bin1Bin1Bin4 Bin3Bin1Bin1 (a) small DBLP extract (b) small Query Log extract

Fig. 3

Small extracts from the multidimensional DBLP and Query Log networks. Edgeline styles correspond to dimensions in the network, i.e. distinct conferences for DBLP, andbinned ranks for Query Log. two connected authors in the same conference (dimension). The ﬁnal net-work consisted of 558,800 nodes, connected by 2,668,497 edges in 2,536dimensions. A small extract of this network is represented in Figure 3(a).Figure 4(a) reports the distribution of the number of edges per dimension(the dimensions are sorted by the values of the y axis). High number ofedges corresponds to high number of editions of a conference and/or highnumber of published papers and/or high co-authorship number per paper. – QueryLog.

This network was constructed from a query log of approxi-mately 20 millions web-search queries submitted by 650,000 users, as de-scribed in [36]. We extracted a word-word network of query terms (nodes),connecting two words if they appeared together in a query. The dimen-sions are deﬁned as the rank positions of the clicked results, grouped intosix almost equi-populated bins: “Bin1” for rank 1, “Bin2” for ranks 2-3, “Bin3” for ranks 4-6, “Bin4” for ranks 7-10, “Bin5” for ranks 11-500.Hence two words appeared together in a query for which the user clickedon a resulting url ranked itle Suppressed Due to Excessive Length 9 o f edge s pe r d i m en s i on dimension (sorted by y) o f edge s pe r d i m en s i on dimension (sorted by y) (a) edges in DBLP dimensions (b) edges in Query Log dimensions Fig. 4

Distribution of the number of edges in each dimension in DBLP and Query Log.Rank of the dimensions based on number of edges.

Following the classiﬁcation in [32], we took one social and one informationnetwork, with diﬀerent features (semantic: social vs information network; num-ber of dimensions: thousands vs ﬁve; number of nodes: ∼ ∼ ABACUS framework

In this section we present the core theoretical concepts of our problem. Afterdeﬁning the types of communities we are seeking, we show how to map theproblem of ﬁnding multidimensional communities to the problem of extract-ing frequent closed itemsets from community memberships, and then ﬁnallypresent

ABACUS , the algorithm proposed to solve our deﬁnition of the prob-lem.4.1 A new conceptAs said above, most of the existing approaches to the problem of communitydiscovery rely on a concept of community which is structure-based. That is,nodes with dense connections (or high interaction) are grouped together (insome cases, overlapping communities are also discovered). In this paper, wechange this perspective. Let us start with a real-world example. In the WWWcontext, nowadays it is very popular to be connected in services like Facebook,Twitter, Google+ and, possibly, all of them. Each of these services sees diﬀer-ent communities that can be spotted within their sets of nodes. As today manyof the users have their online identities replicated across the diﬀerent socialnetworks, it is very likely that people sharing their membership in community k in service s , are also sharing their membership in community k ′ in service s ′ . Extending this, we can easily imagine that many communities (especiallysmall ones) would be exactly replicated across diﬀerent dimensions.In addition to this, there is another eﬀect that can be detected in the realworld. Even within close circles of friends, it usually happens to see pairs ofpeople which are not directly connected. There can be many reasons for this:they can be enemies, or potential friends not yet connected, or there can beobstacles for their connection to setup (in some example networks, spatialconstraints may inhibit people living too far away from connecting to eachother). Yet, in these cases, two or more persons can share their membershipsto communities in diﬀerent contexts, or social networks, or, more in general,dimensions.Two nodes A and B can then end up being logically connected by theirshared memberships (say, to community 3 in dimension Google+ and to com-munity 4 in dimension Facebook), but never actually connected in any di-mensions in which they appear. This concept of logical connection here iscrucial. While in previous community discovery algorithms, monodimensionalapproaches have a limited view of the rich set of connections residing withinnodes, disregarding the additional information provided by multiple dimen-sions would be restrictive. Let us consider a co-authorship graph in DBLP,where each conference is a diﬀerent dimension. Two persons in such networkcan be easily spotted to have connections in conferences such as KDD, VLDB,and SDM, while they are not connected, or not even present, in other dimen-sions such as AAAI, or SIGGRAPH, and so on. This piece of information isusually lost in traditional algorithms working on monodimensional networks,and, unfortunately, weights do not help in conveying entirely this additionalknowledge.On the other hand, if we use the shared memberships as key concept forconnecting people (thus, not necessarily directly connected), we are linkingthem logically, using the semantic residing in the dimensions.4.2 From communities to itemsetsFollowing the above idea, we can proceed as follows. First, we can split a multi-dimensional network into several monodimensional ones. We can then performany existing technique for monodimensional community discovery, obtaining,for each node of the original network, a set of memberships to communities ineach single dimension. We are now using the nodes as transactions of items,where an item is a pair ( dimension, community ) expressing the membershipof the node in the various dimensions. At this point, applying frequent patternmining to ﬁnd frequent closed itemsets [3] appears to be natural. There is, infact, a natural mapping of almost all the concepts in the frequent closed item-set mining (FCIM) paradigm in our problem: nodes are transactions; mem-berships are items; multidimensional communities are itemsets; the supportof an itemset is the number of nodes sharing that set of memberships, and soon. Even the constraint-based paradigm [10] has a role in our problem: one itle Suppressed Due to Excessive Length 11 can, in fact, use constraints on the itemsets (eg. excluding/including speciﬁcitems, computing any monotonic or convertible measure on itemsets, and soon). For the sake of simplicity, we reserve for future work this part of theproblem, and we focus only on the extraction of frequent closed itemsets. Inthis new domain, it is also necessary to deﬁne concepts for a common un-derstanding. With the term support , we intend the number of nodes that aremembers of a given multidimensional community. For instance, in the case ofa co-authorship multidimensional network, two is the support of a multidi-mensional community formed by two authors as members. Moreover, the size represents the number of diﬀerent dimensions involved in a multidimensionalcommunity. Again in the multidimensional co-authorship network, two is thesize of a multidimensional community composed of two dimensions such astwo conferences.

14 53 2 6 6KDDVLDB14 53 2

VLDB-1VLDB-2 KDD-1KDD-2 PKDD-1

PKDD 14 53 2

Fig. 5

Run-through example: co-authorship network with two dimensions: KDD and VLDB(top), monodimensional overlapping communities (bottom)

Let us follow a run-through example of our search strategy. Figure 5 de-scribes our toy input network (top), consisting of six nodes, connected in threediﬀerent dimensions (KDD, VLDB and PKDD). From the top image to theones below, we perform two steps: ﬁrst, we split the multidimensional networkinto three monodimensional ones; then, we perform the community discoveryon each of them. The algorithm ﬁnds two diﬀerent communities (highlightedby diﬀerent line styles) in the VLDB and KDD dimensions, and one formedby a single node in the PKDD dimension. Note that, as this is just meant toprovide an example to guide the reader through the steps of our methodology,we did not run a real community discovery algorithm here, and instead builtthe communities such that the resulting output would contain all the featureswe want to explain by means of this example. In particular, we imagined to

TID ITEMS1 ABC2 BCE3 ABCE4 BE5 ABCE6 D ⇒ DCBA EABCEABCD AD AEAC BCAB CD CEBE DEBDACD BCDABD ABEABC ADE BDEBCE CDEACEABDE ACDE BCDEABCDEFrequentInfrequent A = VLDB-1B = VLDB-2C = KDD-1D = PKDD-1E = KDD-2Closed frequent ⇒ ITEMSETS TIDB 1 2 3 4 5BC 1 2 3 5BE 2 3 4 5ABC 1 3 5BCE 2 3 5ABCE 3 5

Fig. 6

Run-through example: set of monodimensional communities (items) associated toeach node (transaction), on the left; lattice of multidimensional communities extracted usingan algorithm for mining frequent closed itemsets, in the middle; resulting frequent closeditemsets (multidimensional communities) and supporting transactions (nodes) on the right. be running an overlapping monodimensional community discoverer, assigningcommunities also to single nodes (node 6). The output of this process is rep-resented on the left of Figure 6 that shows the list of transactions that it ispossible to build from the memberships of the ﬁve nodes. The central partof Figure 6 shows then how the lattice of multidimensional communities iscreated. We see how the community found in the PKDD dimension gets cutdue to a minimum support threshold σ = 2. In bold black we have the fre-quent itemsets, while with bold dashed line we highlighted the closed frequentones. Finally, we see that the closed frequent itemsets clearly summarize theentire set of frequent itemsets found, so it would be redundant to return alsonon-closed items. In the right part of the ﬁgure we show the ﬁnal output. Theﬁrst community is found with a single membership to B, i.e. VLDB-2. This isclearly a monodimensional community, that our algorithm is still able to ex-tract. The last community, formed by nodes 3 and 5, shows how we are able toextract communities of nodes that were not necessarily connected in the initialinput. Indeed, nodes 3 and 5 are unconnected in all the dimensions of the ex-ample. We want to emphasize here that the ability to ﬁnd such communities isnot given by the type of monodimensional community discoverer, but it is dueto the mapping to the frequent pattern mining paradigm, and the fact thatour concept of multidimensional communities groups together nodes sharingmemberships to the same monodimensional communities in the same dimen-sions. Nodes 3 and 5, in fact, share their memberships to the communitiesVLDB-1, VLDB-2, KDD-1 and KDD-2. itle Suppressed Due to Excessive Length 13 ABACUS algorithm.Algorithm 1 is the core of our approach. It takes as input three parameters:the multidimensional network G , a monodimensional algorithm for communitydiscovery CD , and a minimum support threshold σ . The algorithm works bybuilding a set of transactions memberships that, for each node n , record a setof pairs ( i, j ) representing memberships of node n to community j in dimension i . Note that if CD is able to ﬁnd overlapping communities, one node may havemore than one pair associated to a speciﬁc dimension. This would result inmore possible combinations, i.e. more diﬀerent items, thus an higher number ofresulting communities. However, this does not change the type of communitiesthat ABACUS may ﬁnd, namely groups of nodes sharing memberships to thesame monodimensional communities in the same dimensions. Thus, for thesake of simplicity, and without lack of generality, in the rest of the paperwe show experiments conducted with a non-overlapping community discoveryalgorithm.Note also that CD may or may not take into account edge weights to drivethe search for communities in a more meaningful way. As we have weightedour networks presented in Section 3, in our experimental evaluation in Section5 we use an algorithm that takes into account edge weights.In line 4 the function φ is called to split the multidimensional networkinto a set of monodimensional ones, by replicating each node into each of thedimensions in which it has at least one edge, and adding to it all of its adjacentedges in their corresponding dimensions. Each dimension is then processedas a separate network G i by CD in a for loop, returning a diﬀerent set ofcommunities per dimension. In lines 6 −

8, for each node in each community,its memberships are updated with the pair ( dimension, community ), buildinga set of transactions (one per node). The function map returns a unique itemcode for its argument. Such set is then passed to the frequent closed itemsetminer (

F CIM ) in line 11, together with a threshold of minimum support,and the resulting set of frequent closed itemsets are returned, constituting themultidimensional description of each community. In Section 5 we show how,by using an implementation of

F CIM returning also the transaction ids ofeach itemset, we also get the set of nodes contained in each community (i.e.,the ids of the transactions supporting the frequent closed itemset).The complexity of

ABACUS is directly inherited by the complexity of thealgorithm used for

F CIM , and by that of the method for monodimensionalcommunity discovery. The additional complexity introduced by

ABACUS , infact, resides only in the problem-mapping phase, where we perform a linearscan of the list of communities found and we prepare the input for

F CIM .We then refer to the corresponding papers for discussion on the complexity,although in Section 5.3.3 we present an empirical evaluation of the complexityof

ABACUS . Algorithm 1

ABACUS

Require: G , CD, σ for all n ∈ nodes ( G ) do memberships [ n ] = ∅ end for for all G i ∈ φ ( G ) do for all c j ∈ CD ( G i ) do for all n ∈ nodes ( c j ) do memberships [ n ] ← memberships [ n ] ∪ map (( i, j ))8: end for end for end for I ←

F CIM ( memberships, σ )12: return I ABACUS in c++, making use of the igraph library.As CD parameter, we use the community discovery algorithm based onlabel propagation [39], that takes into account edge weights. This algorithmis well known to be scalable, and, as a result, our running times to processthe network were considerably low (a few seconds up to the creation of thetransaction ﬁle, plus a few minutes to perform frequent closed itemset mining,see Section 5.3 for running times). In all the experiments we set the minimumsupport threshold to 2, in order to capture all the possible connections amongnodes.We chose an eﬃcient implementation of Eclat [12] as frequent itemsetminer, with options to return both frequent closed itemsets and list of sup-porting transactions for every itemset.Note that many other choices are possible for the CD and the frequentitemset mining steps and that, for the sake of simplicity and presentation, weonly report the results obtained by the above choice. Note also that while thechoice for the frequent closed itemset mining implementation is usually mainlydriven by scalability issues, selecting a diﬀerent algorithm for community dis-covery may lead to very diﬀerent communities. The debate on which algorithmto choose is however out of scope in this paper, and we refer to Section 2 and tothe surveys on community discovery for driving the reader to the best choicefor this step, which is mainly driven by the ﬁnal application [21,28]. Moreover,despite the possibility of returning diﬀerent types of communities, we want toemphasize that the ability to return potentially unconnected communities isgiven by the mapping of the problem as described in Section 4, and not to thechoice of CD . In fact, as said above, our multidimensional communities repre-sent nodes that share memberships to the same monodimensional communitiesin the same dimensions. This concept is not tied to the fact that the nodes http://igraph.sourceforge.netitle Suppressed Due to Excessive Length 15 must be directly connected in all the dimensions found in the multidimensionalcommunity.All the experiments were performed on a laptop equipped with an Intel i7processor at 2.2GHz, with 4GB of RAM.5.2 ExperimentsWe performed our experiments following three questions related to our prob-lem: Q1.

Quantitative evaluation: given the high number of resulting communities,how can we easily reduce the patterns to select only a set of meaningfulones?

Q2.

Are there relational dependencies between our concept of communities andstructural properties of them?

Q3.

Qualitative evaluation: among the communities found, are there any rele-vant ones? Can we reason on the multidimensional density of the connec-tions within the communities?In order to answer the above questions, we deﬁne a simple and easy tocompute measure of connectedness within communities. The MultidimensionalCommunity Density (MCD) is then the number of edges in a communitynormalized by the maximum possible for that community, or, in formula: edgesndim × nodes × ( nodes − (1)where ndim is the number of diﬀerent dimensions found in the community. P ( X >= x ) MCDensityCumulative distributions DBLPQuery Log

Fig. 7

Cumulative distributions of Multidimensional Community Density (MCD) for thetwo networks6 Michele Berlingerio et al. -6 -5 -4 -3 -2 -1 P ( X >= x ) SupportSupport (DBLP) 10 -5 -4 -3 -2 -1 P ( X >= x ) SupportSupport (Query Log) (a) Distribution of the support in DBLP (b) Distribution of the support in Query Log -6 -5 -4 -3 -2 -1 P ( X >= x ) SizeSize of the itemsets (DBLP) 10 -2 -1 P ( X >= x ) SizeSize of the itemsets (Query Log) (c) Distribution of the size in DBLP (d) Distribution of the size in Query Log

Fig. 8

Cumulative distribution of support (top row) and size of the itemsets (bottom row),for DBLP (left column) and Query Log (right column)

Let us answer Q1 . The frequent pattern mining literature reports that theproblem of ﬁnding few relevant patterns to be interpreted, among the many re-turned, is hard [14,25]. We can overcome this problem in three diﬀerent ways.First, we can look at the distributions of the MCD (deﬁned above), the sup-port of the patterns, and the size of the itemsets to focus our search towardsthe communities that we consider relevant, depending on the ﬁnal application.Figures 7, 8 report the mentioned distributions (we report the cumulative ver-sions, to be able to use the three measures as straightforward ﬁlters). Forbetter comparison, we reported on the y -axes the percentage of communitieswith values of the measures greater than a certain thresholds. However, theabsolute number of communities can be used to choose, depending on the ap-plication, the best support, size of the itemset, and MCD to select only therelevant communities. Second, more generally speaking, the entire Constraint-Based Frequent Pattern Mining literature can be applied in our scenario atrunning stage, to drive the search to fewer, more focused, patterns [38,37,11]. For example, we may want only patterns including or excluding a speciﬁcdimension, or patterns including dimensions with speciﬁc properties (e.g., atleast 1000 authors). To this extent, it is worth noting that MCD is neither itle Suppressed Due to Excessive Length 17 (anti-)monotone, nor convertible, nor loose-antimonotone. We leave for fur-ther research the deﬁnition of meaningful, application-driven constraints, andtheir eﬀect to the results. Lastly, the authors of [14] present another method-ology for selecting few interesting patterns among many, which is not basedon constraints. We believe that this technique may be also used, and we planto investigate this opportunity in the future.To answer Q2 we check whether MCD is correlated to other structuralproperties of the nodes of the communities. For example, one possible intuitionis that communities with low density may group together nodes that were atthe borders of the monodimensional communities. To study this, we computedthe closeness centrality for each node and for each dimension, and checked thecorrelation between the centrality and the density. We did not ﬁnd any clearsign of direct correlation. We checked also for correlation with PageRank, thedegree centrality and the betweeness centrality, for which again we did nothave signs of correlation. Based on these results, we believe that MCD is yetanother measure to be used to ﬁlter the results towards more focused results.Lastly, in order to answer Q3 , we extracted a few communities either min-imizing or maximising MCD. In the remainder, we call MCS the number ofnodes in a community. As we have stated above, we can use the distributionsof size, support and MCD to post-process the results to get only the few inter-esting ones. We have extracted a few (i.e., 200) communities for each network,and we report in Figure 9 four of them. Besides the ﬁrst example, that wasfound by searching for one of the co-authors of this paper, the other oneswere found by examining the results ﬁltered by means of the above mentionedthree measures. In particular: Figure 9(b) was found within 260 communi-ties obtained by constraining M CD < . M CS ≥ size ≥

3; Figure9(c) was found among 287 communities obtained by constraining

M CD = 1,

M CS ≥ size ≥

2; Figure 9(d) was found within 286 communities ob-tained by constraining

M CD < . M CS ≥ size ≥

4. These thresholdswere obtained by looking at the distributions reported above.Consider the one in Figure 9(a). We discovered a size-4 community con-necting FP, FG, MN and DP with dimensions set { KDD, GIS, SAC, SEBD } .It is interesting to observe that, given its very dense connections, this multi-dimensional community would have been found also by using the methodsproposed in [5,31].However the method proposed in this paper has the possibility to dis-cover more complex interactions between dimensions. Indeed, the lattice canbe used to browse the multidimensional communities by selecting diﬀerentdimensions sets. To give an example we extracted a size-3 community com-posed of authors AA, PN, QC, KR and SG with connections in dimensions set { IOLT S, DAT E, ISLP ED } , see Figure 9(b) where the diﬀerent multidimen-sional memberships are shown. These authors are part of three monodimen-sional communities, but have not co-authored papers at these three conferences(there are no links connecting them). By adding the dimension ICCD (solidline circle), we are able to extract a size-4 community composed of the ﬁrstthree authors. This fourth dimension includes a paper co-authored by the three

Fig. 9

Four communities with high and low MCD extracted from the two networks. Nodesin (a): Fosca Giannotti, Mirco Nanni, Dino Pedreschi, Fabio Pinelli. Nodes in (b): AmitAgarwal, Qikai Chen, Swaroop Ghosh, Patrick Ndai, Kaushik Roy. The dashed ovals repre-sent the shared memberships to the same community in the corresponding dimensions (seecircle labels). The dashed anonymous nodes in (b) represent several nodes belonging to thecommunities in dimensions IOLTS, ISLPED and DATE and are not visualized to simplifythe readability. In (d), we report with a single solid line the point to point connections foundin all the ﬁve dimensions. authors, which resulted in a

ICCD -monodimensional community formed bythe three nodes. Interestingly, through this dimension we are able to specializethe previously discovered 5 authors community. Note that by using the meth-ods proposed in [5,31] it would not be possible to discover how the

ICCD dimension could specialize the community, and so its semantic meaning. Thisis due to the fact that more information is included in the results w.r.t. thementioned works.Similar results can be obtained by applying

ABACUS to the Query Logdataset. Finding a community of words in this network means ﬁnding a set ofwords typically used together in queries that lead to good or bad results. Aset of words found together only in dimension 1 is a set of words that, used to-gether in a query, lead to very speciﬁc results (users clicked on the ﬁrst result).Words found together only in dimension 5, on the other hand, lead to lower itle Suppressed Due to Excessive Length 19 ranked results. If we ﬁnd words together in diﬀerent dimensions, it may meanthat either the concept the users were looking for is only representable bywords used in conjunction, or that they need more terms to be disambiguated.An example of the ﬁrst kind is shown in Figure 9(c), where, maximising theMCD, we were able to detect a highly connected multidimensional commu-nity where the words

Machu , Picchu , and unbelievable are connected in threediﬀerent dimensions of the dataset. An example of the second kind, on theother hand, is shown in Figure 9(d). In this example, we can observe that,if we consider all the dimensions, we obtain a set of words belonging to thesame multidimensional community with a strong intrinsic semantic correlation(i.e. Pablo, Picasso, Neruda -besides sharing their ﬁrst name, there exists anedition of a book from Neruda with a Picasso painting on the cover), remov-ing, then, the most speciﬁc dimension (Bin 1 – i.e. click on the ﬁrst returnedresult) we include words that make the concept broader. Also in this case,the methods proposed in [5,31] do not allow to investigate the eﬀect of thediﬀerent dimensions on the specialization of the communities and, thus, theintrinsic semantic correlation among diﬀerent words.5.3 Comparison with previous approachesAs reported in Section 2, in [5], the authors proposed another way to extractmultidimensional communities. Their approach is based however on a diﬀerentconcept of communities: a multidimensional community groups nodes that arehighly multidimensionally connected. How this multidimensional connected-ness is evaluated is left at the end of the process, by post-processing the re-sulting communities. Their approach is composed of the following steps: ﬁrst,the multidimensional network is collapsed to a monodimensional one (i.e., theyfollow exactly the opposite of our ﬁrst step), by weighing the edges in diﬀerentways; second, monodimensional community discovery is performed on the re-sulting network; on the resulting communities, multidimensional connectionsare restored from the original networks; the communities are then evaluatedby means of multidimensional measures.The approach described in [31] works in a similar way, although it presentssome diﬀerences. The approach works in two phases: in the ﬁrst phase, theadjacency matrices corresponding to each diﬀerent dimensions are coupled byconnecting, for each entity, its node representation i in dimension k ′ to itsnode representation j in k ′′ . This step is driven by a coupling parameter ω which controls the weight of this inter-dimension connection. This is basically anode-centric monodimensional collapsing pre-process on the multidimensionalinformation, as opposed to edge-centric as done in [5]. In the second phase,the authors apply a modularity-driven monodimensional community discoveryto extract the communities. There are then two main diﬀerences between thisbaseline and the one presented in [5]: ﬁrst, the pre-process step in which themultidimensional information is collapsed is done at the node level, rather thanon the edges; second, instead of being parametric in the monodimensional Table 1

Statistics of the large and small nets used in the comparisons community discovery algorithm, the authors apply a strategy that aims atmaximizing a multidimensional version of the modularity function.We then compared against both these approaches, using a c++ implemen-tation of the method presented in [31], and a c++ implementation of themethod presented in [5].We wanted to compare the three approaches at diﬀerent levels. In particularwe wanted to answer the following: Q4.

Quantitative evaluation: how do the sets of returned communities foundcompare? Can we measure their intersection and the number of communi-ties that only our method or a given baseline may ﬁnd?

Q5.

Qualitative evaluation: what do the diﬀerent concepts of community looklike?

Q6.

Scalability: how do the methods perform on networks of diﬀerent size?In order to address the above, we ran

ABACUS and the two diﬀerent baselines,on several subsets of the DBLP dataset. Hereafter, we refer to MD for themethod proposed in [5] and to GL for the method proposed in [31].We created two additional (w.r.t. the networks presented in Section 3)sets of networks by taking incrementally large subsets of DBLP, by takingall the nodes, edges and dimensions contained in diﬀerent temporal windows.This was needed to be able to compare against the two diﬀerent baselines,which present diﬀerent scalability in terms of both running times and memoryoccupation, as we see in Section 5.3.3 (in particular, we were not able to run GL on large networks). The ﬁrst set, called “large nets” hereafter, consists of11 networks corresponding to the single year 2010, the years 2009 and 2010,the years between 2008 and 2010, and so on, up to the years from 2000 to2010. The second set, called “small nets” hereafter, consists of 11 networkscorresponding to the single year 1990, the years 1989 and 1990, and so on, upto the years from 1980 to 1990. Table 1 reports the basic statistics of the two https://code.launchpad.net/louvainitle Suppressed Due to Excessive Length 21 sets of networks. As we see, the small nets are much smaller than the largeones, in terms of nodes, edges and number of dimensions. Figure 10(a) reports the number of communities found by

ABACUS and MD in the large networks, while Figure 10(b) reports the number of communitiesfound by ABACUS , GL with diﬀerent values of ω and MD in the small networks.In the large networks, as we see, due to the strategy of collapsing the mul-tidimensional network to a monodimensional one, the number of communitiesfound by MD becomes nearly stable after adding four years. In fact, afterthe ﬁrst step, each additional year included into the subset is only changingthe weight of existing edges, instead of creating new ones (and bringing newnodes). On the other hand, the search space of ABACUS grows consistentlyup to the last two or three steps, where the growth slows down. By keepingthe dimensions separated, in fact, each additional year is able to provide asigniﬁcant number of new combinations to the previous ones. Although thenumber of results returned by

ABACUS is high, we have discussed in Q1 howto deal with it.In the small networks, the above trend is followed as well, but we canmake further considerations regarding diﬀerent baselines. First of all, we seehow, due to the deﬁnition of the GL approach, setting ω = 0 leads to a largernumber of communities when comparing to other values of ω . This value ofthe parameter actually forces each node representing the same entity in diﬀer-ent dimensions to be grouped separately. In other words, the dimensions aretreated in a disjoint way, i.e. the algorithm performs community discovery ineach dimension separately. o f c o mm un i t i e s f ound ( (cid:1) ) Data subset: [x,2010]ABACUSMD 0 o f c o mm un i t i e s f ound ( (cid:1) ) Data subset: [x,1990]ABACUSGL (cid:2) =0.0GL (cid:2) =0.2GL (cid:2) =0.4GL (cid:2) =0.6GL (cid:2) =0.8GL (cid:2) =1.0MD (a) communities found in the large nets (b) communities found in the small nets

Fig. 10

Quantitative comparisons between

ABACUS and the baselines (each x value cor-responds to an additional year included in the subset, from 2000 to 2010 in (a), and from1980 to 1990 in (b)). Number of communities found by

ABACUS and MD on the large netin (a), and by ABACUS and all the baselines on the small net in (b). In these plots, MD isalways the leftmost bar within a stack, and ABACUS is always the rightmost one.

The plots also shows the diﬀerence in the number of returned communitiesby values of ω greater than zero, although there appear to be no much dif- ferences within the experiments ran with values greater than zero. To betterexplore the sensitivity of these experiments to the ω parameter we also ran GL with values up to 10000. Figure 11 shows that there is no substantial diﬀerencein running GL with values larger than zero and up to 10000. Because of this,and for sake of simplicity, in the following we only show the results obtainedwith a few values of ω .Looking at the plots, a few clear questions arise: are the three methodsﬁnding the same communities? Is one method returning communities foundalso by the competitors? Can we identify (classes of) communities that canbe found only by one of the three methods? Figure 12 partially answers thesequestions from a quantitative point of view. Calling A the set of communitiesfound by ABACUS and B the set returned a given baseline, the light gray bar(always the leftmost bar in a stack) shows | A ∩ B | / | A ∪ B | -i.e. the portion ofcommunities found by both, the dark gray bar (always the bar in the middleof a stack) shows | B \ A | / | A ∪ B | -i.e. the portion of communities found onlyby the baseline-, and the black bar (always the rightmost bar in a stack) shows | A \ B | / | A ∪ B | -i.e. the portion of communities found only by ABACUS . Notethat in order to compare the communities found we had to remove the mul-tidimensional information contained in those found by

ABACUS . This step ishowever correct, i.e. there cannot be two instances of the same set of nodes tiedto two diﬀerent sets of dimensions (itemsets) as this would violate the theorybehind the closed itemsets. Note also that, in analogy with the majority of theworks on community discovery, and on frequent pattern mining, we perform exact matching here, thus we are only counting the identical communities inthis comparison.As we see, since the bars report relative numbers, the ratio of communitiesthat can be found only by the baselines decreases as the subset of years grows.Put in other words, even if we know that

ABACUS is meant to ﬁnd communitiesof a diﬀerent type than the ones found by the baselines, we see how, for largedatasets, the set of communities found only by one of the baselines becomes o f c o mm un i t i e s f ound ( (cid:1) ) Data subset: [x,1990] GL (cid:2) =0.0GL (cid:2) =0.2GL (cid:2) =0.4GL (cid:2) =0.6GL (cid:2) =0.8GL (cid:2) =1.0GL (cid:2) =10GL (cid:2) =100GL (cid:2) =1000GL (cid:2) =10000 Fig. 11

Number of communities in the small nets returned by GL for diﬀerent values of ω .itle Suppressed Due to Excessive Length 23 % o f s e t un i on Data subset: [x,2010] Only AB

ACUS On ly MDIntersection % o f s e t un i on Data subset: [x,1990] Only ABACUSOnly MDIntersection (a) large nets, MD (b) small nets, MD % o f s e t un i on Data subset: [x,1990] Only ABACUSOn ly GL (cid:1) =0.0Intersection % o f s e t un i on Data subset: [x,1990] Only ABACUSOnly GL (cid:1) =0.6Intersection (c) small nets, GL with ω = 0 . GL with ω = 0 . Fig. 12

Comparison between sets of communities found. Each plot compares

ABACUS against a diﬀerent baseline or net: large nets and MD in (a), small nets and MD in (b), smallnets and GL with ω = 0 . GL with ω = 0 . ABACUS ) and B (for baseline), we show, for each interval of years: | A ∩ B | / | A ∪ B | -i.e. the portion of communities found by both ABACUS and the baselinewith the light gray bar; | A \ B | / | A ∪ B | -i.e. the portion of communities found only by the ABACUS - with the black bar; | B \ A | / | A ∪ B | -i.e. the portion of communities found onlyby the baseline- with the dark gray bar. In these plots, MD or GL are always the leftmostbar within a stack, and ABACUS is always the rightmost one. smaller. Moreover, the number of communities that is found only by

ABACUS increases accordingly. This is clearly related to the type of communities thatonly

ABACUS can ﬁnd, i.e. the communities of unconnected nodes, or, moreformally, communities formed by more than one connected component. In thefollowing, we answer the above questions also from a qualitative perspective.

The two concepts of communities found by

ABACUS and the baselines arediﬀerent, without a clear winner (i.e., they just reﬂect diﬀerent types of in-teractions among nodes). This situation can be also detected by the diﬀerentclasses of communities that only one of the two methods can ﬁnd. ConsiderFigure 3: if that was the entire input, MD would collapse the network intoa monodimensional one and possibly ﬁnd only one community containing allthe four nodes. GL would also collapse the multidimensional connectivity ac-cording to the parameter ω . This cannot happen in ABACUS , as the principlefor which the nodes are found in the same multidimensional community is to

ABACUS only by the baselines

Fig. 13

Examples of communities found by (a) only

ABACUS and (b) only the baselines.Nodes in (a): Deepak Agarwal, Zhiyuan Chen, Nitin Gupta. Nodes in (b): Anika Awwal,Matthew Jin, Gul N. Khan, Anita TinoIn. In (b) we also depicted the other outgoing edgesfrom each nodes that were present in the input data. For the nature of the approaches, (a)could not be ﬁnd by the baselines and (b) could not be found by

ABACUS , as diﬀerent nodesexist in diﬀerent dimensions. share memberships to monodimensional communities. That is, if Figure 3(a)was the entire input,

ABACUS would ﬁnd Jon Doe and John Smith in a mul-tidimensional community, but not the entire set of nodes, as the remainingtwo do not share all the memberships to the other nodes (they do not existin dimensions ICDM, CIKM and SIGMOD). Figure 13 shows two communi-ties found during our comparison: (a) was found only by

ABACUS , and (b)was found only by both the baselines. Note that we depict all the edges inthe original input, if there were any, and we reported in (b) also the outgoingedges. While it is clear that (a) cannot be found by the baselines (as they relyon connectedness, but there are no edges among those nodes in the input), inorder to conﬁrm that (b) could not be found by

ABACUS we had to investigatewhether the four nodes were sharing memberships to the same communitiesin the depicted dimensions. That is, even if the image is showing a communitythat could not be detected by

ABACUS if the depicted edges were the entireinput data, there might be in the data other edges (and paths) connecting thenodes. After post-processing the data, we found that this was not the case for(b), as diﬀerent nodes are connected in diﬀerent dimensions (see also outgoingedges).

The last part of our comparison regards scalability. Consider Figure 14, re-porting the running time (in seconds) of

ABACUS and MD on the large netson the left, and ABACUS , MD and GL on the small nets on the right. As wesee, even though by adding years we implicitly add also dimensions (not all the itle Suppressed Due to Excessive Length 25 conferences take place in all the years, see Table 1), this has a very low impacton the running time of ABACUS , and a very high impact for the baselines.Note that here we report only the running times obtained with a minimumsupport of two. That is, we do not test the sensitivity to the minimum supportparameter, as we already give the worst case. In reality, if looking for largercommunities (depending on the application), the running times may be evenlower.To conclude,

ABACUS is scalable, and able to process our data in 32 to1200 seconds (20 minutes) on the large nets, while MD needed 380 (6 minutes)to 13500 seconds (225 minutes, i.e. almost 4 hours), and in less than a secondto 2.5 seconds on the small nets, while MD needed up to 7.5 seconds and GL needed up to 86 seconds.We also report that we were not able to run GL on networks larger thanthe small networks we used because of its memory occupation, that exceeded4GB to process the large nets. In this paper, we have addressed the problem of multidimensional communitydiscovery. We have given a deﬁnition of multidimensional community for whichnodes sharing memberships to the same monodimensional communities in thediﬀerent single dimensions are grouped together. This leads us to deﬁne acommunity extractor combining the use of – A given monodimensional community discovery algorithm (that could alsoallow for overlapping communities) – Frequent itemset pattern mining to allow merging discovered monodimen-sional communities into multidimensional onesBy browsing over the lattice generated by the frequent closed itemset miningalgorithm, it is possible to extract multidimensional communities of diﬀerentsizes (pattern lengths) and so navigate the complex multidimensional structureof a network, in a way that previous methods could not permit. R unn i ng t i m e ( s e c ond s ) Data subset: [x,2010]ABACUSMD

10 20

30 40

50 60 70 80 90 R unn i ng t i m e s ( s e c ond s ) Data subset: [x,1990]ABACUSGL (cid:1) =0.0GL (cid:1) =0.2GL (cid:1) =0.4GL (cid:1) =0.6GL (cid:1) =0.8GL (cid:1) =1.0MD

Fig. 14

Quantitative comparisons between

ABACUS and the baselines. Running times (inseconds) of

ABACUS and MD on the large net in (a), ABACUS and all the baselines on thesmall nets in (b).6 Michele Berlingerio et al.

The proposed method could lead to the development of analytical toolsto characterize the redundancy in the dimensions, the impact of new dimen-sions on the network structure, and more in general to evaluate the interplaybetween dimensions. For these reasons, we see potential applications in realworld problems including characterizing the interplay between mobility andcommunication dimensions in a place-to-place network [16], the similarity be-tween users in a user-mobility proﬁle network [45], or in the analysis spreadingof infectious diseases [2]. We leave for future research the analysis of potentialapplications of

ABACUS . Acknowledgements.

We would like to thank Mason Porter and Peter Muchafor helpful discussion around [31], and Vincent Traag for providing us with ac++ implementation of GL . References

1. Ahn, Y.Y., Bagrow, J.P., Lehmann, S.: Link communities reveal multi-scale complexityin networks. Nature pp. 761–764 (2010)2. Balcan, D., Colizza, V., Goncalves, B., Hu, H., Ramasco, J.J., Vespignani, A.: Multiscalemobility networks and the spatial spreading of infectious diseases. Proceedings of theNational Academy of Sciences of the United States of America (51), 21,484–21,489(2009)3. Bastide, Y., Pasquier, N., Taouil, R., Stumme, G., Lakhal, L.: Mining minimal non-redundant association rules using frequent closed itemsets. Computational Logic pp.972–986 (2000)4. Benevenuto, F., Rodrigues, T., Cha, M., Almeida, V.A.F.: Characterizing user behaviorin online social networks. Internet Measurement Conference pp. 49–62 (2009)5. Berlingerio, M., Coscia, M., Giannotti, F.: Finding redundant and complementary com-munities in multidimensional networks. ACM Conference on Information and Knowl-edge Management pp. 2181–2184 (2011)6. Berlingerio, M., Coscia, M., Giannotti, F., Monreale, A., Pedreschi, D.: Foundationsof multidimensional network analysis. International Conference on Advances in SocialNetworks Analysis and Mining pp. 485–489 (2011)7. Berlingerio, M., Coscia, M., Giannotti, F., Monreale, A., Pedreschi, D.: The pursuit ofhubbiness: Analysis of hubs in large multidimensional networks. Journal of Computa-tional Science (3), 223–237 (2011)8. Berlingerio, M., Coscia, M., Giannotti, F., Monreale, A., Pedreschi, D.: Multidimen-sional networks: foundations of structural analysis. World Wide Web pp. 1–27 (2012)9. Blondel, V.D., Guillaume, J.L., Lambiotte, R., Lefebvre, E.: Fast unfolding of com-munities in large networks. Journal of Statistical Mechanics: Theory and Experiment (10), P10,008 (2008)10. Bonchi, F., Giannotti, F., Mazzanti, A., Pedreschi, D.: Eﬃcient breadth-ﬁrst mining offrequent pattern with monotone constraints. Journal on Knowledge and InformationSystems (2), 131–153 (2005)11. Bonchi, F., Lucchese, C.: Pushing tougher constraints in frequent pattern mining. Ad-vances in Knowledge Discovery and Data Mining , 173–202 (2005)12. Borgelt, C.: Eﬃcient implementations of apriori and eclat. IEEE ICDM Workshop onFrequent Item Set Mining Implementations p. 90 (2003)13. Bringmann, B., Berlingerio, M., Bonchi, F., Gionis, A.: Learning and predicting theevolution of social networks. IEEE Intelligent Systems (4), 26–35 (2010)14. Bringmann, B., Zimmermann, A.: The chosen few: On identifying valuable patterns.IEEE International Conference on Data Mining pp. 63–72 (2007)itle Suppressed Due to Excessive Length 2715. Cai, D., Shao, Z., He, X., Yan, X., Han, J.: Community mining from multi-relationalnetworks. The European Conference on Machine Learning and Principles and Practiceof Knowledge Discovery in Databases pp. 445–452 (2005)16. Calabrese, F., Dahlem, D., Gerber, A., Paul, D., Chen, X., Rowland, J., Rath, C.,Ratti, C.: The connected states of america: Quantifying social radii of inﬂuence. IEEEInternational Conference on Social Computing (SocialCom) (2011)17. Cerf, L., Besson, J., Robardet, C., Boulicaut, J.F.: Closed patterns meet n -ary relations.ACM Transactions on Knowledge Discovery from Data (1), 1–36 (2009)18. Cerf, L., Nguyen, T.B.N., Boulicaut, J.F.: Discovering Relevant Cross-Graph Cliques inDynamic Networks. International Symposium on Methodologies for Intelligent Systems.pp. 513–522 (2009)19. Clauset, A., Newman, M.E.J., Moore, C.: Finding community structure in very largenetworks. Physical Review E , 066,111 (2004)20. Cook, D.J., Crandall, A.S., Singla, G., Thomas, B.: Detection of social interaction insmart spaces. Cybernetics and Systems (2), 90–104 (2010)21. Fortunato, S.: Community detection in graphs. Physics Reports (3-5), 75 – 174(2010)22. Francisco, A.P., Baeza-Yates, R.A., Oliveira, A.L.: Clique analysis of query log graphs.String Processing and Information Retrieval pp. 188–199 (2008)23. Goyal, A., Bonchi, F., Lakshmanan, L.V.: Discovering leaders from community actions.ACM Conference on Information and Knowledge Management pp. 499–508 (2008)24. Gregory, S.: Finding overlapping communities using disjoint community detection algo-rithms. Complex Networks: CompleNet 2009 pp. 47–61 (2009)25. G¨unnemann, S., F¨arber, I., Boden, B., Seidl, T.: Subspace clustering meets dense sub-graph mining: A synthesis of two paradigms. IEEE International Conference on DataMining pp. 845–850 (2010)26. Huang, Y., Sun, L., Nie, J.Y.: Query model reﬁnement using word graphs. ACM Con-ference on Information and Knowledge Management pp. 1453–1456 (2010)27. Leskovec, J., Huttenlocher, D., Kleinberg, J.: Predicting positive and negative links inonline social networks. ACM International Conference on World Wide Web pp. 641–650(2010)28. Leskovec, J., Lang, K.J., Mahoney, M.W.: Empirical comparison of algorithms for net-work community detection. ACM International Conference on World Wide Web pp.631–640 (2010)29. Mongiov`ı, M., Singh, A.K., Yan, X., Zong, B., Psounis, K.: Eﬃcient multicasting fordelay tolerant networks using graph indexing. IEEE International Conference on Com-puter Communications pp. 1386–1394 (2012)30. Mougel, P.N., Rigotti, C., Gandrillon, O.: Finding collections of k-clique percolatedcomponents in attributed graphs. Advances in Knowledge Discovery and Data Miningpp. 181–192 (2012)31. Mucha, P.J., Richardson, T., Macon, K., Porter, M.A., Onnela, J.: Community structurein time-dependent, multiscale, and multiplex networks. Science , 876–878 (2010)32. Newman, M.E.J.: The structure and function of complex networks. SIAM Review (2),167–256 (2003)33. Nijssen, S., Jim´enez, A., Guns, T.: Constraint-based pattern mining in multi-relationaldatabases. Workshops of the IEEE International Conference on Data Mining pp. 1120–1127 (2011)34. Palla, G., Derenyi, I., Farkas, I., Vicsek, T.: Uncovering the overlapping communitystructure of complex networks in nature and society. Nature (7043), 814–818 (2005)35. Papadimitriou, S., Sun, J., Faloutsos, C., Yu, P.S.: Hierarchical, parameter-free com-munity discovery. The European Conference on Machine Learning and Principles andPractice of Knowledge Discovery in Databases pp. 170–187 (2008)36. Pass, G., Chowdhury, A., Torgeson, C.: A picture of search. ACM International Con-ference on Scalable Information Systems p. 1 (2006)37. Pei, J., Han, J.: Can we push more constraints into frequent pattern mining? ACMInternational Conference on Knowledge Discovery and Data Mining pp. 350–354 (2000)38. Pei, J., Han, J., Lakshmanan, L.V.S.: Mining frequent item sets with convertible con-straints. IEEE International Conference on Data Engineering pp. 433–442 (2001)8 Michele Berlingerio et al.39. Raghavan, U.N., Albert, R., Kumara, S.: Near linear time algorithm to detect commu-nity structures in large-scale networks. Physical Review E , 036,106 (2007). URL doi:10.1103/PhysRevE.76.036106

40. Rossetti, G., Berlingerio, M., Giannotti, F.: Scalable link prediction on multidimensionalnetworks. Workshops of the IEEE International Conference on Data Mining pp. 979–986(2011)41. Rosvall, M., Bergstrom, C.T.: Maps of random walks on complex networks reveal com-munity structure. PNAS , 1118–1123 (2008)42. Sun, Y., Yu, Y., Han, J.: Ranking-based clustering of heterogeneous information net-works with star network schema. ACM International Conference on Knowledge Discov-ery and Data Mining pp. 797–806 (2009)43. Szell, M., Lambiotte, R., Thurner, S.: Multirelational organization of large-scale socialnetworks in an online world. PNAS107