[PDF] Multi-attributed Community Search in Road-social Networks

Abstract

Given a location-based social network, how to find the communities that are highly relevant to query users and have top overall scores in multiple attributes according to user preferences? Typically, in the face of such a problem setting, we can model the network as a multi-attributed road-social network, in which each user is linked with location information and d ( ≥1 ) numerical attributes. In practice, user preferences (i.e., weights) are usually inherently uncertain and can only be estimated with bounded accuracy, because a human user is not able to designate exact values with absolute precision. Inspired by this, we introduce a normative community model suitable for multi-criteria decision making, called multi-attributed community (MAC), based on the concepts of k -core and a novel dominance relationship specific to preferences. Given uncertain user preferences, namely, an approximate representation of weights, the MAC search reports the exact communities for each of the possible weight settings. We devise an elegant index structure to maintain the dominance relationships, based on which two algorithms are developed to efficiently compute the top- j MACs. The efficiency and scalability of our algorithms and the effectiveness of MAC model are demonstrated by extensive experiments on both real-world and synthetic road-social networks.

Full PDF

MMulti-attributed Community Search in Road-socialNetworks

Fangda Guo † , Ye Yuan § , Guoren Wang § , Xiangguo Zhao † , Hao Sun † † School of Computer Science and Engineering, Northeastern University, Shenyang, China § School of Computer Science and Technology, Beijing Institute of Technology, Beijing, China { † fangda@stu, § yuanye@, † zhaoxiangguo@, † andynosh@stu } mail.neu.edu.cn, § [email protected] Abstract —Given a location-based social network, how to ﬁndthe communities that are highly relevant to query users andhave top overall scores in multiple attributes according to userpreferences? Typically, in the face of such a problem setting, wecan model the network as a multi-attributed road-social network,in which each user is linked with location information and d ( ≥ )numerical attributes. In practice, user preferences (i.e., weights)are usually inherently uncertain and can only be estimated withbounded accuracy, because a human user is not able to designateexact values with absolute precision. Inspired by this, we intro-duce a normative community model suitable for multi-criteria de-cision making, called multi-attributed community (MAC), basedon the concepts of k -core and a novel dominance relationshipspeciﬁc to preferences. Given uncertain user preferences, namely,an approximate representation of weights, the MAC searchreports the exact communities for each of the possible weightsettings. We devise an elegant index structure to maintain thedominance relationships, based on which two algorithms aredeveloped to efﬁciently compute the top- j MACs. The efﬁciencyand scalability of our algorithms and the effectiveness of MACmodel are demonstrated by extensive experiments on both real-world and synthetic road-social networks.

I. I

NTRODUCTION

Numerous real-world networks (e.g., social networks) aremade up of community structures, where discovering them isan essential problem in network analysis. Recently, commu-nity search, which is a kind of query-dependent communitydiscovery problem and is designed to ﬁnd densely connectedsubgraphs containing query vertices, has drawn much attentionamong database professionals due to an ever-growing numberof applications [1]–[8].At the same time, with the prevalence of GPS-enabledmobile devices, location-based social networks (LBSN) arebecoming more diverse and complex in recent years (e.g.,Facebook Places, Foursquare). Since not only users and friend-ships are included, but each user is often associated withvarious properties, such as location information and numericalattributes. The location information is able to bridge the gapbetween virtual and physical worlds, while the numericalattributes obtained from user proﬁles or statistical informa-tion derived by various network analytics (e.g., inﬂuence,similarity, etc.) can characterize the user. For instance, in ascientiﬁc collaboration network such as Aminer, every authormay have own (spatial) position/address and several numericalattributes (e.g., h-index, d ( ≥ ) numerical attributes.Given a multi-attributed road-social network, how to iden-tify the query-user-involved communities that are not domi-nated by the other communities according to user’s preferencesfor d numerical attributes? For instance, considering h-indexand activeness in the Aminer network, how to ﬁnd a group ofcollaborators who are related to and close to the query usersand take the activeness in their research ﬁelds in recent yearsas the main criterion? Similarly, considering j query receives a dataset of recordswith d -dimensional attributes and a weight vector w of d values assigning the relative importance of each dimension tothe user as input, where the weighted sum of attribute values isused as the score of a record, we evolve such scoring methodinto a multi-attributed community model. Similarly, the weightvector w denotes the user preferences and is therefore crucialin generating useful recommendations for the community.Generally, w can be directly input by the user or mined fromhis/her past behavior or choices [9], [10].Driven by the fact that weight accuracy plays a vital role inthe applicability and practicality of top- j queries, we arguethat the assumption that exact values of a weight vectorare known is inherently inaccurate and almost unrealistic.To illustrate this, consider the case of manually assigningweights to the above example. By taking activeness as themain criterion, a user may specify weights 0.2 and 0.8 for h-index and activeness, respectively. At this point, leaving asidethe participation of query users and spatial cohesiveness, theweighted sum of these two attribute values can be regardedas the inﬂuence (or importance) of the vertex in [4]. On theother hand, based on pure intuition, she may also specify thesame weight per dimension (e.g., 0.5 each). The results maybe similar to the skyline community [8], since weights are notbiased towards any dimension (as veriﬁed in our experiments,the skyline community [8] is usually contained in our results).However, it is impractical and unfair for the user to require a r X i v : . [ c s . D B ] F e b eights with absolute precision even a slight variation in theweight (e.g., from 0.2 to 0.19; see Example 3) may remarkablychange the results. On the contrary, it would be preferable totake user inputs as generic instructions and leave room forinaccuracies in weight setting. Similarly, for the weight vectorcomputed by preference learning techniques, it should serveonly as a rough guide instead of an accurate expression ofuser preferences. Naturally, this issue can be dealt with byexpanding the weight vector to a region and returning allpromising results to the user, thus providing a practical andmore user-centric design.Prior work on community search problem has never in-corporated the uncertainty of weight vector, thus all existingcommunity search algorithms are unable to answer the abovequestions. To adequately characterize such interesting commu-nities w.r.t. user preferences, we propose a novel communitymodel called multi-attributed community (MAC) based onthe concepts of k -core [12] and r-dominance (variant oftraditional sense) [13]. An MAC is a maximal connected k -core with query vertices contained and spatial cohesivenesssatisﬁed, that is not r-dominated by other connected k -coresin terms of d -dimensional attributes w.r.t a region of interest R (see Section II for detailed deﬁnition). Importantly, since thescores of communities all depend on w , they are necessarilycorrelated and vary together as w freely lies inside region R .As a result, a partitioning of R forms the output, in which eachpartition is associated with the MACs when w falls anywherein that partition. The MAC model is also applicable to manyinteresting applications, some of which are introduced below. Personalized optimum community search.

In daily life, peo-ple always want to discover the optimum community based ondifferent needs. For example, a coach hopes to reorganize theschool basketball team around certain players (as query users)to improve offense. In this application, we can limit the queryscope to the school and extract three numerical attributes foreach player: points, rebounds and assists. By setting the regionof user preferences, we can obtain corresponding communitiesthat are not r-dominated by the others. Similar query may alsohelp organizations to analyze customer orientation or performmarketing/promotion activities.

Cohesive groups discovery in LBSNs.

In some cases, onemay wish to circle the target range by ﬁnding cohesive groupsto achieve identiﬁcation, such as COVID-19 precaution andsuspect investigation. Given several conﬁrmed cases, possiblecases are likely to be within a certain range of them (providingpossibility of close contact), and the Jaccard similarity (e.g.,hobbies, interests) and inﬂuence (e.g.,

Contributions.

Efﬁcient solutions are formulated and pro- There are already preference learning techniques (e.g., [11]) to generatesuch a region instead of a speciﬁc weight vector. v v v v v v v v v v v v v v v v v v v v v v v v v v v v v v (a) Social network ( G s ) r r r r r r r r r r r r r r r r r r r r r r r r r r r r r r (b) Road network ( G r )Fig. 1. Example of road-social network vided to ﬁnd multi-attributed communities in road-social net-works. Below, we summarize the contributions of this paper. Novel community model.

The MAC model is proposed inroad-social networks, which can be used for ﬁnding commu-nities not r-dominated by others. To the best of our knowledge,our model is the ﬁrst to incorporate uncertainty of weight vec-tors, and our work is also the ﬁrst to introduce the dominancerelationship speciﬁc to preferences for community modeling.

New algorithms.

An efﬁcient global search, DFS-based al-gorithm, is ﬁrst developed to ﬁnd the top- j MACs for eachof the possible weight settings in user preferences. The timeand space complexity of DFS-based algorithm are boundedby O ( n (cid:48) d ) and O ( m (cid:48) + n (cid:48) + n (cid:48) · d ) respectively, where n (cid:48) and m (cid:48) denote the number of vertices and edges in the maximal ( k, t ) -core of a road-social network. To further accelerate thesearch, we propose a more efﬁcient local search framework,of which two striking features are that (1) its time complexityis much lower than that of global search (at least one order ofmagnitude faster in practice), and it can ﬁnd all non-containedMACs in most cases; and (2) it enables the MACs to be outputprogressively during execution, which is very beneﬁcial toapplications expecting only part of the MACs. Extensive experiments.

To demonstrate the high efﬁciencyand scalability of our proposed algorithms and to evaluate theeffectiveness of the MAC model, we conduct extensive experi-ments and a comprehensive case study on both real-world andsynthetic road-social networks, in which many signiﬁcant andinteresting communities are able to be discovered.II. P

ROBLEM S TATEMENT

In this section, we formally introduce the multi-attributedcommunity in road-social networks and its search problems.Table I summarizes the notations used throughout this paper.

A. Preliminaries

Road network.

We model a road network as an undirectedweighted graph G r = ( V r , E r ) , where V r (resp. E r ) is thesets of vertices (resp. edges). A vertex u ∈ V r represents aroad intersection/end. An edge ( u, v ) ∈ E r represents a roadsegment allowed to travel between vertices u and v , and isassociated with a non-negative weight ω ( u, v ) that representsits cost (e.g., distance/travel time). Let p be a spatial pointlying on edge ( u, v ) , and ω ( u, p ) be proportional to the lengthfrom u to p . By dist ( p, p (cid:48) ) we refer to the network distancebetween points p and p (cid:48) in G r , which is the sum of edgeweights along the least costly (i.e., shortest) path from p to p (cid:48) . ABLE IS

UMMARY OF N OTATIONS

Notation Description

Q, j query vertices and number of MACs to select among k, t coreness threshold and query distance threshold d, R dimensionality of attributes and preference domain dg H ( v ) , δ ( H ) degree of vertex v in H and minimum degree of HN ( v, H ) neighbor set of vertex v in subgraph HL ( v ) mapping of user v ’s location (i.e., spatial point) in G r X ( v ) d -dimensional vector with real values of user vD Q ( H ) , S ( H ) query distance and score of subgraph H resp. H tk , G d the maximal ( k, t ) -core and r-dominance graph G e , G c subgraphs of G d induced by V H and V G d \ V H resp. l b ( G e ) , l t ( G c ) vertices in the bottom layer of G e and top layer of G c Social network.

We model a social network with d numericalattributes as an undirected graph G s = ( V s , E s , L, X ) , where V s ( | V s | = n ) and E s ( | E s | = m ) denote the sets of vertices(i.e., users) and edges (i.e., social relations) respectively, L and X ( | X | = n ) are the sets of mappings and d -dimensionalvectors deﬁned on V s respectively. Speciﬁcally, for each vertex v ∈ V s , L ( v ) provides a mapping of each user’s location (i.e.,spatial point) in the road network; and v is also associated witha d -dimensional vector with real values, denoted by X ( v ) =( x v , . . . , x vd ) , where X ( v ) ∈ X and x vi ∈ R . Road-social network.

A road-social network is a pair ofgraphs, denoted by ( G r , G s ) , where G r is a road networkand G s is a social network. Each user u s ∈ G s is associatedwith a spatial point p in G r , i.e., L ( u s ) = p , indicating thecurrent location of u s . We assume that p can either be on avertex or edge of G r .For example, Fig. 1 displays a road-social network, wherevertices represent users and road junctions, and edges representsocial relations and road segments, respectively. For simplicity,the location of user v i in Fig. 1(a) is on the vertex r i of theroad network in Fig. 1(b). Fig. 2(a) shows the values of partof vertices in three different dimensions. B. (k, t)-Core

We introduce a novel densely connected substructure, called ( k, t ) -core, by focusing on the following structural cohesive-ness and communication cost. Structural cohesiveness.

In order to model the structuralcohesiveness, the generally used k -core model is adopted toindicate the communities [1], [2], [4], [12], [14]. In particular,by dg G s ( v ) we refer to the degree of vertex v in social network G s . Let H = ( V H , E H , L H , X H ) be an induced subgraph of G s , where V H ⊆ V s , E H = { ( u, v ) | u, v ∈ V H , ( u, v ) ∈ E s } , L H = { L ( v ) | v ∈ V H } and X H = { X ( v ) | v ∈ V H } . Deﬁnition 1: ( k -core.) Given a graph G s and an integer k , H is a k -core of G s if each vertex v ∈ V H has a degree atleast k , i.e., dg H ( v ) ≥ k .The maximal k -core is the one satisfying that no super k -core containing it exists. We refer to the maximal k in all k -cores containing vertex v ∈ V s as the core number of v . Inorder to avoid confusion, we denote a connected k -core as a k - (cid:100) core , since the maximal k -core is not necessarily connected. Remarks.

Although we use k -core as the structural cohesive-ness metric, our techniques can also be applied to other criteriasuch as k -clique [15] and k -truss [3]. Communication cost.

For two users u, v ∈ G s , the length ofthe shortest path between their locations in G r is denoted by dist ( L ( u ) , L ( v )) , which is equal to + ∞ if L ( u ) and L ( v ) arenot connected. To model the communication cost in G r , weutilize the notion of query distance below. Deﬁnition 2: (Query Distance.) Given a graph H and queryvertices Q ⊆ V H , ∀ q ∈ Q , the query distance of v ∈ V H is themaximum length of the shortest path from L H ( v ) to L H ( q ) in G r , denoted by D Q ( v ) = max q ∈ Q dist ( L ( v ) , L ( q )) ; the querydistance of H is deﬁned as D Q ( H ) = max u ∈ V H D Q ( u ) =max u ∈ V H ,q ∈ Q dist ( L ( u ) , L ( q )) .By query distance D Q ( H ) , the communication cost betweenquery vertices Q and the members in H can be measured.In general, a good community is considered to own a lowcommunication cost, i.e., small D Q ( H ) . Consider the queryvertices Q = { v , v , v } in Fig. 1. The query distance of v is D Q ( v ) = dist ( r , r ) = 7 . The query distance of the subgraphinduced by { v , v , v , v } is equal to dist ( r , r ) = 9 . In thefollowing, we propose a new notion of ( k, t ) -core, by adaptingthe concepts of k -core and query distance, to capture densestructural cohesiveness and low communication cost. Deﬁnition 3: ( ( k, t ) -core.) Given graphs ( G r , G s ) , queryvertices Q , and numbers k and t , H is a ( k, t ) -core iff H isa k - (cid:100) core of G s containing Q and D Q ( H ) ≤ t in G r .For a ( k, t ) -core, its structural cohesiveness increases with k , while proximity to query vertices decreases with t . Forinstance, in Fig. 1, the subgraph induced by { v , v , v , v } for Q = { v , v , v } is a ( k, t ) -core with k = 3 and t = 9 . C. Score of Multi-Attributes

Note that generalizing the existing community models, mostof which are only for -dimensional attribute such as inﬂuence[4], Euclidean distance [6] or keyword similarity [5], [7], [16],[17], to a comparable one with multi-dimensional attributesis nontrivial. Unlike the above, comparing two communitiesbecomes quite tough if either can have d ( d > ) values dueto d different dimensions. On the other hand, existing multi-dimensional model [8] cannot compare the pros and cons ofany skyline communities and make trade-offs based on userpreferences. To overcome the above issues, we introduce the r-dominance relationship between two communities, that will bedeveloped for deﬁning our multi-attributed community model.As each vertex v ∈ V s associated with a vector X ( v ) , theattributes deﬁne a d -dimensional data domain . We assume thatfor each attribute a higher value is preferable, and a spatialindex (e.g., R-tree [18]) is used to organize the vector set X .Referring to the traditional top- j queries, the score of avertex v w.r.t X ( v ) can also be derived by inputting a weightvector w = ( w , w , . . . , w d ) as S ( v ) = (cid:88) di =1 w i · x vi . (1) ser x x x v v v v v v v x x x v v v v v v v (a) -dimensional vectors w w R Region R R R Region R R w w R Region R R (b) Pref. domain and MACsFig. 2. Numerical attributes and MACs in R Thus, the top- j results consist of the j vertices with the highestscores. We assume w.l.o.g. that w i ∈ (0 , for ∀ i ∈ [1 , d ] and (cid:80) di =1 w i = 1 . Such conditions cannot restrain user preferences,because score ranking depends only on the direction of w instead of its magnitude [19]; but they allow w to drop oneweight (i.e., w d = 1 − (cid:80) d − i =1 w i ), thereby mapping the domainof w to a ( d − )-dimensional space, named the preferencedomain . This dimensionality reduction is critical [20], sincethe dimensionality directly determines the processing time ofthe costliest operations in our techniques (see Section V-B).In the following, w refers to the ( d − )-dimensional form ofthe weight vector. For example, given weights 0.2 and 0.3 for x and x respectively in Fig. 2(a), we have w = 0 . , and S ( v ) = 4 . . Community score.

Let H = ( V H , E H , L H , X H ) be an in-duced subgraph of G s . Given a weight vector w , we deﬁnescore of H as S ( H ) = min v ∈ V H { S ( v ) } . (2)Here, we have a brief discussion on why the “min” operatorin Eq. 2 is used to deﬁne S ( H ) . The intention is the sameas that in [4]. By using it, all the members in H can beensured to have a score in terms of d dimensions no less than S ( H ) . In other words, if S ( H ) is large, vertices in H maynot have large values in each dimension but the weighted sumof their attribute values must be large. For instance, considerthe previous case of taking the activeness as the main criterion(e.g., weight 0.8) in the collaboration network. Obviously, wecan apply the S ( H ) deﬁned above to quantify the overallquality focusing on the activeness of a group of collaborators. Deﬁnition 4: (r-dominance.) Given a region R in the prefer-ence domain, a community H r-dominates another community H (cid:48) when S ( H ) ≥ S ( H (cid:48) ) for any weight vector in R , denotedby H (cid:31) H (cid:48) ; and v (cid:31) v (cid:48) denotes that vertex v r-dominatesanother vertex v (cid:48) .Intuitively, in the traditional sense of dominance [21], [22],for any weight vector w the dominator is always superior tothe dominee, while r-dominance is speciﬁc to weight vectorsin R . Although a community may not dominate another basedon the skyline community model [8], it might always have ahigher score as w is bounded in R . For ease of representation,we assume that R is a hyper-rectangle parallel to axes, but ourtechniques is directly applicable to general convex polytopes.For example, Fig. 2(b) illustrates a preference domain (i.e.,the domain of weight vector) with d = 3 , where axis w (resp. w ) corresponds to the weight for x (resp. x ) on a scale of 0 to 1, and the region R is a convex polygon (i.e., anaxis-parallel rectangle [0 . , . × [0 . , . ) representing theexpanded or approximated user preferences. D. The MAC Problem

Combining the notion of ( k, t ) -core and community score,we deﬁne the multi-attributed community (MAC) as follows. Deﬁnition 5: (MAC.) Given graphs ( G r , G s ) , query vertices Q ⊆ V s , two numbers k and t , and a region of interest R , H is a multi-attributed community in the road-social network, if H satisﬁes the following conditions: H is a ( k, t ) -core containing Q . There does not exist an induced subgraph H (cid:48) ( V H (cid:48) ⊃ V H ) such that H (cid:48) is a ( k, t ) -core and H (cid:48) (cid:31) H .In terms of structural cohesiveness and communication cost,condition (1) requires not only that the community with queryvertices Q contained is densely connected, but also that eachvertex is spatially close to Q . In terms of maximality andmulti-attributed community score, condition (2) ensures thatthe community is maximal and having the greatest score. Thefollowing example illustrates Deﬁnition 5. Example 1:

Consider ( G r , G s ) , numerical attributes and aregion R in Fig. 1 and 2, respectively. Suppose, for instance,that Q = { v } , k = 2 and t = 9 . By Deﬁnition 5, the subgraphinduced by { v , v , v , v , v } is an MAC with communityscore equal to S ( v ) if w freely lies anywhere in the upper-leftpart of R (divided by dotted line), as it meets both conditions.Note that the subgraph induced by { v , v , v , v } is not anMAC. This is because it is contained in the MAC induced by { v , v , v , v , v } with the same community score, thus failsto satisfy condition (2). (cid:3) For most practical applications, we typically tend to focuson the query-user-involved communities, which score higherthan (i.e., r-dominate) all other communities in the preferencedomain R . In this paper, we aim to efﬁciently discoversuch communities in road-social networks. Below, two multi-attributed community search problems are formulated. Problem 1.

Given graphs ( G r , G s ) , query vertices Q ⊆ V s ,two numbers k and t , and a region of interest R , the problemis to ﬁnd the top- j MACs with the highest community scorefor each possible weight vector in R . Although the possibleweight vectors are inﬁnite in R , a partitioning of R can formthe output, in which each partition is associated with the top- j MACs when w falls anywhere in that partition.For Problem 1, an MAC may be contained in another MACin the top- j results. Example 2:

Assume that Q = { v , v , v } , k = 3 and t = 9 inFig. 1. The top- MACs are subgraphs H and H induced by { v , v , v , v } and { v , . . . , v } for any weight vector in R ,respectively. S ( H ) = S ( v ) , and S ( H ) = S ( v ) or S ( v ) oneither side of the dotted line. Clearly, H contains H . (cid:3) To eliminate the containment relations in the top- j results,we study another problem of ﬁnding the non-contained MAC. eﬁnition 6: (non-contained MAC.) An MAC H is a non-contained MAC if there does not exist an induced subgraph H (cid:48) ( V H (cid:48) ⊂ V H ) such that H (cid:48) is a ( k, t ) -core and H (cid:48) (cid:31) H .The following example illustrates Deﬁnition 6. Example 3:

Let us reconsider Example 2. By Deﬁnition 6,subgraphs H and H (induced by { v , . . . , v } ) are the non-contained MACs for any weight vector in R and in R ∪ R ,respectively. For instance, H (resp. H ) is the top- resultwhen w = (0 . , . (resp. w = (0 . , . ). However, H isnot a non-contained MAC as it contains H and H (cid:31) H forany weight vector in R . (cid:3) Problem 2.

Given graphs ( G r , G s ) , query vertices Q ⊆ V s ,two numbers k and t , and a region of interest R , the problemis to ﬁnd the non-contained MAC for every possible weightvector in its corresponding partitioning of R . Discussions.

Another three possible operators “max”, “sum”and “avg” are not appropriate for community score. The ﬁrsttwo are monotonic w.r.t. the size of community, that is, acommunity scores higher than its sub-communities. Hence,the answer is always the maximal ( k, t ) -core, which is inde-pendent of numerical attributes X and region R . The last onemay cause outliers in the answer, e.g., only a few verticesscore very high while the rest score low, resulting in a highercommunity score. Obviously, this is not an ideal community. Challenges.

Solving the above two problems faces three majorchallenges. First, the number of k - (cid:100) core s containing Q in amulti-attributed network G s can be exponentially large (evenregardless of the query distance in G r ). Thus, enumeratingall the k - (cid:100) core s to identify the MACs is intractable. Second,unlike traditional top- j queries [23], the score of a communitymay vary greatly at different parts of R , making it nontrivialto draw conclusions about r-dominance relationships betweencommunities. Third, the MAC model enables a more ﬂexibleway to express user preferences in community search problem,which means that inherent inaccuracies in weight speciﬁcationneed to be taken into account. In consequence, without enu-merating all the ( k, t ) -cores, devising an efﬁcient algorithm todetect the MACs is challenging. To overcome these challenges,we will develop efﬁcient algorithms in the following sections.III. W ARMING U P FOR O UR M ETHODS

According to Deﬁnition 5, regardless of maximality andcommunity score, the multi-attributed communities have tosatisfy the constraints of structural cohesiveness and commu-nication cost. Thus, we give two useful lemmas as follows.

Lemma 1:

For a number t , the vertices of G s with querydistance greater than t in G r cannot exist in any MAC. Lemma 2:

For an integer k , the MACs must be containedin the maximal k - (cid:100) core containing Q .The correctness of above lemmas can be veriﬁed by Deﬁ-nition 2 and the maximality of k -core, resulting in Lemma 3. Deﬁnition 7: (Maximal ( k, t ) -core.) For graphs ( G r , G s ) and query vertices Q , the maximal ( k, t ) -core is a ( k, t ) -coresuch that no super ( k, t ) -core contains it, denoted by H tk . R w w R w w (a) v r-dominates R w w R w w (b) r-incomparable R w w R w w (c) v r-dominatedFig. 3. Cases of r-dominance for vertices v and v (cid:48) Lemma 3:

For two numbers k and t , the MACs must becontained in the maximal ( k, t ) -core.Referring to Lemma 3, for a given k and t , we ﬁrst ﬁlterout the vertices of G s that do not satisfy the query distancethreshold t by range query in G r , which can be accelerated byG-tree [24] or G*-tree [25], to obtain the induced connectedsubgraph G (cid:48) s by the remaining vertices. Next, we do k -core decomposition [14] on the ﬁltered social subgraph G (cid:48) s to compute the maximal k - (cid:100) core containing Q . It is notingthat we employ the upper bound of coreness [2] before coredecomposition. If k is larger than (cid:98) √ | E (cid:48) s |−| V (cid:48) s | )2 (cid:99) , weimmediately know there is no k - (cid:100) core w.r.t Q . So far, themaximal ( k, t ) -core has been found such that the MACs canbe obtained through in-depth computation. For example, themaximal (3 , -core, i.e., H , for Q = { v , v , v } is thesubgraph induced by { v , . . . , v } , as shown in Fig. 4(a).After abandoning the scheme of enumerating all the ( k, t ) -cores whose number can be exponentially large even in H tk ,intuitively, we may think of iteratively deleting the smallest-score vertex w.r.t. d -dimensional attributes until the resultinggraph does not have a k - (cid:100) core containing Q . However, atthis time we will face another problem, that is, which vertexhas the smallest score? To address this issue, we design aneffective data structure and construction algorithm to preserver-dominance relationships between vertices.IV. R-D OMINANCE G RAPH

In this section, we exploit the r-dominance graph to preservepair-wise r-dominance relationships between vertices in H tk ,which will be used in our proposed search algorithms. A. R-Dominance Test

Consider two vertices v and v (cid:48) where none dominates theother in terms of traditional dominance, that may not draw areliable conclusion about which vertex ranks higher. Neverthe-less, given a preference domain, the inequation S ( v ) ≥ S ( v (cid:48) ) (resp. equation S ( v ) = S ( v (cid:48) ) ) corresponds to a half-space(resp. hyperplane), of which there are three different casesregarding the positioning against R [13]. Speciﬁcally, inFig. 3(a), v r-dominates v (cid:48) since half-space HS : S ( v ) ≥ S ( v (cid:48) ) completely covers R , which means v scores higher for ∀ w ∈ R ; the case in Fig. 3(c) is symmetric. In Fig. 3(b), v scores higher in one part of R but lower in another, which iscalled r-incomparable as none r-dominates the other.Clearly, the cases in Fig. 3(a) and 3(c) allow r-dominanceconclusions to be safely drawn. In this way, we can determiner-dominance by detecting whether all polygon vertices deﬁning v v v v v v v v v v v v v (a) H for Q = { v , v , v } v v v v v v v v v v v v v v (b) DAG G d Fig. 4. The maximal ( k, t ) -core and r-dominance graph R fall into the half-space HS : S ( v ) ≥ S ( v (cid:48) ) . If so (resp. not), v r-dominates (resp. is r-dominated by) v (cid:48) . Otherwise, v and v (cid:48) are r-incomparable. The inclusion detecting of each polygonvertex costs O ( d ) , so the r-dominance test requires O ( pd ) intotal, where p is the number of polygon vertices deﬁning R . B. Pair-Wise R-Dominance Relationship

The computation of r-dominance relationships is somewhatsimilar to that of the k -skyband (i.e., BBS [26]), but differsas follows. (1) Rather than traditional dominance, we employr-dominance and apply its test described in Section IV-Aboth for vertex-to-vertex and vertex-to-MBB (i.e., minimumbounding box) dominance testing. (2) Due to the fact that w is bounded in R , we adopt a unique sorting key for R-treenodes (represented by MBB’s upper-right corner) and verticesin the heap to accelerate search convergence by leading it tor-dominate as many members as possible ﬁrst. (3) We preservethe r-dominance relationships between vertices in H tk insteadof only the top- j layers. It is noting that a max-heap is utilizedin our adapted BBS and its sorting key is the score of R-tree node/vertex w.r.t. a pivot vector of R , whose value ofeach dimension is the mean of the polygon vertices of R in that dimension [13]. The correctness of our adaptation isguaranteed as follows. (1) The pivot vector must lie in R dueto R ’s convexity [27]. (2) Vertices popped after v cannot r-dominate v due to pivot-based sorting (in decreasing order).In addition, we adopt a directed acyclic graph (DAG)[22], [28] to maintain all pair-wise r-dominance relationshipsbetween vertices in H tk , named r-dominance graph (denotedby G d ). Fig. 4(b) illustrates G d of H . An arc from vertex v to v (cid:48) signiﬁes that v r-dominates v (cid:48) . It is noting that an arc from v or v to v is not needed as the transitivity of r-dominancerelationship already implies this. The number of vertices thatr-dominate v is called v ’s r-dominance count .V. G LOBAL S EARCH

In this section, we develop a global search algorithm forProblem 2 and discuss its generalization for Problem 1. Beforeproceeding further, three useful lemmas are given as follows.

Lemma 4:

The maximal ( k, t ) -core, i.e., H tk , is an MAC. Lemma 5:

For any MAC, if we delete the smallest-scorevertex w.r.t. any weight w in R and the resulting subgraphstill has a k - (cid:100) core H containing Q , H is an MAC. Lemma 6:

For any MAC H , if we delete the smallest-scorevertex w.r.t. any weight w in R but the resulting subgraph doesnot have a k - (cid:100) core containing Q , H is a non-contained MAC. Algorithm 1: DFS-based algorithm

Input : G r , G s , Q, k, t, R Output : The top- j MACs G ′ s ← ﬁlter out vertices by RangeQuery( Q, t ); H tk ← k -core decomposition on G ′ s ; G d ← build r-dominance graph of H tk ; Queue U ← ∅ ; Heap I ← ∅ ; U .push( H tk , G d , R, I ); while U = ∅ do ( H, G ′ d , ρ, I ′ ) ← U .pop(); HP ← compute/locate new hyperplanes via leaf nodes of G ′ d ; foreach hp ∈ HP (if exists) do Sub-partitions S ← Partition ( ρ, hp ); foreach ρ ′ ∈ S do u ← ﬁnd the smallest-score vertex (if / ∈ Q ); DFS( u, H, G ′ d , I ′ ); { // Consider condition (2) in Corollary 1 } if Corollary 1 holds then output top- j MACs of ρ ′ ; else U .push( H, G ′ d , ρ ′ , I ′ ); Procedure

DFS( u, H, G ′ d , I ) foreach v ∈ N ( u, H ) do Delete edge ( u, v ) from H ; if dg H ( v ) < k then DFS( v, H, G ′ d , I ); I .push( u ); delete u from H and G ′ d (with incident edges); In view of the above lemmas, we can devise an efﬁcientalgorithm based on depth-ﬁrst search (DFS) for our problems.

A. The DFS-based Algorithm

The idea of the DFS-based algorithm is described in detailin Algorithm 1. First, for given k and t , we compute themaximal ( k, t ) -core, i.e., H tk , and build the r-dominance graph G d . Then, the following procedure is iteratively invoked untilthe resulting graph in each partition of R does not have a k - (cid:100) core containing Q . The procedure consists of two steps.Let H and G (cid:48) d be the resulting subgraph and correspondingr-dominance graph in partition ρ of R (Line 6). The ﬁrst stepis to insert sub-partitions into ρ according to G (cid:48) d (Lines 7-9), then ﬁnd the smallest-score vertex in each sub-partition(Lines 10-11). In Line 9, note that ρ is the root node of abinary tree after being passed in as a parameter, and S is a setof leaf nodes of the binary tree, representing sub-partitions of ρ . This step is essential and will be elaborated in Section V-B.The second step is to delete all the vertices that are deﬁnitelyexcluded in subsequent MACs, which enables H and G (cid:48) d tobe updated accordingly (Lines 15-20).In particular, Algorithm 1 recursively deletes all the verticesviolating the structural cohesiveness constraint by a DFSprocedure (Lines 16-19) as long as the smallest-score vertex u is found in the sub-partition. Because the degrees of theadjacent vertices of u all reduce by 1 when we delete u . Thismay cause some neighbors of u to violate the structural cohe-siveness constraint and thus cannot be contained in subsequentMACs. Likewise, we also need to verify whether the neighborsof other hops (e.g., 2-hop, 3-hop, etc.) meet the structuralcohesiveness constraint. Obviously, the DFS procedure can beused to identify and delete all these vertices.According to Lemma 6, we have a corollary as follows. Corollary 1:

Given an MAC H , H is a non-contained MACif the smallest-score vertex u meets one of the followingconditions: (1) u ∈ Q ; and (2) deleting u will recursively .1 0.5 w w R HS HS HS ρ ( v ) ρ ( v ) ρ ( v ) ρ ( v ) ρ ( v ) ρ ( v ) ρ ( v ) ρ ( v ) w w R HS HS HS ρ ( v ) ρ ( v ) ρ ( v ) ρ ( v ) (a) Partitioning R w w R HS HS HS HS HS ρ ρ ρ ρ ρ ρ w w R HS HS HS HS HS ρ ρ ρ ρ ρ ρ (b) Partitioning ρ and ρ Fig. 5. Arrangement and partitions in R disconnect Q (e.g., ∃ q ∈ Q being deleted) or make the degreeof remaining vertices less than k .Note that we always consider the early termination condi-tions of Corollary 1 in conjunction with the DFS procedure.Once either is met, it means that vertex u cannot be deleted,even if u is currently the one with the smallest score but H isalready the non-contained MAC w.r.t. partition ρ (cid:48) (Line 12).As a result, if Corollary 1 (i.e., Lemma 6) holds, the top- j MACs can be obtained by the union of top vertices in heap I (cid:48) (totally backtracking j − times) and the last subgraph H (Line 13). Based on Lemma 4 and 5, we can easily verify that H for each partition ρ recursively obtained in Line 6 is anMAC. Thus, Theorem 1 shows the correctness of Algorithm 1. Theorem 1:

Algorithm 1 correctly ﬁnds the top- j MACs.

Proof Sketch:

For any partition ρ , as long as its currentsubgraph H does not hold Corollary 1, ρ will be dividedinto | S | sub-partitions by | HP | hyperplanes (Line 10 in Al-gorithm 1), e.g., ρ i for ≤ i ≤ | S | . Here, ρ is discarded whenthe recursion proceeds to the promising sub-partitions of thelocal arrangement, i.e., ρ i . Assume that u is the smallest-scorevertex in H w.r.t. ρ i , the resulting subgraph, denoted by H i ,obtained by invoking DFS procedure (Line 12 in Algorithm 1)can be claimed as the maximal k - (cid:100) core of the subgraph H \ u by contradiction, because DFS procedure recursively deletesall the vertices in H whose degrees are smaller than k .Therefore, we have V H i ⊂ V H and S ( H ) ≤ S ( H i ) w.r.t. ρ i for ≤ i ≤ | S | . In this way, the non-contained MACcorresponding to each ﬁnal sub-partition can be found, as wellas all the MACs. Thus, we conclude that Algorithm 1 correctlyﬁnds the top- j MACs. (cid:3)

We analyze the complexity of Algorithm 1 in Theorem 2.

Theorem 2:

The time complexity and space complexity ofAlgorithm 1 are bounded by O ( n (cid:48) d ) and O ( m (cid:48) + n (cid:48) + n (cid:48) · d ) respectively, where n (cid:48) and m (cid:48) denote the number of verticesand edges in H tk . Proof:

The key factor determining Algorithm 1’s timecomplexity is the construction of arrangements. In the worstcase, vertices in H tk are r-incomparable to each other, i.e.,the complete arrangement of n (cid:48) ( n (cid:48) − half-spaces needs to beconstructed, in O ( n (cid:48) d ) time [29]. The algorithm only needs tostore H tk , and maintains the heap I (cid:48) and half-space informationrelated to d , which uses less than O ( m (cid:48) + n (cid:48) + n (cid:48) · d ) spacecomplexity even in the worst case. (cid:3) Algorithm 2: Partition

Input : Node ρ , hyperplane hp Output : Leaf nodes of the binary tree of half-space arrangements if ρ ∩ hp − = ∅ then ρ is covered by hp + ; else if ρ ∩ hp + = ∅ then ρ is covered by hp − ; else if ρ ’ s child is NULL then Insert ρ.left ← ρ ∩ hp − and ρ.right ← ρ ∩ hp + ; else Partition ( ρ.left, hp ); Partition ( ρ.right, hp ); B. Arrangement Jointing and Indexing

In Algorithm 1, to ﬁnd the smallest-score vertex for anyweight vector in partition ρ , we consider the vertices of G (cid:48) d in a bottom-up manner. In other words, leaf vertices of ther-dominance graph will be preferred (Line 7). The reason isthat if a vertex is deleted either because it does not satisfy thestructural cohesiveness constraint (already considered in theDFS procedure), or because it is the smallest-score vertex, butbefore this, all vertices it r-dominates should be deleted ﬁrst.The veriﬁcation of a leaf vertex u in G (cid:48) d entails partitioning ρ by half-spaces HS i : S ( u (cid:48) ) ≥ S ( u ) , each corresponding toone of the remaining leaf vertex u (cid:48) . Formally, an arrangementbounded by R is deﬁned by the supporting hyperplanes ofthese half-spaces, where each cell (i.e., sub-partition) is locatedin a set of half-spaces. The leaf vertices corresponding to thesehalf-spaces are precisely those with scores higher than u if w falls in that cell, which means u is the smallest-score vertex.Consider G d in Fig. 4(b). Initially, the leaf vertices are v , v and v ( G (cid:48) d = G d , I (cid:48) = ∅ , Line 6 in Algorithm 1). Their respec-tive half-spaces HS : S ( v ) ≥ S ( v ) , HS : S ( v ) ≥ S ( v ) and HS : S ( v ) ≥ S ( v ) are inserted into the arrangement, asshown in Fig. 5(a). The vertex in brackets for each partitionindicates the smallest-score vertex for any weight vector w inthat partition. For partitions ρ and ρ on the right, vertex v will also be deleted by the DFS procedure due to thedeletion of v , after which v and v are both pushed into theheap I (cid:48) w.r.t. the partitions (representing the vertices ignored).Thus, the resulting subgraph H induced by { v , . . . , v } is thecorresponding non-contained MAC since discarding any vertexwill no longer satisfy the structural cohesiveness constraint.By backtracking the top vertices in I (cid:48) once (i.e., v and v ),we can easily obtain the second-ranked MAC induced by { v , . . . , v } in ρ and ρ (refer to R in Fig. 2(b)).When Corollary 1 does not hold, sub-partition ρ (cid:48) will bepushed into queue U to compute the non-contained MAC indepth. At this time, vertices ignored will be discarded to updateresulting subgraph H and corresponding G (cid:48) d , so that new leafvertex can be designated in the next round of veriﬁcation andthe half-spaces against other leaf vertices are inserted into thelocal arrangement. Consider ρ in Fig. 5(a), we refer to v as the new leaf vertex after v is deleted, i.e., H and G (cid:48) d are induced by { v , . . . , v } and I (cid:48) = { v } . Then, new half-spaces HS : v ≥ v and HS : v ≥ v are inserted into anewly initialized local arrangement against partition ρ , wherethree sub-partitions ρ , ρ and ρ are produced, as shown in lgorithm 3: Local Search Framework Input : G r , G s , Q, k, t, R Output : The non-contained MACs G ′ s ← ﬁlter out vertices by RangeQuery( Q, t ); H tk ← k -core decomposition on G ′ s ; G d ← build r-dominance graph of H tk ; C ← Expand ( H tk , G d , Q, k ); Verify ( C, G d , Q, R ) ; Fig. 5(b). As v and v are pushed into I (cid:48) together, the non-contained MAC induced by { v , v , v , v } is returned for eachsub-partition. Likewise, same operation applies to partition ρ .Eventually, the solution in Example 3 is obtained. Note thatwe can directly locate HS and HS for ρ since no new half-space needs to be computed due to the same leaf vertices (in G (cid:48) d ) as ρ . This drastically reduces repetitive computation inhalf-spaces, each of which is computed only once if necessary.Speciﬁcally, for each local arrangement considered, anindex is built by a recursive process Partition (Algorithm 2).Then, the index is discarded when all relevant half-spaces areinserted, leaving only the hopeful sub-partitions of the localarrangement (if any). Note that, for any index of ρ (Line 9 inAlgorithm 1), the total cost of inserting the i -th hyperplane hp is O ( i d − ) [20]. In addition, optimization of arrangementindexing and maintenance is the same as described in [13].VI. L OCAL S EARCH

Although the efﬁciency of the global search algorithm isconsiderable for each query, it may need to explore the entiremaximal ( k, t ) -core, especially when query vertices Q arelocated at the upper layer of the r-dominance graph G d . In thissection, we devise the local search algorithms for Problem 2and investigate the generalization for Problem 1.The intuition is that the non-contained MACs for Q are inthe vicinity of Q . Thus, the entire H tk should be unnecessaryto involve during the search. Nonetheless, it is intractable toenumerate all the k - (cid:100) core s containing Q , whose number canbe exponentially large w.r.t. H tk size. Accordingly, we onlyimmerse in ﬁnding the communities that are most likely to becandidates for non-contained MACs. Their validity and corre-sponding partitions in R can be quickly veriﬁed by G d alone.This inspires us to develop a framework of more efﬁcient localsearch (Algorithm 3). Speciﬁcally, Expand procedure ﬁndscandidates (i.e., C ) by selecting the most promising vertexas we explore in the neighborhood of Q , and stops when eachtarget community forms a k - (cid:100) core . Verify procedure providesguarantee of identifying all valid non-contained MACs w.r.t. R from C . Note that the time complexity of Algorithm 3 isbounded by O ( | C |· s d )) (see Theorem 3 and 4), which is muchlower than that of Algorithm 1 ( | C | and s are typically verysmall in practice). As veriﬁed in our experiments, local searchis at least one order of magnitude faster than global search, andall non-contained MACs will be expanded by our candidategeneration strategies in most cases.In this way, two problems arise: (1) how do we guaranteethat the target community can be a candidate for non-containedMACs (at least form a k - (cid:100) core containing Q ); and (2) how dowe know whether the k - (cid:100) core is a valid non-contained MAC Algorithm 4: Expand

Input : H tk , G d , Q, k Output : The candidate set C Queue U ← ∅ ; V H ← Q ; { // or ∀ v ∈ N ( Q, H tk ) , V H ← Q ∪ { v }} foreach unvisited v ∈ N ( V H , H tk ) do U .push( v ); while U = ∅ do u ← U .pop(); V H ← V H ∪ { u } ; if δ ( H ) ≥ k then output C ← C ∪ H ; foreach unvisited v ′ ∈ N ( u, H tk ) do U .push( v ′ ); w.r.t. R ? The former poses a great challenge of determiningwhich vertex to choose and when to terminate expansion, andthe latter requires an in-depth study of the characteristics ofnon-contained MACs. In the following, we present lemmasand algorithms for the local search strategy. A. Candidate Generation

To be a candidate for non-contained MACs, the currentcommunity must be at least a k - (cid:100) core containing Q . It wouldbe nice if the structural cohesiveness metric (i.e., k -core) is“monotonic”, which means that the larger the community, thesmaller its minimum degree. So once the minimum degreedrops below the given coreness threshold k , we can stop thesearch. Unfortunately, such a metric is not monotonic. Thus,the greatest challenge is to overcome non-monotonicity ﬁrst,which motivates us to conduct community search only byexploring local neighborhood of Q .Now for the minimum degree of a subgraph H , denotedby δ ( H ) , we make an in-depth analysis of its monotonicity.Consider the exploration starting from the query vertices, i.e., H induced by Q . We add a vertex from H tk at each step untila k - (cid:100) core H is obtained, assuming that v , v , . . . , v e is a vertexsequence it adds. So let H i be induced by Q ∪ { v , . . . , v i } . Ingeneral, δ ( H ) is a non-monotonic function of H . More for-mally, δ ( H i +1 ) is unnecessarily greater than δ ( H i ) . However,the order of vertices added to V H determines the monotonicityof δ ( H ) . Interestingly, for any Q with δ ( H ) = 0 , we canalways ﬁnd a sequence of added vertices such that δ ( H i ) is anon-decreasing function of i . Lemma 7:

For any query vertices Q ⊆ V H with δ ( H ) = 0 in graph H tk , there always exists an added vertex sequence v , v , . . . , v e of H starting with Q such that ∀ ≤ i ≤ e, δ ( H i ) ≤ δ ( H i +1 ) . Proof Sketch:

It is consistent with proving that vertices canbe removed one by one from V H until Q , just ensuring thateach removal of v does not increase the minimal degree ofthe remaining vertices. Otherwise, it occurs only when v iscurrently one of the vertices with the minimal degree. Thereason is that removing a vertex with non-minimal degree willonly preserve or decrease the minimal degree. (cid:3) Lemma 7 implies that there always exists an explorationorder that leads monotonically to a community of k - (cid:100) core containing Q . This can be generated by a sequence of verticesadded to V H starting from Q so that δ ( H i ) ≤ δ ( H i +1 ) foreach i . Note that the existence of such an order is a necessaryut insufﬁcient condition for ﬁnding a valid community. Toillustrate the insufﬁciency, consider Q = { v } and k = 2 inFig. 1(a). Any vertex sequence starting with v , v cannotyield a valid solution, yet δ ( H ) is greater than δ ( H ) .In Expand procedure (Algorithm 4), we explore from thevicinity of Q by BFS and generate candidate set C . Toconverge the current community towards a candidate for non-contained MAC, we develop two intelligent candidate selec-tion strategies according to Lemma 3, 6 and 7. The idea of im-proving candidate generation is to use priority queues such thatthe most promising vertex can be selected to rapidly generate acandidate. From the perspective of structural cohesiveness, thepriority of a vertex v can be deﬁned as f ( v ) = δ ( H (cid:48) ) − δ ( H ) or f ( v ) = dg H (cid:48) ( v ) , where V (cid:48) H = V H ∪{ v } . f ( v ) emphasizesthe improvement of minimum degree for the next step, with f ( v ) = 1 or for any v ; f ( v ) produces the fastest increase inaverage degree of H so that the minimum degree will increasewith H ’s density growth. From the perspective of communityscore, the priority of v can be deﬁned as f ( v ) = ζ − l ( v ) ,where ζ is a constant (maximum priority in G d ) and l ( v ) denotes the layer of v in G d . f ( v ) drives community scorehigher by adding a vertex that r-dominates as many verticesas possible. To sum up, the priority f ( v ) is deﬁned as f ( v ) = λ · f ( v ) + f ( v ) , (3)where λ is a trade-off against ζ , or f ( v ) = ζ · f ( v ) + f ( v ) . (4) Theorem 3:

The time complexity of Algorithm 4 by Eq. 3and Eq. 4 is O ( n + m ) and O ( n + mlogn ) respectively, where n and m denote the number of vertices and edges in C . Thespace complexity is O ( n + m ) . B. Veriﬁcation

The determinant of global search is that computing the localarrangement of all half-spaces HS i among leaf vertices is arelatively expensive process (in O ( i ∗ d ) time [29], where i ∗ is the number of half-spaces). Instead, in Verify procedure(Algorithm 5), an empty arrangement in R is initialized, intowhich half-spaces w.r.t. a carefully selected and therefore verysmall subset of vertices (i.e., competitors below) are inserted,expecting to securely conﬁrm or disqualify candidate H with-out considering all other vertices. But before this, we ﬁrst givea corollary to ﬁlter out unpromising candidates from C . Notethat G e represents the r-dominance graph corresponding to H ,which is a subgraph of G d induced by V H , denoted by G d [ V H ] ;and G c represents the rest of G d , denoted by G d [ V G d \ V H ] . Corollary 2:

A community H can be discarded if one of thefollowing conditions is met: (1) ∀ v ∈ V G d \ V H , v is a non-leafvertex in G d ; and (2) ∃ v ∈ V G d \ V H , v (cid:48) ∈ V H and v (cid:31) v (cid:48) , v cannot be recursively deleted by deleting vertices in V G d \ V H . Proof Sketch:

Vertices in V G d \ V H must be deleted if H holds. (cid:3) In other words, either there exists a non-cross-layer arc in G c between a non-leaf vertex v and a leaf vertex, or v can be Algorithm 5: Verify

Input : C, G d , Q, R Output : The non-contained MACs and corresponding partitions foreach H ∈ C do G e ← G d [ V H ] ; G c ← G d [ V Gd \ V H ] ; if Corollary 2 holds then continue; HP ← compute hyperplanes via G e and G c (Corollary 3); foreach hp ∈ HP do Sub-partitions S ← Partition ( R, hp ); foreach ρ ∈ S do if Corollary 3 holds then output H and ρ ; recursively deleted by the DFS procedure. We refer to the H that holds Corollary 2 as a promising community . Lemma 8:

For any promising H , v is regarded as an anchorif H still forms a k - (cid:100) core after a (non- Q ) leaf vertex v in G e is deleted.Now we discuss the veriﬁcation process by half-spaceinsertion, in which it may further beneﬁt from the r-dominancerelationships stored in G d as follows. Lemma 9:

Consider u , u (cid:48) , and their half-space HS i insertedinto the arrangement. Assume that u (cid:48)(cid:48) is a vertex r-dominatedby u (cid:48) , and ρ is a partition in the arrangement not covered by HS i . Thus u is guaranteed to r-dominate u (cid:48)(cid:48) in partition ρ . Proof:

First, from the deﬁnition of half-space HS i , S ( u (cid:48) )

For any promising H , it is a valid non-contained MAC if a partition exists in R such that all verticesin l b ( G e ) score higher than those in l t ( G c ) , and the corre-sponding conditions are met: If Lemma 8 holds, additionally, all anchors need to scorehigher than other leaf vertices in G e . ∃ v ∈ l t ( G c ) , if v can be recursively deleted by the DFSprocedure starting from l b ( G c ) (namely, v is bound), then l t ( G c ) is updated where v is ignored and replaced by thevertices of its next layer in G c . ∃ v, v (cid:48) ∈ l t ( G c ) , if v and v (cid:48) are bound to each other, thenvertices in l b ( G e ) only need to score higher than v or v (cid:48) . Proof:

According to Corollary 2, all vertices in V G d \ V H haveto be deleted when H is a promising community. In otherwords, vertices in V G d \ V H are either recursively deleted bydeleting those in the lower layer of G c , or deleted individuallydue to their lower scores. First, we consider the latter case. Asa sufﬁcient and necessary condition, these vertices just needto score lower than those in l b ( G e ) ; that is, only vertices in l b ( G e ) score higher than those in l t ( G c ) . Then, we considerthe former case. That is, if condition (2) holds, it means that ABLE IID

ATASETS (K= AND M= ) Dataset Vertices Edges dg avg dg max k max San Francisco (SF) 175K 223K 2.55 8 -Florida (FL) 1.1M 1.4M 2.53 12 -Slashdot 79K 0.5M 13 2,507 85Delicious 536K 1.4M 5 3,216 34Lastfm 1.2M 4.5M 7 5,150 71Flixster 2.5M 7.9M 6 1,474 69Yelp 3.6M 9.0M 5 10,433 129 the restriction on half-spaces between the competitors can berelaxed by the newly updated vertices in l t ( G c ) , since sucha vertex v will be deleted anyway. Similarly, if condition (3)holds, the restriction on half-spaces between the competitorscan also be relaxed through such vertices, because deletion ofone will also lead to deletion of the other. On this basis, H becomes a non-contained MAC when Lemma 8 does not hold;otherwise, condition (1) has to be satisﬁed. This is becausesuch an anchor can still be deleted as it is a non- Q leaf vertexin G e , i.e., possibly the current smallest-score vertex in H .Putting it all together, we ensure the correctness of the partitioncorresponding to the non-contained MAC H (if any in R ). (cid:3) To illustrate Algorithm 5, let us reconsider Example 2.Assume that by Algorithm 4 we have three promising com-munities H , H and H , where V H = { v , v , v , v } , V H = { v , . . . , v } and V H = { v , v , v , v , v } . For H , l b ( G e ) = { v } and l t ( G c ) = { v , v } . As v and v are boundto each other in H (condition (3) met), we only insert HS and HS into R in Fig. 5(b), and choose the partitions coveredby either of them. As a result, H is a valid non-containedMAC for any weight vector of R in Fig. 2(b). Similarly, H is also valid w.r.t. R ∪ R (condition (2) met) but H is invalid(condition (1) met) as its partition is outside R . Theorem 4:

The time complexity of Algorithm 5 is O ( | C |· ( n (cid:48) + m (cid:48) ) + c · s d ) , where | C | and c denote the number ofcandidates and the number of promising communities in C respectively, and s is the product of | l b ( G e ) | and | l t ( G c ) | . Thespace complexity is bounded by O ( c · s · d + n + m + n (cid:48) + m (cid:48) ) .Finally, we can simply generalize local search for Prob-lem 1. As the non-contained MAC H with correspondingpartition ρ is known, we insert sub-partitions into ρ accordingto G c (in a up-bottom manner) and add the highest-scorevertex to H . As long as the current H contains an MAC,it will be output. The process terminates until all top- j MACsin ρ are acquired. Consider H and its G c , half-spaces amongvertices in l t ( G c ) (i.e., HS ) are inserted ﬁrst into R . Then v is added to V H for ρ and ρ shown in Fig. 5(b). As v r-dominates v , we have the second MAC induced by { v , . . . , v } in ρ and ρ . The same applies to ρ and ρ .VII. E XPERIMENTS

Comprehensive experiments are conducted to evaluate theproposed model and four algorithms, named

GS-T , GS-NC , LS-T and

LS-NC respectively.

GS-T and

GS-NC (resp.

LS-T and

LS-NC ) are global search algorithms (resp. local searchalgorithms) used to compute the top- j MACs and the non-

TABLE IIIP

ARAMETERS

Parameter Tested values k

4, 8, , 32, 64 t (SF/FL) 600/800, , 1000/1200, 1200/1400, 1400/1600 d , 4, 5, 6 | Q | , 8, 16, 32 j

5, 10, , 40, 60 σ , 5%, 10% contained MACs, respectively. Note that either of the twocandidate selection strategies in Section VI-A can be adoptedin LS-T or LS-NC , and we just give the results by using Eq. 3with ζ = 100 and λ = 10 (results by using Eq. 4 are similar andomitted to save space). All algorithms were implemented inC++, and all experiments were conducted on an Ubuntu serverwith 2GHz Intel Xeon E7-4820 CPU and 1TB memory. Datasets.

We use ﬁve real-world social networks and tworoad networks (SF /FL ) in our experiments. Table II summa-rizes the statistics of datasets, of which dg avg , dg max and k max denote the average degree, the maximal degree andthe maximal core number, respectively. Note that numericalattributes are not contained in the ﬁrst four original socialnetworks , for which we employ a widely used method in[21] to generate three different types of numerical attributes,i.e., independence , correlation and anti-correlation . Due tospace limit, we report the results obtained from datasets withindependent and real attributes. In addition, we map each user v to a spatial point p in the road network that matches thescale of his/her social network as follows: we project SF/FLinto range [0 , in each dimension and generate normalized L ( v ) by drawing from a list of recent check-ins. Assume that p is the current location of v if it has the smallest Euclideandistance to L ( v ) in the projection space. Parameters.

We vary parameters: structural cohesiveness k ,query distance t , dimensionality d , number of query users | Q | ,number j of top MACs, and percentage σ of axis length (i.e.,side-length of R ). Table III shows the range of parametersand their default values (in bold). For each value of | Q | , werandomly select sets of query vertices, that satisfy t andcan ensure the existence of the maximal ( k, t ) -core, from the k -core of each social network. In each experiment, only oneparameter varies and the rest remains at the default. Everyreported measurement is the average of , MAC searchesfor ten randomly generated axis-parallel hypercubes R in thepreference domain. A. Performance Evaluation

Exp-1: Varying k . We evaluate the query processing time ofall algorithms and the number of vertices of H tk by varying k . In each road-social network, we can see that local searchperforms better than global search, but the advantage becomesless obvious when k increases. In the best case, e.g., k = 4 , LS-T and

LS-NC are more than one order of magnitude faster http://networkrepository.com .11101001000 T i m e ( s ) GS-NC GS-TLS-NC LS-T1010 -1 (a) Varying k T i m e ( s ) GS-NC GS-TLS-NC LS-T10 (b) Varying t T i m e ( s ) GS-NC GS-T

LS-NC LS-T (c) Varying d T i m e ( s ) GS-NC GS-TLS-NC LS-T (d) Varying | Q | T i m e ( s ) GS-T LS-T (e) Varying j T i m e ( s ) GS-NC GS-T

LS-NC LS-T (f) Varying σ Fig. 6. Efﬁciency and scalability of proposed algorithms in SF+Slashdot with independent attributes. T i m e ( s ) GS-NCGS-TLS-NC

LS-T -1 (a) Varying k T i m e ( s ) GS-NC GS-T

LS-NC LS-T10 (b) Varying t T i m e ( s ) GS-NC GS-TLS-NC LS-T (c) Varying d T i m e ( s ) GS-NC GS-TLS-NC LS-T (d) Varying | Q | T i m e ( s ) GS-T LS-T (e) Varying j T i m e ( s ) GS-NC GS-T

LS-NC LS-T (f) Varying σ Fig. 7. Efﬁciency and scalability of proposed algorithms in SF+Delicious with independent attributes. T i m e ( s ) GS-NC GS-TLS-NC LS-T10 (a) Varying k T i m e ( s ) GS-NC GS-TLS-NC LS-T10 (b) Varying t T i m e ( s ) GS-NC GS-TLS-NC LS-T (c) Varying d T i m e ( s ) GS-NC GS-TLS-NC LS-T (d) Varying | Q | T i m e ( s ) GS-T LS-T (e) Varying j T i m e ( s ) GS-NC GS-TLS-NC LS-T (f) Varying σ Fig. 8. Efﬁciency and scalability of proposed algorithms in FL+Lastfm with independent attributes. T i m e ( s ) GS-NC GS-T

LS-NC LS-T (a) Varying k T i m e ( s ) GS-NC GS-T

LS-NC LS-T10 (b) Varying t T i m e ( s ) GS-NC GS-T

LS-NC LS-T (c) Varying d T i m e ( s ) GS-NC GS-TLS-NC LS-T (d) Varying | Q | T i m e ( s ) GS-T LS-T (e) Varying j T i m e ( s ) GS-NC GS-T

LS-NC LS-T (f) Varying σ Fig. 9. Efﬁciency and scalability of proposed algorithms in FL+Flixster with independent attributes. than

GS-T and

GS-NC in Fig. 7(a) and 9(a). Only when k = 64 , global search is comparable to local search since H tk size shrinks when k is large (see Fig. 11(c)), resultingin a reduction in the time complexity of global search andin the number of promising vertices involved in local search.Although Delicious is larger than Slashdot, but algorithms runfaster at k = 16 and k = 32 since H tk contains fewer vertices (as k max = 34 in Table II). Note that when k increases from to , local search is merely more than twice as fast, e.g., LS-NC takes s and s respectively in Fig. 7(a). This is becausewhen k is relatively small, candidate selection strategies tendto ﬁnd more smaller candidates. However, algorithms runfaster in Yelp than in Flixster, yet its H tk size is larger. Thiswill be explained in Exp-6. In short, across a wide range of k local search is consistently better than global search. Exp-2: Varying t . We evaluate the query processing timeof all algorithms by varying t . For each road-social network,local search outperforms global search signiﬁcantly, with ad-vantage becoming more obvious as t increases. For example,in Fig. 9(b), GS-NC takes , s while LS-NC takes sfor t = 1600 . Note that the results are obtained in the caseof k = 16 , thus generally LS-T and

LS-NC are almost oneorder of magnitude faster than

GS-T and

GS-NC in terms of t . Because when t is large, more users are retained via rangequery accelerated by G-tree [24] or G*-tree [25]. This favorslocal search radiating outward from Q , equivalent to increasingthe expansion radius. Exp-3: Varying d . By varying d we study the query processingtime and the memory overhead of all algorithms, and compar-ison of different methods. The toughness of MAC search rises with d due to its computational geometric nature. Nonetheless,all four algorithms offer practical processing time, takingrespectively s and s for d = 6 in Fig. 9(c). Furthermore,Fig. 11(d) shows the memory overhead of GS-NC / LS-NC and

BBS process ( X indexed and G d built). When d increases,dimension of R-tree increases but memory overhead changesnot much due to unchanged G d size; local search is verylightweight against global search. For example, GS-NC takes

MB and , MB while

LS-NC takes only MB and

MB for d = 3 and d = 6 , respectively. The results conﬁrmboth the theoretical analysis and the claims on arrangementindexing in Section V-B. In addition, Fig. 13 and 14 show thecomparison with methods in [4] and [8], where Influ (resp.

Influ+ ) is the DFS-based (resp. ICP-index based) algorithm,and

Sky (resp.

Sky+ ) is the basic (resp. space-partition)algorithm. We implement

Influ and

Influ+ by varying k instead of d since they can only capture -dimensionalattribute. For a fair comparison, weight vectors that fallanywhere in R are randomly selected to respectively calculatethe weighted sum of d (at the default) numerical attributes asvertex inﬂuence (i.e., score), and the average processing time isreported in Fig. 13(b) and 14(b). Since no r-dominance graphneeds to be maintained and no half-space has to be computednor inserted, Influ and

Influ+ are superior to

GS-NC and

LS-NC in terms of processing time, while

Sky and

Sky+ are generally the most expensive due to their time complexity.On the other hand, in terms of d , Sky and

Sky+ are muchcostlier than ours and intractable when d is relatively large,e.g., d ≥ and d ≥ respectively in Fig. 14(c). Here, “Inf”means processing time exceeds , s. Therefore, our model T i m e ( s ) GS-NC GS-TLS-NC LS-T10 (a) Varying k T i m e ( s ) GS-NC GS-T

LS-NC LS-T10 (b) Varying t T i m e ( s ) GS-NC GS-TLS-NC LS-T (c) Varying d T i m e ( s ) GS-NC GS-TLS-NC LS-T (d) Varying | Q | T i m e ( s ) GS-T LS-T (e) Varying j T i m e ( s ) GS-NC GS-TLS-NC LS-T (f) Varying σ Fig. 10. Efﬁciency and scalability of proposed algorithms in FL+Yelp with independent attributes. N o . o f p a r titi on s σ SF+Slashdot SF+DeliciousFL+Lastfm FL+FlixsterFL+Yelp10 (a) No. of partitions (during search) N o . o f N C - M A C s σ SF+Slashdot SF+DeliciousFL+Lastfm FL+FlixsterFL+Yelp (b) No. of non-contained MACs V e r ti ce s o f k SF+Slashdot SF+DeliciousFL+Lastfm FL+FlixsterFL+Yelp ! " (c) H tk S p ace C o s t ( M B ) d BBS GS-NC LS-NC1010 (d) Memory overhead (FL+Lastfm)Fig. 11. Scalability of proposed algorithms. R a ti o (a) Varying k R a ti o (b) Varying | Q | Fig. 12. Ratio of NC-MACs found by LS-NC to GS-NC in FL+Lastfm. and algorithms are tractable and scalable to handle real-worldapplications comprehensively and ﬂexibly.

Exp-4: Varying | Q | . We evaluate the query processing timeof all algorithms and the ratio of non-contained MACs (NC-MACs) found by

LS-NC to GS-NC by varying | Q | . All pro-cessing time almost monotonically decreases with the growthof | Q | since it accelerates the convergence of both globaland local search. The reason is that increasing | Q | meansmore vertices cannot be deleted in global search and selectingfewer vertices may ﬁnd a candidate in local search. Notethat in Fig. 7(d), as k = 16 , algorithms are not much fasterat Q = 32 than Q = 16 . In fact, we also ﬁnd that globalsearch terminates early if the generated query vertices Q arelocated in the lower layer of G d because it is more likely toencounter Q when deleting vertices, while local search is justthe opposite. Fig. 12(a) and 12(b) show the ratio against k and | Q | , respectively. We can see that both the ratios decreasewith the growth of k and Q , but it is still satisfactory. Thereason is that the number of community candidates expandedby the Expand procedure tends to decrease with the increaseof k or | Q | , but the Verify procedure can always ensurethe correctness of the corresponding partitions in R of any(promising) community. On the other hand, this proves thatlocal search is very useful for applications expecting only partof MACs. Note that, in practice, | Q | is usually not very largeand the ratio reaches at default | Q | , which conﬁrms thatlocal search can ﬁnd all non-contained MACs in most cases. Exp-5: Varying j . We examine the query processing time of

Influ Influ+ SkyInflu Influ+GS-NC LS-NC Sky Sky+Sky+ GS-NC T i m e ( s ) -1 Inf 4 8 16 32 T i m e ( s ) (b) Varying k T i m e ( s ) Sky -1 Inf T i m e ( s ) (c) Varying d Fig. 13. Comparison of different methods in SF+Delicious (independence).

Influ Influ+ SkyInflu Influ+GS-NC LS-NC Sky Sky+Sky+ GS-NC T i m e ( s ) -1 Inf T i m e ( s ) (b) Varying k T i m e ( s ) T i m e ( s ) (c) Varying d Fig. 14. Comparison of different methods in FL+Flixster (independence).

GS-T and

LS-T by varying j . The curve of GS-T is risingslowly with increasing j since the top- j MACs can be directlyobtained after executing global search. But for

LS-T , afterobtaining the non-contained MAC, its corresponding cell hasto be divided again to ﬁnd the top- j MACs, resulting in anincrease in processing time.

Exp-6: Varying σ . We evaluate the query processing time ofall algorithms, and the number of partitions and non-containedMACs by varying σ (determining the size of region R ). Asanticipated, a larger R means a larger output, thereby morecomputations needed. In Fig. 11(a), we can see that the growthof σ will lead to a signiﬁcant increase in the number ofpartitions in R , which also explains the increase in processingtime of all algorithms. Fig. 11(b) records the relationship be-tween the number of non-contained MACs obtained by GS-NC and σ . Similarly, as the number of partitions increases, thediversity of non-contained MACs w.r.t. R will also increase.Note that both bars of Yelp in Fig. 11(a) and 11(b) are muchshorter than those of Flixster while its H tk size is larger inFig. 11(c). Since attributes not only in Yelp but in real worldare usually correlated or more, fewer (even unique) branchesin DAG result in less processing time, that is, less half-spacecomputation and insertion. B. Case Study

NA+Aminer.

We apply the road network of North America (NA) and the Aminer (aminer.org) for the ﬁrst case study. TheAminer is a scientiﬁc collaboration network that incorporatesauthors in DB, DM, IR, and ML ﬁelds, comprised of , vertices and , edges. For each author we crawl fournumerical attributes: h-index , , activeness , and diverseness . To evaluate the effectiveness of MAC model inreal world, we map each author to the location in NA accord-ing to its afﬁliation and use Q = { “Jiawei Han”, “Jian Pei”, iawei HanJian PeiXifeng YanPhilip S. Yu Ke WangCharu AggarwalHaixun Wang (a) top- NC-MAC

Jiawei HanJian PeiXifeng YanPhilip S. Yu Ke WangYizhou SunCharu AggarwalHaixun Wang Chi Wang (b) top- MAC

Jiawei Han Jian PeiXifeng YanPhilip S. YuKe Wang Yizhou SunCharu AggarwalHaixun WangChi WangXiang RenYintao Yu (c) top- NC-MAC

Jiawei HanJian PeiXifeng YanPhilip S. Yu Ke WangYizhou Sun Charu AggarwalHaixun Wang Chi WangXiang RenJing GaoYintao Yu (d) top- MAC

Jiawei HanJian PeiXifeng YanKe WangYizhou SunCharu AggarwalHaixun Wang Chi Wang (e) SkyC

Jiawei HanJian PeiPhilip S. Yu Ke WangCharu AggarwalHaixun Wang Chi WangXiang Ren (f) InfC ( -D) Jiawei HanJian PeiPhilip S. YuKe WangCharu AggarwalHaixun WangXiang Ren (g) InfC ( w ∈ R ) Jiawei Han Jian PeiXifeng YanPhilip S. YuKe WangYizhou SunCharu AggarwalHaixun Wang Chi WangXiang RenJing GaoYintao YuXiaohui GuYu XiaoXin JinChen ChenWei FanMarina Danilevsky (h) ATC (“DM”)Fig. 15. Case study of Aminer+NA: results for k = 5 . “Philip S. Yu”, “Xifeng Yan” } , who are renowned scientistsin DM (i.e., relatively high scores), as query vertices. Aftersetting k = 5 , j = 2 and R = [0 . , . × [0 . , . × [0 . , . (with t large enough), Fig. 15(a-d) show the top- MACsanywhere in R . Furthermore, we compare MAC with differentmodels. For InfC [4], Fig. 15(f, g) report the results involving Q , respectively taking only and weighted sum(by w = (0 . , . , . ) as inﬂuence. In fact, InfC eithercannot capture the characteristics of all attributes, or mustbe covered by a non-contained MAC (NC-MAC) if w ∈ R (e.g., Fig. 15(c)). For SkyC [8], there are two results, one isthe same as NC-MAC in Fig. 15(a), while the other shownin Fig. 15(e) only contains partial Q and is covered by NC-MAC in Fig. 15(c). In effect, we ﬁnd that SkyC is alwayscontained in NC-MACs due to no query vertices and its skylineproperty. For

ATC [7], Fig. 15(h) reports the (6 , -truss w.r.t. Q and keyword “DM” as a ( k +1) -truss is a k -core. Althoughcommunication cost is low (i.e., ), its size is still too largesince it only considers maximum inclusion of keywords butignores attributes. Therefore, the MAC model is very effective,comprehensive and ﬂexible for applications. SF+Yelp.

We apply the SF and the Yelp for the second casestudy. In addition to proﬁle information (e.g., ID, ﬁrst name,etc.), users in Yelp also have real attribute data, such as averagerating of all reviews, . Thus, thenumber of branches in DAG will be extremely small or evenunique, resulting in less half-space computation/insertion andfewer partitions in R . We map each user to the location in SF MichelleDaniEmi PhilMarthieJaneGabiKat (a) top- NC-MAC

Michelle DaniEmi PhilMarthie JaneGabiKat Tom (b) top- MAC

Michelle DaniEmi PhilMarthie JaneGabi KatTomLinda (c) top- MACFig. 16. Case study of Yelp+SF: results for k = 6 . according to check-ins and use Q = { “Emi”, “Phil”, “Dani”,“Michelle” } , who are relatively active, as query vertices. Todiscover a group of people who are more concerned andpopular, k = 6 , t = 300 , j = 3 and R = [0 . , . × [0 . , . ,Fig. 16 shows the top- MACs in R . This fully illustrates thatin the real world, the diversity of (non-contained) MACs andthe number of corresponding partitions w.r.t. R are very smalland user-friendly. VIII. R ELATED W ORK

Community model and search.

A large number of commu-nity models have been proposed such as k -core [2], [12], k -truss [3], maximal clique [30], quasi-clique [15], maximal k -edge connected subgraph [31]–[33], locally densest subgraph[34], query-biased density [35], etc. All these models consideronly graph structural information but ignore numerical/textualattributes associated with vertices. To discover cohesive sub-graphs containing the query vertices, a community searchproblem (CSP) was studied to ﬁnd the maximal connected k -core in social networks [1], for which [2] proposed a moreefﬁcient local search algorithm. Recently, [7] introduced theCSP of small-diameter k -truss with similar query attributes.[5] and [6] developed the CSP of k -core with textual at-tributes and smallest minimum covering circle, respectively.[4] proposed an inﬂuential community with vertex’s inﬂuenceas one numerical attribute, based on which [8] studied askyline community for d -dimensional numerical attributes.More recently, [36] studied the CSP of k -truss with distanceof at most d for any two vertices. [37] and [38] investigated acohesive version of CSP that brings all community membersclosest to the point-of-interest in road networks. [39] and [17]studied two different CSPs in terms of context with only querykeywords but no query vertices. In addition, [40] and [41]made variations on the CSPs in [7] and [36], respectively.This work differs from all the prior work in the following.(1) Our multi-attributed community (MAC) model is the ﬁrstone that can incorporate uncertainty of user preferences inthe weight vector into d -dimensional numerical attributes andcapture spatial cohesiveness between users in road networks.(2) The preference domain is introduced into communitymodeling for the ﬁrst time. (3) We study the novel MAC searchproblems in road-social networks such that our techniques aresigniﬁcantly different from all previous CSP algorithms. Skyline and its generalization.

The r-dominance graph usedin our MAC model is relevant to the skyline [21] and more toits extension, the k -skyband [26]. In traditional top- j queries,f two records are inconsistent and one has no smaller valuein any dimension [21], then it dominates the other. Thus, fora dataset the skyline consists of the records which are notdominated by any other; while the k -skyband comprises thosedominated by fewer than k other records [26], indicated asa superset of all records which for any weight vector mayoccur in the top- j results. As a typical k -skyband computationalgorithm, BBS [26] adopts a spatial index in the dataset,following the branch-and-bound paradigm [42].This proves an additional essential difference between [8]and our work from another perspective. Regardless of socialand spatial cohesiveness, dominance in [8] between two com-munities comes down to a series of standard dominance testson d -dimensional vectors. However, r-dominance tests adoptedin our model do not sufﬁce, e.g., two or more non-skylinecommunities may still collaboratively disqualify a skyline oneif they score higher than it at different parts of R , collectivelyblocking it from being a top- j result anywhere in R .IX. C ONCLUSIONS

In this paper, we propose a novel community model todiscover normative communities suitable for multi-criteriadecision making in a road-social network, in which each useris linked with location information and d ( ≥ ) numericalattributes. Taking a preference region of d -dimensional datadomain as input, the resulting communities identiﬁed byour model cannot be r-dominated by other ones as long asthe weight vector could fall anywhere in the region. Weformalize the multi-attributed community search; distinguishtwo problem versions; develop solutions for correspondingprocessing; and using both real-world and synthetic datasetsdemonstrate the efﬁciency and scalability of our solutions andthe effectiveness of our model.A CKNOWLEDGMENT

This work is supported by NSFC (No. 61932004, 61732003,61729201 and 62072087) and Fundamental Research Fundsfor the Central Universities (No. N181605012). Ye Yuan isthe corresponding author.R

EFERENCES[1] M. Sozio and A. Gionis, “The community-search problem and how toplan a successful cocktail party,” in

SIGKDD , 2010, pp. 939–948.[2] W. Cui, Y. Xiao, H. Wang, and W. Wang, “Local search of communitiesin large graphs,” in

SIGMOD , 2014, pp. 991–1002.[3] X. Huang, L. V. Lakshmanan, J. X. Yu, and H. Cheng, “Approximateclosest community search in networks,”

PVLDB , vol. 9, no. 4, pp. 276–287, 2015.[4] R. Li, L. Qin, J. X. Yu, and R. Mao, “Inﬂuential community search inlarge networks,”

PVLDB , vol. 8, no. 5, pp. 509–520, 2015.[5] Y. Fang, R. Cheng, S. Luo, and J. Hu, “Effective community search forlarge attributed graphs,”

PVLDB , vol. 9, no. 12, pp. 1233–1244, 2016.[6] Y. Fang, R. Cheng, X. Li, S. Luo, and J. Hu, “Effective communitysearch over large spatial graphs.”

PVLDB , vol. 10, no. 6, pp. 709–720,2017.[7] X. Huang and L. V. Lakshmanan, “Attribute-driven community search,”

PVLDB , vol. 10, no. 9, pp. 949–960, 2017.[8] R. Li, L. Qin, F. Ye, J. X. Yu, X. Xiao, N. Xiao, and Z. Zheng, “Skylinecommunity search in multi-valued networks,” in

SIGMOD , 2018, pp.457–472. [9] T. Joachims, “Optimizing search engines using clickthrough data,” in

SIGKDD , 2002, pp. 133–142.[10] B. Jiang, J. Pei, X. Lin, D. W. Cheung, and J. Han, “Mining preferencesfrom superior and inferior examples,” in

SIGKDD , 2008, pp. 390–398.[11] L. Qian, J. Gao, and H. Jagadish, “Learning user preferences by adaptivepairwise comparison,”

PVLDB , vol. 8, no. 11, pp. 1322–1333, 2015.[12] S. B. Seidman, “Network structure and minimum degree,”

Social net-works , vol. 5, no. 3, pp. 269–287, 1983.[13] K. Mouratidis and B. Tang, “Exact processing of uncertain top-k queriesin multi-criteria settings,”

PVLDB , vol. 11, no. 8, pp. 866–879, 2018.[14] V. Batagelj and M. Zaversnik, “An o(m) algorithm for cores decompo-sition of networks,”

CoRR, cs.DS/0310049 , 2003.[15] W. Cui, Y. Xiao, H. Wang, Y. Lu, and W. Wang, “Online search ofoverlapping communities,” in

SIGMOD , 2013, pp. 277–288.[16] F. Zhang, Y. Zhang, L. Qin, W. Zhang, and X. Lin, “When engagementmeets similarity: efﬁcient (k, r)-core computation on social networks,”

PVLDB , vol. 10, no. 10, pp. 998–1009, 2017.[17] Z. Zhang, X. Huang, J. Xu, B. Choi, and Z. Shang, “Keyword-centriccommunity search,” in

ICDE , 2019, pp. 422–433.[18] A. Guttman, “R-trees: A dynamic index structure for spatial searching,”in

SIGMOD , 1984, pp. 47–57.[19] Y.-C. Chang, L. Bergman, V. Castelli, C.-S. Li, M.-L. Lo, and J. R.Smith, “The onion technique: indexing for linear optimization queries,”in

SIGMOD , 2000, pp. 391–402.[20] B. Tang, K. Mouratidis, and M. L. Yiu, “Determining the impact regionsof competing options in preference space,” in

SIGMOD , 2017, pp. 805–820.[21] S. Borzsony, D. Kossmann, and K. Stocker, “The skyline operator,” in

ICDE , 2001, pp. 421–430.[22] J. Liu, L. Xiong, J. Pei, J. Luo, and H. Zhang, “Finding pareto optimalgroups: Group-based skyline,”

PVLDB , vol. 8, no. 13, pp. 2086–2097,2015.[23] I. F. Ilyas, G. Beskales, and M. A. Soliman, “A survey of top-k queryprocessing techniques in relational database systems,”

ACM Comp.Surveys , vol. 40, no. 4, pp. 1–58, 2008.[24] R. Zhong, G. Li, K.-L. Tan, L. Zhou, and Z. Gong, “G-tree: An efﬁcientand scalable index for spatial search on road networks,”

IEEE Trans.Knowl. Data Eng. , vol. 27, no. 8, pp. 2175–2189, 2015.[25] Z. Li, L. Chen, and Y. Wang, “G*-tree: An efﬁcient spatial index onroad networks,” in

ICDE , 2019, pp. 268–279.[26] D. Papadias, Y. Tao, G. Fu, and B. Seeger, “Progressive skylinecomputation in database systems,”

ACM Trans. Database Syst. , vol. 30,no. 1, pp. 41–82, 2005.[27] M. D. Berg, O. Cheong, M. V. Kreveld, and M. Overmars,

Computa-tional geometry: algorithms and applications . Springer, 2008.[28] L. Zou and L. Chen, “Pareto-based dominant graph: An efﬁcientindexing structure to answer top-k queries,”

IEEE Trans. Knowl. DataEng. , vol. 23, no. 5, pp. 727–741, 2011.[29] P. K. Agarwal and M. Sharir, “Arrangements and their applications,” in

Handbook of computational geometry . Elsevier, 2000, pp. 49–119.[30] J. Cheng, Y. Ke, A. W.-C. Fu, J. X. Yu, and L. Zhu, “Finding maximalcliques in massive networks,”

ACM Trans. Database Syst. , vol. 36, no. 4,pp. 21:1–21:34, 2011.[31] R. Zhou, C. Liu, J. X. Yu, W. Liang, B. Chen, and J. Li, “Findingmaximal k-edge-connected subgraphs from a large graph,” in

EDBT ,2012, pp. 480–491.[32] T. Akiba, Y. Iwata, and Y. Yoshida, “Linear-time enumeration ofmaximal k-edge-connected subgraphs in large networks by randomcontraction,” in

CIKM , 2013, pp. 909–918.[33] L. Chang, J. X. Yu, L. Qin, X. Lin, C. Liu, and W. Liang, “Efﬁcientlycomputing k-edge connected components via graph decomposition,” in

SIGMOD , 2013, pp. 205–216.[34] L. Qin, R. Li, L. Chang, and C. Zhang, “Locally densest subgraphdiscovery,” in

SIGKDD , 2015, pp. 965–974.[35] Y. Wu, R. Jin, J. Li, and X. Zhang, “Robust local community detection:on free rider effect and its elimination,”

PVLDB , vol. 8, no. 7, pp. 798–809, 2015.[36] L. Chen, C. Liu, R. Zhou, J. Li, X. Yang, and B. Wang, “Maximumco-located community search in large scale social networks,”

PVLDB ,vol. 11, no. 10, pp. 1233–1246, 2018.[37] F. Guo, Y. Yuan, G. Wang, L. Chen, X. Lian, and Z. Wang, “Cohesivegroup nearest neighbor queries over road-social networks,” in

ICDE ,2019, pp. 434–445.38] F. Guo, Y. Yuan, G. Wang, L. Chen, X. Lian, and Z. Wang, “Cohesivegroup nearest neighbor queries on road-social networks under multi-criteria,”

IEEE Trans. Knowl. Data Eng. , 2020.[39] L. Chen, C. Liu, K. Liao, J. Li, and R. Zhou, “Contextual communitysearch over large social networks,” in

ICDE , 2019, pp. 88–99.[40] Q. Liu, Y. Zhu, M. Zhao, X. Huang, J. Xu, and Y. Gao, “Vac: vertex-centric attributed community search,” in

ICDE , 2020, pp. 937–948.[41] J. Luo, X. Cao, X. Xie, Q. Qu, Z. Xu, and C. S. Jensen, “Efﬁcientattribute-constrained co-located community search,” in

ICDE , 2020, pp.1201–1212.[42] A. H. Land and A. G. Doig, “An automatic method of solving discreteprogramming problems,”

Related Researches

Empowering Investigative Journalism with Graph-based Heterogeneous Data Management

by Angelos-Christos Anadiotis

Approximating Happiness Maximizing Set Problems

by Phoomraphee Luenam

A Framework for Federated SPARQL Query Processing over Heterogeneous Linked Data Fragments

by Lars Heling

Materializing Knowledge Bases via Trigger Graphs

by Efthymia Tsamoura

Online Sketch-based Query Optimization

by Yesdaulet Izenov

Typing Errors in Factual Knowledge Graphs: Severity and Possible Ways Out

by Peiran Yao

The Forgotten Document-Oriented Database Management Systems: An Overview and Benchmark of Native XML DODBMSes in Comparison with JSON DODBMSes

by Ciprian-Octavian Truic?

Fast Distributed Complex Join Processing

by Hao Zhang

A Survey of RDF Stores & SPARQL Engines for Querying Knowledge Graphs

by Waqas Ali

Durable Top-K Instant-Stamped Temporal Records with User-Specified Scoring Functions

by Junyang Gao

Data Quality Certification using ISO/IEC 25012: Industrial Experiences

by Fernando Gualo

FAST: FPGA-based Subgraph Matching on Massive Graphs

by Xin Jin

Approximate Knowledge Graph Query Answering: From Ranking to Binary Classification

by Ruud van Bakel

LMKG: Learned Models for Cardinality Estimation in Knowledge Graphs

by Angjela Davitkova

New Recruiter and Jobs: The Largest Enterprise Data Migration at LinkedIn

by Xie Lu

Interactive Query Formulation using Point to Point Queries

by Henderik Alex Proper

Cornus: One-Phase Commit for Cloud Databases with Storage Disaggregation

by Zhihan Guo

A Unified System for Data Analytics and In Situ Query Processing

by Alex Watson

A Survey on Locality Sensitive Hashing Algorithms and their Applications

by Omid Jafari

A Lazy Approach for Efficient Index Learning

by Guanli Liu

THIA: Accelerating Video Analytics using Early Inference and Fine-Grained Query Planning

by Jiashen Cao

Data provenance, curation and quality in metrology

by James Cheney

Querying collections of tree-structured records in the presence of within-record referential constraints

by Foto N. Afrati

Updatable Materialization of Approximate Constraints

by Steffen Kläbe

Spatial Interpolation-based Learned Index for Range and kNN Queries

by Songnian Zhang

«

1

2

3

4

»

Submitted on 24 Jan 2021 (v1), last revised 21 Feb 2021 (this version, v2) Updated

arXiv.org Original Source

NASA ADS

Google Scholar

Semantic Scholar