Efficiently Finding a Maximal Clique Summary via Effective Sampling
Xiaofan Li, Rui Zhou, Lu Chen, Chengfei Liu, Qiang He, Yun Yang
NNoname manuscript No. (will be inserted by the editor)
Efficiently Finding a Maximal Clique Summary via EffectiveSampling
Xiaofan Li · Rui Zhou · Lu Chen · Chengfei Liu · Qiang He · Yun Yang
Received: date / Accepted: date
Abstract
Maximal clique enumeration (MCE) is a fun-damental problem in graph theory and is used in manyapplications, such as social network analysis, bioinfor-matics, intelligent agent systems, cyber security, etc.Most existing MCE algorithms focus on improving theefficiency rather than reducing the output size. The out-put unfortunately could consist of a large number ofmaximal cliques. In this paper, we study how to reporta summary of less overlapping maximal cliques. Theproblem was studied before, however, after examiningthe pioneer approach, we consider it still not satisfac-tory. To advance the research along this line, our paperattempts to make four contributions: (a) we proposea more effective sampling strategy, which produces amuch smaller summary but still ensures that the sum-mary can somehow witness all the maximal cliques andthe expectation of each maximal clique witnessed by thesummary is above a predefined threshold; (b) we provethat the sampling strategy is optimal under certain op-timality conditions; (c) we apply clique-size boundingand design new enumeration order to approach the op-timality conditions; and (d) to verify experimentally, wetest eight real benchmark datasets that have a varietyof graph characteristics. The results show that our newsampling strategy consistently outperforms the state-of-the-art approach by producing smaller summariesand running faster on all the datasets.
Keywords maximal clique · clique summary · boundestimation · clique enumeration · clique sampling X. Li, R. Zhou, L. Chen, C. Liu, Q. He and Y. YangSwinburne University of Technology, AustraliaE-mail: { xiaofanli,rzhou,luchen,cliu,qhe,yyang } @swin.edu.au A clique C is a complete subgraph of an undirectedgraph G ( V, E ), which means that each pair of nodesin C have an edge between them. A maximal clique isa clique which is not a subgraph of any other clique.The procedure of enumerating all maximal cliques in agraph is called Maximal Clique Enumeration (MCE).MCE has a range of applications in different fields,such as discovering communities in social networks [18],identifying co-expressed genes [22], detecting protein-protein interaction complexes [41], supporting the con-struction of intelligent agent systems [29] and recogniz-ing emergent patterns in terrorist networks [4].There are a sufficient number of works [12,14,21,24,25,32,38] focusing on improving the efficiency of MCE,which is considered as having exponential time. This isprobably because the number of cliques in a graph isalways very large. A graph with less than 10 verticesand 7 × edges can have more than 10 maximalcliques [36]. Counting the number of maximal cliques ina general graph is considered to be τ -visiblesummary, a set of maximal cliques, which promises thatevery maximal clique in graph G can be covered by atleast one maximal clique in the summary with a ratioof at least τ . Here, τ is given by a user and reflects theuser’s tolerance of overlap. For example, a summarywith τ = 0 . a r X i v : . [ c s . D B ] N ov Xiaofan Li et al. mary. This summary model is interesting, e.g., in themarketing domain, if a certain percentage of users in aclique community has been covered, we expect that thecovered users will spread a message across the commu-nity. Consequently, finding fewer communities as tar-gets due to marketing cost while still ensuring a broadfinal user coverage is very desirable. The work [36] mod-ified the depth-first MCE [5] by adding a sampling func-tion that determines whether a new clique enumerationsub-procedure should be entered. It was proved that the expected visibility of such a sampled summary is largerthan τ .However, expected τ -visible summaries are not unique.Apparently, as long as a summary is τ -visible, the moreconcise the summary is, the better the summary is.Hence three questions arise naturally in sequence:(1) Is there any sampling strategy that can find a better(smaller) expected τ -visible summary?(2) What kind of sampling strategy is optimal?(3) If achieving the optimal is difficult or impossible,how can we provide the best effort?We will tackle these three questions in this paper.For question (1), the answer is yes. The state-of-the-art work [36] used a sampling function to determinewhether each maximal clique should be included intothe growing summary. However, it could include manyredundant maximal cliques that (a) are visible to thecurrent summary; (b) are likely to be visible to the fu-ture summary (explained by Observation 1 for (a) andby Observation 2 for (b) in Section 3.1). By identifyingand discarding these two types of maximal cliques, wefind a novel sampling function that is superior to theexisting work in terms of both output summary sizeand running time. For question (2), we naturally definethe optimality as including each maximal clique withthe lowest probability while still promising τ -visibilityof the summary. We find that, when (I) the maximalcliques are enumerated in a proper order and (II) thesize of a clique can be estimated sufficiently accurateat an early stage of the depth-first search, our newlyproposed sampling function successfully guarantees op-timality, while the existing approach [36] does not. Theoptimality of our sampling function promises significantimprovement vs. [36] in terms both effectiveness and ef-ficiency. Although, the above two optimality conditions(I) and (II) are hard to hold ideally, they provide us twoimportant directions to approach the optimal by look-ing for (i) better vertex orders to enumerate maximalcliques, and (ii) better maximal clique size estimationtechniques. Thus for question (3), inspired by the co-hesive structure of k -truss, we design a novel vertexorder based on truss decomposition. Following this ver-tex order, our algorithm can enumerate maximal cliques in such an order that two consecutive generated max-imal cliques tend to be contained by the same cohe-sive k -truss and thus overlap more. When one of thetwo cliques is included into the summary first, the sec-ond one has a higher probability to be discarded so thesummary keeps concise. Besides, we utilize k -truss toestimate the size of a maximal clique contained by asubgraph, which provides an upper bound tighter thanthe k -core based bound (used by [36]), since a k -trussmust be a ( k − – We introduce a new sampling strategy to help toidentify an expected τ -visible maximal clique sum-mary. We prove that the new sampling strategy guar-antees a better performance than the state-of-the-art method in terms of producing a smaller sum-mary while still meeting the threshold τ . – We give a theoretical analysis that the sampling canbe optimal under certain conditions, which substan-tiates good performance of the proposed samplingstrategy in practice. Future investigations could alsobe directed by exploring how to approximate the op-timal conditions. – We show that the sampling approach can get closeto optimal with clique size bounding and enumera-tion ordering strategies. Then we propose truss or-dering and truss bound respectively to further im-prove the performance of our sampling strategy. – We conduct experimental studies to verify the su-periority of the new sampling method as well asour newly designed truss order and truss bound interms of both effectiveness and efficiency on eightreal-world datasets.The rest of this paper is organized as follows. InSection 2, we review the definition of τ -visible sum-mary and an existing sampling approach. In Section3, we give our motivation, introduce a novel samplingfunction and prove its superiority. The conditions ofoptimality are analyzed in Section 4. We propose trussvertex order and truss bound to practically instantiateoptimality conditions in Section 5. Extensive experi-ments are conducted in Section 6. Related work andconclusion are in Section 7 and Section 8. τ -Visible Summary A clique refers to a complete subgraph of an undirectedgraph G ( V, E ). A clique C is maximal if it is not con-tained by any other clique. When the context is clear,we also use C to denote the node set of a maximal fficiently Finding a Maximal Clique Summary via Effective Sampling 3 clique. Given the set of all maximal cliques in graph G ,denoted as M ( G ), a summary S is a subset of M ( G )which means S ⊆ M ( G ). To measure to what extenta summary can witness a clique, visibility is definedin [36], restated as Definitions 1 and 2. We then intro-duce expected visibility in Definitions 3 and 4. Definition 1 (Visibility)
Given a summary S , thevisibility V S : M ( G ) → [0 ,
1] of a maximal clique C isdefined as: V S ( C ) = max C (cid:48) ∈S | C ∩ C (cid:48) || C | (1)Note that C (cid:48) is allowed to be the same as C . This meansthat if C ∈ S , C ’s visibility with respect to S is 1. Inother words, if C ∈ S , the summary S can completelywitness C . Definition 2 ( τ -Visible Summary) A summary S is called τ -visible iff ∀ C ∈ M ( G ), V S ( C ) ≥ τ (2)Rather than the exact τ -visible summary defined above,our work looks for an expected τ -visible summary. Be-fore we give the formal definition of expected τ -visiblesummary, we explain what the term expected means in-tuitively. Since the number of maximal cliques is likelyto be exponential, it is infeasible to firstly compute allthe cliques and then decide the summary. Instead, itis more practical to decide on the way while enumerat-ing, i.e., try to make a decision whether to keep/discarda new clique or keep/discard with a probability, whenthe clique is found. To be more active, a decision canbe made on whether to enter each enumeration branchwith some probability. This means that each maximalclique has a probability P r [ C ∈ S ] to be included in S and a corresponding probability P r [ C / ∈ S ] = 1 − P r [ C ∈ S ] to be discarded. For a clique C , if it is se-lected to be included into S , the visibility of it shouldbe 1, since it is witnessed by itself; otherwise this valueis V S ( C ), which stays unknown before S is finalized.Given the above discussion of visibility, we can havethe mathematical expectation of V S ( C ) in Definition 3: Definition 3 (Expected Visibility)
The expectedvisibility of a clique C with regards to a summary S , E [ V S ( C )], is defined as E [ V S ( C )] (cid:44) · P r [ C ∈ S ] + V S ( C ) · P r [ C / ∈ S ] (3)One may question that, before S is finally known, V S ( C ) is unavailable to Formula (3), since this valuerelies on a materialization of S . However, we need to point out that, such a V S ( C ) does exist albeit it is hardto know its value early. We will see under which con-ditions V S ( C ) can be calculated without S is known inSection 4. Currently, we only need a lower bound of itsince we want to make sure the lower bound is suffi-ciently large, so that the expectation of V S ( C ) is largerthan a user-given threshold, implying that we want tofind a summary with good visibility expectation guar-antee. Definition 4 defines this case: Definition 4 (Expected τ -Visible Summary) A summary S is expected τ -visible iff ∀ C ∈ M ( G ), E [ V S ( C )] ≥ τ (4)where τ ∈ [0 ,
1] is a given threshold.
In this paper, we focus on developing theoriesand algorithms for finding a good expected τ -visible summary. The key issue that we are go-ing to address is how to keep/discard the enumerationbranches to ensure the final found cliques can form asummary which is τ -visible and of a small size. Notethat in an expected τ -visible summary, there may exista clique which cannot be covered by any other cliquein the summary with a factor more than the extent of τ . However, we still aim for expected visibility ratherthan exact visibility, because (1) visibility itself has al-ready meant that the summary is an approximation,hence there may be less gain to enforce exact visibil-ity; (2) the basic MCE algorithm (BK-MCE) whichwe will introduce in Section 2.1 is a depth-first searchapproach. For expected visibility semantics, the greatpruning power of a sampling approach can terminatesearch subtrees as early as possible so that the expo-nential search space can be reduced significantly andthe summary is promised to be concise with a sufficientquality guarantee. An algorithm serving for exact vis-ibility has to decide whether a search subtree can bediscarded at a relatively late stage, thus slows downthe running time.Next, we start with introducing an existing depth-first MCE procedure [5] for maximal clique enumerationin Section 2.1, and then explain how it can be modifiedto find an expected τ -visible summary by the state-of-the-art work [36] in Section 2.2. Important notationsare listed in Table 1.2.1 Maximal Clique EnumerationBK-MCE algorithm [5] (Algorithm 1) is a backtrackingapproach, which recursively calls procedure P rocM CE to grow the current partial clique by adding a newnode from the candidate set until a maximal clique is
Xiaofan Li et al.
Table 1: Notations
Notation Meaning G ( V, E ) the graph G with vertex set V and edge set EG T the induced graph of vertex set T on graph GC a maximal clique M ( G ) the set of all maximal cliques in graph G S a summary, which is a subset of M ( G ) V S ( C ) the visibility of maximal clique C w.r.t. sum-mary S E [ V S ( C )] the expectation of visibility V S ( C ) τ the user-specified threshold N ( v ) the neighbor node set of node vT the candidate set in BK-MCE algorithm D the candidate set whose elements should not betouched in BK-MCE algorithm v p the pivot in BK-MCE algorithm r the local visibility, refers to Formula (5) l the upper bound of the size of a maximal clique r the lower bound of local visibility s ( r ) the sampling function used in [36] s opt ( r ) the conditional optimal sampling function T a search subtree found. Here we denote all the neighbor nodes of node v by N ( v ). C is the current partial clique or configura-tion , which is still growing. T and D are candidate setswhose elements are common neighbors of C , while D only contains nodes which have been contained by someearlier output maximal cliques grown from the cur-rent C . Algorithm 1 takes graph G ( V, E ) as input andoutputs all the maximal cliques in G . Initially it callsprocedure P rocM CE ( ∅ , V, ∅ ) (line 1). Then P rocM CE will be called recursively (line 10) until M ( G ) is gen-erated. At every recursive stage P rocM CE will firstcheck whether T = ∅ and D = ∅ (line 3). If so, it meansthat there is no candidate node left, and therefore thecurrent C is output as a maximal clique (line 4). If not,generally speaking, it will remove an arbitrary node v from T and add it into C . Then it recursively calls pro-cedure P rocM CE ( C ∪ { v } , T ∩ N ( v ) , D ∩ N ( v )). Here, T ∩N ( v ) is the refined T by deleting all nodes which arenot neighbors of v , and the same for D ∩ N ( v ). It en-sures that every node in T or D is a common neighborof the current C . Finally, since v is sure to be containedby some future cliques grown from C , v is added into D (lines 8-12). Note that a pivot v p is chosen for avoidingsome branches which will generate the same maximalclique (line 6). This is because, from the current con-figuration, a maximal clique containing v which is aneighbor of v p , can be grown either from v p or u . u isa neighbor of v but not of v p . Algorithm 1
BK-Maximal Clique Enumeration
Input:
Graph G ( V, E ); Output: M ( G );1: Call P rocMCE ( ∅ , V, ∅ )2: procedure P rocMCE ( C, T, D )3: if T = ∅ and D = ∅ then
4: Output C as a maximal clique; return
5: Choose a pivot vertex v p from ( T ∪ D );6: T (cid:48) ← T \N ( v p );7: for each v ∈ T (cid:48) do
8: Call
P rocMCE ( C ∪ { v } , T ∩ N ( v ) , D ∩ N ( v ));9: T ← T \{ v } ;10: D ← D ∪ { v } ; Algorithm 2
Summarization by Sampling
Input:
Graph G ( V, E ), threshold τ ; Output:
An expected τ -visible summary S ;1: S ← ∅ , C (cid:48) ← ∅ ;2: Call P rocRMCE ( ∅ , V, ∅ ).3: procedure P rocRMCE ( C, T, D )4: if T = ∅ and D = ∅ then
5: include C in S ; C (cid:48) ← C ; return
6: Calculate l and r ;7: Keep the branch T with probability l (cid:112) s ( r );8: if the branch T is kept then
9: Choose a pivot vertex v p from ( T ∪ D );10: T (cid:48) ← T \N ( v p );11: for each v ∈ T (cid:48) do
12: Call
P rocRMCE ( C ∪{ v } , T ∩N ( v ) , D ∩N ( v ));13: T ← T \{ v } ;14: D ← D ∪ { v } ; τ -visible summary: re-call that BK-MCE is a depth-first algorithm, it outputs M ( G ) in such an order that two maximal cliques sharea large portion of common nodes if they are producednext to each other. We denote this property as local-ity . Let C (cid:48) be the last generated maximal clique whichhas been added into summary S , when a new clique C is generated, we can compare it with C (cid:48) , rather thanwith every clique in S , to compute a local visibility r (Formula (5)). If r ≥ τ , discard C ; otherwise, keep C .Such a deterministic strategy will guarantee to producea τ -visible summary. r = | C ∩ C (cid:48) || C | (5)However, it will be desirable if we can discard a wholesearch branch with good confidence when we find thatthe branch has significant overlap with the last foundclique C (cid:48) . This leads to the idea of deliberately pruningsome recursive sub-procedures with some probability - fficiently Finding a Maximal Clique Summary via Effective Sampling 5 let us call it sampling. Meanwhile, we must guaranteethat the summary should have the expected visibility E [ V S ( C )] ≥ τ, ∀ C ∈ M ( G ).Details about invoking a sampling method to givean expected τ -visible summary are shown in Algorithm2. The key idea is to execute a sampling operation (line8) to determine whether this current new branch T should be grown or not before entering a new proce-dure P rocM CE ( C, T, D ) (line 13). In line 7, l denotesan upper bound of the size of the next maximal clique C and r denotes a lower bound of the local visibility r . Aswe have not found C , i.e., l (= | C | ) and r are unknown,we can only estimate l and r . The sampling probabilityfunction l (cid:112) s ( r ) is designed to be a function of l and r .The work in [36] chose the probability function s () tobe: s ( r ) = (1 − r )(2 − τ )(2 − r − τ ) (6)and proved that applying s ( r ) in Algorithm 2 can pro-duce a summary with the expected visibility E [ V S ( C )] ≥ τ, ∀ C ∈ M ( G ). Due to the space limit, we briefly intro-duce the rationale of l (cid:112) s ( r ). From Formula (6), s ( r ) is adecreasing function with range [0,1]. This means whenwe find the estimated r becoming larger, the probabilityof keeping the current search branch becomes smaller.When l is estimated larger, this implies that we willcall the recursion more times, hence the probability ofkeeping the current search branch is made larger. Algo-rithm 1 and Algorithm 2 follow the clique enumerationparadigm, so the time complexities of them are bothbounded by O (3 | V | / ) because a | V | -vertex graph hasat most 3 | V | / maximal cliques [19]. Algorithm 2 shouldbe practically faster due to early prunings, but has thesame complexity in the worst case when τ = 1. Expected τ -visible summaries are not unique. Appar-ently, the more concise (smaller) a summary is, the bet-ter the summary is. Three questions arise naturally:(1) Are there any better sampling strategies?(2) What kind of sampling strategy is optimal?(3) If finding the optimal is difficult, how can we providethe best effort?We will address question (1) in this section and dis-cuss questions (2) and (3) in Section 4 and Section 5respectively. In Section 3.1, we give our idea why weconsider there should exist a better sampling function,then we introduce the new sampling function and proveits superiority in Section 3.2. 3.1 IntuitionOur new sampling strategy is based on the followingtwo observations: Observation 1:
In Formula (6), when r ∈ ( τ, s ( r ) >
0. This means even if we know thenewly generated clique is τ -visible with respect to thecurrent S , there is still a positive probability to addit into the summary. Thus S will be more redundantbecause of these unnecessary cliques. A better strategyis to set s ( r ) = 0 in such cases, which means not to addthese cliques at all. Observation 2:
In Formula (6), when r = 0, wehave s ( r ) = 1. This means once we find a maximalclique whose nodes are totally new to the current sum-mary, we add it into S without hesitation. This seemsreasonable, however, there is still some possibility forthis brand new clique to be covered by some futurecliques. Moreover, considering we are looking for an ex-pected τ -visible summary, which means that we havethe option not to include a brand new clique as long asthe final summary is expected τ -visible. In other words,it is safe to add the brand new clique with certain prob-abilities. Cases where r is in (0 , τ ) are similar.3.2 Sampling Function s opt ( r )Following the observations in Section 3.1, we give a newsampling function s opt ( r ) in Formula (7). s opt ( r ) = (cid:40) τ − r − r , if r ∈ [0 , τ ).0 , if r ∈ [ τ, . (7)The sampling function s opt ( r ) implies: if r ∈ [ τ, τ − r − r . The rationaleof setting τ − r − r will be shown in Theorem 2.Next, we prove that compared with s ( r ), s opt ( r ) is abetter function. This means that we need to prove: (1) s opt ( r ) samples with a lower probability (in Theorem 1);and (2) s opt ( r ) can produce an expected τ -visible sum-mary (in Theorem 2). Theorem 1 s opt ( r ) samples with a low probability thans(r), i.e. s opt ( r ) ≤ s ( r ) , ∀ r ∈ [0 ,
1] (8)
The equation holds iff r = 1 .Proof We show that this inequality holds when r ∈ [0 , τ ) and r ∈ [ τ,
1] separately:
Xiaofan Li et al. - if r ∈ [ τ, s opt ( r ) = 0 and s ( r ) ≥ r = 1 (corresponding to s ( r ) = 0).- if r ∈ [0 , τ ), since τ ∈ [0 ,
1] and r (cid:54) = 1, we have s ( r ) − s opt ( r ) = (1 − r )(2 − τ )(2 − r − τ ) − τ − r − r = (1 − r ) (2 − τ ) − (2 − r − τ )( τ − r )(2 − r − τ )(1 − r )= (1 − τ )((1 − r ) + (1 − τ ))(2 − r − τ )(1 − r ) > S , we have to prove that this S is indeed expected τ -visible: Theorem 2
Algorithm 2, with sampling function s opt ( r ) ,can produce an expected τ -visible summary .Proof First, we give the probability for a maximal clique C being added into S . Then we calculate the expectedvisibility E [ V S ( C )] and show it is no less than τ .Recall that every time before Algorithm 2 starts anew search subtree T i , line 7 will compute a new pair of l and r . We denote them by l i and r i , where 1 ≤ i ≤ k and k = | C | is the size of this maximal clique to begrown. Since every l i is an upper bound of k and every r i is a lower bound of r , together with the monotonicityof s opt ( r ), we have P r [ C ∈ S ] = Π ≤ i ≤ k P r [ T i is kept ]= Π ≤ i ≤ k li (cid:113) s opt ( r i ) ≥ Π ≤ i ≤ k k (cid:113) s opt ( r )= s opt ( r ) (10)If C is not included in S , its visibility V S ( C ) shouldbe no less than the local visibility r ; if C is included in S , V S ( C ) = 1. Now we can calculate the expectation of V S ( C ): E [ V S ( C )] ≥ · P r [ C ∈ S ] + r · P r [ C / ∈ S ] ≥ s opt ( r ) + r · (1 − s opt ( r )) (11)We show two cases where r ∈ [0 , τ ) and r ∈ [ τ, r ∈ [ τ, E [ V S ( C )] ≥ r · (1 −
0) = r ≥ τ .- if r ∈ [0 , τ ), E [ V S ( C )] ≥ τ − r − r + r · (1 − τ − r − r ) = τ .Combining these two cases, we complete this proof. Summary:
Theorem 1 and Theorem 2 jointly showthat s opt ( r ) is a valid sampling function and is betterthan s ( r ). In this section, for the purpose of analyzing the opti-mality of the sampling function, we show what kinds ofconditions should be satisfied. We prove the optimal-ity of s opt ( r ) under such conditions and further explainwhy the performance of s opt ( r ) is good even withoutthe conditions being fully satisfied.4.1 Conditions for Optimality AnalysisSince we can find a better sampling function s opt ( r ),another question comes out naturally: with the restric-tion of expected τ -visibility, does an optimal samplingfunction (even better than s opt ( r )) with the smallestprobability exist?In the proof of Theorem 2, we can only prove theexpectation E [ V S ( C )] ≥ τ . Intuitively, the smaller thesampling probability is, the smaller the expectation is.It is hard to determine whether a sampling functionis optimal because of lacking information on how loosethe inequality is. If we intend to analyze the optimalityof any function, we need to tighten this inequality to bean equation first. Now we show under what conditionsthe theoretical analysis of optimality can be feasible.In the proof of Theorem 2, we amplify E [ V S ( C )] twotimes:The first inequality sign of Formula (11) is de-rived from Formula (10). Algorithm 2 implements thesampling operation in each recursive procedure usingthe probability li (cid:112) s opt ( r i ). Since l i and r i are upperbound and lower bound of k and r respectively, we have li (cid:112) s opt ( r i ) ≥ k (cid:112) s opt ( r ), thus P r [ C ∈ S ] ≥ s opt ( r ). Nowwe have two approaches to eliminate this inequality. The first approach is that if we want to analyze theproperty of the sampling function itself, we need to setthe other factors ideal. This means if we do not careabout the details of how to calculate l i , we can assumethat this upper bound is ideal, so we have l i = l , andthe same for r i . Note this hypothesis is made only forthe purpose of analyzing the function theoretically, notfor implementing Algorithm 2 in practice. With thisassumption, we have P r [ C ∈ S ] = s opt ( r ). The secondapproach is to modify the sampling procedure. Now let fficiently Finding a Maximal Clique Summary via Effective Sampling 7 us sample using the probability s opt ( r ) only after a com-plete maximal clique is generated, rather than sampleeach time a new node is grown. If so, it is obvious that P r [ C ∈ S ] = s opt ( r ). One may argue that it is mean-ingless to do sampling once a maximal clique is found.It is true that if we add every clique whose r is no morethan τ into summary, this summary is strictly τ -visible.However, as we explained in Section 3, this would in-troduce more redundancy to S . In some applications,we only need this summary to be expected τ -visible, sothis one-step sampling procedure is significant to givesuch a concise S .The second inequality sign of Formula (11) isfrom the definition of r . Due to the locality of Algo-rithm 2, we use r to replace the real visibility whichshould be no less than r . Now we need to make theassumption that such locality is sufficiently strong (bywhich we mean that two similar cliques should be pro-duced consecutively), so that r is indeed the visibilitydefined in Formula (1). In practice, we do not need toenforce such strong locality to implement Algorithm 2.We introduce this hypothesis only for the theoreticalconsideration, which means, we only need this assump-tion to construct a framework under which we can an-alyze the optimality of sampling functions. Once thishypothesis is made, the second inequality becomes anequation.Now we can modify Formula (11) to be an equation: E [ V S ( C )] = s opt ( r ) + r · (1 − s opt ( r )) (12)if the following two conditions are satisfied:- The bounds of l and r are ideal, or we only do sam-pling each time when a full maximal clique is gen-erated.- The property of locality is strong for r to be the realvisibility.4.2 Optimality of s opt ( r )One may expect that the optimality should be definedas minimizing the expected cardinality of the summary.However, we notice that such a strong version of opti-mality requires the sampling function s ( r ) to be a func-tion of the distribution of r : ρ ( r ), rather than simplya function of r . Accordingly, we have to report that itis not easy to give a proper definition for ρ ( r ). This isdue to such a fact: ρ ( r ) could not be known until thesummary has been found. This fact holds because the r value of maximal clique C depends on which maxi-mal clique C (cid:48) it is compared with: once C (cid:48) changes, thevalue of r will change accordingly, and distribution ρ ( r ) will also be disturbed thereafter. Since a nondetermin-istic (sampling) algorithm can hardly know the exactpredecessor C (cid:48) of each maximal clique unless the algo-rithm is finalized, it may be impossible to define ρ ( r )unless the summary is fully determined. Thus ρ ( r ) notonly is data-dependent, but also relies on the samplingfunction s ( r ). Since s ( r ) should inversely rely on ρ ( r ), itis likely to be hard to properly give ρ ( r ) an independentdefinition.For the above reasons, we choose not to seek a dis-tribution related sampling function with the strong ver-sion of optimality, but rather define the optimality as: given the r value of maximal clique C , sampling C with the lowest probability while still promis-ing τ -visibility of the summary . We consider thisdefinition being more manipulatable theoretically andapplicable practically. And this optimality successfullyshows its effectiveness in the experimental studies (seeSection 6). Now we can analyze the optimality of s opt ( r )in the framework introduced in Section 4.1. Theorem 3
If the two conditions in Section 4.1 aresatisfied, s opt ( r ) is optimal for Algorithm 2.Proof We show that if there exists a sampling function s (cid:48) ( r ), such that ∀ r ∈ [0 , s (cid:48) ( r ) ≤ s opt ( r ) and for atleast one point r there is s (cid:48) ( r ) < s opt ( r ), such a func-tion cannot be used to generate an expected τ -visiblesummary.Note s opt ( r ) = 0 when r ∈ [ τ, r cannot be in thisrange, since a valid probability should be nonnegative.So we have r ∈ [0 , τ ), and E [ V S ( C )] | r = s (cid:48) ( r ) + r · (1 − s (cid:48) ( r )) < s opt ( r ) + r · (1 − s opt ( r ))= τ − r − r + r · (1 − τ − r − r )= τ (13)This means that the summary generated by s (cid:48) ( r ) can-not be expected τ -visible.Theorem 3 is a conditional theoretical guarantee forthe optimality of s opt ( r ). Note even if in general casesthese two strong conditions are not fully satisfied, The-orem 3 is still useful for the practical implementationof Algorithm 2. That means if the bounds l and r arewell estimated and the property of locality is strong,the inequalities in Formula (11) can be very tight. Insuch cases, s opt ( r ) can still show good performance.One may concern that it is not clear to what ex-tent we can achieve the good performance of s opt ( r ) inpractice by (1) tightening bounds; (2) strengthening thelocality. In the next section, we address the first concern Xiaofan Li et al. by reviewing two existing bounds and proposing a newone which outperforms the other two by large margins.For the second concern, we show that stronger localitycan be achieved by reordering vertices carefully. We willreview an existing vertex order and design a new betterone.
In this section, we show how to approach good per-formance of the new sampling strategy by tighteningbounds (Section 5.1) and by reordering vertices (Sec-tion 5.2).5.1 Bound AnalysisThe first inequality of (11) is derived from bound esti-mation. Note that Formula (14) we use to calculate thelower bound r is the same as that introduced in [36]: r = min ≤ t ≤ d | C ∩ C (cid:48) | + max { t − y t , }| C | + t (14)where C (cid:48) is the previous maximal clique added into S ; t is the number of vertices to be used for growingthe partial configuration C into a full maximal clique; d , which satisfies l = | C | + d , is the upper bound of t ; and given y t out of the t vertices are not coveredby C (cid:48) , y t is an upper bound of y t . Formula (14) canbe understood in this way: suppose we know that thecurrent partial configuration C still needs t vertices togrow into a full maximal clique P , then the dominator | C | + t is the size of P . Since y t means that at most y t out of t vertices in P \ C are not contained by C (cid:48) ,this means that at least t − y t are covered by C (cid:48) . Thusmax { t − y t , } is a lower bound of the size of ( P \ C ) ∩ C (cid:48) .(The max operator is inserted here because dependingon the estimation method of y t , t − y t may be negative.)Now we see that the two parts of the numerator are | C ∩ C (cid:48) | and a lower bound of | ( P \ C ) ∩ C (cid:48) | respectively,thus the sum is a lower bound of | P ∩ C (cid:48) | . Combining thediscussions above, the whole fraction is the very lowerbound of r ≡ | P ∩ C (cid:48) | / | P | . Since we lack information ofthe exact value of t , we have to enumerate all possible t in [0 , d ] and choose the minimum as the lower bound. y t can be estimated as | T \ C (cid:48) | , or simply the value of t ,or the number of vertices in T \ C (cid:48) whose degrees are atleast t − y t vertices should be containedby a t -clique). We see here the upper bound d ( = l −| C | , where | C | is known) is used to estimate r , andthe fraction after the min ≤ t ≤ d operator has nothingrelated to d (because it is calculated after t is given), so the quality of r is determined by the tightness of d (or l ). Thus in the following, we focus on estimating d .One valid and tight bound of d is the size of themaximum clique in candidate set T , however, findingsuch a maximum clique itself is a clique enumerationproblem which is of exponential time. As a result, weshould consider realistic bounds instead. In the follow-ing, we review two bounds that were discussed in theprevious work [36], then we propose a new one to fur-ther improve the effectiveness of s opt ( r ).Let G T be the induced graph of the candidate set T on graph G , then two existing upper bounds of d are:- H bound, denoted by d h , is the maximum h so thatthere exist at least h vertices in G T whose degreesare no less than h −
1. The maximum clique sizecan be bounded by h because if there exists a k -clique, there should also exist at least k vertices in G T whose degrees are no less than k −
1. Therefore h ≥ k holds for all possible k -cliques, including themaximum clique.- Core bound, denoted by d core = Core ( G T ) + 1,where Core ( G T ) denotes the maximum core numberin G T . We now review the definition of k -core [26]and core number first. Definition 5 ( k -core) The k -core of a graph G is thelargest induced subgraph in which the degree of eachvertex is at least k . Definition 6 (Core Number)
The core number ofgraph G , denoted as Core ( G ), is the largest k such thata k -core is contained in G .The core number can serve as an upper bound because k -core is weaker than k -clique: a k -clique must be a ( k -1)-core, while a ( k -1)-core may not be a k -clique. Thus Core ( G T ) + 1 will be no less than the maximum cliquesize in G T . Now we define our newly proposed bound. Definition 7 (Truss bound)
Truss bound, denotedby d truss , is the maximum truss number T russ ( G T ) in G T .Now we review the definition of k -truss and truss num-ber, and then explain why the maximum truss number T russ ( G T ) is valid to be an upper bound. Definition 8 ( k -truss) The k -truss of a graph G isthe largest induced subgraph in which each edge mustbe part of k − Definition 9 (Truss Number)
The truss number ofgraph G , denoted as T russ ( G ), is the largest k suchthat a k -truss is contained in G . fficiently Finding a Maximal Clique Summary via Effective Sampling 9 d truss is a upper bound of the size of maximum clique.This is because a k -clique with the maximum k is also a k -truss since each edge in a k -clique is strictly containedby k − k .These three bounds satisfy the following inequality: d h ≥ d core ≥ d truss (15)The first inequality holds because H bound does notenforce the h vertices to be connected, while core bounddoes. The second inequality comes from the fact thata k -truss must be a ( k − e should be incident to no lessthan k − e itself) since e is guaranteedto be involved in at least k − d h : O ( V T ); d core : O ( E T ); d truss : O ( E . T ) (16)where V T and E T are vertex set and edge set of theinduced graph G T respectively. The induced graph G T can be constructed when selecting the pivot thus itsconstruction does not incur an extra cost. For d h , whenconstructing G T , we can maintain a V T -length array torecord the number of vertices at each degree value. Thiscan be done in O ( V T ). Then the H -value can be foundby scanning this array from tail (where the numbers ofvertices with higher degree values are stored) to head(where the numbers of vertices with lower degree valuesare stored) until h vertices whose degrees are no lessthan h − O ( V T ). For d core , an O ( E T ) core decomposition [13] is needed after G T is found. For d truss , the truss decomposition takes O ( E . T ) to find the maximum truss number [35].We see that the truss bound is the tightest oneamong all of the three, and therefore it promises thebest performance in terms of effectiveness. The intrin-sic is the fact that the structure of truss is more com-pact (or cohesive) than the other two. (This property ofcompactness can also be used to design vertex orders toenhance the locality. We will give a detailed discussionsoon in Section 5.2.) Users may have their own prefer-ences to balance the running time and summary size.Thus which bound to select depends on to what extentthe effectiveness can be improved by sacrificing the effi-ciency. In Section 6, we conduct experimental studies tocompare the practical performance of different boundsin terms of both effectiveness and efficiency.5.2 Locality AnalysisStrong locality implies that two similar cliques shouldbe produced consecutively. This means, for a new clique C , the local visibility computed with the previous out-put clique C (cid:48) should be close to the global visibilitywhich is computed with the most similar clique to C in the summary. However, such a condition is difficultto meet. Reflected in practice, one typical implementa-tion is the vertex order we should follow to grow thecurrent partial clique. An effective vertex order withstrong locality should have such a property that eachcandidate set T of the current configuration C has asufficiently compact structure. Here, by compact (or co-hesive) we mean that the nodes of a candidate set arewell-connected with each other such that cliques in thisset have a higher probability to overlap.One question arises: in the outer recursion level ofBK-MCE, since the neighbor set N ( v ) of the only ver-tex v in the current partial clique C = { v } is uniquelydetermined by the graph G ( V, E ), why we still expecta particular structure in the candidate set of { v } ? Theanswer is that if we implement a fixed vertex order togrow cliques, when we include v into C , all the neigh-bors of v which precede it in the order can be safelymoved into set D . The key point is that the differencebetween N ( v ) and the candidate set of { v } is deter-mined by the particular order we choose, thus leavesus the very opportunity to reshape the structure of thecandidate set. The same holds for each level of the re-cursion.Now we see that strong locality can be achieved byreordering vertices. In the following, we explain whydegeneracy order can be employed to achieve this goaleven if the initial purpose of it is to bound time com-plexity of the BK-MCE [12]. Then we propose a noveltruss order based on truss decomposition to further en-hance locality. Now we begin with the definition of de-generacy. Definition 10 (Degeneracy)
Given a graph G ( V, E ),the degeneracy of G is the smallest value d , such thatevery subgraph of G contains a vertex whose degree isno more than d .Degeneracy is naturally related to a special vertex orderbelow. Definition 11 (Degeneracy Order)
The vertices ofa d-degeneracy graph have a degeneracy order, in whicheach vertex v has only d or fewer neighbors after itself.Degeneracy order can be formed by repeatedly delet-ing the minimum degree vertex with all its edges on thecurrent subgraph. Note this actually is the core decom-position procedure [13], thus this order sorts vertices bycore number from low to high.The reason why this order can be used to enhancelocality is straightforward. We explain it by focusing Table 2: Statistics of datasets
Name | V | | E | Cliques τ -RMCE-TU [ τ = 0 . / . τ -R + MCE-TU [ τ = 0 . / . on this particular scene during MCE procedure that avertex v is being moved from candidate set T to partialclique C . This v and all vertices of G that are reorderedafter v in the degeneracy order induce a subgraph G (cid:48) .By the construction of degeneracy order, we know v is the minimum degree vertex in G (cid:48) , which is denotedby d ( v ), thus G (cid:48) is a d ( v )-core. Since the candidate set T is a subset of G (cid:48) , we reach the conclusion that T iscontained by a d ( v )-core, which is our desirable com-pact structure with strong locality. Although existingworks [12] [24] studied using degeneracy to speed upMCE in the aspect of running time, to our best knowl-edge, our work is the first to exploit degeneracy orderto strengthen the locality for the purpose of reducingoverlapping cliques. To further enhance the locality, wenotice that the key of locality is to guarantee the candi-date set T to be contained by a compact structure, e.g., k -core. Hence if we can find a novel vertex order thathas a stronger guarantee, e.g., T is covered by a k -truss,then we can foresee that the performance in terms ofeffectiveness will outperform that of degeneracy order.Following this intuition, we carefully inspect the rela-tionship between core decomposition and degeneracyorder, and find that such a relationship also applies tothe truss decomposition and a new vertex order (trussorder). Definition 12 (Truss Order)
Vertices sorted by trussorder satisfy such a property: if k is the maximum valuethat there exists a k -truss containing vertex v , then allthe vertices reordered after v should also be containedby the same k -truss.Truss order can be formed during the procedure oftruss decomposition. We firstly delete the edge ( u, v )which is contained by the least number of triangles(this number is denoted by the support of an edge).After ( u, v ) is removed, the supports of all edges whoseendpoints contain u or v decrease by 1. The procedurerepeats until all the edges are removed. Then the orderthat vertices are peeled off from G is a valid truss order.This is because the order sorts each vertex by the max- imum value of k that there exists a k -truss containingit. The same analysis why degeneracy order enhanceslocality applies to truss order: by Definition 12, the can-didate set T is guaranteed to be contained by a k -truss.Since what we desire is a compact structure of can-didate set, k -truss is apparently more favorable than k -core. Note that the concept of locality is more goal-driven and the extent of locality is output-determined.Hence, instead of giving a formal theoretical analysiswhich we found difficult, we decide to use extensive ex-periments to illustrate the effect of three types of vertexorders on output size. We will report experimental re-sults in Section 6 to compare the performance of thesetwo vertex orders with random order as a baseline interms of both effectiveness and efficiency. In this section, we look into three research questionsby experiments. (1) To what extent the summary sizeand running time can be reduced by τ -R + MCE vs. τ -RMCE? (2) To what extent the effectiveness of τ -R + MCE can be further improved by our newly pro-posed truss order and truss bound? (3) To what ex-tent our newly designed truss order and bound affectthe efficiency (both running time and memory require-ment)? For short, we denote the the τ -visible MCE al-gorithm [36] by τ -RMCE and ours by τ -R + MCE. Allalgorithms are implemented in C++ and tested on aMacBook Pro with 16GB memory and Intel Core i72.6GHz CPU 64. We evaluated both effectiveness (interms of summary size) and efficiency (in terms of first-result time, total running time and total memory re-Table 3: Notations
Setting Meaning
T, H, C
Truss bound, H bound, Core bound
U, I, R
Truss order, Degeneracy order, Random orderfficiently Finding a Maximal Clique Summary via Effective Sampling 11(a) soc-Epinions1 (b) loc-Gowalla (c) amazon0302 (d) email-EuAll(e) web-NotreDame (f) com-youtube (g) soc-pokec (h) cit-Patents
Fig. 1:
Summary size of τ -R + MCE and τ -RMCE on eight datasets with τ varied from 0.5 to 0.9, T and U as default quirement) with τ varying from 0 . . τ -R + MCEand τ -RMCE were implemented with three types ofbounds (truss bound (T), core bound (C), H bound(H)) as well as three vertex orders (truss order (U),degeneracy order (I), random order (R)). Details areshown in Table 3. All results were reported by an aver-age of five runs. Datasets
We use eight real-world datasets from dif-ferent domains with various data properties to show therobustness of our algorithms. To provide a more com-prehensive comparison with τ -RMCE, we also test thealgorithms on the dataset cit-Patents that contains thelargest number of vertices used by the work [36]. Detailsare shown in Table 2. For each dataset, we denote by | V | the number of vertices, by | E | the number of edgesand by Cliques the total number of maximal cliques asreference. The 5th and 6th column denotes the fraction summary size/total number of maximal cliques for τ -RMCE-TU and τ -R + MCE-TU with the best configu-ration (Truss bound (T) and Truss order (U)) respec-tively. The percentage before and after / is the valueat τ = 0 . τ = 0 . . / .
9% meansthat the sizes of summaries produced by τ -RMCE oc-cupy 18 .
1% and 77 .
9% of the total number of maximalcliques at τ = 0 . τ = 0 .
9. All datasets used in thispaper can be found in Stanford Large Network DatasetCollection . Available at http://snap.stanford.edu/data/index.html τ -RMCEand τ -R + MCE in Section 6.1.1 (both with T boundand U order as default). To see to what extent our pro-posed truss bound and truss order benefit effectiveness,we implemented τ -RMCE and τ -R + MCE with threeorders (U, I, R, bound T as default) in Section 6.1.2,and with three bounds (T, C, H, order U as default) inSection 6.1.3.
We implemented τ -RMCE and τ -R + MCE with the bestconfigurations using truss bound and truss order, whichare denoted by τ -RMCE-TU and τ -R + MCE-TU re-spectively. The results are shown in Fig. 1. We see that τ -R + MCE-TU consistently outperforms τ -RMCE-TUon all datasets with all the τ values.When τ = 0 . τ -R + MCE-TU significantly reducesmore than 50% output cliques vs. τ -RMCE-TU on alldatasets, three of which (Fig. 1f, 1g, 1h) achieve 70%,and two of which (Fig. 1d, 1e) even achieve more than80%. When τ decreases, the difference is more dra-matic, i.e., the percentage of reduction monotonicallyincreases. At τ = 0 .
5, the reduction for all datasets ismore than 70%, two of which (Fig. 1a, 1b) reach 85%,and five of which (Fig. 1d, 1e, 1f, 1g, 1h) reach 90%to reduce the summary size by more than one orderof magnitude. This monotonic increasing trend impliesthat the performance of τ -R + MCE-TU performs moresignificantly than τ -RMCE-TU along with τ decreas- Fig. 2: Summary size of τ -R + MCE and τ -RMCE on eight datasets with different orders, τ varies from 0.5 to 0.9,T bound as default (a) soc-Epinions1 (b) loc-Gowalla (c) amazon0302 (d) email-EuAll(e) web-NotreDame (f) com-youtube (g) soc-pokec (h) cit-Patents Fig. 3: Average r of τ -R + MCE and τ -RMCE on eight datasets with different orders, τ varies from 0.5 to 0.9, Tbound as defaulting. This is because for a small threshold, τ -RMCE in-cludes more unnecessary cliques whose visibilities aregreater than τ into the summary with high probabil-ities, which confirms our intuition in Section 3 that s opt ( r ) should be set to 0 for r ∈ [ τ, C whose visibility is close to 0, s ( r ) forces τ -RMCE to output C immediately, while τ -R + MCE considers the potential that C may be coveredby some future cliques, thus more carefully outputs sucha clique with a proper probability. To show the robust-ness of our proposed method, we tested the algorithms on eight real-world datasets with different scales. Theresults show that τ -R + MCE achieves relatively betterperformance on large graphs. For the convenience of ourdiscussion, now we focus on the results at τ = 0 .
5. Wesee that all the five datasets (Fig. 1d, 1e, 1f, 1g, 1h)that have more than 90% reductions are the top fivelargest graphs among all eight datasets. This impliesthat our proposed method are more capable to handlecontemporary large scale graphs than the state-of-the-art approach. fficiently Finding a Maximal Clique Summary via Effective Sampling 13
To see to what extent the performance of τ -R + MCEcan be further improved by employing a vertex orderwith strong locality, we implemented τ -R + MCE and τ -RMCE with three types of orders: random order (R),degeneracy order (I) and truss order (U). The defaultbound was set as truss bound (T). The results areshown in Fig. 2.Fig. 2 shows that truss order consistently outper-forms degeneracy order and random order for both τ -R + MCE and τ -RMCE, while generally degeneracy or-der is superior to random order except one exception( τ -R + MCE on loc-Gowalla at τ = 0 . τ -R + MCE. Generally τ -R + MCE-TIis superior to τ -R + MCE-TR for all τ values on 7 outof 8 datasets (except web-NotreDame ), while the reduc-tion percentage (10% ∼ τ -R + MCE-TU vs. τ -R + MCE-TIis much dramatic: at τ = 0 .
5, the reduction percent-age varies from 33% ( soc-pokec ) to 83% ( com-youtube ),and 5 out of 8 achieve more than 50% (except soc-Epinions1, amazon0103, soc-pokec ).To further show how different orders help our ap-proach get close to the optimal, we calculate the actualaverage r for each dataset w.r.t different τ values. Wetake every maximal clique generated by MCE into con-sideration. If a search subtree should be pruned at acertain stage, we keep searching the subtree to calcu-late r of the discarded cliques while disallowing thesecliques to be added into summary S . Results are shownin Fig. 3. This set of experiments show that with τ decreasing from 0 . .
5, the average visibility r of τ -RMCE changes very mildly and keeps no less than0 . loc-Gowalla , thelines of τ -RMCE overlap with each other, which impliesthat different types of vertex orders may have limitedimpact on the state-of-the-art approach. However, thethree lines of τ -R + MCE drop sharply from around 0 . . τ -R + MCE-TR showsthe worst performance among the tested three vertexorders, of which the line drops from around 0 .
95 to 0 . τ -R + MCE-TI shows better effec-tiveness (which drops from around 0 .
95 to 0 .
55) than τ -R + MCE-TR, and τ -R + MCE-TU shows the best perfor-mance (which drops from 0 .
93 to 0 . τ -R + MCEstep forward to the optimal reference line, which leavesa narrow gap between them. Fig. 2 and Fig. 3 confirm our assumption that theeffectiveness of τ -R + MCE can be further improved byproperly reordering vertices. The newly designed trussorder significantly outperforms the degeneracy order bya large margin due to strong locality provided by thecohesiveness of k -truss. To see to what extent the effectiveness of τ -R + MCEcan be further improved by employing a tight bound,we implemented τ -R + MCE and τ -RMCE with threedifferent bounds: H bound (H), core bound (C) andtruss bound (T). Truss order (U) was set to be thedefault. The results are shown in Fig. 4.We see that for both τ -R + MCE and τ -RMCE, theperformance of effectiveness consistently follows this or-der: T outperforms C, and C outperforms H. When wefocus on τ -R + MCE, results show that τ -R + MCE-CUreduces the summary size vs. τ -R + MCE-HU by lessthan 10% for all τ values on all datasets. However, thereduction between τ -R + MCE-TU and τ -R + MCE-CUranges from 21% to 43%. At τ = 0 .
5, the percentageachieves more than 30% for four out of eight datasets(except email-EuAll, web-NotreDame, com-youtube, cit-Patents ).Fig. 5 reports the average of r for τ -RMCE and τ -R + MCE with different bounds. The results are similarto Fig. 3. We see that the three lines of τ -RMCE staymuch closer with each other for all datasets, and theline of τ -RMCE-TU keeps slightly lower than the othertwo lines. This implies that various bounds can provideonly limited improvements to the existing sampling ap-proach. Whereas for τ -R + MCE, we see that a betterbound helps the three lines move towards the requiredthreshold by a significant margin. The gap between τ -R + MCE-TU and τ -R + MCE-CU is much more dra-matic than that between τ -R + MCE-CU and τ -R + MCE-HU, which confirms the superiority of the newly appliedtruss bound T.Fig. 4 and Fig. 5 confirm the fact that the effec-tiveness of τ -R + MCE can be further improved by em-ploying tight bounds. Although the extent of benefitbrought by good bounds is inferior to that brought byvertex orders with strong locality, our proposed trussbound still surpasses the state-of-the-art core bound bya significant margin.6.2 EfficiencyWhile our main concern in this paper is the output size,the efficiency of τ -RMCE and τ -R + MCE (with threetypes of bounds and orders) is also reported. To provide
Fig. 4: Summary size of τ -R + MCE and τ -RMCE on eight datasets with different bounds, τ varies from 0.5 to0.9, U order as default (a) soc-Epinions1 (b) loc-Gowalla (c) amazon0302 (d) email-EuAll(e) web-NotreDame (f) com-youtube (g) soc-pokec (h) cit-Patents Fig. 5: Average r of τ -R + MCE and τ -RMCE on eight datasets with different bounds, τ varies from 0.5 to 0.9, Uorder as defaulta fuller discussion of the efficiency, we plotted both thetotal running time and the memory requirement. We compare the total running time of τ -R + MCE and τ -RMCE with default setting of U and T (see Fig. 6).Results show that τ -R + MCE consistently surpasses τ -RMCE on eight datasets. When τ = 0 .
9, the timereduction is more than 20% for all datasets, amongwhich four datasets ( soc-Epinions1, email-EuAll, com-youtube, cit-Patents ) achieve 30%. When τ = 0 .
5, this percentage exceeds 35% for all datasets, and five ofthem ( soc-Epinions1, amazon0302, email-EuAll, com-youtube, cit-Patents ) achieve more than 40%.To get a full understanding of why our proposedmethod benefits efficiency (although our initial purposeis to target the effectiveness), we recorded the first-result time, that is, the duration from the beginningto the first maximal clique being included into sum-mary. We found that the result varies very little fordifferent bounds and vertex orders. Thus we use Ta-ble 4 to briefly summarize the results (T and U are fficiently Finding a Maximal Clique Summary via Effective Sampling 15(a) soc-Epinions1 (b) loc-Gowalla (c) amazon0302 (d) email-EuAll(e) web-NotreDame (f) com-youtube (g) soc-pokec (h) cit-Patents
Fig. 6:
Running time of τ -R + MCE and τ -RMCE on eight datasets, τ varies from 0.5 to 0.9, T bound and U order as default set as default, τ = 0 . τ -R + MCE and τ -RMCE on eight datasets with all the τ values. Thefact is that most of the running time (more than 93%)is consumed by the enumeration procedure, which im-plies that the benefited efficiency of τ -R + MCE comesfrom the early pruning power that speeds up the enu-meration recursion. The search tree of τ -R + MCE doesnot have to explore as deep as τ -RMCE does to finallydetermine whether to discard a candidate clique, thusless time is wasted on growing cliques that would resultin redundancy. To test the efficiency of three types of vertex orders,we implement τ -RMCE and τ -R + MCE with orders U,I and R. The default bound is set to T. We recordedboth the total running time and memory requirementfor all experiments. The details are shown in Fig. 7 andFig. 8.
Running time:
Fig. 7 shows that the results of τ -R + MCE and τ -RMCE are very similar on each dataset,hence we focus on curves of τ -R + MCE. We see that τ -R + MCE-TU shows the best performance on five outof eight datasets ( soc-Epinions1 ( ∼ ), ama-zon0302 ( ∼ ), email-EuAll ( ∼ ), com-yotube ( ∼ ), cit-Patents ( ∼ ) , wherethe percentages in parentheses are the range of reduc-tions vs. τ -R + MCE-TI). It shows similar performancesas τ -R + MCE-TI on two datasets ( web-NotreDame, soc-pokec ) since the two lines coincide with each other. τ - Table 4: First-Result Time (s) at τ = 0 . Name τ -RMCE-TU τ -R + MCE-TUsoc-Epinions1 0.62 0.61loc-Gowalla 8.55 8.55amazon0302 0.21 0.21email-EuAll 1.22 1.20NotreDame 5.97 5.97com-youtube 12.33 12.32soc-pokec 40.11 39.57cit-Patents 10.30 9.27 R + MCE-TU shows the worst performance on a spe-cial dataset loc-Gowalla because of its small degener-acy. This result implies that benefited from its summa-rization effectiveness, τ -R + MCE-TU shows a compara-ble or even better performance than the state-of-the-artorder on a variety of real-world datasets. However, de-generacy order is still the best choice for graphs withsmall degeneracies that this order is initially designedfor.
Memory requirement:
Fig. 8 shows the mem-ory requirement for different orders. We see that thetruss order U consistently outperforms the other two forboth τ -R + MCE and τ -RMCE. The memory reductionof τ -R + MCE-TU vs. τ -R + MCE-TI varies little when τ changes. The reduction is more than 10% for all eightdatasets, with two of which ( email-EuAll, com-youtube )even achieving 30%. The results of memory cost aremuch similar to the output size. This is because thememory requirement highly relies on the depth of re-cursion: more number of deep branches result in highermemory consumption. The strong locality thus early Fig. 7:
Running time of τ -R + MCE and τ -RMCE on eight datasets with different orders, τ varies from 0.5 to 0.9, T as default(a) soc-Epinions1 (b) loc-Gowalla (c) amazon0302 (d) email-EuAll(e) web-NotreDame (f) com-youtube (g) soc-pokec (h) cit-Patents Fig. 8: Memory requirement of τ -R + MCE and τ -RMCE on eight datasets with different orders, τ varies from 0.5to 0.9, T bound as defaultpruning power of τ -R + MCE-TU prevents some of theredundant branches from growing unnecessarily deep,hence the memory requirement can be reduced signifi-cantly, which has the same reason why the output sum-mary size is reduced.
We test the efficiency of different bounds (T, C, H)with default vertex order U. Both the total runningtime (Fig. 9) and memory requirement (Fig. 10) arerecorded.
Running time:
Fig. 9 shows that for both τ -R + MCEand τ -RMCE, H bound is the fastest choice on fiveout of eight datasets (except for soc-Epinions1, ama-zon0302, email-EuAll ). U bound runs most slowly onseven out of eight datasets (except for soc-Epinions1 ).However, we still notice that the time differences be-tween τ -R + MCE-TU and τ -R + MCE-CU are narrowedwith τ decreasing for all datasets. This is consistentwith Fig. 4: since the summary reduction increases with τ decreasing, the benefit of early pruning gradually off-sets the cost of bound calculation. This explains why τ -R + MCE-TU shows the best performance when τ ≤ . fficiently Finding a Maximal Clique Summary via Effective Sampling 17(a) soc-Epinions1 (b) loc-Gowalla (c) amazon0302 (d) email-EuAll(e) web-NotreDame (f) com-youtube (g) soc-pokec (h) cit-Patents Fig. 9: Running time of τ -R + MCE and τ -RMCE on eight datasets with different bounds, τ varies from 0.5 to0.9, U order as default (a) soc-Epinions1 (b) loc-Gowalla (c) amazon0302 (d) email-EuAll(e) web-NotreDame (f) com-youtube (g) soc-pokec (h) cit-Patents Fig. 10: Memory requirement of τ -R + MCE and τ -RMCE on eight datasets with different bounds, τ varies from0.5 to 0.9, U order as default Memory requirement:
As we explained in Sec-tion 6.2.2, the result of memory requirement is similarto that of output size. The performance of three boundsfor both τ -R + MCE and τ -RMCE are quite clear: Uis better than C, and C is better than H. When wefocus on τ -R + MCE, we see that the memory reduc-tion of τ -R + MCE-TU vs. τ -R + MCE-CU is more than10% on eight datasets for all τ values, among whichfour datasets ( soc-Epinions1, com-youtube, soc-pokec,cit-Patents ) even achieves 25% at τ = 0 .
5. This reduc-tion is mainly caused by the fact that a tight bound thusearly pruning helps to avoid redundant search branches from growing unnecessarily deep, which shows the su-periority of the truss bound.6.3 Algorithm RealizationIn this subsection, we discuss one detail of the algo-rithm realization: during each recursion of Algorithm 2,to calculate the intersection C ∩ C (cid:48) by merge join, wehave to sort the vertices of C according to a given order.This is caused by such an undetected fact: even if westrictly select each vertex in candidate set T (cid:48) by a givenorder to grow the current partial clique C , the order of vertices in C is still disorganized due to pivot selection.The other intersection operations ( T ∩ N ( v ) , D ∩ N ( v ))in Algorithm 2 have no such requirement of vertex sort-ing. Ignorance of such a fact can lead to wrong outputswhile the algorithm runs improperly fast. This explainswhy our results of τ -RMCE differ from the work [36],where the results show that the output size of τ -RMCEchanges dramatically when τ decreases. However, withour appropriate implementation, we indeed find thatboth output size and running time decrease much moreslowly when τ varies from 0 . .
5. Even if we set thedefault parameter as our newly proposed truss bound Tand truss order U, the decreases are much less dramaticthan the results shown in the work [36]. Besides, we testthe algorithms on eight benchmark real-world datasetsfrom various domains, all of which show very similar re-sults. Hence we believe that our results are convincing.Moreover, we did not include two other datasets
Skit-ter and
Wiki used by [36] because although the vertexnumbers of them are less than cit-Patents , the runningtimes of τ -RMCE with our appropriate implementationon these two datasets are not as fast as anticipated. Be-cause of the inherent nature of the clique enumerationproblem, once the input graph gets large, the processingtime becomes unbearable. Therefore, we chose datasetsthat are relatively manageable (3600 s for the runningtime limit). For extremely large datasets, we believethat parallel and distributed computing solutions aremore preferred, which could be very interesting to ex-plore [10, 25, 38]. Accordingly, we are excited to raisethis problem in the future work section to attract moreattentions from the research community.6.4 SummaryAfter a comprehensive discussion of all experiments, wecan now answer the three questions at the beginning ofSection 6:(1) τ -R + MCE consistently outperforms τ -RMCEfor both effectiveness and efficiency on all datasets withall the τ values. The output reduction can be up to oneorder of magnitude, and time reduction is more than35% at τ = 0 . τ -R + MCE achieves relatively betterperformance on large graphs than τ -RMCE.(2) When implemented with τ -R + MCE, the trussorder reduces up to 83% output size vs. the state-of-the-art degeneracy order at τ = 0 .
5, and this reductionof truss bound vs. core bound can be up to 43%. Theboost of vertex order is more significant than that ofbounds.(3) The running time of truss order with τ -R + MCEhas comparable or even better performance than thedegeneracy order except when implemented on small degeneracy graphs. The memory requirement of trussorder consistently shows the best performance, of whichthe reduction vs. degeneracy order is more than 10%.Although the efficiency of truss bound is surpassed bycore bound and H bound, the difference is narrowedwith τ decreasing. The memory requirement of trussbound still shows the best performance, which achievesmore than 10% reduction vs. core bound. The number of maximal cliques in an undirected graphis proved to be exponential [19]. Bron and Kerbosch [5],Akkoyunlu et al. [2] introduced backtracking algorithmsto enumerate all maximal cliques in a graph. There aresufficient studies focusing on the efficiency of MCE. Toeffectively reduce the search space, pruning strategieswere introduced in [6, 14, 32] by selecting good pivots.The key idea is to avoid searching in some unnecessarybranches which leads to duplicated results. Degeneracyvertex ordering was introduced by [12] to bound thetime complexity because with the degeneracy order thesize of candidate set T in the first recursion level can bebounded by the degeneracy, thus all the candidate setat all depths of the search tree can be bounded. Pivotselection strategies were studied by [21, 24] to optimizethe algorithms. Naud´e [21] relaxed the restriction ofpivot selection while keeping the time complexity un-changed. Segundo et al. [24] improved the practical per-formance of the algorithm by avoiding too much timeconsumed by selecting the pivot. With distributed com-puting paradigms, scalable and parallel algorithms weredesigned for MCE in [25,38]. Schmidt et al. [25] decom-posed the search tree to enable parallelization. Xu etal. [38] proposed a distributed MCE algorithm basedon a share-nothing architecture. The I/O performanceof MCE in massive networks was improved by [8, 9].The external-memory algorithm for MCE was first in-troduced by [8] to bound the memory consumption. Apartition-based MCE algorithm is designed by [9] toreduce the memory used for processing large graphs.The maximal spatial clique enumeration was studiedby [42], in which some geometric properties were usedto enhance the enumeration efficiency. Dynamic max-imal clique enumeration was studied in [11, 27, 28], inwhich the graph structure can evolve mildly. All thethree works considered the dynamic cases where edgescan be added or deleted. When considering an uncertaingraph, which is a nondeterministic distribution on a setof deterministic graphs, the uncertain version of MCEwas designed by [15, 20]. Mukherjee et al. [20] designedan algorithm to enumerate all α -maximal cliques in anuncertain graph. The size of an uncertain graph can be fficiently Finding a Maximal Clique Summary via Effective Sampling 19 reduced by core-based algorithms proposed by [15]. Thetop- k maximal clique finding problem was also studiedby [43] on uncertain graphs. While these efficient ap-proaches reduced the running time of MCE, the bottle-neck in applications is the large output size, which isour main focus.There exist a large volume of works [7, 17, 30, 31, 33]studying the maximum clique problem, which aimed tofind a maximal clique with the largest size. An approxi-mate coloring technique was employed by [30] to boundthe maximum clique size, which was further improvedby [31] and [33]. Lu et al. [17] proposed a randomizedalgorithm with a binary search technique to find themaximum clique in massive graphs, while the work [7]studied this problem over sparse graphs by transform-ing the maximum clique in sparse graphs to the k -cliqueover dense subgraphs. Although the concept of max-imum clique is closely related to the maximal clique,the MCE and maximum clique finding are two distin-guishable problems and there is no need to employ asummary to summarize the output of this problem sincethe number of maximum cliques is typically small.Summarizing has also been studied for frequent pat-tern mining [1, 37, 39]. Afrati et al. [1] studied how tofind at most k patterns to span a collection of pat-terns which is an approximation of the original patternsets. Yan et al. [39] proposed a profile-based approachto summarize all frequent patterns by k representives.The pattern redundancy was introduced by [37], whichstudied how to extract redundancy-aware and top- k sig-nificant patterns. While cliques share great similaritywith frequent patterns, these algorithms cannot be usedto summarize maximal cliques efficiently due to theiroffline nature. There are some studies focusing on on-line algorithms to do summarizing. Saha et al. [23] andAusiello et al. [3] studied how to find diversified k setsto represent all sets with a streaming approach, basedon which [40] introduced an online algorithm to givediversified top- k maximal cliques. In these works, k isnormally small, and coverage is not the focus.Our work is close to the work [36] which introducedthe τ -visible summary of maximal cliques. Other thangiving a better sampling function in the earlier ver-sion [16], we further discuss the optimality conditionsand propose to approach the optimal by introducingthe novel truss vertex order and truss bound. In this paper, we have studied how to report a sum-mary of less overlapping maximal cliques during theonline maximal clique enumeration process. We haveproposed so far the best sampling strategy, which can guarantee that the summary expectedly represents allthe maximal cliques while keeping the summary suf-ficiently concise, i.e., each maximal clique can be ex-pectedly covered by at least one maximal clique in thesummary with a ratio of at least τ ( τ is given by auser and reflects the user’s tolerance of overlap). Wehave proved the optimality of this sampling approachunder two conditions (ideal bound estimation and suf-ficiently strong locality), and proposed the novel trussorder as well as the truss bound to approach the op-timal. Experimental studies have shown that the newstrategy can outperform the state-of-the-art approachin both effectiveness and efficiency on eight real-worlddatasets. Future work could be conducted towards ap-proaching the optimal conditions further. It would alsobe interesting to solve the problem in parallel consid-ering that maximal clique enumeration is expensive onlarge graphs. Acknowledgments
The work was supported by Australia Research Coun-cil discovery projects DP170104747, DP180100212. Wewould like to thank Yujun Dai for her effort in the ear-lier version [16].
References
1. Afrati, F., Gionis, A., Mannila, H.: Approximating a col-lection of frequent sets. In: Proceedings of the 2004 ACMSIGKDD International Conference on Knowledge Discov-ery and Data Mining, pp. 12–19. ACM Press, Seattle,WA, USA (2004)2. Akkoyunlu, E.: The enumeration of maximal cliques oflarge graphs. SIAM Journal on Computing (1), 1–6(1973)3. Ausiello, G., Boria, N., Giannakos, A., Lucarelli, G.,Paschos, V.T.: Online maximum k-coverage. In: O. Owe,M. Steffen, J.A. Telle (eds.) Fundamentals of Compu-tation Theory, vol. 6914, pp. 181–192. Springer BerlinHeidelberg, Berlin, Heidelberg (2011)4. Berry, N., Ko, T., Moy, T., Smrcka, J., Turnley, J., Wu,B.: Emergent clique formation in terrorist recruitment.In: AAAI-04 Workshop on Agent Organizations: Theoryand Practice (2004)5. Bron, C., Kerbosch, J.: Algorithm 457: finding all cliquesof an undirected graph. Communications of the ACM (9), 575–577 (1973)6. Cazals, F., Karande, C.: A note on the problem of re-porting maximal cliques. Theoretical Computer Science (1-3), 564–568 (2008)7. Chang, L.: Efficient maximum clique computation overlarge sparse graphs. In: Proceedings of the 25th ACMSIGKDD International Conference on Knowledge Discov-ery & Data Mining, pp. 529–538. ACM, New York, USA(2019)8. Cheng, J., Ke, Y., Fu, A.W.C., Yu, J.X., Zhu, L.: Find-ing maximal cliques in massive networks by H*-graph.In: Proceedings of the 2010 ACM SIGMOD InternationalConference on Management of Data, pp. 447–458. ACM,New York, USA (2010)0 Xiaofan Li et al.9. Cheng, J., Zhu, L., Ke, Y., Chu, S.: Fast algorithms formaximal clique enumeration with limited memory. In:Proceedings of the 18th ACM SIGKDD InternationalConference on Knowledge Discovery and Data Mining,pp. 1240–1248. ACM, New York, USA (2012)10. Das, A., Sanei-Mehri, S.V., Tirthapura, S.: Shared-memory parallel maximal clique enumeration from staticand dynamic graphs. ACM Transactions on ParallelComputing (TOPC) (1), 1–28 (2020)11. Das, A., Svendsen, M., Tirthapura, S.: Incremental main-tenance of maximal cliques in a dynamic graph. TheVLDB Journal (3), 351–375 (2019)12. Eppstein, D., L¨offler, M., Strash, D.: Listing all maxi-mal cliques in large sparse real-world graphs. Journal ofExperimental Algorithmics , 3.1–3.21 (2013)13. Khaouid, W., Barsky, M., Srinivasan, V., Thomo, A.: K-core decomposition of large networks on a single PC. Pro-ceedings of the VLDB Endowment (1), 13–23 (2015)14. Koch, I.: Enumerating all connected maximal commonsubgraphs in two graphs. Theoretical Computer Science (1), 1–30 (2001)15. Li, R., Dai, Q., Wang, G., Ming, Z., Qin, L., Yu, J.X.: Im-proved algorithms for maximal clique search in uncertainnetworks. In: 2019 IEEE 35th International Conferenceon Data Engineering (ICDE), pp. 1178–1189 (2019)16. Li, X., Zhou, R., Dai, Y., Chen, L., Liu, C., He, Q., Yang,Y.: Mining maximal clique summary with effective sam-pling. In: 2019 IEEE International Conference on DataMining (ICDM), pp. 1198–1203. IEEE (2019)17. Lu, C., Yu, J.X., Wei, H., Zhang, Y.: Finding the maxi-mum clique in massive graphs. Proceedings of the VLDBEndowment (11), 1538–1549 (2017)18. Lu, Z., Wahlstr¨om, J., Nehorai, A.: Community detectionin complex networks via clique conductance. ScientificReports (1), 5982–5997 (2018)19. Moon, J.W., Moser, L.: On cliques in graphs. Israel Jour-nal of Mathematics (1), 23–28 (1965)20. Mukherjee, A.P., Xu, P., Tirthapura, S.: Mining maximalcliques from an uncertain graph. In: 2015 IEEE 31stInternational Conference on Data Engineering (ICDE),pp. 243–254 (2015)21. Naud´e, K.A.: Refined pivot selection for maximal cliqueenumeration in graphs. Theoretical Computer Science , 28–37 (2016)22. Rokhlenko, O., Wexler, Y., Yakhini, Z.: Similarities anddifferences of gene expression in yeast stress conditions.Bioinformatics (2), 184–190 (2007)23. Saha, B., Getoor, L.: On maximum coverage in thestreaming model & application to multi-topic Blog-Watch. In: C. Apte, H. Park, K. Wang, M.J. Zaki (eds.)Proceedings of the 2009 SIAM International Conferenceon Data Mining, pp. 697–708. Society for Industrial andApplied Mathematics, Philadelphia, PA (2009)24. San Segundo, P., Artieda, J., Strash, D.: Efficiently enu-merating all maximal cliques with bit-parallelism. Com-puters & Operations Research , 37–46 (2018)25. Schmidt, M.C., Samatova, N.F., Thomas, K., Park, B.H.:A scalable, parallel algorithm for maximal clique enumer-ation. Journal of Parallel and Distributed Computing (4), 417–428 (2009)26. Seidman, S.B.: Network structure and minimum degree.Social networks (3), 269–287 (1983)27. Stix, V.: Finding all maximal cliques in dynamic graphs.Computational Optimization and Applications (2),173–186 (2004)28. Sun, S., Wang, Y., Liao, W., Wang, W.: Mining maximalcliques on dynamic graphs efficiently by local strategies. In: 2017 IEEE 33rd International Conference on DataEngineering (ICDE), pp. 115–118 (2017)29. Tandon, A., Karlapalem, K.: Agent strategies for thehide-and-seek game. In: Proceedings of the 17th Inter-national Conference on Autonomous Agents and Multi-Agent Systems, pp. 2088–2090. International Foundationfor Autonomous Agents and Multiagent Systems, Rich-land, SC (2018)30. Tomita, E., Kameda, T.: An efficient branch-and-boundalgorithm for finding a maximum clique with compu-tational experiments. Journal of Global Optimization (1), 95–111 (2007)31. Tomita, E., Sutani, Y., Higashi, T., Takahashi, S., Wakat-suki, M.: A simple and faster branch-and-bound algo-rithm for finding a maximum clique. In: M.S. Rahman,S. Fujita (eds.) WALCOM: Algorithms and Computa-tion, Lecture Notes in Computer Science, pp. 191–203.Springer Berlin Heidelberg (2010)32. Tomita, E., Tanaka, A., Takahashi, H.: The worst-casetime complexity for generating all maximal cliques andcomputational experiments. Theoretical Computer Sci-ence (1), 28–42 (2006)33. Tomita, E., Yoshida, K., Hatta, T., Nagao, A., Ito, H.,Wakatsuki, M.: A much faster branch-and-bound algo-rithm for finding a maximum clique. In: D. Zhu, S. Bereg(eds.) Frontiers in Algorithmics, Lecture Notes in Com-puter Science, pp. 215–226. Springer International Pub-lishing (2016)34. Valiant, L.G.: The complexity of enumeration and re-liability problems. SIAM Journal on Computing (3),410–421 (1979)35. Wang, J., Cheng, J.: Truss decomposition in massive net-works. arXiv preprint arXiv:1205.6693 (2012)36. Wang, J., Cheng, J., Fu, A.W.C.: Redundancy-awaremaximal cliques. In: Proceedings of the 19th ACMSIGKDD international conference on Knowledge discov-ery and data mining, pp. 122–130. ACM Press, Chicago,Illinois, USA (2013)37. Xin, D., Cheng, H., Yan, X., Han, J.: Extractingredundancy-aware top-k patterns. In: Proceedings of the12th ACM SIGKDD International Conference on Knowl-edge Discovery and Data Mining, pp. 444–453. ACMPress, Philadelphia, PA, USA (2006)38. Xu, Y., Cheng, J., Fu, A.W.: Distributed maximal cliquecomputation and management. IEEE Transactions onServices Computing (1), 110–122 (2016)39. Yan, X., Cheng, H., Han, J., Xin, D.: Summarizing item-set patterns: a profile-based approach. In: Proceedingof the 11th ACM SIGKDD International Conference onKnowledge Discovery in Data Mining, pp. 314–323. ACMPress, Chicago, Illinois, USA (2005)40. Yuan, L., Qin, L., Lin, X., Chang, L., Zhang, W.: Di-versified top-k clique search. The VLDB Journal (2),171–196 (2016)41. Zhang, B., Park, B.H., Karpinets, T., Samatova, N.F.:From pull-down data to protein interaction networksand complexes with biological relevance. Bioinformatics24