Provably and Efficiently Approximating Near-cliques using the Turán Shadow: PEANUTS
PProvably and Efficiently Approximating Near-cliques using theTurán Shadow: PEANUTS
Shweta Jain
University of California, Santa CruzSanta Cruz, CA, [email protected]
C. Seshadhri
University of California, Santa CruzSanta Cruz, [email protected]
ABSTRACT
Clique and near-clique counts are important graph properties withapplications in graph generation, graph modeling, graph analytics,community detection among others. They are the archetypalexamples of dense subgraphs. While there are several differentdefinitions of near-cliques, most of them share the attribute thatthey are cliques that are missing a small number of edges. Cliquecounting is itself considered a challenging problem. Countingnear-cliques is significantly harder more so since the search spacefor near-cliques is orders of magnitude larger than that of cliques.We give a formulation of a near-clique as a clique that is missinga constant number of edges. We exploit the fact that a near-cliquecontains a smaller clique, and use techniques for clique samplingto count near-cliques. This method allows us to count near-cliqueswith 1 or 2 missing edges, in graphs with tens of millions of edges.To the best of our knowledge, there was no known efficient methodfor this problem, and we obtain a 10 x − x speedup over existingalgorithms for counting near-cliques.Our main technique is a space efficient adaptation of theTurán Shadow sampling approach, recently introduced by Jainand Seshadhri (WWW 2017). This approach constructs a largerecursion tree (called the Turán Shadow) that represents cliquesin a graph. We design a novel algorithm that builds an estimatorfor near-cliques, using a online, compact construction of the TuránShadow. CCS CONCEPTS • Theory of computation → Theory and algorithms forapplication domains ; •
Mathematics of computing → Approximation algorithms . KEYWORDS
Cliques, near-cliques, near-cliques, defective-cliques, TuránShadow, sampling, graphs
ACM Reference Format:
Shweta Jain and C. Seshadhri. 2020. Provably and Efficiently ApproximatingNear-cliques using the Turán Shadow: PEANUTS. In
Proceedings of The WebConference 2020 (WWW ’20), April 20–24, 2020, Taipei, Taiwan.
ACM, NewYork, NY, USA, 11 pages. https://doi.org/10.1145/3366423.3380264
This paper is published under the Creative Commons Attribution 4.0 International(CC-BY 4.0) license. Authors reserve their rights to disseminate the work on theirpersonal and corporate Web sites with the appropriate attribution.
WWW ’20, April 20–24, 2020, Taipei, Taiwan © 2020 IW3C2 (International World Wide Web Conference Committee), publishedunder Creative Commons CC-BY 4.0 License.ACM ISBN 978-1-4503-7023-3/20/04.https://doi.org/10.1145/3366423.3380264
Subgraph counting is an important tool in graph analysis that goesby many names such as motif counting, graphlet analysis, andpattern counting. The aim is to count the number of occurrencesof a (generally small) subgraph in a much larger graph. Amongthese subgraphs, cliques are arguably the most important. Eventhe simplest clique, the triangle, has a rich history of algorithmsand applications. There has been much recent focus on countingcliques in large graphs [5, 11, 14, 16, 19, 42].While clique counts are important, the requirement that everyedge in the clique be present is excessively rigid. Data is often noisyor incomplete, and it is likely that cliques that are missing evenan edge or two are significant. Hence, it is important to also lookat counts of patterns that are extremely close to being cliques. Wewill call these structures near-cliques but they are also known asquasi-cliques [23, 27] and defective cliques [46] and have severalapplications ranging from clustering to prediction. Recent work onhas used the fraction of near-cliques to k -cliques to define higherorder variants of clustering coefficients [45].In the bioinformatics literature, near-cliques (or defective cliques,as they are known) have been used to predict missed protein-proteininteractions in noisy PPI networks [46] and have been shown tohave good predictive performance. An alternative viewpoint oflooking at near-cliques views them as dense subgraphs. Miningdense subgraphs is an important problem with many applicationsin Network Analysis. [2, 8, 15, 22, 31]Counting cliques is already challenging, and countingnear-cliques introduces more challenges. Most importantly,near-cliques do not enjoy the recursive structural property ofcliques - that a subset of a clique is also a clique. This rules out mostrecursive backtracking algorithms for clique counting. Moreover,empirical evidence suggests that the number of near-cliques inreal world datasets is order of magnitudes higher than that ofcliques, making the task of counting them equally difficult if notmore. Fig. 1i shows the ratio of 3 different types of near-cliquesto the number of k -cliques for k = k -cliques.There are several different ways of defining near-cliques. [38]define α -quasi-cliques as cliques that are missing a α fraction of theedges. Other formulations define them in terms of graph propertieslike degree of every vertex in the near-clique or diameter of thenear-clique. A set S of size n is called a k − plex if every member ofthe set is connected to n − k others. A k -club is a subset S of nodessuch that in the subgraph induced by S , the diameter is k or less. Allthese formulations have the common property that they representa clique that is missing a few edges. We formulate near-cliques in a r X i v : . [ c s . S I] J un WW ’20, April 20–24, 2020, Taipei, Taiwan Shweta Jain and C. Seshadhri a slightly different way, as cliques that are missing 1 or 2 edges.The advantage of defining them this way is that they allow us toleverage the machinery of clique counting. Every such near-cliquehas a smaller clique contained in it. By sampling the smaller cliquesand using them as hints to find near-cliques, we give an estimatefor the total number of near-cliques. In §6.1 we show an interestingapplication of such near-cliques where we run our algorithm on acitation network to discover papers that perhaps should have citedother papers but did not. A k -clique is a set of k vertices such that there is an edge betweenall pairs of vertices belonging to the set. We define ( k , ) -cliqueand ( k , ) -clique below. For the rest of this paper, whenever we saynear-cliques, we will imply the following 3 kinds of near-cliques(unless mentioned otherwise) Definition 1.1. A ( k , ) -clique is a k -clique with exactly 1 edgemissing.For ( k , ) -cliques, there are 2 configurations possible - one inwhich the missing edges share a vertex, and one in which theydon’t. Definition 1.2.
A Type 1 ( k , ) -clique is a k -clique with exactly 2edges missing such that the missing edges share a vertex. Definition 1.3.
A Type 2 ( k , ) -clique is a k -clique with exactly 2edges missing such that the missing edges do not share a vertex.The different types of near-cliques are shown in Fig. 2. We wantto estimate the number of ( k , ) -cliques and ( k , ) -cliques in G .Note that all our near-cliques are induced and obtaining countsof non-induced near-cliques is simply a matter of taking a linearcombination of the number of k -cliques and near-cliques. For thesake of brevity, we skip a detailed discussion.We stress that we make no distributional assumption on thegraph. All probabilities are over the internal randomness of thealgorithm itself (which is independent of the instance). We provide a randomized algorithm based on TuránShadowcalled PEANUTS which estimates the counts of ( k , ) -cliques and ( k , ) -cliques. In addition, we also provide a heuristic algorithmcalled Inverse-TS based on PEANUTS which takes roughly thesame time as PEANUTS (and in some cases, upto 10x less time)but drastically reduces the space required. Our implementation ofInverse-TS on a commodity machine showed significant savingsin terms of time in obtaining counts of near-cliques over othermethods like color-coding and brute force counting and showedconsistently low error over 100s of runs of the algorithm. Leveraging cliques for near-cliques:
Data being noisy,cliques are brittle and as a result, number of near-cliques isoften very large. However, it is not at all clear how one cancount their number without looking at every set of k − vertices,which is computationally very expensive. PEANUTS uses thefact that near-cliques themselves contain cliques, and leveragesTuránShadow to count near-cliques. There exist algorithmsfor generic pattern counting which can be used for counting near-cliques but there is no known algorithm dedicated to findingnear-cliques that exlploits the clique-like structure of near-cliquesto give a faster estimate. Extremely fast:
PEANUTS is based on the observation thatevery near-clique contains a smaller clique. Thus, we can use cliquesas clues for finding near-cliques. We leverage a fast clique-countingalgorithm (TuránShadow) to achieve fast and accurate near-cliquecounting. Fig. 1ii shows the time taken by Inverse-TS, color-coding(cc) and brute force (bf) to count the number of ( , ) -cliques fora variety of graphs. Inverse-TS is able to estimate their number towithin 2% error in a graph (com-lj) with 4 million vertices and 34million edges in 452 seconds which is at least 100 times faster thancc and bf. As we will show later, similar performance is found inthe estimation of other near-cliques and on other graphs. Extremely accurate:
Similar to TuránShadow, Inverse-TS usesthe seminal result from extremal combinatorics, called Turán’stheorem which allows for efficiently sampling cliques, whichtranslates to fast and accurate estimation of the number ofnear-cliques. Fig. 1iii shows the error in the estimate obtainedfor number of Type 2 ( , )− cliques (a specific configuration of ( , )− cliques) in a variety of graphs using Inverse-TS. As we cansee, all the errors were within 2%. Moreover, unlike color-coding,Inverse-TS allows us to control the number of samples we take, andeven using 500K samples, Inverse-TS was more accurate and tookless time than color-coding (6).For many of the graphs we experimented with, the brute forcealgorithm had not terminated within 1 day and thus was unable togive us ground truth values, but in the cases where the algorithm didterminate, we saw that Inverse-TS gave <
5% error and mostly < Excellent space efficiency:
TuránShadow requires that theentire shadow be generated and stored, which for a graph with100s of millions of edges can potentially require large amount ofmemory. Our practical implementation of Inverse-TS addressesthis by removing the separation in the Shadow construction andsampling phases and instead, performs sampling while the shadowis being constructed in an online fashion. This eliminates the needfor storing the entire Shadow and consequently gives savings oforders of magnitude in space required. The purple bars in Fig. 1ivshow the factor savings in the maximum shadow size required tobe stored at any point (instantaneous shadow size or inst SS) forInverse-TS vs the space required by TuránShadow. There is atleast100x savings in space using Inverse-TS.
Comparison with other algorithms:
We do a thoroughanalysis of Inverse-TS by deploying it on a number of real-worldgraphs of varying sizes. In most cases we observed that Inverse-TSwas considerably fast while showing consistently low error over100s of runs of the algorithm. We also do a thorough comparisonof Inverse-TS with other generic pattern-counting algorithmslike color-coding. Fig. 1ii shows the time required for counting ( , ) -cliques by the different methods. Across all of our experimentswe observe that Inverse-TS was at least 10 times faster on mostgraphs as compared to other algorithms. rovably and Efficiently Approximating Near-cliques using the Turán Shadow: PEANUTS WWW ’20, April 20–24, 2020, Taipei, Taiwan S t an G oog l e B e r k S t an a s - sk i tt e r Graphs − nea r- c li que s / c li que s k=5, near-cliques (5,1)(5,2) Type 1 (5,2) Type 2 (i) Ratios w eb - S t an w eb - G oog l e a m a z on B e r k S t an a s - sk i tt e r P a t en t s s o c - po k e c c o m - lj s o c - L J c o m - o r k u t Graphs T i m e i n s e c (7 , − cliques inv-ts cc bf (ii) Timings w e b - S t a n w e b - G oog l e a m a z o n B e r k S t a n a s - s k i tt e r c o m - l j s o c - L J Graphs . . . . . . P e r ce n t r e l a t i v ee rr o r Type 1, (5 , − cliques, k=5 (iii) Error B e r k S t an c o m - lj s o c - L J c o m - o r k u t Graphs F a c t o r s a v i ng s k=7, cliques time SS inst SS (iv) TS vs Inverse-TSFigure 1: Fig. 1i shows the ratio of number of different types of near-cliques to k -cliques for k = in four real world graphs. The red lineindicates ratio = . In most cases the number of near-cliques is at least of the same order of magnitude as number of k -cliques, if not more.Fig. 1ii shows the time required by Inverse-TS (inv-ts), color-coding (cc) and brute force (bf) to estimate the number of ( , ) -cliques in 10 realworld graphs. The y − axis shows time in seconds on a log scale. The red line indicates 86400 seconds (24 hours). All experiments that ran formore than 24 hours were terminated. Inverse-TS terminated in minutes in all cases except com-orkut, giving a speedup of anywhere between3x-100x. Fig. 1iii shows the percentage error in the estimates for Type 1 ( k , ) -cliques for k = obtained using Inverse-TS. As we can see, theerror is < and in most cases < . Fig. 1iv shows the savings in time and space when using Inverse-TS (500000 samples) vs when usingTuránShadow (50000 samples) to estimate the number of 7-cliques in 4 of the largest real world graphs we experimented with. The green barsshow the factor savings in the percentage of the Turán Shadow that was explored (factor of 2-10). The purple bar shows the factor saving inthe maximum amount of space required for the Turán Shadow at any instant.(i) ( , ) (ii) Type 1 ( , ) (iii) Type 2 ( , ) Figure 2: Near- -cliques. Dotted lines indicate the missing edges.Blue lines mark the contained clique. All code and data available:
All the datasets we used arepublicly available at [36]. In addition, we can readily make thecode for our algorithm publicly available if the paper is acceptedfor publication.
Pattern counting, also known as graphlet counting or motifcounting has been an important tool for graph analysis. It hasbeen used in bioinformatics [25, 29, 44], social sciences [18], spamdetection [4], graph modeling [33], etc. Triangle counting, and morerecently, clique counting have gained a lot of attention [11, 14, 19]due to their special role in characterizing real-world graphs. Cliquecounts have been employed in applications such as discovery ofdense subgraphs [32, 39], in topological approaches to networkanalysis [35], graph clustering [45] among others. More generally,motif counts have been used in clustering [40, 45], evaluation ofgraph models [33, 34], classification of graphs [41] etc.On the theoretical side, several motif-counting algorithmsexist [9, 10, 45]. On the more practical side, only recently, efficientmethods for counting graphlets upto size 5 [17, 20, 28, 43] havebeen proposed. Most of these are extensions of triangle countingmethods and do not scale. For patterns of larger sizes, two widelyused techniques are the MCMC [16, 42] and color-coding (CC)of [1]. However, as shown in [7], MCMC based methods have poorer accuracy for the same running time than CC and for patterns of sizesgreater than 5, CC is also generally quite inefficient, as we will showin our results. Motif counting has been studied in streaming [6, 21]and distributed settings [13] and in temporal networks [26].All these methods are geared towards counting arbitrary patternswith upto 6 nodes but none of these methods scale beyond 6 nodes.Moreover, these are generic pattern counting methods that do notutilize the clique-like nature of near-cliques to give more efficientmethods. Ours is the first work to do so.
Dense subgraph algorithms:
The notion of dense subgraphsas near-cliques was introduced by Tsourakakis et. al. in [38]. Thereare several different formulations of dense subgraphs, many ofwhich are NP-Hard (indeed, even the problem of finding thedensest subgraph on k vertices, known as the densest- k -subgraphis NP-Hard [32]). The algorithms of Andersen and Chellapilla [3],Rossi et al. [30], and Tsourakakis et al. [38, 39] provide practicalalgorithms for some of the formulations. However, most of themfocus on finding or approximating the densest subgraph rather thangiving global stats. The starting point of our result is the TuránShadow algorithm forestimating the number of k -cliques in a graph. TuránShadow isbased on a seminal theorem of Turán and Erdös that says that: ifthe edge density of an n − vertex graph is greater than 1 − /( k − ) (the Turan density), then the graph is guaranteed to have many( O ( n k − ) ) k -cliques. This implies that if we randomly sample a k -vertex set from the graph, the probability of it being a k -cliquewould be high. TuránShadow exploits this fact by splitting G into(possibly overlapping) Turan-dense subgraphs such that there is aone-to-one correspondence between the cliques of a specific sizein each subgraph, and the number of k -cliques in G. The set of allsuch subgraphs of G is called the Turán Shadow of G . Essentially,TuránShadow reduces the search space for k -cliques in G from 1large sparse graph to several dense subgraphs. WW ’20, April 20–24, 2020, Taipei, Taiwan Shweta Jain and C. Seshadhri
More importantly though, for any h , TuránShadow providesan efficient way of sampling a u.a.r. h -clique from G . Let C h bethe set of all h -cliques in G and let f : C h → R + be a boundedfunction over all h -cliques, then we can obtain an unbiased estimatefor F = (cid:205) K ∈ C h f ( K ) by obtaining the average of f over a set ofuniformly sampled h -cliques and scaling by the total number of h -cliques. In other words, we can use this clique sampler to obtainan unbiased estimate of the sum (and mean value) of any boundedfunction over h -cliques. We exploit this fact to obtain an estimateof the number of near-cliques.To estimate the number of ( k , ) -cliques, we make the followingobservation: Every ( k , ) -clique has exactly two k − C k − be the set of ( k , ) -cliques in G and C k − be the set of k − G , and ∀ K ∈ C k − , let f ( K ) = numberof ( k , ) -cliques that clique K is contained in, then (cid:205) K ∈ C k − f ( K ) = | C k − | . However, since every ( k , ) -clique is counted twice, thevariance of the estimator can be pretty large. We observe that ifthe missing edge in a ( k , ) -clique is ( u , v ) , u < v , exactly one ofthe k − u and the other contains v . In order toreduce the variance, we define f ( K ) = number of ( k , ) -cliquesthat clique K is contained in, such that u ∈ K i.e. we break tiesbased on the direction of the missing edge. With this formulation, (cid:205) K ∈ C k − f ( K ) = | C k − | .For ( k , ) -cliques, there are 2 possible configurations, as shownin Fig. 2. Type 1 consists of exactly one k − f ( K ) = number of Type 1 ( k , ) -cliques that a given k − K is contained in. Type 2 ( k , ) -cliques are a bit morecomplicated. A Type 2 ( k , ) -clique has exactly four k − ( u , v ) and ( w , x ) are missing, then thereis an induced cycle involving u , v , w and x and every edge of thiscycle gives a different k − k − ( k , ) -clique. Let min ( u , v , w , x ) = u and let min ( w , x ) = w .Then, for k − K , we set f ( K ) = number of ( k , ) -cliquessuch that u , w ∈ K .As long as f is bounded and is a “well behaved function” i.e.has low variance, we can efficiently estimate F using TuránShadowas a black box. Improving the running time of the black box onlyimproves the running time of the overall algorithm. We observethat in TuránShadow, most of the time is spent in constructing theShadow, but only a small fraction of it is used to gather samples.Thus, if we can first sample and determine which areas of theShadow the samples lie in, we can save time by developing onlythose parts of the Shadow instead of developing the whole Shadow.Additionally, when the number of samples are fixed (as is the case inthe practical implementation of our algorithm), we can interleavethe development of the parts of the Shadow with sampling for h -cliques from those parts, thus obtaining our estimate of F in anonline fashion. This leads to considerable savings in space and time.Outline: In §3 we set some basic notation. In §4 we show ourbasic framework PEANUTS and an optimized version of it calledInverse-TS. Depending on which type of pattern we want to count,we propose and analyze different counters in §5. Finally, in §6we provide a detailed experimental study of Inverse-TS and itscomparison with the state-of-the-art. We set some notation. The input graph G has n vertices and m edges. We will assume that m ≥ n . Let α be the degeneracy of thegraph. Recall that the degeneracy is the maximum outdegree of anyvertex when the edges of the graph are oriented according to thedegeneracy ordering of the vertices in G . Let N v ( G ) represent theneighborhood of v and let N + v ( G ) represent the outneighborhoodof v when the vertices are ordered by degeneracy.We use “u.a.r." as a shorthand for “uniform at random".We will be using the following (rescaled) Chernoff bound.Theorem 3.1. [Theorem 1 in [12]] Let X , X , . . . , X k be asequence of iid random variables with expectation µ . Furthermore,let X i ∈ [ , B ] . Then, for ε < , Pr [| (cid:205) ki = X i − µk | ≥ εµk ] ≤ (− ε µk / B ) . At the core of TuránShadow lies an object called the shadow. Wedefine an analogous structure called Prefixed-Shadow.
Definition 4.1.
Let C k ( G ) be the set of all k − cliques in G .A k -clique Prefixed-Shadow S for graph G is a set of triples {( P i , S i , ℓ i )} where P i ⊆ V , S i ⊆ V and ℓ i ∈ N such that ∀ ( P i , S i , ℓ i ) ∈ S , ∀ c ∈ C ℓ i ( S i ) , P i ∪ c is a unique k -clique in G andthere is a bijection between C k ( G ) and (cid:208) ( P i , S i ,ℓ i )∈ S (cid:208) c ∈ C ℓ i ( S i ) P i ∪ c .Moreover, if the multiset {( S i , ℓ i )} is such that ∀ ( S i , ℓ i ) , ρ ( S i ) > − /( ℓ i − ) where ρ ( S i ) represents the edge density of S i , then S is a k -clique Prefixed-Turán-Shadow of G .It is easy to see that {( v , N + v , h − )} is an h -cliquePrefixed-Shadow of G .We will briefly recap how TuránShadow constructs the shadow.It orders the vertices of G by degeneracy and converts it into a DAG.As shown in [14], to count k -cliques in G it suffices to count thenumber of k − v ∈ V , TuránShadow counts the numberof k -cliques with v as the lowest order vertex by looking at thenumber of k − v , and it appliesthis procedure recursively. When the outneighborhood becomesdense enough, instead of continuing to expand the partial clique, itadds the outneighborhood to the shadow and continues until thereare no more outneighborhoods left to be added to the shadow.Algorithm PrefixedTuránShadowFinder carries out exactly thesame steps as Shadow-Finder in [19], except that at each stage italso maintains the partial clique P .Claim 4.2. Given a graph G and integer k ,PrefixedTuránShadowFinder returns a k -cliquePrefixed-Turán-Shadow of G . Its running time is O (| V | k + ) . Proof. When the function returns, T is empty, and any element ( P , S , ℓ ) ∈ S was added to S only when ρ ( S ) > − /( ℓ − ) . Thus,if S is a Prefixed-Shadow, it is also a Prefixed-Turán-Shadow.By Theorem 5.2 in [19], multiset {( S , ℓ )} is a shadow and hence,there is a bijection between C k ( G ) and (cid:208) ( P , S ,ℓ )∈ S C ℓ ( S ) . Thus, itsuffices to prove that ∀ ( P , S , ℓ ) ∈ T ∪ S , ∀ c ∈ C ℓ ( S ) , P ∪ c is a unique k -clique in G . We will prove this using induction. At the start of rovably and Efficiently Approximating Near-cliques using the Turán Shadow: PEANUTS WWW ’20, April 20–24, 2020, Taipei, Taiwan Algorithm 1:
PrefixedTuránShadowFinder ( G , k ) Initialize T = {(∅ , V , k )} and S = ∅ While ∃ ( P , S , ℓ ) ∈ T such that ρ ( S ) ≤ − ℓ − Construct the degeneracy DAG D ( G | S ) Let N + s denote the outneighborhood (within D ( G | S ) ) of s ∈ S Delete ( P , S , ℓ ) from T For each s ∈ S If ℓ ≤ ρ ( N + s ) > − ℓ − Add ( P ∪ { s } , N + s , ℓ − ) to S Else, add ( P ∪ { s } , N + s , ℓ − ) to T Output S the first iteration, P is empty, S = V and ℓ = k , S = {( P , S , ℓ )} and T is empty. Thus, for the base case, the hypothesis is trivially true.Suppose the hypothesis is true at the start of some iterationand lets say element E = ( P ′ , S ′ , ℓ ′ ) is deleted from T at the startof this iteration. Each E s = ( P ′ ∪ { s } , N + s , ℓ ′ − ) for s ∈ S ′ isadded to S or to T . Let K( E ) = { P ′ ∪ c | c ∈ C ℓ ′ ( S ′ )} denote the setof k -cliques obtained from E . It suffices to prove that: (i) for any k − clique K ∈ K( E ) , K ∈ (cid:208) s K( E s ) , (ii) |K | = (cid:205) s |K( E s )| .Consider a k -clique K = P ′ ∪ c , c ∈ C ℓ ′ ( S ′ ) . Let s be the lowestorder vertex in c according to the degeneracy ordering in G | S ′ . Then, c \ { s } is an ℓ − N + s . Thus, K ∈ K( E s ) . Additionally, for c ∈ C ℓ ′ ( S ′ ) the smallest vertex in c defines a partition over C ℓ ′ ( S ′ ) .Hence, | C ℓ ′ ( S ′ )| = (cid:205) s ∈ S ′ | C ℓ ′ − ( N + s )| i.e. |K( E )| = (cid:205) s ∈ S ′ |K( E s )| .Hence, proved.The out-degree of every vertex is at most | V | and the depth ofthe recursive calls is atmost k −
1. When processing an element ( P , S , ℓ ) it constructs the graph G | S which takes time atmost | V | since it queries every pair of vertices in S and | S | < | V | . Thus, thetime required is O (| V | k + ) . □ Algorithm 2:
Sample ( S ) Inputs: S : k − clique Prefixed-Turán-Shadow of some graph G Output: B : k − vertex set Let w ( S ) = (cid:205) ( P ′ , S ′ ,ℓ ′ )∈ S (cid:0) | S ′ | ℓ ′ (cid:1) Set probability distribution D over S such that ( P , S , ℓ ) ∈ S issampled with probability (cid:0) | S | ℓ (cid:1) / w ( S ) Sample a ( P , S , ℓ ) from D Choose a u.a.r. ℓ − tuple c from S Let B = P ∪ { c } return B Claim 4.3.
The probability of any k -clique K in G being returnedby a call to Sample is w ( S ) . Proof. Let E = ( P , S , ℓ ) ∈ S where S is the k -cliquePrefixed-Shadow of some graph G . Note that w ( S ) = (cid:205) ( P ′ , S ′ ,ℓ ′ )∈ S (cid:0) | S ′ | ℓ ′ (cid:1) . Let c be an ℓ − clique in S and let K = P ∪ c then K must be a unique k − clique in G . Pr ( K is sampled ) = Pr ( E is sampled from D ) ∗ Pr ( c is sampled from S ) = ( | S | ℓ ) w ( S ) ∗ ( | S | ℓ ) = w ( S ) . Thus, every k − clique in G has the same probability of being returned bySample. □ We will first describe PEANUTS. Essentially, it constructs thePrefixed-Turán-Shadow of G , samples h -cliques, obtains f for thesampled h -clique and estimates the value of F . Algorithm 3:
PEANUTS ( G , h , s , Func ) Inputs: G : input graph, h : clique size // = k for cliques, k − ( k , ) -clique and Type 1 ( k , ) -clique, k − ( k , ) -clique s : budget for samples, Func : Function that returns f ( K ) for h -clique K . Output: ˆ F : estimated F S = PrefixedTurnShadowFinder ( G , h ) Let w ( S ) = (cid:205) ( P , S ,ℓ )∈ S (cid:0) | S | ℓ (cid:1) For i = , , ..., s : K = Sample ( S ) If K is a clique, set X i = Func ( G , K ) else set X i = W = W + X i let ˆ F = Ws w ( S ) return ˆ F Theorem 4.4.
Let f be a function over h -cliques, bounded above by B such that given an h -clique, it takes O ( T f ) time to obtain the valueof f . Let ˆ F be the output of PEANUTS, then E [ ˆ F ] = F . Moreover, givenany ε > , δ > and number of samples s = w ( S ) B ln ( / δ )/ ε F ,then with probability at least − δ (this probability is over therandomness of PEANUTS; there is no stochastic assumption on G ), | ˆ F − F | ≤ εF .Let S denote the h -clique Turán shadow of G and size ( S ) = (cid:205) ( S ,ℓ )∈ S | S | . The running time of PEANUTS is O ( α size ( S ) + sT f + m + n ) and the total storage is O ( size ( S ) + m + n ) . Proof. The X i are all iid random variables and by the argumentsin Claim 4.3, every h -clique in G has the same probability of beingreturned by Sample. E [ X i ] = (cid:205) K ∈ C k ( G ) f ( K ) w ( S ) = Fw ( S ) . Suppose X i ∈[ , B ] . By Theorem 3.1, Pr [| (cid:205) si = X i − s E [ X i ]| ≥ εs E [ X i ] ≤ δ when s = w ( S ) B ln ( / δ )/ ε F .The running time and storage required are a direct consequenceof the running time and storage required for TuránShadow(Theorem 5.4 in [19]. The only difference is the addition of sT f in the running time which is the time required to obtain f for s samples. □ We observed that with TuránShadow, bulk of the time is spent inbuilding the tree, and only a small fraction is needed for sampling.
WW ’20, April 20–24, 2020, Taipei, Taiwan Shweta Jain and C. Seshadhri
To give a few examples, for the web-Stanford graph, construction ofthe shadow took 155 seconds for approximating number of 7 cliques,while taking 50 K samples required 0.2 seconds. Similar resultswere observed for all other graphs we experimented with. Thus,naturally, to optimize the performance of TuránShadow it would bebeneficial to minimize the fraction of the shadow that is required tobe built. Consider one extreme of minimizing building the shadow- we will call it level 1 sampling. Let N + v be the outneighborhoodof v in DG , Φ v = (cid:0) | N + v | h − (cid:1) and Φ = (cid:205) v Φ v . {( v , N + v , h − )} is an h -clique Prefixed-Shadow of G . If we sample a v with probabilityproportional to Φ v , and sample h − N + v u.a.r., the probability of sampling a particular h − N + v would be Φ v / Φ ∗ / Φ v = / Φ . If there are C h h -cliques in G thenthe probability that a sampled set of h − vertices is a clique is C h / Φ (we call this the success ratio). Hence, number of samples requiredto find a h -clique would be O ( Φ / C h ) . But Φ is typically very largecompared to C h and hence the number of samples required wouldbe very large. In other words, most of the h − vertex sets picked willnot be cliques.TuránShadow remedies this by first finding the Turán shadowand then sampling within the subgraphs of the shadow which aredense and hence require lesser samples to find a k -clique. Thus,TuránShadow saves on the number of samples required at the costof building the shadow.The advantage of level 1 sampling is that we do not need tospend time finding the Turán Shadow. We mimic the processof sampling an h -clique from this Prefixed-Shadow, but boostthe success ratio by using the latter approach. In particular, wesample a v proportional to Φ v = (cid:0) | N + v | h − (cid:1) , and obtain the h − S of N + v . Suppose the shadow size ϕ v = (cid:205) ( P , S ,ℓ )∈ S (cid:0) S ℓ (cid:1) then probability of sampling a h − = C h − ( G | N + v )/ ϕ v . Thus, the success ratio goes from C h − ( G | N + v )/ Φ v to C h − ( G | N + v )/ ϕ v . Since ϕ v is typically much smaller than Φ v , thesuccess ratio is much improved. However, to account for the factthat we are now sampling u.a.r. in a search space of size ϕ v andnot Φ v , we give a smaller weight ( ϕ v / Φ v ) to every clique obtainedfrom N + v .For an element E = ( P , S , ℓ ) ∈ S where S is the k − cliquePrefixed-Turán-Shadow of a graph G , let K( E ) = { P ∪ c , c ∈ C ℓ ( S )} denote the set of k -cliques obtained from E .Lemma 4.5. Let ˆ F be the value returned by Inverse-TS. Then E [ ˆ F ] = F . Proof. Consider an h -clique K ∈ C h ( G ) and let v be the lowestorder vertex according to degenerecy ordering of vertices in G . Let E = ( v , N + v , h − ) then, K ∈ K( E ) .Let S v be the h − G | N + v and let E v = ( P , S , ℓ ) be the element in S v such that K = v ∪ P ∪ c , c ∈ C ℓ ( S ) . Pr ( K is sampled in Step 11 ) = Pr ( E is sampled ) ∗ Pr ( E v is sampled ) ∗ Pr ( c is sampled ) = Φ v Φ ∗ ( | S | ℓ ) ϕ v ∗ ( | S | ℓ ) = Φ v Φ ϕ v Thus, E [ X i ] = (cid:205) v ∈ V (cid:205) K ∈K( E v ) Φ v Φ ϕ v ϕ v Φ v f ( K ) = (cid:205) K ∈ C k ( G ) f ( K ) Φ = F Φ Algorithm 4:
Inverse-TS ( G , h , s , Func ) Order G by degeneracy and convert it to a DAG DG . Let M be a map, W = Set probability distribution D over V where p ( v ) = (cid:205) v Φ v / Φ . For i = , , ..., s : Independently sample a vertex v from D . If M [ v ] exists, set S = M [ v ] else S = Pre f ixedTurnShadowFinder ( G | N + v , h − ) M [ v ] = S Let ϕ v = (cid:205) ( P , S ,ℓ )∈ S (cid:0) | S | ℓ (cid:1) Let K = { v } ∪ Sample ( S ) If K is a clique, set X i = ϕ v Φ v ∗ Func ( G , K ) else set X i = W = W + X i let ˆ F = Ws Φ return ˆ F Moreover, W = s (cid:205) i = X i . Therefore, E [ W ] = E [ s (cid:205) i = X i ] = s (cid:205) i = E [ X i ] = s F Φ .Hence, E [ ˆ F ] = E [ Ws Φ ] = F . □ Theorem 4.6.
Let f be a function over h -cliques, bounded aboveby B such that given an h -clique, it takes O ( T f ) time to obtain thevalue of f . Given any ε > , δ > and number of samples s = Φ B ln ( / δ )/ ε F , Inverse-TS outputs an estimate ˆ F such that withprobability at least − δ , | ˆ F − F | ≤ εF .Let S denote the k -clique Turán shadow of G and size ( S ) = (cid:205) ( S ,ℓ )∈ S | S | . The running time of Inverse-TS is O ( min ( sα h , α size ( S )) + sT f + m + n ) and the total storage is O ( size ( S ) + m + n ) . Proof. The X i are all iid random variables and by the argumentsin Lemma 4.5, their expectation µ = F / Φ . Suppose X i ∈ [ , B ] .By Theorem 3.1, Pr [| (cid:205) si = X i − µs | ≥ εsµ ] ≤ δ when s = Φ B ln ( / δ )/ ε F .The degeneracy of G can be computed in time linear in the sizeof the graph [24]. For any v , the map M [ v ] in Inverse-TS storesthe h − N + v . For any v that getssampled in Step 5, Inverse-TS checks if the Prefixed-Turán-Shadowof N + v has been constructed and if so, it uses the already-constructedshadow. If not, it constructs it in Step 8 and stores it in M . Thus,in the worst case, it calculates the Prefixed-Turán-Shadow of N + v for every v i.e. it calculates the Prefixed-Turán-Shadow of G whichrequires time O ( α size ( S )) according to Thm. 5.4 from [19]. On theother hand, given any v , the size of N + v is atmost α so constructingthe h − O ( α h ) (Claim 4.2) and it samples s such vertices from D so time requiredis O ( sα h ) . rovably and Efficiently Approximating Near-cliques using the Turán Shadow: PEANUTS WWW ’20, April 20–24, 2020, Taipei, Taiwan There are s h -vertex sets sampled in Step 11 and checking if thesampled vertices form a clique takes time h , while calculating f given that the sampled set is a clique, takes time T f .Thus, the total time required by Inverse-TS is O ( min ( α size ( S ) , sα k ) + sT f + m + n ) . □ Depending on which structure we are counting, we can findappropriate values for B and T f . Notice that in the worst case,depending on the structure of the graph, Inverse-TS may end upbuilding the entire shadow in which case it will not provide anysavings over PEANUTS. However, practically, we observe thatwe get significant savings in the amount of shadow built usingInverse-TS in most cases. Unless specified otherwise, all results inthis paper are obtained using Inverse-TS. ( k , ) -cliques Algorithm 5:
Func- ( k , ) -Clique ( G , K ) f ′ = Let u and v be two distinct vertices from K Let nbrs = N u ∪ N v For nbr ∈ nbrs : If nbr is connected to all vertices in K except 1 vertex, say w and nbr > w , then f ′ = f ′ + return f ′ Definition 5.1.
Let ( u , v ) , u < v , be the missing edge in a ( k , ) -clique J . The lower-order k − J is the k − J \ { u } , and J \ { v } is the higher-order k − J .Claim 5.2. Let f ( K ) for k − -clique K denote the number of ( k , ) -cliques that K is the lower-order k − -clique in. Then F = (cid:205) K ∈ C k − ( G ) f ( K ) = total number of ( k , ) -cliques in G . Proof. Every ( k , ) -clique has exactly 1 lower-order k − f ( K ) denotes the number of ( k , ) -cliques that K is a part of and isthe lower-order clique in, then (cid:205) K ∈ C k − ( G ) f ( K ) = F = total numberof ( k , ) -cliques in G . □ Claim 5.3.
For input k − -clique K , Func- ( k , ) -Clique returns f ( K ) . Proof. For any nbr ∈ V , if K ∪{ nbr } is a ( k , ) -clique, then either nbr ∈ N u or nbr ∈ N v or both. For a given K , Func- ( k , ) -Cliquefinds the set of nbr ( nbrs ) that are connected to every vertex in K except one. Thus, every { nbr } ∪ K for nbr ∈ nbrs is a ( k , ) -cliqueand it is counted in f ′ iff K is a lower-order k − f ′ = f ( K ) . □ Theorem 5.4.
Let d max be the maximum degree of any vertex in G .Then B = min ( d max , n ) and T f = O ( d max ) for Func- ( k , ) -Clique. Proof. By Claim 5.2, F = total number of ( k , ) -cliques in G . Forany ( k , ) -clique J = K ∪{ nbr } that K is the lower-order k − nbr ∈ N u or nbr ∈ N v or both. Thus the number of ( k , ) -cliques in which it is the lower-order k − d max . On the other hand, there can be atmost n nbr , thus B = min ( d max , n ) . Finding nbrs takes time O ( d max ) and checking if nbr ∈ nbrs forms a ( k , ) -clique with K takes time O ( ) . Hence, T f = O ( d max ) □ ( k , ) -cliques Algorithm 6:
Func- ( k , ) -Clique-Type1 ( G , K ) f ′ = For u ∈ K : For v ∈ K , v > u : Let nbrs be the set of vertices connected to all vertices in K except u and v f ′ = f ′ + | nbrs | return f ′ Claim 5.5.
Let f ( K ) for k − -clique K denote the number of Type1 ( k , ) -cliques that K is contained in. Then F = (cid:205) K ′ ∈ C k − ( G ) f ( K ′ ) = the total number of Type 1 ( k , ) -cliques in G . Proof. Every Type 1 ( k , ) -clique contains exactly 1 k − (cid:205) K ′ ∈ C k − ( G ) f ( K ′ ) = F = the total number of Type 1 ( k , ) -cliques in G . □ Claim 5.6.
For input k − -clique K , Func- ( k , ) -Clique-Type1returns f ( K ) . Proof. Given K , for every distinct pair of vertices u and v ∈ K , v > u , Func- ( k , ) -Clique-Type1 finds the set of vertices nbrs such that ∀ nbr ∈ nbrs , nbr is connected to all vertices in K except u and v . Thus, K ∪ { nbr } is a k -clique with exactly2 edges missing - ( u , nbr ) and ( v , nbr ) with the missing edgeshaving a vertex in common ( nbr ) i.e. it is a Type 1 ( k , ) -clique.Thus, Func- ( k , ) -Clique-Type1 returns the number of Type 1 ( k , ) -cliques that K is contained in i.e. it returns f ( K ) . □ Theorem 5.7. B = min ( d max , n ) , T f = O ( d max ) forFunc- ( k , ) -Clique-Type1. Proof. For any 3 vertices u , v , w ∈ K and for any ( k , ) -clique J = K ∪ { nbr } that K is contained in, atleast one of ( u , nbr ) , ( v , nbr ) , ( w , nbr ) ∈ E ( G ) . Thus, any K can be a part ofatmost min ( d max , n ) Type 1 ( k , ) -cliques. For every pair ( u , v ) in K , Func- ( k , ) -Clique-Type1 calculates the number of verticesconnected to all in K but u and v which takes time O ( d max ) . Thus, T f = O ( d max ) . □ WW ’20, April 20–24, 2020, Taipei, Taiwan Shweta Jain and C. Seshadhri
Algorithm 7:
Func- ( k , ) -Clique-Type2 ( G , K ) f ′ = Let deдen ( u ) denote the position of u in the degeneracy orderof G . For u ∈ K : For w ∈ K , deдen ( w ) > deдen ( u ) : Let nbrsu = N + u be the set of out-nbrs of u such that theyare connected to all vertices in K except w and ∀ nbru ∈ nbrsu , deдen ( w ) < deдen ( nbru ) . Let nbrsw be the set of neighbors of w in G such that theyare connected to all vertices in K except u For x ∈ nbrsu : For v ∈ nbrsw : If ( nbru , nbrw ) ∈ E ( G ) : f ′ = f ′ + return f ′ ( k , ) -cliques Definition 5.8.
Given a Type 2 ( k , ) -clique J , v , x ∈ J , the set K = J \ { v , x } is the lowest order k − J if it fulfills allthe following conditions:(1) ( u , v ) (cid:60) E ( G ) , ( w , x ) (cid:60) E ( G ) (note that this implies that K isa k − deдen ( u ) < deдen ( v ) (3) deдen ( u ) < deдen ( w ) < deдen ( x ) .Note that u , v , w and x are all distinct and J consists of exactly 4, k − J \ { v , x } , J \ { v , w } , J \ { u , x } and J \ { u , w } (Fig. 2),and the lowest order k − J is the one which has thevertex ( u ) with minimum position in the degeneracy ordering of G and the minimum neighbor of u .Claim 5.9. Let f ( K ) for k − -clique K denote the number ofType 2 ( k , ) -cliques that K is the lowest-order k − -clique in. Then F = (cid:205) K ′ ∈ C k − ( G ) f ( K ′ ) = total number of Type 2 ( k , ) -cliques in G . Proof. Every Type 2 ( k , ) -clique has exactly one lowestorder k − f ( K ) denotes the number of Type2 ( k , ) -cliques that K is the lowest-order k − (cid:205) K ′ ∈ C k − ( G ) f ( K ′ ) = total number of Type 2 ( k , ) -cliques in G . □ Claim 5.10.
For input k − -clique K , Func- ( k , ) -Clique-Type2returns f ( K ) . Proof. Given a k − K , Step 3 and Step 4 loop over allpossible candidates for u and w , maintaining the condition that deдen ( u ) < deдen ( w ) . In Step 5, Func- ( k , ) -Clique-Type2 picksthe outneighbors of u that are potential candidates for x ( nbrsu ) such that ( w , x ) (cid:60) E ( G ) and deдen ( w ) < deдen ( x ) . In Step 6, itpicks potential candidates for v ( nbrsw ) i.e. neighbors of w that areconnected to all vertices in K except u . Finally, in Step 9, it checks if v and x are connected. Thus, f ′ in Step 9 is incremented iff all theconditions of a lowest order k − ( k , ) -cliqueare fulfilled. Thus, the returned value f ′ = f ( K ) . □ Theorem 5.11. B = min ( n , k αd max / ) , T f = O ( α + d max ) forFunc- ( k , ) -Clique-Type2. Proof. Given K , there can be atmost k / ( u , w ) .There can be at most α candidates for x (since it has to be anoutneighbor of u ) and atmost d max candidates for v (neighbors of w ). On the other hand, there can be atmost n candidates for x and v each. Thus, B = min ( n , k αd max / ) Given a set of k − O ( k ) time to check if itforms a clique. There are O ( k ) candidates for ( u , w ) each. Thereare atmost α candidates for x and d max candidates for v whoseconnections to each of the k − O ( α + d max ) . Altogether, T f = O ( α + d max ) . □ Preliminaries:
We implemented our algorithms in
C++ andran our experiments on a commodity machine equipped witha 1.4GHz AMD Opteron(TM) processor 6272 with 8 cores and2048KB L2 cache (per core), 6144KB L3 cache, and 128GBmemory. We performed our experiments on a collection of graphsfrom SNAP [36], including social networks, web networks, andinfrastructure networks. The largest graph has more than 100Medges. Basic properties like degneracy, maximum degree etc. ofthese graphs are presented in Table 1. We consider the graph tobe simple and undirected. Code for all experiments is available at:https://bitbucket.org/sjain12/counting-near-cliquesOur practical implementation differs slightly from Inverse-TSin two ways: we fix the number of samples to 500K. Moreover,since the number of samples are fixed, we can sample from D inInverse-TS all at once and maintain counts of the number of cliquesto be sampled from each outneighborhood. We can then explorethe outneighborhoods in an online fashion, sampling as we buildthe shadow. Once the samples from a vertex’s outneighborhoodhave been ontained, we no longer need the shadow of theoutneighborhood and the shadow can be discarded. Thus, we don’tneed to store the entire shadow but only the shadow of the currentvertex’s outneighborhood.We focus on counting near- k -cliques for k ranging from 5 to 10. Accuracy and convergence of Inverse-TS:
We picked somegraphs for which the exact near-clique counts are known (for all k ∈ [ , ] ). For each graph and near-clique type, for sample size in[10K,50K,100K,500K,1M], we performed 100 runs of the algorithm.We show here results for amazon0601 for k =
7, though similarresults were observed for other graphs and k . We plot the spreadof the output of Inverse-TS, over all these runs. The results areshown in Fig. 3. The red line denotes the true answer, and thereis a point for the output of every single run. As we can see, theoutput of Inverse-TS fast converges to the true value as we increasethe number of samples. For 500K samples, the range of values iswithin 5% of the true answer which is much less compared to thespread of cc. Similar results were observed for other graphs forwhich the exact counts were available, except soc-pokec. The errorwas mostly <
5% and often <
1% as can be seen from Tab. 1.In cases like soc-pokec the error can be high. This happenswhen most of the samples end up empty, either because thesampled vertices did not form a clique, or the samples belonged toout-neighborhoods that did not have a clique of the required sizeor the sampled clique does not participate in any near-cliques. Thiscan be detected by observing how many of the samples taken in rovably and Efficiently Approximating Near-cliques using the Turán Shadow: PEANUTS WWW ’20, April 20–24, 2020, Taipei, Taiwan k=5 k=7 k=10graph vertices edges degen d max estimate % error time estimate % error time estimate % error time type web-Stanford 2.82E+05 1.99E+06 71 38625 2.36E+10 0.85 142 8.99E+11 - 216 2.16E+14 - 129 ( k , ) ( k , ) Type 11.12E+10 1.19 5396 2.51E+11 - 538 1.04E+14 - 293 ( k , ) Type 26.21E+8 3.47E+10 5.82E+12 k web-Google 8.76E+05 4.32E+06 44 6332 6.76E+08 0.44 13 2.19E+09 0.45 12 2.41E+10 0.41 10 ( k , ) ( k , ) Type 17.18E+07 1.10 21 2.93E+08 0.01 18 7.70E+09 0.01 13 ( k , ) Type 21.05E+08 6.06E+08 1.29E+10 k amazon0601 4.03E+05 4.89E+06 10 2752 1.17E+07 0.00 4 2.88E+06 0.01 3 3.76E+04 0.02 1.5 ( k , ) ( k , ) Type 13.16E+06 0.01 4 1.30E+06 0.01 5 2.96E+04 0.00 3 ( k , ) Type 23.64E+06 9.98E+05 9.77E+03 k web-BerkStan 6.85E+05 6.65E+06 201 84230 4.89E+11 0.93 397 2.89E+13 - 470 1.85E+16 - 704 ( k , ) ( k , ) Type 16.61E+10 0.09 12400 7.32E+11 - 605 1.65E+14 - 646 ( k , ) Type 22.19E+10 9.30E+12 5.79E+16 k as-skitter 1.70E+06 1.11E+07 111 35455 3.94E+10 4.52 1180 5.44E+11 - 1034 7.91E+13 - 800 ( k , ) ( k , ) Type 12.34E+10 1.37 4132 3.97E+11 - 2598 8.55E+13 - 1038 ( k , ) Type 21.17E+09 7.30E+10 1.43E+13 k cit-Patents 3.77E+06 1.65E+07 64 793 4.12E+07 0.01 10 7.20E+07 0.01 6 9.06E+05* 42.22 4 ( k , ) ( k , ) Type 11.31E+08 0.01 6 6.76E+08 3.36 9 2.54E+07* 31.35 5 ( k , ) Type 23.05E+06 1.89E+06 2.55E+03 k soc-pokec 1.63E+06 2.23E+07 47 14854 4.22E+08* 8.48 218 5.41E+07* 9.96 81 7.67E+08 4.24 55 ( k , ) ( k , ) Type 13.34E+08 0.00 38 6.78E+08* 7.61 95 1.28E+09 0.01 64 ( k , ) Type 25.29E+07 8.43E+07 1.98E+08 k com-lj 4.00E+06 3.47E+07 360 14815 2.85E+11 0.11 200 4.28E+14 - 452 1.18E+19 - 558 ( k , ) ( k , ) Type 15.39E+10 0.53 269 1.24E+14 - 581 4.23E+18 - 568 ( k , ) Type 22.47E+11 4.51E+14 1.47E+19 k soc-LJ 4.84E+06 8.57E+07 372 20333 6.32E+11 0.03 677 1.01E+15 - 779 4.14E+19 - 960 ( k , ) ( k , ) Type 11.34E+11 0.41 506 2.77E+14 - 1007 1.17E+19 - 1111 ( k , ) Type 24.49E+14 k com-orkut 3.07E+06 1.17E+08 253 33313 1.56E+11 - 9507 2.26E+12 - 16546 4.66E+13 - 26370 ( k , ) ( k , ) Type 12.37E+11 - 3879 3.51E+12 - 11617 1.60E+14 - 22676 ( k , ) Type 21.57E+10 3.61E+11 3.03E+13 k Table 1: Table shows the sizes, degeneracy, maximum degree of the graphs, the counts of 5, 7 and 10 cliques and near-cliques obtained usingInverse-TS, the percent relative error in the estimates (for those graphs for which we were able to get exact numbers within 24 hours), andtime in seconds required to get the estimates. The rows whose types are k in the rightmost column show the number of k -cliques.For mostinstances, the algorithm terminated in minutes. Values marked with * have significant errors which are addressed in Tab. 2 graph k revised estimate revised % error time type cit-Patents 10 648944 1.91 130 ( k , ) ( k , ) Type 13.69+07 0.27 130 ( k , ) Type 2soc-pokec 5 3.91E+08 0.51 284 ( k , ) ( k , ) Type 1soc-pokec 7 4.92E+08 0.01 288 ( k , ) ( k , ) Type 16.27E+08 0.47 298 ( k , ) Type 2
Table 2: Table revised estimates, revised error and time in secondsfor the counts of near-cliques obtained using PEANUTS with 500Ksamples for the erroneous estimates in Tab. 1 (marked with *).
Step 11 were cliques with non-zero f . If this number is << Running time:
The runtimes for near-cliques of size 7 arepresented in Tab. 1. We show the time for a single run in each case. In all cases except com-orkut, the algorithm terminated inminutes (for com-orkut, it took less than a day) where cc and bfdid not terminate in an entire day (and in some cases, even after 5days).
Comparison with other algorithms:
Our exact brute-forceprocedure is a well-tuned algorithm that uses the degeneracyordering and exhaustively searches outneighborhoods for cliques(based on the approach by Chiba-Nishizeki [9]). Once a clique isfound, we count all the near-cliques the clique is a part of and sumthis quantity over all cliques.On average, color-coding took time anywhere between 2x to100x time taken by Inverse-TS, while giving poorer accuracy. Bruteforce took even more time. Inverse-TS has reduced the time requiredto obtain these estimates from days to minutes.
One of the important applications of near-cliques is in findingmissing edges that likely should have been present in the graph inthe first place. We deployed our algorithm on a citation network [37].Using Inverse-TS we were able to obtain several sets of papers inwhich, ever pair of paper either cited or was cited by the otherpaper (depending on the chronological order of the papers), except
WW ’20, April 20–24, 2020, Taipei, Taiwan Shweta Jain and C. Seshadhri Number of samples ( , ) − c li que s × web-Google, k=7 bfccinv-ts (i) ( k , ) -clique Number of samples T y pe1 , ( , ) − c li que s × web-Google, k=7 bfccinv-ts (ii) Type 1, ( k , ) -clique Number of samples T y pe2 , ( , ) − c li que s × web-Google, k=7 bfccinv-ts (iii) Type 2, ( k , ) -cliqueFigure 3: Fig. 3i, Fig. 3ii, Fig. 3iii show convergence over 100 runs of Inverse-TS using number of samples in [10K, 50K, 100K, 500K,1M] for allnear-clique types. The red line indicates the true value. w eb - S t an w eb - G oog l e a m a z on B e r k S t an a s - sk i tt e r P a t en t s s o c - po k e c c o m - lj s o c - L J c o m - o r k u t Graphs T i m e i n s e c Type 1, (7 , − cliques inv-ts cc bf w eb - S t an w eb - G oog l e a m a z on B e r k S t an a s - sk i tt e r P a t en t s s o c - po k e c c o m - lj s o c - L J c o m - o r k u t Graphs T i m e i n s e c Type 2, (7 , − cliques inv-ts cc bf Figure 4: Figure shows the time required by Inverse-TS (inv-ts),color-coding (cc) and brute force (bf) to estimate the number of Type1 and Type 2 ( k , ) -cliques resp. in 10 real world graphs for k = . Thered line indicates 86400 seconds (24 hours). ( , ) -clique we obtained comprised ofthe papers with the following titles:(1) A ray tracing solution for diffuse interreflection(2) Distributed ray tracing(3) A global illumination solution for general reflectancedistributions(4) Adaptive radiosity textures for bidirectional ray tracing(5) The rendering equation(6) A two-pass solution to the rendering equation: A synthesisof ray tracing and radiosity methods(7) A framework for realistic image synthesisin which, only ( ) and ( ) were not connected. Thus, by miningnear-cliques one can discover missing links and offer suggestionsfor which items should be related. In applications where the data isknown to be noisy, it would be interesting to see how the propertiesof the graph change upon adding these (possibly) missing links andobtaining a more complete picture. Listing near-cliques:
In some applications of near-cliques, au.a.r. sample of near-cliques may be required. Suppose we wantto provide a u.a.r. sample of Type 1 ( k , ) -cliques for a given k .PEANUTS allows us to sample cliques u.a.r. Once a clique K issampled, suppose we return a u.a.r. Type 1 ( k , ) -clique that K participates in. Let J be a Type 1 ( k , ) -clique that K participates in,then the probability of J being returned is inversely proportionalto f ( K ) . In other words, this approach does not give us a u.a.r. sample of Type 1 ( k , ) -cliques. However, if we list all the Type1 ( k , ) -cliques that K participates in, and repeat this process forseveral different K , even though the samples in the list may becorrelated, every Type 1 ( k , ) -cliques in G has equal probabilityof being put in the list. In applications where some amount ofcorrelation in samples is tolerable, such a list can be useful. We leverage the fast clique counting algorithm TuránShadow tocount near-cliques that are essentially k -cliques missing 1 or 2 edges,for k upto 10. The proposed algorithm gives significant savings inspace and time compared to state of the art.One could generalize the definition of near-cliques to largervalues of r and define a ( k , r )− clique as a k − clique that is missingexactly r edges. It would be interesting to see how far r can beincreased such that near-clique counting would still be feasibleusing this clique-centered approach. ACKNOWLEDGMENTS
Shweta Jain and C. Seshadhri acknowledge the support ofNSF Awards CCF-1740850, CCF-1813165, and ARO AwardW911NF1910294.
REFERENCES [1] Noga Alon, Raphy Yuster, and Uri Zwick. 1994. Color-coding: A New Method forFinding Simple Paths, Cycles and Other Small Subgraphs Within Large Graphs.In
Symposium on the Theory of Computing (STOC) (Montreal, Quebec, Canada).326–335. https://doi.org/10.1145/195058.195179[2] J Ignacio Alvarez-Hamelin, Luca Dall’Asta, Alain Barrat, and AlessandroVespignani. 2006. Large scale networks fingerprinting and visualization usingthe k-core decomposition. In
Advances in neural information processing systems .41–50.[3] R. Andersen and K. Chellapilla. 2009. Finding Dense Subgraphs with Size Bounds.In
Workshop on Algorithms and Models for the Web-Graph (WAW) . 25–37.[4] L. Becchetti, P. Boldi, C. Castillo, and A. Gionis. 2008. Efficient semi-streamingalgorithms for local triangle counting in massive graphs. In
KDD’08 . 16–24.https://doi.org/10.1145/1401890.1401898[5] Mansurul A Bhuiyan, Mahmudur Rahman, Mahmuda Rahman, and MohammadAl Hasan. 2012. Guise: Uniform sampling of graphlets for large graph analysis.In . IEEE, 91–100.[6] I. Bordino, D. Donata, A. Gionis, and S. Leonardi. 2008. Mining Large Networkswith Subgraph Counting. In
Proceedings of International Conference on DataMining . 737–742.[7] Marco Bressan, Flavio Chierichetti, Ravi Kumar, Stefano Leucci, and AlessandroPanconesi. 2018. Motif Counting Beyond Five Nodes.
ACM Transactions onKnowledge Discovery from Data (TKDD)
12, 4 (2018), 48. rovably and Efficiently Approximating Near-cliques using the Turán Shadow: PEANUTS WWW ’20, April 20–24, 2020, Taipei, Taiwan [8] Jie Chen and Yousef Saad. 2010. Dense subgraph extraction with application tocommunity detection.
IEEE Transactions on knowledge and data engineering
24, 7(2010), 1216–1230.[9] Norishige Chiba and Takao Nishizeki. 1985. Arboricity and subgraph listingalgorithms.
SIAM J. Comput.
14 (1985), 210–223. Issue 1. https://doi.org/10.1137/0214017[10] Radu Curticapean, Holger Dell, and Dániel Marx. 2017. Homomorphisms are agood basis for counting small subgraphs. In
Proceedings of the 49th Annual ACMSIGACT Symposium on Theory of Computing . ACM, 210–223.[11] Maximilien Danisch, Oana Balalau, and Mauro Sozio. 2018. Listing k-cliques inSparse Real-World Graphs. In
Proceedings of the 2018 World Wide Web Conferenceon World Wide Web . International World Wide Web Conferences SteeringCommittee, 589–598.[12] Devdatt Dubhashi and Alessandro Panconesi. 2009.
Concentration of Measure forthe Analysis of Randomized Algorithms . Cambridge University Press.[13] Ethan R Elenberg, Karthikeyan Shanmugam, Michael Borokhovich, andAlexandros G Dimakis. 2016. Distributed estimation of graph 4-profiles. In
Proceedings of the 25th International Conference on World Wide Web . InternationalWorld Wide Web Conferences Steering Committee, 483–493.[14] Irene Finocchi, Marco Finocchi, and Emanuele G. Fusco. 2015. Clique Countingin MapReduce: Algorithms and Experiments.
ACM Journal of ExperimentalAlgorithmics
20 (2015). https://doi.org/10.1145/2794080[15] Eugene Fratkin, Brian T Naughton, Douglas L Brutlag, and Serafim Batzoglou.2006. MotifCut: regulatory motifs finding with maximum density subgraphs.
Bioinformatics
22, 14 (2006), e150–e157.[16] Guyue Han and Harish Sethu. 2016. Waddling random walk: Fast and accuratemining of motif statistics in large graphs. In
Data Mining (ICDM), 2016 IEEE 16thInternational Conference on . IEEE, 181–190.[17] Tomaž Hočevar and Janez Demšar. 2017. Combinatorial algorithm for countingsmall induced graphs and orbits.
PloS one
12, 2 (2017), e0171428.[18] P. Holland and S. Leinhardt. 1970. A method for detecting structure in sociometricdata.
Amer. J. Sociology
76 (1970), 492–513.[19] Shweta Jain and C Seshadhri. 2017. A Fast and Provable Method for EstimatingClique Counts Using Turán’s Theorem. In
Proceedings of the 26th InternationalConference on World Wide Web . International World Wide Web ConferencesSteering Committee, 441–449.[20] M. Jha, C. Seshadhri, and A. Pinar. 2015. Path Sampling: A Fast and ProvableMethod for Estimating 4-Vertex Subgraph Counts. In
World Wide Web (WWW) .495–505.[21] Daniel M Kane, Kurt Mehlhorn, Thomas Sauerwald, and He Sun. 2012. Countingarbitrary subgraphs in data streams. In
International Colloquium on Automata,Languages, and Programming . Springer, 598–609.[22] Ravi Kumar, Prabhakar Raghavan, Sridhar Rajagopalan, and Andrew Tomkins.1999. Trawling the Web for emerging cyber-communities.
Computer networks
31, 11-16 (1999), 1481–1493.[23] Guimei Liu and Limsoon Wong. 2008. Effective pruning techniques for miningquasi-cliques. In
Joint European conference on machine learning and knowledgediscovery in databases . Springer, 33–49.[24] David W Matula and Leland L Beck. 1983. Smallest-last ordering and clusteringand graph coloring algorithms.
Journal of the ACM (JACM)
30, 3 (1983), 417–427.[25] R. Milo, S. Shen-Orr, S. Itzkovitz, N. Kashtan, D. Chklovskii, and U. Alon. 2002.Network motifs: Simple building blocks of complex networks.
Science
Proceedings of the Tenth ACM International Conference on Web Searchand Data Mining . ACM, 601–610.[27] Jeffrey Pattillo, Nataly Youssef, and Sergiy Butenko. 2012. Clique relaxationmodels in social network analysis. In
Handbook of Optimization in ComplexNetworks . Springer, 143–162.[28] Ali Pinar, C Seshadhri, and Vaidyanathan Vishal. 2017. Escape: Efficientlycounting all 5-vertex subgraphs. In
Proceedings of the 26th International Conferenceon World Wide Web . International World Wide Web Conferences SteeringCommittee, 1431–1440.[29] Nataša Pržulj. 2007. Biological network comparison using graphlet degreedistribution.
Bioinformatics
23, 2 (2007), e177–e183.[30] Ryan A Rossi, David F Gleich, and Assefaw H Gebremedhin. 2015. ParallelMaximum Clique Algorithms with Applications to Network Analysis.
SIAMJournal on Scientific Computing
37, 5 (2015), C589–C616.[31] Ahmet Erdem Sariyuce, C Seshadhri, Ali Pinar, and Umit V Catalyurek. 2015.Finding the hierarchy of dense subgraphs using nucleus decompositions. In
Proceedings of the 24th International Conference on World Wide Web . InternationalWorld Wide Web Conferences Steering Committee, 927–937.[32] Ahmet Erdem Sariyüce, C. Seshadhri, Ali Pinar, and Ümit V. Çatalyürek. 2015.Finding the Hierarchy of Dense Subgraphs using Nucleus Decompositions. (2015),927–937.[33] C. Seshadhri, Tamara G. Kolda, and Ali Pinar. 2012. Community structure andscale-free collections of Erdös-Rényi graphs.
Physical Review E
85, 5 (May 2012),056109. https://doi.org/10.1103/PhysRevE.85.056109 [34] Miguel EP Silva, Pedro Paredes, and Pedro Ribeiro. 2017. Network motifs detectionusing random networks with prescribed subgraph frequencies. In
Workshop onComplex Networks CompleNet . Springer, 17–29.[35] Ann Sizemore, Chad Giusti, and Danielle S. Bassett. 2016. Classification ofweighted networks through mesoscale homological features.
Journal of ComplexNetworks
KDD’08 .990–998.[38] C. Tsourakakis, F. Bonchi, A. Gionis, F. Gullo, and M. Tsiarli. 2013. Denser Thanthe Densest Subgraph: Extracting Optimal Quasi-cliques with Quality Guarantees.In
Knowledge Data and Discovery (KDD) .[39] Charalampos E. Tsourakakis. 2015. The K-clique Densest Subgraph Problem.In
Proceedings of the Conference on World Wide Web WWW . 1122–1132. https://doi.org/10.1145/2736277.2741098[40] Charalampos E. Tsourakakis, Jakub W. Pachocki, and Michael Mitzenmacher.2016. Scalable motif-aware graph clustering.
CoRR abs/1606.06235 (2016). http://arxiv.org/abs/1606.06235[41] Johan Ugander, Lars Backstrom, and Jon M. Kleinberg. 2013. Subgraphfrequencies: mapping the empirical and extremal geography of large graphcollections. In
WWW , Daniel Schwabe, Virgílio A. F. Almeida, Hartmut Glaser,Ricardo A. Baeza-Yates, and Sue B. Moon (Eds.). International World Wide WebConferences Steering Committee / ACM, 1307–1318.[42] Pinghui Wang, John Lui, Bruno Ribeiro, Don Towsley, Junzhou Zhao, andXiaohong Guan. 2014. Efficiently estimating motif statistics of large networks.
ACM Transactions on Knowledge Discovery from Data (TKDD)
9, 2 (2014), 8.[43] Pinghui Wang, Junzhou Zhao, Xiangliang Zhang, Zhenguo Li, Jiefeng Cheng,John CS Lui, Don Towsley, Jing Tao, and Xiaohong Guan. 2018. MOSS-5: Afast method of approximating counts of 5-node graphlets in large graphs.
IEEETransactions on Knowledge and Data Engineering
30, 1 (2018), 73–86.[44] Sebastian Wernicke. 2006. Efficient Detection of Network Motifs.
IEEE/ACMTrans. Comput. Biology Bioinform.
3, 4 (2006), 347–359.[45] Hao Yin, Austin R Benson, Jure Leskovec, and David F Gleich. 2017. Localhigher-order graph clustering. In
Proceedings of the 23rd ACM SIGKDDInternational Conference on Knowledge Discovery and Data Mining . ACM, 555–564.[46] Haiyuan Yu, Alberto Paccanaro, Valery Trifonov, and Mark Gerstein. 2006.Predicting interactions in protein networks by completing defective cliques.