Improved 3-pass Algorithm for Counting 4-cycles in Arbitrary Order Streaming
aa r X i v : . [ c s . D S ] J u l Improved 3-pass Algorithm for Counting 4-cyclesin Arbitrary Order Streaming ∗ Sofya VorotnikovaDartmouth College [email protected]
Abstract
The problem of counting small subgraphs, and specifically cycles, in the streaming modelreceived a lot of attention over the past few years. In this paper, we consider arbitrary orderinsertion-only streams, improving over the state-of-the-art result on counting 4-cycles. Ouralgorithm computes a (1 + ǫ )-approximation by taking three passes over the stream and usingspace O ( m log nǫ T / ), where m is the number of edges in the graph and T is the number of 4-cycles. Subgraph counting is a fundamental graph problem and an important primitive in massive graphanalysis. It has many applications in data mining and analyzing the structure of large networks.This problem has also received a lot of attention in the streaming community, with the mainfocus on counting triangles [1–9, 13–15]. Several papers considered counting larger cycles andcliques [3, 9, 12], and a few studied arbitrary subgraphs of constant size [3, 10, 11]. There is alsowork on counting 4-cycles in the case when the underlying graph is bipartite [16]. Since a 4-cycleis also a 2-by-2 biclique, it is the most basic motif in bipartite graphs and plays essentially thesame role as a triangle does in general graphs.In this paper, we concentrate on counting 4-cycles in the arbitrary order insertion-onlystreaming model, improving over the state-of-the-art algorithm presented by McGregor andVorotnikova [13].
Throughout this paper, we use n to denote the number of vertices in the graph, m to denote thenumber of edges, and T for the number of 4-cycles. Note that our algorithm is parameterizedin terms of T , which is a convention adopted in the literature. In practice, the quantities in thealgorithm would be initialized based on a promised lower bound on T .Our result is as follows. Theorem 1.
There exists an O ( m log nǫ T / ) space algorithm that takes three passes over an arbitraryorder stream and returns a (1 + ǫ ) multiplicative approximation to the number of 4-cycles in thegraph with probability at least / . ∗ This work is supported by NSF Award y running Θ(log 1 /δ ) copies of the algorithm in parallel and taking the median of theiroutputs, we can increase the success probability to 1 − δ , where δ ∈ (0 , e O ( m/T / ) space algorithm by McGregorand Vorotnikova [13]. It takes the same number of passes over the stream and has the sameapproximation guarantees. We believe that the space of our algorithm is tight, however the bestknown lower bound is currently Ω( m/T / ) [13].In [3] Bera and Chakrabarti present a different 4-cycles counting algorithm which takes fourpasses and uses space e O ( m /T ). Note, that the space used by our algorithm is as good or betterwhen T = O ( m / ). McGregor and Vorotnikova [13] also present a 2-pass e O ( m / /T / ) spacealgorithm which distinguishes between graphs with 0 and T A wedge is a path of length 2. For wedge ( u, b, v ) we call vertices u and v the endpoints of thewedge and vertex b the center .We use Γ( v ) to denote the set of neighbors of vertex v . Consider sets of vertices { u, v } andΓ( u ) ∩ Γ( v ). Edges between these two sets form a complete bipartite graph, which we call a diamond with endpoints u and v . We say that wedge w is a part of diamond d if they have thesame endpoints. Note that a diamond with endpoints u and v consists of | Γ( u ) ∩ Γ( v ) | wedgesand involves (cid:0) | Γ( u ) ∩ Γ( v ) | (cid:1) t ( e ), t ( w ), and t ( d ) to denote the number of 4-cycles involvingedge e , wedge w , or involved in diamond d respectively. For any quantity k , we use b k to denoteits estimate.In Section 2.3, we define heavy/light edges, wedges, and diamonds, where “heavy” roughlycorresponds to “involved in many 4-cycles” and “light” to “involved in few 4-cycles”. Note thatthese are defined by the algorithm and depend on the collected samples of vertices and edges.We define T H to be the number of 4-cycles with at least one heavy wedge and T L as the numberof 4-cycles with no heavy wedges and at most one heavy edge. The most basic algorithm approximating the number of 4-cycles in a graph is as follows:
Pass 1:
Sample edges with probability p , call set S . Pass 2:
For each edge e in the stream, let s ( e ) be the number of 3-paths with all edges in S that e completes to a 4-cycle. Return: p P e ∈ E s ( e ).In expectation, the value returned by this algorithm is T . However, due to the fact that someedges or wedges in the graph can be involved in a large number of 4-cycles, the variance of thisestimator is large. If an edge or wedge participates in many 4-cycles, call it “bad”. In thispaper, we show that it is possible to identify such bad edges and wedges and take care of themseparately, leading to an accurate approximation.We observe that if wedge ( u, b, v ) is bad, then it is a part of a large diamond with endpoints u and v . If we sample e Ω(1) vertices in Γ( u ) ∩ Γ( v ) and collect all incident edges, we will detectthe diamond and accurately estimate its size. Using this method, we approximate the totalnumber of cycles with bad wedges.We then separately approximate the number of cycles with no bad wedges and at most onebad edge. This procedure follows the same template as the arbitrary order 4-cycle countingalgorithm in [13]. Sampling edges uniformly at a certain rate allows us to obtain some 3-paths We use e O ( · ) notation to hide polylog( n ) and 1 /ǫ factors. hich are involved in 4-cycles with no bad wedges. Additionally, sampling vertices uniformlyand storing all incident edges allows us to build an oracle roughly classifying edges as good orbad. We use this oracle to compute the number of bad edges in each of the cycles we discover.Note that the oracle takes an extra pass over the stream, and thus in total our algorithm usesthree passes. The algorithm in this section computes estimates to T H and T L separately and then returnstheir sum. We later show that b T H + b T L is an accurate approximation of T .Within the algorithm, we define heavy/light diamonds and wedges. Roughly speaking, aheavy diamond consist of Ω( T / ) wedges and a light diamond consist of O ( T / ) wedges. Awedge is then defined as heavy or light if it is a part of a heavy or light diamond respectively.In the third pass, we refer to the oracle which classifies edges as heavy or light. It is describedseparately after the main algorithm. Pass 1: • Let p = c log nǫ T / . • Sample edges with probability p , call set S E . • Sample vertices with probability p , call set Q V . Collect all incident edges, call set Q E . • Sample vertices with probability p , call set Z V . Collect all incident edges, call set Z E . After Pass 1: • For a pair of vertices ( u, v ), let q ( u, v ) be the number of wedges with center in Q V and endpoints u and v . • Define diamond d with endpoints u and v to be heavy if q ( u, v ) ≥ pT / and light otherwise. Let b t ( d ) = (cid:0) q ( u,v ) /p (cid:1) . • Define wedge w with endpoints u and v to be heavy if it is part of a heavy diamondand light otherwise. Let b t ( w ) = q ( u, v ) /p − • Find all pairs of vertices ( u, v ) which are endpoints of heavy diamonds/wedges. • Let b T H = P b t ( d ), where d is a heavy diamond. Pass 2:
For every edge e in the stream: • Check if e completes any 3 edges from S E to a 4-cycle (call it τ ). Check whether τ has a heavy wedge; if not, store ( e, τ ). Pass 3: • For all edges involved in cycles stored in pass 2, use oracle( Z V , Z E ) to classify themas heavy or light. • Let A be the number of ( e, τ ) pairs s.t. τ has no heavy edges. • Let A be the number of ( e, τ ) pairs s.t. e is heavy and the other 3 edges in τ arelight. • Let b T L = A / (4 p ) + A /p Return: b T H + b T L Oracle.
Below, we describe the oracle which classifies edges as heavy or light. Roughlyspeaking, heavy edges are involved in Ω( T / ) 4-cycles and light edges in O ( T / ).Suppose, that we need to classify edge e = ( u, v ) as heavy or light. We then look at edgessharing a vertex with e . In the post-processing of the first pass, we determined all pairs ofvertices which are endpoints of heavy diamonds/wedges. Thus, for wedge ( e, e ′ ) we can referto that list to check whether it is heavy or not. If it is heavy, we also get an estimate of thenumber of 4-cycles it is involved in and thus contributes to t ( e ). Separately, we approximatethe total number of 4-cycles on e which involve two light wedges ( e, e ′ ) and ( e, e ′′ ). racle( Z V , Z E , e ): • Let b t H ( e ) ← b t L ( e ) ← • For wedges of the form ( e, e ′ ), where e ′ ∈ Z : if ( e, e ′ ) is heavy, “exclude” e ′ from Z . • For each edge e ∗ in the stream, s.t. e ∗ shares a vertex with e : – Look up whether ( e, e ∗ ) is heavy. – If heavy, b t H ( e ) ← b t H ( e ) + b t ( e, e ∗ ). – If light and e ∗ = ( v, a ), let λ ( e, e ∗ ) be the number of vertices b ∈ Z V , such that( u, v, a, b ) is a 4-cycle. b t L ( e ) ← b t L ( e ) + λ ( e, e ∗ ) /p . • Let b t ( e ) = b t H ( e ) + b t L ( e ). • Return: ( L if b t ( e ) < T / (light edge) H if b t ( e ) ≥ T / (heavy edge) In Lemma 2, we show that light edges are involved in at most 4 T / T / / Lemma 2.
With high probability a. oracle( Z V , Z E , e ) = L implies t ( e ) ≤ T / b. oracle( Z V , Z E , e ) = H implies t ( e ) ≥ T / / Proof.
Let t H ( e ) be the number of 4-cycles on e , where e is a part of a heavy wedge. Let t L ( e ) = t ( e ) − t H ( e ). Let b t H ( e ) and b t L ( e ) be our estimates of those two quantities.Note that in the process of approximating t H ( e ), we are double-counting 4-cycles with twoheavy wedges involving e . However, we can show that this double-count is negligible. Let D ( e ) be the number of heavy diamonds which involve e . Since each 4-cycle can belong to atmost 2 diamonds, we are double-counting at most D ( e ) cycles. From Lemma 3 part (b) , itfollows that the number of 4-cycles in a heavy diamond is at least (cid:0) T / / (cid:1) ≥ T / /
9. Therefore, t H ( e ) ≥ D ( e ) T / / D ( e ) ≤ T / ( T / / ≤ T / . If T is sufficiently large, then D ( e ) < ( ǫ/ t H ( e ).From Lemma 3 part (c) it follows that X heavy w : e ∈ w b t ( w ) = (1 ± ǫ/ X heavy w : e ∈ w t ( w )Taking double-counting into account, b t H ( e ) = (1 ± ǫ ) t H ( e ) (1)Recall that e = ( u, v ) and let X b be the number of cycles ( u, v, a, b ) with no heavy wedges if b ∈ Z V , and 0 otherwise. Let X L = P b ∈ V X b and note that E [ X L ] = pt L ( e ) = p E (cid:2)b t L ( e ) (cid:3) .If t L ( e ) < T / , then E (cid:2)b t L ( e ) (cid:3) < T / , and from the Chernoff bound it follows that P h | b t L ( e ) − t L ( e ) | ≥ T / / i = P h | X L − pt L ( e ) | ≥ pT / / i ≤ (cid:18) − pT / T / (cid:19) ≤ / poly( n )(2) When we talk about “excluding” edges from Z , we need to “exclude” different sets of edges for different instancesof the oracle. In practice, for each instance mark those edges and ignore them. However, they might be used by otherinstances. here the first inequality follows from the fact that X b ≤ T / for all b . Similarly, if t ( e ) ≥ T / ,then P (cid:2) | b t L ( e ) − t L ( e ) | ≥ t ( e ) / (cid:3) ≤ / poly( n ) (3)We first prove the contrapositive of (a) . Assume t ( e ) > T / . Then from Eq. 1 (taking ǫ = 1 / b t ( e ) ≥ ( t H ( e ) − t H ( e ) /
4) + ( t L ( e ) − t ( e ) / ≥ t ( e ) − t ( e ) / > T / Similarly, we prove the contrapositive of (b) from Eq. 1 and 2. If t ( e ) < T / /
4, then b t ( e ) ≤ ( t H ( e ) + t H ( e ) /
4) + ( t L ( e ) + T / / < T / T H In Lemma 3, we prove that we can distinguish between large and small diamonds and estimatethe number of 4-cycles in a heavy diamond or on a heavy wedge.
Lemma 3.
Let w ( d ) be the number of wedges in diamond d , and let q ( d ) be the number of thosewedges with center in Q V . Recall that t ( d ) is the number of 4-cycles in diamond d , and t ( w ) isthe number of 4-cycles involving wedge w . Then with high probability, a. If diamond d is heavy ( q ( d ) < pT / ), then w ( d ) ≤ T / b. If diamond d is light ( q ( d ) ≥ pT / ), then w ( d ) ≥ T / / c. If wedge w is heavy, then b t ( w ) = q ( d ) /p − ± ǫ/ t ( w ) d. If diamond d is heavy, then b t ( d ) = (cid:0) q ( d ) /p (cid:1) = (1 ± ǫ/ t ( d ) Proof.
Observe that q ( d ) ∼ Bin ( w ( d ) , p ). By an application of the Chernoff bound, if w ( d ) ≥ T / , then P h q ( d ) < pT / i ≤ exp( − pT / / ≤ / poly( n )proving (a) . Statement (b) is proved similarly.Note that the number of 4-cycles in a diamond grows as the square of the number of wedges.Therefore, to get a (1 + ǫ/ t ( d ), we need to estimate w ( d ) to a higheraccuracy. If q ( d ) ≥ pT / , from Chernoff it follows that P [ | q ( d ) − w ( d ) p | ≥ ( ǫ/ w ( d ) p ] ≤ − ǫ w ( d ) p/ ≤ / poly( n )Recall that if a diamond consists of k wedges, then the number of 4-cycles on each of thosewedges is k −
1. Therefore, statement (c) follows since ǫ/ < ǫ/
2. Statement (d) follows since (cid:0) (1+ ǫ/ w ( d )2 (cid:1) ≤ (1 + ǫ/ (cid:0) w ( d )2 (cid:1) and (cid:0) (1 − ǫ/ w ( d )2 (cid:1) ≥ (1 − ǫ/ (cid:0) w ( d )2 (cid:1) . Lemma 4.
With high probability, b T H = T H ± ǫT / .Proof. First, note that our algorithm double-counts 4-cycles which are involved in two heavydiamonds. As was mentioned before, the number of 4-cycles in a heavy diamond is at least T / /
9, and thus the number of heavy diamonds is at most 5 T / . Since two diamonds can haveat most one cycle in common, we are double-counting at most 25 T / ≤ ( ǫ/ T cycles. Therest of the proof follows from Lemma 3 part (d) . .4.3 Estimating T L Lemma 5.
With constant probability, b T L = T L ± ǫT / .Proof. Let T i be the number of 4-cycles in T L with i heavy edges. Let b T = A / (4 p ) and b T = A /p . Note that E h b T i = T and E h b T i = T .We now show that with constant probability, b T = T ± ǫT / b T = T ± ǫT / P h | b T − T | ≤ ǫT / i ≤ / V h b T i ≤ ǫ T / b T . Let H be the setof 3-paths which are involved in 4-cycles in T . Let X q be 1 if all 3 edges of path q ∈ H weresampled and 0 otherwise. Then V hc T i = V p X q ∈H X q = 116 p X q ∈H V [ X q ] + X q,t ∈H : q = t,q ∩ t = ∅ COV [ X q , X t ] ≤ p X q ∈H E (cid:2) X q (cid:3) + X q,t ∈H : q = t,q ∩ t = ∅ E [ X q X t ] ≤ p X q ∈H p + X q ∈H X t ∈H : q = t, | q ∩ t | =1 p + X q ∈H X t ∈H : q = t, | q ∩ t | =2 p ≤ p |H | p + X q ∈H cT / p + X q ∈H cT / p (4) ≤ p (cid:16) |H | p + c |H | T / p + c |H | T / p (cid:17) ≤ T / p + cT / / p + cT / / p ≤ ǫ T /
256 (5)Equation 4 follows from the fact that any path q ∈ H intersects at most 12 T / other pathsin H at one edge and at most 4 T / paths at two edges (from Lemmas 2 and 3). Equation 5follows from our definition of p .Proving P h | b T − T | ≤ ǫT / i ≤ /
16 follows along the same lines. T We refer to one of the lemmas in [13], which bounds the number of 4-cycles with at most oneedge which is involved in a lot of cycles. emma 6 (McGregor and Vorotnikova [13]) . We call an edge e “bad” if it is contained in atleast η √ T (1 − /η ) T cycles containing nomore than one bad edge. Applying this lemma with η = T / /
4, we get that the number of cycles with at most onebad edge is at least (1 − /T / ) T ≤ (1 − ǫ/ T . We can now prove the main lemma. Lemma 7.
With constant probability, e T = (1 ± ǫ ) T .Proof. Let T ′ H be the number of cycles with at least one heavy wedge and at most one heavyedge. Note that T ′ H ≤ T H . Since good edges (with t ( e ) ≤ T / /
4) are classified as light w.h.p.,(1 − ǫ/ T ≤ T L + T ′ H ≤ T L + T H ≤ T where the first inequality follows from Lemma 6. The rest of the proof follows from Lemmas 4and 5. Sets S E , Q E , and Z E all have the same expected size mp = O ( m log nǫ T / ). The expected numberof cycles stored in pass 2 is 4 T /p = e O (1). Finally, the extra space used by each instance oforacle( Z V , Z E , e ) is in expectation e O (1), since it keeps track of a constant number of countersand e O (1) “excluded” edges, corresponding to heavy wedges involving e among the input of theinstance. Therefore, the total space used by the algorithm is O ( m log nǫ T / ). References [1] Kook Jin Ahn, Sudipto Guha, and Andrew McGregor. Graph sketches: sparsification,spanners, and subgraphs. In
Proceedings of the 31st ACM SIGMOD-SIGACT-SIGARTSymposium on Principles of Database Systems (PODS) , pages 5–14, 2012.[2] Ziv Bar-Yossef, Ravi Kumar, and D. Sivakumar. Reductions in streaming algorithms, withan application to counting triangles in graphs. In
Proceedings of the 13th Annual ACM-SIAM Symposium on Discrete Algorithms (SODA) , pages 623–632, 2002.[3] Suman K. Bera and Amit Chakrabarti. Towards Tighter Space Bounds for Counting Tri-angles and Other Substructures in Graph Streams. In Heribert Vollmer and Brigitte Vall´ee,editors, , vol-ume 66 of
Leibniz International Proceedings in Informatics (LIPIcs) , pages 11:1–11:14,Dagstuhl, Germany, 2017. Schloss Dagstuhl–Leibniz-Zentrum fuer Informatik.[4] Vladimir Braverman, Rafail Ostrovsky, and Dan Vilenchik. How hard is counting trianglesin the streaming model? In
Automata, Languages, and Programming - 40th InternationalColloquium, ICALP 2013, Riga, Latvia, July 8-12, 2013, Proceedings, Part I , pages 244–254, 2013.[5] Laurent Bulteau, Vincent Froese, Konstantin Kutzkov, and Rasmus Pagh. Triangle count-ing in dynamic graph streams.
Algorithmica , 76(1):259–278, Sep 2016.[6] Luciana S. Buriol, Gereon Frahling, Stefano Leonardi, Alberto Marchetti-Spaccamela, andChristian Sohler. Counting triangles in data streams. In
Proceedings of the 29th ACMSIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems (PODS) , pages253–262, 2006.[7] Graham Cormode and Hossein Jowhari. A second look at counting triangles in graphstreams.
Theor. Comput. Sci. , 552:44–51, 2014.[8] Hossein Jowhari and Mohammad Ghodsi. New streaming algorithms for counting trian-gles in graphs. In
Proceedings of the 11th International Computing and CombinatoricsConference (COCOON) , pages 710–716, 2005.
9] John Kallaugher, Andrew McGregor, Eric Price, and Sofya Vorotnikova. The complexityof counting cycles in the adjacency list streaming model. In
Proceedings of the 38th ACMSIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems, PODS 2019,Amsterdam, The Netherlands, June 30 - July 5, 2019. , pages 119–133, 2019.[10] John Kallaugher and Eric Price. A hybrid sampling scheme for triangle counting. In
Proceedings of the Twenty-Eighth Annual ACM-SIAM Symposium on Discrete Algorithms ,SODA ’17, pages 1778–1797, Philadelphia, PA, USA, 2017. Society for Industrial and Ap-plied Mathematics.[11] Daniel M. Kane, Kurt Mehlhorn, Thomas Sauerwald, and He Sun. Counting arbitrarysubgraphs in data streams. In
Proceedings of the 39th International Colloquium Conferenceon Automata, Languages, and Programming - Volume Part II , ICALP’12, pages 598–609,Berlin, Heidelberg, 2012. Springer-Verlag.[12] Madhusudan Manjunath, Kurt Mehlhorn, Konstantinos Panagiotou, and He Sun. Approxi-mate counting of cycles in streams. In
Algorithms - ESA 2011 - 19th Annual European Sym-posium, Saarbr¨ucken, Germany, September 5-9, 2011. Proceedings , pages 677–688, 2011.[13] Andrew McGregor and Sofya Vorotnikova. Triangle and four cycle counting in the datastream model. In
Proceedings of the 39th ACM SIGMOD-SIGACT-SIGAI Symposiumon Principles of Database Systems , PODS20, page 445456, New York, NY, USA, 2020.Association for Computing Machinery.[14] Andrew McGregor, Sofya Vorotnikova, and Hoa T. Vu. Better algorithms for countingtriangles in data streams. In
Proceedings of the 35th ACM SIGMOD-SIGACT-SIGAISymposium on Principles of Database Systems, PODS 2016, San Francisco, CA, USA,June 26 - July 01, 2016 , pages 401–411, 2016.[15] A. Pavan, Kanat Tangwongsan, Srikanta Tirthapura, and Kun-Lung Wu. Counting andsampling triangles from a graph stream.
PVLDB , 6(14):1870–1881, 2013.[16] Seyed-Vahid Sanei-Mehri, Yu Zhang, Ahmet Erdem Sariy¨uce, and Srikanta Tirthapura.Fleet: Butterfly estimation from a bipartite graph stream. In
Proceedings of the 28th ACMInternational Conference on Information and Knowledge Management , CIKM 19, page12011210, New York, NY, USA, 2019. Association for Computing Machinery., CIKM 19, page12011210, New York, NY, USA, 2019. Association for Computing Machinery.