[PDF] sGrapp: Butterfly Approximation in Streaming Graphs

Abstract

We study the fundamental problem of butterfly (i.e. (2,2)-bicliques) counting in bipartite streaming graphs. Similar to triangles in unipartite graphs, enumerating butterflies is crucial in understanding the structure of bipartite graphs. This benefits many applications where studying the cohesion in a graph shaped data is of particular interest. Examples include investigating the structure of computational graphs or input graphs to the algorithms, as well as dynamic phenomena and analytic tasks over complex real graphs. Butterfly counting is computationally expensive, and known techniques do not scale to large graphs; the problem is even harder in streaming graphs. In this paper, following a data-driven methodology, we first conduct an empirical analysis to uncover temporal organizing principles of butterflies in real streaming graphs and then we introduce an approximate adaptive window-based algorithm, sGrapp, for counting butterflies as well as its optimized version sGrapp-x. sGrapp is designed to operate efficiently and effectively over any graph stream with any temporal behavior. Experimental studies of sGrapp and sGrapp-x show superior performance in terms of both accuracy and efficiency.

Full PDF

ssGrapp: Butterfly Approximation in Streaming Graphs

Aida Sheshbolouki

University of [email protected]

M. Tamer Özsu

University of [email protected]

ABSTRACT

We study the fundamental problem of butterfly (i.e. (2,2)-bicliques)counting in bipartite streaming graphs. Similar to triangles in uni-partite graphs, enumerating butterflies is crucial in understandingthe structure of bipartite graphs. This benefits many applicationswhere studying the cohesion in a graph shaped data is of particularinterest. Examples include investigating the structure of computa-tional graphs or input graphs to the algorithms, as well as dynamicphenomena and analytic tasks over complex real graphs. Butterflycounting is computationally expensive, and known techniques donot scale to large graphs; the problem is even harder in streaminggraphs. In this paper, following a data-driven methodology, wefirst conduct an empirical analysis to uncover temporal organizingprinciples of butterflies in real streaming graphs and then we intro-duce an approximate adaptive window-based algorithm, sGrapp,for counting butterflies as well as its optimized version sGrapp-x.sGrapp is designed to operate efficiently and effectively over anygraph stream with any temporal behavior. Experimental studies ofsGrapp and sGrapp-x show superior performance in terms of bothaccuracy and efficiency.

In this paper we address the problem of counting butterfly patternsin large, bipartite streaming graphs. A butterfly (also called (2,2)-biclique or rectangle) is a complete bipartite subgraph with twovertices of one type and two vertices of another type (rightmost inFigure 1). Similar to the triangles in unipartite graphs, butterflies arethe simplest and most local form of a cycle in bipartite graphs. Enu-merating butterflies is important in measuring graph cohesion andclustering or community structure [1]. Clustering or communitystructure is measured by the transitivity/clustering coefficient thatis computed as the fraction of three-paths (called caterpillars– leftfour in Figure 1) which form a butterfly [1, 35, 62]. Graph cohesioncan be measured by the number of butterflies-per-vertex and by thelocal clustering coefficient. Study of such local structural measuresunveils hidden ordering and hierarchies in graphs displaying struc-tural deviations from uncorrelated random connections [14, 42, 46].A recent study investigates the predictive performance of deepneural networks by means of clustering coefficient [60]. Other ap-plications are realistic graph models [1, 30] and representative graphsampling [61]. The study of different phenomena in complex graphssuch as social collective behaviours [18], synchronization [51, 64],information propagation [28], and epidemic spreading [45] relyon clustering coefficient. Moreover, clustering coefficient plays animportant role in graph analytics tasks such as link prediction [26]and community detection [62], and in general any graph process-ing algorithm relying on counting the mutual neighbors or Jaccardsimilarity. The distribution of local clustering coefficient is usedas a feature to uncover statistical differences between normal andfraudulent data in applications such as spam detection [8].

Figure 1: Caterpillar and butterfly (rightmost) patterns.

We study the problem in the context of streaming graphs, becausethe graphs that are used in many modern applications are not staticand not available to algorithms in their entirety; rather the graphvertices and edges are streamed and the graph “emerges” over time.These are called streaming graphs and they differ from dynamicgraphs that are fully available but undergo changes over time. Adriving example is the stream of user-product interactions in e-commerce services. Alibaba has reported that customer purchaseactivities during a heavy period in 2017 resulted in generation of320 PB of log data in a six hour period, and it had to deal with ahigh velocity stream of data that incurred a processing rate of 470million event logs per second. Other e-commerce sites have similaractivity albeit at somewhat lower levels. Other applications such asweb recommenders, fraud detection, and social network analysisrely on butterfly counting over streaming graphs.Bipartite graphs that model networks with two disjoint sets ofvertices are prevalent in real applications: interaction graphs thatmodel the interactions (e.g. comments, reviews, purchases, ratings,etc) between users and items, affiliation graphs that model themembership of actors/people in groups, authorship graphs thatmodel the links between authors and their works, text graphs thatmodel the occurrence of words in documents, and feature graphsthat model the assignment of features to entities. In particular,user-product graphs are currently recognized as the most com-mon graphs in industry that require attention. It is important tostudy the underlying patterns and structures of bipartite graphs,and in this paper we focus on butterfly patterns. A natural ques-tion that arises is why the bipartite graph cannot be projectedinto a unipartite graph on which the existing approaches to countthe triangles are used? The answer is that the projected graph ismisleading and counting on it is inefficient. First, the projected uni-partite graph loses fine-grained pattern information [32, 49], sincethe one-to-many relationship information are projected to pairwiserelationships and the projection is not bijective. Second, the pro-jected unipartite graph will have significantly more edges than thebipartite graph since each 𝑖 − ( 𝑗 −) vertex 𝑣 with degree 𝑑 𝑣 produces 𝑑 𝑣 ( 𝑑 𝑣 − )/ homogeneous edges. That is, the number of edges inthe original bipartite graph is Σ 𝑣 𝑑 𝑣 while in the projected graph itis Σ 𝑣 (cid:0) 𝑑 𝑣 (cid:1) . It has been shown that projection can lead to an edgeinflation of × [32]. In the case of streaming bipartite graphs thatalready have a high number of edges, the projection will exacer-bate the computational footprint. Finally, the patterns that emergein the projected unipartite graph are not reliable signals of the a r X i v : . [ c s . D B ] F e b ida Sheshbolouki and M. Tamer Özsu original bipartite graph since the edge inflation artificially changesthe patterns. For instance, it has been shown that the clusteringcoefficient is high in the projected mode [19, 43] and unipartiteprojection misleads the community detection analysis [7, 20]. Dueto these issues, it is important to devise techniques to directly studybipartite graphs.Exact butterfly counting is feasible only when the entire graphis available to the processing algorithm. As noted earlier, thisis not possible in streaming graphs (and even in massive staticgraphs [37]). The alternative is approximating. One such approachis to use random sampling/sparsification [12, 47], which requiresdetermining the sampling probability, reservoir size, and scalingfactor. The sampling process is done several times and can be apotential overhead lowering the processing throughput. Anotherapproach in streaming graphs is to batch the incoming graph ver-tices and edges into a window and process them when the windowmoves; this is what we follow. Most existing streaming propos-als [5, 12, 13] assume that (a) all the edges incident to a vertexarrive together (i.e. incidence streams) and (b) vertex degrees arebounded. Neither of these are likely to hold in real-life streaminggraphs. We propose a butterfly counting algorithm that can effi-ciently return an accurate answer over any graph stream withoutthese unrealistic assumptions. It has been shown that the spacelower bound for an approximate butterfly count that bounds therelative error to < 𝛿 < . is 𝑂 ( 𝑛 ) where 𝑛 is the number ofvertices [48]. This is not feasible in streaming systems. We analyzethe computational and error bounds of our proposed algorithm. Wealso validate our algorithm’s accuracy and efficiency empirically.We follow a data-driven approach to algorithm design: we con-duct a deep empirical analysis of a number of real graphs withvarying temporal/structural characteristics to determine the tem-poral occurrence of connectivities. We formulate this as a powerlaw (Section 3) that grounds our algorithm, sGrapp , to exploit thesepatterns. Data-driven approach has previously been used to designa graph generator/model preserving the mined patterns in a set ofunipartite real graphs [34]. However, to the best of our knowledge,this is the first time this approach is followed for designing a graphprocessing algorithm. sGrapp is a s treaming gr aph app roximationalgorithm for butterfly counting in bipartite graphs (Section 4) andis based on (a) our novel stream processing framework, which usestime-based windows that can adapt to the temporal distribution ofthe stream (Section 4.1) and (b) our algorithm for exact butterflycounting in streaming graph snapshots (Section 3.2). Our experi-mental analysis (Section 5) shows that sGrapp achieves × higherthroughput and . × lower estimation error than baselines andcan process . × edges-per-second. It can achieve an averagewindow error of less than . in graph streams with almost uni-form temporal distribution. We introduce optimizations that lowerthe average window error to less than . in graph streams withnon-uniform temporal distribution without affecting the through-put. sGrapp handles graph streams with both high number of edgesand high average degree with a sublinear memory footprint, whichis lower than that of the baselines. Empirical analysis shows thatthe performance of sGrapp is independent of its input data, hencecan be applied to any real graph stream. We define a graph 𝐺 as a pair of vertex and edge sets 𝐺 = ( 𝑉 , 𝐸 ) .Since 𝐺 is a bipartite graph, 𝑉 = 𝑉 𝑖 ∪ 𝑉 𝑗 and 𝑉 𝑖 ∩ 𝑉 𝑗 = ∅ . We useuser-item bipartite graphs in which 𝑉 𝑖 (called i-vertices) representsusers and 𝑉 𝑗 (called j-vertices) represents items. Definition 2.1 (Streaming Graph Record).

A streaming graph record(sgr) 𝑟 = ( 𝜏, 𝑝 ) is a pair where 𝜏 is the event (application) timestampof the record assigned by the data source, and payload 𝑝 = ⟨ 𝑒 / 𝑣, 𝑜𝑝 ⟩ indicates an edge 𝑒 ∈ 𝐸 or a vertex 𝑣 ∈ 𝑉 of the [property] graph 𝐺 , and an operation 𝑜𝑝 ∈ { 𝑖𝑛𝑠𝑒𝑟𝑡, 𝑑𝑒𝑙𝑒𝑡𝑒, 𝑢𝑝𝑑𝑎𝑡𝑒 } that defines thetype of the record.In this paper, the operations are limited to edge insertion. If thereare duplicate edge arrivals, the algorithm ignores the duplicates. Definition 2.2 (Streaming Graph).

A streaming graph 𝑆 is anunbounded sequence of streaming graph records 𝑆 = ⟨ 𝑟 , 𝑟 , · · · ⟩ in which each record 𝑟 𝑚 arrives at a particular time 𝑡 𝑚 ( 𝑡 𝑚 ≤ 𝑡 𝑛 for 𝑚 < 𝑛 ). Definition 2.3 (Time-based Window).

A time-based window 𝑊 over a streaming graph 𝑆 is denoted by time interval [ 𝑊 𝑏 ,𝑊 𝑒 ) where 𝑊 𝑏 and 𝑊 𝑒 are the beginning and end times of window 𝑊 and 𝑊 𝑒 − 𝑊 𝑏 = | 𝑊 | . The window contents is the multiset of sgrswhere the timestamp 𝜏 𝑖 of each record 𝑟 𝑖 is in the window interval. Definition 2.4 (Time-based Sliding Window).

A time-based slidingwindow 𝑊 with a slide interval 𝛽 is a time-based window thatprogresses every 𝛽 time units. At any time point 𝜏 , a time-basedsliding window 𝑊 with a slide interval 𝛽 defines a time interval ( 𝑊 𝑏 ,𝑊 𝑒 ] where 𝑊 𝑒 = ⌊ 𝜏 / 𝛽 ⌋ · 𝛽 and 𝑊 𝑏 = 𝑊 𝑒 − | 𝑊 | . Definition 2.5 (Time-based Tumbling Window).

A tumbling win-dow is a time-based window where, for two subsequent windows 𝑊 𝑖 and 𝑊 𝑖 + , 𝑊 𝑏𝑖 + = 𝑊 𝑒𝑖 and 𝑊 𝑒𝑖 + = 𝑊 𝑏𝑖 + + | 𝑊 𝑖 + | . Simply, whensubsequent sliding windows are disjoint, they are called tumblingwindows. Definition 2.6 (Time-based Landmark Window).

A landmark win-dow is a constantly expanding time-based window denoted bya pair ⟨ 𝑊 𝑏 , | 𝑊 |⟩ where, 𝑊 𝑏 is the fixed beginning time and | 𝑊 | is the expansion size. For two subsequent windows 𝑊 𝑖 and 𝑊 𝑖 + , 𝑊 𝑏𝑖 + = 𝑊 𝑏𝑖 and 𝑊 𝑒𝑖 + = 𝑊 𝑒𝑖 + | 𝑊 𝑖 + | . Simply, when the beginningborder is fixed and the end border moves forward, the window iscalled landmark. Definition 2.7 (Streaming Graph Snapshot). A streaming graphsnapshot 𝐺 𝑊 ,𝑡 is the graph formed by the records in the time-basedwindow 𝑊 at time 𝑡 .Table 1 lists the notations used in the paper. he existing works in butterfly counting can be classified alongthree dimensions: graph characteristic (bipartite/unipartite), datalocation (disk-resident/in-memory) and graph availability (static/dynamic/streaming). Detailed coverage of each design point is be-yond the scope of this paper; we focus on two particular designpoints that are most relevant to our work: static bipartite graphsand streaming bipartite graphs. Grapp: Butterfly Approximation in Streaming Graphs

Table 1: Frequent notations. Similar notations stand for j-vertices where applicable.

Notation Description 𝑟 𝑚 = ( 𝜏, 𝑝 ) A streaming graph record (sgr) with timestamp 𝜏 , payload 𝑝 , and arrival time 𝑡 𝑚 𝜏 sgr timestamp (real time-label) 𝑡 Computational time point or time of sgr arrival at the computational system R Average stream rate 𝑝 = ⟨ 𝑒 / 𝑣, 𝑜𝑝 ⟩ An edge 𝑒 ∈ 𝐸 or a vertex 𝑣 ∈ 𝑉 , and an operation 𝑜𝑝 ∈ { 𝑖𝑛𝑠𝑒𝑟𝑡, 𝑑𝑒𝑙𝑒𝑡𝑒, 𝑢𝑝𝑑𝑎𝑡𝑒 } 𝑊 𝑖 : = [ 𝑊 𝑏𝑖 ,𝑊 𝑒𝑖 ) 𝑖 th time-based window 𝑊 as an interval of width | 𝑊 | 𝛽 Slide size for a sliding window 𝐺 𝑊 ,𝑡 = ( 𝑉 ( 𝑡 ) , 𝐸 ( 𝑡 )) A graph snapshot formed by window 𝑊 at time 𝑡𝑑𝑒𝑔 ( 𝑖 ) Degree of vertex 𝑖𝑁 𝑖 Neighborhood of vertex 𝑖𝑃 / 𝛾 / 𝑀 FLEET’s sampling probability/subsampling probability/reservoir capacity 𝐾 𝑖 Average degree of i-vertices 𝜂 / 𝛼 Butterfly densification power law exponent for all/inter-window butterflies 𝑁 ℎ𝑢𝑏 ( 𝑡 ) Number of hubs at time 𝑡𝑁 𝑡 Number of unique timestamps in data stream 𝐵 ( 𝑡 ) The number of butterflies since the initial time point until 𝑡𝐵 𝑖 Butterfly support of vertex 𝑖𝐵 𝑊 𝑘 Number of butterflies introduced by at least one vertex in the window 𝑊 𝑘 ^ 𝐵 ( 𝑡 = 𝑊 𝑒𝑘 ) = ^ 𝐵 𝑘 Estimation of number of butterflies at time 𝑡 = 𝑊 𝑒𝑘 𝐵 𝑊 𝑘 𝐺 Number of butterflies in graph corresponding to window 𝑊 𝑘 𝐵 𝑖𝑛𝑡𝑒𝑟𝑊 & ^ 𝐵 𝑖𝑛𝑡𝑒𝑟𝑊 Number of inter-window butterflies & its estimate 𝑁 𝑤𝑡 Number of unique timestamps per window 𝐾 𝑖,𝑊 𝑘 the lower bound of degree of i(j)-vertices in window 𝑊 𝑘 𝑉 𝑖,𝑊 𝑘 / 𝐸 𝑊 𝑘 Set of i-vertices/edges in the interval [ 𝑊 𝑏𝑘 ,𝑊 𝑒𝑘 ) 𝐸 𝑘 Set of edges in the interval [ 𝑊 𝑏 ,𝑊 𝑒𝑘 ) 𝑃𝑟 ( 𝑁 𝑡𝑖𝐻𝑢𝑏 ≥ ) Probability of having at least one i-hub in the butterflies at time 𝑡 The literature on count-ing (bi)cliques in static bipartite graphs [47, 49, 53, 54] and staticunipartite graphs [22, 56] is quite rich. A major challenge in thiscontext is the massive size of these graphs. Some studies havefocused on disk-resident data and optimized I/O access patternsfor counting the exact number of cliques [8, 17, 22–24, 44]. Otherstudies consider in-memory algorithms and use random samplingso that the induced graph can fit in main memory for estimatingthe number of (bi)cliques [12, 47]. There are studies that proposescaling out computation by parallelization [3, 29].Butterfly counting algorithms in bipartite graphs follow eithervertex-centric or edge-centric processing. One straightforwardedge-centric approach is to take each pair of disjoint edges ( 𝑒 𝑖 ,𝑗 , 𝑒 𝑖 ,𝑗 ) in the graph (Figure 2a) and check for the existence of the two otheredges that complete the butterfly pattern. The complexity of thisapproach is O(| 𝐸 | ) which is too expensive for graphs with a highnumber of edges. Another edge-centric approach [16] takes an edge 𝑒 𝑖 ,𝑗 and examines the existence of the three complementary edges.That is, the algorithm checks the connections between neighborsof 𝑖 and neighbors of 𝑗 denoted as 𝑗 and 𝑖 , respectively to seewhether they are connected by an edge 𝑒 𝑖 ,𝑗 (Figure 2b). This ap-proach can be implemented with an algorithm that has complexity O( (cid:205) ⟨ 𝑖 ,𝑗 ⟩∈ 𝐸 𝑀𝑖𝑛 ( 𝑑𝑒𝑔 ( 𝑖 ) , 𝑑𝑒𝑔 ( 𝑗 ))) , which is not appropriate fordense graphs with high number of edges and high average degrees. (a) (b) (c) (d) Figure 2: Butterfly counting methods.

The state-of-the-art approach [47, 53, 54] is vertex-centric thattakes a vertex 𝑣 𝑖 and traverses all two-hop neighbors to identifytriples ⟨ 𝑖 , 𝑗 , 𝑖 ⟩ and ⟨ 𝑖 , 𝑗 , 𝑖 ⟩ . That is, it finds all triples (i.e. two-paths) with common end vertices (i.e. the same two-hop neighbor)and then combines them to get the number of all butterflies (Figure2c). The complexity of this approach is O( (cid:205) 𝑖 ∈ 𝑉 𝑖 (cid:205) 𝑗 ∈ 𝑁 𝑖 𝑑𝑒𝑔 ( 𝑗 )) ,which is challenging for graphs with high average i- and j-degreesas a result of traversing two hop neighbors [54]. In the streaminggraph context, the literature is also rich for counting in unipartitegraphs [5, 8, 9, 11–13, 56, 57]. However, to the best of our knowledge,the only butterfly counting study over bipartite streaming graphs isFLEET [48], which introduces a suite of algorithms. FLEET1 samplesthe edges of a window with probability 𝑃 into a reservoir with fixedcapacity 𝑀 to bound the memory consumption and increments ida Sheshbolouki and M. Tamer Özsu the butterfly count by the number of incident butterflies for eachsampled edge. When the size of reservoir exceeds 𝑀 , the edges aresub-sampled with probability 𝛾 and the butterfly count is set to theexact number of butterflies in the reservoir. The sampling probabil-ity is then multiplied by 𝛾 for the following edges. FLEET2 avoidsre-computing the exact number of butterflies in the reservoir duringthe sub-sampling iterations. FLEET3 avoids re-computation andalso updates the estimate before sampling the edges into the reser-voir. FLEETSSW uses count-based sliding windows with limitedgraph size in each window, and FLEETTSW uses time-based slidingwindows with fixed window length across windows. To overcomethe variable number of edges inside each window, FLEETTSW as-sumes an upper-bound for the number of edges in a window ontop of a FIFO-based sampling scheme. As we discuss in Section 4,there exist a number of inter-window butterflies in the stream thatare missed by the FLEET algorithms. Moreover, FLEET requiresdetermining a sub-sampling probability and a normalization factorto scale-up the estimation computed over the sampled edges, andthe specification of a time when the result is ready to be returned.FLEET requires a sufficiently large amount of memory to guaranteea desired level of accuracy. In this section, we present our investigations into the emergence ofbutterfly patterns in graph streams and on the underlying contribu-tors to these patterns. We use the insights provided by this analysisto introduce an approximation algorithm for butterfly counting instreaming graphs in Section 4. The analysis results themselves arealso important as they expose how butterfly patterns exist in realworld graphs.

We study a set of real world graphs and make use of a set of syn-thetic graphs to explore additional features. Table 2 provides thestatistics about the graphs we study; these graphs are also used inthe experiments discussed in Section 5.

Real-world graphs – In this study, we use six real worldgraphs: four rating graphs including Epinions, MovieLens100k,MovieLens1m, MovieLens10m, and two Wikipedia edit networks inEnglishand Frenchobtained from the KONECT repository [31]. Allof these networks include information generated from interactionof a set of users with a set of items (products, movies, or wikipediapages). These datasets cover graphs with different edge densitylevels and are suitable for deep analysis and evaluations.

Synthetic graphs – In addition to the real world graphs, weuse six synthetic random graphs in this study to bolster the analy-sis of real world graphs. In fact synthetic graphs are configurableand have known structural properties that ease the understandingof their patterns. We use these synthetic graphs to better under-stand and explain what is happening in real world graphs throughthe comparisons and contradictory case investigations. These syn-thetic graphs are generated with respect to the three real worldgraphs (Epinions, MovieLens100k, and MovieLense1m) in that thethe synthetic graphs and the corresponding real world graphs have(roughly) same structural statistics. We use the Barabasi-Albert(BA) model [6] to generate the structure of random graphs as the baseline for analyzing real world graphs. We chose this model be-cause it is a popular and widely adopted model for generating scalefree graphs [10, 15, 21, 25, 27, 33, 36, 38, 41, 52, 59]. Given the totalnumber of vertices 𝑁 , the initial number of vertices 𝑚 and thenumber of connections of new vertices 𝑚 ( 𝑚 ≤ 𝑚 ) as inputs, theBA graph model applies the rich-get-richer preferential attachmentrule to generate a unipartite scale-free random graph. Precisely, thisgraph model creates an initial complete graph with 𝑚 vertices andkeeps adding 𝑁 − 𝑚 new vertices to this initial graph. The newvertices are connected to 𝑚 existing vertices with higher probabilityof attachment dictated by the attachment rule. The BA preferentialattachment rule states that the probability is determined based onthe degree of the vertex, therefore the higher the degree (i.e. theolder the vertex), the higher the probability of attachment. Theoriginal BA model produces growing unipartite graphs with notimestamps. Therefore, we extended the model to generate bipartiteand temporal graphs with respect to a given real graph such thatthe structure is dynamic but the timestamps are static. We introducea three-step procedure to create a bipartite and temporal scale-freeBA graph as a baseline for a given real-world graph: Figure 3: Projecting a bipartite graph to two unipartitegraphs. There is a link between two vertices in unipartitemode if they have any common neighbors in the bipartitemode. Edge labels in the unipartite graph reflect the com-mon neighbors. (1)

Create Unipartite BA graph – The input parameters tothe BA model (i.e. 𝑁 , 𝑚 , and 𝑚 ) should be set such thatthe average degree of i-vertices and the number of totaledges ( | 𝐸 | ) in real-world and synthetic graphs are (roughly)the same. That is because of the edge-centric nature of ourintended analysis. Therefore, we set the parameters 𝑚 = 𝑚 equal to the average degree of i-vertices (i.e. users) in thereal-world graph and determine the value of 𝑁 in a way thatit satisfies the equation for the number of edges in BA graph,that is 𝑚 ( 𝑚 − )/ + ( 𝑁 − 𝑚 ) 𝑚 = | 𝐸 | . Given the inputparameters, the edge list of the scale-free unipartite directedgraph is generated.(2) Project the graph to bipartite mode – A common ap-proach to project a bipartite graph 𝐵𝐺 = ( 𝑉 , 𝐸 𝑖 𝑗 , Σ ,𝜓, 𝜙 ) tounipartite modes 𝐺 𝑖 = ( 𝑉 𝑖 , 𝐸 𝑖 , Σ ,𝜓, 𝜙 ) and 𝐺 𝑗 = ( 𝑉 𝑗 , 𝐸 𝑗 , Σ ,𝜓, 𝜙 ) is to connect a pair of vertices if they have a common neigh-bor (Figure 3). That is, ( 𝑖 𝑚 , 𝑖 𝑛 ) ∈ 𝐸 𝑖 if ∃ 𝑗 ∈ 𝑉 𝑗 : ( 𝑖 𝑚 , 𝑗 ) ∈ 𝐸 𝑖 𝑗 & ( 𝑖 𝑛 , 𝑗 ) ∈ 𝐸 𝑖 𝑗 and the same connection rule for j-vertices. Accordingly, we propose a reverse-engineering Grapp: Butterfly Approximation in Streaming Graphs

Table 2: Bipartite and temporal graph datasets used. ⟨ 𝑘 𝑖 ⟩ and ⟨ 𝑘 𝑗 ⟩ denote the average degree of i-vertices and j-vertices, respec-tively. 𝑁 and 𝑚 = 𝑚 are parameters of BA graphs and refer to the total number of vertices and average degree in the unipartiteBA graph, respectively. 𝑁 𝑡 denotes the number of unique timestamps. 𝐵 𝐺 denotes the number of butterflies in the graph. Graph dataset | 𝑉 𝑖 | | 𝑉 𝑗 | | 𝐸 | ⟨ 𝑘 𝑖 ⟩ ⟨ 𝑘 𝑗 ⟩ 𝑁 𝑚 = 𝑚 𝑁 𝑡 𝐵 𝐺 EpinionsBA+Epinions stampsBA+random stamps , , ,

514 296 , , ,

455 922 , , ,

254 414141 34343 22 , ,

515 4141 4 , , ,

159 170 , , , MovieLens1mBA+ML1m stampsBA+random stamps , , ,

106 3 , , ,

022 1 , , , ,

901 166164164 270166166 6 , ,

107 166166 458 , , ,

467 16 , , , MovieLens100kBA+ML100k stampsBA+random stamps , , , ,

905 106100100 59100100 966966 106106 49 , , ,

555 220 , , MovieLens10m 69,878 10,677 10,000,054 143 937 7,096,905 1,197,019,065,804edit-frwiki 288,275 3,992,426 46,168,355 160 . × edit-enwiki 262,373,039 266,665,865 266,769,613 70 12 134,075,025 × Table 3: 𝑅 and RMSE of ten fitting functions for the temporal evolution of butterfly frequency in three real-world graphstreams. Filled cells decode increasing function and best fits are highlighted in gray cells. 𝑅 RMSE Linear Quadratic Cubic 4th degreepolynomial Quintic 6th degreepolynomial 7th degreepolynomial 8th degreepolynomial 9th degreepolynomial 10th degreepolynomialEpinions . . 𝑒 . . 𝑒 . . 𝑒 . . 𝑒 . . . . . . ML100k . . 𝑒 . . 𝑒 . . 𝑒 . . 𝑒 . . 𝑒 . . 𝑒 . . 𝒆 . . 𝒆 . . 𝑒 . . 𝑒 ML1m . . 𝑒 . . 𝑒 . . 𝑒 . . 𝑒 . . 𝑒 . . 𝑒 . . 𝑒 . . 𝑒 . . 𝑒 . . 𝒆 ML10m . . 𝑒 . . 𝑒 . . 𝑒 . . 𝑒 . . 𝑒 . . 𝑒 . . 𝒆 . . 𝑒 . . 𝑒 . . 𝑒 Edit-FrWiki . . 𝑒 . . 𝑒 . . 𝑒 . . 𝑒 . . 𝒆 . . 𝑒 . . 𝑒 . . 𝑒 . . Edit-EnWiki . . . . . . . . . . . . . . technique for projecting the unipartite graphs to bi-partite mode . Precisely, given a unipartite BA graph 𝐺 𝑖 with 𝑁 𝑖 vertices (assuming the vertices as i-vertices), thebipartite mode 𝐵𝐺 is generated by the procedure below:(a) Assign 𝑁 𝑗 labels { 𝐿 𝑘 | ≤ 𝑘 ≤ 𝑁 𝑗 } to arbitrary edges in 𝐺 𝑖 .(b) Create a set of 𝑁 𝑗 j-vertices.(c) Project each edge ( 𝑖 𝑚 , 𝑖 𝑛 ) ∈ 𝐸 𝑖 with label 𝐿 𝑘 into twoedges ( 𝑖 𝑚 , 𝑗 𝑘 ) and ( 𝑖 𝑛 , 𝑗 𝑘 ) .Clearly, this procedure can yield a bipartite BA graph with apre-specified number of i- and j-vertices. Therefore, it canmimic the number of vertices in the real-world graph exactly.However, the number of edges in the output bipartite BAgraph does not match that of the unipartite BA graph andif we create a unipartite BA graph with specific numberof edges, then the number of i-vertices would be affectedaccordingly. Therefore, this projection method can not yieldbipartite BA graphs that have specific number of edges andvertices at the same time and solely adjusting the number ofedges will affect the number of vertices. On the other hand,the intended analysis in this work is edge-centric, therefore it is important to create synthetic bipartite graphs with thesame number of edges as the real-world graphs.To address this problem, we follow a simple projection method.Given the list of directed edges in the unipartite BA graph,the sources of edges are treated as i-vertices and the destina-tions as the j-vertices. Hence, the BA graph is projected tobipartite mode with same number of edges as that of the uni-partite and the corresponding real-world graph. The numberof i-vertices in the projected bipartite BA graph (equal tothe 𝑁 of unipartite BA graph) is very close to that of thereal-world graph. In spite of different number of j-vertices inthe projected and real-world graphs, this projection methodis preferable as it solves the aforementioned issue. Moreover,this method preserves the scale-free characteristic of theuni-partite graph since the j-degree (i-degree) distribution inbipartite graph is equivalent to the in-degree (out-degree) dis-tribution of vertices in the unipartite graph and the j-degreedistribution is scale-free (see Figure 4). ida Sheshbolouki and M. Tamer Özsu Figure 4: The j-degree distribution of Projected Bipartite BA graphs for three real-world graphs (3)

Assign timestamps to the synthetic edge s – Given thetimestamps of the a real-world graph and the bipartite struc-ture of the corresponding random graph, timestamps areassigned to the edges in two ways:(a) Each BA edge is randomly assigned a timestamp withinthe range of timestamps of the corresponding real-worldgraph and the resulting graph is called

BA+random stamps .(b) The un-ordered timestamps of the corresponding real-world graph are assigned to arbitrary BA edges and theresulting graph is called

BA+real stamps . This methodguarantees same temporal distribution for the edges ofBA and real-world graphs and supports fair comparisons.All the edge lists (real and synthetic) are sorted based on thetimestamps to simulate the streaming graph records in the analysis.

Network motifs are “patterns of interconnections occurring in com-plex networks at numbers that are significantly higher than thosein randomized networks” [40]. Identifying the motifs helps char-acterize the graph and also benefits graph querying systems thatare based on subgraph-centric programming model (i.e. operateson subgraphs rather than vertices or edges) and can be optimizedby indexing the network motifs. That is, network motifs repre-sent the regularities in the graph data and are helpful in buildingindexes over frequent and regular graph structures (structural in-dexing) [50, 58, 63]. On the other hand, the frequent butterflies ina graph is a sign of high clustering coefficient. While butterfliesare known to be motifs in static graphs, their temporal emergencepatterns is not well studied. Therefore, we study the number ofbutterflies emerging in the real-world graphs over time. Also, wecompare these with the occurrence patterns in randomized graphsto see if the occurrence frequency is higher in real-world graphs.This is required for a sound and complete recognition of butterfliesas temporal motifs, since motif definition requires such comparison.For this analysis we use an exact butterfly counting algorithm forgraph snapshots (called countButterflies(G) – Algorithm 1). Givena bipartite graph snapshot 𝐺 𝑊 ,𝑡 = ( 𝑉 ( 𝑡 ) , 𝐸 ( 𝑡 )) at a time point 𝑡 , the goal is to compute 𝐵 ( 𝑡 ) as the number of all quadruples ⟨ 𝑖 , 𝑖 , 𝑗 , 𝑗 ⟩ in 𝐺 𝑊 ,𝑡 such that they form a butterfly via four edges { 𝑒 𝑖 ,𝑗 , 𝑒 𝑖 ,𝑗 , 𝑒 𝑖 ,𝑗 , 𝑒 𝑖 ,𝑗 } (Figure 1–rightmost).Algorithm 1 follows a vertex-centric approach that does not re-quire accessing two-hop neighbors (i.e. it is not triple-based) andcan be computed by looping over either i-vertices or j-vertices depending on their average degree (denoted by 𝐾 𝑖 and 𝐾 𝑗 ). Thealgorithm takes a vertex 𝑖 (provided that 𝐾 𝑖 ≤ 𝐾 𝑗 ) and consid-ering each pair of j-neighbors 𝑗 and 𝑗 , identifies the commoni-neighbors of 𝑗 and 𝑗 , i.e. vertices such as 𝑖 (Figure 2.d). We usesublists to avoid iterating over repeated j-neighbors (lines 6-8 inAlgorithm 1) and we identify the common neighbors by iteratingover the lower degree j-vertex (line 10 in Algorithm 1). Algorithm 1: countButterflies(G)

Input: 𝐺 = ⟨ 𝑉 𝑖 ∪ 𝑉 𝑗 , 𝐸 𝑖 𝑗 ⟩ , static graph Output: 𝐵 𝐺 , The number of butterflies in G 𝐵𝑢𝑡𝑡𝑒𝑟 𝑓 𝑙𝑖𝑒𝑠 ← ∅ // An empty hashSet of quadruples 𝑗𝑁𝑒𝑖𝑔ℎ𝑏𝑜𝑟𝑠 ← ∅ // An empty Set 𝑣𝑖 𝑠 ← ∅ // An empty Set/* loop over 𝑖 ∈ 𝑉 𝑖 if 𝐾 𝑖 < 𝐾 𝑗 , otherwise loop over 𝑗 ∈ 𝑉 𝑗 */ for 𝑖 ∈ 𝑉 𝑖 do 𝑗𝑁𝑒𝑖𝑔ℎ𝑏𝑜𝑟𝑠 ← 𝑁 𝑖 // j-neighbors of vertex 𝑖 for 𝑖𝑛𝑑𝑒𝑥 ∈ [ , 𝑠𝑖𝑧𝑒 ( 𝑗𝑁𝑒𝑖𝑔ℎ𝑏𝑜𝑟𝑠 )] do 𝑗 ← 𝑗𝑁𝑒𝑖𝑔ℎ𝑏𝑜𝑟𝑠 [ 𝑖𝑛𝑑𝑒𝑥 ] for 𝑖𝑛𝑑𝑒𝑥 ∈ [ 𝑖𝑛𝑑𝑒𝑥 + , 𝑠𝑖𝑧𝑒 ( 𝑗𝑁𝑒𝑖𝑔ℎ𝑏𝑜𝑟𝑠 )] do 𝑗 ← 𝑗𝑁𝑒𝑖𝑔ℎ𝑏𝑜𝑟𝑠 [ 𝑖𝑛𝑑𝑒𝑥 ] 𝑣𝑖 𝑠 ← 𝑁 𝑗 ∩ 𝑁 𝑗 // common i-neighbors 𝐵𝑢𝑡𝑡𝑒𝑟 𝑓 𝑙𝑖𝑒𝑠.𝑎𝑑𝑑 ([ 𝑖 , 𝑗 , 𝑖 , 𝑗 ]) 𝐵 𝐺 ← 𝑠𝑖𝑧𝑒 ( 𝐵𝑢𝑡𝑡𝑒𝑟 𝑓 𝑙𝑖𝑒𝑠 ) It is important to calculate the exact number of butterflies tomake sure that analysis are correct and the identified patterns arereliable. Hence, we adopt an eager computation model where theexact number of butterflies is computed after each edge is added(connecting new/existing vertices) (Algorithm 1). We do this in thetime period to due to the computational expenses of thecomputation model. Note that the frequency distribution of edgeinsertions occurring in time-intervals of variant sizes follows thesame shape. This means that the distribution with respect to scalingacross time scales is invariant (i.e. self-similar [55]). Therefore, wecan rely on the analysis on a fraction of the subsequent streamingedges.To compare the numbers with that of a random graph (see thedefinition of network motifs), we just use the corresponding BAgraph with the same real timestamp. This enables fair comparisonof structural evolution of real-world and synthetic random graphs. Grapp: Butterfly Approximation in Streaming Graphs

Figure 5: [Best viewed in colored.] Temporal evolution of butterfly frequency .Figure 6: [Best viewed in colored.] Best Fitting functions for the temporal evolution of butterfly frequency (top) and theresidual errors of the estimated fitting function (bottom).

As shown in Figure 5, real-world graphs display rapid temporalevolution of the number of butterflies. To further investigate thegrowth pattern of butterfly frequency in these graphs, we examineten polynomial functions of degree one to ten to fit the data pointsof temporal butterfly frequency evolution (black lines in Figure5) and picked the best fitting function (Table 3). The best fittingfunction satisfies three conditions: (i) it has the lowest RMSE; (ii)it has the highest coefficient of determination ( 𝑅 ); and (iii) it isa non-decreasing function. 𝑅𝑀𝑆𝐸 quantifies the estimation error, while 𝑅 quantifies the linear correlation between the estimated fit-ting function and the data points. Figure 6 illustrates the best fittingfunction and its estimation errors (residuals) used in calculation ofthe RMSE. Note that high RMSE values are due to the increasingfunction giving rise to high residuals. We do not compare the RMSEof different graphs, instead we compare the RMSE of different fittingfunctions for each graph. Therefore, the absolute value of RMSEis not as important as its relative value for different functions. Asshown in Figure 6 all the plots are properly fitted to polynomial ida Sheshbolouki and M. Tamer Özsu functions of degree above 5 (best fitted to 5th, 7th, 9th and 10thdegrees). We term this the butterfly densification power-law (following the power-law terminology [34]): the number of but-terflies at time point 𝑡 (i.e. 𝐵 ( 𝑡 ) ) follows a power law function ofthe number of edges at 𝑡 (i.e. 𝐵 ( 𝑡 ) ∝ 𝑓 (| 𝐸 ( 𝑡 )| 𝜂 ), 𝜂 > ). Moreover,the outstanding frequency of butterflies in the real-world graphscompared to that of random graphs suggests that butterflies arenetwork motifs across the time line . In the previous subsection we observed the densification of butter-flies as network motifs. Now, we study how these motifs are formedover time. To this end, we check the distribution of inter-arrivaltime of a pair of edges forming a butterfly. That is, for any pair ofedges ⟨ 𝑒 , 𝑒 ⟩ with time stamps 𝜏 and 𝜏 that co-exist in a butter-fly, the inter-arrival time is | 𝜏 − 𝜏 | . We adopt a lazy computationmodel to compute the inter-arrival distribution once after adding edges to the graph (i.e. at the time point 𝑡 = ).As shown in the Figures 7 and 8, the distribution of inter-arrivalvalues is skewed to the right. The left peaks and the heavy tail of thedistribution reveal different patterns. The leftmost peaks highlightthat many butterflies are formed by edges with close timestamps.On the other hand, according to Figure 5, the number of butterfliesincrease significantly over time. It can be inferred that butterfliesare formed in a bursty fashion .Next, we investigate the vertices that form the butterflies tosee (a) whether the bursty butterfly generation is contributed byhubs (i.e. vertices with degree above the average of unique vertexdegrees) or normal vertices (Subsection 3.3.1), and (b) if hubs arethe main contributors, are they young, old, or both? (Subsection3.3.2). We hypothesizethat butterflies are contributed by hubs and to test this, we studyfollowing items: • The probability of forming butterflies by hubs • The correlation between degree and support of vertices • The connection patterns of hubs

The probability of forming butterflies by hubs –

We enu-merate butterflies formed at time 𝑡 = to 𝑡 = and check thefraction of butterflies formed by zero to four hubs (Table 4) and thefraction of butterflies formed by zero, one, or two i-/j-hubs (Table5). It is evident that, butterflies mostly include one or, with higherprobability, two hubs which are usually i-hubs. The correlation between degree and support of vertices–

We study the correlation between degree 𝑑𝑒𝑔 ( 𝑖 ) and butterflysupport 𝐵 𝑖 , where 𝐵 𝑖 is defined as the number of butterflies incidentto each vertex. For computing the 𝐵 𝑖 , we extend countButterflies(G) (Algorithm 1) to obtain ButterflySupport(G) (Algorithm 2).We refer to the correlation computed over the i-vertices andj-vertices as i-correlation (equation 1) and j-correlation (similarlycomputed), respectively. We use the Pearson correlation coefficientat time point 𝑡 = for all the 𝑁 = | 𝑉 𝑖 | or | 𝑉 𝑗 | seen i-(j-)verticesin the graph snapshot. It should be noted that a positive correla-tion coefficient means 𝑑𝑒𝑔 ( 𝑖 ) and 𝐵 𝑖 increase or decrease together,while a negative correlation means increasing one quantity implies Algorithm 2:

ButterflySupport(G)

Input: 𝐺 = ⟨ 𝑉 𝑖 ∪ 𝑉 𝑗 , 𝐸 𝑖 𝑗 ⟩ , static graph Output: 𝑣𝑆𝑢𝑝𝑝𝑜𝑟𝑡 , butterfly support of vertices 𝑣𝑆𝑢𝑝𝑝𝑜𝑟𝑡 ← ∅ // An empty hashMap 𝐵𝑢𝑡𝑡𝑒𝑟 𝑓 𝑙𝑖𝑒𝑠 ← ∅ // An empty hashSet of quadruples 𝑗𝑁𝑒𝑖𝑔ℎ𝑏𝑜𝑟𝑠 ← ∅ // An empty Set 𝑣𝑖 𝑠 ← ∅ // An empty Set/* loop over 𝑖 ∈ 𝑉 𝑖 if 𝐾 𝑖 < 𝐾 𝑗 , otherwise loop over 𝑗 ∈ 𝑉 𝑗 */ for 𝑣 𝑖 ∈ 𝑉 𝑖 do 𝑗𝑁𝑒𝑖𝑔ℎ𝑏𝑜𝑟𝑠 ← 𝑁 𝑖 // j-neighbors of vertex 𝑖 for 𝑖𝑛𝑑𝑒𝑥 ∈ [ , 𝑠𝑖𝑧𝑒 ( 𝑗𝑁𝑒𝑖𝑔ℎ𝑏𝑜𝑟𝑠 )] do 𝑗 ← 𝑗𝑁𝑒𝑖𝑔ℎ𝑏𝑜𝑟𝑠 [ 𝑖𝑛𝑑𝑒𝑥 ] for 𝑖𝑛𝑑𝑒𝑥 ∈ [ 𝑖𝑛𝑑𝑒𝑥 + , 𝑠𝑖𝑧𝑒 ( 𝑗𝑁𝑒𝑖𝑔ℎ𝑏𝑜𝑟𝑠 )] do 𝑗 ← 𝑗𝑁𝑒𝑖𝑔ℎ𝑏𝑜𝑟𝑠 [ 𝑖𝑛𝑑𝑒𝑥 ] 𝑣𝑖 𝑠 ← 𝑁 𝑗 ∩ 𝑁 𝑗 // common i-neighbors for 𝑖 ∈ 𝑣𝑖 𝑠 do if [ 𝑖 , 𝑗 , 𝑖 , 𝑗 ] ∉ 𝐵𝑢𝑡𝑡𝑒𝑟 𝑓 𝑙𝑖𝑒𝑠 then 𝐵𝑢𝑡𝑡𝑒𝑟 𝑓 𝑙𝑖𝑒𝑠.𝑎𝑑𝑑 ([ 𝑖 , 𝑗 , 𝑖 , 𝑗 ]) 𝑣𝑆𝑢𝑝𝑝𝑜𝑟𝑡.𝑝𝑢𝑡 ( 𝑖 , 𝑣𝑆𝑢𝑝𝑝𝑜𝑟𝑡.𝑔𝑒𝑡 ( 𝑖 ) + ) 𝑣𝑆𝑢𝑝𝑝𝑜𝑟𝑡.𝑝𝑢𝑡 ( 𝑗 , 𝑣𝑆𝑢𝑝𝑝𝑜𝑟𝑡.𝑔𝑒𝑡 ( 𝑗 ) + ) 𝑣𝑆𝑢𝑝𝑝𝑜𝑟𝑡.𝑝𝑢𝑡 ( 𝑖 , 𝑣𝑆𝑢𝑝𝑝𝑜𝑟𝑡.𝑔𝑒𝑡 ( 𝑖 ) + ) 𝑣𝑆𝑢𝑝𝑝𝑜𝑟𝑡.𝑝𝑢𝑡 ( 𝑗 , 𝑣𝑆𝑢𝑝𝑝𝑜𝑟𝑡.𝑔𝑒𝑡 ( 𝑗 ) + ) decreasing the other one. Values close to demonstrate strongcorrelation. 𝑁 (cid:205) 𝑖 ∈ 𝑉𝑖 𝑑𝑒𝑔 ( 𝑖 ) 𝐵 𝑖 − (cid:205) 𝑖 ∈ 𝑉𝑖 𝑑𝑒𝑔 ( 𝑖 ) (cid:205) 𝑖 ∈ 𝑉𝑖 𝐵 𝑖 √︃ [ 𝑁 (cid:205) 𝑖 ∈ 𝑉𝑖 𝑑𝑒𝑔 ( 𝑖 ) − ( (cid:205) 𝑖 ∈ 𝑉𝑖 𝑑𝑒𝑔 ( 𝑖 )) ] [ 𝑁 (cid:205) 𝑖 ∈ 𝑉𝑖 𝐵 𝑖 − ( (cid:205) 𝑖 ∈ 𝑉𝑖 𝐵 𝑖 ) ] (1) As provided in Table 6, there is a strong positive correlationbetween the degree and the support of vertices in real-world graphs.i.e. the higher the degree, the higher the butterfly support andvice versa. This highlights the impact of hubs in the emergence ofenormous number of butterflies in the real-world graphs.

The connection patterns of hubs –

We quantify the extentto which i-(j-)hubs dominate the edges over time by means oftwo equivalent measures: (i) the fraction of i-(j-)hub connections(denoted by (cid:205)

𝑁ℎ𝑢𝑏 ( 𝑡 ) 𝑖 = ( 𝑑𝑒𝑔 ( ℎ𝑢𝑏 𝑖 )) 𝐸 ( 𝑡 ) ) normalized over the number ofhubs at time point 𝑡 (denoted by 𝑁 ℎ𝑢𝑏 ( 𝑡 ) ), and (ii) the averagedegree of i-(j-)hubs (denoted by (cid:205) 𝑁ℎ𝑢𝑏 ( 𝑡 ) 𝑖 = ( 𝑑𝑒𝑔 ( ℎ𝑢𝑏 𝑖 )) 𝑁 ℎ𝑢𝑏 ( 𝑡 ) ) normalizedover the total number of edges at time point 𝑡 (denoted by | 𝐸 ( 𝑡 )| ).Both quantities are calculated by (cid:205) 𝑁ℎ𝑢𝑏 ( 𝑡 ) 𝑖 = ( 𝑑𝑒𝑔 ( ℎ𝑢𝑏 𝑖 )) 𝐸 ( 𝑡 )∗ 𝑁 ℎ𝑢𝑏 ( 𝑡 ) at any giventime point 𝑡 . We adopt an eager computation model to compute thisvalue when a new edge is added. The time point 𝑡 can be interpretedas the number of edges added to the graph since the initial timepoint 𝑡 = .As shown in the Figures 9 and 10, while the number of edgesadded to the graph increases, the normalized fraction of i-(j-)hubconnections (average degree of i-(j-)hubs) decreases over time inboth real-world and BA graphs. Grapp: Butterfly Approximation in Streaming Graphs

Figure 7: Distribution of inter-arrival time of edges forming butterflies in real-world graphs.Figure 8: Distribution of inter-arrival time of edges forming butterflies in BA+real stamps graphs.Table 4: Fraction of butterflies including zero, one, two, three, or four hub(s) at after applying 5000 edge-insertions sgrs.

Fraction 0 hub 1 hub 2 hubs 3 hubs 4 hubsEpinionsBA+Epinions stamps . .

11 0 . .

44 0 . .

39 0 . .

06 00

ML100kBA+ML100k stamps . .

24 0 . .

28 0 . .

15 0 . . ML1mBA+ML1m stamps . .

01 0 . .

33 0 . . . .

06 00

ML10m .

09 0 .

34 0 .

37 0 .

17 0 . Edit-Frwiki .

08 0 .

29 0 .

53 0 . Edit-Enwiki . .

48 0 .

41 0 .

01 0

Table 5: Fraction of butterflies including zero, one, or two i-hub(s) or j-hub(s) at after applying 5000 edge-insertions sgrs.

Fraction 0 i-hub 1 i-hub 2 i-hubs 0 j-hub 1 j-hub 2 j-hubsEpinionsBA+Epinions stamps . .

19 0 . .

56 0 . .

25 0 . . . .

25 0 . . ML100kBA+ML100k stamps . .

48 0 . .

39 0 . .

13 0 . .

37 0 . .

41 0 . . ML1mBA+ML1m stamps . .

01 0 . .

36 0 . .

63 0 . . . . . ML10m .

25 0 .

54 0 .

21 0 .

47 0 .

33 0 . Edit-Frwiki .

11 0 .

35 0 .

54 0 .

81 0 .

18 0 . Edit-Enwiki . . . .

97 0 .

03 0

Figures 9 and 10 also reveal that (a) unlike real-world graphs, i-and j-hubs emerge later in the BA graphs (originated by the BA’spreferential attachment rule), and (b) the average degree of hubsin early time points is higher in real-world graphs than that of BAgraphs. This is due to the bursty characteristic of graph stream (i.e.arrival of a bunch of edges with same time-stamp and same i- or j-vertex). In summary, early in the stream, the BA graphs have lowernumber of hubs with lower degrees compared to the real-worldgraphs. Figure 5 also illustrates the low number of butterflies in BAgraphs earlier in the stream when there are no hubs in these graphs or the average hub degree is low. On the other hand, real worldgraphs have high number of hub connections and high number ofbutterflies. These observations again verify the contribution of hubsto the emergence of butterflies; When the number of hubs is lowand the average degree of hubs is also low, the number of butterfliesis also low (as seen in BA graphs). Also, when the number of hubsand their average degree is high, the number of butterflies is high(as seen in real-world graphs). ida Sheshbolouki and M. Tamer Özsu

Figure 9: [Best viewed in colored.] Temporal evolution of the normalized fraction of i-hub connection (average i-hub degree).Figure 10: [Best viewed in colored.] Temporal evolution of the normalized fraction of j-hub connection (average j-hub degree).Table 6: Correlation between the butterfly support andthe degree of i-vertices (i-correlation) and j-vertices (j-correlation). i-correlation j-correlationEpinionsBA+Epinions stamps . .

56 0 . . MovieLens1mBA+MovieLens1m stamps . .

92 0 . . MovieLens100kBA+MovieLens100k stamps . .

63 0 . . MovieLens10m .

83 0 . Edit-Frwiki .

91 0 . Edit-Enwiki .

89 0 . We hypothe-size that butterflies are contributed by old hubs and to test this, westudy following items: • The evolution of young and old hubs • The inter-arrival of butterfly edges

The evolution of young and old hubs –

To further investigatehow the age of hubs contribute to the emergence of butterflies, wefirst check the evolution of young and old hubs. As mentionedbefore, we define the i-(j-)hub as any i-(j-)vertex whose degree isabove the average of unique i-(j-)degrees in the graph. Accordingly, young(old) hubs are defined as any hub whose timestamp is inthe last(first) of ordered set of already seen timestamps. Thevertex timestamps are determined as the timestamp of the sgr bywhich the vertex has been added to the graph for the first time. Forinstance, if a vertex 𝑖 arrives via the inserting edges 𝑒 = ⟨ 𝑖, 𝑗 ⟩ and 𝑒 = ⟨ 𝑖, 𝑗 ⟩ , the time stamp of vertex 𝑖 is set to the timestampof 𝑒 , which has arrived before 𝑒 (assuming subscript identifyorder of arrival). We adopt a lazy computation model to computethe number of young/old i-(j-)hubs using a time-based landmarkwindow where the computation is done over a growing graphgenerated by the edges in the append-only window following eachexpansion. Window expansion lengths are set to cover . ∗ 𝑁 𝑡 unique timestamps in each window in Epinions, ML100k, ML1m,and ML10m. In the larger graph streams Edit-EnWiki and Edit-FrWiki, this value is equal to . ∗ 𝑁 𝑡 . 𝑁 𝑡 is the number of uniquetimestamps in data stream.As shown in the Figure 11, young i-hubs and/or j-hubs are formedin the real-world graphs over time, while in BA graphs with randomtimestamps the number of young i-(j-)hubs is always zero. In BAgraphs with real timestamps, the timestamp of hubs are shuffled,therefore the old hubs are identified as young hubs that should beignored. Figure 12 demonstrates that old hubs increase over timein BA graphs, which is not always the case for real-world graphs.Moreover the number of old hubs in real world graphs is less thanthat of BA graphs. Grapp: Butterfly Approximation in Streaming Graphs

Figure 11: The number of young (top) i-hubs and (bottom) j-hubs after arrival of each batch of edge insertion sgrs.Figure 12: The number of old (top) i-hubs and (bottom) j-hubs after arrival of each batch of edge insertion sgrs.The inter-arrival of butterfly edges –

Finally, we recheck theheavy tail of the inter-arrival distribution which is over-representedin BA graphs (Figure 8). The heavy tail is related to the butterflyedges with high inter-arrival times. These highly frequent butterflyedges with high inter-arrivals reflect the connection between theyoung vertices and old vertices. We hypothesize that young verticesare ordinary vertices and old ones are hubs and we prove it since (a)we proved in the previous subsection that hubs are main contribu-tors to butterfly emergence; and (b) the hubs forming the butterfliescannot be young hubs as BA graphs would be contradiction; BAgraphs do not have young hubs (Figure 12), while they have manybutterfly edges with high inter-arrival(Figure 5), so butterflies can-not originate from young hubs. Therefore, old hubs signify thebursty butterfly emergence. Young hubs can exist, but they are notthe hubs dominating the butterflies.

In this section, we summarize our findings in this study of theemergence of butterflies in streaming graphs. We observed thatbutterflies are network patterns across the time line of sgr arrivalssince the number of butterflies increases significantly over time inreal-world streaming bipartite graphs, and at each time point thenumber of butterfly occurrences in real-world graphs are signifi-cantly higher than random graphs. We formulated the emergenceof butterfly interconnections as the butterfly densification power law , stating that the number of butterflies at any time point 𝑡 is apower law function of the size of stream prefix seen until 𝑡 .In terms of how these enormous number of butterflies emergeover time, our studies reveal the contribution of hubs in the stream-ing graphs. Further investigation of the impact of hubs in terms oftheir age reveal that the older hubs contribute more to the densifi-cation of butterflies.An efficient streaming algorithm for butterfly counting can onlydeal with a subset of the stream at any given point in time. Also,a precise streaming algorithm demands taking into account allexisting butterflies regardless of how long they take to form andhow much memory is available. The statistical analysis uncover thetemporal organizing principles of butterflies that impact the iden-tification of any potential butterfly that should be counted by thealgorithm. Specifically, our study reveal the dominant contributionof old hubs with young neighbours on shaping butterfly structuresover time. That is, a butterfly takes a long time to form as it takes awhile before newly added vertices get connected to old hubs and thebutterfly structure completes. In window-based algorithms such asours, care is required in windowing as butterflies may be split acrosswindows, affecting the butterfly count – it is important to take intoaccount the butterflies that may fall between windows. Moreover,when counting the number of multiple-window-spanning butter-flies, it is important to take advantage of the butterfly densificationpower law that quantifies the butterfly count with respect to the ida Sheshbolouki and M. Tamer Özsu number of edges seen so far. The total number of received edgesis easy to track in streaming graphs. Analysis of real-world graphstreams as we have done enabled us to design a data-driven butterflycounting algorithm discussed in next section. The analysis results presented in the previous section, in particularthe contribution of old hubs in bursty butterfly densification (theheavy tail of the distribution of inter-arrival values in Figure 7),provide insights to butterfly counting in streaming graphs. In viewof these, the precise problem definition reads as follows:

Given asequence of streaming graph records ordered by their timestamps, thegoal is to compute the total number of butterflies in emerging graph 𝐺 at time point 𝑡 – denoted as 𝐵 ( 𝑡 ) . In other words, the count isover the snapshot corresponding to the prefix of the stream seen sofar. Computing 𝐵 ( 𝑡 ) over a streaming graph is not feasible, sincethe stream is unbounded. It is known that without knowing thesize of the streaming input data, it is not possible to determine thememory required for processing the data [2], and unless there isunbounded memory, it is not possible to compute exact answersfor this data stream problem [4]. Butterfly counting is an exampleof streaming problems that are provably intractable if the avail-able space is sub-linear in the number of stream elements [39].Windowing addresses this fundamental problem by providing anapproximate result. However, as data enters and leaves the windowas the graph emerges, the result is approximate. Approximation hasbeen recognized as an important method for processing high speeddata streams, and windows are known as a natural approximationmethod over data streams [4].Consequently, in this section we develop an approximate but-terfly counting algorithm called sGrapp that uses windowing. Thealgorithm uses tumbling windows in order to avoid double countingof repeated butterflies. As defined in Section 2.1, tumbling windowsdo not overlap when windows move, thus avoiding the double-counting problem. We adopt a lazy time-based tumbling windowmodel to compute the number of butterflies introduced by each win-dow of disjoint edge insertions, 𝑊 𝑘 , at the end time of the windowdenoted by 𝐵 𝑊 𝑘 , and increment the cumulative value accordingly: 𝐵 ( 𝑡 = 𝑊 𝑒𝑘 ) = 𝐵 ( 𝑡 = 𝑊 𝑒𝑘 − ) + 𝐵 𝑊 𝑘 . This processing is incremental.An issue that has to be addressed is that there may exist some but-terflies that are formed by the edges with large inter-arrival times(heavy and long tail in Figure 7). These butterflies are not capturedwithin one window (unless it is sufficiently large) and we referto these as inter-window butterflies . However, setting the windowlength to a big value to cover the inter-window butterflies impliesa high computational footprint in terms of memory and time. Thisconflicts with the goal of using a windowed approach to lower thisfootprint by performing incremental processing over subsets ofsgrs. sGrapp addresses this issue by not requiring lengthy windowsbut using tumbling windows with adaptive lengths.sGrapp estimates the number of butterflies from the beginningof the first window 𝑡 = 𝑊 𝑏 until the end of 𝑘 th window denotedas ^ 𝐵 ( 𝑡 = 𝑊 𝑒𝑘 ) = ^ 𝐵 𝑘 by counting the exact number of butterflies inthe graph corresponding to the current window 𝑊 𝑘 as 𝐵 𝑊 𝑘 𝐺 andapproximating the number of inter-window butterflies ( ^ 𝐵 𝑖𝑛𝑡𝑒𝑟𝑊 ).The estimated cumulative value would be ^ 𝐵 𝑘 = ^ 𝐵 𝑘 − + 𝐵 𝑊 𝑘 𝐺 + 𝛿 ( 𝑘 ≠ ) ^ 𝐵 𝑖𝑛𝑡𝑒𝑟𝑊 , where the function 𝛿 (·) returns 1 for true input and0, otherwise. Note that the first window 𝑊 has no inter-windowbutterflies and hence the corresponding term would become zeroby means of the delta function. In the following, we introduce ouradaptive window framework to perform the butterfly approxima-tion (Subsection 4.1). Next, we explain how sGrapp approximatesthe ^ 𝐵 𝑖𝑛𝑡𝑒𝑟𝑊 and consequently ^ 𝐵 𝑘 (Subsection 4.2). Afterwards, wediscuss optimizations to sGrapp (Subsection 4.3). We end this sec-tion by analyzing the computational complexity and error boundsof sGrapp (Subsection 4.4). A main challenge with time-based windows is how to set the lengthof windows ? A common approach in stream processing is settingthe length of a window using a predetermined value 𝐿 ( | 𝑊 𝑖 | = 𝐿 , ∀ 𝑖 ). However, different graph streams have different temporaldistributions (frequency distribution of sgr timestamps – Figure13) and the number of arrived sgrs is not uniform across all timeintervals. Therefore, this approach would result in windows ofsgrs that cover differing numbers of timestamps, which imposesunbalanced loads on the processing algorithms, particularly in thecase of sgr arrivals with bursty characteristics and non-uniformtemporal distribution.To tackle this issue, we introduce an adaptive approach to setthe window length. This approach determines the window lengthaccording to the timestamps of the graph stream and adapts tothe temporal distribution of the stream (Algorithm 3) with no as-sumption about the order and number of arriving sgrs per timeunit.Hence, graph streams with differing arrival rates and temporaldistributions can be accommodated. Precisely, we use a number oftime-based tumbling windows each including a variable numberof sgrs but a certain number of unique timestamps in the graphstream, 𝑁 𝑤𝑡 . For instance, in Subsection 3.3.2 we used 10 windowseach including variant number of sgrs that cover of uniquetimestamps ( 𝑁 𝑤𝑡 = . ∗ 𝑁 𝑡 ) (Figures 11 and 12 ). That is, given thenumber of unique timestamps per window 𝑁 𝑤𝑡 , we ingest sgrs tothe window (lines to in Algorithm 3). When 𝑁 𝑤𝑡 timestampsare seen, we close the window and perform the intended analysisover the corresponding snapshot (lines − in Algorithm 3).The outputs of the analysis are streamed out correspondingly. Next,the window slides forward (line in Algorithm 3) and the retirededges are deleted from the computational graph (lines − inAlgorithm 3). In tumbling windows, all the edges are retired whenthe window slides, and the graph snapshot is renewed. The time-step is incremented and the algorithm continues until there is a sgr(i.e. continuously in real world streams).This may appear as a count-based window, but it is not. A count-based window would contain a fixed number of sgrs, while weonly fix the number of unique timestamps in the window, not thesgrs. Therefore, ours is time-based with adaptive width since thewindow borders adapt to the temporal distribution of the stream. Infact adaptive windowing would reduce to count-based windowing,if and only if the temporal distribution of stream is uniform andunique timestamps occur with equal frequency numbers. Thereforeour windowing mechanism is general and conforms to real streams. Grapp: Butterfly Approximation in Streaming Graphs

Figure 13: Temporal distribution of real-world graph streams.

Sequential adaptive windows cover the same fraction of distribu-tion of the sgrs (load-balanced windows for efficient analysis) andalso enables comparing the analysis over different windows of agraph stream as well as analysis over different graph streams hav-ing different temporal distributions (time-based windows for theaccuracy of temporal analysis).

Algorithm 3:

Adaptive tumbling windows

Data: { 𝑟 𝑖 } , sequence of time-ordered sgrs Input: 𝑁 𝑊𝑡 , Number of unique timestamps in stream Output: 𝑥 , Analysis output collection 𝐺 ← ⟨ 𝑉 = ∅ , 𝐸 = ∅⟩ // initial empty graph 𝑡 ← // time-step 𝑢𝑛𝑞𝑡 ← ∅ // an empty hashSet 𝑥 ← ∅ // output collection 𝑘 ← // window number 𝑊 𝑏𝑘 ← 𝜏 // begining time of 𝑘 th window while true do 𝑟 𝑡 = ( 𝜏 𝑡 , 𝑝 ) ← 𝑠𝑔𝑟𝐼𝑛𝑔𝑒𝑠𝑡 () if 𝑟 𝑡 ≠ ∅ then 𝑢𝑛𝑞𝑡 .add( 𝜏 𝑡 ) 𝐺 ← 𝑢𝑝𝑑𝑎𝑡𝑒𝐺 ( 𝑟 𝑡 , 𝐺 ) if 𝑢𝑛𝑞𝑡.𝑠𝑖𝑧𝑒 () == 𝑁 𝑊𝑡 then 𝑥 [ 𝑘 ] ← 𝑎𝑛𝑎𝑙𝑦𝑠𝑖𝑠 ( 𝐺 ) 𝑘 ← 𝑘 + 𝑊 𝑏𝑘 ← 𝜏 𝑡 for 𝑒 ∈ 𝐺 : 𝑒.𝑡𝑖𝑚𝑒𝑠𝑡𝑎𝑚𝑝 ≤ 𝑊 𝑏𝑘 do 𝐺 ← 𝐷𝑒𝑙𝑒𝑡𝑒 ( 𝑒, 𝐺 ) 𝑡 ← 𝑡 + Algorithm 4 describes how sGrapp uses the adaptive windowingframework (Algorithm 3) to estimate the number of butterflies in

Algorithm 4: sGrapp

Data: { 𝑟 𝑖 } , sequence of time-ordered sgrs Input: 𝑁 𝑊𝑡 , Number of unique timestamps per window 𝛼 , Approximation exponent Output: 𝑡𝑖𝑚𝑒𝑠𝑡𝑒𝑝 − 𝐵𝑐𝑜𝑢𝑛𝑡 , Approximated number ofbutterflies at the end of each window 𝐺 ← ⟨ 𝑉 = ∅ , 𝐸 = ∅⟩ // initial empty graph 𝑡 ← // time-step 𝑢𝑛𝑞𝑡 ← ∅ // an empty hashSet 𝑘 ← // window number 𝑡𝑖𝑚𝑒𝑠𝑡𝑒𝑝 − 𝐵𝑐𝑜𝑢𝑛𝑡 ← ∅ // an empty hashMap 𝐵 𝑊 𝑘 𝐺 ← // number of butterflies in the graph of 𝑘 th window ^ 𝐵 𝐾 ← // cumulative number of butterflies until 𝑡 = 𝑊 𝑒𝑘 𝐸 ← // total number of edges since 𝑡 = while true do 𝑟 𝑡 = ( 𝜏 𝑡 , 𝑝 ) ← 𝑠𝑔𝑟𝐼𝑛𝑔𝑒𝑠𝑡 () if 𝑟 𝑡 ≠ ∅ then 𝑢𝑛𝑞𝑡 .add( 𝜏 𝑡 ) 𝐺 ← 𝑢𝑝𝑑𝑎𝑡𝑒𝐺 ( 𝑟 𝑡 , 𝐺 ) 𝐸 ← 𝑢𝑝𝑑𝑎𝑡𝑒𝐸 ( 𝑟 𝑡 , 𝐸 ) if 𝑢𝑛𝑞𝑡.𝑠𝑖𝑧𝑒 () == 𝑁 𝑊𝑡 then 𝐵 𝑊 𝑘 𝐺 ← 𝑐𝑜𝑢𝑛𝑡𝐵𝑢𝑡𝑡𝑒𝑟 𝑓 𝑙𝑖𝑒𝑠 ( 𝐺 ) ^ 𝐵 𝐾 ← 𝐵 + 𝐵 𝑊 𝑘 𝐺 + 𝛿 ( 𝑘 ≠ ) 𝐸 𝛼 𝑡𝑖𝑚𝑒𝑠𝑡𝑒𝑝 − 𝐵𝑐𝑜𝑢𝑛𝑡.𝑝𝑢𝑡 ( 𝑡, 𝐵 𝑘 ) 𝑘 ← 𝑘 + /* Retire all the edges in the processing graph. */ 𝐺 ← ⟨ 𝑉 = ∅ , 𝐸 = ∅⟩ 𝑡 ← 𝑡 + the streaming graph. Note that sGrapp uses tumbling windows,therefore instead of checking the timestamp of windowed edgesto decide on the retirement (lines − of Algorithm 3), theprocessing graph is renewed in sGrapp (line of Algorithm 4).As mentioned earlier in this section the total number of butterflies ida Sheshbolouki and M. Tamer Özsu (line of Algorithm 4) is calculated as total number of butterfliescomputed at the end of previous window plus the exact number ofbutterflies in the current window (computed by invoking Algorithm1 in line of Algorithm 4) plus the estimated number of inter-window butterflies contributed by current window. According tothe butterfly densification power law discussed in the previoussubsection, the number of butterflies follows a power-law functionof the number of existing edges in the graph. Moreover, recall theobservation that butterflies are formed by hubs. Thus, we propose toapproximate the number of inter-window butterflies as ^ 𝐵 𝑖𝑛𝑡𝑒𝑟𝑊 = | 𝐸 ( 𝑡 = 𝑊 𝑒𝑘 )| 𝛼 , where | 𝐸 ( 𝑡 = 𝑊 𝑒𝑘 )| is the total number of edges since 𝑡 = 𝑊 𝑏 until 𝑡 = 𝑊 𝑒𝑘 . The total number of added edges are updatedat ingestion time (line Algorithm 4) as 𝐸 is increased when thesgr is an edge insertion and decreased when sgr is an edge deletionand 𝛼 is the approximation exponent. The approximation exponent used in sGrapp (Algorithm 4) is con-stant over windows. However, as we show in the experimentalstudies in Section 5, the estimated number of butterflies using staticexponent can be over or under the true value in subsequent win-dows. The reason is that the number of edges connecting to oldhubs varies across different windows and consequently the esti-mation should not increase linearly with respect to the number ofedges.To address this problem, we optimize sGrapp by changing theexponent over windows. To this end, we modify the unsupervisedalgorithm of sGrapp to a semi-supervised algorithm that we callsGrapp-x. We provide the algorithm with true value of butterfliesfor an initial subset of the stream. Based on the true value, in the cor-responding window 𝑊 𝐾 we compute the relative error ^ 𝐵 𝐾 − 𝐵 𝐾 𝐵 𝐾 (line in Algorithm 5). If the relative error is lower than a user-specifiednegative tolerance value (in the experiments we use − . ), thatmeans there is an underestimation, therefore we increase the ex-ponent by . (line 23-24 in Algorithm 5). Similarly we decreasethe exponent in case the relative error is above positive tolerancevalue to avoid over-estimation in the next window (line 21-22 inAlgorithm 5). The exponent is stabilized when the error is tolerableand after the supervised search for the exponent is finished. Insummary, the optimized version of sGrapp is an adaptive algorithmusing reinforcement learning that learns the most accurate approx-imation exponent for any given window parameter 𝑁 𝑊𝑡 in a subsetof stream and generalizes the learned exponent to the rest of stream.sGrapp-x is semi-supervised with outstanding performance givenlimited ground truth. Previous study of space bounds has shown that any butterfly count-ing algorithm, either randomized or deterministic, that returns anaccurate (exact/approximate) answer (i.e. bounds the relative errorto a small value < 𝛿 < . for each computation round) requiresstoring the entire graph in 𝜃 ( 𝑛 ) bits, where 𝑛 is the number ofvertices [48]. On the other hand, it is not possible to determine thesize of stream (i.e. 𝑛 ) in real world streaming graphs. Hence, it is notpossible to determine the memory required for processing the datawithout knowing the size of data [2]. In the following we analyze Algorithm 5: sGrapp-x

Data: { 𝑟 𝑖 } , sequence of time-ordered sgrs 𝐵 , ground truths Input: 𝑁 𝑊𝑡 , Number of unique timestamps per window 𝛼 , Approximation exponent Output: 𝑡𝑖𝑚𝑒𝑠𝑡𝑒𝑝 − 𝐵𝑐𝑜𝑢𝑛𝑡 , Approximated number ofbutterflies at the end of each window 𝐺 ← ⟨ 𝑉 = ∅ , 𝐸 = ∅⟩ 𝑡 ← 𝑢𝑛𝑞𝑡 ← ∅ 𝑘 ← 𝑡𝑖𝑚𝑒𝑠𝑡𝑒𝑝 − 𝐵𝑐𝑜𝑢𝑛𝑡 ← ∅ 𝐵 𝑊 𝑘 𝐺 ← ^ 𝐵 𝐾 ← 𝐸 ← 𝑒𝑟𝑟𝑜𝑟 ← // relative error for window 𝑊 while true do 𝑟 𝑡 = ( 𝜏 𝑡 , 𝑝 ) ← 𝑠𝑔𝑟𝐼𝑛𝑔𝑒𝑠𝑡 () if 𝑟 𝑡 ≠ ∅ then 𝑢𝑛𝑞𝑡 .add( 𝜏 𝑡 ) 𝐺 ← 𝑢𝑝𝑑𝑎𝑡𝑒𝐺 ( 𝑟 𝑡 , 𝐺 ) 𝐸 ← 𝑢𝑝𝑑𝑎𝑡𝑒𝐸 ( 𝑟 𝑡 , 𝐸 ) if 𝑢𝑛𝑞𝑡.𝑠𝑖𝑧𝑒 () == 𝑁 𝑊𝑡 then 𝐵 𝑊 𝑘 𝐺 ← 𝑐𝑜𝑢𝑛𝑡𝐵𝑢𝑡𝑡𝑒𝑟 𝑓 𝑙𝑖𝑒𝑠 ( 𝐺 ) if 𝑡 < 𝑠𝑖𝑧𝑒 ( 𝐵 ) & 𝑒𝑟𝑟𝑜𝑟 > . then 𝛼 − = . if 𝑡 < 𝑠𝑖𝑧𝑒 ( 𝐵 ) & 𝑒𝑟𝑟𝑜𝑟 < − . then 𝛼 + = . ^ 𝐵 𝐾 ← 𝐵 + 𝐵 𝑊 𝑘 𝐺 + 𝛿 ( 𝑘 ≠ ) 𝐸 𝛼 𝑡𝑖𝑚𝑒𝑠𝑡𝑒𝑝 − 𝐵𝑐𝑜𝑢𝑛𝑡.𝑝𝑢𝑡 ( 𝑡, 𝐵 𝑘 ) if 𝑡 < 𝑠𝑖𝑧𝑒 ( 𝐵 ) then 𝑒𝑟𝑟𝑜𝑟 ← ^ 𝐵 𝐾 − 𝐵 𝐾 𝐵 𝐾 𝑘 ← 𝑘 + 𝐺 ← ⟨ 𝑉 = ∅ , 𝐸 = ∅⟩ 𝑡 ← 𝑡 + the properties of our estimator in terms of computational and errorbounds. Theorem 4.1.

The upper bound of computational complexity ofsGrapp for each window 𝑊 𝑘 is O( 𝐾 𝑖,𝑊𝑘 ( 𝐾 𝑖,𝑊𝑘 − ) 𝐾 𝑗,𝑊 𝑘 R 𝑁 𝑊 𝑘 𝑡 ) , where R is the average stream rate and 𝐾 𝑖,𝑊 𝑘 ( 𝐾 𝑗,𝑊 𝑘 ) is the lower bound ofdegree of i(j)-vertices in 𝑊 𝑘 . Proof. sGrapp’s computations at each window are dominatedby the exact counting algorithm as calculating the number of inter-window butterflies is negligible and we ignore it as well as thesummations. When i-vertices are the vertex set with lower average

Grapp: Butterfly Approximation in Streaming Graphs degree, the computational complexity of the core exact countingalgorithm is the following. O( ∑︁ 𝑖 ∈ 𝑉 𝑖 ∑︁ 𝑗 ,𝑗 ∈ 𝑁 𝑖 𝑀𝑖𝑛 ( 𝑑𝑒𝑔 ( 𝑗 ) , 𝑑𝑒𝑔 ( 𝑗 ))) (2)Let us assume that the lower bound i-degree and j-degree inthe graph snapshot corresponding to the tumbling window 𝑊 𝑘 are 𝐾 𝑖,𝑊 𝑘 and 𝐾 𝑗,𝑊 𝑘 , respectively. Accordingly, the computational com-plexity for this window would be 𝑂 ( 𝐾 𝑖,𝑊𝑘 ( 𝐾 𝑖,𝑊𝑘 − ) 𝐾 𝑗,𝑊 𝑘 | 𝑉 𝑖,𝑊 𝑘 |) ,where 𝑉 𝑖,𝑊 𝑘 denotes the set of i-vertices in the window 𝑊 𝑘 . Sincethe stream can include edges connecting already existing vertices,the total number of edges in 𝑊 𝑘 , denoted as 𝐸 𝑊 𝑘 , is greater thanequal the total number of i-vertices in 𝑊 𝑘 , i.e. | 𝑉 𝑖,𝑊 𝑘 | ≤ | 𝐸 𝑊 𝑘 | .Therefore, O( 𝐾 𝑖,𝑊 𝑘 ( 𝐾 𝑖,𝑊 𝑘 − ) 𝐾 𝑗,𝑊 𝑘 | 𝑉 𝑖,𝑊 𝑘 |) ≤ O( 𝐾 𝑖,𝑊 𝑘 ( 𝐾 𝑖,𝑊 𝑘 − ) 𝐾 𝑗,𝑊 𝑘 | 𝐸 𝑊 𝑘 |) (3)sGrapp uses tumbling windows with adaptive lengths, therefore | 𝐸 𝑊 𝑘 | ≈ R 𝑁 𝑊 𝑘 𝑡 , where R is the average stream rate (i.e. number ofedges per timestamp) and 𝑁 𝑊𝑡 is the number of unique timestampsin 𝑊 𝑘 . Hence, the upper bound of computational complexity ofsGrapp for a tumbling window 𝑊 at 𝑡 is O( 𝐾 𝑖,𝑊𝑘 ( 𝐾 𝑖,𝑊𝑘 − ) 𝐾 𝑗,𝑊 𝑘 R 𝑁 𝑊 𝑘 𝑡 ) .Note that this stands for all sequential windows. □ Figure 14: Schematic butterfly formation. i(j)-vertices areblue (red) in the bottom (top) .

Theorem 4.2.

The exact number of inter-window butterflies at theend of each window 𝑊 𝑘 , ∀ 𝑘 > , denoted as 𝐵 𝑖𝑛𝑡𝑒𝑟𝑊 is bounded as | 𝐸 𝑊 𝑘 | − | 𝑉 𝑖,𝑊 𝑘 | ≤ 𝐵 𝑖𝑛𝑡𝑒𝑟𝑊 ≤ (cid:0) | 𝑉 𝑖,𝑊𝑘 | (cid:1) , where 𝑉 𝑖 is the set of alli-vertices in the 𝑊 𝑘 . Proof. The number of inter-window butterflies contributed bywindow 𝑊 𝑘 denoted as 𝐵 𝑖𝑛𝑡𝑒𝑟𝑊 , is minimum when the 𝑊 𝑘 ’s edges 𝐸 𝑊 𝑘 are uniformly distributed over vertices by connecting eachi-vertex in 𝑊 𝑘 to at least 2 j-neighbors in 𝑊 𝑘 and previous windowsforming a series of caterpillars (solid edges in Figure 14–left). Inthis case, according to the pigeonhole principle, the number ofedges that complete the caterpillars (dashed edges in Figure 14–left)will determine the number of inter-window butterflies: 𝐵 𝑖𝑛𝑡𝑒𝑟𝑊 = | 𝐸 𝑊 𝑘 |− | 𝑉 𝑖,𝑊 𝑘 | . 𝐵 𝑖𝑛𝑡𝑒𝑟𝑊 is maximum when all of the 𝑊 𝑘 ’s i-verticesare connected to two j-vertices such that at least one of them isnot in 𝑊 𝑘 (Figure 14–right). (Note, when all of j-neighbors are inprevious windows, there wouldn’t be any in-window butterfly in 𝑊 𝑘 ). In this case, the number of inter-window butterflies reduces tothe number of ways we can choose two i-vertices from the entireset of i-vertices: 𝐵 𝑖𝑛𝑡𝑒𝑟𝑊 = (cid:0) | 𝑉 𝑖,𝑊𝑘 | (cid:1) . Therefore, | 𝐸 𝑊 𝑘 | − | 𝑉 𝑖,𝑊 𝑘 | ≤ 𝐵 𝑖𝑛𝑡𝑒𝑟𝑊 ≤ (cid:0) | 𝑉 𝑖,𝑊𝑘 | (cid:1) . □ We test the effectiveness and efficiency of sGrapp and its optimizedversion sGrapp-x where x is the percentage of the available groundtruth. We use x=25, 50, 75, and 100. The ground truths are obtainedby running the exact counting Algorithm 1 over the graph streams.Due to the computational expense of Algorithm 1, we collect thetruth values over a limited number of sgrs: in Epinions, in ML100k, in ML1m, in ML10m, inEdit-EnWiki, and in Edit-FrWiki. The data sets that we useare described in Section 3.1.We report the effectiveness and efficiency of sGrapp and sGrapp-x in Sections 5.1 and 5.2, respectively. We also compare the per-formance of our algorithms with that of baselines in Subsection5.3. Our experiments as well as the analysis in Section 3 are con-ducted on a machine with . GB native memory and Intel Core 𝑖 − 𝐻𝑄𝐶𝑃𝑈 @2 . 𝐺𝐻𝑧 ∗ processor. We have implementedFLEET algorithms and sGrapp algorithms in Java (OpenJDK version . . − , OpenJDK Runtime Environment build . . − − 𝑢 − 𝑏 − . − 𝑏 ). We compute the Mean Absolute Percentage Error(MAPE) of sGrapp for windows with variable number of uniquetimestamps ( 𝑁 𝑊𝑡 , 𝑦 axis) and different exponent values ( 𝛼 , 𝑥 axis).These are shown in the Figure 16. The number of unique timestampsper window, 𝑁 𝑡 , varies in different graph streams, therefore weset the value of 𝑁 𝑊𝑡 differently for each graph stream. We cross-validated the values of 𝛼 and 𝑁 𝑊𝑡 to explore the region includingthe best accuracy (lowest MAPE illustrated by the lightest color)for sGrapp. 𝑀𝐴𝑃𝐸 = 𝑛 Σ | 𝐵 𝑘 − ^ 𝐵 𝑘 | 𝐵 𝑘 , where 𝐵 𝑘 is the ground truthcomputed over the growing graph at 𝑡 = 𝑊 𝑒𝑘 by Algorithm 1 and ^ 𝐵 𝑘 is the approximated value at 𝑡 = 𝑊 𝑒𝑘 , and n is the number ofwindows. The data tips in the figures demonstrate the pair of 𝛼 and 𝑁 𝑊𝑡 yielding the lowest MAPE.We observe that the approximation accuracy of sGrapp is notsensitive to window length and the exponent, since there exists acombination of approximation exponent and window length foreach graph steam that yields appropriate MAPE (Figure 16). In ida Sheshbolouki and M. Tamer Özsu fact, the best MAPE of sGrapp is significantly lower than . in allof the rating graph streams, demonstrating that sGrapp is a goodapproximator of actual butterfly count.When the approximation exponent is high and the window iscompact (bottom right corners in Figure 16), the error is high. Inthis case, sGrapp overestimates the number of inter-window but-terflies due to high exponent value. Also, when the exponent is lowand the window includes a large number of sgrs (top left corner inFigure 16), the error is high. The reason in this case is that sGrappunderestimates the number of inter-window butterflies. An appro-priate parameter region to gain a reasonable accuracy is where 𝛼 and 𝑁 𝑊𝑡 are both high or low (middle diameter from top rightcorner to bottom left corner in Figure 16). The best accuracy isalways obtained for higher exponent values. For rating networks,an appropriate exponent value for sGrapp is 𝛼 = . .As we investigated the contribution of hubs to the emergence ofbutterflies (Section 3), we relate the value of approximation expo-nent to the probability of having at least one i-hub ( 𝑃 ( 𝑁 𝑡𝑖𝐻𝑢𝑏 > = ) )plus the probability of having at least one j-hub ( 𝑃 ( 𝑁 𝑡𝑗𝐻𝑢𝑏 > = ) ) inthe butterflies at time 𝑡 , i.e. 𝛼 = 𝑃 ( 𝑡 ) = 𝑃 ( 𝑁 𝑡𝑖𝐻𝑢𝑏 = ) + 𝑃 ( 𝑁 𝑡𝑖𝐻𝑢𝑏 = ) + 𝑃 ( 𝑁 𝑡𝑗𝐻𝑢𝑏 = ) + 𝑃 ( 𝑁 𝑡𝑗𝐻𝑢𝑏 = ) (Table 5). That is, the value of 𝛼 can be determined based on the probability of i- or j-hubs formingbutterflies at a certain time point 𝑡 . The time point 𝑡 is likely a tippingpoint where the number of hub connections in the graph is stabilized(Figures 9 and 10). To check this, we calculate the value of 𝑃 ( 𝑡 ) for 𝑡 ∈ { , , .., , } in the Epinions graph stream. Wecompute the value of MAPE for sGrapp( 𝑁 𝑊𝑡 , 𝛼 ). We set 𝛼 = 𝑃 ( 𝑡 ) and 𝑁 𝑊𝑡 ∈ { . 𝑁 𝑡 , . 𝑁 𝑡 , . 𝑁 𝑡 , . 𝑁 𝑡 , . 𝑁 𝑡 } . In Ta-ble 7, we report the value of MAPE for the approximations withdifferent exponent values and different fraction of unique times-tamp per adaptive window. We observe that, at 𝑡 = , wherethe exponent is equal to 𝛼 = 𝑃 ( 𝑡 = ) = ∼ . , the approxima-tion error is the lowest. This time point is a tipping point wherethe fraction of average hub degree is steadily low afterward andhigh backward (Figures 9 and 10). Moreover, in Figure 16, we seethat the best accuracy is obtained when the exponent is equal to 𝑃 ( 𝑡 = ) = . . We leave further investigation of the signifi-cance of these values as future work.After evaluating sGrapp in terms of the average window errors(MAPE), we delve into its performance evolution over windows sothat we can track the origins of the accuracy gain. We pick the mostaccurate 𝛼 and 𝑁 𝑊𝑡 (highlighted data points in Figure 16) and plotthe signed value of relative error | 𝐵 𝑘 − ^ 𝐵 𝑘 | 𝐵 𝑘 for each window 𝑊 𝑘 inthe Figure 25. Depending on the value of 𝑁 𝑊𝑡 , the number of win-dows vary in different graph streams. Positive errors (depicted byred upward triangles ) reflect over-estimations and negative errors(depicted by blue downward triangles) reflect under-estimations. InML10m, Edit-EnWiki and Edit-FrWiki, the approximation beginswith over-estimation and ends up with under-estimation. The un-derlying reason is the static exponent over sequential windows withdifferent number of connections to the old hubs and consequentlydifferent number of inter-window butterflies. We also evaluate the accuracy of sGrapp-x interms of MAPE in the region that sGrapp displays lowest errorsin Figures 17 – 20. This enables a fair comparison of sGrapp with its optimized version sGrapp-x. Note that, sGrapp-x begins witha given exponent value and ends up with a modified value afterthe supervision phase reaches an error below . . Therefore wefed sGrapp-x with same input values of 𝛼 and 𝑁 𝑊𝑡 as sGrapp. Thevalues shown in Figures 17 – 20 reflect the inputs.It is evident from these figures that sGrapp-x improves the accu-racy, which can be summarized as (a) improving the minimumMAPE (Figure 21), (b) improving the maximum MAPE (Figure22), as well as (c) expanding the coverage of MAPE ≤ . andMAPE ≤ . (Figures 23 and 24). As illustrated in Figure 21, theminimum MAPE value in the studied parameter space is roughlythe same for both sGrapp and sGrapp-x 𝑥 = − in all ratinggraph streams. sGrapp-x lowers the minimum MAPE with respectto sGrapp in Edit-EnWiki graph from . to . (via 𝑥 = ), . (via 𝑥 = ), . (via 𝑥 = ), and . (via 𝑥 = );in Edit-FrWiki graph from . to . (via 𝑥 = ), . (via 𝑥 = ), . (via 𝑥 = ), and . (via 𝑥 = ). That is, theminimum MAPE is lowered ranging from . to . inEdit-EnWiki and . to . in Edit-FrWiki. As illustratedin Figure 22,the maximum MAPE related to the over-estimations(bottom right corners in Figures 17 – 20) is notably decreased inall graph streams. The most significant decrease corresponds toEdit-FrWiki stream with the highest change from to . (via 𝑥 = , ) and Edit-EnWiki stream with highest change from . to . (via 𝑥 = ).In Figures 23 and 24, we present the probability of approximationwith MAPE ≤ . and MAPE ≤ . ( 𝑃 ( 𝑀𝐴𝑃𝐸 ≤ . ( . )) ) bycalculating the fraction of approximations that satisfy MAPE ≤ . and MAPE ≤ . . That is the relative coverage of light blue areasin Figures 16 – 20. When the approximation MAPE is above . or . the corresponding bars are omitted in Figures 23 and 24.Since sGrapp-100 approximates the number of butterflies in Edit-EnWiki with highest MAPE equal to . , the corresponding barhas a height of . sGrapp-25 improves the accuracy of sGrappin MovieLens10m better than other sGrapp-x versions. For theother graph streams, when 𝑥 ≥ , sGrapp-x displays fairly wellaccuracy improvement as the probability of accurate approximation(i.e. average window error below 0.15 and 0.2) is amplified. Asexpected sGrapp-100 has the most improvement, however sGrapp-75 and sGrapp-50 are reliable improvement alternatives for Edit-FrWiki and the rest of graph streams, respectively. sGrapp-x, 𝑥 = , , , and can achieve the 𝑃 ( 𝑀𝐴𝑃𝐸 ≤ . ( . )) equal to . (78.53%), . ( . ), . ( . ), and . ( ). Most notably, sGrapp-50(75) increases 𝑃 ( 𝑀𝐴𝑃𝐸 ≤ . ) from to . ( ) % in Edit-EnWiki.We check the evolution of the signed value of relative errorover windows for the data points with the lowest sGrapp-x MAPE.As shown in Figures 26, 27, 28, and 29, dynamically changing theapproximation exponent heals the under/over-estimation problem;Hence the average window error is diminished. There is alwaysa value of x by which sGrapp-x can yield average approximationerror less than equal . in rating graphs and . in Wikipediagraphs. Grapp: Butterfly Approximation in Streaming Graphs

Table 7: Epinions - The approximation MAPE for different adaptive window lengths (columns) and different exponents calcu-lated as the probability of one or two i-hub plus the probability of one or two j-hub at different time points (rows).

MAPE . ∗ 𝑁 𝑡 . ∗ 𝑁 𝑡 . ∗ 𝑁 𝑡 . ∗ 𝑁 𝑡 . ∗ 𝑁 𝑡 𝛼 = 𝑃 ( 𝑡 = 𝑘 ) = . . . . . . 𝛼 = 𝑃 ( 𝑡 = 𝑘 ) = .

077 0 . . . . . 𝛼 = 𝑃 ( 𝑡 = 𝑘 ) = . . . . . . 𝛼 = 𝑃 ( 𝑡 = 𝑘 ) = . . . . . . 𝛼 = 𝑃 ( 𝑡 = 𝑘 ) = . . . . . . 𝛼 = 𝑃 ( 𝑡 = 𝑘 ) = . 𝛼 = 𝑃 ( 𝑡 = 𝑘 ) = . . . . . . 𝛼 = 𝑃 ( 𝑡 = 𝑘 ) = . . . . . . 𝛼 = 𝑃 ( 𝑡 = 𝑘 ) = . . . . . . 𝛼 = 𝑃 ( 𝑡 = 𝑘 ) = . . . . . . Table 8: Throughput of different algorithms for 𝜸 =0.7. Throughput FLEET2M=75k FLEET3M=75k FLEET2M=150k FLEET3M=150k FLEET2M=300k FLEET3M=300k FLEET2M=600k FLEET3M=600k sGrapp sGrapp-100Epinions 89 575 137 411 59 336 53 077 16 912 16 360 11 028 10 907

182 427

166 895ML100k 3 664 5 652 4 691 4 717 3 509 3 424 4 268 4 378 8 026

ML1m 23 490 23 292 12 038 7 355 2 383 1 673 1 004 857

26 698

26 487ML10m 147 665 72 918 62 905 23 536 16 719 5 358 4 410 2 337

234 571

228 021Edit-FrWiki 554 741 155 343 298 019 57 477 116 917 16 856 41 051 6 240

985 265Edit-EnWiki

719 375 1 373 708 305 347 911 170 114 806 324 183 34 283 1 085 185 1 098 382

Table 9: MAPE of different algorithms for 𝜸 =0.7 and M=0.1S and same 𝑵 𝑾𝒕 . MAPE FLEET1 FLEET2 FLEET3 sGrapp sGrapp-25 sGrapp-50 sGrapp-75 sGrapp-100Epinions 0.058 13.789 0.336

ML1m 0.085 5.261 0.047 0.043

We evaluate the efficiency of sGrapp and sGrapp-100 by averagingover 50 independent cases. We do not report the efficiency metricsfor sGrapp-x for 𝑥 < since their efficiency is close to that ofsGrapp-100. For each graph stream we study the performance forthe parameter settings that yield the best accuracy (highlighteddata points in Figures 16 and 20) to see the overhead of a highlyaccurate approximation. Note that parameter values do not affectthe efficiency.We check the latency of sGrapp and sGrapp-100 for each pro-cessing window (Figures 31 and 32). We observe that the windowlatency of all the graph streams (except the Epinions) is not decreas-ing. The window latency of each graph stream follows its temporaldistribution pattern (Figure 13). Therefore, to omit the effect oftemporal distribution, we study the performance by consideringboth the processing time (latency) and the number of processedelements. To this end, at the end point of each window, we checkthe window throughput (i.e. the number of processed edges in thewindow divided by the elapsed time in seconds, Figures 35 and 36)) as well as the total throughput (i.e. the total number of processededges since the first window until the end of the current windowdivided by the total elapsed time in seconds, Figures 33 and 34).The window throughput displays fluctuations due to variantnumber of sgrs in each window; however in overall it is higher inlater windows for both sGrapp and sGrapp-100. The total through-put of both sGrapp and sGrapp-100 displays an increasing pattern.As mentioned in previous section, the old hubs are the main con-tributors to the butterfly formation. Since old hubs occur in theearly windows, the later windows mostly include butterfly verticeswith lower degree. That is, there are fewer windowed butterfliesin later windows than the inter-window butterflies. Therefore, theexact counting algorithm that computes the number of windowedbutterflies finishes quicker. Also, rapid approximation of the inter-window butterflies plays the main role in reducing the processingtime, enhancing the total throughput. An evidence is the through-put for MovieLens100k that has almost uniform temporal distri-bution: we observe an increasing total throughput over windows.This is important since the number of sgrs in the windows is not ida Sheshbolouki and M. Tamer Özsu decreasing while the throughput is increasing. This confirms (1)the algorithm’s power is independent of the structural/temporalcharacteristics of the input data and (2) the algorithm is efficientparticularly in dense graph streams. We compare the effectiveness and efficiency of sGrapp suit andFLEET suit. Experimental results of FLEET suit show that FLEET3,FLEET2 and FLEET1 have the best performance (in that order), sowe use those as baselines. While sGrapp has the 𝛼 (approximationexponent) and 𝑁 𝑊𝑡 (number of unique timestamps per window)parameters, FLEET has the 𝑀 (reservoir size) and 𝛾 (sub-samplingprobability) parameters. Since the performance of FLEET algorithmsis sensitive to its parameters, we compare our algorithms againstthe FLEET settings which achieve the best performance. We set thesub-sampling probability as 𝛾 = . as suggested by FLEET authors[48].We observe that when the reservoir size 𝑀 is greater than the en-tire stream, latency is negatively impacted since sub-sampling doesnot occur and all the edges are added to the reservoir and for eachnew edge the exact butterfly counting is executed. Hence, for eval-uating the accuracy over the prefix of a stream, we set 𝑀 = . 𝑆 ,where 𝑆 is the size of available stream. For evaluating the efficiency,we also use a range of values 𝑀 ∈ { 𝑘, 𝑘, 𝑘, 𝑘 } to ex-amine the throughput over the entire stream; these values are theones offered in the original paper [48]. We use the approximationexponent values yielding lowest MAPE in sGrapp, which do notnecessarily yield the best MAPE in the optimized variant sGrapp-x. Since FLEET algorithms use different window semantics thansGrapp, we use virtual time-based adaptive windows over FLEETalgorithms to extract the estimated values at the end of virtualwindows for accuracy evaluations only (not for efficiency tests). Weuse the same value of 𝑁 𝑊𝑡 for sGrapp and FLEET suits to computeMAPE: 𝑁 𝑊𝑡 ∈ [ , , , , , ] for Epinions, ML100k,Ml1m, Ml10m, Edit-EnWiki, and Edit-FrWiki, respectively. For effi-ciency comparisons, we used the same value used in effectivenessexperiments since our goal is to check the efficiency cost of themost accurate approximation. For each 𝑁 𝑊𝑡 , there exists an alphayielding a high precision estimate. 𝑁 𝑊𝑡 does not affect accuracy.In Table 8, we report the total throughput over the entire graphstreams for sGrapp and FLEET suits. Since FLEET1’s throughputis very low, we do not include it in this experiment. By increasingthe size of reservoir the throughput of all FLEET algorithms de-creases since the frequency of exact butterfly counting per edgeincreases. It is always the case that 𝑀 = 𝑘 and 𝑀 = 𝑘 yieldsthe highest and the lowest throughput, respectively. sGrapp out-performs FLEET for every setting: minimum (maximum) ratios ofsGrapp to FLEET throughput are . ( . ), . ( . ), . ( . ) , . ( . ), . ( . ), and . ( ) in Epinions, ML100k, ML1m,ML10m, Edit-FrWiki, and Edit-EnWiki, respectively. sGrapp and itsoptimized version outperforms FLEET suit within a range of [ . . ] , with the performance improvement increasing as graphstreams become larger (i.e., Edit-FrWiki, ML10m, and Edit-Enwiki).In Table 9, we report accuracy (in terms of MAPE) of sGrapp andFLEET suits over the subset of stream with available true values. Weobserve that sGrapp and sGrapp-x achieve MAPE values equal to 𝛾 𝑃 ^ 𝐵𝑀 𝐹

Figure 15: Impact of FLEET parameters on estimate. . , . , . , . , . , and . in Epinions, ML100k,ML1m, ML10m, Edit-FrWiki, and Edit-EnWiki which are signifi-cantly lower than those of FLEET – sGrapp errors are . × , . × , . × , . × , . × , and . × of FLEET for these graphs. Table 9(Table 8) shows that for ML10m, FLEET3’s accuracy (throughput)is . better (up to 𝑥 lower) than sGrapp explaining the highcomputational cost of FLEET3 in this specific dataset. FLEET3 up-dates the estimate for each new edge by enumerating butterfliesincident to that edge. This increases the probability of detectingthe incident butterflies by a factor of 𝑃 (i.e. sampling probability),however the computations are much increased. This technique ismore impactful in ML10m with high butterfly density. Butterfly es-timate ^ 𝐵 is updated as soon as an edge arrives in FLEET3 or duringthe sampling and (or) sub-sampling phase in FLEET1 (FLEET2). InFLEET1, when 𝑃 is not high or 𝑀 is small and 𝛾 is low, ^ 𝐵 is notfrequently updated and error goes up. In FLEET2, many butterfliesare missed due to sampling. Moreover, FLEET has poor accuracywhen the butterflies are distributed across the edges uniformly (e.g.Edit-EnWiki with a low butterfly density of . × − accord-ing to the statistics in [48]). The reason is that ^ 𝐵 is updated forsome edges only. In summary, the accuracy of FLEET algorithmshighly depend on 𝑀 , 𝛾 , and the frequency of updating ^ 𝐵 , because ^ 𝐵 is updated wrt the 𝑃 ; and 𝑃 is updated as 𝑝 ← 𝑝 ∗ 𝛾 in eachsampling round, which in turn increases ^ 𝐵 more. As depicted inFigure 15, 𝑀 and 𝛾 (confounding variables) impact 𝑃 and 𝑃 im-pacts ^ 𝐵 directly through the formula and indirectly through thefrequency of updates. A high frequency of butterfly counting andhigh sub-sampling come at the cost of low throughput. A large 𝑀 comes at the cost of memory consumption as well as latency issues.FLEET suit cannot guarantee both efficiency and effectiveness atthe same time. sGrapp does not suffer from the aforementionedissues since it does not rely on exact counting and sampling; ratherit relies on counting the inter-window butterflies. sGrapp keeps thecomputational footprint of exactly counting the in-window butter-flies low by means of the load-balanced adaptive windows and then,effectively estimates the number of inter-window butterflies whichare the dominant butterflies based on the butterfly densificationpower law formalism. We studied the fundamental problem of dense bi-clique counting instreaming graphs. We introduced an effective and efficient frame-work for approximate butterfly counting, sGrapp. Following a datadriven approach, we conducted extensive graph analysis to un-veil the organizing principles of temporal butterflies in streaminggraphs (the butterfly densification power law). These insights shed

Grapp: Butterfly Approximation in Streaming Graphs

Figure 16: [Best viewed in colored.] Accuracy of sGrapp for different values of 𝛼 and 𝑁 𝑊𝑡 Figure 17: [Best viewed in colored.] Accuracy of sGrapp-25 for different values of 𝛼 and 𝑁 𝑊𝑡 . light on developing sGrapp algorithm. sGrapp utilizes a new ex-act counting core and a time-based windowing technique whichadapts to the temporal distribution of the graph stream with noassumptions on the order and rate of stream, making it applicableto any real stream. sGrapp displays 𝑀𝐴𝑃𝐸 < . in graph streamswith almost uniform temporal distribution. The optimized version,called sGrapp-x, handles graph streams with non-uniform temporaldistribution with MAPE below . . sGrapp-x lowers the minimumand maximum MAPE of sGrapp and also increases the probabilityof approximation error below . and . , most notably in thedensest graph streams. sGrapp variants perform much better thanexisting algorithms. ida Sheshbolouki and M. Tamer Özsu Figure 18: [Best viewed in colored.] Accuracy of sGrapp-50 for different values of 𝛼 and 𝑁 𝑊𝑡 Figure 19: [Best viewed in colored.] Accuracy of sGrapp-75 for different values of 𝛼 and 𝑁 𝑊𝑡 . Grapp: Butterfly Approximation in Streaming Graphs

Figure 20: [Best viewed in colored.] Accuracy of sGrapp-100 for different values of 𝛼 and 𝑁 𝑊𝑡 .Figure 21: Minimum approximation MAPE.Figure 22: Maximum approximation MAPE.Figure 23: Probability of approximation with MAPE less than equal 0.15.Figure 24: Probability of approximation with MAPE less than equal 0.2. ida Sheshbolouki and M. Tamer Özsu Figure 25: Relative Error of sGrapp over windows for the best obtained MAPE.Figure 26: Relative Error of sGrapp-25 over windows for the best obtained MAPE.

Grapp: Butterfly Approximation in Streaming Graphs

Figure 27: Relative Error of sGrapp-50 over windows for the best obtained MAPE.Figure 28: Relative Error of sGrapp-75 over windows for the best obtained MAPE. ida Sheshbolouki and M. Tamer Özsu

Figure 29: Relative Error of sGrapp-100 over windows for the best obtained MAPE.Figure 30: MAPE of different algorithms.Figure 31: Average window latency (s) of sGrapp.

Grapp: Butterfly Approximation in Streaming Graphs

Figure 32: Average window latency (s) of sGrapp-100.Figure 33: Average total throughput (edge/s) of sGrapp at the end of each window. ida Sheshbolouki and M. Tamer Özsu

Figure 34: Average total throughput (edge/s) of sGrapp-100 at the end of each window.Figure 35: Average window throughput (edge/s) of sGrapp at the end of each window.

Grapp: Butterfly Approximation in Streaming Graphs

Figure 36: Average window throughput (edge/s) of sGrapp-100 at the end of each window. ida Sheshbolouki and M. Tamer Özsu

REFERENCES [1] Sinan G Aksoy, Tamara G Kolda, and Ali Pinar. Measuring and modeling bipartitegraphs with community structure.

Journal of Complex Networks , 5(4):581–603,2017.[2] Arvind Arasu, Brian Babcock, Shivnath Babu, Jon McAlister, and Jennifer Widom.Characterizing memory requirements for queries over continuous data streams.

ACM Trans. Database Syst. , 29(1):162–194, 2004.[3] Shaikh Arifuzzaman, Maleq Khan, and Madhav Marathe. Patric: A parallelalgorithm for counting triangles in massive networks. In

Proc. 22nd ACM Int.Conf. on Information and Knowledge Management , pages 529–538, 2013.[4] Brian Babcock, Shivnath Babu, Mayur Datar, Rajeev Motwani, and JenniferWidom. Models and issues in data stream systems. In

Proc. 21st ACM SIGACT-SIGMOD-SIGART Symp. on Principles of Database Systems , page 1–16, 2002.[5] Ziv Bar-Yossef, Ravi Kumar, and D Sivakumar. Reductions in streaming algo-rithms, with an application to counting triangles in graphs. In

Proc. 13th annualACM-SIAM symposium on Discrete algorithms , pages 623–632, 2002.[6] Albert-László Barabási and Réka Albert. Emergence of scaling in random net-works.

Science , 286(5439):509–512, 1999.[7] Michael J Barber. Modularity and community detection in bipartite networks.

Physical Review E , 76(6):066102, 2007.[8] Luca Becchetti, Paolo Boldi, Carlos Castillo, and Aristides Gionis. Efficient semi-streaming algorithms for local triangle counting in massive graphs. In

Proc. 14thACM SIGKDD Int. Conf. on Knowledge Discovery and Data Mining , pages 16–24,2008.[9] Suman K Bera and Amit Chakrabarti. Towards tighter space bounds for countingtriangles and other substructures in graph streams. In

Proc. 34th Symposium onTheoretical Aspects of Computer Science , 2017.[10] Massimo Bernaschi, Alessandro Celestini, Stefano Guarino, Flavio Lombardi, andEnrico Mastrostefano. Spiders like onions: on the network of tor hidden services.In

Proc. 28th Int. World Wide Web Conf. , pages 105–115, 2019.[11] Vladimir Braverman, Rafail Ostrovsky, and Dan Vilenchik. How hard is countingtriangles in the streaming model? In , pages 244–254, 2013.[12] Luciana S Buriol, Gereon Frahling, Stefano Leonardi, Alberto Marchetti-Spaccamela, and Christian Sohler. Counting triangles in data streams. In

Proc.25th ACM SIGACT-SIGMOD-SIGART Symp. on Principles of Database Systems ,pages 253–262, 2006.[13] Luciana S Buriol, Gereon Frahling, Stefano Leonardi, and Christian Sohler. Es-timating clustering indexes in data streams. In

Proc. European Symposium onAlgorithms , pages 618–632, 2007.[14] Guido Caldarelli, Romualdo Pastor-Satorras, and Alessandro Vespignani. Struc-ture of cycles and local ordering in complex networks.

The European PhysicalJournal B , 38(2):183–186, 2004.[15] Lijun Chang, Jeffrey Xu Yu, Lu Qin, Hong Cheng, and Miao Qiao. The exactdistance to destination in undirected world.

VLDB J. , 21(6):869–888, 2012.[16] Norishige Chiba and Takao Nishizeki. Arboricity and subgraph listing algorithms.

SIAM Journal on Computing , 14(1):210–223, 1985.[17] Shumo Chu and James Cheng. Triangle listing in massive networks and itsapplications. In

Proc. 17th ACM SIGKDD Int. Conf. on Knowledge Discovery andData Mining , pages 672–680, 2011.[18] Tamas David-Barrett. Herding friends in similarity-based architecture of socialnetworks.

Scientific Reports , 10(1):1–6, 2020.[19] Jean-Loup Guillaume and Matthieu Latapy. Bipartite structure of all complexnetworks.

Information processing letters , 90(5):215–221, 2004.[20] Roger Guimerà, Marta Sales-Pardo, and Luís A Nunes Amaral. Module iden-tification in bipartite and directed networks.

Physical Review E , 76(3):036102,2007.[21] Ali Hadian, Sadegh Nobari, Behrooz Minaei-Bidgoli, and Qiang Qu. Roll: Fastin-memory generation of gigantic scale-free networks. In

Proc. ACM SIGMODInt. Conf. on Management of Data , pages 1829–1842, 2016.[22] Jelle Hellings, George H.L. Fletcher, and Herman Haverkort. Efficient external-memory bisimulation on dags. In

Proc. ACM SIGMOD Int. Conf. on Managementof Data , pages 553–564, 2012.[23] Xiaocheng Hu, Yufei Tao, and Chin-Wan Chung. Massive graph triangulation.In

Proc. ACM SIGMOD Int. Conf. on Management of Data , pages 325–336, 2013.[24] Xiaocheng Hu, Yufei Tao, and Chin-Wan Chung. I/o-efficient algorithms ontriangle listing and counting.

ACM Trans. Database Syst. , 39(4):1–30, 2014.[25] Jiewen Huang and Daniel J Abadi. Leopard: Lightweight edge-oriented partition-ing and replication for dynamic graphs.

Proc. VLDB Endowment , 9(7):540–551,2016.[26] Zan Huang. Link prediction based on graph topology: The predictive value ofgeneralized clustering coefficient.

Available at SSRN 1634014 , 2010.[27] Ruoming Jin, Hui Hong, Haixun Wang, Ning Ruan, and Yang Xiang. Computinglabel-constraint reachability in graph databases. In

Proc. ACM SIGMOD Int. Conf.on Management of Data , pages 123–134, 2010.[28] Hyun-Joo Kim and Jin Min Kim. Cyclic topology in complex networks.

PhysicalReview E , 72:036109, 2005. [29] Jinha Kim, Wook-Shin Han, Sangyeon Lee, Kyungyeol Park, and Hwanjo Yu. Opt:a new framework for overlapped and parallel triangulation in large-scale graphs.In

Proc. ACM SIGMOD Int. Conf. on Management of Data , pages 637–648, 2014.[30] Myunghwan Kim and Jure Leskovec. Multiplicative attribute graph model ofreal-world networks.

Internet mathematics , 8(1-2):113–160, 2012.[31] Jérôme Kunegis. Konect: the koblenz network collection. In

Proc. 22nd Int. WorldWide Web Conf. , pages 1343–1350, 2013.[32] Matthieu Latapy, Clemence Magnien, and Nathalie Del Vecchio. Basic notionsfor the analysis of large affiliation networks/bipartite graphs. arXiv preprintcond-mat/0611631 , 2006.[33] Xi Tong Lee, Arijit Khan, Sourav Sen Gupta, Yu Hann Ong, and Xuan Liu. Mea-surements, analyses, and insights on the entire ethereum blockchain network.In

Proc. The Web Conference 2020 , pages 155–166, 2020.[34] Jure Leskovec, Jon Kleinberg, and Christos Faloutsos. Graphs over time: densifi-cation laws, shrinking diameters and possible explanations. In

Proc. 11th ACMSIGKDD Int. Conf. on Knowledge Discovery and Data Mining , pages 177–187, 2005.[35] Pedro G. Lind, Marta C. Gonzalez, and Hans J. Herrmann. Cycles and clusteringin bipartite networks.

Physical Review E , 72:056127, 2005.[36] Kun Liu and Evimaria Terzi. Towards identity anonymization on graphs. In

Proc.ACM SIGMOD Int. Conf. on Management of Data , pages 93–106, 2008.[37] Bingqing Lyu, Lu Qin, Xuemin Lin, Ying Zhang, Zhengping Qian, and Jin-gren Zhou. Maximum biclique search at billion scale.

Proc. VLDB Endowment ,13(9):1359–1372, 2020.[38] Chenhao Ma, Reynold Cheng, Laks VS Lakshmanan, Tobias Grubenmann, YixiangFang, and Xiaodong Li. Linc: a motif counting algorithm for uncertain graphs.

Proc. VLDB Endowment , 13(2):155–168, 2019.[39] Andrew McGregor. Graph stream algorithms: A survey.

ACM SIGMOD Record ,43(1):9–20, 2014.[40] Ron Milo, Shai Shen-Orr, Shalev Itzkovitz, Nadav Kashtan, Dmitri Chklovskii,and Uri Alon. Network motifs: simple building blocks of complex networks.

Science , 298(5594):824–827, 2002.[41] Jayanta Mondal and Amol Deshpande. Managing large dynamic graphs efficiently.In

Proc. ACM SIGMOD Int. Conf. on Management of Data , pages 145–156, 2012.[42] Mark EJ Newman. The structure and function of complex networks.

SIAM Review ,45(2):167–256, 2003.[43] Mark EJ Newman, Steven H Strogatz, and Duncan J Watts. Random graphswith arbitrary degree distributions and their applications.

Physical Review E ,64(2):026118, 2001.[44] Rasmus Pagh and Francesco Silvestri. The input/output complexity of triangleenumeration. In

Proc. 33rd ACM SIGACT-SIGMOD-SIGART Symp. on Principles ofDatabase Systems , pages 224–233, 2014.[45] Thomas Petermann and Paolo De Los Rios. Role of clustering and gridlikeordering in epidemic spreading.

Physical Review E , 69, 2004.[46] Erzsébet Ravasz and Albert-László Barabási. Hierarchical organization in complexnetworks.

Physical Review E , 67(2):026112, 2003.[47] Seyed-Vahid Sanei-Mehri, Ahmet Erdem Sariyuce, and Srikanta Tirthapura. But-terfly counting in bipartite networks. In

Proc. 24th ACM SIGKDD Int. Conf. onKnowledge Discovery and Data Mining , pages 2150–2159, 2018.[48] Seyed-Vahid Sanei-Mehri, Yu Zhang, Ahmet Erdem Sariyüce, and Srikanta Tirtha-pura. Fleet: Butterfly estimation from a bipartite graph stream. In

Proc. 28th ACMInt. Conf. on Information and Knowledge Management , pages 1201–1210, 2019.[49] Ahmet Erdem Sarıyüce and Ali Pinar. Peeling bipartite networks for densesubgraph discovery. In

Proc. 11th ACM Int. Conf. Web Search and Data Mining ,pages 504–512, 2018.[50] Yuya Sasaki, George H.L. Fletcher, and Makoto Onizuka. Structural indexing forconjunctive path queries. arXiv preprint arXiv:2003.03079 , 2020.[51] Aida Sheshbolouki, Mina Zarei, and Hamid Sarbazi-Azad. Are feedback loopsdestructive to synchronization?

EPL (Europhysics Letters) , 111(4):40010, 2015.[52] Partha Pratim Talukdar, Zachary G Ives, and Fernando Pereira. Automaticallyincorporating new sources in keyword search-based data integration. In

Proc.ACM SIGMOD Int. Conf. on Management of Data , pages 387–398, 2010.[53] Jia Wang, Ada Wai-Chee Fu, and James Cheng. Rectangle counting in largebipartite graphs. In

Proc. 2014 IEEE Int. Congress on Big Data , pages 17–24, 2014.[54] Kai Wang, Xuemin Lin, Lu Qin, Wenjie Zhang, and Ying Zhang. Vertex pri-ority based butterfly counting for large-scale bipartite networks.

Proc. VLDBEndowment , 12(10):1139–1152, 2019.[55] Mengzhi Wang, Tara Madhyastha, Ngai Hang Chan, Spiros Papadimitriou, andChristos Faloutsos. Data mining meets performance evaluation: Fast algorithmsfor modeling bursty traffic. In

Proc. 18th Int. Conf. on Data Engineering , pages507–516, 2002.[56] Nan Wang, Jingbo Zhang, Kian-Lee Tan, and Anthony KH Tung. On triangulation-based dense neighborhood graph discovery.

Proc. VLDB Endowment , 4(2):58–68,2010.[57] Pinghui Wang, Yiyan Qi, Yu Sun, Xiangliang Zhang, Jing Tao, and XiaohongGuan. Approximately counting triangles in large graph streams including edgeduplicates with a fixed memory usage.

Proc. VLDB Endowment , 11(2):162–175,2017.

Grapp: Butterfly Approximation in Streaming Graphs [58] Xifeng Yan, Philip S Yu, and Jiawei Han. Graph indexing: a frequent structure-based approach. In

Proc. ACM SIGMOD Int. Conf. on Management of Data , pages335–346, 2004.[59] Shengqi Yang, Xifeng Yan, Bo Zong, and Arijit Khan. Towards effective partitionmanagement for large graphs. In

Proc. ACM SIGMOD Int. Conf. on Managementof Data , pages 517–528, 2012.[60] Jiaxuan You, Jure Leskovec, Kaiming He, and Saining Xie. Graph structure ofneural networks. In

Proc. 37th Int. Conf. on Machine Learning , pages 10881–10891,2020.[61] Jianpeng Zhang, Kaijie Zhu, Yulong Pei, George H.L. Fletcher, and Mykola Pech-enizkiy. Clustering-structure representative sampling from graph streams. In

Proc. Int. Conf. Complex Networks and their Applications , pages 265–277, 2017.[62] Peng Zhang, Jinliang Wang, Xiaojia Li, Menghui Li, Zengru Di, and Ying Fan.Clustering coefficient and community structure of bipartite networks.

Physica A:Statistical Mechanics and its Applications , 387(27):6869–6875, 2008.[63] Peixiang Zhao, Jeffrey Xu Yu, and S Yu Philip. Graph indexing: Tree+ delta ≥ graph. In Proc. 33rd Int. Conf. on Very Large Data Bases , volume 7, pages 938–949,2007.[64] Abolfazl Ziaeemehr, Mina Zarei, and Aida Sheshbolouki. Emergence of globalsynchronization in directed excitatory networks of type i neurons.