Mining Frequent Patterns in Evolving Graphs
Cigdem Aslay, Muhammad Anis Uddin Nasir, Gianmarco De Francisci Morales, Aristides Gionis
MMining Frequent Patterns in Evolving Graphs
Cigdem Aslay
Aalto [email protected]
Muhammad Anis Uddin Nasir
King Digital Entertainment [email protected]
Gianmarco De Francisci Morales
Aristides Gionis
Aalto [email protected]
ABSTRACT
Given a labeled graph, the frequent-subgraph mining (FSM) prob-lem asks to find all the k -vertex subgraphs that appear with fre-quency greater than a given threshold. FSM has numerous appli-cations ranging from biology to network science, as it provides acompact summary of the characteristics of the graph. However, thetask is challenging, even more so for evolving graphs due to thestreaming nature of the input and the exponential time complexityof the problem.In this paper, we initiate the study of the approximate FSMproblem in both incremental and fully-dynamic streaming settings,where arbitrary edges can be added or removed from the graph. Foreach streaming setting, we propose algorithms that can extract ahigh-quality approximation of the frequent k -vertex subgraphs fora given threshold, at any given time instance, with high probability.In contrast to the existing state-of-the-art solutions that require iter-ating over the entire set of subgraphs for any update, our algorithmsoperate by maintaining a uniform sample of k -vertex subgraphswith optimized neighborhood-exploration procedures local to theupdates. We provide theoretical analysis of the proposed algorithmsand empirically demonstrate that the proposed algorithms generatehigh-quality results compared to baselines. Frequent-subgraph mining (FSM) is a fundamental graph-miningtask with applications in various disciplines, including bioinfor-matics, security, and social sciences. The goal of FSM is to findsubgraph patterns of interest that are frequent in a given graph.Such subgraphs might be indicative of an important protein inter-action, a possible intrusion, or a common social norm. FSM alsofinds applications in graph classification and indexing.Existing algorithms for subgraph mining are not scalable to largegraphs that arise, for instance, in social domains. In addition, thesegraphs are usually produced as a result of a dynamic process, henceare subject to continuous changes. For example, in social networksnew edges are added as a result of the interactions of their users,and the graph structure is in continuous flux. Whenever the graphchanges, i.e., by adding or removing an edge, a large number of newsubgraphs can be created, and existing subgraphs can be modifiedor destroyed. Keeping track of all the possible changes in the graphis subject to combinatorial explosion, thus, is highly challenging.In this paper we address the problem of mining frequent sub-graphs in an evolving graph, which is represented as a stream ofedge updates — additions or deletions. Only a few existing works
Part of the work was done while the first author was at ISI Foundation, the secondauthor was at KTH, and the third author was at QCRI.. consider a similar setting [1, 4, 29]. Bifet et al. [4] deal with a trans-actional setting where the input is a stream of small graphs. Theirsetting is similar to the one considered by frequent-itemset mining,so many of the existing results can be reused. Conversely, in ourcase, there is a single graph that is continuously evolving. Ray et al.[29] consider a scenario similar to the one we study in this paper.They consider a single graph with continuous updates, althoughthey only allow incremental ones (edge addition) rather than thefully-dynamic ones we consider (edge addition and deletion). More-over, their approach is a simple heuristic that does not provide anycorrectness guarantee. Our approach, instead, is able to provablyfind the frequent subgraphs in a fully dynamic graph stream. Abdel-hamid et al. [1] tackle a problem setting similar to ours, with a singlefully-dynamic evolving graph. They propose an exact algorithmwhich tracks patterns which are at the “fringe” of the frequencythreshold, and borrows heavily from existing literature on incre-mental pattern mining. As such, they need to use a specializednotion of frequency for graphs (minimum image support). Instead,our algorithm provides an approximate solution which uses thestandard notion of induced subgraph isomorphism for frequency.This paper is the first to propose an approximation algorithm forthe frequent-subgraph mining problem on a fully-dynamic evolvinggraph. We propose a principled sampling scheme for subgraphsand provide theoretical justifications for its accuracy. Differentlyfrom previous work on sampling from graph streams, our methodrelies on sampling subgraphs rather than edges. This choice enablessampling any kind of subgraph of the same size with equal proba-bility, and thus simplifies dramatically the design of the frequencyestimators. We maintain a uniform sample of subgraphs via reser-voir sampling, which in turn allows us to estimate the frequencyof different patterns. To handle deletions in the stream, we employan adapted version of random pairing [13]. Finally, to increase theefficiency of our sampling procedure during the exploration of thelocal neighborhood of updated edges, we employ an adaptationof the “skip optimization,” proposed by Vitter [35] for reservoirsampling and by Gemulla et al. [14] for random pairing.Concretely, our main contributions are the following: • We are the first to propose an approximation algorithm for thefrequent-subgraph mining problem for evolving graph. • We propose a new subgraph-based sampling scheme. • We show how to use random pairing to handle deletions. • We describe how to implement neighborhood exploration effi-ciently via “skip optimization.” • We provide theoretical analysis and guarantees on the accuracyof the algorithm. a r X i v : . [ c s . D S ] S e p PROBLEM DEFINITION
We consider graphs with vertex and edge labels . We model dynamicgraphs as a sequence of edge additions and deletions .We assume monitoring a graph that changes over time. For anytime t ≥
0, we let G t = ( V t , E t ) be the graph that has been observedup to and including time t , where V t represents the set of verticesand E t represents the set of edges. We assume that vertices andedges have labels, and we write L and Q for the sets of labels ofvertices and edges, respectively. For each vertex v ∈ V t we denoteits label by ℓ v ∈ L , and similarly, for each edge e = ( u , v ) ∈ E t we denote its label by q e ∈ Q . Initially, at time t =
0, we have V = E = ∅ . For any t ≥
0, at time t + ⟨ o , e , q ⟩ from a stream, where o ∈ { + , −} represents an updateoperation, addition or deletion, e = ( u , v ) is a pair of vertices, and q ∈ Q is an edge label. The graph G t + = ( V t + , E t + ) is obtainedby adding a new edge or deleting an existing edge as follows: E t + = (cid:40) E t ∪ { e } and q e = q if o = + E t \ { e } if o = − . Additions and deletions of vertices are treated similarly. Further-more, we assume that when adding an edge e = ( u , v ) , the vertices u and v are added in the graph too, if they are not present at time t .Similarly, when deleting a vertex, we assume that all incident edgesare deleted too, prior to the vertex deletion. Our model deals withthe fully dynamic stream of edges, which is different from thestream of graphs [36]. For simplicity of exposition, in the rest ofthe paper we discuss only edge additions and deletions — vertexoperations can be handled rather easily.We use n t = | V t | and m t = | E t | to refer to the number of verticesand edges, respectively, at time t . In this work, we considered simple,connected, and undirected graphs. The neighborhood of a vertex u ∈ V t at time t is defined as N tu = { v | ( u , v ) ∈ E t } , and its degree as d tu = | N tu | . Similarly, the h -hop neighborhood of u at time t isdenoted as N tu , h , and indicates the set of the vertices that can bereached from u in h steps by following the edges E t . To simplifythe notation, we omit to specify the dependency on t when it isobvious from the context.For any graph G = ( V , E ) and a subset of vertices S ⊆ V , wesay that G S = ( S , E ( S )) is an induced subgraph of G if for all pairsof vertices u , v ∈ S it is ( u , v ) ∈ E ( S ) if and only if ( u , v ) ∈ E . Wedefine C k to be the set of all induced subgraphs with k verticesin G . All subgraphs considered in this paper are induced subgraphs,unless stated otherwise.We say that two subgraphs of G , denoted by G S ∈ C k and G T ∈ C k are isomorphic if there exists a bijection I : S → T such that ( u , v ) ∈ E ( S ) if and only if ( I ( u ) , I ( v )) ∈ E ( T ) and themapping I preserves the vertex and edge labels, i.e., ℓ u = ℓ I ( u ) and q ( u , v ) = q ( I ( u ) , I ( v )) , for all u ∈ S and for all ( u , v ) ∈ E ( S ) . We write G S ≃ G T to denote that G S and G T are isomorphic.The isomorphism relation ≃ partitions the set of subgraphs C k into T k equivalence classes, denoted by C k , · · · , C kT k . Each equiv-alence class C ki is called a subgraph pattern .We define the support set σ ( G S ) of any k -vertex subgraph G S ∈C k as the number of k -vertex subgraphs of G that are isomorphicto G S , i.e., σ ( G S ) = |C ki | , where G S ∈ C ki . We then define the Notice that the value of T k is simply determined by k , | L | , and | Q | . frequency f ( G S ) of a subgraph G S as the fraction of k -vertex sub-graphs of G that are isomorphic to G S , i.e., f ( G S ) = σ ( G S )/|C k | .Next we define the problem of mining frequent k -vertex sub-graphs. Given a graph G = ( V , E , L , Q ) and a frequency threshold τ ∈ ( , ] , the set F kτ ⊆ C k of frequent k -vertex subgraphs of G with respect to τ is the collection of all k -vertex subgraphs withfrequency at least τ , that is F kτ = { G S | G S ∈ C k and f ( G S ) ≥ τ } . Problem 2.1.
Given a graph G = ( V , E , L , Q ) , an integer k , anda frequency threshold τ , find the collection F kτ of frequent k -vertexsubgraphs of G . Let p i = |C ki |/|C k | denote the frequency of isomorphism class i , with i = , . . . , T k . The problem of finding the frequent k -vertexsubgraphs requires finding all isomorphism classes C ki with p i ≥ τ .Hence, we equivalently have F kτ = (cid:216) i ∈[ , T k ] { G S | G S ∈ C ki and p i ≥ τ } . In this paper, our aim is to find an approximation to the collection F kτ by efficiently estimating p i , from a uniform sample S of C k .We say that a subset S ⊆ C k , with |S| = M , is a uniform sampleof size M from C k if the probability of sampling S is equal to theprobability of sampling any S ′ ⊆ C k with |S ′ | = M , i.e., all samplesof the same size are equally likely to be produced.Formally, we want to find an ( ϵ , δ ) -approximation to F kτ , denotedby ˜ F kτ ( ϵ , δ ) such that˜ F kτ ( ϵ , δ ) = (cid:216) i ∈[ , T k ] { G S | G S ∈ C ki ∩ S , | ˆ p i − p i | ≤ ϵ / , p i ≥ τ } , where ˆ p i is the estimation of p i such that | ˆ p i − p i | ≤ ϵ / − δ . In practice, the collection ˜ F kτ ( ϵ , δ ) ofapproximate frequent patterns is computed from a sample S ⊆ C k .The problem of approximate frequent subgraph mining can nowbe formulated as follows.Problem 2.2. Given a graph G = ( V , E , L , Q ) , a frequency thresh-old τ , a small integer k , and constants < ϵ , δ < , find the collection ˜ F kτ ( ϵ , δ ) that is an ( ϵ , δ ) -approximation to F kτ . We focus on the dynamic case with vertex and edge additionsand insertions. As discussed above, at each time t we consider the G t = ( V t , E t ) that results from all vertex and edge operations. Ourgoal is to maintain the approximate collection of frequent subgraphs˜ F kτ ( ϵ , δ ) at each time t without having to recompute it from scratchafter each addition or deletion.In the following problem definition we assume that vertex/edgelabels are specified when a vertex/edge is added in the graph streamand they do not change afterwards. We make this assumption with-out loss of generality, as a vertex/edge label change can be simulatedby a vertex/edge deletion followed by an addition of the same ver-tex/edge with different label.Problem 2.3. Given an evolving graph G t = ( V t , E t , L , Q ) , afrequency threshold τ , a small integer k , and constants < ϵ , δ < ,maintain an approximate collection of frequent subgraphs ˜ F kτ ( ϵ , δ ) at each time t . ALGORITHMS
This section describes the proposed algorithms, which are basedon subgraph sampling. We present two algorithms, both of whichare based on two components: a reservoir of samples and an explo-ration procedure. The goal of the reservoir is to capture the changesto already sampled connected k -subgraphs. The goal of the explo-ration procedure is to include newly (dis)connected k -subgraphsinto the sample. This separation of concerns allows the algorithmto minimize the amount of work per sample, e.g., by avoiding com-putation of expensive minimum DFS codes for the correspondingpatterns [39].The base algorithm requires to enumerate, at each time t , everynewly (dis)connected k -subgraph at least once, by performing a neighborhood exploration of the updated edge. We show how toimprove this algorithm by avoiding to materialize all the subgraphsvia a skip optimization . This optimization enables picking subgraphsinto the sample without having to list them all. We also propose anadditional heuristic to speed up the neighborhood exploration. Weprovide an efficient implementation for the case k =
3, and describehow it generalizes to values k > We begin by describing our algorithm for maintaining a uniformsample S of fixed-size M of k -subgraphs of G t for incrementalstreams (only edge addition). The algorithm relies on reservoir sam-pling [35] to ensure the uniformity of the sample S .The addition of an edge ( u , v ) (cid:60) E t − at time t affects only thesubgraphs in the local neighborhoods up to N t − u , h and N t − v , j , where h + j = k −
2, i.e., all the connected k -subgraphs that contain u , v , and h + j additional nodes from their neighborhoods, for alladmissible values of h , j ≥
0. Therefore, a uniform sample S ofsubgraphs can be maintained by iterating through the subgraphs inthe neighborhood of the newly inserted edge. In particular, considerthe addition of an edge ( u , v ) at time t . Let H ⊆ N t − u , h ∪ N t − v , j be asubset of vertices, for some h and j , such that h + j = k − { u , v } ∈ H , | H | = k . There are two possible cases: ( i ) if H is connected in G t − ,a modified subgraph H ′ = H ∪ {( u , v )} is formed in G t ; ( ii ) if H isnot connected in G t − , and H ′ = H ∪ {( u , v )} is connected in G t , H ′ is a newly formed connected k -subgraph in G t . Example for k = . Assume an edge ( u , v ) arrives at time t . Forcase ( i ) to hold, there should be some w ∈ N t − u ∩ N t − v for which theedge ( u , v ) closes the wedge ∧ = {( u , w ) , ( w , v )} at G t − , forminga new triangle ∆ = ∧ ∪ {( u , v )} in G t . For case ( ii ) to hold, theremust be some w ∈ N t − u (or w ∈ N t − v ), for which a new wedge {( u , v ) , ( u , w )} (respectively, {( u , v ) , ( v , w )} ) is formed in G t . □ When a modified subgraph H ′ is formed in G t , if the previouslyconnected subgraph H = H ′ \{( u , v )} is present in S , we update thesample by substituting H ′ with H . Otherwise, we ignore the modi-fied subgraph. Given that the elements in the sample are inducedconnected subgraphs, this operation is equivalent to maintainingthe sample up-to-date.Conversely, when a new connected k -subgraph H ′ is formed in G t , we can be sure that it appears at time t for the first time. There-fore, we use the standard reservoir sampling algorithm as follows:If |S| < M , we directly add the new subgraph H ′ to the sample S . Hereafter, we simply refer to a k -vertex induced subgraph as k -subgraph. Algorithm 1
Algorithm for Incremental Stream N ← , S ← ∅ M ← log ( T k / δ ) · ( + ϵ )/ ϵ procedure addEdge( t , ( u , v ) )4: for h ∈ [ , k − ] do j ← k − − h for H ⊆ N t − u , h ∪ N t − v , j do if H is connected in G t − then if H ∈ S then H ′ ← H ∪ {( u , v )}
10: Replace( S , H , H ′ ) ▷ replace H with H’11: else H ′ ← H ∪ {( u , v )} ▷ H’ is connected in G t
13: ReservoirSampling( H ′ , S , M , N )14: procedure Replace( S , G R , G S )15: S ← S \ { G R } S ← S ∪ { G S } Otherwise, if |S| = M , we remove a randomly selected subgraph in S and insert the new one H ′ with probability M / N , where M is theupper bound on the sample size and N is the total number of (valid) k -subgraphs encountered since t = The modification of existingsubgraphs in G t (i.e., case ( i )) does not affect N , since, by definition,they replace the previous subgraphs which were already present in G t − . Therefore, the only increase in the number N of subgraphsoccurs in the case of new connected k -subgraph formations in G t (i.e., case ( ii )).Algorithm 1 shows the pseudocode for incremental streams.Next, we show that the sample S maintained by Algorithm 1 isuniform at any given time t .Claim 3.1. Algorithm 1 ensures the uniformity of the sample S atany time t . Proof. To show that S is uniform, we need to consider twocases: ( i ) the inserted edge modifies an existing k -subgraph; ( ii ) theinserted edge forms a newly connected k -subgraph.For the case of new subgraph formation, the uniformity prop-erty directly holds as it leverages the standard reservoir samplingalgorithm. Now, we show that the uniformity property holds whena subgraph is modified.Assume the edge ( u , v ) (cid:60) E t − is inserted at time t and let H denote the invalidated subgraph that is modified as H ′ = H ∪{( u , v )} at time t . Let S ′ denote the sample after the invalidationof H and the formation of H ′ . For the sample to be truly uniform,the probability that H ′ ∈ S ′ should be equal to M / N , conditionedon S = M < N (conditioning on S = N < M is trivial since every k -subgraph of G t would then be deterministically included in S ′ ).Now, given that Pr ( H ∈ S) = M / N , we have that Pr (cid:0) H ′ ∈ S ′ (cid:1) = Pr (cid:0) H ∈ S , H ′ ∈ S ′ (cid:1) + Pr (cid:0) H (cid:60) S , H ′ ∈ S ′ (cid:1) = MN · + (cid:18) − MN (cid:19) · = MN , hence uniformity is preserved. □ Note that the addition of an edge ( u , v ) translates to partially-dynamic k -subgraphstreams in which the k -subgraphs are subject to addition and deletion operations,while k -cliques are subject to addition-only operations. Thus, we can impose, withoutloss of generality, an order of operation during the exploration of the neighborhood ofthe inserted edge. lgorithm 2 Fully-Dynamic-Edge Stream N ← , S ← ∅ , c ← , c ← M ← log ( T k / δ ) · ( + ϵ )/ ϵ procedure addEdge( t , ( u , v ) )4: for h ∈ [ , k − ] do j ← k − − h for H ⊆ N t − u , h ∪ N t − v , j do if H is connected in G t − then if H ∈ S then H ′ ← H ∪ {( u , v )}
10: Replace( S , H , H ′ ) ▷ replace H with H’11: else H ′ ← H ∪ {( u , v )} ▷ H’ is newly connected in G t
13: RandomPairing( H ′ , S , M )14: procedure deleteEdge( t , ( u , v ) )15: for h ∈ [ , k − ] do j ← k − − h for H ⊆ N t − u , h ∪ N t − v , j do if H is still connected in G t then if H ∈ S then H ′ ← H \ {( u , v )}
21: Replace( S , H , H ′ ) ▷ replace H with H’22: else if H ∈ S then S ← S \ H c ← c + else c ← c + N ← N − procedure RandomPairing( G S , S , M )30: if c + c = then
31: ReservoirSampling( G S , S , M , N )32: else if uniform( ) < c c + c then S ← S ∪ G S c ← c − else c ← c − In this section we describe our algorithm for maintaining a uniformsample S of fixed size M for fully-dynamic edge streams (edgeinsertions and deletions). Our algorithm relies on random pairing (RP) [14], a sampling scheme that extends traditional reservoirsampling for evolving data streams, in which elements are subjectto both addition and deletion operations.We first give a brief background on the RP scheme. In RP, theuniformity of the sample is guaranteed by randomly pairing an in-serted element with an uncompensated “partner” deletion, withoutnecessarily keeping the identity of the partner. At any time, therecan be 0 or more uncompensated deletions, denoted by d , whichis equal to the difference between the cumulative number of inser-tions and the cumulative number of deletions. The RP algorithmmaintains ( i ) a counter c that records the number of uncompen-sated deletions in which the deleted element was in the sample, ( ii )a counter c that records the number of uncompensated deletions inwhich the deleted element was not in the sample, hence, d = c + c .When d =
0, i.e., when there are no uncompensated deletions, in-serted elements are processed as in standard reservoir-sampling.When d >
0, the algorithm flips a coin at each inserted element andincludes it in the sample with probability c /( c + c ) , otherwise itexcludes it from the sample (and decreases c or c as appropriate).Next, we describe our adaptation of the RP scheme for fully-dynamic edge streams, which translate to fully-dynamic k -subgraphstreams. First, remember that the incremental stream translates to an incremental k -subgraph stream, in which connected k -subgraphsare only added (the first time they are created) or modified (whennew induced edges arrive).In the case of fully-dynamic edge streams, the k -connected sub-graph stream is also subject to addition and deletion operations, aswe explain next. The events of interest regarding the addition ofan edge have been discussed extensively in the previous section,hence we do not repeat it here. Consider the deletion of an edge ( u , v ) ∈ E t − at time t , and a subgraph H ⊆ N t − u , h ∪ N t − v , j in G t − ,with h + j = k −
2. The effect of the edge deletion is the following:either ( i ) the vertices of H remain connected, hence, H is replacedby a new subgraph H ′ in G t ; or ( ii ) H gets disconnected, hence H does not exist in G t . The first case corresponds to a modificationof an existing connected k -subgraph. As such, it does not cause anaddition or deletion in the subgraph stream. Example for k=3.
In the case a triangle ∆ in G t − that contains anedge ( u , v ) deleted at time t , if ∆ ∈ S , we modify the correspondinginduced subgraph into a subgraph ∧ = ∆ \ {( u , v )} . □ The second case corresponds to a deletion of a subgraph in thestream. To handle this case, our sampling strategy follows the RPscheme. In the case that a subgraph H in G t − is deleted due tothe deletion of edge ( u , v ) at time t , if H ∈ S , we increment thecounter c , otherwise we increment the counter c . In the case thata new subgraph H ′ is formed in G t due to the addition of edge ( u , v ) at time t , we include it in S with probability c /( c + c ) . Theapproach is shown in Algorithm 2. Next, we show that the sample S maintained by Algorithm 2 is uniform at any given time t .Claim 3.2. Algorithm 2 ensures the uniformity of the sample S atany time t . Proof. To show that S is uniform, we need to consider fourcases: ( i ) added edge forms a newly connected subgraph; ( ii ) deletededge disconnects a subgraph; ( iii ) added edge modifies an exist-ing a subgraph; ( iv ) deleted edge modifies an existing a subgraph.For cases ( i ) and ( ii ), the correctness follows from RP hence weonly show the correctness in cases ( iii ) and ( iv ). Assume the edge ( u , v ) ∈ E t − is deleted (resp. added) at time t . Let H ′ denote thenew subgraph due to the deletion (addition) of the edge, so that H ′ = H \ {( u , v )} (resp. H ′ = H ∪ {( u , v )} ). Let S ′ denote thesample after the invalidation of H and the formation of H ′ . Recallthat N remains unchanged since H ′ replaces H in G t . Given thatthe random pairing scheme guarantees uniformity of the sample ateach time instance independently from the current value of d [13],we have Pr ( H ∈ S) = |S|/ N . For the sample to be truly uniform,the probability that H ′ ∈ S ′ should also be equal to |S|/ N sincethe values of both N and S remain unchanged as we either re-place H with H ′ in S or we ignore H ′ if H (cid:60) S , hence |S| remainsunchanged. Thus, we have, Pr (cid:0) H ′ ∈ S ′ (cid:1) = Pr ( H ∈ S) · Pr (cid:0) H ′ ∈ S ′ | H ∈ S (cid:1) + Pr ( H (cid:60) S) · Pr (cid:0) H ′ ∈ S ′ | H (cid:60) S (cid:1) = |S| N · + (cid:18) − |S| N (cid:19) · = |S| N , hence uniformity is preserved. □ lgorithm 3 Compute- W , N ◦ of new connected k -subgraphs procedure Compute- W ( t , ( u , v ) )2: W ← for h ∈ [ , k − ] do j ← k − − h V h ← N t − u , h \ N t − v , j V j ← N t − v , j \ N t − u , h x ← |{ G S = ( S , E ( S )) : u ∈ S , | S | = h + , S ⊆ V h , E ( S ) ⊆ E ( V h )}| y ← |{ G S = ( S , E ( S )) : v ∈ S , | S | = j + , S ⊆ V j , E ( S ) ⊆ E ( V j )}| W ← W + x · y Algorithm 4
Optimized Algorithm for Incremental Stream N ← , S ← ∅ , sum ← M ← log ( T k / δ ) · ( + ϵ )/ ϵ
3: SkipRS( N , M ): skip function as in [35] ▷ [SkipRS( N , M ) = 0 if N < M ]4: procedure addEdge( t , ( u , v ) )5: for H ∈ S : u ∈ H ∧ v ∈ H do H ′ ← H ∪ {( u , v )}
7: Replace( S , H, H’) ▷ replace H with H’8: W ←
Compute- W( t , ( u , v )) ▷ Algorithm 39: I ← while sum ≤ W do I ← I + Z RS ← SkipRS ( N , M ) N ← N + Z RS +
14: sum ← sum + Z RS +
15: replace I random elements of S with I random subgraphs drawn from N tu , h ∪ N tv , j , ∀ h ∈ [ , k − ] , j = k − − h
16: sum ← sum − W The basic algorithm for incremental streams we described requiresto process each subgraph H ⊆ N t − u , h ∪ N t − v , j , for all admissible valuesof h , j ≥ h + j = k −
2, to identify among them the newlycreated k -subgraphs. All these new subgraphs are then provided asinput to the standard reservoir sampling algorithm that needs togenerate random numbers for each. To reduce the cost of traversingthe local neighborhood and generating a random number for eachnew subgraph, we employ Vitter’s acceptance-rejection algorithmthat generates skip counters for reservoir sampling [35] as follows:let Z RS be the random variable that denotes the number of rejectedsubgraphs after the last time a subgraph was inserted to the sample S . Then, the probability that the next z new subgraphs will not beaccepted in S is given by: Pr [ Z RS = z ] = MN + z + z − (cid:214) z ′ = (cid:18) − MN + z ′ + (cid:19) . (1)Thus, rather than identifying all the new subgraphs and calling thereservoir algorithm for each, we can keep a skip counter Z RS thatis distributed with the probability mass function given in Eq. (1),and compute its value in constant time using Vitter’s acceptance-rejection algorithm for reservoir sampling [35]. Then, based on thevalue of Z RS that denotes the number of new subgraph insertionswe can safely skip, we can decide on the fly whether we shouldinsert into the sample any of the new subgraphs created due to theinsertion of edge ( u , v ) . Given that a new k -subgraph G S = ( S , E ( S )) can be formed only when E ( S ) \ ( u , v ) is not already an inducedsubgraph, we can compute the exact value of W as in Algorithm 3.The pseudocode of the optimized algorithm for incremental streamsis given in Algorithm 4. A similar optimization is also possible for fully-dynamic streamsby proper adjustment of the skip counter based on the value d = c + c of uncompensated deletions . Recall that when d =
0, reservoirsampling is effective, hence, we can compute the value of the skipcounter Z RS as in the case of incremental streams. When d > Z RP be the random variable that denotes the number of newsubgraphs that are not accepted into the sample after the last timea subgraph was deleted (not necessarily from the sample) due tothe deletion of an edge. Assume without loss of generality thatthe deletion of a subgraph was followed by the creation of d newsubgraphs due to at least one edge insertion. Following the fact thatthe new elements that random pairing includes into the sample forma uniform random sample of size c among d new elements [14],the probability that the random pairing will not accept the next z new subgraphs in S is given by: Pr [ Z RP = z ] = c d − z z − (cid:214) z ′ = (cid:16) − c d − z ′ (cid:17) . (2)Thus, after each edge deletion, we can compute in constant timethe value of skip counter Z RP for random pairing using acceptance-rejection algorithm for list-sequential sampling [34] and decide onthe fly whether and how many we should insert into the sample anyof the new W subgraphs created in the pairing step. The algorithmto compute the exact number D of deleted induced subgraphs whenan edge ( u , v ) is deleted at time t is similar to the computation of W , but operates on the neighborhoods at time t instead of time t − Now we provide a lower bound on the size of the sample S such that˜ F kτ ( ϵ , δ ) computed on S provides an ( ϵ , δ ) -approximation to F kτ .Lemma 3.1. Suppose that |S| = M satisfies M ≥ log (cid:18) T k δ (cid:19) · ( + ϵ ) ϵ (3) Then, for any isomorphism class i ∈ [ , T k ] , | ˆ p i − p i | ≤ ϵ / holdswith probability at least − δ / T k : Proof. Let X i denote an indicator random variable that equals1 if a randomly sampled subgraph G S from C k belongs in C ki and0 otherwise, ∀ i ∈ [ , T k ] . Notice that X i ∼ Bernoulli ( p i ) . W.l.o.g,let G j , j ∈ [ , M ] denote the j -th subgraph in S for an arbitraryordering of the subgraph and let X i , · · · , X Mi be iid copies of X i where each X ji denotes the event [ G j ∈ C ki ] .Using the two-sided Chernoff bounds we have Pr (cid:169)(cid:173)(cid:171)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) M (cid:213) j = X ji − p i M (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≥ θMp i (cid:170)(cid:174)(cid:172) ≤ (cid:18) − θ + θ · p i M (cid:19) , which implies Pr (| ˆ p i − p i | ≥ θp i ) ≤ (cid:18) − θ + θ · p i M (cid:19) . lgorithm 5 Optimized Algorithm for Fully-Dynamic-EdgeStream N ← , S ← ∅ , c ← , c ← M ← log ( T k / δ ) · ( + ϵ )/ ϵ
3: SkipRS( N , M ): skip function as in [35] ▷ [SkipRS( N , M ) = 0 if N < M ]4: SkipRP( c , c + c ): skip function as in [34]5: sum1 ← , sum2 ← procedure addEdge( t , ( u , v ) )7: for H ∈ S : u ∈ H ∧ v ∈ H do H ′ ← H ∪ {( u , v )}
9: Replace( S , H, H’) ▷ replace H with H’10: W ←
Compute- W( t , ( u , v )) if c + c = then I ← while sum1 ≤ W do I ← I + N ← N + Z RS + Z RS ← SkipRS ( N , M )
17: sum1 ← sum1 + Z RS +
18: replace I random elements of S with I random subgraphs drawn from N tu , h ∪ N tv , j , ∀ h ∈ [ , k − ] , j = k − − h
19: sum1 ← sum1 − W else I ← , sum ← while c1+c2 > and sum < W do I ← I + c ← c − Z RP ← SkipRP ( c , c + c ) c ← c − Z RP ▷ [ c = if c < ]27: sum2 ← sum2 + Z RP +
28: replace I random elements of S with I random subgraphs drawn from N tu , h ∪ N tv , j , ∀ h ∈ [ , k − ] , j = k − − h W ← W − sum if W > then
31: Jump to line 1232: procedure deleteEdge( t , ( u , v ) )33: for H ∈ S : u ∈ H ∧ v ∈ H do if H is still connected in G t then H ′ ← H \ {( u , v )}
36: Replace (S , H , H ′ ) ▷ replace H with H’37: else S ← S \ H c ← c + d ← d + Compute- D( t , ( u , v )) c ← d − c N ← N − D Now, let ϵ = p i θ . Substituting θ = ϵ /( p i ) we have Pr (| ˆ p i − p i | ≥ ϵ / ) ≤ (cid:18) − ϵ / p i + ϵ / · M (cid:19) . To obtain a failure probability of at most δ / T k for each isomorphismclass i ∈ [ , T k ] , we should have:2 exp (cid:18) − ϵ / p i + ϵ / · M (cid:19) ≤ δT k . Rearranging the terms we obtain: M ≥ log (cid:18) T k δ (cid:19) · p i + ϵ / ( ϵ / ) . As we want this to hold ∀ i ∈ [ , T k ] , M should satisfy: M ≥ log (cid:18) T k δ (cid:19) · p max + ϵ / ( ϵ / ) , where p max = max i ∈[ , T k ] p i . Using the worst-case p max =
1, we obtainthe following lower bound on M : M ≥ log (cid:18) T k δ (cid:19) · ( + ϵ ) ϵ . □ Theorem 3.2.
Given a uniform sample
S ⊆ C k of size M thatsatisfies Eq. (3), ˜ F kτ ( ϵ , δ ) provides ( ϵ , δ ) -approximation to F kτ . Proof. Given M that satisfies Eq. (3), using union bound overall T k estimation failure scenarios, we have | ˆ p i − p i | ≤ ϵ /
2, forall i ∈ [ , T k ] , with probability at least 1 − δ . Then, there shouldbe no i ∈ [ , T k ] with p i ≥ τ , for which ˆ p i < τ − ϵ /
2. Hence, weensure ˜ F (S , τ − ϵ / ) ⊆ F ( C k , τ ) with probability at least 1 − δ . Now,assume that there is a subgraph G S ∈ C ki such that p i < τ − ϵ .We have that ˆ p i < τ − ϵ /
2, hence, there is no subgraph G S suchthat G S (cid:60) F ( C k , τ ) and G S ∈ ˜ F (S , τ − ϵ / ) , with probability atleast 1 − δ . □ The skip optimizations allows us to efficiently maintain the unifor-mity of the sample S by eliminating the need to test the inclusionof each newly created k -subgraph in the local neighborhood of theinserted edge. However, the skip optimizations require to know thenumber W of new k -subgraphs. Unfortunately, exact computationof W requires costly traversal of the neighborhood of the insertededge. Moreover, for dynamic streams, the value of the skip counterdirectly depends on c and c , which require to compute the number D of deleted induced subgraphs after each edge deletion operation.Thus, we resort on efficient methods to approximate the values of W and D .To efficiently approximate the value of W after an edge ( u , v ) isinserted at time t , we use sketches to estimate | N t − u , h ∩ N t − v , j | forall possible values of h ∈ [ , k − ] and j ∈ [ , k − ] . Similarly, toefficiently approximate D after an edge ( u , v ) is deleted at time t ,we use sketches to estimate | N tu , h ∩ N tv , j | for all possible values of h ∈ [ , k − ] and j ∈ [ , k − ] .Any sketching technique for set-size estimation can be used.For our purpose, we choose to use the bottom- k sketch [11] inconjunction with recently-proposed improved estimators for unionand intersections of sketches [32]. A bottom- k sketch uses a hashfunction h (·) to map elements of a universe into real numbers in [ , ] , and stores the k minimum values in a set. The smaller the k -th stored value is, the larger the size of the original set should be;a simple estimate of the size is given by k − γ , where γ is the largeststored hash value.In our case, the universe of elements is the set of vertices V t thatbelong to the graph at time t . We build a sketch for each vertex v ∈ V t that summarizes N tv . These sketches can be efficientlycombined to create a sketch for the union of the neighbors of agiven vertex while exploring the neighborhood via a breadth firstsearch (BFS).Bottom- k sketches can easily be built incrementally. When anew edge ( u , v ) is added, we simply add the hash value of v to thesketch of u if it is smaller than the current maximum, and vice versa.Alas, bottom- k sketches do not directly support deletions. However,traditionally the sketches are used in a streaming setting wherememory is the main concern. In our case, the universe of elementsalready resides in memory (i.e., the vertices of the graph), and ourgoal is to improve the speed of computation of Algorithm 3 and itscounterpart for deletion. Therefore, we can easily store the globalash value of each vertex to be used for sketching. Then, we canimplement the sketch by using a pair of min-heap/max-heap. Themax-heap A + has bounded size and contains the hash values of thecorresponding bottom- k vertices. The min-heap A − contains thehash values of the rest of the neighborhood. Whenever an edge ( u , v ) is deleted, if h ( v ) ∈ A − we remove the value from A − butthe sketch remains unchanged; if h ( v ) ∈ A + we remove the valuefrom A + , and we also transfer the minimum value from A − to A + to maintain the fixed size of the sketch. S The reservoir sample S needs to support two main access opera-tions efficiently: (1) Random access (to replace subgraphs in thesample, for reservoir sampling); (2) Access by vertex id (to identifymodified subgraphs, as in Algorithm 4).In order to support both operations in constant time, we resort toan array for the basic random access, supplemented by hash-basedindexes for the access by vertex id.The basic array is straightforward to implement, as the size of thesample M is fixed, and the size of its element is constant k (to storeboth vertices and edges). On top of this basic array, we maintain andindex I : V → { S ⊂ V } such that v → S for all v ∈ S and all S ∈ S .That is, we have a pointer from each vertex part of a subgraphin the sample, to the set of subgraphs containing it. Therefore,when an edge ( u , v ) is modified at time t (either added or deleted),retrieving the set of potentially affected subgraphs takes constanttime. For each potentially affected subgraph, checking whether it isactually affected also takes constant time: for a subgraph S ∈ I( u ) (respectively, S ∈ I( v )) we simply need to check whether v ∈ S (respectively, u ∈ S ). If so, the subgraph needs to be updated, andso the corresponding counters for its pattern. Our proposed algorithms contain two components: an explorationprocedure and a reservoir of samples. The addition of an edge ( u , v ) (cid:60) E t − at time t affects only the subgraphs in the local neigh-borhoods up to N t − u , h and N t − v , j , where h + j = k −
2. The basealgorithms, for both incremental and fully dynamic settings, iteratethrough the set of subgraphs in the local neighborhoods up to N t − u , h and N t − v , j . Moreover, the subgraphs are added into the reservoirin constant time, i.e., O( ) per subgraph, which implies that therunning time of the algorithms are propotional to the expensiveexploration procedure, i.e., O( N t − u , h ∪ N t − v , j ) . The skip optimizationimproves the execution time by avoiding materializing and com-puting the expensive DFS code for many subgraphs, but does notchange its worst case upper bound. We conduct an extensive empirical evaluation of the proposedalgorithms, and provide a comparison with the existing solutions.In particular, we answer the following interesting questions:
Q1:
What is the quality of frequent patterns for incremental streams?
Q2:
What is the quality of frequent patterns for dynamic streams?
Q3:
What is the performance in terms of average update time?
Table 1: Datasets used in the experiments.
Dataset Symbol | V | | E | | L | Patents PT 3M 14M 37Youtube YT 4 .
6M 43M 108
Datasets.
Table 1 shows the graphs used as input in our exper-iments. All datasets used are publicly available. Patent (PT) [15]contains citations among US Patents from January 1963 to Decem-ber 1999; the label of a patent is the year it was granted. YouTube(YT) [10] lists crawled videos and their related videos posted fromFebruary 2007 to July 2008. The label is a combination of a video’srating and length. The streams are generated by permuting theedges in a random order.
Metrics.
We use the following metrics to evaluate the quality ofall the algorithms: • Average Relative Error (RE): measures how close the estimationof the frequency of the subgraph patterns compared to theground truth. For the set of patterns F kτ , the average RE of theestimation is defined as T k (cid:205) T k i = | ˆ p i − p i | p i . • Precision: measures the fraction of frequent subgraph patternsamong the ones returned by the algorithm. • Recall: measures the fraction of frequent subgraph patternsreturned by the algorithm over all frequent subgraphs (as com-puted by the exact algorithm).Additionally, we evaluate the efficiency of the algorithms by report-ing the average update time. We provide an extensive comparisonof all the algorithms for k =
3. We report the results of experimentsaveraged over 5 runs.
Algorithms.
We use two baselines. Exact counting ( EC ) performsexhaustive exploration of the neighborhood of the updated edge,and counts all possible subgraph patterns. Edge reservoir ( ER ) is ascheme inspired by Stefani et al. [31], which maintains a reservoirof edges during the dynamic edge updates. The edge reservoir isused to estimate the frequency of subgraph patterns by applying theappropriate correcting factor for the sampling probability of eachpattern. We compare these baselines with our proposed algorithms,subgraph reservoir ( SR ) and its optimized version ( OSR ). The sizeof the subgraphs reservoir is set as in Section 3.4. Unless otherwisespecified, we fix ϵ = .
01 and δ = .
1. To have a fair comparisonwith ER , following the evaluation of Stefani et al. [31], we set thesize of edge reservoir as the maximum number of edges used inthe subgraph reservoir, averaged over 5 runs. Note that EC and ER algorithms are more competitive than any offline algorithm, e.g.,GraMi [12], which require processing the whole graph upon anyupdate. EC takes less than 2 × − seconds to process an edge ofthe PT dataset, on average, while one execution of GraMi on thesame dataset takes around 30 seconds, which is several orders ofmagnitude larger, and we need to execute it once per edge. Experimental environment.
We conduct our experiments ona machine with 2 Intel Xeon Processors E5-2698 and 128GiB ofmemory. All the algorithms are implemented in Java and executedon JRE 7 running on Linux. The source code is available online. https://github.com/anisnasir/frequent-patterns A v e r age R e l a t i v e E rr o r Threshold τ (x10 -3 ) 0.5 0.6 0.7 0.8 0.9 1 0.2 0.4 0.7 0.9 1.0 1.3PT P r e c i s i on Threshold τ (x10 -3 ) 0.8 0.85 0.9 0.95 1 0.2 0.4 0.7 0.9 1.0 1.3PT R e c a ll Threshold τ (x10 -3 ) 0 0.1 0.2 0.3 0.4 0.5 0.75 1 1.25 1.5YT A v e r age R e l a t i v e E rr o r Threshold τ (x10 -3 )ERSROSR 0.5 0.6 0.7 0.8 0.9 1 0.5 0.75 1 1.25 1.5YT P r e c i s i on Threshold τ (x10 -3 ) 0.9 0.95 1 0.5 0.75 1 1.25 1.5YT R e c a ll Threshold τ (x10 -3 ) Figure 1: Relative error, precision, and recall for incremental streams on PT and YT datasets, for different values of threshold τ . We first evaluate our proposed algorithm on incremental streams.Starting from an empty graph, we add one edge per timestamp, forboth the PT and YT datasets, and run the algorithms for severalvalues of the frequency threshold τ .Figure 1 shows the results. For the PT dataset, the three algo-rithms behave similarly in terms of RE. The subgraph versionsoffer slightly higher precision at the expense of decrease in recall.However, for the highest frequency threshold, we see a markeddeterioration of the performance of ER . This behavior is a resultof higher variance in ER due to non-uniform subgraph-samplingprobabilities. Conversely, for YT, both versions of the subgraphreservoir algorithm provide superior results in terms of averagerelative error. Considering YT is the larger and more challengingdataset (in terms of number of labels), this result shows the powerof subgraph sampling. The improved estimation performance trans-lates to much higher precision for SR and OSR compared to ER .The recall of all the algorithms are very similar. Overall, the resultsindicate that ER generates a larger number of false positives in theresult set, while SR and OSR are able to avoid such errors while atthe same time still having a low false-negative rate.
Now, we proceed to evaluate the algorithms for fully-dynamicstreams. To produce edge deletions, we execute the algorithmsin a sliding window model. This model is of practical interest asit allows to observe recent trends in the stream. We evaluate thealgorithms for the YT dataset, and use a sliding window of size 10M.We choose a sliding window large enough so so that the number of edges (subgraphs) do not fit in the edge (subgraph) reservoir,otherwise both algorithms are equivalent to exact counting. Weonly report the results for YT dataset, as the result for the PT datasetare similar to the incremental case.Figure 2 contains the results for YT dataset. ER obtains higherrelative error compared to SR , and poor precision and recall. SR isclearly the best performing algorithm in terms of accuracy, however,as we show next, it pays in terms of efficiency. OSR has consistentlybetter accuracy than ER , although the approximations it deploys in-troduce some errors. This effect is more evident for larger frequencythresholds, where the precision drops noticeably. Lastly, we evaluate the algorithms in terms of the average updatetime for both incremental and fully-dynamic streams on PT andYT datasets. The size of the sliding window is 10M for the fully-dynamic streams. Figure 3 reports the results of the experimentswhich show that both SR and OSR provide significant performancegains compared to the EC while they are both outperformed by ER .However, given the superior accuracy of SR and OSR compared to ER , it can be easily observed that OSR provides a good trade-offbetween accuracy and efficiency.
Triangle counting.
Exact and approximate triangle counting instatic graphs has attracted a great deal of attention. We refer thereader to the survey by Latapy [26] for a comprehensive treatmentof the topic, and include only related work on approximate trianglecounting in a streaming setting. Tsourakakis et al. [33] proposed A v e r age R e l a t i v e E rr o r Threshold τ (x10 -3 )ERSROSR 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0.3 0.5 0.8 1 1.3YT P r e c i s i on Threshold τ (x10 -3 ) 0.5 0.6 0.7 0.8 0.9 1 0.3 0.5 0.8 1 1.3YT R e c a ll Threshold τ (x10 -3 ) Figure 2: Relative error, precision, and recall for fully-dynamic stream on YT dataset, for different values of threshold τ . -6 -5 -4 EC ER SR OSRPT A v e r age U pda t e T i m e ( s ) AlgorithmInsertion-onlyFully-Dynamic EC ER SR OSRYT Algorithm
Figure 3: Average update time for incremental and fully-dynamic streams on PT and YT datasets. triangle sparsifiers to approximate the triangle counts with a singlepass of the graph, hence, the technique can also be applied toincremental streams. Pavan et al. [28] and Jha et al. [20] proposedsampling a set of connected paths of length for approximatelycounting the triangles in incremental streams. Lim and Kang [27]proposed an algorithm based on Bernoulli sampling of edges forincremental streams, in which the edges are kept in the samplewith a fixed user-defined probability. Recently, Stefani et al. [31]proposed an algorithm for fully-dynamic streams via reservoirsampling [35] and random pairing [13].
General k -vertex graphlet counting. Approximate counting of3-, 4-, 5-vertex graphlets in static graphs has received much moreattention than exact counting, which has an exponential cost. Mostof the literature on approximate counting of graphlets uses random-walks to collect a uniform sample of graphlets on static graphs [3,8, 16, 37]. Alternatively, Bressan et al. [6] proposed a color codingbased scheme for estimating k -vertex graphlet statistics. Unlikethe static case, approximating graphlet statistics in a streamingsetting has received much less attention, and the literature is limitedto incremental streams for k >
3. Wang et al. [38] are the firstto propose an algorithm that estimates graphlet statistics from auniform sample of edges in incremental streams. A recent workby Chen and Lui [9] examines approximate counting of graphletsin incremental streams for different choice of edge sampling andprobabilistic counting methods.
Transactional FSM.
Inokuchi et al. [19] introduced the problemof FSM in the transactional setting, where the goal is to mine allthe frequent subgraphs on a given dataset of many, usually small,graphs. Following [19], a good number of algorithms for this taskwere provided [18, 23, 39]. The transactional FSM setting is similarto the one considered by frequent-itemset mining [17], allowingto reuse many existing results, thanks to the anti-monotonicityof its support metric. In addition to the exact mining approaches,a line of work has studied the approximate mining of frequentsubgraphs by MCMC sampling from the space of graph patterns [2,30] with efficient pruning strategies based on anti-monotonicity ofthe support metric. For a comprehensive treatment, see the surveyby Jiang et al. [21].
Single-Graph FSM.
Kuramochi and Karypis [25] proposed an al-gorithm for exact mining of all frequent subgraphs in a given staticgraph that enumerates all the isomorphisms of the given graph andrelies on the maximum-independent set (MIS) metric whose compu-tation is NP-Complete. Elseidy et al. [12] proposed an apriori-likealgorithm for exact mining of all frequent subgraphs based on theMIS metric from a given static graph. Apart from the exact miningalgorithms, a line of work focused on approximate mining of fre-quent subgraphs in a given static graph. Kuramochi and Karypis[24] proposed a heuristic approach that prunes largely the searchspace however discovers only a small subset of frequent subgraphswithout provable guarantees. Chen et al. [7] uses an approximateversion of the MIS metric, allowing approximate matches duringthe pruning. Khan et al. [22] propose proximity patterns, which, byrelaxing the connectivity constraint of subgraphs, identify frequentpatterns that cannot be found by other approaches.While the discussed work for solving FSM problem on a staticgraph are promising, none of them are applicable to streaminggraphs. The closest to our setting is the work by Ray et al. [29]which consider a single graph with continuous updates, howevertheir approach is a simple heuristic applicable only to incrementalstreams and without provable guarantees. Likewise, Abdelhamidet al. [1] consider an analogous setting, and propose an exact algo-rithm which borrows from the literature on incremental patternmining. The algorithm keeps track of “fringe” subgraph patterns,which are around the frequency threshold, and all their possibleexpansions/contractions (by adding/removing one edge). While thealgorithm uses clever indexing heuristics to reduce the runtime, anxact algorithm still needs to enumerate and track an exponentialnumber of candidate subgraphs. Finally, Borgwardt et al. [5] lookat the problem of finding dynamic patterns in graphs, i.e., pattersover a graph time series, where persistence in time is a key elementof the pattern. By transforming the time series of a labeled edgeinto a binary string, the authors are able to leverage suffix treesand string-manipulation algorithms to find common substrings inthe graph. While dynamic graph patterns capture the time-seriesnature of the evolving graph, in our streaming scenario, only thelatest instance of the graph is of interest, and the graph patternsfound are comparable to the ones found for static graphs.
We initiated the study of approximate frequent-subgraph mining(FSM) in both incremental and fully-dynamic streaming settings,where the edges can be arbitrarily added or removed from thegraph. For each streaming setting, we proposed algorithms thatcan extract a high-quality approximation of the frequent k -vertexsubgraph patterns, for a given threshold, at any given time instance,with high probability. Our algorithms operate by maintaining auniform sample of k -vertex subgraphs at any time instance, forwhich we provide theoretical guarantees. We also proposed severaloptimizations to our algorithms that allow achieving high accuracywith improved execution time. We showed empirically that theproposed algorithms generate high-quality results compared tonatural baselines. Acknowledgements.
Cigdem Aslay and Aristides Gionis are sup-ported by three Academy of Finland projects (286211, 313927, and317085), and the EC H2020 RIA project “SoBigData” (654024).
REFERENCES [1] Ehab Abdelhamid, Mustafa Canim, Mohammad Sadoghi, Bishwaranjan Bhat-tacharjee, Yuan-Chi Chang, and Panos Kalnis. 2017. Incremental Frequent Sub-graph Mining on Large Evolving Graphs.
TKDE
29, 12 (2017), 2710–2723.[2] Mohammad Al Hasan and Mohammed J Zaki. 2009. Output space sampling forgraph patterns.
PVLDB
2, 1 (2009), 730–741.[3] Mansurul A Bhuiyan, Mahmudur Rahman, and M Al Hasan. 2012. Guise: Uniformsampling of graphlets for large graph analysis. In
ICDM . 91–100.[4] Albert Bifet, Geoff Holmes, Bernhard Pfahringer, and Ricard Gavaldà. 2011. Min-ing frequent closed graphs on evolving data streams. In
KDD ’11 . 591–599.[5] Karsten M Borgwardt, Hans-Peter Kriegel, and Peter Wackersreuther. 2006. Pat-tern mining in frequent dynamic subgraphs. In
ICDM . 818–822.[6] Marco Bressan, Flavio Chierichetti, Ravi Kumar, Stefano Leucci, and AlessandroPanconesi. 2017. Counting Graphlets: Space vs Time. In
WSDM . 557–566.[7] Chen Chen, Xifeng Yan, Feida Zhu, and Jiawei Han. 2007. gapprox: Miningfrequent approximate patterns from a massive network. In
ICDM .[8] Xiaowei Chen, Yongkun Li, Pinghui Wang, and John Lui. 2016. A general frame-work for estimating graphlet statistics via random walk.
PVLDB
10, 3 (2016),253–264.[9] Xiaowei Chen and John Lui. 2017. A unified framework to estimate global andlocal graphlet counts for streaming graphs. In
ASONAM . 131–138.[10] Xu Cheng, Cameron Dale, and Jiangchuan Liu. 2008. Statistics and social networkof youtube videos. In
IWQoS . 229–238. [11] Edith Cohen and Haim Kaplan. 2007. Summarizing data using bottom-k sketches.In
PODC . 225–234.[12] Mohammed Elseidy, Ehab Abdelhamid, Spiros Skiadopoulos, and Panos Kalnis.2014. Grami: Frequent subgraph and pattern mining in a single large graph.
PVLDB
7, 7 (2014), 517–528.[13] Rainer Gemulla, Wolfgang Lehner, and Peter J Haas. 2006. A dip in the reservoir:Maintaining sample synopses of evolving datasets. In
PVLDB . 595–606.[14] Rainer Gemulla, Wolfgang Lehner, and Peter J Haas. 2008. Maintaining bounded-size sample synopses of evolving datasets.
VLDBJ
17, 2 (2008), 173–201.[15] Bronwyn H Hall, Adam B Jaffe, and Manuel Trajtenberg. 2001.
The NBER patentcitation data file: Lessons, insights and methodological tools . Technical Report.National Bureau of Economic Research.[16] Guyue Han and Harish Sethu. 2016. Waddling random walk: Fast and accuratesampling of motif statistics in large graphs. arXiv:1605.09776 (2016).[17] Jiawei Han, Jian Pei, and Micheline Kamber. 2011.
Data mining: concepts andtechniques . Elsevier.[18] Jun Huan, Wei Wang, and Jan Prins. 2003. Efficient mining of frequent subgraphsin the presence of isomorphism. In
ICDM .[19] Akihiro Inokuchi, Takashi Washio, and Hiroshi Motoda. 2000. An apriori-basedalgorithm for mining frequent substructures from graph data. In
ECML-PKDD .[20] Madhav Jha, C Seshadhri, and Ali Pinar. 2015. A space-efficient streaming algo-rithm for estimating transitivity and triangle counts using the birthday paradox.
TKDD (2015).[21] Chuntao Jiang, Frans Coenen, and Michele Zito. 2013. A survey of frequentsubgraph mining algorithms.
The Knowledge Eng. Review
28, 1 (2013), 75–105.[22] Arijit Khan, Xifeng Yan, and Kun-Lung Wu. 2010. Towards proximity patternmining in large graphs. In
SIGMOD . 867–878.[23] Michihiro Kuramochi and George Karypis. 2001. Frequent subgraph discovery.In
ICDM . 313–320.[24] Michihiro Kuramochi and George Karypis. 2004. Grew-a scalable frequent sub-graph discovery algorithm. In
ICDM .[25] Michihiro Kuramochi and George Karypis. 2005. Finding frequent patterns in alarge sparse graph.
Data mining and knowledge discovery
11, 3 (2005), 243–271.[26] Matthieu Latapy. 2008. Main-memory triangle computations for very large(sparse (power-law)) graphs.
Theoretical Comp. Sci.
KDD . 685–694.[28] Aduri Pavan, Kanat Tangwongsan, Srikanta Tirthapura, and Kun-Lung Wu. 2013.Counting and sampling triangles from a graph stream.
PVLDB (2013).[29] Abhik Ray, Larry Holder, and Sutanay Choudhury. 2014. Frequent SubgraphDiscovery in Large Attributed Streaming Graphs. In
BigMine . 166–181.[30] Tanay Kumar Saha and Mohammad Al Hasan. 2015. FS3: A sampling basedmethod for top-k frequent subgraph mining.
Statistical Analysis and Data Mining:The ASA Data Science Journal
8, 4 (2015), 245–261.[31] Lorenzo De Stefani, Alessandro Epasto, Matteo Riondato, and Eli Upfal. 2017.Trièst: Counting local and global triangles in fully dynamic streams with fixedmemory size.
TKDD
11, 4 (2017), 43.[32] Daniel Ting. 2016. Towards optimal cardinality estimation of unions and inter-sections with sketches. In
KDD ’16 . 1195–1204.[33] Charalampos E Tsourakakis, Mihail N Kolountzakis, and Gary L Miller. 2011.Triangle Sparsifiers.
J. Graph Algorithms Appl.
15, 6 (2011), 703–726.[34] Jeffrey Scott Vitter. 1984. Faster methods for random sampling.
CACM
27, 7(1984), 703–718.[35] Jeffrey S Vitter. 1985. Random sampling with a reservoir.
TOMS
11, 1 (1985),37–57.[36] Bianca Wackersreuther, Peter Wackersreuther, Annahita Oswald, Christian Böhm,and Karsten M Borgwardt. 2010. Frequent subgraph discovery in dynamic net-works. In
Proceedings of the Eighth Workshop on Mining and Learning with Graphs .ACM, 155–162.[37] Pinghui Wang, John Lui, Bruno Ribeiro, Don Towsley, Junzhou Zhao, and Xiao-hong Guan. 2014. Efficiently estimating motif statistics of large networks.
TKDD
9, 2 (2014), 8.[38] Pinghui Wang, John CS Lui, Don Towsley, and Junzhou Zhao. 2016. Minfer: Amethod of inferring motif statistics from sampled edges. In
ICDE .[39] Xifeng Yan and Jiawei Han. 2002. gSpan: Graph-Based Substructure PatternMining. In