[PDF] Efficient Mining of Frequent Subgraphs with Two-Vertex Exploration

Abstract

Frequent Subgraph Mining (FSM) is the key task in many graph mining and machine learning applications. Numerous systems have been proposed for FSM in the past decade. Although these systems show good performance for small patterns (with no more than four vertices), we found that they have difficulty in mining larger patterns. In this work, we propose a novel two-vertex exploration strategy to accelerate the mining process. Compared with the single-vertex exploration adopted by previous systems, our two-vertex exploration avoids the large memory consumption issue and significantly reduces the memory access overhead. We further enhance the performance through an index-based quick pattern technique that reduces the overhead of isomorphism checks, and a subgraph sampling technique that mitigates the issue of subgraph explosion. The experimental results show that our system achieves significant speedups against the state-of-the-art graph pattern mining systems and supports larger pattern mining tasks that none of the existing systems can handle.

Full PDF

EEfficient Mining of Frequent Subgraphs withTwo-Vertex Exploration

Peng Jiang

The University of Iowa [email protected]

Rujia Wang

Illinois Institute of Technology [email protected]

Bo Wu

Colorado School of Mines [email protected]

Abstract

Frequent Subgraph Mining (FSM) is the key task in manygraph mining and machine learning applications. Numer-ous systems have been proposed for FSM in the past decade.Although these systems show good performance for smallpatterns (with no more than four vertices), we found thatthey have difficulty in mining larger patterns. In this work,we propose a novel two-vertex exploration strategy to accel-erate the mining process. Compared with the single-vertexexploration adopted by previous systems, our two-vertexexploration avoids the large memory consumption issue andsignificantly reduces the memory access overhead. We fur-ther enhance the performance through an index-based quickpattern technique that reduces the overhead of isomorphismchecks, and a subgraph sampling technique that mitigates theissue of subgraph explosion. The experimental results showthat our system achieves significant speedups against thestate-of-the-art graph pattern mining systems and supportslarger pattern mining tasks that none of the existing systemscan handle.

Frequent Subgraph Mining (FSM) is an important operationon graphs and is widely used in various application domains,including bioinformatics [22, 29], computer vision [11], andsocial network analysis [28]. The task is to discover fre-quently occurring subgraph patterns from an input graph.Different from graph pattern matching problems where aquery pattern is given, FSM needs to find the important pat-terns based on a support measure and thus has a much largerexploration space.Since the patterns of interest are unknown, most systemsfor FSM take an explore-aggregate-filter approach [10, 12,27, 30]. The principle is to explore all the subgraphs, aggre-gate the subgraphs according to their patterns, and filterout the subgraphs that are redundant or are not of inter-est. The exploration happens in a vertex-by-vertex mannerwhere smaller subgraphs are iteratively extended based onthe connections in the graph. There are mainly two waysfor exploration: breadth-first and depth-first. Starting fromall vertices in the graph, breadth-first exploration stores allsubgraphs of size 𝑙 and extends them with one more ver-tex to find subgraphs of size 𝑙 +

1. The main problem ofbreadth-first exploration is that the intermediate data caneasily exceeds the memory capacity as the subgraph size grows. With depth-first exploration, a subgraph of size 𝑙 isimmediately extended to a subgraph of size 𝑙 + 𝑙 . It needs not save the in-termediate subgraphs and thus can explore larger patterns.However, depth-first exploration cannot exploit the anti-monotone property to prune the search space [12], resultingin a lot of unnecessary computation.Some recent graph mining systems take a pattern-based approach [15, 17]. The idea is to enumerate the (unlabeled)subgraph patterns and then perform pattern matching onthe graph. Because the pre-generated patterns guide theexploration, these systems need not store any intermediatedata, and the aggregation overhead can be reduced as thetopology of the subgraphs is given. However, this approachonly works well for small patterns because when the patternis larger (more than 6), listing all subgraph patterns itselfbecomes a hard problem [2, 19, 20]. It is also difficult for thepattern-based systems to exploit the anti-monotone propertyto prune the search space. Peregrine [15] maintains a list offrequent patterns, extend the patterns with one vertex oredge, and then re-match the extended patterns on the graph.It prunes the search space without storing the intermediatesubgraphs, but the re-matching incurs a lot of redundantcomputation. These issues have impeded the existing graphmining systems from supporting FSM for large patterns. Infact, most of the prior work only reports experimental resultsfor FSM with no more than 4 vertices.To enable large pattern mining, we propose a novel two-vertex exploration method in this work. Our key observationis that vertex-by-vertex exploration is not necessary for pat-tern mining. Instead, we can perform two-vertex explorationthat joins size-( 𝑛 −

2) subgraphs with size-3 subgraphs ona common vertex to obtain subgraphs of size- 𝑛 . The newexploration method significantly accelerates the explorationprocess and reduces the memory access overhead in the joinoperation. It also allows us the exploit the anti-monotoneproperty to prune the exploration space without storing theintermediate subgraphs or re-matching the patterns.To further accelerate the mining process, we propose twonew techniques to overcome the performance bottlenecks.One performance bottleneck is due to the expensive iso-morphism checks in the aggregation step. To aggregate thesubgraphs based on their patterns, we need to generate a canonical form for each subgraph such that the subgraphswith the same canonical form are isomorphic. Unfortunately, a r X i v : . [ c s . D B ] F e b he best known algorithms for generating such canonicalforms have exponential complexity [7, 25, 31]. Therefore, wewant to perform isomorphism check for as few subgraphs aspossible. Previous work has employed a quick pattern tech-nique to reduce the number of isomorphism checks [27, 30].The main is to first group the subgraphs based on an easilycomputed pattern (e.g., a list of all edges). Since subgraphsin the same group must be isomorphic, only one isomor-phism check is needed for each group. We improve on thisidea by proposing an index-based quick pattern technique.It assigns an index to each pattern and uses the indices tocompute a quick pattern for the joined subgraph. Comparedwith the quick pattern technique used in prior work, ourquick pattern encodes the information of sub-patterns andachieves more accurate grouping of the subgraphs, leadingto a significant reduction of isomorphism checks.Another more fundamental challenge of mining large pat-terns on graphs is due to the exponential growth of the explo-ration space. For example, in a median size graph, MiCo [13],which has 9 × vertices and 10 edges, there are more than10 size-5 subgraphs. When the pattern size increases to 7,the estimated number of subgraphs is in the order of 10 forwhich exhaustive enumeration becomes infeasible. To miti-gate this issue, we propose a subgraph sampling technique.The idea is that we sample a small subset of size-3 subgraphsfor exploring larger subgraphs during the joining and/or thematching phase. Since the subgraphs of frequent patterns aremore likely to be sampled, we are able to discover frequentpatterns with only a small number of sampled subgraphs.Compared with previous works that apply edge or neighborsampling to FSM [14, 18], we can discover more frequentpatterns with the same or less computation. This is becausesubgraph samples preserve more structural information ofthe graph than edge samples.We perform extensive evaluation of our system and com-pare with three state-of-the-art graph mining systems: Au-toMine [17], Peregrine [15], and Pangolin [10]. The resultsshow that without using sampling our system achieves 1.8xto 8.4x speedups on tasks for which the compared systemscan return. By using sampling, our system can discover largerpatterns that none of the existing systems can handle. This section introduces the graph related concepts that areimportant to our discussion and formally defines the frequentsubgraph mining problem. A graph 𝐺 is defined as 𝐺 = ( 𝑉 , 𝐸, 𝐿 ) consisting of a set of ver-tices 𝑉 , a set of edges 𝐸 and a labeling function 𝐿 that assignslabels to the vertices and edges. A graph 𝐺 ′ = ( 𝑉 ′ , 𝐸 ′ , 𝐿 ′ ) is a subgraph of graph 𝐺 = ( 𝑉 , 𝐸, 𝐿 ) if 𝑉 ′ ⊆ 𝑉 , 𝐸 ′ ⊆ 𝐸 and 𝐿 ′ ( 𝑣 ) = 𝐿 ( 𝑣 ) , ∀ 𝑣 ∈ 𝑉 ′ . A subgraph 𝐺 ′ = ( 𝑉 ′ , 𝐸 ′ , 𝐿 ′ ) is vertex-induced if all the edges in 𝐸 that connect the vertices in 𝑉 ′ are included 𝐸 ′ . A subgraph is edge-induced if it is connectedand is not vertex-induced. Definition . Two graphs 𝐺 𝑎 = ( 𝑉 𝑎 , 𝐸 𝑎 , 𝐿 𝑎 ) and 𝐺 𝑏 = ( 𝑉 𝑏 , 𝐸 𝑏 , 𝐿 𝑏 ) are isomorphic if there is a bijectivefunction 𝑓 : 𝑉 𝑎 ⇒ 𝑉 𝑏 such that ( 𝑣 𝑖 , 𝑣 𝑗 ) ∈ 𝐸 𝑎 if and only if ( 𝑓 ( 𝑣 𝑖 ) , 𝑓 ( 𝑣 𝑗 )) ∈ 𝐸 𝑏 .We say two (sub)graphs have the same pattern if they areisomorphic. The pattern is a template for the isomorphicsubgraphs, and a subgraph is an instance (also called embed-ding ) of its pattern. To determine the pattern of a subgraph,a canonical form for each subgraph can be computed, andthe subgraphs with the same canonical form are isomorphic.There are various tools and algorithms available for graphisomorphism check [16, 20, 31]. All of these algorithms haveexponential complexity. We use bliss [16] for isomorphismcheck in our system as it is fast in practice and is widelyused in graph mining systems [15, 27, 30]. A related conceptis automorphism check which checks if two subgraphs areidentical, even though they might have different orderingsof vertices and edges. The task of Frequent Subgraph Mining (FSM) is to obtainall frequent subgraph patterns from a labeled input graph.A pattern is considered frequent if it has a support above athreshold. While the definition of the support measure canvary across applications, the support usually needs to sat-isfy an anti-monotone property, i.e., the support of a patternshould be no greater than the support of its sub-patterns [21].

Definition . Given a pattern 𝑃 = ( 𝑉 𝑝 , 𝐸 𝑝 , 𝐿 𝑝 ) and an input graph 𝐺 = ( 𝑉 , 𝐸, 𝐿 ) , if 𝑃 has 𝑚 embeddings { 𝑓 , 𝑓 , . . . , 𝑓 𝑚 } in 𝐺 , the minimum image based (MNI) sup-port of 𝑃 in 𝐺 is defined as 𝜎 𝑀𝑁𝐼 ( 𝑃, 𝐺 ) = min 𝑣 ∈ 𝑉 𝑝 |{ 𝑓 𝑖 ( 𝑣 ) : 𝑖 = , , . . . , 𝑚 }| . Other support measures include maximum independentset based (MIS) support, minimum instance based (MI) sup-port, and maximum vertex cover based (MVC) support. Allthese support measures are anti-monotone. MNI support isthe most commonly used one because it has linear compu-tation complexity while achieving a good accuracy in mea-suring the ‘frequency’ of patterns in a graph. The readersare refered to [21] for detailed descriptions and computa-tion complexity of different support measures. We adopt theMNI support for our experiments, although our proposedtechniques are applicable to any support measure with theanti-monotone property.With a support measure 𝜎 , the frequent subgraph miningproblem is defined as finding all patterns { 𝑃 𝑖 = ( 𝑉 𝑖 , 𝐸 𝑖 , 𝐿 𝑖 )} ina graph 𝐺 such that | 𝑉 𝑖 | = 𝑠 and 𝜎 ( 𝑃 𝑖 , 𝐺 ) ≥ 𝑡 where 𝑠 is the (a) An example graph : pattern embeddings (b) Size-3 subgraphs

2: ……. ... ... Join column <1,1> on key 3 ... x x x ... y join

2: ……. ... ... pattern embeddings (c) Join the size-3 subgraphs on every column

Figure 1.

An example of finding size-5 subgraphs by joiningsize-3 subgraphs.given pattern size and 𝑡 is the given support threshold. Thesupport can be calculated with either vertex-induced sub-graphs or edge-induced subgraphs. Our proposed techniqueswork for both cases. We use edge-induced subgraphs for ex-periments, as it is the common setting in prior work [12, 15,30]. Before getting into technical details, we describe our ideaof two-vertex exploration with an example. We use an unla-beled graph for simple illustration.Suppose our task is to discover size-5 patterns in an inputgraph (as shown in Figure 1a). We can first find all size-3 sub-graphs and join them on a common vertex to obtain size-5subgraphs. In this example, we first apply a matching algo-rithm to obtain all the embeddings of size-3 patterns (i.e.,wedge and triangle) as listed in Figure 1b. Each pattern isassigned an index (0 for wedge and 1 for triangle in this ex-ample), and the index is stored with each embedding duringthe pattern matching.Next, we calculate the MNI support for each size-3 patternand prune the patterns with support less than the threshold.In this example, the supports for both wedge and triangle is 3.Suppose we set the support threshold to 3. Neither of the pat-terns will be pruned. After we obtain the pruned size-3 sub-graphs, we perform binary join on every pair of the columns(i.e., ⟨ , ⟩ , ⟨ , ⟩ , ⟨ , ⟩ , ⟨ , ⟩ , ⟨ , ⟩ , ⟨ , ⟩ , ⟨ , ⟩ , ⟨ , ⟩ , ⟨ , ⟩ ) toexplore size-5 subgraphs. Figure 1c shows how we can obtain four size-5 subgraphs by joining column ⟨ , ⟩ on key 3. Ev-ery pair of subgraphs with key 3 are tested (i.e., ⟨ ‘342’, ‘342’ ⟩ , ⟨ ‘342, ‘352’ ⟩ , ..., ⟨ ‘387’, ‘385’ ⟩ , ⟨ ‘387’, ‘387’ ⟩ ). If two subgraphshave one and only one common vertex, they compose a validsize-5 subgraph. In this example, ‘342’ and ‘375’ make up avalid size-5 subgraph ‘34275’; ‘342’ and ‘387’ make up ‘34287’;‘352’ and ‘387’ make up ‘35287’; and ‘342’ and ‘385’ make up‘34285’. These valid joins are marked with connected arrowsin Figure 1c. We can see that, through the join operation, wegrow the pattern size from 3 to 5 in one exploration step. Wewill show that such two-vertex exploration is exhaustive forsubgraph exploration in §4.1.One may notice that the result of joining ‘374’ with ‘385’(‘37485’) is not included in Figure 1c. This is because oursystem performs an automorphism check when generatingthe join results to remove redundancy. We propose a smallest-vertex first dissection method that ensures only the resultsthat are obtained by joining the subgraph of the smallestspanning vertex indices are saved. In this case, the ‘37485’subgraph will be generated when we join the third columnof ‘543’ and the first column of ‘387’. More details on theautomorphism check and redundancy removal are explainedin §4.3.The above procedure can be extended to explore largersubgraphs by joining multiple subgraph lists. For example, a3-way join of two size-3 subgraphs and one size-2 subgraphs(i.e. edges) will explore all size-6 subgraphs. A 3-way joinof size-3 subgraphs will explore all size-7 subgraphs. Givenan input graph 𝐺 , a pattern size 𝑠 , and a support threshold 𝑡 ,the workflow of our frequent subgraph mining algorithm issummarized as follows:Step1: Obtain all size-3 subgraphs by matching.Step2: Calculate the support for each size-3 and size-2pattern, and remove patterns with support smaller than 𝑡 along with their subgraphs.Step3: Perform multi-way join of size-3 subgraphs and/oredges to obtain subgraphs of size s: if 𝑠 == 𝑛 +

1, join 𝑛 size-3 subgraph lists; if 𝑠 == 𝑛 , join the edge list with 𝑛 − 𝑠 patterns and re-move patterns with support smaller than 𝑡 .For Step1, any matching algorithm will work; we use Au-toMine [17] in our implementation. Step2 and Step4 arestraightforward based on the definition of the support mea-sure. Step3 is the most important step in the algorithm. Wewill detail this step in the next section. All the current graph mining systems based on the explore-aggregate-filter approach use single-vertex exploration be-cause it ensures that all the size- 𝑛 subgraphs can be found byextending the size-( 𝑛 −

1) subgraphs with an edge. We find hat limiting the step size to 1 is not a must to find all pat-terns. This section describes our two-vertex exploration ideaand explains its advantage over single-vertex exploration. We propose to explore the size- 𝑛 subgraphs by joining thesize-( 𝑛 −

2) subgraphs with the size-3 subgraphs (i.e., wedgesand triangles). The completeness of this two-vertex explo-ration method is summarized as follows.

Theorem 1.

All of the size- 𝑛 subgraphs can be discovered byjoining the size-( 𝑛 − ) subgraphs with the size- subgraphson a common vertex.Proof. Our goal is to show that any size- 𝑛 subgraph canbe dissected into a connected size-( 𝑛 −

2) subgraph and aconnected size-3 subgraph on one vertex. Because we joinall size-( 𝑛 −

2) and size-3 subgraphs in all possible ways, if adissection exists for a size- 𝑛 subgraph, it will be discoveredby the join operation. Suppose any size- 𝑛 subgraph can bedissected into a size- ( 𝑛 − ) and a size-3 subgraph. Thereare only two way a size-( 𝑛 +

1) subgraph can be constructedfrom a size- 𝑛 subgraph: 1) the new vertex is connected withthe size- ( 𝑛 − ) subgraph, and in this case, the size-( 𝑛 + 𝑛 subgraph; 2) if the new vertex is only connected with the size-3 subgraph, it is easy to verify that for either of the two cases(wedge or triangle), we can always pick three connectedvertices as the new dissection. As the base case, all the sixsize-4 patterns can be dissected into a size-3 subgraph andan edge. The proof finishes by induction. □ Note that multi-vertex exploration is not complete withmore than two vertices. For example, a seven-vertex three-pronged star graph with two vertices in each prong cannotbe obtained by joining any two size-4 subgraphs. Therefore,we cannot explore more than two vertices in each step.Two-vertex exploration can be either vertex-induced oredges induced. For vertex-induced exploration, we add allthe connecting edges between the two joining subgraphs tothe resulting subgraph. For edge-induced exploration, weenumerate all possible combinations of the connecting edgesbetween the joining subgraphs and generate a resulting sub-graph for each combination.

To avoid the large memory consumption, we implement theexploration process as a depth-first multi-way join. Supposewe want to join 𝑡 subgraph lists 𝑆𝐿 , 𝑆𝐿 , ..., 𝑆𝐿 𝑡 and the sub-graph in 𝑆𝐿 𝑖 has 𝑙 𝑖 vertices. For each subgraph list 𝑆𝐿 𝑖 , wegroup the subgraphs by each of its 𝑙 𝑖 columns, create 𝑙 𝑖 hashtables and store the hash tables in 𝐻 𝑖 . For example, the size-3subgraphs in Figure 1c are grouped by the vertex indicesin the first column. Once the hash tables are created, the // iterating over all hash tables of each subgraph listfor ℎ ! ∈ 𝐻 ! for ℎ " ∈ 𝐻 " for ℎ ∈ 𝐻 // iterating over all keys in the first joining hash tablefor 𝑘 ! ∈ ℎ ! // iterating over all subgraphs with the keyfor 𝑠 ! ∈ ℎ ! [ 𝑘 ! ]// iterating over all subgraphs with the same key in the // second joining hash tablefor 𝑡 " ∈ ℎ " 𝑘 ! // combine the two subgraphs on vertex 𝑘 ! 𝑠 " = 𝑐𝑜𝑚𝑏𝑖𝑛𝑒(𝑠 ! , 𝑡 " , 𝑘 ! ) // if 𝑠 " is a valid subgraph, iterating over all its verticesfor 𝑘 " ∈ 𝑠 " for 𝑡 ∈ ℎ 𝑘 " 𝑠 = 𝑐𝑜𝑚𝑏𝑖𝑛𝑒(𝑠 " , 𝑡 , 𝑘 " ) … … Figure 2.

Code of multi-way join.multi-way join operation is simply a nested loop that iter-ates over all possible combinations of subgraphs in differenthash tables, as shown in Figure 2. We first enumerate allpossible combinations of columns in different subgraph listsby iterating over all hash tables of each subgraph list. Then,we identify the matching keys 𝑘 in the first two hash tablesand try to combine the subgraphs ( 𝑠 and 𝑡 ) on the key. Ifthe two subgraphs make up a valid larger subgraph 𝑠 , weiterate over all the vertices of 𝑠 and look up each vertex 𝑘 in the third hash table. For every subgraph 𝑡 with key 𝑘 inthe third hash table, we combine 𝑠 with 𝑡 to obtain a largersubgraph. For joining more subgraph lists, the code simplyrepeats the for loop.Depth-first join is also used in Fractal [12] for single-vertexexploration. The main issue is that it incurs a huge amountof redundant memory accesses. Our two-vertex explorationmitigates this issue as it requires fewer join steps to enumer-ate subgraphs of a certain size. To see this, let us considerthe exploration of size-5 subgraphs. With single-vertex ex-ploration, it requires a 4-way join of the edges in the graph.The first join operation of the edge list does not incur anyredundant memory accesses as each neighbor list is accessedonly once in the two hash tables. However, when we join theintermediate size-3 subgraphs with the edge list, we need toquery the edge list for each intermediate subgraph. For non-consecutive size-3 subgraphs of the same key, each neighborlist will be accessed multiple times during the join process.The same problem exists when joining size-4 subgraphs withthe edge list. In contrast, two-vertex exploration obtainssize-5 subgraphs by performing a binary join of size-3 sub-graphs which incurs no redundant memory accesses. Ourexperimental results also validate this point. Combing small subgraphs in different ways can lead to iden-tical results. As we briefly mentioned in Figure 1c, joiningsubgraph ‘342’ and ‘375’ generates the same subgraph as lgorithm 1: Combine two subgraphs and check forautomorphism.

Input : subgraph 𝑠 ; subgraph 𝑡 ; joining key 𝑘 Output : combined subgraph 𝑠 ′ func dissect( 𝑠 ′ , 𝑛 ) : foreach 𝑣 in 𝑠 ′ in ascending order do 𝑙 = the first 𝑛 vertices visited by starting from 𝑣 and spanning to the smallest vertex at each step; 𝑟 ′ = the unvisited vertices in 𝑠 ′ ; foreach 𝑣 ′ in 𝑙 in ascending order do 𝑟 = 𝑟 ′ ∪ 𝑣 ′ ; if 𝑟 is connected then return 𝑙, 𝑟 ; if 𝑠 and 𝑡 have identical vertices other than 𝑘 then return ∅ ;// 𝑠 ′ is a valid subgraph joined by 𝑠 and 𝑡 𝑠 ′ = 𝑠 ∪ 𝑡 ;// find the smallest dissection of 𝑠 ′ 𝑙, 𝑟 = dissect ( 𝑠 ′ , 𝑡.𝑠𝑖𝑧𝑒 );// if the two joining subgraphs correspond to the smallestdissection, return 𝑠 ′ if 𝑙 == 𝑡 and 𝑟 == 𝑠 then return 𝑠 ′ ; else return ∅ ; joining ‘352’ and ‘274’. These redundant subgraphs incurredundant computation, and the redundancy can accumu-late over the exploration steps. To eliminate the redundantsubgraphs, we perform an automorphism check when a sub-graph is generated. The previous automorphism check tech-nique for single-vertex exploration is based on the conceptof the canonicality of the subgraphs [27]. This canonicalitycheck does not work for multi-vertex exploration becausethe small subgraphs are generated by a matching algorithmand may not have the canonicality property. We proposea smallest-vertex-first dissection method that enables theredundancy removal for multi-vertex exploration.Our method is based on the following observation: for anysubgraph, there is only one way to divide it into two smallersubgraphs with both subgraphs being connected and oneof them having the smallest spanning vertex indices. Thus,we can eliminate redundancy by producing a subgraph 𝑠 ′ only if the two joining subgraphs correspond to this uniquedissection of 𝑠 ′ .The automorphism check is performed each time we com-bine two subgraphs (i.e., in the 𝑐𝑜𝑚𝑏𝑖𝑛𝑒 function in Figure 2).Algorithm 1 shows the procedure of the 𝑐𝑜𝑚𝑏𝑖𝑛𝑒 function.For a pair of input subgraphs 𝑠 and 𝑡 ( 𝑡 is usually a size-3 sub-graph), we first check if there are any other identical verticesexcept for the joining vertex 𝑘 . If yes, 𝑠 and 𝑡 cannot form avalid subgraph, and the function return an empty set. If no,we give the combined subgraph to a dissection procedurethat divides the subgraph into two small subgraphs 𝑙 and 𝑟 . From the vertex with the smallest index, the dissectionprocedure finds the smallest 𝑛 vertices and store them in 𝑙 where 𝑛 is the size of 𝑡 . Next, the algorithm checks if theremaining vertices can constitute a connected subgraph 𝑟 with any of the vertices in 𝑙 . If yes, the dissection procedurestops and returns 𝑙 and 𝑟 . The algorithm returns as soon asthe first dissection is found, and it will always return be-cause of Theorem 1. Once we have the smallest dissection 𝑙 and 𝑟 , we check if they are the same as 𝑡 and 𝑠 . If yes, the 𝑐𝑜𝑚𝑏𝑖𝑛𝑒 function returns the combined subgraph; otherwise,it returns an empty set. Example:

The smallest-vertex-first dissection of the sub-graph ‘34257’ in Figure 1a can be obtained by spanning fromvertex 2. The two adjacent vertices of 2 are 3 and 8. Because3 is smaller, we take 3 in the first step, and the visited setcontains vertex 2 and 3. The vertices that are adjacent to thetwo visited vertices are 4 , , ,

8. Because 4 is the smallest, wetake 4 in the next step, and we have three vertices 2 , , 𝑙 .The unvisited vertices are 5 and 7. We check if any of 2 , , ,

7, and we find 3 is thesmallest vertex that connects 5 and 7. The algorithm stopsand returns 𝑙 = { , , } and 𝑟 = { , , } . When joining thetwo subgraph lists in Figure 1c, our system generates ‘34275’(by combining ‘342’ and ‘375’) instead of ‘35274’ (by combing‘352’ and ‘274’). For the same reason, ‘37485’ is not gener-ated by combing ‘374’ and ‘385’ as the smallest dissection of‘37485’ is ‘345’ and ‘387’.The worst cases complexity of the algorithm is 𝑂 (| 𝑠 ′ | ) .Although it is higher than the linear complexity of the auto-morphism check for single-vertex exploration [27, 30], theactual number of instructions does not increase much be-cause 𝑠 ′ is small and the algorithm usually returns early atline 7. Next, we need to aggregate the subgraphs according to theirpatterns. This is done by computing the canonical form ofeach subgraph. The subgraphs with the same canonical formare isomorphic and will be put in the same group. As pointedout in §2, computing the canonical form is expensive, es-pecially for large patterns. Previous work has used a quickpattern technique to reduce the canonical form computation.However, their quick patterns encode little topological infor-mation of the subgraphs, resulting in a lot of quick patterngroups of isomorphic subgraphs.We propose an index-based quick pattern technique thatcan achieve more accurate grouping of subgraphs and reducethe overhead of canonical form computation. The idea is toassign an index to each pattern in a subgraph list and usethe indices for computing the quick pattern of the combinedsubgraph. If a subgraph list is generated by the matchingalgorithm, we simply index the input patterns and storethe indices with each subgraph. As shown in Figure 1b, thesize-3 subgraphs are obtained by matching the two size-3 atterns. We store the 𝑝𝑎𝑡𝑡𝑒𝑟𝑛 _ 𝑖𝑑𝑥 with each of its embed-dings. When two subgraphs are combined, we construct a4-tuple as the quick pattern for the combined subgraph. Thefirst two elements in the 4-tuple are the pattern indices ofthe two joining subgraphs. The third element represents theposition of the joining vertex in the two subgraphs. Supposethe two joining subgraphs 𝑠 and 𝑠 are of size 𝑛 and 𝑛 . Ifthe joining vertex is the 𝑖 th vertex in 𝑠 and the 𝑗 th vertexin 𝑠 , then the value of the third element is ( 𝑖 × 𝑛 + 𝑗 ). Thelast element is a bitarray representing connections betweenthe two subgraphs. If the 𝑖 th vertex in 𝑠 is connected withthe 𝑗 th vertex in 𝑠 , then the ( 𝑖 × 𝑛 + 𝑗 )th bit in the bitarrayis set. Example:

In Figure 1c, the resulting subgraph ‘34275’ isobtained by joining 𝑠 = ‘342’ and 𝑠 = ‘375’, and its quickpattern is ⟨ , , , ⟩ . The first two elements are the patternindex of ‘342’ and ‘375’. The third element is 0 because thejoining vertex is at position 0 in both subgraphs. The lastelement is 32 because the 𝑠 [ ] = 𝑠 [ ] = ( × + ) th bit is set in the bitarray.Similarly, the quick pattern of both ‘34287’ and ‘35287’ is ⟨ , , , ⟩ , and the quick pattern of ‘34285’ is ⟨ , , , ⟩ .By encoding the sub-pattern information, our quick pat-tern achieves more accurate grouping of the subgraphs andthus reduces the canonical form computation. The computa-tion is further reduced by multi-vertex exploration as largersubgraphs contains more accurate sub-pattern information.To see this point, let us consider the number of possible size-4unlabeled patterns. We have known that any size-4 subgraphcan be obtained by joining a size-3 subgraph and an edge.The total number of possible 4-tuples with our index-basedquick pattern is 48 ( = × × ×

4) where 2 represents thereare two types of size-3 subgraphs (i.e., triangle and wedge),6 is the number of possible joining positions, and 4 is thenumber of possible values of the last element in the 4-tuple.In comparison, if we use the edge list as the quick pattern asin previous work [27, 30], the fully-connected size-4 graphalone has 624 ( = − × ×

4! represents the permutations that do not have adja-cent edges. This indicates that our index-based techniquehas much fewer possible patterns compared with the tech-nique used in previous work, leading to fewer groups forisomorphism check.The quick pattern is computed after every 𝑐𝑜𝑚𝑏𝑖𝑛𝑒 func-tion in Figure 2. If the 𝑐𝑜𝑚𝑏𝑖𝑛𝑒 function returns a valid sub-graph, we compute its quick pattern and look for the quickpattern in a global dictionary. The dictionary keeps a map-ping from quick patterns to their indices. If the quick patternexists, we store its index with the subgraph. If a quick patternis not found in the dictionary, we increase the global indexnumber and insert a new pair of quick pattern and its index.In our implementation, we parallelize the for-loop that it-erates over all keys in the first joining hash table. To avoid

01 2 342Frequent patterns Joining two subgraphs at vertex 2 01 2 34Combined subgraph‘01234’ is infrequent‘023’ is infrequent

Figure 3.

An example of exploration space pruning.synchronization among threads, we store a quick patterndictionary for each thread.

An optimization that most graph mining systems adopt forfrequent subgraph mining is to filter out the subgraphs ofinfrequent patterns so as to reduce the subgraph explorationspace [10, 15, 27, 30]. All of the existing systems achieve thisoptimization with breadth-first exploration. They either storeall intermediate subgraphs (e.g., RStream [30], Pangolin [10])or maintain a list of frequent patterns and re-match thesepattern (e.g., Peregrine [15]) in each exploration step. Theproblem with the first approach is that it takes a lot mem-ory and needs to aggregate the subgraphs in each step. Theproblem with the second approach is that it needs to per-form redundant matching in each step, and it only works forsupport measures that can be computed without storing allthe embeddings (e.g., MNI). If the user wants to use moreaccurate support measures (e.g., MIS, MVC [21]), the secondapproach will not work. An advantage of two-vertex explo-ration is that it enables exploration space pruning withoutstoring intermediate results or re-matching.Our main idea is that, instead of checking the support ofthe combined pattern, we check whether the vertices aroundthe joining point form any subgraphs of smaller infrequentpatterns. If an infrequent subgraph is found, then the com-bined subgraph must be infrequent and should be discarded.Figure 3 shows an example of this method. When the systemtries to join two subgraph ‘012’ and ‘234’ at vertex 2, if findsthat there is an edge connecting vertex 0 and 3 and an edgeconnecting vertex 1 and 4. This forms two triangles ‘023’ and‘124’. While triangle ‘124’ is frequent, triangle ‘023’ is not,according to the list of frequent size-3 patterns. Due to theanti-monotone property of the support measure, a frequentpattern cannot contain infrequent subpatterns. Thus, thecombined subgraph ‘01234’ must be infrequent and shouldnot be used for further exploration.The above pruning procedure is done in the 𝑐𝑜𝑚𝑏𝑖𝑛𝑒 func-tion (line 9 in Algorithm 1) when we check the connectivityamong vertices of the two joining subgraphs. For any size-3subgraph ‘abc’ with ‘a’ being the joining vertex and ‘b’, ‘c’from different subgraphs, if the subgraph is not in the listof frequent size-3 patterns, the 𝑐𝑜𝑚𝑏𝑖𝑛𝑒 function returns anempty set immediately. Subgraph Sampling for FasterExploration

In real applications, we may not need to find all frequent pat-terns, and exhaustive exploration is unnecessary [14]. Thus,we propose a subgraph sampling technique to accelerate theexploration process.

Sampling during Joining:

The idea is to add a samplingoperation each time we iterate over the joining subgraphs,i.e., before each for-loop in the dotted boxes in Figure 2. Be-cause the MNI support measures the frequency of a patternas the number of distinct matching vertices, we sample afixed number of iterations in each of the boxed for-loops inFigure 2, in order to achieve a more even distribution of sub-graphs over all vertices. If a loop has fewer iterations thanthe sampling threshold, we execute all of them; if a loop hasmore iterations than the threshold, we sample the iterationsuniformly to the threshold number. This subgraph samplingduring the joining phase can be considered as a generaliza-tion of the neighbor sampling technique in ASAP [14]. ASAPsamples a subset of the edges when it extends the matchedsubgraph from one vertex to its neighbors. We sample theneighboring size-3 subgraphs instead. Intuitively, our sub-graph sampling is more accurate than neighbor samplingbecause size-3 subgraphs preserve more graph structuresthan edges.

Sampling during Matching:

For very large graphs, wemay not be able to store all size-3 subgraphs in memory oreven on disk. To achieve fast mining, we can sample thesubgraphs during the matching phase and only store thesampled subgraphs. Similar to the sampling in the joiningphase, we sample a fixed number of subgraphs around eachvertex in order to have subgraphs evenly distributed overall vertices. More specifically, we permute the vertex listat each inner loop of the nested for-loop generated by Au-toMine [17]. The execution continues to the next iteration ofthe outermost loop if 𝑡 subgraphs have been matched in thecurrent iteration. This will give us 𝑡 subgraphs sampled fromeach vertex. We set 𝑡 to a number such that all the sampledsubgraphs can be stored in memory. These sampled size-3subgraphs are then given to the join procedure to explorelarger subgraphs. This subgraph sampling during the match-ing phase can be considered as a generalization of the edgesampling technique for approximate graph processing [5, 33].Previous work has shown that edge sampling does not workwell for graph mining tasks [14]. Our subgraph samplingis much more robust than edge sampling for graph patternmining as it preserves more structures of the graph. Ourexperiments also validate this point. This section presents our experimental setup and perfor-mance comparison with the existing graph mining systemsand methods.

Table 1.

Graph datasets

Graphs

We run all the experiments on a workstationwith an Intel Xeon W-3225 CPU containing 8 physical cores(16 logical cores with hyper-threading), 196GB memory, anda 4TB SSD. We use GCC 7.3.1 for compilation with optimiza-tion level O2 enabled. All the systems are configured to runwith 16 threads. We use OpenMP to parallelize the for-loopthat iterates over all keys in the first joining hash table.

Datasets:

We test on five graphs as listed in Table. 1. Thesegraphs are commonly used for evaluating performance ofgraph mining systems. CiteSeer and MiCo are labeled, andthe other four are unlabeled. For the unlabeled graphs, werandomly assign 30 labels to the vertices.

Settings:

We compare our system with three state-of-the-art graph mining systems: Peregrine [15] and AutoMine [17]which represent the pattern-based systems, and Pangolin [10]which represents the explore-aggregate-filter systems. Werun edge-induced FSM since it is more commonly evaluatedby the existing graph mining systems [12, 15, 30]. The orig-inal code of AutoMine only supports vertex-induced FSM(which has much less computation than edge-induced FSM)and uses number of embeddings as the support measure(which is not anti-monotone). We adapt the code to supportedge-induced FSM with MNI support, and we use it to findall size-3 subgraphs for our two-vertex exploration.For most graphs, we set the MNI support threshold 𝑡 = . 𝑛 , 0 . 𝑛 , 0 . 𝑛 and 0 . 𝑛 where 𝑛 is the number ofnodes in the graph. The reason we use proportional thresh-olds is that the MNI support measures frequency as the num-ber of distinct vertices [21]. The threshold means that ifevery vertex in a pattern maps to at least 𝑡 different ver-tices in the graph, we consider the pattern frequent. For UKand FR, because 𝑛 is large, there are few patterns that canmeet threshold 0 . 𝑛 . Therefore, we test with 0 . 𝑛 and0 . 𝑛 on UK and FR. Since none of the compared systems supports sampling, wefirst run our algorithm without sampling to compare the per-formance. Table 2 summarizes the execution time of FSM forwhich at least one of the compared systems can return resultwithin 24 hours. The execution time of our system reportedhere is the time of Step 2,3,4 as described in Section 3. Wedo not include the time for Step1 because 1) it is negligibleon these two graphs (0.08 seconds on CI and 102 seconds able 2. Execution times in seconds. Systems: Two-Vertexexploration (TV), Peregrine (PR), AutoMine (AM) and Pan-golin (PG). ‘T’ represents timeout after 24 hours of execution.‘F’ execution failure due to insufficient memory or disk space.

Size Support Gr. TV PR AM PG4-FSM 0.001 CI 0.99 5.4 5.1 5.50.005 0.89 4.8 4.80.01 0.81 3.4 3.70.05 0.61 1.1 2.84-FSM 0.001 MI 41645 F 78244 F0.005 327630.01 296980.05 256825-FSM 0.001 CI 25.2 F 68.2 F0.005 22.10.01 21.50.05 16.96-FSM 0.001 CI 615 F 1924 F0.005 5970.01 5640.05 4167-FSM 0.001 CI 26760 F 63362 F0.005 246440.01 236970.05 16257 on MI) compared with the joining time, and 2) Step1 canbe considered as preprocessing. We find that Peregrine andPangolin abort for most tasks. In fact, Peregrine paper [15]only reports results of 3-FSM. Pangolin [10] reports resultsmostly for 3-FSM. It reports 4-FSM for only one graph usinglarge support thresholds, but it fails to give result for MI. Forthe only one testcase (4-FSM on CI) that Peregrine and Pan-golin do return, our system is 1.8x to 5.6x faster. AutoMineis able to return results for these tasks. However, because itmatches the patterns in a depth-first order, it cannot benefitfrom the anti-monotone property (i.e., it does not run fasterfor larger support thresholds). Our system is 1.9x to 8.4xfaster than AutoMine for these tasks.

Advantage over Single-Vertex Exploration:

As discussedin Section 4.2, one advantage of two-vertex exploration oversingle-vertex exploration is that it reduces the memory ac-cess overhead in depth-first multi-way join. To show theadvantage, we configure our system to run single-vertexexploration. The single-vertex version still uses our index-based quick patterns, but it does not support explorationspace pruning since the size-3 subgraphs are not computed.The execution times of single-vertex exploration are shownin Figure 4. We can see that single-vertex exploration is 1.02xto 1.52x slower than two-vertex exploration. We also collectthe total memory access sizes to the hash tables with two-vertex exploration and single-vertex exploration (assumingevery query to the hash tables is a cache miss). As shownin Figure 5, two-vertex exploration reduces the memory ac-cess overhead by 5x to 189x. The results are collected with support threshold 0 . 𝑛 . Other support thresholds show asimilar pattern. Benefit of Index-Based Quick Pattern:

To show the ben-efit of our index-based quick pattern technique, we disableour index-based quick pattern and use the quick patterntechnique in previous work instead (i.e., a list of edges withlabels of adjacent nodes). Figure 4 shows the execution timesof two-vertex exploration without our index-based quick pat-tern. We can see that it leads to 1.75x to 2.78x slowdown. Tofurther verify the advantage, we collect the number of invo-cations to the bliss function [16] for computing the canonicalforms of subgraphs. As shown in Figure 6, our index-basedquick pattern reduces the number of isomorphism checks by31x to 564x for different tasks, which explains the speedups.

Next, we evaluate the effectiveness of our sampling methods.Since all the size-3 subgraphs of CI and MI can be storedin memory, we only perform sampling during the joiningphase for these two graphs.Figure 7 shows the number of size-4 frequent patternswe can find on MI graph with different support thresholdsand different sampling thresholds. The execution time ofdifferent runs are labeled on top of the bars. When the sup-port threshold is set to 0 . 𝑛 , there are 215025 frequentpatterns in total, and to discover all these patterns preciselyour system needs to run for 41645 seconds (as shown inTable 2). If we sample 10 edges and 100 size-3 subgraphs ineach key group when we join the edge list and the size-3subgraphs (ST10 in the figure), our two-vertex explorationreturns 49% of the frequent patterns in 671 seconds. The exe-cution time is reduced by 62x. When the support threshold isset to 0 . 𝑛 , we can find 41% of the total frequent patternswithin 524 seconds, which is 1/63 of the total execution time.When the support threshold is set to 0 . 𝑛 , there are only 8frequent patterns, and our two-vertex exploration with ST6sampling can find 7 of them in 58 seconds, which leads toa 443x speedup compared with the accurate execution. Thefigure also shows that the larger sampling thresholds we usethe more frequent patterns we can find.Figure 8 shows the number of size-7 frequent patternsfound on CI graph with different support thresholds anddifferent sampling thresholds. When the support thresholdis 0 . 𝑛 , our two-vertex exploration can find 86% of thefrequent patterns in 1284 seconds with ST10 sampling. Com-pared with the time of accurate execution in Table 2, sam-pling achieves a 21x speedup. When the support thresholdis set to 0 . 𝑛 and 0 . 𝑛 , there are fewer frequent patterns,and our two-vertex exploration with ST10 sampling can findmore than 90% of the frequent patterns with less than 1/21of the total execution time. When the support threshold is0 . 𝑛 , there are only 22 frequent patterns, and our two-vertexexploration with ST6 sampling can find all of them within170 seconds, which is 1/96 of the accurate execution time. S l o w d o w n two-vertex single-vertex two-vertex without IQP Figure 4.

Performance of two-vertex exploration, single-vertex exploration, and two-vertex exploration without our index-based quick pattern. The Execution times are normalized for each task with the execution time of two-vertex exploration inTable 2. M e o r y A cc e ss S i z e ( M B ) two-vertexsingle-vertex Figure 5.

Total memory access size in multi-way join withtwo-vertex and single-vertex exploration for different FSMtasks (MNI support threshold 0 . 𝑛 ). o f I s o m o r ph i s m C h e k s index-basednon-index-based Figure 6.

Number of isomorphism checks for different FSMtasks (MNI support threshold 0 . 𝑛 ) with and without ourindex-based quick pattern technique. Advantage over Single-Vertex Exploration:

As discussedin Section 5, another advantage of two-vertex explorationover single-vertex exploration is that it leads to more accu-rate sampling. To verify this, we configure our system to runsampled single-vertex exploration. Since single-vertex explo-ration needs twice join steps as two-vertex exploration, weset its sampling threshold to the square root of the thresholdfor two-vertex exploration in order to achieve a similar sizeof overall exploration space. As shown in Figure 7 and 8, ifwe do not include the matching time, two-vertex explorationhas a slightly shorter execution time than single-vertex ex-ploration when they use the corresponding sampling thresh-olds. Even if we add the time for matching size-3 subgraphs(102 seconds on MI and 0.08 seconds on CI), the total exe-cution time is close to that of single-vertex exploration. For4-FSM on MI, two-vertex exploration finds 6% to 57% morefrequent patterns than single-vertex exploration. For 7-FSMon CI, the number of frequent patterns found by two-vertexexploration is 1.18x to 4x that of single-vertex exploration.

Time (sec):156 267 536 671 92 204 438 524 72 180 313 376 40 58 80 105240 365 678 787 185 313 604 687 175 285 467 534 137 156 181 208 o f D i s c o v e r e d P a tt e r n s / T o t . P a tt e r n s two-vertex single-vertex Figure 7.

Number of discovered size-4 frequent patterns onMI graph with different support thresholds and different sam-pling thresholds. For single-vertex exploration, ST 𝑥 meansin every join step only 𝑥 edges are sampled in each neighborlist. For two-vertex exploration, it means that 𝑥 edges and 𝑥 size-3 subgraphs are sampled in each key group when wejoin the edge list with the size-3 subgraphs. Time (sec):96 325 916 1284 82 280 834 1153 75 271 779 1101 55 170 589 67899 325 916 1271 93 311 860 1141 106 310 798 1039 58 175 586 664 o f D i s c o v e r e d P a tt e r n s / T o t . P a tt e r n s two-vertex single-vertex Figure 8.

Number of discovered size-7 frequent patterns onCI graph with different support thresholds and different sam-pling thresholds. For single-vertex exploration, ST 𝑥 meansin every join step only 𝑥 edges are sampled in each neighborlist. For two-vertex exploration, it means that in every joinstep only 𝑥 size-3 subgraphs are sampled in each key group.Table 3 shows the number of size-9 frequent patterns foundon CI graph with different support thresholds. We can seethat with a similar execution time two-vertex explorationdiscovers much more frequent patterns than single-vertex able 3. Results of 9-FSM on CI graph with sampling thresh-old 4 for two-vertex exploration (TV) and sampling threshold2 for single-vertex exploration (SV). (a)

Number of discovered patterns

Support 0.001 0.005 0.01 0.05TV 63941 6050 1770 16SV 402 0 0 0 (b)

Execution time (sec)

Table 4.

Results of 5-FSM on OK graph with support thresh-old 0 . 𝑛 and different sampling thresholds. ‘M. ST’ standsfor Matching Sampling Threshold, ‘M. Time’ stands forMatching Time, ‘J. ST’ stands for Joining Sampling Thresh-old, ‘J. Time’ stands for Joining Time. (a) Two-vertex exploration

M. ST M. Time (sec) J. ST J. Time (sec) (b)

Single-vertex exploration

J. ST J. Time exploration. It is worth noting that none of the previous sys-tems can return results for 9-FSM even on a small graph likeCI. AutoMine cannot even enumerate all the size-9 unlabeledpatterns in 24 hours.

Results on Large Graphs:

Since the size-3 subgraphs ofOK, UK and FR cannot be entirely stored in memory, weperform sampling during the matching phase and only storethe sampled size-3 subgraphs. Table 4a shows the numberof size-5 frequent patterns with support larger than 0 . 𝑛 found on OK graph. In the table, a matching sampling thresh-old 𝑥 means that 𝑥 subgraphs are sampled from each vertexduring the matching phase. A larger matching samplingthreshold results in longer matching time, although they arenot proportional – the matching time is mainly determinedby the number of subgraph groups that need isomorphismchecks. We find that the number of discovered patterns doesnot increase much with larger sampling thresholds. This isbecause 0 . 𝑛 is a relatively large support threshold for thisgraph, and there are not many frequent patterns. To show theadvantage of two-vertex exploration, we run single-vertexexploration for the same task. Since single-vertex explorationneeds not match the size-3 subgraphs, we only perform sam-pling during the joining phase with threshold 1 and 2. Theresults are shown in Table 4b. We can see that single-vertexexploration takes a longer time and finds fewer patterns. Table 5.

Results of 5-FSM on UK and FR graph with differentsampling thresholds and support thresholds. ‘M. ST’ standsfor Matching Sampling Threshold, ‘M. Time’ stands forMatching Time, ‘J. ST’ stands for Joining Sampling Thresh-old, ‘J. Time’ stands for Joining Time. (a)

Two-vertex exploration on UK

Support M. ST M. Time (sec) J. ST J. Time (sec) (b)

Two-vertex exploration on FR

Support M. ST M. Time (sec) J. ST J. Time (sec)

The results of 5-FSM on UK and FR graph are shown inTable 5. We can see that matching takes a large proportion ofthe total execution time. This is because there are a lot of size-3 subgraphs and patterns in these two large graphs. However,if we consider matching as preprocessing and store the sam-pled size-3 subgraphs in memory, the joining procedure isfast. As shown in Table 5a, our system can find frequent pat-terns on UK within a few minutes, and more patterns can befound by using larger joining sampling thresholds. Table 5bshows the results of 5-FSM on FR graph. Again, the match-ing procedure is expensive. Once the size-3 subgraph aresampled, we can find size-5 frequent patterns in a relativelyshort time with sampled join. For comparison, we also runsingle-vertex exploration with sampling threshold 2 on thesetwo graphs. It cannot finish execution within 24 hours, sowe print out the found patterns after 24 hours of execution.For UK, it returns 5 frequent patterns when support thresh-old is set to 0 . 𝑛 , and 0 frequent pattern when supportthreshold is 0 . 𝑛 . For FR, single-vertex exploration withsampling threshold 2 cannot return any frequent patternwithin 24 hours. This section summarizes the graph pattern mining systemsthat are most related to our work.

Exploration-based Systems:

Arabesque [27] is a distributedgraph pattern mining system. It enumerates all possible em-beddings in multiple rounds and uses a filter-process modelto generate the results. It first propose the quick patterntechnique for reducing isomorphism checks. RStream [30] is he first single-machine, out-of-core graph mining system. Itsupports a rich programming model that exposes relationalalgebra for developers to express various mining tasks and aruntime engine that can efficiently compute the relationaloperations. Pangolin [10] also targets single-machine butprovides GPU programming interface for acceleration. Dist-Graph [26], ScaleMine [4] and G-miner [8] are all distributedgraph mining systems that adopt breadth-first exploration.DistGraph focuses on reducing the communication of dis-tributed computing when each node can only have a portionof the graph. ScaleMine proposes a two-phase mining ap-proach to achieve good load balance and reduce communi-cation in distributed computing. G-miner proposes a block-based graph partitioning technique and uses work stealingto achieve good load balance. Because these systems usebreadth-first exploration and need to store all intermediateresults, they are not able to mine large patterns on largegraphs. Fractal [12] is also exploration-based, but it supportsdepth-first exploration to reduce the memory consumption.All of the existing systems adopt single-vertex exploration.Our system is the first to adopt multi-vertex exploration formining larger pattern in graphs. Pattern-based Systems:

AutoMine [17] is a single-machinegraph mining system that features compiler-based optimiza-tions. Their main idea is to enumerate all the unlabeled pat-terns of a particular size and match them one-by-one ona graph. Because the patterns are given, AutoMine is ableto search an optimal matching strategy and combine thematching procedures of multiple patterns. Because of itsdepth-first matching order, AutoMine is hard to benefit fromthe anti-monotone property of FSM. Also, when the patternsize is more than 7, enumerating the patterns becomes diffi-cult. Peregrine [15] is another pattern-based system. Insteadof enumerating all the patterns before matching, it discoverspatterns based on the subgraphs it has explored and main-tains a list of the patterns. The main issue with Peregrine isthat it needs to rematch the frequent patterns in each step,which leads to redundant computation. DwavesGraph [9] isa recently proposed pattern-based graph mining system. Itis based on the idea that the task of matching a large patterncan be divided into smaller tasks of matching the subpat-terns. Similar to AutoMine, it needs to know all the unlabeledpatterns in advance. Thus, it cannot discover large patterns.In fact, DwavesGraph paper only reports result for 3-FSM.

Approximate Pattern Mining:

Sampling has been pro-posed by earlier works [6, 24] to accelerate FSM in a databaseof graphs. In this setting, a pattern is considered frequent ifit exists in more than a certain amount of graphs. The mainidea of these works is to perform random walk in the space ofall patterns. Every time it walks from one pattern to another,it calculates a probability distribution of all candidate pat-terns. By carefully setting the sampling probability at eachstep, they ensure that patterns of higher supports are morelikely to be sampled [6]. More recent works consider FSM on a single graph since it is more commonly used in real applica-tions and is more general (a list of graphs can be consideredas a single graph with disconnected components) [13]. Sam-pling has also proposed to accelerate pattern-based graphmining in this setting [14, 18, 23]. The main idea is to sampleedges in the graph based on the given patterns and estimatethe actual results with the sampled results. These methodsneed to know the patterns in advance. It is not obvious howthey can be applied to the exploration-based systems. We fillthis gap and show that FSM can be accelerated by samplingthe subgraphs in each key group of the join operation.

In this work, we propose a novel two-vertex explorationmethod to accelerate frequent subgraph mining. Based ontwo-vertex exploration, we further improve the performancethrough an index-based quick pattern technique and sub-graph sampling. The experiments show that our methodoutperforms other state-of-the-art graph mining systems forFSM on various input graphs and pattern sizes.

References [1] [n.d.]. Dataset for "Statistics and Social Network of YouTube Videos" . http://netsg.cs.sfu.ca/youtubedata/ .[2] [n.d.]. Number of Graphs on n unlabelled vertices. http://garsia.math.yorku.ca/~zabrocki/math3260w03/nall.html .[3] [n.d.]. Orkut social network. http://snap.stanford.edu/data/com-Orkut.html .[4] Ehab Abdelhamid, Ibrahim Abdelaziz, Panos Kalnis, Zuhair Khayyat,and Fuad Jamour. 2016. Scalemine: Scalable parallel frequent subgraphmining in a single large graph. In

SC’16: Proceedings of the InternationalConference for High Performance Computing, Networking, Storage andAnalysis . IEEE, 716–727.[5] Sameer Agarwal, Barzan Mozafari, Aurojit Panda, Henry Milner,Samuel Madden, and Ion Stoica. 2013. BlinkDB: queries with boundederrors and bounded response times on very large data. In

Proceedingsof the 8th ACM European Conference on Computer Systems . 29–42.[6] Mohammad Al Hasan and Mohammed J Zaki. 2009. Output spacesampling for graph patterns.

Proceedings of the VLDB Endowment

2, 1(2009), 730–741.[7] László Babai, William M Kantor, and Eugene M Luks. 1983. Compu-tational complexity and the classification of finite simple groups. In .IEEE, 162–171.[8] Hongzhi Chen, Miao Liu, Yunjian Zhao, Xiao Yan, Da Yan, and JamesCheng. 2018. G-Miner: an efficient task-oriented graph mining system.In

Proceedings of the Thirteenth EuroSys Conference . 1–12.[9] Jingji Chen and Xuehai Qian. 2020. DwarvesGraph: A High-Performance Graph Mining System with Pattern Decomposition.arXiv:2008.09682 [cs.DC][10] Xuhao Chen, Roshan Dathathri, Gurbinder Gill, and Keshav Pingali.2020. Pangolin: An Efficient and Flexible Graph Mining System onCPU and GPU.

Proc. VLDB Endow.

13, 8 (April 2020), 1190–1205. https://doi.org/10.14778/3389133.3389137 [11] Wei-Ta Chu and Ming-Hung Tsai. 2012. Visual Pattern Discoveryfor Architecture Image Classification and Product Image Search. In

Proceedings of the 2nd ACM International Conference on MultimediaRetrieval (Hong Kong, China) (ICMR ’12) . Association for ComputingMachinery, New York, NY, USA, Article 27, 8 pages. https://doi.org/10.1145/2324796.2324831

12] Vinicius Dias, Carlos H. C. Teixeira, Dorgival Guedes, Wagner Meira,and Srinivasan Parthasarathy. 2019. Fractal: A General-Purpose GraphPattern Mining System. In

Proceedings of the 2019 International Confer-ence on Management of Data (Amsterdam, Netherlands) (SIGMOD ’19) .Association for Computing Machinery, New York, NY, USA, 1357–1374. https://doi.org/10.1145/3299869.3319875 [13] Mohammed Elseidy, Ehab Abdelhamid, Spiros Skiadopoulos, andPanos Kalnis. 2014. GraMi: Frequent Subgraph and Pattern Mining ina Single Large Graph.

Proc. VLDB Endow.

7, 7 (March 2014), 517–528. https://doi.org/10.14778/2732286.2732289 [14] Anand Padmanabha Iyer, Zaoxing Liu, Xin Jin, Shivaram Venkatara-man, Vladimir Braverman, and Ion Stoica. 2018. ASAP: Fast, Approxi-mate Graph Pattern Mining at Scale. In . USENIX Asso-ciation, Carlsbad, CA, 745–761. [15] Kasra Jamshidi, Rakesh Mahadasa, and Keval Vora. 2020. Peregrine: APattern-Aware Graph Mining System. In

Proceedings of the FifteenthEuropean Conference on Computer Systems (Heraklion, Greece) (Eu-roSys ’20) . Association for Computing Machinery, New York, NY, USA,Article 13, 16 pages. https://doi.org/10.1145/3342195.3387548 [16] Tommi Junttila and Petteri Kaski. 2007. Engineering an efficient canon-ical labeling tool for large and sparse graphs. In

Proceedings of the NinthWorkshop on Algorithm Engineering and Experiments and the FourthWorkshop on Analytic Algorithms and Combinatorics , David Applegate,Gerth Stølting Brodal, Daniel Panario, and Robert Sedgewick (Eds.).SIAM, 135–149.[17] Daniel Mawhirter and Bo Wu. 2019. AutoMine: Harmonizing High-Level Abstraction and High Performance for Graph Mining. In

Pro-ceedings of the 27th ACM Symposium on Operating Systems Principles (Huntsville, Ontario, Canada) (SOSP ’19) . Association for ComputingMachinery, New York, NY, USA, 509–523. https://doi.org/10.1145/3341301.3359633 [18] Daniel Mawhirter, Bo Wu, Dinesh Mehta, and Chao Ai. 2018. Approxg:Fast approximate parallel graphlet counting through accuracy control.In . IEEE, 533–542.[19] Brendan McKay and Adolfo Piperno. [n.d.]. nauty and Traces. http://users.cecs.anu.edu.au/~bdm/nauty/ .[20] Brendan D McKay et al. 1981.

Practical graph isomorphism . Departmentof Computer Science, Vanderbilt University Tennessee, USA.[21] Jinghan Meng and Yi-cheng Tu. 2017. Flexible and Feasible SupportMeasures for Mining Frequent Patterns in Large Labeled Graphs. In

Proceedings of the 2017 ACM International Conference on Management ofData (Chicago, Illinois, USA) (SIGMOD ’17) . Association for ComputingMachinery, New York, NY, USA, 391–402. https://doi.org/10.1145/3035918.3035936 [22] Ron Milo, Shai Shen-Orr, Shalev Itzkovitz, Nadav Kashtan, DmitriChklovskii, and Uri Alon. 2002. Network motifs: simple buildingblocks of complex networks.

Science

Statistical Analysisand Data Mining: The ASA Data Science Journal

8, 4 (2015), 245–261.[25] Haichuan Shang, Ying Zhang, Xuemin Lin, and Jeffrey Xu Yu. 2008.Taming Verification Hardness: An Efficient Algorithm for TestingSubgraph Isomorphism.

Proc. VLDB Endow.

1, 1 (Aug. 2008), 364–375. https://doi.org/10.14778/1453856.1453899 [26] Nilothpal Talukder and Mohammed J Zaki. 2016. A distributed ap-proach for graph mining in massive networks.

Data Mining andKnowledge Discovery

30, 5 (2016), 1024–1052.[27] Carlos HC Teixeira, Alexandre J Fonseca, Marco Serafini, GeorgosSiganos, Mohammed J Zaki, and Ashraf Aboulnaga. 2015. Arabesque: a system for distributed graph mining. In

Proceedings of the 25th Sym-posium on Operating Systems Principles . 425–440.[28] Johan Ugander, Lars Backstrom, and Jon Kleinberg. 2013. SubgraphFrequencies: Mapping the Empirical and Extremal Geography of LargeGraph Collections. In

Proceedings of the 22nd International Conferenceon World Wide Web (Rio de Janeiro, Brazil) (WWW ’13) . Associationfor Computing Machinery, New York, NY, USA, 1307–1318. https://doi.org/10.1145/2488388.2488502 [29] A Vazquez, R Dobrin, D Sergi, J-P Eckmann, Zoltan N Oltvai, andA-L Barabási. 2004. The topological relationship between the large-scale attributes and local interaction patterns of complex networks.

Proceedings of the National Academy of Sciences . 763–782.[31] Xifeng Yan and Jiawei Han. 2002. gSpan: graph-based substructurepattern mining. In

Knowledge and InformationSystems

42, 1 (2015), 181–213.[33] Ruoyu Zou and Lawrence B Holder. 2010. Frequent subgraph miningon a single large graph using sampling techniques. In

Proceedings ofthe eighth workshop on mining and learning with graphs . 171–178.. 171–178.