Graph Iso/Auto-morphism: A Divide-&-Conquer Approach
aa r X i v : . [ c s . S I] N ov Graph Iso/Auto morphism: A Divide & Conquer Approach
Can Lu, Jeffrey Xu Yu, Zhiwei Zhang , Hong Cheng
The Chinese University of Hong Kong, Hong Kong, China;
Hong Kong Baptist University, Hong Kong, China { lucan,yu,hcheng } @se.cuhk.edu.hk; [email protected] ABSTRACT
The graph isomorphism is to determine whether two graphs are iso-morphic. A closely related problem is graph automorphism (sym-metry) detection, where an isomorphism between two graphs is abijection between their vertex sets that preserves adjacency, and anautomorphism is an isomorphism from a graph to itself. Applica-tions of graph isomorphism and automorphism detection includedatabase indexing, network model, network measurement, networksimplification, and social network anonymization. By graph au-tomorphism, we deal with symmetric subgraph matching (SSM),which is to find all subgraphs in a graph G that are symmetric to agiven subgraph in G . An application of SSM is to identify multipleseed sets that have the same influence power as a set of seeds foundby influence maximization in a social network. To test two graphsfor isomorphism, canonical labeling has been studied to relabel agraph in such a way that isomorphic graphs are identical after rela-beling. Efficient canonical labeling algorithms have been designedby individualization-refinement. They enumerate all permutationsof vertices using a search tree, and select the minimum permuta-tion as the canonical labeling. The candidates are pruned by theminimum permutation during enumeration. Despite their high per-formance in benchmark graphs, these algorithms face difficulties inhandling massive graphs, and the search trees used are for pruningpurposes which cannot answer symmetric subgraphs matching.In this paper, we design a new efficient canonical labeling algo-rithm DviCL. DviCL designed is based on the observation that wecan use the k -th minimum permutation as the canonical labeling.Different from previous algorithms, we take a divide-and-conquerapproach to partition a graph G . By partitioning G , an AutoTreeis constructed, which preserves symmetric structures as well as theautomorphism group of G . The canonical labeling for a tree nodecan be obtained by the canonical labeling of its child nodes. andthe canonical labeling for the root is the one for G . Such AutoTreecan also be effectively used to answer the automorphism group,symmetric subgraphs. We conducted extensive performance stud-ies using 22 large graphs, and confirmed that DviCL is much moreefficient and robust than the state-of-the-art.
1. INTRODUCTION
Combinatorial objects and complex structures are often modeledas graphs in many applications, including social networks and so-cial media [29], expert networks [20], bioinformatics [38, 39], andmathematical chemistry [6]. Among many graph problems, graphisomorphism is a problem to determine whether two graphs are iso-morphic [32]. The graph isomorphism is an important issue inpractice since it has been used for deduplication and retrieval indealing with a collection of graphs, and is an important issue intheory due to its relationship to the concept of NP-completeness.A closely related graph problem is automorphism (symmetry) de-tection, where an isomorphism between two graphs is a bijectionbetween their vertex sets that preserves adjacency, and an automor-phism (symmetry) is an isomorphism from a graph to itself. Au-tomorphism detection is also important in various graph problems.On one hand, by automorphism, from a global viewpoint, two ver-tices (or subgraphs) are equivalent in the sense that the entire graphremains unchanged if one is replaced by the other. Therefore, withautomorphism, certain finding over a single vertex (or a subgraph)can be applied to all other automorphic vertices (or subgraphs). Onthe other hand, as symmetries of combinatorial objects are knownto complicate algorithms, detecting and discarding symmetric sub-problems can reduce the scale of the original problems.There are many applications of graph isomorphism and automor-phism detection, including database indexing [31], network model[24, 36], network measurement [35], network simplification [35],and social network anonymization [34]. (a) Database Indexing:Given a large database of graphs (e.g., chemical compounds), it as-signs every graph with a certificate such that two graphs are isomor-phic iff they share the same certificate [31]. (b) Network Model: Itstudies the automorphism groups of a wide variety of real-worldnetworks and finds that real graphs are richly symmetric [24]. In[36], it claims that similar linkage patterns are the underlying in-gredient responsible for the emergence of symmetry in complexnetworks. (c) Network Measurement: In [37], it proposes a struc-ture entropy based on automorphism partition to precisely quan-tify the structural heterogeneity of networks, and finds that struc-tural heterogeneity is strongly negatively correlated to symmetry ofreal graphs. (d) Network Simplification: In [35], it utilizes inher-ent network symmetry to collapse all redundant information froma network, resulting in a coarse graining, known as “quotient”, andclaims that they preserve various key function properties such ascomplexity (heterogeneity and hub vertices) and communication(diameter and mean geodesic distance), although quotients can besubstantially smaller than the original graphs. (e) Social NetworkAnonymization: In [34], it proposes a k -symmetry model to mod-ify a naively-anonymized network such that for any vertex in thenetwork, there exist at least k − structurally equivalent counter-arts, protecting against re-identification under any potential struc-tural knowledge about a target. Below, we discuss how graph au-tomorphism is used for influence maximization (IM) [1, 8, 17, 28],and discuss symmetric subgraph matching (SSM) by graph auto-morphism and other SSM applications [10, 19, 21].Influence maximization (IM) is widely studied in social networksand social media to select a set S of k seeds s.t. the expected valueof the spread of the influence σ ( S ) is maximized. In the litera-ture, almost all work in IM find a single S with the maximuminfluence. With graph automorphism, we can possibly find a set S = { S , S , · · · } where each S i has the same max influence as S while contains some different vertices, and we are able to select one S i in S that satisfies some additional criteria (e.g., attributes on ver-tices in a seed set and distribution of such seed vertices). To showsuch possibilities, we compute IM by one of the best performingalgorithms, PMC [28], under the IC model as reported by [1], overa large number of datasets (Table 1) using the parameters following[1], where the probability to influence one from another is treated asconstant. We conduct testing to select a set of k seeds, for k = 10 and k = 100 . We find that there are 8.82E+15 and 2.93E+15 can-didate seed sets for wikivote when k = 10 and k = 100 , respec-tively, and the numbers for Orkut are 4 and 2.9E+10, respectively.To find S for S found by IM can be processed as a special case ofsymmetric subgraph matching (SSM), which we discuss below.Symmetric subgraph matching (SSM), we study in this paper, isclosely related to subgraph matching (or subgraph isomorphism).Given a query graph q and a data graph G , by subgraph matching,it finds all subgraphs g in G that are isomorphic to q . By SSM, q is required to be a subgraph that exists in G and any g returnedshould be symmetric to q in G , i.e., there is at least one automor-phism γ of G having g = q γ . Note that all subgraphs discussedhere are induced. The applications of SSM include software pla-giarism, program maintenance and compiler optimizations [10, 19,21], where an intermediate program representation, called the pro-gram dependence graph (PDG) is constructed for both control anddata dependencies for each operation in a program.In the literature, to check if two graphs are isomorphic, the mostpractical approach is canonical labeling , by which a graph is rela-beled in such a way that two graphs are isomorphic iff their canon-ical labeling are the same. Since the seminal work [25, 26] byMcKay in 1981, nauty has become a standard for canonical label-ing and has been incorporated into several mathematical softwaretools such as GAP [14] and MAGMA [7]. Other canonical labelingapproaches, such as bliss [15, 16] and traces [30], address pos-sible shortcomings of nauty closely following nauty’s ideas. De-spite their high performance, these approaches face difficulties inhandling today’s massive graphs. As shown in our experimentalstudies (Table 5), nauty fails in all but one datasets, traces fails innearly half datasets, and bliss is inefficient in most datasets. Dueto the lack of efficient canonical labeling algorithms for massivegraphs, to the best of our knowledge, merely have any algorithmsincorporated graph isomorphism or graph automorphism.In this work, we propose a novel efficient canonical labeling al-gorithm for massive graphs. We observe that the state-of-the-artalgorithms (e.g., nauty [25, 26], traces [30] and bliss [15, 16]) dis-cover the canonical labeling following “individualization-refinement”schema. These algorithms enumerate all possible permutations andselect the minimum G γ as the canonical labeling. Here, graph G as well as the permutated G γ can be represented as elements froma totally ordered set, for instance, G ( G γ ) can be represented byits sorted edge list. At first glance, choosing the minimum G γ asthe target is probably the most efficient for branch-and-bound algo-rithms. However, the minimum G γ is not always the best choice for any graph G . For instance, if all vertices in G can be easily distin-guished, the permutation γ based on sorting is a better choice. Ourmain idea is to divide the given graph into a set of subgraphs satis-fying that (1) two isomorphic graphs G and G ′ will be divided intotwo sorted subgraph sets { g , . . . , g k } and { g ′ , . . . , g ′ k } , such that g i is isomorphic to g ′ i for ≤ i ≤ k ; (2) the canonical labeling ofthe original graph G can be easily obtained by canonical labelingof the subgraphs. Note that canonical labeling of each subgraph g i can be defined arbitrarily, not limited to the minimum g γi . Asa consequence, our approach returns the k -th minimum G γ as thecanonical labeling. Note that k is not fixed for all graphs, and we donot need to know what the k value is when computing the canoni-cal labeling. Applying such idea to each subgraph g i , our approachDviCL follows divide-and-conquer paradigm and constructs a treeindex, called AutoTree A T . Here, a tree node in A T correspondsto a subgraph g i of G , and contains its canonical labeling as wellas automorphism group. The root corresponds to G .By the AutoTree, we can easily detect symmetric vertices andsubgraphs in G . Take the maximum clique as an example. Givena graph G , for a maximum clique q found [22], we can efficientlyidentify 4 candidate maximum cliques in Google and 16 candidatemaximum cliques in LiveJournal (Table 1) using the AutoTree con-structed, respectively. Algorithm SSM-ATfor symmetric subgraphmatching is given in Section 6.4. For k -symmetry [34], with Au-toTree, each subtree of root can be duplicated to have at least k − symmetric siblings. As a consequence, each vertex has at least k − automorphic counterparts in the reconstructed graph.The main contributions of our work are summarized below. First,we propose a novel canonical labeling algorithm DviCL followingthe divide-and-conquer paradigm. DviCL can efficiently discoverthe canonical labeling and the automorphism group for massivegraphs. Second, we construct an AutoTree for a graph G whichprovides an explicit view of the symmetric structure in G in addi-tion to the automorphism group and canonical labeling. Such Au-toTree can also be used to solve symmetric subgraph matching andsocial network anonymization. Third, we conduct extensive exper-imental studies to show the efficiency and robustness of DviCL.The preliminaries and the problem statement are given in Sec-tion 2. We discuss related works in Section 3, and review the pre-vious algorithms in Section 4. We give an overview in Section 5,and discuss the algorithms in Section 6. We conduct comprehen-sive experimental studies and report our findings in Section 7. Weconclude this paper in Section 8.
2. PROBLEM DEFINITION
In this paper, we discuss our approach on an undirected graph G = ( V, E ) without self-loops or multiple edges, where V and E denote the sets of vertices and edges of G , respectively. Weuse n and m to denote the numbers of vertices and edges of G ,respectively, i.e., n = | V | and m = | E | . For a vertex u ∈ V , theneighbor set of u is denoted as N ( u ) = { v | ( u, v ) ∈ E } , andthe degree of u is denoted as d ( u ) = | N ( u ) | . In the following,we discuss some concepts and notations using an example graph G shown in Fig. 1(a). Permutation : A permutation of V , denoted as γ , is a bijectionfunction from V to itself. We use v γ to denote the image of v ∈ V under a permutation γ . By a permutation γ to a graph G , it per-mutes vertices in V of G and produces a graph G γ = ( V γ , E γ ) ,here V γ = V and E γ = { ( u γ , v γ ) | ( u, v ) ∈ E } . Followingthe convention used in the literature, we use the cycle notationto represent permutations. In a permutation γ , ( v , v , . . . , v k ) means v γi = v i +1 for ≤ i ≤ k − and v γk = v . For
123 4 567 (a) An example graph [0,1,2,3,4,5,6|7] [6,5,4|2|3,1|0|7][6|5|4|2|3,1|0|7][6|5|4|2|3|1|0|7] [6,5|4|2|3,1|0|7] [6|5|4|2|1|3|0|7][5|6|4|2|3,1|0|7] [6,4|5|2|3,1|0|7] [6|4|5|2|3,1|0|7] [6,5,4|3|0,2|1|7] [6,5|4|3|0,2|1|7] [6|5|4|3|0,2|1|7] [6|5|4|3|2|0|1|7] [2,0,1,3|6,5|4|7] [2|3,1|0|6,5|4|7] [0|3,1|2|6,5|4|7][2|3|1|0|6,5|4|7] [0|3|1|2|6,5|4|7][2|3|1|0|6|5|4|7]
450 0 2 // (1,3) (5,6) (4,5) (0,1)(2,3) (0,3,1,2)(4,6) (b) Backtrack search tree T ( G, π ) by bliss for the graph in Fig. 1(a) Figure 1: An example graph and a backtrack search tree by blisssimplicity, we may only show permutation for a subset of ver-tices using the cycle notation, with the assumption that other ver-tices will be permuted to themselves. Consider the graph G inFig. 1(a), the permutation γ = (4 , , is to relabel 4 as 5, 5 as6, and 6 as 4, where all the other vertices are permutated to them-selves. It produces a graph G γ = ( V, E γ ) , where E γ = E .For the same graph G , the permutation γ = (0 , relabels 0as 1 and 1 as 0, and produces G γ = ( V, E γ ) , where E γ = E ∪ { (0 , , (1 , } \ { (0 , , (1 , } . All permutations of V ( n ! for n = | V | ) consist of a symmetry group with the permutationcomposition as the group operation, denoted as S n . Automorphism : An automorphism of a graph G = ( V, E ) is apermutation γ ( ∈ S n ) that preserves G ’s edge relation, i.e., G γ = G , or equivalently, E γ = E . In graph G (Fig. 1(a)), γ = (4 , , is an automorphism of G whereas γ = (0 , is not. Similarly, allautomorphisms of G , consist of an automorphism group with per-mutation composition as the group operation, denoted as Aut ( G ) ( ⊆ S n ) . Each graph G has a trivial automorphism, called identity, de-noted as ι , that maps every vertex to itself. For two distinct ver-tices u and v in G , if there is an automorphism γ mapping u to v , i.e., u γ = v , we say vertices u and v are automorphic equiva-lent, denoted as u ∼ v . For instance, automorphism γ = (4 , , indicates that vertices 4, 5 and 6 are automorphic equivalent. Structural equivalent : In a graph G , two distinct vertices u and v are structural equivalent if they have the same neighbor set, i.e., N ( u ) = N ( v ) . Obviously, if two vertices are structural equiva-lent, they must be automorphic equivalent, while the converse doesnot always hold. For G in Fig. 1(a), vertices 0 and 2 are structuralequivalent since they have the same neighbor set. Similarly, ver-tices 1 and 3 are also structural equivalent. Vertices 4 and 5 are notstructural equivalent, although they are automorphic equivalent. Isomorphism : Two graphs G and G are isomorphic iff thereexists a permutation γ s.t., G γ = G , and we use G ∼ = G todenote G and G are isomorphic.To check whether two graphs are isomorphic, canonical label-ing (also known as canonical representative or canonical form) isused. A canonical labeling is a function, C , to relabel all verticesof a graph G , such that C ( G ) ∼ = G , and two graphs, G and G ′ ,are isomorphic iff C ( G ) = C ( G ′ ) . A common technique used inthe literature to determine a canonical labeling is by coloring. Be-low, we introduce coloring, colored graph, and canonical labelingby coloring. In brief, to get a canonical labeling for a graph G ,we first get a colored graph ( G, π ) for given color π , and we getthe canonical labeling for ( G, π ) using coloring to prune unneces-sary candidates. The canonical labeling obtained for ( G, π ) is thecanonical labeling for the original graph G . Coloring : A coloring π = [ V | V | . . . | V k ] is a disjoint partition of V in which the order of subsets matters. We use Π( V ) , or simply Π , to denote the set of all colorings of V . Here, a subset V i iscalled a cell of the coloring, and all vertices in V i have the samecolor. In other words, π is to associate each v ∈ V with the color π ( v ) , where π ( v ) ← P 3. RELATED WORKS Graph isomorphism is an equivalence relation on graphs by whichall graphs are grouped into equivalence classes. By graph isomor-phism, it allows us to distinguish graph properties inherent to thestructures of graphs from properties associated with graph repre-sentations: graph drawings, graph labeling, data structures, etc.From the theoretical viewpoint, the graph isomorphism prob-lem is one of few standard problems in computational complex-ity theory belonging to NP, but unknown if it belongs to either ofP or NP-complete. It is one of only two, out of 12 total, prob-lems listed in [12] whose complexity remains unresolved. NP-completeness is considered unlikely since it would imply collapseof the polynomial-time hierarchy [13]. The best currently acceptedtheoretical algorithm is due to [2, 4], whose time complexity is e O ( √ nlogn ) . Although the graph isomorphism problem is not gen-erally known to be in P or NP-complete, they can be solved in poly-nomial time for special classes of graphs, for instance, graphs ofbounded degree [23], bounded genus [11, 27], bounded tree-width[5], and with high probability for random graphs [3]. However,most of these algorithms are unlikely to be useful in practice.In practice, the first practical algorithm to canonically labelinggraphs with hundreds of vertices and graphs with large automor-phism groups was nauty[25, 26], developed by McKay. Observingthat the set of symmetries of a graph forms a group under func-tional composition, nauty integrates group-theoretical techniquesand utilizes automorphisms discovered to prune the search tree.Motivated by nauty, a number of algorithms, such as bliss [15, 16]and traces [30] are proposed to address possible shortcomings ofnauty’s search tree, which we will discuss in Section 4. Anotheralgorithm worth noting is saucy [9]. The data structures and algo-rithms in saucy take advantage of both the sparsity of input graphsand the sparsity of their symmetries to attain scalability. Differentfrom nauty-based canonical labeling algorithms, saucy only findsgraph symmetries, precisely, a generating set of the automorphism group. All algorithms mentioned above are difficult to deal withreal-world massive graphs, and the search tree used are for pruningpurposes not for answering SSM queries. 4. THE PREVIOUS ALGORITHMS In this section, we outline the main ideas of the three state-of-the-art algorithms, namely, nauty, bliss and traces, that enumer-ate all permutations in the symmetry group S n , add all permuta-tions γ satisfying ( G γ , π γ ) = ( G, π ) into the automorphism group Aut ( G, π ) and choose the colored graph ( G γ , π γ ) with the mini-mum value under some specific function as the canonical labeling.Such enumeration of permutations in S n is done by a search tree.In the search tree, each node corresponds to a coloring, and eachedge is established by individualizing a vertex in a non-singletoncell in the coloring of the parent node. Here, individualizing a ver-tex means to assign this vertex a unique color. For instance, in-dividualizing vertex in π = [0 , , , | , , | results in π ′ =[0 , , , | | , | . The coloring of the child node is definitelyfiner than the coloring of the parent node, and each leaf node corre-sponds to a discrete coloring, which is equivalent to a permutationin S n . By the search tree, each permutation is enumerated onceand only once, which implies that the whole search tree contains asmany as n ! leaf nodes.We give the details on the search tree. The search tree, de-noted as T ( G, π ) , is a rooted label tree with labels on both nodesand edges. Here, a node-label is a coloring by individualizingfrom the root to the node concerned, and an edge-label is a ver-tex in G that is individualized from the node-label of the parentnode to the node-label of the child node in T ( G, π ) . Fig. 1(b)shows the search tree T ( G, π ) constructed by bliss for the graph G (Fig. 1(a)), in which a node in T ( G, π ) is shown as x where x is a node identifier. The node identifiers indicate the order theyare traversed. In Fig. 1(b), the root node is labeled by an equitablecoloring [0 , , , , , , | , which has 7 child nodes by individu-alizing one of the vertices in the non-singleton cell. The node 1is a child node of the root by individualizing vertex in G . Here,the individualization of 0 is represented as the edge-label of theedge from the root to the node 1 . The node-label of 1 representsa finer equitable coloring [6 , , | | , | | comparing the color-ing of [0 , , , , , , | at the root. In T ( G, π ) , the edge-labelsequence (or simply sequence) from the root to a node shows theorder of individualization. In Fig. 1(b), node 3 is associated witha sequence 045 and has a node-label coloring [6 | | | | , | | .In the following, we also use ( G, π, ν ) to identify a node in thesearch tree by the sequence ν from the root to the node. The node4 is the leftmost leaf node in the search tree with a discrete col-oring π = [6 | | | | | | | whose corresponding permutation is γ = (0 , , , , . In T ( G, π ) , the leftmost leaf node (cor-responds to a colored graph ( G γ , π γ ) with some specific permu-tation γ ) is taken as a reference node. Any automorphism, γ ′ γ − ,will be discovered when traversing a leaf node with permutation γ ′ having ( G γ , π γ ) = ( G γ ′ , π γ ′ ) . Here, γ − denotes the inverseelement of γ . Reconsider Fig. 1(b), by taking the leftmost leafnode 4 as a reference node, an automorphism (1 , is discoveredwhen traversing the node 5 .The three state-of-the-art algorithms, nauty, bliss and traces ex-ploit three main techniques, namely, refinement function R , targetcell selector T and node invariant φ to construct the search tree T ( G, π ) and prune fruitless subtrees in T ( G, π ) . In brief, the re-finement function R aims at pruning subtrees whose leaf nodescannot result in any automorphisms with the reference node, thearget cell selector T selects a non-singleton cell from a coloring ata node for its children in the search tree, and the node invariant φ is designed to prune subtrees where no new automorphisms can befound or the canonical labeling cannot exist. The refinement function R : For every tree node with a sequence ν (the edge-label sequence from the root to the node), the refine-ment function, R : G × Π × V ∗ → Π , specifies an equitablecoloring corresponding to ν and π . In specific, the refinement isdone by giving the vertices in the sequence unique colors and theninferring a coloring of the other vertices s.t., the resulting color-ing is equitable. Mathematically, a refinement function is a func-tion, R : G × Π × V ∗ → Π , such that for any G ∈ G , π ∈ Π and ν ∈ V ∗ , we have the following. (i) R ( G, π, ν ) (cid:22) π . (ii) If v ∈ ν , then { v } is a cell of R ( G, π, ν ) . (iii) For any γ ∈ S n , R ( G γ , π γ , ν γ ) = R ( G, π, ν ) γ .Revisit the search tree T ( G, π ) in Fig. 1(b). Refinement func-tion R refines the empty sequence and the unit coloring of the rootnode by differentiating vertex 7 from the others in G (Fig. 1(a)).The node 1 can be identified by a sequence 0 from the root. R ( G, π, individualizes vertex from the coloring associatedwith root node, i.e., [0 , , , , , , | , resulting in [1 , , , , , | | , which is further refined to an equitable coloring [6 , , | | , | | . Target cell selector T : For a tree node ( G, π, ν ) that is identifiedby a sequence ν , the target cell selector T : G × Π × V ∗ → V se-lects a non-singleton cell from the coloring by R ( G, π, ν ) to spec-ify its children, where each child node is generated by individu-alizing a vertex in the non-singleton cell selected, if the coloring R ( G, π, ν ) is not discrete. Mathematically, a target cell selector isa function, T : G × Π × V ∗ → V , such that for any G ∈ G , π ∈ Π and ν ∈ V ∗ , the following three holds. (i) If R ( G, π, ν ) is discrete, then T ( G, π, ν ) = ∅ . (ii) If R ( G, π, ν ) is not discrete,then T ( G, π, ν ) is a non-singleton cell of R ( G, π, ν ) . (iii) For any γ ∈ S n , T ( G γ , π γ , ν γ ) = T ( G, π, ν ) γ .The choice of a target cell has a significantly effect on the shapeof the search tree. Some [26] uses the first smallest non-singletoncell, while some others [18] use the first non-singleton cell regard-less of the size. In Fig. 1(b), we follow the suggestion of [18].For instance, target cell selector T on the node 1 chooses the firstnon-singleton cell { , , } , and generates three child nodes ( 2 ,7 , and 9 ) by individualizing vertices 4,5, and 6, respectively. Node invariant φ : It assigns each node in the search tree withan element from a totally ordered set, and φ is designed with thefollowing properties: (a) φ is isomorphic-invariant on tree nodes,i.e., φ ( G γ , π γ , ν γ ) = φ ( G, π, ν ) for any γ ∈ S n ; (b) φ acts as acertificate on leaf nodes, i.e., two leaf nodes share the same valueunder φ iff they are isomorphic; (c) φ retains the partial orderingbetween two subtrees rooted at the same level. Mathematically,let Ω be some totally ordered set. A node invariant is a function, φ : G × Π × V ∗ → Ω , such that for any π ∈ Π , G ∈ G , anddistinct ν, ν ′ ∈ T ( G, π ) , we have the following. (i) If | ν | = | ν ′ | ,and φ ( G, π, ν ) < φ ( G, π, ν ′ ) , then for every leaf ν ∈ T ( G, π, ν ) and leaf ν ′ ∈ T ( G, π, ν ′ ) , we have φ ( G, π, ν ) < φ ( G, π, ν ′ ) ;(ii) If π = R ( G, π, ν ) and π ′ = R ( G, π, ν ′ ) are discrete, then φ ( G, π, ν ) = φ ( G, π, ν ′ ) ⇔ G π = G π ′ ; (iii) For any γ ∈ S n , wehave φ ( G γ , π γ , ν γ ) = φ ( G, π, ν ) . By the node invariant φ , threetypes of pruning operations are possible. (1) P A ( ν, ν ′ ) removessubtree T ( G, π, ν ′ ) that contains no automorphisms with the ref-erence node, when φ ( G, π, ν ′ ) on some node ν ′ does not equalto φ ( G, π, ν ) . Here, ν is the node on the leftmost path having | ν | = | ν ′ | . (2) P B ( ν, ν ′ ) removes subtree T ( G, π, ν ′ ) that does not contain the canonical labeling, when φ ( G, π, ν ′ ) < φ ( G, π, ν ) .Here ν is the node on the path whose leaf node is chosen as the cur-rent canonical labeling, and | ν | = | ν ′ | . (3) P C ( ν, ν ′ ) removessubtree T ( G, π, ν ′ ) that contains no new automorphisms, when ν ′ = ν γ where γ is an automorphism discovered or can be com-posed by discovered automorphisms. 5. AN OVERVIEW OF OUR APPROACH Previous algorithms enumerate all permutations and select theminimum ( G, π ) γ as the canonical labeling C ( G, π ) . There aretwo things. The first is that the algorithms use the minimum ( G, π ) γ as the target to prune candidates during the enumeration, and thesecond is that the minimum ( G, π ) γ is used for any graph. Froma different angle, we consider if we can use the k -th minimum ( G, π ) γ as the canonical labeling C ( G, π ) , where the minimum ( G, π ) γ is a special case when k = 1 . Recall that ( G, π ) γ is repre-sented as the sorted edge list, in other words, all possible ( G, π ) γ form a totally ordered set. We observe that there is no need tofix a certain k for any graph or even to know what the k valueis when computing the canonical labeling. We only need to en-sure that there is such a k value based on which two graphs areisomorphic iff their corresponding k -th minimum ( G, π ) γ are thesame. Different from previous algorithms which are designed toprune candidates, we take a divide-and-conquer approach to parti-tion a graph. We discuss an axis by which a graph is divided, theAutoTree A T ( G, π ) and its construction. Axis : We partition ( G, π ) into a set of vertex disjoint subgraphs, { ( g , π ) , ( g , π ) , ..., ( g k , π k ) } . We ensure that by the partition,all automorphisms in ( G, π ) can be composed by the automor-phisms in every ( g i , π i ) , and the isomorphisms between subgraphs ( g i , π i ) and ( g j , π j ) . In other words, the automorphisms in ( g i , π i ) and the isomorphisms between ( g i , π i ) and ( g j , π j ) form a gener-ating set for the automorphism group of ( G, π ) . We then computecanonical labeling C ( G, π ) by C ( g i , π i ) for every ( g i , π i ) .We discuss how to partition ( G, π ) into subgraphs by symmetryaccording to an axis, which satisfies the requirements mentionedabove. Note that two subgraphs, ( g i , π i ) , ( g j , π j ) , are symmet-ric in ( G, π ) , if there is an automorphism γ that maps ( g i , π i ) to ( g j , π j ) . The axis by γ includes all vertices v having v γ = v , sincethey are invariant under γ . We partition ( G, π ) by such an axis. Byremoving vertices in the axis and their adjacent edges from ( G, π ) , ( g i , π i ) and ( g j , π j ) are connected components, and all symme-tries by γ in ( G, π ) are preserved due to the fact that ( g i , π i ) and ( g j , π j ) are isomorphic. We preserve all symmetries by any suchautomorphism γ with an equitable coloring. Recall that, in an eq-uitable coloring, vertices in singleton cells cannot be automorphicto any other vertices, and thus such vertices, as the common part ofall axes, preserve the symmetries of Aut ( G, π ) . The AutoTree A T ( G, π ) : We illustrate the main idea of our ap-proach in Fig. 2. First, graph ( G, π ) is divided into a set of vertexdisjoint colored subgraphs { ( g , π ) , . . . , ( g k , π k ) } . Such parti-tion can be achieved by common symmetries given in an equitablecoloring obtained by a refinement function R on ( G, π ) . Givingcanonical labeling C ( g i , π i ) for every subgraph ( g i , π i ) , all sub-graphs can be sorted and divided into subsets, where subgraphshaving the same canonical labeling are grouped in a subset (di-vided by vertical dash lines in Fig. 2). The subgraphs in the samesubset are symmetric in ( G, π ) since they are partitioned by sym-metry. For instance, in Fig. 2, suppose two subgraphs ( g , π ) and ( g , π ) are with the same canonical labeling, then they arein the same subset. They are isomorphic, and there is a permuta-tion γ such that ( g , π ) γ = ( g , π ) by definition. In gen- ivide ( G, π )( g , π ) ( g , π ) ( g k , π k )( g , π ) γ = ( g , π )( g , π ) γ = ( g , π ) ( g i , π i ) γ ij = ( g j , π j ) ( g k , π k ) γ k = ( g k , π k ) canonical labeling on each subgraph ( g i , π i ) . . .. . . . . . C ( G, π ) C ( g i , π i ) conquer Figure 2: An Overview of Our Approach gg g g g g g g g g g g g Figure 3: An AutoTree Example eral, for ( g i , π i ) γ ij = ( g j , π j ) , such γ ij will derive an automor-phism in Aut ( G, π ) , and in addition, every automorphism in a sin-gle subgraph ( g i , π i ) is also an automorphism in Aut ( G, π ) . Insuch sense, we have a generating set of the automorphism group Aut ( G, π ) , i.e., Aut ( G, π ) is completely preserved. Note that,two graphs, ( G, π ) and ( G ′ , π ′ ) , are isomorphic, iff they gener-ate the same sorted subgraph sets, resulting in the same canonicallabeling. As a consequence, DviCL discovers the k -th minimum ( G, π ) γ as the canonical labeling.Canonical labeling of each subgraph ( g i , π i ) can be obtained inthe same manner, which results in a tree index. In the tree, eachnode is associated with automorphism group and canonical label-ing and child nodes of each non-leaf node are sorted by canonicallabeling. We call such an ordered tree an AutoTree, denoted as AT ( G, π ) , for given graph ( G, π ) . Such an AutoTree benefits todiscovering the automorphism group Aut ( G, π ) and the canonicallabeling C ( G, π ) .We explain key points of AutoTree AT ( g, π ) , using an exam-ple in Fig. 3. Here, we assume the coloring π associated with thecolored graph ( g, π ) is equitable, and show how the AutoTree isconstructed for such a colored graph. As shown in Fig. 3, the entiregraph g is represented by the root. There are 3 colors in the equi-table coloring. Here, two vertices have the same color if they arein the same cell in π . First, vertex in the singleton cell in π actsas an axis for g and partitions g into three subgraphs g (left), g (right), and g (mid), where g consists of a single vertex . Sec-ond, we construct sub-AutoTree rooted at g . We find that thereis a complete subgraph, g ′ ⊆ g , over all vertices { , , } thathave the same color. We observe that the automorphism group of g will not be affected without the edges in g ′ , and further divide g into g (left), g (mid), and g (right). Here, we consider theset of { , , } as an additional axis ( a ) for g . Third, consider g as an example, it will be divided into another 2 subgraphs, eachcontains a vertex having unique color in g . In A T ( g, π ) , twovertices, and are automorphic, because ∈ g , ∈ g ,and g and g are isomorphic and symmetric according to theaxis a . Similarly, and are automorphic, because ∈ g , {4,5,6} {7}{4} {5} {6} π=[4,5,6] (cid:2) =[4,5,6]→[ , ,2]π=[4] (cid:0) =4→ (cid:1) π=[5] (cid:3) =5→ (cid:4) π=[6] (cid:5) =6→ (cid:6) π=[7] (cid:7) =7→7 { (cid:8) , (cid:9) ,2,3}[2 3, (cid:10) (cid:11) ][2|3| (cid:12) | (cid:13) ] [2| (cid:14) |3| (cid:15) ] [3| (cid:16) ,2| (cid:17) ][3|2| (cid:18) | (cid:19) ] // //32 (cid:20)(cid:21)(cid:22) (cid:23) π=[ (cid:24) , (cid:25) ,2,3] (cid:26) =[ (cid:27) , (cid:28) ,2,3]→[6,5,3,4] ( (cid:29) ,3) (2,3)( (cid:30) , (cid:31) ) π=[4,5,6| , ! ,2,3|7] " =[4,5,6, , $ ,2,3,7]→[ % , & ,2,6,5,3,4,7] ’ *=[ ( , ) ,2,3]→[3,2, * , + ] Figure 4: AutoTree for the graph G in Fig. 1(a). ∈ g , and g and g are isomorphic and are symmetricaccording to the axis a .We discuss the key property of AutoTree AT ( G, π ) . For anytwo automorphic vertices u and u ′ , the axes recursively divide, u and u ′ , into a series of subgraph pairs (( g , g ′ ) , ( g , g ′ ) , . . . , ( g k ,g ′ k )) such that (1) g ⊃ g ⊃ . . . ⊃ g k and g ′ ⊃ g ′ ⊃ . . . ⊃ g ′ k ,(2) ( g k , π k ) and ( g ′ k , π ′ k ) are leaf nodes in AT ( G, π ) , and (3) g i and g ′ i are isomorphic and symmetric in G . For instance, the twoautomorphic vertices, 2 and 12, in Fig. 3 are divided into subgraphpairs (( g , g ) , ( g , g ) , ( g , g )) . In other words, for anytwo automorphic vertices, they must be in two leaf nodes in Au-toTree, whose corresponding subgraphs have the same canonicallabeling and are symmetric. As a consequence, automorphisms be-tween vertices can be detected by comparing canonical labeling ofleaf nodes containing these vertices. As can be observed in theexperiments, (1) most vertices in G are in singleton cells, (2) non-singleton leaf nodes in AT ( G, π ) are small in size. By AT ( G , π ) , automorphisms between vertices can be efficiently detected.It is worth mentioning that, in the existing approaches, determin-ing whether two vertices are automorphic need to compare a setof permutations. The generation of canonical labeling C ( G, π ) ,as we will discuss, can be done in a bottom-up manner, where thecanonical labeling of a non-leaf node in AT ( G, π ) can be done bycombining canonical labeling of its child nodes, which significantlyreduces the cost. AT ( G, π ) is a sorted tree. In AT ( G, π ) , the root represents ( G, π ) , and every node represents a subgraph ( g, π g ) . Here, g is asubgraph of G induced by V ( g ) and π g is the projection of π on V ( g ) . Note that π g ( v ) = π ( v ) for any v ∈ g and any g ⊂ G . Eachnode ( g, π g ) in AT ( G, π ) is associated with canonical labeling C ( g, π g ) , or equivalently, a permutation γ g generating C ( g, π g ) ,i.e., ( g, π g ) γ g = C ( g, π g ) . For any singleton subgraph g = { v } ,we define C ( g, π g ) = ( v γ g , v γ g ) = ( π ( v ) , π ( v )) . Permutation γ g can be generated for a node ( g, π g ) in three cases: (a) γ g is triviallyobtained for a singleton leaf node, (b) γ g is generated with canon-ical labeling achieved by any existing algorithm (e.g., nauty, blissand traces) for a non-singleton leaf node, and (c) γ g is generatedby combining all canonical labeling of g ’s children. The canonicallabeling of the root node is the one of the given graph. Automor-phisms of ( G, π ) can be discovered between two nodes with thesame canonical labeling and automorphisms of each subgraph.Fig. 4 shows the AutoTree AT ( G, π ) constructed for the graph G in Fig. 1(a). A node in AT ( G, π ) represents a subgraph ( g, π g ) ,by its V ( g ) together with its permutation γ g . Consider the threeleaf nodes (singletons) from the left (i.e., the three one-vertex sub-graphs), { } , { } , and { } , with coloring π g = [4] , π g = [5] , π g = [6] , respectively. The permutations for the three subgraphsare, γ = 4 → , γ = 5 → , and γ = 6 → . Vertices4, 5 and 6 are mutually automorphic since these three leaf nodeshave the same canonical labeling. The permutation for the parentof the three singletons is γ = [4 , , → [0 , , by combining the g g i . . .v s v s k g AT ( g, π g ) . . . . . . . . . ( g, π g )( v s , [ v s ]) ( v s k , [ v s k ]) ( g , π g ) g j ( g j , π g j ) Figure 5: The Overview of Algorithm DivideI V i V k ( g, π g )( g , π g ) ( g k , π g k ) . . .g AT ( g, π g ) V j V l Figure 6: The Overview of Algorithm DivideScanonical labeling of the three singletons. Subgraph { , , } doesnot have symmetric counterparts since there exist no other nodeshaving the same canonical labeling. The 4th leaf node from the leftis non-singleton, since it cannot be further divided. We use bliss toobtain its permutation, in dashed rectangle.In an AutoTree, the permutation γ g for ( g, π g ) , is done as fol-lows. First, v γ g = π ( v ) is generated for a singleton leaf node with { v } . For example, for the 2nd leaf node from the left of { } inFig. 4, π (5) = 0 which indicates the cell in the coloring where5 exists. Second, γ g is generated by an existing algorithm for anon-singleton node. For example, the 4th leaf node from the left inFig. 4 is a non-singleton. its permutation γ g is obtained by a back-track search tree constructed using an existing algorithm. Third, γ g for a non-leaf node is determined by those of its child nodes. The AutoTree AT ( G, π ) Construction : We design an algorithmDviCLto construct an AutoTree AT ( G, π ) by divide-and-conquer.In the divide phase, DviCL divides ( G, π ) into a set of subgraphs ( g i , π g i ) , each consists of a child node of ( G, π ) in AT ( G, π ) .DviCLrecursively construct AutoTree AT ( g i , π g i ) rooted at ( g i , π g i ) .In the combine phase, DviCL determines the canonical labeling of ( G, π ) by the canonical labeling of its child nodes ( g i , π g i ) . In thedivide phase, two algorithms are used to divide ( g, π g ) , namely, Di-videIand DivideS, by removing edges in g that have no influence indetermining the automorphism group and the canonical labeling of ( g, π g ) . Consider Fig. 3. DivideIis to remove edges by finding sin-gleton cells in π g (e.g., the vertex 1 in g ), whereas DivideS is to re-move edges by complete subgraphs or complete bipartite subgraphs(e.g., the complete subgraph in g ). A leaf node in AT ( G, π ) is anode that cannot by divided by DivideI or DivideS.An overview is shown in Fig. 5 for DivideI. In Fig. 5, the leftshows a tree node ( g, π g ) where vertices v s i are in singleton cellsin π g , and the right shows the child nodes constructed for ( g, π g ) byDivideI. Isolating singleton cells in π g is to remove dashed edgesin g and partition g into a set of connected components g i . ( g, π g ) → [ i ( v s i , [ v s i ]) ∪ [ j ( g j , π g j ) Here, each ( v s i , [ v s i ]) represents a one-vertex colored subgraphas a result from a singleton cell in π g , and each ( g j , π g j ) is a con-nected component of ( g, π g ) .An overview is shown in Fig. 6 for DivideS. In Fig. 6, the leftshows a subgraph ( g, π g ) whose vertices are in 4 different cells, V i , V j , V k , and V l . The right shows the child nodes constructed for thenode that represents ( g, π g ) by DivideS. DivideS removes edges 01 4 567 (a) simplified graph G s 5! 6! (b) AutoTree of G s Figure 7: Simplified graph and its AutoTree { } {4,5,6} {7} { } {3} {4} {5} {6} {2} { } π=[ γ = γ =2 → γ = → π=[6] γ =6 →0 π=[5] γ =5 →0 π=[4] γ =4 →0 π=[3] γ =3 → π=[4,5,6| , γ =[ , → [3,5,4,6,0,1,2,7] π=[ , γ =[ , → [3,5,4,6]π=[4,5,6] γ =[4,5,6] → [ , π=[7] γ =7 → Figure 8: AutoTree by DviCL for G (Fig. 1(a)) on Fig. 7(a)(b). for 2 cases. First, DivideS removes all edges from the induced sub-graph over the cell V i if it is a complete subgraph. Second, DivideSremoves all edges between 2 different colors V j and V k if there isa complete bipartite subgraph between all vertices in V j and allvertices in V k . Removing such edges does not affect the automor-phism group Aut ( g, π g ) . By removing such edges, ( g, π g ) canbe possibly divided into several disconnected components ( g i , π i ) .We have ( g, π g ) → [ k ( g k , π g k ) In the combine phase, DviCL generates the permutation γ g forthe node ( g, π g ) in AutoTree. Note that permutation γ g is the onethat produces the canonical labeling C ( g, π g ) . First, consider thebase case when ( g, π g ) is a leaf node in AT ( G, π ) . If ( g, π g ) isa singleton leaf node (e.g., g = { v } ), we define g γ g = π ( v ) . If ( g, π g ) is a non-singleton leaf node, we obtain γ g by CombineCL.Here, CombineCL first applies an existing approach to generate acanonical labeling γ ∗ for ( g, π g ) . With γ ∗ , vertices sharing thesame color, i.e., in the same cell in π g , are differentiated by theordering introduced by γ ∗ . Second, consider the case when ( g, π g ) is a non-leaf node. CombineST exploits the canonical labeling ofthe child nodes of ( g, π g ) (i.e., γ g i and ( g i , π g i ) γ gi of ( g i , π g i ) )to determine an ordering that can differentiate vertices in the samecell in π g , and obtains γ g in a similar manner. 6. THE NEW APPROACH We give our DviCL algorithm in Algorithm 1. Given a coloredgraph ( G, π ) , DviCLconstructs an AutoTree AT ( G, π ) for ( G, π ) (Line 4). Note that the canonical labeling C ( G, π ) at the root of AT ( G, π ) acts as the canonical labeling of the given graph. InDviCL, the given coloring π is refined to be equitable by a re-finement function R , for instance, Weisfeiler-Lehman algorithm[33], and is further exploited to assign each vertex v with color π ( v ) (Line 1-2). Then, DviCL applies procedure cl to constructAutoTree AT ( G, π ) (Line 3). We discuss Procedure cl in detail.Procedure cl constructs AT ( g, π g ) rooted at ( g, π g ) , for a col-ored subgraph ( g, π g ) ⊂ ( G, π ) following the divide-and-conquerparadigm. AT ( g, π g ) is initialized with root node ( g, π g ) (Line 6).cl divides ( g, π g ) into a set of subgraphs ( g i , π g i ) , each consists ofa child node of ( g, π g ) , utilizing Algorithm DivideI (Algorithm 2)and Algorithm DivideS (Algorithm 3) (Line 11-12). cl recursively lgorithm 1: DviCL ( G, π ) π = [ V | V | . . . | V k ] ← R( G, π ); π ( v ) ← P DivideS can disconnect ( g, π g ) then C ( g, π g ) ← CombineCL ( g, π g ) ; return AT ( g, π g ) ; S ≤ i ≤ k ( g i , π gi ) ← DivideI ( g, π g )( DivideS ( g, π g ) ); construct tree edges (( g, π g ) , ( g i , π gi )) for all i ; for i from 1 to k do A T ( g i , π gi ) ← cl ( g i , π gi ) ; C ( g, π g ) ← CombineST ( g, π g ) ; return AT ( g, π g ) ; constructs subtrees AT ( g i , π g i ) rooted at each ( g i , π g i ) (Line 13-14) and identifies the canonical labeling C ( g, π g ) for the root node ( g, π g ) utilizing Algorithm CombineST (Algorithm 5) (Line 15).The base cases occur when either g contains a single vertex (Line 7-8) or ( g, π g ) cannot be disconnected by DivideIor DivideS(Line 9-10). For the former case, obtaining C ( g, π g ) is trivial. For the lattercase, C ( g, π g ) can be achieved by applying Algorithm CombineCL(Algorithm 4), which exploits the canonical labeling γ ∗ by existingalgorithms like bliss. Recall that structural equivalent vertices must be automorphicequivalent. Such property can be applied to simplify ( G, π ) andimprove the performance of DviCL. Specifically, vertices in V arepartitioned into a number of structural equivalent subsets. Verticesin each non-singleton subset S are simplified by retaining only onevertex v in the subset, and the colored graph ( G, π ) is simplifiedaccordingly. When constructing AutoTree, the leaf node containing v is extended either by adding a number of sibling leaf nodes, eachcontains a vertex in S if the leaf node containing v is singleton, orby adding vertices in S to the subgraph of the leaf node otherwise.Fig. 7(a) and Fig. 7(b) illustrate the simplified graph G s of thegraph G in Fig. 1(a) and its AutoTree, respectively. For simplicity,AutoTree in Fig. 7(b) contains the tree structure without any infor-mation such as canonical labeling on each tree node. In the examplegraph G , shown in Fig. 1(a), there are two non-singleton structuralequivalent subsets, { , } and { , } . Therefore, in the simplifiedgraph G s , shown in Fig. 7(a), vertices 2 and 3 along with their ad-jacent edges are removed. Based on the simplified graph and itsAutoTree AT ( G s , π s ) , the AutoTree AT ( G, π ) of ( G, π ) is con-structed by extending leaf nodes containing vertices 0 and 1, shownin Fig. 8. It is worth noting that different approaches, or even dif-ferent implementations, can generate different canonical labeling,while each approach, or implementation, will generate the samecanonical labeling for isomorphic graphs. For instance, the canon-ical labeling of the root nodes in Fig. 4 and Fig. 8 are different. DivideI and DivideSWe show DivideI and DivideS in Algorithm 2 and Algorithm 3.Both algorithms take a colored graph ( g, π g ) as input and attemptto divide ( g, π g ) into a set of subgraphs ( g i , π g i ) . DivideI isolateseach singleton cell { v s i } in π g as a colored subgraph ( v s i , [ v s i ]) of ( g, π g ) (Line 2-3). Each connected component g i due to the iso-lation results in a colored subgraph ( g i , π g i ) of ( g, π g ) (Line 4-5).On the other hand, DivideS divides ( g, π g ) based on Theorem 6.2 (Line 1-6). Similar to DivideI, each connected component g i re-sults in a colored subgraph ( g i , π g i ) of ( g, π g ) (Line 8-9).We first discuss properties of refinement function R. In DviCL,we apply Weisfeiler-Lehman algorithm [33] as the refinement func-tion R . As proved by [33], only vertices in the same cell in theresulting equitable coloring π can probably be automorphic equiv-alent. In DviCL, only the coloring π for G is achieved by the re-finement function R , all the other colorings, i.e., π g for subgraphs g , are obtained by projecting π on V ( g ) . The following theoremproves the equivalence between projecting π on V ( g ) and applying R on ( g, π g ) . Theorem 6.1: π g , the projection of π on V ( g ) by DivideI and DivideS , inherits the properties of π . Specifically, (1) only verticesin the same cell in π g can be automorphic equivalent. (2) π g isequitable with respect to g . Proof Sketch: The first property can be proved trivially. We focuson the second property, and prove the claim based on the mathe-matical induction. Assume g is a connected component in g ′ thatemerges due to either DivideI or DivideS, and π g ′ satisfies the sec-ond property. In either case, the edges removed are those betweentwo cells in π g ′ . Then for any two vertices in the same cell in π g ,they either retain all neighbors or remove all neighbors in any cellin π g ′ , i.e., π g is equitable with respect to g . ✷ Lemma 6.1: Given a graph ( g, π g ) . For any cell V i ∈ π g , if thesubgraph induced by V i is a clique, removing edges among verticesin V i , i.e., E i = { ( u, v ) | u, v ∈ V i , u = v } ∩ E ( g ) , will notinfluence the automorphism group of ( g, π g ) . Proof Sketch: : Let g ′ denote the graph after removing edges E i from g , and Aut ( g ′ , π g ′ ) denote its automorphism group. By The-orem 6.1, π g ′ = π g . For simplicity, we will use π g for π g ′ below.Consider automorphisms γ ∈ Aut ( g, π g ) , γ ′ ∈ Aut ( g ′ , π g ) .We prove γ ∈ Aut ( g ′ , π g ) and γ ′ ∈ Aut ( g, π g ) , respectively. Weprove γ ′ ∈ Aut ( g, π g ) , and γ ∈ Aut ( g ′ , π g ) can be proved inthe similar manner. Consider v ∈ V i . Since v and v γ ′ are au-tomorphic, v and v γ ′ must be in the same cell in π g , i.e., v γ ′ ∈ V i , implying that V γ ′ i = V i . As a consequence, for any edge ( u, v ) ∈ E i , ( u γ ′ , v γ ′ ) ∈ E i . Therefore, ( g, π g ) γ ′ = ( g, π g ) ,i.e., γ ′ ∈ Aut ( g, π g ) . ✷ Lemma 6.2: Given a colored graph ( g, π g ) . For any two cells V i and V j in π g , let E ij denotes the edges between V i and V j ,i.e., E ij = { ( u, v ) | u ∈ V i , v ∈ V j } ∩ E ( g ) . If the subgraph ( V i ∪ V j , E ij ) is a complete bipartite graph, removing edges E ij will not influence the automorphism group of ( g, π g ) . Proof Sketch: The proof is similar to that of Lemma 6.1. ✷ Note that DivideI is a special case of Lemma 6.2. Theorem 6.2: Given a colored graph ( g, π g ) , applying DivideI and DivideS on ( g, π g ) retains the automorphism group of ( g, π g ) .In other words, removing the following two classes of edges will notinfluence Aut ( g, π g ) : (1) edges among vertices in V i , i.e., E i = { ( u, v ) | u, v ∈ V i , u = v } ∩ E ( g ) , if the subgraph induced by V i is a clique. (2) edges between V i and V j , i.e., E ij = { ( u, v ) | u ∈ V i , v ∈ V j } ∩ E ( g ) , if the subgraph ( V i ∪ V j , E ij ) is a completebipartite graph. Proof Sketch: It can be proved by Lemma 6.1, Lemma 6.2. Lemma 6.3: Given two isomorphic graphs ( G, π ) and ( G ′ , π ′ ) , ifthey are simplified by either DivideI or DivideS , then the remain-ing graphs are isomorphic. Specifically, each remaining graph canbe partitioned into a subgraph set, i.e., { ( g i , π g i ) } for ( G, π ) and lgorithm 2: DivideI ( g, π g ) S ← ∅ ; for each singleton cell { v si } in π g do S ← S ∪ { ( v si , [ v si ]) } ; g ← g \ v si ; for each connected component g i in g do S ← S ∪ { ( g i , π gi ) } ; return S ; Algorithm 3: DivideS ( g, π g ) for each cell V i ∈ π g do if V i induces a clique then remove all edges between vertices in V i ; for any two distinct cells V i and V j in π g do if edges between V i and V j consist a complete bipartite graph then remove all edges between V i and V j ; S ← ∅ ; for each connected component g i in g after removing edges do S ← S ∪ { ( g i , π gi ) } ; return S ; { ( g ′ i , π ′ g i ) } for ( G ′ , π ′ ) , and the two subgraph sets can be sortedsuch that ( g i , π g i ) ∼ = ( g ′ i , π ′ g i ) . Proof Sketch: Similar to the proof of automorphism retainment,i.e., Lemma 6.1 and Lemma 6.2, we prove that each edge set re-moval will retain any isomorphism between ( G, π ) and ( G ′ , π ′ ) .Without loss of generality, we prove the case when removing edges E i from ( G, π ) and removing E ′ i from ( G ′ , π ′ ) simultaneously.Here, E i and E ′ i are the same as defined in Lemma 6.1, and ver-tices in the corresponding cells V i and V ′ i have the same color.Such property can be easily extended to prove Lemma 6.3.Let ( g, π g ) and ( g ′ , π g ′ ) denote the remaining graph after re-moving E i from ( G, π ) and E ′ i on ( G, π ) and ( G ′ , π ′ ) , respec-tively. Here, π g = π and π g ′ = π ′ by Theorem 6.1. Denote γ as an arbitrary isomorphism between ( G, π ) and ( G ′ , π ′ ) , i.e., ( G, π ) γ = ( G ′ , π ′ ) . We prove that ( g, π ) γ = ( g ′ , π ′ ) . First, byisomorphism, we have V γi = V ′ i . Since both V i and V ′ i inducecomplete subgraphs, for any edge ( u, v ) ∈ E i , ( u, v ) γ ∈ E ′ i . As aconsequence, E γi = E ′ i . Second, since ( g, π ) γ = (( G γ \ E γi , π ′ ) , ( g ′ , π g ′ ) = (( G ′ \ E ′ i , π ′ ) we have ( g, π ) γ = ( g ′ , π ′ ) . ✷ Theorem 6.3: Given two isomorphic graphs ( G, π ) , ( G ′ , π ′ ) , thestructure of A T ( G, π ) and A T ( G ′ , π ′ ) are the same. Here, thestructure of an AutoTree is a tree without any labels. Proof Sketch: The proof can be constructed by mathematical in-duction on each tree node with Lemma 6.3. ✷ Time complexity of DivideI and DivideS: Easy to see, the timecomplexity of DivideIis O ( m ) as each component of DivideIcosts O ( m ) . We focus on the time complexity of DivideS. Recall thata coloring π is equitable with respected to a graph G , if for allvertices v i , v ∈ V i , they have the same number of neighbors in V j .Such property can be utilized to accelerate DivideS. Specifically,DivideS assigns each cell V i with a vector, where each elementmaintains the number of neighbors of each vertex v ∈ V i in V j .Then checking if V i consists a clique is equivalent to checking ifthe i -th element in the vector equals | V i | − . Checking whether V i and V j consists a biclique is equivalent to checking if the j -thelement in the vector equals | V j | . Therefore, each component ofDivideS also costs O ( m ) , i.e., DivideS costs O ( m ) . CombineCL and CombineST Algorithm 4: CombineCL ( g, π g ) γ ∗ ← bliss ( g, π g ) ; for each vertex v ∈ V ( g ) do v γg ← π ( v ) + |{ u | π g ( u ) = π g ( v ) , u γ ∗ < v γ ∗ }| ; C ( g, π g ) = ( g, π g ) γg ; return C ( g, π g ) ; We discuss algorithms CombineCLand CombineST, which gen-erate the canonical labeling C ( g, π g ) for the input colored graph ( g, π g ) . CombineCL, shown in Algorithm 4, generates γ g for anon-singleton leaf node exploiting the canonical labeling γ ∗ ob-tained by existing approaches (Line 1). γ ∗ introduces a total or-der among vertices in the same cell in π g , resulting in the canoni-cal labeling γ g , along with vertex color π ( v ) due to π (Line 2-3).Canonical labeling C ( g, π g ) can be trivially obtained as ( g, π g ) γ g (Line 4). On the other hand, CombineST, shown in Algorithm 5,generates γ g for a non-leaf node by combining canonical labelingof its child nodes ( g i , π g i ) . Canonical labeling C ( g i , π g i ) intro-duces a total order between vertices in different subgraphs (Line 1-2) and canonical labeling γ g i introduce a total order among ver-tices in the same subgraph ( g i , π g i ) (Line 3). These two ordersdetermines a total order between vertices in the same cell in π g ,resulting in the canonical labeling γ g for ( g, π g ) (Line 4-5), in thesimilar manner. Canonical labeling C ( g, π g ) can be obtained by as ( g, π g ) γ g (Line 6). Lemma 6.4: For two leaf nodes ( g , π g ) and ( g , π g ) in Au-toTree, if they are symmetric in ( G, π ) , i.e., these is a permutation γ ∈ Aut ( G, π ) such that ( g , π g ) γ = ( g , π g ) , C ( g , π g ) = C ( g , π g ) by Algorithm CombineCL . Proof Sketch: The proof is trivial when ( g , π g ) and ( g , π g ) are singleton, since the vertices are in the same cell in π .We focus on the case when ( g , π g ) and ( g , π g ) are non-singleton. For ease of discussion, we assume vertices in g and g are relabeled from 1 to k , respectively. Here k = | V ( g ) | = | V ( g ) | . Since ( g , π g ) and ( g , π g ) are symmetric in ( G, π ) ,they are isomorphic, i.e., ( g , π g ) γ = ( g , π g ) γ . Here, γ and γ are the corresponding permutations by bliss. Let v ∈ g and u ∈ g be two vertices having v γ = u γ , we prove v γ g = u γ g .Let v ∈ g and u ∈ g be two vertices having v γ = u γ . If π ( v ) = π ( v ) and v γ < v γ , then π ( u ) = π ( v ) = π ( v ) = π ( u ) and u γ < u γ . The reverse also holds, implying that v and u have the same influence on v γ g and u γ g . If π ( v ) = π ( v ) ,then π ( u ) = π ( u ) , i.e., v and u have no, which is also the same,influence on v γ g and u γ g . As a consequence, v γ g = u γ g , i.e., C ( g , π g ) = C ( g , π g ) . ✷ Lemma 6.5: For two non-leaf nodes ( g , π g ) and ( g , π g ) , if ( g , π g ) and ( g , π g ) are symmetric in ( G, π ) , i.e., these is apermutation γ ∈ Aut ( G, π ) such that ( g , π g ) γ = ( g , π g ) ,then C ( g , π g ) = C ( g , π g ) by CombineST . Proof Sketch: We prove the base case, i.e., child nodes of ( g , π g ) and ( g , π g ) are leaf nodes, the other cases can be proved by math-ematical induction.First, by Lemma 6.3, child nodes of ( g , π g ) and ( g , π g ) canbe sorted such that each pair ( g i , π g i ) and ( g j , π g j ) are isomor-phic. Second, by Lemma 6.4, C ( g i , π g i ) = C ( g j , π g j ) . Thenfor any vertex pairs v ∈ g i and u ∈ g j having v γ gi = u γ gj ,we have (1) there are the same number of subgraph pair ( g ′ i , π ′ g i ) and ( g ′ j , π ′ g j ) that are isomorphic and share the same canonical la-beling with ( g i , π g i ) and ( g j , π g j ) , while ( g ′ i , π ′ g i ) is sorted be-fore ( g i , π g i ) and ( g ′ j , π ′ g j ) is sorted before ( g j , π g j ) ; (2) thereare the same number of vertex pairs v ′ ∈ g i and u ′ ∈ g j with lgorithm 5: CombineST ( g, π g ) sort child nodes ( g i , π gi ) of ( g, π g ) in non-descending order of C ( g i , π gi ) ; sort vertices in each cell in π g , s.t., u is before v if u ∈ g i , v ∈ g j , i < j ; sort vertices in each cell in π gi , s.t., u is before v if u γgi < v γgi ; for each vertex v ∈ V ( g ) do v γg ← π ( v ) + |{ u | π g ( u ) = π g ( v ) , u is before v }| ; C ( g, π g ) = ( g, π g ) γg ; return C ( g, π g ) ; v ′ γ gi = u ′ γ gj , having v ′ γ gi < v γ gi and u ′ γ gj < u γ gj . As a con-sequence, v γ g = u γ g , in other words, C ( g , π g ) = C ( g , π g ) .The following theorem gives the correctness of DviCL. Theorem 6.4: Given two graphs ( G , π ) and ( G , π ) , they areisomorphic iff the canonical labeling C ( G , π ) and C ( G , π ) by DviCL satisfy C ( G , π ) = C ( G , π ) . Proof Sketch: We construct an auxiliary graph G containing G , G and an vertex u connecting to every vertex in G and G . Easyto see, u is distinct from any other vertices in G , and G and G are symmetric in G . Therefore, the root of the AutoTree A T ( G, π ) has three child nodes, ( u, π u ) , ( G , π ) and ( G , π ) . Accordingto Lemma 6.4 and Lemma 6.5, C ( G , π ) = C ( G , π ) . ✷ Theorem 6.5: In ( G, π ) , if two vertices are symmetric, they are intwo leaf nodes in A T ( G, π ) sharing the same canonical labeling. Proof Sketch: It can be proved by Lemma 6.4 and Lemma 6.5,since a leaf node cannot be isomorphic to a non-leaf node. ✷ We revisit previous algorithms, e.g., nauty, bliss, traces as wellas our approach DviCL. As mentioned, previous algorithms enu-merate all possible permutations and select the minimum ( G, π ) γ as the canonical labeling. On the other hand, our approach DviCLconstructs a tree index AutoTree A T that recursively partitions thegiven graph ( G, π ) into subgraphs. By partition, DviCL exploitsproperties of ( s, π s ) that enable canonical labeling computationfrom combining without enumeration. Canonical labeling for eachnode ( s, π s ) in A T ( G, π ) is either the minimum ( s, π s ) γ , for aleaf node, or the k -th minimum ( s, π s ) γ obtained by combiningthe canonical labeling of child nodes, for a non-leaf node. Notethat for different tree nodes, the k values are different. As a conse-quence, DviCLreturns the k -th minimum ( G, π ) γ as the canonicallabeling and ensures that k is the same for isomorphic graphs. Time complexity of CombineCL and CombineST: For CombineCL,easy to see, the most time-consuming parts are invoking existingcanonical labeling algorithms to generate γ ∗ (Line 1) and generat-ing the canonical labeling C ( G, π ) (Line 4). Therefore, the timecomplexity of CombineCL is O ( X + | E ( g ) | ln ( | E ( g ) | )) where X is the time complexity of canonical labeling algorithms. Similarly,the most time-consuming parts of CombineSTare determining totalorder between different child nodes of ( g, π g ) (Line 1) and gener-ating the canonical labeling (Line 6), since the other parts eithercost O ( | V ( g ) | log ( | V ( g ) | )) or cost O ( | V ( g ) | ) . Therefore, the timecomplexity of CombineST is O ( | E ( g ) | ln ( | E ( g ) | )) . We propose Algorithm SSM-AT (Algorithm 6) for SSM, fol-lowing divide-and-conquer paradigm. SSM-AT is designed by theproperties of AutoTree A T . Specifically, two tree nodes sharing thesame canonical labeling implies that the two corresponding sub-graphs in G are symmetric, and one isomorphism between thesetwo subgraphs can be easily obtained. SSM-AT ( G, q, A T ( g )) finds all symmetric subgraphs of q in the subtree of A T that rootedat g , i.e., in a subgraph g of G . SSM-AT first finds the minimalsubgraph, a tree node n q in A T , that contains q (Line 1). Then, Algorithm 6: SSM-AT( G, q, A T ( g ) ) find tree node n q ∈ A T ( g ) with max depth that contains q ; if n q is a leaf node or n q = q then S ← SM ( n q , q ) or S ← n q ; else divide q into subgraphs { q , . . . , q k } , contained in children { n , . . . , n k } of n q ; for each ( q i , n i ) do S i ← SSM-AT ( G, q i , A T ( n i )) ; for each child n j of n q that shares the same canonical labeling with n i do S j ← S iγij ; S ← ∅ ; for each { n ′ , . . . , n k ′ } do S ← S ∪ S ′ × . . . × S k ′ ; for each n q ′ sharing the same canonical labeling with n q do S ← S ∪ S γqq ′ ; return S ; symmetric subgraphs of q in n q can be extended to those in sub-graphs n q ′ that are symmetric to n q by an isomorphism γ qq ′ from n q to n q ′ , consisting the symmetric subgraphs of q in G (Line 13-14). Symmetric subgraphs of q in n q can be found by divide-and-conquer. The basic cases occur when n q is a leaf node or n q = q ,then an existing subgraph isomorphism algorithm SM can be ap-plied or returns n q as the result (Line 2-3). Otherwise, q is dividedinto subgraphs { q , . . . , q k } by the children of n q , where q i is con-tained in n i (Line 5). Symmetric subgraphs of q i in n i can be foundrecursively by SSM-AT ( G, q i , A T ( n i )) , and mapped to those in n j that is symmetric to n i (Line 6-9). As a consequence, each sym-metric subgraph of q in n q can be composed by mosaic subgraphsin { n ′ , . . . , n k ′ } where n i ′ is n i or is a sibling node symmetricto n i (Line 11-12). Since the majority of leaf nodes are singleton,SSM-AT is efficient and robust. On the other hand, existing sub-graph matching algorithms have several drawbacks. (1) the timecomplexity is not bounded. (2) they will find much more candi-date matchings than the result. (3) the verification of symmetrybetween a matching g and the query graph q is not trivial. (4) thereis no guarantee to find all symmetric subgraph matchings. Example 6.1: Consider the AutoTree A T in Fig. 3, where all theleaf nodes are singleton, and leaf nodes with the same canonicallabeling correspond to the vertices with the same color. Consideran SSM query q , 3-2-6, on g . We find symmetric subgraphs of q in g , and those in g can be extended by isomorphism γ =(3 , , , , , , . Symmetric subgraphs of q in g are divided into q , 3-2, in g and q , 6 in g . S = g ,which can be extended to S = g and S = g . Those for q can be obtained similarly. As a consequence, symmetric subgraphsof q in g can be composed as S = { } . Those in g can be obtained by S γ . 7. EXPERIMENTAL STUDIES We conducted extensive experimental studies using 22 large realgraphs and 9 benchmark graphs to test how DviCL improves nauty[26], bliss [15], and traces [30], using their latest distributed ver-sions, i.e., nauty-2.6r10, traces-2.6r10 ( http://pallini.di.uniroma1.it/ )and bliss-0.73 ( ).Below, we use DviCL+X to indicate that X is used to computecanonical labeling for non-singleton leaf nodes in A T . We haveDviCL+n, DviCL+b and DviCL+t, where n, b, and t are for nauty,bliss, and traces. All algorithms are implemented in C++ and com-plied by gcc 4.8.2, and tested on machine with 3.40GHz Intel Corei7-4770 CPU, 32GB RAM and running Linux. Time unit used issecond and we set time limit as 2 hours. raph | V | | E | d max d avg cells singleton Amazon 403,394 2,443,408 2,752 12.11 396,034 390,706BerkStan 685,230 6,649,470 84,230 19.41 387,172 316,162Epinions 75,879 405,740 3,044 10.69 53,067 45,552Gnutella 62,586 147,892 95 4.73 46,098 38,216Google 875,713 4,322,051 6,332 9.87 525,232 424,563LiveJournal 4,036,538 34,681,189 14,815 17.18 3,703,527 3,518,490NotreDame 325,729 1,090,108 10,721 6.69 115,038 89,791Pokec 1,632,803 22,301,964 14,854 27.32 1,586,176 1,561,671Slashdot0811 77,360 469,180 2,539 12.13 61,457 56,219Slashdot0902 82,168 504,229 2,552 12.27 65,264 59,384Stanford 281,903 1,992,636 38,625 14.14 168,967 133,992WikiTalk 2,394,385 4,659,563 100,029 3.89 553,199 498,161wikivote 7,115 100,762 1,065 28.32 5,789 5,283Youtube 1,138,499 2,990,443 28,754 5.25 684,471 585,349Orkut 3,072,627 117,185,083 33,313 11.19 3,042,918 3,028,961BuzzNet 101,163 2,763,066 64,289 54.63 77,588 76,758Delicious 536,408 1,366,136 3,216 5.09 263,961 221,669Digg 771,229 5,907,413 17,643 15.32 445,181 400,605Flixster 2,523,386 7,918,801 1,474 6.28 1,047,509 928,445Foursquare 639,014 3,214,986 106,218 10.06 364,447 315,108Friendster 5,689,498 14,067,887 4,423 4.95 2,135,136 1,973,584Lastfm 1,191,812 4,519,340 5,150 7.58 675,962 609,605 Table 1: Summarization of real graphs Graph | V | | E | d max d avg cells singleton ag2-49 4,851 120,050 50 49.49 2 0cfi-200 2,000 3,000 3 3 800 0difp-21-0-wal-rcr 16,927 44,188 1,526 5.22 16,215 15,755fpga11-20-uns-rcr 5,100 9,240 21 3.62 3,531 2,418grid-w-3-20 8,000 24,000 6 6 1 0had-256 1,024 131,584 257 257 1 0mz-aug-50 1,000 2,300 6 4.6 250 0pg2-49 4,902 122,550 50 50 1 0s3-3-3-10 12,974 23,798 26 3.67 9,146 5,318 Table 2: Summarization of benchmark graphsDatasets : The 22 large real-world graphs include social networks(Epinions, LiveJournal, Pokec, Slashdot0811, Slashdot0902, wikiv-ote, Youtube, Orkut, BuzzNet, Delicious, Digg, Flixster, Foursquareand Friendster), web graphs (BerkStan, Google, NotreDame, Stan-ford), a peer-to-peer network (Gnutella), a product co-purchasingnetwork (Amazon), a communication network (WikiTalk), and amusic website (lastfm). All these datasets are available online. Thedetailed information of the real-world datasets are summarized inTable 1, where, for each graph, the 2nd and 3rd columns show thenumbers of vertices and edges , the 4th and 5th columns show thesizes of max degree and average degree of each graph, and the 6thand 7th columns show the numbers of cells and singleton cells inthe orbit coloring of each graph. As shown in Table 1, the majorityof the cells in the orbit coloring are singleton cells. This prop-erty makes DivideI and DivideS effective since the partition (The-orem 6.2) are more likely to happen when subgraphs get smaller.For the 9 benchmark graphs, we select the largest one in eachfamily of graphs used in bliss collection [15]. Detail descriptionsof each benchmark graph can be found in [15]. Similarly, summa-rization are given in Table 2.Below, we first demonstrate the structure of AutoTrees constructed,and use the observations made to explain the efficiency and perfor-mance of our approaches DviCL+X, which will be confirmed whenwe illustrate the performance of DviCL+X and X. The Structure of AutoTree : Table 3 demonstrates the structure ofAutoTrees constructed for real graphs by DviCL+X. Note that forthe same graph, three DviCL+Xalgorithms construct the same Au-toTree. The 2nd, 3rd, 4th columns show the numbers of total nodes, for each dataset, we remove directions if included and delete allself-loops and multi-edges if exist. Graph | V ( AT ) | singleton non-singleton avg size depthAmazon 407,032 403,388 1 6 3BerkStan 709,702 681,680 118 30.08 5Epinions 76,919 75,879 0 0 3Gnutella 62,598 62,586 0 0 2Google 910,617 874,908 71 11.34 5LiveJournal 4,064,750 4,036,533 1 5 3NotreDame 328,259 318,204 46 163.59 5Pokec 1,633,602 1,632,803 0 0 3Slashdot0811 77,809 77,360 0 0 3Slashdot0902 82,661 82,168 0 0 3Stanford 291,006 279,912 55 36.2 5WikiTalk 2,398,843 2,394,385 0 0 3wikivote 7,139 7,115 0 0 2Youtube 1,161,551 1,138,499 0 0 3Orkut 3,073,414 3,072,627 0 0 3BuzzNet 101,179 101,163 0 0 2Delicious 537,831 533,507 339 8.56 3Digg 771,879 771,229 0 0 3Flixster 2,524,659 2,523,386 0 0 3Foursquare 639,015 639,014 0 0 1Friendster 5,689,609 5,689,498 0 0 3Lastfm 1,192,094 1,191,812 0 0 2 Table 3: The Structure of AutoTrees of real graphs Graph | V ( AT ) | singleton non-singleton avg size depthag2-49 1 0 1 4,851 0cfi-200 1 0 1 2,000 0difp-21-0-wal-rcr 16,928 16,927 0 0 1fpga11-20-uns-rcr 2,441 2,418 22 121.91 1grid-w-3-20 1 0 1 8,000 0had-256 1 0 1 1,024 0mz-aug-50 1 0 1 1,000 0pg2-49 1 0 1 4,902 0s3-3-3-10 12,999 12,974 0 0 2 Table 4: The Structure of AutoTrees of benchmark graphs singleton leaf nodes, non-singleton leaf nodes in AutoTree, respec-tively. The 5th column shows the average size (number of vertices)of each non-singleton leaf node and the 6th column shows the depthof AutoTree. Several interesting observations can be made. First, in15 out of 22 datasets, AutoTree contains only singleton leaf nodes.In these datasets, there is no need to exploit existing approachesto discover automorphism group and canonical labeling, i.e., thethree DviCL+Xalgorithms on these graphs can be done in polyno-mial time and the performances are almost the same. The AutoTree AT ( G, π ) , the automorphism group Aut ( G, π ) and the canonicallabeling C ( G, π ) can be achieved with only an equitable coloringat the root in AutoTree. Second, in the remaining 7 datasets thatcontain non-singleton leaf nodes, there are only a small number ofnon-singleton leaf nodes and these non-singleton leaf nodes are ofsmall sizes. Transferring the problem of discovering the canoni-cal labeling for a massive graph to finding the canonical labelingfor a few small subgraphs improves the efficiency and robustnesssignificantly. This observation also explains the phenomenon thatall DviCL+Xconsume almost the same amount of memory in eachdatasets: AutoTree is the most space-consuming structure whenthere are only a few small non-singleton leaf nodes. Third, Au-toTrees are usually with low depths. Since both DivideI and Di-videS cost O ( m s ) for a graph with m s edges and all subgraphsin the same depth in AutoTree are vertex disjoint, constructing A T ( G, π ) costs O ( m ) . Similarly, with the canonical labeling ofall non-singleton leaf nodes, achieving the canonical labeling forall tree nodes in A T ( G, π ) only costs O ( m · lnm ) . Forth, compar-ing the 3rd column in Table 3 and the 7th column in Table 1, Di-videI and DivideS can further partition some automorphic verticesinto singleton leaf nodes in AutoTree, which can further improvethe efficiency.Table 4 demonstrates the structure of AutoTrees constructed for raph nauty DviCL +n traces DviCL +t bliss DviCL +b time memory time memory time memory time memory time memory time memoryAmazon - - 1.19 280.18 - - 384 1,302.8 293.54 0.95 384Digg - - 1.83 565.94 5.61 1,012 Table 5: Performance of nauty , DviCL +n, traces , DviCL +t, bliss , and DviCL +b on real-world networks Graph | S | = 10 | S | = 100 number time number timeAmazon 1 0.12 1 0.1BerkStan 16 0.12 1.12E23 0.12Epinions 2 0.01 840 0.01Gnutella 1 0.01 1 0.01Google 40 0.19 1.43E25 0.18LiveJournal 30 1.39 1.19E37 1.53NotreDame 88 0.04 63,360 0.04Pokec 1 0.5 302,400 0.52Slashdot0811 1 0.01 192 0.01Slashdot0902 2 0.01 4,608 0.01Stanford 6 0.04 1.23E15 0.04WikiTalk 1 0.49 1 0.48wikivote 8.82E15 0 2.94E15 0Youtube 1 0.27 1 0.28Orkut 4 1.01 2.91E10 1.01BuzzNet 80 0.02 7.36E88 0.02Delicious 19 0.09 787,968 0.09Digg 1 0.15 1 0.16Flixster 1 0.13 1 0.14Foursquare 6.64E6 0.13 4.44E71 0.13Friendster 1 1.64 1 1.62Lastfm 1 0.29 1 0.28 Table 6: SSM on seed set S by IM benchmark graphs by DviCL+X. Different from those constructedfor real graphs, AutoTrees of most benchmark graphs contain onlythe root node. Revisit Table 2, most benchmark graphs are highlyregular and contain none singleton cells, which makes DviCL andAutoTree useless in improving the performance. The efficiency on real datasets : Table 5 shows the efficiency ofDviCL+X and X on real graphs. The 2nd, 4th, 6th, 8th, 10th and12th columns show the running time of nauty, DviCL+n, traces,DviCL+t, bliss and DviCL+b, respectively. In Table 5, the sym-bol of “-” indicates that the algorithm cannot get the result in 2hours, and the champion on each dataset is in bold. Several pointscan be made. First, among 22 datasets, DviCL+Xoutperform X in14 datasets significantly. Specifically, in 3 datasets (Google, Live-Journal and WikiTalk), none of previous approaches can achievethe results, and in 10 datasets, none of previous approaches canobtain the results in 100 seconds. For the remaining 8 datasets,tracesperformances the best, however, its advantage over DviCL+tis marginal. Second, if we take DviCL as a preprocessing pro- Graph maximum clique triangle number cluster max number cluster maxAmazon 610 584 3 3,986,507 3,837,711 120BerkStan 4 4 1 64,690,980 10,487,015 735,000Epinions 18 18 1 1,624,481 1,622,749 35Gnutella 16 16 1 2,024 2,017 2Google 8 2 4 13,391,903 6,325,254 4,200LiveJournal 589,824 36,864 16 177,820,130 158,645,941 198,485NotreDame 1 1 1 8,910,005 2,629,782 2,268,014Pokec 6 6 1 32,557,458 32,545,137 84Slashdot0811 52 52 1 551,724 550,747 46Slashdot0902 104 104 1 602,588 600,239 242Stanford 10 6 2 11,329,473 4,041,344 42,504WikiTalk 141 141 1 9,203,518 9,165,115 780wikivote 23 23 1 608,389 608,366 6Youtube 2 2 1 3,056,537 3,036,649 445Orkut 20 20 1 - - -BuzzNet 12 12 1 30,919,848 30,914,434 71Delicious 9 9 1 487,972 478,909 132Digg 192 192 1 62,710,797 62,685,651 407Flixster 752 752 1 7,897,122 7,114,518 192Foursquare 8 8 1 21,651,003 21,646,991 13Friendster 120 120 1 8,722,131 8,604,990 563Lastfm 330 330 1 3,946,212 3,930,145 100 Table 7: Subgraph clustering by SSM cedure, DviCL improves the efficiency and robustness of nauty,bliss and traces significantly. Third, among the 6 algorithms, onlyDviCL+X algorithms can achieve results in all datasets, and fur-thermore, DviCL+X can get the results in all datasets in less than26 seconds. We explain the efficiency and robustness of DviCL.By constructing an AutoTree, DviCL is able to discover canonicallabeling for only few small subgraphs instead of directly findingthe canonical labeling for a massive graph for two reasons. (1)The AutoTree construction, including graph partition and canoni-cal labeling generation, is of low cost. (2) Finding canonical label-ing for small subgraphs is always efficient. Forth, for the graphswhose AutoTrees contain no non-singleton leaf nodes, all the threeDviCL+Xalgorithms perform similarly.The 3rd, 5th, 7th, 9th, 11th and 13th columns illustrate the maxmemory consumptions of nauty, DviCL+n, traces, DviCL+t, blissand DviCL+b, respectively. First, it is interesting to find that Algo-rithms DviCL+n, DviCL+t and DviCL+b consume almost the sameamount of memory in each dataset, confirming our analysis whendemonstrating the structure of AutoTrees. Second, bliss consumesthe least amount of memory in most datasets. raph nauty DviCL+n traces DviCL+t bliss DviCL+bag2-49 < . < . < . < . < . < . Table 8: Performance on benchmark graphsThe efficiency on benchmark datasets : Table 8 shows the effi-ciency of DviCL and its comparisons on benchmark graphs. The2nd, 3rd, 4th, 5th, 6th and 7th columns show the running timeof nauty, DviCL+n, traces, DviCL+t, bliss and DviCL+b, respec-tively. Worth noting that due to the accuracy of the timers pro-vided in nauty, traces and bliss, we equate < . with any valuein [0 , . . From Table 8, almost all algorithms perform wellin all datasets. Among these 6 algorithms, traces and DviCL+tare the best two approaches. Although traces performs the best inmore datasets then DviCL+t, DviCL+t is more robust than traces.Specifically, DviCL+t can achieve the result in at most 0.04s in anybenchmark dataset tested, while tracesspends 0.23s in dealing withfpga11-20-uns-rcr.In conclusion, since AutoTrees constructed for real graphs are oflow depths and non-singleton leaf nodes in AutoTrees are few andsmall, our DviCL reduces substantial redundant computations andsignificantly improves the performance for massive real graphs byintroducing small extra cost for constructing AutoTrees. Due to thesmall sizes and regularity of benchmark graphs, the improvementsare not remarkable. Applications of SSM : We study the applications of SSM. First,given a seed set S by influence maximization, we estimate thenumber of sets that have the same max influence as S . Here, S is obtained by PMC[28], one of the best performing algorithms forIM, and seed number k , i.e., | S | , is set as and , respectively.Table 6 demonstrates the results. The 2nd and 4th columns showthe number of candidate seed sets when | S | = 10 and | S | = 100 ,respectively. The 3rd and 5th columns show the running time forestimation. Several observations can be made. First, for a largenumber of graphs tested, numerous candidate sets can be found.Second, it is efficient to estimate the number of candidate sets. Thereasons are as follows, 1) the most time consuming part in Algo-rithm SSM-ATis invoking SM on non-singleton leaf nodes (Line 3in Algorithm 6); 2)in AutoTrees, non-singleton leaf nodes are fewand are of small sizes.Second, we study subgraph clustering by SSM. Given a set ofsubgraphs in a graph G , all these subgraphs can be clustered s.t.,each cluster contains subgraphs that are mutually symmetric. Weconsider the set of all maximum cliques and all triangles, and esti-mate the number of clusters and the size of the maximum cluster.Table 7 illustrates the results. The 2nd, 3rd and 4th columns showsthe total number, the number of clusters and the size of the maxi-mum cluster for maximum cliques, respectively. The 5th, 6th and7th columns shows the statistics for triangles, respectively. It is in-teresting to find that, 1) both the maximum cliques and trianglesare diverse; 2) given a single maximum clique or a triangle, it ispossible to find several symmetric ones by SSM. 8. CONCLUSION In this paper, we study graph isomorphism and automorphismdetection for massive graphs. Different from the state-of-the-artalgorithms that adopt an individualization-refinement schema for canonical labeling, we propose a novel efficient canonical label-ing algorithm DviCL following the divide-and-conquer paradigm.With DviCL, a tree-shape index, called AutoTree, is constructed forthe given colored graph ( G, π ) . AutoTree A T ( G, π ) provides in-sights into the symmetric structure of ( G, π ) in addition to the auto-morphism group and canonical labeling. We show that A T ( G, π ) can be used (1) to find all possible seed sets for influence maxi-mization and (2) to find all subgraphs in a graph G that are sym-metric to a given subgraph that exist in G . We conducted com-prehensive experimental studies to demonstrate the efficiency androbustness of our approach DviCL. First, non-singleton leaf nodesin AutoTrees constructed are few and small, and the AutoTrees areof low depths. Thus, the extra cost for AutoTree construction islow and worthy. Second, DviCL+X outperforms X, where X is fornauty, tracesand bliss, in 14 out of 22 datasets significantly. For theremaining 8 datasets, only traces can beat DviCL+X, whereas theadvantages are marginal. Third, among these 6 algorithms tested,all of DviCL+X can achieve the results in all datasets in less than26 seconds, while X is inefficient in most datasets. ACKNOWLEDGEMENTS : This work was supported by the grantsfrom RGC 14203618, RGC 14202919, RGC 12201518, RGC 12232716,RGC 12258116, RGC 14205617 and NSFC 61602395. 9. REFERENCES [1] A. Arora, S. Galhotra, and S. Ranu. Debunking the myths ofinfluence maximization: An in-depth benchmarking study. In Proc.of SIGMOD’17 , 2017.[2] L. Babai, W. M. Kantor, and E. M. Luks. Computational complexityand the classification of finite simple groups. In Proc. of FOCS’83 ,1983.[3] L. Babai and L. Kucera. Canonical labelling of graphs in linearaverage time. In Proc. of FOCS’79 , 1979.[4] L. Babai and E. M. Luks. Canonical labeling of graphs. In Proc. ofSTOC’83 , 1983.[5] H. L. Bodlaender. Polynomial algorithms for graph isomorphism andchromatic index on partial k-trees. Journal of Algorithms , 11(4),1990.[6] D. Bonchev. Chemical graph theory: introduction and fundamentals ,volume 1. CRC Press, 1991.[7] W. Bosma, J. Cannon, and C. Playoust. The magma algebra system i:The user language. Journal of Symbolic Computation , 24(3-4), 1997.[8] W. Chen, Y. Wang, and S. Yang. Efficient influence maximization insocial networks. In Proc. of SIGKDD’09 , 2009.[9] P. T. Darga, M. H. Liffiton, K. A. Sakallah, and I. L. Markov.Exploiting structure in symmetry detection for cnf. In Proc. of the41st annual Design Automation Conference , 2004.[10] J. Ferrante, K. J. Ottenstein, and J. D. Warren. The programdependence graph and its use in optimization. ACM Transactions onProgramming Languages and Systems (TOPLAS) , 9(3), 1987.[11] I. S. Filotti and J. N. Mayer. A polynomial-time algorithm fordetermining the isomorphism of graphs of fixed genus. In Proc. ofSTOC’80 , 1980.[12] M. R. Garey and D. S. Johnson. Computers and Intractability: AGuide to the Theory of NP-Completeness . 1979.[13] O. Goldreich, S. Micali, and A. Wigderson. Proofs that yield nothingbut their validity or all languages in np have zero-knowledge proofsystems. JACM , 38(3), 1991.[14] G. Group et al. Gap system for computational discrete algebra, 2007.[15] T. Junttila and P. Kaski. Engineering an efficient canonical labelingtool for large and sparse graphs. In Proc. of the Ninth Workshop onAlgorithm Engineering and Experiments (ALENEX) , 2007.[16] T. Junttila and P. Kaski. Conflict propagation and componentrecursion for canonical labeling. In Theory and Practice ofAlgorithms in (Computer) Systems . Springer, 2011.[17] D. Kempe, J. Kleinberg, and ´E. Tardos. Maximizing the spread ofinfluence through a social network. In Proc. of SIGKDD’03 , 2003.[18] W. Kocay. On writing isomorphism programs. In Computational andConstructive Design Theory . Springer, 1996.19] D. J. Kuck, R. H. Kuhn, D. A. Padua, B. Leasure, and M. Wolfe.Dependence graphs and compiler optimizations. In Proceedings ofthe 8th ACM SIGPLAN-SIGACT symposium on Principles ofprogramming languages . ACM, 1981.[20] T. Lappas, K. Liu, and E. Terzi. Finding a team of experts in socialnetworks. In Proc. of KDD’09 , 2009.[21] C. Liu, C. Chen, J. Han, and P. S. Yu. Gplag: detection of softwareplagiarism by program dependence graph analysis. In Proc. ofSIGMOD’06 . ACM, 2006.[22] C. Lu, J. X. Yu, H. Wei, and Y. Zhang. Finding the maximum cliquein massive graphs. PVLDB’17 , 10(11), 2017.[23] E. M. Luks. Isomorphism of graphs of bounded valence can be testedin polynomial time. Journal of computer and system sciences , 25(1),1982.[24] B. D. MacArthur, R. J. S´anchez-Garc´ıa, and J. W. Anderson.Symmetry in complex networks. Discrete Applied Mathematics ,156(18), 2008.[25] B. D. McKay. Computing automorphisms and canonical labellings ofgraphs. In Combinatorial mathematics . Springer, 1978.[26] B. D. McKay et al. Practical graph isomorphism. 1981.[27] G. Miller. Isomorphism testing for graphs of bounded genus. In Proc.of STOC’80 , 1980.[28] N. Ohsaka, T. Akiba, Y. Yoshida, and K.-i. Kawarabayashi. Fast andaccurate influence maximization on large networks with prunedmonte-carlo simulations. In AAAI , 2014.[29] S. Papadopoulos, Y. Kompatsiaris, A. Vakali, and P. Spyridonos.Community detection in social media. Data Mining and KnowledgeDiscovery , 24, 2012.[30] A. Piperno. Search space contraction in canonical labeling of graphs. arXiv preprint arXiv:0804.4881 , 2008.[31] M. Randic, G. M. Brissey, and C. L. Wilkins. Computer perceptionof topological symmetry via canonical numbering of atoms. Journalof Chemical Information and Computer Sciences , 21(1), 1981.[32] R. C. Read and D. G. Corneil. The graph isomorphism disease. Journal of Graph Theory , 1(4), 1977.[33] B. Weisfeiler. On construction and identification of graphs , volume558. Springer, 2006.[34] W. Wu, Y. Xiao, W. Wang, Z. He, and Z. Wang. K-symmetry modelfor identity anonymization in social networks. In Proc. of EDBT’10 ,2010.[35] Y. Xiao, B. D. MacArthur, H. Wang, M. Xiong, and W. Wang.Network quotients: Structural skeletons of complex systems. Physical Review E , 78(4), 2008.[36] Y. Xiao, M. Xiong, W. Wang, and H. Wang. Emergence of symmetryin complex networks. Physical Review E , 77(6), 2008.[37] Y.-H. Xiao, W.-T. Wu, H. Wang, M. Xiong, and W. Wang.Symmetry-based structure entropy of complex networks. Physica A:Statistical Mechanics and its Applications , 387(11), 2008.[38] E. Yeger-Lotem, S. Sattath, N. Kashtan, S. Itzkovitz, R. Milo, R. Y.Pinter, U. Alon, and H. Margalit. Network motifs in integratedcellular networks of transcription–regulation and protein–proteininteraction. PNAS , 101, 2004.[39] X. Zheng, T. Liu, Z. Yang, and J. Wang. Large cliques in arabidopsisgene coexpression network and motif discovery.