AA Family of Tractable Graph Distances
Jose Bento ∗ Stratis Ioannidis † Abstract
Important data mining problems such as nearest-neighbor search and clustering admit theoretical guar-antees when restricted to objects embedded in a met-ric space. Graphs are ubiquitous, and clustering andclassification over graphs arise in diverse areas, in-cluding, e.g., image processing and social networks.Unfortunately, popular distance scores used in theseapplications, that scale over large graphs, are notmetrics and thus come with no guarantees. Classicgraph distances such as, e.g., the chemical and theCKS distance are arguably natural and intuitive, andare indeed also metrics, but they are intractable: assuch, their computation does not scale to large graphs.We define a broad family of graph distances, that in-cludes both the chemical and the CKS distance, andprove that these are all metrics. Crucially, we showthat our family includes metrics that are tractable.Moreover, we extend these distances by incorporatingauxiliary node attributes, which is important in prac-tice, while maintaining both the metric property andtractability.
Graph similarity and the related problem of graphisomorphism have a long history in data mining, ma-chine learning, and pattern recognition [20, 44, 39].
Graph distances naturally arise in this literature: in-tuitively, given two (unlabeled) graphs, their distanceis a score quanitifying their structural differences. Ahighly desirable property for such a score is that it isa metric , i.e., it is non-negative, symmetric, positive-definite, and, crucially, satisfies the triangle inequality. ∗ Boston College, [email protected] † Northeastern University, [email protected]
Metrics exhibit significant computational advantagesover non-metrics. For example, operations such asnearest-neighbor search [19, 18, 10], clustering [3],outlier detection [7], and diameter computation [32]admit fast algorithms precisely when performed overobjects embedded in a metric space. To this end,proposing tractable graph metrics is of paramountimportance in applying such algorithms to graphs.Unfortunately, graph metrics of interest are oftencomputationally expensive. A well-known exampleis the chemical distance [41]. Formally, given graphs G A and G B , represented by their adjacency matrices A, B ∈ { , } n × n , the chemical distance is d P n ( A, B )is defined in terms of a mapping between the twographs that minimizes their edge discrepancies, i.e.: d P n ( A, B ) = min P ∈ P n k AP − P B k F , (1)where P n is the set of permutation matrices of size n and k · k F , is the Frobenius norm (see Sec. 2 fordefinitions). The Chartrand-Kubiki-Shultz (CKS) [17]distance is an alternative: CKS is again given by(1) but, instead of edges, matrices A and B containthe pairwise shortest path distances between any twonodes. The chemical and CKS distances have impor-tant properties. First, they are zero if and only if thegraphs are isomorphic, which appeals to both intu-ition and practice; second, as desired, they are metrics;third, they have a natural interpretation, capturingglobal structural similarities between graphs. How-ever, finding an optimal permutation P is notoriouslyhard; graph isomorphism, which is equivalent to de-ciding if there exists a permutation P s.t. AP = P B (for both adjacency and path matrices), is famously aproblem that is neither known to be in P nor shownto be NP-hard [8]. There is a large and expandingliterature on scalable heuristics to estimate the op-timal permutation P [35, 9, 43, 22]. Despite their1 a r X i v : . [ m a t h . C O ] J a n omputational advantages, unfortunately, using themto approximate d P n ( A, B ) breaks the metric property.This significantly degrades the performance of manyimportant tasks that rely on computing distances be-tween graphs. For example, there is a clear separationon the approximability of clustering over metric andnon-metric spaces [3]. We also demonstrate this em-pirically in Section 5 (c.f. Fig. 1): attempting to clus-ter graphs sampled from well-known families basedon non-metric distances significantly increases themisclassification rate, compared to clustering usingmetrics.An additonal issue that arises in practice is thatnodes often have attributes not associated with ad-jacency. For example, in social networks, nodes maycontain profiles with a user’s age or gender; simi-larly, nodes in molecules may be labeled by atomicnumbers. Such attributes are not captured by thechemical or CKS distances. However, in such cases,only label-preserving permutations P may make sense(e.g., mapping females to females, oxygens to oxygens,etc.). Incorporating attributes while preserving themetric property is thus important from a practicalperspective. Contributions.
We seek generalization of the chem-ical and CKS distances that (a) satisfy the metricproperty and (b) are tractable : by this, we mean thatthey can be computed either by solving a convex opti-mization problem, or by a polynomial time algorithm.Specifically, we study generalizations of (1) of theform: d S ( A, B ) = min P ∈ S k AP − P B k (2)where S ⊂ R n × n is closed and bounded, k·k is a matrixnorm, and A, B ∈ R n × n are arbitrary real matrices(representing adjacency, path distances, weights, etc.).We make the following contributions: • We prove sufficient conditions on S and norm k · k for which (2) is a metric. In particular, we showthat d S is a so-called pseudo-metric (see Sec. 2)when: (i) S = P n and k · k is any entry-wise or operatornorm; (ii) S = W n , the set of doubly stochastic matrices, k · k is an arbitrary entry-wise norm, and A, B are symmetric; a modification on d S extends thisresult to both operator norms as well as arbitrarymatrices (capturing, e.g., directed graphs); and (iii) S = O n , the set of orthogonal matrices, and k · k is the operator or entry-wise 2-norm.Relaxations (ii) and (iii) are very important from apractical standpoint. For all matrix norms, comput-ing (2) with S = W n is tractable, as it is a convexoptimization. For S = O n , (2) is non-convex but isstill tractable, as it reduces to a spectral decompo-sition. This was known for the Frobenius norm [57];we prove this is the case for the operator 2-normalso. • We include node attributes in a natural way in thedefinition of d S as both soft (i.e., penalties in theobjective) or hard constraints in Eq. (2). Crucially,we do this without affecting the metric propertyand tractability . This allows us to explore label orfeature preserving permutations, that incorporateboth (a) exogenous node attributes, such as, e.g.,user age or gender in a social network, as well as(b) endogenous, structural features of each node,such as its degree or the number of triangles thatpass through it. We numerically show that addingthese constraints can speed up the computation of d S .From an experimental standpoint, we extensivelycompare our tractable metrics to several existingheuristic approximations. We also demonstrate thetractability of our metrics by parallelizing their execu-tion using the alternating method of multipliers [14],which we implement over a compute cluster usingApache Spark [63]. Related Work.
Graph distance (or similarity) scoresfind applications in varied fields such as in imageprocessing [20], chemistry [6, 41], and social networkanalysis [44, 39]. Graph distances are easy to definewhen, contrary to our setting, the correspondencebetween graph nodes is known, i.e., graphs are labeled [47, 39, 56]. Beyond the chemical distance, classicexamples of distances between unlabeled graphs arethe edit distance [27, 52] and the maximum commonsubgraph distance [16, 15], both of which also haveversions for labeled graphs. Both are metrics and arehard to compute, while existing heuristics [49, 25]2re not metrics. The reaction distance [37] is alsoa metric directly related to the chemical distance[41] when edits are restricted to edge additions anddeletions. Jain [33] also considers an extension ofthe chemical distance, limited to the Frobenius norm,that incorporates edge attributes. However, it is notimmediately clear how to relax the above metrics[33, 37] to attain tractability.A metric can also be induced by embedding graphsin a metric space and measuring the distance of theseembeddings [51, 26, 50]. Several works follow such anapproach, mapping graphs, e.g., to spaces determinedby their spectral decomposition [64, 61, 23]. In gen-eral, in contrast to our metrics, such approaches arenot as discriminative, as embeddings summarize graphstructure. Continuous relaxations of graph isomor-phism, both convex and non-convex [43, 4, 57], havefound applications in a variety of contexts, includingsocial networks [38], computer vision [53], shape de-tection [54, 29], and neuroscience [58]. None of theabove works focus on metric properties of resulting re-laxations, which several fail to satisfy [58, 38, 54, 29].Metrics naturally arise in data mining tasks, includ-ing clustering [62, 28], NN search [19, 18, 10], andoutlier detection [7]. Some of these tasks becometractable or admit formal guarantees precisely whenperformed over a metric space. For example, findingthe nearest neighbor [19, 18, 10] or the diameter ofa dataset [32] become polylogarithimic under metricassumptions; similarly, approximation algorithms forclustering (which is NP-hard) rely on metric assump-tions, whose absence leads to a deterioration on knownbounds [3]. Our search for metrics is motivated bythese considerations.
Graphs.
We represent an undirected graph G ( V, E )with node set V = [ n ] ≡ { , . . . , n } and edge set E ⊆ [ n ] × [ n ] by its adjacency matrix , i.e. A =[ a i,j ] i,j ∈ [ n ] ∈ { , } n × n s.t. a ij = a ji = 1 if and only if( i, j ) ∈ E. In particular, A is symmetric, i.e. A = A > .We denote the set of all real, symmetric matricesby S n . Directed graphs are represented by (possiblynon-symmetric) binary matrices A ∈ { , } n × n , and weighted graphs by real matrices A ∈ R n × n . Matrix Norms.
Given a matrix A = [ a ij ] i,j ∈ [ n ] ∈ R n × n and a p ∈ N + ∪ {∞} , its induced or operator p -norm is defined in terms of the vector p -norm through k A k p = sup x ∈ R n : k x k p =1 k Ax k p , while its entry-wise p -norm is given by k A k p = ( P ni =1 P nj =1 | a ij | p ) /p , for p ∈ N + , and k A k ∞ = max i,j | a i,j | . We denote theentry-wise 2-norm (i.e., the Frobenius norm) as k · k F . Permutation, Doubly Stochastic, and Orthog-onal Matrices.
We denote the set of permutation matrices as P n = { P ∈ { , } n × n : P = , P > = } , the set of doubly-stochastic matrices (i.e., the Birkhoff polytope ) as W n = { W ∈ [0 , n × n : W = , W > = } , and the set of orthogonal matrices (i.e., the Stiefel manifold ) as O n = { U ∈ R n × n : U U > = U > U = I } . Note that P n = W n ∩ O n . More-over, the Birkoff-von Neumann Theorem [11] statesthat W n = conv ( P n ) , i.e., the Birkoff polytope is theconvex hull of P n . Metrics.
Given a set Ω, a function d : Ω × Ω → R iscalled a metric , and the pair (Ω , d ) is called a metricspace , if for all x, y, z ∈ Ω: d ( x, y ) ≥ d ( x, y ) = 0 iff x = y (pos. definiteness) (3b) d ( x, y ) = d ( y, x ) (symmetry) (3c) d ( x, y ) ≤ d ( x, z )+ d ( z, y ) (triangle inequality)(3d)A function d is called a pseudometric if it satisfies (3a),(3c), and (3d), but the positive definiteness property(3b) is replaced by the (weaker) property: d ( x, x ) = 0 for all x ∈ Ω . (3e)If d is a pseudometric, then d ( x, y ) = 0 defines anequivalence relation x ∼ d y over Ω. A pseudometricis then a metric over Ω / ∼ d , the quotient space of ∼ d . A d that satisfies (3a), (3b), and (3d) but not the symmetry property (3c) is called a quasimetric .If d is a quasimetric, then its symmetric extension ¯ d : Ω × Ω → R , defined as ¯ d ( x, y ) = d ( x, y ) + d ( y, x ) , is a metric over Ω . Graph Isomorphism, Chemical, and CKS Dis-tance.
Let
A, B ∈ R n × n be the adjacency matri-ces of two graphs G A and G B . Then, G A and G B are isomorphic if and only if there exists P ∈ P n P > AP = B or, equivalently, AP = P B . The chemical distance , given by (1), extends the latterrelationship to capture distances between graphs. Let k · k be a matrix norm in R n × n . For some Ω ⊆ R n × n ,define d S : Ω × Ω → R + as: d S ( A, B ) = min P ∈ S k AP − P B k , (4)where S ⊂ R n × n is a closed and bounded set, sothat the infimum is indeed attained. Note that d S isthe chemical distance (1) when Ω = R n × n , S = P n and k · k = k · k F . In CKS distance [17], matrices A, B contain pairwise path distances between any twonodes; equivalently, CKS is the chemical distance oftwo weighted complete graphs with path distances asedge weights. Our main contribution is determininggeneral conditions on S and k · k under which d S is ametric over Ω, for arbitrary weighted graphs, therebyincluding both the chemical and CKS distances asspecial cases. For concreteness, we focus on distancesbetween graphs of equal size. Extensions to graphs ofunequal size are described in Appendix F. Our first result establishes that d P n is a pseudometricover all weighted graphs when k · k is an arbitrary entry-wise or operator norm. Theorem 1. If S = P n and k · k is an arbitraryentry-wise or operator norm, then d S given by (4) isa pseudometric over Ω = R n × n . Hence, d P n is a pseudometric under any entry-wiseor operator norm over arbitrary directed, weightedgraphs. Our second result states that this propertyextends to the relaxed version of the chemical dis-tance, in which permutations are replaced by doublystochastic matrices. Theorem 2. If S = W n and k · k is an arbitraryentry-wise norm, then d S given by (4) is a pseudo-metric over Ω = S n × n . If k · k is an arbitrary entry-wise or operator norm, then its symmetric extension ¯ d S ( A, B ) = d S ( A, B ) + d S ( B, A ) is a pseudometricover Ω = R n × n . Hence, if S = W n and k·k is an arbitrary entry-wisenorm, then (4) defines a pseudometric over undirected graphs. The symmetry property (3c) breaks if k · k is an operator norm or graphs are directed. In eithercase, d S is a quasimetric over the quotient space Ω / ∼ d ,and symmetry is attained via the symmetric extension¯ d S .Theorem 2 has significant practical implications. Incontrast to d P n and its extensions implied by Theo-rem 1, computing d W n under any operator or entry-wise norm is tractable [13]: it involves minimizing aconvex function subject to linear constraints. A morelimited result extends to the Stiefel manifold: Theorem 3. If S = O n and k·k is either the operatoror the entry-wise (i.e., Frobenius) 2-norm, then d S given by (4) is a pseudometric over Ω = R n × n . Though (4) is not a convex problem when S = O n ,it is also tractable. Umeyama [57] shows that theoptimization can be solved exactly when k · k = k · k F and Ω = S n (i.e., for undirected graphs) by performinga spectral decomposition on A and B . We extendthis result, showing that the same procedure alsoapplies when k · k is the operator 2-norm (see Thm. 7in Appendix C). In the general case of directed graphs,(4) is a classic example of a problem that can be solvedthrough optimization on manifolds [2]. Equivalence Classes.
The equivalence of matrixnorms implies that all pseudometrics d S definedthrough (4) for a given S have the same quotientspace Ω / ∼ d S : if d S ( A, B ) = 0 for one matrix norm k·k in (4), it will be so for all. When S = P n , Ω / ∼ d P n is the quotient space defined by graph isomorphism:any two adjacency matrices A, B ∈ R n × n satisfy d P n ( A, B ) = 0 if and only if their (possibly weighted)graphs are isomorphic. When S = W n , the quotientspace Ω / ∼ d W n has a connection to the Weisfeiler-Lehman (WL) algorithm [60] described in Appendix D:Ramana et al. [48] show that d W n ( A, B ) = 0 if andonly if G A and G B receive identical colors by theWL algorithm. If S = O n and Ω = S n , i.e., graphsare undirected, then Ω / ∼ d O n is determined by co-spectrality : d O n ( A, B ) = 0 if and only if
A, B havethe same spectrum. When Ω = R n × n , d O n ( A, B ) = 0implies that
A, B are co-spectral, but co-spectral ma-trices
A, B do not necessarily satisfy d O n ( A, B ) = 0.4 .1 Proof of Theorems 1–3.
We define several properties that play a crucial rolein our proofs. We say that a set S ⊆ R n × n is closedunder multiplication if P, P ∈ S implies that P · P ∈ S . We say that S is closed under transposition if P ∈ S implies that P > ∈ S , and closed under inversion if P ∈ S implies that P − ∈ S . Finally, given a matrixnorm k · k , we say that set S is contractive w.r.t. k · k if k AP k ≤ k A k and k P A k ≤ k A k , for all P ∈ S and A ∈ R n × n . Put differently, S is contractive if andonly if every P ∈ S is a contraction w.r.t. k · k . Werely on several lemmas, whose proofs can be foundin Appendix A. The first three establish conditionsunder which (4) satisfies the triangle inequality (3d),symmetry (3c), and weak property (3e), respectively:
Lemma 1.
Given a matrix norm k·k , suppose that set S is (a) contractive w.r.t. k · k , and (b) closed undermultiplication. Then, for any A, B, C ∈ R n × n , d S given by (4) satisfies d S ( A, C ) ≤ d S ( A, B )+ d S ( B, C ) . Lemma 2.
Given a matrix norm k · k , suppose that S ⊂ R n × n is (a) contractive w.r.t. k · k , and (b)closed under inversion. Then, for all A, B ∈ R n × n , d S ( A, B ) = d S ( B, A ) . Lemma 3. If I ∈ S , then d S ( A, A ) = 0 for all A ∈ R n × n . Both the set of permutation matrices P n and theStiefel manifold O n are groups w.r.t. matrix multipli-cation: they are closed under multiplication, containthe identity I , and are closed under inversion. Hence,if they are also contractive w.r.t. a matrix norm k · k , d P n and d O n defined in terms of this norm satisfy allassumptions of Lemmas 1–3. We therefore turn ourattention to this property. Lemma 4.
Let k · k be any operator or entry-wisenorm. Then, S = P n is contractive w.r.t. k · k . Hence, Theorem 1 follows as a direct corollary ofLemmas 1–4. Indeed, d P n is non-negative, symmetricby Lemmas 2 and 4, satifies the triangle inequality byLemmas 1 and 4, as well as property (3e) by Lemma3; hence d P n is a pseudometric over R n × n . Our nextlemma shows that the Stiefel manifold O n is contrac-tive for 2-norms: Lemma 5.
Let k · k be the operator -norm or theFrobenius norm. Then, S = O n is contractive w.r.t. k·k . Theorem 3 follows from Lemmas 1–3 and Lemma 5,along with the the fact that O n is a group. Note that O n is not contractive w.r.t. other norms, e.g., k · k or k · k ∞ . Lemma 4 along with the Birkoff-von Neumanntheorem imply that W n is also contractive: Lemma 6.
Let k · k be any operator or entry-wisenorm. Then, W n is contractive w.r.t. k · k . The Birkhoff polytope W n is not a group, as it is notclosed under inversion. Nevertheless, it is closed undertransposition; in establishing (partial) symmetry of d W n , we leverage the following lemma: Lemma 7.
Suppose that k · k is transpose invari-ant, and S is closed under transposition. Then, d S ( A, B ) = d S ( B, A ) for all A, B ∈ S n . The first part of Theorem 2 therefore follows fromLemmas 1, 3, and 6, as W n is closed under trans-position, contains the identity I , and is closed undermultiplication, while all entry-wise norms are trans-pose invariant. Operator norms are not transposeinvariant. However, if k · k is an operator norm, orΩ = R n × n , then Lemma 6 and Lemma 1 imply that d W n satisfies non-negativity (3a) and the triangle in-equality (3d), while Lemma 3 implies that it satisfies(3e). These properties are inherited by extension ¯ d S ,which also satisfies symmetry (3c), and Theorem 2follows. (cid:3) We have seen that the chemical distance d P n can be relaxed to d W n or d O n , gaining tractability while stillmaintaining the metric property. In practice, nodesin a graph often contain additional atributes that onemight wish to leverage when computing distances.In this section, we show that such attributes can beseamlessly incorporated in d S either as soft or hardconstraints, without violating the metric property .5 etric Embeddings. Given a graph G A of size n ,a metric embedding of G A is a mapping ψ A : [ n ] → ˜Ωfrom the nodes of the graph to a metric space ( ˜Ω , ˜ d ).That is, ψ A maps nodes of the graph to ˜Ω, where˜Ω is endowed with a metric ˜ d . We refer to a graphendowed with an embedding ψ A as an embedded graph ,and denote this by ( A, ψ A ), where A ∈ R n × n is theadjacency matrix of G A . We list two examples: Example 1: Node Attributes.
Consider an embeddingof a graph to ( R k , k · k ) in which every node v ∈ V ismapped to a k -dimensional vector describing “local”attributes. These can be exogenous : e.g., featuresextracted from a user’s profile (age, binarized gender,etc.) in a social network. Alternatively, attributesmay be endogenous or structural , extracted from theadjacency matrix A , e.g., the node’s degree, the sizeof its k -hop neigborhood, its page-rank, etc. Example 2: Node Colors . Let ˜Ω be an arbitrary finiteset endowed with the Kronecker delta as a metric,that is, for s, s ∈ ˜Ω, ˜ d ( s, s ) = 0 if s = s , while˜ d ( s, s ) = ∞ if s = s . Given a graph G A , a mapping ψ A : [ n ] → ˜Ω is then a metric embedding. The valuesof ˜Ω are invariably called colors or labels , and a graphembedded in ˜Ω is a colored or labeled graph. Colorscan again be exogenous or structural : e.g., if the graphrepresents an organic molecule, colors can correspondto atoms, while structural colors can be, e.g., theoutput of the WL algorithm (see Appendix D) after k iterations.As discussed below, node attributes translate to soft constraints in metric (4), while node colors corre-spond to hard constraints. The unified view throughembeddings allows us to establish metric propertiesfor both simultaneously (c.f. Thm. 4 and 5) . Embedding Distance.
Consider two embeddedgraphs (
A, ψ A ), ( B, ψ B ) of size n that are embed-ded in the same metric space ( ˜Ω , ˜ d ) . For u ∈ [ n ]a node in the first graph, and v ∈ [ n ] a node inthe second graph, the embedded distance betweenthe two nodes is given by ˜ d ( ψ A ( u ) , ψ B ( v )). Let D ψ A ,ψ B = [ ˜ d ( ψ A ( u ) , ψ B ( v ))] u ∈ V,v ∈ V ∈ R n × n + be thecorresponding matrix of embedded distances. Aftermapping nodes to the same metric space, it is naturalto seek P ∈ P n that preserve the embedding distance . This amounts to finding a P ∈ P n that minimizes: tr (cid:0) P > D ψ A ,ψ B (cid:1) = P u,v ∈ [ n ] P u,v ˜ d ( ψ A ( u ) , ψ B ( v )) . (5)Note that, in the case of colored graphs and the Kro-necker delta distance, minimizing (5) finds a P ∈ P n that maps nodes in A nodes in B of equal color. Itis not hard to verify that min P ∈ P n tr (cid:0) P > D ψ A ,ψ B (cid:1) induces a metric between graphs embedded in ( ˜Ω , ˜ d ).Despite the combinatorial nature of P n , (5) is a maxi-mum weighted matching problem, which can be solvedthrough, e.g., the Hungarian algorithm [40] in poly-nomial time in n . We note that this metric is notas expressive as (4): depending on the definition ofthe embeddings ψ A , ψ B , attributes may only capture“local” similarities between nodes, as opposed to the“global” view of a mapping attained by (4). A Unified, Tractable Metric.
Motivated by theabove considerations, we focus on unifying the “global”metric (4) with the “local” metrics induced by arbi-trary graph embeddings. Proofs for the two theoremsbelow are provided in the supplement. Given a metricspace ( ˜Ω , ˜ d ), let Ψ n ˜Ω = { ψ : [ n ] → ˜Ω } be the set of allmappings from [ n ] to ˜Ω. Then, given two embeddedgraphs ( A, ψ A ) , ( B, ψ B ) ∈ R n × n × Ψ n ˜Ω , we define: d S (( A, ψ A ) , ( B, ψ B )) = min P ∈ S (cid:2) k AP − P B k + . . . + tr ( P > D ψ A ,ψ B ) (cid:3) (6)for some compact set S ⊂ R n × n and matrix norm k · k .Our next result states that incorporating this linearterm does not affect the pseudometric property of d S . Theorem 4. If S = P n and k · k is an arbitraryentry-wise or operator norm, then d S given by (6) is a pseudometric over the set of embedded graphs Ω = R n × n × Ψ n ˜Ω . We stress here that this result is non-obvious: is nottrue that adding any linear term to d S leads to aquantity that satisfies the triangle inequality. It isprecisely because D ψ A ,ψ B contains pairwise distancesthat Theorem 4 holds. We can similarly extend The-orem 2: This follows from Thm. 4 for A = B = 0, i.e., for distancesbetween embedded graphs with no edges. verage CentroidComplete Median Single Ward WeightedNetAlignBPSparseIsoRankIsoRankNetAlignMRNatalieDSL1DSL2ORTHOPORTHFR F r ac ti on o f M i ss c l a ss i f i e d G r a ph s à b e tt e r Methods of Merging Clusters in Hierarchical Agglomerative Clustering N on - m e t r i c s M e t r i c s Average CentroidComplete Median Single Ward WeightedNetAlignBPSparseIsoRankIsoRankNetAlignMRNatalieDSL1DSL2ORTHOPORTHFR
Average CentroidComplete Median Single Ward WeightedNetAlignBPSparseIsoRankIsoRankNetAlignMRNatalieDSL1DSL2ORTHOPORTHFR (a) Clustering Misclassification Error0.580.610.610.590.360.200.20 0.000.00
B3 B4 B5 E0.1 P R3 R4 R5 S W3 W4 W5NetAlignBPIsoRankSparseIsoRankNetAlignMRNatalie
B3 B4 B5 E0.02E0.1 P R3 R4 R5 S W3 W4 W5NetAlignBPSparseIsoRankIsoRankNetAlignMRNatalie (c) TIVs, n = 50(b) TIVs, n = 10 DescriptionB d Barabasi Albert of degree d [5]E p Erdős-Rényi with probability p [24]P Power Law Tree [45]R d Regular Graph of degree d [12]S Small World [36]W d Watts Strogatz of degree d [59] (d) Synthetic Graph Classes InnerDSL2
NetAlignBP
IsoRank
SparseIsoRank
NetAlignMR
Natalie
DSL1
DSL2
InnerPerm
10 -
InnerDSL1
11 -
EXACT
12 -
ORTHOP
13 -
ORTHFR (e) TIVs, small graphs
Figure 1: A clustering experiment using metrics and non-metrics (y-axis) for different clustering parameters (x-axis) is shownin (a), left . We sample graphs with n = 50 nodes from the six classes, shown in the adjacent table in (d), bottom-center . Wecompute distances between them using nine different algorithms from Table 1. Only the distances in our family (DSL1, DSL2,ORTHOP, and ORTHFR) are metrics. The resulting graphs are clustered using hierarchical agglomerative clustering [28] using Average , Centroid , Complete , Median , Single , W ard , W eighted as a means of merging clusters. Colors represent the fractionof misclassified graphs, with the minimal misclassification rate per distance labeled explicitly. Metrics outperform other distancescores across all clustering methods. The error rate of a random guess is ≈ .
8. Subfigures (b) and (c), top center and right, shows that non-metric distances produce triangle inequality violations (TIVs) which contribute to poor clustering results; thefigure shows the fraction of TIVs within different 10-node and 50 node graph families under these algorithms. Finally, subfigure (e), bottom right , shows the fraction of triangle inequality violations for different algorithms on the small graphs dataset of all7-node graphs.
Theorem 5. If S = W n and k·k is an arbitrary entry-wise norm, then d S given by (6) is a pseudometric over Ω = S n × Ψ n ˜Ω , the set of symmetric graphs embeddedin ( ˜Ω , ˜ d ) . Moreover, if k · k is an arbitrary entry-wiseor operator norm, then the symmetric extension ¯ d S of (6) is a pseudometric over Ω = R n × n × Ψ n ˜Ω . Adding the linear term (5) in d S has significantpractical advantages. Beyond expressing exogenousattributes, a linear term involving colors, combinedwith a Kronecker distance, translates into hard con-straints: any permutation attaning a finite objectivevalue must map nodes in one graph to nodes of thesame color. Theorem 5 therefore implies that suchconstraints can thus be added to the optimizationproblem, while maintaining the metric property. Inpractice, as the number of variables in optimizationproblem (4) is n , incorporating such hard constraintscan significantly reduce the problem’s computationtime; we illustrate this in the next section. Note that adding (5) to d O n does not preserve the metricpropery. Graphs.
We use synthetic graphs from six classessummarized in the table in Fig. 1(d). In addition,we use a dataset of small graphs , comprising all 853connected graphs of 7 nodes [46]. Finally, we use a collaboration graph with 5242 nodes and 14496 edgesrepresenting author collaborations [42].
Algorithms.
We compare our metrics to several com-petitors outlined in Table 1 (see also Appendix E). Allreceive only two unlabeled undirected simple graphs A and B and output a matching a matrix ˆ P either in W n or in P n estimating P ∗ . If ˆ P ∈ P n , we compute k A ˆ P − ˆ P B k . If ˆ P ∈ W n , then we compute both k A ˆ P − ˆ P B k and k A ˆ P − ˆ P B k F ; all norms are entry-wise. We also implement our two relaxations d W and7 NetAlignBPIsoRankSparseIsoRankNetAlignMRNatalieDSL1DSL2ORTHOPORTHFRORandom guess × − F r a c t i o n o f M i s c l a ss i fi e d G r a ph s (a) Effect of TIVs (b) Cosine Similarity to EXACT (c) NN Graph vs. NN Graph of
EXACT
InnerDSL2
NetAlignBP
IsoRank
SparseIsoRank
NetAlignMR
Natalie
DSL1
DSL2
InnerPerm
10 -
InnerDSL1
11 -
EXACT
12 -
ORTHOP
13 -
ORTHFR
Figure 2: (a)
Effect of introducing TIVs on the performance of different algorithms on the clustering experiment of Figure 1(a)when using the Ward method. (b)
Cosine similarity between the Laplacian of distances produced by each algorithm and the oneby
EXACT . (c) Distance between nearest neighbor (NN) graphs induced by different algorithms and NN graph induced by
EXACT . (Non-metric) Distance Score AlgorithmsNetAlignBP Network Alignment using Belief Propagation [9, 34]IsoRank Neighborhood Topology Isomorphism using PageRank [55, 34]SparseIsoRank Neighborhood Topology Sparse Isomorphism usingPage Rank [9, 34]InnerPerm Inner Product Matching with Permutations [43]InnerDSL1 Inner Product Matching with Matrices in W n andentry-wise 1-norm [43]InnerDSL2 Inner Product Matching with Matrices in W n andFrobenius norm [43]NetAlignMR Iterative Matching Relaxation [35, 34]Natalie (V2.0) Improved Iterative Matching Relaxation [22, 21]Metrics from our Family (4)EXACT Chemical Distance via brute force search over GPUDSL1 Doubly Stochastic Chemical Distance d W n withentry-wise 1-normDSL2 Doubly Stochastic Chemical Distance d W n withFrobenius normORTHOP Orthogonal Relaxation of Chemical Distance d O n with operator 2-normORTHFR Orthogonal Relaxation of Chemical Distance d O n with Frobenius norm Table 1: Competitor Distance Scores & Our Metrics k k P k k AP − PA k τ (a) Coloring Constraints || A P - P B || _ F WL k = 2WL k = 3WL k = 4WL k = 5 (b) Convergence of ADMMFigure 3: (a) Effect of coloring/hard constraints on the numbersof variables ( k P k ) and terms of objective ( k AP − P A k ) using k iterations of the WL coloring algorithm. The last columnshows the execution time of WL on a 40 CPU machine usingApache Spark [63]. (b) Convergence of ADMM algorithm [14]computing
DSL2 on two copies of the collaboration graph asa function of time, implemented using Apache Spark [63] on a40 CPU machine. d O n , for two different matrix norm combinations. Clustering Graphs.
The difference between our metrics and non-metrics is striking when clusteringgraphs. This is illustrated by the clustering exper-iment shown in Fig. 1(a). Graphs of size n = 50from the 6 classes in Fig. 1(d) are clustered togetherthrough hierarchical agglomerative clustering. Wecompute distances between them using nine differentalgorithms; only the distances in our family (DSL1,DSL2, ORTHOP, and ORTHFR) are metrics. Thequality of clusters induced by our metrics are far su-perior than clusters induced by non-metrics; in fact, ORTHOP and
ORTHFR can lead to no misclas-sifications. This experiment strongly suggests ourproduced metrics correctly capture the topology ofthe metric space between these larger graphs.
Triangle Inequality Violations (TIV).
Givengraphs A , B and C and a distance d , a TIV occurswhen d ( A, C ) > d ( A, B ) + d ( B, C ). Being metrics,none of our distances induce TIVs; this is not the casefor the remaining algorithms in Table 1. Fig. 1(b)and (c) show the TIV fraction across the syntheticgraphs of Fig. 1(d), while Fig. 1(e) shows the frac-tion of TIVs found on the 853 small graphs ( n = 7). NetAlignMR also produces no TIVs on the smallgraphs, but it does induce TIVs in synthetic graphs.We observe that it is easier to find TIVs when graphsare close: in synthetic graphs, TIVs abound for n = 10.No algorithm performs well across all categories ofgraphs. Effect of TIVs on Clustering.
Next, to investi-gate the effect of TIVs on clustering, we artificiallyintroduced triangle inequality violations into the pairsof distances between graphs. We then re-evaluated8lustering performance for hierarchical agglomerativeclustering using the
Ward method, which performedbest in Fig. 1(a). Fig. 2(a) shows the fraction of mis-classified graphs as the fraction of TIVs introducedincreases. To incur as small a perturmbation on dis-tances as possible, we introduce TIVs as follows: Forevery three graphs,
A, B, C , with probability p , we set d ( A, C ) = d ( A, B ) + d ( B, C ). Although this does notintroduce a TIV w.r.t. A , B , and C , this distortiondoes introduce TIVs w.r.t. other triplets involving A and C . We repeat this 20 times for each algorithmand each value of p , and compute the average fractionof TIVs, shown in the x -axis, and the average fractionof misclassified graphs, shown in the y -axis. As littleas 1% TIVs significantly deteriorate clustering perfor-mance. We also see that, even after introducing TIVs,clustering based on metrics outperforms clusteringbased on non-metrics. Comparison to Chemical Distance.
We comparehow different distance scores relate to the chemical dis-tance
EXACT through two experiments on the smallgraphs (computation on larger graphs is prohibitive).In Figure 2(b), we compare the distances betweensmall graphs with 7 nodes produced by the differ-ent algorithms and
EXACT using the DISTATISmethod of [1]. Let D ∈ R × be the matrix of dis-tances between graphs under an algorithm. DISTATIScomputes the normalized Laplacian of this matrix,given by L = − U DU/ k U DU k where U = I − > n .The DISTATIS score is the cosine similarity of suchLaplacians (vectorized). We see that our metricsproduce distances attaining high similarity with EX-ACT , though
NetAlignBP has the highest simi-larity. We measure proximity to
EXACT with anadditional test. Given D , we compute the nearestneighbor (NN) meta-graph by connecting a graph in D to every graph at distance less than its averagedistance to other graps. This results in a (labeled)meta-graph, which we can compare to the NN meta-graph induced by other algorithms, measuring thefraction of distinct edges. Fig. 2(c) shows that ouralgorithms perform quite well, though Natalie yieldsthe smallest distance to
EXACT . Incorporating Constraints.
Computation costscan be reduced through metric embeddings, as in (6). To show this, we produce a copy of the 5242 node col-laboration graph with permuted node labels. We thenrun the WL algorithm (see Appendix D) to producestructural colors, which induce coloring constraintson P ∈ W n . The support of P (i.e., the numberof variables in the optimization (4)), the support of AP − P A (i.e., the number of non-zero summationterms in the objective of (4)), as well as the execu-tion time τ of the WL algorithm, are summarizedin Fig. 3(b). The original unconstrained problem in-volves 5242 ≈ .
4M variables. However, after usingWL and induced costraints, the effective dimensionof the optimization problem (4) reduces considerably.This, in turn, speeds up convergence time, shown inFig. 3(b): including the time to compute constraints,a solution is found 110 times faster after the introduc-tion of the constraints.
Our work suggests that incorporating soft and hardconstraints has a great potential to further improvethe efficiency of our metrics. In future work, we intendto investigate and characterize the resulting equiva-lence classes under different soft and hard constraintsand to quantify these gains in efficiency, especially inparallel implementations like ADMM. Determiningthe necessity of the conditions used in proving that d S is a metric is also an open problem. Acknowledgements
The authors gratefully acknowledge the support of theNational Science Foundation (grants IIS-1741197,IIS-1741129) and of the National Institutes of Health(grant 1U01AI124302).
References [1] H Abdi, A J O’Toole, D Valentin, and B Edelman. DIS-TATIS: The analysis of multiple distance matrices. In
CVPR Workshops , 2005.[2] P-A Absil, R Mahony, and R Sepulchre.
Optimizationalgorithms on matrix manifolds . Princeton UniversityPress, 2009.
3] M R Ackermann, J Blömer, and C Sohler. Clustering formetric and nonmetric distance measures.
ACM Transac-tions on Algorithms (TALG) , 6(4):59, 2010.[4] Y Aflalo, A Bronstein, and R Kimmel. On convex relax-ation of graph isomorphism.
PNAS , 112(10):2942–2947,2015.[5] R Albert and A-L Barabási. Statistical mechanics ofcomplex networks.
Reviews of Modern Physics , 74(1):47,2002.[6] F H Allen. The Cambridge Structural Database: a quarterof a million crystal structures and rising.
Acta Crystal-lographica Section B: Structural Science , 58(3):380–388,2002.[7] F Angiulli and C Pizzuti. Fast outlier detection in highdimensional spaces. In
PKDD , 2002.[8] L Babai. Graph isomorphism in quasipolynomial time[extended abstract]. In
STOC , 2016.[9] M Bayati, M Gerritsen, D F Gleich, A Saberi, and Y Wang.Algorithms for large, sparse network alignment problems.In
ICDM , 2009.[10] A Beygelzimer, S Kakade, and J Langford. Cover treesfor nearest neighbor. In
ICML , 2006.[11] G Birkhoff. Three observations on linear algebra.
Univ.Nac. Tucumán. Revista A , 5:147–151, 1946.[12] B Bollobás. Random graphs. In
Modern Graph Theory ,pages 215–252. Springer, 1998.[13] S Boyd and L Vandenberghe.
Convex Optimization . Cam-bridge university press, 2004.[14] S Boyd, N Parikh, E Chu, B Peleato, and J Eckstein.Distributed optimization and statistical learning via thealternating direction method of multipliers.
Foundationsand Trends R (cid:13) in Machine Learning , 3(1):1–122, 2011.[15] H Bunke. On a relation between graph edit distance andmaximum common subgraph. Pattern Recognition Letters ,18(8):689–694, 1997.[16] H Bunke and K Shearer. A graph distance metric basedon the maximal common subgraph.
Pattern RecognitionLetters , 19(3):255–259, 1998.[17] G Chartrand, G Kubicki, and M Schultz. Graph similarityand distance in graphs.
Aequationes Mathematicae , 55(1-2):129–145, 1998.[18] K L Clarkson. Nearest neighbor queries in metric spaces.
Discrete & Computational Geometry , 22(1):63–93, 1999. [19] K L Clarkson. Nearest-neighbor searching and metricspace dimensions.
Nearest-Neighbor Methods for Learningand Vision: Theory and Practice , pages 15–59, 2006.[20] D Conte, P Foggia, C Sansone, and M Vento. Thirty yearsof graph matching in pattern recognition.
InternationalJournal of Pattern Recognition and Artificial Intelligence ,18(03):265–298, 2004.[21] M El-Kebir, J Heringa, and G Klau. Natalie, a toolfor pairwise global network alignment. .[22] M El-Kebir, J Heringa, and G W Klau. Natalie 2.0: Sparseglobal network alignment as a special case of quadraticassignment.
Algorithms , 8(4):1035–1051, 2015.[23] H Elghawalby and E R Hancock. Measuring graph simi-larity using spectral geometry. In
ICIAR , 2008.[24] P Erdös and A Rényi. On random graphs, i.
PublicationesMathematicae (Debrecen) , 6:290–297, 1959.[25] S Fankhauser, K Riesen, and H Bunke. Speeding up graphedit distance computation through fast bipartite matching.In
GBR , 2011.[26] M Ferrer, E Valveny, F Serratosa, K Riesen, and H Bunke.Generalized median graph computation by means of graphembedding in vector spaces.
Pattern Recognition , 43(4):1642–1655, 2010.[27] M R Garey and D S Johnson.
Computers and Intractabil-ity , volume 29. WH Freeman New York, 2002.[28] J A Hartigan.
Clustering algorithms . Wiley New York,1975.[29] L He, C Y Han, and W G Wee. Object recognition andrecovery by skeleton graph matching. In
ICME , 2006.[30] A J Hoffman and H W Wielandt. The variation of thespectrum of a normal matrix.
Duke Math. J , 20(1):37–39,1953.[31] R A Horn and C R Johnson.
Matrix Analysis . CambridgeUniversity Press, 2012.[32] P Indyk. Sublinear time algorithms for metric space prob-lems. In
Proceedings of the thirty-first annual ACM sym-posium on Theory of computing , pages 428–434. ACM,1999.[33] B J Jain. On the geometry of graph spaces.
DiscreteApplied Mathematics , 214:126–144, 2016.[34] A Khan, D Gleich, M Halappanavar, and A Pothen.Multicore codes for network alignment. .
35] G W Klau. A new graph-based method for pairwise globalnetwork alignment.
BMC bioinformatics , 10(1):S59, 2009.[36] J Kleinberg. The small-world phenomenon: An algorith-mic perspective. In
STOC , 2000.[37] J Koca, M Kratochvil, V Kvasnicka, L Matyska, andJ Pospichal.
Synthon model of organic chemistry andsynthesis design , volume 51. Springer Science & BusinessMedia, 2012.[38] D Koutra, H Tong, and D Lubensky. Big-align: Fastbipartite graph alignment. In
ICDM , 2013.[39] D Koutra, J T Vogelstein, and C Faloutsos. Deltacon:A principled massive-graph similarity function. In
SDM ,2013.[40] H W Kuhn. The hungarian method for the assignmentproblem.
Naval Research Logistics Quarterly , 2(1-2):83–97,1955.[41] V Kvasnička, J Pospíchal, and V Baláž. Reaction and chem-ical distances and reaction graphs.
Theoretical ChemistryAccounts: Theory, Computation, and Modeling (Theoret-ica Chimica Acta) , 79(1):65–79, 1991.[42] J Leskovec, J Kleinberg, and C Faloutsos. Stanford largenetwork dataset collection. http://snap.stanford.edu/data/ca-GrQc.html .[43] V Lyzinski, D E Fishkind, M Fiori, J T Vogelstein, C EPriebe, and Guillermo Sapiro. Graph matching: Relax atyour own risk.
IEEE Transactions on Pattern Analysisand Machine Intelligence , 38(1):60–73, 2016.[44] O Macindoe and W Richards. Graph comparison usingfine structure analysis. In
SocialCom , 2010.[45] H M Mahmoud, R T Smythe, and J Szymański. On thestructure of random plane-oriented recursive trees andtheir branches.
Random Structures & Algorithms , 4(2):151–176, 1993.[46] B McKay. List of 7 node connected graphs. http://users.cecs.anu.edu.au/~bdm/data/graphs.html .[47] P Papadimitriou, A Dasdan, and H Garcia-Molina. Webgraph similarity for anomaly detection.
Journal of InternetServices and Applications , 1(1):19–30, 2010.[48] M V Ramana, E R Scheinerman, and D Ullman. Fractionalisomorphism of graphs.
Discrete Mathematics , 132(1-3):247–265, 1994.[49] K Riesen and H Bunke. Approximate graph edit distancecomputation by means of bipartite graph matching.
Imageand Vision Computing , 27(7):950–959, 2009. [50] K Riesen and H Bunke.
Graph classification and cluster-ing based on vector space embedding , volume 77. WorldScientific, 2010.[51] K Riesen, M Neuhaus, and H Bunke. Graph embeddingin vector spaces by means of prototype selection. In
GBR ,2007.[52] A Sanfeliu and KS Fu. A distance measure between at-tributed relational graphs for pattern recognition.
IEEETransactions on Systems, Man, and Cybernetics , (3):353–362, 1983.[53] C Schellewald, S Roth, and C Schnörr. Evaluation ofconvex optimization techniques for the weighted graph-matching problem in computer vision. In
Pattern Recog-nition , 2001.[54] T B Sebastian, P N Klein, and B B Kimia. Recognition ofshapes by editing their shock graphs.
IEEE Transactionson pattern analysis and machine intelligence , 26(5):550–571, 2004.[55] R Singh, J Xu, and B Berger. Pairwise global alignmentof protein interaction networks by matching neighborhoodtopology. In
RECOMB , 2007.[56] S Soundarajan, T Eliassi-Rad, and B Gallagher. A guideto selecting a network similarity method. In
SDM , 2014.[57] S Umeyama. An eigendecomposition approach to weightedgraph matching problems.
IEEE Transactions on PatternAnalysis and Machine Intelligence , 10(5):695–703, 1988.[58] J T Vogelstein, J M Conroy, L J Podrazik, S G Kratzer,E T Harley, D E Fishkind, R J Vogelstein, and C EPriebe. Large (brain) graph matching via fast approximatequadratic programming. arXiv preprint arXiv:1112.5507 ,2011.[59] D J Watts and S H Strogatz. Collective dynamics of‘small-world’ networks.
Nature , 393(6684):440–442, 1998.[60] B Weisfeiler and A A Lehman. A reduction of a graphto a canonical form and an algebra arising during thisreduction.
Nauchno-Technicheskaya Informatsia , 2(9):12–16, 1968.[61] R C Wilson and P Zhu. A study of graph spectra forcomparing graphs and trees.
Pattern Recognition , 41(9):2833–2841, 2008.[62] E P Xing, A Y Ng, M I Jordan, and S Russell. Distancemetric learning with application to clustering with side-information. In
NIPS , volume 15, page 12, 2002.[63] M Zaharia, M Chowdhury, M J Franklin, S Shenker, andI Stoica. Spark: Cluster computing with working sets.
HotCloud , 10(10-10):95, 2010.[64] P Zhu and R C Wilson. A study of graph spectra forcomparing graphs. In
BMVC , 2005. Proof of Lemmas 1–7
A.1 Proof of Lemma 1
Consider P ∈ arg min P ∈ S k AP − P B k , and P ∈ arg min P ∈ S k BP − P C k . Then, from closure undermultiplication, P P ∈ S . Hence, d S ( A, C ) ≤ k AP P − P P C k≤ k AP P − P BP k + k P BP − P P C k = k ( AP − P B ) P k + k P ( BP − P C ) k≤ k AP − P B k + k BP − P C k where the last inequality follows from the fact that P , P are contractions. (cid:3) A.2 Proof of Lemma 2
Observe that property (b) implies that, for all P ∈ S , P is invertible and P − ∈ S . Hence, k AP − P B k = k P ( P − A − BP − ) P k ≤ k BP − − P − A k , as P is acontraction w.r.t k · k . We can similarly show that k BP − − P − A k ≤ k AP − P B k , hence k AP − P B k = k BP − − P − A k . As S is closed under inversion,min P ∈ S f ( P ) = min P : P − ∈ S f ( P ), so d S ( A, B ) =min P ∈ S k BP − − P − A k = min P : P − ∈ S k BP − − P − A k = min P ∈ S k BP − P A k = d S ( B, A ) . (cid:3) A.3 Proof of Lemma 3 If I ∈ S , then 0 ≤ d S ( A, A ) ≤ k AI − IA k = 0. (cid:3) A.4 Proof of Lemma 4
Observe first that all vector p -norms are invariantto permutations a vector’s entries; hence, for anyvector x ∈ R d , if P ∈ P n , k P x k p = k x k p . Hence, if k · k is an operator p -norm, k P k = 1 , for all P ∈ S. Every operator norm is submultiplicative ; as a result k P A k ≤ k P kk A k = k A k and, similarly, k AP k ≤ k A k ,so the lemma follows for operator norms. On theother hand, if k · k is an entry-wise norm, then k A k is invariant to permutations of either A ’s rows orcolumns. Matrices P A and AP precisely amount tosuch permutations, so k P A k = k AP k = k A k and thelemma follows also for entrywise norms. (cid:3) A.5 Proof of Lemma 5
Any U ∈ O n is an orthogonal matrix; hence, k U k = k U k F = 1. Both norms are submultiplicative: the firstas an operator norm, the second from the Cauchy-Schwartz inequality. Hence, for U ∈ O n , we have k U A k ≤ k U kk A k = k A k . (cid:3) A.6 Proof of Lemma 6
By the Birkoff-con Neumann theorem [11], W n = conv ( P n ). Hence, for any W ∈ W n there exist P i ∈ P n , θ i > i = 1 , . . . , k , such that W = P ki =1 θ i P i and P ki =1 θ i = 1 . Both operator and en-trywise p -norms are convex functions; hence, byJensen’s inequality, for any A ∈ R n × N : k W A k ≤ P ki =1 θ i k P i A k ≤ P ki =1 θ i k A k = k A k where the last in-eqality follows by Lemma 4. The statement k AW k ≤k A k follows similarly. (cid:3) A.7 Proof of Lemma 7
By transpose invariance and the symmetry of A and B , we have that: k AP − P B k = k BP > − P > A k . Moreover, as S is closed under transposition,min P ∈ S f ( P ) = min P > ∈ S f ( P ). Hence, d S ( A, B ) =min P ∈ S k BP > − P > A k = min P > ∈ S k BP > − P > A k = d S ( B, A ) . (cid:3) B Proof of Theorems 4 and 5
We begin by establishing conditions under which d S satisfies the triangle inequality (3d). We note that,in contrast to Lemma 1, we require the additionalcondition that S ⊆ W n , which is not satisfied by O n . Lemma 8.
Given a norm k · k , suppose that S is (a) contractive w.r.t. k · k , (b) closed un-der multiplication, and (c) is a subset of W n ,i.e., contains only doubly stochastic matrices.Then, for any ( A, ψ A ) , ( B, ψ B ) , ( C, ψ C ) in R n × n × Ψ ˜Ω , d S (( A, ψ A ) , ( C, ψ B )) ≤ d S (( A, ψ A ) , ( B, ψ B )) + d S (( B, ψ B ) , ( C, ψ C )) . Proof.
Consider P ∈ arg min P ∈ S (cid:0) k AP − P B k + tr (cid:0) P > D ψ A ,ψ B (cid:1)(cid:1) , P ∈ arg min P ∈ S (cid:0) k BP − P C k + tr (cid:0) P > D ψ B ,ψ C (cid:1)(cid:1) . Then, from closure under multiplication, P P ∈ S .We have that d S (( A, ψ A ) , ( C, ψ C )) ≤ k AP P − P P C k + tr (cid:2) ( P P ) > D ψ A ψ C (cid:3) As in the proof of Lemma 1, we can show that k AP P − P P C k ≤ k AP − P B k + k BP − P C k using the fact that both P and P are contractions,while tr (cid:2) ( P P ) > D ψ A ψ C (cid:3) == X u,v ∈ [ n ] X k ∈ [ n ] (cid:0) P uk P kv ˜ d ( ψ A ( u ) , ψ C ( v ))) (cid:1) ≤ X u,v ∈ [ n ] X k ∈ [ n ] (cid:2) P uk P kv (cid:0) ˜ d ( ψ A ( u ) , ψ B ( k ))+ ˜ d ( ψ B ( k ) , ψ C ( v )) (cid:1)(cid:3) (as ˜ d is a metric, and P , P are non-negative)= X u,k ∈ [ n ] P uk ˜ d ( ψ A ( u ) , ψ B ( k )) X v ∈ [ n ] P kv + X k,v ∈ [ n ] P kv ˜ d ( ψ B ( k ) , ψ C ( v )) X u ∈ [ n ] P uk ≤ tr (cid:0) ( P ) > D ψ A ,ψ B (cid:1) + tr (cid:0) ( P ) > D ψ B ,ψ C (cid:1) , where the last inequality follows as both P, P > are k · k -norm bounded by 1 for every P ∈ S .The weak property (3e) is again satisfied providedthe identity is included in S . Lemma 9. If I ∈ S , then d S (( A, ψ A ) , ( A, ψ A )) = 0 for all A ∈ R n × n .Proof. Indeed, 0 ≤ d S (( A, ψ A , ( A, ψ A )) ≤ k AI − IA k + P u ∈ [ n ] ˜ d ( ψ A ( u ) , ψ A ( u )) = 0.To attain symmetry over Ω = R n × n , we again relyon closure under inversion, as in Lemma 10; nonethe-less, in contrast to Lemma 10, due to the linear term,we also need to assume orthogonality of S . Lemma 10.
Given a norm k · k , suppose that S (a)is contractive w.r.t. k · k , (b) is closed under inver-sion, and (c) is a subset of O n , i.e., contains onlyorthogonal matrices. Then, d S (( A, ψ A ) , ( B, ψ B )) = d S (( B, ψ B ) , ( A, ψ A )) for all ( A, ψ A ) , ( B, ψ B ) ∈ R n × n × Ψ ˜Ω .Proof. As in the proof of Lemma 2, we can show thatcontractiveness w.r.t. k · k along with closure under in-version imply that: k AP − P B k = k BP − − P − A k . As S is closed under inversion, min P ∈ S f ( P ) =min P : P − ∈ S f ( P ) for all f : S → R , while orthog-onality implies P − = P > for all P ∈ S . Hence, d S (( A, ψ A ) , ( B, ψ B )) equals:min P ∈ S (cid:2) k AP − P B k + tr (cid:0) P > D ψ A ,ψ B (cid:1)(cid:3) = min P ∈ S (cid:2) k BP − − P − A k + tr (cid:0) P − D ψ A ,ψ B (cid:1)(cid:3) = min P ∈ S h k BP − − P − A k + tr (cid:16)(cid:0) P − (cid:1) > D > ψ A ,ψ B (cid:17)i = min P − ∈ S h k BP − − P − A k + tr (cid:16)(cid:0) P − (cid:1) > D > ψ A ,ψ B (cid:17)i = d S (( B, ψ B ) , ( A, ψ A )) . Theorem 4 therefore follows from the above lemmas,as S = P n contains I , it is closed under multiplicationand inversion, is a subset of W n ∩ O n , and is contractivew.r.t. all operator and entrywise norms. Theorem 5also follows by using the following lemma, along withLemmas 8 and 9. Lemma 11.
Suppose that k · k is transpose invari-ant, and S is closed under transposition. Then, d S (( A, ψ A ) , ( B, ψ B )) = d S (( B, ψ B ) , ( A, ψ A )) for all ( A, ψ A ) , ( B, ψ B ) ∈ S n × Ψ ˜Ω .Proof. By transpose invariance of k · k and the sym-metry of A and B , we have that: k AP − P B k = k BP > − P > A k . Moreover, as S is closed undertransposition, min P ∈ S f ( P ) = min P > ∈ S f ( P ) for any f : S → R . Hence, d S (( A, ψ A ) , ( B, ψ B )) equalsmin P ∈ S (cid:2) k AP − P B k + tr (cid:0) P > D ψ A ,ψ B (cid:1)(cid:3) = min P ∈ S (cid:2) k BP > − P > A k + tr (cid:0) P D > ψ A ,ψ B (cid:1)(cid:3) = min P > ∈ S k BP > − P > A k + tr (cid:0) ( P > ) > D ψ B ,ψ A (cid:1) = d S (( B, ψ B ) , ( A, ψ A )) . Metric Computation Overthe Stiefler Manifold.
In this section, we describe how to compute the metric d S in polynomial time when S = O n and k · k isthe Frobenious norm or the operator 2-norm. Thealgorithm for the Frobenius norm, and the proof ofits correctness, is due to [57]; we include it in thisappendix for completeness, along with its extensionto the operator norm.Both cases make use of the following lemma: Lemma 12.
For any matrix M ∈ R n × n and anymatrix P ∈ O n we have that k P M k = k M P k = k M k ,where k · k is either the Frobenius or operator 2-norm.Proof. Recall that the operator 2-norm k · k is k M k = sup x =0 k M x k / k x k = p σ max ( M > M ) = p σ max ( M M > ) = k M > k . where σ max denotesthe largest singular value. Hence, k P M k =sup x =0 k P M x k / k x k = p σ max ( M > P > P M ) = p σ max ( M > M ) = k M k . as P > P = I . Using thefact that k M k = k M > k for all M ∈ R n × n , aswell as that P P > = I , we can show that k M P k = k P > M > k = k M > k = k M k .The Frobenius norm is k M k F = p tr ( M > M ) = p tr ( M M > ) = k M > k F , hence k P M k F = p tr ( M > P > P M ) = p tr ( M > M ) = k M k F and, asin the case of the operator norm, we can similarlyshow k M P k F = k P > M > k F = k M > k F = k M k F .In both norm cases, for A, B ∈ S n , we can com-pute d S using a simple spectral decomposition. Let A = U Σ A U T and B = V Σ B V T be the spectral de-composition of A and B . As A and B are real andsymmetric, we can assume U, V ∈ O n . Recall that U − = U > and V − = V > , while Σ A and Σ B arediagonal and contain the eigenvalues of A and B sorted in increasing order; this orderning matters forcomputations below.The following theorem establishes that this decom-position readily yields the distance d S , as well as theoptimal orthogonal matrix P ∗ , when k · k = k · k F : Theorem 6 ([57]) . d S ( A, B ) (cid:44) min P ∈ S k AP − P B k F = k Σ A − Σ B k F and the minimum is attainedby P ∗ = U V > . Proof. The proof makes use of the following lemmaby [30].
Lemma 13. If A and B are Hermitian matrices witheigenvalues a ≤ a ≤ ... ≤ a n and b ≤ b ≤ ... ≤ b n then k A − B k F ≥ n X i =1 ( a i − b i ) (7) Remark 1.
Note that if Σ A and Σ B are diagonalmatrices with the ordered eigenvalues of A and B in the diagonal, then Lemma 13 can be written as k A − B k F ≥ k Σ A − Σ B k F . For any P ∈ O n and k · k = k · k F we have k AP − P B k = k ( A − P BP − ) P k Lem. 12 = k A − P BP > k = k U Σ A U > − P V Σ B V > P > k = k U (Σ A − U > P V Σ B V > P > U ) U > k Lem. 12 = k Σ A − U > P V Σ B V > P > U k = k Σ A − ∆Σ B ∆ > k where we define ∆ (cid:44) U > P V . As a product of orthog-onal matrices, ∆ ∈ O n . Notice that k Σ A − ∆Σ B ∆ > k == k Σ A − ∆Σ A ∆ > + ∆(Σ B − Σ A )∆ > k≤ k Σ A − ∆Σ A ∆ > k + k ∆(Σ B − Σ A )∆ > k Lem. 12 = k Σ A − ∆Σ A ∆ > k + k Σ B − Σ A k . Therefore, for any P ∈ O n , k Σ A − Σ B k ≤ d S ( A, B ) ≤k Σ A − ∆Σ A ∆ > k + k Σ B − Σ A k , where the first in-equality follows by Lemma 13 if we notice that k AP − P B k = k A − P BP − k and that P BP − and B have the same spectrum for any P . If we choose P = U V > then ∆ = I and the result follows.We can compute d S when S = O n and k · k is theoperator norm in the exact same way. Theorem 7.
Let k · k = k · k be the operator 2-norm.Then, d S ( A, B ) (cid:44) min P ∈ S k AP − P B k = k Σ A − Σ B k and the minimum is attained by P ∗ = U V > .Proof. The proof follows the same steps as the proof ofTheorem 6, using Lemma 14 below instead of Lemma13.14 emma 14. If A and B are Hermitian matrices witheigenvalues a ≤ a ≤ ... ≤ a n and b ≤ b ≤ ... ≤ b n then k A − B k ≥ max i | a i − b i | . (8) Remark 2.
Note that if Σ A and Σ B are diagonalmatrices with the ordered eigenvalues of A and B in the diagonal, then Lemma 14 can be written as k A − B k ≥ k Σ A − Σ B k .Proof of Lemma 14. Let ˜ B = − B have eigenvalues˜ b ≤ ˜ b ≤ ... ≤ ˜ b n and let C = A + ˜ B have eigenvalues c ≤ c ≤ ... ≤ c n . We make use of the followinglemma by Weyl [31] to lower bound c n . Lemma 15. If X and Y are Hermitian with eigen-values x ≤ ... ≤ x n and y ≤ ... ≤ y n and if X + Y has eigenvalues w ≤ ... ≤ w n then x i − j +1 + y j ≤ w i for all i = 1 , . . . , n and j = 1 , . . . , i . If we choose X = ˜ B , Y = A and i = n we get a j + ˜ b n +1 − j ≤ c n for all j = 1 , . . . , n .Since ˜ b n +1 − j = − b j we get that a j − b j ≤ c n , forany j . Similarly, by exchanging the role of A and B ,we can lower bound the largest eigenvalue of B − A ,say d n , by b j − a j for any j . Notice that, by definitionof the operator norm and the fact that A − B isHermitian, k A − B k ≥ | c n | and k B − A k ≥ | d n | .Since k B − A k = k A − B k we have that k A − B k ≥ max {| c n | , | d n |} ≥ max { c n , d n } ≥ max { a j − b j , b j − a j } = | a j − b j | for all j . Taking the maximum over j we get that k A − B k ≥ max j | a j − b j | , and thelemma follows.The proof of Thm. 7 proceeds along the same stepsas the above proof, using again the fact that, byLemma 12, k M k = k M P k = k P M k for any P ∈ O n and any matrix M , along with Lemma 15. D The Weisfeiler-Lehman(WL) Algorithm.
The WL algorithm [60] is a graph isomorphism heuris-tic. To gain some intuition on the algorithm, notethat two isomorphic graphs must have the same de-gree distribution. More broadly, the distributions of k -hop neighborhoods in the two graphs must also beidentical. Building on this, to test if two undirected,unweighted graphs are isomorphic, WL colors thenodes of a graph G ( V, E ) iteratively. At iteration 0,each node v ∈ V receives the same color c ( v ) := 1.Colors at iteration k + 1 ∈ N are defined recursivelyvia c k +1 ( v ) := hash (cid:16) sort (cid:16) clist kv (cid:17)(cid:17) where hash is a per-fect hash function, and clist kv = [ c k ( u ) : ( u, v ) ∈ E )] isa list containing the colors of all of v ’s neighbors atiteration k . Intuitively, two nodes in V share the samecolor after k iterations if their k -hop neighborhoodsare isomorphic. WL terminates when the partitionof V induced by colors is stable from one iteration tothe next. This coloring extends to weighted directedgraphs by appending weights and directions to colorsin clist kv . After coloring two graphs G A , G B , WL de-clares a non-isomorphism if their color distributionsdiffer. If not, then they may be isomorphic and WLgives a set of constraints on candidate isomorphisms:a permutation P under which AP = P B must mapnodes in G A to nodes in G B of the same color. E Algorithms and Implementa-tion Details
We outline here additional impementation detailsabout the algorithms summarized in Table 1. • NetAlignBP , IsoRank , SparseIsoRank and
NetAlignMR , for which code is publicly available[34], are described by [9].
Natalie is described in[22]; code is again available [21]. All five algorithmsoutput P ∈ P n . • The algorithm in [43] outputs one P ∈ P n and one P ∈ W n . We use P ∈ P n to compute k AP − P B k and call this InnerPerm . We use P ∈ W n tocompute k AP − P B k and k AP − P B k and callthese algorithms InnerDSL1 and
InnerDSL2 re-spectively. We use our own CVX-based projectedgradient descent solver for the non-convex optimiza-tion problem the authors propose. • DSL1 and
DSL2 denote d S ( A, B ) when S ∈ W n and k·k is k·k (element-wise) and k·k F , respectively.We implement them in Matlab (using CVX) as wellas in C, aimed for medium size graphs and multi-15ore use. We also implemented a distributed versionin Apache Spark [63] that scales to very large graphsover multiple machines based on the AlternatingDirections Method of Multipliers [14]. • ORTHOP and
ORTHFR denote d S ( A, B ) when S ∈ O n and k · k is k · k (operator norm) and k · k F respectively. We compute them using aneigendecomposition (See Appendix C). • For small graphs, we compute d P n ( A, B ) using ourbrute-force GPU-based code. For a single pair ofgraphs with n ≥
15 nodes,
EXACT already takesseveral days to finish. For k·k = k·k in d S (element-wise or matrix norm), we have implemented thechemical distance as an integer value LP and solvedit using branch-and-cut. It did not scale well for n ≥ • We implemented the WL algorithm over Spark torun, multithreaded, on a machine with 40 CPUs.We use all public algorithms as black boxes with theirdefault parameters, as provided by the authors.
F Graphs of Different Sizes
For simplicity, we described our framework for graphsof equal sizes. However, we can extended in differentways to produce a metric for graphs of different sizes.These extensions all start by extending two graphs, G A and G B , with dummy nodes such that the newgraphs G A and G B have the same number of nodes.If G A has n A nodes and G B has n B nodes we can,for example, add n B dummy nodes to G A and n A dummy nodes to G A . Once we have G A and G B of equal size, we can use the methods we alreadydescribed to compute a distance between G A and G B and return this distance as the distance between G A and G B .The different ways of extending the graphs differin how the dummy nodes connect to existing graphnodes, how dummy nodes connect to themselves, andwhat kind of penalty we introduce for associatingdummy nodes with existing graph nodes. Method 1:
One way of extending the graphs is to add dummynodes and leave them isolated, i.e., with no edges to ei-ther existing nodes or other dummy nodes. Althoughthis might work when both graphs are dense, it might lead to non desirable results when one of the graphs issparse. For example, let G A be 3 isolated nodes and G B be the complete graph on 4 nodes minus the edgesforming triangle { (1 , , (2 , , (3 , } . Let us assumethat S = P n , such that, when we compute the dis-tance between G A and G B , we produce an alignmentbetween the graphs. One desirable outcome wouldbe for G A to be aligned with the three nodes in G B that have no edges among them. This is basicallysolving the problem of finding a sparse subgraph in-side a dense graph. However, computing d S ( A , B ),where A and B are the extended adjacency matrices,could equally well align G A with the 3 dummy nodeof G B . Method 2:
Add dummy nodes and connecteach dummy node to all existing nodes and all otherdummy nodes. This avoids the issue described formethod 1 but creates a similar non desirable situa-tion: since the dummy nodes in each extended graphform a click, we might align G A , or G B , with justdummy nodes, instead of producing an alignment be-tween existing nodes in G A and existing nodes in G B . Method 3:
If both G A and G B are unweighted graphs,a method that avoids both issues above (aligning asparse graph with isolated dummy nodes or aligninga dense graphs with clicks of dummy nodes) is toconnect each dummy node to all existing nodes andall other dummy nodes with edges of weight 1 /