[PDF] A Family of Tractable Graph Distances

Abstract

Important data mining problems such as nearest-neighbor search and clustering admit theoretical guarantees when restricted to objects embedded in a metric space. Graphs are ubiquitous, and clustering and classification over graphs arise in diverse areas, including, e.g., image processing and social networks. Unfortunately, popular distance scores used in these applications, that scale over large graphs, are not metrics and thus come with no guarantees. Classic graph distances such as, e.g., the chemical and the CKS distance are arguably natural and intuitive, and are indeed also metrics, but they are intractable: as such, their computation does not scale to large graphs. We define a broad family of graph distances, that includes both the chemical and the CKS distance, and prove that these are all metrics. Crucially, we show that our family includes metrics that are tractable. Moreover, we extend these distances by incorporating auxiliary node attributes, which is important in practice, while maintaining both the metric property and tractability.

Full PDF

AA Family of Tractable Graph Distances

Jose Bento ∗ Stratis Ioannidis † Abstract

Important data mining problems such as nearest-neighbor search and clustering admit theoretical guar-antees when restricted to objects embedded in a met-ric space. Graphs are ubiquitous, and clustering andclassiﬁcation over graphs arise in diverse areas, in-cluding, e.g., image processing and social networks.Unfortunately, popular distance scores used in theseapplications, that scale over large graphs, are notmetrics and thus come with no guarantees. Classicgraph distances such as, e.g., the chemical and theCKS distance are arguably natural and intuitive, andare indeed also metrics, but they are intractable: assuch, their computation does not scale to large graphs.We deﬁne a broad family of graph distances, that in-cludes both the chemical and the CKS distance, andprove that these are all metrics. Crucially, we showthat our family includes metrics that are tractable.Moreover, we extend these distances by incorporatingauxiliary node attributes, which is important in prac-tice, while maintaining both the metric property andtractability.

Graph similarity and the related problem of graphisomorphism have a long history in data mining, ma-chine learning, and pattern recognition [20, 44, 39].

Graph distances naturally arise in this literature: in-tuitively, given two (unlabeled) graphs, their distanceis a score quanitifying their structural diﬀerences. Ahighly desirable property for such a score is that it isa metric , i.e., it is non-negative, symmetric, positive-deﬁnite, and, crucially, satisﬁes the triangle inequality. ∗ Boston College, [email protected] † Northeastern University, [email protected]

Metrics exhibit signiﬁcant computational advantagesover non-metrics. For example, operations such asnearest-neighbor search [19, 18, 10], clustering [3],outlier detection [7], and diameter computation [32]admit fast algorithms precisely when performed overobjects embedded in a metric space. To this end,proposing tractable graph metrics is of paramountimportance in applying such algorithms to graphs.Unfortunately, graph metrics of interest are oftencomputationally expensive. A well-known exampleis the chemical distance [41]. Formally, given graphs G A and G B , represented by their adjacency matrices A, B ∈ { , } n × n , the chemical distance is d P n ( A, B )is deﬁned in terms of a mapping between the twographs that minimizes their edge discrepancies, i.e.: d P n ( A, B ) = min P ∈ P n k AP − P B k F , (1)where P n is the set of permutation matrices of size n and k · k F , is the Frobenius norm (see Sec. 2 fordeﬁnitions). The Chartrand-Kubiki-Shultz (CKS) [17]distance is an alternative: CKS is again given by(1) but, instead of edges, matrices A and B containthe pairwise shortest path distances between any twonodes. The chemical and CKS distances have impor-tant properties. First, they are zero if and only if thegraphs are isomorphic, which appeals to both intu-ition and practice; second, as desired, they are metrics;third, they have a natural interpretation, capturingglobal structural similarities between graphs. How-ever, ﬁnding an optimal permutation P is notoriouslyhard; graph isomorphism, which is equivalent to de-ciding if there exists a permutation P s.t. AP = P B (for both adjacency and path matrices), is famously aproblem that is neither known to be in P nor shownto be NP-hard [8]. There is a large and expandingliterature on scalable heuristics to estimate the op-timal permutation P [35, 9, 43, 22]. Despite their1 a r X i v : . [ m a t h . C O ] J a n omputational advantages, unfortunately, using themto approximate d P n ( A, B ) breaks the metric property.This signiﬁcantly degrades the performance of manyimportant tasks that rely on computing distances be-tween graphs. For example, there is a clear separationon the approximability of clustering over metric andnon-metric spaces [3]. We also demonstrate this em-pirically in Section 5 (c.f. Fig. 1): attempting to clus-ter graphs sampled from well-known families basedon non-metric distances signiﬁcantly increases themisclassiﬁcation rate, compared to clustering usingmetrics.An additonal issue that arises in practice is thatnodes often have attributes not associated with ad-jacency. For example, in social networks, nodes maycontain proﬁles with a user’s age or gender; simi-larly, nodes in molecules may be labeled by atomicnumbers. Such attributes are not captured by thechemical or CKS distances. However, in such cases,only label-preserving permutations P may make sense(e.g., mapping females to females, oxygens to oxygens,etc.). Incorporating attributes while preserving themetric property is thus important from a practicalperspective. Contributions.

We seek generalization of the chem-ical and CKS distances that (a) satisfy the metricproperty and (b) are tractable : by this, we mean thatthey can be computed either by solving a convex opti-mization problem, or by a polynomial time algorithm.Speciﬁcally, we study generalizations of (1) of theform: d S ( A, B ) = min P ∈ S k AP − P B k (2)where S ⊂ R n × n is closed and bounded, k·k is a matrixnorm, and A, B ∈ R n × n are arbitrary real matrices(representing adjacency, path distances, weights, etc.).We make the following contributions: • We prove suﬃcient conditions on S and norm k · k for which (2) is a metric. In particular, we showthat d S is a so-called pseudo-metric (see Sec. 2)when: (i) S = P n and k · k is any entry-wise or operatornorm; (ii) S = W n , the set of doubly stochastic matrices, k · k is an arbitrary entry-wise norm, and A, B are symmetric; a modiﬁcation on d S extends thisresult to both operator norms as well as arbitrarymatrices (capturing, e.g., directed graphs); and (iii) S = O n , the set of orthogonal matrices, and k · k is the operator or entry-wise 2-norm.Relaxations (ii) and (iii) are very important from apractical standpoint. For all matrix norms, comput-ing (2) with S = W n is tractable, as it is a convexoptimization. For S = O n , (2) is non-convex but isstill tractable, as it reduces to a spectral decompo-sition. This was known for the Frobenius norm [57];we prove this is the case for the operator 2-normalso. • We include node attributes in a natural way in thedeﬁnition of d S as both soft (i.e., penalties in theobjective) or hard constraints in Eq. (2). Crucially,we do this without aﬀecting the metric propertyand tractability . This allows us to explore label orfeature preserving permutations, that incorporateboth (a) exogenous node attributes, such as, e.g.,user age or gender in a social network, as well as(b) endogenous, structural features of each node,such as its degree or the number of triangles thatpass through it. We numerically show that addingthese constraints can speed up the computation of d S .From an experimental standpoint, we extensivelycompare our tractable metrics to several existingheuristic approximations. We also demonstrate thetractability of our metrics by parallelizing their execu-tion using the alternating method of multipliers [14],which we implement over a compute cluster usingApache Spark [63]. Related Work.

Graph distance (or similarity) scoresﬁnd applications in varied ﬁelds such as in imageprocessing [20], chemistry [6, 41], and social networkanalysis [44, 39]. Graph distances are easy to deﬁnewhen, contrary to our setting, the correspondencebetween graph nodes is known, i.e., graphs are labeled [47, 39, 56]. Beyond the chemical distance, classicexamples of distances between unlabeled graphs arethe edit distance [27, 52] and the maximum commonsubgraph distance [16, 15], both of which also haveversions for labeled graphs. Both are metrics and arehard to compute, while existing heuristics [49, 25]2re not metrics. The reaction distance [37] is alsoa metric directly related to the chemical distance[41] when edits are restricted to edge additions anddeletions. Jain [33] also considers an extension ofthe chemical distance, limited to the Frobenius norm,that incorporates edge attributes. However, it is notimmediately clear how to relax the above metrics[33, 37] to attain tractability.A metric can also be induced by embedding graphsin a metric space and measuring the distance of theseembeddings [51, 26, 50]. Several works follow such anapproach, mapping graphs, e.g., to spaces determinedby their spectral decomposition [64, 61, 23]. In gen-eral, in contrast to our metrics, such approaches arenot as discriminative, as embeddings summarize graphstructure. Continuous relaxations of graph isomor-phism, both convex and non-convex [43, 4, 57], havefound applications in a variety of contexts, includingsocial networks [38], computer vision [53], shape de-tection [54, 29], and neuroscience [58]. None of theabove works focus on metric properties of resulting re-laxations, which several fail to satisfy [58, 38, 54, 29].Metrics naturally arise in data mining tasks, includ-ing clustering [62, 28], NN search [19, 18, 10], andoutlier detection [7]. Some of these tasks becometractable or admit formal guarantees precisely whenperformed over a metric space. For example, ﬁndingthe nearest neighbor [19, 18, 10] or the diameter ofa dataset [32] become polylogarithimic under metricassumptions; similarly, approximation algorithms forclustering (which is NP-hard) rely on metric assump-tions, whose absence leads to a deterioration on knownbounds [3]. Our search for metrics is motivated bythese considerations.

Graphs.

We represent an undirected graph G ( V, E )with node set V = [ n ] ≡ { , . . . , n } and edge set E ⊆ [ n ] × [ n ] by its adjacency matrix , i.e. A =[ a i,j ] i,j ∈ [ n ] ∈ { , } n × n s.t. a ij = a ji = 1 if and only if( i, j ) ∈ E. In particular, A is symmetric, i.e. A = A > .We denote the set of all real, symmetric matricesby S n . Directed graphs are represented by (possiblynon-symmetric) binary matrices A ∈ { , } n × n , and weighted graphs by real matrices A ∈ R n × n . Matrix Norms.

Given a matrix A = [ a ij ] i,j ∈ [ n ] ∈ R n × n and a p ∈ N + ∪ {∞} , its induced or operator p -norm is deﬁned in terms of the vector p -norm through k A k p = sup x ∈ R n : k x k p =1 k Ax k p , while its entry-wise p -norm is given by k A k p = ( P ni =1 P nj =1 | a ij | p ) /p , for p ∈ N + , and k A k ∞ = max i,j | a i,j | . We denote theentry-wise 2-norm (i.e., the Frobenius norm) as k · k F . Permutation, Doubly Stochastic, and Orthog-onal Matrices.

We denote the set of permutation matrices as P n = { P ∈ { , } n × n : P = , P > = } , the set of doubly-stochastic matrices (i.e., the Birkhoﬀ polytope ) as W n = { W ∈ [0 , n × n : W = , W > = } , and the set of orthogonal matrices (i.e., the Stiefel manifold ) as O n = { U ∈ R n × n : U U > = U > U = I } . Note that P n = W n ∩ O n . More-over, the Birkoﬀ-von Neumann Theorem [11] statesthat W n = conv ( P n ) , i.e., the Birkoﬀ polytope is theconvex hull of P n . Metrics.

Given a set Ω, a function d : Ω × Ω → R iscalled a metric , and the pair (Ω , d ) is called a metricspace , if for all x, y, z ∈ Ω: d ( x, y ) ≥ d ( x, y ) = 0 iﬀ x = y (pos. deﬁniteness) (3b) d ( x, y ) = d ( y, x ) (symmetry) (3c) d ( x, y ) ≤ d ( x, z )+ d ( z, y ) (triangle inequality)(3d)A function d is called a pseudometric if it satisﬁes (3a),(3c), and (3d), but the positive deﬁniteness property(3b) is replaced by the (weaker) property: d ( x, x ) = 0 for all x ∈ Ω . (3e)If d is a pseudometric, then d ( x, y ) = 0 deﬁnes anequivalence relation x ∼ d y over Ω. A pseudometricis then a metric over Ω / ∼ d , the quotient space of ∼ d . A d that satisﬁes (3a), (3b), and (3d) but not the symmetry property (3c) is called a quasimetric .If d is a quasimetric, then its symmetric extension ¯ d : Ω × Ω → R , deﬁned as ¯ d ( x, y ) = d ( x, y ) + d ( y, x ) , is a metric over Ω . Graph Isomorphism, Chemical, and CKS Dis-tance.

Let

A, B ∈ R n × n be the adjacency matri-ces of two graphs G A and G B . Then, G A and G B are isomorphic if and only if there exists P ∈ P n P > AP = B or, equivalently, AP = P B . The chemical distance , given by (1), extends the latterrelationship to capture distances between graphs. Let k · k be a matrix norm in R n × n . For some Ω ⊆ R n × n ,deﬁne d S : Ω × Ω → R + as: d S ( A, B ) = min P ∈ S k AP − P B k , (4)where S ⊂ R n × n is a closed and bounded set, sothat the inﬁmum is indeed attained. Note that d S isthe chemical distance (1) when Ω = R n × n , S = P n and k · k = k · k F . In CKS distance [17], matrices A, B contain pairwise path distances between any twonodes; equivalently, CKS is the chemical distance oftwo weighted complete graphs with path distances asedge weights. Our main contribution is determininggeneral conditions on S and k · k under which d S is ametric over Ω, for arbitrary weighted graphs, therebyincluding both the chemical and CKS distances asspecial cases. For concreteness, we focus on distancesbetween graphs of equal size. Extensions to graphs ofunequal size are described in Appendix F. Our ﬁrst result establishes that d P n is a pseudometricover all weighted graphs when k · k is an arbitrary entry-wise or operator norm. Theorem 1. If S = P n and k · k is an arbitraryentry-wise or operator norm, then d S given by (4) isa pseudometric over Ω = R n × n . Hence, d P n is a pseudometric under any entry-wiseor operator norm over arbitrary directed, weightedgraphs. Our second result states that this propertyextends to the relaxed version of the chemical dis-tance, in which permutations are replaced by doublystochastic matrices. Theorem 2. If S = W n and k · k is an arbitraryentry-wise norm, then d S given by (4) is a pseudo-metric over Ω = S n × n . If k · k is an arbitrary entry-wise or operator norm, then its symmetric extension ¯ d S ( A, B ) = d S ( A, B ) + d S ( B, A ) is a pseudometricover Ω = R n × n . Hence, if S = W n and k·k is an arbitrary entry-wisenorm, then (4) deﬁnes a pseudometric over undirected graphs. The symmetry property (3c) breaks if k · k is an operator norm or graphs are directed. In eithercase, d S is a quasimetric over the quotient space Ω / ∼ d ,and symmetry is attained via the symmetric extension¯ d S .Theorem 2 has signiﬁcant practical implications. Incontrast to d P n and its extensions implied by Theo-rem 1, computing d W n under any operator or entry-wise norm is tractable [13]: it involves minimizing aconvex function subject to linear constraints. A morelimited result extends to the Stiefel manifold: Theorem 3. If S = O n and k·k is either the operatoror the entry-wise (i.e., Frobenius) 2-norm, then d S given by (4) is a pseudometric over Ω = R n × n . Though (4) is not a convex problem when S = O n ,it is also tractable. Umeyama [57] shows that theoptimization can be solved exactly when k · k = k · k F and Ω = S n (i.e., for undirected graphs) by performinga spectral decomposition on A and B . We extendthis result, showing that the same procedure alsoapplies when k · k is the operator 2-norm (see Thm. 7in Appendix C). In the general case of directed graphs,(4) is a classic example of a problem that can be solvedthrough optimization on manifolds [2]. Equivalence Classes.

The equivalence of matrixnorms implies that all pseudometrics d S deﬁnedthrough (4) for a given S have the same quotientspace Ω / ∼ d S : if d S ( A, B ) = 0 for one matrix norm k·k in (4), it will be so for all. When S = P n , Ω / ∼ d P n is the quotient space deﬁned by graph isomorphism:any two adjacency matrices A, B ∈ R n × n satisfy d P n ( A, B ) = 0 if and only if their (possibly weighted)graphs are isomorphic. When S = W n , the quotientspace Ω / ∼ d W n has a connection to the Weisfeiler-Lehman (WL) algorithm [60] described in Appendix D:Ramana et al. [48] show that d W n ( A, B ) = 0 if andonly if G A and G B receive identical colors by theWL algorithm. If S = O n and Ω = S n , i.e., graphsare undirected, then Ω / ∼ d O n is determined by co-spectrality : d O n ( A, B ) = 0 if and only if

A, B havethe same spectrum. When Ω = R n × n , d O n ( A, B ) = 0implies that

A, B are co-spectral, but co-spectral ma-trices

A, B do not necessarily satisfy d O n ( A, B ) = 0.4 .1 Proof of Theorems 1–3.

We deﬁne several properties that play a crucial rolein our proofs. We say that a set S ⊆ R n × n is closedunder multiplication if P, P ∈ S implies that P · P ∈ S . We say that S is closed under transposition if P ∈ S implies that P > ∈ S , and closed under inversion if P ∈ S implies that P − ∈ S . Finally, given a matrixnorm k · k , we say that set S is contractive w.r.t. k · k if k AP k ≤ k A k and k P A k ≤ k A k , for all P ∈ S and A ∈ R n × n . Put diﬀerently, S is contractive if andonly if every P ∈ S is a contraction w.r.t. k · k . Werely on several lemmas, whose proofs can be foundin Appendix A. The ﬁrst three establish conditionsunder which (4) satisﬁes the triangle inequality (3d),symmetry (3c), and weak property (3e), respectively:

Lemma 1.

Given a matrix norm k·k , suppose that set S is (a) contractive w.r.t. k · k , and (b) closed undermultiplication. Then, for any A, B, C ∈ R n × n , d S given by (4) satisﬁes d S ( A, C ) ≤ d S ( A, B )+ d S ( B, C ) . Lemma 2.

Given a matrix norm k · k , suppose that S ⊂ R n × n is (a) contractive w.r.t. k · k , and (b)closed under inversion. Then, for all A, B ∈ R n × n , d S ( A, B ) = d S ( B, A ) . Lemma 3. If I ∈ S , then d S ( A, A ) = 0 for all A ∈ R n × n . Both the set of permutation matrices P n and theStiefel manifold O n are groups w.r.t. matrix multipli-cation: they are closed under multiplication, containthe identity I , and are closed under inversion. Hence,if they are also contractive w.r.t. a matrix norm k · k , d P n and d O n deﬁned in terms of this norm satisfy allassumptions of Lemmas 1–3. We therefore turn ourattention to this property. Lemma 4.

Let k · k be any operator or entry-wisenorm. Then, S = P n is contractive w.r.t. k · k . Hence, Theorem 1 follows as a direct corollary ofLemmas 1–4. Indeed, d P n is non-negative, symmetricby Lemmas 2 and 4, satiﬁes the triangle inequality byLemmas 1 and 4, as well as property (3e) by Lemma3; hence d P n is a pseudometric over R n × n . Our nextlemma shows that the Stiefel manifold O n is contrac-tive for 2-norms: Lemma 5.

Let k · k be the operator -norm or theFrobenius norm. Then, S = O n is contractive w.r.t. k·k . Theorem 3 follows from Lemmas 1–3 and Lemma 5,along with the the fact that O n is a group. Note that O n is not contractive w.r.t. other norms, e.g., k · k or k · k ∞ . Lemma 4 along with the Birkoﬀ-von Neumanntheorem imply that W n is also contractive: Lemma 6.

Let k · k be any operator or entry-wisenorm. Then, W n is contractive w.r.t. k · k . The Birkhoﬀ polytope W n is not a group, as it is notclosed under inversion. Nevertheless, it is closed undertransposition; in establishing (partial) symmetry of d W n , we leverage the following lemma: Lemma 7.

Suppose that k · k is transpose invari-ant, and S is closed under transposition. Then, d S ( A, B ) = d S ( B, A ) for all A, B ∈ S n . The ﬁrst part of Theorem 2 therefore follows fromLemmas 1, 3, and 6, as W n is closed under trans-position, contains the identity I , and is closed undermultiplication, while all entry-wise norms are trans-pose invariant. Operator norms are not transposeinvariant. However, if k · k is an operator norm, orΩ = R n × n , then Lemma 6 and Lemma 1 imply that d W n satisﬁes non-negativity (3a) and the triangle in-equality (3d), while Lemma 3 implies that it satisﬁes(3e). These properties are inherited by extension ¯ d S ,which also satisﬁes symmetry (3c), and Theorem 2follows. (cid:3) We have seen that the chemical distance d P n can be relaxed to d W n or d O n , gaining tractability while stillmaintaining the metric property. In practice, nodesin a graph often contain additional atributes that onemight wish to leverage when computing distances.In this section, we show that such attributes can beseamlessly incorporated in d S either as soft or hardconstraints, without violating the metric property .5 etric Embeddings. Given a graph G A of size n ,a metric embedding of G A is a mapping ψ A : [ n ] → ˜Ωfrom the nodes of the graph to a metric space ( ˜Ω , ˜ d ).That is, ψ A maps nodes of the graph to ˜Ω, where˜Ω is endowed with a metric ˜ d . We refer to a graphendowed with an embedding ψ A as an embedded graph ,and denote this by ( A, ψ A ), where A ∈ R n × n is theadjacency matrix of G A . We list two examples: Example 1: Node Attributes.

Consider an embeddingof a graph to ( R k , k · k ) in which every node v ∈ V ismapped to a k -dimensional vector describing “local”attributes. These can be exogenous : e.g., featuresextracted from a user’s proﬁle (age, binarized gender,etc.) in a social network. Alternatively, attributesmay be endogenous or structural , extracted from theadjacency matrix A , e.g., the node’s degree, the sizeof its k -hop neigborhood, its page-rank, etc. Example 2: Node Colors . Let ˜Ω be an arbitrary ﬁniteset endowed with the Kronecker delta as a metric,that is, for s, s ∈ ˜Ω, ˜ d ( s, s ) = 0 if s = s , while˜ d ( s, s ) = ∞ if s = s . Given a graph G A , a mapping ψ A : [ n ] → ˜Ω is then a metric embedding. The valuesof ˜Ω are invariably called colors or labels , and a graphembedded in ˜Ω is a colored or labeled graph. Colorscan again be exogenous or structural : e.g., if the graphrepresents an organic molecule, colors can correspondto atoms, while structural colors can be, e.g., theoutput of the WL algorithm (see Appendix D) after k iterations.As discussed below, node attributes translate to soft constraints in metric (4), while node colors corre-spond to hard constraints. The uniﬁed view throughembeddings allows us to establish metric propertiesfor both simultaneously (c.f. Thm. 4 and 5) . Embedding Distance.

Consider two embeddedgraphs (

A, ψ A ), ( B, ψ B ) of size n that are embed-ded in the same metric space ( ˜Ω , ˜ d ) . For u ∈ [ n ]a node in the ﬁrst graph, and v ∈ [ n ] a node inthe second graph, the embedded distance betweenthe two nodes is given by ˜ d ( ψ A ( u ) , ψ B ( v )). Let D ψ A ,ψ B = [ ˜ d ( ψ A ( u ) , ψ B ( v ))] u ∈ V,v ∈ V ∈ R n × n + be thecorresponding matrix of embedded distances. Aftermapping nodes to the same metric space, it is naturalto seek P ∈ P n that preserve the embedding distance . This amounts to ﬁnding a P ∈ P n that minimizes: tr (cid:0) P > D ψ A ,ψ B (cid:1) = P u,v ∈ [ n ] P u,v ˜ d ( ψ A ( u ) , ψ B ( v )) . (5)Note that, in the case of colored graphs and the Kro-necker delta distance, minimizing (5) ﬁnds a P ∈ P n that maps nodes in A nodes in B of equal color. Itis not hard to verify that min P ∈ P n tr (cid:0) P > D ψ A ,ψ B (cid:1) induces a metric between graphs embedded in ( ˜Ω , ˜ d ).Despite the combinatorial nature of P n , (5) is a maxi-mum weighted matching problem, which can be solvedthrough, e.g., the Hungarian algorithm [40] in poly-nomial time in n . We note that this metric is notas expressive as (4): depending on the deﬁnition ofthe embeddings ψ A , ψ B , attributes may only capture“local” similarities between nodes, as opposed to the“global” view of a mapping attained by (4). A Uniﬁed, Tractable Metric.

Motivated by theabove considerations, we focus on unifying the “global”metric (4) with the “local” metrics induced by arbi-trary graph embeddings. Proofs for the two theoremsbelow are provided in the supplement. Given a metricspace ( ˜Ω , ˜ d ), let Ψ n ˜Ω = { ψ : [ n ] → ˜Ω } be the set of allmappings from [ n ] to ˜Ω. Then, given two embeddedgraphs ( A, ψ A ) , ( B, ψ B ) ∈ R n × n × Ψ n ˜Ω , we deﬁne: d S (( A, ψ A ) , ( B, ψ B )) = min P ∈ S (cid:2) k AP − P B k + . . . + tr ( P > D ψ A ,ψ B ) (cid:3) (6)for some compact set S ⊂ R n × n and matrix norm k · k .Our next result states that incorporating this linearterm does not aﬀect the pseudometric property of d S . Theorem 4. If S = P n and k · k is an arbitraryentry-wise or operator norm, then d S given by (6) is a pseudometric over the set of embedded graphs Ω = R n × n × Ψ n ˜Ω . We stress here that this result is non-obvious: is nottrue that adding any linear term to d S leads to aquantity that satisﬁes the triangle inequality. It isprecisely because D ψ A ,ψ B contains pairwise distancesthat Theorem 4 holds. We can similarly extend The-orem 2: This follows from Thm. 4 for A = B = 0, i.e., for distancesbetween embedded graphs with no edges. verage CentroidComplete Median Single Ward WeightedNetAlignBPSparseIsoRankIsoRankNetAlignMRNatalieDSL1DSL2ORTHOPORTHFR F r ac ti on o f M i ss c l a ss i f i e d G r a ph s à b e tt e r Methods of Merging Clusters in Hierarchical Agglomerative Clustering N on - m e t r i c s M e t r i c s Average CentroidComplete Median Single Ward WeightedNetAlignBPSparseIsoRankIsoRankNetAlignMRNatalieDSL1DSL2ORTHOPORTHFR

Average CentroidComplete Median Single Ward WeightedNetAlignBPSparseIsoRankIsoRankNetAlignMRNatalieDSL1DSL2ORTHOPORTHFR (a) Clustering Misclassiﬁcation Error0.580.610.610.590.360.200.20 0.000.00

B3 B4 B5 E0.1 P R3 R4 R5 S W3 W4 W5NetAlignBPIsoRankSparseIsoRankNetAlignMRNatalie

B3 B4 B5 E0.02E0.1 P R3 R4 R5 S W3 W4 W5NetAlignBPSparseIsoRankIsoRankNetAlignMRNatalie (c) TIVs, n = 50(b) TIVs, n = 10 DescriptionB d Barabasi Albert of degree d [5]E p Erdős-Rényi with probability p [24]P Power Law Tree [45]R d Regular Graph of degree d [12]S Small World [36]W d Watts Strogatz of degree d [59] (d) Synthetic Graph Classes InnerDSL2

NetAlignBP

IsoRank

SparseIsoRank

NetAlignMR

Natalie

DSL1

DSL2

InnerPerm

10 -

InnerDSL1

11 -

EXACT

12 -

ORTHOP

13 -

ORTHFR (e) TIVs, small graphs

Figure 1: A clustering experiment using metrics and non-metrics (y-axis) for diﬀerent clustering parameters (x-axis) is shownin (a), left . We sample graphs with n = 50 nodes from the six classes, shown in the adjacent table in (d), bottom-center . Wecompute distances between them using nine diﬀerent algorithms from Table 1. Only the distances in our family (DSL1, DSL2,ORTHOP, and ORTHFR) are metrics. The resulting graphs are clustered using hierarchical agglomerative clustering [28] using Average , Centroid , Complete , Median , Single , W ard , W eighted as a means of merging clusters. Colors represent the fractionof misclassiﬁed graphs, with the minimal misclassiﬁcation rate per distance labeled explicitly. Metrics outperform other distancescores across all clustering methods. The error rate of a random guess is ≈ .

8. Subﬁgures (b) and (c), top center and right, shows that non-metric distances produce triangle inequality violations (TIVs) which contribute to poor clustering results; theﬁgure shows the fraction of TIVs within diﬀerent 10-node and 50 node graph families under these algorithms. Finally, subﬁgure (e), bottom right , shows the fraction of triangle inequality violations for diﬀerent algorithms on the small graphs dataset of all7-node graphs.

Theorem 5. If S = W n and k·k is an arbitrary entry-wise norm, then d S given by (6) is a pseudometric over Ω = S n × Ψ n ˜Ω , the set of symmetric graphs embeddedin ( ˜Ω , ˜ d ) . Moreover, if k · k is an arbitrary entry-wiseor operator norm, then the symmetric extension ¯ d S of (6) is a pseudometric over Ω = R n × n × Ψ n ˜Ω . Adding the linear term (5) in d S has signiﬁcantpractical advantages. Beyond expressing exogenousattributes, a linear term involving colors, combinedwith a Kronecker distance, translates into hard con-straints: any permutation attaning a ﬁnite objectivevalue must map nodes in one graph to nodes of thesame color. Theorem 5 therefore implies that suchconstraints can thus be added to the optimizationproblem, while maintaining the metric property. Inpractice, as the number of variables in optimizationproblem (4) is n , incorporating such hard constraintscan signiﬁcantly reduce the problem’s computationtime; we illustrate this in the next section. Note that adding (5) to d O n does not preserve the metricpropery. Graphs.

We use synthetic graphs from six classessummarized in the table in Fig. 1(d). In addition,we use a dataset of small graphs , comprising all 853connected graphs of 7 nodes [46]. Finally, we use a collaboration graph with 5242 nodes and 14496 edgesrepresenting author collaborations [42].

Algorithms.

We compare our metrics to several com-petitors outlined in Table 1 (see also Appendix E). Allreceive only two unlabeled undirected simple graphs A and B and output a matching a matrix ˆ P either in W n or in P n estimating P ∗ . If ˆ P ∈ P n , we compute k A ˆ P − ˆ P B k . If ˆ P ∈ W n , then we compute both k A ˆ P − ˆ P B k and k A ˆ P − ˆ P B k F ; all norms are entry-wise. We also implement our two relaxations d W and7 NetAlignBPIsoRankSparseIsoRankNetAlignMRNatalieDSL1DSL2ORTHOPORTHFRORandom guess × − F r a c t i o n o f M i s c l a ss i ﬁ e d G r a ph s (a) Eﬀect of TIVs (b) Cosine Similarity to EXACT (c) NN Graph vs. NN Graph of

EXACT

InnerDSL2

NetAlignBP

IsoRank

SparseIsoRank

NetAlignMR

Natalie

DSL1

DSL2

InnerPerm

10 -

InnerDSL1

11 -

EXACT

12 -

ORTHOP

13 -

ORTHFR

Figure 2: (a)

Eﬀect of introducing TIVs on the performance of diﬀerent algorithms on the clustering experiment of Figure 1(a)when using the Ward method. (b)

Cosine similarity between the Laplacian of distances produced by each algorithm and the oneby

EXACT . (c) Distance between nearest neighbor (NN) graphs induced by diﬀerent algorithms and NN graph induced by

EXACT . (Non-metric) Distance Score AlgorithmsNetAlignBP Network Alignment using Belief Propagation [9, 34]IsoRank Neighborhood Topology Isomorphism using PageRank [55, 34]SparseIsoRank Neighborhood Topology Sparse Isomorphism usingPage Rank [9, 34]InnerPerm Inner Product Matching with Permutations [43]InnerDSL1 Inner Product Matching with Matrices in W n andentry-wise 1-norm [43]InnerDSL2 Inner Product Matching with Matrices in W n andFrobenius norm [43]NetAlignMR Iterative Matching Relaxation [35, 34]Natalie (V2.0) Improved Iterative Matching Relaxation [22, 21]Metrics from our Family (4)EXACT Chemical Distance via brute force search over GPUDSL1 Doubly Stochastic Chemical Distance d W n withentry-wise 1-normDSL2 Doubly Stochastic Chemical Distance d W n withFrobenius normORTHOP Orthogonal Relaxation of Chemical Distance d O n with operator 2-normORTHFR Orthogonal Relaxation of Chemical Distance d O n with Frobenius norm Table 1: Competitor Distance Scores & Our Metrics k k P k k AP − PA k τ (a) Coloring Constraints || A P - P B || _ F WL k = 2WL k = 3WL k = 4WL k = 5 (b) Convergence of ADMMFigure 3: (a) Eﬀect of coloring/hard constraints on the numbersof variables ( k P k ) and terms of objective ( k AP − P A k ) using k iterations of the WL coloring algorithm. The last columnshows the execution time of WL on a 40 CPU machine usingApache Spark [63]. (b) Convergence of ADMM algorithm [14]computing

DSL2 on two copies of the collaboration graph asa function of time, implemented using Apache Spark [63] on a40 CPU machine. d O n , for two diﬀerent matrix norm combinations. Clustering Graphs.

The diﬀerence between our metrics and non-metrics is striking when clusteringgraphs. This is illustrated by the clustering exper-iment shown in Fig. 1(a). Graphs of size n = 50from the 6 classes in Fig. 1(d) are clustered togetherthrough hierarchical agglomerative clustering. Wecompute distances between them using nine diﬀerentalgorithms; only the distances in our family (DSL1,DSL2, ORTHOP, and ORTHFR) are metrics. Thequality of clusters induced by our metrics are far su-perior than clusters induced by non-metrics; in fact, ORTHOP and

ORTHFR can lead to no misclas-siﬁcations. This experiment strongly suggests ourproduced metrics correctly capture the topology ofthe metric space between these larger graphs.

Triangle Inequality Violations (TIV).

Givengraphs A , B and C and a distance d , a TIV occurswhen d ( A, C ) > d ( A, B ) + d ( B, C ). Being metrics,none of our distances induce TIVs; this is not the casefor the remaining algorithms in Table 1. Fig. 1(b)and (c) show the TIV fraction across the syntheticgraphs of Fig. 1(d), while Fig. 1(e) shows the frac-tion of TIVs found on the 853 small graphs ( n = 7). NetAlignMR also produces no TIVs on the smallgraphs, but it does induce TIVs in synthetic graphs.We observe that it is easier to ﬁnd TIVs when graphsare close: in synthetic graphs, TIVs abound for n = 10.No algorithm performs well across all categories ofgraphs. Eﬀect of TIVs on Clustering.

Next, to investi-gate the eﬀect of TIVs on clustering, we artiﬁciallyintroduced triangle inequality violations into the pairsof distances between graphs. We then re-evaluated8lustering performance for hierarchical agglomerativeclustering using the

Ward method, which performedbest in Fig. 1(a). Fig. 2(a) shows the fraction of mis-classiﬁed graphs as the fraction of TIVs introducedincreases. To incur as small a perturmbation on dis-tances as possible, we introduce TIVs as follows: Forevery three graphs,

A, B, C , with probability p , we set d ( A, C ) = d ( A, B ) + d ( B, C ). Although this does notintroduce a TIV w.r.t. A , B , and C , this distortiondoes introduce TIVs w.r.t. other triplets involving A and C . We repeat this 20 times for each algorithmand each value of p , and compute the average fractionof TIVs, shown in the x -axis, and the average fractionof misclassiﬁed graphs, shown in the y -axis. As littleas 1% TIVs signiﬁcantly deteriorate clustering perfor-mance. We also see that, even after introducing TIVs,clustering based on metrics outperforms clusteringbased on non-metrics. Comparison to Chemical Distance.

We comparehow diﬀerent distance scores relate to the chemical dis-tance

EXACT through two experiments on the smallgraphs (computation on larger graphs is prohibitive).In Figure 2(b), we compare the distances betweensmall graphs with 7 nodes produced by the diﬀer-ent algorithms and

EXACT using the DISTATISmethod of [1]. Let D ∈ R × be the matrix of dis-tances between graphs under an algorithm. DISTATIScomputes the normalized Laplacian of this matrix,given by L = − U DU/ k U DU k where U = I − > n .The DISTATIS score is the cosine similarity of suchLaplacians (vectorized). We see that our metricsproduce distances attaining high similarity with EX-ACT , though

NetAlignBP has the highest simi-larity. We measure proximity to

EXACT with anadditional test. Given D , we compute the nearestneighbor (NN) meta-graph by connecting a graph in D to every graph at distance less than its averagedistance to other graps. This results in a (labeled)meta-graph, which we can compare to the NN meta-graph induced by other algorithms, measuring thefraction of distinct edges. Fig. 2(c) shows that ouralgorithms perform quite well, though Natalie yieldsthe smallest distance to

EXACT . Incorporating Constraints.

Computation costscan be reduced through metric embeddings, as in (6). To show this, we produce a copy of the 5242 node col-laboration graph with permuted node labels. We thenrun the WL algorithm (see Appendix D) to producestructural colors, which induce coloring constraintson P ∈ W n . The support of P (i.e., the numberof variables in the optimization (4)), the support of AP − P A (i.e., the number of non-zero summationterms in the objective of (4)), as well as the execu-tion time τ of the WL algorithm, are summarizedin Fig. 3(b). The original unconstrained problem in-volves 5242 ≈ .

4M variables. However, after usingWL and induced costraints, the eﬀective dimensionof the optimization problem (4) reduces considerably.This, in turn, speeds up convergence time, shown inFig. 3(b): including the time to compute constraints,a solution is found 110 times faster after the introduc-tion of the constraints.

Our work suggests that incorporating soft and hardconstraints has a great potential to further improvethe eﬃciency of our metrics. In future work, we intendto investigate and characterize the resulting equiva-lence classes under diﬀerent soft and hard constraintsand to quantify these gains in eﬃciency, especially inparallel implementations like ADMM. Determiningthe necessity of the conditions used in proving that d S is a metric is also an open problem. Acknowledgements

The authors gratefully acknowledge the support of theNational Science Foundation (grants IIS-1741197,IIS-1741129) and of the National Institutes of Health(grant 1U01AI124302).

References [1] H Abdi, A J O’Toole, D Valentin, and B Edelman. DIS-TATIS: The analysis of multiple distance matrices. In

CVPR Workshops , 2005.[2] P-A Absil, R Mahony, and R Sepulchre.

Optimizationalgorithms on matrix manifolds . Princeton UniversityPress, 2009.

3] M R Ackermann, J Blömer, and C Sohler. Clustering formetric and nonmetric distance measures.

ACM Transac-tions on Algorithms (TALG) , 6(4):59, 2010.[4] Y Aﬂalo, A Bronstein, and R Kimmel. On convex relax-ation of graph isomorphism.

PNAS , 112(10):2942–2947,2015.[5] R Albert and A-L Barabási. Statistical mechanics ofcomplex networks.

Reviews of Modern Physics , 74(1):47,2002.[6] F H Allen. The Cambridge Structural Database: a quarterof a million crystal structures and rising.

Acta Crystal-lographica Section B: Structural Science , 58(3):380–388,2002.[7] F Angiulli and C Pizzuti. Fast outlier detection in highdimensional spaces. In

PKDD , 2002.[8] L Babai. Graph isomorphism in quasipolynomial time[extended abstract]. In

STOC , 2016.[9] M Bayati, M Gerritsen, D F Gleich, A Saberi, and Y Wang.Algorithms for large, sparse network alignment problems.In

ICDM , 2009.[10] A Beygelzimer, S Kakade, and J Langford. Cover treesfor nearest neighbor. In

ICML , 2006.[11] G Birkhoﬀ. Three observations on linear algebra.

Univ.Nac. Tucumán. Revista A , 5:147–151, 1946.[12] B Bollobás. Random graphs. In

Modern Graph Theory ,pages 215–252. Springer, 1998.[13] S Boyd and L Vandenberghe.

Convex Optimization . Cam-bridge university press, 2004.[14] S Boyd, N Parikh, E Chu, B Peleato, and J Eckstein.Distributed optimization and statistical learning via thealternating direction method of multipliers.

Foundationsand Trends R (cid:13) in Machine Learning , 3(1):1–122, 2011.[15] H Bunke. On a relation between graph edit distance andmaximum common subgraph. Pattern Recognition Letters ,18(8):689–694, 1997.[16] H Bunke and K Shearer. A graph distance metric basedon the maximal common subgraph.

Pattern RecognitionLetters , 19(3):255–259, 1998.[17] G Chartrand, G Kubicki, and M Schultz. Graph similarityand distance in graphs.

Aequationes Mathematicae , 55(1-2):129–145, 1998.[18] K L Clarkson. Nearest neighbor queries in metric spaces.

Discrete & Computational Geometry , 22(1):63–93, 1999. [19] K L Clarkson. Nearest-neighbor searching and metricspace dimensions.

Nearest-Neighbor Methods for Learningand Vision: Theory and Practice , pages 15–59, 2006.[20] D Conte, P Foggia, C Sansone, and M Vento. Thirty yearsof graph matching in pattern recognition.

InternationalJournal of Pattern Recognition and Artiﬁcial Intelligence ,18(03):265–298, 2004.[21] M El-Kebir, J Heringa, and G Klau. Natalie, a toolfor pairwise global network alignment. .[22] M El-Kebir, J Heringa, and G W Klau. Natalie 2.0: Sparseglobal network alignment as a special case of quadraticassignment.

Algorithms , 8(4):1035–1051, 2015.[23] H Elghawalby and E R Hancock. Measuring graph simi-larity using spectral geometry. In

ICIAR , 2008.[24] P Erdös and A Rényi. On random graphs, i.

PublicationesMathematicae (Debrecen) , 6:290–297, 1959.[25] S Fankhauser, K Riesen, and H Bunke. Speeding up graphedit distance computation through fast bipartite matching.In

GBR , 2011.[26] M Ferrer, E Valveny, F Serratosa, K Riesen, and H Bunke.Generalized median graph computation by means of graphembedding in vector spaces.

Pattern Recognition , 43(4):1642–1655, 2010.[27] M R Garey and D S Johnson.

Computers and Intractabil-ity , volume 29. WH Freeman New York, 2002.[28] J A Hartigan.

Clustering algorithms . Wiley New York,1975.[29] L He, C Y Han, and W G Wee. Object recognition andrecovery by skeleton graph matching. In

ICME , 2006.[30] A J Hoﬀman and H W Wielandt. The variation of thespectrum of a normal matrix.

Duke Math. J , 20(1):37–39,1953.[31] R A Horn and C R Johnson.

Matrix Analysis . CambridgeUniversity Press, 2012.[32] P Indyk. Sublinear time algorithms for metric space prob-lems. In

Proceedings of the thirty-ﬁrst annual ACM sym-posium on Theory of computing , pages 428–434. ACM,1999.[33] B J Jain. On the geometry of graph spaces.

DiscreteApplied Mathematics , 214:126–144, 2016.[34] A Khan, D Gleich, M Halappanavar, and A Pothen.Multicore codes for network alignment. .

35] G W Klau. A new graph-based method for pairwise globalnetwork alignment.

BMC bioinformatics , 10(1):S59, 2009.[36] J Kleinberg. The small-world phenomenon: An algorith-mic perspective. In

STOC , 2000.[37] J Koca, M Kratochvil, V Kvasnicka, L Matyska, andJ Pospichal.

Synthon model of organic chemistry andsynthesis design , volume 51. Springer Science & BusinessMedia, 2012.[38] D Koutra, H Tong, and D Lubensky. Big-align: Fastbipartite graph alignment. In

ICDM , 2013.[39] D Koutra, J T Vogelstein, and C Faloutsos. Deltacon:A principled massive-graph similarity function. In

SDM ,2013.[40] H W Kuhn. The hungarian method for the assignmentproblem.

Naval Research Logistics Quarterly , 2(1-2):83–97,1955.[41] V Kvasnička, J Pospíchal, and V Baláž. Reaction and chem-ical distances and reaction graphs.

Theoretical ChemistryAccounts: Theory, Computation, and Modeling (Theoret-ica Chimica Acta) , 79(1):65–79, 1991.[42] J Leskovec, J Kleinberg, and C Faloutsos. Stanford largenetwork dataset collection. http://snap.stanford.edu/data/ca-GrQc.html .[43] V Lyzinski, D E Fishkind, M Fiori, J T Vogelstein, C EPriebe, and Guillermo Sapiro. Graph matching: Relax atyour own risk.

IEEE Transactions on Pattern Analysisand Machine Intelligence , 38(1):60–73, 2016.[44] O Macindoe and W Richards. Graph comparison usingﬁne structure analysis. In

SocialCom , 2010.[45] H M Mahmoud, R T Smythe, and J Szymański. On thestructure of random plane-oriented recursive trees andtheir branches.

Random Structures & Algorithms , 4(2):151–176, 1993.[46] B McKay. List of 7 node connected graphs. http://users.cecs.anu.edu.au/~bdm/data/graphs.html .[47] P Papadimitriou, A Dasdan, and H Garcia-Molina. Webgraph similarity for anomaly detection.

Journal of InternetServices and Applications , 1(1):19–30, 2010.[48] M V Ramana, E R Scheinerman, and D Ullman. Fractionalisomorphism of graphs.

Discrete Mathematics , 132(1-3):247–265, 1994.[49] K Riesen and H Bunke. Approximate graph edit distancecomputation by means of bipartite graph matching.

Imageand Vision Computing , 27(7):950–959, 2009. [50] K Riesen and H Bunke.

Graph classiﬁcation and cluster-ing based on vector space embedding , volume 77. WorldScientiﬁc, 2010.[51] K Riesen, M Neuhaus, and H Bunke. Graph embeddingin vector spaces by means of prototype selection. In

GBR ,2007.[52] A Sanfeliu and KS Fu. A distance measure between at-tributed relational graphs for pattern recognition.

IEEETransactions on Systems, Man, and Cybernetics , (3):353–362, 1983.[53] C Schellewald, S Roth, and C Schnörr. Evaluation ofconvex optimization techniques for the weighted graph-matching problem in computer vision. In

Pattern Recog-nition , 2001.[54] T B Sebastian, P N Klein, and B B Kimia. Recognition ofshapes by editing their shock graphs.

IEEE Transactionson pattern analysis and machine intelligence , 26(5):550–571, 2004.[55] R Singh, J Xu, and B Berger. Pairwise global alignmentof protein interaction networks by matching neighborhoodtopology. In

RECOMB , 2007.[56] S Soundarajan, T Eliassi-Rad, and B Gallagher. A guideto selecting a network similarity method. In

SDM , 2014.[57] S Umeyama. An eigendecomposition approach to weightedgraph matching problems.

IEEE Transactions on PatternAnalysis and Machine Intelligence , 10(5):695–703, 1988.[58] J T Vogelstein, J M Conroy, L J Podrazik, S G Kratzer,E T Harley, D E Fishkind, R J Vogelstein, and C EPriebe. Large (brain) graph matching via fast approximatequadratic programming. arXiv preprint arXiv:1112.5507 ,2011.[59] D J Watts and S H Strogatz. Collective dynamics of‘small-world’ networks.

Nature , 393(6684):440–442, 1998.[60] B Weisfeiler and A A Lehman. A reduction of a graphto a canonical form and an algebra arising during thisreduction.

Nauchno-Technicheskaya Informatsia , 2(9):12–16, 1968.[61] R C Wilson and P Zhu. A study of graph spectra forcomparing graphs and trees.

Pattern Recognition , 41(9):2833–2841, 2008.[62] E P Xing, A Y Ng, M I Jordan, and S Russell. Distancemetric learning with application to clustering with side-information. In

NIPS , volume 15, page 12, 2002.[63] M Zaharia, M Chowdhury, M J Franklin, S Shenker, andI Stoica. Spark: Cluster computing with working sets.

HotCloud , 10(10-10):95, 2010.[64] P Zhu and R C Wilson. A study of graph spectra forcomparing graphs. In

BMVC , 2005. Proof of Lemmas 1–7

A.1 Proof of Lemma 1

Consider P ∈ arg min P ∈ S k AP − P B k , and P ∈ arg min P ∈ S k BP − P C k . Then, from closure undermultiplication, P P ∈ S . Hence, d S ( A, C ) ≤ k AP P − P P C k≤ k AP P − P BP k + k P BP − P P C k = k ( AP − P B ) P k + k P ( BP − P C ) k≤ k AP − P B k + k BP − P C k where the last inequality follows from the fact that P , P are contractions. (cid:3) A.2 Proof of Lemma 2

Observe that property (b) implies that, for all P ∈ S , P is invertible and P − ∈ S . Hence, k AP − P B k = k P ( P − A − BP − ) P k ≤ k BP − − P − A k , as P is acontraction w.r.t k · k . We can similarly show that k BP − − P − A k ≤ k AP − P B k , hence k AP − P B k = k BP − − P − A k . As S is closed under inversion,min P ∈ S f ( P ) = min P : P − ∈ S f ( P ), so d S ( A, B ) =min P ∈ S k BP − − P − A k = min P : P − ∈ S k BP − − P − A k = min P ∈ S k BP − P A k = d S ( B, A ) . (cid:3) A.3 Proof of Lemma 3 If I ∈ S , then 0 ≤ d S ( A, A ) ≤ k AI − IA k = 0. (cid:3) A.4 Proof of Lemma 4

Observe ﬁrst that all vector p -norms are invariantto permutations a vector’s entries; hence, for anyvector x ∈ R d , if P ∈ P n , k P x k p = k x k p . Hence, if k · k is an operator p -norm, k P k = 1 , for all P ∈ S. Every operator norm is submultiplicative ; as a result k P A k ≤ k P kk A k = k A k and, similarly, k AP k ≤ k A k ,so the lemma follows for operator norms. On theother hand, if k · k is an entry-wise norm, then k A k is invariant to permutations of either A ’s rows orcolumns. Matrices P A and AP precisely amount tosuch permutations, so k P A k = k AP k = k A k and thelemma follows also for entrywise norms. (cid:3) A.5 Proof of Lemma 5

Any U ∈ O n is an orthogonal matrix; hence, k U k = k U k F = 1. Both norms are submultiplicative: the ﬁrstas an operator norm, the second from the Cauchy-Schwartz inequality. Hence, for U ∈ O n , we have k U A k ≤ k U kk A k = k A k . (cid:3) A.6 Proof of Lemma 6

By the Birkoﬀ-con Neumann theorem [11], W n = conv ( P n ). Hence, for any W ∈ W n there exist P i ∈ P n , θ i > i = 1 , . . . , k , such that W = P ki =1 θ i P i and P ki =1 θ i = 1 . Both operator and en-trywise p -norms are convex functions; hence, byJensen’s inequality, for any A ∈ R n × N : k W A k ≤ P ki =1 θ i k P i A k ≤ P ki =1 θ i k A k = k A k where the last in-eqality follows by Lemma 4. The statement k AW k ≤k A k follows similarly. (cid:3) A.7 Proof of Lemma 7

By transpose invariance and the symmetry of A and B , we have that: k AP − P B k = k BP > − P > A k . Moreover, as S is closed under transposition,min P ∈ S f ( P ) = min P > ∈ S f ( P ). Hence, d S ( A, B ) =min P ∈ S k BP > − P > A k = min P > ∈ S k BP > − P > A k = d S ( B, A ) . (cid:3) B Proof of Theorems 4 and 5

We begin by establishing conditions under which d S satisﬁes the triangle inequality (3d). We note that,in contrast to Lemma 1, we require the additionalcondition that S ⊆ W n , which is not satisﬁed by O n . Lemma 8.

Given a norm k · k , suppose that S is (a) contractive w.r.t. k · k , (b) closed un-der multiplication, and (c) is a subset of W n ,i.e., contains only doubly stochastic matrices.Then, for any ( A, ψ A ) , ( B, ψ B ) , ( C, ψ C ) in R n × n × Ψ ˜Ω , d S (( A, ψ A ) , ( C, ψ B )) ≤ d S (( A, ψ A ) , ( B, ψ B )) + d S (( B, ψ B ) , ( C, ψ C )) . Proof.

Consider P ∈ arg min P ∈ S (cid:0) k AP − P B k + tr (cid:0) P > D ψ A ,ψ B (cid:1)(cid:1) , P ∈ arg min P ∈ S (cid:0) k BP − P C k + tr (cid:0) P > D ψ B ,ψ C (cid:1)(cid:1) . Then, from closure under multiplication, P P ∈ S .We have that d S (( A, ψ A ) , ( C, ψ C )) ≤ k AP P − P P C k + tr (cid:2) ( P P ) > D ψ A ψ C (cid:3) As in the proof of Lemma 1, we can show that k AP P − P P C k ≤ k AP − P B k + k BP − P C k using the fact that both P and P are contractions,while tr (cid:2) ( P P ) > D ψ A ψ C (cid:3) == X u,v ∈ [ n ] X k ∈ [ n ] (cid:0) P uk P kv ˜ d ( ψ A ( u ) , ψ C ( v ))) (cid:1) ≤ X u,v ∈ [ n ] X k ∈ [ n ] (cid:2) P uk P kv (cid:0) ˜ d ( ψ A ( u ) , ψ B ( k ))+ ˜ d ( ψ B ( k ) , ψ C ( v )) (cid:1)(cid:3) (as ˜ d is a metric, and P , P are non-negative)= X u,k ∈ [ n ] P uk ˜ d ( ψ A ( u ) , ψ B ( k )) X v ∈ [ n ] P kv + X k,v ∈ [ n ] P kv ˜ d ( ψ B ( k ) , ψ C ( v )) X u ∈ [ n ] P uk ≤ tr (cid:0) ( P ) > D ψ A ,ψ B (cid:1) + tr (cid:0) ( P ) > D ψ B ,ψ C (cid:1) , where the last inequality follows as both P, P > are k · k -norm bounded by 1 for every P ∈ S .The weak property (3e) is again satisﬁed providedthe identity is included in S . Lemma 9. If I ∈ S , then d S (( A, ψ A ) , ( A, ψ A )) = 0 for all A ∈ R n × n .Proof. Indeed, 0 ≤ d S (( A, ψ A , ( A, ψ A )) ≤ k AI − IA k + P u ∈ [ n ] ˜ d ( ψ A ( u ) , ψ A ( u )) = 0.To attain symmetry over Ω = R n × n , we again relyon closure under inversion, as in Lemma 10; nonethe-less, in contrast to Lemma 10, due to the linear term,we also need to assume orthogonality of S . Lemma 10.

Given a norm k · k , suppose that S (a)is contractive w.r.t. k · k , (b) is closed under inver-sion, and (c) is a subset of O n , i.e., contains onlyorthogonal matrices. Then, d S (( A, ψ A ) , ( B, ψ B )) = d S (( B, ψ B ) , ( A, ψ A )) for all ( A, ψ A ) , ( B, ψ B ) ∈ R n × n × Ψ ˜Ω .Proof. As in the proof of Lemma 2, we can show thatcontractiveness w.r.t. k · k along with closure under in-version imply that: k AP − P B k = k BP − − P − A k . As S is closed under inversion, min P ∈ S f ( P ) =min P : P − ∈ S f ( P ) for all f : S → R , while orthog-onality implies P − = P > for all P ∈ S . Hence, d S (( A, ψ A ) , ( B, ψ B )) equals:min P ∈ S (cid:2) k AP − P B k + tr (cid:0) P > D ψ A ,ψ B (cid:1)(cid:3) = min P ∈ S (cid:2) k BP − − P − A k + tr (cid:0) P − D ψ A ,ψ B (cid:1)(cid:3) = min P ∈ S h k BP − − P − A k + tr (cid:16)(cid:0) P − (cid:1) > D > ψ A ,ψ B (cid:17)i = min P − ∈ S h k BP − − P − A k + tr (cid:16)(cid:0) P − (cid:1) > D > ψ A ,ψ B (cid:17)i = d S (( B, ψ B ) , ( A, ψ A )) . Theorem 4 therefore follows from the above lemmas,as S = P n contains I , it is closed under multiplicationand inversion, is a subset of W n ∩ O n , and is contractivew.r.t. all operator and entrywise norms. Theorem 5also follows by using the following lemma, along withLemmas 8 and 9. Lemma 11.

Suppose that k · k is transpose invari-ant, and S is closed under transposition. Then, d S (( A, ψ A ) , ( B, ψ B )) = d S (( B, ψ B ) , ( A, ψ A )) for all ( A, ψ A ) , ( B, ψ B ) ∈ S n × Ψ ˜Ω .Proof. By transpose invariance of k · k and the sym-metry of A and B , we have that: k AP − P B k = k BP > − P > A k . Moreover, as S is closed undertransposition, min P ∈ S f ( P ) = min P > ∈ S f ( P ) for any f : S → R . Hence, d S (( A, ψ A ) , ( B, ψ B )) equalsmin P ∈ S (cid:2) k AP − P B k + tr (cid:0) P > D ψ A ,ψ B (cid:1)(cid:3) = min P ∈ S (cid:2) k BP > − P > A k + tr (cid:0) P D > ψ A ,ψ B (cid:1)(cid:3) = min P > ∈ S k BP > − P > A k + tr (cid:0) ( P > ) > D ψ B ,ψ A (cid:1) = d S (( B, ψ B ) , ( A, ψ A )) . Metric Computation Overthe Stieﬂer Manifold.

In this section, we describe how to compute the metric d S in polynomial time when S = O n and k · k isthe Frobenious norm or the operator 2-norm. Thealgorithm for the Frobenius norm, and the proof ofits correctness, is due to [57]; we include it in thisappendix for completeness, along with its extensionto the operator norm.Both cases make use of the following lemma: Lemma 12.

For any matrix M ∈ R n × n and anymatrix P ∈ O n we have that k P M k = k M P k = k M k ,where k · k is either the Frobenius or operator 2-norm.Proof. Recall that the operator 2-norm k · k is k M k = sup x =0 k M x k / k x k = p σ max ( M > M ) = p σ max ( M M > ) = k M > k . where σ max denotesthe largest singular value. Hence, k P M k =sup x =0 k P M x k / k x k = p σ max ( M > P > P M ) = p σ max ( M > M ) = k M k . as P > P = I . Using thefact that k M k = k M > k for all M ∈ R n × n , aswell as that P P > = I , we can show that k M P k = k P > M > k = k M > k = k M k .The Frobenius norm is k M k F = p tr ( M > M ) = p tr ( M M > ) = k M > k F , hence k P M k F = p tr ( M > P > P M ) = p tr ( M > M ) = k M k F and, asin the case of the operator norm, we can similarlyshow k M P k F = k P > M > k F = k M > k F = k M k F .In both norm cases, for A, B ∈ S n , we can com-pute d S using a simple spectral decomposition. Let A = U Σ A U T and B = V Σ B V T be the spectral de-composition of A and B . As A and B are real andsymmetric, we can assume U, V ∈ O n . Recall that U − = U > and V − = V > , while Σ A and Σ B arediagonal and contain the eigenvalues of A and B sorted in increasing order; this orderning matters forcomputations below.The following theorem establishes that this decom-position readily yields the distance d S , as well as theoptimal orthogonal matrix P ∗ , when k · k = k · k F : Theorem 6 ([57]) . d S ( A, B ) (cid:44) min P ∈ S k AP − P B k F = k Σ A − Σ B k F and the minimum is attainedby P ∗ = U V > . Proof. The proof makes use of the following lemmaby [30].

Lemma 13. If A and B are Hermitian matrices witheigenvalues a ≤ a ≤ ... ≤ a n and b ≤ b ≤ ... ≤ b n then k A − B k F ≥ n X i =1 ( a i − b i ) (7) Remark 1.

Note that if Σ A and Σ B are diagonalmatrices with the ordered eigenvalues of A and B in the diagonal, then Lemma 13 can be written as k A − B k F ≥ k Σ A − Σ B k F . For any P ∈ O n and k · k = k · k F we have k AP − P B k = k ( A − P BP − ) P k Lem. 12 = k A − P BP > k = k U Σ A U > − P V Σ B V > P > k = k U (Σ A − U > P V Σ B V > P > U ) U > k Lem. 12 = k Σ A − U > P V Σ B V > P > U k = k Σ A − ∆Σ B ∆ > k where we deﬁne ∆ (cid:44) U > P V . As a product of orthog-onal matrices, ∆ ∈ O n . Notice that k Σ A − ∆Σ B ∆ > k == k Σ A − ∆Σ A ∆ > + ∆(Σ B − Σ A )∆ > k≤ k Σ A − ∆Σ A ∆ > k + k ∆(Σ B − Σ A )∆ > k Lem. 12 = k Σ A − ∆Σ A ∆ > k + k Σ B − Σ A k . Therefore, for any P ∈ O n , k Σ A − Σ B k ≤ d S ( A, B ) ≤k Σ A − ∆Σ A ∆ > k + k Σ B − Σ A k , where the ﬁrst in-equality follows by Lemma 13 if we notice that k AP − P B k = k A − P BP − k and that P BP − and B have the same spectrum for any P . If we choose P = U V > then ∆ = I and the result follows.We can compute d S when S = O n and k · k is theoperator norm in the exact same way. Theorem 7.

Let k · k = k · k be the operator 2-norm.Then, d S ( A, B ) (cid:44) min P ∈ S k AP − P B k = k Σ A − Σ B k and the minimum is attained by P ∗ = U V > .Proof. The proof follows the same steps as the proof ofTheorem 6, using Lemma 14 below instead of Lemma13.14 emma 14. If A and B are Hermitian matrices witheigenvalues a ≤ a ≤ ... ≤ a n and b ≤ b ≤ ... ≤ b n then k A − B k ≥ max i | a i − b i | . (8) Remark 2.

Note that if Σ A and Σ B are diagonalmatrices with the ordered eigenvalues of A and B in the diagonal, then Lemma 14 can be written as k A − B k ≥ k Σ A − Σ B k .Proof of Lemma 14. Let ˜ B = − B have eigenvalues˜ b ≤ ˜ b ≤ ... ≤ ˜ b n and let C = A + ˜ B have eigenvalues c ≤ c ≤ ... ≤ c n . We make use of the followinglemma by Weyl [31] to lower bound c n . Lemma 15. If X and Y are Hermitian with eigen-values x ≤ ... ≤ x n and y ≤ ... ≤ y n and if X + Y has eigenvalues w ≤ ... ≤ w n then x i − j +1 + y j ≤ w i for all i = 1 , . . . , n and j = 1 , . . . , i . If we choose X = ˜ B , Y = A and i = n we get a j + ˜ b n +1 − j ≤ c n for all j = 1 , . . . , n .Since ˜ b n +1 − j = − b j we get that a j − b j ≤ c n , forany j . Similarly, by exchanging the role of A and B ,we can lower bound the largest eigenvalue of B − A ,say d n , by b j − a j for any j . Notice that, by deﬁnitionof the operator norm and the fact that A − B isHermitian, k A − B k ≥ | c n | and k B − A k ≥ | d n | .Since k B − A k = k A − B k we have that k A − B k ≥ max {| c n | , | d n |} ≥ max { c n , d n } ≥ max { a j − b j , b j − a j } = | a j − b j | for all j . Taking the maximum over j we get that k A − B k ≥ max j | a j − b j | , and thelemma follows.The proof of Thm. 7 proceeds along the same stepsas the above proof, using again the fact that, byLemma 12, k M k = k M P k = k P M k for any P ∈ O n and any matrix M , along with Lemma 15. D The Weisfeiler-Lehman(WL) Algorithm.

The WL algorithm [60] is a graph isomorphism heuris-tic. To gain some intuition on the algorithm, notethat two isomorphic graphs must have the same de-gree distribution. More broadly, the distributions of k -hop neighborhoods in the two graphs must also beidentical. Building on this, to test if two undirected,unweighted graphs are isomorphic, WL colors thenodes of a graph G ( V, E ) iteratively. At iteration 0,each node v ∈ V receives the same color c ( v ) := 1.Colors at iteration k + 1 ∈ N are deﬁned recursivelyvia c k +1 ( v ) := hash (cid:16) sort (cid:16) clist kv (cid:17)(cid:17) where hash is a per-fect hash function, and clist kv = [ c k ( u ) : ( u, v ) ∈ E )] isa list containing the colors of all of v ’s neighbors atiteration k . Intuitively, two nodes in V share the samecolor after k iterations if their k -hop neighborhoodsare isomorphic. WL terminates when the partitionof V induced by colors is stable from one iteration tothe next. This coloring extends to weighted directedgraphs by appending weights and directions to colorsin clist kv . After coloring two graphs G A , G B , WL de-clares a non-isomorphism if their color distributionsdiﬀer. If not, then they may be isomorphic and WLgives a set of constraints on candidate isomorphisms:a permutation P under which AP = P B must mapnodes in G A to nodes in G B of the same color. E Algorithms and Implementa-tion Details

We outline here additional impementation detailsabout the algorithms summarized in Table 1. • NetAlignBP , IsoRank , SparseIsoRank and

NetAlignMR , for which code is publicly available[34], are described by [9].

Natalie is described in[22]; code is again available [21]. All ﬁve algorithmsoutput P ∈ P n . • The algorithm in [43] outputs one P ∈ P n and one P ∈ W n . We use P ∈ P n to compute k AP − P B k and call this InnerPerm . We use P ∈ W n tocompute k AP − P B k and k AP − P B k and callthese algorithms InnerDSL1 and

InnerDSL2 re-spectively. We use our own CVX-based projectedgradient descent solver for the non-convex optimiza-tion problem the authors propose. • DSL1 and

DSL2 denote d S ( A, B ) when S ∈ W n and k·k is k·k (element-wise) and k·k F , respectively.We implement them in Matlab (using CVX) as wellas in C, aimed for medium size graphs and multi-15ore use. We also implemented a distributed versionin Apache Spark [63] that scales to very large graphsover multiple machines based on the AlternatingDirections Method of Multipliers [14]. • ORTHOP and

ORTHFR denote d S ( A, B ) when S ∈ O n and k · k is k · k (operator norm) and k · k F respectively. We compute them using aneigendecomposition (See Appendix C). • For small graphs, we compute d P n ( A, B ) using ourbrute-force GPU-based code. For a single pair ofgraphs with n ≥

15 nodes,

EXACT already takesseveral days to ﬁnish. For k·k = k·k in d S (element-wise or matrix norm), we have implemented thechemical distance as an integer value LP and solvedit using branch-and-cut. It did not scale well for n ≥ • We implemented the WL algorithm over Spark torun, multithreaded, on a machine with 40 CPUs.We use all public algorithms as black boxes with theirdefault parameters, as provided by the authors.

F Graphs of Diﬀerent Sizes

For simplicity, we described our framework for graphsof equal sizes. However, we can extended in diﬀerentways to produce a metric for graphs of diﬀerent sizes.These extensions all start by extending two graphs, G A and G B , with dummy nodes such that the newgraphs G A and G B have the same number of nodes.If G A has n A nodes and G B has n B nodes we can,for example, add n B dummy nodes to G A and n A dummy nodes to G A . Once we have G A and G B of equal size, we can use the methods we alreadydescribed to compute a distance between G A and G B and return this distance as the distance between G A and G B .The diﬀerent ways of extending the graphs diﬀerin how the dummy nodes connect to existing graphnodes, how dummy nodes connect to themselves, andwhat kind of penalty we introduce for associatingdummy nodes with existing graph nodes. Method 1:

One way of extending the graphs is to add dummynodes and leave them isolated, i.e., with no edges to ei-ther existing nodes or other dummy nodes. Althoughthis might work when both graphs are dense, it might lead to non desirable results when one of the graphs issparse. For example, let G A be 3 isolated nodes and G B be the complete graph on 4 nodes minus the edgesforming triangle { (1 , , (2 , , (3 , } . Let us assumethat S = P n , such that, when we compute the dis-tance between G A and G B , we produce an alignmentbetween the graphs. One desirable outcome wouldbe for G A to be aligned with the three nodes in G B that have no edges among them. This is basicallysolving the problem of ﬁnding a sparse subgraph in-side a dense graph. However, computing d S ( A , B ),where A and B are the extended adjacency matrices,could equally well align G A with the 3 dummy nodeof G B . Method 2:

Add dummy nodes and connecteach dummy node to all existing nodes and all otherdummy nodes. This avoids the issue described formethod 1 but creates a similar non desirable situa-tion: since the dummy nodes in each extended graphform a click, we might align G A , or G B , with justdummy nodes, instead of producing an alignment be-tween existing nodes in G A and existing nodes in G B . Method 3:

If both G A and G B are unweighted graphs,a method that avoids both issues above (aligning asparse graph with isolated dummy nodes or aligninga dense graphs with clicks of dummy nodes) is toconnect each dummy node to all existing nodes andall other dummy nodes with edges of weight 1 /