[PDF] A Hidden Challenge of Link Prediction: Which Pairs to Check?

Abstract

The traditional setup of link prediction in networks assumes that a test set of node pairs, which is usually balanced, is available over which to predict the presence of links. However, in practice, there is no test set: the ground-truth is not known, so the number of possible pairs to predict over is quadratic in the number of nodes in the graph. Moreover, because graphs are sparse, most of these possible pairs will not be links. Thus, link prediction methods, which often rely on proximity-preserving embeddings or heuristic notions of node similarity, face a vast search space, with many pairs that are in close proximity, but that should not be linked. To mitigate this issue, we introduce LinkWaldo, a framework for choosing from this quadratic, massively-skewed search space of node pairs, a concise set of candidate pairs that, in addition to being in close proximity, also structurally resemble the observed edges. This allows it to ignore some high-proximity but low-resemblance pairs, and also identify high-resemblance, lower-proximity pairs. Our framework is built on a model that theoretically combines Stochastic Block Models (SBMs) with node proximity models. The block structure of the SBM maps out where in the search space new links are expected to fall, and the proximity identifies the most plausible links within these blocks, using locality sensitive hashing to avoid expensive exhaustive search. LinkWaldo can use any node representation learning or heuristic definition of proximity, and can generate candidate pairs for any link prediction method, allowing the representation power of current and future methods to be realized for link prediction in practice. We evaluate LinkWaldo on 13 networks across multiple domains, and show that on average it returns candidate sets containing 7-33% more missing and future links than both embedding-based and heuristic baselines' sets.

Full PDF

AA Hidden Challenge of Link Prediction:Which Pairs to Check?

Caleb Belth

University of Michigan

Ann Arbor, MI, [email protected]

Alican B¨uy¨ukc¸akır

University of Michigan

Ann Arbor, MI, [email protected]

Danai Koutra

University of Michigan

Ann Arbor, MI, [email protected]

Abstract —The traditional setup of link prediction in networksassumes that a test set of node pairs, which is usually balanced,is available over which to predict the presence of links. However,in practice, there is no test set: the ground-truth is not known,so the number of possible pairs to predict over is quadratic inthe number of nodes in the graph. Moreover, because graphs aresparse, most of these possible pairs will not be links. Thus, linkprediction methods, which often rely on proximity-preservingembeddings or heuristic notions of node similarity, face a vastsearch space, with many pairs that are in close proximity, butthat should not be linked. To mitigate this issue, we introduceL

INK W ALDO , a framework for choosing from this quadratic,massively-skewed search space of node pairs, a concise set ofcandidate pairs that, in addition to being in close proximity, also structurally resemble the observed edges . This allows it to ignoresome high-proximity but low-resemblance pairs, and also identifyhigh-resemblance, lower-proximity pairs. Our framework is builton a model that theoretically combines Stochastic Block Models(SBMs) with node proximity models. The block structure of theSBM maps out where in the search space new links are expectedto fall, and the proximity identiﬁes the most plausible links withinthese blocks, using locality sensitive hashing to avoid expensiveexhaustive search. L

INK W ALDO can use any node representationlearning or heuristic deﬁnition of proximity, and can generatecandidate pairs for any link prediction method, allowing the rep-resentation power of current and future methods to be realizedfor link prediction in practice . We evaluate L

INK W ALDO on 13networks across multiple domains, and show that on average itreturns candidate sets containing 7-33% more missing and futurelinks than both embedding-based and heuristic baselines’ sets.

I. I

NTRODUCTION

Link prediction is a long-studied problem that attempts topredict either missing links in an incomplete graph, or linksthat are likely to form in the future. This has applicationsin discovering unknown protein interactions to speed up thediscovery of new drugs, friend recommendation in socialnetworks, knowledge graph completion, and more [1], [15],[16], [25]. Techniques range from heuristics, such as predictinglinks based on the number of common neighbors between apair of nodes, to machine learning techniques, which formulatethe link prediction problem as a binary classiﬁcation problemover node pairs [7], [29].Link prediction is often evaluated via a ranking , where pairsof nodes that are not currently linked are sorted based on the“likelihood” score given by the method being evaluated [16].To construct the ranking, a “ground-truth” test set of node pairsis constructed by either (1) removing a certain percentage of

Groups G r oup s ( e . g . deg r ee - ba s ed ) S2 : Map the search space of node "roadmap" Input : Network ... ... ... F r a c t i on o f O b s e r v ed L i n ks ... ... ...... S3 : Discover closest pairsper equivalence class Many Observed Links ...

Few Observed Links ... ... ... ...

Output : Candidate Pairs P S4 : Augment pairs from global pool S1 : Generate node groupings ... C i C j pairs into struct. equiv. classes C i Fig. 1: Our proposed framework L

INK W ALDO chooses candidatepairs from the quadratic, highly-skewed search space of possiblelinks by ﬁrst constructing a roadmap , which partitions the searchspace into structural equivalence classes of node pairs to capturehow much pairs in each location resemble the observed links. Thisroadmap tells L

INK W ALDO how closely to look in each sectionof the search space. L

INK W ALDO follows the roadmap, selectingfrom each equivalence class the node pairs in closest proximity. links from a graph at random or (2) removing the newest linksthat formed in the graph, if edges have timestamps. Theseremoved edges form the test positives, and the same numberof unlinked pairs are generated at random as test negatives.The methods are then evaluated on how well they are able torank the test positives higher than the test negatives.However, when link prediction is applied in practice, theseground truth labels are not known, since that is the veryquestion that link prediction is attempting to answer. Instead, any pair of nodes that are not currently linked could link in thefuture. Thus, to identify likely missing or future links, a linkprediction method would need to consider O ( n ) node pairsfor a graph with n nodes; most of which in sparse, real-worldnetworks would turn out to not link. Proximity, on its own, isonly a weak signal, sufﬁcient to rank pairs in a balanced testset, but likely to turn up many false positives in an asymptot-ically skewed space, leaving discovering the relatively smallnumber of missing or future links a challenging problem.Proximity-based link prediction heuristics [15], such as a r X i v : . [ c s . S I] F e b ommon Neighbors, could ignore some of the search space,such as nodes that are farther than two hops from each other,but this would not extend to other notions of proximity,like proximity-preserving embeddings. Duan et al. studied theproblem of pruning the search space [5], but formulated itas top- k link prediction, which attempts to predict a smallnumber of links, but misses a large number of missing linksin the process, suffering from low recall.The goal of this work is to develop a principled approachto choose, from the quadratic and skewed space of possiblelinks, a set of candidate pairs for a link prediction method tomake decisions about. We envision that this will allow currentand future developments to be realized for link prediction inpractice, where no ground-truth set is available. Problem 1.

Given a graph and a proximity function betweennodes, we seek to return a candidate set of node pairs fora link predictor to make decisions about, such that the setis signiﬁcantly smaller than the quadratic search space, butcontains many of the missing and future links.

Our insight to handle the vast number of negatives is toconsider not just the proximity of nodes, but also their struc-tural resemblance to observed links. We measure resemblanceas the fraction of observed links that fall in inferred, graph-structural equivalence classes of node pairs. For example,Fig. 1 shows one possible grouping of nodes based on theirdegrees, where the resulting structural equivalence classes (thecells in the “roadmap”) capture what fraction of observedlinks form between nodes of different degrees. Based onthe roadmap, equivalence classes with a high fraction ofobserved edges are expected to contain more unlinked pairsthan those with lower resemblance. We then employ nodeproximity within equivalence classes, rather than globally,which decreases false positives that are in close proximity, butdo not resemble observed links, and decreases false negativesthat are farther away in the graph, but resemble many observededges. Moreover, to avoid computing proximities for all pairsof nodes within each equivalence class, we extend self-tuninglocality sensitive hashing (LSH). Our main contributions are: • Formulation & Theoretical Connections.

Going beyondthe heuristic of proximity between nodes, we model theplausibility of a node pair being linked as both their prox-imity and their structural resemblance to observed links.Based on this insight, we propose Future Link LocationModels (FLLM), which combine Proximity Models andStochastic Block Models; and we prove that ProximityModels are a naive special case. § III • Scalable Method.

We develop a scalable method,L

INK W ALDO (Fig. 1), which implements FLLM, anduses locality sensitive hashing to implicitly ignore unim-portant pairs. § IV • Empirical Analysis.

We evaluate L

INK W ALDO on 13diverse datasets from different domains, where it returnson average 22-33% more missing links than embedding-based models and 7-30% more than strong heuristics. § VOur code is at https://github.com/GemsLab/LinkWaldo. II. R

ELATED WORK

In this paper, we focus on the understudied problem ofchoosing candidate pairs from the quadratic space of possiblelinks, for link prediction methods to make predictions about.Link prediction techniques range from heuristic deﬁnitions ofsimilarity, such as Common Neighbors [15], Jaccard Similarity[15], and Adamic-Adar [1], to machine learning approaches,such as latent methods, which learn low-dimensional noderepresentations that preserve graph-structural proximity inlatent space [7], and GNN methods, which learn heuristicsspeciﬁc to each graph [29] or attempt to re-construct theobserved adjacency matrix [10]. For detailed discussion of linkprediction techniques, we refer readers to [15] and [16].

Selecting Candidate Pairs.

The closest problem to ours is top- k link prediction [5], which attempts to take a particular linkprediction method and prune its search space to directly returnthe k highest score pairs. One method [5] samples multiplesubgraphs to form a bagging ensemble, and performs NMF oneach subgraph, returning the nodes with the largest latent fac-tor products from each, while leveraging early-stopping. Theauthors view their method’s output as predictions rather thancandidates, and thus focus on high precision at small valuesof k relative to our setting. Another approach, ApproximateResistance Distance Link Predictor [21] generates spectralnode embeddings by constructing a low-rank approximation ofthe graph’s effective resistance matrix, and applies a k -closestpairs algorithm on the embeddings, predicting these as links.However, this approach does not scale to moderate embeddingdimensions (e.g., the dimensionality of 128 often-used used inembedding methods), and is often outperformed by the simplecommon neighbors heuristic.A related problem is link-recommendation, which seeks toidentify the k most relevant nodes to a query node. It has beenstudied in social networks for friend recommendation [26], andin knowledge graphs [9] to pick subgraphs that are likely tocontain links to a given query entity. In contrast, we focus oncandidate pairs globally, not speciﬁc to a query node.III. T HEORY

Let G = ( V , E ) be a graph or network with |V| = n nodesand |E| = m edges, where E ⊆ V × V . The adjacency matrix A of G is an n × n binary matrix with element a ij = 1 ifnodes i and j are linked, and 0 otherwise. The set of node v ’sneighbors is N ( v ) = { u : ( u, v ) ∈ E} . We summarize the keysymbols used in this paper and their descriptions in Table I. TABLE I: Description of major symbols.

Notation Description G = ( V , E ) , A Graph, nodes, edges, adjacency matrix |V| = n, |E| = m Number of nodes resp. edges in GE new Unobserved future or missing links Γ , Π Grouping of V , Partition of V × V x v ∈ X , µ v Node embedding and membership vector C i Equivalence class i P , ˜ P G Pairs selected by L

INK W ALDO , global pool k, κ

Budget for |P| , target for an equiv. class

2e now formalize the problem that we seek to solve:

Problem 2.

Given a graph G = ( V , E ) , a proximity function sim : V × V → R + between nodes, and a budget k << n ,return a set of plausible candidate node pairs P ⊂ V × V ofsize |P| = k for a link predictor to make decisions about. We describe next how to deﬁne resemblance in a principledway inspired by Stochastic Block Models, introduce a uniﬁedmodel for link prediction methods that use the proximity ofnodes to rank pairs, and describe our model, which combinesresemblance and proximity to solve Problem 2.

A. Stochastic Block Models

Stochastic Block Models (SBMs) are generative models ofnetworks. They model the connectivity of graphs as emergingfrom the community or group membership of nodes [20].

Node Grouping.

A node grouping Γ is a set of groups orsubsets V i of the nodes that satisﬁes (cid:83) V i ∈ Γ V i = V . It iscalled a partition if it satisﬁes V i ∩ V j = ∅ ∀V i (cid:54) = V j ∈ Γ .Each node v ∈ V has a | Γ | -dimensional binary membershipvector µ v , with element µ vi = 1 if v belongs to group V i .A node grouping can capture community structure, but it canalso capture other graph-structural properties, like the degreesof nodes, in which case the SBM captures the compatibilityof nodes w.r.t degree—viz. degree assortativity. Membership Indices . The membership indices I u,v of nodes u, v are the set of group ids ( i, j ) s.t. u ∈ V i and v ∈ V j : I u,v (cid:44) { i : µ u,i = 1 } × { j : µ v,j = 1 } , i, j ∈ { , , . . . , | Γ |} . Membership equivalence relation & classes . The member-ship indices form the equivalence relation ∼ I : ( u, v ) ∼ I ( u (cid:48) , v (cid:48) ) ⇐⇒ I u,v = I u (cid:48) ,v (cid:48) . This induces a partition Π = {C , C , . . . , C | Π | } over all pairs of nodes V × V (both linked andunlinked), where the equivalence class C i contains all nodepairs ( u, v ) with the same membership indices, i.e., µ u = µ and µ v = µ (cid:48) for some µ , µ (cid:48) ∈ { , } | Γ | . We denote theequivalence class of pair ( u, v ) as [( u, v )] ∼ I . Example 1.

If nodes are grouped by their degrees to form Γ , then the membership indices I u,v of node pair ( u, v ) aredetermined by u and v ’s respective degrees. For example,in Fig. 1, the upper circled node pair has degrees and respectively, which determines their equivalence class—in thiscase, the cell (3 , in the roadmap. Each cell of the roadmapcorresponds to an equivalence class C i ∈ Π . We can now formally deﬁne an SBM:

Deﬁnition 1 (Stochastic Block Model - SBM) . Given a nodegrouping Γ and a | Γ | × | Γ | weight matrix W specifyingthe propensity for links to form across groups, the prob-ability that two nodes link given their group membershipsis Pr ( a uv = 1 | µ u , µ v ) = σ ( µ Tu W µ v ) , where function σ ( · ) converts the dot product to a probability (e.g., sigmoid) [18]. The vanilla SBM [20] assigns each node to one group (i.e.,the grouping is a partition and membership vectors µ are one-hot), in which case µ Tu W µ v = w I u,v . The overlapping SBM [18], [12] is a generalization that allows nodes to belong tomultiple groups, in which case membership vectors may havemultiple elements set to 1, and µ Tu W µ v = (cid:80) i,j ∈ I u,v w ij . Resemblance.

Given an SBM with grouping Γ , we deﬁne theresemblance of node pair ( u, v ) ∈ V × V under the SBM asthe percentage of the observed (training) edges that have thesame group membership as ( u, v ) : ρ ( u, v ) (cid:44) |{ ( v , v ) ∈ E : ( v , v ) ∼ I ( u, v ) }| m . (1) Example 2.

In Figure 1, the resemblance ρ ( u, v ) of node pair ( u, v ) corresponds to the density of the cell that it maps to.The high density in the border cells indicates that many low-degree nodes connect to high-degree nodes. The dense centralcells indicate that mid-degree nodes connect to each other.B. Proximity Models Proximity-based link prediction models (PM) model theconnectivity of graphs based on the proximity of nodes.Some methods deﬁne the proximity of nodes with a heuristic,such as Common Neighbors (CN), Jaccard Similarity (JS),and Adamic/Adar (AA). More recent approaches learn latentsimilarities between nodes, capturing the proximity in latentembeddings such that nodes that are in close proximity in thegraph have similar latent embeddings (e.g., dot product) [7].

Node Embedding.

A node embedding, x v ∈ R d , is a real-valued, d -dimensional vector representation of a node v ∈ V .We denote all the node embeddings as matrix X ∈ R n × d . Deﬁnition 2 (Proximity Model - PM) . Given a similarity orproximity function sim : V × V → R + between nodes, theprobability that nodes u and v link is an increasing functionof their proximity: Pr ( a uv = 1 | sim ( · , · )) = f ( sim ( u, v )) . Instances of the PM include the Latent Proximity Model: sim

LaPM ( u, v ) (cid:44) x Tu x v , (2)where x u , x v are the nodes’ latent embeddings; and the Com-mon Neighbors, Jaccard Similarity, and Adamic/Adar models: sim CN ( u, v ) (cid:44) |N ( u ) ∩ N ( v ) | , (3) sim JS ( u, v ) (cid:44) |N ( u ) ∩ N ( v ) ||N ( u ) ∪ N ( v ) | , and (4) sim AA ( u, v ) (cid:44) (cid:88) v (cid:48) ∈N ( u ) ∩N ( v ) |N ( v (cid:48) ) | . (5) C. Proposed: Future Link Location Model

Unlike SBM and PM, our model, which we call the

FutureLink Location Model (FLLM), is not just modeling the prob-ability of links, but rather where in the search space futurelinks are likely to fall. To do so, FLLM uses a partition of thesearch space, and corresponding SBM, as a roadmap that givesthe number of new edges expected to fall in each equivalenceclass. To formalize this idea, we ﬁrst deﬁne two distributions:

New and Observed Distributions . The new link distribution p n ( C i ) (cid:44) Pr ( C i |E new ) and the observed link distribution3 o ( C i ) (cid:44) Pr ( C i |E ) capture the fraction of new and observededges that fall in equivalence class C i , respectively. Deﬁnition 3 (Future Link Location Model - FLLM) . Givenan overlapping SBM with grouping Γ , the expected numberof new links in equivalence class C i is proportional to thenumber of observed links in C i , and the probability ofnode pair ( u, v ) linking is equal to the pair’s resemblancetimes their proximity relative to other nodes in [( u, v )] ∼ I : Pr ( a uv = 1 | µ u , µ v , sim ( · , · )) = ρ ( u, v ) · sim ( u,v ) (cid:80) ( u (cid:48) ,v (cid:48) ) ∈ [( u,v )] ∼ I sim ( u (cid:48) ,v (cid:48) ) . FLLM employs the following theorem, which states thatif q % of the observed links fall in equivalence class C i ,then in expectation, q % of the unobserved links will fall inequivalence class C i . We initially assume that the unobservedfuture links follow the same distribution as the observedlinks—as generally assumed in machine learning—i.e., therelative fraction of links in each equivalence class will bethe same for future links as observed links: p n = p o . Inthe next subsection, we show that for a ﬁxed k , the error inthis assumption is determined by the total variation distancebetween p n and p o , and hence is upper-bounded by a constant. Theorem 1.

Given an overlapping SBM with grouping Γ inducing the partition Π of V × V for a graph G = ( V , E ) , outof k new (unobserved) links E new , the expected number thatwill fall in equivalence class C i and its variance are: E [ |C i ∩ E new | ] = k |C i ∩ E| m (6) V ar ( |C i ∩ E new | ) = k |C i ∩ E||E \ C i | m . (7) Proof.

Observe that the number of the k new edges that fallin equivalence class C i , i.e., |C i ∩ E new | , is a binomial randomvariable over k trials, with success probability Pr ( C i |E new ) .Thus, the random variable’s expected value is E [ |C i ∩ E new | ] = k Pr ( C i |E new ) , (8) and its variance is V ar ( |C i ∩ E new | ) = k Pr ( C i |E new )(1 − Pr ( C i |E new )) . (9) We can derive Pr ( C i |E new ) via Pr ( C i |E ) and Bayes’ rule: Pr ( C i |E ) = Pr ( E|C i ) Pr ( C i ) Pr ( E ) = |C i ∩E||C i | |C i ||V×V| m/ |V × V| = |C i ∩ E| m . Combining the last equation with Eq. (8) results directly inEq. (6), and by substituting into Eq. (9) we obtain:

V ar ( |C i ∩ E new | ) = k |C i ∩ E| m (1 − |C i ∩ E| m ) = k |C i ∩ E||E \ C i | m , where we used the fact that |E \ C i | = |E| − |C i ∩ E| . D. Guarantees on Error

While this derivation assumed that the future link distri-bution is the same as the observed link distribution, we nowshow that for a ﬁxed k , the amount of error incurred whenthis assumption does not hold is entirely dependent on thetotal variation distance, and hence is upper-bounded by k . Total Variation Distance.

The total variation distance [27]between p n and p o , which is a metric, is deﬁned as d T V ( p n , p o ) (cid:44) sup A⊂ Π | p n ( A ) − p o ( A ) | . (10) Total Error.

The total error incurred over Π in the compu-tation of the expected number of new edges that fall in each C i ∈ Π is an increasing function of the the number of newpairs k and the total variation distance between p n and p o .Furthermore, it has the following upper-bound: ξ = 2 k d T V ( p n , p o ) ≤ min(2 k, k (cid:112) / D KL ( p n || p o )) . (12) Proof.

From the deﬁnition of total error in Eq. (11), the ﬁrstequality holds from [14]. The inequality holds based on thefact that d T V ( · , · ) ranges in [0 , , and Pinsker’s inequality[27], which upper-bounds d T V ( · , · ) via KL-divergence. E. Proximity Model as a Special Case of FLLM

The PM, deﬁned in § III-B, is a special case of FLLM,where FLLM’s grouping contains just one group

Γ = {V} .That is, if the nodes are not grouped, then the models givethe same result. Thus, FLLM’s improvement over LaPM isa result of using structurally-meaningful groupings over thegraph. The following theorem states this result formally.

Theorem 3.

For a single node grouping

Γ = {V} , both PMand FLLM give the same ranking of pairs ( u, v ) ∈ V × V : Pr PM ( a uv = 1 | sim ( · , · )) > Pr PM ( a u (cid:48) v (cid:48) = 1 | sim ( · , · )) ⇐⇒ Pr FLLM ( a uv = 1 | µ u , µ v , sim ( · , · )) > Pr FLLM ( a u (cid:48) v (cid:48) = 1 | µ u (cid:48) , µ v (cid:48) , sim ( · , · )) . Proof.

Since

Γ = {V} , Π = {V × V} , and since

E ⊆V × V , all observed edges fall in the lone equivalence class C = V × V . Thus ρ ( u, v ) = 1 ∀ ( u, v ) ∈ V × V . Sincethere is only one equivalence class, the denominator in Dfn. 3is equal to a constant c (cid:44) (cid:80) ( u (cid:48) ,v (cid:48) ) ∈ [( u,v )] ∼ I sim ( u (cid:48) , v (cid:48) ) = (cid:80) ( u (cid:48) ,v (cid:48) ) ∈V×V sim ( u (cid:48) , v (cid:48) ) ∀ ( u, v ) ∈ V × V . Therefore,Pr FLLM ( a uv = 1 | µ u , µ v , sim ( · , · )) = c sim ( u, v ) , and bothmodels are increasing functions of sim ( · , · ) .4V. M ETHOD

We solve Problem 2 by using our FLLM model in a newmethod, L

INK W ALDO , shown in Fig. 1, which has four steps: • S1 : Generate node groupings and equivalence classes. • S2 : Map the search space, deciding how many candidatepairs to return from each equivalence class. • S3 : Search each equivalence class, returning directly thehighest-proximity pairs, and stashing some slightly lower-proximity pairs in a global pool. • S4 : Choose the best pairs from the global pool to augmentthose returned from each equivalence class.We discuss these steps next, give pseudocode in Alg. 1, anddiscuss time complexity in the appendix. A. Generating Node Groupings ( S1 ) In theory, we would like to infer the groupings that directlymaximize the likelihood of the observed adjacency matrix.However, the techniques for inferring these groupings (and thecorresponding node membership vectors) are computationallyintensive, relying on Markov chain Monte Carlo (MCMC)methods [17]. Indeed, these methods are generally appliedin networks with only up to a few hundred nodes [18]. Incases where n is large enough that considering all O ( n ) node pairs would be computationally infeasible, so would beMCMC. Instead L INK W ALDO uses a ﬁxed grouping, thoughit is agnostic to how the nodes are grouped. We discuss anumber of sensible groupings below, and discuss how to setthe number of groups in § V-D. Any other grouping can bereadily used within our framework, but should be carefullychosen to lead to strong results. • Log-binned Node Degree (DG).

This grouping capturesdegree assortativity [19], i.e., the extent to which low degreenodes link with other low degree nodes vs. high degree nodes,by creating uniform bins in log-space (e.g., Fig. 1; linear bins). • Structural Embedding Clusters (SG).

This grouping extendsDG by clustering latent node embeddings that capture struc-tural roles of nodes [24]. • Communities (CG).

This grouping captures communitystructure by clustering proximity preserving latent embeddingsor using community detection methods. • Multiple Groupings (MG).

Any subset of these groupings orany other groupings can be combined into a new grouping, bysetting µ v element(s) to 1 for v ’s membership in each group-ing, since nodes can have overlapping group memberships. B. Mapping the Search Space ( S2 ) L INK W ALDO ’s approach to mapping the search space (i.e.,identifying how many pairs to return per class C i ) followsdirectly from Thm. 1. L INK W ALDO computes the expectednumber of pairs in each equivalence class based on Eq. (6) andits variance based on Eq. (7), as a measure of the uncertainty.When L

INK W ALDO searches each equivalence class C i , itreturns the expected number of pairs minus a standard devia-tion directly , and adds more pairs, up to a standard deviationpast the mean, to a global pool ˜ P G . Thus, L INK W ALDO

Algorithm 1 L INK W ALDO ( G , sim ( · , · ) , k, τ ) /* S1 : Generating Node Groupings */ Generate node grouping Γ inducing partition Π (cid:46) § IV-A P , ˜ P G ← ∅ , ∅ (cid:46) Initialize pairs to return and global pool for C i ∈ Π do (cid:46) Search each equivalence class § IV-C /* S2 : Mapping the Search Space */ ¯ µ ← E [ |C i ∩ E new | ] (cid:46) Eq. (6) σ ← (cid:112) V ar ( |C i ∩ E new | ) (cid:46) Eq. (7) /* S3 : Discovering Closest Pairs per Equivalence Class */ if |C i | < τ then S ELECT P AIRS E XACT ( P , ˜ P G , C i , ¯ µ, σ ) else S ELECT P AIRS A PPROX ( P , ˜ P G , C i , ¯ µ, σ ) /* S4 : Augmenting Pairs from Global Pool */ P ← P ∪ { top k − |P| pairs from ˜ P G } return P procedure S ELECT P AIRS E XACT ( P , ˜ P G , C i , ¯ µ , σ ) Sort pairs ( u, v ) ∈ C i in descending order on sim ( u, v ) P ← P∪ { top ¯ µ − σ pairs } ˜ P G ← ˜ P G ∪ { next σ pairs } procedure S ELECT P AIRS A PPROX ( P , ˜ P G , C i , ¯ µ , σ ) for i = 1 , , . . . , r do (cid:46) Create r trees B ← { ( V u , V v ) } (cid:46) buckets start off as the root while Vol ( B ) > κ do (cid:46) cf. Prob. 3 for κ deﬁnition Choose h ( · ) at random from H rh B (cid:48) ← ∅ (cid:46) Create new buckets for β ∈ B do (cid:46) Branch each leaf (bucket) β (cid:48) left ← ( { u ∈ V ( β ) u : h ( x u ) < } , { v ∈ V ( β ) v : h ( x v ) < } ) β (cid:48) right ← ( { u ∈ V ( β ) u : h ( x u ) ≥ } , { v ∈ V ( β ) v : h ( x v ) ≥ } ) B (cid:48) ← B (cid:48) ∪ { β left , β right } if Vol ( B ) ≤ κ then B ← B (cid:48)

P ← P∪ { top ¯ µ − σ pairs } ˜ P G ← ˜ P G ∪ { next σ pairs } adds into P the E [ |C i ∩ E new | ] − (cid:112) V ar ( |C i ∩ E new | ) pairsin closest proximity in equivalence class C i , and the next (cid:112) V ar ( |C i ∩ E new | ) closest pairs into the global pool ˜ P G (both expressions are rounded to the nearest integer). Nodepairs that are already linked are skipped. C. Discovering Closest Pairs per Equivalence Class ( S3 ) We now discuss how L

INK W ALDO discovers the κ closestunlinked pairs within each equivalence class (Fig. 1), where κ is determined in step S2 based on the expected number ofpairs in the equivalence class, and variance (uncertainty). Problem 3.

Given an equivalence class C i , return the top- κ unlinked pairs in C i in closest proximity sim ( · , · ) , where κ = E [ |C i ∩ E new | ] + (cid:112) V ar ( |C i ∩ E new | ) (based on S2 ). For equivalence classes smaller than some tolerance τ , it isfeasible to search all pairs of nodes exhaustively. However,for |C i | > τ , this should be avoided, to make the searchpractical. We ﬁrst discuss this case when using the dot productsimilarity sim LaPM ( · , · ) in Eq. (2), and then discuss it for othersimilarity models (CN, JS, and AA) given by Eqs. (3)-(5).Finally, we introduce a reﬁnement that improves the robustnessof L INK W ALDO against errors in proximity.

1) Avoiding Exhaustive Search for Dot Product:

In thecase of dot product, we use Locality Sensitive Hashing (LSH)528] to avoid searching all |C i | pairs. LSH functions havethe property that the probability of two items colliding is afunction of their similarity. We use the following fact: Fact 1.

The equivalence class C i can be decomposed into theCartesian product of two sets C i = V u × V v , where V u (cid:44) { u : µ u = µ } and V v = { v : µ v = µ (cid:48) } . At a high level, to solve Prob. 3, we hash each nodeembedding of the nodes in V u and V v using a locality sensitivehash function. We design the hash function, described next,such that the number of pairs that map to the same bucketis greater than κ , but as small as possible, to maximallyprune pairs. Once the embeddings are hashed, we search thepairs in each hash bucket for the κ closest. We normalizethe embeddings so that dot product is equivalent to cosinesimilarity, and use the Random Hyperplane LSH family [4]. Deﬁnition 4 (Random Hyperplane Hash Family) . The randomhyperplane hash family H rh is the set of hash functions H rh (cid:44) { h : R d → { , }} , where r h is a random d -dimensionalGaussian unit vector and h ( x ) (cid:44) (cid:40) if r Th x ≥ if r Th x < . This hash family is well-known to provide the property thatthe probability of two vectors colliding is a function of thedegree of the angle between them [2]: Pr ( h ( x u ) = h ( x v )) = 1 − θ ( x u , x v ) π = 1 − arccos( x Tu x v ) π , where the last equality holds due to normalized embeddings.To lower the false positive rate, it is conventional to form anew hash function by sampling b hash functions from H rh andconcatenating the hash codes: g ( · ) = ( h ( · ) , h ( · ) , . . . , h b ( · )) .The new hash function is from another LSH family: Deﬁnition 5 ( b -AND-Random Hyperplane Hash Family) . The b -AND-Random hyperplane hash family is the set of hashfunctions H b and (cid:44) { g : R d → { , } b } , where g ( x ) =( h ( x ) , h ( x ) , . . . , h b ( x )) is formed by concatenating b ran-domly sampled hash functions h ( · ) ∈ H rh for some b ∈ N . Since the hash functions are sampled randomly from H rh , Pr ( g ( x u ) = g ( x v )) = (cid:18) − arccos( x Tu x v ) π (cid:19) b . Only vectors that are not split by all b random hyperplanesend up with the same hash codes, so this process lowers thefalse positive rate. However, it also increases the false negativerate for the same reason. The conventional LSH-scheme thenrepeats the process r times, computing the dot product exactlyover all pairs that match in at least one b -dim hash code, inorder to lower the false negative rate. The challenge of thisapproach is determining how to set b . To do so, we ﬁrst deﬁnethe hash buckets of a hash function, and their volume. Deﬁnition 6 (Hash Buckets and Volume) . Given an equiva-lence class C i = V u × V v and a hash function g ( · ) : R d → { , } b , after applying g ( · ) to all v ∈ V u ∪ V v , a hash bucket β = { u ∈ V u , v ∈ V v : g ( u ) = g ( v ) = β hashcode } consists of subsets V ( β ) u ⊆ V u , V ( β ) v ⊆ V v of nodes thatmapped to hashcode β hashcode ∈ { , } b . The set of hashbuckets B g = { β : | β | > } consists of all non-empty buckets.We deﬁne the volume of the buckets as the number of pairs ( u, v ) where u and v landed in the same bucket: Vol ( B g ) (cid:44) |{ ( u, v ) : g ( x u ) = g ( x v ) }| = (cid:80) β ∈B g |V ( β ) u × V ( β ) v | . Since we are after the κ closest pairs, we want to ﬁnd ahash function g ( · ) such that Vol ( B g ) ≥ κ . But since we wantto search as few pairs as possible, we seek the value of b that minimizes Vol ( B g ) for some g ( · ) ∈ H b and subject to theconstraint that Vol ( B g ) ≥ κ .Any hash function g ∈ H b and corresponds to a binary preﬁxtree, like Fig. 2. Each level of the tree corresponds to one Fig. 2: LSH Tree h ∈ H rh , and the leaves correspondto the buckets B g . Thus, to automati-cally identify the best value of b , wecan recursively grow the tree, branch-ing each leaf with a new randomhyperplane hash function h ∈ H rh ,until Vol ( B g ) < κ , then undo the lastbranch. At that point, the depth of thetree equals b , and is the largest valuesuch that Vol ( B g ) ≥ κ . To prevent this process from repeatingindeﬁnitely in edge cases, we halt the branching at a maximumdepth b max . This approach is closely related to LSH Forests[2], but with some key differences, which we discuss below. Theorem 4.

Given a hash function g ∈ H b and , the κ closestpairs in C i are the κ most likely pairs to be in the same bucket: Pr ( g ( x u ) = g ( x v )) > Pr ( g ( x u (cid:48) ) = g ( x v (cid:48) )) ⇐⇒ x Tu x v > x Tu (cid:48) x v (cid:48) .Proof. Since arccos( x ) is a decreasing function of x , Eq. (13)shows that Pr ( g ( x u ) = g ( x v )) is an increasing function of x Tu x v . The result follows from this.While x Tu x v > x Tu (cid:48) x v (cid:48) implies that ( u, v ) are more likely than ( u (cid:48) , v (cid:48) ) to be in the same bucket, it does not guaranteethat this outcome will always happen. Thus, we repeat theprocess r times, creating r binary preﬁx trees and, searchingthe pairs that fall in the same bucket in any tree for the top κ . Setting the r parameter is considered of minor importance,as long as it is sufﬁciently large (e.g., 10) [2]. Differences from LSH Forests [2] . LSH Forests are designedfor

KN N -search, which seeks to return the nearest neighborsto a query vector . In contrast, our approach is designed for κ -closest-pairs search, which seeks to return the κ closest pairsin a set C i . LSH Forests grow each tree until each vector is inits own leaf. We grow each tree until we reach the target bucketvolume κ . LSH Forests allow variable length hash codes, sincethe nearest neighbors of different query vectors may be atdifferent relative distances. All our leaves are at the same depthso that the probability of ( u, v ) surviving together to the leafis an increasing function of their dot product.6 ) Avoiding Exhaustive Search for Heuristics: For theheuristic deﬁnitions of proximity in Eqs.(3)-(5), there aretwo approaches to solving Prob. 3. The ﬁrst is to constructembeddings from the CN and AA scores (this does not applyto JS). For CN, if we let the node embeddings be theircorresponding rows in the adjacency matrix, i.e, X CN = A ,then sim CN ( u, v ) = x Tu x v . Similarly, X AA = A · / (cid:112) log( D ) ,yields sim AA ( u, v ) = x Tu x v , where D is a diagonal matrixrecording the degree of each node. Thus, the LSH solutionjust described can be applied. The second approach usesthe fact that all three heuristics are deﬁned over the 1-hop neighborhoods of nodes ( u, v ) . Thus, to have nonzeroproximity, ( u, v ) must be within 2-hops of each other, andany pairs not within 2-hops can implicitly be ignored.

3) Bail Out Reﬁnement:

To this point we have assumed thatthe proximity model used in L

INK W ALDO is highly informa-tive and accurate. However, in reality, heuristics may not beinformative for all equivalence classes, and even learned, latentproximity models, can fail to encode adequate information. Forinstance, it is challenging to learn high-quality representationsfor low-degree nodes. Thus, we introduce a reﬁnement toL

INK W ALDO that automatically identiﬁes when a proximitymodel is uninformative in an equivalence class, and allows itto bail out of searching that equivalence class.

Proximity Model Error.

The error that a proximity modelmakes is the probability Pr ( sim ( u, v ) < sim ( u (cid:48) , v (cid:48) )) that itgives a higher proximity for some unlinked pair ( u (cid:48) , v (cid:48) ) / ∈ E than for some linked pair ( u, v ) ∈ E .By this deﬁnition of error, we expect strong proximity mod-els to mostly assign higher proximity between observed edgesthan future or missing edges: Pr ( sim ( u, v ) > sim ( u (cid:48) , v (cid:48) )) ≈ for some ( u, v ) ∈ E and ( u (cid:48) , v (cid:48) ) ∈ E new . Thus, on ourway to ﬁnding the top- κ most similar (unlinked) pairs inan equivalence class (Problem 3), we expect to encountera majority of the observed edges (linked pairs) |E ∩ C i | that fall in that class. For a user-speciﬁed error tolerance ζ , L INK W ALDO will bail out and return no pairs from anyequivalence class where less than ζ fraction of its observededges are encountered on the way to ﬁnding the κ most similarunlinked pairs. L INK W ALDO keeps track of how many pairswere skipped by bailing out, and replaces them (after step S4 )by adding to P the top-ranked pairs of a heuristic (e.g., AA). D. Augmenting Pairs from Global Pool ( S4 ) Since L

INK W ALDO returns a standard deviation below theexpected number of new pairs in each equivalence class, itchooses the remaining pairs up to k from ˜ P G . To do so, itconsiders pairs in descending order on the input similarityfunction sim ( · , · ) , and greedily adds to P until |P| = k .V. E VALUATION

We evaluate L

INK W ALDO on three research questions:(

RQ1 ) Does the set P returned by L INK W ALDO havehigh recall and precision? (

RQ2 ) Is L

INK W ALDO scalable?(

RQ3 ) How do parameters affect performance?

TABLE II: Datasets statistics: if the graph is temporal or static,density, degree assortativity [19], and number of nodes and edges.

Graph Time Density Assortativity n m

Yeast - 0.41% 0.4539 2,375 11,693DBLP - 0.06% -0.0458 12,595 49,638Facebook1 - 1.08% 0.0636 4,041 88,235MovieLens (cid:88) (cid:88) (cid:88) (cid:88) (cid:88) (cid:88) < . -0.0557 279,376 1,546,541Protein-Soy - 1.64% -0.0192 45,116 16,691,679 A. Data & Setup

We evaluate L

INK W ALDO on a large, diverse set of net-works: metabolic, social, communication, and informationnetworks. Moreover, we include datasets to evaluate in both LPscenarios: (1) returning possible missing links in static graphsand (2) returning possible future links in temporal graphs. Wetreat all graphs as undirected.

Metabolic.

Yeast [29], HS-Protein [11], and Protein-Soy [13]are metabolic protein networks, where edges denote known as-sociations between proteins in different species. Yeast containsproteins in a species of yeast, HS-Protein in human beings, andProtein-Soy in Glycine max (soybeans).

Social.

Facebook1 [13] and Facebook2 [11] capture friend-ships on Facebook, Reddit [13] encodes links between subred-dits (topical discussion boards), edges in Epinions [11] connectusers who trust each other’s opinions, MathOverﬂow [13]captures comments and answers on math-related questions andcomments (e.g., user u answered user v ’s question), Digg [23]captures friendships among users. Communication.

Enron [11] is an email network, capturingemails sent during the collapse of the Enron energy company.

Information.

DBLP [11] is a citation network, and arXiv [13]is a co-authorship network of Astrophysicists. MovieLens [11]is bipartite graph of users rating movies for the research projectMovieLens. Edges encode users and the movies that they rated.

Training Graph and Ground Truth.

While using L

INK W ALDO in practice does not require a test set, in order to know howeffective it is, we must evaluate it on ground truth missinglinks. As ground truth, we remove 20% of the edges. In thestatic graphs, we remove 20% at random. In the temporalgraphs, we remove the 20% of edges with the most recenttimestamps. If either of the nodes in the removed edge isnot present in the training graph, we discard the edge fromthe ground-truth. The graph with these edges removed is thetraining graph, which L

INK W ALDO and the baselines observewhen choosing the set of unlinked pairs to return.

Setup.

We discuss in § V-D how we choose which groupingsto use, and how many groups in each. Whenever used, we im-plement SG and CG by clustering embeddings with KMeans: X N ET MF [8] and NetMF [22] (window size 1), respectively.In LSH, we set the maximum tree depth dynamically based7n the size of an equivalence class: b max = 12 if |C i | < B , b max = 15 if |C i | < B , b max = 20 |C i | < B , b max = 30 otherwise. We set the number of trees r based on the fractionof |C i | that we seek to return: r = 5 if κ/ |C i | < . , r = 10 if κ/ |C i | < . and r = 25 otherwise. B. Recall and Precision (

RQ1 ) Task Setup.

We evaluate how effectively L

INK W ALDO returnsin P the ground-truth missing links, at values of k muchsmaller than n . We report k , chosen based on dataset size, inTab. III, and discuss effects of the choice in the appendix.We compare the set L INK W ALDO returns to those of ﬁvebaselines, and evaluate both L

INK W ALDO -D, which usesgrouping DG, and L

INK W ALDO -M, which uses DG, SG, andCG together. In both L

INK W ALDO variants, we consider thefollowing proximities (cf. III-B) as input, and report the resultsthat are best: LaPM using N ET MF [22] embeddings (windowsizes 1 and 2), and AA, the best heuristic proximity. For thebipartite MovieLens, we use B I NE [6], an embedding methoddesigned for bipartite graphs. We report the input proximitymodel for each dataset in Tab. V in the appendix. We set theexact-search and bailout tolerances to τ = 25 M and ζ = 0 . ,which we determined via a parameter study in § V-D. Resultsare averages over ﬁve random seeds (§ V-A): for static graphs,the randomly-removed edges are different for each seed; fortemporal graphs, the latest edges are always removed, so theLSH hash functions are the main source of randomness. Metrics.

We use Recall (R@ k ), the fraction of known miss-ing/future links that are in the size- k set returned by themethod, and Precision (P@ k ), the fraction of the k pairsthat are known to be missing/future links. Recall is a moreimportant metric, since (1) the returned set of pairs P doesnot contain ﬁnal predictions, but rather pairs for a LP methodto make ﬁnal decisions about, and (2) our real-world graphsare inherently incomplete, and thus pairs returned that are not known to be missing links, could nonetheless be missing in theoriginal dataset prior to ground-truth removal (i.e., the open-world assumption [25]). We report both in Table III. Baselines.

We use ﬁve baselines.

NMF+B AG [5] uses non-negative matrix factorization (NMF) and a bagging ensembleto return k pairs while pruning the search space. We usetheir reported strongest version: the Biased Edge Bagging version with

Node Uptake and

Edge Filter optimizations (

Bi-ased(NMF+) ). We use the authors’ recommended parameterswhen possible: (cid:15) = 1 , µ = 0 . , f = 0 . , ρ = 0 . , number oflatent factors d = 50 , and ensemble size µ/f . In some cases,these suggested parameters led to fewer than k pairs beingreturned, in which case we tweaked the values of (cid:15), µ , and f until k were returned. We report these deviations in Tab. V inthe appendix. We use our own implementation.We also use four proximity models, which we showed to bespecial cases of FLLM in § III-E: LaPM ranks pairs globally based on the dot product of their embeddings, and returnsthe top k . To avoid searching all-pairs, we use the same LSHscheme that we introduce in § IV-C for L INK W ALDO . We set r = 25 , and like L INK W ALDO , use N ET MF with a windowsize of 1 or 2, except for MovieLens, where we use B I NE. JS , CN , and AA are deﬁned in III-B. We exploit the propertydescribed in IV-C2—i.e., all these scores are zero for nodesbeyond two hops. We compute the scores for all nodes withintwo hops, and return the top k unlinked pairs. Results.

Across the 13 datasets, L

INK W ALDO is thebest performing method on 10, in both recall and preci-sion. The L

INK W ALDO -M variant is slightly stronger thanL

INK W ALDO -D, but the small gap between the two demon-strates that even simple node groupings can lead to strongimprovements over baselines. L

INK W ALDO generalizes wellacross the diverse types of networks. In contrast, the heuris-tics perform well on social networks, but not as well on,e.g., metabolic networks (Yeast, HS-Protein, and Protein-Soy).Furthermore, the heuristic baselines cannot extend to bipartitegraphs like MovieLens, because fundamentally, all links formbetween nodes more than one hop away. These observationsdemonstrate the value of learning from the observed links,which L

INK W ALDO does via resemblance. We also observethat heuristic deﬁnitions of similarity, such as AA, outper-form latent embeddings (LaPM) that capture proximity. Weconjecture that the embedding methods are more sensitive tothe massive skew of the data, because even random vectorsin high-dimensional space can end up with some level ofproximity, due to the curse of dimensionality. This suggeststhat the standard approach of evaluating on a balanced test setmay artiﬁcially inﬂate results.In the three datasets where L

INK W ALDO does not out-perform AA, it is only outperformed by a small margin.Furthermore, the four datasets with the largest total Variationdistances between p n and p o are MovieLens, MathOverﬂow,Enron, and Digg. Theorem 2 suggests that L INK W ALDO mayincur the most error in these datasets. Indeed, these are theonly three datasets where L

INK W ALDO fails to outperform allother methods (with MovieLens being bipartite, as discussedabove). While the performance on temporal networks is strong,the higher total Variation distance suggests that the assumptionthat p o = p n may sometimes be violated due to concept drift [3]. Thus, a promising future research direction is to use thetimestamps of observed edges to predict roadmap drift overtime, in order to more accurately estimate the future roadmap. C. Scalability (

RQ2 ) Task Setup.

We evaluate how L

INK W ALDO scales with thenumber of edges, and the number of nodes in a graph byrunning L

INK W ALDO with ﬁxed parameters on all datasets.We set k = 1 M , use N ET MF (window-size of 1) as sim ( · , · ) ,and do not perform bailout ( ζ = 0 ). All other parameters areidentical to RQ1 . We use our Python implementation on anIntel(R) Xeon(R) CPU E5-2697 v3, 2.60GHz with 1TB RAM.

Results.

The results in Fig. 3 demonstrate that in practice,L

INK W ALDO scales linearly on the number of edges, and sub-quadratically on the number of nodes.8

ABLE III: On each dataset, we highlight the cell of the top-performing method with bold text and a gray background, and thesecond best with bold text only. An “*” denotes statistical signiﬁcance at a 0.05 p -value in a paired t-test. The “**” means thatthe difference between the better performing variant of L INK W ALDO was also signiﬁcantly better than the other variant at thesame p -value. Under each dataset, we give the percentage of the quadratic search space that the value of k corresponds to (usually < ). On average, L INK W ALDO -M is the best-performing method, and L

INK W ALDO -D the second best.

Dataset Metric

NMF+B AG [5] LaPM JS CN AA L INK W ALDO -D L

INK W ALDO -MYeast R@10K 0.4078 ± ± ± ± ± ± ± ± ± ± ± ± ± ± DBLP R@100K 0.2319 ± ± ± ± ± ± ± ± ± ± ± ± ± ± Facebook1 R@100K 0.4036 ± ± ± ± ± ± ± ± ± ± ± ± ± ± MovieLens R@100K 0.1221 ± ± ± ± ± ± ± ± ± ± ± ± ± ± HS-Protein R@100K 0.5127 ± ± ± ± ± ± ± ± ± ± ± ± ± ± arXiv R@100K 0.2877 ± ± ± ± ± ± ± ± ± ± ± ± ± ± MathOverﬂow R@1M 0.3901 ± ± ± ± ± ± ± ± ± ± ± ± ± ± Enron R@1M 0.2551 ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± Epinions R@1M 0.3312 ± ± ± ± ± ± ± ± ± ± ± ± ± ± Facebook2 R@1M 0.0948 ± ± ± ± ± ± ± ± ± ± ± ± ± ± Digg R@10M 0.2459 ± ± ± ± ± ± ± ± ± ± ± ± ± ± Protein-Soy R@10M 0.3624 ± ± ± ± ± ± ± ± ± ± ± ± ± ± Fig. 3: L

INK W ALDO is sub-quadratic on the number of nodes(a) and linear on the number of edges (b).

D. Parameters (

RQ3 ) Setup.

We evaluate the quality of different groupings (§ IV-A),and how the number of groups in each affects performance.On four graphs, Yeast, arXiv, Reddit, and Epinions, we runL

INK W ALDO with groupings DG, SG, and CG, varying thenumber of groups | Γ | ∈ { , , , , , } . We alsoinvestigate pairs of groupings, and the combination of all threegroupings, via grid search of the number of groupings in each.We also evaluated τ ∈ { M, M, M, M } , the tolerancefor searching equivalence classes exactly vs. approximatelywith LSH, and ζ ∈ { , . , . , / , . , / } , the fraction oftraining pairs we allow the proximity function to miss beforewe bailout of an equivalence class. Results.

The results for the individual groupings are shown in

Fig. 4: Number of groups for groupings DG, SG, and CG

Fig. 4. Grouping by log-binning nodes based on their degree,(i.e., DG) is in general the strongest grouping. Across allthree groupings, we ﬁnd that | Γ | = 25 is a good numberof groups. We found that using all three groupings was thebest combination, with 25 log-bins, 5 structural clusters, and 5communities (we omit the ﬁgures for brevity). For individualgroupings, we observe diminishing returns, and in multiplegroupings, slightly diminished performance when the numberof groups in each grows large. We omit the ﬁgures for τ and ζ ,but found that τ = 25 M and ζ = 0 . were the best parameters.VI. C ONCLUSION

In this paper, we focus on the under-studied and challengingproblem of identifying a moderately-sized set of node pairs fora link prediction method to make decisions about. We mitigatethe vastness of the search-space, ﬁlled with mostly non-links,9y considering not just proximity, but also how much a pair ofnodes resembles observed links. We formalize this idea in theFuture Link Location Model, show its theoretical connectionsto stochastic block models and proximity models, and intro-duce an algorithm, L

INK W ALDO , that leverages it to returnhigh-recall candidate sets, with only a tiny fraction of all pairs.Via our resemblance insight, L

INK W ALDO ’s strong perfor-mance generalizes from social networks to protein networks.Future directions include investigating the directionality oflinks, since the roadmap can incorporate this information, andextending to heterogeneous graphs with many edge and nodetypes, like knowledge graphs.A

CKNOWLEDGEMENTS

This work is supported by an NSF GRF, NSF Grant No. IIS1845491, Army Young Investigator Award No. W9-11NF1810397,and Adobe, Amazon, and Google faculty awards. R EFERENCES[1] Lada A Adamic and Eytan Adar. Friends and neighbors on the web.

Social networks , 25(3):211–230, 2003.[2] Mayank Bawa, Tyson Condie, and Prasanna Ganesan. Lsh forest: self-tuning indexes for similarity search. In

WWW , pages 651–660, 2005.[3] Caleb Belth, Xinyi Zheng, and Danai Koutra. Mining persistent activityin continually evolving networks. In

KDD , 2020.[4] Moses S Charikar. Similarity estimation techniques from roundingalgorithms. In

STOC , 2002.[5] Liang Duan, Shuai Ma, Charu Aggarwal, Tiejun Ma, and Jinpeng Huai.An ensemble approach to link prediction.

IEEE TKDE , 29(11), 2017.[6] Ming Gao, Leihui Chen, Xiangnan He, and Aoying Zhou. Bine: Bipartitenetwork embedding. In

SIGIR , 2018.[7] William L. Hamilton, Rex Ying, and Jure Leskovec. Representationlearning on graphs: Methods and applications.

IEEE Data Eng. Bull. ,40(3):52–74, 2017.[8] Mark Heimann, Haoming Shen, Tara Safavi, and Danai Koutra. REGAL:Representation learning-based graph alignment. In

CIKM , 2018.[9] Unmesh Joshi and Jacopo Urbani. Searching for embeddings in ahaystack: Link prediction on knowledge graphs with subgraph pruning.In

WebConf , 2020.[10] Thomas N Kipf and Max Welling. Variational graph auto-encoders. In

NIPS Workshop on Bayesian Deep Learning , 2016.[11] J´erˆome Kunegis. Konect: the koblenz network collection. In

WWW ,2013.[12] Pierre Latouche, Etienne Birmel´e, Christophe Ambroise, et al. Over-lapping stochastic block models with application to the french politicalblogosphere.

Annals App. Stat. , 5(1):309–336, 2011.[13] Jure Leskovec and Andrej Krevl. SNAP Datasets: Stanford large networkdataset collection. http://snap.stanford.edu/data, June 2014.[14] David A Levin and Yuval Peres.

Markov chains and mixing times ,volume 107. American Mathematical Soc., 2017.[15] David Liben-Nowell and Jon Kleinberg. The link-prediction problemfor social networks.

ASIS&T , 58(7):1019–1031, 2007.[16] V´ıctor Mart´ınez, Fernando Berzal, and Juan-Carlos Cubero. A surveyof link prediction in complex networks.

CSUR , 49(4):1–33, 2016.[17] Nikhil Mehta, Lawrence Carin, and Piyush Rai. Stochastic blockmodelsmeet graph neural networks. In

ICML , 2019.[18] Kurt Miller, Michael I Jordan, and Thomas L Grifﬁths. Nonparametriclatent feature models for link prediction. In

NeurIPS , 2009.[19] Mark EJ Newman. Mixing patterns in networks.

Phys. Rev. E , 67(2),2003.[20] Krzysztof Nowicki and Tom A B Snijders. Estimation and predictionfor stochastic blockstructures.

ASIS&T , 96(455):1077–1087, 2001.[21] Benjamin Pachev and Benjamin Webb. Fast link prediction for largenetworks using spectral embedding.

J. Complex Netw. , 6(1):79–94, 2018.[22] Jiezhong Qiu, Yuxiao Dong, Hao Ma, Jian Li, Kuansan Wang, and JieTang. Network embedding as matrix factorization: Unifying deepwalk,line, pte, and node2vec. In

WSDM , pages 459–467, 2018.[23] Ryan Rossi and Nesreen Ahmed. The network data repository withinteractive graph analytics and visualization. In

AAAI , 2015. [24] Ryan A. Rossi, Di Jin, Sungchul Kim, Nesreen Ahmed, Danai Koutra,and John Boaz Lee. On proximity and structural role-based embeddingsin networks: Misconceptions, techniques, and applications.

TKDD , 2020.[25] Tara Safavi, Danai Koutra, and Edgar Meij. Evaluating the calibrationof knowledge graph embeddings for trustworthy link prediction. In

EMNLP , 2020.[26] Dongjin Song, David A Meyer, and Dacheng Tao. Top-k link recom-mendation in social networks. In

ICDM , pages 389–398. IEEE, 2015.[27] Alexandre B Tsybakov.

Introduction to nonparametric estimation .Springer Science & Business Media, 2008.[28] Jingdong Wang, Heng Tao Shen, Jingkuan Song, and Jianqiu Ji. Hashingfor similarity search: A survey. arXiv preprint arXiv:1408.2927 , 2014.[29] Muhan Zhang and Yixin Chen. Link prediction based on graph neuralnetworks. In

NeurIPS , pages 5165–5175, 2018. A PPENDIX

A. Effect of k Table IV gives results (avgs over 3 seeds) for multiple valuesof k . For reasonable values of k —roughly up to an order ofmagnitude greater than m —results are mostly stable. The mainexception is for small k , where NMF+B AG performs well insome cases. This is consistent with its design: to return a small,accurate set of top- k predictions, rather than a candidate set. B. Complexity Analysis

Let γ be the time complexity of the node grouping ( S1 ).Computing the expected number of new edges in each cell (andvariance), directly from the observed links, ( S2 ) is O ( m ) . Thecomplexity of searching equivalence classes ( S3 ) comes fromhashing each node in the decomposition O ( b max ) times, andﬁnding the κ i closest pairs in the O ( κ i ) pairs that land in thesame bucket: (cid:80) C i ∈ Π O ( |V u ∪ V v | b max + κ i ) = O ( nb max + k ) .This assumes that we do not encounter unrealistic scenarios,e.g., the embeddings being equivalent and hence inseparable,and that b max is set large enough that the volume of tree leavesis not asymptotically larger than O ( κ i ) . Adding from theglobal pool ( S4 ) takes O ( k ) time, since | ˜ P G | = O ( k ) and canbe maintained in sorted order (similar to the merge in mergesort). Thus, the total time complexity is O ( γ + m + nb max + k ) . TABLE IV: We report “ < k ” if fewer than k pairs are returned.Black cells indicate values of k outside the scale of the dataset. HS-Protein Facebook2

Metric

NMF+B AG AA L

INK W ALDO -M NMF+B AG AA L

INK W ALDO -M R@10K

R@100K < k

R@1M < k

R@5M N/A N/A N/A < k

TABLE V: Input Proximity Model for LaPM and L

INK W ALDO ,and parameter deviations from default for NMF+B AG . Graph LaPM L

INK W ALDO -D L