[PDF] Learning low-rank latent mesoscale structures in networks

Abstract

It is common to use networks to encode the architecture of interactions between entities in complex systems in the physical, biological, social, and information sciences. Moreover, to study the large-scale behavior of complex systems, it is important to study mesoscale structures in networks as building blocks that influence such behavior. In this paper, we present a new approach for describing low-rank mesoscale structure in networks, and we illustrate our approach using several synthetic network models and empirical friendship, collaboration, and protein--protein interaction (PPI) networks. We find that these networks possess a relatively small number of `latent motifs' that together can successfully approximate most subnetworks at a fixed mesoscale. We use an algorithm that we call "network dictionary learning" (NDL), which combines a network sampling method and nonnegative matrix factorization, to learn the latent motifs of a given network. The ability to encode a network using a set of latent motifs has a wide range of applications to network-analysis tasks, such as comparison, denoising, and edge inference. Additionally, using our new network denoising and reconstruction (NDR) algorithm, we demonstrate how to denoise a corrupted network by using only the latent motifs that one learns directly from the corrupted networks.

Full PDF

LLEARNING LOW-RANK LATENT MESOSCALE STRUCTURES INNETWORKS

HANBAEK LYU, YACOUB H. KUREH, JOSHUA VENDROW, AND MASON A. PORTER

Abstract.

It is common to use networks to encode the architecture of interactions between entitiesin complex systems in the physical, biological, social, and information sciences. Moreover, tostudy the large-scale behavior of complex systems, it is important to study mesoscale structures innetworks as building blocks that inﬂuence such behavior [17, 43]. In this paper, we present a newapproach for describing low-rank mesoscale structure in networks, and we illustrate our approachusing several synthetic network models and empirical friendship, collaboration, and protein–proteininteraction (PPI) networks. We ﬁnd that these networks possess a relatively small number of ‘latentmotifs’ that together can successfully approximate most subnetworks at a ﬁxed mesoscale. We usean algorithm that we call “network dictionary learning” (NDL) [30], which combines a networksampling method [29] and nonnegative matrix factorization [19, 30], to learn the latent motifsof a given network. The ability to encode a network using a set of latent motifs has a widerange of applications to network-analysis tasks, such as comparison, denoising, and edge inference.Additionally, using our new network denoising and reconstruction (NDR) algorithm, we demonstratehow to denoise a corrupted network by using only the latent motifs that one learns directly fromthe corrupted networks.

It is often insightful to examine structures in networks [40] at an intermediate scale (i.e., at a“mesoscale”) that lies between the microscale of nodes and edges but below macroscale distributionsof local network properties. There are a large variety of network mesoscale structures of networks,including community structure [9, 47], core–periphery structure [49], and role structures [1]. Weare interested in mesoscale network structures that are large enough that it is reasonable to discusstheir collective properties, but that are also small enough so that we can also discuss their statisticalproperties. In this paper, we examine mesoscale network structures that we obtain using k -nodeinduced subgraphs of networks. These networks have k nodes that inherit their adjacency structuresfrom the original networks from which we draw them. Because most real-world networks are sparse[40], independently choosing a set of k nodes from a network may not return meaningful information.Instead, we use motif sampling [29]: we ﬁrst uniformly randomly sample a set of k nodes that forma path (this set is called a ‘ k -chain motif’), and we then obtain the subgraph that is induced bythat k -chain motif by including all of the edges between those k nodes. This guarantees that wesample a connected subgraph of a network while assuming very little about the structure of theoriginal network; by repeating this process, we obtain a data set of ‘mesoscale patches’ of a network.We then use ‘dictionary-learning’ algorithms [32] to learn mesoscale structures of networks that wecall ‘latent motifs’. We then use latent motifs to infer subgraph structures of networks, comparediﬀerent networks, and denoise corrupted networks. Dictionary-learning algorithms are machine-learning techniques that learn interpretable latentstructures of complex data sets and are applied regularly in the data analysis of text and images[7, 34, 46]. Such algorithms usually consist of two steps. First, one samples a large number of

Department of Mathematics, University of California, Los Angeles, CA 90095, USA

E-mail addresses : {hlyu, ykureh, jvendrow, mason}@math.ucla.edu .Our code for the main algorithms and simulations are available at https://github.com/HanbaekLyu/NDL_paper .We also provide a user-friendly version as a python package ndlearn . See https://github.com/jvendrow/Network-Dictionary-Learning . a r X i v : . [ c s . S I] F e b LEARNING LOW-RANK LATENT MESOSCALE STRUCTURES IN NETWORKS structured subsets of a data set (e.g., square patches of an image or collections of a few sentences ofa text); we refer to such a subset as a mesoscale patch of a data set. Second, one ﬁnds a set of basiselements such that taking a nonnegative linear combination of them can successfully approximateeach of the sampled mesoscale patches. Such a set of basis elements is called a dictionary , and onecan interpret each basis element as a latent structure of the data set. As an example, consider theimage of the artwork

Cycle by M. C. Escher in Figure 1 a . We ﬁrst sample 10,000 square patchesof × pixels, and we then use a nonnegative matrix factorization (NMF) [19] algorithm to ﬁnda dictionary with r = 25 square patches (see Figure 1 a ). Each element of the learned dictionarydescribes a latent shape in the image.Algorithms for network dictionary learning (NDL) [30] use a similar idea. As mesoscale patchesof a network, we use the k -node subgraphs that are induced by motif sampling. We represent thesesubgraphs using their k × k adjacency matrices. After obtaining suﬃciently many mesoscale patchesof a network, we apply NMF to learn a dictionary for the latent adjacency matrices, which we call latent motifs of the network. We give a complete implementation of our approach in Algorithm1 of the Supplementary Information (SI). See our SI for more details, including the theoreticalguarantees for Algorithm 1 in Theorems G.2 and G.5. Network Dictionary Learning

Network data reconstruction

Interpretable parts (Dictionary)

Network

Motif

Sample

MCMC Motif sampling

Memoli,

Lyu , Sivakoff (2019+)

Dictionary (cid:2869)

Dictionary (cid:2870)

Dictionary (cid:2871) ⋮ 𝐷𝑎𝑡𝑎 (cid:2869)

𝐷𝑎𝑡𝑎 (cid:2870)

𝐷𝑎𝑡𝑎 (cid:2871) ⋮ Lyu , Needell, Balzano (2019+)

Online Matrix Factorization for Markovian data + (Low-rank basis) UCLA Facebook Network C

ALTECH

Facebook Network

Network Dictionary Network Dictionary C YCLE by M.C. Escher

Image Dictionary a b c

Figure 1.

Illustration of mesoscale structures that we learn from ( a ) images and ( b , c ) networks.In all experiments in this ﬁgure, we form a matrix X of size d × n by sampling n mesoscale patchesof size d = 21 × from the corresponding object. For the image in panel ( a ), the columns of X are square patches of × pixels. In panels ( b ) and ( c ), we show both heat maps and adjacencymatrices. We take the columns of X to be the k × k adjacency matrices of the connected subgraphsthat are induced by a walk of k = 21 nodes, where a walk of k nodes consists of k nodes x , . . . , x k such that x i and x i +1 are adjacent for all i ∈ { , . . . , k } . Using nonnegative matrix factorization(NMF), we compute an approximate factorization X ≈ W H into nonnegative matrices, where W has r = 25 columns. Because of this factorization, we can approximate any sampled mesoscalepatches (i.e., the columns of X ) of an object by a nonnegative linear combination of the columnsof W , which we can interpret as latent shapes for images and latent motifs (i.e., subgraphs) fornetworks, respectively. The network dictionaries of latent motifs that we learn from the ( b ) UCLA and( c ) Caltech

Facebook networks reveal distinctive social structures. For example, if we uniformlysample a chain of 21 friends in one of these networks, we observe for

Caltech that there are likelyto be communities with six or more nodes and also some ‘hub’ users who know most of the othersin the sample. However, for

UCLA , it is unlikely that there are such communities or hubs. In theheat map of the UCLA network, we show only the ﬁrst 3000 nodes according to the node labelingin the data set.

EARNING LOW-RANK LATENT MESOSCALE STRUCTURES IN NETWORKS 3

In Figure 1, we demonstrate the NDL method for networks of Facebook friendships (which werecollected on one day in fall 2005) from UCLA (“

UCLA ”) and Caltech (“

Caltech ”) [55, 56]. Each nodein one of these networks is a Facebook account of an individual, and each edge encodes a Facebookfriendship between two individuals. Using NDL, we learn 25 latent motifs from each of these twosocial networks using a chain motif with k = 21 nodes. In each sample, the two diagonal linesin the latent motifs in Figure 1 correspond to the edges along with the k -chain motif. We referto the remaining entries as ‘oﬀ-chain’ entries; one learns these from subgraphs that are inducedby the chain motif. The latent motifs in UCLA ’s dictionary (see Figure 1 b ) have sparse oﬀ-chainconnections with a few clusters, whereas Caltech ’s dictionary (see Figure 1 c ) has relatively denseoﬀ-chain connections. Such latent motifs reveal distinct social structures in the two networks. Forexample, if we uniformly sample a chain of 21 friends in one of these networks, we observe for Caltech that there are likely to be communities with six or more nodes and also some ‘hub’ userswho know most of the others in the sample. However, for

UCLA , it is unlikely that there are suchcommunities or hubs. See Figure 3. k = Coronavirus SNAP FB arXiv H. sapiens Caltech MIT UCLA Harvard ER ER WS WS BA BA k = k = k = k = Figure 2.

Latent motifs that we learn from 14 networks (eight real-world networks and six syn-thetic networks, which include two distinct instantiations from each of three random-graph models)at ﬁve diﬀerent scales (speciﬁcally, for k = 6 , , , , ), which reveal distinct mesoscale struc-tures in the networks. Using network dictionary learning (NDL), we learn network dictionaries of r = 25 latent motifs of k nodes for each of the 14 networks. For each network at each scale, weshow only the second-most dominant latent motifs from each dictionary. These motifs include moreinformation than the most dominant motifs for these sparse networks. Black squares represent entries and white squares represent entries. See Section D.2 in our SI for details of how wemeasure latent motif dominance, and see Figure 6 in our SI for the most dominant latent motifs ofeach network. We examine mesoscale structures of eight real-world networks and six synthetic networks us-ing NDL. The real-world networks are Facebook networks from

Caltech , UCLA , Harvard , and

MIT [55, 56],

SNAP Facebook (which we also denote as

SNAP FB as a shorthand) [11, 24], arXiv ASTRO-PH (with a shorthand of arXiv ) [11, 23],

Coronavirus PPI (with a shorthand of

Coronavirus ) [10,44, 52], and

Homo sapiens PPI (with a shorthand of

H. sapiens ) [11, 44]. The ﬁrst four networksare 2005 Facebook networks from four universities from the

Facebook100 data set [56]. The ﬁfthnetwork is a 2012 Facebook network that was collected from survey participants [24]. The sixthnetwork is a collaboration network based on coauthorship of preprints that were posted in the astro-physics category of the arXiv preprint server. The seventh network is a protein–protein interaction(PPI) network of proteins that are related to the coronaviruses that cause Coronavirus disease 2019

LEARNING LOW-RANK LATENT MESOSCALE STRUCTURES IN NETWORKS (COVID-19), Severe Acute Respiratory Syndrome (SARS), and Middle Eastern Respiratory Syn-drome (MERS) [52]. The eighth network is a PPI network of proteins that are related to

Homosapiens [44].For the six synthetic networks, we generate two instances each of Erdős–Rényi (ER) G ( N, p ) networks [8], Watts–Strogatz (WS) networks [57], and Barabási–Albert (BA) networks [2]. Theseare three types of well-studied random-graph models [40]. Each of the ER networks has 5000 nodes,and we independently connect each pair of nodes with probabilities of p = 0 . (in the network thatwe call ER ) and p = 0 . (in ER ). For the WS networks, we use rewiring probabilities of p = 0 . (in WS ) and p = 0 . (in WS ) starting from a 5000-node ring network in which each node is adjacentto its nearest neighbors. For the BA networks, we use m = 25 (in BA ) and m = 50 (in BA ),where m denotes the number of edges of each new node when it connects (via linear preferentialattachment) to the existing network, which we grow from an initial network of m individual nodes(i.e., none of them are adjacent to each other) until it has nodes. See Section F.1 in our SI formore details.One can interpret the size (i.e., number of nodes) k of a chain motif as a scale parameter. A k × k mesoscale patch of a network that one obtains by using the k -chain motif encodes connectivitybetween nodes that are at most k − edges apart in the network. In Figure 2, we show the second-most dominant latent motif (see Section D.2 and Figure 6 in our SI) that we learn from each of 14networks — eight real-world networks and six synthetic networks — at various scales (speciﬁcally,for k = 6 , , , , ) when we use a dictionary with r = 25 latent motifs. The latent motifsdiﬀer drastically both across the diﬀerent networks and across diﬀerent scales.Suppose that we are given a network G and a dictionary W of latent motifs at scale k , where wemay or may not learn W from G . Consider the following two scenarios. In one scenario, we supposethat we know G exactly, and we ask how to measure the ‘eﬀectiveness’ of W in approximatingmesoscale patches of G at scale k . In the other scenario, we suppose that G is a noisy version ofsome true network G true and that W is ‘faithful’ in the sense that it can well-approximate mesoscalepatches of G true at scale k . (See (6) in the SI for the precise deﬁnition.) We then ask how we caninfer the true network G true from G and W .To examine the above questions, we develop an algorithm that we call network denoising andreconstruction (NDR) (see Algorithm 2) that takes a network G and network dictionary W as inputand outputs a weighted network G recons that has the same node set as G . The NDR algorithmrepeatedly (until convergence) samples mesoscale patches of G at scale k , ﬁnds a nonnegative linearapproximation of them using the latent motifs in W . Because each edge e of G can appear in multiplemesoscale patches of G , there may be multiple reconstructed weights for e in this procedure. We taketheir mean for the ﬁnal weight of e in G recons (see Algorithm 2). To measure the eﬀectiveness of W for an unweighted network G (i.e., edges are either present or absent and there are no multi-edges),one can threshold the weighted edges of G recons at some ﬁxed level θ ∈ [0 , to obtain an undirectedreconstructed network G recons ( θ ) with binary edge weights (of either or ), which one can thencompare directly with the original unweighted network G . We regard W as eﬀective at describing G at mesoscale k if G recons ( θ ) is close to G for some θ . (We will quantify our notion of ‘closeness’in the next paragraph.) We interpret the edge weights in G recons as measures of conﬁdence of thecorresponding edges in G with respect to W . For example, if an edge e has the smallest weight in G recons , we regard it as the most ‘outlying’ with respect to the latent motifs in W . See TheoremsG.7 and G.10 in the SI for theoretical guarantees and error bounds for NDR.In Figure 3, we show various reconstruction experiments using several real-world networks andsynthetic networks. We perform these experiments for various values of the edge threshold θ and r = 9 , , , , , , latent motifs in a single dictionary. Each network dictionary in Figure 3uses a chain motif with k = 21 nodes, for which the dimension of the space of all possible mesoscalepatches (i.e., the adjacency matrices of the induced subgraphs) is (cid:0) (cid:1) −

20 = 200 . An important

EARNING LOW-RANK LATENT MESOSCALE STRUCTURES IN NETWORKS 5 . . . . . . θ (with r = 25)0 . . . . . . a cc u r a c y X ← X CoronavirusH. sapiensSNAP FBarXivCaltech r (with θ = 0 . . . . . . . . . Caltech ← X ER ER WS WS BA BA MITHarvardUCLACaltech . . . . . MIT ← X . . . . . Harvard ← X r (with θ = 0 . . . . . . UCLA ← X Figure 3.

Self-reconstruction and cross-reconstruction accuracy between several real-world andsynthetic networks versus the edge threshold value θ and the number r of latent motifs in a networkdictionary. The label X ← Y indicates that we reconstruct network X using a network dictionarythat we learn from network Y . The reconstruction process produces a weighted network that weturn into an unweighted network by thresholding the edge weights at a threshold value θ , such thatwe keep only edges whose weight is strictly larger than θ . We measure reconstruction accuracy bycalculating the Jaccard index of the original network’s edge set and the reconstructed network’sedge set. In panel ( a ), we plot accuracies versus θ (keeping the number of latent motifs ﬁxed at r = 25 ), where X is one of ﬁve real-world networks (two PPI networks, two Facebook networks,and one collaboration network). In panels ( b ) and ( c ), we reconstruct each of the four Facebooknetworks using network dictionaries with r ∈ { , , , , , , } latent motifs that we learnfrom one of ten networks (with the edge threshold value ﬁxed at θ = 0 . ). observation is that one can reconstruct a given network using an arbitrary network dictionary, whichone can even learn from a diﬀerent network. Such a ‘cross-reconstruction’ allows us to quantitativelycompare the learned mesoscale structures of diﬀerent networks. We label each subplot of Figure 3with Y ← X to indicate that we are reconstructing network Y using a network dictionary that welearn from network X . We turn the weighted reconstructed networks into unweighted networks bythresholding their edges using some threshold θ ∈ [0 , . We measure the reconstruction accuracyby calculating the Jaccard index between the original network’s edge set and the reconstructednetwork’s edge set. That is, to measure the similarity of two edge sets, we calculate the number ofedges in the intersection of these sets divided by the number of edges in the union of these sets. Weobtain the same qualitative results as in Figure 3 if we instead measure similarity using the Randindex [48]).In Figure 3 a , we plot the accuracy for ‘self-reconstruction’ X ← X versus θ (with r = 25 ), where X is one of the real-world networks Coronavirus , H. sapiens ), SNAP FB , Caltech , and arXiv . Theaccuracies for

H. sapiens and

Caltech peak above when θ ≈ . , the accuracies for arXiv and SNAP FB peak above for θ ≈ . , and the accuracy for Coronavirus peaks above near θ = 0 . . We choose θ = 0 . for the cross-reconstruction experiments for the Facebook networks LEARNING LOW-RANK LATENT MESOSCALE STRUCTURES IN NETWORKS

Caltech , Harvard , MIT , and

UCLA in Figures 3 b , c . We observe that these four Facebook networkshave self-reconstruction accuracies above using r = 25 motifs with θ = 0 . . The total numberof dimensions when using mesoscale patches at scale k = 21 is , so this result suggests that allof these nine real-world networks have low-rank mesoscale structures at scale k = 21 .We consider accuracies for cross-reconstruction Y ← X in Figures 3 b , c , where Y is one of theFacebook networks Caltech , Harvard , MIT , and

UCLA and X (with X (cid:54) = Y ) is one of the thesefour network or one of the six synthetic networks ER i , WS i , or BA i (with i ∈ { , } ). From thecross-reconstruction accuracies and interpreting the latent motifs (see Section B.4 in the SI) inFigures 1 and 2 (see also Figures 8, 10, and 14 in the SI), we draw the following conclusions at scale k = 21 . First, the mesoscale structure of Caltech is distinct from those of

Harvard , UCLA , and

MIT . This is consistent with prior studies of these networks (see, e.g., [15, 56]). Second,

Caltech ’smesoscale structure at scale k = 21 has a higher dimension than those of the other three universities’Facebook networks. Third, Caltech has a lot more communities of size at least than the otherthree universities’ Facebook networks. Fourth, both BA networks capture the mesoscale structuresof MIT , Harvard , and

UCLA at scale k = 21 better than the synthetic networks that we generatefrom ER and WS models. For instance, the self-reconstruction accuracies in Figures 3 b , c using r = 9 latent motifs are about for Caltech and or above for the other three universities’Facebook networks. See Section F.5 in the SI for further discussion.In Figure 4, we use our algorithms to perform two types of network denoising. We can thinkof them as distinct binary classiﬁcation problems in network analysis: network denoising with subtractive noise (which is often called edge ‘prediction’ [12, 18, 26, 28, 37] and network denoisingwith additive noise [4]. In each scenario, we suppose that we are given an observed network G =( V, E ) with node set V and edge set E and are asked to ﬁnd an unknown network G (cid:48) = ( V, E (cid:48) ) with the same node set V but a possibly diﬀerent edge set E (cid:48) . We interpret G as a corruptedversion of a ‘true network’ G (cid:48) that we observe with some uncertainty. One can interpret edges andnon-edges in G as ‘false relations’ and ‘false non-relations’, respectively. In the subtractive-noisesetting, we assume that G is a partially observed version of G (cid:48) (i.e., E (cid:40) E (cid:48) ), and we seek to classifyall non-edges in G into ‘positives’ (i.e., edges in G (cid:48) ) and ‘negatives’ (i.e., non-edges in G (cid:48) ). In theadditive-noise setting, we suppose that G is a corrupted version of G (cid:48) to which some unknown edgeshave been added (i.e., E ⊇ E (cid:48) ), and we seek to classify all edges in G into ‘positives’ (i.e., edges in G (cid:48) ) and ‘negatives’ (i.e., non-edges in G (cid:48) ).To experiment with these problems, we use the following four real-world networks: Caltech , SNAP FB , arXiv Coronavirus , and H. sapiens . Given a network G = ( V, E ) , our experimentsproceed as follows. In the subtractive-noise setting, we create two smaller networks by removinguniformly random subsets that consist of (in one experiment) or (in the other) of the edgesfrom our network. In the additive-noise case, we create two corrupted networks by adding edgesbetween node pairs that we choose independently with a ﬁxed probability so that or ofthe edges in a corrupted network are new. We then apply NDL with r = 25 latent motifs at scale k = 21 to learn a network dictionary for each of these four networks, and we use each dictionary toreconstruct the network from which it was learned using NDR. The reconstruction algorithms outputa weighted network G recons , where the weight of each edge is our conﬁdence that the edge belongsto that network. For denoising subtractive (respectively, additive) noise, we classify each non-edge(respectively, edge) in a corrupted network as ‘positive’ if its weight in G recons is strictly largerthan some threshold θ and ‘negative’ otherwise. By varying θ , we construct a receiver-operatingcharacteristic (ROC) curve that consists of points whose horizontal and vertical coordinates are thefalse-positive rates and true-positive rates, respectively.In Figure 4, we show the ROC curves and corresponding area-under-the-curve (AUC) scoresfor our network-denoising experiments with subtractive and additive noise for the four networks.For example, when we add false edges (with one extra edge as a tie-breaker, given the odd EARNING LOW-RANK LATENT MESOSCALE STRUCTURES IN NETWORKS 7

Algorithm

SNAP FB H. Sapiens arXiv S PECTRAL C LUSTERING

EEP W ALK

NDL+NDR (our method) Algorithm

SNAP FB H. Sapiens arXiv S PECTRAL C LUSTERING

EEP W ALK

NDL+NDR (our method) . . . . . . false-positive rate . . . . . . tr u e - p o s i t i v e r a t e Caltech +10% (AUC º °

10% (AUC º º °

50% (AUC º º °

10% (AUC º º °

50% (AUC º .

00 0 .

25 0 .

50 0 .

75 1 . false-positive rate . . . . . . Coronavirus PPI +10% (AUC º °

10% (AUC º º °

20% (AUC º .

00 0 .

25 0 .

50 0 .

75 1 . . . . . . . tr u e - p o s i t i v e r a t e arXiv +10% (AUC º °

10% (AUC º º °

50% (AUC º . . . . . . SNAP Facebook +10% (AUC º °

10% (AUC º º °

50% (AUC º .

00 0 .

25 0 .

50 0 .

75 1 . . . . . . . Homo sapiens PPI +10% (AUC º º º °

50% (AUC º AUC for denoising; ( −50% noise) Mask: no folding MCMC: ApproxPivot a C ORONAVIRUS

PPI Mask: Identity MCMC: Glauber d Mask: Identity MCMC: ApproxPivot c C ORONAVIRUS

PPI Mask: no folding MCMC: Glauber b C ORONAVIRUS

PPI C

ORONAVIRUS

PPI a b c d e f

Algorithm

SNAP FB H. Sapiens arXiv S PECTRAL C LUSTERING

EEP W ALK

NDL+NDR (our method) Figure 4.

Application of the NDL and NDR algorithms to network denoising with additive andsubtractive noise on a variety of networks from empirical data sets. In panels (a) – (e) , we plot ourresults. In panel (f ) , we compare our denoising results for − noise on SNAP FB , H. sapiens ,and arXiv versus those of other methods. In our experiments with subtractive noise, we corrupta network by removing a uniformly random subset of or of its edges, and we seek toclassify the removed edges and the non-edges as true edges and false edges, respectively. In ourexperiments with additive noise, we corrupt a network by uniformly randomly adding or of the number of its edges, and we seek to classify the edges and non-edges in the resulting corruptednetwork as ‘negative’ (i.e., false edges) or ‘positive’ (i.e., true edges). To perform classiﬁcation ina network, we ﬁrst use NDL to learn latent motifs from a corrupted network and then reconstructthe networks using NDR to assign a conﬁdence value to each potential edge. We then use theseconﬁdence values to infer membership of potential edges in the uncorrupted network. Importantly,we never use information from the original networks. For the Caltech

Facebook network in panel( b ), we also perform edge inference and denoising using a network dictionary that we learn from the MIT

Facebook network. For each network, we indicate the receiver-operating characteristic (ROC)curves and corresponding area-under-the-curve (AUC) scores for network denoising with additivenoise using the labels +10 % and +50 %, and we indicate the ROC curves and corresponding AUCscores for network denoising with subtractive noise using the labels − % and − %. number of edges in the original network) to Coronavirus , such that 2,463 edges are true and 1,232edges are false, we are able to detect over of the false edges while misclassifying only of the true edges. In Figure 4 f , we compare the performance of our method to those of somepopular supervised algorithms that are based on network embeddings. Speciﬁcally, we compare to node2vec [11], DeepWalk [45], and

LINE [50] for the task of denoising subtractive noisefor

SNAP FB , H. Sapiens , and arXiv . Our method achieves state-of-the-art results in all cases.We make two important remarks about applying the NDL and NDR algorithms to network de-noising. First, NDL and NDR are able to perform the desired classiﬁcation tasks in an unsupervisedmanner, in the sense that we do not require fully known examples to train our algorithm. This isparticularly useful when it is diﬃcult to obtain a large number of fully known examples, such asmeasuring PPI networks for a new organism (e.g., SARS-CoV-2). Part of the reason that NDLand NDR are successful at these network-denoising tasks is because of the low-rank nature of theexamined mesoscale structures of the social and PPI networks (see Figure 3 a ). Speciﬁcally, becauseNDL learns a small number of latent motifs that are able to successfully give an approximate basisfor all mesoscale patches, they should not be aﬀected signiﬁcantly by noise. Second, unlike theexisting algorithms that we just mentioned, we are able to perform denoising on a network using LEARNING LOW-RANK LATENT MESOSCALE STRUCTURES IN NETWORKS information that we learn from an entirely diﬀerent network. Consequently, we are able to success-fully perform not only self-reconstruction but also cross-reconstruction. For instance, in Figure 4,we show the results for denoising

Caltech using a dictionary that we learn from the

MIT network,which we expect to have a similar structure to

Caltech based on the results of the experiments thatwe highlighted in Figure 3 and also on prior research on these networks [14, 43, 56].Our experiments in Figures 3 and 4 illustrate that various social, collaboration and PPI networkshave low-rank [36] mesoscale structures, in the sense that a few (e.g., r = 25 , but see Figure 3for other choices of r ) latent motifs that we learn using NDL are able to reconstruct, infer, anddenoise the edges in the entire networks by employing the NDR algorithm. We hypothesize thatsuch low-rank mesoscale structures are a general phenomenon for networks of interactions in variouscomplex systems beyond the social, collaboration, and PPI networks that we have examined. Aswe have illustrated in this paper, one can leverage mesoscale structures to perform important taskslike network denoising, so it is important in future studies to explore the level of generality of ourinsights. References [1] Nesreen Ahmed, Ryan Anthony Rossi, John Lee, Theodore Willke, Rong Zhou, Xiangnan Kong, andHoda Eldardiry. Role-based graph embeddings.

IEEE Transactions on Knowledge and Data Engineer-ing , 2020. Available at doi:10.1109/TKDE.2020.3006475 .[2] Albert-László Barabási and Réka Albert. Emergence of scaling in random networks.

Science ,286(5439):509–512, 1999.[3] Marco Bressan. Faster algorithms for sampling connected induced subgraphs. arXiv preprintarXiv:2007.12102 , 2020.[4] Fernanda B. Correia, Edgar D. Coelho, José L. Oliveira, and Joel P. Arrais. Handling noise in proteininteraction networks.

BioMed Research International , 2019:1–13, 2019.[5] Rick Durrett.

Probability: Theory and Examples . Cambridge Series in Statistical and ProbabilisticMathematics. Cambridge University Press, Cambridge, UK, fourth edition, 2010.[6] Bradley Efron, Trevor Hastie, Iain Johnstone, and Robert Tibshirani. Least angle regression.

TheAnnals of Statistics , 32(2):407–499, 2004.[7] Michael Elad and Michal Aharon. Image denoising via sparse and redundant representations over learneddictionaries.

IEEE Transactions on Image Processing , 15(12):3736–3745, 2006.[8] Paul Erdős and Alfréd Rényi. On random graphs. I.

Publicationes Mathematicae , 6(290–297):18, 1959.[9] Santo Fortunato and Darko Hric. Community detection in networks: A user guide.

Physics Reports ,659:1–44, 2016.[10] David E Gordon, Gwendolyn M Jang, Mehdi Bouhaddou, Jiewei Xu, Kirsten Obernier, Kris M White,Matthew J O’Meara, Veronica V Rezelj, Jeﬀrey Z Guo, and Danielle L Swaney. A SARS-CoV-2 proteininteraction map reveals targets for drug repurposing.

Nature , 583:1–13, 2020.[11] Aditya Grover and Jure Leskovec. node2vec : Scalable feature learning for networks. pages 855–864,2016.[12] Roger Guimerà. One model to rule them all in network science?

Proceedings of the National Academyof Sciences of the United States of America , 117(41):25195–25197, 2020.[13] Roger A Horn and Charles R Johnson.

Matrix Analysis . Cambridge University Press, Cambridge, UK,second edition, 2012.[14] Abigail Z. Jacobs, Samuel F. Way, Johan Ugander, and Aaron Clauset. Assembling the facebook: Usingheterogeneity to understand online social network assembly. In

Proceedings of the ACM Web ScienceConference , WebSci ’15, New York City, NY, USA, 2015. Association for Computing Machinery.[15] Lucas G. S. Jeub, Prakash Balachandran, Mason A. Porter, Peter J. Mucha, and Michael W. Mahoney.Think locally, act locally: Detection of small, medium-sized, and large communities in large networks.

Physical Review E , 91:012821, 2015.[16] Nadav Kashtan, Shalev Itzkovitz, Ron Milo, and Uri Alon. Eﬃcient sampling algorithm for estimatingsubgraph concentrations and detecting network motifs.

Bioinformatics , 20(11):1746–1758, 2004.

EARNING LOW-RANK LATENT MESOSCALE STRUCTURES IN NETWORKS 9 [17] Ankit N. Khambhati, Ann E. Sizemore, Richard F. Betzel, and Danielle S. Bassett. Modeling andinterpreting mesoscale network dynamics.

NeuroImage , 180:337–349, 2018.[18] Katja Kovács, István A .and Luck, Kerstin Spirohn, Yang Wang, Carl Pollis, Sadie Schlabach, WentingBian, Dae-Kyum Kim, Nishka Kishore, and Tong Hao. Network-based prediction of protein interactions.

Nature Communications , 10(1):1240, 2019.[19] Daniel D. Lee and H. Sebastian Seung. Learning the parts of objects by non-negative matrix factoriza-tion.

Nature , 401(6755):788, 1999.[20] Daniel D. Lee and H. Sebastian Seung. Algorithms for non-negative matrix factorization. In

Advancesin Neural Information Processing Systems , pages 556–562, 2001.[21] Honglak Lee, Alexis Battle, Rajat Raina, and Andrew Y Ng. Eﬃcient sparse coding algorithms. In

Advances in Neural Information Processing Systems , pages 801–808, 2007.[22] Jure Leskovec and Christos Faloutsos. Sampling from large graphs. In

Proceedings of the 12th ACMSIGKDD International Conference on Knowledge Discovery and Data Mining , pages 631–636, 2006.[23] Jure Leskovec and Andrej Krevl. SNAP Datasets: Stanford Large Network Dataset Collection. Availableat http://snap.stanford.edu/data , 2020.[24] Jure Leskovec and Julian J. Mcauley. Learning to discover social circles in ego networks. In

Advancesin Neural Information Processing Systems , pages 539–547, 2012.[25] David A Levin and Yuval Peres.

Markov Chains and Mixing Times . American Mathematical Society,Providence, RI, USA, 2017.[26] David Liben-Nowell and Jon Kleinberg. The link-prediction problem for social networks.

Journal of theAmerican Society for Information Science and Technology , 58(7):1019–1031, 2007.[27] László Lovász.

Large Networks and Graph Limits , volume 60 of

Colloquium Publications . AmericanMathematical Society, Providence, RI, USA, 2012.[28] Linyuan Lü and Tao Zhou. Link prediction in complex networks: A survey.

Physica A , 390(6):1150–1170,2011.[29] Hanbaek Lyu, Facundo Memoli, and David Sivakoﬀ. Sampling random graph homomorphisms andapplications to network data analysis. arXiv:1910.09483 , 2019.[30] Hanbaek Lyu, Deanna Needell, and Laura Balzano. Online matrix factorization for Markovian data andapplications to network dictionary learning.

Journal of Machine Learning Research , 21:1–49, 2020.[31] Julien Mairal. Stochastic majorization-minimization algorithms for large-scale optimization. In

Advancesin Neural Information Processing Systems , pages 2283–2291, 2013.[32] Julien Mairal, Francis Bach, Jean Ponce, and Guillermo Sapiro. Online learning for matrix factorizationand sparse coding.

Journal of Machine Learning Research , 11:19–60, 2010.[33] Julien Mairal, Francis Bach, Jean Ponce, Guillermo Sapiro, and Andrew Zisserman. Non-local sparsemodels for image restoration.

IEEE 12th International Conference on Computer Vision , pages 2272–2279, 2009.[34] Julien Mairal, Michael Elad, and Guillermo Sapiro. Sparse representation for color image restoration.

IEEE Transactions on Image Processing , 17(1):53–69, 2007.[35] Julien Mairal, Michael Elad, and Guillermo Sapiro. Sparse learned representations for image restoration.

Proceedings of the 4th World Conference of the International Association for Statistical Computing , page118, 2008.[36] Ivan Markovsky and Konstantin Usevich.

Low Rank Approximation . Springer-Verlag, Heidelberg, Ger-many, 2012.[37] Aditya Krishna Menon and Charles Elkan. Link prediction via matrix factorization. In DimitriosGunopulos, Thomas Hofmann, Donato Malerba, and Michalis Vazirgiannis, editors,

Machine Learningand Knowledge Discovery in Databases , pages 437–452, Heidelberg, Germany, 2011. Springer-Verlag.[38] Sean P Meyn and Richard L Tweedie.

Markov Chains and Stochastic Stability . Springer-Verlag, Hei-delberg, Germany, 2012.[39] Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Corrado, and Jeﬀ Dean. Distributed representationsof words and phrases and their compositionality. In

Advances in Neural Information Processing Systems ,pages 3111–3119, 2013.[40] Mark E. J. Newman.

Networks . Oxford University Press, Oxford, UK, second edition, 2018. [41] Andrew Y Ng, Michael I Jordan, and Yair Weiss. On spectral clustering: Analysis and an algorithm.

Advances in neural information processing systems , pages 849–856, 2002.[42] William S Noble. What is a support vector machine?

Nature biotechnology , 24(12):1565–1567, 2006.[43] Jukka-Pekka Onnela, Daniel J. Fenn, Stephen Reid, Mason A. Porter, Peter J. Mucha, Mark D. Fricker,and Nick S. Jones. Taxonomies of networks from community structure.

Physical Review E , 86(3):036104,2012.[44] Rose Oughtred, Chris Stark, Bobby-Joe Breitkreutz, Jennifer Rust, Lorrie Boucher, Christie Chang, Na-dine Kolas, Lara O’Donnell, Genie Leung, and Rochelle McAdam. The BioGRID interaction database:2019 update.

Nucleic Acids Research , 47(D1):D529–D541, 2019.[45] Bryan Perozzi, Rami Al-Rfou, and Steven Skiena. DeepWalk: Online learning of social representations.In

Proceedings of the 20th ACM SIGKDD International Conference on Knowledge Discovery and DataMining , pages 701–710, 2014.[46] Gabriel Peyré. Sparse modeling of textures.

Journal of Mathematical Imaging and Vision , 34(1):17–31,2009.[47] Mason A. Porter, Jukka-Pekka Onnela, and Peter J. Mucha. Communities in networks.

Notices of theAmerican Mathematical Society , 56(9):1082–1097, 1164–1166, 2009.[48] William M. Rand. Objective criteria for the evaluation of clustering methods.

Journal of the AmericanStatistical Association , 66(336):846–850, 1971.[49] Puck Rombach, Mason A Porter, James H Fowler, and Peter J Mucha. Core-periphery structure innetworks (revisited).

SIAM Review , 59(3):619–646, 2017.[50] Jian Tang, Meng Qu, Mingzhe Wang, Ming Zhang, Jun Yan, and Qiaozhu Mei. LINE: Large-scaleinformation network embedding. In

Proceedings of the 24th International Conference on World WideWeb , pages 1067–1077, 2015.[51] Lei Tang and Huan Liu. Leveraging social media networks for classiﬁcation.

Data Mining and KnowledgeDiscovery , 23(3):447–478, 2011.[52] theBiogrid.org. Coronavirus PPI network. 2020. Retrieved from https://wiki.thebiogrid.org/doku.php/covid (downloaded 24 July 2020, Ver. 3.5.187.tab3).[53] theBiogrid.org. Homo sapiens PPI network. 2020. Retrieved from https://wiki.thebiogrid.org/doku.php/covid (downloaded 24 July 2020, Ver. 3.5.180.tab2).[54] Robert Tibshirani. Regression shrinkage and selection via the lasso.

Journal of the Royal StatisticalSociety: Series B (Methodological) , 58(1):267–288, 1996.[55] Amanda L. Traud, Eric D. Kelsic, Peter J. Mucha, and Mason A. Porter. Comparing communitystructure to characteristics in online collegiate social networks.

SIAM Review , 53:526–543, 2011.[56] Amanda L. Traud, Peter J. Mucha, and Mason A. Porter. Social structure of Facebook networks.

Physica A , 391(16):4165–4180, 2012.[57] Duncan J Watts and Steven H Strogatz. Collective dynamics of ‘small-world’ networks.

Nature ,393(6684):440–442, 1998.[58] Sebastian Wernicke. Eﬃcient detection of network motifs.

IEEE/ACM Transactions on ComputationalBiology and Bioinformatics , 3(4):347–359, 2006.

EARNING LOW-RANK LATENT MESOSCALE STRUCTURES IN NETWORKS 11

Supplementary Information:LEARNING LOW-RANK LATENT MESOSCALE STRUCTURES INNETWORKS

Appendix A. Overview

In this supplementary material, we present our algorithms for network dictionary learning (NDL) andnetwork denoising and reconstruction (NDR), and we prove theoretical results about their convergence anderror bounds. We give the full NDL algorithm (see Algorithm 1) in Section D, and we give the full NDRalgorithm (see Algorithm 2) in Section E. We introduce the notion of ‘latent motif dominance’ in SectionD.2 to measure the signiﬁcance of each latent motif that we learn from a network. In Section G, we give arigorous analysis of the NDL and NDR algorithms.

Appendix B. Problem Formulation for Network Dictionary Learning (NDL)

B.1.

Deﬁnitions and notation.

To facilitate our discussions, we use terminology and notation from [27,Ch. 3]. In the main text, we described a network as graph G = ( V, E ) with node set V and edge set E without directed or multi-edges, but possibly with self-edges. One can characterize the edge set E of G asan adjacency matrix A G : V → { , } , where A ( x, y ) = ( { x, y } ∈ E ) for each x, y ∈ V . The function ( S ) denotes the indicator of the event S ; it takes the value if S occurs and if it does not occur. In thissupplementary information, for broader applicability, we formulate our NDL framework in the more generalsetting in which edge in a network may have weights. Although one can extend the above deﬁnition ofnetworks to include weighted edges by adjoining an additional entry to G = ( V, E ) for edge weights, simplyextending the range of adjacency matrices from { , } to the interval [0 , ∞ ) is convenient.With the above considerations in mind, we deﬁne a network as a pair G = ( V, A G ) with node set V anda weight matrix (which is also often called a ‘weighted adjacency matrix’) A G : V → [0 , ∞ ) that encodesthe interaction strengths between the nodes. For simplicity, we will often drop the subscript G in A G anddenote it as A . A given graph G = ( V, E ) determines a unique network G = ( V, A G ) , where A G is theadjacency matrix of G . The set V ( G ) is the node set of the network G , which has size | V ( G ) | , where | S | is the number of elements in the set S . A pair ( x, y ) of nodes in G is called a directed edge if A ( x, y ) > .We say that a network G = ( V, A ) is symmetric if its weight matrix is symmetric (i.e., A ( x, y ) = A ( y, x ) forall x, y ∈ V ) and we say that it is binary if A ( x, y ) ∈ { , } for all x, y ∈ V . Given a symmetric network G = ( V, A ) , we call an unordered pair { x, y } of nodes in G an edge if A ( x, y ) = A ( y, x ) > . We say thata network G = ( V, A ) is connected if for any two distinct nodes x, y ∈ V , there exists a sequence of nodes x , x , . . . , x m for some m ≥ such that A ( x i , x i +1 ) > for all i ∈ { , . . . , m − } and { x , x m } = { x, y } .We call G bipartite if it admits a ‘bipartition’, which is a partition V = V ∪ V of the node set V such that V = V ∪ V and A ( x, y ) = 0 if x, y ∈ V or x, y ∈ V for each x, y ∈ V . If two networks G = ( V, A ) and G (cid:48) = ( V (cid:48) , A (cid:48) ) satisfy V (cid:48) ⊆ V and A (cid:48) ( x, y ) ≤ A ( x, y ) for all x, y ∈ V (cid:48) , then we say that G (cid:48) is a subnetwork of G and write G (cid:48) ⊆ G .Suppose that we are given m elements v , . . . , v m in some vector space. By their mean , we refer to theirsample mean ¯ v = m − (cid:80) mi =1 v i . By their weighted average , we refer to the expectation (cid:80) mi =1 v i p i , where ( p , . . . , p m ) is a probability distribution on the set of m elements.B.2. Homomorphisms between networks and motif sampling.

Being able to sample from a com-plex data set according to a known probability distribution (e.g., a uniform one) is a crucial ingredient indictionary-learning problems. This is often the case for image-processing applications [7, 33, 35], as it isstraightforward to sample a k × k patch uniformly at random from an image. However, the similar prob-lem of uniformly randomly sampling a k -node connected subnetwork from a network is not straightforward[3, 16, 22, 58]. For our purpose of developing dictionary learning for networks, we use motif sampling , which was introduced recently in [29]. In motif sampling, instead of directly sampling a connected subnetwork,one samples a random node map from a smaller network (i.e., a motif) into a target network that preservesadjacency relations, and one then uses the subnetwork that is induced on the nodes in the image of the nodemap. As we discuss below, such a node map between networks is a homomorphism.Fix an integer k ≥ and a weight matrix A F : [ k ] → [0 , ∞ ) , where we use the shorthand notation [ k ] = { , . . . , k } . We use the term motif for the corresponding network F = ([ k ] , A F ) . A motif is a network,and we use such motifs to sample from a given network. The type of motif in which we are particularlyinterested is a k -chain , for which A F = ( { (1 , , (2 , , . . . , ( k − , k ) } ) . A k -chain is a directed path withnode set [ k ] . For a general k -node motif F = ([ k ] , A F ) and a network G = ( V, A ) , we deﬁne the probabilitydistribution π F →G on the set V [ k ] of all node maps x : [ k ] → V by π F →G ( x ) = 1 Z  (cid:89) i,j ∈{ ,...,k } A ( x ( i ) , x ( j )) A F ( i,j )  , (1)where Z is a normalizing constant that is called the homomorphism density of F in G [27]. A node map x : [ k ] → V a homormorphism F → G if π F →G ( x ) > , which is the case if and only if A ( x ( a ) , x ( b )) > for all a, b ∈ [ k ] with A F ( a, b ) > (with the convention that α = 1 for all α ∈ R ). When both A and A F are binary matrices, the probability distribution π F →G is the uniform distribution on the set of allhomomorphisms F → G . This is the case for all examples that we discuss in the main manuscript. Motifsampling refers the problem of sampling a random homomorphism x : F → G according to the distributionin (1). In Section C, we discuss three Markov Chain Monte Carlo (MCMC) algorithms for motif sampling.B.3. Mesoscale patches of networks.

A homomorphism F → G is a node map V ( F ) → V ( G ) that mapsthe edges of F to some edges of G , so it maps F onto a subgraph of G . It thereby maps F ‘into’ G . For eachhomomorphism x : F → G from a motif F = ([ k ] , A F ) into a network G = ( V, A ) , we deﬁne a k × k matrix A x ( a, b ) := A (cid:0) x ( a ) , x ( b ) (cid:1) Φ F, x ( a, b ) for all a, b ∈ { , . . . , k } , (2)where Φ F, x : [ k ] → { , } is a k × k binary matrix that we call a mask . Two particular choices of masks are identity : Φ F, x ( a, b ) = 1 for all a, b ∈ { , . . . , k } , (3) no-folding : Φ F, x ( a, b ) = (cid:18) A F ( a, b ) > or (cid:64) a (cid:48) , b (cid:48) ∈ { , . . . , k } such that [ A F ( a (cid:48) , b (cid:48) ) > and ( x ( a ) , x ( b )) = ( x ( a (cid:48) ) , x ( b (cid:48) )) ] (cid:19) for all a, b ∈ { , . . . , k } . (4)We call A x in (2) the mesoscale patch of G that is induced by the homomorphism x : F → G , which isspeciﬁed uniquely by the homomorhism x : F → G , the weight matrix A , and the mask Φ . For a k × k matrix B and a homomorphism x : F → G , we say that the ( a, b ) entries of B are on-chain if A F ( a, b ) > and oﬀ-chain otherwise. The condition A F ( a, b ) > implies that A ( x ( a ) , x ( b )) by the deﬁnition of thehomomorphism x , so the on-chain entries of A x are always positive (and are always if G is unweighted).However, the oﬀ-chain entries of A x are not necessarily positive, so they encode meaningful informationabout a network that is ‘detected’ by the homomorphism x : F → G . For example, see the (1 , entries inthe × mesoscale patches in Figure 5 b .Suppose that the homomorphism x uses k distinct nodes of G in its image Im ( x t ) := { x ( a ) | a ∈ { , . . . , k }} .It then follows that the network G x := ( Im ( x ) , A x ) that is induced by x is a k -node subnetwork of G . In thiscase, the no-folding mask (4) is the same as the identity mask (3). However, when Im ( x ) has fewer than k nodes, G x is not necessarily a subnetwork of G and distinct positive oﬀ-chain entries of A x with the identitymask may represent the same edge in G . We introduced the no-folding mask in (4) so that the positiveoﬀ-chain entries of A x always have information that is not included in its on-chain entries. EARNING LOW-RANK LATENT MESOSCALE STRUCTURES IN NETWORKS 13

As an illustration, consider the case in which F is the -chain motif and G = ( V, A ) is an undirected andbinary graph. Suppose that we have the following two × binary matrices: P =  ∗ ∗ ∗ ∗ ∗ ∗ ∗∗ ∗ ∗∗ ∗ ∗∗ ∗ ∗ ∗ ∗ ∗ ∗  , Q =   , where each entry ∗ of P is either or . Suppose that Φ F, x is the identity mask (3). It then follows that anyhomomorphism x : F → G from the -chain motif F always induces a mesoscale patch A x of the form P ,where the entries in the two diagonals correspond to the on-chain entries. Suppose that x uses only twodistinct nodes, p and q , in G . Speciﬁcally, let p = x (1) = x (3) = x (5) and q = x (2) = x (4) = x (6) . In thiscase, the mesoscale patch A x equals Q , which has four oﬀ-chain entries of in its upper triangle. However,all of the entries in A x in this case correspond to the single edge between p and q in G , so the oﬀ-chainentries of A x do not give any new information about the network G that is not already given by its on-chainentries. The indicator function in the deﬁnition of the no-folding mask (4) prevents this situation. Indeed,in this case, the mesoscale patch A x with the no-folding mask is the matrix P with all ∗ entries equal to .B.4. Problem formulation for network dictionary learning (NDL).

The goal of the

NDL problem isto learn, for a ﬁxed integer r ≥ , a set of r nonnegative matrices L , . . . , L r of size k × k , Frobenius normsof at most , and A x ≈ a ( x ) L + · · · + a r ( x ) L r (5)for each homomorphism x : F → G for some coeﬃcients a ( x ) , . . . , a r ( x ) ≥ . For each homomorphism x : F → G , this implies that one can approximate the mesoscale patch A x of G that is induced by x asa suitable linear combination of the r matrices L , . . . , L r . We call the the tuple [ L , . . . , L r ] a networkdictionary for G , and we call each L i a latent motif of G . We identify a network dictionary [ L , . . . , L r ] with the nonnegative matrix W ∈ R k × r ≥ whose j th column is the vectorization of the j th latent motif L j for j ∈ { , . . . , r } . The choice of vectorization R k × k → R k can be arbitrary, but we use a column-wisevectorization in Algorithm A4.For the latent motifs to be interpretable, it is crucial that we require the entries of latent motifs L i andthe coeﬃcients a i ( x ) to be nonnegative. The nonnegativity constraint on each L i allows one to interpreteach L i as the weight matrix of a k -node network. Additionally, because the coeﬃcients a j ( x ) are alsononnegative, the approximate decomposition in (5) implies that a i ( x ) L i (cid:47) A x . Therefore, if a i ( x ) > , anynetwork structure (e.g., nodes with large degree, communities, and so on) in the latent motif L i must exist in A x . Therefore, one can consider the latent motifs as approximate k -node subnetworks G that exhibit typicalnetwork structure of G at scale k . In the spirit of Lee and Seung [19], one can regard the latent motifs as‘parts’ of a network G .For a more precise formulation of (5), consider the following optimization problem: arg min L ,..., L r ∈ R k × k ≥ (cid:107)L (cid:107) F ,..., (cid:107)L r (cid:107) F ≤ E x ∼ π F →G (cid:34) inf a ( x ) ,...,a r ( x ) ≥ (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) A x − r (cid:88) i =1 a i ( x ) L i (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) F (cid:35) , (6)where π F →G is the probability distribution that we deﬁned in (1) and (cid:107)·(cid:107) F denotes the matrix Frobeniusnorm. The choice of the probability distribution π F →G for the homomorphisms x : F → G is natural becauseit becomes the uniform distribution on the set of all homomorphisms F → G when the adjacency matricesfor G and F are both binary. The NDL problem is computationally diﬃcult because the objective function in(6) is non-convex and it is not obvious how to sample a homomorphism F → G according to the distribution π F →G that we deﬁned in (1). In Section D, we state an algorithm for NDL that approximately solves (6). Lee and Seung [19] discussed a similar nonnegative decomposition in which the A x are images of faces. Thelearned factors L i then capture parts of human faces (such as eyes, noses, and mouths). B.5.

Overview of our algorithms and their theoretical guarantees.

We overview our algorithms andtheir theoretical guarantees.

Algorithm 1:

Given a network G , the NDL algorithm (see Algorithm 1) computes a sequence ( W t ) t ≥ ofnetwork dictionaries, which take the form of k × r matrices, of latent motifs. Algorithm 2:

Given a network G , a network dictionary W , and a threshold parameter θ > , the NDRalgorithm (see Algorithm 2) computes a sequence of weighted networks G recons and binary (i.e.,unweighted) networks G recons ( θ ) . Theorem G.2:

Given a non-bipartite network G and a choice of the parameters in Algorithm 1, we provethat the sequence ( W t ) t ≥ of network dictionaries converges almost surely to the set of stationarypoints of the associated objective function in (6). Theorem G.5:

Given a bipartite network G and a choice of the parameters in Algorithm 1, we prove aconvergence result that is analogous to the one in Theorem G.2. Theorem G.7:

Given a non-bipartite target network G and a network dictionary W , we show that (i) thesequence of weighted reconstructed networks G recons that we compute using the NDR algorithm (seeAlgorithm 2) converges almost surely to some limiting network, and (ii) we obtain a closed-formexpression of this limiting network. We also show that (iii) a suitable distance between an arbitrarynetwork G (cid:48) and the limiting reconstructed network is bounded by the mean L distance between the k × k mesoscale patches of G (cid:48) and their nonnegative linear approximations from the latent motifs in W . Finally, (iv) if G (cid:48) = G in (iii) , we show that upper bound of the distance between G and G recons is approximately optimized if one learns the network dictionary W from the NDL algorithm 1. Theorem G.10:

We show a convergence result that is analogous to the one in Theorem G.10 for a bipartitetarget network G . Appendix C. Markov Chain Monte Carlo (MCMC) Motif-Sampling Algorithms

We mentioned in Section B.4 that one of the main diﬃculties in solving the optimization problem (6)is to directly sample a homomorphism x : F → G from the distribution π F →G (see (1)). To overcomethis diﬃculty, we use the Markov Chain Monte Carlo (MCMC) algorithms that were introduced in [29].Although the algorithms in [29] apply to networks with edge weights and/or node weights, we only usesimpliﬁed forms of them that we give in Algorithms MP and MG. Additionally, Algorithm MP with theoption AcceptProb = Approximate is a novel algorithm of the present paper. Using these MCMC samplingalgorithms, we generate a sequence ( x t ) t ≥ of homomorphisms F → G such that the distribution of x t converges to π F →G under some mild conditions on G and F [30, Thm. 5.7].In the pivot chain (see Algorithm MP with AcceptProb = Exact ), for each update x t (cid:55)→ x t +1 , the pivot x t (1) ﬁrst performs a random-walk move on G (see (7)) to move to a new node x t +1 (1) ∈ V . It accepts thismove with a suitable acceptance probability (see (8)) according to the Metropolis–Hastings algorithm (see,e.g., [25, Sec. 3.2]). After the move of x t (1) (cid:55)→ x t +1 (1) , we sample each x t +1 ( i ) ∈ V for i ∈ { , , . . . , k } successively from the appropriate conditional distribution (see (9)). This ensures that the desired distribution π F →G in (1) is a stationary distribution of the resulting Markov chain. In the Glauber chain , we pick onenode i ∈ [ k ] of F uniformly at random, and we resample its location x t ( i ) ∈ V ( G ) at time t to (cid:96) = x t +1 ( i ) ∈ V from the correct conditional distribution in (10) (see Figure 5 a ). See [25, Sec. 3.3] for background on theMetropolis–Hastings algorithm and Glauber-chain MCMC sampling.Let ∆ denote the maximum degree (i.e., number of neighbors) of the nodes in the network G = ( V, A ) .We also say that the network G itself has a maximum degree of ∆ . The Glauber chain has an eﬃcient localupdate (with a computational complexity of O (∆) ), but it converges quickly to the stationary distribution π F →G only for networks that are dense enough so that two homomorphisms that diﬀer at one node have aprobability of at least / (2∆) to coincide after a single Glauber chain update. (See [29, Thm. 6.1] for aprecise statement.)By contrast, the pivot chain (see Algorithm MP with AcceptProb = Exact ) has more computationallyexpensive local updates with a computational complexity of O (∆ k − ) (as discussed in [29, Remark 5.6]), butit converges as fast as a ‘lazy’ random walk on a network. (In such a random walk, each move has a chanceto be rejected according to a Metropolis–Hastings algorithm; see [29, Thm. 6.2].) Experimentally, we ﬁnd EARNING LOW-RANK LATENT MESOSCALE STRUCTURES IN NETWORKS 15

Algorithm MP . Pivot-Chain Update Input:

Symmetric network G = ( V, A ) , motif F = ([ k ] , A F ) , and homomorphism x : F → G Parameters:

AcceptProb ∈ {

Exact , Approximate } Do: x (cid:48) ← x If (cid:80) c ∈ V A ( x (1) , c ) = 0 : Terminate Else : Sample (cid:96) ∈ V at random from the distribution p ( w ) = A ( x (1) , w ) (cid:80) c ∈ V A ( x (1) , c ) , w ∈ V (7) Compute the acceptance probability α ∈ [0 , by α ←  min (cid:18) (cid:80) c ∈ [ n ] A k − ( (cid:96),c ) (cid:80) c ∈ [ n ] A k − ( x (1) ,c ) (cid:80) c ∈ V A ( c, x (1)) (cid:80) c ∈ V A ( x (1) ,c ) , (cid:19) , if AcceptProb = Exact min (cid:16) (cid:80) c ∈ V A ( c, x (1)) (cid:80) c ∈ V A ( x (1) ,c ) , (cid:17) , if AcceptProb = Approximate (8) Sample U ∈ [0 , uniformly at random, independently of everything else (cid:96) ← x (1) if U > λ and x (cid:48) (1) ← (cid:96) . For i = 2 , , . . . , k : Sample x (cid:48) ( i ) ∈ V from the distribution p i ( w ) = A ( x ( i − , w ) (cid:80) c ∈ V A ( x ( i − , c ) , w ∈ V (9) Output:

Homomorphism x (cid:48) : F → G Algorithm MG . Glauber-Chain Update Input:

Network G = ( V, A ) , k -chain motif F = ([ k ] , A F ) , and homomorphism x : F → G Do:

Sample v ∈ [ k ] uniformly at random Sample z ∈ V at random from the distribution p ( w ) = 1 Z  (cid:89) u ∈ [ k ] A ( x ( u ) , w ) A F ( u,v )   (cid:89) u ∈ [ k ] A ( w, x ( u )) A F ( v,u )  , w ∈ V (10)where Z = (cid:80) c ∈ V (cid:16)(cid:81) u ∈ [ k ] A ( x ( u ) , c ) A F ( u,v ) (cid:17) (cid:16)(cid:81) u ∈ [ k ] A ( c, x ( u )) A F ( v,u ) (cid:17) is the normalization con-stant. Deﬁne a new homomorphism x (cid:48) : F → G by x (cid:48) ( w ) = z if w = v and x (cid:48) ( w ) = x ( w ) otherwise Output:

Homomorphism x (cid:48) : F → G the Glauber chain is slow, especially for sparse networks (e.g., for COVID PPI , which has an edge density of . , and UCLA , which has an edge density of . ) and that the pivot chain is too expensive to computefor chain motifs with k ≥ . As a compromise with a low computational complexity and fast convergence(as fast as the standard random walk), we employ an approximate pivot chain , which is Algorithm MP withthe option AcceptProb = Approximate . Speciﬁcally, we compute the acceptance probability α in (8) onlyapproximately and thereby reduce the computational cost to O (∆) . The compromise, which we discussin the next paragraph, is that the stationary distribution of the approximate pivot chain may be slightlydiﬀerent from our target distribution π F →G .According to Proposition G.1, the stationary distribution for the approximate pivot chain is ˆ π F →G ( x ) := (cid:81) ki =1 A ( x ( i − , x ( i )) | V | (cid:80) y ,...,y k ∈ V A ( x (1) , y ) (cid:81) ki =3 A ( y i − , y i ) . (11) In general, the distribution (11) is diﬀerent from the desired target distribution π F →G . Speciﬁcally, π F →G ( x ) is proportional only to the numerator in (11) and the sum in the denominator in (11) is a weighted countof homomorphisms y : F → G for which y (1) = x (1) . Therefore, under ˆ π F →G , we penalize the probabilityof each homomorphism x : F → G according to the number of k -step walks in G that start from x (1) ∈ V .(The exact acceptance probability in (8) neutralizes this penalty.) It follows that ˆ π F →G is close to π F →G when the k -step-walk counts that start from each node in G do not diﬀer too much for diﬀerent nodes. Forexample, on degree-regular networks like lattices, such counts do not depend on the starting node, and itthus follows that ˆ π F →G = π F →G . Nevertheless, despite the potential discrepancy between π F →G and ˆ π F →G ,the approximate pivot chain gives good results for the reconstruction and denoising experiments that weshowed in Figures 3 and 4 in the main manuscript. Appendix D. Algorithm for Network Dictionary Learning (NDL)

D.1.

Algorithm overview and statement.

The essential idea behind our algorithm for NDL (see Algo-rithm 1) is as follows. We ﬁrst sample a large number M of homomorphisms x t : F → G from π F →G andcompute their corresponding mesoscale patches A x t for t ∈ { , . . . , M } . These M mesoscale patches of G form the data set (a so-called ‘batch’) in which we apply a dictionary-learning algorithm. Speciﬁcally, we(column-wise) vectorize each of these k × k matrices (using Algorithm A4) and obtain a k × M data matrix X , and we then apply nonnegative matrix factorization (NMF) [19] to obtain a k × r nonnegative matrix W for some ﬁxed integer r ≥ to yield an approximate factorization X ≈ W H for some nonnegative matrix H . From this procedure, we approximate each column of X by the nonnegative linear combination of the r columns of W with coeﬃcients that are given by the r th column of H . Therefore, if we let L i be the k × k matrix that we obtain by reshaping the i th column of W (using Algorithm A5), then [ L , . . . , L r ] isan approximate solution of (6). We will give the precise meaning of ‘approximate solution’ in Theorems G.2and G.5.The scheme in the paragraph above requires one to store all M mesoscale patches, entailing a memoryrequirement that is at least of order k M . Because M should scale with the size of G , this implies that weneed unbounded memory to handle arbitrarily large networks. To address this issue, Algorithm 1 implementsthe above scheme in the setting of ‘online learning’, where subsets (so-called ‘minibatches’) of data arrivein a sequential manner and one does not store previous subsets of the data before processing new subsets.Speciﬁcally, at each iteration t = 1 , , . . . , T , we only process a sample matrix X t that is smaller than thefull matrix X and includes only N (cid:28) M mesoscale patches, where one can take N to be independent ofthe network size. Instead of the standard NMF algorithms for a ﬁxed matrix [20], we use an ‘online’ NMFalgorithm [30, 32] that applies to sequences of matrices, where the intermediate dictionary matrices W t thatwe obtain by factorizing the sample matrix X t typically improves as we iterate (see [30, 32]). In Algorithm1, we give a full implementation of the NDL algorithm.We now explain how the NDL algorithm works. It combines one of the three MCMC algorithms — a pivotchain (in which we use Algorithm MP with AcceptProb = Exact ), an approximate pivot chain (in whichwe use Algorithm MP with

AcceptProb = Approximate ), and a Glauber chain (in which we use AlgorithmMG) — for motif sampling that we presented in Section C with the online NMF algorithm of [30]. Supposethat we have an undirected and binary graph G = ( V, A ) and a k -chain motif F = ([ k ] , A F ) . We satisfythe requirement in Algorithm 1 that there exists at least one homomorphism F → G (as long as G has atleast one edge), so we can ﬁnd an initial homomorphism x : F → G by rejection sampling (see AlgorithmA3). At each iteration t = 1 , , . . . , T , the motif-sampling algorithm generates a sequence x s : F → G of N homomorphisms and corresponding mesoscale patches A x s (see Figure 5 a ), which we summarize as the k × N data matrix X t . The online NMF algorithm in (12) learns a nonnegative factor matrix W t of size k × r by improving the previous factor matrix W t − with respect to the new data matrix X t . It is an‘online’ NMF algorithm because it factorizes a sequence ( X t ) t ∈{ ,...,T } of data matrices, rather than a singlematrix as in conventional NMF algorithms [20]. During this entire process, the algorithm only needs to storeauxiliary matrices P t and Q t of ﬁxed sizes r × r and r × k , respectively; it does not need the previous datamatrices X , . . . , X t − . Therefore, NDL is eﬃcient in memory and scales well with network size. Moreover,NDL is applicable to time-dependent networks because of its online nature, although we do not pursue thisdirection in the present paper. EARNING LOW-RANK LATENT MESOSCALE STRUCTURES IN NETWORKS 17

Algorithm 1 . Network Dictionary Learning (NDL) Input:

Network G = ( V, A ) Parameters: F = ([ k ] , A F ) (a motif) , T ∈ N (the number of iterations) , N ∈ N (the number ofhomomorphisms per iteration) , r ∈ N (the number of latent motifs) , λ ≥ (the coeﬃcient of an L -regularizer) Options: mask ∈ { Id , NF } , MCMC ∈ {

Pivot , PivotApprox , Glauber } Requirement:

There exists at least one homomorphism F → G Initialization: Sample a homomorphism x : F → G using the rejection sampling (see Algorithm A3) W = ( k × r ) matrix of independent entries that we sample uniformly from [0 , P = matrix of size r × r whose entries are Q = matrix of of size r × k whose entries are For t = 1 , , . . . , T : MCMC update and sampling mesoscale patches : Successively generate N homomorphisms x N ( t − , x N ( t − , . . . , x Nt by applyingAlgorithm MP with AcceptProb = Exact if MCMC = Pivot

Algorithm MP with

AcceptProb = Approximate if MCMC = PivotApprox

Algorithm MG with

AcceptProb = Glauber if MCMC = Glauber

For s = N ( t −

1) + 1 , . . . , N t : A x s ← k × k mesoscale patch of G that is induced by x s (see (2)) with Φ F, x s = (cid:40) identity mask in (3) , if mask = Id no-folding mask in (4) , if mask = NF X t ← k × N matrix whose j th column is vec ( A x (cid:96) ) with (cid:96) = N ( t −

1) + j (where vec ( · ) denotes the vectorization operator deﬁned in Algorithm A4) Single iteration of online nonnegative matrix factorization :  H t ← arg min H ∈ R r × N ≥ (cid:107) X t − W t − H (cid:107) F + λ (cid:107) H (cid:107) ( using Algorithm A1 ) P t ← (1 − t − ) P t − + t − H t H Tt Q t ← (1 − t − ) Q t − + t − H t X Tt W t ← arg min W ∈C dict ⊆ R k × r ≥ (cid:0) tr ( W P t W T ) − tr ( W Q t ) (cid:1) ( using Algorithm A2 ) , (12)where C dict = { W ∈ R k × r ≥ | columns of W have Frobenius norm at most } Output:

Network dictionary W T ∈ R k × r ≥ In (12), we solve convex optimization problems to ﬁnd matrices H t ∈ R r × N and W t ∈ R k × r . The ﬁrstsubproblem in (12) is a coding problem. Given two matrices X t and W t − , we seek to ﬁnd a factor matrix (i.e.,a ‘code matrix’) H t such that X t ≈ W t − H t . The parameter λ ≥ is an L -regularizer, which encourages H t to have a small L norm. One can solve the coding problem eﬃciently by using Algorithm A1 or one of avariety of existing algorithms (e.g., LARS [6], LASSO [54], or feature-sign search [21]). The second and thirdlines in (12) update the ‘aggregate matrices’ P t − ∈ R r × r and Q t − ∈ R r × k by taking a weighted averageof them with the new information X t H Tt ∈ R r × r and H t X Tt , respectively. We weight the old aggregatematrices by − t − and the new information by t − . By induction, we obtain P t = t − (cid:80) ts =1 H s H Ts and Q t = t − (cid:80) ts =1 H t X Tt . We use the updated aggregate matrices, P t and Q t , in the last subproblem in(12). This problem is called the dictionary-update problem and yields W t . This is a constrained quadraticproblem, and we can solve it using projected gradient descent (see Algorithm A2). In all of our experiments, z

11 by 11 Network Dictionary from UCLA FB network Reconstructed UCLA FB network

Original UCLA FB network

Dictionary (cid:2869)

Dictionary (cid:2870)

Dictionary (cid:2871) ⋮ ⋮ Limiting Dictionary ⋮ a b c d Figure 5.

Illustration of our network dictionary learning (NDL) algorithm (see Algorithm 1). ( a )Homomorphisms x t : F → G from a k -chain motif into a target network G evolve as a Markov chainto yield a sequence of k -chain subgraphs (the green edges) in G . ( b ) Each copy of the k -chain motifin G induces a k -node subgraph (i.e., the mesoscale patch A x t that we deﬁned in (2)). ( c ) We forma sequence of matrices X t of size k × N , where the N columns of each X t are vectorizations ofthe N consecutive k × k mesoscale patches in panel (b) . ( d ) Using an online nonnegative matrixfactorization (NMF) algorithm, we progressively learn the desired number of latent motifs as thedata matrix of mesoscale patches X t arrives. we take the compact and convex constraint set R k × r ≥ to be the set of W ∈ R k × r ≥ whose columns have aFrobenius norm of at most (as required in (6)).D.2. Dominance scores of latent motifs.

In this subsection, we introduce a quantitative measurementof the ‘prevalence’ of latent motifs in the network dictionary W T that we compute using NDL (see Algorithm1) for a network G .Recall that the output of the NDL algorithm for a network G using a k -chain motif is a network dictionary W T , which consists of r latent motifs L , . . . , L r of size k × k . Recall as well that the algorithm computes datamatrices X , . . . , X T of size k × N . Suppose that we have code matrices H (cid:63) , . . . , H (cid:63)T such that X t ≈ W T H (cid:63)t for all t ∈ { , . . . , T } . More precisely, we let H (cid:63)t = arg min H ≥ ( (cid:107) X t − W T H (cid:107) + λ (cid:107) H (cid:107) ) , (13)where we take the arg min over all H ∈ R k × N ≥ The columns of H (cid:63)t encode how to nonnegatively combinethe latent motifs in W T to approximate the mesoscale patches in X t ∈ R k × N ≥ , so the rows of H (cid:63)t encode thelinear coeﬃcients of each latent motif in W T that we use for approximating the columns of X t . Consequently,the mean inner products of the rows of H t encode the mean usage of the latent motifs in W T in G . Thismotivates us to consider the following mean Gramian matrix [13]: P (cid:63)T := 1 T T (cid:88) t =1 H (cid:63)t ( H (cid:63)t ) T ∈ R r × r . We then can take the square root of the diagonal entries of P (cid:63)T to give us the mean prevalences of the latentmotifs in W T in G . EARNING LOW-RANK LATENT MESOSCALE STRUCTURES IN NETWORKS 19

Computing P (cid:63)T requires one to store the previous data matrices X , . . . , X T and compute H (cid:63) , . . . , H (cid:63)T by solving (13) for t ∈ { , . . . , T } ; this is a very expensive computation. To address this issue, we use theaggregate matrix P T that we compute as part of Algorithm 1. Therefore, it does not require any extracomputation. Note that P T = 1 T T (cid:88) t =1 H t H Tt , where H t ∈ R r × N ≥ is the code matrix that is given by H t = arg min H ≥ ( (cid:107) X t − W t − H (cid:107) + λ (cid:107) H (cid:107) ) . Notethat P T is an approximation of P (cid:63)T because the deﬁning equation of H t is the same as that of H (cid:63)t in (13)with W T replaced by W t − . However, the approximation error between P (cid:63)T and P t vanishes as T → ∞ undermild conditions. Speciﬁcally, under the hypothesis of Theorems G.2 and G.5, W t converges almost surely tosome limiting dictionary. It follows that (cid:107) P (cid:63)T − P T (cid:107) F → almost surely as T → ∞ . k = Coronavirus SNAP FB arXiv H. sapiens Caltech MIT UCLA Harvard ER ER WS WS BA BA k = k = k = k = Figure 6.

The most dominant latent motifs that we learn from 14 networks (eight real-worldnetworks and six synthetic networks, which are single instantiations from random-graph models)at ﬁve diﬀerent scales (speciﬁcally, for k = 6 , , , , ). Using our NDL algorithm, we learnnetwork dictionaries of r = 25 latent motifs of k nodes for each of the 14 networks. For each networkat each scale, we show only the most-dominant latent motifs from each dictionary. We showed theassociated second-most dominant latent motifs in Figure 2. Black squares indicate entries andwhite squares indicate entries. In Figure 2 of the main manuscript, we showed the second-most dominant latent motifs for twelve networksat the scales k = 6 , , , , . In Figure 6, we show the most dominant latent motifs for the same setsof networks and scales. See Figures 8, 9, 10, and 14 in Section I for all 25 latent motifs (along with theirdominance scores) of all of the networks in Figure 6 at all ﬁve scales. For many of these networks (e.g., Harvard , UCLA , arXiv , and ER ) and at all ﬁve scales, the most dominant latent motifs in Figure 6 areclose to the adjacency matrix of the k -chain itself. In these examples, the entries in the ﬁrst superdiagonaland subdiagonal overwhelm the rest of the entries. However, for all networks except ER , the second-most dominant latent motifs in Figure 2 reveal more interesting mesoscale structures (e.g., communities in Harvard , MIT , UCLA , and arXiv and bipartition for edges that are not in the -chain in Coronavirus PPI )than the most dominant latent motifs in Figure 6.D.3.

Inﬂuence of masking and MCMC algorithms on latent motifs.

Our NDL algorithm (see Algo-rithm 1) uses mesoscale patches A x with a mask that consists of either the identity mask (3) or the no-foldingmask (4), but one can generalize it for any choice of mask. The original NDL algorithm [30, Algorithm 1]corresponds to Algorithm 1 with the option mask = Id . the one with identity mask for mesoscale patches.In Section B, we discussed that the no-folding mask improves the interpretability of the positive entries in the mesoscale patches. Consequently, it also improves the interpretability of the latent motifs that we learnfrom them using Algorithm 1. Algorithm

SNAP FB H. Sapiens arXiv S PECTRAL C LUSTERING

EEP W ALK

NDL+NDR (our method) Algorithm

SNAP FB H. Sapiens arXiv S PECTRAL C LUSTERING

EEP W ALK

NDL+NDR (our method) . . . . . . false-positive rate . . . . . . tr u e - p o s i t i v e r a t e Caltech +10% (AUC º °

10% (AUC º º °

50% (AUC º º °

10% (AUC º º °

50% (AUC º .

00 0 .

25 0 .

50 0 .

75 1 . false-positive rate . . . . . . Coronavirus PPI +10% (AUC º °

10% (AUC º º °

20% (AUC º .

00 0 .

25 0 .

50 0 .

75 1 . . . . . . . tr u e - p o s i t i v e r a t e arXiv +10% (AUC º °

10% (AUC º º °

50% (AUC º . . . . . . SNAP Facebook +10% (AUC º °

10% (AUC º º °

50% (AUC º .

00 0 .

25 0 .

50 0 .

75 1 . . . . . . . Homo sapiens PPI +10% (AUC º º º °

50% (AUC º AUC for denoising; ( −50% noise) a C ORONAVIRUS

PPI d c C ORONAVIRUS

PPI b C ORONAVIRUS

PPI C

ORONAVIRUS

PPI a b c d e f

Algorithm

SNAP FB H. Sapiens arXiv S PECTRAL C LUSTERING

EEP W ALK

NDL+NDR (our method) Algorithm

SNAP FB H. Sapiens arXiv S PECTRAL C LUSTERING

EEP W ALK

NDL+NDR (our method) mask = NFMCMC = PivotApproxmask = NFMCMC = Glaubermask = IdMCMC = PivotApproxmask = IdMCMC = Glauber Algorithm

SNAP FB H. Sapiens arXiv S PECTRAL C LUSTERING

EEP W ALK

NDL+NDR (our method) mask = NFMCMC = PivotApproxmask = NFMCMC = Glaubermask = IdMCMC = PivotApproxmask = IdMCMC = Glauber Algorithm

SNAP FB H. Sapiens arXiv S PECTRAL C LUSTERING

EEP W ALK

NDL+NDR (our method) mask = NFMCMC = PivotApproxmask = NFMCMC = Glaubermask = IdMCMC = PivotApproxmask = IdMCMC = Glauber Algorithm

SNAP FB H. Sapiens arXiv S PECTRAL C LUSTERING

EEP W ALK

NDL+NDR (our method) mask = NFMCMC = PivotApproxmask = NFMCMC = Glaubermask = IdMCMC = PivotApproxmask = IdMCMC = Glauber Figure 7.

Comparison of r = 25 latent motifs that we learn from Coronavirus PPI using theNDL algorithm (see Algorithm 1) in four diﬀerent settings, which arise from the choices mask ∈{ Id , NF } and MCMC ∈ {

PivotApprox , Glauber } . The choices mask = Id and mask = NF indicatethat the NDL algorithm uses mesoscale patches (2) with the identity mask (3) and the no-foldingmask (4), respectively. The choices MCMC = PivotApprox and

MCMC = Glauber indicate that theNDL algorithm uses the Glauber chain (see Algorithm MG) and the approximate pivot chain (seeAlgorithm MP with

AcceptProb = Approximate ), respectively. The other parameter values are λ = 1 , N = 100 , and T = 100 . We use black squares for entries and white squares for entries.The numbers underneath the latent motifs give their dominance scores. In Figure 7, we compare r = 25 latent motifs that we learn from Coronavirus PPI four diﬀerent combi-nations of masks and MCMC motif-sampling. In Figure 7 c , d , we see the latent motifs that we learn with theidentity mask have large oﬀ-chain entries that dominate the on-chain entries. (See Section B.3 for deﬁnitionsof on-chain and oﬀ-chain entries.) The most dominant latent motifs appear as complete bipartite networks,in which each node in one set is adjacent to all nodes in the other set in the bipartition, with nodes. Thisis counterintuitive, as Coronavirus PPI has a very low edge density of . . However, as we see in Figure7 a , b , the latent motifs that we learn with the no-folding mask (4) have comparatively sparser oﬀ-chainentries, which are dominated by the on-chain entries. However, by comparing Figure 7 a with Figure 7 b andFigure 7 c with Figure 7 d , we also see that the choice of the MCMC algorithm between the approximatepivot chain and the Glauber chain for Algorithm 1 may not aﬀect the latent motifs as much as whether weuse the identity mask or the no-folding mask. Appendix E. Algorithms for Network Denoising and Reconstruction (NDR)

E.1.

Algorithm overview and statement.

The standard pipeline for image denoising and reconstruction[7, 33, 35] is to uniformly randomly sample a large number of k × k overlapping patches of an image andthen average their associated approximations at each pixel to obtain a reconstructed version of the originalimage. A reasonable network analog of this pipeline is as follows. Given a network G = ( V, A ) , a motif F = ([ k ] , A F ) , and a network dictionary with latent motifs [ L , . . . , L r ] , we uniformly randomly sample alarge number T of homomorphisms x t : F → G . To simplify the present discussion, we assume that each x t uses k distinct nodes of G in its image V ( t ) := { x t ( a ) ∈ V | a ∈ { , . . . , k }} . (We do not make this assumptionelsewhere in the paper.) This yields T k -node subnetworks G ( t ) = ( V ( t ) , A ( t ) ) , where A ( t ) is the mesoscalepatch A x t in (2). We then approximate each weight matrix A ( t ) by a nonnegative linear combination ˆ A ( t ) of the latent motifs L i . We then deﬁne a network G recons = ( V, A recons ) , where we set A recons ( p, q ) for each p, q ∈ V to be the mean of the approximate weights ˆ A ( t ) ( p, q ) for all t ∈ { , . . . , M } such that p, q ∈ V ( t ) .Our network denoising and reconstruction (NDR) algorithm (see Algorithm 2) builds on the idea in thepreceding paragraph. Suppose that we have a network G = ( V, A ) , a motif F = ([ k ] , A F ) , and a network EARNING LOW-RANK LATENT MESOSCALE STRUCTURES IN NETWORKS 21 dictionary W that consists of r nonnegative k × k matrices L , . . . , L r . First, because uniformly randomlysampling a homomorphism x t : F → G is not as straightforward as uniformly randomly sampling k × k patches of an image, we generate a sequence ( x t ) t ∈{ ,...,T } of homomorphisms using a MCMC motif-samplingalgorithm (see Algorithms MP and MG). For each t ≥ , we approximate the mesoscale patch A x t (see (2))by a nonnegative linear combination of latent motifs L i and we then take the mean of the values of eachentry A ( a, b ) up to time t . Algorithm 2 . Network Denoising and Reconstruction (NDR) Input:

Network G = ( V, A ) , and network dictionary W ∈ R k × r ≥ Parameters: F = ([ k ] , A F ) (a motif) , T ∈ N (number of iterations) , λ ≥ (the coeﬃcient of an L -regularizer) , θ ∈ [0 , (an edge threshold) Options: denoising ∈ { T , F } , mask ∈ { Id , NF } , MCMC ∈ {

Pivot , PivotApprox , Glauber } Requirement:

There exists at least one homomorphism F → G Initialization: A recons , A count : V → { } (matrices with entries) Sample a homomorphism x : F → G by the rejection sampling in Algorithm A3 For t = 1 , , . . . , T : MCMC update and mesoscale patch extraction : x t ← Updated homomorphism that we obtain by applyingAlgorithm MP with

AcceptProb = Exact , if MCMC = Pivot

Algorithm MP with

AcceptProb = Approximate , if MCMC = PivotApprox

Algorithm MG with

AcceptProb = Glauber , if MCMC = Glauber A x t ← k × k mesoscale patch of G that is induced by x t (see (2)) with Φ F, x t = (cid:40) identity mask in (3) , if mask = Id no-folding mask in (4) , if mask = NF X t ← k × matrix that we obtain by vectorizing A x t (using Algorithm A4) Mesoscale reconstruction : (cid:40) (cid:101) X t ← X t and (cid:102) W ← W , if denoising = F (cid:101) X t ← ( X t ) oﬀ and (cid:102) W ← ( W ) oﬀ using Algorithm 2a , if denoising = T H t ← arg min H ∈ R r × ≥ ( (cid:107) (cid:101) X t − (cid:102) W H (cid:107) F + λ (cid:107) H (cid:107) ) and ˆ X t ← (cid:102) W H t ˆ A x t ; W ← k × k matrix that we obtain by reshaping the k × matrix ˆ X t using Algorithm A5 Update global reconstruction:

For a, b ∈ { , . . . , k } : If ( denoising = F or A F ( a, b ) = 0) and Φ F, x t ( a, b ) = 1 : A count ( x t ( a ) , x t ( b )) ← A count ( x t ( a ) , x t ( b )) + 1 j ← A count ( x t ( a ) , x t ( b )) A recons ( x t ( a ) , x t ( b )) ← (1 − j − ) A recons ( x t ( a ) , x t ( b )) + j − ˆ A x t ; W ( x t ( a ) , x t ( b )) Output:

Reconstructed networks G recons = ( V, A recons ) and G recons ( θ ) = ( V, ( A recons > θ )) As in Algorithm 1, we require in Algorithm 2 that there exists at least one homomorphism F → G . Thiscondition is satisﬁed when F = ([ k ] , A F ) is a chain motif and G has at least one edge; it holds for all ofour experiments in the present paper. As in the ﬁrst line of (12), the problem for ﬁnding H t in line 15of Algorithm 2 is a standard convex problem, which one can solve by using Algorithm A1. There are two Algorithm 2a . Oﬀ-Chain Projection

Input:

Matrix Y ∈ R k × m , motif F = ([ k ] , A F ) Do:

Let Y (cid:48) be a k × k × m tensor that we obtain by reshaping each column of Y using Algorithm A5.Let Y (cid:48)(cid:48) be a k × k × m tensor that we obtain from Y (cid:48) by Y (cid:48)(cid:48) ( a, b, c ) = Y (cid:48) ( a, b, c ) ( A F ( a, b ) = 0) for all a, b ∈ { , . . . , k } and c ∈ { , . . . , m } Let Y oﬀ be a k × m matrix that we obtain from Y (cid:48)(cid:48) by vectorizing each of its slices using AlgorithmA4: Y (cid:48)(cid:48) [: , : , c ] for all c ∈ { , . . . , m } . Output:

Matrix ( Y ) oﬀ ∈ R k × m variants of the NDR algorithm. The variant is speciﬁed by the Boolean variable denoising . The NDRalgorithm with denoising = F is identical to the network-reconstruction algorithm in [30, Algorithm 2],except for thresholding step. The NDR algorithm with denoising = T is a new variant of NDR that wepresent in the present work for the purpose of network denoising.E.2. Further discussion of the denoising variant of the NDR algorithm.

We now give a detaileddiscussion of Algorithm 2 with denoising = T for network-denoising applications. Recall that the network-denoising problem that we consider is to reconstruct a true network G true = ( V, A ) from an observed network G obs = ( V, A (cid:48) ) . The scheme that we used to produce Figure 4 is the following: D.1

Learn a network dictionary W ∈ R k × r ≥ from an observed network G obs = ( V, A ) using NDL (seeAlgorithm 1). D.2

Compute a reconstructed network G recons = ( V, A recons ) using NDR (see Algorithm 2) with input G obs = ( V, A ) and W . D.3

Fix an edge threshold θ ∈ [0 , . If G obs is G true with additive (respectively, subtractive) noise, weclassify each edge (respectively, non-edge) ( p, q ) as ‘positive’ if and only if A recons ( p, q ) > θ .The NDR algorithm was ﬁrst introduced in [30]. The version of the NDR algorithm from [30] is Algorithm2 with the options mask = Id and denoising = F . It has a competitive performance for denoising − subtractive noise on the networks SNAP FB , H. sapiens , and arXiv in comparison to the methods

SpectralClustering [41],

DeepWalk [45],

LINE [50], and node2vec [11]. In all of these methods, one ﬁrst obtainsa -dimensional vector representation of the nodes in a network; this is called a ‘node embedding’ of thenetwork. One then uses this node embedding to compute vector representations of the edges using binaryoperations such as the Hadamard product. (See [11] for details.) One can then use a binary-classiﬁcationalgorithm (e.g., a support vector machine [42]) to attempt to detect the false edges.

Spectral Clustering uses the top eigenvectors of the normalized Laplacian matrix of G obs to learn vector embeddings of thenodes. (See [51] for details.) The other three benchmark methods ﬁrst generate sequences of nodes usingrandom-walk sampling and then use a word-embedding technique (see, e.g., [39]) to learn a node embedding.In Table 1, we compare the AUC scores of our NDR and NDL approach that we described at the beginningof Section E.2 for our network-denoising tasks to the AUC scores that one obtains using the above fourexisting methods. As was discussed in [30, Remark 4], a limitation of using NDR with mask = Id and denoising = F for network denoising is that one needs to invert what it means to classify successfullydepending on whether noise is additive or subtractive. Speciﬁcally, [30] used network denoising with NDR foradditive noise, but it is necessary to classify each non-edge ( p, q ) as ‘positive’ if A recons ( p, q ) < θ . Therefore,the AUC scores for − noise in the ﬁfth in Table 1 is ‘ﬂipped’ and the same is true for the sixth rowin the table for mask = NF and denoising = F . Our NDR algorithm with denoising = T addressesthis directionality issue and allows us to use the uniﬁed classiﬁcation scheme above for both additive andsubtractive noise.The idea behind NDR with denoising = T is to handle an issue when denoising additive noise for sparsereal-world networks that does not arise in the image-denoising setting. Suppose that we obtain G obs by addingsome false edges to a sparse binary network G true . The on-chain entries of the mesoscale patches A x arealways equal to . Therefore, the latent motifs that we learn from G obs have constant on-chain entries (see,e.g., Figures 2 and 6). Consequently, a linear approximation of mesoscale patches A x of G obs of the latent EARNING LOW-RANK LATENT MESOSCALE STRUCTURES IN NETWORKS 23

Algorithm

SNAP FB H. sapiens arXiv

Noise +50% −

50% +50% −

50% +50% − Spectral Clustering - 0.619 - 0.492 - 0.574

DeepWalk - 0.968 - 0.744 - 0.934LINE - 0.949 - 0.725 - 0.890 node2vec - 0.968 - 0.772 - 0.934

NDL + NDR ( mk = Id , dn = F ) 0.845 NDL + NDR ( mk = NF , dn = F ) 0.898 NDL + NDR ( mk = Id , dn = T ) 0.943 0.980 0.677 NDL + NDR ( mk = NF , dn = T ) Table 1.

Area-under-the-curve (AUC) scores of the ROC curve for our network-denoising experi-ments using NDL (see Algorithm 1) and NDR (see Algorithm 2). As in Figure 4, we ﬁrst use NDLto learn latent motifs from a corrupted network and then reconstruct the networks using NDR toassign a conﬁdence value to each potential edge. We use the networks

SNAP FB , H. sapiens , and arXiv with − % and +50 % noise on the edges. In the last four rows, mk stands for the choice of mask in NDR and dn stands for the denoising approach in NDR. We use mask = NF for NDL ineach of the last four rows. For both NDL and NDR, we use MCMC = ApproxPivot . The last row isa duplicate from Figure 4, and we obtain the results in rows ﬁve–seven using the same parameterchoices as in Figure 4. For the last four rows except the six instances in italics (i.e., for dn = F and − noise), we use the classiﬁcation scheme ( D.3 ). For these six instances, we compute theAUC of the ‘ﬂipped’ ROC curve, as the algorithm technically takes on the opposite classiﬁcationtask. When dn = T , we do not ﬂip the classiﬁcation task. In the ﬁrst four rows, we construct128-dimensional vector representations of the networks using the indicated methods, and we thenuse them for edge classiﬁcation. motifs that we learn from G obs cannot distinguish between true and false on-chain entries. Furthermore,because G obs is sparse, there are many fewer positive oﬀ-chain entries in A x than the number of on-chainentries of A x . Therefore, linear approximations of A x using the latent motifs are likely to assign largerweights to reconstruct on-chain entries of A x than oﬀ-chain entires. The resulting reconstruction of G obs isthus similar to G obs , and it is very hard to detect any false edges in G obs . Using the option denoising = T prevents this issue by ignoring all on-chain entries both for each sampled mesoscale patch A x and for eachlatent motif in the network dictionary W that we use for denoising. For example, using denoising = T instead of denoising = F for SNAP FB and arXiv (see Table 1) yields a performance gain of about 10% for +50% noise.Another issue is using NDL with denoising = F for denoising negative noise in sparse real-world networks.In many of our experiments in this situation, our network-denoising scheme in ( D.1 )–(

D.3 ) seems to give‘ﬂipped’ ROC curves that lie below the diagonal line y = x that represents the baseline ROC curve and theAUC score is thus close to instead of . Therefore, in the reconstructed network G recons , false non-edgeshave larger weights than true non-edges. For the six instances in italics (for denoising = F and − noise)in Table 1, we give AUC scores after ﬂipping the ROC curves, such that we classify a non-edge ( p, q ) as‘positive’ if A recons ( p, q ) < θ , which is the opposite of ( D.3 ) that we used for additive noises in all cases. It isunfortunate to have to use the opposite classiﬁcation scheme for diﬀerent types of noise, especially when onemay not know the type of noise in advance. We suspect that the sparsity of real-world networks may lead tosuch ‘opposite directionality’; this phenomenon requires further investigation. In contrast to this situation,the ROC curves for denoising = T are always above the diagonal in all experiments in Table 1 (as well asin Figure 4), regardless of whether the noise is additive or subtractive, and we always obtain the AUC scoresin Table 1 using the scheme in ( D.3 ). Appendix F. Experimental details

F.1.

Data sets.

We now describe the eight real-world networks that we examined in the main manuscript: (1)

Caltech : This connected network, which is part of the

Facebook100 data set [56] (and previouslywas studied as part of the

Facebook5 data set [55]), has 762 nodes and 16,651 edges. Nodesrepresent users in the Facebook network of Caltech on one day in fall 2005, and edges representFacebook ‘friendships’ between these users.(2)

MIT : This connected network, which is part of the

Facebook100 data set [56], has 6,402 nodes and251,230 edges. Nodes represent users in the Facebook network of MIT on one day in fall 2005, andedges represent Facebook ‘friendships’ between these users.(3)

UCLA : This connected network, which is part of the

Facebook100 data set [56], has 20,453 nodesand 747,604 edges. Nodes represent users in the Facebook network of UCLA on one day in fall 2005,and edges represent Facebook ‘friendships’ between these users.(4)

Harvard : This connected network, which is part of the

Facebook100 data set [56], has 15,086nodes and 824,595 edges. Nodes represent users in the Facebook network of MIT on one day in fall2005, and edges represent Facebook ‘friendships’ between these users.(5)

SNAP Facebook ( SNAP FB ) [24]: This connected network has 4,039 nodes and 88,234 edges. Thisnetwork is a Facebook network that has been used as a benchmark example for edge inference [11].(6) arXiv ASTRO-PH ( arXiv ) [11, 23]: This network has 18,722 nodes and 198,110 edges. Its largestconnected component has 17,903 nodes and 197,031 edges. We use the full network in our experi-ments. It is a collaboration network between authors of papers in astrophysics that were posted tothe arXiv preprint server. Nodes represent scientists and edges indicate coauthorship relationships.(7) Coronavirus PPI ( Coronavirus ): This connected network, which was curated by theBiogrid.org [10, 44, 52] from 142 publications and preprints, has 1,546 proteins that are related to coronaviruesand 2,481 protein–protein interactions (in the form of physical contacts) between them. We down-loaded this data set on 24 July 2020. Among the 2,481 interactions, 1,546 are for SARS-CoV-2and were reported by 44 publications and preprints; the rest are related to coronaviruses that causeSevere Acute Respiratory Syndrome (SARS) or Middle Eastern Respiratory Syndrome (MERS).(8)

Homo sapiens PPI ( H. sapiens ) [11, 44, 53]: This network has 24,407 nodes and 390,420 edges.Its largest connected component has 24,379 nodes and 390,397 edges. We use the full network in ourexperiments. The nodes represent proteins in the organism

Homo sapiens , and the edges representphysical interactions between these proteins.We now describe the six synthetic networks that we examined in the main manuscript:(9) ER and ER : An Erdős–Rényi (ER) network [8, 40], which we denote by ER ( n, p ) , is a random-graph model. The parameter n is the number of nodes and the parameter p is the independent,homogeneous probability that each pair of distinct nodes has an edge between them. The network ER is an individual graph that we draw from ER (5000 , . , and ER is an individual graph thatwe draw from ER (5000 , . .(10) WS and WS : A Watts–Strogatz (WS) network, which we denote by WS ( n, k, p ) , is a random-graphmodel to study the small-world phenomenon [40, 57]. In the version of WS networks that we use,we start with an n -node ring network in which each node is adjacent to its k nearest neighbors.With independent probability p , we then remove and rewire each edge to a pair of distinct nodesthat we choose uniformly at random. The network WS is an individual graph that we draw fromWS (5000 , , . , and WS is an individual graph that we draw from WS (5000 , , . .(11) BA and BA : A Barabási–Albert (BA) network, which we denote by BA ( n, m ) , is a random-graphmodel with a linear preferential-attachment mechanism [2, 40]. In the version of BA networks thatwe use, we start with m isolated nodes and we introduce new nodes with m new edges each thatattach preferentially (with a probability that is proportional to node degree) to existing nodes untilwe have a total of n nodes. The network BA is an individual graph that we draw from BA (5000 , ,and BA is an individual graph that we draw from WS (5000 , .F.2. Figures 2, 6, 8, 9, 10, 11, 12, 13, and 14.

These ﬁgures give latent motifs of the networks that wedescribed in Section F.1 using Algorithm 1 with various parameter choices. In all of these ﬁgures, we use achain motif with the corresponding network F = ([ k ] , A F ) for T = 100 iterations, N = 100 homomorphisms EARNING LOW-RANK LATENT MESOSCALE STRUCTURES IN NETWORKS 25 per iteration, an L -regularizer with coeﬃcient λ = 1 , a mask of mask = Id , and an MCMC motif-samplingalgorithm of MCMC = Pivot . We specify the number r of latent motifs and the scale k in the caption of eachﬁgure.F.3. Figure 1.

In this ﬁgure, we illustrate latent motifs that we learn from

UCLA and

Caltech and wecompare them to an image dictionary. We use the following parameters in Algorithm 1 to generate the resultsin Figures 2 and 6. We use a chain motif with the corresponding network F = ([ k ] , A F ) , a scale k = 21 for T = 100 iterations, N = 100 homomorphisms per iteration, r = 25 latent motifs, an L -regularizer withcoeﬃcient λ = 1 , a mask of mask = Id , and an MCMC motif-sampling algorithm of MCMC = Pivot . Thepivot chain in Algorithm MP uses

AcceptProb = Approximate . The image dictionary for the artwork

Cycle in Figure 1 uses an algorithm that is similar to Algorithm 1, except that we uniformly randomly sample × square patches of the image instead of k × k mesoscale patches of a network.F.4. Figure 7.

This ﬁgure compares r = 25 latent motifs at scale k = 21 that we learn from CoronavirusPPI using Algorithm 1 with MCMC motif-sampling algorithms

MCMC ∈ {

Glauber , ApproxPivot } and masks mask ∈ { Id , NF } . We specify the parameters in Algorithm 1 that we use to generate the results in Figure 7in the caption of that ﬁgure.F.5. Figure 3.

To generate Figure 3, we ﬁrst apply the NDL algorithm (see Algorithm 1) to each networkthat we consider in the ﬁgure to learn r = 25 latent motifs for a chain motif with the corresponding network F = ([ k ] , A F ) , a scale k = 21 for T = 100 iterations, N = 100 homomorphisms per iteration, an L -regularizerwith coeﬃcient λ = 1 , a mask of mask = NF , and MCMC = PivotApprox . For each self-reconstruction X ← X (see the caption of Figure 3), we apply the NDR algorithm (see Algorithm 2) to a chain motif with thecorresponding network F = ([ k ] , A F ) , a scale k = 21 for T = (cid:98) n ln n (cid:99) iterations (where n is the number ofnodes in the network), N = 100 homomorphisms per iteration, r = 25 square patches, an L -regularizerwith coeﬃcient λ = 0 (i.e., no regularization), a mask of mask = Id , a MCMC motif-sampling algorithmof MCMC = PivotApprox , and denoising = F . For each cross-reconstruction Y ← X (see the caption ofFigure 3), we apply the NDR algorithm (see Algorithm 2) to a chain motif with the corresponding network F = ([ k ] , A F ) , a scale k = 21 for T = (cid:98) n ln n (cid:99) time steps (where n is the number of nodes in the network), N = 100 homomorphisms, an edge-threshold value of θ = 0 . , an L -regularizer with coeﬃcient λ = 0 (i.e.,no regularizatin), a mask of mask = NF , an MCMC motif-sampling algorithm of MCMC = PivotApprox , and denoising = F . We use multiple diﬀerent choices of the number r of latent motifs; we indicate them in thecaption of Figure 3.In the main manuscript, we mentioned that the following claims follow from the reconstruction accuraciesthat we reported in Figure 3 in conjunction with the latent motifs in Figures 1, 2, 8, and 10. We now providetheir justiﬁcations. (1) The mesoscale structure of

Caltech is rather diﬀerent from those of

Harvard , UCLA , and

MIT at scale k = 21 . • In Figure 3 c , we observe that the accuracy of the cross-reconstruction Y ← X is consistentlyhigher for X ∈ { UCLA , Harvard , MIT } than for X = Caltech for all values of r . For instance,at r = 9 , we can reconstruct UCLA with more than 90% accuracy and we can reconstruct

Harvard and

MIT with more than 80% accuracy. However, the latent motifs that we learnfrom

Caltech for r = 9 gives only about accuracy for reconstructing UCLA and only about accuracy for reconstructing

MIT and

UCLA . This indicates that

Caltech has signiﬁcantlydiﬀerent mesoscale structure than the other three universities’ Facebook networks at scale k = 21 . Indeed, from Figures 1, 2, 6, and 8, we see that the r = 25 latent motifs of Caltech at scale k = 21 have larger oﬀ-chain entries than those of UCLA , MIT , and

Harvard . (2) The mesoscale structure of

Caltech at scale k = 21 has a higher dimension than those of the otherthree universities’ Facebook networks. • Consider the cross-reconstructions

Caltech ← X for X ∈ { UCLA , Harvard , MIT } in Figure 3 b .The accuracy with r = 9 for the latent motifs that we learn from Caltech itself is as low as .By contrast, it or higher for the self-reconstructions X ← X for the Facebook networks ofthe other universities. In other words, r = 9 latent motifs at scale k = 21 cannot approximatethe mesoscale structures of Caltech as well as those of the other three universities’ Facebook networks. This indicates that the dimension of the mesoscale structures of

Caltech at scale k = 21 is larger than those of the other three universities’ Facebook networks. (3) The networks BA and BA are better than ER , ER , WS , and WS at capturing the mesoscale structuresof MIT , Harvard , and

UCLA at scale k = 21 . • From the reconstruction accuracies for Y ← X in Figure 3 b , c , where X is one of the sixsynthetic networks ( ER i , WS i , and BA i for i ∈ { , } ), we observe that the two BA networks havehigher accuracies then the networks from the ER and WS models for Y ∈ { UCLA , Harvard , MIT } .This suggests that the mesoscale structures of UCLA , Harvard , and

MIT may be more similar insome respects to those of BA i than those of ER i and WS i . The latent motifs of BA in Figures 2 and8 at k = 21 have characteristics that we also observe in UCLA , Harvard , and

MIT . (Speciﬁcally,they have hub nodes and oﬀ-chain entries that are much smaller (so they are lighter in color)than the on-chain entries.) By contrast, in Figure 14, we see that the latent motifs for ER have sparse but seemingly randomly distributed oﬀ-chain connections and the ones for WS have strongly interconnected communities of about nodes (see the diagonal block of blackentries). These patterns diﬀer from the ones that we observe in the latent motifs of UCLA , MIT ,and

Harvard (see Figure 8). (4)

If we uniformly sample a walk of k = 21 nodes, then it is more likely that there are communities with or more nodes in the induced subnetwork for Caltech than is the case for

UCLA , Harvard , and

MIT . • From the reconstruction accuracies for Y ← X in Figure 3 b , c , where X is one of the sixsynthetic networks ( ER i , WS i , and BA i for i ∈ { , } ), we observe that WS networks outperformboth BA and ER networks in reconstructing Caltech , but they are one of the lowest-performingnetworks for reconstructing the Facebook networks of the other three universities. In otherwords, nonnegative linear combinations of the latent motifs of WS i can better approximate themesoscale patches of Caltech than they can for those of

UCLA , Harvard , and

MIT . Recall thatmost latent motifs of WS i at scale k = 21 have blocks of black entries of size × or larger(see Figure 10); these correspond to adjacency matrices of communities with or more nodes.Therefore, such community structure should be more likely to occur in subnetworks that areinduced from uniform samples of k = 21 -node walks in Caltech than from such samples in

UCLA , Harvard , or

MIT .F.6.

Figure 4.

To generate Figure 4, we ﬁrst apply the NDL algorithm (see Algorithm 1) to each corruptednetwork that we consider in the ﬁgure to learn r = 25 latent motifs for a chain motif with the correspondingnetwork F = ([ k ] , A F ) , a scale k = 21 for T = 400 iterations, N = 1000 homomorphisms per iteration, an L -regularizer with coeﬃcient λ = 1 , a mask of mask = NF , and an MCMC motif-sampling algorithm of MCMC = PivotApprox . The NDR algorithm (see Algorithm 2) that we use to generate the results in Figure 4uses r = 25 latent motifs for a chain motif with the corresponding network F = ([ k ] , A F ) , a scale k = 21 for T = 400 , iterations for H. sapiens and T = 200 , iterations for all other networks, an L -regularizerwith coeﬃcient λ = 1 , a mask of mask = NF , MCMC = PivotApprox , and denoising = T . For Figure 4, we didnot conduct the denoising experiment for Coronavirus PPI with − noise because the resulting network(with 1,536 nodes and 1,232 edges) cannot be connected. (To be connected, its spanning trees need to have1,535 edges.)We obtain the AUC scores in Figure 4 for the methods node2vec [11], DeepWalk [45], and

LINE [50]for the task of denoising subtractive noise for

SNAP Facebook , H. Sapiens , and arXiv from [11]. In[11], the ROCs were computed from a ‘balanced test set’ that was chosen uniformly at random from all setsthat include all of the | E | / false non-edges and | E | / true non-edges, where | E | denotes the number ofedges in the network. By contrast, we compute our ROCs in Figure 4 using all of the | E | / false non-edgesand all of true non-edges, which is often very large. For instance, recall that arXiv has | V | = 18 ,722 nodesand | E | = 198 ,110 edges. Therefore, after deleting | E | / edges to create a corrupted network, there are | E | / ,055 false non-edges and | V | ( | V | − / − | E | = 175 ,049,171 true non-edges. Such an imbalancein a data set does not aﬀect the ROCs and hence does not aﬀect the AUCs. Consequently, the true-positiverates and false-positive rates do not change even if we compute them after independently sampling | E | / EARNING LOW-RANK LATENT MESOSCALE STRUCTURES IN NETWORKS 27 true non-edges. Therefore, our results are directly comparable to the AUCs in [11] after uniformly samplinga balanced test set that consists of | E | / true non-edges and | E | / false non-edges. Appendix G. Convergence Analysis

In this section, we give rigorous convergence guarantees for our main algorithms for NDL and NDR. In [30,Corollary 6.1], Lyu et al. obtained a convergence guarantee of the original NDL algorithm ([30, Algorithm1]) for non-bipartite networks G with the MCMC motif-sampling algorithms MCMC ∈ {

Pivot , Glauber } .Theorem G.2 (i) in the present paper establishes the same result for the NDL algorithm (see Algorithm1) that uses mesoscale patches (2) with either the identity mask or the no-folding mask (4). TheoremG.2 (ii) gives a similar convergence result for the NDL algorithm with the approximate pivot chain (i.e., for MCMC = PivotApprox ). Theorem G.5 establishes a similar convergence result for NDL for bipartite networks G . Theorems G.7 and G.10 establish convergence and error bounds for the NDR algorithm (see Algorithm2). Except for Theorem G.2 (i) , all of these theoretical results are novel results of the present paper.Let F = ([ k ] , A F ) be the k -chain motif, and let G = ( V, A ) be a network. Let Ω ⊆ V [ k ] denote the setof all homomorphisms x : F → G . Algorithm 1 generates three stochastic sequences. The ﬁrst one is thesequence ( x t ) t ≥ of homomorphisms F → G that are generated by the pivot chain (see Algorithm MP). Thesecond one is the sequence ( X t ) t ≥ of k × N data matrices whose columns encode N mesoscale patches of G . More precisely, for each y , . . . , y n ∈ Ω , we write Ψ( y , . . . , y n ) ∈ R k × N ≥ for the k × N matrix whose i th column is the vectorization (using Algorithm A4) of the corresponding k × k mesoscale patch A y i of G .For each y ∈ Ω , deﬁne X ( N ) ( y ) := Ψ( y , . . . , y N ) ∈ R k × N ≥ , where we generate y , . . . , y N using the pivot chain when it starts at y . It then follows that X t = X ( N ) ( x Nt ) for each t ≥ , where x Nt is the state of the pivot chain at time N t . For the third (and ﬁnal) sequence thatwe generate using Algorithm 1, let ( W t ) t ≥ denote the sequence of dictionary matrices, where we deﬁne each W t = W t ( x ) via (12) with an initial homomorphism x : F → G that we sample using Algorithm A3.G.1. Convergence of the approximate pivot chain.

We prove Proposition G.1, which states the con-vergence of the approximate pivot chain and gives an explicit formula for its unique stationary distribution.

Proposition G.1.

Fix a network G = ( V, A ) and the k -chain motif F = ([ k ] , A F ) . Let ( x t ) t ≥ denote asequence of homomorphisms x t : F → G that we generate using the approximate pivot chain (in which weuse Algorithm MP with AcceptProb = Approximate ). Suppose that (a)

The weight matrix A is ‘bidirectional’ (i.e., A ( a, b ) > implies that A ( b, a ) > for all a, b ∈ V ) andthe undirected and binary graph ( V, ( A > is connected and non-bipartite.It then follows that ( x t ) t ≥ is an irreducible and aperiodic Markov chain with the unique stationary distri-bution ˆ π F →G that we deﬁned in (11) .Proof. We follow the proof of [29, Thm. 5.8]. Let P : V → [0 , be a matrix with entries P ( a, b ) := A ( a, b ) (cid:80) c ∈ V A ( a, c ) , a, b ∈ V .

This is the transition matrix of the standard random walk on the network G . By hypothesis (a), P isirreducible and aperiodic. Additionally, it has the unique stationary distribution (see [25, Ch. 9]) π (1) ( v ) := (cid:88) c ∈ V A ( v, c ) / (cid:88) c,c (cid:48) ∈ V A ( c, c (cid:48) ) . The approximate pivot chain generates a move x t (1) (cid:55)→ x t +1 (1) of the pivot according to the distribution P ( x t (1) , · ) . We accept this move of the pivot independently of everything else with the approximate accep-tance probability α in (8). If we always accept each move of the pivot, then the pivot performs a randomwalk on G with unique stationary distribution π (1) . We compute the acceptance probability α using theMetropolis–Hastings algorithm (see [25, Sec. 3.3]), and we thereby modify the stationary distribution of thepivot from π (1) to the uniform distribution on V . (See the discussion in [29, Sec. 5].) Therefore, ( x t (1)) t ≥ is an irreducible and aperiodic Markov chain on V that has the uniform distribution as its unique stationary distribution. Because we sample the locations x t +1 ( i ) ∈ V of the subsequent nodes i = 2 , , . . . , k indepen-dently, conditional on the location x t +1 (1) of the pivot, it follows that the approximate pivot chain ( x t ) t ≥ is also an irreducible and aperiodic Markov chain with a unique stationary distribution, which we denote by ˆ π F →G .To compute the stationary distribution ˆ π F →G , we decompose x t into return times of the pivot x t (1) to aﬁxed node x ∈ V in G . Speciﬁcally, let τ ( j ) be the j th return time of x t (1) to x . By the independence ofsampling x t over { , . . . , k } for each t , the strong law of large numbers yields lim M →∞ M M (cid:88) j =1 ( x τ ( j ) (2) = x , . . . , x τ ( j ) ( k ) = x k ) = (cid:81) ki =2 A ( x i − , x i ) (cid:80) y ,...,y k ∈ V (cid:81) ki =2 A ( x , y ) A ( y , y ) . . . A ( y k − , y k ) . For each ﬁxed homomorphism x : F → G , i (cid:55)→ x i , we use the Markov-chain ergodic theorem (see, e.g., [5,Theorem 6.2.1 and Example 6.2.4] or [38, Theorem 17.1.7]) to obtain ˆ π F →G ( x ) = lim N →∞ N N (cid:88) t =0 ( x t = x )= lim N →∞ (cid:80) Nt =0 ( x t = x ) (cid:80) Nt =0 ( x t (1) = x ) (cid:80) Nt =0 ( x t (1) = x ) N = P (cid:18) x t (2) = x , . . . , x t ( k ) = x k (cid:12)(cid:12)(cid:12)(cid:12) x t (1) = x (cid:19) π (1) ( x )= (cid:81) ki =1 A ( x i − , x i ) (cid:80) y ,...,y k ∈ V (cid:81) ki =2 A ( x , y ) A ( y , y ) . . . A ( y k − , y k ) 1 | V | . This proves the assertion. (cid:3)

G.2.

Convergence of the NDL algorithm.

Recall the problem statement for NDL in (6). Informally, weseek to learn r latent motifs L , . . . , L r ∈ R k × k ≥ to minimize the expectation of the error of approximatingthe mesoscale patch A x by a nonnegative combination of the motifs L i , where x : F → G is a randomhomomorphism that we sample from the distribution π F →G (1). We reformulate this problem as the followingmatrix-factorization problem, which generalizes (6). Let C dict denote the set of all matrices W ∈ R k × r ≥ whosecolumns have a Frobenius norm of at most . The matrix-factorization problem is then arg min W ∈C dict ⊆ R k × r ≥ (cid:18) f ( W ) := E x ∼ π F →G (cid:104) (cid:96) ( X ( N ) ( x ) , W ) (cid:105) (cid:19) , (14)where we deﬁne the loss function (cid:96) ( X, W ) := inf H ∈⊆ R r × N ≥ (cid:107) X − W H (cid:107) F + λ (cid:107) H (cid:107) , X ∈ R k × N , W ∈ R k × r . (15)The parameters N ∈ N and λ ≥ appear in Algorithm 1. The former is the number of homomorphisms thatwe sample at each iteration of Algorithm 1, and the latter is the coeﬃcient of an L -regularizer that we useto ﬁnd the code matrix H t in (12). In the special case of N = 1 and λ = 0 , the problem (14) is equivalentto the problem (6), because X (1) ( x ) and the columns of W are vectorizations (using Algorithm A4) of themesoscale patch A x and the latent motifs L , . . . , L r , respectively.Theorems G.2 and G.5 imply that our NDL algorithm (see Algorithm 1) ﬁnds a sequence ( W t ) t ≥ ofdictionary matrices such that almost surely W t is asymptotically a stationary point of the objective function f in the optimization problem (14). The objective function f is non-convex, so it is generally diﬃcult to ﬁndglobal optimum of f . In practice, however, such stationary points have often been good enough for practicalapplications, such as image restoration [7, 35]. We ﬁnd that this is also the case for our network-denoisingproblem (see Figure 4).We are now ready to state our ﬁrst convergence result for the NDL algorithm (see Algorithm 1). Theorem G.2 (Convergence of the NDL Algorithm for Non-Bipartite Networks) . Let F = ([ k ] , A F ) be the k -chain motif, and let G = ( V, A ) be a network that satisﬁes the following properties: EARNING LOW-RANK LATENT MESOSCALE STRUCTURES IN NETWORKS 29 (a) The weight matrix A is ‘bidirectional’ (i.e., A ( a, b ) > implies that A ( b, a ) > for all a, b ∈ V ) andthe undirected and binary graph ( V, ( A > is connected and non-bipartite.(b) For all t ≥ , there exists a unique solution H t in (12) .(c) For all t ≥ , the eigenvalues of the positive semideﬁnite matrix A t that is deﬁned in (12) are at leastas large as some constant κ > .Let ( W t ) t ≥ denote the sequence of dictionary matrices that we generate using Algorithm 1. The followingclaims hold: (i) For

MCMC ∈ {

Pivot , Glauber } , we have almost surely as t → ∞ that W t converges to the set of stationarypoints of the objective function f that we deﬁned in (14) . Furthermore, if f has ﬁnitely manystationary points in C dict , we then have that W t converges to a single stationary point of f almostsurely as t → ∞ . (ii) For

MCMC = PivotApprox , we have almost surely as t → ∞ that W t converges to the set of stationarypoints of the objective function ˆ f ( W ) := E x ∼ ˆ π F →G (cid:104) (cid:96) ( X ( N ) ( x ) , W ) (cid:105) , where the distribution ˆ π F →G is deﬁned in (11) . Furthermore, if ˆ f has ﬁnitely many stationary pointsin C dict , we then have that W t converges to a single stationary point of ˆ f almost surely as t → ∞ . Remark G.3.

Assumptions (a)–(c) in Theorem G.2 are all reasonable and are easy to satisfy. Assumption(a) is satisﬁed if G is undirected, binary, and connected, which is the case for all of our examples in thepresent paper. Assumptions (b) and (c) are standard assumptions in the study of online dictionary learning[30, 31, 32]. For instance, (b) is a common assumption in methods such as layer-wise adaptive-rate scaling(LARS) [6] that aim to ﬁnd good solutions to problems of the form (15). Additionally, in practice, one canverify (c) experimentally after a few iterations of Algorithm 1 for a reasonable choice of the initial dictionary(e.g., r samples of mesoscale patches). See [32, Sec. 4.1] and [30, Sec. 4.1] for more detailed discussions ofthese assumptions. Remark G.4.

It is also possible to slightly modify both the optimization problem (14) and our NDLalgorithm so that Theorem G.2 holds for the modiﬁed problem and the algorithm without needing to assume(b) and (c). The modiﬁed problem is arg min W ∈C dict ⊆ R k × r ≥ (cid:32) E x ∼ π (cid:34) inf H ∈⊆ R r × N ≥ (cid:107) X − W H (cid:107) F + λ (cid:107) H (cid:107) + κ (cid:107) H (cid:107) F + λ (cid:48)(cid:48) (cid:107) W (cid:107) F (cid:35)(cid:33) , (16)where π = ˆ π F →G if MCMC = PivotApprox and π = π F →G otherwise. Note that (16) is the same as (14)with additional quadratic penalization terms for both H and W in the loss function (cid:96) that we deﬁned in(15). By contrast, consider the modiﬁcation of the NDL algorithm (see Algorithm 1) in which the objectivefunction for H t in (12) has the additional term λ (cid:48) (cid:107) H (cid:107) F and we replace P t in (12) by P t + κ I . Assuming that λ (cid:48) , κ > , the modiﬁed objective function for H t is strictly convex, so H t is uniquely deﬁned. Therefore, itsatisﬁes the uniqueness condition (b) in Theorem G.2. Additionally, the smallest eigenvalue of each matrix P t that we compute using the modiﬁed NDL algorithm has a lower bound of κ for all t , so it satisﬁescondition (c) in Theorem G.2. One can then show that the same statement as Theorem G.2 for the modiﬁedproblem (16) and the NDL algorithm hold without assumptions (b) and (c). The argument, which we omit,is almost identical to the one for Theorem G.2. Proof of Theorem G.2 . The proof of the ﬁrst part of (i) is identical to the proof of [30, Corollary 6.1].For the ﬁrst part of (ii) , we can use the same essential argument as in the proof of [30, Corollary 6.1].However, because our assertion is for the approximate pivot chain that we propose in the present article (seeAlgorithm MP with

AcceptProb = Approximate ), we need to use Proposition G.1 (instead of [29, Prop.5.8]) to establish irreducibility and convergence of our Markov chain. The proof of the second parts of both (i) and (ii) are identical.We give a detailed proof of (ii) . Let π = ˆ π F →G if MCMC = PivotApprox and π = π F →G otherwise (see(1) and (11)). We deﬁne ( x t ) t ≥ , ( X t ) t ≥ , and ( W t ) t ≥ as before. We use a general convergence result foronline NMF for Markovian data [30, Theorem 4.1]. We ﬁrst observe that the matrices X t ∈ R k × N ≥ that we compute in line 15 of Algorithm 1 do notnecessarily form a Markov chain, because the forward evolution of the Markov chain depends both on theinduced mesoscale patches and on the actual homomorphisms ( x s ) N ( t −

Flip : R k × r → R k × r that maps W (cid:55)→ W , such that the j th column W (: , j ) of W is deﬁned by ¯ X (: , j ) := vec ◦ rev ◦ reshape ( W (: , j )) , j ∈ { , . . . , r } , where W (: , j ) denotes the j th column of W , reshape : R k → R k × k is the reshaping operator that wedeﬁned in Algorithm A5, rev maps a k × k matrix K to the k × k matrix ( ¯ K ab ) ≤ a,b ≤ k with entries ¯ K ab = K ( k − a + 1 , k − b + 1) , and vec denotes the vectorization operator in Algorithm A4. Applying Flip twicegives the identity map.

Theorem G.5 (Convergence of the NDL Algorithm for Bipartite Networks) . Let F = ([ k ] , A F ) be the k -chain motif, and let G = ( V, A ) be a network that satisﬁes the the following properties:(a’) A is symmetric and the undirected and binary graph ( V, ( A > is connected and bipartite.(b) For all t ≥ , there exists a unique solution H t in (12) .(c) For all t ≥ , the eigenvalues of the positive semideﬁnite matrix A t in (12) are at least as large as someconstant κ > .Let ( W t ) t ≥ denote the sequence of dictionary matrices that we generate using Algorithm 1. We than havethe following properties hold: (i) Suppose that

MCMC ∈ {

Pivot , Glauber } . For each i ∈ { , } , conditional on x ∈ Ω i , the sequence W t ofdictionary matrices converges almost surely as t → ∞ to the set of stationary points of the associatedconditional expected loss function f ( i ) that we deﬁned in (18) . If MCMC = PivotApprox , then thesame statement holds with f ( i ) replaced by the function ˆ f ( i ) that we deﬁned in (19) . We also assumethat f ( i ) (respectively, ˆ f ( i ) ), with i ∈ { , } , have only ﬁnitely many stationary points in C dict . Itthen follows that W t converges to a single stationary point of f ( i ) (respectively, ˆ f ( i ) ) almost surelyas t → ∞ . (ii) Suppose that

MCMC = Glauber in Algorithm 1 and that k is even. Assume that x ∈ Ω . It then followsthat, almost surely, the sequences of dictionary matrices W t and W t converge simultaneously tothe sets of stationary points of the expected loss functions f (1) and f (2) , respectively. Moreover, f (1) ( W t ) = f (2) ( W t ) for all t ≥ . We also assume that f ( i ) (with i ∈ { , } ) has only ﬁnitely manystationary points in C dict . It then follows that both W t and W t converge to single stationary pointsof f (1) and f (2) almost surely as t → ∞ .Proof. We ﬁrst prove (i) . Fix j ∈ { , } , and recall the conditional stationary distribution π ( i ) F →G from (17).Conditional on x ∈ Ω j , the Markov chain ( x t ) t ≥ is irreducible and aperiodic with a unique stationarydistribution π ( j ) F →G . Recall that the conclusion of Theorem G.2 holds as long as the underlying Markovchain is irreducible. Therefore, W t converges almost surely to the set of stationary points of the associatedconditional expected loss function f ( i ) that we deﬁned in (18). The same argument veriﬁes the case in which MCMC = PivotApprox .We now verify (ii) . Deﬁne the notation µ j := π ( i ) F →G and suppose that k is even. For each homomorphism x : F → G , we deﬁne a map x : [ k ] → V by x ( j ) := x ( k − j + 1) for all j ∈ { , . . . , k } . For even k , we have that x ∈ Ω if and only if x ∈ Ω . Because A is symmetric, it follows that k − (cid:89) j =1 A ( x ( j ) , x ( j + 1)) = k − (cid:89) j =1 A ( x ( j + 1) , x ( j )) = k − (cid:89) j =1 A ( x ( j ) , x ( j + 1)) . Therefore, Z = Z = Z / . Consequently, for each x ∈ Ω , (17) implies that µ ( x ) = µ ( x ) = 2 π F →G ( x ) . (20)Consider two Glauber chains, ( x t ) t ≥ and ( x (cid:48) t ) t ≥ , where x = y and x (cid:48) = y . We evolve these two Markovchains using a common source of randomness so that individually they have Glauber-chain trajectories butthey also satisfy the relation x (cid:48) t = x t for all t ≥ . (This is typically called a ‘coupling argument’ inprobability literature; see [25, Sec. 4.2].) Speciﬁcally, suppose that x (cid:48) t = x t . For each update x t (cid:55)→ x t +1 and x (cid:48) t (cid:55)→ x (cid:48) t +1 , we choose a node v ∈ [ k ] uniformly at random and sample z ∈ V according to the conditionaldistribution (10). We deﬁne x t +1 ( v ) = z and x t +1 ( u ) = x t ( u ) for u (cid:54) = v , x (cid:48) t +1 ( k − v + 1) = z and x (cid:48) t +1 ( u ) = x t ( u ) for u (cid:54) = k − v + 1 . We then have that x t (cid:55)→ x t +1 follows the Glauber-chain update in Algorithm MG. We also have the desiredrelation x (cid:48) t +1 = x t +1 because x (cid:48) t +1 ( k − v + 1) = z = x t +1 ( v ) = x t +1 ( k − v + 1) , x (cid:48) t +1 ( u ) = x (cid:48) t ( u ) = x t ( u ) = x t ( k − u + 1) = x t +1 ( k − u + 1) = x t +1 ( u ) for u (cid:54) = k − v + 1 .Finally, we need to verify that x (cid:48) t (cid:55)→ x (cid:48) t +1 also follows the Glauber-chain update in Algorithm MG. It suﬃcesto check that z ∈ V has the same distribution as x (cid:48) t +1 ( k − v + 1) . Because v is uniformly distributed on [ k ] ,so is k − v + 1 . The distribution of z ∈ V is determined by p ( z ) ∝  A ( z, x t (2))) = A ( z, x t ( k − , if v = 1 A ( x t ( v − , z ) A ( z, x t ( v + 1)) = A ( x t ( k − v ) , z ) A ( z, x t ( k − v + 2)) , if v ∈ { , . . . , k − } A ( x t ( k − , z )) = A ( x t (2) , z ) , if v = k . Because x (cid:48) t = x t , it follows that z is distributed as the conditional distribution (10) of x (cid:48) t +1 ( k − v + 1) , asdesired.For the two Glauber chains, x t and x (cid:48) t , we observe that X ( N ) ( y ) = X ( N ) ( y ) (21)almost surely. This result follows from the facts that x (cid:48) t = x t for all t ≥ and rev ( A x ) = A x for all x ∈ Ω .(See (2) for the deﬁnition of A x .) From this, we note that f (1) ( W ) = E x ∼ π (cid:20) (cid:96) ( X ( N ) ( x ) , W ) (cid:12)(cid:12)(cid:12)(cid:12) x ∈ Ω (cid:21) (22) = E x ∼ π (cid:20) (cid:96) (cid:16) X ( N ) ( x ) , W (cid:17) (cid:12)(cid:12)(cid:12)(cid:12) x ∈ Ω (cid:21) = E x ∼ π (cid:20) (cid:96) (cid:16) X ( N ) ( x ) , W (cid:17) (cid:12)(cid:12)(cid:12)(cid:12) x ∈ Ω (cid:21) = E x ∼ π (cid:20) (cid:96) ( X ( N ) ( x ) , W ) (cid:12)(cid:12)(cid:12)(cid:12) x ∈ Ω (cid:21) = f (2) ( W ) . The ﬁrst and the last equalities use the second equality in (17). The second equality uses the fact that (cid:96) ( X, W ) = (cid:96) ( X, W ) . The third equality follows from (21). The fourth equality follows from the change ofvariables x (cid:55)→ x and the fact that x ∼ π if and only if x ∼ π (see (20)).We now prove (ii) . Its ﬁrst part follows immediately from (i) and the above construction of the Glauberchains x t and x (cid:48) t that satisfy x (cid:48) t = x t for all t ≥ . Speciﬁcally, let W t = W t ( x ) and W (cid:48) t = W (cid:48) t ( x (cid:48) ) denotethe sequences of dictionary matrices that we compute using Algorithm 1 with initial homomorphisms x and x (cid:48) , respectively. Suppose that x ∈ Ω , from which we see that x (cid:48) = x ∈ Ω . By (i) , W t and W (cid:48) t convergealmost surely to the set of stationary points of the associated conditional expected loss functions f (1) and f (2) , respectively. We complete the proof of the ﬁrst part of (ii) by observing that, almost surely, W (cid:48) t = W t for all t ≥ . (23)The second part of (ii) follows immediately from (22).We still need to verify (23). Roughly, the argument is that all k × k mesoscale patches A x t = ref ( A x t ) have the reversed row and column ordering from the original ordering, so k × k latent motifs that we trainon such matrices also have the reversed ordering of rows and columns. More concretely, one can check thisclaim by induction on t together with (21) and the uniqueness assumption (b). We omit the details. (cid:3) Remark G.6.

Our proofs of Theorems G.2 and G.5 do not depend on the particular choice of mask Φ F ; x that we use to deﬁne mesoscale patches A x in (2). EARNING LOW-RANK LATENT MESOSCALE STRUCTURES IN NETWORKS 33

G.3.

Convergence of the NDR algorithm.

We prove convergence results for the NDR algorithm (seeAlgorithm 2) in Theorem G.7. Speciﬁcally, we show that the reconstructed network that we obtain usingAlgorithm 2 at iteration t converges almost surely to some limiting network as t → ∞ , and we give a closed-form expression of the limiting network. We also give a bound for the ‘distance’ between the original networkand the limiting reconstructed network in terms of a ‘ﬁtness’ of the network dictionary that we use in thereconstruction algorithm. We measure this ﬁtness using the expected 1-norm between the sampled k × k mesoscale patch and its best nonnegative linear approximation using our network dictionary.Let denoising denote the Boolean variable in Algorithm 2. Fix a network G = ( V, A ) , the k -chain motif F = ([ k ] , A F ) , and a homomorphism x : F → G . Let Φ F, x ∈ { , } k × k denote the no-folding mask that wedeﬁned in (4). For each matrix B : V → [0 , ∞ ) and a node map x : [ k ] → V , deﬁne the k × k matrix B x by B x ( a, b ) := B ( x ( a ) , x ( b )) Φ F, x ( a, b ) for all a, b ∈ { , . . . , k } . If B = A , then B x = A x equals the mesoscale patch of G that is induced by x (see (2)). Additionally,given a network G = ( V, A ) , a motif F = ([ k ] , A F ) , a homomorphism x : F → G , and a nonnegative matrix W ∈ R k × r ≥ , let ˆ A x ; W denote the k × k matrix that we deﬁned in line 11 of Algorithm 2. This matrix dependson the Boolean variable denoising . Recall that ˆ A x ; W is a nonnegative linear approximation of A x that uses W . We introduce the event ( p, q ) x ← (cid:45) ( a, b ) using the following indicator function: (cid:18) ( p, q ) x ← (cid:45) ( a, b ) (cid:19) := ( x ( a ) = p, x ( b ) = q ) (cid:18) denoising = F or A F ( a, b ) = 0 (cid:19) Φ F, x ( a, b ) , (24)where Φ F, x is the no-folding mask that we deﬁned in (4). For each homomorphism x : F → G and p, q ∈ V ,we say that the pair ( p, q ) is visited by ( a, b ) through x whenever the indicator on the left-hand side of (24)is . Additionally, N pq ( x ) := (cid:88) a,b ∈{ ,...,k } (cid:18) ( p, q ) x ← (cid:45) ( a, b ) (cid:19) (25)is the total number of visits to ( p, q ) through x . When N pq ( x ) > , we say that the pair ( p, q ) is visitedby x . In Algorithm 2, observe that both A count ( p, q ) and A recons ( p, q ) change at iteration t if and only if N pq ( x t ) > . Finally, Ω pq := (cid:8) x : F → G (cid:12)(cid:12) N pq ( x ) > (cid:9) (26)is the set of all homomorphisms x : F → G that visit the pair ( p, q ) . Theorem G.7 (Convergence of the NDR Algorithm (see Algorithm 2) for Non-Bipartite Networks) . Let F = ([ k ] , A F ) be the k -chain motif and ﬁx a network G = ( V, A ) and a network dictionary W ∈ R k × r . Weuse Algorithm 2 with inputs G , F , and W and the parameter value T = ∞ . Let ˆ G t = ( V, ˆ A t ) denote thenetwork that we reconstruct at iteration t , and suppose that G satisﬁes assumption (a) of Theorem G.2. Let π = ˆ π F →G if MCMC = PivotApprox and π = π F →G otherwise. The following statements hold: (i) The network ˆ G t converges almost surely to some limiting network ˆ G ∞ = ( V, ˆ A ∞ ) in the sense that lim t →∞ ˆ A t ( p, q ) = ˆ A ∞ ( p, q ) ∈ [0 , ∞ ) almost surely for all p, q ∈ V . (ii)

Let ˆ A ∞ denote the limiting matrix in (i) . For each p, q ∈ V , we then have that ˆ A ∞ ( p, q ) = (cid:88) y ∈ Ω pq  (cid:88) a,b ∈{ ,...,k } ˆ A y ; W ( a, b ) (cid:18) ( p, q ) y ← (cid:45) ( a, b ) (cid:19) π ( y ) E x ∼ π [ N pq ( x )] . (27) (iii) Let ˆ A ∞ be as in (ii) . For any network G (cid:48) = ( V, B ) , we have (cid:88) p,q ∈ V (cid:12)(cid:12)(cid:12) B ( p, q ) − ˆ A ∞ ( p, q ) (cid:12)(cid:12)(cid:12) E x ∼ π [ N pq ( x )] ≤  E x ∼ π (cid:104) (cid:107) B x − ˆ A x ; W (cid:107) (cid:105) , if denoising = F E x ∼ π (cid:104) (cid:107) B x − ˆ A x ; W (cid:107) ,F (cid:105) , if denoising = T , where (cid:107) R (cid:107) ,F := (cid:80) ≤ a,b ≤ k | R ( a, b ) | ( A F ( a, b ) = 0) for each R ∈ R k × k . (iv) Let ˆ A ∞ be as in (ii) and suppose that denoising = F . It then follows that (cid:88) p,q ∈ V (cid:12)(cid:12)(cid:12) A ( p, q ) − ˆ A ∞ ( p, q ) (cid:12)(cid:12)(cid:12) E x ∼ π [ N pq ( x )] ≤ √ k E (cid:104)(cid:112) (cid:96) ( vec ( A x ) , W ) (cid:105) (28) ≤ √ k (cid:18) E x ∼ π [ (cid:96) ( vec ( A x ) , W )] sup x : F →G (cid:112) (cid:96) ( vec ( A x ) , W ) (cid:19) / . (29) Proof.

Let x denote a random homomorphism F → G with distribution π , and let P and E denote theassociated probability measure and expectation, respectively.We ﬁrst verify (i) and (ii) simultaneously. Let ( x t ) t ≥ denote the Markov chain that we generate duringthe reconstruction process (see Algorithm 2). We ﬁx p, q ∈ V and let M t := t (cid:88) s =1 (cid:88) a,b ∈{ ,...,k } (cid:18) ( p, q ) x s ← (cid:45) ( a, b ) (cid:19) = t (cid:88) s =1 N pq ( x s ) , where we deﬁned the indicator (cid:0) ( p, q ) x ← (cid:45) ( a, b ) (cid:1) in (24). The key observation is that ˆ A t ( p, q ) = 1 M t t (cid:88) s =1 (cid:88) a,b ∈{ ,...,k } ˆ A x s ; W ( a, b ) (cid:18) ( p, q ) x s ← (cid:45) ( a, b ) (cid:19) (30) = (cid:88) a,b ∈{ ,...,k } M t t (cid:88) s =1 (cid:88) y ∈ Ω pq ˆ A y ; W ( a, b ) (cid:18) ( p, q ) x s ← (cid:45) ( a, b ) (cid:19) = (cid:88) y ∈ Ω pq (cid:88) a,b ∈{ ,...,k } ˆ A y ; W ( a, b ) (cid:18) ( p, q ) x s ← (cid:45) ( a, b ) (cid:19) tM t t t (cid:88) s =1 ( x s = y ) . With assumption (a), the Markov chain ( x t ) t ≥ of homomorphisms F → G is irreducible and aperiodic with π (see (1)) as its unique stationary distribution. By the Markov-chain ergodic theorem (see, e.g., [5, Theorem6.2.1 and Example 6.2.4] or [38, Theorem 17.1.7]), it follows that lim t →∞ tM t t t (cid:88) s =1 ( x s = y ) = P ( x = y ) E [ N pq ( x )] . This proves both (i) and (ii) .We now verify (iii) . For each a, b ∈ { , . . . , k } and p, q ∈ { , . . . , n } , let Ω ab → pq denote the set of allhomomorphisms x : F → G such that (cid:0) ( p, q ) x ← (cid:45) ( a, b ) (cid:1) = 1 . By changing the order of the sums, we rewritethe formula in (27) as ˆ A ∞ ( p, q ) = (cid:88) a,b ∈{ ,...,k } (cid:18) ( p, q ) x s ← (cid:45) ( a, b ) (cid:19) (cid:88) y ∈ Ω ab → pq ˆ A y ; W ( a, b ) P ( x = y ) E [ N pq ( x )] . For each a, b ∈ { , . . . , k } , deﬁne the indicator function ab := (cid:18) denoising = F or A F ( a, b ) = 0 (cid:19) . Observe that E [ N pq ( x )] = (cid:88) a,b ∈{ ,...,k } ( x ( a ) = p, x ( b ) = q ) ab Φ F, x = (cid:88) a,b ∈{ ,...,k } ab (cid:88) y ∈ Ω ab → pq P ( x = y ) . Indeed, the indicator ab does not depend on the homomorphism y and Φ F, y = 1 if y ∈ Ω ab ← pq . We thencalculate (cid:88) p,q ∈ V (cid:12)(cid:12)(cid:12) B ( p, q ) − ˆ A ∞ ( p, q ) (cid:12)(cid:12)(cid:12) E [ N pq ( x )] EARNING LOW-RANK LATENT MESOSCALE STRUCTURES IN NETWORKS 35 = (cid:88) p,q ∈ V (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) B ( p, q ) E [ N pq ( x )] − (cid:88) a,b ∈{ ,...,k } ab (cid:88) y ∈ Ω ab → pq ˆ A y ; W ( a, b ) P ( x = y ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) = (cid:88) p,q ∈ V (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) (cid:88) a,b ∈{ ,...,k } ab (cid:88) y ∈ Ω ab → pq (cid:16) B ( p, q ) − ˆ A y ; W ( a, b ) (cid:17) P ( x = y ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ (cid:88) p,q ∈ V (cid:88) a,b ∈{ ,...,k } (cid:88) y ∈ Ω ab → pq (cid:12)(cid:12)(cid:12) B ( y ( a ) , y ( b )) ab − ˆ A y ; W ( a, b ) ab (cid:12)(cid:12)(cid:12) P ( x = y )= (cid:88) p,q ∈ V (cid:88) a,b ∈{ ,...,k } (cid:88) y ∈ Ω (cid:12)(cid:12)(cid:12) B ( y ( a ) , y ( b )) ab Φ F, y ( a, b ) − ˆ A y ; W ( a, b ) ab Φ F, y ( a, b ) (cid:12)(cid:12)(cid:12) × P ( x = y ) ( y ( a ) = p, y ( b ) = q )= (cid:88) y ∈ Ω P ( x = y ) (cid:88) a,b ∈{ ,...,k } (cid:12)(cid:12)(cid:12) B y ( a, b ) ab − ˆ A y ; W ( a, b ) ab (cid:12)(cid:12)(cid:12) (cid:88) p,q ∈ V ( y ( a ) = p, y ( b ) = q )= (cid:88) y ∈ Ω (cid:88) a,b ∈{ ,...,k } (cid:12)(cid:12)(cid:12) B y ( a, b ) ab − ˆ A y ; W ( a, b ) ab (cid:12)(cid:12)(cid:12) π ( y ) . This veriﬁes (iii) .Finally, we prove ( iv). First, by the Cauchy–Schwarz inequality, E (cid:104) (cid:107) A y − ˆ A y ; W (cid:107) (cid:105) ≤ √ k E (cid:104) (cid:107) A y − ˆ A y ; W (cid:107) F (cid:105) ≤ √ k E (cid:104)(cid:112) (cid:96) ( vec ( A x ) , W ) (cid:105) , (31)where (cid:96) denotes the objective function that we deﬁned in (15) and vec denotes the vectorization operatorin Algorithm A4. By Markov’s inequality, we have for each δ > that P ( (cid:96) ( vec ( A x ) , W ) ≥ δ ) ≤ E [ (cid:96) ( vec ( A x ) , W )] δ . We deﬁne the notation M := sup x : F →G (cid:112) (cid:96) ( vec ( A x ) , W ) , which is ﬁnite because (cid:96) ( vec ( A x ) , W ) ≤ (cid:107) A x (cid:107) F andthere are only ﬁnitely many homomorphisms x : F → G . By conditioning on whether or not (cid:96) ( vec ( A x ) , W ) ≥ δ , it follows that E (cid:104)(cid:112) (cid:96) ( vec ( A x ) , W ) (cid:105) ≤ √ δ P x ∼ π ( (cid:96) ( vec ( A x ) , W ) < δ ) + E [ (cid:96) ( vec ( A x ) , W )] δ M (32) ≤ √ δ + E [ (cid:96) ( vec ( A x ) , W )] δ M . The last expression in (32) is minimized when δ = (2 M E [ (cid:96) ( vec ( A x ) , W )]) / . This yields E (cid:104)(cid:112) (cid:96) ( vec ( A x ) , W ) (cid:105) ≤ (cid:16) / + 2 − / (cid:17) ( E [ (cid:96) ( vec ( A x ) , W )] M ) / . Noting that / + 2 − / < and combining (31) and (32) with (iii) then veriﬁes (iv) . (cid:3) Remark G.8.

Suppose that G (cid:48) = G and denoising = F in Theorem G.7 (iv) . The left-hand side of (28) isa measure of the diﬀerence between the original network G = ( V, A ) and the limiting reconstructed network ˆ G ∞ = ( V, ˆ A ∞ ) that we compute using the NDR algorithm (see Algorithm 2) with network dictionary W ∈ R k × r ≥ . Recall that the columns of W encode r latent motifs L , . . . , L r ∈ R k × k ≥ (see Section B.4).According to (28), G = ˆ G ∞ if the right-hand side of (28) is . This is the case if sup x : F →G (cid:96) ( vec ( A x ) , W ) = 0 ,which means that W can perfectly approximate all mesoscale patches A x of G . However, the right-handside of (28) can still be small if the worst-case approximation error sup x : F →G (cid:96) ( vec ( A x ) , W ) is large butthe expected approximation error E x ∼ π [ (cid:96) ( vec ( A x ) , W )] is small (i.e., when W at eﬀective in approximating most of the mesoscale patches).How can we ﬁnd a network dictionary W that minimizes the right-hand side of (28)? Although it isdiﬃcult to ﬁnd a globally optimal network dictionary W that minimizes the non-convex objective functionin the right-hand side of (28), Theorems G.2 and G.5 guarantee that our NDL algorithm (see Algorithm 1)always ﬁnds a locally optimal network dictionary. Indeed, from these theorems, the NDL algorithm with N = 1 computes a network dictionary W that is approximately a local optimum of the following expectedloss function: f ( W ) = E x ∼ π [ (cid:96) ( vec ( A x ) , W )] , (33)where π = ˆ π F →G if MCMC = PivotApprox and π = π F →G otherwise. The function f in (33) appears in theupper bound in (29). In our experiments, we found that our NDL algorithm produces network dictionariesthat are eﬃcient in minimizing the reconstruction error. See the left-hand side of (28) and, e.g., Figure 3.Another implication of Theorem G.7 (iii) is also relevant to network denoising (see Figure 4). Supposethat we have an uncorrupted network G (cid:48) = ( V, B ) and a corrupted network G = ( V, A ) . Additionally,suppose that we have trained the network dictionary W for the uncorrupted network G (cid:48) , but that we use itto reconstruct the corrupted network G . Even if ˆ A x ; W is a nonnegative linear approximation of the k × k matrix A x of a mesoscale patch of the corrupted network G , it may be close to the corresponding mesoscalepatch B x of the uncorrupted network G (cid:48) , because we used the network dictionary W that we learned fromthe uncorrupted network G (cid:48) . Theorem G.7 (iii) guarantees that the network ˆ G ∞ that we reconstruct for thecorrupted network G using the uncorrupted-network dictionary W is close to the uncorrupted network G (cid:48) . Remark G.9.

The update step (see line 17) for global reconstruction in Algorithm 2 indicates that weloop over all node pairs ( a, b ) in the k -chain motif and that we update the weight of edge ( x t ( a ) , x t ( b )) inthe reconstructed network using the homomorphism x : F → G . There may be multiple node pairs ( a, b ) in F that contribute to the edge ( p, q ) in the reconstructed network, because x t ( a ) = p and x t ( b ) = q canoccur for multiple choices of ( a, b ) . The output of this update step does not depend on the ordering of a, b ∈ { , . . . , k } , as one can see from the expressions in (30).One can also consider the following alternative update step for global reconstruction. Speciﬁcally, we ﬁrstchoose two nodes p, q of the reconstructed network in the image { x t ( j ) | j ∈ { , . . . , k }} (cid:96) changed to j ofthe homomorphism x t and average over all pairs ( a, b ) ∈ [ k ] such that ( p, q ) is visited by ( a, b ) through x t ,and we then update the weight of ( p, q ) in the reconstructed network with this mean contribution from x t .Speciﬁcally, for each a, b ∈ { , . . . , k } , let (cid:0) ( p, q ) x t ← (cid:45) ( a, b ) (cid:1) denote the indicator that we deﬁned in (24) andlet N pq ( x t ) ≥ be the number of visits of x t to ( p, q ) (see (25)). We can then replace line 17 in Algorithm2 with the following line: Alternative update for global reconstruction : For p, q ∈ V such that N pq ( x t ) > : (cid:101) A x t ; W ( p, q ) ← (cid:80) ≤ a,b ≤ k ˆ A x t ; W ( a, b ) (cid:0) ( p, q ) x ← (cid:45) ( a, b ) (cid:1)(cid:80) ≤ a,b ≤ k (cid:0) ( p, q ) x ← (cid:45) ( a, b ) (cid:1) , j ← A count ( p, q ) + 1 A recons ( p, q ) ← (1 − j − ) A recons ( p, q ) + j − (cid:101) A x t ; W ( p, q ) .For the alternative NDR algorithm that we just described, we can establish a convergence result that issimilar to Theorem G.5 using a similar argument as the one in our proof of Theorem G.7. Speciﬁcally, (i) holds for the alternative NDR algorithm, so there exists a limiting reconstructed network. In the proof of (ii) , the formula for the limiting reconstructed network is now ˆ A ∞ ( p, q ) = (cid:88) y ∈ Ω pq (cid:101) A y ; W ( p, q ) P x ∼ π (cid:0) x = y (cid:12)(cid:12) x ∈ Ω pq (cid:1) for all p, q ∈ V , where Ω pq is the set of all homomorphisms that visit ( p, q ) (see (26)). In particular, if G is an undirectedand binary graph, then ˆ A ∞ ( p, q ) = 1 | Ω pq | (cid:88) y ∈ Ω pq (cid:101) A y ; W ( p, q ) for all p, q ∈ V .

In the proofs of (iii) and (iv) , the same error bounds hold with E x ∼ π [ N pq ( x )] replaced by P x ∼ π ( x ∈ Ω pq ) .We omit the details of the proofs of the above statements for this alternative NDR algorithm.We now discuss convergence results of Algorithm 2 for a bipartite network G . Recall the notation anddiscussions about bipartite networks above Theorem G.5. Additionally, recall for our bipartite networks thatthere exist disjoint subsets Ω and Ω of the set Ω of all homomorphisms F → G such that (1) Ω = Ω ∪ Ω EARNING LOW-RANK LATENT MESOSCALE STRUCTURES IN NETWORKS 37 and (2) the Markov chain ( x ) t ≥ restricted to each Ω i (with i ∈ { , } ) is irreducible but is not irreducibleon the set Ω . Theorem G.10 (Convergence of the NDR Algorithm (see Algorithm 2) for Bipartite Networks) . Let F =([ k ] , A F ) be the k -chain motif, and let G = ( V, A ) be a network that satisﬁes assumption (a’) in TheoremG.5. Let ˆ G t = ( V, ˆ A t ) denote the network that we reconstruct using Algorithm 2 at iteration t using a ﬁxednetwork dictionary W ∈ R k × r . Fix i ∈ { , } and an initial homomorphism x ∈ Ω i . Let π = ˆ π F →G if MCMC = PivotApprox and π = π F →G otherwise. The following properties hold: (i) The network ˆ G t converges almost surely to some limiting network ˆ G ∞ = ( V, ˆ A ∞ ) in the sense that lim t →∞ ˆ A t ( p, q ) = ˆ A ∞ ( p, q ) almost surely for all p, q ∈ V . (ii)–(iv)

The same statements as in Theorem G.7 (ii) – (iv) hold with the expectation E x ∼ π replaced by theconditional expectation E x ∼ π [ · | x ∈ Ω i ] . (v) The results in (ii) – (iv) do not depend on i ∈ { , } if k is even.Proof. The proofs of statements (i) – (iv) are identical to those for Theorem G.7. Statement (v) follows froma similar argument as in the proof of Theorem G.5 (ii) by constructing coupled Markov chains ( x t ) t ≥ and ( x (cid:48) t ) t ≥ such that x (cid:48) t = x t for all t ≥ . (cid:3) Remark G.11.

Our proofs of Theorems G.7 and G.10 do not depend on the particular choice of mask Φ F ; x that we use to deﬁne the mesoscale patches A x in (2). Remark G.12.

In Theorem G.10 (ii) , let ˆ G ( i ) ∞ = ( V, ˆ A ( i ) ∞ ) denote the limiting reconstructed network for G conditional on the Markov chain being initialized in Ω i for i ∈ { , } . When k is even, Theorem G.10 (iv) implies that ˆ G (1) ∞ = ˆ G (2) ∞ . When k is odd, we run the NDR algorithm (see Algorithm 2) twice with theMarkov chain initialized in both Ω and Ω . We then deﬁne the network ˆ G ∞ := ( V, ( ˆ A (1) ∞ + ˆ A (2) ∞ ) / , whoseweight matrix is the mean of those of the two limiting reconstructed networks ˆ G ( i ) ∞ for i ∈ { , } . We obtaina similar error bound as in Theorem G.10 (iii) for this mean limiting reconstructed network. In practice, onecan obtain a sequence of reconstructed networks that converges to the mean reconstructed network ˆ G ∞ byreinitializing the Markov chain every τ iterations of the reconstruction procedure for any ﬁxed τ . Appendix H. Auxiliary Algorithms

We now present auxiliary algorithms that we use to solve subproblems of Algorithms 1 and 2. Let Π S denote the projection operator onto a subset S of ambient space. For each matrix A , let [ A ] • i (respectively, [ A ] i • ) denote the i th column (respectively, i th row) of A . Algorithm A1 . Coding Input:

Data matrix X ∈ R d × b , dictionary matrix W ∈ R d × r Parameters: T ∈ N (the number of iterations) λ > (the coeﬃcient of an L -regularizer) C code ⊆ R r × b (convex constraint set of codes) For t = 1 , . . . , T : Do: H ← Π C code (cid:18) H − tr ( W T W ) ( W T W H − W T X + λJ ) (cid:19) , where J ⊆ R d × b is the matrix with all entries. Output: H ∈ C code ⊆ R r × b Algorithm A2 . Dictionary-Matrix Update Input:

Previous dictionary matrix W t − ∈⊆ R k × r , previous aggregate matrices ( P t , Q t ) ∈ R r × r × R r × N Parameters: C dict ⊆ R k × r (compactness and convexity constraint for dictionary matrices) T ∈ N (the number of iterations) For t = 1 , . . . , T : W ← W t − For j = 1 , , . . . , N : W (: , j ) ← Π C dict (cid:18) W (: , j ) − A t ( j, j ) + 1 ( W P t (: , j ) − Q Tt (: , j )) (cid:19) Output: W t = W ∈ C dict ⊆ R k × r ≥ Algorithm A3 . Rejection Sampling of Homomorphisms Input:

Network G = ( V, A ) , motif F = ([ k ] , A F ) Requirement:

There exists at least one homomorphism F → G Repeat:

Sample x = [ x (1) , x (2) , . . . , x ( k )] ∈ V [ k ] so that the quantities x ( i ) are independent andidentically distributed If (cid:81) i,j ∈{ ,...,k } A ( x ( i ) , x ( j )) A F ( i,j ) > Return x : F → G and Terminate Output:

Homomorphism x : F → G Algorithm A4 . Vectorization Input:

Matrix X ∈ R k × k Output:

Matrix Y ∈ R k k × , where Y ( k ( j −

1) + i,

1) = X ( i, j ) for all i ∈ { , . . . , k } and j ∈ { , . . . , k } Algorithm A5 . Reshaping Input:

Matrix X ∈ R k k × , a pair ( k , k ) of integers Output:

Matrix Y ∈ R k × k , where Y ( i, j ) = X ( k ( j −

1) + i, for all i ∈ { , . . . , k } and j ∈ { , . . . , k } EARNING LOW-RANK LATENT MESOSCALE STRUCTURES IN NETWORKS 39

Appendix I. Additional Figures

We show additional ﬁgures of our network dictionaries. C a l t e c h scale = 6 scale = 11 scale = 21 scale = 51 scale = 101 M I T U C L A H a r v a r d Figure 8.

The r = 25 latent motifs at scales k = 6 , , , , that we learn from the networks Caltech , MIT , UCLA , and

Harvard . The numbers underneath the latent motifs give their dominancescores (see Section D.2). See Section F for the details of these experiments. C o r o n a v i r u s PP I scale = 6 scale = 11 scale = 21 scale = 51 scale = 101 S N A P F a c e b oo k a r X i v A S T R O - P H H o m o s a p i e n s PP I Figure 9.

The r = 25 latent motifs at scales k = 6 , , , , that we learn from the networks Coronavirus PPI , SNAP Facebook , arXiv ASTRO-PH , and Homo sapiens PPI . The numbers under-neath the latent motifs give their dominance scores (see Section D.2). See Section F for the detailsof these experiments.

EARNING LOW-RANK LATENT MESOSCALE STRUCTURES IN NETWORKS 41 WS k = 6 k = 11 k = 21 k = 51 k = 101 WS BA BA Figure 10.

The r = 25 latent motifs at scales k = 6 , , , , that we learn from the networks WS , WS , BA , and BA . The numbers underneath the latent motifs give their dominance scores (seeSection D.2). See Section F for details of the experiments. Caltech r = 9 r = 16 r = 25 r = 36 r = 49 MIT

UCLA

Harvard

Figure 11.

The 25 latent motifs for r ∈ { , , , , } at scale k = 21 that we learn from thenetworks Caltech , MIT , and

UCLA . The r = 25 column is identical to the k = 21 column in Figure8. The numbers underneath the latent motifs give their dominance scores (see Section D.2). SeeSection F for details of the experiments. EARNING LOW-RANK LATENT MESOSCALE STRUCTURES IN NETWORKS 43

Coronavirus r = 9 r = 16 r = 25 r = 36 r = 49 SNAPFB arXiv

H.sapiens

Figure 12.

The latent motifs for r ∈ { , , , , } at scale k = 21 that we learn from thenetworks Coronavirus PPI , SNAP Facebook , arXiv ASTRO-PH , and Homo sapiens PPI . The r = 25 column is identical to the k = 21 column in Figure 9. The numbers underneath the latent motifsgive their dominance scores (see Section D.2). See Section F for details of the experiments. WS r = 9 r = 16 r = 25 r = 36 r = 49 WS BA BA Figure 13.

The latent motifs for r ∈ { , , , , } at scale k = 21 that we learn from thenetworks WS , WS , BA , and BA . The r = 25 column is identical to the k = 21 column in Figure10. The numbers underneath the latent motifs give their dominance scores (see Section D.2). SeeSection F for details of our experiments. EARNING LOW-RANK LATENT MESOSCALE STRUCTURES IN NETWORKS 45 ER k = 6 k = 11 k = 21 k = 51 k = 101 ER ER r = 9 r = 16 r = 25 r = 36 r = 49 ER Figure 14. (Top two rows) The r = 25 latent motifs at scales k = 6 , , , , that we learnfrom the networks ER and ER . (Bottom two rows) The latent motifs for r ∈ { , , , , } atscale k = 21 that we learn from the networks ER and ER . The k = 21 column of the ﬁrst two rowsare identical to the r = 25= 25

Related Researches

Tracking e-cigarette warning label compliance on Instagram with deep learning

by Chris J. Kennedy

Community Detection: Exact Recovery in Weighted Graphs

by Mohammad Esmaeili

Measuring Global Multi-Scale Place Connectivity using Geotagged Social Media Data

by Zhenlong Li

Rihanna versus Bollywood: Twitter Influencers and the Indian Farmers' Protest

by Dibyendu Mishra

Competition Dynamics in the Meme Ecosystem

by Trenton Ford

Fuzzy-AHP approach using Normalized Decision Matrix on Tourism Trend Ranking based-on Social Media

by Shoffan Saifullah

"Short is the Road that Leads from Fear to Hate": Fear Speech in Indian WhatsApp Groups

by Punyajoy Saha

Effective and Scalable Clustering on Massive Attributed Graphs

by Renchi Yang

Asynchronous semi-anonymous dynamics over large-scale networks

by Chiara Ravazzi

Hyperedge Prediction using Tensor Eigenvalue Decomposition

by Deepak Maurya

Overcoming Bias in Community Detection Evaluation

by Jeancarlo Campos Leão

Opinion Dynamics Incorporating Higher-Order Interactions

by Zuobai Zhang

Social Network Analysis: From Graph Theory to Applications with Python

by Dmitri Goldenberg

Self-Supervised Deep Graph Embedding with High-Order Information Fusion for Community Discovery

by Shuliang Xu

Exploring the Subgraph Density-Size Trade-off via the Lovász Extension

by Aritra Konar

Formational bounds of link prediction in collaboration networks

by Jinseok Kim

A Network Based Approach to Characterize Twenty-First-Century Populism in Colombia

by Juan D. Garcia-Arteaga

Over-time measurement of triadic closure in coauthorship networks

by Jinseok Kim

How Information Diffuse in a Nomination Network

by Minghao Wang

Another estimation of Laplacian spectrum of the Kronecker product of graphs

by Milan Baši?

Assessing Individual and Community Vulnerability to Fake News in Social Networks

by Bhavtosh Rath

High-level Approaches to Detect Malicious Political Activity on Twitter

by Miguel Sozinho Ramalho

AttentionFlow: Visualising Influence in Networks of Time Series

by Minjeong Shin

Temporal Motifs in Smart Grid

by Rucha Bhalchandra Joshi

LinkLouvain: Link-Aware A/B Testing and Its Application on Online Marketing Campaign

by Tianchi Cai

«

1

2

3

4

»

Submitted on 13 Feb 2021 Updated

arXiv.org Original Source

NASA ADS

Google Scholar

Semantic Scholar