Learning low-rank latent mesoscale structures in networks
Hanbaek Lyu, Yacoub H. Kureh, Joshua Vendrow, Mason A. Porter
LLEARNING LOW-RANK LATENT MESOSCALE STRUCTURES INNETWORKS
HANBAEK LYU, YACOUB H. KUREH, JOSHUA VENDROW, AND MASON A. PORTER
Abstract.
It is common to use networks to encode the architecture of interactions between entitiesin complex systems in the physical, biological, social, and information sciences. Moreover, tostudy the large-scale behavior of complex systems, it is important to study mesoscale structures innetworks as building blocks that influence such behavior [17, 43]. In this paper, we present a newapproach for describing low-rank mesoscale structure in networks, and we illustrate our approachusing several synthetic network models and empirical friendship, collaboration, and protein–proteininteraction (PPI) networks. We find that these networks possess a relatively small number of ‘latentmotifs’ that together can successfully approximate most subnetworks at a fixed mesoscale. We usean algorithm that we call “network dictionary learning” (NDL) [30], which combines a networksampling method [29] and nonnegative matrix factorization [19, 30], to learn the latent motifsof a given network. The ability to encode a network using a set of latent motifs has a widerange of applications to network-analysis tasks, such as comparison, denoising, and edge inference.Additionally, using our new network denoising and reconstruction (NDR) algorithm, we demonstratehow to denoise a corrupted network by using only the latent motifs that one learns directly fromthe corrupted networks.
It is often insightful to examine structures in networks [40] at an intermediate scale (i.e., at a“mesoscale”) that lies between the microscale of nodes and edges but below macroscale distributionsof local network properties. There are a large variety of network mesoscale structures of networks,including community structure [9, 47], core–periphery structure [49], and role structures [1]. Weare interested in mesoscale network structures that are large enough that it is reasonable to discusstheir collective properties, but that are also small enough so that we can also discuss their statisticalproperties. In this paper, we examine mesoscale network structures that we obtain using k -nodeinduced subgraphs of networks. These networks have k nodes that inherit their adjacency structuresfrom the original networks from which we draw them. Because most real-world networks are sparse[40], independently choosing a set of k nodes from a network may not return meaningful information.Instead, we use motif sampling [29]: we first uniformly randomly sample a set of k nodes that forma path (this set is called a ‘ k -chain motif’), and we then obtain the subgraph that is induced bythat k -chain motif by including all of the edges between those k nodes. This guarantees that wesample a connected subgraph of a network while assuming very little about the structure of theoriginal network; by repeating this process, we obtain a data set of ‘mesoscale patches’ of a network.We then use ‘dictionary-learning’ algorithms [32] to learn mesoscale structures of networks that wecall ‘latent motifs’. We then use latent motifs to infer subgraph structures of networks, comparedifferent networks, and denoise corrupted networks. Dictionary-learning algorithms are machine-learning techniques that learn interpretable latentstructures of complex data sets and are applied regularly in the data analysis of text and images[7, 34, 46]. Such algorithms usually consist of two steps. First, one samples a large number of
Department of Mathematics, University of California, Los Angeles, CA 90095, USA
E-mail addresses : {hlyu, ykureh, jvendrow, mason}@math.ucla.edu .Our code for the main algorithms and simulations are available at https://github.com/HanbaekLyu/NDL_paper .We also provide a user-friendly version as a python package ndlearn . See https://github.com/jvendrow/Network-Dictionary-Learning . a r X i v : . [ c s . S I] F e b LEARNING LOW-RANK LATENT MESOSCALE STRUCTURES IN NETWORKS structured subsets of a data set (e.g., square patches of an image or collections of a few sentences ofa text); we refer to such a subset as a mesoscale patch of a data set. Second, one finds a set of basiselements such that taking a nonnegative linear combination of them can successfully approximateeach of the sampled mesoscale patches. Such a set of basis elements is called a dictionary , and onecan interpret each basis element as a latent structure of the data set. As an example, consider theimage of the artwork
Cycle by M. C. Escher in Figure 1 a . We first sample 10,000 square patchesof × pixels, and we then use a nonnegative matrix factorization (NMF) [19] algorithm to finda dictionary with r = 25 square patches (see Figure 1 a ). Each element of the learned dictionarydescribes a latent shape in the image.Algorithms for network dictionary learning (NDL) [30] use a similar idea. As mesoscale patchesof a network, we use the k -node subgraphs that are induced by motif sampling. We represent thesesubgraphs using their k × k adjacency matrices. After obtaining sufficiently many mesoscale patchesof a network, we apply NMF to learn a dictionary for the latent adjacency matrices, which we call latent motifs of the network. We give a complete implementation of our approach in Algorithm1 of the Supplementary Information (SI). See our SI for more details, including the theoreticalguarantees for Algorithm 1 in Theorems G.2 and G.5. Network Dictionary Learning
Network data reconstruction
Interpretable parts (Dictionary)
Network
Motif
Sample
MCMC Motif sampling
Memoli,
Lyu , Sivakoff (2019+)
Dictionary (cid:2869)
Dictionary (cid:2870)
Dictionary (cid:2871) ⋮ 𝐷𝑎𝑡𝑎 (cid:2869)
𝐷𝑎𝑡𝑎 (cid:2870)
𝐷𝑎𝑡𝑎 (cid:2871) ⋮ Lyu , Needell, Balzano (2019+)
Online Matrix Factorization for Markovian data + (Low-rank basis) UCLA Facebook Network C
ALTECH
Facebook Network
Network Dictionary Network Dictionary C YCLE by M.C. Escher
Image Dictionary a b c
Figure 1.
Illustration of mesoscale structures that we learn from ( a ) images and ( b , c ) networks.In all experiments in this figure, we form a matrix X of size d × n by sampling n mesoscale patchesof size d = 21 × from the corresponding object. For the image in panel ( a ), the columns of X are square patches of × pixels. In panels ( b ) and ( c ), we show both heat maps and adjacencymatrices. We take the columns of X to be the k × k adjacency matrices of the connected subgraphsthat are induced by a walk of k = 21 nodes, where a walk of k nodes consists of k nodes x , . . . , x k such that x i and x i +1 are adjacent for all i ∈ { , . . . , k } . Using nonnegative matrix factorization(NMF), we compute an approximate factorization X ≈ W H into nonnegative matrices, where W has r = 25 columns. Because of this factorization, we can approximate any sampled mesoscalepatches (i.e., the columns of X ) of an object by a nonnegative linear combination of the columnsof W , which we can interpret as latent shapes for images and latent motifs (i.e., subgraphs) fornetworks, respectively. The network dictionaries of latent motifs that we learn from the ( b ) UCLA and( c ) Caltech
Facebook networks reveal distinctive social structures. For example, if we uniformlysample a chain of 21 friends in one of these networks, we observe for
Caltech that there are likelyto be communities with six or more nodes and also some ‘hub’ users who know most of the othersin the sample. However, for
UCLA , it is unlikely that there are such communities or hubs. In theheat map of the UCLA network, we show only the first 3000 nodes according to the node labelingin the data set.
EARNING LOW-RANK LATENT MESOSCALE STRUCTURES IN NETWORKS 3
In Figure 1, we demonstrate the NDL method for networks of Facebook friendships (which werecollected on one day in fall 2005) from UCLA (“
UCLA ”) and Caltech (“
Caltech ”) [55, 56]. Each nodein one of these networks is a Facebook account of an individual, and each edge encodes a Facebookfriendship between two individuals. Using NDL, we learn 25 latent motifs from each of these twosocial networks using a chain motif with k = 21 nodes. In each sample, the two diagonal linesin the latent motifs in Figure 1 correspond to the edges along with the k -chain motif. We referto the remaining entries as ‘off-chain’ entries; one learns these from subgraphs that are inducedby the chain motif. The latent motifs in UCLA ’s dictionary (see Figure 1 b ) have sparse off-chainconnections with a few clusters, whereas Caltech ’s dictionary (see Figure 1 c ) has relatively denseoff-chain connections. Such latent motifs reveal distinct social structures in the two networks. Forexample, if we uniformly sample a chain of 21 friends in one of these networks, we observe for Caltech that there are likely to be communities with six or more nodes and also some ‘hub’ userswho know most of the others in the sample. However, for
UCLA , it is unlikely that there are suchcommunities or hubs. See Figure 3. k = Coronavirus SNAP FB arXiv H. sapiens Caltech MIT UCLA Harvard ER ER WS WS BA BA k = k = k = k = Figure 2.
Latent motifs that we learn from 14 networks (eight real-world networks and six syn-thetic networks, which include two distinct instantiations from each of three random-graph models)at five different scales (specifically, for k = 6 , , , , ), which reveal distinct mesoscale struc-tures in the networks. Using network dictionary learning (NDL), we learn network dictionaries of r = 25 latent motifs of k nodes for each of the 14 networks. For each network at each scale, weshow only the second-most dominant latent motifs from each dictionary. These motifs include moreinformation than the most dominant motifs for these sparse networks. Black squares represent entries and white squares represent entries. See Section D.2 in our SI for details of how wemeasure latent motif dominance, and see Figure 6 in our SI for the most dominant latent motifs ofeach network. We examine mesoscale structures of eight real-world networks and six synthetic networks us-ing NDL. The real-world networks are Facebook networks from
Caltech , UCLA , Harvard , and
MIT [55, 56],
SNAP Facebook (which we also denote as
SNAP FB as a shorthand) [11, 24], arXiv ASTRO-PH (with a shorthand of arXiv ) [11, 23],
Coronavirus PPI (with a shorthand of
Coronavirus ) [10,44, 52], and
Homo sapiens PPI (with a shorthand of
H. sapiens ) [11, 44]. The first four networksare 2005 Facebook networks from four universities from the
Facebook100 data set [56]. The fifthnetwork is a 2012 Facebook network that was collected from survey participants [24]. The sixthnetwork is a collaboration network based on coauthorship of preprints that were posted in the astro-physics category of the arXiv preprint server. The seventh network is a protein–protein interaction(PPI) network of proteins that are related to the coronaviruses that cause Coronavirus disease 2019
LEARNING LOW-RANK LATENT MESOSCALE STRUCTURES IN NETWORKS (COVID-19), Severe Acute Respiratory Syndrome (SARS), and Middle Eastern Respiratory Syn-drome (MERS) [52]. The eighth network is a PPI network of proteins that are related to
Homosapiens [44].For the six synthetic networks, we generate two instances each of Erdős–Rényi (ER) G ( N, p ) networks [8], Watts–Strogatz (WS) networks [57], and Barabási–Albert (BA) networks [2]. Theseare three types of well-studied random-graph models [40]. Each of the ER networks has 5000 nodes,and we independently connect each pair of nodes with probabilities of p = 0 . (in the network thatwe call ER ) and p = 0 . (in ER ). For the WS networks, we use rewiring probabilities of p = 0 . (in WS ) and p = 0 . (in WS ) starting from a 5000-node ring network in which each node is adjacentto its nearest neighbors. For the BA networks, we use m = 25 (in BA ) and m = 50 (in BA ),where m denotes the number of edges of each new node when it connects (via linear preferentialattachment) to the existing network, which we grow from an initial network of m individual nodes(i.e., none of them are adjacent to each other) until it has nodes. See Section F.1 in our SI formore details.One can interpret the size (i.e., number of nodes) k of a chain motif as a scale parameter. A k × k mesoscale patch of a network that one obtains by using the k -chain motif encodes connectivitybetween nodes that are at most k − edges apart in the network. In Figure 2, we show the second-most dominant latent motif (see Section D.2 and Figure 6 in our SI) that we learn from each of 14networks — eight real-world networks and six synthetic networks — at various scales (specifically,for k = 6 , , , , ) when we use a dictionary with r = 25 latent motifs. The latent motifsdiffer drastically both across the different networks and across different scales.Suppose that we are given a network G and a dictionary W of latent motifs at scale k , where wemay or may not learn W from G . Consider the following two scenarios. In one scenario, we supposethat we know G exactly, and we ask how to measure the ‘effectiveness’ of W in approximatingmesoscale patches of G at scale k . In the other scenario, we suppose that G is a noisy version ofsome true network G true and that W is ‘faithful’ in the sense that it can well-approximate mesoscalepatches of G true at scale k . (See (6) in the SI for the precise definition.) We then ask how we caninfer the true network G true from G and W .To examine the above questions, we develop an algorithm that we call network denoising andreconstruction (NDR) (see Algorithm 2) that takes a network G and network dictionary W as inputand outputs a weighted network G recons that has the same node set as G . The NDR algorithmrepeatedly (until convergence) samples mesoscale patches of G at scale k , finds a nonnegative linearapproximation of them using the latent motifs in W . Because each edge e of G can appear in multiplemesoscale patches of G , there may be multiple reconstructed weights for e in this procedure. We taketheir mean for the final weight of e in G recons (see Algorithm 2). To measure the effectiveness of W for an unweighted network G (i.e., edges are either present or absent and there are no multi-edges),one can threshold the weighted edges of G recons at some fixed level θ ∈ [0 , to obtain an undirectedreconstructed network G recons ( θ ) with binary edge weights (of either or ), which one can thencompare directly with the original unweighted network G . We regard W as effective at describing G at mesoscale k if G recons ( θ ) is close to G for some θ . (We will quantify our notion of ‘closeness’in the next paragraph.) We interpret the edge weights in G recons as measures of confidence of thecorresponding edges in G with respect to W . For example, if an edge e has the smallest weight in G recons , we regard it as the most ‘outlying’ with respect to the latent motifs in W . See TheoremsG.7 and G.10 in the SI for theoretical guarantees and error bounds for NDR.In Figure 3, we show various reconstruction experiments using several real-world networks andsynthetic networks. We perform these experiments for various values of the edge threshold θ and r = 9 , , , , , , latent motifs in a single dictionary. Each network dictionary in Figure 3uses a chain motif with k = 21 nodes, for which the dimension of the space of all possible mesoscalepatches (i.e., the adjacency matrices of the induced subgraphs) is (cid:0) (cid:1) −
20 = 200 . An important
EARNING LOW-RANK LATENT MESOSCALE STRUCTURES IN NETWORKS 5 . . . . . . θ (with r = 25)0 . . . . . . a cc u r a c y X ← X CoronavirusH. sapiensSNAP FBarXivCaltech r (with θ = 0 . . . . . . . . . Caltech ← X ER ER WS WS BA BA MITHarvardUCLACaltech . . . . . MIT ← X . . . . . Harvard ← X r (with θ = 0 . . . . . . UCLA ← X Figure 3.
Self-reconstruction and cross-reconstruction accuracy between several real-world andsynthetic networks versus the edge threshold value θ and the number r of latent motifs in a networkdictionary. The label X ← Y indicates that we reconstruct network X using a network dictionarythat we learn from network Y . The reconstruction process produces a weighted network that weturn into an unweighted network by thresholding the edge weights at a threshold value θ , such thatwe keep only edges whose weight is strictly larger than θ . We measure reconstruction accuracy bycalculating the Jaccard index of the original network’s edge set and the reconstructed network’sedge set. In panel ( a ), we plot accuracies versus θ (keeping the number of latent motifs fixed at r = 25 ), where X is one of five real-world networks (two PPI networks, two Facebook networks,and one collaboration network). In panels ( b ) and ( c ), we reconstruct each of the four Facebooknetworks using network dictionaries with r ∈ { , , , , , , } latent motifs that we learnfrom one of ten networks (with the edge threshold value fixed at θ = 0 . ). observation is that one can reconstruct a given network using an arbitrary network dictionary, whichone can even learn from a different network. Such a ‘cross-reconstruction’ allows us to quantitativelycompare the learned mesoscale structures of different networks. We label each subplot of Figure 3with Y ← X to indicate that we are reconstructing network Y using a network dictionary that welearn from network X . We turn the weighted reconstructed networks into unweighted networks bythresholding their edges using some threshold θ ∈ [0 , . We measure the reconstruction accuracyby calculating the Jaccard index between the original network’s edge set and the reconstructednetwork’s edge set. That is, to measure the similarity of two edge sets, we calculate the number ofedges in the intersection of these sets divided by the number of edges in the union of these sets. Weobtain the same qualitative results as in Figure 3 if we instead measure similarity using the Randindex [48]).In Figure 3 a , we plot the accuracy for ‘self-reconstruction’ X ← X versus θ (with r = 25 ), where X is one of the real-world networks Coronavirus , H. sapiens ), SNAP FB , Caltech , and arXiv . Theaccuracies for
H. sapiens and
Caltech peak above when θ ≈ . , the accuracies for arXiv and SNAP FB peak above for θ ≈ . , and the accuracy for Coronavirus peaks above near θ = 0 . . We choose θ = 0 . for the cross-reconstruction experiments for the Facebook networks LEARNING LOW-RANK LATENT MESOSCALE STRUCTURES IN NETWORKS
Caltech , Harvard , MIT , and
UCLA in Figures 3 b , c . We observe that these four Facebook networkshave self-reconstruction accuracies above using r = 25 motifs with θ = 0 . . The total numberof dimensions when using mesoscale patches at scale k = 21 is , so this result suggests that allof these nine real-world networks have low-rank mesoscale structures at scale k = 21 .We consider accuracies for cross-reconstruction Y ← X in Figures 3 b , c , where Y is one of theFacebook networks Caltech , Harvard , MIT , and
UCLA and X (with X (cid:54) = Y ) is one of the thesefour network or one of the six synthetic networks ER i , WS i , or BA i (with i ∈ { , } ). From thecross-reconstruction accuracies and interpreting the latent motifs (see Section B.4 in the SI) inFigures 1 and 2 (see also Figures 8, 10, and 14 in the SI), we draw the following conclusions at scale k = 21 . First, the mesoscale structure of Caltech is distinct from those of
Harvard , UCLA , and
MIT . This is consistent with prior studies of these networks (see, e.g., [15, 56]). Second,
Caltech ’smesoscale structure at scale k = 21 has a higher dimension than those of the other three universities’Facebook networks. Third, Caltech has a lot more communities of size at least than the otherthree universities’ Facebook networks. Fourth, both BA networks capture the mesoscale structuresof MIT , Harvard , and
UCLA at scale k = 21 better than the synthetic networks that we generatefrom ER and WS models. For instance, the self-reconstruction accuracies in Figures 3 b , c using r = 9 latent motifs are about for Caltech and or above for the other three universities’Facebook networks. See Section F.5 in the SI for further discussion.In Figure 4, we use our algorithms to perform two types of network denoising. We can thinkof them as distinct binary classification problems in network analysis: network denoising with subtractive noise (which is often called edge ‘prediction’ [12, 18, 26, 28, 37] and network denoisingwith additive noise [4]. In each scenario, we suppose that we are given an observed network G =( V, E ) with node set V and edge set E and are asked to find an unknown network G (cid:48) = ( V, E (cid:48) ) with the same node set V but a possibly different edge set E (cid:48) . We interpret G as a corruptedversion of a ‘true network’ G (cid:48) that we observe with some uncertainty. One can interpret edges andnon-edges in G as ‘false relations’ and ‘false non-relations’, respectively. In the subtractive-noisesetting, we assume that G is a partially observed version of G (cid:48) (i.e., E (cid:40) E (cid:48) ), and we seek to classifyall non-edges in G into ‘positives’ (i.e., edges in G (cid:48) ) and ‘negatives’ (i.e., non-edges in G (cid:48) ). In theadditive-noise setting, we suppose that G is a corrupted version of G (cid:48) to which some unknown edgeshave been added (i.e., E ⊇ E (cid:48) ), and we seek to classify all edges in G into ‘positives’ (i.e., edges in G (cid:48) ) and ‘negatives’ (i.e., non-edges in G (cid:48) ).To experiment with these problems, we use the following four real-world networks: Caltech , SNAP FB , arXiv Coronavirus , and H. sapiens . Given a network G = ( V, E ) , our experimentsproceed as follows. In the subtractive-noise setting, we create two smaller networks by removinguniformly random subsets that consist of (in one experiment) or (in the other) of the edgesfrom our network. In the additive-noise case, we create two corrupted networks by adding edgesbetween node pairs that we choose independently with a fixed probability so that or ofthe edges in a corrupted network are new. We then apply NDL with r = 25 latent motifs at scale k = 21 to learn a network dictionary for each of these four networks, and we use each dictionary toreconstruct the network from which it was learned using NDR. The reconstruction algorithms outputa weighted network G recons , where the weight of each edge is our confidence that the edge belongsto that network. For denoising subtractive (respectively, additive) noise, we classify each non-edge(respectively, edge) in a corrupted network as ‘positive’ if its weight in G recons is strictly largerthan some threshold θ and ‘negative’ otherwise. By varying θ , we construct a receiver-operatingcharacteristic (ROC) curve that consists of points whose horizontal and vertical coordinates are thefalse-positive rates and true-positive rates, respectively.In Figure 4, we show the ROC curves and corresponding area-under-the-curve (AUC) scoresfor our network-denoising experiments with subtractive and additive noise for the four networks.For example, when we add false edges (with one extra edge as a tie-breaker, given the odd EARNING LOW-RANK LATENT MESOSCALE STRUCTURES IN NETWORKS 7
Algorithm
SNAP FB H. Sapiens arXiv S PECTRAL C LUSTERING
EEP W ALK
NDL+NDR (our method) Algorithm
SNAP FB H. Sapiens arXiv S PECTRAL C LUSTERING
EEP W ALK
NDL+NDR (our method) . . . . . . false-positive rate . . . . . . tr u e - p o s i t i v e r a t e Caltech +10% (AUC º °
10% (AUC º º °
50% (AUC º º °
10% (AUC º º °
50% (AUC º .
00 0 .
25 0 .
50 0 .
75 1 . false-positive rate . . . . . . Coronavirus PPI +10% (AUC º °
10% (AUC º º °
20% (AUC º .
00 0 .
25 0 .
50 0 .
75 1 . . . . . . . tr u e - p o s i t i v e r a t e arXiv +10% (AUC º °
10% (AUC º º °
50% (AUC º . . . . . . SNAP Facebook +10% (AUC º °
10% (AUC º º °
50% (AUC º .
00 0 .
25 0 .
50 0 .
75 1 . . . . . . . Homo sapiens PPI +10% (AUC º º º °
50% (AUC º AUC for denoising; ( −50% noise) Mask: no folding MCMC: ApproxPivot a C ORONAVIRUS
PPI Mask: Identity MCMC: Glauber d Mask: Identity MCMC: ApproxPivot c C ORONAVIRUS
PPI Mask: no folding MCMC: Glauber b C ORONAVIRUS
PPI C
ORONAVIRUS
PPI a b c d e f
Algorithm
SNAP FB H. Sapiens arXiv S PECTRAL C LUSTERING
EEP W ALK
NDL+NDR (our method) Figure 4.
Application of the NDL and NDR algorithms to network denoising with additive andsubtractive noise on a variety of networks from empirical data sets. In panels (a) – (e) , we plot ourresults. In panel (f ) , we compare our denoising results for − noise on SNAP FB , H. sapiens ,and arXiv versus those of other methods. In our experiments with subtractive noise, we corrupta network by removing a uniformly random subset of or of its edges, and we seek toclassify the removed edges and the non-edges as true edges and false edges, respectively. In ourexperiments with additive noise, we corrupt a network by uniformly randomly adding or of the number of its edges, and we seek to classify the edges and non-edges in the resulting corruptednetwork as ‘negative’ (i.e., false edges) or ‘positive’ (i.e., true edges). To perform classification ina network, we first use NDL to learn latent motifs from a corrupted network and then reconstructthe networks using NDR to assign a confidence value to each potential edge. We then use theseconfidence values to infer membership of potential edges in the uncorrupted network. Importantly,we never use information from the original networks. For the Caltech
Facebook network in panel( b ), we also perform edge inference and denoising using a network dictionary that we learn from the MIT
Facebook network. For each network, we indicate the receiver-operating characteristic (ROC)curves and corresponding area-under-the-curve (AUC) scores for network denoising with additivenoise using the labels +10 % and +50 %, and we indicate the ROC curves and corresponding AUCscores for network denoising with subtractive noise using the labels − % and − %. number of edges in the original network) to Coronavirus , such that 2,463 edges are true and 1,232edges are false, we are able to detect over of the false edges while misclassifying only of the true edges. In Figure 4 f , we compare the performance of our method to those of somepopular supervised algorithms that are based on network embeddings. Specifically, we compare to node2vec [11], DeepWalk [45], and
LINE [50] for the task of denoising subtractive noisefor
SNAP FB , H. Sapiens , and arXiv . Our method achieves state-of-the-art results in all cases.We make two important remarks about applying the NDL and NDR algorithms to network de-noising. First, NDL and NDR are able to perform the desired classification tasks in an unsupervisedmanner, in the sense that we do not require fully known examples to train our algorithm. This isparticularly useful when it is difficult to obtain a large number of fully known examples, such asmeasuring PPI networks for a new organism (e.g., SARS-CoV-2). Part of the reason that NDLand NDR are successful at these network-denoising tasks is because of the low-rank nature of theexamined mesoscale structures of the social and PPI networks (see Figure 3 a ). Specifically, becauseNDL learns a small number of latent motifs that are able to successfully give an approximate basisfor all mesoscale patches, they should not be affected significantly by noise. Second, unlike theexisting algorithms that we just mentioned, we are able to perform denoising on a network using LEARNING LOW-RANK LATENT MESOSCALE STRUCTURES IN NETWORKS information that we learn from an entirely different network. Consequently, we are able to success-fully perform not only self-reconstruction but also cross-reconstruction. For instance, in Figure 4,we show the results for denoising
Caltech using a dictionary that we learn from the
MIT network,which we expect to have a similar structure to
Caltech based on the results of the experiments thatwe highlighted in Figure 3 and also on prior research on these networks [14, 43, 56].Our experiments in Figures 3 and 4 illustrate that various social, collaboration and PPI networkshave low-rank [36] mesoscale structures, in the sense that a few (e.g., r = 25 , but see Figure 3for other choices of r ) latent motifs that we learn using NDL are able to reconstruct, infer, anddenoise the edges in the entire networks by employing the NDR algorithm. We hypothesize thatsuch low-rank mesoscale structures are a general phenomenon for networks of interactions in variouscomplex systems beyond the social, collaboration, and PPI networks that we have examined. Aswe have illustrated in this paper, one can leverage mesoscale structures to perform important taskslike network denoising, so it is important in future studies to explore the level of generality of ourinsights. References [1] Nesreen Ahmed, Ryan Anthony Rossi, John Lee, Theodore Willke, Rong Zhou, Xiangnan Kong, andHoda Eldardiry. Role-based graph embeddings.
IEEE Transactions on Knowledge and Data Engineer-ing , 2020. Available at doi:10.1109/TKDE.2020.3006475 .[2] Albert-László Barabási and Réka Albert. Emergence of scaling in random networks.
Science ,286(5439):509–512, 1999.[3] Marco Bressan. Faster algorithms for sampling connected induced subgraphs. arXiv preprintarXiv:2007.12102 , 2020.[4] Fernanda B. Correia, Edgar D. Coelho, José L. Oliveira, and Joel P. Arrais. Handling noise in proteininteraction networks.
BioMed Research International , 2019:1–13, 2019.[5] Rick Durrett.
Probability: Theory and Examples . Cambridge Series in Statistical and ProbabilisticMathematics. Cambridge University Press, Cambridge, UK, fourth edition, 2010.[6] Bradley Efron, Trevor Hastie, Iain Johnstone, and Robert Tibshirani. Least angle regression.
TheAnnals of Statistics , 32(2):407–499, 2004.[7] Michael Elad and Michal Aharon. Image denoising via sparse and redundant representations over learneddictionaries.
IEEE Transactions on Image Processing , 15(12):3736–3745, 2006.[8] Paul Erdős and Alfréd Rényi. On random graphs. I.
Publicationes Mathematicae , 6(290–297):18, 1959.[9] Santo Fortunato and Darko Hric. Community detection in networks: A user guide.
Physics Reports ,659:1–44, 2016.[10] David E Gordon, Gwendolyn M Jang, Mehdi Bouhaddou, Jiewei Xu, Kirsten Obernier, Kris M White,Matthew J O’Meara, Veronica V Rezelj, Jeffrey Z Guo, and Danielle L Swaney. A SARS-CoV-2 proteininteraction map reveals targets for drug repurposing.
Nature , 583:1–13, 2020.[11] Aditya Grover and Jure Leskovec. node2vec : Scalable feature learning for networks. pages 855–864,2016.[12] Roger Guimerà. One model to rule them all in network science?
Proceedings of the National Academyof Sciences of the United States of America , 117(41):25195–25197, 2020.[13] Roger A Horn and Charles R Johnson.
Matrix Analysis . Cambridge University Press, Cambridge, UK,second edition, 2012.[14] Abigail Z. Jacobs, Samuel F. Way, Johan Ugander, and Aaron Clauset. Assembling the facebook: Usingheterogeneity to understand online social network assembly. In
Proceedings of the ACM Web ScienceConference , WebSci ’15, New York City, NY, USA, 2015. Association for Computing Machinery.[15] Lucas G. S. Jeub, Prakash Balachandran, Mason A. Porter, Peter J. Mucha, and Michael W. Mahoney.Think locally, act locally: Detection of small, medium-sized, and large communities in large networks.
Physical Review E , 91:012821, 2015.[16] Nadav Kashtan, Shalev Itzkovitz, Ron Milo, and Uri Alon. Efficient sampling algorithm for estimatingsubgraph concentrations and detecting network motifs.
Bioinformatics , 20(11):1746–1758, 2004.
EARNING LOW-RANK LATENT MESOSCALE STRUCTURES IN NETWORKS 9 [17] Ankit N. Khambhati, Ann E. Sizemore, Richard F. Betzel, and Danielle S. Bassett. Modeling andinterpreting mesoscale network dynamics.
NeuroImage , 180:337–349, 2018.[18] Katja Kovács, István A .and Luck, Kerstin Spirohn, Yang Wang, Carl Pollis, Sadie Schlabach, WentingBian, Dae-Kyum Kim, Nishka Kishore, and Tong Hao. Network-based prediction of protein interactions.
Nature Communications , 10(1):1240, 2019.[19] Daniel D. Lee and H. Sebastian Seung. Learning the parts of objects by non-negative matrix factoriza-tion.
Nature , 401(6755):788, 1999.[20] Daniel D. Lee and H. Sebastian Seung. Algorithms for non-negative matrix factorization. In
Advancesin Neural Information Processing Systems , pages 556–562, 2001.[21] Honglak Lee, Alexis Battle, Rajat Raina, and Andrew Y Ng. Efficient sparse coding algorithms. In
Advances in Neural Information Processing Systems , pages 801–808, 2007.[22] Jure Leskovec and Christos Faloutsos. Sampling from large graphs. In
Proceedings of the 12th ACMSIGKDD International Conference on Knowledge Discovery and Data Mining , pages 631–636, 2006.[23] Jure Leskovec and Andrej Krevl. SNAP Datasets: Stanford Large Network Dataset Collection. Availableat http://snap.stanford.edu/data , 2020.[24] Jure Leskovec and Julian J. Mcauley. Learning to discover social circles in ego networks. In
Advancesin Neural Information Processing Systems , pages 539–547, 2012.[25] David A Levin and Yuval Peres.
Markov Chains and Mixing Times . American Mathematical Society,Providence, RI, USA, 2017.[26] David Liben-Nowell and Jon Kleinberg. The link-prediction problem for social networks.
Journal of theAmerican Society for Information Science and Technology , 58(7):1019–1031, 2007.[27] László Lovász.
Large Networks and Graph Limits , volume 60 of
Colloquium Publications . AmericanMathematical Society, Providence, RI, USA, 2012.[28] Linyuan Lü and Tao Zhou. Link prediction in complex networks: A survey.
Physica A , 390(6):1150–1170,2011.[29] Hanbaek Lyu, Facundo Memoli, and David Sivakoff. Sampling random graph homomorphisms andapplications to network data analysis. arXiv:1910.09483 , 2019.[30] Hanbaek Lyu, Deanna Needell, and Laura Balzano. Online matrix factorization for Markovian data andapplications to network dictionary learning.
Journal of Machine Learning Research , 21:1–49, 2020.[31] Julien Mairal. Stochastic majorization-minimization algorithms for large-scale optimization. In
Advancesin Neural Information Processing Systems , pages 2283–2291, 2013.[32] Julien Mairal, Francis Bach, Jean Ponce, and Guillermo Sapiro. Online learning for matrix factorizationand sparse coding.
Journal of Machine Learning Research , 11:19–60, 2010.[33] Julien Mairal, Francis Bach, Jean Ponce, Guillermo Sapiro, and Andrew Zisserman. Non-local sparsemodels for image restoration.
IEEE 12th International Conference on Computer Vision , pages 2272–2279, 2009.[34] Julien Mairal, Michael Elad, and Guillermo Sapiro. Sparse representation for color image restoration.
IEEE Transactions on Image Processing , 17(1):53–69, 2007.[35] Julien Mairal, Michael Elad, and Guillermo Sapiro. Sparse learned representations for image restoration.
Proceedings of the 4th World Conference of the International Association for Statistical Computing , page118, 2008.[36] Ivan Markovsky and Konstantin Usevich.
Low Rank Approximation . Springer-Verlag, Heidelberg, Ger-many, 2012.[37] Aditya Krishna Menon and Charles Elkan. Link prediction via matrix factorization. In DimitriosGunopulos, Thomas Hofmann, Donato Malerba, and Michalis Vazirgiannis, editors,
Machine Learningand Knowledge Discovery in Databases , pages 437–452, Heidelberg, Germany, 2011. Springer-Verlag.[38] Sean P Meyn and Richard L Tweedie.
Markov Chains and Stochastic Stability . Springer-Verlag, Hei-delberg, Germany, 2012.[39] Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Corrado, and Jeff Dean. Distributed representationsof words and phrases and their compositionality. In
Advances in Neural Information Processing Systems ,pages 3111–3119, 2013.[40] Mark E. J. Newman.
Networks . Oxford University Press, Oxford, UK, second edition, 2018. [41] Andrew Y Ng, Michael I Jordan, and Yair Weiss. On spectral clustering: Analysis and an algorithm.
Advances in neural information processing systems , pages 849–856, 2002.[42] William S Noble. What is a support vector machine?
Nature biotechnology , 24(12):1565–1567, 2006.[43] Jukka-Pekka Onnela, Daniel J. Fenn, Stephen Reid, Mason A. Porter, Peter J. Mucha, Mark D. Fricker,and Nick S. Jones. Taxonomies of networks from community structure.
Physical Review E , 86(3):036104,2012.[44] Rose Oughtred, Chris Stark, Bobby-Joe Breitkreutz, Jennifer Rust, Lorrie Boucher, Christie Chang, Na-dine Kolas, Lara O’Donnell, Genie Leung, and Rochelle McAdam. The BioGRID interaction database:2019 update.
Nucleic Acids Research , 47(D1):D529–D541, 2019.[45] Bryan Perozzi, Rami Al-Rfou, and Steven Skiena. DeepWalk: Online learning of social representations.In
Proceedings of the 20th ACM SIGKDD International Conference on Knowledge Discovery and DataMining , pages 701–710, 2014.[46] Gabriel Peyré. Sparse modeling of textures.
Journal of Mathematical Imaging and Vision , 34(1):17–31,2009.[47] Mason A. Porter, Jukka-Pekka Onnela, and Peter J. Mucha. Communities in networks.
Notices of theAmerican Mathematical Society , 56(9):1082–1097, 1164–1166, 2009.[48] William M. Rand. Objective criteria for the evaluation of clustering methods.
Journal of the AmericanStatistical Association , 66(336):846–850, 1971.[49] Puck Rombach, Mason A Porter, James H Fowler, and Peter J Mucha. Core-periphery structure innetworks (revisited).
SIAM Review , 59(3):619–646, 2017.[50] Jian Tang, Meng Qu, Mingzhe Wang, Ming Zhang, Jun Yan, and Qiaozhu Mei. LINE: Large-scaleinformation network embedding. In
Proceedings of the 24th International Conference on World WideWeb , pages 1067–1077, 2015.[51] Lei Tang and Huan Liu. Leveraging social media networks for classification.
Data Mining and KnowledgeDiscovery , 23(3):447–478, 2011.[52] theBiogrid.org. Coronavirus PPI network. 2020. Retrieved from https://wiki.thebiogrid.org/doku.php/covid (downloaded 24 July 2020, Ver. 3.5.187.tab3).[53] theBiogrid.org. Homo sapiens PPI network. 2020. Retrieved from https://wiki.thebiogrid.org/doku.php/covid (downloaded 24 July 2020, Ver. 3.5.180.tab2).[54] Robert Tibshirani. Regression shrinkage and selection via the lasso.
Journal of the Royal StatisticalSociety: Series B (Methodological) , 58(1):267–288, 1996.[55] Amanda L. Traud, Eric D. Kelsic, Peter J. Mucha, and Mason A. Porter. Comparing communitystructure to characteristics in online collegiate social networks.
SIAM Review , 53:526–543, 2011.[56] Amanda L. Traud, Peter J. Mucha, and Mason A. Porter. Social structure of Facebook networks.
Physica A , 391(16):4165–4180, 2012.[57] Duncan J Watts and Steven H Strogatz. Collective dynamics of ‘small-world’ networks.
Nature ,393(6684):440–442, 1998.[58] Sebastian Wernicke. Efficient detection of network motifs.
IEEE/ACM Transactions on ComputationalBiology and Bioinformatics , 3(4):347–359, 2006.
EARNING LOW-RANK LATENT MESOSCALE STRUCTURES IN NETWORKS 11
Supplementary Information:LEARNING LOW-RANK LATENT MESOSCALE STRUCTURES INNETWORKS
Appendix A. Overview
In this supplementary material, we present our algorithms for network dictionary learning (NDL) andnetwork denoising and reconstruction (NDR), and we prove theoretical results about their convergence anderror bounds. We give the full NDL algorithm (see Algorithm 1) in Section D, and we give the full NDRalgorithm (see Algorithm 2) in Section E. We introduce the notion of ‘latent motif dominance’ in SectionD.2 to measure the significance of each latent motif that we learn from a network. In Section G, we give arigorous analysis of the NDL and NDR algorithms.
Appendix B. Problem Formulation for Network Dictionary Learning (NDL)
B.1.
Definitions and notation.
To facilitate our discussions, we use terminology and notation from [27,Ch. 3]. In the main text, we described a network as graph G = ( V, E ) with node set V and edge set E without directed or multi-edges, but possibly with self-edges. One can characterize the edge set E of G asan adjacency matrix A G : V → { , } , where A ( x, y ) = ( { x, y } ∈ E ) for each x, y ∈ V . The function ( S ) denotes the indicator of the event S ; it takes the value if S occurs and if it does not occur. In thissupplementary information, for broader applicability, we formulate our NDL framework in the more generalsetting in which edge in a network may have weights. Although one can extend the above definition ofnetworks to include weighted edges by adjoining an additional entry to G = ( V, E ) for edge weights, simplyextending the range of adjacency matrices from { , } to the interval [0 , ∞ ) is convenient.With the above considerations in mind, we define a network as a pair G = ( V, A G ) with node set V anda weight matrix (which is also often called a ‘weighted adjacency matrix’) A G : V → [0 , ∞ ) that encodesthe interaction strengths between the nodes. For simplicity, we will often drop the subscript G in A G anddenote it as A . A given graph G = ( V, E ) determines a unique network G = ( V, A G ) , where A G is theadjacency matrix of G . The set V ( G ) is the node set of the network G , which has size | V ( G ) | , where | S | is the number of elements in the set S . A pair ( x, y ) of nodes in G is called a directed edge if A ( x, y ) > .We say that a network G = ( V, A ) is symmetric if its weight matrix is symmetric (i.e., A ( x, y ) = A ( y, x ) forall x, y ∈ V ) and we say that it is binary if A ( x, y ) ∈ { , } for all x, y ∈ V . Given a symmetric network G = ( V, A ) , we call an unordered pair { x, y } of nodes in G an edge if A ( x, y ) = A ( y, x ) > . We say thata network G = ( V, A ) is connected if for any two distinct nodes x, y ∈ V , there exists a sequence of nodes x , x , . . . , x m for some m ≥ such that A ( x i , x i +1 ) > for all i ∈ { , . . . , m − } and { x , x m } = { x, y } .We call G bipartite if it admits a ‘bipartition’, which is a partition V = V ∪ V of the node set V such that V = V ∪ V and A ( x, y ) = 0 if x, y ∈ V or x, y ∈ V for each x, y ∈ V . If two networks G = ( V, A ) and G (cid:48) = ( V (cid:48) , A (cid:48) ) satisfy V (cid:48) ⊆ V and A (cid:48) ( x, y ) ≤ A ( x, y ) for all x, y ∈ V (cid:48) , then we say that G (cid:48) is a subnetwork of G and write G (cid:48) ⊆ G .Suppose that we are given m elements v , . . . , v m in some vector space. By their mean , we refer to theirsample mean ¯ v = m − (cid:80) mi =1 v i . By their weighted average , we refer to the expectation (cid:80) mi =1 v i p i , where ( p , . . . , p m ) is a probability distribution on the set of m elements.B.2. Homomorphisms between networks and motif sampling.
Being able to sample from a com-plex data set according to a known probability distribution (e.g., a uniform one) is a crucial ingredient indictionary-learning problems. This is often the case for image-processing applications [7, 33, 35], as it isstraightforward to sample a k × k patch uniformly at random from an image. However, the similar prob-lem of uniformly randomly sampling a k -node connected subnetwork from a network is not straightforward[3, 16, 22, 58]. For our purpose of developing dictionary learning for networks, we use motif sampling , which was introduced recently in [29]. In motif sampling, instead of directly sampling a connected subnetwork,one samples a random node map from a smaller network (i.e., a motif) into a target network that preservesadjacency relations, and one then uses the subnetwork that is induced on the nodes in the image of the nodemap. As we discuss below, such a node map between networks is a homomorphism.Fix an integer k ≥ and a weight matrix A F : [ k ] → [0 , ∞ ) , where we use the shorthand notation [ k ] = { , . . . , k } . We use the term motif for the corresponding network F = ([ k ] , A F ) . A motif is a network,and we use such motifs to sample from a given network. The type of motif in which we are particularlyinterested is a k -chain , for which A F = ( { (1 , , (2 , , . . . , ( k − , k ) } ) . A k -chain is a directed path withnode set [ k ] . For a general k -node motif F = ([ k ] , A F ) and a network G = ( V, A ) , we define the probabilitydistribution π F →G on the set V [ k ] of all node maps x : [ k ] → V by π F →G ( x ) = 1 Z (cid:89) i,j ∈{ ,...,k } A ( x ( i ) , x ( j )) A F ( i,j ) , (1)where Z is a normalizing constant that is called the homomorphism density of F in G [27]. A node map x : [ k ] → V a homormorphism F → G if π F →G ( x ) > , which is the case if and only if A ( x ( a ) , x ( b )) > for all a, b ∈ [ k ] with A F ( a, b ) > (with the convention that α = 1 for all α ∈ R ). When both A and A F are binary matrices, the probability distribution π F →G is the uniform distribution on the set of allhomomorphisms F → G . This is the case for all examples that we discuss in the main manuscript. Motifsampling refers the problem of sampling a random homomorphism x : F → G according to the distributionin (1). In Section C, we discuss three Markov Chain Monte Carlo (MCMC) algorithms for motif sampling.B.3. Mesoscale patches of networks.
A homomorphism F → G is a node map V ( F ) → V ( G ) that mapsthe edges of F to some edges of G , so it maps F onto a subgraph of G . It thereby maps F ‘into’ G . For eachhomomorphism x : F → G from a motif F = ([ k ] , A F ) into a network G = ( V, A ) , we define a k × k matrix A x ( a, b ) := A (cid:0) x ( a ) , x ( b ) (cid:1) Φ F, x ( a, b ) for all a, b ∈ { , . . . , k } , (2)where Φ F, x : [ k ] → { , } is a k × k binary matrix that we call a mask . Two particular choices of masks are identity : Φ F, x ( a, b ) = 1 for all a, b ∈ { , . . . , k } , (3) no-folding : Φ F, x ( a, b ) = (cid:18) A F ( a, b ) > or (cid:64) a (cid:48) , b (cid:48) ∈ { , . . . , k } such that [ A F ( a (cid:48) , b (cid:48) ) > and ( x ( a ) , x ( b )) = ( x ( a (cid:48) ) , x ( b (cid:48) )) ] (cid:19) for all a, b ∈ { , . . . , k } . (4)We call A x in (2) the mesoscale patch of G that is induced by the homomorphism x : F → G , which isspecified uniquely by the homomorhism x : F → G , the weight matrix A , and the mask Φ . For a k × k matrix B and a homomorphism x : F → G , we say that the ( a, b ) entries of B are on-chain if A F ( a, b ) > and off-chain otherwise. The condition A F ( a, b ) > implies that A ( x ( a ) , x ( b )) by the definition of thehomomorphism x , so the on-chain entries of A x are always positive (and are always if G is unweighted).However, the off-chain entries of A x are not necessarily positive, so they encode meaningful informationabout a network that is ‘detected’ by the homomorphism x : F → G . For example, see the (1 , entries inthe × mesoscale patches in Figure 5 b .Suppose that the homomorphism x uses k distinct nodes of G in its image Im ( x t ) := { x ( a ) | a ∈ { , . . . , k }} .It then follows that the network G x := ( Im ( x ) , A x ) that is induced by x is a k -node subnetwork of G . In thiscase, the no-folding mask (4) is the same as the identity mask (3). However, when Im ( x ) has fewer than k nodes, G x is not necessarily a subnetwork of G and distinct positive off-chain entries of A x with the identitymask may represent the same edge in G . We introduced the no-folding mask in (4) so that the positiveoff-chain entries of A x always have information that is not included in its on-chain entries. EARNING LOW-RANK LATENT MESOSCALE STRUCTURES IN NETWORKS 13
As an illustration, consider the case in which F is the -chain motif and G = ( V, A ) is an undirected andbinary graph. Suppose that we have the following two × binary matrices: P = ∗ ∗ ∗ ∗ ∗ ∗ ∗∗ ∗ ∗∗ ∗ ∗∗ ∗ ∗ ∗ ∗ ∗ ∗ , Q = , where each entry ∗ of P is either or . Suppose that Φ F, x is the identity mask (3). It then follows that anyhomomorphism x : F → G from the -chain motif F always induces a mesoscale patch A x of the form P ,where the entries in the two diagonals correspond to the on-chain entries. Suppose that x uses only twodistinct nodes, p and q , in G . Specifically, let p = x (1) = x (3) = x (5) and q = x (2) = x (4) = x (6) . In thiscase, the mesoscale patch A x equals Q , which has four off-chain entries of in its upper triangle. However,all of the entries in A x in this case correspond to the single edge between p and q in G , so the off-chainentries of A x do not give any new information about the network G that is not already given by its on-chainentries. The indicator function in the definition of the no-folding mask (4) prevents this situation. Indeed,in this case, the mesoscale patch A x with the no-folding mask is the matrix P with all ∗ entries equal to .B.4. Problem formulation for network dictionary learning (NDL).
The goal of the
NDL problem isto learn, for a fixed integer r ≥ , a set of r nonnegative matrices L , . . . , L r of size k × k , Frobenius normsof at most , and A x ≈ a ( x ) L + · · · + a r ( x ) L r (5)for each homomorphism x : F → G for some coefficients a ( x ) , . . . , a r ( x ) ≥ . For each homomorphism x : F → G , this implies that one can approximate the mesoscale patch A x of G that is induced by x asa suitable linear combination of the r matrices L , . . . , L r . We call the the tuple [ L , . . . , L r ] a networkdictionary for G , and we call each L i a latent motif of G . We identify a network dictionary [ L , . . . , L r ] with the nonnegative matrix W ∈ R k × r ≥ whose j th column is the vectorization of the j th latent motif L j for j ∈ { , . . . , r } . The choice of vectorization R k × k → R k can be arbitrary, but we use a column-wisevectorization in Algorithm A4.For the latent motifs to be interpretable, it is crucial that we require the entries of latent motifs L i andthe coefficients a i ( x ) to be nonnegative. The nonnegativity constraint on each L i allows one to interpreteach L i as the weight matrix of a k -node network. Additionally, because the coefficients a j ( x ) are alsononnegative, the approximate decomposition in (5) implies that a i ( x ) L i (cid:47) A x . Therefore, if a i ( x ) > , anynetwork structure (e.g., nodes with large degree, communities, and so on) in the latent motif L i must exist in A x . Therefore, one can consider the latent motifs as approximate k -node subnetworks G that exhibit typicalnetwork structure of G at scale k . In the spirit of Lee and Seung [19], one can regard the latent motifs as‘parts’ of a network G .For a more precise formulation of (5), consider the following optimization problem: arg min L ,..., L r ∈ R k × k ≥ (cid:107)L (cid:107) F ,..., (cid:107)L r (cid:107) F ≤ E x ∼ π F →G (cid:34) inf a ( x ) ,...,a r ( x ) ≥ (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) A x − r (cid:88) i =1 a i ( x ) L i (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) F (cid:35) , (6)where π F →G is the probability distribution that we defined in (1) and (cid:107)·(cid:107) F denotes the matrix Frobeniusnorm. The choice of the probability distribution π F →G for the homomorphisms x : F → G is natural becauseit becomes the uniform distribution on the set of all homomorphisms F → G when the adjacency matricesfor G and F are both binary. The NDL problem is computationally difficult because the objective function in(6) is non-convex and it is not obvious how to sample a homomorphism F → G according to the distribution π F →G that we defined in (1). In Section D, we state an algorithm for NDL that approximately solves (6). Lee and Seung [19] discussed a similar nonnegative decomposition in which the A x are images of faces. Thelearned factors L i then capture parts of human faces (such as eyes, noses, and mouths). B.5.
Overview of our algorithms and their theoretical guarantees.
We overview our algorithms andtheir theoretical guarantees.
Algorithm 1:
Given a network G , the NDL algorithm (see Algorithm 1) computes a sequence ( W t ) t ≥ ofnetwork dictionaries, which take the form of k × r matrices, of latent motifs. Algorithm 2:
Given a network G , a network dictionary W , and a threshold parameter θ > , the NDRalgorithm (see Algorithm 2) computes a sequence of weighted networks G recons and binary (i.e.,unweighted) networks G recons ( θ ) . Theorem G.2:
Given a non-bipartite network G and a choice of the parameters in Algorithm 1, we provethat the sequence ( W t ) t ≥ of network dictionaries converges almost surely to the set of stationarypoints of the associated objective function in (6). Theorem G.5:
Given a bipartite network G and a choice of the parameters in Algorithm 1, we prove aconvergence result that is analogous to the one in Theorem G.2. Theorem G.7:
Given a non-bipartite target network G and a network dictionary W , we show that (i) thesequence of weighted reconstructed networks G recons that we compute using the NDR algorithm (seeAlgorithm 2) converges almost surely to some limiting network, and (ii) we obtain a closed-formexpression of this limiting network. We also show that (iii) a suitable distance between an arbitrarynetwork G (cid:48) and the limiting reconstructed network is bounded by the mean L distance between the k × k mesoscale patches of G (cid:48) and their nonnegative linear approximations from the latent motifs in W . Finally, (iv) if G (cid:48) = G in (iii) , we show that upper bound of the distance between G and G recons is approximately optimized if one learns the network dictionary W from the NDL algorithm 1. Theorem G.10:
We show a convergence result that is analogous to the one in Theorem G.10 for a bipartitetarget network G . Appendix C. Markov Chain Monte Carlo (MCMC) Motif-Sampling Algorithms
We mentioned in Section B.4 that one of the main difficulties in solving the optimization problem (6)is to directly sample a homomorphism x : F → G from the distribution π F →G (see (1)). To overcomethis difficulty, we use the Markov Chain Monte Carlo (MCMC) algorithms that were introduced in [29].Although the algorithms in [29] apply to networks with edge weights and/or node weights, we only usesimplified forms of them that we give in Algorithms MP and MG. Additionally, Algorithm MP with theoption AcceptProb = Approximate is a novel algorithm of the present paper. Using these MCMC samplingalgorithms, we generate a sequence ( x t ) t ≥ of homomorphisms F → G such that the distribution of x t converges to π F →G under some mild conditions on G and F [30, Thm. 5.7].In the pivot chain (see Algorithm MP with AcceptProb = Exact ), for each update x t (cid:55)→ x t +1 , the pivot x t (1) first performs a random-walk move on G (see (7)) to move to a new node x t +1 (1) ∈ V . It accepts thismove with a suitable acceptance probability (see (8)) according to the Metropolis–Hastings algorithm (see,e.g., [25, Sec. 3.2]). After the move of x t (1) (cid:55)→ x t +1 (1) , we sample each x t +1 ( i ) ∈ V for i ∈ { , , . . . , k } successively from the appropriate conditional distribution (see (9)). This ensures that the desired distribution π F →G in (1) is a stationary distribution of the resulting Markov chain. In the Glauber chain , we pick onenode i ∈ [ k ] of F uniformly at random, and we resample its location x t ( i ) ∈ V ( G ) at time t to (cid:96) = x t +1 ( i ) ∈ V from the correct conditional distribution in (10) (see Figure 5 a ). See [25, Sec. 3.3] for background on theMetropolis–Hastings algorithm and Glauber-chain MCMC sampling.Let ∆ denote the maximum degree (i.e., number of neighbors) of the nodes in the network G = ( V, A ) .We also say that the network G itself has a maximum degree of ∆ . The Glauber chain has an efficient localupdate (with a computational complexity of O (∆) ), but it converges quickly to the stationary distribution π F →G only for networks that are dense enough so that two homomorphisms that differ at one node have aprobability of at least / (2∆) to coincide after a single Glauber chain update. (See [29, Thm. 6.1] for aprecise statement.)By contrast, the pivot chain (see Algorithm MP with AcceptProb = Exact ) has more computationallyexpensive local updates with a computational complexity of O (∆ k − ) (as discussed in [29, Remark 5.6]), butit converges as fast as a ‘lazy’ random walk on a network. (In such a random walk, each move has a chanceto be rejected according to a Metropolis–Hastings algorithm; see [29, Thm. 6.2].) Experimentally, we find EARNING LOW-RANK LATENT MESOSCALE STRUCTURES IN NETWORKS 15
Algorithm MP . Pivot-Chain Update Input:
Symmetric network G = ( V, A ) , motif F = ([ k ] , A F ) , and homomorphism x : F → G Parameters:
AcceptProb ∈ {
Exact , Approximate } Do: x (cid:48) ← x If (cid:80) c ∈ V A ( x (1) , c ) = 0 : Terminate Else : Sample (cid:96) ∈ V at random from the distribution p ( w ) = A ( x (1) , w ) (cid:80) c ∈ V A ( x (1) , c ) , w ∈ V (7) Compute the acceptance probability α ∈ [0 , by α ← min (cid:18) (cid:80) c ∈ [ n ] A k − ( (cid:96),c ) (cid:80) c ∈ [ n ] A k − ( x (1) ,c ) (cid:80) c ∈ V A ( c, x (1)) (cid:80) c ∈ V A ( x (1) ,c ) , (cid:19) , if AcceptProb = Exact min (cid:16) (cid:80) c ∈ V A ( c, x (1)) (cid:80) c ∈ V A ( x (1) ,c ) , (cid:17) , if AcceptProb = Approximate (8) Sample U ∈ [0 , uniformly at random, independently of everything else (cid:96) ← x (1) if U > λ and x (cid:48) (1) ← (cid:96) . For i = 2 , , . . . , k : Sample x (cid:48) ( i ) ∈ V from the distribution p i ( w ) = A ( x ( i − , w ) (cid:80) c ∈ V A ( x ( i − , c ) , w ∈ V (9) Output:
Homomorphism x (cid:48) : F → G Algorithm MG . Glauber-Chain Update Input:
Network G = ( V, A ) , k -chain motif F = ([ k ] , A F ) , and homomorphism x : F → G Do:
Sample v ∈ [ k ] uniformly at random Sample z ∈ V at random from the distribution p ( w ) = 1 Z (cid:89) u ∈ [ k ] A ( x ( u ) , w ) A F ( u,v ) (cid:89) u ∈ [ k ] A ( w, x ( u )) A F ( v,u ) , w ∈ V (10)where Z = (cid:80) c ∈ V (cid:16)(cid:81) u ∈ [ k ] A ( x ( u ) , c ) A F ( u,v ) (cid:17) (cid:16)(cid:81) u ∈ [ k ] A ( c, x ( u )) A F ( v,u ) (cid:17) is the normalization con-stant. Define a new homomorphism x (cid:48) : F → G by x (cid:48) ( w ) = z if w = v and x (cid:48) ( w ) = x ( w ) otherwise Output:
Homomorphism x (cid:48) : F → G the Glauber chain is slow, especially for sparse networks (e.g., for COVID PPI , which has an edge density of . , and UCLA , which has an edge density of . ) and that the pivot chain is too expensive to computefor chain motifs with k ≥ . As a compromise with a low computational complexity and fast convergence(as fast as the standard random walk), we employ an approximate pivot chain , which is Algorithm MP withthe option AcceptProb = Approximate . Specifically, we compute the acceptance probability α in (8) onlyapproximately and thereby reduce the computational cost to O (∆) . The compromise, which we discussin the next paragraph, is that the stationary distribution of the approximate pivot chain may be slightlydifferent from our target distribution π F →G .According to Proposition G.1, the stationary distribution for the approximate pivot chain is ˆ π F →G ( x ) := (cid:81) ki =1 A ( x ( i − , x ( i )) | V | (cid:80) y ,...,y k ∈ V A ( x (1) , y ) (cid:81) ki =3 A ( y i − , y i ) . (11) In general, the distribution (11) is different from the desired target distribution π F →G . Specifically, π F →G ( x ) is proportional only to the numerator in (11) and the sum in the denominator in (11) is a weighted countof homomorphisms y : F → G for which y (1) = x (1) . Therefore, under ˆ π F →G , we penalize the probabilityof each homomorphism x : F → G according to the number of k -step walks in G that start from x (1) ∈ V .(The exact acceptance probability in (8) neutralizes this penalty.) It follows that ˆ π F →G is close to π F →G when the k -step-walk counts that start from each node in G do not differ too much for different nodes. Forexample, on degree-regular networks like lattices, such counts do not depend on the starting node, and itthus follows that ˆ π F →G = π F →G . Nevertheless, despite the potential discrepancy between π F →G and ˆ π F →G ,the approximate pivot chain gives good results for the reconstruction and denoising experiments that weshowed in Figures 3 and 4 in the main manuscript. Appendix D. Algorithm for Network Dictionary Learning (NDL)
D.1.
Algorithm overview and statement.
The essential idea behind our algorithm for NDL (see Algo-rithm 1) is as follows. We first sample a large number M of homomorphisms x t : F → G from π F →G andcompute their corresponding mesoscale patches A x t for t ∈ { , . . . , M } . These M mesoscale patches of G form the data set (a so-called ‘batch’) in which we apply a dictionary-learning algorithm. Specifically, we(column-wise) vectorize each of these k × k matrices (using Algorithm A4) and obtain a k × M data matrix X , and we then apply nonnegative matrix factorization (NMF) [19] to obtain a k × r nonnegative matrix W for some fixed integer r ≥ to yield an approximate factorization X ≈ W H for some nonnegative matrix H . From this procedure, we approximate each column of X by the nonnegative linear combination of the r columns of W with coefficients that are given by the r th column of H . Therefore, if we let L i be the k × k matrix that we obtain by reshaping the i th column of W (using Algorithm A5), then [ L , . . . , L r ] isan approximate solution of (6). We will give the precise meaning of ‘approximate solution’ in Theorems G.2and G.5.The scheme in the paragraph above requires one to store all M mesoscale patches, entailing a memoryrequirement that is at least of order k M . Because M should scale with the size of G , this implies that weneed unbounded memory to handle arbitrarily large networks. To address this issue, Algorithm 1 implementsthe above scheme in the setting of ‘online learning’, where subsets (so-called ‘minibatches’) of data arrivein a sequential manner and one does not store previous subsets of the data before processing new subsets.Specifically, at each iteration t = 1 , , . . . , T , we only process a sample matrix X t that is smaller than thefull matrix X and includes only N (cid:28) M mesoscale patches, where one can take N to be independent ofthe network size. Instead of the standard NMF algorithms for a fixed matrix [20], we use an ‘online’ NMFalgorithm [30, 32] that applies to sequences of matrices, where the intermediate dictionary matrices W t thatwe obtain by factorizing the sample matrix X t typically improves as we iterate (see [30, 32]). In Algorithm1, we give a full implementation of the NDL algorithm.We now explain how the NDL algorithm works. It combines one of the three MCMC algorithms — a pivotchain (in which we use Algorithm MP with AcceptProb = Exact ), an approximate pivot chain (in whichwe use Algorithm MP with
AcceptProb = Approximate ), and a Glauber chain (in which we use AlgorithmMG) — for motif sampling that we presented in Section C with the online NMF algorithm of [30]. Supposethat we have an undirected and binary graph G = ( V, A ) and a k -chain motif F = ([ k ] , A F ) . We satisfythe requirement in Algorithm 1 that there exists at least one homomorphism F → G (as long as G has atleast one edge), so we can find an initial homomorphism x : F → G by rejection sampling (see AlgorithmA3). At each iteration t = 1 , , . . . , T , the motif-sampling algorithm generates a sequence x s : F → G of N homomorphisms and corresponding mesoscale patches A x s (see Figure 5 a ), which we summarize as the k × N data matrix X t . The online NMF algorithm in (12) learns a nonnegative factor matrix W t of size k × r by improving the previous factor matrix W t − with respect to the new data matrix X t . It is an‘online’ NMF algorithm because it factorizes a sequence ( X t ) t ∈{ ,...,T } of data matrices, rather than a singlematrix as in conventional NMF algorithms [20]. During this entire process, the algorithm only needs to storeauxiliary matrices P t and Q t of fixed sizes r × r and r × k , respectively; it does not need the previous datamatrices X , . . . , X t − . Therefore, NDL is efficient in memory and scales well with network size. Moreover,NDL is applicable to time-dependent networks because of its online nature, although we do not pursue thisdirection in the present paper. EARNING LOW-RANK LATENT MESOSCALE STRUCTURES IN NETWORKS 17
Algorithm 1 . Network Dictionary Learning (NDL) Input:
Network G = ( V, A ) Parameters: F = ([ k ] , A F ) (a motif) , T ∈ N (the number of iterations) , N ∈ N (the number ofhomomorphisms per iteration) , r ∈ N (the number of latent motifs) , λ ≥ (the coefficient of an L -regularizer) Options: mask ∈ { Id , NF } , MCMC ∈ {
Pivot , PivotApprox , Glauber } Requirement:
There exists at least one homomorphism F → G Initialization: Sample a homomorphism x : F → G using the rejection sampling (see Algorithm A3) W = ( k × r ) matrix of independent entries that we sample uniformly from [0 , P = matrix of size r × r whose entries are Q = matrix of of size r × k whose entries are For t = 1 , , . . . , T : MCMC update and sampling mesoscale patches : Successively generate N homomorphisms x N ( t − , x N ( t − , . . . , x Nt by applyingAlgorithm MP with AcceptProb = Exact if MCMC = Pivot
Algorithm MP with
AcceptProb = Approximate if MCMC = PivotApprox
Algorithm MG with
AcceptProb = Glauber if MCMC = Glauber
For s = N ( t −
1) + 1 , . . . , N t : A x s ← k × k mesoscale patch of G that is induced by x s (see (2)) with Φ F, x s = (cid:40) identity mask in (3) , if mask = Id no-folding mask in (4) , if mask = NF X t ← k × N matrix whose j th column is vec ( A x (cid:96) ) with (cid:96) = N ( t −
1) + j (where vec ( · ) denotes the vectorization operator defined in Algorithm A4) Single iteration of online nonnegative matrix factorization : H t ← arg min H ∈ R r × N ≥ (cid:107) X t − W t − H (cid:107) F + λ (cid:107) H (cid:107) ( using Algorithm A1 ) P t ← (1 − t − ) P t − + t − H t H Tt Q t ← (1 − t − ) Q t − + t − H t X Tt W t ← arg min W ∈C dict ⊆ R k × r ≥ (cid:0) tr ( W P t W T ) − tr ( W Q t ) (cid:1) ( using Algorithm A2 ) , (12)where C dict = { W ∈ R k × r ≥ | columns of W have Frobenius norm at most } Output:
Network dictionary W T ∈ R k × r ≥ In (12), we solve convex optimization problems to find matrices H t ∈ R r × N and W t ∈ R k × r . The firstsubproblem in (12) is a coding problem. Given two matrices X t and W t − , we seek to find a factor matrix (i.e.,a ‘code matrix’) H t such that X t ≈ W t − H t . The parameter λ ≥ is an L -regularizer, which encourages H t to have a small L norm. One can solve the coding problem efficiently by using Algorithm A1 or one of avariety of existing algorithms (e.g., LARS [6], LASSO [54], or feature-sign search [21]). The second and thirdlines in (12) update the ‘aggregate matrices’ P t − ∈ R r × r and Q t − ∈ R r × k by taking a weighted averageof them with the new information X t H Tt ∈ R r × r and H t X Tt , respectively. We weight the old aggregatematrices by − t − and the new information by t − . By induction, we obtain P t = t − (cid:80) ts =1 H s H Ts and Q t = t − (cid:80) ts =1 H t X Tt . We use the updated aggregate matrices, P t and Q t , in the last subproblem in(12). This problem is called the dictionary-update problem and yields W t . This is a constrained quadraticproblem, and we can solve it using projected gradient descent (see Algorithm A2). In all of our experiments, z
11 by 11 Network Dictionary from UCLA FB network Reconstructed UCLA FB network
Original UCLA FB network
Dictionary (cid:2869)
Dictionary (cid:2870)
Dictionary (cid:2871) ⋮ ⋮ Limiting Dictionary ⋮ a b c d Figure 5.
Illustration of our network dictionary learning (NDL) algorithm (see Algorithm 1). ( a )Homomorphisms x t : F → G from a k -chain motif into a target network G evolve as a Markov chainto yield a sequence of k -chain subgraphs (the green edges) in G . ( b ) Each copy of the k -chain motifin G induces a k -node subgraph (i.e., the mesoscale patch A x t that we defined in (2)). ( c ) We forma sequence of matrices X t of size k × N , where the N columns of each X t are vectorizations ofthe N consecutive k × k mesoscale patches in panel (b) . ( d ) Using an online nonnegative matrixfactorization (NMF) algorithm, we progressively learn the desired number of latent motifs as thedata matrix of mesoscale patches X t arrives. we take the compact and convex constraint set R k × r ≥ to be the set of W ∈ R k × r ≥ whose columns have aFrobenius norm of at most (as required in (6)).D.2. Dominance scores of latent motifs.
In this subsection, we introduce a quantitative measurementof the ‘prevalence’ of latent motifs in the network dictionary W T that we compute using NDL (see Algorithm1) for a network G .Recall that the output of the NDL algorithm for a network G using a k -chain motif is a network dictionary W T , which consists of r latent motifs L , . . . , L r of size k × k . Recall as well that the algorithm computes datamatrices X , . . . , X T of size k × N . Suppose that we have code matrices H (cid:63) , . . . , H (cid:63)T such that X t ≈ W T H (cid:63)t for all t ∈ { , . . . , T } . More precisely, we let H (cid:63)t = arg min H ≥ ( (cid:107) X t − W T H (cid:107) + λ (cid:107) H (cid:107) ) , (13)where we take the arg min over all H ∈ R k × N ≥ The columns of H (cid:63)t encode how to nonnegatively combinethe latent motifs in W T to approximate the mesoscale patches in X t ∈ R k × N ≥ , so the rows of H (cid:63)t encode thelinear coefficients of each latent motif in W T that we use for approximating the columns of X t . Consequently,the mean inner products of the rows of H t encode the mean usage of the latent motifs in W T in G . Thismotivates us to consider the following mean Gramian matrix [13]: P (cid:63)T := 1 T T (cid:88) t =1 H (cid:63)t ( H (cid:63)t ) T ∈ R r × r . We then can take the square root of the diagonal entries of P (cid:63)T to give us the mean prevalences of the latentmotifs in W T in G . EARNING LOW-RANK LATENT MESOSCALE STRUCTURES IN NETWORKS 19
Computing P (cid:63)T requires one to store the previous data matrices X , . . . , X T and compute H (cid:63) , . . . , H (cid:63)T by solving (13) for t ∈ { , . . . , T } ; this is a very expensive computation. To address this issue, we use theaggregate matrix P T that we compute as part of Algorithm 1. Therefore, it does not require any extracomputation. Note that P T = 1 T T (cid:88) t =1 H t H Tt , where H t ∈ R r × N ≥ is the code matrix that is given by H t = arg min H ≥ ( (cid:107) X t − W t − H (cid:107) + λ (cid:107) H (cid:107) ) . Notethat P T is an approximation of P (cid:63)T because the defining equation of H t is the same as that of H (cid:63)t in (13)with W T replaced by W t − . However, the approximation error between P (cid:63)T and P t vanishes as T → ∞ undermild conditions. Specifically, under the hypothesis of Theorems G.2 and G.5, W t converges almost surely tosome limiting dictionary. It follows that (cid:107) P (cid:63)T − P T (cid:107) F → almost surely as T → ∞ . k = Coronavirus SNAP FB arXiv H. sapiens Caltech MIT UCLA Harvard ER ER WS WS BA BA k = k = k = k = Figure 6.
The most dominant latent motifs that we learn from 14 networks (eight real-worldnetworks and six synthetic networks, which are single instantiations from random-graph models)at five different scales (specifically, for k = 6 , , , , ). Using our NDL algorithm, we learnnetwork dictionaries of r = 25 latent motifs of k nodes for each of the 14 networks. For each networkat each scale, we show only the most-dominant latent motifs from each dictionary. We showed theassociated second-most dominant latent motifs in Figure 2. Black squares indicate entries andwhite squares indicate entries. In Figure 2 of the main manuscript, we showed the second-most dominant latent motifs for twelve networksat the scales k = 6 , , , , . In Figure 6, we show the most dominant latent motifs for the same setsof networks and scales. See Figures 8, 9, 10, and 14 in Section I for all 25 latent motifs (along with theirdominance scores) of all of the networks in Figure 6 at all five scales. For many of these networks (e.g., Harvard , UCLA , arXiv , and ER ) and at all five scales, the most dominant latent motifs in Figure 6 areclose to the adjacency matrix of the k -chain itself. In these examples, the entries in the first superdiagonaland subdiagonal overwhelm the rest of the entries. However, for all networks except ER , the second-most dominant latent motifs in Figure 2 reveal more interesting mesoscale structures (e.g., communities in Harvard , MIT , UCLA , and arXiv and bipartition for edges that are not in the -chain in Coronavirus PPI )than the most dominant latent motifs in Figure 6.D.3.
Influence of masking and MCMC algorithms on latent motifs.
Our NDL algorithm (see Algo-rithm 1) uses mesoscale patches A x with a mask that consists of either the identity mask (3) or the no-foldingmask (4), but one can generalize it for any choice of mask. The original NDL algorithm [30, Algorithm 1]corresponds to Algorithm 1 with the option mask = Id . the one with identity mask for mesoscale patches.In Section B, we discussed that the no-folding mask improves the interpretability of the positive entries in the mesoscale patches. Consequently, it also improves the interpretability of the latent motifs that we learnfrom them using Algorithm 1. Algorithm
SNAP FB H. Sapiens arXiv S PECTRAL C LUSTERING
EEP W ALK
NDL+NDR (our method) Algorithm
SNAP FB H. Sapiens arXiv S PECTRAL C LUSTERING
EEP W ALK
NDL+NDR (our method) . . . . . . false-positive rate . . . . . . tr u e - p o s i t i v e r a t e Caltech +10% (AUC º °
10% (AUC º º °
50% (AUC º º °
10% (AUC º º °
50% (AUC º .
00 0 .
25 0 .
50 0 .
75 1 . false-positive rate . . . . . . Coronavirus PPI +10% (AUC º °
10% (AUC º º °
20% (AUC º .
00 0 .
25 0 .
50 0 .
75 1 . . . . . . . tr u e - p o s i t i v e r a t e arXiv +10% (AUC º °
10% (AUC º º °
50% (AUC º . . . . . . SNAP Facebook +10% (AUC º °
10% (AUC º º °
50% (AUC º .
00 0 .
25 0 .
50 0 .
75 1 . . . . . . . Homo sapiens PPI +10% (AUC º º º °
50% (AUC º AUC for denoising; ( −50% noise) a C ORONAVIRUS
PPI d c C ORONAVIRUS
PPI b C ORONAVIRUS
PPI C
ORONAVIRUS
PPI a b c d e f
Algorithm
SNAP FB H. Sapiens arXiv S PECTRAL C LUSTERING
EEP W ALK
NDL+NDR (our method) Algorithm
SNAP FB H. Sapiens arXiv S PECTRAL C LUSTERING
EEP W ALK
NDL+NDR (our method) mask = NFMCMC = PivotApproxmask = NFMCMC = Glaubermask = IdMCMC = PivotApproxmask = IdMCMC = Glauber Algorithm
SNAP FB H. Sapiens arXiv S PECTRAL C LUSTERING
EEP W ALK
NDL+NDR (our method) mask = NFMCMC = PivotApproxmask = NFMCMC = Glaubermask = IdMCMC = PivotApproxmask = IdMCMC = Glauber Algorithm
SNAP FB H. Sapiens arXiv S PECTRAL C LUSTERING
EEP W ALK
NDL+NDR (our method) mask = NFMCMC = PivotApproxmask = NFMCMC = Glaubermask = IdMCMC = PivotApproxmask = IdMCMC = Glauber Algorithm
SNAP FB H. Sapiens arXiv S PECTRAL C LUSTERING
EEP W ALK
NDL+NDR (our method) mask = NFMCMC = PivotApproxmask = NFMCMC = Glaubermask = IdMCMC = PivotApproxmask = IdMCMC = Glauber Figure 7.
Comparison of r = 25 latent motifs that we learn from Coronavirus PPI using theNDL algorithm (see Algorithm 1) in four different settings, which arise from the choices mask ∈{ Id , NF } and MCMC ∈ {
PivotApprox , Glauber } . The choices mask = Id and mask = NF indicatethat the NDL algorithm uses mesoscale patches (2) with the identity mask (3) and the no-foldingmask (4), respectively. The choices MCMC = PivotApprox and
MCMC = Glauber indicate that theNDL algorithm uses the Glauber chain (see Algorithm MG) and the approximate pivot chain (seeAlgorithm MP with
AcceptProb = Approximate ), respectively. The other parameter values are λ = 1 , N = 100 , and T = 100 . We use black squares for entries and white squares for entries.The numbers underneath the latent motifs give their dominance scores. In Figure 7, we compare r = 25 latent motifs that we learn from Coronavirus PPI four different combi-nations of masks and MCMC motif-sampling. In Figure 7 c , d , we see the latent motifs that we learn with theidentity mask have large off-chain entries that dominate the on-chain entries. (See Section B.3 for definitionsof on-chain and off-chain entries.) The most dominant latent motifs appear as complete bipartite networks,in which each node in one set is adjacent to all nodes in the other set in the bipartition, with nodes. Thisis counterintuitive, as Coronavirus PPI has a very low edge density of . . However, as we see in Figure7 a , b , the latent motifs that we learn with the no-folding mask (4) have comparatively sparser off-chainentries, which are dominated by the on-chain entries. However, by comparing Figure 7 a with Figure 7 b andFigure 7 c with Figure 7 d , we also see that the choice of the MCMC algorithm between the approximatepivot chain and the Glauber chain for Algorithm 1 may not affect the latent motifs as much as whether weuse the identity mask or the no-folding mask. Appendix E. Algorithms for Network Denoising and Reconstruction (NDR)
E.1.
Algorithm overview and statement.
The standard pipeline for image denoising and reconstruction[7, 33, 35] is to uniformly randomly sample a large number of k × k overlapping patches of an image andthen average their associated approximations at each pixel to obtain a reconstructed version of the originalimage. A reasonable network analog of this pipeline is as follows. Given a network G = ( V, A ) , a motif F = ([ k ] , A F ) , and a network dictionary with latent motifs [ L , . . . , L r ] , we uniformly randomly sample alarge number T of homomorphisms x t : F → G . To simplify the present discussion, we assume that each x t uses k distinct nodes of G in its image V ( t ) := { x t ( a ) ∈ V | a ∈ { , . . . , k }} . (We do not make this assumptionelsewhere in the paper.) This yields T k -node subnetworks G ( t ) = ( V ( t ) , A ( t ) ) , where A ( t ) is the mesoscalepatch A x t in (2). We then approximate each weight matrix A ( t ) by a nonnegative linear combination ˆ A ( t ) of the latent motifs L i . We then define a network G recons = ( V, A recons ) , where we set A recons ( p, q ) for each p, q ∈ V to be the mean of the approximate weights ˆ A ( t ) ( p, q ) for all t ∈ { , . . . , M } such that p, q ∈ V ( t ) .Our network denoising and reconstruction (NDR) algorithm (see Algorithm 2) builds on the idea in thepreceding paragraph. Suppose that we have a network G = ( V, A ) , a motif F = ([ k ] , A F ) , and a network EARNING LOW-RANK LATENT MESOSCALE STRUCTURES IN NETWORKS 21 dictionary W that consists of r nonnegative k × k matrices L , . . . , L r . First, because uniformly randomlysampling a homomorphism x t : F → G is not as straightforward as uniformly randomly sampling k × k patches of an image, we generate a sequence ( x t ) t ∈{ ,...,T } of homomorphisms using a MCMC motif-samplingalgorithm (see Algorithms MP and MG). For each t ≥ , we approximate the mesoscale patch A x t (see (2))by a nonnegative linear combination of latent motifs L i and we then take the mean of the values of eachentry A ( a, b ) up to time t . Algorithm 2 . Network Denoising and Reconstruction (NDR) Input:
Network G = ( V, A ) , and network dictionary W ∈ R k × r ≥ Parameters: F = ([ k ] , A F ) (a motif) , T ∈ N (number of iterations) , λ ≥ (the coefficient of an L -regularizer) , θ ∈ [0 , (an edge threshold) Options: denoising ∈ { T , F } , mask ∈ { Id , NF } , MCMC ∈ {
Pivot , PivotApprox , Glauber } Requirement:
There exists at least one homomorphism F → G Initialization: A recons , A count : V → { } (matrices with entries) Sample a homomorphism x : F → G by the rejection sampling in Algorithm A3 For t = 1 , , . . . , T : MCMC update and mesoscale patch extraction : x t ← Updated homomorphism that we obtain by applyingAlgorithm MP with
AcceptProb = Exact , if MCMC = Pivot
Algorithm MP with
AcceptProb = Approximate , if MCMC = PivotApprox
Algorithm MG with
AcceptProb = Glauber , if MCMC = Glauber A x t ← k × k mesoscale patch of G that is induced by x t (see (2)) with Φ F, x t = (cid:40) identity mask in (3) , if mask = Id no-folding mask in (4) , if mask = NF X t ← k × matrix that we obtain by vectorizing A x t (using Algorithm A4) Mesoscale reconstruction : (cid:40) (cid:101) X t ← X t and (cid:102) W ← W , if denoising = F (cid:101) X t ← ( X t ) off and (cid:102) W ← ( W ) off using Algorithm 2a , if denoising = T H t ← arg min H ∈ R r × ≥ ( (cid:107) (cid:101) X t − (cid:102) W H (cid:107) F + λ (cid:107) H (cid:107) ) and ˆ X t ← (cid:102) W H t ˆ A x t ; W ← k × k matrix that we obtain by reshaping the k × matrix ˆ X t using Algorithm A5 Update global reconstruction:
For a, b ∈ { , . . . , k } : If ( denoising = F or A F ( a, b ) = 0) and Φ F, x t ( a, b ) = 1 : A count ( x t ( a ) , x t ( b )) ← A count ( x t ( a ) , x t ( b )) + 1 j ← A count ( x t ( a ) , x t ( b )) A recons ( x t ( a ) , x t ( b )) ← (1 − j − ) A recons ( x t ( a ) , x t ( b )) + j − ˆ A x t ; W ( x t ( a ) , x t ( b )) Output:
Reconstructed networks G recons = ( V, A recons ) and G recons ( θ ) = ( V, ( A recons > θ )) As in Algorithm 1, we require in Algorithm 2 that there exists at least one homomorphism F → G . Thiscondition is satisfied when F = ([ k ] , A F ) is a chain motif and G has at least one edge; it holds for all ofour experiments in the present paper. As in the first line of (12), the problem for finding H t in line 15of Algorithm 2 is a standard convex problem, which one can solve by using Algorithm A1. There are two Algorithm 2a . Off-Chain Projection
Input:
Matrix Y ∈ R k × m , motif F = ([ k ] , A F ) Do:
Let Y (cid:48) be a k × k × m tensor that we obtain by reshaping each column of Y using Algorithm A5.Let Y (cid:48)(cid:48) be a k × k × m tensor that we obtain from Y (cid:48) by Y (cid:48)(cid:48) ( a, b, c ) = Y (cid:48) ( a, b, c ) ( A F ( a, b ) = 0) for all a, b ∈ { , . . . , k } and c ∈ { , . . . , m } Let Y off be a k × m matrix that we obtain from Y (cid:48)(cid:48) by vectorizing each of its slices using AlgorithmA4: Y (cid:48)(cid:48) [: , : , c ] for all c ∈ { , . . . , m } . Output:
Matrix ( Y ) off ∈ R k × m variants of the NDR algorithm. The variant is specified by the Boolean variable denoising . The NDRalgorithm with denoising = F is identical to the network-reconstruction algorithm in [30, Algorithm 2],except for thresholding step. The NDR algorithm with denoising = T is a new variant of NDR that wepresent in the present work for the purpose of network denoising.E.2. Further discussion of the denoising variant of the NDR algorithm.
We now give a detaileddiscussion of Algorithm 2 with denoising = T for network-denoising applications. Recall that the network-denoising problem that we consider is to reconstruct a true network G true = ( V, A ) from an observed network G obs = ( V, A (cid:48) ) . The scheme that we used to produce Figure 4 is the following: D.1
Learn a network dictionary W ∈ R k × r ≥ from an observed network G obs = ( V, A ) using NDL (seeAlgorithm 1). D.2
Compute a reconstructed network G recons = ( V, A recons ) using NDR (see Algorithm 2) with input G obs = ( V, A ) and W . D.3
Fix an edge threshold θ ∈ [0 , . If G obs is G true with additive (respectively, subtractive) noise, weclassify each edge (respectively, non-edge) ( p, q ) as ‘positive’ if and only if A recons ( p, q ) > θ .The NDR algorithm was first introduced in [30]. The version of the NDR algorithm from [30] is Algorithm2 with the options mask = Id and denoising = F . It has a competitive performance for denoising − subtractive noise on the networks SNAP FB , H. sapiens , and arXiv in comparison to the methods
SpectralClustering [41],
DeepWalk [45],
LINE [50], and node2vec [11]. In all of these methods, one first obtainsa -dimensional vector representation of the nodes in a network; this is called a ‘node embedding’ of thenetwork. One then uses this node embedding to compute vector representations of the edges using binaryoperations such as the Hadamard product. (See [11] for details.) One can then use a binary-classificationalgorithm (e.g., a support vector machine [42]) to attempt to detect the false edges.
Spectral Clustering uses the top eigenvectors of the normalized Laplacian matrix of G obs to learn vector embeddings of thenodes. (See [51] for details.) The other three benchmark methods first generate sequences of nodes usingrandom-walk sampling and then use a word-embedding technique (see, e.g., [39]) to learn a node embedding.In Table 1, we compare the AUC scores of our NDR and NDL approach that we described at the beginningof Section E.2 for our network-denoising tasks to the AUC scores that one obtains using the above fourexisting methods. As was discussed in [30, Remark 4], a limitation of using NDR with mask = Id and denoising = F for network denoising is that one needs to invert what it means to classify successfullydepending on whether noise is additive or subtractive. Specifically, [30] used network denoising with NDR foradditive noise, but it is necessary to classify each non-edge ( p, q ) as ‘positive’ if A recons ( p, q ) < θ . Therefore,the AUC scores for − noise in the fifth in Table 1 is ‘flipped’ and the same is true for the sixth rowin the table for mask = NF and denoising = F . Our NDR algorithm with denoising = T addressesthis directionality issue and allows us to use the unified classification scheme above for both additive andsubtractive noise.The idea behind NDR with denoising = T is to handle an issue when denoising additive noise for sparsereal-world networks that does not arise in the image-denoising setting. Suppose that we obtain G obs by addingsome false edges to a sparse binary network G true . The on-chain entries of the mesoscale patches A x arealways equal to . Therefore, the latent motifs that we learn from G obs have constant on-chain entries (see,e.g., Figures 2 and 6). Consequently, a linear approximation of mesoscale patches A x of G obs of the latent EARNING LOW-RANK LATENT MESOSCALE STRUCTURES IN NETWORKS 23
Algorithm
SNAP FB H. sapiens arXiv
Noise +50% −
50% +50% −
50% +50% − Spectral Clustering - 0.619 - 0.492 - 0.574
DeepWalk - 0.968 - 0.744 - 0.934LINE - 0.949 - 0.725 - 0.890 node2vec - 0.968 - 0.772 - 0.934
NDL + NDR ( mk = Id , dn = F ) 0.845 NDL + NDR ( mk = NF , dn = F ) 0.898 NDL + NDR ( mk = Id , dn = T ) 0.943 0.980 0.677 NDL + NDR ( mk = NF , dn = T ) Table 1.
Area-under-the-curve (AUC) scores of the ROC curve for our network-denoising experi-ments using NDL (see Algorithm 1) and NDR (see Algorithm 2). As in Figure 4, we first use NDLto learn latent motifs from a corrupted network and then reconstruct the networks using NDR toassign a confidence value to each potential edge. We use the networks
SNAP FB , H. sapiens , and arXiv with − % and +50 % noise on the edges. In the last four rows, mk stands for the choice of mask in NDR and dn stands for the denoising approach in NDR. We use mask = NF for NDL ineach of the last four rows. For both NDL and NDR, we use MCMC = ApproxPivot . The last row isa duplicate from Figure 4, and we obtain the results in rows five–seven using the same parameterchoices as in Figure 4. For the last four rows except the six instances in italics (i.e., for dn = F and − noise), we use the classification scheme ( D.3 ). For these six instances, we compute theAUC of the ‘flipped’ ROC curve, as the algorithm technically takes on the opposite classificationtask. When dn = T , we do not flip the classification task. In the first four rows, we construct128-dimensional vector representations of the networks using the indicated methods, and we thenuse them for edge classification. motifs that we learn from G obs cannot distinguish between true and false on-chain entries. Furthermore,because G obs is sparse, there are many fewer positive off-chain entries in A x than the number of on-chainentries of A x . Therefore, linear approximations of A x using the latent motifs are likely to assign largerweights to reconstruct on-chain entries of A x than off-chain entires. The resulting reconstruction of G obs isthus similar to G obs , and it is very hard to detect any false edges in G obs . Using the option denoising = T prevents this issue by ignoring all on-chain entries both for each sampled mesoscale patch A x and for eachlatent motif in the network dictionary W that we use for denoising. For example, using denoising = T instead of denoising = F for SNAP FB and arXiv (see Table 1) yields a performance gain of about 10% for +50% noise.Another issue is using NDL with denoising = F for denoising negative noise in sparse real-world networks.In many of our experiments in this situation, our network-denoising scheme in ( D.1 )–(
D.3 ) seems to give‘flipped’ ROC curves that lie below the diagonal line y = x that represents the baseline ROC curve and theAUC score is thus close to instead of . Therefore, in the reconstructed network G recons , false non-edgeshave larger weights than true non-edges. For the six instances in italics (for denoising = F and − noise)in Table 1, we give AUC scores after flipping the ROC curves, such that we classify a non-edge ( p, q ) as‘positive’ if A recons ( p, q ) < θ , which is the opposite of ( D.3 ) that we used for additive noises in all cases. It isunfortunate to have to use the opposite classification scheme for different types of noise, especially when onemay not know the type of noise in advance. We suspect that the sparsity of real-world networks may lead tosuch ‘opposite directionality’; this phenomenon requires further investigation. In contrast to this situation,the ROC curves for denoising = T are always above the diagonal in all experiments in Table 1 (as well asin Figure 4), regardless of whether the noise is additive or subtractive, and we always obtain the AUC scoresin Table 1 using the scheme in ( D.3 ). Appendix F. Experimental details
F.1.
Data sets.
We now describe the eight real-world networks that we examined in the main manuscript: (1)
Caltech : This connected network, which is part of the
Facebook100 data set [56] (and previouslywas studied as part of the
Facebook5 data set [55]), has 762 nodes and 16,651 edges. Nodesrepresent users in the Facebook network of Caltech on one day in fall 2005, and edges representFacebook ‘friendships’ between these users.(2)
MIT : This connected network, which is part of the
Facebook100 data set [56], has 6,402 nodes and251,230 edges. Nodes represent users in the Facebook network of MIT on one day in fall 2005, andedges represent Facebook ‘friendships’ between these users.(3)
UCLA : This connected network, which is part of the
Facebook100 data set [56], has 20,453 nodesand 747,604 edges. Nodes represent users in the Facebook network of UCLA on one day in fall 2005,and edges represent Facebook ‘friendships’ between these users.(4)
Harvard : This connected network, which is part of the
Facebook100 data set [56], has 15,086nodes and 824,595 edges. Nodes represent users in the Facebook network of MIT on one day in fall2005, and edges represent Facebook ‘friendships’ between these users.(5)
SNAP Facebook ( SNAP FB ) [24]: This connected network has 4,039 nodes and 88,234 edges. Thisnetwork is a Facebook network that has been used as a benchmark example for edge inference [11].(6) arXiv ASTRO-PH ( arXiv ) [11, 23]: This network has 18,722 nodes and 198,110 edges. Its largestconnected component has 17,903 nodes and 197,031 edges. We use the full network in our experi-ments. It is a collaboration network between authors of papers in astrophysics that were posted tothe arXiv preprint server. Nodes represent scientists and edges indicate coauthorship relationships.(7) Coronavirus PPI ( Coronavirus ): This connected network, which was curated by theBiogrid.org [10, 44, 52] from 142 publications and preprints, has 1,546 proteins that are related to coronaviruesand 2,481 protein–protein interactions (in the form of physical contacts) between them. We down-loaded this data set on 24 July 2020. Among the 2,481 interactions, 1,546 are for SARS-CoV-2and were reported by 44 publications and preprints; the rest are related to coronaviruses that causeSevere Acute Respiratory Syndrome (SARS) or Middle Eastern Respiratory Syndrome (MERS).(8)
Homo sapiens PPI ( H. sapiens ) [11, 44, 53]: This network has 24,407 nodes and 390,420 edges.Its largest connected component has 24,379 nodes and 390,397 edges. We use the full network in ourexperiments. The nodes represent proteins in the organism
Homo sapiens , and the edges representphysical interactions between these proteins.We now describe the six synthetic networks that we examined in the main manuscript:(9) ER and ER : An Erdős–Rényi (ER) network [8, 40], which we denote by ER ( n, p ) , is a random-graph model. The parameter n is the number of nodes and the parameter p is the independent,homogeneous probability that each pair of distinct nodes has an edge between them. The network ER is an individual graph that we draw from ER (5000 , . , and ER is an individual graph thatwe draw from ER (5000 , . .(10) WS and WS : A Watts–Strogatz (WS) network, which we denote by WS ( n, k, p ) , is a random-graphmodel to study the small-world phenomenon [40, 57]. In the version of WS networks that we use,we start with an n -node ring network in which each node is adjacent to its k nearest neighbors.With independent probability p , we then remove and rewire each edge to a pair of distinct nodesthat we choose uniformly at random. The network WS is an individual graph that we draw fromWS (5000 , , . , and WS is an individual graph that we draw from WS (5000 , , . .(11) BA and BA : A Barabási–Albert (BA) network, which we denote by BA ( n, m ) , is a random-graphmodel with a linear preferential-attachment mechanism [2, 40]. In the version of BA networks thatwe use, we start with m isolated nodes and we introduce new nodes with m new edges each thatattach preferentially (with a probability that is proportional to node degree) to existing nodes untilwe have a total of n nodes. The network BA is an individual graph that we draw from BA (5000 , ,and BA is an individual graph that we draw from WS (5000 , .F.2. Figures 2, 6, 8, 9, 10, 11, 12, 13, and 14.
These figures give latent motifs of the networks that wedescribed in Section F.1 using Algorithm 1 with various parameter choices. In all of these figures, we use achain motif with the corresponding network F = ([ k ] , A F ) for T = 100 iterations, N = 100 homomorphisms EARNING LOW-RANK LATENT MESOSCALE STRUCTURES IN NETWORKS 25 per iteration, an L -regularizer with coefficient λ = 1 , a mask of mask = Id , and an MCMC motif-samplingalgorithm of MCMC = Pivot . We specify the number r of latent motifs and the scale k in the caption of eachfigure.F.3. Figure 1.
In this figure, we illustrate latent motifs that we learn from
UCLA and
Caltech and wecompare them to an image dictionary. We use the following parameters in Algorithm 1 to generate the resultsin Figures 2 and 6. We use a chain motif with the corresponding network F = ([ k ] , A F ) , a scale k = 21 for T = 100 iterations, N = 100 homomorphisms per iteration, r = 25 latent motifs, an L -regularizer withcoefficient λ = 1 , a mask of mask = Id , and an MCMC motif-sampling algorithm of MCMC = Pivot . Thepivot chain in Algorithm MP uses
AcceptProb = Approximate . The image dictionary for the artwork
Cycle in Figure 1 uses an algorithm that is similar to Algorithm 1, except that we uniformly randomly sample × square patches of the image instead of k × k mesoscale patches of a network.F.4. Figure 7.
This figure compares r = 25 latent motifs at scale k = 21 that we learn from CoronavirusPPI using Algorithm 1 with MCMC motif-sampling algorithms
MCMC ∈ {
Glauber , ApproxPivot } and masks mask ∈ { Id , NF } . We specify the parameters in Algorithm 1 that we use to generate the results in Figure 7in the caption of that figure.F.5. Figure 3.
To generate Figure 3, we first apply the NDL algorithm (see Algorithm 1) to each networkthat we consider in the figure to learn r = 25 latent motifs for a chain motif with the corresponding network F = ([ k ] , A F ) , a scale k = 21 for T = 100 iterations, N = 100 homomorphisms per iteration, an L -regularizerwith coefficient λ = 1 , a mask of mask = NF , and MCMC = PivotApprox . For each self-reconstruction X ← X (see the caption of Figure 3), we apply the NDR algorithm (see Algorithm 2) to a chain motif with thecorresponding network F = ([ k ] , A F ) , a scale k = 21 for T = (cid:98) n ln n (cid:99) iterations (where n is the number ofnodes in the network), N = 100 homomorphisms per iteration, r = 25 square patches, an L -regularizerwith coefficient λ = 0 (i.e., no regularization), a mask of mask = Id , a MCMC motif-sampling algorithmof MCMC = PivotApprox , and denoising = F . For each cross-reconstruction Y ← X (see the caption ofFigure 3), we apply the NDR algorithm (see Algorithm 2) to a chain motif with the corresponding network F = ([ k ] , A F ) , a scale k = 21 for T = (cid:98) n ln n (cid:99) time steps (where n is the number of nodes in the network), N = 100 homomorphisms, an edge-threshold value of θ = 0 . , an L -regularizer with coefficient λ = 0 (i.e.,no regularizatin), a mask of mask = NF , an MCMC motif-sampling algorithm of MCMC = PivotApprox , and denoising = F . We use multiple different choices of the number r of latent motifs; we indicate them in thecaption of Figure 3.In the main manuscript, we mentioned that the following claims follow from the reconstruction accuraciesthat we reported in Figure 3 in conjunction with the latent motifs in Figures 1, 2, 8, and 10. We now providetheir justifications. (1) The mesoscale structure of
Caltech is rather different from those of
Harvard , UCLA , and
MIT at scale k = 21 . • In Figure 3 c , we observe that the accuracy of the cross-reconstruction Y ← X is consistentlyhigher for X ∈ { UCLA , Harvard , MIT } than for X = Caltech for all values of r . For instance,at r = 9 , we can reconstruct UCLA with more than 90% accuracy and we can reconstruct
Harvard and
MIT with more than 80% accuracy. However, the latent motifs that we learnfrom
Caltech for r = 9 gives only about accuracy for reconstructing UCLA and only about accuracy for reconstructing
MIT and
UCLA . This indicates that
Caltech has significantlydifferent mesoscale structure than the other three universities’ Facebook networks at scale k = 21 . Indeed, from Figures 1, 2, 6, and 8, we see that the r = 25 latent motifs of Caltech at scale k = 21 have larger off-chain entries than those of UCLA , MIT , and
Harvard . (2) The mesoscale structure of
Caltech at scale k = 21 has a higher dimension than those of the otherthree universities’ Facebook networks. • Consider the cross-reconstructions
Caltech ← X for X ∈ { UCLA , Harvard , MIT } in Figure 3 b .The accuracy with r = 9 for the latent motifs that we learn from Caltech itself is as low as .By contrast, it or higher for the self-reconstructions X ← X for the Facebook networks ofthe other universities. In other words, r = 9 latent motifs at scale k = 21 cannot approximatethe mesoscale structures of Caltech as well as those of the other three universities’ Facebook networks. This indicates that the dimension of the mesoscale structures of
Caltech at scale k = 21 is larger than those of the other three universities’ Facebook networks. (3) The networks BA and BA are better than ER , ER , WS , and WS at capturing the mesoscale structuresof MIT , Harvard , and
UCLA at scale k = 21 . • From the reconstruction accuracies for Y ← X in Figure 3 b , c , where X is one of the sixsynthetic networks ( ER i , WS i , and BA i for i ∈ { , } ), we observe that the two BA networks havehigher accuracies then the networks from the ER and WS models for Y ∈ { UCLA , Harvard , MIT } .This suggests that the mesoscale structures of UCLA , Harvard , and
MIT may be more similar insome respects to those of BA i than those of ER i and WS i . The latent motifs of BA in Figures 2 and8 at k = 21 have characteristics that we also observe in UCLA , Harvard , and
MIT . (Specifically,they have hub nodes and off-chain entries that are much smaller (so they are lighter in color)than the on-chain entries.) By contrast, in Figure 14, we see that the latent motifs for ER have sparse but seemingly randomly distributed off-chain connections and the ones for WS have strongly interconnected communities of about nodes (see the diagonal block of blackentries). These patterns differ from the ones that we observe in the latent motifs of UCLA , MIT ,and
Harvard (see Figure 8). (4)
If we uniformly sample a walk of k = 21 nodes, then it is more likely that there are communities with or more nodes in the induced subnetwork for Caltech than is the case for
UCLA , Harvard , and
MIT . • From the reconstruction accuracies for Y ← X in Figure 3 b , c , where X is one of the sixsynthetic networks ( ER i , WS i , and BA i for i ∈ { , } ), we observe that WS networks outperformboth BA and ER networks in reconstructing Caltech , but they are one of the lowest-performingnetworks for reconstructing the Facebook networks of the other three universities. In otherwords, nonnegative linear combinations of the latent motifs of WS i can better approximate themesoscale patches of Caltech than they can for those of
UCLA , Harvard , and
MIT . Recall thatmost latent motifs of WS i at scale k = 21 have blocks of black entries of size × or larger(see Figure 10); these correspond to adjacency matrices of communities with or more nodes.Therefore, such community structure should be more likely to occur in subnetworks that areinduced from uniform samples of k = 21 -node walks in Caltech than from such samples in
UCLA , Harvard , or
MIT .F.6.
Figure 4.
To generate Figure 4, we first apply the NDL algorithm (see Algorithm 1) to each corruptednetwork that we consider in the figure to learn r = 25 latent motifs for a chain motif with the correspondingnetwork F = ([ k ] , A F ) , a scale k = 21 for T = 400 iterations, N = 1000 homomorphisms per iteration, an L -regularizer with coefficient λ = 1 , a mask of mask = NF , and an MCMC motif-sampling algorithm of MCMC = PivotApprox . The NDR algorithm (see Algorithm 2) that we use to generate the results in Figure 4uses r = 25 latent motifs for a chain motif with the corresponding network F = ([ k ] , A F ) , a scale k = 21 for T = 400 , iterations for H. sapiens and T = 200 , iterations for all other networks, an L -regularizerwith coefficient λ = 1 , a mask of mask = NF , MCMC = PivotApprox , and denoising = T . For Figure 4, we didnot conduct the denoising experiment for Coronavirus PPI with − noise because the resulting network(with 1,536 nodes and 1,232 edges) cannot be connected. (To be connected, its spanning trees need to have1,535 edges.)We obtain the AUC scores in Figure 4 for the methods node2vec [11], DeepWalk [45], and
LINE [50]for the task of denoising subtractive noise for
SNAP Facebook , H. Sapiens , and arXiv from [11]. In[11], the ROCs were computed from a ‘balanced test set’ that was chosen uniformly at random from all setsthat include all of the | E | / false non-edges and | E | / true non-edges, where | E | denotes the number ofedges in the network. By contrast, we compute our ROCs in Figure 4 using all of the | E | / false non-edgesand all of true non-edges, which is often very large. For instance, recall that arXiv has | V | = 18 ,722 nodesand | E | = 198 ,110 edges. Therefore, after deleting | E | / edges to create a corrupted network, there are | E | / ,055 false non-edges and | V | ( | V | − / − | E | = 175 ,049,171 true non-edges. Such an imbalancein a data set does not affect the ROCs and hence does not affect the AUCs. Consequently, the true-positiverates and false-positive rates do not change even if we compute them after independently sampling | E | / EARNING LOW-RANK LATENT MESOSCALE STRUCTURES IN NETWORKS 27 true non-edges. Therefore, our results are directly comparable to the AUCs in [11] after uniformly samplinga balanced test set that consists of | E | / true non-edges and | E | / false non-edges. Appendix G. Convergence Analysis
In this section, we give rigorous convergence guarantees for our main algorithms for NDL and NDR. In [30,Corollary 6.1], Lyu et al. obtained a convergence guarantee of the original NDL algorithm ([30, Algorithm1]) for non-bipartite networks G with the MCMC motif-sampling algorithms MCMC ∈ {
Pivot , Glauber } .Theorem G.2 (i) in the present paper establishes the same result for the NDL algorithm (see Algorithm1) that uses mesoscale patches (2) with either the identity mask or the no-folding mask (4). TheoremG.2 (ii) gives a similar convergence result for the NDL algorithm with the approximate pivot chain (i.e., for MCMC = PivotApprox ). Theorem G.5 establishes a similar convergence result for NDL for bipartite networks G . Theorems G.7 and G.10 establish convergence and error bounds for the NDR algorithm (see Algorithm2). Except for Theorem G.2 (i) , all of these theoretical results are novel results of the present paper.Let F = ([ k ] , A F ) be the k -chain motif, and let G = ( V, A ) be a network. Let Ω ⊆ V [ k ] denote the setof all homomorphisms x : F → G . Algorithm 1 generates three stochastic sequences. The first one is thesequence ( x t ) t ≥ of homomorphisms F → G that are generated by the pivot chain (see Algorithm MP). Thesecond one is the sequence ( X t ) t ≥ of k × N data matrices whose columns encode N mesoscale patches of G . More precisely, for each y , . . . , y n ∈ Ω , we write Ψ( y , . . . , y n ) ∈ R k × N ≥ for the k × N matrix whose i th column is the vectorization (using Algorithm A4) of the corresponding k × k mesoscale patch A y i of G .For each y ∈ Ω , define X ( N ) ( y ) := Ψ( y , . . . , y N ) ∈ R k × N ≥ , where we generate y , . . . , y N using the pivot chain when it starts at y . It then follows that X t = X ( N ) ( x Nt ) for each t ≥ , where x Nt is the state of the pivot chain at time N t . For the third (and final) sequence thatwe generate using Algorithm 1, let ( W t ) t ≥ denote the sequence of dictionary matrices, where we define each W t = W t ( x ) via (12) with an initial homomorphism x : F → G that we sample using Algorithm A3.G.1. Convergence of the approximate pivot chain.
We prove Proposition G.1, which states the con-vergence of the approximate pivot chain and gives an explicit formula for its unique stationary distribution.
Proposition G.1.
Fix a network G = ( V, A ) and the k -chain motif F = ([ k ] , A F ) . Let ( x t ) t ≥ denote asequence of homomorphisms x t : F → G that we generate using the approximate pivot chain (in which weuse Algorithm MP with AcceptProb = Approximate ). Suppose that (a)
The weight matrix A is ‘bidirectional’ (i.e., A ( a, b ) > implies that A ( b, a ) > for all a, b ∈ V ) andthe undirected and binary graph ( V, ( A > is connected and non-bipartite.It then follows that ( x t ) t ≥ is an irreducible and aperiodic Markov chain with the unique stationary distri-bution ˆ π F →G that we defined in (11) .Proof. We follow the proof of [29, Thm. 5.8]. Let P : V → [0 , be a matrix with entries P ( a, b ) := A ( a, b ) (cid:80) c ∈ V A ( a, c ) , a, b ∈ V .
This is the transition matrix of the standard random walk on the network G . By hypothesis (a), P isirreducible and aperiodic. Additionally, it has the unique stationary distribution (see [25, Ch. 9]) π (1) ( v ) := (cid:88) c ∈ V A ( v, c ) / (cid:88) c,c (cid:48) ∈ V A ( c, c (cid:48) ) . The approximate pivot chain generates a move x t (1) (cid:55)→ x t +1 (1) of the pivot according to the distribution P ( x t (1) , · ) . We accept this move of the pivot independently of everything else with the approximate accep-tance probability α in (8). If we always accept each move of the pivot, then the pivot performs a randomwalk on G with unique stationary distribution π (1) . We compute the acceptance probability α using theMetropolis–Hastings algorithm (see [25, Sec. 3.3]), and we thereby modify the stationary distribution of thepivot from π (1) to the uniform distribution on V . (See the discussion in [29, Sec. 5].) Therefore, ( x t (1)) t ≥ is an irreducible and aperiodic Markov chain on V that has the uniform distribution as its unique stationary distribution. Because we sample the locations x t +1 ( i ) ∈ V of the subsequent nodes i = 2 , , . . . , k indepen-dently, conditional on the location x t +1 (1) of the pivot, it follows that the approximate pivot chain ( x t ) t ≥ is also an irreducible and aperiodic Markov chain with a unique stationary distribution, which we denote by ˆ π F →G .To compute the stationary distribution ˆ π F →G , we decompose x t into return times of the pivot x t (1) to afixed node x ∈ V in G . Specifically, let τ ( j ) be the j th return time of x t (1) to x . By the independence ofsampling x t over { , . . . , k } for each t , the strong law of large numbers yields lim M →∞ M M (cid:88) j =1 ( x τ ( j ) (2) = x , . . . , x τ ( j ) ( k ) = x k ) = (cid:81) ki =2 A ( x i − , x i ) (cid:80) y ,...,y k ∈ V (cid:81) ki =2 A ( x , y ) A ( y , y ) . . . A ( y k − , y k ) . For each fixed homomorphism x : F → G , i (cid:55)→ x i , we use the Markov-chain ergodic theorem (see, e.g., [5,Theorem 6.2.1 and Example 6.2.4] or [38, Theorem 17.1.7]) to obtain ˆ π F →G ( x ) = lim N →∞ N N (cid:88) t =0 ( x t = x )= lim N →∞ (cid:80) Nt =0 ( x t = x ) (cid:80) Nt =0 ( x t (1) = x ) (cid:80) Nt =0 ( x t (1) = x ) N = P (cid:18) x t (2) = x , . . . , x t ( k ) = x k (cid:12)(cid:12)(cid:12)(cid:12) x t (1) = x (cid:19) π (1) ( x )= (cid:81) ki =1 A ( x i − , x i ) (cid:80) y ,...,y k ∈ V (cid:81) ki =2 A ( x , y ) A ( y , y ) . . . A ( y k − , y k ) 1 | V | . This proves the assertion. (cid:3)
G.2.
Convergence of the NDL algorithm.
Recall the problem statement for NDL in (6). Informally, weseek to learn r latent motifs L , . . . , L r ∈ R k × k ≥ to minimize the expectation of the error of approximatingthe mesoscale patch A x by a nonnegative combination of the motifs L i , where x : F → G is a randomhomomorphism that we sample from the distribution π F →G (1). We reformulate this problem as the followingmatrix-factorization problem, which generalizes (6). Let C dict denote the set of all matrices W ∈ R k × r ≥ whosecolumns have a Frobenius norm of at most . The matrix-factorization problem is then arg min W ∈C dict ⊆ R k × r ≥ (cid:18) f ( W ) := E x ∼ π F →G (cid:104) (cid:96) ( X ( N ) ( x ) , W ) (cid:105) (cid:19) , (14)where we define the loss function (cid:96) ( X, W ) := inf H ∈⊆ R r × N ≥ (cid:107) X − W H (cid:107) F + λ (cid:107) H (cid:107) , X ∈ R k × N , W ∈ R k × r . (15)The parameters N ∈ N and λ ≥ appear in Algorithm 1. The former is the number of homomorphisms thatwe sample at each iteration of Algorithm 1, and the latter is the coefficient of an L -regularizer that we useto find the code matrix H t in (12). In the special case of N = 1 and λ = 0 , the problem (14) is equivalentto the problem (6), because X (1) ( x ) and the columns of W are vectorizations (using Algorithm A4) of themesoscale patch A x and the latent motifs L , . . . , L r , respectively.Theorems G.2 and G.5 imply that our NDL algorithm (see Algorithm 1) finds a sequence ( W t ) t ≥ ofdictionary matrices such that almost surely W t is asymptotically a stationary point of the objective function f in the optimization problem (14). The objective function f is non-convex, so it is generally difficult to findglobal optimum of f . In practice, however, such stationary points have often been good enough for practicalapplications, such as image restoration [7, 35]. We find that this is also the case for our network-denoisingproblem (see Figure 4).We are now ready to state our first convergence result for the NDL algorithm (see Algorithm 1). Theorem G.2 (Convergence of the NDL Algorithm for Non-Bipartite Networks) . Let F = ([ k ] , A F ) be the k -chain motif, and let G = ( V, A ) be a network that satisfies the following properties: EARNING LOW-RANK LATENT MESOSCALE STRUCTURES IN NETWORKS 29 (a) The weight matrix A is ‘bidirectional’ (i.e., A ( a, b ) > implies that A ( b, a ) > for all a, b ∈ V ) andthe undirected and binary graph ( V, ( A > is connected and non-bipartite.(b) For all t ≥ , there exists a unique solution H t in (12) .(c) For all t ≥ , the eigenvalues of the positive semidefinite matrix A t that is defined in (12) are at leastas large as some constant κ > .Let ( W t ) t ≥ denote the sequence of dictionary matrices that we generate using Algorithm 1. The followingclaims hold: (i) For
MCMC ∈ {
Pivot , Glauber } , we have almost surely as t → ∞ that W t converges to the set of stationarypoints of the objective function f that we defined in (14) . Furthermore, if f has finitely manystationary points in C dict , we then have that W t converges to a single stationary point of f almostsurely as t → ∞ . (ii) For
MCMC = PivotApprox , we have almost surely as t → ∞ that W t converges to the set of stationarypoints of the objective function ˆ f ( W ) := E x ∼ ˆ π F →G (cid:104) (cid:96) ( X ( N ) ( x ) , W ) (cid:105) , where the distribution ˆ π F →G is defined in (11) . Furthermore, if ˆ f has finitely many stationary pointsin C dict , we then have that W t converges to a single stationary point of ˆ f almost surely as t → ∞ . Remark G.3.
Assumptions (a)–(c) in Theorem G.2 are all reasonable and are easy to satisfy. Assumption(a) is satisfied if G is undirected, binary, and connected, which is the case for all of our examples in thepresent paper. Assumptions (b) and (c) are standard assumptions in the study of online dictionary learning[30, 31, 32]. For instance, (b) is a common assumption in methods such as layer-wise adaptive-rate scaling(LARS) [6] that aim to find good solutions to problems of the form (15). Additionally, in practice, one canverify (c) experimentally after a few iterations of Algorithm 1 for a reasonable choice of the initial dictionary(e.g., r samples of mesoscale patches). See [32, Sec. 4.1] and [30, Sec. 4.1] for more detailed discussions ofthese assumptions. Remark G.4.
It is also possible to slightly modify both the optimization problem (14) and our NDLalgorithm so that Theorem G.2 holds for the modified problem and the algorithm without needing to assume(b) and (c). The modified problem is arg min W ∈C dict ⊆ R k × r ≥ (cid:32) E x ∼ π (cid:34) inf H ∈⊆ R r × N ≥ (cid:107) X − W H (cid:107) F + λ (cid:107) H (cid:107) + κ (cid:107) H (cid:107) F + λ (cid:48)(cid:48) (cid:107) W (cid:107) F (cid:35)(cid:33) , (16)where π = ˆ π F →G if MCMC = PivotApprox and π = π F →G otherwise. Note that (16) is the same as (14)with additional quadratic penalization terms for both H and W in the loss function (cid:96) that we defined in(15). By contrast, consider the modification of the NDL algorithm (see Algorithm 1) in which the objectivefunction for H t in (12) has the additional term λ (cid:48) (cid:107) H (cid:107) F and we replace P t in (12) by P t + κ I . Assuming that λ (cid:48) , κ > , the modified objective function for H t is strictly convex, so H t is uniquely defined. Therefore, itsatisfies the uniqueness condition (b) in Theorem G.2. Additionally, the smallest eigenvalue of each matrix P t that we compute using the modified NDL algorithm has a lower bound of κ for all t , so it satisfiescondition (c) in Theorem G.2. One can then show that the same statement as Theorem G.2 for the modifiedproblem (16) and the NDL algorithm hold without assumptions (b) and (c). The argument, which we omit,is almost identical to the one for Theorem G.2. Proof of Theorem G.2 . The proof of the first part of (i) is identical to the proof of [30, Corollary 6.1].For the first part of (ii) , we can use the same essential argument as in the proof of [30, Corollary 6.1].However, because our assertion is for the approximate pivot chain that we propose in the present article (seeAlgorithm MP with
AcceptProb = Approximate ), we need to use Proposition G.1 (instead of [29, Prop.5.8]) to establish irreducibility and convergence of our Markov chain. The proof of the second parts of both (i) and (ii) are identical.We give a detailed proof of (ii) . Let π = ˆ π F →G if MCMC = PivotApprox and π = π F →G otherwise (see(1) and (11)). We define ( x t ) t ≥ , ( X t ) t ≥ , and ( W t ) t ≥ as before. We use a general convergence result foronline NMF for Markovian data [30, Theorem 4.1]. We first observe that the matrices X t ∈ R k × N ≥ that we compute in line 15 of Algorithm 1 do notnecessarily form a Markov chain, because the forward evolution of the Markov chain depends both on theinduced mesoscale patches and on the actual homomorphisms ( x s ) N ( t −
Flip : R k × r → R k × r that maps W (cid:55)→ W , such that the j th column W (: , j ) of W is defined by ¯ X (: , j ) := vec ◦ rev ◦ reshape ( W (: , j )) , j ∈ { , . . . , r } , where W (: , j ) denotes the j th column of W , reshape : R k → R k × k is the reshaping operator that wedefined in Algorithm A5, rev maps a k × k matrix K to the k × k matrix ( ¯ K ab ) ≤ a,b ≤ k with entries ¯ K ab = K ( k − a + 1 , k − b + 1) , and vec denotes the vectorization operator in Algorithm A4. Applying Flip twicegives the identity map.
Theorem G.5 (Convergence of the NDL Algorithm for Bipartite Networks) . Let F = ([ k ] , A F ) be the k -chain motif, and let G = ( V, A ) be a network that satisfies the the following properties:(a’) A is symmetric and the undirected and binary graph ( V, ( A > is connected and bipartite.(b) For all t ≥ , there exists a unique solution H t in (12) .(c) For all t ≥ , the eigenvalues of the positive semidefinite matrix A t in (12) are at least as large as someconstant κ > .Let ( W t ) t ≥ denote the sequence of dictionary matrices that we generate using Algorithm 1. We than havethe following properties hold: (i) Suppose that
MCMC ∈ {
Pivot , Glauber } . For each i ∈ { , } , conditional on x ∈ Ω i , the sequence W t ofdictionary matrices converges almost surely as t → ∞ to the set of stationary points of the associatedconditional expected loss function f ( i ) that we defined in (18) . If MCMC = PivotApprox , then thesame statement holds with f ( i ) replaced by the function ˆ f ( i ) that we defined in (19) . We also assumethat f ( i ) (respectively, ˆ f ( i ) ), with i ∈ { , } , have only finitely many stationary points in C dict . Itthen follows that W t converges to a single stationary point of f ( i ) (respectively, ˆ f ( i ) ) almost surelyas t → ∞ . (ii) Suppose that
MCMC = Glauber in Algorithm 1 and that k is even. Assume that x ∈ Ω . It then followsthat, almost surely, the sequences of dictionary matrices W t and W t converge simultaneously tothe sets of stationary points of the expected loss functions f (1) and f (2) , respectively. Moreover, f (1) ( W t ) = f (2) ( W t ) for all t ≥ . We also assume that f ( i ) (with i ∈ { , } ) has only finitely manystationary points in C dict . It then follows that both W t and W t converge to single stationary pointsof f (1) and f (2) almost surely as t → ∞ .Proof. We first prove (i) . Fix j ∈ { , } , and recall the conditional stationary distribution π ( i ) F →G from (17).Conditional on x ∈ Ω j , the Markov chain ( x t ) t ≥ is irreducible and aperiodic with a unique stationarydistribution π ( j ) F →G . Recall that the conclusion of Theorem G.2 holds as long as the underlying Markovchain is irreducible. Therefore, W t converges almost surely to the set of stationary points of the associatedconditional expected loss function f ( i ) that we defined in (18). The same argument verifies the case in which MCMC = PivotApprox .We now verify (ii) . Define the notation µ j := π ( i ) F →G and suppose that k is even. For each homomorphism x : F → G , we define a map x : [ k ] → V by x ( j ) := x ( k − j + 1) for all j ∈ { , . . . , k } . For even k , we have that x ∈ Ω if and only if x ∈ Ω . Because A is symmetric, it follows that k − (cid:89) j =1 A ( x ( j ) , x ( j + 1)) = k − (cid:89) j =1 A ( x ( j + 1) , x ( j )) = k − (cid:89) j =1 A ( x ( j ) , x ( j + 1)) . Therefore, Z = Z = Z / . Consequently, for each x ∈ Ω , (17) implies that µ ( x ) = µ ( x ) = 2 π F →G ( x ) . (20)Consider two Glauber chains, ( x t ) t ≥ and ( x (cid:48) t ) t ≥ , where x = y and x (cid:48) = y . We evolve these two Markovchains using a common source of randomness so that individually they have Glauber-chain trajectories butthey also satisfy the relation x (cid:48) t = x t for all t ≥ . (This is typically called a ‘coupling argument’ inprobability literature; see [25, Sec. 4.2].) Specifically, suppose that x (cid:48) t = x t . For each update x t (cid:55)→ x t +1 and x (cid:48) t (cid:55)→ x (cid:48) t +1 , we choose a node v ∈ [ k ] uniformly at random and sample z ∈ V according to the conditionaldistribution (10). We define x t +1 ( v ) = z and x t +1 ( u ) = x t ( u ) for u (cid:54) = v , x (cid:48) t +1 ( k − v + 1) = z and x (cid:48) t +1 ( u ) = x t ( u ) for u (cid:54) = k − v + 1 . We then have that x t (cid:55)→ x t +1 follows the Glauber-chain update in Algorithm MG. We also have the desiredrelation x (cid:48) t +1 = x t +1 because x (cid:48) t +1 ( k − v + 1) = z = x t +1 ( v ) = x t +1 ( k − v + 1) , x (cid:48) t +1 ( u ) = x (cid:48) t ( u ) = x t ( u ) = x t ( k − u + 1) = x t +1 ( k − u + 1) = x t +1 ( u ) for u (cid:54) = k − v + 1 .Finally, we need to verify that x (cid:48) t (cid:55)→ x (cid:48) t +1 also follows the Glauber-chain update in Algorithm MG. It sufficesto check that z ∈ V has the same distribution as x (cid:48) t +1 ( k − v + 1) . Because v is uniformly distributed on [ k ] ,so is k − v + 1 . The distribution of z ∈ V is determined by p ( z ) ∝ A ( z, x t (2))) = A ( z, x t ( k − , if v = 1 A ( x t ( v − , z ) A ( z, x t ( v + 1)) = A ( x t ( k − v ) , z ) A ( z, x t ( k − v + 2)) , if v ∈ { , . . . , k − } A ( x t ( k − , z )) = A ( x t (2) , z ) , if v = k . Because x (cid:48) t = x t , it follows that z is distributed as the conditional distribution (10) of x (cid:48) t +1 ( k − v + 1) , asdesired.For the two Glauber chains, x t and x (cid:48) t , we observe that X ( N ) ( y ) = X ( N ) ( y ) (21)almost surely. This result follows from the facts that x (cid:48) t = x t for all t ≥ and rev ( A x ) = A x for all x ∈ Ω .(See (2) for the definition of A x .) From this, we note that f (1) ( W ) = E x ∼ π (cid:20) (cid:96) ( X ( N ) ( x ) , W ) (cid:12)(cid:12)(cid:12)(cid:12) x ∈ Ω (cid:21) (22) = E x ∼ π (cid:20) (cid:96) (cid:16) X ( N ) ( x ) , W (cid:17) (cid:12)(cid:12)(cid:12)(cid:12) x ∈ Ω (cid:21) = E x ∼ π (cid:20) (cid:96) (cid:16) X ( N ) ( x ) , W (cid:17) (cid:12)(cid:12)(cid:12)(cid:12) x ∈ Ω (cid:21) = E x ∼ π (cid:20) (cid:96) ( X ( N ) ( x ) , W ) (cid:12)(cid:12)(cid:12)(cid:12) x ∈ Ω (cid:21) = f (2) ( W ) . The first and the last equalities use the second equality in (17). The second equality uses the fact that (cid:96) ( X, W ) = (cid:96) ( X, W ) . The third equality follows from (21). The fourth equality follows from the change ofvariables x (cid:55)→ x and the fact that x ∼ π if and only if x ∼ π (see (20)).We now prove (ii) . Its first part follows immediately from (i) and the above construction of the Glauberchains x t and x (cid:48) t that satisfy x (cid:48) t = x t for all t ≥ . Specifically, let W t = W t ( x ) and W (cid:48) t = W (cid:48) t ( x (cid:48) ) denotethe sequences of dictionary matrices that we compute using Algorithm 1 with initial homomorphisms x and x (cid:48) , respectively. Suppose that x ∈ Ω , from which we see that x (cid:48) = x ∈ Ω . By (i) , W t and W (cid:48) t convergealmost surely to the set of stationary points of the associated conditional expected loss functions f (1) and f (2) , respectively. We complete the proof of the first part of (ii) by observing that, almost surely, W (cid:48) t = W t for all t ≥ . (23)The second part of (ii) follows immediately from (22).We still need to verify (23). Roughly, the argument is that all k × k mesoscale patches A x t = ref ( A x t ) have the reversed row and column ordering from the original ordering, so k × k latent motifs that we trainon such matrices also have the reversed ordering of rows and columns. More concretely, one can check thisclaim by induction on t together with (21) and the uniqueness assumption (b). We omit the details. (cid:3) Remark G.6.
Our proofs of Theorems G.2 and G.5 do not depend on the particular choice of mask Φ F ; x that we use to define mesoscale patches A x in (2). EARNING LOW-RANK LATENT MESOSCALE STRUCTURES IN NETWORKS 33
G.3.
Convergence of the NDR algorithm.
We prove convergence results for the NDR algorithm (seeAlgorithm 2) in Theorem G.7. Specifically, we show that the reconstructed network that we obtain usingAlgorithm 2 at iteration t converges almost surely to some limiting network as t → ∞ , and we give a closed-form expression of the limiting network. We also give a bound for the ‘distance’ between the original networkand the limiting reconstructed network in terms of a ‘fitness’ of the network dictionary that we use in thereconstruction algorithm. We measure this fitness using the expected 1-norm between the sampled k × k mesoscale patch and its best nonnegative linear approximation using our network dictionary.Let denoising denote the Boolean variable in Algorithm 2. Fix a network G = ( V, A ) , the k -chain motif F = ([ k ] , A F ) , and a homomorphism x : F → G . Let Φ F, x ∈ { , } k × k denote the no-folding mask that wedefined in (4). For each matrix B : V → [0 , ∞ ) and a node map x : [ k ] → V , define the k × k matrix B x by B x ( a, b ) := B ( x ( a ) , x ( b )) Φ F, x ( a, b ) for all a, b ∈ { , . . . , k } . If B = A , then B x = A x equals the mesoscale patch of G that is induced by x (see (2)). Additionally,given a network G = ( V, A ) , a motif F = ([ k ] , A F ) , a homomorphism x : F → G , and a nonnegative matrix W ∈ R k × r ≥ , let ˆ A x ; W denote the k × k matrix that we defined in line 11 of Algorithm 2. This matrix dependson the Boolean variable denoising . Recall that ˆ A x ; W is a nonnegative linear approximation of A x that uses W . We introduce the event ( p, q ) x ← (cid:45) ( a, b ) using the following indicator function: (cid:18) ( p, q ) x ← (cid:45) ( a, b ) (cid:19) := ( x ( a ) = p, x ( b ) = q ) (cid:18) denoising = F or A F ( a, b ) = 0 (cid:19) Φ F, x ( a, b ) , (24)where Φ F, x is the no-folding mask that we defined in (4). For each homomorphism x : F → G and p, q ∈ V ,we say that the pair ( p, q ) is visited by ( a, b ) through x whenever the indicator on the left-hand side of (24)is . Additionally, N pq ( x ) := (cid:88) a,b ∈{ ,...,k } (cid:18) ( p, q ) x ← (cid:45) ( a, b ) (cid:19) (25)is the total number of visits to ( p, q ) through x . When N pq ( x ) > , we say that the pair ( p, q ) is visitedby x . In Algorithm 2, observe that both A count ( p, q ) and A recons ( p, q ) change at iteration t if and only if N pq ( x t ) > . Finally, Ω pq := (cid:8) x : F → G (cid:12)(cid:12) N pq ( x ) > (cid:9) (26)is the set of all homomorphisms x : F → G that visit the pair ( p, q ) . Theorem G.7 (Convergence of the NDR Algorithm (see Algorithm 2) for Non-Bipartite Networks) . Let F = ([ k ] , A F ) be the k -chain motif and fix a network G = ( V, A ) and a network dictionary W ∈ R k × r . Weuse Algorithm 2 with inputs G , F , and W and the parameter value T = ∞ . Let ˆ G t = ( V, ˆ A t ) denote thenetwork that we reconstruct at iteration t , and suppose that G satisfies assumption (a) of Theorem G.2. Let π = ˆ π F →G if MCMC = PivotApprox and π = π F →G otherwise. The following statements hold: (i) The network ˆ G t converges almost surely to some limiting network ˆ G ∞ = ( V, ˆ A ∞ ) in the sense that lim t →∞ ˆ A t ( p, q ) = ˆ A ∞ ( p, q ) ∈ [0 , ∞ ) almost surely for all p, q ∈ V . (ii)
Let ˆ A ∞ denote the limiting matrix in (i) . For each p, q ∈ V , we then have that ˆ A ∞ ( p, q ) = (cid:88) y ∈ Ω pq (cid:88) a,b ∈{ ,...,k } ˆ A y ; W ( a, b ) (cid:18) ( p, q ) y ← (cid:45) ( a, b ) (cid:19) π ( y ) E x ∼ π [ N pq ( x )] . (27) (iii) Let ˆ A ∞ be as in (ii) . For any network G (cid:48) = ( V, B ) , we have (cid:88) p,q ∈ V (cid:12)(cid:12)(cid:12) B ( p, q ) − ˆ A ∞ ( p, q ) (cid:12)(cid:12)(cid:12) E x ∼ π [ N pq ( x )] ≤ E x ∼ π (cid:104) (cid:107) B x − ˆ A x ; W (cid:107) (cid:105) , if denoising = F E x ∼ π (cid:104) (cid:107) B x − ˆ A x ; W (cid:107) ,F (cid:105) , if denoising = T , where (cid:107) R (cid:107) ,F := (cid:80) ≤ a,b ≤ k | R ( a, b ) | ( A F ( a, b ) = 0) for each R ∈ R k × k . (iv) Let ˆ A ∞ be as in (ii) and suppose that denoising = F . It then follows that (cid:88) p,q ∈ V (cid:12)(cid:12)(cid:12) A ( p, q ) − ˆ A ∞ ( p, q ) (cid:12)(cid:12)(cid:12) E x ∼ π [ N pq ( x )] ≤ √ k E (cid:104)(cid:112) (cid:96) ( vec ( A x ) , W ) (cid:105) (28) ≤ √ k (cid:18) E x ∼ π [ (cid:96) ( vec ( A x ) , W )] sup x : F →G (cid:112) (cid:96) ( vec ( A x ) , W ) (cid:19) / . (29) Proof.
Let x denote a random homomorphism F → G with distribution π , and let P and E denote theassociated probability measure and expectation, respectively.We first verify (i) and (ii) simultaneously. Let ( x t ) t ≥ denote the Markov chain that we generate duringthe reconstruction process (see Algorithm 2). We fix p, q ∈ V and let M t := t (cid:88) s =1 (cid:88) a,b ∈{ ,...,k } (cid:18) ( p, q ) x s ← (cid:45) ( a, b ) (cid:19) = t (cid:88) s =1 N pq ( x s ) , where we defined the indicator (cid:0) ( p, q ) x ← (cid:45) ( a, b ) (cid:1) in (24). The key observation is that ˆ A t ( p, q ) = 1 M t t (cid:88) s =1 (cid:88) a,b ∈{ ,...,k } ˆ A x s ; W ( a, b ) (cid:18) ( p, q ) x s ← (cid:45) ( a, b ) (cid:19) (30) = (cid:88) a,b ∈{ ,...,k } M t t (cid:88) s =1 (cid:88) y ∈ Ω pq ˆ A y ; W ( a, b ) (cid:18) ( p, q ) x s ← (cid:45) ( a, b ) (cid:19) = (cid:88) y ∈ Ω pq (cid:88) a,b ∈{ ,...,k } ˆ A y ; W ( a, b ) (cid:18) ( p, q ) x s ← (cid:45) ( a, b ) (cid:19) tM t t t (cid:88) s =1 ( x s = y ) . With assumption (a), the Markov chain ( x t ) t ≥ of homomorphisms F → G is irreducible and aperiodic with π (see (1)) as its unique stationary distribution. By the Markov-chain ergodic theorem (see, e.g., [5, Theorem6.2.1 and Example 6.2.4] or [38, Theorem 17.1.7]), it follows that lim t →∞ tM t t t (cid:88) s =1 ( x s = y ) = P ( x = y ) E [ N pq ( x )] . This proves both (i) and (ii) .We now verify (iii) . For each a, b ∈ { , . . . , k } and p, q ∈ { , . . . , n } , let Ω ab → pq denote the set of allhomomorphisms x : F → G such that (cid:0) ( p, q ) x ← (cid:45) ( a, b ) (cid:1) = 1 . By changing the order of the sums, we rewritethe formula in (27) as ˆ A ∞ ( p, q ) = (cid:88) a,b ∈{ ,...,k } (cid:18) ( p, q ) x s ← (cid:45) ( a, b ) (cid:19) (cid:88) y ∈ Ω ab → pq ˆ A y ; W ( a, b ) P ( x = y ) E [ N pq ( x )] . For each a, b ∈ { , . . . , k } , define the indicator function ab := (cid:18) denoising = F or A F ( a, b ) = 0 (cid:19) . Observe that E [ N pq ( x )] = (cid:88) a,b ∈{ ,...,k } ( x ( a ) = p, x ( b ) = q ) ab Φ F, x = (cid:88) a,b ∈{ ,...,k } ab (cid:88) y ∈ Ω ab → pq P ( x = y ) . Indeed, the indicator ab does not depend on the homomorphism y and Φ F, y = 1 if y ∈ Ω ab ← pq . We thencalculate (cid:88) p,q ∈ V (cid:12)(cid:12)(cid:12) B ( p, q ) − ˆ A ∞ ( p, q ) (cid:12)(cid:12)(cid:12) E [ N pq ( x )] EARNING LOW-RANK LATENT MESOSCALE STRUCTURES IN NETWORKS 35 = (cid:88) p,q ∈ V (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) B ( p, q ) E [ N pq ( x )] − (cid:88) a,b ∈{ ,...,k } ab (cid:88) y ∈ Ω ab → pq ˆ A y ; W ( a, b ) P ( x = y ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) = (cid:88) p,q ∈ V (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) (cid:88) a,b ∈{ ,...,k } ab (cid:88) y ∈ Ω ab → pq (cid:16) B ( p, q ) − ˆ A y ; W ( a, b ) (cid:17) P ( x = y ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ (cid:88) p,q ∈ V (cid:88) a,b ∈{ ,...,k } (cid:88) y ∈ Ω ab → pq (cid:12)(cid:12)(cid:12) B ( y ( a ) , y ( b )) ab − ˆ A y ; W ( a, b ) ab (cid:12)(cid:12)(cid:12) P ( x = y )= (cid:88) p,q ∈ V (cid:88) a,b ∈{ ,...,k } (cid:88) y ∈ Ω (cid:12)(cid:12)(cid:12) B ( y ( a ) , y ( b )) ab Φ F, y ( a, b ) − ˆ A y ; W ( a, b ) ab Φ F, y ( a, b ) (cid:12)(cid:12)(cid:12) × P ( x = y ) ( y ( a ) = p, y ( b ) = q )= (cid:88) y ∈ Ω P ( x = y ) (cid:88) a,b ∈{ ,...,k } (cid:12)(cid:12)(cid:12) B y ( a, b ) ab − ˆ A y ; W ( a, b ) ab (cid:12)(cid:12)(cid:12) (cid:88) p,q ∈ V ( y ( a ) = p, y ( b ) = q )= (cid:88) y ∈ Ω (cid:88) a,b ∈{ ,...,k } (cid:12)(cid:12)(cid:12) B y ( a, b ) ab − ˆ A y ; W ( a, b ) ab (cid:12)(cid:12)(cid:12) π ( y ) . This verifies (iii) .Finally, we prove ( iv). First, by the Cauchy–Schwarz inequality, E (cid:104) (cid:107) A y − ˆ A y ; W (cid:107) (cid:105) ≤ √ k E (cid:104) (cid:107) A y − ˆ A y ; W (cid:107) F (cid:105) ≤ √ k E (cid:104)(cid:112) (cid:96) ( vec ( A x ) , W ) (cid:105) , (31)where (cid:96) denotes the objective function that we defined in (15) and vec denotes the vectorization operatorin Algorithm A4. By Markov’s inequality, we have for each δ > that P ( (cid:96) ( vec ( A x ) , W ) ≥ δ ) ≤ E [ (cid:96) ( vec ( A x ) , W )] δ . We define the notation M := sup x : F →G (cid:112) (cid:96) ( vec ( A x ) , W ) , which is finite because (cid:96) ( vec ( A x ) , W ) ≤ (cid:107) A x (cid:107) F andthere are only finitely many homomorphisms x : F → G . By conditioning on whether or not (cid:96) ( vec ( A x ) , W ) ≥ δ , it follows that E (cid:104)(cid:112) (cid:96) ( vec ( A x ) , W ) (cid:105) ≤ √ δ P x ∼ π ( (cid:96) ( vec ( A x ) , W ) < δ ) + E [ (cid:96) ( vec ( A x ) , W )] δ M (32) ≤ √ δ + E [ (cid:96) ( vec ( A x ) , W )] δ M . The last expression in (32) is minimized when δ = (2 M E [ (cid:96) ( vec ( A x ) , W )]) / . This yields E (cid:104)(cid:112) (cid:96) ( vec ( A x ) , W ) (cid:105) ≤ (cid:16) / + 2 − / (cid:17) ( E [ (cid:96) ( vec ( A x ) , W )] M ) / . Noting that / + 2 − / < and combining (31) and (32) with (iii) then verifies (iv) . (cid:3) Remark G.8.
Suppose that G (cid:48) = G and denoising = F in Theorem G.7 (iv) . The left-hand side of (28) isa measure of the difference between the original network G = ( V, A ) and the limiting reconstructed network ˆ G ∞ = ( V, ˆ A ∞ ) that we compute using the NDR algorithm (see Algorithm 2) with network dictionary W ∈ R k × r ≥ . Recall that the columns of W encode r latent motifs L , . . . , L r ∈ R k × k ≥ (see Section B.4).According to (28), G = ˆ G ∞ if the right-hand side of (28) is . This is the case if sup x : F →G (cid:96) ( vec ( A x ) , W ) = 0 ,which means that W can perfectly approximate all mesoscale patches A x of G . However, the right-handside of (28) can still be small if the worst-case approximation error sup x : F →G (cid:96) ( vec ( A x ) , W ) is large butthe expected approximation error E x ∼ π [ (cid:96) ( vec ( A x ) , W )] is small (i.e., when W at effective in approximating most of the mesoscale patches).How can we find a network dictionary W that minimizes the right-hand side of (28)? Although it isdifficult to find a globally optimal network dictionary W that minimizes the non-convex objective functionin the right-hand side of (28), Theorems G.2 and G.5 guarantee that our NDL algorithm (see Algorithm 1)always finds a locally optimal network dictionary. Indeed, from these theorems, the NDL algorithm with N = 1 computes a network dictionary W that is approximately a local optimum of the following expectedloss function: f ( W ) = E x ∼ π [ (cid:96) ( vec ( A x ) , W )] , (33)where π = ˆ π F →G if MCMC = PivotApprox and π = π F →G otherwise. The function f in (33) appears in theupper bound in (29). In our experiments, we found that our NDL algorithm produces network dictionariesthat are efficient in minimizing the reconstruction error. See the left-hand side of (28) and, e.g., Figure 3.Another implication of Theorem G.7 (iii) is also relevant to network denoising (see Figure 4). Supposethat we have an uncorrupted network G (cid:48) = ( V, B ) and a corrupted network G = ( V, A ) . Additionally,suppose that we have trained the network dictionary W for the uncorrupted network G (cid:48) , but that we use itto reconstruct the corrupted network G . Even if ˆ A x ; W is a nonnegative linear approximation of the k × k matrix A x of a mesoscale patch of the corrupted network G , it may be close to the corresponding mesoscalepatch B x of the uncorrupted network G (cid:48) , because we used the network dictionary W that we learned fromthe uncorrupted network G (cid:48) . Theorem G.7 (iii) guarantees that the network ˆ G ∞ that we reconstruct for thecorrupted network G using the uncorrupted-network dictionary W is close to the uncorrupted network G (cid:48) . Remark G.9.
The update step (see line 17) for global reconstruction in Algorithm 2 indicates that weloop over all node pairs ( a, b ) in the k -chain motif and that we update the weight of edge ( x t ( a ) , x t ( b )) inthe reconstructed network using the homomorphism x : F → G . There may be multiple node pairs ( a, b ) in F that contribute to the edge ( p, q ) in the reconstructed network, because x t ( a ) = p and x t ( b ) = q canoccur for multiple choices of ( a, b ) . The output of this update step does not depend on the ordering of a, b ∈ { , . . . , k } , as one can see from the expressions in (30).One can also consider the following alternative update step for global reconstruction. Specifically, we firstchoose two nodes p, q of the reconstructed network in the image { x t ( j ) | j ∈ { , . . . , k }} (cid:96) changed to j ofthe homomorphism x t and average over all pairs ( a, b ) ∈ [ k ] such that ( p, q ) is visited by ( a, b ) through x t ,and we then update the weight of ( p, q ) in the reconstructed network with this mean contribution from x t .Specifically, for each a, b ∈ { , . . . , k } , let (cid:0) ( p, q ) x t ← (cid:45) ( a, b ) (cid:1) denote the indicator that we defined in (24) andlet N pq ( x t ) ≥ be the number of visits of x t to ( p, q ) (see (25)). We can then replace line 17 in Algorithm2 with the following line: Alternative update for global reconstruction : For p, q ∈ V such that N pq ( x t ) > : (cid:101) A x t ; W ( p, q ) ← (cid:80) ≤ a,b ≤ k ˆ A x t ; W ( a, b ) (cid:0) ( p, q ) x ← (cid:45) ( a, b ) (cid:1)(cid:80) ≤ a,b ≤ k (cid:0) ( p, q ) x ← (cid:45) ( a, b ) (cid:1) , j ← A count ( p, q ) + 1 A recons ( p, q ) ← (1 − j − ) A recons ( p, q ) + j − (cid:101) A x t ; W ( p, q ) .For the alternative NDR algorithm that we just described, we can establish a convergence result that issimilar to Theorem G.5 using a similar argument as the one in our proof of Theorem G.7. Specifically, (i) holds for the alternative NDR algorithm, so there exists a limiting reconstructed network. In the proof of (ii) , the formula for the limiting reconstructed network is now ˆ A ∞ ( p, q ) = (cid:88) y ∈ Ω pq (cid:101) A y ; W ( p, q ) P x ∼ π (cid:0) x = y (cid:12)(cid:12) x ∈ Ω pq (cid:1) for all p, q ∈ V , where Ω pq is the set of all homomorphisms that visit ( p, q ) (see (26)). In particular, if G is an undirectedand binary graph, then ˆ A ∞ ( p, q ) = 1 | Ω pq | (cid:88) y ∈ Ω pq (cid:101) A y ; W ( p, q ) for all p, q ∈ V .
In the proofs of (iii) and (iv) , the same error bounds hold with E x ∼ π [ N pq ( x )] replaced by P x ∼ π ( x ∈ Ω pq ) .We omit the details of the proofs of the above statements for this alternative NDR algorithm.We now discuss convergence results of Algorithm 2 for a bipartite network G . Recall the notation anddiscussions about bipartite networks above Theorem G.5. Additionally, recall for our bipartite networks thatthere exist disjoint subsets Ω and Ω of the set Ω of all homomorphisms F → G such that (1) Ω = Ω ∪ Ω EARNING LOW-RANK LATENT MESOSCALE STRUCTURES IN NETWORKS 37 and (2) the Markov chain ( x ) t ≥ restricted to each Ω i (with i ∈ { , } ) is irreducible but is not irreducibleon the set Ω . Theorem G.10 (Convergence of the NDR Algorithm (see Algorithm 2) for Bipartite Networks) . Let F =([ k ] , A F ) be the k -chain motif, and let G = ( V, A ) be a network that satisfies assumption (a’) in TheoremG.5. Let ˆ G t = ( V, ˆ A t ) denote the network that we reconstruct using Algorithm 2 at iteration t using a fixednetwork dictionary W ∈ R k × r . Fix i ∈ { , } and an initial homomorphism x ∈ Ω i . Let π = ˆ π F →G if MCMC = PivotApprox and π = π F →G otherwise. The following properties hold: (i) The network ˆ G t converges almost surely to some limiting network ˆ G ∞ = ( V, ˆ A ∞ ) in the sense that lim t →∞ ˆ A t ( p, q ) = ˆ A ∞ ( p, q ) almost surely for all p, q ∈ V . (ii)–(iv)
The same statements as in Theorem G.7 (ii) – (iv) hold with the expectation E x ∼ π replaced by theconditional expectation E x ∼ π [ · | x ∈ Ω i ] . (v) The results in (ii) – (iv) do not depend on i ∈ { , } if k is even.Proof. The proofs of statements (i) – (iv) are identical to those for Theorem G.7. Statement (v) follows froma similar argument as in the proof of Theorem G.5 (ii) by constructing coupled Markov chains ( x t ) t ≥ and ( x (cid:48) t ) t ≥ such that x (cid:48) t = x t for all t ≥ . (cid:3) Remark G.11.
Our proofs of Theorems G.7 and G.10 do not depend on the particular choice of mask Φ F ; x that we use to define the mesoscale patches A x in (2). Remark G.12.
In Theorem G.10 (ii) , let ˆ G ( i ) ∞ = ( V, ˆ A ( i ) ∞ ) denote the limiting reconstructed network for G conditional on the Markov chain being initialized in Ω i for i ∈ { , } . When k is even, Theorem G.10 (iv) implies that ˆ G (1) ∞ = ˆ G (2) ∞ . When k is odd, we run the NDR algorithm (see Algorithm 2) twice with theMarkov chain initialized in both Ω and Ω . We then define the network ˆ G ∞ := ( V, ( ˆ A (1) ∞ + ˆ A (2) ∞ ) / , whoseweight matrix is the mean of those of the two limiting reconstructed networks ˆ G ( i ) ∞ for i ∈ { , } . We obtaina similar error bound as in Theorem G.10 (iii) for this mean limiting reconstructed network. In practice, onecan obtain a sequence of reconstructed networks that converges to the mean reconstructed network ˆ G ∞ byreinitializing the Markov chain every τ iterations of the reconstruction procedure for any fixed τ . Appendix H. Auxiliary Algorithms
We now present auxiliary algorithms that we use to solve subproblems of Algorithms 1 and 2. Let Π S denote the projection operator onto a subset S of ambient space. For each matrix A , let [ A ] • i (respectively, [ A ] i • ) denote the i th column (respectively, i th row) of A . Algorithm A1 . Coding Input:
Data matrix X ∈ R d × b , dictionary matrix W ∈ R d × r Parameters: T ∈ N (the number of iterations) λ > (the coefficient of an L -regularizer) C code ⊆ R r × b (convex constraint set of codes) For t = 1 , . . . , T : Do: H ← Π C code (cid:18) H − tr ( W T W ) ( W T W H − W T X + λJ ) (cid:19) , where J ⊆ R d × b is the matrix with all entries. Output: H ∈ C code ⊆ R r × b Algorithm A2 . Dictionary-Matrix Update Input:
Previous dictionary matrix W t − ∈⊆ R k × r , previous aggregate matrices ( P t , Q t ) ∈ R r × r × R r × N Parameters: C dict ⊆ R k × r (compactness and convexity constraint for dictionary matrices) T ∈ N (the number of iterations) For t = 1 , . . . , T : W ← W t − For j = 1 , , . . . , N : W (: , j ) ← Π C dict (cid:18) W (: , j ) − A t ( j, j ) + 1 ( W P t (: , j ) − Q Tt (: , j )) (cid:19) Output: W t = W ∈ C dict ⊆ R k × r ≥ Algorithm A3 . Rejection Sampling of Homomorphisms Input:
Network G = ( V, A ) , motif F = ([ k ] , A F ) Requirement:
There exists at least one homomorphism F → G Repeat:
Sample x = [ x (1) , x (2) , . . . , x ( k )] ∈ V [ k ] so that the quantities x ( i ) are independent andidentically distributed If (cid:81) i,j ∈{ ,...,k } A ( x ( i ) , x ( j )) A F ( i,j ) > Return x : F → G and Terminate Output:
Homomorphism x : F → G Algorithm A4 . Vectorization Input:
Matrix X ∈ R k × k Output:
Matrix Y ∈ R k k × , where Y ( k ( j −
1) + i,
1) = X ( i, j ) for all i ∈ { , . . . , k } and j ∈ { , . . . , k } Algorithm A5 . Reshaping Input:
Matrix X ∈ R k k × , a pair ( k , k ) of integers Output:
Matrix Y ∈ R k × k , where Y ( i, j ) = X ( k ( j −
1) + i, for all i ∈ { , . . . , k } and j ∈ { , . . . , k } EARNING LOW-RANK LATENT MESOSCALE STRUCTURES IN NETWORKS 39
Appendix I. Additional Figures
We show additional figures of our network dictionaries. C a l t e c h scale = 6 scale = 11 scale = 21 scale = 51 scale = 101 M I T U C L A H a r v a r d Figure 8.
The r = 25 latent motifs at scales k = 6 , , , , that we learn from the networks Caltech , MIT , UCLA , and
Harvard . The numbers underneath the latent motifs give their dominancescores (see Section D.2). See Section F for the details of these experiments. C o r o n a v i r u s PP I scale = 6 scale = 11 scale = 21 scale = 51 scale = 101 S N A P F a c e b oo k a r X i v A S T R O - P H H o m o s a p i e n s PP I Figure 9.
The r = 25 latent motifs at scales k = 6 , , , , that we learn from the networks Coronavirus PPI , SNAP Facebook , arXiv ASTRO-PH , and Homo sapiens PPI . The numbers under-neath the latent motifs give their dominance scores (see Section D.2). See Section F for the detailsof these experiments.
EARNING LOW-RANK LATENT MESOSCALE STRUCTURES IN NETWORKS 41 WS k = 6 k = 11 k = 21 k = 51 k = 101 WS BA BA Figure 10.
The r = 25 latent motifs at scales k = 6 , , , , that we learn from the networks WS , WS , BA , and BA . The numbers underneath the latent motifs give their dominance scores (seeSection D.2). See Section F for details of the experiments. Caltech r = 9 r = 16 r = 25 r = 36 r = 49 MIT
UCLA
Harvard
Figure 11.
The 25 latent motifs for r ∈ { , , , , } at scale k = 21 that we learn from thenetworks Caltech , MIT , and
UCLA . The r = 25 column is identical to the k = 21 column in Figure8. The numbers underneath the latent motifs give their dominance scores (see Section D.2). SeeSection F for details of the experiments. EARNING LOW-RANK LATENT MESOSCALE STRUCTURES IN NETWORKS 43
Coronavirus r = 9 r = 16 r = 25 r = 36 r = 49 SNAPFB arXiv
H.sapiens
Figure 12.
The latent motifs for r ∈ { , , , , } at scale k = 21 that we learn from thenetworks Coronavirus PPI , SNAP Facebook , arXiv ASTRO-PH , and Homo sapiens PPI . The r = 25 column is identical to the k = 21 column in Figure 9. The numbers underneath the latent motifsgive their dominance scores (see Section D.2). See Section F for details of the experiments. WS r = 9 r = 16 r = 25 r = 36 r = 49 WS BA BA Figure 13.
The latent motifs for r ∈ { , , , , } at scale k = 21 that we learn from thenetworks WS , WS , BA , and BA . The r = 25 column is identical to the k = 21 column in Figure10. The numbers underneath the latent motifs give their dominance scores (see Section D.2). SeeSection F for details of our experiments. EARNING LOW-RANK LATENT MESOSCALE STRUCTURES IN NETWORKS 45 ER k = 6 k = 11 k = 21 k = 51 k = 101 ER ER r = 9 r = 16 r = 25 r = 36 r = 49 ER Figure 14. (Top two rows) The r = 25 latent motifs at scales k = 6 , , , , that we learnfrom the networks ER and ER . (Bottom two rows) The latent motifs for r ∈ { , , , , } atscale k = 21 that we learn from the networks ER and ER . The k = 21 column of the first two rowsare identical to the r = 25= 25