Random Walk Sampling for Big Data over Networks
aa r X i v : . [ s t a t . M L ] A p r RANDOM WALK SAMPLING FOR BIG DATA OVER NETWORKS
Saeed Basirian, Alexander Jung
Department of Computer Science, Aalto University, Finland; firstname.lastname(at)aalto.fi
ABSTRACT
It has been shown recently that graph signals with smalltotal variation can be accurately recovered from only fewsamples if the sampling set satisfies a certain condition,referred to as the network nullspace property. Based onthis recovery condition, we propose a sampling strategy forsmooth graph signals based on random walks. Numericalexperiments demonstrate the effectiveness of this approachfor graph signals obtained from a synthetic random graphmodel as well as a real-world dataset.
Index Terms — compressed sensing, big data, graph sig-nal processing, total variation, complex networks I. INTRODUCTION
Modern information processing systems are generatingmassive datasets which are partially labeled mixtures ofdifferent media (audio, video, text). Many successful ap-proaches to such datasets are based on representing the dataas networks or graphs. In particular, within (semi-)supervisedmachine learning, we represent the datasets by graph signalsdefined over an underlying graph, which reflects the simi-larity relations between individual data points. These graphsignals often conform to a smoothness hypothesis, i.e., thesignal values of close-by nodes are similar.Two key problems related to processing these datasetsare (i) how to sample them, i.e., which nodes provide themost information about the entire dataset, and (ii) how torecover the entire graph signal representation of the datasetfrom these samples. These problems have been studied in [3]which proposed a convex optimization method for recoveringa graph signal from a small number of samples. Moreover, asufficient condition for this recovery method to be accuratehas been presented. This condition is a reformulation of thestable nullspace property of compressed sensing to the graphsignal setting.
Contribution.
Based on the intuition provided by therecently derived network nullspace property, we propose asampling strategy based on random walks. The effectivenessof this approach is confirmed via numerical experimentsbased on synthetic graph signals obtained from a particularrandom graph model, i.e., the assortative planted partitionmodel, and graph signals induced by a real-world datasetcontaining product rating information of an online retailshop.
Notation.
Vectors and matrices are denoted by boldfacelower-case and upper-case letters, respectively. The vectorwith all entries equal to one (zero) is denoted ( ). The ℓ and ℓ norm of a vector x = ( x , . . . , x N ) T are denoted by k x k and k x k respectively. Outline.
The problem setup is discussed in II, were weformulate the problem of recovering a smooth graph signalas a convex optimization problem. Our main contribution iscontained in Section III where we present the random walksampling method and discuss its properties in the contextof the assortative planted partition model. The results ofillustrative numerical experiments are presented in SectionIV. We finally conclude in Section V.
II. PROBLEM FORMULATION
We consider massive heterogeneous datasets with intrinsicnetwork structure represented by a graph G = ( V , E ) . Thegraph G consists of the nodes V = { , . . . , N } , which areconnected by undirected edges { i, j } ∈ E . Each node i ∈ V represents an individual data point and an edge { i, j } ∈ E connects nodes representing similar data points. For a givennode i ∈ V , we define its neighbourhood as N ( i ) := { j ∈ V : { i, j } ∈ E} . (1)The degree d i := |N ( i ) | of node i ∈ V counts the numberof its neighbours.Within (semi-)supervised learning, we associate each datapoint i ∈ V with a label x [ i ] ∈ R . These labels inducea graph signal x [ · ] : V → R defined over the graph G underlying the dataset.We aim at recovering a smooth graph signal x based onobserving its values x [ i ] for all nodes i ∈ V which belongto the sampling set M := { i , . . . , i M } ⊆ V . (2)The size M := |M| of the sampling set is typically muchsmaller than the overall dataset, i.e., M ≪ N . For a fixedsampling budget M it is important to choose the samplingset such that the information obtained is sufficient to recoverthe overall graph signal. By considering a particular recoverymethod, called sparse label propagation (SLP), [3] presentsthe network nullspace property as a sufficient condition onthe sampling set such that SLP recovers the overall graphsignal from the samples.he SLP recovery method is based on a smoothnesshypothesis, which requires signal values of nodes belongingto the same cluster to be similar. This smoothness hypothesisthen suggests to search for the particular graph signal whichis consistent with the observed signal samples, and moreoverhas minimum total variation (TV) k x k TV := X { i,j }∈E | x [ j ] − x [ i ] | , (3)which quantifies signal smoothness. Thus the recovery prob-lem amounts to the convex optimization problem ˆ x ∈ arg min k ˜ x k TV s . t . ˜ x M = x M . (4)The SLP algorithm is nothing but the the primal-dual opti-mization method of Pock and Chambolle [2] applied to theproblem (4).Let us from now on assume that the true underlying graphsignal x is clustered, i.e., x = X C∈F a C t C , (5)with the cluster indicator signals t C [ i ] = ( , if i ∈ C else. (6)For a partition F = {C , . . . , C |F| } consisting of disjointclusters C l with small cut-sizes, we have that the TV k x k TV is relatively small. Thus, we expect recovery based on TVminimization (4) to be accurate for signals of the type (5).Indeed, a sufficient condition for the solution ˆ x of (4) tocoincide with x = P C∈F a C t C can be formulated as Lemma 1.
We observe a clustered signal x of the form (5) on the sampling set M ⊆ V . If each boundary edge { i, j } with i ∈ C a , j ∈ C b is connected to two sampled nodes ineach cluster, i.e., |M ∩ C a ∩ N ( i ) | ≥ , and |M ∩ C b ∩ N ( j ) | ≥ , (7) then (4) has a unique solution which moreover coincideswith the true graph signal x . III. RANDOM WALK SAMPLING
We now present a particular strategy (summarized inAlgorithm 1 below) for choosing the sampling set M ofnodes at which the graph signal should be sampled to obtainthe observations { x [ i ] } i ∈M . Our strategy is based on parallelrandom walks which are started at randomly selected seednodes. The endpoints of these random walks, which are runfor a fixed number L of steps, constitute the sampling set M .In Figure 1. we illustrate the construction of the samplingset via the random walks P j . Each random walk P j formsa finite sequence { v = r j , . . . , v L = i j } of nodes that arevisited in successive steps of the walk. Algorithm 1
Random Walk Sampling
Input: random walk length L , sample budget M Initialize:
Sampling set M = ∅ for j = 1 : M do randomly select a seed (start) node i perform a length- L random walk P j ← ( i , . . . , i L ) , M ← M S { i j } end forOutput: M C C a a sampled node i P j Fig. 1 . Clustered graph signal (5) defined over a graphcomposed of two clusters C and C .The sampling strategy of Algorithm 1 is appealing sinceit allows for efficient implementation as the random walkscan be follows in parallel. Moreover, for a particular randomgraph model, the sampling set M delivered by Algorithm 1conforms with Lemma 1. According to Lemma 1, we haveto select from each cluster C l a number sampled nodes whichis proportional to its cut-size | ∂ C l | . Thus, we have to samplemore densely in those clusters which have large cut-size. Wenow show that the sampling set M obtained by Algorithm1 follows this rationale for graph signals obtained from thestochastic block model (SBM) [6].For a given partition F = {C , . . . , C |F| } of the graph G in clusters C l of size N l := |C l | , the SBM is a generativestochastic model for the edge set E of the graph G . Inits simplest form, which is called the assortative plantedpartition model (APPM) [6], the SBM is defined by twoparameters p and q which specify the probability that twoparticular nodes i, j of the graph are connected by an edge { i, j } . In particular, two nodes i, j ∈ C i out of the samecluster are connected by an edge with probability p , i.e., P {{ i, j } ∈ E} = p for i, j ∈ C a . Two nodes i ∈ C a , j ∈ C b from different clusters C a and C b are connected by an edgewith probability q , i.e., P {{ i, j } ∈ E} = q for i ∈ C a and j ∈ C b .Elementary derivations yield the expected degree ¯ d r ofany node i ∈ C r belonging to cluster C r as ¯ d r = E { d i } = p ( N r −
1) + q ( N − N r ) . (8)On the other hand, by similarly elementary calculations, thexpected cut-size C r := | ∂ C r | satisfies C r = qN r ( N − N r ) . (9)Now consider a particular random walk P j which is runin Algorithm 1. For a fixed node i ∈ V , let p l ( i ) denotethe probability that the random walk visits node i in the l thstep. A fundamental result in the theory of random walksover graphs states [8, page 159] lim l →∞ p l ( i ) = d i |E| (10)Thus, by running the random walks in Algorithm 1 suffi-ciently long (choosing L sufficiently large), the probabilitythat the delivered sampling set M contains a node i ∈ C r from cluster C r statisfies P { i ∈ M} ≈ p ( N r −
1) + q ( N − N r )2 |E| . (11)Contrasting (11) with (9) reveals that the sampling setdelivered by Algorithm 1 indeed conforms with Lemma 1,which requires clusters with larger cut-size to be sampledmore densely. IV. NUMERICAL RESULTS
We tested the effectiveness of the sampling method givenby Algorithm 1 was verified by applying it to different graphsignals and using sparse label propagation (SLP) as therecovery method for obtaining the original graph signal fromthe samples. The SLP algorithm, derived in [3], is restated asAlgorithm 2 for convenience. In Algorithm 2, we make useof the clipping operator T : R |E| → R |E| for edge signals de-fined element-wise as ( T (˜ x ))[ e ] = (1 / max {| ˜ x [ e ] | , } )˜ x [ e ] . Algorithm 2
Sparse Label Propagation [3]
Input: data graph G , sampling set M , signal samples { x [ i ] } i ∈M . Initialize: k := 0 , D := incidence matrix of G for some arbi-trary orientation, z (0) := , x (0) := , ˆ x (0) := , y (0) := ,maximum node degree d max := max i ∈V d i repeat y ( k +1) := T ( y ( k ) + (1 / √ d max ) Dz ( k ) ) r := x ( k ) − (1 / √ d max ) D T y ( k +1) x ( k +1) := ( x [ i ] for i ∈ M r [ i ] else. z ( k +1) := 2 x ( k +1) − x ( k ) ˆ x ( k +1) := ˆ x ( k ) + x ( k +1) k := k + 1 until stopping criterion is satisfied Output: ˆ x ( k ) := (1 /k )ˆ x ( k ) Our numerical experiments involved independent sim-ulation runs. Each simulation run is based on randomlygenerating an instance (see Figure 2) of the APPM for fixed Fig. 2 . An APPM instance with 60 nodes and three clusters.Node colours represent the signal values.parameter values p = 3 / , q = 5 / and partition consist-ing of four clusters with sizes |C | = 10 , |C | = 20 , |C | =30 , |C | = 40 ((cf. Section III). We then generated a clusteredgraph signal x of the form (5) by choosing the cluster values a C as independent random variables a C ∼ U (0 ,
1) ( cf. ( )) . For each realization of the APPM, we constructed asampling set M using Algorithm 1 which was then usedto obtain the signal samples { x [ i ] } i ∈M and subsequentlyrecovering the entire graph signal x via Algorithm 2. Wemeasured the recovery accuracy obtained by Algorithm 2via the normalized empirical mean squared error (NMSE)of the signal estimate ˆ x , i.e., ˆ ε ( l ) := k ˆ x ( l ) − x ( l ) k k x ( l ) k . (12)Here, ˆ ε ( l ) , x ( l ) and ˆ x ( l ) denote the NMSE, the original andthe recovered graph signal, respectively, obtained in the l thsimulation run. Note that ˆ ε is random and often we areinterested in its empirical mean ¯ ε := (1 / ) X l =1 ˆ ε ( l ) . (13)We evaluated the quality of the sampling set provided byAlgorithm 1 for varying sampling budgets M and a fixedlength L = 10 of the random walks P j . In Table I, wereport the mean and standard deviation of the NMSE of ˆ x for different sampling budgets M . Besides the expected ampling Budget M M=10 20 30 40 50 ¯ ε Table I . Average NMSE ¯ ε obtained for different samplingbudgets M . STD indicates the empirical standard deviationof the NMSE ˆ ε . Random Walk Length L L=20 40 80 160 320 ¯ ε Table II . Average NMSE ¯ ε obtained for different lengths L of the random walks. STD indicates the empirical standarddeviation of the NMSE ˆ ε .decrease in error by increasing the number of samples, itshows that sampling around half of graph nodes, we obtain ˆ ε ≈ . .We also investigated the effect of choosing a varyingrandom walk length L in Algorithm 1, for a fixed samplebudget M = 10 . In Table II, we display the mean andstandard deviation of the NMSE for different values of L . Itshows that for these range of values, the length of the walkshave a relatively insignificant effect on the outcome. Thiscan be partially explained by the fact that the mixing timeof random walks (i.e., the number of steps before they reachthe stationary distribution) in some cases may be much lessthan the size of the graph N [5].The fluctuation of the NMSE, as indicated by the values ofthe empirical standard deviation in Tables I and II, are on theorder of the average NMSE. We expect the reason for thisrather large amount of fluctuation to be a too small numberof simulation runs. However, due to resource constraintswe have not been able to increase the number of runssignificantly.In the final experiment, we challenged the hypothesis thatthe sampling strategy conforms to the intuition, suggested byLemma 1, of taking more samples in clusters with larger cut-size (cf. Section III). For this purpose, the same procedurein the first two tests was repeated for L = 10 and M = 50 ,and the number of samples in each cluster and its cut-sizewas recorded in each run. In Figure 3, we report the obtainedresults, which indicates that the mean sample counts |M∩C r | are approximately proportional to the cluster cut-sizes | ∂ C r | . IV-A. Real-World Data Set
We also tested our approach on the Amazon co-purchasedataset from the Stanford Network Analysis Platform [4].The dataset consists of a collection of products purchasedon the Amazon website. For each product, it provides a list of other products that are frequently co-purchased withit, as well as an average user rating. We first extracted anundirected graph underlying the full dataset (excluding nodeswith no co-purchase information), which includes an edge { i, j } if product j is co-purchased with product i or viceversa. Subsequently, we selected a subgraph via a randomwalk and including all the nodes on the path and theirneighbours, resulting in a graph with N = 5227 nodes and edges. The graph signal is the average user rating forthe products.The sampling set was extracted using the random walkmethod with the sampling ratio M/N = 0 . and L = 20 .The SLP algorithm was then applied for recovering the graphsignal. This resulted in a mean NMSE of . ± . over 10 runs. For comparison, we also tested three graphclustering algorithms (also referred to as community de-tection algorithms) for selecting the sampling set. Thiscomprised of first finding the partitioning of the nodes usingthe clustering algorithms and then randomly sampling fromeach cluster, where the number of samples in clusters wasuniformly distributed according to the cut-size. For findingthe clusters, we used an algorithm by Blondel et. al. (alsoknown as Louvain) [1], an algorithm by Newman [7], andone by Ronhovde et. al. [9]. Choosing the sampling set viathese methods and applying SLP for recovering the graphsignal resulted in a NMSE of 0.369, 0.478, and 0.364 forthe Louvain, Newman, and Ronhovde methods respectively(the value for the Ronhovde method is the average over 5different clusterings corresponding to 5 values of its gammaparameter equally spaced between 0.1 and 0.5). We concludethat in this case our random walk method performs similarlyto more computationally demanding clustering algorithmsfor sampling the graph signal. V. CONCLUSIONS
We proposed a novel random walk strategy for samplinggraph signals representing massive datasets with intrinsicnetwork structure. This strategy conforms with the rationale,which is supported by the recently derived network nullspaceproperty, to sample more densely in clusters with large cut-size. The proposed sampling method has been tested onsynthetic graph signals generated via an APPM. Our numer-ical experiments demonstrated that combining our samplingstrategy with the SLP recovery algorithm, it is possibleto recover graph signals with small error from only fewsamples. The effectiveness of our sampling strategy has beenalso verified numerically for graph signals obtained from areal-world dataset containing product rating information ofan online retail shop.
REFERENCES [1] V. D. Blondel, J.-L. Guillaume, R. Lambiotte, andE. Lefebvre. Fast unfolding of communities in large luster 1 Cluster 2 Cluster 3 Cluster 4051015202530354045 C oun t s Sample CountsCut-size
Fig. 3 . The mean number of samples and the mean cut-sizeof each cluster. M = 50 , L = 10 .networks. Journal of statistical mechanics: theory andexperiment , 2008(10):P10008, 2008.[2] A. Chambolle and T. Pock. A first-order primal-dualalgorithm for convex problems with applications toimaging.
J. Math. Imaging Vision , 40(1):120–145, 2011.[3] A. Jung. Sparse label propagation.
ArXiv e-prints , Dec.2016.[4] J. Leskovec and A. Krevl. SNAP Datasets:Stanford large network dataset collection.http://snap.stanford.edu/data, June 2014.[5] L. Lov´asz. Random walks on graphs: a survey.
Com-binatorics, Paul erdos is eighty , 2:1–46, 1993.[6] E. Mossel, J. Neeman, and A. Sly. Stochastic blockmodels and reconstruction.
ArXiv e-prints , Aug. 2012.[7] M. E. J. Newman. Fast algorithm for detecting com-munity structure in networks.
Physical review E ,69(6):066133, 2004.[8] M. E. J. Newman.
Networks: an Introduction . OxfordUniv. Press, 2010.[9] P. Ronhovde and Z. Nussinov. Local resolution-limit-free potts model for community detection.
PhysicalReview E , 81(4):046114, 2010. (0 , , (2 , , (3 , (4 , (a) (b) P P P Q G C luster 1 Cluster 2 Cluster 3 Cluster 400.10.20.30.40.50.6 S a m p l e r a t i o L = 10L = 100Ideal
Cluster 1 Cluster 2 Cluster 3 Cluster 4051015202530354045 C oun t ss