Simulating Random Walks on Graphs in the Streaming Model
aa r X i v : . [ c s . D S ] N ov Simulating Random Walks on Graphs in theStreaming Model
Ce Jin
Institute for Interdisciplinary Information Sciences, Tsinghua University, Beijing, [email protected]
Abstract
We study the problem of approximately simulating a t -step random walk on a graph wherethe input edges come from a single-pass stream. The straightforward algorithm using reservoirsampling needs O ( nt ) words of memory. We show that this space complexity is near-optimal fordirected graphs. For undirected graphs, we prove an Ω( n √ t )-bit space lower bound, and give anear-optimal algorithm using O ( n √ t ) words of space with 2 − Ω( √ t ) simulation error (defined asthe ℓ -distance between the output distribution of the simulation algorithm and the distributionof perfect random walks). We also discuss extending the algorithms to the turnstile model, whereboth insertion and deletion of edges can appear in the input stream. Theory of computation → Streaming models
Keywords and phrases streaming models, random walks, sampling
Acknowledgements
I would like to thank Professor Jelani Nelson for introducing this problemto me, advising this project, and giving many helpful comments on my writeup.
Graphs of massive size are used for modeling complex systems that emerge in many differ-ent fields of study. Challenges arise when computing with massive graphs under memoryconstraints. In recent years, graph streaming has become an important model for compu-tation on massive graphs. Many space-efficient streaming algorithms have been designedfor solving classical graph problems, including connectivity [2], bipartiteness [2], minimumspanning tree [2], matching [8, 12, 1], spectral sparsifiers [14, 13], etc. We will define thestreaming model in Section 1.1.Random walks on graphs are stochastic processes that have many applications, such asconnectivity testing [17], clustering [18, 3, 4, 5], sampling [11] and approximate counting [10].Since random walks are a powerful tool in algorithm design, it is interesting to study themin the streaming setting. A natural problem is to find the space complexity of simulatingrandom walks in graph streams. Das Sarma et al. [7] gave a multi-pass streaming algorithmthat simulates a t -step random walk on a directed graph using O ( √ t ) passes and only O ( n ) space. By further extending this algorithm and combining with other ideas, theyobtained space-efficient algorithms for estimating PageRank on graph streams. However,their techniques crucially rely on reading multiple passes of the input stream.In this paper, we study the problem of simulating random walks in the one-pass streamingmodel. We show space lower bounds for both directed and undirected versions of the problem,and present algorithms that nearly match with the lower bounds. We summarize our resultsin Section 1.3. Let G = ( V, E ) be a graph with n vertices. In the insertion-only model, the input graph G is defined by a stream of edges ( e , . . . , e m ) seen in arbitrary order, where each edge e i is specified by its two endpoints u i , v i ∈ V . An algorithm must process the edges of G inthe order that they appear in the input stream. The edges can be directed or undirected,depending on the problem setting. Sometimes we allow multiple edges in the graph, wherethe multiplicity of an edge equals its number of occurrences in the input stream.In the turnstile model, we allow both insertion and deletion of edges. The input is astream of updates (( e , ∆ ) , ( e , ∆ ) , . . . ), where e i encodes an edge and ∆ i ∈ { , − } . Themultiplicity of edge e is f ( e ) = P e i = e ∆ i . We assume f ( e ) ≥ e . Let f ( u, v ) denote the multiplicity of edge ( u, v ). The degree of u is defined by d ( u ) = P v ∈ V f ( u, v ). A t -step random walk starting from a vertex s ∈ V is a random sequence ofvertices v , v , . . . , v t where v = s and v i is a vertex uniformly randomly chosen from thevertices that v i − connects to, i.e., P [ v i = v | v i − = u ] = f ( u, v ) /d ( u ). Let RW s,t : V t +1 → [0 ,
1] denote the distribution of t -step random walks starting from s , defined by RW s,t ( v , . . . , v t ) = [ v = s ] t − Y i =0 f ( v i , v i +1 ) d ( v i ) . (1)For two distributions P, Q , we denote by | P − Q | their ℓ distance. We say that a ran-domized algorithm can simulate a t -step random walk starting from v within error ε , if thedistribution P w of its output w ∈ V t +1 satisfies | P w − RW v ,t | ≤ ε . We say the randomwalk simulation is perfect if ε = 0.We study the problem of simulating a t -step random walk within error ε in the streamingmodel using small space. We assume the length t is specified at the beginning. Then thealgorithm reads the input stream. When a query with parameter v comes, the algorithmshould simulate and output a t -step random walk starting from vertex v .It is without loss of generality to assume that the input graph has no self-loops. If wecan simulate a random walk on the graph with self-loops removed, we can then turn it intoa random walk of the original graph by simply inserting self-loops after u with probability d self ( u ) /d ( u ). The values d self ( u ) , d ( u ) can be easily maintained by a streaming algorithmusing O ( n ) words.The random walk is not well-defined when it starts from a vertex u with d ( u ) = 0. Forundirected graphs, this can only happen at the beginning of the random walk, and we simplylet our algorithm return Fail if d ( v ) = 0. For directed graphs, one way to fix this is tocontinue the random walk from v , by adding an edge ( u, v ) for every vertex u with d ( u ) = 0.We will not deal with d ( u ) = 0 in the following discussion. We will use log x = log x throughout this paper. For a statement p , define [ p ] = 1 if p is true, and [ p ] = 0 if p is false. . Jin 44:3 The following two theorems give space lower bounds on directed and undirected versionsof the problem. Note that the lower bounds hold even for simple graphs . ◮ Theorem 1.
For t ≤ n/ , simulating a t -step random walk on a simple directed graph inthe insertion-only model within error ε = requires Ω( nt log( n/t )) bits of memory. ◮ Theorem 2.
For t = O ( n ) , simulating a t -step random walk on a simple undirected graphin the insertion-only model within error ε = requires Ω( n √ t ) bits of memory. Theorem 3 and Theorem 4 give near optimal space upper bounds for the problem in theinsertion-only streaming model. ◮ Theorem 3.
We can simulate a t -step random walk on a directed graph in the insertion-only model perfectly using O ( nt ) words of memory. For simple directed graphs, the memorycan be reduced to O ( nt log( n/t )) bits, assuming t ≤ n/ . ◮ Theorem 4.
We can simulate a t -step random walk on an undirected graph in the insertion-only model within error ε using O (cid:16) n √ t · q log q (cid:17) words of memory, where q = 2 + log(1 /ε ) √ t . Inparticular, the algorithm uses O ( n √ t ) words of memory when ε = 2 − Θ ( √ t ) . Our algorithms also extend to the turnstile model. ◮ Theorem 5.
We can simulate a t -step random walk on a directed graph in the turnstilemodel within error ε using O ( n ( t + log ε ) log max { n, /ε } ) bits of memory. ◮ Theorem 6.
We can simulate a t -step random walk on an undirected graph in the turnstilemodel within error ε using O ( n ( √ t + log ε ) log max { n, /ε } ) bits of memory. The simplest algorithm uses O ( n ) words of space (or only O ( n ) bits , if we assume thegraph is simple) to store the adjacency matrix of the graph. When t ≪ n , a better solutionis to use reservoir sampling. ◮ Lemma 7 (Reservoir sampling) . Given a stream of n items as input, we can uniformlysample m of them without replacement using O ( m ) words of memory. We can also sample m items from the stream with replacement in O ( m ) words of memoryusing m independent reservoir samplers each with capacity 1. ◮ Theorem 8.
We can simulate a t -step random walk on a directed graph in the insertion-only model perfectly using O ( nt ) words of memory. Proof.
For each vertex u ∈ V , we sample t edges e u, , . . . , e u,t outgoing from u with replace-ment. Then we perform a random walk using these edges. When u is visited for the i -thtime ( i ≤ t ), we go along edge e u,i . ◭ By treating an undirected edge as two opposite directed edges, we can achieve the samespace complexity in undirected graphs.Now we show a space lower bound for the problem. We will use a standard result fromcommunication complexity. A simple graph is a graph with no multiple edges. A word has Θ(log max { n, m } ) bits. ◮ Definition 9.
In the
Index problem, Alice has an n -bit vector X ∈ { , } n and Bob hasan index i ∈ [ n ]. Alice sends a message to Bob, and then Bob should output the bit X i . ◮ Lemma 10 ([15]) . For any constant / < c ≤ , solving the Index problem with successprobability c requires sending Ω( n ) bits. ◮ Theorem 11.
For t ≤ n/ , simulating a t -step random walk on a simple directed graphin the insertion-only model within error ε = requires Ω( nt log( n/t )) bits of memory. Proof.
We prove by showing a reduction from the
Index problem. Before the protocolstarts, Alice and Bob agree on a family F of t -subsets of [ n ] such that the condition | S ∩ S ′ | < t/ S, S ′ ∈ F , S = S ′ . For two independent uniform random t -subsets S, S ′ ⊆ [ n ], let p = P [ | S ∩ S ′ | ≥ t/ ≤ (cid:0) tt/ (cid:1) ( tn ) t/ < ( tn ) t/ . By union bound overall pairs of subsets, a randomly generated family F satisfies the condition with probabilityat least 1 − (cid:0) |F| (cid:1) p , which is positive when |F| = ⌈ p /p ⌉ ≥ ( n t ) t/ . So we can choose suchfamily F with log |F| = Ω( t log( n/t )).Assume |F| is a power of two. Alice encodes n log |F| bits as follows. Let G be a directedgraph with vertex set { v , v , . . . , v n } . For each vertex u ∈ { v n +1 , v n +2 , . . . , v n } , Alicechooses a set S u ∈ F , and inserts an edge ( u, v i ) for every i ∈ S u .Suppose Bob wants to query S u . He adds an edge ( v, u ) for every v ∈ { v , v , v , . . . , v n } ,and then simulates a random walk starting from v . The random walk visits u every twosteps, and it next visits v i for some random i ∈ S u . At least t/ S u can be seen in 2 t samples with probability at least 1 − (cid:0) tt/ (cid:1) ( ) t ≥ − − t , so S u can be uniquely determined by an O ( t )-step random walk (simulated within error ε ) withprobability 1 − − t − ε > . By Lemma 10, the space usage for simulating the O ( t )-steprandom walk is at least Ω( n log |F| ) = Ω( nt log( n/t )) bits. The theorem is proved by scalingdown n and t by a constant factor. ◭ For simple graphs, we can achieve an upper bound of O ( nt log( n/t )) bits. ◮ Theorem 12.
For t ≤ n/ , we can simulate a t -step random walk on a simple directedgraph in the insertion-only model perfectly using O ( nt log( n/t )) bits of memory. Proof.
For every u ∈ V , we run a reservoir sampler with capacity t , which samples (at most) t edges from u ’s outgoing edges without replacement. After reading the entire input stream,we begin simulating the random walk. When u is visited during the simulation, in the nextstep we choose at random an outgoing edge used before with probability d used ( u ) /d ( u ), or anunused edge from the reservoir sampler with probability 1 − d used ( u ) /d ( u ), where d used ( u ) isthe number of edges in u ’s sampler that are previously used in the simulation. We maintaina t -bit vector to keep track of these used samples.The number of different possible states of a sampler is at most P ≤ i ≤ t (cid:0) ni (cid:1) ≤ ( t + 1)( ent ) t ,so it can be encoded using l log (cid:16) ( t + 1)( ent ) t (cid:17)m = O ( t log( n/t )) bits. The total space is O ( nt log( n/t )) bits. ◭ ◮ Theorem 13.
For t = O ( n ) , simulating a t -step random walk on a simple undirectedgraph in the insertion-only model within error ε = requires Ω( n √ t ) bits of memory. Define [ n ] = { , , . . . , n } . A t -subset is a subset of size t . . Jin 44:5 a bA j B j A j +1 V ... · · · ...... v Figure 1
Proof of Theorem 13
Proof.
Again we show a reduction from the
Index problem.Alice encodes Ω( n √ t ) bits as follows. Let G be an undirected graph with vertex set V ∪ V ∪ · · · ∪ V n/ √ t , where each V j has size 2 √ t , and the starting vertex v ∈ V . Foreach j ≥ V j is divided into two subsets A j , B j with size √ t each, and Alice encodes | A j | × | B j | = t bits by inserting a subset of edges from { ( u, v ) : u ∈ A j , v ∈ B j } . In totalshe encodes t · n/ √ t = n √ t bits.Suppose Bob wants to query some bit, i.e., he wants to see whether a and b are connectedby an edge. Assume ( a, b ) ∈ A j × B j . He adds an edge ( u, v ) for every u ∈ A j and every v ∈ V (see Figure 1). A perfect random walk starting from v ∈ V will be inside thebipartite subgraph ( A j , B j ∪ V ). Suppose the current vertex of the perfect random walk is v i ∈ A j . If a, b are connected by an edge, then P [( v i +2 , v i +3 ) = ( a, b ) | v i ] ≥ P [ v i +1 ∈ V | v i ] P [ v i +2 = a | v i +1 ∈ V ] P [ v i +3 = b | v i +2 = a ] ≥ | V || V | + | B j | · | A j | · | V | + | B j |≥ t , so in every four steps the edge ( a, b ) is passed with probability Ω( t ). Then a O ( t )-stepperfect random walk will pass the edge ( a, b ) with probability 0 .
9. Hence Bob can knowwhether the edge ( a, b ) exists by looking at the random walk (simulated within error ε )with success probability 0 . − ε > /
2. By Lemma 10, the space usage for simulating the O ( t )-step random walk is at least Ω( n √ t ) bits. The theorem is proved by scaling down n and t by a constant factor. ◭ Now we describe our algorithm for undirected graphs in the insertion-only model. As a warm-up, we consider simple graphs in this section. We will deal with multi-edges in Section 3.3.
We start by informally explaining the intuition of our algorithm for simple undirected graphs.We maintain a subset of O ( n √ t ) edges from the input graph, and use them to simulatethe random walk after reading the entire input stream.For a vertex u with degree smaller than √ t , we can afford to store all its neighboringedges in memory. For u with degree greater than √ t , we can only sample and store O ( √ t )of its neighboring edges. During the simulation, at every step we first toss a coin to decidewhether the next vertex has small degree or large degree. In the latter case, we have topick a sampled neighboring edge and walk along it. If all sampled neighboring edges havealready been used, our algorithm fails. Using the large degree and the fact that edges areundirected, we can show that the failure probability is low. Description of the algorithm
We divide the vertices into two types according to their degrees: the set of big vertices B = { u ∈ V : d ( u ) ≥ C + 1 } , and the set of small vertices S = { u ∈ V : d ( u ) ≤ C } , whereparameter C is an positive integer to be determined later.We use arc ( u, v ) to refer to an edge when we want to specify the direction u → v . Soan undirected edge ( u, v ) corresponds to two different arcs, arc ( u, v ) and arc ( v, u ).We say an arc ( u, v ) is important if v ∈ S , or unimportant if v ∈ B . Denote the set ofimportant arcs by E , and the set of unimportant arcs by E . The total number of importantarcs equals P s ∈ S d ( s ) ≤ | S | C , so it is possible to store E in O ( nC ) words of space.The set E of unimportant arcs can be huge, so we only store a subset of E . For everyvertex u , we sample with replacement C unimportant arcs outgoing from u , denoted by a u, , . . . , a u,C .To maintain the set E of important arcs and the samples of unimportant arcs afterevery edge insertion, we need to handle the events when some small vertex becomes big.This procedure is straightforward, as described by ProcessInput in Figure 2. Since | E | never exceeds nC , and each of the n samplers uses O ( C ) words of space, the overall spacecomplexity is O ( nC ) words.We begin simulating the random walk after ProcessInput finishes. When the currentvertex of the random walk is v , with probability d ( v ) /d ( v ) the next step will be alongan important arc, where d ( v ) denotes the number of important arcs outgoing from v . Inthis case we simply choose a uniform random vertex from { u : ( v, u ) ∈ E } as the nextvertex. However, if the next step is along an unimportant arc, we need to choose an unusedsample a v,j and go along this arc. If at this time all C samples a v,j are already used, thenour algorithm fails (and is allowed to return an arbitrary walk). The pseudocode of thissimulating procedure is given in Figure 3.In a walk w = ( v , . . . , v t ), we say vertex u fails if |{ i : v i = u and ( v i , v i +1 ) ∈ E }| >C . If no vertex fails in w , then our algorithm will successfully return w with probability RW v ,t ( w ). Otherwise our algorithm will fail after some vertex runs out of the sampledunimportant arcs. To ensure the output distribution is ε -close to RW v ,t in ℓ distance, itsuffices to make our algorithm fail with probability at most ε/
2, by choosing a large enoughcapacity C . We have assumed no self-loops exist, so u = v . . Jin 44:7 procedure InsertArc ( u, v ) d ( v ) ← d ( v ) + 1 if d ( v ) = C + 1 then ⊲ v changes from small to big for x ∈ V such that ( x, v ) ∈ E do ⊲ arc ( x, v ) becomes unimportant E ← E \{ ( x, v ) } Feed arc ( x, v ) into x ’s sampler end forend ifif d ( v ) ≤ C then ⊲ v ∈ SE ← E ∪ { ( u, v ) } else ⊲ v ∈ B Feed arc ( u, v ) into u ’s sampler end ifend procedureprocedure ProcessInput E ← ∅ ⊲ Set of important arcs for u ∈ V do d ( u ) ← u ’s sampler (initially empty) which maintains a u, , . . . , a u,C end forfor undirected edge ( u, v ) in the input stream do InsertArc ( u, v ) InsertArc ( v, u ) end forend procedure Figure 2
Pseudocode for processing the input stream (for simple undirected graphs) procedure
SimulateRandomWalk ( v , t ) for v ∈ V do c ( v ) ← ⊲ counter of used samples end forfor i = 0 , . . . , t − do N ← { u : ( v i , u ) ∈ E } x ← uniformly random integer from { , , . . . , d ( v i ) } if x ≤ | N | then v i +1 ← uniformly random vertex from N else j ← c ( v i ) + 1 c ( v i ) ← j if j > C then return Fail else v i +1 ← u , where ( v i , u ) = a v i ,j end ifend ifend forreturn ( v , . . . , v t ) end procedure Figure 3
Pseudocode for simulating a t -step random walk starting from v To bound the probability P [at least one vertex fails | v = s ] , we will bound the indi-vidual failure probability of every vertex, and then use union bound. ◮ Lemma 14.
Suppose for every u ∈ V , P [ u fails | v = u ] ≤ δ . Then for any startingvertex s ∈ V , P [ at least one vertex fails | v = s ] ≤ tδ . Proof.
Fix a starting vertex s . For any particular u ∈ V , P [ u fails | v = s ]= P [ u fails, and ∃ i ≤ t − , v i = u | v = s ]= P [ ∃ i ≤ t − , v i = u | v = s ] P [ u fails | v = s , and ∃ i ≤ t − , v i = u ] ≤ P [ ∃ i ≤ t − , v i = u | v = s ] P [ u fails | v = u ] ≤ P [ ∃ i ≤ t − , v i = u | v = s ] · δ. By union bound, P [at least one vertex fails | v = s ] ≤ X u ∈ V P [ u fails | v = s ] ≤ X u ∈ V P [ ∃ i ≤ t − , v i = u | v = s ] · δ = E [number of distinct vertices visited in { v , . . . , v t − } | v = s ] · δ ≤ tδ. If not specified, assume the probability space is over all t -step random walks ( v , . . . , v t ) starting from v . . Jin 44:9 ◭◮ Lemma 15.
We can choose integer parameter C = O (cid:16) √ t · q log q (cid:17) , where q = 2 + log(1 /δ ) √ t ,so that P [ u fails | v = u ] ≤ δ holds for every u ∈ V . Proof.
Let d ( u ) = |{ v : ( u, v ) ∈ E }| .For any u ∈ V , P [ u fails | v = u ] ≤ P [ u fails | v = u, ( v , v ) ∈ E ] . We rewrite this probability as the sum of probabilities of possible random walks in which u fails. Recall that u fails if and only if |{ i : v i = u, ( v i , v i +1 ) ∈ E }| ≥ C + 1. Inthe summation over possible random walks, we only keep the shortest prefix ( v , . . . , v k ) inwhich u fails, i.e., the last step ( v k − , v k ) is the ( C +1)-st time walking along an unimportantarc outgoing from u . We have P [ u fails | v = u, ( v , v ) ∈ E ]= X k ≤ t X walk( v ,...,v k ) (cid:20) v = v k − = u, ( v , v ) , ( v k − , v k ) ∈ E , |{ i : v i = u, ( v i , v i +1 ) ∈ E }| = C + 1 (cid:21) d ( u ) k − Y i =1 d ( v i )= X k ≤ t X walk( v ,...,v k − ) (cid:20) v = v k − = u, ( v , v ) ∈ E , |{ i : v i = u, ( v i , v i +1 ) ∈ E }| = C (cid:21) k − Y i =1 d ( v i ) . (2)Let v ′ i = v k − − i . Since the graph is undirected, the vertex sequence ( v ′ , . . . , v ′ k − ) (thereversal of walk ( v , . . . , v k − )) is also a walk starting from and ending at u . So the summa-tion (2) equals X k ≤ t X walk( v ′ ,...,v ′ k − ) (cid:20) v ′ = v ′ k − = u, ( v ′ k − , v ′ k − ) ∈ E , |{ i : v ′ i = u, ( v ′ i , v ′ i − ) ∈ E }| = C (cid:21) k − Y i =0 d ( v ′ i )= P random walk ( v ′ , . . . , v ′ t − ) h |{ i : v ′ i = u, ( v ′ i , v ′ i − ) ∈ E }| ≥ C (cid:12)(cid:12)(cid:12) v ′ = u i . Recall that ( v ′ i , v ′ i − ) ∈ E if and only if v ′ i − ∈ B . For any 1 ≤ i ≤ t − v ′ , . . . , v ′ i − , P (cid:2) v ′ i = u, ( v ′ i , v ′ i − ) ∈ E (cid:12)(cid:12) v ′ , . . . , v ′ i − (cid:3) ≤ [ v ′ i − ∈ B ] · d ( v ′ i − ) < C . (3)
Hence the probability that |{ ≤ i ≤ t − v ′ i = u, ( v ′ i , v ′ i − ) ∈ E }| ≥ C is at most (cid:18) t − C (cid:19) (cid:18) C (cid:19) C ≤ (cid:18) e ( t − C (cid:19) C (cid:18) C (cid:19) C < (cid:18) etC (cid:19) C . We set C = (cid:6) √ t q/ log q (cid:7) , where q = 2 + log(1 /δ ) / √ t >
2. Notice that q/ log q > / C log (cid:18) C et (cid:19) ≥ √ tq log q log (cid:18) q e log q (cid:19) > √ tq log q log(4 q/e ) > √ tq > log(1 /δ ) , so (cid:18) etC (cid:19) C < δ. Hence we have made P [ u fails | v = u ] < δ by choosing C = O ( √ tq/ log q ). ◭◮ Theorem 16.
We can simulate a t -step random walk on a simple undirected graph inthe insertion-only model within error ε using O (cid:16) n √ t · q log q (cid:17) words of memory, where q =2 + log(1 /ε ) √ t . Proof.
The theorem follows from Lemma 14 and Lemma 15 by setting δ = ε t . ◭ When the undirected graph contains multiple edges, condition (3) in the proof of Lemma 15may not hold, so we need to slightly modify our algorithm.We still maintain the multiset E of important arcs. Whether an arc is important willbe determined by our algorithm. (This is different from the previous algorithm, where im-portant arcs were simply defined as ( u, v ) with d ( v ) ≤ C .) We will ensure that condition (3)still holds, i.e., for any u ∈ V and any fixed prefix of the random walk v , . . . , v i − , P (cid:2) ( v i , v i − ) / ∈ E , and v i = u (cid:12)(cid:12) v , . . . , v i − (cid:3) < /C. (4)Note that there can be both important arcs and unimportant arcs from u to v . Let f ( u, v ) denote the number of undirected edges between u, v . Then there are f ( u, v ) arcs( u, v ). Suppose f ( u, v ) of these arcs are important, and f ( u, v ) = f ( u, v ) − f ( u, v ) of themare unimportant. Then we can rewrite condition (4) as f ( u, v i − ) d ( v i − ) < /C, (5)for every u, v i − ∈ V .Similarly as before, we need to store the multiset E using only O ( nC ) words of space.And we need to sample with replacement C unimportant arcs a u, , . . . , a u,C outgoing from u , for every u ∈ V . Finally we use the procedure SimulateRandomWalk in Figure 3 tosimulate a random walk. . Jin 44:11 procedure
InsertArc ( u, v ) d ( v ) ← d ( v ) + 1 if u ∈ L v then A v ( u ) ← A v ( u ) + 1 else Insert u into L v A v ( u ) ← if | L v | ≥ C + 1 thenfor w ∈ L v do Feed arc ( w, v ) into w ’s sampler A v ( w ) ← A v ( w ) − if A v ( w ) = 0 then Remove w from L v end ifend forend ifend ifend procedureprocedure ProcessInput for u ∈ V do d ( u ) ← u ’s sampler (initially empty) which maintains a u, , . . . , a u,C Initialize empty list L u end forfor undirected edge ( u, v ) in the input stream do InsertArc ( u, v ) InsertArc ( v, u ) end for E ← S v ∈ V S u ∈ L v { A v ( u ) copies of arc ( u, v ) } ⊲ Multiset of important arcs end procedure
Figure 4
Pseudocode for processing the input stream (for undirected graphs with possibly mul-tiple edges)
The multiset E is determined as follows: For every vertex v ∈ V , we run Misra-Griesalgorithm [16] on the sequence of all v ’s neighbors. We will obtain a list L v of at most C vertices, such that for every vertex u / ∈ L v , f ( u,v ) d ( v ) < C . Moreover, we will get a frequencyestimate A v ( u ) > u ∈ L v , such that 0 ≤ f ( u, v ) − A v ( u ) < d ( v ) C . Assuming A v ( u ) = 0 for u / ∈ L v , we can satisfy condition (5) for all u ∈ V by setting f ( u, v ) = A v ( u ).Hence we have determined all the important arcs, and they can be stored in O ( P v | L v | ) = O ( nC ) words. To sample from the unimportant arcs, we simply insert the arcs discardedby Misra-Gries algorithm into the samplers. The pseudocode is given in Figure 4. ◮ Lemma 17.
After
ProcessInput (in Figure 4) finishes, | L v | ≤ C . For every u ∈ L v , ≤ f ( u, v ) − A v ( u ) ≤ d ( v ) C +1 . For every u / ∈ L v , f ( u, v ) ≤ d ( v ) C +1 . Proof.
Every time the for loop in procedure
InsertArc finishes, the newly added vertex u must have been removed from L v , so | L v | ≤ C still holds. Let W = { w , · · · , w C +1 } bethe set of vertices in L v before this for loop begins. Then for every u ∈ V , f ( u, v ) − A v ( u ) equals the number of times u is contained in W (assuming A v ( u ) = 0 for u / ∈ L v ), which isat most C +1 P W | W | ≤ d ( v ) C +1 . ◭◮ Corollary 18.
Procedure
ProcessInput in Figure 4 computes the multiset E of import-ant edges and stores it using O ( nC ) words. It also samples with replacement C unimportantarcs a u, , . . . , a u,C outgoing from u , for every u ∈ V . Moreover, f ( u, v ) d ( v ) < C holds for every u, v ∈ V . Now we analyze the failure probability of
SimulateRandomWalk (in Figure 3), similarto Lemma 15. ◮ Lemma 19.
We can choose integer parameter C = O (cid:16) √ t · q log q (cid:17) , where q = 2 + log(1 /δ ) √ t ,so that P [ u fails | v = u ] ≤ δ holds for every u ∈ V . Proof.
Let d ( u ) = P v ∈ V f ( u, v ). As before, we rewrite this probability as a sum over pos-sible random walks. Here we distinguish between important and unimportant arcs. Denote s i = [step ( v i − , v i ) is along an important arc]. Then for any u ∈ V , P [ u fails | v = u ] ≤ P [ u fails | v = u , arc ( v , v ) is unimportant]= d ( u ) d ( u ) X k ≤ t X ( v ,...,v k ) X s ,...,s k (cid:20) v = v k − = u, s = s k = 0 , |{ i : v i = u, s i +1 = 0 }| = C + 1 (cid:21) k − Y i =0 f s i +1 ( v i , v i +1 ) d ( v i )= X k ≤ t X ( v ,...,v k − ) X s ,...,s k − (cid:20) v = v k − = u, s = 0 , |{ i : v i = u, s i +1 = 0 }| = C (cid:21) k − Y i =0 f s i +1 ( v i , v i +1 ) d ( v i ) . Let v ′ i = v k − − i , s ′ i = s k − i . Then this sum equals X k ≤ t X ( v ′ ,...,v ′ k − ) X s ′ ,...,s ′ k − (cid:20) v ′ = v ′ k − = u, s ′ k − = 0 , |{ i : s ′ i = 0 , v ′ i = u }| = C (cid:21) k − Y i =1 f s ′ i ( v ′ i , v ′ i − ) d ( v ′ i − )= P random walk ( v ′ , . . . , v ′ t − ) h |{ i : v ′ i = u , arc ( v ′ i , v ′ i − ) is unimportant }| ≥ C (cid:12)(cid:12)(cid:12) v ′ = u i . Notice that for any i and any fixed prefix v ′ , . . . , v ′ i − , P h v ′ i = u , arc ( v ′ i , v ′ i − ) is unimportant (cid:12)(cid:12)(cid:12) v ′ , v ′ , . . . , v ′ i − i = f ( u, v ′ i − ) d ( v ′ i − ) < C by Corollary 18. The rest of the proof is the same as in Lemma 15. ◭◮ Theorem 20.
We can simulate a random walk on an undirected graph with possibly mul-tiple edges in the insertion-only model within error ε using O (cid:16) n √ t · q log q (cid:17) words of memory,where q = 2 + log(1 /ε ) √ t . Proof.
The theorem follows from Lemma 14 and Lemma 19 by setting δ = ε t . ◭ . Jin 44:13 In this section we consider the turnstile model where both insertion and deletion of edgescan appear. ◮ Lemma 21 ( ℓ sampler in the turnstile model, [9]) . Let f ∈ R n be a vector defined by astream of updates to its coordinates of the form f i ← f i + ∆ , where ∆ can either be positiveor negative. There is an algorithm which reads the stream and returns an index i ∈ [ n ] suchthat for every j ∈ [ n ] , P [ i = j ] = | f j |k f k + O ( n − c ) , (6) where c ≥ is some arbitrary large constant. It is allowed to output Fail with probability δ , and in this case it will not output any index. The space complexity of this algorithm is O (log n log(1 /δ )) bits. ◮ Remark.
For ε ≪ /n , the O ( n − c ) error term in (6) can be reduced to O ( ε c ) by runningthe ℓ sampler on f ∈ R ⌈ /ε ⌉ , using O (log (1 /ε ) log(1 /δ )) bits of space.We will use the ℓ sampler for sampling neighbors (with possibly multiple edges) in theturnstile model. The error term O ( n − c ) (or O ( ε c )) in (6) can be ignored in the followingdiscussion, by choosing sufficiently large constant c and scaling down ε by a constant. ◮ Theorem 22.
We can simulate a t -step random walk on a directed graph in the turnstilemodel within error ε using O ( n ( t + log ε ) log max { n, /ε } ) bits of memory. Proof.
For every u ∈ V , we run C ′ = 2 t + 16 log(2 t/ε ) independent ℓ samplers each havingfailure probability δ = 1 /
2. We use them to sample the outgoing edges of u (as in thealgorithm of Theorem 8). By Chernoff bound, the probability that less than t samplerssucceed is at most ε/ (2 t ).We say a vertex u fails if u has less than t successful samplers, and u ∈ { v , v , . . . , v t − } (where v , v , . . . , v t is the random walk). Then P [ u fails] ≤ ε t P [ u ∈ { v , . . . , v t − } ]. Byunion bound, P [at least one vertex fails] ≤ ε t P u ∈ V P [ u ∈ { v , . . . , v t − } ] ≤ ε . Hence, withprobability 1 − ε , every vertex u visited (except the last one) has at least t outgoing edgessampled, so our simulation can succeed. The space usage is O ( nC ′ log max { n, /ε } log(1 /δ )) = O ( n ( t + log ε ) log max { n, /ε } ) bits. ◭ We slightly modify the
ProcessInput procedure of our previous algorithm in Section 3.3.We will use the ℓ heavy hitter algorithm in the turnstile model. ◮ Lemma 23 ( ℓ heavy hitter, [6]) . Let f ∈ R n be a vector defined by a stream of updates toits coordinates of the form f i ← f i + ∆ , where ∆ can either be positive or negative. Thereis a randomized algorithm which reads the stream and returns a subset L ⊆ [ n ] such that i ∈ L for every | f i | ≥ k f k k , and i / ∈ L for every | f i | ≤ k f k k . Moreover it returns a frequencyestimate ˜ f i for every i ∈ L , which satisfies ≤ f i − ˜ f i ≤ k f k k . The failure probability of thisalgorithm is O ( n − c ) . The space complexity is O ( k log n ) bits. ◮ Remark.
For ε ≪ /n , the O ( n − c ) failure probability of this ℓ heavy hitter algorithmcan be reduced to O ( ε c ) by running the algorithm on f ∈ R ⌈ /ε ⌉ , using O ( k log (1 /ε )) bitsof space. In the following discussion, this failure probability can be ignored by making theconstant c sufficiently large. ◮ Theorem 24.
We can simulate a t -step random walk on an undirected graph in the turnstilemodel within error ε using O ( n ( √ t + log ε ) log max { n, /ε } ) bits of memory. Proof.
Similar to the previous insertion-only algorithm (in Figure 4), we perform two arcupdates (( u, v ) , ∆) , (( v, u ) , ∆) when we read an edge update (( u, v ) , ∆) from the stream.For every u ∈ V , we run C ′ = 2 C + 16 log(2 t/ε ) independent ℓ samplers each havingfailure probability δ = 1 /
2, where C is the same constant as in the proof of Lemma 19 andTheorem 20. By Chernoff bound, the probability that less than C samplers succeed is atmost ε/ (2 t ). For every arc update (( u, v ) , ∆), we send update ( v , ∆) to u ’s ℓ sampler.In addition, for every v ∈ V , we run ℓ heavy hitter algorithm with k = C . For every arcupdate (( u, v ) , ∆), we send update ( u, ∆) to v ’s heavy hitter algorithm. In the end, we willget a frequency estimate A v ( u ) for every u ∈ V , such that f ( u, v ) − d ( v ) C ≤ A v ( u ) ≤ f ( u, v ).We then insert A v ( u ) copies of arc ( u, v ) into E (the multiset of important arcs), and sendupdate ( v, − A v ( u )) to u ’s ℓ sampler. Then we use the ℓ samplers to sample unimportantarcs for every u .As before, we use the procedure SimulateRandomWalk (in Figure 3) to simulate therandom walk. The analysis of the failure probability of the ℓ samplers is the same as in The-orem 22. The analysis of the failure probability of procedure SimulateRandomWalk is thesame as in Lemma 19. The space usage of the algorithm is O ( nC ′ log max { n, /ε } log δ ) = O ( n ( √ t + log ε ) log max { n, /ε } ) bits. ◭ We end our paper by discussing some related questions for future research.The output distribution of our insertion-only algorithm for undirected graphs is ε -closeto the random walk distribution. What if the output is required to be perfectly random,i.e., ε = 0?For insertion-only simple undirected graphs, we proved an Ω( n √ t )-bit space lower bound.Our algorithm uses O ( n √ t log n ) bits (for not too small ε ). Can we close the gap betweenthe lower bound and the upper bound, as in the case of directed graphs?In the undirected version, suppose the starting vertex v is drawn from a distribution(for example, the stationary distribution of the graph) rather than being specified. Is itpossible to obtain a better algorithm in this new setting? Notice that our proof of theΩ( n √ t ) lower bound does not work here, since it requires v to be specified.We required the algorithm to output all vertices on the random walk. If only the lastvertex is required, can we get a better algorithm or prove non-trivial lower bounds? References Kook Jin Ahn and Sudipto Guha. Linear programming in the semi-streaming model withapplication to the maximum matching problem.
Information and Computation , 222:59–79,2013. doi:10.1016/j.ic.2012.10.006 . Kook Jin Ahn, Sudipto Guha, and Andrew McGregor. Analyzing graph structure via linearmeasurements. In
Proceedings of the 23rd Annual ACM-SIAM Symposium on DiscreteAlgorithms (SODA) , pages 459–467, 2012. doi:10.1137/1.9781611973099.40 . . Jin 44:15 Reid Andersen, Fan Chung, and Kevin Lang. Using pagerank to locally partition a graph.
Internet Mathematics , 4(1):35–64, 2007. doi:10.1080/15427951.2007.10129139 . Reid Andersen and Yuval Peres. Finding sparse cuts locally using evolving sets. In
Proceed-ings of the 41st Annual ACM Symposium on Theory of Computing (STOC) , pages 235–244,2009. doi:10.1145/1536414.1536449 . Moses Charikar, Liadan O’Callaghan, and Rina Panigrahy. Better streaming algorithmsfor clustering problems. In
Proceedings of the 35th Annual ACM Symposium on Theory ofComputing (STOC) , pages 30–39, 2003. doi:10.1145/780542.780548 . Graham Cormode and Shan Muthukrishnan. An improved data stream summary:the count-min sketch and its applications.
Journal of Algorithms , 55(1):58–75, 2005. doi:10.1016/j.jalgor.2003.12.001 . Atish Das Sarma, Sreenivas Gollapudi, and Rina Panigrahy. Estimating pagerank on graphstreams.
Journal of the ACM (JACM) , 58(3):13, 2011. doi:10.1145/1970392.1970397 . Leah Epstein, Asaf Levin, Julián Mestre, and Danny Segev. Improved approximationguarantees for weighted matching in the semi-streaming model.
SIAM Journal on DiscreteMathematics , 25(3):1251–1265, 2011. doi:10.1137/100801901 . Rajesh Jayaram and David P. Woodruff. Perfect lp sampling in a data stream. In
Proceed-ings of the 59th Annual IEEE Symposium on Foundations of Computer Science (FOCS) ,pages 544 – 555, 2018. doi:10.1109/FOCS.2018.00058 . Mark Jerrum and Alistair Sinclair. Approximating the permanent.
SIAM Journal onComputing , 18(6):1149–1178, 1989. doi:10.1137/0218077 . Mark R. Jerrum, Leslie G. Valiant, and Vijay V. Vazirani. Random generation of combin-atorial structures from a uniform distribution.
Theoretical Computer Science , 43:169–188,1986. doi:10.1016/0304-3975(86)90174-X . Michael Kapralov. Better bounds for matchings in the streaming model. In
Proceedings ofthe 24th Annual ACM-SIAM Symposium on Discrete Algorithms (SODA) , pages 1679–1697,2013. doi:10.1137/1.9781611973105.121 . Michael Kapralov, Yin Tat Lee, Cameron Musco, Christopher Musco, and Aaron Sid-ford. Single pass spectral sparsification in dynamic streams.
SIAM Journal on Computing ,46(1):456–477, 2017. doi:10.1137/141002281 . Jonathan A. Kelner and Alex Levin. Spectral sparsification in the semi-streaming setting.
Theory of Computing Systems , 53(2):243–262, 2013. doi:10.1007/s00224-012-9396-1 . Peter Bro Miltersen, Noam Nisan, Shmuel Safra, and Avi Wigderson. On data structuresand asymmetric communication complexity.
Journal of Computer and System Sciences ,57(1):37 – 49, 1998. doi:10.1006/jcss.1998.1577 . J. Misra and David Gries. Finding repeated elements.
Science of Computer Programming ,2(2):143 – 152, 1982. doi:10.1016/0167-6423(82)90012-0 . Omer Reingold. Undirected connectivity in log-space.
Journal of the ACM (JACM) ,55(4):17, 2008. doi:10.1145/1391289.1391291 . Daniel A. Spielman and Shang-Hua Teng. A local clustering algorithm for massive graphsand its application to nearly linear time graph partitioning.
SIAM Journal on Computing ,42(1):1–26, 2013. doi:10.1137/080744888doi:10.1137/080744888