[PDF] Tracking the Frequency Moments at All Times

Abstract

The traditional requirement for a randomized streaming algorithm is just {\em one-shot}, i.e., algorithm should be correct (within the stated $\eps$-error bound) at the end of the stream. In this paper, we study the {\em tracking} problem, where the output should be correct at all times. The standard approach for solving the tracking problem is to run O(logm) independent instances of the one-shot algorithm and apply the union bound to all m time instances. In this paper, we study if this standard approach can be improved, for the classical frequency moment problem. We show that for the F p problem for any 1<p≤2 , we actually only need O(loglogm+logn) copies to achieve the tracking guarantee in the cash register model, where n is the universe size. Meanwhile, we present a lower bound of Ω(logmloglogm) bits for all linear sketches achieving this guarantee. This shows that our upper bound is tight when n=(logm ) O(1) . We also present an Ω( log 2 m) lower bound in the turnstile model, showing that the standard approach by using the union bound is essentially optimal.

Full PDF

aa r X i v : . [ c s . D S ] D ec Tracking the Frequency Moments at All Times

Zengfeng Huang Wai Ming Tai Ke Yi

Abstract

The traditional requirement for a randomized streaming algorithm is just one-shot , i.e., algorithmshould be correct (within the stated ε -error bound) at the end of the stream. In this paper, we studythe tracking problem, where the output should be correct at all times. The standard approach forsolving the tracking problem is to run O (log m ) independent instances of the one-shot algorithmand apply the union bound to all m time instances. In this paper, we study if this standard approachcan be improved, for the classical frequency moment problem. We show that for the F p problem forany < p ≤ , we actually only need O (log log m + log n ) copies to achieve the tracking guaranteein the cash register model, where n is the universe size. Meanwhile, we present a lower bound of Ω(log m log log m ) bits for all linear sketches achieving this guarantee. This shows that our upperbound is tight when n = (log m ) O (1) . We also present an Ω(log m ) lower bound in the turnstilemodel, showing that the standard approach by using the union bound is essentially optimal. All classical randomized streaming algorithms provide a one-shot probabilistic guarantee, i.e., the outputof the algorithm at the end of the stream is within the stated ε -error bound with a constant probability.In many practical applications where one wants to monitor the status of the stream continuously as itevolves over time, such a one-shot guarantee is too weak. Instead, a stronger guarantee, which requiresthat the algorithm be correct at all times, would be desired. We refer to this stronger guarantee the tracking problem. The standard approach for solving the tracking problem is to simply reduce thefailure probability of the one-shot algorithm to O (1 /m ) , where m is the length of the stream. Thiscan be achieved by running O (log m ) independent instances of the algorithm and returning the median.Then by the union bound, with at least constant probability, the output is correct (i.e., within the stated ε -error bound) at all times. However, the union bound may be far from being tight as the m time instancesare highly correlated. Thus, the question we ask in this paper is: Can this O (log m ) factor be furtherimproved?We consider this question with the classical frequency moments problem, which is one of the mostextensively studied problems in the streaming literature. Let S = ( a , a , ..., a m ) be a stream of items,where a i ∈ [ n ] for all i . Let f = ( f , . . . , f n ) denote the frequency vector of S , i.e., f i = |{ j : a j = i }| is the number of occurrences of i in the stream S . The p -th frequency moment of f is F p ( f ) = n X i =1 f pi . In particular, F = m and F is the number of distinct items in S . This model is also known as the cashregister model . In the related turnstile model, we also allow deletion of items, i.e., each element in thestream is a pair ( a i , u i ) , where a i ∈ [ n ] and u i ∈ {− , +1 } . The frequency vector is then deﬁned as f i = | P j : a j = i u j | . 1 ur results. In Section 2 we consider the F tracking problem. The classical AMS sketch [1] gives aone-shot estimate to the F with ε relative error with constant probability. In the turnstile model, it uses O (log m + log log n ) bits of space, which is optimal [3]. (For simplicity of presentation, we suppressthe dependency on ε in stating the bounds.) In the cash register model, it is also possible to implementthe sketch with O (log log m + log n ) bits [1] using probabilistic counting [2], so the space needed is O (min { log m + log log n, log log m + log n } ) , which is also optimal . Directly using the union boundfor the tracking problem would need O (log m ) independent copies of the AMS sketch, but we show thatin the cash register model, only O (log log m + log n ) copies are actually needed. The log n factor canbe replaced by log F , so this bound is never worse than that obtained by the union bound since F ≤ n ,and can be much smaller when m ≫ n .We also provide lower bounds for the F tracking problem, though our lower bounds require thatthe sketch has to be linear, i.e., it can be written as Af where A is some random matrix and f isthe frequency vector. In the cash register model, we show that any linear sketch for the F trackingproblem must use Ω(log m log log m ) bits. As the O (log log m + log n ) -bit implementation of theAMS sketch uses probabilistic counting, it is no longer a linear sketch, so our upper bound for the F tracking problem when restricted to a linear sketch is O ((log m + log log n )(log log m + log n )) , whichmatches the lower bound when n = (log m ) O (1) . For non-linear sketches, the upper bound can be O ((log log m ) ) , so the same lower bound cannot hold, but we currently do not have a lower bound fornon-linear sketches. For the turnstile model, we show a lower bound of Ω(log m ) bits. This means thatthe standard solution of running O (log m ) copies of the AMS sketch and applying the union bound isalready optimal.Our upper bound analysis extends to any F p , < p ≤ , while our lower bounds hold for any F p , < p ≤ . F The well-known (fast) AMS sketch [1, 5] can be used to obtain a one-shot estimate of the F with con-stant probability. It uses two hash functions: a 4-wise independent hash function g : [ n ] → { +1 , − } and a pairwise independent hash function h : [ n ] → [ k ] . Given a frequency vector f = ( f , . . . , f n ) of some S , it computes k counters c j = P i ∈ [ n ] ,h ( i )= j f i g ( i ) , j = 1 , . . . , k , and returns ˆ X = P kj =1 c j as the estimate of F ( S ) . It has been shown that for k = O (1 /ε ) , the AMS sketch returns an ε -approximation of F ( S ) with constant probability. The success probability can be boosted to − δ by maintaining O (log(1 /δ )) independent copies of the sketch and returning the median. To solve thetracking problem, one could pick δ = Θ(1 /m ) and apply the union bound, which implies that O (log m ) copies would be needed. Below, we give a tighter analysis showing that only O (log F + log log m +log(1 /ε )) copies are actually needed, where F is the number of distinct elements in S . Theorem 2.1.

Given a stream S = ( a , a , ..., a m ) where a i ∈ [ n ] , let S i = ( a , . . . , a i ) . The stream isfed to O (log F + log log m + log(1 /ε )) independent copies of the AMS sketch, where F is the numberof distinct elements in S and ε > is any small positive real. Let ˆ X i be the median estimate of thesketches after processing S i , then Pr (cid:16)V mi =1 | ˆ X i − F ( S i ) | < ǫF ( S i ) (cid:17) > / . We will consider every frequency vector as a n -dimensional point. The basic idea of the proof isthus to show that nearby points are highly correlated: If the AMS sketch produces an accurate estimateat one point a , then with good probability it is also accurate at all points within a ball centered at a . An Ω(min { log m, log n } ) lower bound is shown in [1]; the Ω(log log m ) lower bound holds trivially since the outputhas at least so many bits if it is a constant-approximation of F ; an Ω(log log n ) lower bound is shown for the turnstile modelin [3], but it actually also holds for the cash register model for any small constant ε . f lying on the n -dimensional Euclidean space R n . Fora frequency vectors x = ( x , . . . , x n ) ∈ R n , the approximation ratio of the AMS sketch using hashfunctions g and h is F g,h ( x ) = ( k X j =1 n X i =1 g ( i ) I ( h ( i ) = j ) x i ! ) / ( x t x ) = x t Hx/x t x, where H i,j = g ( i ) g ( j ) I ( h ( i ) = h ( j )) .We use F ( x ) to denote the random variable F g,h ( x ) when g, h are randomly chosen. For any a ∈ R n and r > , denote by B ( a, r ) the ball centered at a with radius r (using 1-norm distance). Let T be theset of distinct elements appearing in S ; note that | T | = F . Denote by P the subspace of R n spannedby the elements of T , i.e., P = { x = ( x , . . . , x n ) | x i ∈ R if i ∈ S , else x i = 0 } . For j = 1 , . . . , k ,let T j = { i ∈ T | h ( i ) = j } . Then, expand T j to T ′ j by inserting elements that also map to j under h sothat | T ′ | = | T ′ | = ... = | T ′ k | = b . Clearly, b F . Therefore, the approximation ratio can be rewrittenas F g,h ( x ) = x t H ′ x/x t x, where H ′ i,j = g ( i ) g ( j ) I ( h ( i ) = h ( j )) I ( i, j ∈ ∪ kj =1 T ′ y ) .The main technical lemma needed for the proof of Theorem 2.1 is the following, which essentiallysays that all points inside any small ball are “bundled” together. Lemma 2.2.

For any a ∈ F − g,h ([1 − ǫ , ǫ ]) ∩ P , Pr [ | F ( x ) − | ≤ ǫ for all x ∈ B ( a, r ) ∩ P ] ≥ ,where r = Ω( k a k ǫpoly ( F ) ) . Given a point a ∈ R n , hash functions g, h , and any − < ε < , denote by d g,h ( a, ǫ ) the minimum1-norm distance between a and ( F − g,h (1 + ǫ ) ∪ F − g,h (1 − ǫ )) ∩ P . Thus it is the minimum 1-norm distancefrom a to the boundary of “correct region” using g and h . Note that a itself may or may not be inside the“correct region”. Before proving Lemma 2.2, we ﬁrst establish the following lower bound on d g,h ( a, ε ) . Lemma 2.3.

For any a ∈ F − g,h ([1 − ǫ , ǫ ]) ∩ P, − < ε < , d g,h ( a, ε ) = Ω( k a k ǫpoly ( F ) ) .Proof. Let x ∗ ∈ ( F − g,h (1 + ǫ ) ∪ F − g,h (1 − ǫ )) ∩ P such that d g,h ( a, ε ) = k x ∗ − a k . ǫ < | F g,h ( x ∗ ) − F g,h ( a ) | = | x ∗ t H ′ x ∗ x ∗ t x ∗ − a t H ′ aa t a | < ( max c ∈ [0 , k∇ y y t H ′ yy t y | y =(1 − c ) a + cx ∗ k ) k x ∗ − a k < ( max c ∈ [0 , k y t y ) H ′ y − ( y t H ′ y ) y )( y t y ) | y =(1 − c ) a + cx ∗ k ) k x ∗ − a k < ( max c ∈ [0 , k H ′ k k y k | y =(1 − c ) a + cx ∗ ) k x ∗ − a k < O ( F k a k d g,h ( a, ε )) In the last inequality, we have k x ∗ − a k k x ∗ − a k = d g,h ( a, ε ) and k y k > k a k > √ F k a k .To compute k H ′ k , decompose H ′ into H ′ = U DU t , where D = diag[ b, b, ..., b | {z } k , , ..., , and U = u u ... u m ] . For i = 1 , . . . , k , we set u i = √ b  g (1) I (1 ∈ S ′ i ) g (2) I (2 ∈ S ′ i ) ...g ( m ) I ( m ∈ S ′ i )  ; note that these u i ’s are or-thonormal. It implies k H ′ k b F .We use d ( a, ε ) to denote the random variable of d g,h ( a, ε ) when g and h are randomly chosen. Weare now ready to prove Lemma 2.2. Proof. (of Lemma 2.2) We ﬁrst rewrite the probability Pr ( | F ( x ) − | ≤ ǫ for all x ∈ B ( a, r ) ∩ P )= Pr ( | F ( a ) − | ≤ ǫ ∧ d ( a, ǫ ) ≥ r ∧ d ( a, − ǫ ) ≥ r )=1 − Pr ( | F ( a ) − | > ǫ ∨ d ( a, ǫ ) < r ∨ d ( a, − ǫ ) < r )=1 − Pr (( | F ( a ) − | ≤ ǫ ∧ d ( a, ǫ ) < r ) ∨ ( | F ( a ) − | ≤ ǫ ∧ d ( a, − ǫ ) < r ) ∨ | F ( a ) − | > ǫ ) . Next, consider the event d ( a, ǫ ) ≤ r ∧ | F ( a ) − | < ǫ . By Lemma 2.3, this event implies that ǫ/ < F ( a ) − < ǫ . Similarly, the event d ( a, − ǫ ) < r ∧| F ( a ) − | < ǫ implies − ǫ/ > F ( a ) − > − ǫ .Therefore, Pr (( | F ( a ) − | ≤ ǫ ∧ d ( a, ǫ ) < r ) ∨ ( | F ( a ) − | ≤ ǫ ∧ d ( a, − ǫ ) < r ) ∨ | F ( a ) − | > ǫ ) ≤ Pr ( ǫ/ < F ( a ) − < ǫ ∨ − ǫ/ > F ( a ) − > − ǫ ∨ | F ( a ) − | > ǫ )= Pr ( | F ( a ) − | > ǫ/ ≤ / , where the last inequality follows from the error guarantee of the AMS sketch, when using k = c/ε counters for an appropriate constant c .We are now ready to ﬁnish off the proof of Theorem 2.1. Proof. (of Theorem 2.1) Set r = Ω( k a k ǫpoly ( F ) ) as in Lemma 2.2. We divide the stream into epochs suchthat all frequency vectors inside one epoch are within a ball of radius r . Let f and f +∆ f be respectivelythe frequency vectors at the start and the end of an epoch. It is sufﬁcient to have k ∆ f k ≤ r , whichmeans that the ℓ -norm of the frequency vector increases by a factor of ǫpoly ( F ) every epoch. Thisleads to a total of O (cid:0) F ǫ log m (cid:1) epochs.Suppose we run l independent copies of the AMS sketch and always return the median estimate.Consider any one epoch. Lemma 2.2 has established that any one AMS sketch is good for the entireepoch with probability at least / . If at any time instance, the median estimate is outside the errorrequirement, then that means at least half of the sketches are not good for the epoch, which happenswith probability at most − Ω( l ) by a standard Chernoff argument. Finally, by the union bound, thefailure probability of the entire stream is − Ω( l ) · O (cid:0) F ǫ log m (cid:1) , meaning it is sufﬁcient to have l = O (cid:0) log (cid:0) F ǫ log m (cid:1)(cid:1) = O (log F + log log m + log(1 /ε )) . F p with p ∈ (1 , Indyk’s algorithm works as following. Given l = Θ( ǫ log δ ) , initialize nl independent p -stable distri-bution random variable X ji , where i ∈ [ n ] and j ∈ [ l ] . Maintain the vector y = Ax , where x is the4requency vector and A j,i = X ji . For query, output s -quantile of | y j | for some suitable s . This estimatorreturns ǫ -approximation with error probability δ . Similar to F , we have the following theorem. Theorem 3.1.

Given a stream S = ( a , a , ..., a m ) where a i ∈ [ n ] , let S i = ( a , . . . , a i ) . If l = O ( ε (log F + log log m + log(1 /ε ))) , let ˆ X i be the output of the sketches after processing S i , then Pr (cid:16)V mi =1 | ˆ X i − F p ( S i ) | < ǫF p ( S i ) (cid:17) > / .Proof. Basically, the idea is very similar to the proof for F so we point out the main different.Given A , a ∈ P and any − < ε < , deﬁne F A ( a ) = s − quantileA j a k a k p be the approximation ratio,where A j is j -th row of A , and d A ( a, ε ) be the minimum 1-norm distance between a and ( F − A (1 + ǫ ) ∪ F − A (1 − ǫ )) ∩ P . Also, denote x ∗ ∈ ( F − A (1 + ǫ ) ∪ F − A (1 − ǫ )) ∩ P such that d A ( a, ε ) = k x ∗ − a k .For ﬁxed j , given any y , y ∈ P such that k y k > k y k , | A j y k y k p − A j y k y k p | < n X i =1 | A ji || (( y ) j k y k p − (( y ) j k y k p | < ( max i | ( y ) i k y k p − ( y ) i k y k p | )( n X i =1 | A ji | ) < ( max i | ( y ) j ( k y k p − k y k p k y k p k y k p ) | + | ( y − y ) j k y k p | )( n X i =1 | A ji | ) < ( max i | ( y ) j ( k y − y k p k y k p k y k p ) | + | ( y − y ) j k y k p | )( n X i =1 | A ji | ) < ( 2 k y − y k k y k p )( n X i =1 | A ji | ) Here, the inequality k y k p − k y k p k y − y k p holds when p ∈ (1 , .Suppose a ∈ F − A ([1 − ǫ , ǫ ]) ∩ P , consider the line segment between a and x ∗ , let a = a, a , ..., a q = x ∗ be the ”switching” point when s -quantile is switched in between different j . ε < q − X k =0 ( 2 k a k +1 − a k k k a k k p )( n X i =1 | A ji | ) < q − X k =0 ( 2 k a k +1 − a k k k a k p )( n X i =1 | A ji | ) < O (( k x ∗ − a k k a k p )( X i,j | A ji | )) < O ( poly ( F , l ) k a k d A ( a, ε )) In second last inequality, grouping all the terms with same j . In the last inequality, we have k a k p > F − p k a k . For the term P i,j | A ji | , as they are independent p -stable distribution random variable, P i,j | a ij | < C ( F l ) p for some large constant C with constant probability. Hence, we can conclude that d A ( a, ε ) = Ω( ε k a k poly ( F ,l ) ) O (cid:16) poly ( F ,l ) ǫ log m (cid:17) . The error probability for each epoch is at most − Ω( ε l ) . Therefore, by taking l = O ( ε (log F + log log m + log(1 /ε ))) , the ﬁnal error probability is Θ(1) .Remarks. There are two F p algorithm in [3] which is more complicated. Our technique may alsoapplied to these algorithms while we left it as future work. We ﬁrst review the deﬁnition of the Augmented-Indexing problem AI ( k, N ) . In this problem, Alice has a ∈ [ k ] N , and Bob has t ∈ [ N ] , a · · · , a t − and q ∈ [ k ] . (We use b to denote the input of Bob). Thefunction f AI ( a, b ) evaluates to if a t = q , and otherwise it evaluates to . The input distribution ν ofthe problem deﬁned as follows. a is a uniformly random vector, and t ∈ R [ N ] . Set q = a t with / probability and set q randomly with probability / .We deﬁne the following communication game, and assume N ≥ k . We have k + 1 players { Q, P · · · , P k } . Player Q gets a vector x ∈ [ k ] N . Let v ∈ [ N ] k be a vector of k distinct indices and y ∈ [ k ] k . Each player P i gets ( v i , y i ) , and also a set of pairs { ( v j , y j ) | v j > v i } and a preﬁx of x , i.e. x v i − . P i needs to decide whether x v i = y i . Further more, all the players have to answer correctlysimultaneously. The communication is one-way, i.e., only player Q sends a message to each of the otherplayers. We use AI → k ( k, N ) to denote this communication problem, and we will show that the it hascommunication complexity Ω( kN log k ) . Lemma 4.1.

Let Π be private coin randomized protocol for AI → k ( k, N ) with error probability at most δ ≤ / for any input, then the communication complexity of Π is Ω( kN log k ) .Proof. We deﬁne the input distribution µ as follows. Pick x uniformly randomly, and the distributionof v is uniform conditioned on all entries in v are distinct. Then for each i , with / probability, set y i = x v i and with / probability pick y i randomly. We will use capital letters to denote correspondingrandom variables. Let { M , · · · , M k } be the set of messages Q sends to each player respectively. Giventhe input is sampled from µ , we will show that H ( M i ) = Ω( N log k ) for at least a constant fractionof these messages. In the rest of the proof, the probability is over the random coins in Π and the inputdistribution.The proof follows the framework of [4]. In our communication problem, each player will know moreinformation about x than Bob in the Augmented-Indexing problem, which introduce more complication.Let L i = { j | v j > v i } , and E i be the event that P i answer correctly. We deﬁne a set of events F i = { E j | j ∈ L i } . Given V = v , we can apply the chain rule Pr ( E , · · · , E k | V = v ) = Π i Pr ( E i | F i , V = v ) . By our assumption, Pr ( E , · · · , E k | V = v ) ≥ − δ . Using the bound p ≤ e − (1 − p ) (valid for all p ∈ [0 , ), we have X i ( Pr ( E i | F i , V = v ) − ≥ ln(1 − δ ) ≥ − δ, where the last inequality uses the ﬁrst-order approximation of ln at . Multiplying both side of the aboveinequality by Pr ( V = v ) and then sum over all possible v , we have X i ( Pr ( E i | F i ) − ≥ − δ.

6y Markov’s inequality, for at least half of the indices i we have Pr ( E i | F i ) ≥ − δ/k. We call such indices good .Next we give a reduction, using Π to solve Augmented-Indexing problem. We hardwire a good index i , and call this protocol Π i . In this protocol, Bob will simulate the behavior of P i and Alice simulatethe rest of the players. Given an input of the Augmented-Indexing problem a and b = ( t, a t , and if so, output ’abort’, and this happens with probability at most δ . Bobthen runs Π simulating player P i , and outputs whatever P i outputs, because Bob has v and y .Notice that o j is correct for all j ∈ L i if and only if s < t , and in this case all the events in F i happen. So the probability that the above protocol outputs ’abort’ is at most / . Conditioned on’abort’ does not happen, we have s < t , which implies that all events in F i happen, and the successprobability of Π i is at least Pr ( E i | F i ) ≥ − δ/k , since i is good.We next analyze the information cost of the protocol Π i for a good i . By deﬁnition I ( M i , S ; X | V, Y ) = H ( X | V, Y ) − H ( X | M i , V, Y, S ) . It is easy to see H ( X | V, Y ) ≥ N log k , since we assume N ≥ k . So we only need to upperbound H ( X | M i , V, Y, S ) . In Π i , the message send by Alice is M i , V − i , Y − i , S , then Bob ’abort’ withprobability at most / , and condition on not ’abort’, Bob outputs a correct answer with probability atleast − / k . With such properties, we have the following lemma, which is shown in [4]. Lemma 4.2. H ( X | M i , V − i , Y − i , S ) ≤ N log k . For completeness, we provide a proof of the above lemma in Appendix A. By property of entropy, H ( X | M i , V, Y, S ) ≤ H ( X | M i , V − , Y − i , S ) , so we have I ( M i , S ; X | V, Y ) = Ω( N log k ) . Because S only takes log N bits, we get H ( M i ) ≥ H ( M i , S ) − log N ≥ I ( M i , S ; X | V, Y ) − log N = Ω( N log k ) . We have shown that there are at least k/ good i , so we prove that the communication complexity of Π is Ω( kN log k ) . 7 Lower bound of tracking F p in the cash register model We will give a reduction from AI → k ( k, N ) . For convenience, we change the deﬁnition of AI → k ( k, N ) slightly. Here each player P j gets ( v j , y j ) , and also a set of pairs { ( v ℓ , y ℓ ) | v ℓ < v j } and a sufﬁx of x ,i.e. x v j +1: N . P j needs to decide whether x v j = y j . Clear this new problem is equivalent to AI → k ( k, N ) .For < p ≤ and p = 1 , we deﬁne t = p | p − − | and q = t /p . We have the following lower bound fortracking F p . Theorem 5.1.

For any linear sketch based algorithm which can track F p continuously within accuracy (1 ± p − p +3 ) in the cash register model, the space used is at least Ω(log m log log m/ log q ) bits.Proof. Given an linear sketch algorithm which can track F within error (1 ± ε ) of an incrementalstream at all time with probability − δ , we show how to use this algorithm to solve AI → k ( k, N ) deﬁned above. We use L to denote the algorithm, and L ( f ) to denote the memory of the algorithmwhen the input frequency vector is f . For linear sketch, the current state of algorithm does not dependon the order of the stream, and only depends on the current frequency vector. Let O ( L ( f )) denote theoutput of the algorithm, when the current memory state is L ( f ) .Given x , Q runs the following reduction, which use similar ideas as in [3]. Each item in the streamis a pair ( i, x i ) . For i ∈ [ N ] , Q insert ⌊ q i ⌋ items ( i, x i ) . We use f ( x ) to denote the frequency vectorof this stream, and we can also view f : [ k ] N → N Nk as a linear transformation. Then Q runs thestreaming algorithm to process the stream just constructed, and sends the memory content L ( f ( x )) toeach of the other players.For each j , P j ﬁrst computes L ( f ( x ≤ v j )) = L ( f ( x − x >v j )) . Since the L and f are linear, we cando this. Then P j inserts ⌊ q v ℓ ⌋ copies of ( v ℓ , y ℓ ) for all ℓ such that v ℓ < v j . P j can do this because heknows all ( v ℓ , y ℓ ) with v ℓ < v j . Now the frequency vector is u = f ( x ≤ v j + X ℓ : v ℓ . We have F p ( f ( x t i , q i ) , and they want to compute a k -bit vector o , such that o i = f AI ( a i , b i ) for all i ∈ [ k ] .We call this problem AI k ( k, N ) , and we have the following results from [4]. Theorem 6.1.

For any integer k and N , the communication complexity of solving AI k ( k, N ) with con-stant probability is Ω( kN log k ) . Theorem 6.2.

For any linear sketch based algorithm which can track F p continuously within accuracy (1 ± . in the turnstile model, the space used is at least Ω( p log m ) bits.Proof. For < p ≤ , we deﬁne q = 2 /p . Let f : [ k ] N → N Nk be the same linear functiondeﬁned as above. We next give a reduction from AI k ( k, N ) to tracking F p in the turnstile model. Let L be a linear sketch. Alice sends L ( f ( a i )) for i ∈ [ k ] to Bob. For each i , Bob compute a sketch Γ i = L ( f ( a i − a i>t i − q i e t i )) , where e j is the j th vector in the standard basis. It is easy to verify that t i / p ≤ F p ( f ( a it i − q i e t i ) = f ( a it i − q i e t i )) ≥ · t i . So the if the output O (Γ i ) is within accuracy (1 ± . F p , then Bob can distinguish these two cases, and output f AI ( a i , b i ) correctly.Now we need to prove that Bob can solve k instances simultaneously. As in the proof in cash registermodel, we construct a imaginary stream such that the frequency vector each Γ i sketched appear in thestream at sometime, but now we can use negative updates in the stream.The stream has k phases, and the i th phase corresponds to ( a i , b i ) . In the i th phase, we ﬁrst insertssets of positive updates, so that at the end the frequency vector is f ( a i ≤ t i ) , then inserts ⌊ q t i ⌋ copiesof (( t i , q i ) , − . The last step in this phase is to reverse all the above updates, so that the frequencyvector becomes . Clearly, for each i , the frequency vector before the cleaning step in phase i is exactly f ( a i − a i>t i − q i e t i ) , which is the vector sketched by Γ i . We call the above stream S . By our assumption,the algorithm L is correct at any time during the stream S , so Bob solves AI k ( k, N ) . The communication9ost is kT , where T is the amount of space used by L , and thus T = Ω( N log k ) . The number of updatesin S is at most k N/p . Given m , we set k = √ m , and N = p log m , so that the number of update isbounded by m , and we have T = Ω( p log m ) . References [1] N. Alon, Y. Matias, and M. Szegedy. The space complexity of approximating the frequency mo-ments.

Journal of Computer and System Sciences , 58(1):137–147, 1999.[2] P. Flajolet and G. N. Martin. Probabilistic counting algorithms for data base applications.

Journalof Computer and System Sciences , 31(2):182–209, 1985.[3] D. M. Kane, J. Nelson, and D. P. Woodruff. On the exact space complexity of sketching andstreaming small norms. In

Proc. ACM-SIAM Symposium on Discrete Algorithms , 2010.[4] M. Molinaro, D. P. Woodruff, and G. Yaroslavtsev. Beating the direct sum theorem in communi-cation complexity with implications for sketching. In

Proc. ACM-SIAM Symposium on DiscreteAlgorithms , 2013.[5] M. Thorup and Y. Zhang. Tabulation based 4-universal hashing with applications to second momentestimation. In

Proc. ACM-SIAM Symposium on Discrete Algorithms , 2004.

A Proof of Lemma 4.2

Proof.

We focus on tuples ( t, a, r ) , where r is the private coins used by Alice and Bob in Π i , includingthe randomness used in Π and v − i , y − i which are sampled by Alice. Let U = { ( t, a, r ) : Π i ( a, t, a t , r ) = ′ abort ′ } . Here we use Π i ( a, t, a t , r ) to denote the output of Π i with input a, t, a t (yes instance) and random coins r . We use f ( a, t, q ) to denote the corresponding function of Augmented-indexing problem. We deﬁne U = { ( t, a, r ) : ∃ q st Π i ( a, t, q, r ) = f ( a, t, q ) ∧ Π i ( a, t, q, r ) = ′ abort ′ } . We say a tuple good if it does not belong to either U or U . Notice that if ( t, a, r ) is good, then: (1) Π i ( a, t, a t , r ) = 1 ; (2) for every q = a t , Π i ( a, t, q, r ) = 1 . Lemma A.1.

For every index t ∈ N , there is a predictor g t such that Pr ( g t ( M i ( A ) , A

We set g ′ t ( M i ( a ) , a

P r (( T, A, R ) is not good ) ≤ / .Proof. By union bound, we only need to show that the probability

P r (( T, A, R ) ∈ U ) + P r (( T, A, R ) ∈ U ) ≤ / . We have

P r (( T, A, R ) ∈ U ) = Pr (Π i ( A, T, A T , R ) = abort )= Pr (Π i ( A, T, Q, R ) = abort | Q = A T )= Pr ( protocol aborts | Q = A T ) . Since Pr ( Q = A T ) = 1 / and Pr ( protocol aborts ) ≤ / , we have Pr (( T, A, R ) ∈ U ) ≤ / . Wealso have Pr (( T, A, R ) ∈ U ) = Pr (cid:2) ∨ q ∈ [ k ] (Π i ( A, T, q, R ) = f ( A, T, q ) ∧ Π i ( A, T, q, R ) = abort ) (cid:3) ≤ X q ∈ [ k ] Pr [Π i ( A, T, q, R ) = f ( A, T, q ) ∧ Π i ( A, T, q, R ) = abort ] ≤ X q ∈ [ k ] Pr [Π i ( A, T, q, R ) = f ( A, T, q ) | Π i ( A, T, q, R ) = abort ] ≤ k · Pr [Π i ( A, T, Q, R ) = f ( A, T, Q ) | Π i ( A, T, Q, R ) = abort ] ≤ k · k = 1 / . So P r (( T, A, R ) is not good ) ≤ / .As the distribution of T is uniform, we have N X t =1 Pr (( t, A, V − i , Y − i ) is not good ) = N · Pr (( T, A, V − i , Y − i ) is not good ) ≤ N/20