[PDF] Perfect L p Sampling in a Data Stream

Abstract

In this paper, we resolve the one-pass space complexity of L p sampling for p∈(0,2) . Given a stream of updates (insertions and deletions) to the coordinates of an underlying vector f∈ R n , a perfect L p sampler must output an index i with probability | f i | p /∥f ∥ p p , and is allowed to fail with some probability δ . So far, for p>0 no algorithm has been shown to solve the problem exactly using poly(logn) -bits of space. In 2010, Monemizadeh and Woodruff introduced an approximate L p sampler, which outputs i with probability (1±ν)| f i | p /∥f ∥ p p , using space polynomial in ν −1 and log(n) . The space complexity was later reduced by Jowhari, Sağlam, and Tardos to roughly O( ν −p log 2 nlog δ −1 ) for p∈(0,2) , which tightly matches the Ω( log 2 nlog δ −1 ) lower bound in terms of n and δ , but is loose in terms of ν . Given these nearly tight bounds, it is perhaps surprising that no lower bound exists in terms of ν ---not even a bound of Ω( ν −1 ) is known. In this paper, we explain this phenomenon by demonstrating the existence of an O( log 2 nlog δ −1 ) -bit perfect L p sampler for p∈(0,2) . This shows that ν need not factor into the space of an L p sampler, which closes the complexity of the problem for this range of p . For p=2 , our bound is O( log 3 nlog δ −1 ) -bits, which matches the prior best known upper bound in terms of n,δ , but has no dependence on ν . For p<2 , our bound holds in the random oracle model, matching the lower bounds in that model. Moreover, we show that our algorithm can be derandomized with only a O((loglogn ) 2 ) blow-up in the space (and no blow-up for p=2 ). Our derandomization technique is general, and can be used to derandomize a large class of linear sketches.

Full PDF

aa r X i v : . [ c s . D S ] N ov Perfect L p Sampling in a Data Stream ∗ Rajesh JayaramCarnegie Mellon University [email protected]

David P. WoodruﬀCarnegie Mellon University [email protected] † Abstract

In this paper, we resolve the one-pass space complexity of perfect L p sampling for p ∈ (0 ,

2) in a stream. Given a stream of updates (insertions and deletions) to the coordinates ofan underlying vector f ∈ R n , a perfect L p sampler must output an index i with probability | f i | p / k f k pp , and is allowed to fail with some probability δ . So far, for p > n )-bits of space. In 2010, Monemizadehand Woodruﬀ introduced an approximate L p sampler , which outputs i with probability (1 ± ν ) | f i | p / k f k pp , using space polynomial in ν − and log( n ). The space complexity was later reducedby Jowhari, Sa˘glam, and Tardos to roughly O ( ν − p log n log δ − ) for p ∈ (0 , n log δ − ) lower bound in terms of n and δ , but is loose in terms of ν .Given these nearly tight bounds, it is perhaps surprising that no lower bound exists in termsof ν —not even a bound of Ω( ν − ) is known. In this paper, we explain this phenomenon bydemonstrating the existence of an O (log n log δ − )-bit perfect L p sampler for p ∈ (0 , ν need not factor into the space of an L p sampler, which closes the complexity ofthe problem for this range of p . For p = 2, our bound is O (log n log δ − )-bits, which matchesthe prior best known upper bound of O ( ν − log n log δ − ), but has no dependence on ν . For p <

2, our bound holds in the random oracle model, matching the lower bounds in that model.However, we show that our algorithm can be derandomized with only a O ((log log n ) ) blow-up in the space (and no blow-up for p = 2). Our derandomization technique is quite general,and can be used to derandomize a large class of linear sketches, including the more accuratecount-sketch variant of Minton and Price [MP14], resolving an open question in that paper.Finally, we show that a (1 ± ǫ ) relative error estimate of the frequency f i of the sampled index i can be obtained using an additional O ( ǫ − p log n )-bits of space for p <

2, and O ( ǫ − log n ) bitsfor p = 2, which was possible before only by running the prior algorithms with ν = ǫ . ∗ A preliminary version of this work appeared in FOCS 2018. † The authors thank the partial support by the National Science Foundation under Grant No. CCF-1815840. ontents L p Sampling via Count-Sketch 46

A.1 Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46A.2 The L p Sampler . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47

B Derandomizing the Original Algorithm 52

Introduction

The streaming model of computation has become increasingly important for the analysis of massivedatasets, where the sheer size of the input imposes stringent restrictions on the resources availableto algorithms. Examples of such datasets include internet traﬃc logs, sensor networks, ﬁnancialtransaction data, database logs, and scientiﬁc data streams (such as huge experiments in particlephysics, genomics, and astronomy). Given their prevalence, there is a large body of literaturedevoted to designing extremely eﬃcient one-pass algorithms for analyzing data streams. We referthe reader to [BBD +

02, M +

05] for surveys of these algorithms and their applications.More recently, the technique of sampling has proven to be tremendously powerful for the analysisof data streams. Substantial literature has been devoted to the study of sampling for problems inbig data [M +

05, Haa16, Coh15, CDK +

09, CDK +

14, CCD11, EV03, GM98a, Knu98, MM12, Vit85b,CCD12, GLH08, GLH06], with applications to network traﬃc analysis [TLJ10, HNG +

07, GKMS01,MCS +

06, Duf04], databases [Olk93, Haa16, HNSS96, HS92, LNS90, LN95], distributed computation[WZ16, CMYZ10, CMYZ12, TW11], and low-rank approximation [WZ16, FKV04, DV06]. Whileseveral models for sampling in data streams have been proposed [BDM02, AKO10, CMYZ10], oneof the most widely studied are the L p samplers introduced in [MW10]. Roughly speaking, given avector f ∈ R n , the goal of an L p sampler is to return an index i ∈ { , , . . . , n } with probability | f i | p / k f k pp . In the data stream setting, the vector f is given by a sequence of updates (insertions ordeletions) to its coordinates of the form f i ← f i + ∆, where ∆ can either be positive or negative.A 1-pass L p sampler must return an index given only one pass through the updates of the stream.Since their introduction, L p samplers have been utilized to develop alternative algorithms forimportant streaming problems, such as the heavy hitters problem, L p estimation, cascaded normestimation, and ﬁnding duplicates in data streams [AKO10, MW10, JST11, BOZ12]. For the caseof p = 1 and insertion only streams, where the updates to f are strictly positive, the problem iseasily solved using O (log n ) bits of space with the well-known reservoir sampling algorithm [Vit85a].When deletions to the stream are allowed or when p = 1, however, the problem is more complicated.In fact, the question of whether such samplers even exist was posed by Cormode, Murthukrishnan,and Rozenbaum in [CMR05]. Later on, Monemizadeh and Woodruﬀ demonstrated that, if onepermits the sampler to be approximately correct, such samplers are indeed possible [MW10]. Weformally state the guarantee given by an approximate L p sampler below. Deﬁnition 1.

Let f ∈ R n and ν ∈ [0 , p >

0, an approximate L p sampler with ν -relativeerror is an algorithm which returns an index i ∈ { , , . . . , n } such that for every j ∈ { , , . . . , n } Pr [ i = j ] = | f j | p k f k pp (1 ± ν ) + O ( n − c )Where c ≥ p = 0, the problem is to return j withprobability (1 ± ν ) max { , | f j |} / |{ j : f j = 0 }| + O ( n − c ), If ν = 0, then the sampler is said to be perfect . An L p sampler is allowed to output FAIL with some probability δ . However, in this case itmust not output any index.The one-pass approximate L p sampler introduced in [MW10] requires poly( ν − , log n ) space, al-beit with rather large exponents. Later on, in [AKO10], the complexity was reduced signiﬁcantly to O ( ν − p log ( n ) log(1 /δ ))-bits for p ∈ [1 , precision sampling . Roughly, We note that previous works [JST11, KNP +

17] have cited the sampler of [AKO10] as using O (log ( n ))-bits ofspace, however the space bound given in their paper is in machine words , and is therefore a O (log ( n )) bit boundwith δ = 1 / poly( n ). In order to obtain an O (log ( n ) log(1 /δ )) bit sampler, their algorithm must be modiﬁed to usefewer repetitions. p sampling upper bound (bits) p range Notes Citation O (log ( n )) p = 0 perfect L sampler, δ = 1 / poly( n ) [FIS08] O (log ( n ) log(1 /δ )) p = 0 perfect L sampler [JST11]poly log( ν − , n ) p ∈ [0 , δ = 1 / poly( n ) [MW10] O ( ν − p log ( n ) log(1 /δ )) p ∈ [1 ,

2] (1 ± ν )-relative error [AKO10] O ( ν − max { ,p } log ( n ) log(1 /δ )) p ∈ (0 , \ { } (1 ± ν )-relative error [JST11] O ( ν − log( ν − ) log ( n ) log(1 /δ )) p = 1 (1 ± ν )-relative error [JST11] O (log ( n ) log(1 /δ )) p ∈ (0 ,

2) perfect L p sampler,random oracle model,matches lower bound This work O (log ( n ) log(1 /δ )(log log n ) ) p ∈ (0 ,

2) perfect L p sampler This work O (log ( n ) log(1 /δ )) p = 2 perfect L sampler This work O (log ( n )) p ∈ (0 , δ = 1 / poly( n ) This workFigure 1: Evolution of one pass L p sampling upper bounds, with the best known lower bound ofΩ(log ( n ) log(1 /δ )) for p ≥ +

17] (see also [JST11] for a lower bound for constant δ ).the technique of precision sampling consists of scaling the coordinates f i by random variable coef-ﬁcients 1 /t i as the updates arrive, resulting in a new stream vector z ∈ R n with z i = f i /t i . Thealgorithm then searches for all z i which cross a certain threshold T . Observe that if t i = u /pi where u i is uniform on [0 , f i /t i ≥ T is precisely Pr [ u i < | f i | p /T p ] = | f i | p /T p .By running an L p estimation algorithm to obtain T ∈ [ k f k p , k f k p ], an L p sampler can thenreturn any i with z i ≥ T as its output. These heavy coordinates can be found using any of thewell-known η -heavy hitters algorithms for a suﬃciently small precision η .Using a tighter analysis of this technique with the same scaling variables t i = u /pi , Jowhari,Sa˘glam, and Tardos reduced the space complexity of L p sampling for p < O ( ν − max { ,p } log ( n )log(1 /δ ))-bits for p ∈ (0 , \ { } , and O ( ν − log( ν − ) log ( n ) log(1 /δ )) bits of space for p = 1[JST11]. Roughly speaking, their improvements result from a more careful consideration of theprecision η needed to determine when a z i crosses the threshold, which they do via the tighter tail-error guarantee of the well-known count-sketch heavy hitters algorithm [CCFC02a]. In addition,they give an O (log ( n ) log(1 /δ )) perfect L sampler, and demonstrated an Ω(log ( n ))-bit lowerbound for L p samplers for any p ≥

0. Recently, this lower bound was extended to Ω(log ( n ) log(1 /δ ))[KNP +

17] bits, which closes the complexity of the problem for p = 0.For p ∈ (0 , L p samplers are tight in termsof n, δ , but a gap exists in the dependency on ν . Being the case, it would seem natural to searchfor an Ω( ν − p log ( n ) log(1 /δ )) lower bound to close the complexity of the problem. It is perhapssurprising, therefore, that no lower bound in terms of ν exists – not even an Ω( ν − ) bound is known.This poses the question of whether the Ω(log ( n ) log(1 /δ )) lower bound is in fact correct. In this paper, we explain the phenomenon of the lack of an Ω( ν − ) lower bound by showing that ν need not enter the space complexity of an L p sampler at all. In other words, we demonstrate theexistence of perfect L p samplers using O (log ( n ) log(1 /δ )(log log n ) )-bits of space for p ∈ (0 , n ) terms . In the random oraclemodel, where we are given random access to an arbitrarily long tape of random bits which do notcount against the space of the algorithm, our upper bound is O (log ( n ) log(1 /δ )), which matchesthe lower bound in the random oracle model. For p = 2, our space is O (log ( n ) log(1 /δ ))-bits,which matches the best known upper bounds in terms of n, δ , yet again has no dependency on ν .In addition, for p < δ < /n , we obtain a O (log ( n ))-bit perfect L p sampler, which also tightly matches the lower bound without paying the extra (log log n ) factor.A summary of the prior upper bounds for L p sampling, along with the contributions of this work,is given in Figure 1.In addition to outputting a perfect sample i from the stream, for p ∈ (0 ,

2) we also show that,conditioned on an index being output, given an additional additive O (min { ǫ − , ǫ − p log( δ ) } log( n )log(1 /δ ))-bits we can provide a (1 ± ǫ ) approximation of the frequency | f i | with probability 1 − δ .This separates the space dependence on log ( n ) and ǫ for frequency approximation, allowing usto obtain a (1 ± ǫ ) approximation of | f i | in O (log ( n ) + ǫ − p log( n )) bits of space with constantprobability, whereas before this required O ( ǫ − p log ( n )) bits of space. For p = 2, our bound is O ( ǫ − log ( n ) log(1 /δ )), which still improves upon the prior best known bounds for estimating thefrequency by an O (log( n ))-factor. Finally, we show an Ω( ǫ − p log( n ) log(1 /δ )) bits of space lowerbound for producing the (1 ± ǫ ) estimate (conditioned on an index being returned). Since their introduction, it has been observed that L p samplers can be used as a building blockin algorithms for many important streaming problems, such as ﬁnding heavy hitters, L p -normestimation, cascaded norm estimation, and ﬁnding duplicates in data streams [AKO10, MW10,JST11, BOZ12]. L p samplers, particularly for p = 1, are often used as a black-box subroutine todesign representative histograms of f on which more complicated algorithms are run [GMP, GM98b,Olk93, GKMS02, HNG +

07, CMR05]. For these black-box applications, the only property of thesamplers needed is the distribution of their samples. Samplers with relative error are statisticallybiased and, in the analysis of more complicated algorithms built upon such samplers, this biasand its propagation over multiple samples must be accounted for and bounded. The analysis anddevelopment of such algorithms would be simpliﬁed dramatically, therefore, with the assumptionsthat the samples were truly uniform (i.e., from a perfect L sampler). In this case, no error termsor variational distance need be accounted for. Our results show that such an assumption is possiblewithout aﬀecting the space complexity of the sampler.Note that in Deﬁnition 1, we allow a perfect sampler to have n − c +1 variation distance to thetrue L p distribution. We note that this deﬁnition is in line with prior work, observing that eventhe perfect L sampler of [JST11] incurs such an error from derandomizing with Nisan’s PRG.Nevertheless, this error will never be detected if the sampler is run polynomially many times in thecourse of constructing a histogram, and such a sampler is therefore statistically indistinguishablefrom a truly uniform sampler and can be used as a black box.Another motivation for utilizing perfect L p samplers comes from applications in privacy. Here f ∈ R n is some underlying dataset, and we would like to reveal a sample i ∈ [ n ] drawn fromthe L p distribution over f to some external party without revealing too much global information A previous version of this work claimed O (log ( n ) log(1 /δ )) bits of space for p <

2, but contained an error inthe derandomization. Thus, this bound only held in the random oracle model. In the present version we correct thisderandomization using a slightly diﬀerent algorithm, albeit with a (log log n ) blow-up in the space. The algorithmfrom the previous version can be found in Appendix A, along with a new analysis of its derandomization which allowsit to run in O (log ( n )(log log( n )) )-bits of space. nput: f ∈ R n Output: a sampled index i ∗ ∈ [ n ]1. Perform a linear transformation on f to obtain z .2. Run instance A of count-sketch on z to obtain the estimate y .3. Find i ∗ = arg max i | y i | . Then run a statistical test on y to decide whether to output i ∗ or FAIL . Figure 2: Algorithmic Template for L p samplingabout f itself. Using an approximate L p sampler introduces a (1 ± ν ) multiplicative bias into thesampling probabilities, and this bias can depend on global properties of the data. For instance,such a sampler might bias the sampling probabilities of a large set S of coordinates by a (1 + ν )factor if a certain global property P holds for f , and may instead bias them by (1 − ν ) if a disjointproperty P ′ holds. Using only a small number of samples, an adversary would then be able todistinguish whether P or P ′ holds by determining how these coordinates were biased. On the otherhand, the bias in the samples produced by a perfect L p sampler is polynomially small, and thusthe leakage of global information could be substantially smaller when using one, though one wouldneed to formally deﬁne a notion of leakage and privacy for the given application. Our main algorithm is inspired by the precision sampling technique used in prior works [AKO10,JST11], but with some marked diﬀerences. To describe how our sampler achieves the improvementsmentioned above, we begin by observing that all L p sampling algorithms since [AKO10] haveadhered to the same algorithmic template (shown in Figure 2). This template employs the classic count-sketch algorithm of [CCFC02b] as a subroutine, which is easily introduced. For k ∈ N ,let [ k ] denote the set { , , . . . , k } . Given a precision parameter η , count-sketch selects pairwiseindependent hash functions h j : [ n ] → [6 /η ] and g j : [ n ] → { , − } , for j = 1 , , . . . , d where d = Θ(log( n )). Then for all i ∈ [ d ] , j ∈ [6 /η ], it computes the following linear function A i,j = P k ∈ [ n ] ,h i ( k )= j g i ( k ) f k , and outputs an approximation y of f given by y k = median i ∈ [ d ] { g i ( k ) A i,h i ( k ) } .We will discuss the estimation guarantee of count-sketch at a later point.The algorithmic template is as follows. First, perform some linear transformation on the inputvector f to obtain a new vector z . Next, run an instance A of count-sketch on z to obtain theestimate y . Finally, run some statistical test on y . If the test fails, then output FAIL , otherwiseoutput the index of the largest coordinate (in magnitude) of y . We ﬁrst describe how the sampler of[JST11] implements the steps in this template. Afterwards we describe the diﬀerent implementationdecisions made in our algorithm that allow it to overcome the limitations of prior approaches. Prior Algorithms.

The samplers of [JST11, AKO10] utilize the technique known as precisionsampling , which employs the following linear transformation. The algorithms ﬁrst generate ran-dom variables ( t , . . . , t n ) with limited independence, where each t i ∼ Uniform[0 , f i is scaled by the coeﬃcient 1 /t /pi to obtain the transformed vector z ∈ R n given by z i = f i /t /pi , thus completing Step 1 of Figure 2. For simplicity, we now restrict to the case of p = 1and the algorithm of [JST11]. The goal of the algorithm is then to return an item z i that crossesthe threshold | z i | > ν − R , where R = Θ( k f k ) is a constant factor approximation of the L . Notethe probability that this occurs is proportional to ν | f i | / k f k .4ext, implementing the second step of Figure 2, the vector z is hashed into count-sketch toﬁnd an item that has crossed the threshold. Using the stronger tail-guarantee of count-sketch, theestimate vector y satisﬁes k y − z k ∞ ≤ √ η k z tail(1 /η ) k , where z tail(1 /η ) is z with the 1 /η largestcoordinates (in magnitude) set to 0. Now the algorithm runs into trouble when it incorrectlyidentiﬁes z i as crossing the threshold when it has not, or vice-versa. However, if the tail error √ η k z tail(1 /η ) k is at most O ( k f k ), then since t i is a uniform variable the probability that z i is closeenough to the threshold to be misidentiﬁed is O ( ν ), which results in at most (1 ± ν ) relative error inthe sampling probabilities. Thus it will suﬃce to have √ η k z tail(1 /η ) k = O ( k f k ) with probability1 − ν . To show that this is the case, consider the level sets I k = { z i | z i ∈ ( k f k p ( k +1) /p , k f k p k/p ) } , and note E [ | I k | ] = 2 k . We observe here that results of [JST11] can be partially attributed to the fact thatfor p <

2, the total contribution Θ( k f k p k/p | I k | ) of the level sets to k z k decreases geometrically with k , and so with constant probability we have k z k = O ( k f k p ). Moreover, if one removes the toplog(1 /ν ) largest items, the contribution of the remaining items to the L is O ( k f k ) with probability1 − ν . So taking η = log(1 /ν ), the tail error from count-sketch has the desired size. Since the tailerror does not include the 1 /η largest coordinates, this holds even conditioned on a ﬁxed value t i ∗ of the maximizer.Now with probability ν the guarantee on the error from the prior paragraph does not hold, andin this case one cannot still output an index i , as this would result in a ν - additive error sampler.Thus, as in Step 3 of Figure 2, the algorithm must implement a statistical test to check that theguarantee holds. To do this, using the values of the largest 1 /η coordinates of y , they produce anestimate of the tail-error and output FAIL if it is too large. Otherwise, the item i ∗ = arg max i | y i | is output if | y i ∗ | > ν − R . The whole algorithm is run O ( ν − log(1 /δ )) times so that an index isoutput with probability 1 − δ . Our Algorithm.

Our ﬁrst observation is that, in order to obtain a truly perfect sampler, oneneeds to use diﬀerent scaling variables t i . Notice that the approach of scaling by inverse uniformvariables and returning a coordinate which reaches a certain threshold T faces the obvious issueof what to return when more than one of the variables | z i | crosses T . This is solved by simplyoutputting the maximum of all such coordinates. However, the probability of an index becoming themaximum and reaching a threshold is drawn from an entirely diﬀerent distribution, and for uniformvariables t i this distribution does not appear to be the correct one. To overcome this, we mustuse a distribution where the maximum index i of the variables ( | f t − /p | , | f t − /p | , . . . , | f n t − /pn | )is drawn exactly according to the L p distribution | f i | p / k f k pp . We observe that the distribution ofexponential random variables has precisely this property, and thus to implement Step 1 of Figure2 we set z i = f i /t /pi where t i is an exponential random variable. We remark that exponentialvariables have been used in the past, such as for F p moment estimation, p >

2, in [AKO10] andregression in [WZ13]. However it appears that their applicability to sampling has never before beenexploited.Next, we carry out the count-sketch step by hashing our vector z into a count-sketch datastructure A . Because we are only interested in the maximizer of z , we develop a modiﬁed versionof count-sketch, called count-max . Instead of producing an estimate y such that k y − z k ∞ issmall, count-max simple checks, for each i ∈ [ n ], how many times z i hashed into the largestbucket (in absolute value) of a row of A . If this number is at least a 4 / z i is the maximizer of z . We show that with highprobability, count-max never incorrectly declares an item to be the maximizer, and moreover if z i > P j = i z j ) / , then count-max will declare i to be the maximizer. Using the min-stability property of exponential random variables, we can show that the maximum item | z i ∗ | = max {| z i |} is5istributed as k f k p /E /p , where E is another exponential random variable. Thus | z i ∗ | = Ω( k f k p )with constant probability. Using a more general analysis of the L norm of the level sets I k , wecan show that ( P j = i ∗ z j ) / = O ( k f k p ). If all these events occur together (with suﬃciently largeconstants), count-max will correctly determine the coordinate i ∗ = arg max i {| z i |} . However, justas in [JST11], we cannot output an index anyway if these conditions do not hold, so we will needto run a statistical test to ensure that they do. The Statistical Test.

To implement Step 3 of the template, our algorithm simply tests whethercount-max declares any coordinate i ∈ [ n ] to be the maximizer, and we output FAIL if it does not.This approach guarantees that we correctly output the maximizer conditioned on not failing. Theprimary technical challenge will be to show that, conditioned on i = arg max i {| z i |} . for some i ,the probability of failing the statistical test does not depend on i . In other words, conditioningon | z i | being the maximum does not change the failure probability. Let z D ( k ) be the k -th orderstatistic of z (i.e., | z D (1) | ≥ | z D (2) | ≥ · · · ≥ | z D ( n ) | ). Here the D ( k )’s are known as anti-ranks . Toanalyze the conditional dependence, we must ﬁrst obtain a closed form for z D ( k ) which separates thedependencies on k and D ( k ). Hypothetically, if z D ( k ) depended only on k , then our statistical testwould be completely independent of D (1), in which case we could safely fail whenever such an eventoccurred. Of course, in reality this is not the case. Consider the vector f = (100 n, , , , . . . , ∈ R n and p = 1. Clearly we expect z to be the maximizer, and moreover we expect a gap of Θ( n )between z and z D (2) . On the other hand, if you were told that D (1) = 1, it is tempting to think that z D (1) just barely beat out z for its spot as the max, and so z would not be far behind. Indeed, thisintuition would be correct, and one can show that the probability that z D (1) − z D (2) > n conditionedon D (1) = i changes by an additive constant depending on whether or not i = 1. Conditioned onthis gap being smaller or larger, we are more or less likely (respectively) to output FAIL . In thissetting, the probability of conditional failure can change by an Ω(1) factor depending on the valueof D (1).To handle scenarios of this form, our algorithm will utilize an additional linear transformationin Step 1 of the template. Instead of only scaling by the random coeﬃcients 1 /t /pi , our algorithmﬁrst duplicates the coordinates f i to remove all heavy items from the stream. If f is the vectorfrom the example above and F is the duplicated vector, then after poly( n ) duplications all copiesof the heavy item f will have weight at most | f | / k F k < / poly( n ). By uniformizing the relativeweight of the coordinates, this washes out the dependency of | z D (2) | on D (1), since k F − D (1) k pp =(1 ± n − Ω( c ) ) k F − j k pp after n c duplications, for any j ∈ [ n c ]. Notice that this transformation blows-upthe dimension of f by a poly( n ) factor. However, since our space usage is always poly log( n ), theresult is only a constant factor increase in the complexity.After duplication, we scale F by the coeﬃcients 1 /t /pi , and the rest of the algorithm proceedsas described above. Using expressions for the order statistics z D ( k ) which separate the dependenceinto the anti-ranks D ( j ) and a set of exponentials E , E , . . . E n independent of the anti-ranks, afterduplication we can derive tight concentration of the z D ( k ) ’s conditioned on ﬁxed values of the E i ’s.Using this concentration result, we decompose our count-max data structure A into two componentvariables: one independent of the anti-ranks (the independent component), and a small adversarialnoise of relative weight n − c . In order to bound the eﬀect of the adversarial noise on the outcomeof our tests we must ) randomize the threshold for our failure condition and ) demonstrate theanti-concentration of the resulting distribution over the independent components of A . This willdemonstrate that with high probability, the result of the statistical test is completely determinedby the value of the independent component, which allows us to fail without aﬀecting the conditionalprobability of outputting i ∈ [ n ]. 6 erandomization Now the correctness of our sampler crucially relies on the full independenceof the t i ’s to show that the variable D (1) is drawn from precisely the correct distribution (namely,the L p distribution | f i | p / k f k pp ). Being the case, we cannot directly implement our algorithm usingany method of limited independence. In order to derandomize the algorithm from requiring full-independence, we will use a combination of Nisan’s pseudorandom generator [Nis92], as well as anextension of the recent PRG of [GKM15] which fools certain classes of Fourier transforms . We ﬁrstuse a closer analysis of the seed length Nisan’s generator requires to fool the randomness requiredfor the count-max data structure, which avoids the standard O (log( n ))-space blowup which wouldbe incurred by using Nisan’s as a black box. Once the count-max has been derandomized, wedemonstrate how the PRG of [GKM15] can be used to fool arbitrary functions of d -halfspaces,so long as these half-spaces have bounded bit-complexity. We use this result to derandomize theexponential variables t i with a seed of length O (log ( n )(log log n ) ), which will allow for the totalderandomization of our algorithm for δ = Θ(1) and p < any streaming algorithmwhich stores a linear sketch A · f , where the entries of A are independent and can be sampled fromwith O (log( n ))-bits, can be derandomized with only a O ((log log n ) )-factor increase in the spacerequirements (see Theorem 5). This improves the O (log( n ))-blow up incurred from black-box usageof Nisan’s PRG. As an application, we derandomize the count-sketch variant of Minton and Price[MP14] to use O ( ǫ − log ( n )(log log n ) )-bits of space, which gives improved concentration resultsfor count-sketch when the hash functions are fully-independent. The problem of improving the de-randomization of [MP14] beyond the black-box application of Nisan’s PRG was an open problem.We remark that using O (1 /ǫ log ( n ))-bits of space in the classic count sketch of [CCFC02b] hasstrictly better error guarantees that those obtained from derandomizing [MP14] with Nisan’s PRGto run in the same space. Our derandomization, in contrast, demonstrates a strong improvementon this, obtaining the same bounds with an O ((log log n ) ) instead of an O (log( n )) factor blowup. Case of p = 2 . Recall for p <

2, we could show that the L norm of the level sets I k decaysgeometrically with k . More precisely, for any γ we have k z tail( γ ) k = O ( k F k p γ − /p +1 / ) withprobability 1 − O ( e − γ ). Using this, we actually do not need the tight concentration of the z D ( k ) ’s,since we can show that the top n c/ coordinates change by at most (1 ± n − Ω( c ) ) depending on D (1), and the L norm of the remaining coordinates is only an O ( n − c/ /p − / ) fraction of thewhole L , and can thus be absorbed into the adversarial noise. For p = 2 however, each level set I k contributes weight O ( k F k p ) to k z k , so k z tail( γ ) k = O ( p log( n ) k F k p ) even for γ = poly( n ).Therefore, for p = 2 it is essential that we show concentration of the z D ( k ) ’s for nearly all k . Since k z k will now be larger than k F k by a factor of log( n ) with high probability, count-max will onlysucceed in outputting the largest coordinate when it is an O ( p log( n )) factor larger than expected.This event occurs with probability 1 / log( n ), so we will need to run the algorithm log( n ) times inparallel to get constant probability, for a total O (log n )-bits of space. Using the same O (log ( n ))-bit Nisan PRG seed for all O (log( n )) repititions, we show that the entire algorithm for p = 2 canbe derandomized to run in O (log n log 1 /δ )-bits of space. Optimizing the Runtime.

In addition to our core sampling algorithm, we show how the lineartransformation step to construct z can be implemented via a parameterized rounding scheme toimprove the update time of the algorithm without aﬀecting the space complexity, giving a run-time/relative sampling error trade-oﬀ. By rounding the scaling variables 1 /t /pi to powers of (1+ ν ),we discretize their support to have size O ( ν log( n )). We then simulate the update procedure by7ampling from the distribution over updates to our count-max data-structure A of duplicating anupdate and hashing each duplicate independently into A . Our simulation utilizes results on eﬃcientgeneration of binomial random variables, through which we can iteratively reconstruct the updatesto A bin-by-bin instead of duplicate-by-duplicate. In addition, by using an auxiliary heavy-hitterdata structure, we can improve our query time from the na¨ıve O ( n ) to O (poly log( n )) withoutincreasing the space complexity. Estimating the Frequency.

We show that allowing an additional additive O (min { ǫ − , ǫ − p log( δ ) } log n log δ − ) bits of space, we can provide an estimate ˜ f = (1 ± ǫ ) f i of the outputtedfrequency f i with probability 1 − δ when p <

2. To achieve this, we use our more general analysisof the contribution of the level sets I k to k z k , and give concentration bounds on the tail errorwhen the top ǫ − p items are removed. When p = 2, for similar reasons as described in the sam-pling algorithm, we require another O (log n ) factor in the space complexity to obtain a (1 ± ǫ )estimate. Finally, we demonstrate an Ω( ǫ − p log n log δ − ) lower bound for this problem, which isnearly tight when p <

2. To do so, we adapt a communication problem introduced in [JW13],known as

Augmented-Indexing on Large Domains . We weaken the problem so that it need onlysucceed with constant probability, and then show that the same lower bound still holds. Using areduction to this problem, we show that our lower bound for L p samplers holds even if the outputindex is from a distribution with constant additive error from the true L p distribution | f i | p / k f k pp . For a, b, ǫ ∈ R , we write a = b ± ǫ to denote the containment a ∈ [ b − ǫ, b + ǫ ]. For positive integer n , we use [ n ] to denote the set { , , . . . , n } , and ˜ O ( · ) notation to hide log( n ) terms. For anyvector v ∈ R n , we write v ( k ) to denote the k -th largest coordinate of v in absolute value. In otherwords, | v (1) | ≥ | v (2) | ≥ · · · ≥ | v ( n ) | . For any γ ∈ [ n ], we deﬁne v tail( γ ) to be v but with the top γ coordinates (in absolute value) set equal to 0. For any i ∈ [ n ], we deﬁne v − i to be v with the i -thcoordinate set to 0. We write | v | to denote the entry-wise absolute value of v , so | v | j = | v j | for all j ∈ [ n ]. All space bounds stated will be in bits. For our runtime complexity, we assume the unitcost RAM model, where a word of O (log( n ))-bits can be operated on in constant time, where n is the dimension of the input streaming vector. Finally, we will use ˜ O notation to hide polylog( n )factors; in other words O (log c ( n )) = ˜ O (1) for any constant c .Formally, a data stream is given by an underlying vector f ∈ R n , called the frequency vector ,which is initialized to 0 n . The frequency vector then receives a stream of m updates of the form( i t , ∆ t ) ∈ [ n ] × {− M, . . . , M } for some M > t ∈ [ m ]. The update ( i, ∆) causes the change f i t ← f i t + ∆ t . For simplicity, we make the common assumption ([BCIW16]) that log( mM ) = O (log( n )), though our results generalize naturally to arbitrary n, m . In this paper, we will needKhintchine’s and McDiarmid’s inequality Fact 1 ( Khintchine inequality [Haa81]) . Let x ∈ R n and Q = P ni =1 ϕ i x i for i.i.d. random variables ϕ i uniform on { , − } . Then Pr [ | Q | > t k x k ] < e − t / . Fact 2 (McDiarmid’s inequality [McD89]) . Let X , X , . . . , X n be independent random variables,and let ψ ( x , . . . , x n ) by any function that satisﬁes sup x ,...,x n , ˆ x i (cid:12)(cid:12) ψ ( x , x , . . . , x n ) − ψ ( x , . . . , x i − , ˆ x i , x i +1 , . . . , x n ) (cid:12)(cid:12) ≤ c i for ≤ i ≤ n Then for any ǫ > , we have Pr h(cid:12)(cid:12)(cid:12) ψ ( X , . . . , X n ) − E h ψ ( X , . . . , X n ) i(cid:12)(cid:12)(cid:12) ≥ ǫ i ≤ (cid:16) − ǫ P ni =1 c i (cid:17) . Deﬁnition 2.

A distribution D p is said to be p -stable if whenever X , . . . , X n ∼ D p are drawnindependently, we have n X i =1 a i X i = k a k p X for any ﬁxed vector a ∈ R n , where X ∼ D p is again distributed as a p -stable. In particular, theGaussian random variables N (0 ,

1) are p -stable for p = 2 (i.e., P i a i g i = k a k g , where g, g , . . . , g n are Gaussian). Our sampling algorithm will utilize a modiﬁcation of the well-known data structure known as count-sketch (see [CCFC02b] for further details). We now introduce the description of count-sketch whichwe will use for the remainder of the paper. The count-sketch data structure is a table A with d rowsand k columns. When run on a stream f ∈ R n , for each row i ∈ [ d ], count-sketch picks a uniformrandom mapping h i : [ n ] → [ k ] and g i : [ n ] → { , − } . Generally, h i and g i need only be 4-wiseindependent hash functions, but in this paper we will use fully-independent hash functions (andlater relax this condition when derandomizing). Whenever an update ∆ to item v ∈ [ n ] occurs,count-sketch performs the following updates: A i,h i ( v ) ← A i,h i ( v ) + ∆ g i ( v ) for i = 1 , , . . . , d Note that while we will not implement the h i ’s as explicit hash functions, and instead generatei.i.d. random variables h i (1) , . . . , h i ( n ), we will still use the terminology of hash functions. Inother words, by hashing the update ( v, ∆) into the row A i of count-sketch, we mean that we areupdating A i,h i ( v ) by ∆ g i ( v ). By hashing the coordinate f v into A , we mean updating A i,h i ( v ) by g i ( v ) f v for each i = 1 , , . . . , d . Using this terminology, each row of count-sketch corresponds torandomly hashing the indices in [ n ] into k buckets, and then each bucket in the row is a sum of thefrequencies f i of the items which hashed to it multiplied by random ± y ∈ R n such that k y − f k ∞ is small. Here the estimate y is given y j = median i ∈ [ d ] A i,h i ( j ) ( g i ( j )) − for all j ∈ [ n ]. This vector y satisﬁes the followingguarantee. Theorem 1. If d = Θ(log(1 /δ )) and k = 6 /ǫ , then for a ﬁxed i ∈ [ n ] we have | y i − f i | <ǫ k f tail (1 /ǫ ) k with probability − δ . Moreover, if d = Θ(log( n )) and c ≥ is any constant, thenwe have k y − f k ∞ < ǫ k f tail (1 /ǫ ) k with probability − n − c . Furthermore, if we instead set y j = median i ∈ [ d ] | A i,h i ( j ) | , then the same two bounds above hold replacing f with | f | . In this work, however, we are only interested in determining the index of the heaviest item in f , that is i ∗ = arg max i | f i | . So we utilize a simpler estimation algorithm based on the count-sketchdata structure that tests whether a ﬁxed j ∈ [ n ], if j = arg max i | f i | . For analysis purposes, insteadof having the g i ’s be random signs, we draw g i ( v ) ∼ N (0 ,

1) as i.i.d. Gaussian variables. Thenfor a ﬁxed i , set α j = (cid:12)(cid:12) { i ∈ [ d ] | | A i,h i ( j ) | = max r ∈ [ k ] | A i,r |} (cid:12)(cid:12) , and we declare that j = i ∗ to be themaximizer if α j > d . The algorithm computes α j for all j ∈ [ n ], and outputs the ﬁrst index j that satisﬁes α j > d (there will only be one with high probability). To distinguish this modiﬁedquerying protocol from the classic count-sketch, we refer to this algorithm as count-max. To referto the data structure A itself, we will use the terms count-sketch and count-max interchangeably.9e will prove our result for the guarantee of count-max in the presence of the following gen-eralization. Before computing the values of α and reporting a maximizer as above, we will scaleeach bucket A i,j of count-max by a uniform random variable µ i,j ∼ Uniform ( , ). This gen-eralization will be used for technical reasons in our analysis of Lemma 3. Namely, we will need itto ensure that our failure threshold of our algorithm is randomized, which will allow us to handlesmall adversarial error. Lemma 1.

Let c ≥ be an arbitrarily large constant, set d = Θ(log( n )) and k = 2 , and let A be a d × k instance of count-max run on f ∈ R n using fully independent hash functions h i and Gaussian random variables g i ∼ N (0 , . Then then with probability − n − c the followingholds: for every i ∈ [ n ] , if | f i | > k f − i k then count-max declares i to be the maximum, and if | f i | ≤ max j ∈ [ n ] \{ i } | f j | , then count-max does not declare i to be the maximum. Thus if count-maxdeclares | f i | to be the largest coordinate of f , it will be correct with high probability. Moreover, thisresult still holds if each bucket A i,j is scaled by a µ i,j ∼ Uniform ( , ) before reporting.Proof. First suppose | f i | > k f − i k , and consider a ﬁxed row j of A . WLOG i hashes to A j, ,thus A j, = µ j, (cid:16) g j ( i ) f j + P t : h j ( t )=1 g j ( t ) f t (cid:17) and A j, = µ j, (cid:16)P t : h j ( t )=2 g j ( t ) f t (cid:17) . By 2-stability(Deﬁnition 2), the probability that | A j, | > | A j, | is less than probability that one N (0 ,

1) Gaussianis 19 times larger than another, which can be bounded by 15 /

100 by direct computation. Thus i hashes into the max bucket in a row of A with probability at least 85 / d = Ω( c log( n )), with probability 1 − n − c we have that f i is in the largest bucket at least a4 / i is not a unique max, and let i ∗ be such that | f i ∗ | is maximal. Then conditionedon i, i ∗ not hashing to the same bucket, the probability that f i hashes to a larger bucket than f i ∗ is at most 1 /

2. To see this, note that conditioned on this, one bucket is distributed as g j ( i ∗ ) f i ∗ + G and the other as g j ( i ) f i + G ′ , where G, G ′ are identically distributed random variables. Thus theprobability that f i is the in maximal bucket is at most 3 /

4, and so by Chernoﬀ bounds f i will hashto strictly less than (4 d/

5) of the maximal buckets with probability 1 − n − c . Union bounding overall j ∈ [ n ] gives the desired result. Corollary 1.

In the setting of Lemma 1, with probability − O ( n − c ) , count-max will never reportan index i ∈ [ n ] as being the maximum if | f i | < k f k .Proof. Suppose | f i | < k f k , and in a given row WLOG i hashes to A j, . Then we have A j, = g j ( i ) f i + g k f k and A j, = g k f k , where f k is f restricted to the coordinates that hash tobucket k , and g , g ∼ N (0 , f , f are i.i.d., with probability 1 / k f k > k f k .Conditioned on this, we have k f k > k f k / √ > | f i | . So conditioned on k f k > k f k , we have | A j, | < | A j, | whenever one Gaussian is (71 /

70) times larger than another in magnitude, whichoccurs with probability greater than 1 / − /

25. So i hashes into the max bucket with probabilityat most 79 / c suﬃciently large and union bounding overall i ∈ [ n ], i will hash into the max bucket at most a 795 / < / − O ( n − c ), as needed. In this section, we discuss several useful properties of the order statistics of n independent non-identically distributed exponential random variables. Let ( t , . . . , t n ) be independent exponentialrandom variables where t i has mean 1 /λ i (equivalently, t i has rate λ i ). Recall that t i is given by10he cumulative distribution function Pr [ t i < x ] = 1 − e − λ i x . Our main L p sampling algorithm willrequire a careful analysis of the distribution of values ( t , . . . , t n ), which we will now describe. Webegin by noting that constant factor scalings of an exponential variable result in another exponentialvariable. Fact 3 (Scaling of exponentials) . Let t be exponentially distributed with rate λ , and let α > .Then αt is exponentially distributed with rate λ/α Proof.

The cdf of αt is given by Pr [ t < x/α ] = 1 − e − λx/α , which is the cdf of an exponential withrate λ/α .We would now like to study the order statistics of the variables ( t , . . . , t n ), where t i has rate λ i .To do so, we introduce the anti-rank vector ( D (1) , D (2) , . . . , D ( n )), where for k ∈ [ n ], D ( k ) ∈ [ n ]is a random variable which gives the index of the t -th smallest exponential. Deﬁnition 3.

Let ( t , . . . , t n ) be independent exponentials. For k = 1 , , . . . , n , we deﬁne the k -thanti-rank D ( k ) ∈ [ n ] of ( t , . . . , t n ) to be the values D ( k ) such that t D (1) ≤ t D (2) ≤ · · · ≤ t D ( n ) .Using the structure of the anti-rank vector, it has been observed [Nag06] that there is a simpleform for describing the distribution of t D ( k ) as a function of ( λ , . . . , λ n ) and the anti-rank vector. Fact 4 ([Nag06]) . Let ( t , . . . , t n ) be independently distributed exponentials, where t i has rate λ i > .Then for any k = 1 , , . . . , n , we have t D ( k ) = k X i =1 E i P nj = i λ D ( j ) Where the E , E , . . . , E n ’s are i.i.d. exponential variables with mean , and are independent of theanti-rank vector ( D (1) , D (2) , . . . , D ( n )) . Fact 5 ([Nag06]) . For any i = 1 , , . . . , n , we have Pr [ D (1) = i ] = λ i P nj =1 λ j We now describe how these properties will be useful to our sampler. Let f ∈ R n be anyvector presented in a general turnstile stream. We can generate i.i.d. exponentials ( t , . . . , t n ),each with rate 1, and construct the random variable z i = f i /t /pi , which can be obtained in astream by scaling updates to f i by 1 /t /pi as they arrive. By Fact 3, the variable | z i | − p = t i / | f i | p is exponentially distributed with rate λ i = | f i | p . Now let ( D (1) , . . . , D ( n )) be the anti-rank vec-tor of the exponentials ( t / | f n | p , . . . , t n / | f n | /p ). By Fact 5, we have Pr [ D (1) = i ] = Pr [ i =arg min {| z | − p , . . . , | z n | − p } ] = Pr [ i = arg max {| z | , . . . , | z n |} ] = λ i P j λ j = | f i | p k f k pp . In other words, theprobability that | z i | = arg max j {| z j |} is precisely | f i | p / k f k pp , so for a perfect L p sampler it suﬃcesto return i ∈ [ n ] with | z i | maximum. Now note | z D (1) | ≥ | z D (2) | ≥ · · · ≥ | z D ( n ) | , and in this scenariothe statement of Fact 4 becomes: z D ( k ) = (cid:0) k X i =1 E i P Nj = i λ D ( j ) (cid:1) − /p = (cid:0) k X i =1 E i P Nj = i F pD ( j ) (cid:1) − /p Where E i ’s are i.i.d. exponential random variables with mean 1, and are independent of the anti-rank vector ( D (1) , . . . , D ( n )). We call the exponentials E i the hidden exponentials , as they do notappear in the actual execution of the algorithm, and will be needed for analysis purposes only.11 The Sampling Algorithm

We now provide intuition for the workings of our main sampling algorithm. Our algorithm scalesthe input stream by inverse exponentials to obtain a new vector z . We have seen in the priorsection that we can write the order statistics z D ( k ) as a function of the anti-rank vector D , where D ( k ) gives the index of the k -th largest coordinate in z , and the hidden exponentials E i , whichdescribe the “scale” of the order statistics. Importantly, the hidden exponentials are independentof the anti-ranks. We would like to determine the index i for which D (1) = i , however this may notalways be possible. This is the case when the largest element | z D (1) | is not suﬃcently larger thanthe remainig L mass P j> (cid:0) | z D ( j ) | (cid:1) / . In such a case, count-max will not declare any index tobe the largest, and we would therefore like to output FAIL . Note that this event is more likely whenthere is another element | z D (2) | which is very close to | z D (1) | in size, as whenever the two elementsdo not collide in count-max, it is less likely that | z D (1) | will be in the max bucket.Now consider the trivial situation where f = f = · · · = f n . Here the variables z D ( k ) have nodependence at all on the anti-rank vector D . In this case, the condition of failing is independentof D (1), so we can safely fail whenever we cannot determine the maximum index. On the otherhand, if the values | f i | vary wildly, the variables z D ( k ) will depend highly on the anti-ranks. Infact, if there exists f i with | f i | p ≥ ǫ k f k pp , then the probability that | z D (1) | − | z D (2) | is above acertain threshold can change by a (1 ± ǫ ) factor conditioned on D (1) = i , as opposed to D (1) = j for a smaller | f j | . Given this, the probability that we fail can change by a multiplicative (1 ± ǫ )conditioned on D (1) = i as opposed to D (1) = j . In this case, we cannot output FAIL whencount-max does not report a maximizer, lest we suﬀer a (1 ± ǫ ) error in outputting an index withthe correct probability.To handle this, we must remove the heavy items from the stream to weaken the dependenceof the values z D ( k ) on the anti-ranks, which we carry out by duplication of coordinates. Forthe purposes of eﬃciency, we carry out the duplication via a rounding scheme which will allowus to generate and quickly hash updates into our data-structures (Section 5). We will show that,conditioned on the ﬁxed values of the E i ’s, the variables z D ( k ) are highly concentrated, and thereforenearly independent of the anti-ranks ( z D ( k ) depends only on k and not D ( k )). By randomizing thefailure threshold to be anti-concentrated, the small adversarial dependence of z D ( k ) on D ( k ) cannotnon-trivially aﬀect the conditional probabilities of failure, leading to small relative error in theresulting output distribution. The L p Sampler.

We now describe our sampling algorithm, as shown in Figure 3. Let f ∈ R n be the input vector of the stream. As the stream arrives, we duplicate updates to each coordinate f i a total of n c − times to obtain a new vector F ∈ R n c . More precisely, for i ∈ [ n ] we set i j = ( i − n c − + j for j = 1 , , . . . , n c − , and then we will have F i j = f i for all i ∈ [ n ] and j ∈ [ n c − ]. We then call F i j a duplicate of f i . Whenever we use i j as a subscript in this way it willrefer to a duplicate of i , whereas a single subscript i will be used both to index into [ n ] and [ n c ].Note that this duplication has the eﬀect that | F i | p ≤ n − c +1 k F k pp for all p > i ∈ [ n c ].We then generate i.i.d. exponential rate 1 random variables ( t , . . . , t n ), and deﬁne the vector z ∈ R n c by z i = F i /t /pi . As shown in Section 3, we have Pr [ i j = arg max i ′ ,j ′ {| z i ′ j ′ |} ] = | F i j | p / k F k pp .Since P j ∈ [ n c − ] | F i j | p / k F k pp = | f i | p / k f k pp , it will therefore suﬃce to ﬁnd i j ∈ [ n c ] for which i j =arg max i ′ ,j ′ {| z i ′ j ′ |} , and return the index i ∈ [ n ]. The assumption that the t i ’s are i.i.d. will laterbe relaxed in Section 5 while derandomizing the algorithm. In Section 5, we also demonstrate thatall relevant continuous distributions will be made discrete without eﬀecting the perfect samplingguarantee. 12 p Sampler

1. Set d = Θ(log( n )), instantiate a d × A , and set µ i,j ∼ Uniform [ , ] foreach ( i, j ) ∈ [ d ] × [2].2. Duplicate updates to f to obtain the vector F ∈ R n c so that f i = F i j for all i ∈ [ n ] and j = 1 , , . . . , n c − , for some ﬁxed constant c .3. Choose i.i.d. exponential random variables t = ( t , t , . . . , t n c ), and construct the stream ζ i = F i · rnd ν (1 /t /pi ).4. Run A on the stream ζ . Upon the end of the stream, set A i,j ← µ i,j A i,j for all ( i, j ) ∈ [ d ] × [2].5. If count-max declares that an index i j ∈ [ n c ] is the max for some j ∈ [ n c − ] based on thedata structure A , then output i ∈ [ n ]. If A does not declare any index to be the max, output FAIL . Figure 3: Our main L p Sampling algorithmNow ﬁx any suﬃciently large constant c , and ﬁx ν > n − c . To speed up the update time,instead of explicitly scaling F i by 1 /t /pi to construct the stream z , our algorithm instead scales F i by rnd ν (1 /t /pi ), where rnd ν ( x ) rounds x > { . . . , (1 + ν ) − , , (1 + ν ) , (1+ ν ) , . . . } . In other words, rnd ν ( x ) rounds x down to the nearest power of (1+ ν ) j (for j ∈ Z ).This results in a separate stream ζ ∈ R n c where ζ i = F i · rnd ν (1 /t /pi ). Note ζ i = (1 ± O ( ν )) z i for all i ∈ [ n c ]. Importantly, note that this rounding is order preserving. Thus, if ζ has a uniquelargest coordinate | ζ i ∗ | , then | z i ∗ | will be the unique largest coordinate of z .Having constructed the transformed stream ζ , we then run a d × A ∈ R d × of count-max (from Section 2.1), with d = Θ(log( n )), on ζ . At the end of the stream, we scale each bucket A i,j by a uniform random variable µ i,j from the interval [ , ]. This step is ensures that thefailure threshold is randomized, so that a small adversarial error can only eﬀect the output of thealgorithm with extremely low probability (see Lemma 3). Now recall that count-max will eitherdeclare an index i j ∈ [ n c ] as being the maximum, or report nothing. If an index i j is returned,where i j is the j -th copy of index i ∈ [ n ], then our algorithm outputs the index i . If count-maxdoes not report an index, we return FAIL . Let i ∗ = arg max i | ζ i | = D (1) (where D (1) is the ﬁrstanti-rank as in Section 3). By the guarantee of Lemma 1, we know that if | ζ i ∗ | ≥ k ζ − i ∗ k , thenwith probability 1 − n − c count-max will return the index i ∗ ∈ [ n c ]. Moreover, with the sameprobability, count-max will never return an index which is not the unique maximizer. To provecorrectness, therefore, it suﬃces to analyze the conditional probability of failure given D (1) = i .Let N = |{ i ∈ [ n c ] | F i = 0 }| ( N is the support size of F ). We can assume that N = 0 (tocheck this one could run, for instance, the O (log ( n ))-bit support sampler of [JST11]). Note that n c − ≤ N ≤ n c . The following fact is straightforward. Fact 6.

For p ∈ (0 , , suppose that we choose the constant c such that mM ≤ n c/ , wherenote we have | F i | ≤ mM for all i ∈ [ N ] . Then if S ⊂ { i ∈ [ n c ] | F i = 0 } is any subset, then P i ∈ S | F i | p ≥ | S | N n − c/ k F k pp Proof.

We know that | F i | p ≤ ( mM ) p ≤ n c/ using p ≤

2. Then each non-zero value | F i | p is atmost an n − c/ fraction of any other item | F j | p , and in particular of the average item weight. Itfollows that | F i | p ≥ n − c/ k F k pp N for all i ∈ [ N ], which results in the stated fact.As in Section 3, we now use the anti-rank vector D ( k ) to denote the index of the k -th largestvalue of z i in absolute value. In other words, D ( k ) is the index such that | z D ( k ) | is the k -th largestvalue in the set {| z | , | z | , . . . , | z n c |} . Note that the D ( k )’s are also the anti-ranks of the vector ζ ,13ince rounding z into ζ preserves partial ordering. For the following lemma, it suﬃces to consideronly the exponentials t i with F i = 0, and we thus consider only values of k between 1 and N . Thus | z D (1) | ≥ | z D (2) | ≥ · · · ≥ | z D ( N ) | . Moreover, we have that | z D ( k ) | − p = t D ( k ) | F D ( k ) | p is the k -th smallestof all the t i | F i | p ’s, and by the results of Section 3 can be written as | z D ( k ) | − p = P kτ =1 E τ P Nj = τ | F D ( j ) | p where the E τ are i.i.d. exponentials and independent of the anti-rank vector D . We will make useof this in the following lemma. Lemma 2.

For every ≤ k < N − n c/ , we have | z D ( k ) | = h (1 ± O ( n − c/ )) k X τ =1 E τ E [ P Nj = τ | F D ( j ) | p ] i − /p with probability − O ( e − n c/ ) .Proof. Let τ < N − n c/ . We can write P Nj = τ | F D ( j ) | p as a deterministic function ψ ( t , . . . , t N ) ofthe random scaling exponentials t , . . . , t N corresponding to F i = 0. We ﬁrst argue that | ψ ( t , . . . , t N ) − ψ ( t . . . , t i − , t ′ i , t i +1 , . . . , t N ) | < j { F pj } < n − c +1 k F k pp This can be seen from the fact that changing a value of t i can only have the eﬀect of adding(or removing) | F i | p to the sum P Nj = τ | F D ( j ) | p and removing (or adding) a diﬀerent | F l | from thesum. The resulting change in the sum is at most 2 max j {| F j | p } , which is at most 2 n − c +1 k F k pp byduplication. Set T = N − τ + 1. Since the t i ’s are independent, we apply McDiarmid’s inequality(Fact 2) to obtain Pr h | N X j = τ | F D ( j ) | p − E [ N X j = τ | F D ( j ) | p ] | > ǫT n − c k F k pp i ≤ (cid:0) − ǫ T n − c k F k pp ] n c (2 n − c +1 k F k pp ) (cid:1) ≤ (cid:0) − ǫ T n − c − (cid:1) Setting ǫ = Θ( n − c/ ) and using T > n c/ , this is at most 2 exp( − n c/ − ). To show concentrationup to a (1 ± O ( n − c/ )) factor, it remains to show that E [ P Nj = τ | F D ( j ) | p ] = Ω( T n − c/ k F k pp ).This follows from the Fact 6, which gives P Tj =0 | F D ( − j ) | p ≥ n − c/ ( T n − c k F k pp ) deterministically.Now recall that | z D ( k ) | = [ P kτ =1 E τ P Nj = τ | F D ( − j ) | p ] − /p . We have just shown that P Nj = τ | F D ( j ) | p =(1 ± O ( n − c/ )) E [ P Nj = τ | F D ( j ) | p ], so we can union bound over all τ = 1 , , . . . , N − n c/ to obtain | z D ( k ) | = h (1 ± O ( n − c/ )) k X τ =1 E τ E [ P Nj = τ | F D ( j ) | p ] i − /p for all k ≤ N − n c/ with probability 1 − O ( n c e − n c/ − ) = 1 − O ( e n c/ ).We use this result to show that our failure condition is nearly-independent of the value D (1).Let E be the event that Lemma 2 holds. Let ¬ FAIL be the event that the algorithm L p Sampler does not output

FAIL . 14 emma 3.

For p ∈ (0 , a constant bounded away from and any ν ≥ n − c/ , Pr [ ¬ FAIL | D (1)] = Pr [ ¬ FAIL ] ± ˜ O ( ν ) for every possible D (1) ∈ [ N ] .Proof. By Lemma 2, conditioned on E , for every k < N − n c/ we have | z D ( k ) | = U /pD ( k ) (1 ± O ( n − c/ )) /p = U /pD ( k ) (1 ± O ( p n − c/ )) (using the identity (1 + x ) ≤ e x and the Taylor expansionof e x ), where U D ( k ) = ( P kτ =1 E τ E [ P Nj = τ | F D ( j ) | p ] ) − is independent of the anti-rank vector D (in fact, itis totally determined by k and the hidden exponentials E i ). Then for c suﬃciently large, we have | ζ D ( k ) | = U /pD ( k ) (1 ± O ( ν )), and so for all p ∈ (0 ,

2] and k < N − n c/ | ζ D ( k ) | = U /pD ( k ) + U /pD ( k ) V D ( k ) Where V D ( k ) is some random variable that satisﬁes | V D ( k ) | = O ( ν ). Now consider a bucket A i,j for ( i, j ) ∈ [ d ] × [10]. Let σ k = sign ( z k ) = sign ( ζ k ) for k ∈ [ n c ]. Then we write A i,j /µ i,j = P k ∈ B ij σ D ( k ) | ζ D ( k ) | g i ( D ( k ))+ P k ∈ S ij σ D ( k ) | ζ D ( k ) | g i ( D ( k )) where B ij = { k ≤ N − n c/ | h i ( D ( k )) = j } and S ij = { n c ≥ k > N − n c/ | h i ( D ( k )) = j } . Here we deﬁne { D ( N + 1) , . . . , D ( n c ) } to be theset of indices i with F i = 0 (in any ordering, as they contribute nothing to the sum). Also recallthat g i ( D ( k )) ∼ N (0 ,

1) is the i.i.d. Gaussian coeﬃcent associated to item D ( k ) in row i of A . So A i,j /µ i,j = X k ∈ B ij g i ( D ( k )) σ D ( k ) U /pD ( k ) + X k ∈ B ij g i ( D ( k )) σ D ( k ) U /pD ( k ) V D ( k ) + X k ∈ S ij g i ( D ( k )) ζ D ( k ) Importantly, observe that since the variables h i ( D ( k )) are fully independent, the sets B i,j , S i,j are in-dependent of the anti-rank vector D . In other words, the values h i ( D ( k )) are independent of the val-ues D ( k ) (and of the entire anti-rank vector), since { h i (1) , . . . , h i ( n c ) } = { h i ( D (1)) , . . . , h i ( D ( n c )) } are i.i.d. Note that this would not necessarily be the case if { h i (1) , . . . , h i ( n c ) } were only ℓ -wise inde-pendent for some ℓ = o ( n c ). So we can condition on a ﬁxed set of values { h i ( D (1)) , . . . , h i ( D ( n c )) } now, which ﬁxes the sets B i,j , S i,j . Now let U ∗ i,j = | P k ∈ B ij g i ( D ( k )) σ D ( k ) U /pD ( k ) | . Claim 1.

For all i, j ∈ [ d ] × [2] and p ∈ (0 , (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) X k ∈ B ij g i ( D ( k )) σ D ( k ) U /pD ( k ) V D ( k ) + X k ∈ S ij g i ( D ( k )) ζ D ( k ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) = O ( ν ( | A i, | + | A i, | ))with probability 1 − O (log( n ) n − c/ ). Proof.

By the 2-stability of Gaussians (Deﬁnition 2), we have | P k ∈ S ij g i ( D ( k )) ζ D ( k ) | = O ( p log( n )( P k ∈ S i,j (2 z D ( k ) ) ) / ) with probability 1 − n − c . This is a sum over a subset of the n c/ smallestitems | z i | , and thus P k ∈ S i,j z D ( k ) < n c/ N k z k , giving | P k ∈ S ij g i ( D ( k )) ζ D ( k ) | = O ( p log( n ) n − c/ k z k ).Now WLOG A i, is such that P k ∈ B i, ∪ S i, ζ D ( k ) > k ζ k . Then | A i, | ≥ | g |k z k / g ∼N (0 , − O ( n − c/ ) that | A i, | >n − c/ k z k = Ω(( n c/ / p log( n )) | P k ∈ S ij g i ( D ( k )) ζ D ( k ) | ). Scaling ν by a log( n ) factor gives. | P k ∈ S ij g i ( D ( k )) ζ D ( k ) | = O ( ν | A i, | ). Next, using that | V D ( k ) | = O ( ν ), we have | P k ∈ B ij g i ( D ( k )) σ D ( k ) U /pD ( k ) V D ( k ) | = O ( ν ) | P k ∈ B ij g i ( D ( k )) σ D ( k ) U /pD ( k ) | = O ( νU ∗ i,j ). Combined with the prior paragraph, we have U ∗ i,j = O (( | A i, | + | A i, | )) as needed. Note there are only O (log( n )) terms i, j to union bound over, andfrom which the claim follows. 15all the event where the Claim 1 holds E . Conditioned on E , we can decompose | A i,j | /µ i,j for all i, j into U ∗ i,j + V ij where V ij is some random variable satisfying |V ij | = O ( ν ( | A i, | + | A i, | ))and U ∗ i,j is independent of the anti-rank vector D (it depends only on the hidden exponentials E k , and the uniformly random gaussians g i ( D ( k ))). Now ﬁx any realization of the count-maxrandomness, Let E = ( E , . . . , E N ) be the hidden exponential vector, µ = { µ i, , µ i, } i ∈ [ d ] , D =( D (1) , D (2) , . . . , D ( N )), and observe: Pr h ¬ FAIL | D (1) i = X E,µ Pr h ¬ FAIL | D (1) , E, µ i Pr h E, µ i Here we have used the fact that

E, µ are independent of the anti-ranks D . Thus, it will suﬃceto bound the probability of obtaining E, µ such that the event of failure can be determined bythe realization of D . So consider any row i , and consider the event Q i that | µ i, U ∗ i, − µ i, U ∗ i, | < |V ∗ i, | + |V ∗ i, | ) = O ( ν ( | A i, | + | A i, | ) (where here we have conditioned on the high probabilityevent E ). WLOG, U ∗ i, ≥ U ∗ i, , giving U ∗ i, = Θ(( | A i, | + | A i, | ). Since the µ i,j ’s are uniform, Pr [ Q i ] = O ( ν ( | A i, | + | A i, | ) /U ∗ i, ) = O ( ν ), and by a union bound Pr [ ∪ i ∈ [ d ] Q i ] = O (log( n ) ν ). Thusconditioned on E ∩ E ∈ and ¬ ( ∪ i ∈ [ d ] Q i ), the event of failure is completely determined by the values E, µ , and in particular is independent of the anti-rank vector D . Thus Pr h ¬ FAIL | D (1) , E, µ, ¬ ( ∪ i ∈ [ d ] Q i ) , E ∩ E ∈ i = Pr h ¬ FAIL | E, µ, ¬ ( ∪ i ∈ [ d ] Q i ) , E ∩ E ∈ i So averaging over all

E, µ : Pr h ¬ FAIL | D (1) i = Pr h ¬ FAIL | D (1) , ¬ ( ∪ i ∈ [ d ] Q i ) , E ∩ E ∈ i + O (log( n ) ν )= Pr h ¬ FAIL |¬ ( ∪ i ∈ [ d ] Q i ) , E ∩ E ∈ i + O (log( n ) ν )= Pr h ¬ FAIL i + O (log( n ) ν )As needed.In Lemma 3, we demonstrated that the probability of failure can only change by an additive˜ O ( ν ) term given that any one value of i ∈ [ N ] achieved the maximum (i.e., D (1) = i ). Thisproperty will translate into a (1 ± ˜ O ( ν ))- relative error in our sampler, where the space complexityis independent of ν . To complete the proof of correctness of our algorithm, we now need to boundthe probability that we fail at all. To do so, we ﬁrst prove the following fact about k z tail(s) k , orthe L norm of z with the top s largest (in absolute value) elements removed. Proposition 1.

For any s = 2 j ≤ n c − for some j ∈ N , we have P Ni =4 s z D ( i ) = O ( k F k p /s /p − ) if p ∈ (0 , is a constant bounded below , and P Ni =4 s z i ) = O (log( n ) k F k p ) if p = 2 , with probability − e − s .Proof. Let I k = { i ∈ [ N ] | z i ∈ ( k F k p ( k +1) /p , k F k p k/p ) } for k = 0 , , . . . , p log( k F k p ) (where we havelog( k F k pp ) = O (log( n ))). Note that Pr [ i ∈ I k ] = Pr [ t i ∈ ( k F pi k F k pp , k +1 F pi k F k pp )] < k F pi k F k pp , where theinequality follows from the fact that the pdf e − x of the exponential distribution is upper boundedby 1. Thus E [ | I k | ] < k , so for every k ≥ log( s ) = j , we have Pr [ | I k | > k )] < e − s k − j . By aunion bound, the probability that | I k | > k ) for any k ≥ log( S ) is at most e − s P O (log( n )) i =0 e i ≤ e − s . Now observe Pr [ z i > k F k p /s /p ] < sF pi k F k pp , so E [ |{ i | z i > k F k /s /p }| ] < s , and again byChernoﬀ bounds the number of such i with z i > k F k /s /p is at most 4 s with probability 1 − e − s .Conditioning on this, P Ni =4 s z i ) does not include the weight of any of these items, so N X i =4 s z i ) ≤ O (log( n )) X k =log( s ) | I k | ( k F k p k/p ) ≤ O (log( n )) X k =0 k F k p (log( s )+ k )(2 /p − First, if p <

2, the above sum is geometric and converges to at most 4 k F k p − − /p +1 s /p − = O ( k F k p /s /p − )for p a constant bounded below by 2. If p = 2 or is arbitrarily close to 2, then each term is at most k F k p , and the sum is upper bounded by O (log( n ) k F k p ) as stated. Altogether, the probability offailure is at most 1 − e − s by a union bound. Lemma 4.

For < p < a constant bounded away from and , the probability that L p Sampler outputs

FAIL is at most − Ω(1) , and for p = 2 is is − Ω(1 / log( n )) .Proof. By Proposition 1, with probability 1 − e − > . k z tail (16) k = O ( | F k p ) for p <

2, and k z tail (16) k = O ( p log( n ) k F k p ) when p = 2. Observe that for t = 2 , , . . . ,

16, we have | z D ( t ) | < k F k p ( P tτ =1 E τ ) /p , and with probability 99 /

100 we have E t > / | z D ( t ) | = O ( k F k p ) for all t ∈ [16]. Conditioned on this, we have k z tail(2) k < q k F k p where q is aconstant when p <

2, and q = Θ( p log( n )) when p = 2. Now | z D (1) | = k F k p E /p , and using the factthat the pdf exponential random variables around 0 is bounded above by a constant, we will have | z D (1) | > k z − D (1) k with probability Ω(1) when p <

2, and probability Ω( n ) ) when p = 2.Conditioned on this, by Lemma 1, count-max will return the index D (1) with probability 1 − n − c ,and thus the Sampling algorithm will not fail.Putting together the results of this section, we obtain the correctness of our algorithm as statedin Theorem 2. In Section 5, we will show that the algorithm can be implemented to have ˜ O ( ν )update and ˜ O (1) query time, and that the entire algorithm can be derandomized to use O (log ( n ))bits of space for p ∈ (0 ,

2) and O (log ( n )) bits for p = 2. Theorem 2.

Given any constant c ≥ , ν ≥ n − c , and . The space required is O (log ( n ) log(1 /δ )(log log n ) ) bits for p < , and O (log ( n ) log(1 /δ )) bits for p = 2 . For p < and δ = 1 / poly ( n ) , the space is O (log ( n )) -bits. Theupdate time is ˜ O ( ν − ) , and the query time is ˜ O (1) .Proof. Conditioned on not failing, by Lemma 1, with probability 1 − n − c we have that the output i j ∈ [ n c ] of count-max will in fact be equal to arg max i {| ζ i |} . Recall that ζ i = (1 ± O ( ν )) z i for all i ∈ [ n c ] (and this rounding of z to ζ is order preserving). By Lemma 1 count-max only outputsa coordinate which is the unique maximizer of ζ . Now if there was unique maximizer of ζ , theremust also be a unique maximizer in z , from which it follows that i j = arg max i {| z i |} .Now Lemma 3 states for any i j ∈ [ n c ] that Pr [ ¬ FAIL | i j = arg max i ′ ,j ′ {| z i ′ j ′ |} ] = Pr [ ¬ FAIL] ± ˜ O ( ν ) = q ± ˜ O ( ν ), where q = Pr [ ¬ FAIL] = Ω(1) for p <

2, and q = Ω( n ) ) for p = 2, both of whichfollow from Lemma 4, which does not depend on any of the randomness in the algorithm. Sinceconditioned on not failing, the output i j of count-max satisﬁes i j = arg max i {| z i |} , the probability17e output i j ∈ [ n c ] is Pr [ ¬ FAIL ∩ i j = arg max {| z i |} ], so the probability our ﬁnal algorithm outputs i ∈ [ n ] is X j ∈ [ n c − ] Pr [ ¬ FAIL | i j = arg max i ′ ,j ′ {| z i ′ j ′ |} ] Pr [ i j = arg max i ′ ,j ′ {| z i ′ j ′ |} ] = X j ∈ [ n c − ] | f i | p k F k pp ( q ± ˜ O ( ν ))= | f i | p k f k pp ( q ± ˜ O ( ν ))Note that we can scale the c value used in the algorithm by a factor of 60, so that the statement ofLemma 3 holds for any ν ≥ n − c . The potential of the failure of the various high probability eventsthat we conditioned on only adds another additive O ( n − c ) term to the error. Thus, conditionedon an index i being returned, we have Pr [ i = j ] = | f j | p k f k pp (1 ± ˜ O ( ν ))) ± n − c for all j ∈ [ n ], which isthe desired result after scaling ν by a poly(log( n )) term. Running the algorithm O (log( δ − )) timesin parallel for p < O (log( n ) log( δ − )) for p = 2, it follows that at least one index will bereturned with probability 1 − δ .For the complexity, the update time of count-max data structure A follows from the routine Fast-Update of Lemma 6, and the query time follows from Lemma 9. Theorem 7 shows that theentire algorithm can be derandomized to use a random seed with O (log ( n )(log log( n )) )-bits, so tocomplete the claim it suﬃces to note that using O (log( n ))-bit precision as required by Fast-Update (Lemma 6), it follows that our whole data structure A can be stored with O (log ( n )) bits, whichis dominated by the cost of storing the random seed. This gives the stated space after taking O (log( δ − )) parallel repetitions for p <

2. For p = 2, we only need a random seed of length O (log ( n )) for all O (log( n ) log( δ − )) repetitions by Corollary 4, which gives O (log ( n ) log( δ − ) +log ( n )) = O (log ( n ) log(1 /δ )) bits of space for p = 2 as stated. Similarly for the case of p < δ = 1 / poly( n ), the stated space follows from Corollary 4.In particular, it follows that perfect L p samplers exist using O (log ( n ) log(1 /δ )(log log n ) ) and O (log ( n ) log(1 /δ )) bits of space for p < p = 2 respectively. Theorem 3.

Given . The space required is O (log ( n ) log(1 /δ )(log log n ) ) bits for p < , and O (log ( n ) log(1 /δ )) bits for p = 2 . For p < and δ = 1 / poly ( n ) , the space is O (log ( n )) -bits. Finally, we note that the cause of having to pay an extra (log log n ) factor in the space com-plexity for p < n )-length random tape which does not count against its spacerequirement, the space is an optimal O (log ( n ) log(1 /δ )). We remark that the Ω(log ( n ) log(1 /δ ))of [KNP +

17] lower bound also holds in the random oracle model.

Corollary 2.

For p ∈ (0 , , in the random oracle model, there is a perfect L p sampler which failswith probability δ > and uses O (log ( n ) log(1 /δ )) bits of space. Remark 1.

Note that for p arbitrarily close to 2, the bound on k z k of Proposition 1 as used inLemma 4 degrades, as the sum of the L norms of the level sets is no longer geometric, and must bebounded by O ( p log( n ) k F k ). In this case, the failure probability from Lemma 4 goes to Θ( n ) ),and so we must use the upper bound for p = 2. Similarly, for p arbitrarily close to 0, the boundalso degrades since the values V D ( k ) in Lemma 3 blow-up. For such non-constant p arbitrarily closeto 0, we direct the reader to the O (log ( n ))-bit perfect L sampler of [JST11].18 Time and Space Complexity

In this section, we will show that our algorithm can be implemented with the desired space andtime complexity. First, in Section 5.1, we show how L p Sampler can be implemented with theupdate procedure

Fast-Update to result in ˜ O ( ν − ) update time. Next, in Section 5.2, we showthat the algorithm L p Sampler with

Fast-Update can be derandomized to use a random seed oflength O (log ( n )(log log n ) )-bits, which will give the desired space complexity. Finally, in Section5.3, we show how using an additional heavy-hitters data structure as a subroutine, we can obtain˜ O (1) update time as well. This additional data structure will not increase the space or update timecomplexity of the entire algorithm, and does not need to be derandomized. In this section we prove Theorem 6. Our algorithm utilizes a single data structure run on thestream ζ , which is count-max matrix A ∈ R d × where d = Θ(log( n )). We will introduce an updateprocedure Fast-Update which updates the data structure A of L p Sampler in ˜ O ( ν − ) time. Weassume the unit cost RAM model of computation, where a word of length O (log( n ))-bits can beoperated on in O (1) time (note that replacing O (1) with poly(log( n )) time here would not eﬀect ourresults, as the additional cost would be hidden in the ˜ O ). Throughout this section, we will refer tothe original algorithm as the algorithm which implements L p sampler by individually generatingeach scaling exponential t i for i ∈ [ n c ], and hashing them individually into A (na¨ıvely taking n c update time). Our procedure will utilize the following result about eﬃciently sampling binomialrandom variables which can be found in [BKP + Proposition 2.

For any constant c > , there is an algorithm that can draw a sample X ∼ Bin ( n, / in expected O (1) time in the unit cost RAM model. Moreover, it can be sampled intime ˜ O (1) with probability − n − c . The space required is O (log( n )) -bits.Proof. The proof of the running time bounds and correctness can be found in [BKP + n is even, otherwise we could sample Bin ( n, q ) ∼ Bin ( n − , q ) + Bin (1 , q ), where thelatter can be sampled in constant time (unit cost RAM model) and O (log( n ))-bits of space. Thealgorithm ﬁrst computes ∆ ∈ [ √ n, √ n + 3], which can be done via any rough approximation of thefunction √ x , and requires only O (log( n ))-bits. Deﬁne the block B k = { km, km + 1 , . . . , km + m − } for k ∈ Z , and set f ( i ) = 42 max { k, − k − } m s.t. i ∈ B k p ( i ) = 2 − n (cid:18) nn/ i (cid:19) Note that given i , f ( i ) can be computed in constant time and O (log( n )) bits of space. The algorithmthen performs the following loop:1. Sample i via the normalized probability distribution ¯ f = f / n/ i with probability p ( i ) /f ( i )3. Else, reject i and return to Step 1.To compute the ﬁrst step, the symmetry around n/ f is utilized. We ﬂip unbiased coins C , C , . . . until we obtain C t +1 which lands tails, and pick i uniformly from block B t or B − t (where19he choice is decided by a single coin ﬂip). The procedure requires at most O (log( n ))-bits to storethe index t . Next, to perform the second step, we obtain 2 − L additive error approximations ˜ q of q = ( p ( i ) /f ( i )) for L = 1 , , . . . , which (using the fact that 0 ≤ q ≤

1) can be done by obtaining a2 − L -relative error approximation of q . Then we ﬂip L random bits to obtain a uniform ˜ R ∈ [0 , | ˜ R − ˜ q | > − L . If so, we can either accept or reject i based on whether ˜ R > ˜ q + 2 − L or not, otherwise we repeat with L ← L + 1.To obtain ˜ q , it suﬃces to obtain a 2 − L − relative error approximation of the factorial function x !. To do so, the 2 − L approximation x ! ≈ ( x + L ) x +1 / e − ( x + L ) (cid:2) √ π + L − X k =1 c k x + k (cid:3) is used, where c k = ( − k − ( k − ( L − k ) k − / e L − k . This requires estimating the functions e x , √ x and π ,all of which, as well as each term in the sum, need only be estimated to O ( L )-bits of accuracy (asdemonstrated in [BKP + O ( L ) = O (log( n ))-bits ofspace ( L can never exceed O (log( n )), as q is speciﬁed with at most O (log( n )) bits), which completesthe proof.We now utilize a straightforward reduction from the case of sampling from Bin ( n, q ) for any q ∈ [0 ,

1] to sampling several times from

Bin ( n ′ , /

2) where n ′ ≤ n . This reduction has beenobserved before [FCT15], however we will state it here to clearly demonstrate our desired spaceand time bounds. Lemma 5.

For any constant c > and q ∈ [0 , , there is an algorithm that can draw a sample X ∼ Bin ( n, q ) in expected O (1) time in the unit cost RAM model. Moreover, it can be sampled intime ˜ O (1) with probability − n − c , and the space required is O (log( n )) -bits.Proof. The reduction is as follows (for a more detailed proof of correctness, see [FCT15]). Wesample

Bin ( n, q ) by determining how many of the n trials were successful. This can be done bygenerating variables u , . . . , u n uniform on [0 , q . Wedo this without generating all the variables u i explicitly as follows. First write q in binary as q = (0 .q q , . . . ) . Set b ← j ← n j ← n and sample b j ∼ Bin ( n j , / q j = 1, thenset b = b + b j , as these corresponding b j trials u i with the ﬁrst bit set to 0 will all be successfultrials given that q j = 1. Then set n j +1 ← n j − b j and repeat with j ← j + 1. Otherwise, if q j = 0, then we set n j +1 ← n j − ( n j − b j ) = b j , since this represents the fact that ( n j − b j ) of thevariables u i will be larger than q . With probability 1 − n − c , we reach the point where n j = 0within O (log( n )) iterations, and we return the value stored in b at this point. By Proposition 2,each iteration requires ˜ O (1) time, and thus the entire procedure is ˜ O (1). For space, note that weneed only store q to its ﬁrst O (log( n )) bits, since the procedure terminates with high probabilitywithin O (log( n )) iterations. Then the entire procedure requires O (log( n )) bits, since each sampleof Bin ( n j , /

2) requires only O (log( n )) space by Proposition 2. The

Fast-Update procedure.

We are now ready to describe the implementation of ourupdate-time algorithm. Recall that our algorithm utilizes just a single data structure on thestream ζ : the d × A (where d = Θ(log( n ))). Upon receiving an update( i, ∆) to a coordinate f i for i ∈ [ n ], we proceed as follows. Our goal is to compute the set { rnd ν (1 /t /pi ) , rnd ν (1 /t /pi ) , . . . , rnd ν (1 /t /pi nc − ) } , and update each row of A accordingly in ˜ O ( ν − )20ime. Na¨ıvely, this could be done by computing each value individually, and then updating each rowof A accordingly, however this would require O ( n c − ) time. To avoid this, we exploit the fact thatthe support size of rnd ν ( x ) for 1 / poly( n ) ≤ x ≤ poly( n ) is ˜ O ( ν − ), so it will suﬃce to determinehow many variables rnd ν (1 /t /pi j ) are equal to each value in the support of rnd ν ( x ).Our update procedure is then as follows. Let I j = (1 + ν ) j for j = − Π , − Π + 1 , . . . , Π − , Πwhere Π = O (log( n ) ν − ). We utilize the c.d.f. ψ ( x ) = 1 − e − x − p of the 1 /p -th power of theinverse exponential distribution t − /p (here t is exponentially distributed). Then beginning with j = − Π , − Π+ 1 , . . . ,

Π we compute the probability q j = ψ ( I j +1 ) − ψ ( I j ) that rnd ν (1 /t /p ) = I j , andthen compute the number of values Q j in { rnd ν (1 /t /pi ) , rnd ν (1 /t /pi ) , . . . , rnd ν (1 /t /pi nc − ) } whichare equal to I j . With probability 1 − n c , we know that 1 / poly( n ) ≤ t i ≤ poly( n ) for all i ∈ [ N ], and thus conditioned on this, we will have completely determined the values of theitems in { rnd ν (1 /t /pi ) , rnd ν (1 /t /pi ) , . . . , rnd ν (1 /t /pi nc − ) } by looking at the number equal to I j for j = − Π , . . . , Π.Now we know that there are Q j updates which we need to hash into A (along with i.i.d. Gaussianscalings), each with the same value ∆ I j . This is done by the procedure Fast-Update-CS (Figure5), which computes the number b k,θ that hash to each bucket A k,θ by drawing binomial randomvariables. Once this is done, we know that the value of A k,θ should be updated by the value P b k,θ t =1 g t ∆ I j , where each g t ∼ N (0 , P b k,θ t =1 g t ∆ I j would involvegenerating b k,θ random Gaussians. To avoid this, we utilize the 2-stability of Gaussians (Deﬁnition2), which asserts that P b k,θ t =1 g t ∆ I j ∼ g p b k,θ ∆ I j , where g ∼ N (0 , g associated with the item i ∈ [ n ], rounding I j , and bucket A k,θ , and oneach update ∆ to f i we can update A k,θ by g p b k,θ ∆ I j .Finally, once the number of values in { rnd ν (1 /t /pi ) , rnd ν (1 /t /pi ) , . . . , rnd ν (1 /t /pi nc − ) } which areleft to determine is less than K for some K = Θ(log( n )), we simply generate and hash each ofthe remaining variables individually. The generation process is the same as before, except that foreach of these at most K remaining items we associate a ﬁxed index i j for j ∈ [ n c − ], and store therelevant random variables h ℓ ( i j ) , g ℓ ( i j ) for ℓ ∈ [ d ]. Since the value of j which is chosen for eachof these coordinates does not aﬀect the behavior of the algorithm – in other words the index ofthe duplicate which is among the K largest is irrelevant – we can simply choose these indices tobe i , i , . . . , i K ∈ [ N ] so that the ﬁrst item hashed individually via step 3 corresponds to ζ i , thesecond to ζ i , and so on. Fast-Update ( i, ∆ , A, B )Set L = n c − , and ﬁx K = Θ(log( n )) with a large enough constant. For j = − Π , − Π + 1 , . . . , Π − , Π:1. Compute q j = ψ ( I j +1 ) − ψ ( I j ).2. Draw Q j ∼ Bin ( L, q j ).3. If L < K , hash the Q j items individually into each row A ℓ using explicitly stored uniformi.i.d. random variables h ℓ : [ n c ] → [2] and Gaussians g ℓ ( j ) for ℓ ∈ [ d ].4. Else: update count-max table A by via Fast-Update-CS ( A, Q j , I j , ∆ , i )5. L ← L − Q j . Figure 4: Algorithm to Update count-max A Note that the randomness used to process an update corresponding to a ﬁxed i ∈ [ n ] is storedso it can be reused to generate the same updates to A whenever an update to i is made. Thus,21 ast-Update-CS ( A, Q, I, ∆ , i )Set W k = Q for k = 1 , . . . , d For k = 1 , . . . , d ,1. For θ = 1 , b k,θ ∼ Bin ( W k , − θ +1 ).(b) Draw and store g k,θ,I,i ∼ N (0 , Fast-Update-Cs with the sameparameters ( k, θ, I, i ).(c) Set A k,θ ← A k,θ + g k,θ,I,i p b k,θ ∆ I (d) W k ← W k − b k,θ .Figure 5: Update A via updates to Q coordinates, each with a value of ∆ I each time an update +1 is made to a coordinate i ∈ [ n ], each bucket of count-max is updated bythe same value. When an update of size ∆ comes, this update to the count-max buckets is scaledby ∆. For each i ∈ [ n ], let K i denote the size of L when step 3 of Figure 4 was ﬁrst executed whileprocessing an update to i . In other words, the coordinates ζ i , . . . , ζ i Ki were hashed into each row ℓ ∈ [ d ] of A using explicitly stored random variables h ℓ ( i j ) , g ℓ ( i j ). Let K = ∪ i ∈ [ n ] ∪ K i j =1 { i j } . Thenon the termination of the algorithm, to ﬁnd the maximizer of ζ , the count-max algorithm checks foreach i ∈ K , whether i hashed to the largest bucket (in absolute value) in a row at least a fractionof the time. Count-max then returns the ﬁrst i which satisﬁes this, or FAIL . In other words, thecount-max algorithm decides to fail or output an index i based on computing the fraction of rowsfor which i hashes into the largest bucket, instead now it only computes these values for i ∈ K instead of i ∈ [ n c ], thus count-max can only return a value of i ∈ K . We now argue that thedistribution of our algorithm is not changed by using the update procedure Fast-Update . Thiswill involving showing that arg max {| ζ i |} ∈ K if our algorithm was to return a coordinate originally. Lemma 6.

Running the L p sampler with the update procedure given by Fast-Update results in thesame distribution over the count-max table A and L estimation vector B as the original algorithm.Moreover, conditioned on a ﬁxed realization of A, B , the output of the original algorithm will be thesame as the output of the algorithm using

Fast-Update . For a given i ∈ [ n ] , Fast-Update requires ˜ O ( ν − ) -random bits, and runs in time ˜ O ( ν − ) .Proof. To hash an update ∆ to a coordinate f i , the procedure Fast-Update computes the number Q j of variables in the set { rnd ν (1 /t /pi ) , rnd ν (1 /t /pi ) , . . . , rnd ν (1 /t /pi nc − ) } which are equal to I j for each j ∈ {− Π , . . . , Π } . Instead of computing Q j by individually generating the variables androunding them, we utilize a binomial random variable to determine Q j , which results in the samedistribution over { rnd ν (1 /t /pi ) , rnd ν (1 /t /pi ) , . . . , rnd ν (1 /t /pi nc − ) } . As noted, with probability 1 − n c none of the variables rnd ν (1 /t /pi j ) will be equal to I k for | k | > Π, which follows from thefact that n − c < t i < O (log( n )) with probability 1 − n − c and then union bounding over all n c exponentials variables t i . So we can safely ignore this low probability event.Once computed, we can easily sample from the number of items of the Q j that go into eachbucket A k,θ , which is the value b k,θ in Fast-Update-CS (Figure 5). By 2-stability of Gaussians(Deﬁnition 2), we can update each bucket A k,θ by g k,θ,I j ,i p b k,θ ∆ I j , which is distributed preciselythe same as if we had individually generated each of the b k,θ Gaussians, and taken their innerproduct with the vector ∆ I j . Storing the explicit values h ℓ ( i j ) for the top K largest values of rnd ν (1 /t /pi j ) does not eﬀect the distribution, but only allows the algorithm to determine the induces22f the largest coordinates i j corresponding to each i ∈ [ n ] at the termination of the algorithm. Thusthe distribution of updates to A is unchanged by the Fast-Update

Procedure.We now show that the output of the algorithm run with this update procedure is the same asit would have been had all the random variables been generated and hashed individually. Firstobserve that for ν < /

2, no value q j = ψ ( I j +1 ) − ψ ( I j ) is greater than 1 /

2. Thus at any iteration,if

L > K then L − Bin ( L, q j ) > L/ − n − c by Chernoﬀ bounds (using that K = Ω(log( n ))). Thus the ﬁrst iteration at which L drops below K , we will have L > K/

3. So foreach i ∈ [ n ] the top K/ ζi j will be hashed into each row A ℓ using stored random variables h ℓ ( i j ), so K i > K/ n )) for all i ∈ [ n ]. In particular, K i > i ∈ [ n ].Now the only diﬀerence between the output procedure of the original algorithm and that ofthe eﬃcient-update time algorithm is that in the latter we only compute the values of α i j = (cid:12)(cid:12) { t ∈ [ d ] | | A t,h t ( i j ) | = max r ∈{ , } | A t,r |} (cid:12)(cid:12) for the i j ∈ [ n c ] corresponding to the K i largest values t − /pi j in the set { t − /pi , . . . , t − /pi nc − } , whereas in the former all values of α i j are computed to ﬁnda potential maximizer. In other words, count-max with Fast-Update only searches throguh thesubset

K ⊂ [ n c ] for a maxmizer instead of searching through all of [ n c ] (here K is as deﬁned earlierin this section). Since count-max never outputs a index i j that is not a unique maximizer withhigh probability, we know that the output of the original algorithm, if it does not fail, must be i j such that j = arg max j ′ { t i j ′ } , and therefore i j ∈ K . Note the n − c failure probability can besafely absorbed into the additive n − c error of the perfect L p sampler. Thus the new algorithm willalso output i j . Since the new algorithm with Fast-Update searches over the subset

K ⊂ [ n c ] fora maximier, if the original algorithm fails then certainly so will Fast-Update . Thus the output ofthe algorithm using

Fast-Update is distributed identically (up to n − c additive error) as the outputof the original algorithm, which completes the proof. Runtime & Random Bits

For the last claim, ﬁrst note that it suﬃces to generate all continuousrandom varaibles used up to ( nmM ) − c = 1 / poly( n ) precision, which is 1 / poly( n ) additive errorafter conditioning on the event that all random variables are all at most poly( n ) (which occurswith probability 1 − n − c ), and recalling that the length of the stream m satisﬁes m < poly( n ) fora suitably smaller poly( n ) then as in the additive error. More formally, we truncate the binaryrepresentation of every continuous random variable (both the exponentials and Gaussians) after O (log( n ))-bits with a suﬃciently large constant. This will result in at most an additive 1 / poly( n )error for each bucket A i,j of A , which can be absorbed by the adversarial error V i,j with |V i,j | = O ( ν ( | A i, | + | A i, | )) that we incur in each of these buckets already in Lemma 3. Thus each randomvariable requires O (log( n )) bits to specify. Similarly, a precision of at most ( nmM ) − c is needed inthe computation of the q j ’s in Figure 4 by Lemma 5, since the routine to compute Bin ( n, q j ) willterminate with probability 1 − n − c after querying at most O (log( n )) bits of q j . Now there are atmost 2Π = O ( ν − log( n )) iterations of the loop in Fast-Update . Within each, our call to sample abinomial random variable is carried out in ˜ O (1) time with high probability by Lemma 5 (and thusue at most ˜ O (1) random bits), and there are ˜ O (1) entries in A to update (which upper bounds therunning time and randomness requirements of Fast-Update-CS ).Note that since the stream has length m = poly( n ), and there are at most ˜ O ( ν ) calls made tosample binomial random variables in each, we can union bound over each call to guarantee thateach returns in ˜ O (1) time with probability 1 − n − c . Since K = ˜ O (1), we must store an additional˜ O (1) random bits to store the individual random variables h ℓ ( i j ) for i j ∈ { i , . . . , i K i } . Similarly, wemust store ˜ O ( ν ) independent Gaussians for the procedure Fast-Update-CS , which also terminatesin ˜ O (1) time (noting that r = O (log( n ))), which completes the proof.23 .2 Derandomizing the Algorithm We now show that our algorithm L p Sampler with

Fast-Update can be derandomized withoutaﬀecting the space or time complexity. To do this, we use a combination of Nisan’s pseudorandomgenerator (PRG) [Nis92], and the PRG of Goplan, Kane, and Meka [GKM15]. We begin byintroducing Nisan’s PRG, which is a deterministic map G : { , } ℓ → { , } T , where T ≫ ℓ (herewe think of T = poly( n ) and ℓ = O (log ( n ))). Let σ : { , } T → { , } be a eﬃciently computabletester. For the case of Nisan’s PRG, σ must be a tester which reads its random T -bit input in astream, left to right, and outputs either 0 or 1 at the end. Nisan’s PRG can be used to fool any such tester, which means: (cid:12)(cid:12) Pr [ σ ( U T ) = 1] − Pr [ σ ( G ( U ℓ )) = 1] (cid:12)(cid:12) < T c Where U t indicates t uniformly random bits for any t , and c is a suﬃciently large constant. Here theprobability is taken over the choice of the random bits U T and U ℓ . In other words, the probabilitythat σ outputs 1 is nearly the same when it is given random input as opposed to input from Nisan’sgenerator. Nisan’s theorem states if σ has at most poly( T ) states and uses a working memory tapeof size at most O (log( T )), then a seed length of ℓ = O (log ( T )) suﬃces for the above result [Nis92].Thus Nisan’s PRG fools space bounded testers σ that read their randomness in a stream. Half Space Fooling PRG’s.

Our derandomization crucially uses the PRG of Goplan, Kane, andMeka [GKM15], which fools a certain class of fourier transforms. Utilizing the results of [GKM15],we will design a PRG that can fool arbitrary functions of λ = O (log( n )) halfspaces, using a seedof length O (log ( n )(log log( n )) ). We remark that in [GKM15] it is shown how to obtain such aPRG for a function of a single half-space. Using extensions of the techniques in that paper, wedemonstrate that the same PRG with a smaller precision ǫ can be used to fool functions of morehalf-spaces. We now introduce the main result of [GKM15]. Let C = { c ∈ C | | c | ≤ } . Deﬁnition 4 (Deﬁnition 1 [GKM15]) . An ( m, n )-Fourier shape f : [ m ] n → C is a function of theform f ( x , . . . , x n ) = Q nj =1 f j ( x j ) where each f j : [ m ] → C . Theorem 4 (Theorem 1.1 [GKM15]) . There is a PRG G : { , } ℓ → [ m ] n that fools all ( m, n ) -Fourier shapes f with error ǫ using a seed of length ℓ = O (log( mn/ǫ )(log log( mn/ǫ )) ) , meaning: (cid:12)(cid:12)(cid:12) E h f ( x ) i − E h f ( G ( y )) i(cid:12)(cid:12)(cid:12) ≤ ǫ where x is uniformly chosen from [ m ] n and y from { , } ℓ . For any a , . . . , a λ ∈ Z n and θ , . . . , θ λ ∈ Z , let H i : R n → { , } , be the function given by H i ( X , . . . , X n ) = [ a i X + a i X + · · · + a in X n > θ i ], where is the indicator function. We nowdeﬁne the notion of a λ -halfspace tester, and what it means to fool one. Deﬁnition 5 ( λ -halfspace tester) . A λ -halfspace tester is any function σ H : R n → { , } which, oninput X = ( X , . . . , X n ), outputs σ ′ H ( H ( X ) , . . . , H λ ( X )) ∈ { , } where σ ′ H is any ﬁxed function σ ′ H : { , } λ → { , } . In other words, the Boolean valued function σ H ( X ) only depends on thevalues ( H ( X ) , . . . , H λ ( X )). A λ -halfspace tester is said to be M bounded if all the half-spacecoeﬃcents a ij and θ i are integers of magnitude at most M , and each X i is drawn from a discretedistrubtion D with support contained in {− M, . . . , M } ⊂ Z .24 eﬁnition 6 (Fooling a λ -halfspace tester) . A PRG G : { , } ℓ → R n is said to ǫ -fools the class of λ -halfspace testers under a distribution D over R n if for every set of λ halfspaces H = ( H , . . . , H λ )and every λ -halfspace tester σ H : R n → { , } , we have: (cid:12)(cid:12) E X ∼D (cid:2) σ H ( X ) = 1 (cid:3) − E y ∼{ , } ℓ (cid:2) σ H ( G ( y )) = 1 (cid:3)(cid:12)(cid:12) < ǫ Here ℓ is the seed length of G .We will consider only product distributions D . In other words, we assume that each coordiante X i is drawn i.i.d. from a ﬁxed distribution D over {− M, . . . , M } ⊂ Z . We consider PRG’s G : { , } ℓ → {− M, . . . , M } n which take in a random seed of length ℓ and output a X ′ ∈ {− M, . . . , M } n such that any M -bounded λ -halfspace tester will be unable to distinguish X ′ from X ∼ D n (where D n is the product distribution of D , such that each X i ∼ D independently). The following Lemmademonstrates that the PRG of [GKM15] can be used to fool M -bounded λ -halfspace testers. Theauthors would like to thank Raghu Meka for providing us a proof of Lemma 7. Lemma 7.

Suppose X i ∼ D is a distribution on {− M, . . . , M } that can be sampled from with log( M ′ ) = O (log( M )) random bits. Then, for any ǫ > and constant c ≥ , there is a PRG G : { , } ℓ → {− M, . . . , M } n which ǫ ( nM ) − cλ -fools the class of all M -bounded λ -halfspace testerson input X ∼ D n with a seed of length ℓ = O ( λ log( nM/ǫ )(log log( nM/ǫ )) ) (assuming λ ≤ n ).Moreover, if G ( y ) = X ′ ∈ {− M, . . . , M } n is the output G on random seed y ∈ { , } ℓ , then eachcoordinate X ′ i can be computed in O ( ℓ ) -space and in ˜ O (1) time, where ˜ O hides poly (log( nM )) factors.Proof. Let X = ( X , . . . , X n ) be uniformly chosen from [ M ′ ] n for some M ′ = poly( M ), and let Q : [ M ′ ] → {− M, . . . , M } be such that Q ( X i ) ∼ D n for each i ∈ [ n ]. Let a , . . . , a λ ∈ Z n , θ , . . . , θ λ ∈ Z be log( M )-bit integers, where H i ( x ) = [ h a i , x i > θ i ]. Let Y i = h Q ( X ) , a i i − θ i .Note that Y i ∈ [ − M n, M n ]. So ﬁx any α i ∈ [ − M n, M n ] for each i ∈ [ λ ], and let α = ( α , . . . , α λ ). Let h α ( x ) = ( Y = α ) · ( Y = α ) · · · ( Y λ = α λ ), where ( · ) is the indi-cator function. Now deﬁne f ( x ) = P λj =1 (2 M n ) j − h a i , x i for any x ∈ Z n . Note that f ( Q ( X )) ∈{− ( M n ) O ( λ ) , . . . , ( M n ) O ( λ ) } . We deﬁne the Kolmogorov distance between two integer valued ran-dom variables Z, Z ′ by d K ( Z, Z ′ ) = max k ∈ Z ( | Pr [ Z ≤ k ] − Pr [ Z ′ ≤ k ] | ). Let X ′ ∈ [ M ′ ] n be gener-ated via the ( M ′ , n )-fourier shape PRG of [GKM15] with error ǫ ′ (Theorem 1.1 [GKM15]). Observe E [ h α ( Q ( X ))] = Pr [ f ( Q ( X )) = P λj =1 ( M n ) j − α j ], so (cid:12)(cid:12) E [ h α ( Q ( X ))] − E [ h α ( Q ( X ′ )] (cid:12)(cid:12) ≤ d K ( f ( Q ( X )) , f ( Q ( X ′ )))Now by Lemma 9.2 of [GKM15], d K ( f ( Q ( X )) , f ( Q ( X ′ ))) = O (cid:0) λ log( M n ) d F T (cid:0) f ( Q ( X )) , f ( Q ( X ′ )) (cid:1)(cid:1) ,where for integer valued Z, Z ′ , we deﬁne d F T ( Z, Z ′ ) = max β ∈ [0 , | E [exp(2 πiβZ )] − E [exp(2 πiβZ ′ )] | .Now exp(2 πiβf ( Q ( X ))) = Q ni =1 (( P λj =1 (2 M n ) j − a ji ) Q ( X i )), which is a ( M ′ , n )-Fourier shape as inDeﬁnition 4. Thus by Theorem 4 (Theorem 1.1 of [GKM15]), we have d F T ( f ( Q ( X )) , f ( Q ( X ′ ))) ≤ ǫ ′ . Thus (cid:12)(cid:12) E [ h α ( Q ( X ))] − E [ h α ( Q ( X ′ )] (cid:12)(cid:12) = O ( λ log( M n ) ǫ ′ )Now let σ H ( x ) = σ ′ H ( H ( x ) , . . . , H λ ( x )) be any M -bounded λ -halfspace tester on x ∼ D n . Sincethe inputs to the halfspaces H i of σ ′ H are all integers in {− M n, M n } , let A ⊂ {− M n, M n } be the set of α ∈ A such that Y = ( Y , . . . , Y λ ) = α implies that σ H ( Q ( X )) = 1, where Q ( X ) ∼ D n as above. Recall here that Y i = h Q ( X ) , a i i − θ i . Then we can think of a σ H ( X ) = σ ′′ H ( Y , . . . , Y λ )for some function σ ′′ H : {− M n, . . . , M n } λ → { , } , and in this case we have A = { α ∈{− M n, M n } | σ ′′ H ( α ) = 1 } . Then 25 (cid:12) E [ σ H ( Q ( X ))] − E [ σ H ( Q ( X ′ ))] (cid:12)(cid:12) ≤ X α ∈ A (cid:12)(cid:12) E [ h α ( Q ( X ))] − E [ h α ( Q ( X ′ )] (cid:12)(cid:12) ≤ X α ∈ A O ( λ log( M n ) ǫ ′ )Now note that | A | = ( nM ) O ( λ ) , so setting ǫ ′ = ǫ ( nM ) − O ( λ ) with a suitably large constant, weobtain | E [ σ H ( Q ( X ))] − E [ σ H ( Q ( X ′ ))] | ≤ ǫ ( nM ) − cλ as needed. By Theorem 4, the seed requiredis ℓ = O ( λ log( nM/ǫ )(log log( nM/ǫ )) ) as needed. The space and time required to compute eachcoordinate follows from Proposition 3 below. Proposition 3.

In the setting of Lemma 7, if G ( y ) = X ′ ∈ {− M, . . . , M } n is the output G onrandom seed y ∈ { , } ℓ , then each coordinate X ′ i can be computed in O ( ℓ ) -space and in ˜ O (1) time,where ˜ O hides poly (log( nM )) factors.Proof. In order to analyze the space complexity and runtime needed to compute a coordinate X ′ i , wemust describe the PRG of Theorem 4. The Goplan-Kane-Meka PRG has 3 main components, whichthemselves use other PRGs such as Nisan’s PRG as sub-routines. Recall that the PRG generatesa psuedo-uniform element from X ∼ [ m ] n that fools a class of Fourier shapes f : [ m ] n → C ontruly uniform input in [ m ] n . Note that because of the deﬁnition of a Fourier shape, if we wish tosample from a distribution X ∼ D over {− m, . . . , m } n that is not uniform, but such that X i can besampled with log( m ′ )-bits, we can ﬁrst fool Fourier shapes f : [ m ′ ] n → C , and then use a function Q : [ m ′ ] → {− m, . . . , m } which samples X i ∼ D given log( m ′ ) uniformly random bits. We thenfool Fourier shapes F = Q ni =1 f j ( x ) = Q ni =1 f j ( Q ( x )) where x is uniform, and thus Q ( x ) ∼ D . Thusit will suﬃce to fool ( m ′ , n )-Fourier shapes on uniform distributions. For simplicity, for the mostpart we will omit the parameter ǫ in this discussion.The three components of the PRG appear in Sections 5,6, and 7 of [GKM15] respectively. Inthis proof, when we write Section x we are referring to the corresponding Section of [GKM15].They consider two main cases: one where the function f has high variance (for some notion ofvariance), and one where it has low variance. The PRGs use two main pseudo-random primitives, δ -biased and k -wise independent hash function families. Formally, a family H = { h : [ n ] → [ m ] } issaid to be δ -biased if for all r ≤ n distinct indices i , . . . , i r ∈ [ n ] and j , . . . , j r ∈ [ m ] we have Pr h ∼H [ h ( i ) = j ∧ · · · ∧ h ( i r ) = j r ] = 1 m r ± δ The function is said to be k -wise independent if it holds with δ = 0 for all r ≤ k . It is standard that k -wise independent families can be generated by taking a polynomial of degree k over a suitablylarge ﬁnite ﬁeld (requiring space O ( k log( mn ))). Furthermore, a value h ( i ) from a δ -biased familycan be generated by taking products of two O (log( n/δ ))-bit integers over a suitable ﬁnite ﬁeld[Kop13] (requiring space O (log( n/δ ))). So in both cases, computing a value h ( i ) can be done inspace and time that is linear in the space required to store the hash functions (or O (log( n/δ ))-bitintegers). Thus, any nested sequence of such hash functions used to compute a given coordinate X ′ i can be carried out in space linear in the size required to store all the hash functions.Now the ﬁrst PRG (Section 5 [GKM15]) handles the high variance case. The PRG ﬁrst sub-samples the n coordinates at log( n ) levels using a pair-wise hash function (note that a 2-wisepermutation is used in [GKM15], which reduces to computation of a 2-wise hash function). In eachlevel S j of sub-sampling, it uses O (1)-wise independent hash functions to generate the coordinates26 i ∈ S j . So if we want to compute a value X i , we can carry out one hash function computation h ( i ) to determine j such that X i ∈ S j , and then carry out another hash function computation h j ( i ) = X i . Instead of using log( n ) independent hash functions h j , each of size O (log( nm )), foreach of the buckets S j , they derandomize this with the PRG of Nisan and Zuckerman [NZ96] touse a single seed of length O (log( n )). Now the PRG of Nisan and Zuckerman can be evaluatedonline, in the sense that it reads its random bits in a stream and writes its pseudo-random out-put on a one-way tape, and runs in space linear in the seed required to store the generator itself(see Deﬁnition 4 of [NZ96]). Such generators are composed to yield the ﬁnal PRG of Theorem 2[NZ96], however by Lemma 4 of the paper, such online generators are composable. Thus the entiregenerator of [NZ96] is online, and so any substring of the pseudorandom output can be computedin space linear in the seed of the generator by a single pass over the random input. Moreover, byTheorem 1 of [NZ96] in the setting of [GKM15], such a substring can be computed in ˜ O (1) time,since it is only generating ˜ O (1) random bits to begin with.On top of this, the PRG of Section 5 [GKM15] ﬁrst splits the coordinates [ n ] via a limitedindependence hash function into poly(log(1 /ǫ )) buckets, and applies the algorithm described aboveon each. To do this second layer of bucketing and not need fresh randomness for each bucket,they use Nisan’s PRG [Nis92] with a seed of length log( n ) log log( n ). Now any bit of Nisan’s PRGcan be computed by several nested hash function computations, carried out in space linear in theseed required to store the PRG. Thus any substring of Nisan’s can be computed in space linear inthe seed and time ˜ O (1). Thus to compute X ′ i , we ﬁrst determine which bucket it hashes to, whichinvolves computing random bits from Nisan’s PRG. Then we determine a second partitioning, whichis done via a 2-wise hash fucntion, and ﬁnally we compute the value of X ′ i via an O (1)-wise hashfunction, where the randomness for this hash function is stored in a substring output by the PRGof [NZ96]. Altogether, we conclude that the PRG of Section 5 [GKM15] is such that value X ′ i canbe computed in space linear in the seed length and ˜ O (1) time.Next, in Section 6 of [GKM15], another PRG is introduced which reduces the problem to thecase of m ≤ poly ( n ). Assuming a PRG G is given which fools ( m, n )-Fourier shapes, they design aPRG G using G which fools ( m , n )-Fourier shapes. Applying this O (log log( m )) times reduces tothe case of m ≤ n . The PRG is as follows. Let G , . . . , G t be the iteratively composed generators,where t = O (log log( m )). To compute the value of ( G i ) j ∈ [ m ], where ( G i ) j is the j -th coordinateof G i ∈ [ m ] n , the algorithm ﬁrst implicitly generates a matrix Z ∈ [ m ] √ m × m . An entry Z p,q isgenerated as follows. First one applies a k -wise hash function h ( q ) (for some k ), and uses the O (log( m ))-bit value of h ( q ) as a seed for a second 2-wise indepedent hash function h ′ h ( q ) . Then Z p,q = h ′ h ( q ) ( p ). Thus within a column q of Z , the entries are 2-wise independent, and separatecolumns of Z are k -wise independent. This requires O ( k log( m ))-space to store, and the nestedhash functions can be computed in O ( k log( m ))-space. Thus computing Z i,j is done in ˜ O (1) timeand space linear in the seed length. Then we set ( G i ) j = Z ( G i − ) j ,j for each j ∈ [ n ]. Thus ( G i ) j only depends on ( G i − ) j , and the random seeds stored for two hash functions to evaluate entriesof Z . So altogether, the ﬁnal output coordinate ( G t ) j can be computed in space linear in the seedlength required to store all required hash functions, and in time ˜ O (1). Note importantly that therecursion is linear, in the sense that computing ( G i ) j involves only one query to compute ( G i ) j ′ forsome j ′ .Next, in Section 7 of [GKM15], another PRG is introduced for the low-variance case , whichreduces the size of n to √ n , but blows up m polynomially in the process. Formally, it shows givena PRG G ′ that fools (poly( n ) , √ n ) Fourier shapes, one can design a PRG G ′ that fools O ( m, n )-Fourier shapes with m < n (here the poly( n ) can be much larger than n ). To do so, the PRG ﬁrsthashes the n coordinates into √ n buckets k -wise independently, and then in each bucket uses k -wise27ndependence to generate the value of the coordinate. A priori, this requires √ n independent seedsfor the hash function in each of the buckets. To remove this requirement, it uses G ′ to generatethe √ n seeds required from a smaller seed. Thus to compute a coordinate i of G ′ , simply evaluatea k -wise independent hash function on i to determine which bucket j ∈ [ √ n ] a the item i is hashedinto. Then evaluate G ′ ( j ) to obtain the seed required for the k -wise hash function h j , and the ﬁnalresult is given by h j ( i ). Note that this procedure only requires one query to the prior generator G ′ . The space required to do so is linear in the space required to store the hash functions, and thespace required to evaluate a coordinate of the output of G ′ , which will be linear in the size usedto store G ′ by induction.Finally, the overall PRG composes the PRG from Section 6 and 7 to fool larger n, m in thecase of low variance. Suppose we are given a PRG G which fools ( m ′′ , √ n ′ )-Fourier shapes forsome m ′′ < ( n ′ ) . We show how to construct a PRG G which fools ( m ′ , n ′ )-Fourier shapes for any m ′ ≤ ( n ′ ) . Let G be the PRG obtained by ﬁrst applying the PRG from Section 6 on G as aninitial point, which gives a PRG that fools (poly( n ′ ) , √ n ′ )-Fourier shapes, and then applying thePRG from section 7 on top which now fools ( m ′ , n ′ )-Fourier shapes (with low variance). Let G be the generator from Section 5 which fools ( m ′ , n ′ )-Fourier shapes with high variance. The ﬁnalalgorithm for fooling the class of all ( m ′ , n ′ )-Fourier shapes given G computes a generator G suchthat the i -th coordinate is ( G ) i = ( G ) i ⊕ ( G ) i , where ⊕ is addition mod m ′ . This allows oneto simultaneously fool high and low variance Fourier shapes of the desired m ′ , n ′ . If m > ( n ′ ) , onecan apply the PRG for Section 6 one last time on top of G to fool arbitrary m . Thus if for any i , the i -th coordinate of G and G can be composed in ˜ O (1) time and space linear in the sizerequired to store the random seed, then so can G i . Thus going from G to G takes a generatorthat fools ( m ′′ , √ n ′ ) to ( m ′ , n ′ )-Fourier shapes, and similarly we can compose this to design a G that fools ( m ′ , ( n ′ ) )-Fourier shapes. Composing this t = O (log log n )-times, we obtain G t whichfools O ( m, n ) Fourier shapes for any m, n . As a base case (to deﬁne the PRG G ), the PRG of[NZ96] is used, which we have already discussed can be evaluated on-line in space linear in the seedrequired to store it and time polynomial in the length of the seed.Now we observe an important property of this recursion. At every step of the recursion, oneis tasked with computing the j -th coordinate output by some PRG for some j , and the resultwill depend only on a query for the j ′ -th coordinate of another PRG for some j ′ (as well as someadditional values which are computed using the portion of the random seed dedicated to this stepin the recursion). Thus at every step of the recursion, only one query is made for a coordinateto a PRG at a lower level of the recursion. Thus the recursion is linear, in the sense that thecomputation path has only L nodes instead of 2 L (which would occur if two queries to coordinate j ′ , j ′′ were made to a PRG in a lower level). Since at each level of recursion, computing G itselfuses O (log log( nm )) levels of recursion, and also has the property that each level queries the lowerlevel at only one point, it follows that the total depth of the recursion is O ((log log( nm )) ). At eachpoint, to store the information required for this recursion on the stack requires only O (log( nm ))-bits of space to store the relevant information identifying the instance of the PRG in the recursion,along with its associated portion of the random seed. Thus the total space required to compute acoordinate via these O (log log( nm )) ) recursions is O (log( nm )(log log nm ) ), which is linear in theseed length. Moreover, the total time ˜ O (1), since each step of the recursion requires ˜ O (1).We use the prior technique to derandomize a wide class of linear sketches A · f such that theentries of A are independent, and can be sampled using O (log( n ))-bit, and such that the behaviorof the algorithm only depends on the sketch Af . It is well known that there are strong connectionsbetween turnstile streaming algorithms and linear sketches are, insofar as practically all turnstilestreaming algorithms are in fact linear sketches. The equivalence of turnstile algorithms and linear28ketches has even been formalized [LNW14], with some restrictions. Our results show that allsuch sketches that use independent, eﬃciently sampled entries in their sketching matrix A can bederandomized with our techniques. As an application, we derandomize the count-sketch variantof Minton and Price [MP14], a problem which to the best of the authors knowledge was hithertoopen. Lemma 8.

Let

ALG be any streaming algorithm which, on stream vector f ∈ {− M, . . . , M } n forsome M = poly ( n ) , stores only a linear sketch A · f such that the entries of the random matrix A ∈ R k × n are i.i.d., and can be sampled using O (log( n )) -bits. Fix any constant c ≥ . Then ALG can be implemented using a random matrix A ′ using O ( k log( n )(log log n ) ) bits of space, such thatfor every vector y ∈ R k with entry-wise bit-complexity of O (log( n )) , (cid:12)(cid:12)(cid:12) Pr h Af = y i − Pr h A ′ f = y i(cid:12)(cid:12)(cid:12) < n − ck Proof.

We can ﬁrst scale all entries of the algorithm by the bit complexity so that each entry in A is a O (log( n ))-bit integer. Then by Lemma 7, we can store the randomness needed to computeeach entry of A ′ with O ( k log( n )(log log n ) )-bits of space, such that A ′ n − ck -fools the class of all O ( k )-halfspace testers, in particular the one which checks, for each coordinate i ∈ [ k ], whether both( A ′ f ) i < y + 1 and ( A ′ f ) i > y i −

1, and accepts only if both hold of all i ∈ [ k ]. By Proposition3, the entries of A ′ can be computed in space linear in the size of the random seed required tostore A ′ . Since we have scaled all values to be integers, n − ck fooling this tester is equivalent to thetheorem statement. Note that the test ( A ′ f ) i < y + 1 can be made into a half-space test as follows.Let X i ∈ R nk be the vector such that X ij +( i − n = f j for all j ∈ [ n ] and X ij = 0 otherwise. Let vec ( A ) ∈ R nk be the vectorization of A . Then ( Af ) i = h vec ( A ) , X i i , and all the entries of vec ( A )are i.i.d., which allows us to make the stated constraints into the desired half-space constraints.Observe that the above Lemma derandomized the linear sketch Af by writing each coordinate( Af ) i as a linear combination of the random entries of vec ( A ). Note, however, that the aboveproof would hold if we added the values of any O ( k ) additional linear combinations h X j , vec ( A )to the Lemma, where each X j ∈ {− M, . . . , M } kn . This will be useful, since the behavior of somealgorithms, for instance count-sketch, may depend not only on the sketch Af but also on certainvalues or linear combinations of values within the sketch A . This is formalized in the followingCorollary. Corollary 3.

Let the entries of A ∈ R k × n be drawn i.i.d. from a distribution which can be sampledusing O (log n ) -bits, and let vec ( A ) ∈ R nk be the vectorization of A . Let X ∈ R t × nk be any ﬁxed ma-trix with entries contained within {− M, . . . , M } , where M = poly ( n ) . Then there is a distributionover random matrices A ′ ∈ R k × n which can be generated and stored using O ( t log( n )(log log n ) ) bits of space, such that for every vector y ∈ R t with entry-wise bit-complexity of O (log( n )) , (cid:12)(cid:12)(cid:12) Pr h X · vec ( A ) = y i − Pr h X · vec ( A ′ ) = y i(cid:12)(cid:12)(cid:12) < n − ct Proof.

The proof is nearly identical to Lemma 8, where we ﬁrst scale entries to be O (log( n ))-bitintegers, and then apply two half-space tests to each coordinate of X · vec ( A ′ ). Theorem 5.

Let

ALG be any streaming algorithm which, on stream vector f ∈ {− M, . . . , M } n andﬁxed matrix X ∈ R t × nk with entries contained within {− M, . . . , M } , for some M = poly ( n ) , outputsa value that only depends on the sketches A · f and X · vec ( A ) . Assume that the entries of the randommatrix A ∈ R k × n are i.i.d. and can be sampled using O (log( n )) -bits. Let σ : R k × R t → { , } be ny tester which measures the success of ALG , namely σ ( Af, X · vec ( A )) = 1 whenever ALG succeeds.Fix any constant c ≥ . Then ALG can be implemented using a random matrix A ′ using a randomseed of length O (( k + t ) log( n )(log log n ) ) , such that: (cid:12)(cid:12)(cid:12) Pr h σ ( Af, X · vec ( A )) = 1 i − Pr h σ ( A ′ f, X · vec ( A ′ )) = 1 i(cid:12)(cid:12)(cid:12) < n − c ( k + t ) and such that each entry of A ′ can be computed in time ˜ O (1) and using working space linear in theseed length.Proof. As in the Lemma 8, we ﬁrst scale all entries of the algorithm by the bit complexity sothat each entry in A is O (log( n ))-bit integer. Then there is a M ′ = poly( n ) such that each entryof A · f and X · vec ( A ) will be a integer of magnitude at most M ′ . First note that the sketch A · f and X · vec ( A ) can be written as one linear sketch X · vec ( A ) where X ∈ R k + t × kn . Then σ can be written as a function σ : R k + t → { , } evaluated on σ ( X · vec ( A )). Let S = { y ∈{− M ′ , . . . , M ′ } k + t | σ ( y ) = 1 } . Then by Corollary 3, we have (cid:12)(cid:12) Pr [ X · vec ( A ) = y ] − Pr [ X · vec ( A ′ ) = y ] (cid:12)(cid:12) < n − c ( k + t ) for all y ∈ S . Taking c suﬃciently large, and noting | S | = n − O ( k + t ) , we have Pr [ σ ( X · vec ( A )) =1] = P y ∈ S Pr h X · vec ( A ) = y i = P y ∈ S ( Pr [ X · vec ( A ′ ) = y ] ± n − c ( k + t ) ) = Pr [ σ ( X · vec ( A ′ )) =1] + n − O ( k + t ) as desired. The ﬁnal claim follows from Proposition 3.We know show how this general derandomization procedure can be used to derandomize thecount-sketch variant of Minton and Price [MP14]. Minton and Price’s analysis shows improvedconcentration bounds for count-sketch when the random signs g i ( k ) ∈ { , − } are fully independent.They demonstrate that in this setting, if y ∈ R n is the count-sketch estimate of a stream vector f with k columns and d rows, then for any t ≤ d and index i ∈ [ n ] we have: Pr h ( f i − y i ) > td k f tail( k )) k k i ≤ e − Ω( t ) In order to apply this algorithm in o ( n ) space, however, one must ﬁrst derandomize it from us-ing fully independent random signs. To the best of the authors knowledge, the best known de-randomization procedure known before was a black-box application of Nisan’s which results in O ( ǫ − log ( n ))-bits of space. For the purposes of the theorem, we replace the notation 1 /ǫ with k (the number of columns of count-sketch up to a constant). Theorem 6.

The count-sketch variant of [MP14] can be implemented so that if A ∈ R d × k is acount-sketch table, then for any t ≤ d and index i ∈ [ n ] we have: Pr h ( f i − y i ) > td k f tail( k )) k k i ≤ e − Ω( t ) and such that the total space required is O ( kd log( n )(log log n ) ) .Proof. We ﬁrst remark that the following modifcation to the count-sketch procedure does not eﬀectthe analysis of [MP14]. Let A ∈ R d × k be a d × k count-sketch matrix. The modiﬁcation is as follows:instead of each variable h i ( k ) being uniformly distributed in { , , . . . , k } , we replace them withvariables h i,j,k ∈ { , } for ( i, j, k ) ∈ [ d ] × [ k ] × [ n ], such that h i,j,k are all i.i.d. and equal to 1 with30robability 1 /k . We also let h i,h,k ∈ { , − } be i.i.d. Rademacher variables (1 with probability1 / A i,j = P nk =1 f k g i,j,k h i,j,k , and the estimate y t of f t for t ∈ [ n ] is given by: y t = median { g i,j,t A i,j | h i,j,t =1 } Thus the element f t can be hashed into multiple buckets in the same row of A , or even be hashedinto none of the buckets in a given row. By Chernoﬀ bounds, |{ g i,j,t A i,j | h i,j,t =1 }| = Θ( d ) withhigh probability for all t ∈ [ n ]. Observe that the marginal distribution of each bucket is the sameas the count-sketch used in [MP14], and moreover seperate buckets are fully independent. Thekey property used in the analysis of [MP14] is that the ﬁnal estimator is a median over estimatorswhose error is independent and symmetric, and therefore the bounds stated in the theorem stillhold after this modiﬁcation [Pri18].Given this, the entire sketch stored by the streaming algorithm is B · f , where B j =  k − k B are i.i.d., and can be sampled with O (log( k )) ≤ O (log( n )) bits, and vec ( A ) = B · f , where vec ( A ) is the vectorization of the count-sketch table A . Here B ∈ R dk × n .Now note that for a ﬁxed i , to test the statement that ( f i − y i ) > td k f tail( k )) k k , one needs toknow both the value of the sketch Bf , in addition to the value of the i -th column of B , since theestimate can be written as y i = median j ∈ [ kd ] ,B j,i =0 { B j,i · ( Bf ) j } . Note that the i -th column of B (which has kd entries) can simply be written as a sketch of the form X · vec ( B ), where X ∈ R kd × dkn is a ﬁxed matrix such that X · vec ( B ) = B i , so we also need to store X · vec ( B ) Thus by Theorem5, the algorithm can be derandomized to use O ( kd log( n )(log log n ) ) bits of space, and such thatfor any t ≤ d and any i ∈ [ n ] we have Pr [( f i − y i ) > td k f tail( k )) k k ] ≤ e − Ω( t ) ± n − Ω( dk ) . The Derandomization.

We now introduce the notation which will be used in our derandom-ization. Our L p sampler uses two sources of randomness which we must construct PRG’s for. Theﬁrst, r e , is the randomness needed to construct the in exponential random variables t i , and thesecond, r c , is the randomness needed for the fully random hash functions and signs used in count-max. Note that r e , r c both require poly( n ) bits by Lemma 6. From here on, we will ﬁx any index i ∈ [ n ]. Our L p sampler can then be thought of as a tester A ( r e , r c ) ∈ { , } , which tests on inputs r e , r c , whether the algorithm will output i ∈ [ n ]. Let G ( x ) be Nisan’s PRG, and let G ( y ) be thehalf-space PRG. For two values b, c ∈ R , we write a ∼ ǫ b to denote | a − b | < ǫ . Our goal is to showthat Pr r e ,r c (cid:2) A ( r e , r c ) (cid:3) ∼ n − c Pr x,y (cid:2) A ( G ( y ) , G ( x )) (cid:3) where x, y are seeds of length O (log ( n )), and c is an arbitrarily large constant. Theorem 7.

A single instance of the algorithm L p Sampler using

Fast-Update as its updateprocedure can be derandomized using a random seed of length O (log ( n )(log log n ) ) , and thus canbe implemented in this space. Moreover, this does not aﬀect the time complexity as stated in Lemma6.Proof. First note that by Lemma 6, we require ˜ O ( ν − ) random bits for each i ∈ [ n ], and thus werequire a total of ˜ O ( nν − ) = poly( n ) random bits to be generated. Since Nisan’s PRG requires the31ester to read its random input in a stream, we can use a standard reordering trick of the elementsof the stream, so that all the updates to a given coordinate i ∈ [ n ] occur at the same time (see[Ind06]). This does not eﬀect the output distribution of our algorithm, since linear sketches donot depend on the ordering of the stream. Now let c ′ be the constant such that the algorithm L p Sampler duplicates coordinates n c ′ times. In other words, the count-max is run on the streamvector F ∈ R n c ′ , and let N = n c ′ . Now, as above, we ﬁx any index i ∈ [ N ], and attempt to foolthe tester which checks if, on a given random string, our algorithm would output i . For any ﬁxedrandomness r e for the exponentials, let A r e ( r c ) be the tester which tests if our L p sampler wouldoutput the index i , where now the bits r e are hard-coded into the tester, and the random bits r c are taken as input and read in a stream. We ﬁrst claim that this tester can be implemented in O (log( n ))-space.To see this, note that A r e ( r c ) must simply count the number of rows of count-max such thatitem i is hashed into the largest bucket (in absolute value) of that row, and output 1 if this numberis at least d , where d is the number of rows in count-max. To do this, A r e ( r c ) can break r c into d blocks of randomness, where the j -th block is used only for the j -th row of count-max. It can thenfully construct the values of the counters in a row, one row at a time, reading the bits of r c in astream. To build a bucket, it looks at the ﬁrst element of the stream, uses r c to ﬁnd the bucket ithashes to and the Gaussian scaling it gets, then adds this value to that bucket, and then continueswith the next element. Note that since r e is hardcoded into the tester, we can assume the entirestream vector ζ is hardcoded into the tester. Once it constructs a row of count-max, it checks if i is in the largest bucket by absolute value, and increments a O (log( d ))-bit counter if so. Note thatit can determine which bucket i hashes to in this row while reading oﬀ the block of randomnesscorresponding to that row. Then, it throws out the values of this row and the index of the bucket i hashed to in this row, and builds the next row. Since each row has O (1) buckets, A r e ( r c ) only uses O (log( n ))-bits of space at a time. Then using G ( x ) as Nisan’s generator with a random seed x oflength O (log ( n ))-bits, we have Pr [ A r e ( r c )] ∼ n − c Pr [ A r e ( G ( x ))], where the constant c is chosento be suﬃciently larger than the constant c in the n − c additive error of our perfect sampler, aswell as the constant c ′ Moreover: Pr h A ( r e , r c ) i = X r e Pr h A r e ( r c ) i Pr h r e i = X r e (( Pr h A r e ( G ( x )) i ± n − c ) Pr h r e i = X r e ( Pr h A r e ( G ( x )) i Pr h r e i ± X r e n − c Pr h r e i ∼ n − c Pr h A ( r e , G ( x )) i Now ﬁx any Nisan seed x , and consider the tester A G ( x ) ( r e ), which on ﬁxed count-max ran-domness G ( x ), tests if the algorithm will output i ∈ [ n ] on the random input r e for the exponentialvariables. We ﬁrst observe that it seems unlikely that A G ( x ) ( r e ) can be implemented in log( n )space while reading its random bits r e in a stream. This is because each row of count-max dependson the same random bits in r e used to construct the exponentials t i , thus it seems A G ( x ) ( r e )would need to store all log ( n ) bits of count-max at once. However, we will now demonstrate that A G ( x ) ( r e ) is in fact a poly( n ) bounded O ( d )-halfspace tester (as deﬁned earlier in this section)where d is the number of rows of count-max, and therefore can be derandomized with the PRG of[GKM15]. By the Runtime & Random bits analysis in Lemma 6, it suﬃces to take all random vari-32bles in the algorithm to be O (log( n ))-bit rational numbers. Scaling by a suﬃciently large poly( n ),we can assume that 1 /t /pj is a discrete distribution supported on {− T, . . . , T } where T ≤ poly( n )for a suﬃciently large poly( n ). We can then remove all values in the support which occur withprobability less than poly( n ), which only adds an a n − c additive error to our sampler. After this,the distribution can be sampled from with poly( T ) = poly( n ) random bits, which is as needed forthe setting of Lemma 7. Note that we can also apply this scaling the Gaussians in count-max, sothat they too are integers of magnitude at most poly( n ).Given this, the distribution of the variables 1 /t /pj satisfy the conditions of Lemma 7, in par-ticular being poly( n )-bounded, thus we must now show that A G ( x ) ( r e ) is indeed a O ( d )-halfspacetester, with integer valued half-spaces bounded by poly( n ). First consider a given row of count-max, and let the buckets be B , . . . , B . WLOG i hashs into B , and we must check if | B | > | B t | for t = 2 , , . . . ,

10. Let g j be the random count-max signs (as speciﬁed by G ( x )), and let S, S t be the set of indices which hash to B and B t respectively. We can run the following 6 half-spacetests to test if | B | > | B t | : X j ∈ S g j f j ( 1 t /pj ) > X j ∈ S t g j f j ( 1 t /pj ) > a X j ∈ S g j f j ( 1 t /pj ) + a X j ∈ S t g j f j ( 1 t /pj ) > a , a range over all values in { , − } . The tester can decide whether | B | > | B t | by letting a be the truth value (where − a the truth value of 2.It then lets b t ∈ { , } be the truth value of 3 on the resulting a , a values, and it can correctlydeclare | B | > | B t | iﬀ b t = 1. Thus for each of the 9 pairs B, B t , the tester uses 6 halfspace testersto determine if | B | > | B t | , and so can determine if i hashed to the max bucket with O (1) halfspacetests. So A G ( x ) ( r e ) can test if the algorithm will output i by testing if i hashed to the max bucketin a 4 / d rows of count-max, using O ( d ) = O (log( n )) halfspace tests. Note thatby the scaling performed in the prior paragraphs, all coeﬃcents of these half-spaces are integers ofmagnitude at most poly( n ). So by Lemma 7, the PRG G ( y ) of [GKM15] fools A G ( x ) ( r e ) witha seed y of O (log ( n )(log log n ) )-bits. So Pr [ A G ( x ) ( r e )] ∼ n − c Pr [ A G ( x ) ( G ( y ))], and so by thesame averaging argument as used in for the Nisan PRG above, we have Pr [ A ( r e , G ( x ))] ∼ n − c Pr [ A ( G ( y ) , G ( x ))], and so Pr [ A ( r e , r c )] ∼ n − c Pr [ A ( G ( y ) , G ( x ))] as desired. Now ﬁxing any i ∈ [ n ], let A ′ ( r e , r c ) be the event that the overall algorithm outputs the index i . In other words, A ′ i ( r e , r c ) = 1 if A i j ( r e , r c ) = 1 for some j ∈ [ n c ′ − ], where A i j ( r e , r c ) = 1 is the event that count-max declares that i j is the maximum in Algorithm L p Sampler . Thus, the probability that thealgorithm outputs a non-duplicated coordinate i ∈ [ n ] is given by: Pr h A ′ i ( r e , r c ) i = n c ′ X j =1 Pr h A i j ( r e , r c ) i = n c ′ X j =1 Pr h A i j ( G ( y ) , G ( x )) i ± n − c = Pr h A ′ i ( G ( y ) , G ( x )) i ± n − c (4)33here in the last line we set c > c ′ + c , where recall c is the desired additive error in our mainsampler. In conclusion, replacing the count-max randomness with Nisan’s PRG and the exponentialrandom variable randomness with the half-space PRG G ( y ), we can fool the algorithm which teststhe output of our algorithm with a total seed length of O (log ( n )(log log n ) ).To show that the stated update time of Lemma 6 is not aﬀected, we ﬁrst remark that Nisan’sPRG simply involves performing O (log( n )) nested hash computations on a string of length O (log( n ))in order to obtain any arbitrary substring of O (log( n )) bits. Thus the runtime of such a procedureis ˜ O (1) to obtain the randomness needed in each update of a coordinate i ∈ [ n c ]. By Lemma 7,the PRG of [GKM15] requires ˜ O (1) time to sample the O (log( n ))-bit string needed to generate anexponential, and moreover can be computed with working space linear in the size of the randomseed (note that this is also true of Nisan’s PRG, which just involves O (log( n ))-nested hash functioncomputations). Thus the update time is only blown up by a ˜ O (1) factor, which completes theproof. Corollary 4.

For p = 2 , the entire algorithm can be derandomized to run using O (log ( n ) log(1 /δ )) -bits of space with failure probability of δ . For p < , the algorithm can be derandomized to run using O (log ( n )) -bits of space with δ = 1 / poly ( n ) .Proof. We can simply derandomize a single instance of our sampling algorithm using Nisan’s PRG asin Theorem 7, except that we derandomize all the randomness in the algorithm at once. Since suchan instance requires O (log ( n ))-bits of space, using Nisan’s blows up the complexity to O (log ( n ))(the tester can simply simulate our entire algorithm in O (log ( n ))-bits of space, reading the ran-domness in a stream by the reordering trick of [Ind06]). Since the randomness for separate parallelinstances of the main sampling algorithm is disjoint and independent, this same O (log ( n ))-bittester can test the entire output of the algorithm by testing each parallel instance one by one,and terminating on the ﬁrst instance that returns an index i ∈ [ n ]. Thus the same O (log ( n ))-bitrandom seed can be used to randomize all parallel instances of our algorithm. For p <

2, we can run O (log( n )) parallel instances to get 1 / poly( n ) failure probability in O (log ( n ))-bits of space as stated.For p = 2, we can run O (log( n ) log(1 /δ )) parallel repetitions needed to get δ failure probability usingthe same random string, for a total space of O (log ( n ) log(1 /δ ) + log ( n )) = O (log ( n ) log(1 /δ )) asstated. As noted in the proof of Theorem 7, computing a substring of O (log( n ))-bits from Nisan’sPRG can be done in ˜ O (1) time and using space linear in the seed length, which completes theproof. We will now show the modiﬁcations to our algorithm necessary to obtain ˜ O (1) query time. Recallthat our algorithm maintains a count-max matrix A . Our algorithm then searches over all indices i ∈ K to check if i hashed into the maximum bucket in a row of A at least a 4 / |K| = ˜ O ( n ), running this procedure requires ˜ O ( n ) time to produce an output on a givenquery. To avoid this and obtain ˜ O (1) running time, we will utilize the heavy hitters algorithm of[LNNT16], which has ˜ O (1) update and query time, and which does not increase the complexity ofour algorithm. Theorem 8 ([LNNT16]) . For any precision parameter < ǫ < / , given a general turnstile stream x ∈ R n there is an algorithm, ExpanderSketch , which with probability − n − c for any constant c ,returns a set S ⊂ [ n ] of size S = O ( ǫ − ) which contains all indices i such that | x i | ≥ ǫ k x k . Theupdate time is O (log( n )) , the query time is ˜ O ( ǫ − ) , and the space required is O ( ǫ − log ( n )) -bits. sing ExpanderSketch to speed up query time.

The modiﬁcations to our main algorithm L p Sampler with

Fast-Update are as follows. We run our main algorithm as before, maintainingthe same count-max data structures A . Upon initialization of our algorithm, we also initialize aninstance ExSk of ExpanderSketch as in Theorem 8, with the precision parameter ǫ = 1 / Fast-Update procedure, for each i ∈ [ n ] we hash the top K i = O (log( n ))largest duplicates ζ i j corresponding to f i individually, and store the random variables h ℓ ( i j ) thatdetermine which buckets in A they hash to. While processing updates to our algorithm at this point,we make the modiﬁcation of additionally sending these top K i items to ExSk to be sketched. Moreformally, we run

ExSk on the stream ζ K , where ζ K is the vector ζ projected onto the coordinatesof K . Since K i = ˜ O (1), this requires making ˜ O (1) calls to update ExSk on diﬀerent coordinates,which only increases our update time by an ˜ O (1) additive term.On termination, we obtain the set S containing all items ζ i such that i ∈ K and ζ i ≥ (1 / k ζ K k . Instead of searching through all coordinates of K to ﬁnd a maximizer, we sim-ply search through the coordinates in S , which takes ˜ O ( | S | ) = ˜ O (1) time. We now argue that theoutput of our algorithm does not change with these new modiﬁcations. We refer collectively tothe new algorithm with these modiﬁcations as L p Sampler with

Fast-Update and

ExSk , and thealgorithm of Section 5.1 as simply L p Sampler with

Fast-Update . Lemma 9.

For any constant c > , with probability − n − c the algorithm L p Sampler with

Fast-Update and

ExSk as described in this section returns the same output (an index i ∈ [ n ] or FAIL ) as L p Sampler using

Fast-Update but without

ExSk . The space and update time are notincreased by using

ExSk , and the query time is now ˜ O (1) .Proof. We condition on the event that S contains all items i such that i ∈ K and | ζ i | ≥ / k ζ K k ,which occurs with probability 1 − n − c by Theorem 8. Since L p Sampler already uses at least O (log ( n )) bits of space, the additional O (log ( n )) bits of overhead required to run an instance ExSk of ExpanderSketch with sensitivity parameter ǫ = 1 /

100 does not increase the space complexity.Furthermore, as mentioned above, the update time is blown-up by a factor of ˜ O (1), since we make K i = ˜ O (1) calls to update ExSk , which has an update time of ˜ O (1) by Theorem 8. Furthermore, ouralgorithm does not require any more random bits, as it only uses ExpanderSketch as a subroutine,and thus no further derandomization is required. Thus the complexity guarantees of Lemma 6 areunchanged. For the query time, we note that obtaining S requires ˜ O (1) time (again by Theorem8), and querying each of the | S | = O (1) items in our count-max A requires ˜ O (1) time. To completethe proof, we now consider the output of our algorithm. Since we are searching through a strictsubset S ⊂ [ n c ], it suﬃces to show that if the original algorithm output an i j ∈ [ n c ], then so willwe. As argued in Lemma 6, such a coordinate must be contained in K . By Corollary 1, we musthave | ζ i j | > k ζ k ≥ k ζ K k with probability 1 − n − c (scaling c by 100 here), thus i j ∈ S ,which completes the proof. In this section, we will show how, conditioned on our algorithm L p Sampler returning an index i ∈ [ n ], we can obtain an estimate ˜ f i = (1 ± ǫ ) f i with probability 1 − δ . We now describe howto do this. Our algorithm, in addition to the count-max matrix A used by L p Sampler , storesa count-sketch matrix A ′ with, d ′ = O (log(1 /δ )) rows and O ( γ ) = O (min (cid:8) ǫ − , ǫ − p log (cid:0) δ (cid:1) } (cid:9) )columns. Recall in our Fast-Update procedure, for each i ∈ [ n ] we hash the top K i = O (log( n ))largest duplicates ζ i j corresponding to f i individually into A , and store the random variables h ℓ ( i j )35hat determine which buckets in A they hash to. Thus if count-max outputs an i j ∈ [ n c ] we knowthat i j ∈ K , where K = ∪ i ∈ [ n ] ∪ K i j =1 { i j } as in Section 5 (since our algorithm only searches through K to ﬁnd a maximizer). Thus it suﬃces to run the count-sketch instance A ′ on the stream ζ K ,where ζ K is the vector ζ with the coordinates not in K set to 0. Since K i = ˜ O (1), we perform atmost ˜ O (1) updates to count-sketch at every step in the stream. This requires making ˜ O (1) callsto update count-sketch on each stream update, which only increases our update time by an ˜ O (1)additive term.Now if L p Sampler returns i j ∈ [ n c ] (corresponding to some duplicate i j of i ), then we musthave i j ∈ K . Thus we can query A ′ for a value ˜ y i j such that | ˜ y i j − ζ i j | < p /γ k ζ tail(1 /γ ) k withprobability 1 − δ by Theorem 1. Furthermore, since i j ∈ K , we can compute the value I k such that I k = ( rnd ν (1 /t i j )) by simulating the Fast-Update procedure on an update to i . We will argue thatthe estimate ˜ f = ˜ y i j ( rnd ν (1 /t i j )) − satisﬁes ˜ f = (1 ± ǫ ) f i . Putting this together with Theorem 2,we will obtain the following result. Theorem 9.

There is an algorithm A which, on a general turnstile stream f , outputs i ∈ [ n ] withprobability | f i | p / k f k pp (1 ± ν ) + O ( n − c ) , and outputs FAIL with probability at most δ . Conditionedon outputting some i ∈ [ n ] , A will then output ˜ f such that ˜ f = (1 ± ǫ ) f i with probability − δ .The space required is O (cid:0)(cid:0) log ( n )(log log n ) + β log( n ) log(1 /δ ) (cid:1) log(1 /δ ) (cid:1) for p ∈ (0 , , and O (cid:0)(cid:0) log ( n ) + ǫ − log ( n ) log(1 /δ ) (cid:1) log(1 /δ ) (cid:1) for p = 2 , where β = min (cid:8) ǫ − , ǫ − p log (cid:0) δ (cid:1) } (cid:9) . Theupdate time is ˜ O ( ν − ) and the query time is ˜ O (1) Proof.

We ﬁrst consider the complexity. The ﬁrst term in each of the upper bounds follows fromTheorem 2, as well as the log(1 /δ ) term which comes from repeating the entire algorithm log(1 /δ )times for p <

2, and log( n ) log(1 /δ ) times for p = 2. The second term in the space bound resultsfrom storing the d ′ × γ count-sketch table A ′ , which is O ( γ log( n ) log(1 /δ )) as stated. Moreover,the update time for the new data structure is at most ˜ O (1), since the only additional work we doon each update is to hash K i = O (log( n )) items into d ′ = O (log( n )) rows of A ′ . Furthermore, thequery time just requires computing a median of O (log( n )) entries of A ′ . Each of these actions is˜ O (1) time in the unit cost RAM model, so the additional update and query time is ˜ O (1). Theremaining ˜ O ( ν ) update time follows from Lemma 6.For correctness, note that if L p Sampler does not fail and instead outputs i j ∈ [ n c ], we knowthat | ζ i j | > / k ζ k . Furthermore, we have | ˜ y i j − ζ i j | < p /γ k ζ tail(1 /γ ) k ≤ p /γ k ζ k withprobability 1 − δ , so setting γ = Θ(1 /ǫ ) suﬃciently large, it follows that ˜ y i j = (1 ± O ( ǫ )) ζ i j . Then˜ y i j ( rnd ν (1 /t i j )) − = (1 ± ǫ ) f i follows immediately from the fact that f i = ζ i j ( rnd ν (1 /t i j )) − (anda rescaling of ǫ by a constant). This shows that O ( ǫ − ) bits is always an upper bound for the valueof γ = Θ( β ) needed for p ∈ (0 , β (for cases when p < T γ ⊂ [ n c ] as the set of n c − γ smallest coordinates (in absolute value) of z . In other words z T γ = z tail( γ ) , where for any set S ⊆ [ n c ] z S denotes z projected onto the coordinates of S . Notethat if S is any set of size n c − s and v ∈ R n c any vector, we have k v tail( s ) k ≤ k v S k . Thenby Proposition 1, using the fact that ζ i = (1 ± O ( ν )) z i for all i ∈ [ n c ], we have k ζ tail( γ ) k ≤k ζ T γ k ≤ k z tail( γ ) k = O ( k F k p ( γ ) − /p +1 / ) for p < − O ( e − γ ) > − δ ,where now we are setting γ = Θ(max { ǫ − p , log(1 /δ ) } ). Condition on this now. Then we obtainerror error | ˜ y i j − ζ i j | < p /γ k ζ tail(1 /γ ) k = O ( k F k p γ − /p ) = O ( ǫ (log(1 /δ )) − /p k F k p ) from oursecond count-sketch A ′ . Now z D (1) = k F k p /E /p , which is at least Ω( k F k p / (log(1 /δ )) /p ) withprobability greater than 1 − δ using the pdf of an exponential. Conditioned on this, the errorfrom our second count-sketch A ′ gives, in fact, a (1 ± O ( ǫ )) relative error approximation of ζ i j ,which is the desired result. Note that we conditioned only on our count-sketch giving the desired36 ˜ y i j − ζ i j | < p /γ k ζ tail(1 /γ ) k error, on k z tail( γ ) k = O ( k F k p ( γ ) − /p +1 / ), and on E = O (log(1 /δ )),each of which holds with probability at least 1 − O ( δ ), so the Theorem follows after a union bound. In this section, we obtain a lower bound for providing relative error approximations of the frequencyof a sampled item. Our lower bound is derived from one-way two-party communication complexity.Let X , Y be input domains to a two party communication complexity problem. Alice is given x ∈ X and Bob is given y ∈ Y . Their goal is to solve some relational problem Q ⊆ X × Y × O , where foreach ( x, y ) ∈ X × Y the set Q xy = { z | ( x, y, z ) ∈ Q } represents the set of correct solutions to thecommunication problem.In the one-way communication protocol P , Alice must send a single message M to Bob (depend-ing on her input X ), from which Bob must output an answer in o ∈ O depending on his input Y and the message M . The maximum possible length (in bits) of M over all inputs ( x, y ) ∈ X × Y isthe communication cost of the protocol P . Communication protocols are allowed to be randomized,where each player has private access to an unlimited supply of random bits. The protocol P is saidto solve the communication problem Q if Bob’s output o belongs to Q xy with failure probabilityat most δ < /

2. The one-way communication complexity of Q , denoted R → δ ( Q ), is the minimumcommunication cost of a protocol which solves the protocol Q with failure probability δ .Now a similar measure of complexity is the distributional complexity D → µ,δ ( Q ), where µ is adistribution over X × Y , which denotes the minimum communication cost of the best deterministicprotocol of Q with failure probability at most δ when the inputs ( x, y ) ∼ µ . By Yao’s Lemma, wehave that R → δ ( Q ) = max µ D → µ,δ ( Q ). We ﬁrst review some basic facts about entropy and mutualinformation (see Chapter 2 of [CT12] for proofs of these facts). Proposition 4.

1. Entropy Span: if X takes on at most s values, then ≤ H ( X ) ≤ log s I ( X : Y ) := H ( X ) − H ( X | Y ) ≥ , that is H ( X | Y ) ≤ H ( X )

3. Chain Rule: I ( X , X , . . . , X n : Y | Z ) = P ni =1 I ( X i : Y | X , . . . , X i − , Z )

4. Subadditivity: H ( X, Y | Z ) ≤ H ( X | Z ) + H ( Y | Z ) and equality holds if and only if X and Y are independent conditioned on Z

5. Fano’s Inequality: Let M be a predictor of X . In other words, there exists a function g suchthat Pr [ g ( M ) = X ] > − δ where δ < / . Let U denote the support of X , where U ≥ .Then H ( X | M ) ≤ δ log( |U | −

1) + h ( δ ) , where h ( δ ) := δ log( δ − ) + (1 − δ ) log( − δ ) is thebinary entropy function. We now deﬁne the information cost of a protocol P : Deﬁnition 7.

Let µ be a distribution of the input domain X × Y to a communication problem Q . Suppose the unputs ( X, Y ) are choosen according to µ , and let M be Alice’s message to Bob,interpreted as a random variable which is a function of X and Alice’s private coins. Then the information cost of a protocol P for Q is deﬁned as I ( X : M ).The one-way information complexity of Q with respect to µ and δ , denoted by IC → µ,δ ( Q ), is theminimum information cost of a one-way protocol under µ that solves Q with failure probability atmost δ . 37ote that by Proposition 4, we have I ( X : M ) = H ( M ) − H ( M | X ) ≤ H ( M ) ≤ | M | where | M | is the length of the message M in bits. This results in the following proposition. Proposition 5.

For every probability distribution µ on inputs, R → δ ( Q ) ≥ IC → µ,δ ( Q ) We now introduce the following communication problem, known as Augmented Index on LargeDomains. Our communication problem is derived from the communication problem (of the samename) introduced in [JW13], but we modify the guarantee of the output required so that constantprobability of error is allowed. The problem is as follows.

Deﬁnition 8.

Let U be an alphabet with |U | = k ≥

2. Alice is given a string x ∈ U d , and Bobis given i ∈ [ d ] along with the values x i +1 , x i +2 , . . . , x d . Alice must send a message M to Bob,and then Bob must output the value x i ∈ U with probability 3 /

4. We refer to this problem as the augmented-index problem on large domains , and denote it by ind d U .Note that in [JW13], a correct protocol is only required to determine whether x i = a for someﬁxed input a ∈ U given only to Bob, however such a protocol must succeed with probability 1 − δ .For the purposes of both problems, it is taken that |U | = Θ(1 /δ ). In this scenario, we note thatthe guarantee of our communication problem is strictly weaker, since if one had a protocol thatdetermined whether x i = a for a given a ∈ U with probability 1 − δ , one could run it on all a ∈ U and union bound over all |U | trails, from which the exact value of x i could be determined withprobability 3 /

4, thereby solving the form of the communication problem we have described. Weshow, nevertheless, that the same lower bound on the communication cost of our protocol holds asthe lower bound in [JW13].Let X be the set of all x ∈ U d , let Y = [ d ], and deﬁne µ to be the uniform distribution over X × Y . Lemma 10.

Suppose |U | ≥ c for some suﬃciently large constant c . We have IC → µ, / ( ind d U ) ≥ d log( |U ) / .Proof. Fix any protocol P for ind d U which fails with probability at most 1 /

4. Let X = ( X , X , . . . , X d )denote Alice’s input as chosen via µ , and let M be Alice’s message to Bob given X . By Proposition4 I ( X : M ) = d X i =1 I ( X i : M | X , . . . , X i − )= d X i =1 (cid:16) H ( X i | X , . . . , X i − ) − H ( X i | M, X , . . . , X i − ) (cid:17) First note that since X i is independent of X j for all j = i , we have H ( X i | X , . . . , X i − ) = H ( X i ) =log( |U | ). Now since the protocol P is correct on ind d U , then the variables M, X , . . . , X i − must be apredictor for X i with failure probability 1 / X i with probability 3 / , X , . . . , X i − and his private, independent randomness). So by Fano’s inequality (Proposition4), we have H ( X i | M, X , . . . , X i − ) ≤

14 log( |U | −

1) + h ( 14 ) ≤

12 log( |U | )which holds when |U | is suﬃciently large. Putting this together, we obtain I ( X : M ) ≥ d log( |U | )2 Corollary 5.

We have R → / ( ind d U ) = Ω( d log( |U | )) . We now use this lower bound on ind d U to show that, even when the index output is from adistribution with constant additive error from the true L p distribution, returning an estimate withprobability 1 − δ still requires Ω( ǫ − p log( n ) log(1 /δ )) bits of space. Theorem 10.

Fix any p > constant bounded away from , and let ǫ < / with ǫ − p = o ( n ) .Then any L p sampling algorithm that outputs FAIL with probability at most / , and otherwisereturns an item ℓ ∈ [ n ] such that Pr [ ℓ = l ] = | f l | p / k f k pp ± / for all l ∈ [ n ] , along with an estimate ˜ f ℓ such that ˜ f ℓ = (1 ± ǫ ) f ℓ with probability − δ , requires Ω( ǫ − p log( n ) log(1 /δ )) bits of space.Proof. We reduce via ind d U . Suppose we have a streaming algorithm A which satisﬁes all theproperties stated in the theorem. Set |U | = 1 / (10 δ ), and let X ∈ U d be Alice’s input, where d = rs where r = p +1 ǫ p and s = log( n ). Alice conceptually divides X into s blocks X , . . . , X s ,each containing r items X i = X i , X i , . . . , X ir ∈ U . Fix some labeling U = { σ , . . . , σ k } , and let π ( X ij ) ∈ [ k ] be such that X ij = σ π ( X ij ) . Then each X ij can be thought of naturally as a binary vectorin R rsk with support 1, where ( X ij ) t = 1 when t = ( i − r + ( j − k + π ( X ij ), and ( X ij ) t = 0otherwise. Set n ′ = rsk < n for ǫ − p = o ( n ). Using this interpretation of X ij ∈ R rsk , we deﬁne thevector f ∈ R rsk by f = s X i =1 r X j =1 B i X ij Where B = 10 /p . Alice can construct a stream with the frequency vector f by making the necessaryinsertions, and then send the state of the streaming algorithm A to Bob. Now Bob has some index i ∗ ∈ [ d ] = [ rs ], and his goal is to output the value of X i ′ j ′ = X i ∗ such that i ∗ = ( i ′ − r + j ′ . SinceBob knows X ji for all ( i, j ) with i > i ′ , he can delete oﬀ the corresponding values of B i X ij from thestream, leaving the vector f with the value f = i ′ X i =1 r X j =1 B i X ij For j ∈ [ k ], let γ j ∈ R rsk be the binary vector with γ j ( i ′ − r +( j ′ − k + j = B i ′ / (10 ǫ ) and γ jt = 0 at allother coordinates t = ( i ′ − r + ( j ′ − k + j . Bob then constructs the streams f j = f + γ j for j = 1 , . . . , k sequentially. After he constructs f j , he runs A on f j to obtain an output ( ℓ j , ˜ f jℓ j ) ∈ ([ n ′ ] × R ) ∪ ( { FAIL } × {

FAIL } ) from the streaming algorithm, where if the algorithm did not fail39e have that ℓ j ∈ [ n ′ ] is the index output and ˜ f jℓ j is the estimate of f jℓ j . By union bounding overthe guarantee of A we have that if ℓ j = FAIL then ˜ f ℓ j = (1 ± ǫ ) f jℓ j for all j = 1 , , . . . , k withprobability 1 − kδ > /

10. Call this event E . Conditioned on E , it follows that if for each ℓ j with ℓ j = ( i ′ − r + ( j ′ − k + j , if X i ′ j ′ = σ j then˜ f jℓ j > B i ′ (1 + 110 ǫ )(1 − ǫ ) > B i ′ ǫ + 910 B i ′ − ǫB i ′ On the other hand, if X i ′ j ′ = σ j , then we will have˜ f jℓ j < ( B i ′ / (10 ǫ ))(1 + ǫ ) = B i ′ ǫ + B i ′ < B i ′ ǫ + 910 B i ′ − ǫB i ′ using that ǫ < /

3. Thus if ℓ j = ( i ′ − r + ( j ′ − k + j , Bob can correctly determine whether ornot X i ′ j ′ = σ j . Now suppose that, in actuality, Alice’s item was X i ′ j ′ = σ τ ∈ U for some τ ∈ [ k ]. Set λ = ( i ′ − r + ( j ′ − k + τ . To complete the proof, it suﬃces to lower bound the probability that ℓ τ = λ .Thus we consider only the event of running A on f τ . We know that with probability 99 / ℓ τ = FAIL . We write E to denote the event that ℓ τ = FAIL . Let f − λ be equal to f everywhereexcept with the coordinate λ set equal to 0. Then k f τ − λ k pp / − /

50 = 22 /

25, and call the eventthat this occurs E . Then conditioned on E = E ∩ E ∩ E Bob sucsessfully recovers the value of X i ′ j ′ = X i ∗ , and thus solves the communication problem. Note that the probability of success is Pr [ E ] > − (1 /

10 + 1 /

100 + 3 / > /

4, and thus this protocol solves ind d U . So by Corollary 5, itfollows that any such streaming algorithm A requires Ω( rs log( |U | )) = Ω( ǫ − p log( n ) log(1 /δ )) bitsof space. Note that the stream f in question had length n ′ < n for p constant bounded from 0, andno coordinate in the stream ever had a value greater than poly( n ), and thus the stream in questionis valid in the given streaming model. This work demonstrates the existence of perfect L p samplers for p ∈ (0 ,

2) using O (log ( n ) log(1 /δ ))bits of space in the random oracle model. This bound is tight in terms of both n and δ . However,to derandomize our algorithm for p <

2, our space increases by a O ((log log n ) )-factor, which isperhaps unnecessary. There are also several other open problems for L p samplers which this work40oes not close. Notably, there is still a log( n ) factor gap between the upper and lower bounds for L samplers, as the best known lower bound for any p ≥ n ), compared to our upperbound of O (log n ). While perfect L samplers using polylogarithmic space were not known beforethis work, our upper bound matches the best upper bounds of prior approximate L samplerswith constant ν = Ω(1). It is therefore an open question whether this additional factor of log n isrequired in the space complexity of an L sampler, perfect or otherwise.Secondly, one notable shortcoming of the perfect sampler presented in this paper is the largeupdate time. To obtain a perfect sampler as deﬁned in the introduction, the algorithm in this papertakes polynomial (in n ) time to update its data structures after each entry in the stream. This isclearly non-ideal, since most streaming applications demand constant or polylogarithmic updatetime. Using our rounding procedure, we can obtain a (1 ± / poly(log n )) relative error sampler withpolylogarithmic update time (and the same space as the perfect sampler), but it is still an openproblem to design a perfect L p sampler with optimal space dependency as well as polylogarithmicupdate time.Finally, there are several gaps in the dependency on ǫ, δ in our procedure which, in addi-tion to outputting an index i ∈ [ n ], also outputs a (1 ± ǫ ) estimate of the frequency f i . Tak-ing Theorem 10 along with the known lower bounds for L p sampling, our best lower bound forthe problem is Ω(log ( n ) log(1 /δ ) + ǫ − p log( n ) log(1 /δ )), where δ is the probability that thesampler fails to output an index i . On the other hand, our best upper bound is O (cid:0)(cid:0) log ( n ) + β log( n ) log(1 /δ ) (cid:1) log(1 /δ ) (cid:1) for p ∈ (0 , O (cid:0)(cid:0) log ( n ) + ǫ − log ( n ) log(1 /δ ) (cid:1) log(1 /δ ) (cid:1) for p = 2, where β = min (cid:8) ǫ − , ǫ − p log (cid:0) δ (cid:1)(cid:9) . Notably, the log(1 /δ ) multiplies the log(1 /δ ) term inthe upper bound but not in the lower bound. We leave it as an open problem to determine preciselythe right dependencies of such an algorithm on ǫ, δ , δ . Acknowledgments

The authors would like to thank Raghu Meka for a helpful explanation of the [GKM15] PRG,and for pointing out how the arguments could be extended to fooling functions of multiple half-spaces (Lemma 7). The authors would also like to thank Ryan O’Donnell for a useful discussionon pseudo-random generators in general.

References [AKO10] Alexandr Andoni, Robert Krauthgamer, and Krzysztof Onak. Streaming algorithmsfrom precision sampling. arXiv preprint arXiv:1011.1263 , 2010.[BBD +

02] Brian Babcock, Shivnath Babu, Mayur Datar, Rajeev Motwani, and Jennifer Widom.Models and issues in data stream systems. In

Proceedings of the twenty-ﬁrst ACMSIGMOD-SIGACT-SIGART symposium on Principles of database systems , pages 1–16. ACM, 2002.[BCIW16] Vladimir Braverman, Stephen R Chestnut, Nikita Ivkin, and David P Woodruﬀ. Beat-ing countsketch for heavy hitters in insertion streams. In

Proceedings of the forty-eighthannual ACM symposium on Theory of Computing , pages 740–753. ACM, 2016.[BDM02] Brian Babcock, Mayur Datar, and Rajeev Motwani. Sampling from a moving windowover streaming data. In

Proceedings of the thirteenth annual ACM-SIAM symposium on iscrete algorithms , pages 633–634. Society for Industrial and Applied Mathematics,2002.[BKP +

14] Karl Bringmann, Fabian Kuhn, Konstantinos Panagiotou, Ueli Peter, and HenningThomas. Internal dla: Eﬃcient simulation of a physical growth model. In

InternationalColloquium on Automata, Languages, and Programming , pages 247–258. Springer,2014.[BOZ12] Vladimir Braverman, Rafail Ostrovsky, and Carlo Zaniolo. Optimal sampling fromsliding windows.

Journal of Computer and System Sciences , 78(1):260–272, 2012.[CCD11] Edith Cohen, Graham Cormode, and Nick G. Duﬃeld. Structure-aware sampling:Flexible and accurate summarization.

PVLDB , 4(11):819–830, 2011.[CCD12] Edith Cohen, Graham Cormode, and Nick Duﬃeld. Don’t let the negatives bring youdown: sampling from streams of signed updates.

ACM SIGMETRICS PerformanceEvaluation Review , 40(1):343–354, 2012.[CCFC02a] Moses Charikar, Kevin Chen, and Martin Farach-Colton. Finding frequent items indata streams. In

Proceedings of the 29th International Colloquium on Automata, Lan-guages and Programming , ICALP ’02, pages 693–703, London, UK, UK, 2002. Springer-Verlag.[CCFC02b] Moses Charikar, Kevin Chen, and Martin Farach-Colton. Finding frequent items indata streams.

Automata, languages and programming , pages 784–784, 2002.[CDK +

09] Edith Cohen, Nick Duﬃeld, Haim Kaplan, Carsten Lund, and Mikkel Thorup. Streamsampling for variance-optimal estimation of subset sums. In

Proceedings of the twentiethAnnual ACM-SIAM Symposium on Discrete Algorithms , pages 1255–1264. Society forIndustrial and Applied Mathematics, 2009.[CDK +

14] Edith Cohen, Nick G. Duﬃeld, Haim Kaplan, Carsten Lund, and Mikkel Thorup. Al-gorithms and estimators for summarization of unaggregated data streams.

J. Comput.Syst. Sci. , 80(7):1214–1244, 2014.[CMR05] Graham Cormode, S Muthukrishnan, and Irina Rozenbaum. Summarizing and mininginverse distributions on data streams via dynamic inverse sampling. In

Proceedingsof the 31st international conference on Very large data bases , pages 25–36. VLDBEndowment, 2005.[CMYZ10] Graham Cormode, S Muthukrishnan, Ke Yi, and Qin Zhang. Optimal sampling fromdistributed streams. In

Proceedings of the twenty-ninth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems , pages 77–86. ACM, 2010.[CMYZ12] Graham Cormode, S Muthukrishnan, Ke Yi, and Qin Zhang. Continuous samplingfrom distributed streams.

Journal of the ACM (JACM) , 59(2):10, 2012.[Coh15] Edith Cohen. Stream sampling for frequency cap statistics. In

Proceedings of the 21thACM SIGKDD International Conference on Knowledge Discovery and Data Mining ,pages 159–168. ACM, 2015.[CT12] Thomas M Cover and Joy A Thomas.

Elements of information theory . John Wiley &Sons, 2012. 42Duf04] Nick Duﬃeld. Sampling for passive internet measurement: A review.

Statistical Sci-ence , pages 472–498, 2004.[DV06] Amit Deshpande and Santosh Vempala. Adaptive sampling and fast low-rank matrixapproximation. In

Approximation, Randomization, and Combinatorial Optimization.Algorithms and Techniques , pages 292–303. Springer, 2006.[EV03] Cristian Estan and George Varghese. New directions in traﬃc measurement and ac-counting: Focusing on the elephants, ignoring the mice.

ACM Trans. Comput. Syst. ,21(3):270–313, 2003.[FCT15] Mart´ın Farach-Colton and Meng-Tsung Tsai. Exact sublinear binomial sampling.

Al-gorithmica , 73(4):637–651, 2015.[FIS08] Gereon Frahling, Piotr Indyk, and Christian Sohler. Sampling in dynamic data streamsand applications.

International Journal of Computational Geometry & Applications ,18(01n02):3–28, 2008.[FKV04] Alan Frieze, Ravi Kannan, and Santosh Vempala. Fast monte-carlo algorithms forﬁnding low-rank approximations.

Journal of the ACM (JACM) , 51(6):1025–1041, 2004.[GKM15] Parikshit Gopalan, Daniek Kane, and Raghu Meka. Pseudorandomness via the discretefourier transform. In

Foundations of Computer Science (FOCS), 2015 IEEE 56thAnnual Symposium on , pages 903–922. IEEE, 2015.[GKMS01] Anna C Gilbert, Yannis Kotidis, S Muthukrishnan, and Martin Strauss. Quicksand:Quick summary and analysis of network data. Technical report, 2001.[GKMS02] Anna C Gilbert, Yannis Kotidis, S Muthukrishnan, and Martin J Strauss. How tosummarize the universe: Dynamic maintenance of quantiles. In

VLDB’02: Proceedingsof the 28th International Conference on Very Large Databases , pages 454–465. Elsevier,2002.[GLH06] Rainer Gemulla, Wolfgang Lehner, and Peter J. Haas. A dip in the reservoir: Main-taining sample synopses of evolving datasets. In

Proceedings of the 32nd InternationalConference on Very Large Data Bases, Seoul, Korea, September 12-15, 2006 , pages595–606, 2006.[GLH08] Rainer Gemulla, Wolfgang Lehner, and Peter J. Haas. Maintaining bounded-size sam-ple synopses of evolving datasets.

VLDB J. , 17(2):173–202, 2008.[GM98a] Phillip B. Gibbons and Yossi Matias. New sampling-based summary statistics for im-proving approximate query answers. In

SIGMOD 1998, Proceedings ACM SIGMODInternational Conference on Management of Data, June 2-4, 1998, Seattle, Washing-ton, USA. , pages 331–342, 1998.[GM98b] Phillip B Gibbons and Yossi Matias. New sampling-based summary statistics for im-proving approximate query answers. In

ACM SIGMOD Record , volume 27, pages331–342. ACM, 1998.[GMP] Phillip B Gibbons, Yossi Matias, and Viswanath Poosala. Fast incremental mainte-nance of approximate histograms. 43Haa81] Uﬀe Haagerup. The best constants in the khintchine inequality.

Studia Mathematica ,70(3):231–283, 1981.[Haa16] Peter J. Haas. Data-stream sampling: Basic techniques and results. In

Data StreamManagement - Processing High-Speed Data Streams , pages 13–44. 2016.[HNG +

07] Ling Huang, XuanLong Nguyen, Minos Garofalakis, Joseph M Hellerstein, Michael IJordan, Anthony D Joseph, and Nina Taft. Communication-eﬃcient online detectionof network-wide anomalies. In

INFOCOM 2007. 26th IEEE International Conferenceon Computer Communications. IEEE , pages 134–142. IEEE, 2007.[HNSS96] Peter J Haas, Jeﬀrey F Naughton, S Seshadri, and Arun N Swami. Selectivity andcost estimation for joins based on random sampling.

Journal of Computer and SystemSciences , 52(3):550–569, 1996.[HS92] Peter J Haas and Arun N Swami.

Sequential sampling procedures for query size esti-mation , volume 21. ACM, 1992.[Ind06] Piotr Indyk. Stable distributions, pseudorandom generators, embeddings, and datastream computation.

Journal of the ACM (JACM) , 53(3):307–323, 2006.[JST11] Hossein Jowhari, Mert Sa˘glam, and G´abor Tardos. Tight bounds for lp samplers, ﬁnd-ing duplicates in streams, and related problems. In

Proceedings of the Thirtieth ACMSIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems , PODS’11, pages 49–58, New York, NY, USA, 2011. ACM.[JW13] Thathachar S Jayram and David P Woodruﬀ. Optimal bounds for johnson-lindenstrauss transforms and streaming problems with subconstant error.

ACM Trans-actions on Algorithms (TALG) , 9(3):26, 2013.[KNP +

17] Michael Kapralov, Jelani Nelson, Jakub Pachocki, Zhengyu Wang, David P Woodruﬀ,and Mobin Yahyazadeh. Optimal lower bounds for universal relation, and for samplersand ﬁnding duplicates in streams. arXiv preprint arXiv:1704.00633 , 2017.[Knu98] Donald Ervin Knuth.

The art of computer programming, Volume II: SeminumericalAlgorithms, 3rd Edition . Addison-Wesley, 1998.[Kop13] Swastik Kopparty. Lecture 7: eps-biased and almost k-wise independent spaces. http://sites.math.rutgers.edu/~sk1233/courses/topics-S13/lec7.pdf , 2013.[LN95] Richard J Lipton and Jeﬀrey F Naughton. Query size estimation by adaptive sampling.

Journal of Computer and System Sciences , 51(1):18–25, 1995.[LNNT16] Kasper Green Larsen, Jelani Nelson, Huy L Nguyˆen, and Mikkel Thorup. Heavy hittersvia cluster-preserving clustering. In

Foundations of Computer Science (FOCS), 2016IEEE 57th Annual Symposium on , pages 61–70. IEEE, 2016.[LNS90] Richard J Lipton, Jeﬀrey F Naughton, and Donovan A Schneider.

Practical selectivityestimation through adaptive sampling , volume 19. ACM, 1990.[LNW14] Yi Li, Huy L Nguyen, and David P Woodruﬀ. Turnstile streaming algorithms mightas well be linear sketches. In

Proceedings of the forty-sixth annual ACM symposium onTheory of computing , pages 174–183. ACM, 2014.44M +

05] Shanmugavelayutham Muthukrishnan et al. Data streams: Algorithms and applica-tions.

Foundations and Trends R (cid:13) in Theoretical Computer Science , 1(2):117–236, 2005.[McD89] Colin McDiarmid. On the method of bounded diﬀerences , page 148188. London Math-ematical Society Lecture Note Series. Cambridge University Press, 1989.[MCS +

06] Jianning Mai, Chen-Nee Chuah, Ashwin Sridharan, Tao Ye, and Hui Zang. Is sampleddata suﬃcient for anomaly detection? In

Proceedings of the 6th ACM SIGCOMMconference on Internet measurement , pages 165–176. ACM, 2006.[MM12] Gurmeet Singh Manku and Rajeev Motwani. Approximate frequency counts over datastreams.

PVLDB , 5(12):1699, 2012.[MP14] Gregory T Minton and Eric Price. Improved concentration bounds for count-sketch. In

Proceedings of the twenty-ﬁfth annual ACM-SIAM symposium on Discrete algorithms ,pages 669–686. Society for Industrial and Applied Mathematics, 2014.[MW10] Morteza Monemizadeh and David P Woodruﬀ. 1-pass relative-error lp-sampling withapplications. In

Proceedings of the twenty-ﬁrst annual ACM-SIAM symposium onDiscrete Algorithms , pages 1143–1160. SIAM, 2010.[Nag06] HN Nagaraja. Order statistics from independent exponential random variables andthe sum of the top order statistics.

Advances in Distribution Theory, Order Statistics,and Inference , pages 173–185, 2006.[Nis92] Noam Nisan. Pseudorandom generators for space-bounded computation.

Combinator-ica , 12(4):449–461, 1992.[NZ96] Noam Nisan and David Zuckerman. Randomness is linear in space.

Journal of Com-puter and System Sciences , 52(1):43–52, 1996.[Olk93] Frank Olken.

Random sampling from databases . PhD thesis, University of California,Berkeley, 1993.[Pri18] Eric Price. Personal communication. November, 2018.[TLJ10] Marina Thottan, Guanglei Liu, and Chuanyi Ji. Anomaly detection approaches forcommunication networks. In

Algorithms for Next Generation Networks , pages 239–261. Springer, 2010.[TW11] Srikanta Tirthapura and David P Woodruﬀ. Optimal random sampling from dis-tributed streams revisited. In

International Symposium on Distributed Computing ,pages 283–297. Springer, 2011.[Vit85a] Jeﬀrey S Vitter. Random sampling with a reservoir.

ACM Transactions on Mathe-matical Software (TOMS) , 11(1):37–57, 1985.[Vit85b] Jeﬀrey Scott Vitter. Random sampling with a reservoir.

ACM Trans. Math. Softw. ,11(1):37–57, 1985.[WZ13] David P. Woodruﬀ and Qin Zhang. Subspace embeddings and l p-regression usingexponential random variables. CoRR , abs/1305.5580, 2013.45WZ16] David P Woodruﬀ and Peilin Zhong. Distributed low rank approximation of implicitfunctions of a matrix. In

Data Engineering (ICDE), 2016 IEEE 32nd InternationalConference on , pages 847–858. IEEE, 2016.

A Original L p Sampling via Count-Sketch

In a previous version of this work, we used a slightly diﬀerent testing Algorithm for the L p Sampler.Namely, we used the classic count-sketch estimation procedure of Theorem 1 to obtain a y suchthat k y − ζ k ∞ is small. We then take the largest coordinate of y as our guess of the maximizerin ζ . The algorithm presented in the current version has the advantage of being slightly simpler,and does not incur the (log log n ) blow-up in space for p = 2 from the derandomization. In thissection, we show how the algorithm in the original version can be derandomized using the generalderandomization results for linear sketches of Theorem 5. First, we introduce a few preliminarytools that we will need. A.1 Preliminaries

We ﬁrst introduce the L estimation algorithm of [Ind06]. To estimate k f k for f ∈ R n , wegenerate i.i.d. Gaussians ϕ i,j ∼ N (0 ,

1) for i ∈ [ n ] and j ∈ [ r ] where r = Θ(log( n )). We willlater derandomize this assumption. We then store the vector B ∈ R r where B j = P ni =1 f i ϕ i,j for j = 1 , . . . , r , which can be computed update by update throughout the stream. We return theestimate R = median j | B j | . Lemma 11.

For any constant c > , The value of R as computed in the above algorithm satisﬁes k f k ≤ R ≤ k f k with probability − n − c .Proof. Each coordinate B j is distributed as | B j | = | g j |k f k , where g j are i.i.d. Gaussian randomvariables. A simple computation shows that Pr [ | g j | ∈ [2 / , / > .

55, and thus Pr [(5 / | B j | ∈ [1 / k f k , k f k ]] > .

55. Then by Chernoﬀ-Hoeﬀding bounds, the median of O (log( n )) repetitionssatisﬁes this bound with probability 1 − n − c as stated.Finally, we remark that making a simple modiﬁcation to the classic count-sketch algorithm(see Theorem 1), still results in the same error guarantee. Let A ∈ R d × k be a d × k count-sketchmatrix. The modiﬁcation is as follows: instead of each variable h i ( k ) being uniformly distributedin { , , . . . , k } , we replace them with variables h i,j,k ∈ { , } for ( i, j, k ) ∈ [ d ] × [ k ] × [ n ], suchthat h i,j,k are all i.i.d. and equal to 1 with probability 1 /k . We also let h i,h,k ∈ { , − } be i.i.d.Rademacher variables (1 with probability 1 / A i,j = P nk =1 f k g i,j,k h i,j,k , and the estimate y k of f k is given by: y k = median { g i,j,k A i,j | h i,j,k =1 } Thus the element f k can be hashed into multiple buckets in the same row of A , or even be hashedinto none of the buckets in a given row. By Chernoﬀ bounds, |{ g i,j,k A i,j | h i,j,k =1 }| = Θ( d ) withhigh probability for all k ∈ [ n ]. Observe, the marginal distribution of each bucket is the same asbefore, and the thus the original analysis of count-sketch ([CCFC02b]) is unchanged, as it onlyrelies on taking the median of Θ( d ) buckets, each of which independently succeed in giving a goodestimate with probability at least 2 /

3, as is the case here. Thus the bounds of Theorem 1 apply asusual.

Theorem 11.

Let A ∈ R d × k be the modiﬁed count-sketch as described above. If d = Θ(log( n )) , k = 6 /ǫ , and c ≥ is any constant, then we have k y − f k ∞ < ǫ k f tail (1 /ǫ ) k with probability − n − c . p Sampler

1. For 0 < p <

2, set ǫ = Θ(1), and for p = 2 set ǫ = Θ( p / log( n )). Let d = Θ(log( n )), andinstantiate a d × /ǫ count-sketch table A , and set µ ∼ Uniform[ , ].2. Duplicate updates to f to obtain the vector F ∈ R n c so that f i = F i j for all i ∈ [ n ] and j = 1 , , . . . , n c − , for some ﬁxed constant c .3. Choose i.i.d. exponential random variables t = ( t , t , . . . , t n c ), and construct the stream ζ i = F i · rnd ν (1 /t /pi ).4. Run A on ζ to obtain estimate y with | y − | ζ |k ∞ < ǫ k ζ tail(1 /ǫ ) k as in Theorem 11.5. Run L estimator on ζ to obtain R ∈ [ k ζ k , k ζ k ] with high probability.6. If y (1) − y (2) < µǫR or if y (2) < ǫµR , report FAIL , else return i ∈ [ n ] such that y i j = y (1) for some j ∈ [ n c − ]. Figure 6: Our main L p Sampling algorithm

A.2 The L p Sampler

We begin by describing the original sampling algorithm, as shown in Figure 6. The algorithmduplicates coordinates just as the Sampler of Figure 3, and scales it by inverse 1 /p -th powers ofi.i.d. exponentials 1 /t /pi . We also perform the same rounding procedure, turning z into ζ . Havingconstructed the transformed stream ζ , we then run a Θ(log( n )) × /ǫ instance A of count-sketchon ζ to obtain an estimate vector y with k y − | ζ |k ∞ < ǫ k ζ tail(1 /ǫ ) k with probability 1 − n − c (as inTheorem 11). Here, for a vector v ∈ R n , | v | ∈ R n is the vector such that ( | v | ) i = | v i | for all i ∈ [ n ].Thus y j is an estimate of the absolute value ζ j , and is always positive. This is simply accomplishedby taking the absolute value of the usual estimate y obtained from count-sketch.Then for 0 < p <

2, we set ǫ = Θ(1), and for p = 2 we set ǫ = Θ(1 / p log( n )). Next, we obtainestimates R ∈ [ k ζ k , k ζ k ] via the algorithm of Lemma 11 with high probability. The algorithmthen ﬁnds y (1) , y (2) (the two largest coordinates of y ), and samples µ ∼ Uniform[1 / , / y (1) − y (2) < µǫR or if y (2) < ǫµR , and reports FAIL if either occur, otherwise itreturns i ∈ [ n ] with y i j = y (1) for some j ∈ [ n c − ].Let i ∗ ∈ [ n c ] be the index of the maximizer in y , so y i ∗ = y (1) . By checking that y (1) − y (2) > µǫR , noting that 100 µǫR ≥ k y − | ζ |k ∞ and z k = (1 ± ν ) ζ k for all k ∈ [ n c ], for ν < ǫ suﬃcientlysmall we ensure that | z i ∗ | is also the maximum element in z . The necessity for the test y (2) ≥ ǫµR is less straightforward (see Remark 2 for justiﬁcation). To prove correctness, we need to analyzethe conditional probability of failure given D (1) = i . Let N = |{ i ∈ [ n c ] | F i = 0 }| ( N is the supportsize of F ). We can assume that N = 0 (to check this one could run, for instance, the O (log ( n ))-bitsupport sampler of [JST11]). Note that n c − ≤ N ≤ n c . We now will prove the propositions andlemmas needed to demosntrate correctness of this sampler. Lemmas 12 and 13 are the analogousresults to Lemmas 3 and 4 in Section 4, and will follow nearly the same proofs. Proposition 6.

Let

X, Y ∈ R d be random variables where Z = X + Y . Suppose X is independentof some event E , and let M > be such that for every i ∈ [ d ] and every a < b we have Pr [ a ≤ X i ≤ b ] ≤ M ( b − a ) . Suppose further that | Y | ∞ ≤ ǫ . Then if I = I × I × · · · × I d ⊂ R n , whereeach I j = [ a j , b j ] ⊂ R , −∞ ≤ a j < b j ≤ ∞ is a (possibly unbounded) interval, then Pr [ Z ∈ I | E ] = Pr [ Z ∈ I ] + O ( ǫdM ) Proof.

For j ∈ [ d ], let I j = [ a j − ǫ, b j + ǫ ] , I j = [ a j + ǫ, b j − ǫ ], and let I = I × · · · × I d , and I = I × · · · × I d . If one of the endpoints is unbounded we simply use the convention ∞ ± c = ∞ ,47 ∞ ± c = −∞ for any real c . Then Pr [ Z ∈ I | E ] ≤ Pr [ X ∈ I | E ] = Pr [ X ∈ I ] ≤ Pr [ X ∈ I ] + Pr [ d [ i =1 X i ∈ I j \ I j ]By the union bound, this is at most Pr [ X ∈ I ] + 4 dǫM ≤ Pr [ Z ∈ I ] + 4 dǫM . Similarly, Pr [ Z ∈ I | E ] ≥ Pr [ X ∈ I ] ≥ Pr [ X ∈ I ] − dǫM ≥ Pr [ Z ∈ I ] − dǫM . Lemma 12.

For p ∈ (0 , a constant bounded away from and any ν ≥ n − c , Pr [ ¬ FAIL | D (1)] = Pr [ ¬ FAIL ] ± O (log( n ) ν ) for every possible D (1) ∈ [ N ] .Proof. By Lemma 2, conditioned on E , for every k < N − n c/ we have | z D ( k ) | = U /pD ( k ) (1 ± O ( n − c/ )) /p = U /pD ( k ) (1 ± O ( p n − c/ )) (using the identity (1 + x ) ≤ e x and the Taylor expansionof e x ), where U D ( k ) = ( P kτ =1 E τ E [ P Nj = τ | F D ( j ) | p ] ) − is independent of the anti-rank vector D (in fact, itis totally determined by k and the hidden exponentials E i ). Then for c suﬃciently large, we have | ζ D ( k ) | = U /pD ( k ) (1 ± O ( ν )), and so for all p ∈ (0 ,

2] and k < N − n c/ | ζ D ( k ) | = U /pD ( k ) + U /pD ( k ) V D ( k ) Where V D ( k ) is some random variable that satisﬁes | V D ( k ) | = O ( ν ). Now consider a bucket A i,j for ( i, j ) ∈ [ d ] × [6 /ǫ ]. Let σ k = sign ( z k ) = sign ( ζ k ) for k ∈ [ n c ]. Then we write A i,j = P k ∈ B ij σ D ( k ) | ζ D ( k ) | g i,j,D ( k ) + P k ∈ S ij σ D ( k ) | ζ D ( k ) | g i,j,D ( k ) where B ij = { k ≤ N − n c/ | h i,j,D ( k ) = 1 } and S ij = { n c ≥ k > N − n c/ | h i,j,D ( k ) = 1 } (see notation above Theorem 11). Here we deﬁne { D ( N + 1) , . . . , D ( n c ) } to be the set of indices i with F i = 0 (in any ordering, as they contributenothing to the sum). So A i,j = X k ∈ B ij g i,j,D ( k ) σ D ( k ) U /pD ( k ) + X k ∈ B ij g i,j,D ( k ) σ D ( k ) U /pD ( k ) V D ( k ) + X k ∈ S ij g i,j,D ( k ) ζ D ( k ) Importantly, observe that since the variables h i,j,D ( k ) are fully independent, the sets B i,j , S i,j areindependent of the anti-rank vector D . In other words, the values h i,j,D ( k ) are independent of thevalues D ( k ) (and of the entire anti-rank vector). Note that this would not necessarily be the caseif these were only ℓ -wise independent for some ℓ = o ( n c ). So we can condition on a ﬁxed set ofvalues { h i,j,D (1) , . . . , h i,j,D ( n c ) } now, which ﬁxes the sets B i,j , S i,j . Claim 2.

For all i, j and p ∈ (0 , | P k ∈ B ij g i,j,D ( k ) σ D ( k ) U /pD ( k ) V D ( k ) | + | P k ∈ S ij g i,j,D ( k ) ζ D ( k ) | = O ( p log( n ) ν k z k ) with probability 1 − O (log ( n ) n − c ). Proof.

By Khintchines’s inequality (Fact 1), we have | P k ∈ S ij g i,j,D ( k ) ζ D ( k ) | = O ( p log( n ))( P k ∈ S i,j (2 z D ( k ) ) ) / with probability 1 − n − c . This is a sum over a subset of the n c/ smallest items | z i | , and thus P k ∈ S i,j z D ( k ) < n c/ N k z k , giving | P k ∈ S ij g i,j,D ( k ) ζ D ( k ) | = O ( p log( n ) n − c/ k z k ).Furthermore, using the fact that for k ≤ N − n c/ we have | ζ D ( k ) | < U /pD ( k ) and | V D ( k ) | = O ( ν ),we have | P k ∈ B ij g i,j,D ( k ) σ D ( k ) U /pD ( k ) V D ( k ) | = O ( p log( n ) ν k z k ) with probability 1 − n − c again byKhintchine’s inequality, as needed. Note there are only O ( ǫ − log( n )) = O (log ( n )) (for p < O (log( n ))) terms | P k ∈ B ij g i,j,D ( k ) σ k U /pD ( k ) V D ( k ) | + | P k ∈ S ij g i,j,D ( k ) ζ D ( k ) | which ever occur in allof the A i,j ’s, since the count-sketch has size O ( ǫ − log( n )). Union bounding over these buckets, andtaking c suﬃciently large, the claim follows. 48all the event where the Claim 2 holds E . Conditioned on E , we can decompose | A i,j | for all i, j into | P k ∈ B ij g i,j,D ( k ) σ D ( k ) U /pD ( k ) | + V ij where V ij is some random variable satisfying |V ij | = O ( p log( n ) ν k z k ) and P k ∈ B ij g i,j,D ( k ) σ D ( k ) U /pD ( k ) is independent of the anti-rank vector D (it depends only on the hidden exponentials E k , and the uniformly random signs g i,j,D ( k ) σ D ( k ) ).Let U ∗ ij = | P k ∈ B ij g i,j,D ( k ) σ D ( k ) U /pD ( k ) | . Let Γ( k ) = { ( i, j ) ∈ [ d ] × [ k ] | h i,j,D ( k ) = 1 } . Then ourestimate for | ζ D ( k ) | is y D ( k ) = median ( i,j ) ∈ Γ( k ) { U ∗ i,j + V i,j } = median ( i,j ) ∈ Γ( k ) { U ∗ i,j } + V ∗ D ( k ) where |V ∗ D ( k ) | = O ( p log( n ) ν k z k ) for all k ∈ [ n c ].We now consider our L estimate, which is given by R = median j {| P k ∈ [ n c ] ϕ kj ζ k |} where the ϕ kj ’s are i.i.d. normal Gaussians. We can write this as R = 54 median j n(cid:12)(cid:12)(cid:12) X k ∈ B ϕ D ( k ) j σ D ( k ) U /pD ( k ) + ( X k ∈ B ϕ D ( k ) j σ D ( k ) U /pD ( k ) V D ( k ) + X k ∈ S ϕ D ( k ) j ζ D ( k ) ) (cid:12)(cid:12)(cid:12)o where B = ∪ ij B ij and S = [ n c ] \ B . Now the ϕ D ( k ) j ’s are not ± ϕ , . . . , ϕ n are i.i.d. Gaussian, then Pr [ | P i ϕ i a i | > O ( p log( n )) k a k ] = Pr [ | ϕ |k a k > O ( p log( n )) k a k ],where ϕ is again Gaussian. This latter probability can be bounded by n − c via the pdf of a Gaussian,which is the same bound as Khintchine’s inequality. So applying the same argument as in Claim2, we have R = median j { ( | P k ∈ B ϕ D ( k ) j σ D ( k ) U /pD ( k ) |} + V R with probability 1 − O ( n − c ) where |V R | = O ( p log( n ) ν k z k ). Call this event E . By the symmetry of Gaussians, the value ϕ D ( k ) j σ D ( k ) is just another i.i.d. Gaussian, so | P k ∈ B ϕ D ( k ) j σ D ( k ) U /pD ( k ) | is independent of the anti-rank vector.Let U ∗ D ( k ) = median ( i,j ) ∈ Γ( k ) { U ∗ i,h i ( D ( k )) } for k ∈ [ n c ], and U ∗ R = median j ( | P k ∈ B ϕ D ( k ) j σ D ( k ) U /pD ( k ) | ).Then both U ∗ D ( k ) , U ∗ R are independent of the anti-ranks D ( k ) (the former does, however, dependon k ), and y D ( k ) = U ∗ D ( k ) + V ∗ D ( k ) . Now to analyze our failure condition, we deﬁne a determin-istic function Λ( x, v ) ∈ R . For vector x and a scalar v , set Λ( x, v ) = x (1) − x (2) − ǫv , andΛ( x, v ) = x (2) − ǫv . Note Λ( y, µR ) ≥ ¬ FAIL . Claim 3.

Conditioned on E ∩ E ∩ E , we have the decomposition Λ( y, µR ) = Λ( ~U ∗ , µU ∗ R ) + V where the former term is independent of the max index and k V k ∞ = O ( p log( n ) ν k z k ). Proof.

We have shown that |V R | and |V ∗ D ( k ) | are both O ( p log( n ) ν k z k ) for all k ∈ [ n c ] conditionedon E ∩ E ∩ E . We have y = ~U ∗ + ~ V ∗ , where ~U ∗ D ( k ) = U ∗ D ( k ) and ~V ∗ D ( k ) = V ∗ D ( k ) , so ~ V ∗ can changethe value of the two largest coordinates in y by at most k ~ V ∗ k ∞ = O ( p log( n ) ν k z k ). Similarly |V R | can change the value of R by at most O ( p log( n ) ν k z k ), which completes the proof of thedecomposition. To see the claim of independence, note that Λ( ~U ∗ , µU ∗ R ) is a deterministic functionof the hidden exponentials E , . . . , E N , the random signs g , and the uniform random variable µ ,the joint distribution of all of which is marginally independent of the anti-rank vector D , whichcompletes the claim.To complete the proof of the Lemma, it suﬃces to show the anti-concentration of Λ( ~U ∗ , µU ∗ R ).Now for any interval I Pr [Λ( ~U ∗ , µU ∗ R ) ∈ I ] = Pr [ µ ∈ I ′ / (100 ǫU ∗ R )]= O ( | I | / ( ǫU ∗ R ))and Pr [Λ( ~U ∗ , µU ∗ R ) ∈ I ] = Pr [ µ ∈ I ′′ / (50 ǫU ∗ R )]49 O ( | I | / ( ǫU ∗ R ))where I ′ and I ′′ are the result of shifting the interval I by a term which is independent of µ .Here | I | ∈ [0 , ∞ ] denotes the size of the interval I . Thus it suﬃces to lower bound U ∗ R . We have2 U ∗ R > R > k z k after conditioning on the success of our L estimator, an event we call E , whichholds with probability 1 − n − c by Lemma 11. Thus Pr [Λ( ~U ∗ , µU ∗ R ) ∈ I ] = O ( ǫ − | I | / k z k ) and Pr [Λ( ~U ∗ , µU ∗ R ) ∈ I ] = O ( ǫ − | I | / k z k ) for any interval I . So by Proposition 6, conditioned on E ∩ E ∩ E ∩ E we have Pr h Λ( y, µR ) ≥ ~ ∈ R (cid:12)(cid:12) D (1) i = Pr h Λ( y, µR ) ≥ ~ i ± O (log( n ) ν ) (5)Note that E ∩E ∩E ∩E holds with probability 1 − O ( n − c +1 ), so choosing c such that n − c < log( n ) ν ,Equation 5 holds without conditioning on E ∩ E ∩ E ∩ E , which completes the proof of thelemma. Lemma 13. If y is the vector obtained via count-sketch as in the algorithm L p Sampler , and ǫµR, y (2) > ǫµR ] ≥ / , where ǫ = Θ(1) when p < , and ǫ = Θ(1 / p log( n )) when p = 2 .Proof. By Proposition 1, with probability 1 − e − > . k z tail (16) k = O ( | F k p ) for p < k z tail (16) k = O ( p log( n ) k F k p ) when p = 2. Observe that for t ∈ [16] we have | z D ( t ) | < k F k p ( P tτ =1 E τ ) /p , and with probability 99 /

100 we have E > / | z D ( t ) | = O ( k F k p ) for all t ∈ [16]. Conditioned on this, we have k z k < q k F k p where q is a constant when p <

2, and q = Θ( p log( n )) when p = 2. In either case, we know that the estimate y from countsketch satisﬁes k y − | ζ |k ∞ < ǫ k ζ k < ǫ k z k = O ( k F k p ). Thus conditioning on the high probabilityevent that R = Θ( k ζ k ), we have that 100 ǫµR = O ( k F k p ), where we can rescale the quantity downby any constant by a suitable rescaling of ǫ .Now note that | z D (1) | = k F k p /E /p and | z D (1) | = k F k p / ( E + E (1 ± n − c +1 )) /p where E , E are independent exponentials. So with probability 7 /

8, we have all of | z D (1) | = Θ( k F k p ), | z D (2) | =Θ( k F k p ) and | z D (1) | − | z D (2) | = Θ( k F k p ) with suﬃciently scaled constants, so scaling ν by asuﬃciently small constant we have | ζ D (1) | = Θ( k F k p ), | ζ D (1) | − | ζ D (2) | = Θ( k F k p ) and | ζ D (2) | =Θ( k F k p ). Conditioned on the event in the prior paragraph and on the high probability success of our L estimation algorithm and our count-sketch error, our estimates of | ζ D (1) | , | ζ D (2) | via y are Θ(1)-relative error estimates, so for ǫ small enough the maximum indices in y and ζ will coincide, and wewill have both y (1) − y (2) > ǫµR = O ( k F k p ) and y (2) > ǫµR = O ( k F k p ). By a union bound, itfollows that this condition holds with probability at least 1 − (1 /

10 + 99 /

100 + 1 / O ( n − c )) > / Theorem 12.

Given any constant c ≥ , ν ≥ n − c , and . The space required is O (log ( n ) log(1 /δ )(log log n ) ) bits for p < , and O (log ( n ) log(1 /δ )(log log n ) ) bits for p = 2 .Proof. We claim that conditioned on not failing, we have that i ∗ = arg max i { y i } = arg max i {| z i |} .First, condition on the success of our count-sketch estimator, and on the guarantees of our estimate50 , which occur with probability 1 − n − c together. Since the gap between the two largest coordinatesin y is at least 100 ǫµR > ǫ k ζ k ≥ k y − | ζ |k ∞ (20 times the additive error in estimating | ζ | ),it cannot be the case that the index of the maximum coordinate in y is diﬀerent from the index ofthe maximum coordinate (in absolute value) in ζ , and moreover both y and ζ must have a uniquemaximizer. Then we have | ζ i ∗ | − | ζ (2) | = | ζ (1) | − | ζ (2) | > ǫ k ζ k , and since z i = (1 ± O ( ν )) ζ i forall i , we have k| ζ | − | z |k ∞ ≤ O ( ν ) k ζ k . Scaling ν down by a factor of ǫ = Ω( p / log( n )) (which isabsorbed into the ˜ O ( ν − ) update time), the gap between the top two items in ζ is 18 times large thanthe additive error in estimating z via ζ . Thus we must have i ∗ = arg max i {| ζ i k} = arg max i {| z i |} ,which completes the proof of the claim.Now Lemma 12 states for any i j ∈ [ n c ] that Pr [ ¬ FAIL | i j = arg max i ′ ,j ′ {| z i ′ j ′ |} ] = Pr [ ¬ FAIL] ± O (log( n ) ν ) = q ± O (log( n ) ν ), where q = Pr [ ¬ FAIL] = Ω(1) is a ﬁxed constant, by Lemma 13,which does not depend on any of the randomness in the algorithm. Since conditioned on not fail-ing we have arg max i { y i } = arg max i {| z i |} , the probability we output i j ∈ [ n c ] is Pr [ ¬ FAIL ∩ y i j is the max in y ] = Pr [ ¬ FAIL ∩ | z i j | is the max in | z | ] (conditioned on the high probabilityevents in the prior paragraph), so the probability our ﬁnal algorithm outputs i ∈ [ n ] is X j ∈ [ n c − ] Pr [ ¬ FAIL | i j = arg max i ′ ,j ′ {| z i ′ j ′ |} ] Pr [ i j = arg max i ′ ,j ′ {| z i ′ j ′ |} ] = X j ∈ [ n c − ] | f i | p k F k pp ( q ± O (log( n ) ν ))= | f i | p k f k pp ( q ± O (log( n ) ν ))The potential of the failure of the various high probability events that we conditioned on only addsanother additive O ( n − c ) term to the error. Thus, conditioned on an index i being returned, we have Pr [ i = j ] = | f j | p k f k pp (1 ± O (log( n ) ν )) ± n − c for all j ∈ [ n ], which is the desired result after rescaling ν down by a factor of Ω(1 / log( n )) (we need only scale down by Ω(1 / p log( n )) after already rescalingby ǫ = Θ(1 / p log( n ) when p = 2). Running the algorithm O (log( δ − )) times in parallel, it followsthat at least one index will be returned with probability 1 − δ .Theorem 13 shows that the entire algorithm can be derandomized to use a random seed with O (log ( n )(log log n ) )-bits for p < O (log ( n )(log log n ) )-bits for p = 2, which dominates thespace required to store the sketches of the sampling algorithm themselves. Repeating O (log(1 /δ ))times to obtain δ failure probability gives the stated space bounds. Remark 2.

Using roughly the same update-procedures and a similar analysis as in Section 5,one can implement the above L p sampling algorithm to have ˜ O ( ν ) update time and ˜ O (1) reporttime, just as in Theorem 2. The only diﬀerence is the use of Rademacher { , − } variables inthe count-sketch instead of Gaussians, and the change to make the variables g i,j,k independent.These Rademacher variables are easier to handle, as one can just compute, for a given bucket A i,j of count-sketch, the number of items which hash into this bucket with a 1 and − g i,j,k can be handled in Fast-Update by a modifying the procedure to drawa binomial to determine how many items hash to each bucket A i,j independently for each j ∈ [ k ].This is as opposed to the Fast-Update of Figure 5, which only allows an item to be hashed intoa single bucket in each row of A . In other words, we change Figure 5 to deal with the modiﬁedvariables g i,j,k by simply removing step 1( d ) which decrements the value of W k , which is the counterof items left to be hashed in a row k of A . 51o show that the output of this algorithm is the same when only searching through a subset K of the coordinates (where K is as in Section 5) for the maximizers y (1) , y (2) , observe that thetest y (2) ≥ ǫµR enforces that, conditioned on not failing, both y (1) and y (2) will be large enoughto be contained in the set K . Thus we can safely implement the Fast-Update procedure to giveimproved update time, and the

ExpanderSketch of Theorem 8 to obtain the improved query time.

B Derandomizing the Original Algorithm

We now show how our original algorithm can be derandomized using the same techniques as inSection 5. For this section, we let B ∈ R O (log( n )) be the sketch stored for the high probability L estimation used in the L p sampler as in Lemma 11. Note that B = G · ζ , where G is a matrix ofi.i.d. Gaussian variables. Theorem 13.

The algorithm of Section A.2 can be derandomized to run in O (log ( n ) log(1 /δ )(log log( n )) ) space for p < , and O (log ( n ) log(1 /δ )(log log( n )) ) space for p = 2 .Proof. We use the same notation A ( r e , r c ) as in Theorem 7. Recall here that r e is the randomnessrequired for the exponentials, and r c is the randomness required for count-sketch (and now r c mustalso include the randomness required for the L estimation sketch B ). For any ﬁxed randomness r c , let A r c ( r e ) be the tester which tests if our L p sampler would output the index i , where nowthe bits r c are hard-coded into the tester, and the random bits r e for the exponentials are taken asinput.Now note that the entire sketch stored by our algorithm can be written as Z · ζ , where Z ∈ R O (log( n ) /ǫ ) × n is a ﬁxed matrix deﬁned by the count-sketch randomness r c , and ζ is the scaled (byinverse exponentials) and rounded stream vector of the algorithm. Here Z · ζ = [ vec ( A ); vec ( B )]where vec ( A ) denotes the vectorization of the count-sketch matrix A (and resp. B ), and [ x ; y ] isvector which stacks x on top of y . Note that we can pull the scalings by F into the matrix Z (making it into a new ﬁxed matrix Z ′ ), so our sketch can be written as Z ′ · t , where for j ∈ [ n c ] wehave t j = rnd ν (1 /t /pj ) and t j ’s are the i.i.d. exponentials.Since we are rounding the exponentials to powers of (1 + ν ) anyway, we can restrict the supportof the coordinates in t to a discrete support of size O (poly( n )) such that each value occurs withprobability at least 1 / poly( n ) for a suitably larger poly( n ). This allows us to sample the variables rnd ν (1 /t /pi ) using O (log( n ))-bits of space as needed for Lemma 7. Thus our entire algorithmrequires poly( n ) random bits to be generated for the exponentials. Similarly, for the randomGaussians used to estimate the L in the sketch B , one can truncate to O (log( n ))-bits, incurringonly an additive n − c error in these buckets, which can be absorbed in to the adversarial errorwhich is already handled in Lemma 12. Restricting the support of the Gaussians so that each valueoccurs with probability at least 1 / poly( n ), it follows that these Gaussians can also be sampledusing O (log( n ))-bits each. The only remaining randomness are the random signs and h i,j,k in countsketch, each of which have a support of size 2 and can be sampled with O (log( n ))-bits. So usingLemma 7, we can fool the tester which tests if Z ′ · t = y for any y with O (log( n )) bounded bit-complexity, using a seed of O (log ( n )(log log n ) ) bits (and O (log ( n )(log log n ) ) for p = 2). Thenas in Theorem 5, since we can fool Pr [ Z ′ · t = y ], we can also fool any tester which takes as input y = Z ′ · t and outputs whether or not on input y our algorithm would output i ∈ [ n ]. Thus if G ( x )is one instance of the PRG from Lemma 7, we have Pr [ A r c ( r e )] ∼ n − O (log( n )) Pr [ A r c ( G ( x ))], and52imilarly, as in Theorem 7 Pr h A ( r e , r c ) i = X r c Pr h A r c ( r e ) i Pr h r c i = X r c (( Pr h A r c ( G ( x )) i ± n − O (log( n )) ) Pr h r c i = X r c ( Pr h A r c ( G ( x )) i Pr h r c i ± X r c n − O (log( n )) Pr h r c i ∼ n − O (log( n )) Pr h A ( G ( x ) , r c ) i Now ﬁx any seed G ( x ), and consider A G ( x ) ( r c ) which on ﬁxed exponential randomness G ( x ) andfresh count-sketch randomness r c , tests whether out algorithm would output i ∈ [ n ]. Note thatthis algorithm simply maintains the same sketch Z · ζ = [ vec ( A ) , vec ( B )] as above. Note thatthe entries of Z are of two forms: the i.i.d. count-sketch randomness and the i.i.d. Gaussiansneeded for the sketch B . By Theorem 5, we can derandomize both of these separately by two moreinstances G ( x ) , G ( x ) of the PRG of Lemma 7, each using seeds x , x of O (log ( n )(log log n ) )bits of space for p < O (log ( n )(log log n ) ) bits of space for p = 2. So if Z is the ﬁrst set ofrows of Z which correspond to the count-sketch randomness, and Z is the rest of the rows whichcontain i.i.d. Gaussians, we have that for all y, y ′ with O (log( n ))-entrywise bounded bit complexity: Pr [ Z · ζ = y ] ∼ n − O (log( n )) Pr [ G ( x ) · ζ = y ] and Pr [ Z · ζ = y ′ ] ∼ n − O (log( n )) Pr [ G ( x ) · ζ = y ′ ].Here we are abusing notation and thinking of the PRG randomness G ( x ) as being formed into thematrix which it deﬁnes.Since G ( x ) is independent of G ( x ), for any y of O (log( n ))-entrywise bounded bit complexity,we have Pr [ Z · ζ = y ] ∼ n − O (log( n )) Pr [[ G ( x ); G ( x )] · ζ = y ]. Thus we fool the entire tester A G ( x ) ( r c )with A G ( x ) ( G ( x ) ∪ G ( x )), meaning Pr [ A G ( x ) ( r c )] ∼ n − O (log( n )) Pr [ A G ( x ) ( G ( x ) ∪ G ( x ))], and bya similar averaging arguement as above, we have Pr [ A ( G ( x ) , r c )] ∼ n − O (log( n )) Pr [ A ( G ( x ) , G ( x ) ∪ G ( x ))], thus Pr [ A ( r e , r c )] ∼ n − O (log( n )) Pr [ A ( G ( x ) , G ( x ) ∪ G ( x3