aa r X i v : . [ c s . CC ] A p r Annotations for Sparse Data Streams
Amit Chakrabarti ∗ Graham Cormode † Navin Goyal ‡ Justin Thaler § August 27, 2018
Abstract
Motivated by the surging popularity of commercial cloud computing services, a number of recentworks have studied annotated data streams and variants thereof. In this setting, a computationally weak verifier (cloud user), lacking the resources to store and manipulate his massive input locally, accesses apowerful but untrusted prover (cloud service). The verifier must work within the restrictive data stream-ing paradigm. The prover, who can annotate the data stream as it is read, must not just supply the finalanswer but also convince the verifier of its correctness. Ideally, both the amount of annotation from theprover and the space used by the verifier should be sublinear in the relevant input size parameters.A rich theory of such algorithms—which we call schemes —has started to emerge. Prior work hasshown how to leverage the prover’s power to efficiently solve problems that have no non-trivial stan-dard data stream algorithms. However, even though optimal schemes are now known for several basicproblems, such optimality holds only for streams whose length is commensurate with the size of the data universe . In contrast, many real-world data sets are relatively sparse , including graphs that containonly o ( n ) edges, and IP traffic streams that contain much fewer than the total number of possible IPaddresses, 2 in IPv6.Here we design the first annotation schemes that allow both the annotation and the space usage tobe sublinear in the total number of stream updates rather than the size of the data universe. We solvesignificant problems, including variations of INDEX , SET - DISJOINTNESS , and
FREQUENCY - MOMENTS ,plus several natural problems on graphs. On the other hand, we give a new lower bound that, for thefirst time, rules out smooth tradeoffs between annotation and space usage for a specific problem. Ourtechnique brings out new nuances in Merlin–Arthur communication complexity models, and provides aseparation between online versions of the MA and AMA models.
The surging popularity of commercial cloud computing services has rendered the following scenario in-creasingly plausible. A business—call it AliceSystems—processes billions or trillions of transactions a day.The volume is sufficiently high that AliceSystems cannot or will not store and process the transactions onits own. Instead, it offloads the processing to a commercial cloud computing service.The offloading of any computation raises issues of trust. AliceSystems may be concerned about rela-tively benign errors: perhaps the cloud dropped some of the transactions, executed a buggy algorithm, orexperienced an uncorrected hardware fault. Alternatively, AliceSystems may be more cautious and fear thatthe cloud operator is deliberately deceptive or has been externally compromised. Either way, each time ∗ Department of Computer Science, Dartmouth College. Supported in part by NSF grant CCF-1217375. † AT&T Labs—Research. ‡ Microsoft Research India. § School of Engineering and Applied Sciences, Harvard University. Supported by a NSF Graduate Research Fellowship andNSF grants CNS-1011840 and CCF-0915922. verifier (modeling AliceSystems inthe above scenario), who lacks the resources to store the entire input locally, is given access to a powerfulbut untrusted prover (modeling the cloud computing service). The verifier must execute within the confinesof the restrictive data streaming paradigm, i.e., it must process the input sequentially in whatever order itarrives, using space that is substantially sublinear in the total size of the input. The prover is allowed toannotate the data stream as it is read, with the goal of convincing the verifier of the correct answer. Thestreaming restriction for the verifier fits the cloud computing setting well, as the verifier’s streaming passover the input can occur while uploading data to the cloud.Prior work [2, 7, 9, 10, 22, 24] has provided considerable understanding of the power of annotateddata streams, revealing a surprisingly rich theory. A number of fundamental problems that possess no non-trivial algorithms in the standard streaming model do have efficient schemes when the data stream maybe annotated by a prover: the term “scheme” refers to an algorithm involving verifier-prover interactionas above. By exploiting powerful algebraic techniques originally developed in the literature on interactiveproofs [18, 26], these works have achieved essentially optimal tradeoffs between annotation size and thespace usage of the verifier for problems ranging from frequency moments to bipartite perfect matching.However, these schemes are only optimal for streams for which the total number of updates is largerelative to the size of the data universe . In contrast, many real-world data sets are sparse : for example, manyreal-world graphs, though large, contain much fewer than the maximum possible number (cid:0) n (cid:1) of edges, andIP traffic streams contain much fewer than the total number of possible IP addresses, 2 in IPv6.In this paper, we give the first schemes in the annotations model that allow both the annotation sizeand space usage to be sublinear in the number of items with non-zero frequency in the data stream , ratherthan the size of the data universe n . On the negative side, we also give a new lower bound that for the firsttime rules out smooth tradeoffs between annotation size and space usage for a specific problem. The latterresult is derived from a new lower bound in the Merlin–Arthur (MA) communication model that may be ofindependent interest. Aaronson and Wigderson [2] gave a beautiful MA communication protocol for the
SET - DISJOINTNESS prob-lem (henceforth,
DISJ ) using algebraic techniques analogous to those in the famous “sum-check protocol”from the world of interactive proofs and probabilistically checkable proofs [18]. Their protocol is nearly op-timal, essentially matching a lower bound of Klauck [22]. The Aaronson–Wigderson protocol has served asthe starting point for many schemes for annotated data streams. We will refer to such schemes as sum-checkschemes ; a typical example is Proposition 4.1 in this work.Aaronson [1] studied the hardness of the
INDEX problem in a restricted version of the MA communi-cation model, as well as in a quantum variant of this model. His classical model is similar to the onlineMA communication model that we consider. Annotated data streams were introduced by Chakrabarti etal. [7], and studied further by Cormode et al. [9]. These two papers gave essentially optimal annotationschemes for problems ranging from exact computation of Heavy Hitters and Frequency Moments to graphproblems like Bipartite Perfect Matching and Shortest s - t Path. Cormode, Thaler and Yi [11] later extendedthe annotations model to allow the prover and verifier to have a conversation , and dubbed this interactivemodel streaming interactive proofs . They demonstrated that streaming interactive proofs can have expo-nentially smaller space and communication costs than annotated data streams, and showed that a number ofpowerful protocols from the literature on interactive proofs can be made to work with streaming verifiers;in particular, this applies to a powerful general-purpose interactive proof protocol due to Goldwasser, Kalai,2nd Rothblum [20]. Cormode, Mitzenmacher, and Thaler [10] implemented a number of protocols in boththe annotated data streams and streaming interactive proof settings, demonstrating genuine scalability inmany cases. In particular, they developed an implementation of the Goldwasser et al. protocol [20] thatapproaches practicality. Most relevant to our work on annotated data streams, Cormode, Mitzenmacher, andThaler also used sophisticated FFT algorithms to drastically reduce the prover’s runtime in the sum-checkschemes, which we make frequent use of.Two recent works have considered variants of the annotated data stream model. Klauck and Prakash [24]study a restricted version of the annotations model in which the annotation must essentially end by the finalstream update. Gur and Raz [21] give protocols for a class of problems in a model that is similar to annotateddata streams, but more powerful in that the verifier has access to both public and private randomness. Thiscorresponds to the AMA communication model. We consider protocols in this model in Section 7.2.Early work on interactive proof systems studied the power of space-bounded verifiers (the survey byCondon [8] provides a comprehensive overview), but many of the protocols developed in this line of workrequire the verifier to store the input, and therefore do not work in the annotations model, where the verifiermust be streaming. An exception is work by Lipton [17], who relied on using fingerprinting techniquesto allow a log-space streaming verifier to ensure that the prover correctly plays back the transcript of analgorithm in an appropriate computational model. This approach does not lead to protocols with sublinearannotation length. More recently, Das Sarma et al. studied the “best order streaming model,” which can bethought of as the annotations model where the annotation is restricted to be a permutation of the input [13].
We give an informal overview of our results and the techniques we use to obtain them. Throughout, n willdenote the size of the data universe and m the number of items with non-zero frequency at the end of a datastream (we refer to m as the “sparsity” of the stream). A scheme in which the streaming verifier uses atmost c v bits of storage and requires at most c a bits of annotation from the prover is called a ( c a , c v ) -scheme.Section 2 defines our models of computation carefully and sets up terminology. Section 3 contains our first set of results. We begin by precisely characterizing the complexity of thesparse P
OINT Q UERY problem—a natural variant of the well-known
INDEX problem from communicationcomplexity—giving an ( x log n , y log n ) -scheme whenever xy ≥ m . We give similar upper bounds for the re-lated problems S ELECTION and H
EAVY H ITTERS . We also prove a lower bound showing that any ( c a , c v ) -scheme for these problems requires c a c v = W ( m log ( n / m )) , improving by a log ( n / m ) factor over lowerbounds that follow from prior work on “dense” streams. By a dense stream we mean one where n is notmuch larger than m . This log ( n / m ) factor may seem minor, but a striking consequence is that the (very)sparse INDEX problem—where Alice’s n -bit string has Hamming weight O ( log n ) —has one-way random-ized communication complexity that is within a logarithmic factor of its online MA communication com-plexity. This implies that no non-trivial tradeoffs between Merlin’s and Alice’s message sizes are possiblefor this problem; to our knowledge this is the first problem that provably exhibits this phenomenon.Our scheme for sparse P OINT Q UERY relies on universe reduction: the prover succinctly describes amapping h : [ n ] → [ r ] that maps the input stream, which is defined over the huge data universe [ n ] , down to aderived stream defined over a smaller universe [ r ] . By design, if the prover is honest and the mapping h doesnot cause “too many collisions,” then the answer on the original stream can be determined from the answeron the derived stream. We then efficiently apply known schemes for dense streams to the derived stream.For our lower bound in Section 3, we give a novel reduction from the standard (dense) INDEX problemto sparse
INDEX that is tailored to the MA communication model. We then apply known lower bounds fordense
INDEX . Our technique also gives what is to our knowledge the first polynomial separation betweenthe online MA and AMA communication complexities of a specific (and natural) problem.3or clarity, the remainder of this overview omits factors logarithmic in n and m when stating the costs ofschemes. Though these factors are important for Section 3 (the consequences of our lower bound being mostsignificant when n = m w ( ) ), we anticipate that in practice n and m will usually be polynomially related. Sections 4 and 5 contain our most interesting and technically involved results, namely, efficient schemesfor
SIZE - m - SET - DISJOINTNESS (henceforth, m - DISJ ) and k th Frequency Moments (henceforth, F k ). Theschemes here are substantially more complex than those in Section 3 and represent the main technicalcontributions of this paper.Section 4 gives ( m / , m / ) -schemes for both problems, but the schemes rely on “prescient” annotation,i.e., annotation provided at the start of the stream that depends on the stream itself. The even more com-plex schemes of Section 5 eliminate the need for prescient annotation and also achieve much more generaltradeoffs between annotation length and space usage. Specifically, Section 5 gives ( mc − / v , c v ) -schemes for m - DISJ and F k for any c v < m . Notice that one recovers the costs achieved in Section 4 by setting c v = m / .These schemes are the first for these problems that allow both the annotation length and space usage tobe sublinear in m . At a very high level, there are three interlocking ideas that allow us to achieve this.1. The first idea is a careful application of universe reduction. We were able to use a simple version ofthis idea to derive the upper bound for the P OINT Q UERY problem in Section 3, but in the case of
DISJ and F k the universe-reduction mapping h : [ n ] → [ r ] specified by the prover is more complicated, andrequires refinement in the form of the additional ideas described below.2. The second idea is addressed to ensuring that the prover performed the universe-reduction step in anhonest manner, in the sense that the answer on the original stream can indeed be determined from theanswer on the derived stream. The difficulty of ensuring P is honest varies depending on the structureof the problem at hand. For F k , the verifier has to make sure that the universe-reduction mapping h isinjective on the items appearing in the data stream. This requires developing an efficient way for V to detect collisions under h , even though V does not have the space to store all of the values h ( x i ) forstream updates x i . For m - DISJ , a notion weaker than injectiveness is sufficient.3. The third idea pertains to allowing P to specify the universe-reduction mapping h online . That is,for many problems it would be much simpler if P could determine the mapping h in advance i.e. if P could be prescient, and send h to V at the start of the stream so that V can determine the derived“mapped-down” stream on her own (this is the approach taken in Section 4). When P must specify h in an online fashion, additional insight is required. At a high level, our approach is to have P specifya “guess” as to the right hash function at the beginning of the steam, and retroactively modify thehash function after the stream has been observed. The challenging aspect of this approach is to ensurethat P ’s retroactive modification of the hash function is consistent with the observed data stream, eventhough V cannot refer back to the stream to enforce this.We exploit similar ideas to allow V to avoid storing the universe-reduction mapping h herself; thisis the key to achieving general tradeoffs between annotation length and space usage in Section 5. Insome schemes, storing this mapping h would be the bottleneck in V ’s space usage. We show how V can store only a partial description of h , and ask P to fill in the remainder of the description whennecessary. Section 6 exploits all of these results, applying them to several graph problems, including countingtriangles and demonstrating a perfect matching. Our schemes have costs that depend on the number ofedges in the graph, rather than the total number of possible edges, and demonstrate that the ideas underlyingour m - DISJ and F k schemes are broadly applicable. We state clearly how our schemes improve over priorwork throughout. 4 ection 7 considers a more general stream update model, which allows items to have negative frequen-cies. These negative frequencies potentially break the “collision detection” sub-protocol used in the previoussections, so we show how to exploit a source of public randomness to allow these protocols to be carried out.Essentially, the public randomness specifies a remapping of the input, so that the prover is highly unlikelyto be able to use negative frequencies to “hide” collisions. Because the protocols of Section 7 require publicrandomness, they work in the AMA communication and streaming models, as opposed to the MA modelsin which all of our other protocols operate. Many of the algorithms (schemes) in this paper use randomization in subtle ways, making it important toproperly formalize several models of computation. We begin with Merlin–Arthur communication models, atopic first studied by Babai, Frankl and Simon [3], which we eventually use to derive lower bounds. We thenturn to annotated data stream models. At the end of the section we set up some notation and terminology forthe rest of the paper. Some of our discussion in this section borrows from prior work [7].
Let F : X × Y → { , } be a function, where X and Y are both finite sets. This naturally gives a 2-playernumber-in-hand communication problem, where the first player, Alice, holds an input x ∈ X , and the secondplayer, Bob, holds an input y ∈ Y . The players wish to compute F ( x , y ) by executing a (possibly randomized)communication protocol that correctly outputs F ( x , y ) with “high” probability. In Merlin–Arthur communi-cation, there is additionally a “super-player,” called Merlin, who knows the entire input ( x , y ) , and can helpAlice and Bob by interacting with them. The precise pattern of interaction matters greatly and gives rise todistinct models. Merlin’s goal is to get Alice and Bob to output “1” regardless of the actual value of F ( x , y ) ,and so Merlin is not to be blindly trusted.One important departure we make from prior work is that we allow Merlin to use private random coins during the protocol. Most prior work on MA (and AM) communication [3, 22, 23] defined Merlin to bedeterministic, which does not make a difference in the basic setting. But in this work we are concernedwith “online MA” models, where the distinction does matter, and these online MA models are in closecorrespondence with the annotated data stream models that are our eventual topic of study. MA Communication.
In a Merlin–Arthur protocol (henceforth, “MA protocol”) for F , Merlin begins bysends a help message h ( x , y , r M ) , using a private random string r M , that is seen by both Alice and Bob. ThenAlice and Bob (the pair that constitutes the entity “Arthur”) run a randomized communication protocol P ,using a public random string r A , eventually outputting a bit out P ( x , y , r A , h ) . Importantly, r A is not knownto Merlin at the time he sends h . The protocol P is d s -sound and d c -complete if there exists a function h : X × Y × { , } ∗ → { , } ∗ such that the following conditions hold.1. If F ( x , y ) = r M , r A [ out P ( x , y , r A , h ( x , y , r M )) = ] ≤ d c .2. If F ( x , y ) = ∀ h ′ ∈ { , } ∗ : Pr r A [ out P ( x , y , r A , h ′ ) = ] ≤ d s .We define err ( P ) to be the minimum value of max { d s , d c } such that the above conditions hold. Fol-lowing [7], we define the help cost hcost ( P ) to be 1 + max x , y , r M | h ( x , y , r M ) | (forcing hcost ≥
1, even fortraditional Merlin-free protocols), and the verification cost vcost ( P ) to be the maximum number of bitscommunicated by Alice and Bob over all x , y and r A . We define MA d ( F ) = min { vcost ( P ) + hcost ( P ) : P is an MA protocol for F with err ( P ) ≤ d } , and MA ( F ) = MA / ( F ) .5 nline MA Communication. An online MA protocol is defined to be an MA protocol, as above, but withthe communication pattern required to obey the following sequence. (1) Input x is revealed to Alice andMerlin; (2) Merlin sends Alice a help message h ( x , r M ) using a private random string r M ; (3) Input y isrevealed to Bob; (4) Merlin sends Bob a help message h ( x , y , r M ) ; (5) Alice sends a public-coin randomizedmessage to Bob, who then gives a 1-bit output. We see this model as the natural MA variant of one-waycommunication, and the analogy with the gradual revelation of a streamed input should be obvious.For such a protocol P , we define hcost ( P ) to be 1 + max x , y , r M ( | h ( x , r M ) | + | h ( x , y , r M ) | ) We definesoundness, completeness, err ( P ) , and vcost ( P ) as for MA. Define MA → d ( F ) = min { hcost ( P ) + vcost ( P ) : P is an online MA protocol for F with err ( P ) ≤ d } and write MA → ( F ) = MA → / ( F ) . Online AMA Communication.
An online AMA protocol is a souped-up version of an online MA proto-col, where public random coins can be tossed at the start, before any input is revealed. The number of suchcoin tosses is added to the vcost of the protocol. This models the cost of an initial round of communicationbetween Arthur (i.e., Alice + Bob) and Merlin. Note that the second public random string, used when Alicetalks to Bob, does not count towards the vcost.
On Merlin’s Use of Randomness.
In an MA protocol, Merlin can deterministically choose a help messagethat maximizes Arthur’s acceptance probability. However, Merlin cannot do so in the online MA model,because he does not know the entire input when he talks to Alice. This is why we allow Merlin to userandomness in these definitions.Two recent papers [7, 24] use “online MA” to mean a more restrictive model where a deterministicMerlin talks only to Bob and not to Alice. With Merlin required to be deterministic, this communicationrestriction is irrelevant, as Merlin cannot tell Alice anything she does not already know. However, we permitMerlin to be probabilistic, and in this case we do not know that Merlin can avoid talking to Alice.As noted earlier, our goal in defining the communication models this way is to closely correspond toannotated data stream models. In many of our online schemes (see, e.g., Section 5), the helper provides ini-tial annotation that specifies a random “hash” function, h , and the completeness guarantee of the subsequentprotocol depends crucially on h having “low collision” properties. Since h must be chosen without seeing allof the input, such low collision properties cannot be guaranteed by picking a fixed h in advance. However, ifthe helper chooses h at random, then we do have such guarantees for each fixed input, with high probability. We now define our annotated data stream models. Recall that a (traditional) data stream algorithm computesa function F of an input sequence x ∈ U N , where N is the number of stream updates, and U is some datauniverse, such as { , } b or [ n ] = { , . . . , n − } : the algorithm uses a limited amount of working memoryand has access to a random string. The function F may or may not be Boolean.An annotated data stream algorithm, or a scheme , is a pair A = ( h , V ) , consisting of a help function h : U N × { , } ∗ → { , } ∗ used by a prover (henceforth, P ) and a data stream algorithm run by a verifier , V . Prover P provides h ( x , r P ) as annotation to be read by V . We think of h as being decomposed into ( h , . . . , h N ) , where the function h i : U N → { , } ∗ specifies the annotation supplied to V after the arrival ofthe i th token x i . That is, h acts on x (using r P ) to create an annotated stream x h , r P defined as follows: x h , r P : = ( x , h ( x , r P ) , x , h ( x , r P ) , . . . , x N , h N ( x , r P )) . Note that this is a stream over
U ∪ { , } , of length N + (cid:229) i | h i ( x , r P ) | . The streaming verifier V , who uses w bits of working memory and has oracle access to a (private) random string r V , then processes this annotatedstream, eventually giving an output out V ( x h , r P , r V ) . 6 rescient Schemes. The scheme A = ( h , V ) is said to be d s -sound and d c -complete for the function F ifthe following conditions hold:1. For all x ∈ U N , we have Pr r P , r V [ out V ( x h , r P , r V ) = F ( x )] ≤ d c .2. For all x ∈ U N , h ′ = ( h ′ , h ′ , . . . , h ′ N ) ∈ ( { , } ∗ ) N , we have Pr r V [ out V ( x h ′ , r V )
6∈ { F ( x ) } ∪ {⊥} ] ≤ d s .If d c =
0, the scheme satisfies perfect completeness ; otherwise it has imperfect completeness . An output of“ ⊥ ” indicates that V rejects P ’s claims in trying to convince V to output a particular value for F ( x ) .We note two important things. First, the definition of a scheme allows the annotation h i ( x , r P ) to dependon the entire stream x , thus modeling prescience : the advice from the prover can depend on data which theverifier has not seen yet. Second, P must convince V of the value of F ( x ) for all x . This is stricter than thetraditional definitions of interactive proofs and MA communication complexity (including our own, above)for decision problems, which place different requirements on the cases F ( x ) = F ( x ) =
1. In Section6, we briefly consider a relaxed definition of schemes that is in the spirit of the traditional definition.We define err ( A ) to be the minimum value of max { d s , d c } such that the above conditions are satisfied.We define the annotation length hcost ( A ) = max x , r P (cid:229) i | h i ( x , r P ) | , the total size of P ’s communications, andthe verification space cost vcost ( A ) = w , the space used by the verifier V . We say that A is a prescient ( c a , c v ) -scheme if hcost ( A ) = O ( c a ) , vcost ( A ) = O ( c v ) and err ( A ) ≤ . Online Schemes.
We call A = ( h , V ) a d -error online scheme for F if, in addition to the conditions in theprevious definition, each function h i depends only on ( x , . . . , x i ) . We define error, hcost, and vcost as aboveand say that A is an online ( c a , c v ) -scheme if hcost ( A ) = O ( c a ) , vcost ( A ) = O ( c v ) , and err ( A ) ≤ .Unlike prior work [7], we do not always assume that the universe size n and stream length N are poly-nomially related; it is possible that log N = o ( log n ) . Therefore we must be much more careful about loga-rithmic factors than in prior work. We do assume that N < n always, because our focus is on sparse streams.Notice that the help function can be made deterministic in a prescient scheme, but not necessarily so inan online scheme. This is directly analogous to the situation for MA and online MA communication models,as discussed at the end of Section 2.1. AMA Schemes.
We also consider what we call AMA schemes, where there is a common source of publicrandomness, in addition to the verifier’s private random coins. The AMA scheme model is identical to theone considered by Gur and Raz [21], who referred to it as the “Arthur–Merlin streaming model.”An online AMA scheme is identical to a (standard) online scheme, except that the data stream algorithmand help function both have access to a source of public random bits. The number of random bits used isalso counted in both the hcost and the vcost of the scheme.
On Practicality and the Plausibility of Prescience.
Although our definition of a scheme allows anno-tation to be sent after each stream update, all the schemes we in fact design in this paper only requireannotation before the start or after the end of the stream. As a practical matter, this avoids the need forfine-grained coordination between the annotation and the data stream.Online annotation schemes have the appealing property that the prover need not “see into the future” toexecute them; at any time t , the prover’s message only depends on stream updates that arrived before time t .While the online restriction appears most natural, prescient schemes may still be suitable in some settings,such as when P has already seen the full input prior to V beginning to read it. Consider a volunteer computingscenario where the verifier farms out many computations to volunteers, and only inspects a particular inputif a volunteer has already looked at that input and claims to have found something interesting . In brief, insome settings the prover may naturally see the input before the verifier, and in this case a prescient schemewill be feasible. See, for example, http://boinc.berkeley.edu/ . .3 Relationship Between MA Protocols and Schemes Any prescient (resp. online) ( c a , c v ) -scheme A = ( h , V ) for a function F can be converted into an MA(resp. online MA) protocol for F in the natural way: Merlin sends the output of the i th help function h i toAlice—who receives a prefix of the input stream—or Bob, depending on which of the players possessesthe i th piece of the input. Alice runs the streaming algorithm V on her input as well as any annotation shereceived, and sends the state of the algorithm to Bob. Bob uses this state to continue running V on his inputand the annotation he received, and then outputs the end result. The hcost of this protocol is at most c a log N ,since Merlin has to specify which stream update i each piece of annotation is associated with, and the vcostof this protocol is at most c v . Thus, lower bounds on usual (resp. online) MA communication protocolsimply related lower bounds on the costs of prescient (resp. online) annotated data stream algorithms. A data stream specifies an input x incrementally. Typically, x can be thought of as a vector (although moregenerally it may represent a graph or a matrix). Each update in the stream is of the form ( i , d ) where i ∈ U identifies an element of the universe, and d ∈ Z describes the change to the frequency of i . The frequencyof universe item i is defined as f i ( x ) : = (cid:229) ( j k , d k ) ∈ x : j k = i d k . We refer to the vector f ( x ) = ( f ( x ) , . . . , f n ( x )) asthe frequency vector of x , where n denotes the size of the data universe.We consider several different update models. In the most general update model, the non-strict turnstilemodel , the d values may be negative, and so f i may also be negative. In the strict turnstile model , the d values may be negative, but it is assumed that the frequencies f i always remain non-negative. In the insert-only model , the d values must be non-negative. Orthogonal to these, in the unit-update version of eachmodel, the d values are assumed to have absolute value 1. Each of our results applies to a subset of thesemodels, and we specify within the statement of each theorem which update models it applies to.Throughout, n will denote the size of the data universe, N will denote the total number of stream updates, m will denote the total number of items with non-zero frequency at the end of the stream, and M will referto the total number of distinct items that ever appear within some stream update. We will refer to N as the length of the stream, to m as sparsity of the stream, and to M as the footprint of the stream. Notice that it isalways the case that m ≤ M ≤ N . In the case of insert-only streams, m = M , but for streams in the (strict orgeneral) turnstile models it is possible for m to be much smaller than M . Note also that while we talk about“sparse” streams, this refers to the relative size of n and m , not the absolute size. Indeed, we assume that m is typically large, too large for V to store the stream explicitly (else the problems can become trivial).We often make use of fingerprint functions of streams, which enable a streaming verifier to test whethertwo large streams have the same frequency vector. The verifier chooses a fingerprint function g ( x ) at randomfrom some family of functions satisfying the property that (over the random selection of the function g ),Pr [ g ( x ) = g ( y ) | f ( x ) = f ( y )] < / p for a parameter p . Typically, g ( x ) is an element of a finite field of size poly ( p ) , and hence the number ofbits required to store the value g ( x ) (as well as g itself) is O ( log p ) . Further, there are known constructionsof fingerprint functions where g ( x ) can be computed in space O ( log p ) by a streaming algorithm in thenon-strict turnstile update model [7]. Our first result is an efficient online annotation scheme for the P
OINT Q UERY problem, a generalization ofthe familiar
INDEX problem. 8 efinition 3.1.
In the P
OINT Q UERY problem, the data stream x consists of a sequence of updates of theform ( i , d ) , followed by an index i , and the goal is to determine the frequency f i ( x ) = (cid:229) ( j k , d k ) ∈ x : j k = i d k .A prescient ( log n , log n ) -scheme for this problem is trivial as P can just tell V the index i at the start ofthe stream, and V can track the frequency of i while observing the stream. The vcost can be improved to O ( log m ) if V retains a hashed value of i , and tracks the frequency of matching updates. The first schemehas perfect completeness, while the second has completeness error polynomially small in m .The costs of the scheme below are in terms of the stream sparsity m , and not the stream length N orthe stream footprint M ; this is significant if m ≪ M , which is the case, e.g., for the well-known stragglerand set-reconciliation problems that have been studied in traditional streaming and communication models[14, 19]. Our lower bound in Theorem 3.9 shows our scheme is essentially optimal for moderate universesizes, i.e. when the universe size n is sub-exponential in the sparsity m . Theorem 3.2.
For any pair ( c a , c v ) such that c a · c v ≥ m, there is an online ( c a log n , c v log n ) -scheme in thenon-strict turnstile update model for the P OINT Q UERY problem with imperfect completeness. Any online ( c a , c v ) scheme with c a ≥ log n for this problem requires c a · c v = W ( m log ( n / m )) .Proof. V requires P to specify at the start of the stream a hash function h : [ n ] → [ c v ] . V requires h to havedescription length O ( c a ) , rejecting if this is not the case. We define the derived streams x j ∈ U N based on h :we set x jk = x k iff h ( x k ) = j , and 0 otherwise. Intuitively, the hash function h partitions the stream updates in x into c v disjoint buckets, and the vector x j describes the contents of the j th bucket. V maintains fingerprintsover a field of size poly ( n ) of each of the c v different x j vectors.At the end of the stream, given the desired index i , P provides a description of the (claimed) frequencyvector in the h ( i ) th derived stream, f ( x h ( i ) ) . V computes a fingerprint of the claimed frequency vector, andcompares it to the fingerprint she computed from the data stream, accepting if and only if the fingerprintsmatch. Since each x j is sparse in expectation, the cost of this description can be low: provided h does notmap more than O ( c a ) items with non-zero frequency to h ( i ) , P can just specify the item id and frequencyof the items with non-zero frequency in f ( x h ( i ) ) . In this case, the annotation size is just O ( c a log n ) . If P exceeds this amount of annotation, V will halt and reject (output ⊥ ).Soundness follows from the fingerprinting guarantee: if P does not honestly provide x h ( i ) , V ’s fingerprintof x h ( i ) computed from the data stream will not match her fingerprint of the claimed vector of frequencies.To show (imperfect) completeness, we study the probability that the output of an honest prover is re-jected. This happens only if m ( x h ( i ) ) , the number of non-zero entries in x h ( i ) , is much larger than its ex-pectation. By the pairwise independence of h , E [ m ( x h ( i ) )] = m ( x ) / c v = c a . Thus, by Markov’s inequality,Pr [ m ( x h ( i ) ) > c a ] < /
10. So by specifying a hash function chosen at random from a pairwise indepen-dent hash family, and then honestly playing back the items that map to the same region as i , P can convince V to accept with probability 9 / V does not need to enforce that P picks the hash function h at random from a pairwise-wiseindependent hash family, as P has no incentive not to pick the hash functions in this way. That is, since V will reject if too many items map to the same region as i , it is sufficient for P to pick h at random froma pairwise independent hash family in order to convince V to accept with constant probability. But it isequally acceptable if P wants to pick h another way; if he does so, P just risks that V will reject with ahigher probability.The lower bound follows from Theorem 3.9, which we prove in Section 3.2.The scheme of Theorem 3.2 yields nearly optimal schemes for the H EAVY H ITTERS and S
ELECTION problems, described below. Table 1 summarizes these results and compares to prior work.9 roblem Scheme Costs Completeness Prescience SourceP
OINT Q UERY ( log n , log n ) Perfect Prescient [7]P
OINT Q UERY ( m log n , log n ) Perfect Online [7]P
OINT Q UERY ( c a log n , c v log n ) : c a c v ≥ n Perfect Online [7]P
OINT Q UERY ( c a log n , c v log n ) : c a c v ≥ m Imperfect Online Theorem 3.2S
ELECTION ( c a log n , c v log n ) : c a c v ≥ n Perfect Online [7]S
ELECTION ( m log n , log n ) Perfect Online [7]S
ELECTION ( c a log n , c v log n ) : c a c v ≥ m log n Imperfect Online Corollary 3.4 f -H EAVY H ITTERS ( f − log n , f − log n ) : c a c v ≥ n Perfect Prescient [7] f -H EAVY H ITTERS ( f − c a log n , c v log n ) : c a c v ≥ n Perfect Online [7] f -H EAVY H ITTERS ( m log n , log n ) Perfect Online [7] f -H EAVY H ITTERS ( f − c a log n , c v log n ) : c a c v ≥ m log n Imperfect Online Corollary 3.6 f -H EAVY H ITTERS ( f − log n + c a log n , c v log n ) : c a c v ≥ m log n Imperfect Online Corollary 5.6
Table 1: Comparison of our schemes to prior work. For all three problems, ours are the first online schemesto achieve both annotation and space usage sublinear in the stream sparsity m when m ≪ √ n , and we strictlyimprove over the online MA communication cost of prior schemes whenever m = o ( n ) . For brevity, we omitfactors of log c v ( m ) from the statement of costs of the f -H EAVY H ITTERS scheme due to Corollary 5.6
Our definition of the S
ELECTION problem assumes all frequencies f i : = (cid:229) ( j k , d k ) : j k = i d k are non-negative, andso this definition is only valid for the strict turnstile update model. Definition 3.3.
The S
ELECTION problem is defined in terms of the quantity N = (cid:229) i ∈ [ n ] f i , the sum of allthe frequencies. Given a desired rank r ∈ [ N ] , output an item j from the stream x = h ( j , d ) , . . . , ( j m , d m ) i ,such that (cid:229) ( j k , d k ) : j k < j d k < r and (cid:229) ( j k , d k ) : j k > j d k ≥ N − r . Corollary 3.4.
For any pair ( c a , c v ) such that c a c v ≥ m log n, there is an online ( c a log n , c v log n ) -schemefor S ELECTION in the strict turnstile update model.
The corollary follows from a standard observation to reduce S
ELECTION to answering prefix sumqueries, and hence to multiple instances of the P
OINT Q UERY problem. V treats each stream update ( i , d ) in the stream x as an update to O ( log n ) dyadic ranges, where a dyadic range is a range of the form [ j k , ( j + ) k − ] for some j and k . Thus, we can view the set of dyadic range updates implied by x as a derived stream of sparsity m log n . Notice we are using the fact that this transformation from the originalstream of sparsity m results in a derived stream of sparsity at most m log n ; a different derived stream wasused in [7] to address the S ELECTION problem, but the sparsity of that derived stream could be substantiallylarger than the sparsity of the original stream.For any i , the quantity T i : = (cid:229) ( j , d ) : j ≤ i d can be written as the sum of the counts of O ( log n ) dyadicranges. Thus, at the end of the stream P can convince V that item i has the desired T i value by runninglog n P OINT Q UERY protocols as in Theorem 3.2 in parallel on the derived stream of sparsity m log n . Theverifier’s space usage is the same as for a single P OINT Q UERY instance on this stream: V fingerprintseach of the derived streams x j defined in the proof of Theorem 3.2, and uses these fingerprints in all log n instances of the P OINT Q UERY scheme. The annotation length is log n times larger than that required for asingle P OINT Q UERY instance because P may have to describe the frequency vectors of up to log n derivedstreams.Thus, we get an online ( c a log n , c v log n ) -scheme as long as c a c v = W ( m log n ) .10 .1.2 Frequent Items Our definition of the f -H EAVY H ITTERS problem also assumes all frequencies f i : = (cid:229) ( j k , d k ) : j k = i d k are non-negative, and so this definition is only valid for the strict turnstile update model. Definition 3.5.
The f -H EAVY H ITTERS problem (also known as frequent items) is to list those items i suchthat f i ≥ f N , i.e. whose frequency of occurrence exceeds a f fraction of the total count N = (cid:229) i ∈ [ n ] f i .We give a preliminary result for the f -H EAVY H ITTERS problem in Corollary 3.6 below. We give asubstantially improved scheme in Section 5 using the ideas underlying our online scheme for frequencymoments.
Corollary 3.6.
For all c a , c v such that c a c v ≥ m log n, there is an online ( c a f − log n , c v log n ) -scheme forsolving f - H EAVY H ITTERS in the strict turnstile update model.
Corollary 3.6 follows from the following analysis. [7, Theorem 6.1] describes how to reduce f -H EAVY H ITTERS to demonstrating the frequencies of O ( f − ) items in a derived stream. Moreover, the derived stream hassparsity O ( m log n ) if the original stream has sparsity m . We use the P OINT Q UERY scheme of Theorem 3.2.As in Corollary 3.4, the annotation length blows up by a factor f − relative to a single P OINT Q UERY , butthe space usage of V can remain the same as in a single P OINT Q UERY instance. Hence, we obtain an online ( c a f − log n , c v log n ) -scheme for any c a c v ≥ m log n . In this section, we prove a new lower bound on the online MA communication complexity of the ( m , n ) -Sparse INDEX problem.
Definition 3.7.
In the ( m , n ) -Sparse INDEX problem, Alice is given a vector x ∈ { , } n of Hamming weightat most m , and Bob is given an index i . Their goal is to output the value x i .We prove our lower bound by reducing the (dense) INDEX problem (i.e. the ( m , n ) -Sparse INDEX prob-lem with m = Q ( n ) ) in the MA communication model to the ( m , n ) -Sparse INDEX problem for small m . Theidea is to replace Alice’s dense input with a sparser input over a bigger universe, and then take advantageof our sparse P OINT Q UERY protocol. A lower bound on the online MA communication complexity of thedense
INDEX problem was proven in [7, Theorem 3.1]; there, it was shown that any online MA communi-cation protocol P requires hcost ( P ) vcost ( P ) ≥ n . Combining this with our reduction of the dense INDEX problem to the sparse version, we conclude that any protocol for sparse
INDEX must be costly.
Lemma 3.8. [7, Theorem 3.1] Any online MA communication protocol P for the ( n , n ) -Sparse INDEX problem must have hcost ( P ) vcost ( P ) = W ( n ) . Remark 1.
The lower bound of Lemma 3.8 was originally proved by Chakrabarti et al. [7] in the commu-nication model in which Merlin cannot send any message to Alice. However, the proof easily extends toour online MA communication model (where Merlin can send a message to Alice, but that message cannotdepend on Bob’s input).
Theorem 3.9.
Any online MA communication protocol P for the ( m , n ) -Sparse INDEX problem for which hcost ( P ) ≥ log n must have hcost ( P ) vcost ( P ) = W ( m log ( n / m )) .Proof. Assume we have an online MA communication protocol P for ( m , n ) -sparse INDEX . We describehow to use this online MA protocol for the sparse
INDEX problem to design one for the dense
INDEX problem on vectors of length n ′ = m log ( n / m ) . 11et k = log ( n / m ) . Given an input x to the dense INDEX problem, Alice partitions x into n ′ / k blocks oflength k , and constructs a 0-1 vector y of Hamming weight n ′ / k over the universe { , } ( n ′ / k ) · k = { , } n asfollows. She replaces each block B i with a 1-sparse vector v i ∈ { , } k , where each entry of v i correspondsto one of the 2 k possible values of block B i . That is, if block B i of x equals the binary representation of thenumber j ∈ [ k ] , then Alice replaces block B i with the vector e j ∈ { , } k , where e j denotes the vector witha 1 in coordinate j and 0s elsewhere.Alice now has an n ′ / k = m -sparse derived input y over the universe { , } n . Merlin looks at Bob’s inputto see what is the index i of the dense vector x that Bob is interested in. Merlin then tells Bob the index ℓ such that ℓ = k ( i − ) + j , where B i is the block that i is located in, and block B i of Alice’s input x equalsthe binary representation of the number j ∈ [ k ] . Notice that Merlin can specify ℓ using log n bits. If Bob isconvinced that y ℓ =
1, then Bob can deduce the value of all the bits in block B i of the original dense vector x , and in particular, the value of x i .The parties then run the assumed online MA protocol for ( m , n ) -Sparse INDEX . The total hcost ofthis protocol is hcost ( P ) + log n = O ( hcost ( P )) , and the total vcost is vcost ( P ) . Thus, by Lemma 3.8,hcost ( P ) vcost ( P ) = W ( n ′ ) = W ( m log ( n / m )) as claimed.Theorem 3.9 should be contrasted with the following well-known upper bound. Theorem 3.10.
Assume n < m m . Then the one-way randomized communication complexity of the ( m , n ) -Sparse INDEX
Problem is O ( m log m ) .Proof. Alice chooses a hash function h : [ n ] → [ m ] at random from a pairwise independent family and uses h to perform “universe reduction”. That is, she sends h along with the set S of m values { h ( j ) : x j = } . Notice h can be specified with O ( log n ) = O ( m log m ) bits, and S can be specified with O ( m log m ) bits. Bob outputs1 if h ( i ) ∈ S , and 0 otherwise. The correctness of the protocol follows from the pairwise independenceproperty of h : if x i =
0, then with high probability i will not collide under h with any j such that x j = O ( m log m ) . Our lower bound in Theorem 3.9 has interesting consequences when it is combined with the upper boundin Theorem 3.10. Consider in particular the ( m , n ) -Sparse INDEX
Problem, where n = m . Theorem 3.10implies that the one-way randomized communication complexity of this problem is O ( m log m ) ; that is,without any need of Merlin, Alice and Bob can solve the problem with O ( m log m ) communication.Meanwhile, Theorem 3.9 implies that even if Merlin’s message to Bob has length W ( log n ) = W ( m ) ,Alice’s message to Bob must have length W ( m log ( n / m ) / m ) = W ( m ) . Indeed, Theorem 3.9 shows that forany protocol P , if hcost ( P ) ≥ log n = m , then we must have hcost ( P ) vcost ( P ) = W ( m log ( n / m )) = W ( m ) .In particular, this means that if hcost ( P ) = m , vcost ( P ) must be W ( m ) . This trivially implies that for anyprotocol P with hcost ( P ) less than m , vcost ( P ) must still be W ( m ) ; otherwise we could achieve a protocolwith hcost ( P ) = m and vcost ( P ) = o ( m ) simply by running P and adding in extraneous bits to the proof tobring the proof length up to m .Consequently, the online MA communication complexity of this problem is at least W ( m ) , which isat most a logarithmic factor smaller than the one-way randomized communication complexity. To ourknowledge, this is the first problem that provably exhibits this behavior. Specifically, this rules out smoothtradeoffs between annotation size and space usage in any annotated streaming protocol for the ( m , m ) -Sparse INDEX
Problem.
Corollary 3.11.
The one-way randomized communication complexity of the ( m , m ) -Sparse INDEX
Problemis O ( m log m ) . The online Merlin-Arthur communication complexity is W ( m ) . .3.1 Other Sparse Problems A number of lower bounds in [7] are proved via reductions from
INDEX that preserve stream length upto logarithmic factors. This holds for S
ELECTION and H
EAVY H ITTERS , as well as for the problem ofdetermining the existence of a triangle in a graph. For all such problems, the lower bound of Theorem 3.9implies corresponding new lower bounds for sparse streams, i.e. streams for which m = o ( n ) . We omit thedetails for brevity. Another implication of Theorem 3.9 is a polynomial separation between online MA communication com-plexity and online AMA communication complexity. Indeed, there is an online AMA protocol of cost˜ O ( √ m ) for the ( m , √ m ) -Sparse INDEX
Problem, where the ˜ O notation hides factors polylogarithmic in m :the first message, which consists of public random coins, is used to specify a hash function h : [ n ] → [ m ] from a pairwise independent hash family; this message has length O ( log n ) = O ( √ m ) . With high probabil-ity, h is injective on the set { j : x j = } . The parties then run the online MA communication protocol ofTheorem 3.2 on the inputs h ( x ) and h ( i ) and output the result. The total cost of this protocol is ˜ O ( √ m ) asclaimed. In Appendix A, we in fact show that up to logarithmic factors in m , this online AMA protocol isoptimal.Meanwhile, the lower bound of Theorem 3.9 implies that the online MA communication complexityof this problem is W ( m / ) . Indeed, if we have a protocol P with hcost ( P ) = m / > log n , Theorem 3.9implies that hcost ( P ) vcost ( P ) = W ( m log ( n / m )) = W ( m / ) , and hence vcost ( P ) > m / .To our knowledge, this is the first such separation between online AMA and online MA communica-tion complexity (we remark that polynomial separations between online MA and MAMA communicationcomplexity were already known, for problems including INDEX and
DISJ [2, 7]). Indeed, all previous lowerbound methods that apply to online MA communication complexity, such as the proof of [7, Theorem 3.1]and the methods of Klauck and Prakash [24], in fact yield equivalent AMA lower bounds. At a high level,the reason is that these methods work via round reduction – they remove the need for Merlin’s message.They therefore turn any online MA protocol for a function F into an online “A” protocol for F , which isreally just a one-way randomized protocol without a prover, allowing one to invoke a known lower boundon the one-way randomized communication complexity of F . Similarly, they turn an online AMA protocolfor F into an online AA protocol, which is also just a one-way randomized protocol for F .The reason Theorem 3.9 is capable of separating online AMA from MA communication complexityis that the reduction in the proof of Theorem 3.9 turns an online MA protocol for the ( m , n ) -Sparse INDEX
Problem into an online MA protocol for the (dense)
INDEX
Problem with related costs. However, the naturalvariant of the reduction applied to an online AMA protocol for the ( m , n ) -Sparse INDEX
Problem yields anonline MAMA protocol for the dense
INDEX
Problem, not an online AMA protocol (see Appendix A fordetails). And the dense
INDEX
Problem has an online MAMA protocol that is polynomially more efficientthan any online AMA protocol (see e.g. [2, 11]).
In this section and the next, we describe schemes for the m -Disjointness ( m - DISJ ) and Frequency Moment( F k ) problems. These schemes contain the main ideas of the paper.13cheme Costs Completeness Prescience Source ( m log m ) / , ( m log m ) / ) : m = W ( log n ) Perfect Prescient Theorem 4.3 ( c a log n , c v log n ) : c a c v ≥ n Perfect Online [7] ( m log n , log n ) Perfect Online [7] ( c a log n log c v m , c v log n log v m ) : c a = mc − / v Imperfect Online Theorem 5.1Table 2: Comparison of our m - DISJ schemes to prior work. Ours are the first schemes to achieve annotationlength and space usage that are both sublinear in m for m ≪ √ n , and we strictly improve over the MAcommunication cost (online or prescient) of prior schemes whenever m = o ( n ) . We begin with a scheme achieving optimal tradeoffs between annotation length and space usage for a broadclass of dense problems. Though this scheme follows readily from prior work [7, 9], we describe it in detailfor completeness. This scheme is a good example of a sum-check scheme as described in Section 1.1, and isbased on the Aaronson–Wigderson MA protocol for
DISJ [2].
Proposition 4.1.
Let f ( ) , . . . , f ( ℓ ) denote the frequency vectors of ℓ data streams, each over the universe [ n ] . Let g be an ℓ -variate polynomial of total degree d over the integers. Let F = (cid:229) ni = g ( f ( ) i , . . . , f ( ℓ ) i ) , andlet o be an a priori upper bound on | F | . Then for positive integers c a , c v with c a c v ≥ n, there is an online ( dc a ( log n + log o ) , ℓ c v ( log n + log o )) -scheme for computing F in the non-strict turnstile update model.Proof. We work on F q , the finite field with q elements, for a suitably large prime q ; the choice q > d ( n + o ) suffices. V treats each n -dimensional vector f ( j ) as a c a × c v array with entries in F q , using any canonicalbijection between [ c a ] × [ c v ] and [ n ] , and interpreting integers as elements of F q in the natural way. Throughinterpolation, this defines a unique bivariate polynomial ˜ f ( j ) ( X , Y ) ∈ F q [ X , Y ] of degree c a − X and c v − Y , such that for all x ∈ [ c a ] , y ∈ [ c v ] , ˜ f ( j ) ( x , y ) = f ( j ) ( x , y ) .The polynomials ˜ f ( j ) can then be evaluated at locations outside [ c a ] × [ c v ] , so in the scheme V picks arandom position r ∈ F q , and evaluates f ( j ) ( r , y ) for all j ∈ [ ℓ ] and y ∈ [ c v ] ; V can do this using c v words ofmemory per vector f ( j ) in a streaming manner [7, Theorem 4.1]. Let ˜ g denote the total-degree- d polynomialover F q that agrees with g at all inputs in F ℓ q . P then presents a polynomial b ( X ) of degree at most d ( c a − ) that is claimed to be identical to (cid:229) y ∈ [ c v ] ˜ g ( ˜ f ( ) ( X , y ) , . . . , ˜ f ( ℓ ) ( X , y )) . V checks that b ( r ) = (cid:229) y ∈ [ c v ] ˜ g (cid:0) ˜ f ( ) ( r , y ) , . . . , ˜ f ( ℓ ) ( r , y ) (cid:1) . If this sum check passes, then V believes P ’sclaim and accepts (cid:229) x ∈ [ c a ] b ( x ) as the correct answer. It is evident that this scheme satisfies perfect complete-ness. The proof of soundness follows from the Schwartz-Zippel lemma: if P ’s claim is false, thenPr (cid:20) b ( r ) = (cid:229) y ∈ [ c v ] ˜ g (cid:16) ˜ f ( ) ( r , y ) , . . . , ˜ f ( ℓ ) ( r , y ) (cid:17) (cid:21) ≤ d ( c a − ) / q . An important special case of the communication problem
DISJ is when Alice’s and Bob’s input sets arepromised to be small, i.e., have size at most m ≪ n . These should be thought of as sparse instances. Thesparsity parameter m has typically been denoted by the letter k in the communication complexity literature,and the problem has typically been referred to as k - DISJ rather than m - DISJ ; we use m rather than k forconsistency with our notation in the rest of the paper (where m denotes the sparsity of a data stream).Among the original motivations for studying this variant is its relation to the clique-vs.-independent-setproblem introduced by Yannakakis [27] to study linear programming formulations for combinatorial opti-mization problems. More recent motivations include connections to property testing [4]. A clever protocol14f H˚astad and Wigderson [16] gives an optimal O ( m ) communication protocol for m - DISJ , improving uponthe trivial O ( m log n ) and the easy O ( m log m ) bounds. This protocol requires considerable interaction be-tween Alice and Bob, a feature that turns out to be necessary. Recent results of Buhrman et al. [6] andDasgupta et al. [12] give tight Q ( m log m ) bounds for m - DISJ in the one-way model. Very recently, Brody etal. [5] and Sa˘glam and Tardos [25] have given tight rounds-vs.-communication tradeoffs for m - DISJ .Here we obtain the first nontrivial bounds for m - DISJ in the annotated streams model, and thus also inthe online MA communication model.
Definition 4.2.
In the m - DISJ problem, the data stream specifies two multi-sets S , T ⊆ [ n ] , with k S k , k T k ≤ m , where k S k denotes the number of distinct items in S . An update of the form (( , i ) , d ) is interpreted asan insertion of d copies of item i into set S , and an update of the form (( , i ) , d ) is interpreted as an insertionof d copies of item i into T . The goal is to determine whether or not S and T are disjoint.Notice Definition 4.2 allows S and T to be multi-sets, but assumes the strict turnstile update model,where the frequency of each item is non-negative. Theorem 4.3.
Assume m > log n. There is a prescient (( m log m ) / , ( m log m ) / ) -scheme for m- DISJ withperfect completeness in the strict turnstile update model. In particular, the MA-communication complexityof m-
DISJ is O (( m log m ) / ) . Any prescient ( c a , c v ) protocol requires c a c v = W ( m ) .Proof. Obviously if S and T are not disjoint, the prescient prover can provide an item i ∈ S ∩ T at the startof the stream and the verifier can check that i indeed appears in both S and T . The total space usage andannotation length is just O ( log n ) in this case.Suppose now that S and T are disjoint. We first recall that a ( √ n log n , √ n log n ) -scheme for DISJ followsfrom Proposition 4.1, with f ( ) and f ( ) set to the indicator vectors of S and T respectively, and g equal tothe product function. We refer to this as the dense DISJ scheme because its cost does not improve if | S | and | T | are both o ( n ) .Our prescient scheme for m - DISJ works as follows. At the start of the stream, the prover describes a hashfunction h : [ n ] → [ r ] , for some smaller universe [ r ] , with the property that h is injective on S ∪ T . We willwrite h ( S ) to denote the result of applying h to every member of S . The parties can now run the dense DISJ scheme whereby P convinces V that h ( S ) and h ( T ) are disjoint. Given the existence of an injective function h , perfect completeness follows from the fact that if S and T are disjoint, so are h ( S ) and h ( T ) , combinedwith the perfect completeness of the dense DISJ scheme. Soundness follows from the fact that if i ∈ S ∩ T ,then h ( i ) ∈ h ( S ) ∩ h ( T ) i.e. if S and T are not disjoint, then the same holds trivially for h ( S ) and h ( T ) .The dense DISJ scheme run on h ( S ) and h ( T ) requires annotation length and space usage O ( √ r log r ) .We now show that, for a suitable choice of r , P ’s description of h is also limited to O ( √ r log r ) communica-tion, balancing out the cost of the rest of the scheme.A family of functions F ⊆ [ r ] [ n ] is said to be k -perfect if, for all S ⊆ [ n ] with | S | ≤ k , there exists afunction h ∈ F that is injective when restricted to S . Fredman and Koml´os [15] have shown that for all n ≥ r ≥ k , there exists a k -perfect family F , with |F | ≤ ( + o ( )) (cid:18) k log n − log ( − t ( r , k )) (cid:19) , where t ( r , k ) : = k − (cid:213) j = (cid:18) − jr (cid:19) . For r ≥ k , we can use the crude approximation − log ( − t ( r , k )) ≥ t ( r , k ) ≥ (cid:16) − k r (cid:17) k ≥ e − k / r ( k c a log n , kc v log n ) : c a c v ≥ n Perfect Online [7] ( m log n , log n ) Perfect Online [7] ( k m / log n , km / log n ) Perfect Prescient Theorem 4.5 ( k m c − / v log n log c v m , kc v log n log c v m ) : c v > F k schemes to prior work. Ours are the first schemes to achieve annotationlength and space usage that are both sublinear in m for m ≪ √ n , and we strictly improve over the MAcommunication cost of prior protocols (online or prescient) whenever m = o ( n ) .to obtain the bound |F | = O ( k e k / r log n ) , which implieslog |F | = O ( k / r ) , for k / r = W ( log k ) and k = W ( log n ) .Let us pick a family F that is ( m ) -perfect. Once P and V agree upon such a family F , the prover,upon seeing the input sets S and T , can pick h ∈ F that is injective on S ∪ T . Describing h requires O ( m / r ) bits; P sends this to V before the stream is seen, and V stores it while observing the stream in order to runthe dense DISJ scheme on h ( S ) and h ( T ) . To balance out this communication with the O ( √ r log r ) cost ofrunning the dense DISJ scheme on h ( S ) and h ( T ) , we choose r so that m r = Q ( √ r log r ) . This is achieved by setting r = m / / log / m . The resulting upper bound is that both the annotation lengthand verifier’s space usage are O (cid:0) ( m log m ) / (cid:1) . The lower bound follows from known lower bounds for dense streams [7].
We now present prescient schemes for the k th Frequency Moment problem, F k . Definition 4.4.
In the F k problem, the data stream x consists of a sequence of updates of the form ( i , d ) , andthe frequency of item i is defined to be f i = (cid:229) ( j ℓ , d ℓ ) ∈ x : j ℓ = i d ℓ . The goal is to compute F k = (cid:229) i ∈ [ n ] f ki . The idea behind the scheme, as in the case of m - DISJ , is that P is supposed to specify a “hash function” h to reduce the universe size in a way that does not introduce false collisions. However, for F k it is essentialthat V ensure h is truly injective on the items appearing in the data stream. This is in contrast to m - DISJ ,where a weaker notion than injectiveness was sufficient to guarantee soundness. The fundamental differencebetween the two problems is that for m - DISJ , collisions only “hurt the prover’s claim” that the two sets aredisjoint, whereas for F k the prover could try to use collisions to convince the verifier that the answer to thequery is higher or lower than the true answer. Theorem 4.5.
There is a prescient ( k m / log n , km / log n ) -scheme for computing F k over a data stream ofsparsity m in the strict turnstile update model. This scheme has perfect completeness. Any prescient ( c a , c v ) protocol requires c a c v = W ( m ) .Proof. The idea is to have the prover specify for the verifier a perfect hash function h : [ n ] → [ r ] , where r is tobe determined later, i.e. P specifies a hash function h such that for all x = y appearing in at least one updatein the data stream, h ( x ) = h ( y ) . The verifier stores the description of h , and while observing the stream runs16he dense F k scheme of Proposition 4.1 on the derived stream in which each update ( i , d ) is replaced withthe update ( h ( i ) , d ) .As discussed above, it is essential that V ensure h is injective on the set of items that have non-zerofrequency, as otherwise P could try to introduce collisions to try to trick the verifier. To deal with this, weintroduce a mechanism by which V can “detect” collisions. Definition 4.6.
Define the problem I
NJECTION as follows. We observe a stream of tuples t i = (( x i , b i ) , d i ) .Each t i indicates that d i copies of item x i are placed in bucket b i ∈ [ r ] . We allow d i to be negative, modelingdeletions, and refer to the quantity f ( j , b ) = (cid:229) i : ( x i , b i )=( j , b ) d i as the count of pair ( j , b ) . We assume the strictturnstile model, so that for all pairs ( j , b ) we have f ( j , b ) ≥ ( j , b ) and ( j ′ , b ) with positive counts, it holdsthat j = j ′ . Define the output as 1 if the stream defines an injection, and 0 otherwise. Lemma 4.7.
For any c a c v ≥ r, there is an online ( c a log r , c v log r ) -scheme for determining whether a streamin the strict turnstile model is an injection.Proof. Say that bucket b is pure if there is at most one j ∈ [ n ] such that f ( j , b ) >
0. The stream defines aninjection if and only if every bucket b is pure.Notice that a bucket b is pure if and only if the variance of the item identifiers mapping to the bucketwith positive count is zero. Intuitively, our scheme will compute the sum of the these variances across allbuckets b ; this sum will be zero if and only if the stream defines an injection. Details follow.Define three r -dimensional vectors u , v , w as follows: u b = (cid:229) j ∈ [ n ] f ( j , b ) , v b = (cid:229) j ∈ [ n ] f ( j , b ) j , w b = (cid:229) j ∈ [ n ] f ( j , b ) j . It is easy to see that if bucket b is pure then v b = u b · w b . Moreover, if bucket b is impure then v b < u b w b ;this holds by the Cauchy-Schwarz inequality applied to the n -dimensional vectors whose j th entries are p f ( j , b ) and p f ( j , b ) · j respectively (the strict inequality holds because for an impure bucket b , the vectorgiven by p f ( j , b ) · j is not a scalar multiple of the vector given by p f ( j , b ) ). Here, we are exploiting theassumption that f ( j , b ) ≥ ( i , b ) , as this allows us to conclude that all p f ( j , b ) values are realnumbers.It follows that (cid:229) b ∈ [ r ] v b = (cid:229) b ∈ [ r ] u b · w b if and only if the stream defined an injection. Both quantitiescan be computed using the “dense” scheme of Proposition 4.1. Notice that each update t i = (( x i , b i ) , d i ) contributes independently to each of the vectors u , v , and w , and hence it is possible for V to run thescheme of Proposition 4.1 on these vectors as required. This yields an online ( c a log r , c v log r ) -scheme forthe injection problem for any c a c v ≥ r as claimed.Returning to our F k scheme, P specifies a hash function h claimed to be one-to-one on the set of items thatappear in one or more updates of the stream x . V verifies that h is injective using the scheme of Lemma 4.7.If this claim is true, then F k ( x ) = F k ( h ( x )) , the frequency moment of the mapped-down stream, and P canprove this by running the scheme of [7, Theorem 4.1] on the derived stream h ( x ) .Perfect completeness follows from P ’s ability to find a perfect hash function just as in Theorem 4.3.Soundness follows from the soundness of the I NJECTION scheme of Lemma 4.7, in addition to the soundnessproperty of the F k scheme of [7, Theorem 4.1]. 17o analyze the costs, note that by using the hash family of Fredman and Koml´os [15], the annotationlength and space cost due to specifying and storing the hash function h is O ( m log n / r ) . The annotationlength and space cost of the dense F k scheme of Proposition 4.1 are O ( k c a log r ) and O ( kc v log r ) for any c a c v ≥ r . The annotation length and space cost of the I NJECTION scheme can be set to O ( c a log r ) and O ( c v log r ) respectively. Setting r = m / and c a = c v = m / yields the desired costs. We now give an online version of F k scheme of Theorem 4.5. A simple modification of this scheme yieldsthe scheme for m - DISJ with analogous costs as claimed in Row 4 of Table 2. In addition to avoiding the useof prescience, our online scheme avoids requiring V to explicitly store the hash function sent by P , allowingus to achieve a much wider range of tradeoffs between annotation size and space usage relative to Theorems4.3 and 4.5. Theorem 5.1.
For any c v > , there is an online ( k mc − / v log n log c v m , kc v log n log c v m ) -scheme for F k inthe strict turnstile model for a stream of sparsity m over a universe of size n. Any online ( c a , c v ) -scheme forthis problem with c a ≥ log n requires c a c v = W ( m log ( n / m )) . Notice that the annotation length is less than m log n for any c v = m W ( ) , and therefore this protocol is notsubsumed by the simple “sparse” scheme (second row of Table 3) in which P just replays the entire streamin a sorted order, and V checks this is done correctly using fingerprints. Notice also that the product of thespace usage and annotation length is k mc / v log n log c v m , which is in o ( n ) for many interesting parametersettings. This improves upon the dense sum-check scheme (first row of Table 3) in such cases. In order to achieve an online scheme, we examine how to construct perfect hash functions such as those usedin the prescient F k scheme of Theorem 4.5. Let S be the set of m items with non-zero frequency at the endof the stream: we want the hash function to be one-to-one on S . Choose a hash function h at random frompairwise independent hash family mapping [ n ] to [ r ] , for r to be specified later – this requires just O ( log n ) bits to specify. We only expect O ( m / r ) pairs to collide under h , which means that with constant probabilitythere will be O ( m / r ) collisions if h is chosen as specified. The final hash function h ∗ is specified by writingdown h (which takes only O ( log n ) bits), followed by the items involved in a collision and some speciallocations for them. The total (expected) bit length to specify this hash function is O ( m log ( n ) / r ) .In our online F k scheme, P will send such an h at the start of the stream. Notice h does not depend on thestream itself – it is just a random pairwise independent hash function – so P is not using prescience. P alsohas no incentive not to choose h at random from a pairwise independent hash family, since the only purposeof choosing h in this manner is to minimize the number of collisions under h . If P chooses h in a differentway, P simply risks that there are too many collisions under h , causing V to reject.Now while V observes the stream, she runs the online sum-check scheme for F k given in Proposition 4.1on the mapped-down universe of size r , using h as the mapping-down function. At the end of the stream, P is asked to retroactively specify a hash function h ∗ that is one-to-one on S as follows. P provides a list L of all items in S that were involved in a collision under h , accompanied by their frequencies. Assuming thatthese items and their frequencies are honestly specified by P , V can compute their contribution to F k and remove them from the stream. By design, h ∗ is then (claimed to be) injective on the remaining items. V canconfirm this tentatively using the I NJECTION scheme of Lemma 4.7.The remainder of the scheme is devoted to making the correctness a certainty by ensuring that the itemsin L and their frequencies are as claimed (we stress that while our exposition of the scheme is modular, all18arts of the scheme are executed in parallel, with no communication ever occurring from V to P ). A naiveapproach to checking the frequencies of the items in L would be to run | L | independent P OINT Q UERY schemes, one for each item in L ; however there are too many items in L for this to be cost-effective.Instead, we check all of the frequencies as a batch, with a (sub-)scheme whose cost is roughly equal to thatof a single I NJECTION query.This (sub-)scheme can be understood as proceeding in stages, with each stage i using a different pairwiseindependent hash function h i to map down the full original input. Say that an item j is isolated by h i if j isnot involved in a collision under h i with any other item with non-zero frequency in the original data stream x . The goal of stage i is to isolate a large fraction of items which were not isolated by any previous stage.A key technical insight is that at each stage i , it is possible for V to “ignore” all items that are notisolated at that stage. This enables V to check that the frequencies of all items that are isolated at stage i areas claimed. We bound the number of stages that are required to isolate all items if P behaves as prescribed– if P reaches an excessive number of stages, then V will simply reject. Proof of Theorem 5.1:
Let r = mc / v . P sends a hash function h : [ n ] → [ r ] at the start of the stream, claimedto be chosen at random from a pairwise independent hash family. While observing the stream, V runs thedense online sum-check scheme for F k given in Proposition 4.1 on the mapped-down universe [ r ] . Let S bethe set of items with non-zero frequency at the end of the stream. After the stream is observed, P is asked toprovide a list L of all items with nonzero frequency that were involved in a collision, followed by a claimedfrequency f ∗ i for each i ∈ L .Assuming that these items and their frequencies are honestly specified in L by P , V can computetheir contribution C = (cid:229) i ∈ L f ∗ i to F k and then remove them from the stream by processing updates U = { ( i , − f ∗ i ) : i ∈ L } within the dense F k scheme. h is injective on the remaining items. V can confirm thisusing the I NJECTION scheme of Lemma 4.7 (conditioned on the assumed correctness of L ). Thus the dense F k scheme will output C = (cid:229) i L f ki . Assuming all of V ’s checks within the dense F k scheme pass, V outputs C + C as the answer.The remainder of the scheme is directed towards determining that the frequency of items in L arecorrectly reported. We abstract this goal as the following problem. Definition 5.2.
Define the ℓ -M ULTI I NDEX problem as follows. Consider a data stream x ◦ L , where ◦ denotes concatenation. x is a usual data stream in the strict turnstile model, while L is a list of ℓ pairs ( i , f ∗ i ) . Let f be the frequency vector of x . The desired output is 1 if f i = f ∗ i for all i ∈ L , and 0 otherwise.We defer our solution to the ℓ -M ULTI I NDEX problem to Section 5.3. For now, we state our main resultabout the problem in the following lemma.
Lemma 5.3.
For all c v > , ℓ - M ULTI I NDEX has an online ( mc − / v log n log c v ℓ, c v log n log c v ℓ ) -scheme inthe strict turnstile update model. Analysis of Costs.
Let S be the set of items with non-zero frequency when the stream ends. First, weargue that if r is the size of the mapped-down universe, and P chooses the hash function h at random from apairwise independent hash family, then with probability 9 /
10, there will be at most 10 m / r items in S thatcollide under g . Indeed, by a union bound, the probability any item i with non-zero count is involved in acollision is at most m / r , and hence by linearity of expectation, the expected number of items involved in acollision is at most m / r .So by Markov’s inequality, with probability at least 9/10, the total number of items involved in a collisionwill be at most 10 m / r = O ( mc − / v ) under the setting r = mc / v . Conditioned on this event, P can specify19he list L and the associated frequencies with annotation length O ( mc − / v log n ) , and V can use the M ULTI -I NDEX scheme of Lemma 5.3 with ℓ = O ( mc − / v ) to verify the frequencies of the items in L are as claimed.For any c v >
1, Lemma 5.3 under this setting of ℓ yields an ( mc − v log n · log c v ℓ, c v log n · log c v ℓ ) -scheme.Running all of the sum-check schemes (i.e., the I NJECTION scheme and the F k scheme itself) on themapped-down universe requires annotation O ( k rc − v log r ) and space O ( kc v log r ) for V ; in total, this pro-vides an online ( m log n / r + k r log n / c v + kmc − v log n · log c v m , c v log n · log c v M ) -scheme.Since we set r = mc / v , we obtain a online ( k mc − / v log c v ( m ) , kc v log n log v ( m )) -scheme for any c v > ( m , n ) -sparse INDEX problem.
Before presenting an efficient online scheme for the ℓ -M ULTI I NDEX
Problem, we define two “sub”problems,which apply a function to only a subset of the desired input.
Definition 5.4.
Define the problem S UB I NJECTION as follows. We observe a stream of tuples t i = ( x i , b i , d i ) ,followed by a vector z ∈ { , } r . As in the I NJECTION problem, each t i indicates that d i copies of item x i areplaced in bucket b i ∈ [ r ] .We say that the stream defines a subinjection based on z if for every b such that z b ≥
1, for every twopairs ( x , b ) and ( y , b ) with positive counts, it holds that x = y . The S UB I NJECTION problem is to decidewhether the stream defines a subinjection based on z .Notice that the I NJECTION problem is a special case of the S UB I NJECTION problem with z i = i . Lemma 5.5.
For any c a c v ≥ r, there is an online ( c a log r , c v log r ) -scheme for S UB I NJECTION in the strictturnstile update model. Moreover, for any constant c > , this scheme can be instantiated to have soundnesserror / r c .Proof. Define vectors u , c v , and w exactly as in the proof of Lemma 4.7, and observe that the stream definesa sub-injection if and only if (cid:229) b ∈ [ r ] z b v b = (cid:229) b ∈ [ r ] z b u b w b . V can compute both quantities using the densescheme of Proposition 4.1, with the same asymptotic costs as the scheme of Lemma 4.7. The soundnesserror can be made smaller than 1 / r c for any constant c by running the scheme of Proposition 4.1 over a finitefield of size poly ( r ) , for a sufficiently fast-growing polynomial in r .We similarly define the problem S UB F over a data universe of size n based on a vector z ∈ { , } n as (cid:229) i ∈ [ n ] z i f i , the sum of squared frequencies of items indicated by z . This too is a low-degree polyno-mial function of the input values, and so Proposition 4.1 implies S UB F can be computed by an online ( c a log r , c v log r ) -scheme in the general turnstile update model for any c a , c v such that c a c v ≥ r (and thesoundness error in this protocol can be made smaller than 1 / r c for any desired constant c ). Online scheme for ℓ - M ULTI I NDEX . The scheme can be thought of as proceeding in t stages ( t will bespecified later), although these stages merely serve to partition the annotation: there is no communicationfrom V to P during these stages. Each stage j makes use of a corresponding hash function h j : [ n ] → [ r ] for r = mc / v . The t hash functions are provided by P at the start of the stream, so that V has access tothem throughout the stream. Each h j is claimed to be chosen at random from a pairwise independent hashfamily: if they are, then there are unlikely to be too many collisions, so P has no incentive not to choose h j at random. Let f denote the vector of frequencies defined by the input stream, and let f ( ) denote the vectorsatisfying f ( ) i = f i for i ∈ L , and f ( ) i = i L .Stage j begins with a list L j − of items. We will refer to these items as “exceptions”. P provides a newlist L j ⊆ L j − of items which remain exceptions in stage j ; P implicitly claims that no items in L j − \ L j h j . Let z ( j ) denote the indicator vector of the list ofbuckets corresponding to L j − \ L j , i.e. z ( j ) h j ( i ) = i ∈ L j − \ L j , and z ( j ) entries are 0 otherwise. To checkthat no items in L j − \ L j collide under h j , V will use the S UB I NJECTION scheme based on the indicatorvector z ( j ) over the full original input f as mapped by the hash function h j . Note that since the original inputstream is in the strict turnstile update model, so is the stream on which the S UB I NJECTION scheme is run (asthe S UB I NJECTION scheme is simply run on the original input stream as mapped by the hash function h j ,based on the vector z ( j ) ). Note also that L j − and L j are provided explicitly, so V can compute z ( j ) easily. Having established that the items in L j − \ L j are no longer exceptions, V also wants to ensure that thefrequencies of these items were reported correctly in L . To do so, V run the S UB F scheme over the vector f − f ∗ as mapped by h j to r buckets, based on the z ( j ) indicator vector. The result is zero if and only if f i = f ( j ) i for all i where z ( j ) i = L j = /0, and there are no more exceptions. Provided all schemes concludecorrectly, and the number of stages to reach L j = /0 is at most t , V can accept the result, and output 1 for theanswer to the M ULTI I NDEX decision problem.Lastly, note that V does not need to explicitly store any of the lists L j . In fact, P can implicitly specifyall of the lists L j while playing the list L : for each item i ∈ L , he provides a number j , thereby implicitlyclaiming that i ∈ L j ′ for j ′ ≤ j , and i L j ′ for j ′ > j . Analysis of costs. If h j is chosen at random from a pairwise independent hash family, the probability anitem i in L j − is involved in a collision with the original stream f under h j is O ( m / r ) = O ( c − / v ) . Considerthe probability that any item i survives as an exception to stage t . The probability of this is O ( c − t / v ) , andsummed over all ℓ items, the expected number is O ( ℓ c − t / v ) . Invoking Markov’s inequality, with constantprobability it suffices to set t = O ( log c v ℓ ) to ensure that we need at most t stages before no more exceptionsneed to be reported.In stage j , the S UB I NJECTION and S UB F schemes cost ( mc − / v log n , c v log n ) . Summing over the t stages, we achieve for any c v > ( mc − / v log ( n ) · log c v ( m ) , c v log ( n ) · log c v ( m )) -scheme as claimed in thestatement of Lemma 5.3. Formal Proof of Soundness.
The soundness error of the protocol can be bounded by the probability anyinvocation of the S UB I NJECTION scheme or the S UB F scheme returns an incorrect answer. The soundnesserrors of both the S UB I NJECTION scheme and the S UB F scheme can be made smaller than r c for anyconstant c >
0, and therefore a union bound over all t = O ( log c v ℓ ) invocations of each protocol implies thatwith high probability, no invocation of either scheme returns an incorrect answer. Our online scheme for F k in Theorem 5.1 has a number of important consequences. Inner Product and Hamming Distance.
Chakrabarti et al. [7] point out that computing inner products andHamming Distance can be directly reduced to (exact) computation of the second Frequency Moment F , andso Theorems 4.5 and 5.1 immediately yield schemes for these problems of identical cost. An improved scheme for f - H EAVY H ITTERS . We can use Lemma 5.3 to yield an online scheme for the f -H EAVY H ITTERS problem.
Corollary 5.6.
For all c a , c v such that c a c v ≥ m log n, there is an online ( c a log n · log c v ( m ) + f − log n , c v log n log c v ( m )) -scheme for solving f - H EAVY H ITTERS in the strict turnstile update model. For example, V can add one to the corresponding entry of z ( j ) for each item that is marked as an exception. This will cause z ( j ) to count the number of exceptions in each bucket, rather than indicate them, but this does not affect the correctness. f -H EAVY H ITTERS to demonstrating the frequencies of O ( f − ) items in a derived stream. Moreover, thederived stream has sparsity O ( m log n ) if the original stream has sparsity m . We use the M ULTI I NDEX scheme of Lemma 5.3 to verify these claimed frequencies.
Frequency-based functions.
Chakrabarti et al. [7, Theorem 4.5] also explain how to extend the sum-checkscheme of Proposition 4.1 to efficiently compute arbitrary frequency-based functions , which are functionsof the form F ( x ) = (cid:229) i ∈ [ n ] g ( f i ( x )) for an arbitrary g : ( − [ N ] ∪ [ N ]) → Z . A similar but more involvedextension applies in our setting, by replacing the dense F k scheme implied by Proposition 4.1 with the densefrequency-based functions scheme of [7, Theorem 4.5]. We spell out the details below, restricting ourselvesto the prescient case for brevity; an online scheme with essentially identical costs follows by using the ideasunderlying Theorem 5.1. Corollary 5.7.
Let F ( x ) = (cid:229) i ∈ [ n ] g ( f i ( x )) be a frequency-based function. Then there is a prescient ( N / log n , N / log n ) -scheme for computing F ( x ) in the strict unit-update turnstile model. This scheme satisfies perfectcompleteness.Proof. We use a natural modification of the frequency-based functions scheme of [7, Theorem 4.5]. P specifies a hash function h at the start of the stream mapping the universe [ n ] into [ N / ] ; P chooses h to be injective on the set of items that have non-zero frequency at the end of the stream. Using the per-fect hash functions of Fredman and Koml´os [15], h can be represented with O ( N / r log n ) = O ( N / log n ) bits. V stores h explicitly. After the stream is observed, P and V run the f -H EAVY H ITTERS scheme ofCorollary 5.6, with f = N − / . Using the fact that (cid:229) i f i < N , by setting the parameters of Corollary 5.6appropriately we can ensure that this part of the scheme requires annotation length O ( N / log n ) and hasspace cost O ( N / log n ) . This scheme also allows V to determine the exact frequencies of the items in H ,allowing V to compute cont ( H ) : = (cid:229) i ∈ H g ( f i ( x )) , which gives the contribution of the items in H to the out-put F ( x ) . Moreover, whenever V learns the frequency f i of an item in i ∈ H , V treats this as a deletion of f i occurrences of item i , thereby obtaining a derived stream z in which all frequencies have absolute value atmost N / . P and V now run the polynomial-agreement scheme that was first presented in [9, Theorem 4.6] on the“mapped-down” input h ( z ) over the universe [ N / ] . For any c a c v ≥ r , the polynomial agreement scheme canachieve cost ( F max ( z ) c a log n , c v log n ) , where F max ( z ) denotes max i | f i ( z ) | , the largest frequency in absolutevalue of any item. Setting c v = N / and c a = N / , we obtain a prescient ( N / log n , N / log n ) -scheme asclaimed. V computes the final answer as F ( x ) = cont ( H ) + F ( h ( z )) − | H | g ( ) .The final issue is that V needs to verify that h is actually injective over the items that appear in x . V canaccomplish this using the I NJECTION scheme of Lemma 4.7. This does not affect the asymptotic costs ofour scheme, as the I
NJECTION scheme can support annotation cost c a log r and space cost c v log r for any c a c v = W ( N / ) .Finally, we provide one additional corollary, which describes a protocol that will be useful in the nextsection when building graph schemes. Theorem 5.8.
Let X , Y ⊆ [ n ] be sets with | X | ≤ | Y | ≤ m. Then given a stream in the strict turnstile updatemodel with elements of X and Y arbitrarily interleaved, there is an online ( mc − / v · log ( n ) · log c v ( m ) , c v · log ( n ) · log c v ( m )) -scheme for determining whether X ⊆ Y for any c v > .Proof. If X Y , P can specify an x ∈ X \ Y and prove that x is indeed in X and not Y with two point queriesusing the scheme of Theorem 3.2. For the other case, Chakrabarti et al. show how to directly reduce thecase X ⊆ Y to computation of frequency moments [7]. The claimed costs follow from Theorem 5.1.Table 4 provides a comparison of schemes for the S UBSET problem in the dense and sparse cases.22cheme Costs Completeness Online/Prescient Source ( | X | log n , log n ) Perfect Prescient [7] ( c a log n , c v log n ) : c a c v ≥ n Perfect Online [7] ( m log n , log n ) Perfect Online [7] ( mc − / v log c v ( m ) log n , c v log n log c v m ) : c v > UBSET scheme to prior work. Ours is the first online scheme to achieveannotation length and space usage that are both sublinear in m for m ≪ √ n , and strictly improves over theonline MA communication cost of prior protocols whenever m = o ( n ) . We now describe some applications of the techniques developed above to graph problems. The main purposeof this section is to demonstrate that the techniques developed within the F k and m - DISJ schemes are broadlyapplicable to a range of settings.We begin with several non-trivial graph schemes that are direct consequences of the Subset scheme ofTheorem 5.8. Recall that our definition of a scheme for a function F requires a convincing proof of the valueof F ( x ) for all values F ( x ) . This is stricter than the traditional definition of interactive proofs for decisionproblems, which just require that if F ( x ) = F ( x ) = A = ( h , V ) satisfy:1. For all x s.t. F ( x ) =
1, we have Pr r P , r V [ out V ( x h , r P , r V ) = ] ≤ / x s.t. F ( x ) = h ′ = ( h ′ , h ′ , . . . , h ′ N ) ∈ ( { , } ∗ ) N , we have Pr r V [ out V ( x h ′ , r V ) = ] ≤ / Theorem 6.1.
Under the above relaxed definition of a scheme, each of the problems
PERFECT - MATCHING , CONNECTIVITY , and
NON - BIPARTITENESS has an ( n log n + mc − / v log n log c v m , c v log n log c v m ) -schemeon graphs with n vertices and m edges for all c v > . All three schemes work in the strict turnstile updatemodel and improve over prior work if c v = w ( log m ) and c v = o ( m ) .Proof. In the case of perfect matching, the prover can prove a perfect matching exists by sending a matching M , which requires n log n bits of annotation. In order to prove M is a valid perfect matching, P needs toprove that every node appears in exactly one edge of M , and that M ⊆ E , where E is the set of edgesappearing in the stream. V can check the first condition by comparing a fingerprint of the nodes in M to afingerprint of the set { , . . . , n } . V can check that M ⊆ E using Theorem 5.8.In the case of connectivity, the prover demonstrates the graph is connected by specifying a spanningtree T . V needs to check T is spanning, which can be done as in [7, Theorem 7.7], and needs to check that T ⊆ E , which can be done using Theorem 5.8.In the case of non-bipartiteness, P demonstrates an odd cycle C . V needs to check C is a cycle, C has anodd number of edges, and that C ⊆ E . The first condition can be checked by requiring P to play the edgesof C in the natural order. The second condition can be checked by counting. The third condition can bechecked using Theorem 5.8. Counting Triangles.
Returning to our strict definition of a scheme, we give an online scheme for countingthe number of triangles in a graph.
Theorem 6.2.
For any c v > , there is an online ( c a log n log m , c v log n log m ) -scheme, with imperfect com-pleteness, for counting the number of triangles in a graph on n nodes and m edges, where c a = mnc − / v .The scheme is valid in the strict turnstile update model. ( c a log n , c v log n ) : c a c v ≥ n Perfect Online [7] ( n log n , log n ) Perfect Online [7] ( c a log n , c v log n ) : c a = mnc − / v Imperfect Online Theorem 6.2Table 5: Comparison of prior work to our scheme for counting the number of triangles in a graphwith n nodes and m edges. For concreteness, notice that by setting c v = n , Theorem 6.2 achieves a ( mn / log n , n log n ) -scheme, which improves over prior work as long as m ≪ n / . Proof.
Chakrabarti et al. [7, Theorem 7.4] show how to reduce counting the number of triangles in a graph tocomputing the first three frequency moments of a derived stream. The derived stream has sparsity m ( n − ) .Using the online scheme of Theorem 5.1 to compute the relevant frequency moments of the derived streamyields the claimed bounds.The scheme of Theorem 6.2 should be compared to the ( n , log n ) -scheme from [7, Theorem 7.2] basedon matrix multiplication, referenced in Row 2 of Table 5 and the ( h , v ) -scheme for any c a c v ≥ n from [7,Theorem 7.3], referenced in Row 1 of Table 5. To compare to the former, notice that Theorem 6.2 yieldsa ( c a log n , c v log n ) -scheme with c a < n as long as m < n √ c v . To compare to the latter, note that in ournew scheme, c a c v = mnc / v , which is less than n as long as c / v < n m . In particular, if we set c v = n , thenTheorem 6.2 improves over both old schemes as long as m < n / .Unfortunately, Theorem 6.2 does not yield a non-trivial MA-protocol for showing no triangle exists.Indeed, equalizing annotation length and space usage in our new protocol occurs by setting both quantitiesto ( mn ) / . But W (cid:0) ( mn ) / (cid:1) < m only when m > n , which is to say that the MA communication complexityof this protocol is always larger than m , a cost that can be achieved by the trivial MA protocol where Merlinis ignored and Alice just sends her whole input to Bob. That is, the interest in the new protocol is that it canlower the space usage of V to less than m without drastically blowing up the message length of P to n as inthe matrix-multiplication based protocol from [7]. All schemes in Sections 4 and 5 work in the strict turnstile update model. The reason these schemes requirethis update model is that they use the I
NJECTION and S UB I NJECTION schemes of Lemmata 4.7 and 5.5 assub-routines, and these sub-routines assume the strict turnstile update model.In this section, we consider two ways to circumvent this issue. To focus the discussion, we concentrateon the online F k protocol of Theorem 5.1. One simple method for handling streams in the non-strict turnstile update model is the following. We use thescheme of Theorem 5.1, but within the S UB I NJECTION sub-routine, we treat deletions of items in the inputstream as insertions of items into the derived stream of ( x i , b i , d i ) updates. This ensures that the I NJECTION and S UB I NJECTION schemes correctly output 1 if the derived stream is a subinjection (and the remainder ofthe scheme computes the correct answer on the original stream). However it increases the expected numberof collisions under the universe-reduction mappings h i , from m · | L i − | / r to M · | L i − | / r . The result is thatwe achieve the same costs as Theorem 5.1, except the costs depend on to the stream footprint M rather thanthe stream sparsity m (see Section 2.4). 24 orollary 7.1. For any c v > , there is a ( k Mc − / v · log ( n ) · log c v ( M ) , kc v · log ( n ) · log c v ( M )) online schemefor F k in the non-strict turnstile update model over a stream with footprint M over a universe of size n. In this section, we describe an AMA scheme for the I
NJECTION problem that works in the non-strict turnstilestream update model i.e., the input may define a frequency vector where some elements end with negativefrequency. The scheme for I
NJECTION of Lemma 4.7 breaks down here, since there may be some caseswhere the checks performed by the protocol indicate that a bucket is pure, when this is not the case: can-cellations of item weights in the bucket may give the appearance of purity. To address this, we use publicrandomness, thereby yielding an AMA scheme. In essence, the verifier asks the prover to demonstrate thepurity of each of the r buckets via fingerprints of the bucket contents. However, if we allow the prover tochoose the fingerprint function, P could pick a function which leads to false conclusions. Instead, V choosesthe fingerprinting function using public randomness. The players then execute a new I NJECTION protocolusing the data remapped under the fingerprint function, which is intended to convince V of the purity of thebuckets. This then allows us to construct protocols with costs that depend on the stream sparsity m ratherthan the footprint M as in Corollary 7.1.In detail, the new AMA scheme proceeds as follows. Consider the I NJECTION problem as defined inDefinition 4.6, but generalized to allow items with arbitrary integer counts. Consider again a bucket b , andfor 1 ≤ j ≤ log n define b j = ℓ to be the frequency vector of the subset of stream updates ( x k , b , d k ) placingitems into bucket b , subject to the restriction that the j ’th bit of x k is equal to ℓ . We observe the followingproperty: if bucket b is pure, then one of b j = and b j = must be the zero vector , for each j . Moreover, if b is not pure, then there exists a j such that both b j = and b j = are not the zero vector.A natural way to compactly test whether these vectors are equal to zero (probabilistically) is to usefingerprinting (discussed in Section 2.4). The verifier V could do this unaided for a single bucket, but wewish to run this test in parallel for r buckets. At a high level, we achieve this as follows. Given a streamof updates ( x k , b , d k ) , we define two vectors z and o of length r log n , such that each coordinate of z and o corresponds to a (bucket, coordinate) pair ( b , j ) ∈ [ r ] × [ log n ] . In more detail, we will define z and o suchthat for each bucket b and coordinate j ∈ [ log n ] , the ( b , j ) th entry of z is a fingerprint of the vector b j = , andthe ( b , j ) th entry of o is a fingerprint of the vector b j = .We choose the fingerprinting functions to satisfy two properties.1. The fingerprint of the all-zeros vector is always 0. This ensures that if all buckets are pure, then theinner product of z and o is 0, as z b , j · o b , j is 0 for all pairs ( b , j ) ∈ [ r ] × [ log n ] .2. If there is an impure bucket, then the inner product of z and o will be non-zero with high probabilityover the choice of fingerprint functions.Therefore, in order to determine whether the stream defines an injection, it suffices to compute (cid:229) ( b , j ) ∈ [ r ] × [ log n ] z b , j · o b , j , which can be computed using Proposition 4.1 with annotation length c a log n and space cost c v log n forany c a · c v ≥ r log n .The idea allowing us to achieve the second property is as follows. If bucket b is impure, then there isat least one coordinate j ∈ [ log n ] such that b j = and b j = are both not equal to the all-zeros vector . Bybasic properties of fingerprints, this ensures that both z b , j and o b , j are non-zero with high probability overthe choice of fingerprint functions. Moreover, we choose the fingerprinting functions in such a way thatnon-zero terms in the sum (cid:229) ( b , j ) ∈ [ r ] × [ log n ] z b , j · o b , j are unlikely to “cancel out” to zero.Consequently, we can state an analog of Lemma 4.7. Lemma 7.2.
For any c a c v ≥ r log n, there is an online ( c a log n , c v log n ) -scheme for determining whether astream in the non-strict turnstile model is an injection. roof. Let F q be a finite field of size q = poly ( n ) , where the subsequent analysis determines the requiredmagnitude of q . V uses public randomness to choose two field elements a , and b uniformly at random from F q . For each bucket b ∈ [ r ] , and each coordinate j ∈ [ log n ] , we define two “fingerprinting” functions g b , j , a and g b , j , b mapping an n -dimensional frequency vector F as follows: g b , j , a ( x ) = a n ( b · log n + j ) (cid:229) ℓ ∈ [ n ] x ℓ a ℓ , and g b , j , b ( x ) = b n ( b · log n + j ) (cid:229) ℓ ∈ [ n ] x ℓ b ℓ , where each entry x ℓ of x is treated as an element of F in the natural manner.We now (conceptually) construct two vectors z and o of dimension r log n , where for each ( b , j ) ∈ [ r ] × [ log n ] , z b , j = g b , j , a ( b j = ) and o b , j = g b , j ( b j = i ) . That is, the ( b , j ) th entry of z equals the fingerprintof the frequency vector of items mapping to bucket b with a 0 in the j th bit of their binary representation.Observe that g b , j , a ( ) = g b , j , b ( ) = ( b , j ) ∈ [ r ] × [ log n ] , as required by Property 1 above.We now show that Property 2 holds, i.e. if there is an impure bucket, then the inner product of z and o will be non-zero with high probability over the choice of a and b . In the following, for an item ℓ ∈ [ n ] and bucket b ∈ [ r ] , we let f ℓ ( b ) denote the frequency with which item ℓ is mapped to bucket b , and we let ℓ j denote the j ’th bit in the binary representation of ℓ . We can write the inner product of z and o as (cid:229) ( b , j ) ∈ [ r ] × [ log n ] g b , j , a ( b j = ) g b , j , b ( b j = )= (cid:229) ( b , j ) ∈ [ r ] × [ log n ] a n ( b · log n + j ) b n ( b · log n + j ) (cid:229) ℓ ∈ [ n ] ,ℓ j = f ℓ ( b ) a ℓ ! (cid:229) ℓ ∈ [ n ] ,ℓ j = f ℓ ( b ) b ℓ ! = (cid:229) ( b , j ) ∈ [ r ] × [ log n ] a n ( b · log n + j ) b n ( b · log n + j ) (cid:229) ( ℓ,ℓ ′ ) : ℓ j = ,ℓ ′ j = f ℓ ( b ) f ℓ ′ ( b ) a ℓ b ℓ ′ We therefore see that the inner product of z and o is a polynomial in a and b of total degree n r log n ineach variable. Moreover, the coefficient of the term a n ( b · log n + j )+ ℓ b n ( b · log n + j )+ ℓ ′ is precisely f ℓ ( b ) · f ℓ ′ ( b ) if ℓ j = ℓ ′ j =
1, and is 0 otherwise.Recall that if bucket b is not pure, then there is at least one coordinate j ∈ [ log n ] , and items ℓ, ℓ ′ ∈ [ n ] with ℓ j = ℓ ′ j =
1, such that f ℓ ( b ) = f ℓ ′ ( b ) =
0. The above analysis implies that z · o is a non-zero polynomial in a and b , as the coefficient of a n ( b · log n + j )+ ℓ b n ( b · log n + j )+ ℓ ′ is non-zero. Hence, by theSchwartz-Zippel lemma, the probability over a random choice of a and b that z · o = n r log n / q .Setting q to be polynomial in n , there is only negligible probability (over the choice of a and b ) that z · o iszero if the stream is not an injection.Finally, notice that the verifier can apply the scheme of Proposition 4.1 to compute (cid:229) ( b , j ) ∈ [ r ] × [ log n ] z b , j · o b , j , as each stream update ( x k , b , d k ) can be treated as log n updates to the vectors z and o . For example, ifthe j th bit of x k is 0, then update ( x k , b , d k ) causes z b , j to be incremented by d k · a n ( b · log n + j )+ x k . Applications.
We can apply this online scheme to compute Frequency Moments (and Inner Product, Ham-ming Distance, Heavy Hitters etc.) over sparse data in the non-strict turnstile update model. The costsof the resulting online AMA scheme are similar to the costs of the online schemes for the same problemsdeveloped in previous sections. The only difference is that we have scaled m up by a log n factor, to ac-count for the fact that within the new AMA sub-scheme for I NJECTION , we must run the dense protocol of26roposition 4.1 on vectors z and o of length r log n , rather than on vectors of length r as in prior sections,and substitute the bounds from Lemma 7.2. For example, the analog of Theorem 5.1 is that for any c v > ( k mc − / v · log ( n ) · log c v ( m ) , kc v · log ( n ) · log c v ( m )) online AMA scheme for F k in the non-strictturnstile model. We have presented a number of protocols in the annotated data streaming model that for the first time allowsboth the annotation length and the space usage of the verifier to be sublinear in the stream sparsity, ratherthan just the size of the data universe. Our protocols substantially improve on the applicability of prior workin natural settings where data streams are defined over very large universes, such as IP packet flows andsparse graph data.A number of interesting questions remain for future work. The biggest open question is to determinethe precise dependence on the stream sparsity in problems such as m - DISJ and frequency moments. Whensetting the annotation length and the space usage of the verifier to be equal, our protocols have cost roughly m / , where m is the sparsity of the data stream. The best known lower bound is roughly m / . We conjec-ture that our upper bound is tight up to logarithmic factors, but proving any Merlin-Arthur communicationlower bound larger than m / will require new lower bound techniques in communication complexity. An-other interesting open question is to give improved protocols for multiplying an n × n matrix A by a vector x , when A is sparse (i.e., has o ( n ) non-zero entries), but x may be dense. Achieving this would yield im-proved protocols for proving disconnectedness, bipartiteness, or the non-existence of a perfect matching ina bipartite graph. Currently we do not know of any protocols for these problems that leverage graph sparsityin any way. References [1] S. Aaronson. QMA/qpoly ⊆ PSPACE/poly: De-Merlinizing Quantum Protocols. In
CCC , pages 261–273, 2006.[2] S. Aaronson and A. Wigderson. Algebrization: a new barrier in complexity theory.
ACM Trans.Comput. Theory , 1:1, pages 1–54, 2009. Preliminary version appeared in
STOC
FOCS , pages337–347, 1986.[4] E. Blais, J. Brody, and K. Matulef. Property testing lower bounds via communication complexity.
Computational Complexity , 21:311–358, 2012.[5] J. Brody, A. Chakrabarti, and R. Kondapally. Certifying equality with limited interaction. TechnicalReport TR12-153, ECCC, 2012.[6] H. Buhrman, D. Garc´ıa-Soriano, A. Matsliah, and R. de Wolf. The non-adaptive query complexity oftesting k-parities. arXiv preprint arXiv:1209.3849 , 2012.[7] A. Chakrabarti, G. Cormode, A. McGregor, and J. Thaler. Annotations in data streams.
ElectronicColloquium on Computational Complexity (ECCC) , 19:22, 2012. A preliminary version of this paperby A. Chakrabarti, G. Cormode, and A. McGregor appeared in
ICALP
Complexity Theory: Cur-rent Research , S. Homer, U. Sch¨oning and K. Ambos-Spies (Eds.), Cambridge University Press, pages147–190, 1993.[9] G. Cormode, M. Mitzenmacher, and J. Thaler. Streaming graph computations with a helpful advisor.
Algorithmica , 65:2, pages 409–442, 2013. A preliminary version of this paper appeared in
ESA , 2010.[10] G. Cormode, M. Mitzenmacher, and J. Thaler. Practical verified computation with streaming interac-tive proofs. In
ITCS , 2012.[11] G. Cormode, J. Thaler, and K. Yi. Verifying computations with streaming interactive proofs.
PVLDB ,5(1):25–36, 2011.[12] A. Dasgupta, R. Kumar, and D. Sivakumar. Sparse and lopsided set disjointness via information theory.In , volume 7409, pages 517–528, 2012.[13] A. Das Sarma, R.J. Lipton, and D. Nanongkai. Best-order streaming model.
Theor. Comput. Sci. ,412:23, pages 2544–2555 2011.[14] D. Eppstein and M.T. Goodrich. Straggler Identification in Round-Trip Data Streams via Newton’sIdentities and Invertible Bloom Filters.
IEEE Trans. Knowl. Data Eng.
SIAM J.Algebra. Discr. , 5(1):61–68, 1984.[16] J. H˚astad and A. Wigderson. The randomized communication complexity of set disjointness.
Theoryof Computing , pages 211–219, 2007.[17] R. J. Lipton. Efficient Checking of Computations.
STACS , pages 207–215, 1990.[18] C. Lund, L. Fortnow, H. Karloff, and N. Nisan. Algebraic methods for interactive proof systems.
J.ACM , 39(4):859–868, 1992.[19] Y. Minsky, A. Trachtenberg, R. Zippel. Set reconciliation with nearly optimal communication com-plexity.
IEEE Transactions on Information Theory
STOC , pages 113–122, 2008.[21] T. Gur and R. Raz Arthur-Merlin Streaming Complexity. Electronic Colloquium on ComputationalComplexity (ECCC). Available online at http://eccc.hpi-web.de/report/2013/020/ ,2013.[22] H. Klauck. Rectangle Size Bounds and Threshold Covers in Communication Complexity. In
CCC ,pages 118–134, 2003.[23] H. Klauck. On Arthur Merlin Games in Communication Complexity. In
CCC , pages 189–199, 2011.[24] H. Klauck, and V. Prakash Streaming Computations With a Loquacious Prover. In ITCS, pages 305–320, 2013.[25] M. Sa˘glam and G. Tardos. On the communication complexity of sparse set disjointness. Manuscript,privately communicated, 2012. 2826] A. Shamir. IP = PSPACE.
J. ACM , 39(4):869–877, 1992.[27] M. Yannakakis. Expressing combinatorial optimization problems by linear programs.
J. Comput. Syst.Sci. , 43(3):441–466, 1991.
A An Online AMA Lower Bound for ( m , √ m ) -Sparse INDEX
We prove that the online ˜ O ( √ m ) protocol for the ( m , √ m ) -Sparse INDEX problem is essentially optimal.Our lower bound follows from a natural variant of the reduction in Theorem 3.9. That is, we turn anonline AMA protocol for the ( m , √ m ) -Sparse INDEX
Problem into an online MAMA protocol for the dense
INDEX
Problem. We then invoke a lower bound on the online MAMA communication complexity of
INDEX
Problem due to Klauck and Prakash [24]. Theorem A.1.
The online AMA protocol complexity of the ( m , √ m ) -Sparse INDEX problem is ˜ W ( √ m ) .Proof. Let n = √ m . Assume we have an online AMA communication protocol P for ( m , n ) -sparse INDEX with hcost ( P ) = W ( √ m ) . We describe how to use this protocol for the sparse INDEX problem to design onefor the dense