[PDF] SALSA: Self-Adjusting Lean Streaming Analytics

Abstract

Counters are the fundamental building block of many data sketching schemes, which hash items to a small number of counters and account for collisions to provide good approximations for frequencies and other measures. Most existing methods rely on fixed-size counters, which may be wasteful in terms of space, as counters must be large enough to eliminate any risk of overflow. Instead, some solutions use small, fixed-size counters that may overflow into secondary structures. This paper takes a different approach. We propose a simple and general method called SALSA for dynamic re-sizing of counters and show its effectiveness. SALSA starts with small counters, and overflowing counters simply merge with their neighbors. SALSA can thereby allow more counters for a given space, expanding them as necessary to represent large numbers. Our evaluation demonstrates that, at the cost of a small overhead for its merging logic, SALSA significantly improves the accuracy of popular schemes (such as Count-Min Sketch and Count Sketch) over a variety of tasks. Our code is released as open-source [1].

Full PDF

SSALSA: Self-Adjusting Lean Streaming Analytics

Ran Ben Basat

University College London

Gil Einziger

Ben Gurion University

Michael Mitzenmacher

Harvard University

Shay Vargaftik

VMware Research

Abstract —Counters are the fundamental building block ofmany data sketching schemes, which hash items to a smallnumber of counters and account for collisions to provide goodapproximations for frequencies and other measures. Most exist-ing methods rely on ﬁxed-size counters, which may be wastefulin terms of space, as counters must be large enough to eliminateany risk of overﬂow. Instead, some solutions use small, ﬁxed-sizecounters that may overﬂow into secondary structures.This paper takes a different approach. We propose a sim-ple and general method called SALSA for dynamic re-sizing of counters, and show its effectiveness. SALSA starts withsmall counters, and overﬂowing counters simply merge withtheir neighbors. SALSA can thereby allow more countersfor a given space, expanding them as necessary to representlarge numbers. Our evaluation demonstrates that, at the costof a small overhead for its merging logic, SALSA signiﬁ-cantly improves the accuracy of popular schemes (such asCount-Min Sketch and Count Sketch) over a variety of tasks.Our code is released as open source [1].

I. I

NTRODUCTION

Analysis of large data streams is essential in many domains,including natural language processing [2], load balancing [3],and forensic analysis [4]. Typically, the data volume rendersexact analysis algorithms too expensive. However, often itis sufﬁcient to estimate measurements such as per-item fre-quency [5], item distribution entropy [6], or top- k /heavy hit-ters [7] by using approximation algorithms often referred to assketches. Sketching schemes reduce the space requirements bysharing counters that keep frequency counts of the (potentiallymultiple) associated items [8], [9].That is, rather than use a counter for each item, which maybe space-prohibitive, sketches bound the effect of collisions toguarantee good approximations.A common approach for sketch design is to consider coun-ters as the basic building block. Namely, the goal is to optimizethe accuracy for a given number of counters (e.g., [7], [10]).However, these works do not discuss how many bits eachcounter should have, a quantity whose optimal value dependson the workload and optimization metric.For ﬁxed-size counters, if they are too large, space is wasted.Conversely, if they are too small, there are risks of overﬂow.Instead, some solutions use small ﬁxed-size counters that mayoverﬂow into secondary structures (e.g., [11], [12]). Our Contributions:

We present Self-Adjusting LeanStreaming Analytics (SALSA), a simple and general frame-work for dynamic re-sizing of counters. In a nutshell, SALSAstarts with small (e.g., -bit) counters and merges overﬂowingones with their neighbors to represent larger numbers. Thisway, more counters ﬁt in a given space without limiting the counting range. To do so efﬁciently, we employ novel methodsfor representing merges with low memory and computa-tion overheads. These methods also respects byte boundariesmaking them readily implementable in software and somehardware platforms.SALSA integrates with popular sketches and probabilis-tic counter-compression techniques to improve their preci-sion to memory tradeoff. We prove that SALSA stochas-tically improves the accuracy of standard schemes, includ-ing the Count Min Sketch [10], the Conservative UpdateSketch [13], and the Count Sketch [14]. Using different work-loads, metrics, and tasks, we also show signiﬁcant accu-racy improvements for the above schemes as well as forstate of the art solutions like Univmon [15], Cold Fil-ter [9], and AEE [16]. We also compare against PyramidSketch [12] and ABC [17], recent variable-counter-size so-lutions, and show that SALSA is more accurate than both.Finally, we release our code as open source [1].II. R ELATED W ORK

The term sketch here informally describes an algorithmthat uses shared counters, such that each item is associatedwith a subset of the counters via hash functions [10], [13]–[15]. Sketches offer tradeoffs between update speed, ac-curacy, and space, where each of these parameters is im-portant in some scenarios. For example, in software-basednetwork measurement, we are primarily concerned about up-date speed [18]. Conversely, in hardware-based measurements,space is often the bottleneck [19], [20].Some sketches optimize the update speed at the expenseof space. For example, Randomized Counter Sharing [21]uses multiple hash functions but only updates a randomone. NitroSketch [18] extends this idea and only performsupdates for sampled packets using a novel sampling techniquethat asymptotically improves over uniform sampling. Othersolutions aim to maximize the accuracy for a given spaceallocation. For example, Counter Braids [22] and CounterTree [23] aim to ﬁt into the fast static RAM (SRAM) whileoptimizing the precision. These solutions estimate elementsizes using complex ofﬂine procedures that, while being highlyaccurate, may be too slow for online applications.Most relevant to our setting are ABC [17] and PyramidSketch [12], which vary the size of counters on the ﬂy. InABC, an overﬂowing counter is allowed to “borrow” bits fromthe next counter. If there are not enough bits to representboth values, the counters “combine” to create a larger counter.However, the encoding of ABC is cumbersome. It requires a r X i v : . [ c s . D S ] F e b hree bits to mark combined counters (e.g., when starting with -bits, combined counters can count to − ) and slowsthe sketch down signiﬁcantly (see Section VI). Moreover, itdoes not allow counters to combine more than once. PyramidSketch [12] has several layers for extending overﬂowingcounters. An overﬂowing counter increases a counter at thenext layer. Each pair of same-layer counters are associatedwith a single counter at the next layer. If both overﬂow, theywill share their most signiﬁcant bits in that counter whilekeeping the least signiﬁcant bits separately. Critically, thecounters of all layers are pre-allocated regardless of the accesspatterns. This results in inferior memory utilization since manyof the upper layers’ counters may never be used. Further, whenreading a counter, Pyramid may make multiple non-sequentialmemory accesses, thus slowing the processing down. SALSAimproves over these solutions due to its efﬁcient encoding andthe fact that its counting range is not limited by the initialconﬁguration (e.g., counter size).An orthogonal line of works reduces the size of counters byusing probabilistic estimators that only increment their valuewith a certain probability on an update [16], [24]–[26]. Suchan approach saves space as estimators can represent largenumbers with fewer bits, at the cost of a higher error.III. P RELIMINARIES

We consider a data stream S consisting of updates inthe form of (cid:104) x, v (cid:105) , where x ∈ U is an element (or item)and v ∈ Z is a value. Here, U (cid:44) { , . . . , u } is the universe and u is the universe size . For x ∈ { , . . . , u } , f x (cid:44) (cid:80) (cid:104) x,v (cid:105)∈ S v denotes the frequency of x . Additionally, f (cid:44) (cid:104) f , . . . , f u (cid:105) is the frequency vector of S . We denote by N (cid:44) (cid:80) x ∈ U | f x | the volume of the stream. The above is calledthe Turnstile model. Other models include the

Strict Turnstile model, where frequencies are non-negative at all times, andthe

Cash Register model, where updates are strictly positive.The p ’th moment of the frequency vector is deﬁned as F p (cid:44) (cid:80) x ∈ U | f i | p (e.g., F = N ) and the p ’th norm(deﬁned for p ≥ ) is L p (cid:44) p √ F p . We say that analgorithm estimates frequencies with an ( (cid:15), δ ) L p guaranteeif for any element x ∈ U it produces an estimate (cid:98) f x that satisﬁes Pr (cid:104) | (cid:98) f x − f x | ≤ (cid:15)L p (cid:105) ≥ − δ . Throughout thepaper, we assume the standard RAM model and that eachcounter value ﬁts into O (1) machine words.We survey several popular sketches that SALSA extends. Count Min Sketch (CMS) [10]:

CMS is arguably thesimplest and most popular sketch. It provides an L guar-antee in the Strict Turnstile model. The sketch consists of a d × w matrix C of counters and d random hash functions h , . . . , h d : U → [ w ] that map elements into counters.Each element x is associated with one counter in eachrow: C [1 , h ( x )] , . . . , C [ d, h d ( x )] . When processing the up-date (cid:104) x, v (cid:105) , CMS adds v to all of x ’s counters. Since CMSoperates in the Strict Turnstile model where all frequencies arenon-negative, each of x ’s counters provides an over-estimationfor its true frequency (i.e., ∀ i ∈ [ d ] : C [ i, h i ( x )] ≥ f x ). Therefore, CMS uses the minimum of x ’s counters to estimate f x . That is, (cid:98) f x (cid:44) min i ∈ [ d ] C [ i, h i ( x )] .For its analysis, denote by E i (cid:44) C [ i, h i ( x )] − f x ≥ the estimation error of the i ’th counter of x .Notice that E [ E i ] = N − f x w ≤ Nw , and according to Markov’sinequality we have that ∀ c > , i ∈ [ d ] : Pr[ E i ≥ N · c/w ] ≤ /c. (1) We note that CMS, like all the algorithms below, pro-vides a curve of guarantees, in that setting δ determinesthe (cid:15) value for which we have an ( (cid:15), δ ) guarantee with the d × w conﬁguration. Setting (cid:15) = δ − /d /w and c = δ − /d ,equation (1) gives that Pr[ E i ≥ N (cid:15) ] ≤ δ /d , and asthe d rows are independent we get that Pr[ ∀ i : E i ≥ N (cid:15) ] = (Pr[ E i ≥ N (cid:15) ]) d ≤ δ . For ﬁxed ( (cid:15), δ ) values, setting w = e/(cid:15) and d = ln δ − minimizes the space requiredby the sketch, but CMS is often conﬁgured with a smallernumber of rows d since its update and query time are O ( d ) . Conservative Update Sketch (CUS) [13]:

CUS improvesthe accuracy of CMS but is restricted to the Cash Registermodel. Intuitively, when all the update values are positive,we may not need to increase all the counters of the cur-rent element. For example, assume that C [1 , h ( x )] = 7 and C [2 , h ( x )] = 4 , and the update (cid:104) x, (cid:105) arrives. Insuch a scenario, we know that f x ≤ before the up-date, and thus should not increase C [1 , h ( x )] . In general,given an update (cid:104) x, v (cid:105) , CUS sets each counter C [ i, h i ( x )] to max (cid:110) C [ i, h i ( x )] , v + (cid:98) f x (cid:111) , where (cid:98) f x = min i ∈ [ d ] C [ i, h i ( x )] is the estimate for x before the update . While CUS improvesthe accuracy of CMS, its updates are slower due to the need tocompute (cid:98) f x before increasing the counters. Since an estimateof CUS is always bounded by CMS’s estimates from above(and by f x from below), the analysis of CMS holds for CUSas well. We refer the reader to [27] for a reﬁned analysis. Count Sketch (CS) [14]:

CS works in the more generalTurnstile model and provides the stronger L guarantee. Aswith CMS and CUS, each element x is associated with a setof counters { C [ i, h i ( x )] } i ∈ [ d ] . However, the update processis slightly different. Each row i ∈ [ d ] in CS has anotherpairwise independent hash function g i : U → { +1 , − } that associates each element with a sign . When processingan update (cid:104) x, v (cid:105) , CS increases each counter C [ i, h i ( x )] by v · g i ( x ) . Intuitively, this “unbiases” the noise from all otherelements as they increase or decrease the counters withequal probabilities. As a result, each counter now gives anunbiased estimate and therefore CS estimates the size as (cid:98) f x (cid:44) median { C [ i, h i ( x )] · g i ( x ) } i ∈ [ d ] .Assuming without loss of generality that g i ( x ) = 1 ,the standard CS analysis bounds the error of the i ’th row, E i (cid:44) C [ i, h i ( x )] − f x , by showing that Var[ E i ] ≤ F /w .Therefore, using Chebyshev’s inequality we get that Pr[ | E i | ≥ c (cid:112) Var[ E i ]] ≤ Pr[ | E i | ≥ cL / √ w ] ≤ /c . By setting w = Θ( (cid:15) − ) , we can get Pr[ | E i | ≥ L · (cid:15) ] ≤ / − Ω(1) ,and then use a Chernoff bound to show that d = O (log δ − ) rows are enough for an ( (cid:15), δ ) guarantee.2 niversal Sketch (UnivMon) [15], [28]: UnivMon sum-marizes the data once and supports many functions of thefrequency vectors (e.g., its entropy or number of non-zeroentries) in the Cash Register model. Importantly, when usingUnivMon, we provide a function G : Z → R as an input, andestimate the G-sum , given by (cid:80) x ∈ U G ( f x ) . Not all functionsof the frequency vector can be computed in poly-log space ina one-pass streaming setting (a class called Stream-PolyLog ).The surprising result of [28] is that any function G in Stream-PolyLog is supported by UnivMon.UnivMon leverages O (log u ) sketches with an L guarantee(e.g., Count Sketch), which are applied to different subsets ofthe universe. We refer the reader to [15], [28] for details. Cold Filter [9]:

A recent framework for fast and accuratestream processing. It consists of two stages , where the ﬁrststage is designed to ﬁlter cold items and the second measuresheavy hitters accurately. To accelerate the computation, ituses an aggregation buffer and employs SIMD parallelism.

Finding Heavy Hitters:

Often, we care about ﬁndingthe most signiﬁcant elements in a data stream, which hasapplications for load balancing [3], accounting, and security.That is, in addition to estimating the frequency of elements,we wish to track the most frequent elements without needingto query each x ∈ U . For p ≥ , the L p -heavy hitter problemasks to return all elements with frequency larger than θL p andno element smaller than ( θ − (cid:15) ) L p , where θ ∈ [0 , is givenat query time. In the Cash Register model, we can store amin-heap with the /(cid:15) elements with the highest estimates.Whenever an update arrives, we query the item and updatethe heap if necessary. As a result, we can ﬁnd the L heavyhitters using CMS and CUS, or the L heavy hitters using CS. Counting Distinct Items:

Estimating the number of dis-tinct items in a data stream (deﬁned as F ≡ (cid:107) f (cid:107) ) is afundamental primitive for applications such as discoveringdenial of service attacks [29]. While UnivMon can nativelysupport such a function, we can also estimate it from CMSand CUS. By observing the fraction of zero-valued countersin a sketch’s row p , we can estimate the number of distinctelements (as additional occurrences of the same element wouldnot change this quantity). Speciﬁcally, a common approach(e.g., [30]) is use the Linear Counting algorithm [31] thatestimates the distinct count as log p log(1 − /w ) ≈ − w log p . Suchan estimate has a standard error of (cid:114) w · ( e F w − F w − F [31] thatimproves when w grows.IV. T ECHNIQUES

The description of the above sketches does not addressthe fundamental question of sizing the counters. A commonpractice is to assume some upper bound on the maximalfrequency (e.g., N ) and allocate each counter with n = O (log N ) bits. For performance, this upper bound is oftenrounded up to be a multiple of the word size. For example,practitioners often allocate -bit counters when estimatingthe unit-count of elements, and -bit counters for measuringtheir weighted-frequency (e.g., [32], [33]). When space is tight, estimators are sometimes integrated into sketches toallow smaller (e.g., -bit) per-counter overhead at the costof additional error [16]. However, these solutions miss thepotential of allowing counters’ bit sizes to vary and adjustdynamically. Intuitively, the largest counter value is oftenconsiderably larger than the average value, especially in highlyskewed workloads where many counter values remain smallas most of the volume belongs to a small set of heavy hitters .Alternatively, one can use address-calculation coding (e.g.,see [34], [35]) to encode a variable length counter array innear-optimal space (compared to the information theoreticlower bound). Such schemes require an upper bound N max onthe volume, and use w log (1 + N max /w ) + O ( w ) . However,the update time of such encoding is Ω(log N max ) whichmay be prohibitive for high-performance applications. Tothe best of our knowledge, no implementation that com-bines such encoding with sketches has been proposed. Incomparison, SALSA allows for dynamic counter sizing by merging overﬂowing counters with their neighbors, and op-timizes for performance by respecting word alignments. Asimple SALSA encoding requires one bit per counter, andan optimized encoding requires less than . bits per counterwhile still allowing for constant-time read and update oper-ations. Importantly, SALSA resolves overﬂows without dy-namic memory allocations (e.g., [36]), without relying onadditional data structures (as in [20]), and without requiringglobal rescaling operations for all the counters (e.g., [16]). The SALSA encoding:

SALSA starts with all counters hav-ing s bits (e.g., s = 8 ), where s may be signiﬁcantly smallerthan the intended counting range (e.g., N = 2 ). Here, wedescribe an encoding that requires one bit of overhead percounter (e.g., 12.5% for s = 8 bit counters); we later explainhow to reduce it to less than . bits (7.5% for s = 8 ).Each counter i is associated with a merge bit m i . Oncea counter needs to represent a value of s , we say that thecounter overﬂows . In principle, an overﬂowing counter canmerge with its left-neighbor or right-neighbor. In SALSA,we select the merge direction to maximize byte and wordalignment, which improves performance. We also make coun-ters grow in powers of two (e.g., from s bits to s , thento s , etc.). In Section IV, we explore a slower but moreﬁne-grained approach. Speciﬁcally, when an s -bit counter i overﬂows, it merges with i +(1 − · ( i mod 2)) . For example,if counter overﬂows, it merges with , while if counter 7overﬂows, it merges with . More generally, when an s · (cid:96) -bit counter with indices (cid:10) i · (cid:96) , i · (cid:96) + 1 . . . ( i + 1) · (cid:96) − (cid:11) overﬂows, it merges with the counter-set at indices (cid:10) j · (cid:96) , j · (cid:96) + 1 . . . ( j + 1) · (cid:96) − (cid:11) , for j = (1 − · ( i mod 2)) . As an example, if we started from s = 8 bit countersand counter overﬂows, it right-merges with to create a bit counter with indices (cid:104) , (cid:105) . If this counter overﬂows,it left-merges into a bit counter with indices (cid:104) , , , (cid:105) ,and if this overﬂows, it left-merges into a bit counterwith indices (cid:104) , . . . , (cid:105) .To encode that (cid:10) i · (cid:96) , i · (cid:96) + 1 , . . . , ( i + 1) · (cid:96) − (cid:11) are merged into a single s · (cid:96) -bit counter, SALSA3 ndicesValuesMerges Fig. 1: SALSA encoding for an array with a basic counter size of s = 8 bits, notice that large counters consume more indices thansmall counters due to merge operations. sets m i · (cid:96) +2 (cid:96) − − = 1 . For example, to encode that (cid:104) , (cid:105) are merged, we have ( i = 3 , (cid:96) = 1) and thus set m · +2 − − = m = 1 ; when (cid:104) , , , (cid:105) are merged, wehave ( i = 1 , (cid:96) = 2) and thus we set m · +2 − − = m = 1 ;and when (cid:104) , . . . , (cid:105) are merged we have ( i = 0 , (cid:96) = 3) and thus we set m · +2 − − = m = 1 . We cancompute the counter size by testing the (cid:96) relevantbits. We demonstrate this encoding in Figure 1. All thecomputations involved in determining the counter size andoffset can be efﬁciently implemented using bit operations,especially if s is a power of two. Reducing the Encoding Overhead:

The encoding we usedin SALSA so far is efﬁcient as well as amenable for simpleimplementation. The cost of this encoding is a single merge bitper counter. This is, in fact, within a factor of 2 of the optimalencoding, as we show in Appendix A. That is, we prove thatany encoding for SALSA must use at least log . ≈ . overhead bits per counter and show a somewhat more complex O (1) -time encoding with at most . overhead bits percounter. For a given memory allocation, this encoding providesimproved accuracy as the lower overhead allows ﬁtting morecounters, but may be somewhat slower. Fine-grained Counter Merges:

The SALSA encoding wepresented in Section IV doubles the counter size upon anoverﬂow, which may be wasteful when the overﬂowing countercould beneﬁt from a smaller increase in size. Thus, we suggestthe more reﬁned Tango algorithms to explore the beneﬁts ofa more ﬁne-grained merging strategy. In Tango, counters canbe merged into sizes that are arbitrary multiples of s . Forexample, if we start from s = 8 bit counters, Tango canmerge a bit counter into a bit counter while SALSAwould merge from bits to . The encoding of Tango issimple: each counter j is associated with a merge bit m j thatdenotes whether the counter is merged with its right-neighbor.To compute the counter size and offset in Tango of j = h ( x ) ,we scan the number of set bits to the left and right of m j until we hit a zero at both sides. For example, if j = 5 and m = m = m = m = 1 while m = m = 0 then the counter consists of s · bits, spanning (cid:104) , , , , (cid:105) .In general, one can use complex logic to decide whether tomerge with the left or right neighbor once a counter overﬂows.However, we design Tango to evaluate the potential beneﬁtsof ﬁne-grained merging and therefore enforce a merging logicthat mimics SALSA. Speciﬁcally, Tango always tries to bealigned to the smallest possible power of two. For example,if counter overﬂows, it merges with to be aligned withthe -block (cid:104) , (cid:105) . If it overﬂows again, it merges with

258 3 0 65533 95 11 ⟨𝑦, 5⟩ arrives , ℎ 𝑦 = 5

258 3 0 65664 ⟨𝑥,3⟩ arrives, ℎ 𝑥 = 1 (a) Sum merging of counters

258 3 0 65533 95 11 ⟨𝑦, 5⟩ arrives , ℎ 𝑦 = 5

258 3 0 65538 ⟨𝑥,3⟩ arrives, ℎ 𝑥 = 1 (b) Max merging of counters

Fig. 2: Sum and Max merge in SALSA CMS with s = 8 . (creating a s · bits sized counter) and then with . If morebits are needed it will merge with then with , and (being aligned to the -block (cid:104) , . . . , (cid:105) ). Then it mergeswith , , . . . , etc. Notice that at every point in time, the Tangocounters are contained in the corresponding SALSA counters.In particular, this allows us to produce an estimate that isat least as accurate as SALSA. We note that Tango poses atradeoff – while it allows more accurate sketches (e.g., as acounter may not exceed − and thus it could be wastefulto merge it into bits), it also has slower decoding time andcannot use the efﬁcient encoding of the previous section.V. SALSA- FYING S KETCHES

We now describe how SALSA integrates with existingsketches, and speciﬁcally how to set the value of mergedcounters in each sketch. We also state and prove accuracyguarantees for the resulting SALSA sketches. We employ hashfunctions h i : U → [ w ] similarly to the original sketches.Given a merged counter with indices (cid:104) L, L + 1 , . . . , R (cid:105) , weconsider all elements x with L ≤ h i ( x ) ≤ R to be mappedinto it. Hereafter, we often refer to the underlying sketch as following: If the largest merged counter size is s · (cid:96) ,the underlying sketch is a vanilla (ﬁxed counter size) sketchwhere each counter is of size s · (cid:96) and its hashes are (cid:110)(cid:101) h i ( x ) (cid:44) (cid:4) h i ( x ) / (cid:96) (cid:5) | i ∈ [ d ] (cid:111) . Count Min Sketch (CMS):

SALSA CMS and Tango CMSare identical to CMS as long as no counter overﬂows. We havealready deﬁned the merge operation with regard to encoding(Section IV), and with regard to hash mapping in the previoussection. However, we still need to deﬁne how we determine thevalue of a merged counter, which provides a degree of freedomwe leverage to increase measurement accuracy according tothe speciﬁc model requirements. A natural merging operationis to sum the merged counter values (illustrated in Figure 2a).We formalize the correctness of this approach for the StrictTurnstile model via the following theorem.

Theorem V.1.

Assume that SALSA and Tango use sum mergeto unify counters. Let (cid:96) · s be the maximal bit-size of anycounter in SALSA CMS, and ∀ i ∈ [ d ] let (cid:101) h i ( x ) = (cid:4) h i ( x ) / (cid:96) (cid:5) be hash functions that map items into a standard CMS with (2 (cid:96) · s ) -sized counters. Then for any x ∈ U : f x ≤ (cid:92) f Tango x ≤ (cid:92) f SALSA x ≤ (cid:91) f CMS x , where (cid:92) f Tango x , (cid:92) f SALSA x and (cid:91) f CMS x are he estimates of Tango, SALSA, and the underlying CMS (withfunctions (cid:101) h i ( x ) ).Proof. The sum merge maintains an invariant where the valueof each merged counter is the total frequency of all elementsmapped to it. In the worst case, a merge results in a counterof size equal to that of the corresponding counter in theunderlying CMS. In this case, the values of the counters areidentical. Otherwise, the value of a Tango counter is upperbounded by a SALSA counter which, in turn, is upper boundedby the corresponding value in the underlying CMS.For Cash Register streams (with only positive updates),rather than sum the counters when merging, we can take themaximum value of the merged counters to gain more accuracy(exempliﬁed in Figure 2b) while maintaining guarantees, asformalized in the following theorem.

Theorem V.2.

Assume that SALSA and Tango use max mergeto unify counters. Let (cid:96) · s be the maximal bit-size of anycounter in SALSA CMS, and ∀ i ∈ [ d ] let (cid:101) h i ( x ) = (cid:4) h i ( x ) / (cid:96) (cid:5) be hash functions that map items into a standard CMS with (2 (cid:96) · s ) -sized counters. Then for any x ∈ U : f x ≤ (cid:92) f Tango x ≤ (cid:92) f SALSA x ≤ (cid:91) f CMS x , where (cid:92) f Tango x , (cid:92) f SALSA x and (cid:91) f CMS x arethe estimates of Tango, SALSA, and the underlying CMS (withfunctions (cid:101) h i ( x ) ).Proof. After each merge, the counter value upper bounds thefrequency of any element mapped to the hash range of themerged counter. In addition, the value of SALSA and Tangocounters when using the max merge are upper bounded bythe corresponding value of SALSA and Tango counters whenusing the sum merge.Theorems V.1 and V.2 show that SALSA CMS and TangoCMS are at least as accurate as the underlying CMS forboth merge operations. Intuitively, by sum-merging everyconsecutive n bits, we obtain estimates that are identical to aCMS sketch that uses n bit counters. Therefore, sum-mergingSALSA’s estimates are upper bounded by the CMS estimates.In Cash Register streams, max-merging estimates are upperbounded by the sum-merging ones. Finally, for any givenelement, the estimates of SALSA CMS and Tango CMS arelower bounded by its true frequency, which implies that ourapproach provides the same error guarantee as the underlyingsketch.SALSA CMS also improves the performance of countdistinct queries for Linear Counting [37] using CMS. RecallLinear Counting estimates the number of distinct queries usingthe fraction p of zero counters. We consider running LinearCounting using SALSA CMS staring with s = 8 bit counters,compared to a standard CMS implementation using 32-bitcounters. Unlike standard CMS, SALSA may be unable todetermine the exact number of ( s -bit) counters that remainzero, as some are merged into other counters. Instead, wecompute the fraction f of s -bit counters that remained zero from the overall number of counters that did not merge . Forevery counter that is the result of one or more merges, we know that at least one of its sub-counters is not zero; weoptimistically assume that a fraction f of its remaining sub-counters are zero. So, for example, our estimate of the numberof counters that are 0 is the number of s -bit counters thatremained zero, plus f times the number of s -bit counters,plus f times the number of s -bit counters, and so on if thereare larger counters. Note that this approach is heuristic and itsaccuracy guarantees are left as future work. Conservative Update Sketch (CUS):

SALSA CUS is sim-ilar to the standard CUS – whenever an update (cid:104) x, v (cid:105) arrives,each counter C [ i, h i ( x )] is set to max (cid:110) C [ i, h i ( x )] , v + (cid:98) f x (cid:111) with (cid:98) f x = min i ∈ [ d ] C [ i, h i ( x )] being the previous frequencyestimate for x . Unlike the CMS variant, the correctness ofSALSA CUS is not immediate as not all counters are increasedfor each packet. Theorem V.3 shows that SALSA CUS iscorrect in the Cash Register model when working with themax-merge method. Theorem V.3.

Let (cid:96) · s be the maximal bit-size of anycounter in max-merge SALSA CUS, and ∀ i ∈ [ d ] let (cid:101) h i ( x ) = (cid:4) h i ( x ) / (cid:96) (cid:5) be hash functions that map items into a standardCUS with (2 (cid:96) · s ) -sized counters. Then for any x ∈ U : f x ≤ (cid:92) f SALSA x ≤ (cid:100) f CUS x , where (cid:92) f SALSA x and (cid:100) f CUS x are the estimates ofSALSA and the underlying CUS (with functions (cid:101) h i ( x ) ).Proof. It is sufﬁcient to consider only updates with v = 1 since each (cid:104) x, v (cid:105) update is identical to v consecutive (cid:104) x, (cid:105) updates. The proof is by induction on the number of updates.Speciﬁcally, we show that after each update it holds that ∀ x, i ∈ [ d ] : C SALSA [ i, h i ( x )] ≤ C CUS [ i, ˜ h i ( x )] , (2)where we denote by C SALSA and C CUS the counters ofSALSA and the underlying CUS, respectively.As a base case, initially C SALSA [ i, h i ( x )] = C CUS [ i, ˜ h i ( x )] = 0 ∀ i ∈ [ d ] . We show that if Equation(2) holds, it continues to hold after an additional update.Case 1: C SALSA [ i, h i ( x )] = C CUS [ i, ˜ h i ( x )] . In this case,on update (cid:104) x, (cid:105) , C CUS [ i, ˜ h i ( x )] is increased by CUS. There-fore the claim trivially holds if there is no overﬂow in SALSA.If there is an overﬂow, the claim holds by the virtue of themax-merge. That is, the value of the merged counter growsby exactly 1. This also means that the inequality holds forall counters involved in this merge since they are all upperbounded by C CUS [ i, h i ( x )] prior to the update.Case 2: C SALSA [ i, h i ( x )] < C CUS [ i, ˜ h i ( x )] . In this case,on update (cid:104) x, (cid:105) , the claim trivially holds if there is nooverﬂow in SALSA. If there is an overﬂow, by the virtue of themax-merge, the value of the merged counter still only growsby 1. This also means that the inequality holds for all countersinvolved in this merge since they are all upper bounded by C CUS [ i, h i ( x )] prior to the update, and therefore are upperbounded by C CUS [ i, ˜ h i ( x )] after it. Count Sketch (CS):

SALSA can also extend the CS, witha minor modiﬁcation. Unlike most existing implementations,which use the standard Two’s Complement encoding, SALSA5S uses a sign-magnitude representation of counters (as coun-ters can be negative), with the most signiﬁcant bit for thesign and the rest as magnitude. While Two’s Complementrepresents values in the range (cid:8) − s − , . . . , s − − (cid:9) , sign-magnitude does not allow a representation of − s − . However,our use of sign-magnitude is critical for us to ensure that theoverﬂow event is sign-symmetric, which allow us to prove thatour sketch is unbiased. When an s · (cid:96) bits counter exceedsan absolute value of s · (cid:96) − − , we merge to double its size.When merging counters in SALSA CS, we use sum-merge;note max-merge may not be correct as counters may haveopposite signs. We prove the correctness of SALSA CS. Forsimplicity, we focus on the main variant. That is, a countermerges at most twice , starting from s = 8 bits and assumingthat no counter reaches an absolute value of , which is thecommon implementation assumption. A B

AB C DCDABCD

Let x ∈ U be an element mapped to counter A , which may be merged with counter B to create (cid:104) A, B (cid:105) , which in turn may later merge with (cid:104)

C, D (cid:105) to make the s -bit counter (cid:104) A, B, C, D (cid:105) . Thissetting is illustrated to the right:We wish to show that the estimates of each rowin SALSA CS are unbiased, and further that each estimatehas variance with SALSA CS that is no larger than the cor-responding variance with CS. As mentioned, here give a fullanalysis for starting with s bit counters and allowing countersto grow to s bits, as this is the focus in our implementation,but the approach generalizes readily to additional levels. wenow introduce some notation to analyze SALSA CS.We use O AB to denote the event that A and B havebeen merged at query time (either into (cid:104) A, B, C, D (cid:105) or justas (cid:104)

A, B (cid:105) ) and O ABCD for the event that

A, B, C, and D have been merged into (cid:104) A, B, C, D (cid:105) . We also denote thevalue of A by (the random variable) X A , the value of (cid:104) A, B (cid:105) by X AB , and similarly for X B , X CD , and X ABCD .We emphasize that X S represents the value of the countmapped to S , regardless of whether the counter overﬂows (e.g., X A (cid:44) (cid:80) y ∈ U : h ( y )= A f y g ( y ) even if X A ≥ s ). Without lossof generality, we also assume that the sign of x is g ( x ) = 1 (and thus E [ X A ] = E [ X AB ] = E [ X ABCD ] = f x ). This allowsus to express the estimate given in a row for SALSA CS as: (cid:98) f x = X A (1 − O AB ) + X AB · O AB · (1 − O ABCD ) + X ABCD · O ABCD = X A (1 − O AB ) + X AB · O AB − X AB · O AB · O ABCD + X ABCD · O ABCD . Observe that O ABCD ⊆ O AB and thus O AB · O ABCD = O ABCD . This implies that (since X AB − X A = X B and X ABCD − X AB = X CD ): (cid:98) f x = X A + X B · O AB + X CD · O ABCD . (3)We continue by proving that the estimate is unbiased. Lemma V.4.

SALSA is unbiased, i.e., E [ (cid:98) f x ] = f x . Proof.

Due to the sign-symmetry of the sign function g , we have that E [ X B | O AB ] = 0 and E [ X CD | O ABCD ] = 0 (as ∀ i : Pr[ X B = i | O AB ] = Pr[ X B = − i | O AB ] and Pr[ X CD = i | O ABCD ] = Pr[ X CD = − i | O ABCD ] ). Thus,according to (3): E [ (cid:98) f x ] = E [ X A + X B O AB + X CD O ABCD ] = E [ X A ]+ E [ X B | O AB ] Pr[ O AB ] + E [ X CD | O ABCD ] Pr[ O ABCD ] = f x . We show that SALSA reduces the variance in each row.

Lemma V.5.

Var[ (cid:98) f x ] ≤ Var[ CS ] , where Var[ CS ] (cid:44) Var[ X ABCD ] is the variance of the underlying Count Sketch.Proof. Let us prove that

Var[ CS ] − Var[ (cid:98) f x ] ≥ . Observe thatsince CS and SALSA CS are unbiased, we have that: Var[ CS ] − Var[ (cid:98) f x ]= E (cid:104) ( X ABCD − f x ) (cid:105) − E (cid:20)(cid:16) (cid:98) f x − f x (cid:17) (cid:21) = E (cid:104) X ABCD − (cid:98) f x − f x X ABCD + 2 f x (cid:98) f x (cid:105) = E (cid:2) X ABCD (cid:3) − E (cid:104) (cid:98) f x (cid:105) + 2 f x E (cid:104) (cid:98) f x − X ABCD (cid:105) . Due to unbiasedness, E (cid:104) (cid:98) f x (cid:105) = E [ X ABCD ] = f x and thus Var[ CS ] − Var[ (cid:98) f x ] = E (cid:2) X ABCD (cid:3) − E (cid:104) (cid:98) f x (cid:105) . (4)We continue by simplifying the expression for E (cid:104) (cid:98) f x (cid:105) : E (cid:104) (cid:98) f x (cid:105) = E (cid:2) ( X A + X B O AB + X CD O ABCD ) (cid:3) = E (cid:2) X A (cid:3) + E (cid:2) X B O AB (cid:3) + E (cid:2) X CD O ABCD (cid:3) + 2 (cid:32) E [ X A X B O AB ] + E [ X A X CD O ABCD ]+ E [ X B X CD O ABCD ] (cid:33) = E (cid:2) X A (cid:3) + E (cid:2) X B | O AB (cid:3) Pr[ O AB ]+ E (cid:2) X CD | O ABCD (cid:3)

Pr[ O ABCD ] + 2 (cid:16) E [ X A X B | O AB ] Pr[ O AB ]+ E [ X A X CD | O ABCD ] Pr[ O ABCD ]+ E [ X B X CD | O ABCD ] Pr[ O ABCD ] (cid:17) . Since x is mapped to A , we can use the sign-symmetry of X B and X CD to get E [ X A X B | O AB ] = E [ X A X CD | O ABCD ] = E [ X B X CD | O ABCD ] = 0 , which gives E (cid:104) (cid:98) f x (cid:105) = E (cid:2) X A (cid:3) + E (cid:2) X B | O AB (cid:3) Pr[ O AB ]+ E (cid:2) X CD | O ABCD (cid:3)

Pr[ O ABCD ] = E (cid:2) X A (cid:3) + E (cid:2) X B (cid:3) + E (cid:2) X CD (cid:3) − (cid:0) E (cid:2) X B |¬ O AB (cid:3) Pr[ ¬ O AB ] + E (cid:2) X CD |¬ O ABCD (cid:3)

Pr[ ¬ O ABCD ] (cid:1) ≤ E (cid:2) X A (cid:3) + E (cid:2) X B (cid:3) + E (cid:2) X CD (cid:3) . (5) Now, notice that X ABCD = ( X A + X B + X CD ) = X A + X B + X CD + 2( X A X B + X A X CD + X B X CD ) . Due to thesign-symmetry of g , we have E [ X A X B ] = E [ X A X CD ] = This assumes that the variables remain symmetric conditionedon the overﬂow events, which is correct for SALSA CS due toour sign-magnitude representation. [ X B X CD ] = 0 and thus: E (cid:2) X ABCD (cid:3) = E (cid:2) X A (cid:3) + E (cid:2) X B (cid:3) + E (cid:2) X CD (cid:3) ≥ E (cid:104) (cid:98) f x (cid:105) , where the inequality followsfrom (5). Together with (4), this concludes the proof.Because the theorem shows the error variance is no larger for each row , following the same analysis as for CS (usingChebyshev’s inequality to bound the error of the row andthen Chernoff’s inequality to bound the error of the median)yields the same error bounds for SALSA CS. Indeed, weexpect better estimates using SALSA as the inequality fromthe proof of the theorem ( Var[ CS ] ≥ Var[ (cid:98) f x ] ) is usually astrict inequality. In our experimental evaluation, we show thatSALSA CS obtains better estimates than CS. Theorem V.6.

Let (cid:96) · s be the maximal bit-size ofany counter in sum-merge SALSA CS, and ∀ i ∈ [ d ] let (cid:101) h i ( x ) = (cid:4) h i ( x ) / (cid:96) (cid:5) be hash functions that map itemsinto a standard CS with (2 (cid:96) · s ) -sized counters. Thenfor any x ∈ U, i ≤ d : E [ C SALSA [ i, h i ( x )] · g i ( x )] = f x and Var [ C SALSA [ i, h i ( x )] · g i ( x ) − f x ] ≤ Var (cid:104) C CS [ i, ˜ h i ( x )] · g i ( x ) − f x (cid:105) , where C SALSA [ i, h i ( x )] and C CS [ i, ˜ h i ( x )] are the counters of SALSA and theunderlying CS (with functions (cid:101) h i ( x ) ). We note that SALSA CS can also provide other de-rived results, similarly to CS. For example, by using aheap, we can ﬁnd the L -heavy hitters in Cash Registerstreams similarly to the original version. Universal Sketch (UnivMon):

The universal monitoringsketch (UnivMon) uses several L sketches that are applied ondifferent subsets of the universe. By improving the accuracyof CS, we can also improve the performance of UnivMon. Wenote that since SALSA CS provides an accuracy guarantee thatis at least as good as the underlying sketch, SALSA Univmonprovides the same accuracy guarantee as the vanilla Univmon. Merging and Subtracting SALSA Sketches:

Givenstreams

A, B and their sketches s ( A ) , s ( B ) , we may thenwish to derive statistics on A ∪ B (for example, we canparallelize the sketching of A and B and then merge them),or on A \ B (for example, to detect changes in our networktrafﬁc compared to the previous epoch). By A \ B , we referto computing the frequency difference; e.g., if x appearedtwice in A and three times in B , its frequency in A \ B is-1. Most standard sketches are linear, and can be naturallysummed/subtracted counter-wise to obtain a sketches s ( A ∪ B ) ≡ s ( A ) + s ( B ) and s ( A \ B ) ≡ s ( A ) − s ( B ) if they sharethe same hash functions, and work in the Turnstile model.SALSA can also merge and subtract sketches. For merging s ( A ) and s ( B ) , SALSA traverses the counters and mergesthem according sum-merging. Speciﬁcally, each counter inthe merged sketches has a size at least as large as its sizein s ( A ) and its size in s ( B ) . Additionally, when summingor subtracting counters an overﬂow may occur, triggeringanother merge to make sure we have enough bits to encodethe resulting values. CS, as a Turnstile sketch, also supportsgeneral subtracting that is done similarly to merging, while -48 110 3 0 25646 𝑠(𝐴) 𝑠(𝐵)

56 110 402 50024 -42 -396 1268

496 2508 -2319 0 -17 𝑠(𝐴 ∪ 𝐵) 𝑠(𝐴\𝐵)

Fig. 3: An example of s = 8 bit SALSA CS merging and subtractingTurnstile sketches s ( A ) and s ( B ) . CMS (which works in the Strict Turnstile model) can compute s ( A \ B ) given a guarantee that B ⊆ A . These operations areillustrated in Figure 3. Integrating Estimators into SALSA:

Thus far, we havedescribed a single strategy to handle overﬂows: when a counterreaches a value that can not be represented with the currentnumber of bits, it merges with a neighbor. However, thereare alternatives that allow one to increase the counting range.Speciﬁcally, estimators can represent large numbers using asmaller number of bits, at the cost of introducing an error.The state of the art Additive Error Estimators (AEE) [16]offer a simple and efﬁcient technique to increase the countingrange. For simplicity, we describe the technique for CMS andunit-weight streams (where all updates are of the form (cid:104) x, (cid:105) ),although AEE can support weighted updates and other L1sketches as well. Throughout the execution, incoming updatesare sampled with probability p . If an update is sampled, itincreases the sketch, and otherwise, it is ignored. Whenevera counter overﬂows, a downsampling event happens. Whendownsampling, p is halved and any counter C [ i, j ] is replacedby either Bin ( C [ i, j ] , / (called probabilistic downsam-pling) or by (cid:98) C [ i, j ] / (cid:99) (deterministic downsampling). Sincethe counter values are reduced as a result of the downsampling,new updates can be processed, and no additional counter bitsare needed. For any δ est > , we have an implied estima-tion error for AEE given by (cid:15) est (cid:44) (cid:113) p − ln(2 /δ est ) N , suchthat Pr (cid:104)(cid:12)(cid:12)(cid:12) (cid:92) C [ i, j ] − C [ i, j ] (cid:12)(cid:12)(cid:12) ≥ N(cid:15) est (cid:105) = Pr (cid:104) | (cid:92) C [ i, j ] − C [ i, j ] | ≥ (cid:112) Np − ln(2 /δ est ) (cid:105) ≤ δ est . Another motivation for AEEcomes from the processing speed. Since the sampling prob-ability is independent of the value of the current counter,one can compute the hash functions h i ( x ) only if a packetis sampled . Since hash functions are a major bottleneckfor sketches [18], AEE is faster than the baseline sketches.Another version of the estimator, called AEE MaxSpeed, aimsto maximize the processing speed while bounding the error.Therefore, instead of waiting for a counter to overﬂow, itdownsamples all counters once enough updates have beenprocessed. In comparison with the original variant (called AEEMaxAccuracy), MaxSpeed is faster but less accurate [16].Intuitively, downsampling and merging increase the errorin different ways. While downsampling increases the inherenterror of a counter, merging adds noise from other elements7hat previously have not collided with the counter. SALSAselects how to handle overﬂows in a way that minimizes thetheoretical error increase by either downsampling or merging.Speciﬁcally, as our accuracy theorems suggest, the sketcherror in SALSA depends on the size of the largest counter.Therefore, unless a largest counter overﬂows, SALSA optsfor merging as its overﬂow strategy. When a largest counteroverﬂows, SALSA computes the estimator error difference ∆ est = √ · (cid:15) est , which is the increase in error if wedownsample. Similarly, if the currently largest counter is ofsize s · (cid:96) , SALSA computes (cid:15) CMS (cid:44) δ − /d · (cid:96) /w , which is thecurrent accuracy guarantee (see Theorem V.1 and Section III)and ∆ CMS = (cid:15) CMS is then the difference in error guaranteethat results from merging. We pick δ est = δ/d to allow all counters of the current element to be estimated within (cid:15) est with probability − δ . Finally, SALSA chooses to merge if ∆ CMS ≤ ∆ est , and otherwise it downsamples. As a result,SALSA estimates element sizes to within N · ( (cid:15) est + (cid:15) CMS ) with a probability of at least − δ .As an optimization, when downsampling, SALSA may beable to split counters if the resulting values can be representedusing fewer bits. For example, if s = 8 and a value of was represented in the 16-bit counter (cid:104) , (cid:105) , and if itis then downsampled to a value of , we can split thecounter and set both counter and counter to . Wenote that this only works for max-merging, where the accuracyguarantees seamlessly follow.VI. E VALUATION

In this section, we extensively evaluate SALSA’s perfor-mance on real and synthetic datasets and compare it to that ofthe underlying sketches. We ﬁrst document the methodology.

Sketch Conﬁguration Parameters:

Unless speciﬁed oth-erwise, all CMS and CUS sketches are conﬁgured with d = 4 rows, as is used e.g., in the Caffeine caching library [38].Since CS requires taking median over the rows, all CSexperiments are conﬁgured with d = 5 rows as done, e.g.,in [39]. We conﬁgure UnivMon with CS instances, eachconﬁgured with d = 5 and a heap of size , followingthe implementation of [15]. Such settings are standard forapplications that aim for speed rather than being memory-optimal. For the ABC [17] and Pyramid [12] sketches, as wellas the Cold Filter [9] framework, we use the conﬁgurationsrecommended by the authors. We pick s = 8 bit counters asthe default conﬁguration of SALSA, motivated by the syntheticresults. We use the simple encoding (1 bit of overhead percounter) of Section IV, which uses slightly more space butis faster. The Baseline implementations use 32-bit counters,a choice we justify later in Figure 6, and that is also com-mon in existing implementations [18], [33]. Nonetheless, ourSALSA implementation allows counters to grow further, upto bits. For implementation efﬁciency, all row widths w are powers of two. When we give ﬁgures where an x -axisis allocated memory, we include the encoding overheads. Forthe integration with AEE, we conﬁgure SALSA AEE with δ = 4 · δ est = 0 . (see Section V). Datasets:

We evaluate our algorithms using four realdatasets and several synthetic ones. In particular, we use threenetwork packet traces: two from major backbone routers in theUS, denoted NY18 [40] and CH16 [41], and a data center net-work trace denoted Univ2 [42]. In these traces, we deﬁne itemsusing the “5-tuples” of the packets (srcip, dstip, srcport, dst-port, proto). Additionally, we use a YouTube video trace [43,US category]. As the video data does not have a recorded order(just view-count), we use a random order where each item isa video independently sampled according to the view-countdistribution. Finally, we use random order Zipﬁan traces. Alltraces have 98M elements for consistency with the shortest realdataset. In our evaluation, we use unit-weight Cash Registerstreams (i.e., all updates are of the form (cid:104) x, (cid:105) ). We alsoexperiment with the task of evaluating change detection, whichrequires a SALSA sketch under the Turnstile model. Metrics:

For frequency estimates, we use the On-arrivalmodel that asks for an estimate of the size of each ar-riving element (e.g., [5], [7], [16], [44]). Intuitively, thismodel is motivated by the need to take per-packet actionsin networking, e.g., to restrict the allowed bandwidth toprevent denial of service attacks. Given a stream with n updates, we obtain errors e , e , . . . , e n ; the Mean SquareError is deﬁned as

M SE (cid:44) n − · (cid:80) i e i , the RootMean Square Error is then

RM SE (cid:44) √ M SE , whilethe

Normalized RMSE is NRMSE (cid:44) n − · RM SE . Sim-ilar metrics are used, e.g., in [5], [7], [16], [44]. Noticethat NRMSE is a unitless quantity in the interval [0 , . Forfairness, we also evaluate using the error metrics used inPyramid and ABC: Average Absolute Error (AAE) and Av-erage Relative Error (ARE). AAE averages the error overall the elements with non-zero frequency i.e., AAE (cid:44) | U > | (cid:80) x ∈ U > | (cid:98) f x − f x | , where U > (cid:44) { x ∈ U : f x > } .Similarly, ARE is deﬁned as | U > | (cid:80) x ∈ U > | (cid:99) f x − f x | f x .For tasks such as Count Distinct, Entropy, and FrequencyMoments estimation, we use the Average Relative Error (ARE)metric that averages over the relative error of the ten runs. Forturnstile evaluation, we evaluate the capability of SALSA toimprove sketches for the Change Detection task (e.g., see [15],[45]) in which we partition the workload into two equal-lengthparts A and B , sketch each, and test the NRMSE of theestimates of the frequency changes between A and B . Eachdata point is the result of ten trials; we report the mean and95% conﬁdence intervals according to Student’s t-test [46]. Implementation:

We leverage existing CMS and CUSimplementations from [16] and extend them to implementSALSA. We also extend these to create a Baseline andSALSA implementation of CS. We also used the authors’code for the Pyramid [12], ABC [17], and Cold Filter [9]algorithms. Particularly, for error measurements, we used thecode as-is, while for speed measurements, we applied ouroptimizations for a fair comparison. All sketches use the samehash functions (BobHash) and index computation methods. This is different from the ARE used for frequency estimation where theaveraging is done over all elements with positive frequency. . . . . . − − − N R M S E ( L og - s c a l e ) .

60 0 . . . × − (a) Error, Count Min Sketch (2MB) . . . . . − − N R M S E ( L og - s c a l e ) (b) Error, Count Sketch (2.5MB) BaselineSALSA SALSA SALSA SALSA SALSA Fig. 4: Speed and accuracy of SALSA CMS and SALSA CS for thesynthetic datasets. The Baseline uses w = 2 counters in each rowfor a total of 2MB of space in CMS and 2.5MB in CS. Here, SALSA s is using w = (2 · /s ) sized rows for a total of /s ) MBspace for CMS and . /s ) MB for CS. Memory [KB]10 − − − − − − − N R M S E ( L og - s c a l e )

285 3007 . . × − (a) Error, NY18 . . . . . − − − N R M S E ( L og - s c a l e ) (b) Error, Zipf (2MB) SALSA Max SALSA Sum

Fig. 5: Accuracy of SALSA CMS with Sum merge vs. Max merge.

When evaluating against the AEE estimators [16], we usethe provided open-source code. Similarly, we obtained theUnivMon code from [39] and replaced its CS sketches withSALSA CS to create ’SALSA UnivMon’.All speed measurements were performed using a single coreon a PC with an Intel Core i7-7700 CPU @3.60GHz (256KBL1 cache, 1MB L2 cache, and 8MB L3 cache) and 32GBDDR3 2133MHz RAM.

How to Conﬁgure SALSA?

We perform preliminary ex-periments to determine the default SALSA conﬁguration.

How Large Should Counters Be?

We ﬁrst determine themost effective minimal counter size ( s ) for SALSA. Intuitively,for a ﬁxed row width w , smaller s results in lower error butalso in larger encoding overheads. With this tradeoff, it maynot be proﬁtable to reduce s . In this experiment, we ﬁxed thememory of the counters and deliberately ignored the encodingoverheads for SALSA. The goal is to quantify the attainableimprovement from using smaller counters. We used syntheticZipﬁan trace with skews varying from 0.6 to 1.4.As shown in Figure 4, most of the improvement comes fromreducing the counter sizes from bits to bits. These resultswere consistent across different memory footprints. This is notsurprising as almost all counters merge up to at least bits,but then many do not overﬂow further. We also observe thatSALSA offers more gains for low-skew traces and that CS is − − − − − − − Threshold φ (Log-scale)10 − − − − A R E ( L og - s c a l e ) (a) Varying the threshold φ Stream Length10 − − − A R E ( L og - s c a l e ) (b) Varying stream length ( φ = 10 − ) SALSA CMS (8-bits) CMS (16-bits) CMS (32-bits)

Fig. 6: SALSA CMS vs. CMS with small counters (2MB). Memory [KB]10 − − − − − − − N R M S E ( L og - s c a l e )

285 3006 . . . × − (a) Error, NY18 . . . . . − − − N R M S E ( L og - s c a l e ) (b) Error, Zipf Tango Tango Tango Tango SALSA

Fig. 7: Accuracy of SALSA CMS (with s = 8 bits) vs. Tango CMS.In (b), Tango s is allocated with /s ) MB of space while SALSAuses /

8) = 2 . MB. better suited for lower skews, while CMS offers comparableaccuracy with less space for high skew. Hereafter, we use s =8 bits as the default SALSA conﬁguration. While SALSA with s = 4 bits is slightly more accurate for low-skew workloadsand high memory footprints, its encoding overhead of about25% of the sketch size (compared to 12.5% for s = 8 ) is toolarge to justify the beneﬁts. Which Merging Should We Use?

As mentioned above,SALSA CS must use sum-merging, and so does SALSACMS for Strict Turnstile streams. Similarly, in SALSA CUS,we need to use max-merging. This leaves only the choicefor SALSA CMS in Cash Register streams, where we canuse either sum-merging or max-merging. We quantify thedifference in accuracy in Figure 5. As shown, max-mergingis slightly more accurate, especially for low-skew work-loads. We conclude that if one only targets Cash Registerstreams, it is better to use max-merging, but the accuracy ofsum-merging is not far behind.

Is Fine-grained Merging Worth It?

To understand theaccuracy improvement attainable by ﬁne-grained merging (asopposed to SALSA’s approach of doubling the counter size ateach overﬂow), we compare SALSA with Tango. As the re-sults in Figure 7 indicate, Tango also offers the best accuracy-space tradeoff when starting with s = 8 bits (Tango isequivalent to SALSA and is omitted). However, while it isslightly more accurate, the gains seem marginal consideringthe computationally expensive operations of determining thecounter’s size and offset. Further, Tango has an overhead of bit per counter and does not obviously allow an efﬁcient9 Memory [KB]0 . . . . . . . . . T h r o u g hpu t [ × / s ] (a) Speed, NY18 Memory [KB]0 . . . . . . . . . T h r o u g hpu t [ × / s ] (b) Speed, CH16 Memory [KB]10 − − − − − − − N R M S E ( L og - s c a l e ) (c) NRMSE Error, NY18 Memory [KB]10 − − − − − − − N R M S E ( L og - s c a l e ) (d) NRMSE Error, CH16 Pyramid ABC Baseline SALSA Memory [KB]10 − AA E ( L og - s c a l e ) (e) AAE Error, NY18 Memory [KB]10 − AA E ( L og - s c a l e ) (f) AAE Error, CH16 Memory [KB]10 − A R E ( L og - s c a l e ) (g) ARE Error, NY18 Memory [KB]10 − A R E ( L og - s c a l e ) (h) ARE Error, CH16 Fig. 8: Comparing the performance of the SALSA, Pyramid [12], ABC [17], and Baseline versions of CMS. (a) NY18 (b) CH16

Baseline Pyramid ABC SALSA

Fig. 9: The error distribution of the algorithms (2MB). encoding like SALSA does (Section IV).

Can one simply use small counters?

In our evaluation,SALSA starts from s = 8 bit counters. We now compareSALSA with a baseline sketch that uses -bit or -bitcounters (as proposed, e.g., in [47]). In such a sketch, thecounter is only incremented if it does not overﬂow (i.e., itsvalue is bounded by b − for b -bit counters). We show thatsuch solutions cannot capture the sizes of the heavy hitters– elements whose frequency is at least a φ fraction – whichare often considered the most important elements [48]. First,we show (Figure 6a) that when estimating heavy hitters, evenwith a loose deﬁnition of φ = 10 − , it is best to use -bitcounters for CMS. Similarly, as shown in Figure 6b, whenthe measurement is longer than 10M elements, the -bitvariant becomes less accurate. Figure 6 is depicted for Zipﬁantrace with skew=1, and we observed similar behavior for othertraces, thresholds, and memory footprints. Comparison with Pyramid Sketch and ABC:

Weused the authors’ original implementations for both Pyra-mid Sketch and ABC. We present results for CMS on theNY18 and CH16 datasets; similar results are obtained foradditional sketches and workloads.As shown in Figure 8a, and Figure 8b, Pyramid Sketch andSALSA are about 20% slower than the baseline, while ABCis about 75% slower. Intuitively, the slowdown is expected, as all these algorithms bring additional complexity to the base-line. ABC is signiﬁcantly slower due is additional encodingoverheads as its bit-borrowing technique does not allow byte-alignment for counters, forcing it to make additional bitwiseoperations for reading and updating counters.In terms of the NRMSE metric (8c and 8d), SALSAachieves the best results. Next is the baseline following byPyramid Sketch, and ABC. The on-arrival NRMSE metricgives more weight to the frequent elements, and is moresensitive to larger errors than AAE and ARE. Our resultsindicate that SALSA is also more accurate than PyramidSketch and the baseline in terms of AAE and ARE for theentire memory range. Note that Pyramid Sketch is better thanthe baseline in the memory range 0.5MB-2MB, which is therange it is optimized for according to the paper. ABC isslightly more accurate than SALSA for small memory sizesbut less accurate than SALSA for large memory sizes, and iscomparable in between. Our conclusion is that SALSA is thebest in the NRMSE metric and is competitive in the AEE, andARE, and speed metrics.

Understanding the differences:

Our ﬁrst observation, isthat the AAE and ARE metrics are not suitable when estimat-ing the size of the heavy hitters, which are often consideredthe most important elements [48]. This is because both metricsgive equal weight to all items, making impact of the error ofthe largest ones vanish due to averaging. This is evident fromFigure 6, in which the leftmost point ( φ = 10 − ) correspondsto the AAE and ARE metrics (as all items are accounted for).As shown, in such a case, using -bit counters yields lowererror rates. Nevertheless, such a solution cannot count beyondthe value of , which results in excessive error for the heavyhitters (e.g., φ = 10 − ). In fact, as we show in Appendix B,for CMS, and this dataset, it is better to estimate all sizesas without performing any measurement. To illustrate thedifferences that make Pyramid Sketch and ABC competitivein the AAE/ARE metrics, but not in NRMSE, we visualizethe errors of estimating individual element frequencies. We10 Memory [KB]10 − − − − − − − N R M S E ( L og - s c a l e ) (a) Error, NY18 Memory [KB]10 − − − − − − − N R M S E ( L og - s c a l e ) (b) Error, CH16 Memory [KB]10 − − − − − − − N R M S E ( L og - s c a l e ) (c) Error, Univ2 Memory [KB]10 − − − − − − − N R M S E ( L og - s c a l e ) (d) Error, YouTube Baseline CMS Baseline CUS SALSA CMS SALSA CUS Memory [KB]0 . . . . . . . . . T h r o u g hpu t [ × / s ] (e) Speed, NY18 Memory [KB]0 . . . . . . . . . T h r o u g hpu t [ × / s ] (f) Speed, CH16 Memory [KB]0 . . . . . . . . . T h r o u g hpu t [ × / s ] (g) Speed, Univ2 Memory [KB]0 . . . . . . . . . T h r o u g hpu t [ × / s ] (h) Speed, YouTube Fig. 10: Speed and accuracy of SALSA CMS and SALSA CUS for the real datasets. Notice the log-scale of the error plots. Memory [KB]10 − − − − N R M S E ( L og - s c a l e ) (a) Error, NY18 Memory [KB]10 − − − − N R M S E ( L og - s c a l e ) (b) Error, CH16 Baseline SALSA Memory [KB]10 − − − − N R M S E ( L og - s c a l e ) (c) Error, Univ2 Memory [KB]10 − − − − N R M S E ( L og - s c a l e ) (d) Error, YouTube Fig. 11: Accuracy of SALSA CS for the real datasets. sampled one random element from each possible frequency toreduce clutter. The results, showing in Figure 9, demonstratethe differences between the algorithms. SALSA has a lowerror-variance and is consistently more accurate than theBaseline. In contrast, Pyramid Sketch (as shown in region A ) has much higher variance, as elements whose countersoverﬂow share the most signiﬁcant bits with other elements.ABC, as evident in region B , has a high error on heavyhitters as its counters can at most double in size by combiningwith their neighbors. We conﬁgured ABC to start with -bitcounters as suggested by the authors, limiting its estimationto at most − (as three bits are spent on overhead).While one could use larger counters, it would decrease theirnumber and diminish the beneﬁt over the baseline sketch. Toconclude, both ABC and Pyramid Sketch have elements withhigh estimations errors, making them less attractive for (MeanSquare Error)-like metrics. L1 Sketches:

We proceed by testing the impact that SALSA (with s = 8 bit counters) has on the accuracy and speed ofL1 sketches, such as CMS and CUS. The results, depictedin Figure 10, show that SALSA CMS is substantially moreaccurate (roughly requiring half the space for the same error)than the Baseline for the NY18, CH16, and YouTube datasets.For Univ2, SALSA’s improvement is less noticeable, and dueto its encoding overheads, the tradeoff is not statisticallysigniﬁcant. SALSA CUS is better than the Baseline on alltraces, and often requires half the space for a given error.SALSA’s accuracy comes at the cost of additional oper-ations that are required to maintain the counter layout. Wemeasured SALSA to be 17%-23% slower than the corre-sponding Baseline variants, but can nonetheless handle 10-17.5 million elements per second on a single core, which issufﬁcient to support the high link rate forwarding at modernlarge-scale clusters, such as Google, which is estimated at 9Mpackets per second (see [49, Sec. 3.2]) . We note that bycombining SALSA with estimators (Section VI), we can makefaster counter sketches. We conclude that SALSA offers anappealing accuracy to space tradeoff. Count Sketch:

Next, we evaluate SALSA for Count Sketch,whose L2 guarantee is important for low-skew workloads andmore complex algorithms such as UnivMon. As shown inFigure 11, SALSA offers statistically signiﬁcant improvementfor the NY18, CH16, and YouTube datasets. For Univ2, theaccuracy improvement is offset by the encoding overhead, andit is not clear which variant is better.

UnivMon:

We use SALSA CS to extend the UniversalMonitoring (UnivMon) sketch that supports estimating a widevariety of functions of the frequency vector. Our experimentincludes estimating the element size entropy and F p moments,for ≤ p ≤ . The result in Figure 12 indicates that SALSAimproves the accuracy of both tasks. Interestingly, for entropyestimation, we observe that SALSA’s accuracy (and variance)improve when using smaller ( s = 2 or s = 4 bit) counters.When using a large amount of memory, SALSA has roughlythe same accuracy as the baseline, as both hit a bottleneck11 Memory [KB]10 − − A R E ( L og - s c a l e ) (a) Entropy Estimation . . . . . . . . . . . p − − A R E ( L og - s c a l e ) (b) Frequency Moment Estimation (400KB)

Baseline SALSA SALSA SALSA Fig. 12: Accuracy of SALSA UnivMon for the NY18 dataset. Memory [KB]10 − − AA E ( L og - s c a l e ) Memory [KB]10 − − A R E ( L og - s c a l e ) Baseline SALSA

Fig. 13: Accuracy of SALSA Cold Filter for the NY18 dataset. in the size of the sketches’ heaps (set to elements, as inthe implementation of [18]).For estimating F p moments, we measure similar accuracyfor small values of p while SALSA improves the accuracy forlarge p values. To explain this, notice that the element sizeestimates mainly affect the F p for large p , while for p ≈ ,the value is determined primarily by the cardinality. Cold Filter:

We extend Cold Filter by replacing thesecond-stage CUS (denoted CM-CU in the original paper) al-gorithm with our SALSA variant. The results in Figure 13 usethe AAE and ARE metrics suggested by its authors [9]. Theresults show that SALSA saves up to 50% of the space for asimilar error. However, the improvement is more evident whenthe allocated memory is small, as in these cases the secondstage algorithm plays a signiﬁcant role. When the memorysize is large (compared to the measurement length), the ﬁrst-stage algorithm handles most of the ﬂows, and improvingthe second-stage CUS algorithm yields marginal beneﬁts. Weobserved negligible differences in the processing speed, whichis expected as many elements only touch the ﬁrst stage anddo not reach the second. We also tested Cold Filter versusits SALSA variant using the NRMSE metric; there, SALSAyields even larger accuracy gains. However, Cold Filter’saggregation buffer needs to be drained upon query, whichnegates its speedup potential in the on-arrival model.

Count Distinct and Heavy Hitters using Count Min:

We evaluate the performance of SALSA CMS on additionalapplications such as counting distinct elements and estimat-ing the size of the heavy hitters. As shown in the count,distinct results (Figures 14(a)-(c)), neither SALSA CMS northe Baseline are effective with low memory footprints. Thisis because no counters remain zero-valued, and the LinearCounting estimator fails. Nevertheless, SALSA CMS canwork with less memory (4.5MB for NY18 and 1.125MB forCH16) and reduce the estimation error when the Baselinedoes produce estimates. Intuitively, Linear Counting with w Memory [KB]10 − − − A R E ( L og - s c a l e ) (a) NY18 Dataset Memory [KB]10 − − − A R E ( L og - s c a l e ) (b) CH16 Dataset . . . . . − − A R E ( L og - s c a l e ) (c) Zipf (8MB) Baseline SALSA − − − Threshold φ (Log-scale)10 − − − − A R E ( L og - s c a l e ) (d) NY18 Dataset − − − Threshold φ (Log-scale)10 − − − − A R E ( L og - s c a l e ) (e) CH16 Dataset . . . . . − − − A R E ( L og - s c a l e ) (f) Zipf ( φ = 10 − ) Fig. 14: Accuracy of SALSA CMS on ((a)-(c)) counting distinctelements and ((d)-(f)) estimating the size of heavy hitters with 2MB. k )0 . . . . . . A cc u r a c y (a) Top- k , NY18 (640KB) . . . . . . . . . . . A cc u r a c y (b) Top-1024, Zipf (640KB) Baseline SALSA Memory [KB]10 − − − − N R M S E ( L og - s c a l e ) (c) Change Detection, NY18 . . . . . − − N R M S E ( L og - s c a l e ) (d) Change Detection, Zipf (2.5MB) Fig. 15: Accuracy of SALSA CS for the Top- k ((a) and (b)) andChange Detection ((c) and (d)) tasks. buckets can count up to w ln w elements, so the number ofelements in the datasets (6.5M for NY18 and 2.5M for CH16)imposes a lower bound on the amount of space needed. Weevaluate the accuracy for estimating the heavy hitters (ele-ments with frequency of at least φ · N ) frequencies, whilevarying φ between − and − as in [48]. SALSA CMSis more accurate, especially for small values of φ . The lowerimprovement for large φ values can be explained as φ · N ≥ for φ ≥ · ≈ · − , which means that all suchheavy hitters cause their counters to merge to bits (thesame as the Baseline). The plot of Figure 14d stops around φ ≈ . · − as no element in the NY18 dataset hasfrequency larger than . · − · N ≈ K packets. Top- k and Change Detection using Count Sketch: Wealso examine SALSA’s effect on other uses of CS such as Top- k and Change Detection (which requires Turnstile support).For Top- k , our experiments indicate that using sufﬁcient mem-12 Memory [KB]10 − − − − − − N R M S E ( L og - s c a l e ) . . (a) Error, NY18 Memory [KB]10 − − − − − − N R M S E ( L og - s c a l e ) . . (b) Error, CH16 BaselineAEE MaxAccuracy AEE MaxSpeedSALSA SALSA AEESALSA AEE Memory [KB]10 T h r o u g hpu t [ × / s ] ( L og - s c a l e )

60 120480520 (c) Speed, NY18 Memory [KB]10 T h r o u g hpu t [ × / s ] ( L og - s c a l e )

60 120480520 (d) Speed, CH16

Fig. 16: Comparison with Estimator Algorithms (CM sketch). ory (e.g., 2MB), the Baseline CS detects the largest elementsaccurately for reasonable k values. Therefore, we focus on aconstrained memory setting (640KB). As shown in Figure 15(a) and (b), SALSA detects the top- k accurately, especially forlarge values of k and low-skew workloads.We also evaluate SALSA CS on a Change Detection task.Here, we split the input into two equal-length intervals A and B , the algorithm needs to estimate the change in thefrequency of an element x between the ﬁrst and secondhalves. To that end, we create sketches s ( A ) and s ( B ) andthe difference sketch s ( A \ B ) as described in Section V.Intuitively, the frequency difference can be small comparedwith the frequencies of each interval, and thus directly sub-tracting the estimates of s ( A ) and s ( B ) could yield a poorresult compared to taking the difference sketch (as the desirederror is a fraction of the L norm of the frequency difference).In Figure 15 (c) and (d), we compute the NRMSE error overthe set of elements that appear in either A or B . As shown,SALSA provides a statistically signiﬁcant accuracy improve-ment in all tested memory allocations and dataset skews. Estimators:

We now experiment with integrating estima-tors, speciﬁcally AEE [16], into SALSA CMS. Under certainconditions, AEE increases the accuracy and processing speedof the sketch. Intuitively, the accuracy can be increased as thesketch can use more estimators than it would use counters,and the speed is increased because some packets are ignoredwithout updating the estimators. Our estimator-integrated so-lution SALSA AEE (from Section V) optimizes the accuracyby interleaving estimator downsampling and estimator merges.Roughly speaking, SALSA AEE aims to be at least as accurateas the best of SALSA CMS and AEE MaxAccuracy bychoosing the best method to cope with each overﬂow. Similarlyto AEE MaxSpeed, we create a speed-optimized variant calledSALSA AEE d that downsamples on the ﬁrst d overﬂows(and selects whether to merge or downsamples afterward Note that this is not on-arrival computation and the results are notcomparable with those obtained in ﬁgures 10 and 11. Memory [KB]10 − − − − − N R M S E ( L og - s c a l e ) .

48 4 . . × − . × − . × − (a) Error, NY18 Memory [KB]10 − − − − − N R M S E ( L og - s c a l e ) . . × − × − (b) Error, CH16 SALSA AEE Split SALSA AEE

Fig. 17: Affect of splitting counters in SALSA AEE (CM sketch). according to the logic presented in Section V). This allowsthe algorithm to reach a sampling rate of − d and thus obtainspeedups by reducing hash computations.The results, shown in Figure 16, illustrate that SALSA AEEis always as accurate as SALSA (when SALSA only mergescounters), and more accurate for small amounts of memory.For large amounts of memory, SALSA AEE only merges, andtherefore its accuracy is identical to SALSA while it is slightlyslower due to the added logic. Compared with AEE MaxAccu-racy, SALSA AEE has comparable accuracy for small memoryallocations (where it is mostly better to downsample thanmerge). Further, for large memory allocations (e.g., 100KBor higher), SALSA AEE is more accurate than AEE MaxAc-curacy, as in such scenarios it is better to merge than to down-sample. Compared with AEE MaxSpeed, SALSA AEE pro-vides improved accuracy (by up to 25%), especially for smallamounts of memory, while also being faster (by up to 7%),except when using large space (2MB+ in this experiment). Should We Split Counters?

Finally, we check the accuracygains obtainable by splitting counters. Intuitively, once acounter is downsampled, it may require fewer bits to represent.Therefore, if previously the counter had s · (cid:96) bits and thedownsampled value is lower than s · (cid:96) − − (and (cid:96) ≥ ), wecan split the counter into two s · (cid:96) − -bit counters. As a result,there are now fewer collisions between elements, and SALSAAEE has better accuracy. However, as the results in Figure 17suggest, this effect is minor, and in most cases, the accuracygains are insigniﬁcant.VII. C ONCLUSIONS

We have presented SALSA, an efﬁcient framework for dy-namically re-sizing counters in sketching algorithms, extend-ing counters only when needed to represent larger numbers.SALSA starts from small counters and gradually adapts itsmemory layout to optimize the space-accuracy tradeoff. Byevaluating across multiple real-world traces, sketches, andtasks, we have shown that, for a small overhead for its merginglogic, SALSA reduces considerably the measurement error.In particular, our evaluation indicates that SALSA improveson state-of-the-art solutions such as Pyramid Sketch [12],ABC [17], Cold Filter [9], and the AEE estimators [16].We believe that SALSA can replace and enhance ex-isting sketches in more complex algorithms, such as L p -samplers [50] and database systems (e.g., [51], [52]). All ofour code is released as open source [1].13 EFERENCES[1] “Salsa open-source code,” 2020, https://github.com/SALSA-ICDE2021.[2] A. Goyal, H. Daum´e III, and G. Cormode, “Sketch algorithms forestimating point queries in nlp,” in

EMNLP-CoNLL , 2012.[3] G. Dittmann and A. Herkersdorf, “Network processor load balancing forhigh-speed links,” in

SPECTS , 2002.[4] A. K. Kaushik, E. S. Pilli, and R. C. Joshi, “”Network Forensic Analysisby Correlation of Attacks with Network Attributes”,” in

Information andCommunication Technologies , 2010.[5] R. Ben-Basat, G. Einziger, I. Keslassy, A. Orda, S. Vargaftik, andE. Waisbard, “Memento: making sliding windows efﬁcient for heavyhitters,” in

ACM CoNEXT , 2018.[6] A. Lall, V. Sekar, M. Ogihara, J. J. Xu, and H. Zhang, “Data streamingalgorithms for estimating entropy of network trafﬁc,” in

ACM SIGMET-RICS/Performance , 2006.[7] R. Ben-Basat, X. Chen, G. Einziger, R. Friedman, and Y. Kassner, “Ran-domized admission policy for efﬁcient top-k, frequency, and volumeestimation,”

IEEE/ACM Transactions on Networking , 2019.[8] P. Roy, A. Khan, and G. Alonso, “Augmented sketch: Faster and moreaccurate stream processing,” in

ACM SIGMOD , 2016.[9] T. Yang, J. Jiang, Y. Zhou, L. He, J. Li, B. Cui, S. Uhlig, and X. Li,“Fast and accurate stream processing by ﬁltering the cold,”

VLDB J. ,2019.[10] G. Cormode and S. Muthukrishnan, “An improved data stream summary:The count-min sketch and its applications,”

J. Algorithms , 2004.[11] N. Hua, B. Lin, J. J. Xu, and H. C. Zhao, “Brick: A novel exact activestatistics counter architecture,” in

ACM/IEEE ANCS , 2008.[12] T. Yang, Y. Zhou, H. Jin, S. Chen, and X. Li, “Pyramid sketch: A sketchframework for frequency estimation of data streams,” 2017, code Avail-able: https://github.com/zhouyangpkuer/Pyramid Sketch Framework.[13] C. Estan and G. Varghese, “New directions in trafﬁc measurement andaccounting,”

ACM SIGCOMM , 2002.[14] M. Charikar, K. Chen, and M. Farach-Colton, “Finding frequent itemsin data streams,” in

EATCS ICALP , 2002.[15] Z. Liu, A. Manousis, G. Vorsanger, V. Sekar, and V. Braverman, “Onesketch to rule them all: Rethinking network ﬂow monitoring withunivmon,” in

ACM SIGCOMM , 2016.[16] R. B. Basat, G. Einziger, M. Mitzenmacher, and S. Vargaftik, “Faster andmore accurate measurement through additive-error counters,” in

IEEEINFOCOM , 2020, code: https://github.com/additivecounters/AEE.[17] J. Gong, T. Yang, Y. Zhou, D. Yang, S. Chen, B. Cui, and X. Li, “Abc: apracticable sketch framework for non-uniform multisets,” in , 2017.[18] Z. Liu, R. Ben-Basat, G. Einziger, Y. Kassner, V. Braverman, R. Fried-man, and V. Sekar, “Nitrosketch: Robust and general sketch-basedmonitoring in software switches,” in

ACM SIGCOMM , 2019.[19] Ran Ben Basat, Xiaoqi Chen, Gil Einzinger, Ori Rottenstreich, “EfﬁcientMeasurement on Programmable Switches Using Probabilistic Recircu-lation,” in

IEEE ICNP , 2018.[20] L. Yang, W. Hao, P. Tian, D. Huichen, L. Jianyuan, and L. Bin,“Case: Cache-assisted stretchable estimator for high speed per-ﬂowmeasurement,” in

IEEE INFOCOM , 2016.[21] T. Li, S. Chen, and Y. Ling, “Per-ﬂow trafﬁc measurement throughrandomized counter sharing,”

IEEE/ACM Trans. on Networking , 2012.[22] Y. Lu, A. Montanari, B. Prabhakar, S. Dharmapurikar, and A. Kabbani,“Counter braids: a novel counter architecture for per-ﬂow measurement,”in

ACM SIGMETRICS , 2008.[23] M. Chen and S. Chen, “Counter tree: A scalable counter architecturefor per-ﬂow trafﬁc measurement,” in

IEEE ICNP , 2015.[24] E. Tsidon, I. Hanniel, and I. Keslassy, “Estimators also need sharedvalues to grow together,” in

IEEE INFOCOM , 2012.[25] C. Hu and B. Liu, “Self-tuning the parameter of adaptive non-linearsampling method for ﬂow statistics,” in

CSE , 2009.[26] R. Morris, “Counting large numbers of events in small registers,”

Commun. ACM , 1978.[27] G. Einziger and R. Friedman, “A formal analysis of conservative updatebased approximate counting,” in

ICNC , 2015.[28] V. Braverman and R. Ostrovsky, “Zero-one frequency laws,” in

ACMSTOC , 2010.[29] P. Garcia-Teodoro, J. E. D´ıaz-Verdejo, G. Maci´a-Fern´andez, andE. V´azquez, “Anomaly-based network intrusion detection: Techniques,systems and challenges,”

Computers and Security , 2009. [30] T. Yang, J. Jiang, P. Liu, Q. Huang, J. Gong, Y. Zhou, R. Miao,X. Li, and S. Uhlig, “Elastic sketch: Adaptive and fast network-widemeasurements,” in

Proc. of ACM SIGCOMM , 2018.[31] K.-Y. Whang, B. T. Vander-Zanden, and H. M. Taylor, “A linear-time probabilistic counting algorithm for database applications,”

ACMTransactions on Database Systems (TODS) , 1990.[32] R. B. Basat, G. Einziger, and R. Friedman, “Fast ﬂow volume esti-mation,”

Pervasive and Mobile Computing , 2018, code available at:https://github.com/ranbenbasat/FAST.[33] G. Cormode, “Implementation of heavy hitter algorithms.” [Online].Available: http://hadjieleftheriou.com/frequent-items/[34] J. Teuhola, “Interpolative coding of integer sequences supporting log-time random access,”

Information processing & management , 2011.[35] A. Elmasry, J. Katajainen, and J. Teuhola, “Improved address-calculationcoding of integer arrays,” in

SPIRE , 2012.[36] S. Cohen and Y. Matias, “Spectral bloom ﬁlters,” in

ACM SIGMOD ,2003.[37] G. S. Manku and R. Motwani, “Approximate frequency counts over datastreams,” in

VLDB , 2002.[38] B. Manes, “Caffeine: A high performance caching library for java 8,”https://github.com/ben-manes/caffeine.[39] N. Ivkin, R. B. Basat, Z. Liu, G. Einziger, R. Friedman, and V. Braver-man, “I know what you did last summer: Network monitoring usinginterval queries,” in

ACM SIGMETRICS , 2020.[40] “The caida equinix-newyork packet trace, 20181220-130000.” 2018.[41] “The caida equinix-chicago packet trace, 20160406-130000.” 2016.[42] T. Benson, A. Akella, and D. A. Maltz, “Network trafﬁc characteristicsof data centers in the wild,” in

ACM IMC

IEEE INFOCOM , 2016.[45] B. Krishnamurthy, S. Sen, Y. Zhang, and Y. Chen, “Sketch-based changedetection: methods, evaluation, and applications,” in

ACM IMC , 2003.[46] Student, “The probable error of a mean,”

Biometrika , 1908.[47] J. Qi, W. Li, T. Yang, D. Li, and H. Li, “Cuckoo counter: A novelframework for accurate per-ﬂow frequency estimation in network mea-surement,” in

ACM/IEEE ANCS , 2019.[48] G. Cormode and M. Hadjieleftheriou, “Methods for ﬁnding frequentitems in data streams,”

J. VLDB , 2010.[49] D. E. Eisenbud, C. Yi, C. Contavalli, C. Smith, R. Kononov, E. Mann-Hielscher, A. Cilingiroglu, B. Cheyney, W. Shang, and J. D. Hosein,“Maglev: A fast and reliable software network load balancer,” in

USENIX NSDI , 2016.[50] G. Cormode and H. Jowhari, “ Lp samplers and their applications: Asurvey,” ACM Comput. Surv. , 2019.[51] M. Tirmazi, R. Ben Basat, J. Gao, and M. Yu, “Cheetah: Acceleratingdatabase queries with switch pruning,” in

ACM SIGMOD , 2020.[52] “Apache Spark 3 support for Count Min Sketch.” 2020,https://spark.apache.org/docs/3.0.0-preview/api/scala/org/apache/spark/util/sketch/CountMinSketch.html. A PPENDIX AI MPROVED E NCODING

Here, we lower bound the space required toencode SALSA, and then suggest a near-optimalencoding that has O (1) time operations. Lower bound.

For n ∈ N , we deﬁne by a n the numberof possible layouts for a consecutive block of n · s bits (i.e., a block that started from n counters of size s -bits each). For example, we have a = 5 since thepossible combinations for consecutive counters are (cid:104){ a } , { b } , { c } , { d }(cid:105) , (cid:104){ a, b } , { c } , { d }(cid:105) , (cid:104){ a } , { b } , { c, d }(cid:105) , (cid:104){ a, b } , { c, d }(cid:105) , (cid:104){ a, b, c, d }(cid:105) . Observe that either all n counters are merged together, or it is enough to specify thelayouts of the ﬁrst n − counters and last n − counters.Therefore, we get the recursive relation a n = a n − + 1 and14 Fig. 18: An encoding example for m = 5 . This layout is encodedby X = 449527 < a . To compute the size of counter , weﬁrst check that X < a − and thus not all counters are merged.Next, we have that X = (cid:98) X /a (cid:99) = 663 < a − and thuscounters 0-15 are not all merged. Then, we check that X (cid:48) = X mod a = 13 < a − which means that counters 8-15 are not allmerged. We continue with X = (cid:98) X (cid:48) /a (cid:99) = 2 < a − (thus 8-11are not merged) and ﬁnally get X = (cid:98) X /a (cid:99) = 1 = a − whichimplies that is merged with . a = 1 . Given that we start from w counters, this implies thatthe number of possible layouts is lower bounded by a (cid:98) log w (cid:99) . Lemma A.1. ∀ n ∈ N : (cid:4) . n (cid:5) ≤ a n < . n .Proof. The inequality is easy to verify for n ≤ . By induction, one can then easily provethat ∀ n ≥ . n + 1 < a n < . n − . This suggests a lower bound of (cid:6) log a (cid:98) log w (cid:99) (cid:7) ≤ (cid:6) (cid:98) log w (cid:99) log . (cid:7) bits. Speciﬁcally, for w ≥ values whichare powers of any encoding must use at least log . ≈ . bits per counter. Near-optimal encoding.

Denote by m the maximal number ofmerges a single counter may go through during the execution.We note that m = O (1) as we assumed the ﬁnal countersmust ﬁt into O (1) machine words. For example, if we startfrom s = 2 bits counters and we assume that countersgrow up to bits, then m = 6 . Let m = max { , m } .Intuitively, we encode every m counters separately, therebyallowing O ( m ) = O (1) time size computation. According toLemma A.1, we have that z m (cid:44) (cid:100) log a m (cid:101) bits are enough toencode the counter-set layout; for example, z = 19 bits areenough to encode the layout of = 32 counters. Speciﬁcally,for n = m, m − , . . . , we write a z n -bits value X n suchthat X n = a n − means that all n counters are encoded,and otherwise X n − (cid:44) (cid:98) X n /a n − (cid:99) encodes the layout of theﬁrst n − counters while X (cid:48) n − (cid:44) X n mod a n − encodesthe layout of the rest (i.e., they are the base- ( a n − ) digits of X n ). As a result, we use z m bits for each consecutive setof m counters, giving an overhead of z m / m . For n ≥ ,we have that z n / n < . , i.e., we require at most . overhead bits per counter. Computing the size of a counterthen becomes simple: we start from n = m and every timecheck if the value is a n − , or recurse into either the leftor right half depending on the counter index. An example ofthis process is illustrated in Figure 18. While this approachreduces the overhead, the decoding process involves divisionand modulo operations that may reduce the speed.A PPENDIX BU NDERSTANDING THE D IFFERENCES –E XTENDED R ESULTS

For completeness, we repeat the experiment in Figure 6using even smaller ( -bit) counters. The results are shown in − − − − − − − Threshold φ (Log-scale)10 − − − − A R E ( L og - s c a l e ) Fig. 19: Running CMS with small number of bits and the “ ”algorithm for estimating heavy hitter sizes (2MB) using averagerelative error metric. The leftmost point corresponds to the standardARE metric (used in ﬁgures 8g and 8h), which considers all ﬂows. − − − − − − − Threshold φ (Log-scale)10 AA E ( L og - s c a l e ) Fig. 20: Running CMS with small number of bits and the “ ”algorithm for estimating heavy hitter sizes (2MB) using averageabsolute error metric. The leftmost point corresponds to the standardAAE metric (used in ﬁgures 8e and 8f), which considers all ﬂows. ﬁgures 19 and 20. We measured the error on all heavy hitters– elements larger than a φ fraction of the input. The leftmostpoint ( φ = 10 − ) of Figure 19 corresponds to the ARE metric(i.e., all ﬂows will be considered). As shown, in this case, thebest algorithm is , which corresponds to returning estimatesfor all element sizes. That is, according to this metric, onecan reduce the error by not running measurements at all. Asimilar result was observed for AAE, in Figure 20), wherethe0