[PDF] The Online Event-Detection Problem

Abstract

Given a stream S=( s 1 , s 2 ,..., s N ) , a ϕ -heavy hitter is an item s i that occurs at least ϕN times in S . The problem of finding heavy-hitters has been extensively studied in the database literature. In this paper, we study a related problem. We say that there is a ϕ -event at time t if s t occurs exactly ϕN times in ( s 1 , s 2 ,..., s t ) . Thus, for each ϕ -heavy hitter there is a single ϕ -event which occurs when its count reaches the reporting threshold ϕN . We define the online event-detection problem (OEDP) as: given ϕ and a stream S , report all ϕ -events as soon as they occur. Many real-world monitoring systems demand event detection where all events must be reported (no false negatives), in a timely manner, with no non-events reported (no false positives), and a low reporting threshold. As a result, the OEDP requires a large amount of space (Omega(N) words) and is not solvable in the streaming model or via standard sampling-based approaches. Since OEDP requires large space, we focus on cache-efficient algorithms in the external-memory model. We provide algorithms for the OEDP that are within a log factor of optimal. Our algorithms are tunable: its parameters can be set to allow for a bounded false-positives and a bounded delay in reporting. None of our relaxations allow false negatives since reporting all events is a strict requirement of our applications. Finally, we show improved results when the count of items in the input stream follows a power-law distribution.

Full PDF

TThe Online Event-Detection Problem

Michael A. Bender ∗ Jonathan W. Berry † Mart´ın Farach-Colton ‡ Rob Johnson § Thomas M. Kroeger ¶ Prashant Pandey ‖ Cynthia A. Phillips † Shikha Singh ∗∗ Abstract

Given a stream S = ( s , s , . . . , s N ), a φ -heavy hitter is an item s i that occurs at least φN times in S . The problem of ﬁnding heavy-hitters has been extensively studied in the databaseliterature. In this paper, we study a related problem. We say that there is a φ -event at time t if s t occurs exactly (cid:100) φN (cid:101) times in ( s , s , . . . , s t ). Thus, for each φ -heavy hitter there is a single φ -event, which occurs when its count reaches the reporting threshold T = (cid:100) φN (cid:101) . We deﬁnethe online event-detection problem ( oedp ) as: given φ and a stream S , report all φ -eventsas soon as they occur.Many real-world monitoring systems demand event detection where all events must be re-ported (no false negatives), in a timely manner, with no non-events reported (no false positives),and a low reporting threshold. As a result, the oedp requires a large amount of space (Ω( N )words) and is not solvable in the streaming model or via standard sampling-based approaches.Since oedp requires large space, we focus on cache-eﬃcient algorithms in the external-memory model.We provide algorithms for the oedp that are within a log factor of optimal. Our algorithmsare tunable: their parameters can be set to allow for bounded false-positives and a boundeddelay in reporting. None of our relaxations allow false negatives since reporting all events is astrict requirement for our applications. Finally, we show improved results when the count ofitems in the input stream follows a power-law distribution. Real-time monitoring of high-rate data streams, with the goal of detecting and preventing maliciousevents, is a critical component of defense systems for cybersecurity [47, 39, 50] and physical systems,such as water or power distribution [15, 36, 40]. In such a monitoring system, changes of state areinferred from the stream elements. Each detected/reported event triggers an intervention. Analystsuse more specialized tools to gauge the actual threat level. Newer systems are even beginning to takedefensive actions, such as blocking a remote host, automatically based on detected events [43, 34]. ∗ Department of Computer Science, Stony Brook University, Stony Brook, NY, 11794-2424 USA. Email: [email protected] . † MS 1326, PO Box 5800, Albuquerque, NM, 87185 USA. Email: { jberry, caphill } @sandia.gov . ‡ Department of Computer Science, Rutgers University, Piscataway, NJ 08854 USA. Email: [email protected] . § VMware Research, Creekside F 3425 Hillview Ave, Palo Alto, CA 94304 USA. Email: [email protected] . ¶ MS 9011, PO Box 969, Livermore, CA 94551 USA. Email: [email protected] . ‖ Department of Computer Science, Carnegie Mellon University, 5000 Forbes Ave, Pittsburgh, PA 15213. Email: [email protected] . ∗∗ Department of Computer Science, Wellesley College, Wellesley, MA 02481-8203 USA. Email: [email protected] . a r X i v : . [ c s . D S ] D ec hen used in an automated system, accuracy (i.e., few false-positives and no false-negatives) andtimeliness of event detection are essential.Motivated by these applications, we deﬁne and study the online event-detection problem ( oedp ). Roughly speaking, the oedp seeks to report all anomalous events (events that cross apredetermined safety threshold) as soon as they occur in the input stream. The related problemof ﬁnding the most frequent elements or heavy hitters in streams has been extensively studied inthe database literature [29, 30, 4, 24, 19, 18, 38, 16, 32, 42, 17, 31, 17, 14]. More formally, given astream S = ( s , s , . . . , s N ), a φ -heavy hitter is an element s that occurs at least φN times in S .Here we focus on the problem of ﬁnding φ -events, where we say that there is a φ -event at timestep t if s t occurs exactly (cid:100) φN (cid:101) times in ( s , s , . . . , s t ). Thus for each φ -heavy hitter there is asingle φ -event which occurs when its count reaches the reporting threshold T = (cid:100) φN (cid:101) .Formally, we deﬁne the online event-detection problem ( oedp ) as: given stream S =( s , s , . . . , s N ), for each i ∈ [1 , N ], report if there is a φ -event at time i before seeing s i +1 . Asolution to the online event-detection problem must report(a) all events (no False Negatives )(b) with no non-events and no duplicates (no

False Positives )(c) as soon as an element crosses the threshold (

Online ).Furthermore, an online event detector must scale to(d) small reporting thresholds T and large N , i.e., very small φ ( Scalable ).In this paper, we present algorithms for the oedp . We also give solutions which relax condi-tions (b) and (c). However, our solutions are motivated by cybersecurity applications where (a) and(d) are strict requirements. Next, we discuss how each of these conditions relate to our approachand results. See Section 6 for more details about the application that motivates the oedp and itsconstraints.

No false negatives.

We are motivated by monitoring systems for national security [1, 5]. Theevents in this context have especially high consequences so it is worth investing extra resources todetect them. We therefore do not allow false negatives (i.e., condition (a) is strict); see Section 6for more details. This rules out sampling-based approaches for the oedp , which necessarily incurfalse negatives.

Scalability.

Scalability (condition (d)) is essential in the broader context of detecting anomalies innetwork streams, since anomalies are often small-sized events that develop slowly, appearing normalin the midst of large amounts of legitimate traﬃc [41, 49]. As an example of the demands placed onevent detection systems, the US Department of Defense (DoD) and Sandia National Laboratoriesdeveloped the Firehose streaming benchmark suite [1, 5] to measure the performance of oedp algorithms. In the FireHose benchmark, the reporting threshold is preset to the representativevalue of T = 24, which translates to φ = 24 /N = o (1) and thus a DoD benchmark enforcescondition (d). Scalable solutions to the oedp require a large amount of space, ruling out the streamingmodel [6, 51], where the available memory is small—usually just polylog( N ). In particular, stream-ing algorithms for the heavy-hitters problem assume φ > / polylog( N ) (all candidates must ﬁt inmemory). Even if some false positives are allowed, as in the ( ε, φ )-heavy hitters problem , Bhat-tacharyya et al. [16] proved a lower bound of (1 /ε ) log(1 /φ ) + (1 /φ ) log |U | + log log N bits. Thusthe space requirement is large when ε is small, as is the case for Scalable solutions where φ issmall, since ε < φ . We drop the φ when it is understood. Given a stream of size N and 1 /N ≤ ε < φ ≤

1, report every item that occurs ≥ φN times and no item thatoccurs ≤ ( φ − ε ) times. It is optional whether to report items with counts between ( φ − ε ) N and φN . ounded false positives. Our algorithms for the oedp are tunable: parameters can be set toallow bounded false positives (relaxing condition (b)). We show that allowing some false positivesresults in fewer I/Os per element.Allowing

False Positives does not lead to substantial space savings. If we allow O (1+ βt ) falsepositives in a stream with t true positives, for any constant β , a bound of Ω( N log N ) bits follows viaa standard communication-complexity reduction from the probabilistic indexing problem [48, 37].Besides, as argued above, Scalable solutions to the heavy-hitter problem require large space evenwhen

False Positives are allowed.

Bounded reporting delay.

The national-security monitoring systems we are interested in(see Section 6) can tolerate a slight delay in reporting when the high-risk event gives suﬃcientwarning for intervention. We show that allowing a bounded delay in reporting (relaxing condition(c)) allows us to circumvent the lower bounds on φ imposed by our online solution. Thus, boundeddelay is especially desirable when we want our oedp algorithm to scale to arbitrarily small reportingthresholds.Finally, we note that in a security setting like ours, all events need to be detected in real-time tomitigate the associated risk. Thus streaming algorithms for the heavy-hitter problem that requiremultiple passes over the data are not applicable. Online Event Detection in External Memory

In this paper, we make the large space requirement (Ω( N ) words) of the oedp more palatableby shifting most of the storage from expensive RAM to lower-cost external storage, such as SSDsor hard drives. In particular, we give cache-eﬃcient algorithms for the oedp in the external-memory model. In the external-memory model, RAM has size M , storage has unbounded size,and any I/O access to external memory transfers blocks of size B . Typically, blocks are large, i.e., B > log N [33, 3].At ﬁrst, it may appear trivial to detect heavy hitters using external memory: we can store theentire stream, so what is there to solve? And this would be true in an oﬄine setting. We couldﬁnd all events by logging the stream to disk and then sorting it.The technical challenge to online event detection in external memory is that searches are slow.A straw-man solution is to maintain an external-memory dictionary to keep track of the countof every item, and to query the dictionary after each stream item arrives. But this approach isbottlenecked on dictionary searches. In a comparison-based dictionary, queries take Ω(log B N )I/Os, and there are many data structures that match this bound [26, 7, 22, 9]. This yields an I/Ocomplexity of O ( N log B N ). Even if we use external-memory hashing, queries still take Ω(1) I/Os,which still gives a complexity of Ω( N ) I/Os [35, 27]. Both these solutions are bottlenecked on thelatency of storage, which is far too slow for stream processing.Data ingestion is not the bottleneck in external memory. Optimal external-memory dictionaries(including write-optimized dictionaries such as B ε -trees [22, 11], COLAs [10], xDicts [21], buﬀeredrepository trees [23], write-optimized skip lists [13], log structured merge trees [46], and optimalexternal-memory hash tables [35, 27]) can perform inserts and deletes extremely quickly. The fastestcan index using O (cid:0) B log NM (cid:1) I/Os per stream element, which is far less than one I/O per item. Inpractice, this means that even a system with just a single disk can ingest hundreds of thousandsof items per second. For example, at SuperComputing 2017, a single computer was easily able tomaintain a B ε -tree [22] index of all connections on a 600 gigabit/sec network [8]. The system couldalso eﬃciently answer oﬄine queries. What the system could not do, however, was detect eventsonline. 3n this paper, we show how to achieve online (or nearly online) event detection for essentially thesame cost as simply inserting the data into a B ε -tree or other optimal external-memory dictionary. Results

As our main result, we present an external-memory algorithm that solves the oedp , for φ that issuﬃciently large, at an amortized I/O cost that is substantially cheaper than performing one queryfor each item in the stream. Result 1.

Given a stream S of size N and φ > /M + Ω(1 /N ) , the online event-detection problemcan be solved at an amortized cost of O (cid:16)(cid:16) B + M ( φM − N (cid:17) log NM (cid:17) I/Os per stream item.

To put this in context, suppose that φ > /M and ( φ > B/N or N > M B ). Then the I/Ocost of solving the oedp is O (cid:0) B log (cid:0) NM (cid:1)(cid:1) , which is only a logarithmic factor larger than the na¨ıvescanning lower bound. In this case, we eliminate the query bottleneck and match the data ingestionrate of B ε -trees.Our algorithm builds on the classic Misra-Gries algorithm [28, 44], and thus supports its gen-eralizations. In particular, similar to the ( ε, φ )-heavy hitters problem, our algorithm can also berelaxed so that items with frequency between ( φ − ε ) N and φN may be reported. Allowing falsepositives lowers the amortized I/O cost to O (cid:16)(cid:16) B + M ( φM − N (cid:17) log εM (cid:17) ; see Theorem 1. For the oedp (i.e., no False Positives ), we set ε = 1 /N .Next, we show that, by allowing a bounded delay in reporting, we can extend this result toarbitrarily small φ . Intuitively, we allow the reporting delay for an event s t to be proportional tothe time it took for the element s t to go from 1 to φN occurrences. More formally, for a φ -event s t , deﬁne the ﬂow time of s t to be F t = t − t , where t is the time step of s t ’s ﬁrst occurrence.We say that an event-detection algorithm has time stretch α if it reports each event s t at orbefore time t + αF t = t + (1 + α ) F t . Result 2.

Given a stream S of size N and α > , the oedp can be solved for any φ ≥ with timestretch α at an amortized cost of O (cid:16) α +1 α log N/MB (cid:17)

I/Os per stream item.

For constant α , this is asymptotically as fast as simply ingesting and indexing the data [22,10, 23]. This algorithm can also be relaxed to allow false positives and achieve an improved I/Ocomplexity. Thus, this result yields an almost-online solution to the ( ε, φ )-heavy hitters problemfor arbitrarily small ε and φ ; see Theorem 2.Finally, we consider input distributions where the count of items is drawn from a power-lawdistribution. Berinde et al. [14] show that the Misra-Gries algorithm gives improved guarantees forthe heavy-hitter problem when the input follows a Zipﬁan distribution with exponent α >

1. If theitem counts in the stream follow a Zipﬁan distribution with exponent α if and only if they follow apower-law distribution with exponent θ = 1+1 /α [2]. As our algorithms are based on Misra-Gries,we automatically get the same improvements when the power-law exponent θ ≤ α > oedp problem that supports a smaller threshold φ thanin Result 1 and achieves a better I/O complexity when the count of items in the stream follow apower-law distribution with exponent θ > / (log ( N/M )). For a representative speciﬁcationof 1TB hard drive and 32GB RAM, our algorithm is performant for Zipﬁan distributions with α ≤ .

94, a range that is frequently observed in practical data [25, 14, 45, 20, 2]. For instance, the Zipf and power-law are often used interchangeably in the literature, however, they are diﬀerent ways to modelthe same phenomenon; see [2] and Section 5 for details. α = 0 . Result 3.

Given a stream S of size N , where the count of items follows a power-law distributionwith exponent θ > , and φ = Ω( γ/N ) , where γ = 2( N/M ) θ − , the oedp can be solved at anamortized I/O complexity O (cid:16)(cid:16) B + φN − γ ) θ − (cid:17) log NM (cid:17) per stream item. In contrast to the worst-case solution (Result 1), Result 3 allows thresholds φ smaller than 1 /M and an improved I/O complexity when the power-law exponent θ > / (log ( N/M )). (This isbecause γ/N < /M in this case; see Section 5 for details.) This section reviews the Misra-Gries heavy-hitters algorithm [44], a building block of our algorithmsin Section 3 and Section 4.

The Misra-Gries frequency estimator.

The Misra-Gries (MG) algorithm estimates the fre-quency of items in a stream. Given an estimation error bound ε and a stream S of N items from auniverse U , the MG algorithm uses a single pass over S to construct a table C with at most (cid:100) /ε (cid:101) entries. Each table entry is an item s ∈ U with a count, denoted C [ s ]. For each s ∈ U not in table C , let C [ s ] = 0. Let f s be the number of occurrences of item s in stream S . The MG algorithmguarantees that C [ s ] ≤ f s < C [ s ] + εN for all s ∈ U .MG initializes C to an empty table and then processes the items in the stream one after anotheras described below. For each s i in S , • If s i ∈ C , increment counter C [ s i ]. • If s i (cid:54)∈ C and |C| < (cid:100) /ε (cid:101) , insert s i into C and set C [ s i ] ← • If s i / ∈ C and |C| = (cid:100) /ε (cid:101) , then for each x ∈ C decrement C [ x ] and delete its entry if C [ x ]becomes 0.We now argue that C [ s ] ≤ f s < C [ s ] + εN . We have C [ s ] ≤ f s because C [ s ] is incremented onlyfor an occurrence of s in the stream. MG underestimates counts only through the decrements inthe third condition above. This step decrements (cid:100) /ε (cid:101) + 1 counts at once: the item s i that causedthe decrement, since it is never added to the table, and each item in the table. There can be atmost (cid:98) N/ (cid:100) /ε + 1 (cid:101)(cid:99) < εN executions of this decrement step in the algorithm. Thus, f s < C [ s ] + εN . The ( ε, φ )-heavy hitters problem.

The MG algorithm can be used to solve the ( ε, φ ) -heavyhitters problem , which requires us to report all items s with f s ≥ φN and not to report any item s with f s ≤ ( φ − ε ) N . Items that occur strictly between ( φ − ε ) N and φN times in S are neitherrequired nor forbidden in the reported set.To solve the problem, run the MG algorithm on the stream with error parameter ε . Theniterate over the set C and report any item s with C [ s ] > ( φ − ε ) N . Correctness follows from 1) if f s ≤ ( φ − ε ) N , then s will not be reported, since C [ s ] ≤ f s ≤ ( φ − ε ) N , and 2) if f s ≥ φN , then s will be reported, since C [ s ] > f s − εN ≥ φN − εN . Approximate online-event detection.

Analogous to the ( ε, φ )-heavy hitters problem, we deﬁnethe approximate oedp as: • Report all φ -events s t at time t , • Do not report any item s i with count at most ( φ − ε ) N • Items with count greater than ( φ − ε ) N and less than φN are neither required nor forbiddenfrom being reported. 5ll the errors with respect to oedp in the ( ε, φ )-heavy hitters problem and the approximate oedp are false positives , that is, non-events (items with frequency between ( φ − ε ) N and φN )that get reported as φ -events. No false negatives are allowed as all φ -heavy hitters and φ -eventsmust be reported. In the rest of the paper, the term error only refers to false-positive errors. Space usage of the MG algorithm.

For a frequency estimation error of ε , Misra-Gries uses O ( (cid:100) /ε (cid:101) ) words of storage, assuming each stream item and each count occupy O (1) words.Bhattacharyya et al. [16] showed that, by using hashing, sampling, and allowing a small prob-ability of error, Misra-Gries can be extended to solve the ( ε, φ )-Heavy Hitters problem using 1 /φ slots that store counts and an additional (1 /ε ) log(1 /φ )+log log N bits, which they show is optimal.For the exact φ -hitters problem, that is, for ε = 1 /N , the space requirement is large— N slots.Even the optimal algorithm of Bhattacharyya uses Ω( N ) bits of storage in this case, regardless of φ . In this section, we design an eﬃcient external-memory version of the core Misra-Gries frequencyestimator. This immediately gives an eﬃcient external-memory algorithm for the ( ε, φ )-heavyhitters problem. We then extend our external-memory Misra-Gries algorithm to support I/O-eﬃcient immediate event reporting, e.g., for online event detection.When ε = o (1 /M ), then simply running the standard Misra-Gries algorithm can result in acache miss for every stream element, incurring an amortized cost of Ω(1) I/Os per element. Ourconstruction reduces this to O ( log(1 / ( εM )) B ), which is o (1) when B = ω (cid:0) log (cid:0) εM (cid:1)(cid:1) . Our external-memory Misra-Gries data structure is a sequence of Misra-Gries tables, C , . . . , C L − ,where L = 1 + (cid:100) log r (1 / ( εM )) (cid:101) and r ( >

1) is a parameter we set later. The size of the table C i atlevel i is r i M , so the size of the last level is at least 1 /ε .Each level acts as a Misra-Gries data structure. Level 0 receives the input stream. Level i > i −

1, the level above. Whenever the standard Misra-Gries algorithmrunning on the table C i at level i would decrement a item count, the new data structure decrementsthat item’s count by one on level i and sends one instance of that item to the level below ( i + 1).The external-memory MG algorithm processes the input stream by inserting each item in thestream into C . To insert an item x into level i , do the following: • If x ∈ C i , then increment C i [ x ]. • If x / ∈ C i , and |C i | ≤ r i M −

1, then C i [ x ] ← • If x / ∈ C i and |C i | = r i M , then, for each x (cid:48) in C i , decrement C i [ x (cid:48) ]; remove it from C i if C i [ x (cid:48) ]becomes 0. If i < L −

1, recursively insert x (cid:48) into C i +1 .We call the process of decrementing the counts of all the items at level i and incrementing allthe corresponding item counts at level i + 1 a ﬂush . Correctness.

We ﬁrst show that the external-memory MG algorithm still meets the guaranteesof the Misra-Gries frequency estimation algorithm. In fact, we show that every preﬁx of levels C , . . . , C j is a Misra-Gries frequency estimator, with the accuracy of the frequency estimates in-creasing with j . Lemma 1.

Let (cid:98) C j [ x ] = (cid:80) ji =0 C i [ x ] (where C i [ x ] = 0 if x (cid:54)∈ C i ). Then, the following holds: • (cid:98) C j [ x ] ≤ f x < (cid:98) C j [ x ] + ( N/ ( r j M )) , and, (cid:98) C L − [ x ] ≤ f x < (cid:98) C L − [ x ] + εN .Proof. Decrementing the count for an element x in level i < j and inserting it on the next leveldoes not change (cid:98) C j [ x ]. This means that (cid:98) C j [ x ] changes only when we insert an item x from theinput stream into C or when we decrement the count of an element in level j . Thus, as in theoriginal Misra-Gries algorithm, C j [ x ] is only incremented when x occurs in the input stream, andis decremented only when the counts for r j M other elements are also decremented. Following thesame arguments as the MG algorithm, this is suﬃcient to establish the ﬁrst inequality. The secondinequality follows from the ﬁrst, and the fact that r L − M ≥ /ε . Heavy hitters.

Since our external-memory Misra-Gries data structure matches the original Misra-Gries error bounds, it can be used to solve the ( ε, φ )-heavy hitters problem when the regular Misra-Gries algorithm requires more than M space. First, insert each element of the stream into the datastructure. Then, iterate over the sets C i and report any element x with counter (cid:98) C L − [ x ] > ( φ − ε ) N . I/O complexity.

We now analyze the I/O complexity of our external-memory Misra-Gries algo-rithm. For concreteness, we assume each level is implemented as a B-tree, although the same basicalgorithm works with sorted arrays (included with fractional cascading from one level to the next,similar to cache-oblivious lookahead arrays [10]) or hash tables with linear probing and a consistenthash function across levels (similar to cascade ﬁlters [12]).

Lemma 2.

For a given ε ≥ /N , the amortized I/O cost of insertion in the external-memoryMisra-Gries data structure is O ( B log εM ) .Proof. Recall that the process of decrementing the counts of all the items at level i and incrementingall the corresponding item counts at level i + 1 is a ﬂush. A ﬂush can be implemented by rebuildingthe B-trees at both levels, which can be done in O ( r i +1 M/B ) I/Os.Each ﬂush from level i to level i +1 moves r i M stream elements down one level, so the amortizedcost to move one stream element down one level is O ( r i +1 MB / ( r i M )) = O ( r/B ) I/Os.Each stream element can be moved down at most L levels. Thus, the overall amortized I/Ocost of an insert is O ( rL/B ) = O (( r/B ) log r (1 / ( εM ))), which is minimized at r = e .When no false positives are allowed, that is, ε = 1 /N , the I/O complexity of the external-memory MG algorithm is O ( B log NM ). We now extend our external-memory Misra-Gries data structure to solve the online event-detectionproblem. In particular, we show that for a threshold φ that is suﬃciently large, we can report φ -events as soon as they occur.A ﬁrst attempt to add immediate reporting to our external-memory Misra-Gries algorithm is tocompute (cid:98) C L − [ s i ] for each stream event s i and report s i as soon as (cid:98) C L − [ s i ] > ( φ − ε ) N . However,this requires querying C i for i = 0 , . . . , L − O (log(1 /εM ))I/Os per stream item.We avoid these expensive queries by using the properties of the in-memory Misra-Gries frequencyestimator C . If C [ s i ] ≤ ( φ − /M ) N , then we know that f s i ≤ φN and we therefore do not haveto report s i , regardless of the count for s i in the lower levels on disk of the external-memory datastructure. Online event-detection in external memory.

We modify our external-memory Misra-Griesalgorithm to support online event detection as follows. Whenever we increment C [ s i ] from a value7hat is at most ( φ − /M ) N to a value that is greater than ( φ − /M ) N , we compute (cid:98) C L − [ s i ] andreport s i if (cid:98) C L − [ s i ] = (cid:100) ( φ − ε ) N (cid:101) . For each entry C [ x ], we store a bit indicating whether we haveperformed a query for (cid:98) C L − [ x ]. As in our basic external-memory Misra-Gries data structure, if thecount for an entry C [ x ] becomes 0, we delete that entry. This means we might query for the sameitem more than once if its in-memory count crosses the ( φ − /M ) N threshold, it gets removedfrom C , and then its count crosses the ( φ − /M ) N threshold again. As we will see below, this hasno aﬀect on the overall I/O cost of the algorithm. In order to avoid reporting the same item more than once, we can store, with each entry in C i ,a bit indicating whether that item has already been reported. Whenever we report a item x , weset the bit in C [ x ]. Whenever we ﬂush a item from level i to level i + 1, we set the bit for thatitem on level i + 1 if it is set on level i . When we delete the entry for a item that has the bit seton level L −

1, we add an entry for that item on a new level C L . This new level contains only itemsthat have already been reported. When we are checking whether to report a item during a query,we stop checking further and omit reporting as soon as we reach a level where the bit is set. Noneof these changes aﬀect the I/O complexity of the algorithm. I/O complexity.

We assume that computing (cid:98) C L − [ x ] requires O ( L ) I/Os. This is true if the levelsof the data structure are implemented as sorted arrays with fractional cascading.We ﬁrst state the result for the approximate version of the online event-detection problem thatallows elements with frequency between ( φ − ε ) N and φN to be reported as false positives.Then, we set ε = 1 /N to get the result for the oedp . Theorem 1.

Given a stream S of size N and parameters ε and φ , where /N ≤ ε < φ < and φ > /M + Ω(1 /N ) , the approximate oedp can be solved at an amortized I/O complexity O (cid:16)(cid:16) B + M ( φM − N (cid:17) log εM (cid:17) per stream item.Proof. Correctness follows from the arguments above. We need only analyze the I/O costs. Weanalyze the I/O costs of the insertions and the queries separately.The amortized cost of performing insertions is O ( B log εM ).To analyze the query costs, let ε = 1 /M , i.e., the frequency-approximation error of the in-memory level of our data structure.Since we perform at most one query each time an item’s count in C goes from 0 to ( φ − ε ) N , thetotal number of queries is at most N/ (( φ − ε ) N ) = 1 / ( φ − ε ) = M/ ( φM − O (log(1 /εM )) I/Os, the overall amortized I/O complexity of the queries is O (cid:16)(cid:16) M ( φM − N (cid:17) log εM (cid:17) . Exact reporting.

If no false positives are allowed, we set ε = 1 /N in Theorem 1. For error-freereporting, we must store all the items, which increases the number of levels and thus the I/O cost.In particular, we have the following result on oedp . Corollary 1.

Given a stream S of size N and φ > /M + Ω(1 /N ) the oedp can be solved atamortized I/O complexity O (cid:16)(cid:16) B + M ( φM − N (cid:17) log NM (cid:17) per stream item. Summary.

The external-memory MG algorithm supports a throughput at least as fast as optimalwrite-optimized dictionaries [22, 11, 10, 21, 23, 13], while estimating the counts as well as anenormous RAM. It maintains count estimates at diﬀerent granularities across the levels. Not all It is possible to prevent repeated queries for an item but we allow it as it does not hurt the asymptotic performance. oedp . The smallest MG sketch (whichﬁts in memory) is the most important estimator here, because it serves to sparsify queries to the restof the structure. When such a query gets triggered, we need the total counts from the remaininglog NM levels for the (exact) online event-detection problem but only log εM levels when approximatethresholds are permitted. In the next two sections, we exploit other advantages of this cascadingtechnique to support much lower φ without sacriﬁcing I/O eﬃciency. The external-memory Misra-Gries algorithm described in Section 3.2 reports events immediately,albeit at a higher amortized I/O cost for each stream item. In this section, we show that, byallowing a bounded delay in the reporting of events, we can perform event detection asymptoticallyas cheaply as if we reported all events only at the end of the stream.

Time-stretch ﬁlter.

We design a new data structure to guarantee time-stretch called the time-stretch ﬁlter . Recall that, in order to guarantee a time-stretch of α , we must report an item x nolater than time t + (1 + α ) F t , where t is the time of the ﬁrst occurrence of x , and F t is the ﬂowtime of x .Similar to the external-memory MG structure, the time-stretch ﬁlter consists of L =log r (1 / ( εM )) levels C , . . . , C L − . The i th level has size r i M . Items are ﬂushed from lower levelsto higher levels.Unlike the data structure in Section 3.2 for the oedp , all events are detected during the ﬂushoperations. Thus, we never need to perform point queries. This means that (1) we can use simplesorted arrays to represent each level and, (2) we don’t need to maintain the invariant that level 0is a Misra-Gries data structure on its own. Layout and ﬂushing schedule.

We split the table at each level i into q = ( α + 1) /α equal-sized bins b i , . . . , b iq , each of size αα +1 ( r i M ). The capacity of a bin is deﬁned by the sum of the countsof the items in that bin, i.e., a bin at level i can become full because it contains αα +1 ( r i M ) items,each with count 1, or 1 item with count αα +1 ( r i M ), or any other such combination.We maintain a strict ﬂushing schedule to obtain the time-stretch guarantee. The ﬂushes areperformed at the granularity of bins (rather than entire levels). Each stream item is inserted into b . Whenever a bin b i becomes full (i.e., the sum of the counts of the items in the bin is equal toits size), we shift all the bins on level i over by one (i.e., bin 1 becomes bin 2, bin 2 becomes bin 3,etc), and we move all the items in b iq into bin b i +11 . Since the bins in level i + 1 are r times largerthan the bins in level i , bin b i +11 becomes full after exactly r ﬂushes from b iq . When this happens,we perform a ﬂush on level i + 1 and so on. Starting from the beginning, every r i − M elementsfrom the stream causes a ﬂush that involves level i .Finally, during a ﬂush involving levels 0 , . . . , i , where i ≤ L −

1, we scan these levels and foreach item k in the input levels, we sum the counts of each instance of k . If the total count is greaterthan ( φ − ε ) N , and (we have not reported it before) then we report k . Correctness.

We ﬁrst prove correctness of the time-stretch ﬁlter.

Lemma 3.

The time-stretch ﬁlter reports each φ -event s t occurring at time t at or before t + αF t ,where F t is the ﬂow-time of s t . For each reported item, we set a ﬂag that indicates it has been reported, to avoid duplicate reporting of events. roof. In the time-stretch ﬁlter, each item inserted at level i waits in 1 /α bins until it reaches thelast bin, that is, it waits at least r i /α ﬂushes (from main memory) before it is moved down to level i + 1. This ensures that items that are placed on a deeper level have aged suﬃciently that we canaﬀord to not see them again for a while.Consider an item s t with ﬂow time F t = t − t , where t is a φ -event and t is the time step ofthe ﬁrst occurrence of s t .Let (cid:96) ∈ { , , . . . , L } be the largest level containing an instance of s t at time t , when s t has its φN th occurrence. The ﬂushing schedule guarantees that the item s t must have survived at least r (cid:96) − /α ﬂushes since it was ﬁrst inserted in the data structure. Thus, r (cid:96) − M/α ≤ F t .Furthermore, level (cid:96) is involved in a ﬂush again after t (cid:96) = r (cid:96) − M ≤ αF t time steps. At time t (cid:96) during the ﬂush all counts of the item will be consolidated to a total count estimate of ˜ c . Notethat (cid:96) ≤ L and the count-estimate error of s t can be at most εN t (cid:96) , where N t (cid:96) is the number of thestream items seen up till t (cid:96) . Thus, we have that φN ˜ c + εN t ≤ ˜ c + εN . That is, ˜ c ≥ ( φ − ε ) N , whichmeans that s t gets reported during the ﬂush at time t (cid:96) , which is at most αF t time steps away from t . I/O complexity.

Next, we analyze the I/O complexity of the time-stretch ﬁlter. We treat eachlevel of the ﬁlter as a sorted array.

Theorem 2.

Given a stream S of size N and parameters ε and φ , where /N ≤ ε < φ < , the approximate oedp can be solved with time-stretch α at an amortized I/O complexity O ( α +1 α ( B log εM )) per stream item.Proof. A ﬂush from level i to i + 1 costs O ( r i +1 MB ) I/Os, and moves αα +1 r i M stream items downone level, so the amortized cost to move one stream item down one level is O ( r i +1 MB / αα +1 r i M ) = O ( α +1 α rB ) I/Os.Each stream item can be moved down at most L levels, thus the overall amortized I/O cost ofan insert is O ( α +1 α rLB ) = O (cid:0) α +1 α rB log r εM (cid:1) , which is minimized at r = e . Exact reporting with time-stretch.

Similar to Section 3.2, if we do not want any false positivesamong the reported events, we set ε = 1 /N . The cost of error-free reporting is that we have tostore all the items, which increases the number of levels and thus the I/O cost. In particular, wehave the following result on oedp . Corollary 2.

Given α > and a stream S of size N , the oedp can be solved with time stretch α at an amortized cost of O (cid:16) α +1 α log( N/M ) B (cid:17) I/Os per stream item.

Summary.

By allowing a little delay, we can solve the timely event-detection problem at the sameasymptotic cost as simply indexing our data [22, 11, 10, 21, 23, 13].Recall that in the online solution the increments and decrements of the MG algorithm deter-mined the ﬂushes from one level to the other. In contrast, these ﬂushing decisions in the time-stretchsolution were based entirely on the age of the items. The MG style count estimates came essentiallyfor free from the size and cascading nature of the levels. Thus, we get diﬀerent reporting guaranteesdepending on whether we ﬂush based on age or count.Finally, our results on oedp and oedp with time stretch show that there is a spectrum betweencompletely online and completely oﬄine, and it is tunable with little I/O cost.10

Power-Law Distributions

In this section, we present a data structure that solves the oedp on streams where the count ofitems follow a power-law distribution. There is no assumption on the order of arrivals, which can beadversarial. In contrast to worst-case count distributions, our data structure for power-law inputscan support smaller reporting thresholds and achieve better I/O performance.We note that previous work has analyzed the performance of Misra-Gries style algorithms onsimilar input distributions. In particular, Berinde et al. [14] consider streams where the item countsfollow a Zipﬁan distribution, the assumptions of which are similar but distinct from power-law.Next, we brieﬂy review the distinction and relationship between Zipﬁan and power-law distri-butions. This will allow us to compare Berinde et al.’s result to our work. For detailed review ofthese distributions, see [45, 25, 20, 2].

Zipﬁan vs. power-law distributions.

Let f , . . . , f u be the ranked-frequency vector, that is, f ≥ f ≥ . . . ≥ f u of u distinct items in a stream of size N , where u = |U | . The item counts inthe stream follow a Zipﬁan distribution with exponent α > f i = Z · i − α , where Z is the normalization constant. In contrast, the item counts in the stream follow a power-lawdistribution with exponent θ > c is equal to Z · c − θ ,where Z is the normalization constant.An stream follows a Zipﬁan distribution with exponent α if and only if it follows a power-lawdistribution with exponent θ = 1 + 1 /α ; see [2] for details on this conversion.Berinde et al. [14] show that if the item counts in the stream follow a Zipﬁan distribution with α >

1, then the MG algorithm can solve the ε -approximate heavy hitter problem using only ε − /α words. Alternatively, on such Zipﬁan distributions, the MG algorithm achieves an improved errorbound ε α using 1 /ε words. Since all our algorithm so far use the MG algorithm as a building block,we automatically achieve these improved bounds for Zipf exponents α > θ ≤ ≤ θ ≤ oedp with improvedguarantees when the power-law exponent θ ≥ / (log N/M ). Preliminaries.

We use the continuous power-law deﬁnition[45]: the count of an item with apower-law distribution has a probability p ( x ) dx of taking a value in the interval from x to x + dx ,where p ( x ) = Z · x − θ , where θ > Z is the normalization constant.In general, the power-law distribution on x may hold above some minimum value c min of x . Forsimplicity, we let c min = 1. The normalization constant Z is calculated as follows.1 = (cid:90) ∞ p ( x ) dx = Z (cid:90) ∞ x − θ dx = Zθ − (cid:104) − x θ − (cid:105) ∞ = Zθ − . Thus, Z = ( θ − We will use the cumulative distribution of a power law, that is,Prob ( x > c ) = (cid:90) ∞ j = c Prob ( x = c ) = (cid:90) ∞ j = c ( θ − x − θ dx = (cid:2) − x − θ +1 (cid:3) ∞ c = 1 c θ − . (1) In principle, one could have power-law distributions with θ <

1, but these distributions cannot be normalizedand are not common [45]. .1 Power-law ﬁlter First, we present the layout of our data structure, the power-law ﬁlter and then we present itsmain algorithm, the shuﬄe merge, and ﬁnally we analyze its performance.

Layout.

The power-law ﬁlter consists of a cascade of Misra-Gries tables, where M is the size of thetable in RAM and there are L = log r (2 /εM ) levels on disk, where the size of level i is 2 / ( r L − i ε ).Each level on disk has an explicit upper bound on the number of instances of an item that can bestored on that level. This is diﬀerent from the MG algorithm, where this upper bound is implicit:based on the level’s size. In particular, each level i in the power-law ﬁlter has a level threshold τ i for 1 ≤ i ≤ L , ( τ ≥ τ ≥ . . . ≥ τ L ), indicating that the maximum count on level i can be τ i . Threshold invariant.

We maintain the invariant that at most τ i instances of an item can bestored on level i . Later, we show how to set τ i ’s based on the item-count distribution. Shuﬄe merge.

The external-memory MG data structure and time-stretch ﬁlter use two diﬀerentﬂushing strategies, and here we present a third for the power-law ﬁlter.The level in RAM receives inputs from the stream one at a time. When attempting to insertto a level i that is at capacity, instead of ﬂushing items to the next level, we ﬁnd the smallest level j > i , which has enough empty space to hold all items from levels 0 , , . . . , i . We aggregate thecount of each item k on levels 0 , . . . , j , resulting in a consolidated count c jk . If c jk ≥ ( φ − ε ) N ,we report k . Otherwise, we pack instances of k in a bottom-up fashion on levels j, . . . ,

0, whilemaintaining the threshold invariants. In particular, we place min { c jk , τ j } instances of k on level j ,and min { c jk − ( (cid:80) j(cid:96) = y +1 τ y ) , τ y } instances of k on level y for 0 ≤ y ≤ j − pinned , that is, they cannot be ﬂushed out of a level. Speciﬁcally, we sayan item is pinned at level (cid:96) if its count exceeds (cid:80) (cid:96) +1 i = L τ i .Too many pinned items at a level can clog the data structure. In Lemma 4, we show thatif the item counts in the stream follow a power-law distribution with exponent θ , we can set thethresholds based on θ in a way that no level has too many pinned items. Online event detection.

As soon as the count of an item k in RAM (level 0) reaches a thresholdof φN − τ , the data structure triggers a sweep of all the L levels, consolidating the count estimatesof k at all levels. If the consolidated count reaches ( φ − ε ) N , we report k ; otherwise we updatethe k ’s consolidated count in RAM and “pin” k in RAM, that is, mark a bit to ensure k doesnot participate in future shuﬄe merges. Reported items are remembered, so that each event getsreported exactly once. Setting thresholds.

We now show how to set the level thresholds based on the power-law exponentso that the data structure does not get “clogged” even though the high-frequency items are beingsent to higher levels of the data structure.

Lemma 4.

Let the item counts in an stream S of size N be drawn from a power-law distributionwith exponent θ > . Let τ i = r θ − τ i +1 for ≤ i ≤ L − and τ L = ( rεN ) θ − . Then the number ofkeys pinned at any level i is at most half its size, i.e., / ( r L − i ε ) .Proof. We prove by induction on the number of levels. We start at level L −

1. An item is placedat level L − τ L = ( rεN ) θ − . By Equation (1), there can be at most N/τ θ − L = N/ ( rεN ) = 1 /rε such items which proves the base case.Now suppose the lemma holds for level i + 1. We show that it holds for level i . An item getspinned at level i + 1 if its count is greater than (cid:80) i +2 (cid:96) = L τ (cid:96) .12sing Equation (1) again, the expected number of such items is ≤ N ( (cid:80) i +2 (cid:96) = L τ (cid:96) ) θ − < Nτ i +2 θ − . By the induction hypothesis, this is at most half the size of level i + 1, that is, Nτ i +2 θ − ≤ εr L − i − . Using this, we prove that the expected number of items pinned at level i is at most 1 / ( r L − i ε .The expected number of pinned items at level i is N ( (cid:80) i +1 (cid:96) = L τ (cid:96) ) θ − < N ( τ i +1 θ − ) = N ( r /θ − · τ i +1 ) θ − = 1 r · N ( τ i +2 θ − ) ≤ rεr L − i − = 1 r L − i ε . Next, we prove correctness of the power-law ﬁlter and analyze its I/O complexity.We ﬁrst establish notation. Let S be the stream of size N where the count of items followa power-law distribution with exponent θ >

1. For simpliﬁcation we use γ = 2( N/M ) θ − in theanalysis. Correctness.

Next, we prove that the power-law ﬁlter reports all φ -events as soon as they occur.In the approximate oedp , it may report false positives, that is, items with frequency between( φ − ε ) N and φN . As before, for error-free reporting we set ε = 1 /N . Lemma 5.

The power-law ﬁlter solves the approximate oedp on S .Proof. Let ˜ c i denote the count estimate of an item i in RAM in the power-law ﬁlter. Let f i be thefrequency of i in the stream. Since at most (cid:80) (cid:96) = L τ (cid:96) < τ instances of a key can be stored on disk,we have that: ˜ c i ≤ f i ≤ ˜ c i + 2 τ .Suppose item s t reaches the threshold φN at time t , then its count estimate s t in RAM mustbe at least ˜ c i ≥ φN − τ = φN − r L/θ − ( εN ) /θ − = φN − N/M ) θ − = φN − γ . This is exactlywhen we trigger a sweep of the data structure consolidating the count of s t across all L levels; if theconsolidated count reaches ( φN − ε ) N , we report it. This proves correctness as the consolidatedcount can have an error of at most εN . I/O complexity.

We now analyze the I/O complexity of the power-law ﬁlter. Similar to Sec-tion 3.2, we assume each level is implemented as a B-tree, although the same basic algorithmworks with sorted arrays (included with fractional cascading from one level to the next, similar tocache-oblivious lookahead arrays [10]).

Theorem 3.

Let S be a stream of size N where the count of items follow a power-law distributionwith exponent θ > . Let γ = 2 (cid:0) NM (cid:1) θ − . Given S , ε and φ , such that /N ≤ ε < φ and φ = Ω( γ/N ) ,the approximate oedp can be solved at an amortized I/O complexity O (cid:16)(cid:16) B + φN − γ ) θ − (cid:17) log εM (cid:17) per stream item. roof. The insertions cost O ( rL/B ) = O ((1 /r ) log r (1 /εM )) as we are always able to ﬂush out aconstant fraction of a level during a shuﬄe merge using Lemma 4. This cost is minimized at r = e .Since we perform at most one query each time an item’s count in RAM reaches ( φN − γ ).The total number of items in the stream with count at least ( φN − γ ) is at most N/ ( φN − γ ) θ − .Since each query costs O (log(1 /εM )) I/Os, the overall amortized I/O complexity of the queries is O (cid:16) φN − γ ) θ − log εM (cid:17) . Exact reporting.

To forbid false positives, we set ε = 1 /N and get the following corollary. Corollary 3.

Let S be a stream of size N where the count of items follow a power-law distributionwith exponent θ > . Let γ = 2 (cid:0) NM (cid:1) θ − . Given φ = Ω( γ/N ) , the oedp can be solved at anamortized I/O complexity O (cid:16)(cid:16) B + φN − γ ) θ − (cid:17) log NM (cid:17) per stream item. Remark on scalability.

Notice that the power-ﬁlter on an stream with a power-law distributionallows for strictly smaller thresholds φ compared to Theorem 1 and Corollary 1 on worst-case-distributions, when θ > / (log ( N/M )). Recall that we need φ ≥ Ω(1 /M ) for solving oedp on worst-case streams. In contrast, in Theorem 3 and Corollary 3, we need φ ≥ Ω( γ/N ). Whenwe have a power-law distribution with θ ≥ / (log N ), we have γN = M / ( θ − N θ − < M for θ ≥ / (log ( N/M )).

Remark on dynamic thresholds.

Finally, we argue that level thresholds of the power-law ﬁltercan be set dynamically when the power-law exponent θ is not known ahead of time.Initially, each level on disk has a threshold 0 (i.e., ∀ i ∈ , . . . , L τ i = 0). During the ﬁrst shuﬄe-merge involving RAM and the ﬁrst level on disk, we determine the minimum threshold for level 1( τ ) required in-order to move at least half of the items from RAM to the ﬁrst level on disk. Whenmultiple levels, 0 , , . . . , i , are involved in a shuﬄe-merge, we use a bottom-up strategy to assignthresholds. We determine the minimum threshold required for the bottom most level involved inthe shuﬄe-merge ( τ i ) to ﬂush at least half the items from the level just above it ( τ i − ). We thenapply the same strategy to increment thresholds for levels i − , . . . , τ i s for levels 1 , . . . , L increase monotonically. Moreover, during shuﬄe-merges, we increase thresholds of levels involved in the shuﬄe-merge from bottom-up and to theminimum value so as to not clog the data structure, which means that the τ i s take their minimumpossible values. Thus, if the τ i have a feasible setting, then this adaptive strategy will ﬁnd it. Summary.

With a power law distribution, we can support a much lower threshold φ for the onlineevent-detection problem. In the external-member MG sketch from Section 3.1, the upper boundson the counts at each level are implicit. In this section, we can get better estimates by makingthese bounds explicit. Moreover, the data structure can learn these bounds adaptively. Thus, thedata structure can automatically tailor itself to the power law exponent without needing to be toldthe exponent explicitly. In this section, we describe the more complex national-security setting that motivates our con-straints. We describe Firehose [1, 5], a clean benchmark that captures the fundamental elements ofthis setting. The oedp in this paper in turn distills the most diﬃcult part of the Firehose bench-mark. Therefore our solutions have direct line of sight to important national-security applications.14igure 1: The analysis pipeline that motivates our OEDP solution. Analysts associate a multi-piecepattern, represented by the 4-piece puzzle, to a high-consequence event. The pieces arrive slowlyover time, mixed with innocent traﬃc in a high-throughput “ﬁrehose” stream. Our database storesmany partial matches to the pattern reporting all instances of the pattern. There still may be fairnumber of matches, which are pared down by an automated system to a small number (essentiallydroplets compared to the original stream) of matches worthy of human inspection.We are motivated by monitoring systems for national security [1, 5], where experts associatespecial patterns in a cyberstream to rare, high-consequence real-life events. These patterns areformed by a small number of “puzzle pieces,” as shown in Figure 1. Each piece is associated witha key such as an IP address or a hostname. The pieces arrive over time. When an entire puzzleassociated with a particular key is complete, this is an event, which should be reported as soon asthe ﬁnal puzzle piece falls into place. In Figure 1, the ﬁrst stage is like our oedp algorithm, exceptthat it must store puzzle pieces with each key rather than a count and the reporting trigger is acomplete puzzle, not a count threshold.There can still be a fair number of matches to this special pattern, most of which are still notthe critically bad event. This might overwhelm a human analyst, who would then not use thesystem. However, automated tools, shown in the second stage of Figure 1, can pare these down tothe few events worthy of analyst attention.The ﬁrst stage ﬁlter, like our oedp solution, must struggle to handle a massively large, faststream. It is reasonable to allow a few false positives in the ﬁrst stage to improve its speed. Thesecond stage can screen out almost all of these false positives as long as the stream is signiﬁcantlyreduced. The second stage is a slower, more careful tool which cannot keep up with the initialstream. This second tool cannot, however, repair false negatives since anything the ﬁrst ﬁltermisses is gone forever. So the ﬁrst tool cannot drop any matches to the pattern. Experts have goneto great eﬀort to ﬁnd a pattern that is a good ﬁlter for the high-consequence events. We do notallow false negatives because the high-consequence events that match this carefully crafted patterncan and must be detected.Each of these patterns are small with respect to the stream size, so the detection algorithmmust be scalable, that is, must be able to support a small φ . The consequences of missing an event(false negative) are so severe that it is not reasonable to risk facing those consequences just to savea little space. Thus we must save all partial patterns, motivating our use of external memory.The DoD Firehose benchmark captures the essence of this setting [1]. In Firehose, the inputstream has (key,value) pairs. When a key is seen for the 24th time, the system must return afunction of the associated 24 values. The most diﬃcult part of this is determining when the 24thinstance of a key arrives. Thus like Firehose, the oedp captures the essence of the motivatingapplication. 15 Conclusion

Our results show that, by enlisting the power of external memory, we can solve online event detectionproblems at a level of precision that is not possible in the streaming model, and with little or nosacriﬁce in terms of the timeliness of reports.Even though streaming algorithms, such as Misra-Gries, were developed for a space-constrainedsetting, they are nonetheless useful in external memory, where storage is plentiful but I/Os areexpensive. Furthermore, using external memory for problems that have traditionally been analyzedin the streaming setting enables solutions that can scale beyond the provable limits of fast RAM

Acknowledgments

We would like to thank Tyler Mayer for many helpful discussions in earlier stages of this project.In Figure 1, the full-puzzle icon is from theme4press.com , the ﬁre-hydrant icon is from https://hanslodge.com and the water-drop icon is from stockio.com . References [1] FireHose streaming benchmarks.

Communications of the ACM , 31(9):1116–1127, 1988.[4] N. Alon, Y. Matias, and M. Szegedy. The space complexity of approximating the frequencymoments. In

Proc. 28th Annual ACM Symposium on Theory of Computing , pages 20–29,1996.[5] K. Anderson and S. Plimpton. Firehose streaming benchmarks. Technical report, SandiaNational Laboratory, 2015.[6] B. Babcock, S. Babu, M. Datar, R. Motwani, and J. Widom. Models and issues in datastream systems. In

Proc. 21st Symposium on Principles of Database Systems , pages 1–16,New York, NY, USA, 2002.[7] R. Bayer and E. M. McCreight. Organization and maintenance of large ordered indexes.

Acta Informatica , 1:173–189, 1972.[8] M. A. Bender, J. W. Berry, M. Farach-Colton, J. Jacobs, R. Johnson, T. M. Kroeger,T. Mayer, S. McCauley, P. Pandey, C. A. Phillips, A. Porter, S. Singh, J. Raizes, H. Xu, andD. Zage. Advanced data structures for improved cyber resilience and awareness in untrustedenvironments: LDRD report. Technical Report SAND2018-5404, Sandia NationalLaboratories, May 2018.[9] M. A. Bender, E. Demaine, and M. Farach-Colton. Cache-oblivious B-trees. In

Proceedingsof the 41st Annual Symposium on Foundations of Computer Science , pages 399–409,Redondo Beach, California, 2000. 1610] M. A. Bender, M. Farach-Colton, J. T. Fineman, Y. R. Fogel, B. C. Kuszmaul, andJ. Nelson. Cache-oblivious streaming b-trees. In

Proc. 19th Annual ACM Symposium onParallel Algorithms and Architectures , pages 81–92, 2007.[11] M. A. Bender, M. Farach-Colton, W. Jannen, R. Johnson, B. C. Kuszmaul, D. E. Porter,J. Yuan, and Y. Zhan. An introduction to B ε -trees and write-optimization. :login; magazine ,40(5):22–28, October 2015.[12] M. A. Bender, M. Farach-Colton, R. Johnson, B. C. Kuszmaul, D. Medjedovic, P. Montes,P. Shetty, R. P. Spillane, and E. Zadok. Don’t thrash: How to cache your hash on ﬂash. In Proceedings of the 3rd USENIX Workshop on Hot Topics in Storage (HotStorage) , June 2011.[13] M. A. Bender, M. Farach-Colton, R. Johnson, S. Mauras, T. Mayer, C. A. Phillips, andH. Xu. Write-optimized skip lists. In

Proc. 36th Symposium on Principles of DatabaseSystems , pages 69–78. ACM, 2017.[14] R. Berinde, P. Indyk, G. Cormode, and M. J. Strauss. Space-optimal heavy hitters withstrong error bounds.

ACM Transactions on Database Systems , 35(4):26, 2010.[15] J. Berry, R. D. Carr, W. E. Hart, V. J. Leung, C. A. Phillips, and J.-P. Watson. Designingcontamination warning systems for municipal water networks using imperfect sensors.

Journal of Water Resources Planning and Management , 135, 2009.[16] A. Bhattacharyya, P. Dey, and D. P. Woodruﬀ. An optimal algorithm for l1-heavy hitters ininsertion streams and related problems. In

Proc. 35th ACM Symposium on Principles ofDatabase Systems , pages 385–400, 2016.[17] P. Bose, E. Kranakis, P. Morin, and Y. Tang. Bounds for frequency estimation of packetstreams. In

SIROCCO , pages 33–42, 2003.[18] V. Braverman, S. R. Chestnut, N. Ivkin, J. Nelson, Z. Wang, and D. P. Woodruﬀ. Bptree: an (cid:96) heavy hitters algorithm using constant memory. arXiv preprint arXiv:1603.00759 , 2016.[19] V. Braverman, S. R. Chestnut, N. Ivkin, and D. P. Woodruﬀ. Beating countsketch for heavyhitters in insertion streams. In Proc. 48th Annual Symposium on Theory of Computing ,pages 740–753. ACM, 2016.[20] L. Breslau, P. Cao, L. Fan, G. Phillips, and S. Shenker. Web caching and Zipf-likedistributions: Evidence and implications. In

Proc. Annual Joint Conference of the IEEEComputer and Communications Societies , volume 1, pages 126–134, 1999.[21] G. S. Brodal, E. D. Demaine, J. T. Fineman, J. Iacono, S. Langerman, and J. I. Munro.Cache-oblivious dynamic dictionaries with update/query tradeoﬀs. In

Proc. 21st AnnualACM-SIAM Symposium on Discrete Algorithms , pages 1448–1456, 2010.[22] G. S. Brodal and R. Fagerberg. Lower bounds for external memory dictionaries. In

Proc.14th Annual ACM-SIAM Symposium on Discrete Algorithms , pages 546–554, 2003.[23] A. L. Buchsbaum, M. Goldwasser, S. Venkatasubramanian, and J. R. Westbrook. Onexternal memory graph traversal. In

Proc. 11th Annual ACM-SIAM Symposium on DiscreteAlgorithms , pages 859–860, 2000. 1724] M. Charikar, K. Chen, and M. Farach-Colton. Finding frequent items in data streams. In

Proc. International Colloquium on Automata, Languages, and Programming , pages 693–703,2002.[25] A. Clauset, C. R. Shalizi, and M. E. Newman. Power-law distributions in empirical data.

SIAM review , 51(4):661–703, 2009.[26] D. Comer. The ubiquitous B-tree.

ACM Comput. Surv. , 11(2):121–137, June 1979.[27] A. Conway, M. Farach-Colton, and P. Shilane. Optimal hashing in external memory. In

Proc.45th International Colloquium on Automata, Languages, and Programming , pages39:1–39:14, 2018.[28] G. Cormode. Misra-Gries summaries.

Encyclopedia of Algorithms , pages 1–5, 2008.[29] G. Cormode and S. Muthukrishnan. An improved data stream summary: the count-minsketch and its applications.

Journal of Algorithms , 55(1):58–75, 2005.[30] G. Cormode and S. Muthukrishnan. What’s hot and what’s not: tracking most frequentitems dynamically.

ACM Transactions on Database Systems , 30(1):249–278, 2005.[31] E. D. Demaine, A. L´opez-Ortiz, and J. I. Munro. Frequency estimation of internet packetstreams with limited space. In

Proc. European Symposium on Algorithms , pages 348–360.Springer, 2002.[32] X. Dimitropoulos, P. Hurley, and A. Kind. Probabilistic lossy counting: an eﬃcientalgorithm for ﬁnding heavy hitters.

ACM SIGCOMM Computer Communication Review ,38(1):5–5, 2008.[33] M. Frigo, C. E. Leiserson, H. Prokop, and S. Ramachandran. Cache-oblivious algorithms.

Transactions on Algorithms , 8(1):4, 2012.[34] J. M. Gonzalez, V. Paxson, and N. Weaver. Shunting: A hardware/software architecture forﬂexible, high-performance network intrusion prevention. In

Proc. 14th ACM Conference onComputer and Communications Security , pages 139–149, 2007.[35] J. Iacono and M. P˘atra¸scu. Using hashing to solve the dictionary problem. In

Proc. 23rdAnnual ACM-SIAM Symposium on Discrete Algorithms , pages 570–582, 2012.[36] M. Kezunovic. Monitoring of power system topology in real-time. In

Proc. 39th AnnualHawaii International Conference on System Sciences , volume 10, pages 244b–244b, Jan 2006.[37] E. Kushilevitz. Communication complexity. In

Advances in Computers , volume 44, pages331–360. Elsevier, 1997.[38] K. G. Larsen, J. Nelson, H. L. Nguyen, and M. Thorup. Heavy hitters via cluster-preservingclustering. In

Proc. 57th Annual IEEE Symposium on Foundations of Computer Science ,pages 61–70, 2016.[39] Q. Le Sceller, E. B. Karbab, M. Debbabi, and F. Iqbal. SONAR: Automatic detection ofcyber security events over the twitter stream. In

Proc. 12th International Conference onAvailability, Reliability and Security , 2017. 1840] E. Litvinov. Real-time stability in power systems: Techniques for early detection of the riskof blackout [book review].

IEEE Power and Energy Magazine , 4(3):68–70, May 2006.[41] J. Mai, C.-N. Chuah, A. Sridharan, T. Ye, and H. Zang. Is sampled data suﬃcient foranomaly detection? In

Proc. 6th ACM SIGCOMM conference on Internet measurement ,pages 165–176, 2006.[42] G. S. Manku and R. Motwani. Approximate frequency counts over data streams. In

Proc.28th International Conference on Very Large Data Bases , pages 346–357. VLDBEndowment, 2002.[43] C. R. Meiners, J. Patel, E. Norige, E. Torng, and A. X. Liu. Fast regular expressionmatching using small TCAMs for network intrusion detection and prevention systems. In

Proc. 19th USENIX Conference on Security , 2010.[44] J. Misra and D. Gries. Finding repeated elements.

Science of computer programming ,2(2):143–152, 1982.[45] M. E. Newman. Power laws, pareto distributions and Zipf’s law.

Contemporary physics ,46(5):323–351, 2005.[46] P. O’Neil, E. Cheng, D. Gawlick, and E. O’Neil. The log-structured merge-tree (LSM-tree).

Acta Informatica , 33(4):351–385, 1996.[47] S. Raza, L. Wallgren, and T. Voigt. Svelte: Real-time intrusion detection in the internet ofthings.

Ad Hoc Networks , 11(8):2661–2674, 2013.[48] T. Roughgarden et al. Communication complexity (for algorithm designers).

Foundationsand Trends in Theoretical Computer Science , 11(3–4):217–404, 2016.[49] S. Venkataraman, D. Xiaodong Song, P. B. Gibbons, and A. Blum. New streamingalgorithms for fast detection of superspreaders. 01 2005.[50] H. Yan, R. Oliveira, K. Burnett, D. Matthews, L. Zhang, and D. Massey. Bgpmon: Areal-time, scalable, extensible monitoring system. In , pages 212–223, March 2009.[51] B.-Y. Ziv, J. T.S., K. Ravi, S. D., and T. Luca. Counting distinct elements in a data stream.In