[PDF] Low-Rate Overuse Flow Tracer (LOFT): An Efficient and Scalable Algorithm for Detecting Overuse Flows

Abstract

Current probabilistic flow-size monitoring can only detect heavy hitters (e.g., flows utilizing 10 times their permitted bandwidth), but cannot detect smaller overuse (e.g., flows utilizing 50-100% more than their permitted bandwidth). Thus, these systems lack accuracy in the challenging environment of high-throughput packet processing, where fast-memory resources are scarce. Nevertheless, many applications rely on accurate flow-size estimation, e.g. for network monitoring, anomaly detection and Quality of Service. We design, analyze, implement, and evaluate LOFT, a new approach for efficiently detecting overuse flows that achieves dramatically better properties than prior work. LOFT can detect 1.5x overuse flows in one second, whereas prior approaches fail to detect 2x overuse flows within a timeout of 300 seconds. We demonstrate LOFT's suitability for high-speed packet processing with implementations in the DPDK framework and on an FPGA.

Full PDF

LLow-Rate Overuse Flow Tracer (L O F T):An Efﬁcient and Scalable Algorithmfor Detecting Overuse Flows

Simon Scherrer , Che-Yu Wu , Yu-Hsi Chiang , Benjamin Rothenberger , Daniele E. Asoni ,Arish Sateesan , Jo Vliegen , Nele Mentens , Hsu-Chun Hsiao , and Adrian Perrig Department of Computer Science, ETH Zurich, SwitzerlandEmail: { simon.scherrer,benjamin.rothenberger,daniele.asoni,adrian.perrig } @inf.ethz.ch National Taiwan University, TaiwanEmail: { r06922021,r06922023,hchsiao } @ntu.edu.tw ESAT, KU Leuven, BelgiumEmail: { arish.sateesan,jo.vliegen,nele.mentens } @kuleuven.be Abstract —Current probabilistic ﬂow-size monitoring can onlydetect heavy hitters (e.g., ﬂows utilizing 10 times their permittedbandwidth), but cannot detect smaller overuse (e.g., ﬂows utilizing50 – 100% more than their permitted bandwidth). Thus, thesesystems lack accuracy in the challenging environment of high-throughput packet processing, where fast-memory resources arescarce. Nevertheless, many applications rely on accurate ﬂow-sizeestimation, e.g. for network monitoring, anomaly detection andQuality of Service.We design, analyze, implement, and evaluate LOFT, a newapproach for efﬁciently detecting overuse ﬂows that achievesdramatically better properties than prior work. LOFT can detect1.50x overuse ﬂows in one second, whereas prior approaches failto detect 2x overuse ﬂows within a timeout of 300 seconds. Wedemonstrate LOFT’s suitability for high-speed packet processingwith implementations in the DPDK framework and on an FPGA.

I. I

NTRODUCTION

The problem of detecting network ﬂows whose size exceedsa certain share of link bandwidth, so called large ﬂows or heavy hitters , has received signiﬁcant attention since theseminal paper by Estan and Varghese in 2003 [16] and hasrecently experienced renewed interest, partly thanks to theemergence of programmable data-planes [5], [7], [33], [40]. Inthis formulation of the heavy-hitter problem, the assumption isthat network operators deﬁne a ﬂow size threshold and wantto identify all the ﬂows that violate this threshold. Previouswork [16], [5], [7], [11], [13], [38], [37], [33], [40] focusedon detecting heavy hitters, e.g., ﬂows that are 10x larger thanthe threshold such that a large measurement error can betolerated. In contrast, the problem of detecting moderatelylarge ﬂows, i.e., ﬂows that send slightly more (e.g., 1.50x)than the threshold ﬂow size, is still in need of an effectivesolution. Throughout this work, we refer to these large-ﬂowtypes as high-rate and low-rate overuse ﬂows , respectively.While previous approaches to probabilistic ﬂow-size estima-tion could be made arbitrarily accurate if abundant memorywas available, these approaches suffer from a lack of accuracy under stringent constraints regarding memory and computa-tion. These constraints introduce a large measurement errorsuch that the detection of overuse ﬂows becomes unreliable.In practice, only ﬂows that exceed the target threshold bymore than the measurement error (which itself can amount toa signiﬁcant share of link bandwidth) can be reliably detected,resulting in many undetected overuse ﬂows.This failure of probabilistic ﬂow-size monitoring is espe-cially severe under tight hardware constraints, e.g., on routersthat have an aggregate capacity of several terabits per second(Tbps), handle millions of concurrent ﬂows, and thus requirehigh-speed packet processing (on the order of 100 ns process-ing time per packet). These heavy constraints only allow forrestricted ﬂow monitoring with an especially large estimationerror, which makes the detection of low-rate overuse ﬂowsextremely difﬁcult. At the same time, many applications thatstrengthen network dependability rely on accurate ﬂow-sizeestimation: Network operators make heavy use of tools suchas network monitoring, trafﬁc engineering (e.g., ﬂow-size-aware routing), anomaly detection, Quality of Service (QoS),network provisioning, and security applications, all of whichrequire accurate insight into the ﬂow-size distribution.As one example of a dependability-enhancing applicationrelying on accurate ﬂow-size monitoring, consider bandwidth-reservation systems [6], [30], [3]. These approaches consistof allocating the available bandwidth on a link to ﬂowsaccording to purchasable reservations. In case of scarce linkcapacity (e.g., in case of a distributed denial-of-service (DDoS)attack), ﬂows with a reservation can continue sending to theextent of the reserved allowance, whereas other trafﬁc mightbe dropped. If using probabilistic ﬂow-size monitoring forallowance policing, systems that can only detect high-rateoveruse ﬂows, but disregard the detection of low-rate overuseﬂows, allow an adversary to “ﬂy under the radar”. Similar toattacks such as Coremelt [35] and Crossﬁre [20], an attackercould wrongfully consume link bandwidth by creating a large1 a r X i v : . [ c s . N I] F e b umber of ﬂows that only slightly exceed the correspondingreservation. Thus, detecting low-rate overuse ﬂows is essentialto uphold the guarantees of bandwidth-reservation systems.The biggest challenge in creating highly accurate ﬂow-sizemeasurement on high-speed routers is the scarcity of fastmemory compared to the enormous number of ﬂows handledby these routers. Individual-ﬂow resource accounting is eithertoo expensive (due to the high cost of fast SRAM memoryfor caches) or too slow (e.g., keeping per-ﬂow informationin DRAM [1], [10]). To reduce fast-memory usage, a line ofprevious research devises sketches , which use a small numberof counters and map every ﬂow to a random subset of thesecounters. The size of each ﬂow is then estimated based onthe values of the counters corresponding to that ﬂow [16],[13], [37], [29], [21]. Flows with an estimated size exceedinga pre-deﬁned threshold are considered overusing.Our key insight is that, due to the high variance of thecounter values (referred to as counter noise in the remainder ofthis paper), these shared-counter approaches fail to distinguishlow-rate overuse ﬂows from non-overusing ﬂows. This counternoise originates from two different sources: (1) the unevensize of ﬂows within a counter, which leads to non-overuseﬂows being mistaken as overuse ﬂows if they are mappedto the same counter as an overuse ﬂow, and (2) the unevennumber of ﬂows across counters, which leads to non-overuseﬂows being mistaken as overuse ﬂows if they are mappedto the same counter as many other non-overuse ﬂows. Due tocounter noise, a sketch cannot distinguish a 1.50x overuse ﬂowfrom a non-overuse ﬂow within reasonable memory limits (cf.§II-C1). Hence, the core research challenge becomes how toeffectively counteract the noise while using limited computingand storage resources.In this paper, we propose LOFT, a lightweight detector thatcan detect low-rate overuse ﬂows signiﬁcantly more quicklyand reliably than prior approaches, while conforming to strictrequirements regarding time and memory complexity. LOFTreduces the counter noise by using a multi-stage approach thatis aware of both the trafﬁc volume and the number of ﬂows inany counter and aggregates these values over time. Moreover,LOFT requires fewer operations per packet than conventionalschemes and thereby enables high-speed packet processing,which we demonstrate with implementations for the DPDKframework [28] and for a Xilinx Virtex UltraScale+ FPGA onthe Netcope NFB 200G2QL platform [2].Our evaluation based on both real and synthetic trafﬁc tracesshows that LOFT outperforms prior work. LOFT is at least300 times faster than prior approaches in detecting 1.50 – 2xoveruse ﬂows: LOFT can reliably detect 1.50x overuse ﬂowsin one second, whereas prior approaches fail to detect even 2xoveruse ﬂows within a time of 300 s.II. P ROBLEM D EFINITION AND B ACKGROUND A network ﬂow (or ﬂow for short) is a sequence of packetswith common characteristics. For example, NetFlow uniquelydeﬁnes a ﬂow by source and destination IP addresses, source pkt Blacklist pkt ProbabilisticOFDPrecisemonitoringDetected overuse flows fwdpkt Fig. 1. Router packet processing. and destination transport-layer ports, protocol, ingress inter-face and type of service [12].An overuse ﬂow is a ﬂow that consumes more bandwidththan it was permitted by the traversed network. More con-cretely, if a ﬂow’s permitted rate is γ and permitted burst sizeis β , then a ﬂow overuses its allocation if it sends more than γt + β in any time interval t . An (cid:96) -fold overuse ﬂow is a ﬂowthat sends at rate (cid:96)γ . The permitted amount (i.e., the overuse-ﬂow threshold) can be pre-determined by bandwidth allocationor determined by the router based on its resource usage.In this section, we review the ﬂow-policing model, with afocus on how the overuse-ﬂow detection component interactswith other components. Then we highlight important proper-ties desirable for overuse-ﬂow detectors. A. Flow Policing Model

A typical ﬂow-policing mechanism consists of the followingfour components (shown in Figure 1): (1) a ﬂow

Classiﬁer that extracts each ﬂow’s ID and determines its permittedbandwidth, (2) a

Blacklist that ﬁlters out blacklisted ﬂows,(3) a

Probabilistic Overuse Flow Detector (OFD) tasked withﬁnding suspicious ﬂows that could potentially be overusing,and (4) a

Precise monitoring component which analyzes indi-vidual suspicious ﬂows to determine which ones are actuallymisbehaving and should be added to the blacklist. The precise-monitoring component can access a limited amount of fastmemory (e.g., on the order of the amount used in the overuseﬂow detector). This limited fast memory restricts the numberof suspicious ﬂows that can be simultaneously monitored bythe detector, leading to false negatives.While stateful monitoring of individual ﬂows is practical atthe edge of the network, stateful monitoring has an untenablefast-memory consumption on routers with Tbps capacity .Moreover, schemes with per-ﬂow state in fast memory arevulnerable to memory exhaustion attacks in which an attackercreates a high number of ﬂows and thereby depletes theavailable fast memory. Hence, only probabilistic monitoringis a viable option on high-speed routers. B. Desired Properties

To ensure accurate and timely detection of overuse ﬂowswithout affecting the regular packet processing, an overuse-ﬂow detection algorithm should satisfy the following proper-ties:

High-speed packet processing on routers . High-speed packetprocessing is required for a system that is deployed in the As the CAIDA dataset suggests ﬂow concurrency of 10 million ﬂows ona switch supporting an aggregate bandwidth of 1 Tbps[9], individual-ﬂowmonitoring can be expected to require 80MB of fast memory, assuming 4byte ﬂow IDs and 4 byte counters.

Low false-positive and low false-negative rates . In thecontext of overuse ﬂow detection, a false fositive (FP) is amisclassiﬁcation of a non-overuse ﬂow as an overuse ﬂow.Conversely, a false negative (FN) is a misclassiﬁcation of anoveruse ﬂow as a non-overuse ﬂow, i.e., the failure of detectingan overuse ﬂow. Although the precise-monitoring componentof a ﬂow-policing mechanism can prevent non-overuse ﬂowsfrom being falsely blacklisted, the detector itself should stillensure a low FP rate to not evict suspicious ﬂows in theprecise-monitoring component, thus increasing FN rates.

Detection of low-rate overuse ﬂows . Overuse ﬂows that aresending slightly above their permitted sending rate should bedetected with high probability after a short amount of time.Our goal is to detect ﬂows within at most seconds that aresending at 1.50 their permitted rate; current state-of-the artalgorithms typically assume 10 – 1000 fold sending rates ofoveruse ﬂows when budgeting resources.

C. Overuse Flows vs. Large Flows

This work aims to efﬁciently detect overuse ﬂows, and isinspired by algorithms for detecting large ﬂows , i.e., ﬂowswhich use a signiﬁcant fraction of the link bandwidth.However, it is important to note the signiﬁcant differencesbetween the two problems. In our context, large ﬂows corre-spond to high-rate overuse ﬂows, i.e., ﬂows sending at ratesat least 10 – 1000 times higher than the average ﬂow (forexample, ﬂows violating TCP fairness). The goal of previouswork is to quickly identify the large ﬂows to throttle or blockthem, thus preventing them from harming the other ﬂows.These algorithms have a (more or less explicit) threshold abovewhich a ﬂow is considered large, but below which ﬂows areallowed to send. This threshold is usually up to three ordersof magnitude larger than the average ﬂow’s sending rate—soa few ﬂows sending around the threshold rate would rapidlyexhaust link capacity and lead to congestion.In the rest of this section, we brieﬂy introduce two kindsof large-ﬂow detection algorithms, namely sketch-based ap-proaches and selective individual-ﬂow monitoring. In thefollowing, we discuss why they are inadequate for solvingoveruse-ﬂow detection in our target scenario.

1) Sketches:

Individual-ﬂow monitoring tracks the size ofﬂows with a counter per ﬂow, i.e., a memory cell that isincreased by the packet size every time a packet of thecorresponding ﬂow arrives. To monitor ﬂows using limitedfast memory, sketches use each counter to track multiple ﬂows. Two algorithms in this category are the

Count-Min(CM) Sketch [13] (also known as Multistage Filters [16])and

Adaptive Multistage Filters (AMF) [14]. They rely on arelatively simple concept, which we illustrate at the exampleof the CM Sketch.In the CM Sketch, ﬂows are randomly mapped to counters,and each counter aggregates the volume of all ﬂows that areassigned to it. If a large ﬂow is mapped to a certain counter,this counter value is expected to be higher than the othercounters as it includes the contribution of the large ﬂow.To increase precision, the CM Sketch uses multiple stages ,i.e., multiple counter arrays, and for each stage the ﬂows aremapped to counters in a different way (e.g., using differenthash functions). The CM Sketch classiﬁes a ﬂow as large ifand only if the minimum value of counters to which the ﬂowis mapped exceeds a certain threshold. For non-large ﬂows, theprobability that all associated counters exceed the threshold islow, decreasing exponentially in the number of stages.However, achieving high accuracy with a CM Sketch is onlypossible with an untenable amount of memory. According tothe theoretical work on the CM Sketch [13], the measurementerror of the CM Sketch can be related to the amount ofavailable memory. Concretely, a CM Sketch with (cid:100) ln(1 /δ ) (cid:101) stages, each with (cid:100) e/(cid:15) (cid:101) counters, guarantees a probability ofless than δ that the overestimation error amounts to more thana share (cid:15) of total trafﬁc. Assuming that a switch with anaggregate bandwidth of 1 Tbps handles 10 millon ﬂows [9] andthat the overestimation should almost never exceed 50% of theaverage-ﬂow size (hence, δ = 0 . and (cid:15) = 0 . / − ), thenthe CM Sketch would require around 250 million counters.This memory consumption is even higher than allocating acounter per ﬂow, which demonstrates that the CM Sketch isinaccurate on high-capacity routers.Moreover, conventional sketches have been shown to be tooinefﬁcient to keep up with usual line speeds [22]. In order toachieve line rate, accuracy has to be traded for speed, whichexacerbates the estimation error of sketches. In this work, weattempt to reﬁne the sketch-based approach in order to achievehigh accuracy with low processing complexity.

2) Selective Individual-Flow Monitoring:

Another categoryof large-ﬂow detection algorithms dynamically selects a subsetof ﬂows for individual-ﬂow monitoring. In the followingsection, we will illustrate the general idea of these schemesusing the example of EARDet [38], which is one specimenof this category. Other examples include HashPipe [33] andHeavyKeeper [40].EARDet is based on the Misra–Gries (MG) algorithm [25],which ﬁnds the exact set of frequent items (i.e., items makingup more than a /k -share of the stream) in two passes withlimited counters. At a high level, the MG algorithm usesan array of counters to track frequent item candidates. Byadjusting the counter values and associated items, the MGalgorithm guarantees that every frequent item will occupy onecounter after the ﬁrst pass. The second pass is neverthelessrequired to remove falsely included infrequent items.For each item in the stream, the MG algorithm adjusts3 pdateSampler EstimateFlow Table pkt active ﬂ owscounters read updatesuspicious ﬂ ows PrecisemonitoringClassi ﬁ erBlacklist Detected overuse ﬂ owspktpkt CROFT Probabilistic OFD

Probabilistic OFD

Fig. 2. Details of the structure and information ﬂow of the LOFT ProbabilisticOFD component (cf. also Figure 1). the counters as follows. It ﬁrst checks whether the item hasoccupied a counter in the array. If so, the correspondingcounter will be increased by one. Else, if there is a non-occupied counter, the MG algorithm will assign that non-occupied counter to track this item (and also increase it byone). Otherwise, it will decrease all counters by one. Theintuition is that infrequent items, should they be assigned toa counter, are likely to be evicted (i.e., their counter becomeszero) very quickly. By contrast, frequent items (which arealready more likely to be assigned to a counter to begin with)are guaranteed to remain assigned to that counter, since theirfrequency compensates the occasional counter decreases.EARDet enhances the MG algorithm to identify largenetwork ﬂows in one pass . EARDet is based on two ﬂowspeciﬁcations of the form γt + β , where γ is the allowed rateand β is the allowed burst size. The adaptations guarantee thata ﬂow sending less trafﬁc than γ l t + β l during any time windowwith length t will not be falsely blocked (no false positive),and all large ﬂows sending more than γ h t + β h in any timewindow with duration t will be caught (no false negative).To catch overuse ﬂows, EARDet could be conﬁgured tohave γ h t + β h set to the permitted bandwidth. However, given alow permitted bandwidth, this approach either demands manycounters or suffers from high false negatives, as EARDetrecommends using at least linkBW γ h − counters to achieveguaranteed detection. If γ h equals the maximum size of a non-overuse ﬂow, fast memory would need to accommodate almosta counter per ﬂow, which is infeasible on routers that handlemillions of ﬂows (cf. §II-A).In addition to their high fast-memory consumption, theoverhead of selective individual-ﬂow monitoring, i.e., continu-ously deciding which ﬂows to monitor closely, is prohibitivelyexpensive in terms of processing complexity, which results inan insufﬁcient throughput of such schemes (cf. §IV-H).III. LOFT A LGORITHM

LOFT is a novel design approach for a probabilistic overuseﬂow detector (OFD), the core component of a ﬂow polic-ing model in router packet processing. We ﬁrst provide anoverview of LOFT (§III-A) before describing the design inmore detail (§III-B–III-D). We also provide a complexityanalysis of the algorithm (§III-E).

TABLE IN

OTATION USED IN THIS PAPER .Symbol Description N Number of ﬂows γ , β Rate and burst threshold ﬂow speciﬁcation (cid:96)

Overuse ratio θ Number of minor cycles since last reset ω Number of minor cycles per second Z Number of minor cycles per major cycle θ reset Reset cycle λ Sample rate W Number of counters in fast memory W fm Number of precisely monitored ﬂows U Flow-size estimate A Accumulated ﬂow size C Accumulated ﬂow count

A. Overview

In this section, we describe our design for the probabilisticOFD component depicted in Figure 1. As Figure 2 shows,LOFT OFD contains four components: the update algorithm,the estimate algorithm, the sampler , and the ﬂow table . Packetsnot rejected by the blacklist are forwarded to the update algorithm and the sampler . The estimate algorithm consumestheir output, updates the ﬂow table and creates a list ofsuspicious ﬂows for precise monitoring.The LOFT update algorithm targets one short time interval(e.g., 12.50 ms) at a time, which we call minor cycles . For eachminor cycle, the update algorithm collects aggregated trafﬁcinformation over groups of ﬂows by using a single counterarray. For every packet, LOFT maps the packet ﬂow ID toone counter in the current counter array and increases thatcounter by the packet size. At the end of the minor cycle, thecounter array is passed to the estimate algorithm.The estimate algorithm operates on a larger time scale, atintervals (e.g., with duration 250 ms) that we call major cycles (see Figure 3): every time a major cycle is concluded, the estimate algorithm analyzes the counter arrays stored fromall minor cycles during the major cycle, extracts estimatesfor the bandwidth utilization of every ﬂow, and stores theestimates in a ﬂow table . From this, the algorithm produces alist of suspected overuse ﬂows, which is handed to the precise-monitoring component. The sampler provides a list of activeﬂows which the estimate algorithm uses in its analysis.In summary, the estimate algorithm uses the the sequenceof counter arrays generated by the update algorithm to createa ﬁnal ﬂow estimate in each major cycle. Figure 3 visual-izes how these algorithm components interact. By depictingpackets of two different ﬂows, the ﬁgure shows how ﬂowsare mapped to different counters in each minor cycle. Basedon a packet’s ﬂow ID, the update algorithm increases theassociated counter by the packet size. Every major cycle, the estimate algorithm ﬁrst updates the ﬂow table based on thecounter arrays generated in the most recent Z minor cycles,recomputes the estimate for every active ﬂow, and createsa watchlist from the W fm largest ﬂows. The ﬂows on the4 inor cycles ⋯ major cycles 𝑗 + 1𝑗 𝑓 updateestimate 𝑓 𝑓 ⋯ 𝑍1 2 1 2 𝑓 packets packet counter (SRAM)flow table (DRAM) Fig. 3. LOFT Timeline. (cid:182)

A hash value is computed on the ﬂow ID of the incoming packet, and the corresponding counter is increased by the packetsize. (cid:183)

A different set of counters and hash function are used when the minor cycle changes. (cid:184)

After Z minor cycles, the estimate algorithm aggregates thecounter values and tries to identify a group of overuse ﬂows. watchlist undergo precise monitoring in the subsequent majorcycle. This precise monitoring is performed by the leaky-bucket algorithm [36], which detects any violation of a ﬂowspeciﬁcation in form of γt + β (cf. §II) without false positives.Any ﬂow that is found misbehaving by precise monitoring canthus be blocked by inserting it into the blacklist .In the following, we will explain the design of the update algorithm (§III-B), the estimate algorithm (§III-C), and the sampler (§III-D). The pseudo code of the whole system ispresented in Algorithm 1. The relevant notation is presentedin Table I. B. Update Algorithm

Similar to a sketch, the LOFT update algorithm maps eachﬂow to one counter in a counter array. Each counter tracks theaggregate bandwidth of all the ﬂows mapped to that counterduring a minor cycle. In each minor cycle, the association ofﬂows to counters is randomized by changing the hash functionfor every minor cycle. When a minor cycle ends, the counterarray is moved to main memory and an empty counter arrayis initialized in fast memory.More formally, let ctr j,k be the counter array generated inthe k -th minor cycle of major cycle j , and H j,k be the hashfunction used in that minor cycle. The value of ctr j,k [ x ] isthe total packet sizes of ﬂows that have been mapped to x by H j,k , i.e., all ﬂows f for which H j,k ( f ) = x . C. Estimate Algorithm

At the end of a major cycle, which contains a certainnumber Z of minor cycles, the estimate algorithm performsa ﬂow-size estimation for every ﬂow asynchronously (e.g., inuser space of the router), while the update algorithm continuesto aggregate trafﬁc information. The ﬂow-size estimate buildson two values: the volume sum and the cardinality sum .For the volume sum, the estimate algorithm sums up thevalues of all the counters to which a ﬂow was mapped. Thisaggregation over time reduces the counter noise in the senseof uneven ﬂow size within the same counter: intuitively, anoveruse ﬂow, unlike a non-overuse ﬂow, will be consistentlyassociated with large counter values, which results in a largevolume sum for that ﬂow. Although using multiple counters can reduce counter noise,it is neither sufﬁcient nor innovative, as the Count-Min Sketch[13] uses the same idea (although by applying different hashfunctions concurrently instead of sequentially) and deliversinsufﬁcient performance. Indeed, the key to reducing counternoise lies in the cardinality sum. In order to compute this sum,an active-ﬂow list is consulted to compute how many ﬂowsare associated with each counter within each minor cycle, i.e.,the counter cardinality. For every ﬂow, the estimate algorithmsums up the cardinality values of all counters associated withthe ﬂow. The cardinality sum reduces the distortion createdby the varying cardinalities of counters: intuitively, an overuseﬂow will be associated with a large counter value even whenthe number of ﬂows in that counter is small.When dividing the volume sum by the cardinality sum, thestrongest increases in a ﬂow’s estimate are produced when theﬂow is mapped to high-value counters that contain a smallnumber of ﬂows. Indeed, ﬂows with these characteristics arehighly likely to be the largest ﬂows among the investigatedﬂows and are thus candidates for more precise monitoring.Formally, after major cycle j , we deﬁne the estimate of aﬂow f to be U ( j ) f = A ( j ) f /C ( j ) f , where A ( j ) f = (cid:88) j (cid:48) ∈ J ( j ) f Z (cid:88) k =1 ctr j (cid:48) ,k [ H j (cid:48) ,k ( f )] (III.1a) C ( j ) f = (cid:88) j (cid:48) ∈ J ( j ) f Z (cid:88) k =1 | ctr j (cid:48) ,k [ H j (cid:48) ,k ( f )] | (III.1b)where J ( j ) f contains all major cycles j (cid:48) ≤ j in which ﬂow f was active. The term | ctr j (cid:48) ,k [ x ] | denotes the number of ﬂowsthat have been mapped to counter x in minor cycle k of majorcycle j (cid:48) (counter cardinality). A ( j ) f is the value aggregate of thecounters that f has been mapped to (volume sum) and C ( j ) f isthe summed count of the ﬂows in these counters (cardinalitysum). In order to avoid preserving counter arrays from pastmajor cycles, the terms A f and C f are kept in the ﬂow tableand updated after every major cycle, i.e., table [ f ] .A ← table [ f ] .A + Z (cid:88) k =1 ctr j,k [ H j,k ( f )] (III.2)5nd analogously for C ( j ) f .These updates are made for allﬂows f that were active in the most recent major cycleand are thus in the active-ﬂow list generated by the sampler(cf. §III-D).These estimates need to be adjusted when some ﬂows sendintermittently. For example, suppose ﬂow f sends x GB inthe ﬁrst and the third major cycle and nothing in the secondmajor cycle, and ﬂow f sends x GB from the ﬁrst to the thirdmajor cycle, i.e., J (3)1 = { , } and J (3)2 = { , , } . Then A (3)1 /C (3)1 and A (3)2 /C (3)2 will be almost the same. Supposeall counters contain exactly y ﬂows, then A (3)1 /C (3)1 = x + xy + y = xy = x + x + xy + y + y = A (3)2 /C (3)2 . However, the total trafﬁc sent byﬂow f in these three cycles is actually 1.5 times larger thanﬂow f and should result in a higher ﬂow-size estimate.To ﬁx this problem, we reduce a ﬂow-size estimate U f relative to the number of major cycles where ﬂow f was notactive, i.e., U ( j ) f = | J ( j ) f | j · A ( j ) f C ( j ) f (III.3)To enable this computation, the ﬂow table must track | J f | forevery ﬂow f .Another issue is that as A f and C f are accumulated, the estimate algorithm is actually computing their average overtime. An attacker can take advantage of this approach bysending a low-rate trafﬁc in the beginning for a period of time,and then start sending bursty trafﬁc. It may not be detected byour system as its long-term average looks the same as a non-overuse ﬂow, so old values must be discarded at some point.Therefore, we deﬁne the reset cycle θ reset , and clear all thedata every θ reset minor cycles. This reset may sound risky, asan overuse ﬂow could send its trafﬁc around the reset pointso that its estimated size is reset before being detected by oursystem. However, an attacker does not know the reset point.Moreover, even if an attacker could infer the reset point, theoveruse trafﬁc sent by a ﬂow with such a strategy is bounded,as we show in the mathematical analysis (Appendix A). D. Sampler

Clearly, LOFT requires a list of active ﬂows for whichan estimate must be computed. In order to generate suchan active-ﬂow list, we use sampling and limit the numberof sampled packets per second to be λ . LOFT considers arandomized sampling period, which is a random variable ofan exponential distribution with mean λ . This randomizationprohibits an attacker ﬂow from circumventing the sampling bysending at the appropriate moments.Having an active-ﬂow list for a major cycle j also allows tocompute the cardinality | ctr j,k [ x ] | of counters in the estimate algorithm, namely by counting how many active ﬂows weremapped to each counter with the respective hash function H j,k for any minor cycle k . An alternative to this reconstructionof counter cardinality would consist in measuring countercardinality within the update algorithm, for example using aBloom ﬁlter [8] or HyperLogLog register [17] per counter. Algorithm 1

LOFT algorithm. procedure P ROCESS ( pkt ) if pkt.flowID ∈ blacklist then return if pkt.flowID ∈ watchlist then M ONITOR ( pkt ) j ← G ET M AJOR C YCLE () k ← G ET M INOR C YCLE () S AMPLER ( pkt ) U PDATE ( pkt , j , k ) if Z · j ≥ θ reset then R ESET () procedure S AMPLER ( pkt ) if current time ≥ sample time then activeF low ← activeF low ∪ { pkt.flowID } u samples uniformly from U (0 , sample time ← sample time − ln uλ procedure U PDATE ( pkt , j , k ) x ← H j,k ( pkt.flowID ) ctr j,k [ x ] ← ctr j,k [ x ] + pkt.size procedure E STIMATE () j ← G ET M AJOR C YCLE () - 1 for k = 1 to Z do for f ∈ activeF low do x ← H j,k ( f ) numF low [ x ] ← numF low [ x ] + 1 for f ∈ activeF low do x ← H j,k ( f ) A [ f ] ← A [ f ] + ctr j,k [ x ] C [ f ] ← C [ f ] + numF low [ x ] N ← | activeF low | for f ∈ activeF low do table [ f ] .A ← table [ f ] .A + A [ f ] table [ f ] .C ← table [ f ] .C + C [ f ] table [ f ] . numJ ← table [ f ] . numJ + 1 activeF low ← ∅ return W fm ﬂows with largest table [ f ] . numJ j · table [ f ] .Atable [ f ] .C However, as fast memory is the bottleneck resource, additionalcomputational complexity in the estimate algorithm is prefer-able to estimating cardinality in fast memory.

E. Complexity Analysis

In summary, the update algorithm uses O ( W ) fast-memoryentries, where W is the width of a counter array. The numberof read and write operations is linear in the number of packets.The estimate algorithm uses O ( ZW + N ) entries in DRAM,and the number of accesses is O ( ZN ) .

1) Time complexity:

In the update algorithm, each packetrequires a single read and a single write operation to fastmemory. Moreover, the algorithm requires a single hash func-tion computation, which results in the major advantage thatLOFT achieves line rate (see §IV-H), whereas other sketch-based algorithms update multiple counter arrays per packet andtherefore fall short of that goal [22]. The hash computationitself can be performed in hardware or using an efﬁcientsoftware implementation such as the murmur3 hash function.The estimate algorithm uses multiple accesses to mainmemory. However, the number of accesses is linear in the6umber of active ﬂows, which may be much smaller thanthe number of packets. In each major cycle, we need tosum up the corresponding counters in each minor cycle foreach ﬂow. Suppose there are N active ﬂows, then there are O ( ZN ) DRAM reads to compute the updates to the estimatecomponents. After obtaining these values, we need to updatethe ﬂow table. For each active ﬂow, the algorithm performsone lookup and update to the hash table. We use a Cuckoohash table [27], which has worst-case constant lookup time.Although the worst-case insertion time of the Cuckoo tablemight be long, its expected complexity is amortized constant.Additionally, the number of insertions is much lower than thenumber of lookups as each ﬂow will be inserted only oncewhen it’s ﬁrst seen. Therefore, the update takes O ( N ) time.In the sampler, for every sampled packet, there is oneinsertion to the active-ﬂow list stored in DRAM, for whichwe again used a Cuckoo hash table. By properly setting thetable capacity and sample rate λ , our experiments show thatthe sampler is still fast enough to keep up with line speeds.

2) Space Complexity:

Only counters of the current minorcycle reside in fast memory, which is O ( W ) and depends onthe size of a counter. Our analysis shows that, if all ﬂows sendalmost at threshold rate, a small W (e.g., 8192) can lead toconsiderable detection delay.Other counters are kept in main memory before beinghandled by the estimate algorithm, which takes O ( ZW ) space.The active-ﬂow list and the ﬂow table are also in DRAM.The active-ﬂow list requires O ( N ) entries and the spacecomplexity of the ﬂow table is also linear to N (using aCuckoo table). Therefore, the total number of main memoryentries used (including counters) is O ( ZW + N ) .IV. E VALUATION

The evaluation of LOFT is conducted through two imple-mentations and a simulation. First, the behavior of LOFT in areal-world environment is evaluated using the implementationsand a testbed that supports up to 4x40 Gbps trafﬁc volume.Second, to evaluate the accuracy of LOFT and compare it toEARDet, AMF, HeavyKeeper and HashPipe, simulations witha trafﬁc volume equivalent to 4x100 Gbps are used.

A. Implementation

For the scalability experiments with DPDK, we imple-mented LOFT in C on the Intel DPDK framework [28]. Theapplication uses n worker threads that execute the updatealgorithm and a separate thread running the estimation algo-rithm every major cycle. The major and minor cycle indicesare computed based on a monotonic clock with nanosecondresolution.For further scalability experiments, we also implementedthe update algorithm of LOFT on a Xilinx Virtex UltraScale+FPGA on the Netcope NFB 200G2QL platform [2] containingtwo 100 Gbps NICs. Presenting the detailed contributions ofthe FPGA implementation would exceed the scope of thispaper. Therefore, a separate paper illustrating the design of the FPGA implementation has been submitted to a specializedconference [4].In order to test the accuracy of LOFT, we evaluated LOFTand other algithms in simulations using Rust (LOFT, EARDet,AMF) and Golang (HeavyKeeper, HashPipe). B. Experiment Setup

Test Setup . The testbed for the DPDK scalability experimentconsists of two machines connected using 4x40 Gbps Ether-net connections. The trafﬁc is generated using a dedicatedtrafﬁc generator (Spirent TestCenter N4U [34]) and sent toa commodity machine with an Intel Xeon E5-2680 CPU.For the simulations, we execute LOFT on an Intel Xeon8124M machine and process synthesized trafﬁc equivalent to4x100 Gbps.

Trafﬁc Generation . To evaluate the scalability of LOFT withrespect to ﬂow volumes, we generated network traces withuniform packet sizes and based on an iMix trafﬁc distribution(avg. size: 353 B) [26].

C. Evaluation Metric: Detection Delay

The detection delay of an overuse ﬂow is the time elapsedbetween the ﬁrst violation of the ﬂow speciﬁcation and thetime the ﬂow is caught by a detector. A detection delay longerthan the simulation timeout results in a false negative .When presenting detection rates and false negatives, it isalways crucial to present the corresponding false positive rate,i.e., the number of non-overuse ﬂows incorrectly marked asmalicious. We note, however, that LOFT is designed to haveno false positives, because benign ﬂows that happen to beﬂagged as suspicious by the estimation algorithm will beexonerated by precise monitoring.

D. Parameter Selection

Several of LOFT’s parameters can be tuned to ﬁt the hard-ware restrictions of a speciﬁc deployment. In the following,we describe how to experimentally determine these parame-ters so that they comply with the hardware’s computationalconstraints even under the worst-case trafﬁc patterns.To experimentally determine the sampling rate, we increasethe sampling rate of the sampler until its CPU core is fully uti-lized, such that the accuracy of the ﬂow ID list is maximized.With this determined sampling rate and an estimated maximumnumber of ﬂows, we calculate the maximum number of majorcycles per second such that there are enough samples betweentwo executions of the estimation algorithm to build the ﬂowID list with the desired accuracy. Finally, we increase thenumber of minor cycles per second until the CPU core ofthe estimation algorithm is fully utilized. Table II summarizesthese hardware-related parameters used in our experiments.The ﬂow monitor is set to monitor 64 ﬂows simultaneouslyin all experiments, which is small and fast enough to keepup with a high-bandwidth link. Finally, with the determinedparameters in Table II, the reset cycle parameter θ reset in eachexperiment is adjusted to achieve 95% detection probability(detailed in the mathematical analysis in Appendix A) underthe given experiment setting.7 ABLE IIH

ARDWARE - RELATED

LOFT

PARAMETERS .Parameter ValueSampling rate ( λ ) . · samp./sNumber of minor cycles 64 cycles/sNumber of major cycles 1 – 4 cycles/s E. Comparison: Fully Utilized Trafﬁc Trace

We ﬁrst evaluate LOFT in a setting where every ﬂowsends at a rate close to the maximum allowed threshold, fullyutilizing the reserved bandwidth.We simulate a conﬁguration with 4x100 Gbps links withan aggregate number of 130’000 ﬂows, where each ﬂowrequires 3 Mbps, e.g., for high-quality video streaming. Then,a misbehaving ﬂow with an overuse ratio (cid:96) is injected. Eachdetector allocates 16’448 counters in fast memory. LOFT,AMF, HashPipe and HeavyKeeper use 64 counters (out of16’448) as ﬂow monitors. To optimize detector performancefor AMF, HashPipe and HeavyKeeper, the fast-memory coun-ters are structured as counter arrays. LOFT is conﬁgured torun 4 cycles of the estimation algorithm per second, whichreaches the computation limit on our machine.Figure 4(a) shows the detection delays of LOFT, EARDet,HashPipe, HeavyKeeper and AMF under different overuseratios (cid:96) on a log-log scale. Each data point is averaged over 100runs. We ﬁnd that LOFT detects a 1.50x overuse ﬂow in lessthan one second, whereas all other detectors fail to detect itbefore the 300 s timeout. For larger overuse ﬂows, LOFT stilloutperforms AMF, EARDet, HashPipe and HeavyKeeper whenthe overuse ratio is less than 400x, 7x and 3x, respectively. Asopposed to HashPipe and HeavyKeeper, LOFT delivers highaccuracy at a lower variance. Moreover, LOFT can achievemuch higher throughput (cf. §IV-H).The reason that LOFT is slower in detecting high-rateoveruse ﬂows is that it needs at least one major cycle,which takes 0.25 s, to select the overuse ﬂow. These resultsconﬁrm that LOFT can efﬁciently detect overuse ﬂows using asmall amount of router resources. While heavy-hitter detectionschemes perform better than LOFT regarding the extremelylarge ﬂows that they are designed to catch, these schemes areineffective in low-rate overuse ﬂow detection, which is thegoal of this paper.As Figure 4(b) shows, these results are conﬁrmed by per-forming the same experiments with 10 million ﬂows, whichis the number of ﬂows to be expected on a Tbps link (cf.Section II-A). For 10 million ﬂows, the higher accuracy ofLOFT is even more prominent, as even the best other schemes(i.e., HashPipe and HeavyKeeper) fail to detect the overuseﬂows for all overuse ratios below 20.

F. Comparison: CAIDA Trafﬁc Trace

In addition to using synthesized background trafﬁc in whichevery non-overuse ﬂow fully utilizes the reserved bandwidth,we also compare detectors on an OC192 link using a realtrafﬁc trace, namely the CAIDA New York Anonymized − − − D e t ec t i o n t i m e ( s ) LOFTEARDetAMFHashPipeHeavyKeeper (a) N = 130 K − − − D e t ec t i o n t i m e ( s ) LOFTEARDetAMFHashPipeHeavyKeeper (b) N = 10 M Fig. 4. Detection delay of overuse ﬂow detectors under different overuseratios and ﬂow numbers N . The error bars show the minimum and maximumvalues over 100 runs. Internet Trace [9]. In the trace, the majority of the ﬂows aresmaller than 512 bytes per second, and 95% of the ﬂows aresmaller than 14’000 bytes per second.We ﬁrst regulate every ﬂow in the CAIDA trace with thepermitted bandwidth γ . Flows that use more than the permittedbandwidth are governed by dropping overuse packets. Then, a2-fold overuse ﬂow is added into the trafﬁc trace. This settingreﬂects the scenario where an ISP wants to mitigate DDoSby putting a bandwidth cap on individual ﬂows. Because thenumber of ﬂows in the CAIDA trafﬁc is smaller than inthe fully utilized trafﬁc trace, on this smaller network, weonly allocated 2048 + 64 counters to all detectors (64 ﬂowmonitors). LOFT is conﬁgured to run 4 slices of the estimationalgorithm per second.Figure 5 shows the detection delays of LOFT, EARDet,AMF, HashPipe and HeavyKeeper under different permittedbandwidths on a log-log scale. Each data point is averagedover 100 runs. LOFT detects the 2-fold overuse ﬂow in 0.30 –1.50 s. EARDet and AMF can quickly and reliably detect the2-fold overuse ﬂow only when it is much larger than themajority of the non-overuse ﬂows. HashPipe and HeavyKeepercatch the overuse ﬂow also given a low threshold, but theseschemes still require a multiple of LOFT’s detection time toidentify the overuse ﬂow. Furthermore, the throughput of theseschemes is substantially lower than the throughput of LOFT(cf. §IV-H).8

4K 20K 28K 40K 56K 80K 112K160K224K320K448KThreshold (bytes/s)10 − − D e t ec t i o n t i m e ( s ) LOFTEARDetAMFHashPipeHeavyKeeper

Fig. 5. Detection time of overuse ﬂow detectors to catch a overuse ﬂowsending γ given CAIDA trafﬁc with different regulation threshold γ and β = 1500 for all. G. LOFT

Sensitivity Tests

We have shown that given the same memory budget, LOFTcan detect low-rate overuse ﬂows much faster than AMF,EARDet, HashPipe and HeavyKeeper. The following exper-iments further investigate LOFT’s effectiveness by varying itsparameters under a background trafﬁc setting that is challeng-ing for all evaluated detectors.LOFT is conﬁgured to run one slice of the estimationalgorithm per second since it has to process more ﬂows inthe following tests. We consider a half-utilization scenario, inwhich a half of the non-overuse ﬂows send up to the permittedbandwidth, and the rest sends almost negligible trafﬁc. Thisis a more challenging scenario than full utilization for LOFTbecause the variance between counters are not only affectedby the number of ﬂows aggregated but also by the varianceof ﬂow sizes. This half-utilization scenario can capture thebehavior of typical streaming trafﬁc, in which one direction isused to send data and the other to send ACKs.Half of the non-overuse ﬂows send trafﬁc up to the ﬂowspeciﬁcation γ = GbpsN and β = 1500 and the other halfsend 25 times less trafﬁc. As before, one overuse ﬂow isinjected in the simulation. This (cid:96) -fold overuse ﬂow followsa ﬂow speciﬁcation γ = GbpsN · (cid:96) and β = 1500 . Each datapoint of detection delay is the average of 100 simulations. Flow counting drastically reduces the detection time .Figure 6 shows that the estimator with ﬂow counting anddividing signiﬁcantly outperforms the one without counting.This result supports our perspective in Section III-A that thedetection accuracy of sketches is suffered by not taking thenumber of ﬂows into account. In other words, LOFT is highlyaccurate thanks to its efforts to reduce the counter noise thatstems from the variance of counter cardinality.

Memory budget . In this experiment, we investigate the impactof fast memory size on the detection delay. Figure 7(a)shows the cumulative distribution of the detection delay givendifferent numbers of fast-memory counters, ranging from 1024to 16’384. As the number of fast-memory counters is doubled,LOFT’s detection speed increases as lowering the number ofﬂows sharing a counter reduces the variance of each counter.However, even with the maximum number of counters and in- . . . . . . C u m u l a t i v e p r o p o rt i o n Without countingWith counting

Fig. 6. LOFT’s detection time of an overuse ﬂow with and w/o counting. N = 130 K , W = 16384 , (cid:96) = 1 . . cluding the memory required for monitoring suspicious ﬂows,the fast-memory consumption of LOFT is around 130 kB,which represents more than an order of magnitude reductioncompared to the 3 MB of fast memory needed for individual-ﬂow monitoring under the same trafﬁc conditions (cf. §II-A). Number of non-overuse ﬂows . We evaluate the impact ofthe number of non-overuse ﬂows on the detection delay.Using 1024 fast-memory counters and one 2-fold overuseﬂow, Figure 7(b) shows the cumulative detection delay givendifferent numbers of non-overuse ﬂows, ranging from 100’000to 400’000. The detection delay grows with the number ofnon-overuse ﬂows because the variance of counter cardinalityincreases as the total number of ﬂows grows.

Imprecise active-ﬂow list . In our previous experiments, theﬁxed sampling rate (as deﬁned in Table II) is sufﬁcient tomaintain a precise active-ﬂow list. Given that maintainingthis list is bound by the hardware-limited sampling rate, thenumber of active ﬂows might be too high to build a preciseactive-ﬂow list. For example, for 400’000 active ﬂows and asampling rate of 800’000 ﬂows per major cycle, the active-ﬂow list will miss about 15% of active ﬂows.To understand the impact of an imprecise active ﬂows liston LOFT, we evaluate different miss rates of active-ﬂow lists.Figure 7(c) shows that in case of the half-utilization scenario,LOFT performs worse with increasing imprecision of theactive-ﬂow list. This is due to the reason that LOFT will usean inaccurate number of ﬂows to calculate estimators, whichleads to higher variance. Nevertheless, even missing 20% ofactive ﬂows, LOFT can still catch the overuse ﬂow under 14 swith 95% probability.

H. Scalability of

LOFT

1) DPDK Implementation:

To understand the scalability ofLOFT in a DPDK environment, we evaluate the maximumpacket rate with respect to the packet size and number ofcores that execute the update algorithm concurrently. Fig-ure 8(a) shows that LOFT is able to achieve line-rate foriMix-distributed trafﬁc using 16 cores that execute the updatealgorithm and one core that runs the estimation algorithm. Forsmaller numbers of cores, line-rate can only be achieved forlarger packet sizes (1024 B).9

20 40 60Detection time (s)0 . . . . . C u m u l a t i v e p r o p o rt i o n (a) N = 400 K , r = 0% , varying W . . . . . C u m u l a t i v e p r o p o rt i o n (b) W = 1024 , r = 0% , varying N . . . . . C u m u l a t i v e p r o p o rt i o n -30%-20%-10%-0% (c) N = 400 K , W = 16 , , varying r .Fig. 7. CDF of LOFT’s detection delay given number of counters ( W ), number of ﬂows ( N ), and sampling miss rate ( r ). T h r o u g hpu t( G b / s )

64 B128 B256 B512 B 1024 B1500 BiMix (a) Throughput given number of logical cores and packet sizes.

64 B 128 B 256 B 512 B 1024 B 1500 B iMixPacket size020406080100120140160 T h r o u g hpu t( G b / s ) (b) Overhead compared to regular L3 forwarding for 8+1 cores.Fig. 8. DPDK implementation results. Since trafﬁc ﬂows with small sized packets perform consid-erably worse than ﬂows with large packet sizes, we addition-ally evaluate the overhead introduced by LOFT by comparingit to regular L3 packet forwarding in DPDK. Figure 8(b) showsthe throughput of regular L3 forwarding and the throughputof forwarding with additional LOFT processing for differentpacket sizes. Moreover, the ﬁgure gives the LOFT throughputas a percentage of the corresponding base throughput. Asvisible in the ﬁgure, even regular L3 packet forwarding usingeight processing cores cannot achieve line-rate for packet sizessmaller than 1024 B. Compared to regular packet forwarding,LOFT introduces overhead for small packet sizes, whichresults in a maximum packet rate of ˜50 million packets persecond (Mpps) using eight processing cores, i.e., ˜6 Mpps percore. As for larger packet sizes this maximum packet ratedoes not get exhausted, this effect is diminished. However, LOFT still achieves a much higher packet rate than thealternative schemes with the best accuracy, i.e., HashPipe andHeavyKeeper: Prior work has shown a packet rate of ˜2 Mppsper core for HashPipe [39] and a packet rate of ˜2.50 Mppsper core for HeavyKeeper [40].

2) FPGA Implementation:

We also implemented the fast-path component of LOFT (i.e., the update algorithm using16’384 counters) on a Xilinx Virtex UltraScale+ FPGA on theNetcope NFB 200G2QL platform [2] with two 100 Gbps NICsand an operating frequency of 200 MHz. The LOFT imple-mentation can process a packet in every cycle. For minimum-size packets of 64 B, each NIC manages to transfer one packetper cycle to the LOFT implementation, which allows toachieve a packet rate of 200 Mpps per NIC. As the FPGAplatform contains two NICs, it achieves a total packet rate of˜400 Mpps. This high throughput demonstrates that LOFT issuitable for high-speed packet processing if implemented onprogrammable NICs. The full implementation is described in apaper submitted to a specialized conference [4], as the FPGA-speciﬁc implementation details represent a contribution thatgoes beyond the scope of this paper.V. T

HEORETICAL A NALYSIS R ESULTS

In the mathematical analysis provided in Appendix A,we analyze the overuse ﬂow detection, and derive a lowerbound (Equation A.15) for the probability that an overuseﬂow is detected within a reset cycle, i.e., before a reset.This bound depends on a number of parameters of LOFT.The most important are the duration of a reset cycle T reset (where T reset (cid:44) ω − θ reset ) and the number of fast-memorycounters W .By requiring that the lower bound be equal to a desireddetection probability, we compute an upper bound on therequired duration T reset of a reset cycle for a given numberof counters in order to achieve that detection probability.Table III shows the reset cycle duration T reset calculatedunder the settings from the experiment of Figure 7(a), com-pared to the real detection delays as obtained by our analysisin Section IV-G. The result shows that our analysis does notunderestimate the detection delay, since the upper bound onthe reset cycle duration, which is the worst-case-estimate ofthe 95th percentile of the detection delay, is always higherthan those obtained by our evaluation. The reasons why theestimated delay is 2 – 5 higher than the real one are twofold.10irst, the synthesized trafﬁc in our experiments does notalways represent the worst-case scenario. Second, LOFT ranksand detects overuse ﬂows at the end of every estimationalgorithm, but our analysis ignores the probability that overuseﬂows might be caught at some point before θ reset , whichloosens the bound. TABLE IIIC

ALCULATED RESET CYCLE DURATION ( T reset ) AND THE REAL THPERCENTILE OF THE DETECTION DELAY FROM OUR EVALUATION WITHDIFFERENT NUMBERS OF COUNTERS IN FAST MEMORY IN THEEXPERIMENT OF F IGURE A ). Num. of counters 1024 2048 4096 8192 16’384Reset cycle (s) 231 116 58 29 15Detection delay (s) 43.03 20.48 14.04 8.67 6.52

VI. R

ELATED W ORK

Despite a signiﬁcant amount of research on the problem ofdetecting high-rate overuse ﬂows, the problem of efﬁcientlydetecting low-rate overuse ﬂows has so far been neglected.A number of related schemes have been proposed to addresssimilar problems, such as large ﬂow detection and top-k ﬂowdetection. Their ideas might be applicable to detecting overuseﬂows. However, we discuss in the following paragraphs thatthey are either orthogonal to our research direction, or that theyare insufﬁcient to solve our problem as stated in Section II.The problem of detecting top-k ﬂows aims to identify k ﬂows that consume most of a link’s bandwidth. A recentproposal called HashPipe [33] tackles the heavy hitter detec-tion problem on programmable hardware. To ensure line-ratedetection given a limited amount of fast memory, HashPipeconstructs a pipeline of hash tables to efﬁciently implementthe Space Saving algorithm [24], such that the heavier ﬂowsare more likely to be kept in the next stage of the pipeline.However, our results in §IV show that HashPipe fails to detectthe overusing ﬂow in cases where the difference betweenoveruse and non-overuse ﬂow is low (i.e., with an overuse ratioof 1.50 – 2). As another top-k detection scheme, HeavyKeeper[40], seems to suffer from the same weakness, these systemsseem unable to effectively ﬁlter out the counter noise.Similar to top-k ﬂow detection, large-ﬂow detection al-gorithms can be applied to detecting overuse ﬂows. Largeﬂow detection algorithms identify ﬂows that use more thana threshold amount of bandwidth, and it is common that thehigher the threshold the better the performance will be. Be-sides AMF (introduced in Section II-C), CLEF [37] proposesto detect low-rate overuse ﬂows, which are similar to the low-rate overuse ﬂows in our work, using recursive division andby combining two detectors with complementing properties.Our evaluation shows that LOFT outperforms EARDet, oneof the detectors used by CLEF, when the overuse ratio is lowerthan 7x. The other detector used by CLEF is a sketch and thusinherits the limitations explained in Section II-C.Hybrid SRAM/DRAM-based architectures for exact count-ing have been proposed by Shah et al. [32] and were further improved by Ramabhadran and Varghese [29] and by Zhaoet al. [41]. By default, these schemes only consider the numberof packets per ﬂow, but not the ﬂow sizes. Even thoughthe authors propose an extension to consider ﬂow sizes, theuse of probabilistic counting in these schemes introduces ahigh counter variance, making low-rate overuse ﬂows hard todetect. Lall et al. [21] propose another SRAM/DRAM hybriddata structure for efﬁcient detection of both medium andlarge ﬂows. However, because the proposed ﬂow monitoringsolution uses shared counters (in the form of spectral Bloomﬁlters), it performs poorly in catching low-rate overuse ﬂows.Another approach to tackle aforementioned problems uses sampling , where only a small subset of packets is used forﬂow accounting. Sampled NetFlow [12] is a widely deployedsolution that collects one out of every n packets, and estimatesstatistics of the original population by extrapolating fromthe sample. Researchers have proposed advanced samplingalgorithms tailored for catching large ﬂows. For example,Sample and Hold [16] and Sticky Sampling [23] are designedto bias toward large ﬂows. Instead of using a static samplingrate, several adaptive sampling algorithms dynamically adjustthe sampling rate so as to keep resource consumption undera ﬁxed memory limitation [15], [31]. However, without asufﬁciently high sampling rate (resulting in a large amount offast memory), sampling-based algorithms are prone to falsepositives and false negatives, as shown by Estan and Vargh-ese [16]. LOFT relies on sampling to generate a list of activeﬂows (if the list is not provided). However, LOFT ensuresthat at least one packet appears in a sample, thus requiring alower sampling rate than for accurate ﬂow accounting.A recent series of work including SketchVisor [19], Elas-ticSketch [39] and NitroSketch [22] devises techniques tospeed up the updating of detector datastructures based onsketches. However, these techniques all trade off detection ac-curacy against processing speed, i.e., these algorithms achieveeven lower accuracy than the unaltered sketches like AMFevaluated in Section IV. In contrast, LOFT can achieve highaccuracy and a low per-packet overhead.VII. C ONCLUSIONS

Due to limitations of previous approaches to probabilisticﬂow monitoring, network operators so far lacked effectivemeasurement tools that could give an accurate insight into theﬂow-size distribution on high-capacity routers. Given routerconstraints regarding fast memory and computation, existingschemes suffer from a large measurement error, which onlyallows the reliable detection of extremely large ﬂows, butnot ﬂows with a small amount of overuse. In this work, weshow that the source of this measurement error in sketch-basedschemes is counter noise , i.e., the high variance of countervalues. Using this insight, we develop LOFT, a sketch-basedapproach that counteracts the counter noise while respectingthe stringent complexity constraints of high-speed routers.As a result, the measurement error of LOFT is so small thatlow-rate overuse ﬂows (i.e., ﬂows only 50 – 100% larger thanthe average ﬂow) can be reliably detected with a small amount11f fast memory. Concretely, LOFT can reliably identify aﬂow that is only 50% larger than the average ﬂow withinone second, whereas all other investigated schemes fail toidentify such a ﬂow even within 300 s. Moreover, LOFTaccomplishes such high accuracy while reducing the fast-memory requirement by more than one order of magnitude incomparison with individual-ﬂow monitoring. We also investi-gate scalability and overhead of LOFT with a DPDK and anFPGA implementation, and show that LOFT enables line-rateforwarding of a realistic trafﬁc mix.With these demonstrated properties, LOFT can serve asa powerful ﬂow-monitoring tool, which will allow networkoperators to improve the efﬁcacy of existing applications basedon ﬂow-size estimation (e.g., ﬂow-size aware routing) andto enable new applications based on such estimates (e.g.,reservation-based DDoS defense).R

EFERENCES[1] Intel Skylake CPU Architecture Characteristics, Accessed August 2018.[2] Netcope NFB-200G2QL FPGA platform equipped with Virtex Ultra-scale+ FPGA chip, Accessed September 2020.[3] Werner Almesberger, Tiziana Ferrari, and J-Y Le Boudec. Scalableresource reservation for the internet. In

Proceedings of InternationalConference on Protocols for Multimedia Systems-Multimedia Network-ing , pages 18–27. IEEE, 1997.[4] Anonymous. Paper presenting the efﬁcient FPGA implementation of theLOFT update component (currently under review), 2020.[5] Ran Ben Basat, Gil Einziger, Roy Friedman, and Yaron Kassner.Randomized admission policy for efﬁcient top-k and frequency es-timation. In

IEEE INFOCOM 2017-IEEE Conference on ComputerCommunications , pages 1–9. IEEE, 201720.[6] Cristina Basescu, Raphael M. Reischuk, Pawel Szalachowski, AdrianPerrig, Yao Zhang, Hsu-Chun Hsiao, Ayumu Kubota, and JumpeiUrakawa. SIBRA: Scalable internet bandwidth reservation architecture.In

Proceedings of Symposium on Network and Distributed SystemSecurity (NDSS) , February 2016.[7] Ran Ben Basat, Gil Einziger, Roy Friedman, Marcelo C Luizelli, andErez Waisbard. Constant time updates in hierarchical heavy hitters. In

Proceedings of the Conference of the ACM Special Interest Group onData Communication , pages 127–140, 201720.[8] Burton H Bloom. Space/time trade-offs in hash coding with allowableerrors.

Communications of the ACM , 13(7):422–426, 1970.[9] CAIDA. The CAIDA UCSD Anonymized Internet Traces - Oct. 18th.,2018.[10] Kevin K. Chang, Abhijith Kashyap, Hasan Hassan, Saugata Ghose,Kevin Hsieh, Donghyuk Lee, Tianshi Li, Gennady Pekhimenko, SamiraKhan, and Onur Mutlu. Understanding latency variation in modernDRAM chips: Experimental characterization, analysis, and optimization.In

Proceedings of ACM SIGMETRICS , June 2016.[11] Moses Charikar, Kevin Chen, and Martin Farach-Colton. Finding fre-quent items in data streams. In

International Colloquium on Automata,Languages, and Programming , 2002.[12] B. Claise. Cisco Systems NetFlow Services Export Version 9. RFC3954 (Informational), October 2004.[13] Graham Cormode and Shan Muthukrishnan. An improved data streamsummary: the count-min sketch and its applications.

Journal of Algo-rithms , 55(1):58–75, 2005.[14] C. Estan.

Internet Trafﬁc Measurement: What’s Going on in myNetwork?

PhD thesis, 2003.[15] Cristian Estan, Ken Keys, David Moore, and George Varghese. Buildinga Better NetFlow. In

ACM SIGCOMM , 2004.[16] Cristian Estan and George Varghese. New directions in trafﬁc mea-surement and accounting: Focusing on the elephants, ignoring the mice.

ACM Transactions on Computer Systems , 21(3):270–313, 2003.[17] Philippe Flajolet, ´Eric Fusy, Olivier Gandouet, and Fr´ed´eric Meunier.Hyperloglog: the analysis of a near-optimal cardinality estimation al-gorithm. In

Discrete Mathematics and Theoretical Computer Science ,pages 137–156. Discrete Mathematics and Theoretical Computer Sci-ence, 2007. [18] Wassily Hoeffding. Probability inequalities for sums of bounded randomvariables.

Journal of the American Statistical Association , 58(301):13–30, 1963.[19] Qun Huang, Xin Jin, Patrick PC Lee, Runhui Li, Lu Tang, Yi-ChaoChen, and Gong Zhang. Sketchvisor: Robust network measurement forsoftware packet processing. In

Proceedings of the Conference of theACM Special Interest Group on Data Communication , pages 113–126.ACM, 2017.[20] Min Suk Kang, Soo Bum Lee, and Virgil D. Gligor. The crossﬁre attack.In

Proceedings of the 2013 IEEE Symposium on Security and Privacy ,SP ’13, pages 127–141, Washington, DC, USA, 2013. IEEE ComputerSociety.[21] Ashwin Lall, Mitsunori Ogihara, and Jun Xu. An efﬁcient algorithmfor measuring medium-to large-sized ﬂows in network trafﬁc. In

IEEEINFOCOM , 2009.[22] Zaoxing Liu, Ran Ben-Basat, Gil Einziger, Yaron Kassner, VladimirBraverman, Roy Friedman, and Vyas Sekar. Nitrosketch: Robust andgeneral sketch-based monitoring in software switches. In

Proceedingsof the ACM Special Interest Group on Data Communication , pages 334–350. 2019.[23] Gurmeet Singh Manku and Rajeev Motwani. Approximate FrequencyCounts over Data Streams. In

Proceedings of VLDB , 2002.[24] A. Metwally, D. Agrawal, and A. El Abbadi. Efﬁcient computation offrequent and top-k elements in data streams. In

International Conferenceon Database Theory , pages 398–412. Springer, 2005.[25] J. Misra and David Gries. Finding Repeated Elements.

Science ofComputer Programming , 2(2):143–152, 1982.[26] A. Morton. Imix genome: Speciﬁcation of variable packet sizes foradditional testing. RFC 6985, July 2013.[27] R. Pagh and F. F. Rodler. Cuckoo hashing. In

European Symposium onAlgorithms , page 121–133. Springer, 2001.[28] DPDK Project. Data Plane Development Kit, 2020.[29] Sriram Ramabhadran and George Varghese. Efﬁcient implementationof a statistics counter architecture. In

ACM SIGMETRICS PerformanceEvaluation Review , volume 31, pages 261–271. ACM, 2003.[30] Nageswara SV Rao and Stephen Gordon Batsell. Qos routing viamultiple paths using bandwidth reservation. In

Proceedings. IEEE IN-FOCOM’98, the Conference on Computer Communications. SeventeenthAnnual Joint Conference of the IEEE Computer and CommunicationsSocieties. Gateway to the 21st Century (Cat. No. 98 , volume 1, pages11–18. IEEE, 1998.[31] Josep Sanjuas-Cuxart, Pere Barlet-Ros, Nick Dufﬁeld, and RamanaKompella. Cuckoo Sampling: Robust Collection of Flow Aggregatesunder a Fixed Memory Budget. In

IEEE INFOCOM , 2012.[32] Devavrat Shah, Sundar Iyer, Balaji Prabhakar, and Nick McKeown.Analysis of a statistics counter architecture. In

Hot Interconnects ,volume 9, pages 107–111, 2001.[33] Vibhaalakshmi Sivaraman, Srinivas Narayana, Ori Rottenstreich,S. Muthukrishnan, and Jennifer Rexford. Heavy-hitter detection entirelyin the data plane. In

Proceedings of the Symposium on SDN Research ,pages 164–176. ACM, 2017.[34] Spirent. TestCenter N4U Datasheet, 2020.[35] Ahren Studer and Adrian Perrig. The coremelt attack. In Michael Backesand Peng Ning, editors,

Computer Security – ESORICS 2009 , pages 37–52, Berlin, Heidelberg, 2009. Springer Berlin Heidelberg.[36] Jonathan Turner. New directions in communications(or which way tothe information age?).

IEEE communications Magazine , 1986.[37] Hao Wu, Hsu-Chun Hsiao, Daniele Enrico Asoni, Simon Scherrer,Adrian Perrig, and Yih-Chun Hu. Clef: Limiting the damage causedby large ﬂows in the internet core. In

International Conference onCryptology and Network Security (CANS) , 2018.[38] Hao Wu, Hsu-Chun Hsiao, and Yih-Chun Hu. Efﬁcient large ﬂow detec-tion over arbitrary windows: An algorithm exact outside an ambiguityregion. In

Proceedings of the 2014 Conference on Internet MeasurementConference (IMC) , pages 209–222. ACM, 2014.[39] Tong Yang, Jie Jiang, Peng Liu, Qun Huang, Junzhi Gong, Yang Zhou,Rui Miao, Xiaoming Li, and Steve Uhlig. Elastic sketch: Adaptive andfast network-wide measurements. In

Proceedings of the 2018 Conferenceof the ACM Special Interest Group on Data Communication , pages 561–575. ACM, 2018.[40] Tong Yang, Haowei Zhang, Jinyang Li, Junzhi Gong, Steve Uhlig, Shi-gang Chen, and Xiaoming Li. Heavykeeper: An accurate algorithm forﬁnding top- k elephant ﬂows. IEEE/ACM Transactions on Networking ,27(5):1845–1858, 2019.

41] Qi Zhao, Jun Xu, and Zhen Liu. Design of a novel statistics counterarchitecture with optimal space and time efﬁciency.

ACM SIGMETRICSPerformance Evaluation Review , 34(1):323–334, 2006. A PPENDIX

We would like to know the probability that LOFT cancatch overuse ﬂows and how much damage they can dobefore being caught. Our analysis ﬁrst derives a lower boundon the probability that a non-overuse ﬂow estimate becomeslarger than an overuse ﬂow estimate. Then we use this boundto calculate the probability that an overuse ﬂow estimatefalls within the top- W fm largest estimates, which is also theprobability of being monitored.Due to the complexity of this problem, we make thefollowing assumptions throughout the analysis. We require thatthe number of ﬂows N in trafﬁc is ﬁxed during each detectionperiod and N is large enough so that the attacker is unable tomanipulate either N or the overall trafﬁc size distribution. Inaddition, we assume that all ﬂow estimates, which are randomvariables in our model, only have weak pairwise dependence,and see them as i.i.d. in the following analysis. A. Lower Bound of Detection Probability

First we derive a lower bound of the dectection probabilitygiven the amount of an overuse ﬂow after LOFT starts a resetcycle. This lower bound helps us analyze how long LOFT willneed to catch the overuse ﬂow with high enough probability.Then by setting a proper reset cycle, our analysis determinesthe maximum damage an overuse ﬂow can do, and guaranteesthat LOFT will have high probability to catch the overuseﬂow if it exceeds the limit of maximum damage.The detection probability of an overuse ﬂow is exactlythe probability of LOFT choosing to monitor that ﬂow andthe ﬂow monitor catches its overusing behavior. Here weignore the time spent in the ﬂow monitor when an overuseﬂow is reported to it. Analyzing and improving ﬂow monitoris outside the scope of this paper. Therefore, the detectionprobability is equal to the probability of the overuse ﬂow beingselected by LOFT.We deﬁne U i , A i , C i as the estimate, accumulator, andcounter of ﬂow i in our algorithm, respectively. Hence theyhave the following relationship: U i = A i C i (A.1)In each minor cycle, each ﬂow, except for ﬂow i , has W chance to contribute part of its ﬂow size to A i . We deﬁnea ﬂow segment X j as the amount contributed by anotherﬂow j to A i in a minor cycle. Every X j can be seen as arandom variable with unknown distribution, since it is selectedrandomly by a uniform hash function. More deeply, suppose S a = { X , X , · · · , X a } and S b = { X a +1 , X a +2 , · · · , X a + b } are added to A i in minor cycle x and x + 1 , respectively. S a are exactly the a uniform samples without replacement from N ﬂow segments sent by N ﬂows during minor cycle x . Inaddition, S b are another b uniform samples from minor cycle x + 1 and being independent of the prior samples S a , since S a and S b are sampled independently from two different sets ofﬂow segments. These properties allow us to apply Hoeffding’sinequality [18] and derive the bound in Equation A.8.Because we assume that overuse ﬂows are minority andcannot affect the trafﬁc distribution, and because all otherﬂows will not violate the ﬂow speciﬁcation, X j also satisﬁesthe following bound, where M (cid:44) γω − + β . ≤ X j , E [ X j ] ≤ M (A.2)Suppose LOFT has run θ minor cycles, given there are c i ﬂow segments accumulated in C i , the size of ﬂow i is F i , wecan represent U i as a random variable under this condition asbelow. U i | ( C i = c i ) = 1 c i ( F i + c i − θ (cid:88) j =1 X j ) (A.3)From Equation A.3, now U i is modeled by the sum ofseveral random variables, which is also a random variable withunknown distribution. As mentioned previously, different U sare seen as i.i.d. This assumption is reasonable because underrandom dispatching the probability that any pair of U sharemany X (and thus build strong dependence) is extremely low.The more precise analysis is left for future work.Let U l and U b be the estimates of an overuse ﬂow and a non-overuse ﬂow respectively. By Equation A.3, we can enumerateall conditions and derive the equation below (In the followingequations, we short C i = c i to c i ). P ( U l > U b ) = Nθ (cid:88) c l =1 Nθ (cid:88) c b =1 P ( U l > U b | c l , c b ) P ( c l , c b ) (A.4)Instead of bounding Equation A.4 directly, we choose athreshold τ and simplify the bound with the following inequal-ity, which holds since U l and U b are i.i.d. in our assumption. P ( U l > U b | c l , c b ) ≥ P ( U l > τ | c l ) P ( U b < τ | c b ) (A.5)Combining with Equation A.4, we have the followinginequality. P ( U l > U b ) ≥ Nθ (cid:88) c l =1 Nθ (cid:88) c b =1 P ( U l > τ | c l ) P ( U b < τ | c b ) P ( c l , c b ) (A.6)Now the goal is deriving the bound of P ( U i > τ | c i ) .Suppose c b − θ ≥ , for the estimate of non-overuse ﬂow13 b , combining with Equation A.3, we have the followinginequality. P ( U b ≥ τ | c b )= P ( 1 c b ( F b + c b − θ (cid:88) j =1 X j ) ≥ τ )= P ( 1 c b F b + c b − θc b c b − θ c b − θ (cid:88) j =1 X j ≥ τ )= P ( 1 c b − θ c b − θ (cid:88) j =1 X j ≥ c b c b − θ ( τ − F b c b ))= P (( 1 c b − θ c b − θ (cid:88) j =1 X j ) − E [ X ] ≥ c b c b − θ ( τ − F b c b ) − E [ X ]) (A.7)Let v (cid:44) c b c b − θ ( τ − F b c b ) − E [ X ] . If v ≥ , since X j isconstrainted by the inequality in Equation A.2 and satisﬁessufﬁcient properties as we mentioned before, we can applyHoeffding’s inequality to get an upper bound of Equation A.7as below. P (( 1 c b − θ c b − θ (cid:88) j =1 X j ) − E [ X ] ≥ v ) ≤ exp( − c b − θ ) v M ) (A.8)Here, we choose the midpoint between the expected valuesof U l | c l and U b | c b as τ . Therefore, τ for c l , c b can berepresented as below. τ c l c b = 12 ( E [ U l | c l ] + E [ U b | c b ])= E [ X ] + 12 c l c b ( − θE [ X ]( c l + c b ) + c b F l + c l F b ) (A.9)Finally, we substitute τ of inequality A.8 by Equation A.9and derive the following bound. P ( U b ≥ τ c l c b | c b ) ≤ P (( 1 c b − θ c b − θ (cid:88) j =1 X j ) − E [ X ] ≥ v b ) ≤ exp( − c b − θ ) v b M ) where v b (cid:44) c l ( c b − θ ) [ θE [ X ]( c l − c b ) + c b F l − c l F b ] (A.10)For the estimate of overuse ﬂow U l , we can use the samemethod to derive its bound as below. P ( U l ≤ τ c l c b | c l ) ≤ exp( − c l − θ ) v l M ) where v l (cid:44) c b ( c l − θ ) [ θE [ X ]( c l − c b ) + c b F l − c l F b ] (A.11)Although Equation A.10 and Equation A.11 contain anunknown variable E [ X ] and the inequalities only hold undercertain conditions, since ≤ E [ X ] ≤ M , we can get theworst-cast probabilities by choosing E [ X ] to minimize v b and v l . Therefore, Equation A.10 and Equation A.11 can be rewritten into the following two functions, which give outthe worst-case bounds without knowing E [ X ] or making anyassumption on c b and c l . ˆ P b ( θ, c b , c l , F b , F l ) (cid:44) (cid:40) if min v b < ∨ c b − θ < − c b − θ )(min v b ) M ) otherwise ˆ P l ( θ, c b , c l , F b , F l ) (cid:44) (cid:40) if min v l < ∨ c l − θ < − c l − θ )(min v l ) M ) otherwise (A.12)For the pmf of P ( c l , c b ) , we assume that our random ﬂow-to-counter mapping in each minor cycle makes C i follow thebinomial distribution, so it can be represented as below. P ( θ, c l , c b ) = P ( B ( N θ, W ) = c l ) P ( B ( N θ, W ) = c b ) (A.13)Combining all equations above, given the amounts of anoveruse ﬂow F l and a non-overuse ﬂow F b sent in θ minorcycles, we can derive the lower bound function ˆ P win of P ( U l > U b ) as below. ˆ P win ( θ, F l , F b ) (cid:44) Nθ (cid:88) c l =1 Nθ (cid:88) c b =1 (1 − ˆ P b ( θ, c b , c l , F b , F l ))(1 − ˆ P l ( . . . )) P ( θ, c l , c b ) (A.14)The worst-case of ˆ P win happens when the amount of thenon-overuse ﬂow F b is equal to its maximum size. Therefore, ˆ P win ( θ, F l , γω − θ + β ) gives out the worst-case lower boundwhen the behavior of non-overuse ﬂows is unknown.The probability that an overuse ﬂow will be selected intothe ﬂow monitor is equal to the probability that it loses to lessthan W fm non-overuse ﬂows during ranking. Therefore, thelower bound of this probability can be represented as below,since we have assumed that U are i.i.d. P mon ( θ, F l ) ≥ W fm − (cid:88) k =0 (cid:18) Nk (cid:19) (1 − (cid:101) P win ) k ( (cid:101) P win ) N − k where (cid:101) P win (cid:44) ˆ P win ( θ, F l , γω − θ + β ))