Low-Rate Overuse Flow Tracer (LOFT): An Efficient and Scalable Algorithm for Detecting Overuse Flows
Simon Scherrer, Che-Yu Wu, Yu-Hsi Chiang, Benjamin Rothenberger, Daniele E. Asoni, Arish Sateesan, Jo Vliegen, Nele Mentens, Hsu-Chun Hsiao, Adrian Perrig
LLow-Rate Overuse Flow Tracer (L O F T):An Efficient and Scalable Algorithmfor Detecting Overuse Flows
Simon Scherrer , Che-Yu Wu , Yu-Hsi Chiang , Benjamin Rothenberger , Daniele E. Asoni ,Arish Sateesan , Jo Vliegen , Nele Mentens , Hsu-Chun Hsiao , and Adrian Perrig Department of Computer Science, ETH Zurich, SwitzerlandEmail: { simon.scherrer,benjamin.rothenberger,daniele.asoni,adrian.perrig } @inf.ethz.ch National Taiwan University, TaiwanEmail: { r06922021,r06922023,hchsiao } @ntu.edu.tw ESAT, KU Leuven, BelgiumEmail: { arish.sateesan,jo.vliegen,nele.mentens } @kuleuven.be Abstract —Current probabilistic flow-size monitoring can onlydetect heavy hitters (e.g., flows utilizing 10 times their permittedbandwidth), but cannot detect smaller overuse (e.g., flows utilizing50 – 100% more than their permitted bandwidth). Thus, thesesystems lack accuracy in the challenging environment of high-throughput packet processing, where fast-memory resources arescarce. Nevertheless, many applications rely on accurate flow-sizeestimation, e.g. for network monitoring, anomaly detection andQuality of Service.We design, analyze, implement, and evaluate LOFT, a newapproach for efficiently detecting overuse flows that achievesdramatically better properties than prior work. LOFT can detect1.50x overuse flows in one second, whereas prior approaches failto detect 2x overuse flows within a timeout of 300 seconds. Wedemonstrate LOFT’s suitability for high-speed packet processingwith implementations in the DPDK framework and on an FPGA.
I. I
NTRODUCTION
The problem of detecting network flows whose size exceedsa certain share of link bandwidth, so called large flows or heavy hitters , has received significant attention since theseminal paper by Estan and Varghese in 2003 [16] and hasrecently experienced renewed interest, partly thanks to theemergence of programmable data-planes [5], [7], [33], [40]. Inthis formulation of the heavy-hitter problem, the assumption isthat network operators define a flow size threshold and wantto identify all the flows that violate this threshold. Previouswork [16], [5], [7], [11], [13], [38], [37], [33], [40] focusedon detecting heavy hitters, e.g., flows that are 10x larger thanthe threshold such that a large measurement error can betolerated. In contrast, the problem of detecting moderatelylarge flows, i.e., flows that send slightly more (e.g., 1.50x)than the threshold flow size, is still in need of an effectivesolution. Throughout this work, we refer to these large-flowtypes as high-rate and low-rate overuse flows , respectively.While previous approaches to probabilistic flow-size estima-tion could be made arbitrarily accurate if abundant memorywas available, these approaches suffer from a lack of accuracy under stringent constraints regarding memory and computa-tion. These constraints introduce a large measurement errorsuch that the detection of overuse flows becomes unreliable.In practice, only flows that exceed the target threshold bymore than the measurement error (which itself can amount toa significant share of link bandwidth) can be reliably detected,resulting in many undetected overuse flows.This failure of probabilistic flow-size monitoring is espe-cially severe under tight hardware constraints, e.g., on routersthat have an aggregate capacity of several terabits per second(Tbps), handle millions of concurrent flows, and thus requirehigh-speed packet processing (on the order of 100 ns process-ing time per packet). These heavy constraints only allow forrestricted flow monitoring with an especially large estimationerror, which makes the detection of low-rate overuse flowsextremely difficult. At the same time, many applications thatstrengthen network dependability rely on accurate flow-sizeestimation: Network operators make heavy use of tools suchas network monitoring, traffic engineering (e.g., flow-size-aware routing), anomaly detection, Quality of Service (QoS),network provisioning, and security applications, all of whichrequire accurate insight into the flow-size distribution.As one example of a dependability-enhancing applicationrelying on accurate flow-size monitoring, consider bandwidth-reservation systems [6], [30], [3]. These approaches consistof allocating the available bandwidth on a link to flowsaccording to purchasable reservations. In case of scarce linkcapacity (e.g., in case of a distributed denial-of-service (DDoS)attack), flows with a reservation can continue sending to theextent of the reserved allowance, whereas other traffic mightbe dropped. If using probabilistic flow-size monitoring forallowance policing, systems that can only detect high-rateoveruse flows, but disregard the detection of low-rate overuseflows, allow an adversary to “fly under the radar”. Similar toattacks such as Coremelt [35] and Crossfire [20], an attackercould wrongfully consume link bandwidth by creating a large1 a r X i v : . [ c s . N I] F e b umber of flows that only slightly exceed the correspondingreservation. Thus, detecting low-rate overuse flows is essentialto uphold the guarantees of bandwidth-reservation systems.The biggest challenge in creating highly accurate flow-sizemeasurement on high-speed routers is the scarcity of fastmemory compared to the enormous number of flows handledby these routers. Individual-flow resource accounting is eithertoo expensive (due to the high cost of fast SRAM memoryfor caches) or too slow (e.g., keeping per-flow informationin DRAM [1], [10]). To reduce fast-memory usage, a line ofprevious research devises sketches , which use a small numberof counters and map every flow to a random subset of thesecounters. The size of each flow is then estimated based onthe values of the counters corresponding to that flow [16],[13], [37], [29], [21]. Flows with an estimated size exceedinga pre-defined threshold are considered overusing.Our key insight is that, due to the high variance of thecounter values (referred to as counter noise in the remainder ofthis paper), these shared-counter approaches fail to distinguishlow-rate overuse flows from non-overusing flows. This counternoise originates from two different sources: (1) the unevensize of flows within a counter, which leads to non-overuseflows being mistaken as overuse flows if they are mappedto the same counter as an overuse flow, and (2) the unevennumber of flows across counters, which leads to non-overuseflows being mistaken as overuse flows if they are mappedto the same counter as many other non-overuse flows. Due tocounter noise, a sketch cannot distinguish a 1.50x overuse flowfrom a non-overuse flow within reasonable memory limits (cf.§II-C1). Hence, the core research challenge becomes how toeffectively counteract the noise while using limited computingand storage resources.In this paper, we propose LOFT, a lightweight detector thatcan detect low-rate overuse flows significantly more quicklyand reliably than prior approaches, while conforming to strictrequirements regarding time and memory complexity. LOFTreduces the counter noise by using a multi-stage approach thatis aware of both the traffic volume and the number of flows inany counter and aggregates these values over time. Moreover,LOFT requires fewer operations per packet than conventionalschemes and thereby enables high-speed packet processing,which we demonstrate with implementations for the DPDKframework [28] and for a Xilinx Virtex UltraScale+ FPGA onthe Netcope NFB 200G2QL platform [2].Our evaluation based on both real and synthetic traffic tracesshows that LOFT outperforms prior work. LOFT is at least300 times faster than prior approaches in detecting 1.50 – 2xoveruse flows: LOFT can reliably detect 1.50x overuse flowsin one second, whereas prior approaches fail to detect even 2xoveruse flows within a time of 300 s.II. P ROBLEM D EFINITION AND B ACKGROUND A network flow (or flow for short) is a sequence of packetswith common characteristics. For example, NetFlow uniquelydefines a flow by source and destination IP addresses, source pkt Blacklist pkt ProbabilisticOFDPrecisemonitoringDetected overuse flows fwdpkt Fig. 1. Router packet processing. and destination transport-layer ports, protocol, ingress inter-face and type of service [12].An overuse flow is a flow that consumes more bandwidththan it was permitted by the traversed network. More con-cretely, if a flow’s permitted rate is γ and permitted burst sizeis β , then a flow overuses its allocation if it sends more than γt + β in any time interval t . An (cid:96) -fold overuse flow is a flowthat sends at rate (cid:96)γ . The permitted amount (i.e., the overuse-flow threshold) can be pre-determined by bandwidth allocationor determined by the router based on its resource usage.In this section, we review the flow-policing model, with afocus on how the overuse-flow detection component interactswith other components. Then we highlight important proper-ties desirable for overuse-flow detectors. A. Flow Policing Model
A typical flow-policing mechanism consists of the followingfour components (shown in Figure 1): (1) a flow
Classifier that extracts each flow’s ID and determines its permittedbandwidth, (2) a
Blacklist that filters out blacklisted flows,(3) a
Probabilistic Overuse Flow Detector (OFD) tasked withfinding suspicious flows that could potentially be overusing,and (4) a
Precise monitoring component which analyzes indi-vidual suspicious flows to determine which ones are actuallymisbehaving and should be added to the blacklist. The precise-monitoring component can access a limited amount of fastmemory (e.g., on the order of the amount used in the overuseflow detector). This limited fast memory restricts the numberof suspicious flows that can be simultaneously monitored bythe detector, leading to false negatives.While stateful monitoring of individual flows is practical atthe edge of the network, stateful monitoring has an untenablefast-memory consumption on routers with Tbps capacity .Moreover, schemes with per-flow state in fast memory arevulnerable to memory exhaustion attacks in which an attackercreates a high number of flows and thereby depletes theavailable fast memory. Hence, only probabilistic monitoringis a viable option on high-speed routers. B. Desired Properties
To ensure accurate and timely detection of overuse flowswithout affecting the regular packet processing, an overuse-flow detection algorithm should satisfy the following proper-ties:
High-speed packet processing on routers . High-speed packetprocessing is required for a system that is deployed in the As the CAIDA dataset suggests flow concurrency of 10 million flows ona switch supporting an aggregate bandwidth of 1 Tbps[9], individual-flowmonitoring can be expected to require 80MB of fast memory, assuming 4byte flow IDs and 4 byte counters.
Low false-positive and low false-negative rates . In thecontext of overuse flow detection, a false fositive (FP) is amisclassification of a non-overuse flow as an overuse flow.Conversely, a false negative (FN) is a misclassification of anoveruse flow as a non-overuse flow, i.e., the failure of detectingan overuse flow. Although the precise-monitoring componentof a flow-policing mechanism can prevent non-overuse flowsfrom being falsely blacklisted, the detector itself should stillensure a low FP rate to not evict suspicious flows in theprecise-monitoring component, thus increasing FN rates.
Detection of low-rate overuse flows . Overuse flows that aresending slightly above their permitted sending rate should bedetected with high probability after a short amount of time.Our goal is to detect flows within at most seconds that aresending at 1.50 their permitted rate; current state-of-the artalgorithms typically assume 10 – 1000 fold sending rates ofoveruse flows when budgeting resources.
C. Overuse Flows vs. Large Flows
This work aims to efficiently detect overuse flows, and isinspired by algorithms for detecting large flows , i.e., flowswhich use a significant fraction of the link bandwidth.However, it is important to note the significant differencesbetween the two problems. In our context, large flows corre-spond to high-rate overuse flows, i.e., flows sending at ratesat least 10 – 1000 times higher than the average flow (forexample, flows violating TCP fairness). The goal of previouswork is to quickly identify the large flows to throttle or blockthem, thus preventing them from harming the other flows.These algorithms have a (more or less explicit) threshold abovewhich a flow is considered large, but below which flows areallowed to send. This threshold is usually up to three ordersof magnitude larger than the average flow’s sending rate—soa few flows sending around the threshold rate would rapidlyexhaust link capacity and lead to congestion.In the rest of this section, we briefly introduce two kindsof large-flow detection algorithms, namely sketch-based ap-proaches and selective individual-flow monitoring. In thefollowing, we discuss why they are inadequate for solvingoveruse-flow detection in our target scenario.
1) Sketches:
Individual-flow monitoring tracks the size offlows with a counter per flow, i.e., a memory cell that isincreased by the packet size every time a packet of thecorresponding flow arrives. To monitor flows using limitedfast memory, sketches use each counter to track multiple flows. Two algorithms in this category are the
Count-Min(CM) Sketch [13] (also known as Multistage Filters [16])and
Adaptive Multistage Filters (AMF) [14]. They rely on arelatively simple concept, which we illustrate at the exampleof the CM Sketch.In the CM Sketch, flows are randomly mapped to counters,and each counter aggregates the volume of all flows that areassigned to it. If a large flow is mapped to a certain counter,this counter value is expected to be higher than the othercounters as it includes the contribution of the large flow.To increase precision, the CM Sketch uses multiple stages ,i.e., multiple counter arrays, and for each stage the flows aremapped to counters in a different way (e.g., using differenthash functions). The CM Sketch classifies a flow as large ifand only if the minimum value of counters to which the flowis mapped exceeds a certain threshold. For non-large flows, theprobability that all associated counters exceed the threshold islow, decreasing exponentially in the number of stages.However, achieving high accuracy with a CM Sketch is onlypossible with an untenable amount of memory. According tothe theoretical work on the CM Sketch [13], the measurementerror of the CM Sketch can be related to the amount ofavailable memory. Concretely, a CM Sketch with (cid:100) ln(1 /δ ) (cid:101) stages, each with (cid:100) e/(cid:15) (cid:101) counters, guarantees a probability ofless than δ that the overestimation error amounts to more thana share (cid:15) of total traffic. Assuming that a switch with anaggregate bandwidth of 1 Tbps handles 10 millon flows [9] andthat the overestimation should almost never exceed 50% of theaverage-flow size (hence, δ = 0 . and (cid:15) = 0 . / − ), thenthe CM Sketch would require around 250 million counters.This memory consumption is even higher than allocating acounter per flow, which demonstrates that the CM Sketch isinaccurate on high-capacity routers.Moreover, conventional sketches have been shown to be tooinefficient to keep up with usual line speeds [22]. In order toachieve line rate, accuracy has to be traded for speed, whichexacerbates the estimation error of sketches. In this work, weattempt to refine the sketch-based approach in order to achievehigh accuracy with low processing complexity.
2) Selective Individual-Flow Monitoring:
Another categoryof large-flow detection algorithms dynamically selects a subsetof flows for individual-flow monitoring. In the followingsection, we will illustrate the general idea of these schemesusing the example of EARDet [38], which is one specimenof this category. Other examples include HashPipe [33] andHeavyKeeper [40].EARDet is based on the Misra–Gries (MG) algorithm [25],which finds the exact set of frequent items (i.e., items makingup more than a /k -share of the stream) in two passes withlimited counters. At a high level, the MG algorithm usesan array of counters to track frequent item candidates. Byadjusting the counter values and associated items, the MGalgorithm guarantees that every frequent item will occupy onecounter after the first pass. The second pass is neverthelessrequired to remove falsely included infrequent items.For each item in the stream, the MG algorithm adjusts3 pdateSampler EstimateFlow Table pkt active fl owscounters read updatesuspicious fl ows PrecisemonitoringClassi fi erBlacklist Detected overuse fl owspktpkt CROFT Probabilistic OFD
Probabilistic OFD
Fig. 2. Details of the structure and information flow of the LOFT ProbabilisticOFD component (cf. also Figure 1). the counters as follows. It first checks whether the item hasoccupied a counter in the array. If so, the correspondingcounter will be increased by one. Else, if there is a non-occupied counter, the MG algorithm will assign that non-occupied counter to track this item (and also increase it byone). Otherwise, it will decrease all counters by one. Theintuition is that infrequent items, should they be assigned toa counter, are likely to be evicted (i.e., their counter becomeszero) very quickly. By contrast, frequent items (which arealready more likely to be assigned to a counter to begin with)are guaranteed to remain assigned to that counter, since theirfrequency compensates the occasional counter decreases.EARDet enhances the MG algorithm to identify largenetwork flows in one pass . EARDet is based on two flowspecifications of the form γt + β , where γ is the allowed rateand β is the allowed burst size. The adaptations guarantee thata flow sending less traffic than γ l t + β l during any time windowwith length t will not be falsely blocked (no false positive),and all large flows sending more than γ h t + β h in any timewindow with duration t will be caught (no false negative).To catch overuse flows, EARDet could be configured tohave γ h t + β h set to the permitted bandwidth. However, given alow permitted bandwidth, this approach either demands manycounters or suffers from high false negatives, as EARDetrecommends using at least linkBW γ h − counters to achieveguaranteed detection. If γ h equals the maximum size of a non-overuse flow, fast memory would need to accommodate almosta counter per flow, which is infeasible on routers that handlemillions of flows (cf. §II-A).In addition to their high fast-memory consumption, theoverhead of selective individual-flow monitoring, i.e., continu-ously deciding which flows to monitor closely, is prohibitivelyexpensive in terms of processing complexity, which results inan insufficient throughput of such schemes (cf. §IV-H).III. LOFT A LGORITHM
LOFT is a novel design approach for a probabilistic overuseflow detector (OFD), the core component of a flow polic-ing model in router packet processing. We first provide anoverview of LOFT (§III-A) before describing the design inmore detail (§III-B–III-D). We also provide a complexityanalysis of the algorithm (§III-E).
TABLE IN
OTATION USED IN THIS PAPER .Symbol Description N Number of flows γ , β Rate and burst threshold flow specification (cid:96)
Overuse ratio θ Number of minor cycles since last reset ω Number of minor cycles per second Z Number of minor cycles per major cycle θ reset Reset cycle λ Sample rate W Number of counters in fast memory W fm Number of precisely monitored flows U Flow-size estimate A Accumulated flow size C Accumulated flow count
A. Overview
In this section, we describe our design for the probabilisticOFD component depicted in Figure 1. As Figure 2 shows,LOFT OFD contains four components: the update algorithm,the estimate algorithm, the sampler , and the flow table . Packetsnot rejected by the blacklist are forwarded to the update algorithm and the sampler . The estimate algorithm consumestheir output, updates the flow table and creates a list ofsuspicious flows for precise monitoring.The LOFT update algorithm targets one short time interval(e.g., 12.50 ms) at a time, which we call minor cycles . For eachminor cycle, the update algorithm collects aggregated trafficinformation over groups of flows by using a single counterarray. For every packet, LOFT maps the packet flow ID toone counter in the current counter array and increases thatcounter by the packet size. At the end of the minor cycle, thecounter array is passed to the estimate algorithm.The estimate algorithm operates on a larger time scale, atintervals (e.g., with duration 250 ms) that we call major cycles (see Figure 3): every time a major cycle is concluded, the estimate algorithm analyzes the counter arrays stored fromall minor cycles during the major cycle, extracts estimatesfor the bandwidth utilization of every flow, and stores theestimates in a flow table . From this, the algorithm produces alist of suspected overuse flows, which is handed to the precise-monitoring component. The sampler provides a list of activeflows which the estimate algorithm uses in its analysis.In summary, the estimate algorithm uses the the sequenceof counter arrays generated by the update algorithm to createa final flow estimate in each major cycle. Figure 3 visual-izes how these algorithm components interact. By depictingpackets of two different flows, the figure shows how flowsare mapped to different counters in each minor cycle. Basedon a packet’s flow ID, the update algorithm increases theassociated counter by the packet size. Every major cycle, the estimate algorithm first updates the flow table based on thecounter arrays generated in the most recent Z minor cycles,recomputes the estimate for every active flow, and createsa watchlist from the W fm largest flows. The flows on the4 inor cycles ⋯ major cycles 𝑗 + 1𝑗 𝑓 updateestimate 𝑓 𝑓 ⋯ 𝑍1 2 1 2 𝑓 packets packet counter (SRAM)flow table (DRAM) Fig. 3. LOFT Timeline. (cid:182)
A hash value is computed on the flow ID of the incoming packet, and the corresponding counter is increased by the packetsize. (cid:183)
A different set of counters and hash function are used when the minor cycle changes. (cid:184)
After Z minor cycles, the estimate algorithm aggregates thecounter values and tries to identify a group of overuse flows. watchlist undergo precise monitoring in the subsequent majorcycle. This precise monitoring is performed by the leaky-bucket algorithm [36], which detects any violation of a flowspecification in form of γt + β (cf. §II) without false positives.Any flow that is found misbehaving by precise monitoring canthus be blocked by inserting it into the blacklist .In the following, we will explain the design of the update algorithm (§III-B), the estimate algorithm (§III-C), and the sampler (§III-D). The pseudo code of the whole system ispresented in Algorithm 1. The relevant notation is presentedin Table I. B. Update Algorithm
Similar to a sketch, the LOFT update algorithm maps eachflow to one counter in a counter array. Each counter tracks theaggregate bandwidth of all the flows mapped to that counterduring a minor cycle. In each minor cycle, the association offlows to counters is randomized by changing the hash functionfor every minor cycle. When a minor cycle ends, the counterarray is moved to main memory and an empty counter arrayis initialized in fast memory.More formally, let ctr j,k be the counter array generated inthe k -th minor cycle of major cycle j , and H j,k be the hashfunction used in that minor cycle. The value of ctr j,k [ x ] isthe total packet sizes of flows that have been mapped to x by H j,k , i.e., all flows f for which H j,k ( f ) = x . C. Estimate Algorithm
At the end of a major cycle, which contains a certainnumber Z of minor cycles, the estimate algorithm performsa flow-size estimation for every flow asynchronously (e.g., inuser space of the router), while the update algorithm continuesto aggregate traffic information. The flow-size estimate buildson two values: the volume sum and the cardinality sum .For the volume sum, the estimate algorithm sums up thevalues of all the counters to which a flow was mapped. Thisaggregation over time reduces the counter noise in the senseof uneven flow size within the same counter: intuitively, anoveruse flow, unlike a non-overuse flow, will be consistentlyassociated with large counter values, which results in a largevolume sum for that flow. Although using multiple counters can reduce counter noise,it is neither sufficient nor innovative, as the Count-Min Sketch[13] uses the same idea (although by applying different hashfunctions concurrently instead of sequentially) and deliversinsufficient performance. Indeed, the key to reducing counternoise lies in the cardinality sum. In order to compute this sum,an active-flow list is consulted to compute how many flowsare associated with each counter within each minor cycle, i.e.,the counter cardinality. For every flow, the estimate algorithmsums up the cardinality values of all counters associated withthe flow. The cardinality sum reduces the distortion createdby the varying cardinalities of counters: intuitively, an overuseflow will be associated with a large counter value even whenthe number of flows in that counter is small.When dividing the volume sum by the cardinality sum, thestrongest increases in a flow’s estimate are produced when theflow is mapped to high-value counters that contain a smallnumber of flows. Indeed, flows with these characteristics arehighly likely to be the largest flows among the investigatedflows and are thus candidates for more precise monitoring.Formally, after major cycle j , we define the estimate of aflow f to be U ( j ) f = A ( j ) f /C ( j ) f , where A ( j ) f = (cid:88) j (cid:48) ∈ J ( j ) f Z (cid:88) k =1 ctr j (cid:48) ,k [ H j (cid:48) ,k ( f )] (III.1a) C ( j ) f = (cid:88) j (cid:48) ∈ J ( j ) f Z (cid:88) k =1 | ctr j (cid:48) ,k [ H j (cid:48) ,k ( f )] | (III.1b)where J ( j ) f contains all major cycles j (cid:48) ≤ j in which flow f was active. The term | ctr j (cid:48) ,k [ x ] | denotes the number of flowsthat have been mapped to counter x in minor cycle k of majorcycle j (cid:48) (counter cardinality). A ( j ) f is the value aggregate of thecounters that f has been mapped to (volume sum) and C ( j ) f isthe summed count of the flows in these counters (cardinalitysum). In order to avoid preserving counter arrays from pastmajor cycles, the terms A f and C f are kept in the flow tableand updated after every major cycle, i.e., table [ f ] .A ← table [ f ] .A + Z (cid:88) k =1 ctr j,k [ H j,k ( f )] (III.2)5nd analogously for C ( j ) f .These updates are made for allflows f that were active in the most recent major cycleand are thus in the active-flow list generated by the sampler(cf. §III-D).These estimates need to be adjusted when some flows sendintermittently. For example, suppose flow f sends x GB inthe first and the third major cycle and nothing in the secondmajor cycle, and flow f sends x GB from the first to the thirdmajor cycle, i.e., J (3)1 = { , } and J (3)2 = { , , } . Then A (3)1 /C (3)1 and A (3)2 /C (3)2 will be almost the same. Supposeall counters contain exactly y flows, then A (3)1 /C (3)1 = x + xy + y = xy = x + x + xy + y + y = A (3)2 /C (3)2 . However, the total traffic sent byflow f in these three cycles is actually 1.5 times larger thanflow f and should result in a higher flow-size estimate.To fix this problem, we reduce a flow-size estimate U f relative to the number of major cycles where flow f was notactive, i.e., U ( j ) f = | J ( j ) f | j · A ( j ) f C ( j ) f (III.3)To enable this computation, the flow table must track | J f | forevery flow f .Another issue is that as A f and C f are accumulated, the estimate algorithm is actually computing their average overtime. An attacker can take advantage of this approach bysending a low-rate traffic in the beginning for a period of time,and then start sending bursty traffic. It may not be detected byour system as its long-term average looks the same as a non-overuse flow, so old values must be discarded at some point.Therefore, we define the reset cycle θ reset , and clear all thedata every θ reset minor cycles. This reset may sound risky, asan overuse flow could send its traffic around the reset pointso that its estimated size is reset before being detected by oursystem. However, an attacker does not know the reset point.Moreover, even if an attacker could infer the reset point, theoveruse traffic sent by a flow with such a strategy is bounded,as we show in the mathematical analysis (Appendix A). D. Sampler
Clearly, LOFT requires a list of active flows for whichan estimate must be computed. In order to generate suchan active-flow list, we use sampling and limit the numberof sampled packets per second to be λ . LOFT considers arandomized sampling period, which is a random variable ofan exponential distribution with mean λ . This randomizationprohibits an attacker flow from circumventing the sampling bysending at the appropriate moments.Having an active-flow list for a major cycle j also allows tocompute the cardinality | ctr j,k [ x ] | of counters in the estimate algorithm, namely by counting how many active flows weremapped to each counter with the respective hash function H j,k for any minor cycle k . An alternative to this reconstructionof counter cardinality would consist in measuring countercardinality within the update algorithm, for example using aBloom filter [8] or HyperLogLog register [17] per counter. Algorithm 1
LOFT algorithm. procedure P ROCESS ( pkt ) if pkt.flowID ∈ blacklist then return if pkt.flowID ∈ watchlist then M ONITOR ( pkt ) j ← G ET M AJOR C YCLE () k ← G ET M INOR C YCLE () S AMPLER ( pkt ) U PDATE ( pkt , j , k ) if Z · j ≥ θ reset then R ESET () procedure S AMPLER ( pkt ) if current time ≥ sample time then activeF low ← activeF low ∪ { pkt.flowID } u samples uniformly from U (0 , sample time ← sample time − ln uλ procedure U PDATE ( pkt , j , k ) x ← H j,k ( pkt.flowID ) ctr j,k [ x ] ← ctr j,k [ x ] + pkt.size procedure E STIMATE () j ← G ET M AJOR C YCLE () - 1 for k = 1 to Z do for f ∈ activeF low do x ← H j,k ( f ) numF low [ x ] ← numF low [ x ] + 1 for f ∈ activeF low do x ← H j,k ( f ) A [ f ] ← A [ f ] + ctr j,k [ x ] C [ f ] ← C [ f ] + numF low [ x ] N ← | activeF low | for f ∈ activeF low do table [ f ] .A ← table [ f ] .A + A [ f ] table [ f ] .C ← table [ f ] .C + C [ f ] table [ f ] . numJ ← table [ f ] . numJ + 1 activeF low ← ∅ return W fm flows with largest table [ f ] . numJ j · table [ f ] .Atable [ f ] .C However, as fast memory is the bottleneck resource, additionalcomputational complexity in the estimate algorithm is prefer-able to estimating cardinality in fast memory.
E. Complexity Analysis
In summary, the update algorithm uses O ( W ) fast-memoryentries, where W is the width of a counter array. The numberof read and write operations is linear in the number of packets.The estimate algorithm uses O ( ZW + N ) entries in DRAM,and the number of accesses is O ( ZN ) .
1) Time complexity:
In the update algorithm, each packetrequires a single read and a single write operation to fastmemory. Moreover, the algorithm requires a single hash func-tion computation, which results in the major advantage thatLOFT achieves line rate (see §IV-H), whereas other sketch-based algorithms update multiple counter arrays per packet andtherefore fall short of that goal [22]. The hash computationitself can be performed in hardware or using an efficientsoftware implementation such as the murmur3 hash function.The estimate algorithm uses multiple accesses to mainmemory. However, the number of accesses is linear in the6umber of active flows, which may be much smaller thanthe number of packets. In each major cycle, we need tosum up the corresponding counters in each minor cycle foreach flow. Suppose there are N active flows, then there are O ( ZN ) DRAM reads to compute the updates to the estimatecomponents. After obtaining these values, we need to updatethe flow table. For each active flow, the algorithm performsone lookup and update to the hash table. We use a Cuckoohash table [27], which has worst-case constant lookup time.Although the worst-case insertion time of the Cuckoo tablemight be long, its expected complexity is amortized constant.Additionally, the number of insertions is much lower than thenumber of lookups as each flow will be inserted only oncewhen it’s first seen. Therefore, the update takes O ( N ) time.In the sampler, for every sampled packet, there is oneinsertion to the active-flow list stored in DRAM, for whichwe again used a Cuckoo hash table. By properly setting thetable capacity and sample rate λ , our experiments show thatthe sampler is still fast enough to keep up with line speeds.
2) Space Complexity:
Only counters of the current minorcycle reside in fast memory, which is O ( W ) and depends onthe size of a counter. Our analysis shows that, if all flows sendalmost at threshold rate, a small W (e.g., 8192) can lead toconsiderable detection delay.Other counters are kept in main memory before beinghandled by the estimate algorithm, which takes O ( ZW ) space.The active-flow list and the flow table are also in DRAM.The active-flow list requires O ( N ) entries and the spacecomplexity of the flow table is also linear to N (using aCuckoo table). Therefore, the total number of main memoryentries used (including counters) is O ( ZW + N ) .IV. E VALUATION
The evaluation of LOFT is conducted through two imple-mentations and a simulation. First, the behavior of LOFT in areal-world environment is evaluated using the implementationsand a testbed that supports up to 4x40 Gbps traffic volume.Second, to evaluate the accuracy of LOFT and compare it toEARDet, AMF, HeavyKeeper and HashPipe, simulations witha traffic volume equivalent to 4x100 Gbps are used.
A. Implementation
For the scalability experiments with DPDK, we imple-mented LOFT in C on the Intel DPDK framework [28]. Theapplication uses n worker threads that execute the updatealgorithm and a separate thread running the estimation algo-rithm every major cycle. The major and minor cycle indicesare computed based on a monotonic clock with nanosecondresolution.For further scalability experiments, we also implementedthe update algorithm of LOFT on a Xilinx Virtex UltraScale+FPGA on the Netcope NFB 200G2QL platform [2] containingtwo 100 Gbps NICs. Presenting the detailed contributions ofthe FPGA implementation would exceed the scope of thispaper. Therefore, a separate paper illustrating the design of the FPGA implementation has been submitted to a specializedconference [4].In order to test the accuracy of LOFT, we evaluated LOFTand other algithms in simulations using Rust (LOFT, EARDet,AMF) and Golang (HeavyKeeper, HashPipe). B. Experiment Setup
Test Setup . The testbed for the DPDK scalability experimentconsists of two machines connected using 4x40 Gbps Ether-net connections. The traffic is generated using a dedicatedtraffic generator (Spirent TestCenter N4U [34]) and sent toa commodity machine with an Intel Xeon E5-2680 CPU.For the simulations, we execute LOFT on an Intel Xeon8124M machine and process synthesized traffic equivalent to4x100 Gbps.
Traffic Generation . To evaluate the scalability of LOFT withrespect to flow volumes, we generated network traces withuniform packet sizes and based on an iMix traffic distribution(avg. size: 353 B) [26].
C. Evaluation Metric: Detection Delay
The detection delay of an overuse flow is the time elapsedbetween the first violation of the flow specification and thetime the flow is caught by a detector. A detection delay longerthan the simulation timeout results in a false negative .When presenting detection rates and false negatives, it isalways crucial to present the corresponding false positive rate,i.e., the number of non-overuse flows incorrectly marked asmalicious. We note, however, that LOFT is designed to haveno false positives, because benign flows that happen to beflagged as suspicious by the estimation algorithm will beexonerated by precise monitoring.
D. Parameter Selection
Several of LOFT’s parameters can be tuned to fit the hard-ware restrictions of a specific deployment. In the following,we describe how to experimentally determine these parame-ters so that they comply with the hardware’s computationalconstraints even under the worst-case traffic patterns.To experimentally determine the sampling rate, we increasethe sampling rate of the sampler until its CPU core is fully uti-lized, such that the accuracy of the flow ID list is maximized.With this determined sampling rate and an estimated maximumnumber of flows, we calculate the maximum number of majorcycles per second such that there are enough samples betweentwo executions of the estimation algorithm to build the flowID list with the desired accuracy. Finally, we increase thenumber of minor cycles per second until the CPU core ofthe estimation algorithm is fully utilized. Table II summarizesthese hardware-related parameters used in our experiments.The flow monitor is set to monitor 64 flows simultaneouslyin all experiments, which is small and fast enough to keepup with a high-bandwidth link. Finally, with the determinedparameters in Table II, the reset cycle parameter θ reset in eachexperiment is adjusted to achieve 95% detection probability(detailed in the mathematical analysis in Appendix A) underthe given experiment setting.7 ABLE IIH
ARDWARE - RELATED
LOFT
PARAMETERS .Parameter ValueSampling rate ( λ ) . · samp./sNumber of minor cycles 64 cycles/sNumber of major cycles 1 – 4 cycles/s E. Comparison: Fully Utilized Traffic Trace
We first evaluate LOFT in a setting where every flowsends at a rate close to the maximum allowed threshold, fullyutilizing the reserved bandwidth.We simulate a configuration with 4x100 Gbps links withan aggregate number of 130’000 flows, where each flowrequires 3 Mbps, e.g., for high-quality video streaming. Then,a misbehaving flow with an overuse ratio (cid:96) is injected. Eachdetector allocates 16’448 counters in fast memory. LOFT,AMF, HashPipe and HeavyKeeper use 64 counters (out of16’448) as flow monitors. To optimize detector performancefor AMF, HashPipe and HeavyKeeper, the fast-memory coun-ters are structured as counter arrays. LOFT is configured torun 4 cycles of the estimation algorithm per second, whichreaches the computation limit on our machine.Figure 4(a) shows the detection delays of LOFT, EARDet,HashPipe, HeavyKeeper and AMF under different overuseratios (cid:96) on a log-log scale. Each data point is averaged over 100runs. We find that LOFT detects a 1.50x overuse flow in lessthan one second, whereas all other detectors fail to detect itbefore the 300 s timeout. For larger overuse flows, LOFT stilloutperforms AMF, EARDet, HashPipe and HeavyKeeper whenthe overuse ratio is less than 400x, 7x and 3x, respectively. Asopposed to HashPipe and HeavyKeeper, LOFT delivers highaccuracy at a lower variance. Moreover, LOFT can achievemuch higher throughput (cf. §IV-H).The reason that LOFT is slower in detecting high-rateoveruse flows is that it needs at least one major cycle,which takes 0.25 s, to select the overuse flow. These resultsconfirm that LOFT can efficiently detect overuse flows using asmall amount of router resources. While heavy-hitter detectionschemes perform better than LOFT regarding the extremelylarge flows that they are designed to catch, these schemes areineffective in low-rate overuse flow detection, which is thegoal of this paper.As Figure 4(b) shows, these results are confirmed by per-forming the same experiments with 10 million flows, whichis the number of flows to be expected on a Tbps link (cf.Section II-A). For 10 million flows, the higher accuracy ofLOFT is even more prominent, as even the best other schemes(i.e., HashPipe and HeavyKeeper) fail to detect the overuseflows for all overuse ratios below 20.
F. Comparison: CAIDA Traffic Trace
In addition to using synthesized background traffic in whichevery non-overuse flow fully utilizes the reserved bandwidth,we also compare detectors on an OC192 link using a realtraffic trace, namely the CAIDA New York Anonymized − − − D e t ec t i o n t i m e ( s ) LOFTEARDetAMFHashPipeHeavyKeeper (a) N = 130 K − − − D e t ec t i o n t i m e ( s ) LOFTEARDetAMFHashPipeHeavyKeeper (b) N = 10 M Fig. 4. Detection delay of overuse flow detectors under different overuseratios and flow numbers N . The error bars show the minimum and maximumvalues over 100 runs. Internet Trace [9]. In the trace, the majority of the flows aresmaller than 512 bytes per second, and 95% of the flows aresmaller than 14’000 bytes per second.We first regulate every flow in the CAIDA trace with thepermitted bandwidth γ . Flows that use more than the permittedbandwidth are governed by dropping overuse packets. Then, a2-fold overuse flow is added into the traffic trace. This settingreflects the scenario where an ISP wants to mitigate DDoSby putting a bandwidth cap on individual flows. Because thenumber of flows in the CAIDA traffic is smaller than inthe fully utilized traffic trace, on this smaller network, weonly allocated 2048 + 64 counters to all detectors (64 flowmonitors). LOFT is configured to run 4 slices of the estimationalgorithm per second.Figure 5 shows the detection delays of LOFT, EARDet,AMF, HashPipe and HeavyKeeper under different permittedbandwidths on a log-log scale. Each data point is averagedover 100 runs. LOFT detects the 2-fold overuse flow in 0.30 –1.50 s. EARDet and AMF can quickly and reliably detect the2-fold overuse flow only when it is much larger than themajority of the non-overuse flows. HashPipe and HeavyKeepercatch the overuse flow also given a low threshold, but theseschemes still require a multiple of LOFT’s detection time toidentify the overuse flow. Furthermore, the throughput of theseschemes is substantially lower than the throughput of LOFT(cf. §IV-H).8
4K 20K 28K 40K 56K 80K 112K160K224K320K448KThreshold (bytes/s)10 − − D e t ec t i o n t i m e ( s ) LOFTEARDetAMFHashPipeHeavyKeeper
Fig. 5. Detection time of overuse flow detectors to catch a overuse flowsending γ given CAIDA traffic with different regulation threshold γ and β = 1500 for all. G. LOFT
Sensitivity Tests
We have shown that given the same memory budget, LOFTcan detect low-rate overuse flows much faster than AMF,EARDet, HashPipe and HeavyKeeper. The following exper-iments further investigate LOFT’s effectiveness by varying itsparameters under a background traffic setting that is challeng-ing for all evaluated detectors.LOFT is configured to run one slice of the estimationalgorithm per second since it has to process more flows inthe following tests. We consider a half-utilization scenario, inwhich a half of the non-overuse flows send up to the permittedbandwidth, and the rest sends almost negligible traffic. Thisis a more challenging scenario than full utilization for LOFTbecause the variance between counters are not only affectedby the number of flows aggregated but also by the varianceof flow sizes. This half-utilization scenario can capture thebehavior of typical streaming traffic, in which one direction isused to send data and the other to send ACKs.Half of the non-overuse flows send traffic up to the flowspecification γ = GbpsN and β = 1500 and the other halfsend 25 times less traffic. As before, one overuse flow isinjected in the simulation. This (cid:96) -fold overuse flow followsa flow specification γ = GbpsN · (cid:96) and β = 1500 . Each datapoint of detection delay is the average of 100 simulations. Flow counting drastically reduces the detection time .Figure 6 shows that the estimator with flow counting anddividing significantly outperforms the one without counting.This result supports our perspective in Section III-A that thedetection accuracy of sketches is suffered by not taking thenumber of flows into account. In other words, LOFT is highlyaccurate thanks to its efforts to reduce the counter noise thatstems from the variance of counter cardinality.
Memory budget . In this experiment, we investigate the impactof fast memory size on the detection delay. Figure 7(a)shows the cumulative distribution of the detection delay givendifferent numbers of fast-memory counters, ranging from 1024to 16’384. As the number of fast-memory counters is doubled,LOFT’s detection speed increases as lowering the number offlows sharing a counter reduces the variance of each counter.However, even with the maximum number of counters and in- . . . . . . C u m u l a t i v e p r o p o rt i o n Without countingWith counting
Fig. 6. LOFT’s detection time of an overuse flow with and w/o counting. N = 130 K , W = 16384 , (cid:96) = 1 . . cluding the memory required for monitoring suspicious flows,the fast-memory consumption of LOFT is around 130 kB,which represents more than an order of magnitude reductioncompared to the 3 MB of fast memory needed for individual-flow monitoring under the same traffic conditions (cf. §II-A). Number of non-overuse flows . We evaluate the impact ofthe number of non-overuse flows on the detection delay.Using 1024 fast-memory counters and one 2-fold overuseflow, Figure 7(b) shows the cumulative detection delay givendifferent numbers of non-overuse flows, ranging from 100’000to 400’000. The detection delay grows with the number ofnon-overuse flows because the variance of counter cardinalityincreases as the total number of flows grows.
Imprecise active-flow list . In our previous experiments, thefixed sampling rate (as defined in Table II) is sufficient tomaintain a precise active-flow list. Given that maintainingthis list is bound by the hardware-limited sampling rate, thenumber of active flows might be too high to build a preciseactive-flow list. For example, for 400’000 active flows and asampling rate of 800’000 flows per major cycle, the active-flow list will miss about 15% of active flows.To understand the impact of an imprecise active flows liston LOFT, we evaluate different miss rates of active-flow lists.Figure 7(c) shows that in case of the half-utilization scenario,LOFT performs worse with increasing imprecision of theactive-flow list. This is due to the reason that LOFT will usean inaccurate number of flows to calculate estimators, whichleads to higher variance. Nevertheless, even missing 20% ofactive flows, LOFT can still catch the overuse flow under 14 swith 95% probability.
H. Scalability of
LOFT
1) DPDK Implementation:
To understand the scalability ofLOFT in a DPDK environment, we evaluate the maximumpacket rate with respect to the packet size and number ofcores that execute the update algorithm concurrently. Fig-ure 8(a) shows that LOFT is able to achieve line-rate foriMix-distributed traffic using 16 cores that execute the updatealgorithm and one core that runs the estimation algorithm. Forsmaller numbers of cores, line-rate can only be achieved forlarger packet sizes (1024 B).9
20 40 60Detection time (s)0 . . . . . C u m u l a t i v e p r o p o rt i o n (a) N = 400 K , r = 0% , varying W . . . . . C u m u l a t i v e p r o p o rt i o n (b) W = 1024 , r = 0% , varying N . . . . . C u m u l a t i v e p r o p o rt i o n -30%-20%-10%-0% (c) N = 400 K , W = 16 , , varying r .Fig. 7. CDF of LOFT’s detection delay given number of counters ( W ), number of flows ( N ), and sampling miss rate ( r ). T h r o u g hpu t( G b / s )
64 B128 B256 B512 B 1024 B1500 BiMix (a) Throughput given number of logical cores and packet sizes.
64 B 128 B 256 B 512 B 1024 B 1500 B iMixPacket size020406080100120140160 T h r o u g hpu t( G b / s ) (b) Overhead compared to regular L3 forwarding for 8+1 cores.Fig. 8. DPDK implementation results. Since traffic flows with small sized packets perform consid-erably worse than flows with large packet sizes, we addition-ally evaluate the overhead introduced by LOFT by comparingit to regular L3 packet forwarding in DPDK. Figure 8(b) showsthe throughput of regular L3 forwarding and the throughputof forwarding with additional LOFT processing for differentpacket sizes. Moreover, the figure gives the LOFT throughputas a percentage of the corresponding base throughput. Asvisible in the figure, even regular L3 packet forwarding usingeight processing cores cannot achieve line-rate for packet sizessmaller than 1024 B. Compared to regular packet forwarding,LOFT introduces overhead for small packet sizes, whichresults in a maximum packet rate of ˜50 million packets persecond (Mpps) using eight processing cores, i.e., ˜6 Mpps percore. As for larger packet sizes this maximum packet ratedoes not get exhausted, this effect is diminished. However, LOFT still achieves a much higher packet rate than thealternative schemes with the best accuracy, i.e., HashPipe andHeavyKeeper: Prior work has shown a packet rate of ˜2 Mppsper core for HashPipe [39] and a packet rate of ˜2.50 Mppsper core for HeavyKeeper [40].
2) FPGA Implementation:
We also implemented the fast-path component of LOFT (i.e., the update algorithm using16’384 counters) on a Xilinx Virtex UltraScale+ FPGA on theNetcope NFB 200G2QL platform [2] with two 100 Gbps NICsand an operating frequency of 200 MHz. The LOFT imple-mentation can process a packet in every cycle. For minimum-size packets of 64 B, each NIC manages to transfer one packetper cycle to the LOFT implementation, which allows toachieve a packet rate of 200 Mpps per NIC. As the FPGAplatform contains two NICs, it achieves a total packet rate of˜400 Mpps. This high throughput demonstrates that LOFT issuitable for high-speed packet processing if implemented onprogrammable NICs. The full implementation is described in apaper submitted to a specialized conference [4], as the FPGA-specific implementation details represent a contribution thatgoes beyond the scope of this paper.V. T
HEORETICAL A NALYSIS R ESULTS
In the mathematical analysis provided in Appendix A,we analyze the overuse flow detection, and derive a lowerbound (Equation A.15) for the probability that an overuseflow is detected within a reset cycle, i.e., before a reset.This bound depends on a number of parameters of LOFT.The most important are the duration of a reset cycle T reset (where T reset (cid:44) ω − θ reset ) and the number of fast-memorycounters W .By requiring that the lower bound be equal to a desireddetection probability, we compute an upper bound on therequired duration T reset of a reset cycle for a given numberof counters in order to achieve that detection probability.Table III shows the reset cycle duration T reset calculatedunder the settings from the experiment of Figure 7(a), com-pared to the real detection delays as obtained by our analysisin Section IV-G. The result shows that our analysis does notunderestimate the detection delay, since the upper bound onthe reset cycle duration, which is the worst-case-estimate ofthe 95th percentile of the detection delay, is always higherthan those obtained by our evaluation. The reasons why theestimated delay is 2 – 5 higher than the real one are twofold.10irst, the synthesized traffic in our experiments does notalways represent the worst-case scenario. Second, LOFT ranksand detects overuse flows at the end of every estimationalgorithm, but our analysis ignores the probability that overuseflows might be caught at some point before θ reset , whichloosens the bound. TABLE IIIC
ALCULATED RESET CYCLE DURATION ( T reset ) AND THE REAL THPERCENTILE OF THE DETECTION DELAY FROM OUR EVALUATION WITHDIFFERENT NUMBERS OF COUNTERS IN FAST MEMORY IN THEEXPERIMENT OF F IGURE A ). Num. of counters 1024 2048 4096 8192 16’384Reset cycle (s) 231 116 58 29 15Detection delay (s) 43.03 20.48 14.04 8.67 6.52
VI. R
ELATED W ORK
Despite a significant amount of research on the problem ofdetecting high-rate overuse flows, the problem of efficientlydetecting low-rate overuse flows has so far been neglected.A number of related schemes have been proposed to addresssimilar problems, such as large flow detection and top-k flowdetection. Their ideas might be applicable to detecting overuseflows. However, we discuss in the following paragraphs thatthey are either orthogonal to our research direction, or that theyare insufficient to solve our problem as stated in Section II.The problem of detecting top-k flows aims to identify k flows that consume most of a link’s bandwidth. A recentproposal called HashPipe [33] tackles the heavy hitter detec-tion problem on programmable hardware. To ensure line-ratedetection given a limited amount of fast memory, HashPipeconstructs a pipeline of hash tables to efficiently implementthe Space Saving algorithm [24], such that the heavier flowsare more likely to be kept in the next stage of the pipeline.However, our results in §IV show that HashPipe fails to detectthe overusing flow in cases where the difference betweenoveruse and non-overuse flow is low (i.e., with an overuse ratioof 1.50 – 2). As another top-k detection scheme, HeavyKeeper[40], seems to suffer from the same weakness, these systemsseem unable to effectively filter out the counter noise.Similar to top-k flow detection, large-flow detection al-gorithms can be applied to detecting overuse flows. Largeflow detection algorithms identify flows that use more thana threshold amount of bandwidth, and it is common that thehigher the threshold the better the performance will be. Be-sides AMF (introduced in Section II-C), CLEF [37] proposesto detect low-rate overuse flows, which are similar to the low-rate overuse flows in our work, using recursive division andby combining two detectors with complementing properties.Our evaluation shows that LOFT outperforms EARDet, oneof the detectors used by CLEF, when the overuse ratio is lowerthan 7x. The other detector used by CLEF is a sketch and thusinherits the limitations explained in Section II-C.Hybrid SRAM/DRAM-based architectures for exact count-ing have been proposed by Shah et al. [32] and were further improved by Ramabhadran and Varghese [29] and by Zhaoet al. [41]. By default, these schemes only consider the numberof packets per flow, but not the flow sizes. Even thoughthe authors propose an extension to consider flow sizes, theuse of probabilistic counting in these schemes introduces ahigh counter variance, making low-rate overuse flows hard todetect. Lall et al. [21] propose another SRAM/DRAM hybriddata structure for efficient detection of both medium andlarge flows. However, because the proposed flow monitoringsolution uses shared counters (in the form of spectral Bloomfilters), it performs poorly in catching low-rate overuse flows.Another approach to tackle aforementioned problems uses sampling , where only a small subset of packets is used forflow accounting. Sampled NetFlow [12] is a widely deployedsolution that collects one out of every n packets, and estimatesstatistics of the original population by extrapolating fromthe sample. Researchers have proposed advanced samplingalgorithms tailored for catching large flows. For example,Sample and Hold [16] and Sticky Sampling [23] are designedto bias toward large flows. Instead of using a static samplingrate, several adaptive sampling algorithms dynamically adjustthe sampling rate so as to keep resource consumption undera fixed memory limitation [15], [31]. However, without asufficiently high sampling rate (resulting in a large amount offast memory), sampling-based algorithms are prone to falsepositives and false negatives, as shown by Estan and Vargh-ese [16]. LOFT relies on sampling to generate a list of activeflows (if the list is not provided). However, LOFT ensuresthat at least one packet appears in a sample, thus requiring alower sampling rate than for accurate flow accounting.A recent series of work including SketchVisor [19], Elas-ticSketch [39] and NitroSketch [22] devises techniques tospeed up the updating of detector datastructures based onsketches. However, these techniques all trade off detection ac-curacy against processing speed, i.e., these algorithms achieveeven lower accuracy than the unaltered sketches like AMFevaluated in Section IV. In contrast, LOFT can achieve highaccuracy and a low per-packet overhead.VII. C ONCLUSIONS
Due to limitations of previous approaches to probabilisticflow monitoring, network operators so far lacked effectivemeasurement tools that could give an accurate insight into theflow-size distribution on high-capacity routers. Given routerconstraints regarding fast memory and computation, existingschemes suffer from a large measurement error, which onlyallows the reliable detection of extremely large flows, butnot flows with a small amount of overuse. In this work, weshow that the source of this measurement error in sketch-basedschemes is counter noise , i.e., the high variance of countervalues. Using this insight, we develop LOFT, a sketch-basedapproach that counteracts the counter noise while respectingthe stringent complexity constraints of high-speed routers.As a result, the measurement error of LOFT is so small thatlow-rate overuse flows (i.e., flows only 50 – 100% larger thanthe average flow) can be reliably detected with a small amount11f fast memory. Concretely, LOFT can reliably identify aflow that is only 50% larger than the average flow withinone second, whereas all other investigated schemes fail toidentify such a flow even within 300 s. Moreover, LOFTaccomplishes such high accuracy while reducing the fast-memory requirement by more than one order of magnitude incomparison with individual-flow monitoring. We also investi-gate scalability and overhead of LOFT with a DPDK and anFPGA implementation, and show that LOFT enables line-rateforwarding of a realistic traffic mix.With these demonstrated properties, LOFT can serve asa powerful flow-monitoring tool, which will allow networkoperators to improve the efficacy of existing applications basedon flow-size estimation (e.g., flow-size aware routing) andto enable new applications based on such estimates (e.g.,reservation-based DDoS defense).R
EFERENCES[1] Intel Skylake CPU Architecture Characteristics, Accessed August 2018.[2] Netcope NFB-200G2QL FPGA platform equipped with Virtex Ultra-scale+ FPGA chip, Accessed September 2020.[3] Werner Almesberger, Tiziana Ferrari, and J-Y Le Boudec. Scalableresource reservation for the internet. In
Proceedings of InternationalConference on Protocols for Multimedia Systems-Multimedia Network-ing , pages 18–27. IEEE, 1997.[4] Anonymous. Paper presenting the efficient FPGA implementation of theLOFT update component (currently under review), 2020.[5] Ran Ben Basat, Gil Einziger, Roy Friedman, and Yaron Kassner.Randomized admission policy for efficient top-k and frequency es-timation. In
IEEE INFOCOM 2017-IEEE Conference on ComputerCommunications , pages 1–9. IEEE, 201720.[6] Cristina Basescu, Raphael M. Reischuk, Pawel Szalachowski, AdrianPerrig, Yao Zhang, Hsu-Chun Hsiao, Ayumu Kubota, and JumpeiUrakawa. SIBRA: Scalable internet bandwidth reservation architecture.In
Proceedings of Symposium on Network and Distributed SystemSecurity (NDSS) , February 2016.[7] Ran Ben Basat, Gil Einziger, Roy Friedman, Marcelo C Luizelli, andErez Waisbard. Constant time updates in hierarchical heavy hitters. In
Proceedings of the Conference of the ACM Special Interest Group onData Communication , pages 127–140, 201720.[8] Burton H Bloom. Space/time trade-offs in hash coding with allowableerrors.
Communications of the ACM , 13(7):422–426, 1970.[9] CAIDA. The CAIDA UCSD Anonymized Internet Traces - Oct. 18th.,2018.[10] Kevin K. Chang, Abhijith Kashyap, Hasan Hassan, Saugata Ghose,Kevin Hsieh, Donghyuk Lee, Tianshi Li, Gennady Pekhimenko, SamiraKhan, and Onur Mutlu. Understanding latency variation in modernDRAM chips: Experimental characterization, analysis, and optimization.In
Proceedings of ACM SIGMETRICS , June 2016.[11] Moses Charikar, Kevin Chen, and Martin Farach-Colton. Finding fre-quent items in data streams. In
International Colloquium on Automata,Languages, and Programming , 2002.[12] B. Claise. Cisco Systems NetFlow Services Export Version 9. RFC3954 (Informational), October 2004.[13] Graham Cormode and Shan Muthukrishnan. An improved data streamsummary: the count-min sketch and its applications.
Journal of Algo-rithms , 55(1):58–75, 2005.[14] C. Estan.
Internet Traffic Measurement: What’s Going on in myNetwork?
PhD thesis, 2003.[15] Cristian Estan, Ken Keys, David Moore, and George Varghese. Buildinga Better NetFlow. In
ACM SIGCOMM , 2004.[16] Cristian Estan and George Varghese. New directions in traffic mea-surement and accounting: Focusing on the elephants, ignoring the mice.
ACM Transactions on Computer Systems , 21(3):270–313, 2003.[17] Philippe Flajolet, ´Eric Fusy, Olivier Gandouet, and Fr´ed´eric Meunier.Hyperloglog: the analysis of a near-optimal cardinality estimation al-gorithm. In
Discrete Mathematics and Theoretical Computer Science ,pages 137–156. Discrete Mathematics and Theoretical Computer Sci-ence, 2007. [18] Wassily Hoeffding. Probability inequalities for sums of bounded randomvariables.
Journal of the American Statistical Association , 58(301):13–30, 1963.[19] Qun Huang, Xin Jin, Patrick PC Lee, Runhui Li, Lu Tang, Yi-ChaoChen, and Gong Zhang. Sketchvisor: Robust network measurement forsoftware packet processing. In
Proceedings of the Conference of theACM Special Interest Group on Data Communication , pages 113–126.ACM, 2017.[20] Min Suk Kang, Soo Bum Lee, and Virgil D. Gligor. The crossfire attack.In
Proceedings of the 2013 IEEE Symposium on Security and Privacy ,SP ’13, pages 127–141, Washington, DC, USA, 2013. IEEE ComputerSociety.[21] Ashwin Lall, Mitsunori Ogihara, and Jun Xu. An efficient algorithmfor measuring medium-to large-sized flows in network traffic. In
IEEEINFOCOM , 2009.[22] Zaoxing Liu, Ran Ben-Basat, Gil Einziger, Yaron Kassner, VladimirBraverman, Roy Friedman, and Vyas Sekar. Nitrosketch: Robust andgeneral sketch-based monitoring in software switches. In
Proceedingsof the ACM Special Interest Group on Data Communication , pages 334–350. 2019.[23] Gurmeet Singh Manku and Rajeev Motwani. Approximate FrequencyCounts over Data Streams. In
Proceedings of VLDB , 2002.[24] A. Metwally, D. Agrawal, and A. El Abbadi. Efficient computation offrequent and top-k elements in data streams. In
International Conferenceon Database Theory , pages 398–412. Springer, 2005.[25] J. Misra and David Gries. Finding Repeated Elements.
Science ofComputer Programming , 2(2):143–152, 1982.[26] A. Morton. Imix genome: Specification of variable packet sizes foradditional testing. RFC 6985, July 2013.[27] R. Pagh and F. F. Rodler. Cuckoo hashing. In
European Symposium onAlgorithms , page 121–133. Springer, 2001.[28] DPDK Project. Data Plane Development Kit, 2020.[29] Sriram Ramabhadran and George Varghese. Efficient implementationof a statistics counter architecture. In
ACM SIGMETRICS PerformanceEvaluation Review , volume 31, pages 261–271. ACM, 2003.[30] Nageswara SV Rao and Stephen Gordon Batsell. Qos routing viamultiple paths using bandwidth reservation. In
Proceedings. IEEE IN-FOCOM’98, the Conference on Computer Communications. SeventeenthAnnual Joint Conference of the IEEE Computer and CommunicationsSocieties. Gateway to the 21st Century (Cat. No. 98 , volume 1, pages11–18. IEEE, 1998.[31] Josep Sanjuas-Cuxart, Pere Barlet-Ros, Nick Duffield, and RamanaKompella. Cuckoo Sampling: Robust Collection of Flow Aggregatesunder a Fixed Memory Budget. In
IEEE INFOCOM , 2012.[32] Devavrat Shah, Sundar Iyer, Balaji Prabhakar, and Nick McKeown.Analysis of a statistics counter architecture. In
Hot Interconnects ,volume 9, pages 107–111, 2001.[33] Vibhaalakshmi Sivaraman, Srinivas Narayana, Ori Rottenstreich,S. Muthukrishnan, and Jennifer Rexford. Heavy-hitter detection entirelyin the data plane. In
Proceedings of the Symposium on SDN Research ,pages 164–176. ACM, 2017.[34] Spirent. TestCenter N4U Datasheet, 2020.[35] Ahren Studer and Adrian Perrig. The coremelt attack. In Michael Backesand Peng Ning, editors,
Computer Security – ESORICS 2009 , pages 37–52, Berlin, Heidelberg, 2009. Springer Berlin Heidelberg.[36] Jonathan Turner. New directions in communications(or which way tothe information age?).
IEEE communications Magazine , 1986.[37] Hao Wu, Hsu-Chun Hsiao, Daniele Enrico Asoni, Simon Scherrer,Adrian Perrig, and Yih-Chun Hu. Clef: Limiting the damage causedby large flows in the internet core. In
International Conference onCryptology and Network Security (CANS) , 2018.[38] Hao Wu, Hsu-Chun Hsiao, and Yih-Chun Hu. Efficient large flow detec-tion over arbitrary windows: An algorithm exact outside an ambiguityregion. In
Proceedings of the 2014 Conference on Internet MeasurementConference (IMC) , pages 209–222. ACM, 2014.[39] Tong Yang, Jie Jiang, Peng Liu, Qun Huang, Junzhi Gong, Yang Zhou,Rui Miao, Xiaoming Li, and Steve Uhlig. Elastic sketch: Adaptive andfast network-wide measurements. In
Proceedings of the 2018 Conferenceof the ACM Special Interest Group on Data Communication , pages 561–575. ACM, 2018.[40] Tong Yang, Haowei Zhang, Jinyang Li, Junzhi Gong, Steve Uhlig, Shi-gang Chen, and Xiaoming Li. Heavykeeper: An accurate algorithm forfinding top- k elephant flows. IEEE/ACM Transactions on Networking ,27(5):1845–1858, 2019.
41] Qi Zhao, Jun Xu, and Zhen Liu. Design of a novel statistics counterarchitecture with optimal space and time efficiency.
ACM SIGMETRICSPerformance Evaluation Review , 34(1):323–334, 2006. A PPENDIX
We would like to know the probability that LOFT cancatch overuse flows and how much damage they can dobefore being caught. Our analysis first derives a lower boundon the probability that a non-overuse flow estimate becomeslarger than an overuse flow estimate. Then we use this boundto calculate the probability that an overuse flow estimatefalls within the top- W fm largest estimates, which is also theprobability of being monitored.Due to the complexity of this problem, we make thefollowing assumptions throughout the analysis. We require thatthe number of flows N in traffic is fixed during each detectionperiod and N is large enough so that the attacker is unable tomanipulate either N or the overall traffic size distribution. Inaddition, we assume that all flow estimates, which are randomvariables in our model, only have weak pairwise dependence,and see them as i.i.d. in the following analysis. A. Lower Bound of Detection Probability
First we derive a lower bound of the dectection probabilitygiven the amount of an overuse flow after LOFT starts a resetcycle. This lower bound helps us analyze how long LOFT willneed to catch the overuse flow with high enough probability.Then by setting a proper reset cycle, our analysis determinesthe maximum damage an overuse flow can do, and guaranteesthat LOFT will have high probability to catch the overuseflow if it exceeds the limit of maximum damage.The detection probability of an overuse flow is exactlythe probability of LOFT choosing to monitor that flow andthe flow monitor catches its overusing behavior. Here weignore the time spent in the flow monitor when an overuseflow is reported to it. Analyzing and improving flow monitoris outside the scope of this paper. Therefore, the detectionprobability is equal to the probability of the overuse flow beingselected by LOFT.We define U i , A i , C i as the estimate, accumulator, andcounter of flow i in our algorithm, respectively. Hence theyhave the following relationship: U i = A i C i (A.1)In each minor cycle, each flow, except for flow i , has W chance to contribute part of its flow size to A i . We definea flow segment X j as the amount contributed by anotherflow j to A i in a minor cycle. Every X j can be seen as arandom variable with unknown distribution, since it is selectedrandomly by a uniform hash function. More deeply, suppose S a = { X , X , · · · , X a } and S b = { X a +1 , X a +2 , · · · , X a + b } are added to A i in minor cycle x and x + 1 , respectively. S a are exactly the a uniform samples without replacement from N flow segments sent by N flows during minor cycle x . Inaddition, S b are another b uniform samples from minor cycle x + 1 and being independent of the prior samples S a , since S a and S b are sampled independently from two different sets offlow segments. These properties allow us to apply Hoeffding’sinequality [18] and derive the bound in Equation A.8.Because we assume that overuse flows are minority andcannot affect the traffic distribution, and because all otherflows will not violate the flow specification, X j also satisfiesthe following bound, where M (cid:44) γω − + β . ≤ X j , E [ X j ] ≤ M (A.2)Suppose LOFT has run θ minor cycles, given there are c i flow segments accumulated in C i , the size of flow i is F i , wecan represent U i as a random variable under this condition asbelow. U i | ( C i = c i ) = 1 c i ( F i + c i − θ (cid:88) j =1 X j ) (A.3)From Equation A.3, now U i is modeled by the sum ofseveral random variables, which is also a random variable withunknown distribution. As mentioned previously, different U sare seen as i.i.d. This assumption is reasonable because underrandom dispatching the probability that any pair of U sharemany X (and thus build strong dependence) is extremely low.The more precise analysis is left for future work.Let U l and U b be the estimates of an overuse flow and a non-overuse flow respectively. By Equation A.3, we can enumerateall conditions and derive the equation below (In the followingequations, we short C i = c i to c i ). P ( U l > U b ) = Nθ (cid:88) c l =1 Nθ (cid:88) c b =1 P ( U l > U b | c l , c b ) P ( c l , c b ) (A.4)Instead of bounding Equation A.4 directly, we choose athreshold τ and simplify the bound with the following inequal-ity, which holds since U l and U b are i.i.d. in our assumption. P ( U l > U b | c l , c b ) ≥ P ( U l > τ | c l ) P ( U b < τ | c b ) (A.5)Combining with Equation A.4, we have the followinginequality. P ( U l > U b ) ≥ Nθ (cid:88) c l =1 Nθ (cid:88) c b =1 P ( U l > τ | c l ) P ( U b < τ | c b ) P ( c l , c b ) (A.6)Now the goal is deriving the bound of P ( U i > τ | c i ) .Suppose c b − θ ≥ , for the estimate of non-overuse flow13 b , combining with Equation A.3, we have the followinginequality. P ( U b ≥ τ | c b )= P ( 1 c b ( F b + c b − θ (cid:88) j =1 X j ) ≥ τ )= P ( 1 c b F b + c b − θc b c b − θ c b − θ (cid:88) j =1 X j ≥ τ )= P ( 1 c b − θ c b − θ (cid:88) j =1 X j ≥ c b c b − θ ( τ − F b c b ))= P (( 1 c b − θ c b − θ (cid:88) j =1 X j ) − E [ X ] ≥ c b c b − θ ( τ − F b c b ) − E [ X ]) (A.7)Let v (cid:44) c b c b − θ ( τ − F b c b ) − E [ X ] . If v ≥ , since X j isconstrainted by the inequality in Equation A.2 and satisfiessufficient properties as we mentioned before, we can applyHoeffding’s inequality to get an upper bound of Equation A.7as below. P (( 1 c b − θ c b − θ (cid:88) j =1 X j ) − E [ X ] ≥ v ) ≤ exp( − c b − θ ) v M ) (A.8)Here, we choose the midpoint between the expected valuesof U l | c l and U b | c b as τ . Therefore, τ for c l , c b can berepresented as below. τ c l c b = 12 ( E [ U l | c l ] + E [ U b | c b ])= E [ X ] + 12 c l c b ( − θE [ X ]( c l + c b ) + c b F l + c l F b ) (A.9)Finally, we substitute τ of inequality A.8 by Equation A.9and derive the following bound. P ( U b ≥ τ c l c b | c b ) ≤ P (( 1 c b − θ c b − θ (cid:88) j =1 X j ) − E [ X ] ≥ v b ) ≤ exp( − c b − θ ) v b M ) where v b (cid:44) c l ( c b − θ ) [ θE [ X ]( c l − c b ) + c b F l − c l F b ] (A.10)For the estimate of overuse flow U l , we can use the samemethod to derive its bound as below. P ( U l ≤ τ c l c b | c l ) ≤ exp( − c l − θ ) v l M ) where v l (cid:44) c b ( c l − θ ) [ θE [ X ]( c l − c b ) + c b F l − c l F b ] (A.11)Although Equation A.10 and Equation A.11 contain anunknown variable E [ X ] and the inequalities only hold undercertain conditions, since ≤ E [ X ] ≤ M , we can get theworst-cast probabilities by choosing E [ X ] to minimize v b and v l . Therefore, Equation A.10 and Equation A.11 can be rewritten into the following two functions, which give outthe worst-case bounds without knowing E [ X ] or making anyassumption on c b and c l . ˆ P b ( θ, c b , c l , F b , F l ) (cid:44) (cid:40) if min v b < ∨ c b − θ < − c b − θ )(min v b ) M ) otherwise ˆ P l ( θ, c b , c l , F b , F l ) (cid:44) (cid:40) if min v l < ∨ c l − θ < − c l − θ )(min v l ) M ) otherwise (A.12)For the pmf of P ( c l , c b ) , we assume that our random flow-to-counter mapping in each minor cycle makes C i follow thebinomial distribution, so it can be represented as below. P ( θ, c l , c b ) = P ( B ( N θ, W ) = c l ) P ( B ( N θ, W ) = c b ) (A.13)Combining all equations above, given the amounts of anoveruse flow F l and a non-overuse flow F b sent in θ minorcycles, we can derive the lower bound function ˆ P win of P ( U l > U b ) as below. ˆ P win ( θ, F l , F b ) (cid:44) Nθ (cid:88) c l =1 Nθ (cid:88) c b =1 (1 − ˆ P b ( θ, c b , c l , F b , F l ))(1 − ˆ P l ( . . . )) P ( θ, c l , c b ) (A.14)The worst-case of ˆ P win happens when the amount of thenon-overuse flow F b is equal to its maximum size. Therefore, ˆ P win ( θ, F l , γω − θ + β ) gives out the worst-case lower boundwhen the behavior of non-overuse flows is unknown.The probability that an overuse flow will be selected intothe flow monitor is equal to the probability that it loses to lessthan W fm non-overuse flows during ranking. Therefore, thelower bound of this probability can be represented as below,since we have assumed that U are i.i.d. P mon ( θ, F l ) ≥ W fm − (cid:88) k =0 (cid:18) Nk (cid:19) (1 − (cid:101) P win ) k ( (cid:101) P win ) N − k where (cid:101) P win (cid:44) ˆ P win ( θ, F l , γω − θ + β ))