[PDF] Taurus: An Intelligent Data Plane

Abstract

Emerging applications -- cloud computing, the internet of things, and augmented/virtual reality -- need responsive, available, secure, ubiquitous, and scalable datacenter networks. Network management currently uses simple, per-packet, data-plane heuristics (e.g., ECMP and sketches) under an intelligent, millisecond-latency control plane that runs data-driven performance and security policies. However, to meet users' quality-of-service expectations in a modern data center, networks must operate intelligently at line rate. In this paper, we present Taurus, an intelligent data plane capable of machine-learning inference at line rate. Taurus adds custom hardware based on a map-reduce abstraction to programmable network devices, such as switches and NICs; this new hardware uses pipelined and SIMD parallelism for fast inference. Our evaluation of a Taurus-enabled switch ASIC -- supporting several real-world benchmarks -- shows that Taurus operates three orders of magnitude faster than a server-based control plane, while increasing area by 24% and latency, on average, by 178 ns. On the long road to self-driving networks, Taurus is the equivalent of adaptive cruise control: deterministic rules steer flows, while machine learning tunes performance and heightens security.

Full PDF

TTaurus: An Intelligent Data Plane

Tushar Swamy, Alexander Rucker, Muhammad Shahbaz, and Kunle Olukotun

Stanford University

ABSTRACT

Emerging applications—cloud computing, the internet ofthings, and augmented/virtual reality—need responsive, avail-able, secure, ubiquitous, and scalable datacenter networks.Network management currently uses simple, per-packet,data-plane heuristics (e.g., ECMP and sketches) under anintelligent, millisecond-latency control plane that runs data-driven performance and security policies. However, to meetusers’ quality-of-service expectations in a modern data cen-ter, networks must operate intelligently at line rate.In this paper, we present

Taurus , an intelligent data planecapable of machine-learning inference at line rate. Taurusadds custom hardware based on a map-reduce abstraction toprogrammable network devices, such as switches and NICs;this new hardware uses pipelined and SIMD parallelism forfast inference. Our evaluation of a Taurus-enabled switchASIC—supporting several real-world benchmarks—showsthat Taurus operates three orders of magnitude faster thana server-based control plane, while increasing area by 24%and latency, on average, by 178 ns.On the long road to self-driving networks, Taurus is theequivalent of adaptive cruise control: deterministic rulessteer flows, while machine learning tunes performance andheightens security.

The tremendous scale of modern data centers—tens of thou-sands of servers, connected by elaborate networks [45, 78,95]—causes many logistical and technical challenges [10,38, 88]. Moreover, the high throughput and low latencyrequirements of emerging workloads (e.g., cloud comput-ing, the internet of things, and augmented/virtual reality)make managing such large, complex networks challeng-ing [10, 45, 88]. When implementing management policies(e.g., for performance or security), network operators face adichotomy: they must choose between line-rate executionand computational complexity.Data-plane devices (e.g., switches and NICs) can reactin nanoseconds to network conditions, but have a limitedprogramming model designed to forward packets at linerate (e.g., flow tables [13]). This restricts network opera-tions to simple heuristics [5, 60, 65] in data-plane devicesand purpose-built tasks in fixed-function hardware (e.g.,middle-boxes [22, 68]). A security policy for anomaly de-tection, for example, would re-use flow tables—intended as L3 routing or L2 forwarding tables—to implement black-lists or Access Control Lists (ACLs). Such policies, therefore,operate within the constraints of current data-plane abstrac-tions, which set forth a binary world: packets matching ablacklist are dropped, with all others forwarded. Neverthe-less, data-plane devices process every packet, so they cancapture fine-grained statistics (using counters and sketches)and make a new decision for each packet.Control-plane servers can make complicated, data-drivendecisions, but only for a few packets (e.g., the first ofevery flow). Later packets match the cached decisions—installed in the data plane as flow rules—and are forwardeddirectly by the data plane. By using more data, a cen-tralized control plane can make better decisions, provid-ing better performance and security. For example, servers(possibly with accelerators [58, 79]) can implement learn-ing anomaly-detection algorithms like clustering, support-vector machines, and neural networks [71, 81, 102]; thesealgorithms can automatically find latent non-linear correla-tions between features.Ideally, network processing would be data-driven and re-act to every packet—all packets could be sent through thecontrol plane, or data-plane devices could be more flexible.Caching data-driven, per-packet decisions would provideper-packet reactivity, but header instability would effec-tively result in all packets being processed in the controlplane. This approach would decrease performance by aboutthree orders of magnitude, precluding data-driven perfor-mance tuning and restricting data-driven anomaly detec-tion to the most hardened networks. The better approachis a more flexible data plane: by adding a new abstractiondesigned for decision-making, not packet forwarding, swit-ches and NICs can improve their functionality with minimalhardware (compared to intelligent decision-making withflow tables).Data planes, today, use only three abstractions to bridgethe programmer-hardware gap—packet parsing maps to Fi-nite State Machines (FSMs) [37], flow rules map to Match-Action Tables (MATs) [13], and scheduling maps to Push-InFirst-Out (PIFO) [98]—so any new abstraction must also beubiquitous, general-purpose, and provide a coherent high-and low-level interface. Machine Learning (ML) provides abroad high-level interface suitable for many applications,including supervised and reinforcement learning. Anomalydetection would use supervised learning: operators identifyanomalous packets after the fact, which lets a model learn o predict other anomalies. Reinforcement Learning (RL) ismore useful for automatic performance tuning: by automat-ically trying small tweaks to a running model and seeingwhich ones improve performance, the system adapts itself.Most ML algorithms are built around linear algebra,which uses a significant amount of repetitive computation,performed on a small number of weights, with regular com-munication. Unnecessary flexibility, such as the all-to-allVLIW communication [120], large memories, and ternaryCAMs in MATs [13], consumes chip area without benefit-ing ML; prior attempts at ML in switches have failed dueto this inefficiency [97]. Map-reduce, on the other hand, isa good low-level abstraction for ML because it provides thenecessary computations (large numbers of multiplies andadds) and no unnecessary flexibility.Although ML can make data-driven decisions, it cannothandle all network functionality. ML is suited for decisionscurrently made by (approximate) heuristics, such as conges-tion control, ECMP, and anomaly detection; these decisionsimpact only networks’ performance and security, not theircore packet-forwarding behavior. Networks built using MLwill therefore use flow rules to express a range of validdecisions (e.g., output ports), and the ML model will opti-mize best-case while bounding worst-case performance byselecting from a pre-determined set of decisions. An intelli-gent control plane will thus compile user programs phrasedas constrained optimization problems: for example, mini-mizing congestion while ensuring a certain bandwidth forhigh-priority flows.In this paper, we present Taurus , a data plane augmentedwith a new ML abstraction and programmable map-reducehardware for intelligent (data-driven) packet forwarding.The control plane receives telemetry data from the entirenetwork (e.g., via In-Band Network Telemetry, or INT [61]),trains new switch weights, and installs them in Taurusalongside traditional flow rules for packet forwarding. Tooperate at line rate, Taurus’s map-reduce block implementsonly the multiply and add operations needed for ML. Themap-reduce block works alongside parsers, MATs, and thescheduler to forward packets, with MATs connecting map-reduce to the pipeline: pre-processing MATs extract nu-meric input features from packets, the map-reduce blockuses these features and an ML model to generate a numericresult, and post-processing MATs transform this output intoa packet-forwarding decision.Recent coarse-grained accelerators for data analytics [44,86, 104] underpin Taurus’s map-reduce block: a user-definedprogram graph is spatially mapped to a reconfigurable ar-ray, where data flows through the array. Taurus’s map-reduce block is tailored for line-rate inference: unnecessaryoperations, including DRAM access and floating-point op-erations, are eliminated; a gridded organization of compute and memory units is maintained, with 16 SIMD lanes percompute unit and 16 banks per memory unit. We evaluatethe overhead of Taurus’s map-reduce block as an additionto a programmable switch, and we demonstrate that theaverage added latency is 178 ns and added area is 24% toimplement a range of proposed algorithms.By enabling data-plane ML with low overhead and a clearabstraction, Taurus moves data-driven processing from aper-flow to a per-packet level and lets complex performanceand security policies run at line rate.In summary, we make the following contributions: • A Taurus logical pipeline using a map-reduce abstrac-tion for line-rate, per-packet inference (§3). • A hardware design of a Taurus-enabled switch witha reconfigurable SIMD dataflow engine [86] for map-reduce (§4). • Analysis of the design using ASIC synthesis and a28 nm generic library [40] to determine area and poweroverheads relative to commercially available switches(§5.1). • Evaluation with real ML networking applications (§5.2)and microbenchmarks (§5.3), showing that Taurus sup-ports the functions common in modern ML at line rate(1 GPkt/s).We begin by motivating the need for an intelligent dataplane (§2) and highlighting both the importance of per-packet ML and the limitations of existing data- (§2.1) andcontrol-plane ML approaches (§2.2).

Ethics:

This work does not raise any ethical issues. Thisresearch has no human subjects and formal institutionalreview is not required.

Taurus is an intelligent data plane that runs ML at line-rate for every packet and uses ML’s output to optimizeforwarding decisions. Machine learning provides signifi-cant improvements in traffic engineering [114], schedul-ing [112, 115], and security [4, 26, 56, 71, 102]. SIMON [36]also reconstructs queuing delays in network switches withhigher accuracy than edge-based methods [46, 76]. Fur-thermore, decision trees (Remy [112]) and recurrent neuralnetworks (Pantheon [115]) for congestion control have athroughput-latency frontier beyond that of many human-designed algorithms [27, 48, 116]. Many of these algorithmsmake use of sub-flow features. For example, Remy usesRTTs and ACKs, and anomaly detection (e.g., KDD Cupentries [103 ? ]) uses packet-level features like connectionduration. Boutaba et al. [15] survey ML-based network-ing applications and find that tasks like traffic classifi-cation make liberal use of packet-level features [7, 29–32, 67, 75, 82, 101, 117–119]. Even for encrypted traffic, evel Accuracy (%) F Score Missed Anomalies

Flow 48.8 58.1 11 392Packet 75.0 78.3 5 273

Table 1: A comparison of flow- and packet-level anom-aly detection DNN. packet-level features like inter-arrival times and packetsizes allow classification [6, 11].

Per-Packet ML: A Case Study.

To highlight the impor-tance of packet-level features, we build a simple exampleusing an anomaly-detection DNN [102] and the updatedNSL-KDD [103] intrusion-detection data set. The DNN usesTCP-level features available in the data set (e.g., the currentconnection duration or the number of observed packetswith an urgent TCP flag set [42]), but we exclude featuresonly available after a flow (e.g., total source and destinationbytes transferred). We measure the DNN’s performancewith and without packet-level features after training for tenepochs. Packet-level features improve our model’s accuracyby over 25% and reduce the number of missed anomalies (i.e.,false negatives) by a factor of two, as shown in Table 1. Inshort, packet-level events let ML models better understandnetwork behavior and make more accurate decisions.

There have been a number of recent attempts to use cur-rent switch abstractions (i.e., MATs) and specialized hard-ware [36] to support per-packet ML models.

The match-action abstractionis insufficient for line-rate ML in modern data-plane devices,due to both missing operations (especially loops and mul-tiplication) and inefficient MAT pipelines [13]. Binary neu-ral networks have been implemented—using tens of MATsfor each—but they lack the precision needed for practi-cal deployments [97]. Likewise, an SVM for IoT classifica-tion [113] consumes most of the memory of a NetFPGAswitch—an experimental research platform [2, 69]—and hasnot been mapped onto a real switch ASIC. As these tech-niques use switches’ VLIW pipelines to implement simpler,SIMD programs (with lower memory requirements), theyuse only a small fraction of the MAT hardware while ren-dering the entire stage unavailable.

VLIW Parallelism.

The difference in communicationrequirements between a VLIW model and a SIMD model isvisually described in Figure 1. VLIW models, used in cur-rent switch MATs [13], have multiple logically-independentinstructions per stage operating in parallel, reading fromand writing to arbitrary locations. This all-to-multiple inputcommunication and multiple-to-all output communication

Stage i Stage i + (a) VLIW. Stage i Stage i + (b) SIMD. Figure 1: A comparison between VLIW and SIMDcommunication, with lines showing possible paths.The additional communication possible with VLIW in-creases overhead and is unnecessary for ML. requires large crossbars. For example, a 16-issue VLIW pro-cessor has 20 × as much control logic as an equally-powerfulcluster of eight dual-issue processors [120]. VLIW’s over-head thus limits the number of instructions per stage. Bare-foot’s Tofino chip only executes 12 operations per stage:four of each of 8, 16, and 32 bits [47]. A typical DNN layermay require 72 multiplications and 144 additions [102]; evenif multiplication were added to MATs, this would be 18stages (most of the pipeline). Traditional accelera-tors, like TPUs [58], GPUs [79], and FPGAs [33] could ex-tend the data-plane pipeline as bump-in-the-wire inferenceengines, connected via PCIe or Ethernet. In most accelera-tors, inputs are batched to increase parallelism: larger batchsizes boost throughput by enabling more-efficient matrix-matrix multiplication. However, to provide reliably-lowper-packet latency, unbatched (matrix-vector) execution isneeded; otherwise, packets would be delayed while waitingfor a batch to fill. Moreover, adding another physically-separate accelerator would either consume switch ports(wasting transceivers) or replicate switch functions likepacket parsing and match-action rules for feature extrac-tion; separate accelerators would add area, decrease through-put, and consume power.

An alternative is to cache inference results in MATs [73].In a caching scheme, ML training and inference run inthe control plane, while inference results are stored in thedata plane as flow-table rules. However, ML models withfrequently-changing inputs, like packet size—which providegreater accuracy—would experience excessive cache misses.

Cache Miss Rates.

To demonstrate this, we build asimple model to predict the cache miss rate as a func-tion of header entropy (i.e., how frequently a header’svalue changes across packets); matching on high-entropy . . . . . . . . Entropy of a Header Field M i ss R a t e Figure 2: Cache miss rates with increasing number ofheader fields and infinite rule-space.

Accelerator Latency (ms)

Broadwell Xeon 0.67Tesla T4 GPU 1.15Cloud TPU v2-8 3.51

Table 2: Inference time for control-plane accelerators. fields results in more misses than matching on low-entropyfields [94]. We use the five-tuple and a variable numberof unstable headers (e.g., packet sizes) as input featuresand sample flow lengths from an empirical traffic distribu-tion [59]. We assume infinite switch memory to eliminatecapacity-driven cache misses (i.e., all rules remain in cacheonce installed). Figure 2 shows the cache miss rates for dif-ferent numbers of header fields and levels of entropy. Themiss rate increases linearly for a single header field butgrows super-linearly as more fields are added. When usingeight fields (corresponding to a small ML model [102]), al-most all packets traverse the control plane (a cache hit rateof zero).Cache-based inference, therefore, would be limited toonly a few low-entropy headers [84, 92]. This effectivelyprevents ML models from using per-packet features anddecreases their accuracy.

Rule Insertion Time.

Per-packet ML with caching sys-tems would also suffer from high installation latencies formatch-action rules, which grow with flow-table sizes [63].Given a limit on table sizes, flow insertion completes inseveral milliseconds (e.g., 3 ms for TCAMs [18]). However,because per-packet ML would generate multiple decisionsper flow, installation times would increase and interrupteach flow repeatedly. For packet-level decisions, frequentinstallations taking milliseconds would be prohibitive—tomeet end-to-end Service Level Objectives (SLOs), switchesmust process packets in hundreds of nanoseconds.

Inference Compute Time.

Control-plane inference, evenusing ML accelerators, would increase latency; accelerators Flow Length (packets) F l o w C o m p . T i m e ( m s ) Control PlaneTaurus Data Plane

Figure 3: The impact of control- and data-plane ML onflow-completion times in a minimally-loaded system. use batched processing and have software overheads. Ta-ble 2 benchmarks the latency for the anomaly-detectionDNN [102] on an Intel Broadwell Xeon CPU running vec-torized TensorFlow [3], an NVIDIA Tesla T4 GPU withML-optimized Tensor cores [79], and a Google Cloud TPUv2.8 [58] for unbatched inference. This latency comes fromhardware and software (e.g., Tensorflow [3]), which is neces-sary to set up these throughput-oriented devices; the lowest-latency design, a vectorized CPU, takes 0.67 ms.

We now studythe impact of caching control-plane ML decisions on end-to-end flow-completion times for the anomaly-detectionDNN [102]. In our simulation, a host sends packets drawnfrom an empirical flow distribution [105], to another hostover a switch. In both schemes, the first packet of each flowis sent to the control plane for a forwarding decision. Fordata-plane ML, no further packets traverse the control plane,but the caching scheme must process virtually all packetsin the control plane. The cache miss rate, rule-insertiontime, and compute-inference time of a control-plane MLscheme increase the end-to-end completion time of longflows by 1500 × (Figure 3). This simulation is run at near-zero load, so no delays occur due to queueing; as moreflows are added, queues would build and the control-planeperformance would decrease. To achieve network flexibility and reactivity, we designTaurus to run line-rate inference entirely in the data plane,while training—a non-critical-path operation—remains inthe control plane. This is similar to Software-Defined Net-working (SDN): the control plane gathers a global view ofthe network and trains ML models to optimize QoS met-rics, while the data plane uses these models to make linerate, data-driven decisions. Unlike traditional SDN, the con-trol plane now installs both weights and flow rules intoswitches (Figure 4). Weights are more space-efficient thanflow rules: for example, matching the behavior of our anom-aly detection DNN would require 12 MB of flow rules (the ontrol Plane (Training)Host HostSwitch(Inference) TracedPackets Features& Decisions WeightUpdates TracedQoS

Figure 4: Training Taurus—hosts randomly markpackets to trace in the network, and traced switch de-cisions and measured QoS are used to update weights. full dataset), but only 5.6 kB of weights—a 2135 × reductionin memory usage. Furthermore, using monitoring frame-works like Deep Insight [1], the control plane can collectfine-grained performance statistics and use them to identifythe impact of ML decisions and optimize weights. We now describe the logical components of the Taurusdata-plane pipeline as shown in Figure 5. As packets en-ter a switch, FSMs parse them into Packet Header Vec-tors (PHVs) [13], a fixed-layout, structured format. Then,switches use the match-action abstraction, looking up eachheader field in a table and performing a corresponding oper-ation; we allocate several MATs for Taurus’s preprocessing.Taurus then uses map-reduce to evaluate an ML model onthe extracted features, and postprocessing MATs transformthe model’s output into a forwarding decision. Finally, thepacket is scheduled based on the collective decisions of thematch-action pipeline and map-reduce block.

Taurus preprocesses raw packet headers into a canonicalform before inference: additional data may be added toaugment the packet, and some fields may need repair to cor-rect abnormal values. Furthermore, data preprocessing usesrules (implemented with MATs) to convert header fields tofeatures for the ML model. For our anomaly-detection ex-ample, IP addresses could be matched against autonomoussystem subnets and replaced with features indicating own-ership or geographic location. The anomaly-detection net-work would then evaluate the relationships between nu-meric features to provide an anomaly score.Taurus replaces categorical relationships with simpler nu-meric relationships using lookup tables; e.g., a table trans-forms port numbers into a linear likelihood value, which iseasier to infer from [24]. Moreover, preprocessing can in-vert the probability distribution underlying a sampled value.Taking the logarithm of an exponentially distributed vari-able results in a uniform distribution, which an ML model

InterpretPacket IntegrateDataAugmentData Mat.-Vec.MultiplyNonlinearFunctions InterpretPrediction Send toDestination

Parse Preprocess Infer Postprocess Schedule

Figure 5: The logical steps for data-plane ML in Taurus. can process with fewer layers [91]. Such feature engineering transfers load from an ML model to its designer; using bet-ter features increases models’ accuracy without increasingtheir size [14, 91].Lastly, In-Band Network Telemetry (INT)—local state em-bedded into packets—provides switches with a view ofglobal network state [61], which they can process usingMATs. Taurus devices are therefore not limited to infer-ence using switches’ local state: instead, models can usethe packet’s entire history (using INT) and the flow’s entirehistory (using stateful registers), greatly increasing theirpredictive power.

MATs can also interpret ML decisions. For example, if ouranomaly-detection model outputs 0.9, indicating a likely-anomalous packet, MATs decide how the packet shouldbe handled: it can be dropped, flagged, or quarantined. InTaurus, these postprocessing MATs connect inference toscheduling, which uses an abstraction (e.g., PIFO [98]) tosupport a variety of scheduling algorithms.

For each packet, inference combines cleaned features andmodel weights to make a decision. Traditional ML algo-rithms, like Support Vector Machines (SVM) and neuralnetworks, use matrix-vector linear algebra operations andelement-wise non-linear operations [43, 53]. Non-linear op-erations let models learn non-linear semantics; otherwise,the output would be a linear combination of the inputs.Unlike header processing, ML computation is very reg-ular, using many multiply-add operations. In the morecomputationally-taxing linear portion of a single DNN neu-ron, input features are each multiplied by a weight, thenadded to yield a scalar value. Generalizing this operation,vector-to-vector ( map ) and vector-to-scalar ( reduce ) opera-tions suffice for the computationally-intensive linear por-tions of a neuron. This motivates the need for a new data-plane abstraction, map-reduce, that is flexible enough toexpress a variety of ML models but specific enough to allowefficient hardware development. × W x × W x × W x × W ++ + + B G ( z ) Map Reduce Activation

Figure 6: The compute graph of a single perceptronwith the breakdown between map, reduce, and activa-tion functions (outer-loop map) shown.

Our design usesmap-reduce SIMD parallelism to provide high computa-tional throughput cheaply.

Map operations are element-wise vector operations, such as addition, multiplication, ornon-linear operations.

Reduce operations combine a vectorof elements to a scalar value using associative operationslike addition and multiplication. Figure 6 shows how mapand reduce are used to compute a single neuron (dot prod-uct), which can be combined into large neural networks.Map-reduce is a popular form for ML models: map-reducecan accelerate ML both in distributed systems [39] and at afiner granularity [21].By supporting common primitives, we support a set ofapplications broader than ML, including Virtual NetworkFunctions (VNFs) at the switch and NIC [85]. For example,Elastic RSS (eRSS) uses map-reduce for consistent hashingto schedule packets and cores: map is used to evaluatecores’ suitability, and reduce selects the closest core [89].Map-reduce also supports sketching algorithms, includingCount-Min-Sketches (CMS) [23] for flow-size estimation.Furthermore, recent research shows that Bloom filters canalso benefit from, or be replaced by, neural networks [87].

Integrating Map-Reduce with P4.

To program Tau-rus, we propose a dedicated P4 control block (like theones used for checksums and egress computations [12]).P4 already expresses three logically-separate abstractions:parsing, match-action, and scheduling. By adding a fourthblock programmed using a map-reduce abstraction (e.g.,Spatial [62]), we extend SDN’s flexibility for a new class ofapplications. The only additional primitives needed are ar-rays, map, and reduce (as well as loading weights from thecontrol plane). Our proposed syntax is shown in Figure 7,which describes a single DNN layer. The outermost mapiterates over all the layer’s neurons, while the inner map-reduce pair performs the linear operation for each neuron.A final map instruction applies an activation function. Control MapReduce ( inout metadata FeatureSet, inout metadata Output ) { Weights = loadModelFromFile (Model.csv) LinearResults =

Map ( sizeOf (Weights[0])) { i => Mult_Results =

Map ( sizeOf (Weights[1])) { j => Weights[i,j] * FeatureSet[j] } Reduce (Mult_Results) { (x,y) => x + y } } Output =

Map ( sizeOf (LinearResults)) { k => ReLU (LinearResults[k]) } } Figure 7: Map-reduce syntax for a DNN layer based onSpatial [62].

Map-reduceis general enough to support target-independent optimiza-tions: optimizations that consider available execution re-sources (parallelization factors, bandwidth, and more) with-out considering hardware-specific design details. Paralleliz-ing map-reduce programs by unrolling loops in space speedsup execution: if sufficient hardware resources are available,a model can have all map and reduce loops laid out spa-tially for maximum throughput. Because parallelization fac-tors are compile-time constants, Taurus has deterministicthroughput: a static profile of the whole network account-ing for the decreased throughput can be created, allowingoperators to easily analyze performance. This static line-ratereduction is not new: it occurs in RMT recirculation [13],link oversubscription [45, 78], and elsewhere.As packet latencies in switches must be low (on the or-der of hundreds of nanoseconds), latency, not just area,limits switch-level neural networks. Latency increases withdepth, so a switch-level ML accelerator can handle a limitednumber of layers; thus, datacenters’ SLOs essentially forcesmall models in switches, regardless of the resource con-straint. By preprocessing features with MATs, we providehigh performance with low latency: the model only has tolearn relationships between features, not the mapping fromheader fields to features.

ML models only provide probabilistic guarantees; therefore,we must constrain their behavior with deterministic boundsto ensure robust network operation. In a Taurus system, theuser specifies high-level safety (no incorrect behavior) and liveness (correct behavior happens eventually) properties tothe control plane. The control plane then compiles thesehigh-level constraints into per-switch constraints, whichare used as part of post-processing. By constraining the L model’s decision boundary, we ensure correct networkbehavior without complicated model verification.

Starvation.

Congestion control is a promising featurefor in-network ML. However, if an ML model were givenfree reign over per-flow scheduling decisions, it may (erro-neously) decide that some flows should receive a zero ornear-zero bandwidth allocation, effectively blocking themfrom the network. The simplest solution to starvation isguaranteeing each flow a fixed minimum bandwidth, butsetting the wrong minimum could be problematic: too small,and flows may be starved; too large, and ML’s optimizationpotential is limited. A better option is blending ML andan existing queueing algorithm, like earliest deadline firstor least attained service, which are already supported bythe PIFO scheduler [98]. By operating in a range set usingheuristics, ML can optimize bandwidth while providing areliable worst-case from low loads to high loads.

Incorrect Decisions.

Anomaly detection using ML has apotentially catastrophic pathology: allowing an anomalouspacket that compromises a system. Network operators cur-rently define anomalous packets using Access Control Lists(ACLs), which explicitly specify forbidden packets; if MLwere used to approximate an ACL, forbidden packets mightbe forwarded. Instead, the ACL can be used as a safetyguarantee, in addition to labeling packets for ML training.Incoming packets first run through an ML model and arethen compared against the ACL: they are considered anoma-lous if flagged by either, making the network more securethan using an ACL alone.

Oscillation.

A flow may frequently cross a model’s deci-sion boundary. For example, if ML is used to select betweenupstream ports for ECMP, a flow may be sent over sev-eral ports in quick succession, increasing the burden onend hosts to reorder packets. The simplest option is a time-out, which guarantees a minimum number of packets perdecision and decreases flow breaks. Hysteresis is a betteroption: once the ML model has made a decision, the deci-sion boundary is shifted slightly using post-processing tomake that decision more likely. Then, if the flow’s deci-sion is oscillating immaterially around the original decisionboundary, the new decision boundary will ensure that theswitch’s output never changes. However, if the ML model’soutput changes significantly, hysteresis lets the switch’soutput change immediately.

The complete physical data-plane pipeline of a Taurus de-vice is shown in Figure 8, consisting of blocks for packetparsing, ML with map-reduce, packet forwarding with MATs,and scheduling. Taurus’s packet parser, pre- and post-processing

Parse MAT

Map-Reduce

MAT Scheduler

Figure 8: Taurus’s modified data-plane pipeline.

CUMUCU MUCUMU CUMUCU MUCUMU CUMUCU

HeadersPHV In: FeaturesHeadersPHV Out: OutputNon-feature headersbypass map-reduce.

Figure 9: Taurus’s map-reduce block and its interfaceto the rest of the pipeline.

MATs, and scheduler use existing hardware implementa-tions [13, 37, 98]. We base Taurus’s map-reduce block onPlasticine, a Coarse-Grained Reconfigurable Array (CGRA)composed of a sea of compute and memory units, which arereconfigurable to match applications’ dataflow graphs [86].The fraction of the PHV containing features enters the map-reduce block, while other headers are bypassed directly tothe postprocessing MATs as shown in Figure 9.Each Compute Unit (CU, Figure 10) is composed of Func-tional Units (FUs) organized in lanes and stages and per-forms a map, a reduction, or both. Within a CU stage, alllanes execute the same instruction and read the same rela-tive location. CUs have pipeline registers between stages, soevery FU is active on every cycle; pipelining also occurs ata higher level between CUs. We use Memory Units (MUs),which are interspersed with CUs in a checkerboard patternfor locality, to store the weights of ML models (Figure 9).This also allows coarse-grained pipelining, where CUs per-form operations and MUs act as pipeline registers. However,as models in network applications have a low memory foot-print, the sizes of the MUs are negligible (less than 0.02%overhead for our largest application benchmark, §5). Mul-tiple levels of pipelining within each CU allow our designto run at a 1 GHz fixed clock—a crucial factor for matchingthe line rate of high-end switch hardware [13, 98]. By usingMATs (VLIW) for data cleaning and map-reduce (SIMD) forinference, Taurus uses different models of parallelism tobuild a fast and flexible data-plane pipeline. ane 0Lane 1Lane 2Lane 3 Stage 0 Stage 1 Stage 2 FU PRFU PRFU PRFU PR FU PRFU PRFU PRFU PR FU PRFU PRFU PR FU PR

Figure 10: A three-stage CU pipeline, composed ofFunctional Units (FUs) and Pipeline Registers (PRs).The third stage supports map and sparse reductions.

Precision Area (µm ) Power (µW) fix8 fix16 fix32

20 203 759

Table 3: Area and power scaling (per-FU) at the targetdesign (16 lanes, 2 stages) for different precisions. Scal-ing is shown relative to the 8-bit design.

Target-Dependent Compilation.

A variety of program-ming languages natively support map-reduce [52, 74, 80,106]. To support our Plasticine-based fabric, we implementTaurus with Spatial, a map-reduce DSL for fast and efficienthardware [62]. Spatial supports target-dependent optimiza-tions for Plasticine as well as target-independent optimiza-tions (discussed in §3.3.2), In Spatial, map-reduce patternsare represented as nested loops and use per-loop controllersto sequence execution. Programs are compiled to a stream-ing dataflow graph from this hierarchy: innermost loopsbecome SIMD operations within a CU, and outer loopsare mapped over multiple CUs. Then, overly-large patterns(those requiring too many compute stages, inputs, or mem-ory banks) are split into smaller patterns that fit in CUs andMUs; this is necessary to map some activation functionswith long basic blocks. Finally, the resulting graph is placedand routed on the map-reduce block’s interconnect.

We first justify our map-reduce block’s configuration byanalyzing its power and area overheads. Next, we evaluateTaurus’s performance by running several recently-proposednetworking ML applications [71, 102, 113, 115]. Finally, wedemonstrate Taurus’s flexibility by evaluating common MLcomponents, which can be composed to express a varietyof ML algorithms. , , ,

000 StagesNumber of Lanes A r e a p e r F U ( µ m ) (a) Area. . . . . . P o w e r p e r F U ( m W ) (b) Power (10 % switching). Figure 11: Area and power consumption per-FU forvarious CU configurations (lanes and stages).

Taurus’s map-reduce block is parameterized, including pre-cision, lane count, and stage count; these parameters areselected to optimize line-rate inference. To quantify Tau-rus’s area, we use ASIC synthesis with a 28 nm standardcell library [40].We first study the impact of arithmetic precisions rangingfrom 8 to 32 bits on area and power; as floating pointsupport is expensive and nonessential for inference, werestrict Taurus to fixed-point arithmetic. We investigatediffering lane (4–32) and stage (2–6) counts, and determinethat an 8-bit data path with 16 lanes and 2 stages is theideal configuration to support today’s network-inferenceapplications.

For ML inference, fixed-pointarithmetic is faster than floating point with equivalent accu-racy [49, 58]. We believe that 8-bit precision suffices for ML(compressed models use even fewer bits [110]) and use 8-bitprecision for Taurus; however, several industrial designs,such as Google’s TPU, use 16-bit data paths [58]. We there-fore evaluate alternate designs with greater precision andshow that precision has a roughly linear cost: going from8-bit to 16-bit data widths corresponds to a proportional(2 × ) increase in area and power (Table 3). As CU lane and stagecounts increase, the number of FUs, and therefore area,will increase; however, if we were to simply add CUs, area e a k y R e L U R e L U S i g m o i d E x p S i g m o i d P W T a n h E x p T a n h P W S i g m o i d L U TT a n h L U T A r e a ( mm ) Figure 12: Area needed for activation functions as thenumber of stages varies from 2 to 6. All functions runat line-rate, except sigmoid and tanh that operate athalf the speed. would also increase. Therefore, we normalize CU area andpower by FU count to investigate the relative efficiencyof different CU designs. Figure 11 shows that the per-FUarea and dynamic power decrease with lane count, becauseadding lanes or stages decreases the amount of control logicand overhead per FU. However, small models cannot be effi-ciently mapped to large CUs: if there is less application-levelparallelism available than CU lanes, some lanes will be un-used. Likewise, stages in a CU beyond those needed for abasic block will also be unused—each basic block has itsown controller, and the CU only has hardware support forone control hierarchy.The anomaly-detection DNN is our largest model requir-ing line-rate operation, so we use it to set the ideal lanecount. The DNN’s largest layer has 12 hidden units, so thelargest dot-product calculations involve 12 elements; the16-lane configuration fully unrolls the dot product within asingle CU while minimizing underutilization. Currently, the16-lane configuration balances area overhead, power, andmapping efficiency, but optimal lane counts may changeas data-plane ML models evolve. Because map-reduce pro-grams are hardware agnostic, programs can run on newconfigurations unmodified; the compiler will handle thedifferences in unrolling factors as needed.

We perform a scalingstudy to quantify the impact of CU stage counts on area(Figure 12). For this study, we use activation functions asthey have the deepest compute graphs; we sweep CU stagecount and report the area of the smallest array that mapseach function. For Taylor series approximations (Sigmoid-Exp and TanhExp), stages added to CUs are used to mapcomputation, but overall area remains flat: adding stagesis roughly equivalent to adding CUs. Furthermore, for acti-vation functions with shallow compute graphs (e.g., ReLU),

Perf. Area PowerApp Model GPkt/s ns mm +% mW +% IoT KMeans 1.00 76 2.48 3.3 142 0.56Anomaly SVM 1.00 68 4.59 6.1 263 1.1Anomaly DNN 1.00 188 8.80 11.7 506 2.0Indigo LSTM 0.08 380 17.73 23.6 1018 4.1

Table 4: Performance and resource overheads of sev-eral application models. Overheads are calculated rel-ative to a 300 mm chip with 4 reconfigurable pipe-lines [47], each drawing an estimated 25 W. adding stages decreases efficiency: the later stages are notmapped. Dot products require only two stages: one for themap/multiply, and one for the reduce/add. As theoreticalarea- and energy-efficiency increase with stage count, wewant to increase the stage count for better efficiency. How-ever, more stages are not useful for functions like LUTs,ReLU, and linear algebra, so we use two stages. We end upwith a CU configuration that has 16 lanes, 2 stages, and usesan 8-bit fixed-point data path; each CU takes 0.124 mm ,with a single FU taking 3877 µm . Our Taurus parametersare based on the applications and functions in use today:as ML for networking grows, we may need to revisit theseparameters. Regardless, our parameterized design showsthat map-reduce can be supported with a small amount ofadditional hardware. We evaluate Taurus using four ML models: an IoT traf-fic classification model [113], two anomaly-detection mod-els [71, 102], and a model that learns congestion-controlwindows [115]. The IoT traffic classification implementsKMeans clustering, using 11 packet- and flow-level fea-tures, to classify IoT traffic into five categories. The firstanomaly-detection algorithm is an SVM [71] that uses of-fline dimensionality reduction to select eight key featuresof the 41 in the KDD intrusion-detection data set [4, 26].The SVM uses a radial basis function to model nonlinearrelationships. Our second anomaly-detection algorithm isa DNN that takes six input features (also a subset of KDDfeatures) and produces two outputs: the probability of a ma-licious packet and the probability of a safe packet. The DNNhas three intermediate layers with 12, 6, and 3 hidden units,respectively [102]. Finally, the online congestion-control al-gorithm (Indigo [115]) is an LSTM-based network. Indigouses 32 LSTM units followed by a softmax layer and isdesigned to run at an endpoint. able 4 shows the performance, area overheads, andpower requirements of our benchmarks on Taurus, com-pared against a 300 mm [37] programmable switch chipwith an RMT-based pipeline [13]. By mapping traffic clas-sification and anomaly detection to Taurus, we show thatreal models can run at line rate in switches. Both anomaly-detection applications learn to detect malicious packets withaccuracies better than non-ML solutions when running onTaurus, with each using a different ML algorithm. The abil-ity to run multiple ML models for one problem shows Tau-rus’s generality: after network-specific pre/postprocessing,any map-reduce model can be used, allowing network op-erators to select an optimal model. With the congestion-control model, we investigate a neural network running atshort intervals, instead of per-packet—operating only at asmall fraction of line rate still yields major improvementsover the Indigo software. We examine overall area and powerwith respect to an existing programmable switch ASIC toshow the additional cost of implementation. Table 4 reportsthe area of only the CUs needed to implement each oper-ation; therefore, the actual area of a prototype for thesebenchmarks is the area of the largest benchmark, with un-used CUs disabled for smaller benchmarks. Simple models,like SVM-based anomaly detection, have as little as 6.1%area overhead and 1.1% power overhead. Indigo, our largestmodel, consumes an additional 23.6% area and 4.1% powerbecause it is not fully unrolled. Therefore, we choose Tau-rus’s map-reduce block area as 17.73 mm . If switch design-ers choose to only support smaller models, KMeans, SVMs,and DNNs add only 12% more area and 2% more power. KMeans, the SVM, andthe DNN process one packet’s headers per cycle (line-rate); they do not affect throughput, and latency remainsin the nanosecond range (Table 4). Assuming a datacen-ter switch latency of 1 µs [28], KMeans, the SVM, and theDNN add 7.6%, 6.8%, and 18.8% more latency, respectively.We also use Indigo to estimate the performance of mod-els doing periodic—not per-packet—control updates withina network; these models provide more detailed updatesfor real-time events, like link congestion. In software, theIndigo LSTM network significantly improves application-level throughput and latency [115], operating in 10 msintervals—likely slowed due to the LSTM’s computationalrequirements. In Taurus, Indigo can produce a decision ev-ery 12.5 ns with each step taking 380 ns: this allows theLSTM network to react more quickly to changes in loadand better control tail latency. Overall, Taurus can run per-packet models with minimal performance impact, and al-low periodic models to make decisions orders of magnitudefaster than software. µbmark Area (mm ) Lat. (ns)Linear Conv1D 4.93 47Percept 0.78 16SVMLin 1.82 30LSTMLin 2.34 29GRULin 2.34 29

Nonlinear

LeakyReLU 0.78 21ReLU 0.52 20SigmoidLUT 0.52 27TanhLUT 0.52 27

Table 5: Area and latency of each microbenchmark,running at line rate in a 16-lane, 2-stage CU.

PerceptPerceptPerceptPerceptPerceptPerceptPerceptPerceptPerceptPercept R e L U R e L U Percept S i g m o i d ( σ ) Figure 13: A small DNN, broken down into indepen-dent microbenchmarks.

Finally, we evaluate Taurus on a variety of microbench-marks to investigate the key hardware features drivingapplication performance. Smaller dataflow programs canbe composed into a single, large program: for example,Figure 13 shows a DNN built from several perceptron lay-ers fused with nonlinear activation functions. Taurus isspatially reconfigurable, hence, the area overhead of anymodel is the sum of its constituent parts; these parts definethe hardware needed to implement the model. By evaluat-ing these building blocks of ML applications, we providegeneral results that can be adapted to a variety of designpoints.We divide microbenchmarks into two categories; linearand nonlinear functions, which play different roles in amodel and have different implementation characteristics.Linear functions are notable because they are not perfectlyparallel: they include a reduction network that limits thedegree of communication-free parallelism. Conversely, non-linear functions can be perfectly SIMD-parallelized becausethere is no interaction between adjacent data elements. Forexample, if the output of 16 different perceptrons is inputto a ReLU, we simply map the ReLU over the 16 outputs,which are then computed in parallel. If fully unrolled, thelatency of this operation is the sum of the perceptron andReLU execution time. bmark Unroll Line Rate Area (mm )Conv1D SVMLin

Perceptron – 1 0.78

Table 6: Throughput and area scaling of microbench-marks with unrolling factors from 1 to 8.

Linear Operations.

Our primary linear microbench-marks are a one-dimensional convolution, a perceptronkernel, and linear SVM. We also evaluate the linear com-ponents of LSTM and GRU cells, which have an underlyingcomputation similar to the perceptron. The convolution ker-nel captures position-invariant features and is frequentlyused to find spatial or temporal correlations [64]. Table 5shows the area required for each microbenchmark whenunrolled to run at line rate. Because the convolution doesnot map well to vectorized map-reduce (there are multiplesmall inner reductions), it requires 8 × unrolling and muchchip area. However, the SVM and perceptron run at linerate with less than 2 mm of additional chip area; they canbe efficiently composed into high-performance deep neuralnetworks.The latencies imposed by each microbenchmark are alsoshown in Table 5. The convolution and SVM kernels havethe highest latencies—each has a small loop that is unrolledacross CUs. This adds another stage of inter-CU communi-cation, and therefore latency; because the perceptron runsentirely within a single CU, it has the lowest latency. Theminimum latency for a 16-lane CU to perform a map-reduceis five cycles: one cycle for map and four cycles for reduce,using different fractions of a single stage for each reductioncycle (Figure 10). The remaining latency comes from datamovement from the input to the CU and then to the output;Taurus takes roughly five cycles for each data movement—aresult of its spatially-distributed dataflow layout. Unrolling.

Table 6 shows the area and throughput im-pact of outer-loop unrolling on a selection of linear micro-benchmarks. Not all benchmarks can have their outer-loopunrolled: for example, our untiled perceptron has no outerloop. The iterative (i.e., loop-based) versions of the SVMand the convolution run at one-half and one-eighth of linerate, respectively; this corresponds to two loop iterationsand eight iterations per packet. By unrolling the SVM twice,throughput improves to line rate at the cost of a 40% area in-crease; however, unrolling the convolution to meet line rate results in a 6.3 × area increase. Using map-reduce’s target-independent optimization, large ML models can loop, thusrunning over multiple cycles with a corresponding reduc-tion in the number of packets forwarded per second. Nonlinear Operations.

Activation functions are nec-essary to learn nonlinear behavior; otherwise, the entireneural network would collapse into a single linear func-tion. Different activation functions are used for differentpurposes: tanh is used in LSTMs to implement gating [54],while DNNs use ReLU and Leaky ReLU, which are easier toimplement [77]. The area and latency results for nonlinearoperations are also shown in Table 5.The most efficient functions, including ReLU and LeakyReLU, take under 1 mm ; they do not use lookup tables.More complicated functions, including sigmoid and tanh,have several versions: Taylor series, piecewise approxima-tions, and lookup tables [49, 111]. Taylor series and piece-wise approximations require 2–5 times as many resourcesas other activation functions (Figure 12). LUT-based func-tions need memory, but only a small amount: each tablestores 1024 8-bit entries; even when replicated for parallellookups, this is a trivial fraction of a switch chip’s totalmemory. Therefore, we present microbenchmark results forLUT-based sigmoid and tanh. Latencies of nonlinear kernelsare lower than linear kernels because there is no inter-CUcommunication generated by loop unrolling. In this paper, we show that Taurus enables inference forper-packet ML algorithms in the data plane; with it, a widevariety of network ML research directions become available.

Dimensionality Reduction for Data Augmentation.

Data augmentation—joining input data with statically-knownrelationships to aid ML—becomes challenging as the num-ber of new features grows. For example, a network operatorusing IP-correlated data to precisely model a packet’s sourcemay add dozens of derived features. Storing these in MATswould be too expensive; however, dimensionality reductioncan reduce feature counts while maintaining the underlyinginformation [108].The benefits of dimensionality reduction are twofold: itreduces the amount of preprocessing data and decreasesmodel sizes. However, dimensionality reduction cannot re-place ML due to the cross-product explosion: multiple fieldscould be reduced at one, but due to exact/wildcard matching(binary or ternary), the flow-table sizes grow exponentiallywith the number of fields. Therefore, in Taurus, dimensional-ity reduction is best suited to provide additional informationabout input fields (e.g., IP address or port number), whileML identifies relationships between fields. hrinking Models. A major Taurus application will benetwork control and coordination. Neural networks cansolve a variety of control problems [9, 99, 107] and are get-ting smaller. For example, structured control nets [99] fornon-linear control perform almost as well as 512-neuronDNNs using as few as four neurons per layer. With suchsmall networks, Taurus can run multiple models simultane-ously (e.g., one model for intrusion detection and anotherfor traffic optimization). In addition, techniques like quanti-zation, pruning, and distillation can further reduce a models’size [8, 51, 57, 109].

Learned Traffic Management.

For Taurus to run MLmodels accurately, training models properly is paramount.Simultaneous training for learned congestion control letsdevices make decisions using knowledge of other devices’policies [112]. Using data-plane ML models, we can force all data-plane functions to take a global view. For example, alearned scheduling algorithm could be bootstrapped usinga simulated (or emulated) data center (like CrystalNet [66]):realistic traffic would drive the simulation, while switchweights are trained to route more optimally and improvethroughput. Further improvements would occur online, us-ing sampled traces from switches to gradually adapt tochanges in traffic. Effective network training must optimizeglobally: if devices are trained in isolation, they will behavegreedily, lowering efficiency and quality of experience.

Architectures for ML.

While Taurus uses Plasticine,a vectorized CGRA [86], as the basis of its map-reduceblock, it could feasibly be implemented with other fabrics.The most widely-available reconfigurable architectures areField Programmable Gate Arrays (FPGAs), which are usedas both custom accelerators [93] and prototyping tools (e.g.,the NetFPGA [2, 69]). However, FPGAs’ on-chip intercon-nects consume up to 70% of the total chip power [16], andtheir variable, slow clock frequencies complicate interfac-ing and operating at network switch speeds (multi-terabitsper second). CGRAs are optimized for arithmetic to supportML better and typically have a fast, fixed clock frequencythat allows seamless integration between the map-reduceblock and MATs in Taurus [25, 35, 41, 70, 72, 96]. Other ar-chitectures, like Eyeriss [19], Brainwave [34], and EIE [50],achieve high efficiency by focusing on specific algorithms.These could be used for in-switch ML, but are too rigid: if aspecific accelerator were standardized, networks would beunable to benefit from future ML research due to the lackof a flexible abstraction (like map-reduce).

ML For Networking.

Many networking applications canbenefit from ML. For example, learned algorithms for con-gestion control [112, 115] have been shown to outperform their human-designed counterparts [27, 48, 116]. In addi-tion, Boutaba et al. [15] identify ML use cases for networktasks such as traffic classification [29, 31], traffic predic-tion [17, 20], active queue management [100, 121], andsecurity [83]. All of these applications could immediatelybe deployed in networks today using Taurus.

Networking For ML.

Specialized networks can also accel-erate ML algorithms themselves. With minor enhancementsto modern data-plane hardware, switches can aggregate gra-dients in-network, accelerating training by up to 300% [90].Gaia, a system for distributed ML [55], also accounts forwide-area network bandwidth and regulates the movementsof gradients during the training process. While Taurus is notexplicitly designed to accelerate distributed training, map-reduce supports aggregating numeric weights contained inpackets more efficiently than MATs.

Self-driving networks—networks that make observationsabout their performance and improve themselves—wouldincrease efficiency and users’ quality of experience in mod-ern and future data centers, but neither the programmingabstraction nor the hardware exists, today, to realize such anetwork. Bridging this gap, Taurus is the equivalent of adap-tive cruise control: automatically adjusting parameters inresponse to changing network conditions. We demonstratethat Taurus operates at line-rate and adds minimal over-head to a programmable switch pipeline (e.g., RMT)—24%more area and 178 ns average latency—while acceleratingseveral recently-proposed networking benchmarks. Taurusreplaces data-plane heuristics with learned functions andcan inter-operate with existing data-plane devices. Givena mixture of Taurus and traditional networking hardware(e.g., using only Taurus NICs or ToR switches), Taurus’s MLmodels will make optimal decisions accounting for existingheuristics.We hope that Taurus will eventually enable full networkautomation, beyond just performance tuning and learnednetwork security. Operators could use a fully-autonomousML model for packet forwarding—with tight bounds onits output, like training wheels. The bounds would allowthe autonomous model to make decisions and serve as theinitial labeling function for bad decisions. As the modelbecomes more reliable, the bounds could be relaxed untilML is making virtually all packet-forwarding decisions.To build a self-driving network, hardware must be de-ployed before large-scale training can begin: Taurus givesa foothold for in-network ML with hardware that can beinstalled—and improve performance and security—in next-generation data-planes. EFERENCES

USENIX OSDI (2016).[4] Aggarwal, P., and Sharma, S. K. Analysis of KDD DatasetAttributes-Class Wise For Intrusion Detection.

Computer Science 57 (2015), 842–851.[5] Alizadeh, M., Edsall, T., Dharmapurikar, S., Vaidyanathan, R.,Chu, K., Fingerhut, A., Lam, V. T., Matus, F., Pan, R., Yadav, N.,and Varghese, G. CONGA: Distributed Congestion-aware Load Bal-ancing for Datacenters. In

ACM SIGCOMM (2014).[6] Alshammari, R., and Zincir-Heywood, A. N. Machine LearningBased Encrypted Traffic Classification: Identifying SSH and Skype.In

IEEE CISDA (2009).[7] Auld, T., Moore, A. W., and Gull, S. F. Bayesian Neural NetworksFor Internet Traffic Classification.

IEEE Transactions on Neural Net-works 18 , 1 (2007), 223–239.[8] Ba, J., and Caruana, R. Do Deep Nets Really Need to be Deep? In

NeurIPS (2014).[9] Bellemare, M. G., Naddaf, Y., Veness, J., and Bowling, M. TheArcade Learning Environment: An Evaluation Platform for GeneralAgents.

Journal of Artificial Intelligence Research (JAIR) 47 (2013),253–279.[10] Benson, T., Akella, A., and Maltz, D. A. Network Traffic Charac-teristics of Data Centers in the Wild. In

ACM IMC (2010).[11] Bernaille, L., Teixeira, R., Akodkenou, I., Soule, A., and Salama-tian, K. Traffic Classification on the Fly.

ACM SIGCOMM ComputerCommunication Review (CCR) 36 , 2 (2006), 23–26.[12] Bosshart, P., Daly, D., Gibb, G., Izzard, M., McKeown, N., Rex-ford, J., Schlesinger, C., Talayco, D., Vahdat, A., Varghese, G.,et al. P4: Programming protocol-independent packet processors.

ACM SIGCOMM Computer Communication Review 44 , 3 (2014), 87–95.[13] Bosshart, P., Gibb, G., Kim, H.-S., Varghese, G., McKeown, N., Iz-zard, M., Mujica, F., and Horowitz, M. Forwarding Metamorpho-sis: Fast Programmable Match-Action Processing in Hardware forSDN. In

ACM SIGCOMM

Journal of Internet Services and Applica-tions (JISA) 9 , 1 (2018), 16.[16] Calhoun, B. H., Ryan, J. F., Khanna, S., Putic, M., and Lach, J.Flexible Circuits and Architectures for Ultralow Power.

Proceedingsof the IEEE 98 , 2 (2010), 267–282.[17] Chabaa, S., Zeroual, A., and Antari, J. Identification and Predic-tion of Internet Traffic Using Artificial Neural Networks.

Journalof Intelligent Learning Systems and Applications (JILSA) 2 , 03 (2010),147.[18] Chen, H., and Benson, T. The Case for Making Tight Control PlaneLatency Guarantees in SDN Switches. In

ACM SOSR (2017).[19] Chen, Y.-H., Krishna, T., Emer, J. S., and Sze, V. Eyeriss: An Energy-Efficient Reconfigurable Accelerator for Deep Convolutional Neural Networks.

IEEE Journal of Solid-State Circuits 52 , 1 (2016), 127–138.[20] Chen, Z., Wen, J., and Geng, Y. Predicting Future Traffic UsingHidden Markov Models. In

IEEE ICNP (2016).[21] Chu, C.-T., Kim, S. K., Lin, Y.-A., Yu, Y., Bradski, G., Olukotun, K.,and Ng, A. Y. Map-Reduce for Machine Learning on Multicore. In

NeurIPS (2007), pp. 281–288.[22] Cisco Systems, I. Cisco Meraki (MX450): Powerful Security and SD-WAN for the Branch & Campus. https://meraki.cisco.com/products/appliances/mx450. Accessed on 02/07/2020.[23] Cormode, G., and Muthukrishnan, S. An Improved Data StreamSummary: The Count-Min Sketch and its Applications.

Journal ofAlgorithms 55 , 1 (2005), 58–75.[24] Covington, P., Adams, J., and Sargin, E. Deep Neural Networksfor Youtube Recommendations. In

ACM RecSys (2016).[25] Cronqist, D. C., Fisher, C., Figueroa, M., Franklin, P., and Ebel-ing, C. Architecture Design of Reconfigurable Pipelined Datapaths.In

IEEE ARVLSI (1999).[26] Dhanabal, L., and Shantharajah, S. A Study on NSL-KDDDataset for Intrusion Detection System Based on Classification Al-gorithms.

International Journal of Advanced Research in Computerand Communication Engineering (IJARCCE) 4 , 6 (2015), 446–452.[27] Dong, M., Li, Q., Zarchy, D., Godfrey, P. B., and Schapira, M.PCC: Re-architecting Congestion Control for Consistent High Per-formance. In

USENIX NSDI (2015).[28] EMC, D. Data Center Switching Quick Reference Guide.https://i.dell.com/sites/doccontent/shared-content/data-sheets/en/Documents/Dell-Networking-Data-Center-Quick-Reference-Guide.pdf. Accessed on 02/07/2020.[29] Erman, J., Arlitt, M., and Mahanti, A. Traffic Classification UsingClustering Algorithms. In

ACM MineNet (2006).[30] Erman, J., Mahanti, A., and Arlitt, M. Qrp05-4: Internet TrafficIdentification Using Machine Learning. In

IEEE Globecom (2006).[31] Erman, J., Mahanti, A., Arlitt, M., and Williamson, C. Identi-fying and Discriminating Between Web and Peer-to-Peer Traffic inthe Network Core. In

WWW (2007).[32] Este, A., Gringoli, F., and Salgarelli, L. Support Vector MachinesFor TCP Traffic Classification.

Computer Networks 53 , 14 (2009),2476–2490.[33] Firestone, D., Putnam, A., Mundkur, S., Chiou, D., Dabagh, A.,Andrewartha, M., Angepat, H., Bhanu, V., Caulfield, A., Chung,E., et al. Azure Accelerated Networking: SmartNICs in the PublicCloud. In

USENIX NSDI (2018).[34] Fowers, J., Ovtcharov, K., Papamichael, M., Massengill, T., Liu,M., Lo, D., Alkalay, S., Haselman, M., Adams, L., Ghandi, M., et al.A Configurable Cloud-Scale DNN Processor for Real-Time AI. In

IEEE ISCA (2018).[35] Gao, M., and Kozyrakis, C. HRL: Efficient and Flexible Reconfig-urable Logic for Near-Data Processing. In

IEEE HPCA (2016).[36] Geng, Y., Liu, S., Yin, Z., Naik, A., Prabhakar, B., Rosenblum, M.,and Vahdat, A. SIMON: A Simple and Scalable Method for Sensing,Inference and Measurement in Data Center Networks. In

USENIXNSDI (2019).[37] Gibb, G., Varghese, G., Horowitz, M., and McKeown, N. Designprinciples for packet parsers. In

ACM/IEE ANCS (2013).[38] Gill, P., Jain, N., and Nagappan, N. Understanding Network Fail-ures in Data Centers: Measurement, Analysis, and Implications. In

ACM SIGCOMM (2011).[39] Gillick, D., Faria, A., and DeNero, J. Mapreduce: Distributed Com-puting for Machine Learning.

Berkley, Dec 18 (2006).[40] Goldman, R., Bartleson, K., Wood, T., Kranen, K., Melikyan, V.,and Babayan, E. 32/28nm Educational Design Kit: Capabilities, De-ployment and Future. In

IEEE PrimeAsia (2013).

41] Goldstein, S. C., Schmit, H., Budiu, M., Cadambi, S., Moe, M., andTaylor, R. R. PipeRench: A Reconfigurable Architecture and Com-piler.

Computer 33 , 4 (2000), 70–77.[42] Gont, F., and Yourtchenko, A. On the Implementation of the TCPUrgent Mechanism. RFC 6093, Jan. 2011.[43] Goodfellow, I., Bengio, Y., Courville, A., and Bengio, Y.

DeepLearning , vol. 1. MIT Press, Cambridge, 2016.[44] Govindaraju, V., Ho, C.-H., Nowatzki, T., Chhugani, J., Satish,N., Sankaralingam, K., and Kim, C. Dyser: Unifying Functionalityand Parallelism Specialization for Energy-Efficient Computing.

IEEEMicro 32 , 5 (2012), 38–51.[45] Greenberg, A., Hamilton, J. R., Jain, N., Kandula, S., Kim, C.,Lahiri, P., Maltz, D. A., Patel, P., and Sengupta, S. VL2: A Scal-able and Flexible Data Center Network. In

ACM SIGCOMM (2009).[46] Guo, C., Yuan, L., Xiang, D., Dang, Y., Huang, R., Maltz, D., Liu,Z., Wang, V., Pang, B., Chen, H., et al. Pingmesh: A Large-ScaleSystem for Data Center Network Latency Measurement and Analy-sis. In

ACM SIGCOMM (2015).[47] Gurevich, V. Programmable Data Plane at Terabit Speeds, May 2017.Accessed on 02/07/2020.[48] Ha, S., Rhee, I., and Xu, L. CUBIC: A New TCP-Friendly High-SpeedTCP Variant.

ACM SIGOPS Operating Systems Review 42 , 5 (2008), 64–74.[49] Han, S., Kang, J., Mao, H., Hu, Y., Li, X., Li, Y., Xie, D., Luo, H., Yao,S., Wang, Y., et al. ESE: Efficient Speech Recognition Engine withSparse LSTM on FPGA. In

ACM/SIGDA FPGA (2017).[50] Han, S., Liu, X., Mao, H., Pu, J., Pedram, A., Horowitz, M. A., andDally, W. J. EIE: Efficient Inference Engine on Compressed DeepNeural Network. In

ACM/IEEE ISCA (2016).[51] Han, S., Pool, J., Tran, J., and Dally, W. Learning Both Weightsand Connections for Efficient Neural Network. In

NeurIPS (2015).[52] Harper, R., MacQueen, D., and Milner, R.

Standard ML . Depart-ment of Computer Science, University of Edinburgh, 1986.[53] Hearst, M. A., Dumais, S. T., Osuna, E., Platt, J., and Scholkopf,B. Support Vector Machines (SVMs).

IEEE Intelligent Systems andtheir Applications 13 , 4 (1998), 18–28.[54] Hochreiter, S., and Schmidhuber, J. Long Short-Term Memory(LSTM).

Neural Computation 9 , 8 (1997), 1735–1780.[55] Hsieh, K., Harlap, A., Vijaykumar, N., Konomis, D., Ganger, G. R.,Gibbons, P. B., and Mutlu, O. Gaia: Geo-Distributed MachineLearning Approaching LAN Speeds. In

USENIX NSDI (2017).[56] Ingre, B., and Yadav, A. Performance analysis of nsl-kdd datasetusing ann. In (2015), IEEE, pp. 92–96.[57] Jacob, B., Kligys, S., Chen, B., Zhu, M., Tang, M., Howard, A.,Adam, H., and Kalenichenko, D. Quantization and Training ofNeural Networks for Efficient Integer-Arithmetic-Only Inference. In

IEEE CVPR (2018).[58] Jouppi, N. P., Young, C., Patil, N., Patterson, D., Agrawal, G.,Bajwa, R., Bates, S., Bhatia, S., Boden, N., Borchers, A., et al. In-Datacenter Performance Analysis of a Tensor Processing Unit. In

IEEE ISCA (2017).[59] Jurkiewicz, P., Rzym, G., and Boryło, P. How Many Mice Make anElephant? Modelling Flow Length and Size Distribution of InternetTraffic. arXiv:1809.03486 (2019).[60] Katta, N., Hira, M., Kim, C., Sivaraman, A., and Rexford, J. HULA:Scalable Load Balancing Using Programmable Data Planes. In

ACMSOSR (2016).[61] Kim, C., Sivaraman, A., Katta, N., Bas, A., Dixit, A., and Wobker,L. J. In-Band Network Telemetry via Programmable Dataplanes. In

ACM SIGCOMM (Demo) (2015).[62] Koeplinger, D., Feldman, M., Prabhakar, R., Zhang, Y., Hadjis, S., Fiszel, R., Zhao, T., Nardi, L., Pedram, A., Kozyrakis, C., andOlukotun, K. Spatial: A Language and Compiler for ApplicationAccelerators. In

ACM/SIGPLAN PLDI (2018).[63] Kuźniar, M., Perešíni, P., and Kostić, D. What You Need to KnowAbout SDN Flow Tables. In

PAM (2015), Springer.[64] LeCun, Y., Jackel, L., Bottou, L., Brunot, A., Cortes, C., Denker,J., Drucker, H., Guyon, I., Muller, U., Sackinger, E., Simard, P.,and Vapnik, V. Comparison of Learning Algorithms for Handwrit-ten Digit Recognition. In

ICANN (1995).[65] Li, Y., Miao, R., Liu, H. H., Zhuang, Y., Feng, F., Tang, L., Cao, Z.,Zhang, M., Kelly, F., Alizadeh, M., and Yu, M. HPCC: High Preci-sion Congestion Control. In

ACM SIGCOMM (2019).[66] Liu, H. H., Zhu, Y., Padhye, J., Cao, J., Tallapragada, S., Lopes,N. P., Rybalchenko, A., Lu, G., and Yuan, L. CrystalNet: FaithfullyEmulating Large Production Networks. In

ACM SOSP (2017).[67] Liu, Y., Li, W., and Li, Y.-C. Network Traffic Classification UsingK-Means Clustering. In

IEEE IMSCCS

IEEE MSE (2007).[70] Marshall, A., Stansfield, T., Kostarnov, I., Vuillemin, J., andHutchings, B. A Reconfigurable Arithmetic Array for MultimediaApplications. In

IEEE FPGA (1999).[71] Mehmood, T., and Rais, H. B. M. SVM for Network Anomaly De-tection using ACO Feature Subset. In

IEEE iSMSC (2015).[72] Mei, B., Vernalde, S., Verkest, D., De Man, H., and Lauwereins,R. DRESC: A Retargetable Compiler for Coarse-Grained Reconfig-urable Architectures. In

IEEE FPT (2002).[73] Mestres, A., Rodriguez-Natal, A., Carner, J., Barlet-Ros, P.,Alarcón, E., Solé, M., Muntés-Mulero, V., Meyer, D., Barkai, S.,Hibbett, M. J., et al. Knowledge-Defined Networking.

ACM SIG-COMM Computer Communication Review (CCR) 47 , 3 (2017), 2–10.[74] Minsky, Y., Madhavapeddy, A., and Hickey, J.

Real World OCaml:Functional Programming for the Masses . O’Reilly Media, Inc., 2013.[75] Moore, A. W., and Zuev, D. Internet Traffic Classification usingBayesian Analysis Techniques. In

ACM SIGMETRICS (2005).[76] Moshref, M., Yu, M., Govindan, R., and Vahdat, A. Trumpet:Timely and Precise Triggers in Data Centers. In

ACM SIGCOMM (2016).[77] Nair, V., and Hinton, G. E. Rectified Linear Units Improve Re-stricted Boltzmann Machines. In

ICML (2010).[78] Niranjan Mysore, R., Pamboris, A., Farrington, N., Huang, N.,Miri, P., Radhakrishnan, S., Subramanya, V., and Vahdat, A.PortLand: A Scalable Fault-tolerant Layer 2 Data Center NetworkFabric. In

ACM SIGCOMM

Programming in Scala .Artima Inc, 2008.[81] Papalexakis, E. E., Beutel, A., and Steenkiste, P. Network Anom-aly Detection using Co-Clustering. In

IEEE/ACM ASONAM (2012).[82] Park, J., Tyan, H.-R., and Kuo, C.-C. J. Internet Traffic Classificationfor Scalable QoS Provision. In

IEEE ICME (2006).[83] Perdisci, R., Ariu, D., Fogla, P., Giacinto, G., and Lee, W. McPAD:A Multiple Classifier System for Accurate Payload-Based AnomalyDetection.

Computer Networks 53 , 6 (2009), 864–881.[84] Pfaff, B., Pettit, J., Koponen, T., Jackson, E., Zhou, A., Raja-halme, J., Gross, J., Wang, A., Stringer, J., Shelar, P., Amidon, K., nd Casado, M. The Design and Implementation of Open vSwitch.In USENIX NSDI (2015).[85] Ports, D. R., and Nelson, J. When Should The Network Be TheComputer? In

ACM HotOS (2019).[86] Prabhakar, R., Zhang, Y., Koeplinger, D., Feldman, M., Zhao, T.,Hadjis, S., Pedram, A., Kozyrakis, C., and Olukotun, K. Plasticine:A Reconfigurable Architecture for Parallel Patterns. In

ACM/IEEEISCA (2017).[87] Rae, J. W., Bartunov, S., and Lillicrap, T. P. Meta-Learning NeuralBloom Filters. arXiv:1906.04304 (2019).[88] Roy, A., Zeng, H., Bagga, J., Porter, G., and Snoeren, A. C. In-side the Social Network’s (Datacenter) Network. In

ACM SIGCOMM (2015).[89] Rucker, A., Swamy, T., Shahbaz, M., and Olukotun, K. ElasticRSS: Co-Scheduling Packets and Cores Using Programmable NICs.In

ACM APNet (2019).[90] Sapio, A., Canini, M., Ho, C.-Y., Nelson, J., Kalnis, P., Kim, C.,Krishnamurthy, A., Moshref, M., Ports, D. R., and Richtárik,P. Scaling Distributed Machine Learning with In-Network Aggrega-tion. arXiv:1903.06701 (2019).[91] Sarkar, D. Continuous Numeric Data – Strate-gies for Working with Continuous, Numerical Data.https://towardsdatascience.com/understanding-feature-engineering-part-1-continuous-numeric-data-da4e47099a7b, 2018.Accessed on 02/07/2020.[92] Shahbaz, M., Choi, S., Pfaff, B., Kim, C., Feamster, N., McKeown,N., and Rexford, J. Pisces: A Programmable, Protocol-IndependentSoftware Switch. In

ACM SIGCOMM (2016).[93] Shawahna, A., Sait, S. M., and El-Maleh, A. Fpga-Based Acceler-ators of Deep Learning Networks for Learning and Classification: AReview.

IEEE Access 7 (2018), 7823–7859.[94] Shelly, N., Jackson, E. J., Koponen, T., McKeown, N., and Raja-halme, J. Flow Caching for High Entropy Packet Fields. In

ACMHotSDN (2014).[95] Singh, A., Ong, J., Agarwal, A., Anderson, G., Armistead, A.,Bannon, R., Boving, S., Desai, G., Felderman, B., Germano, P.,Kanagala, A., Provost, J., Simmons, J., Tanda, E., Wanderer, J.,Hölzle, U., Stuart, S., and Vahdat, A. Jupiter Rising: A Decadeof Clos Topologies and Centralized Control in Google’s DatacenterNetwork. In

ACM SIGCOMM (2015).[96] Singh, H., Lee, M.-H., Lu, G., Kurdahi, F. J., Bagherzadeh, N., andChaves Filho, E. M. MorphoSys: An Integrated Reconfigurable Sys-tem for Data-Parallel and Computation-Intensive Applications.

IEEETransactions on Computers 49 , 5 (2000), 465–481.[97] Siracusano, G., and Bifulco, R. In-Network Neural Networks. arXiv:1801.05731 (2018).[98] Sivaraman, A., Subramanian, S., Alizadeh, M., Chole, S.,Chuang, S.-T., Agrawal, A., Balakrishnan, H., Edsall, T., Katti,S., and McKeown, N. Programmable Packet Scheduling at Line Rate.In

ACM SIGCOMM (2016).[99] Srouji, M., Zhang, J., and Salakhutdinov, R. Structured ControlNets for Deep Reinforcement Learning. arXiv:1802.08311 (2018).[100] Sun, J., and Zukerman, M. An Adaptive Neuron AQM for a Sta-ble Internet. In

International Conference on Research in Networking (2007), Springer.[101] Sun, R., Yang, B., Peng, L., Chen, Z., Zhang, L., and Jing, S. TrafficClassification Using Probabilistic Neural Networks. In

IEEE ICNC (2010). [102] Tang, T. A., Mhamdi, L., McLernon, D., Zaidi, S. A. R., andGhogho, M. Deep Learning Approach for Network Intrusion De-tection in Software Defined Networking. In

IEEE WINCOM (2016).[103] Tavallaee, M., Bagheri, E., Lu, W., and Ghorbani, A. A. A DetailedAnalysis of the KDD CUP 99 Data Set. In

IEEE CISDA (2009).[104] Taylor, M. B., Kim, J., Miller, J., Wentzlaff, D., Ghodrat, F.,Greenwald, B., Hoffman, H., Johnson, P., Lee, J.-W., Lee, W., et al.The Raw Microprocessor: A Computational Fabric for Software Cir-cuits and General-Purpose Programs.

IEEE Micro 22

Haskell: The Craft of Functional Programming , vol. 2.Addison-Wesley, 2011.[107] Todorov, E., Erez, T., and Tassa, Y. Mujoco: A Physics Engine forModel-Based Control. In

IEEE IROS (2012).[108] Van Der Maaten, L., Postma, E., and Van den Herik, J. Dimen-sionality Reduction: A Comparative.

Journal of Machine LearningResearch (JMLR) 10 (2009), 66–71.[109] Wang, E., Davis, J. J., Zhao, R., Ng, H.-C., Niu, X., Luk, W., Cheung,P. Y., and Constantinides, G. A. Deep Neural Network Approxima-tion for Custom Hardware: Where We’ve Been, Where We’re Going.

ACM Computing Surveys (CSUR) 52 , 2 (2019), 1–39.[110] Wang, N., Choi, J., Brand, D., Chen, C.-Y., and Gopalakrishnan,K. Training Deep Neural Networks with 8-bit Floating Point Num-bers. In

NeurIPS (2018).[111] Wang, S., Li, Z., Ding, C., Yuan, B., Qiu, Q., Wang, Y., and Liang,Y. C-LSTM: Enabling Efficient LSTM using Structured CompressionTechniques on FPGAs. In

ACM/SIGDA FPGA (2018).[112] Winstein, K., and Balakrishnan, H. TCP Ex Machina: Computer-Generated Congestion Control. In

ACM SIGCOMM (2013).[113] Xiong, Z., and Zilberman, N. Do Switches Dream of MachineLearning? Toward In-Network Classification. In

ACM HotNets (2019).[114] Xu, Z., Tang, J., Meng, J., Zhang, W., Wang, Y., Liu, C. H., andYang, D. Experience-Driven Networking: A Deep ReinforcementLearning Based Approach. In

IEEE INFOCOM (2018).[115] Yan, F. Y., Ma, J., Hill, G. D., Raghavan, D., Wahby, R. S., Levis,P., and Winstein, K. Pantheon: The Training Ground for InternetCongestion-Control Research. In

USENIX ATC (2018).[116] Zaki, Y., Pötsch, T., Chen, J., Subramanian, L., and Görg, C. Adap-tive Congestion Control for Unpredictable Cellular Networks. In

ACM SIGCOMM (2015).[117] Zander, S., Nguyen, T., and Armitage, G. Automated Traffic Clas-sification and Application Identification Using Machine Learning. In

IEEE LCN (2005).[118] Zhang, J., Chen, C., Xiang, Y., Zhou, W., and Xiang, Y. InternetTraffic Classification by Aggregating Correlated Naïve Bayes Predic-tions.

IEEE Transactions on Information Forensics and Security 8 , 1(2012), 5–15.[119] Zhang, J., Chen, X., Xiang, Y., Zhou, W., and Wu, J. Robust Net-work Traffic Classification.

IEEE/ACM Transactions on Networking23 , 4 (2014), 1257–1270.[120] Zhong, H., Fan, K., Mahlke, S., and Schlansker, M. A DistributedControl Path Architecture for VLIW Processors. In

IEEE PACT (2005).[121] Zhou, C., Di, D., Chen, Q., and Guo, J. An Adaptive AQM AlgorithmBased on Neuron Reinforcement Learning. In

IEEE ICCA (2009).(2009).