[PDF] An Exhaustive Survey on P4 Programmable Data Plane Switches: Taxonomy, Applications, Challenges, and Future Trends

Abstract

Traditionally, the data plane has been designed with fixed functions to forward packets using a small set of protocols. This closed-design paradigm has limited the capability of the switches to proprietary implementations which are hardcoded by vendors, inducing a lengthy, costly, and inflexible process. Recently, data plane programmability has attracted significant attention from both the research community and the industry, permitting operators and programmers in general to run customized packet processing function. This open-design paradigm is paving the way for an unprecedented wave of innovation and experimentation by reducing the time of designing, testing, and adopting new protocols; enabling a customized, top-down approach to develop network applications; providing granular visibility of packet events defined by the programmer; reducing complexity and enhancing resource utilization of the programmable switches; and drastically improving the performance of applications that are offloaded to the data plane. Despite the impressive advantages of programmable data plane switches and their importance in modern networks, the literature has been missing a comprehensive survey. To this end, this paper provides a background encompassing an overview of the evolution of networks from legacy to programmable, describing the essentials of programmable switches, and summarizing their advantages over Software-defined Networking (SDN) and legacy devices. The paper then presents a unique, comprehensive taxonomy of applications developed with P4 language; surveying, classifying, and analyzing more than 150 articles; discussing challenges and considerations; and presenting future perspectives and open research issues.

Full PDF

11 An Exhaustive Survey on P4 Programmable DataPlane Switches: Taxonomy, Applications,Challenges, and Future Trends

Elie F. Kfoury ∗ , Jorge Crichigno ∗ , Elias Bou-Harb †∗ College of Engineering and Computing, University of South Carolina, Columbia, USA † The Cyber Center For Security and Analytics, University of Texas at San Antonio, USA

Abstract —Traditionally, the data plane has been designedwith ﬁxed functions to forward packets using a small set ofprotocols. This closed-design paradigm has limited the capabilityof the switches to proprietary implementations which are hard-coded by vendors, inducing a lengthy, costly, and inﬂexibleprocess. Recently, data plane programmability has attractedsigniﬁcant attention from both the research community and theindustry, permitting operators and programmers in general torun customized packet processing functions. This open-designparadigm is paving the way for an unprecedented wave of inno-vation and experimentation by reducing the time of designing,testing, and adopting new protocols; enabling a customized,top-down approach to develop network applications; providinggranular visibility of packet events deﬁned by the programmer;reducing complexity and enhancing resource utilization of theprogrammable switches; and drastically improving the perfor-mance of applications that are ofﬂoaded to the data plane.Despite the impressive advantages of programmable data planeswitches and their importance in modern networks, the literaturehas been missing a comprehensive survey. To this end, thispaper provides a background encompassing an overview of theevolution of networks from legacy to programmable, describingthe essentials of programmable switches, and summarizing theiradvantages over Software-deﬁned Networking (SDN) and legacydevices. The paper then presents a unique, comprehensive tax-onomy of applications developed with P4 language; surveying,classifying, and analyzing more than 150 articles; discussingchallenges and considerations; and presenting future perspectivesand open research issues.

Index Terms —Programmable switches, P4 language, Software-deﬁned Networking, data plane, custom packet processing, tax-onomy.

I. I

NTRODUCTION

Since the emergence of the world wide web and theexplosive growth of the Internet in the 1990s, the network-ing industry has been dominated by closed and proprietaryhardware and software. Consider the observations made byMcKeown [1] and the illustration in Fig. 1, which shows thecumulative number of Request For Comments (RFCs) [2].While at ﬁrst an increase in RFCs may appear encouraging, ithas actually represented an entry barrier to the network market.The progressive reduction in the ﬂexibility of protocol designcaused by standardized requirements, which cannot be easilyremoved to enable protocol changes, has perpetuated the statusquo. This protocol ossiﬁcation [3, 4] has been characterizedby a slow innovation pace at the hand of few network

Fig. 1. Cumulative number of RFCs. vendors. As an example, after being initially conceived byCisco and VMware [5], the Application Speciﬁc IntegratedCircuit (ASIC) implementation of the Virtual Extensible LAN(VXLAN) [6], a simple frame encapsulation protocol, tookseveral years, a process that could have been reduced to weeksby software implementations .Protocol ossiﬁcation has been challenged ﬁrst by Software-deﬁned Networking (SDN) [7, 8] and then by the recent adventof programmable switches. SDN fostered major advancesby explicitly separating the control and data planes, and byimplementing the control plane intelligence as a softwareoutside of the switches. While SDN reduced network com-plexity and spurred control plane innovation at the speed ofsoftware development, it did not wrest control of the actualpacket processing functions away from network vendors.Traditionally, the data plane has been designed with ﬁxedfunctions to forward packets using a small set of protocols(e.g., IP, Ethernet). The design cycle of switch ASICs has beencharacterized by a lengthy, closed, and proprietary process thatusually takes years. Such process contrasts with the agility ofthe software industry.The programmable forwarding can be viewed as a naturalevolution of SDN, where the software that describes thebehavior of how packets are processed can be conceived,tested, and deployed in a much shorter time span by operators,engineers, researchers, and practitioners in general. The de- The RFC and VXLAN observations are extracted from Dr. McKeown’spresentation in [1].

An Exhaustive Survey on P4 Programmable Data Plane Switches: Taxonomy, Applications, Challenges, and Future Trends • Protocol ossification • Evolution of SDN • Rise of P4 and programmable data planes • Paper contributionsSection I: Introduction • Comparison of aspects covered in previous surveys • Analysis and limitations of existing surveysSection II: Related Surveys • Comparison between traditional, SDN, and programmable devices • Analogy with other domain specific processorsC i b tSection III: Traditional Control Plane and SDN • Survey methodology • Proposed taxonomy • Year-based distribution of the surveyed work • Implementation platform distributionS th d lSection V: Methodology and Taxonomy • Background and literature review • Intra-category comparison and discussions • Comparison with legacySections VI-XII: Surveyed Work • General challenges and Future trends • Memory availability • Arithmetic computations • Network-wide cooperation, etc.l h ll dSection XIII: Challenges and Future Trends • PISA-based data plane • Programmable switch features • P4 language

PISA b dSection IV: Programmable Switches

Fig. 2. Paper roadmap. facto standard for deﬁning the forwarding behavior is theP4 language [9], which stands for Programming Protocol-independent Packet Processors. Essentially, P4 programmableswitches have removed the entry barrier to network design,previously reserved to network vendors.The momentum of programmable switches is reﬂected inthe global ecosystem around P4. Operators such as ATT [10],Comcast [11], NTT [12], KPN [13], Turk Telekom [14],Deutsche Telekom [15], and China Unicom [14], are nowusing P4-based platforms and applications to optimize theirnetworks. Companies with large data centers such as Facebook[16], Alibaba [17], and Google [18] operate on programmableplatforms running customized software, a contrast from thefully proprietary implementations of just a few years ago[19]. Switch manufacturers such as Edgecore [20], Stordis[21], Cisco [22], Arista [23], Juniper [24], and Interface Mas-ters [25] are now manufacturing P4 programmable switcheswith multiple deployment models, from fully programmableor white boxes to hybrid schemes. Chip manufactures suchas Barefoot Networks (Intel) [26], Xilinx [27], Pensando[28], Mellanox [29], and Innovium [30] have embraced pro-grammable data planes without compromising performance.The availability of tools and the agility of software devel-opment have opened an unprecedented possibility of experi-mentation and innovation by enabling network owners to buildcustom protocols and process them using protocol-independentprimitives, reprogram the data plane in the ﬁeld, and runP4 codes on diverse platforms. Main agencies supportingengineering research and education world-wide are investingin programmable networks as well [31–34].

A. Contribution

Despite the increasing interest on P4 switches, previouswork has only partially covered this technology. As shownin Table I, currently, there is no updated and comprehensivematerial. Thus, this paper addresses this gap by providingan overview of the evolution of networks from legacy toprogrammable; describing the essentials of programmableswitches and P4; and summarizing the advantages of pro-grammable switches over SDN and legacy devices. The papercontinues by presenting a taxonomy of applications developedwith P4; surveying, classifying, and analyzing and comparingmore than 150 articles; discussing challenges and consid-erations; and putting forward future perspectives and openresearch issues.

B. Paper Organization

The road-map of this survey is illustrated in Fig. 2. SectionII studies and compares existing surveys on various P4-related topics and demonstrates the added value of the offeredwork. Section III describes the traditional and SDN devices,and the evolution toward programmable data planes. SectionIV introduces programmable switches and their features andexplains the Protocol Independent Switch Architecture (PISA),a pipeline forwarding model. Section V describes the surveymethodology and the proposed taxonomy. Subsequent sections(from Section VI to Section XII) explore the works pertainingto various categories proposed in the taxonomy, and comparethe P4 approaches in each category, as well as with thelegacy-enabled solutions. Section XIII outlines challenges andconsiderations extracted and induced from the literature, andpinpoints directions that can be explored in the future toameliorate the state-of-the-art solutions. Finally, Section XIVconcludes the survey. The abbreviations used in this article aresummarized in Table XIV, at the end of the article.II. R

ELATED S URVEYS

The advantages of programmable switches attracted con-siderable attention from the research community. They weredescribed in previous surveys.Stubbe et al. [35] discussed various P4 compilers andinterpreters in a short survey. This work provided a backgroundon the P4 language and demonstrated the main building blocksthat describe packet processing in a programmable switch.It outlined reference hardware and software programmableswitch implementations. The survey lacks discussions on exist-ing application schemes, challenges, and potential future work.Dargahi et al. [36] focused on stateful data planes andthe security implications. There are two main objectives ofthis survey. First, it introduces the reader to recent trendsand technologies pertaining to stateful data planes. Second,it discusses relevant security issues by analyzing selecteduse cases. The scope of the survey is not limited to P4for programming the data plane. Instead, it describes otherschemes such as OpenState [44], Flow-level State Transitions(FAST) [45], etc. When reviewing the security properties ofstateful data planes, the authors described a mapping betweenpotential attacks and corresponding vulnerabilities.Cordeiro et al. [37] discussed the evolution of SDN fromOpenFlow to data plane programmability. The survey brieﬂyexplained the layout of a P4 program and how it is mapped to

TABLE IC

OMPARISON WITH R ELATED S URVEYS

Paper Programmable switches and P4 language Taxonomy DiscussionsEvolution Description Features Background Literature Intra-categorycomparison Comparisonwith legacy Challenges Futuredirections [35] (cid:2)(cid:3) (cid:2)(cid:3) (cid:2)(cid:3) (cid:2) (cid:2) (cid:2) (cid:2) (cid:2) (cid:2) [36] (cid:4) (cid:2)(cid:3) (cid:2)(cid:3) (cid:2)(cid:3) (cid:2)(cid:3) (cid:2) (cid:2) (cid:2) (cid:2)(cid:3) [37] (cid:4) (cid:2)(cid:3) (cid:2)(cid:3) (cid:2)(cid:3) (cid:4) (cid:2) (cid:2) (cid:2)(cid:3) (cid:2)(cid:3) [38] (cid:2)(cid:3) (cid:2)(cid:3) (cid:2)(cid:3) (cid:2) (cid:2) (cid:2) (cid:2) (cid:2) (cid:2) [39] (cid:4) (cid:2) (cid:2) (cid:2)(cid:3) (cid:2)(cid:3) (cid:2) (cid:2) (cid:2)(cid:3) (cid:2)(cid:3) [40] (cid:4) (cid:2) (cid:2)(cid:3) (cid:2) (cid:2)(cid:3) (cid:2) (cid:2) (cid:2) (cid:2) [41] (cid:4) (cid:2) (cid:2) (cid:4) (cid:2)(cid:3) (cid:2) (cid:2) (cid:2) (cid:2) [42] (cid:2)(cid:3) (cid:2)(cid:3) (cid:2) (cid:2)(cid:3) (cid:2)(cid:3) (cid:2)(cid:3) (cid:2) (cid:2)(cid:3) (cid:2)(cid:3) [43] (cid:4) (cid:2)(cid:3) (cid:2)(cid:3) (cid:2)(cid:3) (cid:2)(cid:3) (cid:2) (cid:2) (cid:2)(cid:3) (cid:2)(cid:3)

Thispaper (cid:4) (cid:4) (cid:4) (cid:4) (cid:4) (cid:4) (cid:4) (cid:4) (cid:4)(cid:4)

Covered in this survey (cid:2)

Not covered in this survey (cid:2)(cid:3)

Partially covered in this survey the abstract forwarding model. It then listed various compil-ers, tools, simulators, and frameworks for P4 development.The authors categorized the literature into two categories:1) programmable security and dependability management; 2)enhanced accounting and performance management. In theﬁrst category, the authors listed works pertaining to policymodeling, analysis, and veriﬁcation, as well as intrusiondetection and prevention, and network survivability. In thesecond category, the authors focused on network monitoring,trafﬁc engineering, and load balancing. The survey only listsa limited set of papers without providing much details or howpapers differ from each. Moreover, the survey was publishedin 2017, and since then, a signiﬁcant percentage of P4-relatedworks are missing.Satapathy et al. [38] presented a short description aboutthe pitfalls of traditional networks and the evolution of SDN.The report brieﬂy described elements of the P4 language. Theauthors then discussed the control plane and P4Runtime [46],and enumerated three use cases of P4 applications. The reportconcludes with potential future.The short survey presented by Bifulco et al. [39] reviewsthe trends and issues of abstractions and architectures thatrealize programmable networks. The authors discussed themotivation of packet processing devices in the networkingﬁeld and described the anatomy of a programmable switch.The proposed taxonomy categorizes the literature as state-based, abstraction-based, implementation-based, and layer-based. The layer-based consists of control/intent layer and dataplane layer; the implementation-based encompasses softwareand hardware switches; the abstraction-based includes dataﬂow graph and match-action pipelines; and the state-baseddifferentiates between stateful and stateless data planes.Kaljic et al. [40] presented a survey on data plane ﬂex-ibility and programmability in SDN networks. The authorsevaluated data plane architectures through several deﬁnitionsof ﬂexibility and programmability. In general, ﬂexibility inSDN refers to the ability of the network to adapt its resources(e.g., changes in the topology or the network requirements).Afterwards, the authors identiﬁed key factors that inﬂuence thedeviation from the original data plane given with OpenFlow.The survey concludes with future research directions.Kannan et al. [41] presented a short survey related to the evolution of programmable networks. This work describedthe pre-SDN model and the evolution to SDN and pro-grammable data plane. The authors highlighted some featuresof programmable switches such as stateful processing, accuratetiming information, and ﬂexible packet cloning and recircu-lation. The survey categorized data plane applications intotwo categories, namely, network monitoring and in-networkcomputing. While this survey listed a considerable number ofpapers belonging to these categories, it barely explained theoperation and main ideas of each paper.Tan et al. [42] presented a survey describing In-band Net-work Telemetry (INT). The survey explained the developmentstages and classiﬁcations of network measurement (traditional,SDN-based, and P4-based). It also outlined some existingapplications that leverage INT such as congestion control,troubleshooting, etc. The survey concludes with discussionsand potential future work related to INT.Zhang et al. [43] presented a survey that focuses on statefuldata plane. The survey starts with an overview of statelessand stateful data planes, then overviews and compares somestateful platforms (e.g., OpenState, FAST, FlowBlaze, etc.).The paper reviews a handful of stateful data plane applicationsand discusses challenges and future perspectives.Table I summarizes the topics and the features describedin the related surveys. It also highlights how this paperdiffers from the existing surveys. All previous surveys lacka microscopic comparison between the intra-category works.Also, none of them compare switch-based schemes againstlegacy server-based schemes. To the best of the authors’knowledge, this work is the ﬁrst to exhaustively explore thewhole programmable data plane ecosystem. Speciﬁcally, thepaper describes P4 switches and provides a detailed taxonomyof applications using P4 switches. It categorizes and comparesthe applications within each category as well as with legacyapproaches, and provides challenges and future perspectives.III. T

RADITIONAL C ONTROL P LANE AND

SDN

A. Traditional and SDN Devices

With traditional devices, networks are connected usingprotocols such as Open Shortest Path First (OSPF) and BorderGateway Protocol (BGP) [47]) running in the control plane

TABLE IIF

EATURES , TRADITIONAL , SDN,

AND P4 PROGRAMMABLE DEVICES

Feature Traditional SDN P4 programmable

Control - data plane separation No clear separation Well-deﬁned separation Well-deﬁned separationControl and data plane interface Proprietary Standardized APIs (e.g.OpenFlow) Standardized (e.g., OpenFlow,P4Runtine) andprogram-dependent APIsControl and data planeprogram-dependent APIs NA/Proprietary NA/Proprietary Target independentFunctionality separation at controlplane No modular separation offunctions Modular separation: (1) functionsto build topology view (state) and(2) algorithms to operate onnetwork state Same as SDN networksCustomization of control plane No Yes YesVisibility of events at data plane Low Low HighFlexibility to deﬁne and parse newﬁelds and protocols No ﬂexible, ﬁxed Subject to OpenFlow extensions Easy, programmable by userCustomization of data plane No No YesASIC packet processingcomplexity High, hard-coded High, hard-coded Low, deﬁned by user’s sourcecodeData plane match-action stages Proprietary OpenFlow assumes in seriesmatch-action stages In series and/or in parallelData plane actions Protocol-dependent primitives Protocol-dependent primitives Protocol-independent primitivesInﬁeld runtime reprogrammability No No YesCustomer support High Medium LowTechnology maturity High Medium Low at each device . Both control and data planes are underfull control of vendors. On the other hand, SDN delineatesa clear separation between the control plane and the dataplane, and consolidates the control plane so that a singlecentralized controller can control multiple remote data planes.The controller is implemented in software, under the controlof the network owner. The controller computes the tablesused by each switch and distributes them via a well-deﬁnedApplication Programming Interface (API), such as Openﬂow[48]. While SDN allows for the customization of the controlplane, it is limited to the OpenFlow speciﬁcations and theﬁxed-function data plane.

B. Comparison of Traditional, SDN, and Programmable DataPlane Devices

Table II contrasts the main characteristics of traditional,SDN, and P4 programmable devices. In the latter, the forward-ing behavior is deﬁned by the user’s code. Other advantagesinclude the program-dependent APIs, where the same P4program running on different targets requires no modiﬁcationsin the runtime applications (i.e., the control plane and theinterface between control and data planes are target agnostic);the protocol-independent primitives used to process packets;the more powerful computation model where the match-actionstages can not only be in series but also in parallel; and theinﬁeld reprogrammability at runtime. On the other hand, thetechnology maturity and support for P4 devices can still beconsidered low in contrast to traditional and SDN devices.

C. Network Evolution and Analogy with other Domain Spe-ciﬁc Processors

The introduction of the general-purpose computers in theearly 1970s enabled programmers to develop applications running on CPUs. The use of high-level languages accel-erated innovation by hiding the target hardware (e.g., x86).In signal processing, Digital Signal Processors (DSPs) weredeveloped in the late 1970s and early 1980s with instructionsets optimized for digital signal processing. Matlab is used fordeveloping DSP applications. In graphics, Graphics ProcessingUnits (GPUs) were developed in the late 1990s and early 2000swith instruction sets for graphics. Open Computing Language(OpenCL) is one of the main languages for developing graphicapplications. In machine learning, Tensor Processor Units(TPUs) and TensorFlow were developed in mid 2010s withinstruction sets optimized for machine learning.The programmable forwarding is part of the larger informa-tion technology evolution observed above. Speciﬁcally, overthe last few years, a group of researchers developed a ma-chine model for networking, namely the Protocol IndependentSwitch Architecture (PISA) [49]. PISA was designed withinstruction sets optimized for network operations. The high-level language for programming PISA devices is P4.IV. P

ROGRAMMABLE S WITCHES

A. PISA Architecture

PISA is a packet processing model that includes the fol-lowing elements: programmable parser, programmable match-action pipeline, and programmable deparser, see Fig. 3.The programmable parser permits the programmer to deﬁnethe headers (according to custom or standard protocols) andto parse them. The parser can be represented as a state ma-chine. The programmable match-action pipeline executes theoperations over the packet headers and intermediate results. Asingle match-action stage has multiple memory blocks (tables,registers) and Arithmetic Logic Units (ALUs), which allow forsimultaneous lookups and actions. Since some action resultsmay be needed for further processing (e.g., data dependencies),

Data PlaneControl PlaneSoftware-based Centralized ControllerPD-API, P4Runtime

App-1

App-2 …

App-n

Programmable parser …Programmable match-action pipeline

Programmable deparserMemory ALU

Packet Packet

StateKey Action Action dataHeader fields, tuples, etc. Forward()Mark()Drop() Dst IP=IP1, Dst port = p2Dst IP=IP2, Dst port = p4…Program-defined local tableSwitch ASICP4 programCompiler Stage 1 Stage nC

Fig. 3. A PISA-based data plane and its interaction with the control plane. stages are arranged sequentially. The programmable deparserassembles the packet headers back and serializes them fortransmission. A PISA device is protocol-independent.In Fig. 3, the P4 program deﬁnes the format of the keysused for lookup operations. Keys can be formed using packetheader’s information. The control plane populates table entrieswith keys and action data. Keys are used for matching packetinformation (e.g., destination IP address) and action data isused for operations (e.g., output port).

B. Programmable Switch Features

The main features of programmable switches are [50]: • Agility: the programmer can design, test, and adopt newprotocols and features in signiﬁcantly shorter times (i.e.,weeks or months rather than years). • Top-down design: for decades the networking industry oper-ated in a bottom-up approach. Fixed-function ASICs are atthe bottom and enforce available protocols and features tothe programmer at the top. With programmable switches, theprogrammer describes protocols and features in the ASICs.Note that the physical layer and parts of the MAC layer maynot be programmable. • Visibility: programmable switches provide greater visibilityinto the behavior of the network. INT is an example of aframework to collect and retrieve information from the dataplane, without intervention of the control plane. • Reduced complexity: ﬁxed-function switches incorporatea large superset of protocols. These protocols consumeresources and add complexity to the processing logic, whichis hard-coded in silicon. With programmable switches, theprogrammer has the option to implement only those proto-cols that are needed.

TABLE IIIC

OMPARISON BETWEEN A P4 PROGRAMMABLE SWITCH AND AFIXED - FUNCTION SWITCH [51]

Characteristic Programmable Fixed-function

Throughput 6.4Tb/s 6.4Tb/sNumber of 100G ports 64 64Max forwarding rate 4.8B pps 4.2B ppsMax 25G/10G ports 256/258 128/130Programmable Yes (P4) NoPower draw 4.2W per port 4.9W per portLarge scale NAT Yes (100k) NoLarge scale stateful ACL Yes (100k) NoLarge scale tunnels Yes (192k) NoPacket buffers Uniﬁed SegmentedLAG/ECMP Full entropy,programmable Hash seed,reduced entropyECMP 256-way 128-wayTelemetry Line-rate perﬂow stats SFlow (sampled)Latency Under 400 ns Under 450ns • Differentiation: the customized protocol or feature imple-mented by the programmer needs not to be shared with thechip manufacturer. • Enhanced performance: programmable switches do not in-troduce performance penalty. On the contrary, they may pro-duce better performance than ﬁxed-function switches. TableIII shows a comparison between a programmable switchand a ﬁxed-function switch, reproduced from [51]. Notethe enhanced performance of the former (e.g., maximumforwarding rate, latency, power draw).

C. P4 Language

P4 has a reduced instruction set and has the following goals: • Reconﬁgurability: the parser and the processing logic canbe redeﬁned in the ﬁeld. • Protocol independence: the switch is protocol-agnostic. Theprogrammer deﬁnes the protocols, the parser, and the oper-ations to process the headers. • Target independence: the underlying ASIC is hidden fromthe programmer. The compiler takes the switch’s capabilitiesinto account when turning a target-independent P4 programinto a target-dependent binary.

Software48.5%NetFPGA7.9%ASIC38.6% SmartNICs5%0 10 20 30 40 5020162017201820192020 Number of Papers Y e a r (a) (b) Fig. 4. (a) Distribution of surveyed data plane research works per year. (b)Implementation platform distribution. The shares are calculated based on thestudied papers in this survey.

The original speciﬁcation of the P4 language was releasedin 2014, and is referred to as P4 . In 2016, a new version ofthe language was drafted, which is referred to as P4 . P4 is a more mature language which extended the P4 language tobroader underlying targets: ASICs, Field-Programmable GateArrays (FPGAs), Network Interface Cards (NICs), etc.V. M ETHODOLOGY AND T AXONOMY

This section describes the systematic methodology that wasadopted to generate the proposed taxonomy. The results ofthis literature survey represent derived ﬁndings by thoroughlyexploring more than 150 data plane-related research worksstarting from 2016 up to late 2020. The distribution of whichis summarized in Fig. 4 (a).Fig. 4 (b) depicts the share of each implementation plat-form used in the surveyed papers, grouped by software (e.g.,BMv2, PISCES), ASIC (e.g., Toﬁno, Cavium), NetFPGA (e.g.,NetFPGA SUME), and SmartNICs (e.g., Netronome NFP).The graph shows that the vast majority of the works wereimplemented on software switches. Note that behavioral soft-ware switches (e.g., BMv2 [203]) are not suitable indicators ofwhether the program could run on a hardware target; they aretypically used for prototyping ideas and to foster innovation.On the other hand, non-behavioral software switches (e.g.,PICSES [204], derived from Open vSwitch (OVS) [205]) areproduction-grade and can be deployed in data centers.Hardware targets constitute a smaller share of the platformdistribution than software switches. A possible reasoningbehind this is that the technology is still recent and targets are still not widely available for sale in the public. Forexample, to acquire a switch equipped with Toﬁno chip (e.g.,Edgecore Wedge100BF-32 [20]), and to get the developmentenvironment and the customer support, a Non-DisclosureAgreement (NDA) with Barefoot Networks should be signed.Additionally, the client should attend a training course (e.g.,[206]) to understand the architecture and the speciﬁcs of theplatform. This process is considered lengthy and costly, andnot every institution is capable of affording it.The proposed taxonomy is demonstrated in Fig. 5. The tax-onomy was meticulously designed to cover the most signiﬁcantworks related to data plane programmability and P4. The aimis to categorize the surveyed works based on various high-level disciplines. The taxonomy provides a clear separation ofcategories so that a reader interested in a speciﬁc discipline canonly read the works pertaining to the said discipline. The cor-rectness of the taxonomy was veriﬁed by carefully examiningthe related work of each paper to correlate them into high-level categories. Each high-level category is further dividedinto sub-categories. For instance, various measurements worksbelong to the sub-category “Measurements” under the high-level category “Network Performance”.Further, the survey compares the results and the features of-fered by programmable data plane approaches (intra-category),as well as with those of the contemporary and legacy ones.This detailed comparison is elaborated upon for each sub-category, giving the interested reader a comprehensive view ofthe state-of-the-art ﬁndings of that sub-category. Additionally,the survey presents various challenges and considerations, as

ProgrammableSwitches LiteratureIn-Band NetworkTelemetry (INT) NetworkPerformance MiddleboxFunctions AcceleratedComputations Internet ofThings (IoT) Security TestingVariations[52–57]Collectorsand Solutions[58–62] CongestionControl[63–68]Measurements[69–90]AQM[91–95]QoS and TM[96–99]Multicast[100–102] LoadBalancing[103–109]Caching[110–117]TelecomServices[118–124]Pub/Sub[125–128] Consensus[129–136]MachineLearning[137–142]Miscellaneous[143–151] Aggregation[152–155]ServiceAutomation[156, 157] Heavy Hitter[158–164]Cryptography[165–168]Anonymity[169–172]AccessControl[173–176]Attacks andDefenses[177–188] Troubleshoot[189–193]Veriﬁcation[194–202]Fig. 5. Taxonomy of programmable switches literature based upon relevant, explored research areas.

Telemetry Instructions . . . TelemeInstructiometryionsI

INT transit hopINT source INT sink . . . Telemetry instructions Add metadata Add metadataAdd metadata ...

Add metadata ...

INT Collector

Original packet headers Telemetry instructions Switch metadataExtract metadata

Fig. 6. In-band Network Telemetry (INT). well as some current and future trends that could be exploredas future work.VI. I N - BAND N ETWORK T ELEMETRY (INT)Conventional monitoring and collecting tools and protocols(e.g., ping, traceroute, Simple Network Management Protocol(SNMP), NetFlow, sFlow) are by no means sufﬁciently accu-rate to troubleshoot the network, especially with the presenceof congestion. These methods provide milliseconds accuracyat best and cannot capture events that happen on microsecondsmagnitude. Moreover, they cannot provide per-packet visibilityacross the network.In-band Network Telemetry (INT) [207] is one of theearliest key applications of programmable data plane switches.It enables querying the internal state of the switch and pro-vides ﬁne-grained and precise telemetry measurements (e.g.,queue occupancy, link utilization, queuing latency, etc.). INThandles events that occur on microseconds scale, also knownas microbursts . Collecting and reporting the network state isperformed entirely by the data plane, without any interventionfrom the control plane. Due to the increased visibility achievedwith INT, network operators are able to troubleshoot problemsmore efﬁciently. Additionally, it is possible to perform instantprocessing in the data plane after measuring telemetry data(e.g., reroute ﬂows when a link is congested), without havingto interact with the control plane. Fig. 6 shows an INT-enablednetwork. INT enables network administrators to determine thefollowing: • The path a packet took when traversing the network (seeFig. 7). Such information is difﬁcult to learn using existingtechnologies when multi-path routing strategies (e.g., Equal-cost Multi-Path Routing (ECMP) [208], ﬂowlet switching[209]) are used. • The matched rules that forwarded the packets (e.g., ACLentry, routing lookup). • The time a packet spent in the queue of each switch. • The ﬂows that shared the queue with a certain packet.The P4 Applications Working Group developed the INTtelemetry speciﬁcations [210] with contributions from keyenablers of the P4 language such as Barefoot Networks,VMware, Alibaba, and others.INT allows instrumenting the metadata to be monitoredwithout modifying the application layer. The metadata to beinserted depends on the use case; for example, if congestion

INT transit hopINT source INT sink INT CollectorSender S2 S3S1

DataPacket headers DataINT header{S1}Packet headers[S1]INT header DataPacket headers[S2][S1]INT header DataPacket headers[S3][S2][S1]INT header

S4 Receiver

DataPacket headersDataPacket headers[S4][S3][S2][S1]INT header [S4][S3][S2][S1]INT header

Fig. 7. Example of how INT can be used to provide the path traversed bya packet in the network. The INT source inserts its label (S1) as well as theINT headers to instruct subsequent switches about the required operations(i.e., push their labels). Finally, switch S4 strips the INT headers from thepacket and forwards them to a collector, while forwarding the original packetto the receiver. was the main concern to monitor, the programmer insertsqueue metadata and transit latency. An INT-enabled networkhas the following entities: 1) INT source: a trusted entitythat instruments with the initial instruction set what metadatashould be added into the packet by other INT-capable devices;2) INT transit hop: a device adding its own metadata to anINT packet after examining the INT instructions inserted bythe INT source; 3) INT sink: a trusted entity that extracts theINT headers in order to keep the INT operation transparentfor upper-layer applications; and 4) INT collector: a devicethat receives and processes INT packets.The location of an INT header in the packet is intentionallynot enforced in the speciﬁcations document. For example, itcan be inserted as a payload on top of TCP, UDP, and NSH, asa Geneve option on top of Geneve, and as a VXLAN payloadon top of VXLAN.

A. Postcard-based Telemetry (PBT)

INT provides the exact forwarding path, the timestamp andlatency at each network node, and other information. Suchdetailed information is derived by augmenting user packetswith data collected by each switch. Postcard-based Telemetry(PBT) is an alternative to INT which does not modify userpackets. Fig. 8 shows an example of PBT. As a user packettraverses the network, each switch generates a postcard andsends it to the monitor. The event that triggers the generationof the postcard is deﬁned by the programmer, according tothe application’s need. Examples include start and/or end of a

Event detected Event detected INT CollectorOriginal Packet

Flow watchlistEvent detection

Original headers with switch telemetry info Host 1 Host 2

Postcard-based Telemetry

Flow watchlistEvent detection

Fig. 8. Postcard-based telemetry (PBT).

TABLE IVINT V

ARIATIONS C OMPARISON

Variation Name Overhead reduction strategy Metadata collection Operator intervention Implementation [52] NetVision On-demand probing Active (segment routing) High; telemetry through queries Mininet[53] N/A Flow subset selection bythe knowledge plane Passive Low; closed-loop network Software (BMv2)w/ ONOS controller[54] sINT Monitoring ratio adjustmentbased on network changes Passive Low; telemetry based on networkbehavior Software (BMv2)[55] INTO Telemetry orchestrationbased on heuristics Passive High; telemetry speciﬁed byoperators N/A[56] ML-INT Per-ﬂow packet subsetselection through sampling Passive High; telemetry speciﬁed byoperators ASIC (Toﬁno) andSmartNIC (NFP-4000)[57] PINT Telemetry encoding onmultiple packets Passive High; telemetry through queries ASIC (Toﬁno) ﬂow, sampling (e.g., one report per second), packet droppedby the switch, queue congestion, etc.

B. INT VariationsB.1. Background

Despite the improvements that INT brings compared tolegacy monitoring schemes, it introduces bandwidth overheadwhen enabled unconditionally by network operators. In suchscenarios, INT headers are added to every packet traversingthe switch, increasing bandwidth overhead which decreasesthe overall network throughput. To mitigate such limitation,conditional statements are included in the P4 program tosend reports only when certain events occur (e.g., queueutilization exceeds a threshold). This solution requires networkoperators to adjust thresholds and parameters manually basedon the usual network trafﬁc patterns. Consequently, severalvariations of INT have been developed, aiming at customizingits functionalities and addressing its limitations. Mainly, recentworks focus on minimizing the bandwidth overhead of INTby adjusting thresholds and parameters automatically, basedon measured trafﬁc patterns and the desired application type.

B.2. Literature Review

Liu et al. [52] proposed NetVision, a telemetry system thataims at minimizing the trafﬁc overhead of INT by using prob-ing. NetVision actively sends the rightful amount and formatof probe packets depending on the telemetry application (e.g.,trafﬁc engineering, network visualization). Hyun et al. [53]proposed an architecture for self-driving networks that usesINT to collect packet-level network telemetry, and Knowledge-Deﬁned Networking (KDN) to create intelligence to the net-work management, considering the collected telemetry data.KDN accepts the network information as input and generatespolicies to improve the network performance. Kim et al. [54]proposed selective INT (sINT), a scheme that dynamicallyadjusts the insertion frequency of INT headers. A monitoringengine observes changes in consecutive INT metadata andapplies a heuristic algorithm to compute the insertion ratio.Marques et al. [55] described the orchestration problem inINT, which is associated with the optimal use of networkresources for collecting the state and behavior of forwardingdevices through INT. Niu at al. [56] proposed multilayer INT(ML-INT), a system that visualizes IP-over-optical networks in realtime. The proposed system encodes INT headers ina subset of packets pertaining to an IP ﬂow. The encodedheaders contain metadata that describes statistics of electricaland optical network elements on the ﬂow’s routing path. Benet al. [57] proposed Probabilistic INT (PINT), an approach thatprobabilistically adds telemetry information into a collectionof packets to minimize the per-packet overhead associated withregular INT.

B.3. INT Variations, Comparison, and Discussions

Table IV compares the aforementioned INT variations so-lutions. The main motivation behind these solutions is thatthe majority of applications that leverage INT (e.g., con-gestion control, fast reroute) only require approximations ofthe telemetry data and therefore, do not need to gather per-packet per-hop INT information. NetVision uses probing toreduce the overhead of INT. The main limitation of thisapproach is that probing might result in poor accuracy andtimeliness as the probes might experience different networkconditions than actual packets. All other works collect INTinformation passively. [53] and sINT select ﬂows based oncurrent network conditions, ML-INT uses a ﬁxed samplingscheme to select a small portion of packets in a ﬂow, andPINT uses a probabilistic approach to encode telemetry onmultiple packets. Sampling and anomaly-based monitoringmight lead to information loss since not all packets arebeing reported. Some solutions require manual interventionfrom the operators to conﬁgure the telemetry process. Thesimplicity of the conﬁguration interface is vital to makethe solution attractive to network operators. Finally, somesolutions were implemented on software switches, while otherwere implemented on hardware. It is important to note that notall software implementations can ﬁt into the pipeline of thehardware.

B.4. INT, PBT, and Traditional Telemetry Comparison

Table V compares INT, PBT, and traditional telemetry.INT has higher potential vulnerabilities than PBT, such aseavesdropping and tampering. Adding extra protective mea-sures (e.g., encryption) is difﬁcult on the fast data path. Onthe other hand, PBT packets tolerate additional processing toenhance security. The ﬂow tracking process is simpler withINT than with PBT. The latter requires the server receivingINT reports (i.e., INT collector, explained in Section VI-C)

TABLE VI N - BAND , P

OSTCARD - BASED , AND T RADITIONAL N ETWORK T ELEMETRY

Feature INT PBT Traditional

User packetmodiﬁcation Yes No NoUser packet overhead Yes No NoPotentialvulnerabilities Higher Lower LowerFlow trackingprocess Simpler More complex More complexDelay in reporting,tracking Lowest Low HighMicrobursts detection Yes Yes NoAccuracy Higher Higher Lower; especially with congested linksReporting type Push-based, initiated by the data plane Push-based Polling (e.g., SNMP), initiated by the control plane;sampling (e.g., NetFlow), initiated by the data planeTroubleshootproblems Easier and cheaper Easier and cheaper Harder and more expensiveGranularity Higher; microseconds scale Higher Lower; milliseconds scale at bestEvent-basedmonitoring Customizable based on conditions andthresholds Customizable Not possibleReactive processing Faster; reactive processing is executedin the data plane Faster Slower; reactive processing is executed in thecontrol planeBandwidth overhead High when all packets are reported,low when reported based on events Higher than INT Lowest to correlate multiple postcards of a single ﬂow packet passingthrough the network, to form the packet history at the mon-itor. This process also adds delay in reporting and tracking.Legacy schemes that rely on sampling and polling suffer fromaccuracy issues, especially when links are congested. INTon the other hand is push-based, has better accuracy, andis more granular (microseconds scale). Reports sent by anINT-capable device contain rich information (e.g., the patha packet took) that can aid in troubleshooting the network.Such visibility is minimal in legacy monitoring schemes.Programmable switches permit reporting telemetry after theoccurrence of speciﬁc events (e.g., congestion). Moreover, theyprovide ﬂexibility in programming reactive logic that executespromptly in the data plane. One drawback of INT is that itimposes bandwidth overhead if conﬁgured to report for everypacket; however, when event-based reports are considered, thebandwidth overhead signiﬁcantly decreases.

C. INT CollectorsC.1. Background

An INT collector is a component in the network thatprocesses telemetry reports produced by INT devices. It parsesand ﬁlters metrics from the collected reports, then optionallystores the results persistently into a database. Since a largenumber of reports is typically produced in INT, having a high-performance collector is essential to avoid missing importantnetwork events. To this end, a number of research worksfocus on developing and enhancing the performance of INTcollectors running on commodity servers.

C.2. Literature Review

IntMon [58] is an ONOS-based collector application forINT reports. It includes a web-based interface that allowscontrolling which ﬂows to monitor and the speciﬁc metadata tocollect. Another INT collector is the Prometheus INT exporter [59], which extracts information from every INT packet andpushes them to a gateway. A database server then periodicallypulls information from the gateway. INTCollector [60] is acollector that extracts events , which are important networkinformation, from INT raw data. It uses in-kernel processingto further improve the performance. INTCollector has twoprocessing ﬂows; the fast path , which processes INT reportsand needs to execute quickly, and the normal path whichprocesses events sent from the fast path, and stores informationin the database. Deep Insight [61] is a proprietary solutionprovided by Barefoot Networks that leverages INT capabilitiesto provide services such as real-time anomaly detection, con-gestion analysis, packet-drop analysis, etc. Another proprietarysolution is BroadView Analytics used on Broadcom Trident 3devices by Broadcom [62].

C.3. INT Collectors Comparison, Discussions, and Limita-tions

Fig. 9 and Table VI compare the aforementioned INTcollectors. IntMon and Prometheus INT exporter were amongthe earliest collectors. Both have low processing rates sincethey are implemented without kernel nor hardware accelera-

Fig. 9. CPU efﬁciency with the three INT collectors. Source: INTCollectorpaper [60]. TABLE VIINT C

OLLECTORS C OMPARISON

Collector Name Rate Eventdetection Processingacceleration Open source Historical dataavailability Analytics Implementationnotes [58] IntMon 0.1Kpps × × (cid:2) × Low ONOS-BMv2subsystem (ONOS 1.6)[59] PrometheusINT exporter 6.4Kpps × × (cid:2) × Low ONOS P4 Brigadeproject[60] IntCollector 154.8Kpps (cid:2)

Yes; fast pathwith XDP (cid:2) (cid:2)

Medium C language, XDP forin-kernel processing[61] DeepInsight N/A (cid:2)

N/A × (cid:2) High SPRINT data planetelemetry (INT.p4) tion. Also, they are very limited with respect to the featuresthey provide (e.g., lack of event detection, limited analytics,historical data unavailability, etc.). Prometheus INT exporteralso suffers from increased overhead of sending the data forevery INT packet to the gateway, and the potential loss ofnetwork events as the database only stores the latest data pulledfrom the gateway. INTCollector on the other hand has higherrate and uses the eXpress Data Path (XDP) [211] to acceleratethe packet processing in the kernel space. It ﬁlters the datato be published based on signiﬁcant changes in the networkthrough its event detection mechanism. DeepInsight Analyticshas a modular architecture and runs on commodity servers.It executes the Barefoot SPRINT data plane telemetry whichconsists of a P4 program (INT.p4) encompassing intelligenttriggers. It also provides open northbound RESTful APIs thatallow customers to integrate their third-party network man-agement solutions. DeepInsight Analytics is advanced withrespect to the features it provides (real-time anomaly detection,congestion analysis, packet-drop analysis, etc.). However, itis a closed-source solution and lacks reports of performancebenchmarks.Fig. 9 demonstrates the CPU efﬁciency of three INT col-lectors (IntMon, Prometheus INT exporter, and INTCollector)[60]. IntMon has the lowest throughput, and is 57 times slowerthan Prometheus INT. INTCollector on the other hand has thehighest throughput and is 27 times faster than Prometheus INTexporter.

C.4. Collectors in INT and Legacy Monitoring Schemes Com-parison

Generally, collectors used with both INT and legacy moni-toring schemes run on general purpose CPUs, and hence, havecomparable performance. INT produces excessive amountsof reports when compared with legacy monitoring schemes(e.g., NetFlow), and therefore, requires having a collector withhigh processing capability. INT-based collectors are typicallyaccelerated with in-kernel fast packet processing technologies(e.g., XDP) and hardware-based accelerators (e.g., Data PlaneDevelopment Kit (DPDK)).

D. Summary and Lessons Learned

Legacy telemetry tools and protocols are not capable ofcapturing microbursts nor providing ﬁne-grained telemetrymeasurements. INT was developed to address these challenges;it enables the data plane developer to query with high-precision the internal state of switches. Telemetry data are then embedded into packets and forwarded to a high-performancecollector. The collector typically performs analysis and ap-plies actions accordingly (e.g., informs the control plane toupdate table entries). Current research efforts mainly focuson developing variations of INT to decrease its telemetrytrafﬁc overhead, considering the overhead-accuracy trade-off.Other works aim at accelerating INT collectors to handlelarge volumes of trafﬁc (in the scale of Kpps). Future workcould possibly investigate further improvements for INT suchas compressing packets’ headers, broadening coverage andvisibility, enriching the telemetry information, and simplifyingthe deployment.VII. N

ETWORK P ERFORMANCE

Measuring and improving network performance is criticalin nowadays’ infrastructures. Low latency and high bandwidthare key requirements to operate modern applications that con-tinuously generate enormous amounts of data [212]. Conges-tion control (CC), which aims at avoiding network overload, iscritical to meet these requirements. Another important conceptfor expediting these applications is managing the queuesthat form in routers and switches through Active QueuingManagement (AQM) algorithms. This section explores theliterature related to measuring and improving the performanceof programmable networks.

A. Congestion Control (CC)A.1. Background

One of the most challenging tasks in the Internet today iscongestion control and collapse avoidance [213]. The difﬁcultyin controlling the congestion is increasing due to factorssuch as high-speed links, trafﬁc diversity and burstiness, andbuffer sizes [63]. Today’s CC algorithms aim at shorteningdelays, maximizing throughput, and improving the fairness andutilization of network resources.Tremendous amount of research work has been done oncongestion control, including end hosts algorithms such asloss-based CC algorithms (e.g., CUBIC [214], Hamilton TCP(HTCP) [215], etc.), model-based algorithms (e.g., BottleneckBandwidth and Round-trip Time (BBR) [216]), congestion-signalling mechanisms (e.g., Explicit Congestion Notiﬁcation(ECN) [217]), data-center speciﬁc schemes (e.g., TIMELY[218], Data Center Quantized Congestion Notiﬁcation (DC-QCN) [219], Data Center TCP (DCTCP) [220], pFabric [221], Sender Receiver

Packet ACKINT INTACKACKAdjust rate per ACK

Fig. 10. HPCC: INT-based high precision congestion control.

Performance-oriented Congestion Control (PCC) [222], etc.),and application-speciﬁc schemes (e.g., QUIC [223]).With the advent of programmable data plane switches,researchers are investigating new methods to provide network-assisted congestion feedback for end-hosts.

A.2. Literature Review

Handley et al. [63] proposed NDP, a novel protocol archi-tecture for datacenters that aims at achieving low comple-tion latency for short ﬂows and high throughput for longerﬂows. NDP avoids core network congestion by applying per-packet multipath load balancing, which comes at the costof reordering. It also trims the payloads of packets, similarto what is done in Cut Payload (CP) [224], whenever thequeues of the switches become saturated. Once the payload istrimmed, the headers are forwarded using high-priority queues.Consequently, a Negative ACK (NACK) is generated and sentthrough high-priority queues so that a retransmission is sentbefore draining the low priority queue. Similarly, Feldmannet al. [66] proposed a method that uses network-assistedcongestion feedback (NCF) in the form of NACKs generatedentirely in the data plane. NACKs are sent to throttle elephant-ﬂow senders in case of congestion. The method maintains threeseparate queues for mice ﬂows, elephant ﬂows, and controlpackets to ensure fair sharing of resources.Li et al. [65] proposed High Precision Congestion Control(HPCC), a new CC mechanism that leverages INT-based dataadded by P4 switches to obtain precise link load information.HPCC computes accurate ﬂow rate by using only one rateupdate, as opposed to legacy approaches that require a largenumber of iterations to determine the rate. HPCC providesnear-zero queueing, while being almost parameterless. Fig. 10shows the mechanism of HPCC. The switches add INT headersto every packet, and then the INT information is piggybackedinto the TCP/RDMA Acknowledgement (ACK) packet. The end-hosts then use this information to adjust the sending ratethrough their smart Network Interface Controllers (NICs).Kfoury et al. [67] proposed a P4-based method to automateend-hosts’ TCP pacing. It supplies the bottleneck bandwidthsand the number of elephants ﬂows to senders so that they canpace their rates to safe targets, avoiding ﬁlling routers’ buffers.Turkovic et al. [64] proposed a P4-based method that reroutesﬂows to backup paths during congestion. The system detectscongestion by continuously monitoring the queueing delaysof latency-critical ﬂows. The same authors [68] proposed amethod that separates the senders based on their congestioncontrol algorithm. Each congestion control uses a separatequeue in order to enforce the fairness among its competingﬂows.

A.3. CC Schemes Comparison, Discussions, and Limitations

Table VII compares the aforementioned CC schemes. NDPand NCF are similar in the sense that both use NACKs ascongestion feedback. NDP avoids congestion by applying per-packet multihop load balancing. This approach works ade-quately with symmetric topologies, but fails when topologiesare asymmetric (e.g., BCube, Jellyﬁsh), especially duringheavy network load. Another limitation of NDP is the ex-cessive retransmissions produced by the server. NCF adoptedthe idea of packet trimming from NDP, but generates NACKsfrom the trimmed packet and sends it directly to the sender.Such approach removes the receiver from the feedback loop,improving the sender’s reaction time. One limitation of NCFis that it requires operators to manually tune some of thepredeﬁned parameters (e.g., threshold, queue size, etc.). Addi-tionally, NCF might disclose network congestion information,making it less attractive to operators. Finally, the authors ofNCF claim that the approach works with both datacenters andInternet-wide scenarios. However, no implementation resultswere presented to evaluate the effectiveness of the solution.HPCC leverages INT data to control network congestion.It enhances the convergence time by using a Multiplicative-Increase Multiplicative-Decrease (MIMD) scheme. Notethat previous TCP variants use the Additive-IncreaseMultiplicative-Decrease (AIMD), which is conservative whenincreasing the rate, and hence has a slow convergence time.The reason AIMD schemes are slow is that they use a single-

TABLE VIIC

ONGESTION C ONTROL S CHEMES C OMPARISON

Scheme Name Strategy Congestionfeedback Feedbackinformation Rerouting Trafﬁcseparation End-devicemodiﬁcation Implementation [63] NDP Trim packets to headersand priority forward (cid:2)

NACKs (cid:2) (cid:2) (cid:2)

NetFPGASUME[64] N/A Monitor queue latency toreroute trafﬁc on congestion × N/A (cid:2) × ×

BMv2[65] HPCC Use INT data to computesending rate (cid:2)

INT × × (cid:2)

Toﬁno[66] NCF Throttle elephant ﬂowswith NACKs (cid:2)

NACKs × (cid:2) × N/A[67] N/A Pace TCP trafﬁc ofelephant ﬂows to safe targets (cid:2)

Flow countand BW × × (cid:2)

BMv2[68] P4Air Separate ﬂows according totheir congestion control group × N/A × (cid:2) × Toﬁno TABLE VIIIC

ONGESTION C ONTROL S CHEMES . 1) P

ROGRAMMABLE S WITCHES (HPCC); 2) E ND - HOSTS ; AND

3) L

EGACY N ETWORK - ASSISTED (ECN)

Characteristic Programmable switch End-hosts Legacy network-assisted (ECN)

Accuracy Higher, INT-based, microbursts aredetected and reported Low, packet loss (e.g., CUBIC); Medium,estimated RTT and btlbw (e.g., BBR) Lower with classic ECN; Highwith L4SRequired modiﬁcations Switches, end-hosts None; distributed nature of AIMD doesnot require storing state of ﬂows Minimal if ECN is used (mostequipment have classic ECNimplemented); High if L4S is usedConvergence Faster (MIMD) Slower (AIMD) Adequate with ECN; Fast withL4S ECNQueue utilization Near-zero High; possibility of Bufferbloat (e.g.,CUBIC) LowParameterization Few None Few (e.g., thresholds)Congestion information Several ﬁelds (e.g., queue occupancy,link utilization, ﬂow share, etc.) Packets drop 1-bit ECN mark bit congestion information (packet loss, ECN). With HPCC,end-hosts can perform aggressive increase as INT metadata en-compasses precise link utilization and timely queue statistics.HPCC demonstrated promising results with respect to latency,bandwidth, and convergence time. The authors however didnot evaluate the performance of HPCC with conventionalcongestion control algorithms in the Internet (e.g., CUBIC,BBR). Note that achieving inter-protocol fairness is essentialso that the solution is adopted by operators.The method in [67] uses TCP pacing. Pacing decreasesthroughput variations and trafﬁc burstiness, and hence, mini-mizes queuing delays. However, this method works well onlyin networks where the number of large ﬂows senders is small(e.g., in science Demilitarized Zone (DMZ) [212]).P4Air, which applies trafﬁc separation, demonstrated sig-niﬁcant improvements in fairness compared to contemporarysolutions. However, it requires allocating a queue for eachcongestion control algorithm group (e.g., loss-based (Cubic),delay-based (TCP Vegas), etc.). Note that the number ofqueues is limited in switches, and production networks oftenreserve them for other applications’ QoS [65].Note that some schemes require modifying the end-hosts(e.g., HPCC) while others are fully in-network (e.g., P4Air).

A.4. End-hosts, Programmable Switches, and Legacy Devices’CC Schemes

Table VIII compares the CC schemes assisted by pro-grammable switches (e.g., HPCC) with end-hosts CC al-gorithms (e.g., CUBIC) and legacy congestion signallingschemes (e.g., ECN). End-hosts CC infer congestion throughpacket drops and estimations (e.g., btlbw and Round-trip Time(RTT) estimation with BBR), which is not always sufﬁcient toinfer the existence of congestion. Legacy devices use classicECN to signal congestion so that end-hosts slow down theirtransmission rates. Classic ECN is limited as it only marksa single bit to signal congestion, and is not aggressive norimmediate. Programmable switches on the other hand useﬁne-grained prompt measurements to signal congestion (e.g.,INT metadata), which results in higher detection accuracy,near-zero queueing delays, and faster convergence time. Thedistributed nature of end-hosts CC schemes allows them to op-erate without modifying the network infrastructure and withouttweaking parameters. ECN-enabled devices and programmable switches on the other hand require few parameters (e.g.,marking threshold) to adapt to different network conditions.

B. MeasurementsB.1. Background

Gaining an overall understanding of the network behavioris an increasingly complex task, especially when the sizeof the network is large and the bandwidth is high. Legacymeasurements schemes have accuracy limitations since theyrely on polling and sampling-based methods to gather trafﬁcstatistics. Typically, sampling methods have high samplingrates (e.g., one every 30,000 packets) and polling methodshave large polling intervals. The literature [225] has shown thatsuch methods are only suitable for coarse-grained visibility.The accuracy limitation of sampling and polling techniqueshampers the development of measurement applications. Forinstance, it is not possible to accurately measure frequentlychanging TCP-speciﬁc ﬁelds such as congestion window,receive window, and sending rate.Data streaming or sketching algorithms [226–230] wereproposed to answer the limitation of sampling and polling.They address the following problem: an algorithm is allowedto perform a constant number of passes over a data stream(input sequence of items) while using sub-linear space com-pared to the dataset and the dictionary sizes; desired statisticalproperties (e.g., median) on the data stream are then estimatedby the algorithm . The main problem with such algorithms isthat they are tightly coupled to the metrics of interest. Thismeans that switch vendors should build specialized algorithms,data structures, and hardware for speciﬁc monitoring tasks.With the constraints of CPU and memory in networkingdevices, it is challenging to support a wide spectrum ofmonitoring tasks that satisfy all customers. Legacy devices alsolack the capability of customizing the processing behavior sothat switches co-operate in the measurement process.With the emergence of programmable switches, it is nowpossible to perform ﬁne-grained measurements in the dataplane at line rate. Moreover, data structures such as sketchesand bloom ﬁlters can be easily implemented and customizedfor speciﬁc metrics of interest. Programmable switches pavethe way for new areas of research in measurements since notonly they provide ﬂexibility in inspecting with high accuracy the trafﬁc statistics, but also allow programmers to expressreactive processing in real time (e.g., dropping a packet whena threshold is bypassed as done in Random Early Detection(RED) [231]). B.2. Literature Review

INT provides path-level metrics, with data similar to that ofpolling-based techniques. Note that the metrics themselves areﬁxed; for instance, it is possible to determine the ﬂow-levellatency, but not the latency variation (jitter) [71]. The ﬁxedmetrics of INT also prevent performing network-wide mea-surements; note that the INT standard speciﬁcation documentdoes not mention methods to aggregate metadata and performcomplex analytics in the data plane.This section focuses on techniques that provide measure-ments that go beyond the ﬁxed metrics extracted from theinternal state of the switch.

Generic Query-based Monitoring.

Operators constantlychange their monitoring speciﬁcations. Adding new moni-toring requirements on the ﬁxed-function switching ASIC isexpensive. Recent work explored the idea of providing aquery-driven interface that allows operators to express theirmonitoring requirements. The queries can then be convertedinto switch programs (e.g., P4) to be deployed in the network.Alternatively, the queries can be executed on the control planeconsidering the measured information extracted from the dataplane.A simplistic attempt is FlowRadar [69], a system thatstores counters for all ﬂows in the data plane with lowmemory footprint, then exports periodically (every 10ms) to aremote collector. Liu et al. [70] proposed Universal Monitor-ing (UnivMon), an application-agnostic monitoring frameworkthat provides accuracy and generality across a wide rangeof monitoring tasks. UnivMon beneﬁts from the granularityof the data plane to improve accuracy and runs differentestimation algorithms on the control plane. Narayana et al.[71] presented Marple, a query language based on commonquery constructs (i.e., map, ﬁlter, group by). Marple allowsperforming advanced aggregation (e.g., moving average oflatencies) at line rate in the data plane. Similarly, Sonata[79] provides a uniﬁed query interface that uses commondataﬂow operators, and partitions each query across the streamprocessor and the data plane. PacketScope [85] also usesdataﬂow constructs but allows to query the internal switchprocessing, both in the ingress and the egress pipelines.Many of the previous works use the sketch data structure.The work in [88] extended the sketching approach used inprevious works to support the notion of time. The motivationof this work is that recently captured trafﬁc trends are themost relevant in network monitoring. Huang et al. [89] pro-posed OmniMon, an architectural design that coordinates ﬂow-level network telemetry operations between programmableswitches, end-hosts, and controllers. Such coordination aims atachieving high accuracy while maintaining low resource over-head. Chen et al. [90] proposed BeauCoup, a P4-based mea-surement system that handles multiple heterogeneous queriesin the data plane. It offers a general query abstraction that counts the attributes across related packets identiﬁed by keys ,and ﬂags packets that surpass a deﬁned threshold.Other approaches such as Elastic sketch [73] performs mea-surement that are adaptive to changes in network conditions(e.g., bandwidth, packet rate and ﬂow size distribution). *Flow[77] supports concurrent measurements and dynamic queries.Such approach aims at minimizing the concurrency problemsand the network disruption resulting from compiling excessivequeries into the data plane. TurboFlow [78] aims at achievinghigh coverage without sacriﬁcing information richness. Baiet al. [86] proposed FastFE, a system that performs trafﬁcfeatures extraction by leveraging programmable data planes.Features are then used by trafﬁc analysis and behavior detectorML techniques.

Performance Diagnosis Systems.

Recent works are leverag-ing programmable data planes to diagnose network perfor-mance. The main motivation here is that ﬁne-grained infor-mation can be monitored at line rate, mitigating the slowreaction to “gray failures” experienced by diagnosing end-hosts in legacy approaches.Ghasemi et al. [72] proposed Dapper, an in-network TCPperformance diagnosis system. Dapper analyzes packets in realtime, and identiﬁes and pinpoints the root cause of the bottle-neck (sender, network, or receiver). Blink [82] also diagnosesTCP-related issues. In particular, it detects failures in the dataplane based on retransmissions, and consequently, reroutestrafﬁc. Other approaches attempt to diagnose performancedegradation manifested by an increase of latency. Wang et al.[84] proposed SpiderMon, a system that performs network-wide performance degradation diagnosis. The key idea is tohave every switch maintain ﬁne-grained telemetry data for ashort period of time, and upon detecting performance degra-dation (e.g., increased delay), the information is ofﬂoadedto a collector. Liu et al. [81] proposed a memory-efﬁcientapproach for network performance monitoring. This solutiononly monitors the top- k problematic ﬂows. Queue and Other Metrics Measurement.

Programmabledata planes allows querying the internal state of the queue withﬁne-grained visibility. Recent works leveraged this feature toprovide better queueing information which can be used byvarious applications (e.g., AQMs, congestion control, etc.).Chen et al. [80] proposed ConQuest, a P4-based queue mea-surement solution that determines the size of ﬂows occupyingthe queue in real time, and identiﬁes ﬂows that are grabbing asigniﬁcant portion of the queue. Joshi et al. [75] proposedBurstRadar, a system that uses programmable switches tomonitor microbursts in the data plane. Mircorbursts are eventsof sporadic congestion that last for tens or hundreds ofmicroseconds. Microbursts increase latency, jitter, and packetloss, especially when links’ speeds are high and switch buffersare small.Other works enabled measuring further metric. For instance,Ding et al. [83] proposed P4Entropy, an algorithm to estimatenetwork trafﬁc entropy (Shannon entropy) in the data plane.Tracking entropy is useful for calculating trafﬁc distributionin order to understand the network behavior. Another exampleis the system proposed by Chen et al. [87] which passively TABLE IXM

EASUREMENTS S CHEMES C OMPARISON G e n er i c qu er y - b a s e d m o n i t o r i n g Ref Name Core idea Approx. Externalcomputation Datastructure Networkwide PlatformHW SW [89] OmniMon Coordinates ﬂow-leveltelemetry among devices × (cid:2) Slots (bloomﬁlter) (cid:2) (cid:2) [79] Sonata Uses scalable streamprocessor (cid:2) (cid:2)

Sketch × (cid:2) [69] FlowRadar Stores ﬂow counters andperiodically exports results (cid:2) (cid:2) Bloom ﬁlter (cid:2) (cid:2) [73] ElasticSketch Adapts to network changingconditions (cid:2) (cid:2)

Sketch (cid:2) (cid:2) [71] Marple Aggregates based on “map,ﬁlter, group by” constructs × (cid:2) Key-value store (cid:2) (cid:2) [90] BeauCoup Enables simultaneous multipledistinct counting queries (cid:2) × Coupon collect(bloom ﬁlter) × (cid:2) [70] UnivMon Provides application-agnosticmonitoring (cid:2) (cid:2) Universalsketches (cid:2) (cid:2) [77] *Flow Groups trafﬁc in the switch andcomputes statistics in servers × (cid:2) GPV (registerarrays) × (cid:2) [78] TurboFlow Produces ﬁne-grained andunsampled ﬂow records × (cid:2) Hash table × (cid:2) [88] N/A Enables time-awaremonitoring (cid:2) (cid:2) Time-awaresketch × (cid:2) [85] PacketScope Monitors packets’ lifecycleinside the switch (cid:2) (cid:2) Key-value store(hash table) × (cid:2) [86] FastFE Extracts trafﬁc features for MLmodels × (cid:2) key-value store × (cid:2) P er f o r m a n ce d i ag n o s i ss y s t e m s Ref Name Core idea Scope Reactiveprocessing Measuredinformation Networkwide PlatformHW SW [72] Dapper Diagnoses TCP performanceissues in the data plane Identiﬁes TCPbottleneck N/A Flight size, MSS,sender’s reactiontime, loss, RTT,CWND, RWND × (cid:2) [84] SpiderMon Diagnoses latency with smallmemory footprint Identiﬁes ﬂowsaffecting latency Limits rate Queue latency (cid:2) (cid:2) [82] Blink Detects failures based on thepredictable behavior of TCP Identiﬁesretransmitters Reroutestrafﬁc RTO-inducedretransmissions × (cid:2) [81] N/A Improves monitoring scalabilityby measuring subset of ﬂows Identiﬁes top-kinﬂuential ﬂows N/A Retransmissions,latency, packetloss, out-of-order × (cid:2) Q u e u e / o t h er m e a s u re m e n t Ref Name Core idea Passivemeasurement Analysis Measuredinformation Datastructure PlatformHW SW [80] ConQuest Identiﬁes ﬂows contributingheavily to the queue (cid:2)

Data plane Queue occupancy Count-minsketch (cid:2) [87] N/A Measures the RTT of TCPtrafﬁc in ISP networks (cid:2)

Data plane RTT from an ISPvantage point Hash table (cid:2) [75] BurstRadar Monitors microbursts andcaptures telemetry for thecontributing packets × Control plane Queue occupancy Ring buffer (cid:2) [83] P4Entropy Estimates network trafﬁcentropy × Data plane Shannon entropy Count-minsketch (cid:2) measures the RTT of TCP trafﬁc in ISP networks. RTTmeasurement is important for detecting spooﬁng and routingattacks, ensuring Service Level Agreements (SLAs) compli-ance, measuring the Quality of Experience (QoE), improvingcongestion control, and many others.

B.3. Measurements Schemes Comparison, Discussions, andLimitations

Table IX compares the aforementioned measurementsschemes.

Generic Query-based Monitoring.

Some schemes (e.g.,Sonata, FlowRadar, UnivMon) performed approximations ofthe metrics by using probabilistic data structures (e.g., sketch,bloom ﬁlter, etc), sampling methods, and top- k counting. Inaddition, some focused on a subset of trafﬁc by leveraging event matching techniques. Such techniques are primarilyused to achieve high resource efﬁciency (i.e., low memoryfootprint), but cannot achieve full accuracy. On the other hand,systems like OmniMon carefully coordinates the collaborationamong different types of entities in the network. Such coor-dination will result in efﬁcient resource utilization and fullyaccuracy. OmniMon follows a split-merge strategy where the split operation decomposes telemetry operations into partialoperations and schedules them among the entities (switches,end-hosts, and controller), and the merge operation coordinatesthe collaboration among these entities. The idea is to leveragethe strength of the data plane in the switches and end-hosts(i.e., per-ﬂow measurements with high accuracy) and the con-trol plane (i.e., network-wide collaboration). OmniFlow alsoensures consistency through a synchronization mechanism and accountability through a system of linear equation consideringpacket loss and other data center characteristics. Results showthat OmniMon reduces the memory by 33%-96% and thenumber of actions by 66%-90% when compared to state-of-the-art solutions.Another criterion that differentiates the measurementsschemes is whether there are computations being performedoutside the data plane. Most of the systems use the controlplane or external servers to perform complex computationssince the data plane has limited support to complex arithmeticfunctions. While some systems (e.g., BeauCoup) do not re-quire an external computation device, they often support lessmeasurement operations.The selection of the data structure to be used in the dataplane strongly affects the measurements features supportedby a certain scheme. For instance, the goal of BeauCoupis to enable simultaneous distinct counting queries; for suchtask, the authors based their design on the coupon-collectionproblem [232], which computes the number of random drawsfrom n coupons such that all coupons are drawn at leastonce. For example, if the threshold of distinct destination IPsfor detecting superspreaders is 130, instead of recording alldistinct destination IPs, 32 coupons are deﬁned. Consequently,the destination IPs of incoming packets are mapped to those32 coupons. While this data structure uses less memory thanthe other state-of-the-art measurement sketches, it is limitedto speciﬁc objectives (distinct counting). Other works (e.g.,UnivMon) focused on generalizing the measurement scenarios,and hence, used universal sketches as data structures.Qiu et al. [88] focused on capturing trafﬁc trends that are themost relevant in network monitoring and attacks’ detection.The notion of time is not supported by native streamingalgorithms. For instance, count-min sketch , which is a datastructure that uses constant memory amount to record data,is oblivious to the passage of time. Existing solutions thatconsider recency are easily implemented on software, but noton programmable ASICs. For example, resetting a sketch aftera timer expires requires iterating over the elements in thesketch, an operation that cannot be implemented in the dataplane due to the lack of loops. Likewise, creating multiplesketches require additional stages which is limited in thehardware. Time-adaptive sketches utilize the idea of Dolbynoise reduction [233, 234]; a pre-emphasis function inﬂatesthe update when a new key is inserted and a de-emphasis function restores the original value. This mechanism ages theold events over time, and therefore, improves the accuracyof recent events. The authors implemented the pre-emphasisfunction in the data plane using simple bit shifts, and the de-emphasis function in the control plane.Finally, some systems considered network-wide monitoring,while others only restricted their capabilities to local per-switch measurements. Network-wide measurement is essentialand can signiﬁcantly improve the visibility of trafﬁc, asdiscussed in Section XIII-D. Performance Diagnosis Systems.

Some performance diag-nosis schemes restricted their scope to troubleshooting TCP.For instance, Dapper infers sending rate, Maximum Segment Size (MSS), sender’s reaction time (time between receivedACK and new transmission), loss rate, latency, congestionwindow (CWND), receiver window (RWND), and delayedACKs. Based on the inferred variables, Dapper can identifythe root cause of the bottleneck. Similarly, the authors in[81] monitored conditions such as retransmissions, packetloss, round-trip-time, out-of-order packets to identify the top-kproblematic ﬂows. Furthermore, Blink detects failures basedon the predictable behavior of TCP, which retransmits packetsat epochs exponentially spaced in time, in the presence offailure. Other schemes (i.e., SpiderMon) identify failures basedon the increase of latency.Some schemes use reactive processing to mitigate the net-work performance issue. For instance, Blink promptly reroutestrafﬁc whenever failures signals are generated by the dataplane, while SpiderMon limits the sending rate of the rootcause hosts.Finally, it is worth mentioning that some systems (e.g.,Blink, Dapper) considered traces from real-world capturessuch as the ones provided by CAIDA for evaluation. Usingreal-world traces gives more credibility to the proposed solu-tion.

Queue and other Metrics Measurement.

Understandingthe occupancy of the queue is useful for use cases suchas mitigating congestion-based attacks, avoiding conﬂictingworkloads, implementing new AQMs, optimizing switch con-ﬁgurations, debugging switch implementation, off-path mon-itoring of queues in legacy devices, etc. ConQuest performsqueue measurements and identiﬁes ﬂows depending on thepurpose (e.g., detecting bursty connections). It maintainscompact snapshots of the queue, updated on each incomingpacket. The snapshots are then aggregated in a round-robinfashion to approximate the queue occupancy. Afterwards, itcleans the previous snapshots to reuse it for further packets.Similarly, BurstRadar detects microbursts, which can increaselatency, jitter, and packet loss, especially when links’ speedsare high and switch buffers are small. It is almost impossibleto detect microbursts in legacy switches which use samplingand polling-based techniques. BurstRadar detects microbursts,and captures a snapshot of the telemetry information of allthe involved packets. Afterwards, an analysis is conductedon the snapshot to identify the microburst-contributing ﬂowand the burst characteristics. Note that BurstRadar does notsupport measuring the queues of legacy devices passively, butConQuest does. In addition, BurstRadar performs the analysison the control plane, while ConQuest uses the data plane foranalysis.

B.4. In-Network versus Legacy Measurements

Fig. 11 compares the legacy measurements to those con-ducted on programmable switches. There are two mainclasses of legacy measurements techniques. First, there aretechniques that rely on polling and sampling (e.g., Net-Flow). The differences between in-network measurements andpolling/sampling-based schemes are closely related to the dif-ferences between legacy measurements and INT (see Table V).For instance, the granularity of the measurements conducted in Control PlaneData Plane ...App App App N Application-specific computation

Data structures (e.g., Sketch)

ReportTrafficConfigureControl PlaneData Plane

Flow reports

Sampling/PollingTraffic (a) (b)

Fig. 11. (a) Traditional measurements with sampling/polling. The switch uses sampling and polling protocols (e.g., NetFlow, SNMP) to generate ﬁxed networkﬂow records. Instead of collecting every packet, sampling collects only one every N number of packets. Records are then exported to an external server forfurther analysis. (b) Measurements with programmable switches (e.g., UnivMon [70]). The switch runs a universal algorithm over a universal data structure(e.g., universal sketch). The control plane then estimates a wide range of metrics for various applications. Note that this is not the only design possible formeasurement tasks with programmable switches. The programmer has the ﬂexibility to use customized algorithms than run at line rate in the data plane. Suchalgorithms can leverage various data structures in the P4 program (e.g., sketch, bloom ﬁlter) to store ﬂow statistics. The switch then push statistics reports tothe control plane for further analysis and reactive processing. the data plane is much higher than those collected in traditionalmeasurements (e.g., NetFlow). Further, it is not possible toconduct event-based monitoring in legacy approaches, whereaswith in-network measurements, the programmer has the ﬂexi-bility of customizing the monitoring based on conditions andthresholds. Second, there are techniques that rely on sketchingor streaming algorithms to estimate the metric of interest.Such methods are tightly coupled with the metric, whichforces hardware vendors to invest time and effort in buildingcustomized algorithms and data structures that might not beused by various customers. Moreover, with the constraintsof routers and switches, it is not possible to implement avariety of monitoring tasks while still supporting the standardrouting/switching functionalities. Therefore, such approachesare not scalable for the long run.With programmable switches, it is possible to customizethe monitoring tasks by implementing customized sketch-ing/streaming algorithms as P4 programs. This advantageimproves scalability as the operator can always modify thealgorithms whenever needed. C. Active Queue Management (AQM)C.1. Background

A fundamental component in network devices is the queue which temporarily buffers packets. As data trafﬁc is inherentlybursty, routers have been provisioned with large queues toabsorb this burstiness and to maintain high link utilization. Themajority of delays encountered in a communication session isa result of large backlogs formed in queues. Previous legacydevices are limited in the visibility of the queue as they providelittle or no insight about which ﬂows are occupying or sharingthe queue [80]. Consequently, researchers have been investi-gating queue management algorithms to shorten the delay andmitigate packet losses, while providing fairness among ﬂows.AQM is a set of algorithms designed to shorten the queueingdelay by prohibiting buffers on devices from becoming full.The undesirable latency that results from a device buffering too much data is known as "Bufferbloat". Bufferbloat notonly increases the end-to-end delay, but also decreases thethroughput and increases the jitter of a communication session.Modern AQMs help in mitigating the bufferbloat problem[235–238]. Unfortunately, modern AQMs are typically notavailable in state-of-the-art network equipment; for instance,Controlled Delay (CoDel) AQM, which was proposed in2013, and was proven in the literature to be effective inmitigating Bufferbloat [239], is still not available in mostnetwork equipment. With programmable switches, it is nowpossible to implement AQMs as P4 programs, which not onlyaccelerates support for new AQMs, but also provides meansto customize its parameters programmatically in response tonetwork trafﬁc. Moreover, programmable switches thrives forinnovation on newer AQMs that can be easily implementedand rapidly tested.

C.2. Literature Review

Kundel et al. [91] implemented CoDel queueing disciplineon a programmable switch. CoDel eliminates Bufferbloat, evenin the presence of large buffers [240]. Sharma et al. [92]proposed Approximate Fair Queueing (AFQ), a mechanismbuilt on top of programmable switches that approximatesfair queuing on line rate. Fair Queueing (FQ) aims at fairlydividing the bandwidth allocation among active ﬂows. Lakiet al. [93] described an AQM evaluation testbed with P4 ina demo paper. The authors tested the framework with twoAQMs: Proportional Integral Controller Enhanced (PIE) andRED. Mushtaq et al. [241] approximated Shortest RemainingProcessing Time (SRPT). Papagianni et al. [94] implementedProportional Integral PI AQM on a programmable switch. PI is an extension of PIE AQM to support coexistence betweenclassic and scalable congestion controls in the public Internet.Kumazoe et al. [95] implemented MTQ/QTL scheme on P4. C.3. AQM Schemes Comparison, Discussions, and Limitations

Table X compares the aforementioned AQM schemes. Someschemes require tuning a number of parameters and thresholds TABLE XAQM S

CHEMES C OMPARISON

Scheme Name Idea Params & thresholds Multiple queues Data structure Implementation [91] P4-CoDel Implementation of CoDel on P4 2 × Registers BMv2[92] AFQ Approximate fair queueing in theswitch 4 (cid:2)

Count-minsketch CaviumOCTEON[93] N/A Evaluation testbed for PIE and RED Red 1, PIE 5 × Registers BMv2[94] PI2 for P4 Implementation of PI on P4 3 × Registers BMv2[95] MTQ/QTL Implementation of MTQ/QTL on P4 3 × Registers BMv2 so that they operate well in certain network conditions. It isworth mentioning that a scheme becomes hard to manageand less autonomous when the number of parameters andthresholds is high.Some schemes are simple to implement in the data plane.CoDel’s algorithm can be easily expressed in the data planeas it consists of comparisons, counting, basic arithmetic, anddropping packets. Similarly, PI is simple to implement as itis mostly based on basic bit manipulations. FQ algorithms onthe other hand are difﬁcult to implement on hardware as theyrequire complex ﬂow classiﬁcation, per-packet scheduling,and buffer allocation. Such requirements make FQ algorithmsexpensive to be implemented on high-speed devices. AFQaims at approximating fair queueing by using programmableswitches’ features such as mutating switch state, performingbasic calculations, and selecting the egress queue of a packet.AFQ’s operations can be summarized as follows: 1) per-ﬂowstate, which includes the number and timing information of theprevious packet pertaining to that ﬂow, is approximated; 2) theposition of each packet in the output schedule is determined;3) the egress queue to use is selected; and 4) the packet isdequeued based on the approximate sorted order. Note thatAFQ uses a probabilistic data structure (count-min sketch)since it only approximates the states, and uses multiple queuesin its implementation. C.4. AQMs on Programmable Switches and Fixed-functionDevices

Inventing novel AQMs that control queueing delay, mitigatebufferbloat, and achieve fairness with different network con-

TABLE XIAQM

S ON P ROGRAMMABLE AND F IXED - FUNCTION S WITCHES

Feature Programmable switches Fixed-function devices

Innovation Higher; new AQMs areexpressed in P4 programs Lower; onlydeveloped byequipment vendorsExclusivity Higher; operators canimplement their owncustom AQMs withoutdisclosing technicalinformation Lower; mostsupported AQMs arestandardsReadiness Faster (weeks to months);once an AQM isexpressed in P4, it can beimmediately available Slower (years)Cost Lower HigherTweakable Higher; even standardAQMs can be customizedand tweaked based onnetwork trafﬁc Lower; only throughparameters ditions (e.g., short/long RTTs, lossy networks, WANs) is anactive research area. Typically, new AQMs are implementedand tested in software (e.g., as a Linux queueing discipline( qdisc ) used with trafﬁc control ( tc )), which is limited whenthe objective is to deploy the AQMs on production networks.With programmable switches, AQMs are implemented in P4programs, which foster innovation and enhance testing withproduction networks. Additionally, operators can create theirown customized AQMs that perform efﬁciently with their typ-ical network trafﬁc. Historically, deploying AQMs on networkdevices is a lengthy and costly process; once an effectiveAQM is published and thoroughly tested, equipment vendorsstart investigating whether it is feasible to implement it onfuture devices. Such process might take years to ﬁnish, andby then, new network conditions evolve, requiring new AQMs.With programmable switches, this process is cost-efﬁcient andrelatively fast (can be completed in weeks). Table XI comparesthe features of AQMs on programmable switches versus ﬁxed-function devices. D. Quality of Service and Trafﬁc ManagementD.1. Background

Meeting diverse Quality of Service (QoS) requirements isa fundamental challenge in today’s networks. Trafﬁc Man-agement (TM) provides access control that guarantees thatthe trafﬁc admitted to the network conforms to the deﬁnedQoS speciﬁcations. TM often regulates the rate of a ﬂow byapplying trafﬁc policing. New generation of programmableswitches facilitate trafﬁc policing and differentiation by al-lowing network operators to express their logic in a pro-gramming language (P4). This section explores the works onprogrammable switches that involve QoS and TM.

D.2. Literature Review

Bhat et al. [96] described a system where programmableswitches route trafﬁc intelligently by inspecting applicationheaders (layer-5) to improve users’ QoE. Lee et al. [97]implemented a trafﬁc meter based on Multi-Color Markers(MCM) on programmable switches to support multi-tenancyenvironments. Tokmakov et al. [98] proposed RL-SP-DRR, atrafﬁc management system that combines Rate-limited StrictPriority (RL-SP) and Deﬁcit round-robin (DRR) to achievelow latency and fair scheduling while improving link utilisa-tion, prioritization and scalability. Chen et al. [99] proposeda bandwidth manager for end-to-end QoS provisioning usingprogrammable switches. The system classiﬁes packets into TABLE XIIQ O S/TM S

CHEMES C OMPARISON

Ref Idea Input Multiplequeues PlatformHW SW [96] Application-layerheaders inspection Layer-5headers × (cid:2) [97] MCM-basedtrafﬁc meter Trafﬁcrate, VN ID × (cid:2) [98] Trafﬁc mgmt.(RL-SP and DRR) Trafﬁc rate (cid:2) (cid:2) [99] BW manager fore2e QoS Flow ID,min/maxRate (cid:2) (cid:2) different categories based on their QoS demands and usages,and uses two-level queue when prioritizing. D.3. QoS/TM Schemes Comparison, Discussions, and Limita-tions

Table XII compares the QoS/TM schemes. The main ideain [96] is to translate application-layer header information intolink-layer headers (Q-in-Q 802.1ad) for the core network inorder to perform QoS routing and provisioning. The authorsadopted the Adaptive Bit Rate (ABR) video streaming as a usecase to showcase the QoS improvements and the ﬂexibilityof trafﬁc management. Such approach is interesting sinceswitches are inspecting higher layers in the protocol stack.This capability is not available in non-programmable devices.Note however that the solution was only implemented on asoftware switch (BMv2). When it comes to hardware switches,the solution might face challenges to run at line rate whenprocessing L5 headers. Therefore, the authors left the hardwareimplementation as a future work.The other approaches considered trafﬁc rates as inputs ratherthan inspecting application-layer headers. [97] focused onisolating virtual networks (VN). A VN has to have its owndedicated bandwidth (i.e., other networks’ trafﬁc should notimpact the bandwidth) and should be able to differentiatepriorities in order to provide QoS for its ﬂows. While thesolution was not implemented on hardware (the authors leftthe hardware implementation as future work), it is worthnoting that this system relies on metering primitives which areavailable in today’s hardware targets (e.g., meters in Toﬁno).Similarly, [98] was only implemented on a software switch(BMv2) and was evaluated by comparison against standardpriority-based and best-effort scheduling. This system usesmultiple priority queues, a feature supported in hardware tar-gets. Therefore, the system could be implemented on hardwareswitches. The approach in [99] aims at limiting the maximumallowed rate and at maximizing bandwidth utilization. This isthe only work that was implemented on a hardware switch(Toﬁno), and its design was compared against approachesbased on OpenFlow.

D.4. Comparison of QoS/TM between Legacy and Pro-grammable Networks

The ability to perform QoS-based trafﬁc management inlegacy networks is restricted to algorithms that consider stan-dard header ﬁelds (e.g, differentiated services [242]). On theother hand, programmable switches can parse, modify, andprocess customized protocols. Hence, operators now have the ability to perform TM by inspecting custom headersﬁelds. Moreover, it is possible to extract with high-granularitymetadata pertaining to the state of the switch (e.g., queueoccupancy, packet sojourn time, etc.) at line rate. Such in-formation can signiﬁcantly help switches take better decisionswhile performing trafﬁc management.

E. MulticastE.1. Background

Multicast routing enables a source node to send a copyof a packet to a group of nodes. Multicast uses in-networktrafﬁc replication to ensure that at most a single copy of apacket traverses each link of the multicast tree. Perhaps themost widely multicast routing protocol deployed in traditionalnetworks is the Protocol-Independent Multicast (PIM) protocol[243]. PIM and other multicast routing protocols require asignaling protocol such as the Internet Group ManagementProtocol (IGMP) [244] to create, change, and tear-down themulticast tree. Traditional multicast presents some challenges.For example, it is not suitable for environments where multi-cast group members constantly move (e.g., virtual machine mi-gration and allocation). In such cases, the multicast tree mustbe updated dynamically, which may require substantial timeand overhead. Also, some routers support a limited numberof group-table entries, which does not scale in environmentssuch as datacenters. Additionally, the signaling protocol andmulticast algorithm are hard coded in the router, which reducesﬂexibility in building and managing the tree. Finally, it is notpossible to implement multicast based on non-standard headerﬁelds.

E.2. Literature Review

Shahbaz et al. [100] presented ELMO, a multicast schemebased on programmable P4 switches for datacenter applica-tions. ELMO encodes the multicast tree in the packet header,as opposed to maintaining group-table entries inside routers.Kadosh et al. [101] implemented ELMO using a hybrid dat-aplane with programmable and non-programmable elements.ELMO is intended for multi-tenant datacenter applicationsrequiring high scalability. Braun et al. [102] presented animplementation of the Bit Index Explicit Replication (BIER)architecture [245] with extensions for trafﬁc engineering.Similar to ELMO, BIER removes the per-multicast group stateinformation from switches by adding a BIER header, whichis used to forward packets. BIER does not require a signalingprotocol for building, managing, and tearing down trees.

E.3. Multicast Schemes Comparison, Discussions, and Limi-tations

Table XIII compares the aforementioned multicast schemes.Both ELMO and BIER are source-routed multicast schemes.In BIER, group members are encoded as bit strings and arethen inspected by switches to identify the output port. Suchscheme requires heavy processing on the switch, hamperingthe execution at line rate. Consequently, the authors onlyimplemented BIER on a software switch (BMv2). ELMO onthe other hand has no restrictions on the group and network TABLE XIIIM

ULTICAST S CHEMES C OMPARISON ( SOURCE : [100])

Scheme Name Groupsize Networksize Heavyprocessing PlatformHW SW [100] ELMO None None × (cid:2) [102] BIER 2.6K 2.6K (cid:2) (cid:2) sizes, and was implemented on a hardware switch, running atline rate. E.4. Comparison P4-based and Traditional Multicast

Table XIV compares P4-based multicast and traditionalmulticast. The main advantages of implementing multicastrouting with programmable P4 switches are: i) the groupmembership is encoded in the packet itself, which permits thecreation of arbitrary multicast tree based on the application.For example, a multicast tree to update software devices mayprioritize bandwidth over latency, while one for media trafﬁcmay prioritize latency; ii) switches do not need to store per-group state information, although tables can be customizedand used in conjunction with the tree encoded in the packetheader; iii) groups can be reconﬁgured easily by changing theinformation in the header of the packet; and iv) the eliminationof the signaling protocol to build, manage, and tear-down thetree results in consider simpliﬁcation and ﬂexibility for theoperator.

F. Summary and Lessons Learned

Performing network-wide monitoring and measurementsis of utmost importance for network operators to diagnoseperformance degradation. A wide range of research effortsharness streaming methods that utilize various data structures(e.g., sketches, bloom ﬁlters, etc.) and approximation algo-rithms. Further, the majority of measurements work provide aquery-based language to specify the monitoring tasks. Futuremeasurement works should consider generalizing the monitor-ing jobs, reducing storage requirements, managing accuracy-memory trade-off, extending monitoring primitives, minimiz-ing controller intervention, and optimizing the placement of

TABLE XIVC

OMPARISON BETWEEN

P4-

BASED AND T RADITIONAL M ULTICAST

Feature P4-based multicast Traditional multicast

Scalability High; no stateinformation required inswitches Low; state informationrequired in switchersper-groupTreemanagement Flexible; custommulticast algorithm andfeatures can beimplemented Inﬂexible; signalingprotocol required andhard coded in the switchPacketoverhead High; multicast tree isencoded in packet header No packet overheadDynamictree updates Easy; packet headercarries update information Complex; topologychallenges may triggertime-consuming treechangesIP addressconstraint Flexible; switch canmulticast packetsindependently of the typeof IP address Fixed; switch ishard-coded to onlymulticast packets withdestination IP address inthe range 224.0.0.0 -239.255.255.255 switches in a legacy network. Another line of research aim atcombating congestion and reducing packet losses by analyzingmeasurements collected in the data plane and by applyingqueue management policies. Congestion control is enhancedby adopting techniques such as throttling senders, cuttingpayloads, enforcing sending rates by leveraging telemetrydata, and separating trafﬁc into different queues. Furthermore,a handful of works are investigating methods to improveQoS by applying trafﬁc policing and management. Techniquesadopted include application-layer inspection, trafﬁc metering,trafﬁc separation, and bandwidth management. Finally, thescalability concerns of multicast in legacy networks are beingmitigated with programmable switches. Recent efforts pro-posed encoding multicast trees into the headers of packets,and using programmable switches to parse these headers andto determine the multicast groups. Future endeavours shouldinvestigate incremental deployment (i.e., interworking withlegacy multicast schemes), and reliability enhancement (e.g.,by adopting layering protocols such as Pragmatic GeneralMulticast (PGM) and Scalable Reliable Multicast (SRM)).VIII. M

IDDLEBOX F UNCTIONS

RFC 3234 [246] deﬁnes middlebox as a device that performsfunctions other than the standard functions of an IP routerbetween a source and a destination host. In legacy devices,middlebox functions are designed and implemented by man-ufacturers. Hence, they are limited in the functionalities theyprovide, and typically include standard well-known functions(e.g., NAT, protocol converters (6to4/4to6), etc.). To overcomethis limitation, the trend moved towards implementing mid-dleboxes in x86-based servers and in data centers as NetworkFunction Virtualization (NFVs). While this shift acceleratedinnovation and introduced a wide range of new applications,there was some performance implications resulting from op-erating systems’ scheduling delays, interrupt processing la-tency, pre-emptions, and other low-level OS functions. Sinceprogrammable switches offer the ﬂexibility of inspecting andmodifying packets’ headers based on custom logic, they areexcellent candidates for enabling middlebox functions, whileoperating at line rate without performance implications.

A. Load BalancingA.1. Background

A cloud data center, such as a Google or Facebook datacenter, provides many applications concurrently, such as emailand video applications. To support requests from externalclients, each application is associated with a publicly visibleIP address to which clients send their requests and from whichthey receive responses. This IP address is referred to as VirtualIP (VIP) address. The external requests are then directed toa software load balancer whose task is to distribute requeststo the servers, balancing the load across them. The loadbalancer is also referred to as layer-4 load balancer becauseit makes decisions based on the 5-tuple source IP addressand port, destination IP address and port, and transport-layerprotocol. This state information is stored in a connection tablecontaining the 5-tuple and the Direct IP (DIP) address of the TABLE XVL

OAD B ALANCING S CHEMES C OMPARISON

Scheme Name Stateful Centralized Active probing MP-TCP support Failure handling PlatformHardware Software [103] HULA (cid:2) × (cid:2) × (cid:2) (cid:2) [104] SilkRoad (cid:2) × × × (cid:2) (cid:2) [105] MP-HULA (cid:2) × (cid:2) (cid:2) (cid:2) (cid:2) [106] Beamer × (cid:2) × (cid:2) (cid:2) (cid:2) (cid:2) [108] Dash (cid:2) × (cid:2) (cid:2) × (cid:2) [109] Contra (cid:2) × (cid:2) × (cid:2) (cid:2) ServerLoadbalancer (a)

DIP1 DIP2 DIP3VIP Switch

DIP … … Connection table ServerLoadbalancer DIP1 DIP2 DIP3VIP Switch + loadbalancerConnection tableTablemgmt (b)

DIP … … Fig. 12. (a) Traditional software-based load balancing. (b) Load balancingsystem implemented by a programmable switch. server serving that connection. State information is neededto avoid disruptions caused by changes in the DIP pool (e.g.,server failures, addition of new servers). The load balancer alsoprovides a translation functionality, translating the VIP to theinternal DIP, and then translating back for packets travelingin the reverse direction back to the clients. The traditionalsoftware-based load balancer is illustrated in Fig. 12(a).

A.2. Literature Review

Recent works presented schemes where load balancingfunctionality is implemented in programmable P4 switches.The main idea consists of storing state information directly inthe switch’s dataplane. The connection table is managed bythe software load balancer, which can be implemented eitherin the switch’s control plane or as an external device, as shownin Fig. 12(b). The software load balancer adds new entries inthe switch’s table as they arrive, or removes old entries asﬂows end.Katta et al. [103] proposed HULA, a load balancer schemewhere switches store the best path to the destination viatheir neighboring switches. This strategy avoids storing thecongestion status of all paths in leaf switches. Bennet et al.[105] extended this approach to support multi-path transportprotocols (e.g., Multi-path TCP (MPTCP)). Another signiﬁ-cant work is SilkRoad, [104], a load balancer that providesa direct path between application trafﬁc and servers. Othermechanisms such as DistCache [107] enables load balancingfor storage systems through a distributed caching method.DASH [108] proposed a data structure that leverages multiplepipeline stages and per-stage SALUs to dynamically balancedata across multiple paths. The aforementioned approacheswork under speciﬁc assumptions about the network topology,routing constraints, and performance. Contra [109] generalizedload balancing to work with various topologies and undermultiple constraints by using a performance-aware routing mechanism.Beamer [106] takes a different approach to load balancingby using a stateless approach. Instead of storing the state in theswitch, Beamer leverages the connection state already storedin backend servers to perform the forwarding.

A.3. Load Balancing Schemes Comparison, Discussions, andLimitations

Table XV compares the aforementioned load balancingschemes. The key idea of switch-based load balancing isto eliminate the need for a software-layer while mappinga connection to the same server, ensuring Per-ConnectionConsistency (PCC) property. The majority of the proposedapproaches are stateful, meaning that the switches store in-formation locally to perform load balancing. The exceptionhere is Beamer which relies on using the connection statealready stored in backend servers to ensure that connectionsare never dropped under churn. Another signiﬁcant shift fromthe previous solutions is the decentralization nature of Beamer.Some approaches (e.g., HULA, MP-HULA, Contra, Dash)use active probing to collect network performance metrics.Such metrics are then analyzed by the switches to make loadbalancing decisions.In the presence of multi-path transport protocols (e.g.,MPTCP), systems such as HULA provide sub-optimal for-warding decisions when several subﬂows pertaining to a singleconnection are pinned on the same bottleneck link. As a result,schemes such as MP-HULA, Contra, and Dash were proposedto support multi-path transport protocols. For instance, MP-HULA is a transport layer multi-path aware load-balancingscheme that uses the best-k paths to the destination throughthe neighbor switches.Finally, it is important for a load balancing scheme to handlenetwork failures. Most of the discussed systems consideredmitigating failures, with the exception of DASH.

A.4. Comparison between Switch-based and Server-basedLoad Balancer

Table XVI shows a comparison between switch-based andserver-based load balancers. There is a signiﬁcant improve-ment in the throughput when load balancing is ofﬂoaded tothe switches; for instance, SilkRoad [104], which is a loadbalancing scheme in the data plane, achieves 10 billion packetsper second (pps) while operating at line rate. Software loadbalancers on the other hand achieve a much lower throughput,nine million PPS on average. Software-based load balancersalso incur additional latency overhead when processing newrequests. It is relatively easy to install additional software load TABLE XVIS

WITCH - BASED AND S ERVER - BASED L OAD B ALANCERS

Feature Switch-based Server-based

Throughput Higher; (e.g., SilkRoadwith 6.4Tbps ASIC canachieve about 10Gpps) Lower (e.g., 9Mpps percore [247])Latency Lower; sub-microsecondsfrom ingress to egress Higher; additional latencywhen processing newrequests ∗ Scalability Lower; connection isstored in limited SRAM HigherPolicyﬂexibility Limited; hash-based ﬂowassignments may lead toimbalance Flexible policies can bewritten in softwareSystemcomplexity Simpler; it requires acustomized parser,match-action tables More complex; it requirescoordination with routers,tunneling (e.g., GREencapsulation) ∗ After the ﬁrst packet is processed, no additional latency is observed [247]. balancers, which makes it more scalable than switch-basedload balancing schemes. Moreover, software load balancersare more ﬂexible in assigning ﬂow identiﬁcation policies.Finally, switch-based schemes are simpler as the whole logicis expressed in a program (customized parser and match-action tables), whereas server-based balancers might requireadditional coordination with routers (e.g., tunneling).

B. CachingB.1. Background

Modern applications (e.g., online banking, social networks)rely on key-value stores. For example, retrieving a singleweb page may require thousands of storage accesses. As thenumber of users increases to millions or billions, the need forhigher throughput and lower latency is needed. A challenge ofkey-value stores is the non-uniform access of items. Instead,popular items, referred to as “hot items”, receive more queriesthan others. Furthermore, popular items may change rapidlydue to popular posts, limited-time offers, and trending events[110]. Fig. 13(a) shows a typical skew key-value store systemwhich presents load imbalance among servers storing key-value objects. The performance of such systems may presentreduced throughput and long latencies. For example, server 2may add substantial latency as a result of storing a hot itemand being over-utilized, while server 1 is under-utilized.

B.2. Literature Review

Fig. 13(b) illustrates a system where a programmable switchreceives a query before forwarding them to the server storing (a)

Server1 Server2 SwitchServer3Load Server1 Server2 Server3

Value … Key … Key-value tableSwitch + cache (b)

Fig. 13. (a) Traditional software-based caching. (b) Switch-based caching. the key. The switch is used as an “in-network cache”, wherethe hottest items are stored. When a read request for a hot keyis received, the switch consults its local table and returns thevalue corresponding to that key. If the key is missed (i.e., thecase for non-hot keys) then the switch forwards the request tothe appropriate server. When a write request is received, theswitch checks its local table and evicts the entry if the keyis stored there. It then forwards the request to the appropriatebackend server. A controller periodically collects statistics toupdate the cache with the current hot items.A noteworthy approach is NetCache [110], an in-networkarchitecture that uses programmable switches to store hotitems and balance the load across storage nodes. Similarly,Liu et al. [112] proposed IncBricks, a caching fabric for key-value pairs with basic computing primitives in the data plane.Cidon et al. [111] proposed AppSwitch, a packet switchthat performs load balancing for key-value storage systems.Signorello et al. [113] developed a preliminary implementationof Named Data Networking (NDN) instance using P4. Grig-oryan et al. [114] proposed a system that caches ForwardingInformation Base (FIB) entries (the most popular entries) infast memory in order to minimize the TCAM consumptionand to avoid the TCAM overﬂow problem. Zhang et al. [115]proposed B-Cache, a framework that bypasses the originalprocessing pipeline to improve the performance of caching.Vestin et al. [116] proposed FastReact, a system that enablescaching for industrial control networks. Finally, Woodruff etal. [117] proposed P4DNS, an in-network cache for DomainName System (DNS) entries.

B.3. Caching Schemes Comparison, Discussions, and Limita-tions

Table XVII compares the aforementioned caching schemes.Schemes can be separated based on the type of data theyaim to cache. For instance, NetCache, AppSwitch, and In-cBricks cache arbitrary key-value pairs, while NDN.p4 cachesonly NDN names. Further, some schemes (e.g., NetCache,P4DNS, etc.) automatically index entries to be cached basedon their access frequencies, while others require the operatorsto manually specify the entries. Another important distinctionis whether the scheme uses a custom protocol or not. Forinstance, switches in NetCache parse a custom protocol thatcarries key-value pairs, while switches in P4DNS parse stan-dard DNS headers.The main motivation of switch-based caching schemes isto improve the performance issues of server-based schemes.For instance, NetCache, which efﬁciently detects hot key-value items and serves them in the data plane, was capable ofhandling two billion queries per second for 64,000 items with16-bytes keys and 128-bytes values. Compared to commodityservers, NetCache improves the throughput by 3-10 times andreduces the latency of 40% of queries by 50%. In addition tothe throughput, the latency of the queries is also a major metricto improve. In IncBricks, the latency of requests is reduced byover 30% compared to client-side caching systems.Similarly, B-Cache aims at improving the performance bycaching into a single cache match-action table. The motivationbehind B-Cache is that the performance of the data plane TABLE XVIIC

ACHING S CHEMES C OMPARISON

Scheme Name Cached data Network acceleratorneeded Automaticentry indexing Custom protocol Multi-level cache PlatformHW SW [110] NetCache Key-value × (cid:2) (cid:2) × (cid:2) [111] AppSwitch Key-value × × (cid:2) × (cid:2) [112] IncBricks Key-value (cid:2) × (cid:2) × (cid:2) [113] NDN.p4 NDN names × (cid:2) (cid:2) × (cid:2) [114] PFCA Routes (FIB entries) × (cid:2) × (cid:2) (cid:2) [115] B-Cache FIB entries × (cid:2) × × (cid:2) [116] FastReact Sensor readings × × (cid:2) × (cid:2) [117] P4DNS DNS entries × (cid:2) × × (cid:2) decreases signiﬁcantly as the complexity of the P4 programand the packet processing pipeline grows. When a matchoccurs, the packet bypasses the original pipeline, making theperformance of caching independent of the pipeline length.Note however that this system was evaluated on a softwareswitch (BMv2), and it is not certain whether this design isalways feasible on hardware targets.Other caching schemes are more targeted for speciﬁc appli-cations. As examples, FastReact enables caching for industrialcontrol networks, while P4DNS caches DNS entries. Notethat some schemes require a custom protocol to operate (e.g.,NetCache), while others (e.g., P4DNS) work with standardprotocols (e.g., DNS). Finally, some schemes offer multi-levelcaching (e.g., level-1 and level-2 caches). B.4. Comparison between Switch-based and Server-basedCaching

Table XVIII compares the switch-based versus server-basedcaching schemes. The throughput when data is cached onthe switch is order of magnitude larger than that of generalpurpose servers. The latency is also reduced by 50%, and mostof it is induced by the client. The switched-based cachingsolves the load imbalance problem and is simpler as the wholelogic is expressed in a program. Server-based caching on the

TABLE XVIIIS

WITCH - BASED AND S ERVER - BASED C ACHING

Feature Switch-based Server-based

Throughput Higher; (e.g., NetCache,2BQPS ) Lower; 0.2BQPSLatency Lower; (e.g., NetCache, μ s , mostly caused bythe client) Higher; μ s Key size Not ﬂexible (limited bypacket header length) ArbitraryValue size Not ﬂexible (limited bythe amount of stateaccessed when processinga packet) ArbitraryLoadimbalance No YesSystemcomplexity Simpler; it requires acustomized parser,match-action tables More complex; it requirescoordination with routers,tunneling (e.g., GREencapsulation)Table size Limited by RAM ArbitraryCachepolicies Limited by table size Arbitrary BQPS: Billion Queries Per Second. other hand is more ﬂexible regarding cache policies, as wellas keys, values, and tables’ sizes.

C. Telecommunication ServicesC.1. Background

The evolution of the current mobile network to the emergingFifth-Generation (5G) technology implies signiﬁcant improve-ments of the network infrastructure. Such improvements arenecessary in order to meet the Key Performance Indicators(KPIs) and requirements of 5G [248]. 5G requires ultra-reliable low latency and jitter (microseconds-scale). As pro-grammable switches fulﬁll these requirements, researchers areinvestigating the idea of ofﬂoading telecom-oriented VNFsrunning on x86 servers to programmable hardware.

C.2. Literature Review

Ricart-Sanchez et al. [118] proposed a system that usesprogrammable data plane to enhance the performance of thedata path from the edge to the core network, also known asthe backhaul, in a 5G multi-tenant network. The same authors[119] proposed a 5G ﬁrewall that detects, differentiates andselectively blocks 5G network trafﬁc in the backhaul network.In parallel, attempts such as TurboEPC [120] proposedofﬂoading a subset of user state in mobile packet core toprogrammable switches in order to perform signaling in thedata plane. Similarly, Singh et al. [121] designed a P4-basedelement of 5G Mobile Packet Core (MPC) that merges thefunctions of both signaling gateway (SGW) and the PacketData Network Gateway (PGW). Additionally, Voros et al.[122] proposed a a hybrid next-generation NodeB (gNB) thatcombines the capabilities of P4 switches and the externalservices built on top of NIC accelerators (DPDK).Another important function required in 5G is handover.Palagummi et al. [123] proposed SMARTHO, a system thatuses programmable switches to perform handover efﬁcientlyin a wireless network.Finally, Kfoury et al. [124] proposed a system for ofﬂoadingconversational media trafﬁc (e.g., Voice over IP (VoIP), Voiceover LTE (VoLTE), WebRTC, media conferencing, etc.) fromx86-based relay servers to programmable switches. Whilethis system is not tailored for 5G network speciﬁcally, itprovides signiﬁcant performance improvements for Over-The-Top (OTT) VoIP systems. TABLE XIXT

ELECOM S CHEMES C OMPARISON

Scheme Core idea Deployment 5G-centric Reportedlatency scale Concurrentusers evaluated ImplementationHW SW [118] Enhances the data path in 5G multi-tenants Backhaul (cid:2)

Microseconds N/A (cid:2) [119] Implements a 5G ﬁrewall in the switch Backhaul (cid:2)

Microseconds 1K (cid:2) [123] Provides smart handover for mobile UE BetweenCU and DU (cid:2)

N/A N/A (cid:2) [121] Ofﬂoads MPC user plane functions to switch Core network (cid:2)

Microseconds 65K-1M (cid:2) [124] Ofﬂoads media trafﬁc relay to switch Edge × Nanoseconds 65K-1M (cid:2) [120] Performs signaling in the data plane Core (cid:2)

Milliseconds 65K (cid:2)

Fig. 14. CDF of delay and packet loss rate of 900 ofﬂoaded VoIP calls [124].

C.3. Telecom Schemes Comparison, Discussions, and Limita-tions

Table XIX compares the aforementioned telecom schemeson P4. In general, all schemes aim at ofﬂoading variousfunctionalities originally executed on x86-based servers to thedata plane. Such strategy improves the network performance(e.g., latency, throughput) signiﬁcantly and aim at achievingthe KPIs of 5G. For instance, the experiments conducted in[118] show that the attained QoS metrics meet the latencyrequirements of 5G. Similarly, the results reported in [119]demonstrate that the system meets the reliability KPI of 5G,which states that the network should be secured with zerodowntime. Furthermore, the results reported in [123] showthat there are 18% and 25% reductions in handover time withrespect to legacy approaches, for two- and three-handoversequences, respectively. The system in [124] emulates thebehavior of the relay server which is primarily used to solvethe NAT problem. Results show that ultra-low latency and jitter(nanoseconds-scale) are achieved with programmable switchesas opposed to x86-based relay servers where the latency andthe jitter are in the milliseconds-scale (see Fig. 14). Thesolution also improves the packet loss rate, CPU usage of theserver, Mean Opinion Score (MOS), and can scale to morethan one million concurrent sessions, with additional resourcesto spare in the switch.Other systems allow ofﬂoading the signaling part to thedata plane. For instance, TurboEPC ofﬂoads messages thatconstitute a signiﬁcant portion of the total signaling trafﬁc inthe packet core, aiming at improving throughput and latencyof the control plane’s processing.

C.4. Switch-based and Server-based Media Relay

Ofﬂoading media trafﬁc from general purpose servers toprogrammable switches greatly improves the quality of ser-vice. Table XX shows the metrics achieved when media is

TABLE XXS

WITCH - BASED AND S ERVER - BASED M EDIA R ELAYING

Metric Switch-based relay [124] Server-based relay

Relay serverCPU Lower; negligible with900 active sessions Higher; averages at50% for 900 activesessionsLatency Lower; almost constant at440ns with 900 sessions Higher; from 0.2ms to17ms with 900 sessionsJitter Lower; negligible with900 active sessions Higher; ranges from100us to 3msPacket loss None contributed by theswitch High; increases as thenumber of sessionsincreasesMaximumnumber ofsessions Higher; more than onemillion with additionalresources to spare Lower; thousandsessions per core beforeQoS degradesMeanopinionscore (MOS) Higher; maximum MOS(4.4) with 1800concurrent sessions Lower; for 1800sessions, 50% ofsessions have a MOSscore below 3.7Table size Limited by SRAM ArbitraryAdditionalfunctions Limited to relaying Arbitrary; e.g., mediamix, lawful interception relayed by a relay server versus when it is relayed by theswitch, based on [124]. The results show that the latency,jitter and packet loss rates are signiﬁcantly lower when mediais being relayed by the switch. Not only the QoS metricsare improved, but also the maximum number of concurrentsessions. With Toﬁno 3.2Tbps, more than one million sessionswere accommodated in the switch’s SRAM, with additionalresources to spare for other functionalities. On the other hand,only one thousand sessions per CPU core were handled inthe server-based relay, before QoS starts to degrade. Thedrawback of ofﬂoading media trafﬁc to the switch is thatsome functionalities are complex to be implemented in thedata plane (e.g., media mixing for conference calls).

D. Publish/SubscribeD.1. Background

Emerging network architectures (e.g., [249]) promotecontent-centric networking, a model where the addressingscheme is based on named data rather than named hosts.In other words, users specify the data they are interested ininstead of specifying where to get the data from. A branch ofcontent-centric networking is the publish/subscribe (pub/sub)model. The goal of the model is to provide a scalable androbust communication channel between producers and con-sumers of information. A large fraction of today’s Internet applications follow the publish/subscribe paradigm. With theIoT, this paradigm proliferated as sensors/actuators are oftendeployed in dynamic environments. Other applications thatuse pub/sub model include instant messaging, Really SimpleSyndication (RSS) feeds, presence servers, telemetry andothers. Current approaches to content-centric networking usesoftware-based middleboxes, which limits the performance interms of throughput and latency. Recent works are leveragingprogrammable switches to overcome the performance limita-tions of software-based pub/sub middleboxes. D.2. Literature Review

Jepsen et al. [125] presented “packet subscription”, a newabstraction that generalizes the forwarding rules by evalu-ating stateful predicates on input packets. Wernecke et al.[126, 127] presented distribution strategies for content-basedpublish/subscribe systems using programmable switches. Theauthors described a system where the notiﬁcation distributiontree (i.e., the subscribers that should receive the notiﬁcation)is encoded in the packet headers, similar to multicast sourcerouting. Similarly, Kundel et al. [128] implemented a pub-lish/subscribe system on programmable switches. The systemis ﬂexible in encoding attributes/values in packet headers.

D.3. Publish/Subscribe Schemes Comparison, Discussions,and Limitations

Table XXI compares the aforementioned pub/sub schemes.In [125], the authors described a compiler that generates P4tables from logical predicates. It utilizes a novel algorithmbased on Binary Decision Diagrams (BDD) to preserve switchresources (TCAM and SRAM). This feature simpliﬁes the con-ﬁguration as operators do not need to manually install tablesentries switches, which is a cumbersome process when thetopology is large. The prototype was evaluated on a hardwareswitch (Toﬁno), and the authors considered the Nasdaq’s ITCHprotocol as the pub/sub use case. Results show that the systemwas able to process messages at line rate while using thefull switch capacity (6.5 Tbps). The other systems considereddifferent encoding strategies. For example, in [126, 127], theauthors described a system where the notiﬁcation distributiontree (i.e., the subscribers that should receive the notiﬁcation)is encoded in the packet headers, similar to multicast source

TABLE XXIP

UBLISH /S UBSCRIBE S CHEMES C OMPARISON

Scheme Dedicatedlanguage Conﬁgcomplexity Encodingstructure PlatformHW SW [125] (cid:2)

Medium Hierarchical(BDD) (cid:2) [126][127] × High Distributiontree (cid:2) [128] × High Attribute-value pair (cid:2) routing. The advantage of storing the distribution tree in thepacket header instead of storing it in the switch is that rulesin the switches do not need to be updated when subscriptionschange. Another distinction between the pub/sub systems iswhether they require a dedicated language to describe thesubscriptions, and the conﬁguration complexity.

D.4. Comparison between Switch-based and Server-basedPub/Sub Systems

Fig. 15 illustrates the operations of traditional software-based pub/sub systems (a) and switch-based pub/sub systems(b). Latency and its variations are signiﬁcantly reduced whenthe switch acts as a pub/sub broker. However, the size of mem-ory in the switch limits the amount of data to be distributed.Moreover, implementing features provided by software-basedpub/sub systems such as QoS levels, session persistence,message retaining, last will and testament (notify users aftera device disconnects) in hardware is challenging.

E. Summary and Lessons Learned

Programmable switches offer the ﬂexibility of customizingthe data plane to enable middlebox functions. A middlebox canbe deﬁned as a device that performs functions that are beyondthe standard capabilities of routers and switches. A number ofworks demonstrated the implementation of middlebox func-tions such as caching, load balancing, ofﬂoading services,and others on programmable switches. The majority of loadbalancing schemes took advantage of the stateful nature of thedata plane to store the load balancing connection table. Futurework should consider minimizing the storage requirement to (a) Subscriber Subscriber N Broker ...

Publisher Publisher N ... Subscriber Subscriber N Pub/Subinfo ...

P4 switchPublisher Publisher N ... (b)Pub/Subinfo Legacy switchLegacy switch SDN Controller SubscriptionsControl plane rules

Legacy switch

Fig. 15. (a) Traditional software-based pub/sub architecture. (b) Pub/sub implemented on a programmable switch. improve the scalability, supporting ﬂow priority, and develop-ing further variations for novel multipath transport protocolssuch as multipath QUIC.The switch can also act as an “in-network cache” that serveshot items at line rate. Some schemes indexes entries auto-matically, while others require operator’s intervention. Futureendeavours could investigate items compression, communi-cation minimization, priority-based caching, and aggregatedcomputations caching (e.g., cache the average of hot items).An additional middlebox application is ofﬂoading telecomfunctions. The switch is capable of relaying media trafﬁc anduser plane functions. Future work could investigate scalabilityimprovement (i.e., to accommodate more concurrent sessions),ofﬂoading signalling trafﬁc, and in-network media mixing.Finally, the switch can also act as a broker to distributepackets in a publish/subscribe system. Future work could in-vestigate reliability insurance (e.g., packet deliver guarantee),message retaining, and QoS differentiation (e.g., QoS featuresof MQTT).IX. N ETWORK -A CCELERATED C OMPUTATIONS

Programmable switches offer the ﬂexibility of ofﬂoadingsome upper-layer logic to the ASIC, referred also as in-network computation. Since switch ASICs are designed toprocess packets at terabits per second rates, in-network compu-tation can result in an order of magnitude or more of improve-ment in throughput when compared to applications imple-mented in software. The potential performance improvementhas motivated programmers to built in-network computationfor different purposes, including consensus, machine learningacceleration, stream processing, and others.The idea of delegating computations to networking deviceswas perceived with Active Networks [250], where packets arereplaced with small programs (“capsules”) that are executedin each traversed device along the path. However, traditionalnetwork devices were not capable of performing computations.With the recent advancements in programmable switches,performing computations is now a possibility.

A. ConsensusA.1. Background

Consensus algorithms are common in distributed systemswhere machines collectively achieve agreement on a singledata value, or on the current state of a distributed system.Reliability is achieved with consensus algorithms, even in thepresence of some malicious or faulty processes. Consensusalgorithms are used in applications such as blockchain [251],load balancing, clock synchronization, and others [252].Latency has always been a bottleneck with consensus algo-rithms as protocols require expensive coordination on everyrequest. Lately, researchers have started investigating howprogrammable switches can be leveraged to operate consensusprotocols in order to increase throughput and decrease latency.Fig. 16 shows a consensus model in the data plane.

Consensus protocol (e.g., Paxos) running the networkProposer LearnerLearnerProposer Coordinator AcceptorAcceptorAcceptor

Fig. 16. Consensus protocol in the data plane model [130]. An applicationsends a request to the proposer which resides on a commodity server. Theproposer then creates a Paxos message and sends it to the coordinator, runningin the data plane. The role of the coordinator is be the broker of requests onbehalf of proposers. Afterwards, the acceptor, which also runs on the dataplane, receives the messages from the coordinator, and ensures consistencythrough the system by deciding whether to accept/reject proposals. Finally,learners provide replication by learning the result of consensus.

A.2. Literature Review

Li et al. [129] proposed Network-Ordered Paxos(NOPaxos), a P4-based Paxos [253] system that appliesreplication in the data center to reduce the latency imposedfrom communication overhead. Similarly, Dang et al. [130]presented an implementation of Paxos using P4 on thedata plane. Dang et al. [134] also proposed PartitionedPaxos, a P4-based system that separates the two aspects ofPaxos, namely, agreement and execution, and optimizes themseparately. Furthermore, The same authors also proposedP4xos [136], a P4-based solution that executes Paxos logicdirectly in switch ASICs, without strengthening assumptionsabout the network (e.g., ordered delivery, packet loss, etc.).Jin et al. [133] proposed NetChain, a variant of the Paxosprotocol that provides scale-free sub-RTT coordination in datacenters. It is strongly-consistent, fault-tolerant, and presentsan in-network key-value store.Another line of research focused on consensus algorithmsother than Paxos. Li et al. [131] proposed Eris, a P4-basedsolution that avoids replication and transaction coordinationoverhead. It processes a large class of distributed transactionsin a single round trip, without any additional coordinationbetween shards and replicas. Sakic et al. [135] proposed P4Byzantine Fault Tolerance (P4BFT), a system that is based onBFT-enabled SDN, where controllers act as replicated statemachines. The system ofﬂoads the comparison of controllers’outputs required for correct BFT operations to programmableswitches. Finally, Han et al. [132] ofﬂoaded part of the Raftconsensus algorithm [254] to programmable switches in orderto improve its performance. The authors selected Raft dueto the fact that it has been formally proven to be more safethan Paxos, and it has been implemented on popular SDNcontrollers. TABLE XXIIC

ONSENSUS S CHEMES C OMPARISON

Scheme Name Algo. Weakassumpt. Fullproto. PlatformHW SW [129] NOPaxos Paxos × × (cid:2) [130] N/A Paxos (cid:2) × (cid:2) [131] Eris Novel (cid:2) (cid:2) (cid:2) [132] N/A Raft (cid:2) × (cid:2) [133] NetChain Novel × (cid:2) (cid:2) [134] PartitionedPaxos Paxos (cid:2) (cid:2) (cid:2) [135] P4BFT BFT (cid:2) (cid:2) (cid:2) (cid:2) [136] P4xos Paxos (cid:2) (cid:2) (cid:2) A.3. Consensus Schemes Comparison, Discussions, and Lim-itations

Table XXII compares the aforementioned consensusschemes. In general, consensus algorithms such as Paxosare complex and cannot be easily implemented with theconstraints of the data plane. For instance, [130] only im-plemented phase-2 logic of Paxos leaders and acceptors.Similarly, NetChain uses a variant of the Paxos protocol thatdivides it into two parts: steady state and reconﬁguration. Thisvariant is known as Vertical Paxos, and is relatively simpleto implement in the network as the division’s parts can bemapped to the control plane and the data plane.Unordered and completely asynchronous networks requirethe full implementation and complexity of Paxos. NOPaxossuggests that the communication layer should provide a newOrdered Unreliable Multicast (OUM) primitive; that is, there isa guarantee that receivers will process the multicast messagesin the same order, though messages can be lost. NOPaxosrelies on the network to deliver ordered messages in order toavoid entirely the coordination. Dropped packets on the otherhand are handled through coordination with the application.Other systems like Eris avoid replication and transaction co-ordination overhead. The main contribution of Eris comparedto NOPaxos is that it establishes a consistent ordering acrossmessages delivered to many destination shards. Eris alsoallows receivers to detect dropped messages.Partitioned Paxos [134] improved the existing systems. Themotivation behind Partitioned Paxos is that existing network-accelerated approaches do not address the problem of howreplicated application can cope with the high rate of consensusmessages; NOPaxos only processes 13,000 transactions persecond since it presents a new bottleneck at the host side. Othersystems (e.g. NetChain) are specialized replication servicesand can not be used by any off-the-shelf application.Finally, P4xos improves both the latency and the tail-latency. The throughput is also improved compared to hard-ware servers which require additional memory managementand safety features (e.g., user and kernel separation). P4xoswas implemented on a hardware switch (Toﬁno), and resultsshow that it reduces the latency by three times compared totraditional approaches, and it can process over 2.5 billionconsensus messages per second (four orders of magnitudeimprovement).

A.4. Network-Assisted and Legacy Consensus Comparison

Consensus algorithms have been traditionally implementedas application on general purpose CPUs. Such architectureinherently induces latency overhead (e.g., Paxos coordinatorhas a minimum latency of 96us [255]). There are numer-ous performance beneﬁts gained when consensus algorithmsare implemented in programmable devices. When consensusmessages are processed on the wire, the latency signiﬁcantlydecreases (Paxos coordinator had a minimum latency of340ns [255]). Moreover, when compared to legacy consensusdeployments, network-assisted consensus require fewer hopstraversal.

B. Machine LearningB.1. Background

The remarkable success of Machine Learning (ML) todayhas been enabled by a synergy between development in hard-ware and advancements in machine learning techniques. In-creasingly complex ML models are being developed to handlethe large size of datasets and to accelerate the training process.Hardware accelerators (e.g., GPU, TPU) were introduced tospeedup the training. These accelerators are installed in largeclusters and collaborate through distributed training to exploitparallelism. Nevertheless, training ML models is time con-suming and can last for weeks depending on the complexityand the size of the datasets. Researchers have traditionallyinvestigated methods to accelerate the computation process,but not the communication in distributed learning. With theadvancements in programmable switches, it is now possibleto accelerate the ML training process through the network.

B.2. Literature Review

The literature can be divided into two main categories:accelerating training and accelerating inference. Sapio et al.[137] proposed DAIET, a system that performs in-networkdata aggregation to accelerate applications that follow a par-tition/aggregate workload pattern. Similarly, Yang et al. [140]proposed SwitchAgg, a system that performs similar functionsas DAIET, but with a higher data reduction rate. Perhaps themost signiﬁcant work in the training acceleration literature isSwitchML [141], a system that performs in-network aggre-gation for ML model updates sent from workers on externalservers.On the other hand, proposed schemes have shown interestin speeding the inference process by leveraging programmableswitches. Siracusano et al. [138] proposed N2Net, a systemthat runs simpliﬁed neural networks (NN) on programmableswitches. Sanvito et al. [139] proposed BaNaNa Split, a solu-tion that evaluates the conditions under which programmableswitches can act as CPUs’ co-processors for the processingof Neural Networks (e.g., CNN). Finally, Xiong et al. [142]proposed IIsy, a system that enables programmable switchesto perform in-network classiﬁcation. The system maps trainedML classiﬁcation models to match-action pipelines. TABLE XXIIIM

ACHINE L EARNING S CHEMES C OMPARISON

Scheme Name Core idea Objective Evaluatedmodel/algorithm Quantization PlatformInference Training HW SW [137] DAIET In-network computation forpartition/aggregate work pattern × (cid:2) SGD, Adam N/A (cid:2) [138] N2Net In-network classiﬁcation usingBNN (cid:2) × Binary neural networks (cid:2) × × [139] BaNaNa Split NN processing division betweenswitches and CPUs (cid:2) × Binary neural networks (cid:2) × [140] SwitchAgg In-network aggregation withoutmodifying the network × (cid:2) MapReduce-like system N/A × [141] SwitchML Accelerates distributed paralleltraining in ML × (cid:2) Synchronous SGD (cid:2) × [142] IIsy Maps trained ML classiﬁcationmodels to match-action pipeline (cid:2) × Decision tree, SVM,naïve bayes, k-means × ×

B.3. ML Schemes Comparison, Discussions, and Limitations

Table XXIII compares the aforementioned ML schemes.While the goal of DAIET is to discuss what computations thenetwork can perform, the authors did not design a completesystem, nor did they address the major challenges of support-ing ML applications. Moreover, their proof-of-concept pre-sented a simple MapReduce application on a software switch,and it is not certain whether the system can be implementedon a hardware switch. Compared to DAIET, SwitchAgg doesnot require modifying the network architecture, and offersbetter processing abilities with a signiﬁcant data reduction rate.Moreover, SwitchAgg was implemented on an FPGA, and theresults show that the job completion time can be reduced asmuch as 50%.SwitchML extended the literature on accelerating ML mod-els training by providing a complete implementation andevaluation on a hardware switch. A commonly used trainingtechnique for deep neural networks is synchronous stochasticgradient descent [257]. In this technique, each worker has acopy of the model that is being trained. The training is an it-erative process where each iteration consists of: 1) reading thesample of the dataset and locally perform some computation-intensive learning using the worker’s accelerators. This yields to a gradient vector; and 2) updating the model by computingthe mean of all gradient vectors. The main motivation of thisidea is that the aggregation is computationally cheap (takes100ms), but is communication-intensive (transfer hundreds ofmegabytes each iteration). SwitchML uses computation onthe switch to aggregate model update in the network as theworkers are sending them (see Fig. 17). An advantage isthat there is minimal communication; each worker sends itsupdate vector and receives back the aggregated updates. Thedesign challenges of this system include: 1) the limitation ofstorage available on the switch, addressed by using a streamingapproach; 2) switches cannot perform much computations perpacket, addressed by partitioning the work between the switchand the workers; 3) ML systems use ﬂoating point numbers,addressed by quantization approaches; and 4) failure recoveryis needed to ensure correctness. The system is implementedon a hardware switch (Toﬁno), and results show that thesystem speeds up training by up to 300% compared to existingdistributed learning approaches.With respect to in-network inference, it is challengingto implement full-ﬂedged models as they require extensivecomputations (e.g., multiplications and activation functions).Simple variation such as the Binary Neural Network (BNN)

Worker 1 Updates Worker 2 Updates Worker N Updates ...

Legacy switchAll-to-all communicationFast GPUs -> bottleneck on the network ...

Programmable switchIn-network aggregation

Worker sends update vector Worker receives aggregated updates (a) (b)

Fig. 17. (a) ML model updates in legacy networks. The aggregation process is communication-intensive and follows an all-to-all communication pattern.This means that the workers should receive all the other workers’ updates. Since accelerators on end-hosts are becoming faster, the network should speed upso that it does not become the bottleneck. Therefore, it is expensive to deploy additional accelerators since it requires re-architecting the network. The redarrow in (a) shows that the bottleneck source is the network. (b) ML model updates accelerated by the network. Aggregation is performed in the network bythe programmable switches while the workers are sending them. The workers do not need to obtain the updates of all other workers, hence there is minimalcommunication. They only obtain the aggregated model from the switch. The red arrow in (b) shows that the bottleneck source is the worker rather than thenetwork [141, 256] TABLE XXIVS

WITCH - BASED AND S ERVER - BASED

ML A

PPROACHES

Feature Inference TrainingSwitch-based Server-based Switch-based Server-based

Speed Faster, inference at line rate Slower Faster, aggregations at line rate Slower; aggregations on an x86serverComplex computationssupport Lower, basic arithmetic andbitwise logic function Higher Lower HigherCommunication overhead Low Low Lower, switch is the centralizedaggregator Higher, updates are exchangedwith a remote aggregatorStorage Lower Higher Lower, update is not storedentirely at once HigherEncrypted trafﬁc Difﬁcult Easy Difﬁcult Easy only requires bitwise logic functions (e.g., XNOR, POPCNT,SIGN). N2Net provides a compiler that translates a givenBNN model to switching chip’s conﬁguration (P4 program).The authors did not mention on which platform N2Net wasevaluated; however, based on their evaluations, they concludedthat a BNN can be implemented on most current switchingchips, and with small additions to the chip design, morecomplex models can be implemented. IIsy studied other MLmodels. The authors of IIsy acknowledged that the work islimited in scope as it does not address popular ML algorithmssuch as neural networks. Furthermore, it is bounded to thetype of features it can extract (i.e., packet headers), and hasaccuracy limitations. IIsy tries to ﬁnd a balance between thelimited resources on the switch and the classiﬁcation accuracy.Finally, BaNaNa Split took a different approach by partitioningthe processing of NN to ofﬂoad a subset of layers from theCPU to a different processor. Note that the solution is farfrom complete, and the authors evaluated a single binary fullyconnected layer with 4096 neurons using a network processor-based SmartNIC.

C. Comparison between Switch-based and Server-based ML

Table XXIV shows a comparison between switch-based andserver-based ML approaches. ML works that were extractedfrom the literature can be divided into two main categories:1) expedited inference in the data plane, and 2) acceleratedtraining in the network. The main advantage of switch-basedover server-based inference is the ability to execute at line rate,and hence provides faster results to the clients. Performingcomplex computations in the switch is achieved throughestimations, and hence is limited. Moreover, the SRAM ca-pacity of the switch is small, impeding the storage of largemodels. Such limitations are not problematic with server-basedinference approaches.Distributed training can be signiﬁcantly faster when aggre-gations are ofﬂoaded to a centralized switch. However, due tothe small capacity of the switch memory, it is not possible tostore the whole model update at once. Additionally, encryptedtrafﬁc remains a challenge when inference or training ishandled by the switch.

D. Summary and Lessons Learned

Accelerating computations by leveraging programmableswitches is becoming a trend in data centers and backbone networks. Although switches only support basic and limitedoperations, it was shown in the literature that the performanceof various tasks (e.g., consensus, training models in machinelearning), could signiﬁcantly improve if computations aredelegated to the network.The majority of the in-network consensus works aim atimplementing common consensus protocols such as Paxosand Raft in the data plane. Due to the hardware constraints,current schemes implement only simpliﬁed variations of theprotocols. Future work could investigate implementing novelconsensus algorithms that diverge from the existing complexones. Further, such schemes should encompass failure recoverymechanisms.Another interesting in-network application is ML train-ing/inference acceleration. The literature has shown that signif-icant performance improvements are attained when the switchaggregates model updates or classiﬁes new samples. Futuresystems could explore developing further ML models forvarious tasks such as classiﬁcation, regression, clustering, etc.In addition to the aforementioned categories, data planeprogramming is being used for stream processing [143, 144],parallel processing [145], string searching [146], erasure cod-ing [147], in-network lock managers [148], database queriesacceleration [149], in-network compression [150], and com-puter vision ofﬂoading [151].X. I

NTERNET OF T HINGS (I O T)The Internet of Things (IoT) is a novel paradigm in whichpervasive devices equipped with sensors and actuators collectphysical environment information and control the outsideworld. IoT applications include smart water utilities, smartgrid, smart manufacturing, smart gas, smart metering, andmany others. Typical IoT scenarios entail a large numberof devices periodically transmitting their sensors’ readingsto remote servers. Data received on those collectors is thenprocessed and analyzed to assist organizations in taking data-driven intelligence decisions.

A. AggregationA.1. Background

Since IoT devices are constrained in size and process-ing capabilities, they typically generate packets that carrysmall payloads (e.g., temperature sensor readings). While suchpackets are small in size, their headers occupy a signiﬁcant TABLE XXVI O T A

GGREGATION S CHEMES C OMPARISON

Scheme Evaluation Constraints Line rate PlatformTheoretical Implementation Samepayload size Payload<= 16 bytes Numberof packets Aggregation Disaggregation HW SW [152] (cid:2) (cid:2) (cid:2) (cid:2) × (cid:2) [153] (cid:2) × × Up to MTU (cid:2) (cid:2) (cid:2) [154] (cid:2) (cid:2) (cid:2) × × (cid:2) portion of the total packet size. For instance, Sigfox Low-Power Wide Area Network (LPWAN) [258] can support amaximum of 12-bytes payload size per packet. The overheadof headers is 42-bytes (Ethernet 14-bytes + IP 20-bytes + UDP8-bytes), which represent approximately 78% of the packettotal size. When numerous devices continuously transmit IoTpackets, a signiﬁcant percentage of network bandwidth iswasted on transmitting these headers. Packet aggregation isa mechanism in which the payloads of small packets areaggregated into a single larger packet in order to mitigate thebandwidth overhead caused by transmitting multiple headers.Legacy packet aggregation mechanisms operate on the CPUsof servers or on the control plane of switches [259–264].While legacy mechanisms reduce the overhead of packetheaders, they unquestionably increase the end-to-end latencyand decrease the throughput. As a result, some studies havesuggested aggregating only packets that are not real-time. A.2. Literature Review

Wang et al. [152] presented an approach where small IoTpackets are aggregated into a larger packet in the switch dataplane (see Fig. 18). The goal of performing this aggregationis to minimize the bandwidth overhead of packets’ headers.The same authors [153] extended this work to solve someconstraints related to the payload size and the number of aggre-gated packets. Similarly, Madureira et al. [155] proposed IoTP,a layer-2 communication protocol that enables the aggregationof IoT data in programmable switches. The solution gathersnetwork information that includes the Maximum Transmis-sion Unit (MTU), link bandwidths, underlying protocol, anddelays. These properties are used to empower the aggregationalgorithm.

A.3. Aggregation Schemes Comparison, Discussions, andLimitations

Table XXV compares the aforementioned IoT aggregationsschemes. [152] and [153] operate in the same way. Upon

IoT devices IoT packet ...

P4 switch P4 switchAggregation Aggregated packetWAN Disaggregation Server

Fig. 18. IoT packets aggregation [152]. Frequent small IoT packets areaggregated by a P4 switch and encapsulated in a larger packet. Another switchacross the WAN disaggregates the large packet to restore the original IoTpackets. Such mechanism prevents the frequent transmissions of headers, andthus, minimizes the bandwidth overhead. receiving a packet, the P4 switch parses its headers andidentiﬁes whether the packet is an IoT packet. If the packet wasidentiﬁed as an IoT packet, the switch parses and extracts thepayload. Afterwards, the payload is stored in switch registersalong with some other metadata, and the packet is dropped.Once packets are aggregated, the resulting packet is sent acrossthe WAN to reach the remote server. Before the packet reachesthe server, it is disaggregated by another P4 switch situatedclose to the server and several packets identical to the originalones are generated. An important observation is that theaggregation/disaggregation processes are transparent to boththe IoT devices and the servers; hence, no modiﬁcations arerequired on either end. The main advantages of [153] over[152] are: 1) packets can have different payload sizes; 2) thepayload size is no longer limited to 16 bytes; 3) the numberof packets is dynamic and only limited by the packet MTU;and 4) both the disaggregation and the aggregation run at linerate.

A.4. Comparison between Server-based and Switch-based Ag-gregation

Table XXVI shows a comparison between switch-basedand server-based packet aggregation. When aggregation isperformed on the switch (ASIC), the throughput is higherwhile the latency and jitter are lower than that of the server-based approaches (e.g., switch CPU or x86-based server).On the other hand, the server-based aggregation has moreﬂexibility in deﬁning the number of packets and the amountof data that can be aggregated.

B. Service AutomationB.1. Background

Low-power low-range IoT communication technologies(e.g., Bluetooth Low Energy (BLE) [265], Zigbee [266], Z-wave [267]) typically follow a peer-to-peer model. IoT devices

TABLE XXVIS

WITCH - BASED AND S ERVER - BASED P ACKET A GGREGATION

Feature Switch-based (ASIC) Server-based (CPU)

Throughput Higher; (e.g., [152],100Gbps, i.e., maxcapacity) Lower; (e.g., [152],2.58Gbps)Latency andJitter Lower; Higher;Count of packetsto be aggregated Not ﬂexible (limited bythe switch SRAM) ArbitraryAmount of datato be aggregated Not ﬂexible (limited bythe switch SRAM,parsing capacity) Arbitrary in such technologies can be divided into two distinct types, pe-ripheral and central . Peripheral devices, which consist of sen-sors and actuators, receive commands and execute subsequentactions. Central devices on the other hand run applicationsthat analyze information collected from peripheral devices andsubsequently issue commands.The interconnection of devices and services can followa Peer-to-Peer (P2P) model or a cloud-centric approach. Inthe P2P model, the automation service runs on the centraldevice which processes and analyzes sensor data publishedby peripheral devices in order to issue commands. The mainadvantages of the P2P include the low end-to-end latencyand the subtle power consumption as devices are physicallyclose to each other. The drawbacks of the P2P model in-clude poor scalability, short reachability, and inﬂexibility ofpolicy enforcement. The cloud-centric model addresses thelimitations of the P2P model by adding a gateway nodethat connects peripheral devices to a middleware hosted onthe cloud (Internet). While this approach solves the poorscalability and the policy enforcement ﬂexibility issues, itincurs additional delays and jitters in collecting and reactingto data. Moreover, the middleware represents a single pointof failure which can shutdown the whole service in the eventof an outage. With programmable switches, researchers areinvestigating in-network approaches to manage transactionalrelationships between low-power, low-range IoT devices. B.2. Literature Review

Uddin et al. [156] proposed Bluetooth Low Energy ServiceSwitch (BLESS), a programmable switch that automates IoTapplications services by encoding their transactions in the dataplane. It maintains link-layer connections to the devices tosupport P2P connectivity. The same authors proposed Muppet[157], an extension to BLESS to support multiple non-IPprotocols.

B.3. Service Automation Comparison, Discussions, and Limi-tations

In BLESS, the data plane operations are performed at theAttribute Protocol (ATT) service layer which consists of threeoperations: read attributes, write attributes, and attributes’notiﬁcation. BLESS parses ATT packets, then processes andforwards them to the devices. The control plane on the otherhand is responsible for address assignment, device and servicediscovery, policy enforcement, and subscription management.The switch was implemented on a software switch (PISCES),and the results show that BLESS combines the advantages ofP2P and the cloud-center approaches. Speciﬁcally, it achievessmall communication latency, low device power consumption,high scalability, and ﬂexible policy enforcement. Muppet ex-tended this approach to support multiple IoT protocols. Thesystem studied two popular IoT protocols, namely BLE andZigbee. Being in the middle, Muppet switch is responsible fortranslating actions (e.g., on/off switch of a light bulb) betweenZigbee and BLE protocols, as well as logging important eventsto a database which resides on the Internet via the HypertextTransfer Protocol (HTTP). Note that parsers and actionspolicies have to be implemented for each supported protocol.

TABLE XXVIIS

WITCH - BASED , P2P,

AND C LOUD S ERVICE A UTOMATION

Feature Switch-based Peer-to-peer Cloud-based

Latency Low Low HighIoT energy Low Low HighScalability High Low HighReachability High Low High

Another difference from BLESS is that the implementationof Muppet’s control plane leverages ONOS controller withProtocol Independent (PI) framework.

B.4. Comparison between Server-based and Switch-basedService Automation

Table XXVII shows a comparison between switch-based,P2P, and cloud-based service automation. Generally, theswitch-based approach overcomes the limitations of both ap-proaches. It achieves the low energy and latency characteristicsof P2P while increasing scalability and reachability.

C. Summary and Lessons Learned

In the context of IoT, there exist broadly two categories,namely, packets aggregation and service automation. The goalof packet aggregation is to minimize the overhead of IoTpackets’ headers. Typically, headers in IoT packets representa signiﬁcant portion of the whole packet size. By aggregatingseveral packets into a single packet, the bandwidth overheadis reduced. Future work should study the performance side-effects (e.g., delay, jitter, loss rate, retransmission) that ag-gregation causes to packets. Furthermore, timers should beimplemented to avoid excessive delays resulting from waitingfor enough packets to be aggregated.With respect to service automation, the goal is to automateIoT applications services by encoding their transactions in thedata plane while improving scalability, reachability, energyconsumption, and latency. Future work should design and de-velop translators for non-IP IoT protocols so that applicationson various devices that run different protocols can exchangedata. Additionally, production-grade software switches shouldbe leveraged to support non-Ethernet IoT protocols.Other works that involve IoT include ﬂowlet-based statefulmultipath forwarding [268] and SDN/NFV-based architecturefor IoT networks [269].XI. C

YBERSECURITY

Extensive research efforts have been devoted on deployingprogrammable switches to perform various security-relatedfunctions in the data plane. Such functions include heavy hitterdetection, trafﬁc engineering, DDoS attacks detection andmitigation, anonymity, and cryptography. Fig. 19 demonstratesthe difference between contemporary security appliances andprogrammable switches with respect to layers inspection in theOSI model. Although programmable switches are limited inthe computation power, they are capable of inspecting upperlayers (e.g., application layer) at line rate. Such functionalityis not available in any of the existing solutions. ApplicationPresentationSessionTransportNetworkData LinkPhysical ACL, packet filterTraditional firewall, flow-based IDSNext-generation firewall, IDS/IPS ApplicationPresentationSessionTransportNetworkData LinkPhysical Programmable switchSoftware inspection Hardware inspection(a) (b)

Fig. 19. Layers inspection in the OSI model. (a) Contemporary securityappliances. (b) Programmable switch.

A. Heavy HitterA.1. Background

Heavy hitters are a small number of ﬂows that constitutemost of the network trafﬁc over a certain amount of time.They are identiﬁed based on the port speed, network RTT,trafﬁc distribution, application policy, and others. Heavy hittersincrease the ﬂow completion time for delay-sensitive miceﬂows, and represent the major source of congestion. It isimportant to promptly detect heavy hitters in order to reactto them; for instance, redirect them to a low priority queue,perform rate control and trafﬁc engineering, block volumetricDDoS attacks, and diagnose congestion. Traditionally, packetsampling technique (e.g., NetFlow) was used to detect heavyhitters. The main problem with such technique is the limitedaccuracy due to the CPU and bandwidth overheads of process-ing samples in the software. Advancements in programmableswitches paved the way to detect heavy hitters in the dataplane, which is not only orders of magnitude faster thansampling, but also enables additional applications (e.g., ﬂow-size aware routing).

A.2. Literature Review

Sivaraman et al. [158] proposed HashPipe, a heavy hitterdetection algorithm that operates entirely in the data plane.It detects the k -th heavy hitter ﬂows within the constraints of programmable switches while achieving high accuracy. Asubsequent work proposed by Harrison et al. [159] considers anetwork-wide distributed heavy-hitter detection. Furthermore,Kuˇcera et al. [160] proposed Elastic Trie, a solution thatdetects hierarchical heavy hitters, in-network trafﬁc changes,and superspreaders in the data plane. Hierarchical heavy hittersinclude the total activity of all trafﬁc matching relevant IPpreﬁxes. Basat et al. [161] proposed PRECISION, a heavyhitter detection algorithm that probabilistically recirculatesa fraction of packets for a second pipeline traversal. Therecirculation idea greatly simpliﬁes the access pattern ofmemory without signiﬁcantly degrading throughput. Ding etal. [162] proposed an approach for incrementally deployingprogrammable switches in a network consisting of legacydevices with the goal of monitoring as many distinct networkﬂows as possible. Tang et al. [163] proposed MV-Sketch, asolution that exploits the idea of majority voting to track thecandidate heavy ﬂows inside the sketch data structure. Finally,Silva et al. [164] proposed a solution that identiﬁes elephantﬂows in Internet eXchange Points (IXP) networks. A.3. Heavy Hitter Detection Comparison, Limitations, andDiscussions

Table XXVIII compares the aforementioned heavy hitterschemes. The main criteria that differentiates the solutionsis the selection and the implementation of the data structure.Hash tables and sketches are frequently used to store countersfor heavy ﬂows. Note that several variations of such datastructures are being used in the literature, mainly to tackle thememory-accuracy tradeoff; the choice of data structure reﬂectson the accuracy of the performed measurements. For example,with probabilistic data structures, only approximations areperformed.In HashPipe, the programmable switch stores the ﬂowsidentiﬁers and their byte counts in a pipeline of hash tables.HashPipe adapts the space saving algorithm which is describedin [270]. The system was evaluated using an ISP trace providedby CAIDA (400,000 ﬂows), and the results show that HashPipeneeded only 80KB of memory to identify the 300 heaviestﬂows, with an accuracy of 95%. Another hashtable-basedsolution is Elastic Trie, which consists of a preﬁx tree thatexpands or collapses to focus only on the preﬁxes that grabs a

TABLE XXVIIIH

EAVY H ITTER S CHEMES C OMPARISON

Scheme Name Core idea Datastructure Network-wide Adaptivethresholds Approximations PlatformHW SW [158] HashPipe Maintains counts of heavy ﬂowsin a pipeline of hash tables. Hash tables × × × (cid:2) [159] N/A Switch store locally the counts acoordinator aggregates the results Hash tables (cid:2) (cid:2) (cid:2) (cid:2) [160] Elastic Trie Detects hierarchical heavy hittersusing hashtable preﬁx tree Preﬁx tree × (cid:2) (cid:2) (cid:2) [161] PRECISION Recirculates a small fraction ofpackets to simplify memory access Hash tables × × (cid:2) (cid:2) [162] N/A Monitors distinct ﬂows usingHyperLogLog algorithm HyperLogLog (cid:2) (cid:2) (cid:2) (cid:2) [163] MV-Sketch Supports the queries of recoveringall heavy ﬂows in a sketch Invertiblesketches (cid:2) × (cid:2) (cid:2) [164] N/A Identiﬁes elephant ﬂows usingdynamic thresholds in IXPs Hash tables × (cid:2) × (cid:2) TABLE XXIXC

RYPTOGRAPHY S CHEMES C OMPARISON

Scheme Name Core idea Security goal Computations Algorithms PlatformConf. Integ. Auth. ASIC CPU HW SW [165] N/A Implementations ofcryptographic hash functions × × (cid:2) (cid:2)

SipHash-2-4, Poly1305-AES,BLAKE2b, HMAC-SHA256-512 (cid:2) [166] P4-IPsec Implementation of host-to-site IPsec in P4 switches (cid:2) (cid:2) (cid:2) (cid:2)

AES-CTRHMAC-MD5 (cid:2) [167] P4-MACsec Implementation of MACsecon P4 switches (cid:2) (cid:2) × (cid:2) AES-GCM (cid:2) [168] N/A AES implementation usingscrambled lookup table (cid:2) × × (cid:2)

AES-128, AES-192, AES-256 (cid:2) large share of the network. The data plane informs the controlplane about high-volume trafﬁc clusters in an event-based pushapproach only when some conditions are met. Other systemsexplored different data structures for the task. For instance,in [162] the authors used the HyperLogLog algorithm [271]which approximates the number of distinct elements in a multi-set. The solution is capable of detecting heavy hitters by onlyusing partial input from the data plane.Another important criteria is whether the scheme tracksheavy hitters across the whole network. For example, un-like HashPipe which considers a single switch, [159] tracksnetwork-wide heavy hitters. Tracking network-wide heavyhitter is important as some applications (e.g., port scanners,superspreaders, etc.) cannot go undetected within a singlelocation. Moreover, aggregating the results of switches sep-arately for detecting heavy hitter is not sufﬁcient; ﬂows mightnot exceed a threshold locally, but when the total volume isconsidered, the threshold might be crossed.

A.4. Comparison between P4-based and Traditional HeavyHitter Detection

The main advantage of heavy hitters detection schemes inthe data plane over sampling-based approaches is the ability tooperate at line rate. This means that every packet is consideredin the detection algorithm, which improves accuracy andthe speed of detection. Moreover, additional applications thatexploit reactive processing can be implemented. For instance,switches can perform a ﬂow-size aware routing method toredirect trafﬁc upon detecting a heavy hitter.

B. CryptographyB.1. Background

Performing cryptographic functions in the data plane isuseful for a variety of applications (e.g., protecting the layer-2 with cryptographic integrity checks and encryption, miti-gating hash collisions, etc.). Computations in cryptographicoperations (e.g., hashing, encryption, decryption) are known tobe complex and resource-intensive. The supported operationsin switch targets and in the P4 language are limited to ba-sic arithmetic (e.g., additions, subtractions, bit concatenation,etc.). Recently, a handful of works have started studying thepossibility of performing cryptographic functions in the dataplane.

B.2. Literature Review

The authors in [165] argue on the need to implementcryptographic hash functions in the data plane to mitigatepotential attacks targeting hash collisions. Consequently, theypresented prototype implementations of cryptographic hashfunctions in three different P4 target platforms (CPU, Smart-NIC, NetFPGA SUME). Another work by Hauser et al. [166]attempted to implement host-to-site IPsec in P4 switches. Forsimpliﬁcation, only Encapsulating Security Payload (ESP) intunnel mode with different cipher suites is implemented. Thesame authors also proposed P4-MACsec, an implementationof MACsec on P4 switches. MACsec is an IEEE standard forsecuring Layer 2 infrastructure by encrypting, decrypting, andperforming integrity checks on packets.The previous works delegated the complex computations tothe control plane. Chen et al. [168] implemented the AdvancedEncryption Standard (AES) protocol in the data plane usingscrambled lookup tables. AES is one of the most widelyused symmetric cryptography algorithms that applies severalencryption rounds on 128-bit input data blocks

B.3. Cryptography Schemes Comparison, Discussions andLimitations

Table XXIX compares the aforementioned cryptographyschemes. With respect to hashing, P4 currently implementshash functions that do not have the characteristics of cryp-tographic hashing. For example, Cyclic Redundancy Check(CRC), which is commonly used in P4 targets, is originallydeveloped for error detection. CRC can be easily implementedin embedded hardware, and is computationally much lesscomplex than cryptographic hash functions (e.g., Secure HashAlgorithm (SHA)-256); however, it is not secure and has ahigh collision rate. Evaluation results in [165] show that 1)implementing cryptographic hash functions on CPU is easy,but has high latency (several milliseconds); 2) SmartNICs hasthe highest throughput, but can only process packets up to900 bytes; and 3) NetFPGA has the lowest latency, but cannotbe integrated using native P4 features. The authors foundthat the performance of hashing is highly dependent on theapplication, the input type, and the hashing algorithm, andtherefore there is no single solution that ﬁts all requirements.However, P4 targets should beneﬁt from the characteristicsof each solution (CPU, SmartNICs, FPGA, and ASICs) toimplement cryptographic hashing.As for more complex protocol suites (e.g., IPsec), Hauser et al. [166] only implemented Encapsulating Security Payload(ESP) in tunnel mode for simpliﬁcation. The Security PolicyDatabase (SPD) and the Security Association Database (SAD)are represented as match-action tables in the P4 switch. Toavoid complex key exchange protocols such as the InternetKey Exchange (IKE), this work delegates runtime managementoperations to the control plane. Moreover, since encryption anddecryption are not supported by P4, the authors relied on user-deﬁned P4 externs to perform complex computations. Notethat implementing user-deﬁned externs is not applicable forASIC (e.g., Toﬁno), and consequently, the main CPU moduleof the switch is used for performing encryption/decryptioncomputations, at the cost of increased latency and degradedthroughput. Same ideas are applied to P4-MACsec by the sameauthors.The system proposed by Chen et al. [168] has signiﬁcantperformance advantages as it is fully implemented in the dataplane. The idea of the proposed system is to apply permutedlookup tables by using an encryption key. The authors foundthat a single switch pipeline is capable of performing two AESrounds. Consequently, the system leverages packet recircula-tion technique which re-injects the packet into the pipeline.By doing so, it is possible to complete the 10 rounds ofencryption required by the AES-128 algorithm by using ﬁvepipeline passes. Note that recirculation uses loopback portsand hence is limited by their bandwidth. The implementationon Toﬁno chip shows that ≈ B.4. Comparison between In-network and ContemporaryCryptography

Cryptographic primitives often require performing complexarithmetic operations on data. Implementing such compu-tations on general purpose servers is simple; memory andprocessing units are not constrained. The literature has shownthat there is a need to implement cryptographic functions in thedata plane. For instance, cryptographic hash functions can sig-niﬁcantly improve existing data plane applications with respectto collisions; encryption can protect conﬁdential informationfrom being exposed to the public. However, switches havelimitations when it comes to computing. Supported hash func-tions in P4 are non-cryptographic (e.g., CRC), and therefore,produce collisions when the table is not large. Consequently,researchers are continuously investigating techniques to per-form such operations in the data plane.

C. Privacy and AnonymityC.1. Background

Packets in a network carry information that can poten-tially identify users and their online behavior. Therefore, userprivacy and anonymity have been extensively studied in thepast (e.g., ToR and onion routing [272]). However, existingsolutions have several limitations: 1) poor performance sinceoverlay proxy servers are maintained by volunteers and have

TABLE XXXP

RIVACY AND A NONYMITY S CHEMES C OMPARISON

Name/Scheme Goal Strategy PlatformHW SW

NetHide [169] Mitigate topologyattacks Topologyobfuscation × ×

PANEL [170] Protect Internetusers’ identities Source inforewriting (cid:2)

ONTAS [171] Protect PII inpacket traces Headers ﬁeldshashing (cid:2)

SPINE [172] Protect Internetusers’ identities Header ﬁeldsconcealing (cid:2) no performance guarantees; 2) deployability challenges; somesolutions require modifying the whole Internet architecture,which is highly unlikely; 3) no clear partial deploymentpathway; and 4) most solutions are software-based. Conse-quently, recent works started investigating methods that exploitprogrammable switches to develop partially-deployable, low-latency, and light-weight anonymity systems.With respect to anonymity and privacy in the network, newclass of attacks which target the topology, requires the attackerto know the topology and understand it’s forwarding behavior.Such attacks can be mitigated by obfuscating (hiding) thetopology from external users. P4-based schemes are also beingdeveloped to achieve this goal.

C.2. Literature Review

Meier et al. [169] proposed NetHide, a P4-based solu-tion that obfuscates network topologies to mitigate againsttopology-centric attacks such as Link-Flooding Attacks(LFAs). On the other hand, Kim et al. [171] proposed OnlineNetwork Trafﬁc Anonymization System (ONTAS), a systemthat anonymizes trafﬁc online using P4 switches.Another line of research focused on protecting the identityof Internet users. Moghaddam et al. [170] proposed PracticalAnonymity at the NEtwork Level (PANEL), a lightweight andlow overhead in-network solution that provides anonymity intothe Internet forwarding infrastructure. Likewise, Datta et al.[172] proposed Surveillance Protection in the Network Ele-ments (SPINE), a system that anonymizes trafﬁc by concealingIP addresses and relevant TCP ﬁelds (e.g., sequence number)from adversarial Autonomous Systems (ASes) on the dataplane.

C.3. Privacy and Anonymity Schemes Discussions

Table XXX compares the privacy and anonymity schemes.NetHide aims at mitigating the attacks targeting the networktopology. The solution formulates network obfuscation as amulti-objective optimization problem, and uses accuracy (hardconstraints) and utility (soft constraints) as metrics. The systemthen uses ILP solver and heuristics. The P4 switches inthis system capture and modify tracing trafﬁc at line rate.The speciﬁcs of the implementation were not disclosed, butthe authors claim that the system was evaluated on realistictopologies (more than 150 nodes), and more than 90% of linkfailures were detected by operators, despite obfuscation.ONTAS had a slightly different goal; it aims at protectingthe personally identiﬁable information (PII) from online traces.The system overcomes the limitations of existing systems Unmodified device Trusted entity 1 Trusted entity 2Untrusted entity{Keys, version number} Unmodified deviceOriginal Traffic Original TrafficSPINE Traffic SPINE Traffic

Fig. 20. SPINE architecture [172]. which either requires network operators to anonymize packettraces before sharing them with other researchers and analysts,or anonymize trafﬁc online but with signiﬁcant overhead.ONTAS provides a policy language used by operators forexpressing anonymization tasks, which makes the systemﬂexible and scalable. The system was implemented and testedon a hardware switch, and results show that ONTAS entails 0%packet processing overhead and requires half storage comparedto existing ofﬂine tools. A limitation of this system is that itdoes not anonymize TCP/UDP ﬁeld values. Another limitationis that it does not support applying multiple privacy policiesconcurrently.Other line of research (i.e., PANEL, SPINE) focused onprotecting the identities of Internet user. PANEL overcomesthe performance limitations of popular anonymity systems(e.g., Tor), and does not require modifying entirely the Internetrouting and forwarding protocols as proposed in [273] and[274]. Partial deployment is possible as PANEL can co-exist with legacy devices. The solution involves: 1) sourceaddress rewriting to hide the origin of the packet; 2) sourceinformation normalization (IP identiﬁcation and TCP sequencerandomization) to mitigate against ﬁngerprinting attacks; and3) path information hiding (TTL randomization) to hide thedistance to the original sender at any given vantage point.As for SPINE, it does not require cooperation betweenswitches and end-hosts, but assumes that at least two entities(typically two ASes or two ISPs) are trusted. Fig. 20 showsthe SPINE architecture. The solution encrypts the IP addressesbefore the packets enter the intermediary ASes. Therefore,adversarial devices only see the encrypted addresses in theheaders. It also encrypts the TCP sequence and ACK num-bers to mitigate against attributing packets to ﬂows. SPINEtransforms IPv4 headers into IPv6 headers when packetsleave the trusted entity and restore the IPv4 headers uponentering the trusted entity. These operations enable routing tobe performed in intermediary networks. The encrypted IPv4address is inserted in the last 32-bits of the IPv6 destinationaddress. The encryption works by XORing the IP address withthe hash of a pre-shared key and a nonce. The system usesSipHash since it is easily implemented in the data plane.

C.4. Privacy and Anonymity in Switch-based and LegacySystems

Contemporary approaches that provide privacy andanonymity in the Internet uses special routing overlay net-works to hide the physical location of each node from otherparticipants (e.g., Tor). Such approaches have performancelimitations as proxy servers (overlays) are maintained by

P4 switches WANEnd devicesDev. Config.High-level policiesCompilerC P4 programs ...

Context packets

Fig. 21. Overview of Poise [175]. A compiler translates high-level policiesinto P4 programs and device conﬁgurations. Context packets are continuouslysent from the clients to the network, where the switches enforce the policies. volunteers and have no performance guarantees. Moreover,they often require performing advanced encryption routinesto obfuscate from where the packet is originated (e.g., onionrouting technique used by Tor involves encapsulating messagesin several layers of encryption) . On the other hand, approachesthat are based on programmable switches often rely on headersmodiﬁcation and simpliﬁed encryption and hashing to concealinformation (e.g., SPINE [172]).

D. Access ControlD.1. Background

The selective restriction to access digital resources is knownas access control in cybersecurity. Typically, access controlbegins with “authentication” in order to verify the identity of aparty. Afterwards, “authorization” is enforced through policiesto specify access rights to resources. To authenticate parties,methods such as passwords, biometric analysis, cryptographickeys, and others are used. With respect to authorization,methods such as ACL are used to describe what operationsare allowed on given objects.With the advent of programmable switches, it is nowpossible to delegate authentication and authorization to thedata plane. As a result, access can be promptly granted ordenied at line rate, before reaching the target server. A clearadvantage of this approach is that servers are no longer busyprocessing access veriﬁcation routines, which increases theirservice throughput.

D.2. Literature Review

Datta et al. [173] presented P4Guard, a P4-based conﬁg-urable ﬁrewall that acts based on predeﬁned policies set bythe controller. Kang et al. [175] presented a scheme thatimplements context-aware security policies (see Fig. 21). Thepolicies are applicable to enterprise and campus networks withdiverse devices, i.e., Bring Your Own Device (BYOD) (e.g.,laptops, mobile devices, tablets, etc.).Almain et al. [174] proposed delegating the authenticationof end hosts to the data plane. The method is based onport knocking, in which hosts deliver a sequence of packetsaddressed to an ordered list of closed ports. If the ports matchthe ones conﬁgured by the network administrators, then end TABLE XXXIA

CCESS C ONTROL S CHEMES C OMPARISON

Scheme Goal Strategy Scope Limitations PlatformHW SW [173] Simple ﬁrewall-basedaccess control Translates from high-levelsecurity policies to table entries Header-based ﬁrewall(layer-4) Lacks NGFW capabilities (cid:2) [174] User-authenticationin the data plane Uses port knocking techniquefor authentication Unencrypted sequence-based authentication Unencrypted sequencevulnerable to packet snifﬁng (cid:2) [175] Context-aware policiesenforcement Translates from high-levelsecurity policies to P4 programs CAS dynamic policiesbased on runtime contexts External encryptions are slow;lack of authentication (cid:2) [176] OS ﬁngerprinting andpolicy enforcement Compares TCP/IP headers to aﬁngerprint database ﬁle Uses p0f to ﬁlterconnections Lack of advanced built-inactions (e.g., rate-limiting) (cid:2) host is authenticated, and subsequent packets are allowed.Finally, Bai et al. [176] presented P40f, a tool that performs OSﬁngerprinting on programmable switches, and consequently,applies security policies (e.g., allow, drop, redirect) at linerate.

D.3. Access Control Comparison, Discussions, and Limita-tions

Table XXXI compares the aforementioned access controlschemes. P4Guard provides access control based on securitypolicies translated from high-level security policies to tableentries. Note that P4Guard only operates up to the transportlayer (e.g., source/destination IP addresses, source/destinationports, protocol, etc.), similar to a traditional ﬁrewall. Whileprogrammable switches provide increased ﬂexibility in theparser (e.g., parse beyond the transport layer) and the packetprocessing logic, P4Guard did not leverage such capabilities.It would be interesting to investigate additional capabilitiessuch as those enabled by next-generation ﬁrewalls (NGFW).The solution in [174] controls access by performing authen-tication in the data plane. The solution has several limitationssince it uses on port knocking, a technique that has severalsecurity implications. For instance, programmable switches donot use cryptographic hashes, making the solution vulnerableto IP address spooﬁng attacks. Additionally, unencrypted portknocking is vulnerable to packet snifﬁng. Furthermore, portknocking relies on security through obscurity.In [175], the scheme dynamically enforces access controlto users based on contexts (e.g., if the user’s device usesSecure Shell (SSH) 2.0 or higher, then the switch forwardsthe packets of this ﬂow. Otherwise, the switch drops the pack-ets). The scheme requires user devices to run an applicationwhich communicates with the switch using a custom protocol(context packets). The context packets are generated on aper-ﬂow basis. The switch tracks ﬂows using a match actiontable and registers at the data plane. Actions over a packetare dropping, allowing, and forwarding to other appliancesfor deep packet inspection. Data packets are not modiﬁed.Evaluations show that the proposed approach can operate(install new ﬂows in the and update rules) with a minimumlatency, even under heavy DoS attacks. On the other hand,such attacks can decimate similar SDN-based systems. Oneof the main drawbacks of the proposed system is the lackof authentication, integrity, and conﬁdentiality of the contextpackets. Thus, the system can be subject to attacks suchas snooping (i.e., eavesdropping) on communication between user devices and the switch, impersonation, and others.Finally, [176] proposes ﬁngerprinting OS in the data plane.The main motivation behind this work is that software-basedpassive ﬁngerprinting tools (e.g., p0f [275]) are not practicalnor sufﬁcient with large amounts of trafﬁc on high-speedlinks. Furthermore, out-of-band monitoring systems cannotpromptly take actions (e.g., drop, forward, rate-limit) on trafﬁcat line rate. The main drawback of the solution is that it lackssophisticated policies that involve rate-limiting trafﬁc.

D.4. Comparison between Switch-based and Server-based Ac-cess Control

Controlling access to resources often starts with authenti-cation. While server-based approaches are more ﬂexible inthe methods of authentication they can provide, they typi-cally require client connections to reach the server beforethe communication starts. In switch-based approaches, theauthentication can be done in-network at the edge, eliminatingunnecessary latency incurred from traversing the network andfrom software processing.Access to resources can be controlled after ﬁngerprintingend-hosts OSs. Software-based passive ﬁngerprinting toolscannot keep up with the high load (gigabits/s links). Theliterature has shown that tools lead to 38% degradation inthroughput [276]. Additionally, such tools are out-of-band,meaning that it is not possible to apply policies on trafﬁc(e.g., after ﬁngerprinting an OS). On the other hand, switchhardware is able to perform OS ﬁngerprinting and applysecurity policies at line rate.Context-aware policies applied on nodes (clients/servers)have local visibility. A newer approach is to use a centralizedSDN controller (e.g., [277]), but such scheme is vulnerableto control plane saturation attacks and is subject for delayincreases. Switch-based schemes on the other hand are able toprovide access control at line rate.

E. DefensesE.1. Background

DDoS attacks remain among the top security concernsdespite the continuous efforts towards the development of theirdetection and mitigation schemes. This concern is exacerbatednot only by the frequency of said attacks, but also by their highvolumes and rates. Recent attacks (e.g. [278, 279]) reachedthe order of terabits per seconds, a rate that existing defensemechanisms cannot keep with. TABLE XXXIID

EFENSES S CHEMES C OMPARISON

Name & scheme Mitigated attacks Attack coverage Externalcomputations Network-wide Limitations PlatformSpeciﬁc Generic HW SW

NETHCF [177] IP-spooﬁng (cid:2) (cid:2) × Hop-counts incorrectnesswith the presence of NAT (cid:2)

FastFlex [178] Availability attacks (cid:2) × (cid:2) Cross-domain federationcomplexity and security (cid:2) [179] Sensitivity attacks (cid:2) × ×

Limited evaluation oncomplex data plane systems (cid:2) [180] SIP DDoS (cid:2) (cid:2) × No support for encryptedpackets (e.g., SIP/TLS) (cid:2) [181] DDoS anomalies (cid:2) × ×

Not adaptable to trafﬁcpatterns (ﬁx thresholds) (cid:2)

ML-Pushback [182] DDoS anomalies (cid:2) (cid:2) × Depends heavily on externalcomputations × × [183] SYN ﬂoods (cid:2) (cid:2) × Lack of cryptographichash functions (cid:2)

Poseidon [184] Volumetric DDoS (cid:2) (cid:2) × Human intervention forwriting the defense policies (cid:2) [185] Volumetric and stealthyDDoS (cid:2) (cid:2) × Only synthetic evaluations;no extensive experimentation (cid:2)

NetWarden [186] Network covert channels (cid:2) (cid:2) × Slowpath/fastpathcommunication latency (cid:2) [187] ECN protocol abuse (cid:2) × ×

Small subset of attackspace (cid:2)

Ripple [188] Link-ﬂooding (cid:2) × (cid:2) Lack of comparison withother P4 approaches (cid:2)

There are two main concerns with existing defense methodshandled by end-hosts or deployed as middlebox functionson x86-based servers. First, they dramatically degrade thethroughput and increase latency and jitter, impacting theperformance of the network. Second, they present severeconsequences on the network operation when they are installedat the last mile (i.e., far from the edge).The escalation of volumetric DDoS attacks and the lackof robust and efﬁcient defense mechanisms motivated theidea of architecting defenses into the network. Up until re-cently, in-network security methods were restricted to simpleaccess control lists encoded into the switching and routingdevices. The main reason is that the data plane was ﬁxed infunction, impeding the capabilities of developing customizedand dynamic algorithms that can assist in detecting attacks.With the advent of programmable data planes, it is possibleto develop systems that detect and mitigate various types ofattacks without imposing signiﬁcant overhead on the network.

E.2. Literature Review

Li et al. [177] presented NETHCF, a Hop-Count Filtering(HCF) defense mechanism that mitigates spoofed IP trafﬁc.HCF schemes ﬁlter spoofed trafﬁc with an IP-to-hop-countmapping table. Another attack-speciﬁc scheme proposed byFebro et al. [180] mitigates against distributed SIP DDoS inthe data plane. Furthermore, Scholz et al. [183, 280] presenteda scheme that defends against SYN ﬂood attacks.Alternatively, some schemes are generic and aim at ad-dressing multiple attacks concurrently. For instance, Xing etal. [178] proposed FastFlex, an abstraction that architectsdefenses into the network paths based on changing attacks.Kang et al. [179] presented an automated approach for dis-covering sensitivity attacks targeting the data plane programs.Sensitivity attacks in this context are intelligently crafted trafﬁc patterns that exploit the behavior of the P4 program.Lapolli et al. [181] implemented a mechanism to performreal-time DDoS attack detection based on entropy changes.Such changes will be used to compute anomaly detectionthresholds. Mi et al. [182] proposed ML-Pushback, a P4-basedimplementation of the Pushback method [281].Zhang et al. [184] proposed Poseidon, a system that miti-gates against volumetric DDoS attacks through programmableswitches. It provides a language where operators can expressa range of security policies. Friday et al. [185] proposed auniﬁed in-network DDoS detection and mitigation strategy thatconsiders both volumetric and slow/stealthy DDoS attacks.Xing et al. [186] proposed NetWarden, a broad-spectrumdefense against network covert channels in a performance-preserving manner. The method in [187] models a statefulsecurity monitoring function as an Extended Finite State Ma-chine (EFSM) and expresses the EFSM using P4 abstractions.Finally, Ripple [188] provides decentralized link-ﬂooding de-fense against dynamic adversaries.

E.3. Defense Schemes Comparison, Discussions, and Limita-tions

Table XXXII compares the aforementioned defenseschemes. Broadly, defense schemes can be grouped into twomain categories: attack-speciﬁc and generic. Attack-speciﬁccategory consists of the work that address a speciﬁc attack(e.g., NETHCF for IP spooﬁng, [180] for SIP DDoS, etc.),while the generic category aims at addressing various types ofattacks (e.g., FastFlex for various availability attacks, Ripplefor link ﬂooding attacks, etc.).The signiﬁcant advantage of architecting defenses in thedata plane is the performance improvement of the applica-tion. For instance, NETHCF is motivated by the fact thattraditional HCF-based schemes are implemented on end-hosts, which delays the ﬁltering of spoofed packets and increasesthe bandwidth overhead. Moreover, since traditional schemesare implemented in server-based middleboxes, low latencyand minimal jitter are hard to achieve. Similarly, FastFlexadvocates on the need to ofﬂoad the defenses to the dataplane. Speciﬁcally, it tackles the following key challenges thatare faced when programming defenses in the data plane: 1)resource multiplexing; 2) optimal placement; 3) distributedcontrol; and 4) dynamic scaling.When deploying defenses in the data plane, operators mustbe aware of the capabilities of the constrained targets. Manyoperations that require extensive computations cannot be easilyimplemented on the data plane. The existing work eitherapproximate the computations in the data plane (consideringthe computation complexity and the measurements accuracytrade-off), or delegate the computations to external processors(e.g., CPU on the switch, external server, SDN controller,etc.). For instance, NETHCF decouples the HCF defense intoa cache running in the data plane and a mirror in the controlplane. The cache serves the legitimate packets at line rate,while the mirror processes the missed packets, maintains theIP-to-hop-count mapping table, and adjust the state of thesystem based on network dynamics. In Poseidon, the defenseprimitives are partitioned to be executed on switches and onservers, based on their properties. On the other hand, in [181],the authors estimated the entropies of source and destinationIP addresses of incoming packets for consecutive partitions(observation windows) in the data plane, without consultingexternal devices.Network-wide defenses are those that are not restricted to asingle switch, and require multiple switches to co-operate inthe attacks detection and mitigation phases. Such co-operationsigniﬁcantly improves the accuracy and the promptness of thedetection. More details on network-wide data plane systemsis explained in Section XIII-D.Finally, table XXXII lists some limitations of the existingschemes, which can be explored in future work to advance thestate-of-the-art. E.4. Comparison between P4-based and Traditional DefenseSchemes

Network attacks such as large-scale DDoS and link ﬂoodingmay have substantial impact on the network operation. Forsuch attacks, server-based defenses deployed at the last mileare problematic and inherently insufﬁcient, especially whenattacks target the network core. Moreover, it is not feasible todetect and mitigate large volume of attack trafﬁc (e.g., SYNﬂood) on end-hosts without impacting the throughput of thenetwork. When defenses are architected into the network (i.e.,detection and mitigation are programmed into the forwardingdevices), it is easy to detect, throttle, or drop suspicious trafﬁcat any vantage point, at line rate.

F. Summary and Lessons Learned

In the context of cybersecurity, a wide range of worksleveraged programmable switches to achieve the followinggoals: 1) detect heavy hitters and apply countermeasures; 2) execute cryptographic primitives in the data plane to enablefurther applications; 3) protect the identity and the behaviorof end-hosts, as well as obfuscate the network topology; 4)enforce access control policies in the network while consid-ering network dynamics; and 5) architect defenses in the dataplane to accelerate the detection and mitigation processes.Identifying heavy hitters at line rate has several advan-tages. Recent works considered various data structures andstreaming algorithms to detect heavy hitters. Future systemscould explore more complex data structures that reduce theamount of state storage required on the switches. Furthermore,novel systems must minimize the false positives and thefalse negatives compared to both P4-based and legacy heavyhitter detection systems. Finally, new schemes should explorestrategies for incremental deployment while maximizing ﬂowvisibility across the network.There is an absolute necessity to implement cryptographicfunctions (e.g., hash, encrypt, decrypt) in the data plane.Such functions can be used by various applications thatrequire low hashing collisions (e.g., load balancing) and strongdata protection. Most existing efforts delegate the complexcomputations to the control plane. However, recent systemshave demonstrated that AES, a well-known symmetric keyencryption algorithm, can be implemented in the data plane.Another interesting line of work provided privacy andanonymity to the network. Recent efforts obfuscated the net-work topology in order to mitigate topology-centric attacks(e.g., LFA). Such systems must preserve the practicality ofpath tracing tools, while being robust against obfuscationinversion. Additionally, link failures in the physical topologyshould remain visible after obfuscation. Furthermore, whenrandomizing identiﬁers to achieve session unlinkability, theidentiﬁers must ﬁt into the small ﬁxed header space sothat compatibility with legacy networks is preserved. Otherefforts considered rewriting source information and headersconcealing to protect the identity of Internet users.Finally, access control methods and in-network defenseswere proposed. Future access control schemes should explorefurther in-network methods to authenticate the users. Addi-tionally, since switches are capable of inspecting upper-layerheaders, it is worth exploring ofﬂoading some next generationﬁrewall functionalities to the data plane. For instance, in[146], the authors proposed a system that allows searchingfor keywords in the payload of the packet. Similar techniquescould be leveraged to achieve URL ﬁltering at line rate.Additionally, schemes should mitigate against stealthy DDoSattacks. XII. N

ETWORK T ESTING

Although programmable switches provide ﬂexibility indeﬁning the packet processing logic, they introduce potentialrisks of having erroneous and buggy programs. Such bugsmay cause fatal damages, especially when they are unexpect-edly triggered in production networks. In such scenarios, thenetwork starts experiencing a degradation in performance aswell as disruption in its operation. Bugs can occur in variousphases in the P4 program development workﬂow (e.g., in TABLE XXXIIIT

ROUBLESHOOTING S CHEMES C OMPARISON

Name & scheme Core idea Fault detection Memoryrequirements PlatformPassive Proactive HW SW

P4DB [189] On-the-ﬂy runtime debugging using watch, break, and next primitives (cid:2)

High (cid:2)

P4Tester [190] Probing-based troubleshooting using BDD (cid:2)

Low (cid:2) [191] Targets’ behavior examination when undesired actions are triggered N/A N/A (cid:2) (cid:2) [192] Execution paths proﬁling using Ball-Larus encoding (cid:2)

Low (cid:2)

KeySight [193] Probing-based troubleshooting using PEC (cid:2)

Low (cid:2) the P4 program itself, in the controller updating data planetable entries, in the target compiler, etc.). Bugs are usuallymanifested after processing a sequence of packets with certaincombinations not envisioned by the designer of the code.This section gives an overview of the troubleshooting andveriﬁcation schemes for P4 programmable networks.

A. TroubleshootingA.1. Background

Intensive research interests were drawn on troubleshootingthe network. Previous efforts are mainly based on passivepacket behavior tracking through the usage of monitoringtechnologies (e.g., NetSight [282], EverFlow [283]). Othertechniques (e.g., Automatic test Packet Generation (ATPG)[284]) send probing packets to proactively detect networkbugs. Such techniques have two main problems. First, thenumber of probe packets increases exponentially as the sizeof the network increases. Second, the coverage is limited bythe number of probes-generating servers. Despite the ﬂexibilitythat programmable switches offer, writing data plane programsincreases the chance of introducing bugs into the network. Pro-grams are inevitably prone to faults which could signiﬁcantlycompromise the performance of the network and incur highpenalty costs.

A.2. Literature Review

Zhang et al. [189] proposed P4DB, an on-the-ﬂy runtimedebugging platform. The system debugs P4 programs in threelevels of visibility by provisioning operator-friendly primi-tives: watch , break , and next . Zhou et al. [190] proposedP4Tester, a troubleshooting system for data plane runtimefaults. It generates intermediate representation of P4 programsand table rules based on BDD data structure. Dumitru etal. [191] examined how three different targets, BMv2, P4-NetFPGA, and Barefoot’s Toﬁno, behave when undesired be-haviours are triggered. Kodeswaran et al. [192] proposed a dataplane primitive for detecting and localizing bugs as they occurin real time. Finally, Zhou et al. [193] proposed KeySight, aplatform that troubleshoots programmable switches with highscalability and high coverage. It uses Packet Equivalence Class(PEC) abstraction when generating probes. A.3. Troubleshooting Schemes Comparison, Discussions, andLimitations

Table XXXIII compares the aforementioned troubleshootingschemes. Essentially, the schemes either passively track howpackets are processed inside switches (e.g., [189, 192]) or diagnoses faults by injecting probes (e.g., [190, 193]). Themain limitation of passive detection is that schemes can onlydetect rule faults that have been triggered by existing packets,and cannot check the correctness of all table rules. On theother hand, probing-based schemes may incur large controland probes overheads.Examples of probing-based schemes include P4Tester andKeySight. P4Tester generates intermediate representation ofP4 programs and table rules based on BDD data structure.Afterwards, it performs an automated analysis to generateprobes. Probes are sent using source routing to achieve highrule coverage while maintaining low overheads. The systemwas prototyped on a hardware switch (Toﬁno), and resultsshow that it can check all rules efﬁciently and that the probescount is smaller than that of server-based probe injectionsystems (i.e., ATPG and Pronto).Other schemes that use passive fault detection (e.g., P4DB)assume that packets consistently trigger the runtime bugs.P4DB debugs P4 programs in three levels of visibility byprovisioning operator-friendly primitives: watch , break , and next . P4DB does not require modifying the implementation ofthe data plane. It was implemented and evaluated on a softwareswitch (BMv2), and the results show that it is capable oftroubleshooting runtime bugs with a small throughput penaltyand little latency increase.Another important criterion that differentiate the trou-bleshooting schemes is the memory footprint they require.Some schemes (e.g., P4DB) require more memory than others(e.g., KeySight) which bound the memory usage.Finally, the work in [191] is different than the others.The authors examined how three different targets, BMv2,P4-NetFPGA, and Barefoot’s Toﬁno, behave when undesiredbehaviours are triggered. The authors ﬁrst developed buggyprograms in order to observe the actual behavior of targets.Then, they examined the most complex P4 program publiclyavailable, switch.p4 , and found that it can be exploited whenattackers know the speciﬁcs of the implementation. In sum-mary, the paper suggests that BMv2 leaks information fromprevious packets. This behavior is not observed with the othertwo targets. Furthermore, the authors were able to performprivilege escalation on switch.p4 due to a header destinedto ensure communication between the CPU and the P4 dataplane. A.4. Comparison Legacy vs. P4-based Debugging

In legacy networks, network devices are equipped withﬁxed-function services that operate on standard protocols.Troubleshooting these networks often involve testing proto- cols and typical data plane functions (e.g., layer-3 routing)through rigid probing. On the other hand, with programmablenetworks, since operators have the ﬂexibility of deﬁningcustom data plane functions and protocols, testing is morecomplex and is program-dependent. Probing-based approachesshould craft patterns depending on the deployed P4 program.Other approaches proposed primitives that increase the levelsof visibility when debugging P4 programs. Research workextracted from the literature show that it is essential to developﬂexible mechanisms that operate dynamically on diverse P4programs and targets. B. VeriﬁcationB.1. Background

Program veriﬁcation consists of tools and methods thatensure correctness of programs with respect to speciﬁcationsand properties. Veriﬁcation of P4 programs is an active areaas bugs can cause faults that have drastic impacts on theperformance and the security of networking systems. StaticP4 veriﬁcation handles programs before deployment to thenetwork, and hence, cannot detect faults that occur at runtime.On the other hand, runtime veriﬁcation uses passive measure-ments and proactive network testing. This section describesthe major veriﬁcation work pertaining to P4 programs.

B.2. Literature Review

Lopes et al. [194] proposed P4NOD, a tool that compilesP4 speciﬁcations to Datalog rules. The main motivation be-hind this work is that existing static checking tools (e.g.,Header Space Analysis (HSA) [285], VeriFlow [286]) arenot capable of handling changes to forwarding behaviorswithout reprogramming tool internals. The authors introducedthe “well formedness” bugs, a class of bugs arising due to thecapabilities of modifying and adding headers.Another interesting work is ASSERT-P4 [195, 196], anetwork veriﬁcation technique that checks at compile-timethe correctness and the security properties of P4 programs.ASSERT-P4 offers a language with which programmers ex-press their intended properties with assertions. After annotat-ing the program, a symbolic execution takes place with all theassertions being checked while the paths are tested.Further, Liu et al. [197] proposed p4v, a practical veri-ﬁcation tool for P4. It allows the programmer to annotatethe program with Hoare logic clauses in order to performstatic veriﬁcation. To improve scalability, the system suggestsadding assumptions about the control plane and domain-speciﬁc optimizations. The control plane interface is manuallywritten by the programmer and is not veriﬁed, which makesit error-prone and cumbersome. The authors evaluated p4von both an open source and proprietary P4 programs (e.g.,switch.p4) that have different sizes and complexities.Nötzli et al. [198] proposed p4pktgen, a tool that automat-ically generates test cases for P4 programs using symbolicexecution and concrete paths. The tool accepts as input aJSON representation of the P4 program (output of the p4ccompiler for BMv2), and generates test cases. These testcases consist of packets, tables conﬁgurations, and expected paths. Similarly, Lukács et al. [199] described a frameworkfor verifying functional and non-functional requirement ofprotocols in P4. The system translates a P4 program in aversatile symbolic formula to analyze various performancecosts. The proposed approach estimates the performance costof a P4 program prior to its execution.Stoenescu et al. [200] proposed Vera, a symbolic execution-based veriﬁcation tool for P4 programs. The authors arguein this paper that a data plane program should be veriﬁedbefore deployment to ensure safe operations. Vera accepts asinput a P4 program, and translates it to a network veriﬁcationlanguage, SEFL. It then relies on SymNet [287], a networkstatic analysis tool based on symbolic execution to analyze thebehavior of the resulting program. Essentially, Vera generatesall possible packets layouts after inspecting the program’sparser and assumes that the header ﬁelds can accept any value.Afterwards, it tracks the paths when processing these packetsin the program following all branches to completion. Forscalability improvements, Vera utilizes a novel match-forestdata structure to optimize updates and veriﬁcation time. Pars-ing/deparsing errors, invalid memory accesses, loops, amongothers, can be detected by Vera.A different approach uses reinforcement learning is P4RL[201], a fuzzy testing system that automatically veriﬁes P4switches at runtime. The authors described a query language p4q in which operators express their intended switch behavior.A prototype that executes veriﬁcation on layer-3 switch wasimplemented, and results show that PR4L detects various bugsand outperforms the baseline approach.Finally, Dumitrescu et al. [202] proposed bf4, an end-to-end P4 program veriﬁcation tool. It aims at guarantying thatdeployed P4 programs are bug-free. First, bf4 ﬁnds potentialbugs at compile-time. Second, it automatically generates pred-icates that must be followed by the controller whenever a ruleis to be inserted. Third, it proposes code changes if additionalbugs remain reachable. bf4 executes a monitor at runtimethat inspects the rules inserted by the controller and raises anexception whenever a predicate is not satisﬁed. The authorsexecuted bf4 on various data plane programs and interestingbugs that were not detected in state-of-the-art approaches werediscovered.

B.3. Veriﬁcation Schemes Discussions

Table XXXIV compares the aforementioned veriﬁcationschemes. Essentially, some schemes translate P4 programs toveriﬁcation languages and engines. For instance, in [194], P4

TABLE XXXIVV

ERIFICATION S CHEMES C OMPARISON

Scheme Name Engine,language Evaluatedprograms Inconsistencydetection [194] P4NOD NOD 2 × [195] ASSERT-P4 KLEE 5 × [197] p4v Z3 23 × [198] p4pktgen SMT 4 × [199] N/A Pure 0 × [200] Vera SEFL 11 × [201] P4RL DDQN 1 (cid:2) [202] bf4 Z3 21 × programs are translated to Datalog to verify the reachabilityand well-formedness. Similarly, in [197], P4 programs areconverted into Guarded Command Language (GCL) models,and then a theorem prover Z3 is used to verify that sev-eral safety, architectural and program-speciﬁc properties hold.Other schemes (e.g., p4pktgen, Vera) use symbolic executionto generate test cases for P4 programs.The veriﬁcation schemes were evaluated on different P4programs from the literature. A program that was evaluatedby most schemes is switch.p4 which implements variousnetworking features needed for typical cloud data centers,including Layer 2/3 functionalities, ACL, QoS, etc. It isrecommended for future schemes to evaluate switch.p4 as wellas other programs from the literature. Finally, P4RL detectspath-related consistency between data-control planes. B.4. P4-based and Traditional Network Veriﬁcation

Traditional veriﬁcation techniques that address the securityproperties in computer networks are mainly related to hostreachability, isolation, blackholes, and loop-freedom. Tech-niques that check for the aforementioned properties includeAnteater [288], which models the data plane as booleanfunctions to be used in a Boolean Satisﬁability Problem (SAT)solver, NetPlumber [289] which uses header space algebra[285], and others (e.g., VeriFlow [286], DeltaNet [290], Flover[291], and VMN [292]).Since P4 programs incorporate customized protocols andprocessing logic to be used in the data plane, traditional toolsare not capable of handling changes to forwarding behaviorswithout reprogramming their internals. Therefore, veriﬁcationtechniques in programmable networks rely on analyzing theP4 programs themselves since they deﬁne the behavior of thedata plane.

C. Summary and Lessons Learned

Network testing can generally be divided into debug-ging/troubleshooting network problems and verifying the be-havior of forwarding devices. While traditional tools andtechniques were adequate for non-programmable networks,they are insufﬁcient for programmable ones due to theirinability to handle changes to forwarding behaviors withoutreprogramming and restructuring their internals. A variety ofworks were proposed to analyze and model P4 programs inorder to troubleshoot and verify the correctness of networks’operations.XIII. C

HALLENGES AND F UTURE T RENDS

In this section, a number of research and operationalchallenges that correspond to the proposed taxonomy areoutlined. The challenges are extracted after comprehensivelyreviewing and diving into each work in the described literature.Further, the section discusses and pinpoints several initiativesfor future work which could be worthy of being pursued in thisimperative ﬁeld of programmable switches. The challengesand the future trends are illustrated in Fig. 22

Data planechallenges and trendsInteroperability Arithmetic computationsNetwork-wide cooperationProgramming simplicity and modularity

ChallengesTrends [178, 179][162][293, 294] [295, 296] [297] [83, 91][298–303][293]

Fig. 22. Challenges and future trends. The references represent examples ofexisting works that tackle the corresponding future trends.

A. Memory Capacity (SRAM and TCAM)

Stateful processing is a key enabler for programmabledata planes as it allows applications to store and retrievedata across different packets. This advantage enabled a widerange of novel applications (e.g., in-network caching, ﬁnegrained measurements, stateful load balancing, etc.) that werenot possible in non-programmable networks. The amountof data stored in the switch is limited by the size of theon-chip memory which ranges from tens to hundreds ofmegabytes at most. Consequently, the majority of stateful-based applications suffer have trade-offs between performanceand memory usage. For instance, the efﬁciency of cachingwhich is determined by the hit rate is directly affected by thememory size. Furthermore, the vast majority of measurementapplications require storing statistics in the data plane (e.g.,byte/packet counters). The number of ﬂows to be measuredand the richness of measurement information is bound by thesize of the memory in the switch.

Current and future initiatives.

A notable work by Kim etal. [295, 296] suggests accessing remote Dynamic RandomAccess Memory (DRAM) installed on data center serverspurely from data plane to expand the available memory on theswitch. The bandwidth of the chip is traded for the bandwidthneeded to access the external DRAM. The approach is cheapand ﬂexible since it reuses existing resources in commodityhardware without adding additional infrastructure costs. Thesystem is realized by allowing the data plane to access remotememory through an access channel (RDMA over ConvergedEthernet (RoCE)) as shown in Fig. 23. The implementationshow that the proposal achieves throughput close to the linerate, and only incur 1-2 extra microseconds latency (Fig.24). There are some limitations in this approach that can beexplored in the future. • The current implementation only supports address-basedmemory access, and hence, complicated data layouts andternary matching in remote memory should be explored. • Frequent updates in the remote memory requires several General-purpose DRAM poolASIC

Remote table serversRemote state storesRemote buffer servers

RDMARoCE

Commodity Servers

Fig. 23. Expanding switch memory by leveraging remote DRAM on com-modity servers [295]. packets for fetching and adding. This is common in mea-surement applications where counters are continuously in-cremented. A possible solution to the bandwidth overhead isaggregating updates into single operation. This comes withthe cost of having delays in the updates. • Packet loss between the switch and the remote memoryshould be handled, otherwise, the performance of the ap-plication and the freshness of the remote values might beaffected. • The interaction between general data plane applications andthe remote memory is challenging. A potential improvementis designing well-deﬁned APIs to facilitate the interaction.

B. Resources Accessibility

Beside the size limitation of the on-chip memory, there areother restrictions that data plane developers should take intoaccount [297, 304]. First, since the table memory is localto each stage in the pipeline, other stages cannot reclaimnon-utilized memory in other stages. As a result, memoryand match/action processing are fuzed, making the placementof tables challenging. Second, the sequential execution ofoperations in the pipeline lead to poor utilization of resourcesespecially when the matches and the actions are imbalanced(i.e., the presence of default actions that do not need a match).

Current and Future Initiatives.

An interesting work byChole et at. [297] explored the idea of disaggregating thememory and compute resources of a programmable switch.The main notion of this work is to centralize the memoryas a pool that is accessed by a crossbar. By doing so, each

Fig. 24. Accessing remote DRAM latency overhead. Achieved throughputclose to the line rate ( ≈ pipeline stage no longer has local memory. Additionally, thiswork solves the sequential execution limitation by creating acluster of processors used to execute operations in any order.The main limitation of this approach is the lack of adoptionby any hardware vendors. Most of the switch vendors (e.g.,Cavium’s XPliant and Barefoot’s Toﬁno) do not implement thedisaggregation model and follow the regular ReconﬁgurableMatch-action Tables (RMT) model. The implementation andanalysis of the disaggregation model on hardware targetsshould be explored in the future. C. Arithmetic Computations

There are several challenges that must be handled whendealing with arithmetic computations in the data plane. First,programmable switches support a small set of simple arith-metic computations that operate on non-ﬂoating point values.Second, only few operations are supported per packet toguarantee the execution at line rate. Typically, a packet shouldonly spend tens of nanoseconds in the processing pipeline.Third, computations in the data plane consume signiﬁcanthardware resources, hampering the possibility of other pro-grams to execute concurrently. A wide range of applicationssuffer from the lack of complex computations in the dataplane. For instance, some operations required by AQMs (e.g.,square root function in the CoDel algorithm) are complexto be implemented with P4. Additionally, the majority ofmachine learning frameworks and models operate on ﬂoatingpoint values while the supported arithmetic operations on theswitch operate on integer values. In-network model updatesaggregation requires calculating the average over a set ofﬂoating-point vectors.

Current and Future Initiatives.

Existing methods to over-come the computation limitations include approximation andpre-computations. In the approximation method, the applica-tion designer relies on the small set of supported operationsto approximate the desired value, at the cost of sacriﬁcingprecision. For example, approximating the square root functioncan be achieved by counting the number of leading zerosthrough longest preﬁx match [91]. It would be beneﬁcialfor P4 developers to have access to a community-maintainedlibrary which encompasses P4 codes that approximate variouscomplex functions. In the pre-computations method, values arecomputed by the control plane (e.g., switch CPU) and storedin match-action tables or registers. Future work can exploremethods that automatically identify the complex computationsthat can be pre-evaluated in the control plane. After identiﬁca-tion, the data plane code and its corresponding control planeAPIs can be automatically generated.

D. Network-wide Cooperation

The SDN architecture suggests using a centralized controllerfor network-wide switches management. Through centraliza-tion, the state of each programmable switch can be shared withother switches. Consequently, applications will have the abilityto make better decisions as network-wide data is availablelocally on the switch. The problem with such architecture is IP A ID C CountS1InternetDDoS initiator(A) IP A ID C CountC < T (a) S IP A ID C CountS1InternetDDoS initiator(A) (b) S C < T C , C C + C > T C + C > T C + C Count

Total IP A ID C Count C + C Count

Total

Fig. 25. (a) Local detection of DDoS attacks. (b) network-wide detection of DDoS attack. the requirement of having a continuous exchange of packetswith a software-based system. As an alternative, switches canexchange messages to synchronize their states in a decentral-ized manner.Consider Fig. 25 which shows an in-network DDoS defensesolution. Each switch maintains a list of senders and theircorresponding numbers of bytes. A switch compares thenumber of bytes transmitted from a given ﬂow to a threshold.When the threshold is crossed, the ﬂow is blocked and thedevice is identiﬁed as a malicious DDoS sender. Assumethat the network implements a load balancing mechanism thatdistributes trafﬁc across the switches. In the scenario whereswitches do not consider the byte counts of other switches(Fig. 25 (a)), the trafﬁc of a DDoS device might remain underthe threshold. On the other hand, when switches synchronizetheir states by sharing the byte counts (Fig. 25 (b)), thetotal number of bytes is compared against the threshold.Consequently, the total load of a DDoS device is considered.This example demonstrates an application that heavily dependson network-wide cooperation and hence motivates the need forstate synchronization.

Current and Future Initiatives.

Arashloo et al. [298] pro-posed SNAP, a centralized stateful programming model thataims at solving the synchronization problem. SNAP introducedthe idea of writing programs for “one big switch” instead ofmany. Essentially, developers write stateful applications with-out caring about the distribution, placement, and optimizationof access to resources. SNAP is limited to one replica ofeach state in the network. Sviridov et al. [299, 300] proposedLODGE and LOADER to extend SNAP and enable multiplereplicas. Luo et al. [301] proposed Swing State, a frameworkfor runtime state migration and management. This approachleverages existing trafﬁc to piggyback state updates betweencooperating switches. Swing State overcomes the challengesof the SDN-based architecture by synchronizing the statesentirely in the data plane, at line rate, and without interventionfrom the control plane. There are several limitations with thisapproach. First, there are no message delivery guarantees (i.e.,packets dropped/reordered are not retransmitted), leading toinconsistency in the states among the switches. Second, it doesnot merge the states if two switches share common states.Third, the overhead can signiﬁcantly increase if a single stateis mirrored several times. Finally, there is no authenticationof data or senders. Xing et al. [302] proposed P4Sync, asystem that migrates states between switches in the data planewhile guaranteeing the authenticity of the senders and the exchanged data. P4Sync addresses the limitations of existingapproaches. It guarantees the completeness of the migration,ensuring that the snapshot transfer is completed. Moreover, itsolves the overhead of the repeatedly retransmitted updates.An interesting aspect of P4Sync is its ability to control themigration trafﬁc rate depending on the changing networkconditions. Zeno et al. [303] presented a design of SwiSh-mem, a management layer that facilitates the deployment ofnetwork functions (NFs) on multiple switches by managingthe distributed shared states.The future work in this area should consider handling frequent state migrations . Some systems require migrationpackets to be generated each RTT, causing increased trafﬁcoverhead and additional expensive authentication operations.For instance, P4Sync uses public key cryptography in thecontrol plane to sign and verify the end of the migrationsequence chain (2.15ms for signing and 0.07ms to verify usingRSA-2048 signature). Frequent migrations would cause thissignature to be involved repeatedly. Another major concernthat should be handled in future work is denial of service .Even with migration updates authentication, changes in thepackets cause the receiver to reject updates, leading to stateinconsistency among switches.

E. Control Plane Intervention

Delegating tasks to the control plane incurs latency andaffects the application’s performance. For instance, in conges-tion control, rerouting-based schemes often use tables to storealternative routes. Since the data plane cannot directly modifytable entries, intervention from the control plane is required.The interaction with the control plane in this applicationhampers the promptness of rerouting. Another example aremethods that use collisions-free hashing. For example, cuckoohash [305], which rearranges items to solve collisions, uses acomplex search algorithm that cannot run on the switch ASIC,and is often executed on the switch CPU. Ideally, the controlplane intervention should be minimized when possible. Forexample, to synchronize the state among switches, in-networkcooperation should be considered.

Current and Future Initiatives.

The design of the interactionbetween the control plane and the data plane is fully decidedby the developer. Experienced developers might have enoughbackground to immediately minimize such interaction. Futurework should devise algorithms and tools that automaticallydetermine the excessive interaction between the control/data planes, and suggest alternative workﬂows (ideally, as generatedcodes) to minimize such interaction. F. Security

When designing a system for the data plane, the developermust envision the kind of trafﬁc a malicious user can initiateto corrupt the operation of the system. This class of attacks isreferred to as sensitivity attacks as coined in [179]. Essentially,an attacker can intelligently craft trafﬁc patterns to triggerunexpected behaviors of a system in the data plane. Forinstance, a load balancer that balances trafﬁc through packetheaders hashing without cyrptographic support (e.g., modulooperator on the number of available paths) can be tricked by anattacker that craft skewed trafﬁc patterns. This results in trafﬁcbeing forwarded to a single path, leading to congestion, linksaturation, and denial of service. Another example is attacksagainst in-network caching. Caching in data plane performswell when requests are mostly reads rather than writes . If anattacker continuously generates high-skewed write requests,the load on the storage servers would be imbalanced. If thesystem is designed to handle write queries on hot items in theswitch, a random failure in the switch causes data to be lost.Further, an attacker can also exploit the memory limitationof switch and request diverse values, causing the pre-cachedvalues to be evicted.

Current and Future Initiatives.

To mitigate against sensi-tivity attacks, a developer attempts to discover various un-predicted trafﬁc patterns, and accordingly, develops defensestrategies. Such solution is highly unreliable, time consuming,and error-prone. Recent efforts [179] aimed at automaticallydiscovering sensitivity attacks in the data plane. Essentially,the proposed system aims at deriving trafﬁc patterns that woulddrive the program away from common case behavior as muchas possible. Other efforts focused on architecting defenses inthe data plane that perform distributed mode changes uponattack discovery [178]. Future work in this direction shouldconsider achieving high assurance by formally verifying thecodes. Additionally, the stability of the data plane should becarefully handled with fast mode changes; future work couldconsider integrating self-stabilizing systems for such purpose.Finally, future work should provide security interfaces forcollaborating switches that belong to different domains. It isalso worth exposing sensitivity attack patterns for differentapplication types so that data plane developers can avoid thevulnerabilities that trigger those attacks in their codes.

G. Interoperability

Programmable switches pave the way for a wide range ofinnovative in-network applications. The literature has shownthat signiﬁcant performance improvements are brought whenapplications ofﬂoad their processing logic to the network.Despite such facts, it is very unlikely that mobile operatorswill replace their current infrastructure with programmableswitches in one shot. This unlikelihood comes from the factthat major operational and budgeting costs will incur.

Current and Future Initiatives.

Network operators might deploy programmable switches in an incremental fashion. Thatis, P4 switches will be added to the network alongside theexisting legacy devices. While this solution seems simplisticat ﬁrst, studies have showed that partial deployment leadsto reduced effectiveness [162]. For instance, the accuracy ofheavy hitter detection schemes is strongly affected by the ﬂowvisibility. The work in [162] devised a greedy algorithm thatattempts to strategically position P4 switches in the network,with the goal of monitoring as many distinct network ﬂowsas possible. The

F1 score is used to quantify correctness ofswitches placement. Future work in this area should considergeneralizing and enhancing this approach to work with any

P4application, and not only heavy hitter detection. For instance,a future work could suggest the positioning of P4 switches inapplications such as in-network caching, accelerated consen-sus, and in-network defenses, while taking into account thecurrent topology consisting of legacy devices.

H. Programming Simplicity and Modularity

Writing in-network applications using P4 language is notan easy task. Recent studies have shown that many existingP4 programs have several bugs that might lead to networkdisruption [191]. For several decades, the networking indus-try operated in a bottom-up approach, where switches areequipped with ﬁxed-function ASICs. Consequently, little tono programming skills were needed by network operators.With the advent of programmable switches, operators are nowexpected to have experience in programming the ASIC . Current and Future Initiatives.

Since programming theASIC is not a straightforward task, future research endeavoursshould consider simplifying the programming workﬂow forthe operators and generating code (e.g., [293]). For instance,graphical tools can be developed to translate workﬂows (e.g.,ﬂowcharts) to P4 programs that can ﬁt into the hardware.Further, future work should develop tools that allow operatorsto enable features (i.e., program modules) that will translate toP4 programs. As an analogy, consider the mobile applicationstores (e.g., Play store, Apple store). The user simply down-loads and installs application on the device, without having tounderstand anything about programming. An interesting workcould investigate the idea of creating a store for P4 applicationswhere operators select the “apps” they want to activate, andthe result is a generated P4 program optimized to ﬁt in thehardware, considering the different targets available in themarket today (e.g., Toﬁno). Recent efforts attempted to mergeand test modular programs in P4 [294].XIV. C

ONCLUSIONS

This article presents an exhaustive survey on programmabledata planes. The survey describes the evolution of networkingby discussing the traditional control plane and the transition to Note that most vendors (e.g., Barefoot Networks) provide a program( switch.p4 ) that expresses the forwarding plane of a switch, with the typicalfeatures of an advanced layer-2 and layer-3 switch. If the goal is to simplydeploy a switch with no in-network applications, then the operators are notrequired to program the chip. They just need to learn the interaction betweenthe control plane and the data plane (e.g., to populate table entries). SDN. Afterwards, the survey motivates the need for program-ming the data plane and delves into the general architectureof a programmable switch (PISA). A brief description of P4,the de-facto language for programming the data plane waspresented. Motivated by the increasing trend in programmingthe data plane, the survey provides a taxonomy that sheds thelight on numerous signiﬁcant works and compares schemeswithin each category in the taxonomy and with those in legacyapproaches. The survey concludes by discussing challengesand considerations as well as various future trends and initia-tives. A

CKNOWLEDGEMENT

This material is based upon work supported by the Na-tional Science Foundation under grant numbers 1925484 and1829698, funded by the Ofﬁce of Advanced Cyberinfrastruc-ture (OAC). R

BBREVIATIONS U SED IN T HIS A RTICLE

Abbreviation TermABR Adaptive Bit RateACK AcknowledgementACL Access Control ListAFQ Approximate Fair QueueingAIMD Additive Increase Multiplicative DecreaseALU Arithmetic Logical UnitAPI Application Programming InterfaceAQM Active Queue ManagementAS Autonomous SystemASIC Application-speciﬁc Integrated CircuitATPG Automatic Test Packet GenerationATT Attribute ProtocolBBR Bottleneck Bandwidth and Round-trip TimeBDD Binary Decision DiagramBFT Byzantine Fault ToleranceBGP Border Gateway ProtocolBIER Bit Index Explicit ReplicationBLE Bluetooth Low EnergyBLESS Bluetooth Low Energy Service SwitchBMv2 Behavioral Model Version 2BNN Binary Neural NetworkBQPS Billion Queries Per SecondBYOD Bring Your Own DeviceCAIDA Center of Applied Internet Data AnalysisCC Congestion ControlCNN Convolutional Neural NetworkCoDel Controlled DelayCPU Central Processing UnitCRC Cyclic Redundancy CheckCWND Congestion WindowDCQCN Data Center Quantized Congestion NotiﬁcationDCTCP Data Center Transmission Control ProtocolDDoS Distributed Denial-of-ServiceDIP Direct Internet ProtocolDMA Direct Memory AccessDMZ Demilitarized ZoneDNS Domain Name ServerDPDK Data Plane Development Kit Abbreviation TermDRAM Dynamic Random Access MemoryDSP Digital Signal ProcessorsECMP Equal-Cost Multi-Path RoutingECN Explicit Congestion NotiﬁcationESP Encapsulating Security PayloadFAST Flow-level State TransitionsFCT Flow Completion TimeFIB Forwarding Information BaseFPGA Field-programmable Gate ArrayFQ Fair QueueingGPU Graphics Processing UnitGRE Generic Routing EncapsulationHCF Hop-Count FilteringHSA Header Space AnalysisHTCP Hamilton Transmission Control ProtocolHTTP Hypertext Transfer ProtocolIDS Intrusion Detection SystemIGMP Internet Group Management ProtocolIKE Internet Key ExchangeILP Integer Linear ProgrammingINT In-band Network TelemetryIoT Internet of ThingsIP Internet ProtocolISP Internet Service ProviderJSON JavaScript Object NotationKDN Knowledge-deﬁned NetworkingKPI Key Performance IndicatorINT In-band Network TelemetryIoT Internet of ThingsIP Internet ProtocolISP Internet Service ProviderINT In-band Network TelemetryIoT Internet of ThingsINT In-band Network TelemetryIoT Internet of ThingsIP Internet ProtocolISP Internet Service ProviderJSON JavaScript Object NotationKDN Knowledge-deﬁned NetworkingKPI Key Performance IndicatorLAN Local Area NetworkLFA Link Flooding AttackLPM Longest Preﬁx MatchLPWAN Low Power Wide Area NetworkLTE Long Term EvolutionMAC Medium Access ControlMAU Match-Action UnitMCM Multicolor MarkersMIMD Multiplicative Increase Multiplicative DecreaseML Machine LearningMOS Mean Opinion ScoreMPC Mobile Packet CoreMQTT Message Queueing Telemetry TransportMSS Maximum Segment SizeMPTCP Multipath Transmission Control ProtocolMTU Maximum Transmission UnitNACK Negative AcknowledgementNAT Network Address TranslationNDA Non-disclosure AgreementNDN Named Data NetworkingNFV Network Functions VirtualizationNIC Network Interface ControllerNN Neural NetworksNSH Network Service HeaderONOS Open Network Operating SystemOSPF Open Shortest Path FirstOUM Ordered Unreliable MulticastOVS Open Virtual SwitchP2P Peer-to-peerPBT Postcard-Based TelemetryPCC Performance-oriented Congestion ControlPCC Per-Connection ConsistencyPD Program DependentPGW Packet Data Network GatewayPI Protocol IndependentPIE Proportional Integral Controller Enhanced Abbreviation TermPISA Protocol Independent Switch ArchitectureQoE Quality of ExperienceQoS Quality of ServiceRAM Random-Access MemoryRDMA Remote Direct Memory AccessRED Random Early DetectionREST Representational State TransferRFC Request for CommentsRMT Reconﬁgurable Match-action TablesRSA Rivest-Shamir-AdlemanRSS Really Simple SyndicationRTT Round-trip TimeRWND Receiver WindowSAD Security Association DatabaseSAT Boolean Satisﬁability ProblemSDN Software Deﬁned NetworkingSHA Secure Hash AlgorithmSIP Session Initiation ProtocolSLA Service Level AgreementSNMP Simple Network Management ProtocolSPD Security Policy DatabaseSRAM Static Random-Access MemorySSH Secure ShellTCAM Ternary Content-Addressable MemoryTCP Transmission Control ProtocolTM Trafﬁc ManagementToR The Onion RouterTPU Tensor Processing UnitTTL Time-to-LiveUDP User Datagram ProtocolUE User EquipmentVIP Virtual Internet ProtocolVMN Verifying Mutable NetworksVN Virtual NetworkVoLTE Voice over Long-term EvolutionVXLAN Virtual eXtensible Local Area NetworkWAN Wide Area NetworkXDP eXpress Data Path[4] G. Papastergiou, G. Fairhurst, D. Ros, A. Brunstrom, K.-J. Grinnemo,P. Hurtig, N. Khademi, M. Tüxen, M. Welzl, D. Damjanovic, andS. Mangiante, “De-ossifying the Internet transport layer: a surveyand future perspectives,”

IEEE Communications Surveys & Tutorials

ACM SIGCOMMcomputer communication review , vol. 37, no. 4, pp. 1–12, 2007.[8] D. Kreutz, F. M. Ramos, P. E. Verissimo, C. E. Rothenberg, S. Azodol-molky, and S. Uhlig, “Software-deﬁned networking: a comprehensivesurvey,”

Proceedings of the IEEE , vol. 103, no. 1, pp. 14–76, 2014.[9] P. Bosshart, D. Daly, G. Gibb, M. Izzard, N. McKeown, J. Rexford,C. Schlesinger, D. Talayco, A. Vahdat, and G. Varghese, “P4: pro-gramming protocol-independent packet processors,”

ACM SIGCOMMComputer Communication Review

SC19 Network Research Exhibition

Future Internet(FI) and Innovative Internet Technologies and Mobile Communication(IITM) , vol. 47, 2017.[36] T. Dargahi, A. Caponi, M. Ambrosin, G. Bianchi, and M. Conti, “Asurvey on the security of stateful SDN data planes,”

IEEE Communi-cations Surveys & Tutorials , vol. 19, no. 3, pp. 1701–1725, 2017.[37] W. L. da Costa Cordeiro, J. A. Marques, and L. P. Gaspary, “Data planeprogrammability beyond OpenFlow: opportunities and challenges fornetwork and service operations and management,”

Journal of Networkand Systems Management , vol. 25, no. 4, pp. 784–818, 2017.[38] A. Satapathy, “Comprehensive study of P4 programming language andsoftware-deﬁned networks,” 2018. [Online]. Available: https://tinyurl.com/y4d4zma9.[39] R. Bifulco and G. Rétvári, “A survey on the programmable data plane:abstractions, architectures, and open problems,” in , pp. 1–7, IEEE, 2018.[40] E. Kaljic, A. Maric, P. Njemcevic, and M. Hadzialic, “A survey on dataplane ﬂexibility and programmability in software-deﬁned networking,”

IEEE Access , vol. 7, pp. 47804–47840, 2019.[41] P. G. Kannan and M. C. Chan, “On programmable networking evolu-tion,”

CSI Transactions on ICT , vol. 8, no. 1, pp. 69–76, 2020.[42] L. Tan, W. Su, W. Zhang, J. Lv, Z. Zhang, J. Miao, X. Liu, and N. Li,“In-band network telemetry: A survey,”

Computer Networks , p. 107763,2020.[43] X. Zhang, L. Cui, K. Wei, F. P. Tso, Y. Ji, and W. Jia, “A survey onstateful data plane in software deﬁned networks,”

Computer Networks ,p. 107597, 2020.[44] G. Bianchi, M. Bonola, A. Capone, and C. Cascone, “OpenState: programming platform-independent stateful OpenFlow applications in-side the switch,” ACM SIGCOMM Computer Communication Review ,vol. 44, no. 2, pp. 44–51, 2014.[45] M. Moshref, A. Bhargava, A. Gupta, M. Yu, and R. Govindan,“Flow-level state transition as a new switch primitive for SDN,” in

Proceedings of the third workshop on Hot topics in software deﬁnednetworking

ACM SIGCOMM Computer CommunicationReview , vol. 38, no. 2, pp. 69–74, 2008.[49] N. McKeown, “Why does the Internet need a programmable forwardingplane.” [Online]. Available: https://tinyurl.com/y6x7qqpm.[50] A. Shapiro, “P4-programming data plane use-cases.” in P4 ExpertRoundtable Series, April 28-29, 2020. [Online]. Available: https://tinyurl.com/y5n4k83h.[51] C. Kim, “Evolution of networking, Networking Field Day 21, 2:01,”2019. [Online]. Available: https://tinyurl.com/y9fkj7qx.[52] Z. Liu, J. Bi, Y. Zhou, Y. Wang, and Y. Lin, “Netvision: towardsnetwork telemetry as a service,” in , pp. 247–248, IEEE, 2018.[53] J. Hyun, N. Van Tu, and J. W.-K. Hong, “Towards knowledge-deﬁnednetworking using in-band network telemetry,” in

NOMS 2018-2018IEEE/IFIP Network Operations and Management Symposium , pp. 1–7,IEEE, 2018.[54] Y. Kim, D. Suh, and S. Pack, “Selective in-band network telemetryfor overhead reduction,” in , pp. 1–3, IEEE, 2018.[55] J. A. Marques, M. C. Luizelli, R. I. T. da Costa Filho, and L. P. Gaspary,“An optimization-based approach for efﬁcient network monitoringusing in-band network telemetry,”

Journal of Internet Services andApplications , vol. 10, no. 1, p. 12, 2019.[56] B. Niu, J. Kong, S. Tang, Y. Li, and Z. Zhu, “Visualize your IP-over-optical network in realtime: a P4-based ﬂexible multilayer in-bandnetwork telemetry (ML-INT) system,”

IEEE Access , vol. 7, pp. 82413–82423, 2019.[57] R. Ben Basat, S. Ramanathan, Y. Li, G. Antichi, M. Yu, and M. Mitzen-macher, “PINT: probabilistic in-band network telemetry,” in

Proceed-ings of the Annual conference of the ACM Special Interest Group onData Communication on the applications, technologies, architectures,and protocols for computer communication , pp. 662–680, 2020.[58] N. Van Tu, J. Hyun, and J. W.-K. Hong, “Towards ONOS-based SDNmonitoring using in-band network telemetry,” in , pp. 76–81, IEEE, 2017.[59] Serkant, “Prometheus INT exporter.” [Online]. Available: https://github.com/serkantul/prometheus_int_exporter/.[60] N. Van Tu, J. Hyun, G. Y. Kim, J.-H. Yoo, and J. W.-K. Hong, “IntCol-lector: a high-performance collector for in-band network telemetry,” in , pp. 10–18, IEEE, 2018.[61] Barefoot Networks, “Barefoot Deep Insight - product brief.” [Online].Available: https://tinyurl.com/u2ncvry.[62] Broadcom, “BroadView Analytics, Trident 3 in-band telemetry.” [On-line]. Available: https://tinyurl.com/yxr2qydb.[63] M. Handley, C. Raiciu, A. Agache, A. Voinescu, A. W. Moore,G. Antichi, and M. Wójcik, “Re-architecting datacenter networks andstacks for low latency and high performance,” in

Proceedings of theConference of the ACM Special Interest Group on Data Communica-tion , pp. 29–42, 2017.[64] B. Turkovic, F. Kuipers, N. van Adrichem, and K. Langendoen, “Fastnetwork congestion detection and avoidance using P4,” in

Proceedingsof the 2018 Workshop on Networking for Emerging Applications andTechnologies , pp. 45–51, 2018.[65] Y. Li, R. Miao, H. H. Liu, Y. Zhuang, F. Feng, L. Tang, Z. Cao,M. Zhang, F. Kelly, and M. Y. Alizadeh, Mohammad, “HPCC: highprecision congestion control,” in

Proceedings of the ACM SpecialInterest Group on Data Communication , pp. 44–58, 2019.[66] A. Feldmann, B. Chandrasekaran, S. Fathalli, and E. N. Weyulu, “P4-enabled network-assisted congestion feedback: a case for NACKs,”2019.[67] E. F. Kfoury, J. Crichigno, E. Bou-Harb, D. Khoury, and G. Srivastava,“Enabling TCP pacing using programmable data plane switches,” in , pp. 273–277, IEEE, 2019.[68] B. Turkovic and F. Kuipers, “P4air: Increasing fairness among com-peting congestion control algorithms,” 2020.[69] Y. Li, R. Miao, C. Kim, and M. Yu, “Flowradar: A better NetFlow fordata centers,” in { USENIX } Symposium on Networked SystemsDesign and Implementation (NSDI) , pp. 311–324, 2016.[70] Z. Liu, A. Manousis, G. Vorsanger, V. Sekar, and V. Braverman,“One sketch to rule them all: rethinking network ﬂow monitoring withUnivMon,” in

Proceedings of the 2016 ACM SIGCOMM Conference ,pp. 101–114, 2016.[71] S. Narayana, A. Sivaraman, V. Nathan, P. Goyal, V. Arun, M. Alizadeh,V. Jeyakumar, and C. Kim, “Language-directed hardware design fornetwork performance monitoring,” in

Proceedings of the Conferenceof the ACM Special Interest Group on Data Communication , pp. 85–98, 2017.[72] M. Ghasemi, T. Benson, and J. Rexford, “Dapper: data plane perfor-mance diagnosis of TCP,” in

Proceedings of the Symposium on SDNResearch , pp. 61–74, 2017.[73] T. Yang, J. Jiang, P. Liu, Q. Huang, J. Gong, Y. Zhou, R. Miao,X. Li, and S. Uhlig, “Elastic sketch: adaptive and fast network-widemeasurements,” in

Proceedings of the 2018 Conference of the ACMSpecial Interest Group on Data Communication , pp. 561–575, 2018.[74] N. Yaseen, J. Sonchack, and V. Liu, “Synchronized network snapshots,”in

Proceedings of the 2018 Conference of the ACM Special InterestGroup on Data Communication , pp. 402–416, 2018.[75] R. Joshi, T. Qu, M. C. Chan, B. Leong, and B. T. Loo, “Burstradar:practical real-time microburst monitoring for datacenter networks,” in

Proceedings of the 9th Asia-Paciﬁc Workshop on Systems , pp. 1–8,2018.[76] M. Lee and J. Rexford, “Detecting violations of service-level agree-ments in programmable switches,” 2018. [Online]. Available: https://p4campus.cs.princeton.edu/pubs/mackl_thesis_paper.pdf.[77] J. Sonchack, O. Michel, A. J. Aviv, E. Keller, and J. M. Smith, “Scalinghardware accelerated network monitoring to concurrent and dynamicqueries with* ﬂow,” in , pp. 823–835, 2018.[78] J. Sonchack, A. J. Aviv, E. Keller, and J. M. Smith, “Turboﬂow:Information rich ﬂow record generation on commodity switches,” in

Proceedings of the Thirteenth EuroSys Conference , pp. 1–16, 2018.[79] A. Gupta, R. Harrison, M. Canini, N. Feamster, J. Rexford, andW. Willinger, “Sonata: query-driven streaming network telemetry,” in

Proceedings of the 2018 Conference of the ACM Special Interest Groupon Data Communication , pp. 357–371, 2018.[80] X. Chen, S. L. Feibish, Y. Koral, J. Rexford, O. Rottenstreich, S. A.Monetti, and T.-Y. Wang, “Fine-grained queue measurement in thedata plane,” in

Proceedings of the 15th International Conference onEmerging Networking Experiments And Technologies , pp. 15–29, 2019.[81] Z. Liu, S. Zhou, O. Rottenstreich, V. Braverman, and J. Rexford,“Memory-efﬁcient performance monitoring on programmable switcheswith lean algorithms,” in

Symposium on Algorithmic Principles ofComputer Systems (APoCS) , 2020.[82] T. Holterbach, E. C. Molero, M. Apostolaki, A. Dainotti, S. Vissicchio,and L. Vanbever, “Blink: fast connectivity recovery entirely in the dataplane,” in { USENIX } Symposium on Networked Systems Designand Implementation ( { NSDI } , pp. 161–176, 2019.[83] D. Ding, M. Savi, and D. Siracusa, “Estimating logarithmic and expo-nential functions to track network trafﬁc entropy in P4,” in IEEE/IFIPNetwork Operations and Management Symposium (NOMS) , 2019.[84] W. Wang, P. Tammana, A. Chen, and T. E. Ng, “Grasp the root causesin the data plane: diagnosing latency problems with SpiderMon,” in

Proceedings of the Symposium on SDN Research , pp. 55–61, 2020.[85] R. Teixeira, R. Harrison, A. Gupta, and J. Rexford, “PacketScope:monitoring the packet lifecycle inside a switch,” in

Proceedings ofthe Symposium on SDN Research , pp. 76–82, 2020.[86] J. Bai, M. Zhang, G. Li, C. Liu, M. Xu, and H. Hu, “FastFE:accelerating ML-based trafﬁc analysis with programmable switches,”in

Proceedings of the Workshop on Secure Programmable Network In-frastructure , SPIN ’20, p. 1–7, Association for Computing Machinery,2020.[87] X. Chen, H. Kim, J. M. Aman, W. Chang, M. Lee, and J. Rexford,“Measuring TCP round-trip time in the data plane,” in

Proceedings ofthe Workshop on Secure Programmable Network Infrastructure , pp. 35–41, 2020.[88] Y. Qiu, K.-F. Hsu, J. Xing, and A. Chen, “A feasibility study on time-aware monitoring with commodity switches,” in

Proceedings of theWorkshop on Secure Programmable Network Infrastructure , pp. 22–

27, 2020.[89] Q. Huang, H. Sun, P. P. Lee, W. Bai, F. Zhu, and Y. Bao, “OmniMon:re-architecting Network telemetry with resource efﬁciency and fullaccuracy,” in

Proceedings of the Annual conference of the ACMSpecial Interest Group on Data Communication on the applications,technologies, architectures, and protocols for computer communication ,pp. 404–421, 2020.[90] X. Chen, S. Landau-Feibish, M. Braverman, and J. Rexford, “Beau-Coup: answering many network trafﬁc queries, one memory updateat a time,” in

Proceedings of the Annual conference of the ACMSpecial Interest Group on Data Communication on the applications,technologies, architectures, and protocols for computer communication ,pp. 226–239, 2020.[91] R. Kundel, J. Blendin, T. Viernickel, B. Koldehofe, and R. Steinmetz,“P4-CoDel: active queue management in programmable data planes,”in , pp. 1–4, IEEE, 2018.[92] N. K. Sharma, M. Liu, K. Atreya, and A. Krishnamurthy, “Approxi-mating fair queueing on reconﬁgurable switches,” in { USENIX } Symposium on Networked Systems Design and Implementation (NSDI) ,pp. 1–16, 2018.[93] S. Laki, P. Vörös, and F. Fejes, “Towards an AQM evaluation testbedwith P4 and DPDK,” in

Proceedings of the ACM SIGCOMM 2019Conference Posters and Demos , pp. 148–150, 2019.[94] C. Papagianni and K. De Schepper, “PI2 for P4: an active queue man-agement scheme for programmable data planes,” in

Proceedings of the15th International Conference on emerging Networking EXperimentsand Technologies , pp. 84–86, 2019.[95] K. Kumazoe and M. Tsuru, “P4-based implementation and evaluationof adaptive early packet discarding scheme,” in

International Confer-ence on Intelligent Networking and Collaborative Systems , pp. 460–469, Springer, 2020.[96] D. Bhat, J. Anderson, P. Ruth, M. Zink, and K. Keahey, “Application-based QoE support with P4 and OpenFlow,” in

IEEE INFOCOM 2019-IEEE Conference on Computer Communications Workshops (INFO-COM WKSHPS) , pp. 817–823, IEEE, 2019.[97] S. S. Lee and K.-Y. Chan, “A trafﬁc meter based on a multicolor markerfor bandwidth guarantee and priority differentiation in sdn virtualnetworks,”

IEEE Transactions on Network and Service Management ,vol. 16, no. 3, pp. 1046–1058, 2019.[98] K. Tokmakov, M. Sarker, J. Domaschka, and S. Wesner, “A case fordata centre trafﬁc management on software programmable ethernetswitches,” in , pp. 1–6, IEEE, 2019.[99] Y.-W. Chen, L.-H. Yen, W.-C. Wang, C.-A. Chuang, Y.-S. Liu, and C.-C. Tseng, “P4-Enabled bandwidth management,” in ,pp. 1–5, IEEE, 2019.[100] M. Shahbaz, L. Suresh, J. Rexford, N. Feamster, O. Rottenstreich, andM. Hira, “Elmo: Source routed multicast for public clouds,” in

Pro-ceedings of the ACM Special Interest Group on Data Communication ,pp. 458–471, 2019.[101] M. Kadosh, Y. Piasetzky, B. Gafni, L. Suresh, M. Shahbaz, S. Banerjee,“Realizing source routed multicast using Mellanox’s programmablehardware switches, P4 Expert Roundtable Series, Apr. 2020.” [Online].Available: https://tinyurl.com/y8dfcsum.[102] W. Braun, J. Hartmann, and M. Menth, “Scalable and reliable software-deﬁned multicast with BIER and P4,” in , pp. 905–906,IEEE, 2017.[103] N. Katta, M. Hira, C. Kim, A. Sivaraman, and J. Rexford, “Hula: scal-able load balancing using programmable data planes,” in

Proceedingsof the Symposium on SDN Research , pp. 1–12, 2016.[104] R. Miao, H. Zeng, C. Kim, J. Lee, and M. Yu, “SilkRoad: makingstateful layer-4 load balancing fast and cheap using switching ASICs,”in

Proceedings of the Conference of the ACM Special Interest Groupon Data Communication , pp. 15–28, 2017.[105] C. H. Benet, A. J. Kassler, T. Benson, and G. Pongracz, “MP-HULA:multipath transport aware load balancing using programmable dataplanes,” in

Proceedings of the 2018 Morning Workshop on In-NetworkComputing , pp. 7–13, 2018.[106] V. Olteanu, A. Agache, A. Voinescu, and C. Raiciu, “Stateless data-center load-balancing with beamer,” in { USENIX } Symposium onNetworked Systems Design and Implementation (NSDI) , pp. 125–139,2018.[107] Z. Liu, Z. Bai, Z. Liu, X. Li, C. Kim, V. Braverman, X. Jin, andI. Stoica, “Distcache: provable load balancing for large-scale storage systems with distributed caching,” in { USENIX } Conference onFile and Storage Technologies ( { FAST } , pp. 143–157, 2019.[108] K.-F. Hsu, P. Tammana, R. Beckett, A. Chen, J. Rexford, and D. Walker,“Adaptive weighted trafﬁc splitting in programmable data planes,” in Proceedings of the Symposium on SDN Research , pp. 103–109, 2020.[109] K.-F. Hsu, R. Beckett, A. Chen, J. Rexford, and D. Walker, “Contra:A programmable system for performance-aware routing,” in { USENIX } Symposium on Networked Systems Design and Implemen-tation ( { NSDI } , pp. 701–721, 2020.[110] X. Jin, X. Li, H. Zhang, R. Soulé, J. Lee, N. Foster, C. Kim, andI. Stoica, “Netcache: balancing key-value stores with fast in-networkcaching,” in Proceedings of the 26th Symposium on Operating SystemsPrinciples , pp. 121–136, 2017.[111] E. Cidon, S. Choi, S. Katti, and N. McKeown, “AppSwitch: application-layer load balancing within a software switch,” in

Proceedings of theFirst Asia-Paciﬁc Workshop on Networking , pp. 64–70, 2017.[112] M. Liu, L. Luo, J. Nelson, L. Ceze, A. Krishnamurthy, and K. Atreya,“Incbricks: toward in-network computation with an in-network cache,”in

Proceedings of the Twenty-Second International Conference onArchitectural Support for Programming Languages and OperatingSystems , pp. 795–809, 2017.[113] S. Signorello, R. State, J. François, and O. Festor, “NDN.p4: pro-gramming information-centric data-planes,” in , pp. 384–389, IEEE, 2016.[114] G. Grigoryan and Y. Liu, “PFCA: a programmable FIB cachingarchitecture,” in

Proceedings of the 2018 Symposium on Architecturesfor Networking and Communications Systems , pp. 97–103, 2018.[115] C. Zhang, J. Bi, Y. Zhou, K. Zhang, and Z. Ma, “B-cache: abehavior-level caching framework for the programmable data plane,”in ,pp. 00084–00090, IEEE, 2018.[116] J. Vestin, A. Kassler, and J. Åkerberg, “FastReact: in-network controland caching for industrial control networks using programmable dataplanes,” in , vol. 1, pp. 219–226,IEEE, 2018.[117] J. Woodruff, M. Ramanujam, and N. Zilberman, “P4DNS: in-networkDNS,” in , pp. 1–6, IEEE, 2019.[118] R. Ricart-Sanchez, P. Malagon, P. Salva-Garcia, E. C. Perez, Q. Wang,and J. M. A. Calero, “Towards an FPGA-accelerated programmabledata path for edge-to-core communications in 5G networks,”

Journalof Network and Computer Applications , vol. 124, pp. 80–93, 2018.[119] R. Ricart-Sanchez, P. Malagon, J. M. Alcaraz-Calero, and Q. Wang,“Hardware-accelerated ﬁrewall for 5G mobile networks,” in , pp. 446–447, IEEE, 2018.[120] R. Shah, V. Kumar, M. Vutukuru, and P. Kulkarni, “TurboEPC:leveraging dataplane programmability to acccelerate the mobile packetcore,” in

Proceedings of the Symposium on SDN Research , pp. 83–95,2020.[121] S. K. Singh, C. E. Rothenberg, G. Patra, and G. Pongracz, “Ofﬂoadingvirtual evolved packet gateway user plane functions to a programmableASIC,” in

Proceedings of the 1st ACM CoNEXT Workshop on Emergingin-Network Computing Paradigms , pp. 9–14, 2019.[122] P. Vörös, G. Pongrácz, and S. Laki, “Towards a hybrid next generationnodeb,” in

Proceedings of the 3rd P4 Workshop in Europe , pp. 56–58,2020.[123] P. Palagummi and K. M. Sivalingam, “SMARTHO: a network initiatedhandover in NG-RAN using P4-based switches,” in ,pp. 338–342, IEEE, 2018.[124] E. Kfoury, J. Crichigno, and E. Bou-Harb, “Ofﬂoading media trafﬁc toprogrammable data plane switches,” in

ICC 2020 IEEE InternationalConference on Communications (ICC) , IEEE, 2020.[125] T. Jepsen, M. Moshref, A. Carzaniga, N. Foster, and R. Soulé, “Packetsubscriptions for programmable ASICs,” in

Proceedings of the 17thACM Workshop on Hot Topics in Networks , pp. 176–183, 2018.[126] C. Wernecke, H. Parzyjegla, G. Mühl, P. Danielis, and D. Timmermann,“Realizing content-based publish/subscribe with P4,” in , pp. 1–7, IEEE, 2018.[127] C. Wernecke, H. Parzyjegla, G. Mühl, E. Schweissguth, and D. Tim-mermann, “Flexible notiﬁcation forwarding for content-based pub-lish/subscribe using P4,” in , pp. 1–5, IEEE, 2019. [128] R. Kundel, C. Gärtner, M. Luthra, S. Bhowmik, and B. Koldehofe,“Flexible content-based publish/subscribe over programmable dataplanes,” in NOMS 2020-2020 IEEE/IFIP Network Operations andManagement Symposium , pp. 1–5, IEEE, 2020.[129] J. Li, E. Michael, N. K. Sharma, A. Szekeres, and D. R. Ports, “Just say { NO } to paxos overhead: replacing consensus with network ordering,”in { USENIX } Symposium on Operating Systems Design andImplementation (OSDI) , pp. 467–483, 2016.[130] H. T. Dang, M. Canini, F. Pedone, and R. Soulé, “Paxos made switch-y,”

ACM SIGCOMM Computer Communication Review , vol. 46, no. 2,pp. 18–24, 2016.[131] J. Li, E. Michael, and D. R. Ports, “Eris: coordination-free consistenttransactions using in-network concurrency control,” in

Proceedings ofthe 26th Symposium on Operating Systems Principles , pp. 104–120,2017.[132] B. Han, V. Gopalakrishnan, M. Platania, Z.-L. Zhang, and Y. Zhang,“Network-assisted raft consensus protocol,” Feb. 13 2020. US PatentApp. 16/101,751.[133] X. Jin, X. Li, H. Zhang, N. Foster, J. Lee, R. Soulé, C. Kim,and I. Stoica, “Netchain: scale-free sub-rtt coordination,” in { USENIX } Symposium on Networked Systems Design and Implemen-tation ( { NSDI } , pp. 35–49, 2018.[134] H. T. Dang, P. Bressana, H. Wang, K. S. Lee, N. Zilberman, H. Weath-erspoon, M. Canini, F. Pedone, and R. Soulé, “Partitioned Paxos viathe network data plane,” arXiv preprint arXiv:1901.08806 , 2019.[135] E. Sakic, N. Deric, E. Goshi, and W. Kellerer, “P4BFT: hardware-accelerated byzantine-resilient network control plane,” arXiv preprintarXiv:1905.04064 , 2019.[136] H. T. Dang, P. Bressana, H. Wang, K. S. Lee, N. Zilberman, H. Weath-erspoon, M. Canini, F. Pedone, and R. Soulé, “P4xos: Consensus as anetwork service,” IEEE/ACM Transactions on Networking , 2020.[137] A. Sapio, I. Abdelaziz, A. Aldilaijan, M. Canini, and P. Kalnis,“In-network computation is a dumb idea whose time has come,” in

Proceedings of the 16th ACM Workshop on Hot Topics in Networks ,pp. 150–156, 2017.[138] G. Siracusano and R. Bifulco, “In-network neural networks,” arXivpreprint arXiv:1801.05731 , 2018.[139] D. Sanvito, G. Siracusano, and R. Bifulco, “Can the network be theAI accelerator?,” in

Proceedings of the 2018 Morning Workshop onIn-Network Computing , pp. 20–25, 2018.[140] F. Yang, Z. Wang, X. Ma, G. Yuan, and X. An, “SwitchAgg:a further step towards in-network computation,” arXiv preprintarXiv:1904.04024 , 2019.[141] A. Sapio, M. Canini, C.-Y. Ho, J. Nelson, P. Kalnis, C. Kim, A. Kr-ishnamurthy, M. Moshref, D. R. Ports, and P. Richtárik, “Scaling dis-tributed machine learning with in-network aggregation,” arXiv preprintarXiv:1903.06701 , 2019.[142] Z. Xiong and N. Zilberman, “Do switches dream of machine learning?toward in-network classiﬁcation,” in

Proceedings of the 18th ACMWorkshop on Hot Topics in Networks , pp. 25–33, 2019.[143] T. Jepsen, M. Moshref, A. Carzaniga, N. Foster, and R. Soulé, “Life inthe fast lane: a line-rate linear road,” in

Proceedings of the Symposiumon SDN Research , pp. 1–7, 2018.[144] T. Kohler, R. Mayer, F. Dürr, M. Maaß, S. Bhowmik, and K. Rothermel,“P4CEP: towards in-network complex event processing,” in

Proceed-ings of the 2018 Morning Workshop on In-Network Computing , pp. 33–38, 2018.[145] L. Chen, G. Chen, J. Lingys, and K. Chen, “Programmable switch asa parallel computing device,” arXiv preprint arXiv:1803.01491 , 2018.[146] T. Jepsen, D. Alvarez, N. Foster, C. Kim, J. Lee, M. Moshref, andR. Soulé, “Fast string searching on PISA,” in

Proceedings of the 2019ACM Symposium on SDN Research , pp. 21–28, 2019.[147] Y. Qiao, X. Kong, M. Zhang, Y. Zhou, M. Xu, and J. Bi, “Towardsin-network acceleration of erasure coding,” in

Proceedings of theSymposium on SDN Research , pp. 41–47, 2020.[148] Z. Yu, Y. Zhang, V. Braverman, M. Chowdhury, and X. Jin, “NetLock:fast, centralized lock management using programmable switches,” in

Proceedings of the Annual conference of the ACM Special InterestGroup on Data Communication on the applications, technologies,architectures, and protocols for computer communication , pp. 126–138, 2020.[149] M. Tirmazi, R. Ben Basat, J. Gao, and M. Yu, “Cheetah: Acceleratingdatabase queries with switch pruning,” in

Proceedings of the 2020 ACMSIGMOD International Conference on Management of Data , pp. 2407–2422, 2020.[150] S. Vaucher, N. Yazdani, P. Felber, D. E. Lucani, and V. Schiavoni,“Zipline: in-network compression at line speed,” in

Proceedings of the 16th International Conference on emerging Networking EXperimentsand Technologies , pp. 399–405, 2020.[151] R. Glebke, J. Krude, I. Kunze, J. Rüth, F. Senger, and K. Wehrle,“Towards executing computer vision functionality on programmablenetwork devices,” in

Proceedings of the 1st ACM CoNEXT Workshopon Emerging in-Network Computing Paradigms , pp. 15–20, 2019.[152] S.-Y. Wang, C.-M. Wu, Y.-B. Lin, and C.-C. Huang, “High-speed data-plane packet aggregation and disaggregation by P4 switches,”

Journalof Network and Computer Applications , vol. 142, pp. 98–110, 2019.[153] S.-Y. Wang, J.-Y. Li, and Y.-B. Lin, “Aggregating and disaggregatingpackets with various sizes of payload in P4 switches at 100 Gbps linerate,”

Journal of Network and Computer Applications , p. 102676, 2020.[154] Y.-B. Lin, S.-Y. Wang, C.-C. Huang, and C.-M. Wu, “The SDNapproach for the aggregation/disaggregation of sensor data,”

Sensors ,vol. 18, no. 7, p. 2025, 2018.[155] A. L. R. Madureira, F. R. C. Araújo, and L. N. Sampaio, “Onsupporting IoT data aggregation through programmable data planes,”

Computer Networks , p. 107330, 2020.[156] M. Uddin, S. Mukherjee, H. Chang, and T. Lakshman, “SDN-basedservice automation for IoT,” in , pp. 1–10, IEEE, 2017.[157] M. Uddin, S. Mukherjee, H. Chang, and T. Lakshman, “SDN-basedmulti-protocol edge switching for IoT service automation,”

IEEE Jour-nal on Selected Areas in Communications , vol. 36, no. 12, pp. 2775–2786, 2018.[158] V. Sivaraman, S. Narayana, O. Rottenstreich, S. Muthukrishnan, andJ. Rexford, “Heavy-hitter detection entirely in the data plane,” in

Proceedings of the Symposium on SDN Research , pp. 164–176, 2017.[159] R. Harrison, Q. Cai, A. Gupta, and J. Rexford, “Network-wide heavyhitter detection with commodity switches,” in

Proceedings of theSymposium on SDN Research , pp. 1–7, 2018.[160] J. Kuˇcera, D. A. Popescu, G. Antichi, J. Koˇrenek, and A. W. Moore,“Seek and push: detecting large trafﬁc aggregates in the dataplane,” arXiv preprint arXiv:1805.05993 , 2018.[161] R. Ben-Basat, X. Chen, G. Einziger, and O. Rottenstreich, “Efﬁcientmeasurement on programmable switches using probabilistic recircu-lation,” in , pp. 313–323, IEEE, 2018.[162] D. Ding, M. Savi, G. Antichi, and D. Siracusa, “An incrementally-deployable P4-enabled architecture for network-wide heavy-hitter de-tection,”

IEEE Transactions on Network and Service Management ,vol. 17, no. 1, pp. 75–88, 2020.[163] L. Tang, Q. Huang, and P. P. Lee, “A fast and compact invertible sketchfor network-wide heavy ﬂow detection,”

IEEE/ACM Transactions onNetworking , vol. 28, no. 5, pp. 2350–2363, 2020.[164] M. V. B. da Silva, J. A. Marques, L. P. Gaspary, and L. Z. Granville,“Identifying elephant ﬂows using dynamic thresholds in programmableixp networks,”

Journal of Internet Services and Applications , vol. 11,no. 1, pp. 1–12, 2020.[165] D. Scholz, A. Oeldemann, F. Geyer, S. Gallenmüller, H. Stubbe,T. Wild, A. Herkersdorf, and G. Carle, “Cryptographic hashing inP4 data planes,” in , pp. 1–6, IEEE,2019.[166] F. Hauser, M. Häberle, M. Schmidt, and M. Menth, “P4-IPsec: imple-mentation of IPsec gateways in P4 with SDN control for host-to-sitescenarios,” arXiv preprint arXiv:1907.03593 , 2019.[167] F. Hauser, M. Schmidt, M. Häberle, and M. Menth, “P4-MACsec:dynamic topology monitoring and data layer protection with MACsecin P4-based SDN,”

IEEE Access , 2020.[168] X. Chen, “Implementing AES encryption on programmable switchesvia scrambled lookup tables,” in

Proceedings of the Workshop onSecure Programmable Network Infrastructure , SPIN ’20, p. 8–14,Association for Computing Machinery, 2020.[169] R. Meier, P. Tsankov, V. Lenders, L. Vanbever, and M. Vechev,“NetHide: secure and practical network topology obfuscation,” in { USENIX } Security Symposium ( { USENIX } Security 18) , pp. 693–709,2018.[170] H. M. Moghaddam and A. Mosenia, “Anonymizing masses: prac-tical light-weight anonymity at the network level,” arXiv preprintarXiv:1911.09642 , 2019.[171] H. Kim and A. Gupta, “ONTAS: ﬂexible and scalable online networktrafﬁc anonymization system,” in

Proceedings of the 2019 Workshopon Network Meets AI & ML , pp. 15–21, 2019.[172] T. Datta, N. Feamster, J. Rexford, and L. Wang, “ { SPINE } : surveil-lance protection in the network elements,” in { USENIX } Workshopon Free and Open Communications on the Internet (FOCI) , 2019. [173] R. Datta, S. Choi, A. Chowdhary, and Y. Park, “P4Guard: designingP4 based ﬁrewall,” in MILCOM 2018-2018 IEEE Military Communi-cations Conference (MILCOM) , pp. 1–6, IEEE, 2018.[174] A. Almaini, A. Al-Dubai, I. Romdhani, and M. Schramm, “Delegationof authentication to the data plane in software-deﬁned networks,”in , pp. 58–65, IEEE, 2019.[175] Q. Kang, L. Xue, A. Morrison, Y. Tang, A. Chen, and X. Luo,“Programmable in-network security for context-aware BYOD policies,” arXiv preprint arXiv:1908.01405 , 2019.[176] S. Bai, H. Kim, and J. Rexford, “Passive OS ﬁngerprinting on com-modity switches,”[177] G. Li, M. Zhang, C. Liu, X. Kong, A. Chen, G. Gu, and H. Duan,“NetHCF: enabling line-rate and adaptive spoofed IP trafﬁc ﬁltering,”in , pp. 1–12, IEEE, 2019.[178] J. Xing, W. Wu, and A. Chen, “Architecting programmable data planedefenses into the network with FastFlex,” in

Proceedings of the 18thACM Workshop on Hot Topics in Networks , pp. 161–169, 2019.[179] Q. Kang, J. Xing, and A. Chen, “Automated attack discovery indata plane systems,” in { USENIX } Workshop on Cyber SecurityExperimentation and Test (CSET) , 2019.[180] A. Febro, H. Xiao, and J. Spring, “Distributed SIP DDoS defensewith P4,” in , pp. 1–8, IEEE, 2019.[181] Â. C. Lapolli, J. A. Marques, and L. P. Gaspary, “Ofﬂoading real-time DDoS attack detection to programmable data planes,” in , pp. 19–27, IEEE, 2019.[182] Y. Mi and A. Wang, “ML-pushback: machine learning based pushbackdefense against DDoS,” in

Proceedings of the 15th InternationalConference on emerging Networking EXperiments and Technologies ,pp. 80–81, 2019.[183] D. Scholz, S. Gallenmüller, H. Stubbe, B. Jaber, M. Rouhi, andG. Carle, “Me love (SYN-) cookies: SYN ﬂood mitigation in pro-grammable data planes,” arXiv preprint arXiv:2003.03221 , 2020.[184] M. Zhang, G. Li, S. Wang, C. Liu, A. Chen, H. Hu, G. Gu, Q. Li,M. Xu, and J. Wu, “Poseidon: mitigating volumetric DDoS attackswith programmable switches,” in

Proceedings of NDSS , 2020.[185] K. Friday, E. Kfoury, E. Bou-Harb, and J. Crichigno, “Towards auniﬁed in-network DDoS detection and mitigation strategy,” in , pp. 218–226, 2020.[186] J. Xing, Q. Kang, and A. Chen, “NetWarden: mitigating network covertchannels while preserving performance,” in { USENIX } SecuritySymposium ( { USENIX } Security 20) , 2020.[187] A. Laraba, J. François, I. Chrisment, S. R. Chowdhury, and R. Boutaba,“Defeating protocol abuse with p4: Application to explicit conges-tion notiﬁcation,” in ,pp. 431–439, IEEE, 2020.[188] “Ripple: A programmable, decentralized link-ﬂooding defense againstadaptive adversaries,” in , (Vancouver, B.C.), USENIX Association, 2021.[189] C. Zhang, J. Bi, Y. Zhou, J. Wu, B. Liu, Z. Li, A. B. Dogar, andY. Wang, “P4DB: on-the-ﬂy debugging of the programmable dataplane,” in , pp. 1–10, IEEE, 2017.[190] Y. Zhou, J. Bi, Y. Lin, Y. Wang, D. Zhang, Z. Xi, J. Cao, and C. Sun,“P4tester: efﬁcient runtime rule fault detection for programmable dataplanes,” in

Proceedings of the International Symposium on Quality ofService , pp. 1–10, 2019.[191] M. V. Dumitru, D. Dumitrescu, and C. Raiciu, “Can we exploit buggyP4 programs?,” in

Proceedings of the Symposium on SDN Research ,pp. 62–68, 2020.[192] S. Kodeswaran, M. T. Arashloo, P. Tammana, and J. Rexford, “TrackingP4 program execution in the data plane,” in

Proceedings of theSymposium on SDN Research , pp. 117–122, 2020.[193] Y. Zhou, J. Bi, T. Yang, K. Gao, C. Zhang, J. Cao, and Y. Wang,“Keysight: Troubleshooting programmable switches via scalable high-coverage behavior tracking,” in , pp. 291–301, IEEE, 2018.[194] N. Lopes, N. Bjørner, N. McKeown, A. Rybalchenko, D. Talayco,and G. Varghese, “Automatically verifying reachability and well-formedness in P4 networks,”

Technical Report, Tech. Rep , 2016.[195] L. Freire, M. Neves, L. Leal, K. Levchenko, A. Schaeffer-Filho, and M. Barcellos, “Uncovering bugs in P4 programs with assertion-basedveriﬁcation,” in

Proceedings of the Symposium on SDN Research ,pp. 1–7, 2018.[196] M. Neves, L. Freire, A. Schaeffer-Filho, and M. Barcellos, “Veriﬁcationof P4 programs in feasible time using assertions,” in

Proceedings of the14th International Conference on emerging Networking EXperimentsand Technologies , pp. 73–85, 2018.[197] J. Liu, W. Hallahan, C. Schlesinger, M. Sharif, J. Lee, R. Soulé,H. Wang, C. Ca¸scaval, N. McKeown, and N. Foster, “P4v: practicalveriﬁcation for programmable data planes,” in

Proceedings of the 2018Conference of the ACM Special Interest Group on Data Communica-tion , pp. 490–503, 2018.[198] A. Nötzli, J. Khan, A. Fingerhut, C. Barrett, and P. Athanas, “P4pktgen:automated test case generation for P4 programs,” in

Proceedings of theSymposium on SDN Research , pp. 1–7, 2018.[199] D. Lukács, M. Tejfel, and G. Pongrácz, “Keeping P4 switches fast andfault-free through automatic veriﬁcation,”

Acta Cybernetica , vol. 24,no. 1, pp. 61–81, 2019.[200] R. Stoenescu, D. Dumitrescu, M. Popovici, L. Negreanu, and C. Raiciu,“Debugging P4 programs with Vera,” in

Proceedings of the 2018 Con-ference of the ACM Special Interest Group on Data Communication ,pp. 518–532, 2018.[201] A. Shukla, K. N. Hudemann, A. Hecker, and S. Schmid, “Runtime ver-iﬁcation of P4 switches with reinforcement learning,” in

Proceedingsof the 2019 Workshop on Network Meets AI & ML , pp. 1–7, 2019.[202] D. Dumitrescu, R. Stoenescu, L. Negreanu, and C. Raiciu, “bf4: to-wards bug-free P4 programs,” in

Proceedings of the Annual conferenceof the ACM Special Interest Group on Data Communication on theapplications, technologies, architectures, and protocols for computercommunication , pp. 571–585, 2020.[203] A. Bas and A. Fingerhut, “P4 tutorial, slide 22.” [Online]. Available:https://tinyurl.com/tb4m749.[204] M. Shahbaz, S. Choi, B. Pfaff, C. Kim, N. Feamster, N. McKeown, andJ. Rexford, “PISCES: A programmable, protocol-independent softwareswitch,” in

Proceedings of the 2016 ACM SIGCOMM Conference ,pp. 525–538, 2016.[205] B. Pfaff, J. Pettit, T. Koponen, E. Jackson, A. Zhou, J. Rajahalme,J. Gross, A. Wang, J. Stringer, P. Shelar, et al. , “The design andimplementation of open vswitch,” in { USENIX } Symposium onNetworked Systems Design and Implementation (NSDI)

ACMSIGCOMM , 2015.[208] C. Hopps et al. , “Analysis of an equal-cost multi-path algorithm,” tech.rep., RFC 2992, November, 2000.[209] S. Sinha, S. Kandula, and D. Katabi, “Harnessing TCP’s burstinesswith ﬂowlet switching,” in

Proc. 3rd ACM Workshop on Hot Topics inNetworks (Hotnets-III) , Citeseer, 2004.[210] C. Kim, P. Bhide, E. Doe, H. Holbrook, A. Ghanwani, D. Daly,M. Hira, and B. Davie, “In-band network telemetry (INT),” technicalspeciﬁcation , 2016.[211] M. A. Vieira, M. S. Castanho, R. D. Pacíﬁco, E. R. Santos, E. P. C.Júnior, and L. F. Vieira, “Fast packet processing with eBPF and XDP:concepts, code, challenges, and applications,”

ACM Computing Surveys(CSUR) , vol. 53, no. 1, pp. 1–36, 2020.[212] J. Crichigno, E. Bou-Harb, and N. Ghani, “A comprehensive tutorialon science DMZ,”

IEEE Communications Surveys & Tutorials , vol. 21,no. 2, pp. 2041–2078, 2018.[213] J. F. Kurose and K. W. Ross, “Computer networking a top downapproach featuring the intel,” 2016.[214] S. Ha, I. Rhee, and L. Xu, “CUBIC: a new TCP-friendly high-speedTCP variant,”

ACM SIGOPS operating systems review , vol. 42, no. 5,pp. 64–74, 2008.[215] D. Leith and R. Shorten, “H-TCP: TCP congestion control forhigh bandwidth-delay product paths,” draft-leith-tcp-htcp-06 (work inprogress) , 2008.[216] N. Cardwell, Y. Cheng, C. S. Gunn, S. H. Yeganeh, and V. Jacobson,“BBR: congestion-based congestion control,”

Communications of theACM , vol. 60, no. 2, pp. 58–66, 2017.[217] S. Floyd, “TCP and explicit congestion notiﬁcation,”

ACM SIGCOMMComputer Communication Review , vol. 24, no. 5, pp. 8–23, 1994.[218] R. Mittal, V. T. Lam, N. Dukkipati, E. Blem, H. Wassel, M. Ghobadi,A. Vahdat, Y. Wang, D. Wetherall, and D. Zats, “TIMELY: RTT-basedcongestion control for the data center,”

ACM SIGCOMM Computer Communication Review , vol. 45, no. 4, pp. 537–550, 2015.[219] Y. Zhu, H. Eran, D. Firestone, C. Guo, M. Lipshteyn, Y. Liron,J. Padhye, S. Raindel, M. H. Yahia, and M. Zhang, “Congestion controlfor large-scale RDMA deployments,”

ACM SIGCOMM ComputerCommunication Review , vol. 45, no. 4, pp. 523–536, 2015.[220] M. Alizadeh, A. Greenberg, D. A. Maltz, J. Padhye, P. Patel, B. Prab-hakar, S. Sengupta, and M. Sridharan, “Data Center TCP (DCTCP),”in

Proceedings of the ACM SIGCOMM 2010 conference , pp. 63–74,2010.[221] M. Alizadeh, S. Yang, M. Sharif, S. Katti, N. McKeown, B. Prabhakar,and S. Shenker, “pFabric: minimal near-optimal datacenter transport,”

ACM SIGCOMM Computer Communication Review , vol. 43, no. 4,pp. 435–446, 2013.[222] M. Dong, Q. Li, D. Zarchy, P. B. Godfrey, and M. Schapira, “ { PCC } :Re-architecting congestion control for consistent high performance,”in { USENIX } Symposium on Networked Systems Design andImplementation (NSDI) , pp. 395–408, 2015.[223] A. Langley, A. Riddoch, A. Wilk, A. Vicente, C. Krasic, D. Zhang,F. Yang, F. Kouranov, I. Swett, J. Iyengar, et al. , “The QUIC transportprotocol: design and Internet-scale deployment,” in

Proceedings of theConference of the ACM Special Interest Group on Data Communica-tion , pp. 183–196, 2017.[224] P. Cheng, F. Ren, R. Shu, and C. Lin, “Catch the whole lot in an action:rapid precise packet loss notiﬁcation in data center,” in { USENIX } Symposium on Networked Systems Design and Implementation (NSDI) ,pp. 17–28, 2014.[225] A. Ramachandran, S. Seetharaman, N. Feamster, and V. Vazirani, “Fastmonitoring of trafﬁc subpopulations,” in

Proceedings of the 8th ACMSIGCOMM conference on Internet measurement , pp. 257–270, 2008.[226] N. Alon, Y. Matias, and M. Szegedy, “The space complexity ofapproximating the frequency moments,”

Journal of Computer andsystem sciences , vol. 58, no. 1, pp. 137–147, 1999.[227] V. Braverman and R. Ostrovsky, “Zero-one frequency laws,” in

Pro-ceedings of the forty-second ACM symposium on Theory of computing ,pp. 281–290, 2010.[228] M. Charikar, K. Chen, and M. Farach-Colton, “Finding frequent itemsin data streams,” in

International Colloquium on Automata, Languages,and Programming , pp. 693–703, Springer, 2002.[229] G. Cormode and S. Muthukrishnan, “An improved data stream sum-mary: the count-min sketch and its applications,”

Journal of Algorithms ,vol. 55, no. 1, pp. 58–75, 2005.[230] M. Datar, A. Gionis, P. Indyk, and R. Motwani, “Maintaining streamstatistics over sliding windows,”

SIAM journal on computing , vol. 31,no. 6, pp. 1794–1813, 2002.[231] S. Floyd and V. Jacobson, “Random early detection gateways forcongestion avoidance,”

IEEE/ACM Transactions on networking , vol. 1,no. 4, pp. 397–413, 1993.[232] P. Flajolet, D. Gardy, and L. Thimonier, “Birthday paradox, couponcollectors, caching algorithms and self-organizing search,”

DiscreteApplied Mathematics , vol. 39, no. 3, pp. 207–229, 1992.[233] R. Dolby, “Noise reduction systems,” Nov. 5 1974. US Patent3,846,719.[234] S. V. Vaseghi,

Advanced digital signal processing and noise reduction .John Wiley & Sons, 2008.[235] J. Gettys, “Bufferbloat: dark buffers in the Internet,”

IEEE InternetComputing , no. 3, p. 96, 2011.[236] M. Allman, “Comments on bufferbloat,”

ACM SIGCOMM ComputerCommunication Review , vol. 43, no. 1, pp. 30–37, 2013.[237] Y. Gong, D. Rossi, C. Testa, S. Valenti, and M. D. Täht, “Fighting thebufferbloat: on the coexistence of AQM and low priority congestioncontrol,”

Computer Networks , vol. 65, pp. 255–267, 2014.[238] C. Staff, “Bufferbloat: what’s wrong with the Internet?,”

Communica-tions of the ACM , vol. 55, no. 2, pp. 40–47, 2012.[239] V. G. Cerf, “Bufferbloat and other internet challenges,”

IEEE InternetComputing , vol. 18, no. 5, pp. 80–80, 2014.[240] F. Schwarzkopf, S. Veith, and M. Menth, “Performance analysis ofCoDel and PIE for saturated TCP sources,” in , vol. 1, pp. 175–183, IEEE, 2016.[241] A. Mushtaq, R. Mittal, J. McCauley, M. Alizadeh, S. Ratnasamy,and S. Shenker, “Datacenter congestion control: identifying what isessential and making it practical,”

ACM SIGCOMM Computer Com-munication Review , vol. 49, no. 3, pp. 32–38, 2019.[242] K. Nichols, S. Blake, F. Baker, and D. Black, “Deﬁnition of thedifferentiated services ﬁeld (DS ﬁeld) in the IPv4 and IPv6 headers,”RFC8376. [Online]. Available: https://tools.ietf.org/html/rfc8376.[243] B. Fenner, M. Handley, H. Holbrook, I. Kouvelas, R. Parekh, Z. Zhang,and L. Zheng, “Protocol independent multicast-sparse mode (PIM-SM): protocol speciﬁcation (revised).,” [Online]. Available: https://tools.ietf.org/html/rfc7761.[244] H. Holbrook, B. Cain, and B. Haberman, “Using Internet group man-agement protocol version 3 (IGMPv3) and multicast listener discoveryprotocol version 2 (MLDv2) for source-speciﬁc multicast,”

RFC 4604(Proposed Standard), Internet Engineering Task Force , 2006.[245] I. Wijnands, E. C. Rosen, A. Dolganow, T. Przygienda, and S. Aldrin,“Multicast using bit index explicit replication (BIER),” in

RFC Editor ,2017.[246] B. Carpenter and S. Brim, “Middleboxes: taxonomy and issues,” 2002.[Online]. Available: https://tools.ietf.org/html/rfc3234.[247] J. McCauley, A. Panda, A. Krishnamurthy, and S. Shenker, “Thoughtson load distribution and the role of programmable switches,”

ACMSIGCOMM Computer Communication Review , vol. 49, no. 1, pp. 18–23, 2019.[248] T. Norp, “5G Requirements and key performance indicators,”

Journalof ICT Standardization , vol. 6, no. 1, pp. 15–30, 2018.[249] G. Xylomenos, C. N. Ververidis, V. A. Siris, N. Fotiou, C. Tsilopou-los, X. Vasilakos, K. V. Katsaros, and G. C. Polyzos, “A surveyof information-centric networking research,”

IEEE communicationssurveys & tutorials , vol. 16, no. 2, pp. 1024–1049, 2013.[250] D. L. Tennenhouse and D. J. Wetherall, “Towards an active networkarchitecture,” in

Proceedings DARPA Active Networks Conference andExposition , pp. 2–15, IEEE, 2002.[251] E. F. Kfoury, J. Gomez, J. Crichigno, E. Bou-Harb, and D. Khoury,“Decentralized distribution of PCP mappings over blockchain forend-to-end secure direct communications,”

IEEE Access , vol. 7,pp. 110159–110173, 2019.[252] S. A. Weil, S. A. Brandt, E. L. Miller, D. D. Long, and C. Maltzahn,“Ceph: A scalable, high-performance distributed ﬁle system,” in

Pro-ceedings of the 7th symposium on Operating systems design andimplementation , pp. 307–320, 2006.[253] L. Lamport et al. , “Paxos made simple,”

ACM Sigact News , vol. 32,no. 4, pp. 18–25, 2001.[254] D. Ongaro and J. Ousterhout, “In search of an understandable con-sensus algorithm,” in { USENIX } Annual Technical Conference(USENIX ATC 14) , pp. 305–319, 2014.[255] Huynh Tu Dang, “Consensus as a network service.” [Online]. Avail-able: https://tinyurl.com/y2t9plsu.[256] J. Nelson, “SwitchML scaling distributed machine learning with in net-work aggregation.” [Online]. Available: https://tinyurl.com/y53upm7k.[257] D. Das, S. Avancha, D. Mudigere, K. Vaidynathan, S. Sridharan,D. Kalamkar, B. Kaul, and P. Dubey, “Distributed deep learn-ing using synchronous stochastic gradient descent,” arXiv preprintarXiv:1602.06709 , 2016.[258] S. Farrell, “Low-power wide area network (LPWAN) overview,”RFC8376. [Online]. Available: https://tools.ietf.org/html/rfc8376.[259] A. Koike, T. Ohba, and R. Ishibashi, “IoT network architecture usingpacket aggregation and disaggregation,” in , pp. 1140–1145,IEEE, 2016.[260] J. Deng and M. Davis, “An adaptive packet aggregation algorithmfor wireless networks,” in , pp. 1–6, IEEE, 2013.[261] Y. Yasuda, R. Nakamura, and H. Ohsaki, “A probabilistic interestpacket aggregation for content-centric networking,” in ,vol. 2, pp. 783–788, IEEE, 2018.[262] A. S. Akyurek and T. S. Rosing, “Optimal packet aggregation schedul-ing in wireless networks,”

IEEE Transactions on Mobile Computing ,vol. 17, no. 12, pp. 2835–2852, 2018.[263] K. Zhou and N. Nikaein, “Packet aggregation for machine type commu-nications in LTE with random access channel,” in , pp. 262–267,IEEE, 2013.[264] A. Majeed and N. B. Abu-Ghazaleh, “Packet aggregation in multi-rate wireless LANs,” in , pp. 452–460, IEEE, 2012.[265] D. SIG, “Bluetooth core speciﬁcation version 4.2,”

Speciﬁcation of theBluetooth System , 2014.[266] S. Farahani,

ZigBee wireless networks and transceivers . Newnes, 2011.[267] O. Hersent, D. Boswarthick, and O. Elloumi,

The Internet of things:key applications and protocols . John Wiley & Sons, 2011.[268] J. Shi, W. Quan, D. Gao, M. Liu, G. Liu, C. Yu, and W. Su,“Flowlet-based stateful multipath forwarding in heterogeneous Internetof things,”

IEEE Access , vol. 8, pp. 74875–74886, 2020. [269] S. Do, L.-V. Le, B.-S. P. Lin, and L.-P. Tung, “SDN/NFV-based networkinfrastructure for enhancing IoT gateways,” in , pp. 1135–1142, IEEE, 2019.[270] A. Metwally, D. Agrawal, and A. El Abbadi, “Efﬁcient computationof frequent and top-k elements in data streams,” in InternationalConference on Database Theory , pp. 398–412, Springer, 2005.[271] S. Heule, M. Nunkesser, and A. Hall, “HyperLogLog in practice:algorithmic engineering of a state of the art cardinality estimationalgorithm,” in

Proceedings of the 16th International Conference onExtending Database Technology , pp. 683–692, 2013.[272] M. G. Reed, P. F. Syverson, and D. M. Goldschlag, “Anonymousconnections and onion routing,”

IEEE Journal on Selected areas inCommunications , vol. 16, no. 4, pp. 482–494, 1998.[273] V. Liu, S. Han, A. Krishnamurthy, and T. Anderson, “Tor instead of IP,”in

Proceedings of the 10th ACM Workshop on Hot Topics in Networks ,pp. 1–6, 2011.[274] C. Chen, D. E. Asoni, D. Barrera, G. Danezis, and A. Perrig, “HOR-NET: high-speed onion routing at the network layer,” in

Proceedings ofthe 22nd ACM SIGSAC Conference on Computer and CommunicationsSecurity , pp. 1441–1454, 2015.[275] M. Zalewski and W. Stearns, “p0f,” see http://lcamtuf. coredump.cx/p0f3 , 2006.[276] J. Barnes and P. Crowley, “k-p0f: A high-throughput kernel passive OSﬁngerprinter,” in

Architectures for Networking and CommunicationsSystems , pp. 113–114, IEEE, 2013.[277] S. Hong, R. Baykov, L. Xu, S. Nadimpalli, and G. Gu, “Towards SDN-deﬁned programmable BYOD (bring your own device) security,” in

NDSS , 2016.[278] S. Hilton, “Dyn analysis summary of Friday October 21Attack, 2016..” [Online]. Available: https://dyn.com/blog/dyn-analysis-summary-of-friday-october-21-attack/.[279] S. Kottler, “February 28th DDoS incident report, March, 2018.” [On-line]. Available: https://githubengineering.com/ddos-incident-report/.[280] D. Scholz, S. Gallenmüller, H. Stubbe, and G. Carle, “Syn ﬂood defensein programmable data planes,” in

Proceedings of the 3rd P4 Workshopin Europe , pp. 13–20, 2020.[281] J. Ioannidis and S. M. Bellovin, “Implementing pushback: router-baseddefense against DDoS attacks,” in

NDSS , 2016.[282] N. Handigol, B. Heller, V. Jeyakumar, D. Mazières, and N. McKeown,“I know what your packet did last hop: using packet histories totroubleshoot networks,” in { USENIX } Symposium on NetworkedSystems Design and Implementation ( { NSDI } , pp. 71–85, 2014.[283] Y. Zhu, N. Kang, J. Cao, A. Greenberg, G. Lu, R. Mahajan, D. Maltz,L. Yuan, M. Zhang, B. Y. Zhao, and H. Zheng, “Packet-level telemetryin large datacenter networks,” in Proceedings of the 2015 ACM Confer-ence on Special Interest Group on Data Communication , pp. 479–491,2015.[284] H. Zeng, P. Kazemian, G. Varghese, and N. McKeown, “Automatic testpacket generation,” in

Proceedings of the 8th international conferenceon Emerging networking experiments and technologies , pp. 241–252,2012.[285] P. Kazemian, G. Varghese, and N. McKeown, “Header space anal-ysis: static checking for networks,” in

Presented as part of the 9th { USENIX } Symposium on Networked Systems Design and Implemen-tation ( { NSDI } , pp. 113–126, 2012.[286] A. Khurshid, X. Zou, W. Zhou, M. Caesar, and P. B. Godfrey,“Veriﬂow: verifying network-wide invariants in real time,” in Presentedas part of the 10th { USENIX } Symposium on Networked SystemsDesign and Implementation (NSDI) , pp. 15–27, 2013.[287] R. Stoenescu, M. Popovici, L. Negreanu, and C. Raiciu, “Symnet:scalable symbolic execution for modern networks,” in

Proceedings ofthe 2016 ACM SIGCOMM Conference , pp. 314–327, 2016.[288] H. Mai, A. Khurshid, R. Agarwal, M. Caesar, P. B. Godfrey, and S. T.King, “Debugging the data plane with Anteater,”

ACM SIGCOMMComputer Communication Review , vol. 41, no. 4, pp. 290–301, 2011.[289] P. Kazemian, M. Chang, H. Zeng, G. Varghese, N. McKeown, andS. Whyte, “Real time network policy checking using header spaceanalysis,” in

Presented as part of the 10th { USENIX } Symposium onNetworked Systems Design and Implementation (NSDI) , pp. 99–111,2013.[290] A. Horn, A. Kheradmand, and M. Prasad, “Delta-net: real-time networkveriﬁcation using atoms,” in { USENIX } Symposium on NetworkedSystems Design and Implementation (NSDI) , pp. 735–749, 2017.[291] S. Son, S. Shin, V. Yegneswaran, P. Porras, and G. Gu, “Model checking invariant security properties in OpenFlow,” in , pp. 1974–1979,IEEE, 2013.[292] A. Panda, O. Lahav, K. Argyraki, M. Sagiv, and S. Shenker, “Verifyingreachability in networks with mutable datapaths,” in { USENIX } Symposium on Networked Systems Design and Implementation (NSDI) ,pp. 699–718, 2017.[293] X. Gao, T. Kim, M. D. Wong, D. Raghunathan, A. K. Varma, P. G.Kannan, A. Sivaraman, S. Narayana, and A. Gupta, “Switch codegeneration using program synthesis,” in

IEEE Journal on SelectedAreas in Communications , vol. 38, no. 7, pp. 1432–1447, 2020.[295] D. Kim, Y. Zhu, C. Kim, J. Lee, and S. Seshan, “Generic externalmemory for switch data planes,” in

Proceedings of the 17th ACMWorkshop on Hot Topics in Networks , pp. 1–7, 2018.[296] D. Kim, Z. Liu, Y. Zhu, C. Kim, J. Lee, V. Sekar, and S. Seshan, “TEA:enabling state-intensive network functions on programmable switches,”in

Proceedings of the 2020 ACM SIGCOMM Conference , 2020.[297] S. Chole, A. Fingerhut, S. Ma, A. Sivaraman, S. Vargaftik, A. Berger,G. Mendelson, M. Alizadeh, S.-T. Chuang, I. Keslassy, et al. , “dRMT:disaggregated programmable switching,” in

Proceedings of the Con-ference of the ACM Special Interest Group on Data Communication ,pp. 1–14, 2017.[298] M. T. Arashloo, Y. Koral, M. Greenberg, J. Rexford, and D. Walker,“SNAP: stateful network-wide abstractions for packet processing,” in

Proceedings of the 2016 ACM SIGCOMM Conference , pp. 29–43,2016.[299] G. Sviridov, M. Bonola, A. Tulumello, P. Giaccone, A. Bianco,and G. Bianchi, “LODGE: Local decisions on global states in pro-grammable data planes,” in , pp. 257–261, IEEE, 2018.[300] G. Sviridov, M. Bonola, A. Tulumello, P. Giaccone, A. Bianco,and G. Bianchi, “Local decisions on replicated states (LOADER) inprogrammable data planes: programming abstraction and experimentalevaluation,” arXiv preprint arXiv:2001.07670 , 2020.[301] S. Luo, H. Yu, and L. Vanbever, “Swing state: consistent updatesfor stateful and programmable data planes,” in

Proceedings of theSymposium on SDN Research , pp. 115–121, 2017.[302] J. Xing, A. Chen, and T. E. Ng, “Secure state migration in the dataplane,” in

Proceedings of the Workshop on Secure ProgrammableNetwork Infrastructure , pp. 28–34, 2020.[303] L. Zeno, D. R. Ports, J. Nelson, and M. Silberstein, “Swishmem:Distributed shared state abstractions for programmable switches,” in

Proceedings of the 19th ACM Workshop on Hot Topics in Networks ,pp. 160–167, 2020.[304] P. Bosshart, G. Gibb, H.-S. Kim, G. Varghese, N. McKeown, M. Iz-zard, F. Mujica, and M. Horowitz, “Forwarding metamorphosis: fastprogrammable match-action processing in hardware for SDN,”

ACMSIGCOMM Computer Communication Review , vol. 43, no. 4, pp. 99–110, 2013.[305] R. Pagh and F. F. Rodler, “Cuckoo hashing,”