An Exhaustive Survey on P4 Programmable Data Plane Switches: Taxonomy, Applications, Challenges, and Future Trends
11 An Exhaustive Survey on P4 Programmable DataPlane Switches: Taxonomy, Applications,Challenges, and Future Trends
Elie F. Kfoury ∗ , Jorge Crichigno ∗ , Elias Bou-Harb †∗ College of Engineering and Computing, University of South Carolina, Columbia, USA † The Cyber Center For Security and Analytics, University of Texas at San Antonio, USA
Abstract —Traditionally, the data plane has been designedwith fixed functions to forward packets using a small set ofprotocols. This closed-design paradigm has limited the capabilityof the switches to proprietary implementations which are hard-coded by vendors, inducing a lengthy, costly, and inflexibleprocess. Recently, data plane programmability has attractedsignificant attention from both the research community and theindustry, permitting operators and programmers in general torun customized packet processing functions. This open-designparadigm is paving the way for an unprecedented wave of inno-vation and experimentation by reducing the time of designing,testing, and adopting new protocols; enabling a customized,top-down approach to develop network applications; providinggranular visibility of packet events defined by the programmer;reducing complexity and enhancing resource utilization of theprogrammable switches; and drastically improving the perfor-mance of applications that are offloaded to the data plane.Despite the impressive advantages of programmable data planeswitches and their importance in modern networks, the literaturehas been missing a comprehensive survey. To this end, thispaper provides a background encompassing an overview of theevolution of networks from legacy to programmable, describingthe essentials of programmable switches, and summarizing theiradvantages over Software-defined Networking (SDN) and legacydevices. The paper then presents a unique, comprehensive tax-onomy of applications developed with P4 language; surveying,classifying, and analyzing more than 150 articles; discussingchallenges and considerations; and presenting future perspectivesand open research issues.
Index Terms —Programmable switches, P4 language, Software-defined Networking, data plane, custom packet processing, tax-onomy.
I. I
NTRODUCTION
Since the emergence of the world wide web and theexplosive growth of the Internet in the 1990s, the network-ing industry has been dominated by closed and proprietaryhardware and software. Consider the observations made byMcKeown [1] and the illustration in Fig. 1, which shows thecumulative number of Request For Comments (RFCs) [2].While at first an increase in RFCs may appear encouraging, ithas actually represented an entry barrier to the network market.The progressive reduction in the flexibility of protocol designcaused by standardized requirements, which cannot be easilyremoved to enable protocol changes, has perpetuated the statusquo. This protocol ossification [3, 4] has been characterizedby a slow innovation pace at the hand of few network
Fig. 1. Cumulative number of RFCs. vendors. As an example, after being initially conceived byCisco and VMware [5], the Application Specific IntegratedCircuit (ASIC) implementation of the Virtual Extensible LAN(VXLAN) [6], a simple frame encapsulation protocol, tookseveral years, a process that could have been reduced to weeksby software implementations .Protocol ossification has been challenged first by Software-defined Networking (SDN) [7, 8] and then by the recent adventof programmable switches. SDN fostered major advancesby explicitly separating the control and data planes, and byimplementing the control plane intelligence as a softwareoutside of the switches. While SDN reduced network com-plexity and spurred control plane innovation at the speed ofsoftware development, it did not wrest control of the actualpacket processing functions away from network vendors.Traditionally, the data plane has been designed with fixedfunctions to forward packets using a small set of protocols(e.g., IP, Ethernet). The design cycle of switch ASICs has beencharacterized by a lengthy, closed, and proprietary process thatusually takes years. Such process contrasts with the agility ofthe software industry.The programmable forwarding can be viewed as a naturalevolution of SDN, where the software that describes thebehavior of how packets are processed can be conceived,tested, and deployed in a much shorter time span by operators,engineers, researchers, and practitioners in general. The de- The RFC and VXLAN observations are extracted from Dr. McKeown’spresentation in [1].
An Exhaustive Survey on P4 Programmable Data Plane Switches: Taxonomy, Applications, Challenges, and Future Trends • Protocol ossification • Evolution of SDN • Rise of P4 and programmable data planes • Paper contributionsSection I: Introduction • Comparison of aspects covered in previous surveys • Analysis and limitations of existing surveysSection II: Related Surveys • Comparison between traditional, SDN, and programmable devices • Analogy with other domain specific processorsC i b tSection III: Traditional Control Plane and SDN • Survey methodology • Proposed taxonomy • Year-based distribution of the surveyed work • Implementation platform distributionS th d lSection V: Methodology and Taxonomy • Background and literature review • Intra-category comparison and discussions • Comparison with legacySections VI-XII: Surveyed Work • General challenges and Future trends • Memory availability • Arithmetic computations • Network-wide cooperation, etc.l h ll dSection XIII: Challenges and Future Trends • PISA-based data plane • Programmable switch features • P4 language
PISA b dSection IV: Programmable Switches
Fig. 2. Paper roadmap. facto standard for defining the forwarding behavior is theP4 language [9], which stands for Programming Protocol-independent Packet Processors. Essentially, P4 programmableswitches have removed the entry barrier to network design,previously reserved to network vendors.The momentum of programmable switches is reflected inthe global ecosystem around P4. Operators such as ATT [10],Comcast [11], NTT [12], KPN [13], Turk Telekom [14],Deutsche Telekom [15], and China Unicom [14], are nowusing P4-based platforms and applications to optimize theirnetworks. Companies with large data centers such as Facebook[16], Alibaba [17], and Google [18] operate on programmableplatforms running customized software, a contrast from thefully proprietary implementations of just a few years ago[19]. Switch manufacturers such as Edgecore [20], Stordis[21], Cisco [22], Arista [23], Juniper [24], and Interface Mas-ters [25] are now manufacturing P4 programmable switcheswith multiple deployment models, from fully programmableor white boxes to hybrid schemes. Chip manufactures suchas Barefoot Networks (Intel) [26], Xilinx [27], Pensando[28], Mellanox [29], and Innovium [30] have embraced pro-grammable data planes without compromising performance.The availability of tools and the agility of software devel-opment have opened an unprecedented possibility of experi-mentation and innovation by enabling network owners to buildcustom protocols and process them using protocol-independentprimitives, reprogram the data plane in the field, and runP4 codes on diverse platforms. Main agencies supportingengineering research and education world-wide are investingin programmable networks as well [31–34].
A. Contribution
Despite the increasing interest on P4 switches, previouswork has only partially covered this technology. As shownin Table I, currently, there is no updated and comprehensivematerial. Thus, this paper addresses this gap by providingan overview of the evolution of networks from legacy toprogrammable; describing the essentials of programmableswitches and P4; and summarizing the advantages of pro-grammable switches over SDN and legacy devices. The papercontinues by presenting a taxonomy of applications developedwith P4; surveying, classifying, and analyzing and comparingmore than 150 articles; discussing challenges and consid-erations; and putting forward future perspectives and openresearch issues.
B. Paper Organization
The road-map of this survey is illustrated in Fig. 2. SectionII studies and compares existing surveys on various P4-related topics and demonstrates the added value of the offeredwork. Section III describes the traditional and SDN devices,and the evolution toward programmable data planes. SectionIV introduces programmable switches and their features andexplains the Protocol Independent Switch Architecture (PISA),a pipeline forwarding model. Section V describes the surveymethodology and the proposed taxonomy. Subsequent sections(from Section VI to Section XII) explore the works pertainingto various categories proposed in the taxonomy, and comparethe P4 approaches in each category, as well as with thelegacy-enabled solutions. Section XIII outlines challenges andconsiderations extracted and induced from the literature, andpinpoints directions that can be explored in the future toameliorate the state-of-the-art solutions. Finally, Section XIVconcludes the survey. The abbreviations used in this article aresummarized in Table XIV, at the end of the article.II. R
ELATED S URVEYS
The advantages of programmable switches attracted con-siderable attention from the research community. They weredescribed in previous surveys.Stubbe et al. [35] discussed various P4 compilers andinterpreters in a short survey. This work provided a backgroundon the P4 language and demonstrated the main building blocksthat describe packet processing in a programmable switch.It outlined reference hardware and software programmableswitch implementations. The survey lacks discussions on exist-ing application schemes, challenges, and potential future work.Dargahi et al. [36] focused on stateful data planes andthe security implications. There are two main objectives ofthis survey. First, it introduces the reader to recent trendsand technologies pertaining to stateful data planes. Second,it discusses relevant security issues by analyzing selecteduse cases. The scope of the survey is not limited to P4for programming the data plane. Instead, it describes otherschemes such as OpenState [44], Flow-level State Transitions(FAST) [45], etc. When reviewing the security properties ofstateful data planes, the authors described a mapping betweenpotential attacks and corresponding vulnerabilities.Cordeiro et al. [37] discussed the evolution of SDN fromOpenFlow to data plane programmability. The survey brieflyexplained the layout of a P4 program and how it is mapped to
TABLE IC
OMPARISON WITH R ELATED S URVEYS
Paper Programmable switches and P4 language Taxonomy DiscussionsEvolution Description Features Background Literature Intra-categorycomparison Comparisonwith legacy Challenges Futuredirections [35] (cid:2)(cid:3) (cid:2)(cid:3) (cid:2)(cid:3) (cid:2) (cid:2) (cid:2) (cid:2) (cid:2) (cid:2) [36] (cid:4) (cid:2)(cid:3) (cid:2)(cid:3) (cid:2)(cid:3) (cid:2)(cid:3) (cid:2) (cid:2) (cid:2) (cid:2)(cid:3) [37] (cid:4) (cid:2)(cid:3) (cid:2)(cid:3) (cid:2)(cid:3) (cid:4) (cid:2) (cid:2) (cid:2)(cid:3) (cid:2)(cid:3) [38] (cid:2)(cid:3) (cid:2)(cid:3) (cid:2)(cid:3) (cid:2) (cid:2) (cid:2) (cid:2) (cid:2) (cid:2) [39] (cid:4) (cid:2) (cid:2) (cid:2)(cid:3) (cid:2)(cid:3) (cid:2) (cid:2) (cid:2)(cid:3) (cid:2)(cid:3) [40] (cid:4) (cid:2) (cid:2)(cid:3) (cid:2) (cid:2)(cid:3) (cid:2) (cid:2) (cid:2) (cid:2) [41] (cid:4) (cid:2) (cid:2) (cid:4) (cid:2)(cid:3) (cid:2) (cid:2) (cid:2) (cid:2) [42] (cid:2)(cid:3) (cid:2)(cid:3) (cid:2) (cid:2)(cid:3) (cid:2)(cid:3) (cid:2)(cid:3) (cid:2) (cid:2)(cid:3) (cid:2)(cid:3) [43] (cid:4) (cid:2)(cid:3) (cid:2)(cid:3) (cid:2)(cid:3) (cid:2)(cid:3) (cid:2) (cid:2) (cid:2)(cid:3) (cid:2)(cid:3)
Thispaper (cid:4) (cid:4) (cid:4) (cid:4) (cid:4) (cid:4) (cid:4) (cid:4) (cid:4)(cid:4)
Covered in this survey (cid:2)
Not covered in this survey (cid:2)(cid:3)
Partially covered in this survey the abstract forwarding model. It then listed various compil-ers, tools, simulators, and frameworks for P4 development.The authors categorized the literature into two categories:1) programmable security and dependability management; 2)enhanced accounting and performance management. In thefirst category, the authors listed works pertaining to policymodeling, analysis, and verification, as well as intrusiondetection and prevention, and network survivability. In thesecond category, the authors focused on network monitoring,traffic engineering, and load balancing. The survey only listsa limited set of papers without providing much details or howpapers differ from each. Moreover, the survey was publishedin 2017, and since then, a significant percentage of P4-relatedworks are missing.Satapathy et al. [38] presented a short description aboutthe pitfalls of traditional networks and the evolution of SDN.The report briefly described elements of the P4 language. Theauthors then discussed the control plane and P4Runtime [46],and enumerated three use cases of P4 applications. The reportconcludes with potential future.The short survey presented by Bifulco et al. [39] reviewsthe trends and issues of abstractions and architectures thatrealize programmable networks. The authors discussed themotivation of packet processing devices in the networkingfield and described the anatomy of a programmable switch.The proposed taxonomy categorizes the literature as state-based, abstraction-based, implementation-based, and layer-based. The layer-based consists of control/intent layer and dataplane layer; the implementation-based encompasses softwareand hardware switches; the abstraction-based includes dataflow graph and match-action pipelines; and the state-baseddifferentiates between stateful and stateless data planes.Kaljic et al. [40] presented a survey on data plane flex-ibility and programmability in SDN networks. The authorsevaluated data plane architectures through several definitionsof flexibility and programmability. In general, flexibility inSDN refers to the ability of the network to adapt its resources(e.g., changes in the topology or the network requirements).Afterwards, the authors identified key factors that influence thedeviation from the original data plane given with OpenFlow.The survey concludes with future research directions.Kannan et al. [41] presented a short survey related to the evolution of programmable networks. This work describedthe pre-SDN model and the evolution to SDN and pro-grammable data plane. The authors highlighted some featuresof programmable switches such as stateful processing, accuratetiming information, and flexible packet cloning and recircu-lation. The survey categorized data plane applications intotwo categories, namely, network monitoring and in-networkcomputing. While this survey listed a considerable number ofpapers belonging to these categories, it barely explained theoperation and main ideas of each paper.Tan et al. [42] presented a survey describing In-band Net-work Telemetry (INT). The survey explained the developmentstages and classifications of network measurement (traditional,SDN-based, and P4-based). It also outlined some existingapplications that leverage INT such as congestion control,troubleshooting, etc. The survey concludes with discussionsand potential future work related to INT.Zhang et al. [43] presented a survey that focuses on statefuldata plane. The survey starts with an overview of statelessand stateful data planes, then overviews and compares somestateful platforms (e.g., OpenState, FAST, FlowBlaze, etc.).The paper reviews a handful of stateful data plane applicationsand discusses challenges and future perspectives.Table I summarizes the topics and the features describedin the related surveys. It also highlights how this paperdiffers from the existing surveys. All previous surveys lacka microscopic comparison between the intra-category works.Also, none of them compare switch-based schemes againstlegacy server-based schemes. To the best of the authors’knowledge, this work is the first to exhaustively explore thewhole programmable data plane ecosystem. Specifically, thepaper describes P4 switches and provides a detailed taxonomyof applications using P4 switches. It categorizes and comparesthe applications within each category as well as with legacyapproaches, and provides challenges and future perspectives.III. T
RADITIONAL C ONTROL P LANE AND
SDN
A. Traditional and SDN Devices
With traditional devices, networks are connected usingprotocols such as Open Shortest Path First (OSPF) and BorderGateway Protocol (BGP) [47]) running in the control plane
TABLE IIF
EATURES , TRADITIONAL , SDN,
AND P4 PROGRAMMABLE DEVICES
Feature Traditional SDN P4 programmable
Control - data plane separation No clear separation Well-defined separation Well-defined separationControl and data plane interface Proprietary Standardized APIs (e.g.OpenFlow) Standardized (e.g., OpenFlow,P4Runtine) andprogram-dependent APIsControl and data planeprogram-dependent APIs NA/Proprietary NA/Proprietary Target independentFunctionality separation at controlplane No modular separation offunctions Modular separation: (1) functionsto build topology view (state) and(2) algorithms to operate onnetwork state Same as SDN networksCustomization of control plane No Yes YesVisibility of events at data plane Low Low HighFlexibility to define and parse newfields and protocols No flexible, fixed Subject to OpenFlow extensions Easy, programmable by userCustomization of data plane No No YesASIC packet processingcomplexity High, hard-coded High, hard-coded Low, defined by user’s sourcecodeData plane match-action stages Proprietary OpenFlow assumes in seriesmatch-action stages In series and/or in parallelData plane actions Protocol-dependent primitives Protocol-dependent primitives Protocol-independent primitivesInfield runtime reprogrammability No No YesCustomer support High Medium LowTechnology maturity High Medium Low at each device . Both control and data planes are underfull control of vendors. On the other hand, SDN delineatesa clear separation between the control plane and the dataplane, and consolidates the control plane so that a singlecentralized controller can control multiple remote data planes.The controller is implemented in software, under the controlof the network owner. The controller computes the tablesused by each switch and distributes them via a well-definedApplication Programming Interface (API), such as Openflow[48]. While SDN allows for the customization of the controlplane, it is limited to the OpenFlow specifications and thefixed-function data plane.
B. Comparison of Traditional, SDN, and Programmable DataPlane Devices
Table II contrasts the main characteristics of traditional,SDN, and P4 programmable devices. In the latter, the forward-ing behavior is defined by the user’s code. Other advantagesinclude the program-dependent APIs, where the same P4program running on different targets requires no modificationsin the runtime applications (i.e., the control plane and theinterface between control and data planes are target agnostic);the protocol-independent primitives used to process packets;the more powerful computation model where the match-actionstages can not only be in series but also in parallel; and theinfield reprogrammability at runtime. On the other hand, thetechnology maturity and support for P4 devices can still beconsidered low in contrast to traditional and SDN devices.
C. Network Evolution and Analogy with other Domain Spe-cific Processors
The introduction of the general-purpose computers in theearly 1970s enabled programmers to develop applications running on CPUs. The use of high-level languages accel-erated innovation by hiding the target hardware (e.g., x86).In signal processing, Digital Signal Processors (DSPs) weredeveloped in the late 1970s and early 1980s with instructionsets optimized for digital signal processing. Matlab is used fordeveloping DSP applications. In graphics, Graphics ProcessingUnits (GPUs) were developed in the late 1990s and early 2000swith instruction sets for graphics. Open Computing Language(OpenCL) is one of the main languages for developing graphicapplications. In machine learning, Tensor Processor Units(TPUs) and TensorFlow were developed in mid 2010s withinstruction sets optimized for machine learning.The programmable forwarding is part of the larger informa-tion technology evolution observed above. Specifically, overthe last few years, a group of researchers developed a ma-chine model for networking, namely the Protocol IndependentSwitch Architecture (PISA) [49]. PISA was designed withinstruction sets optimized for network operations. The high-level language for programming PISA devices is P4.IV. P
ROGRAMMABLE S WITCHES
A. PISA Architecture
PISA is a packet processing model that includes the fol-lowing elements: programmable parser, programmable match-action pipeline, and programmable deparser, see Fig. 3.The programmable parser permits the programmer to definethe headers (according to custom or standard protocols) andto parse them. The parser can be represented as a state ma-chine. The programmable match-action pipeline executes theoperations over the packet headers and intermediate results. Asingle match-action stage has multiple memory blocks (tables,registers) and Arithmetic Logic Units (ALUs), which allow forsimultaneous lookups and actions. Since some action resultsmay be needed for further processing (e.g., data dependencies),
Data PlaneControl PlaneSoftware-based Centralized ControllerPD-API, P4Runtime
App-1
App-2 …
App-n
Programmable parser …Programmable match-action pipeline
Programmable deparserMemory ALU
Packet Packet
StateKey Action Action dataHeader fields, tuples, etc. Forward()Mark()Drop() Dst IP=IP1, Dst port = p2Dst IP=IP2, Dst port = p4…Program-defined local tableSwitch ASICP4 programCompiler Stage 1 Stage nC
Fig. 3. A PISA-based data plane and its interaction with the control plane. stages are arranged sequentially. The programmable deparserassembles the packet headers back and serializes them fortransmission. A PISA device is protocol-independent.In Fig. 3, the P4 program defines the format of the keysused for lookup operations. Keys can be formed using packetheader’s information. The control plane populates table entrieswith keys and action data. Keys are used for matching packetinformation (e.g., destination IP address) and action data isused for operations (e.g., output port).
B. Programmable Switch Features
The main features of programmable switches are [50]: • Agility: the programmer can design, test, and adopt newprotocols and features in significantly shorter times (i.e.,weeks or months rather than years). • Top-down design: for decades the networking industry oper-ated in a bottom-up approach. Fixed-function ASICs are atthe bottom and enforce available protocols and features tothe programmer at the top. With programmable switches, theprogrammer describes protocols and features in the ASICs.Note that the physical layer and parts of the MAC layer maynot be programmable. • Visibility: programmable switches provide greater visibilityinto the behavior of the network. INT is an example of aframework to collect and retrieve information from the dataplane, without intervention of the control plane. • Reduced complexity: fixed-function switches incorporatea large superset of protocols. These protocols consumeresources and add complexity to the processing logic, whichis hard-coded in silicon. With programmable switches, theprogrammer has the option to implement only those proto-cols that are needed.
TABLE IIIC
OMPARISON BETWEEN A P4 PROGRAMMABLE SWITCH AND AFIXED - FUNCTION SWITCH [51]
Characteristic Programmable Fixed-function
Throughput 6.4Tb/s 6.4Tb/sNumber of 100G ports 64 64Max forwarding rate 4.8B pps 4.2B ppsMax 25G/10G ports 256/258 128/130Programmable Yes (P4) NoPower draw 4.2W per port 4.9W per portLarge scale NAT Yes (100k) NoLarge scale stateful ACL Yes (100k) NoLarge scale tunnels Yes (192k) NoPacket buffers Unified SegmentedLAG/ECMP Full entropy,programmable Hash seed,reduced entropyECMP 256-way 128-wayTelemetry Line-rate perflow stats SFlow (sampled)Latency Under 400 ns Under 450ns • Differentiation: the customized protocol or feature imple-mented by the programmer needs not to be shared with thechip manufacturer. • Enhanced performance: programmable switches do not in-troduce performance penalty. On the contrary, they may pro-duce better performance than fixed-function switches. TableIII shows a comparison between a programmable switchand a fixed-function switch, reproduced from [51]. Notethe enhanced performance of the former (e.g., maximumforwarding rate, latency, power draw).
C. P4 Language
P4 has a reduced instruction set and has the following goals: • Reconfigurability: the parser and the processing logic canbe redefined in the field. • Protocol independence: the switch is protocol-agnostic. Theprogrammer defines the protocols, the parser, and the oper-ations to process the headers. • Target independence: the underlying ASIC is hidden fromthe programmer. The compiler takes the switch’s capabilitiesinto account when turning a target-independent P4 programinto a target-dependent binary.
Software48.5%NetFPGA7.9%ASIC38.6% SmartNICs5%0 10 20 30 40 5020162017201820192020 Number of Papers Y e a r (a) (b) Fig. 4. (a) Distribution of surveyed data plane research works per year. (b)Implementation platform distribution. The shares are calculated based on thestudied papers in this survey.
The original specification of the P4 language was releasedin 2014, and is referred to as P4 . In 2016, a new version ofthe language was drafted, which is referred to as P4 . P4 is a more mature language which extended the P4 language tobroader underlying targets: ASICs, Field-Programmable GateArrays (FPGAs), Network Interface Cards (NICs), etc.V. M ETHODOLOGY AND T AXONOMY
This section describes the systematic methodology that wasadopted to generate the proposed taxonomy. The results ofthis literature survey represent derived findings by thoroughlyexploring more than 150 data plane-related research worksstarting from 2016 up to late 2020. The distribution of whichis summarized in Fig. 4 (a).Fig. 4 (b) depicts the share of each implementation plat-form used in the surveyed papers, grouped by software (e.g.,BMv2, PISCES), ASIC (e.g., Tofino, Cavium), NetFPGA (e.g.,NetFPGA SUME), and SmartNICs (e.g., Netronome NFP).The graph shows that the vast majority of the works wereimplemented on software switches. Note that behavioral soft-ware switches (e.g., BMv2 [203]) are not suitable indicators ofwhether the program could run on a hardware target; they aretypically used for prototyping ideas and to foster innovation.On the other hand, non-behavioral software switches (e.g.,PICSES [204], derived from Open vSwitch (OVS) [205]) areproduction-grade and can be deployed in data centers.Hardware targets constitute a smaller share of the platformdistribution than software switches. A possible reasoningbehind this is that the technology is still recent and targets are still not widely available for sale in the public. Forexample, to acquire a switch equipped with Tofino chip (e.g.,Edgecore Wedge100BF-32 [20]), and to get the developmentenvironment and the customer support, a Non-DisclosureAgreement (NDA) with Barefoot Networks should be signed.Additionally, the client should attend a training course (e.g.,[206]) to understand the architecture and the specifics of theplatform. This process is considered lengthy and costly, andnot every institution is capable of affording it.The proposed taxonomy is demonstrated in Fig. 5. The tax-onomy was meticulously designed to cover the most significantworks related to data plane programmability and P4. The aimis to categorize the surveyed works based on various high-level disciplines. The taxonomy provides a clear separation ofcategories so that a reader interested in a specific discipline canonly read the works pertaining to the said discipline. The cor-rectness of the taxonomy was verified by carefully examiningthe related work of each paper to correlate them into high-level categories. Each high-level category is further dividedinto sub-categories. For instance, various measurements worksbelong to the sub-category “Measurements” under the high-level category “Network Performance”.Further, the survey compares the results and the features of-fered by programmable data plane approaches (intra-category),as well as with those of the contemporary and legacy ones.This detailed comparison is elaborated upon for each sub-category, giving the interested reader a comprehensive view ofthe state-of-the-art findings of that sub-category. Additionally,the survey presents various challenges and considerations, as
ProgrammableSwitches LiteratureIn-Band NetworkTelemetry (INT) NetworkPerformance MiddleboxFunctions AcceleratedComputations Internet ofThings (IoT) Security TestingVariations[52–57]Collectorsand Solutions[58–62] CongestionControl[63–68]Measurements[69–90]AQM[91–95]QoS and TM[96–99]Multicast[100–102] LoadBalancing[103–109]Caching[110–117]TelecomServices[118–124]Pub/Sub[125–128] Consensus[129–136]MachineLearning[137–142]Miscellaneous[143–151] Aggregation[152–155]ServiceAutomation[156, 157] Heavy Hitter[158–164]Cryptography[165–168]Anonymity[169–172]AccessControl[173–176]Attacks andDefenses[177–188] Troubleshoot[189–193]Verification[194–202]Fig. 5. Taxonomy of programmable switches literature based upon relevant, explored research areas.
Telemetry Instructions . . . TelemeInstructiometryionsI
INT transit hopINT source INT sink . . . Telemetry instructions Add metadata Add metadataAdd metadata ...
Add metadata ...
INT Collector
Original packet headers Telemetry instructions Switch metadataExtract metadata
Fig. 6. In-band Network Telemetry (INT). well as some current and future trends that could be exploredas future work.VI. I N - BAND N ETWORK T ELEMETRY (INT)Conventional monitoring and collecting tools and protocols(e.g., ping, traceroute, Simple Network Management Protocol(SNMP), NetFlow, sFlow) are by no means sufficiently accu-rate to troubleshoot the network, especially with the presenceof congestion. These methods provide milliseconds accuracyat best and cannot capture events that happen on microsecondsmagnitude. Moreover, they cannot provide per-packet visibilityacross the network.In-band Network Telemetry (INT) [207] is one of theearliest key applications of programmable data plane switches.It enables querying the internal state of the switch and pro-vides fine-grained and precise telemetry measurements (e.g.,queue occupancy, link utilization, queuing latency, etc.). INThandles events that occur on microseconds scale, also knownas microbursts . Collecting and reporting the network state isperformed entirely by the data plane, without any interventionfrom the control plane. Due to the increased visibility achievedwith INT, network operators are able to troubleshoot problemsmore efficiently. Additionally, it is possible to perform instantprocessing in the data plane after measuring telemetry data(e.g., reroute flows when a link is congested), without havingto interact with the control plane. Fig. 6 shows an INT-enablednetwork. INT enables network administrators to determine thefollowing: • The path a packet took when traversing the network (seeFig. 7). Such information is difficult to learn using existingtechnologies when multi-path routing strategies (e.g., Equal-cost Multi-Path Routing (ECMP) [208], flowlet switching[209]) are used. • The matched rules that forwarded the packets (e.g., ACLentry, routing lookup). • The time a packet spent in the queue of each switch. • The flows that shared the queue with a certain packet.The P4 Applications Working Group developed the INTtelemetry specifications [210] with contributions from keyenablers of the P4 language such as Barefoot Networks,VMware, Alibaba, and others.INT allows instrumenting the metadata to be monitoredwithout modifying the application layer. The metadata to beinserted depends on the use case; for example, if congestion
INT transit hopINT source INT sink INT CollectorSender S2 S3S1
DataPacket headers DataINT header{S1}Packet headers[S1]INT header DataPacket headers[S2][S1]INT header DataPacket headers[S3][S2][S1]INT header
S4 Receiver
DataPacket headersDataPacket headers[S4][S3][S2][S1]INT header [S4][S3][S2][S1]INT header
Fig. 7. Example of how INT can be used to provide the path traversed bya packet in the network. The INT source inserts its label (S1) as well as theINT headers to instruct subsequent switches about the required operations(i.e., push their labels). Finally, switch S4 strips the INT headers from thepacket and forwards them to a collector, while forwarding the original packetto the receiver. was the main concern to monitor, the programmer insertsqueue metadata and transit latency. An INT-enabled networkhas the following entities: 1) INT source: a trusted entitythat instruments with the initial instruction set what metadatashould be added into the packet by other INT-capable devices;2) INT transit hop: a device adding its own metadata to anINT packet after examining the INT instructions inserted bythe INT source; 3) INT sink: a trusted entity that extracts theINT headers in order to keep the INT operation transparentfor upper-layer applications; and 4) INT collector: a devicethat receives and processes INT packets.The location of an INT header in the packet is intentionallynot enforced in the specifications document. For example, itcan be inserted as a payload on top of TCP, UDP, and NSH, asa Geneve option on top of Geneve, and as a VXLAN payloadon top of VXLAN.
A. Postcard-based Telemetry (PBT)
INT provides the exact forwarding path, the timestamp andlatency at each network node, and other information. Suchdetailed information is derived by augmenting user packetswith data collected by each switch. Postcard-based Telemetry(PBT) is an alternative to INT which does not modify userpackets. Fig. 8 shows an example of PBT. As a user packettraverses the network, each switch generates a postcard andsends it to the monitor. The event that triggers the generationof the postcard is defined by the programmer, according tothe application’s need. Examples include start and/or end of a
Event detected Event detected INT CollectorOriginal Packet
Flow watchlistEvent detection
Original headers with switch telemetry info Host 1 Host 2
Postcard-based Telemetry
Flow watchlistEvent detection
Fig. 8. Postcard-based telemetry (PBT).
TABLE IVINT V
ARIATIONS C OMPARISON
Variation Name Overhead reduction strategy Metadata collection Operator intervention Implementation [52] NetVision On-demand probing Active (segment routing) High; telemetry through queries Mininet[53] N/A Flow subset selection bythe knowledge plane Passive Low; closed-loop network Software (BMv2)w/ ONOS controller[54] sINT Monitoring ratio adjustmentbased on network changes Passive Low; telemetry based on networkbehavior Software (BMv2)[55] INTO Telemetry orchestrationbased on heuristics Passive High; telemetry specified byoperators N/A[56] ML-INT Per-flow packet subsetselection through sampling Passive High; telemetry specified byoperators ASIC (Tofino) andSmartNIC (NFP-4000)[57] PINT Telemetry encoding onmultiple packets Passive High; telemetry through queries ASIC (Tofino) flow, sampling (e.g., one report per second), packet droppedby the switch, queue congestion, etc.
B. INT VariationsB.1. Background
Despite the improvements that INT brings compared tolegacy monitoring schemes, it introduces bandwidth overheadwhen enabled unconditionally by network operators. In suchscenarios, INT headers are added to every packet traversingthe switch, increasing bandwidth overhead which decreasesthe overall network throughput. To mitigate such limitation,conditional statements are included in the P4 program tosend reports only when certain events occur (e.g., queueutilization exceeds a threshold). This solution requires networkoperators to adjust thresholds and parameters manually basedon the usual network traffic patterns. Consequently, severalvariations of INT have been developed, aiming at customizingits functionalities and addressing its limitations. Mainly, recentworks focus on minimizing the bandwidth overhead of INTby adjusting thresholds and parameters automatically, basedon measured traffic patterns and the desired application type.
B.2. Literature Review
Liu et al. [52] proposed NetVision, a telemetry system thataims at minimizing the traffic overhead of INT by using prob-ing. NetVision actively sends the rightful amount and formatof probe packets depending on the telemetry application (e.g.,traffic engineering, network visualization). Hyun et al. [53]proposed an architecture for self-driving networks that usesINT to collect packet-level network telemetry, and Knowledge-Defined Networking (KDN) to create intelligence to the net-work management, considering the collected telemetry data.KDN accepts the network information as input and generatespolicies to improve the network performance. Kim et al. [54]proposed selective INT (sINT), a scheme that dynamicallyadjusts the insertion frequency of INT headers. A monitoringengine observes changes in consecutive INT metadata andapplies a heuristic algorithm to compute the insertion ratio.Marques et al. [55] described the orchestration problem inINT, which is associated with the optimal use of networkresources for collecting the state and behavior of forwardingdevices through INT. Niu at al. [56] proposed multilayer INT(ML-INT), a system that visualizes IP-over-optical networks in realtime. The proposed system encodes INT headers ina subset of packets pertaining to an IP flow. The encodedheaders contain metadata that describes statistics of electricaland optical network elements on the flow’s routing path. Benet al. [57] proposed Probabilistic INT (PINT), an approach thatprobabilistically adds telemetry information into a collectionof packets to minimize the per-packet overhead associated withregular INT.
B.3. INT Variations, Comparison, and Discussions
Table IV compares the aforementioned INT variations so-lutions. The main motivation behind these solutions is thatthe majority of applications that leverage INT (e.g., con-gestion control, fast reroute) only require approximations ofthe telemetry data and therefore, do not need to gather per-packet per-hop INT information. NetVision uses probing toreduce the overhead of INT. The main limitation of thisapproach is that probing might result in poor accuracy andtimeliness as the probes might experience different networkconditions than actual packets. All other works collect INTinformation passively. [53] and sINT select flows based oncurrent network conditions, ML-INT uses a fixed samplingscheme to select a small portion of packets in a flow, andPINT uses a probabilistic approach to encode telemetry onmultiple packets. Sampling and anomaly-based monitoringmight lead to information loss since not all packets arebeing reported. Some solutions require manual interventionfrom the operators to configure the telemetry process. Thesimplicity of the configuration interface is vital to makethe solution attractive to network operators. Finally, somesolutions were implemented on software switches, while otherwere implemented on hardware. It is important to note that notall software implementations can fit into the pipeline of thehardware.
B.4. INT, PBT, and Traditional Telemetry Comparison
Table V compares INT, PBT, and traditional telemetry.INT has higher potential vulnerabilities than PBT, such aseavesdropping and tampering. Adding extra protective mea-sures (e.g., encryption) is difficult on the fast data path. Onthe other hand, PBT packets tolerate additional processing toenhance security. The flow tracking process is simpler withINT than with PBT. The latter requires the server receivingINT reports (i.e., INT collector, explained in Section VI-C)
TABLE VI N - BAND , P
OSTCARD - BASED , AND T RADITIONAL N ETWORK T ELEMETRY
Feature INT PBT Traditional
User packetmodification Yes No NoUser packet overhead Yes No NoPotentialvulnerabilities Higher Lower LowerFlow trackingprocess Simpler More complex More complexDelay in reporting,tracking Lowest Low HighMicrobursts detection Yes Yes NoAccuracy Higher Higher Lower; especially with congested linksReporting type Push-based, initiated by the data plane Push-based Polling (e.g., SNMP), initiated by the control plane;sampling (e.g., NetFlow), initiated by the data planeTroubleshootproblems Easier and cheaper Easier and cheaper Harder and more expensiveGranularity Higher; microseconds scale Higher Lower; milliseconds scale at bestEvent-basedmonitoring Customizable based on conditions andthresholds Customizable Not possibleReactive processing Faster; reactive processing is executedin the data plane Faster Slower; reactive processing is executed in thecontrol planeBandwidth overhead High when all packets are reported,low when reported based on events Higher than INT Lowest to correlate multiple postcards of a single flow packet passingthrough the network, to form the packet history at the mon-itor. This process also adds delay in reporting and tracking.Legacy schemes that rely on sampling and polling suffer fromaccuracy issues, especially when links are congested. INTon the other hand is push-based, has better accuracy, andis more granular (microseconds scale). Reports sent by anINT-capable device contain rich information (e.g., the patha packet took) that can aid in troubleshooting the network.Such visibility is minimal in legacy monitoring schemes.Programmable switches permit reporting telemetry after theoccurrence of specific events (e.g., congestion). Moreover, theyprovide flexibility in programming reactive logic that executespromptly in the data plane. One drawback of INT is that itimposes bandwidth overhead if configured to report for everypacket; however, when event-based reports are considered, thebandwidth overhead significantly decreases.
C. INT CollectorsC.1. Background
An INT collector is a component in the network thatprocesses telemetry reports produced by INT devices. It parsesand filters metrics from the collected reports, then optionallystores the results persistently into a database. Since a largenumber of reports is typically produced in INT, having a high-performance collector is essential to avoid missing importantnetwork events. To this end, a number of research worksfocus on developing and enhancing the performance of INTcollectors running on commodity servers.
C.2. Literature Review
IntMon [58] is an ONOS-based collector application forINT reports. It includes a web-based interface that allowscontrolling which flows to monitor and the specific metadata tocollect. Another INT collector is the Prometheus INT exporter [59], which extracts information from every INT packet andpushes them to a gateway. A database server then periodicallypulls information from the gateway. INTCollector [60] is acollector that extracts events , which are important networkinformation, from INT raw data. It uses in-kernel processingto further improve the performance. INTCollector has twoprocessing flows; the fast path , which processes INT reportsand needs to execute quickly, and the normal path whichprocesses events sent from the fast path, and stores informationin the database. Deep Insight [61] is a proprietary solutionprovided by Barefoot Networks that leverages INT capabilitiesto provide services such as real-time anomaly detection, con-gestion analysis, packet-drop analysis, etc. Another proprietarysolution is BroadView Analytics used on Broadcom Trident 3devices by Broadcom [62].
C.3. INT Collectors Comparison, Discussions, and Limita-tions
Fig. 9 and Table VI compare the aforementioned INTcollectors. IntMon and Prometheus INT exporter were amongthe earliest collectors. Both have low processing rates sincethey are implemented without kernel nor hardware accelera-
Fig. 9. CPU efficiency with the three INT collectors. Source: INTCollectorpaper [60]. TABLE VIINT C
OLLECTORS C OMPARISON
Collector Name Rate Eventdetection Processingacceleration Open source Historical dataavailability Analytics Implementationnotes [58] IntMon 0.1Kpps × × (cid:2) × Low ONOS-BMv2subsystem (ONOS 1.6)[59] PrometheusINT exporter 6.4Kpps × × (cid:2) × Low ONOS P4 Brigadeproject[60] IntCollector 154.8Kpps (cid:2)
Yes; fast pathwith XDP (cid:2) (cid:2)
Medium C language, XDP forin-kernel processing[61] DeepInsight N/A (cid:2)
N/A × (cid:2) High SPRINT data planetelemetry (INT.p4) tion. Also, they are very limited with respect to the featuresthey provide (e.g., lack of event detection, limited analytics,historical data unavailability, etc.). Prometheus INT exporteralso suffers from increased overhead of sending the data forevery INT packet to the gateway, and the potential loss ofnetwork events as the database only stores the latest data pulledfrom the gateway. INTCollector on the other hand has higherrate and uses the eXpress Data Path (XDP) [211] to acceleratethe packet processing in the kernel space. It filters the datato be published based on significant changes in the networkthrough its event detection mechanism. DeepInsight Analyticshas a modular architecture and runs on commodity servers.It executes the Barefoot SPRINT data plane telemetry whichconsists of a P4 program (INT.p4) encompassing intelligenttriggers. It also provides open northbound RESTful APIs thatallow customers to integrate their third-party network man-agement solutions. DeepInsight Analytics is advanced withrespect to the features it provides (real-time anomaly detection,congestion analysis, packet-drop analysis, etc.). However, itis a closed-source solution and lacks reports of performancebenchmarks.Fig. 9 demonstrates the CPU efficiency of three INT col-lectors (IntMon, Prometheus INT exporter, and INTCollector)[60]. IntMon has the lowest throughput, and is 57 times slowerthan Prometheus INT. INTCollector on the other hand has thehighest throughput and is 27 times faster than Prometheus INTexporter.
C.4. Collectors in INT and Legacy Monitoring Schemes Com-parison
Generally, collectors used with both INT and legacy moni-toring schemes run on general purpose CPUs, and hence, havecomparable performance. INT produces excessive amountsof reports when compared with legacy monitoring schemes(e.g., NetFlow), and therefore, requires having a collector withhigh processing capability. INT-based collectors are typicallyaccelerated with in-kernel fast packet processing technologies(e.g., XDP) and hardware-based accelerators (e.g., Data PlaneDevelopment Kit (DPDK)).
D. Summary and Lessons Learned
Legacy telemetry tools and protocols are not capable ofcapturing microbursts nor providing fine-grained telemetrymeasurements. INT was developed to address these challenges;it enables the data plane developer to query with high-precision the internal state of switches. Telemetry data are then embedded into packets and forwarded to a high-performancecollector. The collector typically performs analysis and ap-plies actions accordingly (e.g., informs the control plane toupdate table entries). Current research efforts mainly focuson developing variations of INT to decrease its telemetrytraffic overhead, considering the overhead-accuracy trade-off.Other works aim at accelerating INT collectors to handlelarge volumes of traffic (in the scale of Kpps). Future workcould possibly investigate further improvements for INT suchas compressing packets’ headers, broadening coverage andvisibility, enriching the telemetry information, and simplifyingthe deployment.VII. N
ETWORK P ERFORMANCE
Measuring and improving network performance is criticalin nowadays’ infrastructures. Low latency and high bandwidthare key requirements to operate modern applications that con-tinuously generate enormous amounts of data [212]. Conges-tion control (CC), which aims at avoiding network overload, iscritical to meet these requirements. Another important conceptfor expediting these applications is managing the queuesthat form in routers and switches through Active QueuingManagement (AQM) algorithms. This section explores theliterature related to measuring and improving the performanceof programmable networks.
A. Congestion Control (CC)A.1. Background
One of the most challenging tasks in the Internet today iscongestion control and collapse avoidance [213]. The difficultyin controlling the congestion is increasing due to factorssuch as high-speed links, traffic diversity and burstiness, andbuffer sizes [63]. Today’s CC algorithms aim at shorteningdelays, maximizing throughput, and improving the fairness andutilization of network resources.Tremendous amount of research work has been done oncongestion control, including end hosts algorithms such asloss-based CC algorithms (e.g., CUBIC [214], Hamilton TCP(HTCP) [215], etc.), model-based algorithms (e.g., BottleneckBandwidth and Round-trip Time (BBR) [216]), congestion-signalling mechanisms (e.g., Explicit Congestion Notification(ECN) [217]), data-center specific schemes (e.g., TIMELY[218], Data Center Quantized Congestion Notification (DC-QCN) [219], Data Center TCP (DCTCP) [220], pFabric [221], Sender Receiver
Packet ACKINT INTACKACKAdjust rate per ACK
Fig. 10. HPCC: INT-based high precision congestion control.
Performance-oriented Congestion Control (PCC) [222], etc.),and application-specific schemes (e.g., QUIC [223]).With the advent of programmable data plane switches,researchers are investigating new methods to provide network-assisted congestion feedback for end-hosts.
A.2. Literature Review
Handley et al. [63] proposed NDP, a novel protocol archi-tecture for datacenters that aims at achieving low comple-tion latency for short flows and high throughput for longerflows. NDP avoids core network congestion by applying per-packet multipath load balancing, which comes at the costof reordering. It also trims the payloads of packets, similarto what is done in Cut Payload (CP) [224], whenever thequeues of the switches become saturated. Once the payload istrimmed, the headers are forwarded using high-priority queues.Consequently, a Negative ACK (NACK) is generated and sentthrough high-priority queues so that a retransmission is sentbefore draining the low priority queue. Similarly, Feldmannet al. [66] proposed a method that uses network-assistedcongestion feedback (NCF) in the form of NACKs generatedentirely in the data plane. NACKs are sent to throttle elephant-flow senders in case of congestion. The method maintains threeseparate queues for mice flows, elephant flows, and controlpackets to ensure fair sharing of resources.Li et al. [65] proposed High Precision Congestion Control(HPCC), a new CC mechanism that leverages INT-based dataadded by P4 switches to obtain precise link load information.HPCC computes accurate flow rate by using only one rateupdate, as opposed to legacy approaches that require a largenumber of iterations to determine the rate. HPCC providesnear-zero queueing, while being almost parameterless. Fig. 10shows the mechanism of HPCC. The switches add INT headersto every packet, and then the INT information is piggybackedinto the TCP/RDMA Acknowledgement (ACK) packet. The end-hosts then use this information to adjust the sending ratethrough their smart Network Interface Controllers (NICs).Kfoury et al. [67] proposed a P4-based method to automateend-hosts’ TCP pacing. It supplies the bottleneck bandwidthsand the number of elephants flows to senders so that they canpace their rates to safe targets, avoiding filling routers’ buffers.Turkovic et al. [64] proposed a P4-based method that reroutesflows to backup paths during congestion. The system detectscongestion by continuously monitoring the queueing delaysof latency-critical flows. The same authors [68] proposed amethod that separates the senders based on their congestioncontrol algorithm. Each congestion control uses a separatequeue in order to enforce the fairness among its competingflows.
A.3. CC Schemes Comparison, Discussions, and Limitations
Table VII compares the aforementioned CC schemes. NDPand NCF are similar in the sense that both use NACKs ascongestion feedback. NDP avoids congestion by applying per-packet multihop load balancing. This approach works ade-quately with symmetric topologies, but fails when topologiesare asymmetric (e.g., BCube, Jellyfish), especially duringheavy network load. Another limitation of NDP is the ex-cessive retransmissions produced by the server. NCF adoptedthe idea of packet trimming from NDP, but generates NACKsfrom the trimmed packet and sends it directly to the sender.Such approach removes the receiver from the feedback loop,improving the sender’s reaction time. One limitation of NCFis that it requires operators to manually tune some of thepredefined parameters (e.g., threshold, queue size, etc.). Addi-tionally, NCF might disclose network congestion information,making it less attractive to operators. Finally, the authors ofNCF claim that the approach works with both datacenters andInternet-wide scenarios. However, no implementation resultswere presented to evaluate the effectiveness of the solution.HPCC leverages INT data to control network congestion.It enhances the convergence time by using a Multiplicative-Increase Multiplicative-Decrease (MIMD) scheme. Notethat previous TCP variants use the Additive-IncreaseMultiplicative-Decrease (AIMD), which is conservative whenincreasing the rate, and hence has a slow convergence time.The reason AIMD schemes are slow is that they use a single-
TABLE VIIC
ONGESTION C ONTROL S CHEMES C OMPARISON
Scheme Name Strategy Congestionfeedback Feedbackinformation Rerouting Trafficseparation End-devicemodification Implementation [63] NDP Trim packets to headersand priority forward (cid:2)
NACKs (cid:2) (cid:2) (cid:2)
NetFPGASUME[64] N/A Monitor queue latency toreroute traffic on congestion × N/A (cid:2) × ×
BMv2[65] HPCC Use INT data to computesending rate (cid:2)
INT × × (cid:2)
Tofino[66] NCF Throttle elephant flowswith NACKs (cid:2)
NACKs × (cid:2) × N/A[67] N/A Pace TCP traffic ofelephant flows to safe targets (cid:2)
Flow countand BW × × (cid:2)
BMv2[68] P4Air Separate flows according totheir congestion control group × N/A × (cid:2) × Tofino TABLE VIIIC
ONGESTION C ONTROL S CHEMES . 1) P
ROGRAMMABLE S WITCHES (HPCC); 2) E ND - HOSTS ; AND
3) L
EGACY N ETWORK - ASSISTED (ECN)
Characteristic Programmable switch End-hosts Legacy network-assisted (ECN)
Accuracy Higher, INT-based, microbursts aredetected and reported Low, packet loss (e.g., CUBIC); Medium,estimated RTT and btlbw (e.g., BBR) Lower with classic ECN; Highwith L4SRequired modifications Switches, end-hosts None; distributed nature of AIMD doesnot require storing state of flows Minimal if ECN is used (mostequipment have classic ECNimplemented); High if L4S is usedConvergence Faster (MIMD) Slower (AIMD) Adequate with ECN; Fast withL4S ECNQueue utilization Near-zero High; possibility of Bufferbloat (e.g.,CUBIC) LowParameterization Few None Few (e.g., thresholds)Congestion information Several fields (e.g., queue occupancy,link utilization, flow share, etc.) Packets drop 1-bit ECN mark bit congestion information (packet loss, ECN). With HPCC,end-hosts can perform aggressive increase as INT metadata en-compasses precise link utilization and timely queue statistics.HPCC demonstrated promising results with respect to latency,bandwidth, and convergence time. The authors however didnot evaluate the performance of HPCC with conventionalcongestion control algorithms in the Internet (e.g., CUBIC,BBR). Note that achieving inter-protocol fairness is essentialso that the solution is adopted by operators.The method in [67] uses TCP pacing. Pacing decreasesthroughput variations and traffic burstiness, and hence, mini-mizes queuing delays. However, this method works well onlyin networks where the number of large flows senders is small(e.g., in science Demilitarized Zone (DMZ) [212]).P4Air, which applies traffic separation, demonstrated sig-nificant improvements in fairness compared to contemporarysolutions. However, it requires allocating a queue for eachcongestion control algorithm group (e.g., loss-based (Cubic),delay-based (TCP Vegas), etc.). Note that the number ofqueues is limited in switches, and production networks oftenreserve them for other applications’ QoS [65].Note that some schemes require modifying the end-hosts(e.g., HPCC) while others are fully in-network (e.g., P4Air).
A.4. End-hosts, Programmable Switches, and Legacy Devices’CC Schemes
Table VIII compares the CC schemes assisted by pro-grammable switches (e.g., HPCC) with end-hosts CC al-gorithms (e.g., CUBIC) and legacy congestion signallingschemes (e.g., ECN). End-hosts CC infer congestion throughpacket drops and estimations (e.g., btlbw and Round-trip Time(RTT) estimation with BBR), which is not always sufficient toinfer the existence of congestion. Legacy devices use classicECN to signal congestion so that end-hosts slow down theirtransmission rates. Classic ECN is limited as it only marksa single bit to signal congestion, and is not aggressive norimmediate. Programmable switches on the other hand usefine-grained prompt measurements to signal congestion (e.g.,INT metadata), which results in higher detection accuracy,near-zero queueing delays, and faster convergence time. Thedistributed nature of end-hosts CC schemes allows them to op-erate without modifying the network infrastructure and withouttweaking parameters. ECN-enabled devices and programmable switches on the other hand require few parameters (e.g.,marking threshold) to adapt to different network conditions.
B. MeasurementsB.1. Background
Gaining an overall understanding of the network behavioris an increasingly complex task, especially when the sizeof the network is large and the bandwidth is high. Legacymeasurements schemes have accuracy limitations since theyrely on polling and sampling-based methods to gather trafficstatistics. Typically, sampling methods have high samplingrates (e.g., one every 30,000 packets) and polling methodshave large polling intervals. The literature [225] has shown thatsuch methods are only suitable for coarse-grained visibility.The accuracy limitation of sampling and polling techniqueshampers the development of measurement applications. Forinstance, it is not possible to accurately measure frequentlychanging TCP-specific fields such as congestion window,receive window, and sending rate.Data streaming or sketching algorithms [226–230] wereproposed to answer the limitation of sampling and polling.They address the following problem: an algorithm is allowedto perform a constant number of passes over a data stream(input sequence of items) while using sub-linear space com-pared to the dataset and the dictionary sizes; desired statisticalproperties (e.g., median) on the data stream are then estimatedby the algorithm . The main problem with such algorithms isthat they are tightly coupled to the metrics of interest. Thismeans that switch vendors should build specialized algorithms,data structures, and hardware for specific monitoring tasks.With the constraints of CPU and memory in networkingdevices, it is challenging to support a wide spectrum ofmonitoring tasks that satisfy all customers. Legacy devices alsolack the capability of customizing the processing behavior sothat switches co-operate in the measurement process.With the emergence of programmable switches, it is nowpossible to perform fine-grained measurements in the dataplane at line rate. Moreover, data structures such as sketchesand bloom filters can be easily implemented and customizedfor specific metrics of interest. Programmable switches pavethe way for new areas of research in measurements since notonly they provide flexibility in inspecting with high accuracy the traffic statistics, but also allow programmers to expressreactive processing in real time (e.g., dropping a packet whena threshold is bypassed as done in Random Early Detection(RED) [231]). B.2. Literature Review
INT provides path-level metrics, with data similar to that ofpolling-based techniques. Note that the metrics themselves arefixed; for instance, it is possible to determine the flow-levellatency, but not the latency variation (jitter) [71]. The fixedmetrics of INT also prevent performing network-wide mea-surements; note that the INT standard specification documentdoes not mention methods to aggregate metadata and performcomplex analytics in the data plane.This section focuses on techniques that provide measure-ments that go beyond the fixed metrics extracted from theinternal state of the switch.
Generic Query-based Monitoring.
Operators constantlychange their monitoring specifications. Adding new moni-toring requirements on the fixed-function switching ASIC isexpensive. Recent work explored the idea of providing aquery-driven interface that allows operators to express theirmonitoring requirements. The queries can then be convertedinto switch programs (e.g., P4) to be deployed in the network.Alternatively, the queries can be executed on the control planeconsidering the measured information extracted from the dataplane.A simplistic attempt is FlowRadar [69], a system thatstores counters for all flows in the data plane with lowmemory footprint, then exports periodically (every 10ms) to aremote collector. Liu et al. [70] proposed Universal Monitor-ing (UnivMon), an application-agnostic monitoring frameworkthat provides accuracy and generality across a wide rangeof monitoring tasks. UnivMon benefits from the granularityof the data plane to improve accuracy and runs differentestimation algorithms on the control plane. Narayana et al.[71] presented Marple, a query language based on commonquery constructs (i.e., map, filter, group by). Marple allowsperforming advanced aggregation (e.g., moving average oflatencies) at line rate in the data plane. Similarly, Sonata[79] provides a unified query interface that uses commondataflow operators, and partitions each query across the streamprocessor and the data plane. PacketScope [85] also usesdataflow constructs but allows to query the internal switchprocessing, both in the ingress and the egress pipelines.Many of the previous works use the sketch data structure.The work in [88] extended the sketching approach used inprevious works to support the notion of time. The motivationof this work is that recently captured traffic trends are themost relevant in network monitoring. Huang et al. [89] pro-posed OmniMon, an architectural design that coordinates flow-level network telemetry operations between programmableswitches, end-hosts, and controllers. Such coordination aims atachieving high accuracy while maintaining low resource over-head. Chen et al. [90] proposed BeauCoup, a P4-based mea-surement system that handles multiple heterogeneous queriesin the data plane. It offers a general query abstraction that counts the attributes across related packets identified by keys ,and flags packets that surpass a defined threshold.Other approaches such as Elastic sketch [73] performs mea-surement that are adaptive to changes in network conditions(e.g., bandwidth, packet rate and flow size distribution). *Flow[77] supports concurrent measurements and dynamic queries.Such approach aims at minimizing the concurrency problemsand the network disruption resulting from compiling excessivequeries into the data plane. TurboFlow [78] aims at achievinghigh coverage without sacrificing information richness. Baiet al. [86] proposed FastFE, a system that performs trafficfeatures extraction by leveraging programmable data planes.Features are then used by traffic analysis and behavior detectorML techniques.
Performance Diagnosis Systems.
Recent works are leverag-ing programmable data planes to diagnose network perfor-mance. The main motivation here is that fine-grained infor-mation can be monitored at line rate, mitigating the slowreaction to “gray failures” experienced by diagnosing end-hosts in legacy approaches.Ghasemi et al. [72] proposed Dapper, an in-network TCPperformance diagnosis system. Dapper analyzes packets in realtime, and identifies and pinpoints the root cause of the bottle-neck (sender, network, or receiver). Blink [82] also diagnosesTCP-related issues. In particular, it detects failures in the dataplane based on retransmissions, and consequently, reroutestraffic. Other approaches attempt to diagnose performancedegradation manifested by an increase of latency. Wang et al.[84] proposed SpiderMon, a system that performs network-wide performance degradation diagnosis. The key idea is tohave every switch maintain fine-grained telemetry data for ashort period of time, and upon detecting performance degra-dation (e.g., increased delay), the information is offloadedto a collector. Liu et al. [81] proposed a memory-efficientapproach for network performance monitoring. This solutiononly monitors the top- k problematic flows. Queue and Other Metrics Measurement.
Programmabledata planes allows querying the internal state of the queue withfine-grained visibility. Recent works leveraged this feature toprovide better queueing information which can be used byvarious applications (e.g., AQMs, congestion control, etc.).Chen et al. [80] proposed ConQuest, a P4-based queue mea-surement solution that determines the size of flows occupyingthe queue in real time, and identifies flows that are grabbing asignificant portion of the queue. Joshi et al. [75] proposedBurstRadar, a system that uses programmable switches tomonitor microbursts in the data plane. Mircorbursts are eventsof sporadic congestion that last for tens or hundreds ofmicroseconds. Microbursts increase latency, jitter, and packetloss, especially when links’ speeds are high and switch buffersare small.Other works enabled measuring further metric. For instance,Ding et al. [83] proposed P4Entropy, an algorithm to estimatenetwork traffic entropy (Shannon entropy) in the data plane.Tracking entropy is useful for calculating traffic distributionin order to understand the network behavior. Another exampleis the system proposed by Chen et al. [87] which passively TABLE IXM
EASUREMENTS S CHEMES C OMPARISON G e n er i c qu er y - b a s e d m o n i t o r i n g Ref Name Core idea Approx. Externalcomputation Datastructure Networkwide PlatformHW SW [89] OmniMon Coordinates flow-leveltelemetry among devices × (cid:2) Slots (bloomfilter) (cid:2) (cid:2) [79] Sonata Uses scalable streamprocessor (cid:2) (cid:2)
Sketch × (cid:2) [69] FlowRadar Stores flow counters andperiodically exports results (cid:2) (cid:2) Bloom filter (cid:2) (cid:2) [73] ElasticSketch Adapts to network changingconditions (cid:2) (cid:2)
Sketch (cid:2) (cid:2) [71] Marple Aggregates based on “map,filter, group by” constructs × (cid:2) Key-value store (cid:2) (cid:2) [90] BeauCoup Enables simultaneous multipledistinct counting queries (cid:2) × Coupon collect(bloom filter) × (cid:2) [70] UnivMon Provides application-agnosticmonitoring (cid:2) (cid:2) Universalsketches (cid:2) (cid:2) [77] *Flow Groups traffic in the switch andcomputes statistics in servers × (cid:2) GPV (registerarrays) × (cid:2) [78] TurboFlow Produces fine-grained andunsampled flow records × (cid:2) Hash table × (cid:2) [88] N/A Enables time-awaremonitoring (cid:2) (cid:2) Time-awaresketch × (cid:2) [85] PacketScope Monitors packets’ lifecycleinside the switch (cid:2) (cid:2) Key-value store(hash table) × (cid:2) [86] FastFE Extracts traffic features for MLmodels × (cid:2) key-value store × (cid:2) P er f o r m a n ce d i ag n o s i ss y s t e m s Ref Name Core idea Scope Reactiveprocessing Measuredinformation Networkwide PlatformHW SW [72] Dapper Diagnoses TCP performanceissues in the data plane Identifies TCPbottleneck N/A Flight size, MSS,sender’s reactiontime, loss, RTT,CWND, RWND × (cid:2) [84] SpiderMon Diagnoses latency with smallmemory footprint Identifies flowsaffecting latency Limits rate Queue latency (cid:2) (cid:2) [82] Blink Detects failures based on thepredictable behavior of TCP Identifiesretransmitters Reroutestraffic RTO-inducedretransmissions × (cid:2) [81] N/A Improves monitoring scalabilityby measuring subset of flows Identifies top-kinfluential flows N/A Retransmissions,latency, packetloss, out-of-order × (cid:2) Q u e u e / o t h er m e a s u re m e n t Ref Name Core idea Passivemeasurement Analysis Measuredinformation Datastructure PlatformHW SW [80] ConQuest Identifies flows contributingheavily to the queue (cid:2)
Data plane Queue occupancy Count-minsketch (cid:2) [87] N/A Measures the RTT of TCPtraffic in ISP networks (cid:2)
Data plane RTT from an ISPvantage point Hash table (cid:2) [75] BurstRadar Monitors microbursts andcaptures telemetry for thecontributing packets × Control plane Queue occupancy Ring buffer (cid:2) [83] P4Entropy Estimates network trafficentropy × Data plane Shannon entropy Count-minsketch (cid:2) measures the RTT of TCP traffic in ISP networks. RTTmeasurement is important for detecting spoofing and routingattacks, ensuring Service Level Agreements (SLAs) compli-ance, measuring the Quality of Experience (QoE), improvingcongestion control, and many others.
B.3. Measurements Schemes Comparison, Discussions, andLimitations
Table IX compares the aforementioned measurementsschemes.
Generic Query-based Monitoring.
Some schemes (e.g.,Sonata, FlowRadar, UnivMon) performed approximations ofthe metrics by using probabilistic data structures (e.g., sketch,bloom filter, etc), sampling methods, and top- k counting. Inaddition, some focused on a subset of traffic by leveraging event matching techniques. Such techniques are primarilyused to achieve high resource efficiency (i.e., low memoryfootprint), but cannot achieve full accuracy. On the other hand,systems like OmniMon carefully coordinates the collaborationamong different types of entities in the network. Such coor-dination will result in efficient resource utilization and fullyaccuracy. OmniMon follows a split-merge strategy where the split operation decomposes telemetry operations into partialoperations and schedules them among the entities (switches,end-hosts, and controller), and the merge operation coordinatesthe collaboration among these entities. The idea is to leveragethe strength of the data plane in the switches and end-hosts(i.e., per-flow measurements with high accuracy) and the con-trol plane (i.e., network-wide collaboration). OmniFlow alsoensures consistency through a synchronization mechanism and accountability through a system of linear equation consideringpacket loss and other data center characteristics. Results showthat OmniMon reduces the memory by 33%-96% and thenumber of actions by 66%-90% when compared to state-of-the-art solutions.Another criterion that differentiates the measurementsschemes is whether there are computations being performedoutside the data plane. Most of the systems use the controlplane or external servers to perform complex computationssince the data plane has limited support to complex arithmeticfunctions. While some systems (e.g., BeauCoup) do not re-quire an external computation device, they often support lessmeasurement operations.The selection of the data structure to be used in the dataplane strongly affects the measurements features supportedby a certain scheme. For instance, the goal of BeauCoupis to enable simultaneous distinct counting queries; for suchtask, the authors based their design on the coupon-collectionproblem [232], which computes the number of random drawsfrom n coupons such that all coupons are drawn at leastonce. For example, if the threshold of distinct destination IPsfor detecting superspreaders is 130, instead of recording alldistinct destination IPs, 32 coupons are defined. Consequently,the destination IPs of incoming packets are mapped to those32 coupons. While this data structure uses less memory thanthe other state-of-the-art measurement sketches, it is limitedto specific objectives (distinct counting). Other works (e.g.,UnivMon) focused on generalizing the measurement scenarios,and hence, used universal sketches as data structures.Qiu et al. [88] focused on capturing traffic trends that are themost relevant in network monitoring and attacks’ detection.The notion of time is not supported by native streamingalgorithms. For instance, count-min sketch , which is a datastructure that uses constant memory amount to record data,is oblivious to the passage of time. Existing solutions thatconsider recency are easily implemented on software, but noton programmable ASICs. For example, resetting a sketch aftera timer expires requires iterating over the elements in thesketch, an operation that cannot be implemented in the dataplane due to the lack of loops. Likewise, creating multiplesketches require additional stages which is limited in thehardware. Time-adaptive sketches utilize the idea of Dolbynoise reduction [233, 234]; a pre-emphasis function inflatesthe update when a new key is inserted and a de-emphasis function restores the original value. This mechanism ages theold events over time, and therefore, improves the accuracyof recent events. The authors implemented the pre-emphasisfunction in the data plane using simple bit shifts, and the de-emphasis function in the control plane.Finally, some systems considered network-wide monitoring,while others only restricted their capabilities to local per-switch measurements. Network-wide measurement is essentialand can significantly improve the visibility of traffic, asdiscussed in Section XIII-D. Performance Diagnosis Systems.
Some performance diag-nosis schemes restricted their scope to troubleshooting TCP.For instance, Dapper infers sending rate, Maximum Segment Size (MSS), sender’s reaction time (time between receivedACK and new transmission), loss rate, latency, congestionwindow (CWND), receiver window (RWND), and delayedACKs. Based on the inferred variables, Dapper can identifythe root cause of the bottleneck. Similarly, the authors in[81] monitored conditions such as retransmissions, packetloss, round-trip-time, out-of-order packets to identify the top-kproblematic flows. Furthermore, Blink detects failures basedon the predictable behavior of TCP, which retransmits packetsat epochs exponentially spaced in time, in the presence offailure. Other schemes (i.e., SpiderMon) identify failures basedon the increase of latency.Some schemes use reactive processing to mitigate the net-work performance issue. For instance, Blink promptly reroutestraffic whenever failures signals are generated by the dataplane, while SpiderMon limits the sending rate of the rootcause hosts.Finally, it is worth mentioning that some systems (e.g.,Blink, Dapper) considered traces from real-world capturessuch as the ones provided by CAIDA for evaluation. Usingreal-world traces gives more credibility to the proposed solu-tion.
Queue and other Metrics Measurement.
Understandingthe occupancy of the queue is useful for use cases suchas mitigating congestion-based attacks, avoiding conflictingworkloads, implementing new AQMs, optimizing switch con-figurations, debugging switch implementation, off-path mon-itoring of queues in legacy devices, etc. ConQuest performsqueue measurements and identifies flows depending on thepurpose (e.g., detecting bursty connections). It maintainscompact snapshots of the queue, updated on each incomingpacket. The snapshots are then aggregated in a round-robinfashion to approximate the queue occupancy. Afterwards, itcleans the previous snapshots to reuse it for further packets.Similarly, BurstRadar detects microbursts, which can increaselatency, jitter, and packet loss, especially when links’ speedsare high and switch buffers are small. It is almost impossibleto detect microbursts in legacy switches which use samplingand polling-based techniques. BurstRadar detects microbursts,and captures a snapshot of the telemetry information of allthe involved packets. Afterwards, an analysis is conductedon the snapshot to identify the microburst-contributing flowand the burst characteristics. Note that BurstRadar does notsupport measuring the queues of legacy devices passively, butConQuest does. In addition, BurstRadar performs the analysison the control plane, while ConQuest uses the data plane foranalysis.
B.4. In-Network versus Legacy Measurements
Fig. 11 compares the legacy measurements to those con-ducted on programmable switches. There are two mainclasses of legacy measurements techniques. First, there aretechniques that rely on polling and sampling (e.g., Net-Flow). The differences between in-network measurements andpolling/sampling-based schemes are closely related to the dif-ferences between legacy measurements and INT (see Table V).For instance, the granularity of the measurements conducted in Control PlaneData Plane ...App App App N Application-specific computation
Data structures (e.g., Sketch)
ReportTrafficConfigureControl PlaneData Plane
Flow reports
Sampling/PollingTraffic (a) (b)
Fig. 11. (a) Traditional measurements with sampling/polling. The switch uses sampling and polling protocols (e.g., NetFlow, SNMP) to generate fixed networkflow records. Instead of collecting every packet, sampling collects only one every N number of packets. Records are then exported to an external server forfurther analysis. (b) Measurements with programmable switches (e.g., UnivMon [70]). The switch runs a universal algorithm over a universal data structure(e.g., universal sketch). The control plane then estimates a wide range of metrics for various applications. Note that this is not the only design possible formeasurement tasks with programmable switches. The programmer has the flexibility to use customized algorithms than run at line rate in the data plane. Suchalgorithms can leverage various data structures in the P4 program (e.g., sketch, bloom filter) to store flow statistics. The switch then push statistics reports tothe control plane for further analysis and reactive processing. the data plane is much higher than those collected in traditionalmeasurements (e.g., NetFlow). Further, it is not possible toconduct event-based monitoring in legacy approaches, whereaswith in-network measurements, the programmer has the flexi-bility of customizing the monitoring based on conditions andthresholds. Second, there are techniques that rely on sketchingor streaming algorithms to estimate the metric of interest.Such methods are tightly coupled with the metric, whichforces hardware vendors to invest time and effort in buildingcustomized algorithms and data structures that might not beused by various customers. Moreover, with the constraintsof routers and switches, it is not possible to implement avariety of monitoring tasks while still supporting the standardrouting/switching functionalities. Therefore, such approachesare not scalable for the long run.With programmable switches, it is possible to customizethe monitoring tasks by implementing customized sketch-ing/streaming algorithms as P4 programs. This advantageimproves scalability as the operator can always modify thealgorithms whenever needed. C. Active Queue Management (AQM)C.1. Background
A fundamental component in network devices is the queue which temporarily buffers packets. As data traffic is inherentlybursty, routers have been provisioned with large queues toabsorb this burstiness and to maintain high link utilization. Themajority of delays encountered in a communication session isa result of large backlogs formed in queues. Previous legacydevices are limited in the visibility of the queue as they providelittle or no insight about which flows are occupying or sharingthe queue [80]. Consequently, researchers have been investi-gating queue management algorithms to shorten the delay andmitigate packet losses, while providing fairness among flows.AQM is a set of algorithms designed to shorten the queueingdelay by prohibiting buffers on devices from becoming full.The undesirable latency that results from a device buffering too much data is known as "Bufferbloat". Bufferbloat notonly increases the end-to-end delay, but also decreases thethroughput and increases the jitter of a communication session.Modern AQMs help in mitigating the bufferbloat problem[235–238]. Unfortunately, modern AQMs are typically notavailable in state-of-the-art network equipment; for instance,Controlled Delay (CoDel) AQM, which was proposed in2013, and was proven in the literature to be effective inmitigating Bufferbloat [239], is still not available in mostnetwork equipment. With programmable switches, it is nowpossible to implement AQMs as P4 programs, which not onlyaccelerates support for new AQMs, but also provides meansto customize its parameters programmatically in response tonetwork traffic. Moreover, programmable switches thrives forinnovation on newer AQMs that can be easily implementedand rapidly tested.
C.2. Literature Review
Kundel et al. [91] implemented CoDel queueing disciplineon a programmable switch. CoDel eliminates Bufferbloat, evenin the presence of large buffers [240]. Sharma et al. [92]proposed Approximate Fair Queueing (AFQ), a mechanismbuilt on top of programmable switches that approximatesfair queuing on line rate. Fair Queueing (FQ) aims at fairlydividing the bandwidth allocation among active flows. Lakiet al. [93] described an AQM evaluation testbed with P4 ina demo paper. The authors tested the framework with twoAQMs: Proportional Integral Controller Enhanced (PIE) andRED. Mushtaq et al. [241] approximated Shortest RemainingProcessing Time (SRPT). Papagianni et al. [94] implementedProportional Integral PI AQM on a programmable switch. PI is an extension of PIE AQM to support coexistence betweenclassic and scalable congestion controls in the public Internet.Kumazoe et al. [95] implemented MTQ/QTL scheme on P4. C.3. AQM Schemes Comparison, Discussions, and Limitations
Table X compares the aforementioned AQM schemes. Someschemes require tuning a number of parameters and thresholds TABLE XAQM S
CHEMES C OMPARISON
Scheme Name Idea Params & thresholds Multiple queues Data structure Implementation [91] P4-CoDel Implementation of CoDel on P4 2 × Registers BMv2[92] AFQ Approximate fair queueing in theswitch 4 (cid:2)
Count-minsketch CaviumOCTEON[93] N/A Evaluation testbed for PIE and RED Red 1, PIE 5 × Registers BMv2[94] PI2 for P4 Implementation of PI on P4 3 × Registers BMv2[95] MTQ/QTL Implementation of MTQ/QTL on P4 3 × Registers BMv2 so that they operate well in certain network conditions. It isworth mentioning that a scheme becomes hard to manageand less autonomous when the number of parameters andthresholds is high.Some schemes are simple to implement in the data plane.CoDel’s algorithm can be easily expressed in the data planeas it consists of comparisons, counting, basic arithmetic, anddropping packets. Similarly, PI is simple to implement as itis mostly based on basic bit manipulations. FQ algorithms onthe other hand are difficult to implement on hardware as theyrequire complex flow classification, per-packet scheduling,and buffer allocation. Such requirements make FQ algorithmsexpensive to be implemented on high-speed devices. AFQaims at approximating fair queueing by using programmableswitches’ features such as mutating switch state, performingbasic calculations, and selecting the egress queue of a packet.AFQ’s operations can be summarized as follows: 1) per-flowstate, which includes the number and timing information of theprevious packet pertaining to that flow, is approximated; 2) theposition of each packet in the output schedule is determined;3) the egress queue to use is selected; and 4) the packet isdequeued based on the approximate sorted order. Note thatAFQ uses a probabilistic data structure (count-min sketch)since it only approximates the states, and uses multiple queuesin its implementation. C.4. AQMs on Programmable Switches and Fixed-functionDevices
Inventing novel AQMs that control queueing delay, mitigatebufferbloat, and achieve fairness with different network con-
TABLE XIAQM
S ON P ROGRAMMABLE AND F IXED - FUNCTION S WITCHES
Feature Programmable switches Fixed-function devices
Innovation Higher; new AQMs areexpressed in P4 programs Lower; onlydeveloped byequipment vendorsExclusivity Higher; operators canimplement their owncustom AQMs withoutdisclosing technicalinformation Lower; mostsupported AQMs arestandardsReadiness Faster (weeks to months);once an AQM isexpressed in P4, it can beimmediately available Slower (years)Cost Lower HigherTweakable Higher; even standardAQMs can be customizedand tweaked based onnetwork traffic Lower; only throughparameters ditions (e.g., short/long RTTs, lossy networks, WANs) is anactive research area. Typically, new AQMs are implementedand tested in software (e.g., as a Linux queueing discipline( qdisc ) used with traffic control ( tc )), which is limited whenthe objective is to deploy the AQMs on production networks.With programmable switches, AQMs are implemented in P4programs, which foster innovation and enhance testing withproduction networks. Additionally, operators can create theirown customized AQMs that perform efficiently with their typ-ical network traffic. Historically, deploying AQMs on networkdevices is a lengthy and costly process; once an effectiveAQM is published and thoroughly tested, equipment vendorsstart investigating whether it is feasible to implement it onfuture devices. Such process might take years to finish, andby then, new network conditions evolve, requiring new AQMs.With programmable switches, this process is cost-efficient andrelatively fast (can be completed in weeks). Table XI comparesthe features of AQMs on programmable switches versus fixed-function devices. D. Quality of Service and Traffic ManagementD.1. Background
Meeting diverse Quality of Service (QoS) requirements isa fundamental challenge in today’s networks. Traffic Man-agement (TM) provides access control that guarantees thatthe traffic admitted to the network conforms to the definedQoS specifications. TM often regulates the rate of a flow byapplying traffic policing. New generation of programmableswitches facilitate traffic policing and differentiation by al-lowing network operators to express their logic in a pro-gramming language (P4). This section explores the works onprogrammable switches that involve QoS and TM.
D.2. Literature Review
Bhat et al. [96] described a system where programmableswitches route traffic intelligently by inspecting applicationheaders (layer-5) to improve users’ QoE. Lee et al. [97]implemented a traffic meter based on Multi-Color Markers(MCM) on programmable switches to support multi-tenancyenvironments. Tokmakov et al. [98] proposed RL-SP-DRR, atraffic management system that combines Rate-limited StrictPriority (RL-SP) and Deficit round-robin (DRR) to achievelow latency and fair scheduling while improving link utilisa-tion, prioritization and scalability. Chen et al. [99] proposeda bandwidth manager for end-to-end QoS provisioning usingprogrammable switches. The system classifies packets into TABLE XIIQ O S/TM S
CHEMES C OMPARISON
Ref Idea Input Multiplequeues PlatformHW SW [96] Application-layerheaders inspection Layer-5headers × (cid:2) [97] MCM-basedtraffic meter Trafficrate, VN ID × (cid:2) [98] Traffic mgmt.(RL-SP and DRR) Traffic rate (cid:2) (cid:2) [99] BW manager fore2e QoS Flow ID,min/maxRate (cid:2) (cid:2) different categories based on their QoS demands and usages,and uses two-level queue when prioritizing. D.3. QoS/TM Schemes Comparison, Discussions, and Limita-tions
Table XII compares the QoS/TM schemes. The main ideain [96] is to translate application-layer header information intolink-layer headers (Q-in-Q 802.1ad) for the core network inorder to perform QoS routing and provisioning. The authorsadopted the Adaptive Bit Rate (ABR) video streaming as a usecase to showcase the QoS improvements and the flexibilityof traffic management. Such approach is interesting sinceswitches are inspecting higher layers in the protocol stack.This capability is not available in non-programmable devices.Note however that the solution was only implemented on asoftware switch (BMv2). When it comes to hardware switches,the solution might face challenges to run at line rate whenprocessing L5 headers. Therefore, the authors left the hardwareimplementation as a future work.The other approaches considered traffic rates as inputs ratherthan inspecting application-layer headers. [97] focused onisolating virtual networks (VN). A VN has to have its owndedicated bandwidth (i.e., other networks’ traffic should notimpact the bandwidth) and should be able to differentiatepriorities in order to provide QoS for its flows. While thesolution was not implemented on hardware (the authors leftthe hardware implementation as future work), it is worthnoting that this system relies on metering primitives which areavailable in today’s hardware targets (e.g., meters in Tofino).Similarly, [98] was only implemented on a software switch(BMv2) and was evaluated by comparison against standardpriority-based and best-effort scheduling. This system usesmultiple priority queues, a feature supported in hardware tar-gets. Therefore, the system could be implemented on hardwareswitches. The approach in [99] aims at limiting the maximumallowed rate and at maximizing bandwidth utilization. This isthe only work that was implemented on a hardware switch(Tofino), and its design was compared against approachesbased on OpenFlow.
D.4. Comparison of QoS/TM between Legacy and Pro-grammable Networks
The ability to perform QoS-based traffic management inlegacy networks is restricted to algorithms that consider stan-dard header fields (e.g, differentiated services [242]). On theother hand, programmable switches can parse, modify, andprocess customized protocols. Hence, operators now have the ability to perform TM by inspecting custom headersfields. Moreover, it is possible to extract with high-granularitymetadata pertaining to the state of the switch (e.g., queueoccupancy, packet sojourn time, etc.) at line rate. Such in-formation can significantly help switches take better decisionswhile performing traffic management.
E. MulticastE.1. Background
Multicast routing enables a source node to send a copyof a packet to a group of nodes. Multicast uses in-networktraffic replication to ensure that at most a single copy of apacket traverses each link of the multicast tree. Perhaps themost widely multicast routing protocol deployed in traditionalnetworks is the Protocol-Independent Multicast (PIM) protocol[243]. PIM and other multicast routing protocols require asignaling protocol such as the Internet Group ManagementProtocol (IGMP) [244] to create, change, and tear-down themulticast tree. Traditional multicast presents some challenges.For example, it is not suitable for environments where multi-cast group members constantly move (e.g., virtual machine mi-gration and allocation). In such cases, the multicast tree mustbe updated dynamically, which may require substantial timeand overhead. Also, some routers support a limited numberof group-table entries, which does not scale in environmentssuch as datacenters. Additionally, the signaling protocol andmulticast algorithm are hard coded in the router, which reducesflexibility in building and managing the tree. Finally, it is notpossible to implement multicast based on non-standard headerfields.
E.2. Literature Review
Shahbaz et al. [100] presented ELMO, a multicast schemebased on programmable P4 switches for datacenter applica-tions. ELMO encodes the multicast tree in the packet header,as opposed to maintaining group-table entries inside routers.Kadosh et al. [101] implemented ELMO using a hybrid dat-aplane with programmable and non-programmable elements.ELMO is intended for multi-tenant datacenter applicationsrequiring high scalability. Braun et al. [102] presented animplementation of the Bit Index Explicit Replication (BIER)architecture [245] with extensions for traffic engineering.Similar to ELMO, BIER removes the per-multicast group stateinformation from switches by adding a BIER header, whichis used to forward packets. BIER does not require a signalingprotocol for building, managing, and tearing down trees.
E.3. Multicast Schemes Comparison, Discussions, and Limi-tations
Table XIII compares the aforementioned multicast schemes.Both ELMO and BIER are source-routed multicast schemes.In BIER, group members are encoded as bit strings and arethen inspected by switches to identify the output port. Suchscheme requires heavy processing on the switch, hamperingthe execution at line rate. Consequently, the authors onlyimplemented BIER on a software switch (BMv2). ELMO onthe other hand has no restrictions on the group and network TABLE XIIIM
ULTICAST S CHEMES C OMPARISON ( SOURCE : [100])
Scheme Name Groupsize Networksize Heavyprocessing PlatformHW SW [100] ELMO None None × (cid:2) [102] BIER 2.6K 2.6K (cid:2) (cid:2) sizes, and was implemented on a hardware switch, running atline rate. E.4. Comparison P4-based and Traditional Multicast
Table XIV compares P4-based multicast and traditionalmulticast. The main advantages of implementing multicastrouting with programmable P4 switches are: i) the groupmembership is encoded in the packet itself, which permits thecreation of arbitrary multicast tree based on the application.For example, a multicast tree to update software devices mayprioritize bandwidth over latency, while one for media trafficmay prioritize latency; ii) switches do not need to store per-group state information, although tables can be customizedand used in conjunction with the tree encoded in the packetheader; iii) groups can be reconfigured easily by changing theinformation in the header of the packet; and iv) the eliminationof the signaling protocol to build, manage, and tear-down thetree results in consider simplification and flexibility for theoperator.
F. Summary and Lessons Learned
Performing network-wide monitoring and measurementsis of utmost importance for network operators to diagnoseperformance degradation. A wide range of research effortsharness streaming methods that utilize various data structures(e.g., sketches, bloom filters, etc.) and approximation algo-rithms. Further, the majority of measurements work provide aquery-based language to specify the monitoring tasks. Futuremeasurement works should consider generalizing the monitor-ing jobs, reducing storage requirements, managing accuracy-memory trade-off, extending monitoring primitives, minimiz-ing controller intervention, and optimizing the placement of
TABLE XIVC
OMPARISON BETWEEN
P4-
BASED AND T RADITIONAL M ULTICAST
Feature P4-based multicast Traditional multicast
Scalability High; no stateinformation required inswitches Low; state informationrequired in switchersper-groupTreemanagement Flexible; custommulticast algorithm andfeatures can beimplemented Inflexible; signalingprotocol required andhard coded in the switchPacketoverhead High; multicast tree isencoded in packet header No packet overheadDynamictree updates Easy; packet headercarries update information Complex; topologychallenges may triggertime-consuming treechangesIP addressconstraint Flexible; switch canmulticast packetsindependently of the typeof IP address Fixed; switch ishard-coded to onlymulticast packets withdestination IP address inthe range 224.0.0.0 -239.255.255.255 switches in a legacy network. Another line of research aim atcombating congestion and reducing packet losses by analyzingmeasurements collected in the data plane and by applyingqueue management policies. Congestion control is enhancedby adopting techniques such as throttling senders, cuttingpayloads, enforcing sending rates by leveraging telemetrydata, and separating traffic into different queues. Furthermore,a handful of works are investigating methods to improveQoS by applying traffic policing and management. Techniquesadopted include application-layer inspection, traffic metering,traffic separation, and bandwidth management. Finally, thescalability concerns of multicast in legacy networks are beingmitigated with programmable switches. Recent efforts pro-posed encoding multicast trees into the headers of packets,and using programmable switches to parse these headers andto determine the multicast groups. Future endeavours shouldinvestigate incremental deployment (i.e., interworking withlegacy multicast schemes), and reliability enhancement (e.g.,by adopting layering protocols such as Pragmatic GeneralMulticast (PGM) and Scalable Reliable Multicast (SRM)).VIII. M
IDDLEBOX F UNCTIONS
RFC 3234 [246] defines middlebox as a device that performsfunctions other than the standard functions of an IP routerbetween a source and a destination host. In legacy devices,middlebox functions are designed and implemented by man-ufacturers. Hence, they are limited in the functionalities theyprovide, and typically include standard well-known functions(e.g., NAT, protocol converters (6to4/4to6), etc.). To overcomethis limitation, the trend moved towards implementing mid-dleboxes in x86-based servers and in data centers as NetworkFunction Virtualization (NFVs). While this shift acceleratedinnovation and introduced a wide range of new applications,there was some performance implications resulting from op-erating systems’ scheduling delays, interrupt processing la-tency, pre-emptions, and other low-level OS functions. Sinceprogrammable switches offer the flexibility of inspecting andmodifying packets’ headers based on custom logic, they areexcellent candidates for enabling middlebox functions, whileoperating at line rate without performance implications.
A. Load BalancingA.1. Background
A cloud data center, such as a Google or Facebook datacenter, provides many applications concurrently, such as emailand video applications. To support requests from externalclients, each application is associated with a publicly visibleIP address to which clients send their requests and from whichthey receive responses. This IP address is referred to as VirtualIP (VIP) address. The external requests are then directed toa software load balancer whose task is to distribute requeststo the servers, balancing the load across them. The loadbalancer is also referred to as layer-4 load balancer becauseit makes decisions based on the 5-tuple source IP addressand port, destination IP address and port, and transport-layerprotocol. This state information is stored in a connection tablecontaining the 5-tuple and the Direct IP (DIP) address of the TABLE XVL
OAD B ALANCING S CHEMES C OMPARISON
Scheme Name Stateful Centralized Active probing MP-TCP support Failure handling PlatformHardware Software [103] HULA (cid:2) × (cid:2) × (cid:2) (cid:2) [104] SilkRoad (cid:2) × × × (cid:2) (cid:2) [105] MP-HULA (cid:2) × (cid:2) (cid:2) (cid:2) (cid:2) [106] Beamer × (cid:2) × (cid:2) (cid:2) (cid:2) (cid:2) [108] Dash (cid:2) × (cid:2) (cid:2) × (cid:2) [109] Contra (cid:2) × (cid:2) × (cid:2) (cid:2) ServerLoadbalancer (a)
DIP1 DIP2 DIP3VIP Switch
DIP … … Connection table ServerLoadbalancer DIP1 DIP2 DIP3VIP Switch + loadbalancerConnection tableTablemgmt (b)
DIP … … Fig. 12. (a) Traditional software-based load balancing. (b) Load balancingsystem implemented by a programmable switch. server serving that connection. State information is neededto avoid disruptions caused by changes in the DIP pool (e.g.,server failures, addition of new servers). The load balancer alsoprovides a translation functionality, translating the VIP to theinternal DIP, and then translating back for packets travelingin the reverse direction back to the clients. The traditionalsoftware-based load balancer is illustrated in Fig. 12(a).
A.2. Literature Review
Recent works presented schemes where load balancingfunctionality is implemented in programmable P4 switches.The main idea consists of storing state information directly inthe switch’s dataplane. The connection table is managed bythe software load balancer, which can be implemented eitherin the switch’s control plane or as an external device, as shownin Fig. 12(b). The software load balancer adds new entries inthe switch’s table as they arrive, or removes old entries asflows end.Katta et al. [103] proposed HULA, a load balancer schemewhere switches store the best path to the destination viatheir neighboring switches. This strategy avoids storing thecongestion status of all paths in leaf switches. Bennet et al.[105] extended this approach to support multi-path transportprotocols (e.g., Multi-path TCP (MPTCP)). Another signifi-cant work is SilkRoad, [104], a load balancer that providesa direct path between application traffic and servers. Othermechanisms such as DistCache [107] enables load balancingfor storage systems through a distributed caching method.DASH [108] proposed a data structure that leverages multiplepipeline stages and per-stage SALUs to dynamically balancedata across multiple paths. The aforementioned approacheswork under specific assumptions about the network topology,routing constraints, and performance. Contra [109] generalizedload balancing to work with various topologies and undermultiple constraints by using a performance-aware routing mechanism.Beamer [106] takes a different approach to load balancingby using a stateless approach. Instead of storing the state in theswitch, Beamer leverages the connection state already storedin backend servers to perform the forwarding.
A.3. Load Balancing Schemes Comparison, Discussions, andLimitations
Table XV compares the aforementioned load balancingschemes. The key idea of switch-based load balancing isto eliminate the need for a software-layer while mappinga connection to the same server, ensuring Per-ConnectionConsistency (PCC) property. The majority of the proposedapproaches are stateful, meaning that the switches store in-formation locally to perform load balancing. The exceptionhere is Beamer which relies on using the connection statealready stored in backend servers to ensure that connectionsare never dropped under churn. Another significant shift fromthe previous solutions is the decentralization nature of Beamer.Some approaches (e.g., HULA, MP-HULA, Contra, Dash)use active probing to collect network performance metrics.Such metrics are then analyzed by the switches to make loadbalancing decisions.In the presence of multi-path transport protocols (e.g.,MPTCP), systems such as HULA provide sub-optimal for-warding decisions when several subflows pertaining to a singleconnection are pinned on the same bottleneck link. As a result,schemes such as MP-HULA, Contra, and Dash were proposedto support multi-path transport protocols. For instance, MP-HULA is a transport layer multi-path aware load-balancingscheme that uses the best-k paths to the destination throughthe neighbor switches.Finally, it is important for a load balancing scheme to handlenetwork failures. Most of the discussed systems consideredmitigating failures, with the exception of DASH.
A.4. Comparison between Switch-based and Server-basedLoad Balancer
Table XVI shows a comparison between switch-based andserver-based load balancers. There is a significant improve-ment in the throughput when load balancing is offloaded tothe switches; for instance, SilkRoad [104], which is a loadbalancing scheme in the data plane, achieves 10 billion packetsper second (pps) while operating at line rate. Software loadbalancers on the other hand achieve a much lower throughput,nine million PPS on average. Software-based load balancersalso incur additional latency overhead when processing newrequests. It is relatively easy to install additional software load TABLE XVIS
WITCH - BASED AND S ERVER - BASED L OAD B ALANCERS
Feature Switch-based Server-based
Throughput Higher; (e.g., SilkRoadwith 6.4Tbps ASIC canachieve about 10Gpps) Lower (e.g., 9Mpps percore [247])Latency Lower; sub-microsecondsfrom ingress to egress Higher; additional latencywhen processing newrequests ∗ Scalability Lower; connection isstored in limited SRAM HigherPolicyflexibility Limited; hash-based flowassignments may lead toimbalance Flexible policies can bewritten in softwareSystemcomplexity Simpler; it requires acustomized parser,match-action tables More complex; it requirescoordination with routers,tunneling (e.g., GREencapsulation) ∗ After the first packet is processed, no additional latency is observed [247]. balancers, which makes it more scalable than switch-basedload balancing schemes. Moreover, software load balancersare more flexible in assigning flow identification policies.Finally, switch-based schemes are simpler as the whole logicis expressed in a program (customized parser and match-action tables), whereas server-based balancers might requireadditional coordination with routers (e.g., tunneling).
B. CachingB.1. Background
Modern applications (e.g., online banking, social networks)rely on key-value stores. For example, retrieving a singleweb page may require thousands of storage accesses. As thenumber of users increases to millions or billions, the need forhigher throughput and lower latency is needed. A challenge ofkey-value stores is the non-uniform access of items. Instead,popular items, referred to as “hot items”, receive more queriesthan others. Furthermore, popular items may change rapidlydue to popular posts, limited-time offers, and trending events[110]. Fig. 13(a) shows a typical skew key-value store systemwhich presents load imbalance among servers storing key-value objects. The performance of such systems may presentreduced throughput and long latencies. For example, server 2may add substantial latency as a result of storing a hot itemand being over-utilized, while server 1 is under-utilized.
B.2. Literature Review
Fig. 13(b) illustrates a system where a programmable switchreceives a query before forwarding them to the server storing (a)
Server1 Server2 SwitchServer3Load Server1 Server2 Server3
Value … Key … Key-value tableSwitch + cache (b)
Fig. 13. (a) Traditional software-based caching. (b) Switch-based caching. the key. The switch is used as an “in-network cache”, wherethe hottest items are stored. When a read request for a hot keyis received, the switch consults its local table and returns thevalue corresponding to that key. If the key is missed (i.e., thecase for non-hot keys) then the switch forwards the request tothe appropriate server. When a write request is received, theswitch checks its local table and evicts the entry if the keyis stored there. It then forwards the request to the appropriatebackend server. A controller periodically collects statistics toupdate the cache with the current hot items.A noteworthy approach is NetCache [110], an in-networkarchitecture that uses programmable switches to store hotitems and balance the load across storage nodes. Similarly,Liu et al. [112] proposed IncBricks, a caching fabric for key-value pairs with basic computing primitives in the data plane.Cidon et al. [111] proposed AppSwitch, a packet switchthat performs load balancing for key-value storage systems.Signorello et al. [113] developed a preliminary implementationof Named Data Networking (NDN) instance using P4. Grig-oryan et al. [114] proposed a system that caches ForwardingInformation Base (FIB) entries (the most popular entries) infast memory in order to minimize the TCAM consumptionand to avoid the TCAM overflow problem. Zhang et al. [115]proposed B-Cache, a framework that bypasses the originalprocessing pipeline to improve the performance of caching.Vestin et al. [116] proposed FastReact, a system that enablescaching for industrial control networks. Finally, Woodruff etal. [117] proposed P4DNS, an in-network cache for DomainName System (DNS) entries.
B.3. Caching Schemes Comparison, Discussions, and Limita-tions
Table XVII compares the aforementioned caching schemes.Schemes can be separated based on the type of data theyaim to cache. For instance, NetCache, AppSwitch, and In-cBricks cache arbitrary key-value pairs, while NDN.p4 cachesonly NDN names. Further, some schemes (e.g., NetCache,P4DNS, etc.) automatically index entries to be cached basedon their access frequencies, while others require the operatorsto manually specify the entries. Another important distinctionis whether the scheme uses a custom protocol or not. Forinstance, switches in NetCache parse a custom protocol thatcarries key-value pairs, while switches in P4DNS parse stan-dard DNS headers.The main motivation of switch-based caching schemes isto improve the performance issues of server-based schemes.For instance, NetCache, which efficiently detects hot key-value items and serves them in the data plane, was capable ofhandling two billion queries per second for 64,000 items with16-bytes keys and 128-bytes values. Compared to commodityservers, NetCache improves the throughput by 3-10 times andreduces the latency of 40% of queries by 50%. In addition tothe throughput, the latency of the queries is also a major metricto improve. In IncBricks, the latency of requests is reduced byover 30% compared to client-side caching systems.Similarly, B-Cache aims at improving the performance bycaching into a single cache match-action table. The motivationbehind B-Cache is that the performance of the data plane TABLE XVIIC
ACHING S CHEMES C OMPARISON
Scheme Name Cached data Network acceleratorneeded Automaticentry indexing Custom protocol Multi-level cache PlatformHW SW [110] NetCache Key-value × (cid:2) (cid:2) × (cid:2) [111] AppSwitch Key-value × × (cid:2) × (cid:2) [112] IncBricks Key-value (cid:2) × (cid:2) × (cid:2) [113] NDN.p4 NDN names × (cid:2) (cid:2) × (cid:2) [114] PFCA Routes (FIB entries) × (cid:2) × (cid:2) (cid:2) [115] B-Cache FIB entries × (cid:2) × × (cid:2) [116] FastReact Sensor readings × × (cid:2) × (cid:2) [117] P4DNS DNS entries × (cid:2) × × (cid:2) decreases significantly as the complexity of the P4 programand the packet processing pipeline grows. When a matchoccurs, the packet bypasses the original pipeline, making theperformance of caching independent of the pipeline length.Note however that this system was evaluated on a softwareswitch (BMv2), and it is not certain whether this design isalways feasible on hardware targets.Other caching schemes are more targeted for specific appli-cations. As examples, FastReact enables caching for industrialcontrol networks, while P4DNS caches DNS entries. Notethat some schemes require a custom protocol to operate (e.g.,NetCache), while others (e.g., P4DNS) work with standardprotocols (e.g., DNS). Finally, some schemes offer multi-levelcaching (e.g., level-1 and level-2 caches). B.4. Comparison between Switch-based and Server-basedCaching
Table XVIII compares the switch-based versus server-basedcaching schemes. The throughput when data is cached onthe switch is order of magnitude larger than that of generalpurpose servers. The latency is also reduced by 50%, and mostof it is induced by the client. The switched-based cachingsolves the load imbalance problem and is simpler as the wholelogic is expressed in a program. Server-based caching on the
TABLE XVIIIS
WITCH - BASED AND S ERVER - BASED C ACHING
Feature Switch-based Server-based
Throughput Higher; (e.g., NetCache,2BQPS ) Lower; 0.2BQPSLatency Lower; (e.g., NetCache, μ s , mostly caused bythe client) Higher; μ s Key size Not flexible (limited bypacket header length) ArbitraryValue size Not flexible (limited bythe amount of stateaccessed when processinga packet) ArbitraryLoadimbalance No YesSystemcomplexity Simpler; it requires acustomized parser,match-action tables More complex; it requirescoordination with routers,tunneling (e.g., GREencapsulation)Table size Limited by RAM ArbitraryCachepolicies Limited by table size Arbitrary BQPS: Billion Queries Per Second. other hand is more flexible regarding cache policies, as wellas keys, values, and tables’ sizes.
C. Telecommunication ServicesC.1. Background
The evolution of the current mobile network to the emergingFifth-Generation (5G) technology implies significant improve-ments of the network infrastructure. Such improvements arenecessary in order to meet the Key Performance Indicators(KPIs) and requirements of 5G [248]. 5G requires ultra-reliable low latency and jitter (microseconds-scale). As pro-grammable switches fulfill these requirements, researchers areinvestigating the idea of offloading telecom-oriented VNFsrunning on x86 servers to programmable hardware.
C.2. Literature Review
Ricart-Sanchez et al. [118] proposed a system that usesprogrammable data plane to enhance the performance of thedata path from the edge to the core network, also known asthe backhaul, in a 5G multi-tenant network. The same authors[119] proposed a 5G firewall that detects, differentiates andselectively blocks 5G network traffic in the backhaul network.In parallel, attempts such as TurboEPC [120] proposedoffloading a subset of user state in mobile packet core toprogrammable switches in order to perform signaling in thedata plane. Similarly, Singh et al. [121] designed a P4-basedelement of 5G Mobile Packet Core (MPC) that merges thefunctions of both signaling gateway (SGW) and the PacketData Network Gateway (PGW). Additionally, Voros et al.[122] proposed a a hybrid next-generation NodeB (gNB) thatcombines the capabilities of P4 switches and the externalservices built on top of NIC accelerators (DPDK).Another important function required in 5G is handover.Palagummi et al. [123] proposed SMARTHO, a system thatuses programmable switches to perform handover efficientlyin a wireless network.Finally, Kfoury et al. [124] proposed a system for offloadingconversational media traffic (e.g., Voice over IP (VoIP), Voiceover LTE (VoLTE), WebRTC, media conferencing, etc.) fromx86-based relay servers to programmable switches. Whilethis system is not tailored for 5G network specifically, itprovides significant performance improvements for Over-The-Top (OTT) VoIP systems. TABLE XIXT
ELECOM S CHEMES C OMPARISON
Scheme Core idea Deployment 5G-centric Reportedlatency scale Concurrentusers evaluated ImplementationHW SW [118] Enhances the data path in 5G multi-tenants Backhaul (cid:2)
Microseconds N/A (cid:2) [119] Implements a 5G firewall in the switch Backhaul (cid:2)
Microseconds 1K (cid:2) [123] Provides smart handover for mobile UE BetweenCU and DU (cid:2)
N/A N/A (cid:2) [121] Offloads MPC user plane functions to switch Core network (cid:2)
Microseconds 65K-1M (cid:2) [124] Offloads media traffic relay to switch Edge × Nanoseconds 65K-1M (cid:2) [120] Performs signaling in the data plane Core (cid:2)
Milliseconds 65K (cid:2)
Fig. 14. CDF of delay and packet loss rate of 900 offloaded VoIP calls [124].
C.3. Telecom Schemes Comparison, Discussions, and Limita-tions
Table XIX compares the aforementioned telecom schemeson P4. In general, all schemes aim at offloading variousfunctionalities originally executed on x86-based servers to thedata plane. Such strategy improves the network performance(e.g., latency, throughput) significantly and aim at achievingthe KPIs of 5G. For instance, the experiments conducted in[118] show that the attained QoS metrics meet the latencyrequirements of 5G. Similarly, the results reported in [119]demonstrate that the system meets the reliability KPI of 5G,which states that the network should be secured with zerodowntime. Furthermore, the results reported in [123] showthat there are 18% and 25% reductions in handover time withrespect to legacy approaches, for two- and three-handoversequences, respectively. The system in [124] emulates thebehavior of the relay server which is primarily used to solvethe NAT problem. Results show that ultra-low latency and jitter(nanoseconds-scale) are achieved with programmable switchesas opposed to x86-based relay servers where the latency andthe jitter are in the milliseconds-scale (see Fig. 14). Thesolution also improves the packet loss rate, CPU usage of theserver, Mean Opinion Score (MOS), and can scale to morethan one million concurrent sessions, with additional resourcesto spare in the switch.Other systems allow offloading the signaling part to thedata plane. For instance, TurboEPC offloads messages thatconstitute a significant portion of the total signaling traffic inthe packet core, aiming at improving throughput and latencyof the control plane’s processing.
C.4. Switch-based and Server-based Media Relay
Offloading media traffic from general purpose servers toprogrammable switches greatly improves the quality of ser-vice. Table XX shows the metrics achieved when media is
TABLE XXS
WITCH - BASED AND S ERVER - BASED M EDIA R ELAYING
Metric Switch-based relay [124] Server-based relay
Relay serverCPU Lower; negligible with900 active sessions Higher; averages at50% for 900 activesessionsLatency Lower; almost constant at440ns with 900 sessions Higher; from 0.2ms to17ms with 900 sessionsJitter Lower; negligible with900 active sessions Higher; ranges from100us to 3msPacket loss None contributed by theswitch High; increases as thenumber of sessionsincreasesMaximumnumber ofsessions Higher; more than onemillion with additionalresources to spare Lower; thousandsessions per core beforeQoS degradesMeanopinionscore (MOS) Higher; maximum MOS(4.4) with 1800concurrent sessions Lower; for 1800sessions, 50% ofsessions have a MOSscore below 3.7Table size Limited by SRAM ArbitraryAdditionalfunctions Limited to relaying Arbitrary; e.g., mediamix, lawful interception relayed by a relay server versus when it is relayed by theswitch, based on [124]. The results show that the latency,jitter and packet loss rates are significantly lower when mediais being relayed by the switch. Not only the QoS metricsare improved, but also the maximum number of concurrentsessions. With Tofino 3.2Tbps, more than one million sessionswere accommodated in the switch’s SRAM, with additionalresources to spare for other functionalities. On the other hand,only one thousand sessions per CPU core were handled inthe server-based relay, before QoS starts to degrade. Thedrawback of offloading media traffic to the switch is thatsome functionalities are complex to be implemented in thedata plane (e.g., media mixing for conference calls).
D. Publish/SubscribeD.1. Background
Emerging network architectures (e.g., [249]) promotecontent-centric networking, a model where the addressingscheme is based on named data rather than named hosts.In other words, users specify the data they are interested ininstead of specifying where to get the data from. A branch ofcontent-centric networking is the publish/subscribe (pub/sub)model. The goal of the model is to provide a scalable androbust communication channel between producers and con-sumers of information. A large fraction of today’s Internet applications follow the publish/subscribe paradigm. With theIoT, this paradigm proliferated as sensors/actuators are oftendeployed in dynamic environments. Other applications thatuse pub/sub model include instant messaging, Really SimpleSyndication (RSS) feeds, presence servers, telemetry andothers. Current approaches to content-centric networking usesoftware-based middleboxes, which limits the performance interms of throughput and latency. Recent works are leveragingprogrammable switches to overcome the performance limita-tions of software-based pub/sub middleboxes. D.2. Literature Review
Jepsen et al. [125] presented “packet subscription”, a newabstraction that generalizes the forwarding rules by evalu-ating stateful predicates on input packets. Wernecke et al.[126, 127] presented distribution strategies for content-basedpublish/subscribe systems using programmable switches. Theauthors described a system where the notification distributiontree (i.e., the subscribers that should receive the notification)is encoded in the packet headers, similar to multicast sourcerouting. Similarly, Kundel et al. [128] implemented a pub-lish/subscribe system on programmable switches. The systemis flexible in encoding attributes/values in packet headers.
D.3. Publish/Subscribe Schemes Comparison, Discussions,and Limitations
Table XXI compares the aforementioned pub/sub schemes.In [125], the authors described a compiler that generates P4tables from logical predicates. It utilizes a novel algorithmbased on Binary Decision Diagrams (BDD) to preserve switchresources (TCAM and SRAM). This feature simplifies the con-figuration as operators do not need to manually install tablesentries switches, which is a cumbersome process when thetopology is large. The prototype was evaluated on a hardwareswitch (Tofino), and the authors considered the Nasdaq’s ITCHprotocol as the pub/sub use case. Results show that the systemwas able to process messages at line rate while using thefull switch capacity (6.5 Tbps). The other systems considereddifferent encoding strategies. For example, in [126, 127], theauthors described a system where the notification distributiontree (i.e., the subscribers that should receive the notification)is encoded in the packet headers, similar to multicast source
TABLE XXIP
UBLISH /S UBSCRIBE S CHEMES C OMPARISON
Scheme Dedicatedlanguage Configcomplexity Encodingstructure PlatformHW SW [125] (cid:2)
Medium Hierarchical(BDD) (cid:2) [126][127] × High Distributiontree (cid:2) [128] × High Attribute-value pair (cid:2) routing. The advantage of storing the distribution tree in thepacket header instead of storing it in the switch is that rulesin the switches do not need to be updated when subscriptionschange. Another distinction between the pub/sub systems iswhether they require a dedicated language to describe thesubscriptions, and the configuration complexity.
D.4. Comparison between Switch-based and Server-basedPub/Sub Systems
Fig. 15 illustrates the operations of traditional software-based pub/sub systems (a) and switch-based pub/sub systems(b). Latency and its variations are significantly reduced whenthe switch acts as a pub/sub broker. However, the size of mem-ory in the switch limits the amount of data to be distributed.Moreover, implementing features provided by software-basedpub/sub systems such as QoS levels, session persistence,message retaining, last will and testament (notify users aftera device disconnects) in hardware is challenging.
E. Summary and Lessons Learned
Programmable switches offer the flexibility of customizingthe data plane to enable middlebox functions. A middlebox canbe defined as a device that performs functions that are beyondthe standard capabilities of routers and switches. A number ofworks demonstrated the implementation of middlebox func-tions such as caching, load balancing, offloading services,and others on programmable switches. The majority of loadbalancing schemes took advantage of the stateful nature of thedata plane to store the load balancing connection table. Futurework should consider minimizing the storage requirement to (a) Subscriber Subscriber N Broker ...
Publisher Publisher N ... Subscriber Subscriber N Pub/Subinfo ...
P4 switchPublisher Publisher N ... (b)Pub/Subinfo Legacy switchLegacy switch SDN Controller SubscriptionsControl plane rules
Legacy switch
Fig. 15. (a) Traditional software-based pub/sub architecture. (b) Pub/sub implemented on a programmable switch. improve the scalability, supporting flow priority, and develop-ing further variations for novel multipath transport protocolssuch as multipath QUIC.The switch can also act as an “in-network cache” that serveshot items at line rate. Some schemes indexes entries auto-matically, while others require operator’s intervention. Futureendeavours could investigate items compression, communi-cation minimization, priority-based caching, and aggregatedcomputations caching (e.g., cache the average of hot items).An additional middlebox application is offloading telecomfunctions. The switch is capable of relaying media traffic anduser plane functions. Future work could investigate scalabilityimprovement (i.e., to accommodate more concurrent sessions),offloading signalling traffic, and in-network media mixing.Finally, the switch can also act as a broker to distributepackets in a publish/subscribe system. Future work could in-vestigate reliability insurance (e.g., packet deliver guarantee),message retaining, and QoS differentiation (e.g., QoS featuresof MQTT).IX. N ETWORK -A CCELERATED C OMPUTATIONS
Programmable switches offer the flexibility of offloadingsome upper-layer logic to the ASIC, referred also as in-network computation. Since switch ASICs are designed toprocess packets at terabits per second rates, in-network compu-tation can result in an order of magnitude or more of improve-ment in throughput when compared to applications imple-mented in software. The potential performance improvementhas motivated programmers to built in-network computationfor different purposes, including consensus, machine learningacceleration, stream processing, and others.The idea of delegating computations to networking deviceswas perceived with Active Networks [250], where packets arereplaced with small programs (“capsules”) that are executedin each traversed device along the path. However, traditionalnetwork devices were not capable of performing computations.With the recent advancements in programmable switches,performing computations is now a possibility.
A. ConsensusA.1. Background
Consensus algorithms are common in distributed systemswhere machines collectively achieve agreement on a singledata value, or on the current state of a distributed system.Reliability is achieved with consensus algorithms, even in thepresence of some malicious or faulty processes. Consensusalgorithms are used in applications such as blockchain [251],load balancing, clock synchronization, and others [252].Latency has always been a bottleneck with consensus algo-rithms as protocols require expensive coordination on everyrequest. Lately, researchers have started investigating howprogrammable switches can be leveraged to operate consensusprotocols in order to increase throughput and decrease latency.Fig. 16 shows a consensus model in the data plane.
Consensus protocol (e.g., Paxos) running the networkProposer LearnerLearnerProposer Coordinator AcceptorAcceptorAcceptor
Fig. 16. Consensus protocol in the data plane model [130]. An applicationsends a request to the proposer which resides on a commodity server. Theproposer then creates a Paxos message and sends it to the coordinator, runningin the data plane. The role of the coordinator is be the broker of requests onbehalf of proposers. Afterwards, the acceptor, which also runs on the dataplane, receives the messages from the coordinator, and ensures consistencythrough the system by deciding whether to accept/reject proposals. Finally,learners provide replication by learning the result of consensus.
A.2. Literature Review
Li et al. [129] proposed Network-Ordered Paxos(NOPaxos), a P4-based Paxos [253] system that appliesreplication in the data center to reduce the latency imposedfrom communication overhead. Similarly, Dang et al. [130]presented an implementation of Paxos using P4 on thedata plane. Dang et al. [134] also proposed PartitionedPaxos, a P4-based system that separates the two aspects ofPaxos, namely, agreement and execution, and optimizes themseparately. Furthermore, The same authors also proposedP4xos [136], a P4-based solution that executes Paxos logicdirectly in switch ASICs, without strengthening assumptionsabout the network (e.g., ordered delivery, packet loss, etc.).Jin et al. [133] proposed NetChain, a variant of the Paxosprotocol that provides scale-free sub-RTT coordination in datacenters. It is strongly-consistent, fault-tolerant, and presentsan in-network key-value store.Another line of research focused on consensus algorithmsother than Paxos. Li et al. [131] proposed Eris, a P4-basedsolution that avoids replication and transaction coordinationoverhead. It processes a large class of distributed transactionsin a single round trip, without any additional coordinationbetween shards and replicas. Sakic et al. [135] proposed P4Byzantine Fault Tolerance (P4BFT), a system that is based onBFT-enabled SDN, where controllers act as replicated statemachines. The system offloads the comparison of controllers’outputs required for correct BFT operations to programmableswitches. Finally, Han et al. [132] offloaded part of the Raftconsensus algorithm [254] to programmable switches in orderto improve its performance. The authors selected Raft dueto the fact that it has been formally proven to be more safethan Paxos, and it has been implemented on popular SDNcontrollers. TABLE XXIIC
ONSENSUS S CHEMES C OMPARISON
Scheme Name Algo. Weakassumpt. Fullproto. PlatformHW SW [129] NOPaxos Paxos × × (cid:2) [130] N/A Paxos (cid:2) × (cid:2) [131] Eris Novel (cid:2) (cid:2) (cid:2) [132] N/A Raft (cid:2) × (cid:2) [133] NetChain Novel × (cid:2) (cid:2) [134] PartitionedPaxos Paxos (cid:2) (cid:2) (cid:2) [135] P4BFT BFT (cid:2) (cid:2) (cid:2) (cid:2) [136] P4xos Paxos (cid:2) (cid:2) (cid:2) A.3. Consensus Schemes Comparison, Discussions, and Lim-itations
Table XXII compares the aforementioned consensusschemes. In general, consensus algorithms such as Paxosare complex and cannot be easily implemented with theconstraints of the data plane. For instance, [130] only im-plemented phase-2 logic of Paxos leaders and acceptors.Similarly, NetChain uses a variant of the Paxos protocol thatdivides it into two parts: steady state and reconfiguration. Thisvariant is known as Vertical Paxos, and is relatively simpleto implement in the network as the division’s parts can bemapped to the control plane and the data plane.Unordered and completely asynchronous networks requirethe full implementation and complexity of Paxos. NOPaxossuggests that the communication layer should provide a newOrdered Unreliable Multicast (OUM) primitive; that is, there isa guarantee that receivers will process the multicast messagesin the same order, though messages can be lost. NOPaxosrelies on the network to deliver ordered messages in order toavoid entirely the coordination. Dropped packets on the otherhand are handled through coordination with the application.Other systems like Eris avoid replication and transaction co-ordination overhead. The main contribution of Eris comparedto NOPaxos is that it establishes a consistent ordering acrossmessages delivered to many destination shards. Eris alsoallows receivers to detect dropped messages.Partitioned Paxos [134] improved the existing systems. Themotivation behind Partitioned Paxos is that existing network-accelerated approaches do not address the problem of howreplicated application can cope with the high rate of consensusmessages; NOPaxos only processes 13,000 transactions persecond since it presents a new bottleneck at the host side. Othersystems (e.g. NetChain) are specialized replication servicesand can not be used by any off-the-shelf application.Finally, P4xos improves both the latency and the tail-latency. The throughput is also improved compared to hard-ware servers which require additional memory managementand safety features (e.g., user and kernel separation). P4xoswas implemented on a hardware switch (Tofino), and resultsshow that it reduces the latency by three times compared totraditional approaches, and it can process over 2.5 billionconsensus messages per second (four orders of magnitudeimprovement).
A.4. Network-Assisted and Legacy Consensus Comparison
Consensus algorithms have been traditionally implementedas application on general purpose CPUs. Such architectureinherently induces latency overhead (e.g., Paxos coordinatorhas a minimum latency of 96us [255]). There are numer-ous performance benefits gained when consensus algorithmsare implemented in programmable devices. When consensusmessages are processed on the wire, the latency significantlydecreases (Paxos coordinator had a minimum latency of340ns [255]). Moreover, when compared to legacy consensusdeployments, network-assisted consensus require fewer hopstraversal.
B. Machine LearningB.1. Background
The remarkable success of Machine Learning (ML) todayhas been enabled by a synergy between development in hard-ware and advancements in machine learning techniques. In-creasingly complex ML models are being developed to handlethe large size of datasets and to accelerate the training process.Hardware accelerators (e.g., GPU, TPU) were introduced tospeedup the training. These accelerators are installed in largeclusters and collaborate through distributed training to exploitparallelism. Nevertheless, training ML models is time con-suming and can last for weeks depending on the complexityand the size of the datasets. Researchers have traditionallyinvestigated methods to accelerate the computation process,but not the communication in distributed learning. With theadvancements in programmable switches, it is now possibleto accelerate the ML training process through the network.
B.2. Literature Review
The literature can be divided into two main categories:accelerating training and accelerating inference. Sapio et al.[137] proposed DAIET, a system that performs in-networkdata aggregation to accelerate applications that follow a par-tition/aggregate workload pattern. Similarly, Yang et al. [140]proposed SwitchAgg, a system that performs similar functionsas DAIET, but with a higher data reduction rate. Perhaps themost significant work in the training acceleration literature isSwitchML [141], a system that performs in-network aggre-gation for ML model updates sent from workers on externalservers.On the other hand, proposed schemes have shown interestin speeding the inference process by leveraging programmableswitches. Siracusano et al. [138] proposed N2Net, a systemthat runs simplified neural networks (NN) on programmableswitches. Sanvito et al. [139] proposed BaNaNa Split, a solu-tion that evaluates the conditions under which programmableswitches can act as CPUs’ co-processors for the processingof Neural Networks (e.g., CNN). Finally, Xiong et al. [142]proposed IIsy, a system that enables programmable switchesto perform in-network classification. The system maps trainedML classification models to match-action pipelines. TABLE XXIIIM
ACHINE L EARNING S CHEMES C OMPARISON
Scheme Name Core idea Objective Evaluatedmodel/algorithm Quantization PlatformInference Training HW SW [137] DAIET In-network computation forpartition/aggregate work pattern × (cid:2) SGD, Adam N/A (cid:2) [138] N2Net In-network classification usingBNN (cid:2) × Binary neural networks (cid:2) × × [139] BaNaNa Split NN processing division betweenswitches and CPUs (cid:2) × Binary neural networks (cid:2) × [140] SwitchAgg In-network aggregation withoutmodifying the network × (cid:2) MapReduce-like system N/A × [141] SwitchML Accelerates distributed paralleltraining in ML × (cid:2) Synchronous SGD (cid:2) × [142] IIsy Maps trained ML classificationmodels to match-action pipeline (cid:2) × Decision tree, SVM,naïve bayes, k-means × ×
B.3. ML Schemes Comparison, Discussions, and Limitations
Table XXIII compares the aforementioned ML schemes.While the goal of DAIET is to discuss what computations thenetwork can perform, the authors did not design a completesystem, nor did they address the major challenges of support-ing ML applications. Moreover, their proof-of-concept pre-sented a simple MapReduce application on a software switch,and it is not certain whether the system can be implementedon a hardware switch. Compared to DAIET, SwitchAgg doesnot require modifying the network architecture, and offersbetter processing abilities with a significant data reduction rate.Moreover, SwitchAgg was implemented on an FPGA, and theresults show that the job completion time can be reduced asmuch as 50%.SwitchML extended the literature on accelerating ML mod-els training by providing a complete implementation andevaluation on a hardware switch. A commonly used trainingtechnique for deep neural networks is synchronous stochasticgradient descent [257]. In this technique, each worker has acopy of the model that is being trained. The training is an it-erative process where each iteration consists of: 1) reading thesample of the dataset and locally perform some computation-intensive learning using the worker’s accelerators. This yields to a gradient vector; and 2) updating the model by computingthe mean of all gradient vectors. The main motivation of thisidea is that the aggregation is computationally cheap (takes100ms), but is communication-intensive (transfer hundreds ofmegabytes each iteration). SwitchML uses computation onthe switch to aggregate model update in the network as theworkers are sending them (see Fig. 17). An advantage isthat there is minimal communication; each worker sends itsupdate vector and receives back the aggregated updates. Thedesign challenges of this system include: 1) the limitation ofstorage available on the switch, addressed by using a streamingapproach; 2) switches cannot perform much computations perpacket, addressed by partitioning the work between the switchand the workers; 3) ML systems use floating point numbers,addressed by quantization approaches; and 4) failure recoveryis needed to ensure correctness. The system is implementedon a hardware switch (Tofino), and results show that thesystem speeds up training by up to 300% compared to existingdistributed learning approaches.With respect to in-network inference, it is challengingto implement full-fledged models as they require extensivecomputations (e.g., multiplications and activation functions).Simple variation such as the Binary Neural Network (BNN)
Worker 1 Updates Worker 2 Updates Worker N Updates ...
Legacy switchAll-to-all communicationFast GPUs -> bottleneck on the network ...
Programmable switchIn-network aggregation
Worker sends update vector Worker receives aggregated updates (a) (b)
Fig. 17. (a) ML model updates in legacy networks. The aggregation process is communication-intensive and follows an all-to-all communication pattern.This means that the workers should receive all the other workers’ updates. Since accelerators on end-hosts are becoming faster, the network should speed upso that it does not become the bottleneck. Therefore, it is expensive to deploy additional accelerators since it requires re-architecting the network. The redarrow in (a) shows that the bottleneck source is the network. (b) ML model updates accelerated by the network. Aggregation is performed in the network bythe programmable switches while the workers are sending them. The workers do not need to obtain the updates of all other workers, hence there is minimalcommunication. They only obtain the aggregated model from the switch. The red arrow in (b) shows that the bottleneck source is the worker rather than thenetwork [141, 256] TABLE XXIVS
WITCH - BASED AND S ERVER - BASED
ML A
PPROACHES
Feature Inference TrainingSwitch-based Server-based Switch-based Server-based
Speed Faster, inference at line rate Slower Faster, aggregations at line rate Slower; aggregations on an x86serverComplex computationssupport Lower, basic arithmetic andbitwise logic function Higher Lower HigherCommunication overhead Low Low Lower, switch is the centralizedaggregator Higher, updates are exchangedwith a remote aggregatorStorage Lower Higher Lower, update is not storedentirely at once HigherEncrypted traffic Difficult Easy Difficult Easy only requires bitwise logic functions (e.g., XNOR, POPCNT,SIGN). N2Net provides a compiler that translates a givenBNN model to switching chip’s configuration (P4 program).The authors did not mention on which platform N2Net wasevaluated; however, based on their evaluations, they concludedthat a BNN can be implemented on most current switchingchips, and with small additions to the chip design, morecomplex models can be implemented. IIsy studied other MLmodels. The authors of IIsy acknowledged that the work islimited in scope as it does not address popular ML algorithmssuch as neural networks. Furthermore, it is bounded to thetype of features it can extract (i.e., packet headers), and hasaccuracy limitations. IIsy tries to find a balance between thelimited resources on the switch and the classification accuracy.Finally, BaNaNa Split took a different approach by partitioningthe processing of NN to offload a subset of layers from theCPU to a different processor. Note that the solution is farfrom complete, and the authors evaluated a single binary fullyconnected layer with 4096 neurons using a network processor-based SmartNIC.
C. Comparison between Switch-based and Server-based ML
Table XXIV shows a comparison between switch-based andserver-based ML approaches. ML works that were extractedfrom the literature can be divided into two main categories:1) expedited inference in the data plane, and 2) acceleratedtraining in the network. The main advantage of switch-basedover server-based inference is the ability to execute at line rate,and hence provides faster results to the clients. Performingcomplex computations in the switch is achieved throughestimations, and hence is limited. Moreover, the SRAM ca-pacity of the switch is small, impeding the storage of largemodels. Such limitations are not problematic with server-basedinference approaches.Distributed training can be significantly faster when aggre-gations are offloaded to a centralized switch. However, due tothe small capacity of the switch memory, it is not possible tostore the whole model update at once. Additionally, encryptedtraffic remains a challenge when inference or training ishandled by the switch.
D. Summary and Lessons Learned
Accelerating computations by leveraging programmableswitches is becoming a trend in data centers and backbone networks. Although switches only support basic and limitedoperations, it was shown in the literature that the performanceof various tasks (e.g., consensus, training models in machinelearning), could significantly improve if computations aredelegated to the network.The majority of the in-network consensus works aim atimplementing common consensus protocols such as Paxosand Raft in the data plane. Due to the hardware constraints,current schemes implement only simplified variations of theprotocols. Future work could investigate implementing novelconsensus algorithms that diverge from the existing complexones. Further, such schemes should encompass failure recoverymechanisms.Another interesting in-network application is ML train-ing/inference acceleration. The literature has shown that signif-icant performance improvements are attained when the switchaggregates model updates or classifies new samples. Futuresystems could explore developing further ML models forvarious tasks such as classification, regression, clustering, etc.In addition to the aforementioned categories, data planeprogramming is being used for stream processing [143, 144],parallel processing [145], string searching [146], erasure cod-ing [147], in-network lock managers [148], database queriesacceleration [149], in-network compression [150], and com-puter vision offloading [151].X. I
NTERNET OF T HINGS (I O T)The Internet of Things (IoT) is a novel paradigm in whichpervasive devices equipped with sensors and actuators collectphysical environment information and control the outsideworld. IoT applications include smart water utilities, smartgrid, smart manufacturing, smart gas, smart metering, andmany others. Typical IoT scenarios entail a large numberof devices periodically transmitting their sensors’ readingsto remote servers. Data received on those collectors is thenprocessed and analyzed to assist organizations in taking data-driven intelligence decisions.
A. AggregationA.1. Background
Since IoT devices are constrained in size and process-ing capabilities, they typically generate packets that carrysmall payloads (e.g., temperature sensor readings). While suchpackets are small in size, their headers occupy a significant TABLE XXVI O T A
GGREGATION S CHEMES C OMPARISON
Scheme Evaluation Constraints Line rate PlatformTheoretical Implementation Samepayload size Payload<= 16 bytes Numberof packets Aggregation Disaggregation HW SW [152] (cid:2) (cid:2) (cid:2) (cid:2) × (cid:2) [153] (cid:2) × × Up to MTU (cid:2) (cid:2) (cid:2) [154] (cid:2) (cid:2) (cid:2) × × (cid:2) portion of the total packet size. For instance, Sigfox Low-Power Wide Area Network (LPWAN) [258] can support amaximum of 12-bytes payload size per packet. The overheadof headers is 42-bytes (Ethernet 14-bytes + IP 20-bytes + UDP8-bytes), which represent approximately 78% of the packettotal size. When numerous devices continuously transmit IoTpackets, a significant percentage of network bandwidth iswasted on transmitting these headers. Packet aggregation isa mechanism in which the payloads of small packets areaggregated into a single larger packet in order to mitigate thebandwidth overhead caused by transmitting multiple headers.Legacy packet aggregation mechanisms operate on the CPUsof servers or on the control plane of switches [259–264].While legacy mechanisms reduce the overhead of packetheaders, they unquestionably increase the end-to-end latencyand decrease the throughput. As a result, some studies havesuggested aggregating only packets that are not real-time. A.2. Literature Review
Wang et al. [152] presented an approach where small IoTpackets are aggregated into a larger packet in the switch dataplane (see Fig. 18). The goal of performing this aggregationis to minimize the bandwidth overhead of packets’ headers.The same authors [153] extended this work to solve someconstraints related to the payload size and the number of aggre-gated packets. Similarly, Madureira et al. [155] proposed IoTP,a layer-2 communication protocol that enables the aggregationof IoT data in programmable switches. The solution gathersnetwork information that includes the Maximum Transmis-sion Unit (MTU), link bandwidths, underlying protocol, anddelays. These properties are used to empower the aggregationalgorithm.
A.3. Aggregation Schemes Comparison, Discussions, andLimitations
Table XXV compares the aforementioned IoT aggregationsschemes. [152] and [153] operate in the same way. Upon
IoT devices IoT packet ...
P4 switch P4 switchAggregation Aggregated packetWAN Disaggregation Server
Fig. 18. IoT packets aggregation [152]. Frequent small IoT packets areaggregated by a P4 switch and encapsulated in a larger packet. Another switchacross the WAN disaggregates the large packet to restore the original IoTpackets. Such mechanism prevents the frequent transmissions of headers, andthus, minimizes the bandwidth overhead. receiving a packet, the P4 switch parses its headers andidentifies whether the packet is an IoT packet. If the packet wasidentified as an IoT packet, the switch parses and extracts thepayload. Afterwards, the payload is stored in switch registersalong with some other metadata, and the packet is dropped.Once packets are aggregated, the resulting packet is sent acrossthe WAN to reach the remote server. Before the packet reachesthe server, it is disaggregated by another P4 switch situatedclose to the server and several packets identical to the originalones are generated. An important observation is that theaggregation/disaggregation processes are transparent to boththe IoT devices and the servers; hence, no modifications arerequired on either end. The main advantages of [153] over[152] are: 1) packets can have different payload sizes; 2) thepayload size is no longer limited to 16 bytes; 3) the numberof packets is dynamic and only limited by the packet MTU;and 4) both the disaggregation and the aggregation run at linerate.
A.4. Comparison between Server-based and Switch-based Ag-gregation
Table XXVI shows a comparison between switch-basedand server-based packet aggregation. When aggregation isperformed on the switch (ASIC), the throughput is higherwhile the latency and jitter are lower than that of the server-based approaches (e.g., switch CPU or x86-based server).On the other hand, the server-based aggregation has moreflexibility in defining the number of packets and the amountof data that can be aggregated.
B. Service AutomationB.1. Background
Low-power low-range IoT communication technologies(e.g., Bluetooth Low Energy (BLE) [265], Zigbee [266], Z-wave [267]) typically follow a peer-to-peer model. IoT devices
TABLE XXVIS
WITCH - BASED AND S ERVER - BASED P ACKET A GGREGATION
Feature Switch-based (ASIC) Server-based (CPU)
Throughput Higher; (e.g., [152],100Gbps, i.e., maxcapacity) Lower; (e.g., [152],2.58Gbps)Latency andJitter Lower; Higher;Count of packetsto be aggregated Not flexible (limited bythe switch SRAM) ArbitraryAmount of datato be aggregated Not flexible (limited bythe switch SRAM,parsing capacity) Arbitrary in such technologies can be divided into two distinct types, pe-ripheral and central . Peripheral devices, which consist of sen-sors and actuators, receive commands and execute subsequentactions. Central devices on the other hand run applicationsthat analyze information collected from peripheral devices andsubsequently issue commands.The interconnection of devices and services can followa Peer-to-Peer (P2P) model or a cloud-centric approach. Inthe P2P model, the automation service runs on the centraldevice which processes and analyzes sensor data publishedby peripheral devices in order to issue commands. The mainadvantages of the P2P include the low end-to-end latencyand the subtle power consumption as devices are physicallyclose to each other. The drawbacks of the P2P model in-clude poor scalability, short reachability, and inflexibility ofpolicy enforcement. The cloud-centric model addresses thelimitations of the P2P model by adding a gateway nodethat connects peripheral devices to a middleware hosted onthe cloud (Internet). While this approach solves the poorscalability and the policy enforcement flexibility issues, itincurs additional delays and jitters in collecting and reactingto data. Moreover, the middleware represents a single pointof failure which can shutdown the whole service in the eventof an outage. With programmable switches, researchers areinvestigating in-network approaches to manage transactionalrelationships between low-power, low-range IoT devices. B.2. Literature Review
Uddin et al. [156] proposed Bluetooth Low Energy ServiceSwitch (BLESS), a programmable switch that automates IoTapplications services by encoding their transactions in the dataplane. It maintains link-layer connections to the devices tosupport P2P connectivity. The same authors proposed Muppet[157], an extension to BLESS to support multiple non-IPprotocols.
B.3. Service Automation Comparison, Discussions, and Limi-tations
In BLESS, the data plane operations are performed at theAttribute Protocol (ATT) service layer which consists of threeoperations: read attributes, write attributes, and attributes’notification. BLESS parses ATT packets, then processes andforwards them to the devices. The control plane on the otherhand is responsible for address assignment, device and servicediscovery, policy enforcement, and subscription management.The switch was implemented on a software switch (PISCES),and the results show that BLESS combines the advantages ofP2P and the cloud-center approaches. Specifically, it achievessmall communication latency, low device power consumption,high scalability, and flexible policy enforcement. Muppet ex-tended this approach to support multiple IoT protocols. Thesystem studied two popular IoT protocols, namely BLE andZigbee. Being in the middle, Muppet switch is responsible fortranslating actions (e.g., on/off switch of a light bulb) betweenZigbee and BLE protocols, as well as logging important eventsto a database which resides on the Internet via the HypertextTransfer Protocol (HTTP). Note that parsers and actionspolicies have to be implemented for each supported protocol.
TABLE XXVIIS
WITCH - BASED , P2P,
AND C LOUD S ERVICE A UTOMATION
Feature Switch-based Peer-to-peer Cloud-based
Latency Low Low HighIoT energy Low Low HighScalability High Low HighReachability High Low High
Another difference from BLESS is that the implementationof Muppet’s control plane leverages ONOS controller withProtocol Independent (PI) framework.
B.4. Comparison between Server-based and Switch-basedService Automation
Table XXVII shows a comparison between switch-based,P2P, and cloud-based service automation. Generally, theswitch-based approach overcomes the limitations of both ap-proaches. It achieves the low energy and latency characteristicsof P2P while increasing scalability and reachability.
C. Summary and Lessons Learned
In the context of IoT, there exist broadly two categories,namely, packets aggregation and service automation. The goalof packet aggregation is to minimize the overhead of IoTpackets’ headers. Typically, headers in IoT packets representa significant portion of the whole packet size. By aggregatingseveral packets into a single packet, the bandwidth overheadis reduced. Future work should study the performance side-effects (e.g., delay, jitter, loss rate, retransmission) that ag-gregation causes to packets. Furthermore, timers should beimplemented to avoid excessive delays resulting from waitingfor enough packets to be aggregated.With respect to service automation, the goal is to automateIoT applications services by encoding their transactions in thedata plane while improving scalability, reachability, energyconsumption, and latency. Future work should design and de-velop translators for non-IP IoT protocols so that applicationson various devices that run different protocols can exchangedata. Additionally, production-grade software switches shouldbe leveraged to support non-Ethernet IoT protocols.Other works that involve IoT include flowlet-based statefulmultipath forwarding [268] and SDN/NFV-based architecturefor IoT networks [269].XI. C
YBERSECURITY
Extensive research efforts have been devoted on deployingprogrammable switches to perform various security-relatedfunctions in the data plane. Such functions include heavy hitterdetection, traffic engineering, DDoS attacks detection andmitigation, anonymity, and cryptography. Fig. 19 demonstratesthe difference between contemporary security appliances andprogrammable switches with respect to layers inspection in theOSI model. Although programmable switches are limited inthe computation power, they are capable of inspecting upperlayers (e.g., application layer) at line rate. Such functionalityis not available in any of the existing solutions. ApplicationPresentationSessionTransportNetworkData LinkPhysical ACL, packet filterTraditional firewall, flow-based IDSNext-generation firewall, IDS/IPS ApplicationPresentationSessionTransportNetworkData LinkPhysical Programmable switchSoftware inspection Hardware inspection(a) (b)
Fig. 19. Layers inspection in the OSI model. (a) Contemporary securityappliances. (b) Programmable switch.
A. Heavy HitterA.1. Background
Heavy hitters are a small number of flows that constitutemost of the network traffic over a certain amount of time.They are identified based on the port speed, network RTT,traffic distribution, application policy, and others. Heavy hittersincrease the flow completion time for delay-sensitive miceflows, and represent the major source of congestion. It isimportant to promptly detect heavy hitters in order to reactto them; for instance, redirect them to a low priority queue,perform rate control and traffic engineering, block volumetricDDoS attacks, and diagnose congestion. Traditionally, packetsampling technique (e.g., NetFlow) was used to detect heavyhitters. The main problem with such technique is the limitedaccuracy due to the CPU and bandwidth overheads of process-ing samples in the software. Advancements in programmableswitches paved the way to detect heavy hitters in the dataplane, which is not only orders of magnitude faster thansampling, but also enables additional applications (e.g., flow-size aware routing).
A.2. Literature Review
Sivaraman et al. [158] proposed HashPipe, a heavy hitterdetection algorithm that operates entirely in the data plane.It detects the k -th heavy hitter flows within the constraints of programmable switches while achieving high accuracy. Asubsequent work proposed by Harrison et al. [159] considers anetwork-wide distributed heavy-hitter detection. Furthermore,Kuˇcera et al. [160] proposed Elastic Trie, a solution thatdetects hierarchical heavy hitters, in-network traffic changes,and superspreaders in the data plane. Hierarchical heavy hittersinclude the total activity of all traffic matching relevant IPprefixes. Basat et al. [161] proposed PRECISION, a heavyhitter detection algorithm that probabilistically recirculatesa fraction of packets for a second pipeline traversal. Therecirculation idea greatly simplifies the access pattern ofmemory without significantly degrading throughput. Ding etal. [162] proposed an approach for incrementally deployingprogrammable switches in a network consisting of legacydevices with the goal of monitoring as many distinct networkflows as possible. Tang et al. [163] proposed MV-Sketch, asolution that exploits the idea of majority voting to track thecandidate heavy flows inside the sketch data structure. Finally,Silva et al. [164] proposed a solution that identifies elephantflows in Internet eXchange Points (IXP) networks. A.3. Heavy Hitter Detection Comparison, Limitations, andDiscussions
Table XXVIII compares the aforementioned heavy hitterschemes. The main criteria that differentiates the solutionsis the selection and the implementation of the data structure.Hash tables and sketches are frequently used to store countersfor heavy flows. Note that several variations of such datastructures are being used in the literature, mainly to tackle thememory-accuracy tradeoff; the choice of data structure reflectson the accuracy of the performed measurements. For example,with probabilistic data structures, only approximations areperformed.In HashPipe, the programmable switch stores the flowsidentifiers and their byte counts in a pipeline of hash tables.HashPipe adapts the space saving algorithm which is describedin [270]. The system was evaluated using an ISP trace providedby CAIDA (400,000 flows), and the results show that HashPipeneeded only 80KB of memory to identify the 300 heaviestflows, with an accuracy of 95%. Another hashtable-basedsolution is Elastic Trie, which consists of a prefix tree thatexpands or collapses to focus only on the prefixes that grabs a
TABLE XXVIIIH
EAVY H ITTER S CHEMES C OMPARISON
Scheme Name Core idea Datastructure Network-wide Adaptivethresholds Approximations PlatformHW SW [158] HashPipe Maintains counts of heavy flowsin a pipeline of hash tables. Hash tables × × × (cid:2) [159] N/A Switch store locally the counts acoordinator aggregates the results Hash tables (cid:2) (cid:2) (cid:2) (cid:2) [160] Elastic Trie Detects hierarchical heavy hittersusing hashtable prefix tree Prefix tree × (cid:2) (cid:2) (cid:2) [161] PRECISION Recirculates a small fraction ofpackets to simplify memory access Hash tables × × (cid:2) (cid:2) [162] N/A Monitors distinct flows usingHyperLogLog algorithm HyperLogLog (cid:2) (cid:2) (cid:2) (cid:2) [163] MV-Sketch Supports the queries of recoveringall heavy flows in a sketch Invertiblesketches (cid:2) × (cid:2) (cid:2) [164] N/A Identifies elephant flows usingdynamic thresholds in IXPs Hash tables × (cid:2) × (cid:2) TABLE XXIXC
RYPTOGRAPHY S CHEMES C OMPARISON
Scheme Name Core idea Security goal Computations Algorithms PlatformConf. Integ. Auth. ASIC CPU HW SW [165] N/A Implementations ofcryptographic hash functions × × (cid:2) (cid:2)
SipHash-2-4, Poly1305-AES,BLAKE2b, HMAC-SHA256-512 (cid:2) [166] P4-IPsec Implementation of host-to-site IPsec in P4 switches (cid:2) (cid:2) (cid:2) (cid:2)
AES-CTRHMAC-MD5 (cid:2) [167] P4-MACsec Implementation of MACsecon P4 switches (cid:2) (cid:2) × (cid:2) AES-GCM (cid:2) [168] N/A AES implementation usingscrambled lookup table (cid:2) × × (cid:2)
AES-128, AES-192, AES-256 (cid:2) large share of the network. The data plane informs the controlplane about high-volume traffic clusters in an event-based pushapproach only when some conditions are met. Other systemsexplored different data structures for the task. For instance,in [162] the authors used the HyperLogLog algorithm [271]which approximates the number of distinct elements in a multi-set. The solution is capable of detecting heavy hitters by onlyusing partial input from the data plane.Another important criteria is whether the scheme tracksheavy hitters across the whole network. For example, un-like HashPipe which considers a single switch, [159] tracksnetwork-wide heavy hitters. Tracking network-wide heavyhitter is important as some applications (e.g., port scanners,superspreaders, etc.) cannot go undetected within a singlelocation. Moreover, aggregating the results of switches sep-arately for detecting heavy hitter is not sufficient; flows mightnot exceed a threshold locally, but when the total volume isconsidered, the threshold might be crossed.
A.4. Comparison between P4-based and Traditional HeavyHitter Detection
The main advantage of heavy hitters detection schemes inthe data plane over sampling-based approaches is the ability tooperate at line rate. This means that every packet is consideredin the detection algorithm, which improves accuracy andthe speed of detection. Moreover, additional applications thatexploit reactive processing can be implemented. For instance,switches can perform a flow-size aware routing method toredirect traffic upon detecting a heavy hitter.
B. CryptographyB.1. Background
Performing cryptographic functions in the data plane isuseful for a variety of applications (e.g., protecting the layer-2 with cryptographic integrity checks and encryption, miti-gating hash collisions, etc.). Computations in cryptographicoperations (e.g., hashing, encryption, decryption) are known tobe complex and resource-intensive. The supported operationsin switch targets and in the P4 language are limited to ba-sic arithmetic (e.g., additions, subtractions, bit concatenation,etc.). Recently, a handful of works have started studying thepossibility of performing cryptographic functions in the dataplane.
B.2. Literature Review
The authors in [165] argue on the need to implementcryptographic hash functions in the data plane to mitigatepotential attacks targeting hash collisions. Consequently, theypresented prototype implementations of cryptographic hashfunctions in three different P4 target platforms (CPU, Smart-NIC, NetFPGA SUME). Another work by Hauser et al. [166]attempted to implement host-to-site IPsec in P4 switches. Forsimplification, only Encapsulating Security Payload (ESP) intunnel mode with different cipher suites is implemented. Thesame authors also proposed P4-MACsec, an implementationof MACsec on P4 switches. MACsec is an IEEE standard forsecuring Layer 2 infrastructure by encrypting, decrypting, andperforming integrity checks on packets.The previous works delegated the complex computations tothe control plane. Chen et al. [168] implemented the AdvancedEncryption Standard (AES) protocol in the data plane usingscrambled lookup tables. AES is one of the most widelyused symmetric cryptography algorithms that applies severalencryption rounds on 128-bit input data blocks
B.3. Cryptography Schemes Comparison, Discussions andLimitations
Table XXIX compares the aforementioned cryptographyschemes. With respect to hashing, P4 currently implementshash functions that do not have the characteristics of cryp-tographic hashing. For example, Cyclic Redundancy Check(CRC), which is commonly used in P4 targets, is originallydeveloped for error detection. CRC can be easily implementedin embedded hardware, and is computationally much lesscomplex than cryptographic hash functions (e.g., Secure HashAlgorithm (SHA)-256); however, it is not secure and has ahigh collision rate. Evaluation results in [165] show that 1)implementing cryptographic hash functions on CPU is easy,but has high latency (several milliseconds); 2) SmartNICs hasthe highest throughput, but can only process packets up to900 bytes; and 3) NetFPGA has the lowest latency, but cannotbe integrated using native P4 features. The authors foundthat the performance of hashing is highly dependent on theapplication, the input type, and the hashing algorithm, andtherefore there is no single solution that fits all requirements.However, P4 targets should benefit from the characteristicsof each solution (CPU, SmartNICs, FPGA, and ASICs) toimplement cryptographic hashing.As for more complex protocol suites (e.g., IPsec), Hauser et al. [166] only implemented Encapsulating Security Payload(ESP) in tunnel mode for simplification. The Security PolicyDatabase (SPD) and the Security Association Database (SAD)are represented as match-action tables in the P4 switch. Toavoid complex key exchange protocols such as the InternetKey Exchange (IKE), this work delegates runtime managementoperations to the control plane. Moreover, since encryption anddecryption are not supported by P4, the authors relied on user-defined P4 externs to perform complex computations. Notethat implementing user-defined externs is not applicable forASIC (e.g., Tofino), and consequently, the main CPU moduleof the switch is used for performing encryption/decryptioncomputations, at the cost of increased latency and degradedthroughput. Same ideas are applied to P4-MACsec by the sameauthors.The system proposed by Chen et al. [168] has significantperformance advantages as it is fully implemented in the dataplane. The idea of the proposed system is to apply permutedlookup tables by using an encryption key. The authors foundthat a single switch pipeline is capable of performing two AESrounds. Consequently, the system leverages packet recircula-tion technique which re-injects the packet into the pipeline.By doing so, it is possible to complete the 10 rounds ofencryption required by the AES-128 algorithm by using fivepipeline passes. Note that recirculation uses loopback portsand hence is limited by their bandwidth. The implementationon Tofino chip shows that ≈ B.4. Comparison between In-network and ContemporaryCryptography
Cryptographic primitives often require performing complexarithmetic operations on data. Implementing such compu-tations on general purpose servers is simple; memory andprocessing units are not constrained. The literature has shownthat there is a need to implement cryptographic functions in thedata plane. For instance, cryptographic hash functions can sig-nificantly improve existing data plane applications with respectto collisions; encryption can protect confidential informationfrom being exposed to the public. However, switches havelimitations when it comes to computing. Supported hash func-tions in P4 are non-cryptographic (e.g., CRC), and therefore,produce collisions when the table is not large. Consequently,researchers are continuously investigating techniques to per-form such operations in the data plane.
C. Privacy and AnonymityC.1. Background
Packets in a network carry information that can poten-tially identify users and their online behavior. Therefore, userprivacy and anonymity have been extensively studied in thepast (e.g., ToR and onion routing [272]). However, existingsolutions have several limitations: 1) poor performance sinceoverlay proxy servers are maintained by volunteers and have
TABLE XXXP
RIVACY AND A NONYMITY S CHEMES C OMPARISON
Name/Scheme Goal Strategy PlatformHW SW
NetHide [169] Mitigate topologyattacks Topologyobfuscation × ×
PANEL [170] Protect Internetusers’ identities Source inforewriting (cid:2)
ONTAS [171] Protect PII inpacket traces Headers fieldshashing (cid:2)
SPINE [172] Protect Internetusers’ identities Header fieldsconcealing (cid:2) no performance guarantees; 2) deployability challenges; somesolutions require modifying the whole Internet architecture,which is highly unlikely; 3) no clear partial deploymentpathway; and 4) most solutions are software-based. Conse-quently, recent works started investigating methods that exploitprogrammable switches to develop partially-deployable, low-latency, and light-weight anonymity systems.With respect to anonymity and privacy in the network, newclass of attacks which target the topology, requires the attackerto know the topology and understand it’s forwarding behavior.Such attacks can be mitigated by obfuscating (hiding) thetopology from external users. P4-based schemes are also beingdeveloped to achieve this goal.
C.2. Literature Review
Meier et al. [169] proposed NetHide, a P4-based solu-tion that obfuscates network topologies to mitigate againsttopology-centric attacks such as Link-Flooding Attacks(LFAs). On the other hand, Kim et al. [171] proposed OnlineNetwork Traffic Anonymization System (ONTAS), a systemthat anonymizes traffic online using P4 switches.Another line of research focused on protecting the identityof Internet users. Moghaddam et al. [170] proposed PracticalAnonymity at the NEtwork Level (PANEL), a lightweight andlow overhead in-network solution that provides anonymity intothe Internet forwarding infrastructure. Likewise, Datta et al.[172] proposed Surveillance Protection in the Network Ele-ments (SPINE), a system that anonymizes traffic by concealingIP addresses and relevant TCP fields (e.g., sequence number)from adversarial Autonomous Systems (ASes) on the dataplane.
C.3. Privacy and Anonymity Schemes Discussions
Table XXX compares the privacy and anonymity schemes.NetHide aims at mitigating the attacks targeting the networktopology. The solution formulates network obfuscation as amulti-objective optimization problem, and uses accuracy (hardconstraints) and utility (soft constraints) as metrics. The systemthen uses ILP solver and heuristics. The P4 switches inthis system capture and modify tracing traffic at line rate.The specifics of the implementation were not disclosed, butthe authors claim that the system was evaluated on realistictopologies (more than 150 nodes), and more than 90% of linkfailures were detected by operators, despite obfuscation.ONTAS had a slightly different goal; it aims at protectingthe personally identifiable information (PII) from online traces.The system overcomes the limitations of existing systems Unmodified device Trusted entity 1 Trusted entity 2Untrusted entity{Keys, version number} Unmodified deviceOriginal Traffic Original TrafficSPINE Traffic SPINE Traffic
Fig. 20. SPINE architecture [172]. which either requires network operators to anonymize packettraces before sharing them with other researchers and analysts,or anonymize traffic online but with significant overhead.ONTAS provides a policy language used by operators forexpressing anonymization tasks, which makes the systemflexible and scalable. The system was implemented and testedon a hardware switch, and results show that ONTAS entails 0%packet processing overhead and requires half storage comparedto existing offline tools. A limitation of this system is that itdoes not anonymize TCP/UDP field values. Another limitationis that it does not support applying multiple privacy policiesconcurrently.Other line of research (i.e., PANEL, SPINE) focused onprotecting the identities of Internet user. PANEL overcomesthe performance limitations of popular anonymity systems(e.g., Tor), and does not require modifying entirely the Internetrouting and forwarding protocols as proposed in [273] and[274]. Partial deployment is possible as PANEL can co-exist with legacy devices. The solution involves: 1) sourceaddress rewriting to hide the origin of the packet; 2) sourceinformation normalization (IP identification and TCP sequencerandomization) to mitigate against fingerprinting attacks; and3) path information hiding (TTL randomization) to hide thedistance to the original sender at any given vantage point.As for SPINE, it does not require cooperation betweenswitches and end-hosts, but assumes that at least two entities(typically two ASes or two ISPs) are trusted. Fig. 20 showsthe SPINE architecture. The solution encrypts the IP addressesbefore the packets enter the intermediary ASes. Therefore,adversarial devices only see the encrypted addresses in theheaders. It also encrypts the TCP sequence and ACK num-bers to mitigate against attributing packets to flows. SPINEtransforms IPv4 headers into IPv6 headers when packetsleave the trusted entity and restore the IPv4 headers uponentering the trusted entity. These operations enable routing tobe performed in intermediary networks. The encrypted IPv4address is inserted in the last 32-bits of the IPv6 destinationaddress. The encryption works by XORing the IP address withthe hash of a pre-shared key and a nonce. The system usesSipHash since it is easily implemented in the data plane.
C.4. Privacy and Anonymity in Switch-based and LegacySystems
Contemporary approaches that provide privacy andanonymity in the Internet uses special routing overlay net-works to hide the physical location of each node from otherparticipants (e.g., Tor). Such approaches have performancelimitations as proxy servers (overlays) are maintained by
P4 switches WANEnd devicesDev. Config.High-level policiesCompilerC P4 programs ...
Context packets
Fig. 21. Overview of Poise [175]. A compiler translates high-level policiesinto P4 programs and device configurations. Context packets are continuouslysent from the clients to the network, where the switches enforce the policies. volunteers and have no performance guarantees. Moreover,they often require performing advanced encryption routinesto obfuscate from where the packet is originated (e.g., onionrouting technique used by Tor involves encapsulating messagesin several layers of encryption) . On the other hand, approachesthat are based on programmable switches often rely on headersmodification and simplified encryption and hashing to concealinformation (e.g., SPINE [172]).
D. Access ControlD.1. Background
The selective restriction to access digital resources is knownas access control in cybersecurity. Typically, access controlbegins with “authentication” in order to verify the identity of aparty. Afterwards, “authorization” is enforced through policiesto specify access rights to resources. To authenticate parties,methods such as passwords, biometric analysis, cryptographickeys, and others are used. With respect to authorization,methods such as ACL are used to describe what operationsare allowed on given objects.With the advent of programmable switches, it is nowpossible to delegate authentication and authorization to thedata plane. As a result, access can be promptly granted ordenied at line rate, before reaching the target server. A clearadvantage of this approach is that servers are no longer busyprocessing access verification routines, which increases theirservice throughput.
D.2. Literature Review
Datta et al. [173] presented P4Guard, a P4-based config-urable firewall that acts based on predefined policies set bythe controller. Kang et al. [175] presented a scheme thatimplements context-aware security policies (see Fig. 21). Thepolicies are applicable to enterprise and campus networks withdiverse devices, i.e., Bring Your Own Device (BYOD) (e.g.,laptops, mobile devices, tablets, etc.).Almain et al. [174] proposed delegating the authenticationof end hosts to the data plane. The method is based onport knocking, in which hosts deliver a sequence of packetsaddressed to an ordered list of closed ports. If the ports matchthe ones configured by the network administrators, then end TABLE XXXIA
CCESS C ONTROL S CHEMES C OMPARISON
Scheme Goal Strategy Scope Limitations PlatformHW SW [173] Simple firewall-basedaccess control Translates from high-levelsecurity policies to table entries Header-based firewall(layer-4) Lacks NGFW capabilities (cid:2) [174] User-authenticationin the data plane Uses port knocking techniquefor authentication Unencrypted sequence-based authentication Unencrypted sequencevulnerable to packet sniffing (cid:2) [175] Context-aware policiesenforcement Translates from high-levelsecurity policies to P4 programs CAS dynamic policiesbased on runtime contexts External encryptions are slow;lack of authentication (cid:2) [176] OS fingerprinting andpolicy enforcement Compares TCP/IP headers to afingerprint database file Uses p0f to filterconnections Lack of advanced built-inactions (e.g., rate-limiting) (cid:2) host is authenticated, and subsequent packets are allowed.Finally, Bai et al. [176] presented P40f, a tool that performs OSfingerprinting on programmable switches, and consequently,applies security policies (e.g., allow, drop, redirect) at linerate.
D.3. Access Control Comparison, Discussions, and Limita-tions
Table XXXI compares the aforementioned access controlschemes. P4Guard provides access control based on securitypolicies translated from high-level security policies to tableentries. Note that P4Guard only operates up to the transportlayer (e.g., source/destination IP addresses, source/destinationports, protocol, etc.), similar to a traditional firewall. Whileprogrammable switches provide increased flexibility in theparser (e.g., parse beyond the transport layer) and the packetprocessing logic, P4Guard did not leverage such capabilities.It would be interesting to investigate additional capabilitiessuch as those enabled by next-generation firewalls (NGFW).The solution in [174] controls access by performing authen-tication in the data plane. The solution has several limitationssince it uses on port knocking, a technique that has severalsecurity implications. For instance, programmable switches donot use cryptographic hashes, making the solution vulnerableto IP address spoofing attacks. Additionally, unencrypted portknocking is vulnerable to packet sniffing. Furthermore, portknocking relies on security through obscurity.In [175], the scheme dynamically enforces access controlto users based on contexts (e.g., if the user’s device usesSecure Shell (SSH) 2.0 or higher, then the switch forwardsthe packets of this flow. Otherwise, the switch drops the pack-ets). The scheme requires user devices to run an applicationwhich communicates with the switch using a custom protocol(context packets). The context packets are generated on aper-flow basis. The switch tracks flows using a match actiontable and registers at the data plane. Actions over a packetare dropping, allowing, and forwarding to other appliancesfor deep packet inspection. Data packets are not modified.Evaluations show that the proposed approach can operate(install new flows in the and update rules) with a minimumlatency, even under heavy DoS attacks. On the other hand,such attacks can decimate similar SDN-based systems. Oneof the main drawbacks of the proposed system is the lackof authentication, integrity, and confidentiality of the contextpackets. Thus, the system can be subject to attacks suchas snooping (i.e., eavesdropping) on communication between user devices and the switch, impersonation, and others.Finally, [176] proposes fingerprinting OS in the data plane.The main motivation behind this work is that software-basedpassive fingerprinting tools (e.g., p0f [275]) are not practicalnor sufficient with large amounts of traffic on high-speedlinks. Furthermore, out-of-band monitoring systems cannotpromptly take actions (e.g., drop, forward, rate-limit) on trafficat line rate. The main drawback of the solution is that it lackssophisticated policies that involve rate-limiting traffic.
D.4. Comparison between Switch-based and Server-based Ac-cess Control
Controlling access to resources often starts with authenti-cation. While server-based approaches are more flexible inthe methods of authentication they can provide, they typi-cally require client connections to reach the server beforethe communication starts. In switch-based approaches, theauthentication can be done in-network at the edge, eliminatingunnecessary latency incurred from traversing the network andfrom software processing.Access to resources can be controlled after fingerprintingend-hosts OSs. Software-based passive fingerprinting toolscannot keep up with the high load (gigabits/s links). Theliterature has shown that tools lead to 38% degradation inthroughput [276]. Additionally, such tools are out-of-band,meaning that it is not possible to apply policies on traffic(e.g., after fingerprinting an OS). On the other hand, switchhardware is able to perform OS fingerprinting and applysecurity policies at line rate.Context-aware policies applied on nodes (clients/servers)have local visibility. A newer approach is to use a centralizedSDN controller (e.g., [277]), but such scheme is vulnerableto control plane saturation attacks and is subject for delayincreases. Switch-based schemes on the other hand are able toprovide access control at line rate.
E. DefensesE.1. Background
DDoS attacks remain among the top security concernsdespite the continuous efforts towards the development of theirdetection and mitigation schemes. This concern is exacerbatednot only by the frequency of said attacks, but also by their highvolumes and rates. Recent attacks (e.g. [278, 279]) reachedthe order of terabits per seconds, a rate that existing defensemechanisms cannot keep with. TABLE XXXIID
EFENSES S CHEMES C OMPARISON
Name & scheme Mitigated attacks Attack coverage Externalcomputations Network-wide Limitations PlatformSpecific Generic HW SW
NETHCF [177] IP-spoofing (cid:2) (cid:2) × Hop-counts incorrectnesswith the presence of NAT (cid:2)
FastFlex [178] Availability attacks (cid:2) × (cid:2) Cross-domain federationcomplexity and security (cid:2) [179] Sensitivity attacks (cid:2) × ×
Limited evaluation oncomplex data plane systems (cid:2) [180] SIP DDoS (cid:2) (cid:2) × No support for encryptedpackets (e.g., SIP/TLS) (cid:2) [181] DDoS anomalies (cid:2) × ×
Not adaptable to trafficpatterns (fix thresholds) (cid:2)
ML-Pushback [182] DDoS anomalies (cid:2) (cid:2) × Depends heavily on externalcomputations × × [183] SYN floods (cid:2) (cid:2) × Lack of cryptographichash functions (cid:2)
Poseidon [184] Volumetric DDoS (cid:2) (cid:2) × Human intervention forwriting the defense policies (cid:2) [185] Volumetric and stealthyDDoS (cid:2) (cid:2) × Only synthetic evaluations;no extensive experimentation (cid:2)
NetWarden [186] Network covert channels (cid:2) (cid:2) × Slowpath/fastpathcommunication latency (cid:2) [187] ECN protocol abuse (cid:2) × ×
Small subset of attackspace (cid:2)
Ripple [188] Link-flooding (cid:2) × (cid:2) Lack of comparison withother P4 approaches (cid:2)
There are two main concerns with existing defense methodshandled by end-hosts or deployed as middlebox functionson x86-based servers. First, they dramatically degrade thethroughput and increase latency and jitter, impacting theperformance of the network. Second, they present severeconsequences on the network operation when they are installedat the last mile (i.e., far from the edge).The escalation of volumetric DDoS attacks and the lackof robust and efficient defense mechanisms motivated theidea of architecting defenses into the network. Up until re-cently, in-network security methods were restricted to simpleaccess control lists encoded into the switching and routingdevices. The main reason is that the data plane was fixed infunction, impeding the capabilities of developing customizedand dynamic algorithms that can assist in detecting attacks.With the advent of programmable data planes, it is possibleto develop systems that detect and mitigate various types ofattacks without imposing significant overhead on the network.
E.2. Literature Review
Li et al. [177] presented NETHCF, a Hop-Count Filtering(HCF) defense mechanism that mitigates spoofed IP traffic.HCF schemes filter spoofed traffic with an IP-to-hop-countmapping table. Another attack-specific scheme proposed byFebro et al. [180] mitigates against distributed SIP DDoS inthe data plane. Furthermore, Scholz et al. [183, 280] presenteda scheme that defends against SYN flood attacks.Alternatively, some schemes are generic and aim at ad-dressing multiple attacks concurrently. For instance, Xing etal. [178] proposed FastFlex, an abstraction that architectsdefenses into the network paths based on changing attacks.Kang et al. [179] presented an automated approach for dis-covering sensitivity attacks targeting the data plane programs.Sensitivity attacks in this context are intelligently crafted traffic patterns that exploit the behavior of the P4 program.Lapolli et al. [181] implemented a mechanism to performreal-time DDoS attack detection based on entropy changes.Such changes will be used to compute anomaly detectionthresholds. Mi et al. [182] proposed ML-Pushback, a P4-basedimplementation of the Pushback method [281].Zhang et al. [184] proposed Poseidon, a system that miti-gates against volumetric DDoS attacks through programmableswitches. It provides a language where operators can expressa range of security policies. Friday et al. [185] proposed aunified in-network DDoS detection and mitigation strategy thatconsiders both volumetric and slow/stealthy DDoS attacks.Xing et al. [186] proposed NetWarden, a broad-spectrumdefense against network covert channels in a performance-preserving manner. The method in [187] models a statefulsecurity monitoring function as an Extended Finite State Ma-chine (EFSM) and expresses the EFSM using P4 abstractions.Finally, Ripple [188] provides decentralized link-flooding de-fense against dynamic adversaries.
E.3. Defense Schemes Comparison, Discussions, and Limita-tions
Table XXXII compares the aforementioned defenseschemes. Broadly, defense schemes can be grouped into twomain categories: attack-specific and generic. Attack-specificcategory consists of the work that address a specific attack(e.g., NETHCF for IP spoofing, [180] for SIP DDoS, etc.),while the generic category aims at addressing various types ofattacks (e.g., FastFlex for various availability attacks, Ripplefor link flooding attacks, etc.).The significant advantage of architecting defenses in thedata plane is the performance improvement of the applica-tion. For instance, NETHCF is motivated by the fact thattraditional HCF-based schemes are implemented on end-hosts, which delays the filtering of spoofed packets and increasesthe bandwidth overhead. Moreover, since traditional schemesare implemented in server-based middleboxes, low latencyand minimal jitter are hard to achieve. Similarly, FastFlexadvocates on the need to offload the defenses to the dataplane. Specifically, it tackles the following key challenges thatare faced when programming defenses in the data plane: 1)resource multiplexing; 2) optimal placement; 3) distributedcontrol; and 4) dynamic scaling.When deploying defenses in the data plane, operators mustbe aware of the capabilities of the constrained targets. Manyoperations that require extensive computations cannot be easilyimplemented on the data plane. The existing work eitherapproximate the computations in the data plane (consideringthe computation complexity and the measurements accuracytrade-off), or delegate the computations to external processors(e.g., CPU on the switch, external server, SDN controller,etc.). For instance, NETHCF decouples the HCF defense intoa cache running in the data plane and a mirror in the controlplane. The cache serves the legitimate packets at line rate,while the mirror processes the missed packets, maintains theIP-to-hop-count mapping table, and adjust the state of thesystem based on network dynamics. In Poseidon, the defenseprimitives are partitioned to be executed on switches and onservers, based on their properties. On the other hand, in [181],the authors estimated the entropies of source and destinationIP addresses of incoming packets for consecutive partitions(observation windows) in the data plane, without consultingexternal devices.Network-wide defenses are those that are not restricted to asingle switch, and require multiple switches to co-operate inthe attacks detection and mitigation phases. Such co-operationsignificantly improves the accuracy and the promptness of thedetection. More details on network-wide data plane systemsis explained in Section XIII-D.Finally, table XXXII lists some limitations of the existingschemes, which can be explored in future work to advance thestate-of-the-art. E.4. Comparison between P4-based and Traditional DefenseSchemes
Network attacks such as large-scale DDoS and link floodingmay have substantial impact on the network operation. Forsuch attacks, server-based defenses deployed at the last mileare problematic and inherently insufficient, especially whenattacks target the network core. Moreover, it is not feasible todetect and mitigate large volume of attack traffic (e.g., SYNflood) on end-hosts without impacting the throughput of thenetwork. When defenses are architected into the network (i.e.,detection and mitigation are programmed into the forwardingdevices), it is easy to detect, throttle, or drop suspicious trafficat any vantage point, at line rate.
F. Summary and Lessons Learned
In the context of cybersecurity, a wide range of worksleveraged programmable switches to achieve the followinggoals: 1) detect heavy hitters and apply countermeasures; 2) execute cryptographic primitives in the data plane to enablefurther applications; 3) protect the identity and the behaviorof end-hosts, as well as obfuscate the network topology; 4)enforce access control policies in the network while consid-ering network dynamics; and 5) architect defenses in the dataplane to accelerate the detection and mitigation processes.Identifying heavy hitters at line rate has several advan-tages. Recent works considered various data structures andstreaming algorithms to detect heavy hitters. Future systemscould explore more complex data structures that reduce theamount of state storage required on the switches. Furthermore,novel systems must minimize the false positives and thefalse negatives compared to both P4-based and legacy heavyhitter detection systems. Finally, new schemes should explorestrategies for incremental deployment while maximizing flowvisibility across the network.There is an absolute necessity to implement cryptographicfunctions (e.g., hash, encrypt, decrypt) in the data plane.Such functions can be used by various applications thatrequire low hashing collisions (e.g., load balancing) and strongdata protection. Most existing efforts delegate the complexcomputations to the control plane. However, recent systemshave demonstrated that AES, a well-known symmetric keyencryption algorithm, can be implemented in the data plane.Another interesting line of work provided privacy andanonymity to the network. Recent efforts obfuscated the net-work topology in order to mitigate topology-centric attacks(e.g., LFA). Such systems must preserve the practicality ofpath tracing tools, while being robust against obfuscationinversion. Additionally, link failures in the physical topologyshould remain visible after obfuscation. Furthermore, whenrandomizing identifiers to achieve session unlinkability, theidentifiers must fit into the small fixed header space sothat compatibility with legacy networks is preserved. Otherefforts considered rewriting source information and headersconcealing to protect the identity of Internet users.Finally, access control methods and in-network defenseswere proposed. Future access control schemes should explorefurther in-network methods to authenticate the users. Addi-tionally, since switches are capable of inspecting upper-layerheaders, it is worth exploring offloading some next generationfirewall functionalities to the data plane. For instance, in[146], the authors proposed a system that allows searchingfor keywords in the payload of the packet. Similar techniquescould be leveraged to achieve URL filtering at line rate.Additionally, schemes should mitigate against stealthy DDoSattacks. XII. N
ETWORK T ESTING
Although programmable switches provide flexibility indefining the packet processing logic, they introduce potentialrisks of having erroneous and buggy programs. Such bugsmay cause fatal damages, especially when they are unexpect-edly triggered in production networks. In such scenarios, thenetwork starts experiencing a degradation in performance aswell as disruption in its operation. Bugs can occur in variousphases in the P4 program development workflow (e.g., in TABLE XXXIIIT
ROUBLESHOOTING S CHEMES C OMPARISON
Name & scheme Core idea Fault detection Memoryrequirements PlatformPassive Proactive HW SW
P4DB [189] On-the-fly runtime debugging using watch, break, and next primitives (cid:2)
High (cid:2)
P4Tester [190] Probing-based troubleshooting using BDD (cid:2)
Low (cid:2) [191] Targets’ behavior examination when undesired actions are triggered N/A N/A (cid:2) (cid:2) [192] Execution paths profiling using Ball-Larus encoding (cid:2)
Low (cid:2)
KeySight [193] Probing-based troubleshooting using PEC (cid:2)
Low (cid:2) the P4 program itself, in the controller updating data planetable entries, in the target compiler, etc.). Bugs are usuallymanifested after processing a sequence of packets with certaincombinations not envisioned by the designer of the code.This section gives an overview of the troubleshooting andverification schemes for P4 programmable networks.
A. TroubleshootingA.1. Background
Intensive research interests were drawn on troubleshootingthe network. Previous efforts are mainly based on passivepacket behavior tracking through the usage of monitoringtechnologies (e.g., NetSight [282], EverFlow [283]). Othertechniques (e.g., Automatic test Packet Generation (ATPG)[284]) send probing packets to proactively detect networkbugs. Such techniques have two main problems. First, thenumber of probe packets increases exponentially as the sizeof the network increases. Second, the coverage is limited bythe number of probes-generating servers. Despite the flexibilitythat programmable switches offer, writing data plane programsincreases the chance of introducing bugs into the network. Pro-grams are inevitably prone to faults which could significantlycompromise the performance of the network and incur highpenalty costs.
A.2. Literature Review
Zhang et al. [189] proposed P4DB, an on-the-fly runtimedebugging platform. The system debugs P4 programs in threelevels of visibility by provisioning operator-friendly primi-tives: watch , break , and next . Zhou et al. [190] proposedP4Tester, a troubleshooting system for data plane runtimefaults. It generates intermediate representation of P4 programsand table rules based on BDD data structure. Dumitru etal. [191] examined how three different targets, BMv2, P4-NetFPGA, and Barefoot’s Tofino, behave when undesired be-haviours are triggered. Kodeswaran et al. [192] proposed a dataplane primitive for detecting and localizing bugs as they occurin real time. Finally, Zhou et al. [193] proposed KeySight, aplatform that troubleshoots programmable switches with highscalability and high coverage. It uses Packet Equivalence Class(PEC) abstraction when generating probes. A.3. Troubleshooting Schemes Comparison, Discussions, andLimitations
Table XXXIII compares the aforementioned troubleshootingschemes. Essentially, the schemes either passively track howpackets are processed inside switches (e.g., [189, 192]) or diagnoses faults by injecting probes (e.g., [190, 193]). Themain limitation of passive detection is that schemes can onlydetect rule faults that have been triggered by existing packets,and cannot check the correctness of all table rules. On theother hand, probing-based schemes may incur large controland probes overheads.Examples of probing-based schemes include P4Tester andKeySight. P4Tester generates intermediate representation ofP4 programs and table rules based on BDD data structure.Afterwards, it performs an automated analysis to generateprobes. Probes are sent using source routing to achieve highrule coverage while maintaining low overheads. The systemwas prototyped on a hardware switch (Tofino), and resultsshow that it can check all rules efficiently and that the probescount is smaller than that of server-based probe injectionsystems (i.e., ATPG and Pronto).Other schemes that use passive fault detection (e.g., P4DB)assume that packets consistently trigger the runtime bugs.P4DB debugs P4 programs in three levels of visibility byprovisioning operator-friendly primitives: watch , break , and next . P4DB does not require modifying the implementation ofthe data plane. It was implemented and evaluated on a softwareswitch (BMv2), and the results show that it is capable oftroubleshooting runtime bugs with a small throughput penaltyand little latency increase.Another important criterion that differentiate the trou-bleshooting schemes is the memory footprint they require.Some schemes (e.g., P4DB) require more memory than others(e.g., KeySight) which bound the memory usage.Finally, the work in [191] is different than the others.The authors examined how three different targets, BMv2,P4-NetFPGA, and Barefoot’s Tofino, behave when undesiredbehaviours are triggered. The authors first developed buggyprograms in order to observe the actual behavior of targets.Then, they examined the most complex P4 program publiclyavailable, switch.p4 , and found that it can be exploited whenattackers know the specifics of the implementation. In sum-mary, the paper suggests that BMv2 leaks information fromprevious packets. This behavior is not observed with the othertwo targets. Furthermore, the authors were able to performprivilege escalation on switch.p4 due to a header destinedto ensure communication between the CPU and the P4 dataplane. A.4. Comparison Legacy vs. P4-based Debugging
In legacy networks, network devices are equipped withfixed-function services that operate on standard protocols.Troubleshooting these networks often involve testing proto- cols and typical data plane functions (e.g., layer-3 routing)through rigid probing. On the other hand, with programmablenetworks, since operators have the flexibility of definingcustom data plane functions and protocols, testing is morecomplex and is program-dependent. Probing-based approachesshould craft patterns depending on the deployed P4 program.Other approaches proposed primitives that increase the levelsof visibility when debugging P4 programs. Research workextracted from the literature show that it is essential to developflexible mechanisms that operate dynamically on diverse P4programs and targets. B. VerificationB.1. Background
Program verification consists of tools and methods thatensure correctness of programs with respect to specificationsand properties. Verification of P4 programs is an active areaas bugs can cause faults that have drastic impacts on theperformance and the security of networking systems. StaticP4 verification handles programs before deployment to thenetwork, and hence, cannot detect faults that occur at runtime.On the other hand, runtime verification uses passive measure-ments and proactive network testing. This section describesthe major verification work pertaining to P4 programs.
B.2. Literature Review
Lopes et al. [194] proposed P4NOD, a tool that compilesP4 specifications to Datalog rules. The main motivation be-hind this work is that existing static checking tools (e.g.,Header Space Analysis (HSA) [285], VeriFlow [286]) arenot capable of handling changes to forwarding behaviorswithout reprogramming tool internals. The authors introducedthe “well formedness” bugs, a class of bugs arising due to thecapabilities of modifying and adding headers.Another interesting work is ASSERT-P4 [195, 196], anetwork verification technique that checks at compile-timethe correctness and the security properties of P4 programs.ASSERT-P4 offers a language with which programmers ex-press their intended properties with assertions. After annotat-ing the program, a symbolic execution takes place with all theassertions being checked while the paths are tested.Further, Liu et al. [197] proposed p4v, a practical veri-fication tool for P4. It allows the programmer to annotatethe program with Hoare logic clauses in order to performstatic verification. To improve scalability, the system suggestsadding assumptions about the control plane and domain-specific optimizations. The control plane interface is manuallywritten by the programmer and is not verified, which makesit error-prone and cumbersome. The authors evaluated p4von both an open source and proprietary P4 programs (e.g.,switch.p4) that have different sizes and complexities.Nötzli et al. [198] proposed p4pktgen, a tool that automat-ically generates test cases for P4 programs using symbolicexecution and concrete paths. The tool accepts as input aJSON representation of the P4 program (output of the p4ccompiler for BMv2), and generates test cases. These testcases consist of packets, tables configurations, and expected paths. Similarly, Lukács et al. [199] described a frameworkfor verifying functional and non-functional requirement ofprotocols in P4. The system translates a P4 program in aversatile symbolic formula to analyze various performancecosts. The proposed approach estimates the performance costof a P4 program prior to its execution.Stoenescu et al. [200] proposed Vera, a symbolic execution-based verification tool for P4 programs. The authors arguein this paper that a data plane program should be verifiedbefore deployment to ensure safe operations. Vera accepts asinput a P4 program, and translates it to a network verificationlanguage, SEFL. It then relies on SymNet [287], a networkstatic analysis tool based on symbolic execution to analyze thebehavior of the resulting program. Essentially, Vera generatesall possible packets layouts after inspecting the program’sparser and assumes that the header fields can accept any value.Afterwards, it tracks the paths when processing these packetsin the program following all branches to completion. Forscalability improvements, Vera utilizes a novel match-forestdata structure to optimize updates and verification time. Pars-ing/deparsing errors, invalid memory accesses, loops, amongothers, can be detected by Vera.A different approach uses reinforcement learning is P4RL[201], a fuzzy testing system that automatically verifies P4switches at runtime. The authors described a query language p4q in which operators express their intended switch behavior.A prototype that executes verification on layer-3 switch wasimplemented, and results show that PR4L detects various bugsand outperforms the baseline approach.Finally, Dumitrescu et al. [202] proposed bf4, an end-to-end P4 program verification tool. It aims at guarantying thatdeployed P4 programs are bug-free. First, bf4 finds potentialbugs at compile-time. Second, it automatically generates pred-icates that must be followed by the controller whenever a ruleis to be inserted. Third, it proposes code changes if additionalbugs remain reachable. bf4 executes a monitor at runtimethat inspects the rules inserted by the controller and raises anexception whenever a predicate is not satisfied. The authorsexecuted bf4 on various data plane programs and interestingbugs that were not detected in state-of-the-art approaches werediscovered.
B.3. Verification Schemes Discussions
Table XXXIV compares the aforementioned verificationschemes. Essentially, some schemes translate P4 programs toverification languages and engines. For instance, in [194], P4
TABLE XXXIVV
ERIFICATION S CHEMES C OMPARISON
Scheme Name Engine,language Evaluatedprograms Inconsistencydetection [194] P4NOD NOD 2 × [195] ASSERT-P4 KLEE 5 × [197] p4v Z3 23 × [198] p4pktgen SMT 4 × [199] N/A Pure 0 × [200] Vera SEFL 11 × [201] P4RL DDQN 1 (cid:2) [202] bf4 Z3 21 × programs are translated to Datalog to verify the reachabilityand well-formedness. Similarly, in [197], P4 programs areconverted into Guarded Command Language (GCL) models,and then a theorem prover Z3 is used to verify that sev-eral safety, architectural and program-specific properties hold.Other schemes (e.g., p4pktgen, Vera) use symbolic executionto generate test cases for P4 programs.The verification schemes were evaluated on different P4programs from the literature. A program that was evaluatedby most schemes is switch.p4 which implements variousnetworking features needed for typical cloud data centers,including Layer 2/3 functionalities, ACL, QoS, etc. It isrecommended for future schemes to evaluate switch.p4 as wellas other programs from the literature. Finally, P4RL detectspath-related consistency between data-control planes. B.4. P4-based and Traditional Network Verification
Traditional verification techniques that address the securityproperties in computer networks are mainly related to hostreachability, isolation, blackholes, and loop-freedom. Tech-niques that check for the aforementioned properties includeAnteater [288], which models the data plane as booleanfunctions to be used in a Boolean Satisfiability Problem (SAT)solver, NetPlumber [289] which uses header space algebra[285], and others (e.g., VeriFlow [286], DeltaNet [290], Flover[291], and VMN [292]).Since P4 programs incorporate customized protocols andprocessing logic to be used in the data plane, traditional toolsare not capable of handling changes to forwarding behaviorswithout reprogramming their internals. Therefore, verificationtechniques in programmable networks rely on analyzing theP4 programs themselves since they define the behavior of thedata plane.
C. Summary and Lessons Learned
Network testing can generally be divided into debug-ging/troubleshooting network problems and verifying the be-havior of forwarding devices. While traditional tools andtechniques were adequate for non-programmable networks,they are insufficient for programmable ones due to theirinability to handle changes to forwarding behaviors withoutreprogramming and restructuring their internals. A variety ofworks were proposed to analyze and model P4 programs inorder to troubleshoot and verify the correctness of networks’operations.XIII. C
HALLENGES AND F UTURE T RENDS
In this section, a number of research and operationalchallenges that correspond to the proposed taxonomy areoutlined. The challenges are extracted after comprehensivelyreviewing and diving into each work in the described literature.Further, the section discusses and pinpoints several initiativesfor future work which could be worthy of being pursued in thisimperative field of programmable switches. The challengesand the future trends are illustrated in Fig. 22
Data planechallenges and trendsInteroperability Arithmetic computationsNetwork-wide cooperationProgramming simplicity and modularity
ChallengesTrends [178, 179][162][293, 294] [295, 296] [297] [83, 91][298–303][293]
Fig. 22. Challenges and future trends. The references represent examples ofexisting works that tackle the corresponding future trends.
A. Memory Capacity (SRAM and TCAM)
Stateful processing is a key enabler for programmabledata planes as it allows applications to store and retrievedata across different packets. This advantage enabled a widerange of novel applications (e.g., in-network caching, finegrained measurements, stateful load balancing, etc.) that werenot possible in non-programmable networks. The amountof data stored in the switch is limited by the size of theon-chip memory which ranges from tens to hundreds ofmegabytes at most. Consequently, the majority of stateful-based applications suffer have trade-offs between performanceand memory usage. For instance, the efficiency of cachingwhich is determined by the hit rate is directly affected by thememory size. Furthermore, the vast majority of measurementapplications require storing statistics in the data plane (e.g.,byte/packet counters). The number of flows to be measuredand the richness of measurement information is bound by thesize of the memory in the switch.
Current and future initiatives.
A notable work by Kim etal. [295, 296] suggests accessing remote Dynamic RandomAccess Memory (DRAM) installed on data center serverspurely from data plane to expand the available memory on theswitch. The bandwidth of the chip is traded for the bandwidthneeded to access the external DRAM. The approach is cheapand flexible since it reuses existing resources in commodityhardware without adding additional infrastructure costs. Thesystem is realized by allowing the data plane to access remotememory through an access channel (RDMA over ConvergedEthernet (RoCE)) as shown in Fig. 23. The implementationshow that the proposal achieves throughput close to the linerate, and only incur 1-2 extra microseconds latency (Fig.24). There are some limitations in this approach that can beexplored in the future. • The current implementation only supports address-basedmemory access, and hence, complicated data layouts andternary matching in remote memory should be explored. • Frequent updates in the remote memory requires several General-purpose DRAM poolASIC
Remote table serversRemote state storesRemote buffer servers
RDMARoCE
Commodity Servers
Fig. 23. Expanding switch memory by leveraging remote DRAM on com-modity servers [295]. packets for fetching and adding. This is common in mea-surement applications where counters are continuously in-cremented. A possible solution to the bandwidth overhead isaggregating updates into single operation. This comes withthe cost of having delays in the updates. • Packet loss between the switch and the remote memoryshould be handled, otherwise, the performance of the ap-plication and the freshness of the remote values might beaffected. • The interaction between general data plane applications andthe remote memory is challenging. A potential improvementis designing well-defined APIs to facilitate the interaction.
B. Resources Accessibility
Beside the size limitation of the on-chip memory, there areother restrictions that data plane developers should take intoaccount [297, 304]. First, since the table memory is localto each stage in the pipeline, other stages cannot reclaimnon-utilized memory in other stages. As a result, memoryand match/action processing are fuzed, making the placementof tables challenging. Second, the sequential execution ofoperations in the pipeline lead to poor utilization of resourcesespecially when the matches and the actions are imbalanced(i.e., the presence of default actions that do not need a match).
Current and Future Initiatives.
An interesting work byChole et at. [297] explored the idea of disaggregating thememory and compute resources of a programmable switch.The main notion of this work is to centralize the memoryas a pool that is accessed by a crossbar. By doing so, each
Fig. 24. Accessing remote DRAM latency overhead. Achieved throughputclose to the line rate ( ≈ pipeline stage no longer has local memory. Additionally, thiswork solves the sequential execution limitation by creating acluster of processors used to execute operations in any order.The main limitation of this approach is the lack of adoptionby any hardware vendors. Most of the switch vendors (e.g.,Cavium’s XPliant and Barefoot’s Tofino) do not implement thedisaggregation model and follow the regular ReconfigurableMatch-action Tables (RMT) model. The implementation andanalysis of the disaggregation model on hardware targetsshould be explored in the future. C. Arithmetic Computations
There are several challenges that must be handled whendealing with arithmetic computations in the data plane. First,programmable switches support a small set of simple arith-metic computations that operate on non-floating point values.Second, only few operations are supported per packet toguarantee the execution at line rate. Typically, a packet shouldonly spend tens of nanoseconds in the processing pipeline.Third, computations in the data plane consume significanthardware resources, hampering the possibility of other pro-grams to execute concurrently. A wide range of applicationssuffer from the lack of complex computations in the dataplane. For instance, some operations required by AQMs (e.g.,square root function in the CoDel algorithm) are complexto be implemented with P4. Additionally, the majority ofmachine learning frameworks and models operate on floatingpoint values while the supported arithmetic operations on theswitch operate on integer values. In-network model updatesaggregation requires calculating the average over a set offloating-point vectors.
Current and Future Initiatives.
Existing methods to over-come the computation limitations include approximation andpre-computations. In the approximation method, the applica-tion designer relies on the small set of supported operationsto approximate the desired value, at the cost of sacrificingprecision. For example, approximating the square root functioncan be achieved by counting the number of leading zerosthrough longest prefix match [91]. It would be beneficialfor P4 developers to have access to a community-maintainedlibrary which encompasses P4 codes that approximate variouscomplex functions. In the pre-computations method, values arecomputed by the control plane (e.g., switch CPU) and storedin match-action tables or registers. Future work can exploremethods that automatically identify the complex computationsthat can be pre-evaluated in the control plane. After identifica-tion, the data plane code and its corresponding control planeAPIs can be automatically generated.
D. Network-wide Cooperation
The SDN architecture suggests using a centralized controllerfor network-wide switches management. Through centraliza-tion, the state of each programmable switch can be shared withother switches. Consequently, applications will have the abilityto make better decisions as network-wide data is availablelocally on the switch. The problem with such architecture is IP A ID C CountS1InternetDDoS initiator(A) IP A ID C CountC < T (a) S IP A ID C CountS1InternetDDoS initiator(A) (b) S C < T C , C C + C > T C + C > T C + C Count
Total IP A ID C Count C + C Count
Total
Fig. 25. (a) Local detection of DDoS attacks. (b) network-wide detection of DDoS attack. the requirement of having a continuous exchange of packetswith a software-based system. As an alternative, switches canexchange messages to synchronize their states in a decentral-ized manner.Consider Fig. 25 which shows an in-network DDoS defensesolution. Each switch maintains a list of senders and theircorresponding numbers of bytes. A switch compares thenumber of bytes transmitted from a given flow to a threshold.When the threshold is crossed, the flow is blocked and thedevice is identified as a malicious DDoS sender. Assumethat the network implements a load balancing mechanism thatdistributes traffic across the switches. In the scenario whereswitches do not consider the byte counts of other switches(Fig. 25 (a)), the traffic of a DDoS device might remain underthe threshold. On the other hand, when switches synchronizetheir states by sharing the byte counts (Fig. 25 (b)), thetotal number of bytes is compared against the threshold.Consequently, the total load of a DDoS device is considered.This example demonstrates an application that heavily dependson network-wide cooperation and hence motivates the need forstate synchronization.
Current and Future Initiatives.
Arashloo et al. [298] pro-posed SNAP, a centralized stateful programming model thataims at solving the synchronization problem. SNAP introducedthe idea of writing programs for “one big switch” instead ofmany. Essentially, developers write stateful applications with-out caring about the distribution, placement, and optimizationof access to resources. SNAP is limited to one replica ofeach state in the network. Sviridov et al. [299, 300] proposedLODGE and LOADER to extend SNAP and enable multiplereplicas. Luo et al. [301] proposed Swing State, a frameworkfor runtime state migration and management. This approachleverages existing traffic to piggyback state updates betweencooperating switches. Swing State overcomes the challengesof the SDN-based architecture by synchronizing the statesentirely in the data plane, at line rate, and without interventionfrom the control plane. There are several limitations with thisapproach. First, there are no message delivery guarantees (i.e.,packets dropped/reordered are not retransmitted), leading toinconsistency in the states among the switches. Second, it doesnot merge the states if two switches share common states.Third, the overhead can significantly increase if a single stateis mirrored several times. Finally, there is no authenticationof data or senders. Xing et al. [302] proposed P4Sync, asystem that migrates states between switches in the data planewhile guaranteeing the authenticity of the senders and the exchanged data. P4Sync addresses the limitations of existingapproaches. It guarantees the completeness of the migration,ensuring that the snapshot transfer is completed. Moreover, itsolves the overhead of the repeatedly retransmitted updates.An interesting aspect of P4Sync is its ability to control themigration traffic rate depending on the changing networkconditions. Zeno et al. [303] presented a design of SwiSh-mem, a management layer that facilitates the deployment ofnetwork functions (NFs) on multiple switches by managingthe distributed shared states.The future work in this area should consider handling frequent state migrations . Some systems require migrationpackets to be generated each RTT, causing increased trafficoverhead and additional expensive authentication operations.For instance, P4Sync uses public key cryptography in thecontrol plane to sign and verify the end of the migrationsequence chain (2.15ms for signing and 0.07ms to verify usingRSA-2048 signature). Frequent migrations would cause thissignature to be involved repeatedly. Another major concernthat should be handled in future work is denial of service .Even with migration updates authentication, changes in thepackets cause the receiver to reject updates, leading to stateinconsistency among switches.
E. Control Plane Intervention
Delegating tasks to the control plane incurs latency andaffects the application’s performance. For instance, in conges-tion control, rerouting-based schemes often use tables to storealternative routes. Since the data plane cannot directly modifytable entries, intervention from the control plane is required.The interaction with the control plane in this applicationhampers the promptness of rerouting. Another example aremethods that use collisions-free hashing. For example, cuckoohash [305], which rearranges items to solve collisions, uses acomplex search algorithm that cannot run on the switch ASIC,and is often executed on the switch CPU. Ideally, the controlplane intervention should be minimized when possible. Forexample, to synchronize the state among switches, in-networkcooperation should be considered.
Current and Future Initiatives.
The design of the interactionbetween the control plane and the data plane is fully decidedby the developer. Experienced developers might have enoughbackground to immediately minimize such interaction. Futurework should devise algorithms and tools that automaticallydetermine the excessive interaction between the control/data planes, and suggest alternative workflows (ideally, as generatedcodes) to minimize such interaction. F. Security
When designing a system for the data plane, the developermust envision the kind of traffic a malicious user can initiateto corrupt the operation of the system. This class of attacks isreferred to as sensitivity attacks as coined in [179]. Essentially,an attacker can intelligently craft traffic patterns to triggerunexpected behaviors of a system in the data plane. Forinstance, a load balancer that balances traffic through packetheaders hashing without cyrptographic support (e.g., modulooperator on the number of available paths) can be tricked by anattacker that craft skewed traffic patterns. This results in trafficbeing forwarded to a single path, leading to congestion, linksaturation, and denial of service. Another example is attacksagainst in-network caching. Caching in data plane performswell when requests are mostly reads rather than writes . If anattacker continuously generates high-skewed write requests,the load on the storage servers would be imbalanced. If thesystem is designed to handle write queries on hot items in theswitch, a random failure in the switch causes data to be lost.Further, an attacker can also exploit the memory limitationof switch and request diverse values, causing the pre-cachedvalues to be evicted.
Current and Future Initiatives.
To mitigate against sensi-tivity attacks, a developer attempts to discover various un-predicted traffic patterns, and accordingly, develops defensestrategies. Such solution is highly unreliable, time consuming,and error-prone. Recent efforts [179] aimed at automaticallydiscovering sensitivity attacks in the data plane. Essentially,the proposed system aims at deriving traffic patterns that woulddrive the program away from common case behavior as muchas possible. Other efforts focused on architecting defenses inthe data plane that perform distributed mode changes uponattack discovery [178]. Future work in this direction shouldconsider achieving high assurance by formally verifying thecodes. Additionally, the stability of the data plane should becarefully handled with fast mode changes; future work couldconsider integrating self-stabilizing systems for such purpose.Finally, future work should provide security interfaces forcollaborating switches that belong to different domains. It isalso worth exposing sensitivity attack patterns for differentapplication types so that data plane developers can avoid thevulnerabilities that trigger those attacks in their codes.
G. Interoperability
Programmable switches pave the way for a wide range ofinnovative in-network applications. The literature has shownthat significant performance improvements are brought whenapplications offload their processing logic to the network.Despite such facts, it is very unlikely that mobile operatorswill replace their current infrastructure with programmableswitches in one shot. This unlikelihood comes from the factthat major operational and budgeting costs will incur.
Current and Future Initiatives.
Network operators might deploy programmable switches in an incremental fashion. Thatis, P4 switches will be added to the network alongside theexisting legacy devices. While this solution seems simplisticat first, studies have showed that partial deployment leadsto reduced effectiveness [162]. For instance, the accuracy ofheavy hitter detection schemes is strongly affected by the flowvisibility. The work in [162] devised a greedy algorithm thatattempts to strategically position P4 switches in the network,with the goal of monitoring as many distinct network flowsas possible. The
F1 score is used to quantify correctness ofswitches placement. Future work in this area should considergeneralizing and enhancing this approach to work with any
P4application, and not only heavy hitter detection. For instance,a future work could suggest the positioning of P4 switches inapplications such as in-network caching, accelerated consen-sus, and in-network defenses, while taking into account thecurrent topology consisting of legacy devices.
H. Programming Simplicity and Modularity
Writing in-network applications using P4 language is notan easy task. Recent studies have shown that many existingP4 programs have several bugs that might lead to networkdisruption [191]. For several decades, the networking indus-try operated in a bottom-up approach, where switches areequipped with fixed-function ASICs. Consequently, little tono programming skills were needed by network operators.With the advent of programmable switches, operators are nowexpected to have experience in programming the ASIC . Current and Future Initiatives.
Since programming theASIC is not a straightforward task, future research endeavoursshould consider simplifying the programming workflow forthe operators and generating code (e.g., [293]). For instance,graphical tools can be developed to translate workflows (e.g.,flowcharts) to P4 programs that can fit into the hardware.Further, future work should develop tools that allow operatorsto enable features (i.e., program modules) that will translate toP4 programs. As an analogy, consider the mobile applicationstores (e.g., Play store, Apple store). The user simply down-loads and installs application on the device, without having tounderstand anything about programming. An interesting workcould investigate the idea of creating a store for P4 applicationswhere operators select the “apps” they want to activate, andthe result is a generated P4 program optimized to fit in thehardware, considering the different targets available in themarket today (e.g., Tofino). Recent efforts attempted to mergeand test modular programs in P4 [294].XIV. C
ONCLUSIONS
This article presents an exhaustive survey on programmabledata planes. The survey describes the evolution of networkingby discussing the traditional control plane and the transition to Note that most vendors (e.g., Barefoot Networks) provide a program( switch.p4 ) that expresses the forwarding plane of a switch, with the typicalfeatures of an advanced layer-2 and layer-3 switch. If the goal is to simplydeploy a switch with no in-network applications, then the operators are notrequired to program the chip. They just need to learn the interaction betweenthe control plane and the data plane (e.g., to populate table entries). SDN. Afterwards, the survey motivates the need for program-ming the data plane and delves into the general architectureof a programmable switch (PISA). A brief description of P4,the de-facto language for programming the data plane waspresented. Motivated by the increasing trend in programmingthe data plane, the survey provides a taxonomy that sheds thelight on numerous significant works and compares schemeswithin each category in the taxonomy and with those in legacyapproaches. The survey concludes by discussing challengesand considerations as well as various future trends and initia-tives. A
CKNOWLEDGEMENT
This material is based upon work supported by the Na-tional Science Foundation under grant numbers 1925484 and1829698, funded by the Office of Advanced Cyberinfrastruc-ture (OAC). R
BBREVIATIONS U SED IN T HIS A RTICLE
Abbreviation TermABR Adaptive Bit RateACK AcknowledgementACL Access Control ListAFQ Approximate Fair QueueingAIMD Additive Increase Multiplicative DecreaseALU Arithmetic Logical UnitAPI Application Programming InterfaceAQM Active Queue ManagementAS Autonomous SystemASIC Application-specific Integrated CircuitATPG Automatic Test Packet GenerationATT Attribute ProtocolBBR Bottleneck Bandwidth and Round-trip TimeBDD Binary Decision DiagramBFT Byzantine Fault ToleranceBGP Border Gateway ProtocolBIER Bit Index Explicit ReplicationBLE Bluetooth Low EnergyBLESS Bluetooth Low Energy Service SwitchBMv2 Behavioral Model Version 2BNN Binary Neural NetworkBQPS Billion Queries Per SecondBYOD Bring Your Own DeviceCAIDA Center of Applied Internet Data AnalysisCC Congestion ControlCNN Convolutional Neural NetworkCoDel Controlled DelayCPU Central Processing UnitCRC Cyclic Redundancy CheckCWND Congestion WindowDCQCN Data Center Quantized Congestion NotificationDCTCP Data Center Transmission Control ProtocolDDoS Distributed Denial-of-ServiceDIP Direct Internet ProtocolDMA Direct Memory AccessDMZ Demilitarized ZoneDNS Domain Name ServerDPDK Data Plane Development Kit Abbreviation TermDRAM Dynamic Random Access MemoryDSP Digital Signal ProcessorsECMP Equal-Cost Multi-Path RoutingECN Explicit Congestion NotificationESP Encapsulating Security PayloadFAST Flow-level State TransitionsFCT Flow Completion TimeFIB Forwarding Information BaseFPGA Field-programmable Gate ArrayFQ Fair QueueingGPU Graphics Processing UnitGRE Generic Routing EncapsulationHCF Hop-Count FilteringHSA Header Space AnalysisHTCP Hamilton Transmission Control ProtocolHTTP Hypertext Transfer ProtocolIDS Intrusion Detection SystemIGMP Internet Group Management ProtocolIKE Internet Key ExchangeILP Integer Linear ProgrammingINT In-band Network TelemetryIoT Internet of ThingsIP Internet ProtocolISP Internet Service ProviderJSON JavaScript Object NotationKDN Knowledge-defined NetworkingKPI Key Performance IndicatorINT In-band Network TelemetryIoT Internet of ThingsIP Internet ProtocolISP Internet Service ProviderINT In-band Network TelemetryIoT Internet of ThingsINT In-band Network TelemetryIoT Internet of ThingsIP Internet ProtocolISP Internet Service ProviderJSON JavaScript Object NotationKDN Knowledge-defined NetworkingKPI Key Performance IndicatorLAN Local Area NetworkLFA Link Flooding AttackLPM Longest Prefix MatchLPWAN Low Power Wide Area NetworkLTE Long Term EvolutionMAC Medium Access ControlMAU Match-Action UnitMCM Multicolor MarkersMIMD Multiplicative Increase Multiplicative DecreaseML Machine LearningMOS Mean Opinion ScoreMPC Mobile Packet CoreMQTT Message Queueing Telemetry TransportMSS Maximum Segment SizeMPTCP Multipath Transmission Control ProtocolMTU Maximum Transmission UnitNACK Negative AcknowledgementNAT Network Address TranslationNDA Non-disclosure AgreementNDN Named Data NetworkingNFV Network Functions VirtualizationNIC Network Interface ControllerNN Neural NetworksNSH Network Service HeaderONOS Open Network Operating SystemOSPF Open Shortest Path FirstOUM Ordered Unreliable MulticastOVS Open Virtual SwitchP2P Peer-to-peerPBT Postcard-Based TelemetryPCC Performance-oriented Congestion ControlPCC Per-Connection ConsistencyPD Program DependentPGW Packet Data Network GatewayPI Protocol IndependentPIE Proportional Integral Controller Enhanced Abbreviation TermPISA Protocol Independent Switch ArchitectureQoE Quality of ExperienceQoS Quality of ServiceRAM Random-Access MemoryRDMA Remote Direct Memory AccessRED Random Early DetectionREST Representational State TransferRFC Request for CommentsRMT Reconfigurable Match-action TablesRSA Rivest-Shamir-AdlemanRSS Really Simple SyndicationRTT Round-trip TimeRWND Receiver WindowSAD Security Association DatabaseSAT Boolean Satisfiability ProblemSDN Software Defined NetworkingSHA Secure Hash AlgorithmSIP Session Initiation ProtocolSLA Service Level AgreementSNMP Simple Network Management ProtocolSPD Security Policy DatabaseSRAM Static Random-Access MemorySSH Secure ShellTCAM Ternary Content-Addressable MemoryTCP Transmission Control ProtocolTM Traffic ManagementToR The Onion RouterTPU Tensor Processing UnitTTL Time-to-LiveUDP User Datagram ProtocolUE User EquipmentVIP Virtual Internet ProtocolVMN Verifying Mutable NetworksVN Virtual NetworkVoLTE Voice over Long-term EvolutionVXLAN Virtual eXtensible Local Area NetworkWAN Wide Area NetworkXDP eXpress Data Path[4] G. Papastergiou, G. Fairhurst, D. Ros, A. Brunstrom, K.-J. Grinnemo,P. Hurtig, N. Khademi, M. Tüxen, M. Welzl, D. Damjanovic, andS. Mangiante, “De-ossifying the Internet transport layer: a surveyand future perspectives,”
IEEE Communications Surveys & Tutorials
ACM SIGCOMMcomputer communication review , vol. 37, no. 4, pp. 1–12, 2007.[8] D. Kreutz, F. M. Ramos, P. E. Verissimo, C. E. Rothenberg, S. Azodol-molky, and S. Uhlig, “Software-defined networking: a comprehensivesurvey,”
Proceedings of the IEEE , vol. 103, no. 1, pp. 14–76, 2014.[9] P. Bosshart, D. Daly, G. Gibb, M. Izzard, N. McKeown, J. Rexford,C. Schlesinger, D. Talayco, A. Vahdat, and G. Varghese, “P4: pro-gramming protocol-independent packet processors,”
ACM SIGCOMMComputer Communication Review
SC19 Network Research Exhibition
Future Internet(FI) and Innovative Internet Technologies and Mobile Communication(IITM) , vol. 47, 2017.[36] T. Dargahi, A. Caponi, M. Ambrosin, G. Bianchi, and M. Conti, “Asurvey on the security of stateful SDN data planes,”
IEEE Communi-cations Surveys & Tutorials , vol. 19, no. 3, pp. 1701–1725, 2017.[37] W. L. da Costa Cordeiro, J. A. Marques, and L. P. Gaspary, “Data planeprogrammability beyond OpenFlow: opportunities and challenges fornetwork and service operations and management,”
Journal of Networkand Systems Management , vol. 25, no. 4, pp. 784–818, 2017.[38] A. Satapathy, “Comprehensive study of P4 programming language andsoftware-defined networks,” 2018. [Online]. Available: https://tinyurl.com/y4d4zma9.[39] R. Bifulco and G. Rétvári, “A survey on the programmable data plane:abstractions, architectures, and open problems,” in , pp. 1–7, IEEE, 2018.[40] E. Kaljic, A. Maric, P. Njemcevic, and M. Hadzialic, “A survey on dataplane flexibility and programmability in software-defined networking,”
IEEE Access , vol. 7, pp. 47804–47840, 2019.[41] P. G. Kannan and M. C. Chan, “On programmable networking evolu-tion,”
CSI Transactions on ICT , vol. 8, no. 1, pp. 69–76, 2020.[42] L. Tan, W. Su, W. Zhang, J. Lv, Z. Zhang, J. Miao, X. Liu, and N. Li,“In-band network telemetry: A survey,”
Computer Networks , p. 107763,2020.[43] X. Zhang, L. Cui, K. Wei, F. P. Tso, Y. Ji, and W. Jia, “A survey onstateful data plane in software defined networks,”
Computer Networks ,p. 107597, 2020.[44] G. Bianchi, M. Bonola, A. Capone, and C. Cascone, “OpenState: programming platform-independent stateful OpenFlow applications in-side the switch,” ACM SIGCOMM Computer Communication Review ,vol. 44, no. 2, pp. 44–51, 2014.[45] M. Moshref, A. Bhargava, A. Gupta, M. Yu, and R. Govindan,“Flow-level state transition as a new switch primitive for SDN,” in
Proceedings of the third workshop on Hot topics in software definednetworking
ACM SIGCOMM Computer CommunicationReview , vol. 38, no. 2, pp. 69–74, 2008.[49] N. McKeown, “Why does the Internet need a programmable forwardingplane.” [Online]. Available: https://tinyurl.com/y6x7qqpm.[50] A. Shapiro, “P4-programming data plane use-cases.” in P4 ExpertRoundtable Series, April 28-29, 2020. [Online]. Available: https://tinyurl.com/y5n4k83h.[51] C. Kim, “Evolution of networking, Networking Field Day 21, 2:01,”2019. [Online]. Available: https://tinyurl.com/y9fkj7qx.[52] Z. Liu, J. Bi, Y. Zhou, Y. Wang, and Y. Lin, “Netvision: towardsnetwork telemetry as a service,” in , pp. 247–248, IEEE, 2018.[53] J. Hyun, N. Van Tu, and J. W.-K. Hong, “Towards knowledge-definednetworking using in-band network telemetry,” in
NOMS 2018-2018IEEE/IFIP Network Operations and Management Symposium , pp. 1–7,IEEE, 2018.[54] Y. Kim, D. Suh, and S. Pack, “Selective in-band network telemetryfor overhead reduction,” in , pp. 1–3, IEEE, 2018.[55] J. A. Marques, M. C. Luizelli, R. I. T. da Costa Filho, and L. P. Gaspary,“An optimization-based approach for efficient network monitoringusing in-band network telemetry,”
Journal of Internet Services andApplications , vol. 10, no. 1, p. 12, 2019.[56] B. Niu, J. Kong, S. Tang, Y. Li, and Z. Zhu, “Visualize your IP-over-optical network in realtime: a P4-based flexible multilayer in-bandnetwork telemetry (ML-INT) system,”
IEEE Access , vol. 7, pp. 82413–82423, 2019.[57] R. Ben Basat, S. Ramanathan, Y. Li, G. Antichi, M. Yu, and M. Mitzen-macher, “PINT: probabilistic in-band network telemetry,” in
Proceed-ings of the Annual conference of the ACM Special Interest Group onData Communication on the applications, technologies, architectures,and protocols for computer communication , pp. 662–680, 2020.[58] N. Van Tu, J. Hyun, and J. W.-K. Hong, “Towards ONOS-based SDNmonitoring using in-band network telemetry,” in , pp. 76–81, IEEE, 2017.[59] Serkant, “Prometheus INT exporter.” [Online]. Available: https://github.com/serkantul/prometheus_int_exporter/.[60] N. Van Tu, J. Hyun, G. Y. Kim, J.-H. Yoo, and J. W.-K. Hong, “IntCol-lector: a high-performance collector for in-band network telemetry,” in , pp. 10–18, IEEE, 2018.[61] Barefoot Networks, “Barefoot Deep Insight - product brief.” [Online].Available: https://tinyurl.com/u2ncvry.[62] Broadcom, “BroadView Analytics, Trident 3 in-band telemetry.” [On-line]. Available: https://tinyurl.com/yxr2qydb.[63] M. Handley, C. Raiciu, A. Agache, A. Voinescu, A. W. Moore,G. Antichi, and M. Wójcik, “Re-architecting datacenter networks andstacks for low latency and high performance,” in
Proceedings of theConference of the ACM Special Interest Group on Data Communica-tion , pp. 29–42, 2017.[64] B. Turkovic, F. Kuipers, N. van Adrichem, and K. Langendoen, “Fastnetwork congestion detection and avoidance using P4,” in
Proceedingsof the 2018 Workshop on Networking for Emerging Applications andTechnologies , pp. 45–51, 2018.[65] Y. Li, R. Miao, H. H. Liu, Y. Zhuang, F. Feng, L. Tang, Z. Cao,M. Zhang, F. Kelly, and M. Y. Alizadeh, Mohammad, “HPCC: highprecision congestion control,” in
Proceedings of the ACM SpecialInterest Group on Data Communication , pp. 44–58, 2019.[66] A. Feldmann, B. Chandrasekaran, S. Fathalli, and E. N. Weyulu, “P4-enabled network-assisted congestion feedback: a case for NACKs,”2019.[67] E. F. Kfoury, J. Crichigno, E. Bou-Harb, D. Khoury, and G. Srivastava,“Enabling TCP pacing using programmable data plane switches,” in , pp. 273–277, IEEE, 2019.[68] B. Turkovic and F. Kuipers, “P4air: Increasing fairness among com-peting congestion control algorithms,” 2020.[69] Y. Li, R. Miao, C. Kim, and M. Yu, “Flowradar: A better NetFlow fordata centers,” in { USENIX } Symposium on Networked SystemsDesign and Implementation (NSDI) , pp. 311–324, 2016.[70] Z. Liu, A. Manousis, G. Vorsanger, V. Sekar, and V. Braverman,“One sketch to rule them all: rethinking network flow monitoring withUnivMon,” in
Proceedings of the 2016 ACM SIGCOMM Conference ,pp. 101–114, 2016.[71] S. Narayana, A. Sivaraman, V. Nathan, P. Goyal, V. Arun, M. Alizadeh,V. Jeyakumar, and C. Kim, “Language-directed hardware design fornetwork performance monitoring,” in
Proceedings of the Conferenceof the ACM Special Interest Group on Data Communication , pp. 85–98, 2017.[72] M. Ghasemi, T. Benson, and J. Rexford, “Dapper: data plane perfor-mance diagnosis of TCP,” in
Proceedings of the Symposium on SDNResearch , pp. 61–74, 2017.[73] T. Yang, J. Jiang, P. Liu, Q. Huang, J. Gong, Y. Zhou, R. Miao,X. Li, and S. Uhlig, “Elastic sketch: adaptive and fast network-widemeasurements,” in
Proceedings of the 2018 Conference of the ACMSpecial Interest Group on Data Communication , pp. 561–575, 2018.[74] N. Yaseen, J. Sonchack, and V. Liu, “Synchronized network snapshots,”in
Proceedings of the 2018 Conference of the ACM Special InterestGroup on Data Communication , pp. 402–416, 2018.[75] R. Joshi, T. Qu, M. C. Chan, B. Leong, and B. T. Loo, “Burstradar:practical real-time microburst monitoring for datacenter networks,” in
Proceedings of the 9th Asia-Pacific Workshop on Systems , pp. 1–8,2018.[76] M. Lee and J. Rexford, “Detecting violations of service-level agree-ments in programmable switches,” 2018. [Online]. Available: https://p4campus.cs.princeton.edu/pubs/mackl_thesis_paper.pdf.[77] J. Sonchack, O. Michel, A. J. Aviv, E. Keller, and J. M. Smith, “Scalinghardware accelerated network monitoring to concurrent and dynamicqueries with* flow,” in , pp. 823–835, 2018.[78] J. Sonchack, A. J. Aviv, E. Keller, and J. M. Smith, “Turboflow:Information rich flow record generation on commodity switches,” in
Proceedings of the Thirteenth EuroSys Conference , pp. 1–16, 2018.[79] A. Gupta, R. Harrison, M. Canini, N. Feamster, J. Rexford, andW. Willinger, “Sonata: query-driven streaming network telemetry,” in
Proceedings of the 2018 Conference of the ACM Special Interest Groupon Data Communication , pp. 357–371, 2018.[80] X. Chen, S. L. Feibish, Y. Koral, J. Rexford, O. Rottenstreich, S. A.Monetti, and T.-Y. Wang, “Fine-grained queue measurement in thedata plane,” in
Proceedings of the 15th International Conference onEmerging Networking Experiments And Technologies , pp. 15–29, 2019.[81] Z. Liu, S. Zhou, O. Rottenstreich, V. Braverman, and J. Rexford,“Memory-efficient performance monitoring on programmable switcheswith lean algorithms,” in
Symposium on Algorithmic Principles ofComputer Systems (APoCS) , 2020.[82] T. Holterbach, E. C. Molero, M. Apostolaki, A. Dainotti, S. Vissicchio,and L. Vanbever, “Blink: fast connectivity recovery entirely in the dataplane,” in { USENIX } Symposium on Networked Systems Designand Implementation ( { NSDI } , pp. 161–176, 2019.[83] D. Ding, M. Savi, and D. Siracusa, “Estimating logarithmic and expo-nential functions to track network traffic entropy in P4,” in IEEE/IFIPNetwork Operations and Management Symposium (NOMS) , 2019.[84] W. Wang, P. Tammana, A. Chen, and T. E. Ng, “Grasp the root causesin the data plane: diagnosing latency problems with SpiderMon,” in
Proceedings of the Symposium on SDN Research , pp. 55–61, 2020.[85] R. Teixeira, R. Harrison, A. Gupta, and J. Rexford, “PacketScope:monitoring the packet lifecycle inside a switch,” in
Proceedings ofthe Symposium on SDN Research , pp. 76–82, 2020.[86] J. Bai, M. Zhang, G. Li, C. Liu, M. Xu, and H. Hu, “FastFE:accelerating ML-based traffic analysis with programmable switches,”in
Proceedings of the Workshop on Secure Programmable Network In-frastructure , SPIN ’20, p. 1–7, Association for Computing Machinery,2020.[87] X. Chen, H. Kim, J. M. Aman, W. Chang, M. Lee, and J. Rexford,“Measuring TCP round-trip time in the data plane,” in
Proceedings ofthe Workshop on Secure Programmable Network Infrastructure , pp. 35–41, 2020.[88] Y. Qiu, K.-F. Hsu, J. Xing, and A. Chen, “A feasibility study on time-aware monitoring with commodity switches,” in
Proceedings of theWorkshop on Secure Programmable Network Infrastructure , pp. 22–
27, 2020.[89] Q. Huang, H. Sun, P. P. Lee, W. Bai, F. Zhu, and Y. Bao, “OmniMon:re-architecting Network telemetry with resource efficiency and fullaccuracy,” in
Proceedings of the Annual conference of the ACMSpecial Interest Group on Data Communication on the applications,technologies, architectures, and protocols for computer communication ,pp. 404–421, 2020.[90] X. Chen, S. Landau-Feibish, M. Braverman, and J. Rexford, “Beau-Coup: answering many network traffic queries, one memory updateat a time,” in
Proceedings of the Annual conference of the ACMSpecial Interest Group on Data Communication on the applications,technologies, architectures, and protocols for computer communication ,pp. 226–239, 2020.[91] R. Kundel, J. Blendin, T. Viernickel, B. Koldehofe, and R. Steinmetz,“P4-CoDel: active queue management in programmable data planes,”in , pp. 1–4, IEEE, 2018.[92] N. K. Sharma, M. Liu, K. Atreya, and A. Krishnamurthy, “Approxi-mating fair queueing on reconfigurable switches,” in { USENIX } Symposium on Networked Systems Design and Implementation (NSDI) ,pp. 1–16, 2018.[93] S. Laki, P. Vörös, and F. Fejes, “Towards an AQM evaluation testbedwith P4 and DPDK,” in
Proceedings of the ACM SIGCOMM 2019Conference Posters and Demos , pp. 148–150, 2019.[94] C. Papagianni and K. De Schepper, “PI2 for P4: an active queue man-agement scheme for programmable data planes,” in
Proceedings of the15th International Conference on emerging Networking EXperimentsand Technologies , pp. 84–86, 2019.[95] K. Kumazoe and M. Tsuru, “P4-based implementation and evaluationof adaptive early packet discarding scheme,” in
International Confer-ence on Intelligent Networking and Collaborative Systems , pp. 460–469, Springer, 2020.[96] D. Bhat, J. Anderson, P. Ruth, M. Zink, and K. Keahey, “Application-based QoE support with P4 and OpenFlow,” in
IEEE INFOCOM 2019-IEEE Conference on Computer Communications Workshops (INFO-COM WKSHPS) , pp. 817–823, IEEE, 2019.[97] S. S. Lee and K.-Y. Chan, “A traffic meter based on a multicolor markerfor bandwidth guarantee and priority differentiation in sdn virtualnetworks,”
IEEE Transactions on Network and Service Management ,vol. 16, no. 3, pp. 1046–1058, 2019.[98] K. Tokmakov, M. Sarker, J. Domaschka, and S. Wesner, “A case fordata centre traffic management on software programmable ethernetswitches,” in , pp. 1–6, IEEE, 2019.[99] Y.-W. Chen, L.-H. Yen, W.-C. Wang, C.-A. Chuang, Y.-S. Liu, and C.-C. Tseng, “P4-Enabled bandwidth management,” in ,pp. 1–5, IEEE, 2019.[100] M. Shahbaz, L. Suresh, J. Rexford, N. Feamster, O. Rottenstreich, andM. Hira, “Elmo: Source routed multicast for public clouds,” in
Pro-ceedings of the ACM Special Interest Group on Data Communication ,pp. 458–471, 2019.[101] M. Kadosh, Y. Piasetzky, B. Gafni, L. Suresh, M. Shahbaz, S. Banerjee,“Realizing source routed multicast using Mellanox’s programmablehardware switches, P4 Expert Roundtable Series, Apr. 2020.” [Online].Available: https://tinyurl.com/y8dfcsum.[102] W. Braun, J. Hartmann, and M. Menth, “Scalable and reliable software-defined multicast with BIER and P4,” in , pp. 905–906,IEEE, 2017.[103] N. Katta, M. Hira, C. Kim, A. Sivaraman, and J. Rexford, “Hula: scal-able load balancing using programmable data planes,” in
Proceedingsof the Symposium on SDN Research , pp. 1–12, 2016.[104] R. Miao, H. Zeng, C. Kim, J. Lee, and M. Yu, “SilkRoad: makingstateful layer-4 load balancing fast and cheap using switching ASICs,”in
Proceedings of the Conference of the ACM Special Interest Groupon Data Communication , pp. 15–28, 2017.[105] C. H. Benet, A. J. Kassler, T. Benson, and G. Pongracz, “MP-HULA:multipath transport aware load balancing using programmable dataplanes,” in
Proceedings of the 2018 Morning Workshop on In-NetworkComputing , pp. 7–13, 2018.[106] V. Olteanu, A. Agache, A. Voinescu, and C. Raiciu, “Stateless data-center load-balancing with beamer,” in { USENIX } Symposium onNetworked Systems Design and Implementation (NSDI) , pp. 125–139,2018.[107] Z. Liu, Z. Bai, Z. Liu, X. Li, C. Kim, V. Braverman, X. Jin, andI. Stoica, “Distcache: provable load balancing for large-scale storage systems with distributed caching,” in { USENIX } Conference onFile and Storage Technologies ( { FAST } , pp. 143–157, 2019.[108] K.-F. Hsu, P. Tammana, R. Beckett, A. Chen, J. Rexford, and D. Walker,“Adaptive weighted traffic splitting in programmable data planes,” in Proceedings of the Symposium on SDN Research , pp. 103–109, 2020.[109] K.-F. Hsu, R. Beckett, A. Chen, J. Rexford, and D. Walker, “Contra:A programmable system for performance-aware routing,” in { USENIX } Symposium on Networked Systems Design and Implemen-tation ( { NSDI } , pp. 701–721, 2020.[110] X. Jin, X. Li, H. Zhang, R. Soulé, J. Lee, N. Foster, C. Kim, andI. Stoica, “Netcache: balancing key-value stores with fast in-networkcaching,” in Proceedings of the 26th Symposium on Operating SystemsPrinciples , pp. 121–136, 2017.[111] E. Cidon, S. Choi, S. Katti, and N. McKeown, “AppSwitch: application-layer load balancing within a software switch,” in
Proceedings of theFirst Asia-Pacific Workshop on Networking , pp. 64–70, 2017.[112] M. Liu, L. Luo, J. Nelson, L. Ceze, A. Krishnamurthy, and K. Atreya,“Incbricks: toward in-network computation with an in-network cache,”in
Proceedings of the Twenty-Second International Conference onArchitectural Support for Programming Languages and OperatingSystems , pp. 795–809, 2017.[113] S. Signorello, R. State, J. François, and O. Festor, “NDN.p4: pro-gramming information-centric data-planes,” in , pp. 384–389, IEEE, 2016.[114] G. Grigoryan and Y. Liu, “PFCA: a programmable FIB cachingarchitecture,” in
Proceedings of the 2018 Symposium on Architecturesfor Networking and Communications Systems , pp. 97–103, 2018.[115] C. Zhang, J. Bi, Y. Zhou, K. Zhang, and Z. Ma, “B-cache: abehavior-level caching framework for the programmable data plane,”in ,pp. 00084–00090, IEEE, 2018.[116] J. Vestin, A. Kassler, and J. Åkerberg, “FastReact: in-network controland caching for industrial control networks using programmable dataplanes,” in , vol. 1, pp. 219–226,IEEE, 2018.[117] J. Woodruff, M. Ramanujam, and N. Zilberman, “P4DNS: in-networkDNS,” in , pp. 1–6, IEEE, 2019.[118] R. Ricart-Sanchez, P. Malagon, P. Salva-Garcia, E. C. Perez, Q. Wang,and J. M. A. Calero, “Towards an FPGA-accelerated programmabledata path for edge-to-core communications in 5G networks,”
Journalof Network and Computer Applications , vol. 124, pp. 80–93, 2018.[119] R. Ricart-Sanchez, P. Malagon, J. M. Alcaraz-Calero, and Q. Wang,“Hardware-accelerated firewall for 5G mobile networks,” in , pp. 446–447, IEEE, 2018.[120] R. Shah, V. Kumar, M. Vutukuru, and P. Kulkarni, “TurboEPC:leveraging dataplane programmability to acccelerate the mobile packetcore,” in
Proceedings of the Symposium on SDN Research , pp. 83–95,2020.[121] S. K. Singh, C. E. Rothenberg, G. Patra, and G. Pongracz, “Offloadingvirtual evolved packet gateway user plane functions to a programmableASIC,” in
Proceedings of the 1st ACM CoNEXT Workshop on Emergingin-Network Computing Paradigms , pp. 9–14, 2019.[122] P. Vörös, G. Pongrácz, and S. Laki, “Towards a hybrid next generationnodeb,” in
Proceedings of the 3rd P4 Workshop in Europe , pp. 56–58,2020.[123] P. Palagummi and K. M. Sivalingam, “SMARTHO: a network initiatedhandover in NG-RAN using P4-based switches,” in ,pp. 338–342, IEEE, 2018.[124] E. Kfoury, J. Crichigno, and E. Bou-Harb, “Offloading media traffic toprogrammable data plane switches,” in
ICC 2020 IEEE InternationalConference on Communications (ICC) , IEEE, 2020.[125] T. Jepsen, M. Moshref, A. Carzaniga, N. Foster, and R. Soulé, “Packetsubscriptions for programmable ASICs,” in
Proceedings of the 17thACM Workshop on Hot Topics in Networks , pp. 176–183, 2018.[126] C. Wernecke, H. Parzyjegla, G. Mühl, P. Danielis, and D. Timmermann,“Realizing content-based publish/subscribe with P4,” in , pp. 1–7, IEEE, 2018.[127] C. Wernecke, H. Parzyjegla, G. Mühl, E. Schweissguth, and D. Tim-mermann, “Flexible notification forwarding for content-based pub-lish/subscribe using P4,” in , pp. 1–5, IEEE, 2019. [128] R. Kundel, C. Gärtner, M. Luthra, S. Bhowmik, and B. Koldehofe,“Flexible content-based publish/subscribe over programmable dataplanes,” in NOMS 2020-2020 IEEE/IFIP Network Operations andManagement Symposium , pp. 1–5, IEEE, 2020.[129] J. Li, E. Michael, N. K. Sharma, A. Szekeres, and D. R. Ports, “Just say { NO } to paxos overhead: replacing consensus with network ordering,”in { USENIX } Symposium on Operating Systems Design andImplementation (OSDI) , pp. 467–483, 2016.[130] H. T. Dang, M. Canini, F. Pedone, and R. Soulé, “Paxos made switch-y,”
ACM SIGCOMM Computer Communication Review , vol. 46, no. 2,pp. 18–24, 2016.[131] J. Li, E. Michael, and D. R. Ports, “Eris: coordination-free consistenttransactions using in-network concurrency control,” in
Proceedings ofthe 26th Symposium on Operating Systems Principles , pp. 104–120,2017.[132] B. Han, V. Gopalakrishnan, M. Platania, Z.-L. Zhang, and Y. Zhang,“Network-assisted raft consensus protocol,” Feb. 13 2020. US PatentApp. 16/101,751.[133] X. Jin, X. Li, H. Zhang, N. Foster, J. Lee, R. Soulé, C. Kim,and I. Stoica, “Netchain: scale-free sub-rtt coordination,” in { USENIX } Symposium on Networked Systems Design and Implemen-tation ( { NSDI } , pp. 35–49, 2018.[134] H. T. Dang, P. Bressana, H. Wang, K. S. Lee, N. Zilberman, H. Weath-erspoon, M. Canini, F. Pedone, and R. Soulé, “Partitioned Paxos viathe network data plane,” arXiv preprint arXiv:1901.08806 , 2019.[135] E. Sakic, N. Deric, E. Goshi, and W. Kellerer, “P4BFT: hardware-accelerated byzantine-resilient network control plane,” arXiv preprintarXiv:1905.04064 , 2019.[136] H. T. Dang, P. Bressana, H. Wang, K. S. Lee, N. Zilberman, H. Weath-erspoon, M. Canini, F. Pedone, and R. Soulé, “P4xos: Consensus as anetwork service,” IEEE/ACM Transactions on Networking , 2020.[137] A. Sapio, I. Abdelaziz, A. Aldilaijan, M. Canini, and P. Kalnis,“In-network computation is a dumb idea whose time has come,” in
Proceedings of the 16th ACM Workshop on Hot Topics in Networks ,pp. 150–156, 2017.[138] G. Siracusano and R. Bifulco, “In-network neural networks,” arXivpreprint arXiv:1801.05731 , 2018.[139] D. Sanvito, G. Siracusano, and R. Bifulco, “Can the network be theAI accelerator?,” in
Proceedings of the 2018 Morning Workshop onIn-Network Computing , pp. 20–25, 2018.[140] F. Yang, Z. Wang, X. Ma, G. Yuan, and X. An, “SwitchAgg:a further step towards in-network computation,” arXiv preprintarXiv:1904.04024 , 2019.[141] A. Sapio, M. Canini, C.-Y. Ho, J. Nelson, P. Kalnis, C. Kim, A. Kr-ishnamurthy, M. Moshref, D. R. Ports, and P. Richtárik, “Scaling dis-tributed machine learning with in-network aggregation,” arXiv preprintarXiv:1903.06701 , 2019.[142] Z. Xiong and N. Zilberman, “Do switches dream of machine learning?toward in-network classification,” in
Proceedings of the 18th ACMWorkshop on Hot Topics in Networks , pp. 25–33, 2019.[143] T. Jepsen, M. Moshref, A. Carzaniga, N. Foster, and R. Soulé, “Life inthe fast lane: a line-rate linear road,” in
Proceedings of the Symposiumon SDN Research , pp. 1–7, 2018.[144] T. Kohler, R. Mayer, F. Dürr, M. Maaß, S. Bhowmik, and K. Rothermel,“P4CEP: towards in-network complex event processing,” in
Proceed-ings of the 2018 Morning Workshop on In-Network Computing , pp. 33–38, 2018.[145] L. Chen, G. Chen, J. Lingys, and K. Chen, “Programmable switch asa parallel computing device,” arXiv preprint arXiv:1803.01491 , 2018.[146] T. Jepsen, D. Alvarez, N. Foster, C. Kim, J. Lee, M. Moshref, andR. Soulé, “Fast string searching on PISA,” in
Proceedings of the 2019ACM Symposium on SDN Research , pp. 21–28, 2019.[147] Y. Qiao, X. Kong, M. Zhang, Y. Zhou, M. Xu, and J. Bi, “Towardsin-network acceleration of erasure coding,” in
Proceedings of theSymposium on SDN Research , pp. 41–47, 2020.[148] Z. Yu, Y. Zhang, V. Braverman, M. Chowdhury, and X. Jin, “NetLock:fast, centralized lock management using programmable switches,” in
Proceedings of the Annual conference of the ACM Special InterestGroup on Data Communication on the applications, technologies,architectures, and protocols for computer communication , pp. 126–138, 2020.[149] M. Tirmazi, R. Ben Basat, J. Gao, and M. Yu, “Cheetah: Acceleratingdatabase queries with switch pruning,” in
Proceedings of the 2020 ACMSIGMOD International Conference on Management of Data , pp. 2407–2422, 2020.[150] S. Vaucher, N. Yazdani, P. Felber, D. E. Lucani, and V. Schiavoni,“Zipline: in-network compression at line speed,” in
Proceedings of the 16th International Conference on emerging Networking EXperimentsand Technologies , pp. 399–405, 2020.[151] R. Glebke, J. Krude, I. Kunze, J. Rüth, F. Senger, and K. Wehrle,“Towards executing computer vision functionality on programmablenetwork devices,” in
Proceedings of the 1st ACM CoNEXT Workshopon Emerging in-Network Computing Paradigms , pp. 15–20, 2019.[152] S.-Y. Wang, C.-M. Wu, Y.-B. Lin, and C.-C. Huang, “High-speed data-plane packet aggregation and disaggregation by P4 switches,”
Journalof Network and Computer Applications , vol. 142, pp. 98–110, 2019.[153] S.-Y. Wang, J.-Y. Li, and Y.-B. Lin, “Aggregating and disaggregatingpackets with various sizes of payload in P4 switches at 100 Gbps linerate,”
Journal of Network and Computer Applications , p. 102676, 2020.[154] Y.-B. Lin, S.-Y. Wang, C.-C. Huang, and C.-M. Wu, “The SDNapproach for the aggregation/disaggregation of sensor data,”
Sensors ,vol. 18, no. 7, p. 2025, 2018.[155] A. L. R. Madureira, F. R. C. Araújo, and L. N. Sampaio, “Onsupporting IoT data aggregation through programmable data planes,”
Computer Networks , p. 107330, 2020.[156] M. Uddin, S. Mukherjee, H. Chang, and T. Lakshman, “SDN-basedservice automation for IoT,” in , pp. 1–10, IEEE, 2017.[157] M. Uddin, S. Mukherjee, H. Chang, and T. Lakshman, “SDN-basedmulti-protocol edge switching for IoT service automation,”
IEEE Jour-nal on Selected Areas in Communications , vol. 36, no. 12, pp. 2775–2786, 2018.[158] V. Sivaraman, S. Narayana, O. Rottenstreich, S. Muthukrishnan, andJ. Rexford, “Heavy-hitter detection entirely in the data plane,” in
Proceedings of the Symposium on SDN Research , pp. 164–176, 2017.[159] R. Harrison, Q. Cai, A. Gupta, and J. Rexford, “Network-wide heavyhitter detection with commodity switches,” in
Proceedings of theSymposium on SDN Research , pp. 1–7, 2018.[160] J. Kuˇcera, D. A. Popescu, G. Antichi, J. Koˇrenek, and A. W. Moore,“Seek and push: detecting large traffic aggregates in the dataplane,” arXiv preprint arXiv:1805.05993 , 2018.[161] R. Ben-Basat, X. Chen, G. Einziger, and O. Rottenstreich, “Efficientmeasurement on programmable switches using probabilistic recircu-lation,” in , pp. 313–323, IEEE, 2018.[162] D. Ding, M. Savi, G. Antichi, and D. Siracusa, “An incrementally-deployable P4-enabled architecture for network-wide heavy-hitter de-tection,”
IEEE Transactions on Network and Service Management ,vol. 17, no. 1, pp. 75–88, 2020.[163] L. Tang, Q. Huang, and P. P. Lee, “A fast and compact invertible sketchfor network-wide heavy flow detection,”
IEEE/ACM Transactions onNetworking , vol. 28, no. 5, pp. 2350–2363, 2020.[164] M. V. B. da Silva, J. A. Marques, L. P. Gaspary, and L. Z. Granville,“Identifying elephant flows using dynamic thresholds in programmableixp networks,”
Journal of Internet Services and Applications , vol. 11,no. 1, pp. 1–12, 2020.[165] D. Scholz, A. Oeldemann, F. Geyer, S. Gallenmüller, H. Stubbe,T. Wild, A. Herkersdorf, and G. Carle, “Cryptographic hashing inP4 data planes,” in , pp. 1–6, IEEE,2019.[166] F. Hauser, M. Häberle, M. Schmidt, and M. Menth, “P4-IPsec: imple-mentation of IPsec gateways in P4 with SDN control for host-to-sitescenarios,” arXiv preprint arXiv:1907.03593 , 2019.[167] F. Hauser, M. Schmidt, M. Häberle, and M. Menth, “P4-MACsec:dynamic topology monitoring and data layer protection with MACsecin P4-based SDN,”
IEEE Access , 2020.[168] X. Chen, “Implementing AES encryption on programmable switchesvia scrambled lookup tables,” in
Proceedings of the Workshop onSecure Programmable Network Infrastructure , SPIN ’20, p. 8–14,Association for Computing Machinery, 2020.[169] R. Meier, P. Tsankov, V. Lenders, L. Vanbever, and M. Vechev,“NetHide: secure and practical network topology obfuscation,” in { USENIX } Security Symposium ( { USENIX } Security 18) , pp. 693–709,2018.[170] H. M. Moghaddam and A. Mosenia, “Anonymizing masses: prac-tical light-weight anonymity at the network level,” arXiv preprintarXiv:1911.09642 , 2019.[171] H. Kim and A. Gupta, “ONTAS: flexible and scalable online networktraffic anonymization system,” in
Proceedings of the 2019 Workshopon Network Meets AI & ML , pp. 15–21, 2019.[172] T. Datta, N. Feamster, J. Rexford, and L. Wang, “ { SPINE } : surveil-lance protection in the network elements,” in { USENIX } Workshopon Free and Open Communications on the Internet (FOCI) , 2019. [173] R. Datta, S. Choi, A. Chowdhary, and Y. Park, “P4Guard: designingP4 based firewall,” in MILCOM 2018-2018 IEEE Military Communi-cations Conference (MILCOM) , pp. 1–6, IEEE, 2018.[174] A. Almaini, A. Al-Dubai, I. Romdhani, and M. Schramm, “Delegationof authentication to the data plane in software-defined networks,”in , pp. 58–65, IEEE, 2019.[175] Q. Kang, L. Xue, A. Morrison, Y. Tang, A. Chen, and X. Luo,“Programmable in-network security for context-aware BYOD policies,” arXiv preprint arXiv:1908.01405 , 2019.[176] S. Bai, H. Kim, and J. Rexford, “Passive OS fingerprinting on com-modity switches,”[177] G. Li, M. Zhang, C. Liu, X. Kong, A. Chen, G. Gu, and H. Duan,“NetHCF: enabling line-rate and adaptive spoofed IP traffic filtering,”in , pp. 1–12, IEEE, 2019.[178] J. Xing, W. Wu, and A. Chen, “Architecting programmable data planedefenses into the network with FastFlex,” in
Proceedings of the 18thACM Workshop on Hot Topics in Networks , pp. 161–169, 2019.[179] Q. Kang, J. Xing, and A. Chen, “Automated attack discovery indata plane systems,” in { USENIX } Workshop on Cyber SecurityExperimentation and Test (CSET) , 2019.[180] A. Febro, H. Xiao, and J. Spring, “Distributed SIP DDoS defensewith P4,” in , pp. 1–8, IEEE, 2019.[181] Â. C. Lapolli, J. A. Marques, and L. P. Gaspary, “Offloading real-time DDoS attack detection to programmable data planes,” in , pp. 19–27, IEEE, 2019.[182] Y. Mi and A. Wang, “ML-pushback: machine learning based pushbackdefense against DDoS,” in
Proceedings of the 15th InternationalConference on emerging Networking EXperiments and Technologies ,pp. 80–81, 2019.[183] D. Scholz, S. Gallenmüller, H. Stubbe, B. Jaber, M. Rouhi, andG. Carle, “Me love (SYN-) cookies: SYN flood mitigation in pro-grammable data planes,” arXiv preprint arXiv:2003.03221 , 2020.[184] M. Zhang, G. Li, S. Wang, C. Liu, A. Chen, H. Hu, G. Gu, Q. Li,M. Xu, and J. Wu, “Poseidon: mitigating volumetric DDoS attackswith programmable switches,” in
Proceedings of NDSS , 2020.[185] K. Friday, E. Kfoury, E. Bou-Harb, and J. Crichigno, “Towards aunified in-network DDoS detection and mitigation strategy,” in , pp. 218–226, 2020.[186] J. Xing, Q. Kang, and A. Chen, “NetWarden: mitigating network covertchannels while preserving performance,” in { USENIX } SecuritySymposium ( { USENIX } Security 20) , 2020.[187] A. Laraba, J. François, I. Chrisment, S. R. Chowdhury, and R. Boutaba,“Defeating protocol abuse with p4: Application to explicit conges-tion notification,” in ,pp. 431–439, IEEE, 2020.[188] “Ripple: A programmable, decentralized link-flooding defense againstadaptive adversaries,” in , (Vancouver, B.C.), USENIX Association, 2021.[189] C. Zhang, J. Bi, Y. Zhou, J. Wu, B. Liu, Z. Li, A. B. Dogar, andY. Wang, “P4DB: on-the-fly debugging of the programmable dataplane,” in , pp. 1–10, IEEE, 2017.[190] Y. Zhou, J. Bi, Y. Lin, Y. Wang, D. Zhang, Z. Xi, J. Cao, and C. Sun,“P4tester: efficient runtime rule fault detection for programmable dataplanes,” in
Proceedings of the International Symposium on Quality ofService , pp. 1–10, 2019.[191] M. V. Dumitru, D. Dumitrescu, and C. Raiciu, “Can we exploit buggyP4 programs?,” in
Proceedings of the Symposium on SDN Research ,pp. 62–68, 2020.[192] S. Kodeswaran, M. T. Arashloo, P. Tammana, and J. Rexford, “TrackingP4 program execution in the data plane,” in
Proceedings of theSymposium on SDN Research , pp. 117–122, 2020.[193] Y. Zhou, J. Bi, T. Yang, K. Gao, C. Zhang, J. Cao, and Y. Wang,“Keysight: Troubleshooting programmable switches via scalable high-coverage behavior tracking,” in , pp. 291–301, IEEE, 2018.[194] N. Lopes, N. Bjørner, N. McKeown, A. Rybalchenko, D. Talayco,and G. Varghese, “Automatically verifying reachability and well-formedness in P4 networks,”
Technical Report, Tech. Rep , 2016.[195] L. Freire, M. Neves, L. Leal, K. Levchenko, A. Schaeffer-Filho, and M. Barcellos, “Uncovering bugs in P4 programs with assertion-basedverification,” in
Proceedings of the Symposium on SDN Research ,pp. 1–7, 2018.[196] M. Neves, L. Freire, A. Schaeffer-Filho, and M. Barcellos, “Verificationof P4 programs in feasible time using assertions,” in
Proceedings of the14th International Conference on emerging Networking EXperimentsand Technologies , pp. 73–85, 2018.[197] J. Liu, W. Hallahan, C. Schlesinger, M. Sharif, J. Lee, R. Soulé,H. Wang, C. Ca¸scaval, N. McKeown, and N. Foster, “P4v: practicalverification for programmable data planes,” in
Proceedings of the 2018Conference of the ACM Special Interest Group on Data Communica-tion , pp. 490–503, 2018.[198] A. Nötzli, J. Khan, A. Fingerhut, C. Barrett, and P. Athanas, “P4pktgen:automated test case generation for P4 programs,” in
Proceedings of theSymposium on SDN Research , pp. 1–7, 2018.[199] D. Lukács, M. Tejfel, and G. Pongrácz, “Keeping P4 switches fast andfault-free through automatic verification,”
Acta Cybernetica , vol. 24,no. 1, pp. 61–81, 2019.[200] R. Stoenescu, D. Dumitrescu, M. Popovici, L. Negreanu, and C. Raiciu,“Debugging P4 programs with Vera,” in
Proceedings of the 2018 Con-ference of the ACM Special Interest Group on Data Communication ,pp. 518–532, 2018.[201] A. Shukla, K. N. Hudemann, A. Hecker, and S. Schmid, “Runtime ver-ification of P4 switches with reinforcement learning,” in
Proceedingsof the 2019 Workshop on Network Meets AI & ML , pp. 1–7, 2019.[202] D. Dumitrescu, R. Stoenescu, L. Negreanu, and C. Raiciu, “bf4: to-wards bug-free P4 programs,” in
Proceedings of the Annual conferenceof the ACM Special Interest Group on Data Communication on theapplications, technologies, architectures, and protocols for computercommunication , pp. 571–585, 2020.[203] A. Bas and A. Fingerhut, “P4 tutorial, slide 22.” [Online]. Available:https://tinyurl.com/tb4m749.[204] M. Shahbaz, S. Choi, B. Pfaff, C. Kim, N. Feamster, N. McKeown, andJ. Rexford, “PISCES: A programmable, protocol-independent softwareswitch,” in
Proceedings of the 2016 ACM SIGCOMM Conference ,pp. 525–538, 2016.[205] B. Pfaff, J. Pettit, T. Koponen, E. Jackson, A. Zhou, J. Rajahalme,J. Gross, A. Wang, J. Stringer, P. Shelar, et al. , “The design andimplementation of open vswitch,” in { USENIX } Symposium onNetworked Systems Design and Implementation (NSDI)
ACMSIGCOMM , 2015.[208] C. Hopps et al. , “Analysis of an equal-cost multi-path algorithm,” tech.rep., RFC 2992, November, 2000.[209] S. Sinha, S. Kandula, and D. Katabi, “Harnessing TCP’s burstinesswith flowlet switching,” in
Proc. 3rd ACM Workshop on Hot Topics inNetworks (Hotnets-III) , Citeseer, 2004.[210] C. Kim, P. Bhide, E. Doe, H. Holbrook, A. Ghanwani, D. Daly,M. Hira, and B. Davie, “In-band network telemetry (INT),” technicalspecification , 2016.[211] M. A. Vieira, M. S. Castanho, R. D. Pacífico, E. R. Santos, E. P. C.Júnior, and L. F. Vieira, “Fast packet processing with eBPF and XDP:concepts, code, challenges, and applications,”
ACM Computing Surveys(CSUR) , vol. 53, no. 1, pp. 1–36, 2020.[212] J. Crichigno, E. Bou-Harb, and N. Ghani, “A comprehensive tutorialon science DMZ,”
IEEE Communications Surveys & Tutorials , vol. 21,no. 2, pp. 2041–2078, 2018.[213] J. F. Kurose and K. W. Ross, “Computer networking a top downapproach featuring the intel,” 2016.[214] S. Ha, I. Rhee, and L. Xu, “CUBIC: a new TCP-friendly high-speedTCP variant,”
ACM SIGOPS operating systems review , vol. 42, no. 5,pp. 64–74, 2008.[215] D. Leith and R. Shorten, “H-TCP: TCP congestion control forhigh bandwidth-delay product paths,” draft-leith-tcp-htcp-06 (work inprogress) , 2008.[216] N. Cardwell, Y. Cheng, C. S. Gunn, S. H. Yeganeh, and V. Jacobson,“BBR: congestion-based congestion control,”
Communications of theACM , vol. 60, no. 2, pp. 58–66, 2017.[217] S. Floyd, “TCP and explicit congestion notification,”
ACM SIGCOMMComputer Communication Review , vol. 24, no. 5, pp. 8–23, 1994.[218] R. Mittal, V. T. Lam, N. Dukkipati, E. Blem, H. Wassel, M. Ghobadi,A. Vahdat, Y. Wang, D. Wetherall, and D. Zats, “TIMELY: RTT-basedcongestion control for the data center,”
ACM SIGCOMM Computer Communication Review , vol. 45, no. 4, pp. 537–550, 2015.[219] Y. Zhu, H. Eran, D. Firestone, C. Guo, M. Lipshteyn, Y. Liron,J. Padhye, S. Raindel, M. H. Yahia, and M. Zhang, “Congestion controlfor large-scale RDMA deployments,”
ACM SIGCOMM ComputerCommunication Review , vol. 45, no. 4, pp. 523–536, 2015.[220] M. Alizadeh, A. Greenberg, D. A. Maltz, J. Padhye, P. Patel, B. Prab-hakar, S. Sengupta, and M. Sridharan, “Data Center TCP (DCTCP),”in
Proceedings of the ACM SIGCOMM 2010 conference , pp. 63–74,2010.[221] M. Alizadeh, S. Yang, M. Sharif, S. Katti, N. McKeown, B. Prabhakar,and S. Shenker, “pFabric: minimal near-optimal datacenter transport,”
ACM SIGCOMM Computer Communication Review , vol. 43, no. 4,pp. 435–446, 2013.[222] M. Dong, Q. Li, D. Zarchy, P. B. Godfrey, and M. Schapira, “ { PCC } :Re-architecting congestion control for consistent high performance,”in { USENIX } Symposium on Networked Systems Design andImplementation (NSDI) , pp. 395–408, 2015.[223] A. Langley, A. Riddoch, A. Wilk, A. Vicente, C. Krasic, D. Zhang,F. Yang, F. Kouranov, I. Swett, J. Iyengar, et al. , “The QUIC transportprotocol: design and Internet-scale deployment,” in
Proceedings of theConference of the ACM Special Interest Group on Data Communica-tion , pp. 183–196, 2017.[224] P. Cheng, F. Ren, R. Shu, and C. Lin, “Catch the whole lot in an action:rapid precise packet loss notification in data center,” in { USENIX } Symposium on Networked Systems Design and Implementation (NSDI) ,pp. 17–28, 2014.[225] A. Ramachandran, S. Seetharaman, N. Feamster, and V. Vazirani, “Fastmonitoring of traffic subpopulations,” in
Proceedings of the 8th ACMSIGCOMM conference on Internet measurement , pp. 257–270, 2008.[226] N. Alon, Y. Matias, and M. Szegedy, “The space complexity ofapproximating the frequency moments,”
Journal of Computer andsystem sciences , vol. 58, no. 1, pp. 137–147, 1999.[227] V. Braverman and R. Ostrovsky, “Zero-one frequency laws,” in
Pro-ceedings of the forty-second ACM symposium on Theory of computing ,pp. 281–290, 2010.[228] M. Charikar, K. Chen, and M. Farach-Colton, “Finding frequent itemsin data streams,” in
International Colloquium on Automata, Languages,and Programming , pp. 693–703, Springer, 2002.[229] G. Cormode and S. Muthukrishnan, “An improved data stream sum-mary: the count-min sketch and its applications,”
Journal of Algorithms ,vol. 55, no. 1, pp. 58–75, 2005.[230] M. Datar, A. Gionis, P. Indyk, and R. Motwani, “Maintaining streamstatistics over sliding windows,”
SIAM journal on computing , vol. 31,no. 6, pp. 1794–1813, 2002.[231] S. Floyd and V. Jacobson, “Random early detection gateways forcongestion avoidance,”
IEEE/ACM Transactions on networking , vol. 1,no. 4, pp. 397–413, 1993.[232] P. Flajolet, D. Gardy, and L. Thimonier, “Birthday paradox, couponcollectors, caching algorithms and self-organizing search,”
DiscreteApplied Mathematics , vol. 39, no. 3, pp. 207–229, 1992.[233] R. Dolby, “Noise reduction systems,” Nov. 5 1974. US Patent3,846,719.[234] S. V. Vaseghi,
Advanced digital signal processing and noise reduction .John Wiley & Sons, 2008.[235] J. Gettys, “Bufferbloat: dark buffers in the Internet,”
IEEE InternetComputing , no. 3, p. 96, 2011.[236] M. Allman, “Comments on bufferbloat,”
ACM SIGCOMM ComputerCommunication Review , vol. 43, no. 1, pp. 30–37, 2013.[237] Y. Gong, D. Rossi, C. Testa, S. Valenti, and M. D. Täht, “Fighting thebufferbloat: on the coexistence of AQM and low priority congestioncontrol,”
Computer Networks , vol. 65, pp. 255–267, 2014.[238] C. Staff, “Bufferbloat: what’s wrong with the Internet?,”
Communica-tions of the ACM , vol. 55, no. 2, pp. 40–47, 2012.[239] V. G. Cerf, “Bufferbloat and other internet challenges,”
IEEE InternetComputing , vol. 18, no. 5, pp. 80–80, 2014.[240] F. Schwarzkopf, S. Veith, and M. Menth, “Performance analysis ofCoDel and PIE for saturated TCP sources,” in , vol. 1, pp. 175–183, IEEE, 2016.[241] A. Mushtaq, R. Mittal, J. McCauley, M. Alizadeh, S. Ratnasamy,and S. Shenker, “Datacenter congestion control: identifying what isessential and making it practical,”
ACM SIGCOMM Computer Com-munication Review , vol. 49, no. 3, pp. 32–38, 2019.[242] K. Nichols, S. Blake, F. Baker, and D. Black, “Definition of thedifferentiated services field (DS field) in the IPv4 and IPv6 headers,”RFC8376. [Online]. Available: https://tools.ietf.org/html/rfc8376.[243] B. Fenner, M. Handley, H. Holbrook, I. Kouvelas, R. Parekh, Z. Zhang,and L. Zheng, “Protocol independent multicast-sparse mode (PIM-SM): protocol specification (revised).,” [Online]. Available: https://tools.ietf.org/html/rfc7761.[244] H. Holbrook, B. Cain, and B. Haberman, “Using Internet group man-agement protocol version 3 (IGMPv3) and multicast listener discoveryprotocol version 2 (MLDv2) for source-specific multicast,”
RFC 4604(Proposed Standard), Internet Engineering Task Force , 2006.[245] I. Wijnands, E. C. Rosen, A. Dolganow, T. Przygienda, and S. Aldrin,“Multicast using bit index explicit replication (BIER),” in
RFC Editor ,2017.[246] B. Carpenter and S. Brim, “Middleboxes: taxonomy and issues,” 2002.[Online]. Available: https://tools.ietf.org/html/rfc3234.[247] J. McCauley, A. Panda, A. Krishnamurthy, and S. Shenker, “Thoughtson load distribution and the role of programmable switches,”
ACMSIGCOMM Computer Communication Review , vol. 49, no. 1, pp. 18–23, 2019.[248] T. Norp, “5G Requirements and key performance indicators,”
Journalof ICT Standardization , vol. 6, no. 1, pp. 15–30, 2018.[249] G. Xylomenos, C. N. Ververidis, V. A. Siris, N. Fotiou, C. Tsilopou-los, X. Vasilakos, K. V. Katsaros, and G. C. Polyzos, “A surveyof information-centric networking research,”
IEEE communicationssurveys & tutorials , vol. 16, no. 2, pp. 1024–1049, 2013.[250] D. L. Tennenhouse and D. J. Wetherall, “Towards an active networkarchitecture,” in
Proceedings DARPA Active Networks Conference andExposition , pp. 2–15, IEEE, 2002.[251] E. F. Kfoury, J. Gomez, J. Crichigno, E. Bou-Harb, and D. Khoury,“Decentralized distribution of PCP mappings over blockchain forend-to-end secure direct communications,”
IEEE Access , vol. 7,pp. 110159–110173, 2019.[252] S. A. Weil, S. A. Brandt, E. L. Miller, D. D. Long, and C. Maltzahn,“Ceph: A scalable, high-performance distributed file system,” in
Pro-ceedings of the 7th symposium on Operating systems design andimplementation , pp. 307–320, 2006.[253] L. Lamport et al. , “Paxos made simple,”
ACM Sigact News , vol. 32,no. 4, pp. 18–25, 2001.[254] D. Ongaro and J. Ousterhout, “In search of an understandable con-sensus algorithm,” in { USENIX } Annual Technical Conference(USENIX ATC 14) , pp. 305–319, 2014.[255] Huynh Tu Dang, “Consensus as a network service.” [Online]. Avail-able: https://tinyurl.com/y2t9plsu.[256] J. Nelson, “SwitchML scaling distributed machine learning with in net-work aggregation.” [Online]. Available: https://tinyurl.com/y53upm7k.[257] D. Das, S. Avancha, D. Mudigere, K. Vaidynathan, S. Sridharan,D. Kalamkar, B. Kaul, and P. Dubey, “Distributed deep learn-ing using synchronous stochastic gradient descent,” arXiv preprintarXiv:1602.06709 , 2016.[258] S. Farrell, “Low-power wide area network (LPWAN) overview,”RFC8376. [Online]. Available: https://tools.ietf.org/html/rfc8376.[259] A. Koike, T. Ohba, and R. Ishibashi, “IoT network architecture usingpacket aggregation and disaggregation,” in , pp. 1140–1145,IEEE, 2016.[260] J. Deng and M. Davis, “An adaptive packet aggregation algorithmfor wireless networks,” in , pp. 1–6, IEEE, 2013.[261] Y. Yasuda, R. Nakamura, and H. Ohsaki, “A probabilistic interestpacket aggregation for content-centric networking,” in ,vol. 2, pp. 783–788, IEEE, 2018.[262] A. S. Akyurek and T. S. Rosing, “Optimal packet aggregation schedul-ing in wireless networks,”
IEEE Transactions on Mobile Computing ,vol. 17, no. 12, pp. 2835–2852, 2018.[263] K. Zhou and N. Nikaein, “Packet aggregation for machine type commu-nications in LTE with random access channel,” in , pp. 262–267,IEEE, 2013.[264] A. Majeed and N. B. Abu-Ghazaleh, “Packet aggregation in multi-rate wireless LANs,” in , pp. 452–460, IEEE, 2012.[265] D. SIG, “Bluetooth core specification version 4.2,”
Specification of theBluetooth System , 2014.[266] S. Farahani,
ZigBee wireless networks and transceivers . Newnes, 2011.[267] O. Hersent, D. Boswarthick, and O. Elloumi,
The Internet of things:key applications and protocols . John Wiley & Sons, 2011.[268] J. Shi, W. Quan, D. Gao, M. Liu, G. Liu, C. Yu, and W. Su,“Flowlet-based stateful multipath forwarding in heterogeneous Internetof things,”
IEEE Access , vol. 8, pp. 74875–74886, 2020. [269] S. Do, L.-V. Le, B.-S. P. Lin, and L.-P. Tung, “SDN/NFV-based networkinfrastructure for enhancing IoT gateways,” in , pp. 1135–1142, IEEE, 2019.[270] A. Metwally, D. Agrawal, and A. El Abbadi, “Efficient computationof frequent and top-k elements in data streams,” in InternationalConference on Database Theory , pp. 398–412, Springer, 2005.[271] S. Heule, M. Nunkesser, and A. Hall, “HyperLogLog in practice:algorithmic engineering of a state of the art cardinality estimationalgorithm,” in
Proceedings of the 16th International Conference onExtending Database Technology , pp. 683–692, 2013.[272] M. G. Reed, P. F. Syverson, and D. M. Goldschlag, “Anonymousconnections and onion routing,”
IEEE Journal on Selected areas inCommunications , vol. 16, no. 4, pp. 482–494, 1998.[273] V. Liu, S. Han, A. Krishnamurthy, and T. Anderson, “Tor instead of IP,”in
Proceedings of the 10th ACM Workshop on Hot Topics in Networks ,pp. 1–6, 2011.[274] C. Chen, D. E. Asoni, D. Barrera, G. Danezis, and A. Perrig, “HOR-NET: high-speed onion routing at the network layer,” in
Proceedings ofthe 22nd ACM SIGSAC Conference on Computer and CommunicationsSecurity , pp. 1441–1454, 2015.[275] M. Zalewski and W. Stearns, “p0f,” see http://lcamtuf. coredump.cx/p0f3 , 2006.[276] J. Barnes and P. Crowley, “k-p0f: A high-throughput kernel passive OSfingerprinter,” in
Architectures for Networking and CommunicationsSystems , pp. 113–114, IEEE, 2013.[277] S. Hong, R. Baykov, L. Xu, S. Nadimpalli, and G. Gu, “Towards SDN-defined programmable BYOD (bring your own device) security,” in
NDSS , 2016.[278] S. Hilton, “Dyn analysis summary of Friday October 21Attack, 2016..” [Online]. Available: https://dyn.com/blog/dyn-analysis-summary-of-friday-october-21-attack/.[279] S. Kottler, “February 28th DDoS incident report, March, 2018.” [On-line]. Available: https://githubengineering.com/ddos-incident-report/.[280] D. Scholz, S. Gallenmüller, H. Stubbe, and G. Carle, “Syn flood defensein programmable data planes,” in
Proceedings of the 3rd P4 Workshopin Europe , pp. 13–20, 2020.[281] J. Ioannidis and S. M. Bellovin, “Implementing pushback: router-baseddefense against DDoS attacks,” in
NDSS , 2016.[282] N. Handigol, B. Heller, V. Jeyakumar, D. Mazières, and N. McKeown,“I know what your packet did last hop: using packet histories totroubleshoot networks,” in { USENIX } Symposium on NetworkedSystems Design and Implementation ( { NSDI } , pp. 71–85, 2014.[283] Y. Zhu, N. Kang, J. Cao, A. Greenberg, G. Lu, R. Mahajan, D. Maltz,L. Yuan, M. Zhang, B. Y. Zhao, and H. Zheng, “Packet-level telemetryin large datacenter networks,” in Proceedings of the 2015 ACM Confer-ence on Special Interest Group on Data Communication , pp. 479–491,2015.[284] H. Zeng, P. Kazemian, G. Varghese, and N. McKeown, “Automatic testpacket generation,” in
Proceedings of the 8th international conferenceon Emerging networking experiments and technologies , pp. 241–252,2012.[285] P. Kazemian, G. Varghese, and N. McKeown, “Header space anal-ysis: static checking for networks,” in
Presented as part of the 9th { USENIX } Symposium on Networked Systems Design and Implemen-tation ( { NSDI } , pp. 113–126, 2012.[286] A. Khurshid, X. Zou, W. Zhou, M. Caesar, and P. B. Godfrey,“Veriflow: verifying network-wide invariants in real time,” in Presentedas part of the 10th { USENIX } Symposium on Networked SystemsDesign and Implementation (NSDI) , pp. 15–27, 2013.[287] R. Stoenescu, M. Popovici, L. Negreanu, and C. Raiciu, “Symnet:scalable symbolic execution for modern networks,” in
Proceedings ofthe 2016 ACM SIGCOMM Conference , pp. 314–327, 2016.[288] H. Mai, A. Khurshid, R. Agarwal, M. Caesar, P. B. Godfrey, and S. T.King, “Debugging the data plane with Anteater,”
ACM SIGCOMMComputer Communication Review , vol. 41, no. 4, pp. 290–301, 2011.[289] P. Kazemian, M. Chang, H. Zeng, G. Varghese, N. McKeown, andS. Whyte, “Real time network policy checking using header spaceanalysis,” in
Presented as part of the 10th { USENIX } Symposium onNetworked Systems Design and Implementation (NSDI) , pp. 99–111,2013.[290] A. Horn, A. Kheradmand, and M. Prasad, “Delta-net: real-time networkverification using atoms,” in { USENIX } Symposium on NetworkedSystems Design and Implementation (NSDI) , pp. 735–749, 2017.[291] S. Son, S. Shin, V. Yegneswaran, P. Porras, and G. Gu, “Model checking invariant security properties in OpenFlow,” in , pp. 1974–1979,IEEE, 2013.[292] A. Panda, O. Lahav, K. Argyraki, M. Sagiv, and S. Shenker, “Verifyingreachability in networks with mutable datapaths,” in { USENIX } Symposium on Networked Systems Design and Implementation (NSDI) ,pp. 699–718, 2017.[293] X. Gao, T. Kim, M. D. Wong, D. Raghunathan, A. K. Varma, P. G.Kannan, A. Sivaraman, S. Narayana, and A. Gupta, “Switch codegeneration using program synthesis,” in
Proceedings of the Annualconference of the ACM Special Interest Group on Data Communicationon the applications, technologies, architectures, and protocols forcomputer communication , pp. 44–61, 2020.[294] P. Zheng, T. A. Benson, and C. Hu, “Building and testing modularprograms for programmable data planes,”
IEEE Journal on SelectedAreas in Communications , vol. 38, no. 7, pp. 1432–1447, 2020.[295] D. Kim, Y. Zhu, C. Kim, J. Lee, and S. Seshan, “Generic externalmemory for switch data planes,” in
Proceedings of the 17th ACMWorkshop on Hot Topics in Networks , pp. 1–7, 2018.[296] D. Kim, Z. Liu, Y. Zhu, C. Kim, J. Lee, V. Sekar, and S. Seshan, “TEA:enabling state-intensive network functions on programmable switches,”in
Proceedings of the 2020 ACM SIGCOMM Conference , 2020.[297] S. Chole, A. Fingerhut, S. Ma, A. Sivaraman, S. Vargaftik, A. Berger,G. Mendelson, M. Alizadeh, S.-T. Chuang, I. Keslassy, et al. , “dRMT:disaggregated programmable switching,” in
Proceedings of the Con-ference of the ACM Special Interest Group on Data Communication ,pp. 1–14, 2017.[298] M. T. Arashloo, Y. Koral, M. Greenberg, J. Rexford, and D. Walker,“SNAP: stateful network-wide abstractions for packet processing,” in
Proceedings of the 2016 ACM SIGCOMM Conference , pp. 29–43,2016.[299] G. Sviridov, M. Bonola, A. Tulumello, P. Giaccone, A. Bianco,and G. Bianchi, “LODGE: Local decisions on global states in pro-grammable data planes,” in , pp. 257–261, IEEE, 2018.[300] G. Sviridov, M. Bonola, A. Tulumello, P. Giaccone, A. Bianco,and G. Bianchi, “Local decisions on replicated states (LOADER) inprogrammable data planes: programming abstraction and experimentalevaluation,” arXiv preprint arXiv:2001.07670 , 2020.[301] S. Luo, H. Yu, and L. Vanbever, “Swing state: consistent updatesfor stateful and programmable data planes,” in
Proceedings of theSymposium on SDN Research , pp. 115–121, 2017.[302] J. Xing, A. Chen, and T. E. Ng, “Secure state migration in the dataplane,” in
Proceedings of the Workshop on Secure ProgrammableNetwork Infrastructure , pp. 28–34, 2020.[303] L. Zeno, D. R. Ports, J. Nelson, and M. Silberstein, “Swishmem:Distributed shared state abstractions for programmable switches,” in
Proceedings of the 19th ACM Workshop on Hot Topics in Networks ,pp. 160–167, 2020.[304] P. Bosshart, G. Gibb, H.-S. Kim, G. Varghese, N. McKeown, M. Iz-zard, F. Mujica, and M. Horowitz, “Forwarding metamorphosis: fastprogrammable match-action processing in hardware for SDN,”
ACMSIGCOMM Computer Communication Review , vol. 43, no. 4, pp. 99–110, 2013.[305] R. Pagh and F. F. Rodler, “Cuckoo hashing,”