Efficient, Dynamic Multi-tenant Edge Computation in EdgeOS
Yuxin Ren, Vlad Nitu, Guyue Liu, Gabriel Parmer, Timothy Wood, Alain Tchana, Riley Kennedy
EEfficient, Dynamic Multi-tenant Edge Computation in EdgeOS
Yuxin Ren
The George Washington University [email protected]
Vlad Nitu
EPFL Lausanne [email protected]
Guyue Liu
The George Washington University [email protected]
Gabriel Parmer
The George Washington University [email protected]
Timothy Wood
The George Washington University [email protected]
Alain Tchana
Toulouse University
Riley Kennedy
The George Washington University [email protected]
Abstract
In the future, computing will be immersed in the world aroundus – from augmented reality to autonomous vehicles to theInternet of Things. Many of these smart devices will offerservices that respond in real time to their physical surround-ings, requiring complex processing with strict performanceguarantees. Edge clouds promise a pervasive computationalinfrastructure a short network hop away from end devices,but today’s operating systems are a poor fit to meet the goalsof scalable isolation, dense multi-tenancy, and predictableperformance required by these emerging applications. In thispaper we present EdgeOS, a micro-kernel based operatingsystem that meets these goals by blending recent advancesin real-time systems and network function virtualization. Ed-geOS introduces a Featherweight Process model that offerslightweight isolation and supports extreme scalability evenunder high churn. Our architecture provides efficient commu-nication mechanisms, and low-overhead per-client isolation.To achieve high performance networking, EdgeOS employskernel bypass paired with the isolation properties of Feather-weight Processes. We have evaluated our EdgeOS prototypefor running high scale network middleboxes using the Clicksoftware router and endpoint applications using memcached.EdgeOS reduces startup latency by 170X compared to Linuxprocesses and over five orders of magnitude compared to con-tainers, while providing three orders of magnitude latencyimprovement when running 300 to 1000 edge-cloud mem-cached instances on one server.
There is a growing desire to deploy software services closerto users. Cellular providers must run mobility managementservices near customers to properly maintain connectivity forcell phone users. The Internet of Things foretells the deploy-ment of billions of devices producing data streams, whichoften require processing close to the data source to avoidexcess bandwidth consumption in the network core. Latencysensitive cyber physical systems desire communication and processing at millisecond scale, preventing the use of central-ized cloud infrastructures. These applications and many moredemand an efficient and scalable “edge cloud” infrastructure,where computational resources are available on demand, asclose to users as possible.Unfortunately, edge clouds pose major challenges for tradi-tional operating system and virtualization architectures. First,an edge cloud must support dense multi-tenancy—each edgecloud site is expected to be a tiny fraction of the size of acentralized cloud, yet it may need to host many carefully iso-lated services for the users connected to it. Thus rather thanrun thousands of servers each supporting a dozen services invirtual machines (VMs), as is common in today’s centralizedcloud data centers, an edge cloud site might only run a dozenservers, each supporting thousands of diverse services. Evenlightweight virtualization platforms such as Linux Containershave trouble scaling to these extremes [26].Second, the combination of limited resources and mobileusers means that edge cloud workloads are likely to see ex-tremely high churn. Maintaining a large number of long run-ning yet infrequently accessed services will not be efficientin such an environment, so services will instead need to beinstantiated and terminated frequently on demand. In theextreme case, this may require dynamically starting a newservice for each incoming user connection .The overarching concerns of dense multi-tenancy and highchurn are compounded by the latency sensitivity and networkintensive nature of many edge cloud services. This is particu-larly challenging since virtualization adds overhead for I/Otasks [14, 17]. Recent support for HW virtualization, such asSR-IOV capable NICs, reduces virtualization layer costs, butcomes at the expense of scalability (e.g., only a few dozen vir-tual devices per port). Thus current OS and HW virtualizationtechniques lack scalability and often suffer from performanceunpredictability which can be a major concern for latencysensitive applications utilizing the edge. a r X i v : . [ c s . O S ] J a n rior work has investigated portions of these problems incontexts such as cloud computing, network function virtual-ization (NFV), or real-time systems. Lightweight virtualiza-tion techniques based on unikernels [26] and hypervisor opti-mizations [30] have been proposed to reduce boot times, butdon’t address providing many isolated clients high throughput.Recent NFV platforms achieve high throughput with the useof kernel-bypass networking, but they often trade isolationfor performance [15, 50]. Similarly, predictable performanceis the hallmark of real-time systems, but these systems gener-ally rely on conservative resource overprovisioning which iscounter to the goals of an efficient edge cloud.In this paper we explore how a clean-slate OS can providea flexible infrastructure that can securely and efficiently sup-port a large number of isolated services, while offering strictperformance guarantees. By using a µ -kernel based design, EdgeOS provides a customizable architecture tuned for net-work intensive workloads, while providing stronger isolationand latency guarantees than existing approaches.
EdgeOS uses a “Feather Weight Process” (
FWP ) abstraction to pro-vide fine grained isolation at low cost, with support for
FWP caching to assist with fast startup under high churn.Despite its radical design,
EdgeOS is able to run sev-eral common applications, including middleboxes from theClick software router [20] and the memcached key-value store.These network functions and endpoint servers can be flexiblycombined to build complex services, while still providingstrong isolation for both application and network data.
EdgeOS makes the following contributions: • A Feather Weight Process abstraction built on a µ -kernel-based OS that supports orders of magnitude greater densitythan prior approaches. • Efficient mechanisms so that message data can be securelycommunicated through service chains. • FWP chain caching to support microsecond speed initial-ization of complex services in high churn environments.We have implemented
EdgeOS by extending the Compos-ite component-based operating system [48].
EdgeOS inte-grates with the Data Plane Development Kit (DPDK) to pro-vide high performance network I/O. Our evaluation illustrateshow
EdgeOS can offer dramatically better scale, density,and performance predictability than traditional approaches.We execute 1000s of
FWP s per host, instantiate them 170Xfaster than a Linux process, and maintain a memcached la-tency under 1 millisecond even when running 600 isolatedinstances on a single host.
EdgeOS provides performanceon par with state-of-the-art NFV platforms, while offeringstronger isolation and greater agility. In EdgeOS we consider edge clouds in the context of 5Gnetworks, which will allow large numbers of mobile devicesto connect with low latency and high bandwidth to nearby access points [43]. An access point (or perhaps a nearby telcocentral office [37]) can contain an edge cloud site, i.e. , a tinydata center offering compute and storage capabilities to con-nected devices. Edge clouds enable requests to be serviced,filtered, and transformed before they traverse the WAN, thusavoiding computation in a centralized cloud and/or reducingcore bandwidth usage. However, given the large number ofedge cloud sites, each is expected to only have a small numberof servers due to space, power, and cost constraints. Sinceedge clouds are likely to be deployed first by telco opera-tors, it is expected that early use cases will focus on NFVmiddleboxes, such as cellular mobility management, DDoSdetection, etc. Here we discuss how the scale, churn, and per-formance requirements of edge clouds pose major challengesto existing platforms, motivating the need for a redesign of theunderlying OS primitives and communication mechanisms.
Given the increasing number of stakeholders that can benefitfrom edge cloud execution, the support for multi-tenant execu-tion is critical. However, today’s common infrastructures builton containers or virtual machines may add prohibitive costfor edge workloads. Though past research has increased theagility of such infrastructures by optimizing the startup/shut-down costs [21, 26, 30], the overhead of creating and deletingisolated execution environments can still be significant.The costs of inter-tenant isolation, especially with high churn – the rate of client arrival and exit – can severely limitthe workloads a system can handle. A number of factors areincreasing the importance of systems that can handle an in-creased churn at the network edge. (1) serverless computinghas been made popular by platforms like Amazon Lambdaand the open-source OpenWhisk, and leverages transient com-putations without local permanent state to increase agility andconsolidation, (2) middleboxes focus on doing efficient andlow-latency network computation, and benefit from per-flowisolation, (3) the number of clients accessing the infrastruc-ture is both increasing and becoming more transient withmobile computing [26], and (4) large volumes of sporadicallynetwork-connected embedded (IoT) devices are prospectedto generate a majority of the world’s network traffic.
Churn and isolation overheads.
When new clients requireisolated computation in the edge cloud, namespace, memory,and CPU isolation provide the requisite separation betweentenants. Unfortunately, even relatively efficient mechanismssuch as containers rely on layers of abstraction such as theLinux Virtual File System (VFS), and management of a largenumber of namespaces (including those for processes, net-work, and shared memory) that impose significant overhead.As the churn of a system increases, the overheads of containercreation are amplified.
0 200 400 600 800 1000 N e t p e r f L a t e n c y ( u s ) M e m c a c h e d L a t e n c y ( u s ) memcachednetperf-SCnetperf-MC Figure 1.
Round-trip latency of N netperf or memcached instances. Compared with the 1ms round-trip of 5G networks, netperf latencies represent a 2x/8x latency increase using one/six-teen cores, while memcached exhibits a 1000x latency increase. Percentile Docker fork()
EdgeOS docker start can take hundreds ofmilliseconds due to the cost of initializing namespaces andsetting up Docker metadata. Linux fork() has a much lowercost than Docker, but it still exhibits high variance, with the90 th percentile being over 20 times slower than the median.In contrast, EdgeOS improves median start time by 5X com-pared to Linux, and has minimal variability. As discussedin the remainder of the paper, we can improve
EdgeOS byanother order of magnitude by maintaining a cache of ser-vices that can be started near instantaneously. We achieve thisthrough lightweight, yet strong isolation mechanisms, and aclean separation of the control and data paths.
Lightweight isolation mechanisms such as containers facil-itate running large numbers of applications (e.g., hundredsof Docker containers per server), but they cannot provideperformance predictability as the scale rises. This leads tothe second key challenge in edge infrastructures: predictableperformance, particularly latency, at large scale.
Scaling isolation facilities.
Unfortunately, current infrastruc-tures suffer poor performance not only under churn, but alsoat high scale. Both VMs and containers see overheads dueto the expense of traversing the host’s software switch to de-termine the appropriate destination to deliver incoming datato. This is exacerbated with new convenient, yet expensive,networking abstractions such as overlay networking providedby Docker. While an approach such as SR-IOV can providehigh performance networking to VMs or containers, it doesso by dedicating virtual hardware functions that are a limitedresource, preventing high scalability.To evaluate the latency behavior of today’s infrastructure,we adjust the number of netperf servers sharing a singlecore ( netperf-SC ) or spread across multiple cores ( netperf-MC ), Message Pools
EOS Controller
Flow mapping
FWP Manager
Chain orchestration and cache
MMA
FWPFWPFWPFWPFWPFWPFWPFWPFWP
FWPFWPFWPFWPFWP FWPFWPFWPFWP
NetIn
DPDKbased IO
NetOut
DPDKbased IO
Composite uKernel Scheduler
FWP activation C o n t r o l P l a n e D a t a P l a n e Figure 2.
EdgeOS
Control and Data Plane Architecture and the number of memcached instances spread across mul-tiple cores. A second, well provisioned host transmits trafficto the test server over a 10 Gbps link. The overhead, evenin a prevalent and widespread system such as Linux, can besignificant. Using multiple cores still cannot achieve ideallatency due to poor scalability as shown in Figure 1. Realapplications such as memcached are quickly overwhelmedand can only support a hundred or fewer instances (full detailsin Section 5.4). This illustrates the inability of existing OSisolation mechanisms to provide fine grained performance iso-lation at high scale.
EdgeOS is designed to support isolationwith both high scalability and predictability.
As shown in Figure 2,
EdgeOS is designed around: 1) DPDK-based IO gateways that efficiently receive and send packetswith kernel-bypass, 2) a Feather-Weight Process (
FWP ) ab-straction that provides fine grained isolation at low cost, 3) aMemory Movement Accelerator (MMA) that securely copiesmessages between
FWP s arranged in chains, and 4) a controlplane that manages the
FWP -based data plane by providingthe high level policies, and offering management functionslike
FWP template caching for fast startup.
Traditional UNIX processes maintain not only memory pro-tection using virtual address space page-tables, but also addi-tional abstractions including file system (FS) hierarchy visi-bility, file descriptor namespaces, and signal status. Further,mechanisms optimized for fork performance such as copy-on-write, and for exec performance such as demand loading,add unnecessary and unpredictable overheads.In contrast, Feather-Weight Processes (
FWP s) in
EdgeOS are a minimal abstraction wrapping only memory and a smallset of simple kernel resources. This is partially motivated bythe growing usage of stateless computation and the adoptionof middlebox network functions into cloud infrastructures,signaling a growing prevalence of services that depend onexternal databases to store persistent state. This enables a verytightly constrained execution environment that focuses mainlyon the communication of messages ( e.g. network packets) be-tween many, possibly untrusting
FWP s. EdgeOS optimizesaround this trend. As shown in Figure 3,
FWP s have ac-cess only to their own memory (including stack and heap), W IDSMemcachedStackHeapCode
Memory
MiddleboxesEnd point servers
FW IDS MemcachedFW IDS MemcachedFW IDS MemcachedFW IDS MemcachedFW IDS Memcached
Shared chainsPer-connection chains
Figure 3.
FWP s can be middleboxes ( e.g.
Firewalls or IntrusionDetection Systems) or endpoints ( e.g. memcached), and can becomposed into chains or even replicated for every new client. memory for storing messages, and a number of communica-tion end-points used to ask the
EdgeOS system for services.Notably absent are default access to a file system, dynamiclinking facilities, and high-level networking layers such as aTCP/IP stack. The relative simplicity of the
FWP abstractionenables the efficient start-up and tear-down of computation inresponse to client demands.
Resource access control.
Access to all of an
FWP ’s re-sources relies on capability-based access control [6] usingkernel-mediated references, removing any ambient author-ity [28]. These resources include the message pool that isused to receive and send data, communication end-pointsused to trigger the message communication, and synchronouscommunication end-points to request operations from system-level services. These capabilities restrict messages to be onlybetween
FWP s in defined chains , which can be shared bymany clients, or instantiated on demand for each new con-nection (see Figure 3).
FWP s are provided minimal sets ofresources consistent with the principle of least privilege [40],which, paired with strict resource management, enables thescalable execution of isolated computations for many tenants.
Programming API.
FWP ’s primary focus is on processingof data streams (e.g., network packets), their programmingAPI focuses around event notification and the reception andtransmission of messages, summarized in Table 1. Each
FWP provides a callback at initialization that is triggered uponmessage reception. Memory allocation functions distinguishbetween standard local memory (following a malloc -basedinterface) and message memory which is integrated with thecommunication system. While our current implementationuses a single thread per
FWP , the underlying
Composite system supports hierarchical scheduling [35], which could beadapted for multi-threaded
FWP s. Rethinking processes for scalable isolation.
It is importantto contrast the isolation properties and programming modelof
FWP s with those of existing abstractions such as con-tainers [7, 38] and virtual machines [8]. While containersrely on process abstractions for memory isolation, they addnamespace partitioning, and resource rate consumption limita-tions [1]. They rely on the system call layer and Linux’s mono-lithic kernel, thus have a large Trusted Computing Base [40]whereby a single bug in the large kernel can compromiseisolation. In contrast, virtual machine hypervisors expose aninterface to virtual machines that mimics the native hardware,or is extended to include paravirtualization extensions [8]. The hypervisor is often smaller and has a smaller attack sur-face compared to the extended POSIX interface of a systemlike Linux. Virtual machines are often scheduled by the hy-pervisor as a collective abstraction of their applications usinga virtual CPU (VCPU), thus focusing on inter-VM isolation.In contrast to approaches that support a standard API ( e.g.
POSIX or x86), the
FWP abstraction focuses on minimiz-ing the
FWP
API down to the bare necessities required fornetwork intensive edge computations . The API is focusedon enabling different
FWP s to coordinate and compose forcomplex functionality – similar in concept to UNIX pipelines.In this way,
EdgeOS shares the philosophical design of µ -kernels to “a concept is tolerated inside the µ -kernel only ifmoving it outside the kernel...would prevent the implementa-tion of the system’s required functionality” [22], but extends itto the core edge computing primitives. EdgeOS ’s system ser-vices focus on simplicity of implementation and are limited toscheduling, inter-core coordination, low-level network inter-faces,
FWP s, and the capability-based access control to scopeaccess to each. The obvious downside of this approach is de-creased legacy support. However, we have successfully portedthe Click software router and memcached key value store to
EdgeOS . Further, we have prototype implementations ofPOSIX unikernels [25] (based on NetBSD rumpkernel s),but a discussion of these is beyond this paper’s scope.
EdgeOS ’ design departs from heavyweight VM or con-tainer abstractions to enable scale and minimize the widthof the system API to increase security. Though process ab-stractions have often been cast aside in favor of VMs [27],containers [50], or language-based techniques [33],
EdgeOS demonstrates that simplified process abstractions with tailoredminimal APIs and focused optimizations for churn and com-munication, can scale to a large number of tenants and clientswhile maintaining strong isolation for edge computation.
Receiving and transmitting packets with the NIC has tradition-ally required kernel intervention to manipulate the hardware.
EdgeOS embraces the recent trend towards kernel-bypassto reduce this overhead by allowing user-space managementof message buffers and network card DMA rings. Though
FWP s have isolated local memory, the memory used for mes-sage passing between
FWP s exposes a trade-off betweenperformance and isolation. Existing high throughput systemsoften eschew isolation and use shared memory to pass databy reference. This is the design chosen by high-throughputnetworking stacks and software middleboxes [32, 33, 50]. In
EdgeOS , we leverage data copying between separate
FWP sto maintain strong mutual isolation. Data copying can be avery expensive operation as it can dirty caches Thus,
EdgeOS pairs strong isolation, with Memory Movement Accelerators(
MMA s) that decouple copying from the
FWP fast-path.
Network Gateways.
EdgeOS ’s microkernel design is a nat-ural fit for user-space packet processing frameworks such as unction Description eos_postinit(fn_t callback, void *data) Provides the function that is triggered after
FWP initialization has completed eos_receive_fn(fn_t callback, void *data)
Sets the callback function invoked on each message reception;that function is passed both the message, and its source end-point msg_t eos_recv(rcv_ep_t)
A lower-level API for retrieving a message from an end-point ( ep ) eos_send(send_ep_t, msg_t) Send a message to an egress end-point msg_t eos_msg_alloc(size_t)
Allocate a new message in message memory eos_msg_free(msg_t)
Free a message in message memory ( eos_send is much more common) eos_sbrk(size_t)
Allocate local memory into the heap
Table 1.
FWP
Programming Interface.
DPDK. In and Out gateway services run on dedicated coresand pull packets into message pools with no kernel interac-tions. Input packet processing maintains rules dictated by thecontrol plane to match packets to a destination
FWP service.Depending on the rule specification, it may be necessary toinstantiate a new
FWP chain in order to handle the incomingrequest.
FWP chains are a core abstraction in
EdgeOS asthe entire chain can be created to service a new client.
FWP
Memory and Isolation. An FWP ’s memory is sepa-rated into message memory that is used for message passingbetween
FWP s, and local memory that backs each
FWP ’sdata-structures. This separation enables memory allocationsto be optimized for the purpose and use of the memory.Though future optimizations might relax isolation,
EdgeOS focuses on strong protection between
FWP s, and employscopying to safely transfer data between each other. Whenmessages are passed between
FWP s, a trusted system compo-nent must be involved as neither
FWP has the access rightsto copy into, or from, the other
FWP ’s memory.
Efficient message passing with the MMA.
A key
EdgeOS design is to move message copying off the fast-path of
FWP message processing, as we have observed that even a singlein-line copy can prevent line-rate processing in many cases.Toward this,
EdgeOS employs a Memory Movement Accel-erator (
MMA ) whose focus is on efficiently copying messagesbetween
FWP s. The
MMA retrieves messages from a up-stream
FWP ’s ring buffers, copies them and adds them into adownstream
FWP ’s ring buffers, and alerts the scheduler thatthe destination
FWP needs to be activated to receive it. The
MMA acts as a software DMA engine to move message databetween
FWP s, and runs on one or more dedicated cores inorder to perform out-of-band data movement. In contrast tolong-standing networking subsystem guidance that dictatesthat zero-copy is necessary [16, 47] – often at the price ofisolation,
EdgeOS optimizes the
MMA and treats it as a spe-cialized processor that can push data significantly faster thanline-rate, while maintaining strong isolation.
Similar to the approach taken in Software Defined Networks(SDN) and split-OS designs such as Arrakis [36],
EdgeOS separates the data plane processing (implemented with
FWP s, MMA s, and network gateways) from control functions that de-termine request routing, security policies, and resource man-agement (implemented as user space components extending the
Composite µ kernel). As shown in Figure 2, EdgeOS ’scontrol plane is composed of three major components: (1) theEOS Controller that maps incoming flows to
FWP chains,(2) the
FWP
Manager that controls the lifecycle of
FWP sand optimizes their startup, and (3) the Scheduler that deter-mines which
FWP to run on each core and activates them inresponse to incoming messages.
Flow matching with the
EdgeOS
Controller.
When newrequests arrive from connected client devices, they need tobe routed to the appropriate
FWP chain. The
EdgeOS
Con-troller allows administrators to define
FWP chains and thepacket filtering rules that specify what traffic should be routedto them. These rules are pushed to the Net-In data plane com-ponent. Net-In applies rules similar to SDN match-actionrules: packets are split into flows based on the header n-tuple( e.g. src/dest IP and protocol) and a rule is found that matchesthe flow. The rules indicate the
FWP chain that will processthat flow. Since our focus is on fine-grained isolation andhigh scale, a rule can indicate whether all flows that matchthe rule should be handled by a single chain, or if each flowshould be given a dynamically started instance of the chain.
FWP Lifecycle and Caching.
The creation of
FWP s on thefly in response to the arrival of a new flow requires a cascadeof activity: the instantiation of a set of new
FWP s (includingmemory initialization, kernel data-structure management, andthread creation), connecting the
FWP s together with commu-nication channels (ring buffers, kernel end-points, and
MMA integration), and finally, the creation of the message memoryregions for the
FWP s.The
FWP
Manager orchestrates the lifecycle of
FWP chains, which is illustrated in Figure 4. Similar to a Linuxprocess, an
FWP starts as an object file, which must be loadedinto memory. Once execution begins,
FWP s typically performsome initialization routines ( e.g. , parsing configuration filesand allocating initial data structures). Rather than repeat suchcomputation every time a new
FWP of the same type must beinstantiated,
EdgeOS optimizes startup with an
FWP check-point cache. Thus, we utilize the eos_postinit()
API toallow
FWP s to first initialize, then to take a checkpoint thatdefines the state of an
FWP ready to process new data. Our implementation currently assumes flow rules are statically preconfig-ured, but this could be extended to support on-demand flow lookups similarto SDN controllers, with a northbound interface to application logic thatwould assign a rule dynamically to each flow. bject FileInitializing TemplateCached Active eos_recv/BlockedTerminated Loader Checkpoint Restore checkpointAdd to FWP-chain cache New Flow arrives messagesprocessedMessagearrivesInactivityexit()Resource Pressure
Reclaimed
Figure 4.
Lifecycle of a
FWP -chain: Dotted lines indicate
FWP manager operations conducted once to load and then checkpointa
FWP -chain, or to reclaim the
FWP ’s resources when memorypressure exists. Dashed lines indicate operations to re-initializeterminated
FWP -chains for future use. Solid lines are data-path operations performed by on the critical path of
FWP execution.
Since we anticipate many complex services will requiremultiple
FWP s arranged in a chain, the Manager employs a
FWP -chain cache that caches entire chains of
FWP s, theirinterconnections, and their message memory. As new flows ar-rive, they are paired with corresponding
FWP -chains from thecache. The selected
FWP s will be Activated, allowing themto process messages or transition to the Blocked state, beforeeventually Terminating when they are no longer needed.When a
FWP chain terminates, the Manager reuses thechain by Restoring it back into the
FWP -chain cache. In doingso,
EdgeOS must guarantee that the memory of the cachedcomputation represents the checkpointed, post-initializationstate. As this places data-structures into a known and safestate, it ensures the integrity of future
FWP -chain instances.
EdgeOS avoids control operations in the data-path , thus theManager’s checkpoint and restore operations run in parallelto
FWP message passing. If memory pressure exists in thesystem, cached
FWP templates and chains are Reclaimed.
Scheduling and inter-
FWP coordination.
Once a set of
FWP s are activated, they are distributed across cores, and par-titioned scheduling ( i.e. without task migrations) multiplexesthe core’s processing time. Each scheduler requires globalcontext on which
FWP s are assigned on its core, and whichare runnable, and which are blocked awaiting messages.Traditional systems often use direct coordination betweencores via shared data-structures and explicit notification usingInter-Processor Interrupts (IPIs). For example, Linux providesnotifications to activate threads (via futex es, or pipes) byaccessing that thread’s data-structure directly to see if it isalready awake, and if not, an IPI is sent. The resulting cache-coherency traffic for access to shared data-structures, thenthe IPI overheads, can be significant, especially if used formessage notifications arriving over a network at line rate.
FWP -chains can be spread across cores, only increasing thecost. Motivated by these overheads, NFV platforms based onDPDK such as OpenNetVM [50] use active polling for com-munication between threads on different cores, thus entirelyavoiding blocking. However, as the number of processes (“net-work functions” in OpenNetVM) grows beyond the numberof cores, spin-based event notification is inefficient.To avoid the large overheads of shared resources, all inter-scheduler coordination in
EdgeOS is via message passing.
FWP
FWP Manager
MMA
FWP
FWP
FWP
NetIn Sched
Core 1
NetOut FWP FWP
FWP
Sched
Core n
FWP
FWP
Sched
Core 2 … Figure 5.
EdgeOS
Timeline
When a
FWP -chain is activated, a message as such is sent tothe scheduler controlling the core hosting the
FWP . Addition-ally, when a message is sent to a
FWP , and its ring buffer isempty, a message is sent to the corresponding scheduler. Onthe other hand, when an
FWP has processed all of its pendingmessages, instead of spinning awaiting more, the eos_recv operation will invoke the scheduler (which uses IPC to thescheduler component) asking to block.
EdgeOS
Timeline Summary.
Figure 5 shows the completetimeline for receiving and processing a packet. 1) A packetreception at the Net-In gateway causes a flow lookup to decidewhich
FWP chain should process the packet. 2) If there is amiss and no
FWP is currently allocated, the
FWP
Managerspawns one from its cache. 3) A message is sent to the MMAcausing it to copy the packet into the destination
FWP ’s pool.4) A message is added to the
FWP ’s ring and 5) the MMAmessages the scheduler on the
FWP ’s core to activate it. 6)The
FWP processes the packet and 7) asks the output gatewayto DMA the packet out the NIC.
In this section we describe how our
EdgeOS design is im-plemented and the key optimizations we make to achievepredictable, high performance. We plan to release our sourcecode and experiment templates for repeatable research.
Composite is an open source µ -kernel that externalizes tra-ditionally core kernel features into user-level components thatdefine the resource management and isolation policies [48].Components interact through highly-optimized Inter-ProcessCommunication (IPC) to leverage system logic and resources.Similar to Eros [41] and seL4 [11], Composite is based on acapability-based protection model that controls componentaccess to kernel resources. These resources include threads,communication end-points (synchronous and asynchronous),page-tables, capability-tables, temporal capabilities [13], andmemory frames. The kernel includes no scheduling policies,instead implementing schedulers at user-level [34]. The
Com-posite kernel scales up to multiple cores well as it has nolocks and is designed entirely around store-free common-paths, wait-free data-structures, and quiescence for data-structureconsistency [48]. composite.seas.gwu.edu dgeOS builds on these underlying facilities to provide:(1) FWP management and caching capabilities, (2) a DPDK-compatible userspace networking module, (3) new communi-cation mechanisms built around the
MMA , and (4) a schedulerthat is integrated with the communication and DPDK mod-ules.
EdgeOS is implemented as a component consisting ofthese main system modules. Co-location of these in a com-ponent is convenient and simplifies their communication, butis not necessary. Together, they provide the abstractions toexecute
FWP s as isolated components with only a limitednumber of synchronous communication channels to
EdgeOS corresponding to the functions in Table 1. Thus, the attacksurface of any given
FWP is restricted and small.The current
Composite implementation is for 32 bit x86.Though this limits the scale of the system due to memorylimitations, our prototype demonstrates the core functionalityof
EdgeOS . Ports to other platforms such as ARM and x86-64 are in progress by the
Composite developers, and weexpect
EdgeOS would exhibit similar behavior on them.
FWP checkpointing.
EdgeOS caches the im-ages of chains of FWP binaries so they are ready for promptactivation. These ready-to-execute images are asynchronously prepared, thus moving the overhead for
FWP preparationoff the fast-path. The cache contains full
FWP chains sothat complete services can be quickly deployed. The cached
FWP s represent their execution immediately following the eos_postinit function, thus capturing the initialized stateof a ready-to-execute
FWP . This avoids redundant initializa-tion computation. For example, our Click network functionstrigger the checkpoint only after loading and parsing theirconfiguration file from disk.However, the mechanisms to prepare
FWP -chains (in the
FWP
Manager) still must be efficient to maintain a highchurn rate. Thus, we utilize a few optimizations: (1) the post-initialization checkpoint of the
FWP -chain is laid out con-tiguously in memory so that re-initializing a chain is boundedmainly by memcpy and memset overheads (for which we usethe musl libc, unoptimized versions), (2) we do not reclaim– and thus later re-allocate – heap memory from terminated
FWP s, instead only zeroing it out, and using it to satisfyfuture eos_sbrk calls, (3) we reuse the threads active ineach
FWP , instead only resetting their instruction pointer tothe appropriate post-initialization execution point which hasthe side effect of avoiding thread allocation and schedulingoverheads beyond suspending the thread. These optimiza-tions culminate in a system that can handle exceedingly highchurn and scalability –
FWP chain initialization convergeson memcpy overheads, and chain activation in response to anew client takes low 10s of microseconds.
FWP scheduling.
We specialize the user-level schedulingpolicies within
EdgeOS to manage untrusted
FWP s that re-quire low-latency computations. The scheduling policy aims to prevent any
FWP from monopolizing the CPU, and frominterfering with the progress of other
FWP s. Additionally,as all scheduling operations represent overhead that can im-pact system throughput, they must be as rare and efficientas possible, while maintaining inter-
FWP isolation. Giventhese goals, in the current work we focus on simplicity in thescheduling policy, and the careful usage of timer interrupts tobalance each
FWP ’s progress with scheduler overhead.Each core separately schedules the
FWP s assigned to itusing a fixed-priority, round-robin scheduling policy. Thequantum chosen to preempt an executing
FWP is specifi-cally calibrated to enable the average
FWP to complete itsexecution cooperatively (thus avoiding timer overheads), andround-robin prevents starvation. To implement this, user-levelschedulers use the kernel’s facilities to dispatch to a threadand pass the time that the next timer interrupt should fire.We use modern x86 processor local-APIC support for spec-ifying one-shot timer interrupts with cycle-accuracy (called“TSC Deadline Timers” in Intel documents). Each schedulerreceives messages from the
MMA to activate its
FWP s.The simplicity of the scheduling policy and our optimizeduse of timer interrupts, together enable the necessary effi-ciency for line-rate computations, while guaranteeing progressand performance predictabiliy in spite of the large-scale,multi-tenant environment.
To support multi-tenancy,
FWP s provide isolation for localmemory, CPU processing, and access to system resources.However, message pool management provides both inter-
FWP isolation and coordination.
Ring-buffers for both coordination and memory manage-ment.
Each
FWP ’s message pool is associated with two ringbuffers that track both how to transmit and receive messages, and the allocation and deallocation of messages. These ringbuffers are similar to NIC DMA ring buffers. However, unliketraditional driver ring buffers,
EdgeOS makes the observa-tions that (1) general purpose memory allocation facilities( malloc/free ) can have significant overhead for high mes-sage arrival rates, and complicate the coordinated memorymanagement between the
MMA and
FWP s; and (2) the ringbuffers are organized to track not only incoming and outgoingmessages, but also free memory.A reception ring buffer contains a set of references to mes-sage slots into which incoming data can be copied, and thetransmission ring buffer contains references to messages tomove downstream in the
FWP chain. The
MMA dequeuesmessages from an
FWP ’s transmit ring, copies the data, andenqueues a message in the recipient’s ring. In this way, the
MMA acts directly as a software DMA accelerator between
FWP s. Each ring buffer entry has a set of bits that tracksthe state of the entry: transmit – ready to send the message, receive – empty message to transmit into, ready – populatedmessage ready for processing, free – ready to be reallocated y the FWP , or unused – an unused ring buffer entry (withan ignored pointer). Thus, the
MMA transitions ring bufferentries in transmit rings from the transmit to the free state af-ter copying the message, thus signaling the message’s reused;and it transitions receive ring entries from receive to ready after copying data into the message.Message pools are managed by FWP s as a span of MTU-sized message slots, and unlike traditional NIC DMA ringbuffers, the ring buffers include an entry for each messageslot. When a message arrives in a message pool, the
FWP dequeues it from its receive ring – transitioning the ring entryfrom ready to unused , processes it, and later adds it to thetransmission ring buffer – transitioning the entry from un-used to transmit . FWP s must maintain a sufficient numberof messages in reception rings in the receive state to compen-sate for the scheduling latencies due to multiplexing the CPUamong many
FWP s. Thus, after it finishes processing pend-ing messages, it will move freed messages from the transmitring ( free → unused ), into the reception ring ( unused → receive ). In this way, message liveness is managed indirectlythrough the ring buffers. Message pools and isolation.
The ring buffer design decou-ples the message memory from the meta-data to coordinatethe data movement and liveness between
FWP s and the
MMA .In doing so,
EdgeOS avoids lock-based protection of therings, instead relying on wait-free mechanisms that guaranteeexecution progress of both
FWP s and the
MMA . This has thebenefit of minimizing coherency overheads in ring coordina-tion, and avoiding critical sections which threaten the
MMA ’sstarvation. Additionally, it enables
FWP s to have more re-strictive access rights to the pool than the ring buffer, forexample, providing integrity by mapping the pool read-only.
Our initial experiments showed that naively copying packetsbetween stages in a DPDK-based NFV pipeline decreasedthroughput by more than 50%. However, we also found that acore devoted to data movement has a throughput of around 30Gb/s, which is sufficient for line-rate. By using the parallelismof the underlying processor and specializing cores to run the
MMA , we achieve both isolation and high throughput bytaking message movement out of the critical path.The
MMA has read-write access to all message pools. Itmaintains a mapping between both pairs of transmit and re-ceive ring buffers, and their associated pools, and continu-ously iterates through all such pairs, transferring messageswhen it finds a transmission. The
MMA provides two essentialservices: data-movement by copying transmitted messages,and event notification of the receiving
FWP s. The
MMA ’s FWP event notification is efficient as it simply sends a mes-sage to the scheduler controlling the
FWP ’s core. Though thecurrent system uses only a single
MMA , more cores can bedevoted to this, should it require more memory movementthroughput in the future.
MMA optimizations.
The
MMA is on the data-path of all
FWP interactions, including message reception, thus it mustbe able to move messages at faster than line rate. The
MMA iterates through all
FWP transmit rings, and (1) copies databetween message pools while updating rings, and (2) acti-vates the downstream
FWP by sending an event (through aring buffer) to the scheduler on that
FWP ’s core. The data-structures linking transmit and reception rings are laid out inan array to leverage the processor’s prefetcher as the
MMA it-erates over them. The initial implementation of the operationson the ring buffers were straight-forward, but cache-coherencyoverheads, possibly for each ring entry, hurt throughput. Toaddress this, we added two optimization: • Double-cache-line (128B) caches are added to both theenqueue and dequeue operations. These caches are in localmemory outside of the ring, thus their modifications are freeof coherency traffic. Transmitting a message adds it to thetransmit queue cache, and only when it is full is it flushed tothe ring buffer. This batches what would be eight separatering updates into essentially a single memcpy of 128 bytes.To avoid cached entries that are not yet transferred into thering from having delayed (or starved) processing, when an
FWP has completed processing, and is going to block, itflushes its cache to its transmit ring buffer. Similarly, whenthe ring buffers are dequeued, entries are copied out intoa double-cache-line cache, and subsequent accesses firstcheck the cache. The caches are 128B to match the Intelpolicy of fetching double-cache-lines at a time. • These caches enable messages to be viewed in batches.This enables a second optimization to use explicit softwareprefetch instructions to load all referenced messages intothe core’s cache. This optimization is particularly effectiveas the processing of the messages is temporally proximate. • Naming of different messages uses direct virtual addresses.Though the
MMA is isolated from
FWP s, they share asingle virtual address space [5, 9]. To maintain protection,all local memory for both the
MMA and each
FWP isisolated and uses overlapping address, and when the
MMA and
FWP s pass a message, they validate that it lies withinthe message pool’s boundaries.These optimizations contribute to
EdgeOS ’ high messagethroughput. However, should they be insufficient due to toomany
FWP s or long chains, the
MMA can trivially partitionthe ring buffers, thus scale to multiple cores.
EdgeOS uses DPDK for direct access to the NIC via kernel-bypass. Our port of DPDK to
EdgeOS is conducted mainlyas a new Environment Abstraction Layer (EAL), thus mini-mizing the impact on the DPDK code-base. DPDK transmitsand receives packets via the
MMA , but, unlike other
FWP s, ithas a number of heightened privileges. First, DPDK is used inpoll-mode, and we devote a core to polling for and receivingpackets, and another to transmitting them. acket reception. Incoming packets are demultiplexed totheir corresponding
FWP s via the flow mapping facilities inDPDK. In this way,
EdgeOS has mechanisms to maintainthe mappings of IPs and ports to specific
FWP -chains, butwe leave the policy of creating those mappings to a clustermanager such as Kubernetes or a Software-Defined Network-ing (SDN) controller. If a flow maps to an
FWP -chain that isnot yet active, a chain is retrieved from the
FWP -chain cache,and activated. The
FWP -chain cache is populated with
FWP chains by the
FWP manager.DPDK packet pools are treated as
EdgeOS message pools,and the
MMA copies packets into downstream
FWP s. In-telligent hardware with flow direction built in might enablezero-copy here [42], and
EdgeOS could be modified to usethis support in the future. Flows that map to an active
FWP -chain are placed in a message pool transmit ring buffer, andthe
MMA copies the data accordingly.
Packet transmission.
A final optimization avoids a packetcopy on the transmit path. When the last
FWP in the chaintransmits to DPDK, the
MMA omits the copy, and instead en-ables DPDK to add a direct reference to the packet to its ownDMA ring buffers. Later, when the NIC signals the successfultransmission of the packet, DPDK signals the transmission tothe message pool so that the packet can be reclaimed.
All experiments are run on CloudLab Wisconsin c220g1 se-ries nodes. These are two socket, 8 core, Intel(R) Xeon(R)CPU E5-2630 v3 @ 2.40GHz Intel processors with 128GBECC Memory (8x 16 GB DDR4 1866 MHz dual rank RDIMMs).Systems are connected via Dual-port Intel X520-DA2 10GbNIC (PCIe v3.0, 8 lanes) networking cards. For
EdgeOS we use less than 1GB of the system’s memory due to theunderlying µ -kernel’s 32-bit address space limitations. We first evaluate the latency and performance predictabilityof
EdgeOS compared to other high performance networkingplatforms. Figure 6(a) shows the response time distribution (inmicroseconds) for an ICMP ping response Click [20] elementimplemented as either: a DPDK process, an OpenNetVM NF(ONVM), a standard linux process with kernel-based IO, aClickOS NF in a Xen VM, or an
FWP in EdgeOS . Theresults show that
EdgeOS significantly outperforms all ofthese techniques (by up to 3.8X in average latency), exceptfor DPDK. DPDK is slightly better because it can run onlya single service at a time and thus does not need to copypackets from the initial receive DMA ring to a separate pool.In contrast,
EdgeOS provides a platform to potentially runthousands of distinct services, and thus needs to offer strongerisolation via copying.Figure 6(b) shows the maximum throughput of differentapproaches when forwarding traffic from pktgen, a high speed packet generator.
EdgeOS again provides better performancethan ClickOS, while offering stronger isolation than DPDKand ONVM, which rely on globally shared memory pools forzero-copy IO.Next we evaluate the performance of
EdgeOS communi-cation by comparing with ONVM. We run a chain of NFs onthe same core that each forward small (64B) or big (1024B)packets, thus both systems have context switch overhead bypassing a packet to the next NF. In addition,
EdgeOS hascopying overhead from the
MMA to enforce isolation. Theresults in Figure 6(c), show that as the chain length increases,the throughput of 64B packet drops for both
EdgeOS andONVM affected by different overheads. The main overheadof
EdgeOS is data copying, while the overhead of Linuxcontext switches and scheduling dominates ONVM. Whenthe chain length is smaller than 3, the overhead of copyingis less than 8%, and
EdgeOS outperforms ONVM when thechain is longer as the Linux system overheads increase. Thethroughput with 1024B packets maintains line rate for bothsystems when the chain length is smaller than 6, but
EdgeOS sees a throughput decrease when the chain is longer as oneMMA is not able to handle copies for all FWPs.
In Linux, initializing aprocess involves calling fork (and possibly execve ). ForDocker containers, a docker run command is similar, butincludes additional system calls to configure namespaces andmaintain container metadata. In order to optimize the fast pathof readying a cached
FWP , EdgeOS separates out creationfrom activation. For
EdgeOS , creation involves transitioningfrom the Object File to Cached state in Figure 4, includingsetting up page tables, capability tables, and thread creation.We record the start time for 10,000 iterations of starting acontainer, process, or
FWP and report the median in Figure 7(a). Note the log scale; we use median time values since as de-scribed below, Container creation becomes more slowly overtime so the average is skewed by these outliers. We compareagainst two variants of Linux processes: "fork + exec" loadsa different binary whereas "fork + faults" mimics loading theservice’s working set by issuing writes to 8 different pagesto trigger page faults. These approaches are 5-20X slowerthan the comparable "EOS create" approach (dashed lines inFigure 4).Once an
FWP has been created,
EdgeOS keeps copiesof it in a cache which can be quickly activated on demand(solid lines in Figure 4). Cached activation improves
EdgeOS performance by another order of magnitude, allowing newprocessing entities to be instantiated in 6.2 microseconds.Figure 7(b) presents a CDF of these approaches, includingthe activation cost for starting a full chain of 10
FWP s, whichremains an order of magnitude faster than fork+exec.
FWP Scalability . Further, we have found that containerssuffer from poor scalability – as the number of containers C u m u l a t i v e P r o b (a) Response Time (us) dpdkeosonvmlinuxclickos T h r o u g h p u t ( G bp s ) (b) Packet Size (Bytes) dpdk clickos onvm eos T h r o u g h p u t ( G bp s ) (c) Chain Length onvm-64 eos-64 onvm-1024 eos-1024 Figure 6. (a)
EdgeOS provides substantially better latency, and reduced jitter compared to Linux processes and NFV platforms likeOpenNetVM and ClickOS. (b) Throughput of each system with different packets sizes. (c)
EdgeOS provides isolation and adds negligibleoverheads compared to OpenNetVM (no isolation) for different chain length for messages of size 64 and 1024 bytes.
Dockerstart fork+exec fork+faults EOScreate EOSactivate S t a r t T i m e ( m s ) (a)5211.0580.260.0480.0062 0 0.2 0.4 0.6 0.8 1 C u m u l a t i v e P r o b (b) Activation Time (ms) EOS EOS-Chain Fork Fork+exec Docker A c t i v a t i o n T i m e ( m s ) (c) Docker (green)Fork+exec (blue)EOS (red)
Figure 7.
EdgeOS provides orders of magnitude better startup time than other approaches and does not suffer from scalability problemswhen starting larger numbers of
FWP s. rise, the start time worsens. Similar behavior has been shownpreviously for virtual machines [26]. In Figure 7(c) we showthe time to start a new container, exec a process, or activatean FWP , when up to 2200 are started incrementally. TheContainer case gradually drifts upward before hitting a stepafter 2000 containers. The cost of starting the last container is1.368 seconds versus 0.467 seconds for the first. The standarddeviation for containers is 236 ms versus only 0.08 ms for
FWP activation. As long as sufficient
FWP s are available inthe template cache,
EdgeOS provides nearly constant starttime regardless of scale; if additional templates are needed,the
FWP manager can created them in parallel to the datapath. The
EdgeOS timeline has a few outlier points (11 outof 15K measurements are at 2ms), which we believe to beNon-Maskable Interrupts, or a bug in our scheduling logic.
To evaluate the impactof client churn in edge environments, we mimic an experimentfrom the LightVM paper [26]. Clients send requests to an
Ed-geOS based service at a configurable interval, and we assumethat each new client request requires its own
FWP to be instan-tiated. The new
FWP receives the incoming packet, producesa reply, and then terminates, representing a worst case churnscenario. Figure 8 shows a response time CDF for
EdgeOS under different client arrival patterns. The results show thateven when a new client arrives every millisecond, 90% ofrequests are serviced within 50 microseconds Although wehave not been able to successfully run the LightVM softwareon our testbed, we note that their paper produced a 90th per-centile response time of 20 milliseconds (more than 400X worse) with clients arriving 10 times less frequently (10msinterval). The
EdgeOS performance advantage comes fromour extremely lightweight
FWP abstraction and our templatecache that allows nearly instant instantiation. C u m u l a t i v e P r o b a b li t y Response Time (us)1ms10ms25ms50ms100ms
Figure 8.
EdgeOS just in time service instantiation for mo-bile clients with varying client inter-arrival rates. L a t e n c y ( u s ) Figure 9.
Routing and processing latency for routing netperf traffic for an increasing number of clients.
Multi-Tenancy and Customer Isolation.
An important jobof edge-cloud systems, is acting as a middlebox to route a sub-set of requests to the cloud. Figure 9 depicts the processinglatency of processing and routing requests between netperf T h r o u g h p u t ( K r e q s / s e c ) Client Send Rate (10K reqs/s)(a) ThroughputEOSLinux 8 32 128 512 2048 8192 8 16 32 64 128 256 512 A v e r a g e L a t e n c y ( u s ) Client Send Rate (10K reqs/s)(b) Average LatencyLinuxEOS 0 0.2 0.4 0.6 0.8 1 10 100 1000 10000 C u m u l a t i v e P r o b a b li t y Latency(c) Latency CDF EOS-16Linux-16EOS-128Linux-128
Figure 10.
Single memcached instance on one core. T h r o u g h p u t ( M r e q s / s e c )
0 200 400 600 800 1000 A v e r a g e L a t e n c y ( u s ) C u m u l a t i v e P r o b a b li t y Latency (ms)(c) Latency CDFEOS-100Linux-100EOS-800Linux-800
Figure 11.
Multiple memcached instances (1 per client) on 16 cores. client and server machines for an increasing number of con-current clients. We use three nodes, two running netperf clients and servers, and the third running
EdgeOS or ONVMin the middle to act as the middlebox. The systems run eithera single firewall to filter flows or a 2
FWP chain of firewallplus monitor, all implemented in Click, to further maintainstatistics about flows. Each customer is serviced by its ownseparate firewall or chain, thus is isolated from each other.We measure the middlebox latency overhead (i.e., the addedcost versus direct client/server connections from Figure 1) aswe increase the number of clients, and thus number of
FWP s( EdgeOS ) and Network Functions (NFs in ONVM).Though ONVM is a highly optimized middlebox infrastruc-ture, it relies on containers and expensive coordination mech-anisms between NFs and the management layer. Because ofthis, ONVM cannot scale past around 820 containers or 410chains, and the added latency rises quickly with each newclient.
FWP s enable the system to scale past 2000 customerswith an average increase in the latency of only around 0.3 µs per additional client. Chaining in EdgeOS adds negligiblelatency overhead thanks to our efficient scheduler notifica-tion and context switch, while ONVM sees an increasing gapsince it relies on Linux’s more heavyweight futexes and itsunderlying scheduling.
Finally, we evaluate how
EdgeOS can provide a platform forlow latency endpoint applications. We implement an
FWP capable of parsing memcached UDP requests and use it toreplace the standard socket interface in the memcached server.The EOS controller can then be used to map incoming re-quests either to a single memcached
FWP (e.g., represent-ing an edge cloud data cache) or one
FWP per client (e.g.,representing private data stores for edge-connected IoT de-vices). We compare
EdgeOS against Linux, either using a single memcached server or multiple. Our workload, inspiredby [29], uses 135 byte value sizes and a 95% get, 5% setrequest mix generated by the mcblaster client.
Single instance.
Figure 10 shows the throughput and latencywhen all clients connect to a single memcached server in-stance pinned to one core. We use a 16-core server as theclient, running one mcblaster process per core to ensure theclient will not be the bottleneck. Each client process sendsrequests at a configurable rate and we report the aggregatethroughput and average latency of successful requests (i.e.,dropped requests do not impact latency). From Figure 10we see that
EdgeOS can support a throughput of up to 1.4million requests per second, a nearly 5X increase comparedto Linux. The response time of
EdgeOS is also substantiallylower than Linux, and it can handle 8X the client request ratebefore seeing an increase in latency. With very low client re-quest rates, both systems perform similarly because
EdgeOS cannot take advantage of batching. Since Linux is not able tokeep up, it drops a large number of requests, e.g., 5.2% at a320K req/sec client rate. In contrast,
EdgeOS does not seeany requests drops at a 1.2M req/sec client rate.
Multiple instance.
We next run a scalability test where eachclient is paired with its own memcached instance, either run-ning as a Linux process or an
FWP . The Linux processesare started in advance, whereas the
FWP s must be activatedfrom the cache for the first request from a client. Each clientsends at a fixed rate of 10 Mbit/s and they are distributedacross four hosts to prevent them being the bottleneck. Wedistribute the memcached server instances evenly across theavailable cores of the host server – for Linux all 16 cores areavailable, whereas for
EdgeOS only 12 cores are used forrunning
FWP s and 4 are used for the system services. As weincrease the number of clients, the aggregate request rate rises,with Linux hitting its peak throughput with 300 memcachedclients and servers.
EdgeOS is able to scale substantially urther, hitting a maximum throughput over 4M req/sec with800 clients. The Linux server, overwhelmed with the numberof memcached processes, has a response rate of nearly 1 sec-ond, whereas EdgeOS maintains an average latency below1 millisecond for up to 600 memcached instances. From thelatency CDF, we observe that even with only 100 memcachedinstances, Linux has much higher tail latency than
EdgeOS ,and that with 800 instances Linux has more than three ordersof magnitude worse tail latency. Keep in mind that these la-tency metrics ignore dropped, requests – with 800 instances,
EdgeOS drops 13% of requests, whereas Linux drops 66%.
TCP vs UDP.
Our memcached implementation is based onUDP since
EdgeOS does not provide a TCP stack. Addinga high performance DPDK-based TCP stack [19] would bestraightforward, and we expect the performance differencebetween Linux and
EdgeOS to grow even larger in this case.In the current UDP implementation, the high drop rate seenin Linux has no impact on its throughput or latency, but withTCP this would trigger congestion control and retransmis-sions, leading to even worse performance.
Scalable multi-tenant isolation.
Significant research addressesthe increasing churn seen in serverless computing [21, 24, 26,30] by decreasing the startup and teardown costs of virtualmachines. Light-weight systems such as unikernels [25, 26]only further increase the agility of these systems. In contrast,
EdgeOS is motivated by the potentially enormous churn andlarge-scale isolation requirements of the edge cloud, provid-ing service to transient mobile and IoT devices. The
FWP abstraction, and activation based on the
FWP -chain cacheprovide low-overhead isolation and message pools for effec-tive communication that can handle the unprecedented churn.Denali [49] separates the protection provided by a VMM fromthe abstractions within the VM, and enables lightweight VMcontexts that scale from tens to low hundred’s of VMs.
Ed-geOS focuses on extremely fast
FWP activation times for on-the-fly instantiation, and
MMA -coordinated communicationthrough chains of
FWP s to enable service composition frommultiple tenants. Multiple projects have increased the effi-ciency of containers by specializing the environment for moreefficient boot-up.
Cntr [45] includes only the application-specific context in a container, while
SOCK [31] specializesthe container to use efficient kernel operations, and uses aZygote mechanism paired with a cache to accelerate con-tainer creation for stateless computations. For isolated edgecomputation instantiation,
EdgeOS compares favorably toforking of minimal Linux processes (two orders of magni-tude faster start-time) which is the lower-bound for manysuch techniques. These projects have startup latencies in themilliseconds versus
FWP s in the 10s of microseconds. Addi-tionally due to
FWP optimizations,
EdgeOS also maintains significantly lower edge application latencies at scale thanLinux (100s of µ -seconds vs. a second for memcached ). Lightweight isolation.
Wedges [4], Light-weight Contexts [23],and SpaceJMP [10] expand the UNIX interface to includelightweight facilities for controlling and changing protectiondomains. Similarly, Dune [2] uses hardware virtualizationsupport to provide user-level control over page-tables, anddIPC [46] uses hardware support to bypass the kernel duringinter-protection domain communication.
EdgeOS instead re-lies on a highly-optimized µ -kernel’s core support for secureand efficient control flow management, protection domains,and capability-based access control. We target abstractions tosupport immense churn rates, and efficient communicationwith complete isolation via the MMA . To efficiently use thelimited resources in the edge cloud,
EdgeOS leverages thissupport to scale to more than two thousand
FWP s in less than1GB of RAM while maintaining line-rate communication.
User-level, high-performance networking.
User-level net-work processing has long been proposed [47] to better utilizeHW and reach line-rate throughput. Isolation is providedwhen paired with the early demultiplexing of networkingpackets [12, 44]. Shared memory for zero-copy communica-tion, and batched processing have pushed these techniquesinto Gb-level networking [3, 16, 39]. DPDK and other ker-nel by-pass techniques have also pushed middlebox networkfunction processing effectively into VMs [27], and contain-ers [50].
EdgeOS expands on these techniques by integratingthem with large-scale multi-tenancy via the
MMA , and thestrong isolation of
FWP s. NetBricks [33] implement networkprocessing functions in a memory-safe language (Rust), thusrelying on the software isolation in a single thread.
EdgeOS effectively uses the parallelism of the underlying hardware,and the
MMA to maintain memory safety, but also provides temporal isolation by executing all
FWP s in separate threadsthat are explicitly scheduled by the run-time.
Hardware NIC demultiplexing.
Hardware-based early de-multiplexing of networking packets has enabled isolated, high-performance library-based system services [36]. While thisavoids the use of shared memory pools, it relies on networkhardware support for multiple queues to isolate principals.Such support is limited, e.g., the common Intel 82599 chipsetfor 10Gbps NICs only supports up to 128 queues [18]. Intelli-gent NICs take this idea further by supporting demultiplexingwith higher fidelity [42].
EdgeOS supports a high level ofscalability required for the multi-tenant edge cloud, thus usessoftware techniques to safely demultiplex packets by devotingcores to act as
MMA s without relying on specialized hardware.Results show that the system can maintain line-rate despiteusing these software accelerators.
The increasing prevalence of mobile computations and theInternet of Things requires both scalable isolation facilities or multi-tenancy in the edge, and the agility to handle highchurn. This paper has described EdgeOS , an OS for edgecloud computation that introduces a Feather-Weight Processabstraction for low-overhead isolation that is paired with acache of post-initialization checkpointed
FWP -chains to pro-vide the microsecond scale activation times necessary to han-dle high churn. Isolation is facilitated with a specialized coredevoted to accelerating moving messages between
FWP s,thus maintaining isolation.We show that
EdgeOS provides more than a 3.8X reduc-tion in ping latency and more than 2X throughput increasecompared to ClickOS – a system that also provides isolatedcomputation – for middlebox computations. More impor-tantly,
EdgeOS can create
FWP s for client computation in25-50 microseconds, even when they are created every mil-lisecond, and can scale to over 2000
FWP s while maintaininglow latency, even with a very limited amount of memory. Foredge applications like memcached,
EdgeOS has more thanthree orders of magnitude decreases in latency when runningover 300 server instances simultaneously. We believe that
EdgeOS paves the way for closely integrating the edge cloudinto – and augmenting the capabilities of – the increasingprevalence of mobile and embedded devices.
References [1] Gaurav Banga, Peter Druschel, and Jeffrey C. Mogul. 1999. Resourcecontainers: a new facility for resource management in server systems.In
OSDI ’99: Proceedings of the third symposium on Operating systemsdesign and implementation . USENIX Association, Berkeley, CA, USA,45–58.[2] Adam Belay, Andrea Bittau, Ali Mashtizadeh, David Terei, David Maz-ières, and Christos Kozyrakis. 2012. Dune: Safe User-level Accessto Privileged CPU Features. In
Proceedings of the 10th USENIX Sym-posium on Operating Systems Design and Implementation (OSDI’12),Hollywood, CA, USA, October 8-10 .[3] Adam Belay, George Prekas, Ana Klimovic, Samuel Grossman, Chris-tos Kozyrakis, and Edouard Bugnion. 2014. IX: A Protected DataplaneOperating System for High Throughput and Low Latency. In
Proceed-ings of the 11th USENIX Conference on Operating Systems Design andImplementation (OSDI) .[4] Andrea Bittau, Petr Marchenko, Mark Handley, and Brad Karp. 2008.Wedge: Splitting Applications into Reduced-privilege Compartments.In
Proceedings of the 5th USENIX Symposium on Networked SystemsDesign and Implementation (NSDI) .[5] Jeffrey S. Chase, Miche Baker-Harvey, Henry M. Levy, and Edward D.Lazowska. 1992. Opal: A Single Address Space System for 64-BitArchitectures.
Operating Systems Review
26, 2 (1992), 9. citeseer.ist.psu.edu/58003.html [6] Jack B. Dennis and Earl C. Van Horn. 1983. Programming semanticsfor multiprogrammed computations.
Commun. ACM
26, 1 (1983),29–35. https://doi.org/10.1145/357980.357993
Proceedings of the ACM Symposium on Operating Systems Principles(SOSP) .[9] Peter Druschel and Larry L. Peterson. 1993. Fbufs: A High-BandwidthCross-Domain Transfer Facility. In
Symposium on Operating SystemsPrinciples . 189–202. [10] Izzat El Hajj, Alexander Merritt, Gerd Zellweger, Dejan Milojicic, RetoAchermann, Paolo Faraboschi, Wen-mei Hwu, Timothy Roscoe, andKarsten Schwan. 2016. SpaceJMP: Programming with Multiple Vir-tual Address Spaces. In
Proceedings of the Twenty-First InternationalConference on Architectural Support for Programming Languages andOperating Systems (ASPLOS) .[11] Kevin Elphinstone and Gernot Heiser. 2013. From L3 to seL4 whathave we learnt in 20 years of L4 microkernels?. In
Proceedings of the24th ACM Symposium on Operating Systems Principles (SOSP) .[12] Dawson R. Engler, Frans Kaashoek, and James O’Toole. 1995. Exoker-nel: An Operating System Architecture for Application-Level ResourceManagement. In
Proceedings of the 15th ACM Symposium on Operat-ing System Principles . ACM, Copper Mountain Resort, Colorado, USA,251–266.[13] Phani Kishore Gadepalli, Robert Gifford, Lucas Baier, Michael Kelly,and Gabriel Parmer. 2017. Temporal Capabilities: Access Control forTime. In
Proceedings of the 38th IEEE Real-Time Systems Symposium .[14] Diwaker Gupta, Ludmila Cherkasova, Rob Gardner, and Amin Vahdat.2006. Enforcing Performance Isolation Across Virtual Machines inXen. In
Proceedings of the ACM/IFIP/USENIX 2006 InternationalConference on Middleware (Middleware ’06) . Springer-Verlag NewYork, Inc., New York, NY, USA, 342–362. http://dl.acm.org/citation.cfm?id=1515984.1516011 [15] Sangjin Han, Keon Jang, Aurojit Panda, Shoumik Palkar, Dongsu Han,and Sylvia Ratnasamy. 2015.
SoftNIC: A Software NIC to AugmentHardware . Technical Report UCB/EECS-2015-155. EECS Department,University of California, Berkeley. [16] Sangjin Han, Scott Marshall, Byung-Gon Chun, and Sylvia Ratnasamy.2012. MegaPipe: A New Programming Interface for Scalable NetworkI/O. In
Proceedings of the 10th USENIX Conference on OperatingSystems Design and Implementation .[17] Yang Hu, Mingcong Song, and Tao Li. 2017. Towards "Full Con-tainerization" in Containerized Network Function Virtualization. In
Proceedings of the Twenty-Second International Conference on Ar-chitectural Support for Programming Languages and Operating Sys-tems (ASPLOS ’17) . ACM, New York, NY, USA, 467–481. https://doi.org/10.1145/3037697.3037713
Proceedings of the 11th USENIX Symposium on NetworkedSystems Design and Implementation (NSDI 14) . USENIX, Seat-tle, WA, 489–502. [20] Eddie Kohler, Robert Morris, Benjie Chen, John Jannotti, and M. FransKaashoek. 2000. The Click modular router.
ACM Transactions onComputer Systems
18, 3 (August 2000), 263–297.[21] Horacio Andrés Lagar-Cavilla, Joseph Andrew Whitney, Adin MatthewScannell, Philip Patchin, Stephen M. Rumble, Eyal de Lara, MichaelBrudno, and Mahadev Satyanarayanan. 2009. SnowFlock: Rapid Vir-tual Machine Cloning for Cloud Computing. In
Proceedings of the 4thACM European Conference on Computer Systems (Eurosys) .[22] J. Liedtke. 1995. On Micro-Kernel Construction. In
Proceedings ofthe 15th ACM Symposium on Operating System Principles (SOSP’95),Copper Mountain Resort, Colorado, USA, December 3-6 .[23] James Litton, Anjo Vahldiek-Oberwagner, Eslam Elnikety, DeepakGarg, Bobby Bhattacharjee, and Peter Druschel. 2016. Light-weightContexts: An OS Abstraction for Safety and Performance. In
Proceed-ings of the 12th USENIX Conference on Operating Systems Design andImplementation (OSDI) .
24] Anil Madhavapeddy, Thomas Leonard, Magnus Skjegstad, ThomasGazagnaire, David Sheets, Dave Scott, Richard Mortier, AmirChaudhry, Balraj Singh, Jon Ludlam, Jon Crowcroft, and Ian Leslie.2015. Jitsu: Just-in-time Summoning of Unikernels. In
Proceedingsof the 12th USENIX Conference on Networked Systems Design andImplementation (NSDI’15) . USENIX Association, Berkeley, CA, USA,559–573. http://dl.acm.org/citation.cfm?id=2789770.2789809 [25] Anil Madhavapeddy, Richard Mortier, Charalampos Rotsos, DavidScott, Balraj Singh, Thomas Gazagnaire, Steven Smith, Steven Hand,and Jon Crowcroft. 2013. Unikernels: Library Operating Systems forthe Cloud. In
Proceedings of the Eighteenth International Conferenceon Architectural Support for Programming Languages and OperatingSystems (ASPLOS ’13) . ACM, New York, NY, USA, 461–472. https://doi.org/10.1145/2451116.2451167 [26] Filipe Manco, Costin Lupu, Florian Schmidt, Jose Mendes, SimonKuenzer, Sumit Sati, Kenichi Yasukata, Costin Raiciu, and Felipe Huici.2017. My VM is Lighter (and Safer) Than Your Container. In
Proceed-ings of the 26th Symposium on Operating Systems Principles (SOSP) .[27] Joao Martins, Mohamed Ahmed, Costin Raiciu, Vladimir Olteanu,Michio Honda, Roberto Bifulco, and Felipe Huici. 2014. ClickOS andthe Art of Network Function Virtualization. In
Proceedings of the 11thUSENIX Conference on Networked Systems Design and Implementation(NSDI) .[28] Mark S. Miller, Ka-Ping Yee, and Jonathan Shapiro. 2003.
Capabilitymyths demolished . Technical Report SRL2003-02. Johns HopkinsUniversity Systems Research Laboratory, Mountain View CA (USA). [29] Rajesh Nishtala, Hans Fugal, Steven Grimm, Marc Kwiatkowski, Her-man Lee, Harry C. Li, Ryan McElroy, Mike Paleczny, Daniel Peek, PaulSaab, David Stafford, Tony Tung, and Venkateshwaran Venkataramani.2013. Scaling Memcache at Facebook. In
Presented as part of the 10thUSENIX Symposium on Networked Systems Design and Implementation(NSDI 13) . USENIX, Lombard, IL, 385–398. [30] Vlad Nitu, Pierre Olivier, Alain Tchana, Daniel Chiba, Antonio Bar-balace, Daniel Hagimont, and Binoy Ravindran. 2017. Swift Birthand Quick Death: Enabling Fast Parallel Guest Boot and Destruc-tion in the Xen Hypervisor. In
Proceedings of the 13th ACM SIG-PLAN/SIGOPS International Conference on Virtual Execution En-vironments (VEE ’17) . ACM, New York, NY, USA, 1–14. https://doi.org/10.1145/3050748.3050758 [31] Edward Oakes, Leon Yang, Dennis Zhou, Kevin Houck, Tyler Harter,Andrea Arpaci-Dusseau, and Remzi Arpaci-Dusseau. 2018. SOCK:Rapid Task Provisioning with Serverless-Optimized Containers. In .[32] Shoumik Palkar, Chang Lan, Sangjin Han, Keon Jang, Aurojit Panda,Sylvia Ratnasamy, Luigi Rizzo, and Scott Shenker. 2015. E2: A Frame-work for NFV Applications. In
Proceedings of the 25th Symposium onOperating Systems Principles (SOSP) .[33] Aurojit Panda, Sangjin Han, Keon Jang, Melvin Walls, Sylvia Rat-nasamy, and Scott Shenker. 2016. NetBricks: Taking the V out of NFV.In
Proceedings of the 12th USENIX Conference on Operating SystemsDesign and Implementation (OSDI) .[34] Gabriel Parmer and Richard West. 2008. Predictable Interrupt Manage-ment and Scheduling in the Composite Component-based System. In
Proceedings of the 29th IEEE Real-Time Systems Symposium (RTSS’08),Barcelona, Spain, November 30 - December 3 .[35] Gabriel Parmer and Richard West. 2011. HiRes: A System for Pre-dictable Hierarchical Resource Management. In
Proceedings of the17th IEEE Real-Time and Embedded Technology and ApplicationsSymposium (RTAS) .[36] Simon Peter, Jialin Li, Irene Zhang, Dan R. K. Ports, Doug Woos,Arvind Krishnamurthy, Thomas Anderson, and Timothy Roscoe. 2015.Arrakis: The Operating System Is the Control Plane.
ACM Trans. Comput. Syst.
33, 4 (Nov. 2015).[37] Larry Peterson. 2015. Cord: Central office re-architected as a datacenter.
Open Networking Lab white paper (2015).[38] Daniel Price and Andrew Tucker. 2004. Solaris Zones: Operating Sys-tem Support for Consolidating Commercial Workloads. In
Proceedingsof the 18th USENIX Conference on System Administration (LISA) .[39] Luigi Rizzo. 2012. Netmap: A Novel Framework for Fast Packet I/O.In
Proceedings of the 2012 USENIX Conference on Annual TechnicalConference (USENIX ATC) .[40] J. Saltzer and M. Schroeder. 1975. The protection of information incomputer systems. in Proceedings of the IEEE
9, 63 (1975).[41] Jonathan S. Shapiro, Jonathan M. Smith, and David J. Farber. 1999.EROS: a fast capability system. In
Proceedings of the 17th ACM Sym-posium on Operating System Principles (SOSP’99), Kiawah IslandResort, South Carolina, USA, December 12-15 .[42] Naveen Kr. Sharma, Antoine Kaufmann, Thomas Anderson,Changhoon Kim, Arvind Krishnamurthy, Jacob Nelson, and SimonPeter. 2017. Evaluating the Power of Flexible Packet Processing forNetwork Resource Allocation. In
Proceedings of the 14th USENIX Con-ference on Networked Systems Design and Implementation (NSDI) .[43] T. Taleb, K. Samdanis, B. Mada, H. Flinck, S. Dutta, and D. Sabella.2017. On Multi-Access Edge Computing: A Survey of the Emerg-ing 5G Network Edge Cloud Architecture and Orchestration.
IEEECommunications Surveys Tutorials
19, 3 (2017), 1657–1681. https://doi.org/10.1109/COMST.2017.2705720 [44] David Tennenhouse. 1989. Layered Multiplexing Considered Harmful.In
Protocols for High-Speed Networks . North Holland, Amsterdam,143–148.[45] Jörg Thalheim, Pramod Bhatotia, Pedro Fonseca, and Baris Kasikci.2018. Cntr: Lightweight OS Containers. In .[46] Lluís Vilanova, Marc Jordà, Nacho Navarro, Yoav Etsion, and MateoValero. 2017. Direct Inter-Process Communication (dIPC): Repurpos-ing the CODOMs Architecture to Accelerate IPC. In
Proceedings ofthe Twelfth European Conference on Computer Systems (Eurosys) .[47] Thorsten von Eicken, Anindya Basu, Vineet Buch, and Werner Vogels.1995. U-Net: A User-Level Network Interface for Parallel and Dis-tributed Computing. In
Proceedings of the 14th ACM Symposium onOperating Systems Principles . ACM, 40–53.[48] Qi Wang, Yuxin Ren, Matt Scaperoth, and Gabriel Parmer. 2015. Speck:A Kernel for Scalable Predictability. In
Proceedings of the 21st IEEEReal-Time and Embedded Technology and Applications Symposium(RTAS’15), Seattle, WA, USA, April 13-16 .[49] A. Whitaker, M. Shaw, and S. Gribble. 2002. Denali: Lightweightvirtual machines for distributed and networked applications. citeseer.ist.psu.edu/whitaker02denali.html [50] Wei Zhang, Guyue Liu, Wenhui Zhang, Neel Shah, Phillip Lo-preiato, Gregoire Todeschi, K.K. Ramakrishnan, and Timothy Wood.2016. OpenNetVM: A Platform for High Performance Network Ser-vice Chains. In
Proceedings of the 2016 ACM SIGCOMM Work-shop on Hot Topics in Middleboxes and Network Function Vir-tualization . ACM. http://faculty.cs.gwu.edu/timwood/papers/16-HotMiddlebox-onvm.pdfhttp://faculty.cs.gwu.edu/timwood/papers/16-HotMiddlebox-onvm.pdf