SEUSS: Rapid serverless deployment using environment snapshots
James Cadden, Thomas Unger, Yara Awad, Han Dong, Orran Krieger, Jonathan Appavoo
SSEUSS: Rapid serverless deployment using environment snapshots
James Cadden, Thomas Unger, Yara Awad, Han Dong, Orran Krieger, Jonathan AppavooDepartment of Computer Science, Boston University {jmcadden,tommyu,awadyn,handong,okrieg,jappavoo}@bu.edu
Serverless computing has been heralded as a general-purpose programming model that enables easy access toon-demand cloud computation [2]. This is best illus-trated by the FaaS model , wherein application logicis composed of short, high-level scripts (aka. serverlessfunctions ) that are deployed automatically by a remoteFaaS platform in response to demand. In serverless func-tions, application developers find a fine-grain program-ming primitive that is easy-to-use, naturally paralleliz-able, and can grow and shrink across arbitrary scales.Work remains to bring serverless computing out of itscurrent computational niche and toward a larger super-set of general computing . We imagine a FaaS platformwherein functions start as quickly and pack as densely asprocesses. To achieve this, we assert that FaaS platformmust be able to support two broad categories of compu-tation. First, the common case of repeatable, loop drivensequences and fixed degree parallelism. Second, elastic computation characterized by request bursts, large-scaleparallel fanouts, invocation diversity, and singleton exe-cution.Modern FaaS systems perform well in the case of re-peat executions when function working sets stay small.However, these platforms are less effective when ap-plied to more complex, large-scale and dynamic work-loads [9, 13]. At the root of the problem is a lack ofelasticity in the underlying operating system mechanismused to deploy serverless function executions [10]. Whilegreat for maintaining individualized execution state forrapid re-invocation of functions, these coarse-grain iso-lation mechanisms are slow to construct, have large mem- Function-as-a-Service ory footprints, and are burdened by legacy software bot-tlenecks.In this paper, we introduce a new system-level approachfor rapidly deploying serverless functions. Through ourapproach, we demonstrate orders of magnitude improve-ments in both function start times, as well as in functioncacheability, which improves common case re-executionpaths while also unlocking previously-unsupported elasticFaaS workloads.Our model, SEUSS, or serverless execution via uniker-nel snapshot stacks , combines the following three OStechniques to create a novel approach for deploying iso-lated function executions:1. Functions are deployed within unikernels equippedwith a high-level language interpreter and a POSIX-like environment with common features like net-working and a filesystem. Unikernels provideus with strong isolation of untrusted guest execu-tion [23]. Furthermore, the unikernel serializesthe execution environment into a single flat addressspace, enabling our use of unikernel snapshots .2. Initialized unikernel environments are captured in-memory as snapshot images, which are then usedas templates to construct additional environments fornew function executions. We use pre-initialized en-vironments to dramatically reduce the start time fordeploying functions.3. Unikernel environments undergo aggressive sharingof system state which drastically reduces the mem-ory footprint of snapshot images and unikernel in-stances, enabling a high-density caching strategy for a r X i v : . [ c s . O S ] O c t ast re-deployment of function executions from pre-initialized state.Unikernels have been previously explored as a poten-tial solution for rapid serverles deployments [10, 14]. Inthese cases, the unikernels had tiny memory footprintsand boot in milliseconds, but only supported a highly-restricted execution model. In contrast, a unikernel thatsupports a POSIX-like environment with a full-featuredlanguage interpreter is both larger (100+ Mb) and slowerto boot ( 300 Ms). Thus, it is through unikernel snap-shots that SEUSS achieves its advantage. Furthermore,our design reveals a surprising contribution: it is possibleto enable a high-performance FaaS backend from general-purpose software combined with a simple set of OS tech-niques.To demonstrate the advantages of our approach, wehave implemented SEUSS within a prototype kernel forvirtualized x86 environments. When compared with thestandard approach of deploying functions inside of Linuxcontainers, SEUSS demonstrates: 1. 50x reduction inthe minimal deployment time for an uncached function2. 17x more concurrent function environments active on anode When our SEUSS prototype is used as a drop-in re-placement for Linux in a multi-node Apache OpenWhiskdeployment, we demonstrate: 1. × better platformthroughput when running uncached workloads 2. Supportfor large-scale invocation bursts that overwhelm the un-modified platform Runtime cache point time construct environment initialize runtime import run arguments startrunimport code generatebytecode cold path warm path hot path
Function cache point ms ms
10s of ms Figure 1:
Stages of a function invoca-tion
In the FaaS model, clients upload the source code of theirfunctions to the FaaS platform, and, in return, the clientis given a handle used to signal a run of the function on a given set of inputs. Once signaled, the FaaS plat-form is responsible for quickly scheduling and deploy-ing the corresponding function code, a process we call a function invocation (Figure 1). To ensure safe executionon a multi-tenant FaaS platform, function executions takeplace inside of isolated containers or light-weight virtualmachines instances which are typically dedicated to thefunction or client application [2].A general technique taken by FaaS platforms to im-prove function start times is to maintain caches of ini-tialized runtime environments (i.e. pools of idle VMs orcontainers) that sit ready for use in an upcoming functiondeployment. From this perspective, an invocation coldstart is effectively a cache miss, as it occurs when no en-vironment exists ready for use in the invocation. In a coldstart, the complete set of initialization steps are requiredbefore the function can start running. This adds multipleseconds to the deployment time as a new container or VMinstance must be created.Comparably, an invocation warm start is a cache hiton an isolated runtime environment ready for the functionsources to be imported. In the case of interpreted runtimeslike Node.js and Python, an intermediate compilation stepis required to process the imported source code into exe-cutable bytecode prior to it being run (adding further over-heads to the invocation paths). The overheads of import-ing and compiling the source code are specific to the func-tion being invoked. However, even a simple “hello world”function is likely to add a few 100s of milliseconds inoverheads (and possibly much higher for functions whichimport complex libraries [8]).Thus, the fastest and most optimal hot start invoca-tion is enabled using an environment that has been fully-initialized with the runtime and interpreted bytecode thatis specific to the function being invoked. This can beaccomplished by caching idle environments after they’vebeen used for a function execution, such that they may bereused across future invocations of that given function. Inthis case, a new set of input arguments can be importedinto the environment, and the function can start runningin a few milliseconds (or less).2
Motivation
To better understand the opportunities that exist for spe-cialization at the system level, we next highlight threecharacteristics of the serverless/FaaS model that simplifythe role of the OS and unlock new opportunities for per-formance optimizations in function start times and high-density caching of function state.First, the operational shift from the application devel-oper on to the remote FaaS platform enables the platformto adopt new and highly-specialized approaches for howserverless functions are deployed and isolated. For ex-ample, serverless deployment strategies have been pro-posed that use light-weight virtualization [6, 14], light-weight containers [5], application-level isolation [8], andlanguage-level isolation citecloudfair.Second, in the serverless model, function executionsare defined to be independent, transient, and stateless. Forexample, there is no implicit sharing across parallel ex-ecutions, and local modifications are assumed valid foronly as long as the particular execution is alive. This sim-plified execution model avoids many of the features (andmuch of the complexity) provided by standard operatingsystem mechanisms. Instead, it introduces the opportu-nity for designing simplified OS mechanisms that focusentirely on the rapid deployment of independent, short-lived executions.Third, the FaaS platform defines a small set of languageruntimes that function code must be written to. This canbe seen by modern FaaS platforms that maintain “warmpools” of idle VMs that have been pre-initialized with aparticular runtime environment [20]. A limited numberof supported runtimes means that all function execution isderived from a small set of base environments and config-urations. The implication of this is that a larger amount ofstate is likely to be identical across parallel executions offunctions. Thus, an opportunity exists for fine-grain shar-ing of kernel state, file systems state (e.g. runtime binariesand shared libraries), and all the way down to the pagesof the runtime processes themselves.In SEUSS, we exploit these three opportunities in anew deployment strategy designed to dramatically shortenfunction start times: serverless execution via unikernelsnapshots . The use of unikernels provides us with a strongisolation technique together with a simple self-containedabstraction, thereby enabling new performant techniques for the capture and deployment of unikernel snapshots.Furthermore, the single-address space representation ofunikernels allows us to make extensive use of sharing be-tween similar environments, thus decreasing the memoryfootprints of cached functions state.
SEUSS is not alone in arguing for system-level solu-tions towards improved FaaS performance. In particu-lar, SOCK [5] and SAND [8] similarly strive for fasterstart times and improved cacheability through the use oflight-weight isolation strategies and copy-on-write shar-ing across parallel executions. What sets SEUSS apartfrom these similarly-acronym’d systems is our attempt ata general solution that is readily applicable across lan-guage runtimes and function invocation types. For ex-ample, deploying from snapshots can be applied to anyenvironment run within a unikernel. In addition, sharingacross unikernel snapshots provides an approach for pagesaving that is more comprehensive than that of forkingprocesses.The underlying techniques of SEUSS have similarities(and applicability) to use cases beyond serverless com-puting. For example, edge computing benefits from thesafe, low-latency, event-triggered deployment of uniker-nels [12]. Execution snapshotting has been previously ex-plored in the context of process migration [1], fast reini-tialization [21], and replay debugging [18]. The isolationproperties of a libraryOS deployed behind a narrow do-main interface is a technique shown to be effective againstvarious threat models [3, 6, 7, 15, 22, 24].The system-level approach of SEUSS has been moti-vated by previous cloud research systems that tailor theirdesign to the characteristics of an application, with thegoal of significantly increasing platform utilization andthroughput [4, 11, 19, 22]. In particular, our use of page-level sharing via snapshot stacks is most similar to page-level sharing between VMs in Potemkin [19]. Muchakin to the distributed paging techniques employed bySnowflock [11] and Kaleidoscope [4], we view the naturalevolution of our cache model is to expand across a clus-ter of compute nodes, thus introducing distribution andreplication to SEUSS . At which point we will be obliged to rename it to
DR-SEUSS SEUSS Execution Model
Three techniques underpin the SEUSS execution model.
UCs are used to circumscribe all necessary function stateinto a flat address space.
Snapshots are used to cap-ture and deploy functions at arbitrary points during theirexecution. Snapshots enable fast function deploymentand amortize deterministic startup. The resulting fastfunction start times introduces elasticity into the FaaSplatform, especially for the execution of uncached func-tions. Furthermore, Snapshot Stacks enable fine-grainmemory sharing between Snapshots with shared lineages.This dramatically increases the FaaS platform’s ability tocache function-specific environments, ready for immedi-ate reuse.
Most every operating system has a representation for thestate of a running program; UCs comprise our system ab-straction for isolating running functions as manageablesystem objects. UCs are wrappers for: 1) the unikernelsthat hold function state as well as 2) a small set of system meta-state for handling them.Unikernels are natively executable binaries that con-tain an application, its execution dependencies, and akernel-level support library (also called a
LibraryOS ),all linked together into a single address space. Foruse in for FaaS executions, every UC instance con-tains a unikernel equipped with full-feature language run-time environment. As opposed to minimal-footprint ordomain-specific unikernel, we chose to adopt a feature-rich unikernel that provides a POSIX-like environmentand a common set of features like full network stack,shared libraries, and a root filesystem. This software stackis not optimized for speed or size—in fact, it takes hun-dreds of milliseconds to boot a Node.js unikernel ready toreceive input over the network—but performance is wonback through our use of unikernel snapshots .A critical advantage of a unikernel approach is thatits address space contains the vast majority of the mech-anisms required for executing arbitrary function code.Thus, the amount of execution-specific state that must bemanaged outside the unikernel is minimal. The implica-tion of this is a major simplification for snapshotting: thefunction state is the address space of the unikernel. There is no marshalling, quiescing, or synchronizing around in-kernel data structures, enabling snapshotting and deploy-ments to be sub-millisecond operations (§7.3). A Snapshot is an immutable data object which capturesa time point in the lifetime of an UC. From a traditionalOS perspective, snapshots are read-only memory imagespackaged together with the register state and the system-level meta-state for an executing process. In SEUSS,Snapshots act as templates from with many addition UCscan be deployed. By capturing snapshots at strategicpoints along commonly used code path, we significantlyshorten the initialization steps required to deploy new ex-ecution.Furthermore, Snapshots enable powerful anticipatoryoptimization . This is done by snapshotting after specula-tively executing code paths likely to be used during UCexecution. The goal here is to accumulate a significantamount of allocated state into the “base” snapshot, reduc-ing the memory footprint of the UCs branched from it. Inour system, we use these techniques to exercise the net-work stack, the interpreter compiler, and the interpreterexecution path. By doing initialization and "warm-up"work prior to capturing the base Snapshot SEUSS signif-icantly reduces the latency and memory required to storea Snapshot and to run an UC while maintaining identicalexternal effects.
In SEUSS, we optimize Snapshots by factoring out thecommon execution state into Snapshot Stacks enablingmultiple Snapshots and UCs to be backed by the samephysical memory pages. This dramatically decreases thememory footprint of Snapshots and UCs and increases theamount that can be stored on a single machine. SnapshotStacks are enabled by capturing into a Snapshot only thepages modified since the UC was created. For this, we usetraditional copy-on-write semantics enabled by hardwareto track recently changed memory state. In this way, aSnapshot Stacks is an ordered sequence of Snapshots witheach entry acting as a page-level diff of the previous. Forexample, a single heavyweight Snapshot of the JavaScriptinterpreter can be used as a base Snapshot with further4napshots containing only the function-specific memorystate.To better understand the advantage of Snapshot Stacks,consider the following example: a FaaS platformneeds snapshots to capture the fully-initialized state ofJavaScript functions
Foo() and
Bar() . Armed withonly a snapshot mechanism the platform requires twoUC snapshots, one for each function. With SnapshotStack, three snapshots are required, one for the initial-ized JavaScript runtime unikernel, and one for each spe-cific functions. The JavaScript snapshot, which is createdahead of time, is used to deploy UCs for each of the twofunctions at the time of invocation. As these UC moveforward, it writes to memory accumulate a delta relativeto the read-only state of base snapshot it was deployedfrom. Therefore, when the time comes to capture the twofunction-specific snapshots, only difference between thebase snapshot and the memory and the targeted UC willbe copied into the new Snapshot image. This Snapshotnow constitutes a Snapshot Stack, representing both thebase Snapshot and function Snapshot.In our SUESS OS prototype, Snapshot Stacks accountfor a savings of over 100 Mb per snapshot for functionsdeployed using a Node.js unikernel(§ 7, table 2). in action!
We next describe the invocation procedures (hot, warm,and cold) for starting a serverless function using SEUSS.Figure 2 depicts the high-level procedure of SEUSS ona dedicated compute node. Once received, the local sys-tem has a choice of three pathways to process the invoca-tion, the choice of which is determined by the presenceof cached state specific to the function being invoked.Cached state may be found in either the warm pool orthe UC hot tub . On SEUSS boot (“B” in Figure 2)
Runtime Snapshotsare taken for each interpreter during the system boot.Each runtime Snapshot is of a runtime waiting touser source to be imported so that it can be compiledand run it on demand.
Cold path (label “C” in Figure 2)
If no other cachedfunction state exists, a runtime Snapshot is used todeploy an UC. The function source is imported and compiled by the runtime. Once the compilation isfinished, a function-specific Snapshot is created (“S”in the figure) and placed into the warm pool. Finally,the function run arguments are imported and the ex-ecution of the function begins. Once finished, thisUC can be destroyed, or optionally cached in the UChot tub for reuse across further invocations of thatfunction. In our prototype, a cold start of a NOPJavaScript function finishes in 8ms.
Warm path (label “W” in Figure 2)
When function re-quest hits in the warm pool cache, a new UC is cre-ated from a function-specific Snapshot. In this case,compilation is skipped, run arguments are accepted,and execution begins. As in the cold path, the usedUC can then be deleted or optionally enter into thehot tub. The warm pool is comprised of immutablememory images (at most one per user function), fromwhich any number of UCs may be launched (at anydegree of parallelism). Nothing is “consumed” whenan UC is created from a warm pool Snapshot. In ourprototype, a warm start of a NOP JavaScript functionfinishes in 3ms.
Hot path (label “H” in Figure 2)
When a function re-quest hits in the hot tub, an existing UC is taken outof the hot tub, the run arguments are imported intothe UC and execution begins. The UC may be re-entered into the hot tub upon completion. The hottub works differently from the warm pool as the en-tries are “live” UCs not immutable Snapshots ˙Thus,a cached UC is consumed for the duration of a singleexecution. Further, UCs may accumulate significantmemory state after multiples usages, so naturally, thehot tub has shorter recommended occupancy than thewarm pool. In our prototype, a hot start of a NOPJavaScript function finishes in less than 1ms.
In this section, we describe our implementation of theSEUSS model within a prototype kernel designed to runon the compute nodes within a large-scale FaaS platformarchitecture. In addition to the implementation details ofUCs, Snapshots, and Snapshot Stacks, we also describe5
SEUSS operating system snapshot cache cold warm hot construct environment initialize runtime import run arguments startrunimport sources generatebytecode initialization on boot
12 MB requests results runtime snapshots
211 MB function snapshotsunikernelbinaries SB
115 MB
I/O core worker core worker core worker core worker core C per-coreUC caches W H time function invocation lifetime
Figure 2:
The high-level procedures ofSEUSS on a dedicated compute node.our prototype solution for scalable UC networking andhow we integrate a SEUSS node into an Apache Open-Whisk cluster (for our evaluations in §8).
Overview
1. The SEUSS OS is a light-weight multicore kernelthat implements SEUSS in a native x86 virtualizedenvironment.2. To deploy new execution, SEUSS OS maintains acache of Snapshots as well as per-core caches of idleUCs3. The software stack of an UC is implemented usingthe Rumprun unikernel linked together with an inter-pretive language runtime (Python or JavaScript) andinvocation driver.4. UCs execute entirely in user mode (ring 3) withpage-table based hardware protection on top of min-imal domain interface.5. A network layer masquerades traffic going in and outof UCs allowing for outgoing network connectionsinitialized from within the guest functions.
Figure 2 depicts the high-level operations of SEUSSwithin our purpose-build operating system,
SEUSS OS .The low-level kernel manages the local resources of thenode, which are partitioned across worker cores dedicatedto running UCs and IO cores are dedicated to externalnetworking. The low-level approach of the SEUSS OSdemonstrates the ability to selectively apply OS-level spe-cialization to an existing FaaS platform by targeting onlythe nodes where function execution takes place (and leav-ing everything else unchanged). In this way, with SEUSSOS it is possible to construct a high-density executionplane out of a standard set of virtual machines.
As part of the SEUSS OS startup procedure, the baseunikernels are booted from their ELF binary images andinitialized. Next, a runtime snapshot is captured for eachunikernel. At this point, the kernel registers itself with theFaaS platform and awaits for invocation requests to arrive.
Invocation requests are received from a remote FaaS con-troller through an IO core. Requests are then places ontoa shared work queue that idle worker cores can pull from.Once a request is pulled by a core, it will check for cachedstate to accelerate the invocation, following the procedureoutlined in §˜refch:seussexec:action.Once an idle UC is selected or a new UC is constructedfrom a Snapshot, the worker then connects to the run-ning UC and issues commands. Initialized as part ofthe runtime bring up, the invocation driver deploys anHTTP/REST endpoint within the UC that the worker con-nects to. This connection is used by the worker to importfunction code and arguments and to start the executionof the function. Once the execution is finished, the resultsare passed to the worker and the connection is closed. TheUC which now contains the interpreted bytecode and im-ports generated during execution, is returned to a blockedlistening state, allowing for the work to quickly re-executethe function with a new set of arguments (i.e. a hot start)Figure 2 shows the sizes (in MB) of different snapshotsimages and the relative deployment overheads of the in-6ocation paths measured from within our prototype (fullanalysis is provided in § ?? ). Unikernels execute entirely in user mode (ring 3) withpage-table based hardware protection. A highly-restricteddomain interface sits between the UC and the OS, whichact as a ’narrow’ attack surface that the untrusted UC hasaccess too. The isolation properties of a libraryOS de-ployed behind a narrow domain interface is a techniqueshown to be effective technique against various threatmodels [3, 6, 7, 15, 22, 24].
A SEUSS x86_64 prototype is designed to run natively in-side of a kvm-qemu virtual machine. The bottom-mostlayer the prototype uses the EbbRT LibraryOS frame-work [17], which provides a multicore event-driven run-time, bootstrapping logic, a virtio paravirtualized NIC,and a native TCP/IP network stack. On top of this foun-dation, we have built the components responsible for im-plementing the SEUSS execution model and enabling ouruse of cached Snapshots and the safe deployment of func-tions in UCs. The EbbRT framework has been extendedto support for multiple protection modes via the x86_64sysret and syscall instructions.
Snapshot capture regionKVM-QEMU
SEUSS operating system
Ring 3Ring 0
VCPU
TX/RX queuesEbbRT LibraryOS virtio
Rumprun unikernelNode.js
V8 JavaScript engineinvocationdriver.js <*.js>Solo5
Figure 3:
Vertical slice of the single-core system stack The SEUSS prototype uses the Rumprun unikernellinked with a port of Node.js or Python. The Rumprununikernel provides a general-purpose POSIX-like execu-tion environment based on the NetBSD kernel, whichprovides a common set of shared system libraries and aramdisk filesystems [16]. Rumprun is deployed on top ofthe
Solo5 unikernel monitor, which distills the low-levelmachine interface of the unikernel down to a minimal setof paravirtualized hypercalls used by a unikernel to accessthe outside world. The minimal interface of
Solo5 wasinstrumental in enabling the simple and concise addressspace abstraction that acts as the foundation for UCs andSnapshots.
Next, we describe our technique for implementing Snap-shots and Snapshot Stacks ˙Our prototype uses direct ac-cess to hardware page tables to capture Snapshots, deployUCs, and to enable fine-grain page-level sharing across“stacks” of Snapshots and UCs.
Crucial to our design was a mechanism to allow trigger-ing the creation of a Snapshot from within the unikernelitself . Doing so allows us to capture a Snapshot at precisemoments specified within the high-level invocation driversource code. Language-level snapshot creation also opensthe door open to future research into higher-level toolingthat enables developers to directly instrument Snapshotpoints within their code, perhaps amortizing expensivedata structure initialization.We make use of language-level snapshotting by cap-turing new snapshots immediately prior to when the in-vocation driver blocks on the listening socket. When wedeploy new UC from a snapshot, the invocation driver be-gins ready to accept a connection over a local networkport.In our SEUSS prototype, Snapshot trigger is an expedi-ent hack involving an x86 debug register, which enablesus to trigger a hardware exception prior to the executionof an instruction on a preconfigured linear address (whichcorresponds to a designated symbol within the languageruntime included in the unikernel). When the exceptionoccurs, execution switches into kernel mode and calls into7
EUSS operating system snapshot cache cold warm hot time construct environment initialize runtime import run arguments startrunimport code generatebytecode initialization on boot
12 MB
IO core core 1 core 2 core 3 core 4 requests replies runtime snapshots function snapshots
UnikernelBinary
115 MB
21 430 .text .init .rodata .data .got .bss stack heap j RuntimeSnapshotFunctionSnapshotUnikernel Context Owned page Reference
Figure 4:
Page sharing across Snap-shots and UCs.a kernel handler that records the state of the UC into a newSnapshot. When finished, execution transitions back intouser mode and continues from the instruction where theexception was triggered, entirely transparent to the run-ning UC.
The procedure of deploying function execution from aSnapshot starts with creating a new UC. The TLB isflushed and the root of the UC page table structure ismapped to the core. Once this is complete, executionswitches into the new UC by triggering a breakpoint ex-ception and overwriting the exception frame with the reg-isters corresponding register values from the source snap-shot. The interrupt service routine does the work of pop-ping the registers back onto the core and execution trans-fers to the exact instruction where the snapshot was cap-tured.
Snapshots are kept light-weight by only capturing thepages modified since the UC was created. In our model,aSnapshot Stacks is an ordered sequence of Snapshots witheach entry acting as a page-level diff of the previous(Figure 4).. To achieve this, we use traditional copy-on-write semantics enabled by hardware to track recentlychanged memory state. This is accomplished by captur-ing the complete page table structure in each snapshot,but cloning only the pages that have been recently writ- ten (designated on x86 with the
Dirty bit). In addition,snapshots record the location of the top-level Snapshotor unikernel binary which the captured UC was spawnedfrom, which is then used to resolve fault during executionof an UC.When a new UC is created, the procedure requirescloning the page table structure from within the snapshot.All the
Dirty bit set for all of table entries, and the
Read / Write bits are unset, enabling for copy-on-writefaults. Page fault that occurs within a UC are processedby a kernel fault handler that is aware of the SnapshotStack backing the UC ˙Depending on the semantics of thefault, the kernel handler may allocate a new page, clone apage from within the backing Snapshot Stack, or resolvethe fault with a read-only mapping to a page within thesource Snapshot Stack. o One issue with extensive use of copy-on-write is thatmemory becomes highly overcommitted and applicationdemand may cause the system to run out of physical mem-ory. This is a problem, for example, with the heavy use offork() in Linux, whereby an OOM daemon may be trig-gered and may kill system critical processes. This is nota problem in our design, because UCs for function invo-cations are transient and can always be killed by the sys-tem without impacting the system’s ability to make for-ward progress. While more complexity may be necessaryin the future, our OOM daemon for SEUSS is trivial: itreclaims idle UCs that do not currently host a live invo-cation as soon as the available physical memory drops apre-defined some threshold.
Each UC is configured with an identical IP and MACaddress, thus enabling Snapshots to be trivial deployedacross time, in parallel across cores, and enables the po-tential for us to migrate Snapshots across machines. Anetwork layer monitors traffic going in and out of theUC and enables internal and external communication by Given these semantics, a Snapshot can only be deleted when it isknown that no other Snapshots or UCs depend on it (e.g., a referencecount). We avoid this concern in our prototype by only deleting the function-specific snapshots that have no corresponding UCs. network proxy maintains mappings for boththe internal and external networks, for each unikernel in-stance active on that core and routes. Incoming trafficis screened, and the traffic destined for unikernels sentthrough an additional translation process to determine theworker core where the UC is resident. TCP destinationports act as the unique key for mapping packets to an ac-tive UC. We currently do not support port mapping ofUDP or IPv6 packets, but the approach would be simi-lar. This design only supports outgoing TCP connectionsinitiated from within the unikernel. To support listeningsockets within the unikernel would require some out-of-band communication between the unikernel and the net-work proxy to signal intent, but due to the transient na-ture of serverless functions we have yet to give this muchthought.
OpenWhisk provides us with a modern, full-feature FaaScomputing platform, which provides many functions be-yond the invocation procedures target by SEUSS. For ex-ample, OpenWhisk includes management and authenti-cation of users, uploading and editing of functions, theenforcement of usage quotas, whitelist & blacklists, longterm storage, etc. To preserve this important platformfunctionality, we have designed SEUSS prototype kernelto act as a protocol-compliant "drop in" replacement forthe OpenWhisk compute nodes, while all other platformfunctionality remains unaffected.To accomplish this, we have built an intermediate shimprocess, written in C++ and run on Linux, responsible forreading request off the OpenWhisk message bus (Kafka),translating each request to an internal message which issent from the shim process to the SEUSS VM deployed onthe same machine. The advantage of this approach is thatit avoids the need to support a platform-specific protocolwithin our lightweight OS runtime. Instead, the shim pro-cesses can be written in any language to take advantageof the wealth of existing client libraries for the services that make up the FaaS platform (e.g., Kafka, CouchDB,ZooKeeper). In our prototype, our shim process registersitself as an OpenWhisk invoker and received action re-quests through the Kafka message bus. Function code isread out from CouchDB and passed to the native VM forexecution. Like the OpenWhisk Invoker, the shim pro-cess caches function code in-memory to avoid unneces-sary calls to the remote database.While the shim-layer gives us a number of engineeringadvantages it also introduces a performance penalty due tothe additional network hop between the Linux process andthe SEUSS VM. The internal messaging system, part ofthe EbbRT framework, sends each invocation request asa separate message over an established TCP connection.No coalescing is done to optimize message throughput.We explore the penalties of this design in more detail inour evaluations (§ 8.5).
To reiterate, the purpose of SEUSS is to generalize theset of workloads FaaS systems can support by introduc-ing elasticity at the OS level. In this section, we consideroperational overheads of these various techniques. Wedemonstrate that SEUSS supports radically faster func-tion start times, especially in the uncached function case.This is how SEUSS enables elastic workloads. Further,we argue SEUSS radically improves cache density, mak-ing SEUSS favorable in the common case of repeat func-tion execution.We perform a set of micro benchmarks that evaluatethe SEUSS primitive—UCs from Snapshots of Rumprununikernels—and compare it with the deployment tech-niques of Linux processes, Docker containers, and light-weight virtual machines. We conclude this evaluationwith a discussion of the upper and lower bounds of the in-vocation paths for deploying a serverless function SEUSSversus Linux containers.
We target a full-featured Node.js runtime environmentthat is deployed with the invocation driver used in Open-Whisk. We first record the maximum number of Node.js9 ode.js Deployment Saturation Time Instance Density
Process (Linux) 93 s 4200Container (Docker/overlay2) 569 s 3000microVM (Firecracker) 340 s 450SEUSS unikernel (Rumprun) - 52000
Table 1:
Density limit and parallel sat-uration time for the Nodejs runtime en-vironments deployed using various iso-lation techniques on a 88GB 16 CPUSLinux VM. The SEUSS unikernel re-sult is measured using SEUSS OS.environment instances our Linux VM node can support bysequentially deploying instances until the memory of theVM is saturated. Table 1 presents the per-node density ofNode.js environments deployed using various deploymenttypes on Linux, along with the time to saturation the VMwhen deploying instances in parallel across all 16 cores.For system software, we use Ubuntu
LTS (BionicBeaver), Linux kernel v4.15.0 , Docker v18.09 . We start by evaluating standard isolation technologies thatcan be used on Linux to deploy and cache function exe-cution state.
Processes are not a sufficient isolation mechanism for run-ning untrusted execution in a multi-tenant cloud environ-ment. The value of this result is that it helps to orient theconversation with an optimistic “best case” on Linux withrespect to memory sharing and startup latency.We are able to deploy processes of the Node.jsruntime on our VM. In parallel, it took about
93 s to satu-rate the machine using processes.
Using Docker containers and the overlay2 storage driver,we are able to deploy around
Node.js container in-stances before saturating the VM. The additional mem-ory consumption of container over processes is associ- ated with the per-container filesystem state and the lossof DSO sharing across containerized processes.We observe that the creation overheads for an individ-ual container are proportional to the number of total con-tainer instances active in the system (an observation thathas also been corroborated by others [14]). In particular,the creation time for a single Node.js container increasedlinearly from
541 ms (with no other containers) to over . (with around 3000 containers). Creation times forcontainers also suffered relative to the number of parallelcreations taking place. In particular, we saw the minimum creation time to increase by
60 ms for every additionalconcurrent creation taking place . In our 16-way paral-lel creation, the 99% tail latency for deploying a Node.jscontainer is . .Containers have become ubiquitous because of theirvalue in cloud orchestration platforms, but that does notimply they are the right fit for serverless. These twosignificant non-scalabilities above, lead us to concludethat containers are not well suited for the highly-parallel,ephemeral, and latency-critical requirements of FaaS sys-tems . Recently, there has been a movement away from con-tainers as the means to isolate untrusted cloud work-loads towards more secure, hardware-enforced virtualiza-tion techniques. Lightweight VMs (aka. microVMs ), to-gether with a "single-user" build of the Linux kernel arecombined to create a hardware-enforced isolation mecha-nism with performance characteristics similar to those ofcontainers.Using the Kata Container runtime together with the
Firecracker hypervisor, we deploy the Node.js con-tainer isolated within a dedicated "microVM". Unsurpris-ingly, the virtual machine (along with its dedicated Linuxkernel) resulted in an increase of over
MB to the per-instance memory footprint, thereby limiting the numberof cached instances that can be deployed on our VM. Inaddition, the latency to deploy a single Node.js instancevia this method is over 3 seconds, primarily due to the The use of Node.js containers in our Apache OpenWhisk experi-ments (§8)) are subject to the same performance penalties as shown herewith additional limitations introduced by the in-kernel virtual LAN em-ulation (as described in §8.2.1)
From the perspective of the FaaS platform, the techniqueused to encapsulate execution state should be fast to de-ploy should have a light-weight memory footprint (forcaching) and must be secure. From a security perspective,hardware virtualization is superior to Linux containers, asthe wide system call interface exposed by containers hasa large surface for kernel vulnerabilities [10]. However,from our results, the use of light-weight VMs over con-tainers introduces overheads that are detrimental for ourtarget of low-latency, large-scale serverless invocations.The comparable scalability of containers is moreopaque. The time to construct container instances in-creased considerably (up to ) as more containerswere added to the system, which suggests that the systemwould perform worse for new invocations as additionalfunction state becomes cached. We did not see the samerelative increase in creation latency for the VM instances.However, significantly fewer VM instances are simulta-neously deployable in our tests. We believe this differ-ence is caused because the container abstraction is im-plemented across various subsystems of the (host) Linuxkernel. This makes the operations to create and deleteindividual container instances both non-parallel and non-scalable. While the scalability of container creation willmost likely improve in time. Note that the comparableoperations to add or remove virtual domains (via kvm , inour case) are fundamentally less intertwined with the var-ious internal structures of host kernel, and thus are likelyto remain both faster and more scalable. In conclusion,it is not the isolation technology of the light-weight VMthat is the bottleneck, but the general-purpose kernel thatis deployed within it [14].
We measure the time to deploy execution from SEUSSSnapshots and UCs of a Node.js Rumprun unikernel (Ta-ble 2). In a saturation test, we are able to deploy over , individual Node.js UC instances, each internallyblocked on a port with a unique mapping within our NATlayer that allows it to be scheduled and referenced. This extreme density is enabled by the low memory footprintsof the deployed UC instances, which are highly redundantand therefore can get major advantage from COW pagesharing (§6.3.4).Next, we evaluate the time to deploy, interpret and exe-cute a minimal JavaScript NOP function (i.e., code whichsimply returns ‘ true ‘) on SEUSS OS. For this NOPfunction, we measure the end-to-end invocation latencyfrom the moment the invocation request is received bya SEUSS OS worker core to the moment the functionhas finished executing and the result is returned to theworker core. While clearly a microbenchmark, measuringthe NOP provides the clearest picture of system-inducedoverheads.The bottom of Table 2 contains the measurements fordeploying the NOP function via the hot, warm and coldinvocation paths. The cold and warm paths include de-ploying the UC from a Snapshot, while the hot path usesan existing UC which sits idle on that core. The latencyincludes the time to set up a TCP connection betweenthe worker core and the invocation driver as well as thetime to pass in the invocation arguments. In addition, thecold start path includes the time to pass in and compilethe function code as well as the time to capture the warmstart snapshot.The cold start invocation begins from a Node.js run-time snapshot (captured once during system boot ), re-sulting in an end-to-end invocation latency of .
67 ms .Similarly, warm start invocations begin from an existingfunction-specific snapshot. Since warm starts do not re-quire function code to be passed in and interpreted, invo-cation overheads are only .
95 ms . The hot start invoca-tion, which calls into a blocked UC takes only .
82 ms toexecute and return. This optimal path removes UC cre-ation from the execution path, but this is not the main sav-ings. The real benefit here is that the user code has alreadyrun once. Initialization has been done and the state has ac-cumulated from user code, kernel paths, and perhaps mostimportantly, JIT optimizations. Note it is considered safeto re-execute user functions in this way because of the“stateless” assumption of the serverless model. A similar This amortizes the
270 ms required to bring up the unikernelNodejs and the invocation driver, permanently removing it from all in-vocation paths Interpretation time is minimal for the NOP function, but is likely tobe much larger in practice umprun Unikernel Boot time (ms) Snapshot size Hello World 9.2 2.1 MBPython Hello World 393.3 10.8 MBPython Invocation Driver 1072.3 31.8 MBNodejs Hello World 123.9 103.9 MBNodejs Invocation Driver 279.6 114.5 MB
Runtime (ms) Allocated Pages hot start : 0.82 13 warm start : 2.95 391 cold start : 7.67 527
Table 2:
Boot times and snapshot sizetaken after boot of various Rumprununikernels deployed on SEUSS (top).Runtime and page allocation aver-ages across 475 invocations of a NOPJavaScript function (bottom).optimization is used with containers to avoid expensivepause/unpause operations.
Through a series of anticipatory optimizations , we wereable to significantly decrease invocation latencies andmemory footprints to what is shown at the bottom of Ta-ble 2. Prior to making these optimizations, our resultswere - times slower. For example, warm start invoca-tion took
13 ms opposed to , and a cold start took
40 ms opposed to .The observation was that by sending, compiling andexecuting an empty JavaScript file prior to capturing thefirst (runtime) snapshot, we are able to effectively captureinto the snapshot a "warmed up" version of, for example,the unikernel network stack, as well as the interpreter’scompilation and execution paths. As a result, the initialruntime snapshot is considerably larger, but each of thefunction-specific snapshots has decreased in size. Notethat in our prototype, machines holds a single copy ofeach runtime snapshots, while every one of the possiblytens of thousands of function-specific snapshots can ben-efit from memory and latency savings. In our evaluationof OpenWhisk with a SEUSS node, the smaller snapshotsizes results in a doubling of our cache size—from , to , NOP functions (§8.3)—during a high densitythroughput test.
Baseline Invocation Overheads
NewContainerRumprunPaused ContainerSnapshotEC or process
Figure 5:
Start times for different ex-ecution environments. X-axis is logscale (ms)To conclude these micro benchmark evaluations, wediscuss the expected best and worst overheads across thetwo deployment strategies we employ in our evaluationof OpenWhisk, SEUSS Snapshots vs Docker containers.Figure 5 presents a summary of the various deploymentoverheads (graphed on a log scale) for the Node.js run-time evaluated throughout this evaluation.
In the most optimal case, function execution can beginin just a few hundred microseconds. This can beachieved on both SEUSS OS and Linux by a UC or unpaused container that has been fully initialized to theparticular function. As shown in §8.3, this optimal caseis more easily achievable in SEUSS OS across differentfunction invocations due to the increased cache density.
The next best option for Linux, is the use of a pre-initialized environment cached inside of a paused con-tainer. In our evaluations, the time to unpause a Node.jsDocker container took between
30 ms and
50 ms de-pending on the level of concurrency. In comparison,SEUSS can deploy a similarly-initialized UCs from asnapshot in about µ s . In addition, our techniqueof deploying from a snapshot is far more flexible asmultiple UCs can be deployed in parallel from a singleshared snapshot (while a paused container is consumedas part of the invocation). The worst-case on Linux requires a new container in-stance to be constructed and initialized, which we ob-serve takes between
540 ms to . (not including thetime to import and interpret the function code and argu-ments). Comparably, the worst-case on SEUSS OS is12o deploy an initialized runtime from a snapshot priorto importing the function code, which takes around µ s . Note that without snapshotting, the SEUSScold path is comparable to container creation Figure 5“Rumprun” bar. The runtime snapshot permanently re-moves this cost from all paths. The previous micro benchmark evaluation (§ ?? ) providesintuition for where we can expect to see benefit from theSEUSS approach when applied to optimize invocationswithin a FaaS platform. Namely in the radically fasterfunction deployment path (especially for cold starts), andthe dramatically denser function caches.In this second evaluation, we analyze the performanceof requests to a multi-node FaaS cluster (Apache Open-Whisk) with our SEUSS kernel prototype used as a drop-in replacement for Linux on a virtual compute node. Us-ing a custom benchmark tool that we developed, we eval-uate the platform performance across a variety of invoca-tion patterns and function types. The following highlightsthe primary results:1. In the face of increasing diverse invocation pat-terns, the SEUSS backend is able to sustain high-throughput long after Linux node a performance walldue to cache thrashing. (§8.3)2. SEUSS radically outperforms Linux when it comesto deploying new environments for uncached func-tions, but also cached functions are processed fasteras Linux quickly get bogged down with cache inval-idations and container construction overheads(i.e.,when many different functions are being invoked).3. SEUSS can support large-scale request burst thatfully overwhelm the Linux backend. (§8.4)4. SEUSS prototype provides a scalable solution tomultiplexing functioning access to the external net-work. However, in our existing implementation,stressing the external networking ingress and egresspaths show non-scalabilities that limit the networkthroughput of the function when compared to Linux containers. (We believe this can be addressedthrough implementation and its not inherent to theSEUSS model). We run our evaluations on a four-node cluster of physi-cal machines connected via a 10GbE network on a privateVLAN through a commodity switch. Each machine con-tains two 8-core Intel Xeon E5-2660 processors (16 coresin total) running at . GHz,
GB of RAM, and aSolarflare Communications SFC9120 Ethernet card. Theprocessors have been configured to disable Turbo Boost,hyper-threads, and dynamic frequency scaling. For soft-ware, we use Ubuntu
LTS (Bionic Beaver), Linuxkernel v4.15.0 , Docker v18.09 , and OpenWhisk v0.9.0 .We dedicate two of the four physical machines to hostOpenWhisk, one machine to host the benchmark, and onemachine to hosts an external HTTP server (used in §8.4).Of the two OpenWhisk machines, the first hosts the con-trol plane components (i.e., the platform controller, APIserver, message service, and internal databases) deployedwithin containers on the Linux host. The second Open-Whisk machine hosts a single qemu-kvm
VM instancewhich acts as the OpenWhisk compute node across all ex-periments. We use a VM to maintain identical hardwarespecifications for both the SEUSS OS and Linux com-pute node. The VM is configured with VCPUs, GBof memory, a virtio/vhost paravirtualized network device,and an in-memory (ramdisk) filesystem.
For these evaluations, we’ve built a custom OpenWhiskload generation benchmark []. The benchmark works bysending a series of parallel, synchronous invocation re-quests to the OpenWhisk API gateway. For each request,we record the round-trip time for OpenWhisk to processthe request, run the function, and return the result to thebenchmark.Each trial of the benchmark has three main parame-ters: Invocation Count ( N ), Unique Function Set Size( M ), and Concurrent Requests ( C ). The benchmark thengenerates a work queue of N invocation requests across13he set of M functions in a random order . The evalua-tion trial involves C worker threads pulling invocation re-quests (one at a time) from the shared queue and starting anew HTTP connection with the API server to process therequest . The benchmark reports the RTT latency for eachinvocation as well as the aggregate throughput achievedby all C threads. The trial is finished once a response hasbeen received for each of the N requests. Each trial of the benchmark is performed on a fresh de-ployment of the OpenWhisk that has been populated withthe set of ( M ) user functions run by the benchmark. Wehave disabled all platform-enforced quotas and rate limitsin OpenWhisk. We also prevent Docker containers frombeing paused when they are not in use (thereby grant-ing the Linux VM a slight performance boost under high-throughput evaluation). As a stand-alone evaluation, wemeasure the control plane overheads of our OpenWhiskcluster by modifying the invocation service to immedi-ately reply with ’success’ to each invocation request. Us-ing a sequential stream of requests, we observed the meanlatency for an individual request to be
60 ms with a tail latency of
64 ms . This latency account for the time forthe request message to pass from the benchmark, throughthe OpenWhisk control plane, to the invocation node, andback again. In addition, we observed the control planethroughput to plateau between and parallel requests,with a peak throughput of 220 requests per second (with-out any function executions). On our Linux compute node, the function state is cachedinside of unpaused
Docker containers, and the max cachesize limit is set to containers. For the throughputtests (§8.3), we have disabled the OpenWhisks "warmpool" container cache, as we observed that the automaticinitialization of containers hurt platform throughput whenunder heavy load. The warm pool container cache is en-abled for the burst experiment (§8.4). For a fair comparison, the random ordering of invocations is pre-determined and held consistent across runs of the benchmark systems The maximum number of requests in flight at a given time is C The container cache size limit of is considerablylower than the max container density of about 3000 in-stances (Table 1). We originally set the cache size closerto the observed density limit and found that a majority ofinvocation requests result in an error. Upon investigation,we observed high rates of dropped packets on the inter-nal network bridge that the platform uses to communicatewith the active containers. Due to dropped packets, theconnection between the platform and the invocation driverinside the container repeatedly fails and the invocation iseventually aborted.Here we witness another limitation in the legacy sys-tem being deployed. The use of virtual Ethernet meansthat packet processing between the connected endpointstakes place entirely within the Linux kernel. Therefore,a single broadcast (e.g. ARP, DHCP) packet sent overa bridge with N connected endpoints must be processedby the kernel N separate times . With a max cache sizeof 1024 containers—the default limit of endpoints on aLinux bridge—we still witness connections failures in ourlarge-scale evaluations (§ 8.4).
64 512 4096100012501500
Linux Cold Starts
64 512 4096150160 R e q u e s t L a t e n c y ( m s ) Seuss Cold Starts
64 512 40966080100120140
Linux Hot Starts
64 512 4096708090 R e q u e s t L a t e n c y ( m s ) Seuss Hot Starts
Figure 6:
End-to-end request latenciesfor a single stream of NOP JavaScriptexecution across three function setsizes. 1st, 25th, 50th, 75th, 99th per-centiles and the mean (dot). Parallel packet processing proves problematic given that the Linuxkernel has ingress queue length of 1000
192 16384 32768 65536
Unique NOP Functions T h r o u g h p u t ( R e q / s ) SeussLinuxSeuss Cache Limit
64 128 256 5125075100125150175
Linux Cache Limit
Figure 7:
OpenWhisk throughputacross an increasingly diverse set of in-vocations
This experiment is designed to stress the FaaS platform’sability to effectively cache and deploy function executionwhen exposed to increasingly diverse function invoca-tions. To provide "diversity", we increase the set of uniquefunctions that are invoked across each trial, while keepingthe total number of invocations consistent. While eachfunction is logically different, the code being run is anidentical JavaScript NOP. We use a NOP function to min-imize the time spent in function-specific work, therebystressing the invocation path overheads of the platform.Figure 7 presents the achieved throughput OpenWhiskusing the SEUSS OS or Linux compute node. Trials arerepresented as points along the x-axis. For each trial, wedoubled the number of unique functions that the bench-mark invokes (ranging from to ). For each trial,the benchmark sends a continuous stream of invocationrequests from threads until the measured throughputreaches a point of stability.The throughput of the trials with low diversity (the left-most points on the lines) represents a “best-case scenario”for the platform’s performance, as the vast majority of re-quests result in "hot" repeat invocations and the cachesremained underutilized. In contrast, as diversity increasespast the relative cache limits (the right-most points on the lines) it presents a “worst-case scenario” for the system,as an increasing majority of requests require cold path in-vocations. As shown in Figure 7, the overall throughputof the FaaS platform depends heavily on the total amountof unique function state that can be efficiently cached atthe site of the invocation. For example, throughput onthe Linux node drops severely once the container cache issaturated around 512 functions (with 32 request concur-rency. SEUSS, being designed for efficient caching, canhold over 40,000 Snapshots and UCsof the minimal NOPfunction before the memory of the VM is exhausted.On an all-unique workload (zero repeat invocation),the SEUSS OS has a × speedup over Linux. Thisprimarily dues to the cold invocation paths on SEUSSOS begin from an initialized Node.js runtime snapshot.Conversely, when the cache is saturated on Linux, thereis no longer an opportunity to pre-initialize contain-ers for not-yet-seen function invocation. Instead, everycold start requires both cache eviction (container dele-tion) and new container creation. As demonstrated in§ refch:evaluation:microbenchmarks:containers, the per-formance of container operations suffers proportionally tothe container occupancy. The comparable difference incold path invocation latencies, when the cache is under-utilized (64 functions) and oversaturated (2048 func-tions), can be seen in Figure 6.In the low-diversity trials, the throughput of Linux is higher than that of SEUSS OS. This is because theLinux node provides better hot start latencies while at lowcache utilization (Figure 6). In the SEUSS prototype, thehot start pathways are burdened with an additional packetprocessing and network hop between the SEUSS OS shimprocess and the OpenWhisk control plane(§ 6.5), whichadds about to a round-trip request latency. We be-lieve this gap can be reduced through simple implemen-tation changes to the SEUSS OS shim process. However,low-diversity performance is not the focus of the SEUSSOS design. In this experiment, we aim to stress the platform capacityfor caching and invocation when exposed to the suddenarrival of parallel bursts of invocation requests. The ex-15eriment involves giving the system a moderate level of"background" request load and then hitting it with repeatlarge-scale batches of parallel requests. Instead of NOPfunctions, this experiment runs a combination of CPU-heavy functions (to model a computational workload) anda mixture of functions blocked on external IO (represent-ing a more “normal” background behavior).
The background utilization consists of a continuousstream of (hot start) invocations made to a set of IO-boundfunctions. Each function makes an external network callto a remote
HTTP server, which waits for
250 ms beforesending an OK reply (the function finishes once the re-ply is received). To generate the background utilization,we configure the benchmark to use threads and makerequests to a total of 16 unique IO functions. The bench-mark is rate-throttled to a limit of 72 requests per second(about the observed throughput of the platform. OnLinux, around of the container cache is consumed bythe background stream of requests.On top of the background stream, we issue a seriesof invocation bursts, each consists of invocation re-quests sent in parallel to a unique never-been-seen func-tion. Bursts are sent at a fixed frequency of 32, 16, or 8seconds between bursts, and we observe if the backgroundload is affected and if the system is successful in han-dling all the parallel requests. All functions in the burstsdo CPU-bound computation that takes around
150 ms tocomplete. Invocations within a single burst are to thesame function, while functions across bursts are unique.presents the results.
On Linux, we configure OpenWhisk to maintain a cacheof 256 “warm start” containers of the Node.js runtime.This container cache provides warm path invocations forthe sudden bursts of never-been-seen invocation. How-ever, these pre-initialized containers directly competewith the other containers on the node, affecting both con-tainer creation times and the number of "cached" contain-ers that can be used to enable hot starts. With a frequencyof
32 s between bursts (top left graph, Figure ?? ), theLinux container cache can be repopulated between bursts,
60 120 180 240 3001s10s60s 60 120 180 240 300
Wall Clock (s)
Figure 8:
32s bursts. Linux (top) andSEUSS (bottom). Request latency (logscale) is measured on the y axis, trialwall clock time on the x axis (
340 s to-tal). Plotted points represents a suc-cessful completion with the x being thetime the request was sent, and y beingthe RRT latency. Failed requests aremarked with an ’ x ’and so the sudden arrival of requests is met by warm startinvocations. However, beginning around the 6th burst, thecontainer cache limit is hit and some of the requests beginto return an error (marked in the figures as a black ’x’).When the burst frequency is increased to s and then s(middle right, bottom right graphs, Figure ?? ), the con-tainer cache is not able to repopulate between bursts. Inthese cases, only the first burst sees the warm start invo-cations, while all additional burst see cold start overheads(of up to
10 s ) and failed requests. In all cases, the relia-bility of the Linux node suffers as the container cache be-comes saturated, resulting in a majority of request failuresafter only a few bursts. When the bursts continue after thecontainer cache is saturation OpenWhisk stop processingrequest all together for a number of minutes, presumablyin a recovery mode.
The same experiment of OpenWhisk with a SEUSS OSnode presents a stark difference in results. With SEUSS16
Wall Clock (s)
Figure 9:
16s bursts. Linux (top) andSEUSS (bottom)OS, OpenWhisk is able to handle every request we send(no request return an error to our benchmark). Further-more, as every new invocation is handled from a snap-shot, the request latencies remain consistently low acrossall three burst frequencies. Only at the highest frequency(8 seconds between bursts) does the background streambecome disturbed (Figure 10). This is because IO tasksget queued behind CPU tasks which are run to comple-tion. This is not inherent to the design, there is no knownsignificant benefit SEUSS derives from this policy; ad-dressing it with scheduling would be a natural step for fu-ture work. Finally, as each burst adds only one additionalsnapshot to the cache, we would have to process manythousands of additional bursts on SEUSS OS before thecache would be full. Note the difference in the Linux ap-proach where each truly parallel execution of each func-tion requires its own container.
One shortcoming of our system design is the processingoverheads required to masquerade external TCP/IP traf-fic in and out of the running UCs(§6.4). To evaluatethe impact of these overheads, we deploy an I/O-heavyJavaScript function that reads a 1KB, 10KB or 100KBpayload from an external HTTP server. We deploy thefunction repeatedly on an otherwise quiet deployment ofOpenWhisk and measure the average RTT latencies. Forthe 1KB and 10KB payloads, the function took about 20%
60 120 180 240 3001s10s60s 60 120 180 240 300
Wall Clock (s)
Figure 10:
8s bursts. Linux (top) andSEUSS (bottom)longer using the SEUSS OS compute node. When in-creasing the payload size to 100KB, the network through-put was an order of magnitude better on Linux.Unsurprising to us, these results are symptomatic ofone of our major design decisions, to deploy general-useunikernels in user-space, and to multiplex network traf-fic at layer-3 (effectively doubling the processing over-head per-packet). In a sense, this is an old problem in theworld of virtualization, to which multiple hardware andsoftware solutions have been presented. However, in ourcase, a simple and scalable solution to this design puzzleremains an ongoing topic of debate among our group.
As evidenced by our evaluations, modern FaaS platformsachieve fast deployments when they reuse cached execu-tion state of previously run functions but suffer dispropor-tionately when no cached state is available. In SEUSS,we improve on the approach of caching by deploying ex-ecution from Snapshots, which enables function environ-ments to be created faster and more environments can becached on a node.Deploying execution from Snapshots in SEUSS bene-fits both hot and cold invocation paths. Cold starts havea unique advantage on SEUSS as they can deploy from areusable runtime snapshot. In our results, cold start over-heads for deploying a NOP function from a snapshot is17ess than 8ms. On Linux, the cold start overheads beginwith a 500ms to 4s overhead to deploy a new container-ized environment (§7.2).For hot starts, SEUSS’ higher-density cache allows aFaaS compute node to improve the probability of a hit. Inour results, we demonstrate that a SEUSS node can cacheat least an order of magnitude more isolated executing en-vironments than Docker containers and more than two or-ders magnitude than with light-weight VMs (§ ?? ).By combining the benefits of low latency cold startsand simplified immutable memory caching, SEUSS runshigh-demand bursty workloads that are currently unsup-ported by the traditional approach (§8.4).Our results show that the serverless function need nolonger be considered a heavyweight computational prim-itive. When parallel executions can be deployed in mil-liseconds (from snapshots which can be created evenfaster), the serverless function becomes an intuitive wayto enact computational elasticity at arbitrary scales. Withthe function as a performant base primitive, it will be pos-sible for FaaS platforms to unlock a long-promised mutu-alism: users get instantaneous access to on-demand paral-lelism, while providers manage a much finer-grained re-source bin packing problem, trading hundreds of VMs fortens of thousands of functions. SEUSS demonstrates thatit is possible to bump FaaS out of its current computa-tional niche by introducing elasticity at the OS level. eferences [1] Checkpoint/Restore In Userspace , (Accessed4/24/2019).[2] M. Armbrust, A. Fox, R. Griffith, A. D. Joseph,R. H. Katz, A. Konwinski, G. Lee, D. A. Patter-son, A. Rabkin, I. Stoica, and M. Zaharia. Abovethe clouds: A berkeley view of cloud computing.(UCB/EECS-2009-28), Feb 2009.[3] Andrew Baumann, Marcus Peinado, and GalenHunt. Shielding Applications from an UntrustedCloud with Haven.
ACM Trans. Comput. Syst. ,33(3):8:1–8:26, August 2015.[4] Roy Bryant, Alexey Tumanov, Olga Irzak, AdinScannell, Kaustubh Joshi, Matti Hiltunen, H. An-dres Lagar-Cavilla, and Eyal de Lara. Kaleidoscope:Cloud micro-elasticity via vm state coloring. In
Eu-ropean Conference on Computer Systems (Eurosys) ,Saltzburg, Austria, April 2011.[5] Edward Oakes and Leon Yang and Dennis Zhou andKevin Houck and Tyler Harter and Andrea Arpaci-Dusseau and Remzi Arpaci-Dusseau. SOCK: Rapidtask provisioning with serverless-optimized contain-ers. In , pages 57–70, Boston, MA, 2018.USENIX Association.[6] Google. gVisor , (Accessed 4/24/2019).[7] Jon Howell, Bryan Parno, and John R. Douceur. Em-bassies: Radically refactoring the web. In
Presentedas part of the 10th USENIX Symposium on Net-worked Systems Design and Implementation (NSDI13) , pages 529–545, Lombard, IL, 2013. USENIX.[8] Istemi Ekin Akkus and Ruichuan Chen and IvicaRimac and Manuel Stein and Klaus Satzke andAndre Beck and Paarijaat Aditya and Volker Hilt.SAND: Towards high-performance serverless com-puting. In , pages 923–935, Boston,MA, 2018. USENIX Association.[9] Murad Kablan, Azzam Alsudais, Eric Keller, Col-orado Boulder, Implementation Nsdi, and Franck Le. Stateless Network Functions: Breaking theTight Coupling of State and Processing. 2017.[10] Ricardo Koller and Dan Williams. Will serverlessend the dominance of linux in the cloud? In
Pro-ceedings of the 16th Workshop on Hot Topics in Op-erating Systems , HotOS ’17, pages 169–173, NewYork, NY, USA, 2017. ACM.[11] Lagar-Cavilla, Horacio Andrés and Whitney, JosephAndrew and Scannell, Adin Matthew and Patchin,Philip and Rumble, Stephen M. and de Lara, Eyaland Brudno, Michael and Satyanarayanan, Ma-hadev. SnowFlock: Rapid Virtual Machine Cloningfor Cloud Computing. In
Proceedings of the 4thACM European Conference on Computer Systems ,EuroSys ’09, pages 1–12, New York, NY, USA,2009. ACM.[12] Anil Madhavapeddy, Thomas Leonard, MagnusSkjegstad, Thomas Gazagnaire, David Sheets, DaveScott, Richard Mortier, Amir Chaudhry, BalrajSingh, Jon Ludlam, Jon Crowcroft, and Ian Leslie.Jitsu: Just-in-time summoning of unikernels. In , pages 559–573, Oakland, CA, 2015. USENIX Association.[13] Maciej Malawski. Towards serverless execution ofscientific workflows - HyperFlow case study.
CEURWorkshop Proceedings , 1800:25–33, 2016.[14] Filipe Manco, Costin Lupu, Florian Schmidt, JoseMendes, Simon Kuenzer, Sumit Sati, Kenichi Ya-sukata, Costin Raiciu, and Felipe Huici. My vm islighter (and safer) than your container. In
Proceed-ings of the 26th Symposium on Operating SystemsPrinciples , SOSP ’17, pages 218–233, New York,NY, USA, 2017. ACM.[15] Donald E. Porter, Silas Boyd-Wickizer, Jon How-ell, Reuben Olinsky, and Galen C. Hunt. Rethinkingthe Library OS from the Top Down. In
Proceedingsof the Sixteenth International Conference on Archi-tectural Support for Programming Languages andOperating Systems , ASPLOS XVI, pages 291–304.ACM, 2011.16] Rump Kernels. (Accessed 4/24/2019).[17] Dan Schatzberg, James Cadden, Han Dong, OrranKrieger, and Jonathan Appavoo. Ebbrt: A frame-work for building per-application library operatingsystems. In
Proceedings of the 12th USENIX Con-ference on Operating Systems Design and Imple-mentation , OSDI’16, pages 671–688, Berkeley, CA,USA, 2016. USENIX Association.[18] Dirk Vogt, Cristiano Giuffrida, Herbert Bos, andAndrew S. Tanenbaum. Lightweight memory check-pointing. In
Proceedings of the 2015 45th AnnualIEEE/IFIP International Conference on DependableSystems and Networks , DSN ’15, pages 474–484,Washington, DC, USA, 2015. IEEE Computer So-ciety.[19] Michael Vrable, Justin Ma, Jay Chen, David Moore,Erik Vandekieft, Alex C. Snoeren, Geoffrey M.Voelker, and Stefan Savage. Scalability, fidelity,and containment in the potemkin virtual honeyfarm.In
Proceedings of the twentieth ACM symposiumon Operating systems principles - SOSP '05 . ACMPress, 2005.[20] Liang Wang, Mengyuan Li, Yinqian Zhang, ThomasRistenpart, and Michael Swift. Peeking behind thecurtains of serverless platforms. In
Proceedings ofthe 2018 USENIX Conference on Usenix AnnualTechnical Conference , USENIX ATC ’18, pages133–145, Berkeley, CA, USA, 2018. USENIX As-sociation.[21] Yi-Min Wang, Yennun Huang, Kiem-Phong Vo, Pe-Yu Chung, and C. Kintala. Checkpointing and itsapplications. In
Proceedings of the Twenty-Fifth In-ternational Symposium on Fault-Tolerant Comput-ing , FTCS ’95, pages 22–, Washington, DC, USA,1995. IEEE Computer Society.[22] Andrew Whitaker, Marianne Shaw, and Steven D.Gribble. Scale and performance in the denali isola-tion kernel.
SIGOPS Oper. Syst. Rev. , 36(SI):195–209, December 2002.[23] Dan Williams, Ricardo Koller, Martin Lucina, andNikhil Prakash. Unikernels as processes. In
Pro- ceedings of the ACM Symposium on Cloud Com-puting , SoCC ’18, pages 199–211, New York, NY,USA, 2018. ACM.[24] Yiming Zhang and Jon Crowcroft and DongshengLi and Chengfen Zhang and Huiba Li and YaozhengWang and Kai Yu and Yongqiang Xiong and Gui-hai Chen. KylinX: A Dynamic Library OperatingSystem for Simplified and Efficient Cloud Virtual-ization. In2018 USENIX Annual Technical Con-ference (USENIX ATC 18)