[PDF] Serverless Supercomputing: High Performance Function as a Service for Science

Abstract

Growing data volumes and velocities are driving exciting new methods across the sciences in which data analytics and machine learning are increasingly intertwined with research. These new methods require new approaches for scientific computing in which computation is mobile, so that, for example, it can occur near data, be triggered by events (e.g., arrival of new data), or be offloaded to specialized accelerators. They also require new design approaches in which monolithic applications can be decomposed into smaller components, that may in turn be executed separately and on the most efficient resources. To address these needs we propose funcX---a high-performance function-as-a-service (FaaS) platform that enables intuitive, flexible, efficient, scalable, and performant remote function execution on existing infrastructure including clouds, clusters, and supercomputers. It allows users to register and then execute Python functions without regard for the physical resource location, scheduler architecture, or virtualization technology on which the function is executed---an approach we refer to as "serverless supercomputing." We motivate the need for funcX in science, describe our prototype implementation, and demonstrate, via experiments on two supercomputers, that funcX can process millions of functions across more than 65000 concurrent workers. We also outline five scientific scenarios in which funcX has been deployed and highlight the benefits of funcX in these scenarios.

Full PDF

SServerless Supercomputing:High Performance Function as a Service for Science

Ryan Chard ∗ Argonne National [email protected]

Tyler J. Skluzacek ∗ University of [email protected]

Zhuozhao Li

University of [email protected]

Yadu Babuji

University of [email protected]

Anna Woodard

University of [email protected]

Ben Blaiszik

University of [email protected]

Steven Tuecke

University of [email protected]

Ian Foster

Argonne & University of [email protected]

Kyle Chard

University of [email protected]

ABSTRACT

Growing data volumes and velocities are driving exciting new meth-ods across the sciences in which data analytics and machine learn-ing are increasingly intertwined with research. These new methodsrequire new approaches for scientific computing in which compu-tation is mobile, so that, for example, it can occur near data, betriggered by events (e.g., arrival of new data), or be offloaded to spe-cialized accelerators. They also require new design approaches inwhich monolithic applications can be decomposed into smaller com-ponents, that may in turn be executed separately and on the mostefficient resources. To address these needs we propose f unc X—ahigh-performance function-as-a-service (FaaS) platform that en-ables intuitive, flexible, efficient, scalable, and performant remotefunction execution on existing infrastructure including clouds, clus-ters, and supercomputers. It allows users to register and then ex-ecute Python functions without regard for the physical resourcelocation, scheduler architecture, or virtualization technology onwhich the function is executed—an approach we refer to as “server-less supercomputing.” We motivate the need for f unc X in science,describe our prototype implementation, and demonstrate, via ex-periments on two supercomputers, that f unc X can process millionsof functions across more than 65 000 concurrent workers. We alsooutline five scientific scenarios in which f unc X has been deployedand highlight the benefits of f unc X in these scenarios.

The idea that one should be able to compute wherever makes themost sense—wherever a suitable computer is available, softwareis installed, or data are located, for example—is far from new: in-deed, it predates the Internet [28, 47], and motivated initiativessuch as grid [31] and peer-to-peer computing [44]. But in practiceremote computing has long been complex and expensive, due to, forexample, slow and unreliable network communications, securitychallenges, and heterogeneous computer architectures.Now, however, with quasi-ubiquitous high-speed communica-tions, universal trust fabrics, and containerization, computation canoccur essentially anywhere: for example, where data or specializedsoftware are located, or where computing is fast, plentiful, and/or ∗ Both authors contributed equally to the paper cheap. Commercial cloud services have embraced this new real-ity [56], in particular via their function as a service (FaaS) [21, 33]offerings that make invoking remote functions trivial. Thus one sim-ply writes client.invoke(FunctionName="F", Payload=D) to invokea remote function

F(D) on the AWS cloud from a Python program.These developments are transforming how computing is deployedand applied. For example, Netflix uses Amazon Lambda to encodethousands of small video chunks, make data archiving decisions,and validate that cloud instances adhere to security policies [13]. Ineffect, they transformed a monolithic application into one that usesevent-based triggers to dispatch tasks to where data are located, orwhere execution is more efficient and reliable.There is growing awareness of the benefits of FaaS in scienceand engineering [30, 32, 39, 42, 53], as researchers realize that theirapplications, too, can benefit from decomposing monolithic applica-tions into functions that can be more efficiently executed on remotecomputers, the use of specialized hardware and/or software thatis only available on remote computers, moving data to computeand vice versa, and the ability to respond to event-based triggersfor computation. Increasingly, scientists are aware of the need forcomputational fluidity. For example, physicists at FermiLab reportthat a data analysis task that takes two seconds on a CPU can bedispatched to an FPGA device on the AWS cloud, where it takes 30msec to execute, for a total of 50 msec once a round-trip latencyof 20 msec to Virginia is included: a speedup of 40 × [27]. Suchexamples arise in many scientific domains. However, until now,managing such fluid computations has required herculean effortsto develop customized infrastructure to allow such offloading.In many ways research cyberinfrastructure (CI) is lagging withrespect to the perpetually evolving requirements of scientific com-puting. We observe a collection of crucial challenges that lead to asignificant impedance mismatch between sporadic research work-loads and research CI including the technical gulf between batchjobs and function-based workloads, inflexible authentication andauthorization models, and unpredictable scheduling delays for pro-visioning resources, to name just a few. We are motivated thereforeby the need to overcome these challenges and enable computationof short-duration tasks (i.e., at the level of programming functions)with low latency and at scale across a diverse range of existing in-frastructure, including clouds, clusters, and supercomputers. Such a r X i v : . [ c s . D C ] A ug eeds arise when executing machine learning inference tasks [37],processing data streams generated by instruments [42], runningdata transformation and manipulation tasks on edge devices [46],or dispatching expensive computations from edge devices to morecapable systems elsewhere in the computing continuum.In response to these challenges we have developed a flexible,scalable, and high-performance function execution platform, f unc X,that adapts the powerful and flexible FaaS model to support scienceworkloads, and in particular data and learning system workloads,across diverse research CI. f unc X leverages modern programming practices to allow re-searchers to register functions (implemented in Python) and theninvoke those functions on supplied input JSON documents. f unc Xmanages the deployment and execution of those functions on re-mote resources, provisioning resources, staging function code andinput documents, managing safe and secure execution sandboxesusing containers, monitoring execution, and returning output doc-uments to users. Functions are able to execute on any computeresource where f unc X endpoint software is installed and a request-ing user is authorized to access. f unc X agents can turn any existingresource (e.g., cloud, cluster, supercomputer, or container orches-tration cluster) into a FaaS endpoint.The contributions of our work are as follows: • A survey of commercial and academic FaaS platforms and adiscussion of their suitability for science use cases on HPC. • A FaaS platform that can: be deployed on research CI, handledynamic resource provisioning and management, use var-ious container technologies, and facilitate secure, scalable,and federated function execution. • Design and evaluation of performance enhancements forfunction serving on research CI, including memoization,function warming, batching, and prefetching. • Experimental studies showing that f unc X delivers executionlatencies comparable to those of commercial FaaS platformsand scales to 1M+ functions across 65K active workers ontwo supercomputers. • Description of five scientific use cases that make use of f unc X,and analysis of what these use cases reveal concerning theadvantages and disadvantages of FaaS.The remainder of this paper is organized as follows. §2 presents abrief survey of FaaS platforms. §3 outlines three systems built uponby f unc X. §4 presents a conceptual model of f unc X. §5 describes the f unc X system architecture. §6 and §7 evaluate the performance of f unc X and present five scientific case studies, respectively. Finally,§8 summarizes our contributions.

FaaS platforms have proved wildly successful in industry as a way toreduce costs and the need to manage infrastructure. Here we presenta brief survey of FaaS platforms, summarized in Table 1. We broadlycategorize platforms as commercial , open source , or academic , andfurther compare them based on the following categories. • Languages:

The programming languages that can be usedto define functions. • Infrastructure:

Where the FaaS platform is deployed andwhere functions are executed, e.g., cloud, Kubernetes. • Virtualization:

The virtualization technology used to iso-late and deploy functions. • Triggers:

How functions are invoked and whether specificevent sources are supported. • Walltime:

How long functions are permitted to execute. • Billing:

What billing models are used to recoup costs.

Most commercial cloud providers offer FaaS capabilities. Here wecompare three platforms offered by Amazon, Microsoft, and Google.

Amazon Lambda [2] pioneered the FaaS paradigm in 2014 andhas since be used in many industry [13] and academic [24] usecases. Lambda is a hosted service that supports a multitude of func-tion languages and trigger sources (Web interface, CLI, SDK, andother AWS services). Tight integration with the wider AWS ecosys-tem means Lambda functions can be associated with triggers fromother AWS services, such as CloudWatch, S3, API gateways, SQSqueues, and Step Functions. Functions are billed based on theirmemory allocation and for every 100ms execution time. Once de-fined, Lambda uses a custom virtualization technology built onKVM, called Firecracker to create lightweight micro-virtual ma-chines. These microVMs then persist in a warmed state for fiveminutes and continue to serve requests. While Lambda is providedas a hosted service, functions can be deployed locally or to edgedevices via the Greengrass [1] IoT platform.

Google Cloud Functions [7] is differentiated by its tight couplingto Google Cloud Storage, Firebase mobile backends, and customIoT configurations via Google’s globally distributed message bus(Cloud Pub/Sub). Like Lambda, Google Cloud Functions also sup-port triggers from arbitrary HTTP webhooks. Further, users cantrigger functions through a number of third party systems includ-ing GitHub, Slack, and Stripe. While Google Cloud functions applya similar pricing model to Lambda, the model is slightly more ex-pensive for high-volume, less computationally intensive tasks asLambda has lower per-request costs after the first two million in-vocations (with similar compute duration costs).

Azure Functions [11] allow users to create functions in a nativelanguage through either the Web interface or the CLI. Functions arepackaged and may be tested locally using a local web service beforebeing uploaded to the Azure platform. Azure functions integratewith other Azure products through triggers. Triggers are providedfrom CosmosDB, Blob storage, and Azure storage queues, in addi-tion to custom HTTP and time-based triggers. Azure price-matchesAWS for compute and storage (as of November 2018).

Open FaaS platforms resolve two of the key challenges to usingFaaS for scientific workloads: they can be deployed on-premiseand can be customized to meet the requirements of data-intensiveworkloads without any pricing models.

Apache OpenWhisk [3] is the most well-known open sourceFaaS platform. OpenWhisk is the basis of IBM Cloud Functions [8].OpenWhisk clearly defines an event-based programming model,consisting of

Actions which are stateless, runnable functions,

Trig-gers which are the types of events OpenWhisk may track, and

Rules which associate one trigger with one action. OpenWhisk can able 1: Taxonomic survey of common FaaS platforms. Function Language Intended Infrastruc-ture Virtualization Triggers MaximumWalltime (s) BillingAmazonLambda

Google CloudFunctions

BASH, Go, Node.js, Python Public cloud Undefined HTTP, Pub/Sub,storage 540 Requests, runtime,memory

AzureFunctions

OpenWhisk

Ballerina, Go, Java, Node.js,Python Kubernetes, Privatecloud, Public cloud Docker HTTP, IBMCloud, OW-CLI 300 IBM Cloud: Re-quests, runtimeLocal: NA

Kubeless

Node.js, Python .NET, RubyBallerina, PHP Kubernetes Docker HTTP, sched-uled, Pub/Sub Undefined NA

SAND

C, Go, Java, Node.js, Python Public cloud, Privatecloud Docker HTTP, Internalevent Undefined Triggers Fn Go, Java, Ruby, Node.js,Python Public cloud, Kubernetes Docker HTTP, directtrigger 300 NA

Abaco

Container TACC clusters Docker HTTP Undefined Undefined f unc X Python Local, clouds, clusters, su-percomputers Singularity, Shifter,Docker HTTP, GlobusAutomate No limit HPC SUs, cloudcredits. Local: NA be deployed locally as a service using a Kubernetes cluster. How-ever, deploying OpenWhisk is non-trivial, requiring installation ofdependencies and administrator access to the cluster. Fn [6] is a powerful open-source software from Oracle that canbe deployed on any Linux-based compute resource having adminis-trator access to run Docker containers. Applications—or groups offunctions—allow users to logically group functions to build runtimeutilities (e.g., dependency downloads in custom Docker containers)and other resources (e.g., a trained machine learning model file) tosupport functions in the group. Moreover, Fn supports fine-grainedlogging and metrics, and is one of few open source FaaS platformsdeployable on Windows. Fn can be deployed locally or on a Kuber-netes cluster. In our experience, one can deploy a fully-functionalFn server in minutes. Kubeless [9] is a native Kubernetes FaaS platform that takesadvantage of built-in Kubernetes primitives. Kubeless uses ApacheKafka for messaging, provides a CLI that mirrors that of AWSLambda, and supports fine-grained monitoring. Users can invokefunctions via the CLI, HTTP, and via a Pub/Sub mechanism. Like Fn,Kubeless allows users to define function groups that share resources.Like OpenWhisk, Kubeless is reliant on Kubernetes and cannot bedeployed on other resources.

The success of FaaS in industry has spurred academic exploration ofFaaS. Two systems that have resulted from that work are SAND [17]and Actor Based Co(mputing)ntainers (Abaco) [54].

SAND [17] is a lightweight, low-latency FaaS platform fromNokia Labs that provides application-level sandboxing and a hier-archical message bus. The authors state that they achieve a 43%speedup and a 22x latency reduction over Apache OpenWhisk incommonly-used image processing applications. Further, SAND pro-vides support for function or grain chaining via user-submittedworkflows. At the time of their writing, it appears that SAND doesnot support multi-tenancy, only having isolation at the applica-tion level. SAND is closed source and as far as we know cannot bedownloaded and installed locally.

Abaco [54] supports functions written in a wide range of pro-gramming languages and supports automatic scaling. Abaco im-plements the Actor model in which an actor is an Abaco runtimemapped to a specific Docker image. Each actor executes in responseto messages posted to its inbox . Moreover, Abaco provides fine-grained monitoring of container, state, and execution events andstatistics. Abaco is deployable via Docker Compose.

Commercial cloud providers implement high performance and reli-able FaaS models that are used by huge numbers of users. However,for science use cases they are unable to make use of existing infras-tructure, they do not integrate with the science ecosystem (e.g., interms of data and authentication models), and they can be costly.Open source and academic frameworks support on-premise de-ployments and can be configured to address a range of use cases.However, each of the systems surveyed is Docker-based and there-fore requires administrator privileges to be deployed on externalsystems. Furthermore, the reliance on Docker prohibits use in mostcomputing centers which instead support user space containers.In most cases, these systems have been implemented to rely onKubernetes (or other container orchestration models such as Mesosand Openshift) which means they cannot be adapted to existingHPC and HTC environments. f unc X provides a scalable, low-latency FaaS platform that can beapplied to existing HPC resources with minimal effort. It employsuser-space containers to isolate and execute functions, avoiding thesecurity concerns prohibiting other FaaS platforms from being used.Finally, it provides an intuitive interface for executing scientificworkloads and includes a number of performance optimizations tosupport broad scientific use cases.

FaaS builds upon a large amount of related work including in Gridand cloud computing, container orchestration, and analysis sys-tems. Grid computing [31] laid the foundation for remote, federatedcomputations, most often applying federated batch submission [40].GridRPC [51] defines an API for executing functions on remote ervers requiring that developers implement the client and theserver code. f unc X extends these ideas to allow interpreted func-tions to be registered and subsequently to be dynamically executedwithin sandboxed containers via a standard endpoint API.Container orchestration systems [35, 36, 50] allow users to scaledeployment of containers while managing scheduling, fault tol-erance, resource provisioning, and addressing other user require-ments. These systems primarily rely on dedicated, cloud-like infras-tructure and cannot be directly applied to HPC resources. f unc Xprovides similar functionality, however it focuses at the level ofscheduling and managing functions, that are deployed across a poolof containers. We apply approaches from container orchestrationsystems (e.g., warming) to improve performance.Data-parallel systems such as Hadoop [15] and Spark [16] enablemap-reduce style analyses. Unlike f unc X, these systems dictate aparticular programming model on dedicated clusters. Parallel com-puting libraries such as Dask [5], Parsl [20], and Ray [45] supportparallel execution of scripts, and selected functions within thosescripts, on clusters and clouds. f unc X uses Parsl to manage functionexecution in containers.

We build f unc X on a foundation of existing work, including theParsl parallel scripting library [20] and Globus [23].

Parsl is parallel scripting library that augments Python with simple,scalable, and flexible constructs for encoding parallelism. Parsl isdesigned for scalable execution of Python-based workflows on avariety of resources—from laptops to clouds and supercomputers. Itincludes an extensible set of executors tailored to different use cases,such as low-latency, high-throughput, or extreme-scale execution.Parsl’s modular executor architecture enables users to port scriptsbetween different resources, and scale from small clusters throughto the largest supercomputers with many thousands of nodes andtens of thousands of workers. Here we use Parsl’s high-throughputexecutor as the base for the f unc X endpoint software as it providesscalable and reliable execution of functions.Parsl is designed to execute workloads on various resource types,such as AWS, Google Cloud, Slurm, PBS, Condor, and many others.To do so, it defines a common provider interface that can acquire(e.g., via a submission script or cloud API call), monitor, and manageresources. Parsl relies on a Python configuration object to defineand configure the provider. f unc X uses Parsl to connect to variousresources and adopts Parsl’s configuration object to define how adeployed endpoint should use its local resources.

Globus Auth [19] provides authentication and authorization plat-form services designed to support an ecosystem of services, applica-tions, and clients for the research community. It allows external ser-vices (e.g., the f unc X service and the f unc X endpoints) to outsourceauthentication processes such that users may authenticate usingone of more than 600 supported identity providers (e.g., Google,ORCID, and campus credentials). Services can also be registered asGlobus Auth resource servers, each with one or more unique scopes (e.g., execute_function ). Other applications and services may thenobtain delegated access tokens (after consent by a user or client)to securely access other services as that user (e.g., to register orinvoke a function). We rely on Globus Auth throughout the f unc Xarchitecture and in particular to provide user/client authenticationwith the system and to support secure endpoint registration andoperations with the f unc X service.

We first describe the conceptual model behind f unc X, to providecontext to the implementation architecture. f unc X allows users toregister and then execute functions on arbitrary endpoints . All userinteractions with f unc X are performed via a REST API implementedby a cloud-hosted f unc X service. Interactions between users, the f unc X service, and endpoints are subject to Globus Auth-basedauthentication and authorization.

Functions: f unc X is designed to execute functions —snippetsof Python code that perform some activity. A f unc X function ex-plicitly defines a function body that contains the entire function,takes a JSON object as input and may return a JSON object. Thefunction body must specify all imported modules. Functions mustbe registered before they can be invoked by the registrant or, ifpermitted, other users. An example function for processing rawtomographic data is shown in Listing 1. This function is used createa tomographic preview image from a HDF5 input file. The func-tion’s input specifies the file and parameters to identify and reada projection. It uses the automo Python package to read the data,normalize the projection, and then save the preview image. Thefunction returns the name of the saved preview image.

Listing 1: Python function to create neurocartography pre-view images from tomographic data. def automo_preview ( event ): import numpy , tomopy from automo . util import ( read_data_adaptive ,save_pngdata = event ['data ']proj , flat , dark , _ = read_data_adaptive (data ['fname '], proj =( data ['st '],data ['end '], data ['step ']))proj_norm = tomopy . normalize (proj , flat , dark )flat = flat . astype ('float16 ')save_png ( flat . mean ( axis =0) , fname =( 'prev .png ')) return {'filename ': 'prev .png '}

Endpoints: A f unc X endpoint is a logical interface to a computa-tional resource that allows the f unc X service to dispatch function in-vocations to that resource. The endpoint handles authentication andauthorization, provisioning of nodes on the compute resource, andvarious monitoring and management functions. Users can down-load the f unc X endpoint software, deploy it on a target resource,and register it with f unc X by supplying connection informationand metadata (e.g., name and description). Each registered endpointis assigned a unique identifier for subsequent use.

Function execution:

Authorized users may invoke a registeredfunction on a selected endpoint. To do so, they issue a request via he f unc X service which identifies the function and endpoint to beused as well as an input JSON document to be passed to the function.Optionally, the user may specify a container image to be used. Thisallows users to construct environments with appropriate depen-dencies (system packages and Python libraries) required to executethe function. Functions may be executed synchronously or asyn-chronously; in the latter case the invocation returns an identifiervia which progress may be monitored and results retrieved.

Web service:

The f unc X service exposes a REST API for register-ing functions and endpoints, and for executing functions, managingtheir execution, and retrieving results. The Web service is pairedwith accessible endpoints via the endpoint registration process. The f unc X service is a Globus Auth resource server and thus enablesusers to login using an external identity and for programmaticaccess via OAuth access tokens.

User interface: f unc X is designed to be used via the REST APIor f unc X Python SDK that wraps the REST API. Listing 2 showsan example of how the SDK can be used to invoke a registeredfunction on a specific endpoint. The example first imports the

FuncXClient , it then constructs a client, defaulting to the address ofthe public f unc X web service. It then invokes a registered functionusing the run command and passes the unique function identifier, aJSON document with input data (in this case the path to a file), theendpoint id on which to execute the function, the funcx_python3.6 container in which the function will be executed, and it also setsthe interaction to be asynchronous. Finally, the example shows thatthe function can be monitored using status and the asynchronousresults retrieved using result . Listing 2: Example use of the f unc X SDK to invoke a regis-tered function in the funcx_python3.6 container. from funcx import

FuncXClientfx = FuncXClient ()func_id = '6d79 -... -764 bb 'container_name = ' funcx_python3 .6 'endpoint_id = '863d -... - d820d 'data = {'input ': '/ projects / funcX / test .h5 '}func_res = fx.run( func_id , data , endpoint_id ,container_name , async = True )func_res . status ()func_res . result ()

The f unc X system combines a cloud-hosted management servicewith software agents— f unc X endpoints—deployed on remote re-sources. The cloud-hosted f unc X service implements endpoint man-agement and function registration, execution and management. f unc X’s primary interface is the hosted REST API; a Python SDKsupports use in programming environments and integration inexternal applications. The advantages of such a service-orientedmodel are well-known and include ease of use, availability, reliabil-ity, and reduced software development and maintenance costs. Anoverview of f unc X’s architecture is depicted in Figure 1.

Figure 1: f unc X architecture showing the f unc X service onthe left and two f unc X endpoints deployed on a Cloud andHPC cluster on the right. Each endpoint’s manager is re-sponsible for coordinating execution of functions via execu-tors deployed on nodes. f unc X Service

The f unc X service maintains a registry of f unc X endpoints and reg-istered functions. The service provides a REST API to register andmanage endpoints, register functions, and execute, monitor, andretrieve the output from functions. The f unc X service is securedusing Globus Auth allowing users to authenticate with it directly(e.g., via the native app flow in a Jupyter notebook) or via externalclients that can call the REST API directly. It also allows for end-points, registered as Globus Auth clients, to call the API to registerthemselves with the service. The f unc X service is implemented inPython as a Flask application, it is deployed on AWS and relieson Amazon Relation Database Service (RDS) to store registeredendpoints and functions. f unc X uses containers to package function code that is to be de-ployed on a compute resource. Key requirements for a packagingtechnology include portability (i.e., a package can be deployed inmany different environments with little or no change) complete-ness (all code and dependencies required to run a function can becaptured), performance (minimal startup and execution overhead;small storage size), and safety (unwanted interactions betweenfunction and environment can be avoided). Container technologymeets these needs well.Our review of container technologies, including Docker [43],LXC [10], Singularity [41], Shifter [38], and CharlieCloud [49], leadsus to adopt Docker, Singularity, and Shifter in the first instance.Docker works well for local and cloud deployments, whereas Sin-gularity and Shifter are designed for use in HPC environments andare supported at large-scale computing facilities (e.g., Singularityat ALCF and Shifter at NERSC). Singularity and Shifter implementsimilar models and thus it is easy to convert from a common repre-sentation (i.e., a Dockerfile) to both formats. f unc X requires that each container includes a base set of soft-ware, including Python 3 and f unc X worker software. In addition,any other system libraries or Python modules needed for func-tion execution must be added manually to the container. Wheninvoking a function, users must specify the container to be used or execution; if no container is specified, f unc X uses a base f unc Ximage. In future work, we intend to make this process dynamic,using repo2docker [29] to build Docker images and convert themto site-specific container formats when needed. f unc X Endpoint

The f unc X endpoint represents the remote computational resource(e.g., cloud, cluster, or supercomputer) upon which it is deployed.The endpoint is designed to deliver high-performance execution offunctions in a secure, scalable and reliable manner.The endpoint architecture, depicted in Figure 2, is comprised ofthree components, which are discussed below: • Manager : queues and forwards function execution requestsand results, interacts with resource schedulers, and batchesand load balances requests. • Executor : creates and manages a pool of workers on a node. • Worker : executes functions within a container.The

Manager is the daemon that is deployed by a user on a HPCsystem (often on a login node) or on a dedicated cloud node. Itauthenticates with the f unc X service and upon registration actsas a conduit for routing functions and results between the serviceand workers. A manager is responsible for managing resourceson its system by working with the local scheduler or cloud APIto deploy executors on compute nodes. The manager uses a pilotjob model [55] to connect to and manage resources in a uniformmanner, irrespective of the resource type (cloud or cluster) or localresource manager (e.g., Slurm, PBS, Cobalt). As each executor islaunched on a compute node, it connects to and registers with themanager. The manager then uses ZeroMQ sockets to communicatewith its executors. To minimize blocking, all communication ismanaged by threads using asynchronous communication patterns.The manager uses a randomized scheduling algorithm to allocatefunctions to executors.To provide fault tolerance and robustness, for example with re-spect to node failures, the manager uses heartbeats and a watchdogprocess to detect failures or lost executors. The manager trackstasks that have been distributed to executors so that when failuresdo occur, lost tasks can be re-executed (if permitted). Communica-tion from f unc X service to managers uses the reliable Majordomobroker pattern in ZeroMQ. Loss of a manager is terminal and re-layed to the user. To reduce overheads, the manager can shut downexecutors when they are not needed; suspend executors to pre-vent further tasks being scheduled to failed executors; and monitorresource capacity to aid scaling decisions.

Executors represent, and communicate on behalf of, the collectivecapacity of the workers on a single node, thereby limiting the num-ber of sockets used to just two per node. Executors determine theavailable CPU/Memory resources on a node, and partition the nodeamongst the workers. Once all workers connect to the executor,it registers itself with the manager. Executors advertise availablecapacity to the manager, which enables batching on the executor.

Workers persist within containers and each executes one func-tion at a time. Since workers have a single responsibility they useblocking communication to wait for functions from the executor.Once a function is received it is deserialized and executed, and theserialized results are returned via the executor.

The target computational resources for f unc X range from local de-ployment to clusters, clouds, and supercomputers each with distinctmodes of access. As f unc X workloads are often sporadic, resourcesmust be provisioned as needed so as to reduce startup overheadand wasted allocations. f unc X uses Parsl’s provider interface [20]to interact with various resources, specify resource-specific require-ments (e.g., allocations, queues, limits, or cloud instance types), anddefine the rules for automatic scaling (i.e., limits and scaling ag-gressiveness). With this interface, f unc X can be deployed on batchschedulers such as Slurm, Torque, Cobalt, SGE and Condor as wellas the major cloud vendors such as AWS, Azure, and Google Cloud.

Figure 2: The f unc X endpoint.

We apply several optimizations to enable high-performance func-tion serving in a wide range of computational environments. Webriefly describe five optimization methods employed in f unc X. Memoization involves returning a cached result when the inputdocument and function body have been processed previously. f unc Xsupports memoization by hashing the function body and inputdocument and storing a mapping from hash to computed results.Memoization is only used if explicitly set by the user.

Container warming is used by cloud FaaS platforms to im-prove performance [57]. Function containers are kept warm byleaving them running for a short period of time (5-10 minutes) fol-lowing the execution of a function. This is in contrast to terminatingcontainers at the completion of a function. Warm containers re-move the need to instantiate a new container to execute a function,significantly reducing latency. This need is especially evident inHPC resources for several reasons: first, loading many concurrentPython environments and containers puts a strain on large, sharedfile systems; second, many HPC centers have their own methods forinstantiating containers that may place limitations on the numberof concurrent requests; and third, individual cores are often slowerin many core architectures like Xeon Phis. As a result the start timefor containers can be much larger than what would be seen locally.

Batching requests enables f unc X to amortize costs across manyfunction requests. f unc X implements two batching models: first,batching to enable executors to request many tasks on behalf of theirworkers, minimizing network communication costs; second, user-driven batching of function inputs, allowing the user to manage he tradeoff between more efficient execution and increased per-function latency by choosing to create fewer, larger requests. Bothtechniques can increase overall throughput. Prefetching is a technique for requesting more tasks than can besatisfied immediately in the anticipation of availability in the nearfuture. f unc X executors use prefetching to improve performanceby requesting tasks while workers are busy with execution, thusinterleaving network communication with computation. This canimprove performance for short, latency-sensitive functions.

Asynchronous messaging is a technique for hiding networklatencies. f unc X uses asynchronous messaging patterns providedby ZeroMQ to implement end-to-end socket based inter-processcommunication. By avoiding blocking communication patterns, f unc X ensures that even when components over widely varyingnetworks are connected, performance will not be bottlenecked bythe slowest connection.

FaaS is often used for automated processing in response to vari-ous events (e.g., data acquired from an instrument). To facilitateevent-based execution in research scenarios we have integrated f unc X with the Globus Automate platform [18]. To do so we haveimplemented the

ActionProvider interface in f unc X by creatingREST endpoints to start, cancel, release, and check the status ofthe task. Exposing f unc X as an ActionProvider allows automationflows to execute functions on behalf of a user. The API uses GlobusAuth to determine the identity of the user that owns the flow, anduses their authentication tokens to execute functions via the f unc Xservice and endpoint. When specifying the action in a flow the usermust define the function ID, input JSON document, and endpointID for execution. When the flow invokes the function, the f unc Xservice creates an identifier to return to the automation platformfor monitoring of that step of the workflow.

Secure, auditable, and safe function execution is crucial to f unc X.We implement a comprehensive security model to ensure that func-tions are executed by authenticated and authorized users and thatone function cannot interfere another. We rely on two provensecurity-focused technologies: Globus Auth [19] and containers. f unc X uses Globus Auth for authentication, authorization, andprotection of all APIs. The f unc X service is represented as a GlobusAuth resource server, allowing users to authenticate using a sup-ported Globus Auth identity (e.g., institution, Google, ORCID) andenabling various OAuth-based authentication flows (e.g., confiden-tial client credentials, native client) for different scenarios. It also hasits own unique Globus Auth scopes (e.g., “urn:globus:auth:scope:–funcx.org:register_function”) via which other services (e.g., GlobusAutomate) may obtain authorizations for programmatic access. f unc X endpoints are registered as Globus Auth clients, each depen-dent on the f unc X scopes, which can then be used to connect tothe f unc X service. Each endpoint is configured with a Globus Authclient_id/secret pair which is used for constructing REST requests.The connection between the f unc X service and endpoints is estab-lished using ZeroMQ. Communication addresses are communicated as part of the registration process. Inbound traffic from endpointsto the cloud-hosted service is limited to known IP addresses.All functions are executed in isolated containers to ensure thatfunctions cannot access data or devices outside that context. In HPCenvironments we use Singularity and Shifter. f unc X also integratesadditional sandboxing procedures to isolate functions executingwithin containers, namely, creating namespaced directories withinthe containers in which to capture files that are read/written. To en-able fine grained tracking of execution, we store execution requesthistories in the f unc X service and in logs on f unc X endpoints.

We evaluate the performance of f unc X in terms of latency, scalabil-ity, throughput, and fault tolerance. We also explore the affect ofbatching, memoization, and prefetching.

To evaluate f unc X’s latency we compare it with commercial FaaSplatforms by measuring the time required for single function invo-cations. We have created and deployed the same Python function(Listing 3) on Amazon Lambda, Google Cloud Functions, MicrosoftAzure Functions, and f unc X. To minimize unnecessary overheadwe use the same payload when invoking each function: the string“hello-world.” Each function simply prints and returns the string.

Listing 3: Python function to calculate latency. def hello_world ( event ): print ( event ) return event

Although each provider operates its own data centers, we at-tempt to standardize network latencies by placing functions in anavailable US East region (between South Carolina and Virginia).We deploy f unc X service and endpoint on two AWS EC2 instances(m5.large) in the US East region. We use an HTTP trigger to invokethe function on each of the FaaS platforms. We then measure la-tency as the round-trip time to submit, execute, and return a resultfrom the function. We submit all requests from the login node ofArgonne National Laboratory’s Cooley cluster, in Chicago, IL (20.5ms latency to the f unc X service). The experiment configuration isshown in Figure 3.

Figure 3: Comparative latency experiment architecture. igure 4: Average task latency (s) over functions. For each FaaS service we compare the cold start time and warmstart time. The cold start time aims to capture the scenario where afunction is first executed and the function code and execution envi-ronment must be configured. To capture this in f unc X we restartthe service and measure the time taken to launch the first function.For the other services, we simply invoke functions every 10 minutesand 1 second (providers report maximum cache times of 10 minutes,5 minutes, 5 minutes, for Google, Amazon, and Azure, respectively)in order to ensure that each function starts cold. We execute thecold start functions 40 times, and the warmed functions 2000 times.We report the mean completion time and standard deviation foreach FaaS platform in Figure 4. We notice that Lambda, GoogleFunctions, and Azure Functions exhibit warmed round trip timesof 116ms, 122ms, and 126ms, respectively. f unc X proves to be con-siderably faster, running warm functions in 76ms. We suspect thisis due to f unc X’s minimal overhead, as, for example, requests aresent directly to the f unc X service rather than through elastic loadbalancers (e.g., AWS ELB for Lambda), and also likely incur fewerlogging and resiliency overheads. When comparing cold start per-formance, we find that Lambda, Google Functions, Azure Functions,and f unc X exhibit cold round trip times of 175ms, 160ms, 2748ms,and 2886ms respectively. Google and Lambda exhibit significantlylower cold start times, perhaps as a result of the simplicity of ourfunction (which requires only standard Python libraries and there-fore could be served on a standard container) or perhaps due to thelow overhead of these proprietary container technologies [57]. Inthe case of f unc X this overhead is primarily due to the startup timeof the container (see Table 4).We next break down the latency of each function invocation foreach FaaS service. Table 2 shows the total time for warm and coldfunctions in terms of overhead and function execution time. Forthe closed-source, commercial FaaS systems we obtain functionexecution time from execution logs and compute overhead as anyadditional time spent invoking the function. As expected, overheadsconsume much of the invocation time. Somewhat surprisingly, weobserve that Lambda has much faster function execution time forcold than for warm containers, perhaps as the result of the wayAmazon reports usage. We further explore latency for f unc X byinstrumenting the system. The results are shown in Figure 5 for awarm container. Here we consider the following times: t c : round-trip time between the f unc X client on Cooley and the f unc X service, t w : web service latency to dispatch the request to and endpoint (andthen to return the result), t m : endpoint connection latency from Table 2: FaaS latency breakdown (in ms).

Overhead Function TotalAzure warm 112.0 13.6 125.6cold 2720.0 28.0 2748.0

Google warm 117.0 5.0 122.0cold 136.0 24.0 160.0

Lambda warm 116.0 0.3 116.3cold 174.0 0.5 174.5 f unc X warm 74.6 1.3 75.9cold 2882.0 4.2 2886.0 receiving the request (including data transfer and queue processing)until it is passed to an executor, and t e : function execution time.We observe that t e is fast relative to the overall system latency. t c is mostly made up of the communication time from Cooley to AWS(measured at 20.5ms). While t m only includes minimal communica-tion time due to AWS-AWS connections (measured at 1ms). Mostof the f unc X overhead is therefore captured in t w as a result ofdatabase access and endpoint routing, and t m as a result of internalqueuing and Parsl dispatching. Figure 5: f unc X latency breakdown for a warm container.

We study the strong and weak scaling of f unc X using Argonne Na-tional Laboratory’s Theta [14] and NERSC’s Cori [4] supercomput-ers. Theta is a 11.69-petaflop system based on the second-generationIntel Xeon Phi “Knights Landing" (KNL) processor. The system isequipped with 4392 nodes, each containing a 64-core processor with16 GB MCDRAM, 192 GB of DDR4 RAM, and interconnected withhigh speed InfiBand. Cori consists of an Intel Xeon “Haswell" parti-tion and an Intel Xeon Phi KNL partition. Our tests were conductedon the KNL partition. Cori’s KNL partition has 9688 nodes in total,each containing a 68-core processor (with 272 hardware threads)with six 16GB DIMMs, 96 GB DDR4 RAM, and interconnected withDragonfly topology. We perform experiments using 64 Singularitycontainers on each Theta node and 256 Shifter containers on eachCori node. Due to a limited allocation on Cori we use the fourhardware threads per core to deploy more containers than cores.

Figure 6: Strong and weak scaling of f unc X. Strong scaling evaluates performance when the total number offunction invocations is fixed; weak scaling evaluates performancewhen the average number of functions executed on each containeris fixed. To measure scalability we created functions of variousdurations: a 0-second “no-op” function that exits immediately, a 1-second “sleep” function, and a 1-minute CPU “stress” function thatkeeps a CPU core at 100% utilization. For each case, we measured ompletion time of a batch of functions as we increased the numberof total containers. Notice that the completion time of running M “no-op” functions on N workers indicates the overhead of f unc X todistribute the M functions to N containers. Due to limited allocationwe did not execute sleep or stress functions on Cori, nor did weexecute stress functions for strong scaling on Theta. Figure 6(a) shows the completion time of100 000 concurrent function requests with an increasing number ofcontainers. On both Theta and Cori the completion time decreasesas the number of containers increases until we reach 256 contain-ers for the “no-op” function, and 2048 containers for the 1-second“sleep” function on Theta. As reported by Wang et al. [57] and Mi-crosoft [12], for a single function, Amazon Lambda achieves goodscalability to more than 200 containers, Microsoft Azure Functionscan scale up to 200 containers, and Google Cloud Functions doesnot scale very well, especially beyond 100 containers. While theseresults do not necessarily indicate the maximum number of con-tainers that can be used for a single function, and likely includesome per-user limits imposed by the platform, we believe that theseresults show that f unc X scales similarly to commercial platforms.

To conduct the weak scaling tests we per-formed concurrent function requests such that each container re-ceives, on average, 10 requests. Figure 6(b) shows the weak scalingfor “no-op,” 1-second “sleep,” and 1-minute “stress” functions. For“no-op" functions, the completion time increases with more con-tainers on both Theta and Cori. This reflects the time required todistribute requests to all of the containers. On Cori, f unc X scalesto 131 072 concurrent containers and executes more than 1.3 mil-lion “no-op” functions. Again, we see that the completion time for1-second “sleep” remains close to constant up to 2048 containers,and the completion time for the 1-minute “stress” remains close toconstant up to 16 384 containers. Thus, we expect a function withseveral minute duration would scale well to many more containers.

We observe a maximum throughput (computedas number of function requests divided by completion time) of 1694and 1466 requests per second on Theta and Cori, respectively.

Our results show that f unc X i) scales to 65 000+containers for a single function; ii) exhibits good scaling perfor-mance up to approximately 2048 containers for a 1-second functionand 16 384 containers for a 1-minute function; and iii) provides sim-ilar scalability and throughput using both Singularity and Shiftercontainers on Theta and Cori. f unc X uses heartbeats to detect and respond to executor failures.To evaluate fault tolerance we simulate an executor failing andrecovering while executing a workload of sleep functions. To con-duct this experiment we deployed f unc X with two executors andlaunched a stream of 100ms functions at a uniform rate such thatthe system is at capacity. We trigger a failure of an executor twoseconds into the test. Figure 7 illustrates the task latencies measuredas the experiment progresses.We set the heartbeat rate to two seconds in this experiment,causing at least a two second additional latency for functions that were inflight during the failure. Following the failure, latenciesincrease due to demand exceeding capacity until a replacementexecutor rejoins the pool, after which task latencies stabilize.

Figure 7: The latency required to process 100ms functionswhen an executor fails (2 seconds) and recovers (4 seconds).

In this section we evaluate the effect of our optimization mecha-nisms. In particular, we investigate how memoization, containerinitialization, batching, and prefetching impact performance.

To measure the effect of memoization, we cre-ate a function that sleeps for one second and returns the inputmultiplied by two. We submit 100 000 concurrent function requeststo f unc X. Table 3 shows the completion time of the 100 000 requestswhen the percentage of repeated requests is increased. We see thatas the percentage of repeated functions increases, the completiontime decreases dramatically. This highlights the significant per-formance benefits of memoization for workloads with repeateddeterministic function invocations.

Table 3: Completion time vs. number of repeated requests.Repeated requests (%)

Completion time (s)

To understand the time toinstantiate various container technologies on different executionresources we measure the time it takes to start a container and exe-cute a Python command that imports f unc X’s worker modules—thebaseline steps that would be taken by every cold f unc X function.We deploy the containers on an EC2 m5.large instance and oncompute nodes on Theta and Cori following best practices laid outin facility documentation. Table 4 shows the results. We speculatethat the significant performance deterioration of container instanti-ation on HPC systems can be attributed to a combination of slowerclock speed on KNL nodes and shared file system contention whenfetching images. These results highlight the need to apply functionwarming approaches to reduce overheads.

To evaluate the effect of executor-side batching we submit 10 000 concurrent “no-op” function re-quests and measure the completion time when executors can re-quest one function at a time (batching disabled) vs when they canrequest many functions at a time based on the number of idle con-tainers (batching enabled). We use 4 nodes (64 containers each) onTheta. We observe that the completion time with batching enabledis 6.7s (compared to 118 seconds when disabled). able 4: Cold container instantiation time for different con-tainer technologies on different resources.System Container Min (s) Max (s) Mean (s) Theta Singularity 9.83 14.06 10.40Cori Shifter 7.25 31.26 8.49EC2 Docker 1.74 1.88 1.79EC2 Singularity 1.19 1.26 1.22

We evaluate the effect of user-drivenbatching we explore the scientific use cases discussed in §7. Theseuse cases represent various scientific functions, ranging in execu-tion time from half a second through to almost one minute, andprovide perspective to the real-world effects of batching on dif-ferent types of functions. The batch size is defined as the numberof requests transmitted to the container for execution. Figure 8shows the average latency per request (total completion time of thebatch divided by the batch size), as the batch size is increased. Weobserve that batching provides enormous benefit for the shortestrunning functions and reduces the average latency dramaticallywhen combining tens or hundreds of requests. However, largerbatches provide little benefit, implying it would be better to dis-tribute the requests to additional workers. Similarly, long runningfunctions do not benefit as the communication and startup costsare small compared to the computation time.

Figure 8: Effect of batching on each of the scientific use cases.Batch sizes vary between 1 and 1024.

To measure the effect of prefetching, we create“no-op” and “sleep” functions of different durations (i.e., 1, 10, 100ms), and measure the completion time of 10 000 concurrent functionrequests when the prefetch count per node is increased. Figure 9shows the results of each function with 4 nodes (64 containerseach) on Theta. We observe that the completion time decreasesdramatically as the prefetch count increases. This benefit startsdiminishing when the prefetch count is greater than 64, whichimplies that a good setting of prefetch count would be close to thenumber of containers per node.

To demonstrate the benefits of f unc X in science we describe fivecase studies in which it is being used: scalable metadata extraction,machine learning inference as a service, synchrotron serial crys-tallography, neuroscience, and correlation spectroscopy. Figure 10shows execution time distributions for each case study. These shortduration tasks exemplify opportunities for FaaS in science.

Figure 9: Effect of prefetching.Figure 10: Distribution of latencies for 100 function calls, foreach of the five use cases described in the text.Metadata Extraction:

The effects of high-velocity data expan-sion is making it increasingly difficult to organize and discover data.Edge file systems and data repositories now store petabytes of dataand new data is created and data is modified at an alarming rate [48].To make sense of these repositories and file systems, systems suchas Skluma [52] are used to crawl file systems and extract metadata.Skluma is comprised of a set of general and specialized metadataextractors, such as those designed to process tabular data throughto those that identify locations in maps. All are implemented inPython, with various dependencies, and each executes for between3 milliseconds and 15 seconds. Skluma uses f unc X to execute meta-data extraction functions directly on the endpoint on which datareside without moving them to the cloud.

Machine Learning Inference:

As ML becomes increasinglypervasive, new systems are required to support model-in-the-loopscientific processes. DLHub [25] is one such tool designed to enablethe use of ML in science by supporting the publication and servingof ML models for on-demand inference. ML models are often repre-sented as functions, with a set of dependencies that can be includedin a container. DLHub’s publication tools help users describe theirmodels using a defined metadata schema. Once described, modelartifacts are published in the DLHub catalog by uploading the rawmodel (e.g., PyTorch, tensorflow) and model state (e.g., training data,hyperparameters). DLHub uses this information to create a con-tainer for the model using repo2docker [29] that contains all modeldependencies, necessary model state, as well as f unc X software toinvoke the model. DLHub then uses f unc X to manage the executionof model inference tasks. In Figure 10 we show the execution timewhen invoking the MNIST digit identification model. While theMNIST model runs for less than two seconds, many of the otherDLHub models execute for several minutes. f unc X provides severaladvantages to DLHub, most notably, that it allows DLHub to use emote compute resources via a simple interface, and includes per-formance optimizations (e.g., batching and caching) that improveoverall inference performance. Synchrotron Serial Crystallography (SSX) is a new tech-nique that can image small crystal samples 1–2 orders of magnitudefaster than other methods [22, 59] and that offers biologists manynew capabilities, such as imaging of conformation changes, verylow X-ray doses for sensitive samples, room temperature for morebiologically relevant environments, radiation sensitivity for met-alloproteins, and discovery of redox potentials in active sites. Tokeep pace with the increased data production, SSX researchersrequire new automated methods of computing that can processthe resulting data with great rapidity: for example, to count the bright spots in an image (“stills processing”) within seconds, bothfor quality control and as a first step in structure determination. Wehave deployed the DIALS [58] crystallography processing tools as f unc X functions. f unc X allows SSX researchers to submit the same stills process function to either a local endpoint to perform datavalidation or offload large batches of invocations to HPC resourcesto process entire datasets and derive crystal structures.

Quantitative neurocartography and connectomics involvethe mapping of the neurological connections in the brain—a compute-and data-intensive processes that requires processing ~20GB everyminute during experiments. We have used f unc X as part of an au-tomated workflow to perform quality control on raw images (tovalidate that the instrument and sample are correctly configured),apply ML models to detect image centers for subsequent recon-struction, and generate preview images to guide positioning. f unc Xhas proven to be a significant improvement over previous practice,which depended on batch computing jobs that were subject to longscheduling delays and required frequent manual intervention forauthentication, configuration, and failure resolution. f unc X allowsthese workloads to be more flexibly implemented, making use ofa variety of available computing resources, and removing over-heads of managing compute environments manually. Further, itallows these researchers to integrate computing into their auto-mated visualization and analysis workflows (e.g., TomoPy [34] andAutomo [26]) via programmatic APIs.

X-ray Photon Correlation Spectroscopy (XPCS) is an exper-imental technique used at Argonne’s Advanced Photon Source tostudy the dynamics in materials at nanoscale by identifying corre-lations in time series of area detector images. This process involvesanalyzing the pixel-by-pixel correlations for different time intervals.The current detector can acquire megapixel frames at 60 Hz (~120MB/sec). Computing correlations at these data rates is a challengethat requires HPC resources but also rapid response time. We de-ployed XPCS-eigen’s corr function as a f unc X function to evaluatethe rate at which data can be processed. Corr is able to process adataset in ~50 seconds. Images can be processed in parallel using f unc X to invoke corr functions on-demand.

Lessons learned:

We briefly conclude by describing our experi-ences applying f unc X to the five scientific case studies. Before using f unc X, these types of use cases would rely on manual developmentand deployment of software on batch submission systems.Based on discussion with these researchers we have identifiedthe following benefits of the f unc X approach in these scenarios. First, f unc X abstracts the complexity of using HPC resources. Re-searchers were able to incorporate scalable analyses using withouthaving to know anything about the computing environment (sub-mission queues, container technology, etc.) that was being used.Further, they did not have to use cumbersome 2-factor authentica-tion, manually scale workloads, or map their applications to batchjobs. This was particularly beneficial to the SSX use case as it wastrivial to scale the analysis from one to thousands of images. Manyof these use cases use f unc X to enable event-based processing. Wefound that the f unc X model lends itself well to such use cases, as itallows for the execution of sporadic workloads. For example, theneurocartography, XPCS, and SSX use cases all exhibit such char-acteristics, requiring compute resources only when experimentsare running. Finally, f unc X allowed users to securely share theircodes, allowing other researchers to easily (without needing tosetup environments) apply functions on their own datasets. Thiswas particularly useful in the XPCS use case as many researchersshare access to the same instrument.While initial feedback has been encouraging, our experiencesalso highlighted several challenges that need to be addressed. Forexample, while it is relatively easy to debug a running f unc X func-tion, it can be difficult to determine why a function fails when firstpublished. Similarly, containerization does not necessarily provideentirely portable codes that can be run on arbitrary resources due tothe need to compile and link resource-specific modules. For exam-ple, in the XPCS use case we needed to compile codes specificallyfor a target resource. Finally, the current f unc X endpoint softwaredoes not yet support multiple allocations. To accommodate moregeneral use of f unc X for distinct projects we need to develop amodel to specify an allocation and provide accounting and billingmodels to report usage on a per-user and per-function basis.

Here we presented f unc X—a FaaS platform designed to enable thelow latency, scalable, and secure execution of functions on almostany accessible computing resource. f unc X can be deployed on ex-isting HPC infrastructure to enable “serverless supercomputing.”We demonstrated that f unc X provides comparable latency to thatof cloud-hosted FaaS platforms and showed that f unc X can exe-cute 1M tasks over 65 000 concurrent workers when deployed on1024 nodes on the Theta supercomputer. Based on early experi-ences using f unc X in five scientific use cases we have found thatthe approach is not only performant but also flexible in terms ofthe diverse requirements it can address. In future work we willextend f unc X’s container management capabilities to dynamicallycreate containers based on function requirements and stage themto endpoints on-demand. We will also explore techniques to sharecontainers between functions with similar dependencies. We planto design customized, resource-aware scheduling algorithms to fur-ther improve performance. Finally, we are actively developing amulti-tenant endpoint and the additional isolation techniques nec-essary to provide safe and secure execution. f unc X is open sourceand available on GitHub. CKNOWLEDGMENT

This work was supported in part by Laboratory Directed Researchand Development funding from Argonne National Laboratory un-der U.S. Department of Energy under Contract DE-AC02-06CH11357.This research used resources of the Argonne Leadership ComputingFacility, which is a DOE Office of Science User Facility supportedunder Contract DE-AC02-06CH11357. We thank the Argonne Lead-ership Computing Facility for access to the PetrelKube Kubernetescluster and Amazon Web Services for providing research credits toenable rapid service prototyping.

REFERENCES

USENIXAnnual Technical Conference . 923–935.[18] Rachana Ananthakrishnan, Ben Blaiszik, Kyle Chard, Ryan Chard, Brendan Mc-Collam, et al. 2018. Globus Platform Services for Data Publication. In

Practiceand Experience on Advanced Research Computing . ACM, Article 14, 7 pages.[19] Rachana Anathankrishnan, Kyle Chard, Ian Foster, Mattias Lidman, BrendanMcCollam, et al. 2016. Globus Auth: A research identity and access managementplatform. In .[20] Yadu Babuji, Anna Woodard, Zhuozhao Li, Ben Clifford, Rohan Kumar, et al.2019. Parsl: Pervasive Parallel Programming in Python. In

ACM InternationalSymposium on High-Performance Parallel and Distributed Computing .[21] Ioana Baldini, Paul Castro, Kerry Chang, Perry Cheng, Stephen Fink, et al. 2017.Serverless computing: Current trends and open problems. In

Research Advancesin Cloud Computing . Springer, 1–20.[22] Sébastien Boutet, Lukas Lomb, Garth J Williams, Thomas RM Barends, AndrewAquila, et al. 2012. High-resolution protein structure determination by serialfemtosecond crystallography.

Science

IEEE Cloud Computing

1, 3 (2014),46–55.[24] R. Chard, K. Chard, J. Alt, D. Y. Parkinson, S. Tuecke, et al. 2017. Ripple: HomeAutomation for Research Data Management. In . 389–394. https://doi.org/10.1109/ICDCSW.2017.30[25] Ryan Chard, Zhuozhao Li, Kyle Chard, Logan T. Ward, Yadu N. Babuji, et al. 2019.DLHub: Model and data serving for science. In .[26] Francesco De Carlo. Automo. https://automo.readthedocs.io. Accessed April 10,2019.[27] Javier Duarte, Song Han, Philip Harris, Sergo Jindariani, Edward Kreinar, et al.2018. Fast inference of deep neural networks in FPGAs for particle physics.

Journal of Instrumentation

13, 07 (2018), P07027.[28] Robert M Fano. 1965. The MAC system: The computer utility approach.

IEEESpectrum

2, 1 (1965), 56–64.[29] Jessica Forde, Tim Head, Chris Holdgraf, Yuvi Panda, Gladys Nalvarete, et al.2018. Reproducible research environments with repo2docker. (2018). [30] Ian Foster and Dennis B Gannon. 2017.

Cloud Computing for Science and Engi-neering . MIT Press.[31] Ian Foster, Carl Kesselman, and Steve Tuecke. 2001. The Anatomy of the Grid:Enabling Scalable Virtual Organizations.

International Journal of SupercomputerApplications

15, 3 (2001), 200–222.[32] Geoffrey Fox and Shantenu Jha. 2017. Conceptualizing a Computing Platformfor Science Beyond 2020: To Cloudify HPC, or HPCify Clouds?. In . IEEE, 808–810.[33] Geoffrey C Fox, Vatche Ishakian, Vinod Muthusamy, and Aleksander Slominski.2017. Status of serverless computing and function-as-a-service (FaaS) in industryand research. arXiv preprint arXiv:1708.08028 (2017).[34] Doga Gürsoy, Francesco De Carlo, Xianghui Xiao, and Chris Jacobsen. 2014.TomoPy: A framework for the analysis of synchrotron tomographic data.

Journalof Synchrotron Radiation

21, 5 (2014), 1188–1193.[35] Kelsey Hightower, Brendan Burns, and Joe Beda. 2017.

Kubernetes: Up andRunning Dive into the Future of Infrastructure (1st ed.). O’Reilly Media, Inc.[36] Benjamin Hindman, Andy Konwinski, Matei Zaharia, Ali Ghodsi, Anthony D.Joseph, et al. 2011. Mesos: A Platform for Fine-grained Resource Sharing in theData Center. In

Proceedings of the 8th USENIX Conference on Networked SystemsDesign and Implementation (NSDI’11) . USENIX Association, Berkeley, CA, USA,295–308. http://dl.acm.org/citation.cfm?id=1972457.1972488[37] V. Ishakian, V. Muthusamy, and A. Slominski. 2018. Serving Deep LearningModels in a Serverless Platform. In . 257–262. https://doi.org/10.1109/IC2E.2018.00052[38] Douglas M Jacobsen and Richard Shane Canon. 2015. Contain this, unleashingDocker for HPC.

Cray User Group (2015).[39] Gregory Kiar, Shawn T Brown, Tristan Glatard, and Alan C Evans. 2019. AServerless Tool for Platform Agnostic Computational Experiment Management.

Frontiers in Neuroinformatics

13 (2019), 12.[40] Klaus Krauter, Rajkumar Buyya, and Muthucumaru Maheswaran. 2002. A taxon-omy and survey of grid resource management systems for distributed computing.

Software: Practice and Experience

32, 2 (2002), 135–164. https://doi.org/10.1002/spe.432 arXiv:https://onlinelibrary.wiley.com/doi/pdf/10.1002/spe.432[41] Gregory M Kurtzer, Vanessa Sochat, and Michael W Bauer. 2017. Singularity:Scientific containers for mobility of compute.

PloS one

12, 5 (2017), e0177459.[42] Maciej Malawski. 2016. Towards Serverless Execution of Scientific Workflows-HyperFlow Case Study.. In

Works@ Sc . 25–33.[43] Dirk Merkel. 2014. Docker: Lightweight Linux containers for consistent develop-ment and deployment.

Linux Journal

239 (2014), 2.[44] Dejan S Milojicic, Vana Kalogeraki, Rajan Lukose, Kiran Nagaraja, Jim Pruyne,et al. Peer-to-peer computing.[45] Philipp Moritz, Robert Nishihara, Stephanie Wang, Alexey Tumanov, RichardLiaw, et al. 2018. Ray: A distributed framework for emerging AI applications. In

OSDI-13 . 561–577.[46] Apostolos Papageorgiou, Bin Cheng, and Ernö Kovacs. 2015. Real-time data reduc-tion at the network edge of Internet-of-Things systems. In . IEEE, 284–291.[47] D Parkhill. 1966.

The challenge of the computer utility.

Addison-Wesley.[48] Arnab K. Paul, Steven Tuecke, Ryan Chard, Ali R. Butt, Kyle Chard, et al. 2017.Toward Scalable Monitoring on Large-scale Storage for Software Defined Cy-berinfrastructure. In . ACM, New York,NY, USA, 49–54. https://doi.org/10.1145/3149393.3149402[49] Reid Priedhorsky and Tim Randles. 2017. CharlieCloud: Unprivileged containersfor user-defined software stacks in HPC. In

International Conference for HighPerformance Computing, Networking, Storage and Analysis . ACM, 36.[50] Maria A. Rodriguez and Rajkumar Buyya. 2019. Container-based clusterorchestration systems: A taxonomy and future directions.

Software: Prac-tice and Experience

49, 5 (2019), 698–719. https://doi.org/10.1002/spe.2660arXiv:https://onlinelibrary.wiley.com/doi/pdf/10.1002/spe.2660[51] Keith Seymour, Hidemoto Nakada, Satoshi Matsuoka, Jack Dongarra, Craig Lee,et al. 2002. Overview of GridRPC: A Remote Procedure Call API for Grid Com-puting. In

Grid Computing — GRID 2002 , Manish Parashar (Ed.). Springer BerlinHeidelberg, Berlin, Heidelberg, 274–278.[52] Tyler J Skluzacek, Rohan Kumar, Ryan Chard, Galen Harrison, Paul Beckman,et al. 2018. Skluma: An extensible metadata extraction pipeline for disorganizeddata. In . IEEE,256–266.[53] Josef Spillner, Cristian Mateos, and David A Monge. 2017. Faaster, better, cheaper:The prospect of serverless scientific computing and HPC. In

Latin American HighPerformance Computing Conference . Springer, 154–168.[54] Joe Stubbs, Rion Dooley, and Matthew Vaughn. 2017. Containers-as-a-servicevia the Actor Model. In .[55] Matteo Turilli, Mark Santcroos, and Shantenu Jha. 2018. A comprehensiveperspective on pilot-job systems.

Comput. Surveys

51, 2 (2018), 43.[56] Blesson Varghese, Philipp Leitner, Suprio Ray, Kyle Chard, Adam Barker, et al.2019. Cloud Futurology.

Computer (2019).

57] Liang Wang, Mengyuan Li, Yinqian Zhang, Thomas Ristenpart, and MichaelSwift. 2018. Peeking behind the curtains of serverless platforms. In

USENIXAnnual Technical Conference . 133–146.[58] David G Waterman, Graeme Winter, James M Parkhurst, Luis Fuentes-Montero,Johan Hattne, et al. 2013. The DIALS framework for integration software.

CCP4Newslett. Protein Crystallogr

49 (2013), 13–15.[59] Max O. Wiedorn, Dominik Oberthür, Richard Bean, Robin Schubert, NadineWerner, et al. 2018. Megahertz serial crystallography.

Nature Communications9,1 (2018).