BeFaaS: An Application-Centric Benchmarking Framework for FaaS Platforms
Martin Grambow, Tobias Pfandzelter, Luk Burchard, Carsten Schubert, Max Zhao, David Bermbach
BBeFaaS: An Application-Centric Benchmarking Framework forFaaS Platforms
Martin Grambow, Tobias Pfandzelter, Luk Burchard,Carsten Schubert, Max Zhao, David Bermbach
TU Berlin & Einstein Center Digital FutureMobile Cloud Computing Research Group
Berlin, Germany { mg, tp, lubu, casc, mazh, db } @mcc.tu-berlin.de Abstract
Following the increasing interest and adoptionof FaaS systems, benchmarking frameworks fordetermining non-functional properties have alsoemerged. While existing (microbenchmark) frame-works only evaluate single aspects of FaaS plat-forms, a more holistic, application-driven approachis still missing.In this paper, we design and present BeFaaS,an application-centric benchmarking framework forFaaS environments that focuses on the evaluationwith realistic and typical use cases for FaaS applica-tions. BeFaaS comes with two built-in benchmarks(an e-commerce and an IoT application), is exten-sible for new workload profiles and new platforms,supports federated benchmark runs in which thebenchmark application is distributed over multipleproviders, and supports a fine-grained result anal-ysis.Our evaluation compares three major FaaSproviders in single cloud provider setups and an-alyzes the traces of a federated fog setup. It showsthat BeFaaS is capable of running each bench-mark automatically with minimal configuration ef-fort and providing detailed insights for each inter-action.
All major cloud providers offer Function-as-a-Service (FaaS) solutions where users only have totake care of their source code (functions) while the underlying infrastructure and environment is ab-stracted away by the provider. FaaS-based appli-cations are split by their business functionality intoindividual functions which are deployed on a FaaSplatform which, e.g., handles the execution andautomatic scaling. The developer does not haveany direct control over the infrastructure and canonly define high-level parameters, such as the re-gion in which the function should run. This com-plicates an already challenging comparison of cloudproviders [5, 22], as the cloud variability is furthercompounded by an additional, unknown infrastruc-ture component.Existing work dealing with benchmarking ofFaaS platforms focuses on the execution of small,so-called microbenchmarks which deploy and calla simple function (e.g., a matrix multiplication [3]or a random number generator [23]). While mi-crobenchmarks are useful for studying and com-paring specific characteristics, they can give onlyfocused and limited insights into the platform be-havior that applications can expect [10]. Anapplication-centric benchmark, in contrast, mimicsthe behavior of a realistic application while closelyobserving the platform behavior. This allows devel-opers to better compare different service options,a strategy also taken by the TPC benchmarks .To the best of our knowledge, such an application-centric benchmark for FaaS platforms does not ex-ist yet.To address this gap, we here propose BeFaaS ,an extensible framework for executing application- a r X i v : . [ c s . D C ] F e b entric benchmarks against FaaS platforms whichcomes with two realistic example benchmarks – ane-commerce and an IoT application. BeFaaS isalso the first benchmarking framework with out-of-the-box support for federated cloud [20] setupswhich allows us to evaluate complex configurationsin which an application is distributed over multipleFaaS platforms running on a mixture of cloud, edge,and fog nodes. Beyond this, BeFaaS is focused onease-of-use and collects fine-grained measurementswhich can be used for a detailed post-experimentdrill-down analysis, e.g., to identify cold starts orother request-level effects.In this regard, we make the following contribu-tions: • We derive requirements for an application-centric FaaS benchmarking framework. • We propose BeFaaS, an extensible frameworkfor the execution of application-centric FaaSbenchmarks and describe two example bench-marks. • We present our proof-of-concept prototypewhich is available as open source and currentlysupports six FaaS platforms. • We run a number of experiments and use themto compare three public FaaS offerings. Wealso showcase how BeFaaS can evaluate mixedcloud/edge deployments.This paper is structured as follows: After outlin-ing the related work in Section 2 and deriving therequirements for an application-centric FaaS bench-mark in Section 3, we present the design, architec-ture, and features of BeFaaS in Section 4. Next,we describe our implementation of BeFaaS includ-ing the two built-in benchmarks in Section 5 whichwe then use to evaluate three FaaS platforms and toshowcase the benchmarking of a mixed cloud/edgedeployment (Section 6). Finally, we discuss the cur-rent limitations and future work in Section 7 beforeconcluding in Section 8.
Existing research on benchmarking of FaaS envi-ronments has so far focused on microbenchmarks.Application-centric benchmarks that consider the overall performance of multiple functions, the in-teraction with external services, and the effects ofdifferent application load profiles are mostly stillmissing.Microbenchmarks call single functions repeat-edly and evaluate the resulting metrics. Thesefunctions are often designed for a specific purpose,e.g., to stress the CPU of the test system or to eval-uate the test system with a disk-intensive workload.Multiple performance evaluation studies are basedon microbenchmarks which compare FaaS vendors,e.g., [3,14,21,23–25,34,35]. Besides scaling of func-tions, cold start latency, and instance lifetimes, thestudies also evaluate metrics such as CPU utiliza-tion, network throughput, and costs. Almost allexperiments, however, focus on single isolated as-pects and do not create comparability of platformsfor FaaS application developers.Some studies also consider more complex ap-plications such as image processing [19], analyzechained functions, or deploy real world applicationson serverless platforms [35]. While these papersalso use application-centric workloads for experi-ments, their goal was not to propose a compre-hensive framework for the execution of application-centric FaaS benchmarks.PanOpticon [32] uses a deployment, workload,and metrics module to evaluate chained functionsand a simple chat server on two different FaaSvendors. Although PanOpticon has similar goalsas BeFaaS, it neither supports detailed drill-downanalysis nor federated multi-provider setups. Also,van Eyk et al. [33] developed a high-level architec-ture and stated requirements for serverless bench-marking. While their project has a similar goalas BeFaaS, it unfortunately seems to still be ina vision state. Existing preliminary source codecomponents are, according to the paper, not avail-able online, whereas we publish BeFaaS as an open-source research prototype.Beyond FaaS, there are a number of application-centric benchmarking frameworks in other do-mains, e.g., for database and storage systems [7,13]or for virtual machines [11]. These can, however,not easily be adapted to FaaS platforms.2
Requirements
While microbenchmarks are highly useful forstudying individual features of a system-under-test (SUT), application-centric benchmarks sup-port end-to-end comparison of different platformsand configurations. Aside from standard bench-marking requirements such as portability or fair-ness [7, 8, 10, 15, 18], an application-centric FaaSbenchmarking framework needs to fulfill a numberof specific requirements which we describe in thissection.
R1 – Realistic Benchmark Application:
Theperformance of a FaaS platform depends on theapplication that is deployed on it. For instance,an application that frequently causes cold startsthrough a growing request rate will be better off onAWS Lambda while an application that frequentlycauses cold starts through short temporary loadspikes will be better off on Apache OpenWhisk dueto their different request queuing mechanisms [6].This means that the benchmark application shouldbe as close as possible to the real application forwhich the analysis is made [10], e.g., in line withthe findings of [30]. A key requirement is, hence,that a FaaS benchmark should mimic real applica-tions as closely as possible . R2 – Extensibility for New Workloads:
FaaSplatforms are highly flexible and can be used for awide variety of applications, so the world of FaaSapplications is evolving rapidly. As such, any set of“typical” FaaS applications – and thus the work-load profile for a FaaS platform – can only be con-sidered a snapshot in time. Likewise, the load pro-files of existing FaaS applications, i.e., the amountand type of requests that the application handles,are likely to evolve over time. Therefore, we ar-gue that a FaaS benchmarking framework should beeasily extensible in terms of adding new benchmarkapplications and updating load profiles for existingbenchmarks . R3 – Support for Modern Deployments:
FaaSis often used as the “glue” between cloud services,web APIs, and legacy systems. Thus, a bench-marking framework must also consider these linksand support external services. Furthermore, to-day’s applications are often distributed over cloud, edge, and fog resources [9, 36]. Here, for exam-ple, hybrid clouds can keep sensitive functions onpremises while non-critical functions are hosted ina public cloud; similar setups exist for edge and fogcomputing use cases [2, 17, 27]. As such, assuminga single-cloud deployment is unrealistic for bench-marks aiming to be as similar as possible to realis-tic applications.
A benchmarking framework needsto support external services and federated setups inwhich application functions are deployed on one ormore FaaS platforms distributed across cloud, edge,and fog . R4 – Extensibility for New Platforms:
Today,all major cloud service providers offer FaaS plat-forms and there is a growing range of open-sourceFaaS systems, for example, systems that specifi-cally target the edge [16,28]. As interfaces are con-stantly evolving and new platforms are introduced,a cross-platform benchmarking framework needs tobe extensible to support future FaaS platforms . R5 – Support for Drill-down Analysis:
Anapplication-centric FaaS benchmark can help toevaluate the suitability of different sets and con-figurations of FaaS platforms for a specific appli-cation. What it can usually not provide are ex-planations for its finding, e.g., the different coldstart management behavior of AWS Lambda andApache OpenWhisk mentioned above [6]. To facili-tate root cause analysis and help evaluators explainthe patterns they see in the benchmark results, weargue that an application-centric FaaS benchmark-ing framework should support drill-down analysisby logging fine-grained measurement results includ-ing typical metrics of microbenchmarks . R6 – Minimum Required ConfigurationOverhead:
An application-centric FaaS bench-marking framework should be easy to use and pro-vide reproducible results. This includes configu-ration, deployment, execution, as well as collectionand analysis of results, e.g., based on infrastructureautomation. Hence, a FaaS benchmarking frame-work should be designed to require as little manualeffort as possible .3 aaS Platform 1 FaaS Platform 2 FaaS Platform 3 Benchmark ApplicationLoad Generator Results
Figure 1: High-level overview of the BeFaaS archi-tecture.
In this section, we give an overview of the BeFaaSdesign, starting with an overview of the BeFaaSarchitecture and components (Section 4.1) beforedescribing the key features of BeFaaS (Sections 4.2to 4.5).
In BeFaaS, the execution of functions of a bench-mark application is the workload that actuallybenchmarks the FaaS platform, i.e., executing afunction creates stress on the SUT. Since functionsdo not “self-start” executing, we need an additionalload generator that invokes the FaaS functions ofour benchmark application; see also Figure 1 for ahigh-level architecture overview.For a benchmark run, BeFaaS requires three in-puts: (i) the source code of the FaaS functionsforming the benchmark application, (ii) a load pro-file for the load generator, and (iii) a deploymentconfiguration that describes the environment con-figuration for each function and FaaS platform (theSUTs).For a benchmark run, application code and de-ployment configuration are initially converted intodeployment artifacts by the
Deployment Compiler .The Deployment Compiler instruments and wrapseach function’s code with BeFaaS library calls andinjects vendor-specific instructions defined in de-ployment adapters which enables request tracingand fine-grained metrics. The resulting deploymentartifacts are passed to the
Benchmark Manager .The Benchmark Manager orchestrates the exper-iment: First, it sets up the
SUT by deploying eachfunction based on the information in the respec-
Deployment Compiler
Application CodeDeployment Config
Benchmark Manager
Load Profile Deployment ArtifactsResultsBeFaaSLibrary Deployment AdapterDeployment AdapterDeployment Adapter
SUTLoad Generator - In-/Output- Component- Resource
Figure 2: The Deployment Compiler transformsapplication code into individual deployment arti-facts based on a deployment configuration. Theseare then deployed and benchmarked by the LoadGenerator. Finally, the Benchmark Manager ag-gregates and reports fine-grained results.tive artifact. In the second step, it initializes the
Load Generator with the workload information de-scribed in a load profile. Then, the benchmark runis triggered and the Load Generator invokes thefunctions of the benchmark application which logevery request in detail including timestamps, ori-gin function, and called functions (if applicable).Finally, once the benchmark run is completed, theBenchmark Manager collects the log files from allFaaS platforms used, aggregates them into a jointresults file, and destroys all provisioned resources;see Figure 2 for an overview of the components inthe BeFaaS framework and their interactions.
To provide a relevant and realistic application-centric benchmark ( R1 ), BeFaaS comes with twobuilt-in benchmarks which represent two typicaluse cases for FaaS applications: an e-commerceand an IoT application (these applications are ex-plained in further detail in Section 5.1). Both ad-here to the empirical findings of Shahrad et al. [30],are composed of several functions that interact witheach other to form function chains, and use externalservices such as a database system for persistence.Moreover, every benchmark application comes witha default load profile that covers all relevant aspects4f the respective application as well as several fur-ther load profiles to emphasize selected stress situ-ations, e.g., to provoke more cold starts. In com-bination, each benchmark represents a completeFaaS application: load balancing at the providerendpoints, interconnected calls of several functions,calls to external services such as database systems,and multiple load profiles which, e.g., provoke scal-ing of resources.The modular design of BeFaaS, however, also al-lows us to easily add further benchmark applica-tions and load profiles or to adapt existing onesto the concrete needs of the developer ( R2 ). Foradding a new benchmark, the respective applica-tion only needs to use the BeFaaS library (de-scribed in Section 5) for function calls and to haveunique function names. To support portability of benchmarks and feder-ated deployments, BeFaaS relies on unique func-tion names, individual deployment artifacts for ev-ery function, and a single endpoint for every de-ployed function ( R3 ): With globally unique func-tion names, the endpoints of the deployed functionsare already known during the compilation phase.The Deployment Compiler maps these endpointsto the canonical function names (defined in the ap-plication) and compiles them into the source code.Moreover, the compiler also injects endpoints toexternal services such as database systems. Thisdecouples the ability of a function to call anotherfunction or a platform service from its deploymentlocation. This allows BeFaaS to support arbitrar-ily complex deployments: it is indeed possible torun every function on a different FaaS platform. Incombination with open source FaaS platforms, thisalso allows users to explore mixed cloud/edge/fogdeployments as we will later demonstrate.Each FaaS platform offers a different interface forlife-cycle and configuration management of func-tions. As the smallest common interface, BeFaaSrequires that each platform provides API-based ac-cess to (i) deploying functions, (ii) retrieving log en-tries from the standard logging interface, and (iii)removing functions. The Deployment Compilerwraps this functionality using an adapter mecha-nism and selects the appropriate instructions for the target platform specified in the deployment con-figuration. Additional FaaS platforms that fulfillthis minimal interface can easily be added by im-plementing a corresponding adapter ( R4 ). To enable a detailed drill-down analysis of exper-iment results ( R5 ), the Deployment Compiler in-jects and wraps code that collects detailed measure-ments during the benchmark run: The compileradds timestamping to determine start, end, and la-tency of calls to functions and external services.Besides these timestamps, the compiler also in-jects code that generates context IDs and pair IDsto assign individual calls to their respective contextlater on. Here, a context ID is generated once foreach function chain (the first function call) which ispropagated to every subsequent call to other func-tions. To link the individual calls of a functionchain, the compiler injects source code to createpair IDs of randomly generated keys that link call-ing and called function. Thus, it is possible to traceevery single request through the benchmark appli-cation and to generate call trees for every contextand function chain.Finally, to independently and reliably detect coldstarts, the Deployment Compiler also injects codethat evaluates a local variable on the executor atthe provider side. If this variable is not present, thefunction runs on a new executor (cold start), thevariable is created, filled with a randomly generatedkey, and the cold start is logged.All data that enable fine-grained results (times-tamps, context IDs, pair IDs, and executor keys)are recorded on the console using the standardlogging interface of the respective FaaS vendor.Initial experiments with Amazon Web Services(AWS), Google Cloud Platform (GCP), and Mi-crosoft Azure (Azure) have shown that the cost oflogging is at most in the microsecond range. The BeFaaS framework requires only the applica-tion code, a deployment configuration, and a loadprofile to automatically perform the benchmark ex-periment ( R6 ). First, all business logic, dependen-cies, and BeFaaS instrumentation logic are bundled5nto a single deployment artifact by the Deploy-ment Compiler. Next, the Benchmark Manager or-chestrates the experiment and provides a simple in-terface for starting the benchmark run, monitoringits process, and collecting fine-grained results forfurther analysis. Our open-source prototype implementation of Be-FaaS includes (i) the BeFaaS library, (ii) six de-ployment adapters, (iii) the Deployment Compiler,(iv) the Benchmark Manager, (v) two realisticbenchmark applications, and (vi) several load pro-files for the benchmark applications (see Figure 2).The BeFaaS library is written in JavaScript andhandles calls to other functions depending on theircanonical name, generates tracing IDs, and takestimestamps. BeFaaS deployment adapters areimplemented using Terraform commands. Cur-rently, BeFaaS thus supports three major cloudofferings (AWS Lambda, Google Cloud Func-tions, and Azure Functions) as well as the threeopen-source systems tinyFaaS [28], OpenFaaS, andOpenWhisk [4] which support the deployment offunctions on private infrastructure, including edgeor fog nodes. The Deployment Compiler is a shellscript that uses several tools to build the deploy-ment adapters for the respective platforms, parsesand injects information from the Deployment Con-figuration, and generates the deployment artifactsfrom the application code. The Benchmark Man-ager uses Terraform to create the infrastructure,collect the logs, and later remove provisioned re-sources. Both benchmark applications are writtenin JavaScript and include calls to external servicessuch as a Redis instance. The Load Generatoruses Artillery to call the benchmark application.New load profiles can easily be added by specifyingnew Artillery load descriptions. Each benchmark suite consists of a FaaS applica-tion, a realistic default load profile that stresses https://github.com/Be-FaaS https://redis.io/ https://artillery.io/ relevant aspects of the respective application, andseveral additional load profiles that emphasize spe-cific stress situations, e.g., to provoke more coldstarts.The modular design of BeFaaS also allows the in-tegration of external services. Both benchmark ap-plications use Redis as an external service to persiststate; currently, the Redis instance can be deployedon three major cloud providers: AWS, MicrosoftAzure, or Google Cloud.The Load Generator for both benchmarks usesArtillery running in a Docker file that can be de-ployed on an arbitrary instance. E-Commerce Application (Webshop)
Our e-commerce benchmark implements a webshop thatis inspired by Google’s microservice demo applica-tion . Our corresponding benchmark implementa-tion follows the typical request-response-based in-vocation style and comprises 17 functions as wellas a Redis instance (see Figure 3). Besides func-tions that provide recommendations and advertis-ing, customers can log in, set their preferred cur-rency, view products, fill a virtual shopping cart,check out orders, and finally observe the shipping.Each task is implemented in a separate function (inthe figure, we grouped some functions to increaselegibility) and all requests arrive at a single func-tion, the frontend, which takes the customer callsand routes them to the respective backend func-tions. There are blocking synchronous calls to otherfunctions as well as asynchronous call blocks thatidle until all functions returned.The default load profile simulates four differentcustomer workflows and constant traffic for 15 min-utes. The benchmark also includes alternative loadprofiles for a growth workload which linearly rampsup the load to 20 workflows per second over 15 min-utes and a spike workload which suddenly increasesthe load from 3.5 to 20 workflows per second afterfive minutes, retains the high load for ten minutes,and finally continues with the lower load (3.5 work-flows per second) for five minutes.The e-commerce benchmark is particularly wellsuited for comparing different cloud providers butcan also be used to explore federated cloud deploy- https://github.com/GoogleCloudPlatform/microservices-demo ame (x) - Single Function Name - Group of x Functions
Name (x)Name (x)
Frontend AdsCheckoutRecommen-dation E-MailPaymentProduct Catalog (3)
Load Generator DB Shipping (2)Currency (2)Cart (4)
Figure 3: The e-commerce application implementsa webshop in 17 functions. The frontend serves as asingle entry point and an external database is usedto store state.ments, e.g., for scenarios in which the applicationis running on multiple cloud platforms.
IoT Application (Smart Traffic Light)
Although several IoT applications and use casesalready exist in research (e.g., [1, 12, 17, 26, 29]),none of them could directly be used or adapted asa FaaS application. Thus, we designed our bench-mark application around typical IoT patterns andimplemented a use case based on a smart trafficcontrol scenario, mostly inspired by [1,12], and TUVienna’s InTraSafEd5G project .The benchmark application implements an IoTuse case with a smart traffic light which adaptsits light phase based on traffic sensors, a camera,and weather inputs (see Figure 4). The functionsinitially filter incoming data streams and performobject recognition on camera footage to create amovement plan, detect ambulance/emergency cars,and maintain a traffic statistic. The regular lightphase is then determined based on this movementplan, road conditions, and the current light phase. https://newsroom.magenta.at/2020/01/16/5g-anwendungen/ Traffic Sensor Filter Weather Sensor FilterObject Recognition
Load Generator
Movement Plan Emergency Detection Road ConditionTraffic StatisticsLight Phase Calculation (2)
DB DBDB
Name (x) - Single Function
Name - Group of x Functions
Name (x)Name (x)
Figure 4: The IoT application implements a smarttraffic light scenario in 9 functions. The LoadGenerator emulates sensor data and sends them tothree different entry points.Emergency services can override this regular phaseat any time by raising an emergency event thatstops all other traffic.The load profile for this application emulates sen-sor data and injects emergency events. The trafficsensor sends ten updates per second to the Traf-fic Sensor Filter, the Object Recognition processesfour images per second, and the weather is updatedevery ten seconds. Furthermore, the Load Genera-tor also injects an emergency event every two min-utes which lasts five seconds each. This default loadprofile runs for 15 minutes. As this use case will inpractice typically have a very predictable and sta-ble load profile, we did not implement alternativeload profiles – benchmark users can, however, easilyadd them if needed.The IoT benchmark is particularly well suitedfor comparing different deployments across cloud,edge, and fog.
We evaluate BeFaaS in two different ways. We startby presenting the results of several experiments inwhich we use BeFaaS to stress different FaaS plat-forms (Section 6.1). Afterwards, in Section 6.2, we7iscuss to which degree BeFaaS fulfills our require-ments from Section 3.
To showcase the broad applicability of BeFaaS, werun experiments with two different scenarios: First,in single cloud provider setups in which all func-tions of the respective benchmark application aredeployed on a single provider. Here, we deploythe e-commerce benchmark on three major cloudproviders (namely AWS, Azure, and GCP) and usethe default load profile to compare them. In thesecond scenario, we deploy the IoT benchmark ina federated fog setup in which some functions arerunning in the cloud (GCP) and others on the edge(tinyFaaS).
With BeFaaS, running the exact same benchmarkconfiguration on different platforms is easy whichwe use to compare three cloud providers.Figure 5 shows the basic setup of our cloud ex-periments: We deploy the Load Generator on a(vastly over-provisioned) virtual machine (2 vC-PUs and 4 GB RAM) and let it execute the de-fault load profile against the e-commerce applica-tion deployed in either eu-west-1 for AWS, westeu-rope for Azure, or europe-west1 for GCP. Moreover,the Redis database system used by the webshopalso runs on an over-provisioned virtual machine(2 vCPUs and 4 GB RAM; ta3.medium at AWS,
Standard B2S in Azure, and e2-medium at GCP)at the respective provider site. This ensures thatthe database instance and Load Generator will notbe a bottleneck during the experiment [10]. Dur-ing each experiment, the Load Generator executes18 ,
000 workflows, which each consist of 1 to 9 re-quests, over a time span of 15 minutes. Since thefocus of this paper is on BeFaaS and its features andnot on providing an in-depth performance analysisof different cloud providers, we decided not to re-peat the experiment several times.Figure 6 shows the execution duration of fourselected functions which are called from the fron-tend function (as boxplots, boxes represent quar-tiles, whiskers show the minimum and maximumvalues without outliers beyond 1.5 times the In-ter Quartile Range). For the four functions exam-
Provider (AWS, Azure, or GCP)
E-Commerce Application (Web Shop) DB Load Generator
Figure 5: As part of the FaaS application, thedatabase instance is deployed in the same regionand on the same provider as the rest of the web-shop. J H W 3 U R G X F W D G G , W H P J H W &