[PDF] BeFaaS: An Application-Centric Benchmarking Framework for FaaS Platforms

Abstract

Following the increasing interest and adoption of FaaS systems, benchmarking frameworks for determining non-functional properties have also emerged. While existing (microbenchmark) frameworks only evaluate single aspects of FaaS platforms, a more holistic, application-driven approach is still missing. In this paper, we design and present BeFaaS, an application-centric benchmarking framework for FaaS environments that focuses on the evaluation with realistic and typical use cases for FaaS applications. BeFaaS comes with two built-in benchmarks (an e-commerce and an IoT application), is extensible for new workload profiles and new platforms, supports federated benchmark runs in which the benchmark application is distributed over multiple providers, and supports a fine-grained result analysis. Our evaluation compares three major FaaS providers in single cloud provider setups and analyzes the traces of a federated fog setup. It shows that BeFaaS is capable of running each benchmark automatically with minimal configuration effort and providing detailed insights for each interaction.

Full PDF

BBeFaaS: An Application-Centric Benchmarking Framework forFaaS Platforms

Martin Grambow, Tobias Pfandzelter, Luk Burchard,Carsten Schubert, Max Zhao, David Bermbach

TU Berlin & Einstein Center Digital FutureMobile Cloud Computing Research Group

Berlin, Germany { mg, tp, lubu, casc, mazh, db } @mcc.tu-berlin.de Abstract

Following the increasing interest and adoptionof FaaS systems, benchmarking frameworks fordetermining non-functional properties have alsoemerged. While existing (microbenchmark) frame-works only evaluate single aspects of FaaS plat-forms, a more holistic, application-driven approachis still missing.In this paper, we design and present BeFaaS,an application-centric benchmarking framework forFaaS environments that focuses on the evaluationwith realistic and typical use cases for FaaS applica-tions. BeFaaS comes with two built-in benchmarks(an e-commerce and an IoT application), is exten-sible for new workload proﬁles and new platforms,supports federated benchmark runs in which thebenchmark application is distributed over multipleproviders, and supports a ﬁne-grained result anal-ysis.Our evaluation compares three major FaaSproviders in single cloud provider setups and an-alyzes the traces of a federated fog setup. It showsthat BeFaaS is capable of running each bench-mark automatically with minimal conﬁguration ef-fort and providing detailed insights for each inter-action.

All major cloud providers oﬀer Function-as-a-Service (FaaS) solutions where users only have totake care of their source code (functions) while the underlying infrastructure and environment is ab-stracted away by the provider. FaaS-based appli-cations are split by their business functionality intoindividual functions which are deployed on a FaaSplatform which, e.g., handles the execution andautomatic scaling. The developer does not haveany direct control over the infrastructure and canonly deﬁne high-level parameters, such as the re-gion in which the function should run. This com-plicates an already challenging comparison of cloudproviders [5, 22], as the cloud variability is furthercompounded by an additional, unknown infrastruc-ture component.Existing work dealing with benchmarking ofFaaS platforms focuses on the execution of small,so-called microbenchmarks which deploy and calla simple function (e.g., a matrix multiplication [3]or a random number generator [23]). While mi-crobenchmarks are useful for studying and com-paring speciﬁc characteristics, they can give onlyfocused and limited insights into the platform be-havior that applications can expect [10]. Anapplication-centric benchmark, in contrast, mimicsthe behavior of a realistic application while closelyobserving the platform behavior. This allows devel-opers to better compare diﬀerent service options,a strategy also taken by the TPC benchmarks .To the best of our knowledge, such an application-centric benchmark for FaaS platforms does not ex-ist yet.To address this gap, we here propose BeFaaS ,an extensible framework for executing application- a r X i v : . [ c s . D C ] F e b entric benchmarks against FaaS platforms whichcomes with two realistic example benchmarks – ane-commerce and an IoT application. BeFaaS isalso the ﬁrst benchmarking framework with out-of-the-box support for federated cloud [20] setupswhich allows us to evaluate complex conﬁgurationsin which an application is distributed over multipleFaaS platforms running on a mixture of cloud, edge,and fog nodes. Beyond this, BeFaaS is focused onease-of-use and collects ﬁne-grained measurementswhich can be used for a detailed post-experimentdrill-down analysis, e.g., to identify cold starts orother request-level eﬀects.In this regard, we make the following contribu-tions: • We derive requirements for an application-centric FaaS benchmarking framework. • We propose BeFaaS, an extensible frameworkfor the execution of application-centric FaaSbenchmarks and describe two example bench-marks. • We present our proof-of-concept prototypewhich is available as open source and currentlysupports six FaaS platforms. • We run a number of experiments and use themto compare three public FaaS oﬀerings. Wealso showcase how BeFaaS can evaluate mixedcloud/edge deployments.This paper is structured as follows: After outlin-ing the related work in Section 2 and deriving therequirements for an application-centric FaaS bench-mark in Section 3, we present the design, architec-ture, and features of BeFaaS in Section 4. Next,we describe our implementation of BeFaaS includ-ing the two built-in benchmarks in Section 5 whichwe then use to evaluate three FaaS platforms and toshowcase the benchmarking of a mixed cloud/edgedeployment (Section 6). Finally, we discuss the cur-rent limitations and future work in Section 7 beforeconcluding in Section 8.

Existing research on benchmarking of FaaS envi-ronments has so far focused on microbenchmarks.Application-centric benchmarks that consider the overall performance of multiple functions, the in-teraction with external services, and the eﬀects ofdiﬀerent application load proﬁles are mostly stillmissing.Microbenchmarks call single functions repeat-edly and evaluate the resulting metrics. Thesefunctions are often designed for a speciﬁc purpose,e.g., to stress the CPU of the test system or to eval-uate the test system with a disk-intensive workload.Multiple performance evaluation studies are basedon microbenchmarks which compare FaaS vendors,e.g., [3,14,21,23–25,34,35]. Besides scaling of func-tions, cold start latency, and instance lifetimes, thestudies also evaluate metrics such as CPU utiliza-tion, network throughput, and costs. Almost allexperiments, however, focus on single isolated as-pects and do not create comparability of platformsfor FaaS application developers.Some studies also consider more complex ap-plications such as image processing [19], analyzechained functions, or deploy real world applicationson serverless platforms [35]. While these papersalso use application-centric workloads for experi-ments, their goal was not to propose a compre-hensive framework for the execution of application-centric FaaS benchmarks.PanOpticon [32] uses a deployment, workload,and metrics module to evaluate chained functionsand a simple chat server on two diﬀerent FaaSvendors. Although PanOpticon has similar goalsas BeFaaS, it neither supports detailed drill-downanalysis nor federated multi-provider setups. Also,van Eyk et al. [33] developed a high-level architec-ture and stated requirements for serverless bench-marking. While their project has a similar goalas BeFaaS, it unfortunately seems to still be ina vision state. Existing preliminary source codecomponents are, according to the paper, not avail-able online, whereas we publish BeFaaS as an open-source research prototype.Beyond FaaS, there are a number of application-centric benchmarking frameworks in other do-mains, e.g., for database and storage systems [7,13]or for virtual machines [11]. These can, however,not easily be adapted to FaaS platforms.2

Requirements

While microbenchmarks are highly useful forstudying individual features of a system-under-test (SUT), application-centric benchmarks sup-port end-to-end comparison of diﬀerent platformsand conﬁgurations. Aside from standard bench-marking requirements such as portability or fair-ness [7, 8, 10, 15, 18], an application-centric FaaSbenchmarking framework needs to fulﬁll a numberof speciﬁc requirements which we describe in thissection.

R1 – Realistic Benchmark Application:

Theperformance of a FaaS platform depends on theapplication that is deployed on it. For instance,an application that frequently causes cold startsthrough a growing request rate will be better oﬀ onAWS Lambda while an application that frequentlycauses cold starts through short temporary loadspikes will be better oﬀ on Apache OpenWhisk dueto their diﬀerent request queuing mechanisms [6].This means that the benchmark application shouldbe as close as possible to the real application forwhich the analysis is made [10], e.g., in line withthe ﬁndings of [30]. A key requirement is, hence,that a FaaS benchmark should mimic real applica-tions as closely as possible . R2 – Extensibility for New Workloads:

FaaSplatforms are highly ﬂexible and can be used for awide variety of applications, so the world of FaaSapplications is evolving rapidly. As such, any set of“typical” FaaS applications – and thus the work-load proﬁle for a FaaS platform – can only be con-sidered a snapshot in time. Likewise, the load pro-ﬁles of existing FaaS applications, i.e., the amountand type of requests that the application handles,are likely to evolve over time. Therefore, we ar-gue that a FaaS benchmarking framework should beeasily extensible in terms of adding new benchmarkapplications and updating load proﬁles for existingbenchmarks . R3 – Support for Modern Deployments:

FaaSis often used as the “glue” between cloud services,web APIs, and legacy systems. Thus, a bench-marking framework must also consider these linksand support external services. Furthermore, to-day’s applications are often distributed over cloud, edge, and fog resources [9, 36]. Here, for exam-ple, hybrid clouds can keep sensitive functions onpremises while non-critical functions are hosted ina public cloud; similar setups exist for edge and fogcomputing use cases [2, 17, 27]. As such, assuminga single-cloud deployment is unrealistic for bench-marks aiming to be as similar as possible to realis-tic applications.

A benchmarking framework needsto support external services and federated setups inwhich application functions are deployed on one ormore FaaS platforms distributed across cloud, edge,and fog . R4 – Extensibility for New Platforms:

Today,all major cloud service providers oﬀer FaaS plat-forms and there is a growing range of open-sourceFaaS systems, for example, systems that speciﬁ-cally target the edge [16,28]. As interfaces are con-stantly evolving and new platforms are introduced,a cross-platform benchmarking framework needs tobe extensible to support future FaaS platforms . R5 – Support for Drill-down Analysis:

Anapplication-centric FaaS benchmark can help toevaluate the suitability of diﬀerent sets and con-ﬁgurations of FaaS platforms for a speciﬁc appli-cation. What it can usually not provide are ex-planations for its ﬁnding, e.g., the diﬀerent coldstart management behavior of AWS Lambda andApache OpenWhisk mentioned above [6]. To facili-tate root cause analysis and help evaluators explainthe patterns they see in the benchmark results, weargue that an application-centric FaaS benchmark-ing framework should support drill-down analysisby logging ﬁne-grained measurement results includ-ing typical metrics of microbenchmarks . R6 – Minimum Required ConﬁgurationOverhead:

An application-centric FaaS bench-marking framework should be easy to use and pro-vide reproducible results. This includes conﬁgu-ration, deployment, execution, as well as collectionand analysis of results, e.g., based on infrastructureautomation. Hence, a FaaS benchmarking frame-work should be designed to require as little manualeﬀort as possible .3 aaS Platform 1 FaaS Platform 2 FaaS Platform 3 Benchmark ApplicationLoad Generator Results

Figure 1: High-level overview of the BeFaaS archi-tecture.

In this section, we give an overview of the BeFaaSdesign, starting with an overview of the BeFaaSarchitecture and components (Section 4.1) beforedescribing the key features of BeFaaS (Sections 4.2to 4.5).

In BeFaaS, the execution of functions of a bench-mark application is the workload that actuallybenchmarks the FaaS platform, i.e., executing afunction creates stress on the SUT. Since functionsdo not “self-start” executing, we need an additionalload generator that invokes the FaaS functions ofour benchmark application; see also Figure 1 for ahigh-level architecture overview.For a benchmark run, BeFaaS requires three in-puts: (i) the source code of the FaaS functionsforming the benchmark application, (ii) a load pro-ﬁle for the load generator, and (iii) a deploymentconﬁguration that describes the environment con-ﬁguration for each function and FaaS platform (theSUTs).For a benchmark run, application code and de-ployment conﬁguration are initially converted intodeployment artifacts by the

Deployment Compiler .The Deployment Compiler instruments and wrapseach function’s code with BeFaaS library calls andinjects vendor-speciﬁc instructions deﬁned in de-ployment adapters which enables request tracingand ﬁne-grained metrics. The resulting deploymentartifacts are passed to the

Benchmark Manager .The Benchmark Manager orchestrates the exper-iment: First, it sets up the

SUT by deploying eachfunction based on the information in the respec-

Deployment Compiler

Application CodeDeployment Config

Benchmark Manager

Load Profile Deployment ArtifactsResultsBeFaaSLibrary Deployment AdapterDeployment AdapterDeployment Adapter

SUTLoad Generator - In-/Output- Component- Resource

Figure 2: The Deployment Compiler transformsapplication code into individual deployment arti-facts based on a deployment conﬁguration. Theseare then deployed and benchmarked by the LoadGenerator. Finally, the Benchmark Manager ag-gregates and reports ﬁne-grained results.tive artifact. In the second step, it initializes the

Load Generator with the workload information de-scribed in a load proﬁle. Then, the benchmark runis triggered and the Load Generator invokes thefunctions of the benchmark application which logevery request in detail including timestamps, ori-gin function, and called functions (if applicable).Finally, once the benchmark run is completed, theBenchmark Manager collects the log ﬁles from allFaaS platforms used, aggregates them into a jointresults ﬁle, and destroys all provisioned resources;see Figure 2 for an overview of the components inthe BeFaaS framework and their interactions.

To provide a relevant and realistic application-centric benchmark ( R1 ), BeFaaS comes with twobuilt-in benchmarks which represent two typicaluse cases for FaaS applications: an e-commerceand an IoT application (these applications are ex-plained in further detail in Section 5.1). Both ad-here to the empirical ﬁndings of Shahrad et al. [30],are composed of several functions that interact witheach other to form function chains, and use externalservices such as a database system for persistence.Moreover, every benchmark application comes witha default load proﬁle that covers all relevant aspects4f the respective application as well as several fur-ther load proﬁles to emphasize selected stress situ-ations, e.g., to provoke more cold starts. In com-bination, each benchmark represents a completeFaaS application: load balancing at the providerendpoints, interconnected calls of several functions,calls to external services such as database systems,and multiple load proﬁles which, e.g., provoke scal-ing of resources.The modular design of BeFaaS, however, also al-lows us to easily add further benchmark applica-tions and load proﬁles or to adapt existing onesto the concrete needs of the developer ( R2 ). Foradding a new benchmark, the respective applica-tion only needs to use the BeFaaS library (de-scribed in Section 5) for function calls and to haveunique function names. To support portability of benchmarks and feder-ated deployments, BeFaaS relies on unique func-tion names, individual deployment artifacts for ev-ery function, and a single endpoint for every de-ployed function ( R3 ): With globally unique func-tion names, the endpoints of the deployed functionsare already known during the compilation phase.The Deployment Compiler maps these endpointsto the canonical function names (deﬁned in the ap-plication) and compiles them into the source code.Moreover, the compiler also injects endpoints toexternal services such as database systems. Thisdecouples the ability of a function to call anotherfunction or a platform service from its deploymentlocation. This allows BeFaaS to support arbitrar-ily complex deployments: it is indeed possible torun every function on a diﬀerent FaaS platform. Incombination with open source FaaS platforms, thisalso allows users to explore mixed cloud/edge/fogdeployments as we will later demonstrate.Each FaaS platform oﬀers a diﬀerent interface forlife-cycle and conﬁguration management of func-tions. As the smallest common interface, BeFaaSrequires that each platform provides API-based ac-cess to (i) deploying functions, (ii) retrieving log en-tries from the standard logging interface, and (iii)removing functions. The Deployment Compilerwraps this functionality using an adapter mecha-nism and selects the appropriate instructions for the target platform speciﬁed in the deployment con-ﬁguration. Additional FaaS platforms that fulﬁllthis minimal interface can easily be added by im-plementing a corresponding adapter ( R4 ). To enable a detailed drill-down analysis of exper-iment results ( R5 ), the Deployment Compiler in-jects and wraps code that collects detailed measure-ments during the benchmark run: The compileradds timestamping to determine start, end, and la-tency of calls to functions and external services.Besides these timestamps, the compiler also in-jects code that generates context IDs and pair IDsto assign individual calls to their respective contextlater on. Here, a context ID is generated once foreach function chain (the ﬁrst function call) which ispropagated to every subsequent call to other func-tions. To link the individual calls of a functionchain, the compiler injects source code to createpair IDs of randomly generated keys that link call-ing and called function. Thus, it is possible to traceevery single request through the benchmark appli-cation and to generate call trees for every contextand function chain.Finally, to independently and reliably detect coldstarts, the Deployment Compiler also injects codethat evaluates a local variable on the executor atthe provider side. If this variable is not present, thefunction runs on a new executor (cold start), thevariable is created, ﬁlled with a randomly generatedkey, and the cold start is logged.All data that enable ﬁne-grained results (times-tamps, context IDs, pair IDs, and executor keys)are recorded on the console using the standardlogging interface of the respective FaaS vendor.Initial experiments with Amazon Web Services(AWS), Google Cloud Platform (GCP), and Mi-crosoft Azure (Azure) have shown that the cost oflogging is at most in the microsecond range. The BeFaaS framework requires only the applica-tion code, a deployment conﬁguration, and a loadproﬁle to automatically perform the benchmark ex-periment ( R6 ). First, all business logic, dependen-cies, and BeFaaS instrumentation logic are bundled5nto a single deployment artifact by the Deploy-ment Compiler. Next, the Benchmark Manager or-chestrates the experiment and provides a simple in-terface for starting the benchmark run, monitoringits process, and collecting ﬁne-grained results forfurther analysis. Our open-source prototype implementation of Be-FaaS includes (i) the BeFaaS library, (ii) six de-ployment adapters, (iii) the Deployment Compiler,(iv) the Benchmark Manager, (v) two realisticbenchmark applications, and (vi) several load pro-ﬁles for the benchmark applications (see Figure 2).The BeFaaS library is written in JavaScript andhandles calls to other functions depending on theircanonical name, generates tracing IDs, and takestimestamps. BeFaaS deployment adapters areimplemented using Terraform commands. Cur-rently, BeFaaS thus supports three major cloudoﬀerings (AWS Lambda, Google Cloud Func-tions, and Azure Functions) as well as the threeopen-source systems tinyFaaS [28], OpenFaaS, andOpenWhisk [4] which support the deployment offunctions on private infrastructure, including edgeor fog nodes. The Deployment Compiler is a shellscript that uses several tools to build the deploy-ment adapters for the respective platforms, parsesand injects information from the Deployment Con-ﬁguration, and generates the deployment artifactsfrom the application code. The Benchmark Man-ager uses Terraform to create the infrastructure,collect the logs, and later remove provisioned re-sources. Both benchmark applications are writtenin JavaScript and include calls to external servicessuch as a Redis instance. The Load Generatoruses Artillery to call the benchmark application.New load proﬁles can easily be added by specifyingnew Artillery load descriptions. Each benchmark suite consists of a FaaS applica-tion, a realistic default load proﬁle that stresses https://github.com/Be-FaaS https://redis.io/ https://artillery.io/ relevant aspects of the respective application, andseveral additional load proﬁles that emphasize spe-ciﬁc stress situations, e.g., to provoke more coldstarts.The modular design of BeFaaS also allows the in-tegration of external services. Both benchmark ap-plications use Redis as an external service to persiststate; currently, the Redis instance can be deployedon three major cloud providers: AWS, MicrosoftAzure, or Google Cloud.The Load Generator for both benchmarks usesArtillery running in a Docker ﬁle that can be de-ployed on an arbitrary instance. E-Commerce Application (Webshop)

Our e-commerce benchmark implements a webshop thatis inspired by Google’s microservice demo applica-tion . Our corresponding benchmark implementa-tion follows the typical request-response-based in-vocation style and comprises 17 functions as wellas a Redis instance (see Figure 3). Besides func-tions that provide recommendations and advertis-ing, customers can log in, set their preferred cur-rency, view products, ﬁll a virtual shopping cart,check out orders, and ﬁnally observe the shipping.Each task is implemented in a separate function (inthe ﬁgure, we grouped some functions to increaselegibility) and all requests arrive at a single func-tion, the frontend, which takes the customer callsand routes them to the respective backend func-tions. There are blocking synchronous calls to otherfunctions as well as asynchronous call blocks thatidle until all functions returned.The default load proﬁle simulates four diﬀerentcustomer workﬂows and constant traﬃc for 15 min-utes. The benchmark also includes alternative loadproﬁles for a growth workload which linearly rampsup the load to 20 workﬂows per second over 15 min-utes and a spike workload which suddenly increasesthe load from 3.5 to 20 workﬂows per second afterﬁve minutes, retains the high load for ten minutes,and ﬁnally continues with the lower load (3.5 work-ﬂows per second) for ﬁve minutes.The e-commerce benchmark is particularly wellsuited for comparing diﬀerent cloud providers butcan also be used to explore federated cloud deploy- https://github.com/GoogleCloudPlatform/microservices-demo ame (x) - Single Function Name - Group of x Functions

Name (x)Name (x)

Frontend AdsCheckoutRecommen-dation E-MailPaymentProduct Catalog (3)

Load Generator DB Shipping (2)Currency (2)Cart (4)

Figure 3: The e-commerce application implementsa webshop in 17 functions. The frontend serves as asingle entry point and an external database is usedto store state.ments, e.g., for scenarios in which the applicationis running on multiple cloud platforms.

IoT Application (Smart Traﬃc Light)

Although several IoT applications and use casesalready exist in research (e.g., [1, 12, 17, 26, 29]),none of them could directly be used or adapted asa FaaS application. Thus, we designed our bench-mark application around typical IoT patterns andimplemented a use case based on a smart traﬃccontrol scenario, mostly inspired by [1,12], and TUVienna’s InTraSafEd5G project .The benchmark application implements an IoTuse case with a smart traﬃc light which adaptsits light phase based on traﬃc sensors, a camera,and weather inputs (see Figure 4). The functionsinitially ﬁlter incoming data streams and performobject recognition on camera footage to create amovement plan, detect ambulance/emergency cars,and maintain a traﬃc statistic. The regular lightphase is then determined based on this movementplan, road conditions, and the current light phase. https://newsroom.magenta.at/2020/01/16/5g-anwendungen/ Traffic Sensor Filter Weather Sensor FilterObject Recognition

Load Generator

Movement Plan Emergency Detection Road ConditionTraffic StatisticsLight Phase Calculation (2)

DB DBDB

Name (x) - Single Function

Name - Group of x Functions

Name (x)Name (x)

Figure 4: The IoT application implements a smarttraﬃc light scenario in 9 functions. The LoadGenerator emulates sensor data and sends them tothree diﬀerent entry points.Emergency services can override this regular phaseat any time by raising an emergency event thatstops all other traﬃc.The load proﬁle for this application emulates sen-sor data and injects emergency events. The traﬃcsensor sends ten updates per second to the Traf-ﬁc Sensor Filter, the Object Recognition processesfour images per second, and the weather is updatedevery ten seconds. Furthermore, the Load Genera-tor also injects an emergency event every two min-utes which lasts ﬁve seconds each. This default loadproﬁle runs for 15 minutes. As this use case will inpractice typically have a very predictable and sta-ble load proﬁle, we did not implement alternativeload proﬁles – benchmark users can, however, easilyadd them if needed.The IoT benchmark is particularly well suitedfor comparing diﬀerent deployments across cloud,edge, and fog.

We evaluate BeFaaS in two diﬀerent ways. We startby presenting the results of several experiments inwhich we use BeFaaS to stress diﬀerent FaaS plat-forms (Section 6.1). Afterwards, in Section 6.2, we7iscuss to which degree BeFaaS fulﬁlls our require-ments from Section 3.

To showcase the broad applicability of BeFaaS, werun experiments with two diﬀerent scenarios: First,in single cloud provider setups in which all func-tions of the respective benchmark application aredeployed on a single provider. Here, we deploythe e-commerce benchmark on three major cloudproviders (namely AWS, Azure, and GCP) and usethe default load proﬁle to compare them. In thesecond scenario, we deploy the IoT benchmark ina federated fog setup in which some functions arerunning in the cloud (GCP) and others on the edge(tinyFaaS).

With BeFaaS, running the exact same benchmarkconﬁguration on diﬀerent platforms is easy whichwe use to compare three cloud providers.Figure 5 shows the basic setup of our cloud ex-periments: We deploy the Load Generator on a(vastly over-provisioned) virtual machine (2 vC-PUs and 4 GB RAM) and let it execute the de-fault load proﬁle against the e-commerce applica-tion deployed in either eu-west-1 for AWS, westeu-rope for Azure, or europe-west1 for GCP. Moreover,the Redis database system used by the webshopalso runs on an over-provisioned virtual machine(2 vCPUs and 4 GB RAM; ta3.medium at AWS,

Standard B2S in Azure, and e2-medium at GCP)at the respective provider site. This ensures thatthe database instance and Load Generator will notbe a bottleneck during the experiment [10]. Dur-ing each experiment, the Load Generator executes18 ,

000 workﬂows, which each consist of 1 to 9 re-quests, over a time span of 15 minutes. Since thefocus of this paper is on BeFaaS and its features andnot on providing an in-depth performance analysisof diﬀerent cloud providers, we decided not to re-peat the experiment several times.Figure 6 shows the execution duration of fourselected functions which are called from the fron-tend function (as boxplots, boxes represent quar-tiles, whiskers show the minimum and maximumvalues without outliers beyond 1.5 times the In-ter Quartile Range). For the four functions exam-

Provider (AWS, Azure, or GCP)

E-Commerce Application (Web Shop) DB Load Generator

Figure 5: As part of the FaaS application, thedatabase instance is deployed in the same regionand on the same provider as the rest of the web-shop. JHW3URGXFW DGG,WHP JHW&DUW FKHFNRXW )XQFWLRQ ( [ H F X W L RQ ' X U D W L RQ P V $:6$]XUH*&3 Figure 6: A detailed analysis of four functionscalled from the frontend shows that Azure providesthe best performance and that the execution dura-tion has the highest variance on AWS.ined in more detail, the overall picture is similar forall three providers: As expected, simpler functionsthat only read or write a single value have a lowerexecution duration than more complex ones suchas the getCart() or checkout() function. In our ex-periment, Azure provided the fastest environmentfor this single run while AWS was notably slowerwith higher variance.In a further analysis, we investigated the distri-bution of computing, network transmission, anddatabase query latency for a synchronous andblocking section which involves two functions anddatabase operations when an item is put into theshopping cart. The detailed timestamp mecha-nisms of BeFaaS allow us to easily separate thesetimes, which are shown in Figure 7. Even thoughthe results are based on only one benchmark run,it is noticeable that for all providers time is mostlyspent on network transmission followed by the8 RPSXWLQJ 1HWZRUN7UDQVPLVVLRQ '%5RXQG7ULS ' X U D W L RQ P V $:6$]XUH*&3 Figure 7: A drill-down analysis of a function se-quence reveals that the network transmission timeis the most relevant driver of execution time on allproviders.database round-trip time while the actual comput-ing time is relatively low. In addition, AWS showedthe fastest compute and database times for theexperiment run compared to the other providersAzure and GCP, while also incurring signiﬁcantlymore time for network transmission.

Besides standard application-centric benchmarks ofFaaS platforms, BeFaaS also supports distributedbenchmarks of modern fog or edge computing de-ployments. To showcase this feature, we deploy ourIoT benchmark in a fog environment split into twoparts while the Load Generator runs on anothercloud instance.Figure 8 shows the basic setup of our fog setupexperiment: The Load Generator, which runs ona virtual machine (2 vCPUs and 4 GB RAM) onAWS, executes the default load proﬁle against ourIoT application deployed on GCP in the europe-west1 region and tinyFaaS deployed on a Rasp-berry Pi located in Berlin. Of the IoT applica-tion, four functions each run on tinyFaaS and GCP.Moreover, the Redis database used by the applica-tion runs on an over-provisioned virtual machine(2 vCPUs and 4 GB RAM, type e2-medium ) inthe Google europe-west1 region. During the experi-ment, the Load Generator executes 12,690 requestsin total over a time span of 15 minutes to eitherthe traﬃc sensor ﬁlter (9,000 requests), the object

Traffic Sensor Filter Weather Sensor FilterObject Recognition

Load Generator

Movement Plan Emergency Detection Road ConditionTraffic Statistics

DB DB t i n y F aa S G C P A W S Light Phase Calculation (2) DB Figure 8: Four function of the IoT application runon tinyFaaS, four on GCP. The Load Generatorruns on AWS.recognition (3,600 requests), or the weather sensorﬁlter (90 requests). For the same reasons outlinedabove, we decided not to repeat this experimentseveral times.As the focus of this evaluation part was on afederated environment, we analyzed the networktransmission time between the individual functionsshown in Figure 9. As the Load Generator is setup in the cloud on AWS, the link to the weathersensor function on GCP is faster than the links tothe other two sensor functions which are set uplocally in Berlin on the Raspberry Pi. Moreover,the two intra-cloud connections (Weather SensorFilter to Road Condition and Road Condition toLight Phase Calculation) are noticeably the fastestwhile the four edge-only connections in tinyFaaS allroughly take the same amount of time. Finally, thetwo links between edge and cloud are, as expected,the slowest connections.

In Section 3, we had identiﬁed six requirementsfor application-centric FaaS benchmarking frame-works. We now discuss to which degree BeFaaSfulﬁlls these requirements.BeFaaS comes with two standard benchmarksthat cover two representative FaaS application sce-narios, namely standard web applications and mod-ern IoT applications. While additional benchmarkswould be desirable, these can easily be added to9 1HWZRUN7UDQVPLVVLRQPV /RDG*HQHUDWRU2EMHFW5HFRJQLWLRQ/RDG*HQHUDWRU7UDIILF6HQVRU/RDG*HQHUDWRU:HDWKHU6HQVRU:HDWKHU6HQVRU5RDG&RQGLWLRQ5RDG&RQGLWLRQ/LJKW3KDVH2EMHFW5HFRJQLWLRQ0RYHPHQW3ODQ2EMHFW5HFRJQLWLRQ(PHUJHQF\7UDIILF6HQVRU0RYHPHQW3ODQ2EMHFW5HFRJQLWLRQ0RYHPHQW3ODQ(PHUJHQF\/LJKW3KDVH0RYHPHQW3ODQ/LJKW3KDVH & RQQH F W L RQ $:6 WLQ\)DD6$:6 *&3*&3 *&3WLQ\)DD6 WLQ\)DD6WLQ\)DD6 *&3 Figure 9: The increased network latency betweendiﬀerent clouds and the edge is clearly visible, bothbetween the Load Generator on AWS and the twoﬁlters on tinyFaaS as well as between the two func-tions on tinyFaaS that target the Light Phase Cal-culation in the cloud.BeFaaS by implementing another FaaS applicationand using the BeFaaS library. We, hence, believethat BeFaaS fulﬁlls the requirements R1 (

RealisticBenchmark Application ) and R2 (

Extensibility forNew Workloads ).In BeFaaS, benchmark users can deﬁne arbitrar-ily complex deployment mappings of functions totarget FaaS platforms including federated multi-cloud setups or mixed cloud/edge/fog deployments.In fact, each function could run on a diﬀerentplatform. To achieve this, BeFaaS transforms thebenchmark application into deployment artifactsﬁtted to the target platform. Adding another tar-get platform is also straightforward and only re-quires the benchmark user to implement an adaptercomponent for the respective FaaS platform or tocopy and adapt an existing adapter component.Based on this, we argue that BeFaaS fulﬁlls the re-quirements R3 (

Support for Modern Deployments )and R4 (

Extensibility for New Platforms ).At runtime, BeFaaS collects ﬁne-grained mea-surements and traces individual requests similarto what Dapper [31] does for microservice applica-tions. This oﬀers the necessary information basisfor drill-down analysis. Beyond this, BeFaaS alsooﬀers visualization capabilities for select standard measurements to further support analysis needs.Overall, we hence conclude that BeFaaS addressesrequirement R5 (

Support for Drill-down Analysis ).Finally, we believe that BeFaaS is easy to use dueto its experiment automation features and requiresonly very few conﬁguration ﬁles (requirement R6 –

Minimum Required Conﬁguration Overhead ). Nev-ertheless, this is a highly subjective matter thatdepends on the respective individual. Therefore,we invite all researchers to use our proof-of-conceptprototype and to try it out themselves. BeFaaS is a powerful, modern application-centricFaaS benchmarking framework. There are, how-ever, also some points to consider and limits whenusing BeFaaS.

Tracing token generates constant networkoverhead.

BeFaaS supports a detailed tracing ofrequests by injecting a small token in each call. Onthe one hand, this supports the clear mapping ofdiﬀerent calls to function chains, yet on the otherhand, it also causes an additional network over-head. This token, however, is mostly constant insize (depending on the length of the respective func-tion name), so the overhead can be easily deter-mined and considered in results analysis. Further-more, this will only matter if the goal of the bench-mark is to ﬁnd the optimal deployment for an ex-isting application which is then instrumented to beused as a BeFaaS benchmark.

No detailed measurements for external ser-vices.

Currently, BeFaaS handles external servicesand components as a black-box and only measuresend-to-end latency of such service calls. In futurework, however, we plan to implement a small Be-FaaS sidecar proxy that can be deployed on exter-nal service instances to forward calls to the respec-tive service and to inject the BeFaaS tracing tokenthere as well.

External services can aﬀect the comparabil-ity and fairness.

Both included benchmarks usean external database system to persist state butfurther benchmarks and use cases may also require https://github.com/Be-FaaS Provider-speciﬁc features can aﬀect porta-bility.

Competing FaaS vendors are constantly de-veloping new and exclusive features that simplifydevelopment and deployment for customers. Thesefeatures, however, can also aﬀect the portabilityof the BeFaaS framework if a (future) benchmarkuses exclusive features that are not available at allvendors. Thus, we strongly recommend not to useexclusive features of individual providers when de-veloping new BeFaaS benchmarks. BeFaaS can,however, help to determine the impact of new fea-tures within a provider or across multiple providersby adjusting and conﬁguring the respective deploy-ment adapter.

Clock synchronization is required for somedrill-down analysis tasks.

The drill-down anal-ysis features of BeFaaS require approximately syn-chronized clocks. Although this will usually be pro-vided by the provider with suﬃcient accuracy, auser should assert this before running experimentsas this will aﬀect the reliability of tracing insights.Nevertheless, such detailed insights may often notbe needed and the tracing of BeFaaS also oﬀers something to counteract this: If the call followsa request-response pattern, BeFaaS measures thetotal round trip time at the calling function andknows the computing duration at the called func-tion. Thus, it is possible to approximate the net-work transmission latency under the assumptionsthat both directions took comparably long. Forevent-based calls that do not return a message tothe sender, however, this is not possible. In our ex-perience, though, this is not a problem in the cloudand for self-hosted FaaS platforms, where the userhas direct control over clock synchronization.

FaaS platforms are a popular cloud computeparadigm and have also been proposed for edgeenvironments. For comparing and choosing diﬀer-ent FaaS platforms in terms of performance, de-velopers usually rely on benchmarking. ExistingFaaS benchmarks, however, tend to fall into themicrobenchmark category – an application-centricFaaS benchmarking framework is still missing.In this paper, we presented BeFaaS, an extensibleframework for executing application-centric bench-marks against FaaS platforms which comes withtwo realistic example benchmarks – an e-commerceand an IoT application. BeFaaS is also the ﬁrstbenchmarking framework with out-of-the-box sup-port for federated cloud setups which allows us toalso evaluate complex conﬁgurations in which anapplication is distributed over multiple FaaS plat-forms running on a mixture of cloud, edge, and fognodes. Beyond this, BeFaaS is focused on ease-of-use through automation and collects ﬁne-grainedmeasurements which can be used for a detailedpost-experiment drill-down analysis, e.g., to iden-tify cold starts or other request-level eﬀects; it caneasily be extended with additional benchmarks oradapters for further FaaS platforms.With BeFaaS, we provide developers with thenecessary tool to explore, compare, and analyzeFaaS platforms for their suitability for applicationscenarios. With BeFaaS, we also oﬀer researchersthe ability to study the performance eﬀects of dif-ferent FaaS deployment options across cloud, edge,and fog through experiments.11 cknowledgments

The authors would like to thank Emily Dietrichand Christoph Witzko who also contributed to theimplementation of the BeFaaS prototype within thescope of a master’s project.

References [1] A. Aral and I. Brandic. Learning Spatiotem-poral Failure Dependencies for Resilient EdgeComputing Services. In

Transactions on Par-allel and Distributed Systems . IEEE, 2020.[2] M. S. Aslanpour, A. N. Toosi, C. Cicconetti,B. Javadi, P. Sbarski, D. Taibi, M. Assun-cao, S. S. Gill, R. Gaire, and S. Dustdar.Serverless Edge Computing: Vision and Chal-lenges. In

Proc. of the Australasian ComputerScience Week Multiconference (ACSW ' .ACM, 2021.[3] T. Back and V. Andrikopoulos. Using a Mi-crobenchmark to Compare Function as a Ser-vice Solutions. In Proc. of the European Con-ference on Service-Oriented and Cloud Com-puting (ESOCC ' . Springer, 2018.[4] I. Baldini, P. Cheng, S. J. Fink, N. Mitchell,V. Muthusamy, R. Rabbah, P. Suter, andO. Tardieu. The Serverless Trilemma: Func-tion Composition for Serverless Computing.In Proc. of the International Symposium onNew Ideas, New Paradigms, and Reﬂectionson Programming and Software (Onward! ' .ACM, 2017.[5] D. Bermbach. Quality of Cloud Services: Ex-pect the Unexpected. In IEEE Internet Com-puting (Invited Paper) . IEEE, 2017.[6] D. Bermbach, A.-S. Karakaya, and S. Buch-holz. Using Application Knowledge to ReduceCold Starts in FaaS Services. In

Proc. of theACM Symposium on Applied Computing (SAC ' . ACM, 2020.[7] D. Bermbach, J. Kuhlenkamp, A. Dey, A. Ra-machandran, A. Fekete, and S. Tai. Bench-Foundry: A Benchmarking Framework forCloud Storage Services. In Proc. of theInternational Conference on Service-OrientedComputing (ICSOC ' . Springer, 2017. [8] D. Bermbach, J. Kuhlenkamp, A. Dey, S. Sakr,and R. Nambiar. Towards an Extensible Mid-dleware for Database Benchmarking. In Proc.of the Technology Conference on PerformanceEvaluation and Benchmarking (TPCTC ' .Springer, 2014.[9] D. Bermbach, F. Pallas, D. G. P´erez, P. Ple-bani, M. Anderson, R. Kat, and S. Tai. A Re-search Perspective on Fog Computing. In Proc.of the Workshop on IoT Systems Provision-ing & Management for Context-Aware SmartCities (ISYCC ' . Springer, 2017.[10] D. Bermbach, E. Wittern, and S. Tai. CloudService Benchmarking: Measuring Qualityof Cloud Services from a Client Perspective .Springer, 2017.[11] A. H. Borhani, P. Leitner, B.-S. Lee, X. Li,and T. Hung. WPress: An Application-DrivenPerformance Benchmark for Cloud-Based Vir-tual Machines. In

Proc. of the InternationalEnterprise Distributed Object Computing Con-ference (EDOC ' . IEEE, 2014.[12] G. Brambilla, M. Picone, S. Cirani,M. Amoretti, and F. Zanichelli. A Simu-lation Platform for Large-Scale Internet ofThingsScenarios in Urban Environments. In Proc. of the International Conference on IoTin Urban Space (URB-IOT ' . ICST, 2014.[13] D. E. Difallah, A. Pavlo, C. Curino, andP. Cudre-Mauroux. OLTP-Bench: An Ex-tensible Testbed for Benchmarking RelationalDatabases. In Proc. of the International Con-ference on Very Large Data Bases (VLDB ' . VLDB Endowment, 2013.[14] K. Figiela, A. Gajek, A. Zima, B. Obrok, andM. Malawski. Performance evaluation of het-erogeneous cloud functions. In Concurrencyand Computation: Practice and Experience .Wiley, 2018.[15] E. Folkerts, A. Alexandrov, K. Sachs, A. Io-sup, V. Markl, and C. Tosun. Benchmarking inthe Cloud: What it Should, Can, and CannotBe. In

Proc. of the TPC Technology Confer-ence on Performance Evaluation and Bench-marking (TPCTC ' . Springer, 2013.1216] G. George, F. Bakir, R. Wolski, and C. Krintz.NanoLambda: Implementing Functions as aService at All Resource Scales for the Internetof Things. In Proc. of the ACM/IEEE Sym-posium on Edge Computing (SEC ' . ACM,2020.[17] M. Grambow, J. Hasenburg, andD. Bermbach. Public Video Surveillance:Using the Fog to Increase Privacy. In Proc.of the Workshop on Middleware and Applica-tions for the Internet of Things (M4IoT ' .ACM, 2018.[18] K. Huppler. The Art of Building a GoodBenchmark. In Proc. of the First TPC Tech-nology Conference on Performance Evaluationand Benchmarking (TPCTC ' . Springer,2009.[19] J. Kim and K. Lee. FunctionBench : A Suiteof Workloads for Serverless Cloud FunctionService. In Proc. of the International Con-ference on Cloud Computing (CLOUD ' .IEEE, 2019.[20] T. Kurze, M. Klems, D. Bermbach, A. Lenk,S. Tai, and M. Kunze. Cloud Federation.In Proc. of the International Conference onCloud Computing, GRIDs, and Virtualization(CLOUD COMPUTING ' . IARIA, 2011.[21] H. Lee, K. Satyam, and G. Fox. Evaluationof Production Serverless Computing Environ-ments. In Proc. of the International Con-ference on Cloud Computing (CLOUD ' .IEEE, 2018.[22] P. Leitner and J. Cito. Patterns in the Chaos- A Study of Performance Variation and Pre-dictability in Public IaaS Clouds. In Transac-tions on Internet Technology . ACM, 2016.[23] M. Malawski, K. Figiela, A. Gajek, andA. Zima. Benchmarking Heterogeneous CloudFunctions. In

Proc. of the European Confer-ence on Parallel Processing (Euro-Par ' .Springer, 2017.[24] J. Manner, M. Endreß, T. Heckel, andG. Wirtz. Cold Start Inﬂuencing Factors inFunction as a Service. In Proc. of the 2018 IEEE/ACM International Conference on Util-ity and Cloud Computing Companion (UCCCompanion ' . IEEE, 2018.[25] H. Martins, F. Araujo, and P. R. da Cunha.Benchmarking Serverless Computing Plat-forms. In Journal of Grid Computing .Springer, 2020.[26] J. McChesney, N. Wang, A. Tanwer,E. de Lara, and B. Varghese. DeFog:Fog Computing Benchmarks. In

Proc. of theSymposium on Edge Computing (SEC ' .ACM, 2019.[27] F. Pallas, P. Raschke, and D. Bermbach. FogComputing as Privacy Enabler. In InternetComputing . IEEE, 2020.[28] T. Pfandzelter and D. Bermbach. tinyFaaS:A Lightweight FaaS Platform for Edge Envi-ronments. In

Proc. of the International Con-ference on Fog Computing (ICFC ' . IEEE,2020.[29] B. Ramprasad, J. Mukherjee, and M. Litoiu.A Smart Testing Framework for IoT Applica-tions. In Proc. of the IEEE/ACM Interna-tional Conference on Utility and Cloud Com-puting Companion (UCC Companion ' .IEEE, 2018.[30] M. Shahrad, R. Fonseca, I. Goiri,G. Chaudhry, P. Batum, J. Cooke, E. Lau-reano, C. Tresness, M. Russinovich, andR. Bianchini. Serverless in the Wild: Char-acterizing and Optimizing the ServerlessWorkload at a Large Cloud Provider. In Proc.of the USENIX Annual Technical Conference(USENIX ATC ' . USENIX, 2020.[31] B. H. Sigelman, L. A. Barroso, M. Burrows,P. Stephenson, M. Plakal, D. Beaver, S. Jas-pan, and C. Shanbhag. Dapper, a Large-Scale Distributed Systems Tracing Infrastruc-ture. Technical report, Google, 2010.[32] N. Somu, N. Daw, U. Bellur, and P. Kulkarni.PanOpticon: A Comprehensive Benchmark-ing Tool for Serverless Applications. In Proc.of the International Conference on COMmu-nication Systems & NETworkS (COMSNETS ' . IEEE, 2020.1333] E. van Eyk, J. Scheuner, S. Eismann, C. L.Abad, and A. Iosup. Beyond Microbench-marks: The SPEC-RG Vision for A Compre-hensive Serverless Benchmark. In Proc. of theACM/SPEC International Conference on Per-formance Engineering Companion (ICPE ' . ACM, 2020.[34] L. Wang, M. Li, Y. Zhang, T. Ristenpart, andM. Swift. Peeking Behind the Curtains ofServerless Platforms. In Proc. of the USENIXAnnual Technical Conference (USENIX ATC ' . USENIX, 2018. [35] T. Yu, Q. Liu, D. Du, Y. Xia, B. Zang, Z. Lu,P. Yang, C. Qin, and H. Chen. CharacterizingServerless Platforms with ServerlessBench. In Proc. of the Symposium on Cloud Computing(SoCC ' . ACM, 2020.[36] B. Zhang, N. Mor, J. Kolb, D. S. Chan,K. Lutz, E. Allman, J. Wawrzynek, E. Lee,and J. Kubiatowicz. The Cloud is Not Enough:Saving IoT from the Cloud. In Proc. of theUSENIX Workshop on Hot Topics in CloudComputing (HotCloud '15)