COCOA: Cold Start Aware Capacity Planning for Function-as-a-Service Platforms
CCOCOA: Cold Start Aware Capacity Planning forFunction-as-a-Service Platforms
Alim Ul Gias
Department of ComputingImperial College London
London, [email protected]
Giuliano Casale
Department of ComputingImperial College London
London, [email protected]
Abstract —Function-as-a-Service (FaaS) is increasingly popularin the software industry due to the implied cost-savings in event-driven workloads and its synergy with DevOps. To size an on-premise FaaS platform, it is important to estimate the requiredCPU and memory capacity to serve the expected loads. Giventhe service-level agreements, it is however challenging to takethe cold start issue into account during the sizing process. Wehave investigated the similarity of this problem with the hitrate improvement problem in TTL caches and concluded thatsolutions for TTL cache, although potentially applicable, lead toover-provisioning in FaaS. Thus, we propose a novel approach,COCOA, to solve this issue. COCOA uses a queueing-basedapproach to assess the effect of cold starts on FaaS responsetimes. It also considers different memory consumption valuesdepending on whether the function is idle or in execution. Usingan event-driven FaaS simulator,
FaasSim , we have developed, weshow that COCOA can reduce over-provisioning by over 70% insome workloads, while satisfying the service-level agreements.
Index Terms —Function-as-a-service, serverless computing,cold start, sizing, layered queueing network
I. I
NTRODUCTION
Function-as-a-Service (FaaS) platforms, based on the server-less execution model [1], allow developers to deploy theircodes as individual functions without having to deal with theunderlying infrastructure management. This facilitates DevOpspractices [2] by providing more flexibility to each developmentteam and increasing the pace of delivery of code updates.The availability of open source platforms, like OpenFaaS andOpenLambda, has made it possible to install on-premise FaaSplatforms, which calls for dedicated sizing and resource allo-cation methods in order to meeting service-level agreements(SLAs).FaaS platforms are designed to implement event-drivenapplications, which react to a change of state as a result ofevents generated by the environment and execute associatedbusiness logic. In a FaaS platform, this logic is termed as afunction, which is usually packaged as a container. To reduceresource wastage, FaaS containers are offloaded from thememory given that they remain idle for specific time period.When a new request for an offloaded function arrives, the
A. Gias is a commonwealth scholar, funded by the UK government. Thework of G. Casale is partially supported by RADON, funded by the EuropeanUnion’s Horizon 2020 research and innovation program under grant agreementNo. 825040. request is blocked until the function is loaded again. This issueis known as the cold start issue [1], [3].During the capacity planning process, the cold start issuecan pose a significant trade-off between latency and memoryallocation optimization. A cold start occurs when a functionis invoked while the corresponding container is not yet loadedin memory, which adds a delay in sending the responseneeded to spin-up the container and the function runtimedependencies. Despite hurting performance, this mechanismaims at reducing memory consumption by offloading functionsthat are idle for a sufficiently long time. This poses a trade-offbetween response time SLAs and available memory to supportconcurrent execution of more functions, which needs to beconsidered upon sizing an on-premise installation.To address this issue, we can draw parallels between a FaaSplatform and a Time to Live (TTL) cache. Similar to a FaaSplatform, a TTL caching system also periodically offloads itscached objects, so that the cold start issue resembles the objecthit rate improvement problem in a TTL cache, in which oneneeds to decide on the optimal time to keep objects in cache[4]. Thus, analysis methods from TTL cache research such asthe characteristic time approximation in [5] may be in principleapplicable also to FaaS sizing in order to estimate the requiredmemory capacity. However, from our study we have identifiedtwo limitations of such an approach. First, contrary to TTLcache misses, the latency incurred by function cold start timescan vary widely. Next, while a large fraction of TTL researchconsiders equal-sized objects, a function consumes differentamount of memory depending on whether it is idle or inexecution.In this paper, we present COCOA, a sizing method thatleverages a stochastic modeling approach based on layeredqueueing networks (LQN) [6] and M/G/1-type queueing sys-tems for capacity prediction. To consider the effect of coldstarts, we have incorporated the probability of experiencingcold starts, by each function, with the LQN model. Theseprobabilities are estimated from an M/M/1/setup/delayedoffmodel, which is a variant of M/M/k setup class of models[7], which we solve using matrix-analytic methods as a specialcase of M/G/1-type system. Setup models can approximate thecold start probability for a function, taking into account cold-starts. To predict the required capacity, COCOA follows an a r X i v : . [ c s . PF ] J u l terative process. It repeatedly solves the LQN model to find aset of function idle times and a CPU configuration such that thefunction response times are just below the SLA. To acceleratethe searching process, we have designed a parallel algorithm,where each parallel branch utilizes binary search. Once the idletimes and CPU configuration are obtained, COCOA estimatesthe CPU utilization value for each function. These estimationsare integrated with a capacity estimation method for TTLcache [8] to predict the required memory capacity.Overall, we summarize our contributions as follows: • We investigate in Section II the similarity between a FaaSplatform and TTL cache from the cold start perspectiveand illustrate that TTL cache analysis, despite promising,is alone insufficient for FaaS capacity estimation. • We present in Section III an LQN-based performancemodeling technique for FaaS platform that captures theeffect of cold starts over function response times, correct-ing the limitations of TTL cache analysis when appliedto this setting. • In Section IV we propose COCOA, a sizing method foron-premise FaaS platforms leveraging our LQN modeland demonstrate its effectiveness in reducing resourceover-provisioning while ensuring response time SLAs.lastly, in Section V we validate our framework against datafrom simulation. Sections VI and VII respectively position thework against the state-of-the-art and conclude the paper.II. S
IMILARITY BETWEEN F AA S AND
TTL
CACHES
A. Analogy
TTL caches, used in content delivery networks, allow fasterpage loading and reduction of load at the origin server. In suchcaching systems, each cache object is associated with a TTL,after which the object is evicted [4]. If the cache can serve therequest for an object, it is termed as a cache hit. The fractionof request served, for a particular period, is called the hit rate.Longer TTL values can improve the hit rates but are morecostly as they require a larger cache size.Similarly to TTL caches, to reduce the number of cold starts,a possible solution, from the point of view of the end user, is tokeep the functions in the memory for longer periods. However,this significantly increases the required memory capacity sincemost of the functions always remain loaded. A way aroundthis problem could be to determine an optimal idle time thatensures a certain degree of availability and reduce the numberof cold starts to an acceptable limit. We notice the analogyof this problem with the configuration of TTL caches [4]. Insuch systems, the goal is to determine a characteristic time[5], which is set as the TTL value, that maximizes the cachehit rates with a given space constraint. The hit rate ( h i ) foran object is defined by (1), considering its TTL ( T i ) value isreset upon a new request [9]. h i = 1 − e − λ i T i (1)This TTL value is similar to the idle time of the functions.It represents a period for which a object is kept in the cache,
80% h.r. 90% h.r. 95% h.r.
Techniques R e s pon s e T i m e ( s e c . ) SLA (a) Hit rates and response time
16 32 48
Number of Functions M e m o r y ( G B ) ConsumptionCapacity (b) Maximum memory consumptionFig. 1. Evaluation of an availability-aware approach in estimating the capacityof a FaaS platform and ensuring the SLA for response time even though no new request is received, which ensures acertain degree of availability. Similarly for FaaS, an idle timerepresent a period where the functions are kept loaded in thememory even though they are idle. Therefore, we can use thisconcept of TTL to estimate the idle times for the functions.To realize this, we can simply solve (1) to get a TTL value fora particular hit rate. This value can be set as the idle time ofthe function. It will ensure the needed degree of availabilityand reduce the number of cold starts. Consequently, this willhelp to satisfy the response time constraints.
B. Example
To illustrate the concept, we have developed a discrete-eventsimulator for a FaaS platform, referred to in the rest of thepaper as
FaasSim . Developing FaasSim was necessary sincepopular performance modeling tools, like JMT [10], cannotmodel the cases we need to consider for FaaS - the cold startsand modeling both CPU and memory consumption.In the simulation, we have considered an open workloadmodel where the requests arrive following a Poisson process.To introduce popularity among the functions, meaning theirinvocation probabilities will be different, we have used theZipf distribution [11]. The function service times are set suchthat they are at-most half of the SLA value of 2 seconds, whenthere is no resource contention. The cold start times are chosenfrom a recent study on popular FaaS platforms [12] that, apartfrom the platform, also considered factors like programminglanguages and deployment sizes, which effect the magnitudeof cold starts.We have run the simulation in three settings with 16, 32and 48 functions and observed the effect of different hit ratesover the function response times. The function idle times areset by solving (1) for the specific hit rate. This hit rate havealso been used to estimate the memory capacity. The hit rateis related to the average runtime memory consumption ( m )as m = (cid:80) i h i θ i , where θ i is the memory requirement ofeach functions [8]. We have used this value of m as thememory capacity and compared it with the actual memoryconsumption value obtained from FaasSim . The findings fromthe simulation are presented in Fig. 1. The simulator is available for download at - https://github.com/alimulgias/FaasSim . Observations
In Fig. 1a, we plot the response times of each of the 48functions for different hit rates. We see that even with 95% hitrate, there are response times that violate the SLA. However,for 95% hit rate, more than half of the function responsetimes are much lower than the SLA. This indicates that all thefunctions do not require the same hit rate to ensure the SLA.In Fig. 1b, we present a comparison between the estimatedmemory capacity and maximum consumption for 95% hitrate. Although the capacity notably increases with the numberof functions, the consumption is less sensitive to it. This isbecause memory consumption is primarily dependent on theworkload parameters. In addition, the consumption is not veryhigh since most of the functions remain idle while resident inmemory, which is not considered during the estimation.From these observations, it is clear that an availability-awareapproach is not adequate for optimal capacity estimation thatensures the SLA for response time. Such an approach onlyconsiders the volume of cold starts, whereas we also needto consider its effect on the response time. For a particularworkload, firstly, we should know the cold start probabilities ofthe functions for different idle times. Subsequently, dependingon these probabilities and the severity of cold starts, we needto approximate the function response times. Thus, we need aperformance model incorporating all these factors. The modelwill also help in fine-grained capacity estimation by providingthe resource utilization estimates. In the following section wepresent our performance model.III. M
ODELING C OLD S TARTS IN F AA S A. Estimating Cold Start Probabilities
Unlike commercial FaaS platforms, open source platformslike OpenFaaS allow concurrent function execution in samecontainer [13]. We focus on this function concurrency ap-proach. We propose to consider, from a modeling standpoint,the function as a server of a queueing model, representingthe admission control buffer to the function, and the coldstart delay as the initial setup time of the server beforebeginning service. The functions also have an idle time whichis equivalent to the idle server waiting time before it isshut down. Considering these similarities, a cold start maybe modeled as a M/M/1/setup/delayedoff model, which is avariant of M/M/k/setup class of models [7]. The M/M/k/setupmodels consider a setup cost, usually in the form of a timedelay, when turning the server on. Its “delayedoff” variantconsiders an idle time before turning the server off.Although in [7], the exact solution is provided for anM/M/k/setup/delayedoff model, this applies when the numberof servers is k ≥ . However, in our case we need tomodel each function separately. Thus, we have a functionrepresenting a single server, which can be either on or off.To get different performance indices for such a model, wemay directly solve its underlying Continuous Time MarkovChain (CTMC).The CTMC transitions are presented in Fig. 2. Each CTMCstate ( i, j ) has two parameters: i tracks whether the function β α α α λ λ λ λ λ λ λ λ µ µ µµ …….……. Cold Start
States
Fig. 2. The M/M/1/setup/delayedoff model for a function resides in the memory or not ( i ∈ { , } ) , while j tracks thenumber of jobs ( j ∈ Z ∗ ) in the admission queue to enterservice in the function. A transition from ( i, n ) to ( i, n + 1) occurs with rate λ , transition from ( i, n + 1) to ( i, n ) occurswith rate µ , and transition from (0 , j ) to (1 , j ) occurs with rate α . These rates describe the mean inter-arrival time ( λ ) , meanservice time ( µ ) and mean cold start time ( α ) respectively.There is a special transition from (1 , to the initial state (0 , with rate β , which describes the function idle time ( β ) . In aCTMC, all holding times are considered to be exponentiallydistributed. However, in a real system the idle time of afunction is set to a deterministic value. To address this issue,we can use the method of phases and make this transitionErlang- k distributed with rate kβ . To realize this, we introduce k − extra states between (1 , and (0 , . The transitionsbetween all these states occur with a rate kβ . This keeps themean identical to the original exponential, β , but reduces thevariance by k times. Thus, for large enough k , the transitionwill display a behavior close to deterministic.The effect of cold starts vary depending on the sequence ofrequest arrivals. If a request arrives when the function is beingloaded into memory due to a recent request, the response timeof that request will be affected to some extent. The severity ofthe queueing overhead will depend on the residual cold starttime of the previous request. However, this does not need to bemodeled explicitly thanks to the memoryless property of theexponential distribution. Considering this, as shown in Figure2, it is clear that the cold start states are (0 , j ) , ∀ j . We cancalculate the cold start probability of the functions from thestationary distribution ( π ) of their CTMC. Indicate with π i,j the probability of state ( i, j ) , then the cold start probabilityis defined as (cid:80) j π ,j . We can get the stationary distributionby solving the CTMC. This can be done efficiently using thematrix-analytic method, since the CTMC sparsity structuremakes it equivalent to a M/G/1-type process [14]. The latteris analyzed using the implementation in the MAMSolver [15],[16]. B. Predicting Response Time
Solving the CTMC we can get the cold start probabilitiesfor each of the functions. However, our eventual goal is topredict the response time of each functions considering thecold starts. For that, beside the cold start probabilities, weneed a performance model of the functions, typically running lient
Proc. Disp.
Proc.ColdPool
WarmPoolInvoke
Client
Invoke
Client ……. …..…... …… Dispatcher dispatch f ][ dispatch d dispatchN f ][ dispatchN d )( p )( N p cold f coldN f ][ cold d ][ coldN d }{ K / KZ = )( cold p )( coldN p warm f warmN f ][ warm d ][ warmN d )1( )1( warm { } cold { } coldN p − ( ) coldN p − ( ) cold p − ( ) cold p − ( ) Func.
Proc. ocf Pr } { Fig. 3. The LQN model for a FaaS platform in containers, contending for the CPU. Each of these functionscontend for CPU times to execute two types of jobs, the regulartasks when the function is warm and service restarting whenthe function is cold. To ensure scalability of the model analysis[17], we use LQNs as reference modeling formalism.The proposed LQN model is presented in Fig. 3. The modelhas two main building blocks - the tasks and the processors. InLQN models, tasks translate into different system resources,usually the software resources. They carry out different opera-tions which are defined by their entries. The tasks are executedon the processors, which represent the physical entity, like theCPU, that carries out the physical executions. Although eachof the functions is a software resource, we defined them by theentries rather than the tasks. The reason behind this choice istwofold. Firstly, it makes the LQN model more compact andmanageable. Secondly, it reduces the model solving delay asthe number of function increases.Since each function has two types of jobs, we use two tasks, ColdPool and
WarmPool . The entries in the
ColdPool definethe cold jobs for all the functions. Similarly, the entries inthe
WarmPool define the warm jobs. Since every cold job isfollowed by a warm job, there is a call from the cold entriesto the warm entries. The proportion of cold and warm jobsis controlled by the
Dispatcher task based on the cold startprobabilities. This is done by setting the cold start probabilityof each function to the call mean value from its
Dispatcher entry to the
ColdPool entry. The percentage of calls to eachfunctions, based on their popularity, is modeled using thereference task
Client by setting the percentage value as thecall mean from the
Client entry to the
Dispatcher entry.The LQN model requires two parameters, namely the ser-vice demands of the activities and the multiplicities of the IMULATION PARAMETERS FOR MODEL VALIDATION N Number of functions 16, 32, 64, 96, 128 η Zipf parameter 0.6, 1.0, 1.4 λ Arrival rate 0.2, 0.5, 0.8 µ Service rate [1, 2] α Cold start rate [0.037, 0.5] β Idle lifetime rate [0.00083, 0.00556] modeling constructs. The service demand for a job is the totalservice time across all visits when there is no resource con-tention. Each of the functions has different service demandsfor its cold and warm jobs. These values should be set in theactivities of the corresponding entries. The service demandcan be estimated using state of the art techniques based onutilization or response time [18]. The multiplicities translateinto different system entities depending on the modelingconstructs. The multiplicity of the reference task indicates thenumber of clients present in the system, considering the systemas a closed network [19]. However, we can also consider thesystem as open like our
FaasSim simulator. To do so, we haveadapted the think time Z as K/λ , where K and λ representsthe total number of clients and the open arrival rate [20].The multiplicities of the processors indicate the number ofavailable CPU cores. Since we do not consider the Dispatcher a bottleneck, we assume it executes separately from the func-tions, on a single CPU core. The multiplicities of the
ColdPool and
WarmPool indicate the number of process threads availablefor the function containers. Container platforms like Dockerallow this on a container basis, which means that we can put alimit on how many threads a container can create . However,in LQNs entries do not have a multiplicity property, whichwe are using to model the functions. Thus, in the model, weconsider that the functions share two thread pools for cold andwarm jobs. This assumption does not significantly affect theperformance estimates if the number of threads, in both pools,are sufficiently large to start processing a job immediately. C. Model Validation
We have used the LINE performance modeling tool [21] tobuild our model and validated it with the
FaasSim simulator.The simulation parameters for the experiments are presentedin Table I. We have considered more large-scale settingscompared to Section II that includes up to 128 functions.We have also considered different popularity parameters whichare common in cache based studies [22]. We have consideredeach of the combinations of number of functions ( N ), Zipfparameters( η ) and arrival rates ( λ ) from the table. For eachof those combinations we have generated 30 models. In eachof the models, we have chosen the service ( µ ), cold start ( α )and idle lifetime ( β ) rates for the functions randomly fromthe given range. The range for service and cold start rates aresame as Section II. The idle lifetime rates are chosen from[12] such that it can trigger a cold start. Such limits are put to prevent unnecessary thread creation causing memoryleaks. However, the limits are never too small to affect the concurrency.ABLE IIP
ERCENT ERROR IN ESTIMATING THE RESPONSE TIME OF EACHFUNCTION –A CROSS ALL THE Z IPF PARAMETERS IN T ABLE I N λ = 0.2 λ = 0.5 λ = 0.8avg 95p max avg 95p max avg 95p max16 Since cold start affects the response time, we are concernedthat how accurately our model captures that affect and estimatethe response time of each functions. Thus, we have consideredthe percent error in estimating the function response times. Wepresent the results in Table II. From the table, we see that theincrease of the number of functions has a negligible effectover the error. The maximum error across all the parametersis 2.38%. The maximum average error and th percentile ofthe error is 1.29% and 2.08% respectively. Such errors arenot significant and thus we conclude that the LQN modelcan accurately estimate the response times for each functionsconsidering the cold starts.We leverage this model in our capacity estimation methodCOCOA, which we present in the following section.IV. C OLD S TART A WARE C APACITY P LANNING
A. Overview
COCOA provides resource allocation decisions for a FaaSplatform in terms of its memory and CPU configurations. Italso provides the function idle times that can ensure the SLAgiven the suggested configuration is applied. These idle timesallows to control the magnitude of cold starts, which in turnaids in governing the response times. The configurations canbe applied both on the hardware level or on the softwarelevel. This means that the memory and CPU constraints canbe applied on the physical server or the container platformlike Docker. COCOA has multiple components each focusingon a particular tasks and expects different inputs. Thesecomponents, their expected inputs and outputs are illustratedin Figure 4.Since COCOA is a model-based approach, we need a setof parameters to instantiate its model. Firstly, we need theservice demands of each functions. These can be estimated foreach functions individually when they are being developed ina test environment. The workload parameters like the arrivalrate and function popularity can be estimated from perfor-mance requirements or historical data. Once the parametersare estimated they are passed to the component LQN ModelGenerator. The Model Generator also requires the architectureof the FaaS platform, particularly containing the informationabout how the functions communicate.Based on these inputs, the CTMC and LQN model isgenerated and forwarded to the next component, the OptimalStrategy Generator. It utilizes the models to provide memoryand CPU configurations. It needs the SLAs for function re-sponse times and both the memory requirements when they are
Service
Demand
Estimator
System
MonitorLQN
Model
GeneratorOptimal
Strategy
Generator
Service
Demands
SLAs +Memory RequirementsLQN + CTMC
FaaS Architecture
PerformanceRequirementsFunctions
Test
Servers
Memory + CPU
ConfigurationInstance
Maximum
Idle Times
Fig. 4. An overview of the COCOA approach idle and in execution. The generator then searches for the idletimes, under different CPU configurations, that do not violatethe SLA with minimal memory consumption. These idle timesare used to estimate the maximum memory consumption,based on which the memory capacity is suggested.
B. Problem Statement
We consider a system of N functions. These functionsare executed on a multi-core CPU with C cores. Each ofthe functions has two different memory usage, one is whilein execution ( θ on i ) and the other is while being idle ( θ off i )Considering a request for a function f i , if the function is notin the memory, a cold start occurs. Due to this cold start, arequest experiences an extra delay. This extra delay is incurredto load the function in the memory. When a function is loadedin the memory, it is associated with a timeout value T i . Whilethe function is still in the memory, for each new request, thetimeout is reset to the original value. A function is removedfrom the memory if it reaches the timeout limit.We focus on two specific costs, the cost of CPU and costof the memory. We define the per unit CPU and memory costas τ c and τ m respectively. Thus, the cost for the CPU will be B = τ c C . The memory cost is calculated based on maximummemory consumption. To estimate this, we incorporate theidea of a different memory usage, when the function is idle,with the estimator for TTL cache [8]. Based on this, given thefunction CPU utilization is ρ i , the average memory consump-tion ( m ) may be estimated as m = (cid:80) i h i ( ρ i θ on i +(1 − ρ i ) θ off i ) .The system’s memory capacity should be adequate whenthere is a spike in memory consumption. This occurs whenthere is a surge in requests within a short period. This increasesthe memory consumption because more functions starts execu-tion, for which the memory requirement is much higher thanbeing idle. It is sufficient to consider this increase in memoryconsumption by the functions in execution. Considering thememory consumption by the functions in execution is U , theexpectation is defined as E [ U ] = (cid:80) i h i ρ i θ on i . We can approx-imate the maximum consumption as v = κE [ U ] . The valuef κ is calculated, using Markov’s inequality, [23] such thatthe upper-bound of P ( U ≥ v ) is a negligible value (cid:15) . Basedon this, we define the approximation for maximum memoryconsumption ( m max ) as m max = (cid:80) i h i ( κρ i θ on i +(1 − ρ i ) θ off i ) ,so that the memory cost may be defined as A = τ m m max .Considering these cost functions A and B , our objectivefunction, z , is defined in (2). Here, our goal is to find T , avector including the idle times T i of all the functions, and C ,the number of CPU cores, that minimizes a weighted sum ofnormalized memory and CPU cost. z = min ( T,C ) ω A ˆ A + ω B ˆ B (2)subject to: C ≤ C max (3) W i ( T, C ) ≤ W ∗ , ∀ i (4) T ∈ R N + , C ∈ Z + The constraints for the objective function are provided in (3)and (4). The first constraint in (3) is regarding the maximumnumber of allowed CPU cores. This applies when the CPUconstraint is imposed on the software level and the totalphysical CPU capacity is not accessible. The second constraintin (4) addresses the SLA for response time. Here, W i is afunction of T and C which returns the response time for aplatform function f i . This response time should be less thanthe limit W ∗ mentioned in the SLA. C. Optimal Strategy Generation
Using the objective function in (2), COCOA provides anoptimal strategy including the memory and CPU capacity andthe function idle times ( T ). It starts searching for an optimalstrategy with an initial instance of T . This is obtained by acharacteristic time approximation technique for CDN cache[5]. It requires to solve m = (cid:80) i h i , where h i is defined in(1), for a particular value of m . However, since we have a largepool of functions, as suggested in [24], we have estimated asingle value T ∗ for all the functions instead of approximating T i ∈ T, ∀ i . Thus, here we have used a second definition of h i ,replacing T i with T ∗ in (1).After the initialization, COCOA fine-tunes the idle times,such that the function response times are just under the SLAlimit, to ensure minimal memory consumption. For this, itsolves the LQN model in iteration, upon adjusting the idletimes, to observe its effect on the response time. The idletimes are adjusted using the concept of binary search. It startswith an initial searching interval, (0 T ∗ ] , for each T i andreduces the length of the interval by half on each iteration. Theendpoints of the intervals are adjusted depending on whetherthe response time constraint is satisfied or not. The value of T i is updated with the midpoint of the searching interval. For thisprocess to work, the initial value, T ∗ , should be sufficientlylarge so that there is no cold starts and thus the response timeis not affected. For this purpose, we have solved m = (cid:80) i h i by setting m to a value close to N .
64 96 128
Number of Functions M e m o r y C apa c i t y ( G B ) COCOA80% h.r.95% h.r.
Fig. 5. Comparing the memory capacity predicted by COCOA and theavailability-ware approaches with λ = 0 . and η = 1 . COCOA runs this fine-tuning process for different CPUconfigurations (CPU cores). Although the number of CPUcores can be any integer, practically we only need to considersome common options, like multiples of 2 with 32 as thelimit. This accelerates the analysis process. In addition, foreach configuration, this process is run in parallel, makingit even faster. For each run, if a T is found, that does notviolate the SLA, it is considered as a candidate solution.After completing the process, the optimal solution is selectedby comparing the memory and CPU cost. Its correspondingCPU configuration and idle times are suggested just the same.However, the memory capacity is suggested by considering thevalue min( m max , (cid:80) i θ on i ) as an upper-bound and calculatingthe aggregated size of required number of RAM modules .V. E VALUATION
A. Experimental Setup
We have evaluated COCOA using the
FaasSim simulator,considering the parameters from Table I with 64, 96 and128 functions. However, the idle times ( β ) from Table I arenot used. Instead, these have been estimated with COCOAsuch that the response times constraints are satisfied withminimal memory and CPU requirement. We have set thefunctions memory requirement following the limits in AWSLambda [25]. The percentage (0-1) of idle function memoryconsumption is considered to be log-normally distributed witha desired mean of 0.2. From the experiments, we aim to answerthe following research questions - • RQ1:
Can COCOA reduce memory over-provisioningcompared to availability-aware approaches? • RQ2:
Can COCOA predict the required memory capac-ity that meets the maximum demand? • RQ3:
Can COCOA predict the memory and CPUcapacity to satisfy the SLA for response time?
B. Results
To answer
RQ1 , we have compared COCOA and theavailability-aware approach with two hit rates, 0.8 and 0.95.We present the result for a single experiment in Fig. 5. We cansee that the required memory capacity estimated by COCOA is We have considered that each of the memory module is 8GB but this isconfigurable depending on the availability of RAM modules.ABLE IIIC
OMPARING THE PREDICTED MEMORY CAPACITY OF DIFFERENT APPROACHES - AVERAGED ACROSS THE Z IPF PARAMETERS FROM T ABLE I N λ = 0.2 λ = 0.5 λ = 0.8COCOA 80% h.r. 95% h.r. COCOA 80% h.r. 95% h.r. COCOA 80% h.r. 95% h.r.64
32 88 104 45.3 80 93.3 42.7 85.3 101.3
40 128 152 58.7 125.3 146.7 48 120 141.3
48 157.3 184 58.7 157.3 184 64 168 197.3
64 96 128
Number of Functions A v g . M e m . C on s u m p . ( G B ) AnalyticalSimulation (a) Avg. Memory Consumption
64 96 128
Number of Functions M e m o r y ( G B ) Max Consump.Capacity (b) Capacity vs. ConsumptionFig. 6. Comparing the runtime memory consumption obtained from theanalytical approximation and simulation for λ = 0 . and η = 1 . % of Mem. Consump. while Idle (0-1) M e m o r y ( G B ) Max Consump.Capacity (a) 64 Functions % of Mem. Consump. while Idle (0-1) M e m o r y ( G B ) Max Consump.Capacity (b) 128 FunctionsFig. 7. Illustrating that COCOA can meet the maximum memory demand fordifferent percentage of memory consumption while the functions are idle much lower than the other two approaches. The results for allthe parameters are presented in Table III. In all the cases, theestimates from COCOA is much less compared to the othertwo approaches. Considering the 95% hit rate, the capacityestimated by COCOA is 51-74% less. The reason is easy tounderstand - COCOA can take “well-informed” decisions byleveraging its performance model, which is not possible forthe availability-aware approaches.From Table III, we see that, in two cases, COCOA predictsa higher capacity for λ = 0 . than λ = 0 . , which is counter-intuitive. This is because we have used a different upper-boundof P ( U > v ) to estimate κ for λ = 0 . . For λ = 0 . it is . but for λ = 0 . and . it is . . The reason is, for high arrivalrates, the spike in memory consumption from the expectationis less than low arrival rates. Here, κ is the coefficient torepresent this extent. A larger upper-bound of P ( U ≥ v ) willresult in a smaller value of κ . So for higher arrival rates, toreduce over-provisioning, κ should be approximated with alarger upper-bound of P ( U ≥ v ) .To answer RQ2 , we have compared the memory con-sumption values from the simulation with the values fromCOCOA. From Fig. 6a, we can see that the average memoryconsumption values from the simulation agrees with analyticalapproximation from COCOA. From Fig. 6b, we see that thememory capacity also meets the maximum demand. From allthe experiments, we observe only 5 cases where there is a
80% h.r. 95% h.r. COCOA
Techniques R e s pon s e T i m e ( s e c . ) SLA (a) Response time
80% h.r. 95% h.r. COCOA
Techniques H i t R a t e ( - ) (b) Hit rateFig. 8. Function response times and hit rates when the SLA is 2 seconds
80% h.r. 95% h.r. COCOA
Techniques R e s pon s e T i m e ( s e c . ) SLA (a) Response time
80% h.r. 95% h.r. COCOA
Techniques H i t R a t e ( - ) (b) Hit rateFig. 9. Function response times and hit rates when the SLA is 1.5 seconds memory deficit greater than 0.5 GB with a maximum value of3.2 GB. We have also done a sensitivity analysis changing thedesired mean of percentage of memory consumption by idlefunctions. We used two settings with 64 and 128 functionswith λ = 0 . and η = 1 . . As seen from Fig. 7, in both cases,COCOA can satisfy the maximum demand.To answer RQ3 , we have investigated the response timeof each of the functions. We have seen that across all theparameters, COCOA can ensure the SLA for response time.We present the response time of each function, for a singleexperiment, in Fig. 8. The SLA in this case is 2 secondsand COCOA satisfies it for all the functions with hardly anyvariance. On the other hand, even 95% hit rate has violations.The violations are even more, 45% or 57 out of 128 functions,when the SLA is 1.5 seconds.To illustrate how COCOA ensures the SLA without over-provisioning, we have investigated the hit rates of each func-tion obtained from the simulation. In Fig. 8b and 9b, we haveplotted the hit rates, which correspond the experiments fromFig. 8a and 9a. As expected, we see that the hit rates are fixedfor 80% and 95% hit rates. However, COCOA adjusts the idletimes of the functions such that the hit rates are just sufficientto satisfy the SLA. This reduces the memory consumptionwhen the functions are idle and thus COCOA suggests a muchlower memory capacity. For a 2 seconds SLA, the lowesthit rate a function has is 46%. However, COCOA can alsoncrease the hit rates, if required, as seen in Fig. 9b. Herethe hit rates for some functions are even higher than 95% tosatisfy a stricter SLA of 1.5 seconds.VI. R
ELATED W ORK
FaaS platforms, leveraging serverless computing, has gainedthe attention of many researchers. Here, we particularly focuson the works involving cost, resource management or coldstars as such works are more relevant in our context. Fromthe perspective of cost, researchers have focused on variousissues. In [26], the authors present a technique that predictsthe cost of function workflows. The authors in [27] proposean algorithm that optimizes the cost of function workflowthrough function fusion and placement. In [28] the authorshave identified different operation regimes that optimizes thecost of both customer and provider. From the perspective ofresource management, researchers have mainly focused onruntime CPU allocation considering the QoS [29], [30].The authors in [1] and [3] are among the firsts to investigatefunction latency considering cold and warm states. In recentworks, researchers are also proposing different solutions tothis problem. In [31], the authors have addressed cold startsfrom the end user perspective and mitigated it by periodicallysending low cost service requests. The authors in [32] havepre-initialized resources, like networking elements, and asso-ciated them with containers as required. In [33], the authorshave provisioned containers in advance by leveraging functioncomposition knowledge. The authors in [34] propose a windowbased approach to load or unload functions by analyzing theirinvocation patterns. However, none of these works modeledfunction memory consumption and response time, makingthem inapplicable in capacity planning.VII. C
ONCLUSION AND F UTURE W ORK
We have presented COCOA, a cold-start aware sizingmethod for on-premise FaaS platforms. COCOA leveragesan LQN model and M/M/k setup models to obtain differentperformance estimates and consequently, predict the requiredsystem capacity. We have illustrated the improvements yieldedby COCOA with multiple experiments, showing that COCOAcan help in provisioning FaaS systems that satisfy SLAs .A future research direction could be incorporating burstinessin the workload that triggers more resource intensive actionsand dealing with autoscaling scenario where multiple functionreplicas need to be instantiated.R
EFERENCES[1] L. Wang et al. , “Peeking Behind the Curtains of Serverless Platforms,”in
Proc. of USENIX ATC , 2018, pp. 133–146.[2] R. Jabbari et al. , “Towards a benefits dependency network for DevOpsbased on a systematic literature review,”
Journal of Software: Evolutionand Process , vol. 30, no. 11, p. e1957, 2018.[3] W. Lloyd et al. , “Serverless Computing: An Investigation of FactorsInfluencing Microservice Performance,” in
Proc. of IC2E . IEEE, 2018,pp. 159–169.[4] S. Basu et al. , “Adaptive TTL-Based Caching for Content Delivery,”
Trans. on Networking , vol. 26, no. 3, pp. 1063–1077, 2018.[5] H. Che, Y. Tung, and Z. Wang, “Hierarchical Web Caching Systems:Modeling, Design and Experimental Results,”
IEEE JSAC , vol. 20, no. 7,pp. 1305–1314, 2002. [6] G. Franks et al. , “Enhanced Modeling and Solution of Layered QueueingNetworks,”
Trans. on Soft. Eng. , vol. 35, no. 2, pp. 148–161, 2008.[7] A. Gandhi et al. , “Exact Analysis of the M/M/k/setup Class of MarkovChains via Recursive Renewal Reward,” in
Proc. of SIGMETRICS .ACM, 2013, pp. 153–166.[8] C. Fricker, P. Robert, and J. Roberts, “A Versatile and AccurateApproximation for LRU Cache Performance,” in
Proc. of ITC . IEEE,2012, pp. 1–8.[9] M. Dehghan et al. , “A Utility Optimization Approach to Network CacheDesign,”
Trans. on Networking , vol. 27, no. 3, pp. 1013–1027, 2019.[10] M. Bertoli, G. Casale, and G. Serazzi, “JMT: Performance EngineeringTools for System Modeling,”
ACM SIGMETRICS PER , vol. 36, no. 4,pp. 10–15, 2009.[11] S. Glassman, “A caching relay for the world wide web,”
ComputerNetworks and ISDN Systems , vol. 27, no. 2, pp. 165–173, 1994.[12] M. Shilkov, “Comparison of Cold Starts in Serverless Functions acrossAWS, Azure, and GCP,” https://mikhail.io/serverless/coldstarts/big3,2019, Accessed: 2020-06-30.[13] I. Akkus et al. , “SAND: Towards High-Performance Serverless Com-puting,” in
Proc. of USENIX ATC , 2018, pp. 923–935.[14] G. Latouche and V. Ramaswami,
Introduction to Matrix Analytic Meth-ods in Stochastic Modeling . SIAM, 1999.[15] A. Riska and E. Smirni, “MAMSolver: A Matrix Analytic MethodsTool,” in
Proc. of Modelling Techniques and Tools for ComputerPerformance Evaluation . Springer, 2002, pp. 205–211.[16] A. Riska and E. Smirni, “ETAQA Solutions for Infinite Markov Pro-cesses with Repetitive Structure,”
INFORMS Journal on Computing ,vol. 19, no. 2, pp. 215–228, 2007.[17] M. Tribastone, P. Mayer, and M. Wirsing, “Performance prediction ofservice-oriented systems with layered queueing networks,” in
Proc. ofISoLA . Springer, 2010, pp. 51–65.[18] A. U. Gias, G. Casale, and M. Woodside, “ATOM: Model-DrivenAutoscaling for Microservices,” in
Proc. of ICDCS . IEEE, 2019, pp.1994–2004.[19] M. Harchol-Balter,
Performance Modeling and Design of ComputerSystems: Queueing Theory in Action . Cambridge Univ. Press, 2013.[20] C. Shousha et al. , “Applying performance modelling to a telecommuni-cation system,” in
Proc. of WOSP . ACM, 1998, pp. 1–6.[21] G. Casale, “Automated Multi-paradigm Analysis of Extended andLayered Queueing Models with LINE,” in
Comp. Proc. of ICPE .ACM/SPEC, 2019, pp. 37–38.[22] G. Casale, “Analyzing replacement policies in list-based caches withnon-uniform access costs,” in
Proc. of INFOCOM . IEEE, 2018, pp.432–440.[23] R. Nelson,
Probability, Stochastic Processes, and Queueing Theory: TheMathematics of Computer Performance Modeling . Springer, 2013.[24] V. Martina, M. Garetto, and E. Leonardi, “A unified approach to theperformance analysis of caching systems,” in
Proc. of INFOCOM .IEEE, 2014, pp. 2040–2048.[25] “AWS Lambda limits,” https://docs.aws.amazon.com/lambda/latest/dg/gettingstarted-limits.html, Accessed: 2020-06-30.[26] S. Eismann et al. , “Predicting the Costs of Serverless Workflows,” in
Proc. of ICPE . ACM/SPEC, 2020, pp. 265–276.[27] K. Mahajan et al. , “Optimal Pricing for Serverless Computing,” in
Proc.of GLOBECOM . IEEE, 2019.[28] T. Elgamal et al. , “Costless: Optimizing Cost of Serverless Computingthrough Function Fusion and Placement,” in
Proc. of SEC . IEEE/ACM,2018, pp. 300–312.[29] M. R. HoseinyFarahabady et al. , “A QoS-Aware Resource AllocationControllerfor Function as a Service (FaaS) Platform,” in
Proc. of ICSoC .Springer, 2017, pp. 241–255.[30] Y. K. Kim et al. , “Dynamic Control of CPU Usage in a LambdaPlatform,” in
Proc. of CLUSTER . IEEE, 2018, pp. 234–244.[31] W. Lloyd et al. , “Improving Application Migration to Serverless Com-puting Platforms: Latency Mitigation with Keep-Alive Workloads,” in
Com. Proc. of UCC . IEEE/ACM, 2018, pp. 195–200.[32] A. Mohan et al. , “Agile Cold Starts for Scalable Serverless,” in
Proc.of HotCloud . USENIX, 2019, pp. 1–6.[33] D. Bermbach, A.-S. Karakaya, and S. Buchholz, “Using ApplicationKnowledge to Reduce Cold Starts in FaaS Services,” in
Proc. of SAC .ACM, 2020, pp. 134–143.[34] M. Shahrad et al. , “Serverless in the Wild: Characterizing and Optimiz-ing the Serverless Workload at a Large Cloud Provider,” in