ReLeaSER: A Reinforcement Learning Strategy for Optimizing Utilization Of Ephemeral Cloud Resources
Mohamed Handaoui, Jean-Emile Dartois, Jalil Boukhobza, Olivier Barais, Laurent d'Orazio
RReLeaSER: A Reinforcement Learning Strategy forOptimizing Utilization Of Ephemeral Cloud Resources
Mohamed Handaoui ∗‡ , Jean-Emile Dartois ∗† , Jalil Boukhobza ∗‡ , Olivier Barais ∗† , Laurent d’Orazio ∗†∗ b <> com Institute of Research and Technology, † Univ. Rennes, Inria, CNRS, IRISA. ‡ Univ Brest, Lab-STICC, CNRS, UMR 6285, F-29200 Brest. FranceEmail: { mohamed.handaoui, jean-emile.dartois } @b-com.com,[email protected], { olivier.barais, laurent.dorazio } @irisa.fr Abstract —Cloud data center capacities are over-provisioned tohandle demand peaks and hardware failures which leads to lowresources’ utilization. One way to improve resource utilizationand thus reduce the total cost of ownership is to offer unusedresources (referred to as ephemeral resources) at a lower price.However, reselling resources needs to meet the expectations ofits customers in terms of Quality of Service. The goal is soto maximize the amount of reclaimed resources while avoidingSLA penalties. To achieve that, cloud providers have to estimatetheir future utilization to provide availability guarantees. Theprediction should consider a safety margin for resources to reactto unpredictable workloads. The challenge is to find the safetymargin that provides the best trade-off between the amountof resources to reclaim and the risk of SLA violations. Moststate-of-the-art solutions consider a fixed safety margin for alltypes of metrics ( e.g.,
CPU, RAM). However, a unique fixedmargin does not consider various workloads variations overtime which may lead to SLA violations or/and poor utilization.In order to tackle these challenges, we propose ReLeaSER, aReinforcement Learning strategy for optimizing the ephemeralresources’ utilization in the cloud. ReLeaSER dynamically tunesthe safety margin at the host-level for each resource metric. Thestrategy learns from past prediction errors (that caused SLAviolations). Our solution reduces significantly the SLA violationpenalties on average by 2.7 × and up to 3.4 × . It also improvesconsiderably the CPs’ potential savings by 27.6% on average andup to 43.6%. Index Terms —Cloud, Ephemeral Resources, Resource Opti-mization, SLA, Safety Margin, Reinforcement Learning.
I. I
NTRODUCTION
Cloud Providers (CPs) aim to offer resources such as virtualmachines or containers with the best Quality of Service (QoS)possible. To do so, data centers are dimensioned according topeak resource usage with the downside of having a low averageresource utilization. The low resource utilization increasesthe Total Cost of Ownership (TCO) which made reclaimingunused resources an urging research topic [1]–[3]. Resourcereclamation is generally made possible thanks to predictiontechniques [1], [2], [4]. They are usually used to forecast futureresource utilization according to customers’ behavior in orderto infer the unused part ( i.e., ephemeral resources) and sell itat a lower price.Customers’ workloads ( i.e., application) running on cloudresources are known to experience sudden variations [2]. Theyoccur due to several factors such as user’s request rate andworkload types. It causes the resources utilization to increaseor decrease in a manner that current predictions cannot always account for. This means that some workload variations areunpredictable or the predictor has failed to discover the hiddenpatterns [5], [6]. These sudden variations may cause substantialoverestimation or underestimation of resource usage. Overes-timation may reduce resource utilization, but underestimationmay imply an oversell of resources and thus cause SLAviolations and potential cost deficit, which is critical.In case future resource utilization is unpredictable, a preven-tive mechanism should be used, such as a safety margin [7],[8]. A safety margin is a proportion of free resources thatare left unused to absorb sudden variations of customers’workloads or predictions’ errors in order to guarantee the SLA.The safety margin may be applied at different granularity: adatacenter, a host, or a resource. Choosing the right safetymargin value and its granularity is crucial for reducing SLAviolations and increasing CP’s savings.The safety margin may be a static value, that is a fixedproportion of resources applied all the time, for all hostsand resource metrics. This strategy was used in Cuckoo [8]and Salamander [9] where fixed proportions were empiricallytested to select the best one. Although this strategy doesdecrease potential SLA violations, a substantial amount ofresources remained unused due to resource usage overesti-mations [9]. Moreover, the prediction accuracy of the CPUproved to be lower than that of the RAM [2], which meansthat the safety margin should be customized for each resourcemetric. In Scavenger [3], the authors propose a solution thatuses both the mean and standard deviation of the past usagefor each resource metric with a fixed sliding window size.Even if this method gives a specific margin for each resource,it requires an additional mechanism to account for a suddenincrease in resource utilization.In this paper, we argue that a dynamic safety margin needsto be employed instead of a static one in order to reduce SLAviolation and potentially increase cloud providers’ savings. Adynamic solution must consider the following three intrinsicproperties of the Cloud environment considered: (1) volatility of the resources caused by unpredictable workload changes,(2) heterogeneity of the hosts in terms of available resources,and (3) complexity of the Cloud dynamics [5], [6] that makesit hard to draw an exact model of the variables in play.Our solution is based on Reinforcement Learning (RL) inorder to adjust the size of the safety margin according to the a r X i v : . [ c s . PF ] O c t bserved prediction errors and violations of customers’ SLA.The choice of an RL technique answers the aforementionedproperties as follows:1) Volatility : the volatility of the reclaimed resources isconstantly changing and uncertain. RL is known to beable to reason under such uncertainty [10] and is able toadapt and self-configure with the volatility of resources.2)
Heterogeneity : taking heterogeneity of Cloud hosts intoaccount is mandatory since it impacts the performanceof workloads. Indeed, RL can be used in order to makedecisions for each host separately when properly trainedon sufficient data.3)
Complexity : Cloud environment cannot be representedwith an exact model due to its dynamic and stochasticnature. Thus in many cases, we tend to assume that somevariables are known which may impact the performance.However, RL does not require an exact model of theenvironment in order to learn [10].Our strategy, named ReLeaSER, consists of a predictiveand reactive approach that dynamically adjusts the safetymargin at the host level for each resource metric such as CPUand RAM. In this solution, we suppose that future resourcepredictions for reclaiming the unused part already exist. TheRL solution observes the prediction errors that occurred whenusing the reclaimed resources and generates the appropriatesafety margin. Using the safety margin, we compute both thepenalties and the savings for selling the allocated resources.ReLeaSER was compared to four different strategies foradjusting the safety margin. The comparison was done usingreal production traces from three datacenters for a 6-monthstime period. The results show that our solution decreases SLAviolations by . × on average as compared to state-of-the-artstrategies and increases the CPs savings by on averageand up to .The remainder of this paper is organized as follows. Sec-tion II provides the motivation for using a dynamic safetymargin. Section III details our contribution. Then, Section IVdescribes the experimental evaluation and the results obtained.In Section V, we discuss the related work. Finally, Section VIconcludes the paper. II. M OTIVATION
To motivate our study about the use of a dynamic safetymargin, we analyzed two in-production traces for CPU andRAM over a 6-months period. Our study relies on previouswork [2] that predicts future resource usage. These predic-tions are used to reclaim unused resources in order to beallocated to customers. The prediction shows good generalaccuracy and was used in other studies for scheduling bigdata applications [8], [9]. We followed 2 steps in this section:1) Using the predictions alongside the in-production traces,we analyzed the prediction errors on different granu-larities: datacenter, host, resource metric. The intuitionbehind this analysis is to assess the levels at which thesafety margin should be tuned. 2) We evaluate the CPs’ savings when selling the reclaimedunused resources. We also evaluate the impact of predic-tion errors on SLA violations. Here, we allocate all theresources available according to the predictions of futureutilization. The resources are allocated using a singleconfiguration of a container ( i.e., error = prediction − usage .The total savings are computed with: overall savings = savings − SLA penalties
To summarize, we focus on three points: 1) the distributionof prediction errors across hosts and resource metrics ( i.e.,
CPU, RAM), 2) the CP’s savings of the reclaimed resources,3) the cost of violating SLA guarantees. The setup, datasetsused, and cost models are detailed in Section IV.
A. Prediction errors
In this section, we analyze the prediction errors across hostsfor both the CPU and RAM. Fig. 1 represents the CumulativeDistribution Function (CDF) of the errors for University andPrivate Company 1 (PC-1) datasets for a 6-months period. TheCDF allows us to estimate the likelihood ( i.e., percentage)of occurrence of a given prediction error. We only show theunderestimation part of the prediction errors as it is the factorcausing SLA violation.
40 30 20 10 0Error (%)0.00.10.20.30.40.50.6 L i k e li h oo d o f o cc u rr e n c e Host-1Host-2Host-3Host-4Host-5Host-6 (a) University: CPU errors
80 60 40 20 0Error (%)0.00.10.20.30.40.50.6 L i k e li h oo d o f o cc u rr e n c e Host-1Host-2Host-3Host-4Host-5Host-6 (b) University: RAM errors
60 50 40 30 20 10 0Error (%)0.00.10.20.30.40.50.6 L i k e li h oo d o f o cc u rr e n c e Host-1Host-2Host-3Host-4Host-5Host-6Host-7Host-8Host-9 (c) PC-1: CPU errors
80 60 40 20 0Error (%)0.00.10.20.30.40.50.6 L i k e li h oo d o f o cc u rr e n c e Host-1Host-2Host-3Host-4Host-5Host-6Host-7Host-8Host-9 (d) PC-1: RAM errors
Fig. 1: CDF of resource prediction errorsWhen observing the CDF of the RAM in Fig. 1d andFig. 1b, we notice that the likelihood of prediction errors’occurrence is up to 0.3 ( i.e., i.e., e.g.,
Host-1in University, Host-2 in PC-1) but with a lower likelihoodof occurrence of around 0.01 ( i.e., i.e.,
B. Cost related to the reclaimed resources
In the following, we study the economical cost of using thereclaimed resources for both the savings and the SLA violationpenalties. The cost model of SLA is detailed in Section IV.This model is commonly used by cloud providers such asGoogle [12] and Amazon [13]. Fig. 2 represents the potentialsavings and the SLA violation penalties (in dollars) for eachhost and during a 6-months period. H o s t - H o s t - H o s t - H o s t - H o s t - H o s t - D o ll a r s ( $ ) Potential savingsSLA penalties (a) University H o s t - H o s t - H o s t - H o s t - H o s t - H o s t - H o s t - H o s t - H o s t - D o ll a r s ( $ ) Potential savingsSLA penalties (b) Private Company 1
Fig. 2: Potential savings and the SLA violation penaltiesIn Fig. 2b, we observe that the penalties in each host are dif-ferent mainly due to the heterogeneity of their resources. Thisconfirms the previous statement about setting an appropriatesafety margin per host. We also observe that the savings havea similar trend with the penalties. The penalties cost is highwhen compared to the potential savings of the resources. Thetotal savings in the presented case-study are not huge since thedatacenters have a small number of hosts. We can reasonablyassume that the savings increase linearly with the number ofhosts and become considerable for large datacenters. (4)(3)
Operator (2a)(2d)
ReLeaSER, Safety MarginSelector (2c)
ForecastingBuilder (1) (5b)(5a)
QoSControllerHost n Host i ... FarmerCustomersNew contributions imported modules (2b)
Fig. 3: Overview architecture that deploys ReLeaSER: theMargin Selector moduleWe conclude that resource utilization is not optimized dueto the prediction errors that cause SLA violation penalties. Re-ducing the SLA violations can increase the potential savings.It also increases the reliability of applications. Indeed, SLAviolations can have a major impact on the performance andreliability of the running applications. Big data applications,in particular, have to be restarted in case of SLA violation [7],[8], [14]. This highly contributes to wasting resources andmay increase the penalties. Thus, providing a dynamic safetymargin that is specifically tuned for each host and resourcemetric may solve the problem.III. R E L EA SER: A R E INFORCEMENT L EA RNING S TRATEGY FOR OPTIMIZING E PHEMERAL C LOUD R ESOURCES
Our goal is to build a solution that optimizes the utilizationof unused Cloud ephemeral resources while reducing the riskof violating SLA. This is done by dynamically adjusting thesafety margin applied to the resources. The safety margin isa proportion of free resources that are left unused to absorbsudden variations of customers’ workloads or predictions’errors. This problem is considered as a control problem sincethe safety margin is adjusted according to previous errorsand potentially other external factors. Note that we aim todynamically adjust the safety margin to a near-perfect valuewhich should reduce the SLA violations and increase the CPs’savings.In the following, we present the considered architecture andits modules. Then, we formulate the problem of adjusting thesafety margin and the solving algorithm.
A. Architecture overview
Fig. 3 depicts an overview of the architecture that deploysour Safety Margin Selector module, named ReLeaSER, foradjusting the safety margin. There are three main actors: • Farmers : datacenter owners, they seek to reduce theirTCO by offering unused resources to customers.
Customers : we focus here on customers that requestunused volatile resources on the Cloud at a lower cost(we do not consider reserved resources). • Operator : acts as the interface between farmers and cus-tomers, they aim at minimizing farmers’ TCO by offeringunused resources to customers with SLA requirements.The solution is built upon three main modules (see Fig. 3): • Forecasting Builder : this module was introduced in [2]and does not constitute a contribution of this paper. Themodule provides predictions of resource utilization foreach host resource metric. This module is not detailed inthis paper. • QoS Controller : this module was introduced in previouswork [8], [9]. We use this module to monitor the utiliza-tion of ephemeral resources. It computes the predictionerrors considering the safety margins to detect any SLAviolations. • ReLeaSER, the Safety Margin Selector : this modulerepresents the core contribution of this paper. It housesthe Reinforcement learning algorithm responsible foradjusting the safety margins. It does so by continuouslyobserving the resource prediction errors of the Forecast-ing Builder (that caused SLA violations) using the QoScontroller to act accordingly.In Fig. 3, the process begins with the customers submitting(1) their request for resources ( i.e., containers). The operatorreceives the requests in addition to two other inputs: thepredictions of resource utilization (2b) from the ForecastingBuilder and the safety margins (2a) from the Safety MarginSelector. The operator decides about which resources can beallocated and sends (3) a response to the customers. In thispaper, we do not use a specific container placement algorithm.Since we are evaluating the impact of the safety margin,we suppose that we fully allocate the predicted available re-sources. That being said, a scheduling component can be easilyadded when needed as in [8], [9]. After that, if the customers’requests can be satisfied, the operator proceeds to allocate(4) the required resources. Meanwhile, the QoS Controllermonitors the resource utilization of the host. It checks forunderestimated prediction compared to the resource utilizationto detect potential SLA violations. To do so, it receives theresource predictions and the safety margins (2c,2d) at the sametime as the operator. If any SLA violation is detected, theprediction errors are sent to the Safety Margin Selector (5a)to adjust the future values of the safety margin. The errors arealso sent to the operator (5b) in order to act according to theSLA violations.
B. ReLeaSER: the Safety Margin Selector
In what follows, we describe the Safety Margin Selectormodule. First, we give some background on Reinforcementlearning. Then, we present our problem formulation. Next, wedetail the formulation of the reward function ( i.e., objectivefunction). Finally, we describe the solving algorithm.
1) Reinforcement Learning:
RL is an area of machinelearning [10] that can be used to solve problems that require a series of decisions. The algorithm learns what action to doso as to maximize a numerical reward signal. The algorithmis not told which actions to take (from the predefined setof actions), but instead must discover which ones yield thehighest reward by trying them. RL is based on Markov De-cision Process (MDP) [10]. MDP is a discrete-time stochasticcontrol process [15]. It offers a mathematical framework formodeling problems where the results are sometimes randomand sometimes under the control of a decision-maker.The Main concepts of an MDP are the following: i) Agent :the decision-maker that sets the size of the safety margin, ii)Environment : a Cloud host which is the interface that an agentinteracts with, iii) State : it describes the environment ( i.e., host) properties at a given time which can be observed bythe agent. iv) Action : is what an agent can do in each state( i.e., change the size of the safety margin), v) Reward : is afeedback signal from the environment to the agent ( i.e.,
SLAviolation cost, allocated resources cost).The objective of solving an MDP is to find the optimalpolicy π (a function that specifies the action to take for eachstate) that maximizes the sum of expected future rewards.
2) Problem formulation: we formulate the problem ofadjusting the safety margin by using the MDP frameworkwith the tuple { S, A, R, P } . It formulates the state of theenvironment, the action of our agent, the reward function andstates transitions: • S = { errors } , S represents the current state of theenvironment ( i.e., host). The state indicates the previousprediction errors during a predefined time window. The errors = [ e ( t − w, m ) , ..., e ( t, m )] is a sliding windowvector of size w where each value is computed using: e ( t, m ) = u ( t, m ) − p ( t, m ) . Where e ( t, m ) , u ( t, m ) and p ( t, m ) represent the prediction error, the host realutilization and the prediction respectively for resourcemetric m at time t . The size of the error window w wasset to one hour in order to have a reactive strategy thatadapts quickly to workload changes. • A = { sm ∈ R | ≤ sm < } is the action setwhich consists of the possible percentage values for thesafety margin. The safety margin is generated for eachtime step t . The time step is set to 3 minutes similarly tothe prediction sampling. This allows for the algorithm toadjust quickly since the prediction error changes at eachstep. • P : S × A × S → [0 , is the probability that the environ-ment transitions from state s to a new state s (cid:48) when action a is performed ( e.g., an increase in resource utilizationwhen placing a container). MDP modeling requires thisvariable, but in our case, we used RL algorithms whichimplicitly consider these transitions [16]. Indeed, due tothe complexity of the cloud environment, it is hard, if notimpossible, to model precisely its state transitions. • R : S → R is the cost function expressing the expectedreward when the system is in state s . The reward functionoes not depend on the action a ( i.e., safety margin) asSLA penalties at t are not necessarily a result of theimmediate previous action ( t − ). They are also due tomispredictions which may occur at any moment. Thus,the reward signal is not immediate but delayed accordingto the applied safety margin at t . The reward function isdetailed below.
3) Reward function: in what follows we detail the rewardfunction. The idea here is to reward the agent when allocatingresources but penalize it in case of SLA violation. Thusthe reward function can be formulated according to the totalsavings of resources while considering SLA penalties: c savings ( h, d ) = c potential saving ( h, d ) − c penalty ( h, d ) (1)With c savings ( h, d ) representing the savings for a given host h and day d . c potential saving ( h, d ) is the potential savingswhen no SLA violation occurs from allocating the availableresources in host h and day d . c penalty ( h, d ) represents thepenalties due to SLA violations in host h during the day d .The potential savings c potential saving ( h, d ) of a Cloudprovider for the allocated resources is formulated as follows: c potential saving ( h, d ) = (cid:88) t ∈ h nb container ( h, d, t ) ∗ ppm (2)With nb container ( h, t ) being the number of containers in ahost h during t . ppm represents the price per minute of hostinga container. The price depends on the size of the allocatedcontainer and its price per hour pph ( container size ) .The penalty of SLA violations c penalty ( h, d ) is computedusing a discount percentage which is deduced according to theduration of violation in a 24-hour window ( e.g., see Table II): c penalty ( h, d ) = c potential saving ( h, d ) ∗ discount ( T violation ( h, d )) (3)Where discount ( T violation ( h, d )) is the discount percent-age according to the measured duration of violating SLA ( e.g., see Table II). The time duration T violation ( h, d ) of violatingSLA is incremented every time step for which a violation isobserved: if p ( h, d, t ) < u ( h, d, t ) then T violation ( h, d ) + = ts (4)With p ( h, d, t ) being the prediction of the resource usage u ( h, d, t ) for host h during day d at time t . ts represents thetime step in minutes.
4) The solving algorithm: one of the main criteria forchoosing an RL algorithm is the type of the action space [10]( i.e., discrete or continuous), in our case, the safety margin.On the one hand, a discrete action algorithm outputs an actionfrom a finite set of possible actions. This means that forour problem, we would need to discretize the safety marginspace. If we choose a 5% step, we would need to train thealgorithm with 20 actions (from 0 to 100%) which can betime-consuming [17]. On the other hand, a continuous actionalgorithm outputs an action with real values. This meansthat one action as the output would suffice as it can be any
Action
Actor: PolicyState Critic: Q-function
RewardObservations
Environment update
DDPG
Fig. 4: Overview diagram of the DDPG algorithmreal value in the interval [0 , for the safety marginpercentage.We chose one of the state-of-the-art algorithms [16] calledDeep Deterministic Policy Gradient (DDPG) [18] which isefficient for the continuous action space problem [19]. DDPGis a Reinforcement learning algorithm that concurrently learnsa Q-function ( i.e., the value that represents the quality ofstate-action pairs) and a policy ( i.e., the mapping of statesto actions). DDPG uses two neural networks called Actor andCritic which are represented in Fig. 4. The actor is used tolearn the policy ( i.e., choosing a value of the safety margin),whereas the critic computes the Q-function value of the actor’saction which is used in updating the networks.Using the DDPG algorithm, we can integrate the formulatedproblem such as the observation of prediction errors, selectionof safety margins, and the reward function. Algorithm.1 repre-sents the pseudo-code used for configuring the safety marginduring the learning and testing phases.First, we initialize two variables (lines 1-2) used by DDPG:i) the discount factor γ is used to balance the importance ofimmediate and future rewards. We set the value to γ = 0 . which prioritizes future rewards. ii) The learning rate α is used in machine learning algorithms. It should balancethe convergence accuracy of the algorithm and the learningspeed. Lower values slow down the learning but improve theconvergence of the agent. We empirically selected α = 0 . by evaluating different values that reduce the learning speedfor better convergence.We then initialize a replay buffer (line 3) that stores previousexperience for faster convergence during the training. Then,we create the DDPG model (line 4) which contains both theactor and critics networks. The algorithm loops over the tracesby days then by a predefined time step. In each step, we getthe resource prediction (line 7) and its usage (line 8). Wecompute the prediction errors using the previously computedsafety margin (line 9). The errors are then used to computethe reward function (line 10).The DDPG agent uses the observed prediction errors toselect a safety margin (line 11). Initially, the algorithm doesnot have any experience. Thus, a random process must beused to make random actions for exploration purposes. Forfficient learning, we have to balance between exploration( i.e., searching for new knowledge) and exploitation ( i.e., improving upon the current knowledge). In the function ex-ecuted to select a safety margin (line 11), we used Orn-stein–Uhlenbeck process [20]. It is an algorithm used forthe exploration/exploitation problem in the case of continuousaction space. Finally, when training the algorithm, we storethe previous experience in the replay buffer (line 13). Fromthe replay buffer, we randomly select a batch of previousexperiences in order to update the agent’s model. Algorithm 1:
Pseudo-code of configuring the safetymargin using DDPG α = 0 . ; // learning rate γ = 0 . ; // discount factor experience = initializeReplayBuffer(); agent = DdpgModel( α, γ ); for d = 1, D do // loop over days for t = 0, 24h; time step do reward = 0; for h in hosts do predictions = getResourcePrediction(h, d, t); usages = getResourceUsage(h, d, t); errors = prediction + sm - usage; reward += computeRewardValue(errors); end sm = agent.selectSafetyMargin(errors); if training then experience.store(errors, sm, reward); if updateRequired() then batch = experience.randomSamples(); agent.update(batch); end end end end IV. E
XPERIMENTAL VALIDATION
In this section, we detail the experimental setup and resultsused to validate the efficiency of our contribution and try toanswer the following questions:
Q1:
What is the overall performance of ReLeaSER comparedto other strategies in terms of savings and SLA penalties?
Q2:
What are the potential gains of ReLeaSER on largerproduction datacenters?
Q3:
How was the safety margin adapted for each datacen-ter/host/resource metric?
Experiment metrics and strategies:
Comparing the pro-posed solution to other strategies is realized using the samemetrics presented in the Motivation Section II namely: 1)the cost of SLA violation, 2) the total savings related to thereclaimed resources. Both of these metrics are used to assessthe quality of the selected safety margin. Finding a trade-off between SLA violation penalties and the total savings determines the performance of the strategy. ReLeaSER iscompared to the following strategies: • Random : this strategy sets the safety margin randomly. Itwas evaluated to observe whether our strategy effectivelylearns rather than choosing random actions. • Fixed : this strategy empirically selects the best safetymargin for all host and resources. It was used in [8], [9],the best safety margin value for the tested solutions anddatacenters was 5%. • Simple feedback : this strategy simply adds to the safetymargin of 5% from the fixed strategy, the prediction errorfrom the previous time step. • Scavenger : this strategy uses both the mean of theresource usage during a time window and the standarddeviation to build an interval of future utilization. Thevalue of the standard deviation can be used as a safetymargin. Scavenger [3] was used to reduce interferencebetween applications.
A. Implementation
ReLeaSER was implemented using Keras-rl [21] v. . . that implements state-of-the-art deep reinforcement learningalgorithms in Python. It is based on Keras [22] v. . . , aframework used to develop deep machine learning models.Keras is built on top of Google’s open-source frameworkTensorFlow [23]. We used TensorFlow GPU v. . . . Theconfiguration of additional parameters is required [21]: • Replay memory (number of steps): limit = 100000 . • Ornstein–Uhlenbeck process: size = 1 , theta = 0 . , mu = 0 , sigma = 0 . . • Number of warm-up steps (actor/critic): . • Batch size: . • Error metric: Mean Absolute Error (MAE):
M AE = 1 n n (cid:88) j =1 ( y j − ˆ y j ) With y j is the target value and ˆ y j is the observed value. • Training/Testing ratio: training = 0 . , testing = 0 . . • Target model update: after 10 windows of 24 hours.A neural network [24] is comprised of neurones in layersnamely
Input layer , Output layer and all intermediate layersare called
Hidden layers . Layers are interconnected with aspecific type of connection such as dense where all neuronsof two layers are fully connected. Finally, each layer has anactivation function that controls the output. The neural networkarchitecture of the agent is similar to the DDPG Pendulumexample of Keras-rl [25]: • Actor’s architecture: – Input layer : dense layer, 10 neurons (state input size),ReLu activation, – Hidden layers : two dense layers, 16 neurons, ReLuactivation, – Output layer : dense layer, one neuron (action), Linearactivation. • Critic’s architecture:ABLE I: Total capacities and average resourceutilization of datacenter [2]
Datacenter Numberof hosts CPU(cores) AverageCPU usage RAM(TB) AverageRAM usagePC-1 9 120 14.6% 1.2 55.7%PC- 2 27 230 10.3% 3.8 43%University 6 72 9.8% 1.5 60.4%
TABLE II: Discount percentage in case ofviolations during a 24-hour day [2]
Violation Duration (Minutes) Discount >
15 to ≤
120 10% >
120 to ≤
720 15% >
720 30% – Input layer : dense layer, 11 neurons (state input size +Actor’s action), ReLu activation, – Hidden layers : two dense layers, 32 neurons, ReLuactivation, – Output layer : dense layer, one neuron (Q-value of theaction), Linear activation.To train the algorithm, we split the dataset (called PC-2)into 80% for the training phase and 20% for the testing. Thesplit is done on the 6-months period of the dataset comprising27 hosts. We used PC-2 dataset because it has the highestnumber of hosts. After the training, random actions are notperformed but only the learned strategy to assess exactly theperformance of the algorithm. We also evaluate two additionaldatasets of PC-1 and University.
B. Experimental setup
A summary of the datasets is presented in Table I, it showsthe overall capacity and average utilization of all datacenterswhich are heterogeneous. PC-1 ( i.e.,
Private Company 1) has6 different configurations among its 9 hosts. PC-2 has 13different configurations among its 27 hosts. University has 6different configurations. More details can be found in [2]. Inorder to compute the potential savings, we used the followingmodels in [2]: • Leasing model: a unique model is used for simplicitywhich is a container with 2 vCPUs, 8 GB RAM. • Pricing model: a fixed price for the leasing model of onecontainer. It is based on a pay-as-you-go model. The pricewas fixed to 0.0317$/hour as Amazon Spot Instance [11]. • Penalty model: a delay-dependent penalty of SLA viola-tions for which the discount is relative to the CP responsedelay. Table II shows the discounts applied according tothe accumulated time of SLA Violations during a day.
C. Experiment results1) Q1-Cost of allocating resources: in this experiment, weevaluate the cost of allocating the reclaimed resources. Wecompare both the overall savings of CPs and the SLA violationpenalties. Fig. 5 are stacked-bar graphs that represent the SLA penalties (orange) and the overall savings (blue) for thedifferent strategies and datacenters.
SLA penalties : when observing the SLA penalties, wenotice that all the strategies are able to reduce the penaltiescompared to the values seen in Section II. This is expectedas long as the value of the safety margin is greater thanzero. Among the strategies, the random one performed theworst. Then the fixed one followed by the simple feedbackstrategy. The latter performed better because it considers aminimum value for the safety margin. However, the top twoperforming strategies are ReLeaSER with the least SLA vio-lations penalties then Scavenger. Indeed, when comparing theimprovements of ReLeaSER to Scavenger, it reduces penaltieson average by . × ( . × , . × , . × for PC-1, PC-2, andUniversity respectively). Overall savings : when observing the overall savings, wenote that the random strategy also performs the worst sinceit has the highest violation rate. Both the fixed and simplefeedback have comparable savings in spite of the differencein SLA violation. This can be explained by the size ofthe selected safety margin. Choosing a larger safety margindoes reduce SLA penalties but does not necessarily improvesavings. A trivial example that showcases this is a safetymargin of 100% that leads to no penalties but also no savings.When comparing ReLeaSER to Scavenger, we observe animprovement in the overall savings by . on average. Ourstrategy improves savings by . , . , . corre-sponding to PC-1, PC-2, and University respectively. However,we can notice that although ReLeaSER is up to 3x better thanScavenger when it comes to penalties reduction, the savingswere improved by up to . This highly depends on thepenalty model used, as one model can be more penalizingthan another.
2) Q2-Extrapolation to larger datacenters: the previousevaluation was done on relatively small datacenters comparedto what Amazon and Google offer. Hence, the savings com-puted on the previous experiment allow only for an objectivecomparison. In here, we extrapolate the savings for both Re-LeaSER and Scavenger on Amazon datacenters configuration.Each Amazon datacenter has between 50000 and 80000hosts [26]. We computed the approximate savings for adatacenter with 50000 hosts by using the average savingsof each strategy. With an average saving of ∼ $ per hostand per month for ReLeaSER and ∼ $ for Scavenger, weobtained the following results: When using Scavenger on anAmazon datacenter, the savings of the reclaimed resourcesis /month. Whereas, the savings using ReLeaSERis /month. This means that our solution increasesthe total potential savings by 21% or by per monthcompared to Scavenger. Even though the extrapolation mayseem naive, it gives a rough idea about the savings that canbe achieved.
3) Q3-Analysis of the selected safety margins: in thissection, we analyze the safety margins selected by ReLeaSER.The goal is to understand how it performed and where dothe gains come from. Fig. 6 represents boxplot graphs of the e L e a S E R S c a v e n g e r F e e d b a c k F i x e d R a n d o m D o ll a r s ( $ ) Potential savingsSLA penalties (a) Private Company 1 R e L e a S E R S c a v e n g e r F e e d b a c k F i x e d R a n d o m D o ll a r s ( $ ) Potential savingsSLA penalties (b) Private Company 2 R e L e a S E R S c a v e n g e r F e e d b a c k F i x e d R a n d o m D o ll a r s ( $ ) Potential savingsSLA penalties (c) University
Fig. 5: Comparison of the overall savings and SLA violation penalties of the reclaimed resourcesTABLE III: Safety margins for University hosts
Host-1 Host-2 Host-3 Host-4 Host-5 Host-6Minimum 0% 0% 0% 1% 2% 1%Median 3% 2% 4% 5% 6% 4%75th percentile 4% 4% 7% 10% 18% 9% selected safety margins for the different datacenters for boththe CPU (Fig. 6a) and the RAM (Fig. 6b). Table III shows theminimum, median and 75th percentile of the safety marginsfor University hosts.The first observation that can be drawn from Fig. 6 is thatthe median safety margins in the case of CPU are higherthan the RAM’s. This confirms that the RAM should be tunedseparately as seen in Section II. Moreover, the median safetymargin of the CPU is around 5% which aligns perfectly withthe best safety margin of the fixed strategy. Each datacenter hasdifferent values of the safety margin used in order to reducethe penalties. We observe that the safety margins change fromone datacenter to another with PC-2 giving the highest value.This difference demonstrates that each datacenter has differentbehavior in terms of resource utilization. Finally, we observethat there are some outliers for all datacenters considered (seeFig. 1) which are due to high prediction errors. However, theirlikelihood of occurrence is low.In Table III, we observe the different safety margin levelsset for each host. The minimum, median, and 75th percentilevalues vary from one host to another. Host-5 has the highestmedian and the 75th percentile of the safety margins. WhileHost-1 and Host-2 have the lowest similar safety marginswhich may mean that they both have similar predictability.These results confirm that the safety margin should be tunedat a host-level. V. R
ELATED WORK
The safety margin of resources is used in a variety ofapplications (also referred to as headroom). In big data ap-plications, such as Hadoop [27], a user-configurable safetymargin can be used for each host. This safety margin, however,is mainly used for decisions such as re-prioritizing sub-tasks totake advantage of currently allocated containers. In Pado [7], (a) CPU safety margins (b) RAM safety margins
Fig. 6: Safety margins selected by ReLeaSERCuckoo [8] and Salamander [9], a fixed safety margin is used.As shown in the motivation Section II, the safety marginshould be configured for each host and resource metric sincethey exhibit different behaviors. In [1], the authors propose asafety margin tailored to the job execution time. The higherthe execution time of the job, the larger is the safety margin.However, this technique is specific to big data jobs. This meansthat, if multiple jobs are executed, the safety margin is set tothe longest job even if most jobs have a short execution time.Rhythm [28] and CLITE [29] are two frameworks used foroptimizing resource utilization by co-locating latency-criticalapplications. Rhythm uses a load limit which is the upper limitof request load to allow the co-location. The number of co-located applications is controlled by the lower limit of requestload called slack. Similarly, CLITE computes a maximumrequest load by evaluating the latency for each ran application.CLITE also evaluates the maximum load of memcached thatguarantees the QoS requirements. However, both Rhytm andCLITE need to build a catalog of applications that can be co-located which is limiting and time-consuming to extend. Ourstrategy, ReLeaSER, adjusts the safety margin dynamicallywithout specifying the type of the running workloads. Instead,it relies only on the host resource utilization and its predictionto reduce SLA violations and increase savings.VI. C
ONCLUSION
Using reclaimed resources is important for Cloud providersin order to increase their savings. However, allocating re-laimed resources should be done while guaranteeing cus-tomers SLA which is challenging. In addition, resource recla-mation may rely on prediction mechanisms that are error pronein view of the stochastic nature of Cloud workloads.On account for these challenges, we propose ReLeaSER, aReinforcement Learning strategy for optimizing the utilizationof ephemeral resource in the cloud. The strategy consists insetting a dynamic safety margin on a host-level basis for eachresource metric. The strategy learns from the prediction errorsand improves the size of the safety margin accordingly. Thisis done to reduce the SLA violation penalties and increase thepotential savings of Cloud providers.We evaluated ReLeaSER with four other strategies foradjusting the safety margin. The results show that we reduceconsiderably the SLA violation penalties on average by 2.7times. The improvements are also considerable for the CP’ssavings with an average of 27.5%. Furthermore, ReLeaSERcan save approximately /month when linearly ex-trapolated to a single Amazon datacenter.For future work, we plan to extend our work to additionalresource metrics such as network and storage. We also planto evaluate the strategy with higher volatility of resources.Also, we did not consider the starting time variations of thecontainers and virtual machines. This may have an impact onthe relevance of the chosen strategy, which might reduce theefficiency of ReLeaSER. We plan to upgrade our agent modelto consider such starting time fluctuations. Finally, we planto implement Safe Reinforcement Learning [30] which is usedto avoid random actions. This can be useful for giving the RLagent the chance to improve and adapt its strategy while inproduction without impacting the performance.A
CKNOWLEDGMENT
This work was supported by the Institute of Research andTechnology b-com, dedicated to digital technologies, fundedby the French government through the ANR Investment ref-erenced ANR-A0-AIRT-07.R
EFERENCES[1] Y. Zhang, G. Prekas, G. M. Fumarola, M. Fontoura, ´I. Goiri, andR. Bianchini, “History-based harvesting of spare cycles and storagein large-scale datacenters,” in , 2016, pp. 755–770.[2] J.-E. Dartois, A. Knefati, J. Boukhobza, and O. Barais, “Using quantileregression for reclaiming unused cloud resources while achieving sla,” in . IEEE, 2018, pp. 89–98.[3] S. A. Javadi, A. Suresh, M. Wajahat, and A. Gandhi, “Scavenger: Ablack-box batch workload resource manager for improving utilization incloud environments,” in
Proceedings of the ACM Symposium on CloudComputing , 2019, pp. 272–285.[4] Y. Yang, L. Zhao, Z. Li, L. Nie, P. Chen, and K. Li, “Elax: Provisioningresource elastically for containerized online cloud services,” in . IEEE, 2019, pp. 1987–1994.[5] J. Cao, J. Fu, M. Li, and J. Chen, “Cpu load prediction for cloudenvironment based on a dynamic ensemble model,” in
Software: Practiceand Experience , 2014, pp. 793–804. [6] A. Fox, R. Griffith, A. Joseph, R. Katz, A. Konwinski, G. Lee, D. Pat-terson, A. Rabkin, I. Stoica et al. , “Above the clouds: A berkeley viewof cloud computing,” in
Dept. Electrical Eng. and Comput. Sciences,University of California, Berkeley, Rep. UCB/EECS , 2009, p. 2009.[7] Y. Yang, G.-W. Kim, W. W. Song, Y. Lee, A. Chung, Z. Qian, B. Cho,and B.-G. Chun, “Pado: A data processing engine for harnessing tran-sient resources in datacenters,” in
Proceedings of the Twelfth EuropeanConference on Computer Systems , 2017, pp. 575–588.[8] J.-E. Dartois, H. B. Ribeiro, J. Boukhobza, and O. Barais, “Cuckoo:Opportunistic mapreduce on ephemeral and heterogeneous cloud re-sources,” in . IEEE, 2019, pp. 396–403.[9] M. Handaoui, J.-E. Dartois, L. Lemarchand, and J. Boukhobza, “Sala-mander: a holistic scheduling of mapreduce jobs on ephemeral cloudresources,” in
The 20th IEEE/ACM International Symposium on Cluster,Cloud and Grid Computing (CCGRID) , 2020, pp. 320–329.[10] L. P. Kaelbling, M. L. Littman, and A. W. Moore, “Reinforcementlearning: A survey,” in
Journal of artificial intelligence research , 1996,pp. 237–285.[11] “Amazon ec2 spot instances pricing,” https://aws.amazon.com/ec2/spot/pricing/, accessed: 2020-06-11.[12] “Cloud dedicated and partner interconnect service level agreement (sla),”https://cloud.google.com/interconnect/sla, accessed: 2020-06-10.[13] “Amazon compute service level agreement,” https://aws.amazon.com/compute/sla/, accessed: 2020-06-10.[14] Y. Yan, Y. Gao, Y. Chen, Z. Guo, B. Chen, and T. Moscibroda, “Tr-spark: Transient computing for big data analytics,” in
Proceedings ofthe Seventh ACM Symposium on Cloud Computing , 2016, pp. 484–496.[15] R. Bellman, “A markovian decision process,” in
Indiana UniversityMathematics Journal , 1957, pp. 679–684.[16] K. Arulkumaran, M. P. Deisenroth, M. Brundage, and A. A. Bharath,“Deep reinforcement learning: A brief survey,” in
IEEE Signal Process-ing Magazine , 2017, pp. 26–38.[17] G. Dulac-Arnold, R. Evans, H. van Hasselt, P. Sunehag, T. Lillicrap,J. Hunt, T. Mann, T. Weber, T. Degris, and B. Coppin, “Deep rein-forcement learning in large discrete action spaces,” in arXiv , 2015, pp.arXiv–1512.[18] T. P. Lillicrap, J. J. Hunt, A. Pritzel, N. Heess, T. Erez, Y. Tassa,D. Silver, and D. Wierstra, “Continuous control with deep reinforcementlearning,” in arXiv preprint arXiv:1509.02971 , 2015.[19] P. Henderson, R. Islam, P. Bachman, J. Pineau, D. Precup, and D. Meger,“Deep reinforcement learning that matters,” in
Thirty-Second AAAIConference on Artificial Intelligence , 2018.[20] J. L. Doob, “The brownian movement and stochastic equations,” in
Annals of Mathematics , 1942, pp. 351–369.[21] M. Plappert, “keras-rl,” https://github.com/keras-rl/keras-rl, 2016.[22] F. Chollet et al. , “Keras,” https://keras.io, 2015.[23] M. Abadi, A. Agarwal, P. Barham, E. Brevdo, Z. Chen, C. Citro,G. S. Corrado, A. Davis, J. Dean, M. Devin et al. , “Tensorflow: Large-scale machine learning on heterogeneous distributed systems,” in arXivpreprint arXiv:1603.04467 , 2016.[24] I. Goodfellow, Y. Bengio, A. Courville, and Y. Bengio,
Deep learning .MIT press Cambridge, 2016, vol. 1.[25] “Ddpg pendulum example using keras-rl,” https://github.com/keras-rl/keras-rl/blob/master/examples/ddpg pendulum.py, accessed: 2020-06-11.[26] B. King, A. Lark, A. Lightman, and J. Rangaswami,
Augmented: Life inthe smart lane . Marshall Cavendish International Asia Pte Ltd, 2016.[27] K. Shvachko, H. Kuang, S. Radia, R. Chansler et al. , “The hadoopdistributed file system.” in
IEEE 26th symposium on Mass StorageSystems and Technologies , 2010, pp. 1–10.[28] L. Zhao, Y. Yang, K. Zhang, X. Zhou, T. Qiu, K. Li, and Y. Bao,“Rhythm: component-distinguishable workload deployment in datacen-ters,” in
Proceedings of the Fifteenth European Conference on ComputerSystems , 2020, pp. 1–17.[29] T. Patel and D. Tiwari, “Clite: Efficient and qos-aware co-locationof multiple latency-critical jobs for warehouse scale computers,” in . IEEE, 2020, pp. 193–206.[30] J. Garcıa and F. Fern´andez, “A comprehensive survey on safe reinforce-ment learning,” in