Using Application Data for SLA-aware Auto-scaling in Cloud Environments
UUsing Application Data for SLA-awareAuto-scaling in Cloud Environments
Andre Abrantes D. P. Souza, Marco A. S. NettoIBM Research
Abstract —With the establishment of cloud computing as theenvironment of choice for most modern applications, auto-scalingis an economic matter of great importance. For applications likestream computing that process ever changing amounts of data,modifying the number and configuration of resources to meetperformance requirements becomes essential. Current solutionson auto-scaling are mostly rule-based using infrastructure levelmetrics such as CPU/memory/network utilization, and systemlevel metrics such as throughput and response time. In this paper,we introduce a study on how effective auto-scaling can be usingdata generated by the application itself. To make this assessment,two algorithms are proposed that use a priori knowledge ofthe data stream and use sentiment analysis from soccer-relatedtweets, triggering auto-scaling operations according to rapidchanges in the public sentiment about the soccer players thathappens just before big bursts of messages. Our application-basedauto-scaling was able to reduce the number of SLA violations byup to 95% and reduce resource requirements by up to 33%.
I. I
NTRODUCTION
Cloud was initially created to host web applications but hasbecome mature enough to host more complex applications,such as those in the big data space. Due to the large resourceconsumption from these new cloud applications, users arecaution on how much they spend in the cloud to meet theirQoS requirements. In this scenario, auto-scaling, also knownas elasticity [1], is an important technique to help usersconfigure resource allocation dynamically.There is a large body of work in the literature about auto-scaling solutions [2]–[7]. Most of the existing solutions arebased on rules [8] that assess system or infrastructure levelvariables. An example of a CPU-based threshold rule is:“increase 10% of resources if CPU usage is above 80% forthe last 5 minutes”. Other examples of auto-scaling metricsare memory, network, and storage usage, response time, andthroughput.Another source of metrics to trigger auto-scaling operationscomes from the applications themselves. A signal inside thedata generated by an application can serve as an earlierindicator that there will be a load change in near future. Thissignal can be more effective than waiting for CPU or networkreach to undesirable utilization levels. Examples of signalsare (i) a relevant news in a web site that was just publishedthat may increase user access to the site; (ii) a data miningapplication that reaches an intermediate result that intensifiesthe use of computing power to explore more a search area; and
The final publication is available at ieee.org/ieeexplore IEEE MASCOTS’15 (iii) a financial application that detects an unexpected trendthat requires additional simulations to handle it.In this paper, we carry out a study on using application dataas a trigger for auto-scaling operations. Our hypothesis is thatthis approach meets QoS requirements more efficiently thanusing auto-scaling triggers based on infrastructure or systemmetrics. Therefore, our contributions are: • Identification of auto-scaling triggers that use correlationbetween data produced by the application and the volumeof data to be processed ( § III); • Two auto-scaling triggers based on application data, in-cluding one with user sentiment analysis ( § IV); • An extensive evaluation of the auto-scaling triggers usingmillions of tweets from the 2013 FIFA ConfederationsCup and an application that calculates public sentimentchanges during soccer matches. We use a CPU-basedthreshold algorithm for comparison purposes ( § V).II. B
ACKGROUND
Auto-scaling is an important part of cloud computing as itserves to both keep up with a high resource utilization and savemoney when resources are underutilized. Manually managingresources is far inefficient for most applications as perfor-mance and input size usually vary over time. Automaticallyscaling applications is preferable because resources can bedeployed faster and it can be done according to a great array ofperformance parameters beyond ordinary human capabilities.The main auto-scaling operations are scale-in, scale-out,scale-up, and scale-down. Scale-in/out expands and shrinks thenumber of computing resources and scale-up/down expandsand shrinks the computing power of existing resources. Thefirst two operations are known as horizontal auto-scaling ,whereas the last two are known as vertical auto-scaling . Thereare efforts in auto-scaling from both industry and academia.Amazon CloudWatch [9] is a monitoring system to helpusers decide when cloud resources need to be modified.In this system, users specify upper and lower bounds formonitored metrics such as memory and CPU usage. MicrosoftAzure Auto-scaling system [10] also allows users to specifythese auto-scaling parameters. Scryer [11], from Netflix, isan auto-scaling engine that uses predictive models to knowwhen resources should be added or removed. Its auto-scalingstrategy is not exposed to users so they do not need to interactor specify auto-scaling thresholds and policies.Ming et al. [5] proposed an architecture that deals with auto-scaling focusing on meeting user deadlines. Shen et al. [7] a r X i v : . [ c s . D C ] J un resented a system to automate elastic resource scaling forcloud computing environments. Their system does not requireprior knowledge about the applications running in the cloud.Other projects consider auto-scaling in different scenarios,such as auto-scaling for MapReduce applications [3], [12],vertical versus horizontal auto-scaling [6], operational costs[4], and integer model based auto-scaling [5]. Ali-Eldin etal. [2] introduced a tool to analyze and classify workloadsand assign the most suitable auto-scale controllers based onworkload characteristics. Ali-Eldin et al. also identified thechallenge aspect of developing workload predictors. Cunha etal. [13] explored the use of user patience information to makebetter auto-scaling decisions. Netto et al. [14] introduced theconcept of Auto-scaling Demand Index to determine how wellauto-scaling operations are performing and presented a studyto help users configure auto-scaling parameters.From all these works, it can be noticed that traditionalauto-scaling techniques are similar for both PaaS and IaaS; i.e. simple threshold-based rules using, for instance, CPU andmemory as metrics to be monitored. Other parameters couldbe used for auto-scaling, for instance, application parameters.While in IaaS, the cloud infrastructure should only be aware ofthe OS level, in PaaS there is the possibility of the applicationbeing aware of the cloud infrastructure needs for auto-scaling.In order to simplify resource allocation decisions, Copil et al. [15] introduced a framework to advise on elasticity operationsvia time series analysis and Leitner et al. [16] exploredapplication data and domain experts to avoid Service LevelAgreement (SLA) violations.Data stream applications, in particular, can be scaled inmore than one dimension. They can be scaled by parallelizingoperators or by increasing the quota of available resourcesfor the application. But parallelization of operators does nottend to deliver a significant benefit to the user if the operatoris CPU-bound since, most of the time, a single operator iscapable of maximizing the usage of the available resources.There are also efforts on auto-scaling for data streamingapplications [17]–[19].For data streaming applications, the resource managementsoftware can provide system related data such as input andoutput rates and queue sizes. This would most likely alreadyimprove auto-scaling systems. Trends in input rates could befound and output based SLA could be used. But there isstill a third level of data that could be used for this kind ofapplication: their own output.The main novelty of this paper is to provide a real usecase on how application data can be used for auto-scaling inpractice and how beneficial this approach is compared to auto-scaling based on common infrastructure and system metrics.III. U SE C ASE A PPLICATION
We used an in-house application [20], [21] to study theimpact of using application data as a trigger for auto-scaling.This application is based on IBM Streams and evaluates tweetsentiment at real time. The scenario explored here is in the (1) read andparse tweet(2) check if itis about soccer(3) get topicsand sentiment(4) extract terms (5) accumulatestatisticsFig. 1: Sentiment analysis application graph [20], [21].context of analyzing public sentiment about players duringsoccer matches.The application uses Twitter APIs to continuously reada live stream of tweets. To setup the reading stream, theapplication passes a set of keywords and a target language sothat every tweet matching those criteria is sent to the client.Tweets come JSON-encoded and with a variety of data andmeta-data, such as the author username and profile.Figure 1 shows the application operator graph. Each blockis a Processing Element (PE), i.e. a set of operators abstractedto a higher level. Arrows represent the stream of data amongthe PEs and thus the different paths a tweet can take whentraversing the graph. We define in the context of this paperthat the path that the tweet takes through the graph defines itsclass.Tweets that are completely used by the pipeline go throughall PEs. However, most tweets are discarded in the processes, e.g. a tweet might contain a particular keyword but actuallyhave another subject than soccer. All discarded tweets arenevertheless sent to the final statistic accumulator node.PEs (2), (3) and (4) in Figure 1 are actually very parallelizedso they can better benefit from multiple CPUs and hosts. Thesource and sink PEs, (1) and (5), on the other hand, processone tweet at a time. But since their job is way simpler thanthat of the other PEs, they are not bottlenecks in the graph.The way sentiment analysis application is implemented, thesentiment-related data is loaded once and the application doesnot need to consult databases or external APIs at runtime.Therefore, the application is not I/O-bound; the Tweeter APIis its only possible I/O bottleneck and hence the applicationis mostly CPU-bound.Since the application monitors live soccer matches, itsinfrastructure must support the variable and sometimes hugevolume of tweets posted and deliver sentiment analysis at realtime. That is a requisite from clients and the usual SLA agreedis that every tweet must be processed under 5 minutes. a v e r a g e n o n - n e u t r a l s e n t i m e n t Fig. 2: Relationship of the average sentiment on a given minutewith the volume of tweets posted on the next minute for theBrazil vs Spain match.
A. Sentiment analysis and tweet volume relationship
For each tweet analyzed, sentiment is given as three realnumbers called the probability that the tweet is positive,negative or neutral. These three numbers always sum to 1. Theprobability calculation is given by a machine learning basedsentiment analysis, which is part of the in-house application[20], [21].To account for periods of high fluctuations in the sentimenttime series, an exponential moving average is used. Using awindow of one minute, a considerable correlation has beenfound between the sentiment at a given time and the numberof tweets posted on the following minutes.Figure 2 shows the correlation between tweet sentiment andvolume for the Brazil vs Spain match. There is a clear tendencythat the more intense the sentiment the more tweets are posted.From the figure, it also seems that points are divided in twoclusters. The first is a well behaved set of points with moderatesentiment, roughly below . . The second set, however, isspread on a broader area but has consistently higher tweetvolumes.Although the tweet volume is not easily predictable from thecurrent sentiment level, there is a clear relationship betweensentiment intensity and tweet volume in the following minutes.Table I makes that correlation clear by showing the Pearsoncorrelation coefficient for average sentiment level of a minuteand the volume of tweets on the near future. Correlation ofsentiment on time t with the volume on time t has the highestvalue of . and decays slowly for the next 6 minutes beforea significant variation.The sentiment is above 0.4 for most part of the matchesand for most matches. This makes it hard to detect suddenburst of tweets simply by looking at the average sentimentscore . By analyzing the variation in sentiment time series weobserve that bursts of tweets are preceded by a high sentiment Sentiment score: tweet probability of being positive or negative.
TABLE I: Sentiment correlation of the volume tweet at a giventime with sentiment of time t . time correlation t t + 1 t + 2 t + 3 t + 4 t + 5 t + 6 t + 7 t + 8 t + 9 t + 10 t w ee t s −0.06−0.04−0.020.000.020.040.060.08 s e n t i m e n t s c o r e v a r i a t i o n tweets per minutesentiment variation Fig. 3: Sentiment variation and bursts of tweets.variation. Figure 3 shows how that happens over a period of100 minutes of the Brazil vs Spain match. Although thereare some false positives and a false negative in the example,peaks of sentiment variation tend to appear just a minute ortwo before peaks of tweets.Therefore, data suggest that monitoring sentiment during amatch is a way to detect bursts of tweets just a couple minutesbefore they happen. Sudden sentiment variations even happenbefore any trend in the tweet volume time series is observable.
B. Workload overview
We used a set of games from the 2013 FIFA ConfederationsCup to study the sentiment-volume relationship, derive statis-tics and models, and feed the sentiment analysis tool. The datais a set of dumps of tweets from 7 matches of the Braziliansoccer team: five matches from the FIFA Confederations Cupplus two friendly matches weeks prior the main event. Thethree first matches of the cup were for the group phase whilethe last two matches were the semi-final and the final. Table IIshows all matches and the total number of tweets read duringthe execution of the sentiment analysis tool.The two friendly matches were the ones with fewer tweets.They were also monitored for shorter periods of time. Whenthe Confederations Cup began, games were monitored forlonger and users initially showed more interest in tweetingabout the games. Figure 4 shows the time series for thevolumes of tweets captured for all matches.
50 100 150 200Minutes0200040006000800010000 T w ee t s p e r m i nu t e (a) Brazil vs England T w ee t s p e r m i nu t e (b) Brazil vs France T w ee t s p e r m i nu t e (c) Brazil vs Japan T w ee t s p e r m i nu t e (d) Brazil vs Mexico T w ee t s p e r m i nu t e (e) Brazil vs Italy T w ee t s p e r m i nu t e (f) Brazil vs Uruguay T w ee t s p e r m i nu t e (g) Brazil vs Spain Fig. 4: Tweets captured during the seven matches.TABLE II: Matches information.
BRA Date Total Length Tweetsvs tweets (hours) per hour
England June 2nd 370,471 2.62 141,401France June 9th 281,882 2.93 96,205Japan June 15th 736,171 4.08 180,434Mexico June 19th 615,831 3.79 162,488Italy June 22nd 518,952 3.42 151,740Uruguay June 26th 1,763,353 3.44 512,602Spain June 30th 4,309,863 4.18 1,031,067
Time series peaks indicate a sudden increase in user intereston the match and are normally a consequence of notoriousevents. Experience has shown that polemical events like a goalsaved on the last second generate more tweets than goals.Both friendly games have peaks only close to the end ofthe monitoring, indicating that those games did not have muchrepercussion among social network users. Later games showmore peaks during the match, reflecting the user enthusiasmincreasing as the cup advances.IV. E
VALUATION T OOL AND A UTO - SCALING T RIGGERS
In order to evaluate several and repeatable scenarios withdifferent computing configurations, we created a simulationtool based on the in-house application for sentiment analysis.This section describes how the tool was created and validatedand also the auto-scaling algorithms that were based on systemand application metrics.
A. Simulator
A stream computing application can intuitively be thoughtas a network of queues, like the classic M/M/1, with a queuefor each operator. But modeling each node in this networkwould require a great amount of effort and would possibly leadto very different behaviors than those on the original matches. The main purpose of the simulator is to test new auto-scaling techniques on real world scenarios. Therefore only alimited randomness is desired to differentiate simulation fromreal matches. So instead of building a full featured sentiment-analysis application over Streams simulator, the idea is torandomize only the processing delay of the tweets, not theirvolume or distribution.A tracer was attached to the in-house sentiment analysisapplication’s code to monitor how tweets move through theprocessors. It logged the tweet id and the clock every timea tweet was parsed and every time it was finished beingprocessed by the sink. It also logged from which PE the tweetcame before reaching the sink so it would be easy to know itsclass, i.e. the path it took, and whether/where it was discarded.To model the delay distributions of a real instance of thesentiment analysis application, a test-bed comprising of aPC with 2.6 GHz CPU and 1 GB memory was used. Theapplication was slightly adapted to read tweets from the dumpsinstead of Twitter. This way, the system could read all tweetsat once and process them as fast as its CPU was able to. Thememory was enough for the application and no other storagewas used during runtime.One at a time, all seven dumps were given to the system andthe same behavior was observed every time: an almost constantnumber of tweets was processed in the system simultaneously(Figure 5). By sampling on 1-second windows, the averagenumber of tweets processed by the system was , . witha standard deviation of , . , the average processing delaywas of . seconds and the average input rate of . tweets/second. These numbers closely follow Little’s Law: L = λW where L = 15 , . and λ × W = 82 . × .
09 = 15 , . A trend is observable when grouping the tweets by theirclasses and analyzing their delay distribution. Figure 6 shows t w ee t s i n t h e s y s t e m Fig. 5: Number of tweets being processed simultaneously.
PDF P r o b a b ili t y D e n s i t y C u m u l a t i v e P r o b a b ili t y Delay (seconds)
Fig. 6: Weibull distribution of tweets with different topics.the delay distribution of tweets that were considered off-topicand did not have their sentiment analyzed. After trying to fitdifferent distributions to the histogram, the best match was theWeibull distribution with a normalized root mean square errorof . .Tweets that were discarded by PE (1) from Figure 1 hadsuch a small delay (average below 1 second) that they weresimply given a zero delay distribution in the simulator. As forthe other paths, Weibull was also the best fit.CPU utilization by the Streams process averaged 97.95%and its memory usage averaged , MB. From thosestatistics and the fact that Streams showed a predictablebehavior while processing tweets in parallel, if it is assumedthat CPU cycles are uniformly distributed to the tweets, thereis a reasonable way to convert those delay distributions toCPU cycles distributions. That allows the extrapolation ofthe experiments to other machine configurations, making itpossible to simulate any number of CPU frequencies and cores.
B. Simulator internals
For the simulations, tweet data from different sources wasconsolidated into a CSV file for each match. From the dump JSON files came the tweet id and post time. From the realprocessing in the sentiment analysis application came thetweet’s class, processing delay and the sentiment score. Onlythe class and the processing delay are necessary for thederivation of the distribution parameters. The class, post timeand sentiment scores were used for the simulations. Before thesimulation begins all tweets are read from the CSV file and arandom number of cycles is assigned to each tweet followingits class distribution.Unfortunately, running a discrete time simulator provedchallenging on the algorithm complexity and a simpler discretetime model was adopted. This way, the simulator uses a certaintime window on each iteration. By default, the simulation stepis of one second. This means that all tweets that arrived duringthat time slot are read and CPU cycles available for a wholesecond are distributed among the current tweets.The simulator has an internal clock that is incremented bythe simulation step on each iteration of the main loop. Theclock is initialized with the timestamp of the post time ofthe first tweet of the dump. Since it is not the objective toalso simulate the network delay, a constant delay of zero isassumed and the tweet arrival time is considered equal to thepost time.To simulate a limited input rate like Streams does, an inputqueue is used. All tweets posted during a simulation stepare inserted on the queue, but only a configurable amount oftweets/second is read from the queue to be processed.The beginning of the main loop is dedicated to reading allthe tweets that were posted during that window. Tweets areread from the input queue respecting or not the input rate andare then stored in an internal processing structure where it willcompete for resources.Internally, this structure is a queue increasingly ordered bythe post time. This helps the next part of the main simulationloop: distributing CPU cycles among the current tweets. If atweet needs less cycles than there are available, excess cyclesare equally distributed among the other current tweets. This isaccomplished by using Algorithm 1:The third part of the main simulation loop is getting rid ofthe tweets that are done being processed. Tweets that have usedall cycles required are removed from that internal processingqueue and are saved to a history log, from where statistics canlater be taken: mean queue time, mean processing time, etc.The last part is reacting to the current scenario by startingan up or downscale. This is not done on every simulation step,but rather only every few minutes. This adaptation frequency isconfigurable just as the provisioning time. For example, usingthe default values, every minute the situation is evaluated:sentiment and tweet volume for the last minutes are analyzedand a reaction might be issued for up or downscaling. Afterrequesting or releasing resources, another amount of time willpass before they are available.
C. Auto-scaling algorithms
Two auto-scaling trigger algorithms are proposed based on a priori knowledge of the application: lgorithm 1
CPU cycle distribution algorithm.
Require: cyclesPerStep
Require: tweetListnumberOfCurrentTweets = length(tweetList)tweetsToProcess = numberOfCurrentTweetscyclesPerTweet = cyclesPerStep / numberOfCurrentTweets sort tweetList increasingly by remaining cycles for each tweet in tweetList doif tweet.cyclesLeft < cyclesPerTweet then excessCycles = cyclesPerTweet - tweet.cyclesLefttweet.cyclesLeft = 0tweetsToProcess -= 1cyclesPerTweet += excessCycles / tweetsToProcess else tweet.cyclesLeft -= cyclesPerTweet end ifend for load algorithm: knows the processing delay distribu-tions of the sentiment analysis application;2) appdata algorithm : only deals with peaks, is oblivi-ous to ordinary increases of traffic and runs alongsidethe load algorithm—it uses the sentiment analysis datagenerated by the application itself.The load algorithm is based on the expected time to processall current tweets versus the given SLA. The estimated delay iscalculated from the quantile function of the delay distributionof the different tweet classes and from the proportion of theclass length. The quantile value is a parameter to the simulator.A quantile of . is the median and roughly means a delaythat is greater or equal to half of the observable delays. Aquantile of . will return a delay estimative that will coveras much as 90% of the tweets. The higher the quantile themore pessimistic the model is and more likely it is to reactbefore the SLA is really violated. On the other hand, a higherquantile will also spend more resources. Each class estimateddelay is then weighted according to the class length knownfrom the training data.Since this algorithm is proposed as a simple reactive al-gorithm, no predictions on the future of tweet volume isattempted. Instead, if the expected delay is above the SLA,more resources are allocated, and if the expected delay isbelow half the SLA, resources are released. Downscaling islimited to a single CPU being returned at a time, so suddenincreases in tweet volume have less impact. For upscaling, anestimate of necessary resources is calculated by the proportionof the expected delay and the SLA over the current availableresources, as shown in the formula below: cpus nextP eriod = ceil ( cpus ∗ ( expectedDelay/SLA )) The appdata algorithm analyzes the average sentiment scoreof the last minutes and compares it to the average sentimentof the minutes before. If the sentiment score increases by . or more, a predefined quantity of new CPUs is allocated. The two proposed algorithms are used in opposition ofthe classic and largely adopted auto-scaling algorithm: theCPU usage threshold algorithm . The way this algorithm wasimplemented in the simulator, every time the average CPUusage goes above a certain predefined threshold, an extra CPUis allocated. On the other hand, every time the CPU usage isbelow 50%, a CPU is released.V. S IMULATION RESULTS AND ANALYSIS
The goal of the experiments presented in this section is tocompare the performance of the load and appdata algorithmsagainst the classic CPU usage threshold algorithm.For the CPU usage threshold algorithm, thresholds of 60%,70%, 80%, 90%, and 99% are used. For the load algorithm,quantile values are: 90%, 99%, 99.9%, 99.99% and 99.999%.The appdata algorithm was run alongside the load algorithmwith a quantile of 99.999% and different values of extra CPUsallocated when peaks were detected: from 1 to 10.All simulations were run with the configurations describedin Table III. All scenarios were repeated until the length ofthe confidence interval with 95% confidence was smaller than10% of the mean.TABLE III: Basic configuration for all simulation scenarios.
Variable Value
CPU frequency 2.0 GHzstarting CPUs 1simulation step 1 secondSLA 300 secondsadapt frequency 60 secondsresource allocation time 60 seconds
A. Load algorithm performance
Simulations were first run to compare the performance interms of quality and cost of the load algorithm and the classicCPU usage threshold algorithm. Figure 7 was built from theresulting data and shows the evolution of the quality andthe cost of each match as a function of the algorithms andparameters. Quality is shown in terms of percentage of tweetsthat took longer than the SLA requirement to be processed.Matches of Brazil against England and France were leftout of the figure as there was close to no difference on thealgorithms to be shown. On those matches, the volume oftweets was not as significant as on the other matches whichmade it easier for the auto-scaling algorithms to react tothe relatively small variations of volume. In fact, both thethreshold and the load algorithms performed perfectly for bothmatches and not a single tweet took longer than the SLA tobe processed on all simulated scenarios.The load algorithm had a fairly constant cost among all usedquantiles: . CPU hours were used for the England matchand . CPU hours for the France match. Cost differences fordifferent quantiles is insignificant. This behavior repeats on allseven matches as seen in Figure 7 and shows how predictablethe algorithm is in terms of cost. % % % % % CPU usage threshold0.00.20.40.60.81.0 % o u t o f S L A c o s t ( C P U h o u r s ) QoScost (a) Threshold: Brazil vs Japan % % . % . % . % Quantile0.00.51.01.52.02.5 % o u t o f S L A c o s t ( C P U h o u r s ) QoScost (b) Load: Brazil vs Japan % % % % % CPU usage threshold0123456 % o u t o f S L A c o s t ( C P U h o u r s ) QoScost (c) Threshold: Brazil vs Mexico % % . % . % . % Quantile0.00.51.01.52.02.5 % o u t o f S L A c o s t ( C P U h o u r s ) QoScost (d) Load: Brazil vs Mexico % % % % % CPU usage threshold0.00.20.40.60.81.0 % o u t o f S L A c o s t ( C P U h o u r s ) QoScost (e) Threshold: Brazil vs Italy % % . % . % . % Quantile0.00.10.20.30.40.50.60.70.80.9 % o u t o f S L A c o s t ( C P U h o u r s ) QoScost (f) Load: Brazil vs Italy % % % % % CPU usage threshold0.000.050.100.150.200.250.300.35 % o u t o f S L A c o s t ( C P U h o u r s ) QoScost (g) Threshold: Brazil vs Uruguay % % . % . % . % Quantile0.00.20.40.60.81.01.21.41.61.8 % o u t o f S L A c o s t ( C P U h o u r s ) QoScost (h) Load: Brazil vs Uruguay % % % % % CPU usage threshold0123456 % o u t o f S L A c o s t ( C P U h o u r s ) QoScost (i) Threshold: Brazil vs Spain % % . % . % . % Quantile0123456 % o u t o f S L A c o s t ( C P U h o u r s ) QoScost (j) Load: Brazil vs Spain
Fig. 7: Comparison of the performance of threshold and load algorithms for five of the seven games.The threshold algorithm is more expensive for both matchesranging from . CPU hours (threshold of 99%) to . CPUhours (threshold of 60%) for the match with England and . (threshold of 99%) to . (threshold of 60%) for the matchwith France. The cost as a function of the CPU usage thresholdalways decreases as the threshold increases, as is observableon all other matches shown on Figure 7.The three matches of the group phase showed close patternsand volumes of tweets. But while the threshold algorithm wasable to perform perfectly for the matches of Japan and Italy, itdid not show the same performance for the Mexico match. Forthis match, only a threshold of 60% CPU usage was close tocompletely meeting the SLA with only 0.04% of tweets abovethe target processing time.For those three matches, the load algorithm was able toperform well although not perfectly on the quality side. Ingeneral terms, the higher the quantile used, the best thealgorithm performs with an insignificant increase in cost.The load algorithm was able to always deliver lower costs,an advantage that is present on every simulated scenario.Nevertheless, the load algorithm was able to perform betterthan the threshold algorithm for the Mexico match. The reason for the generally better performance of the loadalgorithm on the Mexico game is the great peak of tweetsthat happens around 180 minutes of the monitoring of thematch (refer to Figure 4). Even if the peak does not seem verydifferent from other peaks of the other matches, it happensmore abruptly while others have small increase just before.The load algorithm performs better because it has the abilityto upscale the number of CPUs faster. While the thresholdalgorithm can only increase the number of CPUs by one perobservation, the other algorithm increases by as many timesas the proportion of the estimated delay and the SLA (as seenon Section IV-C), an ability that comes from the a priori knowledge of the delay distribution. Those peaks are eventsthe threshold and the load algorithms were not designed todeal with and the reason the appdata algorithm is proposed.The last two matches had by far more tweets and alsomore significant peaks. None of the two algorithms performedperfectly for them, but this time the load algorithm performedsignificantly better when configured with higher quantileswhile using way less resources. Those two matches werespecially challenging for the algorithms thanks to the largeamounts and great bursts of tweets posted by the fans thatere watching the final games of the championship. While thethreshold algorithm was still able to perform reasonably forthe Uruguay match, the final match had the highest number ofpeaks of all games and the load algorithm capacity to upscalefast was decisive for making it outperform the thresholdalgorithm.On the Brazil vs Uruguay match, comparing the scenarioconfigurations with the best performances, the load algorithmwith 99.999% quantile delivered 0.05% of tweets above theSLA while costing 7.14 CPU hours. The threshold algorithmwith a 60% CPU usage threshold had 0.25% of the tweetsmissing the SLA at a cost of 12.46 CPU hours. For the finalmatch against Spain and the same scenarios, the load algorithmhad 1.67% of tweets above the SLA with a cost of 20.97 CPUhours while the threshold algorithm let 2.52% of the tweetslose the SLA with a cost of 31.04 CPU hours.For the Brazil vs Uruguay match, replacing the traditionalthreshold algorithm with a 60% threshold by the load algo-rithm means saving 43% CPU hours with a slight improveof quality. For the Spain game, savings are of 33%. It isimportant, however, to note that rarely such a low thresholdis used on ordinary jobs on the cloud. B. Appdata algorithm performance
The appdata algorithm detects peaks through the analysisof the live stream of sentiment taken from the tweets beingprocessed. Its use was put to test together with the loadalgorithm with a 99.999% quantile and a number of extraCPUs varying from 1 to 10.As shown in Section III-A, peaks of tweets can be de-tected by analyzing sudden changes in user sentiment. CPUsallocated preemptively are available when peaks occur andmore resources are necessary, preventing quality loss. In thatcontext, a window of 60 seconds is compared to a previouswindow of same size. Peaks are consequences of certain eventsand the first few tweets related to the event that come beforethe peak are the key to detecting them. Older tweets, frombefore the event, that just happened to take longer to processcannot be confused with those few first peak tweets even ifthey are done being processed at the same time. For this, caremust be taken that it is not the time the tweet is done beingprocessed that is used to analyze the sentiment time series, butthe tweets post time.In practice, windows of 60 seconds of length are not largeenough for efficiently detecting peaks. If at a given time,only tweets that were posted at most 60 seconds soonerare considered for a window, very few will be taken intoaccount as very few are done being processed under these 60seconds. After testing different lengths of windows, the onethat rendered the best results was the one of 120 seconds. Withthat size, even if most tweets are not done being processed, asufficiently large number of tweets with sentiment are availablefor detecting peaks.Figure 8 shows the results of running the appdata algorithmallocating a varying number of extra CPUs when peaks weredetected. Just as CPUs allocated by the load algorithm, these % o u t o f S L A c o s t ( C P U h o u r s ) % out of SLAcost (CPU hours) Fig. 8: Appdata: Brazil vs Spain.CPUs take 60 seconds for being available. The test bed chosefor the algorithm was the final match of the ConfederationsCup: Brazil vs Spain. That is the most challenging match ofthe seven, with the most tweets and with the highest peaksand where this algorithm is most necessary.The appdata algorithm was able to deliver better resultsalready with one extra CPU. Compared to the load algorithmalone, the number of tweet above the SLA dropped from1.67% to 1.23% while the cost increased from 20.97 to 21.27CPU hours. When more extra CPUs are used, the qualityconsistently increases while the cost increases. At 10 extraCPUs, only 0.12% of the tweets miss the SLA but at aconsiderably higher cost of 34.78 CPU hours. At those points,it means an improvement of 92.81% with an increase of costsof 63.52%. When compared to the threshold algorithm, thequality improvement was of 95.24% with a cost increase ofonly 12.05%.Even if the quality improvement is greater than the costincrease, it is important to note that while the percentage oftweets above the SLA seems to fall linearly, the cost seemsto increase exponentially. But since the SLA is very close tobeing completely met, it is probable that the cost-benefit willstill be favorable when this happens.The current peak detection algorithm has false negatives andthat is the reason a number of tweets still miss the SLA. Italso has false positives, which results in some CPUs beingunnecessarily allocated and, since the algorithm only releasesa single CPU at once, excess CPUs can take long to disappear.While the excess CPUs are the reason why costs rise so rapidlyin the graph they are also the reason why the number of tweetsmissing the SLA decreases: excess CPUs can compensate anundetected peak if present at the right time.VI. C
ONCLUSION
Elasticity is a key feature of cloud computing to meet SLAand budget constraints. This paper introduced a detailed casestudy of using the data generated by the application itselfo trigger auto-scaling operations. We used data from Twittergenerated during the FIFA 2013 Confederations Cup and anapplication that calculates sentiment of users watching thematches. Here are the main lessons from our study.The load algorithm consistently spends fewer resources thanthe threshold algorithm and is able to react faster allocating avariable amount of resources at a time. That can only happenbecause of the knowledge of the delay distribution. Also abasic communication between the application and the PaaS orIaaS level is necessary so the current number of tweets in thesystem is reported.The threshold still presents better quality for events withmoderate tweet volumes, but its best performance is with athreshold of 60% CPU usage, way below the most commonvalue of 90%. For jobs processing fast changing amounts ofdata, smaller thresholds will behave better. The choice of theparameter of the threshold algorithm must be taken carefully.The value of the threshold has a direct impact on the cost ofrunning the application.For monitoring events with smaller volumes of data, anyalgorithm performs well, but the load algorithm consumesfewer resources compared to the other algorithms. For mod-erate sized events, the threshold algorithm is able to performslightly better but the load algorithm uses fewer resources.For great events with significant bursts, the appdata algorithmis preferred as it is able to predict peaks and prevent manySLA violations. Though it uses more resources than the loadalgorithm, it is more likely to meet the SLA. The balancebetween cost and the necessity of SLA adherence must beconsidered when choosing the algorithm for such events.Apart from performance, using application data to triggerauto-scaling operations can open possibilities for service man-agers to configure their dynamic resource requirements in adifferent way. Instead of trying to define system level metricssuch as CPU or memory consumptions, these service managerscan focus more on application characteristics and how they areperforming over time.A
CKNOWLEDGMENT
We thank Paulo Rodrigo Cavalin for his help with thesentiment analysis application. This work has been supportedand partially funded by FINEP / MCTI, under subcontract no.03.14.0062.00. R
EFERENCES[1] K. Hwang, X. Bai, Y. Shi, M. Li, W. Chen, and Y. Wu, “Cloudperformance modeling and benchmark evaluation of elastic scalingstrategies,”
IEEE Transactions on Parallel and Distributed Systems ,2015.[2] A. Ali-Eldin, J. Tordsson, E. Elmroth, and M. Kihl, “Workloadclassication for efficient auto-scaling of cloud resources,”
TechnicalReport
Proc. of the11th IEEE/ACM International Symposium on Cluster, Cloud and GridComputing (CCGrid’11) , 2011, pp. 454–463.[4] M. Mao and M. Humphrey, “Auto-scaling to minimize cost and meetapplication deadlines in cloud workflows,” in
Proc. of the Int. Conf. forHigh Performance Computing, Networking, Storage and Analysis (SC) ,2011.[5] M. Mao, J. Li, and M. Humphrey, “Cloud auto-scaling with deadlineand budget constraints,” in
Proc. of the 11th IEEE/ACM InternationalConference on Grid Computing (GRID) . IEEE/ACM, 2010, pp. 41–48.[6] M. Sedaghat, F. Hernandez-Rodriguez, and E. Elmroth, “A virtual ma-chine re-packing approach to the horizontal vs. vertical elasticity trade-off for cloud autoscaling,” in
Proc. of the ACM Cloud and AutonomicComputing Conference (CAC’13) . ACM, 2013.[7] Z. Shen, S. Subbiah, X. Gu, and J. Wilkes, “Cloudscale: elastic resourcescaling for multi-tenant cloud systems,” in
Proc. of the 2nd ACMSymposium on Cloud Computing . ACM, 2011, p. 5.[8] T. Lorido-Botran, J. Miguel-Alonso, and J. A. Lozano, “A review ofauto-scaling techniques for elastic applications in cloud environments,”
Journal of Grid Computing
Proc. VLDB Endow. , 2012.[13] R. L. F. Cunha, M. D. Assunc¸˜ao, C. Cardonha, and M. A. S. Netto, “Ex-ploiting user patience for scaling resource capacity in cloud services,”in
Proc. of the 7th IEEE International Conference on Cloud Computing(CLOUD’14) , 2014.[14] M. A. S. Netto, C. Cardonha, R. Cunha, and M. D. Assuncao, “Eval-uating auto-scaling strategies for cloud computing environments,” in
Proceedings of the IEEE 22nd International Symposium on Modelling,Analysis Simulation of Computer and Telecommunication Systems (MAS-COTS’14) , 2014.[15] G. Copil, D. Trihinas, H.-L. Truong, D. Moldovan, G. Pallis, S. Dust-dar, and M. Dikaiakos, “ADVISE–a Framework for Evaluating CloudService Elasticity Behavior,” in
Proceedings of the 12th InternationalConference on Service-Oriented Computing (ICSOC) , 2014.[16] P. Leitner, J. Ferner, W. Hummer, and S. Dustdar, “Data-driven andautomated prediction of service level agreement violations in servicecompositions,”
Distributed and Parallel Databases , vol. 31, no. 3, pp.447–470, 2013.[17] A. Kumbhare, Y. Simmhan, and V. K. Prasanna, “Exploiting applicationdynamism and cloud elasticity for continuous dataflows,” in
Proceed-ings of the International Conference on High Performance Computing,Networking, Storage and Analysis (SC) . ACM, 2013, p. 57.[18] T. Heinze, V. Pappalardo, Z. Jerzak, and C. Fetzer, “Auto-scalingtechniques for elastic data stream processing,” in
Proceedings of the8th ACM International Conference on Distributed Event-Based Systems ,2014.[19] T. Heinze, Z. Jerzak, G. Hackenbroich, and C. Fetzer, “Latency-awareelastic scaling for distributed data stream processing systems,” in
Pro-ceedings of the 8th ACM International Conference on Distributed Event-Based Systems , 2014.[20] P. R. Cavalin, M. A. de C. Gatti, C. N. dos Santos, and C. Pinhanez,“Real-time sentiment analysis in social media streams: The 2013 con-federation cup case,” in
Proceedings of BRACIS/ENIAC , 2014.[21] P. R. Cavalin, M. A. C. Gatti, T. G. P. Moraes, F. S. Oliveira, C. S.Pinhanez, A. Rademaker, and R. A. de Paula, “A scalable architecturefor real-time analysis of microblogging data,”