Dithen: A Computation-as-a-Service Cloud Platform For Large-Scale Multimedia Processing
Joseph Doyle, Vasileios Giotsas, Mohammad Ashraful Anam, Yiannis Andreopoulos
SSUBMITTED 1
Dithen : A Computation-as-a-Service Cloud PlatformFor Large-Scale Multimedia Processing
Joseph Doyle, Vasileios Giotsas, Mohammad Ashraful Anam and Yiannis Andreopoulos,
Senior Member, IEEE
Abstract —We present
Dithen , a novel computation-as-a-service(CaaS) cloud platform specifically tailored to the parallel ex-ecution of large-scale multimedia tasks. Dithen handles theupload/download of both multimedia data and executable items,the assignment of compute units to multimedia workloads, andthe reactive control of the available compute units to minimizethe cloud infrastructure cost under deadline-abiding execution.Dithen combines three key properties: (i) the reactive assignmentof individual multimedia tasks to available computing unitsaccording to availability and predetermined time-to-completionconstraints; (ii) optimal resource estimation based on Kalman-filter estimates; (iii)
Index Terms —computation-as-a-service, big data, multimediacomputing, cloud computing, Amazon EC2, spot instances.
I. I
NTRODUCTION I NFRASTRUCTURE-AS-A-SERVICE (IaaS) providers,such as Amazon Elastic Compute Cloud (EC2), GoogleCompute Engine (GCE), IBM Bluemix and Rackspace, nowallow for the flexible reservation of compute units (CUs) inthe cloud (i.e., pre-established sets of processor cores, memory,storage and operating systems), with yearly, daily, hourly oreven minute-by-minute billing [1], [2]. This has led to theexplosion of
Platform-as-a-Service (PaaS) and
Software-as-a-Service (SaaS) offerings [3], [4]. Within PaaS systems, the useris able to develop and execute processing tasks on distributed computing environments (e.g., Apache Hadoop/mesos, GoogleApp Engine) on IaaS providers, albeit at the cost of convertingthe multimedia processing software to code that can be scaled-up by the PaaS infrastructure (for example converting theoperations to Map and Reduce steps in Hadoop). In SaaS,a provider licenses a specific set of applications to customers(e.g., pre-established word processing software, a fixed set ofvideo transcoding or video streaming toolboxes, etc.) eitheras a service on demand, through a subscription, or in apay-as-you-go model [5]. In the multimedia systems domain,this provides the opportunity to use transcoding or signalprocessing algorithms and toolboxes directly [4], [6]–[10], andhas led to the development of related commercial services.
A. From Platform and Software-as-a-Service to Computation-as-a-Service
This evolution of IaaS, PaaS and SaaS is now beginningto lead to
Computation-as-a-Service (CaaS) [11], where userscan upload multimedia (e.g., image, audio or video) files and scripts or binary files prepared in their local environment [4],[6], [10], [12], [13] in order to be executed in CUs in thecloud directly, i.e., without having to develop and manage anyinfrastructure or convert their software to a format amenableto distributed computing environments. CaaS provides a usefulcompromise between the generality of IaaS and PaaS offeringsand the ease-of-use of SaaS: the end user can deploy and scale any desktop multimedia application of their choosing without needing to adapt its codebase. This differs from the case ofSaaS, in that the user can simply execute any Matlab, C/C++,Java, OpenCV, Javascript/Python based code and scripts oftheir local platform on the CaaS platform without any modifi-cation by following a simple set of rules. The CaaS platformcan then handle the scheduling and parallelization of multiplemultimedia workloads without any user intervention via theappropriate reservation (or bidding) of resources from IaaSproviders, e.g., Amazon EC2 spot instance bidding or GCECU reservation.
B. Related Work
Workload scheduling on CaaS systems has some resem-blance to well-studied scheduling problems for large comput-ing clusters [14]–[16]. However, two major differences be-tween the two domains are that resources of cluster computingsystems are persistent and “prepaid”, i.e., the number of CUsdoes not fluctuate during the execution of a workload and thereis no penalty for unused CUs. On the other hand, because CaaSresources are billed according to compute instance reservation, a r X i v : . [ c s . D C ] O c t UBMITTED 2 a CaaS provider often initiates or terminates CUs during theexecution of a workload in order to minimize the monetarycost incurred, while abiding to the agreed workload completiontime. To this end, there have been numerous recent proposalsfor cloud resource management. Gandhi et. al. propose theirown version of Autoscale, which stops servers that have beenidle for more than a specified time, while concentrating jobson less CUs to reduce cost [17]. Paya et. al. expand on this byproposing a system that uses multiple sleep states to improveperformance [18]. Song et. al. propose optimal allocation ofCUs according to pricing and demand distributions [1]. Ranjan et. al. investigate architectural elements of content-deliverynetworks with cloud-computing support [19]. Jung et. al. propose multi-user workload scheduling on various CUs basedon genetic algorithms [20]. Beyond resource allocation andscheduling, a major challenge in CaaS frameworks is the vary-ing delay in the completion of various multimedia processingworkloads [4], [21]. The processing delay primarily dependson: the workload specifics, the CU reservation mechanismemployed, and the transport-layer jitter (if data is continuouslytransported to/from users and cloud providers) [5]. This isthe primary reason why all real-world CaaS platforms onlyprovide “best effort" service level agreements (SLAs) for largeworkload execution without considering a predetermined time-to-completion (TTC) estimate. Recent research work on thisfront proposes the use of particle swarm optimisation to deriveviable schedules [22] and the use of the earliest-deadline firstalgorithm [23]. While all such proposals are effective in theirresource provisioning for TTC-abiding execution, they assumethat the system has accurate estimates of the computationrequired to complete each workload. However, this is unlikelyto be the case in practice, particularly at the start of aworkload’s execution. Therefore, our proposal considers therealistic scenario where no estimates for the computationalrequirements are available at the start of each workload’sexecution; i.e., in conjunction with resource provisioning, ourframework performs an adaptive resource estimation during the execution of each workload.Finally, the first commercial CaaS offerings are now begin-ning to emerge. The key representatives are: (i) the recently-announced AWS Lambda service, where users can submit indi-vidual Javascript items and be billed at a fixed rate per 100msof Lambda service usage under a best-effort SLA; (ii)
PiCloud,a service for flexible scheduling of batch processing tasksvia a terminal command line interface; (iii)
Parse, a softwaredevelopment environment for Javascript execution on cloud-computing infrastructures; and (iv)
Amazon EC2 Autoscale, aservice that automatically scales application deployment overAmazon EC2 according to processor and network utilizationconstraints. In all these deployments, the comparative metricfor workload analysis is the required processing time in termsof the number of seconds a single core was occupied untilthe workload is successfully completed. We therefore quantifythe resource reservation in the IaaS provider via compute-unitseconds (CUSs), i.e., the product of the total cores used withthe time they were reserved for, since charges will be appliedfor them from the IaaS provider regardless of whether theCaaS system actually used them to their full capacity or not.
C. Contribution
While the current research and commercial efforts in CaaSframeworks are a promising start, they do not consider thereactive estimation of the required CUSs to process submittedworkloads, or assume that the CUS metric per workload isknown [4], [22], [23]. In addition, current CaaS frameworksdo not consider on-demand CU provisioning (e.g., EC2 spotinstances or GCE CUs with minute-level increments) underTTC constraints, where it is imperative to control both the allo-cation and termination of new instances in order to reduce theinfrastructure cost while providing for TTC-abiding execution.Finally, at the moment there are very limited options for CaaSframeworks to develop and benchmark multimedia cloud-computing services and the multimedia systems communitywould benefit from new efforts on this front.In this light, we present
Dithen , a new cloud computingservice that scales small and medium-level execution of dataprocessing workloads to big data under TTC constraints. Forexample, algorithms for video transcoding, image classifica-tion, object recognition, etc., that run on small volumes ofinput images/videos on a desktop computing system can bedirectly scaled-up via Dithen (i.e., without any code modi-fications) to operate on big datasets comprising millions ofinput images and videos, with a-priori established completiontimes. Dithen meets the requirements for such large-scale data-intensive processing by combining the following novel aspects:1) It supports the direct upload and execution of bash,Python, Java, Javascript and Matlab scripts, as well asthe execution of statically-built binaries for 32-bit or 64-bit Ubuntu Linux or Microsoft Windows on any numberof EC2 spot instances.2) Each submitted workload is separated into individually-executable tasks, which are then allocated to availableCUs with proportionally-fair scheduling in order to: (i) maximize the available CU utilization and (ii) abide bythe confirmed TTC value for the workload. The fine-grain partitioning of each workload into tasks allows foreach user to check that the output results are being pro-duced correctly by Dithen during execution and cancelthe workload execution if otherwise.3) Estimates of the required CUSs until the completionof each task type in each workload are derived basedon Kalman-filter estimators, which are shown to signif-icantly outperform other ad-hoc estimators.4) Based on the estimation of the required CUSs, Dithenuses the Additive Increase Multiplicative Decrease(AIMD) algorithm [24] for the allocation or terminationof CUs according to the expected workload. WhileAIMD is a well-known control mechanism for networkresource utilization, e.g., within the transport controlprotocol (TCP), to the best of our knowledge, this isthe first time it is proposed for CaaS provisioning.Finally, beyond describing Dithen, we also provide freeaccess to it each new user account gets an amount of free credit to spend on the service UBMITTED 3
TABLE IN
OMENCLATURE AND N OTATIONAL C ONVENTIONS . Key Concept Definition t monitoring time instant W [ t ] total workloads in Dithen at time instant tM [ t ] total media types at time tm w,k [ t ] remaining media items of type k to be processedwithin workload w at time t (1 ≤ k ≤ M [ t ] , ≤ w ≤ W [ t ]) I total types of instances in the cloud infrastructure p i compute units (CUs), i.e., processor cores, availablewithin instance type i, ≤ i ≤ In i [ t ] , N tot number of instances of type i ( ≤ i ≤ I ) reservedat time t , total number of CUs in Dithen a i,j [ t ] remaining time for the j th instance of type i beforeadditional billing is incurred by the cloud provider c tot [ t ] , c min , c max total compute-unit-seconds (CUSs) available inDithen, and lower/upper limits for CUSs in Dithen d w [ t ] time-to-completion (TTC) for workload w at time t ˆ b w,k [ t ] CUS estimate to process a media item of type k ofworkload wr w [ t ] required CUSs for the completion of workload ws w [ t ] service rate, i.e., CUs allocated for workload wz w,k [ t ] , v w,k [ t ] CUS process and measurement noise instantiations ofmedia type k of workload wα, β additive increase and multiplicative decreaseparameters of AIMD Notation Explanation uppercaseRoman letters random variableslowercaseGreek letters moments of probability distributions, stochasticparameters of Kalman filters, or AIMD and ARMAparameters ˜ b measurement of quantity b ˆ b estimation of quantity b The remainder of this paper is organized as follows. SectionII presents the architecture of Dithen. Sections III and IVpresent the key elements of the proposed CUS estimationand AIMD framework, while Section V presents experimentalresults and comparisons of different CU allocation strategiesfor Amazon EC2 spot instances. Finally, Section VI presentssome concluding remarks.II. A
NATOMY OF THE D ITHEN A RCHITECTURE
The architecture of Dithen is pictorially illustrated in Fig.1. It comprises five elements: the Front End (FE), the CloudStorage and Instance Types (CS-IT), the Monitoring Element(ME), and the Local and Global Controller Instances (LCIand GCI). Their functionality is detailed in the followingsubsections. To aid the exposition, Table I summarizes thenomenclature and notational conventions used.
A. Overview of the Operation of Dithen
In order to illustrate the roles of the different componentsof our system, we first present an overview of how Dithen
Fig. 1. Dithen architecture. At each monitoring instant t , each workload w ( ≤ w ≤ W [ t ] ) contains several media types. In addition, n i [ t ] spotinstances of type i ( ≤ i ≤ I ) are reserved and can be used to processworkloads. ./inputpic1.png...picN.pngvid1.mp4...vidM.mp4dat1.gzip...datK.gzip./footprintpic5.pngvid3.mp4dat7.gzip./mainmain_split.shmain_merge.shvj./aggregationtempRes1.txt...tempResR.txt./outputres1.txt...resL.txt input imagesinput videosinput archives with multiple media lesselected les used for initial CUS predictiontemporary output created by main_split.sh scriptnal output created by main_merge.sh scriptexecutable le that is called by main_split.shand processes any media elements foundin the ./input folderscript which processes any media element found in the ./input folder and places results in the ./aggregation folderscript which processes any results found in the ./aggregation folder in predetermined chunk sizes and places results in the ./output folder Fig. 2. Example workload structure. processes a workload. A user submits a workload via theFE (Section II-B) and can request a particular TTC value forthis workload, or a TTC is allocated by Dithen. As shown inFig. 2, the workload may comprise multiple input media files(e.g., JPEG images, MP4 video files, etc.), as well as scriptsand executable files to process the inputs. An executable filecan be a Linux or Windows binary and the script can be aLinux shell script, Windows batch file or Matlab/Octave scriptfile. Dithen typically handles workloads where each input isprocessed independently from the other inputs. This is thenorm for large-scale multimedia cloud computing where auser typically wants to carry out a certain task (e.g., facerecognition, transcoding, etc.) on a large cache of input imagesor videos. Nevertheless, as explained in later sections (and as
UBMITTED 4 shown in Fig. 2) it is also possible to: (i) package multipleinputs together in a single input archive (e.g., gzip file, thecontents of which are automatically extracted by Dithen inthe utilized CU prior to processing) for concurrent processingof batches of input media, (ii) execute “Split-Merge” tasksakin to MapReduce code, i.e., code that performs parallelprocessing of inputs (Split step) followed by aggregation ofmultiple outputs in order to produce the final results (Mergestep).Once the GCI detects that a new workload has been added, itassigns a small percentage of the inputs of the workload (e.g.,5% of the submitted inputs) to LCIs in a “footprinting" stage.The compute instances of the LCIs execute the submittedcode on their assigned inputs and provide the correspondingexecution times to the ME (Section II-D) and, via that, tothe GCI. These measurements and the logs of the executionstatus (e.g., 0 for normal and -1 for abnormal termination) areused by GCI to: (i) confirm that the workload processing iscarried out without errors or crashes in the submitted code; (ii) derive an initial Kalman-based estimate of the required CUSsto complete this workload (Section II-E-1). This estimate isused to confirm that the requested workload TTC is achievableby Dithen, or else adjust the confirmed TTC accordingly(Section II-E-2). The GCI continues to derive CUS estimatesper workload in order to: (i) assign a service rate per workload,according to which all LCIs can process workload tasks viatheir corresponding instances (Section III); (ii) determine if thenumber of CUs should be scaled up or down (via the proposedAIMD algorithm of Section IV) so that all confirmed TTCs aremet without excessive billing from the cloud provider. Finally,the results produced by all CUs are uploaded to AmazonSimple Storage Service (S3, see Section II-C), where they canbe viewed by the user through the Dithen FE.
B. Front End and Workload Processing Modes
The FE of Dithen provides for workload uploading, launch-ing, monitoring of execution, basic text file and image viewingand editing functionalities (e.g., for log file or launch scriptviewing and editing), and downloading of the results ofindividual tasks as they are being produced by CUs. Userscan utilize the FE to cancel pending workloads if the resultsare deemed to be unsatisfactory or incorrectly executed (e.g.,due to unsupported runtime components, or crashes/errors inthe user’s executed code). The simplicity of the FE of Dithenis evident from Fig. 3: the buttons allow for a point-and-clickinterface via a web browser. A basic mobile FE interface isalso available.
1. Code & data elements and baseline workload pro-cessing mode:
In the basic mode of operation, Dithen as-sumes that the user provides a “main" bash/batch script (i.e., main.sh in Linux or main.bat in Microsoft Windows)for each application, which invokes all the required Matlab,Python, Java/Javascript code, or application binaries. All pro-vided multimedia elements must be available in the folder ./input/ within each application. The results are producedin the locations created and specified within the user’s owncode (typically in application folders such as ./output/ or ./results/ , etc.). The only constraint imposed by Dithenis that results must not be created within the input folder,which should be reserved solely for input files. An exampleof a typical input/output application structure is given in Fig.3(a) and an example of the FE for the job monitoring is givenin Fig. 3(b).Each new user account created by the FE includesby default (at least) four example image processing ap-plications that illustrate how to use the service: (i) Ex1_face_detection —automated detection of humanfaces within individual images using a stand-alone C++implementation of the Viola-Jones algorithm [25]; (ii)
Ex2_template_match —template matching between in-put images with an existing library of template im-ages using the ImageMagick compare tool [26]; (iii)
Ex3_image_merge —merging of groups of input JPEGfiles into a single animated GIF file using the ImageMag-ick convert tool [26]; (iv)
Ex4_Matlab_SIFT —imagesalient point detection and description using the SIFT algo-rithm [27] via a Matlab implementation compiled to stand-alone binary with the Mathworks deploytool .In all cases, the provided code utilizes each input mediafile independently and populates the output folder(s) withthe results of the executed algorithm(s). Each such executioncomprises a media processing task . The entire volume ofindependently-processed inputs, along with the user’s applica-tion code, comprises a workload . If a number of media inputs(e.g., images, videos and/or audio files) must be processedtogether by the provided application code, they must bepackaged into a single file container (e.g., TAR, ZIP or RAR).Dithen will automatically extract all such compressed formatcontainers prior to calling the main script.
2. Advanced workload processing mode:
To allow formore complex interactions between inputs and subsequentresults, such as split and merge tasks that are similar toMapReduce code [28], Dithen provides an advanced work-load processing mode, specifically designed to allow for theexecution of Split-Merge tasks, e.g., the parallel execution ofvisual feature extraction from input JPEG images followedby aggregation of the produced feature points into a singlefeature matrix of reduced dimensions [29], [30]. This modeis triggered by the user uploading two “main” scripts, called main_split.sh and main_merge.sh (see also Fig. 2).When these scripts are found in the application folder, thefirst one is executed as in the basic mode of operationdescribed previously but, instead of returning the results to theFE, it uploads them to a specially designated “aggregation”spot instance in Amazon EC2 that runs the second (i.e.,“Merge”) script. In this way, the latter script can invoke anyaggregation code and the final results are provided to theFE. For example, within image retrieval or face recognitionapplications [29]–[31], the user can parallelize the computationof a very large number of image covariance matrices orvectors of local features via the “Split” script and then performa large singular value decomposition (SVD) of the results(once they become available) by having the “Merge” script This is illustrated in example
Ex3_image_merge . UBMITTED 5 (a)(b)Fig. 3. (a) Example of front-end contents for a user workload executing theSIFT feature extraction on a large image dataset. The input image datasetis contained in the ./input folder and the output results are produced inthe ./output folder. (b) Example of the workload execution indicating thesource and destination folders and the confirmed TTC. periodically poll for the full set of results of the Split stepand invoke the SVD code on them. The data and executableitems corresponding to such an example are illustrated inFig. 2: the Split step is launched simultaneously by mul-tiple instances running the main_split.sh script (eachcontaining subsets of inputs) and each instance calculatesone or more of the tempRes1.txt...tempResR.txt results and places them in the /aggregation folder ofa specially-designated “Merge” instance. This instance isrunning the main_merge.sh script periodically in order topoll the /aggregation folder and, once sufficient outputsare detected, produces each of the res1.txt...resL.txt files based on groups of such outputs. The rule of howmany (and which) outputs to poll for, as well as the pollingfrequency (e.g., once per minute), is set by the user within the main_merge.sh script.
C. Cloud Storage and Instance Types
The CS-IT deployment depends on the possibilities avail-able by the IaaS provider. In this paper, we evaluate Dithenwith Amazon EC2 spot instances and Amazon S3 storage. Weopt for EC2 spot instances as they provide for a wide varietyof available configurations and for flexible billing based onhourly reservations. Future evaluations can incorporate otherIaaS providers, like GCE, IBM Bluemix and Rackspace.Dithen uses Ubuntu Linux and MS Windows spot instancesthat have been setup with bash shell support (or batch scriptfile support for MS Windows), Python, Java, Javascript, Matlab(via the Mathworks Matlab Compiler Runtime), and OpenCVand ImageMagick library support. As shown in Fig. 1, de- pending on the type of scripts and executables submitted bythe user, either of these instances can be spawned into anynumber of spot instances of type i , out of I total instance types.This is performed by bidding for an appropriate spot instancetype in Amazon’s EC2 launch process (out of more thantwenty types available) and specifying the reserved Amazonmachine image id to be spawned. We denote the number ofCUs per instance type by p i , ≤ i ≤ I . Moreover, at the t th time instant, the Dithen architecture contains n i [ t ] totalinstances of type i . Finally, for the j th instance out of the n i [ t ] ones, the remaining time until the next billing increment(e.g., until the time when the IaaS provider will bill for thenext hour in the corresponding spot instance) is denoted by a i,j [ t ] (in seconds). Spot instances are requested using the requestSpotInstance() function which is part of theEC2 class of the AWS SDK. Instances are terminated usingthe terminateInstances() of the same class. Finally,the number of active spot instances is monitored using the describeInstances() function of the same class. D. Monitoring Element
In order to observe the CU utilization, Dithen includes amonitoring element that measures processor utilization withineach spot instance via the mpstat
Linux command (or wmiccpu in MS Windows). Monitoring and reactive control of theexecution of workloads take place at discrete “monitoring”time instants, typically every 1–5 minutes. The ensemble ofall the currently executing workloads within Dithen at the t th monitoring instant includes M [ t ] different media types(e.g., images, audio, video files, or container files comprisingcomposite media and data types). The ME keeps track ofa number of operational parameters described below (andsummarized in Table I).At every monitoring instant t , and within each workload w ( ≤ w ≤ W [ t ] ), the ME keeps track of the number of remain-ing elements to be processed, m w,k [ t ] , as well as the estimatedCUSs required to complete the processing of each media type k with the workload, ˆ b w,k [ t ] . The values of m w,k [ t ] are deter-mined using an SQL database that records which tasks havebeen processed and the getIterator(‘ListObjects’) function of the S3 class of the AWS SDK. The estimates ˆ b w,k [ t ] are derived based on the estimation process describedpreviously. Typically, the SLA for each workload includesexecution within a predetermined TTC value, d w [ t ] , which isconfirmed after an initial CUS estimate is available for theworkload. To this end, the ME continuously keeps track ofthe required CUSs to complete each workload w , r w [ t ] , whichcan be estimated by: r w [ t ] = M [ t ] (cid:88) k =1 m w,k [ t ]ˆ b w,k [ t ] . (1)Finally, the ME keeps track of the total number of active CUsin Dithen by: N tot [ t ] = I (cid:88) i =1 p i n i [ t ] , (2) UBMITTED 6 as well as the total compute-unit seconds billed (i.e., alreadypaid to the IaaS provider and available to use) within theDithen architecture at any given instant t : c tot [ t ] = I (cid:88) i =1 n i [ t ] (cid:88) n =1 p i a i,n [ t ] . (3)Effectively, c tot [ t ] and N tot [ t ] represent a “snapshot” of thecompute resources in Dithen at the t th time instant, as theycomprise the available CUSs and CUs under the already-billedEC2 instances. E. Local and Global Controller Instances
The main tasks at every monitoring time instant t are: (i) toensure that each workload w is executed within its confirmedTTC, d w [ t ] , and (ii) to match c tot [ t ] to (cid:80) W [ t ] w =1 r w [ t ] . Bothmust be met with the minimum billing from the IaaS. Thesetwo tasks are accomplished by the LCI and GCI componentsof Fig. 1, respectively. Towards this end, the most crucialaspects are: (i) defining reliable CUS estimates, ˆ b w,k [ t ] , foreach media type k within each workload w , (ii) confirmingthe feasibility of each workload’s TTC value and selecting theappropriate service rate (i.e., selecting how many CUSs shouldbe allocated to each workload’s tasks), and (iii) devising andexecuting an algorithm to initialize or terminate CUs accordingto the demand volume. The first two items allow for microscale (i.e., local) control of Dithen, and they are discussed in parts3 and 4 of this section, as well as in Sections III and IV. Thelast item allows for macroscale (i.e., global) control of theworkload execution within Dithen; solutions for these aspectsare analyzed in the first two parts of this subsection.
1. Task allocation and tracker operation via the GCI:
Toachieve each workload’s TTC, the GCI divides the workloadinto chunks and sends these chunks to be processed by thespot instances of LCIs. Specifically, when a workload is sub-mitted, the GCI examines the .\input folder to determinehow many individually-executable tasks are present in theworkload. Once this has been determined, the GCI executesa small number of the tasks in a “footprinting” stage. Thegoal of the “footprinting” stage is to determine: (i) an initialworkload CUS estimate per input type; (ii) what chunk sizeto use (i.e., how many inputs to group together for executionby a single spot instance) such that the chunk processingtime is comparable to the time interval between monitoringinstances (described in Section II-D). Importantly, while theinitial CUS estimate forms the basis for resource estimationin Dithen, it is often inaccurate because it is difficult to selecta representative subset of the tasks when the execution timeof these tasks is data dependent. For example, in many facedetection [25] or transcoding workloads [32], the estimate thatuses only “footprinting" data can be 50% higher than the finalmeasured value because of the data dependency of these tasks.Another reason for such inaccuracy is that, when consideringsmall subsets of tasks in some workloads (like Matlab-basedapplications), the CUS estimation is significantly offset by thedisproportional amount of time needed to set up the executionenvironment (a.k.a. “deadband” time) in comparison to theactual code execution. Long deadband times in tasks mandate the grouping of several tasks into large chunks. Once thechunk size has been determined, the GCI connects to the LCIsvia the XMLRPC protocol and instructs the LCI to executethe tasks in the chunk. The LCI writes entries to a MySQLdatabase detailing the status of each task as it processes them,as well as execution time measurements once the task iscompleted. These are used in the Kalman estimation processof Subsection II-E-3. The GCI uses this database to determinewhich task should be placed in a chunk for an LCI in a manneranalogous to a BitTorrent tracker [33]: the controller connectsto the database to determine which tasks appear as: “pending”,“processing” and “completed” and, based on the workloadservice rates (Section III), carries out the chunk allocation toavailable LCIs. This decoupling between the database writingby the LCIs and the database reading by the GCI preventsbottlenecks and minimizes the network traffic between the GCIand LCIs. A workload is marked as “completed” once the GCIdetects that all tasks in the workload have been completed.
2. Spot instance initiation and termination via the GCI:
A direct way to implement the scaling of the required instancesis for the global controller instance to constantly match thetotal CUs billed in Dithen [ c tot [ t ] of (3)] to the total CUsrequired by all workloads ( (cid:80) W [ t ] w =1 r w [ t ] ) at each time instant t by initializing or terminating spot instances (a.k.a. “reactive”control [34]). However, such an approach is not optimal for thefollowing reasons: (i) (cid:80) W [ t ] w =1 r w [ t ] depends on the estimatedCUSs required to complete the processing of each mediatype k within each workload w ; these estimations will notbe accurate for all time instants and media types, and thiswill lead to unnecessary expenditure to initiate and pay forinstances that may never be used due to estimation mismatch; (ii) due to the CU billing for large time intervals (e.g., AmazonEC2 spot instances are billed for one hour and GCE instancesare billed in 10-minute slots), as well as the associated delayin initialization or termination of instances (in the order ofminutes), rapid fluctuations in (cid:80) W [ t ] w =1 r w [ t ] (e.g., due to newworkloads or workload cancellations by users) will causebursts of initiation or termination requests and substantiallyincreased “dead” time, which will be billed by the IaaS; (iii) without a control mechanism in place to absorb rapidfluctuations in demand, a flurry of spot instance requests mayinadvertently cause unwanted spikes in spot instance pricing[1]. In the next two sections, we present our GCI proposal forbest-effort TTC-abiding execution that ensures proportionalfairness amongst all submitted workloads in Dithen.
3. Reliable CUS estimates for media types via Kalman-filter realization:
Due to the aforementioned inaccuracy ofthe CUS estimation based on the “footprinting” process, wepropose the use of an adaptive CUS estimator that runs contin-uously during the execution of each workload. In our proposal,each LCI measures the average CUSs, ˜ b w,k , required for eachmedia type k of each workload w running on its instancetypes, by measuring the time to complete tasks between theprevious and the current monitoring instance ( t − and t )and refining the measurement. We model this measurementoperation mathematically by: ∀ w, k, t : ˜ b w,k [ t ] = ˆ b w,k [ t ] + v w,k [ t ] , (4) UBMITTED 7 where v w,k [ t ] is the measurement noise that deviates ˜ b w,k [ t ] from the ideal CUS estimate ˆ b w,k [ t ] at time instant t . Weassume that v w,k [ t ] can be modeled by independent, identicallydistributed (i.i.d.), zero-mean Gaussian random variables, i.e., ∀ w, k : V w,k ∼ N (cid:0) , σ v (cid:1) .We express the LCI estimation of the required CUSs foreach workload and task type at time t by: ∀ w, k, t : ˆ b w,k [ t ] = ˆ b w,k [ t −
1] + z w,k [ t ] , (5)with z w,k [ t ] the process noise [34], expressing variability inthe execution time of each task type in each workload acrosstime. We assume that ∀ w, k : z w,k [ t ] can be modelled by i.i.d.,zero-mean Gaussian random variables, i.e., ∀ w, k : Z w,k ∼N (0 , σ z ) . Given (4) and (5) and the fact that all noise terms arei.i.d., the noise variances are: E { V w,k } = σ v , E { Z w,k } = σ z and the noise covariance is E { V w,k Z w,k } = 0 .For the measurement and estimation model of (4) and (5),the optimal estimator for ˆ b w,k [ t ] is known to be the Kalmanfilter [34], which provides for the following two time-updateequations for our case ( ∀ w, k, t ): π − w,k [ t ] = π w,k [ t −
1] + σ z , (6) κ w,k [ t ] = π − w,k [ t ] π − w,k [ t ] + σ v , (7)where π − represents the initial update of the process covari-ance noise π , and κ w,k [ t ] is the Kalman gain of the k th tasktype of the w th workload at time instant t . Based on (6) and(7), the estimation of ˆ b w,k [ t ] and the noise covariance updatecan be written as ( ∀ w, k, t ): ˆ b w,k [ t ] = ˆ b w,k [ t −
1] + κ w,k [ t ] (cid:16) ˜ b w,k [ t − − ˆ b w,k [ t − (cid:17) , (8) π w,k [ t ] = (1 − κ w,k [ t ]) π − w,k [ t ] . (9) Initialization of proposed CUS estimator per workload andtask type : For t = 0 and ∀ w, k , the GCI initializes eachKalman-filter estimator with ˜ b w,k [0] , established via the initial“footprinting” measurement per workload and input type, andsets: ˆ b w,k [0] = π [0] = 0 , and σ z = σ v = 0 . . GCI-based CUS estimation steps for each monitoring timeinstant t , t ≥ and ∀ w, k : (i) retrieve (via the ME) the CUSmeasurements per workload and task type to establish ˜ b w,k [ t − ; (ii) perform the estimation of (6)–(9); (iii) retain the valueof the estimated CUS per workload via (8) and (1).
4. TTC confirmation and service rate per workload:
Letus assume that a reliable CUS estimation becomes availablefor workload w , ≤ w ≤ W [ t init ] , at monitoring time instant t init . The GCI can then confirm that d w [ t init ] (the requestedTTC for workload w at t init ) is achievable by Dithen underappropriate adjustment of the workload service rate , s w [ t ] ,for each monitoring time t , t ≥ t init . The service rate s w [ t ] corresponds to the number of CUs allocated to workload w The practical method to determine t init is described in Section V. for the time interval between monitoring instants t and t + 1 .Fractional values (e.g., s w [ t ] = 0 . ) indicate that one CU isallocated to workload w for s w [ t ] × of the time between t and t + 1 . If the combination of d w [ t init ] with the workloadCUS estimate leads to s w [ t init ] > N w, max , with N w, max apredetermined CU upper limit ( ∀ w : N w, max = 10 in ourexperiments), d w [ t init ] is extended such that s w [ t init ] = N w, max .This process confirms d w [ t init ] (or its extension) as the TTCfor workload w .The algorithm to determine s w [ t ] for each workload w andeach t ≥ t init is presented in Section III and is carried out bythe GCI based on the estimated CUS per workload. All LCIsof Dithen are given individual tasks from each workload w according to s w [ t ] by the GCI.III. W ORKLOAD E XECUTION WITH C ONFIRMED
TTCThe GCI of Dithen ensures that each workload is executedwithin its remaining TTC by an allocation mechanism basedon proportional fairness. The proportional fairness goal canthen be stated as: at each monitoring instance t and foreach workload w (1 ≤ w ≤ W [ t ]) , the GCI maximizes anobjective function of the service rate, s w [ t ] , that ensures allworkloads are served proportionally to their CUS requirement, r w [ t ] [given by (1)], and inversely-proportionally to their TTC, d w [ t ] . The latter is defined via an appropriate SLA mechanismonce a workload is submitted for execution and an initialworkload CUS estimate becomes available. In this work, weadopt the objective function: f ( s w [ t ]) = r w [ t ] ln( s w [ t ]) − d w [ t ] s w [ t ] . (10)The subtraction in (10) contrasts between the workload’sCUS requirement, r w [ t ] , and the TTC requirement, d w [ t ] . Inaddition, following proportional fairness problems of otherresource allocation work (notably in cellular network schedul-ing algorithms [35]), we opted for the use of the naturallogarithm in the demand side of the objective function andpursue the maximization of f ( s w [ t ]) . Specifically, when thecondition (cid:80) W [ t ] w =1 r w [ t ] ≤ c tot [ t ] is satisfied, it is straightforwardto show that the optimal solution to the maximization of (10)is ( ∀ s w [ t ] > ) s ∗ w [ t ] = arg max { f ( s w [ t ]) } = r w [ t ] d w [ t ] . (11)This corresponds to the case where enough CUs are availableto accommodate the demand and, therefore, allocation ofservice rates is carried out according to the required CUSsand TTC per workload at each monitoring time instant t . Wecan then calculate the total required CUs for optimal operationas: N ∗ tot [ t ] = W [ t ] (cid:88) w =1 s ∗ w [ t ] = W [ t ] (cid:88) w =1 r w [ t ] d w [ t ] . (12)However, due to volatility in both workload submissionand CU availability in Dithen, it is likely that, for mostmonitoring instances t , N ∗ tot [ t ] differs from N tot [ t ] [the actualnumber of CUs, calculated by (2)]. In such cases, we can UBMITTED 8 adjust the optimal service rates of (11) proportionally to therelative distance between N ∗ tot [ t ] and N tot [ t ] . Specifically, if N ∗ tot [ t ] > N tot [ t ] + α, with α the AIMD additive constantdefined in the next section ( α > ), we downscale the optimalservice rate of each workload to: ∀ w : s − w [ t ] = r w [ t ] d w [ t ] (cid:18) − N ∗ tot [ t ] − N tot [ t ] − αN ∗ tot [ t ] (cid:19) = N tot [ t ] + αN ∗ tot [ t ] s ∗ w [ t ] . (13)If N ∗ tot [ t ] < βN tot [ t ] , with β the AIMD scaling constant definedin the next section ( < β < ), we upscale the optimal servicerate of each workload to: ∀ w : s + w [ t ] = r w [ t ] d w [ t ] (cid:18) βN tot [ t ] − N ∗ tot [ t ] N ∗ tot [ t ] (cid:19) = βN tot [ t ] N ∗ tot [ t ] s ∗ w [ t ] . (14)Finally, if βN tot [ t ] ≤ N ∗ tot [ t ] ≤ N tot [ t ] + α , the service rates of(11) are used. The use of α and β in (13) and (14) ensuresthe service rate adjustment is considering the possible additiveincrease or multiplicative decrease that may occur via theAIMD algorithm after the service rate allocation is establishedfor the interval between t and t + 1 .IV. S CALING WITH A DDITIVE I NCREASE M ULTIPLICATIVE D ECREASE
For any CaaS system, N ∗ tot [ t ] of (12) and N tot [ t ] of (2)must be tightly coupled in order to ensure that the availablecompute-unit time can meet the service demand and TTCrequirements at any instant. This is because, if N ∗ tot [ t ] issubstantially higher than N tot [ t ] , the delay to complete pendingworkloads can increase significantly and workload TTCs maybe violated. Conversely, when N ∗ tot [ t ] is significantly smallerthan N tot [ t ] , several CUs may billed on the service unneces-sarily. Therefore, and in conjunction with the fact that billingcomes in hourly increments in Amazon EC2 spot-instances,sudden surges or dips demand will have a detrimental effectin the delay or cost of the deployment of Dithen. Hence,the goal of GCI component of Dithen is to maintain theresource reservation and workload service rates at the correctlevel. To this end, we propose the AIMD algorithm of Fig.4. By controlling the additive and scaling constants, α and β respectively, we can examine the behavior of Dithen under awide variety of workload submissions. It should be noted thatthe corresponding problem of selecting which spot instancesto terminate in the event that N tot [ t ] > N ∗ tot [ t ] is trivial: perinstance type, the prudent action is always to terminate spotinstances with the smallest remaining time before renewal.We refer to the work of Shorten et. al. [24] for details on thethe stability and convergence properties of AIMD algorithms.A key aspect from their analysis is that fast convergence toan equilibrium state is achieved if β is small and smoothertransitions are expected if β is close to unity [24]. Afterextensive experimentation, we opted for the values of β = 0 . t if N tot [ t ] ≤ N ∗ tot [ t ] incr = TRUE else incr = FALSE N tot for the next instant7 if incr == TRUE N tot [ t + 1] = min { N tot [ t ] + α, N max } % add more CUs9 else N tot [ t + 1] = max { βN tot [ t ] , N min } % remove CUs Fig. 4. Proposed AIMD algorithm; α is a positive constant, β is a constantsuch that < β ≤ , and N max and N min are the upper and lower boundsfor N tot [ t ] . and α = 5 , which exhibit sufficiently-fast convergence while atthe same time ensuring that CUs are not released prematurely.While the AIMD algorithm tunes the total CU value, N tot [ t ] ,it does not select which instance types to deploy out of the I possible. As detailed in Appendix A, the recent statusof Amazon spot instance pricing provides for proportionalincrease of pricing according to the number of compute unitsper instance. Moreover, the single-CU instance type exhibitsthe minimum price volatility, thereby making it the safestinstance type to use. Therefore, we opt to use only single-CU instances in our experiments, i.e., I = 1 and p = 1 ,which alleviates the problem of selecting amongst a varietyof instance types. However, depending on the evolution ofpricing data from the IaaS provider, future work will expandour results into a variety of instance types.V. E XPERIMENTS
In order to examine our proposals, we have deployed Dithenusing single-CU m3.medium spot instances of Amazon EC2(see Appendix A for more details). As discussed in Section II,each instance has a corresponding LCI that is given new tasksto process once the GCI detects that it is idle. In addition,one reserved EC2 instance, serving as the GCI, calculatesthe Kalman filter estimates based on the CUS measurementsper task. Under predetermined TTC per workload (which isconfirmed by Dithen after an initial CUS estimation becomesavailable for the workload), it then derives the service rateper workload in fixed time periods (i.e., within 1–5 minuteintervals), as described in Section III. This is communicatedto the ME and the LCIs (see Fig. 1). The GCI also carriesout the AIMD algorithm of Section IV in order to controlthe increase or decrease of spot instances according to thedemand. The utilized AIMD parameters for all experimentswere set to: α = 5 , β = 0 . , N min = 10 , N max = 100 and ∀ w : N w, max = 10 (maximum service rate per workload). ASQL database is used by the ME to keep track of the taskscompleted per workload. Finally, the produced results, as wellas a summary of the intermediate progress, is communicatedto the user by the web interface of the FE (Fig. 3). UBMITTED 9
Fig. 5. Size of inputs for each of the thirty workloads used in our experiments.
A. Utilized Workloads
All multimedia inputs, processing scripts and executablefiles are placed on Amazon S3 via the uploading service (“add”button) available within the FE of Dithen [Fig. 3(a)]. Thirtydifferent workloads, each with a random number of tasks wereused in our experiments. Eight of the workloads were scriptsrunning the Viola-Jones classifier [25] for face detection inimages. The range of possible values for the number of inputs(i.e., images or videos) for these workloads was between 1and 1000. Eight of the workloads were scripts using FFMPEGto transcode videos to different bitrates via a variety ofcodecs [32]. Each workload had between 1 and 20 videos totranscode, and we also added two large transcoding workloadswith 200 and 300 videos. These were used to examine theresponsiveness of the Dithen system under sudden spikes ofdemand. Seven of the workloads were using the OpenCVBRISK keypoint detector and descriptor extractor [36]. Finally,seven workloads used the Scale Invariant Feature Transform(SIFT) salient point descriptor [27], which was deployed ascompiled Matlab code with the Mathworks deploytool .The total size of the inputs per workload is given in Fig. 5.Workloads were introduced once every five minutes in theorder depicted in Figure 5.
B. Performance of Kalman-based CUS Estimation
The proposed Kalman-based CUS estimation process ofSection III is compared against the “ad-hoc” estimator thatcarries out the CUS estimation of (8), albeit with the scalingcoefficient being set to the fixed value: κ w,k [ t ] = 0 . , whichwas shown to perform best amongst other settings. Moreover,as an external comparison, we also utilize the well-knownsecond-order autoregressive moving average (ARMA) estima-tor of Roy et. al. [37] that has been shown to perform wellfor workload forecasting. ARMA estimates the CUS requiredto complete a workload at time t + 1 via ˆ b w,k [ t + 1] = δ × b norm ,w,k [ t ] + γ × b norm ,w,k [ t − − δ − γ ) × b norm ,w,k [ t − , (15)where: b norm ,w,k [ t, t − , t − are calculated by summingthe total execution time of media type k of workload w at times t, t − , t − and dividing it by the percentage of theworkload that has been completed until then; and δ and γ arescalars having the values recommended by Roy et. al. [37].We chose ARMA as the most suitable benchmark becauseother workload forecasting methods (like the ARIMA model[38], [39]) require extensive past measurements from previousexecutions of other workloads, as well as a long sequence ofmeasurements in order to produce reliable estimates, therebymaking them unsuitable for our case.Two representative examples of the convergence behaviorsof all methods under comparison are given in Fig. 6 and Fig. 7.As illustrated in the figures, the Kalman and ad-hoc estimatorexhibit an underdamped behavior until convergence. We cantherefore use the slope of the CUS estimation across time todetermine the monitoring time instant t init when the proposedKalman and the ad-hoc estimator can provide a reliable CUSestimation per workload and task type. Specifically, whenthe slope of the CUS estimation becomes negative for thefirst time, each estimator establishes a CUS estimate foreach workload with acceptable accuracy. However, ARMAdoes not exhibit such underdamped behavior, since it is amoving-average based estimator. Therefore, we relied on aconventional convergence detection criterion for ARMA: whenthe ARMA estimate deviation within the window of the lastthree measurements is found not to exceed 20% from themean value derived from the values of the window (tenmeasurements are used for the case of 1-min monitoring),we determine that the estimate is reliable enough to be used.The setup for the window size and variability threshold wasselected after testing with a variety of possible values. In theexamples of Fig. 6 and Fig. 7, the time instant when eachmethod reaches its reliable estimate under the described setupis marked with the red dotted vertical lines.Table II presents the average time each estimator took toreach its CUS estimate for each workload type, as well as theCUS percentile mean absolute error (MAE). The summaryover all workloads (per monitoring interval) is given at thebottom of the table. Evidently, the proposed Kalman-basedapproach reduces the average time to reach a reliable estimateby more than 20% in comparison to the other estimators andis found to be the quickest estimator in all but one case. At thesame time, the proposed estimator attains comparable accuracyto the ad-hoc estimator and is found to be significantly superiorto ARMA. This is especially pronounced in the case of the1-minute monitoring, where the use of the proposed Kalman-based approach instead of an ARMA approach provides for38% reduction in estimation time and decreases the averageestimation error from 16.4% to 4.5%. This indicates that,under the usage of the proposed CUS estimator and 1-minutemonitoring, the GCI is expected to have reliable estimatesper workload (and thereby confirm that its requested TTC isachievable) within 6–11 minutes from its launch. Finally, whenwe compare the performance of one-minute monitoring to five-minute monitoring, Table II shows that increase in the mea-surement granularity results in significant improvement in boththe accuracy and time required to reach a reliable estimate.Specifically, for the proposed Kalman estimator, increasedmonitoring frequency reduces the the average estimation time UBMITTED 10
Fig. 6. Example of the convergence of various CUS estimation methods forthe case of an FFMPEG workload under 1-min monitoring interval.Fig. 7. Example of the convergence of various CUS estimation methodsfor the case of a SIFT descriptor (Matlab-based) workload under 1-minmonitoring interval. by 44% and reduces the overall MAE from 13.1% to 4.5%.
C. Results for Cumulative Cost of Workload Execution
We now investigate the management of spot instances sothat each workload is completed under a fixed TTC that issufficiently large to allow for fluctuation in the number ofutilized instances.As external comparisons, our first choice is Amazon’sAutoscale service (termed as “Amazon AS”), which is widelydeployed in practice [40]. Amazon AS does not carry outCUS estimation or TTC-abiding execution, and one can onlycontrol the number of instances based on CPU utilization andbandwidth constraints. Therefore, under these conditions, weconfigured all workloads to execute within an Amazon ASgroup that examines the average CPU usage at all utilizedCUs in five-minute intervals. If the group detected that theaverage CPU utilization was more than 20%, new instances were started . Otherwise, Amazon AS terminated some of theactive instances. We then executed all workloads in AmazonAS and measured the longest time to complete a workloadunder two scaling policies. The first represented a conservativeapproach where reducing the execution time is not of criticalimportance. In this case, a single instance is added or removedwhen a monitoring interval occurs. The longest completiontime was found to be 2 hr 7min. The second scaling policystarted and stopped ten instances instead of one, to representa scenario where reduced execution time is of importance. Inthis case, the longest time to complete a workload was foundto be 1 hr and 37 min. Both of these times were then used asthe two fixed TTC settings for all workloads in Dithen.Beyond Amazon AS, in order to benchmark our AIMD-based scaling of Fig. 4 against other alternatives for CUadjustment, we utilized the mean-weighted-average and linear-regression methods of Gandhi, Krioukov et. al. [17], [41](termed “MWA” and “LR”, respectively) to set the number ofCUs for the next monitoring interval, N tot [ t + 1] . We selectedMWA and LR for our comparisons because previous work [17]has shown them to be amongst the most accurate predictiveresource controllers. Both MWA and LR utilized the proposedKalman-based CUS estimation process and the service rateallocation of (12) to determine when to increase or decreaseCUs. Specifically: (i) MWA sets the number of CUs via N tot [ t + 1] = 16 t (cid:88) i = t − N ∗ tot [ i ] , (16)where N ∗ tot is the optimal number of CUs derived via (12) foreach monitoring time instant; (ii) LR sets N tot [ t + 1] to be theresult of extrapolating the line derived via linear regressionfrom { N ∗ tot [ t ] , . . . , N ∗ tot [ t − } (current plus five previous CUsettings). Finally, in order to see the performance of the direct-compensation approach, we also utilized the case where nofiltering or other adjustment is being used and we simply set N tot [ t + 1] = N ∗ tot (termed as “Reactive”).Figure 8 and Figure 9 show the cumulative cost of eachapproach during the course of both experiments with the twoTTC values. Evidently, the cost of Amazon AS is significantlyhigher than that of all other approaches. This is primarilybecause the Amazon AS is the only approach that does not useCUS estimations and instead bases its decisions solely on CPUutilization. Therefore, it continues to scale up the number ofinstances even when it is nearing completion of the workloads’processing and only scales down after workloads have beencompleted and CPU utilization decreases due to inactivity.Amongst MWA, LR and Reactive, MWA is superior as itincurs less cost for the majority of the experiment (and, asexpected, Reactive is the worst). However, all three methodsend up incurring very comparable cost for the completion of allworkloads. Interestingly, Reactive turns out to be (marginally)the cheapest of the three for this experiment even though it After extensive experimentation, the value of 20% was found to provide forthe best results with Amazon AS. This is because average utilization valuesbetween 18% and 22% represent the average CPU usage observed withinactive time intervals when an instance alternates between downloading files(2%–10% CPU utilization) and actually executing a compute-intensive task(close to 100% CPU utilization).
UBMITTED 11
TABLE IIA
VERAGE TIME TO REACH
CUS
ESTIMATION PER TYPE OF WORKLOAD AND PERCENTILE M EAN A BSOLUTE E RROR (MAE)
OF THE DERIVED ESTIMATE .T HE LAST COLUMN PRESENTS THE PERCENTILE TIME REDUCTION WHEN SWITCHING FROM MIN MONITORING TO MIN MONITORING INTERVALS .T HE BEST RESULT PER CATEGORY IS INDICATED IN BOLDFACE FONT .Control interval 5-min monitoring 1-min monitoring Time Reduction (%) by going
Face Detection
Time MAE (%) Time MAE (%) from 5-min to 1-min monitoringKalman-based
13m 45s
10m 38s 4.6
17m 53s 5.3 36.4ARMA 23m 08s 22.1 12m 08s 27.8 47.6
Transcoding
Time MAE (%) Time MAE (%)Kalman-based
16m 53s
07m 54s
10m 36s
Feat. Extraction
Time MAE (%) Time MAE (%)Kalman-based
13m 34s
SIFT
Time MAE (%) Time MAE (%)Kalman-based
21m 26s
06m 18s
15m 00s 7.6 25.0
Overall Average
Time MAE (%) Time MAE (%)Kalman-based
16m 25s
09m 11s
14m 15s
VERALL COST OF DIFFERENT METHODS AND COMPARISON AGAINST THE PROPOSED METHOD AND THE LOWER BOUND (LB).System AIMD (proposed) Reactive MWA LR AS LBOverall cost ($) 0.41 0.51 0.52 0.53 1.02 0.22Average cost reduction of proposed vs. other methods (%) – 20 21 23 60 –Average cost increase vs. LB (%) 86 132 136 141 364 –Max. uses the largest number of instances of the three methods atone point. The reason for this is that, while Reactive scales upvery quickly it also scales down rapidly and, for this particularexperiment, this behaviour worked in its favour. However, thisis not expected to be always the case, as Reactive does leavemany instances idle for a large portion of their billed time. The proposed AIMD-based scaling initially scales up whenit detects the large workloads, then maintains this level, andthen begins to scale down as it nears the experiment com-pletion. For the experiments of Figure 8, this leads to overallsavings of 30% against MWA, 29% against LR, 27% againstReactive and 38% against Amazon AS. For the experimentsof Figure 9, the equivalent savings were: 14%, 15%, 12% and69% . Overall, beyond the advantage of providing for scaled-up execution under TTC constraints, the 38%–69% savingsdemonstrated in Figure 8 and Figure 9 allow for significantprofit margin for cloud service providers that would deploylarge-scale multimedia applications via the techniques used inDithen, versus utilizing Amazon AS directly.The overall savings for both experiments, as well as themaximum number of instances used by the proposed algorithmagainst all other benchmarks are summarized in Table III. Itshould be emphasized that, beyond the cost savings, all theworkloads in the proposed AIMD approach finished before It should be noted that the controller does incur some overhead cost. Ifwe were to subtract the cost from Amazon AS (Reactive, MWA and LR alsorequire a controller and thus have the same overhead as AIMD) it wouldnot improve its performance by no more than 5%, with this percentagediminishing as the workload size increases. Finally, it is also important tonote that the controller instance does not have to run in AWS; instead, itcould operate under a captive computing environment, thereby incurring nobilling cost from the cloud provider.
UBMITTED 12
Fig. 9. Cumulative cost of processing all workloads of Fig. 5 under fixedTTC of 1 hr 37 min per workload. LB indicates the lower bound. their execution time exceeded the predetermined TTC ofeach experiment. Such TTC-abiding execution is a significantfeature that Amazon AS cannot provide.Finally, the bottom right of Figure 8 and Figure 9 includesa red horizontal line indicating the estimated billing if allworkloads would be processed such that all billed instanceswould be occupied 100% of the time. This constitutes thelower bound for the billing cost (termed “LB”) as no op-erational approach can achieve lower cost. Evidently, theproposed approach incurs 68%–91% higher cost than LB, butall other approaches incur 135%–510% higher cost than LB.It should be noted that both the LB and all the examinedapproaches include the delay to transport of data to and fromthe instances. If this would be removed, all costs would belowered by approximately 27%. Overall, the results of Figure8 and Figure 9 demonstrate that the proposed AIMD-basedscaling of CUs is a simple and effective method towardsapproaching the lowest possible cost incurred from the cloudcomputing infrastructure, while at the same time satisfying theTTC constraint of each workload.
D. Comparison Against Amazon Lambda
Recently, Amazon begun offering its own CaaS servicefor the execution of Javascript code via its Lambda service.Despite this being more limiting due to the inefficiency ofJavascript code, we compared the cost of running three largeJavascript-based workloads on Dithen and Lambda. In thisexperiment we ran “blur”, “rotate” and “convolve” operationsfrom the Javascript version of the widely-used ImageMagickimage manipulation program [26]. We chose these functions asthey represent a cross section of computational requirementsof the various ImageMagick functions. Each function wasexecuted on 25,000 images encompassing a wide variety ofsizes and pixel counts. We also opted for the 1024MB-memoryconfiguration for all Lambda functions to avoid any memorybottlenecks during execution. Again, Dithen was tuned tomatch the execution time of each workload in Lambda. Thiswas done because the latter is dependent on how quicklyrequests can be sent to call the functions through the Amazon
TABLE IVA
VERAGE COST OF I MAGE M AGICK FUNCTIONS PER IMAGE OF THE
DATASET FOR D ITHEN AND A MAZON ’ S LAMBDA .Function Lambda Cost ($) Dithen Cost ($) RatioBlur . × − . × − . × − . × − . × − . × − . × − . × − Web Service Command Line interface (or any other such API),while the execution time for workloads in Dithen is completelytunable based on their specified TTC. This flexibility ofTTC-abiding execution per workload is an advantage of ourproposal against Lambda.A comparison of the cost of executing the workloads isgiven in Table IV. It is interesting to notice that, as therun time of the function decreases, Lambda becomes a moreviable option. For example, the average cost of running themost compute-intensive function (Blur function in Table IV)was 3.34 times higher on Lambda than it was on Dithen.In contrast the average cost of running the fastest and leastcompute-intensive function (the rotate function) was found tobe slightly less on Lambda than on Dithen. This result can beunderstood as follows. AWS Lambda allocates cores basedon the memory consumption. For example, if the Lambdafunctions run on an EC2 instance with 4 GB memory and 2cores and the functions require only 1 GB of memory, Lambdawill allocate only × cores, so it won’t utilize the fullprocessing power of the instance, thereby making the functionsrun longer. This implies that, when Lambda handles low-loadtasks (i.e., easily executable even when a non-dedicated core isavailable), it becomes advantageous to Dithen. However, whenhigh-load tasks are executed, the pricing and core allocationof Lambda becomes less advantageous since the executiontime of complex tasks is significantly prolonged in comparisonto Dithen (which always allocates an entire core per task,regardless of the task’s complexity). Therefore, beyond simpleweb front-end type of tasks (which is the ideal applicationdomain for AWS Lambda—hence its design and pricing beingbuilt around this), Lambda is not advantageous for the vastmajority of more advanced computing tasks handled by a moregeneric and extendable CaaS platform like Dithen. Overall, wewere able to run the workloads on Dithen at more than 2.5times lower cost (60% reduction) in comparison to AmazonLambda. This provides for substantial profit margin for a cloudservice provider to deploy a large-scale multimedia applicationvia the proposed approach instead of Lambda. E. Deep Learning and Split-Merge Workloads
We conclude our experiments by examining the perfor-mance of our platform when processing more complicatedworkloads, namely: (i) an image classification applicationbased on a group of deep convolutional neural networks(CNN) [42] that have been trained on ImageNet [43] and (ii) alarge-scale word histogram calculation, which is the standardexample used with MapReduce-type of processing [28].
UBMITTED 13
Fig. 10. Cumulative cost of an image classification workload based on deepCNNs. LB indicates the lower bound.Fig. 11. Cumulative cost of a word histogram calculation workload based onSplit-Merge (the workload is the standard one used within MapReduce testing[28]). LB indicates the lower bound.
The first example is a representative case for a Split-Mergeworkload since multiple deep CNNs are used to classifyeach input image during the split stage and their results areaggregated via a voting process in the merge stage in orderto produce the final classification result per image [43]. Theinputs used for this workload comprised all images of the“Holidays” dataset [30], as well as 50,000 additional imagesfrom ImageNet.In the second example, the workload measured the numberof occurrences of words in a text file and then this wasaggregated into a word histogram by a separate Reduce (ormerge) instance. This is similar to a number of text basedworkloads where text data is analysed in order to gain insightsinto trends in market sentiment. The inputs used to test thisworkload were a selection of the Project Gutenberg [44]library which was approximately 14,000 text files and 5.5GBof data. This example is used in order to demonstrate that,while Dithen is more amenable to multimedia workloads(where the partitioning is inherent), it can also be used formore general workloads such as market sentiment analysisand the semantic analysis of text. It should be noted that,beyond testing, Dithen could be used for deep learning training workloads as well. For example Tensorflow could be usedwith batches of training sets and the results of such batch-based training could be merged at a later stage, after severaliterations have been carried out in batch mode. We plan toreport on such experiments in a future paper.The experiments were invoked via the front end follow-ing the process described in Section II-B and experimentalbenchmarking of the incurred cost occurred as described inSection V-C. Specifically, the workloads were first executedusing Amazon’s Autoscaling service (commonly used for suchsystems [45]), which was used to determine the TTC to usefor our platform (since Amazon AS does not allow for TTC-abiding execution). Based on this process, the TTC was setto 1 hr 35 min for the first example and 1 hr 05 min forthe second example. In order to account for the time for theMerge step of each of the two workloads, the TTC for eachSplit stage was set to 90% of the overall TTC.The cumulative cost of the image classification workloadcan be seen in Figure 10. Similar to previous examples, thecost of Amazon AS is 38% higher than the cost of the AIMDapproach of Dithen. We can also see that the cost of thisworkload in Dithen is only 21% higher than the lower bound,while the cost of the Amazon AS approach is 70% higher thanthe lower bound.The cumulative cost of the word histogram calculationworkload is depicted in Figure 11. In this case, the cost ofAmazon AS is six times that of the Dithen platform and thelower bound. Interestingly, in this case, the cost incurred bythe AIMD approach of Dithen is extremely close to the lowerbound (less than $0.005 higher) and remains constant at 3cents. This result is achieved because, in this particular case,Dithen was able to quickly and reliably identify the CUSsrequired to complete the Split tasks and determined that 3 spotinstances suffice for the completion of the workload within thepredetermined TTC, and below the 1 hour mark (at whichpoint additional charges are levied by AWS). Therefore, itavoided the unnecessary launch of new instances and its costremained constant at 3 cents since the Split-Merge workloadexecution finished in 55 minutes.Overall, these examples show that the platform can beused to substantially lower the execution costs of complexworkloads and, in certain circumstances, it is even possiblefor Dithen to approach the lower bound for the incurred cost.VI. C ONCLUSIONS
We present
Dithen , a novel Computation-as-a-Serviceframework, which supports the direct upload and executionof multimedia processing workloads. The Dithen architecturecomprises multiple spot instances that execute tasks within theworkloads until their compute units are fully utilized. Dithenuses the Additive Increase Multiplicative Decrease (AIMD)algorithm for the allocation or termination of compute units For simplicity and brevity in our exposition, we did not include the resultswith the remaining methods in our presented comparisons (MWA, LR andReactive), as they were found to incur similar overhead as in the previousexperiments. Furthermore, no results are presented for Amazon Lambda, asLambda cannot support such complex processing tasks.
UBMITTED 14 S po t P r i c e ( $ ) Datem3.mediumm3.largem3.xlargem3.2xlargem4.4xlargem4.10xlarge
Fig. 12. Spot Price for various instance types from 11 th of April to the 11 th July 2015
PPENDIX
AWe briefly analyze the computation costs of Linux instanceson AWS EC2, as EC2 is considered to be the largest publiccloud service provider today [46] and our system is testedand deployed on the EC2 infrastructure. A comparison ofthe cost and EC2 compute units (ECUs) of various instancetypes is given in Table V. An ECU is defined as “equivalentCPU capacity of a 1.0-1.2 GHz 2007 Opteron or 2007 Xeonprocessor”. The m3.medium instance (utilized in this paper)is a single CU instance with clock speed of 3.0–3.6GHz. Fromthe table we can also see that the larger instances consist ofincreasing numbers of CUs (i.e., virtual cores available forcomputations) with similar clock speeds. We can also seethat the “On Demand” cost and spot prices are both linearly-dependent on the number of CUs. Thus, we can conclude thatit is more efficient to use a large number of cheaper instancesthan small number of more expensive instances, as it allowsfor greater granularity when controlling the number of activeinstances without any corresponding increase in cost.From Table V we can also see the difference betweenthe “On Demand” cost and the Spot price . Spot instances Table V does not include all instance types available on Amazon’s EC2.However, all non-included instances are memory, computation or storagevariants of the instances depicted in Table V. The Spot prices depicted in Table V were taken on the 10 th July 2015. are instances that will only function when a user’s bid isgreater than the current spot price. Essentially, the user givesup certainty of having computational resources available, inexchange for a significant reduction in the cost. We cansee from Table V that this reduction ranges from 78% to89%. However, it is difficult to run a CaaS service withoutguarantees of the availability of computational resources, soan analysis of the fluctuation of the spot price is necessary todetermine if spot instances should be utilized.The spot instance price for various instance types in thethree-month period from the 11 th of April to the 11 th July2015 is shown in Figure 12. Evidently, the volatility ofthe spot price is proportional to number of CUs that aninstance possesses. Therefore, while it would be difficult torely on a m4.10xlarge spot instance, the spot price of the m3.medium spot instance is remarkably stable. Specifically,at no point in the three month period does the m3.medium spot price exceed $ . . Therefore, we can conclude thata significant reduction in cost can be achieved by using m3.medium spot instance with little effect on the reliabilityof the service, and with more flexibility than when using largerspot instances with more CUs.R EFERENCES[1] Y. Song, M. Zafer, and K.-W. Lee, “Optimal bidding in spot instancemarket,” in
Proc. IEEE INFOCOM , 2012, pp. 190–198.[2] L. Zhang, Z. Li, and C. Wu, “Dynamic resource provisioning incloud computing: A randomized auction approach,” in
Proc. of IEEEINFOCOM , 2014.[3] X. Nan, Y. He, and L. Guan, “Optimal allocation of virtual machinesfor cloud-based multimedia applications,” in
IEEE Int. Conf. MultimediaSignal Proc. (MMSP) . IEEE, 2012, pp. 175–180.[4] T. Hobfeld, R. Schatz, M. Varela, and C. Timmerer, “Challenges of QoEmanagement for cloud applications,”
IEEE Comm. Mag. , vol. 50, no. 4,pp. 28–36, 2012.[5] S. Islam and J.-C. Grégoire, “Giving users an edge: A flexible cloudmodel and its application for multimedia,”
Fut. Gen. Comp. Syst. , vol.28, no. 6, pp. 823–832, 2012.[6] Y. Andreopoulos and M. van der Schaar, “Incremental refinement ofcomputation for the discrete wavelet transform,”
IEEE Trans. on SignalProcess. , vol. 56, no. 1, pp. 140–157, 2008.[7] V. Spiliotopoulos et al., “Quantization effect on VLSI implementationsfor the 9/7 dwt filters,” in
Proc. IEEE Int. Conf. on Acoust., Speech,and Signal Process., ICASSP’01 . IEEE, 2001, vol. 2, pp. 1197–1200.[8] Y. Andreopoulos et al., “A local wavelet transform implementationversus an optimal row-column algorithm for the 2d multilevel decompo-sition,” in
Proc. IEEE Int. Conf. on Image Process., ICIP 2001 . IEEE,2001, vol. 3, pp. 330–333.[9] Y Andreopoulos et al., “A new method for complete-to-overcompletediscrete wavelet transforms,” in
Proc. 14th IEEE Int. Conf. on DigitalSignal Process., DSP 2002 . IEEE, 2002, vol. 2, pp. 501–504.[10] N. Kontorinis et al., “Statistical framework for video decoding com-plexity modeling and prediction,”
IEEE Trans. on Circ. and Syst. forVideo Technol. , vol. 19, no. 7, pp. 1000–1013, 2009.[11] K. Masiyev et al., “Cloud computing for business,” in
IEEE Int. Conf.Appl. of Inf. and Comm. Tech. (AICT) . IEEE, 2012, pp. 1–4.[12] I Andreopoulos et al., “A hybrid image compression algorithm based onfractal coding and wavelet transform,” in
Proc. IEEE Int. Symp. Circuitsand Systems, 2000 (ISCAS 2000).
IEEE, 2000, vol. 3, pp. 37–40.[13] Y. Andreopoulos et al., “High-level cache modeling for 2-D discretewavelet transform implementations,”
Journal of VLSI signal processingsystems for signal, image and video technology , vol. 34, no. 3, pp. 209–226, 2003.[14] M. Schwarzkopf et al., “Omega: Flexible, scalable schedulers for largecompute clusters,” in
Proceedings of the 8th ACM European Conferenceon Computer Systems , 2013, EuroSys ’13, pp. 351–364. In the case of m4.4xlarge and m4.10xlarge data was only availablefrom the 11 th June.
UBMITTED 15
TABLE VC
OST OF V ARIOUS L INUX I NSTANCES ON THE A MAZON
EC2 P
LATFORM IN THE N ORTH V IRGINIA R EGION
Instance Type m3.medium m3.large m3.xlarge m3.2xlarge m4.4xlarge m4.10xlargeEC2 compute units (ECUs) 3 6.5 13 26 53.5 124.5CUs 1 2 4 8 16 40On-demand cost ($) 0.067 0.133 0.266 0.532 1.008 2.52Spot price ($) 0.0081 0.0173 0.0333 0.066 0.1097 0.5655Cost reduction when using spot (%) 88 87 87 88 89 78[15] E. Boutin et al., “Apollo: scalable and coordinated scheduling for cloud-scale computing,” in
Proc. USENIX Symp. on Operating Systems Designand Implementation (OSDI) , 2014.[16] K. Ousterhout et al., “Sparrow: distributed, low latency scheduling,”in
Proceedings of the Twenty-Fourth ACM Symposium on OperatingSystems Principles , 2013, pp. 69–84.[17] A. Gandhi et al., “Autoscale: Dynamic, robust capacity management formulti-tier data centers,”
ACM Trans. Comput. Syst. , vol. 30, no. 4, pp.14:1–14:26, Nov. 2012.[18] A. Paya and D. Marinescu, “Energy-aware load balancing and applica-tion scaling for the cloud ecosystem,”
IEEE Trans. on Cloud Comp. ,vol. PP, no. 99, pp. 1–1, 2015.[19] R. Ranjan, K. Mitra, and D. Georgakopoulos, “Mediawise cloud contentorchestrator,”
J. of Int. Serv. and Appl. , vol. 4, no. 1, pp. 1–14, 2013.[20] D. Jung et al., “An estimation-based task load balancing schedulingin spot clouds,” in
Network and Parallel Computing , pp. 571–574.Springer, 2014.[21] V. Gulisano et al., “Streamcloud: An elastic and scalable data streamingsystem,”
IEEE Trans. Par. and Distr. Syst , vol. 23, no. 12, pp. 2351–2365, 2012.[22] M.A. Rodriguez and R. Buyya, “Deadline based resource provisioningand scheduling algorithm for scientific workflows on clouds,”
IEEETrans. on Cloud Comp. , vol. 2, no. 2, pp. 222–235, April 2014.[23] M. Mao and M. Humphrey, “Auto-scaling to minimize cost and meetapplication deadlines in cloud workflows,” in
Proc. 2011 Int. Conf. HighPerf. Comp., Netw., Stor. and Anal. , 2011, pp. 49:1–49:12.[24] R. Shorten, F. Wirth, and D. Leith, “A positive systems model of TCP-like congestion control: asymptotic results,”
IEEE/ACM Trans. Netw ,vol. 14, no. 3, pp. 616–629, 2006.[25] P. Viola and M. J Jones, “Robust real-time face detection,”
Int. J. ofComputer Vision
Int. J. Computer Vision , vol. 60, no. 2, pp. 91–110, 2004.[28] J. Dean and S. Ghemawat, “MapReduce: Simplified data processing onlarge clusters,”
Commun. ACM , vol. 51, no. 1, pp. 107–113, Jan. 2008.[29] A. Abbas, N. Deligiannis, and Y. Andreopoulos, “Vectors of locallyaggregated centers for compact video representation,” in
Proc. IEEEInt. Conf. Multimedia and Expo (ICME’15) . IEEE, 2015, pp. 1–6.[30] A. Chadha and Y. Andreopoulos, “Region-of-interest retrieval in largeimage datasets with Voronoi VLAD,” in
Computer Vision Systems , pp.218–227. Springer, 2015.[31] J. Yang et al., “Two-dimensional PCA: a new approach to appearance-based face representation and recognition,”
IEEE Trans. Patt. Anal. andMachine Intel. , vol. 26, no. 1, pp. 131–137, 2004.[32] A. Garcia, H. Kalva, and B. Furht, “A study of transcoding on cloudenvironments for video content delivery,” in
Proc. 2010 ACM Multim.Workshop on Mob. Cloud Media Comput.
ACM, 2010, pp. 13–18.[33] J. Pouwelse et al., “The Bittorrent P2P file-sharing system: Measure-ments and analysis,” in
Peer-to-Peer Systems IV , vol. 3640 of
LectureNotes in Computer Science , pp. 205–216. Springer Berlin Heidelberg,2005.[34] B. D. O. Anderson and J. B. Moore,
Optimal filtering , CourierCorporation, 2012.[35] R. Margolies, A. Sridharan, et al., “Exploiting mobility in proportionalfair cellular scheduling: Measurements and algorithms,” in
Proc. IEEEINFOCOM, 2014 . IEEE, 2014, pp. 1339–1347.[36] S. Leutenegger, M. Chli, and R. Siegwart, “Optimal allocation of virtualmachines for cloud-based multimedia applications,” in
Proc. IEEE Int.Conf. Comp. Vis. (ICCV) . IEEE, 2011, pp. 2548–2555.[37] N. Roy, A. Dubey, and A. Gokhale, “Efficient autoscaling in the cloudusing predictive models for workload forecasting,” in
Proc. IEEE Int.Conf. on Cloud Com. (CLOUD) . IEEE, 2011, pp. 500–507. [38] V. Debusschere, S. Bacha, et al., “Hourly server workload forecastingup to 168 hours ahead using seasonal ARIMA model,” in
Proc. IEEEInt. Conf. on Industr. Technol. , 2012.[39] R. Calheiros et al., “Workload prediction using arima model and itsimpact on cloud applications’ QoS,”
IEEE Trans. on Cloud Comp. , toappear.[40] M. Tighe and M. Bauer, “Integrating cloud application autoscalingwith dynamic VM allocation,” in
IEEE Netw. Oper. and Manag. Symp.(NOMS) , May 2014, pp. 1–9.[41] A. Krioukov et al., “Napsac: Design and implementation of a power-proportional web cluster,”
ACM SIGCOMM Comp. Comm. Rev. , vol.41, no. 1, pp. 102–108, 2011.[42] K. Chatfield et al., “Return of the devil in the details: Delving deep intoconvolutional nets,” arXiv preprint arXiv:1405.3531 , 2014.[43] A. Krizhevsky et al., “Imagenet classification with deep convolutionalneural networks,” in
Adv. in Neural Inf. Process. Syst. (NIPS’12)
Proc. 28th Annual ACM Symp. on Applied Comp. (SAC’13) . ACM,2013, pp. 411–414.[46] S Choy et al., “A hybrid edge-cloud architecture for reusing on-demandgaming latency,”
Multim. Syst. J. , vol. 20, no. 2, March 2014.
Joseph Doyle graduated from Trinity CollegeDublin in 2009 with a B.A.I., B.A. degree in Com-puter and Electronic Engineering as a gold medalist.He was awarded a Ph.D in 2013 from Trinity CollegeDublin. He was a post-doctoral researcher in TrinityCollege Dublin and University College London from2013 to 2014 and 2014 to 2016, respectively. He isa cofounder of Dithen Ltd. (London, U.K.) and isalso Senior Lecturer at University of East London,London, U.K. His research interests include cloudcomputing, cognitive autoscaling, green computing,and network optimization.
Vasileios Giotsas graduated with distinction fromUniversity College London (UCL) in 2008 withan MSc in Data Communications, Networks andDistributed Systems. He was awarded a Ph.D fromUniversity College London in 2014. He is a co-founder or Dithen Ltd. (London, U.K.) and is alsoa postdoctoral scientist at the UCSD Center for Ap-plied Internet Data Analysis (CAIDA), La Jolla, CA.His research interests span the areas of distributedsystems, cloud computing, routing protocols, Inter-net measurements, and Internet economics.
UBMITTED 16
Mohammad Ashraful Anam obtained the PhDin Electronic Engineering from University CollegeLondon (Lombardi Prize for the Best PhD thesisin Electronic Engineering) and is a cofounder ofDithen Ltd. (London, U.K.), as well as post-doctoralresearch associate in the Department of Electronicand Electrical Engineering, University College Lon-don, London, UK. His research interests are in errortolerant computing, and reliable cloud computing.